MP2 calculation of core-electron binding energy calculated in a large basis set (or extrapolated to the CBS limit) corrected by the difference between CCSD and MP2 energy in the small basis set recovers CC CBS values within 0.02 eV. A meta-lesson: when benchmarking new methods, CEBEs for ionizations of different elements should be analyzed separately to look for element-specific trends.
You can calculate core-electron binding energy for 2nd row elements with the accuracy matching that of the most expensive methods (within 0.10-0.15 eV of experimental values) at significantly lower computational cost. Also, yet another example of Simpson's paradox in the wild.
Imagine wanting a bespoke, custom-tailored suit (the most accurate quantum calculation) but only having the budget for an off-the-rack one (a cheaper method). Our work provides a set of precise, inexpensive tailoring instructions (a small correction) that makes the cheap suit fit almost identically to the bespoke one. this trick allows us to accurately model chemical systems that were previously too expensive to simulate
You’re probably familiar with UV-Vis spectroscopy, which measures electronic transitions between valence and virtual orbitals. In a similar fashion, X-Ray Absorption Spectroscopy (XAS) reports excitations of electrons from core orbitals (e.g. 1s orbitals for 2nd row elements), and X-ray Photoelectron Spectroscopy (XPS) measures energies (Core Electron Binding Energies, CEBEs) required to fully ionize those core orbitals. X-ray spectroscopy has several advantages:
While downstream applications (e.g. ultrafast chemical dynamics) employ XAS more frequently, the ability to accurately assess energy of the core orbital is required both for XAS (even if implicitly) and XPS, making the latter a more foundational challenge.
Ejection of a core electron results in a significant redistribution of electron density, so traditional linear response methods such as TDDFT exhibit errors of 10+ eV, whereas proper interpretation of experimental spectra requires errors below 0.2 eV. Errors can be reduced to 1-3 eV with the use of functionals specifically optimizedsuch optimizations often mean fitting the functional to experimental data, which, as you might imagine, is a slippery slope for core spectroscopy.
Alternatively, one can use coupled-cluster based methods (EOM-CC) within Core-Valence Separation (CVS) approximationCVS is needed to avoid calculations of transitions from valence orbitals. Unfortunately, CVS-EOM-CCSD only brings the mean absolute error (MAE) down to 1.75 eV, and you need CVS-EOM-CCSDT in quadruple-zeta basis to reduce it to 0.15 eV. CVS-EOM-CCSDTQ further reduces MAE to 0.07 eV. Accurate, but incredibly expensive!CCSD scales as , CCSDT as , CCSDTQ as , where is the number of basis functions, which is roughly and for triple and quadruple-zeta basis sets, where is the number of atoms. and mind you, this is just the cost of a single iteration of CC, you might need 10-100 to reach convergence
An entirely different approach is to explicitly optimize the wavefunction of the core-ionized state to properly account for orbital relaxation effects. The CEBE can then be calculated as the difference between energies of core-ionized and ground states. Remarkably, even HFHartree-Fock is the cheapest and simplest method in quantum chemistry. It's almost trivially naive: it assumes that movement of electrons doesn't affect each other. calculates CEBEs within 1 eV, and MP2MP2 is relatively cheap, non-iterative way of correcting HF brings errors down to 0.5 eV.
The coupled-cluster method is considered the golden standard of computational chemistry. So yes, it's expensive, but why not just write some CUDA kernels and let GPUs go brrr?
Let's say you have a system with a core orbital , valence orbitals and an empty orbital . These energies are . Let's say you want to find an energy after ionizing (removing one electron from) the orbital . When you solve coupled-cluster equations (within, say, CCSD), you'll have to calculate so-called double transitionsthese transitions have nothing to do with exciting the molecule, it happens to be part of the normal process of calculating energy with CCSD of the form:
because , a combination of energies may (and often does) exist such that the sum in the denominator is near-zero, so explodes, and the whole procedure diverges.
Zheng and Cheng (2019) have shown that if you manually exclude such transitions and apply a few corrections, you can get accurate CEBE predictions. Arias-Martinez et. al (2022) proposed a few more systematic improvements and benchmarked the methods for 18 small organic molecules.
To recap: we can get accurate CEBEs with methods that might suffer from convergence issues. Is there any chance we can get CC grade predictions from cheaper methods?
The answer is yes. If you extrapolate the MP2 CEBEs to the complete basis set (CBSCBS limit is the true prediction you're supposed to get with a method on a true wave function, which is a linear combination of an infinitely-dimensional basis. We can't work with infinite basis sets in practice, so we have to extrapolate the results we get from basis sets of different sizes.) limit and add a (CC-MP2) correction evaluated in a small basis, you can quantitatively recover CC energies in the CBS limit.
The y-axis is the absolute value of the difference between predicted and experimental CEBE (smaller values is smaller error). x-axis shows a few methods. The gold-standard CCSD (extrapolated to the CBS limit, denoted by symbol) scores an average (over 94 CEBEs) error of 0.123 eVerror bars show standard deviations of the MAE, in this case roughly 0.15 eV. The CBS-extrapolated MP2 scores 0.28 eV, but if you add the correction (our method, shown by ), it's practically equivalent to the CCSD predictions. is evaluated in a small, but still decent basis. 3-21G and STO-3G are laughably cheap
Basically, instead of doing CC calculations in a large basis set, you do a MP2 in a large basis, and a CC in a small one.
Method | Basis | Scaling | Practical Runtime |
---|---|---|---|
MP2 | small | once | 1 s |
MP2 | big | once | 1 min |
CCSD | small | iterative | 30 s |
CCSD | big | iterative | 2.4 hrs |
So instead of hours, you're done in 2 minutes.
effectively, all of this rests on an observation that if you plot MP2 and CCSD energies as a function of basis set size, you'll get two curves that have the same shape, but are vertically offset. Meaning that CC-MP2 difference is the same in the CBS limit, in a large basis, and in a small basis.
The paper has quite a few more interesting results on the nuances:
When I started to bring my first results (the errors of different methods or some plots) to the weekly discussions, they were always taken axiomatically correct. In other words, no one double checked the accuracy of my calculations or plots, all discussion was predicated on data being correct and centered around the implications of that data. While I appreciated the trust, given that this was my first theoretical project, I couldn't help but panic: what if I make a small mistake when collecting values from the output filethese are usually at least 3k line text files logging the progress of the calculation and all final results and take the wrong number? What if I make some mistake when selecting values for plots or tables? This is especially concerning when your results are good—how do you prove it was an honest mistake, and not data manipulation?
I quickly decided on a solution: every single piece of data manipulation should be written as a script that ingests from the source (in this case output file) and ends up with a final table or figure to be used in the paper (or any internal meetings). Now, obviously, you can still make a mistake in your script, but:
As a nice side bonus, this approach also significantly simplifies your research process.
.py
file. (e.g. perform_analysis.py in the CEBE repo)Now, that assuming is doing a lot of heavy lifting: the magnitude of benefits depends on the quality of the code you write, which might seem daunting; however, there's no better way to figure out how to do it than to actually start doing it. I think I refactored/rewrote my scripts for the CEBE project from scratch at least 3 times. And if I were to write it today, 18 months later, I'd do it completely differently. And that is great!
During the preparation of the manuscript, Prof. Troy van Voorhis gave me a great rule that I tried to live by ever since:
every single figure in the paper should convey a clear and concise idea. the standard is that if you show it to anyone, they should be able to figure out the intended message without reading the paper.
let's take Fig. 1 as an example.
let me bold and assume that the conclusions you draw are:
which is pretty much exactly the same a domain expert would conclude, except they would say HF instead of the green method or CCSD(T) instead of the blue method.
This all might seem too much of a common sense take, but in practice maximizing clarity of a figure often means sacrificing details or some nuances. For example, initially I intended to show extrapolated MP2 and CC energies on the same figure, which would have a benefit of showing how much the CBS extrapolation reduces the error, but would also make the figure too loaded. As a result, the extrapolated values were taken out into a separate Fig 4.
that's all for today, hope i made you curious enough to check out the paper (or at least the figures).
the code repository above contains all the data and scripts needed to recreate all tables and figures from the paper.
I'm quite proud that other members of the Batista Group have followed suit and started to include figure reproduction scripts in their projects (e.g. CardioGenAI by Dr. Kyro, or Quantum to Classical Transfer Learning by Dr. Smaldone). If you're doing research, consider joining this little trend of ours.
@article{mp2cebe,
author = {Morgunov, Anton and Tran, Henry K. and Meitei, Oinam Romesh and Chien, Yu-Che and Van Voorhis, Troy},
title = {MP2-Based Composite Extrapolation Schemes Can Predict Core-Ionization Energies for First-Row Elements with Coupled-Cluster Level Accuracy},
journal = {The Journal of Physical Chemistry A},
volume = {128},
number = {33},
pages = {6989-6998},
year = {2024},
doi = {10.1021/acs.jpca.4c01606},
}