Data-driven techniques are increasingly used to replace electronic-structure calculations of matter. In this context, a relevant question is whether machine learning (ML) should be applied directly to predict the desired properties or be combined explicitly with physically-grounded operations. We present an example of an integrated modeling approach, in which a symmetry-adapted ML model of an effective Hamiltonian is trained to reproduce electronic excitations from a quantum-mechanical calculation. The resulting model can make predictions for molecules that are much larger and more complex than those that it is trained on, and allows for dramatic computational savings by indirectly targeting the outputs of well-converged calculations while using a parameterization corresponding to a minimal atom-centered basis. Our results on a comprehensive dataset of hydrocarbons emphasize the merits of intertwining data-driven techniques with physical approximations, improving the transferability and interpretability of ML models without affecting their accuracy and computational efficiency, and providing a blueprint for developing ML-augmented electronic-structure methods.
Here we include the dataset, accompanying the paper linked below, of hydrocarbons including ethane, ethene, butadiene, hexane, hexatriene, isoprene, styrene, polyalkenes (dodecahexaene, tetradecaheptaene, hexadecaoctaene, octadecanonaene, eicosadecaene), aromatics (benzene, azulene, naphthalene, biphenyl), anthracene, beta-carotene, fullerene. We also provide scripts to generate the Fock and overlap matrices in this dataset. The code for machine learning can be found at the Software reference below.