Machine Learning for Quantum Mechanical Properties of Atoms in Molecules
Matthias Rupp, Raghunathan Ramakrishnan, O. Anatole von Lilienfeld
MMachine Learning for Quantum Mechanical Properties of Atoms in Molecules
Matthias Rupp, ∗ Raghunathan Ramakrishnan, and O. Anatole von Lilienfeld † Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials (MARVEL),Department of Chemistry, University of Basel, Klingelbergstr. 80, CH-4056 Basel, Switzerland (Dated: August 26, 2015)We introduce machine learning models of quantum mechanical observables of atoms in molecules.Instant out-of-sample predictions for proton and carbon nuclear chemical shifts, atomic core levelexcitations, and forces on atoms reach accuracies on par with density functional theory reference.Locality is exploited within non-linear regression via local atom-centered coordinate systems. Theapproach is validated on a diverse set of 9 k small organic molecules. Linear scaling of computationalcost in system size is demonstrated for saturated polymers with up to sub-mesoscale lengths.
PACS numbers: 03.65.-w,31.15.A-,31.15.E-,02.60.Ed
This work has subsequently been published in theJournal of Physical Chemistry Letters 6(16): 3309–3313, American Chemical Society, 2015. To ac-cess the final edited and published work see DOI10.1021/acs.jpclett.5b01456.Accurate solutions to the many-electron problem inmolecules have become possible due to progress in hard-ware and methods. [1–4] Their prohibitive computationalcost, however, prevents both routine atomistic modelingof large systems and high-throughput screening. [5] Ma-chine learning (ML) models can be used to infer quantummechanical (QM) expectation values of molecules, basedon reference calculations across chemical space. [6, 7]Such models can speed up predictions by several ordersof magnitude, demonstrated for relevant molecular prop-erties such as enthalpies, entropies, polarizabilities, elec-tron correlation, and, electronic excitations. [8–10]A major drawback is their lack of transferability, e.g.,ML models trained on bond dissociation energies of smallmolecules will not be predictive for larger molecules. Inthis work, we introduce ML models for properties ofatoms in molecules. These models exploit locality toachieve transferability to larger systems and across chem-ical space, for systems that are locally similar to the onestrained on (Fig. 1). These aspects have only been treatedin isolation before.[6, 11]We model spectroscopically relevant observables,namely C and H nuclear magnetic resonance (NMR)chemical shifts [12] and 1 s core level ionization ener-gies (CIE), as well as atomic forces, crucial for struc-tural relaxation and molecular dynamics. Nuclear shiftsand ionization energies are dominated by inherently lo-cal core electron-nucleus interactions. Atomic forces areexpectation values of the differential operator applied toan atom’s position in the Hamiltonian [13], and scalequadratically with inverse distance.Inductive modeling of QM properties of atoms inmolecules constitutes a high-dimensional interpolationproblem with spatial and compositional degrees of free-dom. QM reference calculations provide training exam- ⟶ z Q FIG. 1. Sketch illustrating local nature of atomic propertiesfor the example of force (cid:104) Ψ | ∂ R Q ˆ H | Ψ (cid:105) acting on a query atomin a molecule (mid), inferred from similar atoms in trainingmolecules (top, bottom). Shown are force vectors (arrows),integrated electron density (cid:82) d x d y n ( r ) (solid) and integratedelectronic term of Hellmann-Feynman force along z (dashed). ples { ( x i , y i ) } ni =1 , where the x i encode atoms in theirmolecular evironment and y i are atomic property values.ML interpolation between training examples then pro-vides predicted property values for new atoms.The electronic Hamiltonian is determined by num-ber of electrons, nuclear charges { Z I } and posi-tions { R I } , which can be challenging for direct in-terpolation. [14] Proposed requirements for representa-tions include uniqueness, continuity, as well as invarianceto translation, rotation, and nuclear permutations. [15]For scalar properties (NMR, CIE), we use the sortedCoulomb matrix [6] to represent a query atom Q andits environment: M II = 0 . Z . I and M IJ = Z I Z J / | R I − R J | , where atom indices I, J run over Q and all atomsin its environment, sorted by distance to Q . Note thatall molecules in this study are neutral, and no explicitencoding of charge is necessary.Atomic forces are vector quantities requiring a basis,which should depend only on the local environment; inparticular, it should be independent of the global frame a r X i v : . [ phy s i c s . c h e m - ph ] A ug of reference used to construct the Hamiltonian in theQM calculation. We project force vectors into a localcoordinate system centered on atom Q , and predict eachcomponent separately. Later, the predicted force vectoris reconstructed from these component-wise predictions.We use principal component analysis (PCA) to obtainan atom-centered orthogonal three-dimensional local co-ordinate system. In analogy to the electronic term inHellmann-Feynman forces, (cid:82) d r ( r − R Q ) Z Q n ( r ) / (cid:107) r − R Q (cid:107) [13], we weight atoms by Z I / (cid:107) R I − R Q (cid:107) , increas-ing influence of heavy atoms and decreasing influenceof distant atoms. Non-degenerate PCA axes are uniqueonly up to sign; we address this by defining the center ofcharge to be in the positive quadrant. A matching matrixrepresentation is obtained via M I = ( Z I , X (cid:48) I , Y (cid:48) I , Z (cid:48) I ),where X (cid:48) , Y (cid:48) , Z (cid:48) are projected atom coordinates, androws are ordered by distance to central atom Q , yield-ing an m × m is number of atoms. Inboth representations, we impose locality by constraining Q ’s environment to neighboring atoms within a sphere ofradius τ .For interpolation between atomic environments we usekernel ridge regression (KRR) [16], a non-linear regular-ized regression method effectively carried out implicitlyin a high-dimensional Hilbert space (“kernel trick”). [17]Predictions are linear combinations over all training ex-amples in the basis of a symmetric positive definite ker-nel k : f ( z ) = (cid:80) ni =1 α i k ( x i , z ), where α are regressionweights for each example, obtained from a closed-formexpression minimizing the regularized error on the train-ing data. See Refs. [6, 18–20] for details. As kernel k , weuse the Laplacian kernel k ( x , z ) = exp (cid:0) −(cid:107) x − z (cid:107) /σd (cid:1) ,where (cid:107) · (cid:107) is the L -norm, σ is a length scale, and d = dim( x ). This kernel has shown best performancefor prediction of molecular properties. [18]Our models contain three free parameters: cut-offradius τ , regularization strength λ , and kernel lengthscale σ . Regularization strength λ , controlling thesmoothness of the model, was set to a small constant(10 − ), forcing the model to fit the QM values closely.As for length scales σ , note that for the Laplacian kernel,non-trivial behavior requires ||· , ·|| ≈ σ . We set σ to fourtimes the median nearest neighbor L -norm distance inthe training set. [21] Cut-off radii τ were then chosen tominimize RMSE in initial experiments (Fig. 2). For thecomparatively insensitive F C , other statistics (maxAE, R ) yielded an unambiguous choice.We used three datasets for validation: For NMR chem-ical shifts and CIEs, both scalar properties, we employeda dataset of 9 k synthetically accessible organic moleculescontaining 7–9 C, N, or O atoms, with open valenciessaturated by H, a subset of a larger dataset. [23, 24]Relaxation and property calculations were done at theDFT/PBE0/def2TZVP level of theory [25–30] usingGaussian [31]. For forces, we distorted molecular equi-librium geometries using normal mode analysis [32–34] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○□ □ □ □ □ □ □ □ □ □ □ □ □ □ ** ** ****** - off radius / Å R M SE / % ● C δ ■ H δ ▲
1s C δ ○ F C □ F H FIG. 2. Locality of properties, measured by model perfor-mance as a function of cut-off radius τ . Root mean squareerror (RMSE) shown as fraction of corresponding property’srange [22] for nuclear shifts ( C δ , H δ ), core level ionizationenergy (1s C δ ), and atomic forces ( F C , F H ). Asterisks ∗ markchosen values. Shaded areas indicate 1.6 standard deviationsover 15 repetitions. by adding random perturbations in the range [ − . , . H O , with 100 perturbed geome-tries for each isomer. Computationally inexpensive semi-empirical quantum chemistry approximations are readilyavailable for forces. We exploit this to improve accuracyby modeling the difference between baseline PM7 [35]and DFT reference forces (∆-learning [10]). To demon-strate linear scaling of computational cost with systemsize, we used a third dataset of organic saturated poly-mers, namely linear polyethylene, the most common plas-tic, with random substitutions of some CH units with NHor O for chemical variety. All prediction errors were mea-sured on out-of-sample hold-out sets never used duringtraining.Table I presents performance estimates for modelstrained on 10 k randomly chosen atoms, measured on ahold-out set of 1 k other atoms. Comparison with liter-ature estimates of typical errors of the employed DFTreference method suggests in all cases that the ML mod-els achieve similar accuracy—at negligible computationalcost after training. Statistical learning theory shows thatunder certain assumptions the accuracy of a ML modelasymptotically improves with increasing training set sizeas O (1 / √ n ). [36] Fig. 3 presents corresponding learningcurves for all properties. Errors are shown as percentageof property ranges [22], enabling comparison of propertieswith different units. All errors start off in the single digit ● ● ● ● ● ● ● ● ● ● ■ ■ ■ ■ ■ ■ ■ ■ ■ ■▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○□ □ □ □ □ □ □ □ □ □ n R M SE / % ● C δ ■ H δ ▲
1s C δ ○ F C □ F H FIG. 3. Systematic improvement in accuracy of atomic prop-erty predictions with increasing training set size n . Rootmean square error (RMSE) shown as fraction of correspond-ing property’s range [22] for nuclear shifts ( C δ , H δ ), corelevel ionization energy (1s C δ ), and atomic forces ( F C , F H ).Values from 15 repetitions; see Table I for ranges and stan-dard deviations. Solid lines are fits to theoretical asymptoticperformance of O (1 / √ n ). percent range at 1 k training atoms, and decay system-atically to roughly half their initial value at 10 k trainingatoms.ML predictions and DFT values for chemical shifts ofall 50 k carbon atoms in the dataset are featured in Fig. 4.The shielding of the nuclear spin from the magnetic fieldis strongly dependent on the atom’s local chemical envi-ronment. In accordance with the diversity of the dataset,we find a broad distribution with four pronounced peaks,characteristic of up- or downshifts of the resonant fre-quencies of nuclear carbon spin. The peaks at 30, 70,150, and 210 ppm typically correspond to saturated sp -hybridized carbon atoms, strained sp -hybridized car-bons, conjugated or sp -hybridized carbon atoms, andcarbon atoms in carbonyl groups, respectively. A MLmodel trained on only 500 atoms already reproduces allmajor peaks; larger training sets yield systematically im-proved distributions. For 10 k training examples predic-tions are hardly distinguishable from the DFT reference,except for a small deviation at 140 ppm. Using the samemodel, we predicted shifts for 847 k carbon atoms in all134 k molecules published in Ref. [24]. The resulting dis-tribution is roughly similar, reflecting similar chemicalcomposition of molecules in this much larger dataset,which is beyond the current limits of DFT reference cal-culations employed here.The presented approach to model atomic propertiesscales linearly: Since only a finite volume around an atomis considered, its numerical representation is of constantsize; [45] in particular, it does not scale with the system’soverall size. Comparing atoms, and thus kernel evalua-tions, therefore requires constant computational effort, C δ / ppm DFT ML 0.5k ML 1kML 10k GDB9
FIG. 4. Distribution of 50 k C chemical shifts in 9 k or-ganic molecules. ML predictions for increasing training setsizes approach DFT reference values. Molecular structureshighlight chemical diversity and effect of molecular environ-ment on chemical shift of query atom (orange; see main text).GDB9 corresponds to ML predictions for 847 k carbon atomsin 134 k similar molecules published in Ref. [24]. rendering the overall computational cost of predictionslinear in system size, with small prefactor. Furthermore,a form of chemical extrapolation can be achieved despitethe fact that ML models are interpolation models. Aslong as local chemical environments of atoms are similarto those in the training set, the model can interpolate.Consequently, using similar local “building blocks”, largemolecules can be constructed that are very different fromthe ones used in the training set, but amenable to pre-diction.To verify this, we trained a ML model on atoms drawnfrom the short polymers in the third dataset, then ap-plied the same model to predict properties of atoms inpolymers of increasing length. Training set polymers hada backbone length of 29 C,N,O atoms; for validation, weused up to ten times longer backbones, reaching lengthsof 355 ˚A and 696 atoms in total. Fig. 5 presents nu-merical evidence for excellent near-constant accuracy ofmodel predictions, independent of system size, validatedby DFT. Although trained only on the smallest instances,the model’s accuracy varies negligibly with system size,confirming both transferability and chemically extrapola-tive predictive power of the ML model.Individual ML predictions are 4–5 orders of magnitudefaster than reference DFT calculations. Overall speed-updepends on dataset and reference method, and is domi-nated by training set generation, i.e., the ratio betweennumber of predictions and training set size. DFT andML calculations were done on a high-performance com-pute cluster and a laptop, respectively.In conclusion, we have introduced ML models for QMproperties of atoms in molecules. Performance and ap-
TABLE I. Prediction errors for ML models trained on 10 k atoms and predicting properties of 1 k other out-of-sample atoms.Calculated properties are NMR chemical shifts ( C δ , H δ ), core level ionization energy (1s C δ ), and forces ( F C , F H ). a Property Ref. Range MAE RMSE maxAE R σ τ /˚A C δ /ppm 2.4 [30, 37, 38] 6 – 211 3.9 ± ± ± ± ± H δ /ppm 0.11 [38–40] 0 – 10 0.28 ± ± ± ± ± .
51s C δ /mE h ± ± ±
17 0.971 ± ± F C /mE h / a ± ± ± ± ± F H /mE h / a ± ± ± ± ± a Shown are MAE of DFT reference from literature (Ref.), property ranges [22], mean absolute error (MAE), root meansquared error (RMSE), maximum absolute error (maxAE), squared correlation ( R ) and hyperparameters (kernel lengthscale σ , cut-off radius τ ). Averages ± standard deviations over 15 randomly drawn training sets. ● ● ● ● ● ● ● ● ● ● ■ ■ ■ ■ ■ ■ ■ ■ ■ ■▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○□ □ □ □ □ □ □ □ □ □ / nm R M SE / % polymer size / electrons c o m pu t e t i m e / da ys ● C δ ■ H δ ▲
1s C δ ○ F C □ F H FIG. 5. Linear scaling and chemical extrapolation for MLpredictions of saturated polymers of increasing length. Shownare root mean square error (RMSE), given as fraction of cor-responding property’s range [22], as well as indicative com-pute times of cubically scaling DFT calculations (gray bars)and ML predictions (black bars, enlarged for visibility), whichscale linearly with low prefactor. See Table I for propertyranges. plicability have been demonstrated for chemical shifts,core level ionization energies, and atomic forces of 9 kchemically diverse organic molecules and 168 isomers ofC H O , respectively. Accuracy of predictions is on parwith the QM reference method. We have used the MLmodel to predict chemical shifts of all 847 k carbon atomsin the 134 k molecules published in Ref. [24]. Locality ofmodeled atomic properties is exploited through use ofatomic environments as building blocks. Consequently,the model scales linearly in system size, which we havedemonstrated for saturated linear polymers over 30 nm inlength. Results suggest that the model could be usefulin mesoscale studies.For the investigated molecules and properties the local-ity assumption, implemented as a finite cut-off radius inthe representation, has proven sufficient. This might notnecessarily be true in general. The Hellmann-Feynmanforce, for example, depends directly on the electron den- sity, which can be altered substantially due to long-rangesubstituent effects such as those in conjugated π -bondsystems. For other systems and properties, larger cut-offs or additional measures might be necessary.The presented ML models could also be used for nu-clear shift assignment in NMR structure determination,for molecular dynamics of macro-molecules, or condensedmolecular phases. We consider efficient sampling, i.e.,improving the ratio of performance to training set size(“sample efficiency”), and improving representations tobe primary challenges in further development of thesemodels.We thank Tristan Bereau, Zhenwei Li, and Kuang-YuSamuel Chang for helpful discussions. OAvL acknowl-edges the Swiss National Science Foundation for sup-port (SNF grant PP00P2 138932). Calculations wereperformed at sciCORE ( scicore.unibas.ch ) scientificcomputing core facility at University of Basel. This re-search used resources of the Argonne Leadership Com-puting Facility at Argonne National Laboratory, whichis supported by the Office of Science of the U.S. DOEunder contract DE-AC02-06CH11357. ∗ [email protected]; Current address: Fritz Haber Insti-tute of the Max Planck Society, Faradayweg 4–6, 14195Berlin, Germany. † [email protected][1] C. Bekas and A. Curioni, Comput. Phys. Comm. ,1057 (2010).[2] I. Duchemin and F. Gygi, Comput. Phys. Comm. ,855 (2010).[3] G. H. Booth, A. J. Thom, and A. Alavi, J. Chem. Phys. , 054106 (2009).[4] A. Benali, L. Shulenburger, N. A. Romero, J. Kim, andO. A. von Lilienfeld, J. Chem. Theor. Comput. , 3417(2014).[5] T. Helgaker, P. Jørgensen, and J. Olsen, MolecularElectronic-Structure Theory (Wiley, Chichester, Eng-land, 2000).[6] M. Rupp, A. Tkatchenko, K.-R. M¨uller, and O. A. vonLilienfeld, Phys. Rev. Lett. , 058301 (2012).[7] O. A. von Lilienfeld, Int. J. Quant. Chem. , 1676(2013). [8] G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. Tkatchenko, K.-R. M¨uller,and O. A. von Lilienfeld, New J. Phys. , 095003 (2013).[9] R. Ramakrishnan, M. Hartmann, E. Tapavicza, andO. A. von Lilienfeld, arXiv , 1504.01966 (2015).[10] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. vonLilienfeld, J. Chem. Theor. Comput. , 2087 (2015).[11] Z. Li, J. R. Kermode, and A. D. Vita, Phys. Rev. Lett. , 096405 (2015).[12] Calculated as δ = ( σ ref − σ sample ) / (1 − σ ref ).[13] R. P. Feynman, Phys. Rev. , 340 (1939).[14] L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl,and M. Scheffler, Phys. Rev. Lett. , 105503 (2015).[15] O. A. von Lilienfeld, R. Ramakrishnan, M. Rupp, andA. Knoll, Int. J. Quant. Chem. , 1084 (2015).[16] T. Hastie, R. Tibshirani, and J. Friedman, The Ele-ments of Statistical Learning , 2nd ed. (Springer, NewYork, 2009).[17] B. Sch¨olkopf and A. Smola,
Learning with Kernels (MITPress, Cambridge, 2002).[18] K. Hansen, G. Montavon, F. Biegler, S. Fazli, M. Rupp,M. Scheffler, O. A. von Lilienfeld, A. Tkatchenko, andK.-R. M¨uller, J. Chem. Theor. Comput. , 3543 (2013).[19] K. Vu, J. Snyder, L. Li, M. Rupp, B. F. Chen, T. Khelif,K.-R. M¨uller, and K. Burke, Int. J. Quant. Chem. ,1115 (2015).[20] L. Li, J. C. Snyder, I. M. Pelaschier, J. Huang, U.-N. Niranjan, P. Duncan, M. Rupp, K.-R. M¨uller, andK. Burke, arXiv , 1404.1333 (2014).[21] J. Moody and C. J. Darken, Neural. Comput. , 281(1989).[22] Instead of max( p ) − min( p ), which are extrema thatstrongly fluctuate across subsets of the data, we use themore robust 99 % and 1 % quantiles.[23] L. Ruddigkeit, R. van Deursen, L. C. Blum, and J.-L.Reymond, J. Chem. Inf. Model. , 2864 (2012).[24] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. vonLilienfeld, Scientific Data , 140022 (2014).[25] K. Burke, J. Chem. Phys. , 150901 (2012).[26] A. D. Becke, J. Chem. Phys. , 18A301 (2014).[27] J. P. Perdew, K. Burke, and M. Ernzerhof, Phys. Rev.Lett. , 3865 (1996).[28] F. Weigend and R. Ahlrichs, Phys. Chem. Chem. Phys. , 3297 (2005).[29] F. Weigend, Phys. Chem. Chem. Phys. , 1057 (2006).[30] C. Adamo and V. Barone, Chem. Phys. Lett. , 113(1998).[31] M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, G. Scalmani, V. Barone,B. Mennucci, G. A. Petersson, H. Nakatsuji, M. Cari-cato, X. Li, H. P. Hratchian, A. F. Izmaylov, J. Bloino,G. Zheng, J. L. Sonnenberg, M. Hada, M. Ehara, K. Toy-ota, R. Fukuda, J.-Y. Hasegawa, M. Ishida, T. Naka-jima, Y. Honda, O. Kitao, H. Nakai, T. Vreven, J. Mont-gomery, John A., J. E. Peralta, F. Ogliaro, M. Bearpark,J. J. Heyd, E. Brothers, K. N. Kudin, V. N. Staroverov,R. Kobayashi, J. Normand, K. Raghavachari, A. Ren-dell, J. C. Burant, S. S. Iyengar, J. Tomasi, M. Cossi,N. Rega, J. M. Millam, M. Klene, J. E. Knox, J. B.Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts,R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi,C. Pomelli, J. W. Ochterski, R. L. Martin, K. Morokuma,V. G. Zakrzewski, G. A. Voth, P. Salvador, J. J. Dan-nenberg, S. Dapprich, A. D. Daniels, ¨O. Farkas, J. B.Foresman, J. V. Ortiz, J. Cioslowski, and D. J. Fox,“Gaussian 09 revision D.01,” Gaussian Inc., WallingfordCT 2009.[32] J. W. Ochterski, Vibrational Analysis in Gaussian , Gaus-sian Inc., Wallingford, Connecticut, USA (1999).[33] W. Schneider and W. Thiel, Chem. Phys. Lett. , 367(1989).[34] M. J. Mills and P. L. Popelier, Theor. Chim. Acta ,1137 (2012).[35] J. J. P. Stewart, J. Comput. Chem. , 209 (1989).[36] S. Amari, N. Fujita, and S. Shinomoto, Neural. Comput. , 605 (1992).[37] K. Dybiec and A. Gryff-Keller, Magn. Reson. Chem. ,63 (2009).[38] D. Flaig, M. Maurer, M. Hanni, K. Braunger, L. Kick,M. Thubauville, and C. Ochsenfeld, J. Chem. Theor.Comput. , 572 (2014).[39] P. d’Antuono, E. Botek, B. Champagne, M. Spassova,and P. Denkova, J. Chem. Phys. , 144309 (2006).[40] D. E. Hill, N. Vasdev, and J. P. Holland, Comput. Theor.Chem. , 161 (2015).[41] V. Myrseth, J. D. Bozek, E. Kukk, L. J. Sæthre, andT. D. Thomas, J. Electron Spectros. Relat. Phenom. ,57 (2002).[42] T. Karlsen, K. J. Børve, L. J. Sæthre, K. Wiesner,M. B¨assler, and S. Svensson, J. Am. Chem. Soc. ,7866 (2002).[43] A. Holme, K. J. Børve, L. J. Sæthre, and T. D. Thomas,J. Chem. Theor. Comput. , 4104 (2011).[44] Gaussian’s default convergence threshold of 1 mE h / a0