Alchemical and structural distribution based representation for improved QML
Felix A. Faber, Anders S. Christensen, Bing Huang, O. Anatole von Lilienfeld
AAlchemical and structural distribution based representation for improved QML
Felix A. Faber, Anders S. Christensen, Bing Huang, and O. Anatole von Lilienfeld ∗ Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials,Department of Chemistry, University of Basel, Switzerland. (Dated: December 25, 2017)We introduce a representation of any atom in any chemical environment for the generation ofefficient quantum machine learning (QML) models of common electronic ground-state properties.The representation is based on scaled distribution functions explicitly accounting for elementaland structural degrees of freedom. Resulting QML models afford very favorable learning curvesfor properties of out-of-sample systems including organic molecules, non-covalently bonded proteinside-chains, (H O) -clusters, as well as diverse crystals. The elemental components help to lowerthe learning curves, and, through interpolation across the periodic table, even enable “alchemicalextrapolation” to covalent bonding between elements not part of training, as evinced for single,double, and triple bonds among main-group elements. I. INTRODUCTION
Ground-state properties of chemical compounds cangenerally be estimated with acceptable accuracy usingmethods such as ab initio quantum chemistry or den-sity functional theory (DFT) . However, these can becomputationally expensive and therefore have a limitedapplicability, especially for larger systems. Alternatively,inductive quantum machine learning (QML) models caninfer properties directly, or even predict the electrondensity which in turn can be used to calculate proper-ties , by training on a large data sets of reference prop-erty/compound pairs. ML models can have an excep-tional trade off between predictive accuracy and compu-tational cost. For example, in 2017 we showed that QMLmodels can estimate hybrid DFT atomization energies, aswell as several other properties, of medium sized organicmolecules with prediction errors lower than chemical ac-curacy ( ∼ .The system variables defining the ground-state prop-erties of a given compound are its external potential,a simple function of interatomic distances and nuclearcharges. However, when using this information directlyto measure similarity results in QML models with ratherdisappointing predictive power. This can be mitigatedby transformation of system variables into “representa-tions”. Such transformations can either be designed byhuman intuition, or be included in the learning problem,e.g. when using neural networks (NN) which include rep-resentation learning in the supervised learning task. Let-ting a NN find the representation has proven to yieldmodels with low out-of-sample prediction errors . Thisapproach, however, has the drawback that representa-tion and model are intermingled within the NN, makingit less amenable to human understanding, interpretation,adaptation, and further improvement. Furthermore, suchmachine designed representations do not necessarily leadto better QML performance than human design basedrepresentations (vide infra).There are many ways of manually encoding the 3D FIG. 1. The three-body term ( A ( · )) as a function of radial( d ) and angular ( θ ) degrees of freedom in the atomic environ-ments of O, C and H (circled) in ethanol. For simplicity, weshow the three-body term without elemental smearing whereit reduces to a number of two-dimensional distributions foreach element triplet. structure and chemical composition of a compound intoa suitable representation. For example, we can representa compound as a list of interatomic potentials . An-other approach consists of creating a fingerprint of thecompound, transforming internal coordinates into a fixedset of numbers. For example, this can be done by pro-jecting the coordinates on to a set of basis functions ,or by creating a “fingerprint” from the topology of thestructure . Distributions of internal coordinates repre-sent another systematic approach, shown to yield wellperforming QML models applicable throughout chemicalspace . Additional use of bags containing angular anddihedral distributions has led to further improvements inresulting QML models . Bagging based on atomtypes, however, severely hinders resulting QML modelsfrom transferring what has been learned from one atomtype to another—a desirable feature for chemically di-verse systems.In this work we introduce a new atomic environ- a r X i v : . [ phy s i c s . c h e m - ph ] D ec ment representation, with two key differences to pre-vious distribution-based work. (i) The representationis not binned by atomic types. Instead, composi-tional information is encoded directly into the distribu-tions. This allows measuring not only structural dif-ferences, but also “alchemical” differences between ele-ments in the atomic environments. The idea of computa-tional alchemy, amounting to continuous interpolation ofHamiltonians of two different systems, is well establishedin quantum chemistry and statistical mechanics and canbe exploited for virtual exploration campaigns in chem-ical space with increased efficiency . Recently, it hasbeen shown that alchemical estimates of covalent bondpotentials can even surpass generalized gradient approx-imated DFT accuracy . The foundation of a continuouschemical space has been reviewed previously . Alchem-ical distance measures in the context of QML were al-ready exploited previously when using the Coulomb ma-trix , Fourier series distribution based representations ,the Faber, Lindmaa, Lilienfeld, Armiento (FLLA) crys-tal representation , and within smooth overlap of atomicpotentials (SOAP) representations . For this work wehave identified a new functional form with improved per-formance due to alchemical contributions to the distancemeasure. (ii) We use a set of multidimensional distri-butions of interatomic many-body expansions scaled bysimple powerlaws, rather than several 1D bins of inter-nal coordinates. The distributions are built recursively,so that an m -body distribution contains the same infor-mation as the ( m − m -body information. This particular combination com-bines similarity to the potential energy target functionand compliance with many known (translational, rota-tional, permutational) invariances. II. THEORY
In this section, we first motivate the ideas which haveled to this study. Thereafter, we discuss the functionalform and the variational degrees of freedom which wehave introduced, as well as the resulting compound dis-tances. Then, an analysis of the functional form is per-formed using the molecule water as an example. Finally,numerical results for parameters optimization runs arediscussed.
A. Kernel ridge regression
In order to profit from robustness, ease of error conver-gence, computational efficiency, and simplicity, we baseour studies preferably on kernel ridge regression (KRR)models . However, we consider this rather a ques-tion of taste, and believe that other regressors, such asneural networks, will produce similar results if properlyconverged.KRR estimates property p of query compound C as a weighted sum of kernel basis functions placed on each of N training compounds { C k } , p est ( C ) = N (cid:88) k =1 α k K ( C , C k ) , (1) α = ( K + λ I ) − p train (2)where the solution for the weights { α k } are obtainedthrough linear regression with regularizer λ (typicallynegligibly small because of absence of noise in quantumtraining data).Note that throughout this work we rely onatomistic (scalable) Gaussian kernels, K ( C , C (cid:48) ) = (cid:80) I ∈ C (cid:80) J ∈ C (cid:48) k (∆( A M ( I ) , A M ( J ))), as already usedin . As such, KRR renders the selection of a func-tional form which represents an atom in its chemical en-vironment mandatory. Obviously, this choice is funda-mentally related to our understanding of chemistry, andis known to dramatically affect the performance of re-sulting QML models, see e.g. . It is for this reason thatwe draw our inspiration from the fundamental laws ofquantum mechanics which specify the definition of sys-tem (Hamiltonian) and property (Observable), and whichspell out the numerical recipe which links the two .The genesis of this study is due to the fact that thetotal potential energy, the expectation value of a com-pound’s electronic Hamiltonian, constitutes the centralfigure of merit for convergence towards the wavefunctionby virtue of the variational principle. When consideringEq. (2) it should be obvious that kernel (and therebyrepresentation) are independent of the specific property,units and property dependence are introduced throughthe regression weights only. This has also already beendemonstrated numerically for multiple properties usingthe same kernel . As such, the role of the kernel is rem-iniscent of the wavefunction which can be used to predictarbitrarily many observables by evaluating the expecta-tion values of the corresponding operators, always us-ing the same wavefunction: Once the kernel is inverted,arbitrarily many sets of regression coefficients can eas-ily be generated provided that their corresponding prop-erty reference values have been provided. The potentialelectronic energy being the central property in quantummechanics which defines the wavefunction it is thereforeplausible to assume that a representation, optimized forenergy predictions only, is fundamentally more advanta-geous than representations obtained by minimizing pre-diction errors of alternative observables. Consequently,the focus on this study has been to identify a represen-tation which is inspired by the energy changes occurringdue to changes in chemical composition and covalent andnon-covalent bonding. The accuracy quantum mechan-ics when predicting other properties (observables) as ex-pectation values of operators depends crucially on thequality of the wavefunction. Here, we follow a similarargument: The better the representation the better theenergy prediction, implying that energy prediction errorscan be minimized in the functional space of the represen-tation, enabling systematic convergence towards an idealrepresentation. B. Representation
We use a set of interatomic M -body expansions A M ( I ) = { A ( I ) , A ( I ) , A ( I ) , . . . , A M ( I ) } which con-tain up to M -body interactions to represent the struc-tural and chemical environment of an atom I in com-pound C . A m ( I ) is a weighted sum that runs over all m -body interactions. Each element in the sums consistsof Gaussian basis functions, placed on structural and el-emental degrees of freedom, and multiplied by a scalingfunction ξ m . Structural values encode geometrical infor-mation about the system, such as interatomic distancesor angles. As elemental parameters we use the period P and group G from the periodic table. The scalingfunctions ξ m are used to weigh the importance of eachGaussian, based on internal system coordinates. We nowconsider only the first three distributions in A M ( I ) for anatom I . We have also derived, implemented and testedthe 4-body A ( I ) distributions. We refer to the support-ing information (SI) for the derivation. The predictiveaccuracy improvements of resulting QML models, how-ever, were found to be negligible in comparison to the3-body expansion. As such, we Also, the computationalcost for generating large kernel matrices increases sub-stantially when going from third to fourth order terms.The first-order expansion A ( I ) accounts for chemicalcomposition (stoichiometry) and is modeled by a Gaus-sian function placed on period P I and group G I of ele-ment I : A ( I ) = N ( x (1) I ) = e − ( PI − χ σ P − ( GI − χ σ G (3)where x (1) I = { P I , σ P ; G I , σ G } , with respective widths σ P and σ G . σ P and σ G can be seen as elemental smearing pa-rameters, which control the near-sightedness of elementsin the periodic table. χ and χ represent dummy vari-ables for period and group, to be integrated out whenevaluating the Euclidean distance (see Eq. (4)). For A ( I ), the scaling function is set to unity, since stoi-chiometry is geometry independent. We are not awareof other representations in the literature which employsimilar distribution functions in the periodic table. A ( I ) is a product of A ( I ) and a sumthat runs over all neighboring atoms i : A ( I ) = N ( x (1) I ) (cid:80) i (cid:54) = I N ( x (2) iI ) ξ ( d iI ), x (2) iI = { d iI , σ d ; P i , σ P ; G i , σ G } , where d iI and σ d corre-spond to the interatomic distance at which a Gaussian isplaced, and its width, respectively. ξ corresponds to the2-body, interatomic distance dependent, scaling functionwhich takes the form of the power laws discussed below.Note that letting σ P and σ G approach zero is equivalentto using a radial distribution function (RDF) for eachelement pair. This attribute of the representationholds for any of A m ( I ). I.e., σ P , σ G → m − tuple in A m ( I ). While atom pair-wisedistribution functions are rampant as representationchoice, especially for fitting potential energy surfaces ofsystems with fixed chemical composition, to the best ofour knowledge, combining them with scaling functionsis novel. A ( I ) is the logical extension from A ( I ), it has adifferent scaling function with an additional summa-tion, running over all neighboring atoms j : A ( I ) = N ( x (1) ) (cid:80) i (cid:54) = I N ( x (2) iI ) (cid:80) j (cid:54) = i,I N ( x (3) ijI ) ξ ( d iI , d jI , θ Iij ), x (3) ijI = { θ Iij , σ θ ; P j , σ P ; G j , σ G } . P j and G j , similarlyto P i and G i , corresponds to the period and groupof atom j . Again, ξ ( d iI , d jI , θ Iij ) is the (three-body)scaling function, and θ Iij the principal angle betweenthe two distance vectors (cid:126)r Ii and (cid:126)r Ij which span from I to i and I to j , respectively. σ θ is the width of theGaussian placed at θ Iij . Letting σ d go to infinity in A is equivalent to using a type of angular distributionfunction (ADF), which in one form or another hasalready been used in several representations . A can therefore be seen as a generalized ADF containingmore structural information. Fig. 1 illustrates how A ( I )looks for a hydrogen, carbon, and the oxygen atom inethanol. Three-body distributions are less frequent asrepresentation choice, and again, to the best of ourknowledge, combining them with scaling functions isnovel.The scaling functions ξ we have chosen for this workcorrespond to simple power laws. They have been mod-ified from the leading order two- and three-body disper-sion laws by London, 1 /r , and Axilrod-Teller-Muto ,1 /r . Such dispersion expressions were previously al-ready used by some of us . Our scaling functions, how-ever, use different exponents for the radial decay, and setthe C and C coefficients to unity, as early tests indi-cated better performance for this choice. For periodicsystems, however, a very large cutoff radius would beneeded in order to converge the distances between twoatomic environments, when using the optimized expo-nents. We have therefore augmented the scaling func-tions by a previously used soft cutoff function , whichgoes to zero at 9 ˚A. C. Distances and scalar products
In order to train and evaluate the KRR modelin Eq. (1), proper distance measures must be spec-ified. We have found good performance when us-ing as a distance between two atomic environments A M ( I ) and A M ( J ) a weighted sum of the distances be-tween each m -body expansion: ∆( A M ( I ) , A M ( J )) ≡ (cid:80) Mm =0 β m ∆( A m ( I ) , A m ( J )) . Here, β m is another hy-perparameter, which weighs the importance of each ex-pansion order.The distances between each distribution term are eval-uated as Euclidean ( L ) norms, as shown in Eq.4. ς m arenormalization factors, which ensures that all individualbasis functions integrates to 1 in the L -norm. All in- tegrals can be solved analytically since they consist ofa sum of Gaussian products. The explicit form of theintegrals can be found in SI.∆( A m ( I ) , A m ( J )) = 1 ς m (cid:90) R m − dχ · · · dχ m − ( A m ( I ) − A m ( J )) (4)1 ς (cid:90) R dχ dχ A ( I ) A ( J ) = 12 exp( − ( P I − P J ) σ P − ( G I − G J ) σ G )1 ς (cid:90) R dχ · · · dχ A ( I ) A ( J ) = 12 √ − ( P I − P J ) σ P − ( P I − P J ) σ G ) n I (cid:88) i (cid:54) = I ξ ( d iI ) n J (cid:88) j (cid:54) = J exp( − ( d jJ − d iI ) σ d − ( P i − P j ) σ P − ( G i − G j ) σ G ) ξ ( d jJ )1 ς (cid:90) R dχ · · · dχ A ( I ) A ( J ) = 116 exp( − ( P I − P J ) σ P − ( G I − G J ) σ G ) n I (cid:88) i (cid:54) = I n J (cid:88) j (cid:54) = J exp( − ( d jJ − d iI ) σ d − ( P i − P j ) σ P − ( G i − G j ) σ G ) n I (cid:88) k (cid:54) = i,I ξ ( d iI , d kI , θ Iik ) n J (cid:88) l (cid:54) = j,J exp( − ( θ Iik − θ Jjl ) σ θ − ( P k − P l ) σ P − ( G k − G l ) σ G ) ξ ( d jJ , d lJ , θ Jjk )Note that third and fourth order terms become pro-hibitively expensive to calculate directly. However, thiscan to a large extent be circumvented by slightly modify-ing the distributions, and solving the angular integrals inFourier space. Further details about the correspondingequations and derivations can also be found in the SI.
D. Comparison to other distribution basedrepresentations
Probably the largest difference in how A representsnuclear configurations, when compared to many of thepreviously published distribution based representations,lies in the 3-body term (since A ( · ) is a radial distributionfunctions if σ P and σ G go to zero). In this subsection, wehighlight the differences between A ( · ), or conventionalADF or RDF for representing the structure of the watermolecule.As ADF, we use A ( · ) with the limit σ d → ∞ ,and we model RDF by A ( · ). Furthermore, no scal-ing function ( ξ = ξ = 1) is used and we let σ P and σ G go to zero, since we only examine how represen-tations distinguish structural differences among differ-ent geometries of the water molecule. This results in A ( · ) and ADF being (cid:80) i (cid:54) = I N ( d iI , σ d ) (cid:80) j (cid:54) = i,I N ( θ Iij , σ θ )and (cid:80) i (cid:54) = I (cid:80) j (cid:54) = i,I N ( θ Iij , σ θ ) for each element triplet, andRDF being (cid:80) i (cid:54) = I N ( d iI , σ d ) for each element pair.Fig. 2 shows how the distance measure changes as one distorts the geometry away from its equilibrium struc-ture. Both, RDF as well as ADF result for oxygen as wellas for H in large configurational domains with substan-tially zero distance to the minimum, implying a severelack of uniqueness. A , by contrast produces a qualita-tively meaningful picture with a single well defined wellaround the minimum.We have also studied the performance for modeling theenergy of the water molecule. In Fig. 3, the training errorfor atomization energies is shown for a linear kernel KRRmodel with A ( · ), ADF, RDF, or RDF + ADF as repre-sentations. The linear kernel is used as a difficult test inhow far representations can model a nonlinear property,such as the energy, in terms of linear basis functions. Theerrors are significantly lower when using A instead of theother representations, including RDF + ADF. Generally,potential energy surfaces of a three-atom system cannotbe decomposed into many-body terms each as a functionof only one internal coordinate (internuclear distance d or angle θ ). That is, E ( d , θ ) (cid:54) = E ( d ) + E ( θ ). Using aADF, RDF or a linear combination of the two howeverwould result precisely in such a model, as well as mostforce-fields. This also explains the relatively large errorsfor these representations, as well as unreliable perfor-mance of pair-wise potentials when it comes to distortedmolecules. A on the other hand does not decouple dis-tances and angles, and can, by construction, model anythree-body potential.These observations give insight as to why our new rep-resentation performs better than the other distribution . . . . l ( ˚A ) O [RDF] H [RDF] . . . . l ( ˚A ) O [ADF] H [ADF]
45 90 135 180 φ ( ◦ ) . . . . l ( ˚A ) O [ A ( · )]
45 90 135 180 φ ( ◦ )H [ A ( · )] FIG. 2. Heat maps of normalized L distances for three rep-resentations (RDF, ADF, and our new representation). Thecolor code from black to white indicates a distance range from0 to 1, respectively. The distances are measured between oxy-gen (LEFT) and hydrogen atoms (RIGHT) in two differentwater molecules. One water molecule is being distorted byuniform stretching of both OH bonds ( d OH1 = d OH2 = l ) andbending ( φ ). The other water molecule is kept fixed at itsequilibrium geometry (cross). based representations: Using ADF’s and RDF’s as rep-resentations might be able to capture slices of the many-body picture, the fact that there is a linear mappingbetween A n ( · ) and a n -body potential energy surface,however, appears to make it easier to improve the per-formance also for non-linear kernels. E. Optimization
1. Hyper-parameters
The use of our representation in combination withKRR yields multiple hyperparameters. While one could,in principle, attempt to optimize all of them, using sev-eral data sets, and efficient optimizers, such as gradient,Monte Carlo, genetic or simplex methods, we have foundthat the problem is sensitive only to a small subset ofparameters. As such, the exact choice of many hyperpa-rameters is not critical for the out-of-sample errors, andresulting models perform typically well as long as valuesare used which have similar order of magnitude. Unlessotherwise specified, the following hyperparameter valueshave been used: σ P = σ G = 1 . σ d = 0 . σ θ = π , β = 1, β = √ β = 1 .
6. For the water cluster andthe SSI data set there is no to little variation in chemical . . . . . . l ( ˚A ) A ( · ) RDF
60 100 140 180 φ ( ◦ ) . . . . . . l ( ˚A ) ADF
60 100 140 180 φ ( ◦ )ADF + RDF − . − . . . . ∆ E ( e V ) FIG. 3. Heat maps of the signed error of atomization ener-gies in water molecule for the same coordinate system as inFig. 2. The errors correspond to linear kernels in KRR fit-ted to DFT calculated energies (PBE/def2svp) energies. Fourrepresentations have been used: TOP LEFT: our new A ( · )(top left). TOP RIGHT: radial distribution function for eachelement pair (RDF). BOTTOM LEFT: angular distributionfunction for each element triplet (ADF). BOTTOM RIGHT:RDF + ADF. The training data consists of a equidistant gridof 50-by-50 points along l and φ within the range of the fig-ures. composition, and no alchemical smearing has been used.
2. Scaling powerlaw parameters
We have screened screened radial exponents for thescaling functions ξ ( d iI ) = 1 d n iI and ξ ( d iI , d jI , θ Iij ) =1 − θ Iij ) cos( θ iIj ) cos( θ jiI )( d iI d jI d ij ) n , using atomization ener-gies for a subset of the QM9 dataset in order to identifythe optimal exponents. Corresponding learning curvesare shown in Fig. 4. First, we have screened ξ , using A as representation, yielding the lowest off-set for n ξ fixed, we then proceededto screen the exponent ξ in A We found that n ξ . We have usedthese values throughout this work, and unless somethingelse is specified, the optimal scaling functions read, ξ ( d iI ) = 1 d iI ξ ( d iI , d jI , θ Iij ) = 1 − θ Iij ) cos( θ iIj ) cos( θ jiI )( d iI d jI d ij ) (5)
100 1000 N . . . . . M AE ( k c a l / m o l )
100 1000 10000 N . . . . . . . M AE ( k c a l / m o l ) FIG. 4. Optimization of exponents in scaling power laws.LEFT: Out-of-sample MAE for atomization/formation en-ergy predictions as a function of training set size on the QM9data set. Learning curves are generated using KRR with A as representation. The legends indicate the exponent n ξ ( d ). RIGHT: Out-of-sample MAEfor atomization/formation energy predictions as a function oftraining set size on the QM9 data set. Learning curves aregenerated using KRR with A as representation. The legendsindicate the exponent n ξ ( d ). σ P , σ G . . . . . . M A E ( e V /a t m ) OQMD σ P , σ G . . . . . M A E ( e V ) QM9
FIG. 5. Changes in out-of-sample MAE as a function of uni-form Gaussian width ( σ P and σ G ) used for elemental smear-ing. Results for energy predictions in the OQMD (LEFT)and QM9 (RIGHT) datasets, respectively. Legends indicatethe training set size.
3. Alchemical smearing
Parameters associated with the elemental smearinghave also turned out to have a strong effect on the pre-dictive power of the QML models. We have thereforescreened the corresponding values of σ P and σ G usingenergy prediction errors for the OQMD and QM9 dataset for different training set sizes. These two datasetshave been used due to their (relatively) high (OQMD)and low (QM9) chemical diversity in terms of numberof differing elements in the the data set. The optimalalchemical Gaussian widths varies only slightly acrossthe two sets, as shown in Fig. 5. A circular Gaussianwith width σ P = σ G = ∼ ∼ σ P = σ G = 0 . σ P = σ G lowers the MAE by ∼ ∼
34% when 1k training samplesare used. Prediction errors for the QM9 data set indi-cate similar behavior, yet much less pronounced. For thelargest training set (1000 molecules), the optimizationwell becomes very shallow, consistent with the lack ofcompositional diversity in QM9.Unsurprisingly, datasets with higher chemical diversitybenefit more from using the optimized elemental widths.It may therefore not always be beneficial to include anyelemental overlap, especially for datasets with low ele-mental diversity, as it is computationally more expensiveto do so.
III. DATA SETS
We have used multiple datasets to benchmark out-of-sample accuracy of energy predictions of our model.These datasets includes organic molecules, crystals,biomolecular dimers, water clusters, and main-group di-atomics. Some of the datasets are high-quality, have al-ready been published and are in widespread use. Addi-tional low quality data sets have been generated, merelyin order to accumulate additional evidence for the rela-tive improvement of the new representation. Since testset predictions are always close to zero by construction,we exclusively report prediction errors as out-of-sampleerrors (averaged through cross-validation) with respect toreference validation numbers. All errors reported corre-spond to at least 10 cross-validation runs for each trainingset size.
A. Organic molecules: QM9
The QM9 dataset corresponds to hybrid DFT based structures and properties of 134k organic moleculeswith up to nine atoms (C, O, N, or F), not counting hy-drogen. SMILES strings of these molecules correspondto a subset of the GDB-17 dataset . The 3k organicmolecules, which fail SMILES consistency tests , wereremoved before use.A random subset of 22k molecules was selected fromQM9 for training and testing. 2k molecules were used fortesting, and up to 20k for training. B. Organic molecules: QM7b
Due to widespread use we also included the more es-tablished QM7b dataset . QM7b was also derived fromGDB . It contains hybrid DFT (PBE0 ) structuresand properties of ∼
7k organic molecules with up to sevenatoms (C, O, N, S or Cl), not counting H. We have drawnat random up to 5k molecules for training, and 2k fortesting.
C. Biomolecular dimers: SSI
For intra-molecular and non-equilibrium interactionswe used a subset of 2356 neutral dimers from recentlypublished protein-sidechain-sidechain interaction (SSI)dataset Burns et al. . The SSI dataset is a collectionof dimers mimicking configurations of interacting amino-acid sidechains as observed in a set of 47 high-resolutionprotein crystal structures. The energies correspond tothe DW-CCSD(T**)-F12 level of theory . D. Water cluster
We also include a dataset which we calculated for 4’000snapshots drawn from a molecular dynamics trajectory ofa water cluster consisting of 40 water molecules. For themolecular dynamics we used the NET-ensemble at 300K,the Tip3p potential , as implemented in CHARMMC41a1 . The energies correspond to an optimized com-bination of basis-sets and DFT functional (PBEh-3c) ,as implemented in Orca , obtained for each of the 4’000geometries. E. Solids: OQMD
We have used the Inorganic Crystal StructureDatabase subset corresponding to the open quan-tum materials data base (OQMD) by Wolverton and co-workers . This data-set has already been used to de-velop and benchmark random forest based QML model(Voronoi) Ward et al. . The dataset consists of ∼ F. Solids: Elpasolites
We have also tested our representation for the Elpaso-lite crystal structure data set . This data set consists of ∼
10k Elpasolite structures and DFT (PBE ) formationenergies. The crystals correspond to quaternary maingroup elemental composition with all elements up to Bis-muth (39 in total). We have used a random subset of 7kstructures, with up to 6k and 1k for training and testing,respectively.
100 1k 10k 100k0.40.150.060.0250.005 M A E ( e V ) QM9
ThisworkCMBOBSLATMBAMLSOAPaSLATMHDADenn-s2s
100 400 2k 6k0.40.150.060.0250.01 M A E ( e V ) QM7b
ThisworkCMBOBSLATMBAMLSOAPLATMMBTR
125 250 500 1000 20000.040.030.020.0150.01 M A E ( e V ) SSI
ThisworkCMBOBSLATMaSLATM
125 250 500 1000 20000.0080.0040.002 M A E ( e V /a t o m ) Water cluster
ThisworkCMBOBSLATMaSLATM
100 500 2k 7k N M A E ( e V /a t o m ) Elpasolites
ThisworkFLLA
100 500 1k 2k N M A E ( e V /a t o m ) OQMD
ThisworkSineMatrixVoroni
FIG. 6. Learning curves for atomization/formation energypredictions corresponding to various QML models. Out-of-sample MAE is shown as a function of training set size formolecular (QM9 and QM7b), protein side-chain dimers (SSI),liquid water ((H O) snapshots (Water cluster) and crys-talline (OQMD and Elpasolites) data-sets. G. Maingroup diatomics
To test the predictive power for alchemical interpo-lation we have also included a set of previously pub-lished DFT (PBE ) results for single, double, and triplebonds among main-group diatomics saturated with hy-drogens . IV. RESULTS AND DISCUSSION
Using learning curves (always resulting in straight lineswhen recorded on log-log plots due to their inverse powerlaw relationship ? ), we first present numerical resultswhich indicate the predictive power of our QML modelfor atomization and formation energies in various datasets. When available for the same data set, we alsocompare to other QML models in the literature. There-after, the alchemical extrapolation capacity is demon-strated for predicting covalent bonds in molecules withelements that were not part of training. Finally, log-logplots of learning curves for nine electronic ground-stateproperties of organic molecules (QM9) are reported anddiscussed. A. Energies of molecules, clusters, and solids
Fig. 6 displays the performance overview for energypredictions on six different data sets (QM9, QM7b, SSI,water, elapsolites, OQMD). Mean absolute out-of-sampleenergy prediction errors are shown as a function of train-ing set size. The results indicate remarkable performancefor all data sets, indicating a well working QML modelyielding systematic improvement with increasing train-ing set size. The learning curves also indicate out-of-sample MAEs which are consistently lower, or similar,than previously published models in the literature. ForQM9, the MAE reaches the highly coveted chemical ac-curacy threshold (1 kcal/mol or ∼ .
043 eV for enthalpyof formation) with only 2k training points on the QM9dataset. Previously published QML models had to in-clude an order of magnitude more training molecules toreach such accuracy. This is similar to the amount oftraining molecules necessary when using the Coulombmatrix representation in conjunction with semi-empiricalor DFT based baselines in order to estimate electron cor-related energies, as demonstrated in 2015 with the ∆-MLmodel .For QM9, aSLATM and SOAP multi kernelmodel reach a performance nearly as good as ourQML model. aSLATM, however, performs worse for theSSI and the Water cluster. The SOAP multi kernel QMLmodel, however, performs an expansion in kernel func-tion space acting on the distance for which all degrees offreedom have already been integrated out. As such it is,strictly speaking, not the same as as an improved repre-sentation, but rather an improved regressor. Note thatsingle kernel based SOAP QML models perform signifi-cantly worse. The reader should take notice however thatin the SOAP learning curve results presented in Fig 6, the ∼
3k structures which had failed the SMILES consistencytest, were included. As such, theses QML models are notexactly comparable, and the SOAP results are still likelyto slightly improve if these faulty structures were to beremoved. One should also note that the SOAP resultsshown for QM7b correspond to the multi-kernel SOAPkernel .Other models presented correspond to Coulomb ma-trix (CM) , bags of bonds (BOB) , Bonds and An-gles based Machine Learning (BAML) , Histogram ofDistances, Angles, and Dihedrals (HDAD) , SpectralLondon Axilrod-Teller-Muto (SLATM), atomic SLATM(aSLATM) , the crystal representation by Faber, Lind-maa, Lilienfeld, Armiento (FLLA) , the Sinema-trix , and the unified many-body tensor representation(MBTR) . We also compared to QML models whichare not based on KRR, such as the message passing neu-ral network model (enn-s2s) , and a Voronoi-tessellationbased random forest model (Voronoi) Ward et al. .The MAE of our new QML model is consistently thelowest for all data sets and large training sets. For theset of 4,000 non-equilibrium water clusters, there is anoticeable difference between the global (CM, BOB and SLATM) and the atomic representations (i.e., aSLATMand the new model we introduce in this work): The globalmodels exhibit very little learning at first, only for larger N the learning curves begin to turn downward. Theatomic models, however, our new representation basedQML model as well as aSLATM, improve rapidly withincreasing training data set size. We believe that sort-ing and crowding in the global representations makes itdifficult to accurately account for the purely geometri-cal changes in structures that contribute to total energyvariations.Impressive predictive power is also observed for theOQMD dataset, a structurally and compositionally verydiverse set of solids. Our new model has a lower out-of-sample MAE for all N when compared to the sine ma-trix representation on the OQMD dataset. The offset ofthe learning curve of our new model is larger comparedto that of the Voroni-based random-forest model Ward et al. . However, the learning rate of our QML modelis significantly steeper, surpassing the Voronoi model al-ready at just ∼
250 training samples. Results for a solidstate variant of the CM, designed for use in periodic sys-tems, has also been included (SineMatrix) . It has asimilar slope as the Voronoi model, but a substantiallylarger off-set.For the elpasolite data set, , with large compositiondiversity but identical crystal structures, the learning-curve of the FLLA representation has a slightly higheroff-set than our new QML model, yet exhibits a steeperlearning curve. Our model converges towards the sameslope for larger training set sizes. We can only speculateon the reasons for such behavior. The FLLA representa-tion differs qualitatively from the other representationsin this study: It does not include any explicit informa-tion about coordinates and only encodes periodic rowand column of the elements which populate each crys-tal structure site. The QML model then learns to inferground state energies without knowing the exact config-uration. This leads to a very low dimensional model thatis still unique for the system, which might be the causeof the lower slope. This however needs to be investigatedmore carefully before any conclusions can be drawn. B. Alchemical predictions
Our new scaled many-body expansion explicitly ac-counts not only for distributions of interatomic distancesand angles but also for elemental distributions in the pe-riodic table. We have therefore studied its capability topredict covalent binding of molecules containing chem-ical elements which were not present in the moleculesused for training. More specifically, we have investigatedsingle, double, and triple bonds with one bonding atomcoming from group (IV), i.e. C, Si, or Ge. In order to in-crease covalent bond order, we have varied the valency oftheir bonding partner as follows: For single bonds, groupIV atoms are bound to halogens (group VII). For double − − − N O F
ML/DFTMLDFT − − − P S Cl . . . . − − − As . . . . Se . . . . Br ElementCSiGe
Distance (˚A) E n e r g y ( e V ) FIG. 7. Covalent bond potentials calculated by DFT (star)and estimated by QML (circle) for 27 main-group diatomicmolecules. Bonding occurs between a group IV element(C blue, Si green, or Ge red), and halogens (single bond),chalcogen (double bond), or a group V element (triple bond).Columns correspond to triple (LEFT), double (MID), andsingle bonds (RIGHT). Rows correspond to the period of thegroup IV atom’s binding partner: 2nd period (TOP), 3rd pe-riod (MID), 4th period (BOTTOM). bonds, group IV atoms atoms are bound to chalcogenatoms (group VI), and for triple bonds, group IV atomsare bound to group V atoms. Dangling valencies of groupIV atoms have been saturated with hydrogen. Similar co-valent bonding potentials have also recently been used inorder to assess the predictive power of first and second or-der perturbation theory based alchemical predictions .In order to test the alchemical “extrapolation”, wetrained on the covalent bonds of all other compounds(16 curves) which did contain neither the group IV atomnor the corresponding bonding partner in question. Thepredictive power for the out-of-sample molecule, on dis-play in Fig. 7, is impressive. Albeit not quantitative(chemical accuracy is not reached), the results are semi-quantitative and certainly provide a physically very ad-equate picture of the covalent bonding in single, double,and triple bonds for main-group atoms in periods 2 to 4.The fact that predictions for the central elements H SiSare more accurate (easier to interpolate) than others isconsistent with this interpretation. We also note that thedeviation is the worst for 2nd-row elements (due to lackof d -orbitals they differ substantially more from 3rd and4th row than 3rd and 4th row differ from each other).Because of their poor performance we have not includedother representations in this test.These results clearly demonstrate that alchemicalextrapolation is possible when interpolating elementalgroups and periods in the periodic table through an ap-propriate representation. Since the representation is con-tinuous in the corresponding compositional space, we alsobelieve that indication is given that the calculation of al-chemical derivatives is meaningful, similar in spirit to Ref. . C. Other ground state properties of molecules
Finally, we also investigated how well QML modelsbased on our new representation, optimized for ener-gies, performs for predicting other ground-state quantumproperties part of the QM9 dataset. More specifically,we have included atomization energies, HOMO, LUMO-eigenvalues as well as gap, dipole moment, polarizability,zero point vibrational energy, heat capacity, and the vi-brational frequency of the highest lying fundamental ( ω ).Results are shown in Fig. 8, and provide overwhelmingevidence that resulting models enable predictions system-atically improving with training set size, no matter whatproperty. For comparison, we have also included resultsfor the aSLATM model. aSLATM results are typicallyworse when dealing with extensive properties, such asenergies, polarizability, or heat-capacity. When dealingwith intensive properties, such as eigenvalues or dipole-moment, aSLATM is on par or even slightly better thanour model, with the exception of ω . ω corresponds to fre-quency associated to the vibrational stretch of CH, NH,or OH bonds, a property with hardly any variance atall. Previously we have seen that this property is bestpredicted by a random forest model which has poor per-formance for all other properties . The interpretation isthat predicting this property is much more a classifica-tion problem, then a supervised learning task.Fig. 8 also includes learning curves for the root meansquared error, indicating the slightly higher offset thanthe mean absolute error, to be expected, and systematicimprovement with training set size with similar slopesas the mean absolute error. This is an assuring result,indicating once again, that also predictions for outliersimprove as training set size is increased .Furthermore, for Fig. 8 we have also distinguished be-tween two and three body contributions (as well as four-body for BAML). For all properties but for ω the trendmeets the expectation, as also already confirmed previ-ously for BAML : Addition of the higher order term sys-tematically lowers the learning curves by a significantamount. CONCLUSION
We have introduced a universal representation of anatom in a chemical compound for use in QML mod-els. An atom is represented by a sum of multidimen-sional Gaussians, each term corresponding to elemental,atom-pairwise, and angular distributions and scaled byrespective power laws. For the compounds and propertiesstudied we have found four-body contributions to be in-significant. System-independent hyperparameters, suchas exponents in scaling functions and Gaussian widthshave been optimized using the out-of-sample prediction0 U (eV) ∆ ε (eV) ε HOMO (eV) A ( · )M B ( · ) A ( · )M A ( · )M T ( · ) ε LUMO (eV) µ (Debye) α (Bohr )
100 500 2500 20000 N ZPVE (eV)
100 500 2500 20000 N C v (cal[mol · K] − )
100 500 2500 20000 N ω (cm − ) FIG. 8. Learning curves for out-of-sample MAE (filled lines)and RMSE (dashed lines) as a function of training set size N for nine electronic ground state properties in the QM9dataset. QML predictions have been made using either amolecular kernel and BAML as representation, or atomic ker-nels with our new representation. The BAML representationincludes bonds ( M B ); bonds and angles ( M A ); and bonds,angles and torsional angles ( M T ). Predicted properties in-clude: atomization energy, at 0 Kelvin ( U ); HOMO-LUMOgap (∆ ε ); HOMO eigenvalue ( ε HOMO ); LUMO eigenvalue( ε LUMO ); norm of dipole moment ( µ ); static isotropic polariz-ability ( α ); zero point vibrational energy (ZPVE); heat capac-ity at room temperature ( C v ); and the highest fundamentalvibrational frequency ( ω ). error for the energy as a penalty. Analytical expressionshave been derived for corresponding distances betweenarbitrary chemical compounds. These distances can di-rectly be used within kernel ridge regression based QMLmodels of electronic ground state properties. For ener-gies of organic molecules, water clusters, amino-acid sidechains, and crystalline solids the resulting QML modelslead to learning curves with very low off-set and steep learning rate. For compositionally diverse systems chem-ical accuracy ( ∼ . Future work will deal withforces and other properties. ACKNOWLEDGEMENT
F.A.F. would like to thank Kuang-Yu Samuel Changand Michele Ceriotti for data provided. O.A.v.L.acknowledges funding from the Swiss National Sci-ence foundation (No. PP00P2 138932 and 407540 167186NFP 75 Big Data). This research was partly supportedby the NCCR MARVEL, funded by the Swiss NationalScience Foundation. Calculations were performed at sci-CORE (http://scicore.unibas.ch/) scientific computingcore facility at University of Basel. ∗ [email protected] F. Jensen,
Introduction to Computational Chemistry (JohnWiley, West Sussex, England, 2007). F. Brockherde, L. Vogt, L. Li, M. E. Tuckerman, K. Burke,and K.-R. M¨uller, Nature Communications , 872 (2017). F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S.Schoenholz, G. E. Dahl, O. Vinyals, S. Kearnes, P. F.Riley, and O. A. von Lilienfeld, Journal of ChemicalTheory and Computation , null (0), pMID: 28926232,http://dx.doi.org/10.1021/acs.jctc.7b00577. K. T. Sch¨utt, F. Arbabzadah, S. Chmiela, K. R. M¨uller,and A. Tkatchenko, Nat. Commun. , 13890 (2017). J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, andG. E. Dahl, in
Proceedings of the 34nd International Con-ference on Machine Learning, ICML 2017 (2017). D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bom-barell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, in
Advances in Neural Information Processing Systems (2015) pp. 2215–2223. M. Rupp, K.-R. Tkatchenko, Alexandre haand M¨uller,and O. A. von Lilienfeld, Phys. Rev. Lett. , 058301(2012). K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis,O. A. von Lilienfeld, K.-R. M¨uller, and A. Tkatchenko, J.Phys. Chem. Lett. , 2326 (2015). B. Huang and O. A. von Lilienfeld, J. Chem. Phys ,161102 (2016). J. Behler and M. Parrinello, Phys. Rev. Lett. , 146401(2007). D. Rogers and M. Hahn, J. Chem. Inf. Model. , 742(2010). O. A. von Lilienfeld, Int. J. Quantum , 1676 (2013). K. T. Sch¨utt, H. Glawe, F. Brockherde, A. Sanna, K. R.M¨uller, and E. K. U. Gross, Phys. Rev. B , 205118 (2014). B. Huang and O. A. von Lilienfeld, arXiv preprintarXiv:1707.04146 (2017). H. Huo and M. Rupp, arXiv preprint arXiv:1704.06439(2017). O. A. von Lilienfeld, in
Many-Electron Approaches inPhysics, Chemistry and Mathematics (Springer, 2014) pp.169–189. K. S. Chang, S. Fias, R. Ramakrishnan, and O. A. vonLilienfeld, The Journal of chemical physics , 174110(2016). O. A. von Lilienfeld, R. Ramakrishnan, M. Rupp, andA. Knoll, Int. J. Quantum , 1084 (2015). F. A. Faber, A. Lindmaa, O. A. von Lilienfeld, andR. Armiento, Phys. Rev. Lett. , 135502 (2016). S. De, A. P. Bart´ok, G. Cs´anyi, and M. Ceriotti, Phys.Chem. Chem. Phys. , 13754 (2016). K.-R. M¨uller, S. Mika, G. R¨atsch, K. Tsuda, andB. Sch¨olkopf, IEEE transactions on neural networks ,181 (2001). B. Sch¨olkopf and A. J. Smola,
Learning with kernels: sup-port vector machines, regularization, optimization, and be-yond (MIT press, 2002). V. Vovk, “Kernel ridge regression,” in
Empirical Infer-ence: Festschrift in Honor of Vladimir N. Vapnik , editedby B. Sch¨olkopf, Z. Luo, and V. Vovk (Springer BerlinHeidelberg, Berlin, Heidelberg, 2013) pp. 105–116. T. Hastie, R. Tibshirani, and J. Friedman,
The Ele-ments of Statistical Learning: Data Mining, Inference, andPrediction, Second Edition , 2nd ed. (Springer, New York,2011). S. Mathias, Master thesis: http://wissrech.ins.uni-bonn.de/teaching/master/masterthesis mathias revised.pdf(2015). J. Barker, J. Bulin, J. Hamaekers, and S. Mathias, arXivpreprint arXiv:1611.05126 (2016). A. P. Bartok and G. Csanyi, International Journal of Quan-tum Chemistry , 1051 (2015). R. Ramakrishnan and O. A. von Lilienfeld, chimia , 182(2015). B. M. Axilrod and E. Teller, J. Chem. Phys , 299 (1943). Y. Muto, jpmsj , 629 (1943). A. S. Christensen, M. Elstner, and Q. Cui, The Journalof chemical physics , 084123 (2015). R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. vonLilienfeld, Sci. Data (2014). P. J. Stevens, F. J. Devlin, C. F. Chabalowski, and M. J.Frisch, J. Phys. Chem. , 11623 (1993). L. Ruddigkeit, R. van Deursen, L. C. Blum, and J.-L.Reymond, J. Chem. Inf. Model. , 2864 (2012). G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia,K. Hansen, A. Tkatchenko, K.-R. Mller, and O. A. vonLilienfeld, New Journal of Physics , 095003 (2013). L. C. Blum and J.-L. Reymond, J. Am. Chem. Soc. ,8732 (2009). M. Ernzerhof and G. E. Scuseria, J. Chem. Phys. , 5029(1999). C. Adamo and V. Barone, J. Chem. Phys. , 6158(1999). L. A. Burns, J. C. Faver, Z. Zheng, M. S. Marshall,D. G. A. Smith, K. Vanommeslaeghe, A. D. MacKerellJr.,K. M. MerzJr., and C. D. Sherrill, The Journal of Chem-ical Physics , 161727 (2017). M. S. Marshall and C. D. Sherrill, Journal of ChemicalTheory and Computation , 3978 (2011). W. L. Jorgensen, J. Chandrasekhar, J. D. Madura, R. W.Impey, and M. L. Klein, The Journal of chemical physics , 926 (1983). B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States,S. a. Swaminathan, and M. Karplus, Journal of computa-tional chemistry , 187 (1983). S. Grimme, J. G. Brandenburg, C. Bannwarth, andA. Hansen, The Journal of chemical physics , 054107(2015). F. Neese, Wiley Interdisciplinary Reviews: ComputationalMolecular Science , 73 (2012). A. Belsky, M. Hellenbrandt, V. L. Karen, and P. Luksch,Acta Crystallographica Section B Structural Science ,364 (2002). G. Bergerhoff, R. Hundt, R. Sievers, and I. D. Brown,Journal of Chemical Information and Computer Sciences , 66 (1983). S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. W.Doak, M. Aykol, S. R¨uhl, and C. Wolverton, npj Compu-tational Materials , 15010 (2015). J. E. Saal, S. Kirklin, M. Aykol, B. Meredig, andC. Wolverton, JOM , 1501 (2013). L. Ward, R. Liu, A. Krishna, V. I. Hegde, A. Agrawal,A. Choudhary, and C. Wolverton, Phys. Rev. B , 024104(2017). J. P. Perdew, K. Burke, and M. Ernzerhof, Phys. Rev.Lett. , 3865 (1996). R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. vonLilienfeld, J. Chem. Theory Comput. , 2087 (2015). A. P. Bartok, S. De, C. Poelking, N. Bernstein, J. Ker-mode, G. Csanyi, and M. Ceriotti, arXiv preprintarXiv:1706.00179 (2017). A. P. Bart´ok, R. Kondor, and G. Cs´anyi, Phys. Rev. B , 184115 (2013). F. Faber, A. Lindmaa, O. A. von Lilienfeld, andR. Armiento, Int. J. Quantum , 1094 (2015). O. A. von Lilienfeld, J. Chem. Phys.131