Machine learning based energy-free structure predictions of molecules (closed and open-shell), transition states, and solids
Dominik Lemm, Guido Falk von Rudorff, O. Anatole von Lilienfeld
EEnergy-free machine learning predictions of ab initio structures
Dominik Lemm, Guido Falk von Rudorff, and O. Anatole von Lilienfeld
1, 2, a) Faculty of Physics, University of Vienna, Kolingasse 14-16, 1090 Vienna, Austria Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials (MARVEL),Department of Chemistry, University of Basel, Klingelbergstr. 80, 4056 Basel, Switzerland (Dated: 5 February 2021)
The computational prediction of atomistic structure is a long-standing problem in physics, chemistry, materials, andbiology. Within conventional force-field or ab initio calculations, structure is determined through energy minimization,which is either approximate or computationally demanding. Alas, the accuracy-cost trade-off prohibits the generationof synthetic big data records with meaningful energy based conformational search and structure relaxation output. Ex-ploiting implicit correlations among relaxed structures, our kernel ridge regression model, dubbed Graph-To-Structure(G2S), generalizes across chemical compound space, enabling direct predictions of relaxed structures for out-of-samplecompounds, and effectively bypassing the energy optimization task. After training on constitutional and compositionalisomers (no conformers) G2S infers atomic coordinates relying solely on stoichiometry and bond-network informationas input (Our numerical evidence includes closed and open shell molecules, transition states, and solids). For all dataconsidered, G2S learning curves reach mean absolute interatomic distance prediction errors of less than 0.2 Å for lessthan eight thousand training structures — on par or better than popular empirical methods. Applicability test of G2Sinclude meaningful structures of molecules for which standard methods require manual intervention, improved initialguesses for subsequent conventional ab initio based relaxation, and input for structural based representations commonlyused in quantum machine learning models, (bridging the gap between graph and structure based models).
I. INTRODUCTION
The prediction of 3-dimensional (3D) structures from amolecular graph is a universal challenge relevant to manybranches of the natural sciences. 3D coordinates of all atomsdefine a system’s Hamiltonian, and thereby all observableswhich can be calculated as expectation values of approximatesolutions to Schrödinger’s equation. Energy and force esti-mates can subsequently be used to identify structural optima.The many degrees of freedom and various levels of theoryfor describing potential energy surfaces make the structureprediction problem challenging. The problem is aggravatedby the combinatorially large number of possible conformers(cf. Levinthal’s paradox ) mapping to the same graph. Oftenonly low energy conformations are desired, e.g. as a practi-cally relevant starting configuration to a chemical reaction, ,or to binding poses in computational drug design , requir-ing conformational scans to identify promising representativecandidate geometries. While feasible for few and smaller sys-tems, conformational scans of large databases remain compu-tationally prohibitive.State of the art approaches for generating 3D molecularstructures e.g. ETKDG and Gen3D are very efficient yetbased on fixed functional forms, empirical parameters, andknowledge-based heuristic rules. While applicable to knownand well behaved regions of chemical compound space, thesemethods lack generality and cannot be applied to more chal-lenging systems such as open-shell molecules or transitionstates. Recent generative machine learning developments canproduce structural candidates to solve inverse molecular de-sign problems but have, to the best of our knowledge, notyet been used to tackle the structure prediction problem in a) Electronic mail: [email protected] general.Here we propose a machine learning method, Graph ToStructure (G2S), which uses kernel ridge regression (KRR)to predict all elements in the pairwise distance matrix of asingle atomic configuration of a molecule or solid, and usingonly bond-network and stoichometry based information (seeFig. 1(a)) By exploiting correlations among data-sets of con-former free structural minima (constitutional and composi-tional isomers), G2S learns the direct mapping from chemicalgraph to the structural minima in the training data set, therebybypassing the expensive process of energy based conforma-tional search and relaxation. We have evaluated G2S on QMstructures of thousands of constitutional isomers, singlet statecarbenes, E2/S N and Gen3D (forthose closed shell molecules for which the latter are applica-ble), but also exhibit high geometric similarity to the refer-ence quantum chemical structure. Further numerical evidencesuggest the applicability of G2S to resolve uncharacterizedmolecules, to generate input for QM based relaxations, andfor QML based property predictions. Analysis of G2S resultsindicates that interatomic distances between covalently boundatoms are easier to learn than between non-bonded atoms. II. METHODSA. Kernel Ridge Regression (KRR)
We rely on kernel based methods which have shownpromise in predicting quantum properties throughout chemi-cal compound space after training on sufficient data . De-veloped in the 1950s, kernel methods learn a mapping func-tion from a representation vector x to a target property y . a r X i v : . [ phy s i c s . c h e m - ph ] F e b Heavy
Atoms LebedevDGSOL
CC1C2CCOC(=O)C12
Training C o o r d i n a t e s Prepare Molecules Featurize Learn
Intramolecular
Distances • Bond Order • Bond Hop • Bond Length • graph-CM • graph-BoB • Bond Length
Prediction
SMILES
Input Predict
Intramolecular
Distances Reconstruct
Geometry
Hydrogens One Machine per Pairwise Distance12 3 5 4 76 89 . Å Å (a) One Machine per Pairwise Distance (b)
One Machine per Pairwise DistanceOne Machine per Pairwise Distance
DiskRod Sphere C O H C NOH (c) FIG. 1. Schematic of the G2S workflow and QM9 constitutional isomer dataset. (a) From left to right: molecules in the training set areseparated into heavy atoms and hydrogens and featurized with a given representation. During training, one machine is used for each pairwisedistance. For the prediction of new structures, only the molecular connectivity is needed, which can be provided via SMILES or SELFIES .The machines predict all pairwise distances. The full 3D geometry is then reconstructed using DGSOL for heavy atoms and a Lebedevsphere optimization scheme for hydrogen atoms. (b) Example isomers and distance matrix distributions of the C O H QM9 constitutionalisomer dataset. The sorting of the atoms and the distance matrix is dependent on the sorting of the molecular representation (example shownfor the bond length representation). (c) Energy distribution and principal moments of inertia of the C O H and C NOH dataset. G2S attempts to predict interatomic distances in a sorteddistance matrix. The focus on the prediction of internal de-grees of freedom facilitates the learning process because ofrotational, translational, and index invariance. Note that thesubsequent reconstruction of the Cartesian coordinates from acomplete set of noisy interatomic distances is straightforward(see below). Within G2S, the interatomic distance target label y between any pair of atoms I and J is defined as y G2S IJ ( x ) = N ∑ i α ( IJ ) i k ( x i , x ) (1)with α i being the i -th regression coefficient, x i being the i -thmolecule in the training set and k being a kernel function toquantify the similarity of two molecules. The regression co-efficients α are obtained from reference interatomic distances y ref according to standard KRR training procedure. α IJ = ( K + λ I ) − y ref IJ (2) with a regularization coefficient λ and the identity matrix I .The regularization strength λ is dependent on the anticipatednoise in the data and has been determined by hyperparameteroptimization. Note that while each interatomic distance ma-trix element IJ is predicted by a separate G2S model (Eq. 1),formally the training for all models requires only one ma-trix inversion (Eq. 2). In this sense, G2S represents a singlekernel/multi-property KRR model Due to numerical efficiency, however, in practice we havesimply relied on repeated Cholesky decomposition as imple-mented in QMLcode .To relate the molecular representations via a similarity mea-sure, a kernel function k has to be chosen, as for example, k ( x i , x j ) = exp (cid:18) || x i − x j || σ (cid:19) (3) k ( x i , x j ) = exp (cid:18) || x i − x j || σ (cid:19) (4)Equations 3 and 4 represent Laplacian and Gaussian ker-nel functions, respectively, a standard choice in KRR basedQML.[Assessment and validation of machine learning meth-ods for predicting molecular atomization energies] Whileindex-dependent representations can benefit from Wassersteinnorms , we enforce index invariance by sorting (see below),and have therefore only used either L1 or L2 norm.We optimize the hyperparameters σ , λ with differentchoices of kernel function (Gaussian or Laplacian) and rep-resentation by using a grid-search and nested five-fold cross-validation. The performance of all models has been tracked interms of mean-absolute-error (MAE) of all distances, as wellas root-mean-square-deviation RMSD. To assess the generalizing capability of G2S for variousrepresentations, kernels, and data-sets the test error has beenrecorded in terms of training set size N . The relationship be-tween the test error of a machine learning method in depen-dence of training set size N , a.k.a. learning curve, is known tobe linearly decaying on a logarithmic scale, which facilitatesassessment of learning efficiency and predictive power. B. Graph-based Representations
We use bond order matrices to define molecular graphs,with elements being {0, 1,2,3} for bond types none, single,double, and triple, respectively (Bond Order). For discon-nected molecular graphs, e.g. transition states, a fully con-nected graph between attacking/leaving groups and reactioncenters is assumed. We have also used a denser way to de-scribe the connectivity of a molecule by counting the num-ber of bonds between atoms following the shortest connectingpath (Bond Hop). These representations capture the connec-tivity of a molecule but neglect information about atom types.To incorporate atomic information as well as a form of spatialrelationship, we weigh the total bond length l i j on the shortestpath between atoms i and j by covalent atomic radii taken fromRefs. (Bond Length). We have introduced more physics(decreasing off-diagonal magnitude with increasing distance)by adapting the Coulomb matrix (CM) representation usingthe bond length l in the following form,graph CM i j = (cid:40) . Z . i , i = j , Z i Z j l ij , i (cid:54) = j . (5)with nuclear charges Z (graph CM). The two-body bag formof the CM, Bag of Bonds (BoB) , was shown to yield im-proved quantum property machine learning models, and hasalso been adapted correspondingly for this work (graph BoB).We canonicalize the order of atoms in the representa-tion and distance matrix by sorting the atoms such that || x i || ≤ || x i + || with x i being the i -th row. Due to the use of L1 and L2 norms as metrics in the kernel, the canonicaliza-tion process is necessary in order to guarantee that the repre-sentation and distance matrix is invariant to the initial order ofatoms. Depending on the graph representation, this can leadto an implicit sorting of the distance matrix that is easier tolearn, e.g. by sorting short ranges together (Fig. 1 (b)). Forthe graph BoB representation, distances are ordered similar tothe atom-wise binning procedure.While for molecules the bonding-pattern varies, we pre-sume solids in the same crystal structure to share a fixedadjacency matrix implying that they can solely be describedby stoichometry. The FLLA representation, introduced forElpasolite crystals in 2016, exploits this fact by describingeach representative site n solely by the row (principal quan-tum number) and column (number of valence electrons) in theperiodic table resulting in an ( n x 2-tuple), with sites beingordered according to the Wyckoff sequence of the crystal. In2017, and using a similar representation, Botti and co-workershave studied the stability of perovskites with great success C. Work flow
The training of G2S starts with the separation of heavyatoms and hydrogens from the target molecules (Fig. 1 (a)).We generate the heavy atom scaffold first, followed by satu-rating all valencies with hydrogens. This leads to the scaffoldand hydrogen training being independent problems.After the separation, the input’s molecular bonding patternshave to be featurized into a fixed size graph representation.To learn the pairwise distance matrix, we use one model perdistance-pair, resulting in n ( n − ) / n heavy atoms (matrices forsmaller molecules are padded with zeros). For hydrogens,only the distances to the four closest heavy atom neighbors(not forming a plane) are being considered, requiring four ma-chine learning models. This working hypothesis is consistentwith the observation that the deprotonation of small moleculestypically only involves local electron density changes , mak-ing only local geometries predominantly important.In order to predict interatomic distances for out-of samplemolecules (Fig. 1 (a)), only information about the bondingpattern and nuclear charges is required, e.g. by providing asimplified molecular-input line-entry system (SMILES) orSELFIE string. RDKit is used to generate the correspondingadjacency matrix from which we construct the representation.To convert the predicted interatomic distances to 3D coordi-nates, the distance geometry problem has to be solved. Forheavy atoms, we use DGSOL , a robust distance geometrysolver that works with noisy and sparse distance sets. After re-constructing the heavy atom coordinates, all valencies are sat-urated by placing hydrogens on a spherical surface providedby a Lebedev sphere. Note that solving the distance geom-etry problem is independent from G2S, any other approachcould have been used just as well.Regarding the Elpasolite crystal structure predictions and inorder to allow the conversion from fractional to Cartesian co-ordinates, an additional machine has been trained to also pre-dict the unit cell length of each stoichometry. By learning thelength of the unit cell with an additional machine, fractionalcoordinates can then be converted back to Cartesian coordi-nates. D. Data
To assess G2S, several quantum-based datasets contain-ing structures of closed shell, open shell, transition states aswell as crystal structures have been considered. The QM9database has already served as an established benchmarkand recently has been used to test generative machine learn-ing models. All QM9 molecules were optimized at theB3LYP/6-31G(2df,p) level of theory. From QM9, thelargest subsets of constitutional isomers, i.e. 6’095 and 5’859molecules with C O H and C NOH sum formula, respec-tively, have been selected for this work. Note that already pureconstitutional isomers (fixed composition) constitute a diffi-cult target since similar molecular graphs can lead to vastlydifferent 3D geometries. Fig. 1 illustrates three exemplarymolecules, as well as distance, energy, and moments of iner-tia distributions for both constitutional isomer sets. As evidentfrom inspection of the latter, the molecular shapes tend to belong and flat with few spherical structures.In order to push G2S to its limits, systems without well de-fined Lewis structures have been considered as represented bytwo distinct and recent data sets: Carbene and transition state(TS) geometries. The QMSpin database reports over 5 thou-sand singlet and triplet carbene structures (derived through hy-drogen abstraction of random molecules drawn from QM9),for which common structure generation methods would re-quire manual intervention. These structures were optimizedusing CASSCF in a cc-pVDZ-F12 orbital basis, andaug-cc-pVTZ density fitting basis. We have used all singletstate carbene structures for training and testing of G2S.We have also trained and tested G2S on thousands ofTS geometries from the QMrxn20 dataset. QMrxn20 con-sists of C H based reactant scaffolds, substituted with -NO2, -CN, -CH3, -NH2, -F, -Cl and -Br functional groups,for which E2/S N level of theory.Regarding solids, we have relied on the Elpasolite data-set corresponding to 10,000 training systems made up frommain-group elements. All crystal structures had been re-laxed using DFT (PBE) with projector augmented wavepseudopotentials.
Finally, we have also extracted the list of 3,054 SMILESof ’uncharacterized’ molecules from the QM9 database, forwhich the structure generation and B3LYP geometry opti-mization procedure had led to a mismatch with initial Lewisstructures.
III. RESULTS AND DISCUSSIONA. G2S performance
We report G2S performance curves for heavy atom coor-dinates (not hydrogens) of constitutional isomers, carbenes,TS, and elpasolite structure predictions in Fig. 2. For all datasets and representations studied, root mean square deviationsof reconstructed geometries of out-of-sample input graphs de-crease systematically with training set size. For all QM9 basedsets (isomers and carbenes), the Bond Length and Bond Hoprepresentations yield systematic improvements with lowestoff-set. While Bond Order exhibits a similar slope, its off-set however is markedly higher. This difference is likely dueto Bond Order encoding substantially less information. Notethat graph CM and BoB representation, both yielding betterlearning curves for atomization energies due to their inversedistance format , perform both worse than Bond Length.Since geometry is directly proportional to distance (and not in-versely such as energy), this trend is therefore consistent withthe literature findings.For the TS based performance curves, similar trends areobserved with the exception of the Bond Order representa-tion now resulting in the most accurate G2S model. This isin line with the findings in Ref. where the simple one-hot-encoding representation outperforms more physics based rep-resentations when it comes to the prediction of activation ener-gies. It is an open question if and how the physics of transitionstates can be properly accounted for within a representation.From the curves on display in Fig. 2 (a) it is clear that G2Sdelivers similar performance no matter if Lewis structures oftarget systems are well defined or not. For comparison, empir-ical structure prediction methods ETKDG and Gen3D andhave also been applied to the isomer sets (application to car-benes and TS is not possible since these methods are restrictedto systems with valid Lewis structure formulas). Their RMSDfrom QM9 geometries is reached by G2S after training on over4’000 structures. In addition to their quantitative limitations,ETKDG and Gen3D were respectively in 15% and 1.5% of thecases for C O H , and 6.3% and 19% for C NOH not ableto generate a structure from given SMILES at all. This indi-cates that structure generation can be a challenge for empiricalmethods, even when it comes to simple closed shell and smallorganic molecules. Note how the constant slope of the G2Sperformance curves suggests that even lower prediction errorsshould be possible for larger training sets. The kinks in theperformance curves of the carbene data set result from noisein the DGSOL prediction when solving the distance geometryproblem: Actual learning curves of interatomic distances aresmooth for all data-sets (See SI Fig. S4).In complete analogy to predicting pairwise atomic dis-tances in molecules, G2S can be trained to predict pairwisedistance of atomic sites in a crystal. Due to the dependenceon the size of the unit cell, pairwise distances are predicted infractional coordinate space instead of Cartesian coordinates,and an additional G2S model is trained to predict the latticeconstant, based on the exact same representation of stoichiom-etry (FLLA). The performance in Fig. 2(b) indicates, just as (a) (b) Gen3D ETKDG HY XH XY
FIG. 2. Systematic improvement of predictive G2S accuracy with increasing training data for all data sets studied. Performance curvesshow mean heavy atom root-mean-square deviation (RMSD) of G2S generated structures (with different representations). (a) Performancecurves of isomers, carbenes, and transition states (TS). Insets depict exemplary Lewis structures of each dataset. Horizontal lines show meanRMSD of generated structures with ETKDG and Gen3D from SMILES. (b) Performance curves of the elpasolite dataset using the FLLArepresentation. Top, mid, and bottom panel depict prediction errors for unit cell length, interatomic distances, and coordinates. The insetillustrates the AlNaK F elpasolite crystal (Figure taken from reference ). for the molecular cases, systematically decaying prediction er-rors with growing training set size.To gain an overview, we also report the best mean absoluteand root mean square errors for G2S models after training onthe largest training set sizes available in Table I). Mean abso-lute errors of less than 0.2Å are obtained in all cases. Exem-plary predicted structures, drawn at random and superimposedwith their reference, are on display for all molecular data setsin Fig. 3. Visual inspection confirms qualitative to quantitativeagreement, the largest deviations corresponding to conforma-tional isomers which can be expected to exhibit small energydifferences.Based on the promising performance of G2S, we have alsoassessed its performance for 3’054 uncharacterized moleculeswhich had failed the QM9 generation protocol . To revisitthe problem of predicting the geometries for these uncharac-terized molecules, G2S has been trained on 5,000 randomlychosen QM9 molecules (varying constitution and composi-tion), and used to predict coordinates for each of them. Subse- TABLE I. Accuracy of the best performing representation for eachdataset at maximum training set size N test and for test set size N test specified. Mean-absolute-error (MAE) of interatomic distancesand root-mean-square-deviation (RMSD) calculated for heavy atomsonly. N train N test MAE [Å] RMSD [Å] RepresentationC O H NOH N quent geometry optimization at a B3LYP/6-31G(2df,p) levelof theory showed successful convergence of 90% of them. Asimilar success rate has been reached with Gen3D and Open-Babel. Fig. 3 (f) depicts randomly drawn structures together (a) (d) (e) (i) 0.35 Å (iii) 0.69 Å (iv) 0.33 Å (ii) 0.54 Å (i) 0.28 Å (iv) 0.35 Å (iii) 0.23 Å (ii) 0.18 Å (ii) 0.37 Å (iii) 0.47 Å (i) 0.10 Å (i) 0.67 Å (ii) 1.52 Å (iii) 0.70 Å (iv) 0.28 Å (iii) 0.12 Å (ii) 0.51 Å (i) 0.11 Å (b) (f)(c) FIG. 3. Exemplary structures generated with G2S (cyan) for all molecular datasets. Reference structures are shown in green with correspondingheavy atom root mean squared deviation. Panels (a)-(b) constitutional isomers C O H and C NOH , respectively. (c)-(d) correspond to E2and S N with the respective structural formula. At a B3LYP level oftheory, 92% of the uncharacterized molecules are expected toconverge to a local minimum, which makes G2S a viable ini-tial guess for ab initio structure relaxation. B. From G2S output to QM relaxation
Assessing the usefulness of structure prediction models canbe challenging. While from a machine learning perspective,naturally the error is calculated w.r.t. to the test dataset (Fig. 4error type A), energy based optimization methods are typi-cally evaluated by their deviation from the closest minimumof a higher level of theory structure (Fig. 4 error type B). SinceG2S is trained on quantum-based structures, it should inher-ently be able to predict structures close to the minimum of theused reference method, and should therefore be a useful toolfor the automatized generation of meaningful initial structureguesses which can subsequently be used as input in energybased convergence of the geometry.We have relaxed the test sets of the C NOH constitu-tional isomer set and the as E2/S N for both, as well as DFT (B3LYP) and post-Hartree-Fock (MP2) based relaxation, re-spectively. The resulting performance curves are shown inFigure 5 and, again, indicate systematic improvement withtraining set size, reaching even PM6 (semi-empirical quan-tum chemistry) level of theory for error type B of the reactants.The results for error type A in Figure 5 (blue curves), how-ever, show that subsequent QM based structure relaxationdoes not necessarily lead to further improvement for the re-actants. While the constitutional isomers improve by almost0.1 Å, the E2/S N N NOH isomers, the respective error betweenGFN2-xTB and B3LYP is only 0.13 Å, which could explainan almost equal average distance to both minima. A detailedoverview of baseline errors of different methods is given in inthe SI. ReferencePredictionPrediction BA=BAA: Error w.r.t. a reference structure
B: Error w.r.t. to the closest minimum
FIG. 4. Illustration of the structure prediction problem on a bu-tane dihedral energy profile (xTB). The quality of predicted struc-tures can be quantified in two ways. Error A: the overall accuracyof the machine learning model to reproduce a specific configura-tion is measured. Error B: Relaxing a predicted structure, the errorw.r.t. the closest minimum is calculated, allowing one-to-one com-parisons with energy based structure optimization methods.
C. From G2S output to QML predictions
The availability of molecular structures is not only a prob-lem for molecular simulations, but also for structure basedmachine learning of molecular quantum properties . In or-der to push the boundary in exploration of chemical space, ei-ther a graph-based model is required, or 3D structures have tobe generated. In case of the latter, generated structure shouldbe close to the level of theory of the training data in orderto avoid large prediction errors. G2S enables us to circum-vent this problem by allowing structure based machine learn-ing models to be trained on predicted structures. Thereby, theproperty predicting machines learn to compensate the noise ofG2S structures, which allows for the future query structures tooriginate from G2S.In order to quantify the usefulness of G2S for this prob-lem, we have used G2S output coordinates without furthergeometry optimization as an input to standard QML repre-sentations such as FCHL18 , FCHL19 or BoB . We havefocussed on the prediction of atomization and formation ener-gies of constitutional isomers and elpasolites, respectively. InFig. 6, we compare the resulting performance curves to stan-dard QML machines that had access to the ‘true’ referencecoordinates as input, as well as to QML machines that usedtopology only (input graphs for G2S) as input. Again, we notethat all performance curves improve systematically with train-ing set size. For atomization energy prediction of C O H PM6
E2/S N (a)(b) Constitutional IsomersC NOH RMSD w.r.t.
MP2 referenceRMSD w.r.t. G2SRMSD w.r.t.
B3LYP referenceRMSD w.r.t. G2S
FIG. 5. Performance curves of G2S predicted input coordinates ofC NOH constitutional isomers and E2/S N ab initio based geometry optimization runs. Blue lines mea-sure the RMSD w.r.t. a quantum-based reference structure (Fig. 4error type A). Orange lines measure the RMSD w.r.t. G2S pre-dicted structures after a stuctural relaxation (Fig. 4 error type B).(a) C NOH constitutional isomers optimized with GFN2-xTB andB3LYP/6-31G(2df,p), respectively. (b) E2/S N and C NOH isomers, G2S and FCHL19 still reaches an ac-curacy of 5 kcal/mol MAE at 1024 training points, slowly ap-proaching the coveted chemical accuracy of 1 kcal/mol, andalmost matching the accuracy of a DFT structure based BoBmodel.On average, and as one would expect, performance curvesimprove as one goes from topology to G2S to QM coordinatesas input for QML. The advantage is most substantial for smalltraining set, in the limit of larger data sets, the performancecurves of predictions based on G2S input level off, presum-ably due to the noise levels introduced by aforementioned er-ror type B, i.e. inherent noise and conformational effects ofthe predicted structures. (a) (b) (c) Representation Representation
QM9 Constitutional Isomers Elpasolite CrystalsC O H C NOH AlNaK F Input
TopologyG2S Outputxyz Relaxed
FIG. 6. Systematic improvement of energy prediction accuracy with increasing training data using G2S predictions (blue) as well as DFTstructures (orange) as an input to QML models. (a) and (b) atomization energy prediction of C O H and C NOH constitutional isomers,respectively. The inset depicts a speedup estimate of a G2S based QML model over a DFT dependent QML model. This assumes an averageof 16 DFT optimization steps required before a structure can be used in QML. (c) Prediction of formation energies of elpasolite crystals. Comparing the impact of the molecular QML representa-tions, the familiar trend that FCHL is more accurate than BoB,is reproduced for either input . For crystals, conformationaleffects do not exist, which is the reason why FLLA performsalmost comparably well to FCHL18 trained on the originalstructures. Nevertheless, G2S with FCHL18 still reaches anaccuracy of 0.3 eV/atom MAE in training set sizes of less than1,000 points.While the G2S based predictions for the training sets spec-ified are not yet comparable to state of the art QML mod-els, an advantage over standard approaches is the generationof new query structures. While 3D structures are availableonly for a tiny fraction of chemical space, molecular graphsare abundant and can be enumerated systematically . Espe-cially when manual intervention and expensive optimizationmethods are required, the generation of new target structuresbecomes almost as difficult as generating the training data it-self. In short, a regular QML query requires a structure tobe generated with a force field method followed by geometricoptimization. G2S circumvents this procedure by producingstructures within milliseconds (Intel(R) Xeon(R) Gold [email protected]) that can directly be used with a QML model.resulting in orders of magnitude speedups compared to theconventional way (Fig. 6 inset (b)). D. Analysis and Limitations
The analysis of machine learning predictions is crucial inorder to better understand the G2S model. Fig. 7 reportsthe distribution of predicted (largest training set) and refer-ence distances for the C NOH data. We note that, as ex-pected from the integrated results discussed above, the pre-dicted distance distribution overlaps substantially with the re-spective reference distribution. Small deviations indicate thatG2S slightly overestimates covalent bond-lengths, and that itunderestimates distances to third neighbours. The density dif-ferences for second neighbours can hardly be discerned.The scatter error heat-map plot of predicted versus refer-ence distances (Fig. 7(b)) indicates absence of major system-atic errors (in line with remarkably good averages), but revealsa larger variance for distances larger than 2.5 Å. A plausiblereason for this could be the natural flexibility of molecularstructures for flat and long compounds (as opposed to sys-tems dominated by cage-like connectivities). This explana-tion is corroborated by the trend observed among individualMAE obtained for each distance pair of the distance matrixof C NOH (Fig. 7 (c)): The larger the distance the largererror. As mentioned, the sorting of the representation and dis-tance matrix depends on the norm of the feature values of each (a)(b) (c) FIG. 7. Distributions and mean absolute error (MAE) heatmap ofG2S predicted distances vs QM9 (B3LYP) reference for C NOH constitutional isomer set. For all predictions, G2S was used withBond Length representation and maximal training set size (4’687).(a) Histograms of reference (blue) and predicted distances (orange).(b) Hexbin heatmap visualization of reference and predicted dis-tances. (c) Heatmap of MAE for each entry of predicted distancematrix. row, naturally sorting larger distances to higher indices for theBond Length representation. Such prediction errors can thenlead to the generation of the wrong conformer, or even di-astereomer.A potential solution could be the decomposition of thefull distance matrix into only considering close neighbor dis-tances. Conceptually similar to how local atomic representa-tions in QML work, the position of an atom would only de-pend on the distances of the closest four atoms, making its po-sition uniquely defined. Furthermore, the scalability of G2Swould be improved since instead of n ( n − ) / n -heavy atoms, only four machines are necessary. However, ithas to be highlighted that the depicted structures in Fig. 3 havebeen generated with less than 5,000 training molecules avail-able. To that end, the linear trend of the logarithmic learningcurves indicates that more data will still improve the accu-racy, meaning that fundamentally the learning capability ofG2S has not yet achieved its full potential.. An improvementin accuracy would make the solution to the distance geometryproblem less ambiguous and, therefore, would lead to fewercases of conformer/diastereomer misclassifications.In order to further explore the role of the target format, wehave also attempted to build machine learning models of en-tries in the Z-matrix. However, the Z-matrix based predictionsdid not improve over the distance matrix based model esti-mates (see SI). Possible further strategies to improve on G2Scould include ∆ -machine learning where deviations from tabulated (or universal force-field based) estimates are mod-eled. IV. CONCLUSION
We have presented Graph To Structure (G2S), a kernelridge regression based machine learning capable of predict-ing pairwise interatomic distances in a variety of materials(closed shell organic molecules, transition states, carbenes,and crystals). G2S learning curves indicate improving pre-dictive power as training set increases. Training on less than5,000 stuctures affords prediction errors of less than 0.2 ÅMAE in distances for out-of-sample compounds and with nosign of convergence. Generated structures are chemicallymeaningful and exhibit high geometric similarity towards ref-erence geometries. However, noise in the distance matrixcan lead to dramatic changes towards other conformers or di-astereomers. Comparison to empirical popular structure gen-erators (ETKDG and Gen3D) has shown that on average, G2Sgenerated structures are closer to quantum chemistry refer-ence geometries.A solely data driven approach such as G2S is appealingdue to its inherent capability to further improve and gener-alize across chemical compound space as more training datais being made available. The usefulness of G2S output struc-tures has been shown by successfully elucidating 90% of the3,054 uncharacterized molecules from the QM9 database, byusing predicted structures as improved initial guesses for sub-sequent ab initio based geometry optimizations, and by usingthem as an input to structure based quantum machine learningmodels. The universal applicability of G2S has been shownby training on unconventional chemistries such as carbenes,E2/S N ACKNOWLEDGEMENT
O.A.v.L. acknowledges support from the Swiss NationalScience foundation (407540_167186 NFP 75 Big Data) andfrom the European Research Council (ERC-CoG grant QMLand H2020 projects BIG-MAP and TREX). This project hasreceived funding from the European Union’s Horizon 2020research and innovation programme under Grant Agreements http://scicore.unibas.ch/ ) scientific com-puting center at University of Basel.0
DATA AVAILABILITY
The constitutional isomer subset are part of the QM9database available at https://dx.doi.org/10.6084/m9.figshare.c.978904.v5 . The QMSpin and QM-rxn databases are both available on materialscloud at https://dx.doi.org/10.24435/materialscloud:2020.0051/v1 and https://dx.doi.org/10.24435/materialscloud:sf-tz , respectively. The elpasolitedataset is provided as part of the supplemental informationat https://dx.doi.org/10.1103/PhysRevLett.117.135502 . R. Zwanzig, A. Szabo, and B. Bagchi, “Levinthal’s paradox,”
Proceedingsof the National Academy of Sciences , vol. 89, pp. 20–22, Jan. 1992. Pub-lisher: National Academy of Sciences Section: Research Article. G. F. v. Rudorff, S. Heinen, M. Bragato, and O. A. v. Lilienfeld, “Thousandsof reactants and transition states for competing E2 and SN2 reactions,”
Ma-chine Learning: Science and Technology , 2020. T. N. Doman, S. L. McGovern, B. J. Witherbee, T. P. Kasten, R. Kurumbail,W. C. Stallings, D. T. Connolly, and B. K. Shoichet, “Molecular Dockingand High-Throughput Screening for Novel Inhibitors of Protein TyrosinePhosphatase-1B,”
Journal of Medicinal Chemistry , vol. 45, pp. 2213–2221,May 2002. Publisher: American Chemical Society. N. Yoshikawa and G. R. Hutchison, “Fast, efficient fragment-based coor-dinate generation for Open Babel,”
Journal of Cheminformatics , vol. 11,p. 49, Aug. 2019. S. Riniker and G. A. Landrum, “Better Informed Distance Geometry: UsingWhat We Know To Improve Conformation Generation,”
Journal of Chem-ical Information and Modeling , vol. 55, pp. 2562–2574, Dec. 2015. Pub-lisher: American Chemical Society. G. N. C. Simm and J. M. Hernández-Lobato, “A Generative Model forMolecular Distance Geometry,” arXiv:1909.11459 [cs, stat] , Mar. 2020.arXiv: 1909.11459. E. Mansimov, O. Mahmood, S. Kang, and K. Cho, “Molecular GeometryPrediction using a Deep Generative Graph Neural Network,”
Scientific Re-ports , vol. 9, p. 20381, Dec. 2019. M. Hoffmann and F. Noé, “Generating valid Euclidean distance matrices,” arXiv:1910.03131 [cs, stat] , Nov. 2019. arXiv: 1910.03131. N. W. A. Gebauer, M. Gastegger, and K. T. Schütt, “Symmetry-adaptedgeneration of 3d point sets for the targeted discovery of molecules,” arXiv:1906.00957 [physics, stat] , Jan. 2020. arXiv: 1906.00957. V. Nesterov, M. Wieser, and V. Roth, “3DMolNet: A Generative Networkfor Molecular Structures,” arXiv:2010.06477 [cs, q-bio] , Oct. 2020. arXiv:2010.06477. D. Weininger, “SMILES, a chemical language and information system. 1.Introduction to methodology and encoding rules,”
Journal of Chemical In-formation and Computer Sciences , vol. 28, pp. 31–36, Feb. 1988. Publisher:American Chemical Society. M. Krenn, F. Häse, A. Nigam, P. Friederich, and A. Aspuru-Guzik,“Self-referencing embedded strings (SELFIES): A 100% robust molecularstring representation,”
Machine Learning: Science and Technology , vol. 1,p. 045024, Nov. 2020. Publisher: IOP Publishing. J. J. Moré and Z. Wu, “Distance Geometry Optimization for Protein Struc-tures,”
Journal of Global Optimization , vol. 15, pp. 219–234, Oct. 1999. M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. von Lilienfeld, “Fastand Accurate Modeling of Molecular Atomization Energies with MachineLearning,”
Physical Review Letters , vol. 108, p. 058301, Jan. 2012. O. A. von Lilienfeld, “Quantum Machine Learning in Chemical CompoundSpace,”
Angewandte Chemie International Edition , vol. 57, pp. 4164–4169,Apr. 2018. O. A. von Lilienfeld, K.-R. Müller, and A. Tkatchenko, “Exploring chem-ical compound space with quantum-based machine learning,”
Nature Re-views Chemistry , vol. 4, pp. 347–358, July 2020. Number: 7 Publisher:Nature Publishing Group. B. Huang and O. A. von Lilienfeld, “Ab initio machine learning in chem-ical compound space,” arXiv:2012.07502 [physics] , Dec. 2020. arXiv:2012.07502. D. G. Krige, “A statistical approach to some basic mine valuation problemson the Witwatersrand,”
Journal of the Southern African Institute of Miningand Metallurgy , vol. 52, pp. 119–139, Dec. 1951. Publisher: SouthernAfrican Institute of Mining and Metallurgy. V. N. Vapnik,
The Nature of Statistical Learning Theory . New York, NY:Springer New York, 2000. R. Ramakrishnan and O. A. von Lilienfeld, “Many Molecular Propertiesfrom One Kernel in Chemical Space,”
CHIMIA , vol. 69, p. 182, 2015. http://arxiv.org/abs/1502.04563 . A. S. Christensen, F. A. Faber, B. Huang, L. A. Bratholm, A. Tkatchenko,K.-R. Müller, and O. A. v. Lilienfeld, “"QML: A Python Toolkit for Quan-tum Machine Learning" https://github.com/qmlcode/qml,” 2017. O. Çaylak, O. A. v. Lilienfeld, and B. Baumeier, “Wasserstein metric forimproved quantum machine learning with adjacency matrix representa-tions,”
Machine Learning: Science and Technology , vol. 1, p. 03LT01, Aug.2020. Publisher: IOP Publishing. J. C. Kromann, “Calculate Root-mean-square deviation (RMSD) of twomolecules,”
Github . https://github.com/charnley/rmsd, ce7f533. W. Kabsch, “A solution for the best rotation to relate two sets of vectors,”
Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoreticaland General Crystallography , vol. 32, pp. 922–923, Sept. 1976. Number:5 Publisher: International Union of Crystallography. M. W. Walker, L. Shao, and R. A. Volz, “Estimating 3-D location pa-rameters using dual number quaternions,”
CVGIP: Image Understanding ,vol. 54, pp. 358–367, Nov. 1991. C. Cortes, L. D. Jackel, S. A. Solla, V. Vapnik, and J. S. Denker, “Learn-ing Curves: Asymptotic Values and Rate of Convergence,” in
Advances inNeural Information Processing Systems 6 (J. D. Cowan, G. Tesauro, andJ. Alspector, eds.), pp. 327–334, Morgan-Kaufmann, 1994. P. Pyykkö and M. Atsumi, “Molecular Single-Bond Covalent Radii for Ele-ments 1–118,”
Chemistry – A European Journal , vol. 15, pp. 186–197, Jan.2009. Publisher: John Wiley & Sons, Ltd. P. Pyykkö and M. Atsumi, “Molecular Double-Bond Covalent Radii forElements Li–E112,”
Chemistry – A European Journal , vol. 15, pp. 12770–12779, Nov. 2009. Publisher: John Wiley & Sons, Ltd. P. Pyykkö, S. Riedel, and M. Patzschke, “Triple-Bond Covalent Radii,”
Chemistry – A European Journal , vol. 11, pp. 3511–3520, June 2005. Pub-lisher: John Wiley & Sons, Ltd. K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis, O. A. von Lilien-feld, K.-R. Müller, and A. Tkatchenko, “Machine Learning Predictionsof Molecular Properties: Accurate Many-Body Potentials and Nonlocal-ity in Chemical Space,”
The Journal of Physical Chemistry Letters , vol. 6,pp. 2326–2331, June 2015. F. A. Faber, A. Lindmaa, O. A. von Lilienfeld, and R. Armiento, “MachineLearning Energies of 2 Million Elpasolite $(AB{C}_{2}{D}_{6})$ Crys-tals,”
Physical Review Letters , vol. 117, p. 135502, Sept. 2016. Publisher:American Physical Society. J. Schmidt, J. Shi, P. Borlido, L. Chen, S. Botti, and M. A. Marques, “Pre-dicting the thermodynamic stability of solids combining density functionaltheory and machine learning,”
Chemistry of Materials , vol. 29, no. 12,pp. 5090–5103, 2017. G. F. v. Rudorff and O. A. v. Lilienfeld, “Rapid and accurate molecular de-protonation energies from quantum alchemy,”
Physical Chemistry Chem-ical Physics , vol. 22, no. 19, pp. 10519–10525, 2020. Publisher: RoyalSociety of Chemistry. L. Liberti, C. Lavor, N. Maculan, and A. Mucherino, “Euclidean distancegeometry and applications,” arXiv:1205.0349 [q-bio] , May 2012. arXiv:1205.0349. V. I. Lebedev, “Quadratures on a sphere,”
USSR Computational Mathemat-ics and Mathematical Physics , vol. 16, pp. 10–24, Jan. 1976. R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilienfeld, “Quantumchemistry structures and properties of 134 kilo molecules,”
Scientific Data ,vol. 1, pp. 1–7, Aug. 2014. Number: 1 Publisher: Nature Publishing Group. A. D. Becke, “Density-functional thermochemistry. III. The role of exactexchange,”
The Journal of Chemical Physics , vol. 98, pp. 5648–5652, Apr.1993. Publisher: American Institute of Physics. C. Lee, W. Yang, and R. G. Parr, “Development of the Colle-Salvetticorrelation-energy formula into a functional of the electron density,”
Phys-ical Review B , vol. 37, pp. 785–789, Jan. 1988. Publisher: American Phys-ical Society. P. J. Stephens, F. J. Devlin, C. F. Chabalowski, and M. J. Frisch, “Ab Ini-tio Calculation of Vibrational Absorption and Circular Dichroism SpectraUsing Density Functional Force Fields,”
The Journal of Physical Chem-istry , vol. 98, pp. 11623–11627, Nov. 1994. Publisher: American ChemicalSociety. R. Ditchfield, W. J. Hehre, and J. A. Pople, “Self-Consistent Molecular-Orbital Methods. IX. An Extended Gaussian-Type Basis for Molecular-Orbital Studies of Organic Molecules,”
The Journal of Chemical Physics ,vol. 54, pp. 724–728, Jan. 1971. Publisher: American Institute of Physics. W. J. Hehre, R. Ditchfield, and J. A. Pople, “Self—Consistent MolecularOrbital Methods. XII. Further Extensions of Gaussian—Type Basis Setsfor Use in Molecular Orbital Studies of Organic Molecules,”
The Journal ofChemical Physics , vol. 56, pp. 2257–2261, Mar. 1972. Publisher: AmericanInstitute of Physics. P. C. Hariharan and J. A. Pople, “The influence of polarization functionson molecular orbital hydrogenation energies,”
Theoretica chimica acta ,vol. 28, pp. 213–222, Sept. 1973. M. Schwilk, D. N. Tahchieva, and O. A. von Lilienfeld, “Large yetbounded: Spin gap ranges in carbenes,” arXiv:2004.10600 [physics] , Apr.2020. arXiv: 2004.10600. H. Werner and P. J. Knowles, “A second order multiconfiguration SCFprocedure with optimum convergence,”
The Journal of Chemical Physics ,vol. 82, pp. 5053–5063, June 1985. Publisher: American Institute ofPhysics. D. A. Kreplin, P. J. Knowles, and H.-J. Werner, “Second-order MCSCFoptimization revisited. I. Improved algorithms for fast and robust second-order CASSCF convergence,”
The Journal of Chemical Physics , vol. 150,p. 194106, May 2019. Publisher: American Institute of Physics. T. Busch, A. D. Esposti, and H. Werner, “Analytical energy gradients formulticonfiguration self-consistent field wave functions with frozen core or-bitals,”
The Journal of Chemical Physics , vol. 94, pp. 6708–6715, May1991. Publisher: American Institute of Physics. K. A. Peterson, T. B. Adler, and H.-J. Werner, “Systematically convergentbasis sets for explicitly correlated wavefunctions: The atoms H, He, B–Ne,and Al–Ar,”
The Journal of Chemical Physics , vol. 128, p. 084102, Feb.2008. Publisher: American Institute of Physics. M. J. Frisch, J. A. Pople, and J. S. Binkley, “Self-consistent molecular or-bital methods 25. Supplementary functions for Gaussian basis sets,”
TheJournal of Chemical Physics , vol. 80, pp. 3265–3269, Apr. 1984. Pub-lisher: American Institute of Physics. L. A. Curtiss, M. P. McGrath, J. Blaudeau, N. E. Davis, R. C. Binning, andL. Radom, “Extension of Gaussian-2 theory to molecules containing third-row atoms Ga–Kr,”
The Journal of Chemical Physics , vol. 103, pp. 6104–6113, Oct. 1995. Publisher: American Institute of Physics. A. D. McLean and G. S. Chandler, “Contracted Gaussian basis sets formolecular calculations. I. Second row atoms, Z=11–18,”
The Journal ofChemical Physics , vol. 72, pp. 5639–5648, May 1980. Publisher: AmericanInstitute of Physics. R. Krishnan, J. S. Binkley, R. Seeger, and J. A. Pople, “Self-consistentmolecular orbital methods. XX. A basis set for correlated wave functions,”
The Journal of Chemical Physics , vol. 72, pp. 650–654, Jan. 1980. Pub-lisher: American Institute of Physics. T. Clark, J. Chandrasekhar, G. W. Spitznagel, and P. V. R. Schleyer, “Effi-cient diffuse function-augmented basis sets for anion calculations. III. The3-21+G basis set for first-row elements, Li–F,”
Journal of ComputationalChemistry , vol. 4, pp. 294–301, Sept. 1983. Publisher: John Wiley & Sons,Ltd. P. E. Blöchl, “Projector augmented-wave method,”
Physical Review B ,vol. 50, pp. 17953–17979, Dec. 1994. Publisher: American Physical Soci-ety. G. Kresse and D. Joubert, “From ultrasoft pseudopotentials to the projectoraugmented-wave method,”
Physical Review B , vol. 59, pp. 1758–1775, Jan.1999. Publisher: American Physical Society. B. Huang and O. A. von Lilienfeld, “Communication: Understandingmolecular representations in machine learning: The role of uniqueness andtarget similarity,” vol. 145, no. 16, 2016. S. Heinen, G. F. von Rudorff, and O. A. von Lilienfeld, “Quantum basedmachine learning of competing chemical reaction profiles,” arXiv preprintarXiv:2009.13429 , 2020. S. Senthil, S. Chakraborty, and R. Ramakrishnan, “Troubleshooting Unsta-ble Molecules in Chemical Space,” arXiv:2010.02635 [physics] , Oct. 2020.arXiv: 2010.02635. C. Bannwarth, S. Ehlert, and S. Grimme, “GFN2-xTB—An Accurate andBroadly Parametrized Self-Consistent Tight-Binding Quantum ChemicalMethod with Multipole Electrostatics and Density-Dependent DispersionContributions,”
Journal of Chemical Theory and Computation , vol. 15,pp. 1652–1671, Mar. 2019. Publisher: American Chemical Society. J. ˇRezáˇc, J. Fanfrlík, D. Salahub, and P. Hobza, “Semiempirical QuantumChemical PM6 Method Augmented by Dispersion and H-Bonding Correc-tion Terms Reliably Describes Various Types of Noncovalent Complexes,”
Journal of Chemical Theory and Computation , vol. 5, pp. 1749–1760, July2009. Publisher: American Chemical Society. F. A. Faber, A. S. Christensen, B. Huang, and O. A. von Lilienfeld,“Alchemical and structural distribution based representation for universalquantum machine learning,”
The Journal of Chemical Physics , vol. 148,p. 241717, Mar. 2018. A. S. Christensen, L. A. Bratholm, F. A. Faber, and O. Anatole von Lilien-feld, “FCHL revisited: Faster and more accurate quantum machine learn-ing,”
The Journal of Chemical Physics , vol. 152, p. 044107, Jan. 2020.Publisher: American Institute of Physics. L. Ruddigkeit, R. van Deursen, L. C. Blum, and J.-L. Reymond, “Enu-meration of 166 Billion Organic Small Molecules in the Chemical Uni-verse Database GDB-17,”
Journal of Chemical Information and Modeling ,vol. 52, pp. 2864–2875, Nov. 2012. R. Ramakrishnan, P. Dral, M. Rupp, and O. A. von Lilienfeld, “Big Datameets Quantum Chemistry Approximations: The ∆∆