An assessment of the structural resolution of various fingerprints commonly used in machine learning
Behnam Parsaeifard, Deb Sankar De, Anders S. Christensen, Felix A. Faber, Emir Kocer, Sandip De, Joerg Behler, Anatole von Lilienfeld, Stefan Goedecker
AAn assessment of the structural resolution of various fingerprints commonly used inmachine learning
Behnam Parsaeifard,
1, 2
Deb Sankar De,
1, 2
Anders S. Christensen, Felix A. Faber, EmirKocer, Sandip De,
5, 6, 2
Jörg Behler, O. Anatole von Lilienfeld,
3, 2 and Stefan Goedecker
1, 2 Department of Physics, University of Basel, Klingelbergstrasse 82, CH-4056 Basel, Switzerland National Center for Computational Design and Discovery of Novel Materials (MARVEL), Switzerland Institute of Physical Chemistry, Department of Chemistry,University of Basel, Klingelbergstr. 80, CH-4056 Basel, Switzerland Universität Göttingen, Institut für Physikalische Chemie,Theoretische Chemie, Tammannstr. 6, 37077 Göttingen, Germany Present address BASF SE, 67056 Ludwigshafen am Rhein, Germany Laboratory of Computational Science and Modelling, Institute of Materials,Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland (Dated: August 10, 2020)Atomic environment fingerprints are widely used in computational materials science, from ma-chine learning potentials to the quantification of similarities between atomic configurations. Manyapproaches to the construction of such fingerprints, also called structural descriptors, have beenproposed. In this work, we compare the performance of fingerprints based on the Overlap Matrix(OM), the Smooth Overlap of Atomic Positions (SOAP), Behler-Parrinello atom-centered symmetryfunctions (ACSF), modified Behler-Parrinello symmetry functions (MBSF) used in the ANI-1ccxpotential and the Faber-Christensen-Huang-Lilienfeld (FCHL) fingerprint under various aspects. Westudy their ability to resolve differences in local environments and in particular examine whetherthere are certain atomic movements that leave the fingerprints exactly or nearly invariant. Forthis purpose, we introduce a sensitivity matrix whose eigenvalues quantify the effect of atomicdisplacement modes on the fingerprint. Further, we check whether these displacements correlatewith the variation of localized physical quantities such as forces. Finally, we extend our examinationto the correlation between molecular fingerprints obtained from the atomic fingerprints and globalquantities of entire molecules.
I. INTRODUCTION
Materials sciences and chemistry are becoming datadriven sciences . Both experimental and theoreticaldata often contain similar, or duplicate structures whichdiffer only by the noise which is present in any experi-mental measurements as well as in theoretical structurepredictions . Such structures can be eliminated basedon fingerprint distances. If the structures differ by morethan just noise, one frequently wants to quantify their dis-similarity. This is particularly important for applicationsof supervised machine learning in materials science ,where fingerprints form in most schemes the input for neu-ral networks or other machine learning schemes, but alsofor eliminating redundant structures e.g. in the globalexploration of potential-energy surfaces. Both, for thedetection of duplicate structures as well as for machinelearning various atomic environment descriptors have beenproposed to date. In the pioneering work of Behler andParrinello so-called symmetry functions have beenintroduced to explore the chemical environment of eachatom and to form the input to atomic neural networks.Two schemes related to the original Behler-Parrinelloatom-centered symmetry functions (ACSF) will also beused here and denoted as MBSF and FCHL . The nu-merically more efficient discretized version of the FCHLfingerprint is used in our study. Another fingerprintthat is widely used in the context of machine learning is the Smooth Overlap of Atomic Positions (SOAP) atomicenvironment descriptor . The last fingerprint that isincluded in our tests is the Overlap Matrix (OM) finger-print that has been used to find duplicate structuresin minima hopping based structures predictions andto bias the potential energy landscape to find chemicalreaction pathways , as well as in machine learning .Many other types of fingerprints have been proposed inthe literature to date . In the following all thesedescriptors will be called fingerprints, Cartesian coordi-nates of atoms in structures, augmented in the crystallinecase with the vectors describing the unit cell, form anelementary representation of a configuration or atomicenvironment. However such Cartesian descriptors areproblematic since they are not invariant under transla-tions, rotation and atomic index permutations. So, otherdescriptors are needed which must be invariant undertranslations, rotations, and other symmetry operations aswell as permutation of identical atoms . All the finger-prints considered in this work are invariant under theseoperations. The fingerprint distance between two struc-tures can for instance be calculated as the Euclidean normof the difference between the two fingerprint vectors. Inthis work, we compare the structural resolution of variousfingerprints, i.e. their ability to recognize and quantify dif-ferences in atomic environments based on such fingerprintdistances. a r X i v : . [ phy s i c s . c o m p - ph ] A ug II. DESCRIPTION OF FINGERPRINTS USED
In this section we give a very brief summary of thefingerprints used in this study. For a complete descriptionof the fingerprints, the reader is referred to the originalpublications on OM , SOAP , FCHL , ACSF , andMBSF .The OM method is inspired by the experimental ap-proach to identify structures. Experimental approachestypically use some spectrum such as a vibrational spec-trum or an electronic excitation spectrum to identifystructures. Both are related to the eigenvalues of certainmatrices. As was shown by Sadeghi et al. eigenvaluesof the Hessian matrix or of the Kohn Sham Hamiltonianmatrix are excellent fingerprints for molecular structures,but these matrices are quite expensive to calculate. Fortu-nately, it turns out that the eigenvalues of a matrix thatis extremely fast to calculate, namely the overlap matrixwhich contains the full structural information are of com-parable quality. To calculate the fingerprint of an atom k in the OM scheme, a sphere of radius R c is centered onit. We place a minimal basis set of four Gaussian typeorbitals (GTOs) G ν ( r − R i ) (i.e. radial Gaussians timesspherical harmonics) on each atom i in the sphere, namelyone s-type GTO ( ν = 1 ), and 3 p-type GTOs ( ν = 2 , , )shown by OM[sp]. The width of the radial Gaussian isgiven by the covalent radius of the element. Then theoverlap between all atoms in the sphere is calculated as S ki,ν,j,µ = (cid:82) G ν ( r − R i ) G µ ( r − R j ) d r .The off-diagonal elements of the overlap matrix decayquite fast with respect to distance from the central atom.This decay is also exploited in the linear electronic struc-ture calculation . Such a fast decay has been shown in asimilar context to be advantageous compared to a slowerinverse power law decay . Each element S ki,j of this ma-trix is then multiplied by two amplitudes f c ( | R k − R i | ) and f c ( | R k − R j | ) where f c ( r ) = (cid:0) − ( rw ) (cid:1) is a cutofffunction which smoothly tends to zero at r = 2 w = R c .So the width w which determines the cutoff radius is theonly parameter in this scheme.The vector F k containing all the eigenvalues of thismatrix is then the fingerprint of atom k . The fingerprintdistance between two atoms I and J is defined to be theEuclidean distance between their fingerprint vectors : ∆ IJ = | F I − F J | .The above defined fingerprint distance has a discontinu-ity in the first derivative when two eigenvalues cross. Thisis an extremely rare event and does not cause problemsin most applications. If a completely continuous distanceis desired the following post-processing of the eigenvaluescan be used to generate a new set ˜F of fingerprints that gives rise to completely continuous fingerprint distances: ˜ F i = (cid:80) l F l exp (cid:16) − (cid:0) F l − F i a (cid:1) (cid:17)(cid:80) l exp (cid:16) − (cid:0) F l − F i a (cid:1) (cid:17) (1)In the SOAP (Smooth Overlap of Atomic Positions)scheme, a Gaussian of width σ is centered on eachatom within the cutoff distance around the centralatom k at position r . The resulting density of atoms ρ k ( r ) = (cid:80) i exp (cid:16) − ( r − R ki ) σ (cid:17) × f cut ( | r − R ki | ) , multi-plied with a cutoff function, which goes smoothly tozero at the cutoff radius over a characteristic width r δ ,is then expanded in terms of orthogonal radial func-tions g n ( r ) and spherical harmonics Y lm ( θ, φ ) as ρ k ( r ) = (cid:80) nlm c knlm g n ( r ) Y lm ( θ, φ ) , where c knlm = < g n Y lm | ρ k > . p knn (cid:48) l = (cid:113) π l +1 (cid:80) m c knlm ( c kn (cid:48) lm ) ∗ is invariant under ro-tations and the vector F k containing all p knn (cid:48) l ’s with n, n (cid:48) ≤ n max and l ≤ l max is the SOAP fingerprint vectorof atom k . The fingerprint distance between atoms I and J can then either be defined as ∆ IJ = | F I − F J | or ∆ IJ = (1 − F I · F J ) / . Since the second definition isused in the majority of machine learning applications andsince we could not find any difference in preliminary tests,for SOAP we use the second definition of the fingerprintdistance. This definition requires the fingerprint vectorto be normalized to 1 such that (cid:80) i F i = 1 .This has the strange side effect that the N fingerprintsof a system of N atoms remain identical if N additionalatoms are placed on top of the original N atoms. Further,the fingerprint vectors are the same for a dimer wherethe two atoms are at a very large and zero distance.The QUIPPY software was used to generate theSOAP fingerprints, with the following parameters: n max = l max = 12 and σ = 0 . , r δ = 4 . Å.The atom-centered symmetry functions (ACSF) pro-posed by Behler and Parrinello in 2007 have been thefirst descriptors suitable as input for ML methods for thedescription of high-dimensional multi-atom systems .They form atomic fingerprint vectors consisting of sets ofatom-centered many-body radial and angular functions,which describe the chemical environments of the atomsin the system.Radial functions are the sum of two-body termsand describe the radial environment of an atom i .They have, for instance, the analytical form G i = (cid:80) j e − η ( R ij − R s ) f c ( R ij ) .The angular functions are sums of three-body termsand describe the angular environment of the atom. Twoexamples are defined below: G i = 2 − ζ all (cid:88) j,k (cid:54) = i (1 + λ cos( θ ijk )) ζ e − η ( R ij + R ik + R jk ) f c ( R ij ) f c ( R ik ) f c ( R jk ) (2) G i = 2 − ζ all (cid:88) j,k (cid:54) = i (1 + λ cos( θ ijk )) ζ e − η ( R ij + R ik ) f c ( R ij ) f c ( R ik ) (3)where θ ijk is the angle between R ij and R ik and f c ( r ) is asmooth cutoff function . The vector F i containing all the G i ’s for various values of η , λ , R s , and ζ is the fingerprintvector of atom i . In the present work, we used 10 radialsymmetry functions of type G and 48 angular symmetryfunctions of type G , which have been generated with thesoftware RuNNer . We have used CUR to find themost relevant symmetry functions , as we found thatlarger sets did not lead to significant improvements.Isayev et al. made two modifications to the origi-nal Behler-Parrinello angular symmetry functions to ob- tain modified Behler-Parrinello symmetry functions (MB-SFs) while retaining the form of the radial functions.These modifications are the addition of a reference angle θ s to the term cos( θ ijk ) which allows an arbitrary numberof shifts in the angular environment and R s to the expo-nential term in the angular symmetry functions. The R s addition allows the angular environment to be consideredwithin radial shells based on the average of the distancefrom the neighboring atoms similar to the radial shift R s in the original Behler-Parrinello radial functions. Sotheir modified angular symmetry function is G Ai = 2 − ζ all (cid:88) j,k (cid:54) = i (1 + λ cos( θ ijk − θ s )) ζ e − η ( Rij + Rik − R s ) f c ( R ij ) f c ( R ik ) (4)In this approach, a single η and multiple values of R s and θ s are used to generate the fingerprint vector F i . Weused 32 evenly spaced radial shifting parameters for theradial part, and a total of 8 radial and 8 angular shiftingparameters for the angular part for the MBSF resultingin a total 96 symmetry functions. The QML softwarepackage was then used to generate the MBSF fingerprints. The last fingerprint that we study is the discretizedFCHL fingerprint introduced by Faber et al. . FCHLencodes geometric elemental information into the finger-print with up to three-body terms included. The 2-bodyterms consist of sums of log-normal radial functions onthe form G = ξ ( r IJ ) f cut ( r IJ ) R s σ ( r ij ) √ π exp (cid:32) − (ln R s − µ ( r ij )) σ ( r ij ) (cid:33) (5)where f cut ( r IJ ) is a smooth cut-off function, ξ ( r IJ ) isa weight function on the form r N ij which serves to put ahigher weight in the regression to effects from atoms at closer distances, µ ( r ij ) = ln (cid:32) r IJ (cid:113) wr IJ (cid:33) , and σ ( r ij ) =1 + wr IJ . The three-body term in FCHL is the product ofa radial part, but uses a (truncated) Fourier expansionfor the angular spectrum on the form: G = ξ G − bodyRadial G − bodyAngular f cut ( r IJ ) f cut ( r JK ) f cut ( r KI ) (6)Where G − bodyRadial = (cid:114) η π exp (cid:32) − η (cid:18)
12 ( r IJ + r IK ) − R s (cid:19) (cid:33) (7) and G − bodyAngular contains the below sine and cosine termswith n = 1 : G cos n = exp (cid:32) − ( ζn ) (cid:33) (cos ( nθ KIJ ) − cos ( n ( θ KIJ + π ))) (8) G sin n = exp (cid:32) − ( ζn ) (cid:33) (sin ( nθ KIJ ) − sin ( n ( θ KIJ + π ))) (9)where θ KIJ is the angle between the atoms I, J andK. Furthermore, the three-body symmetry functions areweighted with an Axilrod-Teller-Muto term definedas: ξ = c θ KIJ ) cos ( θ IJK ) cos ( θ JKI )( r IK r JK r KI ) N (10)This again serves to attribute a higher weight to atomicconfiguration that likely to more strongly interacting .We used the default parameters described in and andthe QML software to generate the FCHL fingerprints.For all fingerprints related to the Behler-Parrinellosymmetry functions, i.e. for ACSF, MBSF and FCHL weuse the Euclidean norm of the difference of the fingerprintvectors as the fingerprint distance.For a fair comparison we have chosen for all fingerprintsthe same cutoff radius, namely 6.0 Å. This or very similarvalues were used in numerous studies . So allthe methods see exactly the same environment and couldtherefore in principle encode the same information intheir resulting fingerprint vectors. With this choice ofparameters, the length of the fingerprints was 240 for OM,1015 for SOAP, 58 for ACSF, 96 for MBSF and 64 forFCHL. III. RESULTS
In this section we will introduce some criteria to assessthe performance of the various fingerprints. First, wederive a formalism that allows to check the behavior ofthe different fingerprints under infinitesimal changes of theatomic coordinates. We show that there is a matrix, thatwe baptize sensitivity matrix, that describes this behavior.In particular, the displacement modes of this matrix thatbelong to zero eigenvalues give rise to constant fingerprintsfor movements along these modes and indicate thereforea failure of the fingerprint to detect geometry changes.Next we will compare for a test set the distances obtainedby different fingerprints. This test helps us to find caseswhere a certain fingerprint can not recognize differencesbetween different chemical environments. In addition wewill correlate in both cases changes in fingerprint distanceswith changes of physical quantities such as forces, energiesand densities of states.
A. Behavior of fingerprints under infinitesimaldisplacements
To study the evolution of fingerprint distances undersmall displacements, we consider the change of the squaredfingerprint distance up to second order in a Taylor ex-pansion around a reference configuration. Denoting thefingerprint of the reference configuration by F and thefingerprint of a configuration displaced by ∆ R by F ( R ) we get ( F ( R ) − F ) = (cid:88) α,β ∆ R α (cid:32)(cid:88) i g i,α g i,β (cid:33) ∆ R β (11)where g i,α is the gradient of the i -th component ofthe fingerprint vector with respect to the three Cartesiancomponents α (x, y, and z) of the position vector R , i.e. g i,α = ∂F i ∂R α (cid:12)(cid:12)(cid:12)(cid:12) R = R (12)In taking this derivative we have to consider only theatomic positions within the sphere around the centralatom because by construction atoms outside the spherehave no influence on the fingerprint. We call this ma-trix (cid:80) i g i,α g i,β sensitivity matrix. It has the dimension N × N where N is the number of atoms within thecutoff sphere around the reference atom. In the following,we will examine its eigenvalues and eigenvectors. To allowa meaningful comparison of the fingerprints obtained bydifferent methods we have scaled all the eigenvalues suchthat the largest eigenvalue is one. Since the fingerprintis invariant under a uniform translation and rotation ofall the atoms in the sphere, the sensitivity matrix hasalways at least 6 zero eigenvalues. More than 6 zeroeigenvalues indicate that there are other displacementmodes which will leave the fingerprint invariant. This ishighly problematic since it indicates that one can generatedifferent atomic environments which will not change thefingerprint. By calculating iteratively these zero eigen-value displacement modes and then moving the system byan infinitesimal amount along those consecutive modesone can construct from a sequence of infinitesimal smalldisplacements a finite displacement which will leave thefingerprint invariant . Equally problematic are eigen-values that are very small. In this case the fingerprintvariation will not exactly be zero, but will be extremelysmall. We now study the sensitivity matrix for the twoconfigurations of 60 carbon atoms shown in Fig. 1. Ananalogous analysis will be presented in the supplementaryinformation for two more structures.In Fig. 1a the reference atom forms three bonds withits three nearest neighbors and is surrounded by onepentagon and two hexagons, while in Fig. 1b the atomof interest resides on a chain and has fewer neighborscompared to the atom in Fig. 1a.In Fig. 2a we show the eigenvalues of the sensitivitymatrix of configuration 1a for all the fingerprints examined Fingerprint type Name number value unit descriptionMBSF R s ( G R ) a Å Two-body radial bins R s ( G A ) b Å Three-body radial bins θ s c Three-body angular bins r cut a cut η ( G R ) − Two-body width parameter η ( G A ) − Three-body width parameter ζ n Rs d Å Two-body radial bins n Rs e Å Sin three-body radial bins n Rs f Å Cos three-body radial binsw 0.32 Å Two-body width parameter η − Three-body width parameter N N c N Three body-weight ζ π
Angular exponent r cut a cut σ l max n max r δ R c R c = 2 w s p-type orbitals 3 p x , p y , p z ACSF η ( G ) g Å − Two-body width parameter η ( G ) h Å − Three-body width parameter λ ζ R c a [0.8, 0.968, 1.135, 1.303, 1.471, 1.639, 1.806, 1.974, 2.142, 2.31, 2.477, 2.645, 2.813, 2.981, 3.148, 3.316, 3.484, 3.652, 3.819, 3.987, 4.1554.323, 4.490, 4.658, 4.826, 4.994, 5.161, 5.329, 5.497, 5.665, 5.832, 6.0] b [0.8, 1.543, 2.286, 3.0286, 3.771, 4.514, 5.257, 6.0] c [0.0, 0.449, 0.898, 1.346, 1.795, 2.244, 2.693, 3.142] d [0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 2.5, 2.75, 3.0, 3.25, 3.5, 3.75, 4.0, 4.25, 4.5, 4.75, 5.0, 5.25, 5.5, 5.75, 6.0] e [0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3.0, 3.3, 3.6, 3.9, 4.2, 4.5, 4.8, 5.1, 5.4, 5.7, 6.0] f [0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3.0, 3.3, 3.6, 3.9, 4.2, 4.5, 4.8, 5.1, 5.4, 5.7, 6.0] g [0.003, 0.018, 0.036, 0.054, 0.071, 0.089, 0.125, 0.161, 0.214, 0.285] h [0.0, 0.004, 0.018, 0.071, 0.214, 0.285] Table I: The parameters used for each fingerprint. (a) (b)
Figure 1: Two environments which are used for studyingthe behavior of various fingerprints. The two atomswhose environment needs to be described are shown inred. Both structures are meta-stable.in our study. The eigenvalues of the sensitivity matrix for ACSF, MBSF, and FCHL decrease much more rapidlyto zero than the eigenvalues of SOAP and OM[sp]. Thismeans that in ACSF, MBSF, and FCHL, there exist only afew modes that have a strong influence on the fingerprint.It is also of interest to look at the associated modes shownin Fig. 3 and 4. In the context of machine learning onemight hope that the modes that are associated to thelargest eigenvalues and will therefore lead to the strongestvariation in the fingerprint will also lead to the largestvariation of physical properties such as forces . Sincemovements of atoms close to the central atom will ingeneral lead to a strong variation of the environment of thereference atom, this means that modes belonging to largeeigenvalues should be localized around the central atom. e i gen v a l ue s o f t he s en s i t i v i t y m a t r i x ( a r b . u . ) eigenvalue number (arb. u.) OM[sp]MBSFACSFSOAPFCHL (a) e i gen v a l ue s o f t he s en s i t i v i t y m a t r i x ( a r b . u . ) eigenvalue number (arb. u.) OM[sp]MBSFACSFSOAPFCHL (b)
Figure 2: The eigenvalues of the sensitivity matrix in a : for the reference atom of 1a and in b : for the reference atomin 1b. The configuration in 1a has 42 atoms in the sphere around the central atom giving rise to × − non-zero eigenvalues whereas the configuration in 1b has 14 atoms in the sphere giving rise to × − non-zeroeigenvallues. All non-zero eigenvalues up to machine precision are shown.The movement that will lead to the strongest variationof the energy for the configurations shown in Fig. 3 isclearly a bond stretching mode where the 3 neighboringatoms either move towards the central atom or away fromit (Fig. 3a, 3b, 3c). Then follows a movement where twobonds of the central are compressed and one is stretchedand finally an out of plane movement of the central atom.These three modes are exactly the modes associated tothe 3 largest eigenvalues of the OM sensitivity matrix.SOAP and FCHL also describe the physically importantmodes with reasonably large eigenvalues. In the ACSFand MBSF fingerprints however only an out of plane modehas a reasonably large eigenvalue. The modes belongingto the few largest eigenvalues are always localized on thereference atom and a few surrounding atoms. As theeigenvalues become smaller the modes should get moredelocalized, and this is indeed true in most cases. Thereare however some exceptions such as the modes of theACSF shown in the panels l of Fig. 3 and Fig. 4, themodes of MBSF in the panels p of Fig. 3 and Fig. 4 anda mode of SOAP shown panel h of Fig. 4.This discussion, which was based on some physicalinsight into which modes are important, can also be mademore quantitative. We do this by plotting the change inthe force acting on the central atom when the system ismoved along the different modes against the eigenvalueof this mode. This is shown in Fig. 5. A clear correlationis found for OM and SOAP, while for ACSF, MBSF andFCHL the correlation is substantially weaker, with FCHLshowing at least the correct trend. This means thatmovements along modes associated to large and smalleigenvalues have almost the same influence on the forceon the reference atom.Even though the environment of Fig. 1b is quite differ- ent, the performance of the fingerprints is quite similar.Only OM and SOAP detect the physically importantmodes (Fig. 2b), i.e. assign a large eigenvalue to thesemodes. They are also the only two fingerprints that give agood correlation between the eigenvalues and the changein the force (Fig. 5).While SOAP is performing well in our test case wheremany atoms are contained in the sphere, it was recentlyshown that for a methane molecule there are movementsthat leave the SOAP fingerprint of the carbon invariant.We detected the same deficiency also for ACSF, MBSFand FCHL. We have also tested the OM fingerprint forthese configurations and did not find any small or evenzero eigenvalues. This is to be expected since the OMfingerprint is based on a matrix diagonalization schemethat is similar to the diagonalization of the Hamiltonianmatrix in a quantum-mechanical calculation. Hence thescheme is not restricted to the information obtained onlyfrom the radial and angular distribution of the atoms inthe sphere. B. Correlation of fingerprint distances
In this section, we are going to compare the resolu-tion power of different fingerprints, i.e. their numericalsensitivity to small dissimilarities between atomic envi-ronments. To perform the tests we have generated a setof 1000 C structures using minima hopping coupledto DFTB . In this way we have obtained × envi-ronments arising from a large variety of structural motifssuch as chains, planar structures and cages. We will in thefollowing correlate all the × (60000 − pairwise atomicfingerprint distances obtained from different fingerprint (a) OM[sp], λ = 1 . (b) OM[sp], λ = 0 . (c) OM[sp], λ = 0 . (d) OM[sp], λ ∼ × − (e) SOAP, λ = 1 . (f) SOAP, λ = 0 . (g) SOAP, λ = 0 . (h) SOAP, λ ∼ × − (i) ACSF, λ = 1 . (j) ACSF, λ = 0 . (k) ACSF, λ = 0 . (l) ACSF, λ ∼ × − (m) MBSF, λ = 1 . (n) MBSF, λ = 0 . (o) MBSF, λ = 0 . (p) MBSF, λ ∼ × − (q) FCHL, λ = 1 . (r) FCHL, λ = 0 . (s) FCHL, λ = 0 . (t) FCHL, λ ∼ × − Figure 3: The eigenvectors belonging to the three largest eigenvalues and one representative small eigenvalue of thesensitivity matrix for the atomic environment in 1a. The red atom is again the reference atom. The displacementmodes given by eigenvectors are represented by arrows. Only atomic eigenvector components whose length is largerthan 0.1 are shown. For this reason it is not always visible that all the components exactly sum up to zero. (a) OM[sp], λ = 1 . (b) OM[sp], λ = 0 . (c) OM[sp], λ = 0 . (d) OM[sp], λ ∼ × − (e) SOAP, λ = 1 . (f) SOAP, λ = 0 . (g) SOAP, λ = 0 . (h) SOAP, λ ∼ × − (i) ACSF, λ = 1 . (j) ACSF, λ = 0 . (k) ACSF, λ = 0 . (l) ACSF, λ ∼ × − (m) MBSF, λ = 1 . (n) MBSF, λ = 0 . (o) MBSF, λ = 0 . (p) MBSF, λ ∼ × − (q) FCHL, λ = 1 . (r) FCHL, λ = 0 . (s) FCHL, λ = 0 . (t) FCHL, λ ∼ × − Figure 4: Same as 3 but for the environment of 1b. ∆ F ( H a / B oh r) eigenvalues of the sensitivity matrix (arb. u.) (a) OM[sp] ∆ F ( H a / B oh r) eigenvalues of the sensitivity matrix (arb. u.) (b) SOAP ∆ F ( H a / B oh r) eigenvalues of the sensitivity matrix (arb. u.) (c) ACSF ∆ F ( H a / B oh r) eigenvalues of the sensitivity matrix (arb. u.) (d) MBSF ∆ F ( H a / B oh r) eigenvalues of the sensitivity matrix (arb. u.) (e) FCHL Figure 5: Changes of the absolute forces upon displacements along the eigenvectors of the sensitivity matrix vs. itseigenvalues. For each fingerprint the atoms in the system are moved along the respective eigenvectors and the forcechanges are calculated using DFT . The red and the blue curves belong to the reference atoms in 1a and 1brespectively. There is a strong correlation in OM and SOAP since eigenvectors of large eigenvalues are localizedaround the reference atom and eigenvectors of small eigenvalues are localized on further distances from the referenceatom whereas in ACSF, MBSF, and FCHL it is not the case (there is no preferred spatial order of the components,which is why a clear correlation cannot be seen).types. Obviously large fingerprint distances should beobtained for environments that are quite different whereassmall distances correspond to similar environments. Since the absolute value of a fingerprint distance is arbitrary,we scale all our fingerprint distances such that a distanceof one corresponds to the noise level. We define the0noise level as the fingerprint distance between identicalstructures, whose atoms were randomly displaced by anamount of up to ± . Å.Since the number of environment pairs is huge we wouldnot be able to resolve each pair in a simple correlationplot where we would plot the fingerprint distances ∆ AI,J according to fingerprint A versus the distance ∆ BI,J ac-cording to fingerprint B. However this large number ofdata allows us to generate a histogram. This histogramtells us how many environments have fingerprint distances ∆ AI,J and ∆ BI,J . These two distances are plotted along thex and y axis and the height of the bins of the histogramis indicated by the color in this plot shown in Fig. 6.As can be seen in Fig. 6, in most cases, the intensity ispeaked around the diagonal which implies that both fin-gerprints agree on the degree of similarity or dissimilaritybetween the environment pairs. It can not be expectedthat all the points lie directly on the diagonal since dif-ferent fingerprints weight different types of similarity ordissimilarity in different ways. There is however a prob-lem if a point lies exactly on or very close to the x or yaxis which means that the ∆ is either zero or very small.This means that one fingerprint categorizes this pair ofenvironments as identical whereas the other fingerprintcan detect differences, i.e. it’s ∆ value is large. In Table IIwe show several pairs of environments that correspond tosuch problematic points in the correlation plot.In Table II a we show the two most distinct environ-ments in the data according to OM[sp]. One environmentis at the end of a chain and the other is 3-fold coordi-nated. So OM recognizes the atoms with the highestand lowest coordination number found in this data setas being the most different. The fingerprint distance is ∆ OM [ sp ] = 317 . Diamond-like environments were not inour MH generated data set. Due to their large number ofsurface dangling bonds such structures are considerablyhigher in energy than the structures arising from sp2 andsp1 hybridized carbon atoms. However, when we add byhand such a diamond derived cluster, OM predicts thecentral 4 fold coordinated atom of this cluster togetherwith the previous atom at the end of the chain as the mostdistinct atoms. So again it classifies the two environmentswith the highest and lowest coordination as the mostdifferent ones. ACSF, SOAP, FCHL, and MBSF predictthe environments in Table II b and c to be the most dis-tinct environments in the data. The fingerprint distancesare ∆ SOAP = 214 , ∆ F CHL = 315 , ∆ ACSF = 822 , and ∆ MBSF = 1224 respectively. This is not in agreementwith our basic chemical concepts of what structural dif-ferences are important. According to these concepts thecoordination number is the most important quantity in thechemistry of carbon, since it is related to the hybridizationstate. When adding the four fold coordinated carbon fromthe diamond-like cluster, then ACSF, MBSF and FCHLcorrectly identify this fourfold coordinated environmentand the one from the end of the chain as the most differentones. The assignment of the largest fingerprint distance inSOAP is however unchanged by the addition of this four- fold coordinated environment. So the assignments of thesymmetry-function-related fingerprints are at least partlycompatible with chemical concepts, whereas for SOAPthis is not the case. It is unclear whether a fingerprintthat is compatible with chemical concepts gives betterperformance in machine learning schemes. By choosing ashorter r δ in the case of SOAP and shorter cutoff radii forACSF-related fingerprints, it is however expected that theimmediate environment gets more weight and that thenthe other fingerprints can also better distinguish differentcoordinations. We note that also for the cutoff employedin the present work individual components of the finger-print vectors in ACSF-related fingerprints adopt differentvalues for varying coordinations, while this effect is muchless visible in the combined fingerprint distances. In thefollowing we look at the correlation plots of fingerprintdistances obtained with different fingerprints. We checkwhether some fingerprints can not recognize structuraldifferences.Fig. 6 a shows the resolution plot between the OMand SOAP fingerprints. In this case, both OM[sp] andSOAP fingerprints agree quite well on similarities anddissimilarities between the environments.Fig. 6 b shows the resolution intensity plot betweenOM[sp] and ACSF. There exist some points with signifi-cant values on the OM[sp] axis. These points representdifferent environments where ACSF cannot resolve thedifferences between them since the ACSF FP distanceis close to zero. In Table II d we show two atomic envi-ronments which are obviously quite different, but whoseACSF distance is very small. The two environments arevery different since the central atom in the left panel makesone bond with its nearest neighbor while the central atomin the right panel is two-fold coordinated. In Table II e wealso show another example where the difference vectorsof the ACSF are rather small.Fig. 6 c shows the correlation intensity plot betweenOM[sp] and FCHL. There is not any point on the axeswith significant values. So both fingerprints agree onsimilarities.Fig. 6 d shows the correlation plot between OM[sp]and MBSF. In Table II f and g we show two examplesin which the MBSF does not recognize the differencesbetween the two environments. In Table II f left, thecentral environment is in the middle of the chain and hastwo nearest neighbors while on right, it is at the end ofthe chain and has one nearest neighbor. In Table II g left,the reference atom is again at the end of a chain while onright it is three-fold coordinated.Fig. 6 e shows the correlation intensity between SOAPand ACSF. We can also see problematic points where thefingerprint distance is very small according to ACSF butnot according to SOAP. In Table II h we show an exampleof two different environments where ACSF predicts a verysmall fingerprint distance. Although the central atom inboth cases have one nearest neighbour, but the secondand third shells are different. Table III a shows anotherexample in which ACSF does not recognize the differences1Figure 6: The correlation intensity plot for a) OM vs. SOAP; b) OM vs ACSF; c) OM vs. FCHL; d) OM vs. MBSF;e) SOAP vs. ACSF; f) SOAP vs. FCHL; g) SOAP vs. MBSF; h) ACSF vs. FCHL; i) ACSF vs. MBSF; j) MBSF vs.FCHL.2 a ) ∆ OM [ sp ] = 317 (1.0); ∆ SOAP = 189 (0.88); ∆ ACSF = 738 (0.90); ∆ FCHL = 256 (0.82); ∆ MBSF = 844 (0.69) b ) ∆ OM [ sp ] = 251 (0.79); ∆ SOAP = 214 (1.0); ∆ ACSF = 802 (0.98); ∆ FCHL = 315 (1.0); ∆ MBSF = 1224 (1.0) c ) ∆ OM [ sp ] = 292 (0.92); ∆ SOAP = 206 (0.96); ∆ ACSF = 822 (1.0); ∆ FCHL = 292 (0.93); ∆ MBSF = 1119 (0.91) d ) ∆ OM [ sp ] = 38 (0.12); ∆ SOAP = 67 (0.32); ∆ ACSF = 3 (0.0); ∆ FCHL = 33 (0.11); ∆ MBSF = 5 (0.0) e ) ∆ OM [ sp ] = 34 (0.11); ∆ SOAP = 43 (0.2); ∆ ACSF = 2 (0.0); ∆ FCHL = 17 (0.05); ∆ MBSF = 8 (0.01) f ) ∆ OM [ sp ] = 79 (0.25); ∆ SOAP = 66 (0.31); ∆ ACSF = 34 (0.04); ∆ FCHL = 46 (0.15); ∆ MBSF = 13 (0.01) g ) ∆ OM [ sp ] = 78 (0.25); ∆ SOAP = 79 (0.37); ∆ ACSF = 22 (0.03); ∆ FCHL = 60 (0.19); ∆ MBSF = 11 (0.01) h ) ∆ OM [ sp ] = 37 (0.12); ∆ SOAP = 74 (0.35); ∆ ACSF = 7 (0.01); ∆ FCHL = 23 (0.07); ∆ MBSF = 14 (0.01)
Table II: The most distinct atomic environments according to a ) OM; b ) SOAP, FCHL, and MBSF; and c ) ACSF.The rest of the panels are problematic atomic environments in which one fingerprint predicts a large fingerprintdistance whereas the other fingerprint predicts a small one. The first number is the absolute fingerprint distancewhereas the number in parenthesis is the percentage of the largest distance. The reference atom whose environment wewant to describe, is red colored, the atoms in the vicinity of the reference atom are blue colored and the remainingatoms in the structure which are outside of the cutoff sphere and do not affect the fingerprint are shown in brown.3 a ) ∆ OM [ sp ] = 34 (0.11); ∆ SOAP = 74 (0.35); ∆ ACSF = 6 (0.01); ∆ FCHL = 18 (0.06); ∆ MBSF = 18 (0.01) b ) ∆ OM [ sp ] = 55 (0.18); ∆ SOAP = 85 (0.40); ∆ ACSF = 37 (0.05); ∆ FCHL = 35 (0.11); ∆ MBSF = 7 (0.01) c ) ∆ OM [ sp ] = 37 (0.12); ∆ SOAP = 73 (0.34); ∆ ACSF = 16 (0.02); ∆ FCHL = 26 (0.08); ∆ MBSF = 7 (0.01) d ) ∆ OM [ sp ] = 46 (0.15); ∆ SOAP = 60 (0.28); ∆ ACSF = 8 (0.01); ∆ FCHL = 46 (0.15); ∆ MBSF = 25 (0.02) e ) ∆ OM [ sp ] = 36 (0.12); ∆ SOAP = 50 (0.24); ∆ ACSF = 8 (0.01); ∆ FCHL = 44 (0.14); ∆ MBSF = 29 (0.02) f ) ∆ OM [ sp ] = 52 (0.16) ∆ SOAP = 76 (0.36); ∆ ACSF = 28 (0.4); ∆ FCHL = 33 (0.11); ∆ MBSF = 5 (0.0) g ) ∆ OM [ sp ] = 34 (0.11); ∆ SOAP = 43 (0.20); ∆ ACSF = 14 (0.02); ∆ FCHL = 31 (0.10); ∆ MBSF = 5 (0.0) h ) ∆ OM [ sp ] = 21 (0.07); ∆ SOAP = 34 (0.16); ∆ ACSF = 15 (0.02); ∆ FCHL = 29 (0.09); ∆ MBSF = 5 (0.0)
Table III: Further problematic environments.4in the local environment.The correlation intensity between SOAP and FCHL isshown in Fig. 6 f . There isn’t any point on either axes withsignificant values and both fingerprints therefore agree onsimilarities and differences between environments.Correlation intensity between SOAP and the MBSF isshown in Fig. 6 g . There exist again some problematicpoints on the SOAP axis which indicates that there aresome different environments that MBSF predicts to bethe same or very similar. In Table III b and c we showtwo such examples.The correlation intensity between ACSF and FCHL isshown in Fig 6 h . There are also some points lying onand very close to the FCHL axis (points with fingerprintdistances up to 50 near the FCHL axis). These pointsindicate environments which are different according toFCHL and very similar according to ACSF. In Table III d and e we show two such examples where the two environ-ments are different while fingerprint distance accordingto ACSF is very small. The reference atom is in one casetwo-fold coordinated while it is three-fold coordinated inthe other case.In Fig. 6 i we show the correlation intensity betweenACSF and the MBSF. The two fingerprint agree on mostsimilarities and there are no points on axes with significantvalues.As a last illustration we show the correlation plot be-tween the MBSF and FCHL in Fig. 6 j . In Table III f , g , and h we show examples where the MBSF doesnot recognize differences between the local environmentsand predicts very small fingerprint distances comparedto FCHL. To summarize, our analysis of the eigen modesof the sensitivity matrix shows that ACSF, MBSF, andpartly FCHL are quite insensitive to certain displacementsof the neighbouring atoms and have in this way an un-satisfactory structural resolution power. SOAP and OMperform significantly better in this respect. IV. CORRELATION BETWEEN MOLECULARFINGERPRINTS AND GLOBAL PHYSICALPROPERTIES
According to our analysis reported above several finger-prints that are widely and successfully used for instancein machine learning schemes are apparently sometimesunable to distinguish between different chemical envi-ronments. One would thus expect that this gives rise toerrors in the prediction of physical properties. One typicalapplication that in principle could be affected is the devel-opment of machine learning potentials , which predictthe energy and forces as a function of the atomic posi-tions. Most of these ML potentials rely on a constructionof the total energy as a sum of environment-dependentatomic energies and thus should be sensitive todeficiencies in the discrimination of these environments.In this section we will discuss possible implications of ourfindings with respect to such applications of ML. For our investigation, we need to distinguish betweenlocal and global properties. While local properties likeforces are observables that can be uniquely assigned toindividual atoms, the total energy of the system is not anobservable, and there is no physically unique definition ofatomic energies. While ML potentials are supposed to rep-resent both, forces and energies, with high accuracy andconsistently, their analysis requires different approaches.We will now investigate the role of the total energy asa global property. It has been shown for instance for thedistribution of atomic energies within extended systems ,that atomic energies determined by ML can compensateeach other to yield the correct total energy if there isenough flexibility in the system. For many systems thisflexibility can be reduced by adding constraints on theenergy distribution in form of different stoichiometries ,but in general there is no way to extract unique atomicenergies for arbitrary systems using ML. This finding isindependent of the ability of the fingerprint vectors todistinguish chemically inequivalent atomic environments.Here, we now go one step further and investigate ifeven a few "wrong" environment descriptions, which can-not resolve some structural differences as reported above,might be tolerable as the total energy could still be wellrepresented due to some error cancellation. To check thecorrelation of global properties with various atomic finger-prints we first have to construct a global, i.e. molecularfingerprint from our local atomic fingerprints. We do thisby finding the optimal matching between all the atomicenvironments in the two structures , i.e. the matchingthat minimizes the root-mean-square distance (RMSD)between the two molecules . In this approach the finger-print distance between two molecules p and q is definedas ∆ p,q = min P (cid:32) N (cid:88) i | F ip − F P ( i ) q | (cid:33) / (13)where F ip is the fingerprint vector for atom i in configura-tion p and F P ( i ) q is the fingerprint of the best matchingatom P ( i ) in configuration q . The permutation function P which gives the best overall match is found with theHungarian algorithm in polynomial time. We note,however, that this construction of a global molecular fin-gerprint is different from the procedure that is usuallyapplied in the construction of ML potentials, and here weuse it primarily as a tool to detect correlations betweenglobal properties and the entire structure of a system.While the atomic fingerprint distance shows how dif-ferent two atomic environments are, the molecular finger-print distance indicates the difference between two entiremolecules. In the next step, we calculate the correlationbetween molecular fingerprints and two global properties,namely the total energy and the density of states (DOS).If two molecules have different energies or DOS’s, theyhave to be different and so the fingerprint distance shouldbe non-zero. On the other hand, if two molecules havenearly the same energies or DOS they can be similar5 ∆ E ( H a ) OM SOAP0.00.30.60.91.2 ∆ E ( H a ) ACSF MBSF0.00.30.60.91.2 0 0.2 0.4 0.6 0.8 1.0 ∆ E ( H a ) ∆ FP (arb. u.)FCHL (a) ∆ F P vs. ∆ E ∆ D O S ( H a − m − ) OM SOAP0.02.04.06.08.0 ∆ D O S ( H a − m − ) ACSF MBSF0.02.04.06.08.0 0 0.2 0.4 0.6 0.8 1.0 ∆ D O S ( H a − m − ) ∆ FP (arb. u.)FCHL (b) ∆ F P vs. ∆ DOS
Figure 7: The correlation between molecular fingerprint distance and ∆ E (left hand side) and ∆ DOS (right handside) for OM[sp], SOAP, ACSF, FCHL, and MBSF. The global minimum of C is taken as the reference structure.The fingerprint distances are scaled such that the maximum fingerprint distance for each fingerprint is 1.0(in case of degeneracy) or different. So the fingerprintdistance does not need to be necessarily non-zero.The density of states for molecule p , D p ( (cid:15) ) is D p ( (cid:15) ) = (cid:88) i δ ( (cid:15) − (cid:15) pi ) (14)where (cid:15) pi are the Kohn-Sham eigenvalues for molecule p . We replace δ ( (cid:15) − (cid:15) pi ) with √ πσ exp (cid:16) − ( (cid:15) − (cid:15) pi ) σ (cid:17) with σ some smearing parameter. We define the difference between the density of states to be: ∆ DOS p,q = (cid:115)(cid:90) d(cid:15) ( D p ( (cid:15) ) − D q ( (cid:15) )) (15)Taking advantage of the properties of Gaussian func-tions, we can calculate the integral analytically. Hence, ∆ DOS p,q can be calculated as ∆ DOS p,q = (cid:115)(cid:88) i,j (cid:16) e − ( (cid:15) pi − (cid:15) pj ) / σ + e − ( (cid:15) qi − (cid:15) qj ) / σ − e − ( (cid:15) pi − (cid:15) qj ) / σ − e − ( (cid:15) qi − (cid:15) pj ) / σ (cid:17) (16)We chose σ = 0 . Ha in this work. The moleculewith the lowest energy is taken as reference structure andfingerprint distances and energy differences are calculatedwith respect to it. In Fig. 7 we see the correlation betweenthe molecular fingerprint distance ∆ F P and ∆ E and ∆ DOS with respect to the global minimum for OM[sp],SOAP, ACSF, FCHL, and MBSF.Remarkably, all fingerprints show a quite similar be- havior in these tests. In particular we could not findany pair of molecules that has a very small molecularfingerprint distance, but different energy or DOS. As alsonoted in a study highlighting difficulties in the structuraldescription of methane , the fingerprints of neighboringatoms usually change under displacements even if the fin-gerprint of the central atom remains invariant. Throughthis effect machine learning schemes may compensate6the deficiencies of a fingerprint, and the quality of themachine learning results for global quantities based ondifferent fingerprints can become very similar in practice.However, these findings are strictly true only if finger-print vectors of different environments are exactly thesame and have to be treated with care in the context ofmachine learning for several reasons, if fingerprint vectorsare only similar. While correlations between physicalproperties and fingerprints are certainly supporting theconstruction of a ML model, most ML algorithms arehighly non-linear methods, which are able to distinguishfingerprint vectors even if they are overall very similar,as measured by the fingerprint difference, but are suffi-ciently different in at least one or a few components. Forinstance, this is the case for the ACSF fingerprint vectorsof the reference atoms shown in Table II d . In this casethe radial symmetry functions with large η parametersare rather sensitive to the local coordination and pro-vide different numerical values for the exemplified one-and two-fold coordination of the reference atom. This isusually sufficient to distinguish these environments. Fur-ther, in ML applications fingerprint vectors are commonlyscaled such that the values of each individual fingerprintcomponent are normalized to a range between zero andone. We have not done this in the present work to avoidany bias in the comparison of the performance of differentfingerprints. Further, any scaling, although common prac-tice, depends on the fingerprint values in the availabledata set. We observed in Fig. 8 that scaling has someeffect on ACSFs in terms of increasing the eigenvaluesand therefore enhancing the sensitivity of the fingerprintoverall, and similar effects are expected also for the otherfingerprint types.Finally, for instance in case of ML potentials, usuallynot only the total energy as a rather insensitive globalproperty but also the atomic forces are used in the fittingprocess, which contain local atomic information aboutthe potential energy surface. The inability to distinguishchemically different atomic environments thus results inlarge force errors, which can be used to improve thefingerprint set .Irrespective of these aspects of ML applications, whichreduce the effect of similar fingerprint vectors, it hasbeen demonstrated in this work and elsewhere , thatthe detection of fingerprint vectors remaining exactlyinvariant upon structural changes is a major challengeand of utmost importance for many applications. V. CONCLUSIONS
We have introduced stringent tests for the resolutionpower of atomic fingerprints describing the environmentaround a reference atoms. First we introduced the sensi-tivity matrix that can detect atomic displacement modesthat leave the fingerprint invariant. Based on a largedata set of carbon structures we then investigated thecorrelation between fingerprint distances calculated withvarious fingerprints. For SOAP, ACSF, MBSF and FCHL,there exist atomic movements that leave the fingerprintsinvariant. This behavior can apparently only be found forsome small molecules and it did not occur in our studyof larger systems. For the symmetry function-relatedfingerprints, we found many movement modes that leavethe fingerprint nearly invariant and we found many caseswhere environments that were classified as nearly iden-tical were actually quite different. In all the tests wesaw an improvement when going from the ACSF andMBSF to the FCHL fingerprint. The OM fingerprint isthe only fingerprint for which no atomic displacementwas ever found that leaves the fingerprint invariant. Itis also the fingerprint whose distance assignments cor-responds best to basic chemical concepts. This comesfrom the fact that the OM fingerprint is obtained from amatrix diagonalization that is akin to the solution of theSchrödinger equation and therefore naturally incorporatesthe full many-body character of the atomic environment.However, the limited resolution of some atomic finger-prints for some environments is most critical for structuraldiscrimination, while there is still a good correlation ofglobal molecular fingerprints in case of the prediction ofextensive properties such as total energies of systems thatare composed of a large number of environments. Alsoapplications like machine learning are less affected, asthey are able to resolve even subtle differences in thefingerprints.
VI. ACKNOWLEDGMENTS
This research was performed within the NCCR MAR-VEL funded by the Swiss National Science Founda-tion. The calculations were done using the computa-tional resources of the Swiss National Supercomputer(CSCS) under project s963 and on the Scicore comput-ing center of the University of Basel. JB thanks theDeutsche Forschungsgemeinschaft for support (Be3264/13-1, project number 411538199). We thank Gábor Csányifor help in finding good parameters for the SOAP finger-prints, Michele Ceriotti for providing us with the prob-lematic methane configurations and Jonas Finkler for thecareful reading of the manuscript. D. Morgan, G. Ceder, and S. Curtarolo, MeasurementScience and Technology , 296 (2004). J. E. Saal, S. Kirklin, M. Aykol, B. Meredig, and C. Wolver-ton, Jom , 1501 (2013). e i gen v a l ue s o f t he s en s i t i v i t y m a t r i x ( a r b . u . ) eigenvalue number (arb. u.) ACSFScaled ACSF (a) e i gen v a l ue s o f t he s en s i t i v i t y m a t r i x ( a r b . u . ) eigenvalue number (arb. u.) ACSFScaled ACSF (b)
Figure 8: The eigenvalues of the sensitivity matrix for the ACSF vs. scaled ACSF in a : for the reference atom of 1aand in b : for the reference atom in 1b. S. Curtarolo, W. Setyawan, G. L. Hart, M. Jahnatek, R. V.Chepulskii, R. H. Taylor, S. Wang, J. Xue, K. Yang, O. Levy, et al. , Computational Materials Science , 218 (2012). A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards,S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, et al. ,Apl Materials , 011002 (2013). M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine,A. Gamst, M. Sluiter, C. K. Ande, S. Van Der Zwaag, J. J.Plata, et al. , Scientific data , 1 (2015). X. Qu, A. Jain, N. N. Rajput, L. Cheng, Y. Zhang, S. P.Ong, M. Brafman, E. Maginn, L. A. Curtiss, and K. A.Persson, Computational Materials Science , 56 (2015). S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. W.Doak, M. Aykol, S. Rühl, and C. Wolverton, npj Compu-tational Materials , 1 (2015). L. C. Blum and J.-L. Reymond, J. Am. Chem. Soc. ,8732 (2009). M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A.Von Lilienfeld, Physical review letters , 058301 (2012). A. O. Lyakhov, A. R. Oganov, and M. Valle, Modernmethods of crystal structure prediction , 147 (2010). S. Goedecker, The Journal of chemical physics , 9911(2004). M. Amsler and S. Goedecker, The Journal of chemicalphysics , 224104 (2010). M. A. Neumann, F. J. Leusen, and J. Kendrick, Ange-wandte Chemie International Edition , 2427 (2008). A. R. Oganov and M. Valle, The Journal of chemical physics , 104504 (2009). C. M. Handley and P. L. A. Popelier, J. Phys. Chem. A , 3371 (2010). J. Behler, Phys. Chem. Chem. Phys. , 17930 (2011). V. Botu, R. Batra, J. Chapman, and R. Ramprasad, J.Phys. Chem. C , 511 (2017). L. Ward and C. Wolverton, Current Opinion in Solid Stateand Materials Science , 167 (2017). J. Behler, Angew. Chem. Int. Ed. , 12828 (2017). J. Behler and M. Parrinello, Physical review letters ,146401 (2007). J. Behler, The Journal of chemical physics , 074106(2011). J. S. Smith, O. Isayev, and A. E. Roitberg, Chemicalscience , 3192 (2017). F. A. Faber, A. S. Christensen, B. Huang, and O. A.Von Lilienfeld, The Journal of chemical physics , 241717(2018). A. S. Christensen, L. A. Bratholm, F. A. Faber, andO. Anatole von Lilienfeld, The Journal of Chemical Physics , 044107 (2020). A. P. Bartók, R. Kondor, and G. Csányi, Physical ReviewB , 184115 (2013). L. Zhu, M. Amsler, T. Fuhrer, B. Schaefer, S. Faraji, S. Ros-tami, S. A. Ghasemi, A. Sadeghi, M. Grauzinyte, C. Wolver-ton, et al. , The Journal of chemical physics , 034203(2016). S. Goedecker, The Journal of chemical physics , 9911(2004). D. S. De, M. Krummenacher, B. Schaefer, and S. Goedecker,Phys. Rev. Lett. , 206102 (2019). O. Schütt and J. VandeVondele, Journal of chemical theoryand computation , 4168 (2018). M. Babaei, Y. T. Azar, and A. Sadeghi, Physical ReviewB , 115132 (2020). M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. vonLilienfeld, Phys. Rev. Lett. , 058301 (2012). M. Gastegger, L. Schwiedrzik, M. Bittermann, F. Berzsenyi,and P. Marquetand, J. Chem. Phys. , 241709 (2018). S. Jindal, S. Chiriki, and S. S. Bulusu, J. Chem. Phys. , 204301 (2017). J. Jenke, A. P. A. Subramanyam, M. Densow, T. Hammer-schmidt, D. G. Pettifor, and R. Drautz, Phys. Rev. B ,144102 (2018). A. V. Shapeev, Multiscale Model. Simul. , 1153 (2016). A. P. Thompson, L. P. Swiler, C. R. Trott, S. M. Foiles,and G. J. Tucker, J. Comp. Phys. , 316 (2015). E. Kocer, J. K. Mason, and H. Erturk, J. Chem. Phys. , 154102 (2019). M. Rupp, R. Ramakrishnan, and O. A. Von Lilienfeld, TheJournal of Physical Chemistry Letters , 3309 (2015). B. Huang and O. A. von Lilienfeld, “The "dna" of chemistry:Scalable quantum machine learning with "amons",” (2017),arXiv:1707.04146 [physics.chem-ph]. M. Eickenberg, G. Exarchakis, M. Hirn, S. Mallat, andL. Thiry, The Journal of chemical physics , 241732(2018). T. D. Huan, R. Batra, J. Chapman, C. Kim, A. Chan-drasekaran, and R. Ramprasad, The Journal of PhysicalChemistry C , 20715 (2019). A. S. Christensen, L. A. Bratholm, F. A. Faber, D. R.Glowacki, and O. A. von Lilienfeld, arXiv preprintarXiv:1909.01946 (2019). A. Sadeghi, S. A. Ghasemi, B. Schaefer, S. Mohr, M. A.Lill, and S. Goedecker, The Journal of chemical physics , 184118 (2013). S. Goedecker, Reviews of Modern Physics , 1085 (1999). B. Huang and O. A. Von Lilienfeld, “Communication: Un-derstanding molecular representations in machine learning:The role of uniqueness and target similarity,” (2016). J. von Neumann and E. Wigner, Phys. Z. , 467 (1929). N. Bernstein, G. Csanyi, and J. Kermode, “Quip andquippy documentation,” . J. Behler, International Journal of Quantum Chemistry , 1032 (2015). G. Imbalzano, A. Anelli, D. Giofre, S. Klees, J. Behler, andM. Ceriotti, J. Chem. Phys. , 241730 (2018). A. Christensen, F. Faber, B. Huang, L. Bratholm,A. Tkatchenko, K. Muller, and O. von Lilienfeld, URL https://github. com/qmlcode/qml (2017). Y. Muto, J. Phys.-Math. Soc. Japan , 629 (1943). B. M. Axilrod and E. Teller, J. Comp. Phys. , 299 (1943). D. Dragoni, T. D. Daff, G. Csányi, and N. Marzari, PhysicalReview Materials , 013808 (2018). S. N. Pozdnyakov, M. J. Willatt, A. P. Bartók, C. Or-tner, G. Csányi, and M. Ceriotti, arXiv preprintarXiv:2001.11696 (2020). L. Genovese, A. Neelov, S. Goedecker, T. Deutsch, S. A.Ghasemi, A. Willand, D. Caliste, O. Zilberberg, M. Rayson,A. Bergman, et al. , The Journal of chemical physics ,014109 (2008). A. Willand, Y. O. Kvashnin, L. Genovese, Á. Vázquez-Mayagoitia, A. K. Deb, A. Sadeghi, T. Deutsch, andS. Goedecker, The Journal of chemical physics , 104109(2013). J. P. Perdew, K. Burke, and M. Ernzerhof, Physical reviewletters , 3865 (1996). B. Aradi, B. Hourahine, and T. Frauenheim, J. Phys.Chem. A , 5678 (2007). J. Behler, J. Chem. Phys. , 170901 (2016). A. P. Bartók, M. C. Payne, R. Kondor, and G. Csányi,Physical review letters , 136403 (2010). M. Eckhoff and J. Behler, J. Chem. Theory Comput. ,3793 (2019). H. W. Kuhn, Naval research logistics quarterly2