[PDF] The role of feature space in atomistic learning

Abstract

Eficient, physically-inspired descriptors of the structure and composition of molecules and materials play a key role in the application of machine-learning techniques to atomistic simulations. The proliferation of approaches, as well as the fact that each choice of features can lead to very different behavior depending on how they are used, e.g. by introducing non-linear kernels and non-Euclidean metrics to manipulate them, makes it difficult to objectively compare different methods, and to address fundamental questions on how one feature space is related to another. In this work we introduce a framework to compare different sets of descriptors, and different ways of transforming them by means of metrics and kernels, in terms of the structure of the feature space that they induce. We define diagnostic tools to determine whether alternative feature spaces contain equivalent amounts of information, and whether the common information is substantially distorted when going from one feature space to another. We compare, in particular, representations that are built in terms of n-body correlations of the atom density, quantitatively assessing the information loss associated with the use of low-order features. We also investigate the impact of different choices of basis functions and hyperparameters of the widely used SOAP and Behler-Parrinello features, and investigate how the use of non-linear kernels, and of a Wasserstein-type metric, change the structure of the feature space in comparison to a simpler linear feature space.

Full PDF

TThe role of feature space in atomistic learning

Alexander Goscinski, Guillaume Fraux, and Michele Ceriotti a) Laboratory of Computational Science and Modeling, IMX, ´Ecole Polytechnique F´ed´erale de Lausanne, 1015 Lausanne,Switzerland

Eﬃcient, physically-inspired descriptors of the structure and composition of molecules and materials playa key role in the application of machine-learning techniques to atomistic simulations. The proliferation ofapproaches, as well as the fact that each choice of features can lead to very diﬀerent behavior depending on howthey are used, e.g. by introducing non-linear kernels and non-Euclidean metrics to manipulate them, makes itdiﬃcult to objectively compare diﬀerent methods, and to address fundamental questions on how one featurespace is related to another. In this work we introduce a framework to compare diﬀerent sets of descriptors,and diﬀerent ways of transforming them by means of metrics and kernels, in terms of the structure of thefeature space that they induce. We deﬁne diagnostic tools to determine whether alternative feature spacescontain equivalent amounts of information, and whether the common information is substantially distortedwhen going from one feature space to another. We compare, in particular, representations that are built interms of n -body correlations of the atom density, quantitatively assessing the information loss associatedwith the use of low-order features. We also investigate the impact of diﬀerent choices of basis functions andhyperparameters of the widely used SOAP and Behler-Parrinello features, and investigate how the use ofnon-linear kernels, and of a Wasserstein-type metric, change the structure of the feature space in comparisonto a simpler linear feature space. I. INTRODUCTION

The construction of eﬃcient and insightful descriptorsof atomic conﬁgurations has been one of the focal pointsof the development of data-driven applications for atomic-scale modeling . Two of the core ideas that underliemost of the existing schemes are the use of an atom-centred description that is particularly well-suited tomodel additive, extensive properties, and the incorpo-ration of geometric and atom permutation symmetries.While incorporation of symmetries makes representationsmuch more data eﬃcient, it raises subtle issues of whetherthe mapping from structure to descriptor is injective ornot . Many of the structural representations thatfulﬁll these symmetry requirements are closely related toone another, corresponding to projections of n -body cor-relations of the atom density . Yet, comparing them isnot straightforward. When used to build an interatomicpotential, or to predict another atomic-scale property,representations are used together with diﬀerent super-vised learning schemes, so it is diﬃcult to disentanglethe interplay of descriptor, regression method, and targetproperty that combine to determine the accuracy andcomputational cost of the diﬀerent methods. Juxtapos-ing alternative choices of representations is complicatedby the fact that non-linear transformations are often ap-plied as a part of the data processing algorithm, andso it would be equally important to be able to analyzethe eﬀect of these transformations. Eﬀorts to comparediﬀerent choices of descriptors have been mostly focusedthis far on a comparison of compressibility , their abil-ity to represent atomic structures uniquely , theirrole in constructing a metric and their sensitivity to a) Electronic mail: michele.ceriotti@epﬂ.ch perturbations of the atomic structure .Here we propose a strategy to compare feature spacesboth in terms of their mutual information content – whichwe deﬁne transparently as the ability to linearly or non-linearly reconstruct each other – and in terms of theamount of deformation that has to be applied to matchthe common information between the two. Note that thedeﬁnition we use here diﬀers from that used in information-theoretical treatments, based on Shannon entropy – whichis diﬃcult to compute in high dimensions , and does notreﬂect as naturally the behavior of diﬀerent features whenused in the context of atomistic machine learning.We demonstrate this strategy by applying it to elu-cidate several issues related to the behavior of density-based representations. First, we investigate the role ofthe basis and of the density smearing in the practicalimplementation of 3-body density correlation features;we then estimate the loss of information that one in-curs by truncating the description to low body-orderof correlations; ﬁnally, we discuss the role of the met-ric used to compare two structures, by testing the com-monly used Euclidean distance against kernel-induced andWasserstein-type metrics. We provide an open-source im-plementation of functions to compute these quantities at https://github.com/cosmo-epfl/sklearn-cosmo . II. COMPARING FEATURE SPACES

Consider a dataset D = { x i } containing n items. Fora given choice of features F , each item is described byan m F -dimensional feature vector x i . As a whole, thedataset is described by a feature matrix X DF ∈ R n × m F .We consider all of the feature matrices in this work tobe standardized, i.e. centred and scaled so as to havezero mean and unit variance for the selected data set.Consider a second featurization F (cid:48) . We want to be able a r X i v : . [ phy s i c s . c o m p - ph ] N ov to compare the behavior of diﬀerent choices of featurespaces when representing the dataset D , e.g. which oftwo sets of features have more expressive power, and howmuch distorted is one representation relative to the other. A. Global feature space reconstruction error

As a simple, easily-interpretable measure of the relativeexpressive power of F and F (cid:48) , we introduce the globalfeature space reconstruction error GFRE D ( F , F (cid:48) ), deﬁnedas the mean-square error that one incurs when using thefeature matrix X F to linearly regress X F (cid:48) . In this workwe compute the GFRE by a 2-fold split of the dataset,i.e. compute the regression weights P FF (cid:48) over a train set D train composed of half the entries in D , P FF (cid:48) = arg min P ∈ R m F × m F(cid:48) (cid:13)(cid:13)(cid:13) X D train F (cid:48) − X D train F P (cid:13)(cid:13)(cid:13) = (cid:16) X D train F T X D train F (cid:17) − ( X D train F ) T X D train F (cid:48) (1)and then compute the error over the remaining test set D test GFRE D ( F , F (cid:48) ) = (cid:114)(cid:13)(cid:13)(cid:13) X D test F (cid:48) − X D test F P FF (cid:48) (cid:13)(cid:13)(cid:13) /n test , (2)averaging, if needed, over multiple random splits. TheGFRE is a positive quantity, which is equal to zero whenthere is no error in the reconstruction, and that is usu-ally bound by one . For numbers of features largerthan n train , the covariance matrix is not full rank, andone needs to compute a pseudoinverse. Without lossof generality, one can regularize the regression to sta-bilize the calculation. In this paper, we computed thepseudoinverse by means of a singular value decomposi-tion, and we determined the optimal regularization interms of the truncation of the singular value spectrum,using 2-fold cross-validation over the training set to de-termine the optimal truncation threshold. Often, it isalso useful to observe the behavior of the GFRE in theabsence of any regularization: overﬁtting is in itself asignal of the instability of the mapping between featurespaces. In general, GFRE D ( F , F (cid:48) ) is not symmetric. IfGFRE D ( F , F (cid:48) ) ≈ GFRE D ( F (cid:48) , F ) ≈ F and F (cid:48) containsimilar types of information; if GFRE D ( F , F (cid:48) ) ≈

0, whileGFRE D ( F (cid:48) , F ) >

0, one can say that F is more descrip-tive than F (cid:48) : this is the case, for instance, one wouldobserve if F (cid:48) consists of a sparse version of F , with someimportant and linearly-independent features removed; ﬁ-nally, if GFRE D ( F , F (cid:48) ) ≈ GFRE D ( F (cid:48) , F ) >

0, the twofeature spaces contain diﬀerent, and complementary, kindsof information and it may be beneﬁcial to combine themto achieve a more thorough description of the problem.

B. Global feature space reconstruction distortion

The feature space reconstruction error gives insightsinto whether a feature space can be inferred by knowl-edge of a second one. However, having both a smallGFRE D ( F , F (cid:48) ) and GFRE D ( F (cid:48) , F ) does not imply twofeature spaces are identical. Even though they containsimilar amounts of information, one feature space couldgive more emphasis to some features compared to theother, which can eventually result in diﬀerent perfor-mance when building a model. To assess the amount ofdistortion of F (cid:48) relative to F , we introduce the globalfeature space reconstruction distortion GFRD D ( F , F (cid:48) ).To evaluate it, we ﬁrst compute the singular value decom-position of the projector Eq. (1), P FF (cid:48) ≈ UΣV T , andthen use it to reduce the two feature spaces to a commonbasis, in which the reconstruction error is zero, becausethe residual has been discarded˜ X F = X F U ˜ X F (cid:48) = ˜ X F Σ . (3)When the second feature space F (cid:48) has a lower dimension-ality than F , some combinations of the starting featuresare not used to compute ˜ F (cid:48) . In this case, we pad Σ withzeros, so that ˜ F (cid:48) has the same dimensionality m F as thestarting space. This choice ensures that the GFRD takesthe same value it would have in the case F (cid:48) had the samedimensionality as F , but lower rank. In the oppositecase, with m F < m (cid:48)F , padding Σ and U with zeros, ortruncating V , yields the same GFRD.We can then address the question of whether ˜ X F and˜ X F (cid:48) are linked by a unitary transformation (in whichcase the GFRD should be zero), or there is a distortioninvolved. A possible answer involves solving the orthog-onal Procrustes problem – i.e. ﬁnding the orthogonaltransformation that “aligns” as well as possible ˜ X F to˜ X F (cid:48) : Q FF (cid:48) =arg min Q ∈ U m × m (cid:13)(cid:13)(cid:13) ˜ X D train F (cid:48) − ˜ X D train F Q (cid:13)(cid:13)(cid:13) = ˜ U ˜ V T , (4)where ˜ U ˜ Σ ˜ V T = ( ˜ X D train F ) T ˜ X D train F (cid:48) . The amount of dis-tortion can then be computed by assessing the residualon the test set,GFRD D ( F , F (cid:48) ) = (cid:114)(cid:13)(cid:13)(cid:13) ˜ X D test F (cid:48) − ˜ X D test F Q FF (cid:48) (cid:13)(cid:13)(cid:13) /n test . (5)If desired, the error can be averaged over multiple randomsplits of the reference data set D . C. Local feature space reconstruction error

A downside of the global feature comparison schemesintroduced above is that the linear nature of the regres-sion means that they cannot detect if F and F (cid:48) containFIG. 1: A schematic representation of the diﬀerent measures of feature-space dissimilarity we introduce in this work,discussed from left to right. The ﬁgure considers a dataset containing ﬁve samples, embedded in a two-dimensionalfeature space F and a one-dimensional feature space F (cid:48) . As shown, the relationship between the two embeddings caninvolve arbitrary linear and non-linear transformations. The global feature space reconstruction error (GFRE)amounts to ﬁnding the best linear mapping between the two feature spaces. This measure is not symmetric: in thisexample GFRE( F , F (cid:48) ) is smaller than GFRE( F (cid:48) , F ), since F contains an additional nonzero dimension. The localversion of the reconstruction error (LFRE) makes it possible to probe whether a non-linear map exists between thetwo spaces: in this case, the sinus function can be locally approximated by a linear map, and so it is possible toachieve a low LFRE( F (cid:48) , F ). Finally, the global feature space reconstruction distortion (GFRD) determines whetherthe two featurizations are connected by an orthogonal transformation, by ﬁnding the best alignment between F andthe approximation of F (cid:48) that can be obtained as a linear projection of F . Even though GFRE( F , F (cid:48) ) is small, one ofthe components of F is scaled down to zero, resulting in a large value of GFRD( F , F (cid:48) ).analogous information, but diﬀer by a non-linear trans-formation. In the next Section we discuss how one cangeneralize the schemes to use kernel features, that canalso be used to detect non-linear relationships betweenthe original feature spaces. An alternative approach is tocompute a local version of the feature space reconstructionerror, LFRE D ( F , F (cid:48) ), loosely inspired by locally-linearembedding . To compute the LFRE, a local regression isset up, computed in the k -neighbourhood D ( i ) k − neigh aroundsample i – the set of k nearest neighbours of sample i ,based on the Euclidean distance between the samples in F – to reproduce the F (cid:48) features using F as input features,centred around their mean values ¯ x F (cid:48) and ¯ x F .A local embedding of x i is determined as˜ x (cid:48) i = ¯ x F (cid:48) + ( x i − ¯ x F ) P ( i ) FF (cid:48) , (6)where P ( i ) FF (cid:48) contains the regression weights computedfrom D ( i ) k − neigh . The local feature space reconstructionerror is given by the residual discrepancy between the F (cid:48) counterpart of the i -th point and its local embedding (6):LFRE D ( F , F (cid:48) ) = (cid:115)(cid:88) i (cid:107) x (cid:48) i − ˜ x (cid:48) i (cid:107) /n test . (7)Inspecting the error associated with the reconstructionof individual points can reveal regions of feature space for which the mapping between F and F (cid:48) is particularlyproblematic. Similarly, one can compute a local versionof GFRD, that could be useful to detect strong local dis-tortions that might indicate the presence of a singularityin the mapping between two feature spaces. D. Bending space: comparing induced feature spaces

It is often possible to substantially improve the per-formance of regression or dimensionality reduction algo-rithms, without explicitly changing the feature vectors.This can be achieved by introducing a (non-linear) simi-larity measure to compare x i , which takes the form of akernel function k ( x , x (cid:48) ), or a dissimilarity measure whichtakes the form of a distance d ( x , x (cid:48) ).Let us recall that a positive-deﬁnite kernel induces akernel distance by the relation d k ( x , x (cid:48) ) = k ( x , x ) + k ( x (cid:48) , x (cid:48) ) − k ( x , x (cid:48) ) , (8)and that any negative-deﬁnite distance can be used tobuild positive-deﬁnite kernels such as the substitutionkernel k x d ( x , x (cid:48) ) = −

12 ( d ( x , x (cid:48) ) − d ( x , x ) − d ( x , x (cid:48) ) ) , x ∈ F (9)or the radial basis function (RBF) kernel k RBF d ( x , x (cid:48) ) = exp (cid:0) − γd ( x , x (cid:48) ) (cid:1) , γ ∈ R + (10)A positive deﬁnite kernel induces a feature space H ,commonly known as reproducing kernel Hilbert space(RKHS), in which the similarity measure can be expressedas a dot product: k ( x , x (cid:48) ) = (cid:104) φ ( x ) | φ ( x (cid:48) ) (cid:105) , x , x (cid:48) ∈ F , φ : F → H . (11)While in general φ ( x ) is not known, for a given dataset D it is possible to approximate the RKHS features byusing a kernel principal component analysis . Sincelinear regression in RKHS features is equivalent to kernelridge regression, we simply use kernel features computedon the training dataset D train to reduce the problem ofcomparing kernel (or distance) induced features to that ofcomparing explicit features, and use GFRE and GFRD asdeﬁned in Eqs. (2) and (5). It is possible to re-formulatethese measures in an explicit kernelized form, as wellas to compute low-rank approximations of the kernel toreduce the computational cost for very large datasets (seee.g. Ref. 22 for a pedagogic discussion). In this paperwe simply use the explicit RKHS features, that can beobtained by diagonalizing the kernel matrix K = UΛU T ,with K ij = k ( x i , x j ), and deﬁning X H = UΛ − / , (12)which is then standardized as we do for any other setof features. To deﬁne a feature space associated with ametric, rather than a kernel, we ﬁrst center the squareddistance matrix (which is equivalent to computing a sub-stitution kernel analogous to Eq. (9)) and then proceedsimilarly by diagonalizing the resulting matrix. E. Dataset selection

We use four diﬀerent datasets, chosen to emphasizediﬀerent aspects of the problem of representing atomicstructures: A random methane dataset consisting of dif-ferent random displacements of the four hydrogen atomsaround the central carbon atom to cover the completeconﬁgurational space of CH structures; A carbon datasetof approximately 10’000 minimum energy carbon struc-tures, obtained as the result of ab initio random structuresearch , as an example for a realistic dataset of con-densed phase structures; A degenerate methane datasetcomposed of two groups of methane structures (which werefer to as X + and X − ), each associated with a 2D mani-fold parameterised by two parameters ( u, v ): structureswith v = 0 in the two manifolds have exactly the sameC-centred 3-body correlations, despite being diﬀerent (asdiscussed in Ref. 19); A displaced methane dataset, whichconsists in an ideal, tetrahedral CH geometry with onehydrogen atom pulled away from the central carbon atom,as an example of a set of structures that are distinguishedby a clearly identiﬁable structural feature, here the C – Hdistance. III. COMPARING ATOM-CENTREDREPRESENTATIONS

Atom-centred representations that are based on a sym-metrized expansion of the atom density constitute one ofthe most successful and widely adopted classes of featuresfor atomistic machine learning . The construc-tion begins by describing a structure A in terms of a sumof localized functions g ( x ) ≡ (cid:104) x | g (cid:105) (e.g. a Gaussian withvariance σ G /

2) centred on the atom positions r i (cid:104) x | A ; ρ (cid:105) = (cid:88) i (cid:104) x − r i | g (cid:105) . (13)Symmetrizing over translations and rotations leads to adescription of the structure in terms of a sum of environ-ment features (cid:104) x ; . . . x ν | A ; ρ ⊗ νi (cid:105) (14)that describe ν -point correlations of the density centredon atom i (eﬀectively corresponding to a ( ν + 1)-bodycorrelation function in the sense used e.g. in statisticalmechanics of liquids). Diﬀerent values of ν correspond toconceptually distinct descriptions of the system – higherbody order terms being more complicated, but potentiallymore information-rich – while diﬀerent discretizations ofthe abstract vectors on a basis are a matter of computa-tional convenience and aﬀect the computational cost ofdiﬀerent approaches , but their descriptive power shouldbecome equivalent in the limit of a complete basis set.We demonstrate the use of the GFRE, LFRE and GFRDto assess with quantiﬁable measures the eﬀect of someof the diﬀerent choices one can make when designing arepresentation. A. SOAP and symmetry functions

We begin by considering two practical realizationsof atom-centred symmetrized features of order ν = 2:smooth overlap of atomic positions (SOAP) features , asimplemented in librascal , and Behler-Parrinello sym-metry functions (BPSF) as implemented in the n2p2package . In the SOAP representation the atom-centreddensity is written as a sum of Gaussians with ﬁnite width σ G , and the density is expanded in a basis that is a prod-uct of spherical harmonics Y lm (ˆ x ) ≡ (cid:104) ˆ x | lm (cid:105) and a radialbasis R n ( x ) ≡ (cid:104) x | n (cid:105) , (cid:104) nlm | A ; ρ i (cid:105) = (cid:88) j (cid:90) d x (cid:104) n | x (cid:105) (cid:104) lm | ˆ x (cid:105) (cid:104) x − r ji | g (cid:105) , (15)where r ji = r j − r i . We consider two diﬀerent basis setshere, Gaussian-type orbitals (GTO) (cid:104) r | n ; GTO (cid:105) = N n r n exp (cid:0) − b n r (cid:1) , (16a)with N n = 2 σ n +3 n Γ(( n + 3) / , (16b) b n = 1 / (2 σ n ) , σ n = r c max( √ n, /n max , (16c) G F R E / G F R D FIG. 2: Comparison of the GFRE and GFRD for increasing numbers of radial (left, with ﬁxed l max = 4) and angular(middle, with ﬁxed n max = 4) basis functions. On the right, an explicit comparison of the two basis sets in terms ofGFRE(GTO , DVR), GFRE(DVR , GTO) and the corresponding measures of distortion.that are orthogonalized with respect to each other, and adiscrete variable representation (DVR) basis (cid:104) r | n ; DVR (cid:105) = √ w n δ ( r − r n ) (17)where r n are Gaussian quadrature points and w n theircorresponding weights. For both bases, the integral (15)can be evaluated analytically, and the density coeﬃcientcomputed as a sum over the neighbours of the i -th atom.Even though they can be seen as a projection on anappropriate basis of the symmetrized atom density thatunderlies SOAP , Behler-Parrinello symmetry functions(BPSF) are usually computed in real space, as a sum overtuples of neighboring atoms of functions of interatomicangles and distances. Among the many functional formsthat have been proposed we consider the two-bodyfunctions G (2) i = (cid:88) j e − η ( r ji − R s ) · f c ( r ji ) , (18)and the three-body functions G (3) i = 2 − ζ (cid:88) j (cid:88) k (cid:54) = j (1 + λ · cos ˆ r ij · ˆ r ik ) ζ · e − η ( r ji + r ik + r jk ) · f c ( r ji ) f c ( r ik ) f c ( r jk ) , (19)where f c is a cutoﬀ function, and η, ζ, λ, R s are param-eters that deﬁne the shape of each BPSF. We generatesystematically groups of symmetry functions of diﬀerent size by varying the values of these parameters followingthe prescriptions discussed in Ref. 41. The list of val-ues for the BPSF parameters we used are supplied insupplementary information. GTO and DVR radial basis.

We start by consideringthe convergence of the SOAP representation with diﬀer-ent choices of radial basis. Figure 2 demonstrates theconvergence with the number of radial functions n max and angular momentum channels l max (in a Cauchy sense,i.e. comparing results for successive increments of theseparameters). Overall, the GTO basis converges fasterthan DVR for most cases, both in terms of GFRE andGFRD. The slower radial convergence of the reconstruc-tion distortion indicates that even as the discretizationapproaches convergence, the changing position of peaksand nodes of the basis functions gives diﬀerent emphasisto interatomic correlations over diﬀerent ranges. Thisis consistent with the observation that, particularly forsmall ( n max , l max ), regression accuracy depends on thenumber of basis functions in a way that is not necessarilymonotonic. When considering the convergence of theangular component l max , GTO and DVR show nearlyidentical error decay, indicating that the convergence ofthe radial and angular basis are largely independent ofeach other.The faster convergence of the GTO basis suggests that,for a given n max , a representation expanded on this ba-sis should contain a greater amount of information onthe structure. This is reﬂected in the direct compari- a)b) c)d) e)f) FIG. 3: Comparison of the GFRE (top) and GFRD (bottom) a),b) for diﬀerent smearing σ G ( r c = 4 ˚A) c),d) fordiﬀerent cutoﬀ values ( σ G = 0 . r c = 4 ˚A, σ G = 0 . n max , l max ) = (10 ,

6) were used. The feature speciﬁed by the row is used to reconstruct the featurespeciﬁed by the column.son of the two bases, GFRE(GTO n max , DVR n max ) < GFRE(DVR n max , GTO n max ) for small n max . Whenboth basis set have converged, they become essen-tially equivalent. Since the two representations arerelated to each other by a unitary transformation,GFRD(GTO n max , DVR n max ) → n max → ∞ . Gaussian smearing.

The Gaussian smearing used inSOAP features works as a parameter controlling the bal-ance between local resolution and the smoothness of themapping between Cartesian coordinates and symmetrizeddensity features. A small σ G value can identify minutechanges more accurately, but a too small value for σ G can lead to ill-conditioned regression, as the features as-sociated with diﬀerent structures show little overlap witheach other. In fact, there is a tight interplay between thedensity smearing, the choice of the basis set, and the reg-ularization of a regression model. As seen in Fig. 3(a,b),in the case of the smooth GTO basis set there is rela-tivley little reconstruction error, and in general smaller σ G values give a better reconstruction of large- σ G fea- tures than vice versa. The opposite is true for the δ -likeDVR basis: the GFRE for DVR is larger than in thecase of GTO, and it is harder to reconstruct large- σ G features from their sharp-Gaussian counterparts than viceversa. It should be also added that, without an auto-matic choice of regularization, results depend greatly onthe way the feature mapping is executed. In particular,sharp-to-smooth mapping can lead to major overﬁttingproblems, with GFRE becoming much larger than onefor the test set. Even in cases where the GFRE is small,the feature space distortion is large, which highlights thefact that the Gaussian smearing changes signiﬁcantly theemphasis given to diﬀerent structural correlations, andcan therefore aﬀect the accuracy of regression models. Radial cutoﬀ and scaling.

One of the most importanthyperparameters when deﬁning an atom-centred represen-tation is the cutoﬀ distance, which restricts the contribu-tions to the density to the atoms with r ji < r c . Fig. 3(c,d)shows that the GFRE captures the loss of information as-sociated with an aggressive truncation of the environment,with very similar behavior between GTO and DVR bases.The ﬁgure also reﬂects speciﬁc features of the diﬀerentdata sets: for instance, GFRE( r c = 4 ˚A , r c = 6 ˚A) is closeto zero for the random methane data set, because thereare no structures where atoms are farther than 4 ˚A fromthe centre of the environment. GFRE > n max to fully describe the structure of an envi-ronment when using a large value of r c , which is consistentwith the greater amount of information encoded within alarger environment. The GFRD plot also underscores thestrong impact of the choice of r c on the emphasis that isgiven to diﬀerent parts of the atom-density correlations.This eﬀect explains the strong dependency of regressionperformance on r c , and the success of multi-scale modelsthat combine features built on diﬀerent lengthscales .A similar modulation of the contributions from diﬀerentradial distances can be achieved by scaling the neighbourcontribution to the atom-centred density by a decayingfunction, e.g. 1 / (1 + ( r ji /r ) s ). This approach has provento be very eﬀective in ﬁne-tuning the performance of re-gression models using density-based features . Asshown in Fig. 3(e,f), this is an example of a transfor-mation of the feature space that entails essentially noinformation loss – resulting in a very small GFRE be-tween diﬀerent values of the scaling exponent s . However,it does result in substantial GFRD, providing additionalevidence of how the emphasis given by a set of featuresto diﬀerent inter-atomic correlations can aﬀect regressionperformance even if it does not remove altogether piecesof structural information. Behler-Parinello symmetry functions.

BPSF can beseen as projections of the same, abstract symmetrizeddensity features that underlies the construction of SOAPfeatures. While the latter representation is usually imple-mented using an orthogonal set of basis functions, BPSFsare non-orthogonal, and are usually chosen based on acareful analysis of the inter-atomic correlations that arerelevant for a given system , or selected automat-ically out of a large pool of candidates . Fig. 4 showsclearly that an orthogonal basis set provides a more ef-fective strategy to converge a representation than thegrid-based enumeration of the non-linear hyperparam-eters of non-orthogonal basis functions. GFRE(SOAP,BPSF) < GFRE(BPSF, SOAP) for all feature set sizesand both data sets. As usual, we remark that zero recon-struction error does not imply equivalence for regressionpurpose: the GFRD remains very high even for the largestfeature set sizes.Given that, in real scenarios, one would usually com-bine systematic enumeration of BPSF features with anautomatic selection method , we also use the featurereconstruction framework to investigate the convergenceof the automatic screening procedure, i.e. the error inreconstructing the full vector based on the ﬁrst m featureschosen with a CUR decomposition-based procedure . FIG. 4: Comparison of the GFRE and the GFRDbetween SOAP(GTO) and BPSF features withsystematically-increasing sizes of the feature vectors.BPSF features are generated by varying over a grid thehyperparameters entering the deﬁnitions of G (2) and G (3) , following Ref. 41. SOAP expansion truncationparameters ( n max , l max ) are adjusted to approximatelymatch the number of BPSF features.Figure 5 shows that a few dozens CUR-selected featuresallow to almost-perfectly reconstruct the full feature vec-tor. The convergence is particularly fast for BPSF, where m = 50 leads to a minuscule GFRE, indicating that thenon-orthogonal features are highly redundant, and ex-plaining the saturation in model performance that wasobserved in Ref. 41. B. Body order feature truncation.

The examples in Section III A demonstrate the impactof implementation details and hyperparameters choices onthe information content of features that are all equivalentto a three-body correlation of the atom density. A moresubstantial issue is connected to the use of representationsbased on diﬀerent ν -body correlations of a decoratedatom density, which is equivalent to the pair correlationfunction (2-body, ν = 1), to the SOAP power spectrum(3-body, ν = 2) or to the bispectrum (4-body, ν = 3).Diﬀerent orders incorporate conceptually distinct kindsof information: when used in linear regression, diﬀerentFIG. 5: Convergence of a CUR approximation of the fullSOAP/BPSF feature vectors (the largest size consideredin Fig. 4 ) with number of retained features.FIG. 6: GFRE and GFRD body order comparison usingGTO as radial basis function, r c = 4 ˚A, σ G = 0 . n max , l max ) = (6 , ν = 4. FIG. 7: Convergence of the LFRE between 2 and 3-bodydensity correlation features (using GTOs as radial basis, r c = 4 ˚A, σ G = 0 . n max , l max ) = (6 , u, v ) for ( n max , l max ) = (6 ,

4) and k = 15 neighbours.density correlation orders correspond to a body-orderexpansion of the target property , and the linkbetween the convergence of the body-order expansionand the injectivity of the structure-feature map is anopen problem, with known counter-examples showingthat low values of ν are insuﬃcient to achieve a completerepresentation of an atomic environment .Fig. 6 shows that high-order features cannot be recov-ered as linear functions of lower-order features, while anapproximate (if not complete) reconstruction of lower- ν components based on high- ν components is possible.Reconstructing features of diﬀerent order entails a largeamount of distortion, with the GFRD approaching one inmost cases. We also include in the comparison featuresobtained with the recently-developed N -body iterativecontraction of equivariants (NICE) framework, that identi-ﬁes the most important features for each ν value, and usesthem to compute ( ν + 1)-order features . Keeping 400features for each body order is suﬃcient to achieve perfectreconstruction of 2 and 3-body features, but not for the4-body (bispectrum) term, which cannot be reconstructedfully with 400 NICE features. Considering however thatGFRE(NICE , ν = 3) (cid:28) GFRE( ν = 3 , NICE), one caninfer that the of information loss associated with truncat-ing the body order expansion is more severe than whenrestricting the number of 4-body features.The comparison of features of diﬀerent order can alsobe used to elucidate the role of the (non-)linearity ofthe mapping between feature spaces. Figure 7 comparesglobal and local feature reconstruction errors between 2and 3-body density correlation features, for the randomCH data set. In the case of the low-to-high body orderreconstruction, the LFRE is only marginally lower than itsglobal counterpart, indicating that the large GFRE( ν =1 , ν = 2) is a consequence of lower information contentand not only of the linear nature of the map. The reversecase is also revealing: for small k -neighborhood sizes,LFRE( ν = 2 , ν = 1) > GFRE( ν = 2 , ν = 1), becausethe small number of neighbors included in the modelreduce the accuracy of the feature reconstruction map.When the number of neighbors approaches the intrinsicdimensionality of the ν = 2 features, instead, LFRE < GFRE – because the reconstruction is based on a locally-linear map that can approximate a non-linear relationshipbetween features. As k approaches the full train set size,the LFRE approaches the GFRE, as the locality of themapping is lost.The LFRE also makes it possible to identify regionsof phase space for which the construction of a mappingbetween feature spaces is diﬃcult or impossible. Considerthe case of the degenerate manifold discussed in Ref. 19.The dataset includes two sets of CH environments, andthose parameterised by v = 0 cannot be distinguishedfrom each other using 3-body ( ν = 2) features. Fig. 8shows the LFRE for each point along the two manifolds.When trying to reconstruct 3-body features using as in-puts 4-body features (that take diﬀerent values for thetwo manifolds) the LFRE is essentially zero. When us-ing the 3-body features as inputs, instead, one observesa very large error for points along the degenerate line,while points that are farther along the manifold can bereconstructed well. This example demonstrates the use ofthe LFRE to identify regions of feature space for whicha simple, low-body-order representation is insuﬃcient tofully characterize the structure of an environment, andcan be used as a more stringent, explicit test of the pres- ence of degeneracies than the comparison of pointwisedistances discussed in Ref. 19. C. Kernel-induced feature spaces

With the exception of the trivial, scalar-product form,a kernel introduces a non-linear transformation of the fea-ture space, potentially allowing to obtain more accurateregression models. A crucial aspect of kernel methods isthe fact that this non-linear transformation gives rise toa linear feature space that is deﬁned by the combinationof the kernel and the training samples – or the activesamples in the case of sparse kernel methods. We canthen use our feature-space reconstruction framework tocompare quantitatively the linear feature space with thekernel-induced features. We do so using a radial basisfunction kernel, varying the γ parameter. In the γ → γ . The use of stan-dardized input features means that γ is eﬀectively unitless,and also standardize the kernel-induced features. To re-duce noise, given that the kernel matrices are often veryill-conditioned, we only retain the RKHS features thatare associated with the largest eigenvalues, preservingthose that together contribute to approximately 99% ofthe variance.Figure 10 plots the GFRE and GFRD for the map-ping of linear and RBF features computed for 2 and3-body density correlations. The non-linear nature ofthe transformation is apparent in the increase in theGFRE(linear,RBF) for larger values of γ , for both ν = 1and ν = 2. The transformation is not entirely lossless,as evidenced by the fact that the reverse GFRE is alsonon-zero. The GFRE(RBF,linear) becomes particularlylarge for very large values of the γ parameter. This canbe understood from the fact that the decay of the kernelbecomes very sharp, and it only provides informationabout the nearest neighbors of each point – eﬀectivelyleading to an ill-conditoned regression problem as we showin more detail below.Having assessed the impact of non-linear kernel fea-tures on a single body order representation, we can theninvestigate whether a non-linear transformation helps in-ferring high-body order correlations from low-body-orderfeatures. This is relevant because the use of non-linearkernels has been proposed (and used in practice for along time ) as a strategy to describe many-body eﬀectson atomistic properties. We compute the GFRE for pro-moting ν = 1 (2-body) to ν = 2 (3-body) and ν = 2to ν = 3 features for diﬀerent values of the RBF kernel γ . In Figure 10 we show these curves for both the usualGFRE deﬁnition (that involves a separate test set) andfor a prediction carried out on the train set. These resultsshow that while a non-linear kernel does allow a low-body-order model to discern higher body-order features,it does so in a poorly transferable way: high- γ modelsshow much reduced GFRE for train-set predictions, but0FIG. 9: GFRE on the random methane dataset forinterconverting the linear 2-body (left) and 3-body(right) feature spaces with those induced by a RBFkernel with diﬀerent inverse kernel width γ . train settest set FIG. 10: GFRE on the random methane dataset curvesas a function of the γ hyperparameter of k RBF E . Valuesfor train and test sets are plotted separately. Thehorizontal lines correspond to the GFRE of the linearfeatures.lead to a degradation in the feature reconstruction for thetest set. Only low- γ models show a small improvementin the test-set GFRE compared to an entirely linear map-ping. In this regime, the RBF kernel is dominated by thelow-exponent components of the Gaussian expansion, vin-dicating the choice of low-order polynomial kernels, thatare used in most of the published SOAP-based potentials.A better understanding of the eﬀect of a non-linear fea-ture space transformation can be obtained by analyzingthe distribution of reconstruction errors for individualsamples GFRE ( i ) ( F , F (cid:48) ) = (cid:107) x (cid:48) i − x i P FF (cid:48) (cid:107) . (20)The histograms for this “pointwise GFRE” (Fig. 11) showthat increasing the non-linearity of the kernel does indeedallow to reconstruct more accurately a fraction of both thetest and the train set. When extrapolating the mapping to points that have not been seen before, however, thereis an increasingly large fraction of outliers for which thereconstruction is catastrophically poor.The pointwise errors are also revealing of the diﬀerentnature of the ν = 1 → ν = 2 and ν = 2 → ν = 3 cases.In the former case, the clear lack of information in the2-body descriptor makes it impossible, even for a highlynon-linear kernel, to obtain an accurate reconstruction ofhigher body-order features. In the latter case, instead, thetrain set reconstruction become nearly perfect with large γ – indicating that despite the existence of degeneratemanifolds of conﬁgurations it is possible to reconstruct4-body features using only 3-body inputs, for structuresthat are not exactly on the degenerate manifold. However,the increasingly large tail of very high test-set GFREsamples suggests that this mapping is not smooth, andrather unstable. When building a regression model fora property that depends strongly on 4-body terms, thisinstability may translate into poor extrapolative powerfor a non-linear model based on 3-body features. D. Wasserstein metric

As an example of the transformation induced by anon-Euclidean metric we consider the eﬀect of using aWasserstein distance to compare ν = 1 density correlationfeatures. The Wasserstein distance (also known as theEarth Mover Distance, EMD) is deﬁned as the minimum“work” that is needed to transform one probability dis-tribution into another – with the work deﬁned as theamount of probability density multiplied by the extentof the displacement . The EMD has been used todeﬁne a “regularized entropy match” kernel to combinelocal features into a comparison between structures , toobtain permutation-invariant kernels based on Coulombmatrices , and has been shown to be equivalent to theEuclidean distance between vectors of sorted distances .Here we use the Wasserstein distance to compare two-body( ν = 1) features, that can be expressed on a real-spacebasis and take the form of one-dimensional probabilitydistributions.The formal deﬁnition of the Wasserstein distance oforder 2 between two probability distributions p ( r ) and p (cid:48) ( r ) deﬁned on a domain M reads W ( p, p (cid:48) ) = inf γ ∈ Γ( p,p (cid:48) ) (cid:90) M × M d ( r, r (cid:48) ) d γ ( r, r (cid:48) ) , (21)where Γ( p, p (cid:48) ) is the set of all joint distributions withmarginals p and p (cid:48) . For 1-dimensional distributions, W ( p, p (cid:48) ) can be expressed as the 2-norm of the diﬀerencebetween the associated inverse cumulative distributionfunction (ICDF) P − of two environments, W ( p, p (cid:48) ) = (cid:82) (cid:12)(cid:12)(cid:12) P − ( s ) − P (cid:48)− ( s ) (cid:12)(cid:12)(cid:12) d s , with P ( r ) = (cid:82) r p ( r ) d r In order to express the symmetrized 2-body correlationfunction as a probability density, we ﬁrst write it ona real-space basis (cid:104) r | , and evaluate it on 200 Gaussian1FIG. 11: Histograms of the pointwise reconstruction error for 2 → → γ (top to bottom, γ = 0.01, 1.0, 10) to reconstruct the higher body orderfeatures. Red curves refer to the train set points, blue curves to the test set, and the black line correspond to thelinear train-test set GFRE, that serves as a reference.quadrature points, that we also use to evaluate the CDFand its inverse. We then proceed to normalize it, so thatit can be interpreted as a probability density. We estimatethe integral of the distribution (that eﬀectively countsthe number of atoms within the cutoﬀ distance) Z i = (cid:90) r c (cid:104) r | A ; ρ ⊗ i (cid:105) d r, (22)and the maximum value of the integral over the entiredataset Z D . A simple scaling of the correlation function p s i ( r ) = 1 Z i (cid:104) r | A ; ρ ⊗ i (cid:105) (23)distorts the comparison between environments with diﬀer-ent numbers of atoms. To see how, we use the displacedmethane dataset, in which three atoms in a CH moleculeare held ﬁxed in the ideal tetrahedral geometry, at a dis-tance of 1˚A from the carbon centre. The fourth atom,aligned along the z axis, is displaced along it, so thateach conﬁguration is parameterised by a single coordinate z H . Figure 12(a) shows the distance computed betweenpairs of conﬁgurations with diﬀerent z H , demonstratingthe problem with the renormalized probability (23): p s loses information on the total number of atoms within the cutoﬀ, and so once the tagged atom moves beyond r c theremaining CH environment becomes indistinguishablefrom an ideal CH geometry.One can obtain a more physical behavior when atomsenter and leave the cutoﬀ by introducing a δ -like “sink”at the cutoﬀ distance, deﬁning p δi ( r ) = 1 Z D (cid:104) (cid:104) r |A ; ρ ⊗ i (cid:105) + ( Z D − Z i ) δ ( r − r c ) (cid:105) . (24)Fig. 12b shows that with this choice the Wasserstein met-ric between p δi ( r ) reﬂects the distance between the movingatoms. With this normalization, in fact, the Wassersteinmetric corresponds to a smooth version of the Euclideanmetric computed between vectors of sorted interatomicdistances , shown in Fig. 12c. The distortions that canbe seen in the comparison between Fig. 12b,c are a con-sequence of the Gaussian smearing, the smooth cutoﬀfunction, and the SO (3) integration that modulates thecontribution to (cid:104) r |A ; ρ ⊗ i (cid:105) coming from atoms at diﬀerentdistances.Having deﬁned a meaningful normalization and a prob-abilistic interpretation of the radial density correlationfeatures, we can investigate how the feature space inducedby a Wasserstein metric relates to that induced by anEuclidean distance. Figure 13 shows the error in the2 a) b) c) FIG. 12: Distance between two displaced methane conﬁgurations with diﬀerent values of z H , computed using aWasserstein distance using (a) scaling normalization; (b) cutoﬀ δ normalization; (c) Euclidean distance between sortedinteratomic distance vectors.reconstruction of z H for the displaced methane datasetwhen restricting the training set to 0.05˚A and 1.0˚A spacedgrids. Using a Euclidean distance with a sharp σ G leadsto a highly non-linear mapping between the displacementcoordinate and feature space, and a linear model cannotinterpolate accurately between the points of a sparse grid.A Wasserstein metric, on the other hand, measures theminimal distortion needed to transform one structureinto another, and so provides a much more natural in-terpolation along z H , which is robust even with a sharpdensity and large spacing between training samples. It isworth stressing that the sorted distance metric – whicheﬀectively corresponds to the δ density limit of the Wasser-stein metric – performs rather poorly, and cannot evenreproduce the training points. This is because the map-ping between feature space and z H is not exactly linear,changing slope when z H crosses 1˚A (because the sortingof the vector changes) and 4˚A (because one atom exitsthe cutoﬀ). The sorted-distances feature space does nothave suﬃcient ﬂexibility to regress this piecewise linearmap, as opposed to its smooth Wasserstein counterpart.Having rationalized the behavior of the Wassersteinmetric for a toy model, we can test how it compares tothe conventional Euclidean metric on a more realistic dataset. We consider in particular the AIRSS carbon data set,and compare diﬀerent levels of density smearing as wellas Euclidean and Wasserstein metrics. Figure 14 paintsa rather nuanced picture of the relationship between thelinear and the Wasserstein-induced feature spaces. TheGFRE is non-zero in both directions, meaning that (ina linear sense) Wassertein and Euclidean features pro-vide complementary types of information. Smearing ofthe density has a small eﬀect on the Wasserstein met-ric, so that both GFRE( W ( σ G = 0 . , W ( σ G = 0 . W ( σ G = 0 . , W ( σ G = 0 . σ G induces small information loss, but a large distortion of feature space. Overall, there isno sign of the pathological behavior seen in Fig. 13, whichis an indication that (at least for 2-body features) the carbon dataset is suﬃciently dense, and that the betterinterpolative behavior of the EMD does not lead to amore informative feature space. IV. CONCLUSION

Applications of machine learning to atomistic modellingsuggest that the featurization that is chosen to representa molecule or material can be equally or more importantthan the choice of regression scheme . This has led to theproliferation of approaches to build descriptors, that oftendiﬀer from each other only in implementation details. Theframework we introduce in this work enables a comparisonof alternative choices of representations that does notdepend on the target property, and makes it possible todetermine objectively which of two features contains moreinformation – based on a feature-space reconstructionerror – and how much distortion is present in the waythey describe the information that is common betweenthe pair – based on a measure of feature-space distortion.Even though the framework is linear in nature, it can begeneralized to account for non-linear relationships betweenfeature spaces, either by using kernel-induced features, orby decomposing the feature comparison problem into acollection of local mappings.Using this framework we demonstrate that the choice ofbasis set can aﬀect substantially the convergence of SOAPfeatures, and that for instance Gaussian type orbitals aremore stable in the limit of small density smearing thanthe discrete variable representation basis. In practice theconvergence of the representation with the number ofbasis functions should be considered together with thecomputational cost of the basis. The analytical expres-3 gridgrid FIG. 13: Errors when reproducing the atomicdisplacement z H for a ﬁne (top) and coarse (bottom)grid of training points, and diﬀerent Gaussian σ G andmetrics. A constant regularization that discards singularvalues smaller than 10 − has been applied to allpointwise GFRE calculations.sion for GTOs involve special functions that are usuallyharder to evaluate than those that appear for a DVRbasis. This overhead, however, can be avoided by using aspline approximation: computational cost depends sub-stantially on the details of the implementation. We alsoshow quantitatively that a systematic orthogonal basis ismuch more eﬀective in describing the atom density thanthe heuristic symmetry functions of the Behler-Parrinellokind – notwithstanding the considerable success that thelatter approach has had in the construction of neural-network-based interatomic potentials .A more systematic diﬀerence between atomisticmachine-learning frameworks arises from the choice of theorder of inter-atomic correlations that underlies the repre- FIG. 14: Comparison of GFRE and GFRD for the carbon dataset, using sharp ( σ G = 0 . σ G = 0 . ν -to-( ν + 1)-bodymapping error decreases with increasing ν , indicating thatless information is added with higher body-orders. Thisis consistent with the satisfactory results that have beenobtained in the regression of atomistic properties usingonly 3-body information , even though the fundamen-tal incompleteness of 3-body features has been shownto have implications for the asymptotic learning perfor-mance . An analysis based on the GFRE might helpdetermine the high-order correlations that provide thehighest amount of information, and combined with aniterative scheme to evaluate the corresponding features provide a strategy to increase model accuracy with anaﬀordable computational eﬀort.We also investigate the eﬀect of changing the metricused to compare features, by juxtaposing the Euclideandistance (that is induced by a linear description of thefeature space) with a Wasserstein metric, that can beapplied to the comparison of n -body correlation featureswhen expressed as real-space distributions. We ﬁnd that4– with an appropriate normalization – the Wassersteindistance can be seen as a proxy of the minimal amountof distortion needed to transform an environment intoanother, and that this behavior induces smooth interpo-lation between sparse reference points, contrary to whatis observed for the Euclidean distance. However, both anaggressive smearing of the atom density, and the use ofa more realistic data set cure the pathological behaviorof the linear featurization, so that the Wasserstein met-ric should not be regarded as superior to the Euclideanone, but complementary. Generalizing the Wassersteinmetric to higher body-order correlations, which inducea higher-dimensional feature space that is more likely tobe sparsely populated, would be an interesting furtherresearch direction.The analysis in this work can be extended to com-pare atom-density representation with a broader class ofdescriptors based on topological or physicochemicalinformation as well as property-dependent represen-tations induced by neural network frameworks . Evenmore broadly, an objective measure of the relative eﬀec-tiveness of features can guide the development of machinelearning schemes for any problem that depends stronglyon the strategy used to obtain a mathematical descriptionof its inputs. The feature space reconstruction error anddistortion can be incorporated into any machine learningframeworks to drive feature selection algorithms or to ensure that implementation choices that improvecomputational eﬃciency do not cause a degradation inthe resolving power of the resulting features. ACKNOWLEDGMENTS

The authors would like to thank Giulio Imbalzano forgenerating BPSF features to compare against. AG andMC acknowledge support from the Swiss National ScienceFoundation (Project No. 200021-182057). GF acknowl-edges support by the SCCER Eﬃciency of IndustrialProcesses, and by the European Center of ExcellenceMaX, Materials at the Exascale - GA No. 676598. J. Behler and M. Parrinello, Phys. Rev. Lett. , 146401 (2007). A. P. Bart´ok, M. C. Payne, R. Kondor, and G. Cs´anyi, Phys.Rev. Lett. , 136403 (2010). M. Rupp, A. Tkatchenko, K.-R. M¨uller, and O. A. von Lilienfeld,Phys. Rev. Lett. , 058301 (2012). A. P. Bart´ok, R. Kondor, and G. Cs´anyi, Phys. Rev. B , 184115(2013). S. De, A. P. Bart´ok, G. Cs´anyi, and M. Ceriotti, Phys. Chem.Chem. Phys. , 13754 (2016). M. Eickenberg, G. Exarchakis, M. Hirn, and S. Mallat, Adv.Neural Inf. Process. Syst. , 6541 (2017). H. Huo and M. Rupp, arXiv preprint arXiv:1704.06439 (2017). F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz,G. E. Dahl, O. Vinyals, S. Kearnes, P. F. Riley, and O. A. vonLilienfeld, J. Chem. Theory Comput. , 5255 (2017). S. Chmiela, H. E. Sauceda, K.-R. M¨uller, and A. Tkatchenko,Nature communications , 1 (2018). L. Zhang, J. Han, H. Wang, R. Car, and W. E, Phys. Rev. Lett. , 143001 (2018). M. J. Willatt, F. Musil, and M. Ceriotti, J. Chem. Phys. ,154110 (2019). R. Drautz, Phys. Rev. B , 014104 (2019). A. S. Christensen, L. A. Bratholm, F. A. Faber, and O. Anatolevon Lilienfeld, J. Chem. Phys. , 044107 (2020). C. van der Oord, G. Dusson, G. Cs´anyi, and C. Ortner, MachineLearning: Science and Technology , 015004 (2020). L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, andM. Scheﬄer, Physical Review Letters (2015), 10.1103/phys-revlett.114.105503. L. Zhu, M. Amsler, T. Fuhrer, B. Schaefer, S. Faraji, S. Rostami,S. A. Ghasemi, A. Sadeghi, M. Grauzinyte, C. Wolverton, andS. Goedecker, J. Chem. Phys. , 034203 (2016). G. A. Gallet and F. Pietrucci, The Journal of Chemical Physics , 074101 (2013). O. A. von Lilienfeld, R. Ramakrishnan, M. Rupp, and A. Knoll,Int. J. Quantum Chem. , 1084 (2015). S. N. Pozdnyakov, M. J. Willatt, A. P. Bart´ok, C. Ortner,G. Cs´anyi, and M. Ceriotti, Phys. Rev. Lett. , 166001 (2020). M. J. Willatt, F. Musil, and M. Ceriotti, Phys. Chem. Chem.Phys. , 29661 (2018). Y. Zuo, C. Chen, X. Li, Z. Deng, Y. Chen, J. Behler, G. Cs´anyi,A. V. Shapeev, A. P. Thompson, M. A. Wood, and S. P. Ong, J.Phys. Chem. A , acs.jpca.9b08723 (2020). B. Helfrecht, R. K. Cersonsky, G. Fraux, and M. Ceriotti, Mach.Learn.: Sci. Technol. (2020), 10.1088/2632-2153/aba9ef. B. Onat, C. Ortner, and J. R. Kermode, J. Chem. Phys. ,144106 (2020). J. E. Moussa, Physical review letters , 059801 (2012). A. Sadeghi, S. A. Ghasemi, B. Schaefer, S. Mohr, M. A. Lill,and S. Goedecker, The Journal of chemical physics , 184118(2013). B. Parsaeifard, D. S. De, A. S. Christensen, F. A. Faber, E. Kocer,S. De, J. Behler, A. von Lilienfeld, and S. Goedecker, Mach.Learn.: Sci. Technol. (2020), 10.1088/2632-2153/abb212. K. Torkkola, Journal of machine learning research , 1415 (2003). This is due to the fact that feature matrices are standardized,and so (cid:13)(cid:13)(cid:13) X D test F (cid:48) (cid:13)(cid:13)(cid:13) /n test is of the order of one. P. H. Sch¨onemann, Psychometrika , 1 (1966). S. T. Roweis and L. K. Saul, Science , 2323 (2000). B. Sch¨olkopf, in

Advances in neural information processing sys-tems (2001) pp. 301–307. B. Haasdonk and C. Bahlmann, in

Joint pattern recognitionsymposium (Springer, 2004) pp. 220–227. B. Sch¨olkopf, A. Smola, and K.-R. M¨uller, in

Internationalconference on artiﬁcial neural networks (Springer, 1997) pp. 583–588. C. J. Pickard and R. J. Needs, J. Phys. Condens. Matter ,053201 (2011). C. J. Pickard, “AIRSS data for carbon at 10gpa and theC+N+H+O system at 1gpa,” (2020). A. Thompson, L. Swiler, C. Trott, S. Foiles, and G. Tucker,Journal of Computational Physics , 316 (2015). F. Musil, M. Veit, T. Junge, and M. Stricker, “LIBRASCAL,” . J. Behler, J. Chem. Phys. (2011), 10.1063/1.3553717. A. Singraber, T. Morawietz, J. Behler, and C. Dellago, Journalof chemical theory and computation , 3075 (2019). J. Behler, Phys. Chem. Chem. Phys. PCCP , 17930 (2011). G. Imbalzano, A. Anelli, D. Giofr´e, S. Klees, J. Behler, andM. Ceriotti, J. Chem. Phys. , 241730 (2018). A. P. Bart´ok, S. De, C. Poelking, N. Bernstein, J. R. Kermode,G. Cs´anyi, and M. Ceriotti, Sci. Adv. , e1701816 (2017). F. M. Paruzzo, A. Hofstetter, F. Musil, S. De, M. Ceriotti, andL. Emsley, Nat. Commun. , 4501 (2018). K. V. Jose, N. Artrith, and J. Behler, J. Chem. Phys. , 194111(2012). J. Behler, Int. J. Quantum Chem. , 1032 (2015). M. W. Mahoney and P. Drineas, Proceedings of the NationalAcademy of Sciences , 697 (2009). A. V. Shapeev, Multiscale Model. Simul. , 1153 (2016). A. Glielmo, C. Zeni, and A. De Vita, Phys. Rev. B , 184307(2018). R. Jinnouchi, F. Karsai, C. Verdi, R. Asahi, and G. Kresse, J.Chem. Phys. , 234102 (2020). J. Nigam, S. Pozdnyakov, and M. Ceriotti, J. Chem. Phys. ,121101 (2020). S. Vallender, Theory Probab. Its Appl. , 784 (1974). S. D. Cohen and L. Guibas, , 1 (1997). M. Cuturi, Int. Jt. Conf. Artif. Intell. IJCAI , 732 (2007). O. C¸ aylak, A. von Lilienfeld, and B. Baumeier, Mach. Learn.:Sci. Technol. (2020), 10.1088/2632-2153/aba048. J. Behler, J. Chem. Phys. , 170901 (2016). O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo, andA. Tropsha, Nature communications , 1 (2017). S. Liu, M. F. Demirel, and Y. Liang, in

Advances in NeuralInformation Processing Systems (2019) pp. 8466–8478. G. Pilania, C. Wang, X. Jiang, S. Rajasekaran, and R. Ram-prasad, Scientiﬁc reports , 1 (2013). L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton, npjComputational Materials , 1 (2016). R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheﬄer, and L. M.Ghiringhelli, Phys. Rev. Mater. , 083802 (2018). K. T. Sch¨utt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko,and K.-R. M¨uller, J. Chem. Phys. , 241722 (2018). T. S. Cohen, M. Geiger, J. K¨ohler, and M. Welling, arXiv preprintarXiv:1801.10130 (2018).63