[PDF] Atom-Density Representations for Machine Learning

Abstract

The applications of machine learning techniques to chemistry and materials science become more numerous by the day. The main challenge is to devise representations of atomic systems that are at the same time complete and concise, so as to reduce the number of reference calculations that are needed to predict the properties of different types of materials reliably. This has led to a proliferation of alternative ways to convert an atomic structure into an input for a machine-learning model. We introduce an abstract definition of chemical environments that is based on a smoothed atomic density, using a bra-ket notation to emphasize basis set independence and to highlight the connections with some popular choices of representations for describing atomic systems. The correlations between the spatial distribution of atoms and their chemical identities are computed as inner products between these feature kets, which can be given an explicit representation in terms of the expansion of the atom density on orthogonal basis functions, that is equivalent to the smooth overlap of atomic positions (SOAP) power spectrum, but also in real space, corresponding to n -body correlations of the atom density. This formalism lays the foundations for a more systematic tuning of the behavior of the representations, by introducing operators that represent the correlations between structure, composition, and the target properties. It provides a unifying picture of recent developments in the field and indicates a way forward towards more effective and computationally affordable machine-learning schemes for molecules and materials.

Full PDF

AAtom-Density Representations for Machine Learning

Michael J. Willatt, F´elix Musil,

1, 2 and Michele Ceriotti ∗ Laboratory of Computational Science and Modeling, Institute of Materials,´Ecole Polytechnique F´ed´erale de Lausanne, 1015 Lausanne, Switzerland National Center for Computational Design and Discovery of Novel Materials (MARVEL)

The applications of machine learning techniques to chemistry and materials science become morenumerous by the day. The main challenge is to devise representations of atomic systems that areat the same time complete and concise, so as to reduce the number of reference calculations thatare needed to predict the properties of diﬀerent types of materials reliably. This has led to aproliferation of alternative ways to convert an atomic structure into an input for a machine-learningmodel. We introduce an abstract deﬁnition of chemical environments that is based on a smoothedatomic density, using a bra-ket notation to emphasize basis set independence and to highlightthe connections with some popular choices of representations for describing atomic systems. Thecorrelations between the spatial distribution of atoms and their chemical identities are computed asinner products between these feature kets, which can be given an explicit representation in terms ofthe expansion of the atom density on orthogonal basis functions, that is equivalent to the smoothoverlap of atomic positions (SOAP) power spectrum, but also in real space, corresponding to n -bodycorrelations of the atom density. This formalism lays the foundations for a more systematic tuningof the behavior of the representations, by introducing operators that represent the correlationsbetween structure, composition, and the target properties. It provides a unifying picture of recentdevelopments in the ﬁeld and indicates a way forward towards more eﬀective and computationallyaﬀordable machine-learning schemes for molecules and materials. I. INTRODUCTION

Machine learning (ML) is used routinely nowa-days as an expedient to circumvent costly electronicstructure calculations. Given a set of atomic struc-tures and a means to represent them concisely as avector of numbers (often referred to as features, de-scriptors or ﬁngerprints), machine learning models canmake property predictions by interpolating over theknown properties of a subset used for training. Pro-vided the machine learning model yields an acceptablelevel of accuracy, this allows for exploratory inves-tigations of vast numbers of molecules in chemical-compound space, which is essential for e.g. high-throughput screening, drug discovery and molecularclassiﬁcation.

It also enables the development ofatomic potentials with the accuracy of ab initio alter-natives with a smaller computational expense.

For a ground-state electronic structure calculation,an atomic structure is completely characterized by thenumber of electrons, the positions of the nuclei andtheir identities (nuclear charges), and in principle amachine learning model could use the same informa-tion as a representation of its input. However, this in-formation is not consistent with obvious physical sym-metries. Any translation or rotation of a structure, orpermutation of identical atoms within, will not aﬀectscalar properties like energies. A representation thatreﬂects this invariance is therefore desirable for an ef-ﬁcient comparison to be made between structures. Diﬀerent strategies have been proposed to incorpo-rate these symmetries. Approaches based on internalcoordinates (e.g. Coulomb matrices , eigenvalues ∗ michele.ceriotti@epﬂ.ch of overlap matrices or deep potential molecular dy-namics ) are automatically invariant to rotations andtranslations but require an additional symmetriza-tion over the permutation group. For low-dimensionalproblems this symmetrization can be performed ex-actly . For larger systems, one can proceed bysorting the vector of interatomic distances or eigen-values of a matrix that depends on interatomic dis-tances. However, both procedures introduce deriva-tive discontinuities. Instead, many approaches torepresent atomistic conﬁgurations, e.g radial distri-bution functions , wavelets , invariant polynomi-als , symmetry functions following Behler and Par-rinello , or a many-body tensor representation ,rely more or less explicitly on atomic distributions.In this article, we start from an abstract represen-tation of structures and atomic environments to dis-cuss atom-density-based approaches to chemical ma-chine learning. We emphasize the basis-set indepen-dence of this representation by using the Dirac bra-ketnotation. This framework provides a unifying pictureof the ﬁeld, in that several popular techniques canbe seen as alternative representations of the same ab-stract feature vectors. In particular, we show that byrepresenting these kets based on an expansion of atom-centered Gaussians in radial basis functions and spher-ical harmonics one recovers the SOAP power spec-trum, which has been used very successfully to com-pare structures and predict scalar properties ,has been generalized to model tensorial properties and sparsiﬁed to reduce the computational cost .This formalism also provides a way to ﬁne-tune orreduce the dimensionality of the feature vector by in-troducing a general coupling between diﬀerent com-ponents in the deﬁnition of the kernel, recovering the a r X i v : . [ phy s i c s . c h e m - ph ] J a n ﬂexibility of using diﬀerent kinds of density-based rep-resentations within an elegant, uniﬁed framework. II. A DIRAC NOTATION FOR ATOMICCONFIGURATIONS AND ENVIRONMENTS

The Dirac (bra-ket) notation is often used tostreamline the formulation of quantum-mechanical ex-pressions, since it stresses basis-independence of quan-tum states. In a kernel ridge (Gaussian process) re-gression framework, kernel functions between atomicconﬁgurations K ( A , B ) are used as the basis to buildmachine-learning models of atomic-scale properties y ( A ) = (cid:88) M x M K ( A , X M ) . (1)Here y is a property one wants to predict, X M ’s are aset of reference atomic structures, and x M are weightsthat can be determined by minimization of a loss func-tion that measures the discrepancy between the pre-dictions of the model and a set of reference calcula-tions.Well-behaved (positive-semideﬁnite) kernels corre-spond to inner products between (possibly inﬁnite-dimensional) feature vectors, and are also independentof the choice of basis that is used to represent them.This suggests that it might be beneﬁcial to adoptthe bra-ket notation to formulate input representa-tions |A(cid:105) and the associated kernels K ( A , B ) = (cid:104)A|B(cid:105) .When using feature vectors explicitly in a computersimulation, it is necessary to settle on some (ﬁnite)basis. In combination with the advantage that somebases provide over others for proving certain results,this is perhaps why the Dirac notation has not beenused in this context up to now. A. Density-based structural representations

As a starting point to represent atomic structures,every atomic conﬁguration A could be associated witha ket | A (cid:105) that describes the elemental compositionand geometric arrangement of atoms. To make thisidea concrete, we might center a smooth, real, pos-itive function g ( r ) (e.g. a normalized Gaussian) oneach atom and decorate them with orthonormal kets | α (cid:105) to represent their elemental identities. Smooth-ness in the representation is beneﬁcial as it leads tosmooth kernels and better-behaved regression . Suchan atomic conﬁguration is written in position space as (cid:104) r | A (cid:105) = (cid:88) i g ( r − r i ) | α i (cid:105) , (2)where the sum is taken over all atoms in the con-ﬁguration. This expansion could be generalized byusing e.g. element-dependent widths in g ( r ), i.e. g ( r ) → g ( s ( α ) r ). Regardless of the particular form of g ( r ) (provided that it is suﬃciently localized), | A (cid:105) provides a unique representation of the structure, butis variant with respect to fundamental physical sym-metries, such as rigid translations ˆ t and rotations ˆ R of the constituent atoms { r i } → { ˆ R ˆ t r i } . As has beenstressed many times previously, to achieve eﬃcientlearning one should use representations that possessthe same symmetries as the property one wants tolearn. B. Symmetry-invariant representations

To address the variance of Eq. (2) with respectto a symmetry operation ˆ S , one can proceed by for-mally averaging the ket over the corresponding sym-metry group (a procedure often referred to as Haarintegration ): (cid:12)(cid:12)(cid:12) A (1) (cid:69) ˆ S = (cid:90) d ˆ S ˆ S |A(cid:105) . (3)To see how this translates into symmetry invariantrepresentations, let us start by considering the rela-tively simple case of the integration over the transla-tion group which simply corresponds to the integra-tion over R . Averaging directly over the positionrepresentation of | A (cid:105) leads to a rather uninformativerepresentation, which eliminates all structural infor-mation and only counts the number N α of atoms be-longing to each species, (cid:68) r (cid:12)(cid:12)(cid:12) A (1) (cid:69) ˆ t = (cid:90) dˆ t (cid:104) r | ˆ t |A(cid:105) = (cid:88) i | α i (cid:105) (cid:90) d t g ( t + r − r i ) = (cid:88) α N α | α (cid:105) , (4)where we have used the position representation ofthe translation operator. To avoid this informationloss, one can perform the Haar integration over ten-sor products of | A (cid:105) , and deﬁne (cid:12)(cid:12)(cid:12) A ( ν ) (cid:69) ˆ t = (cid:90) dˆ t ˆ t |A(cid:105) ⊗ ˆ t |A(cid:105) . . . ˆ t |A(cid:105) (cid:124) (cid:123)(cid:122) (cid:125) ν . (5)For ν = 2, and assuming for simplicity that the samesmooth density function is used for each atom, onegets (cid:68) rr (cid:48) (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t = (cid:90) dˆ t (cid:88) ij g (ˆ t r − r i ) g (ˆ t r (cid:48) − r j ) | α i α j (cid:105) = (cid:88) ij | α i α j (cid:105) (cid:90) d t g ( t + r − r i ) g ( t + r (cid:48) − r j )= (cid:88) ij | α i α j (cid:105) ( g ∗ g )( r (cid:48) − r − r ij ) , (6)where r ij = r i − r j , and ∗ is the standard convolutionoperator. We can simplify the notation for (cid:12)(cid:12) A (2) (cid:11) ˆ t inthe position representation by (1) noting that the con-volution in Eq. (6) only depends on r − r (cid:48) , so we canwrite the ket as a function of ∆ r = r − r (cid:48) alone; (2)redeﬁning the convolution of two atom-density func-tions as h = g ∗ g : (cid:68) ∆ r (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t = (cid:88) j | α j (cid:105) (cid:88) i h (∆ r − r ij ) | α i (cid:105) . (7) C. Tensor-product representations

Before proceeding further, let us comment brieﬂyon the implications of the form of this ket for machine-learning of the properties associated with the struc-ture | A (cid:105) using kernel ridge regression, taking for sim-plicity a single-species compound so we can ignorethe elemental kets. Learning from a linear kernel K ( A , B ) = (cid:10) A (2) (cid:12)(cid:12) B (2) (cid:11) ˆ t is equivalent to the optimiza-tion of a linear mapping between the ket and the prop-erty, i.e. y ( A ) = (cid:90) d r y ( r ) (cid:68) r (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t . (8)Taking the Dirac δ distribution limit of g ( r ), one seesthis is a (orientation-dependent) pair potential, y ( A ) = (cid:88) ij y ( r ij ) , (9)and it is therefore easy to conceive of properties thatcannot be represented in this form. The feature vectoritself, however, contains complete information aboutthe structure, which can be recovered by taking ten-sor products of |A(cid:105) , or (equivalently) raising the as-sociated kernel to an integer power. For instance,if one takes the outer product of the translationally-symmetrized ket (whose corresponding kernel is justthe elementwise square of the linear kernel), learningamounts to the optimization of a function that de-pends on two displacement vectors simultaneously, y ( A ) = (cid:90) d r d r (cid:48) y ( r , r (cid:48) ) (cid:68) r (cid:48) (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t (cid:68) r (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t = (cid:88) iji (cid:48) j (cid:48) (cid:90) d r d r (cid:48) y ( r , r (cid:48) ) h ( r − r ij ) h ( r (cid:48) − r i (cid:48) j (cid:48) )= h → δ (cid:88) iji (cid:48) j (cid:48) y ( r ij , r i (cid:48) j (cid:48) ) , (10)and so on. This simple example highlights how high-order correlations between atomic positions can be in-corporated in the model by taking the tensor prod-uct of the structural ket before taking the Haar inte-gral (that is, choosing a high value of ν in Eq. (5))or by taking a tensor product of the invariant ket, (cid:12)(cid:12)(cid:12) A ( ν ) (cid:69) ˆ t ⊗ (cid:12)(cid:12)(cid:12) A ( ν ) (cid:69) ˆ t ⊗ · · · ⊗ (cid:12)(cid:12)(cid:12) A ( ν ) (cid:69) ˆ t (cid:124) (cid:123)(cid:122) (cid:125) ζ → (cid:12)(cid:12)(cid:12) A ( ν ) (cid:69) ( ζ )ˆ t . (11) The latter choice corresponds to taking elementwisepowers of the linear invariant kernel. In other terms,using a unique representation of a structure in a non-linear ML model can introduce higher body order cor-relations than those explicitly aﬀorded by the featurevector itself. D. Atom-centered representations

Having clariﬁed how tensor-product kets can beused to incorporate higher-order correlations betweenthe atoms, let us move on to discuss how an atom-centered description arises naturally as a by-productof symmetrization over the translation group. Bygrouping together the terms in the sum correspond-ing to displacement vectors involving atom j , thetranslationally-invariant second-order ket Eq. (7) de-composes into atom-centered contributions, (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t = (cid:88) j | α j (cid:105) |X j (cid:105) , (12)where we have dropped the indication of the trans-lational averaging from |X j (cid:105) to keep at bay the com-plexity of the notation. Note that Eq. (12) implies anadditive deﬁnition of the relation between the repre-sentations of the entire structures, and those associ-ated with atom-centered environments. This deﬁni-tion translates into a “global” kernel between struc-tures that is a sum (or average) of kernels betweenenvironments. While other choices are possible , thisformal derivation shows how naturally an additive ker-nel arises from a symmetrization of a density-based“global” feature vector.The position representation of the environmentalatom-centered ket |X j (cid:105) is (cid:104) r |X j (cid:105) = (cid:88) i f c ( r ij ) h ( r − r ij ) | α i (cid:105) . (13)In this deﬁnition we have introduced a smooth cut-oﬀ function f c ( r ij ) so that each environment onlydepends on the position of the atoms in a sphericalneighborhood centered on atom j . While one could inprinciple proceed with an atom-centered descriptionthat incorporates information from the entire struc-ture, by making f c ( r ) = 1, localisation is useful forcomputational reasons and is justiﬁed when study-ing atomic problems in light of the nearsightednessprinciple of electronic matter , which underlies mostlinear-scaling electronic structure methods . Notethat when the ket is written in this form it mightmake sense to further generalize the deﬁnition of h ,e.g. by making its width dependent on r ij = | r ij | , h ( r ) → h ( s ( r ij ) r ), or by choosing a form other thana Gaussian that is more ﬂexible or computationallyeﬃcient. The notation can be further simpliﬁed byemphasizing the representations of structure and com-position, ψ α X j ( r ) ≡ (cid:104) α r |X j (cid:105) = (cid:88) i ∈ α f c ( r ij ) h ( r − r ij ) . (14)Writing the ket as a sum over all elements α =H , He , . . . (cid:104) r |X j (cid:105) = (cid:88) α ψ α X j ( r ) | α (cid:105) . (15)This translationally-invariant atom-centered envi-ronment representation can also be adapted by takinga linear transformation ˆ U |X j (cid:105) → |X j (cid:105) , where the lin-ear operator ˆ U might act in the position space, theelement space or both. As we will see, the freedom inchoosing the form of ˆ U can be used to tune the behav-ior of the representation to describe in a more eﬃcientway the relation between structure and properties. E. Rotationally-invariant representations

In order to obtain a rotationally-invariant repre-sentation, one can formally average the ket |X (cid:105) overthe SO (3) rotation group, (cid:12)(cid:12)(cid:12) X (1) (cid:69) ˆ R = (cid:90) d ˆ R ˆ R |X (cid:105) . (16)This ket can be readily computed in the position rep-resentation. Taking for simplicity the case where onlyone element is present, one gets (cid:68) r (cid:12)(cid:12)(cid:12) X (1) (cid:69) ˆ R = (cid:90) d ˆ R ψ ( ˆ R r ) = (cid:90) d ˆ R ψ ( r ˆ R ˆ e z ) , (17)where we have used the fact that the integral is over allthe rotation matrices, and so we can always rotate r tobe aligned with the Cartesian z axis ˆ e z before takingthe integral. The average can be written explicitly interms of a suitable parameterization of the rotations,e.g. using Euler angles,18 π (cid:90) π d α (cid:90) π sin β d β (cid:90) π d γ ψ ( r ˆ R ( α, β, γ )ˆ e z ) . (18)One can then recognize that the γ angle does not aﬀectˆ e z , so the integral can be written equivalently as anaverage over the unit sphere . We can then deﬁne (cid:68) r (cid:12)(cid:12)(cid:12) X (1) (cid:69) ˆ R ∝ r (cid:90) d ˆ R ψ ( r ˆ R ˆ e z ) = 14 π r (cid:90) dˆ r ψ ( r ˆ r ) , (19)where we have highlighted the fact that the positionrepresentation only depends on r , and we have explic-itly included a factor of r so that (cid:90) d r (cid:68) X (1) k (cid:12)(cid:12)(cid:12) r (cid:69) ˆ R (cid:68) r (cid:12)(cid:12)(cid:12) X (1) j (cid:69) ˆ R = (cid:90) d r (cid:68) X (1) k (cid:12)(cid:12)(cid:12) r (cid:69) ˆ R (cid:68) r (cid:12)(cid:12)(cid:12) X (1) j (cid:69) ˆ R . (20) Much like in the case of translations, the aver-age over rotations eliminates too much information,and (cid:12)(cid:12) X (1) (cid:11) ˆ R does not retain knowledge of the angu-lar correlations of atoms around the center of theenvironment. A more general family of invariantkets can be obtained by starting from the tensorproducts of (possibly diﬀerent) environmental kets,ˆ U (cid:12)(cid:12) X (cid:11) ⊗ ˆ U (cid:12)(cid:12) X (cid:11) ⊗ . . . , and then symmetrizing overthe rotation group, (cid:12)(cid:12)(cid:12) X ( ν ) (cid:69) ˆ R = (cid:90) d ˆ R ν (cid:89) ℵ ⊗ ˆ R ˆ U ℵ (cid:12)(cid:12) X ℵ j (cid:11) . (21)As for the case of translational averages, one canuse a linear map (or equivalently a linear kernel) tobuild a machine-learning model of a property basedon these symmetrized kets. Non-linear kernels cor-respond to tensor products of symmetrized kets suchas (cid:12)(cid:12)(cid:12) X ( ν ) (cid:69) ˆ R ⊗ (cid:12)(cid:12)(cid:12) X ( ν ) (cid:69) ˆ R ⊗ . . . ⊗ (cid:12)(cid:12)(cid:12) X ( ν ) (cid:69) ˆ R (cid:124) (cid:123)(cid:122) (cid:125) ζ → (cid:12)(cid:12)(cid:12) X ( ν ) (cid:69) ( ζ )ˆ R , (22)and one could further generalize the construction bytaking products of kets built from diﬀerent ˆ U opera-tors. III. A UNIFIED PICTURE OFDENSITY-BASED REPRESENTATIONS

Eq. (21) provides an very general – and abstract– deﬁnition of a density-based representation of anatomic structure that encodes translational, rota-tional and permutation symmetries. This level of ab-straction provides a unifying picture of the ﬁeld, inthat many of the representations that have been usedfor machine-learning of atomic-scale properties can beseen as special cases of this form, or as the result ofprojection onto a particular choice of basis.Let us start by considering the translationally-invariant ket (cid:12)(cid:12) A (2) (cid:11) ˆ t , writing it in a plane-waves ba-sis {| k (cid:105)} , and taking for simplicity the h → δ limit.One obtains a representation that is equivalent tothe diﬀraction pattern generated by the structure, de-composed in multiple channels that correspond to thereciprocal-space correlations between diﬀerent atomicspecies, (cid:68) k (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t = (cid:88) ij | α i α j (cid:105) e i k · r ij . (23)When considering a periodic structure, and with anappropriate normalization, this representation is di-rectly connected with the ﬁngerprints that have beenrecently used to identify crystalline structures , high-lighting how diﬀerent choices of basis may be bestsuited to diﬀerent applications. * * b) c) *** a) FIG. 1. Atom-density-based structural representations, expressed in the real-space (cid:104) r | basis. (a) A structure can bemapped onto a smooth atom density built as a superposition of smooth atom-centered functions. The overall density canbe decomposed in atom-centered environments, and information on chemical compositions can be stored by decoratingthe functions with elemental kets. (b) The ν = 1 invariant ket corresponds to spherical averaging of the environmentalatom density. (c) The ν = 2 invariant ket corresponds to three-body correlations, which are obtained by integrating overall rotations a stencil corresponding to two distances along two directions with a ﬁxed angle arccos ω between them. A. Many-body kernels and representations

Moving on to the case of rotationally-invariantkets, let us take for simplicity ˆ U ℵ = 1, and assumethat all the environmental kets that are multiplied inEq. (21) are the same. We will revisit later the pos-sibility of introducing a linear operator to ﬁne-tunethe properties of the representation. Since we havestarted from a position representation for the envi-ronmental kets, it is natural to write Eq. (21) explic-itly in a complete basis of position and element states, | (cid:81) ν ℵ α ℵ r ℵ (cid:105) ≡ (cid:81) ν ℵ ⊗ | α ℵ r ℵ (cid:105) , (cid:42) ν (cid:89) ℵ α ℵ r ℵ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ( ν ) j (cid:43) ˆ R = (cid:90) d ˆ R ν (cid:89) ℵ (cid:68) α ℵ ˆ R r ℵ (cid:12)(cid:12)(cid:12) X j (cid:69) . (24)One can see clearly that the kernels associatedwith Eq. (24) are in the form of the invariant n -bodykernels discussed in Ref. 16 (more speciﬁcally, as wewill see below, they correspond precisely to the SOAPkernels if g is a Gaussian). Considering the case witha single element, (cid:68) X ( ν ) k (cid:12)(cid:12)(cid:12) X ( ν ) j (cid:69) ˆ R = (cid:90) d ˆ R d ˆ R (cid:48) (cid:20)(cid:90) d r ψ X k ( ˆ R (cid:48) r ) ψ X j ( ˆ R r ) (cid:21) ν , (25)it is clear that one of the two Haar integrals is redun-dant and can be eliminated.Let us consider the eﬀect of ν on the representationand the information that it captures. As discussed inthe case of ν = 1 following Eq. (17), one of the inputvectors r can be aligned with a ﬁxed reference axis,e.g. ˆ e z . The fact that this axis is invariant underone of the Euler rotations makes it possible to align asecond vector so that it lies in the xz plane. For ν = 1 and ν = 2 this analysis leads to (cid:68) αr (cid:12)(cid:12)(cid:12) X (1) (cid:69) ˆ R ∝ r (cid:90) d ˆ R ψ α ( r ˆ R ˆ e z ) (cid:68) αrα (cid:48) r (cid:48) ω (cid:12)(cid:12)(cid:12) X (2) (cid:69) ˆ R ∝ rr (cid:48) (cid:90) d ˆ R ψ α (cid:48) ( r (cid:48) ˆ R ˆ e z ) × ψ α (cid:16) r ˆ R ( ω ˆ e z + (cid:112) − ω ˆ e x ) (cid:17) , (26)where ω = ˆ r · ˆ r (cid:48) (see Fig. 1). After one has alignedthe ﬁrst two r ℵ ’s, the position of all the other r ℵ ’scannot be manipulated, so in practice for ν > z axis and lie in the xz plane. For ν = 3, (cid:68) αrα (cid:48) r (cid:48) ωα (cid:48)(cid:48) r (cid:48)(cid:48) ˆ r (cid:48)(cid:48) (cid:12)(cid:12)(cid:12) X (2) (cid:69) ˆ R ∝ rr (cid:48) r (cid:48)(cid:48) (cid:90) d ˆ R ψ α ( r ˆ R ˆ e z ) × ψ α (cid:48) (cid:16) r (cid:48) ˆ R ( ω ˆ e z + (cid:112) − ω ˆ e x ) (cid:17) × ψ α (cid:48)(cid:48) (cid:16) r (cid:48)(cid:48) ˆ R ˆ r (cid:48)(cid:48) (cid:17) . (27)Also note that we have incorporated the square rootof the Jacobian in the deﬁnition of the representationsso that the corresponding kernels can be computedstraightforwardly as the inner product between twovectors without scaling.By expanding the densities as sums over atoms,it becomes clear that these kets are representationsof the ( ν + 1)-body order correlations between atomswithin an environment (Fig. 2). To start with,we return to the delta function limit of the atomicdensities. In the limit in which each atomic density isrepresented by Dirac δ distributions, the position rep-resentations of the invariant vectors take very simple * FIG. 2. Isocontours of the 3-body correlation functions associated with the environment centered on the tagged carbonatom of an ethanol molecule. From left to right, the ﬁgures correspond to (cid:68) C r H r (cid:48) ω (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R /rr (cid:48) , (cid:68) O r H r (cid:48) ω (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R /rr (cid:48) , (cid:68) O r H r (cid:48) ω (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R /rr (cid:48) . forms: (cid:68) αr (cid:12)(cid:12)(cid:12) X (1) j (cid:69) ˆ R ∝ h → δ r (cid:88) i ∈ α f c ( r ij ) δ ( r − r ij ) (cid:68) αrα (cid:48) r (cid:48) ω (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R ∝ h → δ rr (cid:48) (cid:88) i ∈ α (cid:88) i (cid:48) ∈ α (cid:48) δ ( r − r ij ) δ ( r (cid:48) − r i (cid:48) j ) × δ ( ω − ˆ r ij · ˆ r i (cid:48) j ) f c ( r ij ) f c ( r i (cid:48) j ) . (28)One then sees how linear regression based on (cid:12)(cid:12) X ( ν ) (cid:11) ˆ R corresponds to ( ν + 1)-body potentials e.g. for the3-body term, y ( X j ) = (cid:90) d r d r (cid:48) d ω y (2) ( r, r (cid:48) , ω ) (cid:68) rr (cid:48) ω (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R == (cid:88) ik y (2) ( r ij , r jk , ω ijk ) . (29)There are however good reasons to use non-linearfunctions of the feature vector in an ML model. In thecase of suﬃciently sharp atom-centered density func-tions, the ket with ν = 1 contains information on thelist of all pair distances within an environment, whichis not suﬃcient to reconstruct the structure of theenvironment unequivocally. The representation with ν = 2, on the other hand, contains information onpair distances and angles between triplets of atoms.To the best of our knowledge, and based on extensivenumerical experiments , this information is suﬃcientto reconstruct a conﬁguration modulo arbitrary rigidtranslations, rotations and inversion symmetry. Ten-sor products of the (cid:12)(cid:12) X (2) (cid:11) ˆ R ket are then suﬃcient as abasis to represent arbitrarily complex invariant func-tions of the atomic coordinates. B. Behler-Parrinello symmetry functions

An expression of Eq. (21) in the position represen-tation and in the h → δ limit is an ideal starting point to investigate the relationship of (cid:12)(cid:12) X ( ν ) (cid:11) ˆ R with otherdensity-based frameworks. These expressions revealthe connection between these invariant kets and sev-eral popular ﬁngerprints designed to capture pair and3-body interactions. The link between (cid:68) αr (cid:12)(cid:12)(cid:12) X (1) j (cid:69) andthe pair distribution function is obvious. Behler-Parrinello symmetry functions, and similar weighedaverages of n -body correlations, can be seen as pro-jections of the SO (3) invariant ket over suitable testfunctions G . For instance, for a 2-body symmetryfunction G ( r ) one has (cid:104) αβG |X j (cid:105) = (cid:104) α | α j (cid:105) (cid:90) d r G ( r ) r (cid:68) βr (cid:12)(cid:12)(cid:12) X (1) j (cid:69) ˆ R,h → δ , (30)and an analogous expression can be written for a3-body symmetry function G ( r, r (cid:48) , ω ). Expressionssimilar to Eq. (28) can be obtained by inserting intoEq. (26) Gaussians, or alternative basis functions.The relationship to other density-based representa-tions, such as those discussed in Refs. 23,51 is lesstransparent, but several of the essential ingredients –such as scaling functions that modulate geometric andchemical correlations – can be introduced in terms ofappropriate choices of the ˆ U operators, as we will dis-cuss in the next section. C. Smooth Overlap of Atomic Positions

We have left as a last example a discussion ofthe connection between the symmetrized ket and thesmooth overlap of atomic positions (SOAP) powerspectrum.

In fact, if we take as we did beforeˆ U ℵ = 1 and (cid:12)(cid:12) X j (cid:11) = (cid:12)(cid:12) X j (cid:11) = . . . (cid:12)(cid:12) X νj (cid:11) in Eq. (21),the SOAP power spectrum is nothing but an alterna-tive representation of (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R . To see how, one canstart by expanding the translationally-invariant envi-ronmental ket Eq. (24) in a basis of orthonormal radialbasis functions R n ( r ) and spherical harmonics Y lm (ˆ r ), (cid:104) αnlm |X j (cid:105) = (cid:90) d r R n ( r ) Y lm (ˆ r ) (cid:104) α r |X j (cid:105) . (31)Using a basis of spherical harmonics is extremely use-ful and practical because they block diagonalize theangular momentum operator (and thus the rotationoperator), which allows for explicit integration overthe rotation group in Eq. (24). For ν = 1, this leadsto the following feature vector, (cid:68) αn (cid:12)(cid:12)(cid:12) X j (1) (cid:69) ˆ R ∝ (cid:104) αn |X j (cid:105) . (32)For ν = 2, the feature vector corresponds to the SOAPpower spectrum, (cid:68) αnα (cid:48) n (cid:48) l (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R ∝ √ l + 1 (cid:88) m (cid:104) αnlm |X j (cid:105) (cid:63) (cid:104) α (cid:48) n (cid:48) lm |X j (cid:105) . (33)For ν = 3 the representation corresponds to the bis-pectrum , (cid:68) α n l α n l αnl (cid:12)(cid:12)(cid:12) X (3) j (cid:69) ˆ R ∝ √ l + 1 (cid:88) mm m (cid:104)X j | αnlm (cid:105)× (cid:104) α n l m |X j (cid:105) (cid:104) α n l m |X j (cid:105) (cid:104) l m l m | l m (cid:105) . (34)where (cid:104) l m l m | l m (cid:105) is a Clebsch-Gordan coef-ﬁcient. The bispectrum is used as a four-body featurevector in SOAP and in Spectral Neighbor Analysis Po-tentials (SNAP), where its high resolution is exploitedto construct accurate interatomic potentials throughlinear regression .Seen in the light of the present formalism, the re-markable fact that the SOAP kernel (Eq. (25) withdensities written as a sum of Gaussians) can be ex-pressed as an explicit scalar product between vectors,representing a truncated expansion of the power spec-trum, emerges as a natural consequence of the def-inition of the kernel as the scalar product betweeninvariant kets. It should also be noted that in prac-tical applications of SOAP the kernels are often (butnot always) normalized and raised to an integer power ζ , which corresponds to taking a tensor product ofthe kets and introduces a many-body character in themodel built on such kernels. D. Tensorial Smooth Overlap of AtomicPositions ( λ -SOAP) The feature vectors that appear in the tensorialextension of SOAP are of the form in Eq. (21), withˆ U ℵ = ˆ I for ℵ = 1 , , . . . , ν + 1, (cid:12)(cid:12) X ℵ j (cid:11) = |X j (cid:105) for ℵ =1 , . . . , ν and (cid:12)(cid:12) X ν +1 j (cid:11) = | λµ (cid:105) , where | λµ (cid:105) is an angularmomentum ket: (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ˆ R = (cid:90) d ˆ R ˆ R | λµ (cid:105) ν (cid:89) ℵ =1 ⊗ ˆ R |X j (cid:105) . (35) * FIG. 3. Schematic representation of the construction of areal-space representation of a tensorial ket associated witha λ -SOAP kernel. The (smooth) atom density is evaluatedat two points corresponding to a stencil ( r, r (cid:48) , ω ), and thespherical harmonic Y λµ is evaluated at the angles ( θ, φ ),relative to the reference frame that is used to describe thestencil. The ket is rotationally invariant, (cid:34) ν +1 (cid:89) ℵ =1 ⊗ ˆ R (cid:48) (cid:35) (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ˆ R = (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ˆ R , (36)but not in the subspace that describes the atomic en-vironments, (cid:34) ˆ I ⊗ ν (cid:89) ℵ =1 ⊗ ˆ R (cid:48) (cid:35) (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ˆ R (cid:54) = (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ˆ R . (37)The inner product between two of these vectors is eas-ily shown to be (cid:68) X ( ν ) j λµ (cid:12)(cid:12)(cid:12) X ( ν ) k λ (cid:48) µ (cid:48) (cid:69) ˆ R = δ λλ (cid:48) (cid:90) d ˆ R D λµµ (cid:48) ( ˆ R ) (cid:12)(cid:12) (cid:104)X j | ˆ R |X k (cid:105) (cid:12)(cid:12) ν , (38)which agrees with the usual deﬁnition of the λ -SOAPkernel, (cid:68) X ( ν ) j λµ (cid:12)(cid:12)(cid:12) X ( ν ) k λµ (cid:48) (cid:69) ˆ R = k λµµ (cid:48) ( X j , X k ) . (39)While (cid:12)(cid:12)(cid:12) X ( ν ) j λµ (cid:69) ˆ R can be represented very eﬀectivelyusing a spherical-harmonics expansion of the atomdensity , it is also possible to express it in terms of areal-space basis. Following arguments similar to thoseused to derive Eq. (27), one can see that in this formthe tensorial ket corresponds to the evaluation of athree-body correlation function of the atom density,multiplied by a spherical harmonic of appropriate or-der computed in the reference frame of the ( r, r (cid:48) , ω )stencil (see Fig. 3). a) b) c) FIG. 4. (a) Permutation-variant structural descriptors can be stored in a vector to be used as an atomic-scale repre-sentation. (b) Sorting this vector makes it permutationally invariant. (c) It is easy to see how the sorted vector relatesto the cumulative distribution function associated with the histogram of the values of the structural features.

Taking tensor products of (cid:12)(cid:12) X ( ν ) λµ (cid:11) ˆ R with itselfincreases the order of body correlations that are ex-plicitly included in the feature vector, but destroysthe required symmetry properties. Instead, one cantake tensor products with λ = 0 kets, which arerotationally invariant in the subspace that describesthe atomic environments while preserving the desiredsymmetry of the representation, e.g. (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ˆ R ζ − (cid:89) k =1 ⊗ (cid:12)(cid:12)(cid:12) X ( ν ) (cid:69) ˆ R → (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ( ζ )ˆ R . (40)This procedure has been found eﬀective in practice forincreasing the order of body correlations in tensorialSOAP . E. Distributions vs sorted vectors

It is worth making some further considerationsthat extend somewhat the generality of this construc-tion to include representations that are not based ex-plicitly on atom densities. Many approaches in theliterature rely on computing quantities that are notpermutationally invariant per se , for instance the ele-ments of the matrix of pair distances between atoms ,transformed elementwise by some function , or theeigenvalues of such matrices . In order to make theserepresentations invariant to atom permutations, oneoften proceeds to sort these sets of items, and usesthe Euclidean distance between the sorted vectors asthe building block of kernels or other statistical learn-ing frameworks.In fact, it is easy to see that given a set of elements { a i ∈ R } , the sorted list contains the same amountof information as the histogram of the elements h ( a ).Scaling the index of the sorted items by the total num-ber of items N , and considering the limit in which onecan consider x = i/N as a continuous index, one seesthat x (˜ a ) counts the fraction of entries that are smaller than ˜ a , that is x (˜ a ) = (cid:90) ˜ a −∞ d a h ( a ) . (41)It follows that a ( x ), which is a continuous represen-tation of the vector of sorted distances, is just theinverse cumulative distribution function (iCDF) asso-ciated with h ( x ). The Euclidean distance between twovectors of sorted elements is proportional to the L norm of the diﬀerence between the iCDF of the his-tograms associated with the two sets. Interestingly,if one considers the L norm, the distance betweenthe sorted vectors corresponds to the earth mover’sdistance between two distributions in one dimen-sion. The connection between diﬀerent density-basedrepresentations is more direct than that which can beestablished between density-based and sorted-vectordescriptions – also given that the relation betweenatom positions and the permutation variant itemsmight be far from trivial, e.g. when the represen-tation involves the eigenvalues of an overlap matrix.However, the argument we present here highlights thefact that incorporating physical symmetries in the de-scription of atomistic systems leads to representationsthat contain essentially the same information. IV. GENERALIZED INVARIANT DENSITYREPRESENTATIONS

The formalism we have introduced in the previoussection provides an elegant framework to constructa rotationally-invariant representation of the atomicdensity that can be used for machine-learning pur-poses. While the formalism provides a complete de-scription of structural correlations of a given orderwithin an atomic environment, the quality and thecomputational cost of the regression scheme can beimproved substantially in practice by transforming therepresentation so that it incorporates some degree ofchemical intuition. For instance, the combination ofmultiple kernels corresponding to diﬀerent interatomicdistances has been shown to improve the quality ofthe ML model . Likewise, a scaling of the weights ofdiﬀerent atomic distances within an environment hasbeen shown to be beneﬁcial when using ML to predictatomic-scale properties .We will discuss how many of these modiﬁcationscan be incorporated in the through inclusion of arotationally-invariant Hermitian operator ˆ U = ˆ U =ˆ U = . . . (as introduced earlier) that leads to cou-pling of the geometric and elemental components ofthe translationally-invariant atom-centered ket |X j (cid:105) .For concreteness, and to provide a formulation thatcan be directly applied to an existing framework, wediscuss ˆ U written in the orthonormal basis of radialfunctions and spherical harmonics {| αnlm (cid:105)} , that cor-respond to the SOAP power spectrum.The requirement that ˆ U is rotationally-invariant(and thus commutes with an arbitrary rotation oper-ator) means that it must have the following form (cid:104) αnlm | ˆ U | α (cid:48) n (cid:48) l (cid:48) m (cid:48) (cid:105) = δ ll (cid:48) δ mm (cid:48) (cid:104) αnl | ˆ U | α (cid:48) n (cid:48) l (cid:48) (cid:105) . (42)Eq. (42) is the most general form compatible with SO (3) symmetry, and can be seen as a way to in-troduce correlations between diﬀerent radial and ele-mental components of the features, and to weight thecontribution from diﬀerent angular channels. A. Low-rank expansion of the ˆ U operator Since ˆ U is Hermitian, it can be diagonalized andexpressed in the orthogonal basis of its eigenkets { (cid:12)(cid:12) J (cid:11) } , ˆ U = (cid:88) J (cid:12)(cid:12) J (cid:11) U J (cid:10) J (cid:12)(cid:12) . (43)Taking U J (cid:10) J (cid:12)(cid:12) → (cid:104) J | allows us to express ˆ U as ˆ U = (cid:80) J (cid:12)(cid:12) J (cid:11) (cid:104) J | .The transformed SO(3) vector components can bewritten in terms of the components of | J (cid:105) in the chem-ical basis, u Jαnl = (cid:104) J | αnl (cid:105) . This yields (cid:68) JJ (cid:48) (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R = (cid:88) αα (cid:48) nn (cid:48) l u Jαnl u J (cid:48) α (cid:48) n (cid:48) l × (cid:88) m (cid:104) αnlm |X j (cid:105) (cid:63) (cid:104) α (cid:48) n (cid:48) lm |X j (cid:105) . (44)By choosing a low-rank expansion of ˆ U one can greatlyreduce the dimensionality of the SO (3) ﬁngerprintvector, similarly to what was done in Ref. 37 apply-ing standard sparse decomposition techniques to the SO (3) ﬁngerprints.A possible approach is to determine this low-rankapproximation based on the correlations found be-tween environments that are part of the data set. For a given l , consider the spherically-symmetric co-variance matrix between the features of the expandedatomic density, C ( l ) αnα (cid:48) n (cid:48) = 1 N (cid:88) j (cid:88) m (cid:104) αnlm |X j (cid:105) (cid:63) (cid:104)X j | α (cid:48) n (cid:48) lm (cid:105) = √ l + 1 N (cid:88) j (cid:68) αnα (cid:48) n (cid:48) l (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R . (45)The eigenvectors of C ( l ) , v ( l ) J , can then be used as u Jαnl in Eq. (44). It is easy to see that this trans-formation identiﬁes components of the data that arelinearly independent within the training set, and havea spread that is equal to the corresponding eigenval-ues λ ( l ) J . The feature space can then be compressedby only retaining a certain number of components n J that could be determined using the magnitude of theassociated eigenvalues. B. Radially-scaled kernels

In a system with relatively uniform atom density,the overlap between environments (cid:104)X j |X k (cid:105) is domi-nated by the region farthest from the center. Thiscould be regarded as rather unphysical, since inter-actions between atoms decay with distance and theclosest atoms should therefore give the most signiﬁ-cant contribution to properties, which is reﬂected inthe observation that multi-scale kernels tend to per-form best when very low weights are assigned to thelong-range kernels . This eﬀect can be counter-acted by multiplying the atomic probability amplitudeEq. (13) with a radial scaling u ( r ), (cid:104) α r | ˆ U |X j (cid:105) = u ( r ) ψ α X j ( r ) . (46)In the context of the SOAP power spectrum, thischange can be represented in terms of a ˆ U operatorthat reads (cid:104) n | ˆ U | n (cid:48) (cid:105) = (cid:90) d r r R n ( r ) R n (cid:48) ( r ) u ( r ) , (47)since an operator that scales states in the positionrepresentation must be diagonal in it, (cid:104) r | ˆ U | r (cid:48) (cid:105) = δ ( r − r (cid:48) ) u ( r ) , (48)and its matrix elements in the basis of radial basisfunctions are (cid:104) n | ˆ U | n (cid:48) (cid:105) = (cid:90) dr (cid:90) dr (cid:48) r R n ( r ) R n (cid:48) ( r (cid:48) ) δ ( r − r (cid:48) ) u ( r ) , (49)which reduces to Eq. (47).Radial scaling in the form of Eq. (46) can be ap-proximated, when using narrow atom-centered func-tions, with (cid:80) i u ( r ij ) f c ( r ij ) h ( r − r ij ), where we alsoconsider for simplicity the case with a single species .0Besides the fact that it is simpler to implement thisform of scaling in an existing code, this approximationalso makes apparent the connection between the gen-eral density-based framework we introduce here andthe descriptors of Ref. 51. When h is taken to be aGaussian function, the weight on the central atom isset to zero and one considers the two-body invariantrepresentations, this ansatz is essentially equivalent tothe two-body features in Ref. 51: (cid:68) r (cid:12)(cid:12)(cid:12) ˆ U (cid:12)(cid:12)(cid:12) X (1) j (cid:69) ˆ R = (cid:88) i (cid:54) = j u ( r ij ) √ πr ij σ (cid:20) e − ( r − rij )22 σ − e − ( r + rij )22 σ (cid:21) ≈ (cid:88) i (cid:54) = j u ( r ij ) √ πr ij e − ( r − rij )22 σ . (50) C. Alchemical kernels

In the presence of multiple species, one could makethe scaling element dependent, or devise a more com-plex operator that couples diﬀerent channels of dif-ferent species. As a ﬁrst test of the generalization ofSOAP in the presence of multiple elements, we con-sider an operator in the form (cid:104) αnlm | ˆ U | α (cid:48) n (cid:48) l (cid:48) m (cid:48) (cid:105) = δ ll (cid:48) δ mm (cid:48) δ nn (cid:48) (cid:104) α | ˆ U | α (cid:48) (cid:105) , (51)which ignores couplings between the structure of anenvironment and the elements within it. One canalways write a low-rank expansion of the operator,ˆ U ≈ (cid:80) Jα (cid:12)(cid:12) J (cid:11) u Jα (cid:104) α | , which allows one to writeˆ U ⊗ ˆ U (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R = (cid:88) αα (cid:48) (cid:12)(cid:12) JJ (cid:48) (cid:11) u Jα u J (cid:48) α (cid:48) (cid:68) αα (cid:48) (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R . (52)In the context of SOAP, one can deﬁne the projectionsof the power spectrum in this “alchemical basis”, (cid:68) JnJ (cid:48) n (cid:48) l (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R = (cid:88) αα (cid:48) u Jα u J (cid:48) α (cid:48) (cid:88) m (cid:104) αnlm |X j (cid:105)× (cid:104) α (cid:48) n (cid:48) lm |X j (cid:105) , (53)which was shown in Ref. 66 to yield a substantial im-provement in the learning eﬃciency in the presenceof many chemical elements, and to result in a low-dimensional representation of elemental space thatshares some similarities with the grouping found inthe periodic table of the elements.One can see the relationship between these “al-chemical features” and previous attempts to incorpo-rate cross-species correlations through the generalizedSOAP environmental kernel, (cid:90) d ˆ R (cid:12)(cid:12)(cid:12) (cid:104)X j | ˆ U † ˆ U ˆ R |X k (cid:105) (cid:12)(cid:12)(cid:12) = (cid:88) JnJ (cid:48) n (cid:48) l ˆ R (cid:68) X (2) j (cid:12)(cid:12)(cid:12) JnJ (cid:48) n (cid:48) l (cid:69) (cid:68) JnJ (cid:48) n (cid:48) l (cid:12)(cid:12)(cid:12) X (2) k (cid:69) ˆ R . (54) By writing out explicitly this inner product in termsof the full power spectrum elements (cid:68) αnα (cid:48) n (cid:48) l (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R one can see that the matrix elements (cid:104) α | ˆ U † ˆ U | α (cid:48) (cid:105) are nothing but the elements of the alchemical ker-nel κ αα (cid:48) that was introduced in Ref. 42, where it wasshown that taking κ αα (cid:48) (cid:54) = δ αα (cid:48) can improve prop-erty predictions with kernel ridge regression. Oﬀ-diagonal couplings between chemical elements havealso been used in other representations, includingthose of Ref. 51.The expression in terms of reduced featuresEq. (53) is, however, more eﬃcient to compute andclariﬁes how this approach enables the introduction ofcorrelations between elements, as well as reduction ofthe space dimensionality. The full SOAP feature vec-tor contains a number of components that is propor-tional to the square of the number of present species n sp , while limiting to a number d J (cid:28) n sp of basis ketsreduces the dimensionality of the feature vector by afactor ( n sp /d J ) .Note that one does not even need to compute allthe elements in the | αnlm (cid:105) expansion of the density,since the alchemical projection can be brought downto the level of the atom density, which can be deﬁnedfor d J chemical “channels” rather than for each ele-ment separately, (cid:104) J r |X j (cid:105) = (cid:88) α u Jα ψ α X j ( r ) . (55)Density-based representations that assign a weight toeach species have been explored as means to reducethe complexity of ML representations in cases wheremany elements are present simultaneously , whichcorrespond essentially to the case with d J = 1. Forinstance, the compositional descriptor of Ref. 67 isequivalent to Eq. (30) computed on a single invariantdensity, (cid:68) r (cid:12)(cid:12)(cid:12) X (1) j (cid:69) ˆ R = (cid:88) i u α i δ ( r − r ij ) f c ( r ij ) , (56)where the weights of diﬀerent species are rather arbi-trarily set to be u α = 0 , ± , ± . . . . The more generalformulation in Eqs. (53)-(55) provides a way to al-ter the dimensionality of the representation, and tooptimize the projections to obtain the most eﬃcientfeatures for a given regression problem. D. Non-factorizable operators

In order to relate Eq. (21) to other density-basedrepresentations that involve more complicated scal-ing functions of the internal coordinates, it is nec-essary to introduce a further linear transformationˆ U ( ν ) which does not factorize into components thatact independently on each term in the ν -order ten-sor product. Such an operator must be chosen with1care to ensure that it is rotationally-invariant, oth-erwise the rotational-invariance of the transformedket will be lost. As far as the ν = 2 rotationally-invariant kets are concerned, a generic operator iscompletely determined by its action on the basisvectors {| α r α (cid:48) r (cid:48) (cid:105)} . Rotationally-invariant operatorsmust act on (cid:12)(cid:12)(cid:12) α ˆ R r α (cid:48) ˆ R r (cid:48) (cid:69) in the same was as on | α r α (cid:48) r (cid:48) (cid:105) , followed by the rotation ˆ R . The upshot ofthis observation is (cid:104) α r α r (cid:48) | ˆ U (2) | β r α r (cid:48) (cid:105) = (cid:104) αr α (cid:48) r (cid:48) ω | ˆ U (2) | βr β (cid:48) r (cid:48) ω (cid:105) , (57)i.e. any non-internal coordinate must be cyclic. If adistance and angle-based scaling is required, then theoperator is diagonal, (cid:104) α r α r (cid:48) | ˆ U (2) | β r α r (cid:48) (cid:105) = δ αβ δ α (cid:48) β (cid:48) δ ( r − r ) δ ( r (cid:48) − r (cid:48) ) × δ ( ω − ω ) u ( α, r , α (cid:48) , r (cid:48) , ω ) . (58)For example, the scaling function in the three-bodydescriptor in Ref. 51 corresponds to the followingchoice for u ( r , r , ω ), u ( r , r , ω ) = 1 − ω ω ω ( r r r ) n , (59)where r = r + r − r r ω , ω = ( r − r − r ) / r r , ω = ( r − r − r ) / r r and n is an adjustable param-eter. Faber et al . do not specify a scaling function forfour-body and higher-body descriptors, but the analy-sis presented here clearly extends to any hypotheticalscaling function that involves the internal coordinatesof a collection of ν + 1 positions.Starting from the SOAP power spectrum, one canexploit the fact that each component is separatelysymmetry invariant. It is then possible to introducean arbitrary linear operator coupling the | αnα (cid:48) n (cid:48) l (cid:105) components, (cid:104) αn α (cid:48) n (cid:48) l | ˆ U | βn β (cid:48) n (cid:48) l (cid:105) . Being a lin-ear operation, this transformation amounts to achange of regularization for the ridge regression prob-lem, and is most useful if applied to reduce the dimen-sionality of the feature vectors. This can be done e.g.by ﬁnding the principal components of the covariancematrix of the SOAP power spectrum or – as done inRef. 37 – by a sparse decomposition that singles outa subset of the components that suﬃce to obtain athorough description of the relevant structures. Thiscorresponds to the contracted representation (cid:68) J (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R = (cid:88) Jk u Jk (cid:68) α k n k α (cid:48) k n (cid:48) k l k (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R , (60)where k runs over the set of selected components, which can be determined with diﬀerent schemes, fromCUR to farthest point sampling . The coeﬃ-cients u Jk are the elements of a square matrix thatensures the contracted vectors in Eq. (60) generate akernel that is as close as possible to the full kernel. E. Optimization of the density representation

The optimization of the ˆ U operator in its moregeneral form (see Eq. (42)) involves a large number ofparameters, leading to a very concrete risk of overﬁt-ting. This is exacerbated by the fact that the featurevector is then used as the input for regression, and onehas to balance the amount of data used to optimizethe elements of ˆ U and that used for the training ofthe ridge regression model. The simplest approach toreduce the optimization of ˆ U to a small number of freeparameters uses the compression method discussed inSection IV A to identify the most important combina-tions of (cid:104) αnlm |X j (cid:105) components that are linearly in-dependent for the data at hand. We would like tobe able to optimize u based on the correlations foundbetween environments that are part of the trainingset. The idea is that further optimization using tar-get properties will be less likely to overﬁt after thisdimensionality reduction.Another possible use of the principal-componentrepresentation of (cid:104) αnlm |X j (cid:105) is to obtain a simpleransatz to further optimize the ˆ U operator. For in-stance, one could combine linearly the diﬀerent com-ponents using the ˆ U operator deﬁned in Eq. (45) (cid:104) IJlm |X j (cid:105) = (cid:88) I (cid:48) αJ (cid:48) n f ( l ) II (cid:48) JJ (cid:48) u ( l ) I (cid:48) αJ (cid:48) n (cid:104) αnlm |X j (cid:105) , (61)where the scaling coeﬃcients f ( l ) II (cid:48) JJ (cid:48) are determined soas to make the representation better suited to build aregression model for the target property y . A system-atic exploration of the diﬀerent possibilities, as well astheir benchmarking on diﬀerent regression problems,is left for future work. V. CONCLUSIONS

We have introduced a general formulation of theproblem of representing atomic structures in terms ofa (smooth) atom density, which is independent on thebasis that is used to expand it. Starting from a repre-sentation of a 3D structure in terms of a superpositionof atom-centered functions decorated with elementalkets, we introduce symmetries by formally averagingthe feature vectors over the continuous translation androtation groups. The averaging removes information,but a complete, unique description can be retained bytaking tensor products of the ket before computing theintegral. Diﬀerent representations, capturing varyingamounts of inter-atomic correlations, can be obtaineddepending on the combination of tensor products andsymmetrized averages.The framework we introduced provides a uniﬁedpicture of density-based representations for machinelearning of atomic-scale properties, with several pop-ular frameworks emerging by taking diﬀerent limits,or using speciﬁc basis sets to represent the abstract2invariant kets. In particular, using a basis of ra-dial functions and spherical harmonics shows clearlythe 1:1 mapping between the symmetrized kets anddiﬀerent ﬂavors of the SOAP kernel. Even alterna-tive schemes that start from rotation and translation-invariant internal coordinates and proceed to ensurepermutation invariance appear to contain compara-ble information. Physically-motivated extensions ofexisting frameworks, relying also on real-space formu-lations that are more directly related to body-orderexpansions and to atom correlation functions that ap-pear in classical density functional theory, might pro-vide some leeway to improve the performance of arepresentation, as shown for instance in Ref. 66. Wediscussed how several modiﬁcations and optimizationscan be introduced in terms of operators that coupleand scale diﬀerent channels of the representation, fo-cusing in particular on the SOAP power spectrum rep-resentation. Such modiﬁcations can be advantageousas they allow for a reduction in the dimensionalityof the problem, and make it possible to incorporate diﬀerent kinds of chemical insights – eﬀectively gen-erating several “views” of a material that emphasizediﬀerent kinds of structural and chemical correlations.Seeing many diﬀerent approaches as alternative im-plementations of the same family of representationsmight help coordinate eﬀorts in optimizing the compu-tational eﬃciency and regression accuracy of density-based models of atomic structures.

ACKNOWLEDGEMENTS

The Authors would like to thank G´abor Cs´anyi,Bingqing Cheng and Andrea Grisaﬁ for discussion andcomments during the preparation of this work. MCand MJW were supported by the European ResearchCouncil under the European Union’s Horizon 2020 re-search and innovation programme (grant agreementno. 677013-HBMAP). FM was supported by the theNCCR MARVEL, funded by the Swiss National Sci-ence Foundation. [1] R. Ramakrishnan and O. A. von Lilienfeld, , 182(2015).[2] F. A. Faber, A. Lindmaa, O. A. von Lilienfeld, andR. Armiento, Phys. Rev. Lett. , 135502 (2016).[3] A. P. Bart´ok et al., Sci. Adv. , e1701816 (2017).[4] G. Montavon et al., New J. Phys. , 095003 (2013).[5] J. B. O. Mitchell, Wiley Interdisciplinary Reviews:Computational Molecular Science , 468 (2014).[6] J. Carrete, W. Li, N. Mingo, S. Wang, and S. Cur-tarolo, Physical Review X , 011019 (2014).[7] L. Ward, A. Agrawal, A. Choudhary, and C. Wolver-ton, npj Computational Materials , 16028 (2016).[8] J. Behler and M. Parrinello, Phys. Rev. Lett. ,146401 (2007).[9] W. J. Szlachta, A. P. Bart´ok, and G. Cs´anyi, Phys.Rev. B , 104108 (2014).[10] A. P. Bart´ok, M. C. Payne, R. Kondor, and G. Cs´anyi,Phys. Rev. Lett. , 136403 (2010).[11] T. Morawietz, V. Sharma, and J. Behler, Journal ofChemical Physics , 064103 (2012).[12] R. Kobayashi, D. Giofr´e, T. Junge, M. Ceriotti, andW. A. Curtin, Phys. Rev. Mater. , 053604 (2017).[13] J. S. Smith, O. Isayev, and A. E. Roitberg, Chem.Sci. , 3192 (2017).[14] A. Khorshidi and A. A. Peterson, Computer PhysicsCommunications , 310 (2016).[15] M. Gastegger, J. Behler, and P. Marquetand, Chem.Sci. , 6924 (2017).[16] A. Glielmo, C. Zeni, and A. De Vita, Physical ReviewB (2018).[17] S. Chmiela et al., Science Advances , e1603015(2017).[18] K. Yao, J. E. Herr, and J. Parkhill, Journal of Chem-ical Physics , 014106 (2017).[19] A. P. Bart´ok, R. Kondor, and G. Cs´anyi, Phys. Rev.B , 184115 (2013). [20] M. Rupp, A. Tkatchenko, K.-R. M¨uller, and O. A.von Lilienfeld, Physical Review Letters , 058301(2011).[21] J. Barker, J. Bulin, J. Hamaekers, and S. Math-ias, LC-GAP: Localized coulomb descriptors forthe gaussian approximation potential, in ScientiﬁcComputing and Algorithms in Industrial Simulations:Projects and Products of Fraunhofer SCAI , pages 25–42, Springer, Cham, 2017.[22] A. Sadeghi et al., Journal of Chemical Physics ,184118 (2013).[23] L. Zhang, J. Han, H. Wang, R. Car, and E. Weinan,Physical Review Letters , 143001 (2018).[24] B. J. Braams and J. M. Bowman, International Re-views in Physical Chemistry , 577 (2009).[25] Z. Xie and J. M. Bowman, Journal of Chemical The-ory and Computation , 26 (2010).[26] B. Jiang and H. Guo, The Journal of ChemicalPhysics , 0 (2013).[27] K. T. Sch¨utt et al., Phys. Rev. B , 205118 (2014).[28] M. Hirn, S. Mallat, and N. Poilvert, Multiscale Mod-eling & Simulation , 827 (2017).[29] M. Eickenberg, G. Exarchakis, M. Hirn, and S. Mal-lat, Advances in Neural Information Processing Sys-tems 30 , 6522 (2017).[30] A. V. Shapeev, Multiscale Modeling & Simulation ,1153 (2015).[31] A. P. Bart´ok, M. J. Gillan, F. R. Manby, andG. Cs´anyi, Phys. Rev. B , 054104 (2013).[32] V. L. Deringer and G. Cs´anyi, Phys. Rev. B ,094203 (2017).[33] R. Jinnouchi and R. Asahi, Journal of Physical Chem-istry Letters , 4279 (2017).[34] W. J. Szlachta, A. P. Bart´ok, and G. Cs´anyi, PhysicalReview B - Condensed Matter and Materials Physics , 104108 (2014). [35] S. Kajita, N. Ohba, R. Jinnouchi, and R. Asahi, Sci-entiﬁc Reports (2017).[36] A. Grisaﬁ, D. M. Wilkins, G. Cs´anyi, and M. Ceriotti,Phys. Rev. Lett. , 036002 (2018).[37] G. Imbalzano et al., J. Chem. Phys. , 241730(2018).[38] C. E. Rasmussen and C. K. I. Williams, Gaussianprocesses for machine learning , MIT Press, 2006.[39] A. Glielmo, P. Sollich, and A. De Vita, Phys. Rev. B , 214302 (2017).[40] L. Nachbin, The Haar integral , R. E. Krieger Pub.Co., 1976.[41] For a generic basis function g ∗ g might be a com-plicated function, but when g is a Gaussian, g ∗ g issimply a Gaussian with double the variance.[42] S. De, A. P. Bart´ok, G. Cs´anyi, and M. Ceriotti, Phys.Chem. Chem. Phys. , 13754 (2016).[43] E. Prodan and W. Kohn, Proc. Natl. Acad. Sci. USA , 11635 (2005).[44] W. Yang, Phys. Rev. Lett. , 1438 (1991).[45] G. Galli and M. Parrinello, Phys. Rev. Lett. , 3547(1992).[46] S. Goedecker, Rev. Mod. Phys. , 1085 (1999).[47] M. Ceriotti, T. D. K¨uhne, and M. Parrinello, J. Chem.Phys. , 24707 (2008).[48] This is a consequence of the fact that SO (3) is theproduct of SO (2) and S .[49] A. Ziletti, D. Kumar, M. Scheﬄer, and L. M. Ghir-inghelli, Nature Comm. , 2775 (2018).[50] O. A. von Lilienfeld, R. Ramakrishnan, M. Rupp, andA. Knoll, International Journal of Quantum Chem-istry , 1084 (2015).[51] F. A. Faber, A. S. Christensen, B. Huang, and O. A.Von Lilienfeld, Journal of Chemical Physics ,241717 (2018).[52] A. P. Bart´ok, R. Kondor, and G. Cs´anyi, Phys. Rev.B , 184115 (2013).[53] S. De, A. A. P. Bart´ok, G. Cs´anyi, and M. Ceriotti,Phys. Chem. Chem. Phys. , 13754 (2016). [54] A. Thompson, L. Swiler, C. Trott, S. Foiles, andG. Tucker, J. Comput. Phys. , 316 (2015).[55] A. Grisaﬁ, D. D. M. Wilkins, G. Cs´anyi, and M. Ce-riotti, Phys. Rev. Lett. , 036002 (2018).[56] A. Grisaﬁ et al., ACS Central Science , 57 (2019).[57] D. M. Wilkins et al., arXiv:1809.05337 (2018).[58] F. Pietrucci and R. Martoˇn´ak, J. Chem. Phys. ,104704 (2015).[59] M. Rupp, A. Tkatchenko, K.-R. M¨uller, and O. A.von Lilienfeld, Phys. Rev. Lett. , 058301 (2012).[60] A. Sadeghi et al., J. Chem. Phys. , 184118 (2013).[61] V. M. Panaretos and Y. Zemel, Annual Review ofStatistics and Its Application , annurev (2019).[62] A. A. P. Bart´ok et al., Science Advances , e1701816(2017).[63] B. Huang and O. A. von Lilienfeld, J. Chem. Phys. , 161102 (2016).[64] Note that, apart from a l -dependent scaling, the co-variance matrix is just the average of the SOAP powerspectrum over the training set.[65] F. M. Paruzzo et al., Nature Comm. , 4501 (2018).[66] M. J. Willatt, F. Musil, and M. Ceriotti, PhysicalChemistry Chemical Physics , 29661 (2018).[67] N. Artrith, A. Urban, and G. Ceder, Phys. Rev. B , 014112 (2017).[68] M. Gastegger, L. Schwiedrzik, M. Bittermann,F. Berzsenyi, and P. Marquetand, J. Chem. Phys. , 241709 (2018).[69] H. Huo and M. Rupp, arxiv 1704.06439 (2017).[70] The sparsiﬁcation could be also represented explicitlyby a ˆ U operator, that zeroes out all of the unnecessarycomponents.[71] M. W. Mahoney and P. Drineas, Proceedings of theNational Academy of Sciences , 697 (2009).[72] D. J. Rosenkrantz, R. E. Stearns, and P. M. Lewis,II, SIAM Journal on Computing , 563 (1977).[73] M. Ceriotti, G. A. Tribello, and M. Parrinello, J.Chem. Theory Comput.9