Atom-Density Representations for Machine Learning
AAtom-Density Representations for Machine Learning
Michael J. Willatt, F´elix Musil,
1, 2 and Michele Ceriotti ∗ Laboratory of Computational Science and Modeling, Institute of Materials,´Ecole Polytechnique F´ed´erale de Lausanne, 1015 Lausanne, Switzerland National Center for Computational Design and Discovery of Novel Materials (MARVEL)
The applications of machine learning techniques to chemistry and materials science become morenumerous by the day. The main challenge is to devise representations of atomic systems that areat the same time complete and concise, so as to reduce the number of reference calculations thatare needed to predict the properties of different types of materials reliably. This has led to aproliferation of alternative ways to convert an atomic structure into an input for a machine-learningmodel. We introduce an abstract definition of chemical environments that is based on a smoothedatomic density, using a bra-ket notation to emphasize basis set independence and to highlightthe connections with some popular choices of representations for describing atomic systems. Thecorrelations between the spatial distribution of atoms and their chemical identities are computed asinner products between these feature kets, which can be given an explicit representation in terms ofthe expansion of the atom density on orthogonal basis functions, that is equivalent to the smoothoverlap of atomic positions (SOAP) power spectrum, but also in real space, corresponding to n -bodycorrelations of the atom density. This formalism lays the foundations for a more systematic tuningof the behavior of the representations, by introducing operators that represent the correlationsbetween structure, composition, and the target properties. It provides a unifying picture of recentdevelopments in the field and indicates a way forward towards more effective and computationallyaffordable machine-learning schemes for molecules and materials. I. INTRODUCTION
Machine learning (ML) is used routinely nowa-days as an expedient to circumvent costly electronicstructure calculations. Given a set of atomic struc-tures and a means to represent them concisely as avector of numbers (often referred to as features, de-scriptors or fingerprints), machine learning models canmake property predictions by interpolating over theknown properties of a subset used for training. Pro-vided the machine learning model yields an acceptablelevel of accuracy, this allows for exploratory inves-tigations of vast numbers of molecules in chemical-compound space, which is essential for e.g. high-throughput screening, drug discovery and molecularclassification.
It also enables the development ofatomic potentials with the accuracy of ab initio alter-natives with a smaller computational expense.
For a ground-state electronic structure calculation,an atomic structure is completely characterized by thenumber of electrons, the positions of the nuclei andtheir identities (nuclear charges), and in principle amachine learning model could use the same informa-tion as a representation of its input. However, this in-formation is not consistent with obvious physical sym-metries. Any translation or rotation of a structure, orpermutation of identical atoms within, will not affectscalar properties like energies. A representation thatreflects this invariance is therefore desirable for an ef-ficient comparison to be made between structures. Different strategies have been proposed to incorpo-rate these symmetries. Approaches based on internalcoordinates (e.g. Coulomb matrices , eigenvalues ∗ michele.ceriotti@epfl.ch of overlap matrices or deep potential molecular dy-namics ) are automatically invariant to rotations andtranslations but require an additional symmetriza-tion over the permutation group. For low-dimensionalproblems this symmetrization can be performed ex-actly . For larger systems, one can proceed bysorting the vector of interatomic distances or eigen-values of a matrix that depends on interatomic dis-tances. However, both procedures introduce deriva-tive discontinuities. Instead, many approaches torepresent atomistic configurations, e.g radial distri-bution functions , wavelets , invariant polynomi-als , symmetry functions following Behler and Par-rinello , or a many-body tensor representation ,rely more or less explicitly on atomic distributions.In this article, we start from an abstract represen-tation of structures and atomic environments to dis-cuss atom-density-based approaches to chemical ma-chine learning. We emphasize the basis-set indepen-dence of this representation by using the Dirac bra-ketnotation. This framework provides a unifying pictureof the field, in that several popular techniques canbe seen as alternative representations of the same ab-stract feature vectors. In particular, we show that byrepresenting these kets based on an expansion of atom-centered Gaussians in radial basis functions and spher-ical harmonics one recovers the SOAP power spec-trum, which has been used very successfully to com-pare structures and predict scalar properties ,has been generalized to model tensorial properties and sparsified to reduce the computational cost .This formalism also provides a way to fine-tune orreduce the dimensionality of the feature vector by in-troducing a general coupling between different com-ponents in the definition of the kernel, recovering the a r X i v : . [ phy s i c s . c h e m - ph ] J a n flexibility of using different kinds of density-based rep-resentations within an elegant, unified framework. II. A DIRAC NOTATION FOR ATOMICCONFIGURATIONS AND ENVIRONMENTS
The Dirac (bra-ket) notation is often used tostreamline the formulation of quantum-mechanical ex-pressions, since it stresses basis-independence of quan-tum states. In a kernel ridge (Gaussian process) re-gression framework, kernel functions between atomicconfigurations K ( A , B ) are used as the basis to buildmachine-learning models of atomic-scale properties y ( A ) = (cid:88) M x M K ( A , X M ) . (1)Here y is a property one wants to predict, X M ’s are aset of reference atomic structures, and x M are weightsthat can be determined by minimization of a loss func-tion that measures the discrepancy between the pre-dictions of the model and a set of reference calcula-tions.Well-behaved (positive-semidefinite) kernels corre-spond to inner products between (possibly infinite-dimensional) feature vectors, and are also independentof the choice of basis that is used to represent them.This suggests that it might be beneficial to adoptthe bra-ket notation to formulate input representa-tions |A(cid:105) and the associated kernels K ( A , B ) = (cid:104)A|B(cid:105) .When using feature vectors explicitly in a computersimulation, it is necessary to settle on some (finite)basis. In combination with the advantage that somebases provide over others for proving certain results,this is perhaps why the Dirac notation has not beenused in this context up to now. A. Density-based structural representations
As a starting point to represent atomic structures,every atomic configuration A could be associated witha ket | A (cid:105) that describes the elemental compositionand geometric arrangement of atoms. To make thisidea concrete, we might center a smooth, real, pos-itive function g ( r ) (e.g. a normalized Gaussian) oneach atom and decorate them with orthonormal kets | α (cid:105) to represent their elemental identities. Smooth-ness in the representation is beneficial as it leads tosmooth kernels and better-behaved regression . Suchan atomic configuration is written in position space as (cid:104) r | A (cid:105) = (cid:88) i g ( r − r i ) | α i (cid:105) , (2)where the sum is taken over all atoms in the con-figuration. This expansion could be generalized byusing e.g. element-dependent widths in g ( r ), i.e. g ( r ) → g ( s ( α ) r ). Regardless of the particular form of g ( r ) (provided that it is sufficiently localized), | A (cid:105) provides a unique representation of the structure, butis variant with respect to fundamental physical sym-metries, such as rigid translations ˆ t and rotations ˆ R of the constituent atoms { r i } → { ˆ R ˆ t r i } . As has beenstressed many times previously, to achieve efficientlearning one should use representations that possessthe same symmetries as the property one wants tolearn. B. Symmetry-invariant representations
To address the variance of Eq. (2) with respectto a symmetry operation ˆ S , one can proceed by for-mally averaging the ket over the corresponding sym-metry group (a procedure often referred to as Haarintegration ): (cid:12)(cid:12)(cid:12) A (1) (cid:69) ˆ S = (cid:90) d ˆ S ˆ S |A(cid:105) . (3)To see how this translates into symmetry invariantrepresentations, let us start by considering the rela-tively simple case of the integration over the transla-tion group which simply corresponds to the integra-tion over R . Averaging directly over the positionrepresentation of | A (cid:105) leads to a rather uninformativerepresentation, which eliminates all structural infor-mation and only counts the number N α of atoms be-longing to each species, (cid:68) r (cid:12)(cid:12)(cid:12) A (1) (cid:69) ˆ t = (cid:90) dˆ t (cid:104) r | ˆ t |A(cid:105) = (cid:88) i | α i (cid:105) (cid:90) d t g ( t + r − r i ) = (cid:88) α N α | α (cid:105) , (4)where we have used the position representation ofthe translation operator. To avoid this informationloss, one can perform the Haar integration over ten-sor products of | A (cid:105) , and define (cid:12)(cid:12)(cid:12) A ( ν ) (cid:69) ˆ t = (cid:90) dˆ t ˆ t |A(cid:105) ⊗ ˆ t |A(cid:105) . . . ˆ t |A(cid:105) (cid:124) (cid:123)(cid:122) (cid:125) ν . (5)For ν = 2, and assuming for simplicity that the samesmooth density function is used for each atom, onegets (cid:68) rr (cid:48) (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t = (cid:90) dˆ t (cid:88) ij g (ˆ t r − r i ) g (ˆ t r (cid:48) − r j ) | α i α j (cid:105) = (cid:88) ij | α i α j (cid:105) (cid:90) d t g ( t + r − r i ) g ( t + r (cid:48) − r j )= (cid:88) ij | α i α j (cid:105) ( g ∗ g )( r (cid:48) − r − r ij ) , (6)where r ij = r i − r j , and ∗ is the standard convolutionoperator. We can simplify the notation for (cid:12)(cid:12) A (2) (cid:11) ˆ t inthe position representation by (1) noting that the con-volution in Eq. (6) only depends on r − r (cid:48) , so we canwrite the ket as a function of ∆ r = r − r (cid:48) alone; (2)redefining the convolution of two atom-density func-tions as h = g ∗ g : (cid:68) ∆ r (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t = (cid:88) j | α j (cid:105) (cid:88) i h (∆ r − r ij ) | α i (cid:105) . (7) C. Tensor-product representations
Before proceeding further, let us comment brieflyon the implications of the form of this ket for machine-learning of the properties associated with the struc-ture | A (cid:105) using kernel ridge regression, taking for sim-plicity a single-species compound so we can ignorethe elemental kets. Learning from a linear kernel K ( A , B ) = (cid:10) A (2) (cid:12)(cid:12) B (2) (cid:11) ˆ t is equivalent to the optimiza-tion of a linear mapping between the ket and the prop-erty, i.e. y ( A ) = (cid:90) d r y ( r ) (cid:68) r (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t . (8)Taking the Dirac δ distribution limit of g ( r ), one seesthis is a (orientation-dependent) pair potential, y ( A ) = (cid:88) ij y ( r ij ) , (9)and it is therefore easy to conceive of properties thatcannot be represented in this form. The feature vectoritself, however, contains complete information aboutthe structure, which can be recovered by taking ten-sor products of |A(cid:105) , or (equivalently) raising the as-sociated kernel to an integer power. For instance,if one takes the outer product of the translationally-symmetrized ket (whose corresponding kernel is justthe elementwise square of the linear kernel), learningamounts to the optimization of a function that de-pends on two displacement vectors simultaneously, y ( A ) = (cid:90) d r d r (cid:48) y ( r , r (cid:48) ) (cid:68) r (cid:48) (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t (cid:68) r (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t = (cid:88) iji (cid:48) j (cid:48) (cid:90) d r d r (cid:48) y ( r , r (cid:48) ) h ( r − r ij ) h ( r (cid:48) − r i (cid:48) j (cid:48) )= h → δ (cid:88) iji (cid:48) j (cid:48) y ( r ij , r i (cid:48) j (cid:48) ) , (10)and so on. This simple example highlights how high-order correlations between atomic positions can be in-corporated in the model by taking the tensor prod-uct of the structural ket before taking the Haar inte-gral (that is, choosing a high value of ν in Eq. (5))or by taking a tensor product of the invariant ket, (cid:12)(cid:12)(cid:12) A ( ν ) (cid:69) ˆ t ⊗ (cid:12)(cid:12)(cid:12) A ( ν ) (cid:69) ˆ t ⊗ · · · ⊗ (cid:12)(cid:12)(cid:12) A ( ν ) (cid:69) ˆ t (cid:124) (cid:123)(cid:122) (cid:125) ζ → (cid:12)(cid:12)(cid:12) A ( ν ) (cid:69) ( ζ )ˆ t . (11) The latter choice corresponds to taking elementwisepowers of the linear invariant kernel. In other terms,using a unique representation of a structure in a non-linear ML model can introduce higher body order cor-relations than those explicitly afforded by the featurevector itself. D. Atom-centered representations
Having clarified how tensor-product kets can beused to incorporate higher-order correlations betweenthe atoms, let us move on to discuss how an atom-centered description arises naturally as a by-productof symmetrization over the translation group. Bygrouping together the terms in the sum correspond-ing to displacement vectors involving atom j , thetranslationally-invariant second-order ket Eq. (7) de-composes into atom-centered contributions, (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t = (cid:88) j | α j (cid:105) |X j (cid:105) , (12)where we have dropped the indication of the trans-lational averaging from |X j (cid:105) to keep at bay the com-plexity of the notation. Note that Eq. (12) implies anadditive definition of the relation between the repre-sentations of the entire structures, and those associ-ated with atom-centered environments. This defini-tion translates into a “global” kernel between struc-tures that is a sum (or average) of kernels betweenenvironments. While other choices are possible , thisformal derivation shows how naturally an additive ker-nel arises from a symmetrization of a density-based“global” feature vector.The position representation of the environmentalatom-centered ket |X j (cid:105) is (cid:104) r |X j (cid:105) = (cid:88) i f c ( r ij ) h ( r − r ij ) | α i (cid:105) . (13)In this definition we have introduced a smooth cut-off function f c ( r ij ) so that each environment onlydepends on the position of the atoms in a sphericalneighborhood centered on atom j . While one could inprinciple proceed with an atom-centered descriptionthat incorporates information from the entire struc-ture, by making f c ( r ) = 1, localisation is useful forcomputational reasons and is justified when study-ing atomic problems in light of the nearsightednessprinciple of electronic matter , which underlies mostlinear-scaling electronic structure methods . Notethat when the ket is written in this form it mightmake sense to further generalize the definition of h ,e.g. by making its width dependent on r ij = | r ij | , h ( r ) → h ( s ( r ij ) r ), or by choosing a form other thana Gaussian that is more flexible or computationallyefficient. The notation can be further simplified byemphasizing the representations of structure and com-position, ψ α X j ( r ) ≡ (cid:104) α r |X j (cid:105) = (cid:88) i ∈ α f c ( r ij ) h ( r − r ij ) . (14)Writing the ket as a sum over all elements α =H , He , . . . (cid:104) r |X j (cid:105) = (cid:88) α ψ α X j ( r ) | α (cid:105) . (15)This translationally-invariant atom-centered envi-ronment representation can also be adapted by takinga linear transformation ˆ U |X j (cid:105) → |X j (cid:105) , where the lin-ear operator ˆ U might act in the position space, theelement space or both. As we will see, the freedom inchoosing the form of ˆ U can be used to tune the behav-ior of the representation to describe in a more efficientway the relation between structure and properties. E. Rotationally-invariant representations
In order to obtain a rotationally-invariant repre-sentation, one can formally average the ket |X (cid:105) overthe SO (3) rotation group, (cid:12)(cid:12)(cid:12) X (1) (cid:69) ˆ R = (cid:90) d ˆ R ˆ R |X (cid:105) . (16)This ket can be readily computed in the position rep-resentation. Taking for simplicity the case where onlyone element is present, one gets (cid:68) r (cid:12)(cid:12)(cid:12) X (1) (cid:69) ˆ R = (cid:90) d ˆ R ψ ( ˆ R r ) = (cid:90) d ˆ R ψ ( r ˆ R ˆ e z ) , (17)where we have used the fact that the integral is over allthe rotation matrices, and so we can always rotate r tobe aligned with the Cartesian z axis ˆ e z before takingthe integral. The average can be written explicitly interms of a suitable parameterization of the rotations,e.g. using Euler angles,18 π (cid:90) π d α (cid:90) π sin β d β (cid:90) π d γ ψ ( r ˆ R ( α, β, γ )ˆ e z ) . (18)One can then recognize that the γ angle does not affectˆ e z , so the integral can be written equivalently as anaverage over the unit sphere . We can then define (cid:68) r (cid:12)(cid:12)(cid:12) X (1) (cid:69) ˆ R ∝ r (cid:90) d ˆ R ψ ( r ˆ R ˆ e z ) = 14 π r (cid:90) dˆ r ψ ( r ˆ r ) , (19)where we have highlighted the fact that the positionrepresentation only depends on r , and we have explic-itly included a factor of r so that (cid:90) d r (cid:68) X (1) k (cid:12)(cid:12)(cid:12) r (cid:69) ˆ R (cid:68) r (cid:12)(cid:12)(cid:12) X (1) j (cid:69) ˆ R = (cid:90) d r (cid:68) X (1) k (cid:12)(cid:12)(cid:12) r (cid:69) ˆ R (cid:68) r (cid:12)(cid:12)(cid:12) X (1) j (cid:69) ˆ R . (20) Much like in the case of translations, the aver-age over rotations eliminates too much information,and (cid:12)(cid:12) X (1) (cid:11) ˆ R does not retain knowledge of the angu-lar correlations of atoms around the center of theenvironment. A more general family of invariantkets can be obtained by starting from the tensorproducts of (possibly different) environmental kets,ˆ U (cid:12)(cid:12) X (cid:11) ⊗ ˆ U (cid:12)(cid:12) X (cid:11) ⊗ . . . , and then symmetrizing overthe rotation group, (cid:12)(cid:12)(cid:12) X ( ν ) (cid:69) ˆ R = (cid:90) d ˆ R ν (cid:89) ℵ ⊗ ˆ R ˆ U ℵ (cid:12)(cid:12) X ℵ j (cid:11) . (21)As for the case of translational averages, one canuse a linear map (or equivalently a linear kernel) tobuild a machine-learning model of a property basedon these symmetrized kets. Non-linear kernels cor-respond to tensor products of symmetrized kets suchas (cid:12)(cid:12)(cid:12) X ( ν ) (cid:69) ˆ R ⊗ (cid:12)(cid:12)(cid:12) X ( ν ) (cid:69) ˆ R ⊗ . . . ⊗ (cid:12)(cid:12)(cid:12) X ( ν ) (cid:69) ˆ R (cid:124) (cid:123)(cid:122) (cid:125) ζ → (cid:12)(cid:12)(cid:12) X ( ν ) (cid:69) ( ζ )ˆ R , (22)and one could further generalize the construction bytaking products of kets built from different ˆ U opera-tors. III. A UNIFIED PICTURE OFDENSITY-BASED REPRESENTATIONS
Eq. (21) provides an very general – and abstract– definition of a density-based representation of anatomic structure that encodes translational, rota-tional and permutation symmetries. This level of ab-straction provides a unifying picture of the field, inthat many of the representations that have been usedfor machine-learning of atomic-scale properties can beseen as special cases of this form, or as the result ofprojection onto a particular choice of basis.Let us start by considering the translationally-invariant ket (cid:12)(cid:12) A (2) (cid:11) ˆ t , writing it in a plane-waves ba-sis {| k (cid:105)} , and taking for simplicity the h → δ limit.One obtains a representation that is equivalent tothe diffraction pattern generated by the structure, de-composed in multiple channels that correspond to thereciprocal-space correlations between different atomicspecies, (cid:68) k (cid:12)(cid:12)(cid:12) A (2) (cid:69) ˆ t = (cid:88) ij | α i α j (cid:105) e i k · r ij . (23)When considering a periodic structure, and with anappropriate normalization, this representation is di-rectly connected with the fingerprints that have beenrecently used to identify crystalline structures , high-lighting how different choices of basis may be bestsuited to different applications. * * b) c) *** a) FIG. 1. Atom-density-based structural representations, expressed in the real-space (cid:104) r | basis. (a) A structure can bemapped onto a smooth atom density built as a superposition of smooth atom-centered functions. The overall density canbe decomposed in atom-centered environments, and information on chemical compositions can be stored by decoratingthe functions with elemental kets. (b) The ν = 1 invariant ket corresponds to spherical averaging of the environmentalatom density. (c) The ν = 2 invariant ket corresponds to three-body correlations, which are obtained by integrating overall rotations a stencil corresponding to two distances along two directions with a fixed angle arccos ω between them. A. Many-body kernels and representations
Moving on to the case of rotationally-invariantkets, let us take for simplicity ˆ U ℵ = 1, and assumethat all the environmental kets that are multiplied inEq. (21) are the same. We will revisit later the pos-sibility of introducing a linear operator to fine-tunethe properties of the representation. Since we havestarted from a position representation for the envi-ronmental kets, it is natural to write Eq. (21) explic-itly in a complete basis of position and element states, | (cid:81) ν ℵ α ℵ r ℵ (cid:105) ≡ (cid:81) ν ℵ ⊗ | α ℵ r ℵ (cid:105) , (cid:42) ν (cid:89) ℵ α ℵ r ℵ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ( ν ) j (cid:43) ˆ R = (cid:90) d ˆ R ν (cid:89) ℵ (cid:68) α ℵ ˆ R r ℵ (cid:12)(cid:12)(cid:12) X j (cid:69) . (24)One can see clearly that the kernels associatedwith Eq. (24) are in the form of the invariant n -bodykernels discussed in Ref. 16 (more specifically, as wewill see below, they correspond precisely to the SOAPkernels if g is a Gaussian). Considering the case witha single element, (cid:68) X ( ν ) k (cid:12)(cid:12)(cid:12) X ( ν ) j (cid:69) ˆ R = (cid:90) d ˆ R d ˆ R (cid:48) (cid:20)(cid:90) d r ψ X k ( ˆ R (cid:48) r ) ψ X j ( ˆ R r ) (cid:21) ν , (25)it is clear that one of the two Haar integrals is redun-dant and can be eliminated.Let us consider the effect of ν on the representationand the information that it captures. As discussed inthe case of ν = 1 following Eq. (17), one of the inputvectors r can be aligned with a fixed reference axis,e.g. ˆ e z . The fact that this axis is invariant underone of the Euler rotations makes it possible to align asecond vector so that it lies in the xz plane. For ν = 1 and ν = 2 this analysis leads to (cid:68) αr (cid:12)(cid:12)(cid:12) X (1) (cid:69) ˆ R ∝ r (cid:90) d ˆ R ψ α ( r ˆ R ˆ e z ) (cid:68) αrα (cid:48) r (cid:48) ω (cid:12)(cid:12)(cid:12) X (2) (cid:69) ˆ R ∝ rr (cid:48) (cid:90) d ˆ R ψ α (cid:48) ( r (cid:48) ˆ R ˆ e z ) × ψ α (cid:16) r ˆ R ( ω ˆ e z + (cid:112) − ω ˆ e x ) (cid:17) , (26)where ω = ˆ r · ˆ r (cid:48) (see Fig. 1). After one has alignedthe first two r ℵ ’s, the position of all the other r ℵ ’scannot be manipulated, so in practice for ν > z axis and lie in the xz plane. For ν = 3, (cid:68) αrα (cid:48) r (cid:48) ωα (cid:48)(cid:48) r (cid:48)(cid:48) ˆ r (cid:48)(cid:48) (cid:12)(cid:12)(cid:12) X (2) (cid:69) ˆ R ∝ rr (cid:48) r (cid:48)(cid:48) (cid:90) d ˆ R ψ α ( r ˆ R ˆ e z ) × ψ α (cid:48) (cid:16) r (cid:48) ˆ R ( ω ˆ e z + (cid:112) − ω ˆ e x ) (cid:17) × ψ α (cid:48)(cid:48) (cid:16) r (cid:48)(cid:48) ˆ R ˆ r (cid:48)(cid:48) (cid:17) . (27)Also note that we have incorporated the square rootof the Jacobian in the definition of the representationsso that the corresponding kernels can be computedstraightforwardly as the inner product between twovectors without scaling.By expanding the densities as sums over atoms,it becomes clear that these kets are representationsof the ( ν + 1)-body order correlations between atomswithin an environment (Fig. 2). To start with,we return to the delta function limit of the atomicdensities. In the limit in which each atomic density isrepresented by Dirac δ distributions, the position rep-resentations of the invariant vectors take very simple * FIG. 2. Isocontours of the 3-body correlation functions associated with the environment centered on the tagged carbonatom of an ethanol molecule. From left to right, the figures correspond to (cid:68) C r H r (cid:48) ω (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R /rr (cid:48) , (cid:68) O r H r (cid:48) ω (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R /rr (cid:48) , (cid:68) O r H r (cid:48) ω (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R /rr (cid:48) . forms: (cid:68) αr (cid:12)(cid:12)(cid:12) X (1) j (cid:69) ˆ R ∝ h → δ r (cid:88) i ∈ α f c ( r ij ) δ ( r − r ij ) (cid:68) αrα (cid:48) r (cid:48) ω (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R ∝ h → δ rr (cid:48) (cid:88) i ∈ α (cid:88) i (cid:48) ∈ α (cid:48) δ ( r − r ij ) δ ( r (cid:48) − r i (cid:48) j ) × δ ( ω − ˆ r ij · ˆ r i (cid:48) j ) f c ( r ij ) f c ( r i (cid:48) j ) . (28)One then sees how linear regression based on (cid:12)(cid:12) X ( ν ) (cid:11) ˆ R corresponds to ( ν + 1)-body potentials e.g. for the3-body term, y ( X j ) = (cid:90) d r d r (cid:48) d ω y (2) ( r, r (cid:48) , ω ) (cid:68) rr (cid:48) ω (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R == (cid:88) ik y (2) ( r ij , r jk , ω ijk ) . (29)There are however good reasons to use non-linearfunctions of the feature vector in an ML model. In thecase of sufficiently sharp atom-centered density func-tions, the ket with ν = 1 contains information on thelist of all pair distances within an environment, whichis not sufficient to reconstruct the structure of theenvironment unequivocally. The representation with ν = 2, on the other hand, contains information onpair distances and angles between triplets of atoms.To the best of our knowledge, and based on extensivenumerical experiments , this information is sufficientto reconstruct a configuration modulo arbitrary rigidtranslations, rotations and inversion symmetry. Ten-sor products of the (cid:12)(cid:12) X (2) (cid:11) ˆ R ket are then sufficient as abasis to represent arbitrarily complex invariant func-tions of the atomic coordinates. B. Behler-Parrinello symmetry functions
An expression of Eq. (21) in the position represen-tation and in the h → δ limit is an ideal starting point to investigate the relationship of (cid:12)(cid:12) X ( ν ) (cid:11) ˆ R with otherdensity-based frameworks. These expressions revealthe connection between these invariant kets and sev-eral popular fingerprints designed to capture pair and3-body interactions. The link between (cid:68) αr (cid:12)(cid:12)(cid:12) X (1) j (cid:69) andthe pair distribution function is obvious. Behler-Parrinello symmetry functions, and similar weighedaverages of n -body correlations, can be seen as pro-jections of the SO (3) invariant ket over suitable testfunctions G . For instance, for a 2-body symmetryfunction G ( r ) one has (cid:104) αβG |X j (cid:105) = (cid:104) α | α j (cid:105) (cid:90) d r G ( r ) r (cid:68) βr (cid:12)(cid:12)(cid:12) X (1) j (cid:69) ˆ R,h → δ , (30)and an analogous expression can be written for a3-body symmetry function G ( r, r (cid:48) , ω ). Expressionssimilar to Eq. (28) can be obtained by inserting intoEq. (26) Gaussians, or alternative basis functions.The relationship to other density-based representa-tions, such as those discussed in Refs. 23,51 is lesstransparent, but several of the essential ingredients –such as scaling functions that modulate geometric andchemical correlations – can be introduced in terms ofappropriate choices of the ˆ U operators, as we will dis-cuss in the next section. C. Smooth Overlap of Atomic Positions
We have left as a last example a discussion ofthe connection between the symmetrized ket and thesmooth overlap of atomic positions (SOAP) powerspectrum.
In fact, if we take as we did beforeˆ U ℵ = 1 and (cid:12)(cid:12) X j (cid:11) = (cid:12)(cid:12) X j (cid:11) = . . . (cid:12)(cid:12) X νj (cid:11) in Eq. (21),the SOAP power spectrum is nothing but an alterna-tive representation of (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R . To see how, one canstart by expanding the translationally-invariant envi-ronmental ket Eq. (24) in a basis of orthonormal radialbasis functions R n ( r ) and spherical harmonics Y lm (ˆ r ), (cid:104) αnlm |X j (cid:105) = (cid:90) d r R n ( r ) Y lm (ˆ r ) (cid:104) α r |X j (cid:105) . (31)Using a basis of spherical harmonics is extremely use-ful and practical because they block diagonalize theangular momentum operator (and thus the rotationoperator), which allows for explicit integration overthe rotation group in Eq. (24). For ν = 1, this leadsto the following feature vector, (cid:68) αn (cid:12)(cid:12)(cid:12) X j (1) (cid:69) ˆ R ∝ (cid:104) αn |X j (cid:105) . (32)For ν = 2, the feature vector corresponds to the SOAPpower spectrum, (cid:68) αnα (cid:48) n (cid:48) l (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R ∝ √ l + 1 (cid:88) m (cid:104) αnlm |X j (cid:105) (cid:63) (cid:104) α (cid:48) n (cid:48) lm |X j (cid:105) . (33)For ν = 3 the representation corresponds to the bis-pectrum , (cid:68) α n l α n l αnl (cid:12)(cid:12)(cid:12) X (3) j (cid:69) ˆ R ∝ √ l + 1 (cid:88) mm m (cid:104)X j | αnlm (cid:105)× (cid:104) α n l m |X j (cid:105) (cid:104) α n l m |X j (cid:105) (cid:104) l m l m | l m (cid:105) . (34)where (cid:104) l m l m | l m (cid:105) is a Clebsch-Gordan coef-ficient. The bispectrum is used as a four-body featurevector in SOAP and in Spectral Neighbor Analysis Po-tentials (SNAP), where its high resolution is exploitedto construct accurate interatomic potentials throughlinear regression .Seen in the light of the present formalism, the re-markable fact that the SOAP kernel (Eq. (25) withdensities written as a sum of Gaussians) can be ex-pressed as an explicit scalar product between vectors,representing a truncated expansion of the power spec-trum, emerges as a natural consequence of the def-inition of the kernel as the scalar product betweeninvariant kets. It should also be noted that in prac-tical applications of SOAP the kernels are often (butnot always) normalized and raised to an integer power ζ , which corresponds to taking a tensor product ofthe kets and introduces a many-body character in themodel built on such kernels. D. Tensorial Smooth Overlap of AtomicPositions ( λ -SOAP) The feature vectors that appear in the tensorialextension of SOAP are of the form in Eq. (21), withˆ U ℵ = ˆ I for ℵ = 1 , , . . . , ν + 1, (cid:12)(cid:12) X ℵ j (cid:11) = |X j (cid:105) for ℵ =1 , . . . , ν and (cid:12)(cid:12) X ν +1 j (cid:11) = | λµ (cid:105) , where | λµ (cid:105) is an angularmomentum ket: (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ˆ R = (cid:90) d ˆ R ˆ R | λµ (cid:105) ν (cid:89) ℵ =1 ⊗ ˆ R |X j (cid:105) . (35) * FIG. 3. Schematic representation of the construction of areal-space representation of a tensorial ket associated witha λ -SOAP kernel. The (smooth) atom density is evaluatedat two points corresponding to a stencil ( r, r (cid:48) , ω ), and thespherical harmonic Y λµ is evaluated at the angles ( θ, φ ),relative to the reference frame that is used to describe thestencil. The ket is rotationally invariant, (cid:34) ν +1 (cid:89) ℵ =1 ⊗ ˆ R (cid:48) (cid:35) (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ˆ R = (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ˆ R , (36)but not in the subspace that describes the atomic en-vironments, (cid:34) ˆ I ⊗ ν (cid:89) ℵ =1 ⊗ ˆ R (cid:48) (cid:35) (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ˆ R (cid:54) = (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ˆ R . (37)The inner product between two of these vectors is eas-ily shown to be (cid:68) X ( ν ) j λµ (cid:12)(cid:12)(cid:12) X ( ν ) k λ (cid:48) µ (cid:48) (cid:69) ˆ R = δ λλ (cid:48) (cid:90) d ˆ R D λµµ (cid:48) ( ˆ R ) (cid:12)(cid:12) (cid:104)X j | ˆ R |X k (cid:105) (cid:12)(cid:12) ν , (38)which agrees with the usual definition of the λ -SOAPkernel, (cid:68) X ( ν ) j λµ (cid:12)(cid:12)(cid:12) X ( ν ) k λµ (cid:48) (cid:69) ˆ R = k λµµ (cid:48) ( X j , X k ) . (39)While (cid:12)(cid:12)(cid:12) X ( ν ) j λµ (cid:69) ˆ R can be represented very effectivelyusing a spherical-harmonics expansion of the atomdensity , it is also possible to express it in terms of areal-space basis. Following arguments similar to thoseused to derive Eq. (27), one can see that in this formthe tensorial ket corresponds to the evaluation of athree-body correlation function of the atom density,multiplied by a spherical harmonic of appropriate or-der computed in the reference frame of the ( r, r (cid:48) , ω )stencil (see Fig. 3). a) b) c) FIG. 4. (a) Permutation-variant structural descriptors can be stored in a vector to be used as an atomic-scale repre-sentation. (b) Sorting this vector makes it permutationally invariant. (c) It is easy to see how the sorted vector relatesto the cumulative distribution function associated with the histogram of the values of the structural features.
Taking tensor products of (cid:12)(cid:12) X ( ν ) λµ (cid:11) ˆ R with itselfincreases the order of body correlations that are ex-plicitly included in the feature vector, but destroysthe required symmetry properties. Instead, one cantake tensor products with λ = 0 kets, which arerotationally invariant in the subspace that describesthe atomic environments while preserving the desiredsymmetry of the representation, e.g. (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ˆ R ζ − (cid:89) k =1 ⊗ (cid:12)(cid:12)(cid:12) X ( ν ) (cid:69) ˆ R → (cid:12)(cid:12)(cid:12) X ( ν ) λµ (cid:69) ( ζ )ˆ R . (40)This procedure has been found effective in practice forincreasing the order of body correlations in tensorialSOAP . E. Distributions vs sorted vectors
It is worth making some further considerationsthat extend somewhat the generality of this construc-tion to include representations that are not based ex-plicitly on atom densities. Many approaches in theliterature rely on computing quantities that are notpermutationally invariant per se , for instance the ele-ments of the matrix of pair distances between atoms ,transformed elementwise by some function , or theeigenvalues of such matrices . In order to make theserepresentations invariant to atom permutations, oneoften proceeds to sort these sets of items, and usesthe Euclidean distance between the sorted vectors asthe building block of kernels or other statistical learn-ing frameworks.In fact, it is easy to see that given a set of elements { a i ∈ R } , the sorted list contains the same amountof information as the histogram of the elements h ( a ).Scaling the index of the sorted items by the total num-ber of items N , and considering the limit in which onecan consider x = i/N as a continuous index, one seesthat x (˜ a ) counts the fraction of entries that are smaller than ˜ a , that is x (˜ a ) = (cid:90) ˜ a −∞ d a h ( a ) . (41)It follows that a ( x ), which is a continuous represen-tation of the vector of sorted distances, is just theinverse cumulative distribution function (iCDF) asso-ciated with h ( x ). The Euclidean distance between twovectors of sorted elements is proportional to the L norm of the difference between the iCDF of the his-tograms associated with the two sets. Interestingly,if one considers the L norm, the distance betweenthe sorted vectors corresponds to the earth mover’sdistance between two distributions in one dimen-sion. The connection between different density-basedrepresentations is more direct than that which can beestablished between density-based and sorted-vectordescriptions – also given that the relation betweenatom positions and the permutation variant itemsmight be far from trivial, e.g. when the represen-tation involves the eigenvalues of an overlap matrix.However, the argument we present here highlights thefact that incorporating physical symmetries in the de-scription of atomistic systems leads to representationsthat contain essentially the same information. IV. GENERALIZED INVARIANT DENSITYREPRESENTATIONS
The formalism we have introduced in the previoussection provides an elegant framework to constructa rotationally-invariant representation of the atomicdensity that can be used for machine-learning pur-poses. While the formalism provides a complete de-scription of structural correlations of a given orderwithin an atomic environment, the quality and thecomputational cost of the regression scheme can beimproved substantially in practice by transforming therepresentation so that it incorporates some degree ofchemical intuition. For instance, the combination ofmultiple kernels corresponding to different interatomicdistances has been shown to improve the quality ofthe ML model . Likewise, a scaling of the weights ofdifferent atomic distances within an environment hasbeen shown to be beneficial when using ML to predictatomic-scale properties .We will discuss how many of these modificationscan be incorporated in the through inclusion of arotationally-invariant Hermitian operator ˆ U = ˆ U =ˆ U = . . . (as introduced earlier) that leads to cou-pling of the geometric and elemental components ofthe translationally-invariant atom-centered ket |X j (cid:105) .For concreteness, and to provide a formulation thatcan be directly applied to an existing framework, wediscuss ˆ U written in the orthonormal basis of radialfunctions and spherical harmonics {| αnlm (cid:105)} , that cor-respond to the SOAP power spectrum.The requirement that ˆ U is rotationally-invariant(and thus commutes with an arbitrary rotation oper-ator) means that it must have the following form (cid:104) αnlm | ˆ U | α (cid:48) n (cid:48) l (cid:48) m (cid:48) (cid:105) = δ ll (cid:48) δ mm (cid:48) (cid:104) αnl | ˆ U | α (cid:48) n (cid:48) l (cid:48) (cid:105) . (42)Eq. (42) is the most general form compatible with SO (3) symmetry, and can be seen as a way to in-troduce correlations between different radial and ele-mental components of the features, and to weight thecontribution from different angular channels. A. Low-rank expansion of the ˆ U operator Since ˆ U is Hermitian, it can be diagonalized andexpressed in the orthogonal basis of its eigenkets { (cid:12)(cid:12) J (cid:11) } , ˆ U = (cid:88) J (cid:12)(cid:12) J (cid:11) U J (cid:10) J (cid:12)(cid:12) . (43)Taking U J (cid:10) J (cid:12)(cid:12) → (cid:104) J | allows us to express ˆ U as ˆ U = (cid:80) J (cid:12)(cid:12) J (cid:11) (cid:104) J | .The transformed SO(3) vector components can bewritten in terms of the components of | J (cid:105) in the chem-ical basis, u Jαnl = (cid:104) J | αnl (cid:105) . This yields (cid:68) JJ (cid:48) (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R = (cid:88) αα (cid:48) nn (cid:48) l u Jαnl u J (cid:48) α (cid:48) n (cid:48) l × (cid:88) m (cid:104) αnlm |X j (cid:105) (cid:63) (cid:104) α (cid:48) n (cid:48) lm |X j (cid:105) . (44)By choosing a low-rank expansion of ˆ U one can greatlyreduce the dimensionality of the SO (3) fingerprintvector, similarly to what was done in Ref. 37 apply-ing standard sparse decomposition techniques to the SO (3) fingerprints.A possible approach is to determine this low-rankapproximation based on the correlations found be-tween environments that are part of the data set. For a given l , consider the spherically-symmetric co-variance matrix between the features of the expandedatomic density, C ( l ) αnα (cid:48) n (cid:48) = 1 N (cid:88) j (cid:88) m (cid:104) αnlm |X j (cid:105) (cid:63) (cid:104)X j | α (cid:48) n (cid:48) lm (cid:105) = √ l + 1 N (cid:88) j (cid:68) αnα (cid:48) n (cid:48) l (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R . (45)The eigenvectors of C ( l ) , v ( l ) J , can then be used as u Jαnl in Eq. (44). It is easy to see that this trans-formation identifies components of the data that arelinearly independent within the training set, and havea spread that is equal to the corresponding eigenval-ues λ ( l ) J . The feature space can then be compressedby only retaining a certain number of components n J that could be determined using the magnitude of theassociated eigenvalues. B. Radially-scaled kernels
In a system with relatively uniform atom density,the overlap between environments (cid:104)X j |X k (cid:105) is domi-nated by the region farthest from the center. Thiscould be regarded as rather unphysical, since inter-actions between atoms decay with distance and theclosest atoms should therefore give the most signifi-cant contribution to properties, which is reflected inthe observation that multi-scale kernels tend to per-form best when very low weights are assigned to thelong-range kernels . This effect can be counter-acted by multiplying the atomic probability amplitudeEq. (13) with a radial scaling u ( r ), (cid:104) α r | ˆ U |X j (cid:105) = u ( r ) ψ α X j ( r ) . (46)In the context of the SOAP power spectrum, thischange can be represented in terms of a ˆ U operatorthat reads (cid:104) n | ˆ U | n (cid:48) (cid:105) = (cid:90) d r r R n ( r ) R n (cid:48) ( r ) u ( r ) , (47)since an operator that scales states in the positionrepresentation must be diagonal in it, (cid:104) r | ˆ U | r (cid:48) (cid:105) = δ ( r − r (cid:48) ) u ( r ) , (48)and its matrix elements in the basis of radial basisfunctions are (cid:104) n | ˆ U | n (cid:48) (cid:105) = (cid:90) dr (cid:90) dr (cid:48) r R n ( r ) R n (cid:48) ( r (cid:48) ) δ ( r − r (cid:48) ) u ( r ) , (49)which reduces to Eq. (47).Radial scaling in the form of Eq. (46) can be ap-proximated, when using narrow atom-centered func-tions, with (cid:80) i u ( r ij ) f c ( r ij ) h ( r − r ij ), where we alsoconsider for simplicity the case with a single species .0Besides the fact that it is simpler to implement thisform of scaling in an existing code, this approximationalso makes apparent the connection between the gen-eral density-based framework we introduce here andthe descriptors of Ref. 51. When h is taken to be aGaussian function, the weight on the central atom isset to zero and one considers the two-body invariantrepresentations, this ansatz is essentially equivalent tothe two-body features in Ref. 51: (cid:68) r (cid:12)(cid:12)(cid:12) ˆ U (cid:12)(cid:12)(cid:12) X (1) j (cid:69) ˆ R = (cid:88) i (cid:54) = j u ( r ij ) √ πr ij σ (cid:20) e − ( r − rij )22 σ − e − ( r + rij )22 σ (cid:21) ≈ (cid:88) i (cid:54) = j u ( r ij ) √ πr ij e − ( r − rij )22 σ . (50) C. Alchemical kernels
In the presence of multiple species, one could makethe scaling element dependent, or devise a more com-plex operator that couples different channels of dif-ferent species. As a first test of the generalization ofSOAP in the presence of multiple elements, we con-sider an operator in the form (cid:104) αnlm | ˆ U | α (cid:48) n (cid:48) l (cid:48) m (cid:48) (cid:105) = δ ll (cid:48) δ mm (cid:48) δ nn (cid:48) (cid:104) α | ˆ U | α (cid:48) (cid:105) , (51)which ignores couplings between the structure of anenvironment and the elements within it. One canalways write a low-rank expansion of the operator,ˆ U ≈ (cid:80) Jα (cid:12)(cid:12) J (cid:11) u Jα (cid:104) α | , which allows one to writeˆ U ⊗ ˆ U (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R = (cid:88) αα (cid:48) (cid:12)(cid:12) JJ (cid:48) (cid:11) u Jα u J (cid:48) α (cid:48) (cid:68) αα (cid:48) (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R . (52)In the context of SOAP, one can define the projectionsof the power spectrum in this “alchemical basis”, (cid:68) JnJ (cid:48) n (cid:48) l (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R = (cid:88) αα (cid:48) u Jα u J (cid:48) α (cid:48) (cid:88) m (cid:104) αnlm |X j (cid:105)× (cid:104) α (cid:48) n (cid:48) lm |X j (cid:105) , (53)which was shown in Ref. 66 to yield a substantial im-provement in the learning efficiency in the presenceof many chemical elements, and to result in a low-dimensional representation of elemental space thatshares some similarities with the grouping found inthe periodic table of the elements.One can see the relationship between these “al-chemical features” and previous attempts to incorpo-rate cross-species correlations through the generalizedSOAP environmental kernel, (cid:90) d ˆ R (cid:12)(cid:12)(cid:12) (cid:104)X j | ˆ U † ˆ U ˆ R |X k (cid:105) (cid:12)(cid:12)(cid:12) = (cid:88) JnJ (cid:48) n (cid:48) l ˆ R (cid:68) X (2) j (cid:12)(cid:12)(cid:12) JnJ (cid:48) n (cid:48) l (cid:69) (cid:68) JnJ (cid:48) n (cid:48) l (cid:12)(cid:12)(cid:12) X (2) k (cid:69) ˆ R . (54) By writing out explicitly this inner product in termsof the full power spectrum elements (cid:68) αnα (cid:48) n (cid:48) l (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R one can see that the matrix elements (cid:104) α | ˆ U † ˆ U | α (cid:48) (cid:105) are nothing but the elements of the alchemical ker-nel κ αα (cid:48) that was introduced in Ref. 42, where it wasshown that taking κ αα (cid:48) (cid:54) = δ αα (cid:48) can improve prop-erty predictions with kernel ridge regression. Off-diagonal couplings between chemical elements havealso been used in other representations, includingthose of Ref. 51.The expression in terms of reduced featuresEq. (53) is, however, more efficient to compute andclarifies how this approach enables the introduction ofcorrelations between elements, as well as reduction ofthe space dimensionality. The full SOAP feature vec-tor contains a number of components that is propor-tional to the square of the number of present species n sp , while limiting to a number d J (cid:28) n sp of basis ketsreduces the dimensionality of the feature vector by afactor ( n sp /d J ) .Note that one does not even need to compute allthe elements in the | αnlm (cid:105) expansion of the density,since the alchemical projection can be brought downto the level of the atom density, which can be definedfor d J chemical “channels” rather than for each ele-ment separately, (cid:104) J r |X j (cid:105) = (cid:88) α u Jα ψ α X j ( r ) . (55)Density-based representations that assign a weight toeach species have been explored as means to reducethe complexity of ML representations in cases wheremany elements are present simultaneously , whichcorrespond essentially to the case with d J = 1. Forinstance, the compositional descriptor of Ref. 67 isequivalent to Eq. (30) computed on a single invariantdensity, (cid:68) r (cid:12)(cid:12)(cid:12) X (1) j (cid:69) ˆ R = (cid:88) i u α i δ ( r − r ij ) f c ( r ij ) , (56)where the weights of different species are rather arbi-trarily set to be u α = 0 , ± , ± . . . . The more generalformulation in Eqs. (53)-(55) provides a way to al-ter the dimensionality of the representation, and tooptimize the projections to obtain the most efficientfeatures for a given regression problem. D. Non-factorizable operators
In order to relate Eq. (21) to other density-basedrepresentations that involve more complicated scal-ing functions of the internal coordinates, it is nec-essary to introduce a further linear transformationˆ U ( ν ) which does not factorize into components thatact independently on each term in the ν -order ten-sor product. Such an operator must be chosen with1care to ensure that it is rotationally-invariant, oth-erwise the rotational-invariance of the transformedket will be lost. As far as the ν = 2 rotationally-invariant kets are concerned, a generic operator iscompletely determined by its action on the basisvectors {| α r α (cid:48) r (cid:48) (cid:105)} . Rotationally-invariant operatorsmust act on (cid:12)(cid:12)(cid:12) α ˆ R r α (cid:48) ˆ R r (cid:48) (cid:69) in the same was as on | α r α (cid:48) r (cid:48) (cid:105) , followed by the rotation ˆ R . The upshot ofthis observation is (cid:104) α r α r (cid:48) | ˆ U (2) | β r α r (cid:48) (cid:105) = (cid:104) αr α (cid:48) r (cid:48) ω | ˆ U (2) | βr β (cid:48) r (cid:48) ω (cid:105) , (57)i.e. any non-internal coordinate must be cyclic. If adistance and angle-based scaling is required, then theoperator is diagonal, (cid:104) α r α r (cid:48) | ˆ U (2) | β r α r (cid:48) (cid:105) = δ αβ δ α (cid:48) β (cid:48) δ ( r − r ) δ ( r (cid:48) − r (cid:48) ) × δ ( ω − ω ) u ( α, r , α (cid:48) , r (cid:48) , ω ) . (58)For example, the scaling function in the three-bodydescriptor in Ref. 51 corresponds to the followingchoice for u ( r , r , ω ), u ( r , r , ω ) = 1 − ω ω ω ( r r r ) n , (59)where r = r + r − r r ω , ω = ( r − r − r ) / r r , ω = ( r − r − r ) / r r and n is an adjustable param-eter. Faber et al . do not specify a scaling function forfour-body and higher-body descriptors, but the analy-sis presented here clearly extends to any hypotheticalscaling function that involves the internal coordinatesof a collection of ν + 1 positions.Starting from the SOAP power spectrum, one canexploit the fact that each component is separatelysymmetry invariant. It is then possible to introducean arbitrary linear operator coupling the | αnα (cid:48) n (cid:48) l (cid:105) components, (cid:104) αn α (cid:48) n (cid:48) l | ˆ U | βn β (cid:48) n (cid:48) l (cid:105) . Being a lin-ear operation, this transformation amounts to achange of regularization for the ridge regression prob-lem, and is most useful if applied to reduce the dimen-sionality of the feature vectors. This can be done e.g.by finding the principal components of the covariancematrix of the SOAP power spectrum or – as done inRef. 37 – by a sparse decomposition that singles outa subset of the components that suffice to obtain athorough description of the relevant structures. Thiscorresponds to the contracted representation (cid:68) J (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R = (cid:88) Jk u Jk (cid:68) α k n k α (cid:48) k n (cid:48) k l k (cid:12)(cid:12)(cid:12) X (2) j (cid:69) ˆ R , (60)where k runs over the set of selected components, which can be determined with different schemes, fromCUR to farthest point sampling . The coeffi-cients u Jk are the elements of a square matrix thatensures the contracted vectors in Eq. (60) generate akernel that is as close as possible to the full kernel. E. Optimization of the density representation
The optimization of the ˆ U operator in its moregeneral form (see Eq. (42)) involves a large number ofparameters, leading to a very concrete risk of overfit-ting. This is exacerbated by the fact that the featurevector is then used as the input for regression, and onehas to balance the amount of data used to optimizethe elements of ˆ U and that used for the training ofthe ridge regression model. The simplest approach toreduce the optimization of ˆ U to a small number of freeparameters uses the compression method discussed inSection IV A to identify the most important combina-tions of (cid:104) αnlm |X j (cid:105) components that are linearly in-dependent for the data at hand. We would like tobe able to optimize u based on the correlations foundbetween environments that are part of the trainingset. The idea is that further optimization using tar-get properties will be less likely to overfit after thisdimensionality reduction.Another possible use of the principal-componentrepresentation of (cid:104) αnlm |X j (cid:105) is to obtain a simpleransatz to further optimize the ˆ U operator. For in-stance, one could combine linearly the different com-ponents using the ˆ U operator defined in Eq. (45) (cid:104) IJlm |X j (cid:105) = (cid:88) I (cid:48) αJ (cid:48) n f ( l ) II (cid:48) JJ (cid:48) u ( l ) I (cid:48) αJ (cid:48) n (cid:104) αnlm |X j (cid:105) , (61)where the scaling coefficients f ( l ) II (cid:48) JJ (cid:48) are determined soas to make the representation better suited to build aregression model for the target property y . A system-atic exploration of the different possibilities, as well astheir benchmarking on different regression problems,is left for future work. V. CONCLUSIONS
We have introduced a general formulation of theproblem of representing atomic structures in terms ofa (smooth) atom density, which is independent on thebasis that is used to expand it. Starting from a repre-sentation of a 3D structure in terms of a superpositionof atom-centered functions decorated with elementalkets, we introduce symmetries by formally averagingthe feature vectors over the continuous translation androtation groups. The averaging removes information,but a complete, unique description can be retained bytaking tensor products of the ket before computing theintegral. Different representations, capturing varyingamounts of inter-atomic correlations, can be obtaineddepending on the combination of tensor products andsymmetrized averages.The framework we introduced provides a unifiedpicture of density-based representations for machinelearning of atomic-scale properties, with several pop-ular frameworks emerging by taking different limits,or using specific basis sets to represent the abstract2invariant kets. In particular, using a basis of ra-dial functions and spherical harmonics shows clearlythe 1:1 mapping between the symmetrized kets anddifferent flavors of the SOAP kernel. Even alterna-tive schemes that start from rotation and translation-invariant internal coordinates and proceed to ensurepermutation invariance appear to contain compara-ble information. Physically-motivated extensions ofexisting frameworks, relying also on real-space formu-lations that are more directly related to body-orderexpansions and to atom correlation functions that ap-pear in classical density functional theory, might pro-vide some leeway to improve the performance of arepresentation, as shown for instance in Ref. 66. Wediscussed how several modifications and optimizationscan be introduced in terms of operators that coupleand scale different channels of the representation, fo-cusing in particular on the SOAP power spectrum rep-resentation. Such modifications can be advantageousas they allow for a reduction in the dimensionalityof the problem, and make it possible to incorporate different kinds of chemical insights – effectively gen-erating several “views” of a material that emphasizedifferent kinds of structural and chemical correlations.Seeing many different approaches as alternative im-plementations of the same family of representationsmight help coordinate efforts in optimizing the compu-tational efficiency and regression accuracy of density-based models of atomic structures.
ACKNOWLEDGEMENTS
The Authors would like to thank G´abor Cs´anyi,Bingqing Cheng and Andrea Grisafi for discussion andcomments during the preparation of this work. MCand MJW were supported by the European ResearchCouncil under the European Union’s Horizon 2020 re-search and innovation programme (grant agreementno. 677013-HBMAP). FM was supported by the theNCCR MARVEL, funded by the Swiss National Sci-ence Foundation. [1] R. Ramakrishnan and O. A. von Lilienfeld, , 182(2015).[2] F. A. Faber, A. Lindmaa, O. A. von Lilienfeld, andR. Armiento, Phys. Rev. Lett. , 135502 (2016).[3] A. P. Bart´ok et al., Sci. Adv. , e1701816 (2017).[4] G. Montavon et al., New J. Phys. , 095003 (2013).[5] J. B. O. Mitchell, Wiley Interdisciplinary Reviews:Computational Molecular Science , 468 (2014).[6] J. Carrete, W. Li, N. Mingo, S. Wang, and S. Cur-tarolo, Physical Review X , 011019 (2014).[7] L. Ward, A. Agrawal, A. Choudhary, and C. Wolver-ton, npj Computational Materials , 16028 (2016).[8] J. Behler and M. Parrinello, Phys. Rev. Lett. ,146401 (2007).[9] W. J. Szlachta, A. P. Bart´ok, and G. Cs´anyi, Phys.Rev. B , 104108 (2014).[10] A. P. Bart´ok, M. C. Payne, R. Kondor, and G. Cs´anyi,Phys. Rev. Lett. , 136403 (2010).[11] T. Morawietz, V. Sharma, and J. Behler, Journal ofChemical Physics , 064103 (2012).[12] R. Kobayashi, D. Giofr´e, T. Junge, M. Ceriotti, andW. A. Curtin, Phys. Rev. Mater. , 053604 (2017).[13] J. S. Smith, O. Isayev, and A. E. Roitberg, Chem.Sci. , 3192 (2017).[14] A. Khorshidi and A. A. Peterson, Computer PhysicsCommunications , 310 (2016).[15] M. Gastegger, J. Behler, and P. Marquetand, Chem.Sci. , 6924 (2017).[16] A. Glielmo, C. Zeni, and A. De Vita, Physical ReviewB (2018).[17] S. Chmiela et al., Science Advances , e1603015(2017).[18] K. Yao, J. E. Herr, and J. Parkhill, Journal of Chem-ical Physics , 014106 (2017).[19] A. P. Bart´ok, R. Kondor, and G. Cs´anyi, Phys. Rev.B , 184115 (2013). [20] M. Rupp, A. Tkatchenko, K.-R. M¨uller, and O. A.von Lilienfeld, Physical Review Letters , 058301(2011).[21] J. Barker, J. Bulin, J. Hamaekers, and S. Math-ias, LC-GAP: Localized coulomb descriptors forthe gaussian approximation potential, in ScientificComputing and Algorithms in Industrial Simulations:Projects and Products of Fraunhofer SCAI , pages 25–42, Springer, Cham, 2017.[22] A. Sadeghi et al., Journal of Chemical Physics ,184118 (2013).[23] L. Zhang, J. Han, H. Wang, R. Car, and E. Weinan,Physical Review Letters , 143001 (2018).[24] B. J. Braams and J. M. Bowman, International Re-views in Physical Chemistry , 577 (2009).[25] Z. Xie and J. M. Bowman, Journal of Chemical The-ory and Computation , 26 (2010).[26] B. Jiang and H. Guo, The Journal of ChemicalPhysics , 0 (2013).[27] K. T. Sch¨utt et al., Phys. Rev. B , 205118 (2014).[28] M. Hirn, S. Mallat, and N. Poilvert, Multiscale Mod-eling & Simulation , 827 (2017).[29] M. Eickenberg, G. Exarchakis, M. Hirn, and S. Mal-lat, Advances in Neural Information Processing Sys-tems 30 , 6522 (2017).[30] A. V. Shapeev, Multiscale Modeling & Simulation ,1153 (2015).[31] A. P. Bart´ok, M. J. Gillan, F. R. Manby, andG. Cs´anyi, Phys. Rev. B , 054104 (2013).[32] V. L. Deringer and G. Cs´anyi, Phys. Rev. B ,094203 (2017).[33] R. Jinnouchi and R. Asahi, Journal of Physical Chem-istry Letters , 4279 (2017).[34] W. J. Szlachta, A. P. Bart´ok, and G. Cs´anyi, PhysicalReview B - Condensed Matter and Materials Physics , 104108 (2014). [35] S. Kajita, N. Ohba, R. Jinnouchi, and R. Asahi, Sci-entific Reports (2017).[36] A. Grisafi, D. M. Wilkins, G. Cs´anyi, and M. Ceriotti,Phys. Rev. Lett. , 036002 (2018).[37] G. Imbalzano et al., J. Chem. Phys. , 241730(2018).[38] C. E. Rasmussen and C. K. I. Williams, Gaussianprocesses for machine learning , MIT Press, 2006.[39] A. Glielmo, P. Sollich, and A. De Vita, Phys. Rev. B , 214302 (2017).[40] L. Nachbin, The Haar integral , R. E. Krieger Pub.Co., 1976.[41] For a generic basis function g ∗ g might be a com-plicated function, but when g is a Gaussian, g ∗ g issimply a Gaussian with double the variance.[42] S. De, A. P. Bart´ok, G. Cs´anyi, and M. Ceriotti, Phys.Chem. Chem. Phys. , 13754 (2016).[43] E. Prodan and W. Kohn, Proc. Natl. Acad. Sci. USA , 11635 (2005).[44] W. Yang, Phys. Rev. Lett. , 1438 (1991).[45] G. Galli and M. Parrinello, Phys. Rev. Lett. , 3547(1992).[46] S. Goedecker, Rev. Mod. Phys. , 1085 (1999).[47] M. Ceriotti, T. D. K¨uhne, and M. Parrinello, J. Chem.Phys. , 24707 (2008).[48] This is a consequence of the fact that SO (3) is theproduct of SO (2) and S .[49] A. Ziletti, D. Kumar, M. Scheffler, and L. M. Ghir-inghelli, Nature Comm. , 2775 (2018).[50] O. A. von Lilienfeld, R. Ramakrishnan, M. Rupp, andA. Knoll, International Journal of Quantum Chem-istry , 1084 (2015).[51] F. A. Faber, A. S. Christensen, B. Huang, and O. A.Von Lilienfeld, Journal of Chemical Physics ,241717 (2018).[52] A. P. Bart´ok, R. Kondor, and G. Cs´anyi, Phys. Rev.B , 184115 (2013).[53] S. De, A. A. P. Bart´ok, G. Cs´anyi, and M. Ceriotti,Phys. Chem. Chem. Phys. , 13754 (2016). [54] A. Thompson, L. Swiler, C. Trott, S. Foiles, andG. Tucker, J. Comput. Phys. , 316 (2015).[55] A. Grisafi, D. D. M. Wilkins, G. Cs´anyi, and M. Ce-riotti, Phys. Rev. Lett. , 036002 (2018).[56] A. Grisafi et al., ACS Central Science , 57 (2019).[57] D. M. Wilkins et al., arXiv:1809.05337 (2018).[58] F. Pietrucci and R. Martoˇn´ak, J. Chem. Phys. ,104704 (2015).[59] M. Rupp, A. Tkatchenko, K.-R. M¨uller, and O. A.von Lilienfeld, Phys. Rev. Lett. , 058301 (2012).[60] A. Sadeghi et al., J. Chem. Phys. , 184118 (2013).[61] V. M. Panaretos and Y. Zemel, Annual Review ofStatistics and Its Application , annurev (2019).[62] A. A. P. Bart´ok et al., Science Advances , e1701816(2017).[63] B. Huang and O. A. von Lilienfeld, J. Chem. Phys. , 161102 (2016).[64] Note that, apart from a l -dependent scaling, the co-variance matrix is just the average of the SOAP powerspectrum over the training set.[65] F. M. Paruzzo et al., Nature Comm. , 4501 (2018).[66] M. J. Willatt, F. Musil, and M. Ceriotti, PhysicalChemistry Chemical Physics , 29661 (2018).[67] N. Artrith, A. Urban, and G. Ceder, Phys. Rev. B , 014112 (2017).[68] M. Gastegger, L. Schwiedrzik, M. Bittermann,F. Berzsenyi, and P. Marquetand, J. Chem. Phys. , 241709 (2018).[69] H. Huo and M. Rupp, arxiv 1704.06439 (2017).[70] The sparsification could be also represented explicitlyby a ˆ U operator, that zeroes out all of the unnecessarycomponents.[71] M. W. Mahoney and P. Drineas, Proceedings of theNational Academy of Sciences , 697 (2009).[72] D. J. Rosenkrantz, R. E. Stearns, and P. M. Lewis,II, SIAM Journal on Computing , 563 (1977).[73] M. Ceriotti, G. A. Tribello, and M. Parrinello, J.Chem. Theory Comput.9