[PDF] A Bin and Hash Method for Analyzing Reference Data and Descriptors in Machine Learning Potentials

Abstract

In recent years the development of machine learning (ML) potentials (MLP) has become a very active field of research. Numerous approaches have been proposed, which allow to perform extended simulations of large systems at a small fraction of the computational costs of electronic structure calculations. The key to the success of modern ML potentials is the close-to first principles quality description of the atomic interactions. This accuracy is reached by using very flexible functional forms in combination with high-level reference data from electronic structure calculations. These data sets can include up to hundreds of thousands of structures covering millions of atomic environments to ensure that all relevant features of the potential energy surface are well represented. The handling of such large data sets is nowadays becoming one of the main challenges in the construction of ML potentials. In this paper we present a method, the bin-and-hash (BAH) algorithm, to overcome this problem by enabling the efficient identification and comparison of large numbers of multidimensional vectors. Such vectors emerge in multiple contexts in the construction of ML potentials. Examples are the comparison of local atomic environments to identify and avoid unnecessary redundant information in the reference data sets that is costly in terms of both the electronic structure calculations as well as the training process, the assessment of the quality of the descriptors used as structural fingerprints in many types of ML potentials, and the detection of possibly unreliable data points. The BAH algorithm is illustrated for the example of high-dimensional neural network potentials using atom-centered symmetry functions for the geometrical description of the atomic environments, but the method is general and can be combined with any current type of ML potential.

Full PDF

AA Bin and Hash Method for Analyzing Reference Data and Descriptors in MachineLearning Potentials

Martín Leandro Paleico a) and Jörg Behler b) Universität Göttingen, Institut für Physikalische Chemie, Theoretische Chemie,Tammannstraße 6, 37077 Göttingen, Germany (Dated: 26 August 2020)

In recent years the development of machine learning (ML) potentials (MLP) has becomea very active ﬁeld of research. Numerous approaches have been proposed, which allowto perform extended simulations of large systems at a small fraction of the computationalcosts of electronic structure calculations. The key to the success of modern ML potentialsis the close-to ﬁrst principles quality description of the atomic interactions. This accuracyis reached by using very ﬂexible functional forms in combination with high-level referencedata from electronic structure calculations. These data sets can include up to hundreds ofthousands of structures covering millions of atomic environments to ensure that all rele-vant features of the potential energy surface are well represented. The handling of suchlarge data sets is nowadays becoming one of the main challenges in the construction ofML potentials. In this paper we present a method, the bin-and-hash (BAH) algorithm,to overcome this problem by enabling the efﬁcient identiﬁcation and comparison of largenumbers of multidimensional vectors. Such vectors emerge in multiple contexts in theconstruction of ML potentials. Examples are the comparison of local atomic environmentsto identify and avoid unnecessary redundant information in the reference data sets that iscostly in terms of both the electronic structure calculations as well as the training process,the assessment of the quality of the descriptors used as structural ﬁngerprints in many typesof ML potentials, and the detection of possibly unreliable data points. The BAH algorithmis illustrated for the example of high-dimensional neural network potentials using atom-centered symmetry functions for the geometrical description of the atomic environments,but the method is general and can be combined with any current type of ML potential. a) Electronic mail: [email protected] b) Electronic mail: [email protected] a r X i v : . [ phy s i c s . c o m p - ph ] A ug . INTRODUCTION Machine-learning (ML) has become an important tool for the development of atomistic poten-tials, with a wide variety of applications in chemistry, physics and materials science. . Machinelearning potentials, like many other applications of machine learning algorithms, aim at approx-imating unknown functions, which in the present case is the multidimensional potential energysurface (PES) of the system of interest as a function of the atomic positions. The required infor-mation is obtained from sampling the PES at discrete points, i.e. particular atomic conﬁgurations,utilizing comparably demanding electronic structure methods such as density functional theory(DFT) . Once constructed, the ML potential can then be used to perform cheap simulations withﬁrst principles accuracy for systems of signiﬁcantly increased size and for extended time scales,to address problems which are inaccessible, e.g., to ab initio molecular dynamics simulations.Many types of ML potentials have been developed in recent years, including different ﬂavorsof artiﬁcial neural-network based potentials , Gaussian approximation potentials , momenttensor potentials , spectral neighbor analysis potentials , and many others . Apart from re-producing atomic interactions, machine learning methods have also seen increasing applicationsthat attempt to predict derived properties instead of those directly associated with the PES, such asdipole moments , charges , electronegativities , band gaps , spins , and atomiza-tion energies . All these applications of ML algorithms rely on the availability of large referencedata sets that are used to train the respective ML method to reliably reproduce the property ofinterest. Generating these data sets is computationally very demanding, and thus the amount ofdata should be kept as small as possible, which is a very challenging task. In the present work weaddress this by introducing the bin and hash (BAH) algorithm, enabling a computationally veryefﬁcient analysis of large data sets. This analysis is possible before training of the ML algorithmof choice has been performed, and even before the electronic structure calculations are carried out,which allows to guide the selection of the most important structures.Data set maintenance and analysis as well as atomic ﬁngerprint selection, i.e. ﬁnding suitablerepresentations of atomic geometric environments, have been active areas of research accompa-nying the rise in popularity of ML methods. The use of large and increasingly automaticallygenerated data sets and algorithms to programatically explore PESs has led to the need fortools that can deal with the amount and complexity of data. One such method is the dimension-ality reduction algorithm SketchMap , which can be utilized to group structures together into2imilarity clusters. More direct tools measuring distances in conﬁguration space and structuralsimilarities of solids are also useful for analyzing collections of structures. Previous attemptsbased on ML descriptors such as SOAPs have also been successful at establishing a similaritymeasurement algorithm, and recently a more generalized study has been published, looking at themost common ML descriptors and their relative behavior in describing atomic environments aswell as the relationships between property space (in this case energy) and distances in descriptorspace.As an inherent part of most MLP approaches, atomic ﬁngerprint selection, has also attracted alot of attention. In the wider ﬁeld of machine learning this is done with meta-analysis methods,such as hyperparameter optimization . Unfortunately these methods are usually rather com-plex and expensive, requiring multiple training and ﬁtting iterations, which precludes their use forlarge MLP data sets. Methods speciﬁcally designed for MLP also exist, that attempt to reﬁne thecontents of these atomic ﬁngerprints. Among them we ﬁnd attempts at utilizing genetic algorithmoptimization to select the best ﬁngerprint sets through evolution, or CUR decomposition toselect ﬁngerprints through dimensionality reduction.In this work we use high-dimensional neural network potentials (HDNNP) as proposed byBehler and Parrinello in 2007 to illustrate our algorithm, but the algorithm is very generaland can be used in combination with many other types of ML potentials and atomic environmentdescriptors. The main idea of the HDNNP approach, which is also used in most other classes ofhigh-dimensional ML potentials, is the construction of the total potential energy E of the systemas a sum of atomic energy contributions E i from all N atom atoms in the system as E = N atoms ∑ i = E i . (1)These atomic energies depend on the local chemical environments up to a cutoff radius R c , whichhas to be chosen large enough to capture all energetically relevant atomic interactions. Typicallycutoff values of 6-10 Å are used. The positions of all neighboring atoms in the resulting cutoffsphere must be provided to individual element-dependent atomic neural networks yielding theatomic energies. Many types of descriptors are available in the literature , and the mostfrequently used type in the context of HDNNPs are atom-centered symmetry functions (ACSFs) ,which form a vector G i of input coordinates for each atomic neural network that is invariant withrespect to rotation, translation and permutation, i.e. the order of the atoms in the system. Adetailed discussion of the functional forms of ACSFs and their properties can be found in Ref. 54,3nd here we just use them as placeholders for any ordered set of descriptor values that provides ameaningful structural ﬁngerprint of the local atomic environments.The atomic neural networks represent the analytic functional form of the HDNNP and contain alarge number of ﬁtting parameters, the neural network weights, which are optimized in an iterativetraining process to reproduce a given reference data set of energies and forces for representativesystems obtained from electronic structure calculations. Once the HDNNP has been trained us-ing this data, the energies and forces of a large number of conﬁgurations can be computed at asmall fraction of the computational costs of the underlying electronic structure method, which en-ables extended molecular dynamics and Monte Carlo simulations of large systems with close-toﬁrst-principles quality. For all details about the method, the training process and the validationstrategies for HDNNPs we refer the interested reader to a series of recent reviews .The construction of HDNNPs involves the use of large amounts of data, and the generation ofthe reference electronic structure data often represents the computationally most demanding step.It is therefore desirable to reduce the amount of data as much as possible by only including thosestructures – or more speciﬁcally atomic environments – which are different enough from the dataalready included in the reference set to justify the effort of an electronic structure calculation. Inaddition, also the training process of the HDNNP becomes more time consuming with increasingamount of data. In recent years, active learning has become a standard procedure to identify themost relevant structures . Still, the inclusion of a wide range of structurally different atomicenvironments in the training process is essential for the construction of a reliable HDNNP, asthe underlying functional form is non-physical, and the correct physical shape of the potenial-energy surface can only be learned if all of its relevant features are included in the training set.Consequently, for each system a compromise between the effort of constructing large data sets andthe accuracy and range of applicability of the HDNNP has to be found.The use of large amounts of data poses several challenges. First, a set of ACSF descriptors hasto be deﬁned for each element in the system to construct structural ﬁngerprints that can be used bythe atomic neural networks to construct the energy expression of the HDNNP. These ACSFs can beused for the quantiﬁcation of the similarity of different atomic environments. Typically, a set of 20-100 ACSFs is used for this purpose, which depend on parameters deﬁning their spatial shape .Second, to keep the data sets small, the inclusion of redundant information has to be avoided,which requires an efﬁcient analysis and comparison of the local chemical environments of theatoms given by the ACSF vectors. As we will see below, naive pairwise comparisons are not a4iable option for the typical data sets consisting of tens of thousands of structures, each containingup to a few hundred atoms. Third, the costs of the reference electronic structure calculationsshould be kept as low as possible, but numerical noise that can arise, e.g., from loose but time-saving settings of the electronic structure codes must be avoided. Substantial noise in the datarepresents contradictory information, which prevents a smooth convergence of the ﬁtting processto low root-mean squared errors for the energies and forces.In this paper, we propose a simple, fast and efﬁcient algorithm based on the well known hashtable data structure. The algorithm is described in Sec. II. We use the vector of ACSF valuesbelonging to an atomic environment, the same vector that an atomic neural network in HDNNPswould receive as an input, but we ﬁrst pre-process it by a bin and hash approach. Binning isdescribed in Sec. II B 3, and the procedure of hashing and the workings of hash tables in Sec. II B 4.This creates a numerically unique representation of each environment, where searches for repeatedrepresentations are fast and scale well with the number of environments under consideration. Inaddition, this procedure does not depend on the availability of a trained HDNNP, which is anadvantage compared to active learning strategies. The procedure is very fast, and we benchmarkit in relation to a naive direct comparison approach in Sec. II B 2, with big O notation scalingdiscussed in Sec. II C.In Sec. III, we show results from the application of the algorithm. Concrete timings are pre-sented in Sec. III A, conﬁrming the scaling expected based on theoretical considerations. Sec-tion III B demonstrates how the BAH algorithm reproduces distances in ACSF vector space, whileSec. III C shows the behavior of the algorithm when changing the number of binning subdivisionsand the ACSF set description of the data set, and how this can be utilized to qualitatively evaluatethe suitability of a given ACSF set, without requiring the lengthy process of previously ﬁtting apotential. Finally, Sec. III D shows how the method can be easily utilized to ﬁnd similar atomicenvironments and contradicting information in a data set.Overall these applications are examples for the well known and complex problem of efﬁcientlyﬁnding distances and nearest neighbors in points belonging to multi-dimensional data. Previousapproaches include making use of complex binary tree data structures such as kDtrees , thatcan efﬁciently store data points according to their mutual distance in multi-dimensional space andrapidly reduce a search space due to their binary structure; and dimensionality reduction algo-rithms such as principal component analysis (PCA) and SketchMap that instead reduce thesize of the space under consideration. All of these algorithms are very powerful and suited for5 IG. 1. Stacked histogram plot of the values of the ﬁrst 10 radial ACSFs in the ZnO data set describing theatomic environments of the oxygen atoms. their particular applications, but are often too complex and slow for the current goal. Our BAHapproach is fast and simple, and works in principle for any dimensionality. It simpliﬁes the processof dimensionality reduction by performing a reduction evenly across the coordinate space insteadof centering on the most important directions like PCA and SketchMap.

II. THE BIN AND HASH ALGORITHMA. Description of the Algorithm

Here, we will ﬁrst give a general overview about the bin and hash algorithm summarized aspseudocode in code block 1. The details of each of its components will be discussed in the follow-ing sections.As example system we choose zinc oxide. A typical distribution of ACSF values is presented in6 d i v s = number o f s u b d i v i s i o n s i n

ACSF s p a c e2 f o r a t o m _ e n v _ i i n d a t a s e t3 f o r a c s f _ j i n a c s f _ s e t4 c a l c u l a t e symmetry f u n c t i o n v e c t o r Gi ={ Gj }5 f i n d Gjmax and Gjmin a c r o s s each a c s f component Gj6 i n i t i a l i z e empty h a s h t a b l e Ht7 f o r each

Gi v e c t o r8 b i n Gi v e c t o r Bi ={ Bj } ,9 Bj = d i v s ∗ ( Gjmax − Gj ) / ( Gjmax − Gjmin )10 c a l c u l a t e h a s h Hi= h a s h ( Bi )11 i f Hi n o t i n Ht12 s t o r e i t Ht [ Hi ] = j i n d e x13 e l s e

14 c o u n t a s c o l l i s i o n n c o l l s +=115 add t o e x i s t i n g r e c o r d i n h a s h t a b l e16 Ht [ Hi ] a p p e n d ( j i n d e x )

Code Block 1. Pseudocode for the bin and hash algorithm

Fig. 1 in the form of a stacked histogram plot, for the ﬁrst 10 ACSFs of a small data set containing1192 conﬁgurations of a ZnO(10 ¯10) surface slab with in total 75360 atomic environments. Thestructures included in the data set consist of bulk cut slabs, relaxed slabs, and conﬁgurations ex-tracted from MDs, with different number of layers. Overall, 58 distinct atom-centered symmetryfunctions are used per element to describe the atomic environments, and the parameters deﬁningthe ACSFs are given in the supporting information. We can see that even for such a relativelysmall data set the distribution of data already has a rather complex form.The individual steps forming the BAH algorithm are illustrated in Fig. 2. Starting from thehistogram of ACSF values shown schematically, in a ﬁrst step the range of each ACSF is split intoa predeﬁned number of subdivisions, typically between 10 and 10 bins, taking into account themaximum and minimum values present in the data set. This transforms the ACSF vector G i fora given atomic environment i from a ﬂoat-based continuous representation to an integer-valuedbinned vector B i of the same dimensionality (step 2). This binned vector is then hashed generating7 IG. 2. Illustration of the BAH approach: Each atomic environment in this example is characterized bya two-dimensional ACSF vector G = ( G , G ) . In step 1, the histograms corresponding to the ACSFs aregenerated as a visualization aid. The values of G and G are highlighted by the crosses for one particularexample environment. This ACSF vector is then binned to a pair of integer values, forming the binnedvector B = ( B , B ) in step 2. In step 3 the hash H ( B ) of this binned vector is calculated. Finally, in step4 this hash is used (directly or indirectly) to index into the hash table, and add the atomic environment to acounter for similar environments. the one-dimensional hash key H i (step 3), which is then used for constructing a hash table ( H t )(step 4). The binning achieves two goals at once: getting rid of the ﬂoating point representation,which does not allow for an accurate transformation to a hash, since the hash would be numericallyvery sensitive to the round-off errors of the ﬂoating point values, and binning similar ACSF vectorsto the same B i vector, ﬁnally yielding the same hash key. The step of hashing the integer vectorsinto hash buckets enables a fast and efﬁcient storage and lookup for large data sets. Both parts of8he algorithm – binning and hashing – are thus vital for its performance.Any G i vectors that result in a hash collision, i.e. they end up in the same hash table bucket,are deemed to be similar, and – depending on the number of subdivisions – usually exactly thesame apart from ﬂoating point round-off errors (see Sec. III B). The algorithm keeps track ofthe total number of collisions recorded for a data set and the maximum number of collisions forall the buckets. Additionally, every time a collision is detected the ID of the colliding atomicenvironment is stored in the hash table in the corresponding bucket, which enables to retrieve thecolliding environments afterwards for analysis.An obvious problem of this algorithm is that environments might be very close to the borderbetween two bins. Given two very similar environments, both could be assigned to different binsresulting in completely different hash values, although the atomic conﬁgurations are essentiallyidentical. In this case, two environments that should lead to a collision, do not. A straightforwardsolution to this problem is to use the algorithm with multiple different divisions of the ACSFdomain, and to compare the obtained binning. In this way it can be excluded that very similarenvironments are converted to different hash keys. Still, even when using multiple binnings, thealgorithm remains computationally very efﬁcient. B. Analysis of the Algorithm

Next, we analyze the scaling of the algorithm. This scaling is of particular relevance giventhe sheer size of the typical data sets used in the construction of ML potentials. Many othermore sophisticated algorithms work perfectly well when tested on small example cases, but scalevery inefﬁciently for realistic data sets containing tens or even hundreds of thousands of structures,each consisting of many atomic environments. Initially, we comment on the possibility of utilizingneighbor lists. Then, we describe the naive approach of a brute force comparison as a reference,before discussing the behavior of the binning and hashing operations. Finally, we derive the scalingin big O notation .

1. Cell-Based Neighbor Lists

Efﬁcient distance calculation is a common problem in molecular dynamics simulations, sincemost force ﬁelds depend on interatomic distances in one way or another. A simple and common9pproach is to utilize cell lists , where the system is divided into smaller cubic cells, and atomsare assigned to these cells according to their coordinates. If the size of the cells is chosen properlywith respect to the cutoff radius of the potential, checking for neighbors becomes simple: for eachatom only atoms within the same cell and the directly neighboring cells need to be considered.It is possible to envision taking this approach to further dimensions, where we would now createcells not in coordinate space but in the higher-dimensional ACSF space. Unfortunately, this simpleapproach in unfeasible as the computational costs increase rapidly with dimensionality: in a one-dimensional system we need to check the central bin plus two neighbor cells, in two dimensionsit is the central cell plus eight cells organized in a square, and so on with the total number of cellsto be checked scaling as 3 D with D the dimensionality of the space. This is clearly unfeasible foran ACSF set whose dimensionality starts at 20 but can contain as many as 100 ACSFs per atomicenvironment, and even cases with many hundred functions have been reported .In conclusion, cell-based neighbor lists efﬁciently reduce the degrees of freedom of the prob-lem by creating cells, which we essentially also use for the binning step in the BAH algorithm.However, it rapidly fails when used in higher dimensions, which we avoid in our BAH algorithmby only ﬁnding points in ACSF space that are in the same bin/cell, and by utilizing hash tables toperform this check very efﬁciently using only a one-dimensional property for the comparison.

2. The Naive Approach

The naive approach to comparing atomic environments is to compare ACSF vectors for eachpair of atoms directly. The only obvious simpliﬁcation is that only atoms of the same elementneed to be compared. The performance of this procedure is very poor, since it scales linearly withthe number of ACSFs, and quadratically with the number of environments in the data set, as forenvironment number N , we need to compare it with all the previous N − that does not depend on the amount of data already storedin the table. Binning is needed before reaching this point, since similar ﬂoating point numberswould have very different hash values without a preparatory discretization step.10 . Binning Consequently, binning is the ﬁrst step in the algorithm. The maximum and minimum valuesof each ACSF depend on the available data set and are known beforehand. For each ACSF, theresulting range is divided into an arbitrary number of intervals and the binning is done accordingto B j = nint (cid:18) divs ∗ ACSF max − ACSF val

ACSF max − ACSF min (cid:19) (2)where B j is the bin value for the j -th ACSF, nint is the nearest integer function, i.e., a round-offto the closest integer; and ACSF max , ACSF min , and ACSF val are the maximum, minimum, andcurrent value of the ACSF under consideration, respectively. The number of intervals is kept thesame for all the ACSF types, although some of them might have larger or smaller ranges (see forexample Fig. 1). A possible improvement to the binning procedure would thus be to aim for acertain density of ACSF values in each division, by tailoring the length and number of divisions toeach ACSF.This binning achieves multiple goals. In the ﬁrst place, it transforms ﬂoating point numbers,which are imprecise and hard to hash, into integers. Floats should not be hashed directly becausesmall changes in the accuracy of the ﬂoating point number representation, such as the limitedprecision when reading it from a ﬁle or small deviations resulting from rounding errors, give riseto very different hash values. Integers, on the other hand are easy to convert to a hash.Additionally, binning provides a sense of “distance” in the data set. Calculating distancesdirectly from the difference between ACSF vectors suffers from the same scaling problems asthe naive approach, and the usefulness of an Euclidean distance decreases with the size of thevector, as it becomes less unique and loses meaning as dimensionality increases. As the binsget smaller, fewer ACSF vectors will coincide, making the algorithm more sensitive only leavingthose environments that are more and more similar in the same bucket.Binning on its own does not solve the problem of the naive approach, since we would still needto do an all-against-all comparison of the individual bin vectors, with integers instead of ﬂoats. Tosolve this, a hash table is required, as described in the following section.11 . Hashing and Hash Tables Hash functions are a family of functions that can map data of arbitrary size to data of ﬁxedsize. In effect, a hash is a one-way function, that can assign an integer to any data type. Thisassignment is not unique as two objects that are different can result in the same hash value, i.e. ahash collision. This conversion is usually non-reversible such that if the hash is known, it is notpossible to reconstruct the original object unless by brute force trial and error and comparing theresulting hashes. If two objects share the same hash (a “hash collision”), they will usually be eitherexactly equal or very different, which is a desired property in some applications. Small changes tothe input object will result in very different hash values, so the hash value in principle cannot beused directly as a measure of distance in input space. Hash functions are used in a variety of ﬁelds,such as in cryptography, where passwords as usually stored pre-hashed instead of in plaintext; or inthe realm of data-validation and prooﬁng such as in checksums, credit card numbers, bank routingnumbers, ISBN book numbers, or blockchains. Hash functions make heavy use of the modulofunction and byte-shifting operations.The properties of a hash function allow us to create a hash table. A hash table resembles anarray, but instead of assigning positions sequentially as in a normal array, positions to the hashtable’s “buckets” are assigned using the hash function. In effect, the hash function is used to indexthe hash table array using index = hash%array_size , (3)where “index” is the index to be used when accessing the hash table array, “hash” is the hashfunction value of the object of interest, “array_size” is the size of the array holding the hash table,and % is the modulo operator. The hash will always index an array position, no matter the size ofthe array.One apparent problem arises here: The number of bins can reach up to 10 subdivisions perACSF. For the usual dozens to hundreds of symmetry functions required for a HDNNP data set,this amounts to a large amount of possible bin vectors that grows in a combinatorial fashion. Howthen is it possible to map all the possible bin vectors into a hash table of restricted size? As men-tioned above, hash functions map larger spaces into smaller ones, so collisions are unavoidable.Various solutions exist for solving this problem , which are implementation dependent. One pos-sibility, known as separate chaining, is to store all the collided keys in the same bucket as a list.Assignment to the hash table then consists of rapidly ﬁnding the correct bucket as in Eq. 3, fol-12owed by a slower (but short) search through the list of key in this bucket. Another possibility,known as open addressing, is to assign keys to the ﬁrst open bucket address if the current one isalready occupied. Assignment of a new key then consists of using Eq. 3 to ﬁnd an initial bucket (afast operation), and then continuing through the bucket addresses until an unoccupied address isfound (slower but a short process). Whatever the implementation utilized for collision resolution,it inﬂicts a computation overhead to all hash table operations, but if the number of collisions iskept low, this is not a problem. In normal operation every possible single bin vector will not beencountered since the data utilized to construct a HDNNP is not completely random, so this is notexpected to involve much overhead.An interesting feature of hashes is that this ansatz results in a constant (when the number ofhash collisions is not too high) search, assignment and insertion time of data into the table. In anormal array, if we want to check whether a new object is already present in the array, we need totraverse the array and compare element by element until it is either found and we stop the searchearly, or we reach the end of the array. In a hash table, we instead calculate the hash of the objectand immediately check the corresponding position in the table.This efﬁciency comes at the cost of some overhead: requiring more memory for storing thehash table since many buckets might be empty if the hash table is constructed with sequentialmemory positions, the need to precompute the hash for objects going into the table although hashcalculations are usually fast, and dealing with hash collisions when they happen if we want tomaintain unique buckets. Due to their properties, hash tables are a basic data structure in computerscience , often utilized for efﬁcient storage and retrieval of data.A ﬁnal advantage of hash tables for the use in this work is that they can be easily stored intoa text ﬁle for future use. This way, a data set can be preprocessed into a hash table, and futurestructures can easily be compared against this record to detect repeated conﬁgurations. To storethe hash table all that is needed is to write the unique binned integer vectors to the ﬁle (in anarbitrary order), optionally with a numeric ID associated to the structures in the data set that fallinto that bucket of the table for an easy identiﬁcation. To reconstruct the table, these binned arraysare read and used as members of a new table. 13 lgorithm ScalingNaive Comparison O ( M ∗ N ) Binning O ( M ∗ N ) Hashing O ( M ∗ N ) Hash Table Lookup O ( N ) TABLE I. Big O notation scaling of the different algorithms under consideration. N is the number of atomscorresponding to the number of atomic environments in the data set. M is the number of functions in theatom-centered symmetry function vector. C. Scaling

Next, we look at the scaling of the different parts of the algorithm in the big O notation .This is important to realize why the naive approach soon becomes unfeasible and how the BAHalgorithm improves on it. The results are summarized in Table I. We will consider the case ofsearching once through a complete data set, and attempting to ﬁnd repeated atomic environments.In the following discussion, N is the number of environments in the data set, i.e., the totalnumber of atoms in all structures. M is the number of functions in each ACSF vector correspondingto the dimensionality of our problem. We note that atoms of the same element always have thesame ACSF sets, but this is not necessarily true for different elements. The scaling with respectto N can be more important than regarding M , since the number of ACSF in a HDNNP is usuallyless than 100 per element for most systems, while the number of atomic environments can reachmillions and has no upper bound.The following scaling is observed: • Naive comparison and lookup: Comparison scales at worst as O ( M ) , since we need to com-pare each element in one ACSF vector to the corresponding element in another ACSF vector,but we might end early if a mismatch is detected. We then need to compare environment1 with the next N − N − N − N . This isa mathematical series that in the end scales as O ( N ) . Both parts of the algorithm togetherscale as O ( M ∗ N ) . • Binning: Binning scales with both the number of elements in each ACSF vector – since we14eed to bin each element individually – as O ( M ) . Additionally, it has to be done for eachof the N atomic environments ( O ( N ) ). Combined it scales as O ( N ∗ M ) . This operation isusually very fast. • Hashing: Hashing scales weakly with the size of the object being hashed ( O ( M ) ). There issome dependence on the speciﬁc implementation of the hash function (see Sec. III A) andthe hashing needs to be repeated for each ACSF to be compared ( O ( N ) ). It is a comparablyslow operation compared with a straight division in binning. • Hash tables: Addition of data to a hash table and lookup are constant with respect to the sizeof the stored data set (which would be proportional to N), except for hash collisions O ( ) .This is where the main time saving comes from. We have to repeat this N times, once perhashed array, resulting in a scaling of O ( N ) .Now we can estimate the total processing times. The naive case is simple, we need to perform M ∗ N operations to process the whole data set. For the BAH algorithm, we need to ﬁrst binthe whole data set, then hash the resulting binned arrays, and ﬁnally store the result in the hashtable detecting a collision if present. All of these times are additive since they are independentsequential operations. Putting this all together, we obtain t naive = k comp and lookup ∗ O ( M ∗ N ) t bah = k binning ∗ O ( M ∗ N )++ k hashing ∗ O ( M ∗ N ) + k hash lookup ∗ O ( N ) , (4)where each k is the timing constant to perform that operation once, which depends on the actualimplementation of each algorithm, the programming language of choice, and the CPU architecture.Notice that the naive approach shows the worst scaling, since it scales as N , with typical valuesof N in the order of 10 − . The BAH algorithm, on the other hand, consists of three linearlyscaling additive components. This is tested in Section III A for an illustrative example, and thedifferent timing constants estimated, for a Python implementation. D. Implementation

The algorithm has been implemented in Python 3.5, using the dict data structure, whichis a hash table with the possibility to associate arbitrary data to each hash bucket. The set aiveConstant Value (s/op ) op /s k comp and lookup k binning k hashing k hash lookup k BAH global M =

10 (scaling is assumed linear for other M values, in the cases where relevant). Units are in secondsrequired per operation (s/op). The inverse constant is also given providing the number of operations persecond (op/s). Note that the naive algorithm only seems “faster” because it is expressed in terms of op . data structure is similar and can also be used, but can only store the hashed object and no otherassociated data. It can also be implemented easily in many other languages, since hash tables area widely used data structure, and only pointers or allocatable arrays are needed to implement themfrom scratch. The dict object in Python already incorporates the step of hashing the data, so noexplicit hash function is required in this case, and the actual implementation of the hash functionis not relevant to the result as long as it avoids as many spurious collisions as possible.The algorithms is straightforward to parallelize if this is required for larger data sets, or fornon-synchronous processing, e.g. using a compute cluster associated with a database. This is dueto the fact that hash tables can be easily combined. A central master process can hold the copy ofthe hash table, and dispatch binning and hashing operations to the slave processes; or each slaveprocess can hold its own hash table and report back to a central process, which combines the slavesub-tables into a master hash table. 16 IG. 3. Plots of the timing of the different algorithms with increasing system size. a) Naive lookup vs.squared size of data set. b) Different parts of the BAH algorithm vs. size of data set. c) All algorithmstogether in log scale for comparison. d) Relative speedup or time gain of the different parts of the BAHalgorithms compared to the naive approach, calculated as t algo / t naive , with t algo the timings of the differentparts of the algorithm from b). e) Scaling of the hash calculation with ACSF vector size, per 100000operations. f) Behavior of hash table operations with data set size, per 100000 operations. III. RESULTSA. Performance and Timings

For illustrative purposes, we present the timings and scalings of the naive and BAH algo-rithms on randomly generated values, as obtained from Python3.5 on a Intel Core i5-5300U CPU2.30GHz. Fig. 3 plots the behavior of the different algorithms for increasing data sets.As can be seen in Fig. 3a, the naive algorithm for the comparison of the atomic environmentsscales with the square of the data set size, while the BAH algorithm in Fig. 3b scales linearly. Inthe logarithmic scale of Fig. 3c combining the data of panels a) and b), it can be clearly seen thatthe costs of the naive algorithm increase much faster than those of the BAH algorithm. Fig. 3d17hows the speedup (the relative time gain, t algo / t naive for any of the sub-algorithms involved inBAH) between the BAH and the naive algorithms. Notice that this speedup increases as thedata set size increases, since the naive approach scales as the square of the data set size but theBAH scales linearly. Consequently, the larger the data set becomes, the faster the BAH approachbecomes with respect to the naive approach. Fig. 3e shows that the hashing algorithms scaleslinearly with the size of the ACSF vector under consideration, but is extremely fast for typicalvector dimensionalities. Finally, Fig. 3f conﬁrms that, as expected, operations regarding the hashtable object – assignment to the hash table, and looking up if an object belongs to the hash table –remain constant in time with data set size.From these analyses and data we can estimate the different proportionality constants of Eq. 4,they are compiled in Table II. Notice that the Naive and BAH halves of the table have differentunits. The fastest part of the BAH algorithm is the hash calculation ( k hashing ), while the bottleneckin the current implementation seems to be the binning ( k binning ). This is probably due to thedivision and rounding nearest integer operations involved in binning, and it could probably beimproved with some vectorization or better numerical libraries. Not considered here is the requiredI/O to read ACSF data from a ﬁle, which might become a more serious bottleneck for larger datasets, but is however common to both algorithms. The values obtained here represent only anapproximate order of magnitude since this will change signiﬁcantly for different implementationsand computing architectures. B. Analysis of the Distance in Symmetry Function Space

An interesting question is how the algorithm reﬂects distances in ACSF space, since some in-formation is lost in the process of binning and hashing of the atomic environment vectors. Hashesthemselves are not a useful measure of distance since the resulting hash is not smoothly continu-ous with respect its inputs, but we would expect similar ACSF vectors to end in the same bucket.A reliable binning of only similar structures is an important condition for the BAH method to beuseful. For this purpose, we now investigate all the ACSF vector distances obtained for atomicenvironments that fall in the same bucket using different subdivisions of the ACSF space. Wedeﬁne a relative distance in ACSF space, δ i j between atoms i and j of the same element, as δ i j = | G i − G j | . ( | G i | + | G j | ) (5)18 IG. 4. a)-d) Histograms for the typical intra-bucket ACSF relative distance ( δ ) values for different subdi-visions (10 , , , ) in the ZnO slab data set. Other intermediate subdivisions (10 , , ) exhibitsimilar behaviors. The counts axis is logarithmic for better visualization. where G i and G j are a pair of symmetry function vectors corresponding to atomic environmentsthat ended up in the same bucket, and which are thus similar for the BAH algorithm. We plota histogram of the calculated distances in Fig. 4 for different subdivision numbers. Most of thedistances in the histogram are close to zero as expected. Notice that as we increase the numberof subdivisions, the maximum intra-bucket distance drops quickly due to the more stringent cri-terion for structural similarity in the binning process, becoming close to the ﬂoating point noise(either due to the limited precision of ﬂoating point numbers in a computer representation a.k.a.the “machine epsilon”, or the limited precision of data such as coordinates and ACSF values heldin text format) for the maximum number of subdivisions such that the differences for many subdi-visions are probably due to round-off errors and ﬂoat-to-string conversions rather than signiﬁcantdistances in ACSF space. Consequently, the histograms show that the BAH algorithm is indeedclosely correlated to distances in ACSF space, up to a given maximum distance depending on how19 IG. 5. Maximum and average intra-bucket relative distances for the histograms in Fig. 4 versus number ofsubdivisions, in log scales. Notice that they follow approximately linear relationships, and trendlines withcorresponding ﬁtting equations are included. the multi-dimensional space is subdivided for the binning step.Interestingly, as shown in Fig. 5, the maximum and average δ obtained from these histogramsfollow a linear relationship with the number of subdivisions, on a double logarithmic scale. There-fore, changing the subdivisions parameter allows us to ﬁne-tune the maximum detected atomicenvironment distance in a predictable way.Given this behavior of the distances in ACSF space, it is also of interest to study the corre-sponding behavior of the properties associated to each atomic environment such as the atomic20 IG. 6. Difference in force magnitude vs. the ACSF relative distance, δ , for different subdivisions of theBAH algorithm applied to the ZnO slab data set. The points present in each subplot are not always the same,since the plots are generated from environments that collided for a given number of subdivisions. Notice thedifference in the scale of the X and particularly the Y-axis for a) when compared to b)-d); the force spreadfor structures with δ i j ≈ forces. In Fig. 6 we plot the difference in force magnitude vs. the ACSF relative distance, δ ,for different subdivisions. As shown in a), there is a relationship between the two quantities, sinceone would expect that atoms whose environments/ACSF vectors are similar should also presentsimilar forces. Despite this, the relationship is not strong, since distances in “force space” do notnecessarily transfer linearly into ACSF space . As the number of divisions increases and the forcevectors considered correspond to closer environments, the force distance quickly falls. In the end(d), this force distance corresponds to the numerical noise present in the reference DFT data, sincehe environments detected are actually identical (up to numerical noise).21 IG. 7. Panels a) and b) show the total and maximum number of hash table collisions, i.e., conﬁgurationsthat hash into the same bucket due to similarity of their ACSF vectors, vs. number of ACSFs, for differ-ent binning divisions. Panels c) and d) show the same properties as a function of the number of binningdivisions, for different numbers of ACSFs.

C. Results for Different Divisions and Symmetry Functions

An interesting question is how the resolution power of the algorithm, i.e., the ability to dif-ferentiate ACSF vectors, changes as we increase the number of binning subdivisions, and as wechange the ACSF descriptor set itself. For this purpose, we have analyzed the ZnO (10¯10) slabdata set.A count of collisions was performed on this data set, which as described before occur when twoenvironments end up in the same hash table bucket, due to their binned vectors being the same,which implies their original ACSF vectors were at least similar. We keep track of the total numberof collisions, and the maximum number of collisions in a single bin, for different divisions and an22ncreasing ACSF set.We would expect both total and maximum number of collisions to go down as both divisionsand numbers of ACSFs increase, since more divisions means that environments need to be moresimilar in ACSF space to collide (see Sec. III B) and more ACSFs lead to a more granular de-scription of each environment. Eventually, this count converges as we are left with only the en-vironments that are exactly the same, which can happen in a data set due to repeated parts of aconﬁguration for example, if parts of a slab far away from a chemically modiﬁed region remainessentially constant. This is in fact found in Fig. 7. Here we have performed the BAH analysis onan increasing number of ACSFs, in the order presented in the supporting information.In this ﬁgure we note that in a), collisions go down extremely quickly as we increase the ACSFdescriptor set, and then plateau with a slight downward trend that is hard to observe due to the scaleof the plot. The line with 10 divisions seems to offer the most granularity, showing changes acrossthe whole ACSF set under consideration. Being able to differentiate chemical environments is anecessary (but not sufﬁcient) condition for a good HDNNP ﬁt, in which case the BAH algorithmcould be utilized to identify a minimum ﬂoor to the size of the ACSF set.At this point, the question arises of which subdivision range is “best” to describe a given dataset, and whether this is actually dependent on the speciﬁc data set. As can be seen from Fig. 5, thenumber of subdivisions roughly corresponds to the symmetry function space distance between thecollided atomic environments. As such the “right” subdivision range depends on whether we wantto detect environments that are only roughly similar or exactly the same, and there is not a singleideal value. For the type of analysis presented in Fig. 7, a lower number of subdivisions (in therange of 10 to 10 ) provides a more granular behavior in the number of collisions vs. symmetryfunctions utilized, which results in an easier to analyze trend. For detecting contradictions (seeSec. III D we require environments that are either extremely similar or exactly the same, in whichcase the upper range of subdivisions (10 to 10 ) is better suited.Whether the number of subdivisions required depends on the speciﬁc data set is harder to eval-uate. Since our data sets are derived from physically “reasonable” conﬁgurations correspondingto chemical systems, they share roughly the same properties, with some differences depending onthe involved elements, states of matter present, energy ranges covered, etc. The parameters of thetrendlines in Fig. 5 might depend on the speciﬁc composition of the data in the data set, but as longas the relationship with ACSF space distance remains, the speciﬁc parameters are not crucial.In the end no speciﬁc number of subdivisions is ideal for every situation, and this has to be23ested with each data set and adapted to each desired analysis, but the BAH process is so fastthat binning a data set multiple times is not a problem. Our recommendation is to test threewidely separated orders of magnitude of subdivisions (10 − − ), and reﬁne according tothe results. D. Comparison of Atomic Environments and Conﬂicting Information

FIG. 8. Force components and force vector magnitude for 22 environments found in a collision bucket.Note that although the ACSF vector for all environments is identical, there are slight differences in forcevalues arising from numerical noise in the DFT calculations.

The result of running the BAH algorithm is a list of environments that fall into the same bucket.That is, we obtain a list of collisions representing structurally similar atomic environments as de-ﬁned above. This is valuable information and can be used to predict if a new conﬁguration obtainedfrom a simulation employing the HDNNP is sufﬁciently different from the available data to justifyan inclusion in the reference data set to reﬁne the potential. All the atomic environments in a largenumber of structures structure obtained in long validation simulations can be screened in this way,and for a most efﬁcient use of subsequent electronic structure calculations it is possible to iden-24ify those structures from this pool, in which the highest fraction of environments is sufﬁcientlydifferent for the existing reference data.Another possibility is the search for contradictions in the data set. Contradictions in this casemeans atoms whose ACSF sets are similar, but their derived properties (any per atom predictedproperty, such as force, spin, charge, etc.) differ by more than an acceptable threshold. This couldbe due to a too small ACSF set or cutoff radius of the ACSFs, which does not allow to correctlydistinguish chemically different atomic environments, due to the neglect of long-range interactionsbeyond the cutoff radius, or due to incorrect electronic structure data resulting, e.g., from a poorconvergence level. Contradictions are detrimental to the ﬁtting process, since in case of conﬂictingdata the HDNNP cannot reach a high ﬁtting accuracy .If we apply this analysis to our data set, with 10 binning divisions we ﬁnd that the bucket withmost collisions contains 22 environments. The ACSF vector of these conﬁgurations is identical,but plotting their DFT force components and magnitude results in Fig. 8. We can see that theforces are not exactly identical, but they are within the expected error margin for the HDNNP ,i.e. below about 100 meV/Bohr. In this case, no contradiction is detected, but in other situationswe found structures that have not properly been converged for various reasons. Identifying andeliminating these data substantially improved the HDNNPs in this case. For larger data sets, thepoints within buckets could be automatically analyzed, and a contradiction warning raised if theforce difference is above a given threshold. IV. SUPPORTING INFORMATION

In the supporting information we present: • A list of ACSF parameters for the studied ZnO slab data set. • The code utilized to perform the scaling tests in Sec. III A.

V. CONCLUSIONS

In this work we have presented a bin and hash method, which allows a computationally veryefﬁcient comparison of a large number of geometric atomic environments, which are used in theconstruction of modern machine learning potentials. In case of high-dimensional neural networkpotentials, which we use as a typical example here, these environments are usually described by25ectors of atom-centered symmetry functions. We show that the ability of the method to identifysimilar atomic environments can be systematically controlled by the number of subdivisions usedin the binning process of the ACSF vectors, but also a large number of alternative descriptorsproposed in the literature is equally applicable.The method is fast, simple and robust with many applications in the construction of machinelearning potentials. One example is the identiﬁcation of redundant atomic environments in thereference data sets used for the construction of the potential as a basis for the decision whichstructures should be included in the training set. This is an essential step, as a systematic coverageof the conﬁguration space is very important for obtaining reliable potentials, which an excessiveamount of data would turn the construction and use of the potentials unfeasible. Due to the use ofhash functions and tables, the method can process millions of candidate atomic environments ina number of minutes, being much faster than a naive direct comparison approach. The obtainedinformation can be stored in data libraries that can be efﬁciently searched at a later stage if needed.We note that in this context the BAH algorithm is complementary to the use of active learning, asthe BAH algorithm is based on the geometric structure and its description, while it does not requirethe availability of trained ML potentials as no property evaluations are needed. Active learning onthe other hand is based on the comparison of predicted properties, which allows to focus on thereliability of the target property, while it depends on the availability of preliminary models andtheir evaluation.Another application is the validation of the structural resolution capabilities of the descriptorsused for the discrimination of different atomic environments. Poor descriptor sets result in a largenumber of environments appearing erroneously to be structurally similar although local physicalproperties like forces substantially differ. Finally, the method can be used to identify conﬂictingdata in the training set, which might result from an insufﬁcient convergence level of the referenceelectronic structure calculations and other types of errors resulting in inconsistent information.Consequently, the bin and hash method has been found to be a useful tool for solving a varietyof challenges emerging in the construction of machine learning potentials, with many additionalpotential applications in other ﬁelds requiring the efﬁcient comparison of structural features, suchas genetic algorithms , minima hopping , and kinetic Monte Carlo simulations.26 CKNOWLEDGMENTS

We thank the Deutsche Forschungsgemeinschaft (DFG) for ﬁnancial support (Be3264/10-1,project number 289217282 and INST186/1294-1 FUGG, project number 405832858). JB grate-fully acknowledges a DFG Heisenberg professorship (Be3264/11-2, project number 329898176).We would also like to thank the North-German Supercomputing Alliance (HLRN) under projectnumber NIC00046 for computing time. 27

EFERENCES J. Behler, “Perspective: Machine learning potentials for atomistic simulations,” J. Chem. Phys. , 170901 (2016). V. Botu, R. Batra, J. Chapman, and R. Ramprasad, “Machine learning force ﬁelds: Construction,validation, and outlook,” J. Phys. Chem. C , 511–522 (2017). V. L. Deringer, M. A. Caro, and G. Csányi, “Machine learning interatomic potentials as emerg-ing tools for materials science,” Adv. Mater. , 1902765 (2019). P. Hohenberg and W. Kohn, “Inhomogeneous Electron Gas,” Phys. Rev. , B864–B871(1964). W. Kohn and L. J. Sham, “Self-Consistent Equations Including Exchange and Correlation Ef-fects,” Phys. Rev. , A1133–A1138 (1965). T. B. Blank, S. D. Brown, A. W. Calhoun, and D. J. Doren, “Neural network models of potentialenergy surfaces,” J. Chem. Phys. , 4129–4137 (1995). J. Behler and M. Parrinello, “Generalized Neural-Network Representation of High-DimensionalPotential-Energy Surfaces,” Phys. Rev. Lett. , 146401 (2007). B. Jiang and H. Guo, “Permutation invariant polynomial neural network approach to ﬁtting po-tential energy surfaces,” J. Chem. Phys. , 054112 (2013). S. Lorenz, A. Groß, and M. Schefﬂer, “Representing high-dimensional potential-energy surfacesfor reactions at surfaces by neural networks,” Chem. Phys. Lett. , 210–215 (2004). S. Manzhos and T. Carrington, Jr, “Using neural networks, optimized coordinates, and high-dimensional model representations to obtain a vinyl bromide potential surface,” J. Chem. Phys. , 224104 (2008). O. T. Unke and M. Meuwly, “Physnet: A neural network for predicting energies, forces, dipolemoments, and partial charges,” J. Chem. Theory Comput. , 3678–3693 (2019). K. T. Schütt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. Müller, “Schnet - adeep learning architecture for molecules and materials,” J. Chem. Phys. , 241722 (2018). L. Zhang, J. Han, H. Wang, R. Car, and W. E, “Deep potential molecular dynamics: A scalablemodel with the accuracy of quantum mechanics,” Phys. Rev. Lett. , 143001 (2018). J. S. Smith, O. Isayev, and A. E. Roitberg, “ANI-1: An extensible neural network potential withdft accuracy at force ﬁeld computational cost,” Chem. Sci. , 3192–3203 (2017). A. P. Bartók, M. C. Payne, R. Kondor, and G. Csányi, “Gaussian Approximation Potentials: The28ccuracy of Quantum Mechanics, without the Electrons,” Phys. Rev. Lett. , 136403 (2010). A. P. Bartók and G. Csányi, “Gaussian approximation potentials: A brief tutorial introduction,”Int. J. Quant. Chem. , 1051–1057 (2015). A. V. Shapeev, “Moment Tensor Potentials: A Class of Systematically Improvable InteratomicPotentials,” Multiscale Model. Simul. , 1153–1173 (2016). A. P. Thompson, L. P. Swiler, C. R. Trott, S. M. Foiles, and G. J. Tucker, “Spectral neighboranalysis method for automated generation of quantum-accurate interatomic potentials,” Journalof Computational Physics , 316–330 (2015). J. Jenke, A. P. A. Subramanyam, M. Densow, T. Hammerschmidt, D. G. Pettifor, and R. Drautz,“Electronic structure based descriptor for characterizing local atomic environments,” Phys. Rev.B , 144102 (2018). R. M. Balabin and E. I. Lomakina, “Support vector machine regression (ls-svm)-an alternativeto artiﬁcial neural networks (anns) for the analysis of quantum chemistry data?” Phys. Chem.Chem. Phys. , 11710 (2011). M. Gastegger, J. Behler, and P. Marquetand, “Machine learning molecular dynamics for thesimulation of infrared spectra,” Chem. Sci. , 6924 (2017). M. G. Darley, C. M. Handley, and P. L. A. Popelier, “Beyond point charges: Dynamic polar-ization from neural net predicted multipole moments,” J. Chem. Theor. Comput. , 1435–1448(2008). F. Pereira and J. Aires-de Sousa, “Machine learning for the prediction of molecular dipole mo-ments obtained by density functional theory,” Journal of Cheminformatics , 43 (2018). N. Artrith, T. Morawietz, and J. Behler, “High-dimensional neural-network potentials for mul-ticomponent systems: Applications to zinc oxide,” Phys. Rev. B , 153101 (2011). T. Morawietz, V. Sharma, and J. Behler, “A neural network potential-energy surface for thewater dimer based on environment-dependent atomic energies and charges,” J. Chem. Phys. , 064103 (2012). K. Yao, J. E. Herr, D. W. Toth, R. Mckintyre, and J. Parkhill, “The TensorMol-0.1 modelchemistry: a neural network augmented with long-range physics,” Chem. Sci. , 2261–2269(2018). T. Bereau, D. Andrienko, and O. A. von Lilienfeld, “Transferable atomic multipole machinelearning models for small organic molecules,” J. Chem. Theory Comput. , 3225–3233 (2015). S. Faraji, S. A. Ghasemi, S. Rostami, R. Rasoulkhani, B. Schaefer, S. Goedecker, and M. Am-29ler, “High accuracy and transferability of a neural network potential through charge equilibra-tion for calcium ﬂuoride,” Phys. Rev. B , 104105 (2017). J. Lee, A. Seko, K. Shitara, K. Nakayama, and I. Tanaka, “Prediction model of band gap forinorganic compounds by combination of density functional theory calculations and machinelearning techniques,” Phys. Rev. B , 115104 (2016). G. Pilania, J. E. Gubernatis, and T. Lookman, “Multi-ﬁdelity machine learning models foraccurate bandgap predictions of solids,” Computational Materials Science , 156–163 (2017). M. Eckhoff, K. N. Lausch, P. E. Blöchl, and J. Behler, “Predicting oxidation and spinstates by high-dimensional neural networks: Applications to lithium manganese oxide spinels,”arXiv:2007.00335 (2020). M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. von Lilienfeld, “Fast and Accurate Model-ing of Molecular Atomization Energies with Machine Learning,” Phys. Rev. Lett. , 058301(2012). L. B. Pártay, A. P. Bartók, and G. Csányi, “Efﬁcient Sampling of Atomic ConﬁgurationalSpaces,” J. Phys. Chem. B , 10502–10512 (2010). E. L. Kolsbjerg, A. A. Peterson, and B. Hammer, “Neural-network-enhanced evolutionary al-gorithm applied to supported metal nanoparticles,” Phys. Rev. B , 195424 (2018). P. C. Jennings, S. Lysgaard, J. S. Hummelshøj, T. Vegge, and T. Bligaard, “Genetic algorithmsfor computational materials discovery accelerated by machine learning,” npj Comput Mater ,1–6 (2019). M. Ceriotti, G. A. Tribello, and M. Parrinello, “Simplifying the representation of complex free-energy landscapes using sketch-map,” Proc Natl Acad Sci USA , 13023–13028 (2011). S. De, F. Musil, T. Ingram, C. Baldauf, and M. Ceriotti, “Mapping and classifying moleculesfrom a high-throughput structural database,” Journal of Cheminformatics , 6 (2017). A. Sadeghi, S. A. Ghasemi, B. Schaefer, S. Mohr, M. A. Lill, and S. Goedecker, “Metrics formeasuring distances in conﬁguration spaces,” J. Chem. Phys. , 184118 (2013). L. Zhu, M. Amsler, T. Fuhrer, B. Schaefer, S. Faraji, S. Rostami, S. A. Ghasemi, A. Sadeghi,M. Grauzinyte, C. Wolverton, and S. Goedecker, “A ﬁngerprint based metric for measuringsimilarities of crystalline structures,” J. Chem. Phys. , 034203 (2016). S. De, A. P. Bartók, G. Csányi, and M. Ceriotti, “Comparing molecules and solids across struc-tural and alchemical space,” Physical Chemistry Chemical Physics , 13754–13769 (2016). B. Parsaeifard, D. S. De, A. S. Christensen, F. A. Faber, E. Kocer, S. De, J. Behler, A. von30ilienfeld, and S. Goedecker, “An assessment of the structural resolution of various ﬁngerprintscommonly used in machine learning,” arXiv:2008.03189 [cond-mat, physics:physics] (2020). F. Hutter, J. Lücke, and L. Schmidt-Thieme, “Beyond Manual Tuning of Hyperparameters,”Künstl Intell , 329–337 (2015). G. Luo, “A review of automatic selection methods for machine learning algorithms and hyper-parameter values,” Netw Model Anal Health Inform Bioinforma , 18 (2016). A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter, “Fast Bayesian Optimization ofMachine Learning Hyperparameters on Large Datasets,” in

Artiﬁcial Intelligence and Statistics (2017) pp. 528–536. M. Gastegger, L. Schwiedrzik, M. Bittermann, F. Berzsenyi, and P. Marquetand, “wACSF -weighted atom-centered symmetry functions as descriptors in machine learning potentials,” J.Chem. Phys. , 241709 (2018). N. J. Browning, R. Ramakrishnan, O. A. von Lilienfeld, and U. Roethlisberger, “Genetic Opti-mization of Training Sets for Improved Machine Learning Models of Molecular Properties,” J.Phys. Chem. Lett. , 1351–1359 (2017). G. Imbalzano, A. Anelli, D. Giofré, S. Klees, J. Behler, and M. Ceriotti, “Automatic selectionof atomic ﬁngerprints and reference conﬁgurations for machine-learning potentials,” J. Chem.Phys. , 241730 (2018). J. Behler, “First Principles Neural Network Potentials for Reactive Simulations of Large Molec-ular and Condensed Systems,” Angewandte Chemie International Edition , 12828–12840(2017). A. P. Bartók, R. Kondor, and G. Csányi, “On representing chemical environments,” Phys. Rev.B , 184115 (2013). W. Pronobis, A. Tkatchenko, and K.-R. Mueller, “Many-body descriptors for predicting molec-ular properties with machine learning: Analysis of pairwise and three-body interactions inmolecules,” J. Chem. Theory Comput. , 2991–3003 (2018). S. Jindal, S. Chiriki, and S. S. Bulusu, “Spherical harmonics based descriptor for neural networkpotentials: Structure and dynamics of Au nanocluster,” J. Chem. Phys. , 204301 (2017). E. Kocer, J. K. Mason, , and H. Erturk, “A novel approach to describe chemical environmentsin high-dimensional neural network potentials,” J. Chem. Phys. , 154102 (2019). F. A. Faber, A. S. Christensen, B. Huang, and O. A. von Lilienfeld, “Alchemical and structuraldistribution based representation for universal quantum machine learning,” J. Chem. Phys. ,3141717 (2018). J. Behler, “Atom-centered symmetry functions for constructing high-dimensional neural net-work potentials,” J. Chem. Phys. , 074106 (2011). J. Behler, “Representing potential energy surfaces by high-dimensional neural network poten-tials,” J. Phys.: Condens. Matter , 183001 (2014). J. Behler, “Constructing high-dimensional neural network potentials: A tutorial review,” Inter-national Journal of Quantum Chemistry , 1032–1050 (2015). H. S. Seung, M. Opper, and H. Sompolinsky, “Query by committee,” Proceedings of the ﬁfthannual workshop on computational learning theory , 287–294 (1992). N. Artrith and J. Behler, “High-dimensional neural network potentials for metal surfaces: Aprototype study for copper,” Phys. Rev. B , 045439 (2012). E. V. Podryabinkin and A. V. Shapeev, “Active learning of linearly parametrized interatomicpotentials,” Comp. Mater. Sci. , 171–180 (2017). L. Zhang, D.-Y. Lin, H. Wang, R. Car, and W. E, “Active learning of uniformly accurate inter-atomic potentials for materials simulation,” Phys. Rev. Mater. , 023804 (2019). C. Schran, J. Behler, and D. Marx, “Automated ﬁtting of neural network potentials at coupledcluster accuracy: Protonated water clusters as testing ground,” J. Chem. Theory Comput. ,88–99 (2020). T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,

Introduction to Algorithms (MITPress, 2009). J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Commun.ACM , 509–517 (1975). K. Pearson, “On lines and planes of closest ﬁt to systems of points in space,” The London,Edinburgh, and Dublin Philosophical Magazine and Journal of Science , 559–572 (1901). H. Hotelling, “Analysis of a complex of statistical variables into principal components,” Journalof Educational Psychology , 417–441 (1933). D. Frenkel and B. Smit,

Understanding Molecular Simulations (Academic Press, 2002). C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the Surprising Behavior of Distance Met-rics in High Dimensional Space,” in

Database Theory — ICDT 2001 , Lecture Notes in Com-puter Science, edited by J. Van den Bussche and V. Vianu (Springer, Berlin, Heidelberg, 2001)pp. 420–434. “Python 3.8.5 documentation - 5. Data Structures,” docs.python.org/3/tutorial/ atastructures.html . Note: When comparing force components directly, care should be taken. ACSF vectors areinvariant with respect to rotations and translations in coordinate space, but forces are not . This isdue to the derivatives involved in going from energy to forces, which add a direction component.The result is that with the same ACSF vector, one can have different force vector orientations,that is, the components of the force vector might not match. The predicted magnitude of the forcevector should on the other hand remain consistent since it is directionless. A trivial example ofthis is an unrelaxed unmodiﬁed slab with two interfaces: atoms in the top and bottom surfaceswill have identical environments as described by their ACSFs, but the Z -component of theirforce vectors will necessarily, due to symmetry, be opposite. This becomes more complicatedfor more homogeneous systems such as liquids and amorphous solids, where the same atomicenvironment might be found in a variety of orientations. Thus only force vector magnitudesshould be compared, or a consistent orientation of the environments should be achieved in someway. J. Weinreich, A. Römer, M. L. Paleico, and J. Behler, “Properties of alpha-Brass Nanoparticles.1. Neural Network Potential Energy Surface,” J. Phys. Chem. C , 12682–12695 (2020). D. M. Deaven and K. M. Ho, “Molecular Geometry Optimization with a Genetic Algorithm,”Phys. Rev. Lett. , 288–291 (1995). S. Goedecker, “Minima hopping: An efﬁcient search method for the global minimum of thepotential energy surface of complex molecular systems,” J. Chem. Phys. , 9911–9917 (2004). A. F. Voter, “Introduction to the Kinetic Monte Carlo Method,” in