[PDF] An Accurate and Transferable Machine Learning Potential for Carbon

Abstract

We present an accurate machine learning (ML) model for atomistic simulations of carbon, constructed using the Gaussian approximation potential (GAP) methodology. The potential, named GAP-20, describes the properties of the bulk crystalline and amorphous phases, crystal surfaces and defect structures with an accuracy approaching that of direct ab initio simulation, but at a significantly reduced cost. We combine structural databases for amorphous carbon and graphene, which we extend substantially by adding suitable configurations, for example, for defects in graphene and other nanostructures. The final potential is fitted to reference data computed using the optB88-vdW density functional theory (DFT) functional. Dispersion interactions, which are crucial to describe multilayer carbonaceous materials, are therefore implicitly included. We additionally account for long-range dispersion interactions using a semianalytical two-body term and show that an improved model can be obtained through an optimisation of the many-body smooth overlap of atomic positions (SOAP) descriptor. We rigorously test the potential on lattice parameters, bond lengths, formation energies and phonon dispersions of numerous carbon allotropes. We compare the formation energies of an extensive set of defect structures, surfaces and surface reconstructions to DFT reference calculations. The present work demonstrates the ability to combine, in the same ML model, the previously attained flexibility required for amorphous carbon [Phys. Rev. B, 95, 094203, (2017)] with the high numerical accuracy necessary for crystalline graphene [Phys. Rev. B, 97, 054303, (2018)], thereby providing an interatomic potential that will be applicable to a wide range of applications concerning diverse forms of bulk and nanostructured carbon.

Full PDF

AAn Accurate and Transferable Machine Learning Potential for Carbon

Patrick Rowe , Volker L. Deringer , Piero Gasparotto , G´abor Cs´anyi , and Angelos Michaelides Thomas Young Centre, London Centre for Nanotechnology, and Department of Physics andAstronomy, University College London, Gower Street, London, WC1E 6BT, U.K. Department of Chemistry, Inorganic Chemistry Laboratory, University of Oxford, Oxford OX13QR, U.K. Engineering Laboratory, University of Cambridge, Trumpington Street, Cambridge CB2 1PZ,U.K. a Author to whom correspondence should be addressed: [email protected] 25, 2020

Abstract

We present an accurate machine learning (ML) model for atomistic simulations of carbon, constructed using theGaussian approximation potential (GAP) methodology. The potential, named GAP-20, describes the properties ofthe bulk crystalline and amorphous phases, crystal surfaces and defect structures with an accuracy approachingthat of direct ab initio simulation, but at a signiﬁcantly reduced cost. We combine structural databases foramorphous carbon and graphene, which we extend substantially by adding suitable conﬁgurations, for example,for defects in graphene and other nanostructures. The ﬁnal potential is ﬁtted to reference data computed using theoptB88-vdW density functional theory (DFT) functional. Dispersion interactions, which are crucial to describemultilayer carbonaceous materials, are therefore implicitly included. We additionally account for long-rangedispersion interactions using a semianalytical two-body term and show that an improved model can be obtainedthrough an optimisation of the many-body smooth overlap of atomic positions (SOAP) descriptor. We rigorouslytest the potential on lattice parameters, bond lengths, formation energies and phonon dispersions of numerouscarbon allotropes. We compare the formation energies of an extensive set of defect structures, surfaces and surfacereconstructions to DFT reference calculations. The present work demonstrates the ability to combine, in the sameML model, the previously attained ﬂexibility required for amorphous carbon [

Phys. Rev. B , , 094203, (2017)]with the high numerical accuracy necessary for crystalline graphene [ Phys. Rev. B , , 054303, (2018)], therebyproviding an interatomic potential that will be applicable to a wide range of applications concerning diverse formsof bulk and nanostructured carbon. a r X i v : . [ phy s i c s . c o m p - ph ] J un Introduction

The same characteristics which make carbon a fascinat-ing element for study also make it challenging to modelcomputationally. It exhibits some of the greatest struc-tural diversity - and associated diversity of properties - ofany of the elements [1–7]. Its allotropes range from zeroto three-dimensional, have metallic, semiconducting andinsulating phases and boast mechanical properties includ-ing some of the highest tensile strengths, hardnesses andbulk moduli measured [8, 9]. It is unsurprising, there-fore, that carbon is considered to be not just an elementof prime technological importance, but also remains thesubject of continued fundamental scientiﬁc study [10–16].Current applications of elemental carbon are numer-ous, they include lightweight and strong structural mate-rials, anodes of batteries and components in advanced op-tical technologies [9, 13, 16–19]. The usefulness of elemen-tal carbon has also clearly not yet been exhausted; futureapplications propose to make use of graphene’s uniqueelectronic properties for advanced electronics [11, 20], car-bon nanotubes’ structural and optical characteristics forperformance materials [18, 21] and the thermal and op-tical properties of diamonds for laser optics [9, 22, 23],amongst innumerable other examples.Atomistic simulations have played a major role in de-veloping our understanding of carbon materials. Amongthe many scientiﬁc problems that have been addressedwith carbon potentials, we may mention the wear pro-cess of diamond [24] or the compression behaviour ofglassy carbon [25]. The ﬁrst many-body interatomic po-tential for modelling carbon was published in 1988 byTersoﬀ. This potential was used to investigate the prop-erties of carbon’s crystalline and amorphous allotropes[26]. The reactive empirical bond order (REBO, REBO-II) potentials built on the original Tersoﬀ formulation toinclude a wider range of parameters and data in the ﬁt,as well as adding additional conjugation and torsionalterms, and modifying the bond order expression for smallangles [27, 28]. However, the Tersoﬀ and REBO-II poten-tials only considered nearest neighbour interactions anddid not account for the eﬀects of dispersion. The adaptiveintermolecular REBO (AIREBO) potential [29] sought tocorrect this by adding an additional long-range Lennard-Jones term between atoms with larger separations, whilemaking no modiﬁcations to the short-range part of thepotential. The long-range carbon bond order potential(LCBOP) not only increased the range of the potentialto account for longer-range interactions but also consti-tuted a complete reparameterisation of the bond orderpotential to improve the accuracy and transferability ofthe model, though long-range dispersion interactions arestill omitted [30].Further developments, beyond the scope of detaileddiscussion here, include the LCBOP-II potential whichexpanded the application range of the model to includethe liquid phase [31, 32], the environment dependant in- teratomic potential (EDIP) for carbon [33] which em-ployed properties calculated from ab initio simulation inits parameterisation, the introduction of a dynamic cut-oﬀ to bond-order potentials [34] and a recent reparameter-isation of a carbon ReaxFF potential [35]. Notwithstand-ing the long-standing success of these potential models,there are inherent limitations to even the most advancedof them. Such issues are particularly relevant when onedeparts from the idealised structures (diamond, graphite, etc. ), as shown in two detailed benchmark studies by deTomas et al. [36, 37].Machine learning (ML) has recently arisen as a way ofaddressing some of these limitations. A number of prac-tical approaches for modelling the potential energy sur-face (PES) using ML have been developed in recent years,employing algorithms including artiﬁcial neural networks,Gaussian process regression, and compressed sensing [38–45]. The demonstrated ability of ML algorithms to ﬁtarbitrary functions with an extremely high accuracy [46],combined with recent developments in high-dimensionaldescriptors for atomic systems makes ML approaches forthe development of interatomic potentials an increasinglypopular approach [42, 47–50]. Indeed, the use of machinelearning methodologies to model carbon has a signiﬁcantprecedent. Some of the ﬁrst examples of ML potentials,using both Gaussian process regression and artiﬁcial neu-ral networks, were ﬁtted for graphite and diamond, thesewere tested with regards to properties oﬀ the crystallinephases and the graphite-diamond phase coexistence line[38, 39, 51]. One of the ﬁrst neural-network potentialswas used for large-scale simulations of the diamond nucle-ation mechanism [52]. We recently introduced a machinelearning potential for pristine graphene constructed us-ing the Gaussian approximation potential (GAP) frame-work, which achieved excellent accuracy when bench-marked against DFT and experiment for a wide range oflattice and dynamical properties, including the phonondispersion relations, thermal expansion and Raman spec-tra at diﬀerent temperatures [53].While achieving good accuracy in a speciﬁc regionof conﬁguration space is not trivial, the problem of thetransferability of a potential is much more challenging tosolve from a ML perspective. In 2017, some of us reporteda highly transferable GAP model trained primarily onthe amorphous and liquid phases of carbon (henceforthtermed GAP-17), based on DFT-LDA reference data.The focus, there, was somewhat complementary—to beable to describe very diverse structural environments, al-beit accepting a degree of numerical error. As an exam-ple, the in-plane force errors for a pristine graphene sheetare 0.03 eV ˚A − with the graphene-only GAP mentionedabove, as compared to 0.27 eV ˚A − with GAP-17. Forcomparison, these errors for a range of commonly usedempirically ﬁtted potentials range from 0.6 to 3.1 eV ˚A − (more details are in Ref. [53]). In return, owing to theﬂexibility and transferability ensuing from its choice ofreference database, GAP-17 enabled the study of a num-2er of scientiﬁc problems which involve diverse structuralenvironments, including understanding the mechanism ofgrowth of sp hybridised amorphous carbon by ion de-position [54], extensive studies of the surface properties(and chemical reactivity) of tetrahedral amorphous car-bon [54–56], the structure of “porous” carbonaceous ma-terials which are relevant to applications in batteries andsupercapacitors [57, 58], and crystal-structure prediction[5]. The model we present here, GAP-20, builds on all ofthe previous work applying the GAP machine learningmethodology to the development of carbon potentials, toachieve the accuracy required for capturing subtle diﬀer-ences in formation energies of nanostructures or in defectformation energies, and for describing phonon dispersionsto within meV accuracy - while maintaining the ﬂexibil-ity and transferability of GAP-17. Importantly, all dataare generated using a dispersion-corrected DFT methodwhich properly accounts for longer-range interactions inlow-dimensional carbon structures, and the ﬁtting archi-tecture is adapted to account for those. Our tests suggestGAP-20 to be suitable as a “general-purpose” carbon MLpotential for diverse areas of study.While detailed discussions of the construction andtesting of the potential will be given in subsequent sec-tions, we take a moment to highlight the main pointshere. The composition of the training data set and perfor-mance of this potential is summarised in ﬁgure 1. GAP-20 correctly predicts the formation energies of diamond,graphite, fullerenes and nanotubes, to an accuracy of afew meV, and achieves comparable accuracy for a numberof crystalline and amorphous surfaces. The computed for-mation energies of defects are also accurate, with overallerrors signiﬁcantly lower than those obtained from com-parable empirical models. At the same time, GAP-20can accurately predict the behaviour of high temperatureliquid carbon over a wide range of temperatures and den-sities, which will be shown below. We believe that thesefeatures make GAP-20 a useful tool for the accurate mod-elling of nanostructured carbons; nanotubes, graphitisedcarbon and materials with varying degrees of defects anddisorder.The rest of the paper will be organised as follows.We ﬁrst describe our process for the construction of atraining set suitable for developing such a potential. Wethen give details on the construction and training of themodel itself, with discussion of particular aspects whichrequired special attention or optimisation. Subsequently,we present an extensive and rigorous testing of our model,for a wide range of properties. We also compare the re-sults of our model to a selection of commonly used em-pirical potentials which model the interatomic interac-tions in carbon with diﬀering degrees of simpliﬁcation.Speciﬁcally we choose the Tersoﬀ, REBO-II, AIREBOand LCBOP models. The selection of potentials consid-ered here is by no means exhaustive and is only intendedto give some basis for comparison between previous work and the model we introduce, as well as illustrating howthe inclusion or exclusion of diﬀerent interactions (e.g.dispersion interactions) may aﬀect the performance of amodel. A more detailed benchmarking across a widerrange of potentials, complementing the existing detailedtests for amorphous and “graphitised” carbons [36], maybe the subject of future work. One of the challenges inherent in constructing a gener-alised potential for carbon is the enormous variety ofstructures which must be considered. In addition to itsmore commonly encountered crystalline phases; diamondand graphite, carbon may be found in forms of diﬀeringdimensionality, from zero dimensional fullerenes, to onedimensional nanotubes, two dimensional graphene andthree dimensional amorphous forms [6].Speciﬁcally in the case of ML, one is drawn to theproblem of the composition of the large database of ex-ample conﬁgurations, known as the training data set. Fora potential to be both accurate and transferable, its train-ing data set ought to include representative conﬁgurationsfrom all of the thermally accessible chemical space. Onemight initially suggest that the problem is therefore in-tractable, if in order to produce a potential which is ca-pable of accurately modelling all of the relevant phasesof carbon, we must explore the entirety of the vast 3N-dimensional chemical space. It is an empirical observa-tion, however, that the thermally accessible and physi-cally relevant regions of this chemical space constitute avastly reduced subset of all of the available conﬁgurations[59–61]. Further, rather than an exploration of the 3N di-mensional space, in ﬁtting the parameters of a ML algo-rithm we are primarily concerned with an exploration ofthe reduced dimensionality descriptor space [59, 62–64].In the case of atom centred descriptors such as the smoothoverlap of atomic positions (SOAP), this represents thelocal environment around a particular atom rather thanthe global structure [63]. While the structural variabilityof carbon is globally almost inﬁnite, many of these struc-tures are constructed from similar local motifs, for exam-ple the tetrahedral building blocks of diamond [65, 66].Similar logic may be applied to more complex structures.The reference conﬁgurations which comprise ourstructural database are drawn from a wide variety ofsources. Regardless of the origin of the conﬁgurationitself (e.g., from the GAP-17 database), its properties,those being the total energy, atomic forces and virialstresses, which comprise the actual training data, arealways computed using the same level of tightly con-verged plane-wave DFT including dispersion corrections.We use the VASP plane-wave DFT code, we performspin-polarised calculations with the optB88-vdW disper-sion inclusive exchange-correlation functional [67–70], a3igure 1: Overview of some of the key structures included in the training shown through a sketch-map representation(top) as well as selected information on the performance of the potential for a variety of properties. (a) Sketch-maprepresentation of the total data set for carbon generated as part of this work. Select structures are identiﬁedfor graphite, diamond, hexagonal diamond (Lonsdaleite), amorphous carbon and fullerenes. Points are colouredaccording to their energy, while contours indicate the density of the database population in a particular region.Bottom, is a summary of (b) the predicted crystalline formation energies, (c) defect formation energies and (d)surface energies, comparing the DFT (optB88-vdW) reference (black circles) GAP-20 (red crosses) and all othermodels (blue crosses). 4lane-wave cut-oﬀ of 600 eV and a projector augmentedwave pseudopotential [71–73]. A Gaussian smearing of0.1 eV is applied to the energy levels and dense recip-rocal space Monkhorst-Pack grids are used [74]. In thecase of the reduced dimensionality allotropes; grapheneand nanotube structures, the reciprocal space samplingis only performed in the directions in which the allotropeis periodic. The properties of fullerene structures arecalculated at the gamma point. For this potential, wechoose the optB88-vdW functional as it has already beendemonstrated to provide an excellent description of car-bonaceous materials, in particular graphitic carbon – forwhich its prediction of the binding energy and interlayerspacing is in good agreement with experimental values[75].The database of conﬁgurations presented here uses asits foundation a combination of the training data sets forthe two carbon potentials previously published primar-ily for liquid and amorphous carbon (GAP-17), and forpristine graphene, respectively [53, 62]. A large numberof new conﬁgurations are considered in addition to theseexisting ML data sets [5, 6, 76]. We endeavour to com-prehensively cover all the possible crystalline phases ofcarbon found at moderate temperatures and pressures,including more exotic allotropes. To that end, DFT opti-mised structures for graphite, graphene, cubic and hexag-onal diamond are included, as well as the structures of a li-brary of fullerenes comprised of fewer than 240 atoms andall nanotube structures with chiral indices, 3 ≤ n, m ≤ abinitio and some iteratively improved GAP driven molecu-lar dynamics simulations at a number of temperatures soas to also sample the region of phase space close to theselocal minima [53, 62]. The resulting database is com-prised of ca. 17000 conﬁgurations each containing from 1to 240 atoms per cell.The choice of which structures might be importantfor training a potential requires for the most part chem-ical or physical intuition on the part of the researcher[42, 59, 85]. Some of these choices may be clear, for ex-ample the need to include conﬁgurations representing thebulk structures of diamond and graphite. Others, how-ever, such as the inclusion or exclusion of particular defector surface structures, will depend on the desired applica-tion of the potential (and, to some extent, on personalchoice). To maximise the transferability of our model,we have produced as comprehensive a database as pos-sible – too large to train on with current computationalfacilities. Rather than using the full database for sparsi-ﬁcation, as commonly done in GAP ﬁtting (including in the development of GAP-17), we instead allow the bulkof our training conﬁgurations to be chosen from the to-tal dataset using a sampling method known as farthestpoint sampling (FPS) [42, 86]. Within this set, we thencarefully check the data saturation of our training withrespect to the number of sparse points, which is discussedin section 3.This method allows us to start with a much more com-prehensive database than previously, while still keepingthe computational eﬀort at the ﬁtting stage tractable. Wewish for our training dataset to have the widest possiblesampling of descriptors and forces – leaving no physicallyrelevant conﬁgurations unsampled, while avoiding over-representation of particular regions of phase space. FPSfacilitates this, by allowing a selection of frames to bemade based on a measure of the global similarity (in de-scriptor space) between possible conﬁgurations [42, 86].Given a set of n descriptors of type d for a number offrames, Q = { q d , avg i =1 ... n } , which are themselves the averageof the individual descriptors of the atoms in a particu-lar frame q di , the FPS algorithm selects conﬁgurations sothat at each step, the kernel distance between previouslyselected conﬁgurations Q selected = { q d , avg1 . . . q d , avgm } andthe new conﬁguration q d , avgm+1 is maximised. That is, q d , avgm+1 = argmax q d , avg [D( Q selected , q d , avg )] , (1)where D is the kernel distance between the selected de-scriptors (and associated frames) Q selected and the candi-ate descriptor q d , avg . In our case, we use the SOAP de-scriptor as a structural ﬁngerprint of a conﬁguration andthe dot product between two SOAP descriptors as ourkernel similarity measure [63]. As has previously beenshown for molecular systems, we ﬁnd that this method ofselection enables the training of a potential which demon-strates good transferability [59]. However, due to thenature of the sampling, it lacks the dense populationof conﬁgurations around particular local minima whichwe ﬁnd are important for achieving very high accuracyon particular crystalline properties. We therefore chooseto augment the training dataset selected through FPSwith a number of mandatory conﬁgurations chosen usingchemical intuition, focused on the bulk crystalline phasesand certain defect and surface structures. Speciﬁcally,we note that optimised geometries for structures used inthe validation sections of this paper are included in thetraining. The ﬁnal database is comprised of the unionof the 4000 FPS-selected points and the existing GAP-17 dataset, while a further ca. 1000 conﬁgurations aremanually added to target speciﬁc properties.The selected conﬁgurations, as well as a representationof their position in phase space, can be seen illustrated inﬁgure (Fig. 1). This sketch-map [86, 87] representationof the total training dataset uses the same measure ofkernel similarity as discussed above to position points in areduced dimensionality such that points which are similar5n the full high-dimensional descriptor space are closertogether, and those which are dissimilar are further apart.This sketch-map representation also serves as a qual-itative overview of the type of structures to which we ﬁtour model. Structures with carbon atoms of highly variedcoordination environments, from sp to sp and sp canbe seen. Those allotropes which are sp hybridised, suchas graphene, graphite and carbon nanotubes are clus-tered together towards the right of the map. Amorphousstructures can be seen as a large region in the centre,with low density (sp and sp rich) amorphous carbonat the far right, high density sp rich amorphous car-bon towards the left, eventually approaching crystallinediamond at the very far left of the map. The more ex-otic, sometimes hypothetical structures collected from theSACADA database are often found separated from bulkcrystalline or amorphous conﬁgurations. In the far topright of the map, isolated gas phase dimer conﬁgurationsare found. We choose to construct GAP-20 to represent the PES asthe combination of contributions from a two body (2b),a three body (3b) and a high dimensional many-body(MB) component. It is an empirical observation that alarge proportion of the interaction in an atomistic systemmay be satisfactorily captured by considering 2b interac-tions. In particular this is the case for the exchange re-pulsion experienced as interatomic distances become verysmall. Representing this exchange repulsion in its fullhigh-dimensional form would be costly from the perspec-tive of both training data generation, potential genera-tion and the ultimate evaluation of the potential. Thenature of bonding interactions for carbon may also becaptured in an approximate way, being generally attrac-tive between 1.2 - 1.6 ˚A, with an attractive tail at longdistances. We design the 2b part of our model as a GAPﬁtted 2b component (V short ) r < ≥ r > long ) which decays as r − .This long range component is ﬁtted to correctly reproduce(albeit without many-body contributions) the long-rangeattraction due to van der Waals interactions of graphiticlayers. A smooth transition is achieved by ﬁrst ﬁtting theanalytical form of V long to the graphene bilayer interac-tion curve from 3.0 to 10.0 ˚A. V short is then trained byﬁrst subtracting V long from the total energy and ﬁtting tothe diﬀerence. The resultant 2b potential (Fig. 2) simplyhas the ﬁnal form V short + V long The true subtleties of interatomic bonding are inher-ently many-body in character, however. We representthese higher-order contributions to the potential energyusing a combination of a 3b descriptor and the aforemen-tioned SOAP descriptor. The full details of the construc-tion of the 3b and SOAP descriptors is given in detail elsewhere [53, 62, 63, 88–91].Figure 2: Construction of the long-range 2b component ofthe GAP model. An analytical spline is ﬁtted to a func-tion which decays as r − , designed to recover the long-range attraction between graphene layers. (a) Shows thepredicted energies for each component for the interactionof two graphene layers at diﬀerent distances. The longrange attraction is well characterised by the r − compo-nent of the r − potential, which is in turn well recoveredby the analytical spline. The GAP ﬁt using a 2b, (V short ),3b and SOAP descriptor provides the appropriate repul-sive potential at short distances but is too short ranged todescribe the attractive tail. GAP-20 reproduces the wholecurve with good accuracy across a range of distances. (b)Shows the same decomposition for the gas phase dimer.In this case, the strong bonding interactions are domi-nated by the GAP 2b (V short ), 3b and SOAP descriptorcomponents. The energy of the r − component becomeslarge and negative for short distances. (b, inset) Providesa closer view of panel B, and shows how the 2b splineﬁt to the r − component is brought smoothly to 0 at adistance of 3 ˚A.In short, the 3b term is a symmetrized transforma-tion of the Cartesian coordinates of triplets of atoms,which is designed to be permutationally invariant to theswapping the atomic indices [53, 62]. In the constructionof the SOAP descriptor, the local environment arounda target atom is represented by a ‘local neighbour den-sity’, constructed by placing a Gaussian basis functionon each neighbouring atom within a certain cutoﬀ, whichwe choose to be 4.5 ˚A. The Gaussian basis functions arescaled by a factor of 1 / r . to reﬂect the greater contri-bution to material properties of atoms which are closertogether [92–94]. Other functional forms for the radialscaling exist and the introduction of this scaling was per-formed independently of the optimisation of the SOAPdescriptor cutoﬀ, the choice of which is motivated below.As a result, there may still be scope for further optimi-sation of these parameter sets beyond what is performedhere. The local neighbour density is expanded in a ba-sis set of spherical harmonics, the coeﬃcients of whichform a ‘SOAP vector’. In our case we use a basis set upto order l = 4 and n = 12, our motivation for which is6iscussed in the following paragraph. This SOAP vectorconstitutes a unique representation of the local environ-ment, which satisﬁes the requirements of being transla-tionally, rotationally and permutationally invariant. TheSOAP kernel, used for regression, is constructed as thescalar product of individual SOAP vectors. Such a kernelis physically interpretable, as it corresponds directly tothe integral of two neighbour densities for all possible 3Drotations. Details for the speciﬁc choices for a number ofassociated hyper-parameters are given in the supplemen-tary material, while further details on their signiﬁcanceis given elsewhere [39, 53, 62, 63, 88–91].Figure 3: Mean absolute force errors calculated for anindependent test set of conﬁgurations for diﬀerent SOAPdescriptors. (a) Force error behaviour and cost of evalua-tion (relative to the fastest GAP) of the resultant modelas a function of the SOAP descriptor cutoﬀ, the selectedvalue of 4.5 ˚A is indicated by the dashed red line. (b)Force error convergence and relative cost as a functionof the number of sparse points included in the training.(c) Dependence of the force error and model evaluationcost on the order of the SOAP neighbour density basis setexpansion. Force errors are indicated by the colour bar,while relative costs are shown by contour lines, our choice l max = 4, n max = 12 is highlighted by the red square.We now provide the details of select convergence testsfor the optimisation of our GAP model. These tests con-sider the independent optimisation of the SOAP descrip-tor cutoﬀ, number of sparse points and the order of the ra- dial basis set expansion. In general we begin with a SOAPdescriptor with an expansion of the neighbour density upto l max = 8, n max = 8, a cutoﬀ of 4.2 ˚A, σ force = 0 . − and σ energy = 0 .

001 eV and ζ = 4. We modifyone parameter at a time in isolation, while keeping theremainder ﬁxed. We calculate the force error for the re-sulting models on a randomly selected independent set oftest conﬁgurations which is not included in the training.Figure 3(a) shows the behaviour of the force errors asa function of the SOAP descriptor cutoﬀ. The force errorhas a minimum for a cutoﬀ of 2.9 ˚A, after which it beginsto rise again as the increased size of the descriptor spaceexpands beyond what can be populated with the numberof available training conﬁgurations. A na¨ıve optimisa-tion of these parameters based purely on the force errorswould therefore select a cutoﬀ radius of 2.9 ˚A. However,selection of these parameters cannot be performed in iso-lation from the intended application of the potential, butmust also be motivated by knowledge of the behaviourof the material of interest. In this regard, the force (orenergy) error alone may be regarded as an imperfect orincomplete target property for optimisation. Speciﬁcally,we ﬁnd that although the minimum in the force error isfound at much shorter distances, a longer cutoﬀ of 4.5˚A must be used in order to correctly describe graphiticstructures, a feature which we consider to be mandatoryfor this potential. The inter-layer spacing of graphiticstructures is typically large (approx. 3.3 ˚A) and a po-tential must incorporate enough of the structure of bothlayers to correctly model properties such as the bindingcurve of graphene layers or the energy diﬀerence betweenAA and AB stacked graphite. The eﬀect of these shortcutoﬀs can be seen in the unsatisfactory behaviour of theTersoﬀ and REBO-II models when modelling the inter-layer spacing of graphite (Table 1), or graphene bilayers(Supplementary ﬁg. 6). However, the problem of pro-ducing a single analytical metric for optimisation, whichsatisfactorily includes properties such as the lattice pa-rameters, defect formation energies or phonon errors aswell as the force errors themselves is a challenging one. Inthis instance, the design choice of selecting an appropriatedescriptor cutoﬀ remains partly qualitative in nature.Figure 3(b) shows the convergence of the mean ab-solute force errors as a function of the number of sparsepoints used in the training, this may be considered as ameasure of the data saturation of GAP-20. The force er-rors decrease rapidly up to approx. 1500 sparse points,at which point they begin to level oﬀ, although we notethat a further increase in the number of sparse points hasa negligible impact on the cost of evaluating the model.Our choice of 9000 sparse points is therefore very tightlyconverged.In ﬁgure 3(c) we show how the force error for ourmodel converges as a function of the order of the basis setof radial functions used to expand the SOAP neighbourdensity. The relative computational cost of each basis setis indicated on the same plot by labelled contour lines.7e ﬁnd that the radial ( n ) component of the expansion,typically has a greater impact on the rate of convergencethan the angular ( l ) component. While previously in theconstruction of GAP models, band limits n max = l max ,were used for the SOAP descriptor, we ﬁnd that surpris-ingly, an improvement in accuracy can be achieved withessentially no additional cost by making a selection forthe basis set expansion which is strongly biased to theradial ( n max ) component. Of course the cost must alsobe taken into account, the use of a larger radial compo-nent is more expensive than an identical increase in theangular component, due to the greater number of basisfunctions introduced. Our selection of l max = 4, n max =12 is chosen as a compromise between accuracy and eﬃ-ciency. Although a small improvement in the force errorscan be achieved by selecting n max = 12 , l max >

4, the re-sultant increase in the cost of training restricts both thesize of the training data set which can be used and thesize and length scales to which the resultant potential canbe applied.

Among the most important material properties for anypotential to predict accurately are those of the bulk crys-talline phases. Table 1 compares the lattice parametersand bond lengths as predicted by GAP-20 to those fromthe reference DFT method, in addition to a number ofempirical models. In ﬁgure 4, we also provide both theatomisation energies, and the formation energies of thecrystalline phases relative to the thermodynamically sta-ble state of carbon (at standard temperature and pres-sure), i.e. graphite. We deﬁne the atomisation energy ofa phase relative to the isolated gas phase carbon atomE at as, E f = E bulk − n E at , (2)where E bulk is the energy of the bulk phase and n is thenumber of atoms in the bulk. Lattice parameters are cal-culated by independently optimising the cell vectors foreach allotrope, until the total energy is converged to lessthan 10 − eV. GAP-20 accurately predicts the lattice pa-rameters and bond lengths of all of the tested crystallineallotropes with an average error of 0.2%, and their for-mation energy to within 0.5%.Accurately modelling the graphite c lattice parame-ter, corresponding to the spacing between graphitic layersproved particularly challenging for candidate GAP mod-els, as did the formation energy. This is in large part dueto the shallow nature of the energetic minimum charac-terising the graphite inter-layer interactions and the longrange of the atomic descriptor required to capture it. Asdiscussed above, the choice of SOAP cut-oﬀ was speciﬁ-cally informed by a desire to capture this property cor-rectly. We consider this in particular to be a mandatory characteristic of a general carbon potential which wouldbe absent for any model with a shorter cut-oﬀ.It is useful here to make a brief comparison to selectedempirical potentials. While we do include DFT referencedata for all properties, in subsequent sections these ref-erence values are only computed using the same level ofDFT used to train GAP-20. For benchmarking GAP-20,which is our primary purpose, this is not problematic,however we do not fully account for the potential im-pact of functional dependence, or the errors of DFT withrespect to experiment, when making comparisons to em-pirical models. Many of the empirical models consideredare ﬁtted to experimental data, or contain values fromother DFT functionals, typically LDA, which should betaken into account when comparing diﬀerent model pre-dictions to our optB88-vdW reference values. To givesome indication of the functional dependence, referencevalues for the formation energies in ﬁgure 4 are given us-ing both the optB88-vdW and LDA functionals. We alsore-iterate that the GAP-17 model was ﬁtted to LDA data,so it would be expected to accurately reproduce DFT val-ues at this level only.On average, GAP-20 predicts the lattice parameters ofthe tested crystalline phases with an error of 0.2%, whilethe Tersoﬀ, LCBOP, REBO-II, AIREBO potentials haveerrors averaging 5%, 0.3%, 4% and 1% respectively (Ta-ble 1. What the Tersoﬀ and REBO-II potentials gain ineﬃciency by using short cutoﬀs, they lose in accuracy,notably by predicting dramatically incorrect inter-layerspacings (c lattice parameters) for graphite. This erroris ﬁxed by the inclusion of medium and long-range termsto account for van der Waals interactions in the LCBOPand AIREBO models, however. Despite its inaccuracy forgraphene, the REBO-II potential does achieve good accu-racy on the remaining lattice parameters; the additionalterms included in the bond-order potential constitute adramatic improvement over the Tersoﬀ potential. Duein part to its complete reparameterisation to account forthe eﬀects of long-range interactions in the bond-orderpart of the potential, LCBOP does outperform the otherempirical potentials tested here in most cases.8igure 4: Formation energies of the crystalline phases ofcarbon, comparing results from DFT (optB88-vdW andLDA) to those from GAP-20 and the other models tested.(a) Atomisation energies using the isolated gas phase car-bon atom as a reference, diﬀerences are dominated byoverstabilisation of the gas phase atom by empirical mod-els. (b) Formation energies given relative to the graphiteformation energy of each particular model.In absolute terms, the atomisation energies (ﬁg. 4(a))from the tested empirical potentials diﬀer signiﬁcantlyfrom those predicted by both reference DFT methods,due to the very diﬀerent energies of the isolated gas phaseatom. In the case of GAP-17, the small oﬀset betweenthe LDA reference and the model prediction is the resultof the isolated atom not being included in the trainingdataset. When using the formation energy of graphiteas a reference state however, (ﬁg. 4(b)) this oﬀset is re-moved and the agreement between the empirical modelsand DFT improves considerably. When using both thegas phase atom and graphite as a reference, there is anexcellent agreement between GAP-20 and the optB88-vdW DFT reference for all of the phases considered here. GAP-20 uniformly predicts the atomisation energies ofthe tested allotropes to within an error of 1%, includingthe relatively subtle diﬀerence in energetics between nor-mal cubic and hexagonal diamond and the energetics ofnanotubes and fullerenes. The inclusion of the gas-phaseatom in the training is vital to accurately predict theseatomisation energies. There is surprisingly little diﬀer-ence between the formation energies predicted by the dif-ferent many-body potentials tested here, though there area few points of note. Firstly, due to their short cutoﬀs,the Tersoﬀ and REBO-II potentials do not distinguish be-tween graphite and graphene as the thermodynamicallystable phase and as such their formation energies are pre-dicted to be equal. Similarly, only the GAP-20, LCBOPand AIREBO models correctly favour cubic over hexago-nal diamond, although the AIREBO model overestimatesthe diﬀerence in energy by a factor of 5, while the othermodels considered do not distinguish between the two di-amond phases. A more complete evaluation of the forma-tion energies for diﬀerent chiralities of nanotubes is givenin the supplemental material, for GAP-20, the energy er-rors for a signiﬁcantly wider range of structure types arealso given.In addition to the static properties of the crystallineallotropes, it is an important characteristic of any poten-tial that it accurately model the lattice dynamics of bulkcrystals, i.e. their behaviour at ﬁnite temperature. Thephonon spectrum of a material is a direct probe of thiswhich is experimentally measurable. In addition, a num-ber of thermodynamically relevant properties, includingthe thermal expansion coeﬃcient and the constant vol-ume heat capacity of a material may be calculated di-rectly from the phonon dispersion relation by calculationof the free energy. It is clear therefore, why a correctprediction of the phonon dispersion relation is a highlydesirable feature of an interatomic potential.9able 1: Lattice parameters and bond lengths of the crystalline carbon phases. In the case of fullerenes, bond lengthsare given in lieu of lattice parameters. Absolute values for the lattice parameters are given, with percentage errorsrelative to DFT in brackets. The Tersoﬀ and REBO-II potentials have no interaction between graphite layers forany physically reasonable lattice parameters and as such these values are omitted.Lattice Parameter(s) [˚A] (% Error)DFT GAP-20 Tersoﬀ LCBOP REBO-II AIREBOGraphite ( a ) 2.46 2.47 (0.4) 2.53 (2.4) 2.46 (0.4) 2.46 (0.4) 2.42 (2.0)Graphite ( c ) 6.65 6.71 (0.9) - 6.36 (4.4) - 6.72 (1.1)Graphene 2.46 2.46 (0.0) 2.53 (2.8) 2.46 (0.0) 2.46 (0.0) 2.42 (1.6)Diamond 3.58 3.59 (0.3) 3.57 (0.3) 3.57 (0.3) 3.57 (0.3) 3.56 (0.6)Hexagonal Diamond 2.52 2.53 (0.4) 2.52 (0.0) 2.52 (0.0) 2.52 (0.0) 2.52 (0.0)Nanotube-(9, 9) 4.26 4.25 (0.2) 4.35 (2.1) 4.24 (0.5) 4.26 (0.0) 4.18 (1.9)Nanotube-(9, 0) 2.41 2.39 (0.8) 2.53 (5.0) 2.47 (2.5) 2.47 (2.5) 2.43 (0.8)C Fullerene 1.40 1.40 (0.0) 1.46 (4.5) 1.41 (0.7) 1.42 (1.4) 1.40 (0.0)C

Fullerene 1.39 1.39 (0.0) 1.39 (0.0) 1.39 (0.0) 1.39 (0.0) 1.39 (0.0)Figure 5: A. Phonon dispersion relation for diamond aspredicted by GAP-20 (black) with comparison to DFT(optB88-vdW) reference data (red). B. Graphene phonondispersion relation comparing GAP-20 and DFT (optB88-vdW) reference data. The dashed blue line shows thepredicted phonon dispersion curve for the graphene-only model previously published [53] C. (9,0)-Nanotubephonon dispersion and vibrational density of states. D.(9, 9)-Nanotube phonon dispersion relation and vibra-tional density of states. Equivalent comparisons for theother models tested are given in the supplementary ma-terial.Figure 5 shows the comparison between the phonondispersion curves calculated using the reference DFTmethod and those calculated using the carbon GAPmodel for graphene, diamond, a zig-zag (9, 9) and an arm-chair (9, 0) carbon nanotube. Phonon dispersion curveswere computed using the ﬁnite displacement method asimplemented in the Phonopy Python package [95]. Equiv-alent curves for the other models tested are provided inthe supplementary material.We have previously reported results comparing thephonon dispersion relation for a purely graphene GAP model, to those from experimental x-ray scattering dataand a number of reference DFT methods [53]. It is usefulto make a comparison between the highly targeted modelpreviously published and the much more general GAP-20introduced here. A particular concern might be that sig-niﬁcantly expanding the conﬁgurational space on whichwe wish to train, as we have done here, would necessarilydamage the quality of the predictions for graphene com-pared to the previous model – particularly for a propertyas sensitive as the phonon dispersion curves. It is demon-strated in ﬁgure 5(b) that this is not the case; the dis-persion relation of the phonon curves for graphene fromGAP-20 are comparable to those of the previously pub-lished graphene GAP model [53]. The energies of thephonon bands are correctly predicted across all of thehigh symmetry directions plotted, while the frequencies(in particular at the high symmetry points) are found tobe correct to within 4 meV, which may be compared to avalue of 1 meV for the pristine graphene model [53]. Thequality of the GAP-20 model prediction is comparablefor diamond (which cannot be modelled at all with thepristine graphene model), though with marginally largererrors for the prediction of the energies of certain bands,up to 7 meV for the higher frequency modes. GAP-20also captures the diﬀerence in vibrational behaviour be-tween armchair and zig-zag nanotubes remarkably well.There are some diﬀerences in the energy of certain split-tings for some bands, but the magnitude of these errors issmall, typically on the order of a 2-3 meV. In particular,it can be seen from ﬁg. 5 that the vibrational density ofstates for the two nanotube systems agrees well with theDFT reference.

From the point of view of atomistic simulation, surfacespresent a major challenge, as their correct description re-quires a treatment of a number of competing physical10able 2: Surface energies of low Miller index surfaces for common carbon allotropes. Reference energies are calculatedusing DFT, absolute values from each model are given, with their percentage error in brackets. Note that for theamorphous surfaces, the surface energy is averaged over a large number of diﬀerent surfaces. In the amorphous case,the error provided is the average of the individual point-wise errors, rather than the error between the average surfaceenergies. Surface Energy [eV ˚A − ] (% Error)DFT GAP-20 Tersoﬀ LCBOP REBO-II AIREBODiamond (100) (As cut) 0.56 0.60 (7) 0.47 (16) 0.61 (9) 0.69 (23) 0.73 (30)Diamond (100) (Relaxed) 0.54 0.56 (4) 0.42 (22) 0.59 (9) 0.69 (28) 0.72 (33)Diamond (111) (As cut) 0.64 0.73 (14) 0.88 (38) 1.07 (67) 1.00 (56) 1.03 (61)Diamond (111) (Relaxed) 0.62 0.66 (6) 0.88 (42) 1.07 (73) 1.00 (61) 1.03 (66)Diamond (110) (As cut) 0.68 0.70 (3) 0.70 (3) 0.89 (31) 0.74 (9) 0.74 (9)Diamond (110) (Relaxed) 0.68 0.67 (1) 0.63 (7) 0.83 (22) 0.69 (1) 0.69 (1)Graphite (0001) (As Cut) 0.015 0.013 (13) 0 (100) 0.005 (67) 0 (100) 0.011 (27)Graphite (0001) (Relaxed) 0.015 0.012 (20) 0 (100) 0.005 (67) 0 (100) 0.011 (27)Amorphous Surfaces (As Cut) 0.26 0.27 (4) 0.25 (4) 0.25 (4) 0.25 (4) 0.25 (4)interactions [81–84, 96].We compute the surface energy for each model by ﬁrstoptimising the bulk structure for the parent crystal untilthe total energy is converged to 10 − eV. We then cutthe surface along the desired direction and compute thespeciﬁc surface energy γ s at T = 0 K as, γ s = (E n − nE bulk ) / , (3)where E n is the energy of the n slab layer containing twosurfaces, which may be as-cut (unrelaxed) or allowed torelax and E bulk is the energy of a single atom in the bulkstructure and A is the area of the surface structure. Inthe case of the amorphous surfaces, due to the extent ofthe surface relaxation observed, we report only the as-cutsurface energies.Graphite may be readily cleaved to expose its (0001)surface, which is remarkably stable and is by far the pre-dominant face of graphite, while in diamond, the (100),(111) and (110) surfaces are of particular interest [97]. Wealso compute the as-cut surface energies for an ensembleof amorphous structures, by cutting bulk amorphous sys-tems along diﬀerent directions.The energies of several important surface cuts andtheir reconstructions are given in Table 2. GAP-20 typ-ically predicts the diamond surface energies correctly towithin 7 %, the exception being the case of the relaxeddiamond (111) surface, where the error is slightly largerat 15 %. The structures of the relaxed surfaces were alsofound to be in excellent agreement, with the average errorin the positions of individual surface atoms being 10 − ˚A.The graphite (0001) surface energy is extremely small andit thus proved challenging to produce a model which couldcorrectly characterise this, however, GAP-20 predicts theunrelaxed and relaxed surface energies correctly to withinan error of 3 meV ˚A − (20 %). With the inclusion of vdW interactions considered in their construction, the LCBOPand AIREBO potentials both predict the graphite (0001)surface energy rather well, with errors of 67 and 27 %respectively.While GAP-20 achieves low errors for the surface en-ergies of all the diamond surfaces considered, the othermodels generally perform well for at least one diamondsurface, though none exhibit uniformly low errors. TheTersoﬀ, REBO-II and AIREBO models predict the en-ergies of the diamond (110) surfaces to within 10 % ofthe reference value. Conversely, of the empirical modelsonly the LCBOP potential correctly predicts the energy ofthe diamond (100) surface; errors for the Tersoﬀ, REBO-II and AIREBO potentials were 22, 28 and 33% respec-tively. None of the empirical potentials performed well forthe (111) surface of diamond. The Tersoﬀ and REBO-IImodels do not show any binding between graphitic lay-ers for any reasonable initial geometry. This would leadto the spontaneous exfoliation of graphite layers and theeventual disintegration of graphite crystals in simulationsemploying these models. A certain concentration of defects is a guarantee in anyexperimental material sample. Such imperfections mayhave a strong impact on the structural, optical and ther-mal properties of a material and may be introduced into acrystal structure to induce or modify properties. The en-gineering of defects is of great technological importanceand consequently their accurate modelling by an inter-atomic potential is highly desirable. The possibility of re-hybridisation, which allows carbon atoms to reconstructwith diﬀering numbers of bonds to stabilise particularstructures allows carbon to have a wider variety of de-11ects than most other elements.To the best of our knowledge, there is not a set of de-fect formation energies for a wide range of carbon defectscomputed at precisely the same level of theory. There-fore, we here assemble such a reference set, for which wecompute defect formation energies in large supercells toavoid defect self-interaction in the computation of ener-gies. For graphite, a (6 × ×

2) supercell with 288 atomsand four graphite layers was used [77, 98]. In the caseof graphene, a (10 ×

10) supercell with 200 atoms wasemployed and for diamond a (3 × ×

3) supercell with216 atoms was used [53, 78]. Defect formation energiesare calculated for the representative (9, 9) and (9, 0) in-dex carbon nanotubes, which had 174 and 180 atoms inthe supercells used respectively [79]. For each structure,the lattice parameters and ionic positions of the pristinestructures were optimised as discussed previously. Theionic positions of the defective structures were then opti-mised until the energy was converged to within 10 − eV,while keeping the lattice parameters ﬁxed. We computethe formation energy E f of a vacancy defect relative tothe energy of an atom in an ideal parent structure: E f = E d − (n E at + E bulk ) (4)Where E d is the energy of the defective supercell struc-ture, E bulk is the energy of the undefective bulk struc-ture and E at is the energy of a single atom in the bulkstructure, while n is the number of carbon atoms added(positive n) or removed (negative n) to form the defect.The simplest of defects involves the absence of one ortwo atoms from their regular position in the lattice, form-ing monovacancy and divacancy defects. Monovacancydefects often result in unsaturated bonds at the defectsite, while divacancy structures, particularly in sp hy-bridised systems, can reconstruct to produce saturatedconﬁgurations. In graphene, graphite and carbon nan-otubes, the 14-membered ring formed by the removal oftwo adjacent atoms from the structure reconstructs toform a saturated sp structure with two 5-membered andone 8-membered ring - a more stable structure known asa 5-8-5 divacancy. In graphene, this defect may furtherreconstruct to remove the unfavourable 8-membered ringto form a 555-777 or 5555-6-7777 divacancy reconstruc-tion. Monovacancy coalescence is also observed in dia-mond, whereupon annealing at high temperature, mono-vacancies migrate to form divacancy defects, with fewerunsaturated bonds per absent carbon atom. Figure 6: Images of selected carbon defect structures,with atoms in the immediate vicinity of the defecthighlighted in red. (a) graphene divacancy defect (b)graphene Stone-Wales defect (c) graphene monovacancy(d) (9, 9)-nanotube Stone-Wales defect (transverse orien-tation) (e) (9, 9)-nanotube Stone-Wales defect (parallelorientation) (f) diamond split interstitial defectGraphite is the only allotrope of carbon in which trueinterstitial defects are known, wherein interstitial atomsmay be found between graphite layers [77]. The moststable arrangement of this is in a ‘dumbbell’ conﬁgura-tion, where the adatom displaces an atom in the graphitestructure to form a symmetric arrangement of trigonallybonded carbon atoms above and below the sheet. Isolatedinterstitial atoms are not known either experimentally orfrom theory to be stable in diamond, rather a split in-terstitial is found, where a lattice site is shared by twocarbon atoms which are displaced along the [100] and[¯100] directions [99].In sp bonded allotropes of carbon, the rotation of asingle C-C bond transforms four 6-membered rings intoa cluster of two 7-membered and two 5-membered rings,forming a Stone-Wales type defect [100, 101, 101, 102].Table 3 compares the energies of a number of defectsas computed with DFT, GAP-20 and the other modelsconsidered. In most cases, GAP-20 correctly predicts thedefect formation energy to within an error of 10%. Typ-ically, the prediction of the formation energies of Stone-Wales type defects was found to be extremely accurate,with no error (to within the precision of the values given)in either the graphite or graphene cases and only smallerrors for nanotubes. The errors for the formation ener-gies of diamond defects tend to be larger, ranging from25-35%, while those for defective nanotubes range from0-11%. Anecdotally, we note that although relevant train-ing data for the defects considered are represented inthe training data, it proved challenging to achieve de-fect formation energies which were universally accurate.In particular this is due to the sensitivity of the formationenergies to aspects such as the SOAP descriptor cutoﬀ,speciﬁc training data included and the number of sparsepoints used in the training.12able 3: Formation energies of common defects in carbon structures for GAP-20 and the other models considered,with DFT (optB88-vdW) values given as reference. Data are given in eV, with percentage errors relative to DFTgiven in brackets. In each case, the value given is for the optimal geometry of the defect found with that particularmodel. Formation Energy [eV] (% Error)DFT GAP-20 Tersoﬀ LCBOP REBO-II AIREBOGraphene Stone-Wales 4.9 4.8 (2) 1.9 (61) 4.5 (8) 5.3 (8) 5.4 (10)Graphene Monovacancy 7.7 7.0 (8) 2.5 (68) 6.9 (10) 7.5 (3) 7.2 (6)Graphene Divacancy (5-8-5) 7.4 7.9 (7) 5.1 (31) 7.5 (1) 7.5 (1) 9.2 (24)Graphene Divacancy (555-777) 6.6 6.9 (5) 5.2 (21) 6.6 (0) 6.8 (3) 8.7 (32)Graphene Divacancy (5555-6-7777) 6.9 7.4 (7) 7.9 (14) 7.2 (4) 7.6 (10) 9.5 (38)Graphene Adatom 6.4 5.9 (8) 6.7 (5) 6.8 (6) 7.4 (16) 7.8 (22)Graphite Monovacancy 7.8 7.3 (6) 7.1 (9) 7.8 (0) 7.9 (1) 7.6 (3)Graphite Divacancy (5-8-5) 9.6 9.2 (4) 12.6 (31) 8.2 (15) 8.0 (17) 9.7 (1)Graphite Stone-Wales 5.4 5.6 (4) 12.8 (137) 5.7 (6) 6.0 (11) 6.0 (11)Graphite Interstitial 7.4 7.9 (7) 9.7 (31) 7.2 (3) 7.1 (4) 6.8 (8)Diamond Monovacancy 6.6 4.3 (35) 5.2 (36) 7.2 (11) 7.1 (4) 6.8 (8)Diamond Divacancy 9.1 6.6 (27) 5.1 (44) 10.6 (16) 10.7 (18) 10.1 (16)Diamond Split Interstitial 11.4 8.3 (27) 12.4 (9) 9.8 (14) 11.0 (4) 11.4 (0)Nanotube-(9, 9) Monovacancy 6.4 5.8 (9) -5.1 (180) 3.8 (41) -1.6 (125) -2.5 (139)Nanotube-(9, 9) Divacancy 4.7 4.8 (2) -5.5 (217) 2.9 (38) -0.9 (119) -2.3 (149)Nanotube-(9, 9) Stone-Wales (Parallel) 4.4 4.5 (2) -4.5 (202) 2.1 (52) -2.1 (148) -3.9 (189)Nanotube-(9, 9) Stone-Wales (Transverse) 3.5 3.5 (0) -5.8 (261) 2.0 (44) -3.3 (192) -2.00 (156)Nanotube-(9, 0) Monovacancy 5.3 4.9 (8) -0.9 (117) 4.4 (17) 4.7 (11) 3.4 (36)Nanotube-(9, 0) Divacancy 3.6 3.5 (3) -1.0 (128) 3.0 (17) 4.1 (14) 2.8 (22)Nanotube-(9, 0) Stone-Wales (Parallel) 2.7 3.1 (15) -1.3 (148) 3.2 (19) 3.4 (26) 3.6 (33)Nanotube-(9, 0) Stone-Wales (Transverse) 3.5 3.2 (9) -1.1 (131) 3.1 (11) 4.2 (20) 2.6 (26)Considering the empirical potentials, we ﬁnd that themodiﬁcations to the Tersoﬀ potential included in theREBO-II model dramatically improve the quality of thepredicted defect formation energies; percentage errors areoften decreased by an order of magnitude or more whencomparing these two potentials. Surprisingly, these re-sults show that the inclusion of the long-range Lennard-Jones term in the AIREBO model often has a nega-tive impact on the accuracy of its predicted defect for-mation energies, indicating that the addition of a long-range term without reparameterisation of the short-rangecomponents has adversely impacted the energetics of themodel. Indeed, in the case of LCBOP, where this repa-rameterisation of the short range bond-order potentialhas been performed, we ﬁnd that the errors are signiﬁ-cantly reduced, and are in many cases comparable withthe performance of GAP-20. The exception to this be-ing the case of defective nanotubes, where LCBOP ex-hibits errors ranging from 11-52%. In fact, the predictionof nanotube defect formation energies proved challengingfor all of the empirical models considered. In a numberof cases, defect formation was found to be an energeti-cally favourable process and was associated with a strongrelaxation of the nanotube structure after defects were induced.As well as accurately predicting the energetic cost ofinducing defects in carbon structures, GAP-20 was alsofound to accurately predict the structures of these defects.We quantify this accuracy by calculating the structuralsimilarity between the defect structures optimised withour GAP model and those from DFT, in the form of theroot mean squared error (RMSE) between the two opti-mally overlapped structures. In all but a handful of cases,the RMSE for these defects is vanishingly small, withatoms having an error in their position of less than 10 − ˚A, when comparing identical atoms from GAP-20 andDFT structures. In particular, the presence and heightof the characteristic buckling of the Stone-Wales defect ingraphene was well described, as was the structural distor-tion resulting from the presence of defects in both (9,9)and (9,0) index carbon nanotubes. Similarly, the rehy-bridisation and reconstruction of (5-8-5), (555-777) and(5555-6-7777) graphene divacancy defects was accuratelyreproduced, as were the geometries of all of the diamonddefects considered. Situations in which GAP-20 showedstructural inaccuracies were the nanotube monovacancystructures and the parallel Stone-Wales defect in the (9,9)index nanotube, for which the GAP model predicted a13arger distortion of the bulk nanotube structure due tothe presence of the defects. We also ﬁnd, that as with allthe models considered here, GAP-20 does not correctlydescribe the asymmetry introduced through a Jahn-Tellerdistortion of the graphene monovacancy defect – insteadpredicting the monovacancy to have a symmetric geom-etry. This is perhaps unsurprising as the energy diﬀer-ence between the symmetric and asymmetric geometriesis typically small (ca 350 meV). However, even in thecases illustrated here the typical error in the position ofany atom was found to be only 0.1 ˚A. That GAP-20 iscapable of accurately modelling both the energetics andstructural characteristics of a wide range of carbon de-fects indicates its potential usefulness in a wide range ofsimulations in which defective structures may be relevant,including fracture, atom bombardment and simulations ofmembrane characteristics. As discussed previously, the requirements of a potentialfor the satisfactory modelling of crystalline and liquidor amorphous phases are signiﬁcantly diﬀerent. In thecase of crystalline materials, a highly accurate descrip-tion close to a local minimum for a system is required[53]. Conversely, in a liquid simulation, a vastly greaternumber of local conﬁgurations are explored, requiring ahigh degree of ﬂexibility and transferability [62]. As in thecase of GAP-17, we therefore use the liquid as a bench-mark for the ﬂexibility of our potential [62], scanning overa wide range of densities and (here) temperatures. Theaim is to diagnose any possible issues which might be ex-posed by visiting a very diverse set of conﬁgurations dur-ing the simulations. There is a strong precedent for thestudy of high temperature liquids, including carbon, usingDFT [103, 104]. A good agreement with DFT-MD data istherefore strong evidence for the usefulness of the poten-tial for further studies of liquid carbon, which is presentonly under extreme conditions, but is nonetheless vitallyimportant, e.g. for understanding the nucleation and for-mation of diamond and graphite under a wide range ofcircumstances [105–108].The radial distribution function (RDF) of a liquid rep-resents a convenient measure of its local structuring, asdoes its angular distribution function (ADF). Here, wecompare the results of constant volume ab initio molecu-lar dynamics simulations to those of GAP-20. We performtwo sets of simulations, one for a range of densities be-tween 1 . − . − at 5000 K and the other for a rangeof temperatures between 5000 - 9500 K at a ﬁxed densityof 2.5 g cm − . These simulations were performed for 216atom systems using a chain of 5 Nos´e-Hoover thermostats. Ab initio trajectories were generated using VASP, simula-tions were performed at the gamma point and data werecollected for 5 ps at each temperature and pressure [71–73]. We ﬁnd that there is a very good agreement between the ab initio data and the GAP-20 predictions for boththe RDF and ADF across the wide range of tempera-tures studied (see ﬁgure 7). GAP-20 correctly modelsthe increased structuring of the liquid carbon as the tem-perature is reduced from 9500-4500 K. At temperaturesbelow approximately 3500 K, the GAP model predicts theliquid to form an amorphous glass which slowly graphi-tises (which is entirely expected because the temperatureis now below the melting line). While a full discussionon the mechanism of formation and resulting morphol-ogy of graphitised amorphous carbons generated usingGAP-20 is beyond the scope of the current work, this pro-cess has previously been shown to be an excellent methodof diﬀerentiation between the numerous available carbonpotentials [36, 37]. Figure 8 shows the RDF and ADFcomputed with both GAP-20 and optB88-vdW across awide range of densities, from 1 . . − at 5000K. This test is particularly important as it represents dy-namical simulations of structures from highly sp and sp hybridised (low density) through to a predominantly sp hybridised liquid at higher density. GAP-20 captures thischange in the bonding characteristics of liquid carbon, inparticular the increase in the sp hybridised fraction ofthe liquid at very low densities (reﬂected in bond anglesclose to 180 degrees), qualitatively similar to GAP-17 [62].Figure 7: Angular and radial distribution functions forliquid carbon at a ﬁxed density of 2.5 g cm − for temper-atures between 5000 - 9500 K. GAP-20 results are shownin black, while reference DFT (optB88-vdW) data aregiven in red.14hat GAP-20 can model the atomistic structure ofliquid carbon at a wide variety of temperatures and den-sities, while maintaining the ability to accurately predictproperties such as the phonon relation and defect for-mation energies is a reﬂection of the ﬂexibility of theGAP methodology. Such wildly diﬀerent systems explorea range of characteristic energies, where important ﬂuc-tuations cover many orders of magnitude; in carbon, thiscan be anywhere from the meV range in the case of dif-ferences between graphite defect energies to ﬂuctuationson the order of tens of electron volts as encountered inthe liquid. Despite these very diﬀerent energy ranges, itis not unlikely that a potential may encounter all of themover the course of a single simulation (for example duringthe crystallisation of a solid phase directly from the liq-uid) and it is therefore important that they be handledcorrectly.Figure 8: Angular and radial distribution functions forliquid carbon at 5000 K for a range of densities from 1.5to 3.5 g cm − . GAP-20 results are shown in black, whilereference DFT (optB88-vdW) data are given in red. Ultimately, the purpose of any interatomic potential isthat it may be used for the discovery of new and interest-ing phenomena. Consequently,in its application it mayencounter structures which were not explicitly consideredin its construction, in this case meaning that it must model structures which were not included in the trainingdata base. It has therefore been a criticism of ML poten-tials that their poor performance in extrapolation mightinhibit their use for scientiﬁc discovery. As discussed ear-lier, the problem of extrapolation is circumvented by thefact that we consider only the local environment around aparticular atom to be important for predicting its atomicenergy and the forces acting upon it. While the problemof exploring the entirety of the 3N dimensional chemicalspace is indeed intractable, suﬃciently sampling all of thephysically relevant local environments is not [59].We demonstrate this here by performing a diagnosticGAP driven random structure search (GAP-RSS), simi-lar in spirit to Refs. [109] and [5], and demonstrate thatthe predicted energies of these structures agree well withthose from DFT [5, 60, 85]. We then calculate a numberof high energy pathways for speciﬁc transformations notincluded in the training and compare these to DFT. Bothof these tests serve the purpose of exploring the high en-ergy regions of the potential energy surface which may beexplored during molecular dynamics simulations or geom-etry optimisation and which must be well described foran ML model to be transferable. Importantly, they areboth explicitly designed to include conﬁgurations whichare not present in the training data set of GAP-20.To perform the ﬁrst test, we generate a cubic unitcell with lattice parameter a = 3 ˚A. In this cell, we ran-domly place 8 carbon atoms, avoiding any overlaps suchthat the distance between any two carbon atoms is notless than 2 ˚A. This process is performed to generate 1000initial randomised geometries. The LAMMPS package isthen used to optimise lattice vectors of the cell indepen-dently using conjugate gradient descent, while maintain-ing their orthogonality, until the total energy is convergedto within 10 − eV [110]. Following this, the positions ofthe atoms in the unit cell are optimised using a FIRE al-gorithm [111], until the total energy is again converged towithin 10 − eV. This cycle is repeated twice more beforeperforming a ﬁnal conjugate gradient optimisation of theatomic positions and cell vectors until the total energy isconverged to 10 − eV.15igure 9: (a) A comparison of the histograms of ener-gies of structures identiﬁed by GAP-RSS, given in eV /atom, showing good agreement for the prediction of theenergy of structures between GAP-20 (shown in black)and DFT (optB88-vdW) (shown in red). A number ofexamples of structures identiﬁed in GAP-20 driven ran-dom structure search are shown. The position of eachof the example structures on the histogram is indicatedby their numbering, 1) AB stacked graphite, AA stackedgraphite. 2) cubic and hexagonal diamond. 3) Haekelite4) crosslinked graphitic structure. 5) Novel carbon struc-ture with high proportion of sp hybridised carbon atoms(b) The structures resulting from the GAP-RSS projectedinto the sketch-map representation from Fig 1. The den-sity of the structures present in the training data are indi-cated by the contour lines, while the structures identiﬁedfrom the GAP-RSS are shown as individual points.To validate the results of our GAP-RSS, we recom-pute the energies of the structures found using the ref- erence DFT method used to train the model. We notethat for across all 1000 structures, the predicted energyagrees well with the energy predicted from DFT. It haspreviously been shown that correctly identifying low en-ergy structures from a RSS is an extremely challengingtask for empirical models, which often predict qualita-tively incorrect behaviour and fail to ﬁnd physically rele-vant conﬁgurations due to their having many more localminima than the DFT PES [112, 113].Our GAP-RSS correctly identiﬁes a range of impor-tant low-energy carbon allotropes, as well as numerousmore exotic species. In particular, AB-stacked graphitewas found as the lowest energy allotrope of carbon. AA-and ABC- stacked graphite allotropes are also identiﬁedin the search, their energy is correctly predicted to behigher than that of the AB stacked graphite structure.Furthermore, both diamond and lonsdaleite are both cor-rectly identiﬁed. We also identify a number of more exoticcarbon allotropes, some of which are known either fromexperiment or theory but were not included in the train-ing dataseset, including crosslinked graphite structures,porous carbon cages and a variety of haekelite structures.For the vast majority of structures found during the GAP-20 driven random structure search, the predicted energiesfrom both DFT and GAP-20 agree well (ﬁgure 9).We also return at this stage to the sketch-map rep-resentation of the training dataset given in ﬁgure 1. Inﬁgure 9(b) we provide a projection of the GAP-RSS struc-tures onto this sketch-map representation. GAP-RSSpoints are coloured according to the GAP-20 energy error.The density of structures present in the original trainingdataset is indicated by black contour lines. It is clear thatmost structures found are clustered in the region repre-senting the bulk amorphous and crystalline polymorphs,with very few structures representative of fullerenes ornanotubes identiﬁed. This is a reﬂection of the fact thatonly 8 atoms are included in the unit cell used for theRSS. Additionally, the RSS procedure employed beginswith simulation cells which are fully periodic and with nosymmetry constraints imposed on the initial atomic posi-tions. In the lower left of the sketch-map is a cluster whichis structurally distinct from those present in the train-ing data, as indicated by its large separation from otherpoints in the sketch-map. These structures are charac-terised by their highly sp rich character. Although a sig-niﬁcant number of amorphous structures which are rich insp hybridised carbon atoms are included in the trainingdata, there are indeed very few crystalline sp rich struc-tures. Despite being structurally distinct from anythingincluded in the training dataset, the error in the GAP-20 prediction for the energy of these structures remainslow. This indicates excellent performance for GAP-20in applications where transferability to potentially novelstructures is important.16igure 10: Energies for rigid transformations of a C-Cbond in graphene (a) and in a C60 fullerene (b). Re-sults from GAP-20, DFT (optB88-vdW) and a selectionof empirical potentials are shown.We also test GAP-20 on a number of speciﬁc struc-tural transformations. Although our GAP model is nottrained explicitly on reaction barriers, it is useful to testhow well the model performs for the prediction of thetypes of barriers which might be encountered in studiesof the reactivity of carbon nanostructures. To this end,we compare the predictions of our GAP model to those ofDFT for two approximate transformations; a rigid bondrotation in graphene and a C60 fullerene. Since these cal-culations are performed on rigid structures, rather thanfor example using nudged elastic band calculations, thebarriers calculated here will not be true defect forma-tion barriers. They are, however, still representations ofphysically reasonable points on the potential energy sur-face which are not included in the training dataset andso form a useful test of the potential compared to othermodels [66, 114].Figure 10 (a) shows the barrier to rigid rotation of aC-C bond within a graphene sheet as predicted by GAP-20, and the tested empirical potentials. The performanceof GAP-20 on this test is reassuring for its wider applica-tion, it achieves excellent accuracy with respect to boththe height of the local minimum in the rotation and theheight of the barrier. The AIREBO and REBO potentialscapture the general shape and height of the barrier, butpredict jagged curves for the rotation, as compared to thesmooth variation from DFT. The LCBOP and Tersoﬀ po- tentials perform poorly, overestimating the energy of therotation by more than 30 eV, and erroneously situatinga maximum in the potential energy where a minimum isfound from DFT. In the case of the Tersoﬀ potential, twoadditional spurious minima are located close to where theDFT maxima are located.A similar situation is observed for the behaviour ofthe C60 rotational barrier in ﬁgure 10 (b). Here, it isagain seen that GAP-20 performs well, providing a goodestimation of the barrier height and shape with respectto DFT. The REBO, AIREBO and Tersoﬀ potentials allsituate spurious minima in the potential energy close towhere the maxima in the reference DFT curve are located,although they do also predict a minimum at the correctrotation. LCBOP again overestimates the energy of thebarrier by 20 eV, and situates a maximum in the potentialenergy surface where a minimum ought to be located. The advantages conferred by the ﬂexibility of the Gaus-sian approximation potential framework are made clearby the wide variety of structures which are accuratelytreated by GAP-20. The variable hybridisation of carbonmakes it an extremely challenging element to model us-ing empirical potentials; its structurally diverse allotropesare energetically similar and the properties of these de-pend on an broad range of physical interactions, fromthe weak van der Waals forces binding graphite to thestiﬀ covalent bonds of diamond. We have demonstratedhere a model which is equally suited to modelling notjust these two bulk structures, but defects, surfaces andliquid carbon as well. Wherever possible, we have vali-dated the performance of GAP-20 against the referenceDFT method and shown it to perform well for a num-ber of physical properties across the diﬀerent phases. In-cluded in these are a number of processes involving bondbreaking and formation, some of which have been chal-lenging for cheaper empirical potentials by construction.Tests for transferability, speciﬁcally by diagnostic GAP-RSS runs and the study of transformations not includedin the training, suggest that GAP-20 could readily be ap-plied to more thorough explorations of the carbon poten-tial energy landscape, for example, in the search for largerfullerenes [61, 115] or in crystal-structure prediction byexpanding on Refs. [5] and [109]. Further applicationsmay include the more detailed study of non-graphitisingor “hard” carbons [116–119], following on from earlierGAP-17 based studies in Ref. [57] and [58].Despite the many potential applications of GAP-20,the model is not without its shortcomings. While it re-mains signiﬁcantly more computationally aﬀordable thandirect ab initio simulation (in particular for large systems)the cost of its evaluation is much greater than that of em-pirical potentials (Supplementary Fig. 12), and thereforethe latter will still give access to even larger-scale systems1736]. We also note that ‘real’ carbon is rarely found inisolation – hydrogenation and oxidation of carbon struc-tures is not considered here. The expansion of the scope ofthe potential to treat hydrogenated or oxidised structureswould complicate the process of training both by requir-ing a larger training dataset and by requiring the inclusionof a number of interactions not considered here. In ad-dition to long ranged van der Waals interactions (whichare only considered approximately in the current work),the introduction of other elements introduces the associ-ated complexities of substantial charge rearrangements:polar bonds and partial charges. Long ranged Coulombinteractions, dipole-dipole and higher order multipole in-teractions remain a challenge for ML potentials. We notethat the combination of pure carbon simulations (usingGAP-17) and subsequent density-functional analyses ofhydrogenation and oxidation [36] or metal intercalation[58] has proven fruitful, and we expect that further stud-ies of this type will be facilitated by GAP-20, particu-larly when low-density, dispersion-dominated nanostruc-tures are concerned.We believe that we have achieved an excellent com-promise for our potential, in that it accurately modelsthe wide range of structures required to make it broadlyapplicable. We do not claim perfect accuracy for all prop-erties, however; we accept that ﬁtting to such a widerange of structures will necessarily impact the accuracy insome areas. Notably, a large number of structures whichwere generated as part of the total training dataset areexcluded from the ﬁnal training. Conversely, many havebeen included which might be irrelevant for a researcher’sintended purpose. With this in mind, we have made freelyavailable the total training dataset (structures, energies,forces and virial coeﬃcients) produced as part of thiswork. While we do not believe that this will typicallybe necessary, it is a further virtue of the GAP frameworkthat a potential may be readily retrained to suit a partic-ular purpose simply by modifying the composition of thetraining conﬁgurations used, we believe it is beneﬁcial tooﬀer the opportunity for users to tune the model to targethigher accuracy in a particular region of interest.In addition to the training dataset, the potential in-troduced here is provided in the form of an XML ﬁleand has been made freely available, along with the GAPcode at , it has the uniqueidentiﬁer GAP 2020 4 27 60 2 50 5 436 and may be usedwithin the QUIP software package which can be foundat https://github.com/libAtoms/QUIP . GAP-20 maybe used for simulations directly in LAMMPS, using theQUIP for LAMMPS plugin [110].

10 Supplementary Material

See [Supplementary Material] for full details of computedformation energies and phonon dispersion curves for allmodels, further information on GAP hyper-parameter se- lection and command line argument used, graphene bi-layer separation curves and force errors for various con-ﬁgurations. More information on the optimisation andcomputational cost of GAP-20 compared to DFT is alsogiven.

11 Acknowledgements

A.M. was supported by the European Research Coun-cil under the European Union’s Seventh Framework Pro-gramme (FP/2007-2013) / ERC Grant Agreement No.616121 (HeteroIce project). V.L.D. acknowledges aLeverhulme Early Career Fellowship and support fromthe Isaac Newton Trust. We are grateful to the UKMaterials and Molecular Modelling Hub for computa-tional resources, which is partially funded by the EPSRC(EP/P020194/1). We are also grateful for computationalsupport from the UK national high-performance comput-ing service, ARCHER, for which access was obtained viathe UKCP consortium and funded by EPSRC grant refEP/P022561/1. In addition, we are grateful for the useof the UCL Grace High Performance Computing Facil-ity (Grace@UCL), and associated support services, in thecompletion of this work.

12 Data Availability

The data that support the ﬁndings of this study areopenly available at and on request from the authors.18 eferences [1] K. S. Novoselov, A. K. Geim, S. V. Moro-zov, D. Jiang, Y. Zhang, S. V. Dubonos, I. V.Griegorieva, and A. A. Firsov, “Electric Field Ef-fect in Atomically Thin Carbon Films,”

Science ,vol. 306, no. 5696, pp. 666–669, 2004.[2] H. Kroto, J. Heath, S. O’Brien, R. Curl, andR. E. Smalley, “C60: Buckminsterfullerene,”

Na-ture , vol. 318, pp. 162–163, 1985.[3] M. Martinez-Canales and C. J. Pickard, “Thermo-dynamically Stable Phases of Carbon at Multitera-pascal Pressures,”

Physical Review Letters , vol. 108,no. 4, p. 045704, 2012.[4] R. C. Powles, N. A. Marks, and D. W. M. Lau,“Self-assembly of sp2 -bonded carbon nanostruc-tures from amorphous precursors,”

Physical Re-view B - Condensed Matter and Materials Physics ,vol. 79, no. 7, pp. 1–11, 2009.[5] V. L. Deringer, G. Cs´anyi, and D. M. Proserpio,“Extracting Crystal Chemistry from AmorphousCarbon Structures,”

ChemPhysChem , vol. 18,no. 8, pp. 873–877, 2017.[6] D. Tom´anek,

Guide Through the Nanocarbon Jun-gle . San Rafael: Morgan & Claypool Publishers,1 ed., 2014.[7] R. Mirzayev, K. Mustonen, M. R. A. Monazam,A. Mittelberger, T. J. Pennycook, C. Mangler,T. Susi, J. Kotakoski, and J. C. Meyer, “Bucky-ball sandwiches,”

Science Advances , vol. 3, no. 6,p. e1700176, 2017.[8] C. Lee, X. Wei, J. W. Kysar, and J. Hone, “Mea-surement of the elastic properties and intrinsicstrength of monolayer graphene.,”

Science , vol. 321,no. 5887, pp. 385–388, 2008.[9] M. H. Nazare and A. J. Neves,

Properties Growthand Applications of Diamond . London: INSPEC,1 ed., 2001.[10] A. K. Geim, “Graphene: Status and prospects,”

Science , vol. 324, no. 5934, pp. 1530–1534, 2009.[11] P. Avouris, Z. Chen, and V. Perebeinos, “Carbon-based electronics.,”

Nature Nanotechnology , vol. 2,no. 10, pp. 605–15, 2007.[12] Y. Yan, J. Miao, Z. Yang, F. X. Xiao, H. B. Yang,B. Liu, and Y. Yang, “Carbon nanotube catalysts:Recent advances in synthesis, characterization andapplications,”

Chemical Society Reviews , vol. 44,no. 10, pp. 3295–3346, 2015. [13] W. Zhang, S. Zhu, R. Luque, S. Han, L. Hu, andG. Xu, “Recent development of carbon electrodematerials and their bioanalytical and environmen-tal applications,”

Chemical Society Reviews , vol. 45,no. 3, pp. 715–752, 2016.[14] Y.-n. Zhang, Q. Niu, X. Gu, N. Yang, and G. Zhao,“Recent Progress of Carbon Nanomaterials forElectrochemical Detection and Removal of Environ-mental Pollutants,”

Nanoscale , 2019.[15] Z. Yang, J. Ren, Z. Zhang, X. Chen, G. Guan,L. Qiu, Y. Zhang, and H. Peng, “Recent Advance-ment of Nanostructured Carbon for Energy Ap-plications,”

Chemical Reviews , vol. 115, no. 11,pp. 5159–5223, 2015.[16] W. Yuan, Y. Zhang, L. Cheng, H. Wu, L. Zheng,and D. Zhao, “The applications of carbon nan-otubes and graphene in advanced rechargeablelithium batteries,”

Journal of Materials ChemistryA , vol. 4, no. 23, pp. 8932–8951, 2016.[17] A. S. Aric`o, P. Bruce, B. Scrosati, J. M. Tarascon,and W. Van Schalkwijk, “Nanostructured materi-als for advanced energy conversion and storage de-vices,”

Nature Materials , vol. 4, no. 5, pp. 366–377,2005.[18] I. N. Kholmanov, C. W. Magnuson, R. Piner, J. Y.Kim, A. E. Aliev, C. Tan, T. Y. Kim, A. A.Zakhidov, G. Sberveglieri, R. H. Baughman, andR. S. Ruoﬀ, “Optical, electrical, and electrome-chanical properties of hybrid graphene/carbon nan-otube ﬁlms,”

Advanced Materials , vol. 27, no. 19,pp. 3053–3059, 2015.[19] F. Bonaccorso, Z. Sun, T. Hasan, and A. C. Ferrari,“Graphene Photonics and Optoelectronics,”

NaturePhotonics , vol. 4, no. 9, pp. 611–622, 2010.[20] A. H. Castro Neto, F. Guinea, N. M. R. Peres, K. S.Novoselov, and A. K. Geim, “The electronic prop-erties of graphene,”

Reviews of Modern Physics ,vol. 81, no. 1, pp. 109–162, 2009.[21] C. Farrera, F. Torres And´on, and N. Feliu, “Car-bon Nanotubes as Optical Sensors in Biomedicine,”

ACS Nano , vol. 11, no. 11, pp. 10637–10643, 2017.[22] A. Courvoisier, M. J. Booth, and P. S. Salter, “In-scription of 3D waveguides in diamond using anultrafast laser,”

Applied Physics Letters , vol. 109,no. 3, 2016.[23] P. Latawiec, V. Venkataraman, M. J. Burek,B. J. M. Hausmann, I. Bulu, and M. Lonˇcar, “On-chip diamond Raman laser,”

Optica , vol. 2, no. 11,p. 924, 2015.1924] L. Pastewka, S. Moser, P. Gumbsch, andM. Moseler, “Anisotropic mechanical amorphiza-tion drives wear in diamond,”

Nature Materials ,vol. 10, no. 1, pp. 34–38, 2011.[25] T. B. Shiell, D. G. McCulloch, D. R. McKenzie,M. R. Field, B. Haberl, R. Boehler, B. A. Cook,C. De Tomas, I. Suarez-Martinez, N. A. Marks, andJ. E. Bradby, “Graphitization of Glassy Carbon af-ter Compression at Room Temperature,”

PhysicalReview Letters , vol. 120, no. 21, p. 215701, 2018.[26] J. Tersoﬀ, “Empirical interatomic potential forcarbon, with applications to amorphous carbon,”

Physical Review Letters , vol. 61, no. 25, pp. 2879–2882, 1988.[27] D. W. Brenner, “Empirical potential for hydrocar-bons for use in simulating the chemical vapor depo-sition of diamond ﬁlms,”

Physical Review B , vol. 42,no. 15, pp. 9458–9471, 1990.[28] T. C. O’Connor, J. Andzelm, and M. O. Robbins,“AIREBO-M: A reactive model for hydrocarbons atextreme pressures,”

Journal of Chemical Physics ,vol. 142, no. 2, p. 024903, 2015.[29] S. J. Stuart, A. B. Tutein, and J. A. Harrison, “A re-active potential for hydrocarbons with intermolecu-lar interactions,”

The Journal of Chemical Physics ,vol. 112, no. 2000, pp. 6472–6486, 2000.[30] J. H. Los and a. Fasolino, “Intrinsic long-rangebond-order potential for carbon: Performance inMonte Carlo simulations of graphitization,”

Physi-cal Review B , vol. 68, no. 2, p. 24107, 2003.[31] L. M. Ghiringhelli, J. H. Los, A. Fasolino, and E. J.Meijer, “Improved long-range reactive bond-orderpotential for carbon. II. Molecular simulation of liq-uid carbon,”

Physical Review B , vol. 72, no. 21,p. 214103, 2005.[32] J. H. Los, L. M. Ghiringhelli, E. J. Meijer, andA. Fasolino, “Improved long-range reactive bond-order potential for carbon. I. Construction (Phys-ical Review B. Condensed Matter and MaterialsPhysics (2005) 72 (214102)),”

Physical Review B ,vol. 73, no. 22, p. 229901, 2006.[33] N. A. Marks, “Generalizing the environment-dependent interaction potential for carbon,”

Phys-ical Review B , vol. 63, no. December 2000, pp. 1–7,2006.[34] L. Pastewka, P. Pou, R. P´erez, P. Gumbsch,and M. Moseler, “Describing bond-breaking pro-cesses by reactive potentials: Importance of anenvironment-dependent interaction range,”

Physi-cal Review B - Condensed Matter and MaterialsPhysics , vol. 78, no. 16, p. 161402, 2008. [35] S. Goverapet Srinivasan, A. C. T. Van Duin, andP. Ganesh, “Development of a ReaxFF potential forcarbon condensed phases and its application to thethermal fragmentation of a large fullerene,”

Journalof Physical Chemistry A , vol. 119, no. 4, pp. 571–580, 2015.[36] C. de Tomas, A. Aghajamali, J. L. Jones, D. J. Lim,M. J. Lopez, I. Suarez-Martinez, and N. A. Marks,“Transferability in interatomic potentials for car-bon,”

Carbon , vol. 155, pp. 624–634, 2019.[37] C. de Tomas, I. Suarez-Martinez, and N. A. Marks,“Graphitization of amorphous carbons: A com-parative study of interatomic potentials,”

Carbon ,vol. 109, pp. 681–693, 2016.[38] J. Behler and M. Parrinello, “Generalized neural-network representation of high-dimensionalpotential-energy surfaces,”

Physical Review Let-ters , vol. 98, no. 14, p. 146401, 2007.[39] A. P. Bart´ok, M. C. Payne, R. Kondor, andG. Cs´anyi, “Gaussian approximation potentials:The accuracy of quantum mechanics, without theelectrons,”

Physical Review Letters , vol. 104, no. 13,p. 136403, 2010.[40] K. T. Sch¨utt, H. E. Sauceda, P. Kindermans,A. Tkatchenko, K. M¨uller, H. E. Sauceda, P. Kin-dermans, A. Tkatchenko, and K. M, “SchNet –A deep learning architecture for molecules andmaterials SchNet – A deep learning architecturefor molecules and materials,”

Journal of ChemicalPhysics , vol. 148, p. 241722, 2018.[41] T. D. Huan, R. Batra, J. Chapman, S. Krishnan,L. Chen, and R. Ramprasad, “A universal strategyfor the creation of machine learning-based atomisticforce ﬁelds,” npj Computational Materials , vol. 3,no. 1, p. 37, 2017.[42] A. P. Bart´ok, S. De, C. Poelking, N. Bernstein,J. R. Kermode, G. Cs´anyi, and M. Ceriotti, “Ma-chine learning uniﬁes the modeling of materials andmolecules,”

Science Advances , vol. 3, p. e1701816,2017.[43] B. Kolb, X. Luo, X. Zhou, B. Jiang, and H. Guo,“High-Dimensional Atomistic Neural Network Po-tentials for Molecule-Surface Interactions: HClScattering from Au(111),”

Journal of PhysicalChemistry Letters , vol. 8, pp. 666–672, 2017.[44] P. E. Dolgirev, I. A. Kruglov, and A. R. Oganov,“Machine learning scheme for fast extraction ofchemically interpretable interatomic potentials,”

AIP Advances , vol. 6, p. 085318, 2016.2045] A. Glielmo, P. Sollich, and A. De Vita, “Accu-rate interatomic force ﬁelds via machine learningwith covariant kernels,”

Physical Review B , vol. 95,no. 21, p. 214302, 2017.[46] V. K˚urkov´a, “Kolmogorov’s theorem and multilayerneural networks,”

Neural Networks , vol. 5, no. 3,pp. 501–506, 1992.[47] C. M. Handley and J. Behler, “Next generation in-teratomic potentials for condensed systems,”

TheEuropean Physical Journal B , vol. 87, p. 152, 2014.[48] J. Behler, “Neural network potential-energy sur-faces in chemistry: A tool for large-scale sim-ulations,”

Physical Chemistry Chemical Physics ,vol. 13, no. 40, pp. 17930–17955, 2011.[49] J. Behler, “Perspective: Machine learning poten-tials for atomistic simulations,”

Journal of Chemi-cal Physics , vol. 145, no. 17, p. 170901, 2016.[50] V. L. Deringer, M. A. Caro, and G. Cs´anyi, “Ma-chine Learning Interatomic Potentials as EmergingTools for Materials Science,”

Advanced Materials ,vol. 31, no. 46, p. 1902765, 2019.[51] R. Z. Khaliullin, H. Eshet, T. D. K¨uhne, J. Behler,and M. Parrinello, “Graphite-diamond phase coex-istence study employing a neural-network mappingof the ab initio potential energy surface,”

PhysicalReview B , vol. 81, no. 10, p. 100103, 2010.[52] R. Z. Khaliullin, H. Eshet, T. D. K¨uhne, J. Behler,and M. Parrinello, “Nucleation mechanism for thedirect graphite-to-diamond phase transition,”

Na-ture Materials , vol. 10, no. 9, pp. 693–697, 2011.[53] P. Rowe, G. Cs´anyi, D. Alf`e, and A. Michaelides,“Development of a machine learning potential forgraphene,”

Physical Review B , vol. 97, p. 054303,2018.[54] M. A. Caro, V. L. Deringer, J. Koskinen, T. Lau-rila, and G. Cs´anyi, “Growth Mechanism and Ori-gin of High sp3 Content in Tetrahedral AmorphousCarbon,”

Physical Review Letters , vol. 120, no. 16,p. 166101, 2018.[55] V. L. Deringer, M. A. Caro, R. Jana, A. Aarva,S. R. Elliott, T. Laurila, G. Cs´anyi, andL. Pastewka, “Computational Surface Chemistryof Tetrahedral Amorphous Carbon by CombiningMachine Learning and Density Functional Theory,”

Chemistry of Materials , vol. 30, no. 21, pp. 7438–7445, 2018.[56] M. A. Caro, A. Aarva, V. L. Deringer, G. Cs´anyi,and T. Laurila, “Reactivity of Amorphous CarbonSurfaces: Rationalizing the Role of Structural Mo-tifs in Functionalization Using Machine Learning,”

Chemistry of Materials , vol. 30, no. 21, pp. 7446–7455, 2018.[57] V. L. Deringer, C. Merlet, Y. Hu, T. H. Lee, J. A.Kattirtzi, O. Pecher, G. Cs´anyi, S. R. Elliott, andC. P. Grey, “Towards an atomistic understandingof disordered carbon electrode materials,”

ChemicalCommunications , vol. 54, no. 47, pp. 5988–5991,2018.[58] J. X. Huang, G. Cs´anyi, J. B. Zhao, J. Cheng,and V. L. Deringer, “First-principles study of alkali-metal intercalation in disordered carbon anode ma-terials,”

Journal of Materials Chemistry A , vol. 7,no. 32, pp. 19070–19080, 2019.[59] A. P. Bart´ok, J. Kermode, and N. Bernstein, “Ma-chine Learning a General-Purpose Interatomic Po-tential for Silicon,”

Physical Review X , vol. 8, no. 4,p. 41048, 2018.[60] C. J. Pickard and R. J. Needs, “Ab initio ran-dom structure searching,”

Journal of Physics: Con-densed Matter , vol. 23, p. 053201, 2011.[61] D. Wales,

Energy Landscapes: Applications to Clus-ters, Biomolecules and Glasses . Cambridge: Cam-bridge University Press, 1 ed., 2004.[62] V. L. Deringer and G. Cs´anyi, “Machine learningbased interatomic potential for amorphous carbon,”

Physical Review B , vol. 95, no. 9, p. 094203, 2017.[63] A. P. Bart´ok, R. Kondor, and G. Cs´anyi, “On rep-resenting chemical environments,”

Physical ReviewB , vol. 87, no. 18, p. 184115, 2013.[64] J. Behler, “Atom-centered symmetry functions forconstructing high-dimensional neural network po-tentials,”

Journal of Chemical Physics , vol. 134,no. 7, p. 074106, 2011.[65] P. Gasparotto, R. H. Meißner, and M. Ceriotti,“Recognizing Local and Global Structural Motifsat the Atomic Scale,”

Journal of Chemical Theoryand Computation , vol. 14, pp. 486–498, 2018.[66] D. J. Wales, “Closed-shell structures and the build-ing game,”

Chemical Physics Letters , vol. 141,no. 6, pp. 478–484, 1987.[67] K. Lee, ´E. D. Murray, L. Kong, B. I. Lundqvist,and D. C. Langreth, “Higher-accuracy van derWaals density functional,”

Physical Review B -Condensed Matter and Materials Physics , vol. 82,no. 8, p. 081101, 2010.[68] J. Klimeˇs, D. R. Bowler, and A. Michaelides, “Vander Waals density functionals applied to solids,”

Physical Review B , vol. 83, no. 19, p. 195131, 2011.2169] M. Dion, H. Rydberg, E. Schr¨oder, D. C. Langreth,and B. I. Lundqvist, “Van der Waals density func-tional for general geometries,”

Physical Review Let-ters , vol. 92, no. 24, p. 246401, 2004.[70] J. Klimeˇs, D. R. Bowler, and A. Michaelides,“Chemical accuracy for the van der Waals densityfunctional.,”

Journal of Physics: Condensed Mat-ter , vol. 22, no. 2, p. 022201, 2010.[71] G. Kresse and J. Furthm¨uller, “Eﬃcient iterativeschemes for ab initio total-energy calculations usinga plane-wave basis set,”

Physical Review B , vol. 54,no. 16, pp. 11169–11186, 1996.[72] G. Kresse and J. Furthm¨uller, “Eﬃciency of ab-initio total energy calculations for metals and semi-conductors using a plane-wave basis set,”

Compu-tational Materials Science , vol. 6, no. 1, pp. 15–50,1996.[73] G. Kresse, “From ultrasoft pseudopotentials to theprojector augmented-wave method,”

Physical Re-view B , vol. 59, no. 3, pp. 1758–1775, 1999.[74] H. Monkhorst and J. Pack, “Special points for Bril-louin zone integrations,”

Physical Review B , vol. 13,no. 12, pp. 5188–5192, 1976.[75] G. Graziano, J. Klimeˇs, F. Fernandez-Alonso, andA. Michaelides, “Improved description of soft lay-ered materials with van der Waals density func-tional theory,”

Journal of Physics: Condensed Mat-ter , vol. 24, no. 42, p. 424216, 2012.[76] R. Hoﬀmann, A. A. Kabanov, A. A. Golov, andD. M. Proserpio, “ Homo Citans and Carbon Al-lotropes: For an Ethics of Citation ,”

AngewandteChemie International Edition , vol. 55, no. 37,pp. 10962–10976, 2016.[77] L. Li, S. Reich, and J. Robertson, “Defect energiesof graphite: Density-functional calculations,”

Phys-ical Review B - Condensed Matter and MaterialsPhysics , vol. 72, no. 18, p. 184109, 2005.[78] T. Xu and L. Sun, “Structural defects in graphene,”

Defects in Advanced Electronic Materials and NovelLow Dimensional Structures , vol. 5, no. 1, pp. 137–160, 2018.[79] J. C. Charlier, “Defects in carbon nanotubes,”

Accounts of Chemical Research , vol. 35, no. 12,pp. 1063–1069, 2002.[80] S. T. Skowron, I. V. Lebedeva, A. M. Popov, andE. Bichoutskaia, “Energetics of atomic scale struc-ture changes in graphene,”

Chemical Society Re-views , vol. 44, no. 44, pp. 3143–3176, 2015. [81] J. Ristein, “Diamond surfaces : familiar and amaz-ing,”

Applied Physics A , vol. 384, pp. 377–384,2006.[82] G. Kern and J. Hanfer, “Ab initio molecular-dynamics studies of the graphitization of ﬂat andstepped diamond (111) surfaces,”

Physical ReviewB , vol. 19, p. 13167, 1998.[83] N. Ooi, A. Rairkar, and J. B. Adams, “Density func-tional study of graphite bulk and surface proper-ties,”

Carbon , vol. 44, no. 2, pp. 231–242, 2006.[84] G. Kern and J. Hafner, “Ab initio calculationsof the atomic and electronic structure of diamond(111) surfaces with steps,”

Physical Review B ,vol. 58, no. 4, pp. 2161–2169, 1998.[85] V. L. Deringer, C. J. Pickard, and G. Cs´anyi,“Data-Driven Learning of Total and Local Ener-gies in Elemental Boron,”

Physical Review Letters ,vol. 120, no. 15, p. 156001, 2018.[86] S. De, A. P. Bart´ok, G. Cs´anyi, and M. Ceriotti,“Comparing molecules and solids across structuraland alchemical space,”

Physical Chemistry Chemi-cal Physics , vol. 18, no. 20, p. 13754, 2016.[87] M. Ceriotti, G. A. Tribello, and M. Parrinello,“Simplifying the representation of complex free-energy landscapes using sketch-map.,”

Proceedingsof the National Academy of Sciences , vol. 108,no. 32, pp. 13023–13028, 2011.[88] A. P. Bart´ok, M. J. Gillan, F. R. Manby, andG. Cs´anyi, “Machine-learning approach for one-and two-body corrections to density functional the-ory: Applications to molecular and condensed wa-ter,”

Physical Review B , vol. 88, no. 5, p. 054104,2013.[89] D. Dragoni, T. D. Daﬀ, G. Csanyi, and N. Marzari,“Achieving DFT accuracy with a machine-learninginteratomic potential: thermomechanics and de-fects in bcc ferromagnetic iron,”

Physical ReviewMaterials , vol. 2, p. 013808, 2017.[90] S. Fujikake, V. L. Deringer, T. H. Lee, M. Krynski,S. R. Elliott, G. Cs´anyi, S. Fujikake, V. L. Deringer,H. Lee, and M. Krynski, “Gaussian approximationpotential modeling of lithium intercalation in car-bon nanostructures,”

Journal of Chemical Physics ,vol. 148, p. 241714, 2018.[91] W. J. Szlachta, A. P. Bart´ok, and G. Cs´anyi, “Accu-racy and transferability of Gaussian approximationpotential models for tungsten,”

Physical Review B ,vol. 90, no. 10, p. 104018, 2014.[92] M. J. Willatt, F. Musil, and M. Ceriotti, “Featureoptimization for atomistic machine learning yields a22ata-driven construction of the periodic table of theelements,”

Physical Chemistry Chemical Physics ,vol. 20, no. 47, pp. 29661–29668, 2018.[93] F. A. Faber, A. S. Christensen, B. Huang, and O. A.Von Lilienfeld, “Alchemical and structural distri-bution based representation for universal quantummachine learning,”

Journal of Chemical Physics ,vol. 148, no. 24, 2018.[94] A. S. Christensen, L. A. Bratholm, F. A. Faber,and O. Anatole Von Lilienfeld, “FCHL revisited:Faster and more accurate quantum machine learn-ing,”

Journal of Chemical Physics , vol. 152, no. 4,p. 044107, 2020.[95] A. Togo and I. Tanaka, “First principles phonon cal-culations in materials science,”

Scripta Materialia ,vol. 108, pp. 1–5, 2015.[96] G. Kern and J. Hafner, “Ab initio calculations ofthe atomic and electronic structure of clean and hy-drogenated diamond (110) surfaces,”

Physical Re-view B , vol. 56, no. 7, pp. 4203–4210, 1997.[97] S. Thinius, M. M. Islam, and T. Bredow, “Recon-struction of low-index graphite surfaces,”

SurfaceScience , vol. 649, no. 1, pp. 60–65, 2016.[98] M. I. J. Probert and M. C. Payne, “Improving theconvergence of defect calculations in supercells: Anab initio study of the neutral silicon vacancy,”

Phys-ical Review B , vol. 67, no. 7, p. 075204, 2003.[99] D. Hunt, D. Twitchen, M. Newton, J. Baker, andT. Anthony, “Identiﬁcation of the neutral carbon(100)-split interstitial in diamond,”

Physical Re-view B - Condensed Matter and Materials Physics ,vol. 61, no. 6, pp. 3863–3876, 2000.[100] A. J. Stone and D. J. Wales, “Theoretical Stud-ies of Icosahedral C60 and Some Related Species,”

Chemical Physics Letters , vol. 128, no. 5, pp. 501–503, 1986.[101] J. Kotakoski, J. C. Meyer, S. Kurasch, D. Santos-Cottin, U. Kaiser, and A. V. Krashenin-nikov, “Stone-Wales-type transformations in car-bon nanostructures driven by electron irradiation,”

Physical Review B - Condensed Matter and Mate-rials Physics , vol. 83, no. 24, p. 245420, 2011.[102] J. Ma, D. Alf`e, A. Michaelides, and E. Wang,“Stone-Wales defects in graphene and other planarsp2 -bonded materials,”

Physical Review B , vol. 80,no. 4, p. 033407, 2009.[103] A. A. Correa, S. A. Bonev, and G. Galli, “Carbonunder extreme conditions: Phase boundaries andelectronic properties from ﬁrst-principles theory,”

Proceedings of the National Academy of Sciences of the United States of America , vol. 103, no. 5,pp. 1204–1208, 2006.[104] M. Pozzo, C. Davies, D. Gubbins, and D. Alf`e,“Transport properties for liquid silicon-oxygen-ironmixtures at Earth’s core conditions,”

Physical Re-view B - Condensed Matter and Materials Physics ,vol. 87, no. 1, p. 014110, 2013.[105] H. M. Strong and R. E. Hanneman, “Crystallizationof Diamond and Graphite,”

Journal of ChemicalPhysics , vol. 46, no. 9, pp. 3668–3676, 1967.[106] A. Sorkin, J. Adler, and R. Kalish, “Nucleation ofdiamond from liquid carbon under extreme pres-sures : Atomistic simulation,”

Physical Review B ,vol. 74, no. 6, p. 064115, 2006.[107] W. H. Gust, “Phase transition and shock-compression parameters to 120 GPa for 3 typesof graphite,”

Physical Review B , vol. 22, no. 10,pp. 4744–4756, 1980.[108] C. J. Mundy, A. Curioni, N. Goldman, I. F. Kuo,E. J. Reed, L. E. Fried, and M. Ianuzzi, “Ultra-fast transformation of graphite to diamond: An abinitio study of graphite under shock compression,”

Journal of Chemical Physics , vol. 128, no. 18, 2008.[109] V. L. Deringer, D. M. Proserpio, G. Cs´anyi, andC. J. Pickard, “Data-driven learning and predic-tion of inorganic crystal structures,”

Faraday Dis-cussions , vol. 211, pp. 45–59, 2018.[110] S. Plimpton, “Fast Parallel Algorithms for Short –Range Molecular Dynamics,”

Journal of Computa-tional Physics , vol. 117, no. June 1994, pp. 1–19,1995.[111] E. Bitzek, P. Koskinen, F. G¨ahler, M. Moseler, andP. Gumbsch, “Structural relaxation made simple,”

Physical Review Letters , vol. 97, no. 17, p. 170201,2006.[112] C. J. Pickard, “Predicted Pressure-Induced s -BandFerromagnetism in Alkali Metals,”

Physical ReviewLetters , vol. 107, no. 8, p. 087201, 2011.[113] R. J. Needs and C. J. Pickard, “Perspective: Roleof structure prediction in materials discovery anddesign,”

APL Materials , vol. 4, no. 5, 2016.[114] Y. Kumeda and D. J. Wales, “Ab initio study ofrearrangements between C60 fullerenes,”

ChemicalPhysics Letters , vol. 374, no. 1-2, pp. 125–131, 2003.[115] D. J. Wales, “Global Optimization of Clusters,Crystals, and Biomolecules,”

Science , vol. 285,no. 5432, pp. 1368–1372, 1999.23116] P. J. Harris and S. C. Tsang, “High-resolution elec-tron microscopy studies of non- graphitizing car-bons,”

Philosophical Magazine A: Physics of Con-densed Matter, Structure, Defects and MechanicalProperties , vol. 76, no. 3, pp. 667–677, 1997.[117] P. J. Harris, “New perspectives on the structure ofgraphitic carbons,”

Critical Reviews in Solid Stateand Materials Sciences , vol. 30, no. 4, pp. 235–253,2005.[118] P. J. F. Harris, “Structure of non-graphitising car-bons,”

International Materials Reviews , vol. 42,no. 5, pp. 206–218, 2014.[119] X. Dou, I. Hasa, D. Saurel, C. Vaalma, L. Wu,D. Buchholz, D. Bresser, S. Komaba, andS. Passerini, “Hard carbons for sodium-ion batter-ies: Structure, analysis, sustainability, and electro-chemistry,”