[PDF] AtomSets -- A Hierarchical Transfer Learning Framework for Small and Large Materials Datasets

Abstract

Predicting materials properties from composition or structure is of great interest to the materials science community. Deep learning has recently garnered considerable interest in materials predictive tasks with low model errors when dealing with large materials data. However, deep learning models suffer in the small data regime that is common in materials science. Here we leverage the transfer learning concept and the graph network deep learning framework and develop the AtomSets machine learning framework for consistent high model accuracy at both small and large materials data. The AtomSets models can work with both compositional and structural materials data. By combining with transfer learned features from graph networks, they can achieve state-of-the-art accuracy from using small compositional data (<400) to large structural data (>130,000). The AtomSets models show much lower errors than the state-of-the-art graph network models at small data limits and the classical machine learning models at large data limits. They also transfer better in the simulated materials discovery process where the targeted materials have property values out of the training data limits. The models require minimal domain knowledge inputs and are free from feature engineering. The presented AtomSets model framework opens new routes for machine learning-assisted materials design and discovery.

Full PDF

AAtomSets - A Hierarchical Transfer LearningFramework for Small and Large MaterialsDatasets

Chi Chen and Shyue Ping Ong ∗ Materials Virtual Lab, Department of NanoEngineering, University of California SanDiego, 9500 Gilman Dr, Mail Code 0448, La Jolla, CA 92093-0448, United States

E-mail: [email protected] a r X i v : . [ c ond - m a t . m t r l - s c i ] F e b bstract Predicting materials properties from composition or structure is of great interestto the materials science community. Deep learning has recently garnered considerableinterest in materials predictive tasks with low model errors when dealing with largematerials data. However, deep learning models suﬀer in the small data regime thatis common in materials science. Here we leverage the transfer learning concept andthe graph network deep learning framework and develop the AtomSets machine learn-ing framework for consistent high model accuracy at both small and large materialsdata. The AtomSets models can work with both compositional and structural materi-als data. By combining with transfer learned features from graph networks, they canachieve state-of-the-art accuracy from using small compositional data ( < > Introduction

Machine learning (ML) has garnered substantial interest as an eﬀective method for devel-oping surrogate models for materials property predictions in recent years.

However, acritical bottleneck is that materials datasets are often small and inhomogeneous, making itchallenging to train reliable models. While large density functional theory (DFT) databasessuch as the Materials Project, Open Quantum Materials Database and AFLOWLIB have ∼ O(10 ) relaxed structures and computed energies, data on other computed properties suchas band gaps, elastic constants, dielectric constants, etc. tend to be several times or even2rders of magnitude fewer. In general, deep learning models based on neural networks tendto require much more data to train, resulting in lower performance in small datasets relativeto non-deep learning models. For example, Dunn et al. have found that while deep learningmodels such as the MatErials Graph Networks (MEGNet) and Crystal Graph ConvolutionalNeural Network (CGCNN) achieve state-of-the-art performance for datasets with > O(10 )data points, ensembles of non-deep-learning models (using AutoMatminer) outperform thesedeep learning models when the data set size is < O(10 ), and especially when the data setis < O(10 ).Several approaches have been explored to address the data bottleneck. The most popularapproach is transfer learning (TL), wherein the weights from models trained on a propertywith a large data size are “transferred” to a model on smaller data size. Most TL studieswere performed on the same property. For example, Hutchinson et al. developed three TLapproaches that reduced the model errors in predicting experimental band gaps by includingDFT band gaps. Similarly, Jha et al. trained models on the formation energies in the largeOQMD database and demonstrated that transferring the model weights from OQMD canimprove the models on the small DFT-computed and even experimental formation energydata. TL has also been demonstrated between diﬀerent properties in some cases. For exam-ple, the present authors found that transferring the weights from large data-size formationenergy MEGNet models to smaller-data-size band gap and elastic moduli models improvedconvergence rate and accuracy. Another approach uses multi-ﬁdelity models, where datasetsof multiple ﬁdelities (e.g., band gaps computed with diﬀerent functionals or measured exper-imentally) are used to improve prediction performance on the more valuable, high ﬁdelityproperties. For example, two-ﬁdelity co-kriging methods have demonstrated successes in im-proving the predictions of the Heyd-Scuseria-Ernzerhof (HSE) band gaps of perovskites, defect energies in hafnia and DFT bulk moduli. In a recently published work, the presentauthors also developed multi-ﬁdelity MEGNet models that utilize band gap data from fourDFT functionals (Perdew-Burke-Ernzerhof or PBE, Gritsenko-Leeuwen-Lenthe-Baerends3ith solid correction or GLLB-SC, strongly constrained and appropriately normed orSCAN and HSE ) and experimental measurements to signiﬁcant improve the prediction ofexperimental band gaps. In this work, we develop “AtomSets”, a hierarchical framework to TL using MEGNetmodels that can achieve uniformly excellent performance across diverse datasets with diﬀer-ent sizes. The AtomSets framework uniﬁes compositional and structural features under oneumbrella. We show, for the ﬁrst time, TL from structural models to compositional models.Using 13 MatBench datasets, we show that the AtomSets models can achieve excellentperformance even when the inputs are compositional and the data size is small ( ∼ Methods

MatErials Graph Network

The MEGNet formalism has been described extensively in previous works and interestedreaders are referred to those publications for details. Brieﬂy, the MEGNet framework fea-turizes a material into a graph G = ( V, E, u ), where v i ∈ V are the atom or node features, e k ∈ E are the edges or bonds, and u are state features. The features matrices/vectors are V = [ v ; ... ; v N a ] ∈ R N a × N v , E = [ e ; ... ; e N b ] ∈ R N b × N bf and u ∈ R N u , where N a , N b , N f , N bf and N u are the number of atoms, bonds, atom features, bond features, state features,respectively. For compositional models, N a is the number of atoms in the formula. Forsimplicity, the atom and bond features are represented as matrices. However, shuﬄing theﬁrst dimension does not change the results of the models. Hence, the atoms and bondsare essentially sets. A graph convolution (GC) operation uses the connectivity of bonds totransform input graph features ( V , E , u ) to output graph features ( V , E , u ), as shown in4 ..Atom attributes V …Bond attributes E State attributes u …..…Graph convolution ab u’E’ V’ GC V , u GC V , u GC V , u AE f Readout MLP c V S Atomsets V , V , V , V MLP N a × N v N a × N’ v N’ v × MLPReadout v r Atom embeddingAE

Figure 1: Graph networks and AtomSets schematics. a, The graph convolution (GC) takesan input graph with labeled atom ( V ), bond ( E ) and state ( u ) attributes and outputs a newcomputed graph with updated attributes. b, The graph network model architecture. Theinput to the model is the structure graph (S) with atomic number as the atom attributes.Then the graph is passed to an atom embedding (AE) layer, followed by three GC layers.After the GC, the graph is readout to a structure-wise vector f , and f is further passed tomulti-layer perceptron (MLP) models. Within the model, each layer output is captured forlater use. c, The AtomSets model takes a site-wise/element-wise feature matrix and passesto MLP layers. After the MLP, the a readout function is applied to derive a structure-wise/formula-wise vector, followed by ﬁnal MLP layers.5igure 1a, by updating the the atom, bond and state features as follows: e ( i ) k = φ e ( e ( i − k , v ( i − s k , v ( i − r k , u ( i − ) (1) v ( i ) j = φ v ( v ( i − j , v ( i − k ∈N ( j ) , e ( i ) k,r k = j , u ( i − ) (2) u ( i ) = φ u ( 1 N b X k e ( i ) k , N a X k v ( i ) k , u ( i − ) (3)where i is an index indicating the layer of the GC, e ( i ) k and v ( i ) j are bond attributes ofbond k and atom attributes of atom j at layer i respectively, φ s are the update functions ap-proximated using multi-layer perceptrons (MLPs), N ( j ) indicates the neighbor atom indicesof atom j , and r k = j are the bonds with receiving atom index as j .In the initial structural graph ( S ), the atom attributes are simply the atomic numberof the element embedded into a vector space via an atom embedding (AE) layer ( AE : Z −→ R N f ) to obtain V ∈ R N a × N f , as shown in Figure 1b. The bonds are constructed byconsidering atom pairs within certain cutoﬀ radius R c . With each GC layer, information isexchanged between atoms, bonds and state. As more GC layers are stacked (e.g., GC andGC in Figure 1b), information on each atom can be propagated to further distances.In this work, a MEGNet model with three GC layers was ﬁrst trained on the formationenergies of more than 130,000 Materials Project crystals as of Jun 1 2019, henceforth referredto as the “parent” model. The training procedures and hyperparameter settings of theMEGNet models are similar as the previous work. AtomSets Framework

In our proposed AtomSets framework, the output atom V i = [ v ( i )1 ; ... ; v ( i ) N a ] features after eachGC layer are extracted from the parent model and transferred to develop models for otherproperties. Bond features are not considered in TL since the number of bonds depends onthe graph construction settings and parameters, such as cutoﬀ radius. As shown in Figure6c, an AtomSets model takes the atom-wise features V i matrix of shape N a × N f as inputsto an MLP model. These features can either be compositional, e.g., elemental properties,or structural, e.g., local environment descriptor. Afterwards, the output feature matrix isreadout to a vector, compressing the atom number dimension.The purpose of the readout function is to reduce the feature matrices with diﬀerentnumber of atoms to structure-wise vectors subject to permutational invariance. Simplefunctions to calculate the statistics along the atom number dimension can be used as readoutfunctions. In this work, we tested two types of readout functions. The linear mean readoutfunction averages the feature vectors, as follows¯ x = P i w i x i P i w i (4)where x i is the feature row vector for atom i and w i is the corresponding weights. Theweights are atom fractions on one site, e.g., w Fe = 0 .

01 and w Ni = 0 .

99 in Fe Ni .We also tested a weight-modiﬁed attention-based set2set readout function. We start withmemory vectors m i = x i W + b , and initialize q ∗ = , where W and b are learnable weightsand biases respectively. At step t , the updates are calculated using long short-term memory(LSTM) and attention mechanisms as follows q t = LST M ( q ∗ t − ) (5) e i,t = m i · q t (6) a i,t = w i exp ( e i,t ) P j w j exp ( e j,t ) (7) r t = X i a i,t m i (8) q ∗ t = q t ⊕ r t (9)A total of three steps are used in the weighted-set2set readout function.7hen, the readout vector can be used to predict properties with the help of MLP orother models, as shown in Figure 1c. The feature matrix can either be taken as pre-trainedmodel generated feature matrices V i ( i = 0 , , ,

3) or trained on-the-ﬂy via a trainable atomembedding layer prepended to the model.When the site-wise/atom-wise features are computed from pre-trained models, informa-tion gained from previous model training is retained and eﬀectively the AtomSets modelstransfer-learn part of the pre-trained models. A hierarchical TL scheme is achieved by includ-ing diﬀerent GC outputs. The AtomSets models can also be used without transfer learning,by training the elemental embedding and hence atom-wise features from the data.The AtomSets framework is ﬂexible in the choice of input features. For example, ifthe symmetry functions are provided as inputs, then the AtomSets model becomes the high-dimensional neural network potential. The AtomSets framework also shares similarity withthe Deepsets model where the summation of feature vectors are used to get the readoutvectors. Since only simple MLP are underlying the AtomSets framework, the model trainingcan be extremely fast. Models investigated in the current work are provided in Table 1. Data and Model Training

The 13 materials datasets were obtained from the matbench repository. A summary isprovided in Table ?? , where the data sizes range from 312 to 132,752, with both compositionaland structural data. The tasks include regression and classiﬁcation. Detailed summaries areprovided in the work by Dunn et al. For each model training, we split the data into 80%-10%-10% train-validation-test sets. The validation set was used to stop the model ﬁttingwhen the validation metric, i.e., mean-absolute-error (MAE) in regression and area underthe curve (AUC) in classiﬁcation, did not improve for more than 200 consecutive epochs.The model with the lowest validation error was chosen as the “best” one. Each model wasﬁtted ﬁve times using diﬀerent random splits, and the average and standard deviations ofthe metric on the test set were reported. 8able 1: Models investigated in this work, categorized by the models types, i.e., composi-tional (C) or structural (S), and whether they utilize transfer learning (TL). In our deﬁnition,S-type models contain compositional information as a superset. It should be noted that theMLP- u i , MLP- f and MLP- v r are classiﬁed as S-type models because u i , f or v r implic-itly incorporate structural information due to information passing in the graph convolutionlayers.Model name Type TL DescriptionAtomSets C No Compositional models directly trained fromdataAtomSets- V C Yes Compositional models transferring learned V from the parent formation energy modelAtomSets- V i ( i = 1 , ,

3) S Yes Structural models transferring learned V , V or V features from the parent formationenergy modelMLP- V -stats C Yes Compositional MLP models using statisticscalculated on V from the parent formationenergy model as inputsMLP- u i ( i = 1 , , f and MLP- v r S Yes MLP models using learned u i , f or v r fromthe parent formation energy model.MEGNet S No Graph network models trained directlyusing each data set without transferlearningA grid search was performed on the hyperparameters for the AtomSets models and MLPmodels. The parameter candidates are provided in Table ?? . During the screening pro-cess, a 5-fold random shuﬄe split is applied to the data set, and the parameter set withthe lowest average validation error was chosen. The matbench steels (compositional) and matbench phonons (structural) data sets were ﬁrst used to perform an initial screening forrelatively good parameter sets. Starting from these parameter sets, a further grid searchfor all datasets for the most generalizable AtomSets- V (compositional) and AtomSets- V (structural) models was then performed. 9 esults Model accuracies

The MAE of regression and the AUC of classiﬁcation for various tasks are shown in Table 2.In addition, hyperparameter optimization was carried out on the AtomSets-V and V models(see Table ?? ), but did not seem to have a signiﬁcant eﬀect on model performance. Here,we will focus our discussion on the models without further hyperparameter optimization.To frame our analysis, we will ﬁrst recapitulate that a key ﬁnding of Dunn et al. is thatMEGNet models tend to outperform other models when the data size is large ( > , E exfoliation datasets,AtomSets models perform similarly to AutoMatminer, while for the larger formation energies(Perovskite and MP E f ) and MP band gap ( E g ) datasets, AtomSets models perform similarlyto MEGNet. The only dataset where the AtomSets and MEGNet models substantiallyunderperform relative to AutoMatminer is the refractive index of crystals from the MaterialsProject. This suggests that some of the additional features considered in the AutoMatmineralgorithm (e.g., electronic structure of the constituent elements) might be necessary for MLalgorithms to predict the refractive index.A somewhat surprising observation is that several target properties show minimal de-pendency on structural information. For example, the average MAEs of the compositionalAtomSets- V models and structural AtomSets- V models for the JDFT-2D exfoliation en-ergy, the MP phonon DOS peak, and the refractive index datasets are within the stan-10able 2: Performance of AtomSets models relative to state-of-the-art models. The averageand standard deviations of the MAE and AUC are reported for regression and classifcationtasks, respectively. The properties are sorted by dataset size. Some structural models(e.g., AtomSets- V /V /V for experimental band gaps) cannot be constructed as the datasetdoes not contain structural information. The best performing model(s) within the standarddeviation are bolded for each target. Target, Data Size AtomSets AtomSets- V AtomSets- V AtomSets- V AtomSets- V MLP- f MEG-Net AutoMat-miner Regression TasksYield Strength (GPa), 312 a ± ± - - - - - E exfoliation (meV/atom),636 b ± ± ± ± ± ± PhonDOS Peak (1/cm),1265 c ± ± ± ± ±

75 154 ± E g (eV), 4604 d ± ± - - - - - ε , 4764 e ± ± ± ± ± ± log( K V RH ) (GPa), 10987 f ± ± ± ± ± ± log( G V RH ) (GPa), 10987 g ± ± ± ± ± ± Perovskite E f (meV/atom), 18928 h ± ± ± ± ± ± E g (eV), 106113 i ± ± ± ± ± ± E f (meV/atom),132752 j ± ± ± ± ± ±

33 173Classiﬁcation TasksExpt. Metallicity, 4921 k ± ± - - - - - 0.92Glass Forming Ability,5680 l ± ± - - - - - 0.86MP Metallicity, 106113 m ± ± ± ± ± ± a Steel yield strength data from Citrine Informatics. b Exfoliation energy of crystals from JARVIS DFT 2D dataset. c Phonon DOS peak frequency from Materials Project. d Experimental composition-band gap dataset from Zhuo et al. e Refractive index from Materials Project. f Log of computed bulk moduli from Materials Project. g Log of computed shear moduli from Materials Project. h Computed perovskite formation energy from Castelli et al. i Computed PBE band gap data from Materials Project. j Computed PBE formation energy data from Materials Project. k Experimental metallicity (binary) from Zhuo et al. l Glass forming ability (binary) from Landolt-Bornstein Handbook. m Computed PBE metallicity (binary) from Materials Project. V models for the MP elasticity data (log K V RH and log G V RH ) only exhibit minor improvements in average MAEs over the compositionalAtomSets- V models. To investigate the implications of this observation, we analyzed thepolymorphs for each composition in the elasticity data set, see Figure ?? . Out of the 10987elasticity data, 81% of them do not have polymorphs. For those materials, structural modelslikely perform similarly to the compositional models. For compositions that have more thanone polymorph (816 out of 9723 unique compositions), we calculated the range of the targetvalues for polymorphs, as shown in Figure ?? b and c. The majority of the polymorphsfor the same composition have similar bulk and shear moduli, and the average ranges forlog K V RH and log G V RH are 0.134 and 0.158, respectively. If we include compositions with nopolymorphs, i.e., range equals zero, the average ranges for log K V RH and log G V RH are 0.011and 0.013, respectively. Such small ranges for each composition suggest that compositionexplains the majority of the variation in bulk moduli, which is why the accuracy diﬀerencesbetween AtomSets- V and AtomSets- V are minimal. This observation also gives a glance atwhy compositional models have been reasonably successful. It should be noted that thereare well-known polymorphs with vastly diﬀerent mechanical properties, e.g., diamond andgraphite carbon, and the AtomSets- V provide far better predictions. For example, theAtomSets- V model predicts the shear moduli of graphite (96 GPa) and diamond (520 GPa)to be 96 GPa and 490 GPa, respectively, while the AtomSets- V model predicts them to be177 GPa. In contrast, the perovskites and MP formation energy datasets require structuralmodels to achieve accurate results. This observation is consistent with a recent study byBartel et al. Comparing AtomSets models with various V ’s, the results show that the features ex-tracted from earlier stage GC layers, e.g., V and V , are more generalizable and have higheraccuracy in all models compared to those produced by later GC layers. The structure-wisestate vectors, u i ( i = 1 , , v r , are relatively poorfeatures, as shown by the large errors in all models in Table ?? . However, the ﬁnal structure-12ise readout vector f , along with MLP models, oﬀers excellent accuracy in MP metallicityand formation energy tasks. Model Convergence

A convergence study of the best models - two compositional models, i.e., AtomSets, AtomSets- V , and two structural models, i.e., AtomSets- V and MLP- f - was performed relative to datasize. Diﬀerent data sizes in terms of the fractions of maximum available data are applied.Comparing the two compositional models, the AtomSets- V model achieves relatively higherperformance throughout all the tasks and generally converges faster than the non-TL coun-terpart, i.e., the AtomSets model, as shown in Figure 2. For the structural datasets inFigure 2c and 2d, consistent with previous benchmark results, the structural AtomSets- V and MLP- f models are generally more accurate than the compositional models. The rapidconvergence of the MLP- f models in the MP formation energy dataset is expected since thestructural features f were generated by the formation energy MEGNet models in the ﬁrstplace. Model convergences on the rest of the datasets are provided in Figure ?? .The model performance is also probed at tiny datasets. We used several MP datasetsin this study to obtain consistent results and then down-sampled the datasets at 100, 200,400, 600, 1000, and 2000 data points. For comparison, we also include the non-TL MEGNetstructural models, as shown in Figure 3. Similar to the previous convergence study atrelatively large data sizes, the TL compositional models AtomSets- V outperform the non-TLcompositional AtomSets models at all data sizes. For structural models, the TL AtomSets- V models achieve consistent accuracy at small data limits for all four tasks and consistentlyoutperform the non-TL MEGNet models.Interestingly, the MLP- f models specialize in MP metallicity data and MP formationenergy data, same as previous benchmark results shown in Table 2. In particular, the MLP- f models converge rapidly for the MP metallicity task, with AUC exceeding 85% with only200 data points and 90% with only 1000 data points. The MLP- f models also reach ∼ a bc d Expt. Metallicity (4921) Expt. E g (4604)MP Metallicity (106113) MP E f (132752) Figure 2: Model convergence for AtomSets, AtomSets- V , AtomSets- V and MLP- f of smallcompositional (a and b) and large structural (c and d) datasets. (a) and (c) show the areaunder the curve (AUC) for classiﬁcation tasks, and (b) and (d) show the mean absolute error(MAE) for regression tasks. The x-axis is plotted on a log scale to provide improve resolutionat small data sizes. The shaded areas are the standard deviation across ﬁve randomly dataﬁtting. Additional model results are shown in Figure ?? a bc d MP log(K VRH ) (10987) MP E g (106113)MP Metallicity (106113) MP E f (132752) Figure 3: Model convergence in the small data limits. The four datasets are the (a) log10of the bulk moduli, (b) band gap, (c) binary metallicity and (d) formation energy structuraldatasets from the Materials Project. 15.2 eV/atom errors on the MP formation energy data when the data size is 600. In bothcases, the MLP- f models outperform MEGNet models by a considerable margin. However,in terms of generalizability, the AtomSets- V models seem to be a better ﬁt for all generictasks.At a data size of 600 (533 train data points), the formation energy and the band gapmodels errors of AtomSets- V are 0.2 eV/atom and 0.702 eV, respectively, much lower thanthe errors achieved by the full MEGNet models with 0.367 eV/atom and 0.78 eV. TheAtomSets- V errors at such small data regimes are on par with the 0.210 eV/atom and 0.71eV errors (504 train data points) reported by the MODNet models that specialize in smallmaterials data ﬁtting. Interestingly, the compositional model AtomSets- V also achievedlower errors than full MEGNet, with formation energy model errors of 0.269 eV/atom andband gap model errors of 0.72 eV. Model Extrapolability

In a typical materials design problem, the target is not ﬁnding a material with similarperformance as most existing materials, but rather materials with extraordinary propertiesthat lie outside of the current materials pool. Such extrapolation presents a major challengefor most ML models. Previous works have designed leave-one cluster out cross-validation(LOCO CV) or k-fold forward cross-validation to evaluate the models’ extrapolationability in data regions outside the training data. Here we adopted the concept of forwardcross-validation by splitting the data according to their target value ranges and applied themethod to elasticity data (MP log ( K V RH ) and MP log ( G V RH )) to imitate the process ofﬁnding super-incompressible (high K ) and superhard (roughly high G ) materials. First, weheld out the materials with the top 10% corresponding target values as the test dataset (high-test, extrapolation). Then for the remaining, we also split it into the train, validation, andtest (low-test, interpolation) datasets, making two test data regimes in total. We selectedAtomSets, AtomSets- V , AtomSets- V , and the MEGNet models for the comparison. For the16 bc d E x t r apo l a t i on I n t e r po l a t i on Figure 4: Absolute diﬀerences in predicted and DFT log( K V RH ), i.e., | ∆ log( K V RH ) | againstthe DFT value range for the test data. The training and validation data are randomlysampled from the 0% to 90% (vertical dash line) target quantile range. Half of test datacomes from the 90%-100% quantile (extrapolation) and the other half is from the same targetrange as the train-validation data (interpolation).17ulk moduli K , the low-test errors for the compositional models AtomSets and AtomSets- V are identical. However, with the test target value outside of the training data range, theerrors increase rapidly above the low-test errors. Nevertheless, the TL model AtomSets- V arebetter generalized in the extrapolation high-test regime, as shown by the lower extrapolationerrors in Figure 4a and Figure 4b. For structural models, the low-test errors are again almostthe same, yet the TL AtomSets- V models have lower errors than the MEGNet counterparts,see Figure 4c and Figure 4d. Similar conclusions can be reached using the shear modulidataset, as shown in Figure ?? . These results conclude that TL approaches can signiﬁcantlyenhance the models’ accuracy in extrapolation tasks critical in new materials discovery. Discussion

The hierarchical MEGNet features provide a cascade of descriptors that capture both short-ranged interactions at early GC (e.g., V , V ) and long-ranged interactions at later GC(e.g., V , V ). The ﬁrst GC features are better TL features across various tasks, whilethe latter GC generated features generally exhibit worse performance. We can explain thispart by drawing an analogy to convolutional neural networks (CNN) in facial recognition,where the early feature maps capture generic features such as lines and shapes and the laterfeature maps form human faces. It is not surprising that if such CNN is transferred to otherdomains, for example, recognizing general objects beyond faces, the early feature maps maywork, while the later ones will not.One surprising result from our studies is the relatively good performance of the com-positional models (AtomSets- V ) on many properties, e.g., the phonon dos and bulk andshear moduli. It would be erroneous to conclude that these properties are not structure-dependent. We believe the main reason for the outperformance of the compositional modelsis because most compositions either do not exhibit polymorphism or have many polymorphswith somewhat similar properties, e.g., the well-known family of SiC polymorphs. These18esults highlight the importance of generating a diversity of data beyond existing known ma-terials. Existing databases such as the Materials Project typically prioritize computationson known materials, e.g., ICSD crystals. While such a strategy undoubtedly provides themost value to the community for the study of existing materials, the discovery of new mate-rials with extraordinary properties require exploration beyond known materials; additionaltraining data on hypothetical materials is critical for the development of ML models thatcan extrapolate beyond known materials design spaces. The use of TL, as shown in thiswork, is nevertheless critical for improving the extrapolability of models.The AtomSets framework can be viewed as a particular case of the graph network mod-els without the edge and global information update. With TL, the previously learnededge and global information from MEGNet model training are implicitly included to thenode information and hence the AtomSets model. The AtomSets greatly simplify the graphnetwork models and thus can be trained at a much small computational cost without com-promising the model accuracy. For example, it takes about 10 seconds per epoch for trainingAtomSets model on the most extensive MP formation energy data (132,752) using one GTX1080Ti GPU, while training MEGNet can take >

300 seconds/epoch.

Conclusion

This work introduces a new and straightforward deep learning model framework, the Atom-Sets, as an eﬀective way to learn materials properties at all data sizes and for both compo-sitional and structural data. By combining with TL, the structure-embedded compositionaland structural information can be readily incorporated into the model. The simple modelarchitecture makes it possible to train the models with much smaller datasets and lowercomputational resources compared to graph models. We show that the AtomSets modelscan achieve consistently low errors for small data tasks, e.g., steel strength datasets, to ex-tensive data tasks, e.g., MP computational data, and the model accuracy further improves19ith TL. We also show better model convergence for the AtomSets models. The AtomSetsframework introduces a facile deep learning framework and helps accelerate the materialsdiscovery process by combining accurate compositional and structural materials models.

Data Availability

The MatBench datasets are available from the AutoMatminer github repository ( https://github.com/hackingmaterials/automatminer ). Code Availability

The AtomSets framework and MEGNet featurizations are implemented in the open sourcematerials machine learning (maml) package ( https://github.com/materialsvirtuallab/maml ). Acknowledgement

The authors acknowledge the support from the Materials Project, funded by the U.S. De-partment of Energy, Oﬃce of Science, Oﬃce of Basic Energy Sciences, Materials Sciencesand Engineering Division under contract no. DE-AC02-05-CH11231: Materials Project pro-gram, KC23MP. The authors also acknowledge computational resources provided by theTriton Shared Computing Cluster (TSCC) at the University of California, San Diego, andthe Extreme Science and Engineering Discovery Environment (XSEDE) supported by theNational Science Foundation under grant no. ACI-1053575.20 uthor contributions

C.C. and S.P.O. conceived the idea. C.C. carried out the model construction, and ﬁttingunder the supervision of S.P.O. C.C. and S.P.O. wrote the manuscript.

Declaration of interests

The authors declare no competing interests.

Supplementary Information accompanies this paper at21 eferences (1) Butler, K. T.; Davies, D. W.; Cartwright, H.; Isayev, O.; Walsh, A. Machine Learningfor Molecular and Materials Science.

Nature , , 547–555.(2) Chen, C.; Zuo, Y.; Ye, W.; Li, X.; Deng, Z.; Ong, S. P. A Critical Review of MachineLearning of Energy Materials. Advanced Energy Materials , , 1903242.(3) Jain, A.; Ong, S. P.; Hautier, G.; Chen, W.; Richards, W. D.; Dacek, S.; Cholia, S.;Gunter, D.; Skinner, D.; Ceder, G.; Persson, K. A. Commentary: The Materials Project:A Materials Genome Approach to Accelerating Materials Innovation. APL Materials , , 011002.(4) Kirklin, S.; Saal, J. E.; Meredig, B.; Thompson, A.; Doak, J. W.; Aykol, M.; R¨uhl, S.;Wolverton, C. The Open Quantum Materials Database (OQMD): Assessing the Accu-racy of DFT Formation Energies. npj Computational Materials , , 15010.(5) Curtarolo, S.; Setyawan, W.; Wang, S.; Xue, J.; Yang, K.; Taylor, R. H.; Nel-son, L. J.; Hart, G. L. W.; Sanvito, S.; Buongiorno-Nardelli, M.; Mingo, N.;Levy, O. AFLOWLIB.ORG: A Distributed Materials Properties Repository from High-Throughput Ab Initio Calculations. Computational Materials Science , , 227–235.(6) Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials PropertyPrediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. arXiv:2005.00707 [cond-mat, physics:physics] ,(7) Chen, C.; Ye, W.; Zuo, Y.; Zheng, C.; Ong, S. P. Graph Networks as a UniversalMachine Learning Framework for Molecules and Crystals. Chemistry of Materials , , 3564–3572. 228) Xie, T.; Grossman, J. C. Crystal Graph Convolutional Neural Networks for an Accurateand Interpretable Prediction of Material Properties. Physical Review Letters , ,145301.(9) Hutchinson, M. L.; Antono, E.; Gibbons, B. M.; Paradiso, S.; Ling, J.; Meredig, B.Overcoming Data Scarcity with Transfer Learning. arXiv:1711.05099 [cond-mat, stat] ,(10) Jha, D.; Choudhary, K.; Tavazza, F.; Liao, W.-k.; Choudhary, A.; Campbell, C.;Agrawal, A. Enhancing Materials Property Prediction by Leveraging Computationaland Experimental Data Using Deep Transfer Learning. Nature Communications , , 5316.(11) Heyd, J.; Scuseria, G. E.; Ernzerhof, M. Hybrid Functionals Based on a ScreenedCoulomb Potential. The Journal of Chemical Physics , , 8207–8215.(12) Pilania, G.; Gubernatis, J. E.; Lookman, T. Multi-Fidelity Machine Learning Modelsfor Accurate Bandgap Predictions of Solids. Computational Materials Science , , 156–163.(13) Batra, R.; Pilania, G.; Uberuaga, B. P.; Ramprasad, R. Multiﬁdelity Information Fusionwith Machine Learning: A Case Study of Dopant Formation Energies in Hafnia. ACSApplied Materials & Interfaces , acsami.9b02174.(14) Batra, R.; Sankaranarayanan, S. Machine Learning for Multi-Fidelity Scale Bridg-ing and Dynamical Simulations of Materials.

Journal of Physics: Materials , ,031002.(15) Perdew, J. P.; Burke, K.; Ernzerhof, M. Generalized Gradient Approximation MadeSimple. Physical Review Letters , , 3865–3868.2316) Gritsenko, O.; van Leeuwen, R.; van Lenthe, E.; Baerends, E. J. Self-Consistent Ap-proximation to the Kohn-Sham Exchange Potential. Physical Review A , , 1944–1954.(17) Kuisma, M.; Ojanen, J.; Enkovaara, J.; Rantala, T. T. Kohn-Sham Potential withDiscontinuity for Band Gap Materials. Physical Review B , , 115106.(18) Sun, J.; Ruzsinszky, A.; Perdew, J. P. Strongly Constrained and Appropriately NormedSemilocal Density Functional. Physical Review Letters , , 036402.(19) Chen, C.; Zuo, Y.; Ye, W.; Li, X.; Ong, S. P. Learning Properties of Ordered andDisordered Materials from Multi-Fidelity Data. Nature Computational Science , , 46–53.(20) Vinyals, O.; Bengio, S.; Kudlur, M. Order Matters: Sequence to Sequence for Sets. arXiv:1511.06391 [cs, stat] ,(21) Behler, J.; Parrinello, M. Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces. Physical Review Letters , , 146401.(22) Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R. R.; Smola, A. J.In Advances in Neural Information Processing Systems 30 ; Guyon, I., Luxburg, U. V.,Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran As-sociates, Inc., 2017; pp 3391–3401.(23) Conduit, G.; Bajaj, S. Citrination. https://citrination.com/datasets/153092/show search?searchMatchOption=fuzzyMatch,2017.(24) Choudhary, K.; Kalish, I.; Beams, R.; Tavazza, F. High-Throughput Identiﬁcationand Characterization of Two-Dimensional Materials Using Density Functional Theory.

Scientiﬁc Reports , , 5179. 2425) Petretto, G.; Dwaraknath, S.; Miranda, H. P. C.; Winston, D.; Giantomassi, M.;van Setten, M. J.; Gonze, X.; Persson, K. A.; Hautier, G.; Rignanese, G.-M. High-Throughput Density-Functional Perturbation Theory Phonons for Inorganic Materials. Scientiﬁc Data , , 180065.(26) Zhuo, Y.; Mansouri Tehrani, A.; Brgoch, J. Predicting the Band Gaps of InorganicSolids by Machine Learning. The Journal of Physical Chemistry Letters , , 1668–1673.(27) Petousis, I.; Mrdjenovich, D.; Ballouz, E.; Liu, M.; Winston, D.; Chen, W.; Graf, T.;Schladt, T. D.; Persson, K. A.; Prinz, F. B. High-Throughput Screening of InorganicCompounds for the Discovery of Novel Dielectric and Optical Materials. Scientiﬁc Data , , 160134.(28) de Jong, M.; Chen, W.; Angsten, T.; Jain, A.; Notestine, R.; Gamst, A.; Sluiter, M.;Krishna Ande, C.; van der Zwaag, S.; Plata, J. J.; Toher, C.; Curtarolo, S.; Ceder, G.;Persson, K. A.; Asta, M. Charting the Complete Elastic Properties of Inorganic Crys-talline Compounds. Scientiﬁc Data , , 150009.(29) Castelli, I. E.; Landis, D. D.; Thygesen, K. S.; Dahl, S.; Chorkendorﬀ, I.;Jaramillo, T. F.; Jacobsen, K. W. New Cubic Perovskites for One- and Two-Photon Wa-ter Splitting Using the Computational Materials Repository. Energy & EnvironmentalScience , , 9034–9043.(30) Ong, S. P.; Cholia, S.; Jain, A.; Brafman, M.; Gunter, D.; Ceder, G.; Persson, K. A. TheMaterials Application Programming Interface (API): A Simple, Flexible and EﬃcientAPI for Materials Data Based on REpresentational State Transfer (REST) Principles. Computational Materials Science , , 209–215.(31) Kawazoe, Y., Yu, J.-Z., Tsai, A.-P., Masumoto, T., Eds. Nonequilibrium Phase Dia- rams of Ternary Amorphous Alloys ; Condensed Matter; Springer-Verlag: Berlin Hei-delberg, 1997.(32) Bartel, C. J.; Trewartha, A.; Wang, Q.; Dunn, A.; Jain, A.; Ceder, G. A Critical Exam-ination of Compound Stability Predictions from Machine-Learned Formation Energies. npj Computational Materials , , 97.(33) De Breuck, P.-P.; Hautier, G.; Rignanese, G.-M. Machine Learning Materials Propertiesfor Small Datasets. arXiv:2004.14766 [cond-mat] ,(34) Meredig, B.; Antono, E.; Church, C.; Hutchinson, M.; Ling, J.; Paradiso, S.;Blaiszik, B.; Foster, I.; Gibbons, B.; Hattrick-Simpers, J.; Mehta, A.; Ward, L. CanMachine Learning Identify the next High-Temperature Superconductor? ExaminingExtrapolation Performance for Materials Discovery. Molecular Systems Design & En-gineering , , 819–825.(35) Xiong, Z.; Cui, Y.; Liu, Z.; Zhao, Y.; Hu, M.; Hu, J. Evaluating Explorative PredictionPower of Machine Learning Algorithms for Materials Discovery Using K-Fold ForwardCross-Validation. Computational Materials Science , , 109203.(36) Lee, H.; Grosse, R.; Ranganath, R.; Ng, A. Y. Proceedings of the 26th Annual Interna-tional Conference on Machine Learning ; Association for Computing Machinery: NewYork, NY, USA, 2009; pp 609–616.(37) Battaglia, P. W. et al. Relational Inductive Biases, Deep Learning, and Graph Net-works. arXiv:1806.01261 [cs, stat] ,(38) Chen, C.; Zuo, Y.; Ye, W.; Ong, S. P. Maml - Materials Machine Learning Package.2021. 26 upplementary InformationAtomSets - A Hierarchical Transfer LearningFramework for Small and Large MaterialsDatasets

Chi Chen and Shyue Ping Ong ∗ Materials Virtual Lab, Department of NanoEngineering, University of California SanDiego, 9500 Gilman Dr, Mail Code 0448, La Jolla, CA 92093-0448, United States

E-mail: [email protected] a r X i v : . [ c ond - m a t . m t r l - s c i ] F e b able S1: Materials data name, data sizes, input types, property name, units and task types.Type shows the input data type, where Comp means composition and Struct means struc-ture. The task includes regression (R) and classiﬁcation (C). For the matbench perovskitesdatasets, the original data shows formation in eV, while in the ﬁnal presentation of errormetrics, we converted it into eV/atom.Data name Size Type Target Unit Taskmatbench steels 312 Comp Yield strength GPa Rmatbench jdft2d 636 Struct Exfoliation energy meV/atom Rmatbench phonons 1265 Struct Peak frequency 1/cm Rmatbench dielectric 4764 Struct Dielectric constant - Rmatbench expt gap 4604 Comp Band gap eV Rmatbench expt is metal 4921 Comp Metallicity - Cmatbench glass 5690 Comp Metallicity - Cmatbench log kvrh 10987 Struct Bulk moduli log(GPa) Rmatbench log gvrh 10987 Struct Shear moduli log(GPa) Rmatbench perovskites 18928 Struct Formation energy eV Rmatbench mp gap 106113 Struct PBE band gap eV Rmatbench mp is metal 106113 Struct PBE metallicity - Cmatbench mp e form 132752 Struct Formation energy eV RTable S2: Hyperparameters for AtomSets and MLP. The AtomSets model has two MLPs ascomponents: one before the readout, and one after the readout.Parameter name Possible values AtomSets n layer before readout 1, 2, 3n layer after readout 2, 3n neurons per layer 16, 32, 64, 128readout function “mean”, “set2set”activation function “softplus”, “relu”

MLP n layer 1, 2, 3, 4, 5n neurons per layer 16, 32, 64, 128activation function “softplus”, “relu”2 bc Figure S1: Analysis of polymorph distribution in the elasticity dataset. (a) The distributionof the number of polymorphs N p for a given composition. The range distribution of log 10( K )(b) and log 10( G ) (c) within polymorphs for compositions with N p > bd e cfd e f Figure S2: Model convergence for AtomSets, AtomSets- V , AtomSets- V and MLP- f modelsfor the nine datasets indicated by the legends.4 bc d E x t r apo l a t i on I n t e r po l a t i on Figure S3: Absolute diﬀerences in predicted and DFT log( G V RH ), i.e., | ∆ log( G V RH ) | againstthe DFT value range. The training and validation data are randomly sampled from the0% to 90% (vertical dash line) target quantile range. Half of test data comes from the90%-100% quantile (extrapolation) and the other half is from the same target range as thetrain-validation data (interpolation). 5able S3: AtomSets model performance with best parameters.Target, Data Size AtomSets- V +0 AtomSets- V +1 Yield Strength (GPa), 312 90 ±

25 -Exforliation Energy (meV/atom), 636 42 ± ± ± ± ± ± ± ± ± K V RH ) (GPa), 10987 0.08 ± ± G V RH ) (GPa), 10987 0.11 ± ± ± ± ± ± ± ± ± ± Target, Data Size MLP- V -stats MLP- u MLP- u MLP- u MLP- v r DummyModelsYield Strength (GPa), 312 107 ±

28 - - - - 230Exfol. Energy(meV/atom), 636 61 ± ± ± ±

10 63 ± ±

10 120 ±

13 224 ±

13 233 ±

34 188 ±

15 324Expt. Band Gap (eV),4604 0.61 ± ± ± ± ± ± ± ± K V RH ) (GPa), 10987 0.11 ± ± ± ± ± G V RH ) (GPa), 10987 0.13 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±1 1010