[PDF] Robust model benchmarking and bias-imbalance in data-driven materials science: a case study on MODNet

Abstract

As the number of novel data-driven approaches to material science continues to grow, it is crucial to perform consistent quality, reliability and applicability assessments of model performance. In this paper, we benchmark the Materials Optimal Descriptor Network (MODNet) method and architecture against the recently released MatBench v0.1, a curated test suite of materials datasets. MODNet is shown to outperform current leaders on 6 of the 13 tasks, whilst closely matching the current leaders on a further 2 tasks; MODNet performs particularly well when the number of samples is below 10,000. Attention is paid to two topics of concern when benchmarking models. First, we encourage the reporting of a more diverse set of metrics as it leads to a more comprehensive and holistic comparison of model performance. Second, an equally important task is the uncertainty assessment of a model towards a target domain. Significant variations in validation errors can be observed, depending on the imbalance and bias in the training set (i.e., similarity between training and application space). By using an ensemble MODNet model, confidence intervals can be built and the uncertainty on individual predictions can be quantified. Imbalance and bias issues are often overlooked, and yet are important for successful real-world applications of machine learning in materials science and condensed matter.

Full PDF

RRobust model benchmarking and bias-imbalance indata-driven materials science: a case study onMODNet

Pierre-Paul De Breuck, Matthew L. Evans and Gian-MarcoRignanese

Universit´e catholique de Louvain (UCLouvain), Institute of Condensed Matter andNanosciences (IMCN), Chemin des ´Etoiles 8, B-1348 Louvain-la-Neuve, BelgiumE-mail: [email protected]

Abstract.

As the number of novel data-driven approaches to material sciencecontinues to grow, it is crucial to perform consistent quality, reliability and applicabilityassessments of model performance. In this paper, we benchmark the Materials OptimalDescriptor Network (MODNet) method and architecture against the recently releasedMatBench v0.1, a curated test suite of materials datasets. MODNet is shown tooutperform current leaders on 4 of the 13 tasks, whilst closely matching the currentleaders on a further 3 tasks; MODNet performs particularly well when the number ofsamples is below 10,000. Attention is paid to two topics of concern when benchmarkingmodels. First, we encourage the reporting of a more diverse set of metrics as it leadsto a more comprehensive and holistic comparison of model performance. Second, anequally important task is the uncertainty assessment of a model towards a targetdomain. By applying a distance metric in feature space, we found that signiﬁcantvariations in validation errors can be observed, depending on the imbalance and biasin the training set (i.e., similarity between training and application space). Both issuesare often overlooked, yet important for successful real-world applications of machinelearning in materials science and condensed matter.

1. Introduction

Functional materials are at the heart of many technological advances and are thereforeessential to the growth of next-generation innovative and sustainable technologies [1].The capability to predict structure- or composition-property relationships would furtheraccelerate development and when coupled with the proliferation of reliable materialssimulations [2], open datasets [3], and high-throughput experimentation, machinelearning (ML) has emerged as a particularly powerful approach [4, 5]. Complexproperties can indeed be predicted by surrogate models in a fraction of time withclose to quantum accuracy, allowing for a much faster screening of materials; multi-ﬁdelity models can go further and render these predictions more directly comparable a r X i v : . [ c ond - m a t . m t r l - s c i ] F e b obust model benchmarking and bias-imbalance in data-driven materials science et al. proposed a model thatidentiﬁes Heusler compounds [7]. From their study, novel predicted candidates from theMRu Ga and RuM Ga (M=Ti-Co) composition space were synthesised and conﬁrmedto be Heusler compounds. Another example concerns the work of Stanev et al. , whichaimed to discover high- T c superconductors, by constructing an estimator for criticaltemperature T c based on the composition alone [8]. Databases of reported synthesisedcompositions were screened for new superconductors, yielding 35 non-cuprate and non-ferrous oxides as candidate compositions. These examples, just two among many [4, 5],show the impact of ML techniques in materials science and condensed matter, and assuch it is often considered as the fourth paradigm of materials science [9].Many ML approaches are being actively developed to accurately predict compound-property relationships. They diﬀer in the amount, type and richness of descriptorsand model complexity, as well as adopting other learning paradigms such as transferor multi-task learning. As a result, they exhibit very diﬀerent characteristics, andno universally superior algorithm exists [10, 11]. Firstly, the input space and thusapplicability can be very diﬀerent, some models are architecturally restricted to speciﬁcdescriptors, e.g., requiring or being unable to make use of full structural information tomake predictions. Secondly, the expected generalisation accuracy strongly depends onthe amount of available data, with diﬀerent approaches succeeding in contrasting data-poor or data-rich regimes. Often, a model or architecture that performs very well on alarge dataset can be disappointing when applied on a small dataset (and vice versa ).Given the divergent types of models, and the growing number of approachesbeing released, it is important that the ﬁeld follows a standard benchmarking pipeline,inspired by other ﬁelds (e.g., ImageNet for computer vision [12]). Too often, modelsare tested on only a very few datasets, with unclear validation and testing procedures.Signiﬁcant variation in generalisation performances can be observed for diﬀerent hold-out sets, as will be shown. To draw a fair comparison between the two competingmodels, they should not just be benchmarked on the same dataset, but also on thesame samples within the dataset; that is, using predetermined subsets for trainingand testing. Similarly, hyperparameter choices are often opaque and should insteadfollow a reproducible pipeline. Every model has advantages and disadvantages andtesting a diverse range of tasks shows that no model is universally superior, butrather performance of a given architecture is a function of the amount of data, featureavailability, the complexity of the target space and the computational resources availablefor tuning. These reasons motivate the creation of a standard benchmark. Dunn etal. recently released the MatBench v0.1 test suite [13] which consists of 13 materialsscience-speciﬁc prediction tasks, ranging from small ( ∼ ∼ obust model benchmarking and bias-imbalance in data-driven materials science ± post hoc . As such, MODNet provides a promising alternative onsmall to medium-sized datasets to the automated but complex pipelines trained withAutomatminer [13] and the data-intensive graph convolutional neural networks CGCNN[15] and MEGNet [16].Furthermore, our testing reveals a variance in the generalisation error betweencross-validation folds. In the second part of this paper, we link this (sometimes abrupt)change in prediction error to a distance metric in feature space, with a discussion of theimportance of imbalance and bias in the dataset, and how model performance can beappropriately reported within these diﬀerent regimes. This paves the way to a betterunderstanding of a model’s applicability for a particular target domain, allowing theestimation of the prediction error of unseen samples.

2. Methods

The Material Optimal Descriptor Network (MODNet) is a universal tool for predictingmaterials properties from primitives such as composition or structure [14]. It consistsof a feedforward neural network fed with a limited number of physically motivateddescriptors. MODNet was designed to make the most eﬃcient use of data for taskswhere large, internally consistent training sets are prohibitively diﬃcult or expensiveto obtain; we observe that most existing material datasets are limited in this way,particularly those derived from experimental results. In order to have the best possibleperformance at low size, three key aspects are needed.First, our observation showed that chemical and geometrical descriptors can providemore information than a raw graph representation. To this end, MODNet makes useof many of the existing descriptors designed in the community as implemented in theMatminer package [17]. By giving the model physically motivated descriptors, part ofthe learning is already done by as they exploit existing chemical knowledge. In contrast,more complex models such as graph networks or embedding methods are often initialisedwith only atomic numbers and bond lengths [16, 15]. Impressively, graph models are ableto bootstrap this chemical knowledge ex nihilo , and perform highly accurate propertypredictions but only when a large ( ∼ ) and suﬃciently diverse dataset is available.Second, the smaller the dataset, the greater the importance of the feature selectionprocess in amplifying the signal-to-noise ratio. For instance, a previous application ofMODNet to vibrational thermodynamics [14] showed that an average 12% improvementin error can be obtained by removing irrelevant features. For feature selection, MODNet obust model benchmarking and bias-imbalance in data-driven materials science etal. [18] (4040 samples), MODNet achieves a mean absolute error of 0.051 (n.b.,this dataset is distinct from MatBench refractive index task). A multi-targetMODNet was also developed to study temperature-dependent results of the vibrationalthermodynamics dataset of Petretto et al. [19] (1265 samples). A single model wassimultaneously trained on the vibrational entropy, enthalpy, speciﬁc heat and Helmholtzenergy across a wide temperature range. This model achieved an average error of0.009 meV/K/atom on the room temperature vibrational entropy ( S ◦ ), four times lowerthan previous studies, with joint learning contributing to an error reduction of 8%compared to the best MODNet model trained on S ◦ alone [14].Since the initial release (v0.1.5) of MODNet as reported in [14], severalimprovements to the software framework have been implemented in the version usedin this paper (v0.1.9). First, MODNet has been made increasingly ﬂexible in thenature of the material primitives supported (a full structure or just a composition)and on the types of target properties that can be learned. For instance, the usercan manually choose which Matminer featurizers [17] to apply to the data, or theycan use presets suitable for structural or purely compositional samples. Second, it isnow possible to use MODNet to perform binary and multi-label classiﬁcation tasks,predicting class membership probability in both cases. Third, an automated procedurefor hyperparameter optimisation has been added within the framework of nested cross-validation (NCV). Automated hyperparameter selection ensures a correct assessmentof the generalisation error and minimises overﬁtting. That is, hyperparameters shouldnever be chosen to minimise a test error, but rather chosen internally with a validationprocedure. The speciﬁc hyperparameter optimization process used in this work isdescribe in Appendix A.MODNet is an evolving framework and further improvements are planned. Weexpect to apply it on even more datasets, as well as applying the trained models obust model benchmarking and bias-imbalance in data-driven materials science ppdebreuck/modnet on GitHub [20].

3. Results and discussion

The MatBench (v0.1) test suite contains 13 supervised ML tasks, taken from 10 datasetswith size varying from 300 to 130,000. Our previous benchmarking of MODNet hasshown that it is most eﬀective on small to medium-sized datasets, so eﬀorts were focusedon the 10 tasks with fewer than 15,000 samples. Of the remaining tasks, the largest twoconsist of the entire Materials Project [21]; the scaling of MODNet’s performance on thisdataset was previously studied in detail with the conclusion that graph-based networksoutperform MODNet by a factor of 2 when the entire dataset is considered [14].The MatBench suite covers various observables in materials science driven bydiﬀerent length scales, from microscopic simulations of elastic, electronic, optical andvibrational properties, to macroscopic measurements of steel yield strengths and metallicglass formation. For a detailed description of the contents and curation of each dataset,the reader is referred to the MatBench paper itself [13].Each task provides either a composition or a structural model per sample. Thetarget is always a single property, either continuous or a discrete class label (such asmetal or non-metal). Almost all properties are therefore trained in single-target mode,i.e., one target per task. The only exceptions are the elastic properties log K andlog G , where both bulk and shear modulus are provided for each sample (across twoMatBench datasets); here model performance is improved by joint-learning both targetsin a single model. Regression models were trained using the mean absolute error as aloss function. For classiﬁcation, the output activation is set to a sigmoid function andthe categorical cross-entropy is used as loss function.Our testing procedure follows the MatBench recommendations, i.e., an outer loopof ﬁve-fold cross-validation is used to determine the generalisation error (test error). Aninner validation is performed for both feature and hyperparameter selection, insuringan unbiased generalisation error. Instead of using a nested cross-validation, we used aﬁxed inner validation split; this does not compromise the generalisation error (as theuser is free to optimise the model given the training data only ), but a nested cross-validation will in principle only select better (more general) hyperparameters as deﬁnedby a lower generalisation error. More details of the hyperparameter optimisation processand training methods can be found in Appendix A.Table 1 contains an evaluation of MODNet on the MatBench suite, alongside ﬁvealternative algorithms reported on the MatBench leaderboard [13], as measured by themean absolute error (MAE) for regression tasks, or the area under the reciver-operatorcurve (ROC-AUC) for classiﬁcation tasks. The Dummy algorithm forms a baseline obust model benchmarking and bias-imbalance in data-driven materials science Table 1.

The performance of MODNet (v0.1.9) on the MatBench (v0.1) test suitealongside alternative algorithms (Automatminer, RF, CGCNN and MEGNet) reportedby Dunn et al. [13]. The reported scores are either mean absolute errors (MAE) forregression or the area under the receiver-operator curve (ROC-AUC) for classiﬁcation.These scores are averaged over the best models for each of the ﬁve cross-validation folds,following the recommendations of MatBench [13]. The hyperparameters are allowed tovary across these folds. Numbers in bold indicate the best performing model for thattask, and those within 2% of the best performance relative to the best model. A columnis included for a null (“Dummy”) model that yields the mean of the dataset for eachprediction for regression tasks, or returns the class frequency for classiﬁcation tasks.Where present, bracketed numbers indicate the best score obtained after hand-tuningthe MODNet models that were previously trained by the automatic hyperparameteroptimization. † An outlier was automatically detected when joint-learning on log K and log G ; the prediction for this sample was replaced with the dataset mean (see3.2.2 for details). Regression task n MODNet AM RF CGCNN MEGNet DummySteel yield strength (MPa) 312 112.9 (91.4)

104 - - 230 E exfol . (eV/atom) 636 (32.7) 38.6 49.9 49.2 55.9 67.3argmax(PhDOS) (1/cm) 1265 (34.3) 50.8 68 57.8 (0.330) 0.416 0.446 - - 1.14Refractive index 4764 0.321 (0.309) K (log GPa) † G (log GPa) † E form (eV/atom) 18928 - 0.194 0.235 0.0452 E form (eV/atom) 132752 - 0.173 0.116 0.0332 by predicting the mean of the training set in case of regression, or a random labelweighted by class size in case of classiﬁcation. A second baseline is formed by a RandomForest (RF) algorithm [22] using Magpie elemental descriptors when only a compositionis provided [23], and Sine Coulomb Matrix descriptor if structures are available [24].MODNet is compared against Automatminer (AM) which is provided as a referencealgorithm with the MatBench suite [13]. Automatminer creates automated end-to-endpipelines that couples materials datasets decorated with Matminer descriptors withan AutoML search using the TPOT package [25]; this approach generates ensemblesof tree-based models to perform the ﬁnal prediction. It has the advantage of beingfully automated and therefore requires minimal ML expertise to run on a new dataset.Finally, for tasks that contain the crystal structure as input, the deep learning modelsCGCNN [15] and MEGNet [16] are also compared against; these models performconvolution and pooling operations on a graph representation of the crystal structure,allowing for structurally local interpretations of predicted properties.The presented results show that MODNet signiﬁcantly improves the error on 4tasks (exp. band gap, exp. metallicity, glass-forming ability, 2D exfoliation energy),and matches/outperforms the existing leader (within 2%) error for 3 tasks, and ﬁnallyhas a higher error on only one task (steel yield strength). Figure 1 displays the results obust model benchmarking and bias-imbalance in data-driven materials science = )exfoliation energy ( = )refractive index ( = )exp. band gap ( = ) ( = ) ( = ) ( ) ( = ) MODNetMODNet foldsAMRFMEGNetCGCNN Figure 1.

Comparison of model performance across regression tasks, relative to adummy model that predicts the mean of the dataset. The shaded region indicates therange of errors spanned by the MODNet cross-validation folds. An arbitrary verticaloﬀset has been applied to each model to aid visual discrimination. of the regression tasks relative to the Dummy model across all datasets. One can seethat the Dummy algorithm is outperformed by all models by a factor of two to three.The spread between the performance on individual folds is correlated both with theimprovement relative to the Dummy model, but also to the spread between all competingapproaches. It should be noted that hand-tuned models after automatic hyperparameteroptimisation can also lower this error further. This eﬀect was most prominent for thesteel yield strength task where model performance could be improved by nearly 20%with manual tuning, indicating that the hyperparameter optimisation approach couldbe improved for smaller datasets where a simple grid search is inherently noisier.Detailed MODNet benchmarking plots for each dataset can be found in AppendixB. These plots display the typical regression between target and predicted properties,and additionally a regression between targets and prediction error to show any modelbiases as a function of target property. As one might expect, the sample density isoften considerably lower for extremal target property values, so many datasets display apositive correlation between the errors and targets (i.e., predictions decrease in qualityfor larger target values). This eﬀect is most prominent for the refractive index andexfoliation energy datasets shown in Figures B1 and B3.As a general trend, it can be observed that MODNet performs relatively well ontasks that are limited to composition data only. We explain this success by the usage ofa limited input neural network where a hidden representation is learned for the diﬀerentelemental contributions; this representation is much more ﬂexible than the equivalentobtained by feature engineering for tree-based approaches. Recently presented deeprepresentation learning approaches should be able to exploit this eﬀect even further, obust model benchmarking and bias-imbalance in data-driven materials science

When predicting material properties with a machine learning model, one should keepin mind that the prediction error can vary signiﬁcantly from material to material. TheMAE is merely a mean value, with sometimes huge diﬀerences (lower or higher) onsingle predictions. For instance, concerning the experimental band gap, the errors span5 orders of magnitude from 10 − to 10 . This has two major consequences. First, amore comprehensive way of assessing a model’s performance is needed. Particularly,when comparing diﬀerent models, it is generally worth having a compilation of metricsor a full distribution of errors instead of relying on an individual metric. Second, theability to provide an error estimate on predictions a priori is extremely useful, so onecould be tempted to seek to quantify this uncertainty. In other words, given the inputmaterial, how reliable is the ﬁnal prediction? This section covers a solution to the ﬁrstquestion, and proposes a simple distance metric in feature space to address the secondquestion. The results in Table 1 are an average over the5 cross-validation folds, and signiﬁcant variations between folds are often observed,as shown in Figure 1. This is caused by the underlying spread in errors over theindividual predictions, and when sampled, a strong discrepancy between folds canbe seen, especially when the dataset is small. This variance over the points containsvaluable information about the intrinsic model performance, and should, in our opinion,not be neglected. Too often in the materials science ﬁeld, models are assessed using asingle metric, which is in general not enough to fully capture and compare ML models.Two models could have a similar MAE, but with a very diﬀerent spread. Moreover,models can set their internal parameters to suit the benchmark metric (e.g., by settingthe loss function the the metric of interest), and therefore optimise their model withrespect to a benchmark rather than the domain of application. A more holistic approachto compare models is clearly needed. The advantage of using diﬀerent metrics is thatthey illuminate diﬀerent performance aspects of a model, and taken together, they givea more comprehensive view of a model’s quality. Following the work of Vishwakarma etal. [28], we provide a compilation of metrics to assess MODNet performance.We suggest the following metrics for regression: mean absolute error (MAE),median absolute error (MedAE), root-mean-squared error (RMSE), mean absolutepercentage error (MAPE), maximum absolute error (MaxAE), and the Pearsoncorrelation ( R ) of an ordinary least squares regression between the target and predictedproperties. The corresponding values for the diﬀerent tasks on MODNet are given inTable 2 (excluding MAPE).Comparing the MAE with the RMSE oﬀers insights on the variability of predictionerrors. Large ﬂuctuations in errors (outliers) will result in a signiﬁcantly higher RMSE. obust model benchmarking and bias-imbalance in data-driven materials science relative to the ground truth, and complements thusthe MAE and RMSE. It should be noted, however, that it is best avoided to propertieshaving values equal or close to zero, as the MAPE diverges. It is therefore omitted inTable 2.The MaxAE gives the worst error in the test set, and is thus related to the spreadin errors. When making ﬁnite-cost decisions based on predictions from a model, it iscrucial to have an idea of the worst case error. As can be seen from the table, all taskshave a MaxAE that is orders of magnitude larger than the MAE among the test samples.This aspect and its relation to uncertainty is discussed hereafter.The R value measures how close predictions are to the ground truth by measuringtheir linear relationship, and has the advantage to be bounded between -1 and 1. Avalue of 1 indicates a perfect positive correlation. This enables to compare performancenot only between models but also between tasks. Best performing tasks are found tobe the phonons and elastic constants. In contrast, the refractive index and exfoliationenergy are both seen to yield a low R . This is mainly caused by outliers, as the R -valuemetric is very sensitive to them.Finally, an alternative solution to the ensemble of metrics, is to provide a fullprobability distribution of errors, by for instance applying a Gaussian kernel densityestimation on the discrete errors. An example is given in the previous work onMODNet [14].Concerning classiﬁcation, the previously mentioned ROC-AUC is an overall goodmetric for binary classiﬁcation provided the labels are balanced, combining the model’sperformance at diﬀerent class probability thresholds. Providing the full ROC curveis also good practice. However, for imbalanced datasets, precision-recall (PR) curveshave greater utility [29]. Both the ROC and PR curves are given in Appendix Bfor the experimental metallicity and glass-forming ability tasks using MODNet. Thecorresponding ROC-AUC and average precision scores are given in Table 2. For multi-class classiﬁcation, a confusion matrix is a simple yet powerful way of visualizing thequality of the model for a given threshold. .Beyond benchmarking a model, it is equally important to asses the uncertaintyof novel predictions. Given an unseen sample, how reliable is the model’s prediction?This aspect is closely related to the applicability and generalisation capabilities of amodel. This is not a straightforward problem in materials science and it should begiven signiﬁcant attention, in the same way much eﬀort is spent on optimizing an errormetric. For instance, a MODNet model was created on the DFT-predicted refractiveindex dataset of Naccarato et al. [18]. A random test set of 200 samples yields an MAEof 0.051. However, when looking only at the non-oxides in this test set, the error jumps obust model benchmarking and bias-imbalance in data-driven materials science Table 2.

A battery of metrics indicating MODNet performance on regression andclassiﬁcation tasks of MatBench v0.1. † As in Table 1, one numerically unstableprediction was replaced with the dataset mean before reporting the metrics for thelog K and log G tasks. Regression Task MAE Median AE RMSE MaxAE R Steel yield strength (MPa) 112.9 65.5 204.1 1689 0.79 E exfol . (eV/atom) 34.4 6.98 103.8 1567 0.64Refractive index 0.321 0.065 1.99 59.1 0.37Exp. band gap (eV) 0.357 0.014 0.84 9.9 0.82log K (log GPa) † G (log GPa) † to 0.082. The oxides (deﬁned loosely here as compounds containing at least one oxygenatom) in the test set have a much lower MAE of 0.035. Similarly, for the bulk modulusin MatBench, we observed an error of over 50,000 log (GPa) on a particular compound,as noted in the caption of Table 1. These variations are signiﬁcant and anticipating suchﬂuctuations a priori is therefore important for reliable real-world applications.From physical and chemical grounds, we learn that materials appear in distinctqualitative classes based on composition (e.g., oxides, phosphides, sulﬁdes), structuralclasses (e.g., perovskite, wurtzite, zincblende) or by instrinsic properties (e.g., metalsand non-metals). Other more subtle possibilities exists and are equally important asmaterials often behave diﬀerently depending on subgroup membership. Therefore, itcould be expected that a machine learning model trained on a particular class will failto make good predictions when applied to another class. However, many of the diﬀerentclasses mentioned previously are principally human made. They are based on humanintuition and concepts such as bond type or the periodic table, and in some sense,do not exist for the machine [30]. In all generality, a machine learning model goesbeyond these human concepts and forms its own material representation, consideringa material as a purely mathematical object in a high-dimensional space. For instance,one can use the feature space, or better, a condensed latent space. In this work, weconsider materials either as points in the MODNet feature space, or as a PCA (principalcomponent analysis) reduction in a lower dimensional space. Classes can now appearnaturally as point clouds in these spaces and a similarity metric can be deﬁned througha simple distance-based measure. Close materials are similar and should behave closelywith respect to the learned property; if this is not the case, the feature space may notbe suﬃciently complete to fully discriminate samples.The concepts of imbalance and bias are important in this context. A training setthat has two or more distinct classes (i.e., clusters in the material space) that are sampledin diﬀerent proportions is imbalanced . For instance, 84% of the refractive index datasetis comprised of oxides and therefore constitutes an imbalanced training set. A training obust model benchmarking and bias-imbalance in data-driven materials science Figure 2.

PCA decomposition of the refractive index dataset (Naccarato et al. [18])for the second and third component. The second component is linked to the bond-length, while the third component is linked to the ionicity of the compound. Datapoints corresponding to oxides and non-oxides are coloured blue and red respectively.Compounds having at least one oxygen atom tend to have shorter mean bond lengthsand increased ionicity. An imbalance appears due to sampling: the dotted blue linedivides the sample space in denser and coarser regions. The compounds in the well-sampled centre region yield on average the lowest error, regardless of whether they arean oxide. PCA components are provided in detail in Appendix C.

30 20 10 0 10 20 30Principal Component 2 [a.u.]1001020 P r i n c i p a l C o m pon e n t [ a . u . ] set is biased when it covers only a speciﬁc region (i.e., not all classes are covered) ofthe material space, a more extreme case of imbalance. Both imbalance and bias haveconsequences to the generalisation of a model as they express potential dissimilaritiesbetween training and target domain.In order to visualise imbalance-bias and the discrepancy in error between oxides andnon-oxides for the refractive index DFT dataset of Naccarato et al. [18], we performeda PCA decomposition. The ﬁrst three components together account for 25 % of thevariance. Their detailed description are provided in Appendix C, with the list ofdescriptors and corresponding weights in the feature space. The ﬁrst feature roughlycorresponds to a mean of all features and therefore identiﬁes outlier compounds in the obust model benchmarking and bias-imbalance in data-driven materials science k -nearest neighbours(KNN) in the training set, d KNN . The Euclidean distance (i.e., L -norm) is used here,but other distances are possible.Figure 3 represents the absolute error of each test point for diﬀerent MatBenchtasks as a function of d KNN . Features are scaled to ﬁt a normal distribution, therefore,distances can be interpreted as proportional to the number of standard deviationsbetween samples. We should note that one is looking for small variations on average, as asystematic trend would indicate an underperforming ML model. A weak but signiﬁcantupward trend is observed on most tasks, as shown by the positive Pearson correlation r and corresponding linear regression depicted by the blue dashed line.The error range tends to increase with increasing distance, indicating an increasein uncertainty. Both the Pearson correlation and linear regression are very sensitiveto outliers. As a result, it somewhat underestimates (resp. overestimates) the trendfor the dielectric and experimental gap (resp. bulk modulus) tasks. Therefore, the obust model benchmarking and bias-imbalance in data-driven materials science Figure 3.

Absolute error of a test sample as a function of the distance to the k -nearestneighbours ( d KNN ) in the training set, accumulated over the 5 folds. The value of k isset to 25 for the steels and experimental band gaps, and to 5 for the other datasets.A linear regression of the point-cloud is depicted with the blue dashed line, while theaverage error over diﬀerent bins of d KNN is represented by orange circles. For reference,the mean absolute error is also reported as a green line. The Pearson correlation ( r )is signiﬁcantly above zero on most tasks. The maximum error for each task is circledred, which in this case includes the outlier excluded from the previous metrics for thebulk modulus task. d KNN d KNN d KNN d KNN d KNN d KNN

Steel yield strength Refractive indexExfoliation energy Exp. band gap argmax(PhDOS)Bulk modulus average error as a function of d KNN is binned and depicted in orange in Figure 3 forcomparison with the MAE. It can be seen that the error is slightly below the overallMAE at lower distances, and increases above at higher distances, following the trendobserved in Appendix B for the error as a function of target property. For the dielectric,phonon, experimental band gap and bulk modulus tasks a threshold value is observed,below which the error is nearly zero for all samples. This can be explained by thecorresponding larger datasets: similar materials facilitate good predictions.In particular, this approach enables us to explain the previously removed outlier ofthe PtRh bulk modulus (i.e., red circled outlier in Figure 3 bulk modulus). It can beseen that PtRh has a very high dissimilarity with respect to the training set causedby uncontrolled numerical instabilities in Voronoi featurization (i.e., a mean Voronoivolume is more than hundred standard deviations away from the nearest cluster).Finally, it should be noted that many model architectures intrinsically performuncertainty quantiﬁcation, namely Bayesian learning approaches, Gaussian processes, obust model benchmarking and bias-imbalance in data-driven materials science

4. Summary

To summarise, we report benchmarks of the Materials Optimal Descriptor Network(MODNet) on the MatBench v0.1 test suite. MODNet is a universal model,both concerning the material primitives (structure or composition) and target type.MatBench is a standardised benchmarking pipeline, with ﬁxed procedures on trainingand testing, containing various material prediction tasks of diﬀering sizes and types.Our results show that MODNet performs well on small to medium-sized datasets. Inparticular, it was found to outperform or match the current leaders on the experimentalband gap, experimental metallicity, glass formation ability, 2D exfoliation energy, elasticmoduli (bulk and shear), and phonon DOS peak estimation (see Table 1). We providemetrics beyond MAE and ROC-AUC scores and encourage the reporting of multiplemetrics and we make available all benchmarking data for future model comparisons.We show that depending on the test set sampling (or fold), signiﬁcant variationin error can be measured. This is due to the high dimensionality of the feature spacespanned by materials and the associated sparsity of a given dataset. Bias or imbalance iseasily introduced in the training set by this sampling. Therefore, applicability throughuncertainty assessment is a crucial aspect for future developments. We show thatdataset imbalance can be identiﬁed through PCA and that uncertainty can be studiedvia distance-based approaches in feature space. This oﬀers the possibility to set anconﬁdence bound on individual predictions; uncertain predictions could be ﬂagged orremoved.Finally, we emphasise that this work only forms a basis to a better practice inmodel design and performance assessment. Uncertainty quantiﬁcation, and particularlyfor extrapolations, is a diﬃcult task in general. We showed some options to thisregard, but further development and benchmarking is necessary. For instance, Figure 3show various exceptions to the linear increasing trend, which mitigate an overallaccurate quantiﬁcation. Uncertainty quantiﬁcation could be improved by computingthe distances in the hidden representation learned by the network, rather than in thefeature space.We hope that this work encourages future developments on the topics of metrics,bias and uncertainty that are crucial in materials science for robust, transparent testingand better understanding of a model applicability in a wider context. By providing allof our benchmarking data, we hope that the diﬀerences between competing approachescan be learned from and used to improve real-world model performance. obust model benchmarking and bias-imbalance in data-driven materials science Acknowledgements

P.-P. D.B. and G.-M. R. are grateful to the F.R.S.-FNRS for ﬁnancial support. M. L. E.and G.-M. R. acknowledge support from the European Union’s Horizon 2020 researchand innovation program under the European Union’s Grant agreement No. 951786(NOMAD CoE).Computational resources were provided by the supercomputing facilities of theUniversit´e catholique de Louvain (CISM/UCL) and the Consortium des ´Equipementsde Calcul Intensif en F´ed´eration Wallonie Bruxelles (C ´ECI) funded by the Fond de laRecherche Scientiﬁque de Belgique (F.R.S.-FNRS) under convention 2.5020.11 and bythe Walloon Region.

References [1] Magee C L 2012

Complexity et al. Science

URL https://dx.doi.org/10.1126/science.aad3000 [3] Himanen L, Geurts A, Foster A S and Rinke P 2019

Advanced Science https://doi.org/10.1002/advs.201900808 [4] Butler K T, Davies D W, Cartwright H, Isayev O and Walsh A 2018 Nature https://doi.org/10.1038/s41586-018-0337-2 [5] Schmidt J, Marques M R G, Botti S and Marques M A L 2019 npj Computational Materials Nature Computational Science https://doi.org/10.1038/s43588-020-00002-x [7] Oliynyk A O, Antono E, Sparks T D, Ghadbeigi L, Gaultois M W, Meredig B and Mar A 2016 Chemistry of Materials npjComputational Materials ISSN 2057-3960[9] Agrawal A and Choudhary A 2016

APL Materials IEEE Transactions on Evolutionary Computation Evolutionary Computation, IEEE Transactions on pp 248–255ISSN 1063-6919[13] Dunn A, Wang Q, Ganose A, Dopp D and Jain A 2020 npj Computational Materials arXiv:2004.14766 [cond-mat] ( Preprint )[15] Xie T and Grossman J C 2018

Physical Review Letters

ISSN 0031-9007, 1079-7114[16] Chen C, Ye W, Zuo Y, Zheng C and Ong S P 2019

Chemistry of Materials Computational Materials Science

60 – 69 ISSN 0927-0256 URL [18] Naccarato F, Ricci F, Suntivich J, Hautier G, Wirtz L and Rignanese G M 2019

Physical ReviewMaterials obust model benchmarking and bias-imbalance in data-driven materials science [19] Petretto G, Dwaraknath S, Miranda H P C, Winston D, Giantomassi M, van Setten M J, GonzeX, Persson K A, Hautier G and Rignanese G M 2018 Scientiﬁc Data https://github.com/ppdebreuck/modnet [21] Jain A, Ong S P, Hautier G, Chen W, Richards W D, Dacek S, Cholia S, Gunter D, SkinnerD, Ceder G and Persson K 2013 APL Materials https://dx.doi.org/10.1063/1.4812323 [22] Breiman L 2001 Machine Learning npj Computational Materials International Journal of QuantumChemistry

Proceedings of the Genetic and EvolutionaryComputation Conference 2016

GECCO ’16 (New York, NY, USA: ACM) pp 485–492 ISBN978-1-4503-4206-3 URL http://doi.acm.org/10.1145/2908812.2908918 [26] Goodall R E and Lee A A 2020

Nature communications https://doi.org/10.1038/s41524-020-00406-3 [27] Wang A, Kauwe S, Murdock R and Sparks T 2020 ChemRxiv

URL https://doi.org/10.26434/chemrxiv.11869026.v2 [28] Vishwakarma G, Sonpal A and Hachmann J 2020 arXiv:2010.00110 [physics] ( Preprint )[29] Davis J and Goadrich M 2006 The relationship between Precision-Recall and ROC curves

Proceedings of the 23rd International Conference on Machine Learning

ICML ’06 (New York,NY, USA: Association for Computing Machinery) pp 233–240 ISBN 978-1-59593-383-6[30] George J and Hautier G 2020

Trends in Chemistry

S2589597420302653 ISSN 25895974 obust model benchmarking and bias-imbalance in data-driven materials science Appendix A. Notes on featurization and training

All datasets were featurized using the

DeBreuck2020Featurizer preset that is bundledwith MODNet, though some datasets only made use of compositional descriptors. Afterfeaturization, the normalised mutual information was computed between all pairs ofdescriptors across the entire dataset and was referred back to when training each model.Hyperparameter optimization was performed via 5-fold cross-validation, with 85%of each fold being used for training in the inner loop. For each fold, a grid search overbatch sizes, learning rates, number of training features ( N ) and hidden layer depthswas performed, and the hyperparameters with the lowest MAE on the remaining 15%of the fold were used to ﬁt a new model on the entire fold. Feature selection is key tothe performance of MODNet; the features were selected per fold based on a relevance-redundancy criterion and the top N features were used for training. Some typicalarchitectures are illustrated in Figure A1 for a 1000-dimensional feature space.The outer loop was performed in parallel, yielding ﬁve models per task. Afterfeaturization, training can be eﬃciently performed on commodity hardware, with a fullgrid search (20-60 hyperparameter combinations) requiring no more than a 2 hours perouter fold when run on 4-cores of an AMD EPYC 7742 (64-core) CPU.All benchmarking data and the scripts used to featurize and train models for eachdataset are available on GitHub at ml-evs/modnet-matbench . I npu t H i dden l a y e r H i dden l a y e r H i dden l a y e r H i dden l a y e r La y e r s i z e Figure A1.

Graphical depiction of the network architectures sampled duringhyperparameter optimisation. Absolute values correspond to the layer depths for a1000-dimensional feature space (before feature selection). obust model benchmarking and bias-imbalance in data-driven materials science Appendix B. Detailed benchmarks plots

Appendix B.1. matbench dielectric P r ed i c t ed r e f r a c t i v e i nde x ( d i m en s i on l e ss ) = . ; = . Ideal 1020304050 A b s o l u t e e rr o r ( d i m en s i on l e ss ) P r ed i c t i on e rr o r ( d i m en s i on l e ss ) MAE = 0.321

Figure B1.

Results for the matbench dielectric dataset.

Appendix B.2. matbench expt gap P r ed i c t ed band gap ( e V ) = . ; = . Ideal 2468 A b s o l u t e e rr o r ( e V ) P r ed i c t i on e rr o r ( e V ) MAE = 0.357 eV

Figure B2.

Results for the matbench expt gap dataset. obust model benchmarking and bias-imbalance in data-driven materials science Appendix B.3. matbench jdft2d P r ed i c t ed e x f o li a t i on ene r g y ( e V / a t o m ) = . ; = . Ideal 200400600800100012001400 A b s o l u t e e rr o r ( e V / a t o m ) P r ed i c t i on e rr o r ( e V / a t o m ) MAE = 34.433 eV/atom

Figure B3.

Results for the matbench jdft2d dataset. obust model benchmarking and bias-imbalance in data-driven materials science Appendix B.4. matbench log gvrh P r ed i c t ed () = . ; = . Ideal 0.51.01.52.02.5 A b s o l u t e e rr o r () P r ed i c t i on e rr o r () MAE = 0.065

Figure B4.

Results for the matbench log gvrh dataset.

Appendix B.5. matbench log kvrh P r ed i c t ed () = . ; = . Ideal 0.51.01.52.02.5 A b s o l u t e e rr o r () P r ed i c t i on e rr o r () MAE = 0.085

Figure B5.

Results for the matbench log kvrh dataset. obust model benchmarking and bias-imbalance in data-driven materials science Appendix B.6. matbench phonons P r ed i c t ed l a r ge s t f r equen cy i n phonon D O S ( / c m ) = . ; = . Ideal 20040060080010001200 A b s o l u t e e rr o r ( / c m ) P r ed i c t i on e rr o r ( / c m ) MAE = 36.333 1/cm

Figure B6.

Results for the matbench phonons dataset.

Appendix B.7. matbench steels P r ed i c t ed y i e l d s t r eng t h ( M P a ) = . ; = . Ideal 250500750100012501500 A b s o l u t e e rr o r ( M P a ) P r ed i c t i on e rr o r ( M P a ) MAE = 112.929 MPa

Figure B7.

Results for the matbench steels dataset. obust model benchmarking and bias-imbalance in data-driven materials science Appendix B.8. matbench expt is metal T r ue po s i t i v e r a t e ROC-AUC: 0.953 0.00 0.25 0.50 0.75 1.00Recall0.00.20.40.60.81.0 P r e c i s i on Average AP score: 0.946

Figure B8.

Results for the matbench expt is metal dataset.

Appendix B.9. matbench glass T r ue po s i t i v e r a t e ROC-AUC: 0.912 0.00 0.25 0.50 0.75 1.00Recall0.00.20.40.60.81.0 P r e c i s i on Average AP score: 0.954

Figure B9.

Results for the matbench glass dataset. obust model benchmarking and bias-imbalance in data-driven materials science Appendix C. PCA

Appendix C.1. First component (PC ) P C = (cid:88) w ,i f ,i Table C1.

20 highest contributing features of PC with corresponding weights w ,i Feature f ,i -0.0890 ChemEnvSiteFingerprint|mean SC:12 -0.0890

ChemEnvSiteFingerprint|mean SH:11 -0.0890

ChemEnvSiteFingerprint|std dev S:10 -0.0890

ChemEnvSiteFingerprint|mean DD:20 -0.0890

ChemEnvSiteFingerprint|mean H:10 -0.0890

ChemEnvSiteFingerprint|std dev CO:11 -0.0890

ChemEnvSiteFingerprint|mean S:12 -0.0890

ChemEnvSiteFingerprint|mean S:10 -0.0890

ChemEnvSiteFingerprint|mean CO:11 -0.0890

ChemEnvSiteFingerprint|std dev SH:11 -0.0890

ChemEnvSiteFingerprint|std dev S:12 -0.0890

ChemEnvSiteFingerprint|std dev H:10 -0.0890

ChemEnvSiteFingerprint|std dev DD:20 -0.0890

ChemEnvSiteFingerprint|std dev SC:12 -0.0890

ChemEnvSiteFingerprint|mean H:11 -0.0890

ChemEnvSiteFingerprint|mean HD:9 -0.0890

ChemEnvSiteFingerprint|mean SH:13 -0.0890

ChemEnvSiteFingerprint|std dev H:11 -0.0890

ChemEnvSiteFingerprint|mean PCPA:11 -0.0890

ChemEnvSiteFingerprint|mean TBSA:10 obust model benchmarking and bias-imbalance in data-driven materials science Appendix C.2. Second component (PC ) P C = (cid:88) w ,i f ,i Table C2.

20 highest contributing features of PC with corresponding weights w ,i Feature f ,i -0.1070 GaussianSymmFunc|mean G2 20.0 -0.1053

AGNIFingerPrint|mean AGNI eta=1.23e+00 -0.1049

GeneralizedRDF|mean Gaussian center=1.0 width=1.0

ElementProperty|MagpieData mean Row

VoronoiFingerprint|mean Voro dist minimum -0.0993

GeneralizedRDF|mean Gaussian center=0.0 width=1.0 -0.0973

AGNIFingerPrint|std dev AGNI eta=1.23e+00

AverageBondLength|mean Average bond length -0.0965

AGNIFingerPrint|mean AGNI eta=1.88e+00

ElementProperty|MagpieData mean Number -0.0949

GeneralizedRDF|std dev Gaussian center=0.0 width=1.0 -0.0948

GaussianSymmFunc|std dev G2 20.0

ElementProperty|MagpieData mean CovalentRadius

ElementProperty|MagpieData mean AtomicWeight -0.0877

GaussianSymmFunc|mean G4 0.005 4.0 -1.0 -0.0876

AGNIFingerPrint|std dev AGNI dir=y eta=1.23e+00 -0.0864

AGNIFingerPrint|std dev AGNI dir=x eta=1.23e+00 -0.0859

GaussianSymmFunc|mean G2 4.0 -0.0848

AGNIFingerPrint|std dev AGNI dir=y eta=1.88e+00 -0.0845

GeneralizedRDF|std dev Gaussian center=1.0 width=1.0 obust model benchmarking and bias-imbalance in data-driven materials science Appendix C.3. Third component (PC ) P C = (cid:88) w ,i f ,i Table C3.

20 highest contributing features of PC with corresponding weights w ,i Feature f ,i -0.1194 IonProperty|avg ionic char -0.1174

ElementProperty|MagpieData avg dev Electronegativity -0.1159

LocalPropertyDifference|mean local difference in Electronegativity -0.1023

IonProperty|max ionic char

ElementProperty|MagpieData mean MendeleevNumber -0.1005

ElementProperty|MagpieData avg dev CovalentRadius -0.0995

DensityFeatures|packing fraction -0.0990

ElementProperty|MagpieData range Electronegativity -0.0959

ElectronegativityDiff|mean EN difference -0.0957

ElementProperty|MagpieData avg dev MendeleevNumber -0.0955

ElementProperty|MagpieData avg dev SpaceGroupNumber -0.0950

ElementProperty|MagpieData avg dev Column

ElementProperty|MagpieData minimum Electronegativity

ElementProperty|MagpieData minimum Column

ElementProperty|MagpieData mean NpValence

ValenceOrbital|avg p valence electrons

ElementProperty|MagpieData minimum MendeleevNumber -0.0922

ElectronegativityDiff|maximum EN difference

ElementProperty|MagpieData mean Column -0.0896-0.0896