Reliable and Explainable Machine Learning Methods for Accelerated Material Discovery
Bhavya Kailkhura, Brian Gallagher, Sookyung Kim, Anna Hiszpanski, T. Yong-Jin Han
11 Reliable and Explainable Machine LearningMethods for Accelerated Material Discovery
Bhavya Kailkhura*, Brian Gallagher, Sookyung Kim, Anna Hiszpanski, T.Yong-Jin Han*Lawrence Livermore National LaboratoryCorresponding Authors: B.K. ([email protected]); T.Y.H. ([email protected])
Abstract
Material scientists are increasingly adopting the use of machine learning (ML) for makingpotentially important decisions, such as, discovery, development, optimization, synthesis andcharacterization of materials. However, despite ML’s impressive performance in commercialapplications, several unique challenges exist when applying ML in materials science applica-tions. In such a context, the contributions of this work are twofold. First, we identify commonpitfalls of existing ML techniques when learning from underrepresented/imbalanced materialdata. Specifically, we show that with imbalanced data, standard methods for assessing qualityof ML models break down and lead to misleading conclusions. Furthermore, we find that themodel’s own confidence score cannot be trusted and model introspection methods (using simplermodels) do not help as they result in loss of predictive performance (reliability-explainabilitytrade-off). Second, to overcome these challenges, we propose a general-purpose explainable andreliable machine-learning framework. Specifically, we propose a novel pipeline that employs anensemble of simpler models to reliably predict material properties. We also propose a transferlearning technique and show that the performance loss due to models’ simplicity can be overcomeby exploiting correlations among different material properties. A new evaluation metric and atrust score to better quantify the confidence in the predictions are also proposed. To improvethe interpretability, we add a rationale generator component to our framework which providesboth model-level and decision-level explanations. Finally, we demonstrate the versatility of
March 12, 2019 DRAFT a r X i v : . [ phy s i c s . c o m p - ph ] M a r Fig. 1. Histograms (number of compounds vs. targeted property bin) of targeted properties of the OQMD database show heavilyskewed distributions. We show that conventional machine learning approaches: (a) produce inaccurate inferences in sparse regionsof the property-space and (b) are overconfident in the accuracy of such predictions. The proposed approach overcomes theseshortcomings. our technique on two applications: predicting properties of crystalline compounds, and identifying novel potentially stable solar cell materials. We also point to some outstanding issuesyet to be resolved for a successful application of ML in material science.I. I NTRODUCTION
A. Motivation
Driven by the success of machine learning (ML) in commercial applications (e.g., productrecommendations and advertising), there are significant efforts to exploit these tools to analyzescientific data. One such effort is the emerging discipline of Materials Informatics which appliesML methods to accelerate the selection, development, and discovery of materials by learningstructure-property relationships. Materials Informatics researchers are increasingly adopting MLmethods in their workflow to predict materials’ physical, mechanical, optoelectronic, and thermalproperties (e.g., crystal structure, melting temperature, formation enthalpy, band gap). Whilecommercial use cases and material science applications may appear similar in their overall goals,we argue that fundamental differences exist in the corresponding data, tasks, and requirements.Applying ML techniques without careful consideration of their assumptions and limitations maylead to missed opportunities at best and a waste of substantial resources and incorrect scientificinferences at worst. In the following, we mention unique challenges that the Materials Informaticscommunity must overcome for universal acceptance of ML solutions in material science.
Learning From Underrepresented and Distributionally Skewed Data:
One of the fundamentalassumptions of current ML methods is the availability of densely and uniformly sampled (or
March 12, 2019 DRAFT balanced) training data. When there is an under-representation of certain classes in the data,standard ML algorithms provide incorrect inferences across the classes of the data. Unfortunately,in most material science applications, balanced data is exceedingly rare, and virtually all problemsof interest involve various forms of extrapolation due to underrepresented data and severe classdistribution skews. As an example, materials scientists are often interested in designing (ordiscovering) compounds with uncommon targeted properties, e.g., high T C superconductivity orlarge ZT for improved thermoelectric power , shape memory alloys (SMAs) with the targetedproperty of very low thermal hysteresis , and band gap energy in the desired range ( . − . eV) for solar cells . In such applications, we encounter highly imbalanced data (with targetedmaterials being in the minority class) due to these design choices or constraints. Consider a taskof predicting material properties (e.g., bandgap energy, formation energy, stability, etc.) from aset of feature vectors (or descriptors) corresponding to crystalline compounds. One representativedatabase for such a data set is the Open Quantum Materials Database (OQMD) , which containsseveral properties of crystalline compounds as calculated using density functional theory (DFT).Note that, the OQMD database contains data sets with strongly imbalanced distributions of targetvariables, i.e., material properties. In Figure 1, we plot the histogram of several commonlytargeted properties. It can be seen that, the data set exhibits severe distribution skews. Forexample, of the compounds in the OQMD are possibly conductors with band gap valueequal to zero. Note that if the sole aim of the ML model is to maximize overall accuracy, theML algorithm will perform quite well by ignoring or discarding the minority class. However,in practice, correctly classifying and learning from the minority class of interest may be moreimportant than possibly misclassifying the majority classes. Explainable ML Methods without Compromising the Model Accuracy:
A common miscon-ception is that increasing model complexity can address the challenges of underrepresented anddistributionally skewed data. However, this can only superficially solve some of these problems.Furthermore, increasing the complexity of ML models may increase the overall accuracy ofthe system at the cost of making the model very hard to interpret. Understanding why an MLmodel made a certain prediction or recommendation is crucial, since it is this understanding thatprovides the confidence to make a decision and that will lead to new hypotheses and ultimatelynew scientific insights. Most of the existing approaches define explainability as the inverse ofcomplexity and achieve explainability at the cost of accuracy. This introduces a risk of producingexplainable but misleading predictions. With the advent of highly predictive but opaque ML
March 12, 2019 DRAFT models, it has become more important than ever to understand and explain the predictions ofsuch models and to devise explainable scientific machine learning techniques without sacrificingpredictive power.
Better Evaluation and Uncertainty Quantification Techniques for Building Trust in ML:
For a credible use of ML in material science applications, we need the ability to rigorouslyquantify the ML performance. Traditionally, the quality of an ML model is measured by theaccuracy on test data using cross-validation. Considering the scarcity of densely sampled data inmost material science problems, high accuracy on the test data can hardly provide confidence onthe quality and generality of ML systems. A natural solution is to use a model’s own reportedconfidence (or uncertainty) score for quantifying trust in the prediction. However, a model’sconfidence score alone may not be very reliable. For example, in computer vision, well-craftedperturbations to images can cause classifiers to make mistakes (such as, identifying a panda as agibbon or confusing a cat with a computer) with very high confidence . As we will show later,this problem also persists in the Materials Informatics pipeline (especially with distributionalskewness). Nevertheless, knowing when a classifier’s (or regressor’s) prediction can be trustedis useful in several other applications for building assured ML solutions. Therefore, we needto augment current validation techniques with additional components to quantify generalizationperformance of scientific ML algorithms and devise reliable uncertainty quantification methodsto establish trust in these predictive models. B. Literature Survey
In the recent past, the materials science community has used ML methods for buildingpredictive models for several applications . Seko et al. considered the problem of buildingML models to predict the melting temperatures of binary inorganic compounds. The problem ofpredicting the formation enthalpy of crystalline compounds using ML models was consideredrecently . Predictive modeling for crystal structure formation at a certain composition arealso being developed . The problem of band gap energy prediction of certain classes ofcrystals and mechanical property prediction of metal alloys was also considered in the liter-ature . Ward et al. proposed a general-purpose ML framework to predict diverse propertiesof crystalline and amorphous materials, such as band gap energy and glass-forming ability.Thus far, the research on applying ML methods for material science applications has predom-inantly focused on improving overall accuracy of predictive modeling. However, imbalanced March 12, 2019 DRAFT learning, explainability and reliability of ML methods in material science have not received anysignificant attention. As mentioned earlier, these aspects pose a real problem in deriving correctand reliable scientific inferences and the universal acceptance of machine learning solutions inmaterial science, and deserves to be tackled head on.
C. Our Contributions
In this paper, we take some first steps in addressing the challenge of building reliable andexplainable ML solutions for Materials Informatics applications. The main contributions of thepaper are twofold. First, we identify some shortcoming with training, testing, and uncertaintyquantification steps in existing ML techniques while learning from underrepresented and distri-butionally skewed data. Our finding raises serious concerns regarding the reliability of existingMaterials Informatics pipelines. Second, to overcome these challenges, we propose a general-purpose explainable and reliable machine-learning methods for enabling reliable learning fromunderrepresented and distributionally skewed data. We propose the following solutions: novellearning architecture to bias the training process to the goals of imbalanced domains; and sampling approaches to manipulate the training data distribution so as to allow the use ofstandard ML models; reliable evaluation metrics and uncertainty quantification methods tobetter capture the application bias. To improve the explainability, as oppose to other existingapproaches which train an independent regression model per property, we employ a simple andcomputationally cheap partitioning scheme. This scheme first partitions the data into sub classesof materials based on their property values and train separate simpler regression models for eachgroup. Note that our approach differs in its motivation (and operation) from similar conceptutilized by Ward et al. . Our motivation behind partitioning is to enhance the “explainability”,as opposed to the previous approach , where a computationally expensive exhaustive searchwas performed to find artificial groups to enhance the accuracy of predictions. In our case,our explainability enhancing partitioning scheme in fact hurts our predictive performance (oraccuracy). To compensate this performance loss, we utilize transfer learning by exploitingcorrelation among different material properties to improve the regression performance. Weshow that the proposed transfer learning technique can overcome the performance loss dueto simplicity of the models. To further improve the interpretability of the ML system, we add arationale generator component to our framework. The goal of the rationale generator is twofold: provide explanations corresponding to an individual prediction, and provide explanations March 12, 2019 DRAFT
Clustering based on target property Multi-class Classification CorrelationbasedfusionRationale generator Predictions with Uncertainty Quantification1. Model-Level Explanations2. Decision-Level ExplanationsSub-sampling of representative compoundsDataset 1 ({Compounds, Property 1}) R e g r e ss i on R e g r e ss i on Clustering based on target propertyDataset 2 ({Compounds, Property 2})
Sub-sampling of representative compounds Multi-class Classification
Fig. 2. An illustration of proposed ML pipeline for material property prediction. corresponding to the regression model. For individual prediction, the rationale generator providesexplanations in terms of prototypes (or similar but known compounds). This helps a materialscientist to use his/her domain knowledge to verify if similar known compounds or prototypessatisfy the requirements or constraints imposed. On the other hand, for regression models, therationale generator provides global explanations regarding the whole material sub-classes. Thisis achieved by providing feature importance for every material sub-class. Finally, we propose anew evaluation metric and a trust score to better quantify confidence and establish trust in the MLpredictions. We demonstrate the applicability of our technique by using it for two applications: predicting five physically distinct properties of crystalline compounds, and identifyingpotentially stable solar cells. II. R ESULTS AND D ISCUSSIONS
First, we discuss proposed ML method with a focus on reliability and explainability using thedata from the Open Quantum Materials Database (OQMD). Next, we demonstrate the applicationof our approach in two material science problems.
A. General-Purpose Reliable and Explainable ML Framework
To solve the problem of reliable learning and inference from distributionally skewed data,we propose a general purpose ML framework (see Fig. 2). Instead of developing yet anotherML algorithm to improve accuracy for a specific application, our objective is to develop generic
March 12, 2019 DRAFT methods to improve reliability, explainablity and accuracy in the presence of imbalanced data.The proposed framework is agnostic to the type of training data, can utilize a variety ofalready-developed ML algorithms, and can be reused for a broad variety of material scienceproblems. The framework is composed of three main components: novel training procedurefor learning from imbalanced data, rationale generator for model-level and decision-levelexplainability, and reliable testing and uncertainty quantification techniques to evaluate theprediction performance of ML pipelines.
1) Training Procedure:
Building an ML model for materials properties prediction can beposed as a regression problem where the goal is to predict continuous valued property valuesfrom a set of material attributes/features. The challenge in our target task is that due to thepresence of distributional skewness, ML models do not generalize well (specifically in domainswhich are not well represented using available labeled data (or minority classes)). To solvethis problem, we propose a generic ML training process that is applicable to a broad rangeof materials science applications which suffer from the distributionally skewed data. We willexplain the proposed training process with the help of following running example: A materialscientist is interested in learning an ML model targeting a specific class of material properties,e.g., stable wide bandgap materials in a certain targeted range. In most of the cases, we havedomain knowledge about the range of property values for specific classes of materials, e.g.,conductors have bandgap energies equal to zero, typical semiconductors have bandgap energiesin the range of . to . eV, whereas wide bandgap materials have bandgap energies greaterthan eV. These requirements introduce a partition of the property space in multiple materialclasses a . Given N training data samples { X i , ( Y i , · · · , Y Mi ) } Ni =1 , where, X i is feature/attributevector and Y ji is j th property value corresponding to compound i , the steps in the proposedtraining procedure are as follows:1) Partition the property space in K regions/classes and obtain transformed training datasamples { X i , ( Z i , · · · , Z Mi ) } Ni =1 where Z ji ∈ { , · · · , K } .2) For each property in j ∈ { , · · · , M } , perform sub-sampling b on sample compounds in K distinct classes, and obtain an evenly distributed training set: { X i , Z ji } N j i =1 . a This partition can also be introduced artificially by imposing constraints on the gradient of the property values so thatcompounds with similar property value are in the same class. b Other sophisticated sampling techniques or generative modeling approaches can also be used. March 12, 2019 DRAFT
3) Train M multi-class classifiers (one per property) on balanced datasets { X i , Z ji } N j i =1 topredict which class a compound belongs to.4) For every ( j, k ) pair, train a regressor on { X i , Y ji } N j i =1 to predict property values ˆ Y ji .5) Finally, utilize correlation among properties to improve the model accuracy by employingtransfer learning (explained next).At the test time, to predict j th property of the test compound, the ML algorithm first identifiesthe class the test compound belongs to by using trained j th multi-class classifier. Next, dependingon the predicted class k for property j , ( j, k ) th regressor is used, along with transfer learningstep, to predict property values of the test compound. Next, we provide details and justificationsfor each of these steps in our ML pipeline.Steps to transform a regression problem into a multi-class classification problem on sub-sampled training data. The change that is carried out has the goal of balancing the distribution ofthe least represented (but more important) material classes with the more frequent observations c .Furthermore, instead of having a single model trained on the entire training set, having smallerand simpler models for different classes of materials helps to gain better understanding of sub-domains using the rationale generator (explained later).Next, we explain the proposed transfer learning technique which exploits correlations presentedamong different material properties to improve the regression performance. We devise a simpleknowledge transfer scheme to utilize the marginal estimates/predictions from step whereregressors were trained independently for different properties. Note that, for each compound i , we get an independent estimate ˆY i ≈ { ˆ Y i , · · · , ˆ Y Mi } from step . In step , we augmentthe original attribute vector X i with independent estimates ˆY i and use it as a modified attributevector and train regressors for each ( j, k ) pair. We found that this simple knowledge transferscheme significantly improves the regression performance.
2) Rationale Generator:
The goal of rationale generator is to provide: ( a ) decision levelexplanations, and ( b ) model level explanations. Decision level explanations provide reasoningsuch as: what made an ML algorithm make a specific decision/prediction? On the other hand,model level explanations are focused on providing understandings at the class level, e.g., whichchemical attributes help in discriminating among insulators, semi-conductors, and conductors? c Note that the proposed framework is general enough to utilize other sophisticated imbalanced learning strategies (such as,ensemble learning, data pre-processing and cost-based learning) to further improve the performance.
March 12, 2019 DRAFT
Decision Level Explanations:
The proposed ML pipeline explains its predictions for previ-ously unseen compounds by providing similar known examples (or prototypes). Explanationby examples is motivated by the observation that studies of human reasoning have shown thatthe use of examples (analogy) is fundamental to the development of effective strategies forbetter decision-making . Example-based explanations are widely used in the effort to improveuser explainability of complex ML models. In our context, for every unseen test example, inaddition to predicted property values, we provide similar experimentally known compoundswith corresponding similarity to the test compound in the feature space. Our feature spaceis heterogeneous (both continuous and categorical features), thus, Euclidean distance is notreliable. Thus, we propose to quantify similarity using Gower’s metric . Gower’s metric canbe used to measure similarity between data containing a combination of logical, numerical,categorical or text entries. The distance is always a number between (similar) and (maximallydissimilar). Furthermore, as a consequence of breaking a large regression problem into a multi-class classification followed by a simpler regression problem, we can also provide a logicalsequence of decisions taken to reach a prediction. Model Level Explanations:
Knowing which chemical attributes are important in a model’s predic-tion (feature importance) and how they are combined can be very powerful in helping materialscientists understand and trust automatic ML systems. Due to the structure of our pipeline(regression+classification), we can provide a more fine grained feature importance explanationscompared to having a single regression model. Specifically, we break the feature importance ofattributes to predict a material property into: feature importance for discriminating amongdifferent material classes (inter-class), and feature importance for regression on a materialsub-domain (intra-class). This provides a more in depth explanation of the property predictionprocess.
3) Robust Model Performance Evaluation and Uncertainty Quantification:
The distribution-ally skewed training data biases the learning system towards solutions that may not be inaccordance with the user’s end goal. Most existing learning systems work by searching thespace of possible models with the goal of optimizing some criteria (or numerical score). Thesemetrics are usually related to some form of average performance over the whole train/test dataand can be misleading in cases where sampled train/test data is not representative of the truedistribution. More specifically, commonly used evaluation metrics (such as mean squared error,R-squared error, etc.) assume an unbiased (or uniform) sampling of the test data and break
March 12, 2019 DRAFT0 down in the presence of distributionally skewed test data (shown later). Therefore, we proposeto perform class specific evaluations (by partitioning the property space into multiple classes ofinterest) which better characterizes the predictive performance of ML models in the presenceof distributionally skewed data. We also recommend visualizing predicted and actual propertyvalues in combination with the numeric scores to build a better intuition about the predictiveperformance.Note that having a robust evaluation metric only partially solves the problem as ML modelsare susceptible to over-confident extrapolations. As we will show later, in imbalanced learningscenarios, ML models make overconfident extrapolations which have higher probability of beingwrong (e.g., predicting conductor to be an insulator with confidence). In other words,a model’s own confidence score cannot be trusted. To overcome this problem, we use a setof labeled experimentally known compounds as side information to help determine a model’strustworthiness for a particular unseen test example. The trust score is defined as follows: T ( X i ) = 1 − d ( X i , { X j } j ∈ c i ) d ( X i , { X j } j ∈ c i ) + d ( X i , { X j } j / ∈ c i ) . (1)The trust score T takes into account the average Gower distance d from the test sample X i toother samples in the same class c i vs. the average Gower distance to nearby samples in otherclasses. T ranges from to where a higher T value indicates a more trustworthy model. B. Example Applications
In this section, we discuss two distinct applications for our reliable and explainable MLpipeline to demonstrate its versatility: predicting five physically distinct properties of crystallinecompounds and identifying potentially stable solar cells. In both the cases, we use the samegeneral framework, i.e., the same attributes and ML pipeline. Through these examples, wediscuss all aspects of creating reliable and explainable ML models: building a reliable machinelearning model from distributionally skewed training data, generating explanations to gain betterunderstanding of the data/model, evaluating model accuracy and employing the model to predictnew materials.
1) Predicting Properties of Crystalline Compounds:
Density functional theory (DFT) pro-vides a means of predicting properties of chemical compounds, based on quantum mechanicalmodeling. However, the utility of DFT is limited by its computational complexity. An alternative
March 12, 2019 DRAFT1 approach is to use machine learning (ML) to train a surrogate model on a representative set of(input,output) pairs from prior DFT calculations. The surrogate then emulates DFT, producingapproximate answers at dramatically lower computational cost (several orders of magnitudefaster), enabling rapid screening of candidate materials. A potential drawback of this approachis that it requires many (potentially hundreds of thousands) DFT calculations in order to generatea suitable training set. Fortunately, several such training sets already exist.
Data Set:
We follow the lead of Ward et al. and use OQMD for training purposes. OQMDcontains the results of DFT calculations on approximately , diverse compounds. Of these,we select only the lowest-energy compound for each composition. This yields a training setcontaining , unique examples. We use the same set of attributes/features to representeach compound as Ward et al. Using these features, we consider the problem of developingreliable and explainable ML models to predict five physically distinct properties currently avail-able through the OQMD: bandgap energy (eV), volume/atom ( ˚A /atom), energy/atom (eV/atom),thermodynamic stability (eV/atom) and formation energy (eV/atom) . Units for these propertiesare omitted in the rest of the paper for ease of notation. A description of the attributes(inputs) and properties (outputs) are provided in the Supplementary Materials. Method:
We quantify the predictive performance of our approach using -fold cross-validation.Following the procedure mentioned in Sec. II-A1, we partition the property space for each prop-erty in K = 3 classes. The decision boundary thresholds for class separation (with class distri-butions) are as follows: bandgap energy ( . , . ) with ( , , ), volume/atom ( . , . )with ( , , ), energy/atom ( − . , − . ) with ( , , ), stability ( . , . ) with( , , ) and formation energy ( . , . ) with ( , , ). d . Sub-sampling ratios forsample compounds (for obtaining evenly distributed training set) were determined using cross-validation. We train Extreme Gradient Boosting (XGB) classifiers to do multi-class ( K = 3 )classification using the softmax objective for each property. Next, we train Gradient BoostingRegressors (GBRs) for each property-class pair independently (and refer to them as marginalregressors). Using these marginal regressors, we create augmented feature vectors for correlationbased predictions. Finally, we train another set of GBR regressors for each property-class pair onaugmented data (and refer to them as joint regressors as they exploit correlation present among d We also tried different combinations of thresholds and trends in the obtained results were found to be consistent. In practice,these thresholds can be provided by domain experts depending on a specific application (as done in Sec. II-B2).
March 12, 2019 DRAFT2
TABLE IResults for conventional technique with overall prediction scores. Cross-validation gives an impression that conventionalregressors have excellent regression performance (i.e., low MAE/MSE and high R score). However, later we show that thesemetrics provide misleading inferences due to the presence of distributionally skewed data. Metrics Energy/atom Volume/atom Bandgap Energy Formation Energy StabilityMAE . . . . . MSE . . . . . R . . . . . properties to improve the prediction performance). Results:
For the conventional scheme, we train M independent GBR regressors to directlypredict properties from the features corresponding to the compounds. In Table I, we reportdifferent error metrics to quantify the regression performance using cross-validation. Note thatthese metrics report an accumulated/average error score on the test set (which comprises ofcompounds from all partitions of properties). These results are comparable to state of the art and suggest that conventional regressors have excellent regression performance (low MAE/MSEand high R score). Relying on the inference made by this evaluation method, we may betempted to use these regression models in practice for different applications (such as, screeningor discovery of novel solar cells). However, next we show that these metrics provide misleadinginferences in the presence of distributionally skewed data. In Table II(a), we perform classspecific evaluations (i.e., we partition the property space for each property in K = 3 classesand use the test data belonging to each class separately). Surprisingly, Table II(a) shows thatconventional regressors perform well only on a specific class (or range of property values) –specifically, only those in the majority classes (i.e., majority of compounds fall in those propertyvalue ranges). The conventional regression method performs particularly poorly with minorityclasses for bandgap energy and stability prediction where the data distribution is highly skewed(see Fig. 1). Unfortunately, the test data is also distributionally skewed and is not representativeof the true data distribution. Thus, traditional methods for assessing and ensuring generalizabilityof ML models provide misleading conclusions (as shown in Table I). On the other hand, classspecific evaluations better characterize the predictive performance of ML models in the presenceof distributionally skewed data. March 12, 2019 DRAFT3
TABLE IIClass-specific prediction score comparison. Class-specific cross-validation provides reliable inferences and shows thesuperiority of proposed scheme over conventional scheme.(a) Conventional technique. Class-specific cross-validation shows that the conventional technique performs poorly on minorityclasses. This important observation cannot be made from Table I.
Metrics Energy/atom Volume/atom Bandgap Energy Formation Energy StabilityDistribution ( , , ) ( , , ) ( , , ) ( , , ) ( , , )MAE (0 . , . , .
11) (0 . , . , .
3) (0 . , . , .
77) (0 . , . , .
14) (0 . , . , . MSE (0 . , . , .
03) (0 . , . , .
3) (0 . , . , .
29) (0 . , . , .
07) (0 . , . , . R (0 . , . , .
95) (0 . , . , .
89) (1 . , . , − .
13) (0 . , . , .
47) ( − . , . , − . (b) Proposed technique without transfer learning. Simplicity (or explainability) due to smaller and simpler models results inperformance loss. This is not surprising as there is a trade-off between simplicity/explainability and prediction performance. Metrics Energy/atom Volume/atom Bandgap Energy Formation Energy StabilityMAE (0 . , . , .
18) (0 . , . , .
3) (0 . , . , .
81) (0 . , . , .
20) (0 . , . , . MSE (0 . , . , .
07) (0 . , . , .
8) (0 . , . , .
64) (0 . , . , .
10) (0 . , . , . R (0 . , . , .
74) (0 . , . , .
81) (1 . , . , − .
43) (0 . , . , .
22) ( − . , . , − . (c) Proposed technique with transfer learning. Transfer learning step in our pipeline compensates for the performance loss due tosimplicity of models and in fact outperforms conventional technique (especially on minority classes). We suspect that this gainmay also be due to the fact that simpler models perform better in low-data regime (e.g, minority classes), as opposed to complexmodels which may over-fit (and require a large amount of data to perform well). Metrics Energy/atom Volume/atom Bandgap Energy Formation Energy StabilityMAE (0 . , . , .
07) (0 . , . , .
2) (0 . , . , .
70) (0 . , . , .
2) (0 . , . , . MSE (0 . , . , .
02) (0 . , . , .
7) (0 . , . , .
42) (0 . , . , .
07) (0 . , . , . R (0 . , . , .
93) (0 . , . , .
88) (1 . , . , − .
24) (0 . , . , .
46) ( − . , . , − . In Table II(b), we show the effect of transforming a single complex regression model intoensemble of smaller and simpler models to gain a better understanding of sub-domains (Step - in Sec. II-A1). We notice that the performance of these transformed simpler models are worsecompared to having a single complex model (as given in Table II(a)). This suggests that thereis a trade-off between simplicity/explainability and accuracy.Finally, Table II(c) shows how this performance loss due to simplicity of models can beovercome using the transfer learning (or correlation based fusion) step in our pipeline. Weobserve that the proposed transfer learning technique can very well exploit correlations inthe property space which results in a significant performance gain compared to conventional March 12, 2019 DRAFT4 (a) (b)Fig. 3. Uncertainty quantification of the regressor (ground truth is in blue, predictions are in red, and gray shaded area representsuncertainty). (a) Bandgap energy, and (b) Stability. In several cases, regressors perform poorly in regions with high uncertainty. regression approach e . Note that this gain is achieved in spite of having simper and smaller modelsin our ML pipeline. This suggests that a user can achieve high accuracy without sacrificingexplainability. We also observed that sub-sampling step in our pipeline had a positive impact onthe regression performance of minority classes.Furthermore, our pipeline also quantifies uncertainties in its predictions providing a confidencescore to the user. We show an illustration of the uncertainty quantification of bandgap energyand stability predictions on test samples in Figure 3. It can be seen that regressors performpoorly in regions with high uncertainty.We would also like to point out that in cases where the data from a specific class is heav-ily under-represented, none of the model design strategies will improve the performance andgenerating new data may be the only possible solution (e.g., bandgap energy prediction forminority classes). In such cases, relying solely on cross-validation score or confidence score maynot provide reliable inference (shown later). To overcome this challenge, explainable machinelearning can be a potentially viable solution.Next, we show the output of rationale generator in our pipeline. Specifically, we provide model-level explanations, as well as, decision-level explanations for each sub-class ofmaterials. For model-level explanations, our pipeline provides feature importance for both clas-sification and regression steps. Feature importance provides a score that indicates how useful(or valuable) each feature was in the construction of the model. The more an attribute is used e Surprisingly, we did not observe any gain when using transfer learning with conventional technique. In fact, we observedthat the models showed severe over-fitting behavior to the predicted properties.
March 12, 2019 DRAFT5
Fig. 4. Feature importance for -class classification of bandgap energy. The rationale generator favors attributes related tomelting temperature, electro-negativity, and volume per atom for explaining bandgap-energy predictions. These attributes are allknown to be highly correlated with the bandgap energy level of crystalline compounds. to make key decisions with (classification/regression) model, the higher its relative importance.This importance is calculated explicitly for each attribute in the data set, allowing attributesto be ranked and compared to each other. In Fig. 4, we show the feature importance forour -class classifier for bandgap energy. It shows the attributes which help in discriminatingamong -classes on compounds (insulators, semi-conductors, and conductors) based on theirbandgap energy values. Note that the rationale generator picked attributes related to the meltingtemperature, electro-negativity and volume per atom of constituent elements to be the mostimportant features in determining the bandgap energy level of the compounds. This is reasonableas all these attributes are known to be highly correlated with the bandgap energy level ofcrystalline compounds. For example, melting temperature of constituent elements is positivelycorrelated with inter-atomic forces (and in turn inter-atomic distances). Increased inter-atomicspacing decreases the potential seen by the electrons in the material, which in turn reduces thebandgap energy. Therefore, band structure changes as function of inter-atomic forces which iscorrelated with melting temperature. Similarly, in multi-element material system, as the electro-negativity difference between different atoms increases, so does the energy difference betweenbonding and anti-bonding orbitals. Therefore, the bandgap energy increases as the electro-negativities of constituent elements increase. Thus, the bandgap energy has a strong correlationwith electro-negativity of constituent elements. Finally, mean volume per atom of constituent March 12, 2019 DRAFT6
TABLE IIIBandgap energy prediction and uncertainty quantification. Model’s own confidence score alone may not be very reliable as itmakes wrong and over-confident predictions on minority classes (i.e., classes and ). On the other hand, a higher (or lower)trust score consistently imply higher (or lower) probability that the classifier (or regressor) is correct. Test Compound Ground Truth Prediction Confidence Score Trust Score(Class, Bandgap) (Class, Bandgap)Ge Na O (2 , .
63) (0 , .
0) 0 .
999 0 . F Na Nb (1 , .
1) (0 , .
0) 0 .
998 0 . C Mg (2 , .
67) (0 , .
0) 0 .
998 0 . Rh (0 , .
0) (0 , .
0) 0 .
999 0 . elements is also correlated with the inter-atomic distance in a material system. As explainedabove, inter-atomic distance is negatively correlated with the bandgap energy, and so does themean volume per atom of constituent elements. Similar feature importance results for class-specific predictors can also be obtained (see Supplementary Material).In Table III, we show test compounds with ground truths (class, bandgap energy value),predictions (class, bandgap energy value), and corresponding confidence scores. It can be seenthat both classifier and regressor make wrong and over-confident predictions on minority classes(i.e., classes and ). In other words, a higher confidence score from the model for minorityclass does not necessarily imply higher probability that the classifier (or regressor) is correct.For compounds in minority classes, ML model may simply not be the best judge of its owntrustworthiness. On the other hand, the proposed trust score (as given in (1)) consistentlyoutperforms classifier’s/regressor’s own confidence score. A higher/lower trust score from themodel imply higher/lower probability that the classifier (or regressor) is correct. Furthermore,as our trust score is computed using distances from experimentally known compounds fromInorganic Crystal Structure Database (ICSD) , it also provides some confidence on compoundsamenability to be synthesized.
2) Novel Stable Solar Cell Prediction:
To show how our ML pipeline can be used fordiscovering new materials, we simulate a search for stable compounds with bandgap energywithin a desired range. To evaluate the ability of our approach to locate compounds that arestable and have bandgap energies within the target range, we setup an experiment where a modelwas fit on the training data set and, then, was tasked with selecting which compounds in the March 12, 2019 DRAFT7
TABLE IVCompositions of materials predicted using proposed ML pipeline to be stable candidates for solar cell applications withexperimentally known prototypes and their distances from predicted candidates.
Compounds Bandgap Energy Stability (Prototype , Distance) (Prototype , Distance) Trust ScoreCr Se Cs . − .
10 ( Cs Mn Se , .
03) ( Rb Mn Se , .
06) 0 . S Sb Cs . − .
09 ( Cs Ge Te , .
06) ( Cs Si As , .
07) 0 . V Se Cs . − .
21 ( Cs Zr Se , .
02) ( Cs Nb Ag Se , .
04) 0 . C O Th . − .
16 ( Th S O , .
07) ( Th S N , .
07) 0 . Se Pm . Pt . − .
11 ( Sm Cu Se , .
06) ( Ho Ag Se , .
07) 0 . O Na Ag . − .
007 ( Na Cd O , .
02) ( Na Ag O , .
03) 0 . test data were most likely to be stable and have a bandgap energy in the desired range for solarcells: . − . eV. Data Set:
Same as before, for the training data, we selected a subset of , compoundsfrom OQMD that represents the lowest-energy compounds at each unique composition. Weuse same attributes as before. Using these attributes/features, we consider the problem ofdeveloping reliable and explainable ML models to predict two physically distinct properties ofstable solar cells: bandgap energy, and stability. Note that this experiment is more challengingand practical as compared to Ward et al. where the training data set was considered to becompounds that were reported to be possible to be made experimentally in the ICSD (a totalof , entries) so that only bandgap energy, and not stability, needed to be considered. Wechoose test data set from Meredig et al. to be as-yet-undiscovered ternary compounds ( , entries). These compounds are not yet in the OQMD. Method:
Following the procedure mentioned in Sec. II-A1, we partition the property spacefor each property in K = 3 classes. The decision boundary thresholds for class separationare as follows: bandgap energy ( . , . ), and stability ( . , . ). Similar to Sec. II-B1, we useExtreme Gradient Boosting (XGB) classifiers (with default parameters) to do multiclass ( K = 3 )classification and Gradient Boosting Regressors (GBRs) to do marginal and joint regression. Weuse models’ own confidence and trust score to rank the potentially stable solar cells. Results:
We used the proposed ML pipeline to search for new stable compounds (i.e., thosenot yet in the OQMD). Specifically, we use trained models to predict bandgap energy andstability of compositions that were suggested by Meredig et al. to be as-yet-undiscovered ternary March 12, 2019 DRAFT8 compounds. We found that out of these compounds, compounds are likely to bestable and have favorable bandgap energies to be solar cells. A subset with the trust scoreare shown in Table IV. Similar experimentally known prototypes (as shown in Table IV) canalso serve as an initial guess on the -d crystal structure of the predicted compounds. Theserecommendations appear reasonable as four of the six suggested compounds (Cs CrSe , Cs Sb S,Cs VSe , Na AgO ) can be classified as I-III-VI semiconductors, which are semiconductors thatcontain an alkali metal, a transition metal, and a chalcogen; I-III-VI semiconductors are a knownpromising class of photovoltaic materials as many have direct bandgap energies of ∼ . eV,making them well-matched to the solar spectrum. The best known I-III-VI photovoltaic is copper-indium-gallium-selenide (CIGS), which has solar cell power conversion efficiencies on par withsilicons. The other two identified compounds Th CO and Pm . PtSe are unique in that theycontain actinide and lanthanide elements. However, from a practical perspective, the scarcityand radioactivity of these elements may make it challenging to explore them experimentally. Adetailed list of potentially stable solar cell compounds is provided in the Supplementary Material.III. S OME O PEN I SSUES
There are still some issues yet to be resolved for a successful application of ML in materialscience. First, in cases where the data from a specific class is heavily under-represented, noneof the model design strategies will improve the performance and generating new data may bethe only possible solution. Solving this problem will require answering the following question:How many training samples are sufficient to learn a reliable model and where to sample ifthey are inadequate? Second, predictive models built based on chemical attributes make recom-mendations (e.g., potential solar cells) in the form of chemical attributes. However, verifyingthese recommendations using DFT (or experiments) has its own challenges (e.g., identifyingappropriate crystal structure (or synthesis recipes)). A potentially viable solution is to biasthe recommendation process towards compounds with favourable synthesis conditions. Finally,explainable ML methods based on feature importance still require a material scientist to makesense of model/decision explanations using domain knowledge which may suffer from the humanbias. Solving these problems will require making significant advances on current explainable MLtechniques. Interactive ML and casual inference techniques can further help in resolving someof these issues.
March 12, 2019 DRAFT9
IV. C
ONCLUSIONS
This paper considered the problem of learning reliable and explainable machine learningmodels from underrepresented and distributionally skewed materials science data. We identifiedcommon pitfalls of existing ML techniques while learning from imbalanced data. We show howapplying ML techniques without careful consideration of its assumptions and limitations can leadto both quantitatively and qualitatively incorrect predictive models. To overcome the limitations ofexisting ML techniques, we proposed a general-purpose explainable and reliable ML frameworkfor learning from imbalanced material data. We also proposed a new evaluation metric and atrust score to better quantify confidence in the predictions. The rationale generator component inour pipeline provides useful model-level and decision-level explanations to establish trust in theML model and its predictions. Finally, we demonstrated the applicability of our technique onpredicting five physically distinct properties of crystalline compounds, and identifying potentiallystable solar cells. V. M
ATERIALS AND M ETHODS
All machine learning models were created using the Scikit-learn and XGBoost machinelearning libraries. The Materials Agnostic Platform for Informatics and Exploration (Magpie) was used to compute the attributes. Scikit-learn, XGBoost and Magpie are available under open-source licenses. The software, training data sets and input files used in this work are providedin the Supplementary Information associated with this manuscript.VI. A CKNOWLEDGEMENTS
Authors would like to thank Dr. Joel Varley and Dr. Mike Surh for valuable feedbacks,suggestions and discussions in preparation of this manuscript.This work was performed under the auspices of the U.S. Department of Energy by LawrenceLivermore National Laboratory under Contract DE-AC52-07NA27344 and was supported by theLLNL-LDRD Program under Project No. 16-ERD-019 and 19-SI-001. (LLNL-JRNL-764864)VII. C
ONTRIBUTIONS
B.K. and T.Y.H conceived the project, B.K performed the experiments, B.K. and B.G. analyzedthe results. All authors discussed the results and contributed to the writing of the manuscript.
March 12, 2019 DRAFT0
VIII. C
OMPETING I NTERESTS
The authors declare no conflict of interestIX. D
ATA A VAILABILITY
All data generated or analysed during this study are included in this published article (and itssupplementary information files). R
EFERENCES [1] N. Wagner and J. M. Rondinelli, “Theory-guided machine learning in materials science,”
Frontiers in Materials , vol. 3, p. 28, 2016.[2] D. Xue, P. V. Balachandran, J. Hogden, J. Theiler, D. Xue, and T. Lookman, “Acceleratedsearch for materials with targeted properties by adaptive design,”
Nature communications ,vol. 7, p. 11241, 2016.[3] L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton, “A general-purpose machinelearning framework for predicting properties of inorganic materials,” npj ComputationalMaterials , vol. 2, p. 16028, 2016.[4] S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. W. Doak, M. Aykol, S. R¨uhl, andC. Wolverton, “The open quantum materials database (oqmd): assessing the accuracy ofdft formation energies,” npj Computational Materials , vol. 1, p. 15010, 2015.[5] T. A. Hogan and B. Kailkhura, “Universal hard-label black-box perturbations: Breakingsecurity-through-obscurity defenses,” arXiv preprint arXiv:1811.03733 , 2018.[6] S. Srinivasan and K. Rajan, “property phase diagrams for compound semiconductorsthrough data mining,”
Materials , vol. 6, no. 1, pp. 279–290, 2013.[7] L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, and M. Scheffler, “Big data ofmaterials science: Critical role of the descriptor,”
Physical review letters , vol. 114, no. 10,p. 105503, 2015.[8] B. Meredig, A. Agrawal, S. Kirklin, J. E. Saal, J. Doak, A. Thompson, K. Zhang, A. Choud-hary, and C. Wolverton, “Combinatorial screening for new materials in unconstrainedcomposition space with machine learning,”
Physical Review B , vol. 89, no. 9, p. 094104,2014.
March 12, 2019 DRAFT1 [9] C. S. Kong, W. Luo, S. Arapan, P. Villars, S. Iwata, R. Ahuja, and K. Rajan, “Information-theoretic approach for the discovery of design rules for crystal chemistry,”
Journal ofchemical information and modeling , vol. 52, no. 7, pp. 1812–1820, 2012.[10] F. Faber, A. Lindmaa, O. A. von Lilienfeld, and R. Armiento, “Crystal structure repre-sentations for machine learning models of formation energies,”
International Journal ofQuantum Chemistry , vol. 115, no. 16, pp. 1094–1101, 2015.[11] K. Sch¨utt, H. Glawe, F. Brockherde, A. Sanna, K. M ¨uller, and E. Gross, “How to representcrystal structures for machine learning: Towards fast prediction of electronic properties,”
Physical Review B , vol. 89, no. 20, p. 205118, 2014.[12] G. Pilania, C. Wang, X. Jiang, S. Rajasekaran, and R. Ramprasad, “Accelerating materialsproperty predictions using machine learning,”
Scientific reports , vol. 3, 2013.[13] A. P. Bart´ok, M. C. Payne, R. Kondor, and G. Cs´anyi, “Gaussian approximation potentials:The accuracy of quantum mechanics, without the electrons,”
Physical review letters , vol.104, no. 13, p. 136403, 2010.[14] A. Seko, T. Maekawa, K. Tsuda, and I. Tanaka, “Machine learning with systematic density-functional theory calculations: Application to melting temperatures of single-and binary-component solids,”
Physical Review B , vol. 89, no. 5, p. 054303, 2014.[15] Z.-Y. Hou, Q. Dai, X.-Q. Wu, and G.-T. Chen, “Artificial neural network aided design ofcatalyst for propane ammoxidation,”
Applied Catalysis A: General , vol. 161, no. 1, pp.183–190, 1997.[16] B. G. Sumpter and D. W. Noid, “On the design, analysis, and characterization of materialsusing computational neural networks,”
Annual Review of Materials Science , vol. 26, no. 1,pp. 223–277, 1996.[17] H. Bhadeshia, R. Dimitriu, S. Forsik, J. Pak, and J. Ryu, “Performance of neural networksin materials science,”
Materials Science and Technology , vol. 25, no. 4, pp. 504–510, 2009.[18] S. Atahan-Evrenk and A. Aspuru-Guzik, “Prediction and calculation of crystal structures,”
Topics in Current Chemistry , vol. 345, 2014.[19] L. Yang and G. Ceder, “Data-mined similarity function between material compositions,”
Physical Review B , vol. 88, no. 22, p. 224107, 2013.[20] A. M. Deml, R. OHayre, C. Wolverton, and V. Stevanovi´c, “Predicting density functionaltheory total energies and enthalpies of formation of metal-nonmetal compounds by linearregression,”
Physical Review B , vol. 93, no. 8, p. 085142, 2016.
March 12, 2019 DRAFT2 [21] S. Curtarolo, D. Morgan, K. Persson, J. Rodgers, and G. Ceder, “Predicting crystal structureswith data mining of quantum calculations,”
Physical review letters , vol. 91, no. 13, p.135503, 2003.[22] C. C. Fischer, K. J. Tibbetts, D. Morgan, and G. Ceder, “Predicting crystal structure bymerging data mining with quantum mechanics,”
Nature materials , vol. 5, no. 8, pp. 641–646, 2006.[23] G. Hautier, C. Fischer, V. Ehrlacher, A. Jain, and G. Ceder, “Data mined ionic substitutionsfor the discovery of new compounds,”
Inorganic chemistry , vol. 50, no. 2, pp. 656–663,2010.[24] P. Dey, J. Bible, S. Datta, S. Broderick, J. Jasinski, M. Sunkara, M. Menon, and K. Rajan,“Informatics-aided bandgap engineering for solar materials,”
Computational MaterialsScience , vol. 83, pp. 185–195, 2014.[25] G. Pilania, A. Mannodi-Kanakkithodi, B. Uberuaga, R. Ramprasad, J. Gubernatis, andT. Lookman, “Machine learning bandgaps of double perovskites,”
Scientific reports , vol. 6,p. 19375, 2016.[26] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minorityover-sampling technique,”
Journal of artificial intelligence research , vol. 16, pp. 321–357,2002.[27] A. Newell, H. A. Simon et al. , Human problem solving . Prentice-Hall Englewood Cliffs,NJ, 1972, vol. 104, no. 9.[28] J. van den Hoven, “Clustering with optimised weights for gowers metric,”
Netherlands:University of Amsterdam , 2015.[29] A. A. Emery and C. Wolverton, “High-throughput dft calculations of formation energy,stability and oxygen vacancy formation energy of abo 3 perovskites,”
Scientific data , vol. 4,p. 170153, 2017.[30] A. Belsky, M. Hellenbrandt, V. L. Karen, and P. Luksch, “New developments in theinorganic crystal structure database (icsd): accessibility in support of materials researchand design,”
Acta Crystallographica Section B: Structural Science , vol. 58, no. 3, pp. 364–369, 2002.[31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in python,”
March 12, 2019 DRAFT3
J. Mach. Learn. Res. , vol. 12, pp. 2825–2830, Nov. 2011. [Online]. Available:http://dl.acm.org/citation.cfm?id=1953048.2078195[32] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in
Proceedings ofthe 22Nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining , ser. KDD ’16. New York, NY, USA: ACM, 2016, pp. 785–794. [Online].Available: http://doi.acm.org/10.1145/2939672.2939785[33] A. A. Emery and C. Wolverton, “High-throughput dft calculations of formation energy,stability and oxygen vacancy formation energy of abo 3 perovskites,”
Scientific data , vol. 4,p. 170153, 2017.
March 12, 2019 DRAFT4 S UPPLEMENTARY M ATERIAL
A. Attributes and Properties
The first step of our pipeline is to compute attributes (or chemical descriptors) based onthe composition of materials. These attributes should be descriptive enough to enable a MLalgorithm to construct general rules that can possibly “learn” chemistry. Building on existingstrategies , we use a set of attributes/features to represent each compound. These attributesare comprised of: stoichiometric properties, elemental statistics, electronic structure propertiesattributes, ionic compound attributes. A detailed procedure to compute these attributes can befound in The Materials Agnostic Platform for Informatics and Exploration (Magpie) . Usingthese features, we consider the problem of developing reliable and explainable ML modelsto predict five physically distinct properties currently available through the OQMD: bandgapenergy (eV), volume/atom ( ˚A /atom), energy/atom (eV/atom), thermodynamic stability (eV/atom)and formation energy (eV/atom). Formation energy is total Energy/atom minus some correctionfactors (i.e., the material with the lowest formation energy at each composition also has the lowestenergy per atom). Stability has to do with whether a particular material is thermodynamicallystable or not. Compounds with a negative stability are stable and those with a positive stabilityare unstable. More information on the properties are provided by Emery et al. . B. Feature Importance for Class-specific Regression
Feature importance results for class-specific predictors can also be obtained.In Fig. 5, we show feature importance for formation energy prediction regressors for all classes. For all three classes, the thermodynamic stability is found to be the most importantattribute in predicting formation energy. From thermodynamic point of view, this makes senseas the stability is negatively correlated with the formation energy. More results are provided inthe Supplementary Information associated with this manuscript. C. Stable Solar Cell Compounds
A detailed list of potentially stable solar cell (with corresponding property predictions andexplanations) is provided in the Supplementary Information associated with this manuscript.
D. Other
The software, training data sets and input files used in this work are provided in the Supple-mentary Information associated with this manuscript.
March 12, 2019 DRAFT5 (a)(b)(c)Fig. 5. Feature importance for class specific formation energy prediction regressors. (a) class , (b) class , and (c) class ,.,.