CLAIMED: A CLAssification-Incorporated Minimum Energy Design to explore a multivariate response surface with feasibility constraints
Yao Song, Mert Y. Sengul, Linglin He, Adri C. T. van Duin, Ying Hung, Tirthankar Dasgupta
CCLAIMED: A CLAssification-Incorporated MinimumEnergy Design to explore a multivariate responsesurface with feasibility constraints
Yao Song * , Mert Y. Sengul ** , Linglin He * , Adri C.T. van Duin *** , Ying Hung * andTirthankar Dasgupta ** Department of Statistics, Rutgers University ** Materials Science and Engineering Department, Penn State University ***
Department of Mechanical Engineering, Penn State University
Abstract
Motivated by the problem of optimization of force-field systems in physics using large-scalecomputer simulations, we consider exploration of a deterministic complex multivariate responsesurface. The objective is to find input combinations that generate output close to some desiredor “target” vector. In spite of reducing the problem to exploration of the input space withrespect to a one-dimensional loss function, the search is nontrivial and challenging due to in-feasible input combinations, high dimensionalities of the input and output space and multiple“desirable” regions in the input space and the difficulty of emulating the objective function wellwith a surrogate model. We propose an approach that is based on combining machine learningtechniques with smart experimental design ideas to locate multiple good regions in the inputspace.
Keywords:
Multiple response; minimum energy; space-filling design; classification; clustering;Material Science; Force Field. 1 a r X i v : . [ s t a t . M L ] J un . Introduction Exploration of multi-response physical/engineering systems with the objective of determining “good”points in the input space with desirable or targetted values of each output variable is typically achallenging problem. Such problems become even more complicated if the number of responses oroutputs is large (possibly larger than the number of inputs), and certain combinations of inputs are“infeasible”, in the sense they do not produce any reasonable output. Consider a system with p inputvariables X , . . . , X p and q output variables Y , . . . , Y q . The problem is to determine “good” combi-nations of inputs X , . . . , X p that produce responses Y , . . . , Y q as close to some pre-defined “target”values T , . . . , T q as possible. We assume that the functional relationships Y j = f i ( X , . . . , X p ) for j = 1 , . . . , q are in principal known, as is the case in computer experiments, but they may beexpensive to compute. We will denote the q × Y and T respectively.If q = 1, i.e., for a single response, this problem can be formulated as a response surfaceoptimization problem Box and Draper (1987). When certain input combinations are infeasible,this becomes a response surface optimization problem with unknown constraints (associated withfeasible regions). Such a problem has been addressed in engineering literature Henkenjohann et al.(2005). When q ≥
2, a widely used approach is to optimize some one dimensional loss function likethe weighted squared error loss L ( Y , T , W ) = q (cid:88) j =1 { ( Y j − T j ) /w j } = ( Y − T ) T W − ( Y − T ) , (1)where w , . . . , w q are a set of weights associated with the q responses Wu and Hamada (2009) and W is a q × q diagonal matrix with entries w , . . . , w q . Such a method can also be applied withunknown constraints (e.g. Henkenjohann and Kunert (2007) where q = 2). However, one responsemay be closest to its target at a unique combination of inputs and farthest at a combination thatis best for another response, leading to multiple local optima for the loss function.We illustrate this situation with a toy example with p = q = 2. Let the input space be [0 , ,2nd suppose the input-output relations are described by the following equations: y = log(2 + ( x − . x − . , y = log(2 + x + 0 . x − x ) , that is, the first response only depends on x whereas the second response depends on both x and x . Let the target vector for ( y , y ) be ( T , T ) = (log 2 , log 2). Then for any weight vector( w , w ), the loss function (cid:80) j =1 { ( y j − T j ) /w j } is minimized at two points: (0 . , ( − . √ . / . , ( − . √ . /
2) where it attains value zero. Figure 1 shows the contour plot of theloss function with weights (1 , . , . u ( x , x ) = − .
25 + { ( x − . / . } + { ( x − . / . } ,π ( u ) = exp( u ) / { u ) } , and defining the binary feasibility variable z = 0 (infeasible) or 1 (feasible) according as π < . π ≥ .
5. Figure 1: Contour plot of loss function with two inputs and two outputs . . . . . . x1 x .0004 I n f ea s i b l e B e s t r e g i o n Figure 1 makes it obvious that as q increases, trying to formulate the problem as global opti-3ization of the loss function may not be a great idea. Rather, the objective should be to detectseveral “good” points within the feasible input space by efficient search algorithms. This is par-ticularly true for the material sciences application (explained in the next section) that motivatedthis research, where p can be fairly large, ranging from 15 to 300, and q is typically in the range of500 − ) ReaxFF system and Nickel-Chromium (Ni-Cr) binary ReaxFF system in SectionV, and present some conclusions and opportunities of future research in Section VI.
2. Motivating example: Optimization of ReaxFF systems
Recent advances in materials science have established ReaxFF-based atomistic simulations as apromising method for atomistic level investigations in materials science. The ReaxFF is a forcefield that incorporates complex functions with associated inputs in order to describe the inter andintra-atomic interactions in materials systems. A typical ReaxFF force field consists of hundreds4f parameters (inputs) per element type. During the development of a force field for a molecularsystem of interest, these parameters are optimized to reproduce reference values with reasonableaccuracy. These reference values are molecular properties (e.g., bond lengths, bond angles, charges,and energies, etc.) of reference systems, also known as “gold standards”, obtained by quantumchemistry methods (e.g., Density Functional Theory (DFT)) or experiments. In our notation definedin Section 1, for a ReaxFF system X ’s represent parameters, Y ’s represent the molecular propertiesand T ’s represent the gold standards or targets.The conventional optimization method that is widely used by the force field developers is asequential one input-at-at-time parabolic extrapolation method Shchygol et al. (2019), which isnot capable of switching between local minima to detect lowest loss regions in the input space. Incontrast, this method is susceptible to being stuck in a local minimum, preventing parallelization ofthe optimization algorithm. Another big limitation of such one-factor-at-at-time approaches is itsfailure to capture important interactions among inputs. Due to these limitations in the conventionalmethod, considerable effort has been directed toward finding a solution to this optimization problemusing heuristic methods that include simulated annealing and genetic algorithm Iype et al. (2013);Larsson et al. (2013); Dittner et al. (2015) but none of them have been found efficient for exploringthe parameter space comprehensively.The ReaxFF systems present several practical challenges to the exploration of complex inputspaces in systems with multiple response. The true input output relationships are complex, typicallyinvolving systems of partial differential equations. Simulations for most ReaxFF systems are timeconsuming and expensive. The number of responses is large, making the loss function complex andmultimodal. Another major complication arises from the fact that several parameter combinationsdo not produce meaningful outcomes, in the sense that either the simulation does not convergewithin the specified stopping time, or the results produce such a large discrepancy of the Y ’s fromthe T ’s that they are not meaningful at all. For many ReaxFF systems, the percentage of suchinfeasible combinations is larger than the percentage of feasible ones, making the process of outputdata generation even more expensive. 5 . Essential steps involved in the proposed exploration approach Space filling designs have found extensive use in efficient exploration of complex response surfaceand for identification of good regions in the input space. As explained in Joseph (2016), space fillingdesigns focus on placing the design points in the experimental region in an intelligent manner sothat there exist points everywhere in the experimental region with as few gaps or holes as possible.Therefore, ideally, if one can construct a very large space-filling design to cover the entire designspace, then it is possible to identify the good input regions. However, as the dimension of the inputspace increases, the total number of points required to cover the entire space become prohibitivelylarge, and additional strategies are required to facilitate exploration. Thus, most strategies forexploring large and complex input spaces rely on a sequential strategy in which an initial explorationis done using a standard space-filling design.Our first step therefore is to create an initial space-filling design of N points that does an initialexploration of the input space. Each of the N design points in this initial design is a combination ofthe p inputs. Several of these N points are expected to be infeasible, i.e., they do not produce anyoutput. Let N and N denote the number of feasible and infeasible points respectively, such that N + N = N . The N feasible points generate an N × q matrix of responses, where each columncorresponds to an output Y .The second step is to fit a classification model using the information on N feasible and N infeasible points, that predicts the probability that a given combination of ReaxFF parameterswill result in a feasible simulation output. The classification model can be parametric, e.g., logisticregression, or based on non-parametric or machine learning methods like random forests.The third step is the most crucial in the exploration process and involves exploring the inputspace using the N × q matrix of responses as the starting point, and finding the best regionswith lowest loss making as few evaluations of new combinations as possible. This step will alsoincorporate the classification model fitted in the previous step. This will be done by application ofa recently proposed design strategy called minimum energy designs (MED) Joseph et al. (2015).As in the case of the ReaxFF problem, we assume that the true response functions are deter-ministic and known, but the feasibility region is unknown. However, with increase in the dimensionof the input space p , the actual number of evaluations of the true function required to obtain good6esults may be prohibitive. In such cases, one may need to substitute the simulator by a cheap orfast surrogate called the “emulator”. This is an optional fourth step in the optimization algorithm.We now describe these four steps and illustrate each step using the toy example described inSection 1 and used to generate Figure 1. An initial space filling design is generated in which the levels of the p input variables are simulta-neously varied to produce N different combinations. Simulations are conducted at each of these N combinations and the outcome recorded. A p -dimensional input space can be explored using specificdesign strategies such as the Latin Hypercube sampling McKay et al. (1979). The design generatedusing such a sampling strategy is called a Latin hypercube design (LHD). However, as observedby several authors, an arbitrary LHD does not necessarily have good space-filling properties andit is necessary to incorporate additional criteria like orthogonality of inputs and maximin distance(that maximizes the minimum distance between every pair of points in the design space). Theinitial design proposed for initial exploration of the input space is known as Orthogonal-maximinlatin hypercube design (OMLHD) originally proposed in Joseph and Hung (2008). The OMLHDalgorithm can generate parameter combinations within ranges specific to each parameter that aremultidimensionally uniformly distributed by reducing the pairwise correlation and maximizing thedistance between parameters.The outcome of each simulation obtained from the initial OMLHD is recorded as follows: (i) abinary outcome variable Z taking values 0 and 1 according as whether the parameter combinationis infeasible (does not produce any result) or feasible (produce results) and (ii) the values of theresponses Y , . . . , Y q . The structure of the raw data matrix is shown in Table 1 in which the rowsare arranged by values of Z without loss of generality. The data would thus consist of N rows and p + q + 1 columns ( p input variables, q responses, and one binary feasibility column). The responsecolumns will be missing for the N infeasible combinations, shown as rows N + 1 , . . . , N in Table1. Figure 2 demonstrates initial exploration of the loss function in the toy example described inSection 1 using a 20-point maximin LHD. The design generates N = 18 points (90%) in the feasibleregion, which is consistent with the small size of the infeasible region. In our motivating example,7able 1: Data generated from initial design No Inputs Responses Feasibility X · · · X p Y · · · Y q Z X · · · X p Y · · · Y q N X N · · · X N p Y N · · · Y N q N + 1 X N , · · · X N ,p · · · · · · · · · N X N · · · X Np · · · · · · · · · this will typically not be the case, with the infeasible region often being larger than the feasibleregion making N > N in most cases.In this example, the initial design also finds four points in the desired region, and one of themis close to one of the two optima. Again, this will rarely be the case, because as the dimension ofthe input space increases, the volume of the desired region will typically be very small comparedto the total input space.Figure 2: Initial exploration of the loss function in the toy example using a 20-point Maximim LHD . . . . . . x1 x I n f ea s i b l e B e s t r e g i o n Having generated the data from the initial simulation experiment, the next task is to fit a classifi-cation model that predicts the feasibility of a combination of X , . . . , X p . Such a model is fit usingthe binary outcomes Z applying a suitable supervised classification algorithm. There are severalclassification algorithms in classical statistics and machine learning Hastie et al. (2009). Amongthese methods, logistic regression stands out as a versatile tool used in almost all scientific fields for8everal decades and is a natural candidate to be used. On the machine learning side, random forestclassification Breiman (2001) has attained tremendous popularity in scientific and engineering ap-plications in the recent years. In practice, multiple algorithms can be tried and the one with bestout-of-sample prediction performance can be chosen. Such a strategy necessitates splitting the datamatrix into training and testing sets, fitting models using the training set and comparing them onthe basis of certain performance metrics using the testing set. Two performance measures – sen-sitivity and specificity – can be considered for comparing the approaches and identifying the bestone. Let TP, TN, FP and FN denote the numbers of true positives (correct identification of feasibil-ity), true negatives (correct identification of non-feasibility), false positives (incorrect classificationof a true infeasible point as feasible) and false negatives (incorrect classification of a true feasiblepoint as infeasible). Then sensitivity is defined as the true positive rate measured as TP/(TP+FN),whereas specificity is defined as the true negative rate measured as TN/(FP+TN). The primary objective of space-filling experimental designs is to spread points uniformly over theinput space. The foregoing discussion in Section 1 and Section 3.1 establishes that this is not thecase in the current problem. We need a sequential design or active learning (Sung and Niyogi(1995); Cohn et al. (1996)) strategy that helps avoid bad regions and generate more and morepoints from the desired region as the algorithm progresses. Thus the points sampled by the designshould represent the response surface r ( x ) under exploration, which means more points should begenerated in regions where r ( x ) is large and fewer points in regions where r ( x ) is small. In our case,because regions of lower loss are desirable, the response surface r ( x ) of interest can be taken asthe inverse of the loss function L ( x ) in (1). Such an algorithm known as minimum energy designs(MED) was proposed by Joseph et al. (2015) Joseph et al. (2015). We now briefly describe theminimum energy algorithm and explain its usage in the context of our problem again using the toyexample.Consider the problem of exploring a p -dimensional input space, where the range of each factoris scaled to [0 , , p . Let q ( x i ) be the weight associated with the i thdesign point x i . Visualize x i as a charged particle in the box [0 , p and q ( x i ) as the positive chargeassociated with it. Let d ( x i , x j ) denote the Euclidean distance between the points x i and x j . Then9he design that minimizes the total potential energy for n charged particles E = n − (cid:88) i =1 n (cid:88) j = i +1 q ( x i ) q ( x j ) d ( x i , x j ) , is the MED. If the objective is to select samples that represent a positive response surface r ( x ), theweight q ( x ) should be chosen as 1 / { r ( x ) } / (2 p ) .Based on the above idea, Joseph et al. (2019) Joseph et al. (2019) proposed an efficient algorithmfor computation of MED that optimizes a variant of the above energy function, and selects pointsto represent a response surface r ( x ). The algorithm requires the user to specify (i) an initial n -rundesign (ii) a function that computes the logarithm of r ( x ) (in our case, − log L ( x )) and (iii) thenumber of iterations K . Each iteration, n new design points are generated and new evaluations ofthe logarithm of r ( x ) are done at these points. Thus a total of Kn evaluations of the function arenecessary. This algorithm was implemented as the R package “mined” in Wang and Joseph (2018).Direct application of the above algorithm to our problem, however, is not possible becauseeven if the true response functions y j = f j ( x , . . . , x p ) are known for j = 1 , . . . , q , several pointsin the initial designs are likely to fall in the unknown infeasible region, not returning any valueof the y , . . . , y q and consequently of L . A naive way to avoid this problem is to use a modifiedresponse function that only considers the observed responses from the feasible input points, therebyautomatically truncating the infeasible points. In the context of Table 1, this would essentially meanignoring the lower half of the data matrix, and considering only with the N points in the upperhalf consisting of the feasible points.Noting that the infeasible region is essentially a “bad” region that the MED is trying to avoid,a better strategy is to modify the response function by “imputing” the missing responses withvalues that generates a large value of the loss function. Since to run the MED algorithm we needa function that returns a value of the one dimensional loss for a given input combination, we cansimply replace the loss at each infeasible input combination by a value greater than or equal to thelargest loss observed from the N feasible points in the initial design.We now illustrate this strategy again using the two-dimensional toy example introduced inSection 1. Recall the 20-point initial design shown in Figure 2, which generates two infeasiblepoints for which the response values are missing. Among the remaining 18 feasible points, the point100 . , . . , . . , . . . . . . . x x Figure 3 shows the points generated after 8 iterations of the MED algorithm with the contourof the loss function generated from the modified response function shown in the background. It isseen that the MED algorithm beautifully captures the best region, identifies the two optima, andalso avoids the infeasible region. Note that one drawback of the R function “mined” is that it canproduce points outside the input space in an attempt to broaden the search region. In this example,the algorithm ends up generating 12 points in the box [0 , that are shown in Figure 3.Although in this particular implementation of MED we end up avoiding generating points inthe feasible region, this is not guaranteed because of the MED algorithm’s inherent property ofjumping out of good regions and exploring distant regions with uncertainties. Thus, we recommendusing the classification model described in Section 3.2 to predict the feasibility outcome for eachpoint generated by the MED as the last step.This implementation of the MED algorithm requires 20 × p = 2, the number of evaluations necessary will increasewith the increase in p . If the computation is expensive, one can use a surrogate model or emulatorconstructed from the initial design and use it as an approximation for the true simulator. We discussa few strategies to fit such a surrogate model in the following subsection.11 .4. Surrogate model or emulator of the simulation model Most methods for exploration of deterministic complex response surfaces use Gaussian processmodels as surrogates. However, while such models are flexible and capable of capturing complexnon-linear input-output relationships, even with a moderate number of predictors like 20-25, theybecome computationally challenging due to identifiability issues associated with model parameters.This is the situation in our motivating problem. For example, the MoS ReaxFF system has 45input parameters and 599 output properties.On the other hand, a naive statistical approach may be to fit individual regression models of q responses on the p inputs, and then predicting the total error based on the individual predictions.However, such an approach is not wise, because there may exist strong correlations among severalproperties that may not be exploited in the process. Further, any reasonable regression model shouldat least consider second-order terms, i.e., p square terms and (cid:0) p (cid:1) pairwise interactions among theparameters. For the MoS system, inclusion of the second order terms takes the total number ofpredictors to 1080, larger than the number of feasible data points obtained in the training data set.This makes linear regression an impossible proposition, although penalized regression methods likeLasso Tibshirani (1996) can be used. Finally, independent prediction of each individual responseadd up the individual noises or model-fitting errors associated with each fit, resulting in a largeprediction error for a utility function that combines individual discrepancies.Another popular choice is to fit a multi-response machine learning model such as a deep learningmodel and use it as a surrogate. While in recent years deep learning LeCun et al. (2015) hasemerged as a popular tool for modeling multiple-input multiple-output data, there have also beenseveral criticisms of the approach. First, as noted by Marcus (2018) “deep learning currently lacksa mechanism for learning abstractions through explicit, verbal definition, and works best whenthere are thousands, millions or even billions of training examples” and “is not an ideal solutionin problems where data are limited”. Second, its relative opacity has been met with criticismRibeiro et al. (2016). Third, like several machine learning algorithms, tuning deep learning is a notstraightforward.For MoS ReaxFF system, we used a combination of simple exploratory techniques for dimensionreduction of the response vector and a sequentially fit clustered penalized linear regression that12nvolves first and second order terms of input variables to obtain a surrogate model. For Ni-CrReaxFF system, we fit a deep learning model as the surrogate. Applications of this approach willbe demonstrated in Section 5 and details are provided in the Appendix.Let ˆ f j ( x ), j = 1 , . . . , q denote the predictor of Y j and ˆ Z ( x ) denote the predictor of Z , thefeasibility indicator at input combination x = ( x , . . . , x p ). Then the combined predictor of the lossfunction L at input combination x is given by (cid:98) L ( x ) = (cid:80) qj =1 (cid:110) (cid:98) f j ( x ) − T j w j (cid:111) , ˆ Z ( x ) = 1 M, ˆ Z ( x ) = 0 , (2)where M is a number at least as large as the maximum observed loss.
4. Simulations
We now examine the effectiveness of the proposed approach using simulations from a multiresponsesystem. We use a slightly modified version of the DTLZ2 function, a popular test function in multi-objective problems Deb (1999); Huband et al. (2006). The advantage of this function is its flexibility- it can be extended to any input and output dimensions p and q , and adjusted to create a suitableexample for our problem. We first consider the case of p = q = 4, defining the response functionsas y = { g ( x ) } cos( x π/
2) cos( x π/
2) cos( x π/ y = { g ( x ) } cos( x π/
2) cos( x π/
2) sin( x π/ y = { g ( x ) } cos( x π/
2) sin( x π/ ,y = { g ( x ) } sin( x π/ , where g ( x ) = (2 x − . , x = ( x , . . . , x ). The domain of the functions is 0 ≤ x i ≤ i = 1 , . . . ,
4, making the input space is [0 , . We also assume that the infeasible region is D ∩ D ∩ D ∩ D , where D = { x : 0 ≤ x ≤ . } , D = { x : 0 ≤ x ≤ . } , D = { x : 0 ≤ x ≤ . } , D = { x : 0 ≤ x ≤ . } . We set the target vector T = (0 . , . , . , .
7) and the weight vector w = c (1 , , , N = 100 and N = 250.Figure 4 shows a comparison of the loss function evaluated at the points in the initial design (a100-point maximim LHD) and the MED. Clearly, the MED substantially shrinks the distributionof the loss. The best point identified by the initial design is (0.4961, 0.8342, 0.3506, 0.4381), andit produces a response vector (0.4755, 0.2920, 0.5624, 0.8219) leading to a loss of 0.2407. The bestpoint identified by the MED is (0.0958, 0.1688, 0.3319, 0.5227), resulting in a response vector(0.7086, 0.4070, 0.8338, 0.5663) and a loss of 0.1098.Figure 4: Box plots of loss for initial and minimum energy designs Initial design MED ss Figure 5 shows the two-dimensional contour plot of the loss function (1) against x and x ,setting ( x , x ) at their best setting (0.3319, 0.5227) identified by the MED. The figure shows howthe MED beautifully identifies the two good regions.To confirm that the performance of the MED is not a one-time phenomenon and is robust tothe choice of a reasonable space-filling initial design, we repeated the simulations 100 times, and theresults are summarized in Figure 6. The Table compares (i) the difference between the minimumlosses identified by the initial design and the MED, (ii) the difference between the medium lossesidentified by the initial design and the MED and (iii) the ratio of the standard deviations of thelosses identified by the initial design and the MED. Out of 100 repetitions of the simulation, in 72cases the MED identified a point with minimum loss lower than the one identified by the initialdesign, and in 25 cases the point with the minimum loss identified by both designs were the same.Only on five occasions, the best point identified by the MED produced a marginally smaller lossthan the best point identified by the initial design.14igure 5: Contour plot of L ( x , x , . , . . . . . . . x1 x Figure 6: Improvement achieved by MED over initial design
Minimum loss F r equen cy −0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Median loss F r equen cy sd of loss F r equen cy The simulations with the DTLZ2 function described above were also conducted by increasingthe input dimension p to 10, 15 and 20. The output dimension q was kept the same as the inputdimension. In each case, the size of the initial design N was varied from 50 to 200 in steps of 50.In almost all of these cases, the proposed strategy was able to locate points that were better thanthe best ones identified by the initial design. The computation times associated with each suchsimulation settings were also explored, and a representative set of times are shown in Figure 7. Asseen from the figure, the computation time taken by the MED algorithm with a 20-dimensionalDTZL2 function with an initial design of size 200 is approximately 200 seconds.15igure 7: Computation times for MED for different dimensions and initial design sizes Size of the initial design T i m e ( s e c ond s ) D im ensionof input space
1. The points generated by “mined” function would not shrink the total error very well, while it can findone or two points with smaller errors than the initial points.2. The point with smallest error generated by “mined” function sometimes is the same as the point withsmallest error in intial points.3. When variable dimension increases, points generated by “mined” function are mostly out of the range.4. When the number of initial points increases, computation time of “mined” increases.5. When the number of variable dimension increases, computation time of “mined” increases.19
5. Applications
We now evaluate the proposed approach with MoS and Ni-Cr ReaxFF systems. Different surrogatemodels were constructed in these two examples, follow by MED algorithm to generate new points. ReaxFF system
We now demonstrate the proposed approach using the MoS ReaxFF system, which has 45 in-put parameters ( X s) and 599 material properties or responses denoted by Y , . . . , Y . Using theOMLHD design algorithm described in Section 3.1, 5000 different combinations of the 45 inputswere obtained. The space-filling property of the OMLHD is illustrated in Figure 8 where 2-D scatterplots of four pairs of parameters in the feasible design space are displayed.Out of the simulation results produced by these 5000 combinations, 3846 were found infeasibleand only the remaining 1154 simulations produced results, generating values of 599 different ma-terial characteristics. The outcome of each simulation was recorded in the format shown in Table1 with N = 1154, p = 45 and q = 599. The distribution of the loss function computed from these1154 points is summarized in Table 2.Table 2: Distribution of loss function from initial design of size 1154min median mean max96459 2606367 17268394 84812206216igure 8: OMLHD for four pairs of ReaxxFF parameters (only feasible points) x x − . − . − . x x . . . . x x . . . x x The next step was to fit the classification model to predict the binary outcome variable Z fromthe X variables. The 5000 ×
46 data matrix (45 ReaxFF parameters and feasibility outcome Z ,ignoring the data on responses Y ’s) was split into a training set consisting of 4000 data points anda testing set consisting of 1000 data points. The models fit with the training data were comparedon the basis of their sensitivities and specificities (defined in Section 3.2) computed from the testingdata.A logistic regression classifier that included linear terms of all ReaxFF predictors resulted ina sensitivity of 0.6422 and specificity of 0.9296 based on the testing data. The performance of therandom forest classifier, like several other machine learning algorithms, depends on the choice ofthe tuning parameters. Two of the most important tuning parameters are NTREE (the number oftrees to grow) and MTRY (the number of variables that should be selected at a node split). Therandom forest classifier was fit with the training data with several combinations of NTREE andMTRY, and the performances are summarized in Table 3.It appears from Table 3 that in terms of predicting the feasibility of ReaxFF parameter combi-nations, the logistic regression based classifier outperforms the random forest classifier with respectto sensitivity, and is marginally inferior with respect to specificity. Considering the role of this clas-sifier in predicting whether a promising new combination of ReaxFF parameters should be addedto the exploration space, a higher true positive rate is possibly more important than achieving ahigher true negative rate. This is because, a positive outcome (feasible combination) leads to a17able 3: Sensitivity and specificity for random forest classifier with different tuning parametercombinations Tuning parameters PerformanceNTREE MTRY Sensitivity Specificity200 5 0.3088 0.9837200 10 0.4706 0.9611200 15 0.5245 0.9497200 20 0.5147 0.9447200 25 0.5686 0.9372400 5 0.3039 0.9925400 10 0.4510 0.9686400 15 0.5245 0.9497400 20 0.5441 0.9410400 25 0.5343 0.9384600 5 0.2990 0.9899600 10 0.4461 0.9673600 15 0.5294 0.9472600 20 0.5245 0.9422600 25 0.5343 0.9397800 5 0.2843 0.9962800 10 0.4363 0.9673800 15 0.5098 0.9497800 20 0.5245 0.9435800 25 0.5490 0.93841000 5 0.2941 0.99251000 10 0.4461 0.96731000 15 0.5000 0.95101000 20 0.5392 0.94101000 25 0.5441 0.9384 successful simulation and generates values of material characteristics, whereas a negative outcomeor infeasible point does not add anything to the existing knowledge. From this perspective, thelogistic regression based classifier was chosen over the random forest classifier.As a pre-cursor to fitting the surrogate model to predict Y ’s from the X ’s, an exploratoryanalysis, following the guidelines provided in Appendix 6, was conducted with the 1154 ×
599 matrixof Y s by summarizing the scaled individual errors (SIE) E ij = ( Y ij − T j ) /w j for i = 1 , . . . , j = 1 , . . . ,
599 in an attempt to identify the ones that have major contributions to the totaldiscrepancy. A graphical summary of the means and standard deviations of SIEs of the 599 responses(calculated from 1154 observations) is shown in Figure 9.Figure 9 revealed a very interesting aspect - several (in fact 190 out of 599) of the responsesappeared to have negligibly small standard deviations in the generated data, meaning that theyremained more or less constant over the 1154 feasible ReaxFF parameter combinations in the initialdesign. Clearly prediction models involving such properties are meaningless. Interestingly, most ofthese 190 properties also had their means very close to the gold standards, meaning they could notbe improved further. On the other extreme, there were some responses that varied widely across18igure 9:
Plot of means and sds of 599 responses obtained from 1154 data points; the + sign represents 37shortlisted responses − − − − Mean individual errors
Index M ean i nd i v i dua l e rr o r s SD of individual errors
Index S D o f i nd i v i dua l e rr o r s the parameter settings and were therefore potentially interesting candidates for model fitting. Asproposed in Appendix 6, using the measure P ( J ∗ ) given by (3), 37 properties were identified asthe top contributors to the total squared SIE (cid:80) i =1 (cid:80) j =1 E ij . These 37 properties contributed to96.7% of the total SIE in the generated data, i.e., P ( J ∗ ) = 0 .
967 where J ∗ represents the set ofindices of these properties. These 37 properties are represented by + signs in both panels of Figure9.Figure 10: Correlation heatmap among 37 log absolute weighted individual errors (red color indicates highcorrelation) property.116property.105property.115property.104property.308property.307property.306property.302property.303property.304property.305property.559property.560property.558property.556property.557property.588property.593property.592property.578property.595property.598property.597property.594property.596property.582property.550property.591property.590property.572property.373property.571property.370property.569property.581property.579property.580 p r ope r t y .
580 p r ope r t y .
579 p r ope r t y .
581 p r ope r t y .
569 p r ope r t y .
370 p r ope r t y .
571 p r ope r t y .
373 p r ope r t y .
572 p r ope r t y .
590 p r ope r t y .
591 p r ope r t y .
550 p r ope r t y .
582 p r ope r t y .
596 p r ope r t y .
594 p r ope r t y .
597 p r ope r t y .
598 p r ope r t y .
595 p r ope r t y .
578 p r ope r t y .
592 p r ope r t y .
593 p r ope r t y .
588 p r ope r t y .
557 p r ope r t y .
556 p r ope r t y .
558 p r ope r t y .
560 p r ope r t y .
559 p r ope r t y .
305 p r ope r t y .
304 p r ope r t y .
303 p r ope r t y .
302 p r ope r t y .
306 p r ope r t y .
307 p r ope r t y .
308 p r ope r t y .
104 p r ope r t y .
115 p r ope r t y .
105 p r ope r t y . The 37 shortlisted properties were grouped into five clusters on the basis of correlations among19he transformed responses U j = log | ( Y j − T j ) | /w j , i.e, the logarithms of absolute individual dif-ferences from the target. The logic behind transforming the responses is explained in Appendix 6.Figure 10 shows the heatmap of correlations.Finally, the predictor of the loss function (cid:98) L ( x ) at a new input combination x was obtainedalong the lines of (2) and (4). After running the MED algorithm with this predicted loss function,it was able to identify one point with a loss of approximately 80,000, providing a substantial (17%)improvement over the best point obtained from the initial 1154 points. We now illustrate the proposed approach using another ReaxFF system Ni-Cr, which has 16 inputparameters ( Xs ) and 90 material properties denoted by Y , . . . , Y . The main difference in thisapplication from the previous example of the MoS system was the surrogate modeling approach.Unlike the cluster-based sequential penalized regression model used for exploring the MoS system,we decided to fit a deep learning (DL) model as a surrogate. Fitting a DL model typically entailsgenerating a large number of training data points. Therefore, using the OMLHD design algorithmdescribed earlier, 79635 different combinations of the 16 inputs were obtained, and ReaxFF sim-ulations were conducted with these combinations. Out of these 79635 combinations, 4999 werefound infeasible. The rest produced meaningful values of the responses. The distribution of the lossfunction obtained from these points is summarized in Table 4.Table 4: Distribution of loss function from Ni-Cr initial designmin median mean max2812 2416505 63081433 1 . × The data obtained from the feasible region were split into two parts - a training set comprising80% of the points, and a testing set consisting of the remaining 20%. The training data were usedto fit the DL model with two hidden layers and 90 nodes and the model was tested using meanabsolute error (MAE) as the measure of prediction accuracy. Details on fitting the DL algorithmcan be found in Sengul et al. (2020).The trained DL model permitted prediction of material properties (cid:98) y at any new input combina-tion x . Consequently, the predictor of (cid:98) L ( x ) was obtained along the lines of (2) and (4). The initial2000-run design for MED was chosen from a region centered around the parameter combination withthe smallest loss obtained from the initial design points. After running the MED algorithm with thesurrogate loss function based on the DL model, several points with losses smaller than 2812 wereidentified. The best point with a loss 2316.17 provided a substantial (18%) improvement over thebest point obtained from the initial 79635 points. Our approach produced a significant improvementover the conventional method, which was not able to find input points with loss lower than 5614.26.In addition, it also stood out in terms of computation time by producing several good parametercombinations, including the optimum, in about two minutes compared to two weeks taken by anexperienced force field developer using the conventional method.
6. Concluding remarks
In this article, we have proposed a framework for finding good points in an input space that producea vector of responses close to their targets, where the “goodness” or “closeness” is defined in termsof a one-dimensional loss function. This problem is different from traditional global optimizationbecause of the possible existence of several good regions with almost similar values of the lossfunction, and there are large unknown input regions that are infeasible in the sense that they donot produce any output. The responses are assumed to be deterministic, but can be extended toaccomodate noise. The problem is motivated by and applied to the ReaxFF optimization systems inphysics, and provide initial results that are encouraging. Two key elements of the proposed approachare classifying the input space into feasible and infeasible regions and intelligently sampling pointsfrom the good regions using a method called minimum energy designs.Another popular approach to define “goodness” of points in multiresponse systems is to findsolutions that are Pareto optimal Lu et al. (2011); Lu and Anderson-Cook (2014); Lu et al. (2014);Chen et al. (2018), i.e., solutions that cannot be improved so as to make any one response closer tothe target without making at least another response move farther away from its target. However,finding Pareto optimal solutions with unknown constraints, and input and output dimensions ashigh as what is encountered in a typical ReaxFF problem is rarely addressed. This can be aninteresting direction for future research. 21 cknowledgment
This work was supported by NSF NSF EAGER Grant No. DMR 1842952, DMR 1942922 and MRI1626251.
AppendixExploratory analysis, dimension reduction and clustering of responses
Recall that our objective is to obtain input combinations that produce low discrepancies of responsesfrom respective targets, i.e., Y j − T j for j = 1 , . . . , p . We first take a close look at the N × q matrixof scaled individual errors (SIE) E ij = ( Y ij − T j ) /w j , for i = 1 , . . . , N and j = 1 , . . . , q , where w j ’s denote weights, in an attempt to identify the ones that have major contributions to the totaldiscrepancy. Plots of summary statistics of the SIE’s, i.e., their sample means ¯ E j = (cid:80) N i =1 E ij /N ,sample standard deviations s j = (cid:113)(cid:80) N j =1 ( E ij − ¯ E j ) / ( N −
1) for j = 1 , . . . , q or sample quantilesprovide useful information on the distributions of SIEs for each response.Based on such exploratory analysis, one can choose a subset of output characteristics thatcontributes to at least an aimed proportion of the total squared SIE (cid:80) i (cid:80) j E ij as a criterion. Thatis, we can choose a subset J ∗ of the set of indices { , . . . , q } satisfying P ( J ∗ ) ≥ δ , where P ( J ∗ ) = (cid:80) Ni =1 (cid:80) j ∈J ∗ E ij (cid:80) Ni =1 (cid:80) qj =1 E ij (3)represents the proportion of total squared SIE from all responses explained by the responses Y j for j ∈ J ∗ in the data generated and 0 < δ < A cluster-based sequential penalized regression model
We now consider prediction of the responses Y j for j ∈ J ∗ , where J ∗ is the set of responsesidentified in the previous section. Let q ∗ ≤ q denote the cardinality of J ∗ . A naive strategy is to fitindependent regression models of each of the q ∗ responses on the input variables X . However, such astrategy does not take into account potential correlations present within the properties. We proposefitting a cluster-based penalized regression model to address this problem. Instead of modeling the22esponses Y , specially for the ReaxFF application, we propose modeling the transformed responses(TR) U = log | Y − T | /w where T is the target for Y and w is the weight defined earlier. Thistransformation is justified by the fact that it is the absolute difference between the response andtarget that needs to be made small. Further, in the context of the ReaxFF problem, for most of theresponses the distribution of the absolute individual error | ( Y − T ) /w | appeared to have moderateto heavy skewness, that could be corrected using the log transformation.The first task is to cluster the q ∗ TR variables U into C clusters based on their correlationmatrix. This can be done by using any clustering package in a standard statistical computing. Let C denote the number of clusters and K , . . . , K C the number of TR variables in clusters 1 , . . . , C respectively.Let U (cid:96) , . . . , U (cid:96)K l denote the K (cid:96) TRs for cluster (cid:96) ∈ { , . . . , C } . We use the following algorithmto fit a set of predictor models for U (cid:96) , . . . , U (cid:96)K l .(i) For each TR U in the cluster, using a Lasso regression Tibshirani (1996), select significantpredictors in the penalized linear model of U on all 2 p + (cid:0) p (cid:1) first and second order terms, i.e., X h , X h and X h X h (cid:48) for h, h (cid:48) = 1 , . . . , p, h (cid:54) = h (cid:48) .(ii) From these K (cid:96) models, select the one with the maximum predictive power (determined usingout-of-sample mean-squared error, cross validation methods or adjusted R ). This becomesthe baseline model of the cluster, interpreted as the one with maximum predictive powersolely based on the input variables X . Let U ∗ (cid:96) denote the TR in the baseline model, and wecall U ∗ (cid:96) the baseline TR in cluster (cid:96) . Let (cid:98) U ∗ (cid:96) = f ∗ (cid:96) ( x ) represent the baseline model.(iii) Now consider the prediction of the remaining K (cid:96) − U ∗ (cid:96) , that can be best predicted using the predictors chosen by Lasso in step (i), and thebaseline TR U ∗ (cid:96) already identified in step (ii). Then update the model for U ∗ (cid:96) by includingthe baseline TR U ∗ (cid:96) as a predictor in addition to the X terms if it satisfies a pre-specifiedinclusion rule (like achieving a threshold improvement in out-of-sample error or adjusted R ).Let (cid:98) U ∗ (cid:96) = f ∗ (cid:96) ( x , U ∗ (cid:96) ) denote this model.(iv) Repeat step (iii) sequentially for the remaining TRs in the cluster, and include TRs from theprevious steps to predict them in addition to the X ’s if the inclusion rule is satisfied.23he above procedure is repeated for each cluster. Thus we have, for each cluster, a collectionof models that predict each TR in that cluster. Let (cid:98) U (cid:96)k ( x ) denote the predicted TR for the k thresponse of cluster (cid:96) . Then the surrogate model for the loss function at input combination x is (cid:98) L ( x ) = (cid:80) C(cid:96) =1 (cid:80) K (cid:96) k =1 (cid:110) exp (cid:16) (cid:98) U (cid:96)k ( x ) (cid:17)(cid:111) , (cid:98) Z ( x ) = 1 ,M (cid:98) Z ( x ) = 0 , (4)where M is a very large number and (cid:98) Z ( x ) the predicted binary response outcome from the classi-fication model. 24 eferences Box, G. E. P. and Draper, N. R. (1987).
Empirical Model-Building and Response Surfaces . Wiley,New York, NY.Breiman, L. (2001). Random forests.
Machine Learning , 45:5–32.Chen, P., Santner, T., and Dean, A. (2018). Sequential pareto minimization of physical systemsusing calibrated computer simulators.
Statistica Sinica , 28:671–692.Cohn, D., Ghahramani, Z., and Jordan, M. (1996). Active learning with statistical models.
Journalof Artificial Intelligence Research , page 129145.Deb, K. (1999). Multiobjective genetic algorithms: Problem difficulties and construction of testproblems.
Evolutionary Computation , 7:205–230.Dittner, M., Muller, J., Aktulga, H. M., and Hartke, B. (2015). Efficient global optimization ofreactive force-field parameters.
J Comput Chem , 36:1550–1561.Hastie, T., Tibshirani, R., and Friedman, J. H. (2009).
The elements of statistical learning: datamining, inference, and prediction . New York: Springer.Henkenjohann, N., Gobel, R., Kleiner, M., and Kunert, J. (2005). An adaptive sequential procedurefor efficient optimization of the sheet metal spinning process.
Quality and Reliability EngineeringInternational , 21:439–455.Henkenjohann, N. and Kunert, J. (2007). An efficient sequential optimization approach based onthe multivariate expected improvement criterion.
Quality Engineering , 19:267–280.Huband, S., Hingston, P., Barone, L., and While, L. (2006). A review of multi-objective testproblems and a scalable test problem toolkit.
IEEE Transactions on Evolutionary Computation ,10:477–506.Iype, E., Hutter, M., Jansen, A. P., Nedea, S. V., and Rindt, C. C. (2013). Parameterization of areactive force field using a monte carlo algorithm.
J Comput Chem , 34:1143–1154.25oseph, V. R. (2016). Space-filling designs for computer experiments: A review.
Quality Engineering ,28:2835.Joseph, V. R., Dasgupta, T., Tuo, R., and Wu, C. F. J. (2015). Sequential exploration of complexsurfaces using minimum energy designs.
Technometrics , 57:6474.Joseph, V. R. and Hung, Y. (2008). Orthogonal maximin latin hypercube designs.
Statistica Sinica ,18:171–186.Joseph, V. R., Wang, D., Gu, L., Lv, S., and Tuo, R. (2019). Deterministic sampling of expensiveposteriors using minimum energy designs.
Technometrics , 61:297–308.Larsson, H. R., van Duin, A. C., and Hartke, B. (2013). Global optimization of parameters in thereactive force field reaxff for sioh.
J Comput Chem , 34:2178–2189.LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning.
Nature , 521:436–444.Lu, L., Anderson-Cook, C., and Lin, D. (2014). Optimal designed experiments using a Pareto frontsearch for focused preference of multiple objectives.
Computational Statistics and Data Analysis ,30:1178–1192.Lu, L. and Anderson-Cook, C. M. (2014). Balancing multiple criteria incorporating cost usingPareto front optimization for split-plot designed experiments.
Quality and Reliability EngineeringInternational , 30:37–55.Lu, L., Anderson-Cook, C. M., and Robinson, T. J. (2011). Optimization of designed experimentsbased on multiple criteria utilizing a Pareto frontier.
Technometrics , 53:353365.Marcus, G. (2018). Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018 -arxiv.org , 36.McKay, M., Beckman, R., and Conover, W. (1979). A comparison of three methods for select-ing values of input variables in the analysis of output from a computer code.
Technometrics ,21:239245. 26ibeiro, M. T., Singh, S., and Guestrin, C. (2016). Why should I trust you? : Explaining thepredictions of any classifier.
KDD ’16 Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , pages 1135–1144.Sengul, M., Nayir, N., Gao, Y., Hung, Y., Dasgupta, T., and van Duin, A. (2020). An initialdesign-enhanced deep learning-based optimization framework to parameterize multicomponentreaxff force fields. Technical report, chemrxiv.Shchygol, G., Yakovlev, A., Trnka, T., van Duin, A. C. T., and Verstraelen, T. (2019). Parameteroptimization with monte-carlo and evolutionary algorithms: Guidelines and insights.
J ChemTheory Comput , 15:6799–6812.Sung, K. and Niyogi, P. (1995). Active learning for function approximation.
Proc. Advances inNeural Information Processing Systems , page 7.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society. Series B , 58:267288.Wang, D. and Joseph, V. R. (2018). mined: Minimum Energy Designs . R package version 1.0-2 —For new features, see the ’Changelog’ file (in the package source).Wu, C. F. J. and Hamada, M. S. (2009).