[PDF] Designing compact training sets for data-driven molecular property prediction

Abstract

In this paper, we consider the problem of designing a training set using the most informative molecules from a specified library to build data-driven molecular property models. Specifically, we use (i) sparse generalized group additivity and (ii) kernel ridge regression as two representative classes of models, we propose a method combining rigorous model-based design of experiments and cheminformatics-based diversity-maximizing subset selection within the epsilon--greedy framework to systematically minimize the amount of data needed to train these models. We demonstrate the effectiveness of the algorithm on subsets of various databases, including QM7, NIST, and a catalysis dataset. For sparse group additive models, a balance between exploration (diversity-maximizing selection) and exploitation (D-optimality selection) leads to learning with a fraction (sometimes as little as 15%) of the data to achieve similar accuracy as five-fold cross validation on the entire set. On the other hand, kernel ridge regression prefers diversity-maximizing selections.

Full PDF

DD ESIGNING COMPACT TRAINING SETS FOR DATA - DRIVENMOLECULAR PROPERTY PREDICTION

Bowen Li, Srinivas Rangarajan

Department of Chemical and Biomolecular Engineering, Lehigh University, BethlehemJune 26, 2019 A BSTRACT

In this paper, we consider the problem of designing a training set using the most informativemolecules from a speciﬁed library to build data-driven molecular property models. Speciﬁcally,we use (i) sparse generalized group additivity and (ii) kernel ridge regression as two representativeclasses of models, we propose a method combining rigorous model-based design of experimentsand cheminformatics-based diversity-maximizing subset selection within the (cid:15) –greedy frameworkto systematically minimize the amount of data needed to train these models. We demonstrate theeffectiveness of the algorithm on subsets of various databases, including QM7, NIST, and a catalysisdataset. For sparse group additive models, a balance between exploration (diversity-maximizingselection) and exploitation (D-optimality selection) leads to learning with a fraction (sometimes aslittle as 15%) of the data to achieve similar accuracy as ﬁve-fold cross validation on the entire set. Onthe other hand, kernel ridge regression prefers diversity-maximizing selections.

In several applications[1, 2, 3], including high throughput screening for molecule discovery, modeling of large reactionsystems, and design of catalysts, one encounters the challenge of estimating multiple properties of a large set ofmolecules (or materials, reaction intermediates, etc.). Experimental or ab initio estimation of such properties can beprohibitively expensive; a more computationally scalable strategy is, therefore, needed.Machine learning algorithms serve as an appealing alternative strategy for molecule property prediction. By training asurrogate model with reference data from quantum mechanical calculations or experiments, fast and accurate propertyprediction can be achieved. A number of molecular descriptors[4, 5, 6, 7] and accurate machine learning models[8, 9, 10]have been proposed for the property prediction of organic molecules. However, such techniques can be expensivebecause these machine learned models are often built using large datasets comprising between tens of thousands tomillions of data points.In this broader context, we consider the following question :

Given a library (or a set) of molecules, S , for whichwe need to calculate properties, how does one identify the smallest subset of molecules S T , ( S T ⊂ S ), to doexperiments/computations on, such that a surrogate property model trained on S T is accurate enough to be applied tothe rest of the molecules in S ? The concept of active learning or sequential experimental design aims to minimize the total number of training samplesfor learning by adaptively selecting informative samples to be included in the training data. This ultimately allows forbuilding a relatively small but high-quality training set (relative to the design space) to save resources on data acquisition[11, 12, 13, 14]. For molecular property prediction, active learning has been applied to achieve better accuracy thanrandom selection with the same training set size [15, 16].In this work, we extend such a purely statistical approach by infusing domain information in an (cid:15) –greedy fashion,reminiscent of reinforcement learning. In particular, the "domain knowledge" we use here is a diversity maximizationalgorithm to select molecules into S T . We consider the sequential experimental design as exploiting the model for thebest short-term reward, and diversity maximization as the exploration of the molecule space to improve the model by a r X i v : . [ phy s i c s . d a t a - a n ] J un ntroducing new fragments into the model. The ratio of exploitation vs exploration moves is set by a user-providedfraction (cid:15) . As a proof-of-concept, we test this algorithm on three datasets of varying molecule sizes, types, andproperties. Our property prediction models are based on a sparse version of a generalized group additivity formalismthat offers a reasonably accurate and explainable data-driven model for molecular properties. For such models, wedemonstrate the algorithm by showing that S T can have less than − of S . Finally, we extend our algorithm tokernel-based property models where we see that diversity maximization results in better training sets. We begin withthe discussion of methods. Group additivity (GA), proposed by Benson and Buss in 1958, is a simple linear property model where the property ofinterest is calculated as the sum of contributions coming from constituent atoms, bonds, or groups. [17, 18, 19, 20] f ( x ) = P (cid:88) i =1 β i x i (1)where f ( x ) is enthalpy or entropy of formation, x i is for instance the number of groups and β i is the coefﬁcient weight(or the contribution).Traditionally, groups were "handpicked" by the developer. As an alternative, one could use any of the more moderncheminformatics-based ﬁngerprints. We use pathway ﬁngerprints, which are modiﬁed from Daylight ﬁngerprints (aBoolean array showing absence or presence of patterns in a molecule), by enumerating each atom and linear substructure.Speciﬁcally, paths of length one to seven (1—-7) atoms in the molecular graphs are identiﬁed. Subsequently, anymolecule can be represented (not uniquely) by storing the count of each path (or essentially fragment) in a ﬁngerprintvector.Two additional correction terms for pathway ﬁngerprints describing aromaticity and ring information are introducedsince they are not covered in linear substructures. The detailed steps to build modiﬁed pathway ﬁngerprints are containedin section S1 in supporting information.The design matrix, X , is formed in which each row represents the ﬁngerprints vector of a molecule in the dataset. GAcan be applied for property prediction using this matrix. A common way for obtaining the feature coefﬁcient β i wouldbe ordinary least square (OLS) regression: ˆ β = arg min β (cid:107) y − Xβ (cid:107) (2)The coefﬁcient vector can be obtained from ˆ β = (cid:0) X T X (cid:1) − X T y (3)Pathway ﬁngerprints can result in a large number of features. We obtain sparse models to prevent overﬁtting, weuse Least Absolute Shrinkage and Selection Operator (LASSO), to pick relevant features by introducing a 1-normregularization term with penalty λ into the loss function of the OLS: ˆ β = min β ∈ R P (cid:26) N (cid:107) y − Xβ (cid:107) + λ (cid:107) β (cid:107) (cid:27) (4)Thus a sparse solution, of dimension p , is obtained by penalizing the coefﬁcient weights and pushing insigniﬁcant onesto zero. The value of regularization term λ relates to the sparsity of solution; a larger λ results in less nonzero features.The optimal λ value is chosen by minimizing a cross-validation error on the training set.A similar approach utilizing exhaustively generated subgraphs of molecules and LASSO feature selection shows goodperformance on gaseous species and surface intermediates.[21].The GA model is built with exhaustively generated pathway ﬁngerprints features and LASSO feature selection algorithm. LASSO is performed with scikit-learn Elastic net python module by setting L2 norm to zero.2 .2 Max-Min dissimilarity-based subset selection Structurally similar molecules are likely to exhibit similar properties[22], thus S T can contain more information relatedto property by maximizing the coverage of fragments in that set. This concept is widely applied to maximize diversity indrug discovery and design, which favors screening small yet diverse fragment-based libraries instead of the conventionalstrategy that aims to evaluate as many compounds as technologically possible[23, 24, 25].In this work, we use the Max-Min method for diversity maximization and Sørensen Dice coefﬁcient (DSC) for similarityquantiﬁcation to select the most fragment-diverse subset [26]. Sørensen Dice coefﬁcient converts the similarity into areal number by:

DSC = 2 | Y ∩ Z || Y | + | Z | (5)where Y and Z are two vectors or sets, and the numerator of the coefﬁcient is twice the number of common elements inboth sets while the denominator is the sum of the number of elements in each set. The larger the DSC value betweentwo sets the more similar they are.The Max-Min method maximizes the fragment diversity by: (i) calculating the smallest dissimilarity between eachcompound in the remaining set and those already in the training set, and (ii) selecting the molecule with the largestdissimilarity to be added into S T . The method essentially tries, in a greedy sense, to minimize the maximum distancebetween molecules outside S T and within it; hence the term “Max-Min". Note that, the smallest dissimilarity valueis the largest Sørensen Dice coefﬁcient and vise versa . Further, the algorithm can be employed a priori (beforecalculating the desired properties) and can be stopped when S T reaches a desired size.In this work, when applying the Max-Min method for the diversity maximization of pathway ﬁngerprints for a given set,the maximization is performed through the rdkit python module ’MaxMinPicker’ and based on its built-in pathwayﬁngerprints module ’FingerprintMols’, wherein the ﬁngerprints are hashed and lumped to improve efﬁciency. When cases arise where samples (molecules) are abundant but their labels (properties) are not readily available orare expensive to acquire, active learning or experimental design methods [27] can be implemented to choose themost informative samples to label (i.e. to acquire property for a speciﬁc molecule). Active learning methods aresupervised machine learning methods wherein the training data is updated adaptively to build a compact but high-quality model. Popular selection strategies include uncertainty sampling[28, 29], optimal experimental design[30],query-by-committee[31] and querying representative examples[32]. Based on a particular strategy, information-richsamples are selected from the remaining samples and added into the current training set after obtaining their labels(properties); thus the model can be improved in an adaptive manner.In OLS (Equation 2), the p × p matrix ( X T X ) is called the information matrix of the parameters β ; a larger thedeterminant of ( X T X ) translates to a more informative training set. The goal of optimal experimental design is to ﬁndthe best combination of samples from the dataset to optimize different properties of information matrices based onexperimental design criteria[33]. The most commonly applied one is the D-optimality criterion[34], which maximizesthe determinant of the information matrix and is equivalent to minimizing the determinant of the covariance matrix V = ( X T X ) − .Geometrically, the related (1 − a )% conﬁdence region of the estimated parameters ˆ β of p dimensions can be representedby the conﬁdence ellipsoid obtained from the F test which satisﬁes the following inequality[35]: (cid:16) β − ˆ β (cid:17) T X T X (cid:16) β − ˆ β (cid:17) ≤ ps F p,v,a (6)where β represents any point in the conﬁdence region, s is the estimate of variance σ , and F is the F -statisticcorresponding to v degree of freedom, p number of parameters and the a level of signiﬁcance. The axes of theconﬁdence ellipsoid represent the conﬁdence intervals of ˆ β , thus shrinking the volume of the ellipsoid leads to reducingthe parameter’s errors. As the volume is proportional to the square root of the determinant of the covariance matrix V ,the generalized variance of the estimated parameters ˆ β is minimized by performing the D-optimality criterion [33].We note that for linear models, a D-optimal design of a desired number of training points can be identiﬁed a priori (e.g.the DETMAX algorithm [36]). For nonlinear models and in the case of our sparse models wherein the matrix X can3hange depending on which parameters (i.e. paths or fragments) are picked in each iteration, one can use a sequentialdesign approach to greedily maximize the information content.We use the sequential D-optimal experimental design as one of the strategies to update the most informative remainingmolecule into the model iteratively. At each iteration, we ﬁrst calculate our sparse regression model to determine ourmatrix X ; then, for each remaining molecule (in S − S T ) with a feature vector x i , the covariance matrix is updatedusing the Sherman-Morrison formula[37] V (cid:48) = V − V x T xV xV x T (7)as a numerically inexpensive way to calculate the otherwise expensive inverse of the updated X T X . Here, V is thecurrent covariance matrix and V (cid:48) is the updated covariance matrix. The molecule that lowers the determinant of theupdated covariance matrix the greatest is considered to contain the most information, its property is then calculated andupdated into the training set. Within the sequential experimental design, the training set is updated at each step with a molecule that is calculated tobe the most informative. Such sequential updates are greedy actions which keep exploiting the current model; therefore,can be biased by the model. One way to improve upon such purely exploitative updates is to allow the possibility ofselecting samples to the training set that do not offer the best short-term expectations, but potentially offer long-termbeneﬁts of improved solutions (i.e. a lower error on the remaining set for a given training set size). Selecting suchsamples is called exploration . Since at each step there is a choice of exploring the space or exploiting the current model,the problem is referred to as a “conﬂict" between exploration and exploitation [38].Epsilon-greedy method is a widely used strategy to balance exploitation and exploration moves in reinforcementlearning studies[39, 40], where the ratio of exploration is controlled by a hyperparameter (cid:15) . At each step, a randomnumber ζ is generated between 0 and 1, if ζ falls in a speciﬁed range determined by (cid:15) , a random (or a more carefullychosen exploratory) action would be taken instead of the best exploitation action; otherwise, an exploitation action ischosen.We add exploration actions into the sequential experimental design (which is the exploitation action) through the epsilongreedy strategy. For exploration, we use the Max-Min method to pick a molecule that maximizes fragment diversity inthe training set. At each iteration, the action is taken according to the generated random number ζ to determine whetherwe should pick a molecule based on Max-Min method to increase the model feature diversity or based on D-optimalityto reduce the generalized variance of the estimated parameters. If ζ is smaller than − (cid:15) + (cid:15)n , where n is the number ofremaining molecules, the molecule will be updated using the D-optimality strategy, otherwise, it is updated using theMax-Min method. The ﬂowchart of our algorithm is shown in Figure 10. Given a molecular space S , the design matrix X is constructedwith descriptors (all pathway ﬁngerprints up to a speciﬁed length) to represent the molecules in the space. The algorithmstarts with selecting molecules from S as the initial training set S T0 and forming initial training set matrix X T builtfrom X with each row representing selected molecule. The next step is to apply LASSO on X T to select a subset ofthe columns; we further carry out a QR decomposition to remove linearly dependent columns. The model matrix canbe transformed into the information matrix which is needed for implementing the D-optimal sampling. After LASSOand QR decomposition, an OLS step (with the reduced set of columns) is carried out to obtain a ﬁnal model and itsassociated prediction (and standard error) for the remaining molecules in S − S T . Subsequently, if the stopping criterionis not reached, the epsilon greedy strategy is applied to determine whether to perform an exploration step (using theMax-Min method) or an exploitation step (using D-optimality). Based on the selection strategy a molecule from theremaining set S − S T is selected and added into S T , its property is obtained and its feature vector appended to X T .This iterative update of S T and X T occurs until the stopping criterion is reached. Note that LASSO at each iterationconsiders the whole set of P ﬁngerprints from which it identiﬁes p salient ones. A critical aspect of this algorithm is todecide when to stop; we discuss potential criteria in a subsequent section. The details of how to select the regularizationvalue for LASSO in each step is discussed in section S2 in supporting information.4igure 1: Workﬂow of the proposed algorithm for building the optimal training set. A kernel model assumes that samples (molecules) with similar patterns are likely to have similar labels (properties).The prediction value of a sample is, therefore: f ( x ) = n (cid:88) i =1 α i k (cid:104) x i , x (cid:105) (8)where α i is the coefﬁcient of each training sample representing its contribution and k (cid:104) x i , x (cid:105) is the kernel functionchosen to replace the inner product computation in feature space to save computation resources.We test both diversity maximization and variance sampling selection strategies for the kernel model. For maximizingdiversity, we use the Max-Min method to minimize the maximum kernel distance between molecules within the trainingset and those outside. We use variance sampling, similar to the uncertainty sampling in classiﬁcation, that directly aimsto select the molecule with the largest prediction variance. The prediction variance is calculated by k Ti ( K T K ) − k i based on kernel matrix K of the current training set and kernel distance vector k i with each element representingthe kernel distance between the remaining molecule i and each training molecule. This is similar to the sequentialD-optimal design.The model is based on the Laplacian kernel and the hyperparameter σ (kernel width) is set for the entire exerciseas max ( { D ij ) } /ln (2) where D ij is the kernel distance of each pair of molecules in the initial training set. Theregularization term λ is set to following the arguments of Ramakrishnan.[41] Bag of bonds (BOB) [5] molecularrepresentation is used to quantify the difference of two molecules with their 3D coordinates as inputs. Our method is benchmarked on three different sets of data: (1) ∼ The performance of a selection strategy is measured by the root mean squared error (RMSE) of the prediction onthe remaining samples with the model trained on selected samples ( S T ). For each set of data, the average RMSE ofthe ﬁve-fold cross validation (CV) is shown as a baseline to represent the performance of the model with sufﬁcientlylarge training molecules. We note that the RMSE of the remaining set is not theoretically bounded by this baseline;nevertheless, for a large S and a relatively small S T , the baseline is a worthy target for the models. The relationshipbetween the values of the regularization and the corresponding RMSE and MAE is shown in section S3 in the supportinginformation (SI).For any step that involves random number generation, we report the averages of ten different runs. This then leads toaveraging over (1) hundred runs for random sampling (ten different sets for the initial set and ten simulations of updatesfor each initial set), (2) ten runs for the D-optimal strategy each starting with a different random initial training set, (3)ten runs for largest error elimination curve, and (4) ten runs for each (cid:15) value for the epsilon greedy method. Figure 2: Comparison of the learning rates of different molecule selection strategies on the QM7 dataset for GA-basedprediction of heats of atomization. The y-axis measures the root mean square error of the prediction on the remainingset during the updating process with (1) random sampling, (2) largest error elimination method, (3) D-optimalitystrategy with initial training set selected randomly or by Max-Min method. The baseline represents the ﬁve-fold crossvalidation result of the dataset. The reported performance of random sampling, largest error elimination method and theD-optimality strategy with initial training set selected randomly are averaged over multiple runsFor the QM7 dataset, we ﬁrst set the bounds for the selection strategies - (i) the worst case is when the remaining set( S − S T ) is sampled randomly to pick the best molecule to S T and (ii) the best case is denoted as the "largest errorelimination method" wherein all labels (properties) are known and in each iteration we select the molecule with thelargest prediction error (the difference between the true value and the predicted value using the model trained on S T )from the optimal model.The performance of the two strategies is shown in Figure 2. The two selection strategies start with the same randomlychosen molecules (of size eighty) as the initial training set (generated with the same random seed). Then we evaluatetheir performance by inspecting the predicted RMSE of the remaining set as a function of the size of S T . Compared torandom sampling, the largest error elimination strategy, as expected, rapidly converges to the CV baseline. In practice,6owever, all properties are not available and the largest error elimination strategy is purely a theoretical exercise;nevertheless, the gap in the learning rate (the rate of decrease of RMSE with increasing size of the training set) betweenthe two strategies suggests that, in principle, a model trained on a carefully selected small subset can perform as goodas the one trained on the entire set.Next, we investigate the performance of the D-optimality strategy, which is shown in red in Figure 2. Clearly, thestrategy learns faster than random sampling and is able to reach the CV baseline within 1.5 kcal/mol with just 600molecules in the training set. However, even though the D-optimal learning rate is high early on, its performance startsto slow down at about 300 molecules. This indicates that the initial training set does not have sufﬁcient representationof salient features of the QM7 dataset; the D-optimal selection strategy then uses a statistical criterion to add newmolecules into the training set. While this maximizes the determinant of the information matrix it does not ensure thatrelevant missing features are included in the training set.To investigate whether missing relevant features can be identiﬁed by picking appropriate molecules, we next apply theMax-Min method to build the initial training set for the D-optimal selection strategy (“Max-Min 80 + D optimality"curve in Figure 2). We can observe that early on this curve outperforms the original D-optimal curve and learns almostas fast as the largest error elimination curve. If we further increase the number of initial training set molecules from80 to 200 (shown in the "Max-Min 200 + D-optimality" curve in Figure 2), the curve approaches the baseline and iswithin 0.5 kcal/mol for | S T | > . The result indicates that the performance of the D-optimal strategy improves uponstarting with a diverse training set. Figure 3: Comparison of the learning rate of different (cid:15) (eps) values on the QM7 dataset for GA-based prediction ofheats of atomization. The value of (cid:15) determines the ratio of sample selection with the D-optimal criterion (exploitation)or Max-Min diversity maximization (exploration). The reported performance of strategies combining both explorationand exploitation ( (cid:15) between 0.1 and 0.9) are averaged over ten runs.Introducing diversity maximization to identify the initial training set improves the learning rate of the D-optimal strategy.Therefore, it is clear that the Max-Min algorithm is able to explore the molecule space better than random selection.In Figure 3 we show the performance of the epsilon-greedy method to combine the D-optimal strategy and Max-minmethods for molecule selection. The ﬁgure shows different values of the hyperparameter (cid:15) in order to identify theoptimal one. (cid:15) values range from 0 (purely exploitative using D-optimal) to 1 (purely explorative with Max-Min).Figure 3 shows that the D-optimality curve ( (cid:15) = 0 ) performs the best in the early stages but the advantage vanishes afterabout 600 molecules, while the pure Max-Min method ( (cid:15) = 1 ) decreases the slowest. The learning curve for (cid:15) = 0 . lies between those with the pure methods ( (cid:15) = 0 / ) in the early steps and yields the best “long-term" performance(by about 0.2 - 0.3 kcal/mol when the training set size reaches 1000). The results show that exploitation provides agood short-term beneﬁt while the exploration improves the model in the long run. A balance between exploration and7xploitation strategies can lead to a steady and on overall better performance. The detailed values for the comparison ofthe selection strategies are reported in Table 1.Table 1: Comparison of the RMSE, the largest prediction standard error and the average prediction standard error of theremaining set with different training set sizes and different selection strategies for GA-based prediction for heats ofatomization (kcal/mol) of the QM7 dataset. The number in the parenthesis of each strategy shows the size of initialtraining set to start with. The performance of strategies except Max-Min + D-optimality (200) are averaged frommultiple runs. The largest prediction standard error and the average prediction standard error for random sampling andlargest error elimination method are not reported. Trainingset size Random (80) Largest errorelimination (80) Random +D-optimality (80)RMSE Trainingerror RMSE Trainingerror RMSE Trainingerror Largeststandard error Averagestandard error200 15.13 3.07 3.88 6.34 6.08 2.66 4.81 2.21600 5.14 2.50 2.41 6.24 3.87 3.07 2.30 1.26800 – – – – 3.52 3.19 1.49 1.061000 – – – – 3.36 3.32 1.56 0.94

Trainingset size Max-Min +D-optimality (200) Epsilon greedy : (cid:15) = 0.5 (200)RMSE Trainingerror Largeststandard error Averagestandard error RMSE Trainingerror Largeststandard error Averagestandard error200 5.07 0.97 15.90 3.32 5.07 0.97 15.90 3.32600 3.08 2.07 1.97 1.30 3.16 2.24 2.32 1.31800 3.05 2.36 1.56 1.07 2.90 2.41 1.96 1.131000 2.94 2.58 1.41 0.97 2.75 2.54 1.66 1.00

Figure 4: Comparison of the learning rates of different molecule selection strategies on the NIST chemistry webbookfor GA-based prediction of heats of formation. The selection strategy involves (1) random sampling, (2) D-optimalitystrategy with initial training set selected randomly or by Max-Min method, (3) epsilon greedy method with (cid:15) set to 0.5,(4) pure Max-Min method. The baseline represents the ﬁve-fold cross validation result of the dataset. The performanceof random sampling, the D-optimality with initial training set selected randomly and epsilon greedy method is averagedover multiple runs.We apply these different sampling strategies on ∼

600 hydrocarbon molecules from the NIST database to investigatewhether the algorithm can perform well on a smaller experimental dataset comprising of diverse molecule size. Figure8 shows the relative performance of different strategies. We note here that all methods eventually have a lower RMSEvalue than the "baseline" cross validation error, which is expected for a small set of molecules. Overall, the results aresimilar to that of the QM7 dataset: (cid:15) -greedy with (cid:15) = 0 . is better than D-optimality with Max-min initial selection,which in turn is better than purely random sampling. However, subtle differences arise; for instance, the pure Max-Minsampling performs well in the early stages but slows down after ∼

200 molecules in the training set. Nevertheless, acombination of D-optimality and Max-Min method gives the best performance; this strategy approaches the CV basewith ∼

140 molecules in the training set (compared to ∼

470 training molecules in each CV model) and reaches 5.2kcal/mol with ∼

300 training molecules(compared to 11.0 kcal/mol as the CV error).

Figure 5: Comparison of the learning rates of different molecule selection strategies on the surface intermediates datasetfor GA-based prediction of heats of formation. The selection strategy involves (1) random sampling, (2) D-optimalitystrategy with initial training set selected randomly or by Max-Min method, (3) epsilon greedy method with (cid:15) set to 0.5,(4) pure Max-Min method. The baseline represents the ﬁve-fold cross validation result of the dataset. The reportedperformance of random sampling, the D-optimality strategy with an initial training set selected randomly and epsilongreedy method is averaged over multiple runs.The selection strategies are further tested using a dataset of surface intermediates to investigate their performancebased on a different descriptor (groups mined from surface intermediates) and a different type of molecules (surfaceintermediates). Figure 5 shows that despite the changes in the descriptor and the type of molecules, the "active learning"algorithm remains effective; the selection strategies involving both Max-Min method and D-optimality ( "Max-Min + Doptimality" curve and "Max-Min eps = 0.5" curve) yields overall the best performance and reaches the CV baseline(involving ∼

450 training molecules) with 40-50% less training molecules.

Finally, we investigate the best selection strategy to build kernel models. Figure 6 shows the result for the kernelmodel trained on the QM7 dataset using the “bag of bonds" descriptor. Note that we apply a variance sampling forthe “exploitation" step, which is akin to the D-optimal strategy; Max-Min method uses kernel distance to quantifyintermolecular similarity. Random sampling and largest error elimination continue to represent the worst and best casesrespectively. Variance sampling started with a randomly selected initial training set outperforms random sampling,but the learning rate is not steady and the RMSE drops almost instantly when the training set size reaches ∼ ∼ (cid:15) values. The baselinerepresents the ﬁve-fold cross validation result of the dataset. The reported performance of random sampling, largesterror elimination method, variance sampling with initial training molecules selected randomly and epsilon greedymethod is averaged over multiple runs.curves of strategies that combine Max-Min method and variance sampling via epsilon greedy method lie between thepure methods ( "Max-Min + variance" curve and pure "Max-Min" curve). The higher the (cid:15) value is, viz the moretimes the Max-Min method is applied, the performance is closer to the Max-Min method. The result indicates thatfor similarity-based property prediction models (i.e. using kernels), Max-Min method that maximizes kernel distancediversity is preferable. Figure 7: Stopping criterion for the epsilon greedy method ( (cid:15) = 0 . applying on the GA model to predict heat ofatomization for QM7 dataset. (left) The RMSE, the largest prediction standard error, and the average prediction standarderror of the remaining set are shown. At the end of each iteration, the prediction error of the selected molecule can becalculated; E prev is the moving average of this error over the previous forty (40) iterations. (right) Comparison of theRMSE of the remaining set and training set error. 10t is essential to determine when to stop the algorithm. Although the prediction error on the remaining set is a usefulmetric to understand the performance of our method, it is not practical as labels are not available. Instead, we computethe prediction standard error as a surrogate to track the quality of the model and determine when the learning rate slowsdown.The prediction standard error of a molecule outside the training set is calculated as: s = σ (cid:113) x Ti ( X T X ) − x i (9)where x i is the molecule feature vector and X is the design matrix.In Figure 7 we ﬁrst investigate whether the prediction standard error follows a similar trend with the model predictionaccuracy, thereby allowing for its use as a surrogate. We compare the largest prediction standard error and the averageprediction standard error of the remaining dataset with the RMSE at each iteration of the epsilon greedy strategy (eps =0.5). We can see that as the two prediction standard error curves (“largest standard error" and “average standard error")decrease rapidly in the early iterations, the model learns fast ( shown as “Max-Min (cid:15) = 0.5" curve). As the predictionstandard error curves are converging in the later stage, the improvement in accuracy is negligible. The similar behaviorof the RMSE and the average standard error indicates that we can rely on the trend of the latter to qualitatively infer thelearning rate. In principle, i) since the value of the largest prediction standard error in the remaining dataset reﬂectshow much information we can gain by performing careful selection of the informative molecules, and ii) the averageprediction standard error represents how conﬁdently can the model predict the remaining molecules, one feasiblestopping point can be when both of them are smaller than a user deﬁned threshold,Further, we investigate if the quality of the model can be estimated by tracking the training error directly ( “trainingerror" curve) or the moving average of the difference between predicted and true values of the selected molecules in theprevious forty iterations (denoted as E prev ). Interestingly, the training error increases as more molecules are addedindicating that the model overﬁts the data early on. But as more diverse data is added to the training set, the modelis getting progressively less biased. Eventually with ∼

900 training molecules selected, it reaches to the accuracy ofthe prediction on the remaining molecules. The result indicates that we may even use the training set error to estimatethe prediction accuracy in the later stage of iteration. The “ E prev " also eventually drops to the RMSE value of theremaining set and the baseline error; however, it ﬂuctuates considerably, although it could still be used as a qualitativemetric. In this work, we present an optimal training set building algorithm that identiﬁes the smallest amount of informativemolecules to be calculated to build a surrogate property training model S T from a given molecule library S to predicton the remaining molecules S − S T . The algorithm iteratively updates the sparse model generated with LASSO featureselection from the design matrix X T with informative molecules selected based on either D-optimality experimentaldesign as exploitation to improve the current model or Max-Min diversity maximization as exploration to introducediversity into X T . The balance of the exploration and exploitation is controlled by the (cid:15) greedy method which iswidely used in reinforcement learning studies. We validate the algorithm by comparing different selection strategieson three different datasets that vary in the size, type and properties of molecules: (1) gaseous molecules from QM7dataset, (2) hydrocarbons from NIST chemistry webbook and (3) surface intermediates on transition metal catalysis.In all cases the algorithm that balances D-optimality and Max-Min method via the epsilon greedy method gives thebest result and reduces the training molecules needed to ∼

15% of QM7 dataset, ∼

25% of NIST chemistry webbookand ∼

45% of surface intermediates dataset to achieve similar accuracy as their ﬁve-fold cross validation results. Weextend our algorithm on a kernel model and show that building the training set with diversity maximization based onkernel distance is preferred. The GA and the kernel models in this work serve as two simple representation models forvalidation; in principle we can apply the algorithm to any property prediction model and ﬁngerprints that are eitheruser-deﬁned or machine-learned.

References [1] Adam Z Weber, Matthew M Mench, Jeremy P Meyers, Philip N Ross, Jeffrey T Gostick, and Qinghua Liu. Redoxﬂow batteries: a review.

Journal of Applied Electrochemistry , 41(10):1137, 2011.[2] Xianfeng Ma, Zheng Li, Luke EK Achenie, and Hongliang Xin. Machine-learning-augmented chemisorptionmodel for co2 electroreduction catalyst screening.

The journal of physical chemistry letters , 6(18):3528–3533,2015. 113] Liping Yu and Alex Zunger. Identiﬁcation of potential photovoltaic absorbers based on ﬁrst-principles spectro-scopic screening of materials.

Physical review letters , 108(6):068701, 2012.[4] Katja Hansen, Grégoire Montavon, Franziska Biegler, Siamac Fazli, Matthias Rupp, Matthias Schefﬂer, O AnatoleVon Lilienfeld, Alexandre Tkatchenko, and Klaus-Robert Mu?ller. Assessment and validation of machinelearning methods for predicting molecular atomization energies.

Journal of Chemical Theory and Computation ,9(8):3404–3419, 2013.[5] Katja Hansen, Franziska Biegler, Raghunathan Ramakrishnan, Wiktor Pronobis, O Anatole Von Lilienfeld, Klaus-Robert Mu?ller, and Alexandre Tkatchenko. Machine learning predictions of molecular properties: Accuratemany-body potentials and nonlocality in chemical space.

The journal of physical chemistry letters , 6(12):2326–2331, 2015.[6] David Rogers and Mathew Hahn. Extended-connectivity ﬁngerprints.

Journal of chemical information andmodeling , 50(5):742–754, 2010.[7] Matthias Rupp. Machine learning for quantum mechanics in a nutshell.

International Journal of QuantumChemistry , 115(16):1058–1073, 2015.[8] Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. Deep architectures and deep learning in chemoinformatics:the prediction of aqueous solubility for drug-like molecules.

Journal of chemical information and modeling ,53(7):1563–1575, 2013.[9] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik,and Ryan P Adams. Convolutional networks on graphs for learning molecular ﬁngerprints. In

Advances in neuralinformation processing systems , pages 2224–2232, 2015.[10] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions:moving beyond ﬁngerprints.

Journal of computer-aided molecular design , 30(8):595–608, 2016.[11] Daniel Reker, Petra Schneider, Gisbert Schneider, and JB Brown. Active learning for computational chemoge-nomics.

Future medicinal chemistry , 9(4):381–402, 2017.[12] Tobias Lang, Florian Flachsenberg, Ulrike von Luxburg, and Matthias Rarey. Feasibility of active machinelearning for multiclass compound classiﬁcation.

Journal of chemical information and modeling , 56(1):12–20,2016.[13] Daniel Reker and Gisbert Schneider. Active-learning strategies in computer-assisted drug discovery.

Drugdiscovery today , 20(4):458–465, 2015.[14] Ying Liu. Active learning with support vector machine applied to gene expression data for cancer classiﬁcation.

Journal of chemical information and computer sciences , 44(6):1936–1941, 2004.[15] Yu-Hang Tang and Wibe A de Jong. Prediction of atomization energy using graph kernel and active learning. arXiv preprint arXiv:1810.07310 , 2018.[16] Konstantin Gubaev, Evgeny V Podryabinkin, and Alexander V Shapeev. Machine learning of molecular properties:Locality and active learning.

The Journal of Chemical Physics , 148(24):241727, 2018.[17] Sidney W Benson and Jerry H Buss. Additivity rules for the estimation of molecular properties. thermodynamicproperties.

The Journal of Chemical Physics , 29(3):546–572, 1958.[18] Sidney W Benson, FR Cruickshank, DM Golden, Gilbert R Haugen, HE O’neal, AS Rodgers, Robert Shaw, andR Walsh. Additivity rules for the estimation of thermochemical properties.

Chemical Reviews , 69(3):279–324,1969.[19] HK Eigenmann, DM Golden, and SW Benson. Revised group additivity parameters for the enthalpies of formationof oxygen-containing organic compounds.

The Journal of Physical Chemistry , 77(13):1687–1691, 1973.[20] N Cohen and SW Benson. Estimation of heats of formation of organic compounds by additivity methods.

ChemicalReviews , 93(7):2419–2438, 1993.[21] Geun Ho Gu, Petr Plechac, and Dionisios G Vlachos. Thermochemistry of gas-phase and surface species vialasso-assisted subgraph selection.

Reaction Chemistry & Engineering , 3(4):454–466, 2018.[22] Mark A Johnson and Gerald M Maggiora.

Concepts and applications of molecular similarity . Wiley, 1990.[23] Philip J Hajduk and Jonathan Greer. A decade of fragment-based drug design: strategic advances and lessonslearned.

Nature reviews Drug discovery , 6(3):211, 2007.[24] Mark G Bures and Yvonne C Martin. Computational methods in molecular diversity and combinatorial chemistry.

Current opinion in chemical biology , 2(3):376–380, 1998.1225] Ana G Maldonado, JP Doucet, Michel Petitjean, and Bo-Tao Fan. Molecular similarity and diversity in chemoin-formatics: from theory to applications.

Molecular diversity , 10(1):39–79, 2006.[26] Mark Ashton, John Barnard, Florence Casset, Michael Charlton, Geoffrey Downs, Dominique Gorse, JohnHolliday, Roger Lahana, and Peter Willett. Identiﬁcation of diverse database subsets using property-based andfragment-based molecular descriptions.

Quantitative Structure-Activity Relationships , 21(6):598–604, 2002.[27] David A Cohn, Zoubin Ghahramani, and Michael I Jordan. Active learning with statistical models.

Journal ofartiﬁcial intelligence research , 4:129–145, 1996.[28] David D Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In

MachineLearning Proceedings 1994 , pages 148–156. Elsevier, 1994.[29] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classiﬁcation.

Journal of machine learning research , 2(Nov):45–66, 2001.[30] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. In

Proceedings of the23rd international conference on Machine learning , pages 1081–1088. ACM, 2006.[31] H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In

Proceedings of the ﬁfthannual workshop on Computational learning theory , pages 287–294. ACM, 1992.[32] Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. Active learning by querying informative and representativeexamples. In

Advances in neural information processing systems , pages 892–900, 2010.[33] Anthony Atkinson, Alexander Donev, and Randall Tobias.

Optimum experimental designs, with SAS , volume 34.Oxford University Press, 2007.[34] Kirstine Smith. On the standard deviations of adjusted and interpolated values of an observed polynomial functionand its constants and the guidance they give towards a proper choice of the distribution of observations.

Biometrika ,12(1/2):1–85, 1918.[35] Norman R Draper and Harry Smith.

Applied regression analysis , volume 326. John Wiley & Sons, 1998.[36] Toby J Mitchell. An algorithm for the construction of “d-optimal” experimental designs.

Technometrics , 16(2):203–210, 1974.[37] Jack Sherman and Winifred J Morrison. Adjustment of an inverse matrix corresponding to a change in one elementof a given matrix.

The Annals of Mathematical Statistics , 21(1):124–127, 1950.[38] Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction . MIT press, 2018.[39] Michel Tokic and Günther Palm. Value-difference based exploration: adaptive control between epsilon-greedyand softmax. In

Annual Conference on Artiﬁcial Intelligence , pages 335–346. Springer, 2011.[40] Michael Wunder, Michael L Littman, and Monica Babes. Classes of multiagent q-learning dynamics withepsilon-greedy exploration. In

Proceedings of the 27th International Conference on Machine Learning (ICML-10) ,pages 1167–1174. Citeseer, 2010.[41] Raghunathan Ramakrishnan and O Anatole von Lilienfeld. Many molecular properties from one kernel in chemicalspace.

CHIMIA International Journal for Chemistry , 69(4):182–186, 2015.[42] Lorenz C Blum and Jean-Louis Reymond. 970 million druglike small molecules for virtual screening in thechemical universe database gdb-13.

Journal of the American Chemical Society , 131(25):8732–8733, 2009.[43] Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O Anatole Von Lilienfeld. Fast and accuratemodeling of molecular atomization energies with machine learning.

Physical review letters , 108(5):058301, 2012.[44] Philipp Buerger, Jethro Akroyd, Jacob W Martin, and Markus Kraft. A big data framework to validate thermody-namic data for chemical species.

Combustion and Flame , 176:584–591, 2017.[45] P Linstrom and W Mallard. Nist chemistry webbook.

NIST standard reference database , (69):20899, 2005.[46] Geun Ho Gu and Dionisios G Vlachos. Group additivity for thermochemical property estimation of ligninmonomers on pt (111).

The Journal of Physical Chemistry C , 120(34):19234–19241, 2016.13 upplemental Materials: Designing compact training sets for data-driven molecularproperty predictionS1: Generalized pathway ﬁngerprints for molecular representation

Figure 8: Illustrations of linear substructures and correction terms of modiﬁed pathway ﬁngerprints on propane and afused ring molecule. The notation includes (1) SMILES scheme, (2) linear substructures (FP), and (3) correction terms.Figure 8 shows the illustration of the modiﬁed pathway ﬁngerprints. The ﬁngerprints consist of two parts, linearsubstructures (FP) enumerates paths of length one to seven atoms emanating from each atom of the molecule andcorrection terms (Correction) introduce additional ring information into the ﬁngerprints; ring information is lumped bycorrection terms into its presence of aromaticity and its size, while fused ring information is lumped into the size of twoadherent rings. The lumped ring information is appended to the feature vector of linear substructures.

S2: Regularization value selection for the active learning algorithm

When the initial training set is built or the molecule is updated into the training set, LASSO regression is performed tobuild the sparse model. The regularization value λ determines the number of the selected features and its optimal valueis chosen from the range (cid:8)(cid:0) i (cid:1) | i = i upper : i low : − (cid:9) , where the upper bond i upper is a predeﬁned large value thatallows only a few features to be selected when the training set size is small, and the lower bond i low is the ten-foldcross validation optimal regularization value based on the initial training set. At each iteration, the regularization valueis keeping decreasing from i upper to i low by the 2-base logarithmic step and stops at the value where the number ofthe feature selected is larger than the number of current training molecules or the value reaches the lower bond i low . Inthe former case, we multiply the value that stops at by 2 and choose the result value for LASSO to allow maximumfeatures to be selected, and in the latter case we choose the lower bound i low .Sometimes the number of features determined by the regularization value is slightly less than the molecule number inthe early iteration steps, which causes severe overﬁtting and dramatic increase of the RMSE. To prevent such scenarioan additional restriction relying on prediction variance function ( x Ti (cid:0) X T X (cid:1) − x i ) is added that in two consecutiveiteration steps, if the selected regularization value leads to a huge increase of the average prediction variance function inthe remaining set, we multiply the regularization value by 2 once to select fewer features. In this work, the multiplicationis performed if the average prediction variance function is increased more than half of the value in the previous iteration.On the other hand, while more features can be included when the training set size gets large, in several cases therestriction will prevent regularization value getting smaller in later stage even though the average prediction variancefunction is very small. Thus the restriction is removed to allow more features to be selected after the largest predictionvariance function value reaching a small value which indicates a stable model. In this work, the restriction is removedwhen the largest prediction variance function value is smaller than 1.5.For some datasets, the lower bound i low of the regularization range determined by ten-fold cross validation forMax-Min selected initial training set is unreasonably large, which makes feature selected by LASSO too sparse to buildan accurate model in the whole iteration. When such a scenario happens, the regularization value decreases to the lowerbound within a few iterations, and remains the same for the remaining iterations. One possible explanation is that due14o the diversity of the initial training set, during ten-fold cross validation only fragments common to most fragments areselected by LASSO, while other speciﬁc fragments contained by a few molecules are likely to be pushed to zero, whichleads to a large regularization value being favored. In this work, the scenario happens when applying the algorithmon the surface intermediates set started with Max-Min method for initial training set building, the solution so far isto simply choose the lower bound i low for Max-Min method selected initial training set from the random selectedinitial training set instead as it is more uniform with the whole dataset. In principle, we can do another round of crossvalidation during the middle stage if the too sparse model scenario happens. S3: Five-fold Cross validation result for 3 datasets

Figure 9: Comparison of the average RMSE/MAE with different regularization values for ﬁve models during ﬁve-foldcross validation for (a) 6.5k gaseous molecules of QM7 dataset with their heats of atomization, (b) 598 hydrocarbons ofNIST chemistry webbook with their heats of formation, (c) 591 surface intermediates set with their heat of formation.Figure 9 shows the average RMSE/MAE corresponding to different regularization values for ﬁve models during ﬁve-foldcross validation for three datasets. The lowest RMSE value in each dataset is selected as the optimal value, which isused as the baseline value for the learning rate compare.