[PDF] Uncertainty quantification of molecular property prediction with Bayesian neural networks

Abstract

Deep neural networks have outperformed existing machine learning models in various molecular applications. In practical applications, it is still difficult to make confident decisions because of the uncertainty in predictions arisen from insufficient quality and quantity of training data. Here, we show that Bayesian neural networks are useful to quantify the uncertainty of molecular property prediction with three numerical experiments. In particular, it enables us to decompose the predictive variance into the model- and data-driven uncertainties, which helps to elucidate the source of errors. In the logP predictions, we show that data noise affected the data-driven uncertainties more significantly than the model-driven ones. Based on this analysis, we were able to find unexpected errors in the Harvard Clean Energy Project dataset. Lastly, we show that the confidence of prediction is closely related to the predictive uncertainty by performing on bio-activity and toxicity classification problems.

Full PDF

UUncertainty quantiﬁcation of molecular propertyprediction with Bayesian neural networks

Seongok Ryu, † Yongchan Kwon, ‡ and Woo Youn Kim ∗ , † , ¶ † Department of Chemistry, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republicof Korea ‡ Department of Statistics, Seoul National University, Seoul 08826, Republic of Korea ¶ KI for Artiﬁcial Intelligence, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141,Republic of Korea

E-mail: [email protected]

Abstract

Deep neural networks have outperformed existing machine learning models in var-ious molecular applications. In practical applications, it is still diﬃcult to make conﬁ-dent decisions because of the uncertainty in predictions arisen from insuﬃcient qualityand quantity of training data. Here, we show that Bayesian neural networks are use-ful to quantify the uncertainty of molecular property prediction with three numericalexperiments. In particular, it enables us to decompose the predictive variance into themodel- and data-driven uncertainties, which helps to elucidate the source of errors. Inthe logP predictions, we show that data noise aﬀected the data-driven uncertaintiesmore signiﬁcantly than the model-driven ones. Based on this analysis, we were ableto ﬁnd unexpected errors in the Harvard Clean Energy Project dataset. Lastly, weshow that the conﬁdence of prediction is closely related to the predictive uncertaintyby performing on bio-activity and toxicity classiﬁcation problems. a r X i v : . [ c s . L G ] M a r ntroduction Modern deep neural network (DNN) models have been used in various molecular applica-tions, such as high-throughput screening for drug discovery, de novo molecular design and planning chemical reaction.

DNNs show comparable or sometimes better perfor-mance than traditional approaches grounded on quantum chemical theories in predictingsome molecular properties, if a vast amount of well-qualiﬁed data is secured. Despitethe remarkable potential of DNN models, the direct use of their outputs is sometimes limitedbecause most data in practical applications is likely to involve undesirable problems causedby the lack of both data quality and quantity.Such data discourages a reliable statistical analysis based on DNN models, since theiraccuracy critically depends on training data. For example, Feinberg et al. mentioned thatmore qualiﬁed data should be provided to improve the prediction accuracy on drug-targetinteractions, which is a key step for drug discovery. The number of ligand-protein complexsamples in the PDB-bind database is only about 15,000, limiting the development of reli-able DNN models. In order to prepare more qualiﬁed data, expensive and time-consumingexperiments are inevitable. Synthetic data from computations can be used as an alternative,like the Harvard Clean Energy Project set, but it often suﬀers from unintentional errorscaused by approximation methods employed. In addition, data-inherent bias and noise hurtthe quality of data. Tox21 and DUD-E dataset are such examples. The number of datain the Tox21 dataset is less than 10,000. There are far more negative samples than positivesamples. Of various toxic types, the lowest percentage of positive samples is 2.9% and thehighest is 15.5%. For the DUD-E dataset, it is highly imbalanced that the number of decoysamples are almost 50 times larger than that of active samples. All of those situations wouldinterrupt developing reliable models.It has been stressed in deep learning researches that uncertainty analysis is necessary toaddress namely the AI-safety problems. That is because even though DNNs push thebounds of data-driven approaches, they often make catastrophic decisions. The uncertainty2nalysis has been performed to analyze the processes of decision making with deep neuralnetworks. Kendall and Gal studied quantitative uncertainty analysis on computer visionproblems by using Bayesian neural networks (BNNs). They separated model- and data-driven uncertainties, which helps to identify the sources of prediction errors. It is possiblebecause Bayesian inference allows uncertainty assessments, giving probabilistic interpreta-tions of model outputs.In this paper, we propose to exploit BNNs to quantify uncertainties implied in molecularproperty predictions. Previous studies on uncertainty quantiﬁcation have regarded a predic-tive variance as a predictive uncertainty.

The predictive uncertainty can be decomposedinto (i) an aleatoric uncertainty arisen from data noise and (ii) an epistemic uncertaintyarisen from the incompleteness of model. We adopt the same method in this study. Asa DNN model for molecular applications, we use augmented graph convolutional networks(GCNs).

In what follows, we brieﬂy introduce BNNs, the uncertainty quantiﬁcationmethods based on Bayesian inference, and the augmented-GCN used in this work. Then, weshow the results of uncertainty analysis on three experimental studies. The main results aresummarized as follows. • We ﬁrst applied the Bayesian GCN to a simple example, the logP prediction of moleculesin the ZINC set, in order to demonstrate the uncertainty quantiﬁcation in molecu-lar applications. As expected, the aleatoric uncertainty increases as the data noiseincreases, while the epistemic uncertainty slightly depends on the quality of data. • Second, we evaluate the quality of synthetic data and ﬁnd erroneous samples fabricatedby poor approximations. The Harvard Clean Energy Project (CEP) set contains syn-thetic power conversion eﬃciency (PCE) values of molecules. We noted that moleculeswith exactly zero values have a conspicuously large aleatoric uncertainty, which havebeen veriﬁed as incorrect annotations. • In the last example, for the binary classiﬁcation of bio-activity and toxicity, we studied3he relationship between predicted probability and uncertainties. Our analysis showsthat prediction with a lower uncertainty turned out to be more accurate, indicatingthat the uncertainty can be regarded as the conﬁdence of prediction.

Theoretical backgrounds

Bayesian neural network

For a given training set { X , Y } , let p ( Y | X , w ) and p ( w ) be a model likelihood and a priordistribution for a parameter w ∈ Ω, respectively. Under the Bayesian framework, the modelparameter and output are considered as random variables. The posterior distribution is givenby p ( w | X , Y ) = p ( Y | X , w ) p ( w ) p ( Y | X ) (1)and the predictive distribution is deﬁned as p ( y ∗ | x ∗ , X , Y ) = (cid:90) Ω p ( y ∗ | x ∗ , w ) p ( w | X , Y ) d w (2)for a new input x ∗ and an output y ∗ . These simple formulations make the two following taskspossible: (i) assessing uncertainty of the random variables in a conditional manner and (ii)predicting a distribution of the new output y ∗ given both the new input x ∗ and the trainingset { X , Y } .However, direct computation of eq. (2) is often infeasible when deep neural network mod-els are exploited because the integration over the whole parameter space Ω entails heavy com-putational costs. Many practical approximation methods have been proposed to handle thiscomputation cost. A variational inference, one of the most popular approximation methods,approximates the posterior distribution with a tractable distribution q θ ( w ) parametrized by4 variational parameter θ . Minimizing the Kullback-Leibler divergence,KL( q θ ( w ) (cid:107) p ( w | X , Y )) = (cid:90) Ω q θ ( w ) log q θ ( w ) p ( w | X , Y ) d w , (3)makes the two distributions similar to one another in principle. We can replace the intractableposterior distribution in (3) with p ( Y | X , w ) p ( w ) due to the Bayes’ theorem (1). Then, ourminimization objective, called the negative evidence lower-bound, is L VI ( θ ) = − (cid:90) Ω q θ ( w ) log p ( Y | X , w ) d w + KL( q θ ( w ) (cid:107) p ( w )) . (4)In order to implement Bayesian models, we need to be cautious in choosing a varia-tional distribution q θ ( w ). Blundell et al. proposed to use a product of Gaussian distributionsfor the variational distribution q θ ( w ). In addition, a multiplicative normalizing ﬂow canbe applied to increase the expressive power of variational distribution. However, the twoapproaches often require a large number of weight parameters. The Monte-Carlo dropout(MC-dropout) using a dropout variational distribution approximates the posterior distri-bution by a product of Bernoulli distribution. The MC-dropout is practical in that it doesnot need extra learnable parameters to model the variational posterior distribution and theintegration over the whole parameter space can be easily approximated with the summationof models sampled by a Monte-Carlo estimator.

Thus, we adopted the MC-dropout inthis work.

Uncertainty quantiﬁcation with Bayesian neural network

A variational inference approximating a posterior with a variational distribution q θ ( w ) pro-vides a variational predictive distribution of a new output y ∗ given a new input x ∗ as q ∗ θ ( y ∗ | x ∗ ) = (cid:90) Ω q θ ( w ) p ( y ∗ | f w ( x ∗ )) d w , (5)5here f w ( x ∗ ) is a model output with a given w . For regression tasks, a predictive mean ofthis distribution with T times of MC sampling is estimated byˆ E [ y ∗ | x ∗ ] = 1 T T (cid:88) t =1 f ˆ w t ( x ∗ ) , (6)and a predictive variance is estimated by (cid:100) V ar [ y ∗ | x ∗ ] = σ I + 1 T T (cid:88) t =1 f ˆ w t ( x ∗ ) T f ˆ w t ( x ∗ ) − ˆ E [ y ∗ | x ∗ ] T ˆ E [ y ∗ | x ∗ ] , (7)with ˆ w t drawn from q θ ( w ) at the sampling step t and an assumption p ( y ∗ | f w ( x ∗ )) = N ( y ∗ ; f w ( x ∗ ) , σ I ). Here, the model assumes a homoscedasticity with a known quantity,meaning that every data point gives a distribution with a same variance σ . Further tothis, obtaining the distributions with diﬀerent variances allows deducing a heteroscedasticuncertainty. Assuming the heteroscedasticity, the output given the t -th sample ˆ w t is[ˆ y ∗ t , ˆ σ t ] = f ˆ w t ( x ∗ ) . (8)The heteroscedastic predictive uncertainty given by (9) can be partitioned into two diﬀerentuncertainties: aleatoric and epistemic uncertainties. (cid:100) V ar [ y ∗ | x ∗ ] = 1 T T (cid:88) t =1 (ˆ y ∗ t ) − ( 1 T T (cid:88) t =1 ˆ y ∗ t ) (cid:124) (cid:123)(cid:122) (cid:125) epistemic + 1 T T (cid:88) t =1 ˆ σ t (cid:124) (cid:123)(cid:122) (cid:125) aleatoric . (9)The aleatoric uncertainty arises from data inherent noise, while the epistemic uncertainty isrelated to the model incompleteness. Note that the latter can be reduced by increasing theamount of training data, because it comes from insuﬃcient amount of data as well as theuse of inappropriate model. In classiﬁcation problems, Kwon et al. proposed a natural way to quantify aleatoric and6pistemic uncertainties as follows. (cid:100)

V ar [ y ∗ | x ∗ ] = 1 T T (cid:88) t =1 (ˆ y ∗ t − ¯ y )(ˆ y ∗ t − ¯ y ) T (cid:124) (cid:123)(cid:122) (cid:125) epistemic + 1 T T (cid:88) t =1 (diag(ˆ y ∗ t ) − (ˆ y ∗ t )(ˆ y ∗ t ) T ) (cid:124) (cid:123)(cid:122) (cid:125) aleatoric , (10)where ¯ y = (cid:80) Tt =1 ˆ y ∗ t /T and ˆ y ∗ t = softmax( f ˆ w t ( x ∗ )). While Kendall and Gal’s method requiresextra parameters ˆ σ t at the last hidden layer and often causes unstable parameter updates ina training phase, the method in Kwon et al. has advantages in that models do not need theextra parameters. The equation (10) also utilizes a functional relationship between meanand variance of multinomial random variables. We refer to Kwon et al. for more details.

Graph convolutional network for molecular property predictions

Molecules, social graphs, images and language sentences can be represented as graph struc-tures. GCN is one of the most popular graph neural networks and is widely adopted toprocess molcular graphs. Inputs to the GCN is G = ( A , X ), where A ∈ R N × N is an adja-cency matrix with the number of nodes N and X = H (0) ∈ R N × F inp is a set of initial nodefeatures whose dimensionality is F inp . The GCN gives new node features as follows. H ( l +1) = ReLU( AH ( l ) W ( l ) ) , (11)where H ( l ) ∈ R N × F and W ( l ) ∈ R F × F are node features and weight parameters for the l -thgraph convolution layer for l ∈ { , . . . , L − } , respectively. The GCN updates node features H ( l +1) with information of only adjacent nodes.Applying a self-attention enables the GCN to learn relations between node pairs byreﬂecting the importance of adjacent nodes. Updating node features with the K -head7elf-attention is given by˜ H ( l +1) i = [ReLU( (cid:88) j ∈N i α ( l ) ij, H ( l ) j W ( l )1 ) , ..., ReLU( (cid:88) j ∈N i α ( l ) ij,K H ( l ) j W ( l ) K )] W ( l ) O , (12)where N i denotes the adjacent nodes of the i -th node, H ( l ) j ∈ R × F is the j -th node featureupdated at l -th graph convolution, W ( l ) k ∈ R F × F is a weight parameter for the k -th attentionhead, W ( l ) O ∈ R KF × F is a weight parameter to combine the node features from K -diﬀerentattention heads, and the attention coeﬃcient α ( l ) ij,k is given by α ( l ) ij,k = tanh(( H ( l ) i W ( l ) k ) C ( l ) k ( H ( l ) j W ( l ) k ) T ) , (13)where C ( l ) k ∈ R F × F is a weight parameter.In addition, the GCN has room for improvement because its accuracy is gradually loweredas the number of graph convolution layers increases. We used a gated-skip connectionto prevent this problem as follows. H ( l +1) = r (cid:12) ˜ H ( l +1) + ( − r ) (cid:12) H ( l ) , r = sigmoid( U r, H ( l ) + U r, ˜ H ( l +1) + b r ) , (14)where U r, and U r, are trainable parameters and (cid:12) denotes Hadamard product.After computing the node features L -times by following eq. (14), a graph feature z G ∈ R d G is aggregated as the summation of all node features in a set of nodes V , z G = (cid:88) v ∈V MLP ( H ( L ) v ) , (15)where MLP denotes a multi-layer perceptron. The graph feature is invariant to permutationsof the node states. A molecular property, which is the ﬁnal output from the model, is afunction of the graph feature. y pred = MLP ( z G ) . (16)8 mplementation details Model architecture

Figure 1: The architecture of Bayesian GCN used in this work. (a) The entire model iscomposed of three augemented graph convolutional layers, readout layers and three linearlayers with non-linear activation. (b) Detailed description of the graph convolution layeraugmented with attention and gate mechanisms. We added dropout layers in order for themodel parameters to have stochasticity.As illustrated in Figure 1, our graph convolutional MC-dropout network used in thiswork consists of the following three parts: • Three augmented graph convolution layers update node features according to (14).The number of self-attention head is four. The dimension of output from each layer is( N × F ) = (75 × • A readout function produces a graph feature whose dimension d G is 256 by following(15). • A feed-forward MLP, which is composed of two fully-connected layers, turns out amolecular property. The hidden dimension of each fully-connected layer is 256.9n order for the model parameters to have stochasticity, we applied dropouts at every hiddenlayer. Note that we did not use the standard dropout with a pre-deﬁned dropout rate, butused Concrete dropout to develop as an accurate Bayesian model as possible. By usingthe Concrete dropout, we can obtain an optimal dropout rate for individual hidden layer bya stochastic optimization. We used Gaussian priors N (0 , l ) with length scale l = 10 − forall model parameters. In the training phase, we used the Adam optimizer with an initiallearning rate 10 − , and the learning rate is decayed by half at every 10 epoch. The numberof total training epoches is 100 and the batch size is 100. We randomly split datasets in theratio of (0 .

72 : 0 .

08 : 0 .

2) for training, validation and test. The code used for the experimentsis available at https://github.com/seongokryu/uq-molecule . Experiments

Implication of data quality on aleatoric and epistemic uncertainties

Figure 2: Histograms of (a) aleatoric , (b) epistemic and (c) total uncertanties as the amountof additive noise σ increases.In this experiment, we applied the uncertainty quantiﬁcation method to a simple example,logP prediction. We chose this example because we can obtain the logP value of moleculesfrom the analytic expression of logP as implemented in the RDKit without data inherentnoise. To examine the eﬀect of data quality on uncertainties, we adjust the extent of noisein logP by adding a random Gaussian noise (cid:15) ∼ N (0 , σ ). We trained the model with 97,28710amples and analyzed uncertainties of each predicted logP for 27,023 samples. The sampleswere chosen randomly from the ZINC dataset.Figure 2 shows the distribution of the three uncertainties as a function of the amount ofadditive noise σ . As the noise level increases, the aleatoric and total uncertainties increase,but the epistemic uncertainty is slightly changed. This result veriﬁes that the aleatoric un-certainty arises from data inherent noises, while the epistemic uncertainty does not dependon data quality. Theoretically, the epistemic uncertainty should not increase by the changesin the amount of data noise. We guess that the slight change of the epistemic uncertaintyarises from the stochastic numerical optimization of model parameters. Evaluating quality of synthetic data based on uncertainty analysis

Figure 3: (a) Aleatoric, (b) epistemic, (c) total uncertainties and (d) predicted PCE againstthe PCE value in the dataset. The samples colored in red show the total uncertainty greaterthan two.Based on the analysis of the previous experiment, we attempted to evaluate the qualityof synthetic data. Synthetic PCE values in the CEP dataset was obtained from the Schar-ber model with statistical approximations. In this procedure, unintentional errors can beinvolved in the resulting synthetic data. Since the aleatoric uncertainty arises due to dataquality, we evaluated quality of the synthetic data by analyzing the uncertainties of predicted11CE values. We used the same dataset in Duvenaud et al. for training and test.Figure 3 shows the scatter plot of three uncertainties in the CEP predictions for 5,995molecules in the test set. Samples with the total uncertainty greater than two are high-lighted with red color. Some samples with large PCE values above eight had relatively largetotal uncertainties. Their PCE values deviated considerably from the black line in Figure3-(d). More interestingly, we found that most molecules with the zero PCE value had largetotal uncertainties as well. Those large uncertainties came from the aleatoric uncertaintyas depicted in Figure 3-(a), indicating that the data quality of those particular samples isrelatively poor. Hence, we speculated that data inherent noises might cause large predictionerrors.To elaborate the origin of such errors, we investigated the procedure of obtaining thePCE values. The Havard Organic Photovolatic Dataset contains both experimental andsynthetic PCE values of 350 organic photovoltaic materials. The synthetic PCE values werecomputed according to (17), which is the result of the Scharber model. PCE ∝ V OC × F F × J SC , (17)where V OC is an open circuit potential, F F is a ﬁll factor, and J SC is a short circuit currentdensity. F F was set to 65%. V OC and J SC were obtained from electronic structure calculationsof molecules. We found that J SC of some molecules were zero or nearly zero, resulting in zeroor almost zero synthetic PCE values, in contrast to their non-zero experimental PCE values.Especially, J SC and PCE values computed using the M06-2X functional were almost zeroconsistently. We suspect that those approximated values caused a signiﬁcant drop of dataquality, resulting in large aleatoric uncertainties as highlighted in Figure 3. Consequently, thedata noise due to poorly fabricated data was identiﬁed as the large aleatoric uncertainties. https://github.com/HIPS/neural-fingerprint ncertainty as conﬁdence indicator: bio-activity and toxicity clas-siﬁcation Figure 4: (a) Aleatoric, (b) epistemic and (c) total uncertainty of predicted probabilities inthe classiﬁcation of bio-activity against the EGFR target.In this experiment, we demonstrate that the uncertainty analysis can lead reliable clas-siﬁcation. In classiﬁcation problems, it tends to interpret the ﬁnal outputs from a sigmoidor softmax activation as their conﬁdence, which means that the higher the output proba-bility, the higher the prediction accuracy. However, as Gal and Ghahramani pointed out,such interpretation is erroneous. Thus, we applied the uncertainty quantiﬁcation on thebio-activity and toxicity classiﬁcation problems and show that the predictive uncertainty canbe used as the conﬁdence of outcomes.Figure 5: Test accuracy for the classiﬁcations of (a) bio-activities against the ﬁve targetproteins in the DUD-E set and (b) the ﬁve toxic eﬀects in the Tox21 set.13e trained the Bayesian GCN using 25,627 molecules with the labels for EGFR-activityin the DUD-E dataset. Figure 4 shows the results for 7,118 molecules in the test set. In orderfor the predictive uncertainty to be interpreted as a conﬁdence, its value should be minimumon the output probability of zero or one and should be maximum on that of 0.5. Indeed,the total uncertainty predicted from our model shows such behaviour. In other words, moreuncertain outcomes have lower predictive probability values. We also noted that the aleatoricuncertainty aﬀected the total uncertainty more signiﬁcantly than the epistemic uncertaintydid.To further investigate a relationship between accuracy and uncertainty, we trained theBayesian GCN for various bio-activity labels in the DUD-E dataset and toxicity labels in theTox21 dataset. Then, we sorted the molecules in the order of increasing uncertainty and thendivided them into ﬁve groups as follows: molecules in the i -th group have total uncertaintiesin the range of (( i − × . , i × . Conclusion

Deep neural network models show promising performances in the prediction of molecularproperties. In practical applications, however, a lack of data quality and quantity discouragesdeveloping accurate models. To make reliable decisions in such a case, we have proposed toanalyze uncertainties in the prediction results by using the Bayesian GCN.Our ﬁrst experiment on the logP prediction showed that data inherent noise can beidentiﬁed by the aleatoric uncertainty. The aleatoric uncertainty in the predicted logP valuesincreases as the amount of noise increases. In contrast, the epistemic uncertainty slightly14epends on the data noise as expected. In the second experiment, we applied the uncertaintyanalysis to the Harvard Clean Energy Project dataset. It was able to identify erroneousdata by noting the abnormally increased aleatoric uncertainty in the poorly approximatedsynthetic data, which is helpful to ﬁnd the source of the errors. In the third experiment ofbio-activity and toxicity predictions, we showed that the uncertainty is closely related tothe conﬁdence of prediction for binary classiﬁcation problems. As grouping the molecules inthe increasing order of uncertainty, the groups with lower uncertainty show higher accuracythan those with higher uncertainty.We have demonstrated how useful the uncertainty quantiﬁcation is in molecular appli-cations. By using the Bayesian GCN, we can analyze the quality of data that is often noisybecause of the stochastic nature of experimental results. From the relationship betweenoutput probability and conﬁdence of prediction, it is able to extract more reliable resultsselectively from entire predictions, which is critical to making a desirable decision. Suchanalysis can be used to screen bio-active and toxic molecules, where reliable prediction isvital. We believe that our study on the uncertainty quantiﬁcation of molecular propertiesoﬀers insights to tackle AI-safety problems in molecular applications.

References (1) Gomes, J.; Ramsundar, B.; Feinberg, E. N.; Pande, V. S. Atomic convolutional networksfor predicting protein-ligand binding aﬃnity. arXiv preprint arXiv:1703.10603 ,(2) Jim´enez, J.; Skalic, M.; Mart´ınez-Rosell, G.; De Fabritiis, G. K DEEP: Protein–LigandAbsolute Binding Aﬃnity Prediction via 3D-Convolutional Neural Networks.

Journalof chemical information and modeling , , 287–296.(3) Mayr, A.; Klambauer, G.; Unterthiner, T.; Hochreiter, S. DeepTox: toxicity predictionusing deep learning. Frontiers in Environmental Science , , 80.154) ¨Ozt¨urk, H.; ¨Ozg¨ur, A.; Ozkirimli, E. DeepDTA: deep drug–target binding aﬃnity pre-diction. Bioinformatics , , i821–i829.(5) De Cao, N.; Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973 ,(6) G´omez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hern´andez-Lobato, J. M.; S´anchez-Lengeling, B. et al. Automatic chemical design using a data-driven continuous repre-sentation of molecules. ACS central science , , 268–276.(7) Guimaraes, G. L.; Sanchez-Lengeling, B.; Outeiral, C.; Farias, P. L. C.; Aspuru-Guzik, A. Objective-reinforced generative adversarial networks (ORGAN) for sequencegeneration models. arXiv preprint arXiv:1705.10843 ,(8) Jin, W.; Barzilay, R.; Jaakkola, T. Junction Tree Variational Autoencoder for MolecularGraph Generation. arXiv preprint arXiv:1802.04364 ,(9) Kusner, M. J.; Paige, B.; Hern´andez-Lobato, J. M. Grammar variational autoencoder. arXiv preprint arXiv:1703.01925 ,(10) Li, Y.; Vinyals, O.; Dyer, C.; Pascanu, R.; Battaglia, P. Learning deep generative modelsof graphs. arXiv preprint arXiv:1803.03324 ,(11) Segler, M. H.; Kogej, T.; Tyrchan, C.; Waller, M. P. Generating focused molecule li-braries for drug discovery with recurrent neural networks. ACS central science , , 120–131.(12) You, J.; Liu, B.; Ying, R.; Pande, V.; Leskovec, J. Graph Convolutional Policy Networkfor Goal-Directed Molecular Graph Generation. arXiv preprint arXiv:1806.02473 ,(13) Segler, M. H.; Preuss, M.; Waller, M. P. Planning chemical syntheses with deep neuralnetworks and symbolic AI. Nature , , 604.1614) Wei, J. N.; Duvenaud, D.; Aspuru-Guzik, A. Neural networks for the prediction oforganic chemistry reactions. ACS central science , , 725–732.(15) Zhou, Z.; Li, X.; Zare, R. N. Optimizing chemical reactions with deep reinforcementlearning. ACS central science , , 1337–1344.(16) Faber, F. A.; Hutchison, L.; Huang, B.; Gilmer, J.; Schoenholz, S. S. et al. Predictionerrors of molecular machine learning models lower than hybrid DFT error. Journal ofchemical theory and computation , , 5255–5264.(17) Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; Dahl, G. E. Neural messagepassing for quantum chemistry. arXiv preprint arXiv:1704.01212 ,(18) Sch¨utt, K.; Kindermans, P.-J.; Felix, H. E. S.; Chmiela, S.; Tkatchenko, A. et al. SchNet:A continuous-ﬁlter convolutional neural network for modeling quantum interactions.Advances in Neural Information Processing Systems. 2017; pp 991–1001.(19) Sch¨utt, K. T.; Arbabzadah, F.; Chmiela, S.; M¨uller, K. R.; Tkatchenko, A. Quantum-chemical insights from deep tensor neural networks. Nature communications , ,13890.(20) Smith, J. S.; Isayev, O.; Roitberg, A. E. ANI-1: an extensible neural network potentialwith DFT accuracy at force ﬁeld computational cost. Chemical science , , 3192–3203.(21) Feinberg, E. N.; Sur, D.; Husic, B. E.; Mai, D.; Li, Y. et al. Spatial Graph Convolutionsfor Drug Discovery. arXiv preprint arXiv:1803.04465 ,(22) Liu, Z.; Su, M.; Han, L.; Liu, J.; Yang, Q. et al. Forging the basis for developingprotein–ligand interaction scoring functions. Accounts of chemical research , ,302–309. 1723) Hachmann, J.; Olivares-Amaya, R.; Atahan-Evrenk, S.; Amador-Bedolla, C.; S´anchez-Carrera, R. S. et al. The Harvard clean energy project: large-scale computational screen-ing and design of organic photovoltaics on the world community grid. The Journal ofPhysical Chemistry Letters , , 2241–2251.(24) Mysinger, M. M.; Carchia, M.; Irwin, J. J.; Shoichet, B. K. Directory of useful decoys,enhanced (DUD-E): better ligands and decoys for better benchmarking. Journal ofmedicinal chemistry , , 6582–6594.(25) Gal, Y. Uncertainty in deep learning. University of Cambridge ,(26) Begoli, E.; Bhattacharya, T.; Kusnezov, D. The need for uncertainty quantiﬁcation inmachine-assisted medical decision making.

Nature Machine Intelligence , , 20.(27) McAllister, R.; Gal, Y.; Kendall, A.; Van Der Wilk, M.; Shah, A. et al. Concrete prob-lems for autonomous vehicle safety: advantages of Bayesian deep learning. 2017.(28) Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning forcomputer vision? Advances in neural information processing systems. 2017; pp 5574–5584.(29) Kwon, Y.; Won, J.-H.; Kim, B. J.; Paik, M. C. Uncertainty quantiﬁcation using Bayesianneural networks in classiﬁcation: Application to ischemic stroke lesion segmentation.international conference on medical imaging with deep learning. 2018.(30) Der Kiureghian, A.; Ditlevsen, O. Aleatory or epistemic? Does it matter? StructuralSafety , , 105–112.(31) Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T. et al. Con-volutional networks on graphs for learning molecular ﬁngerprints. Advances in neuralinformation processing systems. 2015; pp 2224–2232.1832) Kipf, T. N.; Welling, M. Semi-supervised classiﬁcation with graph convolutional net-works. arXiv preprint arXiv:1609.02907 ,(33) Ryu, S.; Lim, J.; Kim, W. Y. Deeply learning molecular structure-property relationshipsusing graph attention neural network. arXiv preprint arXiv:1805.10988 ,(34) Irwin, J. J.; Shoichet, B. K. ZINC- A free database of commercially available compoundsfor virtual screening. Journal of chemical information and modeling , , 177–182.(35) Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neuralnetworks. arXiv preprint arXiv:1505.05424 ,(36) Graves, A. Practical variational inference for neural networks. Advances in neural in-formation processing systems. 2011; pp 2348–2356.(37) Louizos, C.; Welling, M. Multiplicative normalizing ﬂows for variational bayesian neuralnetworks. arXiv preprint arXiv:1703.01961 ,(38) Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout:a simple way to prevent neural networks from overﬁtting. The Journal of MachineLearning Research , , 1929–1958.(39) Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing modeluncertainty in deep learning. international conference on machine learning. 2016; pp1050–1059.(40) Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.et al. Relational inductive biases, deep learning, and graph networks. arXiv preprintarXiv:1806.01261 ,(41) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L. et al. Attention is all youneed. Advances in Neural Information Processing Systems. 2017; pp 5998–6008.1942) Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P. et al. Graph attentionnetworks. arXiv preprint arXiv:1710.10903 ,(43) Gal, Y.; Hron, J.; Kendall, A. Concrete dropout. Advances in Neural Information Pro-cessing Systems. 2017; pp 3581–3590.(44) Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 ,(45) Landrum, G. RDKit: Open-source cheminformatics. 2006.(46) Scharber, M. C.; M¨uhlbacher, D.; Koppe, M.; Denk, P.; Waldauf, C. et al. Design rulesfor donors in bulk-heterojunction solar cells—Towards 10% energy-conversion eﬃciency. Advanced materials , , 789–794.(47) Lopez, S. A.; Pyzer-Knapp, E. O.; Simm, G. N.; Lutzow, T.; Li, K. et al. The Harvardorganic photovoltaic dataset. Scientiﬁc data , , 160086.(48) Zhao, Y.; Truhlar, D. G. The M06 suite of density functionals for main group thermo-chemistry, thermochemical kinetics, noncovalent interactions, excited states, and transi-tion elements: two new functionals and systematic testing of four M06-class functionalsand 12 other functionals. Theoretical Chemistry Accounts , , 215–241.20 raphical TOC Entry Some journals require a graphical entry for the Table of Contents. Thisshould be laid out “print ready” so that the sizing of the text is correct.Inside the tocentry environment, the font used is Helvetica 8 pt, as re-quired by

Journal of the American Chemical Society .The surrounding frame is 9 cm by 3.5 cm, which is the maximum per-mitted for

Journal of the American Chemical Society graphical table ofcontent entries. The box will not resize if the content is too big: insteadit will overﬂow the edge of the box.This box and the associated title will always be printed on a separatepage at the end of the document.graphical table ofcontent entries. The box will not resize if the content is too big: insteadit will overﬂow the edge of the box.This box and the associated title will always be printed on a separatepage at the end of the document.