[PDF] Bayesian Graph Neural Networks for Molecular Property Prediction

Abstract

Graph neural networks for molecular property prediction are frequently underspecified by data and fail to generalise to new scaffolds at test time. A potential solution is Bayesian learning, which can capture our uncertainty in the model parameters. This study benchmarks a set of Bayesian methods applied to a directed MPNN, using the QM9 regression dataset. We find that capturing uncertainty in both readout and message passing parameters yields enhanced predictive accuracy, calibration, and performance on a downstream molecular search task.

Full PDF

BBayesian GNNs for Molecular Property Prediction

George Lamb

University College London [email protected]

Brooks Paige

University College London [email protected]

Abstract

GNNs for molecular property prediction are frequently underspeciﬁed by data andfail to generalise to new scaffolds at test time. A potential solution is Bayesianlearning, which can capture our uncertainty in the model parameters. This studybenchmarks a set of Bayesian methods applied to a directed MPNN, using theQM9 regression dataset. We ﬁnd that capturing uncertainty in both readout andmessage passing parameters yields enhanced predictive accuracy, calibration, andperformance on a downstream molecular search task.

Graph neural networks (GNNs) are the state-of-the-art approach to molecular property prediction(Duvenaud et al., 2015; Gilmer et al., 2017; Wu et al., 2018; Yang et al., 2019). A GNN operateson the graph structure of a molecule in two phases. In the message passing phase, a molecularrepresentation is learned by passing messages between atom or bond states. In the readout phase, afeed forward network (FFN) converts this representation into a prediction.

Motivation . The particular challenges of molecular property prediction marry well with the potentialadvantages of Bayesian learning. Generalisation is made difﬁcult in cheminformatics by the conceptof a molecular scaffold: the structural core of a compound to which functional groups are attached.Highly parameterised GNNs are prone to over-ﬁt to training scaffolds, learning a poor molecularrepresentation and failing to generalise at test time (Yang et al., 2019). Models are at risk of returningover-conﬁdent predictions when operating on new scaffolds, conveying little of the uncertaintyassociated with a new chemical space. Poorly quantiﬁed uncertainty makes it especially challengingto evaluate model robustness and out-of-domain applicability (Hirschfeld et al., 2020). We believe thebest answer to these deﬁciencies is Bayesian modelling. Whereas a ‘classical’ neural network betseverything on one hypothesis, a Bayesian approach builds a predictive distribution by consideringevery possible setting of parameters. Bayesian marginalisation can improve the calibration (Maddoxet al., 2019) and accuracy (Izmailov et al., 2019) of deep neural networks underspeciﬁed by data.

Related work . Two recent studies are particularly pertinent. Firstly, Hirschfeld et al. (2020) bench-mark a set of methods for uncertainty quantiﬁcation in molecular property prediction using the sameGNN architecture that we employ in this paper. They ﬁnd no consistently leading method, thoughreplacing readout with a Gaussian process (GP) or random forest leads to reasonable performanceacross evaluation metrics. We extend the work of Hirschfeld et al. by considering four additionalBayesian methods (SWAG, SGLD, BBP and DUN). Secondly, Hwang et al. (2020) benchmarka set of Bayesian GNNs for molecular property prediction, assessing calibration and predictiveaccuracy across four classiﬁcation datasets. They ﬁnd that Stochastic Weight Averaging (SWA) andSWA-Gaussian (SWAG) demonstrate superior performance across metrics and datasets. We extendthe work of Hwang et al. by (i) working in the regression setting where aleatoric and epistemicuncertainty are more explicitly separable, (ii) directly comparing a Bayesian readout phase with a fullBayesian GNN, and (iii) featuring a downstream molecular search experiment.We release PyTorch code at https://github.com/georgelamb19/chempropBayes . Machine Learning for Molecules Workshop at NeurIPS 2020. https://ml4molecules.github.io a r X i v : . [ q - b i o . B M ] N ov able 1: Accuracy (measured by Mean Rank), Miscalibration Area (MA) and Search Scores. Forsingle models we present the mean and standard deviation across 5 runs. MAs are computed withpost-hoc t -distribution likelihoods and presented × . Search Scores equate to the % of the top 1%of molecules discovered after 30 batch additions. All Search Scores are computed for single models.Method Accuracy (Mean Rank) Miscalibration Area Search ScoreSingle model Ensmbl. Single model Ensmbl. Greedy ThompsonMAP 4.08 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± The D-MPNN . Our GNN is a directed message passing neural network (D-MPNN) (Yang et al.,2019), a variant of the MPNN family (Gilmer et al., 2017). The D-MPNN consistently matches oroutperforms previous GNN architectures across datasets and splits types (Yang et al., 2019). It hasalso demonstrated promise in a proof-of-concept antibiotic discovery pipeline (Stokes et al., 2020).The D-MPNN is the core of the Chemprop project ( https://chemprop.readthedocs.io ). QM9 . We perform all experiments on QM9. QM9 contains 12 geometric, energetic, electronic andthermodynamic properties for 133,885 small molecules (Ramakrishnan et al., 2014). Assessmentsof uncertainty calibration in Bayesian deep learning tend to focus on classiﬁcation tasks (Laksh-minarayanan et al., 2017; Maddox et al., 2019). We complement previous studies by exploringcalibration and uncertainty quantiﬁcation in a real-valued regression setting.

We implement eight separate methods.

MAP : Our baseline is classical maximum a posteriori training,in which we ﬁnd the regularised maximum likelihood solution. GP : We replace the ﬁnal layerof the readout FFN with a variational GP and train the resulting model end-to-end (deep kernellearning). The GP is a non-parametric Bayesian method which captures uncertainty in functionalform. DropR : MC dropout uses a spike and slab variational distribution to view test time dropoutas approximate variational inference (Gal and Ghahramani, 2016). ‘DropR’ is the implementationof MC dropout across readout FFN layers.

DropA : We separately implement MC dropout over thefull GNN.

SWAG : Stochastic Weight Averaging (SWA) (Izmailov et al., 2018) computes an averageof SGD iterates with a high constant learning rate schedule, providing improved generalisation. Weimplement SWA-Gaussian (Maddox et al., 2019), which builds on SWA by computing a ‘low rankplus diagonal’ covariance.

SGLD : Stochastic Gradient Langevin Dynamics (Welling and Teh, 2011)uses ﬁrst order Langevin dynamics in the stochastic gradient setting. SGLD is a Markov Chain MonteCarlo (MCMC) method. Within this class of methods Hamiltonian Monte Carlo (HMC) (Neal, 1994)is the gold standard, but requires full gradients which are intractable for modern neural networks.Chen et al. (2014) propose Stochastic Gradient HMC (SGHMC), but in practice tuning this methodcan be challenging.

BBP : Variational Bayesian (VB) methods ﬁt a variational approximation to thetrue posterior by minimising a Kullback–Leibler (KL) divergence or equivalently maximising anevidence lower bound (ELBO). Bayes by Backprop (BBP) (Blundell et al., 2015) assumes a fullyfactorised Gaussian posterior and utilises a reparameterisation trick to sample gradients; we alsouse ‘local reparameterisation’ (Kingma et al., 2015) as a variance reduction technique.

DUN : As anaddition to the set of established methods above we implement a novel depth uncertainty network(DUN), which permits inference over both weights and the number of message passing iterations.Our DUN combines Bayes by Backprop with the ‘vanilla’ DUN proposed by Antorán et al. (2020),and is introduced in Appendix B. For context, Bayesian modelling is reviewed in Appendix A.2 .0 0.2 0.4 0.6 0.8 1.0

Confidence interval (CI) size [ P r o p . t a r g e t s i n s i d e C I ] - [ C I s i z e ] MAPGPDropR DropASWAGSGLD BBPDUNIdeal 0.0 0.2 0.4 0.6 0.8 1.0

Confidence interval (CI) size [ P r o p . t a r g e t s i n s i d e C I ] - [ C I s i z e ] MAPGPDropR DropASWAGSGLD BBPDUNIdeal

Figure 1: Reliability diagrams for single models given Gaussian likelihoods (left) and post-hoc t -distribution likelihoods (right). Each line on the diagrams is the average of 5 runs. . We perform 5 runs for each of the 8 methods, corresponding to different random seedsfor weight initialisation. The runs enable an analysis of both ‘single’ models and model ensembles,the latter incorporating multiple posterior basins of attraction. We analyse single models by averaging scores across runs, computing a mean and standard deviation. We form a model ensemble by averaging predictive distributions across runs, constituting a second layer of Bayesian model averaging. Forcalibration analysis we model aleatoric noise; a scalar noise per QM9 property is learned within theD-MPNN. Each posterior sample yields an individual Gaussian predictive distribution, representingaleatoric uncertainty. The full Bayesian predictive distribution is a mixture of Gaussians, representingaleatoric and epistemic uncertainty. We create our own data split with [train/val/test] proportions [0 . , . , . using Chemprop’s ‘scaffold split’ function, which partitions molecules into binsbased on their Murcko scaffold. Method implementation is detailed in Appendix C. Evaluation . We measure the mean absolute error (MAE) of Bayesian predictive means and rankmethods for each of the 12 QM9 tasks. The mean rank across 12 tasks is our chief accuracy evaluationmetric. To assess calibration we generalise reliability diagrams (Guo et al., 2017) to the regressionsetting and aggregate QM9 tasks. We consider conﬁdence intervals (CIs) around the Bayesianpredictive mean. CI size is plotted on the x -axis. On the y -axis we plot the proportion of moleculesin our test set falling within each CI, minus CI size. A perfectly calibrated model is represented bythe line y = 0 . We summarise performance on the reliability diagram by computing miscalibrationarea (MA); the average absolute difference between conﬁdence and accuracy. Results (accuracy) . Results are presented in Table 1 and Appendix D. The leading methods inboth single and ensemble settings are BBP, SGLD, SWAG and GP. SGLD and SWAG may sufferslightly versus BBP because they employ vanilla SGD optimisation rather than Adam. SGLD hasa higher rank in the single model setting where it is distinguished by its ability to explore multipleposterior modes. Note that the GP captures uncertainty only in readout. Dropout performance is poor,which perhaps could be attributed to an insufﬁciently large network. DUN accuracy results shouldbe considered in light of the fact that the variational posterior over depths collapses to d = 5 (weconsider depths of to ), indicating that it has likely failed to capture the true posterior correctly. Results (calibration) . Reliability diagrams are shown in Figure 1 and Appendix E. With originalGaussian likelihoods we observe pathological underconﬁdence across methods. We ﬁnd that thisuniversal underconﬁdence is driven by overestimated aleatoric uncertainty, a consequence of heavy-tailed residual distributions containing extreme outliers. We improve calibration by ﬁtting post-hoc t -distribution likelihoods to training residuals. MA results for post-hoc t -distribution likelihoods areshown in Table 1. The post-hoc results motivate re-training with a gamma prior over the precisionof our Gaussian likelihood function; placing a prior Gam ( τ | a, b ) over τ and integrating out theprecision we obtain a marginal distribution which, after conventional reparameterisation, equates tothe t -distribution (see Bishop (2006, section 2.3.7)). We leave this to future work.3 .2 Molecular searchFramework. We follow the approximate Bayesian optimisation setup in Hernández-Lobato et al.(2017), running both Thompson sampling and greedy trials. Given an unlabelled dataset, the goal isto discover molecules with the largest values of some target property in as few evaluations as possible.At each Thompson iteration we: (i) draw S posterior samples to obtain S deterministic regressors;(ii) for each sample ﬁnd the molecule with the largest predicted target value, yielding a total batchof S molecules; (iii) query said batch and add it to the labelled training set. Our dataset is a 100ksubset of QM9 and our target is the ﬁrst QM9 property, ‘mu’. We begin with 5k labelled molecules(selected uniformly at random) and make 30 batch additions with S = 50 . We perform 5 runs permethod, corresponding to different base labelled sets and random weight initialisations. Batch additions F r a c t i o n o f t o p % d i s c o v e r e d GPDropRDropASWAGSGLDBBPMC

Figure 2: Search trajectories for Thompson sam-pling. Fractions are averaged over 5 runs.

Evaluation . After each batch addition we measurethe fraction of the top 1% of molecules discovered.The ﬁnal metric used to compare methods is thefraction discovered following 30 batch additions,at the close of the experiment. At this point thereare 6.5k labelled molecules.

Results . Search scores are presented in Table 1.Thompson sampling trajectories are shown in Fig-ure 2 alongside a Monte Carlo baseline. We omitDUN given the collapse of the posterior over depths.As explained in Hernández-Lobato et al. (2017),Thompson sampling uses epistemic variance in theBayesian predictive distribution to perform explo-ration. In contrast, greedy search selects moleculesusing the Bayesian predictive mean and exercisespure exploitation. Across methods we ﬁnd no no-table difference between Thompson and greedysearch scores. This likely reﬂects reduced epis-temic uncertainty; having randomly selected a sub-set of 5k molecules to initially label we are op-erating ‘in-distribution’. We considered smaller initially labelled sets but found BBP performedparticularly poorly. Tuning BBP, SGLD and SWAG without a large validation set is challenging. Incontrast, dropout methods and the GP demonstrate robustness to dataset size and hyperparametersettings. The particular success of dropout might also be attributed to its regularising effect.

The most performant methods involve Bayesian message passing as well as a Bayesian FFN. Weconclude that there is meaningful and useful epistemic uncertainty to be captured in learned molecularrepresentations as well as in readout.When applied to the full QM9 dataset, BBP, SGLD and SWAG enhance accuracy versus a MAPbaseline. However, in the context of molecular search the sensitivity of these methods is limiting.Our recommendations follow the observed trends. For precise property prediction with > , labelled molecules we suggest experimenting with BBP, SGLD, and SWAG. For molecular search,the robustness of dropout and deep kernel learning to different dataset sizes and hyperparametersettings is advantageous. Our results suggest single model SGLD is the best method for obtainingcalibrated uncertainty estimates, though this is likely to be a task-speciﬁc phenomenon; extremeoutlying residuals are still affecting calibration results despite post-hoc t -distribution likelihoods.We identify three avenues for future work: (i) benchmarking Bayesian GNNs on the complete Molecu-leNet dataset collection (see here); the majority of these datasets contain < , molecules; (ii)adapting our D-MPNN by placing a gamma prior over the Gaussian likelihood precision, increasingnetwork size for dropout, and experimenting with larger depths (following the DUN posterior col-lapse we trial depths up to d = 8 and ﬁnd accuracy increases monotonically); and (iii) incorporatingmeta-learning into Bayesian GNNs to improve initialisation in search tasks; meta-initialisationsenable rapid learning in low resource settings (Nguyen et al., 2020).4 cknowledgments and Disclosure of Funding WL thanks Aneesh Pappu, Kush Madlani and Udeepa Meepegama for stalwart support, stimulatingdiscussion and occasional commiseration. WL thanks Rob Lewis for blazing a trail from consultingto computer science. BP is supported by The Alan Turing Institute under the UK Engineering andPhysical Sciences Research Council (EPSRC) grant no. EP/N510129/1.

References

Antorán, J., Allingham, J. U., and Hernández-Lobato, J. M. (2020). Depth uncertainty in neural networks. arXivpreprint arXiv:2006.08437 .Bishop, C. M. (2006).

Pattern recognition and machine learning . Springer.Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424 .Chen, T., Fox, E., and Guestrin, C. (2014). Stochastic Gradient Hamiltonian Monte Carlo. In

Internationalconference on machine learning , pages 1683–1691.Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., and Adams,R. P. (2015). Convolutional networks on graphs for learning molecular ﬁngerprints. In

Advances in neuralinformation processing systems , pages 2224–2232.Gal, Y. and Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty indeep learning. In

International conference on machine learning , pages 1050–1059.Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017). Neural message passing forquantum chemistry. arXiv preprint arXiv:1704.01212 .Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. arXivpreprint arXiv:1706.04599 .Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. arXiv preprintarXiv:1309.6835 .Hernández-Lobato, J. M., Requeima, J., Pyzer-Knapp, E. O., and Aspuru-Guzik, A. (2017). Parallel anddistributed Thompson sampling for large-scale accelerated exploration of chemical space. arXiv preprintarXiv:1706.01825 .Hirschfeld, L., Swanson, K., Yang, K., Barzilay, R., and Coley, C. W. (2020). Uncertainty quantiﬁcation usingneural networks for molecular property prediction. arXiv preprint arXiv:2005.10036 .Hwang, D., Lee, G., Jo, H., Yoon, S., and Ryu, S. (2020). A benchmark study on reliable molecular supervisedlearning via Bayesian learning. arXiv preprint arXiv:2006.07021 .Izmailov, P., Maddox, W. J., Kirichenko, P., Garipov, T., Vetrov, D., and Wilson, A. G. (2019). Subspaceinference for Bayesian deep learning. arXiv preprint arXiv:1907.07504 .Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. (2018). Averaging weights leads towider optima and better generalization. arXiv preprint arXiv:1803.05407 .Kingma, D. P., Salimans, T., and Welling, M. (2015). Variational dropout and the local reparameterization trick.In

Advances in neural information processing systems , pages 2575–2583.Kingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 .Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predictive uncertainty estimationusing deep ensembles. In

Advances in neural information processing systems , pages 6402–6413.Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2019). A simple baseline for Bayesianuncertainty in deep learning. In

Advances in Neural Information Processing Systems , pages 13153–13164.Matthews, A. G. d. G., Hensman, J., Turner, R., and Ghahramani, Z. (2016). On sparse variational methodsand the Kullback-Leibler divergence between stochastic processes.

Journal of Machine Learning Research ,51:231–239.Neal, R. M. (1994).

Bayesian Learning for Neural Networks . PhD thesis, University of Toronto. guyen, C. Q., Kreatsoulas, C., and Branson, K. M. (2020). Meta-learning GNN initializations for low-resourcemolecular property prediction. ICML 2020 Workshop on Graph Representation Learning and Beyond .Ramakrishnan, R., Dral, P. O., Rupp, M., and Von Lilienfeld, O. A. (2014). Quantum chemistry structures andproperties of 134 kilo molecules.

Scientiﬁc data , 1(1):1–7.Stokes, J. M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N. M., MacNair, C. R., French, S.,Carfrae, L. A., Bloom-Ackerman, Z., et al. (2020). A deep learning approach to antibiotic discovery.

Cell ,180(4):688–702.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.(2017). Attention is all you need. In

Advances in neural information processing systems , pages 5998–6008.Welling, M. and Teh, Y. W. (2011). Bayesian learning via Stochastic Gradient Langevin Dynamics. In

Proceedings of the 28th international conference on machine learning (ICML-11) , pages 681–688.Wilson, A. G. (2020). The case for Bayesian deep learning. arXiv preprint arXiv:2001.10995 .Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V.(2018). MoleculeNet: a benchmark for molecular machine learning.

Chemical science , 9(2):513–530.Yang, K., Swanson, K., Jin, W., Coley, C., Eiden, P., Gao, H., Guzman-Perez, A., Hopper, T., Kelley, B., Mathea,M., et al. (2019). Analyzing learned molecular representations for property prediction.

Journal of chemicalinformation and modeling , 59(8):3370–3388.Zhang, R., Li, C., Zhang, J., Chen, C., and Wilson, A. G. (2019). Cyclical stochastic gradient MCMC forBayesian deep learning. arXiv preprint arXiv:1902.03932 . ppendices The appendices are structured as follows:• Appendix A reviews the main concepts underlying Bayesian modelling, for readers lessfamiliar with the Bayesian framework.• Appendix B introduces our depth uncertainty network (DUN) which permits inference overboth model weights and the number of message passing iterations.• Appendix C describes the implementation of methods, and explains key hyperparameterchoices.• Appendix D contains granular predictive accuracy results (scaled MAE by task).• Appendix E contains a full set of reliability diagrams. Three diagram pairs correspond to (i)Gaussian likelihoods, (ii) post-hoc t -distribution likelihoods, and (iii) omission of modelledaleatoric noise. A Bayesian Modelling

Wilson (2020) emphasises that the distinguishing property of a Bayesian approach is marginalisationrather than optimisation. A Bayesian approach forms a predictive distribution by marginalisingover different parameter settings, each weighted by their posterior probability. In contrast, classicallearning involves maximising a posterior.

A.1 Bayesian marginalisation

We consider the case of Bayesian parametric regression. Given inputs X and outputs Y , we desirethe parameters ω of a function f ω ( · ) that is likely to have generated our inputs. We place a prior p ( ω ) over the space of possible parameter settings, representing our a priori belief about which parametersare likely to have generated the data. To transform the prior distribution in light of the observeddata we deﬁne a likelihood distribution p ( y | x, ω ) , the probabilistic model by which inputs generateoutputs for parameter settings ω . We look for the posterior distribution over the space of parametersby invoking Bayes’ theorem: p ( ω |D ) = p ( Y | X, ω ) p ( ω ) p ( Y | X ) . A predictive distribution is obtained by marginalising over ω : p ( y | x, D ) = (cid:90) p ( y | x, ω ) p ( ω |D ) dω. (1)Equation (1) is a Bayesian model average (BMA), representing model uncertainty. A.2 Bayesian deep learning

Modern neural networks often contain millions of parameters. The posterior over these parameters isgenerally intractable. In Bayesian deep learning we deal with the problem of inference by makingtwo, layered approximations. Firstly, we approximate the Bayesian posterior. Methods differ withrespect to posterior approximation. Secondly, we approximate the Bayesian integral (1) by MonteCarlo (MC) sampling. MC integration is common across methods.With q ( ω |D ) our approximate posterior, the MC BMA is: p ( y | x, D ) ≈ J J (cid:88) j =1 p ( y | x, ω j ) , ω j ∼ q ( ω |D ) . Following MC integration, we have approximated the true posterior with a set of point masses, wheretheir locations are given by samples from q ( ω |D ) : p ( ω |D ) ≈ J J (cid:88) j =1 δ ( ω = ω j ) , ω j ∼ q ( ω |D ) . The Depth Uncertainty Network

Here we consider capturing uncertainty in model weights and the number of message passing itera-tions. For simplicity we refer to the latter parameter as ‘depth’. There is motivation to acknowledgeand capture uncertainty in the MPNN depth parameter. Different depths allow hidden states to repre-sent different sized molecular substructures. Incorporating different sized spheres of representation attest time may enhance predictive accuracy.

B.1 Depth uncertainty in an FFN

Antorán et al. (2020) perform inference over the depth of an FFN. Different depths correspond tosubnetworks which share weights. Exploiting the sequential structure of FFNs, Antorán et al. evaluatea training objective and make predictions with a single forward pass.Antorán et al. deﬁne a categorical prior over network depth, p β ( d ) = Cat ( d |{ β i } Di =0 ) . Theyparameterise the likelihood for each depth using the corresponding subnetwork’s output: p ( y | x , d = i, ω ) = p ( y | f D +1 ( a i , ω )) . Here, f D +1 ( · ) is an output layer and a i the activation at depth i ∈ [0 , D ] given input x and weight conﬁguration ω . For a given weight conﬁguration, the likelihood for eachdepth and consequently the model’s marginal log likelihood (MLL) can be computed from a singleforward pass. The MLL is computed as log p ( D , ω ) = log D (cid:88) i =0 (cid:32) p β ( d = i ) · N (cid:89) n =1 p ( y ( n ) | x ( n ) , d = i, ω ) (cid:33) . The posterior over depth is a tractable categorical distribution which tells us how well each subnetworkexplains the data given some set of weights ω : p ( d |D , ω ) = p ( D| d, ω ) p β ( d ) p ( D , ω ) . Antorán et al. try learning weights by maximising the MLL directly using backpropagation and the log-sum-exp trick, but ﬁnd the posterior collapses to a delta function over an arbitrary depth. This isexplained by the gradients of each subnetwork being weighted by that subnetwork’s posterior mass,leading to local optima where all but one subnetwork’s gradients vanish.The solution is to separate the optimisation of network weights from the posterior distribution asin the expectation maximisation (EM) algorithm for latent variable models. Antorán et al. achievethis by performing stochastic gradient variational inference. They introduce a variational posterior q α ( d ) = Cat ( d |{ α i } Di =0 ) . They learn variational parameters α and weights ω by maximising thefollowing ELBO: log p ( D , ω ) ≥ L ( α, ω ) = N (cid:88) n =1 E q α ( d ) (cid:2) log p ( y ( n ) | x ( n ) , d, ω ) (cid:3) − KL ( q α ( d ) || p β ( d )) . Maximising this ELBO by taking gradients actually constitutes exact inference. The ELBO is convexw.r.t. α because the variational and true posteriors are categorical. The variational family contains theexact posterior, thus at the maxima q α ( d ) = p ( d |D , ω ) . B.2 Incorporating uncertainty in model weights

Our depth uncertainty network (DUN) combines the model above with Bayes by Backprop. Weassume a fully factorised variational posterior over depth and neural network weights. We derive anELBO by expanding the KL divergence:KL (cid:2) q ( d | α ) q ( ω | θ ) || p ( d |D , ω ) p ( ω |D ) (cid:3) = KL (cid:2) q ( d | α ) || p ( d ) (cid:3) + KL (cid:2) q ( ω | θ ) || p ( ω ) (cid:3) − E q ( d | α ) q ( ω | θ ) [log p ( D| d, ω )] + log p ( D )= −L ( α, θ ) + log p ( D ) . L ( α, θ ) here is the ELBO. Due to the non-negativity of the KL divergence, L ( α, θ ) ≤ log p ( D ) .8e can learn variational parameters α and θ by maximising the ELBO using backpropagation. TheKL term involving categorical distributions and the expectation of the log likelihood w.r.t. theposterior over depth can be computed analytically with a single forward pass. Expectations w.r.t. thevariational posterior q ( ω | θ ) are estimated as in Bayes by Backprop, by sampling unbiased gradients.In practice we can use mini-batches, estimating the ELBO as follows: L ( ω, α, θ ) ≈ NB B (cid:88) n =1 D (cid:88) i =0 (cid:0) log p ( y ( n ) | x ( n ) , d = i, ω ) · α i (cid:1) − log q ( ω | θ ) p ( ω ) − D (cid:88) i =0 (cid:0) α i log α i β i (cid:1) . (2)At test time, predictions for new data are made by the following Bayesian model average: p ( y ∗ | x ∗ , D ) ≈ J J (cid:88) j =1 D (cid:88) i =0 p ( y ∗ | x ∗ , d = i, ω j ) q α ( d = i ) ,ω j ∼ q ( ω | θ ) . C Implementation of Methods

In this section we describe the implementation of methods for the predictive accuracy and calibrationexperiment. Unless otherwise speciﬁed, hyperparameters are set to optimise validation MAE (aver-aged across QM9 tasks) following grid-search. An exhaustive list of the hyperparameter settings forall our experiments can be found in the ﬁle chempropBayes/scripts/bayesHyp.py . C.1 MAP

In order to learn aleatoric noise we instantiate a log standard deviation parameter within the D-MPNNmodel (the log ensures non-negativity). This parameter is a set of 12 scalars, one for each of the QM9tasks. We henceforth refer to the parameter as ‘log noise’.Our full loss object is the negative log likelihood plus the negative log prior. A function to computethe former takes as input predictions, targets and log noise. We place a zero-mean Gaussian prior overeach D-MPNN weight and control σ prior via a weight decay hyperparameter λ inside our optimiser.In practice we scale the negative log likelihood to a manageable order of magnitude by dividing bythe batch size. This scales the relationship between weight decay and our prior sigma. Precisely, wehave λ = 1 /σ prior N where N is the training set size.The default batch size in Chemprop is 50 and we ﬁnd this works well; we use this batch size acrossall methods. Our optimiser is Adam. Following grid search we set the weight decay to λ = 0 . .Chemprop utilises a ‘noam’ learning rate scheduler with piecewise linear increase and exponentialdecay (based on the scheduler in Vaswani et al. (2017)). We train for 200 epochs, using the ‘noam’scheduler for the ﬁrst 100. We linearly increase the learning rate from lr min to lr max over 2 epochsand decay back to lr min over the following 98, from which point we remain at lr min . Following gridsearch we set lr min = 1e-4 and lr max = 1e-3. The saved MAP model following each training run is thatwhich achieves the best validation accuracy. We also apply this selection procedure to GP, DropR,DropA, BBP and DUN. Architecture . We grid search over 24 architectures, exploring combinations of hidden size h ∈{ , } , message passing depth d ∈ { , , , } and number of FFN readout layers L ∈ { , , } .We ﬁnd that ( h, d, L ) = (500 , , achieves optimal accuracy. We choose not to investigate largerhidden sizes or depths to manage compute requirements. For context, the Chemprop defaults are ( h, d, L ) = (300 , , . We maintain a ﬁxed architecture across all methods. Standardising features and targets . Before training the D-MPNN we standardise atom and bondfeatures in the training set, and apply the same transformation to validation and test molecules. Wealso standardise training targets, later applying the reverse transform when making predictions onvalidation or test molecules. Both these standardisations occur across all methods.9 .2 GP

Each GP run is initialised with the output of the corresponding MAP run (e.g. GP run 1 is initialisedwith the output of MAP run 1). We take the pre-trained MAP D-MPNN and replace the ﬁnal layerof readout with 12 batched stochastic variational GPs (SVGPs), one per task. We train the resultingarchitecture end-to-end. This end-to-end training is known as deep kernel learning (DKL).We implement the GPs in GPyTorch, following the example SVGP and DKL implementationsas a guide ( https://docs.gpytorch.ai ). Our variational distribution is a multivariate normal(‘CholeskyVariationalDistribution’) with batch shape 12. We use a multitask variational strategy with1200 inducing points based on methodology in Hensman et al. (2013). The variational strategy tellsus how to transform a distribution q ( u ) over the inducing point values to a distribution q ( f ) over thelatent function values for some input x . 1200 is the maximum feasible number of inducing pointsgiven compute constraints (corresponding to 10-15 minutes per epoch on a single GPU node). Wenote that the closeness of the variational GP approximation to the true posterior increases only withthe log of the number of inducing points (Matthews et al., 2016).Each GP is deﬁned by a constant mean and RBF kernel. We train GP hyperparameters, a scalaraleatoric noise per task and D-MPNN weights by minimising a negative variational ELBO. We trainwith a batch size of 50 for 200 epochs, following the same learning rate proﬁle as for MAP. We usethe Adam optimiser. For fair comparison with other methods we regularise D-MPNN weights with aweight decay of . . C.3 DropR, DropA

The dropout models follow a similar training procedure to MAP; here we highlight differences. ForDropR we activate dropout layers following the D-MPNN atom representation step, and followingevery FFN layer except for the output layer. For DropA we additionally activate dropout layersfollowing D-MPNN hidden state updates.For both DropR and DropA we grid search over p ∈ { . , . , . , . , . } (these are dropout probabilities). For both DropR and DropA the optimal probability is . . We run the ﬁnal modelswith this dropout probability during training and testing.Both dropout methods take signiﬁcantly longer to converge than MAP. This is expected given thenoise inherent in dropout. We train for 300 epochs, extending the MAP learning rate proﬁle by 100additional epochs at a ﬁxed learning rate of lr min = 1e-4. At test time we draw 30 samples. C.4 SWAG

The SWAG implementation is based on code attached to the original SWAG paper (Maddox et al.,2019). We ﬁrst build a wrapper around the D-MPNN, referring to the latter as our ‘base’ model. Thewrapper contains a list of the parameter objects in the base model (where a parameter ‘object’ is, forexample, a weight matrix). Within the wrapper list (which includes log noise) we register buffersfor each parameter object to store ﬁrst and second uncentred moments. During training we ‘collect’models by looping through parameter objects in the base model and updating buffers in the wrapperlist. At test time we generate new sample parameters directly within the wrapper list . Because theparameter objects in the list point to the same place in memory as the parameters in the base model,base model parameters also change when we sample.The starting point for SWAG training is the pre-trained MAP model. We run SWAG training for100 epochs, collecting one model per epoch after 20 warm-up epochs (thus collecting 80 models intotal). We limit the rank of our estimated covariance matrix by using only the last K = 20 models tocompose a deviation matrix (the same setting as in the original paper). For fair comparison with othermethods we set weight decay to λ = 0 . .The performance of the SWAG method is sensitive to learning rates. To prevent spikes in loss duringtraining we make three changes versus MAP. Firstly, we lower the main learning rate. In practice 2e-5is the highest rate with which we can achieve reasonable validation accuracy during training (SWAGshould be run with a constant ‘high’ learning rate). Secondly, we reduce the learning rate even furtherfor log noise; at all times it is one ﬁfth of the learning rate applied to other model parameters. Thirdly,at the start of SWAG training we gradually increase learning rates from 1e-10 up to their maxima10ver 5 epochs, using a cosine scheduler (the scheduler is not particularly important; we use a cosinescheduler for alignment with SGLD). SWAG’s sensitivity is a result of using the SGD optimiseras opposed to Adam (which enjoys momentum and adaptive learning rates). We try SWAG withmomentum ρ ∈ { . , . , . } but see more volatile loss proﬁles so momentum is kept at zero. Attest time we draw 30 samples. C.5 SGLD

SGLD parameter updates are equivalent to MAP updates with the addition of Gaussian noise. Aswith MAP, we scale the SGLD loss to a manageable order of magnitude by dividing by the trainingset size, N . This division effectively rescales the SGLD learning rate to be (cid:15)/N . It follows that weshould also divide the variance of our added Langevin noise by N . Denoting the batch size as B , ourparameter update equation is: ∆ ω t = (cid:15) N (cid:18) ∇ log p ( ω t ) + NB B (cid:88) i =1 ∇ log p ( y i | x i , ω t ) (cid:19) + η t ,η t ∼ N (0 , (cid:15)/N ) . We implement SGLD as an optimiser which inherits from PyTorch’s SGD base class. The SGLDoptimiser loops through parameter groups and adds two gradient terms to the already-computed loglikelihood gradients. Firstly, it adds the gradient of the log prior; we parameterise this via a weightdecay and set the weight decay to λ = 0 . for fair comparison with other methods. Secondly, theoptimiser adds appropriately scaled Langevin noise.The starting point for SGLD is the pre-trained MAP model. We run SGLD with a cyclical cosinelearning rate schedule, following the proposal of Zhang et al. (2019). The idea is that larger stepsdiscover new posterior modes during a period of exploration (effectively a burn-in phase), and thatsmaller steps characterise each mode. We use PyTorch’s ‘OneCycleLR’ scheduler and conﬁgure asingle cycle as follows: we cosine anneal the learning rate from 1e-10 to a maximum learning rateof 1e-4 over 5 epochs, and then cosine anneal from this maximum to a minimum learning rate of1e-5 over the following 45 epochs. At all times the learning rate for log noise is one ﬁfth of the mainlearning rate. We save a posterior sample at the end of each 50 epoch cycle. Given this relativelyexpensive serial sampling procedure, we collect only 20 samples for SGLD. C.6 BBP

Again, we scale the loss to be a per example measure. Given batch size B the loss function is: L ( ω, θ ) = 1 N (cid:18) log q ( ω | θ ) − log p ( ω ) − NB B (cid:88) i =1 ∇ log p ( y i | x i , ω ) (cid:19) . In practice, we average this loss across 5 forward passes before every backward pass to reducevariance.To implement BBP, we deﬁne a Bayesian linear layer class to replace the existing linear layers inthe D-MPNN. Within the Bayesian linear layer we implement the ‘local reparameterisation trick’(Kingma et al., 2015). This involves calculating the mean and variance of activations in closed formand sampling activations instead of weights. With the local reparameterisation trick the variance ofour MC ELBO estimator scales as /B ; sampling weights directly it scales ( B − /B (with B thebatch size). Within each Bayesian linear layer we also compute the KL divergence in closed formusing a result from Kingma and Welling (2013, Appendix B). Each layer returns standard output aswell as a KL. We sum the latter across layers to compute a total KL.We initialise BBP from the MAP solution and train for 100 epochs at a constant learning rate of1e-4. We set σ prior = 0 . which is approximately equivalent to a weight decay of . given ourscaled loss. Initialising ρ parameters in the correct range is important for reasonable convergence; weinitialise uniformly at random between − . and − . Each training run we save the BBP model withthe best validation accuracy, where validation accuracy is calculated for the mean of the variationalposterior. At test time we draw 30 samples. 11 .7 DUN The DUN method is implemented on top of BBP. Our loss is the negative of equation (2) (though wecompute an exact BBP KL). As with previous methods, we rescale the loss by dividing by the trainingset size. The variational categorical distribution q ( d | α ) is learned as a set of parameters within theD-MPNN, where we use logs to ensure non-negativity. In a single forward pass our DUN D-MPNNreturns a BBP KL (the sum of KLs computed in closed form within Bayesian linear layers), a KLof categorical distributions (also computed exactly) and predictions corresponding to ﬁve differentdepths. Our categorical prior is the uniform distribution.Categorical distributions are over d ∈ { , , , , } . Note that in Chemprop the depth d counts thenumber of message passing steps plus the hidden state initialisation step. We do not exceed d = 5 forfair comparison; improved DUN performance versus other methods may otherwise be caused by theinclusion of a deeper sub-model alone. Recall that when selecting a model architecture we only gridsearch up to d = 5 .We train DUN models for 350 epochs. For the ﬁrst 100, BBP ρ parameters are frozen at zero (thus ourvariational posterior over weights is a point mass) and we freeze q ( d | α ) to be a uniform distribution.This ﬁrst phase of training is designed to minimise the chance of the variational categorical distributionlater collapsing to a single depth. For the ﬁrst 100 epochs we use the ‘noam’ scheduler with MAPlearning rates. After 100 epochs we initialise ρ as in the BBP method and unfreeze our variationalcategorical parameters. From 100 epochs onward we use a constant learning rate of 1e-4. We savethe DUN model achieving the best validation accuracy.At test time we generate two sets of samples. We draw 30 samples from the marginal posteriorover weights, each of which predicts by taking an expectation w.r.t. depth; we use these samples toevaluate DUN accuracy. We also draw 100 samples from the joint posterior over weights and depth,using these to assess uncertainty calibration; the larger number of samples is necessary to minimisediscretisation error.Table 2: MAE for MAP and Bayesian model ensembles . The 5 runs in Table 3 constitute an ensembleof 5 models. Results are scaled by task such that predicting with the mean of test targets would yieldan MAE score of 100.Property MAP GP DropR DropA SWAG SGLD BBP DUNmu 45.64 45.43 49.10 52.17 45.47 45.41 All

Granular Predictive Accuracy Results

Tables 2 and 3 present granular predictive accuracy results. MAEs are scaled by task such thatpredicting with the mean of test targets would yield an MAE score of 100. The primary metric inthese tables is ‘mean rank’, calculated (per run) by averaging the rank of a method across the 12 tasks.Using mean rank ensures we evenly weight the 12 tasks. To use MAE averaged across tasks (the ‘All’row in the tables) would be to give a higher weighting to more difﬁcult tasks.Table 3: (split table). MAE for MAP and Bayesian single models . Means and standard deviationsare computed across 5 runs. Results are scaled by task such that predicting with the mean of testtargets would yield an MAE score of 100.Property MAP GP DropR DropAmean std mean std mean std mean stdmu 48.41 0.43 48.95 0.29 50.42 0.67 52.77 0.37alpha 7.89 0.15 8.01 0.31 9.32 0.22 10.12 0.12HOMO 30.06 0.27 31.13 0.40 30.76 0.29 33.38 0.40LUMO 11.08 0.15 11.44 0.04 11.86 0.36 12.34 0.21gap 15.35 0.11 15.87 0.12 16.07 0.25 17.24 0.38R2 15.44 0.17 15.74 0.28 16.86 0.27 17.53 0.20ZPVE 1.93 0.06

All

Reliability Diagrams

Figure 3 contains a complete set of reliability diagrams. Each row (pair of diagrams) corresponds toa different likelihood function. The ﬁrst pair of diagrams are generated with Gaussian likelihoods;Gaussian noise parameters are learned when we train the D-MPNN. We observe pathological under-conﬁdence across methods. The second pair of diagrams are generated with post-hoc t -distributionlikelihoods and demonstrate improved calibration. The third pair of diagrams are generated withoutmodelling aleatoric noise. In this case, the Bayesian predictive distribution is approximated as a singleGaussian rather than a mixture. We ﬁt the single Gaussian to S × N predictive means, after drawing S posterior samples from an ensemble of N models. The third row demonstrates that overestimatedaleatoric uncertainty drives the underconﬁdence in the ﬁrst row. The elbow shape towards the end ofthe reliability lines in the third row points to the presence of outlying residuals. G a u ss i a n li k e li h oo d s [ P r o p . t a r g e t s i n s i d e C I ] - [ C I s i z e ] MAPGPDropR DropASWAGSGLD BBPDUNIdeal P o s t - h o c t li k e li h oo d s [ P r o p . t a r g e t s i n s i d e C I ] - [ C I s i z e ] MAPGPDropR DropASWAGSGLD BBPDUNIdeal

Confidence interval (CI) size N o m o d e ll e d a l e a t o r i c n o i s e [ P r o p . t a r g e t s i n s i d e C I ] - [ C I s i z e ] GPDropRDropA SWAGSGLDBBP DUNIdeal

Confidence interval (CI) size

Figure 3: Reliability diagrams for single models (left column) and model ensemblesmodel ensembles