[PDF] TensorBNN: Bayesian Inference for Neural Networks using Tensorflow

Abstract

TensorBNN is a new package based on TensorFlow that implements Bayesian inference for modern neural network models. The posterior density of neural network model parameters is represented as a point cloud sampled using Hamiltonian Monte Carlo. The TensorBNN package leverages TensorFlow's architecture and training features as well as its ability to use modern graphics processing units (GPU) in both the training and prediction stages.

Full PDF

TTensorBNN: Bayesian Inference for Neural Networksusing Tensorﬂow

B.S. Kronheim a , M.P. Kuchera a , H.B. Prosper b a Department of Physics, Davidson College, Davidson, NC 28036 b Department of Physics, Florida State University, Tallahassee, FL 32306

Abstract

TensorBNN is a new package based on

TensorFlow that implements Bayesianinference for modern neural network models. The posterior density of neu-ral network model parameters is represented as a point cloud sampled usingHamiltonian Monte Carlo. The

TensorBNN package leverages

TensorFlow ’sarchitecture and training features as well as its ability to use modern graphicsprocessing units (GPU) in both the training and prediction stages.

Keywords:

Bayesian Neural Networks, Machine Learning,

TensorFlow ,Hamiltonian Monte Carlo

1. Introduction

TensorBNN is a ﬂexible implementation of Bayesian Neural Networks(BNNs) that uses the eﬃcient GPU algorithms and machine learning plat-form of

TensorFlow , extending the package to allow for a fully Bayesiantreatment of neural networks.

TensorFlow Probability contains proba-bilistic architectures, but does not currently contain support for a Bayesianinference procedure.The Flexible Bayesian Modeling (

FBM ) toolkit, developed by RadfordNeal [1], provides extensive capabilities for Bayesian inference for neural net-works. However, machine learning technologies have evolved signiﬁcantlysince the ﬁrst release of

FBM . Robust, ﬂexible machine learning platforms such

Email addresses: [email protected] (B.S. Kronheim), [email protected] (M.P. Kuchera)

Preprint submitted to Computational Physics Communication October 5, 2020 a r X i v : . [ phy s i c s . c o m p - ph ] O c t s Google’s TensorFlow [2] and Facebook’s

PyTorch [3] contain functional-ity in architecture design and optimization methods that are unmatched inframeworks with smaller user-bases. In addition, these packages provide aseamless interface with Graphics Processing Units (GPUs), enabling large-scale computations with orders of magnitude speedup over CPU-only soft-ware. The

TensorBNN package leverages these recent developments in orderto provide an environment for Bayesian inference using the methods proposedby Neal [1].Implementations of approximate BNNs, which can be run on GPUs, suchas the

DenseFlipout layers in

TensorFlow Probability based on [4], haverecently appeared. These

DenseFlipout layers approximate the prior andposterior densities with explicit functional forms, such as a Gaussian density,and optimize the parameters using gradient descent. While such methodscan be eﬀective and are much less computationally expensive than a fullBayesian analysis, they are limited by the choice of the functional form ofthe posterior densities and, therefore, will be unable to model as complex aposterior density as that over the parameter space of a neural network.The

TensorBNN package approximates the posterior density of the param-eters of a neural network as a point cloud, that is, a neural network modelis “trained” not by ﬁnding a single neural network that best ﬁts the trainingdata, but rather by creating an ensemble of neural networks by samplingtheir parameters from the posterior density using a Markov chain MonteCarlo (MCMC) method.The paper is organized as follows. We begin in Section 2 with a descrip-tion of the salient mathematical underpinning of BNNs and

TensorBNN , inparticular. This is followed by a description of the

TensorBNN package inSection 3. In Section 4 the performance of

TensorBNN is illustrated with asimple example. A summary is given in Section 5.

2. Bayesian neural networks

Neural networks are the most commonly used supervised machine learningmodels in which the training data contains both inputs and known outputsfrom which a regression or classiﬁcation model is constructed. The standardapproach to training such a model, that is, ﬁtting the model to data, is tominimize a suitable empirical risk function, which in practice is proportionalto the average of a loss function. If the average loss is a linear function of2he negative log likelihood, − ln p ( D | θ ) of the data D , then minimizing theaverage loss is identical to estimation of the neural network parameters θ viamaximum likelihood. In this case, ﬁtting a model, that is, a function f ( θ ),to data can be construed as a problem of inference . Furthermore, if a priordensity π ( θ ) over the parameter space can be deﬁned, then the inference canbe performed using the Bayesian approach [1]. A Bayesian neural networkcan be represented as follows p ( x, D ) = (cid:90) F ( x, θ ) p ( θ | D ) dθ, (1)where p ( θ | D ) = p ( D | θ ) π ( θ ) p ( D ) , (2)is the posterior density over the parameter space Θ of the neural networkwith parameters θ ∈ Θ. Since each point θ corresponds to a diﬀerent neuralnetwork (from the same function class), each point in general is associatedwith a diﬀerent output value y = f ( x, θ ) for the same input x . Therefore,the posterior density and neural network f ( x, θ ) together induce a predictivedistribution given by p ( y | x, D ) = (cid:90) δ ( y − f ( x, θ )) p ( θ | D ) dθ. (3)The quantity D = { ( t k , x k ) } denotes the training data, which consists ofthe targets t k associated with data x k . In practice, the posterior density isrepresented by a point cloud { θ i } sampled from the posterior density andEq. (3) is approximated by binning the values y = f ( x, θ ) with θ ∈ { θ i } .A practical advantage of the sampling approach is that it provides astraightforward method to estimate the uncertainty in any quantity thatdepends on the network parameters. This advantage, of course, exists forthe maximum likelihood approach also.However, for likelihood functions of the neural network parameters, it isunclear, given their highly non-Gaussian character, how to obtain meaningfuluncertainty estimates. On the other hand, the Bayesian approach requiresthe speciﬁcation of a prior, which introduces its own complication. But,since the network parameters are sampled from the posterior density, thesensitivity of inferences to the choice of prior can be assessed by re-weightingthe sampled points by the ratio of the new prior to the one with which thesample was generated. 3he the current version of TensorBNN is restricted to fully-connected deepneural networks (DNN), y ( x , θ ) = f ( b K + w K h K ( · · · + h ( b + w h ( b + w x ) ) · · · ) , (4)which are simply nested non-linear functions. The quantities b j and w j —thebiases and weights—are matrices of parameters and h j and f are the acti-vation and output functions, respectively; j is the layer number. In Eq. (4),the functions h j and f are applied element-wise to their matrix arguments. TensorBNN maintains the standard activation function options available in

TensorFlow , with the addition of one custom activation function, a modiﬁed

PReLU function, called

SquarePReLU h j ( x ) = (cid:40) x for x ≥ a j x otherwise , (5)with the slope parameter a j for the negative x region. We opted to take theparameter to be a j rather than a j to ensure that the slope remains positivethereby keeping the activation function one-to-one. The function f is f ( x ) = (cid:40) / (1 + exp( − x )) for a classiﬁer and x for regression models. (6)The weights and biases of the network along with the PReLU parameters arethe free parameters of the neural network model.In the following subsection, we describe the pertinent details of this im-plementation, while the second subsection contains relevant information onthe hardware used to perform the sampling for the examples described inSection 4.

The likelihood functions included in the package are modeled asfollows: p ( D | θ ) = (cid:40)(cid:81) k y t k k (1 − y k ) − t k for classiﬁers and (cid:81) k N( t k , y k , σ ) for the regression models , where y k ≡ y ( x k , θ ) and N( x, µ, σ ) is a Gaussian density. (7)For classiﬁers, the targets t k are 1 and 0 for true and false identiﬁcation,while for regression models they are the desired regression values.4 rior. Constructing a prior for a multi-parameter likelihood function is ex-ceedingly challenging and is especially so for functions as complicated asthose in Eq. (7). For an excellent and thorough review of this problem werecommend Nalisnick [5]). We sidestep this challenge by proceeding prag-matically and relying instead on ex post facto justiﬁcation of the choices wemake: the choices are justiﬁed based on the quality of the results. Moreover,

TensorBNN includes a re-weighting mechanism for studying the sensitivity ofthe results to these choices.Each layer of the DNN contains three kinds of parameter: weights, biases,and the slope parameters of the activation functions. In the

FBM toolkit ofNeal [1], the prior for each weight and bias is a zero mean Gaussian density,with the precisions, σ − , of these densities constrained by gamma hyper-priors. In TensorBNN , the weights w i of a given layer are assigned the hier-archical prior π ( w ) = (cid:89) i π − β − (cid:34) (cid:18) w i − αβ (cid:19) (cid:35) −  (prior) × N ( α, µ α , σ α ) N ( β, µ β , σ β ) (hyper-prior) , (8)comprising a product of Cauchy densities with location and scale parame-ters, α and β , respectively, constrained by hyper-priors modeled as Gaussiandensities. A hierarchical prior of the same form is assigned to the biases of agiven layer, but with diﬀerent location and scale parameters. The prior forthe slope parameter of the activation function can be either an exponentialdensity with its rate parameter constrained by an exponential hyper-prior π ( m ) = λ exp( − λm ) γ λ exp( − γ λ λ ) , (9)or a Gaussian prior with location and scale parameters constrained by Gaus-sian density hyper-priors as follows π ( m ) = N ( m, γ, ω ) N ( γ, µ γ , σ γ ) N ( ω, µ ω , σ ω ) . (10)The overall prior π ( θ ) is a product of the priors for all weights, biases, andslope parameters, and the associated hyper-priors. For regression models,there is, in addition, a ﬂat prior for the standard deviation σ in Eq. (7), witha starting value set by the user. 5 .3. Sampling the posterior density Since the high-dimensional integral in Eq. (1) is intractable it is typicallyapproximated using a Monte Carlo method to sample from the posteriordensity p ( θ | D ). The Monte Carlo method of choice is Hamiltonian MonteCarlo (HMC) [1, 6] in which the posterior density is written as p ( θ | D ) = exp( − V ( θ )) , where V = − ln p ( θ | D ) is viewed as a “potential” to which a “kinetic” term T = p / H = T + V . The dimen-sionality of the “momentum” p equals that of the parameter space Θ. TheHMC sampling algorithm alternates between deterministic traversals of thespace Θ governed by a ﬁnite diﬀerence approximation of Hamilton’s equa-tions and random changes of direction. In order to achieve detailed balanceand, therefore, ensure asymptotic convergence to the correct posterior den-sity, the deterministic trajectories are computed using a reversible, leapfrogapproximation, to Hamilton’s equations. The HMC method has two freeparameters, the step size along the trajectories and the number of steps totake before executing a random change in direction. The method used todetermine these parameters is described in Section 3.5.

3. Implementation Details

TensorBNN , which is built using the

TensorFlow [2] and

TensorFlow - Probability [7], follows the design of BNNs described in Neal [1] with someimprovements and modiﬁcations. Here we summarize the general structureof

TensorBNN and the improvements, which include the HMC parameteradaptation scheme and the addition of pretraining.

TensorBNN provides a framework for the user-friendly construction ofBNNs in a manner similar to the

Keras [8] interface for building neural net-works with

TensorFlow . The main object in the package is the network object, which is the base for all operations in the package. The options thatcan be speciﬁed when instantiating this object are the data type, e.g. ﬂoat32or ﬂoat64, the training and validation data, and the normalization scalingfor the output. An example network declaration is shown below.6 model = network ( dtype , inputDims , trainX , trainY , validateX , validateY ) After the network object is instantiated, the layers and activation func-tions are added. Each of these is a variant of the

Layer object. For example,a

DenseLayer is included in the package with multiple activation functions,but users can create their own

Layer variants and activation functions withcustom priors.The

DenseLayer object can be initialized either randomly or with pre-trained weights and biases. When using random initialization, the GaussianHe [9] initialization is used to determine starting values of the weights, andthe biases are extracted from the same random distribution. This is doneto keep the starting values small, while allowing variation. As discussed inSection 2.2, the priors for the weights and biases are Cauchy densities, withthe location parameter α and scale parameter β constrained by a Gaussianhyper-prior (see Eq. (8)). See Table 1 for the parameter values. The β parameters are made more ﬂexible than the α values in order to keep amajority of the weights and biases close to 0, while allowing some number ofweights and biases to have larger values. These values, which are summarizedin Table 1, cannot be changed within a DenseLayer object, though it is asimple matter to create a new

Layer variant.

Parameter Value α β µ α σ α µ β σ β Table 1:

DenseLayer initial values of prior hyper-parameters, α and β , and the ﬁxedhyper-prior parameters. In the code snippet below, a

DenseLayer and an activation function areadded to the network. This process would be the same for any layer or7ctivation function with a diﬀerent object added. model . add ( DenseLayer ( inputDims , width , seed = seed , dtype = dtype ) ) model . add ( Tanh () ) Name Description

Sigmoid / (1 + e − x ) Tanh tanh( x ) Softmax e x i / (cid:80) n e x n ReLU (cid:40) x ≤ x x > leakyReLU (cid:40) αx x ≤ x x > Elu (cid:40) e x − x ≤ x x > PReLU leakyReLU with trainable α SquarePReLU (cid:40) α x x ≤ x x > α is trainable Table 2: The built-in activation functions of

TensorBNN . Within the package there are eight options for activation functions, whichare listed in Table 2. All the activation functions except

SquarePReLU arestandard and present in

TensorFlow . The

SquarePReLU was developedspeciﬁcally for

TensorBNN . Both

PReLU and

SquarePReLU have trainable slopeparameters α . They have, however, diﬀerent priors and hyper-priors, given inEqs. (9) and (10), with the ﬁxed and initial values of their parameters listedin Table 3. For PReLU , the exponential distribution was chosen to model theprior belief that the slopes should be close to zero. The rates were pickedto allow for larger slopes, while still enforcing the belief that smaller slopesare preferred. For

SquarePReLU , as we are considering the square root of the8 ame Parameter Description leakyReLU m = 0 . PReLU m = 0 . λ = 0.3 initial rate parameter for the m prior γ λ = 0 . λ hyper-prior SquarePReLU m = 0 . γ = 0.3 initial mean for the m prior ω = 0.3 initial standard deviation for the m prior µ γ = 0.0 mean for the hyper-prior of γσ γ = 0.1 standard deviation for the γ hyper-prior µ ω = 0 . ω hyper-prior σ ω = 0.3 standard deviation for the ω hyper-prior Table 3: The initial and ﬁxed parameters of the built-in activation functions of

TensorBNN . slope, which can be positive or negative, the Gaussian prior was chosen be-cause it is continuous as 0 and enforces the prior preference for small slopes.Once again, the priors for these activation functions cannot be changed, buta custom activation function with diﬀerent priors can be created.When the add method of the network class is called, the layer or activa-tion function is added to a list of layers. Additionally, the weights, biases, andtrainable activation functions from the layer along with the hyper-parametersare stored. After building the network architecture, the Hamiltonian Monte Carlo(HMC) sampler must be initialized. This is done through a method of the network class. An example usage is presented below. As described in Sec-tion 2.3, HMC is a Markov chain Monte Carlo method where sampling isperformed by moving through the parameter space in a manner governedby a ﬁctitious potential energy function determined by the posterior den-sity of the neural network parameters. The numerical approximation usedis the leapfrog method, in which the number of leapfrog steps together withthe step size determine the distance traveled to the next proposed point.Naturally, larger step sizes yield longer deterministic trajectories, but theyalso increase the accumulated error due to the numerical approximation andso lower the acceptance rate. Unfortunately, choosing good values for the9umber of steps and the step size can be challenging. Therefore,

TensorBNN contains an algorithm, called the parameter adapter, to ﬁnd these parametersautomatically.The adapter searches a discrete space of step sizes and number of leapfrogsteps using the algorithm of [10]. This discrete space is determined by aminimum and maximum step size and a minimum and maximum numberof steps. In addition, the adapter accepts the number of iterations before areset is performed. A reset is performed if after this number of iterations nopoint has been accepted. It is also given the number random pairs of stepsize and leapfrog steps to try before using the search algorithm from [10].Finally, it also accepts two constants a and δ , which assume the values 4 and0.1, respectively as suggested in [10].One of the changes from Neal’s procedure is the use of HMC to sample thehyper-parameter space instead of Gibbs sampling. This was done becauseGibbs sampling requires knowledge of the conditional distribution for eachhyper-parameter given that the rest are ﬁxed. While this is possible tocalculate for some hyper-priors the package allows these to be changed tocustom priors, for which it may not be possible to compute the conditionaldensities. It was, therefore, simpler to use a second HMC sampler to samplethe hyper-parameters since this works for any hyper-prior. The number ofsteps for the HMC hyper-parameter sampler is kept constant, but the stepsize is modiﬁed depending on the acceptance rate of the sample for 80% ofthe burn-in period. The TensorFlow Probability

HMC implementation isused for the sampling. model . setupMCMC (

40 ,

50 ,

20 ,

2) .3. Model Training The model is trained using the train method of the network class. Theparameters of the train method determine 1) the likelihood function, 2) themetrics to be calculated during training, 3) the number of burn-in epochs,4) how often to save networks, and 5) the directory in which to store themodels. An example is provided below. The prior is determined by the

Likelihood object. The priors included in the package are those describedin Section 2.2, though any desired likelihood function can be used througha custom

Likelihood object. The built-in Gaussian prior for the standarddeviation of the likelihood function used for regression allows speciﬁcation ofthe initial value of its standard deviation.The metrics to be computed are speciﬁed by including a list of all desired

Metric objects. The built-in metrics are

PercentError, SquaredError ,and

Accuracy . All three accept the normalization scalars of mean and stan-dard deviation to calculate the metrics for unnormalized data, as well as anoption to take the exponential of output values before computing for the met-rics, for the case of log-scaled outputs. The

PercentError metric is deﬁnedas

P E = 100 E (cid:20) | y pred − y true | y real (cid:21) . Squared Error is deﬁned as SE = E (cid:2) ( y pred − y true ) (cid:3) . In both cases, the expectation is with respect to the input data, y pred is thepredicted value for a given input, and y true is the target. Finally, Accuracy is simply the number of correct predictions divided by the total number ofpredictions for a classiﬁcation task. likelihood = GaussianLikelihood ( sd = 0.1) metricList = [ SquaredError ( mean = 0 , sd = 1 , scaleExp = False ) , PercentError ( mean = 10 , sd = 2 , scaleExp = True ) ] network . train (

320 , likelihood , metricList = metricList , folderName = " Regression " , networksPerFile =50) In training, the HMC samplers run for the speciﬁed number of epochs. Anepoch corresponds to iterating the diﬀerential equations inside the main HMCsampler for the speciﬁed number of leapfrog steps, updating the weights,biases, and activation functions, and then repeating this process with thehyper-parameters.

In order to prevent the output ﬁles in which the networks are saved frombecoming too large and to allow predictions midway through training, thenumber of networks written to each ﬁle can be speciﬁed by the user, asnoted above. One ﬁle is saved with the shapes of each matrix that deﬁnesthe network architecture so that they can be properly extracted. Another ﬁleis saved that contains the layer names so the network can be reconstructed.The predictor object, which is instantiated as follows, network = predictor ( " modelDir / " , dtype = dtype ) uses these ﬁles to make predictions. Once the object is instantiated, predic-tions can be made by calling its predict method with an input data matrixand specifying that every n saved networks are to be used. predictions = network . predict ( inputData , n =10) Beyond making predictions from saved networks, the predictor class isalso capable of re-weighting networks given a new set of priors. The abilityto re-weight the point cloud of networks makes it possible to study the sensi-tivity of results to the choice of prior. We can compute the posterior density p ( θ | D ) with the prior used in the generation of the point cloud and we cancalculate the posterior density p ( θ | D ) using a diﬀerent prior. By weightingeach network j by the ratio p ( θ j | D ) /p ( θ j | D ), given the network parameters θ j , we can approximate the point cloud that would have been generated hadwe used the alternative prior. In general, however, the only components of12 ( θ | D ) and p ( θ | D ) that diﬀer are their priors. Therefore, we can substitute π ( θ j ) /π ( θ j ) for p ( θ j | D ) /p ( θ j | D ) . But since it may be useful to study theeﬀect of diﬀerent likelihoods on the results, the code includes the option tosupply new likelihoods. The option to re-weight allows ﬁne tuning of the net-works after training and makes it possible to explore the impact of diﬀerentpriors on the results.Re-weighting is implemented in the reweight method of the predictor ,which requires an architecture ﬁle containing the architecture of the networkwith diﬀerent priors. The method can also accept the training data and a likelihood object for use in calculating the network probabilities shouldthe impact of the likelihood need to be studied. The user can choose touse only every n networks when making predictions as illustrated in the codesnippet below. In order to use these features a separate Layer object must becreated for each new prior. These objects are then passed as a dictionary tothe predictor object when it is instantiated. Optionally, a modiﬁed versionof the likelihood object used while training can be included to study theimpact of the modiﬁcations. The code below shows the instantiation of a predictor object with a custom

Layer added and a call of the reweight method which returns a weight for each network. network = predictor ( " / path / to / saved / network " , dtype = dtype , customLayerDict ={ " dense2 " : Dense2 } likelihood = modifiedLikelihood ) weights = network . reweight ( n = 10 , architecture = " architecture2 . txt " ) The predictor can also calculate the autocorrelation function of the net-works, which is useful for choosing a suitable value for n in the predictor .Given the sequence of networks f , f , · · · , the autocorrelation and normal-13zed autocorrelation are deﬁned by C ( n ) = 1 N − n N − n (cid:88) i =1 ( f i − f )( f i + n − f ) , and ρ ( n ) = C ( n ) /C (0) , (11)respectively, where ¯ f is the average prediction over the networks and n is theautocorrelation lag. One expects an approximately exponential decrease of ρ ( n ) with n . The larger the value of n the smaller the correlation betweenthe networks separated by a lag of n and, therefore, the more independentthe resulting ensemble of networks. The autocorrelation is computed withthe autocorrelation method to which an input data matrix inputData ispassed along with the maximum lag n max . The output of the method is a listcontaining the average autocorrelation for each value of n from 1 to n max ,where n enters the calculation as in Eq. (11). The average is taken withrespect to the input data matrix.The user can also opt to calculate the autocorrelation length, which maybe taken to be the smallest recommended lag n . The autoCorrelationLength method accepts the same inputs as autocorrelation and returns a singleﬂoat. Both of these methods make use of the emcee package [11]. Here is anexample of the usage of both methods. autocorrelations = network . autocorrelation ( inputData ,

75) corrLength = network . aut oCorr elat ionL ength ( inputData ,

75)

The HMC parameter adaptation scheme brieﬂy mentioned in Section 3.2is a modiﬁed version of the method described in Ref. [10]. Our method adaptsthe number of steps and stepsize in a leapfrog trajectory in attempt to max-imize the length of the trajectories in the network parameter space in eachiteration of the HMC sampler.In

TensorBNN , two variations are introduced. First, in the case that theHMC sampler iterates through 10 leapfrog trajectories without having ac-cepted a point, the algorithm resets and begins with the stepsize maximumand minimum values decreased by a factor of two. This was empiricallyobserved to prevent the HMC sampler from remaining stagnant with an ex-tremely low acceptance rate. Additionally, after each reset of the parameter14dapter, the value of the step size and number of leapfrog steps were ran-domized for the ﬁrst 20 iterations in order to prevent the algorithm fromconverging to a boundary point of the interval that had a high acceptance,but was not optimal, as it was observed to do without this randomization.This algorithm was implemented using a combination of

TensorFlow and

NumPy [12].In order to begin the Markov chain sampling of network parameters at aposition that reduces the number of needed burn-in steps, we trained a fullyconnected neural network using

AMSGrad [13] with the version of

Keras in TensorFlow . We observed empirically that choosing a starting point for theMarkov chain based on pre-training done using

Keras led to a faster burn-intime. The pre-training is done through three training cycles, each containinga ﬁxed number of epochs. After each cycle the learning rate was decreasedby a factor of ten, and the best network, judged by the loss computed usingthe validation data, is selected as the starting point for the next cycle.

4. Use Cases

Here we present a simple application of

TensorBNN , which highlights theimportant properties of a BNN. The goal is to train a BNN to learn thefunction y = x sin(2 πx ) − cos( πx ) . (12)In this example, the caveat is the network will be trained on only 11 evenlyspaced examples for x ∈ [ − , , but we will then ask the network to makepredictions for x ∈ [ − , . The training data and real curve can be seen inFigure 1. We also present training on 31 examples to show that the networksbehave as expected: improved performance with more data.The BNN trained on this data had three hidden layers with 10 perceptronsin each and the hyperbolic tangent activation function was used. The valuesfor each of remaining training parameters can be found in Table 4. As thedataset used for this task was so small and the network was not especiallylarge training can be completed using either a GPU or CPU in a number ofminutes.A graphical representation of the training results is shown in Figure 2.These plots demonstrate three important properties of BNNs. Firstly, in theupper graph, we see that as the predictions reach farther from the trainingsamples, the uncertainty increases. This is expected behavior; near the train-ing data, the network can be reasonably certain of the prediction due to the15 − − − ou t pu t − − − − ou t pu t Figure 1: The training data (blue dots) and actual curve (red line) for the training exam-ples. information provided to the model during training. However, farther fromthe training examples, there are many possible predictions that are consis-tent with the training data. Looking at the plots for training with 31 points,we also see that the overall predicted curve is much closer to the true curveand that the uncertainty is lower, as expected.Secondly, the credible intervals of the BNN do not always contain the truevalue. In the upper graph in Figure 2, it is clear that the BNN is quite wrongabout the behavior of the curve between the ﬁrst two and last two trainingpoints. This, however, is not unreasonable, as the value of the curve decreasesdramatically between those values, without any training data to indicate thisbehavior. The BNN clearly predicts that the curve should follow a roughlystraight path between the points, with some possible variance. This is aplausible prediction given the training data, and demonstrates how, despitethe uncertainty estimates inherent in the BNN, it is still susceptible to suddenjumps between training data points. This behavior highlights the importanceof providing enough training data to accurately represent the mapping. Whilethe BNN will predict greater uncertainty in regions without data, there isno guarantee that the credible intervals encompass the true value, especiallyif the true values vary vastly from the training data values. Note, however,that these conclusions are contingent on the form of the prior. It is thereforegood practice to study the sensitivity of conclusions to modiﬁcations of theprior in any real-world analysis.The need for caution when extrapolating too far from where the datalie is demonstrated clearly in the bottom graph of Figure 2. Extrapolating16 − − − − ou t pu t − − − − ou t pu t − − − − ou t pu t − − − . − . − . . . . . ou t pu t Figure 2: The training data (blue dots) and actual curve (red line) for the training example,along with the mean of the predictions (black line), and the mean plus and minus 1 and 2standard deviations (green and yellow shading). The top plots are over the training range,and bottom over an extended range. The left plots are with 11 training points, the rightplots with 31.

Table 4: network parameters beyond the region containing the training data, the BNN predictions becomehighly uncertain as expected, but the credible interval may not necessarilycontain the true curve. This is, of course, a general observation about anymethod of extrapolation; the latter depends on assumptions about how thefunction ought to behave in the region devoid of data.Beyond these simple examples, a more complex application of

TensorBNN can be found in [14]. In that paper, the package is used to predict the crosssections for supersymmetric particle creation at the Large Hadron Collider,predict which supersymmetric model points are theoretically viable, and pre-dict the mass of the lightest neural Higgs boson, which is generally identiﬁedwith the Higgs boson discovered at CERN [15, 16].

5. Summary

TensorBNN is a framework that allows for the full Bayesian treatment ofneural networks while leveraging GPU computing through the

TensorFlow platform. Through an automation of the search for the parameters neededfor an eﬀective Hamiltonian Monte Carlo sampler and the ease of using apre-trained network,

TensorBNN is able to decrease the computation neededto converge to a sample from the posterior density of the neural network pa-rameters. The algorithm automatically adapts hyper-parameters to reducethe amount of ﬁne-tuning required by the user. As shown through the simpleexamples, the distribution of the network predictions behave according to ex-18ectations and can be yield good results even on diﬃcult learning problems.The package provides a ﬂexible means to perform regression and binary clas-siﬁcation and is designed with the potential for expansion in terms of allowednetwork architectures and output possibilities.

6. Acknowledgements

This work was supported in part by the Davidson Research Initiative andthe U.S. Department of Energy Award No. DE-SC0010102.

References [1] R. M. Neal, Bayesian Learning for Neural Networks, Lecture Notes inStatistics doi:10.1007/978-1-4612-0745-0.[2] M. Abadi, et al. , TensorFlow: Large-Scale Machine Learning on Hetero-geneous Systems, URL , software avail-able from tensorﬂow.org, 2015.[3] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,L. Fang, J. Bai, S. Chintala, PyTorch: An Imperative Style, High-Performance Deep Learning Library, in: Advances in Neural InformationProcessing Systems 32, 8024–8035, 2019.[4] Y. Wen, P. Vicol, J. Ba, D. Tran, R. Grosse, Flipout: Eﬃcient Pseudo-Independent Weight Perturbations on Mini-Batches, 2018.[5] N. T. Eric, On Priors for Bayesian Neural Networks, Ph.D. thesis,University of California, Irvine, URL https://escholarship.org/uc/item/1jq6z904 , 2018.[6] M. Betancourt, A Conceptual Introduction to Hamiltonian Monte Carlo,arXiv: Methodology .[7] J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore,B. Patton, A. Alemi, M. Hoﬀman, R. A. Saurous, TensorFlow Distribu-tions, 2017. 198] F. Chollet, et al., Keras, https://keras.io , 2015.[9] K. He, X. Zhang, S. Ren, J. Sun, Delving Deep into Rectiﬁers: Sur-passing Human-Level Performance on ImageNet Classiﬁcation, in: TheIEEE International Conference on Computer Vision (ICCV), 2015.[10] Z. Wang, S. Mohamed, N. De Freitas, Adaptive Hamiltonian and Rie-mann Manifold Monte Carlo Samplers, in: Proceedings of the 30th In-ternational Conference on International Conference on Machine Learn-ing - Volume 28, ICML’13, JMLR.org, III–1462–III–1470, URL http://dl.acm.org/citation.cfm?id=3042817.3043100 , 2013.[11] D. Foreman-Mackey, D. W. Hogg, D. Lang, J. Goodman, emcee: TheMCMC Hammer, PASP 125 (2013) 306–312, doi:10.1086/670067.[12] E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open source scientiﬁctools for Python, URL , 2001–.[13] S. J. Reddi, S. Kale, S. Kumar, On the Convergence of Adam and Be-yond, in: International Conference on Learning Representations, URL https://openreview.net/forum?id=ryQu7f-RZhttps://openreview.net/forum?id=ryQu7f-RZ