[PDF] Combine and Conquer: Event Reconstruction with Bayesian Ensemble Neural Networks

Abstract

Ensemble learning is a technique where multiple component learners are combined through a protocol. We propose an Ensemble Neural Network (ENN) that uses the combined latent-feature space of multiple neural network classifiers to improve the representation of the network hypothesis. We apply this approach to construct an ENN from Convolutional and Recurrent Neural Networks to discriminate top-quark jets from QCD jets. Such ENN provides the flexibility to improve the classification beyond simple prediction combining methods by linking different sources of error correlations, hence improving the representation between data and hypothesis. In combination with Bayesian techniques, we show that it can reduce epistemic uncertainties and the entropy of the hypothesis by simultaneously exploiting various kinematic correlations of the system, which also makes the network less susceptible to a limitation in training sample size.

Full PDF

PPrepared for submission to JHEP

IPPP/20/74

Combine and Conquer: Event Reconstruction withBayesian Ensemble Neural Networks

Jack Y. Araz a and Michael Spannowsky a a Institute for Particle Physics Phenomenology,Durham University, South Road, Durham, DH1 3LE,

E-mail: [email protected] , [email protected] Abstract:

Ensemble learning is a technique where multiple component learners arecombined through a protocol. We propose an Ensemble Neural Network (ENN) that usesthe combined latent-feature space of multiple neural network classiﬁers to improve therepresentation of the network hypothesis. We apply this approach to construct an ENNfrom Convolutional and Recurrent Neural Networks to discriminate top-quark jets fromQCD jets. Such ENN provides the ﬂexibility to improve the classiﬁcation beyond simpleprediction combining methods by increasing the classiﬁcation error correlations. In com-bination with Bayesian techniques, we show that it can reduce epistemic uncertainties andthe entropy of the hypothesis by exploiting various kinematic correlations of the system.

Keywords: deep learning, ensemble neural networks, bayesian neural networks a r X i v : . [ h e p - ph ] F e b ontents Deep Learning (DL) has gained tremendous momentum on the verge of the latest devel-opments in data analysis. Whilst boosted decision trees (BDT) have been used in thecontext of High-Energy Physics for over 30 years, wide usage of Deep Neural Networks(DNNs) only surged very recently. Since then, especially in applications to LHC physicswhere a large amount of data with the need for its fast and automated analysis is gathered,there has been a profound improvement in the understanding of Neural Networks (NNs).The analysis of the internal structure of jets, highly complex collimated sprays of radi-ation [1], is a popular arena where reconstruction techniques evolved from sophisticatedmulti-variate approaches, e.g.

HEPTopTagger [2–4], over theory-guided matrix-elementmethods [5–8] to data-driven NN techniques [9–12]. In particular top tagging has been theprime example to benchmark the performance of various NN classiﬁers [13–21]. Similartagging algorithms have been used for Higgs [22, 23] and W-boson [24, 25] tagging andquark-gluon discrimination [26–30] . Thus, it became apparent that there is a wide rangeof use-cases for NNs in collider phenomenology, where particle tagging is just one of manyapplications.A standard supervised learning algorithm produces a ﬁtting function that aims to ﬁndan optimal contour of the decision boundary between competing hypotheses . The givenalgorithm takes a labelled feature-tensor and attempts to ﬁnd the global minimum of a givenobjective function, the so-called loss function, resulting in the prediction of the algorithm.This is achieved by convoluting the input feature vector with non-linear functions, so-called activation functions, and updating the weights of the initial hypothesis through the For a review of these methodologies and more see refs. [14, 31], and other examples [32–44]. Here the word “ﬁtting” is used to simplify the text. However, Deep Learning is not merely a ﬁttingalgorithm; it looks for a higher dimensional irreducible representation that the feature-space lives in. – 1 –ackpropagation algorithm. Whilst such an approach oﬀers increased ﬂexibility, in general,it can suﬀer from three major predicaments [45]. First, the problem of statistics denotesthe lack of training examples within a particular domain, which can cause the learningalgorithm to get stuck in various minima with comparable accuracies in each training.The second problem is computational. As mentioned before, a learning algorithm oftenemploys a stochastic search algorithm, e.g. gradient descent. Assuming the provision ofsuﬃcient data, the feature-space can be highly complex, creating a very non-trivial loss-hypersurface for which the algorithm is tasked to ﬁnd the global minimum [46, 47]. Finally,the third problem is representational. As the nature of a “ﬁtting” algorithm, it is not alwayspossible to ﬁnd a linear or non-linear representation of the actual function. Hence, it mightbe necessary to expand the representation space or employ various possible hypotheses toﬁnd a closer approximation of the actual function. Although the representation problemis directly linked to the previously mentioned issues, even with suﬃcient statistics andadvanced algorithms, an optimization algorithm may not proceed after ﬁnding a hypothesisthat can adequately explain the data [48].The three most popular architectures for classiﬁcation tasks in particle physics arecurrently Deep Neural Networks (DNN), Convolutional Neural Networks (CNN) and Re-current Neural Networks (RNN). Each of these networks is designed to exploit diﬀerentfeatures and correlations of the input data. For instance, CNNs are special-purpose net-works that are widely used for image recognition [14, 18]. This method sweeps through theimage by dividing it into subvolumes. Each subvolume has been transferred to the nextlayer by passing through an activation function, allowing the network to ﬁlter the image’sdistinguishable features. RNNs are a diﬀerent kind of specialised networks that keep trackof the ordering of the feature vector and thus maintains a sense of “memory” by connectingeach node in a graph via an ephemeral sequence. Long-short term memory (LSTM) net-works have been employed to classify QCD events with high accuracy [17, 25, 49]. Whileeach of these techniques can be powerful by itself, it is not clear whether they exploitthe full amount of information contained in the feature vectors to perform an optimalclassiﬁcation between competing hypotheses. Thus, combining multiple networks into anEnsemble Neural Network (ENN) might allow to improve on their individual classiﬁcationperformances.Ensemble learning is a paradigm which employs multiple neural networks to solve aproblem. The main idea behind ensembling is to increase the generalisation of the databy harvesting many hypotheses trying to solve the same problem. Each of the networksmentioned above is specialised to learn a particular feature of the given data to achieve thesame or similar generalisation. An ensemble of these networks can access all the informationpresented in each component network and optimise it according to more generic informationthrough data [50–55].While some techniques to combine classiﬁers have been used in the context of colliderphenomenology before , to our knowledge for the ﬁrst time, we will present a parallel Ref. [56] shows that combining predictions of BDTs with speciﬁc rules can improve the discriminationof BSM models from the SM. Ref. [57] shows that injecting randomness to a hypothesis and combining itsresults can boost the accuracy of the classiﬁcation. Refs. [58, 59] uses stack combining method for Higgs – 2 –ombining method to go beyond simple prediction combinations. As shown in previousstudies [45, 46, 51, 61], combining predictions of various networks can signiﬁcantly improvethe overall performance for classiﬁcation or regression. However, if networks are onlycombined at the prediction level, they are each separately trained for a speciﬁc propertyof the data. Parallel combined ensembles allow the network to train on a combined higherdimensional latent-space to optimize the entire network accordingly. Hence, having accessto all component networks allows improvement upon the representation of the problem. Wewill show that such an approach allows ﬂexibility to improve background rejection beyondsimple prediction combinations. Furthermore, we will show that it will drastically improvethe network’s error correlations beyond the component and prediction-based-combinednetworks.With continuously improving performance indicators for NNs, e.g. measured throughreceiver operating characteristic (ROC) curves, it becomes increasingly important to obtainan understanding of how this is achieved and how reliable the performance is evaluated [62–67]. Bayesian neural networks allow to estimate intrinsic uncertainties of NN by treatingtheir weights as distributions instead of a single trainable variable [68, 69]. Hence thenetwork output is a distribution rather than a ﬁx value. To estimate the uncertainties of anetwork, multiple measurements of the same test data are combined to calculate the meanprediction alongside its standard deviation. We will employ Bayesian techniques to showthat parallel combining methods, i.e. as implemented in ENNs, can reduce the standarddeviation of the predictions and epistemic uncertainties without requiring more data.In Sec. 2 we provide a discussion of Ensemble Neural Networks and review their ap-plications and beneﬁts in improving the classiﬁcation performance. In Sec. 3.1 we describethe procedure we employed to preprocess the input data before the training and in Sec. 3.2we present our results. Finally, in Sec. 4 we compare uncertainties between componentnetworks and their ensemble, and we oﬀer a summary and conclusions in Sec. 5.

Ensemble Neural Networks (ENNs) are protocols that aim to increase the generalizability ofa hypothesis by combining multiple component networks. It has been shown that ENNs canprovide the necessary resolutions or approximations that that all three potential pitfalls forNNs mentioned in Sec. 1 require [50–54]. Depending on the problem at hand, ensemblingmethods can be pooled under three paradigms [55]: (i) parallel combining, (ii) stackedcombining and (iii) combining weak classiﬁers.Combining classiﬁers spanning feature-spaces that contains diﬀerent physical domains,can provide an expanded representation of the hypothesis space, see Fig. 1. Such meth-ods are studied under so-called “parallel combining” method. Another technique, called“stacked combining”, employs diﬀerent classiﬁers to be trained on the same feature-tensor.Such techniques can provide simple solutions to the computational problem where mul-tiple non-correlated hypotheses can approximate the underlying function more eﬃciently.The ﬁnal and most widely studied method is “combining weak classiﬁers” where, as the tagging at LHC and ref. [60] combines the predictions of multiple diﬀerent learners. – 3 –ame suggests, weak but successful classiﬁers’ predictions are assembled to create a NNthat reaches accuracies beyond its constituents [45]. Here successful means that the hy-pothesis has greater accuracy than random selection. Although existing methods underthe paradigms ( ii ) and ( iii ) can successfully optimize over statistical and computationalshortcomings of the NNs [70–77], they can not expand the representation of the hypothesiswithout acquiring an extended domain of the data. Hence one needs a dedicated approachto address the representation problem to learn over diﬀerent types of correlations withindistinct symmetries of the data.While ENNs are known to improve on the statistics and computational problems [55],see Sec. 1, its beneﬁts for the representation problem, which is in most collider phenomeno-logical applications often is prevalent, is underrated. We propose the use of ENNs for theevent reconstruction at high-energy collider experiments under the paradigm of parallelcombining. We will further show that this approach improves on the representation prob-lem.For this purpose, we will use two high-level classiﬁers, a CNN and a RNN which areoften used for image recognition and text recognition respectively. Both of these modelsare generalising a speciﬁc property of a jet, i.e. the spatial position of the substructure ofa jet and the sequential order of a cluster history respectively. Naively, one could take themean prediction of both classiﬁers, which will lead to a generalisation of the problem inthe higher-dimensional hypothesis space. Although this can improve the performance, bothcomponent networks are optimised for their own feature space. In this study, we show thatinstead of combining the component networks’ predictions, optimising the network overthe combined latent-feature space can lead to a more substantial and stable performanceimprovement for the problem at hand.Thus, we propose to initialise multiple high-level classiﬁers separately. For the exampleof Sec. 3, these are chosen to be CNN and RNN classiﬁers. Each the CNN and RNN providea vector in the latent-feature space corresponding to the ﬂattened image for the CNN andthe higher dimensional representation sequence for the RNN. Concatenating these vectorswill lead to a larger latent-space, including information from both image-type and sequence-type data. Training with this higher dimensional feature space with extra handles for theNN architecture, such as more layers or nodes to generalise this latent-feature space, canlead to two signiﬁcant improvements. Firstly, each component network’s weights will beoptimised with respect to the combined hypothesis space hence will have access to morefeatures of the base theory. Secondly, the ability to access a larger latent-feature space willmake it possible to increase the complexity of the model for a larger hypothesis-space.Fig. 1 shows a schematic representation of this approach where one source of input isdivided into multiple branches to be analysed within diﬀerent architectures. Dependingon the nature of the problem, one can employ multiple network architectures such as fullyconnected networks (blue), CNNs (purple), RNNs (green) or even more complex structureswhich, for the sake of simplicity, are not shown explicitly. The merging stage representsthe concatenation process where instead of the prediction of each model, one can combinethe latent-space of each network after its individual i th layer and continue training on thisnew feature space. Hence, the network’s output will be the prediction optimised over each– 4 –istinct feature of the problem. InputMergelatent-spaceENN Output Mean OutputMergepredictions = x t h t x x x n h h h n A A A A

Figure 1 . A schematic representation of ensemble neural networks where blue box represents a NNwith dense layers, purple represents convolutional neural network and green represents a recurrentNN with inputs x i and output values h i for an operator A . Solid line at the bottom guides towardslatent-space concatenation which leads to ensemble prediction. Dashed lines represent the same formean prediction of each network. Whilst the network architectures discussed often unveil a strong performance improve-ment over conventional cut-based reconstruction strategies; one wonders if combining anyNN will increase accuracy. To answer this question one needs to investigate the bias-variance-covariance decomposition. The prediction of an ensemble estimator, constructedby averaging the prediction of each component estimators, assuming that they are inde-pendent from each other, can be cast as f ens ( x ) = 1 N N (cid:88) i f i ( x ) , (2.1)where N is the number of component estimators, f i ( x ) is the prediction of the i th estimatorand x is the feature-tensor. For such an object, the generalization error is given by [61, 78]Err( f ens ) = Err (cid:26) N Var( x ) + (cid:18) − N (cid:19) Cov( x ) + Bias( x ) (cid:27) , (2.2)where the three terms correspond to variance, covariance and the squared bias of thefeature-tensor respectively. Although such construction assumes a very simplistic case, itshows that the generalization error of the average prediction of multiple hypotheses is alsoaﬀected by the covariance. This shows that if the component hypotheses are negatively– 5 –orrelated with each other the average prediction will decrease the generalization errorfurther. However, as the average bias will remain the same, the generalization error canonly be reduced to the bias term. Thus an ENN can improve the generalization error ifand only if the given component estimators’ errors are not completely correlated [50, 79]. Using CNNs, the pixelated energy deposits in the calorimeters of multi-purpose high-energyexperiments have been repeatedly shown to provide a strong discriminatory power betweenthe radiation proﬁle of top quarks versus QCD jets. In the η − φ plane, each pixel cor-responds to one or more particles, and so-called colour or luminosity of a pixel can bemeasured by a combined intrinsic property of these particles such as energy or transversemomentum. This will allow the CNN to learn translationally invariant features of the topand jet system. RNNs instead maintain a sense of timing and memory in a given sequenceused as input features. Due to the nature of the clustering algorithm, each jet has anembedded tree structure, where subjets are recombined with respect to a particular rule.Thus, CNNs and RNNs exploit diﬀerent features of top and QCD jets to discriminate themfrom each other. We use the complementarity of both methods to combine them in an ENNthat has an improved performance over both approaches individually. An implementationof the code we use for preprocessing and network training is provided at this link . As a case study, we will investigate the top tagging capabilities of NNs by employing aCNN and a RNN. To achieve this, we used the dataset provided in [60, 80], which con-sists of 14 TeV top signal and mixed quark-gluon background jets generated and showeredby

Pythia

Delphes anti-kT algorithm [83] as deﬁend in

FastJet [84], using radius variable R = 0 .

8. The fat-jet transverse momentum has been limited to [550 , p T -range. The resulting fat jets are further limited tobe within | η j | <

2. Finally, the fat jets in the top signal sample have been matched withtruth level tops requiring ∆ R ( j, t truth ) < .

8. This dataset consists of 1.2 million training,400,000 validation and test events respectively. This dataset has been divided into twosubsets within our framework, one for CNN type training and one for RNN type training.For both of the datasets provided PFlow-objects are clustered into a fat-jet as describedabove.The CNN dataset has been prepared with leading anti-kT fat jet constituents whichare ordered by their corresponding transverse momentum. Each jet in the event has beencentred with respect to total p T weighted centroid where the jet vector has been centredat ( η, φ ) = (0 , https://gitlab.com/jackaraz/EnsembleNN – 6 –rincipal axis is at the direction of positive pseudo-rapidity for all constituents. Thesemodiﬁed constituents are ﬁtted into pixels on η − φ plane, divided into 37 ×

37 pixelsbetween ( η, φ ) = ([ − . , . , [ − . , . p T within that pixel. To get the leading constituent into the ﬁrst quadrant, the vertical halfof the image with higher total p T ﬂipped to the right, and similarly, the horizontal half ofthe image with higher p T ﬂipped to the top. Fig. 2 shows the averaged top signal (left)and dijet background (right) images for 10 ,

000 events projected on modiﬁed η − φ –plane.Colour represents the magnitude of the transverse momentum in the corresponding pixel.Note, this image has been zoomed-in to highlight the relevant portion of the image. Sincethe network requires the input data within [0 ,

1] range, each image has been normalized by1 TeV before training.

Figure 2 . Left panel shows averaged top signal image on modiﬁed η − φ plane and the left panelshows the same for dijet sample. Colour represents the combined transverse momentum of theconstituents within a pixel. Each image includes 10,000 events. The RNN dataset has been constructed using leading anti-kT fat-jet where the con-stituents of the this jet are re-clustered with the same radius parameter using the

Cambridge-/Aachen ( C/A ) clustering algorithm [85]. In order to construct the training sequence, threeleading branches have been extracted from the clustering history where their respectivetransverse momentum deﬁned the branches. Initial two leading branches are constructedby the ﬁrst two subjets in the clustering history where the subjet with larger p T has beenchosen to be the leading branch. The third leading branch has been chosen within the par-ent subjets of the ﬁrst leading subjet. The parent with the lowest p T is considered as thethird leading branch. Fig. 3 shows a schematic representation of this selection where bluestands for the leading branch following the subjets with relatively higher momentum thanthe consecutive parent subjet. Green is the second leading branch and purple is the thirdleading branch following the same pattern as the leading branch. Black lines representthe discarded branches which have less p T compared to the corresponding parent subjet.Finally, red represents the C/A -jet. The sequence has been constructed using k T -distances– 7 – lustered jetParent subjetsLeading branchSecond leading branchThird leading branchDiscarded branches d , d , d , Figure 3 . A schematic representation of the cluster history where blue represents leading branchwith respect to the relative magnitude of transverse momentum, green is the second leading branchand purple is the third leading branch. Black lines shows the discarded branches. Finally darkred represents the initial clustered jet. The size of the circles represents the relative magnitude oftransverse momentum. in the clustering history, deﬁned as d i,j = min (cid:0) p T,i , p T,j (cid:1) ∆ R R .

Here i, j is the number of the parent subjets, ∆ R is the relative angular distance betweentwo subjets and R is the clustering radius given as 0.8. For each parent subjet in a branch,the d i,j value is stored with its chronological order. d , and d , , see Fig. 3, are included aspart of the leading branch sequence. In order to compose the RNN sequence, we ﬁrst usedthe mass of the anti-kT -jet and then the mass of C/A -jet constructed using

Mass DropTagger [86] ( µ = 0 . y cut = 0 . k T -distances ofthe leading, second leading and third-leading branches, respectively. Branches with fewersubjets then padded with zeros. Upper panel of Fig. 4 shows the k T sequence for 2000top signal and 2000 dijet background events. Each event has been represented via hightransparency; hence the vibrant colours show the high occurrences of the particular eventswhere blue and red stands for top and dijet samples. The bottom two panels of Fig. 4show the number of subjets in each branch where the left and right panels show for topand dijet samples, respectively . Before passing the input feature vectors to the networkfor training, the dataset has been standardized using RobustScaler within

Scikit-Learn package [87] using 100,000 mixed events from the training sample.

In order to study the eﬀects of ensembling multiple architectures, here we will ﬁrst introducetwo “comparable” but independent architectures for the CNN and RNN-type of datasets It is important to note that we also test our sequence by constructing it out of jets clustered by kT and anti-kT algorithms; however, the discriminative power has been observed to be less than the sequenceclustered by C/A algorithm. – 8 – igure 4 . Top panel shows combined k T -distances in RNN sequence for 4000 events. Red representsthe dijet events and blue represents top signal events. Dominated colours shows which event hashigh occurance in a particular sequence. Bottom two panel shows the number of subjets in eachbranch where left panel shows it for top signal and right panel shows for dijet background. presented in Sec. 3.1. Our NN architecture relies on Keras library [88] embedded in

TensorFlow version 2.2 [89].The CNN dataset has been trained by a network receiving 37 × ×

2, leaving a reduced 18 ×

18 image with eightfeatures. Finally, these images have been ﬂattened and passed to a fully connected denselayer with sixteen nodes with a dropout probability of 25%. A rectiﬁed linear unit (

ReLu )activation function has been used for each layer. A dense output layer has then followedthe network with a single node and sigmoid activation for classiﬁcation.Furthermore, the RNN dataset has been trained in a slightly more complex architecturestarting with an LSTM layer, including 64 nodes. The activation and recurrent activationfunction for the LSTM layer have been chosen as hyperbolic tangent and sigmoid functions.It has been followed by three fully connected dense layer with 64, 64 and 32 nodes respec-tively and each dense layer followed by a dropout layer with 25% probability. As before,– 9 – igure 5 . Receiver operating characteristic curve has been shown where CNN, RNN, the meanprediction of both and ENN architectures represented by green, blue, orange and red curves. Theepistemic uncertainty has been represented by the transparent area around each curve for one stan-dard deviation. Black curve represents the random choice. The inner plot zooms into the slice of ε S ∈ [0 . , . . the ReLu activation function has been used for each dense layer. The network output hasbeen generated from a ﬁnal dense layer with a single node and sigmoid activation function.Both networks are aimed to minimize a binary cross-entropy loss function via

Adam optimizer [90] with a learning rate of 10 − . Networks are trained for 500 epochs, and thelearning rate has been reduced half for every 20 epochs if there is no improvement on thevalidation dataset’s loss value. If the network didn’t improve the validation loss for 250epochs, the training terminated automatically.Since the goal of this study is to question if a more extensive representation cangeneralize the given problem much better than its component hypotheses, we employedtwo types of ensembling methods. As a reference case, we studied the mean of both CNNand RNN predictions. As mentioned in Sec. 1, such ensembles have shown to go beyondthe accuracies of their component networks. For the main case, we will study an extendedarchitecture where CNN and RNN architectures are concatenated before their output layer;hence resulting in a latent-space of 48 features. To ﬁnd an optimal generalization of thislatent-feature space, they are further connected to a fully connected dense layer with 96nodes, employing ReLu activation function and L2 kernel regularization with a penaltystrength of 0.05. This dense layer has been padded with 25% dropout layers before andafter. Then connected to an output layer as before, activated via a sigmoid function.In order to estimate the inherent uncertainty on each model, the test data has beendivided into randomly selected 50,000 non-overlapping partitions. Fig. 5 shows the ROCcurve for each model. RNN and CNN are represented with blue and green curves alongsidethe inherent uncertainty for one standard deviation. The orange curve shows the mean– 10 –rediction of these two models, which already indicates a higher generalization power thaneach component network. Finally, the red curve shows the minimalistic ENN conﬁgura-tion. Although the concatenated latent-feature space’s training is minimal, it still revealsimprovement beyond the mean prediction. The inner plot of Fig. 5 zooms into the sliceof tagging eﬃciency within [0 . , .

7] to emphasize this improvement. Fig. 5 also showsthe area under the curve (AUC) value for each curve where the improvement in meanprediction and ENN is also visible.

Figure 6 . Squared error correlation mapped on 50,000 randomly selected test images for RNN(upper left), CNN (upper right), mean (lower left) and ENN (lower right).

As mentioned before, for the ENN to show a signiﬁcant performance improvementover all pooled networks, it is important for the component networks to show mutually acomparable performance. As seen from Fig. 5, both the ENN and the mean prediction isdominated by the CNN above a tagging eﬃciency of 0.8 and dominated by the RNN belowa tagging eﬃciency of 0.15. This explicitly shows that no matter how complex the ENNarchitecture is, if one component network is dominating the other component networks,the ensemble will follow the performance curve of the best component network closely. As– 11 –een from the interval [0 . , .

7] of the ROC curve, the ENN-improvement is maximizedwhen the component accuracies are similar.As discussed in Sec. 2, combining hypotheses with non-correlated errors may improvean ensemble’s prediction. In order to test this, Fig. 6 shows the correlations of the squarederror, ( y − ˆ y ) mapped on the test images where y is the truth label and ˆ y is the predictionof the corresponding network. Fig. 6 shows RNN (upper left panel), CNN (upper rightpanel), mean prediction (lower left panel) and ENN (lower right panel). Each correlationhas been estimated by using randomly selected 50,000 test images. One can immediately seethe shrinking area of the blue negative correlation distribution. Although the correlationsbetween the RNN and the CNN mapping look similar, the mean prediction improves thetwo hypotheses’ non-overlapping portions. The ENN goes beyond the mean prediction’sachievement by drastically shrinking the blue region and removing the ﬂuctuations inthe red (positively correlated) region, hence increasing the correlations between squarederror and the image pixels. As expected, similarly correlated regions changed neither formean prediction nor for ENNs. Thus, combining all available neural networks would notimprove the accuracy if their error is highly correlated. Instead, one can beneﬁt fromthis methodology by employing networks with comparable accuracies and diﬀerent errorcorrelation to improve the latent-feature space accuracy. For all phenomenological applications it is important to assess the intrinsic uncertainties ofa NN model. Two major uncertainties can be modelled within the context of DL [62, 69].The irreducible noise in the observations called aleatoric uncertainties and the uncertaintiesintrinsic to the proposed hypothesis called epistemic uncertainties. Given suﬃcient data,epistemic uncertainties can be explained and reduced. The decomposition of the varianceof a binary hypothesis is given as [91, 92],

V ar ( y ) = (cid:104) ˆ y (cid:105) − (cid:104) ˆ y (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) epistemic + (cid:104) ˆ y (1 − ˆ y ) (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) aleatoric , (4.1)where ˆ y represents the network’s predictive distribution, the ﬁrst term represents the epis-temic uncertainties while the second term is the aleatoric uncertainty. In addition to theuncertainties, the entropy of the network’s prediction, also, gives strong indications aboutthe underlying uncertainties of the system where higher entropy points to higher uncer-tainty. The entropy of binary classiﬁcation is given as [93], S = − (ˆ y log (ˆ y ) + (1 − ˆ y ) log (1 − ˆ y ))) , (4.2)where the ﬁrst term stands for the classiﬁcation of the class 1 (top signal) and the secondterm for the classiﬁcation of class 0 (dijet background).In order to analyse the uncertainties underlying our neural network, we used the Ten-sorFlow Probability package version 0.10.0 [94]. We limited ourselves to predictionuncertainties by only changing each network’s output layer to Dense Flipout layer [95] with– 12 –igmoid activation . The kernel divergence function has been chosen to be mean Kullbeck-Leiber divergence. We employed the same network architectures presented in Sec. 3.2. Asbefore, all networks are trained for 500 epochs with Adam optimizer. The initial learningrate has been given as 10 − and reduced to its half in every 20 epochs if validation loss hasnot been improved. The ﬁnal prediction has been reported using randomly chosen 100,000test samples where each network output has been sampled 100 times.Although the notion of “mean prediction” is ambiguous in the Bayesian context, inorder to have a baseline, we deﬁned the mean prediction of CNN and RNN networks asthe mean of each 100 samples. This serves as the linear combination ensemble baselinewhich has not been trained on any latent-feature space beyond its component networks. Toreveal our ensembling technique’s full eﬀect, we used an ensemble learner with one denselayer including 96 nodes, as before, and another ensemble learner with an additional denselayer with 96 nodes . Figure 7 . Mean entropy distribution with respect to the standard deviation of the entropy for RNN(blue), CNN (green) and ENN (red) where ensemble having two dense layers (left panel). Meanentropy distribution with respect to percentage of binned events (right panel).

RNN CNN Mean ENN (1 layer) ENN (2 layers)ˆ µ S < . .

92% 75 .

22% 72 .

61% 78 .

05% 79 . Table 1 . Percentage of events for each network structure, i.e. RNN, CNN, ENN and Mean, withmean entropy below 0.5. It is important to note here that, to get the complete model uncertainties from each layer, one canmodify the entire network with Bayesian layers. This will double the number of trainable parameters ineach layer. Thus in order to simplify our problem, we are only concentrating on prediction uncertainties. It is important to note that we did not observe a signiﬁcant improvement over ROC AUC by adding anextra dense layer. Thus further optimization beyond adding an extra layer required to improve the accuracyof an ensemble learner. Since this is beyond our scope, we limit ourselves to simplistic architecture. – 13 –he left panel of Fig. 7 shows the mean entropy, ˆ µ S , distribution with respect tothe standard deviation in entropy, ˆ σ S , where RNN, CNN and two-layer ENN has beenrepresented with blue, green and red points. In order to simplify the plot, the meanprediction and the one-layer ENN model is not shown. One can immediately conclude thatthe ensemble learner has a considerable limitation on the standard deviation of the entropywhere CNN reaches beyond 0.025, RNN to 0.015 but ENN limits the standard deviationbelow 0.0075. The right panel of Fig. 7 shows the percentage of events per mean entropy.As before, the RNN and CNN architectures are represented by blue and green solid curves.The separation between two curves increases between the entropy values 0 . − . µ S and ˆ σ S . This is also summarized in Table 1, where more than 78%of the events for both ensemble learners reach a mean entropy ˆ µ S of less than 0.5, whileRNN, CNN, and mean prediction remain below 75.3%. Figure 8 . Left panel shows the normalised number of events per standard deviation in prediction.Right panel shows the same for epistemic uncertainty. In each histogram RNN, CNN, mean, one-layer ENN and two-layer ENN has been represented with blue, green, orange, red and purple curves.

We also analyzed the standard deviation in the hypothesis prediction, which is crucialto maintain small in order to achieve consistent predictions. The left panel of Fig. 8 showsthe fraction of events per standard deviation in prediction where the same colour schemeapplied as before. Given a suﬃciently complexity problem, the ENN is observed to reduceˆ σ bayes signiﬁcantly, compared to each component network and the mean combination ofthose networks respectively. While the mean prediction reaching up to ˆ σ bayes ∼ .

01, theENN limits the standard deviation below 0.004, which is similar to the standard deviation– 14 –ean entropy. On the right panel of the Fig. 8, we show the epistemic uncertainty as givenin the ﬁrst term of Eq. (4.1) using the same colour labelling. Again, we ﬁnd a signiﬁcantreduction of the uncertainties with ensemble learners. These results show that learning overvarious symmetries leads to a more accurate representation of the given problem withoutrequiring more data.Thus we observed that employing diﬀerent domains of data that are specialised forspeciﬁc properties, and optimising a neural network with combined properties of thesecomponent learners drastically reduces the system’s uncertainties. Such an ensemble net-work has been shown to learn the system’s correlations much more accurately comparedto its individual component networks.

We presented Ensemble Neural Networks for the reconstruction and classiﬁcation of colliderevents and applied them to the discrimination of boosted hadronically decaying top quarksfrom QCD jets. An ENN can improve the accuracy of the network beyond the individualcontributions of its component networks by reducing the variance of the prediction giventhat the errors of component networks are not highly correlated. In this study, we showedthat such techniques can be used in the event reconstruction of collider events in orderto overcome the representation problem of neural networks and to improve the predictionaccuracy and uncertainties.Special-purpose networks, such as CNNs or RNNs have been repeatedly shown to behighly accurate for the classiﬁcation of LHC events. These networks are specialised tolearn and optimise their models with respect to the correlations of the given data. Inthe case of the classiﬁcation of fat jets, these correlations can be represented throughcalorimeter images where a network learns the spatial distribution of a jet’s constituents.On the other hand, clustering algorithms produce a sequential tree structures which canbe employed to learn distinct kinematic features of top decays and QCD backgrounds.An ensemble learner is a paradigm that allows the combination of these properties inone algorithm. Instead of optimising the network separately with respect to the distinctsymmetries of images or cluster sequences, it allows optimisation through combined latent-feature space. We showed that combining convolutional and recurrent neural networks andtraining the network further over their latent-feature space leads to higher accuracy for theclassiﬁcation task. Further, we found that such technique explicitly reduces the variationsin error correlations of the component networks hence improving the domains where thecomponent networks are not highly correlated.A detailed understanding of the inner workings of Deep Learning techniques is oftenmissing. To develop conﬁdence in their applicability in measurements and searches for newphysics, it is of vital importance to understand and, if possible, reduce the uncertainties ofthe networks. Bayesian techniques are designed to quantify such uncertainties. We foundthat ENNs can drastically reduce the uncertainty in the prediction of the network, withoutincreasing the amount of training data. We also showed that such methods reduce theentropy of the system as well as the epistemic uncertainties. ENNs can thus provide much– 15 –ore accurate predictions than their component networks. The methodology employed inthis study can be applied to a broad scope of application in HEP phenomenology. Insteadof expanding the data domain, learning through combined underlying correlations of theproblem has been shown to be very eﬀective.While ensemble learners can reduce the variance of the hypothesis, we did not observeany improvement in the data’s bias or aleatoric uncertainties. Although reducing thenetwork’s epistemic uncertainties and variance is a crucial step, aleatoric uncertainties areobserved to be larger than the epistemic uncertainties. As it has been shown that Genetic-Algorithm-based Selective Ensembles can reduce the biases as well as the variance of thesystem [50], it is an obvious next step to employ such techniques to reduce biases as wellas the variance of the network.

References [1] S. Marzani, G. Soyez and M. Spannowsky,

Looking inside jets: an introduction to jetsubstructure and boosted-object phenomenology , vol. 958. Springer, 2019,10.1007/978-3-030-15709-8.[2] T. Plehn, G. P. Salam and M. Spannowsky,

Fat Jets for a Light Higgs , Phys. Rev. Lett. (2010) 111801, [ ].[3] T. Plehn, M. Spannowsky, M. Takeuchi and D. Zerwas,

Stop Reconstruction with TaggedTops , JHEP (2010) 078, [ ].[4] T. Plehn, M. Spannowsky and M. Takeuchi, How to improve top-quark tagging , Phys. Rev. D (Feb, 2012) 034029.[5] D. E. Soper and M. Spannowsky, Finding top quarks with shower deconstruction , Phys. Rev.D (2013) 054012, [ ].[6] D. E. Soper and M. Spannowsky, Finding physics signals with shower deconstruction , Phys.Rev. D (2011) 074002, [ ].[7] D. E. Soper and M. Spannowsky, Finding physics signals with event deconstruction , Phys.Rev. D (2014) 094005, [ ].[8] S. Prestel and M. Spannowsky, HYTREES: Combining Matrix Elements and Parton Showerfor Hypothesis Testing , Eur. Phys. J. C (2019) 546, [ ].[9] J. Brehmer, K. Cranmer, G. Louppe and J. Pavez, Constraining Eﬀective Field Theorieswith Machine Learning , Phys. Rev. Lett. (2018) 111801, [ ].[10] J. Brehmer, F. Kling, I. Espejo and K. Cranmer,

MadMiner: Machine learning-basedinference for particle physics , Comput. Softw. Big Sci. (2020) 3, [ ].[11] G. Louppe, M. Kagan and K. Cranmer, Learning to Pivot with Adversarial Networks , .[12] C. K. Khosa and V. Sanz, Anomaly Awareness , .[13] L. G. Almeida, M. Backovi´c, M. Cliche, S. J. Lee and M. Perelstein, Playing Tag with ANN:Boosted Top Identiﬁcation with Pattern Recognition , JHEP (2015) 086, [ ].[14] G. Kasieczka, T. Plehn, M. Russell and T. Schell, Deep-learning Top Taggers or The End ofQCD? , JHEP (2017) 006, [ ]. – 16 –

15] A. Butter, G. Kasieczka, T. Plehn and M. Russell,

Deep-learned Top Tagging with a LorentzLayer , SciPost Phys. (2018) 028, [ ].[16] J. Pearkes, W. Fedorko, A. Lister and C. Gay, Jet Constituents for Deep Neural NetworkBased Top Quark Tagging , .[17] S. Egan, W. Fedorko, A. Lister, J. Pearkes and C. Gay, Long Short-Term Memory (LSTM)networks with jet constituents for boosted top tagging at the LHC , .[18] S. Macaluso and D. Shih, Pulling Out All the Tops with Computer Vision and DeepLearning , JHEP (2018) 121, [ ].[19] S. Choi, S. J. Lee and M. Perelstein, Infrared Safety of a Neural-Net Top Tagging Algorithm , JHEP (2019) 132, [ ].[20] L. Moore, K. Nordstr¨om, S. Varma and M. Fairbairn, Reports of My Demise Are GreatlyExaggerated: N -subjettiness Taggers Take On Jet Images , SciPost Phys. (2019) 036,[ ].[21] A. Blance, M. Spannowsky and P. Waite, Adversarially-trained autoencoders for robustunsupervised new physics searches , JHEP (2019) 047, [ ].[22] S. H. Lim and M. M. Nojiri, Spectral Analysis of Jet Substructure with Neural Networks:Boosted Higgs Case , JHEP (2018) 181, [ ].[23] J. Lin, M. Freytsis, I. Moult and B. Nachman, Boosting H → b ¯ b with Machine Learning , JHEP (2018) 101, [ ].[24] P. Baldi, K. Bauer, C. Eng, P. Sadowski and D. Whiteson, Jet Substructure Classiﬁcation inHigh-Energy Physics with Deep Neural Networks , Phys. Rev. D (2016) 094034,[ ].[25] G. Louppe, K. Cho, C. Becot and K. Cranmer, QCD-Aware Recursive Neural Networks forJet Physics , JHEP (2019) 057, [ ].[26] J. Gallicchio and M. D. Schwartz, Quark and Gluon Jet Substructure , JHEP (2013) 090,[ ].[27] P. T. Komiske, E. M. Metodiev and M. D. Schwartz, Deep learning in color: towardsautomated quark/gluon jet discrimination , JHEP (2017) 110, [ ].[28] T. Cheng, Recursive Neural Networks in Quark/Gluon Tagging , Comput. Softw. Big Sci. (2018) 3, [ ].[29] P. T. Komiske, E. M. Metodiev and J. Thaler, Energy Flow Networks: Deep Sets for ParticleJets , JHEP (2019) 121, [ ].[30] S. Bright-Thonney and B. Nachman, Investigating the Topology Dependence of Quark andGluon Jets , JHEP (2019) 098, [ ].[31] A. J. Larkoski, I. Moult and B. Nachman, Jet Substructure at the Large Hadron Collider: AReview of Recent Advances in Theory and Machine Learning , Phys. Rept. (2020) 1–63,[ ].[32] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman and A. Schwartzman,

Jet-images — deeplearning edition , JHEP (2016) 069, [ ].[33] O. Kitouni, B. Nachman, C. Weisser and M. Williams, Enhancing searches for resonanceswith machine learning and moment decomposition , . – 17 –

34] X. Ju and B. Nachman,

Supervised Jet Clustering with Graph Neural Networks for LorentzBoosted Bosons , Phys. Rev. D (2020) 075014, [ ].[35] A. Butter, S. Diefenbacher, G. Kasieczka, B. Nachman and T. Plehn,

GANplifying EventSamples , .[36] S. Farrell, W. Bhimji, T. Kurth, M. Mustafa, D. Bard, Z. Lukic et al., Next GenerationGenerative Neural Networks for HEP , EPJ Web Conf. (2019) 09005.[37] J. Lin, W. Bhimji and B. Nachman,

Machine Learning Templates for QCD Factorization inthe Search for Physics Beyond the Standard Model , JHEP (2019) 181, [ ].[38] K. Datta, A. Larkoski and B. Nachman, Automating the Construction of Jet Observableswith Machine Learning , Phys. Rev. D (2019) 095016, [ ].[39] R. T. D’Agnolo, G. Grosso, M. Pierini, A. Wulzer and M. Zanetti,

Learning MultivariateNew Physics , .[40] R. T. D’Agnolo and A. Wulzer, Learning New Physics from a Machine , Phys. Rev. D (2019) 015014, [ ].[41] B. Nachman and J. Thaler, E Pluribus Unum Ex Machina: Learning from Many ColliderEvents at Once , .[42] T. Faucett, J. Thaler and D. Whiteson, Mapping Machine-Learned Physics into aHuman-Readable Space , .[43] C. K. Khosa, L. Mars, J. Richards and V. Sanz, Convolutional Neural Networks for DirectDetection of Dark Matter , J. Phys. G (2020) 095201, [ ].[44] C. K. Khosa, V. Sanz and M. Soughton, WIMPs or else? Using Machine Learning todisentangle LHC signatures , .[45] T. G. Dietterich, Ensemble methods in machine learning , in

Multiple Classiﬁer Systems ,(Berlin, Heidelberg), pp. 1–15, Springer Berlin Heidelberg, 2000.[46] L. K. Hansen and P. Salamon,

Neural network ensembles , IEEE Transactions on PatternAnalysis and Machine Intelligence (Oct, 1990) 993–1001.[47] A. L. Blum and R. L. Rivest, Training a 3-node neural network is np-complete , NeuralNetworks (1992) 117 – 127.[48] K. Hornik, M. Stinchcombe and H. White, Universal approximation of an unknown mappingand its derivatives using multilayer feedforward networks , Neural Networks (1990) 551 –560.[49] C. Englert, M. Fairbairn, M. Spannowsky, P. Stylianou and S. Varma, Sensing Higgs bosoncascade decays through memory , Phys. Rev. D (2020) 095027, [ ].[50] Z.-H. Zhou, J. Wu and W. Tang,

Ensembling neural networks: Many could be better than all , Artiﬁcial Intelligence (2002) 239 – 263.[51] A. Krogh and J. Vedelsby,

Neural network ensembles, cross validation and active learning , in

Proceedings of the 7th International Conference on Neural Information Processing Systems ,NIPS’94, (Cambridge, MA, USA), pp. 231–238, MIT Press, 1994.[52] M. P. PERRONE and L. N. COOPER,

When networks disagree: Ensemble methods forhybrid neural networks , pp. 342–358. 10.1142/97898127958850025. – 18 –

53] J. Xie, B. Xu and Z. Chuang,

Horizontal and vertical ensemble with deep representation forclassiﬁcation , 2013.[54] L. Rokach,

Ensemble-based classiﬁers , Artiﬁcial Intelligence Review (2010) 1–39.[55] R. P. W. Duin and D. M. J. Tax, Experiments with classiﬁer combining rules , in

MultipleClassiﬁer Systems , (Berlin, Heidelberg), pp. 16–29, Springer Berlin Heidelberg, 2000.[56] J. Conrad and F. Tegenfeldt,

Applying rule ensembles to the search for super-symmetry atthe large hadron collider , JHEP (2006) 040, [ hep-ph/0605106 ].[57] P. Baldi, P. Sadowski and D. Whiteson, Enhanced Higgs Boson to τ + τ − Search with DeepLearning , Phys. Rev. Lett. (2015) 111801, [ ].[58] A. Alves,

Stacking machine learning classiﬁers to identify Higgs bosons at the LHC , JINST (2017) T05005, [ ].[59] A. Alves and F. F. Freitas, Towards recognizing the light facet of the Higgs Boson , Mach.Learn. Sci. Tech. (2020) 045025, [ ].[60] A. Butter et al., The Machine Learning Landscape of Top Taggers , SciPost Phys. (2019)014, [ ].[61] N. Ueda and R. Nakano, Generalization error of ensemble estimators , in

Proceedings ofInternational Conference on Neural Networks (ICNN’96) , vol. 1, pp. 90–95 vol.1, June, 1996.DOI.[62] S. Bollweg, M. Haußmann, G. Kasieczka, M. Luchmann, T. Plehn and J. Thompson,

Deep-Learning Jets with Uncertainties and More , SciPost Phys. (2020) 006, [ ].[63] S. Marshall, A. Cobb, C. Ra¨ıssi, Y. Gal, A. Rozek, M. W. Busch et al., Using BayesianOptimization to Find Asteroids’ Pole Directions , in

AAS/Division for Planetary SciencesMeeting Abstracts , vol. 50 of

AAS/Division for Planetary Sciences Meeting Abstracts ,p. 505.01D, Oct., 2018.[64] J. Mukhoti, P. Stenetorp and Y. Gal,

On the importance of strong baselines in bayesian deeplearning , CoRR abs/1811.09385 (2018) , [ ].[65] B. Nachman,

A guide for deploying Deep Learning in LHC searches: How to achieveoptimality and account for uncertainty , SciPost Phys. (2020) 090, [ ].[66] B. Nachman and J. Thaler, Neural resampler for Monte Carlo reweighting with preserveduncertainties , Phys. Rev. D (2020) 076004, [ ].[67] C. Englert, P. Galler, P. Harris and M. Spannowsky,

Machine Learning Uncertainties withAdversarial Neural Networks , Eur. Phys. J. C (2019) 4, [ ].[68] Y. Gal and Z. Ghahramani, Dropout as a bayesian approximation: Representing modeluncertainty in deep learning , 2016.[69] A. Kendall and Y. Gal,

What uncertainties do we need in bayesian deep learning forcomputer vision? , 2017.[70] J. F. Kolen and J. B. Pollack,

Back propagation is sensitive to initial conditions , in

Proceedings of the 3rd International Conference on Neural Information Processing Systems ,NIPS’90, (San Francisco, CA, USA), pp. 860–867, Morgan Kaufmann Publishers Inc., 1990.[71] K. Cherkauer,

Human expert-level performance on a scientiﬁc image analysis task by asystem using combined artiﬁcial neural networks , in

Working Notes of the AAAI Workshopon Integrating Multiple Learned Models , pp. 15–21, 1996. – 19 –

72] K. Tumer and J. Ghosh,

Error correlation and error reduction in ensemble classiﬁers , Connection Science (1996) 385–404, [ https://doi.org/10.1080/095400996116839 ].[73] L. Breiman, Bagging predictors , Machine Learning (1996) 123–140.[74] M. Gams, New measurements highlight the importance of redundant knowledge , in

Proceedings of the Fourth European Working Session on Learning , pp. 71–79.[75] B. Parmanto, P. Munro and H. Doyle,

Improving committee diagnosis with resamplingtechniques , in

Advances in Neural Information Processing Systems (D. Touretzky, M. C.Mozer and M. Hasselmo, eds.), vol. 8, pp. 882–888, MIT Press, 1996.[76] Y. Freund and R. E. Schapire,

A Decision-Theoretic Generalization of On-Line Learning andan Application to Boosting , J. Comput. Syst. Sci. (1997) 119–139.[77] Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm , in

INPROCEEDINGS OF THE THIRTEENTH INTERNATIONAL CONFERENCE ONMACHINE LEARNING , pp. 148–156, Morgan Kaufmann, 1996.[78] G. Brown, J. L. Wyatt and P. Ti˜no,

Managing diversity in regression ensembles , J. Mach.Learn. Res. (2005) 1621–1650.[79] P. Domingos, A unifeid bias-variance decomposition and its applications , in

Proceedings ofthe Seventeenth International Conference on Machine Learning , ICML ’00, (San Francisco,CA, USA), pp. 231–238, Morgan Kaufmann Publishers Inc., 2000.[80] G. Kasieczka, T. Plehn, J. Thompson and M. Russel,

Top quark tagging reference dataset ,Mar., 2019. 10.5281/zenodo.2603256.[81] T. Sj¨ostrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten et al.,

An Introductionto PYTHIA 8.2 , Comput. Phys. Commun. (2015) 159–177, [ ].[82]

DELPHES 3 collaboration, J. de Favereau, C. Delaere, P. Demin, A. Giammanco,V. Lemaˆıtre, A. Mertens et al.,

DELPHES 3, A modular framework for fast simulation of ageneric collider experiment , JHEP (2014) 057, [ ].[83] M. Cacciari, G. P. Salam and G. Soyez, The Anti-k(t) jet clustering algorithm , JHEP (2008) 063, [ ].[84] M. Cacciari, G. P. Salam and G. Soyez, FastJet User Manual , Eur. Phys. J.

C72 (2012)1896, [ ].[85] S. Bentvelsen and I. Meyer,

The Cambridge jet algorithm: Features and applications , Eur.Phys. J. C (1998) 623–629, [ hep-ph/9803322 ].[86] J. M. Butterworth, A. R. Davison, M. Rubin and G. P. Salam, Jet substructure as a newHiggs search channel at the LHC , Phys. Rev. Lett. (2008) 242001, [ ].[87] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel et al.,

Scikit-learn: Machine learning in Python , Journal of Machine Learning Research (2011)2825–2830.[88] F. Chollet et al., “Keras.” https://keras.io , 2015.[89] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro et al., TensorFlow:Large-scale machine learning on heterogeneous systems , 2015.[90] D. P. Kingma and J. Ba,

Adam: A method for stochastic optimization , CoRR abs/1412.6980 (2014) . – 20 –

91] Y. Kwon, J.-H. Won, B. J. Kim and M. C. Paik,

Uncertainty quantiﬁcation using bayesianneural networks in classiﬁcation: Application to biomedical image segmentation , Computational Statistics & Data Analysis (2020) 106816.[92] N. Tagasovska and D. Lopez-Paz,

Single-model uncertainties for deep learning , 2019.[93] D. J. C. MacKay,

Information Theory, Inference & Learning Algorithms . CambridgeUniversity Press, USA, 2002.[94] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean et al.,

Tensorﬂow: A system forlarge-scale machine learning , CoRR abs/1605.08695 (2016) , [ ].[95] Y. Wen, P. Vicol, J. Ba, D. Tran and R. B. Grosse,

Flipout: Eﬃcient pseudo-independentweight perturbations on mini-batches , CoRR abs/1803.04386 (2018) , [ ].].