E Pluribus Unum Ex Machina: Learning from Many Collider Events at Once
MMIT-CTP 5271
E Pluribus Unum Ex Machina:Learning from Many Collider Events at Once
Benjamin Nachman
1, 2, ∗ and Jesse Thaler
3, 4, † Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Berkeley Institute for Data Science, University of California, Berkeley, CA 94720, USA Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA The NSF AI Institute for Artificial Intelligence and Fundamental Interactions
There have been a number of recent proposals to enhance the performance of machine learningstrategies for collider physics by combining many distinct events into a single ensemble feature. Toevaluate the efficacy of these proposals, we study the connection between single-event classifiers andmulti-event classifiers under the assumption that collider events are independent and identicallydistributed (IID). We show how one can build optimal multi-event classifiers from single-eventclassifiers, and we also show how to construct multi-event classifiers such that they produce optimalsingle-event classifiers. This is illustrated for a Gaussian example as well as for classification tasksrelevant for searches and measurements at the Large Hadron Collider. We extend our discussion toregression tasks by showing how they can be phrased in terms of parametrized classifiers. Empirically,we find that training a single-event (per-instance) classifier is more effective than training a multi-event (per-ensemble) classifier, as least for the cases we studied, and we relate this fact to propertiesof the loss function gradient in the two cases. While we did not identify a clear benefit from usingmulti-event classifiers in the collider context, we speculate on the potential value of these methods incases involving only approximate independence, as relevant for jet substructure studies.
CONTENTS
I. Introduction 1II. The Statistics of Per-Ensemble Learning 2A. Review of Per-Instance Learning 2B. Per-Ensemble Binary Classification 3C. Comparing the Loss Gradients 4D. Per-Ensemble Regression 41. Maximum Likelihood 42. Classifier Loss 53. Direct Regression 5E. Beyond Regression 6III. Empirical Studies 6A. Classifiers: Multi-Event from Single-Event 61. Two Gaussian Example 62. Dijet Resonance Search 73. Top Quark Mass Measurement 9B. Classifiers: Single-Event from Multi-Event 10C. Comparison of Regression Strategies 111. Gaussian Mean Example 112. Top Quark Mass Measurement 12D. Beyond Regression Example 13IV. Conclusions 14Code and Data 14Acknowledgments 14A. Deriving Maximum Likelihood Classifier Loss 14References 15 ∗ [email protected] † [email protected] I. INTRODUCTION
Modern machine learning techniques are being widelyapplied to enhance or replace existing analysis techniquesacross collider physics [1–6]. These approaches hold greatpromise for new particle searches, for Standard Modelmeasurements, and for high-energy nuclear physics investi-gations. A subset of these proposals have advocated for amulti-event strategy whereby a machine-learned functionacts on multiple collision events at the same time [7–13].This multi-event (per-ensemble) strategy contrasts withmore typical single-event (per-instance) machine learningmethods that processes one event at a time, although bothstrategies make use of many events during the trainingprocess.Intuitively, an ensemble approach might seem like amore promising learning strategy because there is moreinformation contained in
N > a r X i v : . [ phy s i c s . d a t a - a n ] F e b To an excellent approximation, collider events are sta-tistically independent and identically distributed (IID).In simulation, this is exactly true up to deficiencies inrandom number generators. In data, there are somesmall time-dependent effects from changing conditionsand there are also some correlations between events intro-duced by detector effects with timescales longer than atypical bunch crossing. These event-to-event correlations,however, are truly negligible when considering the set ofevents typically used for physics analysis that are selectedby triggers. The probability for two events next to eachother in time to be saved by the triggers is effectively zero,since triggers save only a tiny fraction of events. TheIID nature of collision events therefore ensures that theinformation content is the same for ensembles of eventsand for single events drawn from an ensemble.In equations, the probability to observe N events x i is p ( { x , . . . , x N }| θ ) = N Y i =1 p ( x i | θ ) , (1)where is θ represents possible parameters of the generativemodel, such as the physics process being studied or thevalues of coupling constants. The optimal classifier todistinguish whether events have been generated via θ A or via θ B depends only on the per-ensemble likelihoodratio [14]: p ( { x , . . . , x N }| θ A ) p ( { x , . . . , x N }| θ B ) = N Y i =1 p ( x i | θ A ) p ( x i | θ B ) , (2)which by the IID assumption only depends on the knowingthe per-instance likelihood ratio p ( x i | θ A ) /p ( x i | θ B ). Thisequality explains the informational equivalence of per-ensemble and per-event learning.Given the simplicity of Eq. (2), why are we writing awhole paper on this topic (apart from the opportunityto invoke a gratuitously Latinate paper title that incor-porates an aspiration for national unity)? The studiesin Refs. [7–13] find that per-ensemble learning is effec-tive for their respective tasks, in some cases arguing whyper-instance learning is deficient. It is certainly true thata set of events { x , . . . , x N } contains more informationthan a single event x i drawn from this set. What wewill show in this paper is that if one carefully combinesthe per-instance information, one can recover the per-ensemble benefit, with the potential for a substantiallyreduced training cost. We emphasize that our analysisdoes not contradict the studies in Refs. [7–13]; ratherthis work suggests the possibility of achieving the sameor better results by replacing per-ensemble learning withper-instance learning. There may be specialized contextswhere per-ensemble learning is superior, particularly ifthe training procedure itself can be made simpler, suchas in the linear regression approach of Ref. [11]. Thispaper also gives us a chance to mention some facts aboutloss functions that are well known in the statistics lit-erature but might not be as well appreciated in collider physics. Moving away from the IID case, we speculateon the relevance of our analysis for jet substructure taskswhere there is a notion of approximate independence ofemissions.The remainder of this paper is organized as follows. InSec. II, we provide the formal statistical basis for buildingmulti-event classifiers from single-event classifiers, andvice versa, under the IID assumption. We also explainhow regression tasks can be translated into the languageof per-instance parametrized classification. In Sec. III, wepresent empirical studies that corroborate these analyticresults. Our conclusions are given in Sec. IV. II. THE STATISTICS OF PER-ENSEMBLELEARNINGA. Review of Per-Instance Learning
Suppose that a collider event is represented by featuresin E = R M and we are trying to train a binary classifierto learn a target in [0 , c : E → [0 ,
1] be a functionthat processes a single event, with the goal of distinguish-ing events being generated by θ A ( c →
1) versus thosegenerated by θ B ( c → L BCE [ c ] = − Z dx (cid:16) p ( x | θ A ) log c ( x )+ p ( x | θ B ) log(1 − c ( x )) (cid:17) , (3)where p ( x | θ ) is the probability density of x ∈ E givenclass θ . Here and throughout this discussion, we considerthe infinite statistics limit such that we can replace sumsover events by integrals. We have also dropped the priorfactors p ( θ i ), assuming that one has equal numbers ofexamples from the two hypotheses during training. Whilethis is often true in practice, it is not strictly necessaryfor our main conclusions, though it does simplify thenotation. It is well-known [15, 16] (also in high-energyphysics [17–29]) that an optimally trained c will have thefollowing property: c ( x )1 − c ( x ) = p ( x | θ A ) p ( x | θ B ) , (4)such that one learns the per-instance likelihood ratio. Bythe Neyman–Pearson lemma [14], this defines the optimalsingle-event classifier.There are many loss functionals that satisfy this prop-erty. Consider a more general loss functional that dependson a learnable function f : E → R (which unlike c may ormay not map to [0 , A : R → R and B : R → R : L [ f ] = − Z dx (cid:16) p ( x | θ A ) A ( f ( x )) + p ( x | θ B ) B ( f ( x )) (cid:17) . (5) Loss Name A ( f ) B ( f ) argmin f L [ f ] Integrand of − min f L [ f ] Related Divergence/DistanceBinary Cross Entropy log f log(1 − f ) p A p A + p B p A log p A p A + p B + ( A ↔ B ) 2 (cid:0) Jensen-Shannon − log 2 (cid:1) Mean Squared Error − (1 − f ) − f p A p A + p B − p A p B p A + p B (cid:0) Triangular − (cid:1) Square Root − √ f −√ f p A p B − √ p A p B (cid:0) Hellinger − (cid:1) Maximum Likelihood Cl. log f − f p A p B p A log p A p B Kullback–LeiblerTABLE I. Examples of loss functionals in the form of Eq. (5), with the associated location and value of the loss minimum, usingthe shorthand p i ≡ p ( x | θ i ). We have used the symbol f in all cases to denote the classifier, but some choices require explicitconstraints on f to be either non-negative or in the range [0 , Taking the functional derivative with respect to f ( x ), theextremum of L [ f ] satisfies the property: − B ( f ( x )) A ( f ( x )) = p ( x | θ A ) p ( x | θ B ) . (6)As long as − B ( f ) /A ( f ) is a monotonic rescaling of f and the overall loss functional is convex, then the function f ( x ) learned by minimizing Eq. (5) defines an optimalclassifier. In many cases, the minimum value of L [ f ] itselfis interesting in the context of statistical divergences anddistances [30], and a few examples are shown in Table I.To simplify the following discussion, we will focus onthe “maximum likelihood classifier” (MLC) loss: L MLC [ f ] = − Z dx (cid:16) p ( x | θ A ) log f ( x )+ p ( x | θ B ) (1 − f ( x )) (cid:17) . (7)This is of the general form in Eq. (5) with A ( f ) = log f and B ( f ) = 1 − f . To our knowledge, the MLC was firstintroduced in the collider physics context in Refs. [31, 32],although with an exponential parameterization of f ( x ).We call Eq. (7) the MLC loss to distinguish it from therelated maximum likelihood loss that is often used to fitgenerative models [33–35]. Using Eq. (6), the minimumof this loss functional yields directly the likelihood ratio:argmin f L MLC [ f ] = p ( x | θ A ) p ( x | θ B ) , (8)which will be useful to simplify later analyses. The MLCloss functional value at the minimum is − min f L MLC [ f ] = Z dx p ( x | θ A ) log p ( x | θ A ) p ( x | θ B ) , (9) A variation of Eq. (8) holds for A ( f ) = log C ( f ) and B ( f ) =1 − C ( f ), where C ( f ) is any monotonically increasing functionwith range that covers (0 , ∞ ). In this case, C (argmin f L [ f ]) = p ( x | θ A ) /p ( x | θ B ). This can be useful in practice if C ( f ) is every-where positive, since f can take on negative values and still yielda valid likelihood ratio. See Fig. 10 for an empirical study of C ( f ) = exp f . which is the Kullback–Leibler (KL) divergence, also knownas the relative entropy from p ( x | θ B ) to p ( x | θ A ). SeeApp. A for an intuitive derivation of Eq. (7). B. Per-Ensemble Binary Classification
To move from single-event classification to multi-eventclassification, we want learn a classification function f N that can processes N events simultaneously. Here, weare using f N : E N → R instead of c N : E N → [0 ,
1] toavoid algebraic manipulations like Eq. (4). We will usethe vector notation ~x = { x , . . . , x N } (10)to represent an element of E N . Our goal is to distinguishwhether ~x is drawn from p ( ~x | θ A ) ( f N → ∞ ) or from p ( ~x | θ B ) ( f N → θ A or θ B ,which is a different question than trying to determine theproportion of events drawn from each class in a mixedevent ensemble. For N = 1, f is the same as f discussedin Eq. (5).If f N is trained optimally, then the classification per-formance of f N evaluated on N > f evaluated on a single event, asrelevant to the discussions in Refs. [7–13]. The key pointof this paper is that one can construct a classifier f → N that is built only from f , acts on N events, and has thesame asymptotic performance as f N .Using the MLC loss in Eq. (7), but now applied to N events, we have L MLC [ f N ] = − Z d N x (cid:16) p ( ~x | θ A ) log f N ( ~x )+ p ( ~x | θ B ) (1 − f N ( ~x )) (cid:17) , (11)whose minimum is the per-ensemble likelihood ratio:argmin f N L MLC [ f N ] = p ( ~x | θ A ) p ( ~x | θ B ) . (12)By the Neyman–Pearson lemma, this yields the optimalper-ensemble classifier.On the other hand, once we have trained a single-eventclassifier f using Eq. (7), we can build a multi-eventclassifier f → N without any additional training: f → N ( ~x ) ≡ N Y i =1 f ( x i ) → p ( ~x | θ A ) p ( ~x | θ B ) , (13)where in the last step we have combined the solution foundin Eq. (8) with the IID condition in Eq. (2). Whereas min-imizing Eq. (11) requires sampling over E N , constructing f → N only requires sampling over E , which is a consid-erable reduction in computational burden for large N .The technical details of carrying out this procedure areexplained in Sec. III A.Going in the converse direction, we can learn a single-event classifier f N → starting from a constrained multi-event classifier ˜ f N . Using weight sharing, we can minimizeEq. (11) subject to the constraint that ˜ f N takes thefunctional form:˜ f N ( { x , . . . , x N } ) = N Y i =1 f N → ( x i ) , (14)where f N → ( x ) is a learnable function. Under the IIDassumption, ˜ f N can still learn the per-ensemble likeli-hood ratio, but the learned f N → ( x ) will now be theper-instance likelihood ratio, at least asymptotically. Anexamination of this converse construction is presented inSec. III B. C. Comparing the Loss Gradients
We have shown that the per-ensemble classifier f N andthe composite per-event classifier f → N have the sameasymptotic information content, but one might wonder ifthere is nevertheless a practical performance gain to behad using per-ensemble learning.Under the IID assumption, the optimal f N takes theform of ˜ f N in Eq. (14), and in our empirical studies,we found no benefit to letting f N have more functionalfreedom. Therefore, to get a sense of the efficacy of per-ensemble versus per-instance training, we can comparethe effective loss functions for f N → and f . Since theinputs and outputs of these functions are the same (i.e. E → R ), we can do an apples-to-apples comparison oftheir behavior under gradient descent.For the per-ensemble case, plugging Eq. (14) intoEq. (11) and using the IID relation in Eq. (1), we findthe effective loss functional: L MLC [ f N → ] + 1 = − N Z dx p ( x | θ A ) log f N → ( x )+ (cid:18)Z dx p ( x | θ B ) f N → ( x ) (cid:19) N . (15) This is to be contrasted with the per-instance loss func-tional from Eq. (7), repeated for convenience with the f notation and typeset to be parallel to the above: L MLC [ f ] + 1 = − Z dx p ( x | θ A ) log f ( x )+ Z dx p ( x | θ B ) f ( x ) . (16)To understand the loss gradients, we can Taylor expandthe learned functions about the optimal solution: f N → ( x ) = p ( x | θ A ) p ( x | θ B ) + (cid:15) ( x ) , (17) f ( x ) = p ( x | θ A ) p ( x | θ B ) + (cid:15) ( x ) . (18)Plugging these in to their respective loss functionals andlooking at the leading-order variations, we have: δL MLC [ f N → ] N = Z dx (cid:0) p ( x | θ B ) (cid:15) ( x ) (cid:1) p ( x | θ A )+ N − (cid:18)Z dx p ( x | θ B ) (cid:15) ( x ) (cid:19) , (19) δL MLC [ f ] = Z dx (cid:0) p ( x | θ B ) (cid:15) ( x ) (cid:1) p ( x | θ A ) . (20)These expressions are quadratic in (cid:15) ( x ), which means thatwe are expanding around the correct minimum.The expression for δL MLC [ f ] involves a single integralover x , so under gradient descent, the value of (cid:15) ( x ) canbe independently adjusted at each point in phase space tofind the minimum. By contrast, δL MLC [ f N → ] has an ad-ditional piece involving an integral squared, so even if at agiven point in phase space x we have achieved (cid:15) ( x ) = 0,gradient descent will tend to push (cid:15) ( x ) away from thecorrect value until (cid:15) ( x ) = 0 everywhere. This correlatedstructure explains the slower convergence of L MLC [ f N → ]compared to L MLC [ f ] in our empirical studies. Whilewe focused on the MLC loss to simplify the algebra, theappearance of these (typically counterproductive) correla-tions in the loss gradient appears to be a generic featureof per-ensemble learning. D. Per-Ensemble Regression
While the discussion above focused on binary classifica-tion, the same basic idea applies to regression problemsas well. The goal of regression is to infer parameters θ from the data ~x . There are a variety of approaches thatcan be used for this task, and each can be connected toparametrized per-instance classification.
1. Maximum Likelihood
Maximum likelihood is the most common strategy forinference in collider physics. Symbolically, we are tryingto find θ ML = argmax θ p ( ~x | θ ) . (21)One way to determine θ ML is with a two-step approach.First, one can train a parametrized classifier f ( x, θ ) [25,36] using, e.g., the per-instance MLC loss: L MLC [ f ] = − Z dx (cid:16) p ( x | θ ) p ( θ ) log f ( x, θ )+ p ( x | θ ) p ( θ ) (1 − f ( x, θ )) (cid:17) . (22)The top line corresponds to a synthetic dataset where ev-ery event is generated from p ( x | θ ) with different θ valuesdrawn from the probability density p ( θ ). The bottom linecorresponds to a synthetic dataset where every event isgenerated using the same p ( x | θ ) for fixed θ and thenaugmented with a value θ that follows from p ( θ ) indepen-dently of x . Minimizing Eq. (22) with respect to f ( x, θ ),the asymptotic solution is the likelihood ratio: f ( x, θ ) = p ( x | θ ) p ( x | θ ) , (23)where the factors of p ( θ ) have cancelled out. Second,one can estimate θ ML by using the IID properties of theevent ensemble to relate likelihoods to the classifier output f ( x, θ ): θ ML = argmin θ ( − N X i =1 log p ( x i | θ ) ) = argmin θ ( − N X i =1 log p ( x i | θ ) p ( x i | θ ) ) ≈ argmin θ ( − N X i =1 log f ( x i , θ ) ) . (24)Thus, even though maximum likelihood regressionuses information from the full event ensemble, only aparametrized per-instance classifier is required for thisprocedure.
2. Classifier Loss
Two recent proposals for parameter estimation are ex-plicitly built on classifiers for regression [17, 18]. For anyclassifier, one can perform the following optimization: θ CL = argmax θ ( Loss of a classifier trainedto distinguish θ from θ data ) . (25) Note that Ref. [17] used the (non-differentiable) area under thecurve instead of the classifier loss, as it is not sensitive to differ-ences in the prior p ( θ ) between the two data sets. Here, we are imagining that the θ samples come fromsynthetic data sets. The appearance of a maximum in-stead of minimum in Eq. (25) is because, as highlightedin Table I, it is negative loss functions that correspond tostatistical divergences and distances.In general, the θ CL that minimizes the classifier loss willbe different from the θ ML that maximizes the likelihood.For the special case of the MLC loss, though, they arethe same in the asymptotic limit if we set θ A = θ data and θ B = θ . To see this, recall from Eq. (9) that aftertraining, the value of the MLC loss is related to the KLdivergence:argmax θ { min f L MLC [ f ] } = argmax θ (cid:26) − Z dx p ( x | θ data ) log p ( x | θ data ) p ( x | θ ) (cid:27) ≈ argmax θ ( N X i =1 log p ( x i | θ ) p ( x i | θ data ) ) = argmin θ ( − N X i =1 log p ( x i | θ ) ) = θ ML , (26)where the sum is over data events.
3. Direct Regression
In terms of information content, a regression modeltrained in the usual way can be built from a parametrizedclassification model. Suppose that θ ∈ R Q and g N : E N → R Q is a regression model trained with the meansquared error loss: L MSE [ g N ] = − Z d n x p ( ~x, θ ) (cid:16) g N ( ~x ) − θ (cid:17) (27)It is well known that the optimally trained g N will berelated to the expectation value of θ : g N ( ~x ) = E [ θ | ~x ] = Z dθ θ p ( θ | ~x ) . (28)Other loss functions approximate other statistics, as dis-cussed in Ref. [37]. For example, the mean absolute errorloss approximates the median of θ . Ultimately, all directregression methods are functionals of p ( θ | ~x ).We can relate p ( θ | ~x ) to a parametrized classifier f N ( ~x, θ ) trained to distinguish θ from a baseline θ : p ( θ | ~x ) = p ( ~x | θ ) p ( θ ) p ( ~x ) = p ( ~x | θ ) p ( θ ) R dθ p ( ~x | θ ) p ( θ )= p ( ~x | θ ) p ( ~x | θ ) p ( θ ) R dθ p ( ~x | θ ) p ( ~x | θ ) p ( θ )= f N ( ~x, θ ) p ( θ ) R dθ f N ( ~x, θ ) p ( θ ) , (29)where p ( θ ) is the probability density of θ used duringthe training of g N . Following the same logic as Sec. II B,the per-ensemble classifier f N ( ~x, θ ) can be related to aper-instance classifier f ( x, θ ). Therefore, even though g N acts on N events, it has the same information contentas a parametrized classifier that acts on single events.Performing regression via Eqs. (28) and (29) is straight-forward but tedious. In practice, one would train a param-terized per-instance classifier f ( x, θ ) as in Eq. (23), mul-tiply it to construct f N ( ~x, θ ) = Q Ni =1 f ( x i , θ ), and thensample over values of θ to approximate the integrals.We show examples of the above regression strategies inSec. III C E. Beyond Regression
In addition to classification and regression, a standardmachine learning task is density estimation. While someclassical machine learning methods like k -nearest neigh-bors [38, 39] do require multi-instance information at pre-diction time, many of the standard deep learning solutionsto implicit or explicit generative modeling are built onper-instance functions. Such methods include generativeadversarial networks [40], variational autoencoders [42],and normalizing flows [43].One reason for computing explicit densities is to es-timate the distance to a reference density. A commonset of tools for this task are the f -divergences mentionedearlier. As discussed in Ref. [30] and highlighted in Ta-ble I, there is a direct mapping between the loss valueof a per-instance classification task and a corresponding f -divergence between the underlying probability densities.A related quantity is the mutual information betweentwo random variables X and Y : I ( X, Y ) = Z dx dy p ( x, y ) log p ( x, y ) p ( x ) p ( y ) . (30)For example, Y could be binary (a class label) and then I ( X, Y ) would encode how much information (in units ofnats) is available in X for doing classification. This can behelpful in the context of ranking input features, and wasstudied in the context of quark/gluon jet classification inRef. [44].Naively, Eq. (30) might seem like it requires estimatingthe densities p ( x ), p ( y ), and p ( x, y ), which in turn mayrequire ensemble information (see e.g. Ref. [45] for a studyin the context of HEP). On the other hand, Eq. (30) takesthe same form as the KL divergence in Eq. (9). Therefore,this quantity can be estimated using a similar strategy asin earlier sections, by training a classifier to distinguish In the context of adversarial training, it may be beneficial touse per-ensemble information in the discriminator to mitigatemode collapse, as utilized in Ref. [13]. This is also the philosophybehind mini-batch discrimination [41]. data following p ( x, y ) from data following p ( x ) p ( y ) usingthe MLC loss. The value of the loss at the minimumwill be an estimate of the mutual information. A simpleexample of this will be studied in Sec. III D. III. EMPIRICAL STUDIES
We now present empirical studies comparing per-instance and per-ensemble data analysis strategies tohighlight the points made in Sec. II. Our analyses arebased on three case studies: a simple two Gaussian exam-ple, searching for dijet resonances, and measuring the topquark mass.
A. Classifiers: Multi-Event from Single-Event
As argued in Sec. II B, under the IID assumption we canbuild multi-event classifiers from single-event classifiers.We now demonstrate how to construct f → N defined inEq. (13), comparing its performance to f N .
1. Two Gaussian Example
Our first case study involves one-dimensional Gaussianrandom variables. As shown in Fig. 1a, we consider twoGaussian distributions X ∼ N ( ± (cid:15), x = ± (cid:15) ) but the same variance ( σ = 1). Here,the “signal” has positive mean while the “background”has negative mean, and we take (cid:15) = 0 . f ) and per-ensemble ( f N ) clas-sifiers are parametrized by neural networks and imple-mented using Keras [46] with the
Tensorflow back-end [47] and optimized with
Adam [48]. We use thebinary cross entropy loss function so Eq. (4) is needed toconvert the classifier output to a likelihood ratio. Eachclassifier consists of two hidden layers with 128 nodes perlayer. Rectified Linear Unit (ReLU) activation functionsare used for the intermediate layers while sigmoid activa-tion is used for the last layer. The only difference betweenthe per-instance and per-ensemble networks is that theinput layer has one input for f but N inputs for f N .We train each network with 50,000 events to minimizethe binary cross entropy loss function, and we test theperformance with an additional 50,000 events. For eachnetwork, we train for two epochs with a batch size of128. We take N = 10 for the per-ensemble network,which means that in each batch, the training involves1280 events, and therefore fewer batches per epoch thanin the per-instance case. We did not do any detailedhyperparameter optimization for these studies.In Fig. 1b, we show the performance of the resultingclassifiers f and f . We checked that the f classifier pa-rameterized by a neural network has essentially the sameperformance as an analytic function derived by taking theratio of Gaussian probability densities, which means that x E v e n t s Gaussian Example
SignalBackground (a) F a l s e P o s i t i v e R a t e Gaussian Example f f f (b) FIG. 1. Classification in the two Gaussian example. (a) A histogram of the Gaussian random variable X , for the “signal”( x = 0 .
1) and background ( x = − . f , we canconstruct a multi-event classifier f → that matches the performance of a classifier trained on 10 events simultaneously ( f ). the neural network f is nearly optimal. As expected,the per-instance classifier f has a worse receiver oper-ating characteristic (ROC) curve than the per-ensembleclassifier f . This is not a relevant comparison, however,because the two are solving different classification tasks(i.e. classifying individual events as coming from signalor background versus classifying an ensemble of N = 10events as all coming from signal or background). WithEq. (13), we can use f to build a 10-instance classifier f → , whose ROC curve is nearly identical to f , if noteven slightly better. Thus, as expected from Eq. (2), all ofthe information in the 10-instance classifier is containedin the per-instance classifier.
2. Dijet Resonance Search
We now consider an example from collider physics, mo-tivated by a search for new particles in a dijet final state.The simulations used for this study were produced forthe LHC Olympics 2020 community challenge [49]. Thebackground process involves generic quantum chromo-dynamics (QCD) dijet events with a requirement of atleast one such jet with transverse momentum p T > . W with mass m W = 3 . W → XY to two hypotheticalparticles X and Y of masses 500 GeV and 100 GeV, respec-tively. Each of the X and Y particles decays promptlyinto pairs of quarks. Due to the mass hierarchy betweenthe W boson and its decay products, the final state ischaracterized by two large-radius jets with two-prong sub- structure. The background and signal are generated using Pythia
Delphes
FastJet k t algorithm [57] using R = 1 . m JJ < [3 . , .
7] GeV.Four features are used to train our classifiers: the in-variant mass of the lighter jet, the mass difference of theleading two jets, and the N -subjettiess ratios τ [58, 59]of the leading two jets. The observable τ quantifiesthe degree to which a jet is characterized by two subjetsor one subjet, with smaller values indicating two-prongsubstructure. The mass features are recorded in units ofTeV so that they are numerically O (1). Histograms ofthe four features for signal and background are shown inFigs. 2a and 2b. The signal jet masses are localized atthe X and Y masses and the τ observables are shiftedtowards lower values, indicating that the jets have two-prong substructure.We train a per-instance classifier ( f ) and a per-ensemble classifier ( f ) classifier using the same toolsas for the Gaussian example above, again using binarycross entropy for the loss function. Because signal andbackground are so well separated in this example, werestrict our attention to N = 3 to avoid saturating theperformance. Note that this is an artificially constructedclassification problem, since in a more realistic contextone would be trying to estimate the signal fraction inan event ensemble, not classify triplets of events as allcoming from signal or background. E v e n t s BSM, W YZ versus QCD m J signal m J signal m J background m J background (a) E v e n t s BSM, W YZ versus QCD
21, 1 signal
21, 2 signal
21, 1 background
21, 2 background (b) / F a l s e P o s i t i v e R a t e setsortlist BSM, W YZ versus QCD f f f f f (c) FIG. 2. Classification in the dijet resonance search example. (a,b) Histograms of the four jet features for the signal ( W → XY )and background (QCD dijet) processes. (c) ROC curves for various binary classifiers. The multi-event classifier f → (built from f ) outperforms three classifiers trained on triplets of events: f list3 with randomly ordered inputs, f sort3 with sorted inputs, and f set3 based on the deep sets/PFN strategy in Eq. (31) with built-in permutation invariance. For f , the neural network architecture is the sameas Ref. [17] with four hidden layers, each with 64 nodesand ReLU activation, and an output layer with sigmoidactivation. For f , the neural network involves 4 × f network is trained for 50 epochs, while theper-ensemble f network is trained for 500 epochs. We fixa batch size of 10% in both trainings, which means thatnumber of batches per epoch is the same, as is the numberof events considered per batch. Following Eq. (13), weconstruct a tri-event classifier f → from f .The ROC curves for f and f → are shown in Fig. 2c,with f also shown for completeness. Interestingly, the f → classifier trained on single events significantly outper-forms f trained on multiple events. There are a varietyof reasons for this, but one important deficiency of the f classifier is that it does not respect the permutation sym-metry of its inputs. Because events are IID distributed,there is no natural ordering of the events, but the fullyconnected architecture we are using imposes an artificialordering. Inspired by Ref. [11], we can break the permu-tation symmetry of the inputs by imposing a particularorder on the events. Specifically, we train a network f sort3 where the triplet of events is sorted by their leading jetmass. Using f sort3 yields a small gain in performance seenin Fig. 2, but not enough to close the gap with f → .A more powerful way to account for the permuta-tion symmetry among events is to explicitly build apermutation-invariant neural network architecture. For this purpose, we use the deep sets approach [60]. Inthe particle physics context, deep sets were first used toconstruct particle flow networks (PFNs) [61], where theinputs involve sets of particles. Here, we are interested insets of events, though we will still use the PFN code fromthe https://energyflow.network/ package. FollowingRefs. [60, 61], we decompose our set-based classifier as: f set N ( ~x ) = F N X i =1 Φ( x i ) ! , (31)where F : R L → [0 ,
1] and Φ : E → R L are neuralnetworks that are simultaneously optimized. The networkΦ embeds single events x i into a L -dimensional latentspace. The sum operator in Eq. (31) guarantees that f set N is invariant under permutations x σ ( i ) for σ ∈ S N , thepermutation group acting on N elements. We use thedefault parameters from the PFN code, with L = 128, Φhaving of two hidden layers with 100 nodes each, and F having three hidden nodes with 100 nodes each.The performance of f set3 is shown in Fig. 2, which getsmuch closer to matching the performance of f → . Partof this improvement is due to enforcing the permutationsymmetry, though there is also a potential gain from thefact the PFN we used for f set3 has more trainable weightsthan the fully connected network for f sort3 . All of the f variants were considerably more difficult to train than f → , likely for the reason discussed in Sec. II C. Thus, wehave empirical evidence for the superiority of single-eventtraining for multi-event classification.
100 150 200 250 300 m bl [GeV]0.0000.0050.0100.0150.0200.025 P r o b a b ili t y D e n s i t y / . G e V Top Quark Mass m t = 172.5 GeV m t = 175.0 GeV m t = 172.5 GeV wgt. (a) T r u e P o s i t i v e F a l s e P o s i t i v e R a t e setsort Top Quark Mass, 172.5 versus 175 GeV f f f f (b) FIG. 3. Classification in the top quark mass example. (a) A histogram of m b µν for top quark masses of 172.5 GeV and 175GeV. The “wgt.” curve is explained later in Sec. III C 2, where we test the performance of a likelihood reweighting. (b) Thedifference in efficiency for the 172.5 GeV top quark mass sample (true positive) and the 175 GeV top quark mass sample (falsepositive) as a function of the true positive rate for various binary classifiers. Once again, a multi-event classifier ( f → ) builtfrom the single-event classifier ( f ) has the best performance. For the classifiers trained to process 20 events simultaneously, thedeep sets/PFN approach ( f set20 ) does better than sorting the inputs ( f sort20 ).
3. Top Quark Mass Measurement
Our third and final example is motivated by the topquark mass measurement, as recently studied in Refs. [11,17]. Extracting the top quark mass is really a regressionproblem, which we investigate in Sec. III C. Here, weconsider a related classification task to distinguish twoevent samples generated with different top quark masses(172.5 GeV and 175 GeV). This is a realistic hypothesistesting task that requires full event ensemble information,though only per-instance training as we will see.We use the same dataset as Ref. [17]. Top quark pairproduction is generated using
Pythia
Delphes t ¯ t → bW + ¯ bW − , one of the W bosons isforced to decay to µ + ν while the other W boson decayshadronically. Each event is recorded as a variable-lengthset of objects, consisting of jets, muons, and neutrinos. Atsimulation-level, the neutrino is replaced with the missingtransverse momentum. Generator-level and simulation-level jets are clustered with the anti- k t algorithm using R = 0 . b -taggedif the highest energy parton inside the nearest generator-level jet (∆ R < .
5) is a b quark. Jets are required to have p T >
20 GeV and they can only be b -tagged if | η | < . b -taggedjets and at least two additional non b -tagged jets. The b -jet closest to the muon in rapidity-azimuth is labeled b . Of the remaining b -tagged jets, the highest p T oneis labeled b . The two highest p T non- b -tagged jets arelabeled j and j , and typically come from the W boson. (Imposing the W mass constraint on j and j would yieldlower efficiency, though without significantly impactingthe results.) The four-momentum of the detector-levelneutrino ( ν ) is determined by solving the quadratic equa-tion for the W boson mass; if there is no solution, themass is set to zero, while if there are two real solutions,the one with the smaller | p z | is selected. Four observablesare formed for performing the top quark mass extraction,given by the following invariant masses: m b µν , m b µν , m b j j , and m b j j . A histogram of m b µν is shown forillustration in Fig. 3a.We use the same neural network architectures andtraining procedure as in the BSM example above, with1.5 million events per fixed-mass sample. For the per-ensemble classifier, we take N = 20, though of course fora realistic hypothesis testing situation, N would be aslarge as the number of top quark events recorded in data.To capture the permutation invariance of the inputs, weconstruct f set20 using the deep sets approach in Eq. (31).We also build a classifier f → from the per-instanceclassifier f using Eq. (13).In Fig. 3b, we see that f → and f set20 have comparableperformance, though f → is noticeably better. Someof this improvement maybe be due to differences in thenetwork architecture, but we suspect that most of the gainis due to the more efficient training in the per-instancecase. We checked that very poor performance is obtainedfor a classifier f lacking permutation invariance, with aROC curve that wasn’t that much better than f alone.Explicitly breaking the invariance by sorting the inputsbased on m b µν does help a little, as indicated by the f sort20 curve in Fig. 3b, but does not reach the set-basedapproach.0 T P R F P R a t % T P R set Top Quark Mass, 172.5 versus 175 GeV f f FIG. 4. Computational performance of single-event versusmulti-event training. Shown is the efficiency for the 175 GeVsample (false positive) for a fixed 50% efficiency for the 172.5GeV sample (true positive), plotted as a function of trainingepoch. Single-event training ( f → ) outperforms multi-eventtraining ( f set20 ), where both methods go through the full dataset per epoch. Given the similar performance of f → and f set20 , it isinteresting to examine which learning strategy is morecomputationally efficient. In Fig. 4, we compare the per-formance as a function of the training epoch, using thedifference of the true and false positive rates at a fixed50% signal efficiency. In each epoch, both f → and f set20 see the full ensemble of events, so this is an apples-to-apples comparison as far as data usage is concerned. Inparticular, we plot this information per epoch instead ofper compute time to avoid differences due to the struc-ture of the neural networks. (There is not an easy wayto control for possible differences in the training timedue to the differences in the network structures, sincethe underlying tasks are different.) The f → classifiertrains much faster, in agreement with the analysis inSec. II C, even though the ultimate asymptotic perfor-mance would be the same for both classifiers. Once again,we see better empirical behavior from f → trained onone event at a time version f set20 trained on multiple eventssimultaneously. B. Classifiers: Single-Event from Multi-Event
In general, one cannot take a multi-event classifier f N and extract a single-event classifier f . It is, however,possible to construct a special ˜ f N network such that onecan interpret a subnetwork as a per-event classifier, asdiscussed in Sec. II B. When using the MLC loss function,we can use the functional form in Eq. (14), where ˜ f N isa product of f N → terms. Training ˜ f N , where the onlytrainable weights are contained in f N → , we can learn asingle-event classifier f N → from multi-event samples.For the binary cross entropy loss used in our case stud- F a l s e P o s i t i v e R a t e Gaussian Example f f f f
10 1
FIG. 5. Revisiting the ROC curves for the two Gaussianexample from Fig. 1b. The multi-event classifier ˜ f with therestricted functional form in Eq. (32) has the same performanceas f with no restrictions. Using ˜ f , we can construct a single-event classifier ˜ f → with the same performance as f traineddirectly. ies, where Eq. (4) is needed to convert the classifier toa likelihood ratio, we have to introduce a slightly differ-ent structure than Eq. (14). Let f set N be a permutation-invariant classifier, as defined in Eq. (31) using the deepsets/PFN strategy. Taking the latent space dimensionto be L = 1, the Φ network can be interpreted as asingle-event classifier. Because the Φ network outputs arepooled via summation, we can build an optimal multi-event classifier if Φ learns the logarithm of the likelihoodratio; cf. Eq. (2). With this insight, we can fix the F function to achieve the same asymptotic performance asa trainable F by setting: F ( ~x ) = exp (cid:0) P Ni =1 Φ( x i ) (cid:1) (cid:0) P Ni =1 Φ( x i ) (cid:1) . (32)Using Eq. (4), one can check that this F is monotonicallyrelated to the ensemble likelihood ratio. Similarly, Φ willbe monotonically related to the optimal f , which we call f N → for the remainder of this discussion.This construction is demonstrated in Fig. 5 for theGaussian example. We see that that the deep sets archi-tecture with the fixed form of Eq. (32) ( ˜ f set10 ) has the sameor better performance as the 10-instance fully-connectedclassifier with more network capacity ( f ). Similarly, theΦ function used as a single-event classifier ( f → ) hasnearly the same performance as an independently trainedsingle-event classifier ( f ).The same conclusion holds for the BSM classificationtask, shown in Fig. 6. The only difference between theset-based architectures ˜ f set3 and f set3 is that the former1 / F a l s e P o s i t i v e R a t e setset BSM,
X YZ versus QCD f f f f FIG. 6. Revisiting the ROC curves for the dijet resonancesearch example in Fig. 2c. The set-based multi-event classifiers˜ f set3 and f set3 have similar performance, but we can use theformer to construct a single-event classifier f → . This con-struction is not as effective as performing single-event trainingdirectly ( f ). uses the fixed functional form in Eq. (32). The factthat they achieve nearly the same performance is ensuredby the IID relation in Eq. (2). The per-instance f → network extracted from ˜ f set3 is not quite as powerful asthe f network trained independently on single events, asexpected from the gradient issue discussed in Sec. II C.While we found no benefit to extracting a single-eventclassifier from a multi-event classifier, it is satisfying tosee these IID-derived theoretical predictions borne out inthese empirical examples. C. Comparison of Regression Strategies
We now consider the regression methods introducedin Sec. II D. For classification, the mapping betweenper-instance and per-ensemble information is relativelystraightforward. For regression, though, per-ensembleregression is structurally dissimilar from per-instance re-gression because of the need to integrate over priors onthe regression parameters. Nevertheless, we can performper-ensemble regression by first mapping the problem toper-instance parametrized classification.We compare three different regression strategies forour empirical studies. The first method is a maximum-likelihood analysis, using the form in Eq. (24) based onthe single-event parametrized classifier in Eq. (23). Thesecond method is per-instance direct regression, using theconstruction in Eqs. (28) and (29) based on the sameclassifier as above. The third method is per-ensemble direct regression, based on minimizing the mean squarederror loss in Eq. (27).
1. Gaussian Mean Example
Our first regression study is based on the same one-dimensional Gaussian distributions as Sec. III A 1. Theprior distribution for the Gaussian means is taken tobe uniform with µ ∈ [ − . , . σ = 1. A training dataset is created from 100examples each from 10,000 values of the Gaussian mean,for a total of one million training data points. For thereference sample p ( x | θ ) needed to build the single-eventparametrized classifier f ( x, µ ) in Eq. (23), we create asecond dataset with one million examples drawn from astandard normal distribution (i.e. µ = 0). To implementthe p ( θ ) term in the second line of Eq. (22), each example x i from the reference dataset is assigned a random meanvalue picked from the variable-mean dataset.We train a parameterized neural network to distinguishthe variable-mean datasets from the reference dataset.This network takes as input two features: one componentof ~x and the random mean value µ . The architectureconsists of three hidden layers with (64 , ,
64) nodes perlayer and ReLU activation. The output layer has a singlenode and sigmoid activation. Binary cross entropy is usedto train the classifier and Eq. (4) is used to convert it tothe likelihood ratio form f ( x, µ ). The model is trainedfor 20 epochs with a batch size of 10% of the trainingstatistics.The same learned function f ( x, µ ) is used for boththe maximum likelihood analysis and per-instance directregression. For the maximum-likelihood analysis, the opti-mization in Eq. (24) is performed over a fixed grid with 20evenly spaced values in µ ∈ [ − . , . f N ( ~x, µ ) in Eq. (29) is con-structed by taking a product of f ( x, µ ) outputs over all100 examples in a given ensemble data point ~x . The inte-grals in Eqs. (28) and (29) are approximated by evaluating f N ( ~x, µ ) at 20 evenly spaced µ values between − . . g N that takes as input 100 values (i.e. allof ~x ) and predicts a single mean value. This network hasthe same architecture as f ( x, µ ), except it directly takesas input ~x and has linear (instead of a sigmoid) activationfor the output layer, since the predicted mean can beboth positive or negative. It is trained to minimize themean squared error loss in Eq. (27).In Fig. 7, we see that all three approaches give nearlythe same results in terms of bias and variance. Strictlyspeaking, maximum likelihood and direct regression aredifferent tasks so their behavior could be different. Forper-instance and per-ensemble direct regression, they areconstructed to yield the same asympotic behavior, butthere will be differences due to, e.g., the finite approxima-2 P r e d i c t e d V a l u e o f t h e M e a n Gaussian with 100 Instances
Maximum LikelihoodPer-Instance Direct Reg.Per-Ensemble Direct Reg.
FIG. 7. Comparison of regression methods with the Gaussianexample, with the predicted value of the mean plotted againstthe true value of the mean. The regression involves analyzing100 instances drawn from the same Gaussian distribution.Bands are the standard deviation of the predictions over 10,000generated samples. The per-instance direct regression usessingle-event training, yet achieves comparable performanceto per-ensemble direct regression that processes 100 eventssimultaneously. tions to the integrals. Note that maximum likelihood andper-instance direct regression only use neural networksthat process per-instance inputs; information about therest of the events is used only through the training proce-dure. Thus, we have empirical evidence that per-ensembleregression can be accomplished via per-instance training.
2. Top Quark Mass Measurement
As a physics example of regression, we consider extract-ing the top quark mass. Here, the top quark mass is theregression target and the setup is similar to the Gaussianexample above. We use the same event generation asSec. III A 3, but now with top quark mass parameterssampled uniformly at random in m t ∈ [170 , m t , which we account for when assigningdummy mass values to the reference sample.The parameterized classifier now takes five inputs:the four mass features from Sec. III A 3 ( m b µν , m b µν , m b j j , and m b j j ) plus the top quark mass used forevent generation. The neural network has three hiddenlayers with 50 nodes per layer and ReLU activation, anda single node output layer with sigmoid activation. Wetrain 10 models and take the median as the classifieroutput, using Eq. (4) to convert it to the likelihood ra-tio f ( x, m t ). Each model is trained for 200 epochs withearly stopping with a patience of 20 epochs and a batchsize of 1000. To test the fidelity of the training, we ex-tract the estimated likelihood ratio of m t = 175 GeV over m t = 172 . . m t ∈ [170 , f ( x, m t )into an estimate of E [ m t | ~x ]. The integrals in Eqs. (28)and (29) are approximated by sampling 50 random topquark masses per set of 100 following the probabilitydensity from the training dataset. Because 40 eventsare insufficient to make a precision measurement of thetop quark mass, we find a noticeable bias between theestimated and true top mass values, which is exacerbatedby edge effects at the ends of the training range. For thisreason, we do not show a direct analog to Fig. 7, thoughthis bias could be overcome with much larger trainingdatasets with many more than 100 examples per massvalue.For the per-ensemble direct regression, we use the deepsets approach in Eq. (31) to handle the permutation-invariance of the inputs. This approach is also wellsuited to handle the large variation in the number ofevents in each set due to the event selection effect. Weagain use PFNs for our practical implementation. Weuse the default PFN hyperparameters from the https://energyflow.network/ package, except we use linearactivation in the output layer and the mean squared er-ror loss function. We found that it was important forthe model accuracy to standardize both the inputs andoutputs of the network. Note that this is a different per-ensemble direct regression setup than used in Ref. [11],which found excellent performance using linear regressionon sorted inputs.In Fig. 8b, we compare the output of per-ensembledirect regression to the output of per-instance direct re-gression. We find a very strong correlation between thesetwo very different approaches to computing the samequantity E [ m t | ~x ]. The band in Fig. 8b is the standarddeviation over data sets with a true mass in the same oneof the 100 bins that are evenly spaced betwen 170 and180 GeV. A key advantage of the per-instance approach3
172 173 174 175 176 177 178Top Quark Mass [GeV]0.0000.0020.0040.0060.0080.010 - l o g ( f ( x , m t )) + m a x m t l o g ( f ( x , m t )) Top Quark Mass, Maximum Likelihood (a)
170 172 174 176 178 180Per-Ensemble Mass [GeV]170172174176178180 P e r - I n s t a n c e M a ss [ G e V ] Top Quark Mass, Direct Regression (b)
FIG. 8. Regression in the top quark mass example. (a) An estimate of the log likelihood for samples generated with 172.5 and175 GeV top quark masses. The vertical axis has been shifted such that the minimum value is at zero. Note that the axisrepresents the average log likelihood which is a factor of N events different than the total log likelihood. (b) Correlation betweenthe per-instance predicted mass and the per-ensemble predicted mass in the context of direct regression. The per-ensemble massvalues are put in bins of 0.1 GeV width, and the bands represent the standard deviation of the per-instance mass values in eachbin. is that it does not need to be retrained if more events areacquired. By contrast, the per-ensemble approach is onlyvalid for event samples that have the same sizes as wereused during training. D. Beyond Regression Example
As remarked in Sec. II E, the ideas discussed aboveapply to learning tasks beyond just standard classifica-tion and regression. As one simple example to illustratethis, we consider the Gaussian classification task fromSec. III A 1 and compute the mutual information betweenthe Gaussian feature and the label. This quantifies howmuch information is available in the feature for classifi-cation and can be directly compared with other featuresand other classification tasks.For this illustration, 10 events are generated each fromtwo Gaussian distributions with means ±| (cid:15) | for fixed (cid:15) .The mutual information is estimated using a per-instanceclassifier as described in Sec. II E and also computedanalytically via Eq. (30). For the per-instance classifier,we use a neural network that processes two inputs (labeland feature), has two hidden layers with ReLU activation,and has a single node sigmoid output. The classificationtask is to distinguish the nominal dataset from one wherethe labels are assigned uniformly at random to the features.The value of the MLC loss yields an estimate of the mutualinformation.The mutual information results are presented in Fig. 9, M u t u a l I n f o r m a t i o n Gaussian Example
NN EstimationAnalytic
FIG. 9. Mutual information between a Gaussian featureand a label, where the “ signal” ( x = (cid:15) ) and “ background”( x = − (cid:15) ) have opposite means. The estimate using the MLCloss approach shows good agreement with the exact analyticexpression. as a function of (cid:15) . As expected, the neural networkstrategy yields an excellent approximation to the analyticcalculation. Note that this strategy does require anybinning and naturally extends to high-dimensional data,since the core component is a neural network classifier.We leave an investigation of this approach in the particlephysics context to future work.4 IV. CONCLUSIONS
We have demonstrated a connection between classifierstrained on single events and those that process multipleevents at the same time. One can take a generic single-event classifier and build an N -event classifier using simplearithmetic operations. Such classifiers tend to out-performgeneric N -event classifiers, since we can enforce the IIDassumptions into the learning task. This performancegap can be mostly recovered by deploying a classifier thatrespects the permutation invariance of the set of N events.We used the deep sets/PFN architecture [60, 61] for thispurpose, but other set-based architectures such as graphneural networks [62, 63] would also be appropriate.An amusing feature of the deep sets approach is thatwe can use it to reverse-engineer a single-event classifierfrom a multi-event classifier by restricting the latent spaceto be one-dimensional and fixing a static output function.Even after enforcing these additional structures, though,we found both theoretical and empirically that the lossfunction gradients are better behaved for single-eventclassifiers than multi-event classifiers. Going beyond clas-sification, we explained how various regression tasks canbe phrased in terms of per-instance parametrized clas-sification, yielding similar performance to per-ensembledirect regression. We also mentioned how to computedistances and divergences between probability densitieswithout requiring explicit density estimation. These re-sults hold for any data sample satisfying the IID property.Ultimately, we did not find any formal or practicaladvantage for training a multi-event classifier instead ofa single-event classifier, as least for the cases we stud-ied. With a carefully selected multi-event architecture,one can achieve similar performance to a scaled-up per-event classifier, but the latter will typically train faster.For direct regression, the per-ensemble strategy mightbe conceptually simpler than the per-instance method,though the per-instance methods allow for an simplertreatment of variably-sized data sets. Note that theremay be situations where a simplifying assumption (e.g. thelinear regression model in Ref. [11]) could yield betterper-ensemble behavior than indicated by our case studies.At minimum, we hope this paper has demystified aspectsof per-ensemble learning and highlighted some interestingfeatures of the MLC loss function.Going beyond the IID assumption, the duality betweenper-instance classifiers and per-ensemble classifiers couldhave applications to problems with approximate inde-pendence. For example, flavor tagging algorithms havetraditionally exploited the approximate independence ofindividual track features within a jet [64, 65]. Similarly,emissions in the Lund jet plane [66, 67] are approximatelyindependent, with exact independence in the stronglyordered limit of QCD. In both contexts, the instances areparticles (or particle-like features) and the ensemble is thejet. A potentially powerful training procedure for thesesituations might be to first train a per-particle classifier,then build a per-jet classifier using the constructions de- scribed in this paper, and finally let the network trainfurther to learn interdependencies between the particles. CODE AND DATA
The code for this paper can be found at https://github.com/bnachman/EnsembleLearning . The physicsdatasets are hosted on Zenodo at Ref. [68] for the topquark dataset and Ref. [69] for the BSM dataset.
ACKNOWLEDGMENTS
We thank Anders Andreassen, Patrick Komiske, andEric Metodiev and for discussions about the MLC loss.We thank Rikab Gambhir and Ian Convy for discussionsabout mutual information. We thank Adi Suresh for dis-cussions about the regression task with the classifier loss.We thank Katherine Fraiser, Yue Lai, Duff Neill, BryanOstdiek, Mateusz Ploskon, Felix Ringer, and MatthewSchwartz for useful comments on our manuscript. BNis supported by the U.S. Department of Energy (DOE),Office of Science under contract DE-AC02-05CH11231.JT is supported by the the National Science Foundationunder Cooperative Agreement PHY-2019786 (The NSFAI Institute for Artificial Intelligence and FundamentalInteractions, http://iaifi.org/ ), and by the U.S. DOEOffice of High Energy Physics under grant number DE-SC0012567. BN would also like to thank NVIDIA forproviding Volta GPUs for neural network training.
Appendix A: Deriving Maximum LikelihoodClassifier Loss
Beyond just the practical value of learning the likelihoodratio, the MLC loss in Eq. (7) has a nice interpretationin terms of learning probability distributions.Consider trying to learn a function f ( x ) that is a nor-malized probability distribution, up to a Jacobian factor j ( x ): Z dx j ( x ) f ( x ) = 1 . (A1)We are given samples from a probability distribution q ( x ),and we want to learn f ( x ) such that f ( x ) → q ( x ) j ( x ) . (A2)In other words, we want to learn a function f ( x ) thatreproduces the sampled distribution q ( x ) after includingthe Jacobian factor. This problem was studied in Ref. [33],albeit in a context where f ( x ) had a restricted functionalform such that Eq. (A1) was automatically enforced.One strategy to accomplish this is to minimize thecross entropy of f ( x ) with respect to q ( x ), since the5 x L i k e li h oo d R a t i o Gaussian Example exactBCEMLC (lin)MLC (exp)
FIG. 10. A demonstration of the MLC loss for learning thelikelihood ratio directly, using the Gaussian example fromFig. 1a. The linear (lin) and exponential (exp) parametriza-tions perform similarly. Shown for comparison is likelihoodratio computed using the binary cross entropy (BCE) loss thatrequires the manipulation in Eq. (4). smallest cross entropy is obtained when f ( x ) has thesame information content as q ( x ). The associated lossfunctional is: L [ f ] = − Z dx q ( x ) log f ( x ) − λ (cid:18) − Z dx j ( x ) f ( x ) (cid:19) , (A3)where the first term is the cross entropy and λ is a La-grange multiplier to enforce the normalization conditionin Eq. (A1). Taking the functional derivative of Eq. (A3)with respect to f ( x ) and setting it equal to zero, we find the extremum condition: − q ( x ) f ( x ) + λ j ( x ) = 0 . (A4)Multiplying both sides of this equation by f ( x ) and in-tegrating over x to set the Lagrange multiplier, we findthat Eq. (A4) is solved for λ = 1 , f ( x ) = q ( x ) j ( x ) , (A5)so f ( x ) learns the q ( x ) /j ( x ) ratio as desired.In the special case that j ( x ) is itself a normalized prob-ability distribution, we can substitute for the Lagrangemultiplier and rewrite Eq. (A3) in the following form: L [ f ] = − Z dx (cid:16) q ( x ) log f ( x ) + j ( x )(1 − f ( x )) (cid:17) . (A6)Identifying q ( x ) = p ( x | θ A ) and j ( x ) = p ( x | θ B ), this isprecisely the MLC loss in Eq. (7). Therefore, we havean intuitive understanding of the MLC loss as tryingto maximize the (log) likelihood of f ( x ) with respect to p ( x | θ A ), subject to the constraint that f ( x ) p ( x | θ B ) is aproper probability distribution.In Fig. 10, we plot the learned likelihood ratio betweenthe two Gaussian samples from Fig. 1a, comparing theperformance of MLC against binary cross entropy andthe exact analytic expression. In all cases, a networkis trained with 100 epochs and early stopping with apatience of 10 epochs. We also compare the MLC lossagainst the C ( f ) = exp f variant discussed in footnote 1.We see that both the linear (i.e. C ( f ) = f ) and expo-nential parametrizations perform similarly in the regionwith ample data. That said, the exponential parametriza-tion has a more robust extrapolation towards the edges,yielding similar behavior to binary cross entropy. Notethat the exponential parametrization of the MLC loss wasused in Ref. [31]. [1] A. J. Larkoski, I. Moult, and B. Nachman, Jet Substruc-ture at the Large Hadron Collider: A Review of RecentAdvances in Theory and Machine Learning, Phys. Rept. , 1 (2020), arXiv:1709.04464 [hep-ph].[2] D. Guest, K. Cranmer, and D. Whiteson, Deep Learningand its Application to LHC Physics, Ann. Rev. Nucl. Part.Sci. , 161 (2018), arXiv:1806.11484 [hep-ex].[3] K. Albertsson et al. , Machine Learning in HighEnergy Physics Community White Paper, (2018),arXiv:1807.02876 [physics.comp-ph].[4] A. Radovic et al. , Machine learning at the energy andintensity frontiers of particle physics, Nature , 41(2018).[5] D. Bourilkov, Machine and Deep Learning Applicationsin Particle Physics, Int. J. Mod. Phys. A , 1930019(2020), arXiv:1912.08245 [physics.data-an].[6] HEP ML Community, A Living Review of Machine Learn- ing for Particle Physics.[7] Y. S. Lai, Automated Discovery of Jet Substructure Anal-yses, (2018), arXiv:1810.00835 [nucl-th].[8] Y.-L. Du, K. Zhou, J. Steinheimer, L.-G. Pang, A. Mo-tornenko, H.-S. Zong, X.-N. Wang, and H. Stöcker, Iden-tifying the nature of the QCD transition in relativisticcollision of heavy nuclei with deep learning, Eur. Phys. J.C , 516 (2020), arXiv:1910.11530 [hep-ph].[9] A. Mullin, H. Pacey, M. Parker, M. White, andS. Williams, Does SUSY have friends? A new approach forLHC event analysis, (2019), arXiv:1912.10625 [hep-ph].[10] S. Chang, T.-K. Chen, and C.-W. Chiang, Distinguishing W Signals at Hadron Colliders Using Neural Networks,(2020), arXiv:2007.14586 [hep-ph].[11] F. Flesher, K. Fraser, C. Hutchison, B. Ostdiek, and M. D.Schwartz, Parameter Inference from Event Ensembles andthe Top-Quark Mass, (2020), arXiv:2011.04666 [hep-ph]. [12] M. Lazzarin, S. Alioli, and S. Carrazza, MCNNTUNES:tuning Shower Monte Carlo generators with machine learn-ing, (2020), arXiv:2010.02213 [physics.comp-ph].[13] Y. S. Lai, D. Neill, M. Płoskoń, and F. Ringer, Explainablemachine learning of the underlying physics of high-energyparticle collisions, (2020), arXiv:2012.06582 [hep-ph].[14] J. Neyman and E. S. Pearson, On the problem of themost efficient tests of statistical hypotheses, Phil. Trans.R. Soc. Lond. A , 289 (1933).[15] T. Hastie, R. Tibshirani, and J. Friedman, The Ele-ments of Statistical Learning , Springer Series in Statistics(Springer New York Inc., New York, NY, USA, 2001).[16] M. Sugiyama, T. Suzuki, and T. Kanamori,
Density RatioEstimation in Machine Learning (Cambridge UniversityPress, 2012).[17] A. Andreassen, S. Hsu, B. Nachman, N. Suaysom, A.Suresh, Parameter Estimation using Neural Networks inthe Presence of Detector Effects, (2020), arXiv:2010.03569[hep-ph].[18] A. Andreassen and B. Nachman, Neural Networks for FullPhase-space Reweighting and Parameter Tuning, Phys.Rev. D , 091901(R) (2020), arXiv:1907.08209 [hep-ph].[19] M. Stoye, J. Brehmer, G. Louppe, J. Pavez, and K. Cran-mer, Likelihood-free inference with an improved cross-entropy estimator, (2018), arXiv:1808.00973 [stat.ML].[20] J. Hollingsworth and D. Whiteson, Resonance Searcheswith Machine Learned Likelihood Ratios, (2020),arXiv:2002.04699 [hep-ph].[21] J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez, Con-straining Effective Field Theories with Machine Learning,Phys. Rev. Lett. , 111801 (2018), arXiv:1805.00013[hep-ph].[22] J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez,A Guide to Constraining Effective Field Theories withMachine Learning, Phys. Rev. D , 052004 (2018),arXiv:1805.00020 [hep-ph].[23] J. Brehmer, F. Kling, I. Espejo, and K. Cranmer,MadMiner: Machine learning-based inference for par-ticle physics, Comput. Softw. Big Sci. , 3 (2020),arXiv:1907.10621 [hep-ph].[24] J. Brehmer, G. Louppe, J. Pavez, and K. Cranmer,Mining gold from implicit models to improve likelihood-free inference, Proc. Nat. Acad. Sci. , 201915980 (2020),arXiv:1805.12244 [stat.ML].[25] K. Cranmer, J. Pavez, and G. Louppe, ApproximatingLikelihood Ratios with Calibrated Discriminative Classi-fiers, (2015), arXiv:1506.02169 [stat.AP].[26] C. Badiali, F. Di Bello, G. Frattari, E. Gross, V. Ippolito,M. Kado, and J. Shlomi, Efficiency Parameterization withNeural Networks, (2020), arXiv:2004.02665 [hep-ex].[27] A. Andreassen, B. Nachman, and D. Shih, SimulationAssisted Likelihood-free Anomaly Detection, Phys. Rev.D , 095004 (2020), arXiv:2001.05001 [hep-ph].[28] A. Andreassen, P. T. Komiske, E. M. Metodiev, B. Nach-man, and J. Thaler, OmniFold: A Method to Simulta-neously Unfold All Observables, Phys. Rev. Lett. ,182001 (2020), arXiv:1911.09107 [hep-ph].[29] M. Erdmann, B. Fischer, D. Noll, Y. Rath, M. Rieger,and D. Schmidt, Adversarial Neural Network-based data-simulation corrections for jet-tagging at CMS, in Proc.19th Int. Workshop on Adv. Comp., Anal. Techn. in Phys.Research, ACAT2019 (2019).[30] X. Nguyen, M. J. Wainwright, and M. I. Jordan, On sur-rogate loss functions and f -divergences, arXiv Mathemat- ics e-prints , math/0510521 (2005), arXiv:math/0510521[math.ST].[31] R. T. D’Agnolo and A. Wulzer, Learning New Physicsfrom a Machine, Phys. Rev. D , 015014 (2019),arXiv:1806.02350 [hep-ph].[32] R. T. D’Agnolo, G. Grosso, M. Pierini, A. Wulzer, andM. Zanetti, Learning Multivariate New Physics, (2019),arXiv:1912.12155 [hep-ph].[33] A. Andreassen, I. Feige, C. Frye, and M. D. Schwartz,JUNIPR: a Framework for Unsupervised Machine Learn-ing in Particle Physics, Eur. Phys. J. C , 102 (2019),arXiv:1804.09720 [hep-ph].[34] J. Brehmer and K. Cranmer, Flows for simultane-ous manifold learning and density estimation (2020),arXiv:2003.13913 [stat.ML].[35] B. Nachman and D. Shih, Anomaly Detection withDensity Estimation, Phys. Rev. D , 075042 (2020),arXiv:2001.04990 [hep-ph].[36] P. Baldi, K. Cranmer, T. Faucett, P. Sadowski, andD. Whiteson, Parameterized neural networks for high-energy physics, Eur. Phys. J. C , 235 (2016),arXiv:1601.07913 [hep-ex].[37] S. Cheong, A. Cukierman, B. Nachman, M. Safdari, A.Schwartzman, Parametrizing the Detector Response withNeural Networks, JINST , P01030, arXiv:1910.03773[physics.data-an].[38] E. Fix and J. L. Hodges Jr., Discriminatory analysis-nonparametric discrimination: consistency properties,USAF School of Aviation Medicine, Project Number 21-49-004, Report Number 4 (1951).[39] T. Cover M. and P. E. Hart, Nearest neighbor patternclassification, IEEE Transactions on Information Theory , 21 (1967).[40] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-gio, Generative adversarial nets, in Advances in Neu-ral Information Processing Systems , Vol. 27, edited byZ. Ghahramani, M. Welling, C. Cortes, N. Lawrence, andK. Q. Weinberger (Curran Associates, Inc., 2014) pp.2672–2680.[41] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung,A. Radford, and X. Chen, Improved Techniques for Train-ing GANs, arXiv e-prints , arXiv:1606.03498 (2016),arXiv:1606.03498 [cs.LG].[42] D. P. Kingma and M. Welling, Auto-encoding variationalbayes., in
ICLR , edited by Y. Bengio and Y. LeCun(2014).[43] D. Rezende and S. Mohamed, Variational inference withnormalizing flows, in
Proceedings of the 32nd InternationalConference on Machine Learning , Proceedings of MachineLearning Research, Vol. 37, edited by F. Bach and D. Blei(PMLR, Lille, France, 2015) pp. 1530–1538.[44] A. J. Larkoski, J. Thaler, and W. J. Waalewijn, Gaining(Mutual) Information about Quark/Gluon Discrimination,JHEP , 129, arXiv:1408.3122 [hep-ph].[45] N. Carrara and J. Ernst, On the estimation of mutualinformation (2019), arXiv:1910.00365 [physics.data-an].[46] F. Chollet, Keras, https://github.com/fchollet/keras (2017).[47] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. , Ten-sorflow: A system for large-scale machine learning., in OSDI , Vol. 16 (2016) pp. 265–283.[48] D. Kingma and J. Ba, Adam: A method for stochastic optimization, (2014), arXiv:1412.6980 [cs].[49] G. Kasieczka, B. Nachman, and D. Shih, R&DDataset for LHC Olympics 2020 Anomaly De-tection Challenge, 10.5281/zenodo.2629073 (2019),https://doi.org/10.5281/zenodo.2629073.[50] T. Sjöstrand, S. Mrenna, and P. Z. Skands, PYTHIA6.4 Physics and Manual, JHEP , 026, arXiv:hep-ph/0603175 [hep-ph].[51] T. Sjöstrand, S. Mrenna, and P. Z. Skands, A Brief Intro-duction to PYTHIA 8.1, Comput. Phys. Commun. ,852 (2008), arXiv:0710.3820 [hep-ph].[52] J. de Favereau, C. Delaere, P. Demin, A. Giammanco,V. Lemaître, A. Mertens, and M. Selvaggi (DELPHES3), DELPHES 3, A modular framework for fast simu-lation of a generic collider experiment, JHEP , 057,arXiv:1307.6346 [hep-ex].[53] A. Mertens, New features in Delphes 3, Proceedings, 16thInternational workshop on Advanced Computing and Anal-ysis Techniques in physics (ACAT 14): Prague, CzechRepublic, September 1-5, 2014 , J. Phys. Conf. Ser. ,012045 (2015).[54] M. Selvaggi, DELPHES 3: A modular framework for fast-simulation of generic collider experiments,
Proceedings,15th International Workshop on Advanced Computing andAnalysis Techniques in Physics Research (ACAT 2013):Beijing, China, May 16-21, 2013 , J. Phys. Conf. Ser. ,012033 (2014).[55] M. Cacciari, G. P. Salam, and G. Soyez, FastJet UserManual, Eur. Phys. J.
C72 , 1896 (2012), arXiv:1111.6097[hep-ph].[56] M. Cacciari and G. P. Salam, Dispelling the N myth forthe k t jet-finder, Phys. Lett. B641 , 57 (2006), arXiv:hep-ph/0512210 [hep-ph].[57] M. Cacciari, G. P. Salam, and G. Soyez, The anti- k t jet clustering algorithm, JHEP , 063, arXiv:0802.1189[hep-ph]. [58] J. Thaler and K. Van Tilburg, Maximizing Boosted TopIdentification by Minimizing N-subjettiness, JHEP ,093, arXiv:1108.2701 [hep-ph].[59] J. Thaler and K. Van Tilburg, Identifying Boosted Objectswith N-subjettiness, JHEP , 015, arXiv:1011.2268 [hep-ph].[60] M. Zaheer, S. Kottur, S. Ravanbhakhsh, B. Póczos,R. Salakhutdinov, and A. J. Smola, Deep sets, in Pro-ceedings of the 31st International Conference on NeuralInformation Processing Systems , NIPS’17 (Curran Asso-ciates Inc., Red Hook, NY, USA, 2017) p. 3394–3404.[61] P. T. Komiske, E. M. Metodiev, and J. Thaler, EnergyFlow Networks: Deep Sets for Particle Jets, JHEP ,121, arXiv:1810.05165 [hep-ph].[62] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, andG. Monfardini, The graph neural network model, Trans.Neur. Netw. , 61–80 (2009).[63] J. Shlomi, P. Battaglia, and J.-R. Vlimant, Graph NeuralNetworks in Particle Physics 10.1088/2632-2153/abbf9a(2020), arXiv:2007.13681 [hep-ex].[64] M. Aaboud et al. (ATLAS), Measurements of b-jet taggingefficiency with the ATLAS detector using tt events at √ s = 13 TeV, JHEP , 089, arXiv:1805.01845 [hep-ex].[65] S. Chatrchyan et al. (CMS), Identification of b-QuarkJets with the CMS Experiment, JINST , P04013,arXiv:1211.4462 [hep-ex].[66] B. Andersson, G. Gustafson, L. Lonnblad, and U. Pet-tersson, Coherence Effects in Deep Inelastic Scattering,Z. Phys. C , 625 (1989).[67] F. A. Dreyer, G. P. Salam, and G. Soyez, The Lund JetPlane, JHEP , 064, arXiv:1807.04758 [hep-ph].[68] A. Andreassen, S.-C. Hsu, B. Nachman, N. Suaysom, andA. Suresh, Srgn: Pythia + delphes pp → tt
C72 , 1896 (2012), arXiv:1111.6097[hep-ph].[56] M. Cacciari and G. P. Salam, Dispelling the N myth forthe k t jet-finder, Phys. Lett. B641 , 57 (2006), arXiv:hep-ph/0512210 [hep-ph].[57] M. Cacciari, G. P. Salam, and G. Soyez, The anti- k t jet clustering algorithm, JHEP , 063, arXiv:0802.1189[hep-ph]. [58] J. Thaler and K. Van Tilburg, Maximizing Boosted TopIdentification by Minimizing N-subjettiness, JHEP ,093, arXiv:1108.2701 [hep-ph].[59] J. Thaler and K. Van Tilburg, Identifying Boosted Objectswith N-subjettiness, JHEP , 015, arXiv:1011.2268 [hep-ph].[60] M. Zaheer, S. Kottur, S. Ravanbhakhsh, B. Póczos,R. Salakhutdinov, and A. J. Smola, Deep sets, in Pro-ceedings of the 31st International Conference on NeuralInformation Processing Systems , NIPS’17 (Curran Asso-ciates Inc., Red Hook, NY, USA, 2017) p. 3394–3404.[61] P. T. Komiske, E. M. Metodiev, and J. Thaler, EnergyFlow Networks: Deep Sets for Particle Jets, JHEP ,121, arXiv:1810.05165 [hep-ph].[62] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, andG. Monfardini, The graph neural network model, Trans.Neur. Netw. , 61–80 (2009).[63] J. Shlomi, P. Battaglia, and J.-R. Vlimant, Graph NeuralNetworks in Particle Physics 10.1088/2632-2153/abbf9a(2020), arXiv:2007.13681 [hep-ex].[64] M. Aaboud et al. (ATLAS), Measurements of b-jet taggingefficiency with the ATLAS detector using tt events at √ s = 13 TeV, JHEP , 089, arXiv:1805.01845 [hep-ex].[65] S. Chatrchyan et al. (CMS), Identification of b-QuarkJets with the CMS Experiment, JINST , P04013,arXiv:1211.4462 [hep-ex].[66] B. Andersson, G. Gustafson, L. Lonnblad, and U. Pet-tersson, Coherence Effects in Deep Inelastic Scattering,Z. Phys. C , 625 (1989).[67] F. A. Dreyer, G. P. Salam, and G. Soyez, The Lund JetPlane, JHEP , 064, arXiv:1807.04758 [hep-ph].[68] A. Andreassen, S.-C. Hsu, B. Nachman, N. Suaysom, andA. Suresh, Srgn: Pythia + delphes pp → tt ¯ tt