[PDF] E Pluribus Unum Ex Machina: Learning from Many Collider Events at Once

Abstract

There have been a number of recent proposals to enhance the performance of machine learning strategies for collider physics by combining many distinct events into a single ensemble feature. To evaluate the efficacy of these proposals, we study the connection between single-event classifiers and multi-event classifiers under the assumption that collider events are independent and identically distributed (IID). We show how one can build optimal multi-event classifiers from single-event classifiers, and we also show how to construct multi-event classifiers such that they produce optimal single-event classifiers. This is illustrated for a Gaussian example as well as for classification tasks relevant for searches and measurements at the Large Hadron Collider. We extend our discussion to regression tasks by showing how they can be phrased in terms of parametrized classifiers. Empirically, we find that training a single-event (per-instance) classifier is more effective than training a multi-event (per-ensemble) classifier, as least for the cases we studied, and we relate this fact to properties of the loss function gradient in the two cases. While we did not identify a clear benefit from using multi-event classifiers in the collider context, we speculate on the potential value of these methods in cases involving only approximate independence, as relevant for jet substructure studies.

Full PDF

MMIT-CTP 5271

E Pluribus Unum Ex Machina:Learning from Many Collider Events at Once

Benjamin Nachman

1, 2, ∗ and Jesse Thaler

3, 4, † Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Berkeley Institute for Data Science, University of California, Berkeley, CA 94720, USA Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA The NSF AI Institute for Artiﬁcial Intelligence and Fundamental Interactions

There have been a number of recent proposals to enhance the performance of machine learningstrategies for collider physics by combining many distinct events into a single ensemble feature. Toevaluate the eﬃcacy of these proposals, we study the connection between single-event classiﬁers andmulti-event classiﬁers under the assumption that collider events are independent and identicallydistributed (IID). We show how one can build optimal multi-event classiﬁers from single-eventclassiﬁers, and we also show how to construct multi-event classiﬁers such that they produce optimalsingle-event classiﬁers. This is illustrated for a Gaussian example as well as for classiﬁcation tasksrelevant for searches and measurements at the Large Hadron Collider. We extend our discussion toregression tasks by showing how they can be phrased in terms of parametrized classiﬁers. Empirically,we ﬁnd that training a single-event (per-instance) classiﬁer is more eﬀective than training a multi-event (per-ensemble) classiﬁer, as least for the cases we studied, and we relate this fact to propertiesof the loss function gradient in the two cases. While we did not identify a clear beneﬁt from usingmulti-event classiﬁers in the collider context, we speculate on the potential value of these methods incases involving only approximate independence, as relevant for jet substructure studies.

CONTENTS

I. Introduction 1II. The Statistics of Per-Ensemble Learning 2A. Review of Per-Instance Learning 2B. Per-Ensemble Binary Classiﬁcation 3C. Comparing the Loss Gradients 4D. Per-Ensemble Regression 41. Maximum Likelihood 42. Classiﬁer Loss 53. Direct Regression 5E. Beyond Regression 6III. Empirical Studies 6A. Classiﬁers: Multi-Event from Single-Event 61. Two Gaussian Example 62. Dijet Resonance Search 73. Top Quark Mass Measurement 9B. Classiﬁers: Single-Event from Multi-Event 10C. Comparison of Regression Strategies 111. Gaussian Mean Example 112. Top Quark Mass Measurement 12D. Beyond Regression Example 13IV. Conclusions 14Code and Data 14Acknowledgments 14A. Deriving Maximum Likelihood Classiﬁer Loss 14References 15 ∗ [email protected] † [email protected] I. INTRODUCTION

Modern machine learning techniques are being widelyapplied to enhance or replace existing analysis techniquesacross collider physics [1–6]. These approaches hold greatpromise for new particle searches, for Standard Modelmeasurements, and for high-energy nuclear physics investi-gations. A subset of these proposals have advocated for amulti-event strategy whereby a machine-learned functionacts on multiple collision events at the same time [7–13].This multi-event (per-ensemble) strategy contrasts withmore typical single-event (per-instance) machine learningmethods that processes one event at a time, although bothstrategies make use of many events during the trainingprocess.Intuitively, an ensemble approach might seem like amore promising learning strategy because there is moreinformation contained in

N > a r X i v : . [ phy s i c s . d a t a - a n ] F e b To an excellent approximation, collider events are sta-tistically independent and identically distributed (IID).In simulation, this is exactly true up to deﬁciencies inrandom number generators. In data, there are somesmall time-dependent eﬀects from changing conditionsand there are also some correlations between events intro-duced by detector eﬀects with timescales longer than atypical bunch crossing. These event-to-event correlations,however, are truly negligible when considering the set ofevents typically used for physics analysis that are selectedby triggers. The probability for two events next to eachother in time to be saved by the triggers is eﬀectively zero,since triggers save only a tiny fraction of events. TheIID nature of collision events therefore ensures that theinformation content is the same for ensembles of eventsand for single events drawn from an ensemble.In equations, the probability to observe N events x i is p ( { x , . . . , x N }| θ ) = N Y i =1 p ( x i | θ ) , (1)where is θ represents possible parameters of the generativemodel, such as the physics process being studied or thevalues of coupling constants. The optimal classiﬁer todistinguish whether events have been generated via θ A or via θ B depends only on the per-ensemble likelihoodratio [14]: p ( { x , . . . , x N }| θ A ) p ( { x , . . . , x N }| θ B ) = N Y i =1 p ( x i | θ A ) p ( x i | θ B ) , (2)which by the IID assumption only depends on the knowingthe per-instance likelihood ratio p ( x i | θ A ) /p ( x i | θ B ). Thisequality explains the informational equivalence of per-ensemble and per-event learning.Given the simplicity of Eq. (2), why are we writing awhole paper on this topic (apart from the opportunityto invoke a gratuitously Latinate paper title that incor-porates an aspiration for national unity)? The studiesin Refs. [7–13] ﬁnd that per-ensemble learning is eﬀec-tive for their respective tasks, in some cases arguing whyper-instance learning is deﬁcient. It is certainly true thata set of events { x , . . . , x N } contains more informationthan a single event x i drawn from this set. What wewill show in this paper is that if one carefully combinesthe per-instance information, one can recover the per-ensemble beneﬁt, with the potential for a substantiallyreduced training cost. We emphasize that our analysisdoes not contradict the studies in Refs. [7–13]; ratherthis work suggests the possibility of achieving the sameor better results by replacing per-ensemble learning withper-instance learning. There may be specialized contextswhere per-ensemble learning is superior, particularly ifthe training procedure itself can be made simpler, suchas in the linear regression approach of Ref. [11]. Thispaper also gives us a chance to mention some facts aboutloss functions that are well known in the statistics lit-erature but might not be as well appreciated in collider physics. Moving away from the IID case, we speculateon the relevance of our analysis for jet substructure taskswhere there is a notion of approximate independence ofemissions.The remainder of this paper is organized as follows. InSec. II, we provide the formal statistical basis for buildingmulti-event classiﬁers from single-event classiﬁers, andvice versa, under the IID assumption. We also explainhow regression tasks can be translated into the languageof per-instance parametrized classiﬁcation. In Sec. III, wepresent empirical studies that corroborate these analyticresults. Our conclusions are given in Sec. IV. II. THE STATISTICS OF PER-ENSEMBLELEARNINGA. Review of Per-Instance Learning

Suppose that a collider event is represented by featuresin E = R M and we are trying to train a binary classiﬁerto learn a target in [0 , c : E → [0 ,

1] be a functionthat processes a single event, with the goal of distinguish-ing events being generated by θ A ( c →

1) versus thosegenerated by θ B ( c → L BCE [ c ] = − Z dx (cid:16) p ( x | θ A ) log c ( x )+ p ( x | θ B ) log(1 − c ( x )) (cid:17) , (3)where p ( x | θ ) is the probability density of x ∈ E givenclass θ . Here and throughout this discussion, we considerthe inﬁnite statistics limit such that we can replace sumsover events by integrals. We have also dropped the priorfactors p ( θ i ), assuming that one has equal numbers ofexamples from the two hypotheses during training. Whilethis is often true in practice, it is not strictly necessaryfor our main conclusions, though it does simplify thenotation. It is well-known [15, 16] (also in high-energyphysics [17–29]) that an optimally trained c will have thefollowing property: c ( x )1 − c ( x ) = p ( x | θ A ) p ( x | θ B ) , (4)such that one learns the per-instance likelihood ratio. Bythe Neyman–Pearson lemma [14], this deﬁnes the optimalsingle-event classiﬁer.There are many loss functionals that satisfy this prop-erty. Consider a more general loss functional that dependson a learnable function f : E → R (which unlike c may ormay not map to [0 , A : R → R and B : R → R : L [ f ] = − Z dx (cid:16) p ( x | θ A ) A ( f ( x )) + p ( x | θ B ) B ( f ( x )) (cid:17) . (5) Loss Name A ( f ) B ( f ) argmin f L [ f ] Integrand of − min f L [ f ] Related Divergence/DistanceBinary Cross Entropy log f log(1 − f ) p A p A + p B p A log p A p A + p B + ( A ↔ B ) 2 (cid:0) Jensen-Shannon − log 2 (cid:1) Mean Squared Error − (1 − f ) − f p A p A + p B − p A p B p A + p B (cid:0) Triangular − (cid:1) Square Root − √ f −√ f p A p B − √ p A p B (cid:0) Hellinger − (cid:1) Maximum Likelihood Cl. log f − f p A p B p A log p A p B Kullback–LeiblerTABLE I. Examples of loss functionals in the form of Eq. (5), with the associated location and value of the loss minimum, usingthe shorthand p i ≡ p ( x | θ i ). We have used the symbol f in all cases to denote the classiﬁer, but some choices require explicitconstraints on f to be either non-negative or in the range [0 , Taking the functional derivative with respect to f ( x ), theextremum of L [ f ] satisﬁes the property: − B ( f ( x )) A ( f ( x )) = p ( x | θ A ) p ( x | θ B ) . (6)As long as − B ( f ) /A ( f ) is a monotonic rescaling of f and the overall loss functional is convex, then the function f ( x ) learned by minimizing Eq. (5) deﬁnes an optimalclassiﬁer. In many cases, the minimum value of L [ f ] itselfis interesting in the context of statistical divergences anddistances [30], and a few examples are shown in Table I.To simplify the following discussion, we will focus onthe “maximum likelihood classiﬁer” (MLC) loss: L MLC [ f ] = − Z dx (cid:16) p ( x | θ A ) log f ( x )+ p ( x | θ B ) (1 − f ( x )) (cid:17) . (7)This is of the general form in Eq. (5) with A ( f ) = log f and B ( f ) = 1 − f . To our knowledge, the MLC was ﬁrstintroduced in the collider physics context in Refs. [31, 32],although with an exponential parameterization of f ( x ).We call Eq. (7) the MLC loss to distinguish it from therelated maximum likelihood loss that is often used to ﬁtgenerative models [33–35]. Using Eq. (6), the minimumof this loss functional yields directly the likelihood ratio:argmin f L MLC [ f ] = p ( x | θ A ) p ( x | θ B ) , (8)which will be useful to simplify later analyses. The MLCloss functional value at the minimum is − min f L MLC [ f ] = Z dx p ( x | θ A ) log p ( x | θ A ) p ( x | θ B ) , (9) A variation of Eq. (8) holds for A ( f ) = log C ( f ) and B ( f ) =1 − C ( f ), where C ( f ) is any monotonically increasing functionwith range that covers (0 , ∞ ). In this case, C (argmin f L [ f ]) = p ( x | θ A ) /p ( x | θ B ). This can be useful in practice if C ( f ) is every-where positive, since f can take on negative values and still yielda valid likelihood ratio. See Fig. 10 for an empirical study of C ( f ) = exp f . which is the Kullback–Leibler (KL) divergence, also knownas the relative entropy from p ( x | θ B ) to p ( x | θ A ). SeeApp. A for an intuitive derivation of Eq. (7). B. Per-Ensemble Binary Classiﬁcation

To move from single-event classiﬁcation to multi-eventclassiﬁcation, we want learn a classiﬁcation function f N that can processes N events simultaneously. Here, weare using f N : E N → R instead of c N : E N → [0 ,

1] toavoid algebraic manipulations like Eq. (4). We will usethe vector notation ~x = { x , . . . , x N } (10)to represent an element of E N . Our goal is to distinguishwhether ~x is drawn from p ( ~x | θ A ) ( f N → ∞ ) or from p ( ~x | θ B ) ( f N → θ A or θ B ,which is a diﬀerent question than trying to determine theproportion of events drawn from each class in a mixedevent ensemble. For N = 1, f is the same as f discussedin Eq. (5).If f N is trained optimally, then the classiﬁcation per-formance of f N evaluated on N > f evaluated on a single event, asrelevant to the discussions in Refs. [7–13]. The key pointof this paper is that one can construct a classiﬁer f → N that is built only from f , acts on N events, and has thesame asymptotic performance as f N .Using the MLC loss in Eq. (7), but now applied to N events, we have L MLC [ f N ] = − Z d N x (cid:16) p ( ~x | θ A ) log f N ( ~x )+ p ( ~x | θ B ) (1 − f N ( ~x )) (cid:17) , (11)whose minimum is the per-ensemble likelihood ratio:argmin f N L MLC [ f N ] = p ( ~x | θ A ) p ( ~x | θ B ) . (12)By the Neyman–Pearson lemma, this yields the optimalper-ensemble classiﬁer.On the other hand, once we have trained a single-eventclassiﬁer f using Eq. (7), we can build a multi-eventclassiﬁer f → N without any additional training: f → N ( ~x ) ≡ N Y i =1 f ( x i ) → p ( ~x | θ A ) p ( ~x | θ B ) , (13)where in the last step we have combined the solution foundin Eq. (8) with the IID condition in Eq. (2). Whereas min-imizing Eq. (11) requires sampling over E N , constructing f → N only requires sampling over E , which is a consid-erable reduction in computational burden for large N .The technical details of carrying out this procedure areexplained in Sec. III A.Going in the converse direction, we can learn a single-event classiﬁer f N → starting from a constrained multi-event classiﬁer ˜ f N . Using weight sharing, we can minimizeEq. (11) subject to the constraint that ˜ f N takes thefunctional form:˜ f N ( { x , . . . , x N } ) = N Y i =1 f N → ( x i ) , (14)where f N → ( x ) is a learnable function. Under the IIDassumption, ˜ f N can still learn the per-ensemble likeli-hood ratio, but the learned f N → ( x ) will now be theper-instance likelihood ratio, at least asymptotically. Anexamination of this converse construction is presented inSec. III B. C. Comparing the Loss Gradients

We have shown that the per-ensemble classiﬁer f N andthe composite per-event classiﬁer f → N have the sameasymptotic information content, but one might wonder ifthere is nevertheless a practical performance gain to behad using per-ensemble learning.Under the IID assumption, the optimal f N takes theform of ˜ f N in Eq. (14), and in our empirical studies,we found no beneﬁt to letting f N have more functionalfreedom. Therefore, to get a sense of the eﬃcacy of per-ensemble versus per-instance training, we can comparethe eﬀective loss functions for f N → and f . Since theinputs and outputs of these functions are the same (i.e. E → R ), we can do an apples-to-apples comparison oftheir behavior under gradient descent.For the per-ensemble case, plugging Eq. (14) intoEq. (11) and using the IID relation in Eq. (1), we ﬁndthe eﬀective loss functional: L MLC [ f N → ] + 1 = − N Z dx p ( x | θ A ) log f N → ( x )+ (cid:18)Z dx p ( x | θ B ) f N → ( x ) (cid:19) N . (15) This is to be contrasted with the per-instance loss func-tional from Eq. (7), repeated for convenience with the f notation and typeset to be parallel to the above: L MLC [ f ] + 1 = − Z dx p ( x | θ A ) log f ( x )+ Z dx p ( x | θ B ) f ( x ) . (16)To understand the loss gradients, we can Taylor expandthe learned functions about the optimal solution: f N → ( x ) = p ( x | θ A ) p ( x | θ B ) + (cid:15) ( x ) , (17) f ( x ) = p ( x | θ A ) p ( x | θ B ) + (cid:15) ( x ) . (18)Plugging these in to their respective loss functionals andlooking at the leading-order variations, we have: δL MLC [ f N → ] N = Z dx (cid:0) p ( x | θ B ) (cid:15) ( x ) (cid:1) p ( x | θ A )+ N − (cid:18)Z dx p ( x | θ B ) (cid:15) ( x ) (cid:19) , (19) δL MLC [ f ] = Z dx (cid:0) p ( x | θ B ) (cid:15) ( x ) (cid:1) p ( x | θ A ) . (20)These expressions are quadratic in (cid:15) ( x ), which means thatwe are expanding around the correct minimum.The expression for δL MLC [ f ] involves a single integralover x , so under gradient descent, the value of (cid:15) ( x ) canbe independently adjusted at each point in phase space toﬁnd the minimum. By contrast, δL MLC [ f N → ] has an ad-ditional piece involving an integral squared, so even if at agiven point in phase space x we have achieved (cid:15) ( x ) = 0,gradient descent will tend to push (cid:15) ( x ) away from thecorrect value until (cid:15) ( x ) = 0 everywhere. This correlatedstructure explains the slower convergence of L MLC [ f N → ]compared to L MLC [ f ] in our empirical studies. Whilewe focused on the MLC loss to simplify the algebra, theappearance of these (typically counterproductive) correla-tions in the loss gradient appears to be a generic featureof per-ensemble learning. D. Per-Ensemble Regression

While the discussion above focused on binary classiﬁca-tion, the same basic idea applies to regression problemsas well. The goal of regression is to infer parameters θ from the data ~x . There are a variety of approaches thatcan be used for this task, and each can be connected toparametrized per-instance classiﬁcation.

1. Maximum Likelihood

Maximum likelihood is the most common strategy forinference in collider physics. Symbolically, we are tryingto ﬁnd θ ML = argmax θ p ( ~x | θ ) . (21)One way to determine θ ML is with a two-step approach.First, one can train a parametrized classiﬁer f ( x, θ ) [25,36] using, e.g., the per-instance MLC loss: L MLC [ f ] = − Z dx (cid:16) p ( x | θ ) p ( θ ) log f ( x, θ )+ p ( x | θ ) p ( θ ) (1 − f ( x, θ )) (cid:17) . (22)The top line corresponds to a synthetic dataset where ev-ery event is generated from p ( x | θ ) with diﬀerent θ valuesdrawn from the probability density p ( θ ). The bottom linecorresponds to a synthetic dataset where every event isgenerated using the same p ( x | θ ) for ﬁxed θ and thenaugmented with a value θ that follows from p ( θ ) indepen-dently of x . Minimizing Eq. (22) with respect to f ( x, θ ),the asymptotic solution is the likelihood ratio: f ( x, θ ) = p ( x | θ ) p ( x | θ ) , (23)where the factors of p ( θ ) have cancelled out. Second,one can estimate θ ML by using the IID properties of theevent ensemble to relate likelihoods to the classiﬁer output f ( x, θ ): θ ML = argmin θ ( − N X i =1 log p ( x i | θ ) ) = argmin θ ( − N X i =1 log p ( x i | θ ) p ( x i | θ ) ) ≈ argmin θ ( − N X i =1 log f ( x i , θ ) ) . (24)Thus, even though maximum likelihood regressionuses information from the full event ensemble, only aparametrized per-instance classiﬁer is required for thisprocedure.

2. Classiﬁer Loss

Two recent proposals for parameter estimation are ex-plicitly built on classiﬁers for regression [17, 18]. For anyclassiﬁer, one can perform the following optimization: θ CL = argmax θ ( Loss of a classiﬁer trainedto distinguish θ from θ data ) . (25) Note that Ref. [17] used the (non-diﬀerentiable) area under thecurve instead of the classiﬁer loss, as it is not sensitive to diﬀer-ences in the prior p ( θ ) between the two data sets. Here, we are imagining that the θ samples come fromsynthetic data sets. The appearance of a maximum in-stead of minimum in Eq. (25) is because, as highlightedin Table I, it is negative loss functions that correspond tostatistical divergences and distances.In general, the θ CL that minimizes the classiﬁer loss willbe diﬀerent from the θ ML that maximizes the likelihood.For the special case of the MLC loss, though, they arethe same in the asymptotic limit if we set θ A = θ data and θ B = θ . To see this, recall from Eq. (9) that aftertraining, the value of the MLC loss is related to the KLdivergence:argmax θ { min f L MLC [ f ] } = argmax θ (cid:26) − Z dx p ( x | θ data ) log p ( x | θ data ) p ( x | θ ) (cid:27) ≈ argmax θ ( N X i =1 log p ( x i | θ ) p ( x i | θ data ) ) = argmin θ ( − N X i =1 log p ( x i | θ ) ) = θ ML , (26)where the sum is over data events.

3. Direct Regression

In terms of information content, a regression modeltrained in the usual way can be built from a parametrizedclassiﬁcation model. Suppose that θ ∈ R Q and g N : E N → R Q is a regression model trained with the meansquared error loss: L MSE [ g N ] = − Z d n x p ( ~x, θ ) (cid:16) g N ( ~x ) − θ (cid:17) (27)It is well known that the optimally trained g N will berelated to the expectation value of θ : g N ( ~x ) = E [ θ | ~x ] = Z dθ θ p ( θ | ~x ) . (28)Other loss functions approximate other statistics, as dis-cussed in Ref. [37]. For example, the mean absolute errorloss approximates the median of θ . Ultimately, all directregression methods are functionals of p ( θ | ~x ).We can relate p ( θ | ~x ) to a parametrized classiﬁer f N ( ~x, θ ) trained to distinguish θ from a baseline θ : p ( θ | ~x ) = p ( ~x | θ ) p ( θ ) p ( ~x ) = p ( ~x | θ ) p ( θ ) R dθ p ( ~x | θ ) p ( θ )= p ( ~x | θ ) p ( ~x | θ ) p ( θ ) R dθ p ( ~x | θ ) p ( ~x | θ ) p ( θ )= f N ( ~x, θ ) p ( θ ) R dθ f N ( ~x, θ ) p ( θ ) , (29)where p ( θ ) is the probability density of θ used duringthe training of g N . Following the same logic as Sec. II B,the per-ensemble classiﬁer f N ( ~x, θ ) can be related to aper-instance classiﬁer f ( x, θ ). Therefore, even though g N acts on N events, it has the same information contentas a parametrized classiﬁer that acts on single events.Performing regression via Eqs. (28) and (29) is straight-forward but tedious. In practice, one would train a param-terized per-instance classiﬁer f ( x, θ ) as in Eq. (23), mul-tiply it to construct f N ( ~x, θ ) = Q Ni =1 f ( x i , θ ), and thensample over values of θ to approximate the integrals.We show examples of the above regression strategies inSec. III C E. Beyond Regression

In addition to classiﬁcation and regression, a standardmachine learning task is density estimation. While someclassical machine learning methods like k -nearest neigh-bors [38, 39] do require multi-instance information at pre-diction time, many of the standard deep learning solutionsto implicit or explicit generative modeling are built onper-instance functions. Such methods include generativeadversarial networks [40], variational autoencoders [42],and normalizing ﬂows [43].One reason for computing explicit densities is to es-timate the distance to a reference density. A commonset of tools for this task are the f -divergences mentionedearlier. As discussed in Ref. [30] and highlighted in Ta-ble I, there is a direct mapping between the loss valueof a per-instance classiﬁcation task and a corresponding f -divergence between the underlying probability densities.A related quantity is the mutual information betweentwo random variables X and Y : I ( X, Y ) = Z dx dy p ( x, y ) log p ( x, y ) p ( x ) p ( y ) . (30)For example, Y could be binary (a class label) and then I ( X, Y ) would encode how much information (in units ofnats) is available in X for doing classiﬁcation. This can behelpful in the context of ranking input features, and wasstudied in the context of quark/gluon jet classiﬁcation inRef. [44].Naively, Eq. (30) might seem like it requires estimatingthe densities p ( x ), p ( y ), and p ( x, y ), which in turn mayrequire ensemble information (see e.g. Ref. [45] for a studyin the context of HEP). On the other hand, Eq. (30) takesthe same form as the KL divergence in Eq. (9). Therefore,this quantity can be estimated using a similar strategy asin earlier sections, by training a classiﬁer to distinguish In the context of adversarial training, it may be beneﬁcial touse per-ensemble information in the discriminator to mitigatemode collapse, as utilized in Ref. [13]. This is also the philosophybehind mini-batch discrimination [41]. data following p ( x, y ) from data following p ( x ) p ( y ) usingthe MLC loss. The value of the loss at the minimumwill be an estimate of the mutual information. A simpleexample of this will be studied in Sec. III D. III. EMPIRICAL STUDIES

We now present empirical studies comparing per-instance and per-ensemble data analysis strategies tohighlight the points made in Sec. II. Our analyses arebased on three case studies: a simple two Gaussian exam-ple, searching for dijet resonances, and measuring the topquark mass.

A. Classiﬁers: Multi-Event from Single-Event

As argued in Sec. II B, under the IID assumption we canbuild multi-event classiﬁers from single-event classiﬁers.We now demonstrate how to construct f → N deﬁned inEq. (13), comparing its performance to f N .

1. Two Gaussian Example

Our ﬁrst case study involves one-dimensional Gaussianrandom variables. As shown in Fig. 1a, we consider twoGaussian distributions X ∼ N ( ± (cid:15), x = ± (cid:15) ) but the same variance ( σ = 1). Here,the “signal” has positive mean while the “background”has negative mean, and we take (cid:15) = 0 . f ) and per-ensemble ( f N ) clas-siﬁers are parametrized by neural networks and imple-mented using Keras [46] with the

Tensorflow back-end [47] and optimized with

Adam [48]. We use thebinary cross entropy loss function so Eq. (4) is needed toconvert the classiﬁer output to a likelihood ratio. Eachclassiﬁer consists of two hidden layers with 128 nodes perlayer. Rectiﬁed Linear Unit (ReLU) activation functionsare used for the intermediate layers while sigmoid activa-tion is used for the last layer. The only diﬀerence betweenthe per-instance and per-ensemble networks is that theinput layer has one input for f but N inputs for f N .We train each network with 50,000 events to minimizethe binary cross entropy loss function, and we test theperformance with an additional 50,000 events. For eachnetwork, we train for two epochs with a batch size of128. We take N = 10 for the per-ensemble network,which means that in each batch, the training involves1280 events, and therefore fewer batches per epoch thanin the per-instance case. We did not do any detailedhyperparameter optimization for these studies.In Fig. 1b, we show the performance of the resultingclassiﬁers f and f . We checked that the f classiﬁer pa-rameterized by a neural network has essentially the sameperformance as an analytic function derived by taking theratio of Gaussian probability densities, which means that x E v e n t s Gaussian Example

SignalBackground (a) F a l s e P o s i t i v e R a t e Gaussian Example f f f (b) FIG. 1. Classiﬁcation in the two Gaussian example. (a) A histogram of the Gaussian random variable X , for the “signal”( x = 0 .

1) and background ( x = − . f , we canconstruct a multi-event classiﬁer f → that matches the performance of a classiﬁer trained on 10 events simultaneously ( f ). the neural network f is nearly optimal. As expected,the per-instance classiﬁer f has a worse receiver oper-ating characteristic (ROC) curve than the per-ensembleclassiﬁer f . This is not a relevant comparison, however,because the two are solving diﬀerent classiﬁcation tasks(i.e. classifying individual events as coming from signalor background versus classifying an ensemble of N = 10events as all coming from signal or background). WithEq. (13), we can use f to build a 10-instance classiﬁer f → , whose ROC curve is nearly identical to f , if noteven slightly better. Thus, as expected from Eq. (2), all ofthe information in the 10-instance classiﬁer is containedin the per-instance classiﬁer.

2. Dijet Resonance Search

We now consider an example from collider physics, mo-tivated by a search for new particles in a dijet ﬁnal state.The simulations used for this study were produced forthe LHC Olympics 2020 community challenge [49]. Thebackground process involves generic quantum chromo-dynamics (QCD) dijet events with a requirement of atleast one such jet with transverse momentum p T > . W with mass m W = 3 . W → XY to two hypotheticalparticles X and Y of masses 500 GeV and 100 GeV, respec-tively. Each of the X and Y particles decays promptlyinto pairs of quarks. Due to the mass hierarchy betweenthe W boson and its decay products, the ﬁnal state ischaracterized by two large-radius jets with two-prong sub- structure. The background and signal are generated using Pythia

Delphes

FastJet k t algorithm [57] using R = 1 . m JJ < [3 . , .

7] GeV.Four features are used to train our classiﬁers: the in-variant mass of the lighter jet, the mass diﬀerence of theleading two jets, and the N -subjettiess ratios τ [58, 59]of the leading two jets. The observable τ quantiﬁesthe degree to which a jet is characterized by two subjetsor one subjet, with smaller values indicating two-prongsubstructure. The mass features are recorded in units ofTeV so that they are numerically O (1). Histograms ofthe four features for signal and background are shown inFigs. 2a and 2b. The signal jet masses are localized atthe X and Y masses and the τ observables are shiftedtowards lower values, indicating that the jets have two-prong substructure.We train a per-instance classiﬁer ( f ) and a per-ensemble classiﬁer ( f ) classiﬁer using the same toolsas for the Gaussian example above, again using binarycross entropy for the loss function. Because signal andbackground are so well separated in this example, werestrict our attention to N = 3 to avoid saturating theperformance. Note that this is an artiﬁcially constructedclassiﬁcation problem, since in a more realistic contextone would be trying to estimate the signal fraction inan event ensemble, not classify triplets of events as allcoming from signal or background. E v e n t s BSM, W YZ versus QCD m J signal m J signal m J background m J background (a) E v e n t s BSM, W YZ versus QCD

21, 1 signal

21, 2 signal

21, 1 background

21, 2 background (b) / F a l s e P o s i t i v e R a t e setsortlist BSM, W YZ versus QCD f f f f f (c) FIG. 2. Classiﬁcation in the dijet resonance search example. (a,b) Histograms of the four jet features for the signal ( W → XY )and background (QCD dijet) processes. (c) ROC curves for various binary classiﬁers. The multi-event classiﬁer f → (built from f ) outperforms three classiﬁers trained on triplets of events: f list3 with randomly ordered inputs, f sort3 with sorted inputs, and f set3 based on the deep sets/PFN strategy in Eq. (31) with built-in permutation invariance. For f , the neural network architecture is the sameas Ref. [17] with four hidden layers, each with 64 nodesand ReLU activation, and an output layer with sigmoidactivation. For f , the neural network involves 4 × f network is trained for 50 epochs, while theper-ensemble f network is trained for 500 epochs. We ﬁxa batch size of 10% in both trainings, which means thatnumber of batches per epoch is the same, as is the numberof events considered per batch. Following Eq. (13), weconstruct a tri-event classiﬁer f → from f .The ROC curves for f and f → are shown in Fig. 2c,with f also shown for completeness. Interestingly, the f → classiﬁer trained on single events signiﬁcantly outper-forms f trained on multiple events. There are a varietyof reasons for this, but one important deﬁciency of the f classiﬁer is that it does not respect the permutation sym-metry of its inputs. Because events are IID distributed,there is no natural ordering of the events, but the fullyconnected architecture we are using imposes an artiﬁcialordering. Inspired by Ref. [11], we can break the permu-tation symmetry of the inputs by imposing a particularorder on the events. Speciﬁcally, we train a network f sort3 where the triplet of events is sorted by their leading jetmass. Using f sort3 yields a small gain in performance seenin Fig. 2, but not enough to close the gap with f → .A more powerful way to account for the permuta-tion symmetry among events is to explicitly build apermutation-invariant neural network architecture. For this purpose, we use the deep sets approach [60]. Inthe particle physics context, deep sets were ﬁrst used toconstruct particle ﬂow networks (PFNs) [61], where theinputs involve sets of particles. Here, we are interested insets of events, though we will still use the PFN code fromthe https://energyflow.network/ package. FollowingRefs. [60, 61], we decompose our set-based classiﬁer as: f set N ( ~x ) = F N X i =1 Φ( x i ) ! , (31)where F : R L → [0 ,

1] and Φ : E → R L are neuralnetworks that are simultaneously optimized. The networkΦ embeds single events x i into a L -dimensional latentspace. The sum operator in Eq. (31) guarantees that f set N is invariant under permutations x σ ( i ) for σ ∈ S N , thepermutation group acting on N elements. We use thedefault parameters from the PFN code, with L = 128, Φhaving of two hidden layers with 100 nodes each, and F having three hidden nodes with 100 nodes each.The performance of f set3 is shown in Fig. 2, which getsmuch closer to matching the performance of f → . Partof this improvement is due to enforcing the permutationsymmetry, though there is also a potential gain from thefact the PFN we used for f set3 has more trainable weightsthan the fully connected network for f sort3 . All of the f variants were considerably more diﬃcult to train than f → , likely for the reason discussed in Sec. II C. Thus, wehave empirical evidence for the superiority of single-eventtraining for multi-event classiﬁcation.

100 150 200 250 300 m bl [GeV]0.0000.0050.0100.0150.0200.025 P r o b a b ili t y D e n s i t y / . G e V Top Quark Mass m t = 172.5 GeV m t = 175.0 GeV m t = 172.5 GeV wgt. (a) T r u e P o s i t i v e F a l s e P o s i t i v e R a t e setsort Top Quark Mass, 172.5 versus 175 GeV f f f f (b) FIG. 3. Classiﬁcation in the top quark mass example. (a) A histogram of m b µν for top quark masses of 172.5 GeV and 175GeV. The “wgt.” curve is explained later in Sec. III C 2, where we test the performance of a likelihood reweighting. (b) Thediﬀerence in eﬃciency for the 172.5 GeV top quark mass sample (true positive) and the 175 GeV top quark mass sample (falsepositive) as a function of the true positive rate for various binary classiﬁers. Once again, a multi-event classiﬁer ( f → ) builtfrom the single-event classiﬁer ( f ) has the best performance. For the classiﬁers trained to process 20 events simultaneously, thedeep sets/PFN approach ( f set20 ) does better than sorting the inputs ( f sort20 ).

3. Top Quark Mass Measurement

Our third and ﬁnal example is motivated by the topquark mass measurement, as recently studied in Refs. [11,17]. Extracting the top quark mass is really a regressionproblem, which we investigate in Sec. III C. Here, weconsider a related classiﬁcation task to distinguish twoevent samples generated with diﬀerent top quark masses(172.5 GeV and 175 GeV). This is a realistic hypothesistesting task that requires full event ensemble information,though only per-instance training as we will see.We use the same dataset as Ref. [17]. Top quark pairproduction is generated using

Pythia

Delphes t ¯ t → bW + ¯ bW − , one of the W bosons isforced to decay to µ + ν while the other W boson decayshadronically. Each event is recorded as a variable-lengthset of objects, consisting of jets, muons, and neutrinos. Atsimulation-level, the neutrino is replaced with the missingtransverse momentum. Generator-level and simulation-level jets are clustered with the anti- k t algorithm using R = 0 . b -taggedif the highest energy parton inside the nearest generator-level jet (∆ R < .

5) is a b quark. Jets are required to have p T >

20 GeV and they can only be b -tagged if | η | < . b -taggedjets and at least two additional non b -tagged jets. The b -jet closest to the muon in rapidity-azimuth is labeled b . Of the remaining b -tagged jets, the highest p T oneis labeled b . The two highest p T non- b -tagged jets arelabeled j and j , and typically come from the W boson. (Imposing the W mass constraint on j and j would yieldlower eﬃciency, though without signiﬁcantly impactingthe results.) The four-momentum of the detector-levelneutrino ( ν ) is determined by solving the quadratic equa-tion for the W boson mass; if there is no solution, themass is set to zero, while if there are two real solutions,the one with the smaller | p z | is selected. Four observablesare formed for performing the top quark mass extraction,given by the following invariant masses: m b µν , m b µν , m b j j , and m b j j . A histogram of m b µν is shown forillustration in Fig. 3a.We use the same neural network architectures andtraining procedure as in the BSM example above, with1.5 million events per ﬁxed-mass sample. For the per-ensemble classiﬁer, we take N = 20, though of course fora realistic hypothesis testing situation, N would be aslarge as the number of top quark events recorded in data.To capture the permutation invariance of the inputs, weconstruct f set20 using the deep sets approach in Eq. (31).We also build a classiﬁer f → from the per-instanceclassiﬁer f using Eq. (13).In Fig. 3b, we see that f → and f set20 have comparableperformance, though f → is noticeably better. Someof this improvement maybe be due to diﬀerences in thenetwork architecture, but we suspect that most of the gainis due to the more eﬃcient training in the per-instancecase. We checked that very poor performance is obtainedfor a classiﬁer f lacking permutation invariance, with aROC curve that wasn’t that much better than f alone.Explicitly breaking the invariance by sorting the inputsbased on m b µν does help a little, as indicated by the f sort20 curve in Fig. 3b, but does not reach the set-basedapproach.0 T P R F P R a t % T P R set Top Quark Mass, 172.5 versus 175 GeV f f FIG. 4. Computational performance of single-event versusmulti-event training. Shown is the eﬃciency for the 175 GeVsample (false positive) for a ﬁxed 50% eﬃciency for the 172.5GeV sample (true positive), plotted as a function of trainingepoch. Single-event training ( f → ) outperforms multi-eventtraining ( f set20 ), where both methods go through the full dataset per epoch. Given the similar performance of f → and f set20 , it isinteresting to examine which learning strategy is morecomputationally eﬃcient. In Fig. 4, we compare the per-formance as a function of the training epoch, using thediﬀerence of the true and false positive rates at a ﬁxed50% signal eﬃciency. In each epoch, both f → and f set20 see the full ensemble of events, so this is an apples-to-apples comparison as far as data usage is concerned. Inparticular, we plot this information per epoch instead ofper compute time to avoid diﬀerences due to the struc-ture of the neural networks. (There is not an easy wayto control for possible diﬀerences in the training timedue to the diﬀerences in the network structures, sincethe underlying tasks are diﬀerent.) The f → classiﬁertrains much faster, in agreement with the analysis inSec. II C, even though the ultimate asymptotic perfor-mance would be the same for both classiﬁers. Once again,we see better empirical behavior from f → trained onone event at a time version f set20 trained on multiple eventssimultaneously. B. Classiﬁers: Single-Event from Multi-Event

In general, one cannot take a multi-event classiﬁer f N and extract a single-event classiﬁer f . It is, however,possible to construct a special ˜ f N network such that onecan interpret a subnetwork as a per-event classiﬁer, asdiscussed in Sec. II B. When using the MLC loss function,we can use the functional form in Eq. (14), where ˜ f N isa product of f N → terms. Training ˜ f N , where the onlytrainable weights are contained in f N → , we can learn asingle-event classiﬁer f N → from multi-event samples.For the binary cross entropy loss used in our case stud- F a l s e P o s i t i v e R a t e Gaussian Example f f f f

10 1

FIG. 5. Revisiting the ROC curves for the two Gaussianexample from Fig. 1b. The multi-event classiﬁer ˜ f with therestricted functional form in Eq. (32) has the same performanceas f with no restrictions. Using ˜ f , we can construct a single-event classiﬁer ˜ f → with the same performance as f traineddirectly. ies, where Eq. (4) is needed to convert the classiﬁer toa likelihood ratio, we have to introduce a slightly diﬀer-ent structure than Eq. (14). Let f set N be a permutation-invariant classiﬁer, as deﬁned in Eq. (31) using the deepsets/PFN strategy. Taking the latent space dimensionto be L = 1, the Φ network can be interpreted as asingle-event classiﬁer. Because the Φ network outputs arepooled via summation, we can build an optimal multi-event classiﬁer if Φ learns the logarithm of the likelihoodratio; cf. Eq. (2). With this insight, we can ﬁx the F function to achieve the same asymptotic performance asa trainable F by setting: F ( ~x ) = exp (cid:0) P Ni =1 Φ( x i ) (cid:1) (cid:0) P Ni =1 Φ( x i ) (cid:1) . (32)Using Eq. (4), one can check that this F is monotonicallyrelated to the ensemble likelihood ratio. Similarly, Φ willbe monotonically related to the optimal f , which we call f N → for the remainder of this discussion.This construction is demonstrated in Fig. 5 for theGaussian example. We see that that the deep sets archi-tecture with the ﬁxed form of Eq. (32) ( ˜ f set10 ) has the sameor better performance as the 10-instance fully-connectedclassiﬁer with more network capacity ( f ). Similarly, theΦ function used as a single-event classiﬁer ( f → ) hasnearly the same performance as an independently trainedsingle-event classiﬁer ( f ).The same conclusion holds for the BSM classiﬁcationtask, shown in Fig. 6. The only diﬀerence between theset-based architectures ˜ f set3 and f set3 is that the former1 / F a l s e P o s i t i v e R a t e setset BSM,

X YZ versus QCD f f f f FIG. 6. Revisiting the ROC curves for the dijet resonancesearch example in Fig. 2c. The set-based multi-event classiﬁers˜ f set3 and f set3 have similar performance, but we can use theformer to construct a single-event classiﬁer f → . This con-struction is not as eﬀective as performing single-event trainingdirectly ( f ). uses the ﬁxed functional form in Eq. (32). The factthat they achieve nearly the same performance is ensuredby the IID relation in Eq. (2). The per-instance f → network extracted from ˜ f set3 is not quite as powerful asthe f network trained independently on single events, asexpected from the gradient issue discussed in Sec. II C.While we found no beneﬁt to extracting a single-eventclassiﬁer from a multi-event classiﬁer, it is satisfying tosee these IID-derived theoretical predictions borne out inthese empirical examples. C. Comparison of Regression Strategies

We now consider the regression methods introducedin Sec. II D. For classiﬁcation, the mapping betweenper-instance and per-ensemble information is relativelystraightforward. For regression, though, per-ensembleregression is structurally dissimilar from per-instance re-gression because of the need to integrate over priors onthe regression parameters. Nevertheless, we can performper-ensemble regression by ﬁrst mapping the problem toper-instance parametrized classiﬁcation.We compare three diﬀerent regression strategies forour empirical studies. The ﬁrst method is a maximum-likelihood analysis, using the form in Eq. (24) based onthe single-event parametrized classiﬁer in Eq. (23). Thesecond method is per-instance direct regression, using theconstruction in Eqs. (28) and (29) based on the sameclassiﬁer as above. The third method is per-ensemble direct regression, based on minimizing the mean squarederror loss in Eq. (27).

1. Gaussian Mean Example

Our ﬁrst regression study is based on the same one-dimensional Gaussian distributions as Sec. III A 1. Theprior distribution for the Gaussian means is taken tobe uniform with µ ∈ [ − . , . σ = 1. A training dataset is created from 100examples each from 10,000 values of the Gaussian mean,for a total of one million training data points. For thereference sample p ( x | θ ) needed to build the single-eventparametrized classiﬁer f ( x, µ ) in Eq. (23), we create asecond dataset with one million examples drawn from astandard normal distribution (i.e. µ = 0). To implementthe p ( θ ) term in the second line of Eq. (22), each example x i from the reference dataset is assigned a random meanvalue picked from the variable-mean dataset.We train a parameterized neural network to distinguishthe variable-mean datasets from the reference dataset.This network takes as input two features: one componentof ~x and the random mean value µ . The architectureconsists of three hidden layers with (64 , ,

64) nodes perlayer and ReLU activation. The output layer has a singlenode and sigmoid activation. Binary cross entropy is usedto train the classiﬁer and Eq. (4) is used to convert it tothe likelihood ratio form f ( x, µ ). The model is trainedfor 20 epochs with a batch size of 10% of the trainingstatistics.The same learned function f ( x, µ ) is used for boththe maximum likelihood analysis and per-instance directregression. For the maximum-likelihood analysis, the opti-mization in Eq. (24) is performed over a ﬁxed grid with 20evenly spaced values in µ ∈ [ − . , . f N ( ~x, µ ) in Eq. (29) is con-structed by taking a product of f ( x, µ ) outputs over all100 examples in a given ensemble data point ~x . The inte-grals in Eqs. (28) and (29) are approximated by evaluating f N ( ~x, µ ) at 20 evenly spaced µ values between − . . g N that takes as input 100 values (i.e. allof ~x ) and predicts a single mean value. This network hasthe same architecture as f ( x, µ ), except it directly takesas input ~x and has linear (instead of a sigmoid) activationfor the output layer, since the predicted mean can beboth positive or negative. It is trained to minimize themean squared error loss in Eq. (27).In Fig. 7, we see that all three approaches give nearlythe same results in terms of bias and variance. Strictlyspeaking, maximum likelihood and direct regression arediﬀerent tasks so their behavior could be diﬀerent. Forper-instance and per-ensemble direct regression, they areconstructed to yield the same asympotic behavior, butthere will be diﬀerences due to, e.g., the ﬁnite approxima-2 P r e d i c t e d V a l u e o f t h e M e a n Gaussian with 100 Instances

Maximum LikelihoodPer-Instance Direct Reg.Per-Ensemble Direct Reg.

FIG. 7. Comparison of regression methods with the Gaussianexample, with the predicted value of the mean plotted againstthe true value of the mean. The regression involves analyzing100 instances drawn from the same Gaussian distribution.Bands are the standard deviation of the predictions over 10,000generated samples. The per-instance direct regression usessingle-event training, yet achieves comparable performanceto per-ensemble direct regression that processes 100 eventssimultaneously. tions to the integrals. Note that maximum likelihood andper-instance direct regression only use neural networksthat process per-instance inputs; information about therest of the events is used only through the training proce-dure. Thus, we have empirical evidence that per-ensembleregression can be accomplished via per-instance training.

2. Top Quark Mass Measurement

As a physics example of regression, we consider extract-ing the top quark mass. Here, the top quark mass is theregression target and the setup is similar to the Gaussianexample above. We use the same event generation asSec. III A 3, but now with top quark mass parameterssampled uniformly at random in m t ∈ [170 , m t , which we account for when assigningdummy mass values to the reference sample.The parameterized classiﬁer now takes ﬁve inputs:the four mass features from Sec. III A 3 ( m b µν , m b µν , m b j j , and m b j j ) plus the top quark mass used forevent generation. The neural network has three hiddenlayers with 50 nodes per layer and ReLU activation, anda single node output layer with sigmoid activation. Wetrain 10 models and take the median as the classiﬁeroutput, using Eq. (4) to convert it to the likelihood ra-tio f ( x, m t ). Each model is trained for 200 epochs withearly stopping with a patience of 20 epochs and a batchsize of 1000. To test the ﬁdelity of the training, we ex-tract the estimated likelihood ratio of m t = 175 GeV over m t = 172 . . m t ∈ [170 , f ( x, m t )into an estimate of E [ m t | ~x ]. The integrals in Eqs. (28)and (29) are approximated by sampling 50 random topquark masses per set of 100 following the probabilitydensity from the training dataset. Because 40 eventsare insuﬃcient to make a precision measurement of thetop quark mass, we ﬁnd a noticeable bias between theestimated and true top mass values, which is exacerbatedby edge eﬀects at the ends of the training range. For thisreason, we do not show a direct analog to Fig. 7, thoughthis bias could be overcome with much larger trainingdatasets with many more than 100 examples per massvalue.For the per-ensemble direct regression, we use the deepsets approach in Eq. (31) to handle the permutation-invariance of the inputs. This approach is also wellsuited to handle the large variation in the number ofevents in each set due to the event selection eﬀect. Weagain use PFNs for our practical implementation. Weuse the default PFN hyperparameters from the https://energyflow.network/ package, except we use linearactivation in the output layer and the mean squared er-ror loss function. We found that it was important forthe model accuracy to standardize both the inputs andoutputs of the network. Note that this is a diﬀerent per-ensemble direct regression setup than used in Ref. [11],which found excellent performance using linear regressionon sorted inputs.In Fig. 8b, we compare the output of per-ensembledirect regression to the output of per-instance direct re-gression. We ﬁnd a very strong correlation between thesetwo very diﬀerent approaches to computing the samequantity E [ m t | ~x ]. The band in Fig. 8b is the standarddeviation over data sets with a true mass in the same oneof the 100 bins that are evenly spaced betwen 170 and180 GeV. A key advantage of the per-instance approach3

172 173 174 175 176 177 178Top Quark Mass [GeV]0.0000.0020.0040.0060.0080.010 - l o g ( f ( x , m t )) + m a x m t l o g ( f ( x , m t )) Top Quark Mass, Maximum Likelihood (a)

170 172 174 176 178 180Per-Ensemble Mass [GeV]170172174176178180 P e r - I n s t a n c e M a ss [ G e V ] Top Quark Mass, Direct Regression (b)

FIG. 8. Regression in the top quark mass example. (a) An estimate of the log likelihood for samples generated with 172.5 and175 GeV top quark masses. The vertical axis has been shifted such that the minimum value is at zero. Note that the axisrepresents the average log likelihood which is a factor of N events diﬀerent than the total log likelihood. (b) Correlation betweenthe per-instance predicted mass and the per-ensemble predicted mass in the context of direct regression. The per-ensemble massvalues are put in bins of 0.1 GeV width, and the bands represent the standard deviation of the per-instance mass values in eachbin. is that it does not need to be retrained if more events areacquired. By contrast, the per-ensemble approach is onlyvalid for event samples that have the same sizes as wereused during training. D. Beyond Regression Example

As remarked in Sec. II E, the ideas discussed aboveapply to learning tasks beyond just standard classiﬁca-tion and regression. As one simple example to illustratethis, we consider the Gaussian classiﬁcation task fromSec. III A 1 and compute the mutual information betweenthe Gaussian feature and the label. This quantiﬁes howmuch information is available in the feature for classiﬁ-cation and can be directly compared with other featuresand other classiﬁcation tasks.For this illustration, 10 events are generated each fromtwo Gaussian distributions with means ±| (cid:15) | for ﬁxed (cid:15) .The mutual information is estimated using a per-instanceclassiﬁer as described in Sec. II E and also computedanalytically via Eq. (30). For the per-instance classiﬁer,we use a neural network that processes two inputs (labeland feature), has two hidden layers with ReLU activation,and has a single node sigmoid output. The classiﬁcationtask is to distinguish the nominal dataset from one wherethe labels are assigned uniformly at random to the features.The value of the MLC loss yields an estimate of the mutualinformation.The mutual information results are presented in Fig. 9, M u t u a l I n f o r m a t i o n Gaussian Example

NN EstimationAnalytic

FIG. 9. Mutual information between a Gaussian featureand a label, where the “ signal” ( x = (cid:15) ) and “ background”( x = − (cid:15) ) have opposite means. The estimate using the MLCloss approach shows good agreement with the exact analyticexpression. as a function of (cid:15) . As expected, the neural networkstrategy yields an excellent approximation to the analyticcalculation. Note that this strategy does require anybinning and naturally extends to high-dimensional data,since the core component is a neural network classiﬁer.We leave an investigation of this approach in the particlephysics context to future work.4 IV. CONCLUSIONS

We have demonstrated a connection between classiﬁerstrained on single events and those that process multipleevents at the same time. One can take a generic single-event classiﬁer and build an N -event classiﬁer using simplearithmetic operations. Such classiﬁers tend to out-performgeneric N -event classiﬁers, since we can enforce the IIDassumptions into the learning task. This performancegap can be mostly recovered by deploying a classiﬁer thatrespects the permutation invariance of the set of N events.We used the deep sets/PFN architecture [60, 61] for thispurpose, but other set-based architectures such as graphneural networks [62, 63] would also be appropriate.An amusing feature of the deep sets approach is thatwe can use it to reverse-engineer a single-event classiﬁerfrom a multi-event classiﬁer by restricting the latent spaceto be one-dimensional and ﬁxing a static output function.Even after enforcing these additional structures, though,we found both theoretical and empirically that the lossfunction gradients are better behaved for single-eventclassiﬁers than multi-event classiﬁers. Going beyond clas-siﬁcation, we explained how various regression tasks canbe phrased in terms of per-instance parametrized clas-siﬁcation, yielding similar performance to per-ensembledirect regression. We also mentioned how to computedistances and divergences between probability densitieswithout requiring explicit density estimation. These re-sults hold for any data sample satisfying the IID property.Ultimately, we did not ﬁnd any formal or practicaladvantage for training a multi-event classiﬁer instead ofa single-event classiﬁer, as least for the cases we stud-ied. With a carefully selected multi-event architecture,one can achieve similar performance to a scaled-up per-event classiﬁer, but the latter will typically train faster.For direct regression, the per-ensemble strategy mightbe conceptually simpler than the per-instance method,though the per-instance methods allow for an simplertreatment of variably-sized data sets. Note that theremay be situations where a simplifying assumption (e.g. thelinear regression model in Ref. [11]) could yield betterper-ensemble behavior than indicated by our case studies.At minimum, we hope this paper has demystiﬁed aspectsof per-ensemble learning and highlighted some interestingfeatures of the MLC loss function.Going beyond the IID assumption, the duality betweenper-instance classiﬁers and per-ensemble classiﬁers couldhave applications to problems with approximate inde-pendence. For example, ﬂavor tagging algorithms havetraditionally exploited the approximate independence ofindividual track features within a jet [64, 65]. Similarly,emissions in the Lund jet plane [66, 67] are approximatelyindependent, with exact independence in the stronglyordered limit of QCD. In both contexts, the instances areparticles (or particle-like features) and the ensemble is thejet. A potentially powerful training procedure for thesesituations might be to ﬁrst train a per-particle classiﬁer,then build a per-jet classiﬁer using the constructions de- scribed in this paper, and ﬁnally let the network trainfurther to learn interdependencies between the particles. CODE AND DATA

The code for this paper can be found at https://github.com/bnachman/EnsembleLearning . The physicsdatasets are hosted on Zenodo at Ref. [68] for the topquark dataset and Ref. [69] for the BSM dataset.

ACKNOWLEDGMENTS

We thank Anders Andreassen, Patrick Komiske, andEric Metodiev and for discussions about the MLC loss.We thank Rikab Gambhir and Ian Convy for discussionsabout mutual information. We thank Adi Suresh for dis-cussions about the regression task with the classiﬁer loss.We thank Katherine Fraiser, Yue Lai, Duﬀ Neill, BryanOstdiek, Mateusz Ploskon, Felix Ringer, and MatthewSchwartz for useful comments on our manuscript. BNis supported by the U.S. Department of Energy (DOE),Oﬃce of Science under contract DE-AC02-05CH11231.JT is supported by the the National Science Foundationunder Cooperative Agreement PHY-2019786 (The NSFAI Institute for Artiﬁcial Intelligence and FundamentalInteractions, http://iaifi.org/ ), and by the U.S. DOEOﬃce of High Energy Physics under grant number DE-SC0012567. BN would also like to thank NVIDIA forproviding Volta GPUs for neural network training.

Appendix A: Deriving Maximum LikelihoodClassiﬁer Loss

Beyond just the practical value of learning the likelihoodratio, the MLC loss in Eq. (7) has a nice interpretationin terms of learning probability distributions.Consider trying to learn a function f ( x ) that is a nor-malized probability distribution, up to a Jacobian factor j ( x ): Z dx j ( x ) f ( x ) = 1 . (A1)We are given samples from a probability distribution q ( x ),and we want to learn f ( x ) such that f ( x ) → q ( x ) j ( x ) . (A2)In other words, we want to learn a function f ( x ) thatreproduces the sampled distribution q ( x ) after includingthe Jacobian factor. This problem was studied in Ref. [33],albeit in a context where f ( x ) had a restricted functionalform such that Eq. (A1) was automatically enforced.One strategy to accomplish this is to minimize thecross entropy of f ( x ) with respect to q ( x ), since the5 x L i k e li h oo d R a t i o Gaussian Example exactBCEMLC (lin)MLC (exp)

FIG. 10. A demonstration of the MLC loss for learning thelikelihood ratio directly, using the Gaussian example fromFig. 1a. The linear (lin) and exponential (exp) parametriza-tions perform similarly. Shown for comparison is likelihoodratio computed using the binary cross entropy (BCE) loss thatrequires the manipulation in Eq. (4). smallest cross entropy is obtained when f ( x ) has thesame information content as q ( x ). The associated lossfunctional is: L [ f ] = − Z dx q ( x ) log f ( x ) − λ (cid:18) − Z dx j ( x ) f ( x ) (cid:19) , (A3)where the ﬁrst term is the cross entropy and λ is a La-grange multiplier to enforce the normalization conditionin Eq. (A1). Taking the functional derivative of Eq. (A3)with respect to f ( x ) and setting it equal to zero, we ﬁnd the extremum condition: − q ( x ) f ( x ) + λ j ( x ) = 0 . (A4)Multiplying both sides of this equation by f ( x ) and in-tegrating over x to set the Lagrange multiplier, we ﬁndthat Eq. (A4) is solved for λ = 1 , f ( x ) = q ( x ) j ( x ) , (A5)so f ( x ) learns the q ( x ) /j ( x ) ratio as desired.In the special case that j ( x ) is itself a normalized prob-ability distribution, we can substitute for the Lagrangemultiplier and rewrite Eq. (A3) in the following form: L [ f ] = − Z dx (cid:16) q ( x ) log f ( x ) + j ( x )(1 − f ( x )) (cid:17) . (A6)Identifying q ( x ) = p ( x | θ A ) and j ( x ) = p ( x | θ B ), this isprecisely the MLC loss in Eq. (7). Therefore, we havean intuitive understanding of the MLC loss as tryingto maximize the (log) likelihood of f ( x ) with respect to p ( x | θ A ), subject to the constraint that f ( x ) p ( x | θ B ) is aproper probability distribution.In Fig. 10, we plot the learned likelihood ratio betweenthe two Gaussian samples from Fig. 1a, comparing theperformance of MLC against binary cross entropy andthe exact analytic expression. In all cases, a networkis trained with 100 epochs and early stopping with apatience of 10 epochs. We also compare the MLC lossagainst the C ( f ) = exp f variant discussed in footnote 1.We see that both the linear (i.e. C ( f ) = f ) and expo-nential parametrizations perform similarly in the regionwith ample data. That said, the exponential parametriza-tion has a more robust extrapolation towards the edges,yielding similar behavior to binary cross entropy. Notethat the exponential parametrization of the MLC loss wasused in Ref. [31]. [1] A. J. Larkoski, I. Moult, and B. Nachman, Jet Substruc-ture at the Large Hadron Collider: A Review of RecentAdvances in Theory and Machine Learning, Phys. Rept. , 1 (2020), arXiv:1709.04464 [hep-ph].[2] D. Guest, K. Cranmer, and D. Whiteson, Deep Learningand its Application to LHC Physics, Ann. Rev. Nucl. Part.Sci. , 161 (2018), arXiv:1806.11484 [hep-ex].[3] K. Albertsson et al. , Machine Learning in HighEnergy Physics Community White Paper, (2018),arXiv:1807.02876 [physics.comp-ph].[4] A. Radovic et al. , Machine learning at the energy andintensity frontiers of particle physics, Nature , 41(2018).[5] D. Bourilkov, Machine and Deep Learning Applicationsin Particle Physics, Int. J. Mod. Phys. A , 1930019(2020), arXiv:1912.08245 [physics.data-an].[6] HEP ML Community, A Living Review of Machine Learn- ing for Particle Physics.[7] Y. S. Lai, Automated Discovery of Jet Substructure Anal-yses, (2018), arXiv:1810.00835 [nucl-th].[8] Y.-L. Du, K. Zhou, J. Steinheimer, L.-G. Pang, A. Mo-tornenko, H.-S. Zong, X.-N. Wang, and H. Stöcker, Iden-tifying the nature of the QCD transition in relativisticcollision of heavy nuclei with deep learning, Eur. Phys. J.C , 516 (2020), arXiv:1910.11530 [hep-ph].[9] A. Mullin, H. Pacey, M. Parker, M. White, andS. Williams, Does SUSY have friends? A new approach forLHC event analysis, (2019), arXiv:1912.10625 [hep-ph].[10] S. Chang, T.-K. Chen, and C.-W. Chiang, Distinguishing W Signals at Hadron Colliders Using Neural Networks,(2020), arXiv:2007.14586 [hep-ph].[11] F. Flesher, K. Fraser, C. Hutchison, B. Ostdiek, and M. D.Schwartz, Parameter Inference from Event Ensembles andthe Top-Quark Mass, (2020), arXiv:2011.04666 [hep-ph]. [12] M. Lazzarin, S. Alioli, and S. Carrazza, MCNNTUNES:tuning Shower Monte Carlo generators with machine learn-ing, (2020), arXiv:2010.02213 [physics.comp-ph].[13] Y. S. Lai, D. Neill, M. Płoskoń, and F. Ringer, Explainablemachine learning of the underlying physics of high-energyparticle collisions, (2020), arXiv:2012.06582 [hep-ph].[14] J. Neyman and E. S. Pearson, On the problem of themost eﬃcient tests of statistical hypotheses, Phil. Trans.R. Soc. Lond. A , 289 (1933).[15] T. Hastie, R. Tibshirani, and J. Friedman, The Ele-ments of Statistical Learning , Springer Series in Statistics(Springer New York Inc., New York, NY, USA, 2001).[16] M. Sugiyama, T. Suzuki, and T. Kanamori,

Density RatioEstimation in Machine Learning (Cambridge UniversityPress, 2012).[17] A. Andreassen, S. Hsu, B. Nachman, N. Suaysom, A.Suresh, Parameter Estimation using Neural Networks inthe Presence of Detector Eﬀects, (2020), arXiv:2010.03569[hep-ph].[18] A. Andreassen and B. Nachman, Neural Networks for FullPhase-space Reweighting and Parameter Tuning, Phys.Rev. D , 091901(R) (2020), arXiv:1907.08209 [hep-ph].[19] M. Stoye, J. Brehmer, G. Louppe, J. Pavez, and K. Cran-mer, Likelihood-free inference with an improved cross-entropy estimator, (2018), arXiv:1808.00973 [stat.ML].[20] J. Hollingsworth and D. Whiteson, Resonance Searcheswith Machine Learned Likelihood Ratios, (2020),arXiv:2002.04699 [hep-ph].[21] J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez, Con-straining Eﬀective Field Theories with Machine Learning,Phys. Rev. Lett. , 111801 (2018), arXiv:1805.00013[hep-ph].[22] J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez,A Guide to Constraining Eﬀective Field Theories withMachine Learning, Phys. Rev. D , 052004 (2018),arXiv:1805.00020 [hep-ph].[23] J. Brehmer, F. Kling, I. Espejo, and K. Cranmer,MadMiner: Machine learning-based inference for par-ticle physics, Comput. Softw. Big Sci. , 3 (2020),arXiv:1907.10621 [hep-ph].[24] J. Brehmer, G. Louppe, J. Pavez, and K. Cranmer,Mining gold from implicit models to improve likelihood-free inference, Proc. Nat. Acad. Sci. , 201915980 (2020),arXiv:1805.12244 [stat.ML].[25] K. Cranmer, J. Pavez, and G. Louppe, ApproximatingLikelihood Ratios with Calibrated Discriminative Classi-ﬁers, (2015), arXiv:1506.02169 [stat.AP].[26] C. Badiali, F. Di Bello, G. Frattari, E. Gross, V. Ippolito,M. Kado, and J. Shlomi, Eﬃciency Parameterization withNeural Networks, (2020), arXiv:2004.02665 [hep-ex].[27] A. Andreassen, B. Nachman, and D. Shih, SimulationAssisted Likelihood-free Anomaly Detection, Phys. Rev.D , 095004 (2020), arXiv:2001.05001 [hep-ph].[28] A. Andreassen, P. T. Komiske, E. M. Metodiev, B. Nach-man, and J. Thaler, OmniFold: A Method to Simulta-neously Unfold All Observables, Phys. Rev. Lett. ,182001 (2020), arXiv:1911.09107 [hep-ph].[29] M. Erdmann, B. Fischer, D. Noll, Y. Rath, M. Rieger,and D. Schmidt, Adversarial Neural Network-based data-simulation corrections for jet-tagging at CMS, in Proc.19th Int. Workshop on Adv. Comp., Anal. Techn. in Phys.Research, ACAT2019 (2019).[30] X. Nguyen, M. J. Wainwright, and M. I. Jordan, On sur-rogate loss functions and f -divergences, arXiv Mathemat- ics e-prints , math/0510521 (2005), arXiv:math/0510521[math.ST].[31] R. T. D’Agnolo and A. Wulzer, Learning New Physicsfrom a Machine, Phys. Rev. D , 015014 (2019),arXiv:1806.02350 [hep-ph].[32] R. T. D’Agnolo, G. Grosso, M. Pierini, A. Wulzer, andM. Zanetti, Learning Multivariate New Physics, (2019),arXiv:1912.12155 [hep-ph].[33] A. Andreassen, I. Feige, C. Frye, and M. D. Schwartz,JUNIPR: a Framework for Unsupervised Machine Learn-ing in Particle Physics, Eur. Phys. J. C , 102 (2019),arXiv:1804.09720 [hep-ph].[34] J. Brehmer and K. Cranmer, Flows for simultane-ous manifold learning and density estimation (2020),arXiv:2003.13913 [stat.ML].[35] B. Nachman and D. Shih, Anomaly Detection withDensity Estimation, Phys. Rev. D , 075042 (2020),arXiv:2001.04990 [hep-ph].[36] P. Baldi, K. Cranmer, T. Faucett, P. Sadowski, andD. Whiteson, Parameterized neural networks for high-energy physics, Eur. Phys. J. C , 235 (2016),arXiv:1601.07913 [hep-ex].[37] S. Cheong, A. Cukierman, B. Nachman, M. Safdari, A.Schwartzman, Parametrizing the Detector Response withNeural Networks, JINST , P01030, arXiv:1910.03773[physics.data-an].[38] E. Fix and J. L. Hodges Jr., Discriminatory analysis-nonparametric discrimination: consistency properties,USAF School of Aviation Medicine, Project Number 21-49-004, Report Number 4 (1951).[39] T. Cover M. and P. E. Hart, Nearest neighbor patternclassiﬁcation, IEEE Transactions on Information Theory , 21 (1967).[40] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-gio, Generative adversarial nets, in Advances in Neu-ral Information Processing Systems , Vol. 27, edited byZ. Ghahramani, M. Welling, C. Cortes, N. Lawrence, andK. Q. Weinberger (Curran Associates, Inc., 2014) pp.2672–2680.[41] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung,A. Radford, and X. Chen, Improved Techniques for Train-ing GANs, arXiv e-prints , arXiv:1606.03498 (2016),arXiv:1606.03498 [cs.LG].[42] D. P. Kingma and M. Welling, Auto-encoding variationalbayes., in

ICLR , edited by Y. Bengio and Y. LeCun(2014).[43] D. Rezende and S. Mohamed, Variational inference withnormalizing ﬂows, in

Proceedings of the 32nd InternationalConference on Machine Learning , Proceedings of MachineLearning Research, Vol. 37, edited by F. Bach and D. Blei(PMLR, Lille, France, 2015) pp. 1530–1538.[44] A. J. Larkoski, J. Thaler, and W. J. Waalewijn, Gaining(Mutual) Information about Quark/Gluon Discrimination,JHEP , 129, arXiv:1408.3122 [hep-ph].[45] N. Carrara and J. Ernst, On the estimation of mutualinformation (2019), arXiv:1910.00365 [physics.data-an].[46] F. Chollet, Keras, https://github.com/fchollet/keras (2017).[47] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. , Ten-sorﬂow: A system for large-scale machine learning., in OSDI , Vol. 16 (2016) pp. 265–283.[48] D. Kingma and J. Ba, Adam: A method for stochastic optimization, (2014), arXiv:1412.6980 [cs].[49] G. Kasieczka, B. Nachman, and D. Shih, R&DDataset for LHC Olympics 2020 Anomaly De-tection Challenge, 10.5281/zenodo.2629073 (2019),https://doi.org/10.5281/zenodo.2629073.[50] T. Sjöstrand, S. Mrenna, and P. Z. Skands, PYTHIA6.4 Physics and Manual, JHEP , 026, arXiv:hep-ph/0603175 [hep-ph].[51] T. Sjöstrand, S. Mrenna, and P. Z. Skands, A Brief Intro-duction to PYTHIA 8.1, Comput. Phys. Commun. ,852 (2008), arXiv:0710.3820 [hep-ph].[52] J. de Favereau, C. Delaere, P. Demin, A. Giammanco,V. Lemaître, A. Mertens, and M. Selvaggi (DELPHES3), DELPHES 3, A modular framework for fast simu-lation of a generic collider experiment, JHEP , 057,arXiv:1307.6346 [hep-ex].[53] A. Mertens, New features in Delphes 3, Proceedings, 16thInternational workshop on Advanced Computing and Anal-ysis Techniques in physics (ACAT 14): Prague, CzechRepublic, September 1-5, 2014 , J. Phys. Conf. Ser. ,012045 (2015).[54] M. Selvaggi, DELPHES 3: A modular framework for fast-simulation of generic collider experiments,

Proceedings,15th International Workshop on Advanced Computing andAnalysis Techniques in Physics Research (ACAT 2013):Beijing, China, May 16-21, 2013 , J. Phys. Conf. Ser. ,012033 (2014).[55] M. Cacciari, G. P. Salam, and G. Soyez, FastJet UserManual, Eur. Phys. J.

C72 , 1896 (2012), arXiv:1111.6097[hep-ph].[56] M. Cacciari and G. P. Salam, Dispelling the N myth forthe k t jet-ﬁnder, Phys. Lett. B641 , 57 (2006), arXiv:hep-ph/0512210 [hep-ph].[57] M. Cacciari, G. P. Salam, and G. Soyez, The anti- k t jet clustering algorithm, JHEP , 063, arXiv:0802.1189[hep-ph]. [58] J. Thaler and K. Van Tilburg, Maximizing Boosted TopIdentiﬁcation by Minimizing N-subjettiness, JHEP ,093, arXiv:1108.2701 [hep-ph].[59] J. Thaler and K. Van Tilburg, Identifying Boosted Objectswith N-subjettiness, JHEP , 015, arXiv:1011.2268 [hep-ph].[60] M. Zaheer, S. Kottur, S. Ravanbhakhsh, B. Póczos,R. Salakhutdinov, and A. J. Smola, Deep sets, in Pro-ceedings of the 31st International Conference on NeuralInformation Processing Systems , NIPS’17 (Curran Asso-ciates Inc., Red Hook, NY, USA, 2017) p. 3394–3404.[61] P. T. Komiske, E. M. Metodiev, and J. Thaler, EnergyFlow Networks: Deep Sets for Particle Jets, JHEP ,121, arXiv:1810.05165 [hep-ph].[62] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, andG. Monfardini, The graph neural network model, Trans.Neur. Netw. , 61–80 (2009).[63] J. Shlomi, P. Battaglia, and J.-R. Vlimant, Graph NeuralNetworks in Particle Physics 10.1088/2632-2153/abbf9a(2020), arXiv:2007.13681 [hep-ex].[64] M. Aaboud et al. (ATLAS), Measurements of b-jet taggingeﬃciency with the ATLAS detector using tt events at √ s = 13 TeV, JHEP , 089, arXiv:1805.01845 [hep-ex].[65] S. Chatrchyan et al. (CMS), Identiﬁcation of b-QuarkJets with the CMS Experiment, JINST , P04013,arXiv:1211.4462 [hep-ex].[66] B. Andersson, G. Gustafson, L. Lonnblad, and U. Pet-tersson, Coherence Eﬀects in Deep Inelastic Scattering,Z. Phys. C , 625 (1989).[67] F. A. Dreyer, G. P. Salam, and G. Soyez, The Lund JetPlane, JHEP , 064, arXiv:1807.04758 [hep-ph].[68] A. Andreassen, S.-C. Hsu, B. Nachman, N. Suaysom, andA. Suresh, Srgn: Pythia + delphes pp → tt