[PDF] A W^\pm polarization analyzer from Deep Neural Networks

Abstract

In this paper, we train a Convolutional Neural Network to classify longitudinally and transversely polarized hadronic W^\pm using the images of boosted W^{\pm} jets as input. The images capture angular and energy information from the jet constituents that is faithful to properties of the original quark/anti-quark W^{\pm} decay products without the need for invasive substructure cuts. We find that the difference between the polarizations is too subtle for the network to be used as an event-by-event tagger. However, given an ensemble of W^{\pm} events with unknown polarization, the average network output from that ensemble can be used to extract the longitudinal fraction f_L. We test the network on Standard Model pp \to W^{\pm}Z events and on pp \to W^{\pm}Z in the presence of dimension-6 operators that perturb the polarization composition.

Full PDF

PPrepared for submission to JHEP A W ± polarization analyzer from Deep NeuralNetworks Taegyun Kim a and Adam Martin a a Department of Physics, University of Notre Dame, South Bend, IN 46556 USA

Abstract:

In this paper we train a Convolutional Neural Network to classify longitudinallyand transversely polarized hadronic W ± using the images of boosted W ± jets as input. Theimages capture angular and energy information from the jet constituents that is faithful toproperties of the original quark/anti-quark W ± decay products without the need for invasivesubstructure cuts. We ﬁnd that the diﬀerence between the polarizations is too subtle for thenetwork to be used as an event-by-event tagger. However, given an ensemble of W ± events withunknown polarization, the average network output from that ensemble can be used to extractthe longitudinal fraction f L . We test the network on Standard Model pp → W ± Z eventsand on pp → W ± Z in the presence of dimension-6 operators that perturb the polarizationcomposition. a r X i v : . [ h e p - ph ] F e b ontents W ± polarization at parton level 33 From parton level to particle level: network setup and training 5 pp → W ± Z We are entering the precision LHC era. No light new particles have been seen to date, and whileit is not impossible that the full run of the LHC will expose a new particle, we must considerthe possibility that new physics is simply too heavy to produce substantially at the LHC. Inthis scenario, the search for new physics moves from obvious and direct – spectacular signalsof on-shell particle production, such as resonant peaks or large missing energy signatures –to indirect and subtle, looking for deviations in distributions from the Standard Model (SM)prediction.The polarization of massive gauge bosons is an interesting avenue to explore using theindirect approach. The transverse and longitudinal fractions vary depending on what process(e.g. single boson production versus diboson) and energy are considered, and are a detailedprobe of the machinery of the Standard Model (SM). Moreover, the longitudinal polarizationsof the W ± /Z are especially sensitive of the mechanism of electroweak symmetry breaking,as perturbative unitarity in longitudinal boson scattering can be maintained only through adelicate balance of contributions [1, 2]. In scenarios where the Higgs properties deviate evenslightly from the SM expectations, such as in composite Higgs scenarios [3–5], this balancebreaks down and we expect dramatic signals.If the scale of new physics is light, these signals usually take the form of resonances.However, if the scale of new physics is heavy, its eﬀects can be captured by an eﬀective– 1 –agrangian, the SM augmented by a series of higher dimensional operators. The imprint ofUV physics is left on the pattern of operators – the relative size and type of operator generated.Within the eﬀective Lagrangian language, gauge bosons can appear either as ﬁeld strength F µν or in the covariant derivatives of Higgs ﬁelds D µ H . The former are transversely polarizedwhile the latter are (primarily) longitudinally polarized.To disentangle the eﬀects from the two types of operators, we need to diﬀerentiate po-larizations. The polarization diﬀerence is the clearest if one can reconstruct the W ± /Z andboost back to its rest frame, therefore current polarization studies have been restricted toleptonic ﬁnal states [6, 7]. However, while leptonic ﬁnal states are clean, they suﬀer from lowbranching ratios and ambiguities due to the presence of neutrinos.The goal of this paper is to develop a polarization analyzer for hadronic W ± using machinelearning tools. In order to avoid huge backgrounds and combinatorial issues, we focused onanalyzing the polarization of boosted W ± . Boosted W ± have collimated decays, so they looklike a single fat ( ∆ R ∼ ) jet at detector level. As all the decay products are (theoretically)contained within the fat jet, this mitigates the headache of reconstructing the W ± and providesseveral useful handles, exploited through jet substructure techniques, at distinguishing thehadronic W from a QCD jet [8–12]. For our network input, we use images of the the fat W jets (preprocessed and pixelized) rather than speciﬁc substructure variables. Then, using eventsamples where one polarization completely dominates for training, the resulting network is ableto pick up on how the polarization of the boosted W is manifest in subtle image diﬀerences.For transverse W ± , we use W + jets as the training sample, while for longitudinal W ± we usea heavy Higgs H → W + W − .Machine learning techniques have previously been applied to hadronic W in Ref. [13],focusing on extracting the polarization in semi-leptonic W produced in vector boson fusion, pp → W ( (cid:96)ν ) W ( jj ) + jj and showing promising results. Comparing our approach with theirs,we use jet images, while Ref. [13] used the four vectors of the lepton and jets as the input to thenetwork. More importantly, the simulation in Ref [13] consisted of parton-level events smearedwith detector eﬃciencies and without a genuine parton shower. As we will discuss in moredetail below, the main diﬃculty with the hadronic W ± (in general, and for a polarization studyin general), comes from extra radiation, speciﬁcally in identifying the quarks the W ± decaysto and their accompanying radiation, and weeding out extraneous radiation. As smearedparton-level events will never generate extra radiation, it is hard to extrapolate the results ofRef. [13] to a realistic collider environment.The layout of the rest of this paper is as follows. In Sect. 2 we review the parton levelobservables, in the lab and W rest frames, that are sensitive to polarization information. Next,in Sect. 3, we describe our neutral network structure, training samples, and performance. Ourmain results are contained in Sec. 4, broken up into two subsections: i.) results for SM pp → W ± Z production, Sec. 4.1 and ii.) pp → W ± Z production in the presence of higherdimensional operators that alter the polarization fraction, Sec. 4.2. Section 5 contains ourconclusions. – 2 – W ± polarization at parton level In the rest frame of a W ± boson, the decay products from the longitudinal and transversepolarizations have diﬀerent angular distributions. Taking the decay products to be massless, d Γ( W L → f f ) d cos θ ∗ ∝ − cos θ ∗ d Γ( W T → f f ) d cos θ ∗ ∝ (1 ± cos θ ∗ ) where ± refers to the two possible transverse polarizations and the angle θ ∗ is deﬁned withrespect to the direction of the W ± ’s motion in the lab frame. Higher order corrections willdisrupt this pattern, but the eﬀect has been shown to be small [14]. The two distributionsare shown below in Fig. 1. The cos θ ∗ information can also be captured in the lab frame - - Cos ( θ ) P r obab ili t y D en s i t y W + W - W L Figure 1 : Parton level angular distribution. Analogous relations can be derived for Z bosons.However, as the Z couples to both left and right-handed fermions, the relations are not assimple and the distributions not as distinct as the W ± case.as p θ ≡ ∆ Ep V , where ∆ E is the energy diﬀerence between the decay products and p V is themomentum of the W ± /Z [15].For Z bosons, the leptonic modes allow clear access to the polarization info, though at theprice of a small branching fraction. For leptonic W ± events where there is only one neutrino(i.e W ( (cid:96)ν ) + jets or W ( (cid:96)ν ) W ( jj ) ), one can attribute all missing energy in the event to theneutrino and solve for the longitudinal neutrino momentum by requiring the ‘neutrino’ andthe charged lepton reconstruct the W ± . This method yields the full lab-frame neutrino four-vector, but it automatically introduces a two-fold ambiguity, as the W ± mass constraint isquadratic, and is subject to uncertainties from mis-measured missing energy and W ± s thatare slightly oﬀ-shell. – 3 –or hadronically decaying W ± /Z , there is no clean solution due to the usual diﬃcultiesof jet physics – mis-measurement and the challenges of correctly ﬁltering the W ± /Z -decayquark/anti-quark (and their associated radiations) from hadronic activity unrelated to the W ± /Z . Any uncertainties in extracting the momenta of the W ± /Z or their decay productsresults mixes the polarizations and makes them harder to separate. For example, boostsalong the true W ± /Z momentum don’t mix polarizations, i.e. a transversely polarized W ± are left invariant by longitudinal (along the direction of motion) boosts, and the longitudinalpolarization remains longitudinal, but this is not the case for boosts along other directions,such as the direction of an incorrectly reconstructed W ± /Z . Due to these complexities,polarization studies have focused primarily on leptonic W ± /Z .One recent exception is Ref. [15], which studied the polarization of boosted hadronic W ± by using jet substructure techniques to extract p θ ≡ ∆ Ep V . Speciﬁcally, by using the variable N -subjettiness [12], the fat W jet gets factored into smaller pieces, and can be used to selectout clusters of energy to serve as proxies for the underlying quark and anti-quark. Bolsteredby techniques to clear away extraneous QCD radiation [16], the subjets faithfully representthe partonic physics, and the authors demonstrate polarization discrimination in vector bosonfusion and in the presence of a hypothetical new resonance that decays to W + W − . The pricefor the substructure approach is additional cuts – on the mass of the W jet, the mass fractionof the subjets, and the ‘subjettiness’ variable itself. These lead to a more accurate sample, butreduce the number of events and can potentially reintroduce interference among the diﬀerentpolarizations [17–20]. We would like to study in the polarization diﬀerences using the samephysics – the trace of the angular/energy correlations left in the W -jet substructure – butusing a more inclusive, though arguably less transparent, method.Before describing our method, it is useful to quantify how well one can possibly diﬀer-entiate polarizations, e.g. as if we were able to work at parton level. For hadronic W ± , wecannot separate W + from W − , so we must combine them. Let us introduce | cos θ cut | , andclassify all events with | cos θ ∗ | ≤ | cos θ cut | as longitudinal. Varying cos θ cut , we trace outa curve in eﬃciency vs. mistag rate. This curve is shown in Fig. 3, with the eﬃciency axislabeled "true positive rate" and the mistag rate as "false positive rate" to make the connectionwith our later network results easier. Tracing that curve, we see about longitudinal W can be successfully identiﬁed with a fake rate of , or success with a fake rate.These partonic, and therefore ‘best case’, eﬃciency/fake rates are fairly poor, especially whencompared to rates from top/Higgs/ V ‘taggers’ that diﬀerentiate between massive objects andQCD [21–23]. This is not surprising, given that we are aiming to distinguish between W ± sthat have the same gross kinematic features ( p T,W , η W ) yet diﬀer in polarization, so the anglebetween decay products is our only handle and the populations have non-negligible overlapnear cos θ ∗ ∼ . . – 4 – From parton level to particle level: network setup and training

Moving from parton level to more realistic, detector level signals, we will attack this problemusing jet images and deep neural networks (DNN). Deep neural networks have been shownto be a powerful tool for discriminating among diﬀerent particles, such as quark vs. gluonjets [24, 25], W ± vs. QCD [22] or tops vs. QCD [21, 26], displaying superior performance overanalyses using kinematic variables alone. Among diﬀerent networks, we focus on convolutionalneural networks (CNNs) which take boosted jet images as input and allow on the network topick up on minute angular and energetic correlations among jet constituents that are inheritedfrom the initial partons. In the following sections, we describe the network construction, imagepreprocessing, and supervised training samples, then present visualizations of the trainednetwork’s performance and predictions. To simulate boosted W ± bosons, we use MadGraph5v2.6.5 and

MadGraph5v2.7.0 [27] atleading order and a center of mass energy of 13 TeV. The parton level events are fed through PYTHIA [28, 29] to incorporate showering and hadronization, then through

Delphes [30] to adddetector eﬀects. From the

Delphes calorimeter output, we extract a list of all charged andneural particle four-vectors in the event. The list of four-vectors is clustered into ‘fat’ jetsvia

FastJet [31, 32] using the anti-kT algorithm with R = 1 . , minimum p T = 100 GeV andmax | η | = 2 . . These jets are pre-processed and pixelized, and the pixels used as the inputsto our neural network. Preprocessing formats the jets, centering them and minimizing anyangular anisotropy, so extraneous features are not picked up by the network to distinguishbetween samples. We follow the preprocessing steps from Ref. [22] :1.) Re-cluster the fat jet into subjets using the Cambridge/Aachen algorithm [34, 35] with ∆ R = 0 . and minimum p T = 1 GeV.2.) Translate jet constituents’ ( η, φ ) position to put the highest p T of leading subjet at theorigin.3.) Rotate all jet constituents so that the highest p T of sub-leading subjet is located belowthe origin.4.) Reﬂect based on the number of subjets. For 2 subjets in a clustered jet, sum over p T of left and right side of the image to place higher p T sum on right hand side. For 3 The recent updates on

MadGraph allows us to generate polarization enforced events Speciﬁcally, we use the

EFlowTrack Delphes branch for charged particles, the

EFlowNeutralHadron branchfor neutral hadrons, and the

EFlowPhoton branch for photons. We use the package

Pyjet [33] as a wrapper for

FastJet In addition to centering and rotating, Ref. [22] also zooms, or rescales the p T of the image constituentsso they can view jets across a wide range of p T . As we focus on boosted W jets in a few, relatively small p T windows, we do not perform this step. – 5 –r more subjets, reﬂect the jet image so that the third leading subjet is located on theright hand side of the image. Events with only one subject are rejected.Next, the formatted jets are pixelized in η, φ in a × grid, with each direction spanning − and around the center of the fat jet, keeping with the pixel size used in Ref. [22]. Thevalue of each grid point in the accumulated p T value of the particles in that square. input_1: InputLayer input:output: (None, 1, 20, 20)(None, 1, 20, 20)conv2d_3: Conv2D input:output: (None, 1, 20, 20)(None, 20, 20, 20)max_pooling2d_3: MaxPooling2D input:output: (None, 20, 20, 20)(None, 20, 10, 10)dropout_4: Dropout input:output: (None, 20, 10, 10)(None, 20, 10, 10)conv2d_4: Conv2D input:output: (None, 20, 10, 10)(None, 40, 10, 10)max_pooling2d_4: MaxPooling2D input:output: (None, 40, 10, 10)(None, 40, 5, 5)dropout_5: Dropout input:output: (None, 40, 5, 5)(None, 40, 5, 5)ﬂatten_2: Flatten input:output: (None, 40, 5, 5)(None, 1000)dense_5: Dense input:output: (None, 1000)(None, 300)dropout_6: Dropout input:output: (None, 300)(None, 300)dense_6: Dense input:output: (None, 300)(None, 100)dense_7: Dense input:output: (None, 100)(None, 100)dense_8: Dense input:output: (None, 100)(None, 1) Figure 2 : Visualization of CNN structure– 6 – .2 Neural Network structure and event information

The pixelized × jet images form the ﬁrst layer of our convolutional neural network (CNN).After the input layer, we follow the typical CNN structure example provided by Keras [36]of a combination of convolutional and fully connected dense layers. Speciﬁcally, the inputlayer is followed by 2 dimensional convolutional layer with 20 kernels of size 4. This layeris subject to Max pooling with Dropout, then fed into a second convolutional layer with 40kernels of size 4. Max pooling with Dropout is followed by the second convolutional layeroutput. After again pooling with Dropout, the result is then ﬂattened and fed into 3 denselayers with 100 units each. Finally, the last dense layer it is connected to the output layer of 1unit. Throughout the network, we use the rectiﬁed linear unit (ReLU) function to introducenon-linearity, except for the output layer which has a sigmoid activation function. With thisarchitecture, the output sits in the range [0 , and can be interpreted as the probability thata given even comes from a longitudinal W . The structure of the our network is illustrated inFig. 2.We arrived at this network architecture and set of hyperparameter parameters by opti-mizing run time and performance on training samples (to be discussed shortly). In additionto the CNN, we explored how two networks from the literature performed. The networks wetested are MaxOut [22] and ResNet [37], with structure displayed in Fig. 5, 6 respectively.MaxOut is a fully connected dense network with dedicated layers designed to mimic the ﬁlterfeatures of a CNN and was built with the goal of diﬀerentiating W -jets from QCD, whileResNet (short for Residual Network) is an image based network that contains skipped connec-tions in an eﬀort to avoid vanishing gradient issues; it is signiﬁcantly more complicated thanour CNN. Comparing with MaxOut is a useful cross check with previous literature [38], whilecomparing with ResNet illustrates whether a more advanced network architecture is worththe added number of parameters. The results from these networks along with more details oftheir architecture and how it diﬀers from the CNN we use are presented in Appendix A. As our training (MC) samples, we want processes that have pure W ± polarization. Fortransverse W ± bosons, pp → W ± + jets is an easy choice, while for longitudinal W ± weuse pp → H → W + W − , where H is a ﬁctitious heavy Higgs boson with mass GeV .An alternative sample of longitudinal W ± that could be used more readily in a data-drivenapproach is associated production pp → hW . For all training samples we lump hadronic W + and hadronic W − events together, as they are experimentally indistinguishable.We also break up the training into two p T bins: p T ∈ [200 GeV , GeV ] , which we willrefer to as the ‘low- p T ’ sample, and p T ∈ [400 GeV , GeV ] , the ‘high- p T ’ sample. Thesechoices are motivated by the fact that, W ± with p T ≤ GeV are not boosted enoughfor their decay products to fall within ∆ R ≤ (our fat jet deﬁnition), while W ± s with To generate this signal, we use the HEFT model included within MadGraph. – 7 – T > GeV suﬀer from a low rate and tend to have such collimated decay products thatboth end up with the same subjet. Importantly, we do impose any cuts other than p T . Thiscan be contrasted with Ref. [15], where additional substructure cuts, such as mass drop andN-subjettiness must be applied to ‘locate’ the primary W decay products needed in p θ . Theseadditional cuts have a signal eﬃciency of O (50%) , thereby reducing the event sample sizeand ultimately feeding into the uncertainty. As we create the training and validation samples, it is crucial to keep the number ofevents for each polarization the same to avoid unequal trainings. As a result, we use 340kfor training and 85k for validating at lower p T . At higher p T , we use 236k for training and59k for the validation. When training the network, we intervene and stop if there is nosigniﬁcant enhancements for 10 iterations within maximum 200 epochs on the validation set.The training/validation samples and their network output are summarized below in Table 1. p T range Number of training/validation events validation accuracy GeV ≤ p T ≤ GeV 340k 63%

GeV ≤ p T ≤ GeV 236k 64%

Table 1 : Summary of training samples and network validation accuracy. For both p T bins,we use 20% of the sample for validation and 80% for training.We plot the Receiver Operating Characteristic (ROC) curve based on the validation sam-ples in order to visualize the network’s performance. In Fig. 3, the partonic curve indicates thetheoretical maximum of the training calculated in the previous section, and we observe thatour trained networks for both p T samples nearly matches to the partonic version. While it isgood to see that the network approaches the ideal/partonic curve, the true positive rates arenot signiﬁcantly larger than the corresponding false positive rates. As such, event-by-eventtagging using our network is not particularly powerful. This result is seconded by the net-work’s accuracy, ∼ , deﬁned as the correct classiﬁcation probability when the threshold(value between 0 and 1 where we classify the event as longitudinal or transverse) is set to 0.5.Therefore, instead of treating the network as a variable to cut on, event by event, to selecta certain polarization population, we will keep all events and use the network output of theentire ensemble to extract the polarization fraction. The (area normalized) network output for the transverse ( pp → W ± + jets) and longi-tudinal ( pp → H → W + W − ) validation samples are shown below in Fig. 4 for the two p T Ref. [15] considered slightly diﬀerent p T,W regions for their analysis, p T,W ∼ − GeV, and it ispossible the eﬃciencies for the additional substructure cuts carry some p T -dependence. At this point we are assuming the same starting point as Ref [15] – a sample of pure W s of unknownpolarization. A more accurate comparison requires including non- W backgrounds. We will discuss the role ofother backgrounds a little in Sect. 4, deferring a more complete study to later work. For other hyperparameter settings: we use Keras callback EarlyStopping with patience = 15 and ReduceL-ROnPlateau with patience = 5 For another example using ML event ensembles to extract information about model parameters (thoughwith a DNN and engineered variables rather than a CNN and images), see Ref. [39]. – 8 – .0 0.2 0.4 0.6 0.8 1.0False Positive Rate0.00.20.40.60.81.0 T r u e P o s i t i v e R a t e

200 < p T ( GeV ) < 300 (AUC = 0.68)400 < p T ( GeV ) < 500 (AUC = 0.70)Partonic (AUC = 0.71)

Figure 3 : Receiver Operating Characteristic Comparison: Partonic ROC curve is shown asa reference line to compare with each trained network’s performance comparison. As thecurve bends more toward the upper left corner, the performance of the network increases.Considering the angular distribution as the theoretical limit, our networks for both p T binsshows the clue of reaching the limit. N u m b e r o f E v e n t s LongitudinalTransverse (a) ≤ p T ≤ N u m b e r o f E v e n t s LongitudinalTransverse (b) ≤ p T ≤ Figure 4 : The distribution of network outputs for two diﬀerent p T bins, determined using ourvalidation dataset. The distribution for the longitudinal sample ( H → W + W − ) is shown inred and peaks near 1, while the green line shows the transverse W sample distribution (from W + jet) and has more support towards 0. The distributions are unit normalized.regions. We can identify several features in the distributions: a true peak, a false peak and acentral region. The true peak corresponds to when the network properly classiﬁes a validationevent, the false peak represents the mistagging of the network and tends to coincides withthe true peak of true positive events, and the central region is populated by obscure outputs.Obviously, both the false peak and central regions contribute to diluting the performance.– 9 –omparing the two p T regions, the network output in the higher p T sample has a largerﬂuctuation in the central region.Knowing the network templates for the purely longitudinal and transverse samples, we in-terpolate between them to ﬁt the network output from a signal whose polarization compositionwe’d like to ﬁnd. Speciﬁcally, we interpret the network output as a probability distribution( D i ( x ) ) and set. f L × D L ( x ) + f T × D T ( x ) = D unknown ( x ) (3.1)Here f L , f T are the longitudinal and transverse fractions and D L ( x ) , D T ( x ) are network dis-tributions determined from the validation sets. Setting f T = 1 − f L , multiplying by x andintegrating, we ﬁnd a relation between the expectation values of the validation distributionsand the distribution with unknown polarization composition. f L (cid:104) x L (cid:105) + (1 − f L ) (cid:104) x T (cid:105) = (cid:104) x unknown (cid:105) (3.2)Solving for f L , we ﬁnd: f L = (cid:104) x unknown (cid:105) − (cid:104) x T (cid:105)(cid:104) x L (cid:105) − (cid:104) x T (cid:105) (3.3) pp → W ± Z Our ﬁrst test case is SM pp → W ± ( jj ) Z ( (cid:96)(cid:96) ) . This SM process is a good test candidate sinceit has a relatively high cross section and is not dominated by a single polarization (unlike, forexample, pp → W ± H which is completely dominated by longitudinal W s); pp → W ± ( jj ) Z ( (cid:96)(cid:96) ) also experimentally clean, as the presence of the leptonic Z will mitigate backgrounds fromtop quark production – a handle we don’t have if looking at pp → W ± ( jj ) W ∓ ( (cid:96)ν ) . Usinga newly introduced feature of MadGraph5v2.7.0 , we can specify the

W/Z polarization whengenerating events. This lets us quickly check the truth-level polarization fraction for each setof cuts.We generate 1M testing events pp → W ± Z samples for lower p T bin (20k for higher p T bin), following the same preprocessing as training/validation samples. In situations wherethere are multiple jets passing the kinematic criteria, we select the jet whose φ coordinateis closest to − φ of the reconstructed Z . The testing events play the role of the samplewith unknown polarization composition in the discussion above, and size of the samples wegenerated is related to the number of expected events at the end of the HL-LHC era, as wewill explain. Running these events through our network, then ﬁtting the network output toa sum of the longitudinal and transverse templates, we ﬁnd f L . The results are quantiﬁed inTable 2, with the output average method showing good agreement with the ideal values.The network output values in Table 2 include uncertainty bands, which were estimatedusing the following approach: – 10 – T range σ ( pp → W ± ( jj ) Z ( (cid:96)(cid:96) )) (fb) truth σ L /σ tot predicted f L GeV ≤ p T ≤ GeV 6.67 0.265 0.259 ± GeV ≤ p T ≤ GeV 0.35 0.304 0.300 ± Table 2 : Longitudinal polarization fraction comparison between truth and using the networkoutput average. The truth value of f L is calculated using cross section provided by MadGraph .The cross section shown in the second column includes branching fractions ( (cid:96) = e, µ ) andthe acceptance for the p T cuts for each row. For the parton level cuts and jet requirementswe have assumed, the acceptance cut eﬃciency is 59% for the lower- p T sample and 65% forthe high- p T sample. The uncertainty on the extracted f L is determined using the bootstrapmethod explained in the text.• We assume that the statistical uncertainty on (cid:104) x T (cid:105) and (cid:104) x L (cid:105) are small as they can bedetermined by large simulated datasets. Had we carried out a binned analysis ratherthan working with the network average, this assumption would have been hard to justifygiven our total training sample size of ∼ few hundred thousand events.• Assuming that the uncertainties are uncorrelated, propagation of uncertainty leads to σ f L = (cid:18) ∂f L ∂ (cid:104) x unknown (cid:105) (cid:19) σ (cid:104) x unknown (cid:105) = (cid:18) (cid:104) x L (cid:105) − (cid:104) x T (cid:105) (cid:19) σ (cid:104) x unknown (cid:105) (4.1)• For a single set of testing data, we only get one number – the network average. To de-termine the uncertainty on the network average we can run pseudo-experiments (‘boot-strapping’ technique, in network terminology ). Speciﬁcally, we randomly select subsetsof the testing data that correspond to the number of signal events expected for a givenluminosity, and calculate the network output for this subset. Iterating this procedure,we can use the distribution of results to deﬁne the uncertainty.For this particular example, we select the size of the pp → W ± Z dataset to correspondto the number of pp → W ± Z events at the end of the HL-LHC run. Using the (LO)cross section from Table 2 and assuming L = 3 ab − , this corresponds to 20k events for p T ∈ [200 GeV , GeV ] and 1k for p T ∈ [400 GeV , GeV ] . Iterating 20 times, and pluggingthe extracted σ (cid:104) x unknown (cid:105) into Eq. (4.1), we ﬁnd the uncertainty on f L quoted in the last columnof Table 2. We ﬁnd that σ (cid:104) x unknown (cid:105) does not depend strongly on the number of iterations,provided the number is (cid:38) few. If we instead use batches corresponding to event sizes for fb − ( k events for p T ∈ [200 GeV , GeV ] , 100 events for p T ∈ [400 GeV , GeV ] ),the uncertainty on f L increases to . ( p T ∈ [200 GeV , GeV ] ) or 0.132 (for p T ∈ [400 GeV , GeV ] ); for event sizes corresponding to fb − , the uncertainty becomes . (0.190) for the low (high) p T bins respectively.Looking at Table 2, we see that the network prediction reproduces the truth value. Basedoﬀ of our pseudo-experiment test, the uncertainty on f L for p T ∈ [200 GeV , GeV ] is ∼ – 11 – . / √ N events for the number of events available with the (roughly) the current LHC luminosity( fb − ) , rising to . / √ N events for ab − .Of course, the numbers quoted above assume we have been handed a sample of pure pp → W ( jj ) Z ( (cid:96)(cid:96) ) events and therefore ignores the presence of other SM backgrounds. Asour study here is simply a ﬁrst step in hadronic W polarization analysis, we will stick withidealized ‘ W Z -only’ events for the remaining examples. However, it is worthwhile to considerhow other backgrounds will impact our story. For a pp → W ± ( jj ) Z ( (cid:96)(cid:96) ) signal, the mainworry is pp → Z ( (cid:96)(cid:96) ) + jets . There has been lots of recent progress distinguishing massivevector bosons from QCD, both with substructure analysis and jet images [22]. The degree towhich that background impacts our quantitative results depends on the W-tagging algorithm.As a back-of-the envelope calculation, an additional cut to ﬁlter out QCD with eﬃciencyepsilon will inﬂate the uncertainty on our polarization fraction extraction by (cid:112) /(cid:15) . Thisestimate ignores any biases the QCD-vs.- W cuts introduce, or pollution from mistags. Asan example, the W ± tagger in Ref. [22] quotes a tagging eﬃciency of (cid:15) ∼ for a fakerate of , resulting in a ∼ inﬂation in the uncertainties from decreased signal statisticsalone. Further work combining polarization analysis into existing W ± vs. QCD algorithmsand including all backgrounds would be interesting to pursue. Having tested our method, we now explore how our well our polarization analyzer performsat detecting the presence of higher dimensional operators. Diﬀerent operators contribute todiﬀerent W ± /Z gauge boson polarizations, therefore including them in processes involvingelectroweak gauge boson production can potentially change the ratio of transverse to longitu-dinal bosons.There are several reasons to study this example. First, it is insensitive to the UV setup,as it can be applied to any scenarios one can map into the SMEFT framework. This can becontrasted with a test that assumes a particular UV content, i.e. a resonance. Second, whilemeasuring the cross section is an obvious way to look for the presence of higher dimensionoperators, it’s possible for new physics to have negligible impact on the cross section, eitherbecause coeﬃcients are small or because diﬀerent eﬀects conspire and cancel. In these cases,analyzing the polarization provides another handle and can potentially spot new physics ordisentangle eﬀects that the cross section is blind to.We will focus on two particular higher dimensional operators that can impact the process pp → W ± Z : L NP = c W O W + c W O W (4.2) Other backgrounds are present, such as fully leptonic ¯ tt production and ZZ/Zγ , however they are smaller;leptonic ¯ tt can be suppressed by the requirement of an on-shell leptonic Z , while ZZ/Zγ have small productionrates. – 12 –here c W, W are dimensionless Wilson coeﬃcients and O W = i gm W ( H † σ a ←→ D µ H ) D ν W aµν (4.3) O W = i g m W (cid:15) abc W aνµ W bνρ W cρµ , (4.4)following the convention of Ref. [40]. The factor of m W in the denominator is a bit unconven-tional, however since any measurement will only reveal information on the ratio of the Wilsoncoeﬃcient c i to the scale suppressing the operator, we can always translate this normalizationto any other suppression scale Λ . In our simulations, the vertices contained in O W , O W areallowed to enter a given amplitude/diagram once. As such, the cross section is a quadraticfunction of the Wilson coeﬃcients c W , c W . The linear term represents the interference be-tween the SM and the higher dimensional operators, while the quadratic term contains thesquare of the new physics amplitudes. Finally, as we are picking a subset of dimension-6operators, this study should be viewed as a straw man to illustrate a technique rather thana genuine SMEFT analysis, as the latter requires working with a complete basis and a moreconsistent treatment of quadratic EFT eﬀects.From the ﬁeld content of O W and O W , we suspect that O W will aﬀect the productionof longitudinal W while O W only includes ﬁeld strengths and can therefore only participatein transverse production. This thinking is backed up by Ref [41], which analyzed dibosonproduction in the presence of certain dimension-6 and -8 operators. As a ﬁrst test we turn on one operator at a time using Wilson coeﬃcient value − for c W and × − for c W . Rescaling to operators suppressed by Λ = (1 TeV ) and with no explicitfactors of g , this choice corresponds to an overall coeﬃcient of . for ( H † σ a ←→ D µ H ) D ν W aµν and . for (cid:15) abc W aνµ W bνρ W cρµ . For Monte Carlo purposes, we use the UFO implementationof O W , O W from Ref. [40]. The network f L output using the method of Sect. 3 for eachoperator choice and p T bin is shown below in Table 3, along with the cross sections.Comparing our value of f L with the truth, we see that the network average performswell. As expected, O W impacts the longitudinal fraction, while O W impacts the transversefraction. The sign of the impact depends on the sign of the Wilson coeﬃcient and the relativesize of the linear (in c W , c W ) and quadratic contributions to the cross section.Using the uncertainties derived from pseudo-experiments with sample sizes correspondingto ab − of luminosity, we can take the ratio of the deviation in the polarization fraction(network f L value in the presence of the higher dimensional operator minus the SM value)to δf L as a rough measure of the discriminating power. We ﬁne this ratio is: 3.2 for O W ,low- p T , 4.1 for O W , high- p T , O W , low- p T and . for O W , high- p T . The ratio is The impact of higher dimensional operators on the polarization breakdown can be found by studying howvarious W ± λ Z λ (cid:48) , λ, λ (cid:48) = T, L subprocesses depend on the scales in the problem and identifying contributionsthat grow with the energy of the process. Following Ref. [41], the W L Z L cross section contributions involving c W (both linear and quadratic) grow with energy, while c W does not contribute, while for W T Z T all eﬀectsinvolving c W are suppressed, the linear c W term is constant, and the c W contribution grows with energy. – 13 – T range σ ( pp → W ± Z ) (fb) truth σ L /σ tot predicted f L O W GeV ≤ p T ≤ GeV 6.93 0.311 0.297 ± GeV ≤ p T ≤ GeV 0.42 0.439 0.391 ± O W GeV ≤ p T ≤ GeV 6.58 0.258 0.254 ± GeV ≤ p T ≤ GeV 0.50 0.198 0.181 ± Table 3 : Truth level and network average longitudinal fraction results when one dimension-6operator at a time is included. As in Table 2, the truth values were determined by restrictingthe W polarization at generator level in MadGraph , and the quoted cross sections includebranching ratios and acceptance for kinematic cuts. Uncertainties in the predicted f L arecalculated using the method of Sec. 4.1 and are based on pseudo-experiments with samplesizes matching the expected number of events for ab − of luminosity.higher for O W , despite the fact that O W has a larger eﬀect on the total cross section (for thisbenchmark point) . If we divide the diﬀerence in total cross section (with operators versusSM) by / √ N events – a proxy for the uncertainty on the cross section – we ﬁnd much largernumbers, O (10) . Therefore, at least for the benchmark values in Table 3, the total cross sectionis a more powerful measurement for detecting the presence of these operators. This is notsurprising, as the polarization fraction is a more reﬁned quantity. However, the polarizationfraction can provide insight into what type of operator is responsible for any observed changein cross section. For example, the diﬀerence in f L values in the presence of O W versus O W for GeV ≤ p T ≤ GeV in Table 3 – which have similar impact on the cross section – is O (4) times the full luminosity HL-LHC uncertainty on f L . As with the SM study in Sec. 4.1,these numbers neglect the impact from processes other than pp → W ( jj ) Z ( (cid:96)(cid:96) ) .If we use smaller event samples to determine the uncertainty, corresponding to pseudo-experiments using smaller luminosity datasets, σ f L increases. As an example, we ﬁnd σ f L =0 . for fb − of luminosity ( σ f L = 0 . for fb − of luminosity) in the low- p T scenariofor both O W and O W . Propagating these larger uncertainties through, we ﬁnd the diﬀerencein polarization fraction between samples with O W and samples with O W (for the values inTable 3 ) is roughly ( . ) times σ f L .As a second test, we explore a scenario where both c W and c W are non-zero, but they havebeen tuned so that their net eﬀect on the cross-section is negligible. This test examines howwell the polarization breakdown works as a way to detect the presence of new physics, givenno hints of anything BSM from the cross section alone. We adjust the size of both coeﬃcientsfor each p T bin respectively: c W = − . × − , c W = 5 . × − for p T ∈ [200 GeV , GeV ] and c W = − × − , c W = 5 × − for p T ∈ [400 GeV , GeV ] Yet again, we see that the network average reproduces the truth values; and while the This is somewhat counterintuitive, given that O W impacts the longitudinal fraction and amplitudes withlongitudinal vector bosons tend to grow with energy. However, the pieces in the amplitude that are quadraticin c W , c W grow with energy – for all W polarizations. These quadratic pieces, and the fact that c W > c W for this benchmark point, lead to a greater cross section changes from O W . – 14 – T range σ ( pp → W ± Z ) (fb) truth σ L /σ tot predicted f L

200 GeV ≤ p T ≤

300 GeV 6.68 0.202 . ± .

400 GeV ≤ p T ≤

400 GeV 0.34 0.285 . ± . Table 4 : Longitudinal fraction results, truth versus network average, in a scenario whereboth c W and c W are nonzero. The coeﬃcients have been chosen so there is essentially noimpact on the pp → W Z cross section, which can be veriﬁed by comparing the second columnhere to the second column of Table 2. Uncertainties in the predicted f L are calculated usingthe method of Sec. 4.1 and are based on pseudo-experiments with event sizes matching theexpected number of events for ab − of luminosity.cross section in the presence of O W and O W matches the SM value by construction, thepolarization fraction is clearly diﬀerent. Plugging in numbers, the GeV ≤ p T ≤ GeVpolarization fraction is diﬀerent than its SM value by O (5 σ f L ) using the full luminosity HL-LHC uncertainty, or O (1 . σ f L ) using fb − values. In this paper, we have shown how a CNN can be used as a polarization analyzer for hadronic W ± bosons. The algorithm cannot distinguish between events accurately enough that it canbe used as an event-by-event tagger, though this inability to perfectly separate polarizations isnot a failure of the network and is present even at parton level. While event-by-event tagging isineﬃcient, we showed that a template analysis comparing the network average for an unknownsample to the average output of validation samples, does accurately reveal the polarizationcomposition. A beneﬁt of the CNN method is that it keeps all events without reducing thediscriminating power. This can be compared substructure based polarization analyzers, whichintroduce further cuts on top of the base kinematic selection of boosted W . In keeping moreevents while maintaining discriminating power, the uncertainty on the extracted polarization isreduced, and fewer cuts means less concern of reintroducing interference between the diﬀerent W polarizations.We tested the network average method on pp → W ± ( jj ) Z ( (cid:96)(cid:96) ) production in the SM and inthe presence of dimension-6 operators that impact diﬀerent polarizations. In all cases, we ﬁndthat the network average reproduces the truth level result and captures how the dimension-6operator structure dictates how the boson polarizations are aﬀected. Using pseudoexperimentsto estimate the uncertainty on f L , we ﬁnd σ f L at the percent level assuming ab − of data,or ∼ for fb − , with the higher p T samples having slightly larger uncertainties dueto lower statistics. For the ab − estimates, these uncertainties are small compared to thedeviations from the SM polarization fraction when the dimension-6 operator O W is includedwith c W = 0 . , and comparable to the f L deviations from including O W , c W = 0 . .Furthermore, we ﬁnd similar uncertainties in scenarios where the c W , c W have been chosento cancel in the total cross section – a scenario where polarization analysis is the discovery tool– 15 –or new physics. We obtain these results results with training sizes of ∼ O (few 100K events);the training sample size could be enlarged in future work and may lead to better performance.Finally, the uncertainty estimates above are optimistic, as we have not considered the impactof cuts required to separate reducible SM backgrounds such as Z ( (cid:96)(cid:96) ) + jets or pollution fromthose backgrounds. However, our analysis demonstrates the utility of network-based hadronic W ± polarization analyses.Targets for future study include other processes, such as vector boson scattering, andhadronic Z -tagging possibilities. It would also be interesting to explore what information theCNN uses besides the cos θ ∗ or p θ variable, perhaps by an adversarial network, or to combine W ± vs. QCD diﬀerentiation and polarization analysis into a single network. Acknowledgments

We thank Bryan Ostdiek for numerous helpful discussions. We also thank the Center forResearch Computing (CRC) at Notre Dame for resources and continuous support. The workof AM was supported in part by the National Science Foundation under Grant Number PHY-1820860.

A Diﬀerent Network example

In addition to the CNN, we tested the performance of two other networks, MaxOut andResNet.• In a MaxOut network, images are ﬂattened into a vector of inputs, then fed throughspecial ‘MaxOut’ layers that combine nearby inputs in several ways and output themaximum combination [38], a process designed to capture some proximity informationon neighboring inputs, to inhibit the sparsity of hidden layer values and assists thedropout layer as shown in [42]. For the problem at hand, we use a network with twosequential MaxOut layers, the ﬁrst with 256 units and the second with 128. The secondMaxOut layer is followed by 64 and 25 fully connected dense layers with ReLU activationand single output layer with sigmoid function as activation.• ResNet networks are image based and are grouped into ‘residual blocks’. Within eachblock, the input is processed by several convolution layers, then connected back to theoriginal image. This ‘bypassing’ step was designed to minimize vanishing gradient issues,but comes at the price of increased complexity and thus more trainable parameters. Aftera number of blocks, the ResNet output is ﬂattened and processed by dense layers. TheResNet structures we ended up with is shown below in Fig. 6.For both networks, we supplemented the architecture with several Dropout layers. Thesewere added, especially for ResNet, to avoid overtraining.We train and test our MaxOut and ResNet networks in the same fashion as the CNNdiscussed in Sec. 3.3. Repeating the f L and δf L calculations on these comparison networks,– 16 –e can compare results with the CNN. The extracted f L values from the CNN, MaxOut andResNet are presented in Tables 6. While the network outputs are diﬀerent (Fig. 5 shows theMaxOut output for both p T bins), all three networks perform similarly. These results indicatethat the network performance of predicting f L has reached a saturation point, in the sensethat additional network complexity does not yield more accurate results.truth f L CNN f L MaxOut f L ResNetSM 0.265 . ± .

013 0 . ± .

011 0 . ± . O W . ± .

010 0 . ± .

010 0 . ± . O W . ± .

011 0 . ± .

012 0 . ± . Table 5 : f L predictions at low p T ∈ [400 GeV , GeV ] for MaxOut, ResNet, and the CNNdeveloped here, along with and the truth value from MadGraph5v2.7.0 . The errors on f L havebeen calculated using the method described in Sec. 4.1 and assuming ab − of data.truth f L CNN f L MaxOut f L ResNetSM .

304 0 . ± .

033 0 . ± .

026 0 . ± . O W .

439 0 . ± .

033 0 . ± .

025 0 . ± . O W .

198 0 . ± .

043 0 . ± .

026 0 . ± . Table 6 : f L predictions at high p T ∈ [400 GeV , GeV ] for MaxOut, ResNet, and the CNNdeveloped here, along with and the truth value from MadGraph5v2.7.0 . The errors on f L havebeen calculated using the method described in Sec. 4.1 and assuming ab − of data. N u m b e r o f E v e n t s LongitudinalTransverse 0.0 0.2 0.4 0.6 0.8 1.0Neural Network Output0123456789101112131415 N u m b e r o f E v e n t s LongitudinalTransverse

Figure 5 : MaxOut distribution result for both p T bins: left [200 GeV , GeV ] , and right [400 GeV , GeV ] . – 17 – igure 6 : For ResNet Structure, we stack several ResNet blocks with the network shownabove. Output of the ﬁrst block yields the same dimension as the original image and secondblock deduces the dimension. After the deduction, the convoluted images is followed byﬂattening and dense network to produce a single output. References [1] B. W. Lee, C. Quigg, and H. Thacker, “Weak Interactions at Very High-Energies: The Role ofthe Higgs Boson Mass,”

Phys. Rev. D (1977) 1519.[2] M. S. Chanowitz and M. K. Gaillard, “The TeV Physics of Strongly Interacting W’s and Z’s,” Nucl. Phys. B (1985) 379–431.[3] M. J. Dugan, H. Georgi, and D. B. Kaplan, “Anatomy of a Composite Higgs Model,”

Nucl.Phys.

B254 (1985) 299–326.[4] K. Agashe, R. Contino, and A. Pomarol, “The Minimal composite Higgs model,”

Nucl. Phys.

B719 (2005) 165–187, arXiv:hep-ph/0412089 [hep-ph] .[5] G. F. Giudice, C. Grojean, A. Pomarol, and R. Rattazzi, “The Strongly-Interacting LightHiggs,”

JHEP (2007) 045, arXiv:hep-ph/0703164 [hep-ph] . – 18 – CMS

Collaboration, S. Chatrchyan et al. , “Measurement of the Polarization of W Bosons withLarge Transverse Momenta in W+Jets Events at the LHC,”

Phys. Rev. Lett. (2011)021802, arXiv:1104.3829 [hep-ex] .[7]

ATLAS

Collaboration, G. Aad et al. , “Measurement of the W boson polarization in top quarkdecays with the ATLAS detector,”

JHEP (2012) 088, arXiv:1205.2484 [hep-ex] .[8] J. Thaler and L.-T. Wang, “Strategies to Identify Boosted Tops,” JHEP (2008) 092, arXiv:0806.0023 [hep-ph] .[9] D. E. Kaplan, K. Rehermann, M. D. Schwartz, and B. Tweedie, “Top Tagging: A Method forIdentifying Boosted Hadronically Decaying Top Quarks,” Phys. Rev. Lett. (2008) 142001, arXiv:0806.0848 [hep-ph] .[10] L. G. Almeida, S. J. Lee, G. Perez, G. F. Sterman, I. Sung, and J. Virzi, “Substructure ofhigh- p T Jets at the LHC,”

Phys. Rev. D (2009) 074017, arXiv:0807.0234 [hep-ph] .[11] J. M. Butterworth, A. R. Davison, M. Rubin, and G. P. Salam, “Jet substructure as a new Higgssearch channel at the LHC,” Phys. Rev. Lett. (2008) 242001, arXiv:0802.2470 [hep-ph] .[12] J. Thaler and K. Van Tilburg, “Identifying Boosted Objects with N-subjettiness,”

JHEP (2011) 015, arXiv:1011.2268 [hep-ph] .[13] M. Grossi, J. Novak, D. Rebuzzi, and B. Kersevan, “Comparing Traditional and Deep-LearningTechniques of Kinematic Reconstruction for polarisation Discrimination in Vector BosonScattering,” arXiv:2008.05316 [hep-ph] .[14] S. Groote, J. Korner, and P. Tuvike, “ O ( α s ) Corrections to the Decays of Polarized W ± and Z Bosons into Massive Quark Pairs,”

Eur. Phys. J. C (2012) 2177, arXiv:1204.5295[hep-ph] .[15] S. De, V. Rentala, and W. Shepherd, “Measuring the polarization of boosted, hadronic W bosons with jet substructure observables,” arXiv:2008.04318 [hep-ph] .[16] S. D. Ellis, C. K. Vermilion, and J. R. Walsh, “Techniques for improved heavy particle searcheswith jet substructure,” Phys. Rev. D (2009) 051501, arXiv:0903.5081 [hep-ph] .[17] A. Ballestrero, E. Maina, and G. Pelliccioli, “ W boson polarization in vector boson scattering atthe LHC,” JHEP (2018) 170, arXiv:1710.09339 [hep-ph] .[18] E. Mirkes and J. Ohnemus, “ W and Z polarization eﬀects in hadronic collisions,” Phys. Rev. D (1994) 5692–5703, arXiv:hep-ph/9406381 .[19] W. Stirling and E. Vryonidou, “Electroweak gauge boson polarisation at the LHC,” JHEP (2012) 124, arXiv:1204.6427 [hep-ph] .[20] A. Belyaev and D. Ross, “What Does the CMS Measurement of W-polarization Tell Us aboutthe Underlying Theory of the Coupling of W-Bosons to Matter?,” JHEP (2013) 120, arXiv:1303.3297 [hep-ph] .[21] A. Butter, G. Kasieczka, T. Plehn, and M. Russell, “Deep-learned top tagging with a lorentzlayer,” SciPost Physics (Sep, 2018) . http://dx.doi.org/10.21468/SciPostPhys.5.3.028 .[22] J. Barnard, E. N. Dawe, M. J. Dolan, and N. Rajcic, “Parton shower uncertainties in jetsubstructure analyses with deep neural networks,” Physical Review D (Jan, 2017) . http://dx.doi.org/10.1103/PhysRevD.95.014018 . – 19 –

23] S. H. Lim and M. M. Nojiri, “Spectral analysis of jet substructure with neural networks:boosted higgs case,”

Journal of High Energy Physics (Oct, 2018) . http://dx.doi.org/10.1007/JHEP10(2018)181 .[24] L. Lönnblad, C. Peterson, and T. Rögnvaldsson, “Finding gluon jets with a neuraltrigger,”

Phys. Rev. Lett. (Sep, 1990) 1321–1324. https://link.aps.org/doi/10.1103/PhysRevLett.65.1321 .[25] C. Peterson, T. Rögnvaldsson, and L. Lönnblad, “Jetnet 3.0—a versatile artiﬁcial neuralnetwork package,” Computer Physics Communications (1994) no. 1, 185 – 220. .[26] S. Macaluso and D. Shih, “Pulling out all the tops with computer vision and deeplearning,” Journal of High Energy Physics (Oct, 2018) . http://dx.doi.org/10.1007/JHEP10(2018)121 .[27] J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H. S. Shao, T. Stelzer,P. Torrielli, and M. Zaro, “The automated computation of tree-level and next-to-leading orderdiﬀerential cross sections, and their matching to parton shower simulations,”

JHEP (2014)079, arXiv:1405.0301 [hep-ph] .[28] T. Sjöstrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel,C. O. Rasmussen, and P. Z. Skands, “An Introduction to PYTHIA 8.2,” Comput. Phys.Commun. (2015) 159–177, arXiv:1410.3012 [hep-ph] .[29] T. Sjostrand, S. Mrenna, and P. Z. Skands, “PYTHIA 6.4 Physics and Manual,”

JHEP (2006) 026, arXiv:hep-ph/0603175 [hep-ph] .[30] DELPHES 3

Collaboration, J. de Favereau, C. Delaere, P. Demin, A. Giammanco,V. Lemaître, A. Mertens, and M. Selvaggi, “DELPHES 3, A modular framework for fastsimulation of a generic collider experiment,”

JHEP (2014) 057, arXiv:1307.6346 [hep-ex] .[31] M. Cacciari and G. P. Salam, “Dispelling the N myth for the k t jet-ﬁnder,” Phys. Lett. B (2006) 57–61, arXiv:hep-ph/0512210 .[32] M. Cacciari, G. P. Salam, and G. Soyez, “FastJet User Manual,”

Eur. Phys. J. C (2012)1896, arXiv:1111.6097 [hep-ph] .[33] “scikit-hep/pyjet: 1.6.0 (version 1.6.0),”.[34] Y. L. Dokshitzer, G. Leder, S. Moretti, and B. Webber, “Better jet clustering algorithms,” JHEP (1997) 001, arXiv:hep-ph/9707323 .[35] M. Wobisch and T. Wengler, “Hadronization corrections to jet cross-sections in deep inelasticscattering,” in Workshop on Monte Carlo Generators for HERA Physics (Plenary StartingMeeting) , pp. 270–279. 4, 1998. arXiv:hep-ph/9907280 .[36] F. Chollet et al. , “Keras.” https://github.com/fchollet/keras , 2015.[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015.[38] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman, “Jet-images – deeplearning edition,”

Journal of High Energy Physics (June, 2016) . https://doi.org/10.1007/JHEP07(2016)069 . – 20 –

39] F. Flesher, K. Fraser, C. Hutchison, B. Ostdiek, and M. D. Schwartz, “Parameter inference fromevent ensembles and the top-quark mass,” 2020.[40] A. Alloul, B. Fuks, and V. Sanz, “Phenomenology of the Higgs Eﬀective Lagrangian viaFEYNRULES,”

JHEP (2014) 110, arXiv:1310.5150 [hep-ph] .[41] D. Liu and L.-T. Wang, “Prospects for precision measurement of diboson processes in thesemileptonic decay channel in future LHC runs,” Physical Review D (Mar, 2019) . http://dx.doi.org/10.1103/PhysRevD.99.055001 .[42] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,”2013..[42] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,”2013.