SARM: Sparse Autoregressive Model for Scalable Generation of Sparse Images in Particle Physics
SSARM: Sparse Autoregressive Model for ScalableGeneration of Sparse Images in Particle Physics
Yadong Lu, a Julian Collado, a Daniel Whiteson, b Pierre Baldi a, a Department of Computer Science, University of California, Irvine, CA, USA 92627 b Department of Physics and Astronomy, University of California, Irvine, CA, USA 92627 Corresponding author
Abstract:
Generation of simulated data is essential for data analysis in particle physics,but current Monte Carlo methods are very computationally expensive. Deep-learning-basedgenerative models have successfully generated simulated data at lower cost, but strugglewhen the data are very sparse. We introduce a novel deep sparse autoregressive model(SARM) that explicitly learns the sparseness of the data with a tractable likelihood, mak-ing it more stable and interpretable when compared to Generative Adversarial Networks(GANs) and other methods. In two case studies, we compare SARM to a GAN modeland a non-sparse autoregressive model. As a quantitative measure of performance, wecompute the Wasserstein distance ( W p ) between the distributions of physical quantitiescalculated on the generated images and on the training images. In the first study, featuringimages of jets in which of the pixels are zero-valued, SARM produces images with W p scores that are better than the scores obtained with other state-of-the-art gener-ative models. In the second study, on calorimeter images in the vicinity of muons where of the pixels are zero-valued, SARM produces images with W p scores that are better. Similar observations made with other metrics confirm the usefulness of SARM forsparse data in particle physics. Original data and software will be made available uponacceptance of the manuscript from the UCI Machine Learning in Physics web portal at:http://mlphysics.ics.uci.edu/. a r X i v : . [ phy s i c s . d a t a - a n ] O c t ontents Introduction
Experiments in particle physics seek to uncover the building blocks of matter and theirinteractions, which determine the structure of the Universe from subatomic to cosmic dis-tances. Analyses of the data produced by these experiments make extensive use of simula-tions to predict the experimental signature of particle interactions under various theoreticalhypothesis. These simulations are used in likelihood-free inference as well as in the devel-opment of data selection and analysis strategies which optimize the statistical power of thedata. Current state-of-the-art simulators apply Monte Carlo techniques to the microphysi-cal processes governing individual particles’ propagation and interaction [1], making themcomputationally expensive [2, 3].Detectors in particle physics experiments have a multi-layer architecture which pro-duces highly structured data. One essential layer, the calorimeter, measures the energy ofpassing particles, and is subdivided into small cells to ensure spatial resolution. In colliderexperiments, the calorimeter is typically cylindrical [4], while in fixed-target experiments itmay be a surface [5]. In both cases, the data can be represented as an image, allowing forthe application of image-processing methods initially developed for natural images. How-ever, in contrast to natural images, pixels in calorimeter images (figure 1) are very sparse,where usually 90% or more of the pixel values are zero. In addition, these images are notas uniform as natural images, featuring clusters in the center and noise in the periphery. [ T r a n s f o r m e d ] P s e u d o r a p i d i t y [ T r a n s f o r m e d ] P s e u d o r a p i d i t y P i x e l P T ( G e V ) E n e r g y d e p o s i t ( G e V ) Figure 1 : Calorimeter images in particle physics are often very sparse, where most of theirpixels have very small values.
Left : Typical signal image of a hadronic jet from [6]
Right :Typical signal image of the vicinity of a muon from [7].Recently, deep generative models [8–10] have produced high-quality artificial naturalimages [11–13] at a relatively low computational cost. The successful application of ma-chine learning in high energy physics [14–21], and generative models in natural imageshas inspired the use of these models for generating image-like data in physical sciencesapplications [6, 22–32], often employing Generative Adversarial Networks (GAN) [8] or,less frequently, Variational Auto-encoders (VAE) [9]. However, the extreme sparsity ofthe images in particle physics and other areas of the physical sciences [33] presents uniquechallenges for generative models. – 3 –he leading applications of GAN-based generative models for sparse image synthesisin high-energy physics, LAGAN [6] and CALOGAN [34], make use of the ReLU activationfunction in the final layer to induce sparsity in the output image. The flat portion of theReLU activation function can lead to many error gradients being zero at the output layer,creating challenges [35] for stochastic gradient descent [36, 37] methods. In addition, GANsare notoriously unstable during training [38] and can suffer from mode collapse, whichrestricts the diversity of events in the generated data [39, 40]. Despite these difficulties,GANs have been one of the most popular deep generative models in particle physics.However, other generative models may be better suited for sparse data. For example,deep autoregressive models (ARMs) have also demonstrated impressive performance forgenerating natural images among likelihood based generative models [10, 41]. In this paper,we develop sparse autoregressive models (SARM), a class of ARMs specifically tuned toproduce sparse images. We then evaluate SARMs on two benchmark data sets. Given theirflexibility, SARMs may be applicable to areas beyond particle physics, where sparse imagesmust be generated.
An important statistical task in the analysis of particle physics data is identifying theparticle source of a particular detector signature. Below, we describe two datasets, onewhich distinguishes between the detector signatures of single quarks and collimated pairsof quarks, and a second which distinguishes between muons produced in isolation and thoseproduced as part of a shower of particles.
Quarks or gluons produced in collisions leave a particular detector signature: a jet , orshower of collimated particles, which deposit most of their energy in a tight core. In manyapplications, it is important to distinguish the signatures of a single quark or gluon fromthat of a collimated pair of quarks, which may leave two potentially overlapping cores. Thistask is a natural setting for image-recognition algorithms, and has been the focus of manydeep learning studies [33, 42–45] which rely on simplified calorimeter simulations due to thecost of generating realistic samples. Thus, an inexpensive generation of realistic datasetswould be very valuable as a classification training sample.We use a set of benchmark jet images from [6], where quark pairs from W -boson decayare labeled as signal and single quark or gluon jets are labeled as background images. Theimages are generated using PYTHIA 8.219 [46] simulations of proton collisions at a center-of-mass energy √ s = 14 TeV, selecting jets with < P T < GeV. Instead of a realisticdetector simulation, the calorimeter response is mimicked via a regular . × . grid in η and φ coordinates. The intensity of each pixel value represents the sum of the momentumtransverse to the beam ( P T ) over the particles which strike a particular cell. The jet imagesare constructed and preprocessed as described in [43], including the centering and rotationsof the images. The resulting images are × pixels, with intensity values in the [0,276]range. We divide them into a training set containing 400,000 images for the signal and– 4 –00,000 images for the background, and a testing set containing 36,000 images for thesignal and 36,000 images for the background. A typical image from this dataset is shownin figure 1. This dataset has a high degree of sparseness: more than 90% of its pixels arezero valued. Muons leave a very clear detector signature which is difficult to mimic. However, physicistsmust distinguish between two modes of muon production: a rare mode in which muons areproduced from the decay of a heavy boson and are isolated in the detector, and a secondprolific mode in which muons are produced inside a jet, surrounded by other particles.Fluctuations in the jet can occasionally produce apparently-isolated muons.We use a set of benchmark calorimeter images from [7], where muons from heavy bosonsare labeled as signal and muons produced within jets are labeled as background. The sig-nal muons are generated with the process pp → Z (cid:48) → µ + µ − with a Z (cid:48) mass of GeV.Background muons are generated with the process pp → b ¯ b . Both signal and backgrounddatasets are generated at a center of mass energy √ s = 13 TeV. The collisions and immedi-ate decays are simulated with madgraph5 [47], showering and hadronization with pythia [46], and detector response with delphes [48]. Pile-up events are overlaid to simulatethe presence of additional proton interactions with an average of µ = 50 interactions perevent. This dataset only considers muons with P T in the range: P T ∈ [10 , GeV. Thesignal events are weighted to match the transverse muon momentum distribution of thebackground events. The calorimeter images in the vicinity of the muon are created fromthe calorimeter deposits within η − φ radius of R < . and preprocessed by centering theimage on the coordinates of the identified muon propagated to the calorimeter. The imagesare pixelated using a 32x32 grid to roughly match the granularity of the calorimeters ofATLAS and CMS, and the pixels have values in the range [0, 172]. The training set con-tains 41250 signal images and 41246 background images, and the testing set contains 41344signal images and 41151 background images. A typical image from this dataset is shownin figure 1. This dataset has an even greater level of sparsity: more than 98% of its pixelshave zero-value. Autoregressive models (ARMs) approximate a high dimensional data distribution P data ( x ) with P ( x ) , the distribution induced by the model where x ∈ R D . For example, whenworking with images, P data ( x ) represents the distribution of the values of D pixels in theimage. ARMs are generative models that create outputs sequentially, where each newoutput is conditioned on the previous output [49]. Formally, ARMs transform the problemof learning the joint distribution P data ( x ) into learning a sequence of tractable conditionaldistributions P ( x i | x j
In SARM, the likelihood function for the i -th pixel x i is formulated as: p ( x i | θ i ) = γ i · δ x i =0 + (1 − γ i ) · δ x i (cid:54) =0 · p ( x i | φ i ) (4.1)where the parameters θ i = { γ i , φ i } are predicted by the underlying neural network taking x , . . . , x i − as inputs. Since the pixel values in calorimeter images represent the physicaldeposition of energy, they must be non-negative, i.e. p ( x i | φ i ) > only when x i > . Tosatisfy this constraint, we explore two options. First, we consider a discrete mass at zerowith a discrete mixture of non-zero masses (D+D). Second, we employ a discrete mass atzero with a continuous non-zero distribution (D+C). Discrete Mixture Model (D+D):
We discretize each pixel value x i by rounding itto the nearest value in a pre-determined grid with points { , g , . . . , g N } , where g j > for– 7 – from 1 to N , and g N corresponds to the largest pixel value after rounding. The modellearns the probability of each discrete value as a categorical distribution: p ( x i | θ i ) = γ i, · δ x i =0 + N (cid:88) j =1 γ i,j · δ x i = g j (4.2)where each γ i,j is predicted by the parameter θ i = ( θ i , . . . , θ iN ) using a softmax func-tion. When the grid is uniform, this likelihood is the same as the discretized softmaxlikelihood used by Pixel RNN [10], which has achieved state-of-the-art results on bench-mark datasets of natural images. [54]. However, in particle physics the distribution of pixelvalues is typically far from uniform. In many typical cases, there is a large number of pixelswith small values, and a few pixels with large values, as seen in figure 5a. To better repre-sent the pixel distribution and minimize the error due to quantization, we assign more gridpoints to the region of low pixel values. We achieve this by using a power transformation ˆ x = x /p on the pixel values, where p is a hyperparameter such that p ≥ . Discrete and Continuous Mixture Model (D+C):
The pixel values of naturalimages are usually represented by unsigned integer values between 0 and 255. However,in particle physics images, the pixel values are typically real-valued. To avoid explicitrounding, SARM (D+C) is built with a truncated logistic distribution that models the non-zero distribution component of each pixel. To generate the D+C mixture, we reparameterizeeach pixel as x i = ˜ x i · z i , where ˜ x i follows a truncated logistic distribution T L ( µ i , s i ) withmean µ i and scale parameter s i . Here z i ∼ Bern ( γ i ) is a Bernoulli random variable withprobability p ( z i = 1) = γ i , which controls the sparsity level. By assuming independence of ˜ x i and z i , the likelihood function of x i becomes: p ( x i | θ i ) = γ i · δ z i =0 + (1 − γ i ) · δ z i (cid:54) =0 · p (˜ x i | µ i , s i ) (4.3)where θ i = { µ i , s i , γ i } are functions of the previous pixel values x i − , to ensure the au-toregressive structure. In order to allow for unconstrained optimization, we treat log( s i ) as the learning parameter and take its exponential in the likelihood equation 4.3. Sincethe pixel distribution could be multi-modal, we use a mixture of truncated logistic (MTL)distributions for ˜ x i which is more flexible.The mixture of truncated logistic likelihood differs from the discretized logistic mixtureused in Pixel CNN++ [41] in the way it handles continuous pixel values. Pixel CNN++requires discretizing x i and then maximizing the probability mass on the discretized grid.In contrast, SARM can directly maximize the probability density function of x i , allowingit to handle continuous pixel values without incurring quantization errors.There are several differences between the D+D and the D+C models. The D+Dmodel allows enough flexibility to represent multi-modal distributions, as each grid pointhas its own learnable probability mass. However, there is a price for this flexibility. It issignificantly more time-consuming to generate an N +1 -way softmax vector and sample froma discrete mixture (D+D) than it is to generate the parameters of γ, µ, s and then samplefrom a discrete and continuous mixture (D+C). Other constrained domain distributionssuch as the exponential and the gamma distributions were also considered but led to inferior– 8 – igure 4 : Generation process for the D+C model. The blue circle dots represent thevalue sampled for each pixel. For example, given the first pixel value of 6.7, sampledfrom the empirical distribution of the dataset, the neural network outputs the distributionparameters γ = 0 . , µ = 3 . , s = 3 . to generate the second pixel. Then a Bernoullirandom variable is sampled from z ∼ Bern ( γ i ) and a logistic random variable is sampledfrom ˜ x i ∼ Logistic ( µ i , s i ) . The value of the second pixel x i is produced by the product ofthese two variables as: x i = z i · ˜ x i = 0 · . . This sequential process is repeated untilevery pixel is generated.results. The exponential distribution suffers from a lack of flexibility due to having onlyone learnable parameter. In many ARM applications, a single network is used to predict the parameters θ i of theconditional probability distribution P ( x i | θ i ) . This approach works well if the distributionof pixel values is similar across pixels, as is often the case in natural images. However,as shown in figure 5a, the pixel value distribution in the central square of a calorimeterimage containing a jet is very different from the distribution in the rest of the image (seealso [43]). In order to handle these heterogeneous regions, we use a two-stage approach bystacking two distinct deep SARM modules, one for the center and one for the periphery.When the model generates the image from the inside out, the outer module generates pixelsconditioned on the outputs of the center module, as illustrated in figure 5b. We refer to thismodel as SARM-2 while the single stage model is SARM-1. Since the center may not havea clear border, we treat the size of the center relative to the periphery as a hyperparameterduring training. Note that in general the number of stages depends on the structure of thedata and is not limited to two.It is possible to learn the SARMs for each region in any order. However, in our experi-ments, we obtained better results by first learning and generating the pixels for the centralregion. In addition, we used a spiral ordering of the pixels when learning and generating– 9 –
50 100 150 200 250 300
Pixel value D e n s i t y ( l o g s c a l e ) Central RegionPeripheral Region (a)
Central StagePeripheral Stage (b)
Figure 5 : (a) Distribution of pixel values in the jet substructure dataset for the 9 pixelsin the center of the images (central region) and the rest of the pixels (peripheral region).Note that the majority of the pixels in the peripheral region are zero-valued and in generalhave lower variance than pixels in the central region. (b)
Two-stage generation for thecentral and peripheral regions using a spiral path and two different SARM modules. Usingdifferent networks for each region improves performance.pixels inside each stage.
The goal is to train generative models which create images indistinguishable from imagescreated by the slower Monte Carlo methods. We compare the performance of our models,both in terms of image quality and generation time, against two other generative models:LAGAN [6], the current state-of-the-art generative model for sparse images in particlephysics; and Pixel CNN++ [41], a widely used autoregressive model for natural images nottuned for sparse images. We evaluate all models on both datasets described above; notethat LAGAN was designed to handle images typically found in the jet substructure dataset,while the muon dataset features extreme sparsity in comparison. We measure the qualityof the generated images both qualitatively and quantitatively.
Qualitative Evaluation:
We examine typical images generated by each model, aswell as the pixel-wise average intensity of the generated images, using the images producedby the Monte Carlo methods, which in the jet substructure study are referred to as thePythia images. Additional qualitative comparisons are described in the Appendix 8.3 and8.4.
Quantitative Evaluation:
Comparisons of distributions in high-dimensional datasetsshould focus on the scientific context and potential applications. In particle physics, thecalorimeter information is typically used to calculate physical quantities, such as invariantmass or transverse momentum ( P T ), which are especially revealing as metrics because theyhave not been explicitly optimized by the models. In addition, calorimeter images are usedto train classifiers which can identify particles from their patterns of depositions.– 10 –ne-dimensional distributions of mass and P T can be evaluated in comparison to thedistributions from Monte Carlo generators using the Wasserstein distance, the minimumcost to transform one distribution into the other one, expressed by: W p ( P, Q ) = (cid:18) inf J ∈J ( P,Q ) (cid:90) (cid:107) x − y (cid:107) p dJ ( x, y ) (cid:19) /p (5.1)where J ( x, y ) is the family of joint probability distribution of x and y ; P and Q are marginaldistributions, and p ≥ . When p = 1 , this metric is also known as the Earth Mover’sDistance [55]. To match the results in [6], we computed W ( P, Q ) , where P represents oneof jet observable distributions from the Pythia images, and Q represents the correspondingjet observable distribution from the generated images.An important motivation for developing generative models for fast simulations is toprovide a computationally inexpensive method to augment existing datasets in classificationtask [43, 56]. The jet substructure dataset was generated to train classifiers to distinguishbetween jets from W boson decays (signal) and those from single quarks and gluons, awell-known classification task [43, 56]. The muon isolation dataset was generated to trainclassifiers to distinguish isolated muons from those due to heavy-flavor jet production.Therefore, an essential test for the quality of the generated images is whether they canbe used in these classification tasks. To quantify this, the generated images were usedas training sets to develop a classifier whose performance was assessed using the MonteCarlo images. The same convolutional neural network architecture was trained with thesame hyperparameters on five different data sets: Monte Carlo images, images generatedby SARM-2 (D+C) images generated by SARM-2 (D+D), images generated by LAGAN,and images generated by Pixel CNN++. Because higher quality images should lead toimproved classification of the Monte Carlo images, we used the classification performanceas the evaluation metric. Speed:
Each generative model was used to generate batches of images multiple timesto measure the average speed of image generation.
An example image from each generative model and from the Pythia Monte Carlo generatoris shown in figure 6. It is clear that SARM-2 (D+C) excels at generating pixels with smallvalues around the periphery in comparison to the other models. Additional samples for eachmodel can be seen in Appendix 8.8. To assess the overall quality of the generated images,figure 7 shows the pixel-wise average of each dataset. The autoregressive models, SARMsand Pixel CNN++, are able to model the peripheral radial region around the center moreaccurately. This region has higher degree of sparseness than the center region, making itmore challenging for the generative models to accurately capture. We note that the imagesfrom the SARM-2 (D+C) model appear to be most similar to the Pythia images, while– 11 –
10 2005101520 S i g n a l Pythia
SARM-2 (D+C)
SARM-2 (D+D)
Pixel CNN++
LAGAN B a c k g r o un d P i x e l p T ( G e V ) P i x e l p T ( G e V ) Figure 6 : Example jet images generated from each model. Notice that SARM-2 (D+C)is able to produce small value pixels in the periphery of the images. The intensity of eachpixel is shown on a log scale, where the white space represents pixels with value zero. S i g n a l Pythia
SARM-2 (D+C)
SARM-2 (D+D)
Pixel CNN++
LAGAN B a c k g r o un d P i x e l p T ( G e V ) P i x e l p T ( G e V ) Figure 7 : Pixel-wise average of the images generated by each model. Notice that LAGANstruggles to capture the distribution of low value pixels in the periphery of the images andhas a non-smooth radial transition compared to the autoregressive models. The intensityof each pixel is shown on a log scale, where the white space represents pixels with valuezero.the other models are less able to generate the peripheral region faithfully. In addition,Pixel CNN++ struggles to achieve the radial structure present in the Pythia images andcreates a square-like structure instead. In general, the images from figure 7 generated bythe autoregressive models show a smooth transition from the highly activated center tothe sparse border, similar to that seen in the Pythia dataset. In contrast, the border ofthe LAGAN images is irregular, which could be due to its reliance on the ReLU activationfunction to induce the sparsity, making the model unable to estimate the sparseness leveldirectly. – 12 –
Mass [GeV/ c ] U n i t s n o r m a li z e d t o un i t a r e a Signal
PythiaLAGAN Pixel CNN++SARM-2 (D+C)SARM-2 (D+D)
40 60 80 100 120
Mass [GeV/ c ] U n i t s n o r m a li z e d t o un i t a r e a Background
PythiaLAGANPixel CNN++SARM-2 (D+C)SARM-2 (D+D)
220 240 260 280 300 320 340 P T [GeV/c] U n i t s n o r m a li z e d t o un i t a r e a PythiaLAGANPixel CNN++SARM-2 (D+C)SARM-2 (D+D)
220 240 260 280 300 320 340 P T [GeV/c] U n i t s n o r m a li z e d t o un i t a r e a PythiaLAGANPixel CNN++SARM-2 (D+C)SARM-2 (D+D)
Figure 8 : Distributions of jet observables (
Top : Mass,
Bottom : P T ) calculated fromimages generated by several generative models and from the original images generated byPythia. Signal images, with two collimated quarks, are on the left; background images,with a single quark or gluon, are on the right. To quantify the fidelity of the images generated by each model as compared with theoriginal samples, we insert them into typical applications in particle physics. In the contextof collisions that produce jets, it is common to calculate the invariant mass of the jet, andthe transverse momentum. Distributions of jet mass and P T are shown in figure 8 for allmodels, which all succeed in matching the general shape, though discrepancies are visible,and Wasserstein distances are shown in table 1.All SARM variants achieve lower distances in the P T distributions than LAGAN andPixel CNN+, and comparable or better distances in jet mass. The best results in all cat-egories are obtained by the SARM-2 (D+D). Compared to the best of Pixel CNN++ andLAGAN, SARM-2 (D+D) provides a . % improvement for P T , and a 23.79% improve-ment for mass, averaged over the signal and background sets. These results demonstratethe effectiveness of taking sparseness into account during learning and generation. Sec-ondly, the SARM-2 models clearly outperform the SARM-1 models for both the (D+D)– 13 –nd (D+C) likelihoods, which shows the effectiveness of the multi-stage approach in mod-eling heterogeneous areas in the images. Table 1 : Comparison of images created by various generative models with original im-ages from Pythia, evaluated using the Wasserstein distance (with p = 1 ) between one-dimensional distributions of physical quantities calculated from the images: jet P T andinvariant mass, also shown in figure 8. Smaller values indicate a closer match to the Pythiaimages. Four SARMs are evaluated, those with either one-stage (SARM-1) or two-stage(SARM-2) models, and those with either discrete and continuous distributions (D+C) or amixture of discrete distributions (D+D). P T MassModel Signal Background Signal BackgroundLAGAN 3.15 3.29 1.45 1.39Pixel CNN++ 3.46 3.59 1.09 1.56SARM-1 (D+C) 2.33 2.46 1.07 1.54SARM-2 (D+C) 2.32 2.71 1.06 1.39SARM-1 (D+D) 1.95 2.52 1.34 2.45
SARM-2 (D+D) 1.44 1.66 0.94 0.926.1.3 Classification of Generated Images
An important application of generated calorimeter images is to augment training sets fornetworks learning to perform vital signal-background classification tasks. As a high-leveltest of the image quality, we train networks using images generated by each model (200ksignal, 200k background), and evaluate the performance on the original images from Pythia(20k signal, 20k background). Training sets which best mimic the original Pythia imagesshould lead to networks which most closely match the performance of a network trained onPythia images. Detailed information about the classifier and training procedure are givenin Appendix 8.6. The receiver operating characteristic (ROC) curves for networks trainedon images from Pythia , SARM-2 (D+C), SARM-2 (D+D), Pixel CNN++ and LAGANare shown in figure 9. Classifiers trained on both SARM generated datasets have higherAUC (area under the ROC curve) scores than the classifiers trained on the LAGAN imagesand Pixel CNN++ images.
SARMs generate images pixel by pixel, conditioning each step on the previously generatedpixels. The order of the pixel generation corresponds to a dependency decomposition inEquation 3.1, which may impact training performance. The traversal path is especiallyimportant for images containing heterogeneous areas. For natural images, [50] uses anensemble of models with random paths, while Pixel CNN++ and other models [10, 41] usethe row-by-row pixel ordering. – 14 – .0 0.2 0.4 0.6 0.8 1.0
False positive rate T r u e p o s i t i v e r a t e Pythia (AUC = 0.894)SARM-2 (D+D) (AUC = 0.869)SARM-2 (D+C) (AUC = 0.841)LAGAN (AUC = 0.825)Pixel CNN++ (AUC = 0.815)
Figure 9 : Evaluation of the fidelity of images generated by several models in the contextof a classification task. Images generated by the model are used to train a network todiscriminate between signal and background, but performance is measured using the originalPythia images.
Mass distance P t d i s t a n c e Spiral-out (CW)Spiral-out (CCW)Spiral-in (CW) Spiral-in (CCW)Row-wiseColumn-wise Random 0Random 1Random 2
Figure 10 : Quality of images generated by SARMs with various pixel-generation orderings,as measured by the Wasserstein distance for the physical observables ( P T and mass) betweenthe generated images and the original Pythia images. Spiral-in clockwise/counterclockwise(CW/CCW), spiral-out clockwise/counterclockwise (CW/CCW), column-wise, row-wise,and three random approaches are compared. The outward spiral orders show good perfor-mance due to the radial structure of the images.The performance of various pixel orderings for SARM-1 (D+D) is shown in figure 10.Each order is evaluated using the Wasserstein distance between the distributions of thegenerated signal images and the Pythia signal images for the jet P T and invariant mass.– 15 –he two spiral paths originating from the the center, clockwise (CW) and counterclock-wise (CCW), achieve the strongest results. This could be understood in terms of mutualinformation between neighboring pixels. Unlike the other orderings, the spiral orderingalways generating a pixel adjacent to the previously generated pixel. Starting the spiralfrom the center outperforms inward spirals, indicating that it may be easier to learn thecorrelations between the pixels starting with pixels that have higher entropy. Additionally,comparing two outward spiral orders, the CCW order has better performance than the CWorder, which suggests an information asymmetry in the training data. A full exploration ofthe ordering dependency is beyond the scope of this work and computationally challengingdue to the factorial number of possible orderings. Table 2 shows the speed of the generative models in comparison to the Monte Carlo method(Pythia). The SARM-2 models are five times slower than LAGAN, which is mainly dueto the extra computational cost of the autoregressive structure. On the other hand, theSARM-2 models are two orders of magnitude faster than Pythia and Pixel CNN++. Theforward pass of the Pixel CNN++ model is computationally expensive due to the ResNetblocks with convolutional layers and skip connections [41, 57]. In contrast, SARMs use asimple feed forward network with disabled connections to preserve autoregressive structure.The speed of the generative models is measured on a machine with 4 TITANX GPU cardseach with 12G of memory. The speed of Pythia was assessed in [6] using Amazon WebServices (AWS) and an IntelR XeonR E5-2686 v4 at 2.30GHz CPU.
Table 2 : Comparison of image generation speed between the Monte Carlo approach(Pythia) and various generative models. The SARM-2 models are slower than LAGAN,but still considerably faster than Pythia and Pixel CNN++.Model Speed (images/sec)Pythia [6] 34Pixel CNN++ 50SARM-2 (D+D) 1612SARM-2 (D+C) 2480LAGAN 10176There is room to further optimize the speed of the SARM models. For instance, we findthat reducing the size of the intermediate upsampling layer of the SARM (D+D) drasticallyreduces the memory requirements and improves the generation speed. Another direction isto explore model pruning and compression.
Typical calorimeter images in the vicinity of a muon generated by the standard MonteCarlo method, Pixel CNN++ as well as two SARMs are shown in figure 11. In this context,– 16 –AGAN suffered from mode collapse and failed to generate reasonable quality images (Seefigure 19 in the Appendix). This is a well known problem when training GANs [6, 38, 39],especially with sparse data.Figure 12 shows the pixel-wise average images. The SARM-2 models and the PixelCNN++ reproduce the radial symmetry seen in the original images. However, the averageimages produced by Pixel CNN++ contain noticeable artifacts, potentially due to theconvolutional layers in the model [58]. S i g n a l Monte Carlo
SARM-2 (D+C)
SARM-2 (D+D)
Pixel CNN++ B a c k g r o un d P i x e l E T ( G e V ) P i x e l E T ( G e V ) Figure 11 : Example calorimeter images in the vicinity of a muon from the generativemodels as well as the original Monte Carlo generator. Top row shows isolated muons(signal), while the bottom shows muons produced in association with a jet (background).The intensity of each pixel is shown on a log scale, where the white space represents pixelswith value zero.
To assess the fidelity of the images quantitatively, we calculate physical quantities whichsummarize the content of the images and allow for comparison of one-dimensional distri-butions. While calorimeter images in the vicinity of a muon do not necessarily contain aclustered jet, the total P T and invariant mass of the entire image do have physical meaning.Figure 13 shows the distributions of these quantities for the original Monte Carlo images,as well as for the generated images, and table 3 provides the corresponding Wassersteindistances.The datasets generated by both SARM-2 models have considerably smaller Wassersteindistances than the datasets generated by the Pixel CNN++ model for both signal andbackground. The distributions of all the generated datasets approximate the shape of theMonte Carlo distributions quite well for P T and mass, but the distributions of the PixelCNN++ dataset have a small shift towards higher values, for both the signal and thebackground. In addition, for the background they are more concentrated around the mean.This is potentially due to the fact that Pixel CNN++ fails to model the right tail of the– 17 –
10 20 30051015202530 S i g n a l Monte Carlo
SARM-2 (D+C)
SARM-2 (D+D)
Pixel CNN++ B a c k g r o un d P i x e l E T ( G e V ) P i x e l E T ( G e V ) Figure 12 : Pixel-wise averages of calorimeter images in the vicinity of a muon from thegenerative models as well as the original Monte Carlo generator. Top row shows isolatedmuons (signal), where little calorimeter activity is expected. The bottom row shows muonsproduced in association with a jet (background), which deposits significant energy near themuon.pixel distribution, where the pixels have higher values but appear much less frequently inthe data (figure 22 in Appendix). The SARM-2 (D+D) has the best overall performance,with improvements of . for P T and . for mass, averaged over the signal andbackground datasets. Table 3 : Comparison of images created by various generative models to the original MonteCarlo images using the Wasserstein distance (with p = 1 ) between one-dimensional dis-tributions of physical quantities calculated from the images: P T and invariant mass, alsoshown in figure 13. Smaller values indicate a closer match to the Monte Carlo images.Two SARMs are evaluated, with either discrete and continuous distributions (D+C) or amixture of discrete distributions (D+D). P T MassModel Signal Background Signal BackgroundPixelCNN++ 1.75 2.92 0.58 0.82SARM-2 (D+C) 0.79 0.97 0.25
SARM-2 (D+D)
The fidelity of the images can be evaluated in the context of the data analysis task for whichthey were created, training a network to distinguish between signal (calorimeter images nearisolated muons) and background (calorimeter images near non-isolated muons).– 18 –
Mass [GeV/ c ] U n i t s n o r m a li z e d t o un i t a r e a Signal
Monte CarloPixel CNN++SARM-2 (D+C)SARM-2 (D+D) 0 5 10 15 20
Mass [GeV/ c ] U n i t s n o r m a li z e d t o un i t a r e a Background
Monte CarloPixel CNN++SARM-2 (D+C)SARM-2 (D+D)0 20 40 60 80 100 P T [GeV/c] U n i t s n o r m a li z e d t o un i t a r e a Monte CarloPixel CNN++SARM-2 (D+C)SARM-2 (D+D) 0 20 40 60 80 100 P T [GeV/c] U n i t s n o r m a li z e d t o un i t a r e a Monte CarloPixel CNN++SARM-2 (D+C)SARM-2 (D+D)
Figure 13 : Distributions of calorimeter observables (top: invariant mass, bottom: total P T ) calculated from images generated by several generative models and the originals gen-erated by a Monte Carlo generator. Signal images, in the vicinity of an isolated muon, areon the left. Background images, in the vicinity of a muon produced with an associated jet,are on the right.A convolutional neural network classifier was trained using images generated exclusivelyby each of the models (SARM-2 (D+C), SARM-2 (D+D), or Pixel CNN++); one additionalnetwork was trained using images from the Monte Carlo generator. The quality of theimages is measured by comparing the classification performance of these networks on imagesfrom the Monte Carlo generator, see figure 14. The classifiers trained on each SARM datasethave higher AUC score than the classifier trained on the Pixel CNN++ dataset, providingadditional evidence that the SARM datasets are more similar to the Monte Carlo imagesand thus better suited for downstream tasks such as data augmentation. Calorimeter image generation speeds in the context of the muon isolation study are shownin table 4 for the SARM models, Pixel CNN++ and the Monte Carlo generator. TheSARM models are one to two orders of magnitude faster than Pixel CNN++, similar to theobservation of the jet substructure study. The generation speed of each generative model is– 19 – .0 0.2 0.4 0.6 0.8 1.0
False positive rate T r u e p o s i t i v e r a t e Monte Carlo (AUC = 0.816)SARM-2 (D+D) (AUC = 0.808)SARM-2 (D+C) (AUC = 0.788)Pixel CNN++ (AUC = 0.769)
Figure 14 : Evaluation of the fidelity of images generated by several models in the contextof a classification task, distinguishing muons produced in isolation from those producedin association with a jet. Images generated by the model are used to train a networkto discriminate between signal and background, but performance is measured using theoriginal Monte Carlo images.measured with the same hardware as described in Section 6.1.5. The speed for the MonteCarlo generator is measured on an Intel(R) Xeon(R) E5-2680 at 2.70GHz CPU.
Table 4 : Comparison of image generation speed between the Monte Carlo approach andvarious generative models. The SARM-2 models are considerably faster than Pixel CNN++and the Monte Carlo generator.Model Speed (images/sec)Monte Carlo 5Pixel CNN++ 10SARM-2 (D+D) 625SARM-2 (D+C) 1136
Sparse images, prevalent in particle physics datasets, present unique challenges for genera-tive models. We have developed and applied a new class of models, deep sparse autoregres-sive generative models (SARMs), specifically designed to handle extreme sparseness. Thesecompositional models are also able to take advantage of the structure present in particlephysics images by using a multi-stage generation approach. Using several different metrics,we compared SARMs to other generative models, in particular to Pixel CNN++, a popularautoregressive model not adapted for sparsity, and to LAGAN, a state-of-the-art GAN forsparse images. The comparisons were carried using two benchmark data sets.– 20 –n the first case study on jet substructure, the adaptation to sparseness enables SARMsto produce qualitatively and quantitatively higher quality images than Pixel CNN++ andLAGAN. SARM are also orders of magnitude faster than traditional Monte Carlo methodsand Pixel CNN++, but slower than the non-autoregressive model LAGAN, showing atrade-off between speed and quality. The second case study features extremely sparseimages corresponding to calorimeter images in the vicinity of muons. While competingmodels produce artifacts or suffer from mode collapse, SARMs are able to handle andmodel extreme degrees of sparseness.In sum, given the prevalence of sparse images in particle physics and beyond, SARMscan be expected to provide an important option for rapid, high-quality, image generationfrom training data. Because of their quality, the generated images in turn will be able tobenefit a variety of downstream data analyses.
Acknowledgments
We wish to acknowledge a hardware grant from NVIDIA. The work of YL, JC, and PB isin part supported by grants NSF 1839429 and NSF NRT 1633631 to PB. DW is supportedby the Department of Energy Office of Science. The authors would like to thank BenjaminNachman for helpful feedback on an early draft.– 21 –
Appendix
We simulate a dataset containing pairs of two variables x and x , such that x ∼ p ( x | x ) and x ∼ p ( x ) . In this toy example we show that the autoregressive model is still ableto learn to generate the joint distribution of x and x , even though during training it isforced to learn x ∼ p ( x ) first, and then to learn the dependency p ( x | x ) . The simulatedtraining data contains 1000 pairs of { x , x } according to x ∼ N (0 , and x = x + (cid:15) ,where (cid:15) ∼ N (0 , , a standard normal distribution independent of x . The joint distributionof x , x is shown in figure 15. The toy autoregressive model learns to generate x usingtwo learnable parameters, µ and log( σ ) , corresponding to the mean and log standarddeviation of x . It has a single linear layer for predicting µ and log( σ ) , which correspondsto the mean and log standard deviation of x . The model is trained for 5000 iterations,by maximizing the likelihood p ( x , x ) . During the generation stage, the model generates x without knowing x . Since the goal of the model is to generate the joint distribution of ( x , x ) ∼ P ( x , x ) , to do this it only needs to learn the marginal distribution, which is x ∼ N (0 , and the relationship x = x − (cid:15) . figure 15 shows the result of training thismodel and we can see it correctly learns the means and variances of { x , x } along with thedata distribution despite the fact that it has to generate x before generating x . Figure 15 : Left : Density plot of training data.
Right : Density plot of generated data.The two distributions are very close, showing the ARM is able to learn the joint distributionof x and x well. The MADE structure enforces the auto-regressive property on fully connected layers byusing a carefully selected binary mask on the weights of the layer. The joint likelihood ofthe MADE structure can be evaluated in one forward pass of the network during training,which is not possible in other models like Pixel-RNN [10] and Pixel CNN++ [41]. Thisallows MADE to take advantage of the GPU acceleration. In our SARM implementation,– 22 –e consider a simple MADE structure with input x and a stack of multiple hidden layers h ( x ) , where each h ( x ) follows: h ( x ) = f (cid:0) b + (cid:0) W (cid:12) M W (cid:1) x (cid:1) θ = f (cid:0) c + (cid:0) V (cid:12) M V (cid:1) h ( x ) (cid:1) (8.1)Here θ is the output, and f is the activation function of the hidden layer. In practice,we found Gaussian Error Linear Units (GeLU) [59] works better in our experiments thanother activations such as Sigmoid and tanh . Both W and V are weight matrices, withcorresponding masks: the hidden mask M W , and the output mask M V . Each matrix ismultiplied element-wise with each mask.Suppose x ∈ R D , it can be shown that for the input mask: M W k,d = 1 k mod D ≤ d = (cid:40) if k mod D ≤ d otherwise (8.2)Likewise, suppose h ( x ) ∈ R H , then for the output mask: M V k,d = 1 k mod D SARM-2 (D+D) Pixel CNN++ LAGAN B a c k g r o un d Figure 16 : Error measured by subtracting of the pixel-wise average of the images createdby each generative model and the pixel-wise average of the images generated with Pythia.The SARM models have lower error than both Pixel CNN++ and LAGAN with most ofthe errors are concentrated in the center of the image. Pixel Intensity N u m b e r o f P i x e l s Signal PythiaLAGANPixel CNN++SARM-2 (D+C)SARM-2 (D+D) Pixel Intensity N u m b e r o f P i x e l s Background PythiaLAGANPixel CNN++SARM-2 (D+C)SARM-2 (D+D) Figure 17 : Distribution of aggregated pixel intensity in the generated images for jet sub-structure study. Notice most of the differences happen at high pixel values where there arefewer events. LAGAN also has a harder time replicating the distribution of backgroundimages across all pixel values compared to the other models.have difficulties learning the high value pixels, which is expected since there are very fewpixels in this range in the Pythia distribution. Despite our best efforts, the LAGAN model performed poorly every time it was trainedon the muon isolation dataset. As seen in Figures 18 and 19 the pixel-wise average image– 24 –oesn’t capture the radial structure present in the dataset and some of the pixels with highvalues seem to be present in many of the images. This seems to be due to a low amount ofvariability in the generated images, typical of mode collapse in GANs. This performance isalso reflected in the distributions of P T and mass (figure 20) and the respective Wassersteindistances which are one order of magnitude worse than the values for the other models(table 5). P i x e l P T ( G e V ) P i x e l P T ( G e V ) Figure 18 : Typical muon images generated using LAGAN. The figures are plotted in logscale, where the white space represents pixels with value zero. P i x e l P T ( G e V ) P i x e l P T ( G e V ) Figure 19 : Pixel-wise average of muon images from LAGAN for signal and background.The average images generated by LAGAN fail to reproduce the radial structure present inthe average Monte Carlo images (figure 12). Figure 21 shows the subtraction between the pixel-wise average of the images from eachgenerative model and the pixel-wise average from Pythia in the muon isolation dataset.For the signal data, all models show very small differences, evenly distributed across theradial structure of the images. In particular, Pixel CNN++ is over-representing most ofthe pixels in the artificial checkerboard pattern noted before. For the background data theerrors are slightly higher for all models. The SARM models have more difficulties with thepixels in the center and tend to over-represent them while Pixel CNN++ under-representsthe center and over-represents the periphery.– 25 – Mass [GeV/ c ] U n i t s n o r m a li z e d t o un i t a r e a Signal Monte CarloSARM-2 (D+D)LAGAN Mass [GeV/ c ] U n i t s n o r m a li z e d t o un i t a r e a Background Monte CarloSARM-2 (D+D)LAGAN P T [GeV/c] U n i t s n o r m a li z e d t o un i t a r e a Monte CarloSARM-2 (D+D)LAGAN P T [GeV/c] U n i t s n o r m a li z e d t o un i t a r e a Monte CarloSARM-2 (D+D)LAGAN Figure 20 : Comparison of the mass and P T distributions of the images generated byLAGAN, SARM-2 (D+D), and the Monte Carlo simulations for both signal and backgroundmuons. Table 5 : Wasserstein distance of the physical constituents jet P T and mass distributionsbetween the original muon images from the Monte Carlo generator and the images createdby the generative models. A small distance signifies a good agreement. SARM-2 (D+D) isthe two-stage SARM model with a discrete mixture. P T MassSignal Background Signal BackgroundLAGAN 4.81 10.88 1.81 2.17SARM-2 (D+D) Figure 22 shows the distribution of pixel values across all the generated images. Forboth signal and background the Pixel CNN++ model is under-representing pixels withhigh intensity, while the SARM models match the distribution quite well. Like in the jetsubstructure study, most of the errors correspond to pixels with high intensity values, whichis expected since these values are rare in the training data, making it difficult to correctly– 26 –earn their distribution. S i g n a l SARM-2 (D+C) SARM-2 (D+D) Pixel CNN++ B a c k g r o un d Figure 21 : Subtraction between the pixel-wise average of generated images vs Monte Carloimages. The errors are evenly distributed in the signal images, while they are concentratedin the center for the background images. In the center there is larger number of highintensity pixels. Pixel Intensity N u m b e r o f P i x e l s Signal Monte CarloPixel CNN++SARM-2 (D+C)SARM-2 (D+D) 0 20 40 60 80 100 Pixel Intensity N u m b e r o f P i x e l s BackgroundMonte CarloPixel CNN++SARM-2 (D+C)SARM-2 (D+D) Figure 22 : Distribution of pixel intensity for muon isolation study. Pixel CNN++ under-represents the distribution while the SARM models miss the high pixel values where thereare fewer events. – 27 – .5 Software Modifications8.5.1 LAGAN The code and weights of the original LAGAN model for the jet substructure study datasetare publicly available. This makes it possible to generate new images using the originalmodel’s weights for this dataset, but the model needs to be retrained to generate images ofa different dataset. The model was retrained for the muon isolation study and it also hadto be modified to adapt it to the larger images of × pixels since it has upsamplinglayers in the generator part of the GAN. As a baseline for autoregressive models we used the Pixel CNN ++ [41]. Due to speedand memory restrictions, we had to modify the original model by reducing the number offilters in the masked convolutional layers and the number of residual blocks compared to theoriginal model. Both the number of filters and the number of residual blocks are optimizedas hyperparameters using grid search with 5, 10 or 20 filters and 2 or 3 residual blocks.However, we found most hyperparameter combinations to have similar performance. Themodel with 20 filters and 3 blocks performs slightly better in the jet substructure study,and the model with 10 filters and 5 blocks performs slightly better in the muon isolationstudy. Even though the models we used are smaller than the original model in [41], theyare almost as slow as the traditional Monte Carlo methods (table 2 and 4). We performed a search over the architectures of the SARMs including the number of hiddenlayers structure, the size of the central area for the two-stage approach and the size ofthe intermediate upsampling layer using SHERPA [60]. We also conducted search of thetransformation parameter p with values [1 , . , . , . , . , for the D+D models. Allmodels were implemented in Pytorch [61], and were trained for 300 epochs with outwardspiral (CCW) order using the Adam optimizer [37] with learning rate 3e-4, decreased byhalf every 100 epochs and mini-batch size 128.For the jet substructure study, the best SARM-2 configuration had a center area of sidelength 3. For the D+D models, we used 5 hidden layers with an upsampling layer of size 10and found that a power transformation with p = 1 . yields slightly better results. For theD+C models, we found that the model with 3 hidden layers and a mixture of 5 truncatedlogistic for the C component works well for both signal and background images. In thegeneration order experiments, similarly we used SARM-1 (D+D) models with 5 hiddenlayers, an upsampling layer of size 10 and a power transformation with p = 1 . , effectivelyno transformation. And all models are trained with identical settings: learning rate of 3e-4,decreased by half every 100 epochs and mini-batch size 128. For the LAGAN model weused the publicly available version of LAGAN optimized by the original authors.For the muon isolation study, the best model we found had 5 hidden layers, and acenter area of side length 7 for both D+D and D+C models. For the SARM-2 (D+D), weused an upsampling layer of size 10 and found that a power transformation with p = 1 . – 28 –or signal and p = 1 . for background provided the best results. And for the D+C models,we found again that a mixture of 5 truncated logistic for the C component works well forboth signal and background images.For the classification tasks, we trained five convolutional neural networks with the samestructure on each of the datasets. We randomly split the data into a 90% subset for trainingand a 10% subset for validation. The validation set is used for early-stopping during trainingto avoid over-fitting. The convolutional neural network model has 2 convolutional blocks,2 fully connected layers with 100 rectified linear units, and a sigmoid unit at the end topredict the probability of the image being signal. Each convolutional block contains twoconvolutional layers with 3x3 kernels and 30 filters with rectified linear units followed bya maxpooling layer with 2x2 kernel. All models were trained in PyTorch using the Adamoptimizer, with a learning rate of . and a batch size of 128. Next we compare the number of parameters for the different models in table 6. Note thatthe original Pixel CNN++ model [41] uses 160 convolutional filters. With all these filters,each forward pass takes more than 1 second on 4 NVIDIA TITANX GPU cards, resultingin a generation speed that is one order of magnitude slower than the traditional MonteCarlo methods, thus defeating the original purpose. Therefore, in our implementation ofthe Pixel CNN++ model, we limit the number of its filters to 20 to speed up the generationprocess and reduce the memory requirements. Table 6 : Model complexity comparison in terms of the number of parameters.Model Num. of ParametersPythia [6] -Pixel CNN++ 0.7MSARM-2 (D+D) 21MSARM-2 (D+C) 7MLAGAN 5M In this section, we show more generated images from both the jet substructure study andthe muon isolation study. – 29 – S i g n a l Pythia SARM-2 (D+C) SARM-2 (D+D) Pixel CNN++ LAGAN S i g n a l S i g n a l B a c k g r o un d B a c k g r o un d B a c k g r o un d P i x e l p T ( G e V ) P i x e l p T ( G e V ) P i x e l p T ( G e V ) P i x e l p T ( G e V ) P i x e l p T ( G e V ) P i x e l p T ( G e V ) Figure 23 : Additional typical images from the jet substructure study– 30 – S i g n a l Monte Carlo SARM-2 (D+C) SARM-2 (D+D) Pixel CNN++ LAGAN S i g n a l S i g n a l B a c k g r o un d B a c k g r o un d B a c k g r o un d P i x e l p T ( G e V ) P i x e l p T ( G e V ) P i x e l p T ( G e V ) P i x e l p T ( G e V ) P i x e l p T ( G e V ) P i x e l p T ( G e V ) Figure 24 : Additional typical images from the muon isolation study– 31 – eferences [1] GEANT4 Collaboration, S. Agostinelli et al., GEANT4: A Simulation toolkit , Nucl.Instrum. Meth. A (2003).[2] ATLAS Collaboration, G. Aad et al., The ATLAS Simulation Infrastructure , Eur. Phys. J.C (2010).[3] R. Rahmat, R. Kroeger, and A. Giammanco, The Fast Simulation of The CMS Experiment , Journal of Physics: Conference Series (2012).[4] ATLAS Collaboration, N. Nikiforou, Performance of the ATLAS Liquid Argon CalorimeterAfter Three Years of LHC Operation and Plans For a Future Upgrade , in , (2013). arXiv:1306.6756 .[5] LHCb Collaboration, LHCb calorimeters: Technical Design Report . Technical DesignReport LHCb. CERN, Geneva, (2000).[6] L. de Oliveira, M. Paganini, and B. Nachman, Learning Particle Physics by Example:Location-Aware Generative Adversarial Networks for Physics Synthesis , Comput Softw BigSci (2017) [ arXiv:1701.05927 ].[7] Y. Lu, J. Collado, K. Bauer, D. Whiteson, and P. Baldi, Sparse Image Generation withDecoupled Generative Models , in Neural Information Processing Systems, Machine Learningand the Physical Sciences Workshop , (2019). https://ml4physicalsciences.github.io/2019/files/NeurIPS_ML4PS_2019_161.pdf .[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,and Y. Bengio, Generative Adversarial Networks , in Advances in Neural InformationProcessing Systems 27 . Curran Associates, Inc., (2014). arXiv:1406.2661 .[9] D. P. Kingma and M. Welling, Auto-Encoding Variational Bayes , in , (2014). arXiv:1312.6114 .[10] A. V. Oord, N. Kalchbrenner, and K. Kavukcuoglu, Pixel Recurrent Neural Networks , in Proceedings of The 33rd International Conference on Machine Learning ICML , vol. 48 of Proceedings of Machine Learning Research , (New York, New York, USA), pp. 1747–1756,PMLR, (2016). arXiv:1601.06759 .[11] J. Zhu, T. Park, P. Isola, and A. A. Efros, Unpaired Image-to-Image Translation usingCycle-Consistent Adversarial Networks , CoRR (2017) [ arXiv:1703.10593 ].[12] A. Brock, J. Donahue, and K. Simonyan, Large Scale GAN Training for High FidelityNatural Image Synthesis , in International Conference on Learning Representations ICLR ,(2019). arXiv:1809.11096 .[13] D. P. Kingma and P. Dhariwal, Glow: Generative flow with invertible 1x1 convolutions , in Advances in Neural Information Processing Systems 31 , pp. 10215–10224. Curran Associates,Inc., 2018. arXiv:1807.03039 .[14] P. Baldi, P. Sadowski, and D. Whiteson, Searching for Exotic Particles in High-EnergyPhysics with Deep Learning , Nature Communications (2014) 4308, [ ].[15] S. Delaquis, M. Jewell, I. Ostrovskiy, M. Weber, T. Ziegler, J. Dalmasson, L. Kaufman,T. Richards, J. Albert, G. Anton, I. Badhrees, P. Barbeau, R. Bayerlein, D. Beck, V. Belov,M. Breidenbach, T. Brunner, G. Cao, W. Cen, and O. Zeldovich, Deep Neural Networks for – 32 – nergy and Position Reconstruction in EXO-200 , Journal of Instrumentation (2018)[ ].[16] C. Shimmin, P. Sadowski, P. Baldi, E. Weik, D. Whiteson, E. Goul, and A. Søgaard, Decorrelated Jet Substructure Tagging using Adversarial Neural Networks , Physical Review D (03, 2017) [ arXiv:1703.03507 ].[17] P. Baldi, J. Bian, L. Hertel, and L. Li, Improved Energy Reconstruction in NOvA withRegression Convolutional Neural Networks , Phys. Rev. D (2019) [ arXiv:1811.04557 ].[18] D. Guest, J. Collado, P. Baldi, S.-C. Hsu, G. Urban, and D. Whiteson, Jet FlavorClassification in High-Energy Physics with Deep Neural Networks , Physical Review D (2016) [ arXiv:1607.08633 ].[19] P. Sadowski, J. Collado, D. Whiteson, and P. Baldi, Deep Learning, Dark Knowledge, andDark Matter , in Proceedings of the NIPS 2014 Workshop on High-energy Physics andMachine Learning , vol. 42 of Proceedings of Machine Learning Research , (Montreal, Canada),pp. 81–87, PMLR, (2015). http://proceedings.mlr.press/v42/sado14.html .[20] I. Seong, L. Hertel, J. Collado, L. Li, N. Nayak, J. Bian, and P. Baldi, Convolutional NeuralNetworks for Energy and Vertex Reconstruction in DUNE , in , (2019). https://ml4physicalsciences.github.io/2019/files/NeurIPS_ML4PS_2019_77.pdf .[21] P. Baldi, Deep Learning in Science: Theory, Algorithms, and Applications . CambridgeUniversity Press, Cambridge, UK, (2020). In press.[22] M. Mustafa, D. Bard, W. Bhimji, Z. Lukić, R. Al-Rfou, and J. M. Kratochvil, CosmoGAN:Creating High-fidelity Weak Lensing Convergence Maps using Generative AdversarialNetworks , Computational Astrophysics and Cosmology (2019) [ arXiv:1706.02390 ].[23] P. Musella and F. Pandolfi, Fast and Accurate Simulation of Particle Detectors UsingGenerative Adversarial Networks , Computing and Software for Big Science (2018)[ arXiv:1805.00850 ].[24] K. Zhou, G. Endrődi, L.-G. Pang, and H. Stöcker, Regressive and Generative NeuralNetworks for Dcalar Field Theory , Phys. Rev. D (2019) 011501, [ ].[25] G. r. Khattak, S. Vallecorsa, and F. Carminati, Three Dimensional Energy ParametrizedGenerative Adversarial Networks for Electromagnetic Shower Simulation , in , pp. 3913–3917, 2018. https://ieeexplore.ieee.org/document/8451587 .[26] S. Alonso Monsalve and L. Whitehead, Image-Based Model Parameter Optimization UsingModel-Assisted Generative Adversarial Networks , IEEE Transactions on Neural Networksand Learning Systems PP (03, 2020) 1–6, [ arXiv:1812.00879 ].[27] K. Deja, T. Trzciński, and Ł. Graczykowski, Generative Models for Fast Cluster Simulationsin the TPC for the ALICE Experiment , in Information Technology, Systems Research, andComputational Physics , (Cham), pp. 267–280, Springer International Publishing, (2020).[28] F. Carminati, M. P. Gulrukh Khattak, B. H. Amir Farbin, W. Wei, M. Zhang, V. B. Pacela,S. Vallecorsafac, M. Spiropulu, and J.-R. Vlimant, Calorimetry with Deep Learning: ParticleClassification, Energy Regression, and Simulation for High-Energy Physics , Deep Learning – 33 – or Physical Sciences, Workshop at the 31st Conference on Neural Information ProcessingSystems (NeurIPS) (2017).[29] V. Shah, A. Joshi, S. Ghosal, B. S. S. Pokuri, S. Sarkar, B. Ganapathysubramanian, andC. Hegde, Encoding Invariances in Deep Generative Models , CoRR (2019)[ arXiv:1906.01626 ].[30] K. Cranmer, S. Gadatsch, A. Ghosh, T. Golling, D. R. Gilles Louppe, D. Salamani, andG. S. on behalf of the ATLAS Collaboration, Deep generative models for fast showersimulation in ATLAS , Bayesian Deep Learning, Workshop at the 32nd Conference on NeuralInformation Processing Systems (NeurIPS) (2018). http://bayesiandeeplearning.org/2018/papers/24.pdf .[31] B. Hashemi, N. Amin, K. Datta, D. Olivito, and M. Pierini, LHC analysis-specific datasetswith Generative Adversarial Networks , CoRR (2019) [ arXiv:1901.05282 ].[32] S. Otten, S. Caron, W. de Swart, M. van Beekveld, L. Hendriks, C. van Leeuwen,D. Podareanu, R. R. de Austri, and R. Verheyen, Event Generation and Statistical Samplingfor Physics with Deep Generative Models and a Density Information Buffer , arXiv:1901.00875 .[33] J. Cogan, M. Kagan, E. Strauss, and A. Schwarztman, Jet-Images: Computer VisionInspired Techniques for Jet Tagging , JHEP (2015) [ arXiv:1407.5675 ].[34] M. Paganini, L. de Oliveira, and B. Nachman, CaloGAN: Simulating 3D High EnergyParticle Showers in Multi-Layer Electromagnetic Calorimeters with Generative AdversarialNetworks , Phys. Rev. D (2018) [ arXiv:1712.10321 ].[35] S. Chintala, How to train a GAN? , in Workshop on Generative Adversarial Networks , 2016.[36] L. Bottou, Large-scale machine learning with stochastic gradient descent , in COMPSTAT ,(2010). https://leon.bottou.org/publications/pdf/compstat-2010.pdf .[37] D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization , in , (2014). arXiv:1412.6980 .[38] M. Arjovsky and L. Bottou, Towards Principled Methods for Training Generative AdversarialNetworks , in , (2017). arXiv:1701.04862 .[39] V. Nagarajan and J. Z. Kolter, Gradient descent GAN optimization is locally stable , in Advances in Neural Information Processing Systems 30 , pp. 5585–5595. Curran Associates,Inc., (2017). arXiv:1706.04156 .[40] A. Radford, L. Metz, and S. Chintala, Unsupervised Representation Learning with DeepConvolutional Generative Adversarial Networks , CoRR (2015) [ arXiv:1511.06434 ].[41] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, PixelCNN++: Improving thePixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , CoRR (2017) [ arXiv:1701.05517 ].[42] L. G. Almeida, M. Backović, M. Cliche, S. J. Lee, and M. Perelstein, Playing Tag with ANN:Boosted Top Identification with Pattern Recognition , JHEP (2015) 086,[ arXiv:1501.05968 ].[43] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman, Jet-Images — DeepLearning Edition , Journal of High Energy Physics (2016) [ arXiv:1511.05190 ]. – 34 – 44] J. Barnard, E. N. Dawe, M. J. Dolan, and N. Rajcic, Parton Shower Uncertainties in JetSubstructure Analyses with Deep Neural Networks , Phys. Rev. D95 (2017), no. 1 014018,[ arXiv:1609.00607 ].[45] P. T. Komiske, E. M. Metodiev, and M. D. Schwartz, Deep learning in color: towardsautomated quark/gluon jet discrimination , JHEP (2017) 110, [ arXiv:1612.01551 ].[46] T. Sjostrand, S. Mrenna, and P. Z. Skands, PYTHIA 6.4 Physics and Manual , JHEP (2006) 026, [ hep-ph/0603175 ].[47] J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H. S. Shao,T. Stelzer, P. Torrielli, and M. Zaro, The automated computation of tree-level andnext-to-leading order differential cross sections, and their matching to parton showersimulations , JHEP (2014) 079, [ arXiv:1405.0301 ].[48] DELPHES 3 Collaboration, J. de Favereau et al., DELPHES 3, A modular framework forfast simulation of a generic collider experiment , JHEP (2014) 057, [ arXiv:1307.6346 ].[49] H. Larochelle and I. Murray, The Neural Autoregressive Distribution Estimator , in Proceedings of the Fourteenth International Conference on Artificial Intelligence andStatistics , vol. 15 of Proceedings of Machine Learning Research , (Fort Lauderdale, FL, USA),pp. 29–37, PMLR, 11–13 Apr, (2011). http://proceedings.mlr.press/v15/larochelle11a/larochelle11a.pdf .[50] M. Germain, K. Gregor, I. Murray, and H. Larochelle, MADE: Masked Autoencoder forDistribution Estimation , CoRR abs/1502.03509 (2015) [ arXiv:1502.03509 ].[51] B. Uria, I. Murray, and H. Larochelle, RNADE: The real-valued neural autoregressivedensity-estimator , in Advances in Neural Information Processing Systems 26 , pp. 2175–2183.Curran Associates, Inc., (2013). arXiv:1306.0186 .[52] C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville, Neural Autoregressive Flows , in Proceedings of the 35th International Conference on Machine Learning , vol. 80 of Proceedingsof Machine Learning Research , (Stockholmsmässan, Stockholm Sweden), pp. 2078–2087,PMLR, 10–15 Jul, (2018). http://proceedings.mlr.press/v80/huang18d/huang18d.pdf .[53] K. Gregor, I. Danihelka, A. Mnih, C. Blundell, and D. Wierstra, Deep AutoRegressiveNetworks , in Proceedings of the 31st International Conference on Machine Learning , vol. 32of Proceedings of Machine Learning Research , (Bejing, China), pp. 1242–1250, PMLR, 22–24Jun, (2014). arXiv:1310.8499 .[54] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ImageNet: A Large-ScaleHierarchical Image Database , in Conference on Computer Vision and Pattern Recognition ,(2009). .[55] Y. Rubner, C. Tomasi, and L. J. Guibas, The Earth Mover’s Distance as a Metric for ImageRetrieval , International Journal of Computer Vision (2000). https://link.springer.com/article/10.1023/A:1026543900054 .[56] P. Baldi, K. Bauer, C. Eng, P. Sadowski, and D. Whiteson, Jet Substructure Classification inHigh-Energy Physics with Deep Neural Networks , Phys. Rev. D93 (2016)[ arXiv:1603.09349 ].[57] K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition , in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 770–778,(2016). arXiv:1512.03385 . – 35 – 58] A. Odena, V. Dumoulin, and C. Olah, Deconvolution and Checkerboard Artifacts , Distill (Oct., 2016). http://distill.pub/2016/deconv-checkerboard .[59] D. Hendrycks and K. Gimpel, Bridging Nonlinearities and Stochastic Regularizers withGaussian Error Linear Units , CoRR abs/1606.08415 (2016) [ arXiv:1606.08415 ].[60] L. Hertel, J. Collado, P. Sadowski, J. Ott, and P. Baldi, Sherpa: Robust HyperparameterOptimization for Machine Learning , SoftwareX (2020) [ arXiv:2005.04048 ]. In press.Software available at: https://github.com/sherpa-ai/sherpa.[61] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, and A. Lerer, Automatic Differentiation in PyTorch , in NIPS 2017 Workshop onAutodiff , 2017. https://openreview.net/forum?id=BJJsrmfCZ ..