Decoding Photons: Physics in the Latent Space of a BIB-AE Generative Network
Erik Buhmann, Sascha Diefenbacher, Engin Eren, Frank Gaede, Gregor Kasieczka, Anatolii Korol, Katja Krüger
DDecoding Photons: Physics in the Latent Space of aBIB-AE Generative Network
Erik
Buhmann , ∗ , Sascha
Diefenbacher , , Engin
Eren , Frank
Gaede , Gregor
Kasieczka , Anatolii
Korol , and Katja
Krüger Institut für Experimentalphysik, Universität Hamburg, Germany Deutsches Elektronen-Synchrotron, Germany Taras Shevchenko National University of Kyiv, Ukraine
Abstract.
Given the increasing data collection capabilities and limited com-puting resources of future collider experiments, interest in using generative neu-ral networks for the fast simulation of collider events is growing. In our previousstudy, the Bounded Information Bottleneck Autoencoder (BIB-AE) architecturefor generating photon showers in a high-granularity calorimeter showed a highaccuracy modeling of various global di ff erential shower distributions. In thiswork, we investigate how the BIB-AE encodes this physics information in itslatent space. Our understanding of this encoding allows us to propose methodsto optimize the generation performance further, for example, by altering latentspace sampling or by suggesting specific changes to hyperparameters. In partic-ular, we improve the modeling of the shower shape along the particle incidentaxis. High-quality simulations of fundamental processes and particle interactions with complexdetectors are crucial to data analysis in high energy physics. Especially in the context ofincreasing data volumes from upcoming runs of the Large Hadron Collider (LHC) and futureexperiments, the production of datasets using Monte-Carlo-based simulators is increasinglybecoming a computing bottleneck [1].A way to accelerate simulations is based on generative machine learning models thatleverage recent advances in computer vision and are implemented parallelizable on graphicprocessing hardware. Such fast simulations based on Generative Adversarial Networks(GANs) [2] for calorimeter physics were first introduced in Ref. [3] and have seen activedevelopment in recent years [4–11]. This approach starts with a small dataset obtained usingclassical simulation techniques and aims to amplify its usable statistics by training a genera-tive model. The principal feasibility of amplification was shown in Ref. [12].Inspired by Ref. [13], we have previously implemented an improved Bounded Informa-tion Bottleneck Autoencoder (BIB-AE) architecture and shown its generation accuracy forvarious di ff erential distributions of photon shower data in a high granularity calorimeter [14].The BIB-AE architecture unifies ideas from di ff erent generative approaches, including GANs ∗ e-mail: [email protected] a r X i v : . [ phy s i c s . i n s - d e t ] F e b nd Variational Autoencoders (VAE) [15]. As an autoencoder, the model encodes input pho-ton showers into a latent space from which in turn newly generated showers are sampled.This contribution explores methods to understand the physics encoded in the latent spaceand introduces optimizations for improved generation fidelity. As opposed to Ref. [16] wedo not aim to explicitly shape the latent space to match physical distributions but rather in-vestigate how the deviations and correlations of the optimally Gaussian normal latent spacefeatures correspond to physically important observables. Compared to Ref. [17] we focus onan information-theoretic perspective and investigate correlations with physical observablesinstead of the topological structure of the latent space.In the following, we first briefly introduce the data (Sec. 1.1) and our BIB-AE archi-tecture (Sec. 1.2). We then investigate the connection between generative performance andinformation encoded in the latent space in Sec. 2, the correlation between learned latent spacedistributions and physical observables in Sec. 3, and see how this can be used to improve gen-erative performance in Sec. 4. We close with a summary of results and draw our conclusionsin Sec. 5. Calorimeters are an essential part of detectors used at high-energy particle colliders. Theymeasure the energy particles deposit when interacting with material. Particles interactingwith the matter in the calorimeter can produce secondary particles resulting in cascades or showers . Such a particle shower is created for example by an initial electromagneticallyinteracting photon.Modern sampling calorimeters are built in a sandwich structure of measuring active layersinterspersed with dense passive material. The active material of modern high granularitycalorimeters consists of many small cells that are read out separately, yielding high resolution3-dimensional measurements of particle showers.We created our photon shower dataset using the G eant calorimeter cells in a rectangular grid resulting in 3d images of 30 × × = , The BIB-AE architecture consists of several building blocks: An encoder network map-ping the input calorimeter images into a latent representation; a decoder network transform-ing the latent representation back into calorimeter images; a Post-Processor network refin-ing the pixel values of the decoded image; a reconstruction critic network calculating theWasserstein-distance between encoded and decoded image; and a latent critic network toregularize the latent space. The whole model is trained in two stages: First, the encoder, de-coder, and critics are trained until su ffi cient fidelity is reached; afterwards, the whole modelis trained in conjunction with the Post-Processor network to improve the accuracy of the cellenergy generation. To generate energy dependent samples, the BIB-AE is conditioned onthe incident particle energy. An overview of the architecture is shown in Figure 1. A moredetailed discussion of the network is provided in [14]. A fraction of the dataset as well as our implementation of the BIB-AE model are available athttps: // github.com / FLC-QU-hep / getting_high. igure 1. Overview of the BIB-AE generative model including the Post-Processor (PP) network.
Like in any VAE-based model, the trained BIB-AE can be used to generate calorimetershower images by sampling the latent space from Standard Normal distributions. To achievegood generation results, the latent space needs to be regularized towards such a Normal dis-tribution. For this regularization the BIB-AE model employs several loss terms during train-ing: A Kullback-Leibler divergence (KLD) loss L KLD , the output of a latent critic network L latent-critic , and a latent Maximum Mean Discrepancy (MMD) [20] term L latent-MMD . Eachlatent regularizer contribution is scaled with a weight β yielding a combined latent loss of L total-latent = β KLD · L KLD + β latent-critic · L latent-critic + β latent-MMD · L latent-MMD (1)with the KL divergence of two discrete probability distributions P and Q defined as D KL ( P (cid:107) Q ) = (cid:88) x ∈X P ( x ) log (cid:32) P ( x ) Q ( x ) (cid:33) (2)and calculated via D KL , i = D KL ( Z i (cid:107) N (0 , = − (cid:16) + log( σ i ) − µ i − σ i (cid:17) . (3)In the context of this publication, latent variables Z i are Gaussian distributions withtwo trainable parameters µ i and σ i ( Z i ≡ N ( µ i , σ i )) regularized towards a Standard Normaldistribution. Its sampled values are z i ∼ Z i .In previous work we have implemented the BIB-AE architecture with 24 trainable latentvariables and an additional 488 variables that are not encoded but sampled straight from aStandard Normal distribution during training [14]. The number of trainable latent space vari-ables we term the latent space size n . Hence the total KLD loss is given by L KLD = (cid:80) ni = D KL , i .The loss weight β KLD has the highest impact on the latent regularization as its scaling de-fines the magnitude of the KL divergence. Here the KL divergence measures the informationcontent of the latent space [21, 22].
Intuitively, for fixed β KLD , higher latent space sizes n should yield an increased total infor-mation in the latent space until a maximum corresponding to the showers’ intrinsic relevantinformation is reached. We test this by re-training a BIB-AE architecture with latent space3
20 40 60 80 100 120 latent variables sorted i n t e g r a t e d K L D ( n a t s ) n = 2 n = 6 n = 12 n = 18 n = 24 n = 64 n = 128 n = 512 n = 24 , β KLD = 0 . latent variables sorted K L D ( n a t s ) Figure 2. Left:
Integrated KLD for latent variables sorted by highest KLD for models with di ff erentlatent space sizes. Right:
KL divergence of individual latent space variables sorted with decreasingKLD for di ff erent latent space sizes. All models are trained with a baseline weight β KLD = .
05, exceptstated otherwise . sizes between 2 and 512 for fixed β KLD = .
05. The number of additional Standard Nor-mal sampled variables is adjusted such that the total number of 512 decoded variables staysconstant.In Fig. 2 (left) we have sorted the trainable latent space variables by their individualKLD calculated via Eq. 3. On the vertical axis, the total information (i.e. the sum of KLDvalues up to and including latent variable i ) is shown. Indeed, we observe increasing totalencoded information with increasing latent space size until a saturation at approximately45 nats ( ≈
64 bits) is reached around a latent space size of 64. After that, a larger latent spacedoes not substantially increase the learned information.Next, we consider the KLD per latent space variable in Fig. 2 (right). All models followa similar pattern: Only a few variables contain a high amount of information (high KLD). Inparticular, there are always two variables that encode significantly more information than theremaining ones. Furthermore, about 60 variables contain > . ffi cient use of the latent space at a size of n =
64 toyield optimal performance. Evaluating the performance of generative models is not straight-forward and constitutes an active topic of research. Methods such as
Inception Score [23]were proposed to evaluate models which produce photographic images. However, such scoresare typically domain-specific and cannot directly be applied to our dataset. We therefore de-fine a problem specific fidelity score S
JSD that summarizes the performance across a numberof physically relevant observables. The score is calculated by combining the Jensen–Shannondistance (JSD) between the G eant ff erential distributions for photon shower anal-ysis and were applied previously to judge model performances [14]. Additional details onhow the score is calculated are given in Appendix A.In Table 1 we show the fidelity score for di ff erent values of the latent space size n . Lowervalues correspond to better agreement with the underlying slow simulation. For very small n ,4 able 1. Fidelity score S JSD for the best epoch of multiple model configurations with di ff erent latentspace sizes n . For n =
24 the best score out of multiple training runs is given, while the mean score forthose trainings is: S JSD,24 = . ± .
12. Only one training each was performed for sizes n (cid:44) latent size S JSD n until the best observed value at n =
24. Seven trainings withidentical network setup but di ff erent random weight initialization were performed for thispoint to obtain an estimate of the associated uncertainty (calculated as the standard deviationof individual results at n = n the performance is approximatelystable within the uncertainty observed for n =
24. This implies that maximum informationcontent of ≈
45 nats encoded in the latent space is not needed for optimal performance.
As only a few variables seem to encode most of the shower information, we investigate whatkind of physics information is learned by these variables. In Fig. 3 the Pearson correlationcoe ffi cients ρ between di ff erent shower physics distributions and the distributions of the en-coded µ of the five highest KLD latent variables as well as the incident particle energy (whichis used for conditioning and is included as a latent variable in the BIB-AE) are shown for fourdi ff erent model configurations. The physics distributions include the first and second momentin each spatial dimension — the first moment corresponds to the center of gravity — thevisible energy, the incident particle energy, the number of hits, and the fraction of depositedenergy in each third of the calorimeter (in the z-direction).Regardless of the model configuration, it is apparent that the highest KLD latent variablestrongly correlates (approx. ρ = ρ = ρ = ff erent latent space sizes n .We can use this observation to improve the CoG-Z distribution in the generated eventsas this distribution was previously not particularly well-modeled (see Fig. 6 (bottom left)).In addition, the targeted sampling of a subspace of these latent variables allows to generateshowers with a specific shower start. This is visualized in Fig. 4 with multiple 3d images ofa decoded calorimeter shower in which only the highest KLD latent z variable was altered.This variable change leads to a di ff erent shower start and hence to an altered center of gravityin the z-axis. Figure 5 visualizes the energy deposition per layer in z-direction of these fivedecoded shower images. The incident photon enters the calorimeter in the center of the x-y plane at z = µ µ µ µ E m , x m , y m , z m , x m , y m , z E vis E inc n hit E / E vis E / E vis E / E vis n = 12 µ µ µ µ µ E n = 24 µ µ µ µ µ E n = 24 , β KLD = 0 . µ µ µ µ µ E n = 512 Figure 3.
Pearson correlation coe ffi cients between various physics variables and the encoded µ of thefive highest KLD latent variables as well as the incident particle energy E for multiple model sizes n .The baseline latent weight is β KLD = .
05 except for one training with β KLD = .
4. Only non-zerovalues of the correlations are shown.
Figure 4.
3d image of generated showers decoded from a latent space with all variables z i =
0, exceptthe highest KLD latent z variable which is set to values between -3 and 3. Understanding the encoded shower information, particularly the center of gravity, in the la-tent space helps us make educated optimization choices for improving model performance.Specifically, we can increase generation fidelity by either regularizing the latent space morestrongly or by sampling from non-Gaussian distributions. Either optimization path can beapproached in di ff erent ways. We have chosen one exemplary method for each: (1) By in-creasing β KLD the overall KLD in the latent space is reduced, yielding latent distributionsstronger regularized towards Normal distributions and therefore more accurate sampling; or(2) keeping the already trained model but using a second density estimator — such as Ker-nel Density Estimation (KDE) [24] — on the latent variables and sampling directly from theencoded latent space. The former approach is motivated by [25] while the latter mirrors amethod for the Bu ff er-VAE from Ref. [26]. 6 calorimeter layer in z e n e r g y [ M e V ] z = − . z = − . z = 0 . z = 1 . z = 3 . Figure 5.
Deposited energyper layer in z-direction forshowers which are decodedwith all latent variables z i =
0, except the highestKLD latent z variablewhich is set to valuesbetween -3 and 3. Figure 6. Di ff erential distributions comparing physics quantities between G eant β KLD = . β KLD = . β KLD = .
05 with the KDE sampling approach.
Our baseline model uses a latent KLD weight of β KLD = .
05. However, as a higher valuefor β KLD leads to a lower KLD value, less information is encoded in the latent space. There-fore, the latent space more closely approaches a Standard Normal distribution and samplingfrom N (0 ,
1) in the generation step should yield showers resembling the G eant ff for other distributions, such as the total energyor energy sum (top center) and the number of hits (top right) which become narrower thanthe baseline and truth distributions. This can also be seen in Fig. 7: Except for low energies7 igure 7. Mean andrelative width of theenergy deposited in thecalorimeter for variousincident particle energiesfor G eant β KLD = . β KLD = . β KLD = .
05 with the KDEsampling approach.
Table 2.
Fidelity score S JSD for the best epochs for multiple model and sampling configurations ofBIB-AE models with a latent size of 24. For β KLD = .
05 the best score out of multiple training runs isgiven, while the mean score for those trainings is: S JSD,24 = . ± .
12. For β KLD = . config. β KLD = . β KLD = . β KLD = . + KDE sampling S JSD encoded µ latent value -4 -3 -2 -1 n o r m a li z e d e n t r i e s latent var. Z KDE of β KLD = 0.05Normal distribution n =24 , β KLD =0 . n =24 , β KLD =0 . encoded µ latent value -4 -3 -2 -1 n o r m a li z e d e n t r i e s latent var. Z KDE of β KLD = 0.05Normal distribution n =24 , β KLD =0 . n =24 , β KLD =0 . Figure 8.
Encoded µ values of the highest (left)and second-highest (right)KLD latent variables for50k shower images formodels with a latent size of24 and β KLD = .
05 or β KLD = .
4. For referenceadded lines for a Normaldistributions and theKernel Density Estimate ofthe β KLD = . the energy linearity is better, but the relative width of the energy distributions is on averagenarrower than the baseline model.Figure 8 illustrates that for the highest (left) and second-highest (right) KLD latent vari-ables, the encoded µ distributions for β KLD = . β KLD = .
05. Although improv-ing the CoG-Z distribution, the overall fidelity score given in Table 2 is slightly worse for β KLD = . Another way to improve the generative performance, particularly the CoG-Z distribution, isto utilize latent variables highly correlated to the CoG-Z distribution. Using exactly the samemodel as in Ref. [14] ( β KLD = .
05 , n =
24) without retraining one can see in Fig. 8 thatthe encoded distribution deviates from a Standard Normal distribution. In the usual VAEsetup one would regardless sample these variable from N (0 ,
1) to generate new samples,8hereby ignoring the correlations between the latent space and the shower physics. Insteadone could sample those latent variables from the distribution of the encoded µ . Since at leasttwo variables as well as the incident energy are correlated to the CoG-Z distribution, oneneeds to account for correlations between latent variables when sampling. This can be doneby encoding a su ffi ciently large number of showers (i.e. 500k) from the training set, applyinga density estimation method such KDE, and then sampling new latent variables from it. In theBIB-AE case with 24 encoded latent variables plus energy conditioning, this leads to traininga KDE of a 25-dimensional space. The resulting KDE kernel can be used as a probabilitydensity function for sampling the latent z variables for improved shower generation.As shown in Fig. 6 this KDE sampling approach yields global di ff erential distributionsvery similar to the G eant β KLD = .
05 was chosen as the best out of seventraining runs and the same model was used to simply add the KDE sampling step. Thisillustrates another benefit of the KDE approach: It can be applied to any already trainedVAE-like model without expensive re-training.
Improving the simulation of calorimeter showers with generative models is an active topic ofresearch motivated by these tasks’ large resource consumption. As such generative modelsstill require substantial training e ff orts and preclude large hyperparameter scans for optimiza-tion, we investigate how a better understanding of the latent space can be used to increaseperformance. While a BIB-AE architecture was used for these studies, the developed strate-gies should readily transfer to other generative models with an encoded latent space (i.e.VAE-like but not GAN-like architectures).We first quantify the information encoded in the latent space and note that for a fixed valueof β KLD = .
05, it saturates at ≈
45 nats. However, generative performance — as measuredby a metric defined to take the relevant physical distributions into account — achieves its bestvalue at a latent space of n =
24 with ≈
28 nats. Put di ff erently, more information encoded inthe latent space will not necessarily translate into better generative performance.This observation o ff ers an interesting parallel to the information bottleneck principle [13,27]. It proposes that for a supervised classification task, the latent space Z should maximiseits mutual information I with the true class labels C but minimise information irrelevant forclassification between data examples X and latent space: L S ( φ ) = I φ ( X ; Z ) − β I ( Z ; C ) . (4)Here L S is the supervised optimisation target, we minimise over parametric mappings φ from data to latent space, and the Lagrange multiplier β denotes the trade-o ff between thetwo goals.For unsupervised tasks, no class labels are available, and the problem becomes: L U ( φ ) = I φ ( X ; Z ) − β I ( Z ; X ) (5)which is also the core of the BIB-AE loss formulation [13]. It is a much more challengingcompression problem as the entropy of a small number of class labels will, in general, bemuch smaller — and therefore easier to encode — than the entropy of the data distribution.9e observed that without additional constraints, such as restricting the latent space size n ,more information than needed for good generative performance is encoded in the latent space,suggesting the need for additional regularising constraints. An interesting open question forfuture research is therefore how the useful encoded information might be quantified.Regardless of the model configuration, only a few latent variables of the BIB-AE containmost of the shower information. Correlating the latent variables with various shower physicsmetrics reveals that the center of gravity in z-direction is always encoded into the two highestKLD latent variables. This encoding can be leveraged for targeted shower generation ofphoton showers with a specific shower start by sampling from a subspace of the highest KLDvariable.Furthermore, this observations can help improve the generative fidelity of the BIB-AEmodel. This can be achieved either by lowering the encoded KLD or by sampling directlyfrom the encoded latent space density distribution, e.g. learned via Kernel Density Esti-mation. Forcing the latent distributions closer to unit Normal naturally improves physicalobservables most strongly correlated with the corresponding latent space variables with thehighest-KLD values, and decreases the performance of the others. The latter approach yieldsthe best results with the additional benefit of applying to the already previously trained BIB-AE model (or any other VAE-like model).The increasing use of generative machine learning models motivates a closer look intotheir learned encoding. Especially in particle physics, the needed precision for many dif-ferential distributions over many orders of magnitude o ff ers a rich laboratory to study theconnection between generation fidelity and latent space. On the one hand, this o ff ers severalmethods to probe and improve generative performance, for example by identifying poorlymodeled distributions for which a discrepancy between encoded-into and sampled-from la-tent space exists. Resolving this discrepancy yields better-generated showers. On the otherhand, the observed di ff erence between maximum-information and best-performance latentspace capacity raises an interesting problem for future studies. Acknowledgements
We would like to thank the Maxwell and National Analysis Facility (NAF) computing centers at DESYfor the smooth operation and technical support. E. Buhmann is funded by a scholarship of the FriedrichNaumann Foundation for Freedom and by the German Federal Ministry of Science and Research(BMBF) via
Verbundprojekts 05H2018 - R & D COMPUTING (Pilotmaßnahme ErUM-Data) Innova-tive Digitale Technologien für die Erforschung von Universum und Materie . S. Diefenbacher is fundedby the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Ex-cellence Strategy – EXC 2121 “Quantum Universe" – 390833306. E. Eren was funded through theHelmholtz Innovation Pool project AMALEA that provided a stimulating scientific environment forparts of the research done here.
References [1] R. Jansky,
The ATLAS Fast Monte Carlo Production Chain Project (2015), J. Phys.Conf. Ser., Vol. 664, No. 7[2] I.J. Goodfellow et al.,
Generative Adversarial Nets , in
Proceedings of the 27th Interna-tional Conference on Neural Information Processing Systems - Volume 2 (Cambridge,MA, USA, 2014), NIPS’14, p. 2672–2680, , https://dl.acm.org/doi/10.5555/2969033.2969125 Accelerating Science with Generative Ad-versarial Networks: An Application to 3D Particle Showers in Multilayer Calorimeters (2018), [4] L. de Oliveira, M. Paganini, B. Nachman,
Learning Particle Physics by Exam-ple: Location-Aware Generative Adversarial Networks for Physics Synthesis (2017), [5] M. Paganini, L. de Oliveira, B. Nachman,
CaloGAN : Simulating 3D high energy par-ticle showers in multilayer electromagnetic calorimeters with generative adversarialnetworks (2018), [6] M. Erdmann, L. Geiger, J. Glombitza, D. Schmidt,
Generating and refining particledetector simulations using the Wasserstein distance in adversarial networks (2018), [7] M. Erdmann, J. Glombitza, T. Quast,
Precise simulation of electromagnetic calorimetershowers using a Wasserstein Generative Adversarial Network (2019), [8] ATLAS Collaboration, Tech. Rep. ATL-SOFT-PUB-2018-001, CERN, Geneva (2018), http://cds.cern.ch/record/2630433 [9] ATLAS Collaboration, Tech. Rep. ATL-SOFT-SIM-2019-007, CERN (2019), https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PLOTS/SIM-2019-007/ [10] A. Ghosh (ATLAS Collaboration), Tech. Rep. ATL-SOFT-PROC-2019-007, CERN,Geneva (2019), https://cds.cern.ch/record/2680531 [11] D. Belayneh et al.,
Calorimetry with Deep Learning: Particle Simulation and Recon-struction for Collider Physics (2019), [12] A. Butter, S. Diefenbacher, G. Kasieczka, B. Nachman, T. Plehn,
GANplifying EventSamples (2020), [13] S. Voloshynovskiy, M. Kondah, S. Rezaeifar, O. Taran, T. Holotyak, D.J. Rezende,
Information bottleneck through variational glasses (2019), [14] E. Buhmann, S. Diefenbacher, E. Eren, F. Gaede, G. Kasieczka, A. Korol, K. Krüger,
Getting High: High Fidelity Simulation of High Granularity Calorimeters with HighSpeed (2021), [15] D.P. Kingma, M. Welling,
Auto-Encoding Variational Bayes (2014), [16] J.N. Howard, S. Mandt, D. Whiteson, Y. Yang,
Foundations of a Fast, Data-Driven,Machine-Learned Simulator (2021), [17] J. Batson, C.G. Haaf, Y. Kahn, D.A. Roberts,
Topological Obstructions to Autoencoding (2021), [18] S. Agostinelli et al.,
Geant4—a simulation toolkit (2003), Nuclear Instruments andMethods in Physics Research Section A: Accelerators, Spectrometers, Detectorsand Associated Equipment , 250, [19] H. Abramowicz et al. (ILD Concept Group),
International Large Detector: InterimDesign Report (2020), [20] A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, A.J. Smola,
A Kernel Methodfor the Two-Sample Problem (2008), [21] C.E. Shannon,
A mathematical theory of communication. (1948), Bell Syst.Tech. J. , 379, http://dblp.uni-trier.de/db/journals/bstj/bstj27.html [22] S. Kullback, Information Theory and Statistics (Wiley, New York, 1959)[23] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen,
ImprovedTechniques for Training GANs (2016),
On Estimation of a Probability Density Function and Mode (1962), TheAnnals of Mathematical Statistics , pp. 1065, [25] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed,A. Lerchner, beta-VAE: Learning Basic Visual Concepts with a Constrained VariationalFramework , in ICLR (2017)[26] S. Otten, S. Caron, W. de Swart, M. van Beekveld, L. Hendriks, C. van Leeuwen,D. Podareanu, R.R. de Austri, R. Verheyen,
Event Generation and Statistical Samplingfor Physics with Deep Generative Models and a Density Information Bu ff er (2019), [27] N. Tishby, F.C. Pereira, W. Bialek, The information bottleneck method (2000), arXivpreprint physics / A Fidelity score
Comparing histograms of shower variables such as total energy, number of hits, shower pro-file and center of gravity as shown in Fig. 6 is a way to determine the generation performanceof the generative model in comparison to the G eant ffi cult toquantify the model improvement by manually observing these plots. A quantification of the’generation performance’ or ’fidelity’ can be calculated via the di ff erence between the his-tograms of generated and G eant ff erence between the histograms. This score was comparable to our fidelity score S JSD . Asimilar fidelity metric was calculated in Ref. [26].The JSD can be calculated for each of the six histograms in Fig. 6. To have one score com-bining all six histograms one needs to weight each individual histograms’ JSD in comparisonto all other JSDs of the same model. This weighting is done in the following way:1. Calculate JSD for each of the six plots for each model configuration and epoch: JSD i , m , e with i for 1 in 6 plots, m for 1 in x models, and e for 1 in y epochs that are compared inthe score2. Calculate the 6 weighting factor for the JSD of each i plot: < JSD i > = JDS i , m , e for each plot i
3. Calculate the fidelity score S JSD for each model m and epoch e : S JSD , m , e = < JSD m , e > = (cid:80) i JSD i , m , e · < JSD i > An example of this weighted S JSD score is shown in Fig. 9 for an epoch-wise scan duringthe training of two models with di ff erent β KLD weights; each with and without the Post-Processor network. Note that the KLD is increasing with each epoch and saturates over time.However, a higher KLD does not necessarily correlate with a lower fidelity score.12
20 40 60 80 100 epoch f i d e li t y s c o r e S J S D n = 24 , β KLD = 0 . n = 24 , β KLD = 0 . epoch K L D ( n a t s ) β KLD = 0.05, KLD × β KLD = 0.4, KLD × Figure 9.
Evolution of the fidelity score S JSD and the KL divergence over the course the of training forthe two models with β KLD = .
05 and β KLD = .
4. Based on the fidelity score the best epochs werechosen (epoch 39 and epoch 87 respectively). Color brightness implies training with or without the PostProcessor network (see Sec. 1.2.4. Based on the fidelity score the best epochs werechosen (epoch 39 and epoch 87 respectively). Color brightness implies training with or without the PostProcessor network (see Sec. 1.2.