Identifying the relevant dependencies of the neural network response on characteristics of the input space
PPublished by ”Computing and Software for Big Science” (DOI: 10.1007/s41781-018-0012-1)
Identifying the relevant dependencies of the neural networkresponse on characteristics of the input space
Stefan Wunsch · Raphael Friese · Roger Wolf · G¨unter QuastAbstract
The relation between the input and outputspaces of neural networks (NNs) is investigated to iden-tify those characteristics of the input space that havea large influence on the output for a given task. Forthis purpose, the NN function is decomposed into aTaylor expansion in each element of the input space.The Taylor coefficients contain information about thesensitivity of the NN response to the inputs. A metricis introduced that allows for the identification of thecharacteristics that mostly determine the performanceof the NN in solving a given task. Finally, the capabilityof this metric to analyze the performance of the NN isevaluated based on a task common to data analyses inhigh-energy particle physics experiments.
A neural network (NN) is a multi-parameter system,which, depending on its architecture, can consist of sev-eral thousands of weight and bias parameters, subjectto one or more non-linear activation functions. Each ofthese adjustable parameters obtains its concrete valueand meaning by minimization during the training pro-cess. Thus the same NN can be applied to several con-crete tasks, which are only defined at the training step.
Stefan Wunsch (cid:0) [email protected] Friese [email protected] Wolf [email protected]¨unter Quast [email protected] Karlsruhe Institute of Technology, Institute of ExperimentalParticle Physics, Karlsruhe, Germany CERN, Geneva, Switzerland
In applications in high-energy particle physics, whichare supposed to distinguish a signal from one or morebackgrounds, the training sample is obtained either fromsimulation or from an independent dataset without over-lap with the sample of interest, to which the NN is ap-plied. Usually the NN output itself is then subject toa detailed likelihood based hypothesis test, to infer thepresence and yield of the signal [1,2,3,4,5]. The likeli-hood may include information on the shape of a vari-able that is supposed to discriminate signal from back-ground. This shape could (while it does not have to) bee.g. the output of an NN. Apart from one or more pa-rameters of interest the hypothesis test may compriseseveral hundreds of nuisance parameters, steering theresponse of the test statistic on a corresponding set ofuncertainties. The nuisance parameters can be corre-lated or uncorrelated with the shape of the discrimi-nating variable and (directly or indirectly) depend onthe response of the NN output on its input variables.These kinds of analyses connect the observation ofa measurement to a hypothesized truth. For NN ap-plications they pose the intrinsic problem that, beyondstatistical fluctuations, congruency between the train-ing sample and the sample of interest may not be given.Deviations need to be identified and quantified withinthe uncertainty model of the hypothesis test. They mayoccur not only in the description of single input vari-ables to the NN, but also in correlations across inputvariables, even if the marginal distributions of the in-dividual input variables are reproduced. An NN can besensitive to correlations across input variables; in factthis sensitivity is the main reason for potential perfor-mance gains, with respect to other approaches, like e.g.profile likelihoods. To make sure that this performancegain is not feigned, in addition to the marginal distri-butions, all correlations across input variables need tobe carefully checked, and their influence on the test a r X i v : . [ phy s i c s . d a t a - a n ] O c t Stefan Wunsch et al. statistic identified and eventually mapped into the un-certainty model of the hypothesis test. The complexityof this methodology motivates the interest, not only inkeeping the number of inputs to the NN at a manage-able level, but above all in identifying those character-istics of the input space to the NN with the largestinfluence on the NN output. The definition of the un-certainty model of the hypothesis test can then be con-centrated on these most influential characteristics.This approach sets the scope of this study to notmore than a few tenth, up to a few hundred, partiallyhighly correlated input variables in the context of par-ticle physics experiments, or comparable applications.It differs from the approaches of weak supervision [6,7,8,9] and pivoting with adversaries [10] that have beendiscussed in the literature. Weak supervision tries tocircumvent the problem that we are describing by re-placing an originally ground-truth labeled training bya training based on unlabeled training data. The corre-sponding samples can be obtained from the data them-selves. They do not depend on a simulation and maybe chosen to be unbiased. This approach is well justi-fied in classification tasks, that are based just on thecharacteristics of the predefined training data. In theanalyses that we are discussing the classification is tiedto the hypothesized truth. Replacing the ground-truthlabeled training by unlabeled input data does not solvethe problem that we are discussing. Our discussion isalso beyond the scope of pivoting with adversaries, forwhich the mismodellings to address have to be knownbeforehand. Our discussion sets in at an earlier stage,which is the most complete identification of all uncer-tainties that can be of relevance for the physics anal-ysis. After the most influential features of the inputspace have been identified the method of pivoting withadversaries could be applied to mitigate potential mis-modellings. A related approach to extract informationabout the characteristics of the input space is to flattenthe distributions of sub-spaces so that possible discrim-inating features vanish [11,12]. From the performancedegradation after retraining the NN on the modifiedinputs, information about the discriminating power ofthe respective sub-space can be obtained. However, thisapproach does not allow to evaluate the dependencies ofthe response of an unique NN function on the character-istics of the input space, since each retrained functionmay have learned different features.So far, the questions we are raising have been ad-dressed by methods that have been proposed to relatethe output of NNs with certain regions of input pixels inthe context of image classification [13,14]. These meth-ods only use first-order derivatives to the NN functionto back propagate the output layer by layer. What we propose is a Taylor expansion of the full NN function upto an arbitrary order, which allows to connect the inputspace directly to the NN output. While with this studywe will demonstrate the application of the Taylor ex-pansion only up to second order, we explicitly proposea generalization towards higher-order derivatives in theTaylor expansion to capture relations across variables,which usually play a more important role in data anal-yses in high-energy particle physics experiments.Due to the high-performance computation of deriva-tives in modern software frameworks used for the im-plementation of NNs [15,16,17], this expansion can beobtained at each point of the input space, even if thisspace is of high dimension. In this way, the sensitivityof the NN response to the input space can be analyzedby the gradient of the NN function. For practical rea-sons we stop the expansion at second order. To facilitatethe following interpretation, we define a feature to be acharacteristic of a single element or a pair-wise relationbetween two elements of the input space. The first classof features relates to the coefficients of the expansion tofirst order (first-order feature); the second class to thecoefficients of the second order expansion (second-orderfeature). First-order features capture the influence ofsingle input elements on the NN output throughout theinput space; second-order features the influence of pair-wise or auto-correlations among the input elements. Itis obvious that depending on the given task a certainfeature can have large influence on the output of the NNin a certain region of the input space, while it is less im-portant in others. We propose the arithmetic mean ofthe absolute value of the corresponding Taylor coeffi-cient, computed from the input space defined by thetask to be solved, (cid:104) t i (cid:105) ≡ N N (cid:88) k =1 (cid:12)(cid:12) t i ( { x j }| k ) (cid:12)(cid:12) i ∈ P ( { x j } ) (1)as a metric for the influence of a given feature of theinput space on the output, where the sum runs overthe whole testing sample of size N , t i corresponds tothe coefficients of the Taylor expansion, { x j }| k to theset of variables spanning the input space, evaluated forelement k of the testing sample, and i is an elementof the powerset of { x j } . It should be noted that the (cid:104) t i (cid:105) characterize the input space (as covered by the testdata) and the sensitivity of the NN to it, after training,as a whole.In section 2 we illustrate this choice with the helpof four simple tasks emphasizing certain single featuresof the input space or their combination. In section 3 wepoint out that, when evaluated at each step of the min-imization during the training process, the (cid:104) t i (cid:105) can beutilized to illustrate and monitor the training process dentifying the relevant dependencies of the neural network response on characteristics of the input space 3 and learning strategies adopted by the NN. In section 4we show the application of the (cid:104) t i (cid:105) to a more realistictask common to data analyses in high-energy particlephysics experiments. Such tasks usually have the follow-ing attributes, which are of relevance for the followingdiscussion: – they consist of not more than several tens of impor-tant input parameters, which leads to a moderatedimensionality of the posed problem; – they may rely on relations between elements morethan they rely on single elements of the input space; – they usually pose problems, where a signal and back-ground class cannot be separated based on single orfew input variables, but only from the combinationof several input variables; – they require a good understanding of the NN per-formance to turn the output into a reliable measure-ment. In the following we illustrate the relation of the (cid:104) t i (cid:105) tocertain features of the input space.The applied NN corresponds to a fully connectedfeed-forward model with a single hidden layer consist-ing of 100 nodes. As activation functions a hyperbolictangent is chosen for the hidden layer and a sigmoid forthe output layer. A preprocessing of the inputs is per-formed following the ( x − µ ) /σ rule with the mean µ and the standard deviation σ derived independently foreach input variable. The free parameters of the NN arefitted to the training data using the cross-entropy lossand the Adam optimizer algorithm [18]. The full train-ing dataset with 10 elements is split into two equalhalves. One half is used for the calculation of the gra-dients used by the optimizer. The other half is used asindependent validation dataset. The training is stoppedif the loss did not improve on the validation dataset forthree times in a row (early stopping). The independenttest dataset used to calculate the (cid:104) t i (cid:105) consists of 10 elements. We use the software packages Keras [19] andTensorFlow [15] for the implementation of the NN andthe calculation of the derivatives.For simplicity we choose binary classification taskswith two inputs, x and x . For the signal and back-ground classes we sample Gaussian distributions withparameters, as summarized in Table 1. From the Taylorseries we obtain two metrics (cid:104) t x (cid:105) and (cid:104) t x (cid:105) indicatingthe influence of the marginal distributions of x and x , and three metrics (cid:104) t x ,x (cid:105) , (cid:104) t x ,x (cid:105) , and (cid:104) t x ,x (cid:105) in-dicating the influence of the relation between x and x , and the two auto-correlations. In the upper row ofFig. 1 the distribution of the (red) signal and (blue)background classes in the input space are shown, wheredarker colors indicate a higher sample density. In thelower row of Fig. 1 the values obtained for the (cid:104) t i (cid:105) afterthe training are shown for each corresponding task.For the task shown in Fig. 1a the signal and back-ground classes are shifted against each other. In bothclasses x and x are uncorrelated and of equal spread.The classification task becomes most difficult along theoff-diagonal axis between the two classes through theorigin and simpler if both, x and x , take large orsmall values at the same time. Correspondingly, (cid:104) t x (cid:105) and (cid:104) t x (cid:105) obtain large values indicating the separationpower that is already caused by the marginal distribu-tions of x and x . The orientation of the two classeswith respect to each other also results in a non-negligiblecontribution of (cid:104) t x ,x (cid:105) to the NN response.For the task shown in Fig. 1b the signal and back-ground classes are both centered at the origin of theinput space, with equal spread in x and x , but withdifferent correlation coefficients in the covariance ma-trix. The classification task is most difficult in the originof the input space and becomes simpler if x and x takelarge absolute values. Correspondingly, the relation be-tween x and x is identified as the most influentialfeature by the value of (cid:104) t x ,x (cid:105) . The fact that large ab-solute values of x and x support the separability ofthe two classes is expressed by the relatively large val-ues for (cid:104) t x (cid:105) and (cid:104) t x (cid:105) . A combination of the examplesof Fig. 1a and 1b is shown in Fig. 1c. For the task shownin Fig. 1d the signal and background classes are bothcentered in the origin of the input space with differ-ent spread. In both classes x and x are uncorrelated.According to the symmetry of the posed problem therelation between x and x is expected to not stronglycontribute to the separability of the signal and back-ground classes. This is confirmed by the lower value of (cid:104) t x ,x (cid:105) . Instead (cid:104) t x (cid:105) , (cid:104) t x (cid:105) , (cid:104) t x ,x (cid:105) , and (cid:104) t x ,x (cid:105) takelarger values as expected from the previous discussion. When evaluated at each minimization step during thetraining, the metrics (cid:104) t i (cid:105) may serve as a tool to an-alyze the learning progress of the NN. We illustratethis for the task shown in Fig. 1c. In Fig. 2 the evolv-ing values of each (cid:104) t i (cid:105) are shown, as continuous linesof different color, for the first 700 gradient steps. Thestopping criterion of the training is reached after 339gradient steps (indicated by the red vertical line in thefigure). We measure the performance of the NN in sep-arating the signal from the background class by the Stefan Wunsch et al.
Table 1: Parameters defining the signal and background classes used for the tasks discussed in section 2. Theparameters correspond to two-dimensional Gaußian distributions.
Task Mean value Covariance matrixSignal ( x , x ) Background ( x , x ) Signal BackgroundFig. 1a 0.5 0.5 − . − . (cid:18) (cid:19) (cid:18) (cid:19) Fig. 1b 0 0 0 0 (cid:18) . . (cid:19) (cid:18) − . − . (cid:19) Fig. 1c 0.5 0.5 − . − . (cid:18) . . (cid:19) (cid:18) − . − . (cid:19) Fig. 1d 0 0 0 0 (cid:18) . . (cid:19) (cid:18) (cid:19) x x t x t x t x , x t x , x t x , x t i (a) x x t x t x t x , x t x , x t x , x t i (b) x x t x t x t x , x t x , x t x , x t i (c) x x t x t x t x , x t x , x t x , x t i (d) Fig. 1: (Upper row) Contours of the distributions used in the examples for the signal (red) and background (blue)classes discussed in section 2, and the (lower row) corresponding metrics (cid:104) t i (cid:105) .area under the curve (AUC) of the receiver operatingcharacteristic (ROC). We have added the AUC at eachtraining step to the figure with a separate axis on theright. A rough distinction of two phases can be stated.Approximately up to minimization step 30 the perfor-mance of the NN shows a steep rise up to a plateauvalue of 0 .
84 for the AUC. This rise coincides with in-creasing values of (cid:104) t x (cid:105) and (cid:104) t x (cid:105) . Both metrics have thesame progression, which can be explained by the sym-metry of the task. Also the values for (cid:104) t x ,x (cid:105) , (cid:104) t x ,x (cid:105) and (cid:104) t x ,x (cid:105) show an increase, though much less pro-nounced. Roughly 100 minimization steps later, a sec-ond, more shallow, rise of the AUC sets in, coincidingwith increasing values for (cid:104) t x ,x (cid:105) . We interpret this inthe following way. During the first phase the NN adaptsto the first-order features related to (cid:104) t x (cid:105) and (cid:104) t x (cid:105) , which is the most obvious choice to separate the sig-nal from the background class. During this phase thelearning progress of the NN is concentrated in the ar-eas of the input space with medium to large values of x and x . In the second phase the relation between x and x , as a second-order feature, gains influence. Thisis when the NN learning progress concentrates on theregion of the input space where the signal and back-ground classes overlap. It can be seen that the influ-ence of the features related to (cid:104) t x (cid:105) and (cid:104) t x (cid:105) decreasesfrom minimization step 50 on. Apparently this influ-ence has been overestimated at first and is successivelyreplaced giving more importance to the more difficultto identify second-order features. From our knowledgeof the truth, this is indeed the ”more correct” assess-ment, which from minimization step 250 on, also leads dentifying the relevant dependencies of the neural network response on characteristics of the input space 5
50 150 250 350 450 550 650Gradient step0.000.070.140.21 t i t x t x t x , x t x , x t x , x AUC 0.770.800.830.86 A U C Fig. 2: Values of the metrics (cid:104) t i (cid:105) , as defined in Eq. 1,evaluated at each gradient step of the NN training, forthe task discussed in section 2 and shown in Fig. 1c.On the axis to the right the AUC of the ROC curve, asa measure of the NN performance in solving the taskat each training step, is shown. The red vertical lineindicates after how many gradient steps the predefinedstopping criterion, given in section 2, has been met.to another gain in performance. Note that by the endof the training the progression of (cid:104) t x ,x (cid:105) has not con-verged, yet. The stopping criterion represents a mea-sure of success and not a measure of truth. It mightwell have happened that the stopping criterion mighthave been met already between gradient step 50 and100. In this case the NN output would have been basedon the assessment that (cid:104) t x ,x (cid:105) plays a less importantrole. In this case success rules over truth. In our exam-ple the a priori known, more correct assessment leadsto another performance gain after a few more gradientsteps. Stopping the training before gradient step 100would have missed this performance gain. We wouldlike to emphasize that Fig. 2 is not more but a monitorto visualize what steps have led to the training resultof the NN. This information can help to interpret boththe features of the input space and the NN sensitivityto it. A different NN configuration might reveal a differ-ent sensitivity to any of the (cid:104) t i (cid:105) . Also there is no claimof proof that the increase in (cid:104) t x ,x (cid:105) causes the increasein the AUC. In the following we are investigating the behavior of the (cid:104) t i (cid:105) when applied to a more complex task, typical fordata analyses in high-energy particle physics. For thispurpose we are exploiting a dataset that was released inthe context of the Higgs boson machine learning chal- lenge [20], in 2014. This challenge was inspired by thediscovery of a Higgs particle in collisions of high-energyproton beams at the CERN LHC, in 2012 [21,22]. Thesearch for Higgs bosons in the final state with two τ lep-tons [23,24,25] at the LHC has two main characteristicsof relevance for this challenge: – a Higgs boson will be produced in only a tiny frac-tion of the recorded collisions. – there is no unambiguous physical signature to dis-tinguish collisions containing Higgs bosons (definingthe signal class) from other collisions (defining thebackground class).Consequently, for such a search the signal needs tobe inferred from a larger number of (potentially re-lated) physical quantities of the recorded collisions, us-ing statistical methods, which makes the task suitedalso for NN applications. For the challenge a typicalset of proton-proton collisions was simulated, of whichonly a small subset contained Higgs bosons in the finalstate with two τ leptons. Important physical quanti-ties to distinguish the signal and background classes arethe momenta of certain collision products in the plane,transverse to the incoming proton beams; the invariantmass of pairs of certain collision products; and their an-gular position relative to each other and to the beamaxis. In the context of the challenge the values of 30such quantities were released, whose names and exactphysical meaning are given in [20]. Seventeen of thesevariables are basic quantities, characterizing a collisionfrom direct measurements; the rest, like all invariantmass quantities, are called derived variables and com-puted from the basic quantities. These derived variableshave a high power to distinguish the signal and back-ground classes. Other variables like the azimuthal angle φ of single collision products in the plane transverse tothe incoming proton beams have no separating powerbetween the signal and background classes, due to thesymmetry of the posed problem. The task is solved bythe same NN model and training approach as describedin section 2. Applied to all 30 input quantities this re-sults in an AUC of 0 .
92 and an approximate mediansignificance, as defined in [20], of 2 .
61. In total, the 30input quantities result in 495 first- and second-orderfeatures. For further discussion we rank these featuresaccording to their extracted influence on the NN out-put, based on the values of the corresponding (cid:104) t i (cid:105) , indecreasing order. In Fig. 3 the (cid:104) t i (cid:105) for all features areshown, split into (orange) first- and (blue) second-orderfeatures. The distribution shows a rapidly falling trend,suggesting that only a small number of the investigatedfeatures significantly contributes to the solution of thetask. The most important input variable is identifiedas the invariant mass calculated from the kinematics of Stefan Wunsch et al. two distinguished particles in the collision, the identi-fied hadronic τ lepton decay and the additional lightflavor lepton, associated with a leptonic decay of the τ lepton, DER mass vis , as defined in [20]. This vari-able also belongs to the most important quantities toidentify Higgs particles in the published analyses [23,24,25], with a strong relation to the invariant mass ofthe new particle. It is a peaking unimodal distributionin the signal class, with a broader distribution, peakingin a different position, in the background class. Amongthe 10 most influential features, it appears as the mostinfluential first-order feature (in position 10), reflectingthe difference in the position of the peak in the sig-nal and background classes, and as part of six furthersecond-order features, including the auto-correlation (inposition 6), characterizing the difference in the width ofthe peak in the signal and background classes. The NNis thus able to identify the most important features of
DER mass vis : its peak position and width. The usageof this variable in a NN analysis requires a good under-standing not only of the marginal distribution but alsoof all relevant relations to other variables, which shouldbe reflected in the uncertainty model. The most influen-tial feature is found to be the relation of
DER mass vis with the ratio of the transverse momenta of the two par-ticles that enter the calculation of this variable, named
DER pt ratio lep tau . This feature is shown in Fig. 4,visualizing the gain of the relation over a pure marginaldistribution on each individual axis. Features related to φ on the other hand are consequently ranked to the endof the list, as can be seen from Fig. 5, with the first oc-currence in position 82. Apart from DER mass vis onlyeight more inputs, which are all well motivated fromthe physics expectation, contribute to the upper 5 %of the most influential features. When exposed to onlythese nine input quantities the NN solves the task withan AUC and ROC curve identical to the one that weobserve, when using all 30 input quantities, within thenumerical precision, indicating the potential to reducethe input space from 30 to 9 dimensions without sig-nificant loss of information. We refrain from a moredetailed analysis of the complete list of features, whichquickly turns very abstract and cannot be fully appre-ciated without deeper knowledge of the exact physicalmeaning of the input quantities. We conclude that themetric of Eq. 1 allows for a detailed understanding ofthe role of each input quantity - even without know-ing their exact meaning - and quantitatively confirmsthe intuition of the high-energy particle physics analy-ses that have been performed during the search for theHiggs boson in 2012 and afterwards. We would like toemphasize that the reduction of the dimension of theinput space (in the demonstrated case from 30 to 9),
100 200 300 400Rank0.0000.0010.0020.003 t i Second-order featuresFirst-order features
Fig. 3: Metrics (cid:104) t i (cid:105) , as defined in Eq. 1, obtained fromthe 30 inputs of the task discussed in section 4. The (cid:104) t i (cid:105) , have been ranked by value, in descending order.A color coding identifies (orange) first-order and (blue)second-order features. D E R _ m a ss _ v i s Fig. 4: Relation between the variables
DER mass vis and
DER pt ratio lep tau , as defined in [20] and dis-cussed in section 4, shown in a subset of the input space.The red (blue) contours correspond to the signal (back-ground) class. Darker colors indicate a higher sampledensity. This relation is identified as the most influen-tial feature after the NN training.which can be achieved also by other methods, like theprincipal component analysis [26], is not the main goalof our investigation. The main goal is an improved andmore intuitive understanding of the features of the in-put space and the sensitivity of the NN output on it.
We have discussed the usage of the coefficients t i froma Taylor expansion in each element of the input space { x j } to identify the characteristics of the input spacewith the largest influence on the NN output. For practi-cal reasons we have restricted the discussion to the ex-pansion up to second order, concentrating on the char- dentifying the relevant dependencies of the neural network response on characteristics of the input space 7
100 200 300 400Rank010203040 C o un t Assoc. with prim. variablesAssoc. with mass variable
Fig. 5: Occurrence of features containing primitive φ variables and occurrence of DER mass vis , as discussedin section 4, in the ranked list of features.acteristics of marginal distributions of input elements, x j , or relations between them, referred to as first- andsecond-order features. We propose the arithmetic meanof the absolute value of a corresponding Taylor coeffi-cient (cid:104) t i (cid:105) , built from the whole input space, as a metricto quantify the influence of the corresponding featureon the NN output. We have illustrated the relation be-tween features and corresponding (cid:104) t i (cid:105) with the help ofsimple tasks emphasizing single features or relations be-tween them. Evaluating the (cid:104) t i (cid:105) at each step of the NNtraining allows for the analysis and monitoring of thelearning process of the NN. Finally we have applied theproposed metrics to a more complex task common tohigh-energy particle physics and found that the mostimportant features, known from physics analyses arereliably identified, while features known to be irrele-vant are also identified as such. We consider this as thefirst step to identify those characteristics of the NN in-put space that have the largest influence on the NNoutput, in the context of tasks, typical for high-energyparticle physics experiments. As shown for the examplein section 4 these most influential characteristics maywell correspond to relations between different inputs orauto-correlations, and not just to the marginal distri-bution of single inputs. In subsequent steps the quan-tification of systematic uncertainties in the NN inputscan be concentrated on those most relevant inputs. References
1. Junk, T.: Confidence level computation for combiningsearches with small statistics. Nuclear Instruments andMethods in Physics Research (2) (1999) 4352. Read, A.L.: Presentation of search results: the CLs tech-nique. Journal of Physics G: Nuclear and Particle Physics (10) (2002) 26933. The ATLAS and CMS collaborations: Procedure forthe LHC Higgs boson search combination in summer2011. Technical report, ATL-PHYS-PUB-2011-011, CMSNOTE 2011/005 (2011)4. The CMS collaboration: Combined results of searches forthe Standard Model Higgs boson in pp collisions at √ s =7 TeV. Phys. Lett. B (2012) 265. Cowan, G., Cranmer, K., Gross, E., Vitells, O.: Asymp-totic formulae for likelihood-based tests of new physics.Eur. Phys. J. C (2) (2011) 15546. Metodiev, E., Nachman, B., Thaler, J.: Classificationwithout labels: Learning from mixed samples in high en-ergy physics. arXiv:1708.02949 (2017)7. Dery, L.M., Nachman, B., Rubbo, F., Schwartzman, A.:Weakly supervised classification in high energy physics.Journal of High Energy Physics (5) (2017) 1458. Komiske, P., Metodiev, E., Nachman, B., Schwartz, M.:Learning to classify from impure samples with high-dimensional data. arXiv:1801.10158 (2018)9. Cohen, T., Freytsis, M., Ostdiek, B.: (Machine) Learningto Do More with Less. arXiv:1706.09451 (2018)10. Louppe, G., Kagan, M., Cranmer, K.: Learning to pivotwith adversarial networks. In: Advances in Neural Infor-mation Processing Systems. (2017) 98211. de Oliveira, L., Kagan, M., Mackey, L., Nachman, B.,Schwartzman, A.: Jet-Images – Deep Learning Edition.arXiv:1511.05190 (2017)12. Chang, S., Cohen, T., Ostdiek, B.: What is the machinelearning? arXiv:1709.10106 (2018)13. Bach, S., Binder, A., Montavon, G., Klauschen, F.,M¨uller, K.R., Samek, W.: On pixel-wise explanationsfor non-linear classifier decisions by layer-wise relevancepropagation. PloS one (7) (2015)14. Montavon, G., Lapuschkin, S., Binder, A., Samek, W.,M¨uller, K.R.: Explaining nonlinear classification deci-sions with deep taylor decomposition. Pattern Recogni-tion (2017) 21115. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen,Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin,M., et al.: Tensorflow: Large-scale machine learningon heterogeneous distributed systems. arXiv preprintarXiv:1603.04467 (2016)16. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.:Automatic differentiation in PyTorch. (2017)17. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pas-canu, R., Desjardins, G., Turian, J., Warde-Farley, D.,Bengio, Y.: Theano: A CPU and GPU math compiler inPython. In: Proc. 9th Python in Science Conf. (2010) 118. Kingma, D., Ba, J.: Adam: A method for stochastic op-timization. arXiv preprint arXiv:1412.6980 (2014)19. Chollet, F., et al.: Keras. https://keras.io (2015)20. Adam-Bourdarios, C., Cowan, G., Germain, C., Guyon,I., Kegl, B., Rousseau, D.: Learning to discover: the Higgsboson machine learning challenge https://higgsml.lal.in2p3.fr/documentation/ Visited on January 3, 2018.21. The CMS collaboration: Observation of a new boson at amass of 125 GeV with the CMS experiment at the LHC.Phys. Lett. B (1) (2012) 30 Stefan Wunsch et al.22. The ATLAS collaboration: Observation of a new particlein the search for the Standard Model Higgs boson withthe ATLAS detector at the LHC. Phys. Lett. B (1)(2012) 123. The CMS collaboration: Evidence for the 125 GeV Higgsboson decaying to a pair of τ leptons. JHEP (2014)10424. The ATLAS collaboration: Evidence for the Higgs-bosonYukawa coupling to τ leptons with the ATLAS detector.JHEP (2015) 11725. The CMS collaboration: Observation of the Higgs bosondecay to a pair of τ leptons with the CMS detector. Phys.Lett. B (2018) 28326. Abdi, H., Williams, L.J.: Principal component analysis.Wiley interdisciplinary reviews: computational statistics2