[PDF] Identifying the relevant dependencies of the neural network response on characteristics of the input space

Abstract

The relation between the input and output spaces of neural networks (NNs) is investigated to identify those characteristics of the input space that have a large influence on the output for a given task. For this purpose, the NN function is decomposed into a Taylor expansion in each element of the input space. The Taylor coefficients contain information about the sensitivity of the NN response to the inputs. A metric is introduced that allows for the identification of the characteristics that mostly determine the performance of the NN in solving a given task. Finally, the capability of this metric to analyze the performance of the NN is evaluated based on a task common to data analyses in high-energy particle physics experiments.

Full PDF

PPublished by ”Computing and Software for Big Science” (DOI: 10.1007/s41781-018-0012-1)

Identifying the relevant dependencies of the neural networkresponse on characteristics of the input space

Stefan Wunsch · Raphael Friese · Roger Wolf · G¨unter QuastAbstract

The relation between the input and outputspaces of neural networks (NNs) is investigated to iden-tify those characteristics of the input space that havea large inﬂuence on the output for a given task. Forthis purpose, the NN function is decomposed into aTaylor expansion in each element of the input space.The Taylor coeﬃcients contain information about thesensitivity of the NN response to the inputs. A metricis introduced that allows for the identiﬁcation of thecharacteristics that mostly determine the performanceof the NN in solving a given task. Finally, the capabilityof this metric to analyze the performance of the NN isevaluated based on a task common to data analyses inhigh-energy particle physics experiments.

A neural network (NN) is a multi-parameter system,which, depending on its architecture, can consist of sev-eral thousands of weight and bias parameters, subjectto one or more non-linear activation functions. Each ofthese adjustable parameters obtains its concrete valueand meaning by minimization during the training pro-cess. Thus the same NN can be applied to several con-crete tasks, which are only deﬁned at the training step.

Stefan Wunsch (cid:0) [email protected] Friese [email protected] Wolf [email protected]¨unter Quast [email protected] Karlsruhe Institute of Technology, Institute of ExperimentalParticle Physics, Karlsruhe, Germany CERN, Geneva, Switzerland

In applications in high-energy particle physics, whichare supposed to distinguish a signal from one or morebackgrounds, the training sample is obtained either fromsimulation or from an independent dataset without over-lap with the sample of interest, to which the NN is ap-plied. Usually the NN output itself is then subject toa detailed likelihood based hypothesis test, to infer thepresence and yield of the signal [1,2,3,4,5]. The likeli-hood may include information on the shape of a vari-able that is supposed to discriminate signal from back-ground. This shape could (while it does not have to) bee.g. the output of an NN. Apart from one or more pa-rameters of interest the hypothesis test may compriseseveral hundreds of nuisance parameters, steering theresponse of the test statistic on a corresponding set ofuncertainties. The nuisance parameters can be corre-lated or uncorrelated with the shape of the discrimi-nating variable and (directly or indirectly) depend onthe response of the NN output on its input variables.These kinds of analyses connect the observation ofa measurement to a hypothesized truth. For NN ap-plications they pose the intrinsic problem that, beyondstatistical ﬂuctuations, congruency between the train-ing sample and the sample of interest may not be given.Deviations need to be identiﬁed and quantiﬁed withinthe uncertainty model of the hypothesis test. They mayoccur not only in the description of single input vari-ables to the NN, but also in correlations across inputvariables, even if the marginal distributions of the in-dividual input variables are reproduced. An NN can besensitive to correlations across input variables; in factthis sensitivity is the main reason for potential perfor-mance gains, with respect to other approaches, like e.g.proﬁle likelihoods. To make sure that this performancegain is not feigned, in addition to the marginal distri-butions, all correlations across input variables need tobe carefully checked, and their inﬂuence on the test a r X i v : . [ phy s i c s . d a t a - a n ] O c t Stefan Wunsch et al. statistic identiﬁed and eventually mapped into the un-certainty model of the hypothesis test. The complexityof this methodology motivates the interest, not only inkeeping the number of inputs to the NN at a manage-able level, but above all in identifying those character-istics of the input space to the NN with the largestinﬂuence on the NN output. The deﬁnition of the un-certainty model of the hypothesis test can then be con-centrated on these most inﬂuential characteristics.This approach sets the scope of this study to notmore than a few tenth, up to a few hundred, partiallyhighly correlated input variables in the context of par-ticle physics experiments, or comparable applications.It diﬀers from the approaches of weak supervision [6,7,8,9] and pivoting with adversaries [10] that have beendiscussed in the literature. Weak supervision tries tocircumvent the problem that we are describing by re-placing an originally ground-truth labeled training bya training based on unlabeled training data. The corre-sponding samples can be obtained from the data them-selves. They do not depend on a simulation and maybe chosen to be unbiased. This approach is well justi-ﬁed in classiﬁcation tasks, that are based just on thecharacteristics of the predeﬁned training data. In theanalyses that we are discussing the classiﬁcation is tiedto the hypothesized truth. Replacing the ground-truthlabeled training by unlabeled input data does not solvethe problem that we are discussing. Our discussion isalso beyond the scope of pivoting with adversaries, forwhich the mismodellings to address have to be knownbeforehand. Our discussion sets in at an earlier stage,which is the most complete identiﬁcation of all uncer-tainties that can be of relevance for the physics anal-ysis. After the most inﬂuential features of the inputspace have been identiﬁed the method of pivoting withadversaries could be applied to mitigate potential mis-modellings. A related approach to extract informationabout the characteristics of the input space is to ﬂattenthe distributions of sub-spaces so that possible discrim-inating features vanish [11,12]. From the performancedegradation after retraining the NN on the modiﬁedinputs, information about the discriminating power ofthe respective sub-space can be obtained. However, thisapproach does not allow to evaluate the dependencies ofthe response of an unique NN function on the character-istics of the input space, since each retrained functionmay have learned diﬀerent features.So far, the questions we are raising have been ad-dressed by methods that have been proposed to relatethe output of NNs with certain regions of input pixels inthe context of image classiﬁcation [13,14]. These meth-ods only use ﬁrst-order derivatives to the NN functionto back propagate the output layer by layer. What we propose is a Taylor expansion of the full NN function upto an arbitrary order, which allows to connect the inputspace directly to the NN output. While with this studywe will demonstrate the application of the Taylor ex-pansion only up to second order, we explicitly proposea generalization towards higher-order derivatives in theTaylor expansion to capture relations across variables,which usually play a more important role in data anal-yses in high-energy particle physics experiments.Due to the high-performance computation of deriva-tives in modern software frameworks used for the im-plementation of NNs [15,16,17], this expansion can beobtained at each point of the input space, even if thisspace is of high dimension. In this way, the sensitivityof the NN response to the input space can be analyzedby the gradient of the NN function. For practical rea-sons we stop the expansion at second order. To facilitatethe following interpretation, we deﬁne a feature to be acharacteristic of a single element or a pair-wise relationbetween two elements of the input space. The ﬁrst classof features relates to the coeﬃcients of the expansion toﬁrst order (ﬁrst-order feature); the second class to thecoeﬃcients of the second order expansion (second-orderfeature). First-order features capture the inﬂuence ofsingle input elements on the NN output throughout theinput space; second-order features the inﬂuence of pair-wise or auto-correlations among the input elements. Itis obvious that depending on the given task a certainfeature can have large inﬂuence on the output of the NNin a certain region of the input space, while it is less im-portant in others. We propose the arithmetic mean ofthe absolute value of the corresponding Taylor coeﬃ-cient, computed from the input space deﬁned by thetask to be solved, (cid:104) t i (cid:105) ≡ N N (cid:88) k =1 (cid:12)(cid:12) t i ( { x j }| k ) (cid:12)(cid:12) i ∈ P ( { x j } ) (1)as a metric for the inﬂuence of a given feature of theinput space on the output, where the sum runs overthe whole testing sample of size N , t i corresponds tothe coeﬃcients of the Taylor expansion, { x j }| k to theset of variables spanning the input space, evaluated forelement k of the testing sample, and i is an elementof the powerset of { x j } . It should be noted that the (cid:104) t i (cid:105) characterize the input space (as covered by the testdata) and the sensitivity of the NN to it, after training,as a whole.In section 2 we illustrate this choice with the helpof four simple tasks emphasizing certain single featuresof the input space or their combination. In section 3 wepoint out that, when evaluated at each step of the min-imization during the training process, the (cid:104) t i (cid:105) can beutilized to illustrate and monitor the training process dentifying the relevant dependencies of the neural network response on characteristics of the input space 3 and learning strategies adopted by the NN. In section 4we show the application of the (cid:104) t i (cid:105) to a more realistictask common to data analyses in high-energy particlephysics experiments. Such tasks usually have the follow-ing attributes, which are of relevance for the followingdiscussion: – they consist of not more than several tens of impor-tant input parameters, which leads to a moderatedimensionality of the posed problem; – they may rely on relations between elements morethan they rely on single elements of the input space; – they usually pose problems, where a signal and back-ground class cannot be separated based on single orfew input variables, but only from the combinationof several input variables; – they require a good understanding of the NN per-formance to turn the output into a reliable measure-ment. In the following we illustrate the relation of the (cid:104) t i (cid:105) tocertain features of the input space.The applied NN corresponds to a fully connectedfeed-forward model with a single hidden layer consist-ing of 100 nodes. As activation functions a hyperbolictangent is chosen for the hidden layer and a sigmoid forthe output layer. A preprocessing of the inputs is per-formed following the ( x − µ ) /σ rule with the mean µ and the standard deviation σ derived independently foreach input variable. The free parameters of the NN areﬁtted to the training data using the cross-entropy lossand the Adam optimizer algorithm [18]. The full train-ing dataset with 10 elements is split into two equalhalves. One half is used for the calculation of the gra-dients used by the optimizer. The other half is used asindependent validation dataset. The training is stoppedif the loss did not improve on the validation dataset forthree times in a row (early stopping). The independenttest dataset used to calculate the (cid:104) t i (cid:105) consists of 10 elements. We use the software packages Keras [19] andTensorFlow [15] for the implementation of the NN andthe calculation of the derivatives.For simplicity we choose binary classiﬁcation taskswith two inputs, x and x . For the signal and back-ground classes we sample Gaussian distributions withparameters, as summarized in Table 1. From the Taylorseries we obtain two metrics (cid:104) t x (cid:105) and (cid:104) t x (cid:105) indicatingthe inﬂuence of the marginal distributions of x and x , and three metrics (cid:104) t x ,x (cid:105) , (cid:104) t x ,x (cid:105) , and (cid:104) t x ,x (cid:105) in-dicating the inﬂuence of the relation between x and x , and the two auto-correlations. In the upper row ofFig. 1 the distribution of the (red) signal and (blue)background classes in the input space are shown, wheredarker colors indicate a higher sample density. In thelower row of Fig. 1 the values obtained for the (cid:104) t i (cid:105) afterthe training are shown for each corresponding task.For the task shown in Fig. 1a the signal and back-ground classes are shifted against each other. In bothclasses x and x are uncorrelated and of equal spread.The classiﬁcation task becomes most diﬃcult along theoﬀ-diagonal axis between the two classes through theorigin and simpler if both, x and x , take large orsmall values at the same time. Correspondingly, (cid:104) t x (cid:105) and (cid:104) t x (cid:105) obtain large values indicating the separationpower that is already caused by the marginal distribu-tions of x and x . The orientation of the two classeswith respect to each other also results in a non-negligiblecontribution of (cid:104) t x ,x (cid:105) to the NN response.For the task shown in Fig. 1b the signal and back-ground classes are both centered at the origin of theinput space, with equal spread in x and x , but withdiﬀerent correlation coeﬃcients in the covariance ma-trix. The classiﬁcation task is most diﬃcult in the originof the input space and becomes simpler if x and x takelarge absolute values. Correspondingly, the relation be-tween x and x is identiﬁed as the most inﬂuentialfeature by the value of (cid:104) t x ,x (cid:105) . The fact that large ab-solute values of x and x support the separability ofthe two classes is expressed by the relatively large val-ues for (cid:104) t x (cid:105) and (cid:104) t x (cid:105) . A combination of the examplesof Fig. 1a and 1b is shown in Fig. 1c. For the task shownin Fig. 1d the signal and background classes are bothcentered in the origin of the input space with diﬀer-ent spread. In both classes x and x are uncorrelated.According to the symmetry of the posed problem therelation between x and x is expected to not stronglycontribute to the separability of the signal and back-ground classes. This is conﬁrmed by the lower value of (cid:104) t x ,x (cid:105) . Instead (cid:104) t x (cid:105) , (cid:104) t x (cid:105) , (cid:104) t x ,x (cid:105) , and (cid:104) t x ,x (cid:105) takelarger values as expected from the previous discussion. When evaluated at each minimization step during thetraining, the metrics (cid:104) t i (cid:105) may serve as a tool to an-alyze the learning progress of the NN. We illustratethis for the task shown in Fig. 1c. In Fig. 2 the evolv-ing values of each (cid:104) t i (cid:105) are shown, as continuous linesof diﬀerent color, for the ﬁrst 700 gradient steps. Thestopping criterion of the training is reached after 339gradient steps (indicated by the red vertical line in theﬁgure). We measure the performance of the NN in sep-arating the signal from the background class by the Stefan Wunsch et al.

Table 1: Parameters deﬁning the signal and background classes used for the tasks discussed in section 2. Theparameters correspond to two-dimensional Gaußian distributions.

Task Mean value Covariance matrixSignal ( x , x ) Background ( x , x ) Signal BackgroundFig. 1a 0.5 0.5 − . − . (cid:18) (cid:19) (cid:18) (cid:19) Fig. 1b 0 0 0 0 (cid:18) . . (cid:19) (cid:18) − . − . (cid:19) Fig. 1c 0.5 0.5 − . − . (cid:18) . . (cid:19) (cid:18) − . − . (cid:19) Fig. 1d 0 0 0 0 (cid:18) . . (cid:19) (cid:18) (cid:19) x x t x t x t x , x t x , x t x , x t i (a) x x t x t x t x , x t x , x t x , x t i (b) x x t x t x t x , x t x , x t x , x t i (c) x x t x t x t x , x t x , x t x , x t i (d) Fig. 1: (Upper row) Contours of the distributions used in the examples for the signal (red) and background (blue)classes discussed in section 2, and the (lower row) corresponding metrics (cid:104) t i (cid:105) .area under the curve (AUC) of the receiver operatingcharacteristic (ROC). We have added the AUC at eachtraining step to the ﬁgure with a separate axis on theright. A rough distinction of two phases can be stated.Approximately up to minimization step 30 the perfor-mance of the NN shows a steep rise up to a plateauvalue of 0 .

84 for the AUC. This rise coincides with in-creasing values of (cid:104) t x (cid:105) and (cid:104) t x (cid:105) . Both metrics have thesame progression, which can be explained by the sym-metry of the task. Also the values for (cid:104) t x ,x (cid:105) , (cid:104) t x ,x (cid:105) and (cid:104) t x ,x (cid:105) show an increase, though much less pro-nounced. Roughly 100 minimization steps later, a sec-ond, more shallow, rise of the AUC sets in, coincidingwith increasing values for (cid:104) t x ,x (cid:105) . We interpret this inthe following way. During the ﬁrst phase the NN adaptsto the ﬁrst-order features related to (cid:104) t x (cid:105) and (cid:104) t x (cid:105) , which is the most obvious choice to separate the sig-nal from the background class. During this phase thelearning progress of the NN is concentrated in the ar-eas of the input space with medium to large values of x and x . In the second phase the relation between x and x , as a second-order feature, gains inﬂuence. Thisis when the NN learning progress concentrates on theregion of the input space where the signal and back-ground classes overlap. It can be seen that the inﬂu-ence of the features related to (cid:104) t x (cid:105) and (cid:104) t x (cid:105) decreasesfrom minimization step 50 on. Apparently this inﬂu-ence has been overestimated at ﬁrst and is successivelyreplaced giving more importance to the more diﬃcultto identify second-order features. From our knowledgeof the truth, this is indeed the ”more correct” assess-ment, which from minimization step 250 on, also leads dentifying the relevant dependencies of the neural network response on characteristics of the input space 5

50 150 250 350 450 550 650Gradient step0.000.070.140.21 t i t x t x t x , x t x , x t x , x AUC 0.770.800.830.86 A U C Fig. 2: Values of the metrics (cid:104) t i (cid:105) , as deﬁned in Eq. 1,evaluated at each gradient step of the NN training, forthe task discussed in section 2 and shown in Fig. 1c.On the axis to the right the AUC of the ROC curve, asa measure of the NN performance in solving the taskat each training step, is shown. The red vertical lineindicates after how many gradient steps the predeﬁnedstopping criterion, given in section 2, has been met.to another gain in performance. Note that by the endof the training the progression of (cid:104) t x ,x (cid:105) has not con-verged, yet. The stopping criterion represents a mea-sure of success and not a measure of truth. It mightwell have happened that the stopping criterion mighthave been met already between gradient step 50 and100. In this case the NN output would have been basedon the assessment that (cid:104) t x ,x (cid:105) plays a less importantrole. In this case success rules over truth. In our exam-ple the a priori known, more correct assessment leadsto another performance gain after a few more gradientsteps. Stopping the training before gradient step 100would have missed this performance gain. We wouldlike to emphasize that Fig. 2 is not more but a monitorto visualize what steps have led to the training resultof the NN. This information can help to interpret boththe features of the input space and the NN sensitivityto it. A diﬀerent NN conﬁguration might reveal a diﬀer-ent sensitivity to any of the (cid:104) t i (cid:105) . Also there is no claimof proof that the increase in (cid:104) t x ,x (cid:105) causes the increasein the AUC. In the following we are investigating the behavior of the (cid:104) t i (cid:105) when applied to a more complex task, typical fordata analyses in high-energy particle physics. For thispurpose we are exploiting a dataset that was released inthe context of the Higgs boson machine learning chal- lenge [20], in 2014. This challenge was inspired by thediscovery of a Higgs particle in collisions of high-energyproton beams at the CERN LHC, in 2012 [21,22]. Thesearch for Higgs bosons in the ﬁnal state with two τ lep-tons [23,24,25] at the LHC has two main characteristicsof relevance for this challenge: – a Higgs boson will be produced in only a tiny frac-tion of the recorded collisions. – there is no unambiguous physical signature to dis-tinguish collisions containing Higgs bosons (deﬁningthe signal class) from other collisions (deﬁning thebackground class).Consequently, for such a search the signal needs tobe inferred from a larger number of (potentially re-lated) physical quantities of the recorded collisions, us-ing statistical methods, which makes the task suitedalso for NN applications. For the challenge a typicalset of proton-proton collisions was simulated, of whichonly a small subset contained Higgs bosons in the ﬁnalstate with two τ leptons. Important physical quanti-ties to distinguish the signal and background classes arethe momenta of certain collision products in the plane,transverse to the incoming proton beams; the invariantmass of pairs of certain collision products; and their an-gular position relative to each other and to the beamaxis. In the context of the challenge the values of 30such quantities were released, whose names and exactphysical meaning are given in [20]. Seventeen of thesevariables are basic quantities, characterizing a collisionfrom direct measurements; the rest, like all invariantmass quantities, are called derived variables and com-puted from the basic quantities. These derived variableshave a high power to distinguish the signal and back-ground classes. Other variables like the azimuthal angle φ of single collision products in the plane transverse tothe incoming proton beams have no separating powerbetween the signal and background classes, due to thesymmetry of the posed problem. The task is solved bythe same NN model and training approach as describedin section 2. Applied to all 30 input quantities this re-sults in an AUC of 0 .

92 and an approximate mediansigniﬁcance, as deﬁned in [20], of 2 .

61. In total, the 30input quantities result in 495 ﬁrst- and second-orderfeatures. For further discussion we rank these featuresaccording to their extracted inﬂuence on the NN out-put, based on the values of the corresponding (cid:104) t i (cid:105) , indecreasing order. In Fig. 3 the (cid:104) t i (cid:105) for all features areshown, split into (orange) ﬁrst- and (blue) second-orderfeatures. The distribution shows a rapidly falling trend,suggesting that only a small number of the investigatedfeatures signiﬁcantly contributes to the solution of thetask. The most important input variable is identiﬁedas the invariant mass calculated from the kinematics of Stefan Wunsch et al. two distinguished particles in the collision, the identi-ﬁed hadronic τ lepton decay and the additional lightﬂavor lepton, associated with a leptonic decay of the τ lepton, DER mass vis , as deﬁned in [20]. This vari-able also belongs to the most important quantities toidentify Higgs particles in the published analyses [23,24,25], with a strong relation to the invariant mass ofthe new particle. It is a peaking unimodal distributionin the signal class, with a broader distribution, peakingin a diﬀerent position, in the background class. Amongthe 10 most inﬂuential features, it appears as the mostinﬂuential ﬁrst-order feature (in position 10), reﬂectingthe diﬀerence in the position of the peak in the sig-nal and background classes, and as part of six furthersecond-order features, including the auto-correlation (inposition 6), characterizing the diﬀerence in the width ofthe peak in the signal and background classes. The NNis thus able to identify the most important features of

DER mass vis : its peak position and width. The usageof this variable in a NN analysis requires a good under-standing not only of the marginal distribution but alsoof all relevant relations to other variables, which shouldbe reﬂected in the uncertainty model. The most inﬂuen-tial feature is found to be the relation of

DER mass vis with the ratio of the transverse momenta of the two par-ticles that enter the calculation of this variable, named

DER pt ratio lep tau . This feature is shown in Fig. 4,visualizing the gain of the relation over a pure marginaldistribution on each individual axis. Features related to φ on the other hand are consequently ranked to the endof the list, as can be seen from Fig. 5, with the ﬁrst oc-currence in position 82. Apart from DER mass vis onlyeight more inputs, which are all well motivated fromthe physics expectation, contribute to the upper 5 %of the most inﬂuential features. When exposed to onlythese nine input quantities the NN solves the task withan AUC and ROC curve identical to the one that weobserve, when using all 30 input quantities, within thenumerical precision, indicating the potential to reducethe input space from 30 to 9 dimensions without sig-niﬁcant loss of information. We refrain from a moredetailed analysis of the complete list of features, whichquickly turns very abstract and cannot be fully appre-ciated without deeper knowledge of the exact physicalmeaning of the input quantities. We conclude that themetric of Eq. 1 allows for a detailed understanding ofthe role of each input quantity - even without know-ing their exact meaning - and quantitatively conﬁrmsthe intuition of the high-energy particle physics analy-ses that have been performed during the search for theHiggs boson in 2012 and afterwards. We would like toemphasize that the reduction of the dimension of theinput space (in the demonstrated case from 30 to 9),

100 200 300 400Rank0.0000.0010.0020.003 t i Second-order featuresFirst-order features

Fig. 3: Metrics (cid:104) t i (cid:105) , as deﬁned in Eq. 1, obtained fromthe 30 inputs of the task discussed in section 4. The (cid:104) t i (cid:105) , have been ranked by value, in descending order.A color coding identiﬁes (orange) ﬁrst-order and (blue)second-order features. D E R _ m a ss _ v i s Fig. 4: Relation between the variables

DER mass vis and

DER pt ratio lep tau , as deﬁned in [20] and dis-cussed in section 4, shown in a subset of the input space.The red (blue) contours correspond to the signal (back-ground) class. Darker colors indicate a higher sampledensity. This relation is identiﬁed as the most inﬂuen-tial feature after the NN training.which can be achieved also by other methods, like theprincipal component analysis [26], is not the main goalof our investigation. The main goal is an improved andmore intuitive understanding of the features of the in-put space and the sensitivity of the NN output on it.

We have discussed the usage of the coeﬃcients t i froma Taylor expansion in each element of the input space { x j } to identify the characteristics of the input spacewith the largest inﬂuence on the NN output. For practi-cal reasons we have restricted the discussion to the ex-pansion up to second order, concentrating on the char- dentifying the relevant dependencies of the neural network response on characteristics of the input space 7

100 200 300 400Rank010203040 C o un t Assoc. with prim. variablesAssoc. with mass variable

Fig. 5: Occurrence of features containing primitive φ variables and occurrence of DER mass vis , as discussedin section 4, in the ranked list of features.acteristics of marginal distributions of input elements, x j , or relations between them, referred to as ﬁrst- andsecond-order features. We propose the arithmetic meanof the absolute value of a corresponding Taylor coeﬃ-cient (cid:104) t i (cid:105) , built from the whole input space, as a metricto quantify the inﬂuence of the corresponding featureon the NN output. We have illustrated the relation be-tween features and corresponding (cid:104) t i (cid:105) with the help ofsimple tasks emphasizing single features or relations be-tween them. Evaluating the (cid:104) t i (cid:105) at each step of the NNtraining allows for the analysis and monitoring of thelearning process of the NN. Finally we have applied theproposed metrics to a more complex task common tohigh-energy particle physics and found that the mostimportant features, known from physics analyses arereliably identiﬁed, while features known to be irrele-vant are also identiﬁed as such. We consider this as theﬁrst step to identify those characteristics of the NN in-put space that have the largest inﬂuence on the NNoutput, in the context of tasks, typical for high-energyparticle physics experiments. As shown for the examplein section 4 these most inﬂuential characteristics maywell correspond to relations between diﬀerent inputs orauto-correlations, and not just to the marginal distri-bution of single inputs. In subsequent steps the quan-tiﬁcation of systematic uncertainties in the NN inputscan be concentrated on those most relevant inputs. References

1. Junk, T.: Conﬁdence level computation for combiningsearches with small statistics. Nuclear Instruments andMethods in Physics Research (2) (1999) 4352. Read, A.L.: Presentation of search results: the CLs tech-nique. Journal of Physics G: Nuclear and Particle Physics (10) (2002) 26933. The ATLAS and CMS collaborations: Procedure forthe LHC Higgs boson search combination in summer2011. Technical report, ATL-PHYS-PUB-2011-011, CMSNOTE 2011/005 (2011)4. The CMS collaboration: Combined results of searches forthe Standard Model Higgs boson in pp collisions at √ s =7 TeV. Phys. Lett. B (2012) 265. Cowan, G., Cranmer, K., Gross, E., Vitells, O.: Asymp-totic formulae for likelihood-based tests of new physics.Eur. Phys. J. C (2) (2011) 15546. Metodiev, E., Nachman, B., Thaler, J.: Classiﬁcationwithout labels: Learning from mixed samples in high en-ergy physics. arXiv:1708.02949 (2017)7. Dery, L.M., Nachman, B., Rubbo, F., Schwartzman, A.:Weakly supervised classiﬁcation in high energy physics.Journal of High Energy Physics (5) (2017) 1458. Komiske, P., Metodiev, E., Nachman, B., Schwartz, M.:Learning to classify from impure samples with high-dimensional data. arXiv:1801.10158 (2018)9. Cohen, T., Freytsis, M., Ostdiek, B.: (Machine) Learningto Do More with Less. arXiv:1706.09451 (2018)10. Louppe, G., Kagan, M., Cranmer, K.: Learning to pivotwith adversarial networks. In: Advances in Neural Infor-mation Processing Systems. (2017) 98211. de Oliveira, L., Kagan, M., Mackey, L., Nachman, B.,Schwartzman, A.: Jet-Images – Deep Learning Edition.arXiv:1511.05190 (2017)12. Chang, S., Cohen, T., Ostdiek, B.: What is the machinelearning? arXiv:1709.10106 (2018)13. Bach, S., Binder, A., Montavon, G., Klauschen, F.,M¨uller, K.R., Samek, W.: On pixel-wise explanationsfor non-linear classiﬁer decisions by layer-wise relevancepropagation. PloS one (7) (2015)14. Montavon, G., Lapuschkin, S., Binder, A., Samek, W.,M¨uller, K.R.: Explaining nonlinear classiﬁcation deci-sions with deep taylor decomposition. Pattern Recogni-tion (2017) 21115. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen,Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin,M., et al.: Tensorﬂow: Large-scale machine learningon heterogeneous distributed systems. arXiv preprintarXiv:1603.04467 (2016)16. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.:Automatic diﬀerentiation in PyTorch. (2017)17. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pas-canu, R., Desjardins, G., Turian, J., Warde-Farley, D.,Bengio, Y.: Theano: A CPU and GPU math compiler inPython. In: Proc. 9th Python in Science Conf. (2010) 118. Kingma, D., Ba, J.: Adam: A method for stochastic op-timization. arXiv preprint arXiv:1412.6980 (2014)19. Chollet, F., et al.: Keras. https://keras.io (2015)20. Adam-Bourdarios, C., Cowan, G., Germain, C., Guyon,I., Kegl, B., Rousseau, D.: Learning to discover: the Higgsboson machine learning challenge https://higgsml.lal.in2p3.fr/documentation/ Visited on January 3, 2018.21. The CMS collaboration: Observation of a new boson at amass of 125 GeV with the CMS experiment at the LHC.Phys. Lett. B (1) (2012) 30 Stefan Wunsch et al.22. The ATLAS collaboration: Observation of a new particlein the search for the Standard Model Higgs boson withthe ATLAS detector at the LHC. Phys. Lett. B (1)(2012) 123. The CMS collaboration: Evidence for the 125 GeV Higgsboson decaying to a pair of τ leptons. JHEP (2014)10424. The ATLAS collaboration: Evidence for the Higgs-bosonYukawa coupling to τ leptons with the ATLAS detector.JHEP (2015) 11725. The CMS collaboration: Observation of the Higgs bosondecay to a pair of τ leptons with the CMS detector. Phys.Lett. B (2018) 28326. Abdi, H., Williams, L.J.: Principal component analysis.Wiley interdisciplinary reviews: computational statistics2