Model-Independent Detection of New Physics Signals Using Interpretable Semi-Supervised Classifier Tests
Purvasha Chakravarti, Mikael Kuusela, Jing Lei, Larry Wasserman
MMODEL-INDEPENDENT DETECTION OF NEW PHYSICS SIGNALS USINGINTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS B Y P URVASHA C HAKRAVARTI , M IKAEL K UUSELA J ING L EI AND L ARRY W ASSERMAN Department of Mathematics, Imperial College London, [email protected] Department of Statistics & Data Science and NSF AI Planning Institute for Physics of the Future, Carnegie MellonUniversity, * [email protected]; † [email protected] Department of Statistics & Data Science, Carnegie Mellon University, ‡ [email protected] A central goal in experimental high energy physics is to detect newphysics signals that are not explained by known physics. In this paper, weaim to search for new signals that appear as deviations from known StandardModel physics in high-dimensional particle physics data. To do this, we de-termine whether there is any statistically significant difference between thedistribution of Standard Model background samples and the distribution ofthe experimental observations, which are a mixture of the background anda potential new signal. Traditionally, one also assumes access to a samplefrom a model for the hypothesized signal distribution. Here we instead in-vestigate a model-independent method that does not make any assumptionsabout the signal and uses a semi-supervised classifier to detect the presenceof the signal in the experimental data. We construct three test statistics usingthe classifier: an estimated likelihood ratio test (LRT) statistic, a test basedon the area under the ROC curve (AUC), and a test based on the misclassi-fication error (MCE). Additionally, we propose a method for estimating thesignal strength parameter and explore active subspace methods to interpretthe proposed semi-supervised classifier in order to understand the propertiesof the detected signal. We investigate the performance of the methods on adata set related to the search for the Higgs boson at the Large Hadron Colliderat CERN. We demonstrate that the semi-supervised tests have power compet-itive with the classical supervised methods for a well-specified signal, butmuch higher power for an unexpected signal which might be entirely missedby the supervised tests.
1. Introduction.
Statistical and machine learning tools have been extensively used overthe past few decades to answer fundamental questions in particle physics (Bhat, 2011; Behnkeet al., 2013). To answer these questions, one needs to experimentally test the predictions ofthe Standard Model, which describes our current understanding of elementary particles andtheir interactions. For example, the empirical confirmation of the Higgs boson at CERN in2012 was an essential step towards its inclusion in the Standard Model (Aad et al., 2012;Chatrchyan et al., 2012).In experiments conducted with large particle accelerators such as the Large Hadron Col-lider (LHC), the searches for new physics signals in the high-dimensional experimental datahave been primarily conducted using model-dependent data analysis methods. These searchesare generally structured as likelihood ratio tests based on a model assumption for the specificnew signal (Cowan et al., 2011; ATLAS Collaboration and CMS Collaboration, 2011). Dueto the high-dimensionality of the data, tests based on classifiers that are optimized to detect
Keywords and phrases: collective anomaly detection, active subspace, mixture proportion estimation, signalstrength estimation, likelihood ratio test, high-dimensional two-sample testing, Large Hadron Collider. a r X i v : . [ s t a t . A P ] M a r (a) Training Samples (b) Experimental Samples Fig 1: Decision boundary using a supervised classifier to separate the signal (red) from thebackground (grey). (a) The decision boundary of the classifier when trained on signal gen-erated from the assumed signal model. (b) When there is a systematic error in the trainingsignal data, the classifier completely misses the actual signal data when used on the experi-mental samples. We consider a two-dimensional example here for illustration purposes only.The real data can be of much higher dimensionality. In the experiments considered in thispaper, we have a 16 dimensional sample.a particular hypothesized signal are preferred over density estimation or mixture model ap-proaches. Specifically, tests based on supervised, multivariate classification algorithms suchas neural networks and boosted decision trees have proved beneficial. The classifier output isthen used to extract a signal enriched sample which is used to perform likelihood ratio testsfor the detection of the signal (Aad et al., 2012; Chatrchyan et al., 2012).The training signal samples for the classifier are generated using Monte Carlo (MC) eventgenerators, which enable sampling collision events from specific physics models. But thereare disadvantages to this approach. First, this approach may have trouble finding novel, unex-pected deviations from the background model. Second, systematic errors in the signal modelcan be influential on supervised classifiers and so any error or imprecision in the signal modelmight adversely affect the method. Figure 1 illustrates the problem. If a classifier is trainedon training signal data as shown in Figure 1(a), it gives the classification boundary as shown.But what if the signal data actually looks like Figure 1(b)? Then the classifier ends up mis-classifying the signal as background. So an algorithm trained on a wrong signal model mightcompletely miss the actual signal. The two dimensional example considered here is for il-lustration purposes only. In reality, the data lie in a high-dimensional space which furtheraggravates the problem.In this paper, we study model-independent tests for new physics signals. We use data froman event generator for background events together with experimental data, which are a mix-ture of background and a potential signal. But we do not use data from signal simulations. Inother words, we assume access to labelled background data and unlabelled experimental data,where events may either have background or signal labels. We then use a semi-supervised ap-proach that trains a classifier to differentiate the background data from the experimental data.Crucially, we do not assume availability of labelled signal data, which differentiates this ap-proach from model-dependent or supervised methods.In particular, we make three contributions:1. We investigate several variants of classifier-based model-independent tests to detect thesignal. We use the trained classifier mentioned above, to construct three different teststatistics to detect the presence of a signal in the experimental data and compare them. We
IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS also compare bootstrap, permutation or asymptotic methods to obtain the null distributionof the tests.2. We propose a method to estimate the signal strength in the experimental data.3. We use active subspace methods to interpret the detected signal, which is a challengingproblem in model-independent searches since we do not have information about the signalmodel to characterize the signal.Related to our first contribution, one of the model-independent tests that we use for detect-ing the signal is similar to the test proposed by D’Agnolo and Wulzer (2019) and D’Agnoloet al. (2021). It is based on estimating the likelihood ratio test (LRT) statistic using the clas-sifier output. The other two tests are based on the performance of the classifier measuredusing the area under the ROC curve (AUC) and the misclassification error (MCE). To thebest of our knowledge, these two are novel in the present application. All the tests are basedon the fact that in the absence of a signal in the experimental data, a classifier should notbe able to differentiate the experimental data from the background data. This is analogous toa high-dimensional two-sample testing problem where we compare the distributions of thebackground and the experimental data (Kim et al., 2016, 2019). Throughout the paper, weleverage the fact that classifiers have been found to give accurate estimates of density ratios(Cranmer, Pavez and Louppe, 2015).Though the test that uses the LRT is similar to the test proposed by D’Agnolo and Wulzer(2019) and D’Agnolo et al. (2021), unlike them, we use the nonparametric bootstrap as wellas permutation methods to estimate the null distribution, comparing the performances of thedifferent re-sampling techniques. We also apply the tests in a higher-dimensional setting thanthem. Additionally, since our test has a simple null versus a composite alternative, the LRTis not guaranteed to yield an optimal test in this setting. Hence it is important to compare theperformances of the different test statistics. We compare the three different test statistics asmentioned above.Towards our second contribution of estimating the signal strength, we show that estimatingthe signal strength is equivalent to estimating a monotone density at a boundary point. Inparticular, we need to estimate the density of a statistic built by considering the quantile ofa Neyman–Pearson style likelihood ratio. Similar to the estimator given by Storey (2002),which is a special case of boundary density estimator using histograms, we use histograms toestimate the density, and then use a Poisson regression (Nelder and Wedderburn, 1972) modelof the histogram bin counts to estimate the density at the boundary point. The additional useof Poisson regression makes the boundary estimate more stable.For our third contribution, to interpret the classifier and characterize the signal, we proposeusing active subspace methods (Constantine, Dow and Wang, 2014; Constantine, 2015) toexplore which part of the covariates is the most informative for the test. Active subspaces havebeen used as a form of sensitivity analysis to quantify uncertainty in an output (Constantineet al., 2015) and they have also been used to analyze the internal structure and vulnerabilityof deep neural networks (Cui et al., 2020). Similar to these works, in this paper, we proposeactive subspace methods to interpret the variability in the classifier output in terms of itsinputs. The methods proposed study the directions that capture the most variability in thegradient of the classifier. This gives us information about the combinations of the covariatesthat most influence the classifier in distinguishing the experimental data from the backgrounddata, hence giving us an idea of the signal. Importantly, this process helps us understand theclassifier that detects the signal, which otherwise would be a black box.Alternative variable importance methods have been used to understand the intrinsic pre-dictiveness potential of the covariates (Van der Laan, 2006; Lei et al., 2018; Williamson et al.,2020) which can be used to understand the individual covariates that play an important rolein the classifier. There exists a particularly rich literature on variable importance for random forests (Breiman, 2001; Ishwaran et al., 2007; Strobl et al., 2008; Grömping, 2009) and neu-ral networks (Bach et al., 2015; Shrikumar, Greenside and Kundaje, 2017; Sundararajan, Talyand Yan, 2017). But contrary to the active subspace methods, these methods concentrate onfinding the variables that are individually the most important and potentially miss detectingcombinations of covariates that might have more predictive power.The advantage of model-independent tests is that they can detect discrepancies between thebackground data and the experimental data irrespective of the distribution of the signal events.In the case that a signal is found in the experimental data, the signal should be investigatedfurther in order to decide if it results from (a) an inaccurate background MC generator, (b) aparticle detector defect or a lack of understanding of the detector, or (c) a previously unknownphysics process. The active subspace methods can provide a preliminary indication in thisprocess.While there are advantages to using model-independent methods, there are disadvantagestoo. One disadvantage is that when a reliable signal model is available, the supervised ap-proach should yield greater power since it uses the information available about the signalmodel. Another disadvantage is that when a reliable background model is unavailable, wecannot compare the experimental data to the background data to find discrepancies. So, themodel independent and model-dependent methods are complementary approaches and notsuperior to each other. Depending on the context, one performs better than the other.1.1. Related Work.
Due to their advantages, model-independent approaches have beenused for new physics searches at the Tevatron (Aaltonen et al., 2009; Bertram et al., 2012),HERA (Aktas et al., 2004), and the LHC (CMS Collaboration, 2017; Aaboud et al., 2019;CMS Collaboration, 2020). These methods typically compare a large set of binned distri-butions to the prediction from the background Monte Carlo simulation, in search for binsin the experimental data that exhibit a deviation larger than some predefined threshold. Forexample, the approach of Aaboud et al. (2019), employed by the ATLAS experiment, usesa (quasi-)model-independent method that considers some generic features of the potentialnew physics signals. These approaches have two limitations: (a) they do not consider themultivariate dependency structures between the variables in the data and (b) they might misscertain signals that do not show a localized excess in one of the studied distributions.More recent approaches like CWoLa (Classification Without Labels) Bump Hunting(Collins, Howe and Nachman, 2018, 2019), ANODE (Nachman and Shih, 2020), SALAD(Andreassen, Nachman and Shih, 2020) and simulation augmented CWoLa (SA-CWoLa)(Kasieczka et al., 2021) are also based on searching for anomalies by assuming that the sig-nal is localized in a single feature. These approaches use this weak assumption about thesignal which causes them to have the second limitation mentioned above, namely they mightmiss signals that do not show a localized excess along one of the features. Kasieczka et al.(2021) describes a variety of current methods for signal detection along with results on theLHC Olympics 2020 datasets.A different approach that uses a semi-supervised nonparametric clustering algorithm ispresented by Casa and Menardi (2018). They assume that in high energy physics, a newparticle manifests itself as a significant peak emerging from the background process. They usenonparametric modal clustering to search for a signal that is expected to emerge as a bump inthe background distribution. BuHuLaSpa and Bump Hunter methods presented in Kasieczkaet al. (2021) also use similar bump hunting ideas to find the signal. UCluster presented inKasieczka et al. (2021) uses clustering to find the localized signal. These methods sufferfrom the second problem mentioned above: they can only find localized signals.Model-independent semi-supervised searches were also proposed by Kuusela et al. (2012)and Vatanen et al. (2012) who use multivariate Gaussian mixture models to estimate the
IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS densities of the background and the experimental data. They test for the significance of theadditional Gaussian components which quantify the anomalous contribution. The drawbackof this method is that Gaussian mixture models are very difficult to fit in a high-dimensionalsetting. Additionally, since the signal strength is typically very small, the quality of the fitinfluences the power of the test in detecting the signal.These drawbacks of the mixture modeling methods along with the fact (as mentionedbefore) that classification algorithms have demonstrated excellent performance in detectingsignals in the model-dependent approaches, especially in the high-dimensional setting, mo-tivates us to use classifiers instead of mixture modeling to find the deviations of the experi-mental data from the background.The problem of comparing the distributions of the background and the experimental datais also analogous to a high-dimensional two-sample testing problem (Kim et al., 2016, 2019),where the signal events appear as a collective anomaly (Becks and Perret-Gallix, 1994; Chan-dola, Banerjee and Kumar, 2009) in a cluster close to each other. Hence we also compare themethods proposed in this paper to nearest-neighbor two-sample tests introduced in Schilling(1986) and Henze (1988) for a sub-sample of the data. In the experiments, we observe thatmethods that use a semi-supervised classifier have much higher power to detect the signalthan the nearest-neighbor two-sample tests.1.2. Outline.
In the following section, Section 2, we first introduce the problem setupmathematically. We then describe comparable supervised methods in Section 2.1. The pro-posed model-independent semi-supervised methods are introduced in Section 2.2. We intro-duce tests based on the LRT, the AUC and the MCE in this section. We then discuss methodsto estimate the signal strength in the experimental data in Section 3. In Section 4 we describeactive subspace methods to understand the subspace affecting the classifier the most, leadingto an understanding of the detected signal. Finally, in Section 5, we demonstrate the perfor-mance of the proposed methods and compare them to the supervised approaches. We includeconcluding remarks and possible future directions of research in Section 6. We also includeexploratory analysis of the Higgs boson data set which is used to perform the experiments inSection 5 and include some other experimental details that are left out in the main paper inthe Supplementary Material (Chakravarti et al., 2021). We defer the proofs of the theoremsand some of the proposed algorithms to the Appendix.
2. Classifier Based Tests for Signal Detection.
In this section, we introduce bothmodel-dependent and model-independent approaches to signal detection in the experimen-tal particle physics data. Both approaches use background data from MC event generatorsfor background events based on the Standard Model, together with experimental data, whichare a mixture of background and a potential signal. First, we discuss the model-dependentapproach that additionally assumes access to signal data that are generated using MC eventgenerators for a hypothesized signal model. Then we describe the model-independent ap-proach that does not assume access to the signal data. We now introduce formal notation todiscuss the two different approaches.The background, signal and experimental data are samples from Poisson point processes(Cranmer, 2015; Reiss, 2012). We condition on the sample sizes of the individual datasets sothat the data in all three of the cases may be treated as independent samples from a density.Specifically, we have three datasets:Background: X = { X , . . . , X m b } , X i ∼ p b Signal: Y = { Y , . . . , Y m s } , Y i ∼ p s (1) Experimental: W = { W , . . . , W n } , W i ∼ q = (1 − λ ) p b + λp s , where p b , p s are the densities of the background and signal data respectively, q is the densityof the experimental data and λ is a scalar parameter representing the signal strength.The likelihood function for λ given the experimental data ( W ) (treating p b and p s as knownfor the moment) is(2) L ( λ ) = (cid:89) i ((1 − λ ) p b ( W i ) + λp s ( W i )) . The null hypothesis ( H ) that there is no signal corresponds to λ = 0 . So the goal is to test H : λ = 0 versus H : λ > . We additionally have the likelihood ratio:(3) L ( λ ) L (0) = (cid:89) i (1 − λ + λψ ( W i )) where ψ ( w ) = p s ( w ) /p b ( w ) . Note that the function ψ can be seen as an infinite dimensionalnuisance parameter.In the idealized case where p b and p s are known, we could use the usual likelihood ratiotest (LRT) statistic(4) T = − L (0) / L ( (cid:98) λ )) = 2 (cid:88) i log(1 − (cid:98) λ + (cid:98) λψ ( W i )) , where (cid:98) λ is the maximum likelihood estimator (MLE) of λ based on the experimental data W . Alternatively, we could use the score test statistic(5) T = 1 n (cid:88) i ( ψ ( W i ) − . In practice, the densities p b and p s are unknown. The two different approaches described inthis section show how a classifier can be used to estimate the desired statistics directly. Weprefer estimating the density ratio required for the desired statistics directly instead of takingthe ratio of the estimated densities. This is due to the high-dimensionality of the data whichmakes estimating the high-dimensional density with limited data very difficult.2.1. The Model-Dependent (Supervised) Case.
In this case, we make use of the signaldata Y , where Y , . . . , Y m s ∼ p s are generated using a MC event generator for a hypoth-ized signal model. Such approach is standard in most new physics searches at the LHC, andserves here as a point of comparison against the model-independent methods. The strategyunderlying our implementation of this approach is to use a classifier to estimate ψ = p s /p b and then use the LRT or the score test with the estimated ψ . Since ψ is estimated, we can-not rely on standard asymptotics to get the null distribution. Instead we use permutation ornonparametric bootstrap methods.Before we train a classifier, we first combine the background and signal data into a singledataset { Z , . . . , Z m b + m s } = X ∪ Y = { X , . . . , X m b , Y , . . . , Y m s } and we define S i = 1 if Z i is from the signal data Y and S i = 0 otherwise. We treat ( Z i , S i ) as a sample from a density p ( z, s ) with π := P ( S = 1) = m s / ( m b + m s ) , the probability ofany sample being from the signal distribution. In most LHC analyses, ψ is currently used to extract a signal-enriched subset of the data which is then usedin a low-dimensional parametric fit to form a test statistic (Radovic et al., 2018). Here we use instead the ψ -based high-dimensional LRT or score test to provide an apples-to-apples comparison with the model-independentmethods.IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS Let h ( z ) := P ( S = 1 | Z = z ) denote the probability of a sample being from the signaldistribution given Z = z . Then using Bayes’s rule, h ( z ) = P ( S = 1 | Z = z ) = π ψ ( z ) π ψ ( z ) + (1 − π ) . Inverting this, ψ ( z ) = (cid:18) − π π (cid:19) (cid:18) h ( z )1 − h ( z ) (cid:19) . This leads to our estimate of ψ ,(6) (cid:98) ψ ( z ) = (cid:18) − π π (cid:19) (cid:32) (cid:98) h ( z )1 − (cid:98) h ( z ) (cid:33) , where (cid:98) h ( z ) is a classifier that separates the signal data Y from the background data X .In this paper, we take (cid:98) h to be a random forest, but in principle, any classifier can be usedinstead.To use (cid:98) ψ ( z ) to calculate the LRT statistic in (4), we estimate λ using its MLE as follows.We can write the likelihood of the experimental data as L ( λ ) = (cid:89) i p b ( W i ) × (cid:89) i (1 − λ + λψ ( W i )) . The first term does not involve λ so we can ignore it. Hence, L ( λ ) ∝ (cid:89) i (1 − λ + λψ ( W i )) . We then define the estimated likelihood (cid:98) L ( λ ) ∝ (cid:89) i (1 − λ + λ (cid:98) ψ ( W i )) and we define (cid:98) λ to be the maximizer of (cid:98) L ( λ ) . We can now use (cid:98) λ and (cid:98) ψ to estimate the LRTand the score test statistics as given in (4) and (5).To see the effect of maximizing the estimated likelihood using the estimated ψ instead ofmaximizing the actual likelihood, let (cid:96) ( λ ) = log L ( λ ) , let (cid:98) (cid:96) ( λ ) = log (cid:98) L ( λ ) , and note that, forsmall λ (cid:96) ( λ ) = λ (cid:88) i ( ψ ( W i ) − − λ (cid:88) i ( ψ ( W i ) − + o P ( λ ) + C, where C = (cid:80) i log( p b ( W i )) is just a constant and can be ignored. A similar relation holds for (cid:98) (cid:96) ( λ ) and (cid:98) ψ . So n (cid:98) (cid:96) ( λ ) − n (cid:96) ( λ ) = λn (cid:88) i ( (cid:98) ψ ( W i ) − ψ ( W i )) + o P ( λ ) which shows how the accuracy of the classifier affects the log-likelihood. The maximizer (cid:101) λ of (cid:96) ( λ ) and the maximizer (cid:98) λ of (cid:98) (cid:96) ( λ ) are(7) (cid:101) λ = (cid:20) (cid:80) i ( ψ ( W i ) − (cid:80) i ( ψ ( W i ) − (cid:21) + + o P ( λ ) , (cid:98) λ = (cid:34) (cid:80) i ( (cid:98) ψ ( W i ) − (cid:80) i ( (cid:98) ψ ( W i ) − (cid:35) + + o P ( λ ) . Hence, (cid:98) λ − (cid:101) λ = O P ( n (cid:80) i ( (cid:98) ψ ( W i ) − ψ ( W i ))) , emphasizing again the importance of an ac-curate classifier. Note that in practice, instead of using the above approximation, we evaluatethe MLE of λ by performing a grid search on [0 , .As n → ∞ , the usual likelihood ratio test statistic (Ghosh and Sen, 1984; Böhning et al.,1994),(8) T = 2 (cid:88) i log (cid:16) (1 − (cid:98) λ ) + (cid:98) λψ ( W i ) (cid:17) d (cid:32) δ + 12 χ , where δ is a degenerate distribution at . But when (cid:98) ψ is substituted for ψ , the asymptoticdistribution is unknown so the null distribution needs to be estimated by simulating from thebackground model. If the available background data is limited, we can use permutation ornonparametric bootstrap methods; see Section 5.2.Similar remarks apply to the score statistic T = n − (cid:80) i ( ψ ( W i ) − and its estimatedversion (cid:98) T = n − (cid:80) i ( (cid:98) ψ ( W i ) − . There are conditions under which (cid:98) T has a tractable distri-bution. Suppose that (cid:98) ψ is estimated on part of the data and (cid:98) T on another. Now √ n (cid:98) T = √ n T + √ n (cid:32) n (cid:88) i (cid:16) (cid:98) ψ ( W i ) − ψ ( W i ) (cid:17)(cid:33) . The first term converges to N (0 , σ ) where σ = E p b [( ψ ( W ) − ] . If ψ is in a Hölder classof smoothness index β and β > d/ , then it can be shown that n (cid:80) i ( (cid:98) ψ ( W i ) − ψ ( W i )) = O P ( n − β/ (2 β + d ) ) . Hence, if β > d/ , the second term is negligible so that √ n (cid:98) T (cid:32) N (0 , σ ) .However, we have found that the Normal approximation is poor in practice and instead weuse permutation and bootstrap methods to approximate the null distribution. Similar to theLRT, the null distribution can also be estimated by simulating from the background model.To use nonparametric bootstrap or permutation methods, we randomly split the availablebackground data X into two sets X and X . We use the first set X along with the signaldata Y to train the classifier (cid:99) h . We use the second set X along with the experimental data W to approximate the null distributions of the LRT and the score statistics. Note that, underthe null H : λ = 0 , the experimental data W and background data X have the same distri-bution p b . In the case of nonparametric bootstrap, we approximate the null distributions byrepeatedly sampling with replacement from X ∪ W , and computing the test statistics. Forthe permutation test, we do the same using sampling without replacement.2.2. Model-Independent (Semi-Supervised) Case.
In this case, we assume that we do nothave access to (or do not completely trust) the signal training sample Y = { Y , . . . , Y m s } .So the data available are X = { X , . . . , X m b } and W = { W , . . . , W n } , where X i ∼ p b and W i ∼ q = (1 − λ ) p b + λp s , with p b , p s and λ unknown. We want to test H : λ = 0 versus H : λ > which is equivalent to testing H : p b = q versus H : p b (cid:54) = q . Hence we are in atwo-sample testing scenario (Kim et al., 2016, 2019).Again, we want to leverage the fact that classifiers have been found to give accurate es-timates of density ratios (Friedman, Hastie and Tibshirani, 2001; Goodfellow et al., 2014;Cranmer, Pavez and Louppe, 2015). One strategy is to use a classifier like before to obtain alikelihood ratio test statistic, but this time we estimate the density ratio ψ † = q/p b . To do this,we train a classifier to differentiate between the experimental ( W ) and background events( X ) instead of the signal ( Y ) and background events ( X ) . As mentioned in the introduction,this strategy is similar to the one taken by D’Agnolo et al. (2021) and D’Agnolo and Wulzer(2019), who use a neural network to estimate the likelihood ratio test statistic.A second strategy is to use the area under the curve statistic (AUC) (Hanley and McNeil,1982) or the misclassification error/misclassification rate (MCE) to evaluate the performance IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS of the classifier. The intuition behind the second strategy is that in the absence of signalunder the null, a classifier should not be able to differentiate between the background and theexperimental data. We discuss both the strategies below.As before, let h denote a classifier in the combined (background and experimental) sample,then ψ † ( z ) = (cid:18) − ππ (cid:19) (cid:18) h ( z )1 − h ( z ) (cid:19) where now π = n/ ( n + m b ) . This leads to our estimate of ψ † (9) (cid:99) ψ † ( z ) = (cid:18) − ππ (cid:19) (cid:32) (cid:98) h ( z )1 − (cid:98) h ( z ) (cid:33) where (cid:98) h ( z ) is the trained classifier.The likelihood ratio as given in (3) can be rewritten as L ( λ ) L (0) = (cid:89) i ψ † ( W i ) . Hence, the LRT statistic is given by(10) T = 2 (cid:88) i log (cid:99) ψ † ( W i ) . Similar to the arguments presented in the model-dependent case, since we are estimating ψ † using (cid:99) ψ † , usual asymptotics do not hold for the LRT statistic. Instead we propose a condi-tional asymptotic test. We split the background and the experimental data into training andtest data and estimate (cid:98) h on the training data. We then use the test experimental data to calcu-late the test statistic T and use the test background data to approximate the null distribution.Unlike the chi-squared null distribution considered by D’Agnolo et al. (2021) and D’Agnoloand Wulzer (2019), we instead use an asymptotic Normal distribution to approximate the nulldistribution. The detailed algorithm can be found below in Method B.1. We additionally pro-pose nonparametric bootstrap and permutation methods to approximate the null distributionas well.For the second strategy that uses the performance of the classifier ( (cid:98) h ) in separating thebackground data from the experimental data, we consider two statistics - AUC (Area Underthe Curve) statistic and MCE (Misclassification Error) statistic.AUC is used to summarize the ROC curve by calculating the area under the ROC curve.The ROC curve (Metz, 1978; Hanley et al., 1989) demonstrates the performance of a clas-sifier by plotting the true positive rate (TPR) versus the false positive rate (FPR) at variousthreshold settings of the classifier output. For example, for our classifier (cid:98) h ( z ) , given a thresh-old parameter t , an instance Z is classified as experimental (“positive") if (cid:98) h ( Z ) > t andbackground (“negative") otherwise. Now Z ∼ q if Z actually belongs to the experimentalclass and Z ∼ p b if Z actually belongs to the background class. Therefore, the TPR is givenby TPR ( t ) = P q (cid:16)(cid:98) h ( Z ) > t (cid:17) and the FPR is given by FPR ( t ) = P p b (cid:16)(cid:98) h ( Z ) > t (cid:17) , where P q is the probability when Z ∼ q and P p b is the probability when Z ∼ p b . Since, the ROC curveplots TPR ( t ) (y-axis) versus FPR ( t ) (x-axis) with varying t , the AUC θ is given by: θ = (cid:90) TPR ( FPR − ( x )) dx = P (cid:16)(cid:98) h ( W ) > (cid:98) h ( X ) (cid:17) , by standard derivation. So, the AUC can also be interpreted as the probability that a classifierwill rank a randomly chosen positive instance higher than a randomly chosen negative one.Hence we can estimate it using(11) (cid:98) θ = 1 m b n (cid:88) i (cid:88) j I (cid:110)(cid:98) h ( W j ) > (cid:98) h ( X i ) (cid:111) . Under the null H : λ = 0 , we have that q = p b , i.e., X and W have the same distribution.So under H , θ = 0 . and an AUC that is significantly greater than . provides evidence of q (cid:54) = p b . In other words, testing H : λ = 0 versus H : λ > is equivalent to testing H : θ =0 . versus H : θ > . .We can additionally use the MCE (misclassification error/classification error rate) to mea-sure the performance of the classifier (Kim et al., 2016), where we define the MCE as: M = 0 . P ( (cid:98) h ( X ) > π ) + 0 . P ( (cid:98) h ( W ) < π ) . Note that this is the average of the false positive rate (first term) and the false negative rate(second term) for threshold π = n/ ( n + m b ) . This can be estimated using(12) (cid:99) M = 12 m b (cid:88) i I (cid:110)(cid:98) h ( X i ) > π (cid:111) + 1 n (cid:88) j I (cid:110)(cid:98) h ( W j ) < π (cid:111) . Under the null, H : λ = 0 and as a result, q = p b , i.e., X i d = W j . Hence, M = 0 . under thenull. The classifier will have a true accuracy significantly above half (and hence M belowhalf), only if q (cid:54) = p . Hence we can use (cid:99) M as a test statistic and test H : M = 0 . versus H : M < . .As with the LRT statistic, the tests using the AUC and the MCE statistics can also be per-formed using asymptotic, bootstrap and permutation methods. The asymptotic AUC method,unlike the asymptotic LRT and MCE methods, does not use a conditional test. For the AUCwe derive the asymptotic distribution of the statistic using results presented in Newcombe(2006). These methods also use data splitting, where we use the training data to fit the clas-sifier ( (cid:98) h ) and use the test data to perform the test. We detail the nonparametric bootstrap andone of the permutation methods in Method 2.1 below. We provide algorithms for the othermethods, including the asymptotic methods using the three statistics LRT, AUC and MCE,and the slower in-sample permutation method in Appendix B.R EMARK . All the model-independent methods presented in this section, except the in-sample permutation method (Method B.4), use data splitting of the background and the exper-imental data into training and test data, to approximate the null distribution and to performthe test. This has the benefit that the classifier can be kept fixed when estimating the nulldistribution. Here splitting a sample means randomly splitting the sample into two disjointsubsamples. The training data is used to train the classifier (cid:98) h and the test data is used to per-form the signal detection test. The in-sample permutation method (Method B.4), on the otherhand, re-trains the classifier on permuted background and experimental data when approx-imating the null. This has the advantage of using all the data to train the classifier, but thecomputing time is much longer due to the classifier re-training required when obtaining thenull distribution. IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS M ETHOD X = { X , . . . , X m b } into X and X of sizes m and m re-spectively.2. Split experimental data W = { W , . . . , W n } into W and W of sizes n and n respectively, with n = m
3. Train the classifier (cid:98) h on X and W .4. Evaluate the LRT statistic T on W using (9) and (10) as(13) (cid:101) T = T n = log (cid:18) − ππ (cid:19) + 1 n (cid:88) W i ∈W log (cid:32) (cid:98) h ( W i )1 − (cid:98) h ( W i ) (cid:33) where π = n / ( m + n ) . Similarly evaluate the AUC statistic (cid:98) θ , as defined in Equa-tion (11), and the MCE statistic (cid:99) M , as defined in Equation (12), on X and W as(14) (cid:98) θ = 1 m n (cid:88) X i ∈X (cid:88) W j ∈W I (cid:110)(cid:98) h ( W j ) > (cid:98) h ( X i ) (cid:111) (15) (cid:99) M = 12 m (cid:88) X i ∈X I (cid:110)(cid:98) h ( X i ) > π (cid:111) + 1 n (cid:88) W j ∈W I (cid:110)(cid:98) h ( W j ) < π (cid:111) respectively.5. Estimate the null distribution of (cid:101) T , (cid:98) θ and (cid:99) M , by repeatedly drawing m + n randomobservations from X ∪ W (with replacement for bootstrap and without replacementfor permutation) and randomly labelling m of them as X ’s and n of them as W ’sbefore computing (cid:101) T , (cid:98) θ and (cid:99) M on them.6. Calculate the p-values based on the estimated null distributions.
3. Estimating λ in the Model Independent Scenario. Here we discuss the problemof estimating the signal strength λ in the semi-supervised setting. As we have seen in Sec-tion 2.2, the ratio of the densities ψ † = q/p b can be written as: ψ † ( z ) = (cid:18) − ππ (cid:19) (cid:18) h ( z )1 − h ( z ) (cid:19) , where π = n/ ( n + m b ) and h is the classifier differentiating the experimental data from thebackground data. Since(16) ψ † ( z ) = q ( z ) p b ( z ) = (1 − λ ) p b ( z ) + λp s ( z ) p b ( z ) = 1 − λ + λψ ( z ) , we see that, for any z such that ψ ( z ) (cid:54) = 1 we have(17) λ = 1 − (cid:0) − ππ (cid:1) (cid:16) h ( z )1 − h ( z ) (cid:17) − ψ ( z ) . Hence, if we can find any z for which p s ( z ) = 0 (no signal) then we can estimate λ by(18) (cid:98) λ = 1 − (cid:18) − ππ (cid:19) (cid:32) (cid:98) h ( z )1 − (cid:98) h ( z ) (cid:33) , where (cid:98) h ( z ) is the trained semi-supervised classifier. The problem is that the search for sucha z may not be obvious.Instead we take a different approach. To ensure identifiability, we assume inf z p s ( z ) /p b ( z ) =0 . We believe this is true in most high-energy physics problems. To simplify the discussion,we additionally assume p b , q > everywhere.Next we show that the problem of estimating λ can be transformed into a problem ofestimating g q (1) , where g q is the density of a univariate random variable supported on [0 , .We define for any t ≥ , C t = { z ∈ R d : p s ( z ) ≥ tp b ( z ) } . Then for any z ∈ R d , we define theNeyman-Pearson Quantile Transform of z as: ρ ( z ) = P X ∼ p b (cid:18) q ( X ) p b ( X ) ≥ q ( z ) p b ( z ) (cid:19) = P X ∼ p b (cid:16) ψ † ( X ) ≥ ψ † ( z ) (cid:17) = P X ∼ p b ( h ( X ) ≥ h ( z )) . Let g , g and g q be the density functions of ρ ( Z ) when Z ∼ p b , Z ∼ p s and Z ∼ q respectively. Then we can show, as stated in Theorem 3.1 below, that g q (1) = 1 − λ andtherefore λ = 1 − g q (1) . So, we can estimate λ using(19) (cid:98) λ = 1 − (cid:98) g q (1) , where (cid:98) g q is a density estimate based on the (cid:98) ρ ( W i ) ’s defined as(20) (cid:98) ρ ( W i ) = 1 m b (cid:88) j I (cid:110)(cid:98) h ( X j ) ≥ (cid:98) h ( W i ) (cid:111) T HEOREM
Assuming ρ ( Z ) has continuous density under either Z ∼ p b or Z ∼ p s ,then the following hold. g ( u ) = 1 for all u ∈ [0 , . That is, ρ ( Z ) ∼ Unif(0 , if Z ∼ p b . g ( u ) = t u where t u satisfies P X ∼ p b ( X ∈ C t u ) = u . In particular, g (1) = 0 . g q (1) = 1 − λ . We detail the proof of the Theorem in the Appendix A.Now the problem of estimating λ reduces to estimating a monotone density at a boundarypoint. We can estimate the density g q (1) based on the (cid:98) ρ ( W i ) ’s using a simple histogram basedestimator:(21) (cid:98) g q (1) = 1 nb (cid:88) i I { (cid:98) ρ ( W i ) ∈ (1 − b, } , where b is the bin-width of the histogram estimator. But, since the density is a monotonicallydecreasing function, the density estimates at points close to 1 could also be indicative of theestimate at 1.We therefore propose using a Poisson regression on bins close to 1, in order to estimate thedensity at 1. We fit a Poisson regression (cid:98) f ( t ) = exp( β + β t ) , with β ≤ to the histogramestimates(22) H t = (cid:88) i I { (cid:98) ρ ( W i ) ∈ ( t − b, t ] } , T ≤ t − b ≤ t ≤ , where b is the bin-width of the histogram estimator and T determines the neighborhood of 1that is used to estimate the density at 1. Then the estimated density at 1 is given by:(23) (cid:98) g q (1) = (cid:98) f (1) , the estimate given by the Poisson regression at 1. IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS R EMARK . 1. Note that the bins ( t − b, t ] for T ≤ t − b ≤ t ≤ form a partition of [ T, and we regress on the bin end points for the Poisson regression model.2. We constrain β ≤ , since we know that g q is a monotonically decreasing function. Inpractice, we implement this condition by setting (cid:98) β = 0 , when the Poisson regressionmodels estimates (cid:98) β > and fit (cid:98) f ( t ) = exp( β ) .3. In practice, we additionally perform data-splitting in order to get out-of-sample estimatesof λ . It’s important to consider out-of-sample estimates since the Poisson regression modelis conditioned on the trained classifier and computing the signal strength estimates on thesame data that is used to train the classifier could give biased estimators.4. Instead of fitting a Poisson regression model on the histogram estimates H t , we addition-ally tried using just the mean of the density estimates given by the histogram above athreshold T to estimate (cid:98) f (1) . We also tried using a simple linear regression on log( H t ) .These methods experimentally gave poorer estimates compared to the Poisson model.We compute the out-of-sample estimate of λ as mentioned in Method 3.1 below. We canadditionally use nonparametric bootstrap to understand the stability of the signal strengthestimates to perturbations in the data. We detail the bootstrap process below in Method 3.2.The method provides standard error estimates as well as bootstrapped confidence intervalsthat characterize the stability of the estimated signal strength λ .M ETHOD λ .1. Split background data X = { X , . . . , X m b } into X and X of sizes m and m re-spectively.2. Split experimental data W = { W , . . . , W n } into W and W of sizes n and n = m respectively.3. Train the classifier (cid:98) h on X and W .4. Compute (cid:98) ρ ( W i ) for all W i ∈ W as defined in (20) based on (cid:98) h and X as(24) (cid:98) ρ ( W i ) = 1 m (cid:88) X j ∈X I (cid:110)(cid:98) h ( X j ) ≥ (cid:98) h ( W i ) (cid:111) , W i ∈ W .
5. Get histogram estimates H t of (cid:98) ρ ( W i ) ’s for bins larger than T with bin-width b usingEquation (22).6. Use Poisson regression to estimate the density at 1 as (cid:98) g q (1) . Then (cid:98) λ = 1 − (cid:98) g q (1) .M ETHOD (cid:98) λ .1. Repeatedly draw with replacement m b background samples from X and n experi-mental samples from W and split them into X ∗ , X ∗ , W ∗ and W ∗ of sizes m , m , n and n at random ensuring no overlap between the training and test data sets. Thatis, X ∗ ∩ X ∗ = φ and W ∗ ∩ W ∗ = φ . m , n , m and n are same as ones used inMethod 3.1 above.2. Find (cid:98) λ ∗ in each case using steps 3-6 in Method 3.1. Note that the classifier is re-trained on X ∗ ∪ W ∗ for every random sample.
3. We can then use the empirical standard deviation or quantiles of the (cid:98) λ ∗ ’s to createbootstrap confidence intervals that will give estimates of the stability of the estimator (cid:98) λ to perturbations in the data.R EMARK . Since the data are split to consider an out-of-sample estimate of λ , it is alsonecessary in the bootstrapped samples for the training set to be disjoint of the test set, to avoidproblems caused by considering an in-sample estimate.
4. Interpreting the Classifier.
The signal detection test relies on the classifier, which wetake to be a random forest. So, to understand, characterize and interpret the new physics signaldetected by the test, it is useful and necessary to find a way to interpret the fitted classifier. Inthis section, we propose two methods to help interpret the random forest classifier.The methods are based on understanding how the gradient of the classifier surface ( (cid:98) h ( z ) )behaves in different directions. The underlying idea is that directions in which the surfaceis sloped or changes a lot contain useful information for the classification, while directionsin which the surface is flat do not. For the first method, we look at the density of the gra-dients of the classifier marginally along each of the variables. For the second method, wefind multivariate dependencies that jointly affect the gradient of the classifier, via finding theactive subspace (Constantine, 2015) of the classifier surface. The active subspace is found bylooking at the leading eigenvectors of the standardized gradients of the classifier surface.Figure 2 demonstrates the active subspace method on a two-dimensional example for sim-plicity, where the data ( X , X ) ∈ [ − , × [ − , and the signal lies around lines parallelto the line X + X = 0 . We then train a classifier (cid:98) h ( X , X ) to separate the signal fromthe background. We notice that the classifier output does not appear to have any relationshipwith either X or X marginally, as shown in Figure 2a. But the smoothed classifier sur-face detects the signal around lines parallel to the line X + X = 0 , as can be seen by theridges along those lines in Figure 2b. So, looking at the standardized gradients of the classi-fier surface gives us information about the direction in which the surface changes. As seenin Figure 2c, the first eigenvector of the standardized gradients reflects the direction in whichthe classifier surface changes the most, which is along the X − X = 0 line. Identifyingthis subspace helps us understand the directions in the feature space that separate the signalfrom the background. Note that we consider a two-dimensional example here for illustrationpurposes only. The real data can be of much higher dimensionality.In what follows, instead of directly looking at the classifier (cid:98) h ( z ) , we look at logit ( (cid:98) h ( z )) =log (cid:16)(cid:98) h ( z ) / (1 − (cid:98) h ( z )) (cid:17) . In practice, we add a small noise (e.g. − ) to (cid:98) h ( z ) when (cid:98) h ( z ) ∈{ , } for computational purposes. We notice that taking the logit transformation providesmore stable estimates of the gradients of the surface. Henceforth, we will instead look at thegradients of H ( z ) := logit ( (cid:98) h ( z )) .To find the gradient ∇ z H ( z ) , we fit a local linear smoother to the logit of the randomforest output. That is, we fit the logit of the classifier outputs H ( Z i ) = logit ( (cid:98) h ( Z i )) locallyaround z ∗ using(25) (cid:98) H ( z ) = α ( z ∗ ) + β ( z ∗ ) T ( z − z ∗ ) + o (cid:0) || z − z ∗ || (cid:1) . Then, (cid:98) β ( Z i ) provides estimates of the gradient of the logit classifier output on thedata. We furthermore replace the gradient estimates (cid:98) β j ( Z i ) by their standardized versions (cid:98) β j ( Z i ) /sd ( (cid:98) β j ( Z i )) at every data point Z i , to get more stabilized values. So, henceforth (cid:98) β ( Z i ) will be used to indicate the standardized gradient estimates for notational simplicity. IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS X versus X , (cid:98) h ( X , X ) versus X and (cid:98) h ( X , X ) versus X (b) Smoothed Classifier Surface (c) PCA of the Standardized Gradients Fig 2: We demonstrate the active subspace method on a two-dimensional example as shownin (a), where grey circles denote the uniformly distributed background and the green trianglesdenote the signal. The signal lies around lines parallel to the line X + X = 0 . Additionally,(a) also shows the output of the classifier that separates the signal from the background, (cid:98) h ( X , X ) , as a function of X and X individually. Neither of these variables marginallyhas separation power for the signal. (b) shows the smoothed classifier surface as a functionof both X and X . The active subspace method then performs PCA on the standardizedgradients of the smoothed classifier output in (b). (c) shows a scatter plot of the standardizedgradients of the smoothed classifier output as well as the two eigenvectors. We see that, asexpected, the first eigenvector picks the direction in which the classifier output varies themost, namely X − X = 0 . We consider a two-dimensional example here for illustrationpurposes only. The real data can be of much higher dimensionality.R EMARK . The data used for estimating the gradient can either be the background dataor the experimental data or a combination of both. We use a combination of both since ex-perimentally we see that using the combination captures the classifier surface better.We can then look at:(i) The density of (cid:98) β j ( Z ) , the estimated standardized gradient of H ( Z ) along the j th variable. (ii) The active subspace found using the estimated standardized gradients (cid:98) β ( Z ) as detailedin Section 4 below.The density of (cid:98) β j ( Z ) explains how the classifier behaves marginally along each variable,whereas the active subspace identifies relationships between multiple variables that cause themost change in the classifier. We now describe the active subspace method in detail below.4.1. Active Subspace of the Classifier.
In this section we find the active subspace of theclassifier (cid:98) h , which is assumed to be fixed and all the expectations in this section are withrespect to the inputs of the fixed classifier. Let us consider the mean and the covariancematrix of ∇ z H ( z ) . We define C = E f (cid:104) ( ∇ z H − E [ ∇ z H ]) ( ∇ z H − E [ ∇ z H ]) T (cid:105) , where E f is the expectation with respect to the density f = cp b + (1 − c ) q , where c = m / ( m + n ) gives the proportion of background data in the combined sample of back-ground data X of size m and experimental data W of size n and q = (1 − λ ) p b + λp s .We consider expectations with respect to f since as explained in the remark above, we use acombination of the experimental data and the background data to find the active subspace ofthe classifier.The mean E f [ ∇ z H ] gives the expected slope of the surface of H ( z ) . For example, a posi-tive mean gradient along the j th variable indicates an increasing trend in the classifier outputwith that variable, meaning that increasing that variable increases the probability of an exper-imental data point and decreases the probability of a background data point. The covariancematrix C , on the other hand, encodes the variable relationships that cause variability in theclassifier surface.Consider the real eigenvalue decomposition of C ,(26) C = M Λ M T , Λ = diag ( λ , . . . , λ d ) , λ ≥ . . . ≥ λ d ≥ , where M has columns { M · , . . . , M · d } , the normalized eigenvectors of C . Then the vec-tor M · corresponding to λ , gives the association between the variables (i.e., the directionin the input space) that best captures the changes in H ( z ) around the expected slope, fol-lowed by M · and so on. Therefore the eigenvectors corresponding to the leading eigenvalues λ , λ , . . . , give us an idea about the directions along which the classifier output changes themost. These are directions that contain meaningful information for separating the experimen-tal data from the background data and therefore enable us to characterize how the experimen-tal data differs from the background data. Towards this end, we propose Method 4.1 that usesthe gradient estimates (cid:98) β ( Z ) derived from a local linear smoother.M ETHOD (cid:98) h .1. Split background data into X and X of sizes m and m respectively. Split experi-mental data into W and W of sizes n and n = m respectively. Train the classifier (cid:98) h on X and W .2. Fit a local linear smoother that estimates H ( Z i ) = logit (cid:16)(cid:98) h ( Z i ) (cid:17) for Z i ∈ X ∪ W using the model given in (25).3. Consider the coefficients of the local linear smoother (cid:98) β ( Z i ) for Z i ∈ X ∪ W . Theyprovide an estimate of the gradient ∇ z H at Z i , i.e., (cid:92) ∇ z H ( Z i ) = (cid:98) β ( Z i ) for Z i ∈X ∪ W . IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS
4. Standardize the estimated gradients (cid:98) β j ( Z i ) to (cid:98) β j ( Z i ) / (cid:92) sd ( (cid:98) β j ( Z i )) by using the esti-mated (cid:92) sd ( (cid:98) β j ( Z i )) given by the local linear smoother.5. Then instead of C we can use the estimate(27) (cid:98) C = 1 N (cid:88) Z j ∈X ∪W (cid:16) (cid:98) β ( Z j ) − (cid:98) β ( Z ) (cid:17) (cid:16) (cid:98) β ( Z j ) − (cid:98) β ( Z ) (cid:17) T , where (cid:98) β ( Z ) = (cid:80) Z j ∈X ∪W (cid:98) β ( Z j ) /N and N = m + n .6. Find the eigenvalue decomposition of (cid:98) C as (cid:98) C = (cid:99) M (cid:98) Λ (cid:99) M T , which gives the estimates (cid:99) M and (cid:98) Λ of M and Λ as defined in (26) respectively.7. Then (cid:98) β ( Z ) and the estimated eigenvectors (cid:99) M · , (cid:99) M · , . . . given by the columns of (cid:99) M , best capture the slope and variations in H ( · ) and hence the classifier surface,respectively.We can additionally construct bootstrapped uncertainty intervals by repeatedly drawingwith replacement m b background samples from X and n experimental samples from W andsplit them into X ∗ , X ∗ , W ∗ and W ∗ of sizes m , m , n and n at random ensuring nooverlap between the training and test data sets. That is, X ∗ ∩ X ∗ = φ and W ∗ ∩ W ∗ = φ . Ineach case, we first re-train the classifier on the new training data X ∗ ∪ W ∗ and then find (cid:98) β ( Z ) ,and the estimated eigen values (cid:99) M · , (cid:99) M · , . . . on the new test data X ∗ ∪ W ∗ . Here m , m , n and n are the same as in Method 4.1. We can then use the standard deviation or quantiles ofthe bootstrapped estimates to create bootstrap confidence intervals that will give estimates ofthe stability of the estimators, given by the algorithm, to perturbations in the data.
5. Experiments: Search for the Higgs Boson.
We demonstrate the performance of theproposed semi-supervised classifier tests on the Higgs boson machine learning challengedata set available on the CERN Open Data Portal at http://opendata.cern.ch/record/328 (ATLAS collaboration, 2014). The data set consists of simulated data pro-vided by the ATLAS experiment at CERN’s Large Hadron Collider to optimize the searchfor the Higgs boson.Our goal is to demonstrate the performance of the proposed tests in identifying the pres-ence of the Higgs boson particle without assuming an a priori ansatz of the signal and demon-strate their applicability to model-independent searches of new physics signals in experimen-tal particle physics.5.1.
Data Description.
The Higgs boson has many different ways through which it candecay in an experiment and produce other particles. This particular challenge, from whichour data set originates, focuses on the collision events where the Higgs boson decays intotwo tau particles (Adam-Bourdarios et al., 2015). The data provided for the challenge consistof collision events labelled as background and signal. The signal class is comprised of eventsin which a Higgs boson (with a fixed mass of 125 GeV) was produced and then decayedinto two taus. The events are simulated using the official ATLAS full detector simulator.The simulator yields simulated events with properties that mimic the statistical properties ofthe real events of the signal type as well as several important backgrounds. For the sake ofsimplicity, background events generated from only three different background processes wereincluded in the challenge data set. Our objective is to show that semi-supervised classifiertests are able to identify the Higgs signal without any prior knowledge. The data set has 818,238 observations, where each observation is a simulated proton-proton collision event in the detector. Each of these collision events produces clustered show-ers of hadrons, which originate from a quark or a gluon produced during the collision. Theseshowers are called jets . The data contain information on the measured properties of the jetsas well as the other particles produced during the collision. There are d = 35 features whoseindividual details can be found on CERN’s Open Data Portal or in Appendix B of Adam-Bourdarios et al. (2015). Here we give some insight into the most important characteristicsof the features.The features whose names start with PRI are primitive variables that record the raw quan-tities as measured by the detector. The features whose names start with
DER are derived vari-ables which are evaluated as functions of the primitive variables. Since all collision eventsdo not produce the same number of jets, the number of jets produced in the collisions, de-noted by
PRI_jet_num , ranges from − (events with more than jets are capped at ).Note that it is possible for the collisons to not produce any jets ( PRI_jet_num = 0 ) andhence there are structurally absent missing values in the data that relate to the jets producedin the collisions. To avoid these missing values in the data, we only consider events that havetwo jets (
PRI_jet_num = 2 ). This results in 165,027 events; 80,806 background events and84,221 signal events. Since the derived quantities are functions of the primitives, we use justthe primitive variables ( d = 16 ) for our analysis. Since we only use the primitive variables,we drop the pre-fix PRI from the variable names and further shorten some of the variablenames intuitively for convenience. Descriptions of the primitive variables used are providedin Table 1. T ABLE Descriptions of the variables used in the analysis of the Higgs boson machine learning challenge data set(Adam-Bourdarios et al., 2015). We drop the pre-fix
PRI from the variable names and further shorten some ofthe variable names intuitively for convenience.
Variable Description tau_pt
The transverse momentum of the hadronic tau. tau_eta
The pseudorapidity η of the hadronic tau. tau_phi The azimuth angle φ of the hadronic tau. lep_pt The transverse momentum of the lepton (electron or muon). lep_eta
The pseudorapidity η of the lepton. lep_phi The azimuth angle φ of the lepton. met The missing transverse energy. met_phi
The azimuth angle φ of the missing transverse energy. met_sumet The total transverse energy in the detector. lead_pt
The transverse momentum of the leading jet, i.e., the jet with the largest trans-verse momentum (undefined if
PRI_jet_num = 0 ). lead_eta The pseudorapidity η of the leading jet (undefined if PRI_jet_num = 0 ). lead_phi The azimuth angle φ of the leading jet (undefined if PRI_jet_num = 0 ). sublead_pt The transverse momentum of the subleading jet, i.e., the jet with the secondlargest transverse momentum (undefined if
PRI_jet_num ≤ ). sublead_eta The pseudorapidity η of the subleading jet (undefined if PRI_jet_num ≤ ). sublead_phi The azimuth angle φ of the subleading jet (undefined if PRI_jet_num ≤ ). all_pt The scalar sum of the transverse momentum of all the jets in the event.Weight The event weight.Label The event label (string) (s for signal, b for background).IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS Among the primitive features, five of them provide the azimuth angle φ of the particlesgenerated in the event (variables ending with _phi ). These features are rotation invariant inthe sense that the event does not change if all of them are rotated together by the same angle.Hence to interpret these variables more easily using the active subspace method, we removethe invariance of the azimuth angle variables by rotating all the φ ’s so that the azimuth angleof the leading jet at ( lead_phi = 0 ). lead_phi can then be removed from the analysis,leading to a 15-dimensional feature space.Additionally, we take logarithmic transformations of the variables that give the transversemomentum of the particles produced (variables ending with _pt ), the missing transverseenergy ( met ) and the total transverse energy in the detector ( met_sumet ).Exploratory analysis of the data as well as details and justifications for the transformationsconsidered above, can be found in the Supplementary Material (Chakravarti et al., 2021). Experimental Setting:
In all the following experiments on the Higgs boson data set, werandomly sample without replacement background and signal events from the original dataset to form background data with m b = m s = n = λ , the signal strength. So, the number of signal eventsin the experimental data is decided according to λ by randomly generating from Bin ( n, λ ) .For methods that require data-splitting, the data-splitting is done by randomly splitting thebackground data into m = m = n = n = ( n , λ ) and Bin ( n , λ ) ,respectively. Additionally, when randomly generating the experimental data, events are se-lected in a weighted fashion using the Weight variable provided in the data set as describedin Table 1. For each of the bootstrap and permutation methods we consider 1,000 bootstrapand permutation cycles respectively.In the following sections, we first explore the power of the classifier tests described inSection 2 to detect the presence of the Higgs boson signal in the experimental data. We thenestimate the signal strength ( λ ) using the methods introduced in Section 3 and then use theactive subspace method introduced in Section 4 to characterize the signal.5.2. Anomaly Detection Using the Classifier Tests.
We compare the power of the model-dependent supervised methods and the model-independent semi-supervised methods intro-duced in Section 2 in detecting the Higgs boson signal, by varying the signal strength from λ = 0 . to λ = 0 . . We also check that the tests have the right error control under the nullcase ( λ = 0 ). We demonstrate the performance of the different methods – asymptotic, boot-strap, permutation (using out-of-sample statistics), and slow permutation (using in-samplestatistics), used along with the different test statistics – Likelihood Ratio Test statistic (LRT),Score statistic, Area Under the Curve (AUC), and Misclassification Error (MCE).For the model-dependent methods, since we have only finitely many samples from thebackground available, and not the background generator itself, we split the available back-ground data into training and test sets as described above. We then use the training set forfitting the classifier and the test set for estimating the null distribution of the test statistic bybootstrapping or permuting as described in Section 2. In real life, since the background MCgenerator is known, we should, in many cases, be able to generate more training backgroundsamples for estimating the null distribution of the test statistic. Anomaly Detection when Data has a Correctly Specified Signal.
We first comparethe methods when the signal model is correctly specified. The tests are run on randomsamplings of the data, and the percentage of times each of the tests rejects the null that thereis no signal, is given in Table 2. “Permutation” indicates the faster permutation method fromSection 2.2, that uses out-of sample test statistics for testing, and “Slow Perm” indicates theslower permutation method from Section 2.2 that uses in-sample test statistics for testing andre-trains the classifier in every permutation cycle. The significance level for all the tests isconsidered at α = 0 . . T ABLE Power of detecting the signal in the Higgs boson data for each method in percentages. We consider 50 randomsamplings of the data and 1,000 bootstrap and permutation cycles. We perform each test at α = 5% significancelevel. Signal Strength ( λ )Model Method .
15 0 . .
07 0 .
05 0 .
03 0 .
01 0
Supervised LRT Asymptotic 100 100 96 62 18 18 6Bootstrap 100 96 78 58 6 0 0Permutation 100 98 98 86 28 6 0Supervised Score Bootstrap 64 66 74 50 18 0 0Permutation 94 92 100 92 80 24 12Semi-Supervised LRT Asymptotic 100 98 74 38 16 6 2Bootstrap 100 98 48 10 2 2 0Permutation 100 98 72 38 16 6 2Slow Perm 82 8 0 4 2 0 4Semi-Supervised AUC Asymptotic 100 96 78 32 14 4 2Bootstrap 100 98 70 32 20 6 2Permutation 100 98 68 32 20 4 2Slow Perm 100 100 94 56 20 8 4Semi-Supervised MCE Asymptotic 100 92 60 28 14 2 2Bootstrap 100 96 52 28 16 6 4Permutation 100 96 52 30 14 6 6Slow Perm 100 98 86 58 16 6 2
As seen in Table 2, among the supervised methods, the permuted tests out-perform all theother supervised methods in terms of power. The permuted test with the score statistic has themost power for smaller values of λ . Though it is worth noting that it might be anticonservativefor λ = 0 and might not have the right significance level. On the other hand, the supervisedtests that use the LRT statistic appear to be overly conservative for smaller values of λ asseen in Table 2 and Figure 3. This occurs because the density ratio p s /p b is unknown, andis being estimated using a classifier when evaluating the LRT statistic. This problem mainlyappears when two things happen simultaneously. First, when the classifier that is being usedto learn the difference between the background and the signal overfits the class probabilities,by overfitting to the data. Second, when λ = 0 , i.e., when the experimental data has no signal.When both of these things happen, then with probability higher than expected, almost theentire experimental data is classified as background with (cid:98) h ( Z i ) = 0 , and (cid:91) p s /p b = 0 . Thismakes (cid:98) λ MLE , the maximum likelihood estimator of λ , also zero, making the LRT statisticzero as well. This is also observed when we use the permutation or bootstrap methods toestimate the null distribution. The estimated distributions have a point mass at zero that ismuch higher than . (as given by the asymptotic distribution in Equation (8) when true IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS Asymptotic Bootstrap Permutation L R T S c o r e Value E m p i r i c a l D i s t r i bu t i on lambda N/A
Fig 3: Empirical CDF of the p-values given by the supervised tests for different signalstrengths ( λ ). The columns in the grid of plots represent the different methods of testing(asymptotic, bootstrap and permutation) and the rows represent the use of the different teststatistics (LRT and Score). Note that the plot for the asymptotic test using the score statis-tic is missing since we do not consider that test due to the poor quality of the asymptoticapproximation. p s /p b is used). Note that we additionally leave out the asymptotic score test both in Table 2and Figure 3 since we tried the test for a sub-sample of the data and decided to not considerthat test for the larger data set due to the poor quality of the asymptotic approximation.Among the semi-supervised approaches, the AUC and MCE slow permutation methods(slow perm) have power that is only slightly worse than that of the supervised methods,as shown in Table 2. Importantly, the semi-supervised methods achieve this without havingaccess to the labeled signal sample during training. Among the asymptotic and the fasterpermutation methods, using LRT or MCE gives similar performance to AUC in the semi-supervised approaches. We also observe that the slow permutation method using the LRTstatistic has very low power. This is due to bias in estimating the LRT using an in-sampleestimate which is influenced by over-fitting.Figure 4 shows the empirical distributions of the p -values produced by the semi-supervisedtests for different λ values. All the tests appear to have correct (or at least almost correct) typeI error control as indicated by the close to uniform CDFs in the λ = 0 case. We also notice thatnone of the tests have much power to detect signals that are less than of the experimentaldata ( λ < . . We omit the plot for the slow permutation method using the LRT as it hasanomalously low power for most λ values due to the reasons mentioned above.We additionally compared these methods to nearest neighbor (NN) two-sample tests asintroduced in Schilling (1986) and Henze (1988). We considered the nearest neighbor testsas opposed to other tests, since the signal that we are trying to detect, appears as a collec-tion of nearby data points. Hence tests based on neighborhoods should have better powerto detect it. We compared the methods presented in this paper to the NN tests (asymptoticand permutation versions) for a sub-sample of the data set. We observed that the asymptoticversion especially had very poor power. The permutation version performed better, but was Asymptotic Bootstrap Permutation Slow Permutation L R T A UC M C E Value E m p i r i c a l D i s t r i bu t i on lambda N/A
Fig 4: Empirical CDF of the p-values given by the semi-supervised tests for different signalstrengths ( λ ). The columns in the grid of plots represent the different methods of testing(asymptotic, bootstrap, permutation and slow permutation) and the rows represent the use ofthe different test statistics (LRT, AUC and MCE). We leave out the plot for slow permutationtest using the LRT statistic since in practice it demonstrates very poor performance. It suffersfrom bias caused by over-fitting since for slow permutation we consider the in-sample LRT.out-performed by the semi-supervised AUC, MCE and LRT methods. Additionally, it was notscalable to extend it to the larger data set. Hence we concluded that it was computationallyimpractical to apply it to the larger full data set.In conclusion, we see that slow permutation method when using the AUC and MCE statis-tics out-performs the other semi-supervised methods and additionally gives comparable per-formance to the supervised methods in detecting the signal in the experimental data. Note thatthe power of the slow permutation method using the AUC and MCE statistics is much betterthan the one using the LRT statistic, which is the statistic used by D’Agnolo and Wulzer(2019) and D’Agnolo et al. (2021). So these methods give an improvement over using justthe LRT statistic.5.2.2. Anomaly Detection when Data has a Misspecified Signal.
As mentioned in theintroduction, the main motivation for using model independent methods is to be able to detecta signal when there are systematic errors or imprecisions in the signal model which willadversely affect the model-dependent methods. In this section we demonstrate the effect ofsuch a misspecified signal model.We transform the signal in the experimental data to intentionally make it different fromthe signal model, which makes the signal model misspecified. We consider a transformation
IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS in just one variable, transforming tau_pt ∗ = tau_pt − . tau_pt − min( tau_pt )) . in the experimental data. Meanwhile, we do not transform the signal data used by the super-vised methods for training the classifier. This simulates a misspecified signal situation, wherethe signal in the experimental data is not from the same distribution as the signal used by thesupervised methods for training.We chose this particular transformation for multiple reasons. First, it transforms the vari-able that has the most marginally different means for the background and the signal data asdemonstrated by Figure ?? in the Supplementary Material (Chakravarti et al., 2021). Sec-ond, the marginal distribution of the signal along this variable is different from the marginaldistribution of the background. These two reasons make tau_pt a variable that potentiallyinfluences the classifier. By transforming the variable, we lower the mean tau_pt for thesignal in the experimental data, causing the model dependent methods to lose power in de-tecting the transformed signal.We now compare the power of the supervised and the semi-supervised methods in detect-ing the transformed signal in the experimental data. The tests are run on random samplingsof the data, and the percentage of times each of the tests rejects the null that there is no signal,is given in Table 3. As before, “Permutation" indicates the faster permutation method fromSection 2.2, that uses out-of-sample test statistics for testing and “Slow Perm" indicates theslower permutation method from Section 2.2, that uses in-sample test statistics for testingand re-trains the classifier in every permutation cycle. The significance level for all the testsis considered at α = 0 . . T ABLE Power of detecting the misspecified signal in the Higgs boson data for each model in percentages. We consider50 random samplings of the data and 1,000 bootstrap and permutation cycles. We perform each test at α = 5% significance level. Signal Strength ( λ )Model Method .
15 0 . .
07 0 .
05 0 .
03 0 .
01 0
Supervised LRT Asymptotic 2 10 2 8 8 6 4Bootstrap 0 0 0 0 0 0 0Permutation 0 0 0 0 0 2 0Supervised Score Bootstrap 0 0 0 0 0 0 0Permutation 0 0 0 0 0 2 8Semi-Supervised LRT Asymptotic 100 100 100 82 18 4 4Bootstrap 100 100 100 60 4 2 0Permutation 100 100 100 82 18 4 2Slow Perm 100 100 78 22 2 4 6Semi-Supervised AUC Asymptotic 100 100 100 78 16 8 4Bootstrap 100 100 100 82 20 10 0Permutation 100 100 100 80 20 8 2Slow Perm 100 100 100 100 34 10 4Semi-Supervised MCE Asymptotic 100 100 100 66 24 6 4Bootstrap 100 100 100 62 16 6 4Permutation 100 100 100 62 14 6 4Slow Perm 100 100 100 98 22 8 2
As seen in Table 3, as expected, the supervised methods have no power at all in detectingthe transformed signal. An interesting observation from the table is that, it appears that the supervised methods are more conservative when there is some signal present compared towhen there is no signal present. This occurs as the signal labels that the supervised modelsare trained on are inconsistent with the signal in the experimental data, when the signal ispresent.The semi-supervised methods on the other hand, still have power to detect the transformedsignal since they are not trained on the misspecified signal training data. In fact, the semi-supervised methods appear to have higher power with the transformed signal than with theoriginal signal, indicating that the transformed signal is somewhat easier to disentangle fromthe background than the original signal. Comparing the semi-supervised methods, we againobserve that the slow permutation method using AUC has the most power. We also noticethat the asymptotic and permutation tests that use LRT demonstrate power comparable tothe AUC. The slow permutation method with MCE also performs comparably to the slowpermutation method with AUC. The slow permutation method with LRT has again anoma-lously low power for smaller λ , due to the bias caused by using the in-sample LRT as the teststatistic.In conclusion, semi-supervised methods continue to demonstrate power similar to the pre-vious case where the signal model was not misspecified. On the other hand, the supervisedmodels do not have any power to detect the transformed signal in the misspecified modelcase, which motivates the use of semi-supervised models in such a case.5.3. Estimating the Higgs Boson Signal Strength.
In this section, we demonstrate theperformance of the methods proposed in Section 3 to estimate the signal strength ( λ ) usingthe Higgs boson data set. We vary the signal strength from λ = 0 . to λ = 0 and look at theestimates of the signal strength as well as the bootstrapped uncertainty intervals given byMethod 3.1 in Section 3.We consider bootstrap cycles to get the bootstrapped uncertainty intervals. Forthe estimates, we consider T ∈ { . , . } and b ∈ { . , . } , where T is the thresholdthat indicates the neighborhood of where we fit the Poisson regression model and b isthe bin-width of the histogram that we use to estimate the densities. Note that bin-width b ∈ { . , . } is equivalent to the number of bins, Bins ∈ { , } . To construct thebootstrapped uncertainty intervals, we use three different methods using α = 5% and com-pare them in Figure 5 below.First, we use the basic bootstrap confidence interval, also known as the Reverse PercentileInterval, which uses empirical quantiles of the bootstrap distribution of (cid:98) λ in a reverse order toconstruct the confidence interval (see Davison and Hinkley (1997), Eq. (5.6), p. 194): (2 (cid:98) λ − λ ∗ (1 − α/ , (cid:98) λ − λ ∗ ( α/ ) , where λ ∗ (1 − α/ denotes the − α/ quantile of the bootstrappedestimates λ ∗ .Second, we use the bootstrapped quantiles to form percentile bootstrap confidence inter-vals (see Davison and Hinkley (1997), Eq. (5.18), p. 203, Tibshirani and Efron (1993), Eq.(13.5), p. 171): ( λ ∗ ( α/ , λ ∗ (1 − α/ ) .Third, we use bootstrapped standard errors and normal distribution’s quantiles to createbootstrap confidence intervals: ( (cid:98) λ − z − α/ (cid:98) se λ ∗ , (cid:98) λ + z − α/ (cid:98) se λ ∗ ) where z − α/ denotes the − α/ quantile of the standard normal distribution, and (cid:98) se λ ∗ is the standard error estimatedusing the empirical standard deviation of the bootstrapped estimates λ ∗ .We additionally present the confidence intervals given by the GLM model (Dobson andBarnett, 2018) itself in Figure 5.We observe from Figure 5 that the estimates of λ do not vary much with the number of binsused for the histogram estimate of the density. We also notice that thresholding at T = 0 . which is closer to as compared to T = 0 . , gives better estimates of λ . We additionallytried T = 0 . on a sub-sample of the data set. But the estimates in that case were not very IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS Fig 5: Estimates of the signal strength ( λ ), along with the uncertainty intervals, of theHiggs boson in the experimental data. The true λ , given by the dotted diagonal line, variesfrom to . . T specifies the threshold that controls the size of the neighborhood of that isconsidered in the Poisson regression model and Bins specifies the number of bins consideredin the histogram when estimating the density of (cid:98) g q (1) as described in Section 3.stable and hence, we decided to not consider them for the final larger data set. The uncertaintyintervals created using bootstrapped quantiles and bootstrapped standard errors appear to berelatively well-calibrated when T = 0 . , i.e., they include the true value of λ , given by thedotted diagonal line, in each of the plots. The uncertainty intervals given by the GLM modeland basic bootstrap, on the other hand, do not contain the true λ for some smaller values of λ . So, (cid:98) λ using T = 0 . with the number of bins either or , accompanied with eitherbootstrapped quantile or bootstrapped standard error uncertainty intervals, appears to give thebest performance in estimating and quantifying the uncertainty of the signal strength λ . Anadditional advantage of the uncertainty intervals using bootstrapped standard errors is thatthey are centered about the estimate (cid:98) λ .5.4. Interpreting the Classifier Using Active Subspace Methods.
We demonstrate the ap-plication of the active subspace methods for a random simulation (one of the 50 simulationsin Section 5.2) which detects the signal at significance level α = 0 . , when λ = 0 . , i.e., of the experimental data is from the signal sample. We consider λ = 0 . , since the ran-dom forests demonstrate good power in detecting the signal for that signal strength (Table 2). Fig 6: The active subspace variables for the classifier trained on data with signal strength λ = 0 . computed using a local linear smoother that uses the gaussian kernel with smoothingparameter h = 0 . . (a) gives the distribution of the standardized gradients (cid:98) β j ( Z i ) / (cid:92) sd ( (cid:98) β j ( Z i )) given by the local linear smoother at every point in the combined test data Z i ∈ X ∪ W .The dots denote the mean. In (b), (c) and (d), the violin plot and the dashes give the boot-strapped empirical distribution and the bootstrapped uncertainty intervals computed usingthe empirical quantiles respectively for the standardized mean, the first eigenvector and thesecond eigenvector. In (b) the dots give the mean standardized gradient similar to (a). In (c)and (d) the dots represent the first eigenvector and the second eigenvector computed on thecombined test data, respectively. IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS We then use Method 4.1 presented in Section 4 to find the active subspace of the fittedsemi-supervised classifier. The second step of the algorithm requires us to choose a linearsmoother as well as a smoothing parameter for it. We choose a Gaussian kernel smootheras the linear smoother and h = 0 . as the smoothing parameter. The smoothing parameter h is used to scale the standard deviations of the variables, which is then used as the standarddeviation of the multivariate Gaussian kernel, i.e., (cid:98) sd ( Z ) /h is considered as the standarddeviation of the multivariate Gaussian kernel. The smoothing parameter selection process isdescribed in the Supplementary Material (Chakravarti et al., 2021).Figure 6 gives us the active variables given by the mean standardized gradient, the firsteigenvector and the second eigenvector. The higher eigenvectors do not contain much in-formation and have been included in the Supplementary Material (Chakravarti et al., 2021).The figure additionally gives the distribution of the bootstrapped active subspace estimates asviolin plots. These help us construct uncertainty intervals for the active subspace vari-ables using the bootstrapped empirical percentiles. We consider bootstrap cycles to getthe bootstrapped uncertainty intervals.Figure 6(a) additionally provides the violin plots of the standardized gradient estimates (cid:98) β j ( Z i ) / (cid:92) sd ( (cid:98) β j ( Z i )) given by the local linear smoother at every point in the combined testdata Z i ∈ X ∪ W . We notice that the violin plots are not very informative since they are allsymmetric about zero, i.e., they don’t appear to give any information about the slope of theclassifier surface.The mean standardized gradient in Figure 6(b) gives the direction in which the classifieroutput changes most rapidly on average. We see that jointly higher transverse momentumsof the hadronic tau ( tau_pt ) and the remaining lepton ( lep_pt ) contribute the most toan increase in the classifier output. Increase in these transverse momentums, combined witha decrease in the transverse momentum of the leading jet ( lead_pt ) and the scalar sumof the transverse momentum of all the jets ( all_pt ), leads to an increase in the classifieroutput. This implies that the detected signal events display higher momentums of the hadronictau and the other lepton, lower momentums of the leading jet and lower scalar sum of themomentums of all the jets as compared to background events. Note that the mean gradientvector not only captures the dependency of the classifier on each variable individually, butalso characterizes the multivariable dependencies that influence the classifier.The first eigenvector gives the first principal component of the gradients, which demon-strates the relationship between the variables that causes the most variability in the gradientsof the classifier. The first eigenvector, as seen in Figure 6(c), indicates that when all_pt ,the scalar sum of the transverse momentums, lead_pt , the transverse momentum of theleading jet, sublead_pt , the transverse momentum of the subleading jet and met , themissing transverse energy, jointly change in the same direction, it causes the most variationin the classifier. This means that these variables together, help the classifier separate the sig-nal from the background. The second eigenvector, i.e., the second principal component ofthe gradients, as seen in Figure 6(d), appears to capture the relationship between the azimuthangle φ of the different objects in the event that most influences the classifier. Recall that the φ values of the objects have been transformed to denote the difference in the angle betweeneach object and the leading jet. So, the second eigenvector indicates that the φ angle betweenthe leading jet and all the other four objects (the subleading jet, the hadronic tau, the leptonand the missing transverse energy) influences the classifier. This has an appealing physicalinterpretation in that it indicates that the azimuthal orientation between the leading jet andthe rest of the event is a useful feature for separating the signal from the background.R EMARK . Note that for any eigenvector M · j of the standardized gradients of the trainedclassifier, − M · j is also an eigenvector. This causes the violin plots of the bootstrapped eigen-vectors to be systematically symmetric about zero. To solve this problem, we first take the variable that has the largest absolute eigenvector value, i.e. k j = arg max i | (cid:99) M ij | . Then we fixthe sign of the bootstrapped eigenvectors (cid:99) M ∗· j such that the sign of the k thj variable matches,i.e., sgn ( (cid:99) M ∗ k j j ) = sgn ( (cid:99) M k j j ) . This process has been followed in Figures 6(c) and 6(d) tohandle the problem.So, the active subspace methods provide an algorithm to interpret the semi-supervisedclassifier once it has detected a signal in the experimental data. The methods imply that theclassifier that detects the Higgs boson is positively influenced by tau_pt and lep_pt , andnegatively influenced by all_pt and lead_pt . Additionally, it is also influenced by jointchanges in the values of all_pt , lead_pt , lsubead_pt and lmet . The classifier is alsoinfluenced by the difference between the azimuth angle φ of the leading jet and those of theother four objects, indicating that this might be an important feature for detecting the signalevents. Finally, we note that without techniques like these it would have been very difficultto understand what kind of a signal the semi-supervised classifier has detected within thehigh-dimensional feature space.
6. Conclusion.
In this paper, we studied model-independent anomaly detection tests us-ing semi-supervised classifiers, that can detect the presence of signal events hidden withinbackground events in high energy particle physics data sets. Additionally, we proposed meth-ods to estimate the signal strength and to identify the active subspace affecting the classifiermost strongly, leading to an understanding of the detected signal. We demonstrated the per-formance of the methods and also compared the proposed tests with comparable model-dependent supervised methods as well as nearest neighbor two-sample tests on a data setrelated to the search for the Higgs boson at the Large Hadron Collider.We presented multiple model-independent methods that search for the signal without as-suming any knowledge of the signal model. By not assuming any signal model, we retainthe ability to detect unknown and unexpected signals. This is an important capability for fu-ture searches at the Large Hadron Collider, where model-dependent searches have so far notyielded evidence of physics beyond the Standard Model. We used a semi-supervised classi-fier to distinguish the experimental data from the background data and used the performanceof the classifier to perform a test to detect a significant difference between the two data sets.We proposed three test statistics that can be used for the test: the likelihood ratio test (LRT)statistic, the area under the curve (AUC) statistic and the misclassification error (MCE) statis-tic.We compared the use of the different test statistics, as well as the use of different tech-niques for obtaining the null distribution, in building the test that has the most power indetecting the signal. We compared the power of the methods to detect the Higgs boson atdifferent signal strengths and showed that a version of the proposed AUC method has powerthat is competitive with the model-dependent methods. So, even when the signal model is cor-rectly assumed by the model-dependent methods, the proposed model-independent methodsappear to still have competitive power to detect the presence of the signal. However, when thesignal model is incorrectly assumed or misspecified, the model-dependent methods can to-tally miss the signal, whereas the proposed model-independent ones are still able to detect thesignal, as demonstrated in our experiments. In particular, the proposed methods demonstratethe ability to find new particles without specific a priori knowledge of their properties.As described in Section 2.1, model-dependent methods currently in use in experimentalhigh-energy physics, are slightly different from the ones presented in Section 2.1. In this pa-per, to make the model-dependent methods comparable to the proposed model-independentones, we consider high-dimensional model-dependent tests. In current practice though, athreshold is placed on the supervised classifier output to select a subset of the experimental
IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS data that is richer in signal events. Then the high-dimensional data set is transformed into aunivariate one, and the transformed data are fitted using a one-dimensional mixture modelconsisting of signal and background components. This is used to construct a profile likeli-hood ratio test (ATLAS Collaboration and CMS Collaboration, 2011), where some of thenuisance parameters are related to the one-dimensional background model. Dauncey et al.(2015) proposed a discrete profiling method for this test based on considering the choiceof the background functional form as a discrete nuisance parameter which is profiled in ananalogous way to continuous nuisance parameters.These considerations motivate multiple avenues for future exploration. We could constructan analogous model-independent version of the approach described above, where we first usethe semi-supervised classifier output to select a signal rich subset of the experimental data,then transform the selected data to a one-dimensional space and finally perform a test in theone-dimensional space using a mixture model. For well-chosen transformations, this couldincrease the power of the test. More interestingly, for many signals, the signal distributioncorresponding to a transformation into the invariant mass variable is predicted by quantummechanics to be the Cauchy distribution (also known as the Breit–Wigner distribution inparticle physics; see Chapter 49 in Zyla et al. (2020)), whose parameters, under the model-independent scenario, are unknown to us. It would be interesting to built this knowledge intothe semi-supervised method. A straightforward approach would be to simply use the Cauchydistribution in the one-dimensional mixture model, but it might also be possible to incorporatethis knowledge into the training of the high-dimensional semi-supervised classifier.Another important avenue of future work would be to find ways to account for systematicuncertainties in the background training data. As described above, the current supervisedtechniques account for systematics using one-dimensional profile likelihoods. This shouldalso be feasible for one-dimensional semi-supervised tests. However, it might also be possibleto use profiling in the high-dimensional semi-supervised likelihood ratio tests. For example,if we had two plausible background samples from two different MC generators, we coulduse optimal transport (Peyré et al., 2019) to find the geodesic between the samples. We couldthen profile over the parameter corresponding to the location on the geodesic, which wouldaccount for the systematic uncertainty corresponding to the envelope of background modelsspanned by the two MC generators.Finally, in conclusion, it is worth noting that these ideas can be applied more generally.Here we demonstrated an application of the methods to the search of new phenomena inparticle physics, but the methods presented here are applicable beyond particle physics. Ingeneral, these methods can be applied when there is a need to search for, detect or characterizecollective anomalies in a high-dimensional space. Potential further applications range fromengineering to medicine and other areas of the physical sciences.SUPPLEMENTARY MATERIAL Supplementary Material for Model-Independent Detection of New Physics SignalsUsing Interpretable Semi-Supervised Classifier Tests (doi: NA; .pdf). ( ).The supplementary contains details about the exploratory data analysis of the Higgs bo-son data used in the experiments in Section 5. It additionally describes the selection of thesmoothing parameter for the active subspace methods. APPENDIX A: PROOF OF THEOREMWe detail the proof of Theorem 3.1 here:P
ROOF . To show Part 1, we note that ρ ( Z ) is just 1 minus the CDF of q ( Z ) /p b ( Z ) under Z ∼ p b . Therefore, ρ ( Z ) ∼ Unif(0 , if Z ∼ p b .Now to show Part 2, note that if z ∈ C t u , then p s ( z ) /p b ( z ) ≥ t u . So, P X ∼ p b ( C t u ) = u implies that ρ ( z ) ≤ u . Then by definition: g ( u ) = lim δ ↓ P Z ∼ p s ( u < ρ ( Z ) ≤ u + δ ) δ = lim δ ↓ P Z ∼ p s (cid:0) Z ∈ C t u + δ \ C t u (cid:1) δ = lim δ ↓ (cid:82) C tu + δ \ C tu p s ( z ) dzδ = lim δ ↓ (cid:82) C tu + δ \ C tu p s ( z ) p b ( z ) p b ( z ) dzδ = t u lim δ ↓ (cid:82) C tu + δ \ C tu p b ( z ) dzδ = t u . The second last equality holds because for all z ∈ C t u + δ \ C t u , we have p s ( z ) /p b ( z ) ∈ ( t u , t u + δ ] . So when δ ↓ , we have p s ( z ) /p b ( z ) → t u for all z ∈ C t u + δ \ C t u . The last equalityfollows from the definition of t u and t u + δ .Part 3, follows trivially from Part 2, by noticing that g (1) = 0 . Hence g q (1) = (1 − λ ) g (1) + λg (1) = 1 − λ .APPENDIX B: DETAILED MODEL-INDEPENDENT SEMI-SUPERVISEDALGORITHMSHere we detail the remaining model-independent methods that include the asymptotic testsusing the three test statistics LRT, AUC and MCE and the slow permutation method as intro-duced in Section 2.2.M ETHOD
B.1. LRT Asymptotic Method — fast method.1. Split background data X = { X , . . . , X m b } into X and X of sizes m and m re-spectively.2. Split experimental data W = { W , . . . , W n } into W and W of sizes n and n respectively, with n = m
3. Train the classifier (cid:98) h on X and W .4. Evaluate the LRT statistic T on W using (9) and (10) as(28) (cid:101) T = T n = log (cid:18) − ππ (cid:19) + 1 n (cid:88) W i ∈W log (cid:32) (cid:98) h ( W i )1 − (cid:98) h ( W i ) (cid:33) where π = n / ( m + n ) . IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS
5. Since under the null H : p b = q , i.e., X and W have the same distribution, (cid:101) T defined on W has the same distribution as T = log (cid:18) − ππ (cid:19) + 1 m (cid:88) X i ∈X log (cid:32) (cid:98) h ( X i )1 − (cid:98) h ( X i ) (cid:33) . defined on X .Then conditional on X and W , since n = m , under H , √ n ( (cid:101) T − T ) √ σ d (cid:32) N (0 , , where σ = Var (cid:32) log (cid:32) (cid:98) h ( X )1 − (cid:98) h ( X ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X , W (cid:33) and Var is variance under p b . This can be estimated by the variance of data in X .6. Compute the Normal distribution based p-value.M ETHOD
B.2. AUC Asymptotic Method — fast method.1. Perform steps 1–3 from Method B.1 to get a classifier (cid:98) h based on X and W .2. Evaluate (cid:98) θ as defined in Equation (11) using X and W as(29) (cid:98) θ = 1 m n (cid:88) X i ∈X (cid:88) W j ∈W I (cid:110)(cid:98) h ( W j ) > (cid:98) h ( X i ) (cid:111) .
3. Then the expectation and variance of the estimator (cid:98) θ are given by Newcombe (2006)as E (cid:104)(cid:98) θ (cid:105) = 1 m n (cid:88) X i ∈X (cid:88) W j ∈W E (cid:104) I (cid:110)(cid:98) h ( W j ) > (cid:98) h ( X i ) (cid:111)(cid:105) = P (cid:16)(cid:98) h ( W ) > (cid:98) h ( X ) (cid:17) = θ,V (cid:16)(cid:98) θ (cid:17) ≈ θ (1 − θ ) m n (cid:20) n ∗ − (cid:18) − θ − θ + θ θ (cid:19)(cid:21) = θ (1 − θ ) m n (cid:20) n ∗ − n ∗ − − θ )(1 + θ ) (cid:21) , where n ∗ = ( m + n ) / . Then under H , θ = 0 . and hence(30) (cid:98) θ − . (cid:114) V (cid:16)(cid:98) θ (cid:17) (cid:32) N (0 , , where(31) V (cid:16)(cid:98) θ (cid:17) = 0 . m n (cid:20) n ∗ − n ∗ − . (cid:21) We then compute the Normal distribution based p-value.M
ETHOD
B.3. MCE Asymptotic Method — fast method.1. Perform steps 1–3 from Method B.1 to get a classifier (cid:98) h based on X and W .2. Evaluate (cid:99) M using X and W , based on Equation (12) as(32) (cid:99) M = 12 m (cid:88) X i ∈X I (cid:110)(cid:98) h ( X i ) > π (cid:111) + 1 n (cid:88) W j ∈W I (cid:110)(cid:98) h ( W j ) < π (cid:111) , where π = n / ( m + n ) .3. Since under the null H : p b = q , i.e., X and W have the same distribution, condi-tional on X and W , under H , (cid:99) M − (cid:114) (cid:98) p h (1 − (cid:98) p h ) (cid:16) m + n (cid:17) d (cid:32) N (0 , , where (cid:98) p h = (cid:92)P (cid:16)(cid:98) h ( X ) > π (cid:17) = 1 m + n (cid:34) (cid:88) X i ∈X I (cid:110)(cid:98) h ( X i ) > π (cid:111) + (cid:88) W i ∈W I (cid:110)(cid:98) h ( W j ) > π (cid:111)(cid:35) , since under H , X i d = W j .4. Compute the Normal distribution based p-value.M ETHOD
B.4. In-sample Permutation Method — this is a slow method.1. Train a classifier (cid:98) h on X = { X , . . . , X m b } and W = { W , . . . , W n } .2. Compute the classifier based LRT statistic T using (9) and (10) as(33) T n = log (cid:18) − ππ (cid:19) + 1 n (cid:88) i log (cid:32) (cid:98) h ( W i )1 − (cid:98) h ( W i ) (cid:33) where π = n/ ( m b + n ) . This is an estimate of /n (cid:80) i log( q ( W i ) /p ( W i )) . Note thatthe sum is only over W .Estimate θ by Equation (11) and the MCE by Equation (12).3. Estimate the null distribution of T , (cid:98) θ and (cid:99) M by repeatedly permuting the class labelsof X ∪ W = { X , . . . , X m b , W , . . . , W n } , re-training the classifier on the permutedlabels each time and computing the statistics T , (cid:98) θ and (cid:99) M in each case. This workssince under H , X j ’s and W i ’s have the same distribution.4. Calculate the p-value based on the estimated null distribution.REFERENCES A ABOUD , M., A AD , G., A BBOTT , B., A
BDINOV , O., A
BELOOS , B., A
BIDI , S. H., A
BOU Z EID , O., A
BRA - HAM , N., A
BRAMOWICZ , H., A
BREU , H. et al. (2019). A strategy for a general search for new phenomenausing data-derived signal regions and its application within the ATLAS experiment.
The European PhysicalJournal C A AD , G., A BAJYAN , T., A
BBOTT , B., A
BDALLAH , J., K
HALEK , S. A., A
BDELALIM , A. A., A
BDINOV , O.,A
BEN , R., A BI , B., A BOLINS , M. et al. (2012). Observation of a new particle in the search for the StandardModel Higgs boson with the ATLAS detector at the LHC.
Physics Letters B
ALTONEN , T., A
DELMAN , J., A
KIMOTO , T., A
LBROW , M. G., G
ONZALEZ , B. A., A
MERIO , S., A
MIDEI , D.,A
NASTASSOV , A., A
NNOVI , A., A
NTOS , J. et al. (2009). Global search for new physics with . fb − atCDF. Physical Review D DAM -B OURDARIOS , C., C
OWAN , G., G
ERMAIN , C., G
UYON , I., K
ÉGL , B. and R
OUSSEAU , D. (2015). TheHiggs boson machine learning challenge. In
Proceedings of the NIPS 2014 Workshop on High-energy Physicsand Machine Learning (G. C
OWAN , C. G
ERMAIN , I. G
UYON , B. K
ÉGL and D. R
OUSSEAU , eds.).
Proceed-ings of Machine Learning Research KTAS , A., A
NDREEV , V., A
NTHONIS , T., A
SMONE , A., B
ABAEV , A., B
ACKOVIC , S., B
ÄHR , J., B
ARA - NOV , P., B
ARRELET , E., B
ARTEL , W. et al. (2004). A general search for new phenomena in ep scattering atHERA.
Physics Letters B
NDREASSEN , A., N
ACHMAN , B. and S
HIH , D. (2020). Simulation assisted likelihood-free anomaly detection.
Physical Review D
ACH , S., B
INDER , A., M
ONTAVON , G., K
LAUSCHEN , F., M
ÜLLER , K.-R. and S
AMEK , W. (2015). Onpixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.
PloS one e0130140.B ECKS , K.-H. and P
ERRET -G ALLIX , D. (1994). New computing techniques in physics research.
CERN Courier .B EHNKE , O., K
RÖNINGER , K., S
CHÖRNER -S ADENIUS , T. and S
CHOTT , G. (2013). Data analysis in high en-ergy physics.
Wiley-VCH, Weinheim, Germany
ERTRAM , I., F OX , H., R OSS , A., W
ILLIAMS , M., R
ATOFF , P. et al. (2012). Model independent search for newphenomena in pp (bar) collisions at √ s = 1 . TeV.
Physical Review D .B HAT , P. C. (2011). Multivariate analysis methods in particle physics.
Annual Review of Nuclear and ParticleScience ÖHNING , D., D
IETZ , E., S
CHAUB , R., S
CHLATTMANN , P. and L
INDSAY , B. G. (1994). The distribution of thelikelihood ratio for mixtures of densities from the one-parameter exponential family.
Annals of the Institute ofStatistical Mathematics REIMAN , L. (2001). Random forests.
Machine learning ASA , A. and M
ENARDI , G. (2018). Nonparametric semisupervised classification for signal detection in highenergy physics. arXiv preprint arXiv:1809.02977 .C HAKRAVARTI , P., K
UUSELA , M., L EI , J. and W ASSERMAN , L. (2021). Supplement to “Model-IndependentDetection of Signals Using Interpretable Semi-Supervised Classifier Tests.”.C
HANDOLA , V., B
ANERJEE , A. and K
UMAR , V. (2009). Anomaly detection: A survey.
ACM computing surveys(CSUR) HATRCHYAN , S., K
HACHATRYAN , V., S
IRUNYAN , A. M., T
UMASYAN , A., A
DAM , W., A
GUILO , E.,B
ERGAUER , T., D
RAGICEVIC , M., E RÖ , J., F ABJAN , C. et al. (2012). Observation of a new boson at amass of 125 GeV with the CMS experiment at the LHC.
Physics Letters B
COLLABORATION (2014). Dataset from the ATLAS Higgs Boson Machine Learning Challenge 2014.
CERN Open Data Portal .CMS C
OLLABORATION (2017). MUSiC, a model unspecific search for new physics, in pp collisions at √ s = 8 TeV.
CMS Physics Analysis Summary CMS-PAS-EXO-14/016 .CMS C
OLLABORATION (2020). MUSiC, a model unspecific search for new physics, in pp collisions at sqrt (s)=13 TeV Technical Report, Technical report CMS-PAS-EXO-19-008, CERN, Geneva.ATLAS C
OLLABORATION AND
CMS C
OLLABORATION (2011). LHC Higgs Combination Group, Procedurefor the LHC Higgs boson search combination in Summer 2011 Technical Report, CMS-NOTE-2011-005.C
OLLINS , J., H
OWE , K. and N
ACHMAN , B. (2018). Anomaly detection for resonant new physics with machinelearning.
Physical review letters
OLLINS , J. H., H
OWE , K. and N
ACHMAN , B. (2019). Extending the search for new resonances with machinelearning.
Physical Review D ONSTANTINE , P. G. (2015).
Active subspaces: Emerging ideas for dimension reduction in parameter studies .SIAM.C ONSTANTINE , P. G., D OW , E. and W ANG , Q. (2014). Active subspace methods in theory and practice: appli-cations to kriging surfaces.
SIAM Journal on Scientific Computing A1500–A1524.C
ONSTANTINE , P. G., E
MORY , M., L
ARSSON , J. and I
ACCARINO , G. (2015). Exploiting active subspaces toquantify uncertainty in the numerical simulation of the HyShot II scramjet.
Journal of Computational Physics
OWAN , G., C
RANMER , K., G
ROSS , E. and V
ITELLS , O. (2011). Asymptotic formulae for likelihood-basedtests of new physics.
The European Physical Journal C C RANMER , K. (2015). Practical Statistics for the LHC. arXiv preprint arXiv:1503.07622 .C RANMER , K., P
AVEZ , J. and L
OUPPE , G. (2015). Approximating likelihood ratios with calibrated discrimina-tive classifiers. arXiv preprint arXiv:1506.02169 .C UI , C., Z HANG , K., D
AULBAEV , T., G
USAK , J., O
SELEDETS , I. and Z
HANG , Z. (2020). Active Subspace ofNeural Networks: Structural Analysis and Universal Attacks.
SIAM Journal on Mathematics of Data Science GNOLO , R. T. and W
ULZER , A. (2019). Learning new physics from a machine.
Physical Review D GNOLO , R. T., G
ROSSO , G., P
IERINI , M., W
ULZER , A. and Z
ANETTI , M. (2021). Learning multivariatenew physics.
The European Physical Journal C AUNCEY , P., K
ENZIE , M., W
ARDLE , N. and D
AVIES , G. (2015). Handling uncertainties in background shapes:the discrete profiling method.
Journal of Instrumentation P04015.D
AVISON , A. and H
INKLEY , D. (1997). Cambridge series in statistical and probabilistic mathematics.
Bootstrapmethods and their applications .D OBSON , A. J. and B
ARNETT , A. G. (2018).
An introduction to generalized linear models . CRC press.F
RIEDMAN , J., H
ASTIE , T. and T
IBSHIRANI , R. (2001).
The elements of statistical learning . Springer seriesin statistics New York.G HOSH , J. K. and S EN , P. K. (1984). On the asymptotic performance of the log likelihood ratio statistic for themixture model and related results. Berkeley Conference In Honor of Jerzy Neyman and Jack Kiefer .G OODFELLOW , I., P
OUGET -A BADIE , J., M
IRZA , M., X U , B., W ARDE -F ARLEY , D., O
ZAIR , S.,C
OURVILLE , A. and B
ENGIO , Y. (2014). Generative adversarial nets. In
Advances in neural informationprocessing systems
RÖMPING , U. (2009). Variable importance assessment in regression: linear regression versus random forest.
The American Statistician ANLEY , J. A. et al. (1989). Receiver operating characteristic (ROC) methodology: the state of the art.
Crit RevDiagn Imaging ANLEY , J. A. and M C N EIL , B. J. (1982). The meaning and use of the area under a receiver operating charac-teristic (ROC) curve.
Radiology
ENZE , N. (1988). A multivariate two-sample test based on the number of nearest neighbor type coincidences.
The Annals of Statistics
SHWARAN , H. et al. (2007). Variable importance in binary regression trees and forests.
Electronic Journal ofStatistics ASIECZKA , G., N
ACHMAN , B., S
HIH , D., A
MRAM , O., A
NDREASSEN , A., B
ENKENDORFER , K., B OR - TOLATO , B., B
ROOIJMANS , G., C
ANELLI , F., C
OLLINS , J. H. et al. (2021). The LHC Olympics 2020: ACommunity Challenge for Anomaly Detection in High Energy Physics. arXiv preprint arXiv:2101.08320 .K IM , I., R AMDAS , A., S
INGH , A. and W
ASSERMAN , L. (2016). Classification accuracy as a proxy for twosample testing. arXiv preprint arXiv:1602.02210 .K IM , I., L EE , A. B., L EI , J. et al. (2019). Global and local two-sample tests via regression. Electronic Journalof Statistics UUSELA , M., V
ATANEN , T., M
ALMI , E., R
AIKO , T., A
ALTONEN , T. and N
AGAI , Y. (2012). Semi-supervisedanomaly detection–towards model-independent searches of new physics. In
Journal of Physics: ConferenceSeries EI , J., G’S ELL , M., R
INALDO , A., T
IBSHIRANI , R. J. and W
ASSERMAN , L. (2018). Distribution-free predic-tive inference for regression.
Journal of the American Statistical Association
ETZ , C. E. (1978). Basic principles of ROC analysis. In
Seminars in nuclear medicine ACHMAN , B. and S
HIH , D. (2020). Anomaly detection with density estimation.
Physical Review D
ELDER , J. A. and W
EDDERBURN , R. W. (1972). Generalized linear models.
Journal of the Royal StatisticalSociety: Series A (General)
EWCOMBE , R. G. (2006). Confidence intervals for an effect size measure based on the Mann–Whitney statistic.Part 2: asymptotic methods and evaluation.
Statistics in Medicine EYRÉ , G., C
UTURI , M. et al. (2019). Computational optimal transport: With applications to data science.
Foun-dations and Trends® in Machine Learning ADOVIC , A., W
ILLIAMS , M., R
OUSSEAU , D., K
AGAN , M., B
ONACORSI , D., H
IMMEL , A., A
URISANO , A.,T
ERAO , K. and W
ONGJIRAD , T. (2018). Machine learning at the energy and intensity frontiers of particlephysics.
Nature
EISS , R.-D. (2012).
A course on point processes . Springer Science & Business Media.S
CHILLING , M. F. (1986). Multivariate two-sample tests based on nearest neighbors.
Journal of the AmericanStatistical Association HRIKUMAR , A., G
REENSIDE , P. and K
UNDAJE , A. (2017). Learning important features through propagatingactivation differences. arXiv preprint arXiv:1704.02685 .IGNAL DETECTION USING INTERPRETABLE SEMI-SUPERVISED CLASSIFIER TESTS S TOREY , J. D. (2002). A direct approach to false discovery rates.
Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) TROBL , C., B
OULESTEIX , A.-L., K
NEIB , T., A
UGUSTIN , T. and Z
EILEIS , A. (2008). Conditional variableimportance for random forests.
BMC bioinformatics UNDARARAJAN , M., T
ALY , A. and Y AN , Q. (2017). Axiomatic attribution for deep networks. arXiv preprintarXiv:1703.01365 .T IBSHIRANI , R. J. and E
FRON , B. (1993). An introduction to the bootstrap.
Monographs on statistics and appliedprobability AN DER L AAN , M. J. (2006). Statistical inference for variable importance.
The International Journal of Bio-statistics .V ATANEN , T., K
UUSELA , M., M
ALMI , E., R
AIKO , T., A
ALTONEN , T. and N
AGAI , Y. (2012). Semi-superviseddetection of collective anomalies with an application in high energy particle physics. In
The 2012 InternationalJoint Conference on Neural Networks (IJCNN)
ILLIAMSON , B. D., G
ILBERT , P. B., S
IMON , N. R. and C
ARONE , M. (2020). A unified approach for inferenceon algorithm-agnostic variable importance. arXiv preprint arXiv:2004.03683 .Z YLA , P. A. et al. (2020). Review of Particle Physics.
PTEP2020