[PDF] Managing Many Simultaneous Systematic Uncertainties

Abstract

Recent statistical evaluations for High-Energy Physics measurements, in particular those at the Large Hadron Collider, require careful evaluation of many sources of systematic uncertainties at the same time. While the fundamental aspects of the statistical treatment are now consolidated, both using a frequentist or a Bayesian approach, the management of many sources of uncertainties and their corresponding nuisance parameters in analyses that combine multiple control regions and decay channels, in practice, may pose challenging implementation issues, that make the analysis infrastructure complex and hard to manage, eventually resulting in simplifications in the treatment of systematics, and in limitations to the result interpretation. Typical cases will be discussed, having in mind the most popular implementation tool, RooStats, with possible ideas about improving the management of such cases in future software implementations.

Full PDF

aa r X i v : . [ phy s i c s . d a t a - a n ] O c t Managing Many Simultaneous SystematicUncertainties

Luca Lista ∗† INFN Sezione di NapoliE-mail: [email protected]

Agostino De Iorio

Università degli Studi di Napoli Federico II and INFN Sezione di NapoliE-mail: [email protected]

Alberto Orso Maria Iorio

Università degli Studi di Napoli Federico II and INFN Sezione di NapoliE-mail: [email protected]

Recent statistical evaluations for High-Energy Physics measurements, in particular those at theLarge Hadron Collider, require careful evaluation of many sources of systematic uncertainties atthe same time. While the fundamental aspects of the statistical treatment are now consolidated,both using a frequentist or a Bayesian approach, the management of many sources of uncer-tainties and their corresponding nuisance parameters in analyses that combine multiple controlregions and decay channels, in practice, may pose challenging implementation issues, that makethe analysis infrastructure complex and hard to manage, eventually resulting in simpliﬁcations inthe treatment of systematics, and in limitations to the result interpretation. Typical cases will bediscussed, having in mind the most popular implementation tool, R OO S TATS , with possible ideasabout improving the management of such cases in future software implementations.

XIII Quark Conﬁnement and the Hadron Spectrum - Conﬁnement201831 July - 6 August 2018Maynooth University, Ireland ∗ Speaker. † The speaker is note PI of the project I

NSIGHTS , funded by the European Union’s Horizon 2020 research andinnovation programme, call H2020-MSCA-ITN-2017, under Grant Agreement n. 765710. c (cid:13) Copyright owned by the author(s) under the terms of the Creative CommonsAttribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). https://pos.sissa.it/ anaging Many Simultaneous Systematic Uncertainties

Luca Lista

1. Systematics uncertainties and nuisance parameters

The dependence of a probabilistic model on sources of systematic uncertainty is modeledin terms of nuisance parameters. Those parameters may be known from external measurementswith some uncertainty. Data samples can constrain nuisance parameters and reduce the originaluncertainties.Different approaches are adopted in Bayesian or frequentist applications, but the resultingeffect is similar.Assume a signal-extraction problem based on a data sample x modeled by parameter(s) ofinterest µ and nuisance parameters θ . µ is in many cases the so-called signal strenght, i.e.: theratio of the measured cross section and the corresponding theory prediction.Under the Bayesian approach, the posterior probability density for the unknonw parameters µ and θ is [1]: P ( µ , θ | x ) = L ( x ; µ , θ ) π ( µ , θ ) R L ( x ; µ ′ , θ ′ ) π ( µ ′ , θ ′ ) d µ ′ d θ ′ . (1.1)From Eq. (1.1), the probability density of the parameter of interest µ alone is given by integrating P ( µ , θ | x ) over the nuisance parameters θ : P ( µ | x ) = Z P ( µ , θ | x ) d θ = R L ( x ; µ , θ ) π ( µ , θ ) d θ R L ( x ; µ ′ , θ ′ ) π ( µ ′ , θ ′ ) d µ ′ d θ ′ . (1.2)Under the frequentist approach, the preferred choice of test statistic is the proﬁle likelihood: λ ( µ ) = L ( µ , ˆˆ θ ) L ( ˆ µ , ˆ θ ) , (1.3)where ˆ µ and ˆ θ are the best-ﬁt value of the parameters µ and θ , respectively, and ˆˆ θ is the best-ﬁtvalue of θ for a ﬁxed value of µ , given the data sample x .The distribution of q µ = − λ ( µ ) , or other variations of this test statistic, are used to deter-mine the signal strength parameter µ and/or to set upper limits to the new signal yield.The distribution of the test statistic for µ = χ distribution with one degree of freedom, in the case of a single parameter of interest [2]. This resultis due to Wilks’ theorem.

2. Simultaneous ﬁts

A complementary dataset, or control sample, y , may be used to constrain nuisance parameters θ . This could be the case of calibration data, background estimates from independent data samples,etc. Statistical problems can be formulated in terms of both the main data sample ( x ) and thecontrol sample ( y ) assumed to be statistically independent, with a likelihood function determinedas the product of the likelihoods of the two samples: L ( x , y ; µ , θ ) = L x ( x ; µ , θ ) L y ( y ; µ , θ ) , (2.1)where L y does not depend on µ only if there is no signal contamination in the control sample.1 anaging Many Simultaneous Systematic Uncertainties Luca Lista

Control samples data are not always available in realistic cases, like calibrations from testbeam, data stored in different formats or analyzed with different software framework, etc.A simple case may be modeled with a simpliﬁed probability density function (PDF) model,given the ‘nominal’ value θ nom , that could be a Gaussian, log-normal, Gamma, etc. In this case,the likelihood function becomes: L ( x , θ nom ; µ , θ ) = L x ( x ; µ , θ ) L θ nom ( θ nom ; θ ) . (2.2)A real-case example of analysis performed by ﬁtting simultaneously control regions and asignal region is the single-top cross-section measurement performed by CMS [3] at center-of-massenergy of 8 TeV. Effectively, background yields measured from background-enriched regions areextrapolated to signal regions using scale factors predicted from simulation. Events are categorizedaccording to the number of selected hadronic jets and number of jets identiﬁed as b jets, in theaforementioned measurement.In many cases, an effective way to model nuisance parameters is to provide distributions mod-eled as histograms (templates) obtained from simulations by varying each source of systematicuncertainty by plus or minus one standard deviation of the corresponding nuisance parameter. In-termediate values (or outside the ± σ range) are obtained with interpolation (or extrapolation)using either parabolic or piece-wise linear models.Systematic uncertainties may affect the rate (i.e.: cross section) or shape (i.e.: distribution)of a process, or both. Examples are luminosity, pile-up modeling in simulation, jet energy scale,b-tagging efﬁciency, misidentiﬁcation probability, lepton selection, reconstruction and trigger efﬁ-ciencies, as well as uncertainties related to theory modeling: individual cross section predictions,shape and normalization due to renormalization and factorization scales, parton distribution func-tion models, parton shower modeling, generator choice, etc. Uncertainty may also be due to thelimited size of Monte Carlo generated simulation samples.

3. Software implementations

Most of the methods adopted in High Energy Physics are implemented in the R OO S TATS

C++framework using convenient modeling of PDF via the R OO F IT package [4], released as part of theR OOT toolkit [5].PDFs from templates are derived from R

OOT histograms (

RooHistPdf class). Such PDFmodels, together with data and parameter deﬁnitions, are stored in a convenient ﬁle format usingthe class

RooWorkspace .Asymptotic approximations from [2] are available and allow to save CPU time avoiding inten-sive toy Monte Carlo generation.Many analyses in the CMS experiment use a command-line, datacard-driven, python-poweredtool originally developed for the combination of multiple Higgs production and decay channels.Code and documentation are open to public access [6].A datacard language allows to deﬁne the analyzed channels and the signal and backgroundprocesses. Nuisance parameters are associated via datacards to individual channels and processes,and their PDF models and nominal values are deﬁned.2 anaging Many Simultaneous Systematic Uncertainties

Luca Lista

Data and simulated distributions are stored as histograms. Special care should be given to nam-ing conventions that are used to identify histograms related to speciﬁc processes, channels, and withthe proper one-sigma up or down variations of nuisance parameters. Bookkeeping may become anissue for complex cases: histograms may be arranged in different ﬁles with overloaded names,or in the same ﬁles with different names or in the same ﬁle but different R

OOT sub-directories.Separators, usually underscores, are used in histogram titles in order to match tags with variousmeanings.Limited simulation statistics in each bin is also a source of uncertainty: one parameter per binimplies many parameters in the model. Considering only the uncertainty in the bin content of theleast-populated bins may speed up the computation considerably.In some cases, backgrounds in signal region are constrained from control region scaled bybin-dependent factors: h sig i = h bkg i α i , (3.1)where the scale factors α i are determined from Monte Carlo samples. Histogram content in eachbin depends on the value of nuisance parameters. Scaled histograms can be represented by acustomized RooAbsPdf object. The R OO F IT helper class RooFormulaVar may help, withthe caveat that formulae are encoded into strings, which may require convoluted code in complexcases, and bugs in the string deﬁnition are only spotted at run time.In some real case applications, automatic data-cards generation may simplify the problem.Large data cards can be automatically generated with ad-hoc software that anyway constitute oneextra layer on top of CMS Higgs combine tool.The organization of parameters into categories may simplify the deﬁnition of the problem.Parameters may be common to groups of distributions, e.g.: • Common to all spectra: – Luminosity, jet-energy scale, b-tag, ... • Common to a process: – Theory uncertainties (renormalization and factorization scale, affect both shape andrate) • Common to a decay channel: – Muon, electron efﬁciencies (reconstruction, isolation, trigger) • Speciﬁc to a single spectrum: – Statistical uncertainty from simulation in each binMore easily management of the most commonly used cases may be approached with possibleextensions of the CMS Higgs combine interface, that can be potentially promoted as common HEPtool that could eventually even be released in the R

OOT toolkit.3 anaging Many Simultaneous Systematic Uncertainties

Luca Lista

4. The Insights Project I NSIGHTS , International Training Network of Statistics for High Energy Physics and Soci-ety [7], is a 4-year Marie Skłodowska-Curie Innovative Training Networks project for the careerdevelopment of 12 Early Stage Researchers (ESRs) at 10 partner institutions across Europe. IN-SIGHTS is focused on developing and applying latest advances in statistics, and in particular ma-chine learning, to particle physics CERN is part of the network with deep interconnection with theROOT development team.I

NSIGHTS ’ Early-Stage Researchers have been selected and will shortly start working on dif-ferent statistical tools and applications. One of the projects proposes development for the presentedproblem.

5. Conclusions

Most of data analyses at the Large Hadron Collider, both precision measurements and searchfor physics beyond the Standard Model, require simultaneous statistical analysis of many datasamples in order to constrain systematic uncertainties. Managing the achieved complexity requiresa substantial amount of coding and challenges the structure of the present software interfaces.Ad-hoc solutions and mini-framework are implemented in experiment and for speciﬁc analyses.A common implementation in the framework of R OO F IT /R OO S TATS /R OOT tools is desirable inorder to simplify the management of many applications.

References [1] Luca Lista, Statistical Methods for Data Analysis in Particle Physics, 2 nd edition, Springer, Lect.Notes Phys. 941 (2017), ISBN 978-3-319-62840-0.[2] Glen Cowan, Kyle Cranmer, Eilam Gross and Ofer Vitells, Asymptotic formulae for likelihood-basedtests of new physics, Eur. Phys. J. C71 (2011) 1554.[3] CMS Collaboration, Measurement of the t -channel single-top-quark production cross section and ofthe | V tb | CKM matrix element in pp collisions at √ s = 8 TeV, J. High Energ. Phys. (2014) 2014:90.[4] Wouter Verkerke and David P. Kirkby, The RooFit toolkit for data modeling, eConf C0303241 (2003)MOLT007.[5] Rene Brun and Fons Rademakers, ROOT - An Object Oriented Data Analysis Framework,Proceedings AIHENP’96 Workshop, Lausanne, Sep. 1996, Nucl. Inst. Meth. in Phys. Res. A 389(1997) 81-86. See also http://root.cern.ch/.[6] CMS Combine tool, https://cms-hcomb.gitbooks.io/combine/content/.[7] I NSIGHTS