[PDF] Reproducibility and Replication of Experimental Particle Physics Results

Abstract

Recently, much attention has been focused on the replicability of scientific results, causing scientists, statisticians, and journal editors to examine closely their methodologies and publishing criteria. Experimental particle physicists have been aware of the precursors of non-replicable research for many decades and have many safeguards to ensure that the published results are as reliable as possible. The experiments require large investments of time and effort to design, construct, and operate. Large collaborations produce and check the results, and many papers are signed by more than three thousand authors. This paper gives an introduction to what experimental particle physics is and to some of the tools that are used to analyze the data. It describes the procedures used to ensure that results can be computationally reproduced, both by collaborators and by non-collaborators. It describes the status of publicly available data sets and analysis tools that aid in reproduction and recasting of experimental results. It also describes methods particle physicists use to maximize the reliability of the results, which increases the probability that they can be replicated by other collaborations or even the same collaborations with more data and new personnel. Examples of results that were later found to be false are given, both with failed replication attempts and one with alarmingly successful replications. While some of the characteristics of particle physics experiments are unique, many of the procedures and techniques can be and are used in other fields.

Full PDF

RReproducibility and Replication of Experimental Particle PhysicsResults

Thomas R. Junk † , ∗ and Louis Lyons ‡ , ∗∗ † Fermi National Accelerator Laboratory, Batavia, IL USA ‡ Imperial College, London and Oxford University, UK

September 14, 2020

Abstract

The recent “replication crisis” has caused practitioners and journal editors in many ﬁelds inscience to examine closely their methodologies and publishing criteria. Experimental particlephysicists are no exceptions to this, but some of the unique features of this sub-ﬁeld of physicsmake the issues of reproduction and replication of results a very interesting and informative topic.The experiments take many years to design, construct, and operate. Because the equipment is solarge and complex, like that of the Large Hadron Collider and its associated particle detectors,the costs are very high. Large collaborations produce and check the results, and many papersare signed by more than three thousand authors. Experimental particle physics is a mature ﬁeldwith strong traditions followed by the collaborations. This paper gives an introduction to whatexperimental particle physics is and to some of the tools that are used to analyze the data. Itdescribes the procedures used to ensure that results can be computationally reproduced, bothinternally and externally. It also describes methods particle physicists use to maximize thereliability of the results, which increases the probability that they can be replicated by othercollaborations or even the same collaborations with more data and new personnel. Examplesof results that were later found to be false are given, both with failed replication attemptsand some with alarmingly successful replications. While some of the characteristics of particlephysics experiments are unique, many of the procedures and techniques can be and are used inother ﬁelds.

Keywords:

Reliability, Reproducibility, Replication, Particle Physics

This article is c (cid:13) ∗ [email protected], ∗∗ [email protected] a r X i v : . [ phy s i c s . d a t a - a n ] S e p edia Summary The recent “replication crisis” has caused quite a stir in many scientiﬁc ﬁelds. Scientists and statisticiansalike have recoiled in horror at the low rate at which results have been conﬁrmed when experiments are re-peated. Much ink has been spilled explaining the shortcomings of methodologies commonly used in scientiﬁcexperiments and the criteria that are used when selecting results for publication. Not every proposed solu-tion makes sense, although many are good ideas. Particle physicists have long been aware of the precursorsof non-replicable results. No one on a large collaboration who has worked long years on a very expensiveexperiment wishes to publish a wrong result, which would undermine the credibility of all results from thatcollaboration. Thus, many internal tests of reproducibility of the results, as well as conservative methodssuch as blind analysis and stringent review, all with the purpose of catching mistakes and well-intentionedbut ﬂawed work, are common in particle physics. Results are also published even if they disprove new theo-ries; null results are not simply ﬁled away. Discoveries of new particles and interactions have a very high barto meet in particle physics: p values must be less than 3 × − , not 0.05 as is common in some other ﬁelds ofstudy. Particle physicists can easily point to past discovery claims that have had less signiﬁcance and havevanished when more data were collected or when other groups attempted conﬁrmation. Not every result isperfect or even replicable in particle physics, but the quality is generally quite high. New practitioners arealways introduced to examples in which even the most careful analyzers have been able to fool themselves.While some of the techniques and procedures used by particle physicists to ensure the reliability of theirresults are speciﬁc to the sub-ﬁeld, many can be used regardless of the scientiﬁc specialty. Experimental particle physics (EPP), also commonly known as high-energy physics, is a relatively matureﬁeld of research, with traditions spanning many decades. Particle physics experiments take years to design,build and run, and they require large ﬁnancial resources, equipment, and eﬀort. The experimental collabo-rations at the Large Hadron Collider (LHC) comprise more than 3000 physicists each, and collaborations inother sub-branches of particle physics such as neutrino physics are also becoming larger over time. Experi-mental particle physics has historically been on several cutting edges technologically, computationally, andsociologically. Some of the practices within EPP to assure the reliability of published results may be directlyrelated to peculiarities of the ﬁeld and its data, but most are generally applicable. It is the aim of thisarticle to give a review of how issues of reproducibility and replication are addressed in the speciﬁc contextof EPP, with the intention that they may be more broadly applicable. We also observe that reproducibilityand replication, while necessary conditions for a reliable result, by themselves do not guarantee that a resultis correct.Before addressing reproducibility and replication, however, we ﬁrst give a gentle introduction to someof the most important features of EPP. This paper proceeds by describing the target ﬁeld of knowledge,elementary particle physics, and the tools used in the research: high-energy accelerators, particle detectors,data collection, processing, and reduction techniques. It then describes common statistical inference toolsused in EPP. From there, reproducibility and replication are deﬁned, using common data science conventionsand comparing these terms with the language commonly used in EPP. Features of EPP that intrinsicallyhelp with reproducibility and reproduction are discussed, and the methods used to improve these qualitiesare described. Examples are given of experimental results that have been replicated, results for whichattempts to replicate them have failed, and results that were replicated but were later found to be incorrectnonetheless. u Down d Electron e Neutrino ν e Generation 2 Charm c Strange s Muon µ Neutrino ν µ Generation 3 Top t Bottom b Tau τ Neutrino ν τ Particle physics involves the study of matter and energy at its smallest and most fundamental level. Thequestion of what are the smallest building blocks out of which everything is made has a long history. Forthe ancient Greeks, the elements were A ir, F ire, E arth and W ater. Experiments that conﬁrmed this modelwere highly replicable. Even though the model is incomplete, it served many practical purposes.Today’s elementary particles are quarks and leptons, all with spin 1/2. The latter consist of the electronand its two heavier versions, the muon ( µ ) and the heavy lepton ( τ ); for each of these, there is a corre-sponding neutrino. The proton, the neutron and other half-integer spin particles (baryons) are composedof combinations of three quarks, while mesons of integer spin (bosons) consist of a quark and an antiquark.Each quark and lepton has its own anti-particle.The baryons and mesons, the particles made of quarks, are collectively known as hadrons . Quarks areconﬁned within hadrons, and in contrast to leptons do not seem to lead an independent existence as freeparticles. In collisions between particles, as any struck quarks try to escape from the hadrons, they areconverted into jets of pions, kaons and protons which leave visible tracks in the detectors (see Fig. 1).In addition, the fundamental forces are each transmitted by their own carrier. These are • Gravitation, transmitted via gravitons. Gravitational waves produced in the coalescence of a pair ofblack holes were observed in 2016. • Electromagnetism. This is mediated by photons. • The weak nuclear force. The intermediate vector bosons W and Z transmit this short range force. • The strong nuclear force. This is another short range force, carried by gluons. They are responsiblefor binding quarks in hadrons, are produced in collisions involving quarks, and are detected by thejets of particles they produce.Finally, there is the Higgs boson, with a mass of 125 GeV. The Higgs ﬁeld is responsible for enablingfundamental particles to have mass; in the simplest version of the theory, particles would be massless. Evenwith the discovery of the Higgs boson, the numerical values of the masses are not understood.In the standard model (SM) of particle physics, quarks and leptons are arranged in three generations -see Table 1. It is not understood why there are three generations. The SM also speciﬁes the way the variousparticles and force carriers interact with each other via the fundamental interactions in the bottom threerows of Table 2; gravitation has not yet been uniﬁed with the other three forces. A possible source of confusion is the meaning of the phrase “elementary particle”. This should refer to our basicbuilding blocks of matter at its smallest scale, the quarks and leptons. However, by tradition in this ﬁeld, they arethe particles which, prior to the quark model in the 1960s, were thought to be elementary. This includes protons,neutrons, π mesons and all the other hadrons, as well as the leptons (which still today really are considered to beelementary). Thus the neutron is an elementary particle, composed of an up quark and two down quarks. A GeV is a unit of energy or mass, and is 1 × electron volts. For comparison, the lightest neutrino’s mass isless than 1.1 eV ([Aker et al., 2019]), the proton’s is about 1 GeV, and a gold atom’s is about 180 GeV. Force Potential Force carrier Responsible for Particles aﬀectedGravitation 1 /r Graviton Earth going round Sun, Everythingetc.Electromagnetism 1 /r Photon Coulomb repulsion, Charged particlesphoton emissionWeak Short range W and Z bosons Energy generation in Sun, Quarks, Leptons β -decayStrong Short range Gluons Nuclear binding, Hadronsquark-quark scattering The remaining sections of this introduction outline the various steps involved in the data analysis process,starting with the accelerator and detector, and ending with the data and analysis software storage forposterity.

Experiments can be divided into those performed at accelerators, and those carried out elsewhere. The accel-erator ones either use a beam hitting a stationary target; or have antiparallel beams colliding with each other.An example of the former is a neutrino beam, with the detector hundreds of kilometers away. Colliding beamsprovide an easier way of achieving higher center-of-mass energies, but have more stringent beam requirements.The highest center-of-mass energy of 13 TeV has been obtained at CERN’s LHC with collisions betweenprotons in a 26 km circumference ring some 100 - 150 m below the surface ([Evans and Bryant, 2008]). Atypical analysis uses data collected over a running period of between several weeks and several years.There are various forms of non-accelerator experiments. These include beams from nuclear reactors;studies of cosmic rays, solar and atmospheric neutrinos; searches for Dark Matter or proton decay; etc.

There are two large general purpose detectors at the LHC, CMS ([Chatrchyan et al., 2008]) and ATLAS([Aad et al., 2008]). Both are cylindrical, and consist of concentric sub-detectors with diﬀerent functions: • Vertex detector. This is a high spatial resolution detector placed as close as possible to the interactionregion. It is useful for ﬁnding charged particles that come from the decays of heavier particles thathave traveled millimeters before decaying. • Tracker. This detects charged particles and measures their momenta. • Electromagnetic calorimeter. Electrons and photons are identiﬁed by the showers they produce in thecalorimeter’s material of high- Z nuclei. • Hadron calorimeter. This is useful for detecting neutral hadrons, such as neutrons or K . • Muon detector. These are placed at the outside of the whole detector, so that almost the only chargedparticles penetrating as far are muons.These are in a magnetic ﬁeld of several Tesla, so that the momenta of particles can be measured. The lengthof ATLAS is 45 m and its height is 25 m.Because ATLAS and CMS are general purpose detectors, many diﬀerent physics analyses are possible.Generally these will be performed on diﬀerent subsets of the accumulated data.Experiments at lower energy accelerators tend to have more individually designed, smaller detectors fortheir speciﬁc analysis. − μ + μ + μ − jetjet jet - b jet pp → t ¯ t H + Xμ + μ − μ + μ − → Figure 1: An event display from the CMS collaboration, showing a t ¯ tH candidate interaction. TheHiggs boson candidate decays to two Z bosons which themselves each decay to µ + µ − . Recon-structed tracks are shown as curves that originate in the center of the detector, and calorimeterclusters and muon detector responses are shown with shaded blocks further out. The intensities of the proton beams at the LHC are such that the rate of collisions at the center of ATLASand CMS is of order 10 Hz. The data acquisition system can record up to 1000 interactions (“events”) persecond. An online trigger system is used to select interesting events to be stored for further analysis; thisconsists of several algorithms in parallel, to cater for the variety of the subsequent physics analyses. Eventsnot recorded are lost. Studies are performed to evaluate and correct for the trigger eﬃciency for recordingwanted events for each analysis.

The information from the detectors consists of a series of digitized electronic signals which measure energydeposits (“hits”) in small regions of known location in the detector. The job of the reconstruction programs isto link together appropriate hits and turn these into a series of three-dimensional tracks corresponding to thetrajectories particles took after they were produced in the event . Figure 1 shows the reconstructed tracks andcalorimeter clusters for a single event collected by the CMS detector. This event passes the candidate selectionrequirements for associated production of a Higgs boson and two top quarks ([Sirunyan et al., 2020]). In a region where the magnetic ﬁeld is constant, the trajectories of charged particles are approximately helical.Neutral particles travel in straight lines. .6 Event selection In general, each physics analysis will use a small subset of the accumulated data, to reduce backgroundfrom unwanted processes, while maintaining a high eﬃciency for retaining the signal. The initial stage ofthis process involves relatively simple selection criteria devised by physicists (for example, the event shouldcontain a muon and an electron of opposite electrical charge). This is usually followed by machine learning(ML) techniques, such as neural networks or boosted decision trees. Recently, deep learning methods havebeen employed. In either case, the choice of training samples is important.

There are two large classes of analyses in EPP, parameter determination and hypothesis testing.

This involves the determination of one or more parameters (e.g. the mass of the Higgs boson) and theiruncertainties. It requires the use of some parameter determination technique, such as those listed below,each of which exists in several variants. • Chi-squared: This can be the Pearson or the Neyman versions, or can use a log likelihood ratioformulation. • Likelihood: Either the usual form where the probability density function on which the likelihood isbased is normalized to unity; or the extended form where the overall number of events is allowed toﬂuctuate. • Bayesian posterior: Although particle physicists are loath to use Bayesian methods for hypothesis test-ing, there is less resistance to them for parameter determination. The choices here are the functionalforms used for the Bayesian priors, and the way the credible interval is extracted from the posterior. • Neyman construction: This guarantees coverage for the determined parameter(s). The resulting conﬁ-dence intervals can be chosen to be one-sided upper limits (UL) or lower limits (LL); two-sided centralintervals; or Feldman-Cousins (see Section 2.5).As well as determining the actual conﬁdence interval from the data, it is also important to calculatetheir expected values (sensitivity), either from the median of a set of values assuming data are distributedaccording to the relevant model; or from the “Asimov” data set, where the single set of invented “data” isexactly as predicted by the relevant model ([Cowan et al., 2011]).For replicability comparisons, the sensitivities are probably better than the actual data values, as theformer are not aﬄicted by statistical ﬂuctuations. (Of course, each data value should be compared withthe corresponding expected value for compatibility.) A further point regarding replicability is that the sameinterval method should be used, as in some cases they can produce very diﬀerent answers.

The other category is where we attempt to see if the data favors some version of new physics (hypothesis H ) as compared with a null hypothesis H . Thus if we had a mass spectrum and were looking for a peakat some location in the spectrum (compare ﬁg. 3), our hypotheses could be: • H = only well known particles are produced. • H = also the production of Higgs bosons, decaying via a pair of Z bosons to 4 charged leptons.Alternatively, an example from neutrino physics would be - - ) H /L H L-2ln( P r obab ili t y D en s i t y H HObs p p (a) - - ) H /L H L-2ln( P r obab ili t y D en s i t y H HObs p p (b) Figure 2: Example distributions of the logarithm of the likelihood ratio test statistic assuming H (black) and H (blue). An example observed outcome of the experiment is indicated by the redline. The p value p is the yellow-shaded area under the tail of the H distribution to the left ofthe observed value, and p is the green-shaded area under the H distribution to the right of theobserved value. Panel (a) shows a case in which the experiment is expected to distinguish between H and H most of the time, and panel (b) shows the distributions for an experiment that is notas sensitive. • H = “normal” ordering of the three neutrino masses, or • H = “inverted” ordering ([Esteban et al., 2019, De Salas et al., 2018]).Here some form of hypothesis testing is used. This usually requires the choice of a data statistic, whichmay well be a likelihood ratio for the two hypotheses. Then the choice of hypothesis favored by the datamay involve comparing the actual value of the data statistic, with the expected distributions for the twohypotheses; these may be obtained by Monte Carlo simulation. Another possibility is to use the expectedasymptotic distributions ([Cowan et al., 2011]), though care must be taken to use the asymptotic formulasonly within their domains of applicability.Possible outcomes of these comparisons are: • Data are consistent with H but not with H . For the ﬁrst example above, this would constituteevidence for a discovery claim. • Data are consistent with H , but not with H . This would result in the exclusion of the model of newphysics, at least for some values of the parameters of that model. Figure 2a shows an example of thissituation. • Data are consistent with both H and H . i.e. the experiment is not sensitive enough to distinguishbetween the two models. Figure 2b is an example of this situation. • Data are inconsistent with both H and H . This may indicate that some other model is required.Particle physics analyses often result in the second situation above. There thus appear in the literaturemany papers entitled “Search for....”; this is a euphemism for “We looked for something and did not ﬁndit, but our search was sensitive enough to justify publication.” Publication of such null results is useful inthat it serves to exclude the tested model of new physics at some conﬁdence level, at least for some range of ts parameter space. For example, hypothetical supersymmetric partners of electrons, called selectrons, areexcluded at the 95% conﬁdence level for masses below 500 GeV, although this is somewhat model dependent([Aaboud et al., 2018]). That means that if the mass of the selectron were lighter, it would have produceda clear signal in the data, but this was not seen, so selectron masses below 500 GeV are ruled out. Anadditional reason for publishing null results is that it avoids the publication bias of accepting for publicationonly positive results.Further speciﬁc topics related to physics analyses are discussed in Section 2. Given the eﬀort and expense of acquiring particle physics data, it is clearly mandatory for experimentalgroups and collaborations to store their data and analysis assets in a way which makes it accessible fordecades, either for new analyses or for replicability tests. CERN has initiated both the Open Data andthe CAP (“CERN Analysis Preservation”) projects ([Chen et al., 2019]) for storing all the data and alsoall the relevant information, software and tools needed to preserve an analysis at the large experiments atthe LHC. The preserved analysis assets include any useful metadata to allow understanding of the analysisworkﬂow, related code, systematic uncertainties, statistics procedures, meaningful keywords to assure theanalysis is easily ﬁndable, etc., as well as links to publications and to back-up material. This is a veryinvolved procedure, and is still in its testing stage, but it will clearly be very helpful for any subsequentreproducibility or replicability studies of the results of analyses using LHC data. It will, however, require acultural change, with attention and eﬀort on the part of physicists performing analyses.Initially, access to the stored information would be restricted to members of the collaboration whoproduced the data, but eventually it could be used by other EPP physicists and the wider range of scientistsand the general public, on a time scale decided by the collaboration.Although developed at CERN for the EPP community, the CAP framework and its concepts may wellbe of interest to a wider range of scientists.

We here discuss several issues which appear in many physics analyses, and almost certainly are relevant forother ﬁelds too.

We deﬁne the “error” in a measurement to be the diﬀerence between the estimate of a model parameterproduced by the analysis and the unknown true value. “Uncertainties” are numerical estimates of thepossible values of the errors, and are typically reported as one-standard-deviation intervals centered on thebest estimate of a measured quantity. Asymmetric conﬁdence intervals are also used when appropriate.Particle physicists often use the word “error” when they mean “uncertainty.”Physics analyses are aﬀected by statistical uncertainties and by systematic ones. The former arise eitherfrom the limited precision of the apparatus and/or observer in making measurements, or from the randomﬂuctuations (usually Poissonian) in counted events. They can be detected by the fact that, if the experimentis repeated several times, the measured physical quantity will vary.Systematic eﬀects, however, can cause the result to be shifted from its true value, but in a way thatit does not necessarily change from measurement to measurement. Measurements nearly always have somebias, and the question is by how much are they biased. Systematic eﬀects are not easy to detect, and ingeneral much more eﬀort is needed to evaluate the corresponding uncertainties.The simplest systematic eﬀects are the ones associated with the measured quantities needed for theevaluation of the quantity of interest. The raw measurements may need to be corrected, and any uncertainty n this correction contributes to the overall systematic uncertainty. Another may arise from the uncertaintyin some other relevant quantity, which has been measured in a subsidiary analysis in this experiment, or insome diﬀerent experiment. Yet another can be that the relationship between the quantity of interest and themeasured quantities involves implicit assumptions that are not quite true in Nature; the systematic arisesfrom the uncertainty in the correction for this. The most diﬃcult to deal with are theoretical uncertaintiesin evaluating the answer. These can be because of approximations used, or from diﬀerent ways of estimatingthem.Usually systematic errors are dealt with in a likelihood function by assigning them nuisance parameters,with constraint terms corresponding to the uncertainties on their values. Common ways of including theireﬀects in ﬁnal results such as p values and conﬁdence intervals are to proﬁle the likelihood function withrespect to them; or to marginalize the posterior probability distribution in a Bayesian approach.The way systematic uncertainties on parameter determinations are reported is that in the bulk of a papertheir numerical eﬀects on the total systematic uncertainty are quoted separately for each source. This is sothat if subsequently the magnitude of any of these can be updated, the total systematic can be adjusted.Another reason is that if the results of two or more experiments measuring the same quantity are to becombined, this will ease the problem of taking into account the correlations between the systematic eﬀects.The abstract and the conclusions will typically quote the result as µ ± σ ± σ , where µ is the measuredquantity, σ is the statistical uncertainty and σ is the systematic one. This separation of the two uncertain-ties is because systematics are regarded as more problematic than statistical uncertainties, so an experimentwith σ = 4 , σ = 1 is regarded as superior to one with σ = 1 , σ = 4.More details on the subject of systematic uncertainties can be found in Section 5.7.1 and in [Heinrich and Lyons, 2007]. Several methods for blinding analyses have been used in EPP and are currently in use ([Klein and Roodman, 2005]).One simple method is to optimize the analysis using simulated data and reserve the input of data from theexperimental apparatus until the procedures have been decided upon, including the data selection, classiﬁ-cation, and statistical procedures. Analyses are constructed and optimized based on predicted outcomes ofexpected signal and background contributions to the event yields, and so it is usually possible to performthe necessary steps without access to the data from the experiment. The collaboration must then agree toaccept the result of the analysis after the data are input to the analysis (“unblinding”), without change toany step of the analysis, or the procedure is not fully blind.A drawback of the simple blinding procedure described above is that it precludes the use of data from theexperimental apparatus as a calibration source to help constrain the values of nuisance parameters and tohelp guide the data selection. This shortcoming is addressed by partial blinding. Data in control samples –events that fail one or more signal selection requirements, for example, are allowed to be input to the analysisprocedure before the selected “signal” sample is made available for analysis. Sometimes surprises are foundin the control samples – previously unappreciated sources of background events or miscalibrations can showup at this stage. The process of eventually revealing data passing signal selection requirements is often called“opening the box.” The process relies on the good faith of the collaboration members not to look at datathat have been blinded. In a large collaboration with many diﬀerent analyses being developed in parallel,sometimes one analysis group’s control sample is another group’s selected signal sample. However, now thatsophisticated ML procedures are commonplace, one group’s histogram of a highly-speciﬁc ML discriminantvariable is unlikely to be meaningful to another group which may be seeking a diﬀerent sort of signal entirely.A similar strategy for partial blinding which helps prevent big surprises from showing up when the signalbox is opened is to look at the data in the signal box ﬁrst only for a subset of the running period over whichthe data were collected. The data from this running period may then be discarded from the ﬁnal result if afully blind analysis is to be claimed.Yet another blinding procedure which applies to precision measurement of physical quantities is to encode n arbitrary, ﬁxed oﬀset in the ﬁnal step of parameter inference in software, and to hide the value of this oﬀsetfrom researchers performing the analysis work. This sort of oﬀset can help combat the possibly unconsciousdesire to get the “right” answer (see Sect. 6.1). P Values

Recently, p values have been under attack ([McShane et al., 2019]), with some journals actually banningtheir use ([Woolston, 2015]). The reasons seem to be: • There are many results that claim to observe eﬀects, based on having a p value less than 0.05, whichare subsequently not replicated. • People don’t understand p values, and confuse them with the probability of the null hypothesis beingtrue.The ﬁrst point can be mitigated by having a lower cutoﬀ on the p value criterion. The second argumentis similar to claiming that matrices should be banned, because many people don’t understand them; whatis required is simply better education. Physicists have been dismayed by some of the blunt tools proposedto solve the replicability crisis. A more careful examination of methodologies is much more valuable thanblaming the use of p values ([Leek and Peng, 2015]). p Values to Quantify Discovery Signiﬁcance

Particle physicists make extensive use of p values in deciding whether to reject the null hypotheses and claima discovery. These p values are denoted as p (see Fig. 2), as they represent a probability under the curvepredicted by H . To claim a speciﬁc discovery, it is also necessary to check that the data are consistentwith the expectation from H . Often in searches for new phenomena, the data statistic used for calculating p is the likelihood ratio for the two hypotheses, H and H . This already takes note of the alternativehypothesis. The cut-oﬀ on p is conventionally taken as 3 × − , corresponding to a z value of 5.0.Some statisticians scoﬀ at this criterion, saying that probability distributions are not so well known intheir extreme tails. The reasons in favor are: • Claiming a fundamentally new eﬀect has widespread repercussions, and can have very high publicity.Withdrawing a claim of discovery can be embarrassing for a collaboration, and, more speciﬁcally, thediscovery proponents. Reputations and future credibility can be tarnished. The large author lists onparticle physics experiments may serve as one reason for the extreme conservativeness in EPP, at leastin recent decades. Many collaborators who worked hard on their experiment but not on a particularanalysis will be interested that their work does not contribute to false claims. • Past experience shows that eﬀects with z values of 3 and 4 have often not been replicated when moredata are collected ([Franklin, 2013]). • Estimating systematic uncertainties is generally more problematic than determining statistical uncer-tainties. If a systematics-dominated experiment had underestimated the systematic uncertainty by afactor of two, an interesting reported z score of 5.0 should really have been a much more mundane z = 2 . • The Look Elsewhere Eﬀect can eﬀectively increase a local p value to a more relevant global p value(see Section 2.4). • An old but still relevant maxim is that “Extraordinary claims require extraordinary evidence.” Thusif we were looking for evidence of energy-nonconservation in events at the LHC, we should require a z -value of much more than 5.0 before rushing into print. From a Bayesian viewpoint, this corresponds toassigning a much smaller prior probability to a hypothesis involving a radically new idea, as comparedwith traditional well-established physics. hile p may be computed with Monte Carlo simulations of possible experimental outcomes, thesecalculations become very expensive with a threshold of 3 × − . Asymptotic formulas, such as those in[Cowan et al., 2011], provide for more rapid calculation. p Values to Reject Alternative Models

If we merely wish to exclude the alternative hypothesis, the convention is to use a p value for H , denoted p (see Fig. 2), of 0.05 or 0.10. This weaker requirement than that for rejecting H is because the embarrassmentof making a false exclusion is by no means as serious as that of an incorrect claim of some novel discovery.An exclusion of models H using the criterion p < .

05 will falsely exclude H at most 5% of the timeif H is true. Most hypotheses of new physics however are not true, and the risk of falsely excluding a truemodel is therefore rather low. Exclusions are usually expressed in terms of upper limits on signal strengths.In 5% of cases, assuming H is true, all values of the signal strength including zero are excluded using thistechnique. A plot of an upper limit on the signal strength as a function of a model parameter, such as themass of a hypothetical particle, will then exclude 5% of possible masses for all values of the signal strength,typically in disjoint subsets, assuming no new particle is truly present. Physicists do not wish to excludemodels that they did not test, even if their experiment’s outcome is in what is called by statisticians an“identiﬁable subset” ([Mandelkern, 2002]). Furthermore, a plot showing exclusions all the way down to zerosignal strength in 5% of its tested parameters is not expected to be replicable – the repeated experimentwould have to get lucky or unlucky in the same way.To combat the production of upper bounds on signal strengths being reported too small in 5% of cases,particle physicists do one of two things. They may use a modiﬁed p value, p / (1 − p ), which has beengiven the confusing name CL s ([Junk, 1999, Read, 2002]). If CL s < .

05 then H is ruled out. It hasthe property that CL s ≥ p , and so comparing it with 0.05 will exclude H no more often than the strictlyfrequentist test on p . It also has the beneﬁcial property of approaching 1.0 as the signal strength approacheszero, preventing exclusion of signals with zero strength. The other common technique is to use a Bayesiancalculation of the posterior probability density as a function of the signal strength and exclude those valuessuch that the integral from the upper limit to inﬁnity of the posterior density is 0.05. Very often our alternative hypothesis is composite. If we are looking for a signal that produces a peak abovea smooth background, the location of the peak may be arbitrary. When we are assessing the chance ofrandom ﬂuctuations resulting in a peak as signiﬁcant as the one we see in our actual data, the local p valueis this probability for the given location in our data. Often, the smallest local p value is the most exciting.But more realistic is the global p value, for having a signiﬁcant ﬂuctuation anywhere in the spectrum.This is similar to the statistical issue of multiple testing, except that that considers discrete tests, whilethe LEE in the particle physics context often involves a continuous variable (e.g. the location of the peak). P values at neighboring locations are often correlated due to detector resolution, so there aren’t inﬁnitelymany independent tests even when the variable is continuous. Asymptotic formulae exist to compute thesmall p values needed for discovery while taking into account the LEE in cases of a continuous variable suchas the invariant mass of a new particle ([Gross and Vitells, 2010]).The LEE is more general than as described above. For example, the ﬂuctuation could be in the distri-bution of the same physics variable, but produced by diﬀerent selections; other possible distributions whichcould also be relevant; etc. These eﬀects can be avoided by using a blind analysis, which cannot be tunedto produce a desired result.Another complication is that the deﬁnition of “elsewhere” depends on who you are. A graduate studentmight worry about possible ﬂuctuations anywhere in his or her analysis, but the convenor of a physics groupdevoted to looking for evidence of the production of supersymmetric particles might well be worried about astatistical ﬂuctuation in any of the many analyses searching for these particles. In view of these ambiguities, t is recommended that when global p values are being quoted, it is made clear what deﬁnition of “elsewhere”is being used.Benjamini has commented that in some cases non-replicability can be caused by the original analysisignoring the eﬀects of multiple testing ([Benjamini, 2020]). In a similar vein, an unscrupulous member of thenews media or other interested reader may dredge the preprint servers for the most signiﬁcant result of themonth and not report all of the others that were passed over in the search. Ignoring the LEE and reportingthe smallest local p value from a collection of them is a form of “ p -hacking” ([Ioannidis, 2005]).There is no LEE to take into account when computing model exclusions or upper bounds. Each modelparameter point is tested and excluded independently of others. If a researcher sifts through all of the modelexclusions looking for the most ﬁrmly excluded one and holds that up as an example, then an LEE may benecessary, but generally this is not of interest; the set of excluded models is the important result. In caseswhere multiple collaborations test the same model spaces and arrive at excluded regions of these spaces,then points in those spaces may have multiple opportunities to be falsely excluded. The warning here goesto presentations of results in which excluded regions are merely overlaid on one another and the union of allexcluded regions is inferred to be excluded. In fact, a rigorous combination of the results is needed in orderto make a single exclusion plot with proper coverage. Historically, a particle physicist would ﬁrst look at the data from an experiment and use it to choose betweenreporting a discovery and reporting an upper limit. This “ﬂip-ﬂopping” has been shown to cause under-coverage. To solve this issue, Feldman and Cousins ([Feldman and Cousins, 1998]) propose using the Neymanconstruction to produce conﬁdence intervals using the likelihood ratio ordering principle. In the case of abounded physical parameter, the method automatically selects between reporting a one-sided bound on theparameter and reporting a two-sided interval, while guaranteeing statistical coverage for all true values of theparameter. It readily generalizes to multiple parameters of interest but its computational expense increasesrapidly with the number of parameters.The method of Feldman and Cousins (FC) has the property of never producing empty conﬁdence in-tervals, although the Neyman construction with other ordering rules may do so. With the FC method, itimpossible to exclude an entire parameter space with it. It is therefore an ideal method to use when it isknown that the parameter space contains a single point corresponding to the true value(s) in Nature. Thisis a very common situation. We know that electrons have a mass, and so the space of possible values ofthe electron mass contains the truth somewhere in it. On the other hand, supersymmetric electrons maynot exist at all, regardless of what mass one might imagine they have. One must be careful not to cometo unwarranted discovery claims based on model assumptions. If someone has lost their car keys and theylook everywhere except in a place that is diﬃcult to search and no keys have been found, they may deducethat the missing keys are located in the place that has not yet been investigated. If it is not known thatthe keys exist in the ﬁrst place, or it is not known that the set of all considered locations they could be inis exhaustive, such a deduction is unwarranted. One may address this issue by including the null hypothesisin the model space considered, though the model space may still be incomplete.

The deﬁnition of the word “reproduce” as used in this paper is the extraction of consistent results using thesame data, methods, software and model assumptions ([National Academies of Sciences and Medicine, 2019]).The only variables in this case are the human researchers, if there are any, and the separate runs on possiblydiﬀerent computers. A failure to reproduce results could arise from improper packaging of digital artifacts,a lack of documentation, knowledge or even patience on the part of either the original researchers or thoseattempting the reproduction, the use of random numbers in the computational step, non-repeatability of alculations on computers either due to thread scheduling, radiological or cosmogenic interference with thecomputational equipment, or diﬀerences in the architectures of the computational equipment.In past decades, the internal representations of ﬂoating-point numbers varied from one hardware vendorto the next, and results generally were not exactly reproducible when software was ported. As late as the1990’s, physicists used mixtures of DEC VAX, IBM mainframes, Cray supercomputers and various RISCarchitectures, such as HP PA-RISC, IBM POWER, Sun Microsystems SPARC, DEC Alpha, and MIPS toname a few. Many of these architectures had idiosyncratic handling of ﬂoating-point arithmetic. Softwarehad to be specially designed so that data ﬁles created on one computer architecture but read on anotherproduced results as similar as possible.Large computing grids currently used in EPP contain mixtures of hardware from diﬀerent vendors,though these are almost entirely composed of x86-64 processors manufactured by Intel or AMD. The relativelyuniform landscape today makes reproducing results much easier, but by no means can diﬀerences in computerarchitecture be ignored.Compiling programs with diﬀerent optimization options can result in diﬀerent results when run on thesame computer, due to intermediate ﬂoating-point registers carrying higher precision than representations inmemory. If a program has an IF -statement in it that tests whether or not a ﬂoating-point number is greaterthan or less than some threshold, a very small diﬀerence in a computed result can be the root cause of avery large diﬀerence in the rest of the output of a program. EPP has long been sensitive to these issues, dueto the complexity of the software in use and the large size of its data sets. While each triggered readout of adetector is processed independently of all others, the large number of triggered readouts to process virtuallyguarantees that rare cases of calculations that perform diﬀerently, or even fail, will occur during a large dataprocessing campaign.Particle physicists have long recognized the utility of compiling software using diﬀerent compilers, withall of the warnings enabled, and even with warnings treated as errors so they cannot be ignored duringdevelopment. Diﬀerent compilers produce warning and error messages for diﬀerent classes of errors in thesource code, such as the use of uninitialized variables, some instances of which may go undetected by somecompilers but which may be ﬂagged by others. Fixing software mistakes identiﬁed in this way helps guardagainst undeﬁned behavior in programs.Running the programs on computers of diﬀerent architectures and comparing the results has also longbeen a tradition in EPP. More recently, the process has been automated. On each release of the software, or,in some cases, as ﬁne-grained as on each change in any software component committed to a central repository,continuous integration systems now compile and link the software stack and run basic tests, comparing theresults against previous results. These systems automatically warn the authors and the maintainers of thesoftware of variations in the program’s outputs. These safeguards are very important in large collaborationsbecause not every person developing software is an expert in every part of a large software stack, andunintended consequences of changes can go unnoticed or their root causes can be misassigned unless theyare identiﬁed quickly. Collaboration members who develop software sometimes leave for other jobs, placinghigh importance on good documentation and less reliance on individual memories of what the software doesand how it works. These continuous integration systems require human attention, as some changes to thesoftware produce intended improvements, and these must be separately identiﬁed from undesired outcomes.Reproduction of entire analyses is common in EPP when analysis tools are handed from one analyzer, orteam, to the next. For example, when a graduate student graduates, a new student is often given the taskof extending the previous student’s work. The ﬁrst exercise for the new student however is to reproduce theearlier result performed by the previous student using the same data and software. It is usually referred to as“re-doing” an analysis or “checking” an analysis. The standards are very high for comparison – numbers mustmatch to much better than their quoted uncertainties, and exact matches are preferable. Any discrepancyin this step points to a relatively simple ﬂaw that can be remedied.Reproducibility is a necessary but insuﬃcient condition for reliability. Reproducibility only tests theintegrity of the computational steps, not the correctness of the assumptions entering the analysis, or even he quality of the input data. A ﬂawed result, when reproduced, contains the same ﬂaws. A more interesting process is to perform a similar analysis, either with the same data and a diﬀerent analysistechnique, or with diﬀerent data and either the same or a diﬀerent analysis technique. Not only can the resultsbe compared, but to the extent possible, intermediate quantities such as selected event counts or even listsof which triggered readouts of the detector were selected, can be compared to check for consistency. The word“replicate” is deﬁned to mean these kinds of independent tests ([National Academies of Sciences and Medicine, 2019]).In EPP, however, the word “replicate” is rarely used in this sense due to the use of the word “replica” tomean an identical copy, as in data sets distributed to distant computer centers or in geometry descriptions ofrepeated, identical detector components. Instead, “independent conﬁrmation” is a more conventional phraseused in the case of successful replication, and “ruling out,” “exclusion,” and “refutation” are words thatare used when replication attempts fail to conﬁrm the previous result. If the data sets used in a replicatedanalysis overlap with those of the original analysis, the word “independent” is not used. Replicated analysesoften share sources of systematic error and thus also may fail to be independent even when the data sets,the experimental apparatus, and the collaborations are independent.In order to tell if a second experimental result successfully replicates the ﬁrst, shared and independentsources of error must be carefully taken into account. These are usually obtained from the quoted uncertain-ties on the measured values, but in the case of overlapping data sets, a more involved analysis is warranted.The end result of a comparison of a replicated measurement is often a p value expressing the probability thatthe two results would diﬀer by as much as they were observed to or more, maximized over model parameters.The sample space in which the p value is computed consists of imaginary repetitions of the two experiments,assuming their outcomes are predicted by the same model. A result with a very large systematic uncertaintyis consistent with more true values of the parameter(s) of interest and thus passes replication tests moreeasily than one with a smaller systematic uncertainty. Such a result is also less interesting because of itslack of constraint on the parameters of interest. A result can only be “wrong” if it has underestimateduncertainties. Even a non-reproducible result may not be a wrong result if it is accompanied by a systematicuncertainty that covers the amount of non-reproducibility. The conditions under which particle detectors are operated are known to aﬀect their performance andthus may bias the results obtained from them. Particle physics experiments run for years at a time, andoperating conditions are variable, making the data sets heterogeneous. Variations in accelerator parameters,such as the beam energy, the energy spread, the intensity and stray particles accompanying the beam(“halo”) are constantly monitored and automatically recorded in databases for future retrieval during dataanalysis. Environmental variables such as ambient temperature, pressure and humidity are also includedin these records. The concentrations of electronegative impurities in drift media are constantly measuredand recorded. The status of high-voltage settings, electronics noise, and which detector components arefunctioning or broken are also recorded. Non-functioning detector components are often repaired duringscheduled accelerator downtime, and some detector components may recover functionality when computerprocesses are restarted. If a particular physics analysis requires that the detector is fully functional, thenonly data that were taken while the detector satisﬁes the relevant requirements can be included in thatanalysis.While an experiment is collecting data, physicists take shifts operating the detector and monitoring thedata that come out of it. While most detector parameters that aﬀect analyses can be identiﬁed in advance nd monitored automatically, some surprises can and do occur. It is up to the shift crew to identify thoseconditions, notify experts who may be able to repair the errant condition, and mark the data appropriately soit does not bias physics results. The shift crew is aided by automated processes that analyze basic quantitiesof the data in near real time, providing input to their decisions. Because the data processing and analyses in EPP require the use of a large amount of software that is underconstant development by a large number of people, a well thought-out version control system is required.Not only must the source code be under strict version control, but so too must the installed environments,which include auxiliary ﬁles and databases. Naive systems in which collaborators share computers on whichthe software is constantly updated to the latest version will ﬁnd their analysis work diﬃcult in a way thatscales with the size and activity of the software development eﬀort. Results obtained by a physicist runningthe same programs on the same data may diﬀer from day to day, or programs that ran previously mayfail to run at all. A new release of a software component may be objectively better than the older ones –bugs may have been ﬁxed or the performance of the algorithms may have been improved. “Performance”here refers not to the speed with which the program runs on a computer, but rather in its ability to do itsintended job. The probability for a track-ﬁnding algorithm to ﬁnd a track may go up from one version to thenext, for example. But if an analyzer has measured the performance of the algorithms using experimentaldata if possible and Monte Carlo simulations otherwise, then those algorithms must be held constant or thecalibration constants become invalid. One may “freeze” the software releases, but then some collaboratorswill require newer releases than other collaborators.The solution chosen in EPP is to freeze and distribute pre-compiled binaries and associated data andconﬁguration ﬁles for each release used by any collaborator. No software version is set up by default when auser logs in – a speciﬁc version must be speciﬁed. The version for a top-level software component determinesthe versions needed for all dependent components, which are automatically selected. Inconsistent versionrequests are treated as errors. New users are surprised at the need for this complexity, but they appreciateit later when they are ﬁnishing up their analysis work and are trying to keep every piece of their workﬂowstable.

A common and necessary practice in EPP is to compute the expected sensitivity of an analysis before thedata are collected, or at least before they are analyzed. Since particle physics experiments are so expensiveand take so long to design, construct, operate, and perform data analysis, funding agencies require that acollaboration must demonstrate that their proposed experiment is capable of testing the desired hypothesesbefore approving the funding. The expected sensitivity usually takes the form of the expected length of theconﬁdence interval on one or more measured parameters, or the median expected upper limit on the rate ofa process assuming it truly is not present in Nature, or the median p value, assuming the new process istruly present. Typically, distributions of possible outcomes of the experiment are pre-computed before theanalysis process is ﬁnished. The observed result can then be compared with the expected results once allthe data are collected and the analysis is ﬁnished. Sometimes a spurious outcome is easily identiﬁable asbeing consistent with none of the considered hypotheses. This separation of the results into sensitivity andsigniﬁcance furthermore helps combat the “ﬁle-drawer” eﬀect. Even if a result fails to be signiﬁcant, if thesensitivity of the test is high, then the result is worth publishing. One way in which physics analyses can beneﬁt from replication without waiting for another group on thesame or a diﬀerent collaboration to work on a similar analysis is to use a calibration source, or a “standard andle.” Signals that have been long established ought to be visible in analyses that seek similar but not-yet-established signals. The analysis therefore replicates earlier work, and in so doing, not only validatesthe earlier work, but increases the conﬁdence in the present work. This step is particularly important inanalyses that do not observe a new signal. One might think that the detector or the analysis method issimply not sensitive to the new signal and may have missed it. To show that a known signal is found in thesame analysis with the expected strength and properties gives conﬁdence that the whole chain is working asdesired.A related technique is the use of “control samples,” in which the desired signal is known not to exist, orif it does, contributes a much smaller fraction of events than in the selected signal sample. Data that do notpass selection requirements, or which were collected in diﬀerent accelerator conditions (oﬀ-resonance runningis an example of a way to collect background-only data) can be used to estimate the rates and properties ofprocesses that are not the intended signal but which can be confounded with it if not carefully controlled.Often multiple control samples are used, each one targeting a speciﬁc background process, or which mayover-constrain the rates or properties of them. Disagreements in the predictions of background rates andproperties from diﬀerent control samples are often contributions to the systematic uncertainty estimationsused in the signal sample. Large EPP collaborations have a diﬃcult, complex task to perform to produce any individual result. Thedetectors have millions of active elements, and the conditions are variable. Often a physics analysis requiresevents to be selected with a speciﬁc particle content – say a lepton, a number of jets, and missing energy.Not every lepton is identiﬁed correctly, however, and not every jet’s energy is measured well. Instead ofrequiring every team that wants to analyze data to perform the work to calibrate all of the things that needcalibrating, working groups are set up to perform these tasks. A group may be devoted just to b -jet taggingwhile another will calibrate the electron identiﬁcation and energy scale. Other groups will form aroundeach necessary task. Their results, along with systematic uncertainty estimates, are reviewed in much thesame way as physics results are reviewed before being approved for use by the collaboration. In this way,consistent calibrations are available for all physics analyses performed by the collaboration, and mistakesare minimized. One avenue that subconscious bias can aﬀect an analysis is in the calibration stage. Theseparation of the calibration eﬀorts into dedicated groups reduces the possibility that collaborators wishingspeciﬁc results for their analyses can do so via (subconsciously) manipulating calibrations, as each calibrationgroup must provide results for everyone in the collaboration, not just one set of interested parties. Circular colliders typically have multiple detectors located at discrete interaction regions that produce iden-tical physics processes because the beams are the same. The PEP ring at SLAC had four detectors: HRS,TPC-2 γ , Mark-II and MAC. The LEP collider had the ALEPH, DELPHI, L3 and OPAL detectors. TheLHC has ATLAS and CMS, and also special-purpose detectors ALICE and LHCb. These detectors are onlypartially redundant; they are not exact copies of one another. Part of the purpose is to provide for replicationof results, but it is also important to diversify the technology used in the experimental apparatus. Whiledetector research and development is also a mature ﬁeld and technologies are deployed in large experimentsonly if they have been shown to work in prototypes, risks still exist. A particular technology may be moreideal for a speciﬁc physics analysis than another, but it may be weaker in another analysis. Given thehigh costs of these detectors, the additional value accrued by exposing diﬀerent technologies to the samephysics is seen to be a better investment than exact duplication. Competition between collaborations alsoencourages scientists to optimize their analyses for sensitivity (and not signiﬁcance), and to produce resultsquickly so as not to lose the race with competitors. High-proﬁle discoveries, such as those of the top quarkand the Higgs boson, are typically simultaneously announced by rival collaborations. This is not surprising t circular colliders because each detector is delivered the same amount of collision data at any given timeas each other detector on the collider.Results are frequently interpreted in the context of other results obtained in similar but not identical pro-cesses, but for which the model explanation for one experiment’s result must have consequences for anotherexperiment’s result. An example of this is the search for a light, sterile neutrino. The LSND and MiniBooNEcollaborations observed excesses of ν e events in beams dominantly composed of ν µ , when compared to whatwas expected given what is known from three-ﬂavor neutrino oscillation rates ([Aguilar-Arevalo et al., 2001,Aguilar-Arevalo et al., 2013, Aguilar-Arevalo et al., 2018]). A hypothesis to explain these data is that afourth neutrino may exist which provides an oscillation path from ν µ to ν e . We know from the LEP exper-iments’ e + e − → Z lineshape measurements ([Schael et al., 2006]) that there are only three light neutrinospecies that interact with the Z boson. A fourth light neutrino must therefore be “sterile.” Nonetheless, inorder to explain the LSND excess in this way, some ν µ s must disappear as they oscillate into sterile neutrinos,while a few of these sterile neutrinos may oscillate back into ν e . One can test this hypothesis by lookingfor ν µ interactions in a ν µ beam (any ν µ beam, not necessarily LSND’s), and see if enough disappear as afunction of the distance to the neutrino source divided by the neutrino energy. A recent combination of datafrom the MINOS, MINOS+, Daya Bay and Bugey experiments which have measured this disappearance rate([Adamson et al., 2020]) exclude parameter values consistent with the LSND result. While the newer exper-iments did not attempt to directly replicate the LSND experiment, they do provide interesting information.It remains to be seen whether the tensions in this ﬁeld come from inadequate understanding of experimentaleﬀects, or from a more fundamental physical process, even if it is not a light, sterile neutrino. Large collaborations beneﬁt from the availability of scientists with diverse experiences and points of view. Allcollaborators on the author list are given the opportunity to review each result that is published. While thenumber of papers published by each of the LHC collaborations is such that not every collaborator reads everypaper, each collaborating institution is required to meet a quota of papers that are read and commented onby its members.

Results in preparation must pass through a lengthy, formal approval process before they can be presentedoutside of the collaboration. Working groups led by experienced physicists review each analysis by themembers of the group and frequently point out ﬂaws in logic, data handling, analysis and presentation.Before approval, a result must be fully presented to a working group, and a public note must be written.At this stage, questions are asked of the proponents, and some of these questions may require signiﬁcantadditional study in order to address. At a later date, the analysis must be presented again, and all questionsand requests must be answered to the satisfaction of the group members. Only at this phase can a result beapproved, though all ﬁgures and numbers must be labeled “Preliminary.”Preliminary results sometimes do not have the ﬁnal estimates of systematic uncertainties associated withthem. Estimating systematic uncertainties is usually the most time-consuming aspect of analysis work andit usually involves signiﬁcant re-analysis or generation of additional Monte Carlo samples to make modelpredictions corresponding to variations of each nuisance parameter. Even the list of all potential sourcesof systematic error may not be fully understood at the time a preliminary result is reviewed. If a resultneeds to be produced on a short timescale under these circumstances, systematic uncertainties are estimatedconservatively, with the intention that further work will reduce their magnitude ([Barlow, 2002]).There are negative consequences to overestimating systematic uncertainties. A set of measurements ofthe same physical quantity by several collaborations, all of which overestimate their uncertainties, will havea χ that is smaller than expected, even if the sources of systematic error are diﬀerent. More worrisome,however, is the possibility that a combination of results may assume a measurement is more sensitive to a articular nuisance parameter than it really is, due to an inﬂated or misclassiﬁed source of uncertainty. Themeasurement thus serves to constrain the nuisance parameter too strongly in the joint result, producing aﬁnal combined result with an underestimated uncertainty. After a a preliminary result is released, the physicists who performed the analysis prepare a manuscript forpublication. At this stage, and sometimes even during the preliminary result preparation stage, a committeeof collaborators who have worked on similar topics but who are not directly involved in the particular resultis set up to review the paper draft. Often the committee is involved at an early stage of writing the draftand they meet regularly with the authors to improve the analysis and the presentation. All changes to theanalysis must be approved by the working group specializing in the topic. Usually the review committeesubmits the manuscript for collaboration review, rather than the analysis proponents. Once a manuscript isreleased to the collaboration, the result has already been reviewed multiple times. The additional scrutinyfrom the large collaboration may uncover additional ﬂaws, and the manuscript is re-released for a secondcollaboration review after addressing the concerns raised in the ﬁrst review. The process is iterated untilconsensus is reached that the paper can be submitted for publication. This process often uncovers eventiny ﬂaws and it can take several months to years to complete. High-proﬁle results can be pushed throughon accelerated timescales without compromising the integrity of the review, provided that the necessaryeﬀort can be directed towards them. Only in very rare instances is consensus not reached. In these cases,dissenting collaborators can request that their names be removed from the author list of that paper. Asigniﬁcant fraction of the collaboration refusing to sign a paper provides a strong signal to the analysisproponents and the readers of the article about the perceived validity of the results.After a manuscript has been agreed upon and submitted to a journal, the editors use traditional blindpeer review before publication. Referees who are known to be experts in the ﬁeld, frequently who are alsomembers of rival collaborations, weigh in on the publication. Referees are not superhuman, though they dosometimes ﬁnd issues with papers that thousands of authors may have missed. Sometimes a collaborationmay fall into the trap of “group-think,” having repeated the same arguments to itself over and over again,so an independent check has as much value in EPP as in other ﬁelds. An independent review also can helpimprove the presentation of work that may not be clear to an outsider. Sometimes the root cause of thenon-replicability of a result is merely inadequate or unclear documentation.

Large collaborations sometimes have a “statistics committee,” which is made up of collaborators who areexperts on data analysis, inference, and presentation of results. A good-sized committee has at least sixmembers, and more are desirable. Statistical issues in analyses can be intricate and they take some timeto understand. Members of the committee sometimes disagree about issues with speciﬁc analyses. It isimportant for physicists who are embarking on a new analysis to consult with the collaboration’s statisticscommittee, so that work is not steered in a direction that is only later found to be ﬂawed under collaborationreview. Frequently, the most challenging issues with an analysis relate to the treatment of systematicuncertainties. The enumeration of sources of uncertainty, their prior constraints, how to constrain them in situ with the data, and how to include their eﬀects in the ﬁnal results, are common subjects that thestatistics committee must address. Experimental collaborations typically must each have their own statisticscommittee, as results in preparation are usually conﬁdential until a preliminary result is released, and reviewby members of other collaborations would spoil this conﬁdentiality. Peer review generally does not spoilconﬁdentiality because all or nearly all results are submitted as preliminary results ﬁrst, and preprints areavailable for submitted manuscripts, thus establishing priority. Members of a collaboration may be wary ofadvice from members from competing collaborations, even if it is general statistical methods advice. Membersof one collaboration may prefer that their competitors treat their uncertainties more conservatively and thus ppear to have a less reliable result. Even the fear of such bias in advice is enough to prevent the formationof joint statistics committees across collaborations.Munaf`o and collaborators point out that independent methodological support committees have beenvery useful in clinical trials ([Munaf`o et al., 2017]). Particle physicists routinely reach out to statisticians,holding workshops titled PhyStat every couple of years ([Behnke et al., 2020]). Often it is useful to combine results produced by competing collaborations, sometimes alongside the an-nouncement of the separate results. In this case, the collaborations must agree on the exchange of data andappropriate methods of inferring results. The methods used typically are extensions of what the collabo-rations use to prepare their own results, and usually in a combination eﬀort, members of each experimentperform the combination with their own methods and the results are compared for consistency. At this stage,mistakes in the creation or the exchange of digital artifacts may be exposed and they must be addressedbefore the ﬁnal results can be approved by all collaborations contributing to the combined results.

Nearly all results in EPP are derived from counts of interactions in particle detectors. Each interactionhas measurable properties and may diﬀer in many ways from other interactions. These counts are typicallybinned in histograms, where each bin collects events with similar values of some observable quantity, like areconstructed invariant mass. An example of such a histogram, showing the distribution of the reconstructedmass m (cid:96) in H → ZZ → (cid:96) decays, selected by the ATLAS collaboration ([Aad et al., 2020]), is given inFig. 3. Event counts often are simply reported by themselves. Under imagined identical repetitions of theexperiment, the event counts in each bin of each histogram are expected to be Poisson distributed, althoughthe means are usually unknown or not perfectly known. The data provide an estimate of the Poisson mean,which is often directly related to a parameter of interest, such as an interaction probability. Data in EPPhave been referred to as “marked Poisson” data, where the marks are the quantities measured by a particledetector, such as the energies, momenta, positions and angles of particles produced in collisions. The fact thatall practitioners use the same underlying Poisson model for the data helps reproducibility and replication.An advantage of particle physics analyses as compared with those in, for example, sociology is thatelementary particles are more reliable than people. In selecting a sample of muons, we do not have to worryabout what their distribution is in age, income, job satisfaction and ability at statistics. “Double-blind”analyses do not exist in particle physics. Furthermore, unlike humans, particles are relatively unaﬀected byenvironmental factors, and so these do not have to be taken into account in analyses. However, this is lesstrue of the detectors. Their response to elementary particles and background noise depends on the ambienttemperature and pressure, electrical noise, and to their exposure to high doses of radiation from the beams inthe accelerator. The raw data needs to be corrected to take such variations into account. Thus calibrationsof the detector responses have to be carried out at frequent intervals.Another diﬀerence between particle physics and other areas of research is the way models are regarded.The basic model used for particle physics is the standard model, which contains the particles listed in Table 1,and the forces in Table 2 (apart from gravitation, which for most purposes can be neglected on a particlescale). This provides an excellent description of a large number of experimental distributions, but despitethis it is not believed to be the ultimate description of Nature; for example, it does not explain dark matter ordark energy, and it has about 20 arbitrary parameters (such as particle masses) that have to be determinedfrom experiment rather than being predicted by the theory. Thus we have a model that works very well,that despite our prejudices may even be the ultimate truth, but which we are constantly hoping to disprove.Most of our “search” experiments are seeking not merely to produce another veriﬁcation of the SM, but arehoping to discover evidence for physics beyond the standard model (BSM). A convincing rejection of the SMwould be a major discovery. m020406080100120140160180 E v en t s / . G e V DataHiggs (125 GeV)Z(Z*)tXX, VVV tZ+jets, tUncertainty

ATLAS → ZZ* → H = 13 TeV, 139 fbs Figure 3: A histogram showing event counts in bins of the reconstructed mass m (cid:96) for interactionsselected in the H → ZZ → (cid:96) decay mode in the ATLAS detector ([Aad et al., 2020]). The pointswith error bars show the numbers of observed interactions in each bin, and the shaded, stackedhistograms show the model predictions. The red peak on the left corresponds to the well-known Z boson (an example of a “standard candle”), while the blue peak in the middle shows the predictionfor the Higgs boson. In contrast, in many other ﬁelds, the models are more ad hoc, in general do not provide very good detaileddescriptions of the data, and almost no one believes in their ultimate truth. Approximate agreement betweenthe data and the model is regarded as a successful outcome. On the other hand, a rejection of the modelbeing used would merely result in it being replaced by a diﬀerent ad hoc model.Another peculiarity of EPP is the specialization of practitioners into theoretical and experimental cat-egories. An important beneﬁt of this division is that experimentalists almost never test theories that theythemselves invented. Experimentalists are therefore usually not personally invested in the success or thefailure of the models they are testing.Furthermore, while there is only one true set (however incompletely known) of physical laws governingNature, the set of speculative possibilities is limited only by physicists’ imaginations. Theory and phe-nomenology preprints and publications abound in great numbers. Most theories are in fact not true, butgenerally in order to be published, they must be consistent with existing data. Experimenters are well awarethat most searches for new particles or interactions will come up empty-handed, even though the hope isthat if one of them makes a discovery, then our understanding of fundamental physics will make a greatstride forwards. This possibility makes all the null results worth it.The LHC has been referred to as a “theory assassin,” owing to all of the null results excluding manyspeculative ideas. This process largely mitigates the “ﬁle-drawer” eﬀect, as a null result excluding a publishedtheoretical model is likely to be publishable and not ignored. The experimental tests must be accompaniedwith proof that they are sensitive to the predictions of the theories in question, however, before they are takenseriously for publication. Null results are therefore typically presented as upper limits on signal strengths.Theorists may counter a null result by predicting smaller signal strengths. Frequently, an iterative process isundertaken that progressively tightens the constraints on a model of new physics as more data are collectedand analysis techniques are improved. Some models can never be fully ruled out because signal strengthscould always be smaller. These, if interesting enough, are left as challenges to the next generation ofexperiments. Examples of Non-Replicated Results in Particle Physics

The success of the quark model in explaining the spectroscopy of the many existing hadrons (and theabsence of those that were forbidden by the model), as well as many features of the production processesfor interactions, resulted in many searches for evidence of the existence of free quarks. Most of these usedthe fact that quarks have an electric charge of ± e/ ± e/

3. Experiments looked for quarks in cosmicrays, reactions at accelerators, the Sun, moon dust, meteorites, ocean sludge, mountain lava, oyster shells,etc., but almost all experiments yielded null results. Theorists accepted this as being due to the concept of“conﬁnement”: quarks can exist only inside hadrons, but not as free particles.However, in 1981, an experiment at Stanford reported a positive result ([Larue et al., 1981]). It involvedlevitating small spherical niobium spheres, and measuring their oscillations in an oscillating electric ﬁeld;their amplitude is proportional to the charge on the ball. Of 39 measurements reported, 14 correspondedto the fractional charge of quarks. Although the word “quark” does not appear in the Stanford publication,this was possible evidence for their existence as free particles.However, many analysis decisions had to be made in order to extract the charge on a ball from the rawmeasurements e.g. whether or not to accept an experimental run, which corrections to apply for experimentalfeatures, etc. These decisions were made while looking at the possible result of the charge measurement.Luis Alvarez suggested that a form of blind analysis should be used (see Section 2.2). This involved thecomputer analyzing the data adding a random number onto the extracted charge visible to the physicists;this was subtracted from the result only after they had made all necessary decisions about run acceptanceand corrections. The net result was that this experiment published no further results.In his review, “Pathological Science,” Stone collects several stories of experimental results that havelater been found to be wrong ([Stone, 2000]). In two of the cases, the Davis-Barnes Eﬀect and N-Rays,clear defects in the experimental technique were uncovered during visits to the laboratories by independentexperts. In the case of the split a resonance, several groups conﬁrmed the false result but others did not,and at least one group that did see the splitting of the a reported adjusting the apparatus when the eﬀectwas not seen but not when it was seen. Franklin provides excellent commentary on some results that havebeen successfully replicated and some that have not ([Franklin, 2018]). Bailey has produced histograms ofchanges in measured values of particle properties expressed in terms of the reported uncertainties. Whilemany repeated measurements of the same quantities are consistent, there is a long tail of highly discrepantresults ([Bailey, 2017]). More concerning than false results that are not replicated are false results that are replicated. Of course,these results can only be ascertained as false by further attempts at replication and/or the discovery of errorsin the original results.

The search for a pentaquark is an interesting example of replication. Particles known as hadrons are dividedinto baryons and mesons. In the original quark model, baryons are composed of three quarks (and mesonsof a quark and an antiquark)- see Section 1.1. There was, however, no obvious reason why baryons couldnot be made of four quarks and an anti-quark; these baryons would be pentaquark states. This wouldmake available new types of baryons which could not be made of the simple and more restrictive three-quark structure, and which could be identiﬁed by their decays modes involving unconventional groupings ofparticles not accessible to three-quark baryons. Thus searches were made for these new possible particles.In 2003, four experiments provided evidence suggesting the existence of one of these possibilities, known s the Θ + , with a mass around 1.54 GeV. The quoted signiﬁcances were 4 to 5 σ . Indeed, national prizeswere awarded to physicists involved in these experiments. In the next couple of years there were six moreexperiments quoting evidence in favor of its existence.However, other studies, many with much higher event numbers than those with positive results, saw noevidence for the particle. Although most of these were not exact replications of the original positive ones, atleast one was a continuation with much higher event numbers than the original study, and involved exactlythe same reaction and the same beam energy; it failed to conﬁrm its original result.The net conclusion was that the Θ + does not exist. Possible reasons for the apparently spurious early re-sults include poor estimates of background; non-optimal methods of assessing signiﬁcance; the eﬀect of usingnon-blind methods for selecting the event sample and for the mass location of the Θ + ; and unlucky statisticalﬂuctuations. Hicks provides a detailed review of pentaquark search experiments and their methodologies([Hicks, 2012]).This topic is probably the one in which there were the most positive replications of an incorrect result.It demonstrates the care needed when taking a conﬁrmatory replication as evidence that the analyses arecorrect, especially when the experiments involve smallish numbers of events.The twist in the tale of this topic is that more recently, pentaquark states have been observed by theLHCb experiment ([Aaij et al., 2015]). They are, however, much higher in mass than the Θ + , and have adiﬀerent quark composition, so are certainly not the same particle. More details of the interesting historyof the search for pentaquarks and their eventual discovery and measurement can be found in the review onthe subject by M. Karliner and T. Skwarnicki in the 2020 Review of Particle Physics ([Zyla et al., 2020]). It is not only searches for new particles that can suﬀer from spurious replication, but measured values ofwell-established particles and processes can also be aﬀected. The Particle Data Group collects measurementsof particle properties, averages them in cases of multiple measurements of the same quantity, and publishesthese every two years ([Zyla et al., 2020]). One can see in the historical evolution of the averages thatthe error bars generally decrease over time and the diﬀerences between the measured values also decreaseover time. There is considerable correlation from one average to the next, which is largely due to thesame measurements contributing to multiple years’ averages. In order to see if there is an eﬀect in whichexperimenters seek out, consciously or not, to replicate earlier numbers without contradicting them, a meta-analysis was performed ([Klein and Roodman, 2005]) in which individual measurements of selected quantitieswere plotted as functions of time. Correlations are indeed visible even in these historical plots. Not all ofthe eﬀects may be due to over-eagerness to replicate earlier work, because often shared sources of systematicerror aﬄict multiple measurements.

In the last decade, a number of meta-analyses of published results in several scientiﬁc ﬁelds have uncovered alarge fraction of results that were not replicated when tested ([National Academies of Sciences and Medicine, 2019]).Experimental particle physicists, when they heard the news of the “replication crisis” in other ﬁelds, felt thetemptation to gloat a little, mainly because of the nature of their enterprise, the high standards applied totheir results, and the tradition of publishing all results that have sensitivity to the eﬀect under test withoutrelying on the observed signiﬁcance to determine whether or not to submit a manuscript. Indeed, many ofthe proposed solutions for the replication crisis have been, in some way or another, part of the culture ofexperimental particle physics for decades.Particle physicists have long been cautioned about historical failures of even the most stringent checksand balances, and every new student is given examples of how well-meaning researchers can come to wrongconclusions because they misled themselves and therefore others. It is concern over repeating the mistakes f the past that justiﬁes the rigor. That, and the fact that enormous amounts of time, money and eﬀortgo into particle physics experiments makes practitioners especially wary of producing wrong results due torelatively minor mistakes. While by no means do all results in experimental particle physics meet the mostrigorous standards, the techniques used to make the majority of them the best that can be produced areheld as examples of good practices in science. Disclosure Statement

The authors have no conﬂicts of interest to declare.

Acknowledgments

Work supported by the Fermi National Accelerator Laboratory, managed and operated by Fermi ResearchAlliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy. The U.S.Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S.Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce thepublished form of this manuscript, or allow others to do so, for U.S. Government purposes.

References [Aaboud et al., 2018] Aaboud, M. et al. (2018). Search for electroweak production of supersymmetric par-ticles in ﬁnal states with two or three leptons at √ s = 13 TeV with the ATLAS detector. Eur. Phys. J.C , 78(12):995.[Aad et al., 2008] Aad, G. et al. (2008). The ATLAS Experiment at the CERN Large Hadron Collider.

JINST , 3:S08003.[Aad et al., 2020] Aad, G. et al. (2020). Measurements of the Higgs boson inclusive and diﬀerential ﬁducialcross sections in the 4 (cid:96) decay channel at √ s = 13 TeV.[Aaij et al., 2015] Aaij, R. et al. (2015). Observation of J/ψp

Resonances Consistent with Pentaquark Statesin Λ b → J/ψK − p Decays.

Phys. Rev. Lett. , 115:072001.[Adamson et al., 2020] Adamson, P. et al. (2020). Improved Constraints on Sterile Neutrino Mixing fromDisappearance Searches in the MINOS, MINOS+, Daya Bay, and Bugey-3 Experiments.

Phys. Rev. Lett. ,125(7):071801.[Aguilar-Arevalo et al., 2001] Aguilar-Arevalo, A. et al. (2001). Evidence for neutrino oscillations from theobservation of ¯ ν e appearance in a ¯ ν µ beam. Phys. Rev. D , 64:112007.[Aguilar-Arevalo et al., 2013] Aguilar-Arevalo, A. et al. (2013). Improved Search for ¯ ν µ → ¯ ν e Oscillationsin the MiniBooNE Experiment.

Phys. Rev. Lett. , 110:161801.[Aguilar-Arevalo et al., 2018] Aguilar-Arevalo, A. et al. (2018). Signiﬁcant Excess of ElectronLike Events inthe MiniBooNE Short-Baseline Neutrino Experiment.

Phys. Rev. Lett. , 121(22):221801.[Aker et al., 2019] Aker, M. et al. (2019). Improved Upper Limit on the Neutrino Mass from a DirectKinematic Method by KATRIN.

Phys. Rev. Lett. , 123(22):221802.[Bailey, 2017] Bailey, D. (2017). Not Normal: the uncertainties of scientiﬁc measurements.

Royal SocientyOpen Science , page 4160600.[Barlow, 2002] Barlow, R. (2002). Systematic errors: Facts and ﬁctions. In

Conference on Advanced Statis-tical Techniques in Particle Physics , pages 134–144. Behnke et al., 2020] Behnke, O., Cousins, R., Cowan, G., Cranmer, K., Junk, T., Kuusela, M., Lyons, L.,and Wardle, N. (2020). PhyStat Workshop Series. https://espace.cern.ch/phystat/ . Online; accessed9 Sep 2020.[Benjamini, 2020] Benjamini, Y. (2020). The replicability problems in science: its not the p < .

05 fault.Private communication.[Chatrchyan et al., 2008] Chatrchyan, S. et al. (2008). The CMS Experiment at the CERN LHC.

JINST ,3:S08004.[Chen et al., 2019] Chen, X., Dallmeier-Tiessen, S., Dasler, R., Feger, S., Fokianos, P., Gonzalez, J. B.,Hirvonsalo, H., Kousidis, D., Lavasa, A., Mele, S., Rodriguez, D. R., ˇSimko, T., Smith, T., Trisovic,A., Trzcinska, A., Tsanaktsidis, I., Zimmermann, M., Cranmer, K., Heinrich, L., Watts, G., Hildreth,M., Lloret Iglesias, L., Lassila-Perini, K., and Neubert, S. (2019). Open is not enough.

Nature Physics ,15(2):113–119.[Cowan et al., 2011] Cowan, G., Cranmer, K., Gross, E., and Vitells, O. (2011). Asymptotic formulae forlikelihood-based tests of new physics.

Eur. Phys. J. C , 71:1554. [Erratum: Eur.Phys.J.C 73, 2501 (2013)].[De Salas et al., 2018] De Salas, P., Gariazzo, S., Mena, O., Ternes, C., and Trtola, M. (2018). NeutrinoMass Ordering from Oscillations and Beyond: 2018 Status and Future Prospects.

Front. Astron. SpaceSci. , 5:36.[Esteban et al., 2019] Esteban, I., Gonzalez-Garcia, M., Hernandez-Cabezudo, A., Maltoni, M., andSchwetz, T. (2019). Global analysis of three-ﬂavour neutrino oscillations: synergies and tensions in thedetermination of θ , δ CP , and the mass ordering. JHEP , 01:106.[Evans and Bryant, 2008] Evans, L. and Bryant, P. (2008). LHC Machine.

JINST , 3:S08001.[Feldman and Cousins, 1998] Feldman, G. J. and Cousins, R. D. (1998). A Uniﬁed approach to the classicalstatistical analysis of small signals.

Phys. Rev. D , 57:3873–3889.[Franklin, 2013] Franklin, A. (2013).

Shifting Standards: Experiments in Particle Physics in the TwentiethCentury . JSTOR EBA. University of Pittsburgh Press.[Franklin, 2018] Franklin, A. (2018).

Is It the Same Result: Replication in Physics . 2053-2571. Morgan &Claypool Publishers.[Gross and Vitells, 2010] Gross, E. and Vitells, O. (2010). Trial factors for the look elsewhere eﬀect in highenergy physics.

Eur. Phys. J. C , 70:525–530.[Heinrich and Lyons, 2007] Heinrich, J. and Lyons, L. (2007). Systematic errors.

Ann. Rev. Nucl. Part. Sci. ,57:145–169.[Hicks, 2012] Hicks, K. H. (2012). On the conundrum of the pentaquark.

Eur. Phys. J. H , 37:1–31.[Ioannidis, 2005] Ioannidis, J. P. A. (2005). Why most published research ﬁndings are false.

PLoS Medicine ,2(8):e124–e124.[Junk, 1999] Junk, T. (1999). Conﬁdence level computation for combining searches with small statistics.

Nucl. Instrum. Meth. A , 434:435–443.[Klein and Roodman, 2005] Klein, J. and Roodman, A. (2005). Blind analysis in nuclear and particlephysics.

Ann. Rev. Nucl. Part. Sci. , 55:141–163.[Larue et al., 1981] Larue, G., Phillips, J., and Fairbank, W. (1981). Observation of Fractional Charge of(1 / e on Matter. Phys. Rev. Lett. , 46:967–970.[Leek and Peng, 2015] Leek, J. and Peng, R. (2015). Statistics: P values are just the tip of the iceberg.

Nature , 520:612. Mandelkern, 2002] Mandelkern, M. (2002). Setting conﬁdence intervals for bounded parameters.

Statist.Sci. , 17(2):149–172.[McShane et al., 2019] McShane, B. B., Gal, D., Gelman, A., Robert, C., and Tackett, J. L. (2019). Abandonstatistical signiﬁcance.

The American Statistician , 73(sup1):235–245.[Munaf`o et al., 2017] Munaf`o, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D.,Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., and Ioannidis, J. P. A. (2017). Amanifesto for reproducible science.

Nature Human Behaviour , 1(1):0021.[National Academies of Sciences and Medicine, 2019] National Academies of Sciences, E. and Medicine(2019).

Reproducibility and Replicability in Science . The National Academies Press, Washington, DC.[Read, 2002] Read, A. L. (2002). Presentation of search results: The CL(s) technique.

J. Phys. G , 28:2693–2704.[Schael et al., 2006] Schael, S. et al. (2006). Precision electroweak measurements on the Z resonance. Phys.Rept. , 427:257–454.[Sirunyan et al., 2020] Sirunyan, A. et al. (2020). Constraints on anomalous Higgs boson couplings to vectorbosons and fermions in production and decay in the H → (cid:96) channel.[Stone, 2000] Stone, S. (2000). Pathological Science. In Theoretical Advanced Study Institute in ElementaryParticle Physics (TASI 2000): Flavor Physics for the Millennium , pages 557–575.[Woolston, 2015] Woolston, C. (2015). Psychology journal bans p values.

Nature , 519:9–9.[Zyla et al., 2020] Zyla, P. et al. (2020). Review of Particle Physics.

PTEP , 2020(8):083C01., 2020(8):083C01.