Reproducibility and Replication of Experimental Particle Physics Results
RReproducibility and Replication of Experimental Particle PhysicsResults
Thomas R. Junk † , ∗ and Louis Lyons ‡ , ∗∗ † Fermi National Accelerator Laboratory, Batavia, IL USA ‡ Imperial College, London and Oxford University, UK
September 14, 2020
Abstract
The recent “replication crisis” has caused practitioners and journal editors in many fields inscience to examine closely their methodologies and publishing criteria. Experimental particlephysicists are no exceptions to this, but some of the unique features of this sub-field of physicsmake the issues of reproduction and replication of results a very interesting and informative topic.The experiments take many years to design, construct, and operate. Because the equipment is solarge and complex, like that of the Large Hadron Collider and its associated particle detectors,the costs are very high. Large collaborations produce and check the results, and many papersare signed by more than three thousand authors. Experimental particle physics is a mature fieldwith strong traditions followed by the collaborations. This paper gives an introduction to whatexperimental particle physics is and to some of the tools that are used to analyze the data. Itdescribes the procedures used to ensure that results can be computationally reproduced, bothinternally and externally. It also describes methods particle physicists use to maximize thereliability of the results, which increases the probability that they can be replicated by othercollaborations or even the same collaborations with more data and new personnel. Examplesof results that were later found to be false are given, both with failed replication attemptsand some with alarmingly successful replications. While some of the characteristics of particlephysics experiments are unique, many of the procedures and techniques can be and are used inother fields.
Keywords:
Reliability, Reproducibility, Replication, Particle Physics
This article is c (cid:13) ∗ [email protected], ∗∗ [email protected] a r X i v : . [ phy s i c s . d a t a - a n ] S e p edia Summary The recent “replication crisis” has caused quite a stir in many scientific fields. Scientists and statisticiansalike have recoiled in horror at the low rate at which results have been confirmed when experiments are re-peated. Much ink has been spilled explaining the shortcomings of methodologies commonly used in scientificexperiments and the criteria that are used when selecting results for publication. Not every proposed solu-tion makes sense, although many are good ideas. Particle physicists have long been aware of the precursorsof non-replicable results. No one on a large collaboration who has worked long years on a very expensiveexperiment wishes to publish a wrong result, which would undermine the credibility of all results from thatcollaboration. Thus, many internal tests of reproducibility of the results, as well as conservative methodssuch as blind analysis and stringent review, all with the purpose of catching mistakes and well-intentionedbut flawed work, are common in particle physics. Results are also published even if they disprove new theo-ries; null results are not simply filed away. Discoveries of new particles and interactions have a very high barto meet in particle physics: p values must be less than 3 × − , not 0.05 as is common in some other fields ofstudy. Particle physicists can easily point to past discovery claims that have had less significance and havevanished when more data were collected or when other groups attempted confirmation. Not every result isperfect or even replicable in particle physics, but the quality is generally quite high. New practitioners arealways introduced to examples in which even the most careful analyzers have been able to fool themselves.While some of the techniques and procedures used by particle physicists to ensure the reliability of theirresults are specific to the sub-field, many can be used regardless of the scientific specialty. Experimental particle physics (EPP), also commonly known as high-energy physics, is a relatively maturefield of research, with traditions spanning many decades. Particle physics experiments take years to design,build and run, and they require large financial resources, equipment, and effort. The experimental collabo-rations at the Large Hadron Collider (LHC) comprise more than 3000 physicists each, and collaborations inother sub-branches of particle physics such as neutrino physics are also becoming larger over time. Experi-mental particle physics has historically been on several cutting edges technologically, computationally, andsociologically. Some of the practices within EPP to assure the reliability of published results may be directlyrelated to peculiarities of the field and its data, but most are generally applicable. It is the aim of thisarticle to give a review of how issues of reproducibility and replication are addressed in the specific contextof EPP, with the intention that they may be more broadly applicable. We also observe that reproducibilityand replication, while necessary conditions for a reliable result, by themselves do not guarantee that a resultis correct.Before addressing reproducibility and replication, however, we first give a gentle introduction to someof the most important features of EPP. This paper proceeds by describing the target field of knowledge,elementary particle physics, and the tools used in the research: high-energy accelerators, particle detectors,data collection, processing, and reduction techniques. It then describes common statistical inference toolsused in EPP. From there, reproducibility and replication are defined, using common data science conventionsand comparing these terms with the language commonly used in EPP. Features of EPP that intrinsicallyhelp with reproducibility and reproduction are discussed, and the methods used to improve these qualitiesare described. Examples are given of experimental results that have been replicated, results for whichattempts to replicate them have failed, and results that were replicated but were later found to be incorrectnonetheless. u Down d Electron e Neutrino ν e Generation 2 Charm c Strange s Muon µ Neutrino ν µ Generation 3 Top t Bottom b Tau τ Neutrino ν τ Particle physics involves the study of matter and energy at its smallest and most fundamental level. Thequestion of what are the smallest building blocks out of which everything is made has a long history. Forthe ancient Greeks, the elements were A ir, F ire, E arth and W ater. Experiments that confirmed this modelwere highly replicable. Even though the model is incomplete, it served many practical purposes.Today’s elementary particles are quarks and leptons, all with spin 1/2. The latter consist of the electronand its two heavier versions, the muon ( µ ) and the heavy lepton ( τ ); for each of these, there is a corre-sponding neutrino. The proton, the neutron and other half-integer spin particles (baryons) are composedof combinations of three quarks, while mesons of integer spin (bosons) consist of a quark and an antiquark.Each quark and lepton has its own anti-particle.The baryons and mesons, the particles made of quarks, are collectively known as hadrons . Quarks areconfined within hadrons, and in contrast to leptons do not seem to lead an independent existence as freeparticles. In collisions between particles, as any struck quarks try to escape from the hadrons, they areconverted into jets of pions, kaons and protons which leave visible tracks in the detectors (see Fig. 1).In addition, the fundamental forces are each transmitted by their own carrier. These are • Gravitation, transmitted via gravitons. Gravitational waves produced in the coalescence of a pair ofblack holes were observed in 2016. • Electromagnetism. This is mediated by photons. • The weak nuclear force. The intermediate vector bosons W and Z transmit this short range force. • The strong nuclear force. This is another short range force, carried by gluons. They are responsiblefor binding quarks in hadrons, are produced in collisions involving quarks, and are detected by thejets of particles they produce.Finally, there is the Higgs boson, with a mass of 125 GeV. The Higgs field is responsible for enablingfundamental particles to have mass; in the simplest version of the theory, particles would be massless. Evenwith the discovery of the Higgs boson, the numerical values of the masses are not understood.In the standard model (SM) of particle physics, quarks and leptons are arranged in three generations -see Table 1. It is not understood why there are three generations. The SM also specifies the way the variousparticles and force carriers interact with each other via the fundamental interactions in the bottom threerows of Table 2; gravitation has not yet been unified with the other three forces. A possible source of confusion is the meaning of the phrase “elementary particle”. This should refer to our basicbuilding blocks of matter at its smallest scale, the quarks and leptons. However, by tradition in this field, they arethe particles which, prior to the quark model in the 1960s, were thought to be elementary. This includes protons,neutrons, π mesons and all the other hadrons, as well as the leptons (which still today really are considered to beelementary). Thus the neutron is an elementary particle, composed of an up quark and two down quarks. A GeV is a unit of energy or mass, and is 1 × electron volts. For comparison, the lightest neutrino’s mass isless than 1.1 eV ([Aker et al., 2019]), the proton’s is about 1 GeV, and a gold atom’s is about 180 GeV. Force Potential Force carrier Responsible for Particles affectedGravitation 1 /r Graviton Earth going round Sun, Everythingetc.Electromagnetism 1 /r Photon Coulomb repulsion, Charged particlesphoton emissionWeak Short range W and Z bosons Energy generation in Sun, Quarks, Leptons β -decayStrong Short range Gluons Nuclear binding, Hadronsquark-quark scattering The remaining sections of this introduction outline the various steps involved in the data analysis process,starting with the accelerator and detector, and ending with the data and analysis software storage forposterity.
Experiments can be divided into those performed at accelerators, and those carried out elsewhere. The accel-erator ones either use a beam hitting a stationary target; or have antiparallel beams colliding with each other.An example of the former is a neutrino beam, with the detector hundreds of kilometers away. Colliding beamsprovide an easier way of achieving higher center-of-mass energies, but have more stringent beam requirements.The highest center-of-mass energy of 13 TeV has been obtained at CERN’s LHC with collisions betweenprotons in a 26 km circumference ring some 100 - 150 m below the surface ([Evans and Bryant, 2008]). Atypical analysis uses data collected over a running period of between several weeks and several years.There are various forms of non-accelerator experiments. These include beams from nuclear reactors;studies of cosmic rays, solar and atmospheric neutrinos; searches for Dark Matter or proton decay; etc.
There are two large general purpose detectors at the LHC, CMS ([Chatrchyan et al., 2008]) and ATLAS([Aad et al., 2008]). Both are cylindrical, and consist of concentric sub-detectors with different functions: • Vertex detector. This is a high spatial resolution detector placed as close as possible to the interactionregion. It is useful for finding charged particles that come from the decays of heavier particles thathave traveled millimeters before decaying. • Tracker. This detects charged particles and measures their momenta. • Electromagnetic calorimeter. Electrons and photons are identified by the showers they produce in thecalorimeter’s material of high- Z nuclei. • Hadron calorimeter. This is useful for detecting neutral hadrons, such as neutrons or K . • Muon detector. These are placed at the outside of the whole detector, so that almost the only chargedparticles penetrating as far are muons.These are in a magnetic field of several Tesla, so that the momenta of particles can be measured. The lengthof ATLAS is 45 m and its height is 25 m.Because ATLAS and CMS are general purpose detectors, many different physics analyses are possible.Generally these will be performed on different subsets of the accumulated data.Experiments at lower energy accelerators tend to have more individually designed, smaller detectors fortheir specific analysis. − μ + μ + μ − jetjet jet - b jet pp → t ¯ t H + Xμ + μ − μ + μ − → Figure 1: An event display from the CMS collaboration, showing a t ¯ tH candidate interaction. TheHiggs boson candidate decays to two Z bosons which themselves each decay to µ + µ − . Recon-structed tracks are shown as curves that originate in the center of the detector, and calorimeterclusters and muon detector responses are shown with shaded blocks further out. The intensities of the proton beams at the LHC are such that the rate of collisions at the center of ATLASand CMS is of order 10 Hz. The data acquisition system can record up to 1000 interactions (“events”) persecond. An online trigger system is used to select interesting events to be stored for further analysis; thisconsists of several algorithms in parallel, to cater for the variety of the subsequent physics analyses. Eventsnot recorded are lost. Studies are performed to evaluate and correct for the trigger efficiency for recordingwanted events for each analysis.
The information from the detectors consists of a series of digitized electronic signals which measure energydeposits (“hits”) in small regions of known location in the detector. The job of the reconstruction programs isto link together appropriate hits and turn these into a series of three-dimensional tracks corresponding to thetrajectories particles took after they were produced in the event . Figure 1 shows the reconstructed tracks andcalorimeter clusters for a single event collected by the CMS detector. This event passes the candidate selectionrequirements for associated production of a Higgs boson and two top quarks ([Sirunyan et al., 2020]). In a region where the magnetic field is constant, the trajectories of charged particles are approximately helical.Neutral particles travel in straight lines. .6 Event selection In general, each physics analysis will use a small subset of the accumulated data, to reduce backgroundfrom unwanted processes, while maintaining a high efficiency for retaining the signal. The initial stage ofthis process involves relatively simple selection criteria devised by physicists (for example, the event shouldcontain a muon and an electron of opposite electrical charge). This is usually followed by machine learning(ML) techniques, such as neural networks or boosted decision trees. Recently, deep learning methods havebeen employed. In either case, the choice of training samples is important.
There are two large classes of analyses in EPP, parameter determination and hypothesis testing.
This involves the determination of one or more parameters (e.g. the mass of the Higgs boson) and theiruncertainties. It requires the use of some parameter determination technique, such as those listed below,each of which exists in several variants. • Chi-squared: This can be the Pearson or the Neyman versions, or can use a log likelihood ratioformulation. • Likelihood: Either the usual form where the probability density function on which the likelihood isbased is normalized to unity; or the extended form where the overall number of events is allowed tofluctuate. • Bayesian posterior: Although particle physicists are loath to use Bayesian methods for hypothesis test-ing, there is less resistance to them for parameter determination. The choices here are the functionalforms used for the Bayesian priors, and the way the credible interval is extracted from the posterior. • Neyman construction: This guarantees coverage for the determined parameter(s). The resulting confi-dence intervals can be chosen to be one-sided upper limits (UL) or lower limits (LL); two-sided centralintervals; or Feldman-Cousins (see Section 2.5).As well as determining the actual confidence interval from the data, it is also important to calculatetheir expected values (sensitivity), either from the median of a set of values assuming data are distributedaccording to the relevant model; or from the “Asimov” data set, where the single set of invented “data” isexactly as predicted by the relevant model ([Cowan et al., 2011]).For replicability comparisons, the sensitivities are probably better than the actual data values, as theformer are not afflicted by statistical fluctuations. (Of course, each data value should be compared withthe corresponding expected value for compatibility.) A further point regarding replicability is that the sameinterval method should be used, as in some cases they can produce very different answers.
The other category is where we attempt to see if the data favors some version of new physics (hypothesis H ) as compared with a null hypothesis H . Thus if we had a mass spectrum and were looking for a peakat some location in the spectrum (compare fig. 3), our hypotheses could be: • H = only well known particles are produced. • H = also the production of Higgs bosons, decaying via a pair of Z bosons to 4 charged leptons.Alternatively, an example from neutrino physics would be - - ) H /L H L-2ln( P r obab ili t y D en s i t y H HObs p p (a) - - ) H /L H L-2ln( P r obab ili t y D en s i t y H HObs p p (b) Figure 2: Example distributions of the logarithm of the likelihood ratio test statistic assuming H (black) and H (blue). An example observed outcome of the experiment is indicated by the redline. The p value p is the yellow-shaded area under the tail of the H distribution to the left ofthe observed value, and p is the green-shaded area under the H distribution to the right of theobserved value. Panel (a) shows a case in which the experiment is expected to distinguish between H and H most of the time, and panel (b) shows the distributions for an experiment that is notas sensitive. • H = “normal” ordering of the three neutrino masses, or • H = “inverted” ordering ([Esteban et al., 2019, De Salas et al., 2018]).Here some form of hypothesis testing is used. This usually requires the choice of a data statistic, whichmay well be a likelihood ratio for the two hypotheses. Then the choice of hypothesis favored by the datamay involve comparing the actual value of the data statistic, with the expected distributions for the twohypotheses; these may be obtained by Monte Carlo simulation. Another possibility is to use the expectedasymptotic distributions ([Cowan et al., 2011]), though care must be taken to use the asymptotic formulasonly within their domains of applicability.Possible outcomes of these comparisons are: • Data are consistent with H but not with H . For the first example above, this would constituteevidence for a discovery claim. • Data are consistent with H , but not with H . This would result in the exclusion of the model of newphysics, at least for some values of the parameters of that model. Figure 2a shows an example of thissituation. • Data are consistent with both H and H . i.e. the experiment is not sensitive enough to distinguishbetween the two models. Figure 2b is an example of this situation. • Data are inconsistent with both H and H . This may indicate that some other model is required.Particle physics analyses often result in the second situation above. There thus appear in the literaturemany papers entitled “Search for....”; this is a euphemism for “We looked for something and did not findit, but our search was sensitive enough to justify publication.” Publication of such null results is useful inthat it serves to exclude the tested model of new physics at some confidence level, at least for some range of ts parameter space. For example, hypothetical supersymmetric partners of electrons, called selectrons, areexcluded at the 95% confidence level for masses below 500 GeV, although this is somewhat model dependent([Aaboud et al., 2018]). That means that if the mass of the selectron were lighter, it would have produceda clear signal in the data, but this was not seen, so selectron masses below 500 GeV are ruled out. Anadditional reason for publishing null results is that it avoids the publication bias of accepting for publicationonly positive results.Further specific topics related to physics analyses are discussed in Section 2. Given the effort and expense of acquiring particle physics data, it is clearly mandatory for experimentalgroups and collaborations to store their data and analysis assets in a way which makes it accessible fordecades, either for new analyses or for replicability tests. CERN has initiated both the Open Data andthe CAP (“CERN Analysis Preservation”) projects ([Chen et al., 2019]) for storing all the data and alsoall the relevant information, software and tools needed to preserve an analysis at the large experiments atthe LHC. The preserved analysis assets include any useful metadata to allow understanding of the analysisworkflow, related code, systematic uncertainties, statistics procedures, meaningful keywords to assure theanalysis is easily findable, etc., as well as links to publications and to back-up material. This is a veryinvolved procedure, and is still in its testing stage, but it will clearly be very helpful for any subsequentreproducibility or replicability studies of the results of analyses using LHC data. It will, however, require acultural change, with attention and effort on the part of physicists performing analyses.Initially, access to the stored information would be restricted to members of the collaboration whoproduced the data, but eventually it could be used by other EPP physicists and the wider range of scientistsand the general public, on a time scale decided by the collaboration.Although developed at CERN for the EPP community, the CAP framework and its concepts may wellbe of interest to a wider range of scientists.
We here discuss several issues which appear in many physics analyses, and almost certainly are relevant forother fields too.
We define the “error” in a measurement to be the difference between the estimate of a model parameterproduced by the analysis and the unknown true value. “Uncertainties” are numerical estimates of thepossible values of the errors, and are typically reported as one-standard-deviation intervals centered on thebest estimate of a measured quantity. Asymmetric confidence intervals are also used when appropriate.Particle physicists often use the word “error” when they mean “uncertainty.”Physics analyses are affected by statistical uncertainties and by systematic ones. The former arise eitherfrom the limited precision of the apparatus and/or observer in making measurements, or from the randomfluctuations (usually Poissonian) in counted events. They can be detected by the fact that, if the experimentis repeated several times, the measured physical quantity will vary.Systematic effects, however, can cause the result to be shifted from its true value, but in a way thatit does not necessarily change from measurement to measurement. Measurements nearly always have somebias, and the question is by how much are they biased. Systematic effects are not easy to detect, and ingeneral much more effort is needed to evaluate the corresponding uncertainties.The simplest systematic effects are the ones associated with the measured quantities needed for theevaluation of the quantity of interest. The raw measurements may need to be corrected, and any uncertainty n this correction contributes to the overall systematic uncertainty. Another may arise from the uncertaintyin some other relevant quantity, which has been measured in a subsidiary analysis in this experiment, or insome different experiment. Yet another can be that the relationship between the quantity of interest and themeasured quantities involves implicit assumptions that are not quite true in Nature; the systematic arisesfrom the uncertainty in the correction for this. The most difficult to deal with are theoretical uncertaintiesin evaluating the answer. These can be because of approximations used, or from different ways of estimatingthem.Usually systematic errors are dealt with in a likelihood function by assigning them nuisance parameters,with constraint terms corresponding to the uncertainties on their values. Common ways of including theireffects in final results such as p values and confidence intervals are to profile the likelihood function withrespect to them; or to marginalize the posterior probability distribution in a Bayesian approach.The way systematic uncertainties on parameter determinations are reported is that in the bulk of a papertheir numerical effects on the total systematic uncertainty are quoted separately for each source. This is sothat if subsequently the magnitude of any of these can be updated, the total systematic can be adjusted.Another reason is that if the results of two or more experiments measuring the same quantity are to becombined, this will ease the problem of taking into account the correlations between the systematic effects.The abstract and the conclusions will typically quote the result as µ ± σ ± σ , where µ is the measuredquantity, σ is the statistical uncertainty and σ is the systematic one. This separation of the two uncertain-ties is because systematics are regarded as more problematic than statistical uncertainties, so an experimentwith σ = 4 , σ = 1 is regarded as superior to one with σ = 1 , σ = 4.More details on the subject of systematic uncertainties can be found in Section 5.7.1 and in [Heinrich and Lyons, 2007]. Several methods for blinding analyses have been used in EPP and are currently in use ([Klein and Roodman, 2005]).One simple method is to optimize the analysis using simulated data and reserve the input of data from theexperimental apparatus until the procedures have been decided upon, including the data selection, classifi-cation, and statistical procedures. Analyses are constructed and optimized based on predicted outcomes ofexpected signal and background contributions to the event yields, and so it is usually possible to performthe necessary steps without access to the data from the experiment. The collaboration must then agree toaccept the result of the analysis after the data are input to the analysis (“unblinding”), without change toany step of the analysis, or the procedure is not fully blind.A drawback of the simple blinding procedure described above is that it precludes the use of data from theexperimental apparatus as a calibration source to help constrain the values of nuisance parameters and tohelp guide the data selection. This shortcoming is addressed by partial blinding. Data in control samples –events that fail one or more signal selection requirements, for example, are allowed to be input to the analysisprocedure before the selected “signal” sample is made available for analysis. Sometimes surprises are foundin the control samples – previously unappreciated sources of background events or miscalibrations can showup at this stage. The process of eventually revealing data passing signal selection requirements is often called“opening the box.” The process relies on the good faith of the collaboration members not to look at datathat have been blinded. In a large collaboration with many different analyses being developed in parallel,sometimes one analysis group’s control sample is another group’s selected signal sample. However, now thatsophisticated ML procedures are commonplace, one group’s histogram of a highly-specific ML discriminantvariable is unlikely to be meaningful to another group which may be seeking a different sort of signal entirely.A similar strategy for partial blinding which helps prevent big surprises from showing up when the signalbox is opened is to look at the data in the signal box first only for a subset of the running period over whichthe data were collected. The data from this running period may then be discarded from the final result if afully blind analysis is to be claimed.Yet another blinding procedure which applies to precision measurement of physical quantities is to encode n arbitrary, fixed offset in the final step of parameter inference in software, and to hide the value of this offsetfrom researchers performing the analysis work. This sort of offset can help combat the possibly unconsciousdesire to get the “right” answer (see Sect. 6.1). P Values
Recently, p values have been under attack ([McShane et al., 2019]), with some journals actually banningtheir use ([Woolston, 2015]). The reasons seem to be: • There are many results that claim to observe effects, based on having a p value less than 0.05, whichare subsequently not replicated. • People don’t understand p values, and confuse them with the probability of the null hypothesis beingtrue.The first point can be mitigated by having a lower cutoff on the p value criterion. The second argumentis similar to claiming that matrices should be banned, because many people don’t understand them; whatis required is simply better education. Physicists have been dismayed by some of the blunt tools proposedto solve the replicability crisis. A more careful examination of methodologies is much more valuable thanblaming the use of p values ([Leek and Peng, 2015]). p Values to Quantify Discovery Significance
Particle physicists make extensive use of p values in deciding whether to reject the null hypotheses and claima discovery. These p values are denoted as p (see Fig. 2), as they represent a probability under the curvepredicted by H . To claim a specific discovery, it is also necessary to check that the data are consistentwith the expectation from H . Often in searches for new phenomena, the data statistic used for calculating p is the likelihood ratio for the two hypotheses, H and H . This already takes note of the alternativehypothesis. The cut-off on p is conventionally taken as 3 × − , corresponding to a z value of 5.0.Some statisticians scoff at this criterion, saying that probability distributions are not so well known intheir extreme tails. The reasons in favor are: • Claiming a fundamentally new effect has widespread repercussions, and can have very high publicity.Withdrawing a claim of discovery can be embarrassing for a collaboration, and, more specifically, thediscovery proponents. Reputations and future credibility can be tarnished. The large author lists onparticle physics experiments may serve as one reason for the extreme conservativeness in EPP, at leastin recent decades. Many collaborators who worked hard on their experiment but not on a particularanalysis will be interested that their work does not contribute to false claims. • Past experience shows that effects with z values of 3 and 4 have often not been replicated when moredata are collected ([Franklin, 2013]). • Estimating systematic uncertainties is generally more problematic than determining statistical uncer-tainties. If a systematics-dominated experiment had underestimated the systematic uncertainty by afactor of two, an interesting reported z score of 5.0 should really have been a much more mundane z = 2 . • The Look Elsewhere Effect can effectively increase a local p value to a more relevant global p value(see Section 2.4). • An old but still relevant maxim is that “Extraordinary claims require extraordinary evidence.” Thusif we were looking for evidence of energy-nonconservation in events at the LHC, we should require a z -value of much more than 5.0 before rushing into print. From a Bayesian viewpoint, this corresponds toassigning a much smaller prior probability to a hypothesis involving a radically new idea, as comparedwith traditional well-established physics. hile p may be computed with Monte Carlo simulations of possible experimental outcomes, thesecalculations become very expensive with a threshold of 3 × − . Asymptotic formulas, such as those in[Cowan et al., 2011], provide for more rapid calculation. p Values to Reject Alternative Models
If we merely wish to exclude the alternative hypothesis, the convention is to use a p value for H , denoted p (see Fig. 2), of 0.05 or 0.10. This weaker requirement than that for rejecting H is because the embarrassmentof making a false exclusion is by no means as serious as that of an incorrect claim of some novel discovery.An exclusion of models H using the criterion p < .
05 will falsely exclude H at most 5% of the timeif H is true. Most hypotheses of new physics however are not true, and the risk of falsely excluding a truemodel is therefore rather low. Exclusions are usually expressed in terms of upper limits on signal strengths.In 5% of cases, assuming H is true, all values of the signal strength including zero are excluded using thistechnique. A plot of an upper limit on the signal strength as a function of a model parameter, such as themass of a hypothetical particle, will then exclude 5% of possible masses for all values of the signal strength,typically in disjoint subsets, assuming no new particle is truly present. Physicists do not wish to excludemodels that they did not test, even if their experiment’s outcome is in what is called by statisticians an“identifiable subset” ([Mandelkern, 2002]). Furthermore, a plot showing exclusions all the way down to zerosignal strength in 5% of its tested parameters is not expected to be replicable – the repeated experimentwould have to get lucky or unlucky in the same way.To combat the production of upper bounds on signal strengths being reported too small in 5% of cases,particle physicists do one of two things. They may use a modified p value, p / (1 − p ), which has beengiven the confusing name CL s ([Junk, 1999, Read, 2002]). If CL s < .
05 then H is ruled out. It hasthe property that CL s ≥ p , and so comparing it with 0.05 will exclude H no more often than the strictlyfrequentist test on p . It also has the beneficial property of approaching 1.0 as the signal strength approacheszero, preventing exclusion of signals with zero strength. The other common technique is to use a Bayesiancalculation of the posterior probability density as a function of the signal strength and exclude those valuessuch that the integral from the upper limit to infinity of the posterior density is 0.05. Very often our alternative hypothesis is composite. If we are looking for a signal that produces a peak abovea smooth background, the location of the peak may be arbitrary. When we are assessing the chance ofrandom fluctuations resulting in a peak as significant as the one we see in our actual data, the local p valueis this probability for the given location in our data. Often, the smallest local p value is the most exciting.But more realistic is the global p value, for having a significant fluctuation anywhere in the spectrum.This is similar to the statistical issue of multiple testing, except that that considers discrete tests, whilethe LEE in the particle physics context often involves a continuous variable (e.g. the location of the peak). P values at neighboring locations are often correlated due to detector resolution, so there aren’t infinitelymany independent tests even when the variable is continuous. Asymptotic formulae exist to compute thesmall p values needed for discovery while taking into account the LEE in cases of a continuous variable suchas the invariant mass of a new particle ([Gross and Vitells, 2010]).The LEE is more general than as described above. For example, the fluctuation could be in the distri-bution of the same physics variable, but produced by different selections; other possible distributions whichcould also be relevant; etc. These effects can be avoided by using a blind analysis, which cannot be tunedto produce a desired result.Another complication is that the definition of “elsewhere” depends on who you are. A graduate studentmight worry about possible fluctuations anywhere in his or her analysis, but the convenor of a physics groupdevoted to looking for evidence of the production of supersymmetric particles might well be worried about astatistical fluctuation in any of the many analyses searching for these particles. In view of these ambiguities, t is recommended that when global p values are being quoted, it is made clear what definition of “elsewhere”is being used.Benjamini has commented that in some cases non-replicability can be caused by the original analysisignoring the effects of multiple testing ([Benjamini, 2020]). In a similar vein, an unscrupulous member of thenews media or other interested reader may dredge the preprint servers for the most significant result of themonth and not report all of the others that were passed over in the search. Ignoring the LEE and reportingthe smallest local p value from a collection of them is a form of “ p -hacking” ([Ioannidis, 2005]).There is no LEE to take into account when computing model exclusions or upper bounds. Each modelparameter point is tested and excluded independently of others. If a researcher sifts through all of the modelexclusions looking for the most firmly excluded one and holds that up as an example, then an LEE may benecessary, but generally this is not of interest; the set of excluded models is the important result. In caseswhere multiple collaborations test the same model spaces and arrive at excluded regions of these spaces,then points in those spaces may have multiple opportunities to be falsely excluded. The warning here goesto presentations of results in which excluded regions are merely overlaid on one another and the union of allexcluded regions is inferred to be excluded. In fact, a rigorous combination of the results is needed in orderto make a single exclusion plot with proper coverage. Historically, a particle physicist would first look at the data from an experiment and use it to choose betweenreporting a discovery and reporting an upper limit. This “flip-flopping” has been shown to cause under-coverage. To solve this issue, Feldman and Cousins ([Feldman and Cousins, 1998]) propose using the Neymanconstruction to produce confidence intervals using the likelihood ratio ordering principle. In the case of abounded physical parameter, the method automatically selects between reporting a one-sided bound on theparameter and reporting a two-sided interval, while guaranteeing statistical coverage for all true values of theparameter. It readily generalizes to multiple parameters of interest but its computational expense increasesrapidly with the number of parameters.The method of Feldman and Cousins (FC) has the property of never producing empty confidence in-tervals, although the Neyman construction with other ordering rules may do so. With the FC method, itimpossible to exclude an entire parameter space with it. It is therefore an ideal method to use when it isknown that the parameter space contains a single point corresponding to the true value(s) in Nature. Thisis a very common situation. We know that electrons have a mass, and so the space of possible values ofthe electron mass contains the truth somewhere in it. On the other hand, supersymmetric electrons maynot exist at all, regardless of what mass one might imagine they have. One must be careful not to cometo unwarranted discovery claims based on model assumptions. If someone has lost their car keys and theylook everywhere except in a place that is difficult to search and no keys have been found, they may deducethat the missing keys are located in the place that has not yet been investigated. If it is not known thatthe keys exist in the first place, or it is not known that the set of all considered locations they could be inis exhaustive, such a deduction is unwarranted. One may address this issue by including the null hypothesisin the model space considered, though the model space may still be incomplete.
The definition of the word “reproduce” as used in this paper is the extraction of consistent results using thesame data, methods, software and model assumptions ([National Academies of Sciences and Medicine, 2019]).The only variables in this case are the human researchers, if there are any, and the separate runs on possiblydifferent computers. A failure to reproduce results could arise from improper packaging of digital artifacts,a lack of documentation, knowledge or even patience on the part of either the original researchers or thoseattempting the reproduction, the use of random numbers in the computational step, non-repeatability of alculations on computers either due to thread scheduling, radiological or cosmogenic interference with thecomputational equipment, or differences in the architectures of the computational equipment.In past decades, the internal representations of floating-point numbers varied from one hardware vendorto the next, and results generally were not exactly reproducible when software was ported. As late as the1990’s, physicists used mixtures of DEC VAX, IBM mainframes, Cray supercomputers and various RISCarchitectures, such as HP PA-RISC, IBM POWER, Sun Microsystems SPARC, DEC Alpha, and MIPS toname a few. Many of these architectures had idiosyncratic handling of floating-point arithmetic. Softwarehad to be specially designed so that data files created on one computer architecture but read on anotherproduced results as similar as possible.Large computing grids currently used in EPP contain mixtures of hardware from different vendors,though these are almost entirely composed of x86-64 processors manufactured by Intel or AMD. The relativelyuniform landscape today makes reproducing results much easier, but by no means can differences in computerarchitecture be ignored.Compiling programs with different optimization options can result in different results when run on thesame computer, due to intermediate floating-point registers carrying higher precision than representations inmemory. If a program has an IF -statement in it that tests whether or not a floating-point number is greaterthan or less than some threshold, a very small difference in a computed result can be the root cause of avery large difference in the rest of the output of a program. EPP has long been sensitive to these issues, dueto the complexity of the software in use and the large size of its data sets. While each triggered readout of adetector is processed independently of all others, the large number of triggered readouts to process virtuallyguarantees that rare cases of calculations that perform differently, or even fail, will occur during a large dataprocessing campaign.Particle physicists have long recognized the utility of compiling software using different compilers, withall of the warnings enabled, and even with warnings treated as errors so they cannot be ignored duringdevelopment. Different compilers produce warning and error messages for different classes of errors in thesource code, such as the use of uninitialized variables, some instances of which may go undetected by somecompilers but which may be flagged by others. Fixing software mistakes identified in this way helps guardagainst undefined behavior in programs.Running the programs on computers of different architectures and comparing the results has also longbeen a tradition in EPP. More recently, the process has been automated. On each release of the software, or,in some cases, as fine-grained as on each change in any software component committed to a central repository,continuous integration systems now compile and link the software stack and run basic tests, comparing theresults against previous results. These systems automatically warn the authors and the maintainers of thesoftware of variations in the program’s outputs. These safeguards are very important in large collaborationsbecause not every person developing software is an expert in every part of a large software stack, andunintended consequences of changes can go unnoticed or their root causes can be misassigned unless theyare identified quickly. Collaboration members who develop software sometimes leave for other jobs, placinghigh importance on good documentation and less reliance on individual memories of what the software doesand how it works. These continuous integration systems require human attention, as some changes to thesoftware produce intended improvements, and these must be separately identified from undesired outcomes.Reproduction of entire analyses is common in EPP when analysis tools are handed from one analyzer, orteam, to the next. For example, when a graduate student graduates, a new student is often given the taskof extending the previous student’s work. The first exercise for the new student however is to reproduce theearlier result performed by the previous student using the same data and software. It is usually referred to as“re-doing” an analysis or “checking” an analysis. The standards are very high for comparison – numbers mustmatch to much better than their quoted uncertainties, and exact matches are preferable. Any discrepancyin this step points to a relatively simple flaw that can be remedied.Reproducibility is a necessary but insufficient condition for reliability. Reproducibility only tests theintegrity of the computational steps, not the correctness of the assumptions entering the analysis, or even he quality of the input data. A flawed result, when reproduced, contains the same flaws. A more interesting process is to perform a similar analysis, either with the same data and a different analysistechnique, or with different data and either the same or a different analysis technique. Not only can the resultsbe compared, but to the extent possible, intermediate quantities such as selected event counts or even listsof which triggered readouts of the detector were selected, can be compared to check for consistency. The word“replicate” is defined to mean these kinds of independent tests ([National Academies of Sciences and Medicine, 2019]).In EPP, however, the word “replicate” is rarely used in this sense due to the use of the word “replica” tomean an identical copy, as in data sets distributed to distant computer centers or in geometry descriptions ofrepeated, identical detector components. Instead, “independent confirmation” is a more conventional phraseused in the case of successful replication, and “ruling out,” “exclusion,” and “refutation” are words thatare used when replication attempts fail to confirm the previous result. If the data sets used in a replicatedanalysis overlap with those of the original analysis, the word “independent” is not used. Replicated analysesoften share sources of systematic error and thus also may fail to be independent even when the data sets,the experimental apparatus, and the collaborations are independent.In order to tell if a second experimental result successfully replicates the first, shared and independentsources of error must be carefully taken into account. These are usually obtained from the quoted uncertain-ties on the measured values, but in the case of overlapping data sets, a more involved analysis is warranted.The end result of a comparison of a replicated measurement is often a p value expressing the probability thatthe two results would differ by as much as they were observed to or more, maximized over model parameters.The sample space in which the p value is computed consists of imaginary repetitions of the two experiments,assuming their outcomes are predicted by the same model. A result with a very large systematic uncertaintyis consistent with more true values of the parameter(s) of interest and thus passes replication tests moreeasily than one with a smaller systematic uncertainty. Such a result is also less interesting because of itslack of constraint on the parameters of interest. A result can only be “wrong” if it has underestimateduncertainties. Even a non-reproducible result may not be a wrong result if it is accompanied by a systematicuncertainty that covers the amount of non-reproducibility. The conditions under which particle detectors are operated are known to affect their performance andthus may bias the results obtained from them. Particle physics experiments run for years at a time, andoperating conditions are variable, making the data sets heterogeneous. Variations in accelerator parameters,such as the beam energy, the energy spread, the intensity and stray particles accompanying the beam(“halo”) are constantly monitored and automatically recorded in databases for future retrieval during dataanalysis. Environmental variables such as ambient temperature, pressure and humidity are also includedin these records. The concentrations of electronegative impurities in drift media are constantly measuredand recorded. The status of high-voltage settings, electronics noise, and which detector components arefunctioning or broken are also recorded. Non-functioning detector components are often repaired duringscheduled accelerator downtime, and some detector components may recover functionality when computerprocesses are restarted. If a particular physics analysis requires that the detector is fully functional, thenonly data that were taken while the detector satisfies the relevant requirements can be included in thatanalysis.While an experiment is collecting data, physicists take shifts operating the detector and monitoring thedata that come out of it. While most detector parameters that affect analyses can be identified in advance nd monitored automatically, some surprises can and do occur. It is up to the shift crew to identify thoseconditions, notify experts who may be able to repair the errant condition, and mark the data appropriately soit does not bias physics results. The shift crew is aided by automated processes that analyze basic quantitiesof the data in near real time, providing input to their decisions. Because the data processing and analyses in EPP require the use of a large amount of software that is underconstant development by a large number of people, a well thought-out version control system is required.Not only must the source code be under strict version control, but so too must the installed environments,which include auxiliary files and databases. Naive systems in which collaborators share computers on whichthe software is constantly updated to the latest version will find their analysis work difficult in a way thatscales with the size and activity of the software development effort. Results obtained by a physicist runningthe same programs on the same data may differ from day to day, or programs that ran previously mayfail to run at all. A new release of a software component may be objectively better than the older ones –bugs may have been fixed or the performance of the algorithms may have been improved. “Performance”here refers not to the speed with which the program runs on a computer, but rather in its ability to do itsintended job. The probability for a track-finding algorithm to find a track may go up from one version to thenext, for example. But if an analyzer has measured the performance of the algorithms using experimentaldata if possible and Monte Carlo simulations otherwise, then those algorithms must be held constant or thecalibration constants become invalid. One may “freeze” the software releases, but then some collaboratorswill require newer releases than other collaborators.The solution chosen in EPP is to freeze and distribute pre-compiled binaries and associated data andconfiguration files for each release used by any collaborator. No software version is set up by default when auser logs in – a specific version must be specified. The version for a top-level software component determinesthe versions needed for all dependent components, which are automatically selected. Inconsistent versionrequests are treated as errors. New users are surprised at the need for this complexity, but they appreciateit later when they are finishing up their analysis work and are trying to keep every piece of their workflowstable.
A common and necessary practice in EPP is to compute the expected sensitivity of an analysis before thedata are collected, or at least before they are analyzed. Since particle physics experiments are so expensiveand take so long to design, construct, operate, and perform data analysis, funding agencies require that acollaboration must demonstrate that their proposed experiment is capable of testing the desired hypothesesbefore approving the funding. The expected sensitivity usually takes the form of the expected length of theconfidence interval on one or more measured parameters, or the median expected upper limit on the rate ofa process assuming it truly is not present in Nature, or the median p value, assuming the new process istruly present. Typically, distributions of possible outcomes of the experiment are pre-computed before theanalysis process is finished. The observed result can then be compared with the expected results once allthe data are collected and the analysis is finished. Sometimes a spurious outcome is easily identifiable asbeing consistent with none of the considered hypotheses. This separation of the results into sensitivity andsignificance furthermore helps combat the “file-drawer” effect. Even if a result fails to be significant, if thesensitivity of the test is high, then the result is worth publishing. One way in which physics analyses can benefit from replication without waiting for another group on thesame or a different collaboration to work on a similar analysis is to use a calibration source, or a “standard andle.” Signals that have been long established ought to be visible in analyses that seek similar but not-yet-established signals. The analysis therefore replicates earlier work, and in so doing, not only validatesthe earlier work, but increases the confidence in the present work. This step is particularly important inanalyses that do not observe a new signal. One might think that the detector or the analysis method issimply not sensitive to the new signal and may have missed it. To show that a known signal is found in thesame analysis with the expected strength and properties gives confidence that the whole chain is working asdesired.A related technique is the use of “control samples,” in which the desired signal is known not to exist, orif it does, contributes a much smaller fraction of events than in the selected signal sample. Data that do notpass selection requirements, or which were collected in different accelerator conditions (off-resonance runningis an example of a way to collect background-only data) can be used to estimate the rates and properties ofprocesses that are not the intended signal but which can be confounded with it if not carefully controlled.Often multiple control samples are used, each one targeting a specific background process, or which mayover-constrain the rates or properties of them. Disagreements in the predictions of background rates andproperties from different control samples are often contributions to the systematic uncertainty estimationsused in the signal sample. Large EPP collaborations have a difficult, complex task to perform to produce any individual result. Thedetectors have millions of active elements, and the conditions are variable. Often a physics analysis requiresevents to be selected with a specific particle content – say a lepton, a number of jets, and missing energy.Not every lepton is identified correctly, however, and not every jet’s energy is measured well. Instead ofrequiring every team that wants to analyze data to perform the work to calibrate all of the things that needcalibrating, working groups are set up to perform these tasks. A group may be devoted just to b -jet taggingwhile another will calibrate the electron identification and energy scale. Other groups will form aroundeach necessary task. Their results, along with systematic uncertainty estimates, are reviewed in much thesame way as physics results are reviewed before being approved for use by the collaboration. In this way,consistent calibrations are available for all physics analyses performed by the collaboration, and mistakesare minimized. One avenue that subconscious bias can affect an analysis is in the calibration stage. Theseparation of the calibration efforts into dedicated groups reduces the possibility that collaborators wishingspecific results for their analyses can do so via (subconsciously) manipulating calibrations, as each calibrationgroup must provide results for everyone in the collaboration, not just one set of interested parties. Circular colliders typically have multiple detectors located at discrete interaction regions that produce iden-tical physics processes because the beams are the same. The PEP ring at SLAC had four detectors: HRS,TPC-2 γ , Mark-II and MAC. The LEP collider had the ALEPH, DELPHI, L3 and OPAL detectors. TheLHC has ATLAS and CMS, and also special-purpose detectors ALICE and LHCb. These detectors are onlypartially redundant; they are not exact copies of one another. Part of the purpose is to provide for replicationof results, but it is also important to diversify the technology used in the experimental apparatus. Whiledetector research and development is also a mature field and technologies are deployed in large experimentsonly if they have been shown to work in prototypes, risks still exist. A particular technology may be moreideal for a specific physics analysis than another, but it may be weaker in another analysis. Given thehigh costs of these detectors, the additional value accrued by exposing different technologies to the samephysics is seen to be a better investment than exact duplication. Competition between collaborations alsoencourages scientists to optimize their analyses for sensitivity (and not significance), and to produce resultsquickly so as not to lose the race with competitors. High-profile discoveries, such as those of the top quarkand the Higgs boson, are typically simultaneously announced by rival collaborations. This is not surprising t circular colliders because each detector is delivered the same amount of collision data at any given timeas each other detector on the collider.Results are frequently interpreted in the context of other results obtained in similar but not identical pro-cesses, but for which the model explanation for one experiment’s result must have consequences for anotherexperiment’s result. An example of this is the search for a light, sterile neutrino. The LSND and MiniBooNEcollaborations observed excesses of ν e events in beams dominantly composed of ν µ , when compared to whatwas expected given what is known from three-flavor neutrino oscillation rates ([Aguilar-Arevalo et al., 2001,Aguilar-Arevalo et al., 2013, Aguilar-Arevalo et al., 2018]). A hypothesis to explain these data is that afourth neutrino may exist which provides an oscillation path from ν µ to ν e . We know from the LEP exper-iments’ e + e − → Z lineshape measurements ([Schael et al., 2006]) that there are only three light neutrinospecies that interact with the Z boson. A fourth light neutrino must therefore be “sterile.” Nonetheless, inorder to explain the LSND excess in this way, some ν µ s must disappear as they oscillate into sterile neutrinos,while a few of these sterile neutrinos may oscillate back into ν e . One can test this hypothesis by lookingfor ν µ interactions in a ν µ beam (any ν µ beam, not necessarily LSND’s), and see if enough disappear as afunction of the distance to the neutrino source divided by the neutrino energy. A recent combination of datafrom the MINOS, MINOS+, Daya Bay and Bugey experiments which have measured this disappearance rate([Adamson et al., 2020]) exclude parameter values consistent with the LSND result. While the newer exper-iments did not attempt to directly replicate the LSND experiment, they do provide interesting information.It remains to be seen whether the tensions in this field come from inadequate understanding of experimentaleffects, or from a more fundamental physical process, even if it is not a light, sterile neutrino. Large collaborations benefit from the availability of scientists with diverse experiences and points of view. Allcollaborators on the author list are given the opportunity to review each result that is published. While thenumber of papers published by each of the LHC collaborations is such that not every collaborator reads everypaper, each collaborating institution is required to meet a quota of papers that are read and commented onby its members.
Results in preparation must pass through a lengthy, formal approval process before they can be presentedoutside of the collaboration. Working groups led by experienced physicists review each analysis by themembers of the group and frequently point out flaws in logic, data handling, analysis and presentation.Before approval, a result must be fully presented to a working group, and a public note must be written.At this stage, questions are asked of the proponents, and some of these questions may require significantadditional study in order to address. At a later date, the analysis must be presented again, and all questionsand requests must be answered to the satisfaction of the group members. Only at this phase can a result beapproved, though all figures and numbers must be labeled “Preliminary.”Preliminary results sometimes do not have the final estimates of systematic uncertainties associated withthem. Estimating systematic uncertainties is usually the most time-consuming aspect of analysis work andit usually involves significant re-analysis or generation of additional Monte Carlo samples to make modelpredictions corresponding to variations of each nuisance parameter. Even the list of all potential sourcesof systematic error may not be fully understood at the time a preliminary result is reviewed. If a resultneeds to be produced on a short timescale under these circumstances, systematic uncertainties are estimatedconservatively, with the intention that further work will reduce their magnitude ([Barlow, 2002]).There are negative consequences to overestimating systematic uncertainties. A set of measurements ofthe same physical quantity by several collaborations, all of which overestimate their uncertainties, will havea χ that is smaller than expected, even if the sources of systematic error are different. More worrisome,however, is the possibility that a combination of results may assume a measurement is more sensitive to a articular nuisance parameter than it really is, due to an inflated or misclassified source of uncertainty. Themeasurement thus serves to constrain the nuisance parameter too strongly in the joint result, producing afinal combined result with an underestimated uncertainty. After a a preliminary result is released, the physicists who performed the analysis prepare a manuscript forpublication. At this stage, and sometimes even during the preliminary result preparation stage, a committeeof collaborators who have worked on similar topics but who are not directly involved in the particular resultis set up to review the paper draft. Often the committee is involved at an early stage of writing the draftand they meet regularly with the authors to improve the analysis and the presentation. All changes to theanalysis must be approved by the working group specializing in the topic. Usually the review committeesubmits the manuscript for collaboration review, rather than the analysis proponents. Once a manuscript isreleased to the collaboration, the result has already been reviewed multiple times. The additional scrutinyfrom the large collaboration may uncover additional flaws, and the manuscript is re-released for a secondcollaboration review after addressing the concerns raised in the first review. The process is iterated untilconsensus is reached that the paper can be submitted for publication. This process often uncovers eventiny flaws and it can take several months to years to complete. High-profile results can be pushed throughon accelerated timescales without compromising the integrity of the review, provided that the necessaryeffort can be directed towards them. Only in very rare instances is consensus not reached. In these cases,dissenting collaborators can request that their names be removed from the author list of that paper. Asignificant fraction of the collaboration refusing to sign a paper provides a strong signal to the analysisproponents and the readers of the article about the perceived validity of the results.After a manuscript has been agreed upon and submitted to a journal, the editors use traditional blindpeer review before publication. Referees who are known to be experts in the field, frequently who are alsomembers of rival collaborations, weigh in on the publication. Referees are not superhuman, though they dosometimes find issues with papers that thousands of authors may have missed. Sometimes a collaborationmay fall into the trap of “group-think,” having repeated the same arguments to itself over and over again,so an independent check has as much value in EPP as in other fields. An independent review also can helpimprove the presentation of work that may not be clear to an outsider. Sometimes the root cause of thenon-replicability of a result is merely inadequate or unclear documentation.
Large collaborations sometimes have a “statistics committee,” which is made up of collaborators who areexperts on data analysis, inference, and presentation of results. A good-sized committee has at least sixmembers, and more are desirable. Statistical issues in analyses can be intricate and they take some timeto understand. Members of the committee sometimes disagree about issues with specific analyses. It isimportant for physicists who are embarking on a new analysis to consult with the collaboration’s statisticscommittee, so that work is not steered in a direction that is only later found to be flawed under collaborationreview. Frequently, the most challenging issues with an analysis relate to the treatment of systematicuncertainties. The enumeration of sources of uncertainty, their prior constraints, how to constrain them in situ with the data, and how to include their effects in the final results, are common subjects that thestatistics committee must address. Experimental collaborations typically must each have their own statisticscommittee, as results in preparation are usually confidential until a preliminary result is released, and reviewby members of other collaborations would spoil this confidentiality. Peer review generally does not spoilconfidentiality because all or nearly all results are submitted as preliminary results first, and preprints areavailable for submitted manuscripts, thus establishing priority. Members of a collaboration may be wary ofadvice from members from competing collaborations, even if it is general statistical methods advice. Membersof one collaboration may prefer that their competitors treat their uncertainties more conservatively and thus ppear to have a less reliable result. Even the fear of such bias in advice is enough to prevent the formationof joint statistics committees across collaborations.Munaf`o and collaborators point out that independent methodological support committees have beenvery useful in clinical trials ([Munaf`o et al., 2017]). Particle physicists routinely reach out to statisticians,holding workshops titled PhyStat every couple of years ([Behnke et al., 2020]). Often it is useful to combine results produced by competing collaborations, sometimes alongside the an-nouncement of the separate results. In this case, the collaborations must agree on the exchange of data andappropriate methods of inferring results. The methods used typically are extensions of what the collabo-rations use to prepare their own results, and usually in a combination effort, members of each experimentperform the combination with their own methods and the results are compared for consistency. At this stage,mistakes in the creation or the exchange of digital artifacts may be exposed and they must be addressedbefore the final results can be approved by all collaborations contributing to the combined results.
Nearly all results in EPP are derived from counts of interactions in particle detectors. Each interactionhas measurable properties and may differ in many ways from other interactions. These counts are typicallybinned in histograms, where each bin collects events with similar values of some observable quantity, like areconstructed invariant mass. An example of such a histogram, showing the distribution of the reconstructedmass m (cid:96) in H → ZZ → (cid:96) decays, selected by the ATLAS collaboration ([Aad et al., 2020]), is given inFig. 3. Event counts often are simply reported by themselves. Under imagined identical repetitions of theexperiment, the event counts in each bin of each histogram are expected to be Poisson distributed, althoughthe means are usually unknown or not perfectly known. The data provide an estimate of the Poisson mean,which is often directly related to a parameter of interest, such as an interaction probability. Data in EPPhave been referred to as “marked Poisson” data, where the marks are the quantities measured by a particledetector, such as the energies, momenta, positions and angles of particles produced in collisions. The fact thatall practitioners use the same underlying Poisson model for the data helps reproducibility and replication.An advantage of particle physics analyses as compared with those in, for example, sociology is thatelementary particles are more reliable than people. In selecting a sample of muons, we do not have to worryabout what their distribution is in age, income, job satisfaction and ability at statistics. “Double-blind”analyses do not exist in particle physics. Furthermore, unlike humans, particles are relatively unaffected byenvironmental factors, and so these do not have to be taken into account in analyses. However, this is lesstrue of the detectors. Their response to elementary particles and background noise depends on the ambienttemperature and pressure, electrical noise, and to their exposure to high doses of radiation from the beams inthe accelerator. The raw data needs to be corrected to take such variations into account. Thus calibrationsof the detector responses have to be carried out at frequent intervals.Another difference between particle physics and other areas of research is the way models are regarded.The basic model used for particle physics is the standard model, which contains the particles listed in Table 1,and the forces in Table 2 (apart from gravitation, which for most purposes can be neglected on a particlescale). This provides an excellent description of a large number of experimental distributions, but despitethis it is not believed to be the ultimate description of Nature; for example, it does not explain dark matter ordark energy, and it has about 20 arbitrary parameters (such as particle masses) that have to be determinedfrom experiment rather than being predicted by the theory. Thus we have a model that works very well,that despite our prejudices may even be the ultimate truth, but which we are constantly hoping to disprove.Most of our “search” experiments are seeking not merely to produce another verification of the SM, but arehoping to discover evidence for physics beyond the standard model (BSM). A convincing rejection of the SMwould be a major discovery. m020406080100120140160180 E v en t s / . G e V DataHiggs (125 GeV)Z(Z*)tXX, VVV tZ+jets, tUncertainty
ATLAS → ZZ* → H = 13 TeV, 139 fbs Figure 3: A histogram showing event counts in bins of the reconstructed mass m (cid:96) for interactionsselected in the H → ZZ → (cid:96) decay mode in the ATLAS detector ([Aad et al., 2020]). The pointswith error bars show the numbers of observed interactions in each bin, and the shaded, stackedhistograms show the model predictions. The red peak on the left corresponds to the well-known Z boson (an example of a “standard candle”), while the blue peak in the middle shows the predictionfor the Higgs boson. In contrast, in many other fields, the models are more ad hoc, in general do not provide very good detaileddescriptions of the data, and almost no one believes in their ultimate truth. Approximate agreement betweenthe data and the model is regarded as a successful outcome. On the other hand, a rejection of the modelbeing used would merely result in it being replaced by a different ad hoc model.Another peculiarity of EPP is the specialization of practitioners into theoretical and experimental cat-egories. An important benefit of this division is that experimentalists almost never test theories that theythemselves invented. Experimentalists are therefore usually not personally invested in the success or thefailure of the models they are testing.Furthermore, while there is only one true set (however incompletely known) of physical laws governingNature, the set of speculative possibilities is limited only by physicists’ imaginations. Theory and phe-nomenology preprints and publications abound in great numbers. Most theories are in fact not true, butgenerally in order to be published, they must be consistent with existing data. Experimenters are well awarethat most searches for new particles or interactions will come up empty-handed, even though the hope isthat if one of them makes a discovery, then our understanding of fundamental physics will make a greatstride forwards. This possibility makes all the null results worth it.The LHC has been referred to as a “theory assassin,” owing to all of the null results excluding manyspeculative ideas. This process largely mitigates the “file-drawer” effect, as a null result excluding a publishedtheoretical model is likely to be publishable and not ignored. The experimental tests must be accompaniedwith proof that they are sensitive to the predictions of the theories in question, however, before they are takenseriously for publication. Null results are therefore typically presented as upper limits on signal strengths.Theorists may counter a null result by predicting smaller signal strengths. Frequently, an iterative process isundertaken that progressively tightens the constraints on a model of new physics as more data are collectedand analysis techniques are improved. Some models can never be fully ruled out because signal strengthscould always be smaller. These, if interesting enough, are left as challenges to the next generation ofexperiments. Examples of Non-Replicated Results in Particle Physics
The success of the quark model in explaining the spectroscopy of the many existing hadrons (and theabsence of those that were forbidden by the model), as well as many features of the production processesfor interactions, resulted in many searches for evidence of the existence of free quarks. Most of these usedthe fact that quarks have an electric charge of ± e/ ± e/
3. Experiments looked for quarks in cosmicrays, reactions at accelerators, the Sun, moon dust, meteorites, ocean sludge, mountain lava, oyster shells,etc., but almost all experiments yielded null results. Theorists accepted this as being due to the concept of“confinement”: quarks can exist only inside hadrons, but not as free particles.However, in 1981, an experiment at Stanford reported a positive result ([Larue et al., 1981]). It involvedlevitating small spherical niobium spheres, and measuring their oscillations in an oscillating electric field;their amplitude is proportional to the charge on the ball. Of 39 measurements reported, 14 correspondedto the fractional charge of quarks. Although the word “quark” does not appear in the Stanford publication,this was possible evidence for their existence as free particles.However, many analysis decisions had to be made in order to extract the charge on a ball from the rawmeasurements e.g. whether or not to accept an experimental run, which corrections to apply for experimentalfeatures, etc. These decisions were made while looking at the possible result of the charge measurement.Luis Alvarez suggested that a form of blind analysis should be used (see Section 2.2). This involved thecomputer analyzing the data adding a random number onto the extracted charge visible to the physicists;this was subtracted from the result only after they had made all necessary decisions about run acceptanceand corrections. The net result was that this experiment published no further results.In his review, “Pathological Science,” Stone collects several stories of experimental results that havelater been found to be wrong ([Stone, 2000]). In two of the cases, the Davis-Barnes Effect and N-Rays,clear defects in the experimental technique were uncovered during visits to the laboratories by independentexperts. In the case of the split a resonance, several groups confirmed the false result but others did not,and at least one group that did see the splitting of the a reported adjusting the apparatus when the effectwas not seen but not when it was seen. Franklin provides excellent commentary on some results that havebeen successfully replicated and some that have not ([Franklin, 2018]). Bailey has produced histograms ofchanges in measured values of particle properties expressed in terms of the reported uncertainties. Whilemany repeated measurements of the same quantities are consistent, there is a long tail of highly discrepantresults ([Bailey, 2017]). More concerning than false results that are not replicated are false results that are replicated. Of course,these results can only be ascertained as false by further attempts at replication and/or the discovery of errorsin the original results.
The search for a pentaquark is an interesting example of replication. Particles known as hadrons are dividedinto baryons and mesons. In the original quark model, baryons are composed of three quarks (and mesonsof a quark and an antiquark)- see Section 1.1. There was, however, no obvious reason why baryons couldnot be made of four quarks and an anti-quark; these baryons would be pentaquark states. This wouldmake available new types of baryons which could not be made of the simple and more restrictive three-quark structure, and which could be identified by their decays modes involving unconventional groupings ofparticles not accessible to three-quark baryons. Thus searches were made for these new possible particles.In 2003, four experiments provided evidence suggesting the existence of one of these possibilities, known s the Θ + , with a mass around 1.54 GeV. The quoted significances were 4 to 5 σ . Indeed, national prizeswere awarded to physicists involved in these experiments. In the next couple of years there were six moreexperiments quoting evidence in favor of its existence.However, other studies, many with much higher event numbers than those with positive results, saw noevidence for the particle. Although most of these were not exact replications of the original positive ones, atleast one was a continuation with much higher event numbers than the original study, and involved exactlythe same reaction and the same beam energy; it failed to confirm its original result.The net conclusion was that the Θ + does not exist. Possible reasons for the apparently spurious early re-sults include poor estimates of background; non-optimal methods of assessing significance; the effect of usingnon-blind methods for selecting the event sample and for the mass location of the Θ + ; and unlucky statisticalfluctuations. Hicks provides a detailed review of pentaquark search experiments and their methodologies([Hicks, 2012]).This topic is probably the one in which there were the most positive replications of an incorrect result.It demonstrates the care needed when taking a confirmatory replication as evidence that the analyses arecorrect, especially when the experiments involve smallish numbers of events.The twist in the tale of this topic is that more recently, pentaquark states have been observed by theLHCb experiment ([Aaij et al., 2015]). They are, however, much higher in mass than the Θ + , and have adifferent quark composition, so are certainly not the same particle. More details of the interesting historyof the search for pentaquarks and their eventual discovery and measurement can be found in the review onthe subject by M. Karliner and T. Skwarnicki in the 2020 Review of Particle Physics ([Zyla et al., 2020]). It is not only searches for new particles that can suffer from spurious replication, but measured values ofwell-established particles and processes can also be affected. The Particle Data Group collects measurementsof particle properties, averages them in cases of multiple measurements of the same quantity, and publishesthese every two years ([Zyla et al., 2020]). One can see in the historical evolution of the averages thatthe error bars generally decrease over time and the differences between the measured values also decreaseover time. There is considerable correlation from one average to the next, which is largely due to thesame measurements contributing to multiple years’ averages. In order to see if there is an effect in whichexperimenters seek out, consciously or not, to replicate earlier numbers without contradicting them, a meta-analysis was performed ([Klein and Roodman, 2005]) in which individual measurements of selected quantitieswere plotted as functions of time. Correlations are indeed visible even in these historical plots. Not all ofthe effects may be due to over-eagerness to replicate earlier work, because often shared sources of systematicerror afflict multiple measurements.
In the last decade, a number of meta-analyses of published results in several scientific fields have uncovered alarge fraction of results that were not replicated when tested ([National Academies of Sciences and Medicine, 2019]).Experimental particle physicists, when they heard the news of the “replication crisis” in other fields, felt thetemptation to gloat a little, mainly because of the nature of their enterprise, the high standards applied totheir results, and the tradition of publishing all results that have sensitivity to the effect under test withoutrelying on the observed significance to determine whether or not to submit a manuscript. Indeed, many ofthe proposed solutions for the replication crisis have been, in some way or another, part of the culture ofexperimental particle physics for decades.Particle physicists have long been cautioned about historical failures of even the most stringent checksand balances, and every new student is given examples of how well-meaning researchers can come to wrongconclusions because they misled themselves and therefore others. It is concern over repeating the mistakes f the past that justifies the rigor. That, and the fact that enormous amounts of time, money and effortgo into particle physics experiments makes practitioners especially wary of producing wrong results due torelatively minor mistakes. While by no means do all results in experimental particle physics meet the mostrigorous standards, the techniques used to make the majority of them the best that can be produced areheld as examples of good practices in science. Disclosure Statement
The authors have no conflicts of interest to declare.
Acknowledgments
Work supported by the Fermi National Accelerator Laboratory, managed and operated by Fermi ResearchAlliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy. The U.S.Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S.Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce thepublished form of this manuscript, or allow others to do so, for U.S. Government purposes.
References [Aaboud et al., 2018] Aaboud, M. et al. (2018). Search for electroweak production of supersymmetric par-ticles in final states with two or three leptons at √ s = 13 TeV with the ATLAS detector. Eur. Phys. J.C , 78(12):995.[Aad et al., 2008] Aad, G. et al. (2008). The ATLAS Experiment at the CERN Large Hadron Collider.
JINST , 3:S08003.[Aad et al., 2020] Aad, G. et al. (2020). Measurements of the Higgs boson inclusive and differential fiducialcross sections in the 4 (cid:96) decay channel at √ s = 13 TeV.[Aaij et al., 2015] Aaij, R. et al. (2015). Observation of J/ψp
Resonances Consistent with Pentaquark Statesin Λ b → J/ψK − p Decays.
Phys. Rev. Lett. , 115:072001.[Adamson et al., 2020] Adamson, P. et al. (2020). Improved Constraints on Sterile Neutrino Mixing fromDisappearance Searches in the MINOS, MINOS+, Daya Bay, and Bugey-3 Experiments.
Phys. Rev. Lett. ,125(7):071801.[Aguilar-Arevalo et al., 2001] Aguilar-Arevalo, A. et al. (2001). Evidence for neutrino oscillations from theobservation of ¯ ν e appearance in a ¯ ν µ beam. Phys. Rev. D , 64:112007.[Aguilar-Arevalo et al., 2013] Aguilar-Arevalo, A. et al. (2013). Improved Search for ¯ ν µ → ¯ ν e Oscillationsin the MiniBooNE Experiment.
Phys. Rev. Lett. , 110:161801.[Aguilar-Arevalo et al., 2018] Aguilar-Arevalo, A. et al. (2018). Significant Excess of ElectronLike Events inthe MiniBooNE Short-Baseline Neutrino Experiment.
Phys. Rev. Lett. , 121(22):221801.[Aker et al., 2019] Aker, M. et al. (2019). Improved Upper Limit on the Neutrino Mass from a DirectKinematic Method by KATRIN.
Phys. Rev. Lett. , 123(22):221802.[Bailey, 2017] Bailey, D. (2017). Not Normal: the uncertainties of scientific measurements.
Royal SocientyOpen Science , page 4160600.[Barlow, 2002] Barlow, R. (2002). Systematic errors: Facts and fictions. In
Conference on Advanced Statis-tical Techniques in Particle Physics , pages 134–144. Behnke et al., 2020] Behnke, O., Cousins, R., Cowan, G., Cranmer, K., Junk, T., Kuusela, M., Lyons, L.,and Wardle, N. (2020). PhyStat Workshop Series. https://espace.cern.ch/phystat/ . Online; accessed9 Sep 2020.[Benjamini, 2020] Benjamini, Y. (2020). The replicability problems in science: its not the p < .
05 fault.Private communication.[Chatrchyan et al., 2008] Chatrchyan, S. et al. (2008). The CMS Experiment at the CERN LHC.
JINST ,3:S08004.[Chen et al., 2019] Chen, X., Dallmeier-Tiessen, S., Dasler, R., Feger, S., Fokianos, P., Gonzalez, J. B.,Hirvonsalo, H., Kousidis, D., Lavasa, A., Mele, S., Rodriguez, D. R., ˇSimko, T., Smith, T., Trisovic,A., Trzcinska, A., Tsanaktsidis, I., Zimmermann, M., Cranmer, K., Heinrich, L., Watts, G., Hildreth,M., Lloret Iglesias, L., Lassila-Perini, K., and Neubert, S. (2019). Open is not enough.
Nature Physics ,15(2):113–119.[Cowan et al., 2011] Cowan, G., Cranmer, K., Gross, E., and Vitells, O. (2011). Asymptotic formulae forlikelihood-based tests of new physics.
Eur. Phys. J. C , 71:1554. [Erratum: Eur.Phys.J.C 73, 2501 (2013)].[De Salas et al., 2018] De Salas, P., Gariazzo, S., Mena, O., Ternes, C., and Trtola, M. (2018). NeutrinoMass Ordering from Oscillations and Beyond: 2018 Status and Future Prospects.
Front. Astron. SpaceSci. , 5:36.[Esteban et al., 2019] Esteban, I., Gonzalez-Garcia, M., Hernandez-Cabezudo, A., Maltoni, M., andSchwetz, T. (2019). Global analysis of three-flavour neutrino oscillations: synergies and tensions in thedetermination of θ , δ CP , and the mass ordering. JHEP , 01:106.[Evans and Bryant, 2008] Evans, L. and Bryant, P. (2008). LHC Machine.
JINST , 3:S08001.[Feldman and Cousins, 1998] Feldman, G. J. and Cousins, R. D. (1998). A Unified approach to the classicalstatistical analysis of small signals.
Phys. Rev. D , 57:3873–3889.[Franklin, 2013] Franklin, A. (2013).
Shifting Standards: Experiments in Particle Physics in the TwentiethCentury . JSTOR EBA. University of Pittsburgh Press.[Franklin, 2018] Franklin, A. (2018).
Is It the Same Result: Replication in Physics . 2053-2571. Morgan &Claypool Publishers.[Gross and Vitells, 2010] Gross, E. and Vitells, O. (2010). Trial factors for the look elsewhere effect in highenergy physics.
Eur. Phys. J. C , 70:525–530.[Heinrich and Lyons, 2007] Heinrich, J. and Lyons, L. (2007). Systematic errors.
Ann. Rev. Nucl. Part. Sci. ,57:145–169.[Hicks, 2012] Hicks, K. H. (2012). On the conundrum of the pentaquark.
Eur. Phys. J. H , 37:1–31.[Ioannidis, 2005] Ioannidis, J. P. A. (2005). Why most published research findings are false.
PLoS Medicine ,2(8):e124–e124.[Junk, 1999] Junk, T. (1999). Confidence level computation for combining searches with small statistics.
Nucl. Instrum. Meth. A , 434:435–443.[Klein and Roodman, 2005] Klein, J. and Roodman, A. (2005). Blind analysis in nuclear and particlephysics.
Ann. Rev. Nucl. Part. Sci. , 55:141–163.[Larue et al., 1981] Larue, G., Phillips, J., and Fairbank, W. (1981). Observation of Fractional Charge of(1 / e on Matter. Phys. Rev. Lett. , 46:967–970.[Leek and Peng, 2015] Leek, J. and Peng, R. (2015). Statistics: P values are just the tip of the iceberg.
Nature , 520:612. Mandelkern, 2002] Mandelkern, M. (2002). Setting confidence intervals for bounded parameters.
Statist.Sci. , 17(2):149–172.[McShane et al., 2019] McShane, B. B., Gal, D., Gelman, A., Robert, C., and Tackett, J. L. (2019). Abandonstatistical significance.
The American Statistician , 73(sup1):235–245.[Munaf`o et al., 2017] Munaf`o, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D.,Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., and Ioannidis, J. P. A. (2017). Amanifesto for reproducible science.
Nature Human Behaviour , 1(1):0021.[National Academies of Sciences and Medicine, 2019] National Academies of Sciences, E. and Medicine(2019).
Reproducibility and Replicability in Science . The National Academies Press, Washington, DC.[Read, 2002] Read, A. L. (2002). Presentation of search results: The CL(s) technique.
J. Phys. G , 28:2693–2704.[Schael et al., 2006] Schael, S. et al. (2006). Precision electroweak measurements on the Z resonance. Phys.Rept. , 427:257–454.[Sirunyan et al., 2020] Sirunyan, A. et al. (2020). Constraints on anomalous Higgs boson couplings to vectorbosons and fermions in production and decay in the H → (cid:96) channel.[Stone, 2000] Stone, S. (2000). Pathological Science. In Theoretical Advanced Study Institute in ElementaryParticle Physics (TASI 2000): Flavor Physics for the Millennium , pages 557–575.[Woolston, 2015] Woolston, C. (2015). Psychology journal bans p values.
Nature , 519:9–9.[Zyla et al., 2020] Zyla, P. et al. (2020). Review of Particle Physics.
PTEP , 2020(8):083C01., 2020(8):083C01.