aa r X i v : . [ s t a t . O T ] J un Frequentist Inference without RepeatedSampling
Paul Vos ∗ & Don Holbert † Department of Biostatistics, East Carolina UniversityJune 21, 2019
Abstract
Frequentist inference typically is described interms of hypothetical repeated sampling butthere are advantages to an interpretation thatuses a single random sample. Contemporaryexamples are given that indicate probabili-ties for random phenomena are interpreted asclassical probabilities, and this interpretationis applied to statistical inference using urnmodels. Both classical and limiting relativefrequency interpretations can be used to com-municate statistical inference, and the effec-tiveness of each is discussed. Recent descrip-tions of p -values, confidence intervals, andpower are viewed through the lens of classicalprobability based on a single random samplefrom the population. Keywords: classicalprobability, statistical ensemble, multiset, p -value, confidence interval. ∗ [email protected] † [email protected] Frequentist inference appears to require hy-pothetical repeated sampling. Cox (2006,page 8) describes frequentist inference as fol-lows Arguments involving probabilityonly via its (hypothetical) long-runfrequency interpretation are called frequentist . That is, we define pro-cedures for assessing evidence thatare calibrated by how they wouldperform were they used repeatedly.In that sense they do not differ fromother measuring instruments.The entry “Frequency Interpretation in Prob-ability and Statistical Inference” in the
En-cyclopedia of Statistical Sciences (ESS) alsorestricts the interpretation to repeated trials. . . . ordinary people ... [and] manyprofessional people, both statisti-cians and physicists, ... will confinethemselves to probabilities only in1onnection with hypothetically re-peated trials. (Sverdrup, 2006)Without proper context these quotes couldmisrepresent these authors as only concernedwith long-run behavior. Cox (2006) recog-nizes the importance of interpreting specificdata.We intend, of course, that this long-run behavior is some assurance thatwith our particular data currentlyunder analysis sound conclusionsare drawn. This raises important is-sues of ensuring, as far as is feasible,the relevance of the long run to thespecific instance.We contend that results from a particularstudy can be more effectively described byallowing for a more flexible probability inter-pretation, one allowing probability to be in-terpreted as a limiting relative frequency oras a simple proportion.Interpreting probabilities as proportions isthe classical interpretation but has been dis-missed because it is viewed as having lim-ited utility. The entry “Foundations of Prob-ability” in the
Encyclopedia of Biostatistics statesThough influential in the early de-velopment of the subject, and stillvaluable in calculations, the classi-cal view fails because it is seldomapplicable. (Lindley, 2005)In fact, for understanding p -values, in par-ticular, and statistical inference, in general, the classical view is often applicable. Prob-abilities viewed as proportions fit naturallyin the context of statistical inference. In-troductory texts use ’frequency’ and ’relativefrequency’ interchangeably with ’count’ and’proportion’, respectively. In a population,the proportion of individuals having a cer-tain characteristic provides the same numeri-cal value as the probability that a single ran-domly chosen individual will have that char-acteristic.Requiring that frequentist inference in-clude repeated trials is unnecessary in all,or nearly all, situations. Interpreting prob-abilities simply as proportions will allow fre-quentists to better communicate p -values andother inferential concepts. Furthermore, theclassical interpretation protects against theissue raised by Cox that long-run behaviormay not be relevant to a specific instance. To effectively communicate the p -value andstatistical inference in general we shouldknow how the term probability, when describ-ing a random phenomenon, is understood bythe general public. Examples from statisticalliterature that interpret probability can seemcontrived and do not represent what we ob-serve when considering real world examples. See, for example, Johnson (1996) pages 22 and23. .1 ESS
Example
The following example appears in the afore-mentioned
ESS entry . A convict with a death sentencehanging over his head may have achance of being pardoned. He is tomake a choice between white andblack and then draw a ball ran-domly from an urn containing 999white balls and 1 black ball. If thecolor agrees with his choice he willbe pardoned.Instead of using the proportion of white ballsin the urn to describe a single random se-lection, the convict considers an unspecifiednumber of hypothetical drawings.The convict replies that he willchoose white because . . . out ofmany hypothetical drawings he willin 99.9% of the trials be pardonedand in 0.1% of the trials be exe-cuted. . . . the convict . . . attaches99.9% probability to the single trialabout to be performed .The article says the convict can attach aprobability to a single trial because thatprobability is a very real thing tothe convict and it is reliably esti-mated from past experiences con-cerning urn drawings.It would seem we need to add the conditionthat the convict has sufficient experience withurn drawings. Even if that were true, we would expect hewould be open to the equally likely interpre-tation that clearly applies to a random drawfrom an urn. There is no need for a history of“past experiences concerning urn drawings”or a hypothetical future where convicts areexecuted repeatedly.
The broadcast of the 2018 Final Table in the49th No-limit Hold-em main event held in LasVegas (aired 13 July 2018 on ESPN’s WorldSeries of Poker) listed the player Cada as hav-ing a 14% chance of winning while his oppo-nent Miles had an 86% chance. These proba-bilities were based on two cards held by Cada,two held by Miles, and four cards on the ta-ble. These cards were dealt after the deck wasthoroughly shuffled so that each ordering ofthe 52 cards was equally likely, or, at leasttreated as such. There is one more card tobe dealt and the announcer says that Cadahas 6 outs – cards that would provide himwith a better hand than Miles. There are 44cards remaining so the chance that Cada winsis 6 /
44 =14%.North Carolina, like many states, has a lot-tery where numbers are selected by havingballs jumbled with shots of air in a confinedtransparent space. The Pick-3 game con-sists of three clear boxes each with 10 ballsthat are labeled with the numerals 0, 1, ...,9. These balls are jumbled for a few secondsand then one is allowed to come to the top.The jumbling is vigorous enough so that eachball is assumed to be equally likely to comeup. While there may have been some players3ho waited for there to be sufficient historyof Pick-3 drawings before placing a bet, weare confident there are many who did not re-quire such history and still understood theprobability of winning.
The examples above each had a known sam-ple space of equally likely outcomes and thisallowed for the calculation of the proportionthat provided, under suitable randomization,the interpretation for probability. For statis-tical inference, simple random sampling fromthe population provides equally likely out-comes so that these probabilities can also beinterpreted as proportions. However, unlikethe previous examples, not all population val-ues are known so that proportions cannotbe calculated without specifying a model forthese values.Consider a trial of 60 participants in which30 are assigned randomly to treatment A andthe remainder to treatment B . For simplicitywe take the response variable to be dichoto-mous with values favorable and unfavorable.The population is the 60 participants and thevalue for each participant is the ordered pairindicating the outcome, favorable or unfavor-able, under treatment A and under treatmentB. Only one value of each pair is observed.Suppose the number responding favorably to A is 25 and to B is 17.One way to compare the treatments is bytesting the hypothesis that the two treat-ments have the same effect on each partici-pant; that is, that the values are identical ineach of the 60 outcome pairs. Under this hy- pothesis there would be exactly 42 favorableresponses regardless of the treatment assign-ment. The population values consist of 42favorable and 18 unfavorable outcomes. Bychance 25 of the 42 favorable outcomes wereassigned to treatment A . Each possible as-signment of 30 outcomes to A can be enu-merated and the proportion where 25 or moreare favorable can be calculated. This propor-tion is 0.0235. Likewise, the proportion of25 or more favorable responses in group B isalso 0.0235. The interpretation is as follows:4.7% of all possible treatment assignmentshave a discrepancy between groups as greator greater than the observed discrepancy of25 versus 17. Because the actual assignmentwas done in a manner such that each possibleassignment was equally likely, this proportion is the probability of an observation as extremeor more extreme than 25 vs 17. That is, the p -value is 0.047 and its interpretation does notrequire that we consider hypothetical randomassignments of subjects to treatments. The limiting relative frequency interpretationand the classical interpretation each describethe same numerical probability. It is not aquestion of which is correct. Both are cor-rect and both are available for describing sta-tistical inferences. The pertinent question iswhich is more useful and the answer involvestwo factors. Before considering these factorswe make a distinction between the definition4f probability and an interpretation thereof.
There generally is wider agreement on howa p -value is calculated, its operational defini-tion, than its interpretation. There are manyincorrect descriptions of the p -value but thisdoes not mean there is only one correct wayto interpret its meaning. The definition of, and confusion surround-ing, frequentist inference involves interpreta-tions of probability, not its definition. Prob-ability is defined axiomatically as a set func-tion whose domain consists of subsets from aset S , the sample space. When the samplespace is finite the domain can be the powerset of S . When the sample space is infi-nite the power set is replaced with a sigmafield. The common interpretation for the in-finite case involves extending the finite sam-ple space interpretation using limits, in par-ticular limiting relative frequencies. Anotherapproach is to approximate an infinite modelwith one having a finite sample space therebyallowing probability to be interpreted as aproportion. We follow the latter approachhere.Epistemologically, what we call an inter-pretation could be considered a definition,but our concerns here are more practicalthan philosophical. The equally likely defi-nition/interpretation is not intended to coverevery situation where one might use the term See Greenland et al. (2016) for a useful account-ing of misinterpretations. probability, but it is useful for much of sta-tistical inference. Furthermore, the variety ofsettings for statistical inference means properinterpretation is more easily conveyed whenthe term probability is not restricted to onlyone interpretation. We categorize these set-tings using two dichotomous factors: scopeand focus.
The utility of each interpretation will dependon the intended audience. In the poker exam-ple, if the audience is Cada, the player hold-ing a specific hand, probability is more use-fully described as was done on the broadcast,as a proportion of equally likely cards. Moregenerally, for casino gambling, if the audienceis the house then probability is usefully de-scribed as a limiting relative frequency thatdescribes an unspecified, but very large, num-ber of hands.The Lottery example did not include aninterpretation of probability. However, if theaudience is a ticket holder, then clearly thereis interest in a specific drawing and the prob-ability is naturally described as a proportion.On the other hand, the Lottery Commissionis more concerned with on-going drawingsand so long-run frequencies are natural forthis audience.In the
ESS example, where the audience isthe convict, the proportion of white balls andthe notion of equally likely provide a simplerdescription than hypothetical repeated draw-ings that involve this or other convicts. Thecollection of future draws and consequent ex-ecutions would be relevant to the state.5or the investigators of the clinical trial oranyone interested in the particular outcomeof the study, the proportion of randomiza-tions resulting in a discrepancy as great as25 and 17 provides a simple interpretationfor the p -value. For statisticians interestedin calibrating how inference procedures suchas Fisher’s exact test “would perform werethey used repeatedly” then significance lev-els would be specified and probabilities wouldbe described in terms of limiting relative fre-quencies of hypothetical repeated randomiza-tions.The common factor in comparing the po-tential audience in each of these examples isthe scope, either specific or generic, to whichthe probability extends. For a specific out-come, be it a hand of cards that could deter-mine whether a player continues in the tour-nament, a lottery draw for a ticket holder, aconvict whose life depends on a single drawfrom an urn, or a physician wanting to assessthe evidence from a single study for the mer-its of a specific treatment, a proportion pro-vides the natural interpretation for the prob-ability related to a single randomization.The scope is generic when a specific out-come is viewed as part of a collection andprobability describes this collection. Forstatisticians who are concerned with howtheir methods perform in general, it is naturalfor the scope to be generic. However, resultsfrom a specific study will be communicatedmore effectively when statisticians recognizethat the scope is specific for their audience.Scope is related to Cox’s distinction be-tween “long-run behavior” and a “specific in-stance” but differs in that the collection of outcomes when the scope is generic need notbe constructed in the long-run. An inter-pretation for the confidence interval havinggeneric scope is given below that does not re-quire repeated sampling. Scope applies to the interpretation of randomphenomena whether or not these are used forinference. Focus is meaningful only in thecontext of statistical inference where we areconcerned with an unknown distribution ofnumerical values. We call this distribution,whether it be measurements on individualsin a population or values obtained from ran-dom phenomena, the population distribution,or simply the population when the contextmakes it clear that we are considering a dis-tribution of numerical values rather than acollection of individuals.Statistical inference proceeds by positingthat a known distribution, the model, is thesame as, or an approximation to, the un-known population distribution. While statis-tical inference is always concerned with thepopulation distribution, some inference pro-cedures address the population directly andothers indirectly using one or more modelsfor the population. That is, the focus of aninference procedure can be on the populationor a model.The probability calculated for the clini-cal trial is a p -value and the calculation ofany p -value requires the specification of amodel (determined by the null hypothesis6long with other assumptions). Unless thepopulation is the same as the model, it isdifficult to interpret the p -value as directlydescribing the population.On the other hand, probability used to de-scribe confidence intervals can have as its fo-cus either the population or a family of mod-els for the population. For the former, the in-terpretation of a 95% confidence interval forthe mean, say .03 to 41.83, is that this inter-val was the result of an interval generatingprocedure applied to the population that hasthe property that 95% of the intervals fromthis procedure contain the population mean.Since 95% describes the procedure and notthe specific interval, the scope of this inter-pretation is generic and the focus is the pop-ulation.Fisher (1949, pages 190-191) provides thefollowing interpretation.An alternative view of the matteris to consider that variation of theunknown parameter, µ , generatesa continuum of hypotheses each ofwhich might be regarded as a nullhypothesis, which the experimentis capable of testing. In this casethe data of the experiment, andthe test of significance based uponthem, have divided this continuuminto two portions. One, a region inwhich µ lies between the limits 0.03and 41.83, is accepted by the testof significance, in the sense that thevalues of µ within this region arenot contradicted by the data, at thelevel of significance chosen. The re- mainder of the continuum, includ-ing all values of µ outside these lim-its, is rejected by the test of signifi-cance.Here the focus is on a collection of models.The scope is specific because each model isassessed in terms of how extreme the specificdata would be for that model.Simply checking whether a parameter valueis in the interval shortchanges the inferentialvalue of the confidence interval. The end-points serve as guideposts indicating whichmodels are such that the data would be un-likely enough to elicit doubt regarding themodel. For models having mean slightly lessthan 0.03 the p- value is slightly less than 0.05and for models having mean slightly greaterthan 0.03 the p- value is slightly greater than0.05. Similar comments hold for models withmeans near 41.83. Urn models are a conceptual constructionthat provide a convenient tool for describinginferential results in terms of classical proba-bility. One should conceive of a bowl filledwith N balls that are indistinguishable inregard to their possible selection but com-pletely distinguishable in terms of at leastone feature. This distinguishable feature isneeded to count the balls. The urn modelis an example of a multiset which is likea set except multiplicities are allowed. Forsets, { , } ∪ { , } = { , , } while for urns, ⌊ , ⌋ ∪ ⌊ , ⌋ = ⌊ , , , ⌋ . Unions and other7asic set operations used below also hold formultisets. A population can be described using the con-ceptional construction of an urn model. Thismodel may be thought of as a bowl that con-tains one ball for each member in the pop-ulation. For a variable of interest X , thepopulation urn ⌊ X ⌋ pop is the bowl where thenumerical value for each member is writtenon the corresponding ball. In most cases thevalues on the balls and the number of balls N are unknown. From the population urnwe construct another urn ⌊ X ⌋ npop containing (cid:0) Nn (cid:1) balls. Each sample of n balls taken from ⌊ X ⌋ pop is represented by one ball in ⌊ X ⌋ npop ;this ball is labeled with an n -tuple of valuesobtained from the balls of the correspondingsample from ⌊ X ⌋ pop . The only restriction on n is that it is a positive integer not greaterthan N . Notationally, this conceptual con-struction is ⌊ X ⌋ pop C n −→ ⌊ X ⌋ npop where the arrow indicates an enumeration ofall possible samples of n balls so that the ob-served sample corresponds to a ball ( x ) obs in ⌊ X ⌋ npop . For inference regarding the population, amodel is posited for ⌊ X ⌋ pop and the urn for Sampling plans other than SRS would require adifferent enumeration. the model is written ⌊ X ⌋ θ because often therewill be a set of models indexed by a parameter θ ∈ Θ. To assess how well ⌊ X ⌋ θ approximates ⌊ X ⌋ pop , the observed sample ( x ) obs is com-pared to the possible samples in the model, ⌊ X ⌋ nθ , where ⌊ X ⌋ θ C n −→ ⌊ X ⌋ nθ . (1)Unlike ⌊ X ⌋ npop , the n -tuples on all balls in ⌊ X ⌋ nθ are known. The samples in ⌊ X ⌋ nθ are compared to theobserved sample using a test statistic T θ , areal valued function on R n . The value of theobserved test statistic is t obsθ = T θ ( x ) obs . Theplausibility of a specific model ⌊ X ⌋ θ o as anapproximation to ⌊ X ⌋ pop is assessed by com-paring ( x ) obs to the samples in ⌊ X ⌋ nθ . Specifi-cally, by finding the proportion of balls whosetest statistic value is greater than or equal to t obsθ o . This proportion is written asPr ⌊ T θ o ≥ t obsθ o ⌋ nθ o (2)wherePr ⌊ T ≥ t ⌋ nθ = | { b ∈ ⌊ X ⌋ nθ : T ( b ) ≥ t } ||⌊ X ⌋ nθ | . (3)No randomizations were used to construct themodel urn ⌊ X ⌋ nθ o . However, for the propor-tion in (2) to be meaningful as a probabil-ity, the observed sample must have been ob-tained using a simple random sample (SRS)from the population. Given this randomiza-tion, the proportion in (2) is the p -value fortesting H : θ = θ o using the test statistic T θ o . The number of balls in model urn ⌊ X ⌋ θ need notequal the number in the population urn. The relevantfeatures are proportions rather than counts. − α )100% confidence interval for θ obtained from ( x ) obs is found by allowing θ o in (2) to range over all possible values for θ , C α ( x ) obs = (cid:8) θ : Pr ⌊ T θ ≥ t obsθ ⌋ nθ ≥ α (cid:9) . (4)The interval in (4) represents all the models,indexed by θ , for which the observed datawould not be in the most extreme α T θ . Even though the confidenceinterval C α ( x ) obs involves many models there isstill only one randomization that is required– the randomization used to obtain the datafrom the population.The procedural interpretation of the confi-dence interval can be described using an urnof confidence intervals ⌊ X ⌋ npop ←→ ⌊ C α ⌋ npop (5)where the urn on the right is obtained by let-ting ( x ) obs in (4) range over all possible sam-ples in the population. The sampling urns for the population and formodels are constructed using enumeration.In contrast, the limiting relative frequencyinterpretation involves the conceptual con-struction of an infinite sequence where eachterm in the sequence is obtained by a hypo-thetical random sample. Notationally, ⌊ X ⌋ pop SRS n −→ ( x ) , ( x ) , . . . (6) This notation and interpretation allow generaliz-ing to a confidence region. where ( x ) i is the n -tuple obtained from the i th hypothetical sample. Because these arerandom samples, another sequence ⌊ X ⌋ pop SRS n −→ ( x ) ′ , ( x ) ′ , . . . (7)could be used. The sequences in (6) and (7)are different but have the same limiting rela-tive frequency.The structure in random sampling that al-lows the calculation of probabilities is rep-resented in the limit of an infinite sequencewhose order is immaterial to describing thisstructure. In contrast, the enumeration usedto construct ⌊ X ⌋ npop imposes no artificial or-dering and describes the structure without in-finite limits.For models, limiting relative frequencycould be described using a conceptual con-struction where ⌊ X ⌋ pop is replaced with ⌊ X ⌋ θ in (6). While actual random samples from amodel can be useful for calculations, hypo-thetical random samples are not required forinterpretation since all samples are known.Furthermore, when hypothetical randomiza-tions are used to interpret model probabili-ties, probabilities that are independent of thedata, these can be confused with hypotheti-cal randomizations from the population thatare intimately connected with the data. Random variables are used to model dataand, if X rv is a random variable , then theterminology suggests thinking of X rv as gen-erating a sequence of values through repeatedrandomization Section 6 provides an example. Common notation would be X but we are using X to represent a finite collection of values. rv SRS n −→ ( x ) , ( x ) , . . . . (8)We use the notation ⌊·⌋ to emphasize that themodel is an aggregate of values rather thana generator of infinite random sequences.When the aggregate is finite, the distributionof ⌊ X ⌋ θ is described by proportions having in-teger denominator. When the aggregate is in-finite, the distribution of ⌊ X rv ⌋ θ is describedby the proportion of areas under a curve. Neither the definition nor interpretation ofa probability model requires randomization.Both the definition and interpretation of fre-quentist inference require randomization butthis need not be imagined as belonging toa hypothetical repetition of randomizations.The randomization required is the one thatproduced the data that were obtained ⌊ X ⌋ pop SRS n −→ ( x ) obs . To recognize the importance of this random-ization from the population, models are de-scribed using (1) rather than (8).
The Fisher interpretation for the observed in-terval is naturally described without repeatedsampling using C α ( x ) obs . The interpretation ofa confidence interval as having been producedby a procedure is typically described usingrepeated sampling. Section 5.1 shows that,in fact, a single random sample can be used If X rv is continuous the curve is the probabilitydensity function. If X rv is discrete the proportion oflengths would described the distribution. for the procedural interpretation. Section 5.2compares the single random sample interpre-tations of C α ( x ) obs and ⌊ C α ⌋ npop . ⌊ C α ⌋ npop Greenland et al. (2016) provide the followinginterpretation for the 95% confidence inter-val, . . . the 95% refers only to howoften 95% confidence intervals com-puted from very many studies wouldcontain the true effect if all the as-sumptions used to compute the in-tervals were correct.It seems the word “only” is used to discourageother procedural interpretations since earlierin their paper the observed confidence inter-val is described in terms of testing which weunderstand to be Fisher’s interpretation.Even if the word “only” applies just to theprocedural interpretation, this statement istoo strong. As the urn models show, this in-terpretation need not be described in terms oflimiting relative frequency. When the familyof models contains the true model, ⌊ X ⌋ pop = ⌊ X ⌋ θ ∗ for some θ ∗ , then the urn ⌊ C . ⌋ npop de-fined by (5) has the property that 95% ofthese intervals contain the true effect, θ ∗ . Theproportion 0.95 is a probability when each in-terval in ⌊ C . ⌋ npop is given an equally likelychance of being selected; i.e., the observeddata were obtained by an SRS from the popu-lation. The procedural interpretation for theconfidence interval does not require the pro-cedure to be repeated many times, just as10nderstanding Cada’s probability of winningdid not require repeatedly shuffling the re-maining poker cards.This requirement of conceptualizing verymany studies leads to an unnecessary criti-cism of a common (mis)interpretation regard-ing an observed confidence interval:There is a 95% chance that thepopulation mean is between 0.03and 41.83.A standard response is “Either the mean isbetween these values or it is not. The values0.03, 41.83, and the population are not ran-dom so probability is not meaningful here.” This statement warrants caution ratherthan correction. To understand how this canbe a reasonable interpretation we consider aversion of the North Carolina Pick-3 Lotterywhere a Statistics professor buys 1,000 Pick-3tickets, one for each possible combination ofthree digits from 000 to 999. The tickets arepartitioned so that 20 tickets are placed intoeach of 50 envelopes that are labeled with thenames of the 50 students in her class. Thedrawing is on Wednesday and at Tuesday’slecture the professor asks Bob what is theprobability that his envelope has the winningticket. Bob responds 1 in 50. The professorwill distribute the envelopes at Thursday’slecture.Before distributing the envelopes on Thurs-day, Bob is asked the same question and againgives the probability of 1 in 50. Should theprofessor correct Bob and say that either he For a recent version of this response see Anderson(2019). has or has not won, and that probability nolonger applies? We think not. It is still mean-ingful to say the probability for each studentis 1 in 50.However, the situation on Tuesday is differ-ent from that on Thursday, and recognizingthis difference indicates the necessary cau-tion. On Thursday when the first envelope isopened the probability of the remaining en-velopes changes to 1 in 49 or to 0. If the en-velopes had been distributed before the draw-ing, any envelope could be opened and theprobability would remain 1 in 50.The interpretation “There is a 95% chancethat the population mean is between 0.03and 41.83” is incorrect when there is addi-tional information from the population (i.e.,opened envelopes). In particular, this inter-pretation cannot be used when there are twoobserved confidence intervals from the samepopulation – let alone, “very many studies”as the above repeated sampling interpreta-tion requires. However, without additionalinformation from the population this state-ment provides a reasonable description of theinformation in the data concerning the popu-lation mean. Cox and Hinkley (1979, pages227-228) also consider interpreting the ob-served interval in terms of probability rea-sonable given appropriate cautions.
In terms of scope and focus the interpreta-tions represented by C α ( x ) obs and ⌊ C α ⌋ npop arevery different. The interval C α ( x ) obs is specific11o the data that was observed and the focusis on a collection of models. The collection ofintervals ⌊ C α ⌋ npop is generic and the focus ison the population.The interpretations also differ in the as-sumptions that are required. The urn ⌊ C α ⌋ npop cannot be constructed directly sincethe population is unknown but relies on theassumption that there is a model with pa-rameter θ ∗ such that ⌊ X ⌋ θ ∗ is a close approx-imation to ⌊ X ⌋ pop . This assumption is notrequired for the interpretation represented by C α ( x ) obs .Coverage probability and expected lengthapply to ⌊ C α ⌋ npop but not to C α ( x ) obs . Whenintervals are defined with these two criteriain mind but without inverting a test, thereis great flexibility in how individual intervalsare chosen. As a result, observed intervalscan have poor properties when interpretedin terms of testing. To maintain fidelityto the Fisher interpretation, Vos and Hudson(2005) introduce the criteria p -confidence and p -bias that apply to C α ( x ) obs . P -values Confidence intervals allow for an interpreta-tion that is population focused. Interpret-ing p -values in terms of population focus canlead to problems. As an example we considerthe issue of potential comparisons raised byGelman (2016) who claims This issue arises when the sample space is dis-crete and the intervals are considered too conserva-tive in terms of coverage probability. See, for exam-ple, Vos and Hudson (2008). . . . to compute a valid p -value youneed to know what analyses wouldhave been done had the data beendifferent. Even if the researchersonly did a single analysis of the dataat hand, they well could’ve doneother analyses had the data beendifferent.We cannot be certain of Gelman’s inter-pretation for the p- value but the proportionin (2) is a valid p -value and requires onlya single random sample. Gelman considersrepeated sampling from the population butthe p -value is a probability that describes amodel, not the population. Comments byFisher (1959, page 44) apply hereIn general tests of significanceare based on hypothetical probabil-ities calculated from the null hy-potheses. They do not generallylead to any probability statementsabout the real world, but to a ra-tional and well-defined measure ofreluctance to the acceptance of thehypotheses they test.Certainly p -values can be misused but Gel-man’s statement is too strong because itmakes p -values invalid even when there hasbeen no actual misuse. A potential misuseof a p -value, or any inference procedure, doesnot invalidate a single instance of proper use.Consider the following example from TexasHold’em Poker. A gambler calculates theprobability of making a specific hand basedon the proportion of unseen cards. This cal-culation is done under the following condi-tions: he is well rested, sober, and knows the12ealer, and he has no reason to suspect cheat-ing. The result of this calculation is a validprobability. The gambler’s wife might saythat if he were to play too much poker, thenhe would become sleepy, drink too much, andgamble at shady establishments. Regardingthe long run outcome of his gambling, theseare legitimate concerns that bring the valid-ity (utility) of future probability calculationsinto questions. However, these potentialitiesdo not affect the gambler’s specific calcula-tion made under the actual conditions. Thescope for the gambler is specific while for hiswife it is generic.The reader might find differences betweenour example and the discussion of potentialcomparisons. Our hope is that we could agreethat hypothetical long run sampling is prob-lematic when used to address a specific in-stance, and our point is that repeated sam-pling is not required to interpret inference forthe data actually observed. We have seen that confidence intervals and p -values can be interpreted using a sin-gle random sample. Power calculations aredone before data have been collected anddo not require any randomization or hypo-thetical repetitions. This is in contrast tohow power is often discussed. For example,Greenland et al. (2016) describe power as aprobability “defined over repetitions of thesame study design and so is a frequency prob-ability.”Power calculations are done by comparing the model specified by a null hypothesis to acompeting model. The urn ⌊ X ⌋ no of the nullmodel is compared to the urn ⌊ X ⌋ n of thecompeting model in terms of a test statistic T o . Specifically, the significance level α de-fines a value t ∗ such thatPr ⌊ T o ≥ t ∗ ⌋ no = α and the power β is given byPr ⌊ T o ≥ t ∗ ⌋ n = β. Both α and β are proportions. The poweris the proportion of all samples of size n fromthe competing model (posited as an approxi-mation to the population) that are more ex-treme than t ∗ . Power calculations based onrandom variables are conducted in the sameway but now proportions with integer de-nominator are generalized to proportions ofarea or a more general measure. These pro-portions are meaningful as probabilities anduseful for inference regarding the populationwhen the observed data is obtained by an ac-tual randomization from the population. Hy-pothetical repetitions from the population orone of the models are not required. Describing the observed confidence intervalas having been obtained from a procedureis often the only interpretation that is con-sidered, but there are authors who recog-nize Fisher’s interpretation. Examples in-clude, Kempthorne and Folks (1971) who callFisher’s interpretation a consonance interval statistical ensem-ble.
According to the Wikipedia entry (2019)... an ensemble (also statisticalensemble ) is an idealization con-sisting of a large number of virtualcopies (sometimes infinitely many)of a system, considered all at once,each of which represents a possiblestate that the real system might bein.A single simple random sample of n individu-als from a population creates a statistical en-semble where the possible states consist ex-actly of the possible samples of size n fromthe population.The conceptualization of a statistical en-semble differs from repeated sampling in thata large number is considered all at once and this idea avoids several pitfalls associ-ated with repeated sampling. Repeated sam-pling and terms such as “long run” intro-duce the notion of time even though timeis not included in the definition of probabil-ity. Adding to the confusion is that when the scope is generic, such as a statistician defin-ing procedures in terms of “how they wouldperform were they used repeatedly”, time fitsnaturally in that particular interpretation.Furthermore, repetition generates a sequenceand the order of this sequence has nothing todo with the structure of the collection so theidea of independence is needed to appropri-ately describe a random sequence. By con-sidering the collection all at once, whether itis balls in an urn or states of an ensemble,these complications are avoided. A statisti-cal ensemble can be applied when the scopeis generic or specific but is especially usefulin the latter case.Recognizing that the focus can be eitherthe population or the model sheds light onthe role of randomization in statistical infer-ence. Randomly selecting data from a pop-ulation is fundamental for making inferencesabout the population, and models are usedto make inferences, but no randomizationsfrom the model are required. Hypotheticalrepeated randomizations may be introducedas a means to interpret the probability ob-tained from the model, but these hypotheti-cal randomizations, and the consequent con-fusion with the required randomization fromthe population, can be avoided by using urnmodels or statistical ensembles. References
Anderson, A. A. (2019). Assessing statis-tical results: Magnitude, precision, andmodel uncertainty.
The American Statis-tician 73 (sup1), 118–121.14ox, D. and D. Hinkley (1979).
TheoreticalStatistics . Chapman and Hall/CRC.Cox, P. D. R. (2006).
Principles of StatisticalInference . Cambridge University Press.Fisher, R. (1949).
The design of experiments.1949 (5th ed.). New York: Hafner Publish-ing Company Inc.Fisher, R. (1959).
Statistical methods andscientific inference (2nd ed.). HopetounStreet, University of Edinburgh: T and AConstable Ltd.Gelman, A. (2016). The problems with p-values are not just with p-values.
TheAmerican Statistician 70 (2), online.Greenland, S., S. J. Senn, K. J. Rothman,J. B. Carlin, C. Poole, S. N. Goodman,and D. G. Altman (2016). Statistical tests,p values, confidence intervals, and power:a guide to misinterpretations.
EuropeanJournal of Epidemiology 31 (4), 337–350.Hacking, I. (1976).
Logic of statistical infer-ence . Cambridge England, New York: Uni-versity Press.Johnson, R. (1996).
Statistics : principlesand methods . New York: Wiley.Kempthorne, O. and L. Folks (1971).
Prob-ability, Statistics, and data analysis . IowaState University Press.Lindley, D. (2005). Foundations of probabil-ity. In T. Armitage, Peter & Colton (Ed.),
Encyclopedia of biostatistics (2 ed.), Vol-ume 3, pp. 1993–2001. Hoboken, N.J: JohnWiley & Sons.Mayo, D. G. (2018).
Statistical Inferenceas Severe Testing: How to Get Beyondthe Statistics Wars . Cambridge UniversityPress.Sverdrup, E. (2006). Frequency interpreta-tion in probability and statitical inference.In S. Kotz (Ed.),
Encyclopedia of statisticalsciences (2 ed.), Volume 4, pp. 2530–2536.Hoboken, N.J: Wiley-Interscience.Vos, P. W. and S. Hudson (2005). Evalua-tion criteria for discrete confidence inter-vals: Beyond coverage and length.
TheAmerican Statistician 59 (2), 137–142.Vos, P. W. and S. Hudson (2008). Problemswith binomial two-sided tests and the as-sociated confidence intervals.