aa r X i v : . [ s t a t . O T ] J u l Frequentism-as-model

Christian Hennig, Dipartimento di Scienze Statistiche,Universita die Bologna,Via delle Belle Arti, 41, 40126 [email protected] 14, 2020

Abstract:

Most statisticians are aware that probability models interpreted in afrequentist manner are not really true in objective reality, but only idealisations.I argue that this is often ignored when actually applying frequentist methodsand interpreting the results, and that keeping up the awareness for the essentialdiﬀerence between reality and models can lead to a more appropriate use and in-terpretation of frequentist models and methods, called frequentism-as-model. Thisis elaborated showing connections to existing work, appreciating the special role ofi.i.d. models and subject matter knowledge, giving an account of how and underwhat conditions models that are not true can be useful, giving detailed interpreta-tions of tests and conﬁdence intervals, confronting their implicit compatibility logicwith the inverse probability logic of Bayesian inference, re-interpreting the role ofmodel assumptions, appreciating robustness, and the role of “interpretative equiv-alence” of models. Epistemic (often referred to as Bayesian) probability sharesthe issue that its models are only idealisations and not really true for modellingreasoning about uncertainty, meaning that it does not have an essential advantageover frequentism, as is often claimed. Bayesian statistics can be combined withfrequentism-as-model, leading to what Gelman and Hennig (2017) call “falsiﬁca-tionist Bayes”.

Key Words: foundations of statistics, Bayesian statistics, interpretational equiv-alence, compatibility logic, inverse probability logic, misspeciﬁcation testing, sta-bility, robustness

The frequentist interpretation of probability and frequentist inference such as hy-pothesis tests and conﬁdence intervals have been strongly criticised recently (e.g.,Hajek (2009); Diaconis and Skyrms (2018); Wasserstein et al. (2019)). In appliedstatistics they are still in very widespread use, and in theoretical statistics thenumber of still appearing new works based on frequentist principles is probablynot lower than that of any competing approach to statistics.1

INTRODUCTION

INTRODUCTION {N ( µ, σ ) : µ ∈ IR, σ ∈ IR + } , and a single distribution with ﬁxed values for µ and σ , will be called “model”.A key issue in Davies’s concept is whether the real data to be analysed can bedistinguished, in a sense to be deﬁned by the choice of an appropriate statistic,from typical data sets generated by the model. I will refer to this as “compatibilitylogic”. It does not require any postulation of “truth” for the model. AlthoughDavies stated that his approach is neither frequentist nor Bayesian, I interpret itas frequentist in the sense that models are taken as data generators that can inprinciple produce an arbitrarily large amount of data sets with resulting relativefrequencies that roughly correspond to the modelled probabilities.Whereas I mostly agree with Davies, my focus is more on interpretation andwhat we can learn from probabilistic modelling, and as opposed to Davies I cansee an interpretation of Bayesian inverse probability logic that is in line withfrequentism-as-model, see Section 5.2.In Section 2 I deﬁne what I mean by frequentism-as-model. In Section 3 con-cerns the connection between frequentism-as-model and reality, ﬁrst summarisingmy general view on mathematical models and reality, then explaining how sta- WHAT IS FREQUENTISM-AS-MODEL?

In Section 2.1, I will explain frequentism-as-model. In order to understand itsimplications, the role of i.i.d. models and subject matter information is discussedin Section 2.2.

Frequentism-as-model is an interpretation of probability, by which I mean a way toconnect the mathematical deﬁnition of probability to our concept and perceptionof reality.To make distinctions clearer, “traditional” frequentism is an interpretation ofprobability as well, and this does in my view not include what is often seen asmethods of frequentist inference such as tests and conﬁdence intervals. These arenormally motivated assuming a frequentist interpretation of probability, but hold-ing a frequentist interpretation of probability does not make using these methodsmandatory. An alternative to frequentism is epistemic probability, interpretedas expressing the level of uncertainty of either an individual or a group aboutevents; often logical/objective and subjective epistemic interpretations are dis-tinguished. Epistemic probability is an interpretation of probability, whereas Iunderstand Bayesian statistics as statistics involving Bayes’ theorem and reason-ing based prior and posterior probabilities. This is often, but not always, donebased on an epistemic interpretation of probability, and is therefore not in itselfan interpretation of probability. See Gillies (2000); Galavotti (2005) for overviewsof interpretations of probability.I do not think that the adoption of one interpretation of probability in one sit-uation should preclude a statistician from using another interpretation in anothersituation. A statistician can be a frequentist when analysing a randomised trial tocompare a new drug with a placebo, and adopt a logical epistemic approach when

WHAT IS FREQUENTISM-AS-MODEL? A refers to the limiting relative frequency for observing A under ide-alised inﬁnite repetition of a random experiment with possible outcome A . Thereare substantial problems with frequentism. It is hard to deﬁne what a “random ex-periment” is in reality. In the following I will generally use the term “experiment”for a situation that is modelled as potentially having several diﬀerent outcomesto which probabilities are assigned. One of the central proponents of frequentism,von Mises (1939), tried to ensure the random character of sequences by enforcingrelative frequency limits to be constant under certain place selection rules for ex-tracting a subsequence from the inﬁnite sequence of experiments. His and otherattempts have been criticised by the lack of connection of both the hypothesisedlimit and the deﬁnition of admissible sequences to observed relative frequencies inﬁnite sequences, which are the only ones we can observe (see, e.g., Hajek (2009);Diaconis and Skyrms (2018)).Von Mises and most other frequentists claim that their conception is applicableto experiments allowing for lots of (idealised inﬁnitely many, idealised identical)repetitions, but in fact in standard frequentist probability modelling such repeti-tions are modelled as identically and independently distributed (i.i.d.), meaningthat a whole sequence of repetitions is modelled by a single probability model,modelling probabilities not only for the outcomes of a single replicate, but forcombinations of outcomes of some or all replicates. Central results of probabilitytheory such as the laws of large numbers and the central limit theorem apply tosuch models. Applying the traditional frequentist interpretation of probability tothem would require whole independent and potentially inﬁnite sequences to berepeated, which of course is impossible.Frequentism-as-model deals with these issues in two ways. Firstly, it empha-sises that probability is a model and as such fundamentally diﬀerent from observedreality. Adopting frequentism-as-model means to think about an experiment or asituation as if it were generated by a frequentist process, temporarily, without im-plying that this really is the case. This means that it is not required to make a casethat the experiment is really inﬁnitely or even very often repeatable and will reallylead to the implied relative frequencies. Particularly this means that all availableobservations can be modelled as i.i.d. without having to postulate that the wholesequence of observations can be inﬁnitely repeated. Obviously, insisting on thefundamental diﬀerence between model and reality in this way raises the questionhow we can use such a model to learn something useful about reality, which I willdiscuss in Section 3.2. Von Mises and other frequentists already acknowledgedthat frequentist probabilities are idealisations. To me it seems, however, thatin traditional frequentism this is only a response to critics, and that otherwiseit does not have consequences regarding data analyses and their interpretations.Frequentism-as-model is diﬀerent, as elaborated below. WHAT IS FREQUENTISM-AS-MODEL?

WHAT IS FREQUENTISM-AS-MODEL? as if gen-erated by a certain model, not ruling out the possibility that it could be diﬀerent(for which we may consider diﬀerent models), and even taking this possibilityinto account when making further data analytic decisions, frequentism-as-modeltakes the fundamental diﬀerence between model and reality much more seriouslyinto account than a standard frequentist analysis, in which a model is used andnormally no longer questioned after some potential initial checks. This leads tointerpretations of the ﬁnal results that often seem to naively imply that the modelis true, even if the researcher using such an approach may well admit, when askedexplicitly, that reality actually diﬀers from the model.

Thinking about reality as if it were generated by a certain probability model hasimplications. A critical discussion of these implications is a way to decide aboutwhether or not to adopt a certain model, temporarily, and on top of this it bearsthe potential to learn about the real situation. If a sequence is thought of as gen-erated by an i.i.d. process, it means that the individual experiments are treatedas identical and independent. De Finetti made the point that whatever can bedistinguished is not identical. Treating experiments as identical in frequentism-as-model does not mean that the researcher believes that they are really identical,and neither do they have to be really identical in order to license the applicationof such a model. Rather it implies that the diﬀerences between the individualexperiments are temporarily ignored or, equivalently, treated as irrelevant; analo-gously, using an independence model does not mean that the researcher believesthat experiments are really independent in reality, but rather that she assessespotential sources of dependence as irrelevant to what she wants to do and how shewants to interpret the result. Of course, depending on what kind and strength ofdependence can be found in reality, this may be inappropriate. On such grounds,the model can be criticised based on subject-matter knowledge, and may be re-vised. This allows for a discussion about whether the model is appropriate, on topof possible checks against the data, which may have a low power detecting certaindeviations from an i.i.d. model.

WHAT IS FREQUENTISM-AS-MODEL?

WHAT IS FREQUENTISM-AS-MODEL? n = 8, and n = 2 for classical homeopathy only, whichthe authors may use to justify their decision to not make a modelling diﬀerencebetween diﬀerent modes of applying homeopathy, selected by Shang et al. (2005)by a study quality criterion. The power of the resulting test is therefore very low,and consequently the conﬁdence interval that the authors give for the odds ratio isvery wide. This argument applies to meta-analyses with a random eﬀect generally;in the literature one can sometimes ﬁnd the informal remark that such analysesrequire a large number of studies, see Kulinskaya et al. (2008). On top of thatit could be discussed whether there may be reasons for assuming systematic biasover all studies, which in the model would amount to a nonzero expectation of therandom eﬀect.Obviously n = 8 is not suﬃcient to check the i.i.d.-assumption for the randomstudy eﬀect at any reasonable power. Rather than having any connection to anobservable “truth”, the random eﬀect is a convenient modelling device where in-dividual study eﬀects cannot be ignored, but the implications above are usuallyignored.Given this, it looks very dubious that the Lancet (2005)’s issue editorial con-cluded: “Surely the time has passed for selective analyses, biased reports, or furtherinvestment in research to perpetuate the homeopathy versus allopathy debate.”Regarding the purely statistical evidence, the study itself could as well be used toargue that there are not yet enough studies.In principle such discussions can also take place regarding the traditional fre-quentist use of models. but frequentism-as-model re-frames these discussions in ahelpful way. I makes us aware of the implicit meaning of the model assumptions,which can be sued to discuss them. And it emphasises that what has to be decidedis not the truth of the model, but rather a balance between the capacity of themodel to enable learning from the existing observations about future observationsor underlying mechanisms on one hand, and on the other hand taking into accountall kinds of peculiarities that may be present in the real situation of interest, butthat may be hard to incorporate in such a model - even though they may have FREQUENTISM-AS-MODEL AND REALITY think of thesituation as hypothetically repeatable, and of certain outcomes having a certaintendency, or, in standard philosophical terms, propensity to happen, which wouldmaterialise in case of repetition, but not in reality.Therefore frequentism-as-model does not bar single-case probabilities as vonMises’s ﬂavour of frequentism does. But it is more diﬃcult for the person who putsup such a model to convincingly justify it compared to a situation in which thereare what is interpreted as “replicates”. Convincing justiﬁcation is central due tothe central role of communication and agreement for science in the philosophyoutlined above. Whatever the model is used for, it must be taken into accountthat hypothesised values of the single-case probability cannot be checked againstdata. This changes if the experiment is embedded in a set of experiments thatare not directly seen as repetitions of each other, but for which an assumptionis made that their results are put together from systematic components and anerror term, and the latter is interpreted as i.i.d. repetition. This is actually done instandard frequentist regression analyses, in which the x is often assumed ﬁxed. Fora given set of explanatory variables x a distribution for the response y is implied,despite the fact that the random experiment can in reality not be repeated for anygiven x in situations in which the researcher cannot control the x . Insisting onrepeatability for the existence of traditional frequentist probabilities would implythat nothing real corresponds to the distribution of y for ﬁxed x , and there is noway to check any probability model. The unobservable i.i.d. error term constructsrepeatability artiﬁcially, but I have never seen this mentioned anywhere. From afrequentism-as-model perspective this needs to be acknowledged, and dependenton the situation it may be seen as appropriate or inappropriate, once more usinginformation from the data and about the subject matter background, but thereis nothing essentially wrong or particularly suspicious about it, apart from thefact that it turns out that our possibilities to test model assumptions are quitegenerally more limited than many might think, see Section 4.3. Given the separation between reality and model in frequentism-as-model, usingsuch methods to address real problems requires justiﬁcation. Before this is ad-dressed in Section 3.2, I will summarise my general ideas about how mathematicalmodels related to reality in Section 3.1.

FREQUENTISM-AS-MODEL AND REALITY I wrote Hennig (2010) mainly out of the conviction that the debate about the foun-dations of statistics often suﬀers from the lack of a general account of mathematicalmodels and their relation to reality.On the frequentist side often a rather naive connection between the modelsand reality is postulated, which is then often criticised by advocates of epistemicprobability, implying that because this does not work, probability should betterbe about human uncertainty rather than directly about the reality of interest. Butmodelling human uncertainty is still modelling, and similarly naive ideas of howsuch models relate to “real” human uncertainty are problematic in similar ways.More about this in Section 5.1.In Hennig (2010) I have argued that mathematical models are thought con-structs and means of communication about reality. I have distinguished “observer-independent (or objective) reality”, “personal reality” (the view of reality con-structed by an individual observer and basis for their actions), and “social reality”(the view of reality constructed by communication between observers). I haveargued from a constructivist perspective that implies that observer-independentreality is not accessible directly for human observers, and that, if we want to talkabout reality in a meaningful way, it makes sense to refer to personal and socialreality, which are accessible by the individual, a social group, respectively, ratherthan only to observer-independent reality. Observer-independent reality manifestsitself primarily in the generally shared experience that neither an individual ob-server nor a social system can construct their reality however they want; we allface resistance from reality. This resonates with Chang (2012)’s “Active ScientiﬁcRealism”: “I take reality as whatever is not subject to one’s will, and knowledge asan ability to act without being frustrated by resistance from reality. This perspec-tive allows an optimistic rendition of the pessimistic induction, which celebratesthe fact that we can be successful in science without even knowing the truth. Thestandard realist argument from success to truth is shown to be ill-deﬁned andﬂawed.”In Hennig (2010) I interpret mathematical models as a particular form of socialreality, based on the idea that mathematics is a particular form of communicationthat aims at being free of ambiguity, and that enables absolute agreement by meansof proofs. Science is seen as a social endeavour aiming at general agreement aboutaspects of the world, for which unambiguous mathematical communication shouldbe obviously useful. This comes to the price that mathematics needs to unifypotentially diverging personal and social observations and constructions. Thereis no guarantee that by doing so mathematical models come closer to observer-independent reality; in fact, this view sees the domain of mathematics as distinctfrom the domain of observer-independent reality. The connection is that the con-structs of personal and social reality are aﬀected by “perturbation” (or, as Changputs it, “resistance”) from observer-independent reality, and it can be hoped thatsuch perturbation, if observed in suﬃciently related ways by diﬀerent individuals

FREQUENTISM-AS-MODEL AND REALITY

From Section 3.1 it follows that the separation between reality and model is a gen-eral feature of mathematical models, not an exclusive one of frequentism-as-model,although frequentism-as-model acknowledges it explicitly. The question “how isfrequentism-as-model useful, and how is it connected to reality?” is therefore con-nected with the general question how mathematical models can be useful, evenif “wrong” in Box’s terms. The way to use mathematical models is obviously totake the mathematical objects that are involved in the model as representationsof either perceived or theoretically implied aspects of reality; and then to use thisto interpret the results of using the models.In order for the model to be useful, does not reality have to be “somehow”like the model? According to the constructivist view in Section 3.1, reality isaccessible only through personal perception and communication, so the best wecan hope for is that the model can correspond to how we perceive, communicate,and think about reality. Regarding frequentism-as-model, this concerns in theﬁrst place the concept of i.i.d. repetition, see Section 2.2. As follows from there,i.i.d. repetition is a thought construct and not an objective reality; however, evenapart from probability modelling it is easy to see how much of our world-view evenbefore any science relies on the idea of repetition, be it the cycles of day and nightand the seasons, or be it the expectation, out of experience, of roughly the samebehaviour or even roughly the same distribution of behaviours when observingany kind of recurring process as diverse as what a baby needs to do to attract theattention of the parents, look, size, and edibility of plants of the same species, orany speciﬁc industrial production process. Objections against the truth of “i.i.d”can be easily constructed for observations in all of these examples, yet analysingthem in terms of i.i.d. models often gives us insight, new ideas, and practical

FREQUENTISM-AS-MODEL AND REALITY (1) Is the model compatible with the data?

I will discuss in more detail inSection 4. (2) How does the model relate to subject matter knowledge?

The modelshould be informed by perceptions and ideas (subject matter knowledge,which is not necessarily “objective”) regarding the real process to be mod-elled, such as speciﬁc reasons for dependence, non-identity, or also for oragainst particular distributional shapes such as symmetry. Just to give anexample, dependence by having the same teacher and communication be-tween students makes the use of an i.i.d. model for analysing student resultsfor students from the same class or even school suspicious, and it can be con-troversially discussed to what extent introducing dependence only in form ofan additive random eﬀect addresses this appropriately. (3) What is the aim of modelling?

Probability models are used in diﬀerent

FREQUENTISM-AS-MODEL AND REALITY

FREQUENTISM-AS-MODEL AND REALITY k -means clustering, which directly measures the quality ofthe approximation of any data point by the closest cluster centroid (moregenerally, often loss functions can be constructed that formalise the aim ofanalysis). If achieving a good performance in this respect is the aim of anal-ysis (for example if clustering is used for data set size reduction before someother analysis, and all clustered objects are replaced by the cluster centroidin a ﬁnal analysis), k -means clustering is a suitable method even in situationsthat are very diﬀerent from the model for which k -means is maximum like-lihood, namely normal distributions with same spherical covariance matrixin all clusters but diﬀerent cluster means. Statisticians have argued that k - FREQUENTISM-AS-MODEL AND REALITY (4) How stable are conclusions against modelling diﬀerently?

The ultimateanswer to the question how we can be sure that a model is appropriate forthe situation we want to analyse is that we cannot be. Breiman (2001) men-tioned the “multiplicity of good models”in line with Tukey (1997); Davies(2014), and diﬀerent “good” models may lead to contradicting conclusions.On top of that, research hypotheses may be operationalised in diﬀerent ways,using diﬀerent measurements, diﬀerent data collection and so on. There isno way to be sure of a scientiﬁc claim backed up by a statistical result basedon the naive idea that the model is true and the method is optimal. In orderto arrive at reliable conclusions, science will need to establish the stability ofthe conclusions against diﬀerent ways of operationalising the problem. Thisis basically the same way in which we as individual human beings arrive at astable concept of the world, as far as it concerns us; after lots of experiences,looking at something from diﬀerent angles in diﬀerent situations, using dif-ferent senses, communicating with others about it etc., we start to rely onour “understanding” of something. In the same way, every single analysisonly gives us a very restricted view of a problem, and does not suﬃce tosecure stability, see the discussion of “interpretative hypotheses” in Section4.1. Currently there is talk of a “reproducibility crisis” in a number of dis-ciplines (Fidler and Wilcox (2018)). I think that a major reason for this isthat no single analysis can establish stability, which is all too convenientlyignored, given that researchers and their funders like to tout big meaningfulresults with limited eﬀort. Even if researchers who try to reproduce otherresearchers’ work have the intention to take the very same steps of anal-ysis described in the original work that they try to reproduce, more oftenthan not there are subtle diﬀerences that were not reported, like for exampledata dependent selection of a methodology that the reproducer then usesunconditionally. Any single analysis will depend on researchers’ decisionsthat are rarely fully documented, could often be replaced by other decisionsthat are not in any way obviously worse, and may have a strong impact onthe result. Having something conﬁrmed by an as large as possible numberof analyses investigating the same research hypothesis of interest in diﬀerentways is the royal road to increase the reliability of scientiﬁc results. In this Iagree with the spirit of Mayo (2018)’s “severity”, and it also resonates withher “piecemeal testing”; a scientiﬁc claim is the more reliable, the harderresearchers have tried to falsify it and failed. Frequentism-as-model surelydoes not license claims that anything substantial about the world can beenproved by rejecting a single straw man null hypothesis.

FREQUENTISM-AS-MODEL FOR STATISTICAL INFERENCE Adopting frequentism-as-model as interpretation of probability does not imply thatclassical methods of frequentist inference, tests, conﬁdence intervals, or estimatorshave to be used. It may also be combined with Bayesian inference, see Section5.2. However, the classical methods of statistical inference have valid frequentism-as-model interpretations, and I believe that understanding these interpretationsproperly will help to apply the methods in a meaningful and useful way. ThereforeI disagree with recent calls to abandon signiﬁcance testing or even frequentistinference as a whole (Wasserstein et al. (2019)). in Section 4.1 I introduce the keyconcept of “interpretational equivalence”. Section 4.2 is about the frequentism-as-model interpretation of these methods. Section 4.3 is devoted to the issue ofmodel assumptions for frequentist inference, and how these are important giventhat they cannot literally be fulﬁlled.

The question how stable our conclusions are as raised in Section 3.2 does notonly depend on models, methods, and the data, but also on the conclusions thatare drawn from them, i.e., the interpretation of the results by the researcher. Ahelpful concept to clarify things is “interpretational equivalence”. I call two models“interpretationally equivalent” with respect to a researcher’s aim if her subject-matter interpretation of what she is interested in is the same regardless of whichof the two models is true. For example, if a researcher is interested in testingwhether a certain treatment improves blood pressure based on paired data beforeand after treatment, she will in all likelihood draw the same conclusion, namelythat the treatment overall does not change the blood pressure, if the distributionof diﬀerences between after and before treatment is symmetric about zero, be ita Gaussian or a t -distribution, for example. For her research aim, the preciseshape of the distribution is irrelevant. Comparing two models with expectationzero, one of which is symmetric whereas the other one is not, this is not so clear;the treatment changes the shape of the distribution, and whether this can be seenas an improvement may depend on the precise nature of the change. For example,consider the model 0 . N ( µ ∗ , σ ) + 0 . ∗ δ , (1) δ x being the one-point distribution in x . The researcher may think of δ -distribution as modelling erroneous observations, in which case µ ∗ = 0 makesthe model interpretationally equivalent to N (0 , σ ), whereas for µ ∗ = − . , de-spite expectation zero, the researcher will interpret an average eﬀect of loweringthe blood pressure, which the model implies for a subpopulation of 99% of people.The importance of interpretational equivalence is that it allows to discuss whatdeviations of a temporarily assumed model should be handled in which way. Ifconclusions are drawn from a method involving certain model assumptions, itwould be desirable, and conclusions can be seen as stable, if the probability would FREQUENTISM-AS-MODEL FOR STATISTICAL INFERENCE µ = 0 null hypothesis, a model with µ = 0 but avery small absolute value will lead to a rejection of µ = 0 with a large probabilityfor large n . This is a problem if and only if models with, say, | µ | < ǫ, ǫ > µ = 0, in other words,if a diﬀerence as small as ǫ is considered substantially irrelevant. If this can bespeciﬁed, the question whether values of µ with | µ | < ǫ are in the correspondingconﬁdence interval, or what severity the test achieves ruling out | µ | < ǫ , see Mayo(2018), is more relevant than whether a test of µ = 0 is signiﬁcant.Being concerned about interpretational equivalence of models is an entry pointfor robust methods, which are designed for dealing with the possibility that anominal model is violated in ways that are hard to diagnose but may make a dif-ference regarding analysis. Robustness theory is about limiting changes in resultsunder small changes of the modelled processes. An issue with standard robustnesstheory is that it is often implied that “contamination” of a distribution shouldnot have an inﬂuence of the results, whereas in practice the contamination mayactually be meaningful. To decide this is a matter of assessing interpretationalequivalence. Frequentism-as-model acknowledges that desirable behaviour of astatistical method is not just something “objective” but depends on how we in-terpret and assign meaning to diﬀerences between distributions. Robustness ishelpful where the reduced sensitivity of robust methods to certain changes in thedata or model takes eﬀect where the changes are indeed interpreted as meaninglessrelative to the aim of analysis; for example there needs to be a decision whetherin model (1) what correspond to the quantity of interest is rather the mean of thedistribution, or rather the µ ∗ . In some applications it is inappropriate to use amethod of which the results may not change much when replacing up to half ofthe data, namely where the resulting processes are considered as interpretationallyvery diﬀerent, but in any case comparing the results of such a method with a more“sensitive” non-robust one may allow for a more diﬀerentiated perception of whatis going on than does every single method on its own. Statistical hypothesis tests have become very controversial (e.g., Wasserstein et al.(2019)), but the interpretation of tests from the position of frequentism-as-modelis rather straightforward. They address to what extent models are compatible withthe data in the sense formalised by the test statistic. For simplicity, I mostly call

FREQUENTISM-AS-MODEL FOR STATISTICAL INFERENCE p -values. Obviously a non-rejection cannot be anindication that the model is in fact true, but it means the absence of evidence inthe data that reality is diﬀerent in the way implied by the test statistic, making itimpossible to claim such evidence. Sometimes tests, particularly two-sided tests,are criticised because “point null hypotheses are usually scientiﬁcally implausibleand hence only a straw man” (e.g., Rice et al. (2020)), but then a parametricmodel allowing for any parameter value cannot be “really true” anyway, so thatfor the same reason for which the point null hypothesis ( H ) is “implausible”, thewhole parametric family is implausible as well, but that does not stop inferenceabout it from being informative and useful, see Section 3.2. What a test can dois neither to conﬁrm H as true, nor to allow to infer any speciﬁc alternative incase of rejection. A two-sided test should be run if compatibility of the data with H is a possibility of interest; even then it may not be the method of choice if thesample size is too large, see Section 4.1.Often signiﬁcance tests are presented implying that a rejection of the H ismeaningful whereas a non-rejection is not. Above I have explained the meaningof a non-rejection in frequentism-as-model; the meaning of rejection is a morecomplex issue. Obviously, rejection of the H does not imply that there is evidencein favour of any particular model that is not part of the H . On the positive side,a rejection of the H gives information about the “direction” of deviation fromthe H , which can and mostly should be substantiated using conﬁdence intervalsas sets of compatible parameter values, and particularly data visualisation forexploring how this plays out without relying on the speciﬁed model.Consider a one-sample t-test of the H : µ = 0 regarding N ( µ, σ ) against µ >

0. The test statistic T is the diﬀerence between the sample mean and µ = 0standardised by the sample standard deviation. Sticking to model-based thinkingwhile not implying that N ( µ, σ ) is true, rejection of the H can be interpretedas providing evidence in favour of any distribution of the two samples for whicha larger value of T is observed with larger probability than under H , comparedwith the H . We do not only have evidence against H ; we also learn that theproblem with H is that µ = 0 is most likely too low, which is informative.In many situations in which the one-sample t-test is used, the user is interestedin testing µ = 0 or rather its interpretational meaning, but not in the distributionalshape N ( µ, σ ), which just enables her to run a test of µ = 0. If in fact there werea true distribution, which of course we can consider, in the “as if”-model world,which isn’t N ( µ, σ ) with any speciﬁc µ or σ , it is not clear in a straightforwardmanner what would correspond to µ . It may be the expected value, but in a modelsuch as (1) it may well be seen as the µ ∗ that governs 99% of the observations.In any case, there are distributions that could be considered as interpretationallyequivalent to the actual H . Any distribution symmetric about zero is in mostapplications of t-tests interpretationally equivalent to the H , meaning that ifindeed a distribution diﬀerent from N ( µ, σ ) but still symmetric about µ = 0 weretrue, it would be desirable that the probability to reject H would be as low as if FREQUENTISM-AS-MODEL FOR STATISTICAL INFERENCE H were true.One can then ask how likely it is to reject the H under a model that is notformally part of the H , but interpretationally equivalent. For the one-sample t-test, assuming existing variances, many distributions interpretationally equivalentto N ( µ, σ ) are asymptotically equivalent as well, and results in Cressie (1980)and some own simulations indicate that it is very hard if not impossible evenin ﬁnite samples to generate an error probability of more than 7% for rejectingan interpretationally true H with a nominal level of 5%, even if the underlyingdistribution is skew and µ is taken to be the expected value, meaning that testingat a slightly smaller nominal level than the maximum error probability that wewant to achieve will still allow for the intended interpretation.But the situation becomes much worse under some other deviations from themodel assumptions, particularly positive dependence between observations, whichincreases the variation of T and may lead to signiﬁcant T and increased type I errorprobability very easily, also in situations that are interpretationally equivalentto the H . Consider an example in which the expected change of the turnovergenerated by a salesperson after attending a sales seminar is of interest in order toevaluate the quality of the instructor. If all salespersons in the sample attend thesales seminar together, they may learn something useful from talking with eachother about sales strategies, even if what the instructor does itself is useless.What is required in order to achieve a reliable conclusion is just what wasdiscussed in Section 3.2. The general question to address is whether a rejection of H could have been caused by something interpretationally equivalent to the H ,in which case the conclusion could not be relied upon. The questions to ask are:(1) is there any evidence in the data that such a thing may have happened, (2) isthere any subject matter knowledge that suggests this (e.g., salespersons may havelearnt from each other rather than from the seminar), (3) does the meaning of thetest statistic correspond to what is of interest, and (4) is it feasible to run furtheranalyses, with the same or other data, that can conﬁrm the conclusion? (4) is im-portant because even the best eﬀorts to use (1) and (2) will not be able to removeall doubt. Models are always conceivable that can cause trouble but cannot bedistinguished from the nominal model based on the data, and thinking about thesubject matter may miss something important. (3) is in my view a very importantissue that is not usually appreciated. Tests are usually derived using optimalityconsiderations under the nominal model, but this does not imply that they areoptimal for distinguishing the “interpretational H ” from the “interpretational al-ternative”. In many cases at least they make some good sense, but a model like(1) may lead them astray. The full interpretational H and the interpretationalalternative will normally be too complex to derive any optimal test from them,but the form of the test statistic itself suggests what kinds of distributions thetest actually distinguishes. This should be in line with what is interpreted as thediﬀerence of interest between H and the alternative. At least under a point H ,the distribution of any test statistic can be simulated (if the H is a set of distri-butions, parametric bootstrap can be used, if potentially involving a certain bias). FREQUENTISM-AS-MODEL FOR STATISTICAL INFERENCE H and the interpretational alternative distinguishedby the test as the set of distributions for which the test statistic is expected low,and the set of distributions for which it is expected larger, respectively. In manysituations a nonstandard test statistic may be better suitable than what is optimalunder a simple nominal model (see Hennig and Lin (2015) for an example).Bayesians often argue against frequentist inference by stating that the proba-bilities that characterise the performance of frequentist inference methods such astest error and conﬁdence interval coverage probabilities are pre-data probabilities,and that they do not tell the researcher about the probability for their inferencesto be true after having seen the data. One example is that a 95%-conﬁdence in-terval for the mean of a Gaussian distribution in a situation in which the mean isconstrained to be larger than zero may consist of negative values only, meaningthat after having seen the data the researcher can be sure that the true value isnot in the conﬁdence interval. This is a problem if the coverage probability of theconﬁdence interval is indeed interpreted as a probability for the true parametervalue to be in the conﬁdence interval, which is a misinterpretation; the coverageprobability is a performance characteristic of the conﬁdence interval pre-data, as-suming the model. From the point of view of frequentism-as-model, this is not abig problem, because there is no such thing as a true model or a true parametervalue, and therefore a probability for any model to be true is misleading anyway,although it can be given a meaning within a model in which the prior distributionis also interpreted in a frequentism-as-model sense, see Section 5.2. In practice, ifin fact a conﬁdence interval is found that does not contain any admissible value,this means that either the restriction of the parameter space that makes all thevalues in the conﬁdence interval impossible can be questioned, or that a trulyatypical sample was observed. Davies (1995) deﬁned “adequacy regions” that arebasically conﬁdence sets deﬁned based on several statistics together (such as themean or median, an extreme value index, and a discrepancy between distributionalshapes; appropriately adjusting conﬁdence levels), that can by deﬁnition in princi-ple rule out all distributions of an “assumed” parametric family, meaning that nomember of the family is compatible with the data given the combination of chosenstatistics. Normally conﬁdence intervals are interpreted as giving a set of truthcandidate values assuming that the parametric model is true; but if that assump-tion is dropped from the interpretation and it is just a set of models compatiblewith the data deﬁned by parameter values within a certain parametric family, itis possible that the whole family is not compatible with the data.Overall I suspect that the biggest problem with interpreting standard frequen-tist inference is that many of the people who use it want to make stronger claims,and want to have bigger certainty, than the probability setup that they use allows.An example for this is the ubiquity of implicit or explicit claims that the null hy-pothesis is true in case that it was not rejected by a test. Interpreting the results ofclassical frequentist inference assuming the truth of the model will generally leadto overinterpretation. According to frequentism-as-model, classical frequentist in- FREQUENTISM-AS-MODEL FOR STATISTICAL INFERENCE

A standard statement regarding statistical methods is that these are based onmodel assumptions, and that the model assumptions have to be fulﬁlled for amethod to be applied. This is misleading. Model assumptions cannot be fulﬁlledin reality, because models are thought constructs and operate on a domain diﬀerentfrom observer-independent reality. For this reason they cannot even be approxi-mately true, in a well deﬁned sense, because no distance between the unformalised“underlying real truth” and the model can be deﬁned, although, following Davies,the approximation notion can be well deﬁned comparing observed data to a model.What is the role of model assumptions then? There are theorems that granta certain performance of the method, sometimes optimal, sometimes just good ina certain sense, under the model assumptions. The model assumptions are notrequired for applying the method, but for securing the performance achieved intheory. The theory can be helpful to choose a method, and some theory leads to thedevelopment of methods, but the theoretical performance can never be granted inreality. This does not mean that the performance will be bad whenever the modelassumptions are not fulﬁlled. In fact, some aspects of the model assumptionsare usually almost totally irrelevant for the performance, such as the applicationof methods derived from models for continuous data to data that is rounded to,say, two decimal paces. Interpretational equivalence is an important concept alsoin this respect, because optimally methods would not only distinguish what isdeemed relevant to distinguish by the researcher under the model assumptions,e.g., a certain H and its nominal alternative, but they would lead to largely thesame results for interpretationally equivalent distributions that do not fulﬁll themodel assumptions. As far as this is the case, the corresponding model assumptionis irrelevant.It makes sense, at ﬁrst sight, to think that the method will be appropriate ifthe assumed model is a good model, i.e., if reality looks very much like it. Thisitself can be modelled, meaning that one can look at the performance of a methodin a situation in which the assumed method does not hold, but a somewhat similarmodel, which can be formalised using dissimilarity measures between distributions.This has been considered in robust statistics (Hampel et al. (1986), “qualitative FREQUENTISM-AS-MODEL FOR STATISTICAL INFERENCE n − n th observation has to ﬁt well for passing. I calledthis “goodness-of-ﬁt (or misspeciﬁcation) paradox” in Hennig (2007). In manysituations this will not change error probabilities associated with the model-basedmethod strongly, meaning that in case the model was true before misspeciﬁcationtesting, not much harm is done.On the other hand, in case that the model assumption is violated in a problem-atic way, the distribution conditionally under passing a misspeciﬁcation test willoften not make the method work better, and sometimes even worse than beforetesting; keep in mind that just because the misspeciﬁcation test does not reject themodel assumption, it does not mean that it is fulﬁlled. Shamsudheen and Hennig(2019) reviewed work investigating the actual performance of procedures that in-volve a misspeciﬁcation test for one or more model assumptions before running amethod that is based on these model assumptions, looking at data that originally FREQUENTISM-AS-MODEL FOR STATISTICAL INFERENCE λ , generated by the assumed model, and with probability 1 − λ bya distribution that could cause trouble for the model-based method (see Section5.2 for using such a “Bayesian” setup in connection with frequentism-as-model).Under certain assumptions for the involved methods they showed that for a rangeof values of λ the combined procedure beats both involved tests, the model-basedone and the one not requiring the speciﬁc model assumption, regarding power,even if not winning for λ = 0 and λ = 1, which is what authors of previous workhad investigated.The surprisingly pessimistic assessment of misspeciﬁcation testing by authorswho investigated its eﬀect may be due to the fact that most available misspeciﬁ-cation tests test the model assumption against alternatives that are either easy tohandle or very general, whereas little eﬀort has been spent on developing tests thatrule out speciﬁc violations of the model assumptions that are known to aﬀect theperformance of the model-based method strongly, i.e., leading to diﬀerent conclu-sions for interpretationally equivalent models, or same conclusions for models thatare interpretationally very diﬀerent. Such tests would be speciﬁcally connected tothe model-based method with which they are meant to be combined, and to theassessment of interpretational equivalence.Furthermore, as already mentioned, not everything can be tested on data. E.g.,many conceivable dependence structures do not lead to patterns that can be usedto reject independence. For example, it may occasionally but not regularly happenthat one observation determines or changes the distribution of the next one. Thinkof psychological tests in which sometimes a test person discusses the test withanother participant who has not yet been tested, and where such communicationcan have a strong inﬂuence on the result. FREQUENTISM-AS-MODEL AND BAYESIAN STATISTICS

Most Bayesians interpret their probabilities epistemically, and this is incompatiblewith frequentism-as-model. This does not imply that the epistemic interpretationis in my view in any way wrong, but I do not agree with the claim of some, includingde Finetti, that epistemic probabilities make frequentist probabilities superﬂuous,see Section 5.1. Frequentism-as-model as interpretation of probability is not com-mitted to speciﬁc methodology such as tests and conﬁdence intervals. It is alsocompatible with Bayesian methods, as long as they are interpreted accordingly,see Section 5.2.The distinction between compatibility logic (i.e., asking whether certain prob-ability models are compatible with the data as addressed by tests and conﬁdenceintervals) and Bayesian inverse probability logic, in which the central outcomesare the posterior probabilities resulting from conditioning the prior on the data,is not fully in line with the distinction between epistemic and aleatory probabil-ity, but currently a matter of hot debate. The 2016 ASA-Statement on p-values(Wasserstein and Lazar (2016)) has a single positive message on p-values, besidesa number of negative ones: “p-values can indicate how incompatible the data arewith a speciﬁed statistical model.” Wasserstein et al. (2019) go further and askto “abandon statistical signiﬁcance”, opening a Special Issue of The AmericanStatistician containing a bewildering variety of alternative proposals. Many of theauthors argue from an inverse probability logic, but this has problems that in myview are similarly severe. Section 5.3 compares the two logics.

The term “epistemic probability” refers to interpretations of probability that ex-plain them as rational measures of uncertainty of a claim. They can roughly bedistinguished in objectivist (logical) and subjectivist epistemic probabilities, re-spectively (see Gillies (2000); Galavotti (2005) for an overview). According toepistemic interpretations of probability, a priori existing probability assignments

FREQUENTISM-AS-MODEL AND BAYESIAN STATISTICS should be. Butmixing probability axioms with otherwise unconstrained prior probabilityassessments produces a strange compromise of normative and empirical rea-soning that looks artiﬁcially constructed rather than really existing in anyconceivable sense. The only credible claim for existence is probably that aBayesian researcher can say that she consciously adopts the resulting prob-abilities.2. As is the case with frequentist probabilities, epistemic probabilities are usedin a simplifying and idealising way. As elaborated in Section 2.2, frequen-tists need to rely on i.i.d. models not because they believe that the modelledprocess is really i.i.d., but rather in order to construct the kind of repetitionthat makes model-based learning from data possible. For the same reason,epistemic Bayesians normally rely on exchangeability (or a generalised con-cept for more complex situations, as was discussed for i.i.d. in Section 2.2).But when applied to real degrees of belief, the exchangeability assumptionseems counterintuitive. In particular, once a process is assessed as exchange-able by an epistemic Bayesian, using standard Bayesian reasoning there isno way to learn anymore that the exchangeability assumption is not in linewith the process to be modelled. I would think it rational, even if initiallythere is no reason to think that the order of observations matters regardingthe probability of a sequence, to change that assessment if for example in

FREQUENTISM-AS-MODEL AND BAYESIAN STATISTICS

27a binary process 50 ones, then 50 zeroes, then 50 ones, then 23 zeroes areobserved. Surely this should convince the subjectivist that observing a zeronext is now more likely than it was in the middle of a run of ones! But ifinitially runs were assessed to be exchangeable, this is not possible.The point that I want to get across here is not that subjectivist epistemic proba-bility involving exchangeability is in any way “wrong” or “useless”. In fact I ama pluralist, I can see areas where epistemic probability can be of use, and I acceptthe necessity of simpliﬁcation and idealisation. The point is rather that most ifnot all criticism that subjectivists have about frequentism corresponds to troublethat exists within their own approach as well, which has to do with the fact thatmodelling does not match reality, be it aleatory or epistemic. Given that this isso, and given the obvious diﬃculty in many cases to choose a prior distribution, itlooks attractive to model the reality of interest directly as frequentism-as-modeldoes, rather than a degree of belief about it.Objectivist epistemic probability is overall not in a better position than sub-jectivist epistemic probability. The issue with exchangeability is the same, exceptthat subjectivists at least have a reference, namely the person holding the prob-ability, who could take responsibility for either choosing exchangeability or forspecifying a particular pattern deviating from it. I have not seen any discussion ofobjectivist epistemic modelling involving the ability to deviate from exchangeableassessments if the data provides strong evidence against it.Another advantage that subjectivists have over epistemic objectivists is thatthey are allowed to formalise existing but informal evidence as they see ﬁt, whereasit is unclear how epistemic objectivists could incorporate it. This corresponds tothe open license that frequentism-as-model grants the researcher to incorporatesubjective assessments of the situation in their models compared to traditionalfrequentists or propensity theorists. In practice almost everyone does it, but manydo not admit it.Overall I think that frequentism-as-model has something relevant to oﬀer thatis not covered by the major streams of epistemic probability, and that for majorarguments that critics have against frequentism, corresponding arguments existagainst epistemic probability. I do not deny that epistemic probability has itsuses, but frequentism-as-model treats processes that the researchers think of as“random” more directly, avoiding the thorny if sometimes useful issue of specifyingand justifying a prior.

Epistemic probabilities model a personal or “objective” degree of belief, not thedata generating process as such, and therefore they cannot be checked againstand falsiﬁed by the data. See Dawid (1982) for a discussion of “calibration”, i.e.,agreement or potential mismatch between predictions based on epistemic Bayesianprobabilities and what is actually observed. Gelman and Shalizi (2013) arguedthat Bayesian statistics should allow for checking the model against the data,

FREQUENTISM-AS-MODEL AND BAYESIAN STATISTICS as if it describes the data generating process but can be dropped or modi-ﬁed if falsiﬁed by the data. The involved interpretation of probabilities is consistentif the parameter prior is interpreted as a model of a parameter generating processin the same way. Gelman and Shalizi (2013) stated that the prior distributionmay encode “ a priori knowledge” or a “subjective degree of belief”. This seemsto mix up an epistemic interpretation of the parameter prior with an aleatory in-terpretation of the parametric model, and it is hard to justify using them in thesame calculus. I believe that it would be better to refer to the parameter prior asan idealistic model of a process that generates parameters for diﬀerent situationsthat are based on the same information, i.e., to interpret it in a frequentism-as-model way. This is in line with the fact that Gelman in presentations sometimesinformally refers to the parameter prior as a distribution over parameters realisticin a distribution of diﬀerent situations of similar kind in which datasets can bedrawn. This is a very idealistic concept and it is probably hard to connect thesetup of parameter generation precisely to real observations. The modelling willnormally indeed rely more on belief and informal knowledge than on observationof replicates of what is supposed to be parameter generation, but as in Section2.1, an arbitrary amount of data can be generated from the fully speciﬁed model,and can be compared with the observed data. Testing the parameter prior is hard.In a standard simple Bayesian setup, it is assumed that only one parameter valuegenerated all observed data, so the eﬀective sample size for checking the parameterprior is smaller than one, because the single parameter is not even precisely ob-served. Therefore sensitivity against prior speciﬁcation will always be a concern,but see the discussion of single-case probabilities in Section 2.2. In any case, theparametric model can be tested in frequentist ways. In case the parameter priorencodes valuable information about the parameter that can most suitably be en-coded in this way, Bayesian reasoning based on such a model is clearly useful, andopen to self-correction by falsiﬁcationist logic. As for model assumptions testingfollowed by traditional frequentism methods (Shamsudheen and Hennig (2019)),it may be of interest to analyse the behaviour of Bayesian reasoning conditionallyon model checking in case of fulﬁlled and not fulﬁlled model assumptions.A beneﬁt of such an interpretation could be that the prior no longer eitherhas to be claimed to be objective or to model a speciﬁc person. It is a not nec-essarily unique researcher’s suggestion how to imagine the parameter generatingprocess based on a certain amount of information, and can as such be comparedwith alternatives and potentially rejected, if not by the data, then by open dis-

FREQUENTISM-AS-MODEL AND BAYESIAN STATISTICS

We have seen that the distinction between a frequentist and an epistemic interpre-tation of probability does not align perfectly with the distinction between compat-ibility logic and Bayesian inverse probability logic. Often posterior probabilitiesare interpreted as probabilities about where to ﬁnd the true parameter value. Theidea of a true parameter value is a traditional frequentist one. de Finetti (1974) ar-gued that posteriors should be interpreted regarding observable quantities such asfuture observations and not regarding unobservables such as true parameter valuesthat may well not exist, but according to Diaconis and Skyrms (2018), de Finetti’sTheorem implies that for a subjectivist, belief in exchangeability or a suitable gen-eralisation of it implies belief in the existence of limiting relative frequencies, andtherefore a limiting probability distribution that can be parametrised. This couldbe used to connect epistemic probability with compatibility logic, but advocates ofepistemic probability do not seem to be very interested in this. In any case, a falsi-ﬁcationist Bayes perspective combined with a frequentism-as-model interpretationof probability licenses probabilistic statements about the parameter modelled astrue.A major diﬀerence between compatibility logic and inverse probability logicis that according to compatibility logic many models can be compatible with thedata, and the compatibility of one model does not exclude or reduce the compati-bility of another model. Inverse probability logic distributes an overall probabilityof one over the models modelled as possible, implying that a higher probability forone model automatically decreases the probability for the others. The models arecompeting for probability, so to say.There are advantages and disadvantages of both approaches. Many Bayesianssuch as Diaconis and Skyrms (2018) have pointed out that p-values and conﬁdencelevels are regularly misinterpreted as probabilities regarding the true parameter,because these should be the ultimate quantities of interest in statistical inference,or so it is claimed. Inverse probability logic deals with combining diﬀerent infer-ences such as multiple testing, which creates trouble for standard compatibilitylogic approaches, in a uniﬁed and coherent way. Frequentists are not only inter-ested in compatibility, but also in estimation; ﬁnding a best model amounts to acompetition between models. A Bayesian can argue that in this case a probabilitydistribution over parameters provides better information about how the parame-

FREQUENTISM-AS-MODEL AND BAYESIAN STATISTICS N (0 ,

1) is a reasonable approximationof the truth, N (10 − ,

1) is a reasonable approximation as well, whereas accord-ing to inverse probability logic any two models compete for a part of the unitprobability mass.Furthermore, it may seem unfair to criticise tests and conﬁdence intervals basedon misinterpretations. Arguably many users do not only want to know the proba-bility for certain parameter values to be true, which indeed tempts them to misin-terpret conﬁdence levels and p-values. But arguably they also want this probabilityto be objective and independent of prior assessments, which to make up they havea hard time. This combination is not licensed by any properly understood philoso-phy of statistics, and ultimately statisticians need to accept that their job is oftennot to give the users what they want, but rather to defy wrong expectations.The role of the prior distribution in inverse probability logic is a major distinc-tion between the two approaches. Bayesians argue that the prior is a good andvery useful vehicle to incorporate prior information. Actually prior informationenters frequentist modelling as well (see earlier sections), but the Bayesian prioris still an additional tool on top of the options that frequentists have to involveinformation. But the requirement to set up a prior can also be seen as a majorproblem with inverse probability logic, given that prior information does not nor-mally come in the form of prior probabilities, and that it is actually in most casesvery diﬃcult to translate existing information into the required form. It is notan accident that a very large number of applied Bayesian publications come withno or very scarce subject matter justiﬁcation of the prior, and in most cases theprior information is very clearly compatible with many diﬀerent potential priors,with comprehensive sensitivity analysis rarely done. If the sample size is largeenough for the prior to lose most of its inﬂuence, one may wonder why to botherhaving one. The question whether there is prior information that is meant to havean impact on the analysis and can be encoded convincingly in the form of a priordistribution is a key issue for deciding whether inverse probability or compatibilitylogic will be more promising in a given application of statistics.Falsiﬁcationist Bayes combines the two by applying inverse probability logicwithin a Bayesian model, of which the compatibility with the data should also beinvestigated. Gelman et al. (1996) emphasise that the posterior distribution of theparameter is conditional on the truth of the model, and according to frequentism-as-model a researcher can interpret results temporarily as if this were the case,

CONCLUSION

It may seem to be my core message that models are models and as such diﬀerentfrom reality. This is of course commonplace, and agreed by many if not all statisti-cians, although it rarely inﬂuences applied statistical analyses or even discussionsabout the foundations of statistics.Here are some less obvious implications: • The usual way of talking about model assumptions, namely that they “haveto be fulﬁlled”, is misleading. The aim of model assumption checking isnot to make sure that they are fulﬁlled, but rather to rule out issues thatmisguide the interpretation of the results. Combining model assumptionchecking and analyses chosen conditionally on the model checking resultscan itself be modelled and analysed, and depending on what exactly is done,it may or may not turn out to work well. • There are always lots of models compatible with the data. Some of these canbe favoured or excluded by plausibility considerations, prior information, orby the data, but some irregular ones have to be excluded simply becauseinference would be hopeless if they were correct. • Model assessment based on the data involves decisions about “in what way tolook”, i.e., what model deviations are relevant. Whether a model is compat-ible with the data cannot be decided independently of such considerations. • Choosing a model implies decisions to ignore certain aspects of reality, e.g.,diﬀerences between the conditions under which observations modelled asi.i.d. were gathered. These decisions should be transparent and open fordiscussion. • Consideration of interpretational equivalence, i.e., what diﬀerent modelswould be interpreted in the same or diﬀerent way regarding the subject

EFERENCES • Frequentism-as-model is compatible with both compatibility and inverseprobability logic. A key to decide which one to prefer is to ask whetherthe parameter prior distribution required for inverse probability logic can beused to add valuable information in a convincing way. The parameter prioritself can not normally be checked against the data with satisfactory power.

References

Box, G. E. P. (1979). Robustness in the strategy of scientiﬁc model building. InR. L. Launer and G. N. Wilkinson (Eds.),

Robustness in Statistics , pp. 201–236.Cambridge, MA: Academic Press.Breiman, L. (2001). Statistical modeling: the two cultures.

Statistical Science 16 ,199–215.Chang, H. (2012).

Is Water H O ? Evidence, Realism and Pluralism . Dordrecht,The Netherlands: Springer.Cox, D. R. (1990). Role of models in statistical analysis. Statistical Science 5 ,169–174.Cressie, N. (1980). Relaxing assumptions in the one-sample t-test.

AustralianJournal of Statistics 22 , 143–153.Davies, P. L. (1995). Data features.

Statistica Neerlandica 49 , 185–245.Davies, P. L. (2014).

Data Analysis and Approximate Models . New York: Chapmanand Hall/CRC.Dawid, A. P. (1982). The well-calibrated bayesian (with discussion).

Journal ofthe American Statistical Association 77 , 605–613.de Finetti, B. (1974).

Theory of Probability, Vol. 1 . New York: Wiley.Diaconis, P. and B. Skyrms (2018).

Ten Great Ideas About Chance . Princeton,NJ: Princeton University Press.Eagle, A. (2019). Chance versus randomness. In E. N. Zalta (Ed.),

The Stan-ford Encyclopedia of Philosophy (Spring 2019 ed.). Metaphysics Research Lab,Stanford University.Fay, M. P. and M. A. Proschan (2010). Wilcoxon-Mann-Whitney or t-test? onassumptions for hypothesis tests and multiple interpretations of decision rules.

Statistics Surveys 4 , 1–39.

EFERENCES

The Stanford Encyclopedia of Philosophy (Winter 2018 ed.). MetaphysicsResearch Lab, Stanford University.Galavotti, M. C. (2005).

Philosophical Introduction to Probability . Stanford, CA:CSLI Publications.Gelman, A. and C. Hennig (2017). Beyond objective and subjective in statistics(with discussion).

Journal of the Royal Statistical Society, Series A 180 , 967–1033.Gelman, A., X.-L. Meng, and H. Stern (1996). Posterior predictive assessment ofmodel ﬁtness via realized discrepancies (with discussion).

Statistica Sinica 6 (4),733–808.Gelman, A. and C. Shalizi (2013). Philosophy and the practice of bayesian statistics(with discussion).

British Journal of Mathematical and Statistical Psychology 66 ,8–80.Gillies, D. (2000).

Philosophical Theories of Probability . London: Routledge.Hacking, I. (1975).

A Philosophical Study of Early Ideas about Probability, Induc-tion and Statistical Inference . Cambridge, UK: Cambridge University Press.Hajek, A. (2009). Fifteen arguments against hypothetical frequentism.

Erkennt-nis 70 (2), 211–235.Hampel, F. R., E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel (1986).

RobustStatistics: The Approach Based on Inﬂuence Functions . New York: Wiley.Hand, D. J. (2006). Classiﬁer technology and the illusion of progress.

StatisticalScience 21 , 1–14.Hennig, C. (2007). Falsiﬁcation of propensity models by statistical tests and thegoodness-of-ﬁt paradox.

Philosophia Mathematica 15 (2), 166–192.Hennig, C. (2010). Mathematical models and reality: A constructivist perspective.

Foundations of Science 15 , 29–48.Hennig, C. and C.-J. Lin (2015). Flexible parametric bootstrap for testing homo-geneity against clustering and assessing the number of clusters.

Statistics andComputing 25 , 821–833.Kahneman, D., P. Slovic, and A. Tversky (1982).

Judgment Under Uncertainty:Heuristics and Biases . Cambridge, UK: Cambridge University Press.Kulinskaya, E., S. Morgenthaler, and R. G. Staudte (2008).

Meta Analysis . NewYork: Wiley.

EFERENCES

The Lancet 366 , 690.Lehmann, E. L. (1986).

Testing Statistical Hypotheses (2nd ed.) . New York: Wiley.Mayo, D. G. (2018).

Statistical Inference as Severe Testing: How to Get Beyondthe Statistics Wars . Cambridge, UK: Cambridge University Press.Rice, K., T. Bonnett, and C. Krakauer (2020). Knowing the signs: a direct andgeneralizable motivation of two-sided tests.

Journal of the Royal StatisticalSociety: Series A (Statistics in Society) 183 (2), 411–430.Shamsudheen, M. I. and C. Hennig (2019, August). Does Preliminary ModelChecking Help With Subsequent Inference? A Review And A New Result. arXive-prints (submitted for publication) , arXiv:1908.02218.Shang, A., K. Huwiler-M¨untener, L. Nartey, P. J¨uni, S. D¨orig, J. A. C. Sterne,D. Pewsner, and M. Egger (2005). Are the clinical eﬀects of homoeopathyplacebo eﬀects? comparative study of placebo-controlled trials of homoeopathyand allopathy.

The Lancet 366 , 726–732.Steinley, D. and M. J. Brusco (2011). Choosing the number of clusters in k-meansclustering.

Psychological Models 16 , 285–297.Tukey, J. W. (1997). More honest foundations for data analysis.

Journal ofStatistical Planning and Inference 57 , 21–28.Vermunt, J. K. (2011). K-means may perform as well as mixture model cluster-ing but may also be much worse: Comment on Steinley and Brusco (2011).

Psychological Models 16 , 82–88.von Mises, R. (1939).

Probability, Statistics, and Truth (2nd ed.) . London, UK:Macmillan.Wasserstein, R. L. and N. A. Lazar (2016). The ASA statement on p-values:Context, process, and purpose.

The American Statistician 70 (2), 129–133.Wasserstein, R. L., A. L. Schirm, and N. A. Lazar (2019). Moving to a worldbeyond “ p < . The American Statistician 73 , 1–19.Xie, M.-g. and K. Singh (2013). Conﬁdence distribution, the frequentist distribu-tion estimator of a parameter: A review.