aa r X i v : . [ s t a t . O T ] N ov On p -values Laurie DaviesFaculty of MathematicsUniversity of Duisburg-Essen, 45117 Essen, Germanyemail:[email protected].
The use and abuse of p -values All branches of knowledge which require the analysis of data make use of p -values. Unfortunately in many cases ‘make use of’ could be replaced by ‘abuse’,the many reports of widespread abuse are convincing. In response The AmericanStatistician published a statement on p -values by the American Statistical Asso-ciation together with supplementary material consisting of statements by severalstatisticians and philosophers ([Wasserstein and Lazar, 2016]).The most detailed of the supplementary material is [Greenland et al., 2016]. Theauthors point out that there are many ways in which any usefulness of a p -valuecan be invalidated. One example is to perform several experiments and report onlythe one with the smallest p -value. Problems of this nature will not be discussedhere. It will be assumed that the experiment is so to speak clean and the data areso to speak high quality.2. Probability models and approximation
Semantics.
There are two meaning of the word ‘model’ is statistics. The firstmeaning refers to a parametric family of distributions. Thus the normal model isthe family of all normal distributions. This meaning of the word ‘model’ is commonin much of statistics, in particular in Bayesian statistics where such models are theobjects of study.The second meaning is that of a single probability measure. In this sense of theword the N (0 ,
1) probability measure is one model, the N (0 ,
2) probability measureis another model. This is the sense in which the word will be used in this paper.Models in this sense are the atoms so to speak of probability theory and hence thebasic objects of stochastic modelling. The meaning of the word ‘model’ in the firstsense is a parametric family of models in the second sense. Approximate models.
The authors of [Greenland et al., 2016] state(A) .. the distance between the data and the model prediction is measuredusing a test statistic ..and(B) In logical terms, the p -value tests all the assumptions about how thedata were generated (the entire model) ...Although it is never precisely stated it seems that the word ‘model’ in the abovequotations is meant in the first sense, a parametric family of distributions. Whateverthe meaning of the word ‘model’ the meaning of the quotations taken together isclear. The distance between the data and the model is based on a test statisticand the corresponding p -value measures this distance in a particular manner. Thephrase ‘In logical terms’ in the quotation (B) suggests that in practice this is notso. Indeed in practice the parametric model is accepted and the p -value is based ona particular hypothesis H : µ = µ using a statistic especially designed for testingthis null hypothesis, for example a t -test. Such a single statistic cannot possiblytest ‘ all the assumptions about how the data were generated (the entire model)’.A similar attitude is taken in [Birnbaum, 1962]: consideration is restricted to(C) models whose adequacy is postulated and not in question ... the ad-equacy of any such model is typically supported, more or less ad-equately, by a complex informal synthesis of previous experimentalevidence of various kinds and theoretical considerations concerningboth subject-matter and experimental techniques.In contrast to the word ‘adequacy’ being applied to a family of probabilitymeasures it will here be applied to individual probability measures. Thus the N (0 , N (0 ,
1) model and the N (100 , − ) model are both adequate orconsistent with the data.The two different meanings of the word ‘model’ are not just a question of no-tation or definition. They reflect two different approaches to statistics. This maybe seen in [Birnbaum, 1962] where a parametric family of probability measures has to be adequate without specifying the adequacy of any individual measure. Thisis necessary as the Likelihood Principle requires the proportionality of two differ-ent densities for all values of the parameter and not just for the adequate ones. Asimilar problem occurs when testing hypotheses. The parametric model is declaredadequate without specifying the adequate values of the parameter. A hypothesis H : µ = µ is then tested to see whether µ = µ is consistent with the data. It onlymakes sense to do this if the adequate parameter values have not been specifiedwhen declaring the whole family to be adequate as otherwise the test would besuperfluous.More generally a common approach is two perform a statistical analysis in twostages. In the first stage one or several parametric models will be investigated foradequacy, for example by using a goodness-of-fit test. Once an adequate model hasbeen found it is made the basis of the second stage where it is treated as if it weretrue. Treating it as true means among other things ignoring the first stage. If indeedthe model is now treated as if true then how we arrived at this truth is irrelevant.The following quotation is taken from the Chapter 5 of Huber [Huber, 2011] entitled‘Approximate Models’:(D) In the opposite case, if a goodness-of-fit test does not reject a model,statisticians have become accustomed to acting as if it were true. Ofcourse this is logically inadmissible, even more so if with McCullaghand Nelder one believes that all models are wrong a priori .Moreover, treating a model that has not been rejected as correct canbe misleading and dangerous. Perhaps this is the main lesson we havelearned from robustness.In [Davies, 2014] models are consistently treated as approximations. The basicidea is that a model P is an adequate approximation to a data set x n = ( x , . . . , x n )if typical data X n ( P ) generated under P look like x n . Data are generated undersingle probability distributions P and not under a family of such, that is, a modelin the first sense of the word. This is the reason why single probability distributionsare the basic objects of study and not families of distributions.The definition of ‘look like’ will depend on the nature of the data being analysedand the model. As an example suppose that the model is that of i.i.d. N (0 , N (0 ,
1) distribution function. This will be done explicitly in Section 2.4 below. It is worth noting that the concept of adequacy is defined in terms of severalstatistics and not just one. This is in contrast to the quotation at the beginning ofSection 3.2 where it is based on a single statistic.The approach described in [Davies, 2014] can be read as an attempt to replacea two stage methodology, EDA followed by formal inference, by a single stagemethodology whereby all tests become misspecification tests from without, or froma distance. It is an instance of ‘distanced rationality’ due to D. W. M¨uller (see[M¨uller, 1974]). Here my translation(E) ... distanced rationality. By this we mean an attitude to the given,which is not governed by any possible or imputed immanent laws butwhich confronts it with draft constructs of the mind in the form ofmodels, hypotheses, working hypotheses, definitions, conclusions, al-ternatives, analogies, so to speak from a distance, in the manner ofpartial, provisional, approximate knowledge.2.3. ‘Adequate’ parametric families.
Although the quotation C does not makeit explicit it is clear from [Birnbaum, 1962] that Birnbaum is referring to familiesof parametric models. Thus the Poisson family may be declared adequate withoutspecifying any particular value of λ which is consistent with the data. As an exam-ple suppose the parametric family is the Poisson family and that the chi-squaredgoodness-of-fit is used to test adequacy. The test is typically based on some variantof the test statistic(1) k X j =0 (ˆ p j − p j (¯ x n )) p j (¯ x n )where the ˆ p j are the empirical frequencies, ¯ x n is the mean of the data and p j ( λ ) = λ j exp( − λ ) /j !. If the value of the test statistic (1) lies below a certain level then thePoisson model is declared adequate. Note that (1) does not specify any individualparameter values.Given adequacy in this sense the whole parametric family is then transportedto the second stage of formal inference in spite of the fact that the overwhelmingmajority of individual models will not be consistent with the data. For Birnbaum’sargument to work this is essential: ‘two likelihood functions, f ( x, θ ) and g ( y, θ ) arecalled the same if they are proportional, that is if there exists a positive constant c such that f ( x, θ ) = g ( y, θ ) for all θ ’. In the second sense of the word ‘model’, an individual probability distribution,the goodness-of-fit procedure takes on a different form. In the concrete case of thePoisson family a given Poisson distribution say P λ with λ = 2 can be tested foradequacy using(2) k X j =0 (ˆ p j − p j (2)) p j (2) . The set of λ values for which the test statistic lies below a critical level specifiesthose λ values, if any, which are consistent with the data. This will not be the setof all possible values.If one interprets the concept of adequacy for models in the first sense using thesecond sense it can only mean that there are some parameter values θ for whichthe single model P θ is consistent with the data. The likelihood principle is basedon not specifying which values of θ these are.2.4. Approximation regions: an example.
A model P is an adequate approx-imation to data x n if typical data sets generated under P look like x n . To makethis susceptible to mathematical analysis the term ‘look like’ must be expressible innumerical quantities. This may not always be possible or easy. An animal may beeasily recognizable as a dog but it it not easy to give this a mathematical expression.If a model is required which gives data sets looking like the daily returns of theStandard and Poor’s 500 index it is not clear how ‘look like’ can be defined. In thefollowing it will be assumed that ‘look like’ has a precise mathematical expression.The following is taken from [Davies, 2014]. Given a probability measure P a sam-ple of size n generated under P will be denoted by X n ( P ) = ( X ( P ) , . . . , X n ( P )).Given further a family P of probability measures and a number α, < α ≤ , an α approximation region for the data x n is defined by(3) A ( x n , α, P ) = { P ∈ P : x n ∈ E n ( P ) } where for each P ∈ P E n ( P ) denotes a subset of R n such that(4) P ( X n ( P ) ∈ E n ( P )) = α . The choice of the E n ( P ) depends on the situation and has in general to beaugmented by some form of regularization, for example: minimum Fisher models,number of local extremes, convexity constraints. These and further examples are tobe found in [Davies, 2014]. The definition (3) makes no assumption that the data x n were generated undersome model P ∈ P . The interpretation is that A ( x n , α, P ) specifies those models P for which x n ‘looks like’ a ‘typical sample’ X n ( P ) generated under P : typicalsamples X n ( P ) lie in E n ( P ) so that points x n ∈ E n ( P ) look like typical samples X n ( P ).As an example suppose P is the family of normal distributions N = { ( µ, σ ) : N ( µ, σ ) } . An approximation region can be based on the mean, the variance, out-liers and the distance of the empirical measure to the model N ( µ, σ ) as measuredby the Kuiper metric. More precisely put y n = ( x n − µ ) /σ and(5) T ( y n ) = √ n | mean( y n ) | , T ( y n ) = P ni =1 y i ,T ( y n ) = max i | y i | , T ( y n ) = d ku ( P ( y n ) , N (0 , , where P ( y n ) is the empirical measure based on y n . Given ˜ α one can determinequantiles q (˜ α ) , q (˜ α ) , q (˜ α ) , q (˜ α ) , q (˜ α ) such that(6) P ( T ( Y n ) ≤ q (˜ α )) = ˜ α, P ( q (˜ α ) ≤ T ( Y n ) ≤ q (˜ α )) = ˜ α, P ( T ( Y n ) ≤ q (˜ α )) = ˜ α, P ( T ( Y n ) ≤ q (˜ α )) = ˜ α. where Y n are i.i.d. N (0 , A ( x n , α, R × R + ) = { ( µ, σ ) : T ( y n ) ≤ q (˜ α ) , q (˜ α ) ≤ T ( y n ) ≤ q (˜ α ) , (7) T ( y n ) ≤ q (˜ α ) , T ( y n ) ≤ q (˜ α ) , y n = ( x n − µ ) /σ } where ˜ α is adjusted so that the region is indeed an α -approximation region. Areasonable starting value for ˜ α is (3 + α ) /
4. This will lead to an effective value α ∗ > α of α which can be determined by simulations. A better approximation cannow be obtained by putting ˜ α = (3+2 α − α ∗ ) /
4. For a normal sample of size n = 50and α = 0 . α ≈ .
97 compared with the starting value of 0.975.The following data give the quantity of copper in milligrams per litre in a sampleof drinking water ([Davies, 2014]):2 . , . , . , . , . , . , . , . , . , . , . , . , . , (8) 2 . , . , . , . , . , . , . , . , . , . , . , . , . , . . The 0.9 approximation region A ( x n , . , R × R + ) this data set is shown in Figure 1.An approximation region for µ alone can be obtained by projecting A ( x n , α, N )onto the µ -axis:(9) A ( x n , α, R ) = { µ : there exists some σ s.t. ( µ, σ ) ∈ A ( x n , α, R × R + ) } This is equivalent to projecting the approximation region of Figure 1 onto the x -axis. The result is the interval [1 . , . . µ based on the t -statistic is the smaller interval [1 . , . µ will be smaller thanthe corresponding approximation interval. If the data are not normally distributedthen the approximation interval can be smaller, indeed much smaller than theconfidence interval. This will be discussed in Section 2.6 below.In (7) the same ˜ α is used for all four functionals. There is no need for this. Iffor example the Kuiper distance is not regarded as important as the other featuresit can be given less weight in terms of a higher value of ˜ α .2.5. Multiple p -values. The approximation region (7) is based on the four statis-tics T i , i = 1 , . . . ,
4. For each parameter pair ( µ, σ ) each of the statistics T i , i = 1 , , p -value(10) p i ( µ, σ ) = 1 − P ( T i ( Y n ) ≤ T i ( y n )) , the statistic T comes with the p -value(11) p ( µ, σ ) = 2 min( P ( T ( Y n ) ≤ T ( y n )) , − P ( T ( Y n ) ≤ T ( y n )))where the Y n are i.i.d. N (0 ,
1) and y n = ( x n − µ ) /σ ). Thus each parameter pair( µ, σ ) comes with four p -values attached p i ( µ, σ ) , i = 1 , . . .
4. It belongs to theapproximation region if and only if(12) p ( µ, σ ) = min( p i ( µ, σ ) , i = 1 , . . . , ≥ − α ∗ . As an example the pair (2 . , . p i -values (0 . , . , . , . . . . . Figure 1.
The approximation region A ( x n , . , R × R + ) for thedata (8). The multiple p -values associated with each parameter value stand in contrast tothe usual definition of a p -value which uses only one statistic (see the quotation A).2.6. Approximation and confidence regions.
At first sight the approximationregion (3) can be interpreted as a confidence region. If the data x n were indeedgenerated under some model P ∈ P then because of (4) we have(13) P ( P ∈ A ( X n ( P ) , α, P )) = α for all P ∈ P . Such an interpretation however causes difficulties. Consider the family P = { N ( µ, σ ) :( µ, σ ) ∈ R × R + } . A standard confidence region for the ‘true’ value µ of µ is basedon the assumption that there is indeed a ‘true’ value µ of µ . That is the data weregenerated under N ( µ , σ ) for some σ . This assumption is not checked in the formalinference phase and consequently a confidence region for µ is never empty. Theinterpretation is that it is a measure precision with which µ can be determined.The approximation region (9) on the other hand is not based on the assumptionthat the data were indeed generated as i.i.d. N ( µ, σ ) for some ( µ, σ ). It specifiesthose µ -values if any for which N ( µ, σ ) is an adequate approximation to the datafor some σ . Thus if the adequacy region (9) is small this simply means that thereare few values of µ for which N ( µ, σ ) is an adequate approximation to the data forsome σ . It is not a measure of precision. If one imagines the data gradually becomingless and less normal then the region (7) will become smaller and eventually will bethe empty set. One way of doing this is to gradually increase one value of the sampleuntil this value becomes incompatible with the feature T of (5). As an exampleFigure 2 gives the 0.9 approximation region for the copper data of (8) but with thesmallest observation of 1.7 being replaced by 1.5 If the 1.7 is replaced by 1.267 the . . . . . . . Figure 2.
The approximation region A ( x n , . , R × R + ) for thedata (8) but with 1.7 replaced by 1.5.approximation region as calculate has exactly one point (1 . , . Interpreting (9) as a confidence region leads to complications. As the data be-come less and less like Gaussian data the region becomes smaller and smaller whichis interpreted as an increase in precision. Thus on this interpretation replacing 1.7by 1.267 in (8) leads to exact values for ( µ, σ ) namely (1 . , . An empty approximation region.
Consider the approximation region (7)with ˜ α = 0 .
975 corresponding to α ≈ .
92. Simulations show that for normalsamples of size 50 the approximation is empty in about 0.7% of the cases. Thisvalue is based on 5000 simulations. It is much smaller than the 8% of the caseswhere the approximation region does not contain the ( µ, σ ) pair used to generatethe data.If the approximation region is empty, that is, the family P contains no modelwhich is an adequate approximation, there may well be an interest in quantifyingjust how poor the approximation is. One way of doing this is to determine thesmallest value of α , say α ∗ such that the approximation region is non-empty. Thecorresponding p -value is defined as p ∗ = 1 − α ∗ which is a measure of the goodnessof the approximation: the smaller the p -value the worse the approximation.For the approximation region (7) it is always possible to calculate p as thequantiles q can be calculated. If the quantiles were obtained by simulation and theapproximation is poor then it may not be possible to calculate the p -values. Analternative is suggested in [Lindsay and Liu, 2009]. It is based on the idea that it iseasier for there to be an adequate approximation if the sample size is small. Samples x ∗ m of size m are drawn for the original sample x n and the approximation regioncalculated. The measure of the degree of approximation is the largest value of m for which the approximation region is not empty in 50% of the cases. An exampleis given in Chapter 3.8 of [Davies, 2014] for a sample of size n = 189. The familyof models considered was the family of discretized gamma models and the conceptof adequacy was based on the total variation metric The fit was so poor that evenfor α = 0 . ****************************************************************************************************2 3 4 5 6 7 8 . . . . Figure 3.
Upper panel: the p -values (12) plotted against the ofthe largest observation for a normal sample of size n = 27. Lowerpanel: the number of points in the approximation region also plot-ted against the size of the largest observation.random subsamples for which there was an adequate approximation in 50% of thecases was approximately 40.2.8. A non-empty approximation region.
If the approximation region is de-fined by statistics as in (7) then for each model P the p -values for each of inequalitiesis calculated and the minimum value taken. The maximum of these values takenover the approximation region is then a measure of the degree of approximation.Again, the smaller this value the poorer the approximation. The statistician canbase the decision on whether to use the family of models P at least in part on themaximum p -value. A cut-off point 0.2 implies that there are adequate models wherethe p -values of the functionals involved all exceed 0.2.Figure 3 shows an example of this. A N (0 ,
1) sample of size n = 50 was generatedand the largest observation 2.130 gradually increased in steps of 0.06875. The upperpanel shows the p -values as a function of the size of this observation, the lowerpanel the the number of points in the approximation region evaluated over a gridof parameter values. It can be interpreted as a proxy for the area of the region. The quantiles or the p -values of the p -values can be obtained from simulations.For a normal sample of size n = 28 the 0.001, 0.01, 0.05 and 0.1 quantiles are 0.0713,0.210, 0.324 and 0.393 respectively. These values are based on 10000 simulations.The p -value based on the normal family of models is 0.406 with a p -value ( p -value of the p -value) of about 0.1. If the smallest observation 1.70 is removed the p -value becomes 0.835 which corresponds to a p -value of 0.96. If the smallest valueis set to 1.4 the p -value of the p -values is about 0.001. This raises the questions tohow bad an approximation can be whilst still basing an analysis on the family ofdistributions.The following quotation from [Huber, 2011] is relevant in this context. It imme-diately precedes the quotation D(F) If a goodness-of-fit tests rejects a model, we are left with many al-ternative actions. Perhaps we do nothing (and continue to live witha model certified to be inaccurate). Perhaps we tinker with a modelby adding or adjusting a few more features (and thereby destroy itsconceptual simplicity). Or we switch to an entirely different model,maybe one based on a different theory, or maybe in the absence ofsuch a theory, to a purely phenomenological one. In the opposite case,if a goodness-of-fit test does not reject a ...2.9. p -values and hypotheses. Consider the Gaussian family of models and thenull hypothesis H : µ ≥ µ . The p -value defined by the t -statistic(14) pt (cid:18) √ n mean( x n ) − µ sd( x n ) , n − (cid:19) is often used as a measure as to the extent that µ is compatible with the data.This definition is not acceptable from the point of view of approximation as it doesnot specify any value for σ .As an example consider the copper data (8) Suppose the legal limit is 2.1 mil-ligrams per litre and we wish to test the hypothesis that this is exceeded. TheGaussian family of models will be used so that with the usual identification of theamount of copper in the water with µ the null hypothesis becomes(15) H : µ ≥ . . The p -value as defined by (14) is 0.000436.An equivalent definition of a standard p -value is the following. Given the p -value p ∗ put α ∗ = 1 − p ∗ . Then α ∗ is the smallest value of α such that the confidence region for µ contains µ . This can be used to define a p -value using the idea of anapproximation region. This p -value is defined as p ∗ = 1 − α ∗ where α ∗ is the smallestvalue of α such that the approximation region contains a point ( µ , σ ) for some σ .This is similar to the definition of a p -value for an empty approximation regiongiven in Section 2.7 . If this is done for the water data (8) using the approximationregion (7) the resulting p -value is 0.045.Replace now the smallest value 1.7 by 0.7. The p -value of (14) is now 0.015.At first sight this may seem surprising as the value 0.7 is less consistent with (15)than is 1.7. The reason is that the standard deviation is now 0.274 as against 0.116.The p -value based on the approximation region is 0.00018. The value of σ is 0.310.The reason is that the value 0.7 is essentially an outlier. This is picked up by thestatistics T and T but not by the t -statistic. See the second Huber quotation (F).The outlyingness of 0.7 should have been detected in the EDA phase before movingon to the formal inference phase. This raises the question of how to react to theoutlier. 3. p -values and functionals The purpose of the copper measurements (8) it to give a point estimate of theamount of copper in the sample of drinking water combined with an interval ofreasonable values. The mean and a confidence interval using a normal model givea reasonable solution for this particular data sets, but there are problems.One immediate question is why the Gaussian family and not the Laplace (doubleexponential) family? This question draws attention to the fact that the location-scale problem is ill-posed when density based methods, maximum likelihood orBayes, are used. Some form of regularization is required. What are required are‘bland’ or ‘hornless’ models (see Section 2,
B is for Blandness , of [Tukey, 1993] andChapter 1.3.6 of [Davies, 2014]). In the location-scale situation one possible form ofregularization is to use minimum Fisher information models such as the Gaussian.Another problem is to relate the parameters of the model to the real world.As the purpose of the copper data is to estimate the amount of copper in thewater, simply estimating (in another sense of the word estimate) the parameters ofa parametric model does not solve the problem. The parameters must be connectedto the real world. For the Gaussian family this is not a problem as the canonicalconnection is to identify the location parameter µ with the actual amount of copperin the water. However this fails for the log-normal distribution, another minimum Fisher distribution. One can still associate the actual amount of copper with themean but also with the median. This gives two different identifications for the samemodel.The final problem is that of outliers. They are common in interlaboratory testsand any method of analysis must be able to deal with them. Neither the Gaussian,Laplace or log-normal achieve this.The path taken in Chapter 5 of [Davies, 2014] is to use M -functionals (seeChapters 4 and 5 of [Huber and Ronchetti, 2009]). Given ψ - and χ -functions ψ and χ respectively and a probability measure P over R the M -functional T M is definedby T M ( P ) = ( T L ( P ) , T S ( P ) where T L ( P ) and T S ( P ) solve(16) R ψ (cid:16) x − T L ( P )) T S ( P ) (cid:17) dP ( x ) = 0 R χ (cid:16) x − T L ( P )) T S ( P ) (cid:17) dP ( x ) = 0The functions ψ and χ can be so chosen so that (i) (16) has a unique solutionfor all P with a largest atom of less than 0.5 and (ii) the functional T M ( P )is locally uniformly Fr´echet differentiable in a Kolmogorov neighbourhood of P see([Davies, 1998] and page 54 of [Hampel et al., 1986]). This gives stability of anal-ysis with respect to P . The functions used here are(17) ψ ( u, c ) = exp( u/c ) − u/c ) + 1 , χ ( u ) = u − u + 1where c is a tuning constant set here to 5.The connection with reality is achieved by identifying the amount of copperwith T L ( P ) for any adequate model P . the only form of adequacy required is thatthe model P is in a small Kolmogorov neightbourhood of the empirical distribution P n of the data. This still leaves open the choice of T M . This will be discussed atthe end of the section.Let X n ( P ) denote a sample of size n of i.i.d. random variable with distribution P , and by q ψ ( · , n, P ) the quantiles of1 √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ψ (cid:18) X i ( P ) − T L ( P ) T S ( P ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) with the corresponding definition of q χ ( · , n, P ). The an α -approximation region forthe functional T M is defined by A ( x n , α, T M ) = n ( T L ( P ) , T S ( P )) : d ko ( P n , P ) < qdk(˜ α, n ) , (18) √ n (cid:12)(cid:12)(cid:12)(cid:12)Z ψ (cid:18) x − T L ( P ) T S ( P ) (cid:19) d P n ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ q ψ (˜ α, n, P ) , √ n (cid:12)(cid:12)(cid:12)(cid:12)Z χ (cid:18) x − T L ( P ) T S ( P ) (cid:19) d P n ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ q χ (˜ α, n, P ) o where ˜ α = (2 + α ) / P n denotes the empirical distribution of the data x n , d ko the Kolmogorov metric and qdk( · , n ) its quantile function. The choice of ˜ α corre-sponds to spending (1 − α ) / E (cid:18) ψ (cid:18) X i ( P ) − T L ( P ) T S ( P ) (cid:19)(cid:19) = 0and E ψ (cid:18) X i ( P ) − T L ( P ) T S ( P ) (cid:19) ! = Z ψ (cid:18) x − T L ( P ) T S ( P ) (cid:19) dP ( x )it follows from the central limit theorem that1 √ n n X i =1 ψ (cid:18) X i ( P ) − T L ( P ) T S ( P ) (cid:19) ≍ N , Z ψ (cid:18) x − T L ( P ) T S ( P ) (cid:19) dP ( x ) ! . with the same result for χ . Thus asymptotically(19) q ψ (˜ α, n, P ) ≈ qnorm(˜ α ) sZ ψ (cid:18) x − T L ( P ) T S ( P ) (cid:19) dP ( x )with the same result for χ . As the random variables are bounded the normal ap-proximation is good for small values of n .The requirement d ko ( P n , P ) < qdk(˜ α, n ) forces P into a O (1 / √ n ) Kolmogorovneighbourhood of P n . This together with the locally uniform Fr´echet differentiabilityimplies q ψ (˜ α, n, P ) ≈ q ψ (˜ α, n, P n ) (see pages 107-108 of[Davies, 2014]) and togetherwith (19) it leads to the approximate approximation region˜ A ( x n , α, T M ) = n ( T L ( P ) , T S ( P )) : d ko ( P n , P ) < qdk(˜ α, n )(20) √ n (cid:12)(cid:12)(cid:12)(cid:12)Z ψ (cid:18) x − T L ( P ) T S ( P ) (cid:19) d P n ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ qnorm(˜ α ) q V ψ ( P n ) , √ n (cid:12)(cid:12)(cid:12)(cid:12)Z χ (cid:18) x − T L ( P ) T S ( P ) (cid:19) d P n ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ qnorm(˜ α ) q V χ ( P n ) . . . . Figure 4.
The 0 . T L , T S ) for the copper data using the psi- and chi-functions of (17) with c = 5.where V ψ ( P n ) = 1 n n X i =1 ψ (cid:18) x i − T L ( P ) T S ( P ) (cid:19) . This approximation to (18) can be calculated over a grid of values. It is shownin Figure 4 for the copper data with α = 0 .
9. It may be compared with the ap-proximation region based on the Gaussian distribution as shown in Figure 1. Theapproximation region for T L ( P ) is obtained by projecting the approximation regiononto the x -axis. For the copper data with α = 0 . . , . . , . t -statistic.The approximation region (4) remains unchanged if the smallest observation 1.7is replaced by zero. This is in sharp contrast to the approximation region based onthe Gaussian family of models which is empty in this case. This one example ofstability of analysis deriving from the use of T M : small changes in the data, here asingle data point, lead to only small changes in the result. It was pointed out abovethat the location-scale problem requires regularization. The use of the M -functional T M is a regularization of the procedure not the models.Hypothesis testing as in Section 2.9 can be done as follows. For the copper datathe null hypothesis (15) is replaced by H : T L ( P ) ≥ . . The p -value is p ∗ = 1 − α ∗ where α ∗ is the smallest value of α such that (20)contains (2 . , σ ) for some σ . Its value is p ∗ = 0 . M -functional used here is not the only one. There are many possible choices.Which one to use is an empirical question. A member of the committee whichproduced the German DIN standard ([DIN, 2003]) for analysing water, waste water and sludge reported that in his experience the median was better than the mean butworse than the mean after the elimination of outliers. The final decision was to useHampel’s redescending ψ -function (Example 1 on page 150 of [Hampel et al., 1986])which can be seen as a smooth version of the mean after eliminating outliers.4. Approximation and prediction
Prediction.
The concept of adequate approximation can be looked at in termsof prediction. Given a number α and based on a model P a prediction has to bemade about a sample x n . That the prediction is based on P means that if thesample were generated under P , that is x n = X n ( P ), then the prediction wouldbe correct with probability α . In making the prediction is has to be decided whichaspects of the data are regarded as important. In the definition of the approximationregion (7) the important aspects are given by the statistics T i , i = 1 , . . . ,
4. With P = N ( µ, σ ) the corresponding prediction is that all the inequalities of (6) will holdwith y n = ( x n − µ ) /σ replacing Y n . If the prediction is correct then the model N ( µ, σ ) is accepted as an adequate approximation to the data.4.2. Jeffreys on p -values. The following is often cited as an argument againstthe use of p -values:(G) .... gives the probability of departures, measured in a particular way,equal to or greater than the observed set, and the contribution fromthe actual value is nearly always negligible. What the use of P implies,therefore, is that a hypothesis that may be true may be rejected be-cause it has not predicted observable results that have not occurred.This seems to be a remarkable procedure. On the face of it, the evi-dence might more reasonably be taken as evidence for the hypothesis,not against it.(page 385 of [Jeffreys, 1961]).Suppose the hypothesis is that the data follow the N (0 ,
1) distribution. Whatobservable results does this hypothesis predict? It seem pointless to predict a singlevalue as such a prediction would be wrong with probability 1. The prediction mustbe a set S of values with the prediction being regarded as correct if the observableresult x lies in S . Putting S = R results in the prediction being correct withprobability one but this is somewhat vacuous. A non-vacuous prediction can beobtained by specifying a probability α and a set S ( α ) such that the prediction is correct with probability α , P ( X ∈ S ( α )) = α . It is worthy of note that the larger α the more vacuous the prediction so to speak. As a simple example put α = 0 . S ( α ) = ( − . , .
96) and suppose that x = 3 .
121 is observed. The p -value is P ( | X | > . . α = 0 . α = 0 .
95. We now interpret ‘not predicted tooccur’ in the sense ‘predicted not to occur’ rather than in the sense ‘forgetting topredict’. If it were agreed beforehand that a false prediction would lead to the nullhypothesis to be rejected, then this is done because a value predicted not to occur,namely 3.121, did in fact occur. This seems an unremarkable procedure. How badthe prediction error is can be measured by the α = 0 . p -values and choice of covariates in stepwise regression The following is based on [Davies, 2016a]. Given a data set of size n consistingof a dependent variable y ( n ) and p ( n ) covariates x ( n ) the problem is to decidewhich if any of the covariates to include. The discussion below will be restricted tothe case where p ( n ) is chosen by stepwise regression but the idea can be extendedto considering all subsets of the covariates as long as p ( n ) is not too large, say p ( n ) ≤
20 (see [Davies, 2016b]).It would seem that all procedures for choosing the covariates are based on thestandard linear model(21) Y ( n ) = X ( n ) β ( n ) + ε ( n ) . The procedure to be described below is not based on this model. The basic idea isto compare the covariates x ( n ) with covariates which are simply standard Gaussianwhite noise. A covariate x j is included only if it is significantly better than whitenoise.Suppose that p ≤ n − S have already been been included in theregression and that the sum of squared residuals is ss . Denote by ss j the sum ofsquared residuals if the covariate x j with j / ∈ S is included. The next candidate forinclusion is that covariate for which ss j is smallest. Including this covariate leadsto a sum of squared residuals ss = min j / ∈S ss j . Replace all the covariates not in S in their entirety by standard Gaussian whitenoise. Let SS j denote the sum of squared residuals if the random covariate corre-sponding to x j is included. The inclusion of the best of the random covariates leadsto a sum of squared residuals SS = min j / ∈S SS j . The probability that the best random covariate is better than the best of the actualcovariates is P ( SS < ss ) = 1 − P ( SS ≥ ss ) = 1 − P (min j / ∈S SS j ≥ ss )= 1 − Y j / ∈S P ( SS j ≥ ss )It has been shown by Lutz D¨umbgen that(22) SS j D = ss (1 − B / , ( n − p − / )where B a,b denotes a beta random variable with parameters a and b and distributionfunction pbeta( · , a, b ). From this it follows that P ( SS j ≥ ss ) = pbeta(1 − ss /ss , / , ( n − p − / P ( SS ≤ ss ) = 1 − pbeta(1 − ss /ss , / , ( n − p − / p ( n ) − p . This is the p -value for the inclusion of the next covariate. The simplest procedureis to specify α < p -valueexceeds α . Those covariates up to but excluding this last one are the selected ones.The stopping rule is(24) ss > ss (cid:16) − qbeta((1 − α ) / ( p ( n ) − p ) , / , ( n − p − / (cid:17) where qbeta( · , a, b ) is the quantile function of the beta distribution with parameters a and b .The procedure is conceptually and algorithmically simple. It requires no regular-ization parameter or cross-validation or an estimate of the error term in (21). It isinvariant with respect to affine changes of unit of the covariates and equivariant withrespect to a permutation of the covariates. It can be extended to non-linear para-metric regression, robust regression and the Kullback-Leibler discrepancy whereappropriate. n =72 samples of tissue with with p ( n ) = 3571 covariates. The dependent variable y ( n )is either 0 or 1 depending on whether the patient suffers from acute lymphoblasticleukemia or acute myeloid leukemia. The first five genes in order of inclusion withtheir associated p -values as defined by (23) are as follows:(25) gene number 1182 1219 2888 1946 2102 p -value 0.0000 8.57e-4 3.56e-3 2.54e-1 1.48e-1According to this relevant genes are 1182, 1219 and 2888 and given these the re-maining 3568 are no better than random noise. This applies to the gene 1946 but ifa simple linear regression is performed using this gene alone its p -value in the linearregression is 7.75e-9. This is much smaller than the 0.254 in (25). The p -value (23)takes into account the stepwise nature of the procedure, in particular that gene1946 is the best of the remaining genes once the genes 1182, 1219 and 2888 havebeen included. A simple linear regression does not take this into account.The data were gathered in the hope of using the gene expression data to classifythe patients. If the classification is based on genes 1182, 1219 and 2888. A simplelinear regression results in one misclassification. In [Dettling and B¨uhlmann, 2003]the authors considered 42 different classification schemes. Only two of them resultedin a single misclassification. They used a 1-nearest-neighbour method based on 25and 3571 genes. For this particular data set the procedure described above attainsthe same result and moreover specifies the relevant genes. References [DIN, 2003] (2003). DIN 38402-45:2003-09 German standard methods for the examination of wa-ter, waste water and sludge - General information (group A) - Part 45: Interlaboratory compar-isons for proficiency testing of laboratories (A 45). Deutsches Institut f¨ur Normierung.[Birnbaum, 1962] Birnbaum, A. (1962). On the foundations of statistical inference.
Journal of theAmerican Statistical Association , 57:269–326.[Davies, 2014] Davies, L. (2014).
Data Analysis and Approximate Models . Monographs on Statis-tics and Applied Probability 133. CRC Press.[Davies, 2016a] Davies, L. (2016a). Stepwise choice of covariates in high dimensional regression.arXiv:1610.05131 [math.ST].[Davies, 1995] Davies, P. L. (1995). Data features.
Statistica Neerlandica , 49:185–245.[Davies, 1998] Davies, P. L. (1998). On locally uniformly linearizable high breakdown locationand scale functionals.
Annals of Statistics , 26:1103–1125. [Davies, 2008] Davies, P. L. (2008). Approximating data (with discussion). Journal of the KoreanStatistical Society , 37:191–240.[Davies, 2016b] Davies, P. L. (2016b). Functional choice and non-significance regions in regression.arXiv:1605.01936 [math.ST].[Dettling and B¨uhlmann, 2003] Dettling, M. and B¨uhlmann, P. (2003). Boosting for tumor clas-sification with gene expression data.
Bioinformatics , 19(9):1061–1069.[Golub et al., 1999] Golub, T., Slonim, D., P., T., Huard, C., Gaasenbeek, M., Mesirov, J., Coller,H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., and Lander, E. (1999). Molecular clas-sification of cancer: class discovery and class prediction by gene expression monitoring.
Science ,286(15):531–537.[Greenland et al., 2016] Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C.,Goodman, S. N., and Altman, D. G. (2016). Statistical tests, p -values, confidence intervals,and power: A guide to misinterpretations. The American Statistician, Volume 70. Supplementalmaterial to ‘The ASA’s Statement on p-Values: Context, Process and Purpose’.[Hampel et al., 1986] Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions . Wiley, New York.[Huber, 2011] Huber, P. J. (2011).
Data Analysis . Wiley, New Jersey.[Huber and Ronchetti, 2009] Huber, P. J. and Ronchetti, E. M. (2009).
Robust Statistics . Wiley,New Jersey, second edition.[Jeffreys, 1961] Jeffreys, H. (1961).
Theory of Probability . Oxford Classic Texts in the PhysicalSciences. Oxford University press, third edition.[Lindsay and Liu, 2009] Lindsay, B. and Liu, J. (2009). Model assessment tools for a model falseworld.
Statistical Science , 24(3):303–318.[M¨uller, 1974] M¨uller, D. W. (1974). Thesen zur Didaktik der Mathematik.
Math. phys. Semester-berichte, N.F. , 21:164–169.[Tukey, 1993] Tukey, J. W. (1993). Issues relevant to an honest account of data-based inference,partially in the light of Laurie Davies’s paper. Princeton University, Princeton.[Wasserstein and Lazar, 2016] Wasserstein, R. L. and Lazar, N. A. (2016). The asa’s statementon p-values: Context, process, and purpose.