Bayesian Selective Inference: Non-informative Priors
aa r X i v : . [ m a t h . S T ] A ug Bayesian selective inference: sampling models andnon-informative priors
Daniel G. Rasines ∗ G. Alastair Young ∗ August 12, 2020
Abstract
We discuss Bayesian inference for parameters selected using the data. We argue that, ingeneral, an adjustment for selection is necessary in order to achieve approximate repeated-sampling validity, and discuss two issues that emerge from such adjustment. The first oneconcerns a potential ambiguity in the choice of posterior distribution. The second one con-cerns the choice of non-informative prior densities that lead to well-calibrated posteriorinferences. We show that non-informative priors that are independent of the sample sizetend to overstate regions of the parameter space with low selection probability.
Keywords : Bayesian inference; condition on selection; prior; selective inference.
We consider the problem of providing Bayesian inference in situations where the parameter onwhich inference is performed is selected using the same data. A typical example of this situationinvolves N independent observations from distributions with N different parameters. If N islarge, we may decide to provide inference only for those parameters that appear to be moreinteresting based on the sample. For instance, we may want to study only the parameters thatgave the largest K sample means, for some K < N .A crucial realisation in statistical inference is that, in repeated sampling from a model involvingselection, the distribution of the data corresponding to the experiments that lead to a specificparameter being selected depends, in general, on the selection mechanism. This issue has clearimplications in the frequentist paradigm and has been extensively investigated in the literature.Of the techniques proposed to counteract selection effects, one that has received considerableattention due to its firm theoretical justification is the so called conditional approach, whichadvocates basing inference for selected parameters on the conditional distribution of the datagiven selection; see Fithian et al. (2017) and Kuffner and Young (2018).The Bayesian standpoint regarding selection is less clear. The classical view is that inferenceshould not be altered by selection. The argument is that, since Bayesian inference operates ∗ Department of Mathematics, Imperial College London.
The settings we are going to consider in this work can be formalised as follows. Suppose wehave data Y whose distribution has a density or probability mass function f ( y ; θ ) , y ∈ Y , knownup to a finite-dimensional parameter θ ∈ Θ, and that there exists a potential parameter ofinterest, ψ ≡ ψ ( θ ), that may be selected for future study after observing the data, possibly byan artificially randomised procedure. More precisely, we assume that there exists a function p ( y )taking values in [0 ,
1] and determined prior to the data collection, such that, having observed Y = y , inference on ψ is performed with probability p ( y ). The function p ( y ), which fullycharacterises the selection mechanism, will be referred to as the selection function. The followingexamples illustrate this framework. Example 1.1.
Let Y ∼ N ( θ, θ if we suspect thatit is greater than zero. To determine this we run the test Y > y >
2, wecollect a second sample Y ∼ N ( θ, θ . We therefore have Y = ( Y , Y ) and p ( y , y ) = ( y > ( A ) to denote the indicator function of the event A . Example 1.2.
There are N treatments for a given disease, with unknown effects θ , . . . , θ N .These treatments are tested, and the most promising one is put forward for further study. In thetesting stage we get to observe Y i = θ i + ε i for i = 1 , . . . , N , where the errors are independentand identically distributed according to a certain known distribution. After this, a second testis carried on the parameter that gave the largest y i , which we may assume to be the first onewithout loss of generality. Denote this sample by Y ′ . At the end, the whole data vector is Y = ( Y , Y ′ , Y , . . . , Y N ), and the selection function is p ( y , y ′ , . . . , y N ) = ( y = max { y i } ).2 xample 1.3. Let Y ∈ R n and X ∈ R n × m denote a response vector and a matrix of covariates,respectively. Suppose that from the original set of covariates, only a small number are selected forinclusion in a regression model, and denote the design matrix which includes only the selectedcovariates by X ′ . Assume also that we decide to analyse the fixed design linear model Y ∼ N ( X ′ β, σ I n ), with β the parameter of interest and σ a nuisance parameter. In this case, wedecide to provide inference for ψ = β only upon observing a particular event y ∈ E , comprisedof all data points that would have led to the selection of that particular set of covariates, giventhe original design matrix X . Therefore, the selection function is p ( y ) = ( y ∈ E ). Different approaches to inference lead to two opposing views as to the correct analysis of thedata in the presence of selection. On the one hand, frequentist methods evaluate the accuracyof inferential procedures with respect to the sampling distribution of the data. Since selectionmodifies the sampling distribution by favouring data points with higher selection probability, itis clear that this reported accuracy should be appropriately modified. For example, the meansquared error of an estimator should be evaluated with respect to the conditional distributionof the data given selection. On the other hand, subjective Bayesians adopt the view that, oncethe data has been observed, the recognition that a different realisation could have resulted ina different inferential problem, or in no problem at all, should have no effect on the inference.This discrepancy is discussed in detail by Dawid (1994).In the frequentist literature, the first viewpoint has led to the development of the conditionalapproach to selective inference, which advocates that inference for ψ should be based on theconditional distribution of the data given the event that selection took place. In our notation,this distribution has density f S ( y ; θ ) = f ( y ; θ ) p ( y ) ϕ ( θ ) , (1)where the normalising constant ϕ ( θ ) is the probability that ψ gets selected when θ is the trueparameter. We will refer to the conditional distribution of Y given selection as the selectivedistribution.Basing inference on (1) ensures that, under repeated sampling, if we only report inferencesfor those samples that get selected, probabilistic statements are well calibrated, in the sensethat the error assessments are accurate. This approach has attracted much attention recently,particularly in the context of variable selection for linear models. Some notable references in-clude Lockhart et al. (2014); Loftus and Taylor (2014); Lee and Taylor (2014); Lee et al. (2016);Tibshirani et al. (2018); and Hyun et al. (2018). A unified theory of conditional selective infer-ence is described in Fithian et al. (2017) and framed within Fisherian statistical thought byKuffner and Young (2018).Conditioning on selection can be understood in terms of information splitting. Let R be therandom variable that takes the value 1 if ψ gets selected and 0 otherwise. That is, R |{ Y = y } ∼ Bernoulli( p ( y )). Following Fithian et al. (2017), the data-generating process of Y may bethought of as consisting of two stages. In the first stage, the value of R , r say, is sampled fromits marginal distribution, and in the second, Y is sampled from the conditional distribution Y |{ R = r } . Since it is R that determines whether we are going to provide inference for ψ ornot, inference based on information revealed in stage two is necessarily free of any selection bias,because it removes the information about the parameter provided by R . From this viewpoint,3onditioning on selection may be regarded as a refined form of data splitting, where R constitutesthe “training” data.Traditionally, the subjective Bayesian viewpoint has been that inference should be unaffectedby selection. However, this view can sometimes be in conflict with the widely accepted principlethat any mode of inference has to be well-calibrated in the sense that the reported accuracy of theinferences should approximately match the actual one in the long term; see Bayarri and Berger(2004). To avoid this issue, following the previous discussion, it is natural to consider posteriorinference based only on the information provided by Y given R . If the density π ( θ ) representsour prior beliefs about θ , selection-free Bayesian inference may be provided via the posteriordistribution with density π S ( θ | y ) ∝ π ( θ ) f S ( y ; θ ) . (2)We will throughout refer to posteriors of this form as selective posteriors. Inference based onselective posteriors allows the injection of prior information while avoiding potential problemsarising from selection. These type of posterior densities are discussed by Bayarri and DeGroot(1987) and Bayarri and Berger (1998). They also appear, in a different context, in Bayarri and Berger(2000), where they are referred to as “partial posterior densities”.More recently, Yekutieli (2012) has argued that posterior (2) is the “correct” posterior densityin some contexts. Yekutieli identifies two types of parameter, according to how it is affectedby selection. Assuming π ( θ ) is the genuine marginal density of the parameter, θ is said to be random when the joint sampling scheme for the parameter and data is such that pairs ( θ, Y )are jointly sampled from π ( θ ) f ( y ; θ ) until selection occurs, and fixed when θ is sampled from itsmarginal distribution, held fixed, and Y is sampled from f ( y ; θ ) until selection occurs (mixedparameters are also possible, but will not be discussed here). For a random parameter, itsconditional distribution given Y = y is π ( θ | y ) ∝ π ( θ | r ) f ( y | r ; θ ) ∝ π ( θ ) f ( y ; θ ) , (3)while that for a fixed parameter is π ( θ | y ) ∝ π ( θ ) f ( y | r ; θ ) = π ( θ ) f S ( y ; θ ) . (4)Hence, in the first case selection does not have an effect on the posterior, but in the second oneit does, and coincides with (2). Panigrahi and Taylor (2018) and Panigrahi et al. (2020) haveproposed approximations of (2) for regression models for which the selection probability ϕ ( θ ) isdifficult to compute exactly.While the posterior densities (3) and (4) are formally correct given the respective samplingmechanisms, our view is that it is not clear that a parameter can be labelled as random orfixed without explicit consideration of this sampling process, which is to some extent arbitrary.However, this distinction is clarifying, in that it captures the two types of bias that can arisebecause of selection. The first occurs when the same parameter is analysed multiple times: ifsome of the analyses are not reported because the parameter did not appear to be significant inlight of the data, the published reports about it will on average overstate this significance. Thesecond one is a bias in the parameter space: parameters that are in some way significant aremore likely to be selected for inference than those which are not. Only the former bias, whichin principle any parameter is subject to, is problematic regarding the correct calibration of theinferential statements. The motivation for using posterior (2) is that it counteracts this bias ifit were ever to arise, and remains valid if it does not. To illustrate these ideas let us considerthe following model. 4 xample 2.1. Let Y , Y ∼ N ( θ, /n ) independently, with n = 20 and selection event Y > n − / . The prior distribution of θ is the standard Laplace distribution. We consider threepossible posterior distributions: π A ( θ | y ) ∝ π ( θ ) f ( y , y ; θ ); π B ( θ | y ) ∝ π ( θ ) f ( y ; θ ); π C ( θ | y ) ∝ π ( θ ) f ( y , y | Y > n − / ; θ ) . The first is the usual posterior, the second corresponds to using only the information providedby the second sample, and the third to using Y and the information left in Y after taking outthe information provided by the random variable ( Y > n − / ). We generated 10 pairs of( θ, Y ) under the random-parameter sampling regime, and for each pair computed three credibleintervals, from the 0.05 and 0.95 quantiles of each posterior. We know that the repeated-samplingcoverage of π A is equal to the nominal one (0.9) by definition. The estimated coverages of π B and π C were both 0.89. Furthermore, the average lengths of the intervals were 0.52, 0.73, and0.55, respectively, so averaged over the prior, the loss of information resulting from conditioningon selection has a minimal impact on the power.Now, if instead of averaging the results over the prior, we compute them for fixed values of theparameter, we observe the behaviour shown in Figure 1. For large values of θ , all possibilitiesgive a reasonably good coverage. However, if θ is small, corresponding to a lower selectionprobability, the actual coverage of π A (blue) is significantly lower than the nominal one, whilethe other two options remain well-calibrated. This sharp behaviour makes the “true” posteriormore sensitive to prior misspecification in a random-parameter regime. In addition, there is gainin inferential power obtained by using π C (orange) instead of π B (green), as can be appreciatedin the second plot. For small values of θ , the intervals derived from π C widen to account forthe selection effect, but as θ increases, π C is able to shrink by using the leftover informationavailable in Y | Y > n − / , while π B does not. If θ is very large, selection has a negligible effectin the model and the results of posteriors π A and π C are virtually identical. All figures wereestimated from 10 repetitions. . . . . q C O VE R A G E . . . . . q AV . L E N G T H Figure 1: Coverage and average lengths of 0.9-credible intervals derived from posteriors π A (blue), π B (green), and π C (orange). 5 Ambiguity in the choice of selective distribution
The conditional approach to selective inference is usually presented as straightforward to applyfrom a conceptual perspective (though actually implementing it may be far from trivial); if weknow the selection event, and are comfortable with a particular sampling model for the data, thenthe selective distribution (1) is unequivocally determined. In this section we argue that, just asthe determination of an appropriate sampling distribution in classical statistics is generally nota trivial issue, so is the appropriate choice of the conditional distribution for selective inference,in some circumstances. To motivate the discussion, let us consider the following example.
Example 3.1.
Let Y , . . . , Y n be a random sample from a N ( θ,
1) distribution, and supposethat the sample size n > f N ( n ). In thesame spirit as in example 1.1, we decide to split the data into two sets, of sizes n = [ n/
2] and n = n − n , say, and use the first of them to decide whether θ is likely to be greater than zero,by checking the condition y + . . . + y n > n / . In the joint model for ( N, Y , . . . , Y N ), withselection function p ( n, y , . . . , y n ) = ( y + . . . + y n > n / ), the conditional distribution of thefull data given selection has density f S ( n, y , . . . , y n ; θ ) = f N ( n ) f ( y , . . . , y n | n ; θ ) p ( n, y , . . . , y n ) P ˜ n> f N (˜ n )Φ( θ ˜ n / − , where Φ denotes the standard normal distribution function. In this model the sample size is notindependent of θ . Its density is given by f S ( n ; θ ) = f N ( n )Φ( θn / − P ˜ n> f N (˜ n )Φ( θ ˜ n / − . The intuitive reason is that, if the true θ is smaller than zero, we have a higher chance of falselyconcluding that θ > n is small), and vice versa. Therefore,posterior inference based on this sampling model would make use of the sample size distribution,and would arrive to a different conclusion had the sample size been decided in advance by thestatistician rather than random, which appears counter-intuitive. Instead, it is more reasonable(and would presumably be done in practice without thinking about it) to work with the model f S ( n, y , . . . , y n ; θ ) = f N ( n ) f ( y , . . . , y n | n ; θ ) p ( n, y , . . . , y n )Φ( θn / − . This model conditions on selection after the sample size has been observed, which thereforeremains independent of the parameter after conditioning. This reflects more closely the selectionprocess, since the selection test was designed to achieve a certain significance level conditionallyon n for any possible value of n . That is, the information about θ in the selection step isinterpreted according to the conditional model Y , . . . , Y N | N .The analysis of example 3.1 suggests that in models that admit ancillary statistics, that isstatistics with distribution not depending on the parameter, it may be more coherent to conditionon selection after conditioning on the observed value of the ancillary. Let us suppose thatthe sampling model is such that the data Y can be reduced to a minimal sufficient statistic( T, A ), where A is ancillary. In such circumstances, the conditionality principle asserts that theinformation provided by Y about θ is equivalent to that provided by T given the observed valueof A ; see Birnbaum (1962). If the selection process complies with this principle, informationabout θ in the selection stage ought to be interpreted via the conditional distribution T | A , and6hus it is more reasonable to define the selective density as f S ( t, a ; θ ) = f ( a ) f ( t | a ; θ ) p ( t, a ) ϕ ( θ ; a )rather than f S ( t, a ; θ ) = f ( a ) f ( t | a ; θ ) p ( t, a ) ϕ ( θ ) , where p ( t, a ) is the selection probability given ( T, A ) = ( t, a ), independent of θ by sufficiency,and ϕ ( θ ; a ) = R t f ( t | a ; θ ) p ( t, a )d t is the selection probability given A = a when θ is the trueparameter. Example 3.2.
Let Y , . . . , Y n be a random sample from a location model with density f ( y i ; θ ) = g ( y i − θ ), θ ∈ R . It is widely agreed that sample evidence about θ should be interpreted viathe conditional distribution of ˆ θ given A , where ˆ θ is the maximum likelihood estimator of θ and A = ( Y − ˆ θ, . . . , Y n − ˆ θ ) is the configuration ancillary. The density of ˆ θ |{ A = a } admits thesimple expression f ( z | a ; θ ) = c ( θ, a ) n Y i =1 g ( z + a i − θ ) , where c ( θ, a ) is a normalising constant. Suppose θ is selected for inference if the p -value u (ˆ θ, a ) = c (0 , a ) Z ∞ ˆ θ n Y i =1 g ( z + a i )d z is below some level α . Then, the selective density of (ˆ θ, A ) should be f S ( z, a ; θ ) = f ( a ) c ( θ, a ) Q ni =1 g ( z + a i − θ ) p ( z, a ) P θ ( u (ˆ θ, a ) ≤ α | a ) , where p ( z, a ) = ( u ( z, a ) ≤ α ).A parallel case can be made about some models involving nuisance parameters. Suppose that θ = ( ψ, χ ), where ψ is the potential parameter of interest and χ is a nuisance parameter. Inthese models it is sometimes possible to identify a minimal sufficient statistic ( T, A ) such thatthe distribution of T | A is independent of χ and the distribution of A is independent of ψ . Insuch cases, a natural extension of the conditionality principle asserts that information about ψ should be interpreted as coming from the conditional model T | A ; see Cox and Hinkley (1974).As before, if the selection process interprets the sample information about ψ via the conditionaldistribution T | A , we argue that it is more natural to define the selective density as f S ( t, a ; θ ) = f ( a ; χ ) f ( t | a ; ψ ) p ( t, a ) ϕ ( ψ ; a )rather than f S ( t, a ; θ ) = f ( a ; χ ) f ( t | a ; ψ ) p ( t, a ) ϕ ( ψ, χ ) . Here we are using the same notational convention as before. The number of settings that admitsuch a convenient decomposition is admittedly fairly small, but a class of them arises frequentlyin selective inference, namely those in which one has independent datasets relative to differentparameters and chooses which parameter, or parameters, to provide inference for on the basis ofsome performance statistic. For example, we may choose the parameter that yielded the largestsample mean. The following example is a simple instance of such scenario, but the discussionapplies in more general settings. 7 xample 3.3.
Consider the setting of example 1.2, and as before let us assume that the selectedparameter for the given sample is θ . We can identify two natural sampling mechanisms thatare consistent with selection of the first mean. In the first one, the whole vector ( Y , . . . , Y N ) issampled until Y is observed to be the maximum. The corresponding density of this generativeprocess is f S ( y , y ′ , y . . . , y N ; θ ) = f ( y ′ ; θ ) Q Ni =1 f ( y i ; θ i ) ( y i = max { y i } ) P θ ( Y > Y i ∀ i > . (5)The second sampling mechanism is that in which ( Y , . . . , Y N ) is sampled from its uncondi-tional distribution, and conditionally on its observed value, Y is sampled until it exceedsmax( y , . . . , y N ). The density of the data under this model is f S ( y , y ′ , y . . . , y N ; θ ) = f ( y ′ ; θ ) Q Ni =1 f ( y i ; θ i ) ( y i = max { y i } ) P θ ( Y > Y i ∀ i > | y , . . . , y N ) . (6)In the first model, conditioning on selection breaks the independence structure of the data.As a consequence, the observations from the non-selected parameters depend on θ , and themarginal distribution of Y depends on θ , . . . , θ N . This makes manipulation of the likelihoodfunction awkward and computationally expensive if N is large, despite the apparent simplicityof the problem. One could nevertheless argue that it is appropriate to condition on the observedvalues of Y , . . . , Y N , on the basis that they only depend mildly on the parameter of interest.However, if selection is used to determine whether θ is the largest θ i , and since the observedvalues of Y , . . . , Y N do not provide any direct information about this fact, preference of (6) over(5) can be justified without resorting to computational considerations.This argument can also be invoked to justify conditioning on the design matrix in a regressionproblem such as that of example 1.3. In a fixed design setting, if the variable-selection algorithmoperates conditionally on the design matrix, the selection event should be conditioned on afterconditioning on the design matrix, so the selective analysis is also a fixed design one. Non-informative priors allow the derivation of posterior distributions without explicitly incorpo-rating any prior information about the parameter. They serve different purposes: the resultingposterior can be employed as a reference against which posteriors derived from subjective priorscan be compared in order to assess their impact; they can be used to derive frequentist methodsthat retain some of the appealing properties of Bayesian methods, such as good conditionalperformance; or allow to carry out an analysis when very little prior information is available. Acommon requirement for these priors is that they lead to posterior claims about the parameterof interest that hold up to a repeated-sampling scrutiny from the data-generating model. Moreprecisely, they are required to satisfy P θ { ψ ≤ Π − ( α | Y ) } = α + ε ( α, θ ) , (7)where Π( ψ | Y ) is the marginal posterior distribution function of ψ and ε ( α, θ ) is small. In thecontext of conditional selective inference, condition (7) is required to hold with respect to theconditional distribution of the data given selection.In non-selective regimes involving independent and identically distributed observations from aregular model, this quantile-matching requirement holds with an error decreasing at rate n − / n is the samplesize. In addition, certain priors, known as probability-matching priors, improve on this error rate,lowering it to n − in general, and to n − / for some distributions; see Datta and Mukerjee (2004).However, these priors are typically developed in settings with normal asymptotic behaviour, andselective models are not in general asymptotically normal.In the selective inference context, it is argued in Yekutieli (2012) that non-informative priorsshould not be altered in the presence of selection. So, for example, for a selective normal-location model, the improper uniform prior π ( θ ) ∝ Lemma 1.
Let Y ∼ N ( θ, σ ) , with σ > known, and p ( y ) = ( y > t ) for some fixed t ∈ R , andlet Π( θ | Y ) be the selective posterior distribution given Y based on the uniform prior π ( θ ) ∝ .Then, P θ { θ ≤ Π − ( α | Y ) | S } < α ∀ ( α, θ ) ∈ (0 , × R , and Π − ( α | Y ) does not have a first order moment for any < α ≤ / and any true θ . Generalising from the proof of Lemma 1, it can be shown that in a conditional normal-locationmodel exact probability matching can only be achieved with a data-dependent prior. Considerthe model Y ∼ N ( θ, σ ) with an arbitrary selection function p ( y ), where σ > p -value function, H ( θ, Y ) = R ∞ Y φ ( σ − (˜ y − θ )) p (˜ y )d˜ y R ∞−∞ φ ( σ − (˜ y − θ )) p (˜ y )d˜ y , is uniformly distributed over (0 ,
1) when θ is the true parameter value, where φ is the standardnormal density. Furthermore, for any observed value y in the interior of the support of p ( y ), itis easy to check that θ → H ( θ, y ) is a distribution function. Hence, if we differentiate it withrespect to θ and divide it by the likelihood, we obtain a formal data-dependent prior densitythat gives exact quantile-matching when combined with the likelihood. The resulting prior is π y ( θ ) ∝ − ∂∂θ H ( θ, y ) ∂∂y H ( θ, y ) ∝ R ∞ y (cid:8) σ − (˜ y − θ ) − ∂∂θ log ϕ ( θ ) (cid:9) φ ( σ − (˜ y − θ )) p (˜ y )d˜ yφ ( σ − ( y − θ )) , (8)where, as usual, ϕ ( θ ) = R ∞−∞ σ − φ ( σ − (˜ y − θ )) p (˜ y )d˜ y . Furthermore, this is the only sensibleprior that satisfies the quantile-matching requirement for every θ . This follows from the matchingequation P θ { Π( θ | Y ) ≤ α | S } = α, α ∈ (0 , . Since the model is stochastically increasing in θ , Π( θ | y ) has to be a decreasing function of y forevery θ . Denoting its inverse with respect to y by l θ ( α ), we have that P θ { Y ≥ l θ ( α ) | S } = H { θ, l θ ( α ) } = α, α ∈ (0 , . α = Π( θ | y ) gives the equality. Priors of the form (8) are discussed in Fraser et al. (2010).In general, these priors do not admit a simple closed expression, and evaluating them requiresnumerical evaluation of several integrals. This is naturally not a problem for inference in thisparticular model, as we could use directly H ( θ, y ), but may be problematic when generalisingit to other settings. To this end, we also consider the Jeffreys prior of the conditional model,given by π J ( θ ) ∝ (cid:26) σ + ∂ ∂θ log ϕ ( θ ) (cid:27) / . (9)As we will see, this prior provides good matching properties and is easier to evaluate in somecontexts.As a basic class of selective inference models let us consider settings involving n independentobservations Y ( j ) i ∼ N ( θ, i = 1 , . . . , n j , j = 1 ,
2, and n + n = n , where the first set ofobservations, corresponding to j = 1, is used for selection, in the sense that θ gets selected if andonly if ( Y (1)1 , . . . , Y (1) n ) ∈ E ⊆ R n for some prespecified event E , and the rest of the observationsare used only in the inferential stage, should selection occur. As in the non-selective case,the conditional model ( Y (1)1 , . . . , Y (1) n , Y (2)1 , . . . , Y (2) n ) |{ ( Y (1)1 , . . . , Y (1) n ) ∈ E } can be reduced bysufficiency to the selective model involving a single observation Y = γ ¯ Y (1) + (1 − γ ) ¯ Y (2) and theselection function p ( y ) = P { ( Y (1)1 , . . . , Y (1) n ) ∈ E | Y = y } , where γ = n /n and ¯ Y ( j ) is the sampleaverage of the j -th set of observations. Note that the case γ = 1 corresponds to a situation wherewe have access to all the data during the selection stage, while the case 0 < γ < Y (1) > t for some threshold t , theprobability-matching prior (8) is given by π y ( θ ) ∝ − h ( n / θ ) h ( n / ( θ − y ))when γ = 1, and by π y ( θ ) ∝ γ / φ ( n / ( θ − t )) φ ( n / ( θ − y )) × ( − Φ ( n / ( y − θ + γ ( θ − t ))(1 − γ ) / ) − R ∞ y n / φ ( n / (˜ y − θ )) p (˜ y )d˜ y Φ( n / ( θ − t )) ) + Φ ( n / ( y − t )(1 − γ ) / ) (10)when 0 < γ <
1, and the Jeffreys prior (9) is given by π J ( θ ) ∝ n γh ( n / ( θ − t )) o / , where h ( x ) = φ ( x )Φ( x ) ; h ( x ) = − xh ( x ) − h ( x ) ;are the first two derivatives of log Φ( x ). The derivation of prior (10) can be found in theappendix. The other two cases follow directly from the definitions. Both priors depend on the10 . . . . . . q P R I O R −2.0 −1.0 0.0 0.5 1.0 . . . . . . q P O S T E R I O R Figure 2: Left: uniform prior (black), probability-matching prior for y = 0 (red), and Jeffreysprior (blue) for the normal model. Right: the resulting posteriors for y = 0 (the red densityoverlaps the blue one in this panel).Table 1: Estimated coverages of ( −∞ , Π − ( α | Y )] for the normal-location model derived fromthe uniform (U) and Jeffreys (J) priors. α .
05 0 . .
25 0 . .
75 0 . . θ = − . γ = . γ = .
75 .035 .047 .075 .096 .209 .247 .458 .502 .724 .754 .889 .903 .944 .952 γ = 1 .020 .051 .041 .098 .110 .234 .250 .465 .448 .714 .637 .879 .738 .938 θ = 0 γ = . γ = .
75 .038 .050 .077 .099 .199 .243 .427 .492 .693 .748 .872 .901 .935 .952 γ = 1 .033 .056 .065 .108 .161 .251 .331 .477 .537 .710 .709 .868 .794 .929 θ = 0 . γ = . γ = .
75 .048 .052 .095 .103 .236 .255 .471 .501 .713 .744 .871 .893 .930 .945 γ = 1 .048 .052 .096 .105 .237 .259 .469 .509 .702 .749 .852 .890 .908 .939 sample size and the selection event, which suggests that this is a necessary condition to achievecorrect frequentist calibration. Figure 2 shows both priors, and the resulting posterior densities,for n = 20, γ = 0 . t = 0, and y = 0. The uniform prior and its posterior density is alsoplotted for comparison. The posterior densities corresponding to the two proposed priors arevirtually identical, a behaviour that extends to other choices of parameters and observation,which suggests that Jeffreys prior is well-calibrated from a frequentist viewpoint. Empiricalevidence of this fact can be found in Table 1. The table shows the coverage of the credibleintervals ( −∞ , Π − ( α | Y )] for this model with n = 20, t = 0, and different combinations of( θ, γ, α ), computed numerically. The posteriors were constructed using the uniform prior (U)and Jeffreys prior (J). The Jeffreys prior performs better in most cases, and is considerablysuperior to the uniform prior when all the data is used for selection ( γ = 1).Consider now a more general setting. Suppose we have a random sample Y , . . . , Y n from a11ne-parameter exponential family with density or mass function f ( y i ; θ ) = h ( y i ) exp { s ( y i ) θ − A ( θ ) } . We assume that the parameter space Θ is open and convex, and that A ′′ ( θ ) > θ ∈ Θ.The maximum likelihood estimator of θ in the non-selective model, ˆ θ = ( A ′ ) − ( n − P ni =1 s ( Y i )),is sufficient for θ without selection. Let us assume that the sample is divided in two sets of sizes n and n , ( Y , . . . , Y n ) and ( Y n +1 , . . . , Y n ), with respective maximum likelihood estimators ˆ θ and ˆ θ , and that in the selection stage we only have access to the first set of samples. That is,selection occurs if ( Y , . . . , Y n ) ∈ E n for some measurable set E n . Our goal is to devise a prior π ( θ ) that yields good frequentist calibration in the selective model f S ( y , . . . , y n ; θ ) = f ( y , . . . , y n ; θ ) { ( y , . . . , y n ) ∈ E n } ϕ n ( θ ) . (11)Note that, by sufficiency, for any selection event E n , the selection function of ˆ θ , defined as p n ( z ) = P (( Y , . . . , Y n ) ∈ E n | ˆ θ = z ), is free of θ . Moreover, by the factorisation theorem, ˆ θ is also sufficient for θ in the conditional model (11).If i ( θ ) = − E θ [( ∂ /∂θ ) log f ( Y i ; θ )] = A ′′ ( θ ) denotes the Fisher information of a single obser-vation in the non-selective model, any function g ( θ ) satisfying g ′ ( θ ) = i ( θ ) / is known as avariance-stabilising transformation, since, by the delta method, n / { g (ˆ θ ) − g ( θ ) } d −→ N (0 , n → ∞ . In particular, for i = 1 , n / i { g (ˆ θ i ) − g ( θ ) } d −→ N (0 ,
1) as n i → ∞ . Thus, forlarge values of n and n , the original selective model is approximately equivalent to a se-lective normal-location model involving two independent observations, Y ∼ N ( g ( θ ) , n − ) and Y ∼ N ( g ( θ ) , n − ), and selection function p ( y , y ) = p n ( g − ( y )). This heuristic suggestschoosing as a non-informative prior for ψ = g ( θ ) one that is reliable in the limiting model.Denoting the chosen prior by π ψ ( ψ ), the resulting prior in the original parametrisation is givenby π θ ( θ ) ∝ i ( θ ) / π ψ ( g ( θ )). In view of the previous discussion and the simulation results, it isnatural to take π ψ ( ψ ) to be either the probability-matching prior (8), or the Jeffreys prior (9).Note that, in the absence of selection, both choices give π ψ ( ψ ) ∝
1, and π θ ( θ ) specialises to thestandard Jeffreys prior π θ ( θ ) ∝ i ( θ ) / .As an example, consider providing non-informative Bayesian inference for a selected probability θ ∈ [0 ,
1] on the basis of two observations Y ∼ Bin( n , θ ) and Y ∼ Bin( n , θ ). Initially only Y is observed, and if it satisfies n − Y ≥ .
5, the second sample Y is collected for inference.The variance-stabilising transformation of the maximum likelihood estimator in this model is g ( θ ) = sin − ( θ / ). The left panel of Figure 3 shows the standard, non-selective Jeffreys prior π ( θ ) ∝ θ − / (1 − θ ) − / , and the two non-informative selective priors for n = 8, n = 2, y = 4,and y = 1. The right panel shows the corresponding posterior densities. As in the normal case,the two selective priors are almost identical and favour parameter values with large selectionprobability.More formally, let us consider an asymptotic setting in which the selection probability at thetrue parameter, ϕ n ( θ ), is bounded away from zero as n increases. This assumption allowsus to approximate the general conditional model by a normal conditional model around thetrue parameter asymptotically. Tian and Taylor (2018) show that consistent inference when ϕ n ( θ ) → P n i =1 s ( Y i ) > t n for some threshold t n ∈ R , corresponding to a uniformly most powerful test for testing a one-sided hypothesisabout θ , although the analysis extends trivially to similar selection procedures.12 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . q P R I O R . . . . . q P O S T E R I O R Figure 3: Left: non-selective Jeffreys prior (black), probability-matching prior for y = 4 and y = 1 (red), and Jeffreys prior (blue) for the binomial model. Right: the resulting posteriors(the red lines overlap the blue ones).The following result states that, under the boundedness condition, the selection probability underthe original model can be uniformly approximated by the selection probability under a normalmodel in n − / -neighbourhoods of the true parameter, and that the selective distribution of themaximum likelihood estimator corresponding to the selection data is asymptotically selectivenormal. In what follows, O p ( · ) and o p ( · ) statements are relative to the selective distribution ofthe data. The proof of the result can be found in the appendix. Lemma 2.
Consider the selective exponential model (11) and an arbitrary parameter ψ = g ( θ ) ,with g continuously differentiable and strictly increasing in Θ . Denote by ϕ ψn ( ψ ) = ϕ n ( g − ( ψ )) the selection probability as a function of ψ , and let ˜ ϕ ψn ( ψ ) = P (cid:18) N (cid:18) ψ, n i ψ ( ψ ) (cid:19) > g (cid:26) ( A ′ ) − (cid:18) t n n (cid:19)(cid:27)(cid:19) be the approximation of the selection probability derived by substituting the distribution of ˆ ψ = g (ˆ θ ) by its limit as n → ∞ , where i ψ ( ψ ) denotes the single-observation Fisher information inthe ψ -parametrisation. If ϕ ψn ( ψ ) is bounded away from zero as n → ∞ , where ψ = g ( θ ) ,then ϕ ψn ( ψ + n − / t )˜ ϕ ψn ( ψ + n − / t ) → as n → ∞ uniformly in [ − M, M ] for any M > such that [ g − ( ψ − M ) , g − ( ψ + M )] ⊆ Θ , and G n ( ˆ ψ ) d −→ U (0 , conditionally on selection, where ˆ ψ = g (ˆ θ ) and G n is the distribution function of a N ( ψ , ( n i ψ ( ψ )) − ) distribution truncated to the interval ( g { ( A ′ ) − ( n − t n ) } , ∞ ) . For the variance-stabilising parametrisation ψ , let us consider the posterior density of the localparameter t = n / ( ψ − ψ ). Lemma 2 allows us to analyse the asymptotic behaviour of model(11) via the language of convergence of experiments; see van der Vaart (1998), chapter 9. If ϕ ψn ( ψ ) is bounded away from zero, for any convergent sequence t n → t , a Taylor expansion of13he non-selective component of the log-likelihood giveslog f ( Y , . . . , Y n ; ψ + n − / t n ) f ( Y , . . . , Y n ; ψ )= γ / tn / ( ˆ ψ − ψ ) + (1 − γ ) / tn / ( ˆ ψ − ψ ) − t o p (1) , as both n , n → ∞ , where γ = n /n . The order of error holds unconditionally, and consequentlyalso conditionally on selection since, for any random variable X and event A , P ( | X | > x | A ) ≤ P ( | X | > x ) / P ( A ). Together with Lemma 2, this remark implies that the log-likelihood ratio ofthe selective model satisfies, for any convergent sequence t n → t ,log f S ( Y , . . . , Y n ; ψ + n − / t n ) f S ( Y , . . . , Y n ; ψ )= γ / tn / ( ˆ ψ − ψ ) + (1 − γ ) / tn / ( ˆ ψ − ψ ) − t − log ˜ ϕ ψn ( ψ + n − / t )˜ ϕ ψn ( ψ ) + o p (1) , so the dominant term is asymptotically distributed as the log-likelihood ratio of the selec-tive model with observations Y ∼ N ( ψ, n − ) and Y ∼ N ( ψ, n − ), and selection event Y >g { ( A ′ ) − ( n − t n ) } .The following examples illustrate the performance of the proposed non-informative priors innon-Gaussian settings. Example 4.1.
Let Y , . . . , Y n be a random sample from an exponential distribution with rateparameter θ >
0. For this model the variance-stabilising transformation is g ( θ ) = log( θ ). Weconsider the selection event ˆ θ >
1, where ˆ θ is the maximum likelihood estimator of θ based on asubsample of size n = 0 . × n . For the simulation we consider the values of n = 10 , ,
80, and foreach sample size we consider three true parameter values defined to satisfy ϕ n ( θ ) = 0 . , . , . n, θ ), we plot the coverage of the interval ( −∞ , Π − ( α | Y , . . . , Y n )] as a function of α for the non-selective Jeffreys prior π ( θ ) ∝ θ − and for the two non-informative priors proposedin this work. The coverages were approximated via 10 simulations from the conditional model( Y , . . . , Y n ) | ˆ θ >
1. The results can be found in Figure 5 (Appendix D). The performances ofthe selective Jeffreys and probability-matching priors are practically identical and significantlybetter than that of the non-selective prior.
Example 4.2.
Let Y , . . . , Y n be a random sample from an inverse Gaussian distribution withunknown mean θ > λ = 1, with density f ( y i ; θ ) = 1(2 πy i ) / exp (cid:26) − ( y i − θ ) θ y i (cid:27) . The variance-stabilising transformation for this model can be seen to be g ( θ ) = − θ − / , and thenon-selective Jeffreys prior π ( θ ) ∝ θ − / . We consider the same settings as before. The selectionevent is ˆ θ >
1, where ˆ θ is the maximum likelihood estimator of θ based on a subsample ofsize n = 0 . × n for n = 10 , ,
80, and we define the true values of the parameter in the sameway as before. The results are similar to those of the exponential model. They can be found inFigure 6 (Appendix E). 14 .1 Normal distribution with unknown variance
In this subsection we explore an extension of the previous ideas to a model with a nuisanceparameter. Let Y , . . . , Y n ∼ N ( µ, σ ), where both µ and σ are unknown, but only µ is ofdirect interest. Suppose that the sample is divided into two sets of sizes n and n , withrespective maximum likelihood estimators ˆ θ = ( ¯ Y , V ) and ˆ θ = ( ¯ Y , V ) of θ = ( µ, σ ). Inorder to determine whether µ is of interest, we conduct the t -test V − / ¯ Y > n − / t , for somepre-specified value of t . The selective density of the data is f S ( y , . . . , y n ; µ, σ ) = f ( y , . . . , y n ; µ, σ ) ( v − / ¯ y > n − / t ) P µ,σ ( V − / ¯ Y > n − / t ) . Note that in the selective model the marginal distribution of V , with marginal density f S ( v ; µ, σ ) = f ( v ; σ ) P µ,σ ( v − / ¯ Y > n − / t | v ) P µ,σ ( V − / ¯ Y > n − / t ) , depends on µ , even though its non-selective density is free of it. Under selection, small values of µ favour larger values of V , corresponding to noisier samples, and vice-versa, analogously to thesituation of example 3.1. Similarly to the univariate case, standard non-informative priors suchas π ( µ, σ ) ∝ σ − produce marginal posteriors for µ that overstate, on average, smaller valuesof the parameter. To counteract this, we propose the improper prior defined by π ( σ ) ∝ σ − and π ( µ | σ ) ∝ ( n n h n / µσ − t !) / , which assigns lower prior probabilities to small values of µ relative to σ . The conditional prior of µ given σ is the selective Jeffreys prior of the model ¯ Y ∼ N ( µ, n − σ ) | ( n − / σt, ∞ ), assuming σ is known, where the selection threshold n − / V / t has been approximated by its asymptoticlimit.Figure 4 illustrates the performance of this prior under repeated sampling from the selectivemodel with n = 50, n = 10, t = 2, and true parameter ( µ , σ ) = (0 , µ | Y , . . . , Y n ) for the prior π ( µ, σ ) ∝ σ − (orange) andfor the proposed prior (green), as estimated from 5 × repetitions. The selection probabilitywas computed by numerical evaluation of the integral P µ,σ ( V − / ¯ Y > n − / t ) = Z ∞ f ( v ; σ )Φ ( n / σ µ − tv / n / !) d v , and the marginal posterior distribution of µ was approximated with a Metropolis-Hastings algo-rithm with 5 × steps. The results show that the selection-adjusted prior produces posteriorinference with a more reliable frequentist calibration than the unadjusted one. Providing valid inference for a selected parameter is a central problem in statistics. The com-monly accepted notion that Bayesian inference is immune to selection has been recently ques-tioned by some authors, most notably Yekutieli (2012). In this paper we have argued in favourof a general selection adjustment of the posterior distribution, which allows the injection of15 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . Figure 4: Estimated empirical CDFs of Π( µ | Y , . . . , Y n ) for the prior π ( µ, σ ) ∝ σ − (orange)and for the proposed prior (green).prior information while accounting for the selection bias. This adjustment, however, introducestwo difficulties in the analysis, which we have addressed in sections 3 and 4. The first requiresus to think carefully about how the selection mechanism operates, that is, about how it usesthe sample information when determining whether the parameter is selected. The second issue,concerning the choice of non-informative priors, is partially addressed by asymptotically approx-imating a generic selective regular model by a selective normal-location model and analysingthe latter. While exact probability matching in these models can only be achieved with a data-dependent prior, we show that the Jeffreys prior provides essentially the same results, whilebeing data-free and generally easier to implement. Many important issues remain unsolved,including the rigorous verification of the heuristic analysis, the extension of the priors to multi-dimensional settings, and the development of efficient computational methods for implementingthe procedures in complex settings.Because the conditional approach discards all the information used by the selection mechanism,it can be very conservative. In frequentist and non-informative Bayesian analyses it can easilylead to very large confidence, or credible, bounds if all the data is used for selection, as notedby Fithian et al. (2017) and Benjamini et al. (2019). In a Bayesian setting, prior informationcan provide a partial remedy to this issue, but it does not address it on a fundamental level; insuch circumstances, the posterior distribution will be very similar to the prior. The conditionalapproach is generally safer to use in scenarios where there is a holdout dataset that is not usedin the selection stage. This includes situations where the data is split into two sets, one of whichis not used to make the selection decision, and also situations where, after selecting a parameter,we collect more data. In these cases, conditioning on selection provides an elegant refinement tothe standard data-splitting approach, which uses only the holdout data in the inferential stage.A key assumption in our formulation of the problem, which is prevalent in the conditionalapproach literature, is that the selection mechanism is known and fully specified before theanalysis. This enables us to model the situation entirely and pose a well-defined decision problem.However, while this assumption may be true in some simple settings, in other cases the selectionmechanism may be too complex to be pinned down precisely, or may involve subjective decisions16rom the statistician that cannot be modelled. In these circumstances, even though we may notbe able to fully model the selection mechanism, we may still be able to identify the key partof it. For example, if we use a variable-selection algorithm to select covariates in a regressionproblem, and then choose a model that incorporates the selected covariates in a more subjectiveway, we can still condition on the observed output of the variable-selection algorithm. Fromthis point of view we can argue that the conditional approach is correcting part of the selectionbias. Alternatively, if we do not know the selection event but have some information about it,we can exploit the Bayesian machinery to allow for random selection mechanisms. For example,if we believe that a selection event is of the form Y > t but we are uncertain about the valueof t , we can treat it as an unknown parameter, assign a prior distribution to it, and integrate itout from the selective posterior. References
Bayarri, M. J. and Berger, J. O. (1998), ‘Robust Bayesian analysis of selection models’,
Ann.Stat. (2), 645–659.Bayarri, M. J. and Berger, J. O. (2000), ‘P values for composite null models’, J. Am. Stat.Assoc. (452), 1127–1142.Bayarri, M. J. and Berger, J. O. (2004), ‘The interplay of Bayesian and frequentist analysis’, Stat. Sci. (1), 58–80.Bayarri, M. J. and DeGroot, M. H. (1987), ‘Bayesian analysis of selection models’, J. R. Statistic.Soc. D (The Statistician) (2/3), 137–146.Benjamini, Y., Hechtlinger, Y. and Stark, P. B. (2019), Confidence intervals for selected param-eters. arXiv:1906.00505v1.Birnbaum, A. (1962), ‘On the foundations of statistical inference’, J. Am. Stat. Assoc. , 269–326.Cox, D. R. and Hinkley, D. V. (1974), Theoretical Statistics , Chapman and Hall/CRC, London.Datta, G. S. and Mukerjee, R. (2004),
Probability Matching Priors: Higher Order Asymptotics ,Springer, New York.Dawid, A. P. (1994), ‘Selection paradoxes of Bayesian inference’, in T. W. Anderson, K. T. Fangand I. Olkin, eds, ‘Multivariate analysis and its applications’, Vol. 24, Institute of Mathemat-ical Statistics, Monograph Series., pp. 211–220.Fithian, W., Sun, D. L. and Taylor, J. E. (2017), ‘Optimal inference after model selection’.arXiv:1410.2597v4.Fraser, D. A. S., Reid, N., Marras, E. and Yi, G. Y. (2010), ‘Default priors for Bayesian andfrequentist inference’,
J. R. Statistic. Soc. B (5), 631–654.Hyun, S., G’Sell, M. and Tibshirani, R. J. (2018), ‘Exact post-selection inference for the gener-alized lasso path’, Electron. J. Stat. , 1935–7524.Kuffner, T. A. and Young, G. A. (2018), ‘Principled statistical inference in data science’, in N. Adams, E. Cohen and Y.-K. Guo, eds, ‘Statistical Data Science’, World Scientific Publish-ing, pp. 21–36. 17ee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016), ‘Exact post-selection inference, withapplication to the lasso’,
Ann. Stat. (3), 907–927.Lee, J. D. and Taylor, J. E. (2014), ‘Exact post model selection inference for marginal screening’, Adv. Neural Inf. Process Syst. , 136–144.Lockhart, R., Taylor, J. E., Tibshirani, R. and Tibshirani, R. (2014), ‘A significance test for thelasso’, Ann. Stat. (2), 413–468.Loftus, J. R. and Taylor, J. E. (2014), ‘A significance test for forward stepwise model selection’.arXiv:1405.3920v1.Owen, D. B. (1980), ‘A table of normal integrals’, Commun. Statist. - Simula. Computa. (4), 389–419.Panigrahi, S. and Taylor, J. (2018), ‘Scalable methods for Bayesian selective inference’, Electron.J. Stat. , 2355–2400.Panigrahi, S., Taylor, J. and Weinstein, A. (2020), ‘Integrative methods for post-selection infer-ence under convex constraints’. arXiv:1605.08824v7.Tian, X. and Taylor, J. E. (2018), ‘Selective inference with a randomized response’, Ann. Stat. (2), 679–710.Tibshirani, R., Rinaldo, A., Tibshirani, R. and Wasserman, L. (2018), ‘Uniform asymptoticinference and the bootstrap after model selection’, Ann. Stat. (3), 1255–1287.van der Vaart, A. W. (1998), Asymptotic statistics , Cambridge University Press.Yekutieli, D. (2012), ‘Adjusted Bayesian inference for selected parameters’,
J. R. Statist. Soc.B (3), 515–541. 18 Proof of Lemma 1
Without loss of generality we may assume that σ = 1 and t = 0. Let h i ( x ) denote the i -thderivative of log Φ( x ) and, for y >
0, define H ( θ, y ) = P θ ( Y ≥ y | Y >
0) = Φ( θ − y )Φ( θ ) , which is a distribution function as a function of θ , and g ( θ ; y ) = − ∂∂θ H ( θ, y ) ∂∂y H ( θ, y ) = 1 − h ( θ ) h ( θ − y ) > . This function is strictly increasing. Indeed, g ′ ( θ ; y ) = h ( θ ) h ( θ − y ) h ( θ − y ) − h ( θ ) h ( θ − y ) , which is positive if h ( θ − y ) h ( θ − y ) > h ( θ ) h ( θ ) , that is, if h ( x ) /h ( x ) is strictly decreasing. Since h ( x ) = − xh ( x ) − h ( x ) , h ( x ) /h ( x ) = − x − h ( x ), so ( ∂/∂x ) { h ( x ) /h ( x ) } = − − h ( x ) < x ∈ R . The latter claim followsfrom the fact that 1 + h ( x ) is the variance of a N ( x,
1) distribution truncated to (0 , ∞ ). Since g ( θ ; y ) is strictly increasing, if used as a formal prior density for θ , it satisfiesΠ( θ | y ) > R θ −∞ g ( θ ; y )Φ( θ ) − φ ( θ − y )d θ R ∞−∞ g ( θ ; y )Φ( θ ) − φ ( θ − y )d θ . But, by definition, g ( θ ; y ) is such that g ( θ ; y ) φ ( θ − y )Φ( θ ) = ∂∂θ H ( θ, y ) , so P θ { Π( θ | Y ) < α | Y > } > P θ { H ( θ , Y ) < α | Y > } = α, since H ( θ , Y ) is uniformly distributed given selection.For the second claim, let m ( y ) denote the posterior mode of θ given Y = y . We have that ∂∂θ log π ( θ | y ) = y − θ − h ( θ ) . Now, for all x > ∂∂θ log π ( m ( y ) − x | y ) + ∂∂θ log π ( m ( y ) + x | y )= 2( y − m ( y )) − h ( m ( y ) − x ) − h ( m ( y ) + x )= 2 h ( m ( y )) − h ( m ( y ) − x ) − h ( m ( y ) + x )= x { h ( m ( y ) − x ) − h ( m ( y ) + x ) } < , where x , x >
0. Here we have used that y = m ( y ) + h ( m ( y )) and that h ( x ) is strictlyincreasing. Integrating this inequality we find that π ( m ( y ) + x | y ) < π ( m ( y ) − x | y ) for all x > α ≤ /
2, Π − ( α | y ) < m ( y ). We now show that m ( y ) + y − is boundedin (0 , k ] for any positive k . To this end, let r ( x ) = x + h ( x ). Using L’Hˆopital, we findlim x →−∞ r ( x ) + x = lim x →−∞ Φ( x ) + x Φ( x ) + xφ ( x ) φ ( x ) + x Φ( x )= 2 lim x →−∞ φ ( x ) + x Φ( x )Φ( x )= 2 lim x →−∞ Φ( x ) φ ( x )= 0 . The claimed boundedness follows from the change of variable x = m ( y ) and from noting that m ( y ) → −∞ as y → + . To conclude the proof, fix a k > Z k y − φ ( y − θ )d y ≤ Z k | Π − ( α | y ) | φ ( y − θ )d y + Z k | y − − Π − ( α | y ) | φ ( y − θ )d y. Since the integral on the left diverges and the final one does not, it follows that R k | Π − ( α | y ) | φ ( y − θ )d y = ∞ , verifying the claim. B Derivation of the probability-matching prior for < γ < Let Y = γ ¯ Y (1) +(1 − γ ) ¯ Y (2) and Y ⊥ = (1 − γ )( ¯ Y (1) − ¯ Y (2) ), which are uncorrelated, and thereforeindependent. We have p ( y ) = P ( ¯ Y (1) > t | Y = y )= P ( Y + Y ⊥ > t | Y = y )= P ( Y ⊥ > t − y )= P ( N (0 , n − ( γ − − > t − y )= Φ ((cid:18) nγ − − (cid:19) / ( y − t ) ) . We also have that ϕ ( θ ) = P θ ( ¯ Y (1) > t ) = Φ( n / ( θ − t )) . The definition gives π y ( θ ) ∝ φ ( n / ( y − θ )) (cid:26) Z ∞ y n (˜ y − θ ) φ ( n / (˜ y − θ )) p (˜ y )d˜ y − ϕ ′ ( θ ) ϕ ( θ ) Z ∞ y φ ( n / (˜ y − θ )) p (˜ y )d˜ y (cid:27) . (12)Using that Z xφ ( x )Φ( a + bx )d x = bd φ (cid:16) ad (cid:17) Φ (cid:18) xt + abd (cid:19) − φ ( x )Φ( a + bx )20or any constants a and b (see Owen (1980)), where d = (1 + b ) / , the first integral can beexpressed as γ / φ ( n / ( θ − t )) ( − Φ ( n / (1 − γ ) / ( y − θ + γ ( θ − t )) )) + φ ( n / ( y − θ ))Φ ( n / (1 − γ ) / ( y − t ) ) . Substituting this expression in (12) and reorganising the terms gives the claimed expression.
C Proof of Lemma 2
We prove the result for the natural parameter θ to simplify the notation, but all the steps holdfor a generic parameter. Let us write ϕ n ( θ )= P θ n X i =1 s ( Y i ) > t n ! = P θ n / A ′′ ( θ ) / ( n − n X i =1 s ( Y i ) − A ′ ( θ ) ) > n / A ′′ ( θ ) / (cid:26) t n n − A ′ ( θ ) (cid:27)! ≡ P θ n / A ′′ ( θ ) / ( n − n X i =1 s ( Y i ) − A ′ ( θ ) ) > h n ( θ ) ! . By the Berry–Esseen theorem and Jensen’s inequality, | ϕ n ( θ ) − [1 − Φ { h n ( θ ) } ] | ≤ C E θ (cid:2) | s ( Y ) − A ′ ( θ ) | (cid:3) n / A ′′ ( θ ) / ≤ CA (6) ( θ ) / n / A ′′ ( θ ) / for some positive constant C . Similarly,˜ ϕ n ( θ ) = P (cid:18) N (cid:18) θ, n i ( θ ) (cid:19) > ( A ′ ) − (cid:18) t n n (cid:19)(cid:19) = P (cid:18) N (0 , > n / A ′′ ( θ ) / (cid:26) ( A ′ ) − (cid:18) t n n (cid:19) − θ (cid:27)(cid:19) ≡ P ( N (0 , > g n ( θ )) . Writing t n n − A ′ ( θ ) = A ′′ ( x ) (cid:26) ( A ′ ) − (cid:18) t n n (cid:19) − θ (cid:27) , where | x − θ | ≤ | ( A ′ ) − ( n − t n ) − θ | = { n A ′′ ( θ ) } − / | g n ( θ ) | , we find that h n ( θ ) − g n ( θ ) = (cid:26) A ′′ ( x ) A ′′ ( θ ) − (cid:27) g n ( θ ) . Now, for any
M > γ ∈ (0 , / | Φ { h n ( θ ) } − Φ { g n ( θ ) }| = | Φ { h n ( θ ) } − Φ { g n ( θ ) }| {| g n ( θ ) | ≤ M n γ } + | Φ { h n ( θ ) } − Φ { g n ( θ ) }| {| g n ( θ ) | > M n γ } . θ , since | h n ( θ ) − g n ( θ ) | {| g n ( θ ) | ≤ M n γ } ≤ (cid:12)(cid:12)(cid:12)(cid:12) A ′′ ( x ) − A ′′ ( θ ) A ′′ ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) {| g n ( θ ) | ≤ M n γ } M n γ and | x − θ | ≤ n γ − / { A ′′ ( θ ) } − / M . For the second term, consider the case g n ( θ ) < − M n γ .This implies that Φ { g n ( θ ) } < Φ( − M n γ ) = o (1) and Φ { h n ( θ ) } < Φ( − M A ′′ ( θ ) − A ′′ ( x )) = o (1)uniformly in compact sets of θ . An analogous argument can be made for the case g n ( θ ) > M n γ to conclude that | Φ { h n ( θ ) }− Φ { g n ( θ ) }| , and therefore | ϕ n ( θ ) − ˜ ϕ n ( θ ) | by the previous remark,vanishes uniformly in compact sets of θ .To conclude the proof of the first claim it suffices to show that ϕ n ( θ + n − / t ) is uniformlybounded away from zero in bounded sets of t . Let X n ( t ) = log f ( Y , . . . , Y n ; θ + n − / t ) − log f ( Y , . . . , Y n ; θ ) and write ϕ n ( θ + n − / t ) ϕ n ( θ ) = E θ h e X n ( t ) | S n i ≥ e E θ [ X n ( t ) | S n ] , where S n denotes the selection event, and we have again used Jensen’s inequality. It thereforesuffices to show boundedness of | E θ [ X n ( t ) | S n ] | . Taylor’s theorem gives E θ [ X n ( t )] = n / tA ′ ( θ ) − n { A ( θ + n − / t ) − A ( θ ) } = tε A ′′ ( θ + n − / ε ) , where | ε i | ≤ t for i = 1 ,
2. Also, Var θ ( X n ( t )) = t A ′′ ( θ ). These two facts, combined with theassumption that ϕ n ( θ ) is bounded away from zero, imply the result, since | E θ [ X n ( t ) | S n ] − E θ [ X n ( t )] | = | E θ [ X n ( t ) − E θ [ X n ( t )] | S n ] | = | E θ [ { X n ( t ) − E θ [ X n ( t )] } ( P n i =1 s ( Y i ) > t n )] | ϕ n ( θ ) ≤ E θ [ | X n ( t ) − E θ [ X n ( t )] | ] ϕ n ( θ ) ≤ Var θ ( X n ( t )) / ϕ n ( θ ) . To prove the last claim we need to show that P θ (ˆ θ ≤ x | ˆ θ > a n ) − G n ( x ) converges uni-formly to zero, where a n = ( A ′ ) − ( n − t n ). This is true because P θ ( a n < ˆ θ ≤ x ) − P ( a n 0, which we have just shown.22 Exponential model simulation . . . n = 10 , prob = 0.1 . . . n = 30 , prob = 0.1 . . . n = 80 , prob = 0.1 . . . n = 10 , prob = 0.5 . . . n = 30 , prob = 0.5 . . . n = 80 , prob = 0.5 . . . n = 10 , prob = 0.9 . . . n = 30 , prob = 0.9 . . . n = 80 , prob = 0.9 Figure 5: Exponential model results (example 4.1). Coverage of the interval( −∞ , Π − ( α | Y , . . . , Y n )] as a function of α for the non-selective Jeffreys prior (red), theprobability-matching prior (orange), and the selective Jeffreys prior (green). The orange linespartially overlap the green ones. 23 Inverse Gaussian simulation . . . n = 10 , prob = 0.1 . . . n = 30 , prob = 0.1 . . . n = 80 , prob = 0.1 . . . n = 10 , prob = 0.5 . . . n = 30 , prob = 0.5 . . . n = 80 , prob = 0.5 . . . n = 10 , prob = 0.9 . . . n = 30 , prob = 0.9 . . . n = 80 , prob = 0.9 Figure 6: Inverse Gaussian model results (example 4.2). Coverage of the interval( −∞ , Π − ( α | Y , . . . , Y n )] as a function of αα