[PDF] Parameter Restrictions for the Sake of Identification: Is there Utility in Asserting that Perhaps a Restriction Holds?

Abstract

Statistical modeling can involve a tension between assumptions and statistical identification. The law of the observable data may not uniquely determine the value of a target parameter without invoking a key assumption, and, while plausible, this assumption may not be obviously true in the scientific context at hand. Moreover, there are many instances of key assumptions which are untestable, hence we cannot rely on the data to resolve the question of whether the target is legitimately identified. Working in the Bayesian paradigm, we consider the grey zone of situations where a key assumption, in the form of a parameter space restriction, is scientifically reasonable but not incontrovertible for the problem being tackled. Specifically, we investigate statistical properties that ensue if we structure a prior distribution to assert that `maybe' or `perhaps' the assumption holds. Technically this simply devolves to using a mixture prior distribution putting just some prior weight on the assumption, or one of several assumptions, holding. However, while the construct is straightforward, there is very little literature discussing situations where Bayesian model averaging is employed across a mix of fully identified and partially identified models.

Full PDF

PParameter Restrictions for the Sake ofIdentiﬁcation: Is there Utility in Asserting that Perhaps a Restriction Holds?

Paul Gustafson ∗ Department of StatisticsUniversity of British ColumbiaSeptember 28, 2020

Abstract

Statistical modeling can involve a tension between assumptions and statisticalidentiﬁcation. The law of the observable data may not uniquely determine the valueof a target parameter without invoking a key assumption, and, while plausible, thisassumption may not be obviously true in the scientiﬁc context at hand. Moreover,there are many instances of key assumptions which are untestable, hence we cannotrely on the data to resolve the question of whether the target is legitimately identiﬁed.Working in the Bayesian paradigm, we consider the grey zone of situations where a keyassumption, in the form of a parameter space restriction, is scientiﬁcally reasonablebut not incontrovertible for the problem being tackled. Speciﬁcally, we investigatestatistical properties that ensue if we structure a prior distribution to assert that maybe or perhaps the assumption holds. Technically this simply devolves to usinga mixture prior distribution putting just some prior weight on the assumption, orone of several assumptions, holding. However, while the construct is straightforward,there is very little literature discussing situations where Bayesian model averaging isemployed across a mix of fully identiﬁed and partially identiﬁed models. Keywords:

Bayesian model averaging; Bayes risk; large-sample theory; partial identiﬁca-tion. ∗ The author gratefully acknowledges research support from the Natural Sciences and Engineering Re-search Council of Canada a r X i v : . [ s t a t . M E ] S e p Introduction

In many applications of statistical modeling, a tension can arise. In order to fully identifya parameter of interest, one or more “strong” model assumptions may be required. Ofcourse it is not prudent to invoke an assumption without a solid rationale for doing so. Ifidentiﬁcation is not obtained, however, then an uncomfortable truth ensues: even an inﬁniteamount of data would not reveal the true value of the target parameter. To further muddythe waters, often an assumption that would identify the target is not testable empirically.So we cannot necessarily rely on the data to resolve the situation.Say we are faced with a context where an identifying assumption or parameter restriction is scientiﬁcally plausible but not incontrovertible. One strategy, at least within the Bayesianparadigm, is to specify a prior distribution asserting that “large” violations of the restrictionare unlikely. For instance, say the restriction takes the form λ = 0. Then a prior of the form λ ∼ N (0 , τ ), for a suitably small choice of τ , could encode this information. For instance,some eﬀorts along this line in the context of instrumental variable problems are pursued byGustafson and Greenland (2006); Gustafson (2007). And more generally in observationalepidemiology settings, “Bayesian-like” approaches, often referred to as probabilistic biasanalysis , have been considered (Greenland, 2003, 2005; Lash et al., 2009, 2014).Assigning a prior distribution which probabilistically limits the magnitude of violationfor an identifying restriction is not the only way to proceed. In fact, an arguably moredirect encoding of “plausible but not incontrovertible” would result from a speciﬁcationof a mixture prior distribution, giving some weight to the restriction being met exactly,and the remaining weight on it being violated (without necessarily making a stringentjudgement that the violation is unlikely to be large). While mixture prior distributions areubiquitous in Bayesian hypothesis testing and model selection procedures, there is scantliterature on their use in the context of identifying restrictions.2 .2 Motivating Example: Prevalence Estimation with MissingData To motivate the developments to follow, consider the HIV surveillance study described byVerstraeten et al. (1998), which was also used to motivate the methodological developmentsin Vansteelandt et al. (2006). In this study, blood draws were taken from a sample of n = 787 members of the target population. At a population level, let Y indicate the HIVtest result (0 for negative, 1 for positive), let R indicate the observation of the result (1for observed, 0 for missing), and let p ry = P r ( R = r, Y = y ). Hence the target parameter,HIV prevalence, can be expressed as ψ = P r ( Y = 1) = p + p . The study data aresummarized by c = 699 negative HIV tests ( R = 1 , Y = 0), c = 52 positive HIV tests( R = 1 , Y = 1), and c = 36 missing test results ( R = 0).Say we proceed to infer ψ without invoking any identifying restriction, i.e., we allowthat the missing data mechanism may be nonignorable (see Daniels and Hogan (2008)or Little and Rubin (2014) for full discussions of nonignorable missingness). A possible“neutral” prior speciﬁcation is ( p , p , p , p ) ∼ Dirichlet(1 , , , s, p , p , p ), where p = p + p and s = p / ( p + p ), the posterior distributionis characterized by ( p , p , p ) ∼ Dirichlet(2 + c , c , c ) and independently s ∼ Unif(0 , ψ = sp + p . Here the impact of not having an identifying restriction is clear. Since theposterior distribution of s is Unif(0 ,

1) for any dataset, the posterior distribution of ψ willnot reduce to a point-mass in the inﬁnite limit of further data collection. For the data athand, the posterior distribution of ψ is depicted in Figure 1.Alternately, say we believe the missing-at-random assumption is justiﬁed, i.e., we pre-sume R to be independent of Y . Then a possible neutral prior speciﬁcation is to let p ry = (1 − γ ) − r γ r (1 − ψ ) − y ψ y , with γ = P r ( R = 1) and ψ = P r ( Y = 1) independentlyand identically distributed as Unif(0 , ψ is simply beta(1 + c , c ). Clearly this posteriordistribution will concentrate to a point-mass in the inﬁnite limit of further data collection.For the data at hand, the posterior is also depicted in Figure 1.3 .00 0.05 0.10 0.15 0.20 HIV Prevalence D en s i t y NIMMARBMA

Figure 1: Posterior distribution of HIV prevalence without an identifying restriction (al-lowing nonignorable missingness, NIM), with the missing-at-random (MAR) restriction,and the Bayesian model averaged (BMA) synthesis of the two distributions.For an investigator having good reason to believe that the MAR assumption might hold,a natural route to expressing this belief is through a model-averaging prior. See Hoetinget al. (1999) or Wasserman (2000) for a full discussion of Bayesian model averaging (BMA).Speciﬁcally, the prior for ( p , p , p , p ) is taken as a mixture of the two speciﬁcationsabove. It is a matter of conjugate updating algebra to verify that the Bayes factor contrast-ing the two speciﬁcations (with MAR in the numerator versus NIM in the denominator)is: b = ( n + 2)( n + 3)6( c + 1)( n − c + 1) , (1)which evaluates to b = 3 .

73 for the present data. Of course the posterior odds favoring theMAR speciﬁcation are b times the prior odds. For instance, if the mixture prior gives equalweights of 0 . .

211 to the NIM posterior and weight 0 .

789 to the MARposterior. This model-averaged posterior for the target is also depicted in Figure 1.With both the NIM and MAR “within-model” prior speciﬁcations being neutral in some4ense, it may be surprising that a Bayes factor of nearly four is obtained, given that MARis well known to be an untestable assumption. Similarly, it may be surprising to see fromthe form of (1) how sensitive the Bayes factor is to the proportion of data that are missing.This begs the question of what statistical properties are possessed by the model-averagedinference scheme in a context such as this. In general there is a rich literature on Bayesianmodel averaging. We are not aware, however, of any focus on the case when one constituentmodel does not identify the target parameter, while the other model, or models, imposeidentifying restrictions. Thus the rationale for the present work is to quantify statisticalperformance speciﬁcally when model averaging is used to declare the a priori suppositionthat perhaps an identifying restriction holds.

Say that the initial statistical model at hand, which we refer to as M , is parameterized by θ = ( φ, λ ). Here, in the terminology of Gustafson (2015), this is taken to be a transparent parameterization, with the distribution of the data depending on θ only through φ . Or,put more directly in Bayesian terms, the data D are conditionally independent of λ , given φ . The lack of full identiﬁcation is thus made clear: the data inform φ (indeed we presumethat they fully inform φ , in the sense that the ( D | φ ) model supports consistent estimationof φ ). However, for any dataset, the posterior conditional distribution of ( λ | φ ) is the sameas the prior conditional distribution. In what follows, the primary target of inference isexpressed as ψ = g ( φ, λ ). Provided that g () varies non-trivially with λ , the target is notfully identiﬁed.Whereas we write the overall parameter space for M as θ ∈ Θ , we let the marginalparameter spaces implied by Θ be φ ∈ Φ and λ ∈ Λ . Importantly, many partiallyidentiﬁed models that arise in practice are such that φ and λ are not variation independent,with the consequence that direct learning about φ from the observed data may inducesome indirect learning about λ , via the support of λ depending on φ (Gustafson, 2015) .Commensurately, note for future reference that Θ may be a proper subset of the Cartesianspace Φ × Λ . 5e consider one or more sub-models of M , each of which has some level of a priori scientiﬁc credence, and each of which identiﬁes the target. We express the j -th of J ≥ M j : θ ∈ Θ j ⊂ Θ . That restriction of θ to the j -th sub-modelidentiﬁes the target implies that over ( φ, λ ) ∈ Θ j , g ( φ, λ ) does not vary with λ . Statedmore pragmatically, if M j holds, then φ , the value of which can be learned from the data,uniquely determines the target ψ .As alluded to in the previous section, the framework allowing the user to postulatethat perhaps one of the identifying restriction holds is that of BMA. Let π j () be the priordensity speciﬁed for θ within model M j . Then the BMA prior distribution is a mixture,whose density takes the form: π MIX ( θ ) = J (cid:88) j =0 w j π j ( θ ) , (2)where w j = P r ( M j ) is the prior probability that model M j is correct, hence (cid:80) Jj =0 w j = 1.As a technical note, we presume that each sub-model has zero probability under π () (aswell as under the other sub-models). Consequently, it is not necessary to explicitly deﬁnethe support of π () as excluding the sub-models.Upon receipt of data D , standard Bayesian updating from the mixture prior (2) givesthe posterior distribution as a mixture: π MIX ( θ | D) = J (cid:88) j =0 ˜ w j ( D ) π j ( θ | D ) . Here π j ( θ | D ) is the standard “within-model” posterior distribution of parameters, i.e.,based on π j ( θ | D ) ∝ L ( θ ; D ) π j ( θ ), for likelihood function L (; ). And ˜ w j ( D ) is the posteriorprobability that M j is correct. In the usual fashion,˜ w j ( D ) = w j f j ( D ) (cid:80) Jk =0 w k f k ( D ) , where f j ( D ) = (cid:82) Θ j L ( θ ; D ) π j ( θ ) dθ is the marginal density of the data under the j -th model.While the framework above is standard, its implications are unexplored when the list ofcandidate models is a mix of partially and fully identiﬁed models. To large extent, estimator6ehaviour will be seen to be speciﬁc to the particular choice of partially identiﬁed modeland identiﬁed sub-models, and to the nature of the speciﬁed within-model priors. However,to foreshadow the examples which follow, there are situations where: • For every dataset, ˜ w ( D ) / ˜ w ( D ) = w /w , i.e., the data have zero ability to discrim-inate between the base model and an identiﬁed sub-model. • For data generated under some parameter values in Θ − Θ , ˜ w ( D ) tends to zero asmore data accumulate, but under other such values the limit is positive. That is, afalsely asserted restriction may or may not be fully refuted. • As the underlying θ ranges over Θ , the large-sample limit of ˜ w ( D ) takes values in[ c, c >

0. Since the limit is never zero or one, the restriction can never befully refuted or fully supported. Also, since c is positive, there are not parametervalues under which we get close to complete refutation of the restriction. In the otherdirection, however, there are parameter values under which we get arbitrarily closeto full support for the restriction.Thus we ﬁnd a surprising richness in the variety of behaviors that can be encountered.Toward understanding the structure of the posterior model weights in our present frame-work, note that f j ( D ) = (cid:90) Θ j L ( θ ) π j ( θ ) dθ = (cid:90) Θ j L ( φ ) π j ( φ, λ ) dφ dλ = (cid:90) Φ j L ( φ ) π j ( φ ) dφ. Thus the marginal prior distribution of φ arising from the speciﬁed joint prior on ( φ, λ ) playsa key role. Asymptotically, imagine a datastream of independent observations generatedunder true parameter values ( φ, λ ) = ( φ † , λ † ). (We will generally use the ‘dagger’ notationto emphasize speciﬁc parameter values that give rise to the observable data.) Let D n denotethe ﬁrst n observations, and let w (cid:63)j = lim n →∞ ˜ w ( D n ) be the limiting posterior weight on7he j -th model. Then we immediately have w (cid:63)j w (cid:63)k = (cid:26) lim n →∞ f j ( D n ) f k ( D n ) (cid:27) (cid:18) w j w k (cid:19) = (cid:26) π j ( φ † ) π k ( φ † ) (cid:27) (cid:18) w j w k (cid:19) . (3)Just as we can characterize the limiting posterior weights on models, we can also char-acterize the limiting values of within-models estimators. Generically, let ˆ ψ ( D ; π I ) be theposterior mean of the target ψ arising from the prior speciﬁcation π I and data D , wherethe I subscript reminds us this is the investigator’s choice of prior distribution. It is easyto verify that without an identifying restriction we have ψ (cid:63) . = lim n →∞ ˆ ψ ( D n ; π )= (cid:90) g ( φ † , λ ) π ( λ | φ † ) dλ. We can similarly deﬁne ψ (cid:63)j . = lim n →∞ ˆ ψ ( D n ; π j ) for j >

0. Speciﬁcally, when the inves-tigator invokes the j -th restriction, provided that φ † ∈ Φ j , we can express the limit as ψ (cid:63)j = g ( φ † , · ), upon recalling that g ( φ, λ ) is constant in λ when ( φ, λ ) ∈ Θ j . In a situationwhere φ † / ∈ Φ j , typically misspeciﬁed model theory (e.g., White (1982)) would be needed todetermine ψ (cid:63)j . Assembling the pieces thus far, the large-sample limit of the BMA posteriormean is ψ (cid:63)MIX . = J (cid:88) j =0 w (cid:63)j ψ (cid:63)j . We quantify the performance of estimators by averaging frequentist performance acrossthe parameter space. If data were generated under parameter value θ , the estimatorˆ ψ ( D ; π I ) would incur mean-squared error (MSE) of E θ (cid:104) { ˆ ψ ( D ; π I ) − ψ ( θ ) } (cid:105) . We thenaverage the MSE across diﬀerent underlying parameter values θ , according to θ ∼ π N (),where the subscript N serves to remind us that this is Nature’s choice of prior distribution.For data D n with sample size n , then, the average mean-squared error (AMSE) is: AM SE n ( π N , π I ) . = E π N E θ (cid:20)(cid:110) ˆ ψ ( D n ; π I ) − ψ ( θ ) (cid:111) (cid:21) . (4)8ote that we quite deliberately keep both the ‘A’ and the ‘M’ in ‘AMSE,’ to remind usthat two averagings are in play: the mean (M) of squared error across D n given θ , distinctfrom the averaging (A) of this result across θ .Standard decision-theoretic terminology would see (4) referred to as the Bayes riskunder π N . Indeed, standard arguments tell us that for ﬁxed π N , (4) is minimized by taking π I = π N , and that in fact this is the minimum achievable across all estimators, not justthose arising as Bayesian estimators from some choice of prior. As a further formality,we will largely focus on the large-sample limit. Notationally we achieve this by deﬁning AM SE () = lim n →∞ AM SE n (). Also, for the sake of interpretability we will tend to reportthe square root of AM SE , i.e.,

RAM SE = AM SE / .The Nature versus investigator prior framework embedded in (4) seems particularlyapplicable to quantifying the pros and cons of “maybe” assertions about identiﬁed sub-models. We can compute AM SE ( π N , π I ) for the four combinations arising from π I ∈{ π , π MIX } and π N ∈ { π , π MIX } . Necessarily, AM SE ( π MIX , π

MIX ) is lower than (orpossibly equal to)

AM SE ( π MIX , π ). The extent to which it is lower reﬂects the beneﬁtof making “maybe” assertions in contexts where appropriate, since setting π N = π MIX implies we are studying average-case performance across a mix of scenarios some with, andsome without, one of the identifying restrictions being true.On the other hand,

AM SE ( π , π MIX ) is guaranteed to be higher than (or possiblyequal to)

AM SE ( π , π ), and the extent to which it is higher reﬂects the risk of inappro-priately making “maybe” assertions. That is, across scenarios where none of the identifyingrestrictions hold, making the maybe assertions increases average-case MSE. So, if we re-gard invoking “maybe” restrictions in contexts where all the identiﬁed sub-models are apriori implausible as a form of cheating, then the extent to which AM SE ( π , π MIX ) ex-ceeds

AM SE ( π , π ) is the extent to which cheating results in empirically worse estimationperformance.It is important to note that the AM SE comparisons just outlined could be applied ina fully identiﬁed setting. That is, say that ψ were identiﬁed under M , and consequentlyalso identiﬁed under each sub-model M j , j = 1 , . . . J . For a ﬁnite sample size n , the9our AM SE n values would still have the interpretations given above, quantifying both thevalue and the risk of speculating that the sub-models are a priori plausible. However, allfour AM SE n values would tend to zero as n tends to inﬁnity, since both ˆ ψ ( D n ; π ) andˆ ψ ( D n ; π MIX ) would consistently estimate ψ , under any values of θ ∈ Θ. Conversely, in thesituation we study, with ψ not identiﬁed under M , consistent estimation does not arise.Hence the positive large-sample limits of the AM SE values are fundamental descriptors ofthe situation.

For a ﬁrst example of quantifying performance, we simply return to the motivating exampleof Section 1.2. Here M is the NIM speciﬁcation which does not fully identify the target,whereas M is the MAR speciﬁcation which does. The particular prior speciﬁcations π ()and π () are those given in Section 1.2. From the posterior forms, we see the posterior meanof the disease prevalence ψ has large-sample limit ψ (cid:63) = p † + p † / ψ (cid:63) = p † /p † when the investigator presumes MAR.The respective prior marginal densities of φ = ( p , p , p ) are readily obtained. Ex-pressed as densities for ( p , p ) over the lower triangle of the unit square (rather thandensities over the probability simplex), we have π ( p , p ) = 6 p for the NIM speciﬁca-tion, and π ( p , p ) = (1 − p ) − . Consequently the limiting Bayes factor is w (cid:63) /w (cid:63) w /w = 16 p † (1 − p † ) . Note that this expression could also be deduced by direct inspection of (1) when n increases.Based on the prior speciﬁcations and the limiting posterior forms, and taking w =(0 . , . AM SE ( π , π ) = E π (cid:26) ( p − p ) (cid:27) = 140 , AM SE ( π MIX , π ) = (1 / E π (cid:26) ( p − p ) (cid:27) + (1 / E π (cid:26) ( p − p ) (cid:27) = 12 (cid:18)

140 + 136 (cid:19) . The other two quantities needed take the form

AM SE ( π , π MIX ) = E π { w ( p ) } and AM SE ( π MIX , π

MIX ) = (1 / E π { w ( p ) } + (1 / E π { w ( p ) } , where w ( p ) = 6 p p p p (cid:16) p + p (cid:17) + 11 + 6 p p (cid:18) p p (cid:19) − p +1 . Given the nonlinearity of w (), the needed expectations do not have analytic forms. Theyare easily evaluated numerically, however, say via Monte Carlo draws from π () and π ().Table 1 gives the four AM SE values. On the pro side, appropriate use of the “maybe”assertion leads to a 16% reduction in RAMSE compared to no assertion, i.e., we see thisreduction when Nature’s prior is the mixture, so we are averaging performance acrosssome scenarios where the assertion holds and others when it does not. On the con side,inappropriate use of the “maybe” assertion leads to a 12% increase in RAMSE. This is seenwhen Nature’s prior is π alone, so we are averaging performance across scenarios whichall involve nonignorable missingness. Say that interest lies in the distribution of three binary variables, (

C, X, Y ), in a population.Here C is a potential confounding variable, X is an exposure variable, and Y is a healthoutcome variable. The particular target of inferential interest is taken to be the averagerisk diﬀerence, ψ = E { E ( Y | X = 1 , C ) − E ( Y | X = 0 , C ) } . (As an aside, this target can bemotivated as an average treatment eﬀects in a framework using counterfactual outcomes,under an assumption that the counterfactual outcomes are conditionally independent of X given C .) Say the available data, however, are sampled conditional on C . That is,11able 1: RAMSE values for diﬀerent combinations of Nature’s prior and the investigator’sprior, in Example 1. Values in the ﬁrst column are exact. Values in the second column arecomputed as Monte Carlo averages based on 10 draws from each of π () and π (). Theimpact of the investigator using π MIX rather than π is a 11.8% increase in RAMSE whenthis use is not warranted, but a 16.0% decrease in RAMSE when it is warranted. TheMonte Carlo standard errors for these two percentage changes are 0.3 percentage pointsand 0.2 percentage points, respectively. Investigator π π MIX

Nature π π MIX n are drawn from ( Y, X | C = 0), and similarly apre-speciﬁed number of realizations n are drawn from ( Y, X | C = 1).For this problem, let φ = ( φ , φ ), where φ c comprises the ( X, Y | C = c ) cell probabilities(so φ c has three elements, with the fourth cell probability consequently determined sincethese probabilities must sum to one). Also, let λ = P r ( C = 1). Then ( φ, λ ) is a transparentparameterization, with the likelihood L ( φ ) = L ( φ ) L ( φ ) based on two independentmultinomial samples. The target parameter can be expressed as g ( φ, λ ) = (1 − λ ) v ( φ ) + λv ( φ ), where v () returns the risk diﬀerence from a single set of cell probabilities. For thepartially identiﬁed model M we specify a prior with independencies of the form π ( φ, λ ) = π ( φ ) π ( φ ) π ( λ ). For future reference, let ¯ λ and σ λ be the mean and variance of λ underthe speciﬁed prior.The ﬁrst identifying restriction we consider is that the prevalence of C is exactly knownfrom external sources, i.e., M is deﬁned as the sub-model of M deﬁned by λ = ˜ λ , where˜ λ is user-speciﬁed. We express this model via the prior speciﬁcation π ( φ, λ ) = π ( φ ) δ ˜ λ ( λ ),where δ x () is the Dirac delta function, i.e., the ‘density’ of a point-mass at x . We take π ( φ ) = π ( φ ) as before, but now λ = ˜ λ is taken as known.12he second identifying restriction we consider is that there is no interaction on the riskdiﬀerence scale. Hence we obtain model M from model M by constraining φ accordingto v ( φ ) = v ( φ ). An appropriate prior speciﬁcation for the marginal π ( φ ) would be thedistribution induced from π ( φ ) by conditioning on v ( φ ) = v ( φ ). For M we can leave π ( λ | φ ) unspeciﬁed, since it will play no role whatsoever in the posterior distribution of ψ or in the posterior weight of M relative to the other two models.From the forms of π j ( φ ) for j = 0 , , w (cid:63) , w (cid:63) , w (cid:63) ) =  (0 , ,

1) if v ( φ † ) = v ( φ † );( w + w ) − ( w , w ,

0) otherwise. (5)Speciﬁcally, M is completely untestable compared to M , in the sense that f ( D ) = f ( D )for every dataset. On the other hand M is completely testable compared to M , in thesense that the constraint v ( φ ) = v ( φ ) is either proven or disproven in the limit of inﬁnitesample size.In terms of inference within models, under M it is immediate that the posterior distri-bution of ( ψ | φ ) is a location-scale shift of the prior distribution of λ [with location v ( φ )and scale v ( φ ) − v ( φ )]. The limiting posterior mean of ψ is then v ( φ † ) + { v ( φ † ) − v ( φ † ) } ¯ λ .Whereas, in a related spirit, under M the limiting posterior mean is v ( φ † ) + { v ( φ † ) − v ( φ † ) } ˜ λ . This limit does or does not match with the target ψ † depending on whether M holds. If M holds, then asymptotically the M posterior will concentrate at the correctvalue of v ( φ † ) = v ( φ † ). The limiting posterior mean under M when M does not hold isgoverned by standard wrong-model asymptotic theory (e.g., see White (1982)). We willnot need to determine this limit, however, as M is discredited when it is wrong, as per(5).Armed with (5) and the limiting posterior means of ψ under each model, we proceed todetermine AM SE for the four combinations of Nature’s prior and the investigator’s priorbased on π () or π MIX (). Letting k = Var { v ( φ ) − v ( φ ) } = 2Var { v ( φ c } under π (), direct13alculation gives: AM SE ( π ; π ) = kσ λ ,AM SE ( π ; π MIX ) = k (cid:20) σ λ + { w / ( w + w ) } (cid:16) ˜ λ − ¯ λ (cid:17) (cid:21) ,AM SE ( π MIX ; π ) = k (cid:26) w σ λ + w (cid:16) ˜ λ − ¯ λ (cid:17) (cid:27) ,AM SE ( π MIX ; π MIX ) = k (cid:26) w σ λ + w w w + w (cid:16) ˜ λ − ¯ λ (cid:17) (cid:27) . Note that these expressions are suﬃciently simple that one can “read oﬀ” the extent towhich

AM SE ( π ; π MIX ) exceeds

AM SE ( π ; π ) and the extent to which AM SE ( π MIX ; π MIX )is reduced compared to

AM SE ( π MIX ; π ). Note also that both these gaps collapse to zeroin situations where ¯ λ (the prior mean of λ under model M ) equals ˜ λ (the presumed valueof λ under Model M ).To give a concrete example, say that external information leads to the speciﬁcation of˜ λ = 0 .

15 for the identifying restriction under M . And say the same information suggeststhe prior λ ∼ Beta(4 ,

18) under Model M , giving the prior mode at 0 .

15, as well asa prior mean of ¯ λ = 4 /

22 and prior variance of σ λ = (4 / / / ≈ . φ c ∼ Dirichlet(1 , , , k = 1 / w , w , w ) = (1 / , / , / In a somewhat similar spirit to Example 2, say we are interested in the risk diﬀerence

P r ( Y = 1 | X = 1) − P r ( Y = 1 | X = 0), for binary variables X and Y . However, the availabledata are a sample of ( X, Y ∗ ) realizations, where Y ∗ may be an imperfect surrogate for Y ,14able 2: RAMSE values for diﬀerent combinations of Nature’s prior and the investigator’sprior in Example 2. The impact of the investigator using π MIX rather than π is a 1.9%increase when this use is not warranted, but a 3.5% decrease when it is warranted.Investigator π π MIX

Nature π π MIX Y itself is latent. Speciﬁcally, say Y ∗ and X are conditionally independent given Y (the so-called “nondiﬀerential” assumption), with λ y = P r ( Y ∗ = Y | Y = y ), i.e, λ and λ are respectively the speciﬁcity and sensitivity of the surrogate. Letting ω x = P r ( Y =1 | X = x ), the target of inference is ψ = ω − ω . At present we have parameterizedthe problem at hand by ( ω, λ ). However, to obtain a transparent parameterization wereparameterize to ( φ, λ ), where φ = ( φ , φ ) has components φ x = P r ( Y ∗ = 1 | X = x ) =(1 − ω x )(1 − λ ) + ω x λ . With respect to this parameterization, the target is expressed as ψ = ( φ − φ ) / ( λ + λ − M by assigning a prior distribution in theoriginal parameterization. We let ω follow a uniform distribution over (0 , , and indepen-dently we let λ follow a uniform distribution over ( a , × ( a , a and a are worst-case assertions about the magnitude of outcome misclassiﬁcation. (Wepresume a > . a > . Y and Y ∗ .) Note that the speciﬁed prior independencebetween ω and λ is intuitive; there seems no reason to tie prior assertions about the ( Y | X )prevalences to prior assertions about the quality of Y ∗ as a surrogate for Y . (And notealso that it would be impossible to assert prior independence of φ and λ , since the supportof λ depends on φ .)Upon moving to the transparent parameterization, following the related model formu-15ation in Gustafson et al. (2001), we ﬁnd the M prior density transforms to π ( φ, λ ) = I A ( φ, λ )(1 − a )(1 − a )( λ + λ − . (6)Here A is the intersection of three sets, imposing the restrictions that: (i), φ ∈ (0 , ; (ii), λ ∈ ( a , × ( a , φ and λ are compatible with each other in that (1 − λ ) <φ x < λ , for x = 0 , M is the sub-model of M deﬁned by λ = (1 , (cid:48) . Under M we take theprior on φ to be uniform over (0 , . This is consistent with M in that it matches the( φ | λ = (1 , (cid:48) ) prior.For a datastream arising under φ = φ † , we have the limiting Bayes factor π ( φ † ) /π ( φ † )as per (3). For the present prior speciﬁcations, π ( φ ) = 1, while marginalizing (6) gives: π ( φ ) = 1(1 − a )(1 − a ) [log b ( φ ) + log b ( φ ) − log { b ( φ ) + b ( φ ) − } ] , (7)where b ( φ ) = max { a , − φ , − φ } and b ( φ ) = max { a , φ , φ } . Upon scrutiny, thebivariate density (7) over the unit square is seen to have a “tabletop” shape. It is constantat its maximum value over the square deﬁned by min { φ , φ } > − a and max { φ , φ }

85) for π (), along with equal a priori modelweights, ( w , w ) = (0 . , . w (cid:63) appear in Figure 1. As we might expect, we see relatively more mass at the lowerboundary for w (cid:63) when M is false. And in fact for the present hyperparameter values thislower boundary is 0 . M . Also, asper the discussion above, we see the right tail of the w (cid:63) values approaching one when M is true, i.e., there are occasional circumstances where the data can strongly support theidentifying restriction.In terms of limiting posterior means for the target, when the investigator presumes M ,the limit is simply ψ (cid:63) = φ † − φ † . For M , numerical integration is required to obtain ψ (cid:63) for a given φ † . Speciﬁcally, ψ (cid:63) = (cid:82) g ( φ † , λ ) π ( λ | φ † ) dλ , where the prior conditional density17able 3: RAMSE values for diﬀerent combinations of Nature’s prior and the investigator’sprior in Example 3. Each value is computed as a Monte Carlo average, using 20000 real-izations from each of π () and π (). The impact of the investigator using π MIX rather than π is a 44.8% increase in RAMSE when this use is not warranted, but a 24.9% decreasein RAMSE when it is warranted. Monte Carlo standard errors for these two percentagechanges are 0 . . π π MIX

Nature π π MIX λ | φ ) is derived from the joint density (6).The RAMSE values are given in Table 3. We now see rather higher stakes than weremanifested in Example 2. Imposing the maybe restriction when not warranted incurs apenalty of a 45% increase in RAMSE (as per the ﬁrst row of the table). Whereas imposingthe maybe restriction when warranted produces a 25% reduction in RAMSE (as per thesecond row of the table). Our ﬁnal, and most complex, example blends elements of Examples 2 and 3. As perExample 2, say we are interested in properties of the joint distribution of (

C, X, Y ), where C is a binary confounding variable, X is a binary exposure variable, and Y is a binary outcomevariable. Also as per Example 2, the target of inference is presumed to be the average riskdiﬀerence, ψ = E { P r ( Y = 1 | X = 1 , C ) − P r ( Y = 1 | X = 0 , C ) } . However, the exposure X may be subject to misclassiﬁcation. Hence the observable variables are ( C, X ∗ , Y ), wherethe surrogate X ∗ may diﬀer from the exposure of interest, X . We conﬁne attention to a18imple form of exposure misclassiﬁcation. We take it as known that the misclassiﬁcation isnondiﬀerential, in the sense that X ∗ is conditionally independent of ( C, Y ) given X . Andwe also take it as known that P r ( X ∗ = 0 | X = 0) = 1, i.e., the classiﬁcation scheme hasperfect speciﬁcity . However, the sensitivity , λ = P r ( X ∗ = 1 | X = 1), may be less than one.Some recent work on Bayesian inference in problems with “unidirectional” misclassiﬁcationof this form includes Xia and Gustafson (2016, 2018). To organize prior speciﬁcation and posterior computation across the multiple models athand, it is helpful to specify a convenience prior distribution which in turn produces aconvenience posterior distribution possessing a simple form. Then appropriate model-speciﬁc prior speciﬁcations and posterior calculations can be expressed in terms of tweaksto the convenience analysis.Let P d be the space of probability vectors over d mutually distinct and exhaustiveoutcomes, and let φ ∈ P be the cell probabilities describing the distribution of ( C, X ∗ , Y ).We specify the convenience prior density on θ = ( φ, λ ) to be π ∗ ( φ, λ ) = π ∗ ( φ ) π ∗ ( λ ), where π ∗ ( φ ) is the Dirichlet(1 , . . . ,

1) density, while π ∗ ( λ ) is the Uniform( b,

1) density, wherehyperparameter b is an a priori speciﬁed lower bound on the sensitivity of the exposureclassiﬁcation. For illustration, we take b = 0 . M In the absence of any identifying restrictions, a ﬁrst thought is that the convenience priormight be employed as the actual prior. This is not actually possible, however, since we do not have a Cartesian parameter space for ( φ, λ ). For a given sensitivity λ , let s λ () mapfrom the ( C, X, Y ) cell probabilities to the (

C, X ∗ , Y ) cell probabilities. Hence s λ () mapsfrom P to its image S λ ⊂ P . This map is invertible, and for a given φ ∈ P it is possibleto numerically determine whether φ ∈ S λ , and, if so, compute s − λ ( φ ). Thus an obvious19daptation of the convenience prior is to take the Model M prior as: π ( φ, λ ) = π ∗ ( φ, λ ) I { φ ∈ S λ } P r ∗ { φ ∈ S λ } . (8)Thus we simply truncate the convenience prior to only those φ and λ pairs which arecompatible with each other.It is also possible to establish that for a given φ ∈ P , φ ∈ S λ if and only if λ exceeds athreshold. That is, the cell probabilities for the observables ( C, X ∗ , Y ) imply a lower boundon the sensitivity of X ∗ as a surrogate for X . (Note that this bound may or may not exceedthe investigator-speciﬁed lower bound b .) Using t ( φ ) to denote this lower bound, we have φ ∈ S λ if and only if λ > t ( φ ). Combined with the speciﬁcation of π ∗ ( λ ) as Unif( b, π ( φ, λ ) = π ∗ ( φ, λ ) I [ λ > max { b, t ( φ ) } ](1 − b ) − E ∗ [1 − max { b, t ( φ ) } ] . Importantly, this marginalizes to π ( φ ) = π ∗ ( φ )[1 − max { b, t ( φ ) } ] E ∗ [1 − max { b, t ( φ ) } ] . (9)In terms of the parameter of interest, say that ˜ g () maps from the ( C, X, Y ) cell proba-bilities to the target parameter, i.e., ˜ g () returns E { E ( Y | X = 1 , C ) − E ( Y | X = 0 , C ) } . Inthe parameterization at hand then, the target parameter is g ( φ, λ ) = ˜ g ( s − λ ( φ )). M In a similar spirit to Example 3, the ﬁrst identifying restriction considered is simply thatthe surrogate X ∗ is in fact perfect. So model M is the sub-model of M corresponding to λ = 1. An obvious prior speciﬁcation is π ( φ, λ ) = π ∗ ( φ ) δ ( λ ) . This simply marginalizes to π ( φ ) = π ∗ ( φ ) . (10)Under this restriction φ is identically the ( C, X, Y ) cell probabilities. Hence the targetparameter is simply expressed as g ( φ ) = ˜ g ( φ ).20 .4 Model M Model M renders the target identiﬁable via a diﬀerent restriction on M . Namely, as perExample 2, we presume the ( C, X, Y ) distribution does not involve any interaction on therisk diﬀerence scale, so that E ( Y | X = 1 , C = c ) − E ( Y | X = 0 , C = c ) does not depend on c . If we let ˜ φ parameterize the so restricted ( C, X, Y ) cell probabilities (so that ˜ φ has onlysix degrees of freedom), then the resultant ( C, X ∗ , Y ) cell probabilities can be expressedas h ( ˜ φ, λ ). Here the map h () is invertible. However, its image, which we denote as H , is astrict subset of P . While it does not seem possible to express h − () in closed form, for agiven φ ∈ P one can numerically determine whether φ ∈ H , and, if so, compute h − ( φ ).A sensible prior construction for M thus takes the form: π ( φ, λ ) = π ∗ ( φ ) I H ( φ ) P r ∗ ( φ ∈ H ) δ m ( φ ) ( λ ) , where m ( φ ) is the unique sensitivity value implied by φ . Clearly this prior marginalizes to π ( φ ) = π ∗ ( φ ) I H ( φ ) P r ∗ ( φ ∈ H ) . (11)Note that the parameter of interest in this formulation, the constant risk diﬀerence, canbe simply expressed as one element from h − ( φ ). The limiting posterior model weights are governed by (9), (10), and (11), giving w (cid:63) /w (cid:63) w /w = 1 − max (cid:8) b, t (cid:0) φ † (cid:1)(cid:9) − E ∗ [max { b, t ( φ ) } ] , (12)and w (cid:63) /w (cid:63) w /w =  { P r ∗ ( φ ∈ H ) } − if φ † ∈ H , M and M ,in that a positive and ﬁnite limiting Bayes factor arises. Moreover, this limit is not one(except for a set of φ † values with probability zero under all three model-speciﬁc priors).21f course we are reminded that this is indeed mild discrimination, since the limiting Bayesfactor would be zero or inﬁnity if M were identiﬁed. From (13) we see that when M is false, enough data may or may not discredit it. That is, the true model (either M or M ) may give rise to φ † / ∈ H , in which case M is discredited asymptotically. However,any of the three models can yield φ † ∈ H , and (13) cannot distinguish between the threesituations. Note that if M is not discredited, then it necessarily receives more supportthat M , i.e., the Bayes factor (13) exceeds one. To help convey the ability of data to discriminate between the three models, we examine w (cid:63) values arising for three ensembles of φ † values, where the ensembles are randomly drawnfrom π ( φ ), π ( φ ), and π ( φ ) respectively. Equal prior weights w = (1 / , / / /

3) areemployed. The results are plotted in Figure 3. For both the ensembles corresponding to M and M , we see w (cid:63) = 0 for large majorities of points. So M is often, but not always,fully discredited when it is false. On the other hand, when M is not discredited (theminorities of scenarios arising under M and M , but all the scenarios arising under M ),it receives strong support. The w (cid:63) values range from 0 .

72 and 0 . w (cid:63) = 0 cases generated under M and M we see modest discrim-inatory power in the form of a modest tendency for w (cid:63) = 1 − w (cid:63) to be smaller when M istrue and larger when M is true. Still focussing on these cases, we also see an asymmetry.The most extreme evidence in favour of M corresponds to a value of w (cid:63) = 1 − w (cid:63) wellbelow one (approximately w (cid:63) = 0 . M is true aswell as when M is true. On the other hand, there are some narrow circumstances underwhich w (cid:63) = 1 − w (cid:63) is very close to one. This asymmetry echoes what was seen in Example3. There is more scope to support the identifying restriction M than there is to criticizeit. The RAMSE values for this example appear in Table 4. As per Example 3, we see fairlyhigh stakes involved with the investigator invoking the maybe assumptions. RAMSE isincreased by 31% in terms of average performance across scenarios where neither restriction22 .0 0.4 0.8 . . . w * M . . . w M . . . w w * M Figure 3: Distribution of the limiting posterior weights ( w (cid:63) , w (cid:63) , w (cid:63) ), in Example 4, under M (upper-left), M (upper-right), and M (lower-left). In each instance, an ensembleof 100 φ † points are simulated from π j (). Points in the upper panels are jittered with asmall amount of random noise, in order to better see the distribution of those points with w (cid:63) + w (cid:63) = 1, w (cid:63) = 0. 23able 4: RAMSE values for diﬀerent combinations of Nature’s prior and the investigator’sprior in Example 4. Each value is computed as a Monte Carlo average, using 1600 drawsfrom each of π (), π (), π (). The impact of the investigator using π MIX rather than π is a31.3% increase in RAMSE when this use is not warranted, but a 39.5% decrease in RAMSEwhen it is warranted. Monte Carlo standard errors for these two percentage changes are4 . . π π MIX

Nature π π MIX

Of course, there is no free lunch. When specifying a prior distribution, we implicitly choosethe estimator with optimal average-case behavior, where the average is with respect to thejoint distribution of parameters and data arising from the speciﬁed prior. In situations withone or more plausible identifying restrictions, we have seen the stakes associated with priorassertions can be quite high. In one direction,

AM SE ( π , π MIX ) can be substantially higherthan

AM SE ( π , π ). Purely wishful thinking that perhaps an identifying restriction holdscan come with a steep penalty. Equally, however, AM SE ( π MIX , π

MIX ) can be substantiallylower than

AM SE ( π MIX , π ). Realistic assessment that perhaps an identifying restrictionholds comes with a reward.Our examples included some identifying restrictions that are empirically untestable,in the sense that w ∗ /w (cid:63) is neither zero nor inﬁnite, i.e., even an inﬁnite amount of datawould neither deﬁnitively prove nor deﬁnitively disprove the identifying assumption. In24ome such cases, however, the positive and ﬁnite value of w ∗ /w (cid:63) does vary with the un-derlying parameter values θ † (via dependence on φ † ). Thus a nuanced situation of partiallearning about the plausibility of a restriction can result. Consequently, the pros and consof making “perhaps” suppositions a priori are not easily and generally intuited. In a givenproblem, however, analysis of the kind demonstrated in this paper can reveal the structureof inference. Consequently, the risks and rewards of giving some prior credence to one ormore identifying restrictions can be quantiﬁed. Supplementary Materials

R code to produce all the empirical results in the paper will be made available online viaa GitHub repository.

References

Daniels, M. J. and J. W. Hogan (2008).

Missing Data in Longitudinal Studies: Strategiesfor Bayesian Modeling and Sensitivity Analysis . Chapman & Hall, CRC Press.Greenland, S. (2003). The impact of prior distributions for uncontrolled confounding andresponse bias: A case study of the relation of wire codes and magnetic ﬁelds to childhoodleukemia.

Journal of the American Statistical Association 98 , 47–55.Greenland, S. (2005). Multiple-bias modelling for analysis of observational data.

Journalof the Royal Statistical Society, Series A 168 , 267–306.Gustafson, P. (2007). Measurement error modeling with an approximate instrumentalvariable.

Journal of the Royal Statistical Society, Series B 69 , 797–815.Gustafson, P. (2015).

Bayesian Inference for Partially Identiﬁed Models: Exploring theLimits of Limited Data . Chapman and Hall / CRC Press.Gustafson, P. and S. Greenland (2006). The performance of random coeﬃcient regressionin accounting for residual confounding.

Biometrics 62 , 760–768.25ustafson, P., N. D. Le, and R. Saskin (2001). Case-control analysis with partial knowledgeof exposure misclassiﬁcation probabilities.

Biometrics 57 , 598–609.Hoeting, J. A., D. Madigan, A. E. Raftery, and C. T. Volinsky (1999). Bayesian modelaveraging: A tutorial.

Statistical Science 14 , 382–401.Lash, T. L., M. P. Fox, and A. K. Fink (2009).

Applying Quantitative Bias Analysis toEpidemiologic Data . Springer.Lash, T. L., M. P. Fox, R. F. MacLehose, G. Maldonado, L. C. McCandless, and S. Green-land (2014). Good practices for quantitative bias analysis.

International journal ofepidemiology 43 (6), 1969–1985.Little, R. J. and D. B. Rubin (2014).

Statistical Analysis with Missing Data, Second Edition .John Wiley & Sons.Vansteelandt, S., E. Goetghebeur, M. G. Kenward, and G. Molenberghs (2006). Ignoranceand uncertainty regions as inferential tools in a sensitivity analysis.

Statistica Sinica 16 ,953–979.Verstraeten, T., B. Farah, L. Duchateau, and R. Matu (1998). Pooling sera to reduce thecost of HIV surveillance: a feasibility study in a rural Kenyan district.

Tropical Medicineand International Health 3 (9), 747–750.Wasserman, L. (2000). Bayesian model selection and model averaging.

Journal of Mathe-matical Psychology 44 (1), 92–107.White, H. (1982). Maximum likelihood estimation of misspeciﬁed models.

Econometrica 50 ,1–25.Xia, M. and P. Gustafson (2016). Bayesian regression models adjusting for unidirectionalcovariate misclassiﬁcation.

Canadian Journal of Statistics 44 (2), 198–218.Xia, M. and P. Gustafson (2018). Bayesian inference for unidirectional misclassiﬁcation ofa binary response trait.