Parameter Restrictions for the Sake of Identification: Is there Utility in Asserting that Perhaps a Restriction Holds?
PParameter Restrictions for the Sake ofIdentification: Is there Utility in Asserting that Perhaps a Restriction Holds?
Paul Gustafson ∗ Department of StatisticsUniversity of British ColumbiaSeptember 28, 2020
Abstract
Statistical modeling can involve a tension between assumptions and statisticalidentification. The law of the observable data may not uniquely determine the valueof a target parameter without invoking a key assumption, and, while plausible, thisassumption may not be obviously true in the scientific context at hand. Moreover,there are many instances of key assumptions which are untestable, hence we cannotrely on the data to resolve the question of whether the target is legitimately identified.Working in the Bayesian paradigm, we consider the grey zone of situations where a keyassumption, in the form of a parameter space restriction, is scientifically reasonablebut not incontrovertible for the problem being tackled. Specifically, we investigatestatistical properties that ensue if we structure a prior distribution to assert that maybe or perhaps the assumption holds. Technically this simply devolves to usinga mixture prior distribution putting just some prior weight on the assumption, orone of several assumptions, holding. However, while the construct is straightforward,there is very little literature discussing situations where Bayesian model averaging isemployed across a mix of fully identified and partially identified models. Keywords:
Bayesian model averaging; Bayes risk; large-sample theory; partial identifica-tion. ∗ The author gratefully acknowledges research support from the Natural Sciences and Engineering Re-search Council of Canada a r X i v : . [ s t a t . M E ] S e p Introduction
In many applications of statistical modeling, a tension can arise. In order to fully identifya parameter of interest, one or more “strong” model assumptions may be required. Ofcourse it is not prudent to invoke an assumption without a solid rationale for doing so. Ifidentification is not obtained, however, then an uncomfortable truth ensues: even an infiniteamount of data would not reveal the true value of the target parameter. To further muddythe waters, often an assumption that would identify the target is not testable empirically.So we cannot necessarily rely on the data to resolve the situation.Say we are faced with a context where an identifying assumption or parameter restriction is scientifically plausible but not incontrovertible. One strategy, at least within the Bayesianparadigm, is to specify a prior distribution asserting that “large” violations of the restrictionare unlikely. For instance, say the restriction takes the form λ = 0. Then a prior of the form λ ∼ N (0 , τ ), for a suitably small choice of τ , could encode this information. For instance,some efforts along this line in the context of instrumental variable problems are pursued byGustafson and Greenland (2006); Gustafson (2007). And more generally in observationalepidemiology settings, “Bayesian-like” approaches, often referred to as probabilistic biasanalysis , have been considered (Greenland, 2003, 2005; Lash et al., 2009, 2014).Assigning a prior distribution which probabilistically limits the magnitude of violationfor an identifying restriction is not the only way to proceed. In fact, an arguably moredirect encoding of “plausible but not incontrovertible” would result from a specificationof a mixture prior distribution, giving some weight to the restriction being met exactly,and the remaining weight on it being violated (without necessarily making a stringentjudgement that the violation is unlikely to be large). While mixture prior distributions areubiquitous in Bayesian hypothesis testing and model selection procedures, there is scantliterature on their use in the context of identifying restrictions.2 .2 Motivating Example: Prevalence Estimation with MissingData To motivate the developments to follow, consider the HIV surveillance study described byVerstraeten et al. (1998), which was also used to motivate the methodological developmentsin Vansteelandt et al. (2006). In this study, blood draws were taken from a sample of n = 787 members of the target population. At a population level, let Y indicate the HIVtest result (0 for negative, 1 for positive), let R indicate the observation of the result (1for observed, 0 for missing), and let p ry = P r ( R = r, Y = y ). Hence the target parameter,HIV prevalence, can be expressed as ψ = P r ( Y = 1) = p + p . The study data aresummarized by c = 699 negative HIV tests ( R = 1 , Y = 0), c = 52 positive HIV tests( R = 1 , Y = 1), and c = 36 missing test results ( R = 0).Say we proceed to infer ψ without invoking any identifying restriction, i.e., we allowthat the missing data mechanism may be nonignorable (see Daniels and Hogan (2008)or Little and Rubin (2014) for full discussions of nonignorable missingness). A possible“neutral” prior specification is ( p , p , p , p ) ∼ Dirichlet(1 , , , s, p , p , p ), where p = p + p and s = p / ( p + p ), the posterior distributionis characterized by ( p , p , p ) ∼ Dirichlet(2 + c , c , c ) and independently s ∼ Unif(0 , ψ = sp + p . Here the impact of not having an identifying restriction is clear. Since theposterior distribution of s is Unif(0 ,
1) for any dataset, the posterior distribution of ψ willnot reduce to a point-mass in the infinite limit of further data collection. For the data athand, the posterior distribution of ψ is depicted in Figure 1.Alternately, say we believe the missing-at-random assumption is justified, i.e., we pre-sume R to be independent of Y . Then a possible neutral prior specification is to let p ry = (1 − γ ) − r γ r (1 − ψ ) − y ψ y , with γ = P r ( R = 1) and ψ = P r ( Y = 1) independentlyand identically distributed as Unif(0 , ψ is simply beta(1 + c , c ). Clearly this posteriordistribution will concentrate to a point-mass in the infinite limit of further data collection.For the data at hand, the posterior is also depicted in Figure 1.3 .00 0.05 0.10 0.15 0.20 HIV Prevalence D en s i t y NIMMARBMA
Figure 1: Posterior distribution of HIV prevalence without an identifying restriction (al-lowing nonignorable missingness, NIM), with the missing-at-random (MAR) restriction,and the Bayesian model averaged (BMA) synthesis of the two distributions.For an investigator having good reason to believe that the MAR assumption might hold,a natural route to expressing this belief is through a model-averaging prior. See Hoetinget al. (1999) or Wasserman (2000) for a full discussion of Bayesian model averaging (BMA).Specifically, the prior for ( p , p , p , p ) is taken as a mixture of the two specificationsabove. It is a matter of conjugate updating algebra to verify that the Bayes factor contrast-ing the two specifications (with MAR in the numerator versus NIM in the denominator)is: b = ( n + 2)( n + 3)6( c + 1)( n − c + 1) , (1)which evaluates to b = 3 .
73 for the present data. Of course the posterior odds favoring theMAR specification are b times the prior odds. For instance, if the mixture prior gives equalweights of 0 . .
211 to the NIM posterior and weight 0 .
789 to the MARposterior. This model-averaged posterior for the target is also depicted in Figure 1.With both the NIM and MAR “within-model” prior specifications being neutral in some4ense, it may be surprising that a Bayes factor of nearly four is obtained, given that MARis well known to be an untestable assumption. Similarly, it may be surprising to see fromthe form of (1) how sensitive the Bayes factor is to the proportion of data that are missing.This begs the question of what statistical properties are possessed by the model-averagedinference scheme in a context such as this. In general there is a rich literature on Bayesianmodel averaging. We are not aware, however, of any focus on the case when one constituentmodel does not identify the target parameter, while the other model, or models, imposeidentifying restrictions. Thus the rationale for the present work is to quantify statisticalperformance specifically when model averaging is used to declare the a priori suppositionthat perhaps an identifying restriction holds.
Say that the initial statistical model at hand, which we refer to as M , is parameterized by θ = ( φ, λ ). Here, in the terminology of Gustafson (2015), this is taken to be a transparent parameterization, with the distribution of the data depending on θ only through φ . Or,put more directly in Bayesian terms, the data D are conditionally independent of λ , given φ . The lack of full identification is thus made clear: the data inform φ (indeed we presumethat they fully inform φ , in the sense that the ( D | φ ) model supports consistent estimationof φ ). However, for any dataset, the posterior conditional distribution of ( λ | φ ) is the sameas the prior conditional distribution. In what follows, the primary target of inference isexpressed as ψ = g ( φ, λ ). Provided that g () varies non-trivially with λ , the target is notfully identified.Whereas we write the overall parameter space for M as θ ∈ Θ , we let the marginalparameter spaces implied by Θ be φ ∈ Φ and λ ∈ Λ . Importantly, many partiallyidentified models that arise in practice are such that φ and λ are not variation independent,with the consequence that direct learning about φ from the observed data may inducesome indirect learning about λ , via the support of λ depending on φ (Gustafson, 2015) .Commensurately, note for future reference that Θ may be a proper subset of the Cartesianspace Φ × Λ . 5e consider one or more sub-models of M , each of which has some level of a priori scientific credence, and each of which identifies the target. We express the j -th of J ≥ M j : θ ∈ Θ j ⊂ Θ . That restriction of θ to the j -th sub-modelidentifies the target implies that over ( φ, λ ) ∈ Θ j , g ( φ, λ ) does not vary with λ . Statedmore pragmatically, if M j holds, then φ , the value of which can be learned from the data,uniquely determines the target ψ .As alluded to in the previous section, the framework allowing the user to postulatethat perhaps one of the identifying restriction holds is that of BMA. Let π j () be the priordensity specified for θ within model M j . Then the BMA prior distribution is a mixture,whose density takes the form: π MIX ( θ ) = J (cid:88) j =0 w j π j ( θ ) , (2)where w j = P r ( M j ) is the prior probability that model M j is correct, hence (cid:80) Jj =0 w j = 1.As a technical note, we presume that each sub-model has zero probability under π () (aswell as under the other sub-models). Consequently, it is not necessary to explicitly definethe support of π () as excluding the sub-models.Upon receipt of data D , standard Bayesian updating from the mixture prior (2) givesthe posterior distribution as a mixture: π MIX ( θ | D) = J (cid:88) j =0 ˜ w j ( D ) π j ( θ | D ) . Here π j ( θ | D ) is the standard “within-model” posterior distribution of parameters, i.e.,based on π j ( θ | D ) ∝ L ( θ ; D ) π j ( θ ), for likelihood function L (; ). And ˜ w j ( D ) is the posteriorprobability that M j is correct. In the usual fashion,˜ w j ( D ) = w j f j ( D ) (cid:80) Jk =0 w k f k ( D ) , where f j ( D ) = (cid:82) Θ j L ( θ ; D ) π j ( θ ) dθ is the marginal density of the data under the j -th model.While the framework above is standard, its implications are unexplored when the list ofcandidate models is a mix of partially and fully identified models. To large extent, estimator6ehaviour will be seen to be specific to the particular choice of partially identified modeland identified sub-models, and to the nature of the specified within-model priors. However,to foreshadow the examples which follow, there are situations where: • For every dataset, ˜ w ( D ) / ˜ w ( D ) = w /w , i.e., the data have zero ability to discrim-inate between the base model and an identified sub-model. • For data generated under some parameter values in Θ − Θ , ˜ w ( D ) tends to zero asmore data accumulate, but under other such values the limit is positive. That is, afalsely asserted restriction may or may not be fully refuted. • As the underlying θ ranges over Θ , the large-sample limit of ˜ w ( D ) takes values in[ c, c >
0. Since the limit is never zero or one, the restriction can never befully refuted or fully supported. Also, since c is positive, there are not parametervalues under which we get close to complete refutation of the restriction. In the otherdirection, however, there are parameter values under which we get arbitrarily closeto full support for the restriction.Thus we find a surprising richness in the variety of behaviors that can be encountered.Toward understanding the structure of the posterior model weights in our present frame-work, note that f j ( D ) = (cid:90) Θ j L ( θ ) π j ( θ ) dθ = (cid:90) Θ j L ( φ ) π j ( φ, λ ) dφ dλ = (cid:90) Φ j L ( φ ) π j ( φ ) dφ. Thus the marginal prior distribution of φ arising from the specified joint prior on ( φ, λ ) playsa key role. Asymptotically, imagine a datastream of independent observations generatedunder true parameter values ( φ, λ ) = ( φ † , λ † ). (We will generally use the ‘dagger’ notationto emphasize specific parameter values that give rise to the observable data.) Let D n denotethe first n observations, and let w (cid:63)j = lim n →∞ ˜ w ( D n ) be the limiting posterior weight on7he j -th model. Then we immediately have w (cid:63)j w (cid:63)k = (cid:26) lim n →∞ f j ( D n ) f k ( D n ) (cid:27) (cid:18) w j w k (cid:19) = (cid:26) π j ( φ † ) π k ( φ † ) (cid:27) (cid:18) w j w k (cid:19) . (3)Just as we can characterize the limiting posterior weights on models, we can also char-acterize the limiting values of within-models estimators. Generically, let ˆ ψ ( D ; π I ) be theposterior mean of the target ψ arising from the prior specification π I and data D , wherethe I subscript reminds us this is the investigator’s choice of prior distribution. It is easyto verify that without an identifying restriction we have ψ (cid:63) . = lim n →∞ ˆ ψ ( D n ; π )= (cid:90) g ( φ † , λ ) π ( λ | φ † ) dλ. We can similarly define ψ (cid:63)j . = lim n →∞ ˆ ψ ( D n ; π j ) for j >
0. Specifically, when the inves-tigator invokes the j -th restriction, provided that φ † ∈ Φ j , we can express the limit as ψ (cid:63)j = g ( φ † , · ), upon recalling that g ( φ, λ ) is constant in λ when ( φ, λ ) ∈ Θ j . In a situationwhere φ † / ∈ Φ j , typically misspecified model theory (e.g., White (1982)) would be needed todetermine ψ (cid:63)j . Assembling the pieces thus far, the large-sample limit of the BMA posteriormean is ψ (cid:63)MIX . = J (cid:88) j =0 w (cid:63)j ψ (cid:63)j . We quantify the performance of estimators by averaging frequentist performance acrossthe parameter space. If data were generated under parameter value θ , the estimatorˆ ψ ( D ; π I ) would incur mean-squared error (MSE) of E θ (cid:104) { ˆ ψ ( D ; π I ) − ψ ( θ ) } (cid:105) . We thenaverage the MSE across different underlying parameter values θ , according to θ ∼ π N (),where the subscript N serves to remind us that this is Nature’s choice of prior distribution.For data D n with sample size n , then, the average mean-squared error (AMSE) is: AM SE n ( π N , π I ) . = E π N E θ (cid:20)(cid:110) ˆ ψ ( D n ; π I ) − ψ ( θ ) (cid:111) (cid:21) . (4)8ote that we quite deliberately keep both the ‘A’ and the ‘M’ in ‘AMSE,’ to remind usthat two averagings are in play: the mean (M) of squared error across D n given θ , distinctfrom the averaging (A) of this result across θ .Standard decision-theoretic terminology would see (4) referred to as the Bayes riskunder π N . Indeed, standard arguments tell us that for fixed π N , (4) is minimized by taking π I = π N , and that in fact this is the minimum achievable across all estimators, not justthose arising as Bayesian estimators from some choice of prior. As a further formality,we will largely focus on the large-sample limit. Notationally we achieve this by defining AM SE () = lim n →∞ AM SE n (). Also, for the sake of interpretability we will tend to reportthe square root of AM SE , i.e.,
RAM SE = AM SE / .The Nature versus investigator prior framework embedded in (4) seems particularlyapplicable to quantifying the pros and cons of “maybe” assertions about identified sub-models. We can compute AM SE ( π N , π I ) for the four combinations arising from π I ∈{ π , π MIX } and π N ∈ { π , π MIX } . Necessarily, AM SE ( π MIX , π
MIX ) is lower than (orpossibly equal to)
AM SE ( π MIX , π ). The extent to which it is lower reflects the benefitof making “maybe” assertions in contexts where appropriate, since setting π N = π MIX implies we are studying average-case performance across a mix of scenarios some with, andsome without, one of the identifying restrictions being true.On the other hand,
AM SE ( π , π MIX ) is guaranteed to be higher than (or possiblyequal to)
AM SE ( π , π ), and the extent to which it is higher reflects the risk of inappro-priately making “maybe” assertions. That is, across scenarios where none of the identifyingrestrictions hold, making the maybe assertions increases average-case MSE. So, if we re-gard invoking “maybe” restrictions in contexts where all the identified sub-models are apriori implausible as a form of cheating, then the extent to which AM SE ( π , π MIX ) ex-ceeds
AM SE ( π , π ) is the extent to which cheating results in empirically worse estimationperformance.It is important to note that the AM SE comparisons just outlined could be applied ina fully identified setting. That is, say that ψ were identified under M , and consequentlyalso identified under each sub-model M j , j = 1 , . . . J . For a finite sample size n , the9our AM SE n values would still have the interpretations given above, quantifying both thevalue and the risk of speculating that the sub-models are a priori plausible. However, allfour AM SE n values would tend to zero as n tends to infinity, since both ˆ ψ ( D n ; π ) andˆ ψ ( D n ; π MIX ) would consistently estimate ψ , under any values of θ ∈ Θ. Conversely, in thesituation we study, with ψ not identified under M , consistent estimation does not arise.Hence the positive large-sample limits of the AM SE values are fundamental descriptors ofthe situation.
For a first example of quantifying performance, we simply return to the motivating exampleof Section 1.2. Here M is the NIM specification which does not fully identify the target,whereas M is the MAR specification which does. The particular prior specifications π ()and π () are those given in Section 1.2. From the posterior forms, we see the posterior meanof the disease prevalence ψ has large-sample limit ψ (cid:63) = p † + p † / ψ (cid:63) = p † /p † when the investigator presumes MAR.The respective prior marginal densities of φ = ( p , p , p ) are readily obtained. Ex-pressed as densities for ( p , p ) over the lower triangle of the unit square (rather thandensities over the probability simplex), we have π ( p , p ) = 6 p for the NIM specifica-tion, and π ( p , p ) = (1 − p ) − . Consequently the limiting Bayes factor is w (cid:63) /w (cid:63) w /w = 16 p † (1 − p † ) . Note that this expression could also be deduced by direct inspection of (1) when n increases.Based on the prior specifications and the limiting posterior forms, and taking w =(0 . , . AM SE ( π , π ) = E π (cid:26) ( p − p ) (cid:27) = 140 , AM SE ( π MIX , π ) = (1 / E π (cid:26) ( p − p ) (cid:27) + (1 / E π (cid:26) ( p − p ) (cid:27) = 12 (cid:18)
140 + 136 (cid:19) . The other two quantities needed take the form
AM SE ( π , π MIX ) = E π { w ( p ) } and AM SE ( π MIX , π
MIX ) = (1 / E π { w ( p ) } + (1 / E π { w ( p ) } , where w ( p ) = 6 p p p p (cid:16) p + p (cid:17) + 11 + 6 p p (cid:18) p p (cid:19) − p +1 . Given the nonlinearity of w (), the needed expectations do not have analytic forms. Theyare easily evaluated numerically, however, say via Monte Carlo draws from π () and π ().Table 1 gives the four AM SE values. On the pro side, appropriate use of the “maybe”assertion leads to a 16% reduction in RAMSE compared to no assertion, i.e., we see thisreduction when Nature’s prior is the mixture, so we are averaging performance acrosssome scenarios where the assertion holds and others when it does not. On the con side,inappropriate use of the “maybe” assertion leads to a 12% increase in RAMSE. This is seenwhen Nature’s prior is π alone, so we are averaging performance across scenarios whichall involve nonignorable missingness. Say that interest lies in the distribution of three binary variables, (
C, X, Y ), in a population.Here C is a potential confounding variable, X is an exposure variable, and Y is a healthoutcome variable. The particular target of inferential interest is taken to be the averagerisk difference, ψ = E { E ( Y | X = 1 , C ) − E ( Y | X = 0 , C ) } . (As an aside, this target can bemotivated as an average treatment effects in a framework using counterfactual outcomes,under an assumption that the counterfactual outcomes are conditionally independent of X given C .) Say the available data, however, are sampled conditional on C . That is,11able 1: RAMSE values for different combinations of Nature’s prior and the investigator’sprior, in Example 1. Values in the first column are exact. Values in the second column arecomputed as Monte Carlo averages based on 10 draws from each of π () and π (). Theimpact of the investigator using π MIX rather than π is a 11.8% increase in RAMSE whenthis use is not warranted, but a 16.0% decrease in RAMSE when it is warranted. TheMonte Carlo standard errors for these two percentage changes are 0.3 percentage pointsand 0.2 percentage points, respectively. Investigator π π MIX
Nature π π MIX n are drawn from ( Y, X | C = 0), and similarly apre-specified number of realizations n are drawn from ( Y, X | C = 1).For this problem, let φ = ( φ , φ ), where φ c comprises the ( X, Y | C = c ) cell probabilities(so φ c has three elements, with the fourth cell probability consequently determined sincethese probabilities must sum to one). Also, let λ = P r ( C = 1). Then ( φ, λ ) is a transparentparameterization, with the likelihood L ( φ ) = L ( φ ) L ( φ ) based on two independentmultinomial samples. The target parameter can be expressed as g ( φ, λ ) = (1 − λ ) v ( φ ) + λv ( φ ), where v () returns the risk difference from a single set of cell probabilities. For thepartially identified model M we specify a prior with independencies of the form π ( φ, λ ) = π ( φ ) π ( φ ) π ( λ ). For future reference, let ¯ λ and σ λ be the mean and variance of λ underthe specified prior.The first identifying restriction we consider is that the prevalence of C is exactly knownfrom external sources, i.e., M is defined as the sub-model of M defined by λ = ˜ λ , where˜ λ is user-specified. We express this model via the prior specification π ( φ, λ ) = π ( φ ) δ ˜ λ ( λ ),where δ x () is the Dirac delta function, i.e., the ‘density’ of a point-mass at x . We take π ( φ ) = π ( φ ) as before, but now λ = ˜ λ is taken as known.12he second identifying restriction we consider is that there is no interaction on the riskdifference scale. Hence we obtain model M from model M by constraining φ accordingto v ( φ ) = v ( φ ). An appropriate prior specification for the marginal π ( φ ) would be thedistribution induced from π ( φ ) by conditioning on v ( φ ) = v ( φ ). For M we can leave π ( λ | φ ) unspecified, since it will play no role whatsoever in the posterior distribution of ψ or in the posterior weight of M relative to the other two models.From the forms of π j ( φ ) for j = 0 , , w (cid:63) , w (cid:63) , w (cid:63) ) = (0 , ,
1) if v ( φ † ) = v ( φ † );( w + w ) − ( w , w ,
0) otherwise. (5)Specifically, M is completely untestable compared to M , in the sense that f ( D ) = f ( D )for every dataset. On the other hand M is completely testable compared to M , in thesense that the constraint v ( φ ) = v ( φ ) is either proven or disproven in the limit of infinitesample size.In terms of inference within models, under M it is immediate that the posterior distri-bution of ( ψ | φ ) is a location-scale shift of the prior distribution of λ [with location v ( φ )and scale v ( φ ) − v ( φ )]. The limiting posterior mean of ψ is then v ( φ † ) + { v ( φ † ) − v ( φ † ) } ¯ λ .Whereas, in a related spirit, under M the limiting posterior mean is v ( φ † ) + { v ( φ † ) − v ( φ † ) } ˜ λ . This limit does or does not match with the target ψ † depending on whether M holds. If M holds, then asymptotically the M posterior will concentrate at the correctvalue of v ( φ † ) = v ( φ † ). The limiting posterior mean under M when M does not hold isgoverned by standard wrong-model asymptotic theory (e.g., see White (1982)). We willnot need to determine this limit, however, as M is discredited when it is wrong, as per(5).Armed with (5) and the limiting posterior means of ψ under each model, we proceed todetermine AM SE for the four combinations of Nature’s prior and the investigator’s priorbased on π () or π MIX (). Letting k = Var { v ( φ ) − v ( φ ) } = 2Var { v ( φ c } under π (), direct13alculation gives: AM SE ( π ; π ) = kσ λ ,AM SE ( π ; π MIX ) = k (cid:20) σ λ + { w / ( w + w ) } (cid:16) ˜ λ − ¯ λ (cid:17) (cid:21) ,AM SE ( π MIX ; π ) = k (cid:26) w σ λ + w (cid:16) ˜ λ − ¯ λ (cid:17) (cid:27) ,AM SE ( π MIX ; π MIX ) = k (cid:26) w σ λ + w w w + w (cid:16) ˜ λ − ¯ λ (cid:17) (cid:27) . Note that these expressions are sufficiently simple that one can “read off” the extent towhich
AM SE ( π ; π MIX ) exceeds
AM SE ( π ; π ) and the extent to which AM SE ( π MIX ; π MIX )is reduced compared to
AM SE ( π MIX ; π ). Note also that both these gaps collapse to zeroin situations where ¯ λ (the prior mean of λ under model M ) equals ˜ λ (the presumed valueof λ under Model M ).To give a concrete example, say that external information leads to the specification of˜ λ = 0 .
15 for the identifying restriction under M . And say the same information suggeststhe prior λ ∼ Beta(4 ,
18) under Model M , giving the prior mode at 0 .
15, as well asa prior mean of ¯ λ = 4 /
22 and prior variance of σ λ = (4 / / / ≈ . φ c ∼ Dirichlet(1 , , , k = 1 / w , w , w ) = (1 / , / , / In a somewhat similar spirit to Example 2, say we are interested in the risk difference
P r ( Y = 1 | X = 1) − P r ( Y = 1 | X = 0), for binary variables X and Y . However, the availabledata are a sample of ( X, Y ∗ ) realizations, where Y ∗ may be an imperfect surrogate for Y ,14able 2: RAMSE values for different combinations of Nature’s prior and the investigator’sprior in Example 2. The impact of the investigator using π MIX rather than π is a 1.9%increase when this use is not warranted, but a 3.5% decrease when it is warranted.Investigator π π MIX
Nature π π MIX Y itself is latent. Specifically, say Y ∗ and X are conditionally independent given Y (the so-called “nondifferential” assumption), with λ y = P r ( Y ∗ = Y | Y = y ), i.e, λ and λ are respectively the specificity and sensitivity of the surrogate. Letting ω x = P r ( Y =1 | X = x ), the target of inference is ψ = ω − ω . At present we have parameterizedthe problem at hand by ( ω, λ ). However, to obtain a transparent parameterization wereparameterize to ( φ, λ ), where φ = ( φ , φ ) has components φ x = P r ( Y ∗ = 1 | X = x ) =(1 − ω x )(1 − λ ) + ω x λ . With respect to this parameterization, the target is expressed as ψ = ( φ − φ ) / ( λ + λ − M by assigning a prior distribution in theoriginal parameterization. We let ω follow a uniform distribution over (0 , , and indepen-dently we let λ follow a uniform distribution over ( a , × ( a , a and a are worst-case assertions about the magnitude of outcome misclassification. (Wepresume a > . a > . Y and Y ∗ .) Note that the specified prior independencebetween ω and λ is intuitive; there seems no reason to tie prior assertions about the ( Y | X )prevalences to prior assertions about the quality of Y ∗ as a surrogate for Y . (And notealso that it would be impossible to assert prior independence of φ and λ , since the supportof λ depends on φ .)Upon moving to the transparent parameterization, following the related model formu-15ation in Gustafson et al. (2001), we find the M prior density transforms to π ( φ, λ ) = I A ( φ, λ )(1 − a )(1 − a )( λ + λ − . (6)Here A is the intersection of three sets, imposing the restrictions that: (i), φ ∈ (0 , ; (ii), λ ∈ ( a , × ( a , φ and λ are compatible with each other in that (1 − λ ) <φ x < λ , for x = 0 , M is the sub-model of M defined by λ = (1 , (cid:48) . Under M we take theprior on φ to be uniform over (0 , . This is consistent with M in that it matches the( φ | λ = (1 , (cid:48) ) prior.For a datastream arising under φ = φ † , we have the limiting Bayes factor π ( φ † ) /π ( φ † )as per (3). For the present prior specifications, π ( φ ) = 1, while marginalizing (6) gives: π ( φ ) = 1(1 − a )(1 − a ) [log b ( φ ) + log b ( φ ) − log { b ( φ ) + b ( φ ) − } ] , (7)where b ( φ ) = max { a , − φ , − φ } and b ( φ ) = max { a , φ , φ } . Upon scrutiny, thebivariate density (7) over the unit square is seen to have a “tabletop” shape. It is constantat its maximum value over the square defined by min { φ , φ } > − a and max { φ , φ }
85) for π (), along with equal a priori modelweights, ( w , w ) = (0 . , . w (cid:63) appear in Figure 1. As we might expect, we see relatively more mass at the lowerboundary for w (cid:63) when M is false. And in fact for the present hyperparameter values thislower boundary is 0 . M . Also, asper the discussion above, we see the right tail of the w (cid:63) values approaching one when M is true, i.e., there are occasional circumstances where the data can strongly support theidentifying restriction.In terms of limiting posterior means for the target, when the investigator presumes M ,the limit is simply ψ (cid:63) = φ † − φ † . For M , numerical integration is required to obtain ψ (cid:63) for a given φ † . Specifically, ψ (cid:63) = (cid:82) g ( φ † , λ ) π ( λ | φ † ) dλ , where the prior conditional density17able 3: RAMSE values for different combinations of Nature’s prior and the investigator’sprior in Example 3. Each value is computed as a Monte Carlo average, using 20000 real-izations from each of π () and π (). The impact of the investigator using π MIX rather than π is a 44.8% increase in RAMSE when this use is not warranted, but a 24.9% decreasein RAMSE when it is warranted. Monte Carlo standard errors for these two percentagechanges are 0 . . π π MIX Nature π π MIX λ | φ ) is derived from the joint density (6).The RAMSE values are given in Table 3. We now see rather higher stakes than weremanifested in Example 2. Imposing the maybe restriction when not warranted incurs apenalty of a 45% increase in RAMSE (as per the first row of the table). Whereas imposingthe maybe restriction when warranted produces a 25% reduction in RAMSE (as per thesecond row of the table). Our final, and most complex, example blends elements of Examples 2 and 3. As perExample 2, say we are interested in properties of the joint distribution of ( C, X, Y ), where C is a binary confounding variable, X is a binary exposure variable, and Y is a binary outcomevariable. Also as per Example 2, the target of inference is presumed to be the average riskdifference, ψ = E { P r ( Y = 1 | X = 1 , C ) − P r ( Y = 1 | X = 0 , C ) } . However, the exposure X may be subject to misclassification. Hence the observable variables are ( C, X ∗ , Y ), wherethe surrogate X ∗ may differ from the exposure of interest, X . We confine attention to a18imple form of exposure misclassification. We take it as known that the misclassification isnondifferential, in the sense that X ∗ is conditionally independent of ( C, Y ) given X . Andwe also take it as known that P r ( X ∗ = 0 | X = 0) = 1, i.e., the classification scheme hasperfect specificity . However, the sensitivity , λ = P r ( X ∗ = 1 | X = 1), may be less than one.Some recent work on Bayesian inference in problems with “unidirectional” misclassificationof this form includes Xia and Gustafson (2016, 2018). To organize prior specification and posterior computation across the multiple models athand, it is helpful to specify a convenience prior distribution which in turn produces aconvenience posterior distribution possessing a simple form. Then appropriate model-specific prior specifications and posterior calculations can be expressed in terms of tweaksto the convenience analysis.Let P d be the space of probability vectors over d mutually distinct and exhaustiveoutcomes, and let φ ∈ P be the cell probabilities describing the distribution of ( C, X ∗ , Y ).We specify the convenience prior density on θ = ( φ, λ ) to be π ∗ ( φ, λ ) = π ∗ ( φ ) π ∗ ( λ ), where π ∗ ( φ ) is the Dirichlet(1 , . . . , 1) density, while π ∗ ( λ ) is the Uniform( b, 1) density, wherehyperparameter b is an a priori specified lower bound on the sensitivity of the exposureclassification. For illustration, we take b = 0 . M In the absence of any identifying restrictions, a first thought is that the convenience priormight be employed as the actual prior. This is not actually possible, however, since we do not have a Cartesian parameter space for ( φ, λ ). For a given sensitivity λ , let s λ () mapfrom the ( C, X, Y ) cell probabilities to the ( C, X ∗ , Y ) cell probabilities. Hence s λ () mapsfrom P to its image S λ ⊂ P . This map is invertible, and for a given φ ∈ P it is possibleto numerically determine whether φ ∈ S λ , and, if so, compute s − λ ( φ ). Thus an obvious19daptation of the convenience prior is to take the Model M prior as: π ( φ, λ ) = π ∗ ( φ, λ ) I { φ ∈ S λ } P r ∗ { φ ∈ S λ } . (8)Thus we simply truncate the convenience prior to only those φ and λ pairs which arecompatible with each other.It is also possible to establish that for a given φ ∈ P , φ ∈ S λ if and only if λ exceeds athreshold. That is, the cell probabilities for the observables ( C, X ∗ , Y ) imply a lower boundon the sensitivity of X ∗ as a surrogate for X . (Note that this bound may or may not exceedthe investigator-specified lower bound b .) Using t ( φ ) to denote this lower bound, we have φ ∈ S λ if and only if λ > t ( φ ). Combined with the specification of π ∗ ( λ ) as Unif( b, π ( φ, λ ) = π ∗ ( φ, λ ) I [ λ > max { b, t ( φ ) } ](1 − b ) − E ∗ [1 − max { b, t ( φ ) } ] . Importantly, this marginalizes to π ( φ ) = π ∗ ( φ )[1 − max { b, t ( φ ) } ] E ∗ [1 − max { b, t ( φ ) } ] . (9)In terms of the parameter of interest, say that ˜ g () maps from the ( C, X, Y ) cell proba-bilities to the target parameter, i.e., ˜ g () returns E { E ( Y | X = 1 , C ) − E ( Y | X = 0 , C ) } . Inthe parameterization at hand then, the target parameter is g ( φ, λ ) = ˜ g ( s − λ ( φ )). M In a similar spirit to Example 3, the first identifying restriction considered is simply thatthe surrogate X ∗ is in fact perfect. So model M is the sub-model of M corresponding to λ = 1. An obvious prior specification is π ( φ, λ ) = π ∗ ( φ ) δ ( λ ) . This simply marginalizes to π ( φ ) = π ∗ ( φ ) . (10)Under this restriction φ is identically the ( C, X, Y ) cell probabilities. Hence the targetparameter is simply expressed as g ( φ ) = ˜ g ( φ ).20 .4 Model M Model M renders the target identifiable via a different restriction on M . Namely, as perExample 2, we presume the ( C, X, Y ) distribution does not involve any interaction on therisk difference scale, so that E ( Y | X = 1 , C = c ) − E ( Y | X = 0 , C = c ) does not depend on c . If we let ˜ φ parameterize the so restricted ( C, X, Y ) cell probabilities (so that ˜ φ has onlysix degrees of freedom), then the resultant ( C, X ∗ , Y ) cell probabilities can be expressedas h ( ˜ φ, λ ). Here the map h () is invertible. However, its image, which we denote as H , is astrict subset of P . While it does not seem possible to express h − () in closed form, for agiven φ ∈ P one can numerically determine whether φ ∈ H , and, if so, compute h − ( φ ).A sensible prior construction for M thus takes the form: π ( φ, λ ) = π ∗ ( φ ) I H ( φ ) P r ∗ ( φ ∈ H ) δ m ( φ ) ( λ ) , where m ( φ ) is the unique sensitivity value implied by φ . Clearly this prior marginalizes to π ( φ ) = π ∗ ( φ ) I H ( φ ) P r ∗ ( φ ∈ H ) . (11)Note that the parameter of interest in this formulation, the constant risk difference, canbe simply expressed as one element from h − ( φ ). The limiting posterior model weights are governed by (9), (10), and (11), giving w (cid:63) /w (cid:63) w /w = 1 − max (cid:8) b, t (cid:0) φ † (cid:1)(cid:9) − E ∗ [max { b, t ( φ ) } ] , (12)and w (cid:63) /w (cid:63) w /w = { P r ∗ ( φ ∈ H ) } − if φ † ∈ H , M and M ,in that a positive and finite limiting Bayes factor arises. Moreover, this limit is not one(except for a set of φ † values with probability zero under all three model-specific priors).21f course we are reminded that this is indeed mild discrimination, since the limiting Bayesfactor would be zero or infinity if M were identified. From (13) we see that when M is false, enough data may or may not discredit it. That is, the true model (either M or M ) may give rise to φ † / ∈ H , in which case M is discredited asymptotically. However,any of the three models can yield φ † ∈ H , and (13) cannot distinguish between the threesituations. Note that if M is not discredited, then it necessarily receives more supportthat M , i.e., the Bayes factor (13) exceeds one. To help convey the ability of data to discriminate between the three models, we examine w (cid:63) values arising for three ensembles of φ † values, where the ensembles are randomly drawnfrom π ( φ ), π ( φ ), and π ( φ ) respectively. Equal prior weights w = (1 / , / / / 3) areemployed. The results are plotted in Figure 3. For both the ensembles corresponding to M and M , we see w (cid:63) = 0 for large majorities of points. So M is often, but not always,fully discredited when it is false. On the other hand, when M is not discredited (theminorities of scenarios arising under M and M , but all the scenarios arising under M ),it receives strong support. The w (cid:63) values range from 0 . 72 and 0 . w (cid:63) = 0 cases generated under M and M we see modest discrim-inatory power in the form of a modest tendency for w (cid:63) = 1 − w (cid:63) to be smaller when M istrue and larger when M is true. Still focussing on these cases, we also see an asymmetry.The most extreme evidence in favour of M corresponds to a value of w (cid:63) = 1 − w (cid:63) wellbelow one (approximately w (cid:63) = 0 . M is true aswell as when M is true. On the other hand, there are some narrow circumstances underwhich w (cid:63) = 1 − w (cid:63) is very close to one. This asymmetry echoes what was seen in Example3. There is more scope to support the identifying restriction M than there is to criticizeit. The RAMSE values for this example appear in Table 4. As per Example 3, we see fairlyhigh stakes involved with the investigator invoking the maybe assumptions. RAMSE isincreased by 31% in terms of average performance across scenarios where neither restriction22 .0 0.4 0.8 . . . w * M . . . w M . . . w w * M Figure 3: Distribution of the limiting posterior weights ( w (cid:63) , w (cid:63) , w (cid:63) ), in Example 4, under M (upper-left), M (upper-right), and M (lower-left). In each instance, an ensembleof 100 φ † points are simulated from π j (). Points in the upper panels are jittered with asmall amount of random noise, in order to better see the distribution of those points with w (cid:63) + w (cid:63) = 1, w (cid:63) = 0. 23able 4: RAMSE values for different combinations of Nature’s prior and the investigator’sprior in Example 4. Each value is computed as a Monte Carlo average, using 1600 drawsfrom each of π (), π (), π (). The impact of the investigator using π MIX rather than π is a31.3% increase in RAMSE when this use is not warranted, but a 39.5% decrease in RAMSEwhen it is warranted. Monte Carlo standard errors for these two percentage changes are4 . . π π MIX Nature π π MIX Of course, there is no free lunch. When specifying a prior distribution, we implicitly choosethe estimator with optimal average-case behavior, where the average is with respect to thejoint distribution of parameters and data arising from the specified prior. In situations withone or more plausible identifying restrictions, we have seen the stakes associated with priorassertions can be quite high. In one direction, AM SE ( π , π MIX ) can be substantially higherthan AM SE ( π , π ). Purely wishful thinking that perhaps an identifying restriction holdscan come with a steep penalty. Equally, however, AM SE ( π MIX , π MIX ) can be substantiallylower than AM SE ( π MIX , π ). Realistic assessment that perhaps an identifying restrictionholds comes with a reward.Our examples included some identifying restrictions that are empirically untestable,in the sense that w ∗ /w (cid:63) is neither zero nor infinite, i.e., even an infinite amount of datawould neither definitively prove nor definitively disprove the identifying assumption. In24ome such cases, however, the positive and finite value of w ∗ /w (cid:63) does vary with the un-derlying parameter values θ † (via dependence on φ † ). Thus a nuanced situation of partiallearning about the plausibility of a restriction can result. Consequently, the pros and consof making “perhaps” suppositions a priori are not easily and generally intuited. In a givenproblem, however, analysis of the kind demonstrated in this paper can reveal the structureof inference. Consequently, the risks and rewards of giving some prior credence to one ormore identifying restrictions can be quantified. Supplementary Materials R code to produce all the empirical results in the paper will be made available online viaa GitHub repository. References Daniels, M. J. and J. W. Hogan (2008). Missing Data in Longitudinal Studies: Strategiesfor Bayesian Modeling and Sensitivity Analysis . Chapman & Hall, CRC Press.Greenland, S. (2003). The impact of prior distributions for uncontrolled confounding andresponse bias: A case study of the relation of wire codes and magnetic fields to childhoodleukemia. Journal of the American Statistical Association 98 , 47–55.Greenland, S. (2005). Multiple-bias modelling for analysis of observational data. Journalof the Royal Statistical Society, Series A 168 , 267–306.Gustafson, P. (2007). Measurement error modeling with an approximate instrumentalvariable. Journal of the Royal Statistical Society, Series B 69 , 797–815.Gustafson, P. (2015). Bayesian Inference for Partially Identified Models: Exploring theLimits of Limited Data . Chapman and Hall / CRC Press.Gustafson, P. and S. Greenland (2006). The performance of random coefficient regressionin accounting for residual confounding. Biometrics 62 , 760–768.25ustafson, P., N. D. Le, and R. Saskin (2001). Case-control analysis with partial knowledgeof exposure misclassification probabilities. Biometrics 57 , 598–609.Hoeting, J. A., D. Madigan, A. E. Raftery, and C. T. Volinsky (1999). Bayesian modelaveraging: A tutorial. Statistical Science 14 , 382–401.Lash, T. L., M. P. Fox, and A. K. Fink (2009). Applying Quantitative Bias Analysis toEpidemiologic Data . Springer.Lash, T. L., M. P. Fox, R. F. MacLehose, G. Maldonado, L. C. McCandless, and S. Green-land (2014). Good practices for quantitative bias analysis. International journal ofepidemiology 43 (6), 1969–1985.Little, R. J. and D. B. Rubin (2014). Statistical Analysis with Missing Data, Second Edition .John Wiley & Sons.Vansteelandt, S., E. Goetghebeur, M. G. Kenward, and G. Molenberghs (2006). Ignoranceand uncertainty regions as inferential tools in a sensitivity analysis. Statistica Sinica 16 ,953–979.Verstraeten, T., B. Farah, L. Duchateau, and R. Matu (1998). Pooling sera to reduce thecost of HIV surveillance: a feasibility study in a rural Kenyan district. Tropical Medicineand International Health 3 (9), 747–750.Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of Mathe-matical Psychology 44 (1), 92–107.White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50 ,1–25.Xia, M. and P. Gustafson (2016). Bayesian regression models adjusting for unidirectionalcovariate misclassification. Canadian Journal of Statistics 44 (2), 198–218.Xia, M. and P. Gustafson (2018). Bayesian inference for unidirectional misclassification ofa binary response trait.