Identification of Causal Effects Within Principal Strata Using Auxiliary Variables
IIdentification of Causal Effects Within PrincipalStrata Using Auxiliary Variables
Zhichao Jiang and Peng Ding ∗ Abstract
In causal inference, principal stratification is a framework for dealing with a post-treatment intermediate variable between a treatment and an outcome, in which theprincipal strata are defined by the joint potential values of the intermediate variable.Because the principal strata are not fully observable, the causal effects within them, alsoknown as the principal causal effects, are not identifiable without additional assump-tions. Several previous empirical studies leveraged auxiliary variables to improve theinference of principal causal effects. We establish a general theory for identification andestimation of the principal causal effects with auxiliary variables, which provides a solidfoundation for statistical inference and more insights for model building in empiricalresearch. In particular, we consider two commonly-used strategies for principal stratifi-cation problems: principal ignorability, and the conditional independence between theauxiliary variable and the outcome given principal strata and covariates. For these twostrategies, we give non-parametric and semi-parametric identification results withoutmodeling assumptions on the outcome. When the assumptions for neither strategies areplausible, we propose a large class of flexible parametric and semi-parametric modelsfor identifying principal causal effects. Our theory not only ensures formal identifica-tion results of several models that have been used in previous empirical studies but alsogeneralizes them to allow for different types of outcomes and intermediate variables.
Keywords:
Augmented design; Auxiliary independent; Identification; Principal ignorability;Principal stratification ∗ Zhichao Jiang (Email: [email protected]) is Assistant Professor, Department of Biostatisticsand Epidemiology, University of Massachusetts, Amherst. Peng Ding (Email: [email protected])is Assistant Professor, Department of Statistics, University of California, Berkeley. a r X i v : . [ s t a t . M E ] A ug Introduction
Complications arise in causal inference with an intermediate variable between the treat-ment and the outcome. Cochran (1957), Rosenbaum (1984) and Frangakis and Rubin(2002) pointed out that naively conditioning on the observed intermediate variable doesnot yield valid causal interpretations in general. Frangakis and Rubin (2002) proposed touse principal stratification, the joint potential values of the intermediate variable underboth the treatment and control, to define subgroup causal effects, because it acts as apretreatment covariate vector unaffected by the treatment. Principal stratification has awide range of applications with meanings varying in different scientific contexts. In non-compliance problems where the treatment received can differ from the treatment assigned,principal stratification represents individual potential compliance behavior (Angrist et al.,1996). In truncation-by-death problems where some units die before the measurement timepoint of their outcomes, principal stratification represents individual potential survival sta-tus (Rubin, 2006). In surrogate evaluation problems, principal stratification helps to clarifycriteria for good surrogate endpoints (Frangakis and Rubin, 2002; Gilbert and Hudgens,2008). In mediation analysis, Rubin (2004), Gallop et al. (2009), Elliott et al. (2010) andMattei and Mealli (2011) defined direct effects as treatment effects on the outcome withinprincipal strata of identical potential intermediate variables under the treatment and con-trol. VanderWeele (2008) and Forastiere et al. (2018) linked the principal stratificationapproach with the direct and indirect effect approach, and Jo (2008) linked the princi-pal stratification approach with structural equation model for mediation analysis. Theseproblems with intermediate variables concern average causal effects within principal strata,which are also known as the principal causal effects (PCEs).Because we cannot observe the joint potential values of the intermediate variable, wedo not know the principal stratum of every individual and thus cannot identify the PCEswithout additional assumptions. For a binary intermediate variable, Zhang and Rubin(2003), Cheng and Small (2006) and Imai (2008) derived large sample bounds, which canbe too wide to be informative. Angrist et al. (1996), Little and Yau (1998), Zhang et al.(2009) and Frumento et al. (2012) imposed additional structural or model assumptions toachieve identification. When the intermediate variable is continuous, identification becomesmore difficult because of the infinitely many principal strata. To estimate the PCEs, Gilbertand Hudgens (2008) assumed parametric models and used a likelihood approach. Jin and2ubin (2008), Schwartz et al. (2011), and Zigler and Belin (2012) proposed different formsof parametric and semi-parametric Bayesian approaches. However, the identifiability oftheir models is not formally established. Without identifiability, the likelihood functionmay be flat over a region of some parameters, and Bayesian inference can be sensitive toprior specifications. See Gustafson (2009) and Ding and Li (2018) and for more discussionon identifiability.Identification is sometimes achievable with a pretreatment auxiliary variable satisfyingsome conditional independence assumptions. We focus on two categories. The first cate-gory assumes that the outcome is independent of the principal strata given the auxiliaryvariable. This assumption is known as principal ignorability (Jo et al., 2011; Ding andLu, 2017). Under principal ignorability, Jo and Stuart (2009) and Stuart and Jo (2015)used principal scores to analyze data with one-sided noncompliance, and Joffe et al. (2007)suggested using principal scores to estimate general causal effects within principal strata.Ding and Lu (2017) established formal identification results for PCEs with a binary inter-mediate variable in randomized experiments. The other category assumes the conditionalindependence between the outcome and the auxiliary variable within principal strata. Wewill refer to this conditional independence as auxiliary independence . This assumption mo-tivates several identification and estimation strategies in different contexts. For a binaryintermediate variable, Ding et al. (2011) used the baseline quality of life as an auxiliaryvariable for identification when evaluating the effect on the quality of life with outcomestruncated by death. Under monotonicity, Mealli and Pacini (2013) relaxed Ding et al.(2011)’s assumptions and discussed bounds and identification of the PCEs with a binarysecondary variable. Wang et al. (2017) extended the strategy to observational studies andrelaxed monotonicity in a sensitivity analysis. In a study with multiple independent trials,Jiang et al. (2016) used the trial number as an auxiliary variable and proposed strategies toidentify the PCEs. Yuan et al. (2019) weakened the identification assumptions and appliedthe methodology to a multi-site trial in education. Similar ideas have also been used to dealwith continuous intermediate variables. In assessing the effect of an HIV vaccine on infec-tion rate through immune response, Follmann (2006) used the baseline immune responseto the rabies vaccine as an auxiliary variable. Qin et al. (2008) extended this idea to dealwith time-to-event endpoints under a case-cohort sampling. Gilbert and Hudgens (2008)and Huang and Gilbert (2011) proposed approaches to evaluating biomarkers based on3rincipal stratification by incorporating baseline covariates as auxiliary variables to predictthe biomarkers. These strategies also provided insights for better experimental designs. Inparticular, Gabriel and Follmann (2016) proposed the augmented treatment run-in designand used a baseline measure as a predictor of the potential values of the intermediate vari-able. However, under auxiliary independence, formal identification results are establishedonly for binary intermediate variables (Ding et al., 2011; Mealli and Pacini, 2013; Jianget al., 2016).This paper discusses the identification of PCEs defined by a general intermediate vari-able with auxiliary variables. We first generalize the identification results under principalignorability in Ding and Lu (2017) to general intermediate variables in both randomizedexperiments and observational studies. We then study the identification under auxiliary in-dependence in various scenarios. With auxiliary independence, we establish non-parametricidentification results for discrete intermediate variables and semi-parametric identificationresults for continuous intermediate variables. These results do not require modeling the out-come. Without principal ignorability or auxiliary independence, we propose a large classof parametric models to identify the PCEs, which has not been formally established be-fore. Compared with models used in previous empirical studies, our models require weakerassumptions and can deal with different types of data.Identifiability is a cornerstone for both frequentists’ (Bickel and Doksum, 2015) andBayesian (Gustafson, 2015) inferences. Our results provide theoretical bases to check theidentifiability of PCEs before analyzing data. Practitioners can use our results to guidemodel building for principal stratification problems. Our results imply that some existingmodels are indeed identifiable but some are not (e.g. Follmann, 2006; Gilbert and Hud-gens, 2008; Zigler and Belin, 2012). Moreover, our results reveal that some existing modelsinvoked unnecessary assumptions for identification, for example, restricting the parame-ter space or imposing informative priors, although these assumptions make finite-sampleinference more convenient.The paper uses the following notation. Let i.i.d. denote “independently and identicallydistributed,”
A B | C denote the conditional independence of A and B given C , and A d = B denote that A has the same distribution as B . Let P ( · ) be the probability mass ordensity function, and Φ( · ) be the cumulative distribution function of the standard Normaldistribution. We say that functions { f ( x ) , . . . , f J ( x ) } are linearly independent if c +4 f ( x ) + · · · + c J f J ( x ) = 0 for all x implies c = c = · · · = c J = 0. We say that a family Q of probability distributions is complete if (cid:82) f ( v ) Q (d v ) = 0 for all Q ∈ Q implies f ( v ) = 0 , a.s. (Lehmann and Romano, 2006). Let Z be a binary treatment indicator with Z = 1 for the treatment and 0 for the control, Y be an outcome of interest, and S be an intermediate variable between the treatmentand outcome. Let S iz and Y iz be the potential values of the intermediate variable andthe outcome if unit i were to receive treatment z ( z = 0 , S i = Z i S i + (1 − Z i ) S i and Y i = Z i Y i + (1 − Z i ) Y i . Assume that { Z i , S i , S i , Y i , Y i : i = 1 , . . . , n } are i.i.d. samples drawn from aninfinite superpopulation, and thus the observed { Z i , S i , Y i : i = 1 , . . . , n } are also i.i.d. Asa result, we can drop the subscript i .Frangakis and Rubin (2002) defined principal stratification as U i = ( S i , S i ), the jointpotential values of the intermediate variable, and the PCEs as τ s s = E { Y − Y | U = ( s , s ) } for all s , s . The PCEs are not identifiable because U is latent in general. It is commonto exploit a pretreatment auxiliary variable for identifying the PCEs. Let W i denote thisvariable with meanings varying in different settings. We start with the following basicassumption. Assumption 1. Z ( Y , Y , S , S ) | W .Assumption 1 is often guaranteed by design. In completely randomized experiments,Assumption 1 holds because Z ( Y , Y , S , S , W ). In a multi-center experiment with W being the center number, Assumption 1 holds because Z is randomized in each center.We consider two different assumptions for identification. The first assumption is theconditional independence between the potential outcome Y z and the principal stratum U given the auxiliary variable W . Assumption 2 (principal ignorability) . Y z U | W for z = 0 , W , the principal stratification variableis randomly assigned with respect to the potential outcomes. Assumption 2 requires that5onditioning on the auxiliary variable there is no difference between the distributions ofthe potential outcomes across principal strata. Many applied researchers have invoked itto estimate the PCEs (Follmann, 2000; Jo and Stuart, 2009; Jo et al., 2011; Stuart andJo, 2015). To make Assumptions 2 more plausible, researchers often need to include allpretreatment covariates in W . We provide two examples below. Example 1.
Follmann (2000) studied the effect of a multi-factor intervention on mortal-ity due to coronary heart disease, where Z is the indicator of the intervention and Y isthe survival time of the patients. One-sided noncompliance occurred in the experiment,where patients assigned to the treatment group might not actually take the treatment.Let S denote the actual treatment, which can be different from Z . The principal strat-ification variable characterizes the compliance behavior of the patients. Follmann (2000)argued that the patients with different compliance behavior would be similar conditioningon pretreatment covariates W , and estimate the PCEs under principal ignorability. Example 2.
Ding and Lu (2017) gave an example of a randomized experiment withtruncation-by-death, where Z is the treatment indicator, S is the binary survival status,and Y is the health-related quality of life. Because the outcome is only well defined for thesurvived patients, the parameter of interest is the PCE within the principal stratum of thepatients who would survive regardless of the treatment. They used all the covariates asthe auxiliary variables and invoked the principal ignorability in their analysis, which meansthat the health-related quality of life for always survived patients would be identical to thatfor other patients given the covariates.The second identification assumption is the conditional independence between the po-tential outcome Y z and the auxiliary variable W given the principal stratum U . Assumption 3 (auxiliary independence) . Y z W | U for z = 0 , Y W | ( Z, U ), i.e., the auxiliaryvariable is independent of the outcome conditional on the treatment and principal strata.Including additional pretreatment covariates can make this assumption more plausible.However, for notational simplicity, we condition on X implicitly and omit X below. Weprovide two examples below, in which Assumption 3 are justifiable by design.6 xample 3. Follmann (2006) introduced an augmented design to assess immune responsein vaccine trials, where Z is the indicator of an HIV vaccine injection, S is the immuneresponse to this vaccine, and Y is the infection indicator. Before the randomization of Z ,all patients receive the rabies vaccine. Let W denote the immune response to the rabiesvaccine, which is correlated with S . Because the rabies vaccine is irrelevant to the HIVinfection, the potential HIV infection status should depend only on the immune responseto the HIV vaccine but not the rabies vaccine. This justifies auxiliary independence, basedon which Follmann (2006) estimated the PCEs. Example 4.
Jiang et al. (2016) proposed approaches to identifying the PCEs by multipleindependent trials, where Z is the treatment indicator, S is the indicator of three-yearcancer reoccurrence, and Y is the five-year survival status. The data are from multipletrials with the trial number denoted by W . Jiang et al. (2016) argued that the principalstratification variable is a measure of physical status, and assumed that the survival statusdoes not depend on the trial number W given the patient’s physical status. As a result,they identified the PCEs under auxiliary independence.When S is binary as in Example 4, Jiang et al. (2016) showed the identifiability ofPCEs. With a general S as in Example 3, formal identification results have not beenestablished although several parametric or semi-parametric models have been proposed indata analysis.In the following two sections, we will give a unified theory for the identification of thePCEs with an auxiliary variable under various scenarios. The theoretical results dependon three factors: (a) whether or not the potential intermediate variable under control S isconstant, (b) whether or not the intermediate variable S is discrete or continuous, and (c)whether or not Assumption 2 or 3 holds. Table 1 presents the roadmap for our paper. Weencourage the readers to revisit it after reading Sections 3 and 4. We first consider the cases with a constant intermediate variable under control.
Assumption 4. S i = c for all i , where c is a constant.In some vaccine trials (e.g., Follmann, 2006; Hudgens and Gilbert, 2009), Assumption 4is plausible because vaccine antigens must be present to induce a specific immune response,7able 1: Roadmap of the sufficient conditions for identifying the PCEs. Note that theresults with a non-constant S require the identification of P ( S , S | W ).Assumptions Type of S Requirement for W Outcome model
Constant S Section 3.1 1, 2 and 4 General No NoSection 3.2 1, 3 and 4 Discrete More categories than S NoSection 3.3 1, 3 and 4 General Completeness NoSection 3.4 1 and 4 General Depends on the model of S Yes
Non-constant S Section 4.2 1 and 2 General No NoSection 4.3 1 and 3 Discrete More categories than S NoSection 4.4 1 and 3 General Completeness NoSection 4.5 1 General Depends on the model of S Yeswhich is absent in the control group. For a binary S , Assumption 4 with c = 0 is called strong monotonicity , which holds in the one-sided noncompliance setting because individu-als assigned to the control group do not have access to the treatment (Sommer and Zeger,1991; Imbens and Rubin, 2015). Under Assumption 4, S is constant, and therefore it isnot necessary to include it in U , simplifying the PCEs to τ s = E ( Y − Y | S = s ) = E ( Y | S = s ) − E ( Y | S = s ) . Because S is observed in the treatment group, we can identify E ( Y | S = s ) = E ( Y | Z = 1 , S = s ) under Assumption 1. Then we need only to identify E ( Y | S = s ).Because S is missing in the control group, the PCEs are not identifiable without additionalassumptions. Below we will discuss the identification of PCEs under Assumption 2 or 3. We extend the definition of principal score to general S , which is the probability of theprincipal strata conditional on the auxiliary variable: π s ,s ( W ) = P ( S = s , S = s | W ) . π s ( W ) = P ( S = s | W ), which isidentified by π s ( W ) = P ( S = s | Z = 1 , W ) under Assumption 1. The proportions ofstrata are then identified by π s = P ( S = s ) = E { π s ( W ) } . With principal ignorability,we can identify the PCEs using the principal scores and show the results in the followingtheorem. Theorem 1.
Under Assumptions 1, 2 and 4, the PCEs are identified by τ s = E ( Y | Z = 1 , S = s ) − E (cid:26) π s ( W ) π s (1 − Z ) Y − e ( W ) (cid:27) , (1)where e ( W ) = P ( Z = 1 | W ) is the propensity score.Theorem 1 shows that E ( Y | S = s ) can be identified by the average of the outcomesin a weighted sample, with the weight depending on both the principal score and thepropensity scores. The principal score accounts for the relationship between the principalstratum membership and the covariates, whereas the propensity score accounts for therelationship between the treatment and the covariates. Ding and Lu (2017)’s result holdsonly in randomized experiments with a binary S , and Theorem 1 generalizes it to allow fordifferent types of S in observational studies.Based on Theorem 1, we can estimate the PCEs using a two-step procedure. We firstestimate the principal score and the propensity score. We then plug in the estimatedprincipal score and propensity score into (1) and replace the expectation with the empiricalaverage to obtain the estimates of PCEs. Theorem 2.
Suppose S ∈ { s , . . . , s K } and W ∈ { w , . . . , w L } . Let M denote the K × L matrix with the ( k, l )-th element P ( S = s k | Z = 1 , W = w l ). Under Assumptions 1, 3 and 4,if rank( M (cid:62) M ) = K , then the PCEs are identifiable.From Theorem 2, a necessary condition for identification is L ≥ K , i.e., W must havemore categories than S . Because M depends only on the distribution of the observeddata, the condition rank( M (cid:62) M ) = K is testable. The following example illustrates theidentifiability for the case with binary intermediate and auxiliary variables. Example 5.
Consider binary S and W. First, from the observed distribution and Assump-tion 1, we can identify θ sw = P ( S = s | W = w ) = P ( S = s | Z = 1 , W = w ) and9 w = E ( Y | W = w ) = E ( Y | Z = 0 , W = w ) for s, w = 0 ,
1. Second, from Assumption 3,we have δ = E ( Y | S = 1) θ + E ( Y | S = 0) θ ,δ = E ( Y | S = 1) θ + E ( Y | S = 0) θ , which are two linear equations of E ( Y | S = 1) and E ( Y | S = 0). If the conditionrank( M (cid:62) M ) = 2 holds, or, equivalently, S / W | Z = 1, we can uniquely solve the twolinear equations and obtain E ( Y | S = 1) = δ θ − δ θ θ θ − θ θ , E ( Y | S = 0) = δ θ − δ θ θ θ − θ θ . Therefore, the PCEs are identifiable.
Identification is more difficult with a continuous intermediate variable, which generatesinfinitely many principal strata. Let W be the support of W , and P W = { P ( S | W = w ) : w ∈ W} be the family of probability distributions indexed by w . Based on the definitionof completeness, we give a sufficient condition for identification. Theorem 3.
Under Assumptions 1, 3 and 4, if P W is complete, then the PCEs are iden-tifiable.As discussed before, the key to identify the PCEs is to identify E ( Y | S ). FromAssumptions 1 and 3, we have E ( Y | Z = 0 , W = w ) = E ( Y | W = w )= E { E ( Y | S ) | W = w } = (cid:90) E ( Y | S = s ) Q (d s ) (2)for any probability measure Q ( s ) = P ( S ≤ s | W = w ) in P W . The left-hand side of (2)is directly estimable from the observed data, and the distributions in P W are identified by P ( S | W ) = P ( S | Z = 1 , W ). Therefore, (2) is an integral equation for E ( Y | S = s ).As a result, E ( Y | S = s ) is identifiable if it can be uniquely determined by (2), whichis guaranteed by the completeness of P W . When S is discrete, the integral in (2) becomessummation, and the completeness is the same as the rank condition in Theorem 2.10heorem 3 is general but abstract. From the well-known completeness property of anexponential family (Lehmann and Romano, 2006), we have a more interpretable sufficientcondition for identifying PCEs. Theorem 4.
Under Assumptions 1, 3 and 4, we further assume P ( S = s | W = w ) = h ( s ) g ( w ) exp { η (cid:62) ( w ) t ( s ) } , where s → t ( s ) is a one-to-one mapping and { η ( w ) : w ∈ W} contains an open set in R d where d is the length of the vector function η ( w ). The PCEs are identifiable.Theorem 4 requires that the distribution of S conditional on W belongs to the expo-nential family, but it does not require any models for the potential outcome Y z . Therefore,Theorem 4 guarantees semi-parametric identifiability and allows for different types of out-comes. Below we give an example for Normal distributions. Corollary 1.
Under Assumptions 1, 3 and 4, if ( S , W ) follows a bivariate Normal distri-bution, then the PCEs are identifiable. Remark 1.
For a binary outcome, Follmann (2006) assumes that the outcome follows aProbit model and ( S , W ) follows a bivariate Normal distribution, which is a special case ofCorollary 1. Thus, Follmann (2006)’s model is semi-parametrically identified even withoutthe outcome model, and his parametric outcome model is invoked only for convenience inthe finite-sample inference.To further improve the applicability of Theorem 3, we review the following lemma (Huand Shiu, 2017, Lemma 4) on the completeness of a class of location-scale distributionfamilies, which works for non-exponential distributions. Lemma 1.
Suppose the support of W has an interior point, and S = h ( W ) + σ ( W ) (cid:15) withcontinuously differentiable h ( w ) and σ ( w ) and (cid:15) W . P W is complete if the characteristicfunction and density function of (cid:15) , φ ( t ) and f ( (cid:15) ), satisfy the following conditions:(a) 0 < | φ ( t ) | < C exp( − δ | t | ) for all t ∈ R and some constants C, δ > f ( (cid:15) ) is continuously differentiable, (cid:82) + ∞−∞ | xf (cid:48) ( x ) | d x < + ∞ , and (cid:82) + ∞−∞ f ( x )d x < + ∞ ;(c) for any positive integer J , the following functions are linearly independent (cid:26) f (cid:18) x − h σ (cid:19) , . . . , f (cid:18) x − h J σ J (cid:19)(cid:27) , where the ( h j , σ j )’s are distinct. 11he existence of the infinite sequence required by Lemma 1 holds automatically forcontinuous W but fails for discrete W . Conditions (a) and (b) in Lemma 1 are techni-cal requirements on the distribution of the error term (cid:15) . Condition (c) means that thefinite location-scale mixture of the distribution of (cid:15) is identifiable, which holds for manydistributions (Everitt and Hand, 1981). For example, Appendix B.1 shows that Conditions(a)–(c) hold when (cid:15) follows a Normal, t or Logistic distribution. Combining Theorem 3with Lemma 1, we obtain the following theorem for the location-scale distribution families. Theorem 5.
Suppose that W is continuous, Assumptions 1, 3 and 4 hold, S = h ( W ) + σ ( W ) (cid:15) with continuously differentiable h ( w ) and σ ( w ), and (cid:15) W . If (cid:15) satisfies Conditions(a)–(c) in Lemma 1, then the PCEs are identifiable.Theorem 5 guarantees the identifiability of PCEs in many models involving distribu-tions that do not belong to an exponential family. It allows for heteroscedastic errors andenables flexible model choices. For example, if we replace the bivariate Normal distributionassumption of ( S , W ) with S | W = w ∼ N ( µ ( w ) , σ ( w )), Theorem 4 and Corollary 1cannot be applied because { η ( w ) = (1 /σ ( w ) , µ ( w ) /σ ( w )) : w ∈ W} is a line in R .However, Theorem 5 ensures that the PCEs are still identifiable. The conditional independence in Assumption 2 or 3 may be violated. In Example 2, co-variates may not be sufficient to account for the difference in the health-related quality oflife across principal strata, which makes Assumption 2 implausible; in Example 4, differentcenters may have different qualities of services, which makes Assumption 3 implausible.Without conditional independence, W does not help to achieve non-parametric or semi-parametric identification. One solution is to conduct sensitivity analysis, which, however,requires to use sensitivity parameters to characterize the violation of the assumptions andfurther requires to specify their ranges. Sensitivity analysis gives a range of estimates ratherthan a point estimate, and it often depends on additional model assumptions. We will notpursue this direction in this paper. Instead, in this subsection, we seek an alternative routeto propose some parametric models for identifying the PCEs, in which the auxiliary variable W satisfies certain modeling assumptions. We can also include other covariates X in ourmodels, but do not require any modeling assumptions for X . So, again, we condition on X Proposition 1.
Under Assumptions 1 and 4, assume that ( S , Y ) follow additive models: S = g ( W ) + σ ( W ) (cid:15) S , (3) Y = β + αS + J (cid:88) j =1 β j f j ( W ) + σ ( W ) (cid:15) Y , (4)where E ( (cid:15) S | W ) = E ( (cid:15) Y | W ) = 0, and g ( w ) and σ ( w ) can be unknown functions. If { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, then the PCEs are identifiable.We do not need to specify g ( w ) and σ ( w ) because we can identify them from theobserved distribution P ( S, W | Z = 1) under Assumption 1. In contrast, we need to specifythe f j ( w )’s in the model of Y .Intuitively, in Proposition 1, replacing S in (4) by (3), we obtain an additive model of Y on W , and the linear independence condition allows us to disentangle the coefficients ofdifferent terms involving W . For example, if g ( W ) is quadratic in W in (3) and { J = 1, f ( W ) = W } in (4), then the linear independence assumption holds in Proposition 1.However, if g ( w ) is linear in w , then the linear independence assumption fails.If f j ( w ) = 0 for all j , then Proposition 1 becomes a special case of Theorem 5. Propo-sition 1 guarantees the identifiability of PCEs in additive models without specifying thedistributions of the error terms.In the model for Y , we require S to have a linear form. Identification may also bepossible for other forms of S , but will require the knowledge of the distributions of theerror terms.For binary outcomes, we show an identification result below for the Probit model. Proposition 2.
Under Assumptions 1 and 4, assume that S follows an additive modelwith Normal error and Y follows a Probit model: S = g ( W ) + (cid:15) S , (cid:15) S W, (cid:15) S ∼ N (0 , σ ) , P ( Y = 1 | S = s, W = w ) = Φ β + αs + J (cid:88) j =1 β j f j ( w ) , where g ( w ) can be unknown. If { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, thenthe PCEs are identifiable. 13he model of S in Proposition 2 requires the variance of the error term (cid:15) S not dependon W , which is different from Proposition 1. Identification may also be possible with thevariance depending on W , but will rely on the functional form of var( S | W ). Remark 2.
Our result is not in contrary to Follmann (2006). Without Assumption 3,Follmann (2006) assumed a bivariate Normal distribution for ( S , W ) and used the followingProbit model for Y : P ( Y = 1 | Z, S , W ) = Φ( β + β Z + β S + β W + β ZS ) . (5)Under Assumption 1, (5) is equivalent to P ( Y = 1 | S , W ) = Φ { β + β + ( β + β ) S + β W } , P ( Y = 1 | S , W ) = Φ( β + β S + β W ) . From Proposition 2, without the model of Y , the PCEs are not identifiable because thelinear independence condition is violated. The identifiability comes from the parallel modelassumption that restricts the coefficients of W be the same in the models of Y and Y . Remark 3.
Without the linear independence condition, researchers often used additionalinformation on the parameters to improve identification. Using a Bayesian approach, Ziglerand Belin (2012) imposed informative priors on α . In a similar setting with a time-to-eventoutcome, Qin et al. (2008) imposed the principal ignorability Y S | W , or, equivalently, α = 0. Assumption 4 does not hold in many applications. Without it, we can never simultaneouslyobserve S and S , and therefore it is challenging to identify the joint distribution of ( S , S )in the first place, let alone the PCEs. Below we first use a copula model for the jointdistribution of ( S , S ), and then discuss identification of the PCEs. P ( S , S | W ) Under Assumption 1, P ( S z | W ) = P ( S | Z = z, W ), and thus we can identify the marginaldistributions of S z given W from the observed data. To recover the joint distribution of( S , S ) given W from the marginal distributions, we need some prior knowledge about14he association between S and S conditional on W . For a binary S , a commonly-usedassumption to recover the joint distribution of ( S , S ) is the monotonicity assumption that S ≥ S . Under this assumption, we can identify P ( S = 1 , S = 1 | W ) = P ( S = 1 | Z = 0 , W ), P ( S = 0 , S = 0 | W ) = P ( S = 0 | Z = 1 , W ) and P ( S = 1 , S = 0 | W ) = P ( S = 1 | Z = 1 , W ) − P ( S = 1 | Z = 0 , W ). For a continuous S , Efron and Feldman(1991) and Jin and Rubin (2008) discussed the equipercentile equating assumption, i.e., F ( S | W ) = F ( S | W ), where F z ( · | W ) is the cumulative distribution function of S z given W for z = 0 ,
1. Under this assumption, S z determines S − z based on F ( · | W ) and F ( · | W ) for z = 0 , w , weassume P ( S , S | W = w ) = C ρ { P ( S | W = w ) , P ( S | W = w ) } , (6)where C ρ ( · , · ) is a copula and ρ is a measure of the association between S and S . If weknow ρ , then we can identify P ( S , S | W = w ) from the marginal distributions by (6).Otherwise, we can view ρ as a sensitivity parameter and conduct sensitivity analysis byvarying ρ . Assume that the principal score π s ,s ( W ) = P ( S , S | W ) is identifiable. So the density ofthe principal strata equals π s ,s = E { π s ,s ( W ) } . Similar to Section 3.1, we establish theidentification of PCEs using the principal scores. Theorem 6.
Under Assumptions 1 and 2, if π s ,s ( W ) is identifiable, then the PCEs areidentified by τ s s = E (cid:26) π s ,s ( W ) π s ,s · ZYe ( W ) (cid:27) − E (cid:26) π s ,s ( W ) π s ,s · (1 − Z ) Y − e ( W ) (cid:27) . Theorem 6 generalizes Theorem 1 to the case with non-constant intermediate variables.It shows that E ( Y z | S = s , S = s ) can be identified by the average of the outcomes in a15eighted sample, with the weight depending on both the principal score and the propensityscore. We give the identification results for discrete intermediate variables.
Theorem 7.
Suppose S ∈ { s , . . . , s K } , W ∈ { w , . . . , w L } , Assumptions 1 and 3 hold,and P ( S , S | W ) is identifiable.(a) Let M s denote the K × L matrix with ( k, l )-th element P ( S = s k | S = s , W = w l ).For a fixed s , if rank( M (cid:62) s M s ) = K , then P ( Y | S , S = s ) is identifiable.(b) Let M s denote the K × L matrix with ( k, l )-th element P ( S = s k | S = s , W = w l ).For a fixed s , if rank( M (cid:62) s M s ) = K , then P ( Y | S = s , S ) is identifiable.(c) If (a) and (b) above hold for all s and s , then the PCEs are identifiable.Theorem 7 extends Theorem 2. As special cases of Theorem 7, for a binary intermediatevariable under monotonicity, Ding et al. (2011) and Jiang et al. (2016) gave the identifi-cation results, and the rank conditions in Theorem 7 simplify to conditional independencerelationships S / W | S and S / W | S . Recalling that W is the support of W . For fixed s and s , let P W ,s = { P ( S | S = s , W = w ) : w ∈ W} and P W ,s = { P ( S | S = s , W = w ) : w ∈ W} be the families ofthe distributions indexed by w given s and s , respectively.Similar to Section 3.2, the identifiability of PCEs reduces to the completeness of prob-ability distributions of P W ,s and P W ,s . Theorem 8.
Suppose that Assumptions 1 and 3 hold, and P ( S , S | W ) is identifiable.(a) If P W ,s is complete for all s , then P ( Y, S , S , W | Z = 0) is identifiable.(b) If P W ,s is complete for all s , then P ( Y, S , S , W | Z = 1) is identifiable.(c) If (a) and (b) above hold, then the PCEs are identifiable.16imilar to Theorem 3, Theorem 8 does not require any models for the distribution of Y z ( z = 0 , Corollary 2.
For a continuous W , suppose that Assumptions 1 and 3 hold. If( S , S ) | W = w ∼ N µ ( w ) µ ( w ) , σ ( w ) ρ ( w ) σ ( w ) σ ( w ) ρ ( w ) σ ( w ) σ ( w ) σ ( w ) , (7)with a known ρ ( w ), then the PCEs are identifiable.Corollary 2 does not need any models for the outcome, but it requires the auxiliaryvariable to be continuous. In Corollary 2, with a known ρ ( w ), we can identify the jointdistribution of ( S , S ) given W from the marginal distributions of S z given W . Therefore,the PCEs are identifiable from Theorem 8. To apply Corollary 2, we need to pre-specifythe correlation coefficient ρ ( w ), which is a sensitive parameter in practice. Similar to the case with a constant control intermediate variable, we propose some usefulparametric models for identifying the PCEs using the auxiliary variable W when Assump-tions 2 or 3 fails. Proposition 3.
For a binary S with monotonicity S ≥ S , suppose that Assumption 1holds, and Y and Y follow linear models E ( Y z | S , S , W ) = β z + β z S + β z S + β z W, ( z = 0 , . (8)If both P ( S = 1 | Z = 1 , W = w ) P ( S = 1 | Z = 0 , W = w ) and P ( S = 0 | Z = 1 , W = w ) P ( S = 0 | Z = 0 , W = w ) (9)are not constant in w , then the PCEs are identifiable.We can use observed data to check whether the two terms in (9) are constant in w .For a binary W , the only restriction of (8) is no interaction term among ( S , S , W ) in themodel of Y , which is similar to some existing no-interaction or homogeneity assumption(Ding et al., 2011; Wang et al., 2017).For a continuous intermediate variable, we give the following proposition.17 roposition 4. Suppose that Assumption 1 holds, and ( S , S ) given W follows (7) witha known ρ ( w ). Suppose Y and Y follow additive models: Y = β + α S + α S + J (cid:88) j =1 β j f j ( W ) + σ ( W ) (cid:15) Y ,Y = β (cid:48) + α (cid:48) S + α (cid:48) S + J (cid:88) j =1 β (cid:48) j h j ( W ) + σ ( W ) (cid:15) Y , ( (cid:15) Y , (cid:15) Y ) ( S , S , W ) . The PCEs are identifiable if the following two conditions hold:(a) { , s , E ( S | S = s , W = w ) , f ( w ) , . . . , f J ( w ) } are linearly independent as functionsof ( s , w );(b) { , s , E ( S | S = s , W = w ) , h ( w ) , . . . , h J ( w ) } are linearly independent as functionsof ( s , w ).Proposition 4, as an extension of Proposition 1, is mostly useful for continuous outcomes.The Normality in (7) implies a linear relation of S on S given W , i.e., S = a ( W ) S + b ( W ) (cid:15) S with a ( w ) and b ( w ) determined by the distribution of ( S , S ) given W . Then,in Proposition 4, we can obtain an additive model of Y on S and W by replacing S in themodel of Y . The linear independence condition (a) allows us to disentangle the coefficientsof different terms involving S and W . Similar discussion applies to condition (b).The Normality in (7) is also helpful for binary outcomes. The following propositiongives the identification result under Probit model for Y z . Proposition 5.
Suppose that Assumption 1 holds, and ( S , S ) given W follows (7) witha known ρ ( w ). Suppose Y and Y follow Probit models: P ( Y = 1 | S = s , S = s , W = w ) = Φ β + α s + α s + J (cid:88) j =1 β j f j ( w ) , (10) P ( Y = 1 | S = s , S = s , W = w ) = Φ β (cid:48) + α (cid:48) s + α (cid:48) s + k (cid:88) j =1 β (cid:48) j h j ( w ) . (11)If Conditions (a) and (b) in Proposition 4 hold, then the PCEs are identifiable. Remark 4.
Using a Bayesian approach, Zigler and Belin (2012) assumed a trivariate Nor-mal distribution for ( S , S , W ) with a sensitivity parameter to characterize the correlation18etween S and S , and Probit models for Y z with f j ( w ) and h j ( w ) linear in w . Undertheir model, the conditional expectation E ( S | S = s , W = w ) is linear in both s and w , and E ( S | S = s , W = w ) is linear in both s and w . Thus, the linear independencecondition is violated, and the parameters are not identifiable. To mitigate the inferentialdifficulties, Zigler and Belin (2012) imposed informative priors on α − α (cid:48) and α − α (cid:48) . In the frequentists’ inference, non-identifiability renders the likelihood function flat overa region for some parameters, and the classical repeated sampling theory of the maxi-mum likelihood estimates do not apply (Bickel and Doksum, 2015). Computationally, theBayesian machinery is still applicable as long as the priors are proper. The simulation be-low, however, highlights the importance of identifiability in the Bayesian inference. In bothcases with a constant and non-constant control intermediate variable, we use two models toestimate the PCEs under several data generating processes (DGPs). The two models seemsimilar in form but have different identifiability. We use the Gibbs Sampler to simulate theposterior distributions of the PCEs with 20000 iterations and the first 4000 iterations asthe burn-in period. The Markov chains mix very well with the Gelman–Rubin diagnosticstatistics close to one based on multiple chains.
We generate data from DGP 1: Z ∼ Bernoulli(0 . , W ∼ N (0 , , Z W, S | W ∼ N ( γ + γ W, σ ) , P ( Y z = 1 | S , W ) = Φ( β z + β z S + β z W ) , with parameters ( β , β , β ) = (1 , − . , . β , β , β ) = (0 . , , .
5) and ( γ , γ , σ ) =(1 , . , Z , W and Y z are the same as DGP 1, but S | W ∼ N ( γ + γ W + γ W , σ ),where ( γ , γ , γ , σ ) = (1 , . , , , γ + γ W + γ W , W ) are linearly independent, the PCEs are identifiablebased on Proposition 2. 19or both DGPs 1 and 2, we use the true models to analyze the generated data withsample size 1000. We choose the following two sets of priors to assess the sensitivity of theinference based on posteriors:(A) ( β z , β z , β z ) ∼ N ( , diag(1 , , / − ) for z = 0 , p ( σ ) ∝ /σ , and ( γ , γ ) ∼ N ( , diag(1 , / − ) for model 1 (correspondingly, ( γ , γ , γ ) ∼ N ( , diag(1 , , / − )for model 2).(B) ( β z , β z , β z ) ∼ N ( , diag(1 , , z = 0 , p ( σ ) ∝ /σ , and ( γ , γ ) ∼ N ( , diag(1 , / − ) for model 1 (correspondingly, ( γ , γ , γ ) ∼ N ( , diag(1 , , / − )for model 2).The prior for ( β z , β z , β z ) is much less diffused in prior (B) than in prior (A).Figure 1 shows the posterior distributions of ( β , β , β , β ). For model 2, the pos-terior 95% credible intervals cover the true parameters under both priors. For model 1, theposterior distributions of β and β differ greatly under the two priors. Their posteriordistributions are far away from the true values under prior (A), which shows strong evidenceof non-identifiability or weakly identifiability of model 1. Similar to Section 5.1, we describe two DGPs with different identifiability and evaluate thefinite sample performance of Bayesian inference under each DGP. We choose two modelscorresponding to two nested DGPs so that we can go beyond Section 5.1 to assess theperformance of Bayesian inference with a mis-specified model.We first specify the two DGPs. For DGP 3, W ∼ Bernoulli(0 .
5) and Z | W = w ∼ Bernoulli( α w ), where ( α , α ) = (0 . , . U = ( S , S ) from categoricaldistributions conditional on W , and Y from Bernoulli distributions conditional on Z and U with true values of the parameters in Table 2(a). We name the model correspondingto DGP 3 as model 3. For model 3, Assumptions 1 and 3 hold. Because the stratum( S , S ) = (0 ,
1) does not exist, monotonicity holds and thus the distribution of ( S , S )given W is identifiable. From Theorem 7, the PCEs are identifiable.For DGP 4, we generate W and Z in the same way as DGP 4. We then generate U =( S , S ) from categorical distributions conditional on W , and Y from Bernoulli distributionsconditional on Z and U with true values of the parameters in Table 2(b). We name the20 b b b b b b b M ode l M ode l Figure 1: Posterior distributions of the parameters in Section 5.1. The grey histograms arethe results with prior (A), the white histograms are the results with prior (B). The verticaldashed lines are the true values of the parameters.model corresponding to DGP 4 as model 4. For model 4, stratum ( S , S ) = (0 ,
1) exists,and monotonicity does not hold. Without monotonicity, the distribution of ( S , S ) | W isnot identifiable, and thus the PCEs are not identifiable. Use models 3 and 4 to analyze the data simulated from DGP 3.
Because model4 is a generalization of model 3, they are both correctly specified under DGP 3. However,the true value of τ in model 4 is not well-defined.We choose two sample sizes 1000 and 50000. For model 3, we choose the followingpriors: P ( W = 1) ∼ Beta(1 , α w ∼ Beta(1 , π ,w , π ,w , π ,w ) ∼ Dirichlet(1 , , w = 1 ,
2. We choose two different priors for the parameters δ u,s s . One is the uniformprior Beta(1 ,
1) and the other is Beta(0 . , . π ,w , π ,w , π ,w , π ,w ) is Dirichlet(1 , , , τ m11 , τ and τ , where τ m11 is the PCEwithin the stratum ( S , S ) = (1 ,
1) under model 3, and τ and τ are the PCEs withinthe strata ( S , S ) = (1 ,
1) and (0 ,
1) under model 4, respectively. Comparing the two rows21able 2: True values of the parameters under DGP 3 and DGP 4. (a) DGP 3. The true PCEs are: τ = 0 . τ = 0 . τ =0 . P ( U = u | W = w ) u = (1 , u = (1 , u = (0 , w = 1 0.5 0.3 0.2 w = 2 0.2 0.3 0.5 P ( Y = 1 | Z = z, U = u ) u = (1 , u = (1 , u = (0 , z = 1 0.8 0.7 0.6 z = 0 0.5 0.3 0.1 (b) DGP 4. The true PCEs are: τ = 0 . τ = 0 . τ = 0 . τ = − . P ( U = u | W = w ) u = (1 , u = (1 , u = (0 , u = (0 , w = 1 0.5 0.3 0.1 0.1 w = 2 0.1 0.3 0.5 0.1 P ( Y = 1 | Z = z, U = u ) u = (1 , u = (1 , u = (0 , u = (0 , z = 1 0.8 0.7 0.6 0.2 z = 0 0.5 0.3 0.1 0.5 of plots in Figure 2(a), we can see that as the sample size increases, the posterior 95%credible intervals of τ m11 becomes narrower and always cover the true value, regardless ofthe priors. For model 4, the posterior distributions of the PCEs change greatly and theposterior 95% credible intervals do not shrink as those under model 3. When the samplesize is 50000, the posterior distribution of τ is far away from the true value with the flatprior Beta(1 ,
1) and is not unimodal with the prior Beta(0 . , . ,
1) and Beta(0 . , .
5) priors resultin small discrepancies. The drastic differences with different sample sizes and priors showstrong evidence of the non-identifiability or weakly identifiability of model 4, which canyield misleading estimates and inferences.
Use models 3 and 4 to analyze data simulated from DGP 4.
The true model 4is not identifiable, and model 3 is mis-specified. Figure 2(b) shows the results for τ m11 , τ and τ . Although model 3 is not the true model, the result under this model is very stableunder different priors. The 95% credible intervals of τ m11 cover the true value. This maybe due to our choice of small values of π , and π , , which makes model 3 only slightlydeviates from the true model. In contrast, the result of model 4 changes greatly under22 t t t t t (a) The data are simulated from DGP 3 and analyzed by models 3 and 4 S a m p l e s i z e S a m p l e s i z e t t t t t t (b) The data are simulated from DGP 4 and analyzed by models 3 and 4 S a m p l e s i z e S a m p l e s i z e Figure 2: Posterior distributions of the PCEs in Section 5.2. τ m11 is the PCE within thestratum ( S , S ) = (1 ,
1) under model 3. τ and τ are the PCEs within the strata( S , S ) = (1 ,
1) and (0 ,
1) under model 4. The grey histograms are the results with priorBeta(1 ,
1) for δ u,s s , the white histograms are the results with prior Beta(0 . , .
5) for δ u,s s .The vertical dashed lines are the true values of the parameters.23ifferent priors even when the sample size is large. The posterior distributions of τ aremultimodal even with a very large sample size. Therefore, using an unidentifiable modelmay lead to an undesirable result even if it is a true model.Our simulation demonstrates that identification is important in Bayesian inference.Otherwise, the results are extremely sensitive to the priors. More importantly, the simula-tion suggests that when the proposed model is not identifiable, using an identifiable model“close” to it may be a compromising solution. The Job Search Intervention Study was a randomized field experiment investigating the ef-ficacy of a job training intervention on unemployed workers (Vinokur et al., 1995; Vinokurand Schul, 1997; Tingley et al., 2014). The program was designed not only to increasereemployment among the unemployed but also to enhance the mental health of the jobseekers. In the study, 600 unemployed workers are randomly assigned to the treatmentgroup ( Z = 1) and 299 are assigned to the control group ( Z = 0). Those in the treatmentgroup participated in workshops that covered skills for job search and coping with stress.Those in the control group received a booklet describing job-search tips. The intermedi-ate variable S is a measure of job-search self-efficacy ranged from 1 to 5. It measures theparticipants’ confidence in being able to successfully perform six essential job-search activ-ities such as completing a job application or resume, using their social network to discoverpromising job openings, and getting their point across in a job interview. The outcome Y isa measure of depressive symptoms based on the Hopkins Symptom Checklist. It measureshow much they had been bothered or distressed in the last two weeks by various depressionsymptoms such as feeling blue, having thoughts of ending one’s life, and crying easily. Let W be the previous occupation, which is a nominal variable with seven categories.We assume that ( S , S ) given W follows (7), where ρ ( w ) is the correlation coefficientof S and S given W = w . We further assume linear models for Y and Y : Y z = β z + β z S + β z S + (cid:15) Y z , where (cid:15) Y ∼ N (0 , σ Y ), (cid:15) Y ∼ N (0 , σ Y ), and ( (cid:15) Y , (cid:15) Y ) ( S , S , W ). We choose the linearmodel because of its simplicity for illustration, and acknowledge its limitation and leavethe task of building more flexible models for Y and Y to future work. Under this model,24 S P C E −1.5−1.0−0.50.00.51.0 r =0 Figure 3: Posterior medians of the PCEs with ρ = 0.the PCEs equals τ s s = β − β + ( β − β ) s + ( β − β ) s . We assume ρ ( w ) = ρ and treat ρ as the sensitivity parameter within { , . , . , . , . } .From Proposition 2, the PCEs are identifiable. We use a Bayesian approach and simulatethe posterior distributions of the PCEs. To assess the sensitivity of our results to differentpriors, we choose two different priors. Denote β = ( β , β , β ) and β = ( β , β , β ).For the first prior, we choose multivariate Normal priors for β z and µ w : β z ∼ N ( , Ω z ), µ w ∼ N ( , Ω ), with Ω z = 10 diag(1 , ,
1) and Ω = 10 diag(1 ,
1) for z = 0 ,
1, and w =1 , . . . ,
7. We choose the following non-informative parameters for the other parameter: f ( σ zw ) ∝ /σ zw , f ( σ Y z ) ∝ /σ Y z , { P ( W = 1) , . . . , P ( W = 7) } ∼ Dirichlet(1 , . . . ,
1) and P ( Z = 1 | W = w ) ∼ Beta(1 , z = 0 , w = 1 , . . . ,
7. For the second prior, wechoose Ω z = diag(1 , ,
1) and Ω = diag(1 ,
1) and keep other prior distributions unchanged.We will present the results for the first prior in the main text and show the sensitivity checkof the results to different priors in Appendix C.2.Figure S1 shows the posterior medians of τ s s for all ( s , s ) under ρ = 0. The surfaceof these posterior medians rises from its lowest point at principal stratum (5 ,
1) to its25able 3: Point and interval estimates of some PCEs using the Bayesian approach. Theintervals excluding zero are highlighted in bold. ( S , S ) ρ =0 ρ =0.2 ρ =0.4 ρ = 0.6 ρ =0.8(1 . , .
00) 1.363 1.901 1.676 1.790 1.530( − . , . − . , . − . , . − . , . − . , . . , .
50) 0 .
288 0 .
392 0 .
366 0 .
389 0 . − . , . − . , . − . , . − . , . − . , . . , . − . − . − . − . − . − . , . − . , . − . , . − . , . − . , . . , . − . − . − . − . − . ( − . , − . − . , − . ( − . , . − . , . − . , . . , . − . − . − . − . − . ( − . , − . − . , − . − . , − . ( − . , . − . , . highest point at principal stratum (1 , S and S decreases. That is, for people who can gain more for thejob-search self-efficacy from the treatment, the treatment can lower the risk of depressionto a larger extent. Imai et al. (2010) analyzed this data using a mediation analysis andfound that the indirect effect of the treatment through job-search self-efficacy is negative.This implies the program participation decreases depressive symptoms by increasing thelevel of job search self-efficacy. Jo et al. (2011) used the principal stratification approach bydichotomizing the job-search self-efficacy, and found that the treatment has a negative effecton the depression for people whose job-search self-efficacy is improved by the treatment.Our conclusion corroborates with their findings.In our analysis, we assume Assumption 3 holds conditional on the previous occupation.It is plausible because conditional on the potential values of the job-search self-efficacy, theprevious occupation may not affect the depressive symptoms. In Appendix C.3, we alsoconduct an analysis of Assumption 3 by including more covariates.To assess the sensitivity of the PCEs to ρ , we choose five principal strata, consisting ofthe maximum, minimum, 25%, 50%, and 75% quantiles of S and S . Table 3 shows theirposterior medians and 95% credible intervals. The point estimates are not sensitive to thevalues of ρ , and the interval estimates are not sensitive to small values of ρ . But as ρ growslarger, the intervals tend to become wider which makes the results not significant.26wo technical issues arise in our data analysis. First, Proposition 2 requires W to becontinuous but W is categorical in our application. In Appendix C.1, we give a formaljustification of the identifiability of the PCEs in our model with a discrete W . Second,the Normality assumptions on the outcomes are invoked for convenience in the Bayesiancomputation. In fact, without Normality, we can use the method of moments to estimatethe PCEs. The results from the method of moments are similar to those from the Bayesianinference; see Appendix C.2. Identification of the PCEs is an important but challenging problem. Although several em-pirical studies have leveraged auxiliary variables to improve inference for the PCEs, formalidentification results have not been established especially for non-binary intermediate vari-ables. Our results supplement previous empirical studies with theoretical justifications foridentification. We give identification results for several models based on Normal distri-butions, which can be generalized to other commonly-used distributions. Appendix B.4gives identification results for models based on t distributions, which are useful for robustanalysis of data with heavy tails.Researchers have conducted sensitivity analyses for the principal ignorability and theauxiliary independence using different kinds of models under various settings. For example,Ding and Lu (2017) proposed the sensitivity analysis for principal ignorability with a binaryintermediate variable, and Jiang et al. (2016) proposed the sensitivity analysis for auxiliaryindependence using a random-effects model. However, there is no general setup for thesensitivity analysis of these assumptions, which may depend on the specification of themodel and types of the outcomes and the intermediate variables. We believe that sensitivityanalysis should be routinely conducted in problems with principal stratification, but leavethe development and the technical details to future research. Auxiliary variables play different roles in identifying the PCEs, depending on the underly-ing assumptions. Under principal ignorability, auxiliary variables can be viewed as “con-27ounders” between the principal stratification variable and the outcome. In contrast, underauxiliary independence, auxiliary variables can be treated as an “instrumental variables”for the relationship between the principal stratification and the outcome. Therefore, thecomparison between the principal ignorability and auxiliary ignobility for identifying thePCEs resembles the comparison between the ignorability assumption (Rosenbaum and Ru-bin, 1983) and the instrumental variable method (Angrist et al., 1996) for identifying theaverage causal effect. The methods based on principal ignorability are easy to employ be-cause the assumption generally conditions on all baseline variables. However, they bearsimilar disadvantages as the methods based on ignorability for estimating average causaleffect — we do not know whether we have conditioned on sufficient variables (Pearl, 2000,2009). In contrast, the methods based on auxiliary independence may be burdening toanalysts and content experts because one needs to carve out a specific baseline variableas a designated auxiliary variable. However, the advantage is that we can intentionallytarget the variable based on science and experts’ knowledge or by design. For example,this assumption can possibly be used in a multi-center trial as in Example 4, and based onExample 3, Follmann (2006) proposed the augmented design which is useful in assessingthe effect of vaccination.Although we restrict the auxiliary variable W to be pretreatment in the paper, theauxiliary independence assumption allows for it to be affected by the treatment. It onlyrequires the auxiliary variable to be independent of the outcome conditional on the treat-ment and principal strata, which can hold even if the auxiliary variable is posttreatment.For example, for a binary S , Mealli and Pacini (2013) identify the PCEs in completelyrandomized experiments using a secondary outcome as the auxiliary variable. In contrast,the principal ignorability assumption is unlikely to hold with a posttreatment auxiliaryvariable. The required independence would fail due to the bias induced by conditioning ona posttreatment variable. Alternative identification strategies also exist without requiring an auxiliary variable. Fora binary intermediate variable, without monotonicity or exclusion restriction, Hirano et al.(2000) suggested using parallel outcome models to improve identifiability where the regres-sion coefficients of the covariates are the same for all types of non-compliers. Mealli et al.282016) used the concentration graph theory to study the identification of the PCEs. It isof interest to combine these strategies in theory and practice.The identification issue of PCEs is closely related to the finite mixture models. Forexample, with a binary intermediate variable, the observed data with ( Z = 1 , S = 1) is amixture of principal strata ( S = 1 , S = 1) and ( S = 1 , S = 0), and the observed datawith ( Z = 1 , S = 0) is a mixture of principal strata ( S = 0 , S = 0) and ( S = 0 , S = 1).From this perspective, principal ignorability and auxiliary independence help to separatethe components in the finite mixture model. Researchers sometimes use parametric finitemixture models for principal stratification problems (Zhang et al., 2009; Frumento et al.,2012). However, even if those models are parametrically identifiable, the estimators oftenhave poor finite-sample properties (Frumento et al., 2016; Feller et al., 2019). These find-ings echo the caveat from Cox and Donnelly (2011, page 96): “If an issue can be addressednonparametrically then it will often be better to tackle it parametrically; however, if itcannot be resolved nonparametrically then it is usually dangerous to resolve it parametri-cally.” This is an important motivation for us to seek for nonparametric and semiparametricidentifiability as presented in this paper. References
Abramowitz, M., I. A. Stegun, et al. (1966). Handbook of mathematical functions.
AppliedMathematics Series 55 , 62.Angrist, J. D., G. W. Imbens, and D. B. Rubin (1996). Identification of causal effects usinginstrumental variables (with discussion).
Journal of the American Statistical Associa-tion 91 , 444–455.Bartolucci, F. and L. Grilli (2011). Modeling partial compliance through copulas in aprincipal stratification framework.
Journal of the American Statistical Association 106 ,469–479.Bickel, P. J. and K. A. Doksum (2015).
Mathematical Statistics: Basic Ideas and SelectedTopics, Volume I . Boca Raton, FL: CRC Press.Cheng, J. and D. S. Small (2006). Bounds on causal effects in three-arm trials with non-29ompliance.
Journal of the Royal Statistical Society: Series B (Statistical Methodol-ogy) 68 , 815–836.Cochran, W. G. (1957). Analysis of covariance: its nature and uses.
Biometrics 13 , 261–281.Conlon, A., J. Taylor, and M. Elliott (2017). Surrogacy assessment using principal stratifi-cation and a Gaussian copula model.
Statistical Methods in Medical Research 26 , 88–107.Cox, D. R. and C. A. Donnelly (2011).
Principles of applied statistics . Cambridge: Cam-bridge University Press.Daniels, M. J., J. A. Roy, C. Kim, J. W. Hogan, and M. G. Perri (2012). Bayesian inferencefor the causal effect of mediation.
Biometrics 68 , 1028–1036.Ding, P. (2016). On the conditional distribution of the multivariate t distribution. TheAmerican Statistician 70 , 293–295.Ding, P., Z. Geng, W. Yan, and X. H. Zhou (2011). Identifiability and estimation ofcausal effects by principal stratification with outcomes truncated by death.
Journal ofthe American Statistical Association 106 , 1578–1591.Ding, P. and F. Li (2018). Causal inference: A missing data perspective.
Statistical Sci-ence 33 , 214–237.Ding, P. and J. Lu (2017). Principal stratification analysis using principal scores.
Journalof the Royal Statistical Society: Series B (Statistical Methodology) 79 , 757–777.Efron, B. and D. Feldman (1991). Compliance as an explanatory variable in clinical trials(with discussion).
Journal of the American Statistical Association 86 , 9–17.Elliott, M. R., T. E. Raghunathan, and Y. Li (2010). Bayesian inference for causal me-diation effects using principal stratification with dichotomous mediators and outcomes.
Biostatistics 11 , 353–372.Everitt, B. S. and D. J. Hand (1981).
Finite Mixture Distributions . New York: Chapmanand Hall.Feller, A., E. Greif, N. Ho, L. Miratrix, and N. Pillai (2019). Weak separation in mixturemodels and implications for principal stratification. arXiv preprint arXiv:1602.06595 .30ollmann, D. (2006). Augmented designs to assess immune response in vaccine trials.
Biometrics 62 , 1161–1169.Follmann, D. A. (2000). On the effect of treatment among would-be treatment compli-ers: An analysis of the multiple risk factor intervention trial.
Journal of the AmericanStatistical Association 95 , 1101–1109.Forastiere, L., A. Mattei, and P. Ding (2018). Principal ignorability in mediation analysis:through and beyond sequential ignorability.
Biometrika 105 , 979–986.Frangakis, C. E. and D. B. Rubin (2002). Principal stratification in causal inference.
Biometrics 58 , 21–29.Frumento, P., F. Mealli, B. Pacini, and D. B. Rubin (2012). Evaluating the effect of trainingon wages in the presence of noncompliance, nonemployment, and missing outcome data.
Journal of the American Statistical Association 107 , 450–466.Frumento, P., F. Mealli, B. Pacini, and D. B. Rubin (2016). The fragility of standardinferential approaches in principal stratification models relative to direct likelihood ap-proaches.
Statistical Analysis and Data Mining 9 , 58–70.Gabriel, E. E. and D. Follmann (2016). Augmented trial designs for evaluation of principalsurrogates.
Biostatistics 17 , 453–467.Gallop, R., D. S. Small, J. Y. Lin, M. R. Elliott, M. Joffe, and T. R. Ten Have (2009).Mediation analysis with principal stratification.
Statistics in Medicine 28 , 1108–1130.Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin (2014).
Bayesian Data Analysis (3rd ed.) . London: Chapman and Hall/CRC.Gilbert, P. B. and M. G. Hudgens (2008). Evaluating candidate principal surrogate end-points.
Biometrics 64 , 1146–1154.Gustafson, P. (2009). What are the limits of posterior distributions arising from noniden-tified models, and why should we care?
Journal of the American Statistical Associa-tion 104 , 1682–1695.Gustafson, P. (2015).
Bayesian Inference for Partially Identified Models: Exploring theLimits of Limited Data . Chapman and Hall/CRC.31irano, K., G. W. Imbens, D. B. Rubin, and X.-H. Zhou (2000). Assessing the effect of aninfluenza vaccine in an encouragement design.
Biostatistics 1 , 69–88.Hu, Y. and J.-L. Shiu (2017). Nonparametric identification using instrumental variables:sufficient conditions for completeness.
Econometric Theory 34 , 1–35.Huang, Y. and P. B. Gilbert (2011). Comparing biomarkers as principal surrogate end-points.
Biometrics 67 , 1442–1451.Hudgens, M. G. and P. B. Gilbert (2009). Assessing vaccine effects in repeated low-dosechallenge experiments.
Biometrics 65 , 1223–1232.Imai, K. (2008). Sharp bounds on the causal effects in randomized experiments with“truncation-by-death”.
Statistics and Probability Letters 78 , 144–149.Imai, K., L. Keele, and D. Tingley (2010). A general approach to causal mediation analysis.
Psychological Methods 15 , 309–334.Imbens, G. W. and D. B. Rubin (2015).
Causal Inference in Statistics, Social, and Biomed-ical Sciences . Cambridge: Cambridge University Press.Jiang, Z., P. Ding, and Z. Geng (2016). Principal causal effect identification and surrogateend point evaluation by multiple trials.
Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) 78 , 829–848.Jin, H. and D. B. Rubin (2008). Principal stratification for causal inference with extendedpartial compliance.
Journal of the American Statistical Association 103 , 101–111.Jo, B. (2008). Causal inference in randomized experiments with mediational processes.
Psychological Methods 13 , 314–336.Jo, B. and E. A. Stuart (2009). On the use of propensity scores in principal causal effectestimation.
Statistics in Medicine 28 , 2857–2875.Jo, B., E. A. Stuart, D. P. MacKinnon, and A. D. Vinokur (2011). The use of propensityscores in mediation analysis.
Multivariate Behavioral Research 46 , 425–452.Joffe, M. M., D. Small, C.-Y. Hsu, et al. (2007). Defining and estimating intervention effectsfor groups that will develop an auxiliary outcome.
Statistical Science 22 , 74–97.32im, C., L. R. Henneman, C. Choirat, and C. M. Zigler (2020). Health effects of powerplant emissions through ambient air quality.
Journal of the Royal Statistical Society:Series A (Statistics in Society) .Lange, K. L., R. J. Little, and J. M. Taylor (1989). Robust statistical modeling using the t distribution. Journal of the American Statistical Association 84 , 881–896.Lehmann, E. L. and J. P. Romano (2006).
Testing Statistical Hypotheses (3rd ed.). NewYork: Springer.Little, R. J. and L. H. Yau (1998). Statistical techniques for analyzing data from preventiontrials: Treatment of no-shows using Rubin’s causal model.
Psychological Methods 3 , 147.Liu, C. (2004).
Robit regression: a simple robust alternative to logistic and probit regression ,pp. 227–238. In
Applied Bayesian Modeling and Causal Inference From Incomplete-DataPerspectives (A. Gelman and X. L. Meng, eds.), New York: Wiley.Mattei, A. and F. Mealli (2011). Augmented designs to assess principal strata direct effects.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 , 729–752.Mealli, F. and B. Pacini (2013). Using secondary outcomes to sharpen inference in ran-domized experiments with noncompliance.
Journal of the American Statistical Associa-tion 108 , 1120–1131.Mealli, F., B. Pacini, and E. Stanghellini (2016). Identification of principal causal effects us-ing additional outcomes in concentration graphs.
Journal of Educational and BehavioralStatistics 41 , 463–480.Nelsen, R. B. (2007).
An Introduction to Copulas (2nd ed.). New York: Springer.Pearl, J. (2000).
Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge:Cambridge University Press.Pearl, J. (2009). Letter to the editor: Remarks on the method of propensity score.
Statisticsin Medicine 28 , 1415–1416.Qin, L., P. B. Gilbert, D. Follmann, and D. Li (2008). Assessing surrogate endpointsin vaccine trials with case-cohort sampling and the cox model.
The Annals of AppliedStatistics 2 , 386–407. 33osenbaum, P. R. (1984). The consequences of adjustment for a concomitant variable thathas been affected by the treatment.
Journal of the Royal Statistical Society. Series A 147 ,656–666.Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score inobservational studies for causal effects.
Biometrika 70 , 41–55.Roy, J., J. W. Hogan, and B. H. Marcus (2008). Principal stratification with predictors ofcompliance for randomized trials with 2 active treatments.
Biostatistics 9 , 277–289.Rubin, D. B. (2004). Direct and indirect causal effects via potential outcomes (with dis-cussion and reply).
Scandinavian Journal of Statistics 31 , 161–170.Rubin, D. B. (2006). Causal inference through potential outcomes and principal stratifica-tion: application to studies with “censoring” due to death (with discussion).
StatisticalScience 21 , 299–309.Schwartz, S. L., F. Li, and F. Mealli (2011). A bayesian semiparametric approach to inter-mediate variables in causal inference.
Journal of the American Statistical Association 106 ,1331–1344.Sommer, A. and S. L. Zeger (1991). On estimating efficacy from clinical trials.
Statisticsin Medicine 10 , 45–52.Stuart, E. A. and B. Jo (2015). Assessing the sensitivity of methods for estimating principalcausal effects.
Statistical Methods in Medical Research 24 , 657–674.Tingley, D., T. Yamamoto, K. Hirose, L. Keele, and K. Imai (2014). Mediation: R packagefor causal mediation analysis.
Journal of Statistical Software 59 , 1–38.VanderWeele, T. J. (2008). Simple relations between principal stratification and direct andindirect effects.
Statistics and Probability Letters 78 , 2957–2962.Vinokur, A. D., R. H. Price, and Y. Schul (1995). Impact of the JOBS intervention onunemployed workers varying in risk for depression.
American Journal of CommunityPsychology 23 , 39–74. 34inokur, A. D. and Y. Schul (1997). Mastery and inoculation against setbacks as active in-gredients in the jobs intervention for the unemployed.
Journal of Consulting and ClinicalPsychology 65 , 867.Wang, L., X.-H. Zhou, and T. S. Richardson (2017). Identification and estimation of causaleffects with outcomes truncated by death.
Biometrika 104 , 597–612.Yang, F. and P. Ding (2018). Using survival information in truncation by death problemswithout the monotonicity assumption.
Biometrics 74 , 1232–1239.Yuan, L.-H., A. Feller, L. W. Miratrix, et al. (2019). Identifying and estimating principalcausal effects in a multi-site trial of early college high schools.
The Annals of AppliedStatistics 13 , 1348–1369.Zellner, A. (1976). Bayesian and non-Bayesian analysis of the regression model with mul-tivariate student- t error terms. Journal of the American Statistical Association 71 , 400–405.Zhang, J. L. and D. B. Rubin (2003). Estimation of causal effects via principal stratificationwhen some outcomes are truncated by “death”.
Journal of Educational and BehavioralStatistics 28 , 353–368.Zhang, J. L., D. B. Rubin, and F. Mealli (2009). Likelihood-based analysis of causal effectsof job-training programs using principal stratification.
Journal of the American StatisticalAssociation 104 , 166–176.Zigler, C. M. and T. R. Belin (2012). A Bayesian approach to improved estimation of causaleffect predictiveness for a principal surrogate endpoint.
Biometrics 68 , 922–932.35 upplementary Materials
Appendix A provides the proofs of the theorems.Appendix B provides the proofs of the corollaries and propositions, and presents addi-tional results for models related to t distributions. Let t p ( µ , Σ , ν ) denote the p -dimensional t distribution with median µ , scale matrix Σ , and degrees of freedom ν , and let T ν ( · ) denotethe cumulative distribution function of the standard t distribution with degrees of freedom ν . Appendix C provides more details for the data analysis. Appendix A: Proofs of the theorems
To prove the theorems, we need the following lemma from importance sampling.
Lemma 2.
Let f X ( x ) and f Y ( y ) be the density functions of X and Y. For any function g ( · ), E { g ( X ) } = E (cid:26) f X ( Y ) f Y ( Y ) g ( Y ) (cid:27) , provided the existence of the moments. Proof of Theorem 1.
From the law of total expectation, E (cid:26) π s ( W ) π s (1 − Z ) Y − e ( W ) (cid:27) = E (cid:34) E (cid:40) π s ( W ) π s (1 − Z ) Y − e ( W ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W (cid:41)(cid:35) = E (cid:20) π s ( W ) π s { − e ( W ) } E { (1 − Z ) Y | W } (cid:21) = E (cid:20) π s ( W ) π s { − e ( W ) } E { (1 − Z ) | W } E ( Y | W ) (cid:21) (Assumption 1)= E (cid:26) π s ( W ) π s E ( Y | W ) (cid:27) . (12)On the other hand, from the law of total expectation, E ( Y | S = s ) = E { E ( Y | W, S = s ) | S = s } = E { E ( Y | W ) | S = s } , (13)where the last equality follows from Assumption 2. The expectation in (12) is with respectto W and the expectation in (13) is with respect to W | S = s . Therefore, from Lemma 2, E { E ( Y | W ) | S = s } = E (cid:26) P ( W | S = s ) P ( W ) E ( Y | W ) (cid:27) = E (cid:26) π s ( W ) π s E ( Y | W ) (cid:27) .
36s a result, E (cid:26) π s ( W ) π s (1 − Z ) Y − e ( W ) (cid:27) = E ( Y | S = s ) . (cid:3) Proof of Theorem 2.
Because rank( M (cid:62) M ) = K , we can find an invertible K × K sub-matrix of M , denoted by H . Without loss of generality, we use w , . . . , w K to denote thecorresponding values of W in H . First, for any s , P ( Y = y | S = s ) = P ( Y = y | Z =1 , S = s ), which can be identified from the observed data. Assumption 3 implies P ( Y = y | Z = 0 , W = w k ) = (cid:88) s P ( Y = y | Z = 0 , S = s , W = w j ) P ( S = s | Z = 0 , W = w k )= (cid:88) s P ( Y = y | S = s ) P ( S = s | Z = 0 , W = w k ) , (14)for k = 1 , . . . , K , where P ( Y = y | Z = 0 , W = w k ) and P ( S = s | Z = 0 , W = w k ) canbe identified from the observed data. Because H is invertible, we can solve (14) to obtain P ( Y = y | S = s ) for all s . Therefore, the PCEs are identifiable. (cid:3) Proof of Theorem 3.
For a fixed y , if there exist two functions of S , f ( Y = y | S ) and f ( Y = y | S ), satisfying E { f ( Y = y | S ) | W } = E { f ( Y = y | S ) | W } , then E { g ( S ) | W } = 0, where g ( S ) = f ( Y = y | S ) − f ( Y = y | S ). From thedefinition of completeness, g ( S ) = 0 and hence the distribution of P ( Y = y | S ) isidentifiable. Therefore, the PCEs are identifiable. (cid:3) Proof of Theorem 4.
From the completeness of exponential family (shown in Result S1 inthe next section), P W is complete. Then from Theorem 3, the PCEs are identifiable. (cid:3) Proof of Theorem 5.
From Lemma 1, P W is complete. Then from Theorem 3, the PCEsare identifiable. (cid:3) Proof of Theorem 6.
Similar to the proof of Theorem 1, we can show that E (cid:26) π s ,s ( W ) π s ,s · ZYe ( W ) (cid:27) = E ( Y | S = s , S = s ) , E (cid:26) π s ,s ( W ) π s ,s · (1 − Z ) Y − e ( W ) (cid:27) = E ( Y | S = s , S = s ) , (cid:3) Proof of Theorem 7.
Because rank( M (cid:62) s M s ) = K , we can find an invertible K × K sub-matrix of M s , denoted by H s , with the corresponding values of W denoted by w , . . . , w K .For any fixed s , Assumption 3 implies P ( Y = y | Z = 0 , S = s , W = w j )= (cid:88) s P ( Y = y | Z = 0 , S = s , S = s , W = w j ) P ( S = s | S = s , W = w j )= (cid:88) s P ( Y = y | S = s , S = s ) P ( S = s | S = s , W = w j ) , (15)for j = 1 , . . . , K , where P ( Y = y | Z = 0 , S = s , W = w j ) and P ( S = s | S = s , W = w j ) can be identified from the observed data. Because H s is invertible, we cansolve (15) to obtain P ( Y = y | S = s , S = s ) for all s . Similarly, we can identify P ( Y = y | S = s , S = s ) for all s under rank( M (cid:62) s M s ) = K . Therefore, the PCEs areidentifiable under conditions (a) and (b). (cid:3) Proof of Theorem 8.
For any fixed y and s , if there exist two functions of S , f ( Y = y | S , S = s ) and f ( Y = y | S , S = s ), satisfying E { f ( Y = y | S , S = s ) | W } = E { f ( Y = y | S , S = s ) | W } , then E { g ( S ) | W } = 0, where g ( S ) = f ( Y = y | S , S = s ) − f ( Y = y | S , S = s ).Because P W ,s is complete, g ( S ) = 0. Therefore, the distribution of P ( Y = y | S , S = s ) is identifiable. Similarly, the distribution of P ( Y = y | S = s , S ) is identifiablefor any fixed y and s if P W ,s is complete. Therefore, the PCEs are identifiable underconditions (a) and (b). (cid:3) Appendix B: Proof of the corollaries and propositions
Appendix B.1: Proofs of the completeness of some distribution families
In this section, we show the completeness of exponential family and three distributionfamilies based on Normal, t and Logistic distribution, respectively.38 esult S1. (Lehmann and Romano, 2006, Theorem 4.3.1) Let X be a random vector withprobability distributiond P θ ( x ) = C ( θ ) exp J (cid:88) j =1 θ j t j ( x ) d µ ( x ) , and let P ω be the family of distributions of t = ( t ( X ) , . . . , t J ( X )) as θ = ( θ , . . . , θ J ) variesover the set ω . Then P ω is complete provided ω contains a J -dimensional rectangle in R J . Result S2.
Suppose S = h ( W ) + σ ( W ) (cid:15) with (cid:15) W , and h ( w ) and σ ( w ) continuouslydifferentiable. If (cid:15) ∼ N (0 , P W is complete. Proof of Result S2.
We verify the conditions in Lemma 1. Because φ ( t ) = exp( − t / P W is complete. (cid:3) Result S3.
Suppose S = h ( W ) + σ ( W ) (cid:15) with (cid:15) W , and h ( w ) and σ ( w ) continuouslydifferentiable. If (cid:15) ∼ t (0 , , ν ), then P W is complete. Proof of Result S3.
We verify the conditions in Lemma 1. The characteristic function of (cid:15) is φ ( t ) = K ν/ ( √ ν | t | ) · ( √ ν | t | ) ν/ Γ( ν/ ν/ − , where Γ( · ) is the Gamma function and K ν/ ( · ) is the modified Bessel function of the secondkind. Abramowitz et al. (1966) ensures thatlim t →∞ K ν/ ( √ ν | t | )exp( −√ ν | t | ) / (cid:112) √ ν | t | = 1 . Therefore, the dominating term in φ ( t ) is exp( −√ ν | t | ), and hence condition (a) holds. Itis easy to verify condition (b). Due to the identifiability of t mixture model (Everitt andHand, 1981), condition (c) holds. Thus, from Lemma 1, P W is complete. (cid:3) Result S4.
Suppose S = h ( W ) + σ ( W ) (cid:15) with (cid:15) W , and h ( w ) and σ ( w ) continuously dif-ferentiable. If (cid:15) follows a standard Logistic distribution with density exp( − (cid:15) ) / { − (cid:15) ) } , then P W is complete. 39 roof of Result S4. We verify the conditions in Lemma 1. The characteristic function of (cid:15) is φ ( t ) = πt/ sinh( πt ) . Because the dominating term in φ ( t ) is exp( − π | t | ), condition (a)holds. It is also easy to show that condition (b) holds. From the identifiability of Logisticmixture models (Everitt and Hand, 1981), condition (c) holds. Thus, from Lemma 1, P W is complete. (cid:3) Appendix B.2: Proof of the corollaries
Proof of Corollary 1.
Suppose that the distribution of ( S , W ) is bivariate Normal withmeans ( µ S , µ W ), variances ( σ S , σ W ) and correlation coefficient ρ . From P ( S = s , W = w ) = P ( S = s | W = w, Z = 1) P ( W = w ) , we can identify the joint distribution of ( S , W ): P ( S = s | W = w ) = 1 (cid:112) πσ ( w ) exp (cid:20) − { s − µ ( w ) } σ ( w ) (cid:21) = 1 (cid:112) πσ ( w ) exp (cid:26) − µ ( w )2 σ ( w ) (cid:27) exp (cid:26) − s σ ( w ) (cid:27) exp (cid:26) s µ ( w ) σ ( w ) (cid:27) , where µ ( w ) = µ S + ρσ S /σ W ( w − µ W ) and σ ( w ) = σ S (1 − ρ ). Thus, we have T ( x ) = x and η ( w ) = µ ( w ) /σ ( w ). From Theorem 4, the PCEs are identifiable. (cid:3) Proof of Corollary 2.
First, we can identify { µ ( w ) , µ ( w ) , σ ( w ) , σ ( w ) } from the observeddata. From the bivariate Normal assumption, the conditional distribution of S | ( S = s , W = w ) is a Normal distribution with mean µ ( s , w ) = µ ( w ) + ρ ( w ) σ ( w ) /σ ( w ) { s − µ ( w ) } and variance σ ( w ) = { − ρ ( w ) } σ ( w ), i.e., S = µ ( s , W ) + σ ( W ) (cid:15), where (cid:15) W and (cid:15) ∼ N (0 , P W ,s is complete. Similarly, P W ,s is alsocomplete. Therefore, from Theorem 8, the PCEs are identifiable. (cid:3) Appendix B.3: Proof of the propositions
Proof of Proposition 1.
First, because we can observed S when Z = 1, we can identify P ( Y | S ) and g ( W ) = E ( S | W ). Second, the linear models of S and Y imply Y = β + αg ( W ) + J (cid:88) j =1 β j f j ( W ) + σ ( W ) (cid:15) + σ ( W ) (cid:15) . (cid:15) , (cid:15) ) W and { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, the param-eters are identifiable from the classic result of linear models. Therefore, we can identify P ( Y | S ) and hence the PCEs. (cid:3) Proof of Proposition 2.
First, because we can observed S when Z = 1, we can identify P ( Y | S ). Second, the distribution of ( S , W ) is identifiable, so are g ( w ) and σ . Third, P ( Y = 1 | Z = 0 , W = w ) = (cid:90) P ( Y = 1 | Z = 0 , S = s, W = w ) P ( S = s | W = w ) ds = (cid:90) P ( Y = 1 | S = s, W = w ) P ( S = s | W = w ) ds = (cid:90) Φ αs + J (cid:88) j =1 β j f j ( W ) P ( S = s | W = w ) ds = Φ (cid:40) β + αg ( w ) + (cid:80) Jj =1 β j f j ( w ) √ α σ (cid:41) . Because { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, from the identifiability of theProbit model, we can identify α √ α σ , β √ α σ , . . . , β J √ α σ . Because σ is identifiable, we can identify { α, β , . . . , β J } and hence the PCEs. (cid:3) Proof of Proposition 3.
First, we can identify the joint distribution of ( S , S ) given W : P ( S = 1 , S = 1 | W = w ) = P ( S = 1 , S = 1 | Z = 0 , W = w ) = P ( S = 1 | Z = 0 , W = w ) , P ( S = 0 , S = 0 | W = w ) = P ( S = 0 , S = 0 | Z = 1 , W = w ) = P ( S = 0 | Z = 1 , W = w ) , P ( S = 1 , S = 0 | W = w ) = 1 − P ( S = 1 , S = 1 | W = w ) − P ( S = 0 , S = 0 | W = w ) . Then P ( S = 1 | S = 1 , W = w ) = P ( S = 1 | Z = 0 , W = w ) P ( S = 1 | Z = 1 , W = w ) , P ( S = 1 | S = 0 , W = w ) = 0 . Because the subpopulation with ( Z = 1 , S = 1) is a mixture of the subpopulations with( S = 1 , S = 1) and ( S = 1 , S = 0), we have E ( Y | Z = 1 , S = 1 , W = w ) = E ( Y | S = 1 , S = 1 , W = w ) P ( S = 1 | S = 1 , W = w )+ E ( Y | S = 1 , S = 0 , W = w ) P ( S = 0 | S = 1 , W = w )= β + β + β P ( S = 1 | Z = 0 , W = w ) P ( S = 1 | Z = 1 , W = w ) + β w. (16)41he subpopulation with ( Z = 1 , S = 0) is the same as the subpopulation with ( S = 0 , S =0), and then E ( Y | S = 0 , Z = 1 , W = w ) = E ( Y | S = 0 , S = 0 , W = w ) β + β w. (17)Because P ( S = 1 | Z = 0 , W = w ) / P ( S = 1 | Z = 1 , W = w ) is not constant in w , we canidentify ( β , β , β , β ) from (16) and (17). Similarly, we can identify ( β , β , β , β )and hence the PCEs. (cid:3) Proof of Proposition 4.
From the linear model for Y , E ( Y | Z = 1 , S = s , W = w ) = (cid:90) E ( Y | S = s , S = s, W = w ) P ( S = s | S = s , W = w ) ds = (cid:90) { β + α s + α s + J (cid:88) j =1 β j f j ( W ) } P ( S = s | S = s , W = w ) ds = β + α s + α E ( S | S = s , W = w ) + J (cid:88) j =1 β j f j ( w ) . Because { , s , E ( S | S = s , W = w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, we canidentify ( β , . . . , β J , α , α ). Similarly, we can identify ( β (cid:48) , . . . , β (cid:48) J , α (cid:48) , α (cid:48) ). Therefore, wecan identify the PCEs. (cid:3) Proof of Proposition 5.
We can first identify { µ ( w ) , µ ( w ) , σ ( w ) , σ ( w ) } . From the bi-variate Normality, the conditional distribution S | ( S = s , W = w ) is Normal with mean µ ( s , w ) = µ ( w ) + ρ ( w ) σ ( w ) /σ ( w ) { s − µ ( w ) } and variance σ ( w ) = { − ρ ( w ) } σ ( w ).Then P ( Y = 1 | Z = 1 , S = s , W = w )= (cid:90) P ( Y = 1 | S = s , S = s , W = w ) P ( S = s | S = s , W = w ) ds = (cid:90) Φ β + α s + α s + J (cid:88) j =1 β j f j ( W ) P ( S = s | S = s , W = w ) ds = Φ (cid:40) β + α s + α µ ( s , w ) + (cid:80) Jj =1 β j f j ( w ) (cid:112) α σ ( w ) (cid:41) . Because { , s , E ( S | S = s , W = w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, wecan use Probit regression to first identify α and α by fixing w and then identify β j for42 = 1 , . . . , K by fixing s . Similarly, we can identify ( β (cid:48) , . . . , β (cid:48) k , α (cid:48) , α (cid:48) ). Therefore, we canidentify the PCEs. (cid:3) Appendix B.4: Models based on t distributions We give identification results for models based on t distributions. These models are fre-quently applied for robust analysis when the data have heavier tails than the standardNormal distribution (Zellner, 1976; Lange et al., 1989; Liu, 2004; Gelman et al., 2014).We first give the results under Assumption 3. The following result is an application ofTheorem 5 to models based on t distributions with constant control intermediate variable. Corollary S1.
Suppose that W is continuous and Assumptions 1, 3 and 4 hold, and S | W = w ∼ t ( h ( w ) , σ ( w ) , ν ) with unknown ν , where h ( w ) and σ ( w ) are continuouslydifferentiable functions that can be unknown. The PCEs are identifiable. Proof of Corollary S1.
From Result S3, P W is complete. Thus, from Theorem 5, the PCEsare identifiable. (cid:3) Similarly, the following result is an application of Theorem 8 to models based on t distributions with non-constant control intermediate variable. Corollary S2.
For a continuous W , suppose that Assumptions 1 and 3 hold. If( S , S ) | W = w ∼ t µ ( w ) µ ( w ) , σ ( w ) ρ ( w ) σ ( w ) σ ( w ) ρ ( w ) σ ( w ) σ ( w ) σ ( w ) , ν , with known values of ρ ( w ), then the PCEs are identifiable. Proof of Corollary S2.
First, we can identify { µ ( w ) , µ ( w ) , σ ( w ) , σ ( w ) } . For any fixed s , Ding (2016) implies that S = h s ( W ) + σ s ( W ) (cid:15) , where h s ( W ) = µ ( W ) + σ ( W ) σ ( W ) { s − µ ( W ) } ,σ s ( W ) = (cid:115) ν + { s − µ ( W ) } /σ ( W ) ν + 1 · (cid:115) σ ( W ) − σ ( W ) σ ( W ) ,(cid:15) ∼ t , , ν + 1 , (cid:15) W. From Result S3, P W ,s is complete. Similarly, P W ,s is also complete. Therefore, fromTheorem 8, the PCEs are identifiable. (cid:3)
43e then give the results when neither Assumptions 2 or 3 holds. The following proposi-tion extends the Probit model in Proposition 2 to the Robit model (Liu, 2004) with constantcontrol intermediate variable.
Proposition 6.
Under Assumptions 1 and 4, assume S = g ( W ) + (cid:15) S ,Y = I β + αS + J (cid:88) j =1 β j f j ( W ) + (cid:15) Y > , ( (cid:15) S , (cid:15) Y ) ∼ t ( µ , Σ , ν ) , where ( (cid:15) S , (cid:15) Y ) W , µ = (0 , (cid:62) , Σ = diag( σ , g ( w ) can be unknown. If { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, then the PCEs are identifiable.Proposition 2 does not assume a joint model for ( S , Y ). In contrast, Proposition 6assumes a joint model for these two variables by restricting the error terms to follow abivariate t distribution. This is because two independent Normal random variables make abivariate Normal random variable, but two independent t random variables do not make abivariate t random variable even with the same degrees of freedom. Proof of Proposition 6.
First, because we can observed S when Z = 1, we can identify P ( Y | S ). Second, because the distribution of ( S , W ) is identifiable, we can identify g ( w )and σ . Then P ( Y = 1 | Z = 0 , W = w ) = (cid:90) P ( Y = 1 | Z = 0 , S = s, W = w ) P ( S = s | Z = 0)d s = (cid:90) P ( Y = 1 | S = s, W = w ) P ( S = s | Z = 0)d s = T ν (cid:40) β + αg ( W ) + (cid:80) Jj =1 β j f j ( W ) √ α σ (cid:41) . Because { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, from the identifiability of theRobit model, we can identify α √ α σ , β √ α σ , . . . , β J √ α σ . Because σ is identifiable, we can identify { α, β , . . . , β J } and hence the PCEs. (cid:3) Similarly, for the case with non-constant control intermediate variable, we give an iden-tification result for Robit outcome models as an extension of the Probit outcome models.44 roposition 7.
Under Assumption 1, assume that S = g ( W ) + (cid:15) S , S = g ( W ) + (cid:15) S ,Y (1) = I β + α S + α S + J (cid:88) j =1 β j f j ( W ) + (cid:15) Y > ,Y (0) = I (cid:40) β (cid:48) + α (cid:48) S + α (cid:48) S + k (cid:88) i =1 β (cid:48) i h i ( W ) + (cid:15) Y > (cid:41) , where ( (cid:15) S , (cid:15) S , (cid:15) Y , (cid:15) Y ) ∼ t ( µ , Σ , ν ), ( (cid:15) S , (cid:15) S , (cid:15) Y , (cid:15) Y ) W , and µ = , Σ = σ S σ S S σ S Y σ S S σ S σ S Y σ Y S σ Y σ Y Y σ Y S σ Y Y σ Y , with σ S S known. The PCEs are identifiable, if { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearlyindependent, and { , g ( w ), h ( w ) , . . . , h J ( w ) } are linearly independent. Proof of Proposition 7.
We can first identify g ( · ), g ( · ) and all the terms in Σ except σ Y Y .Then P ( Y = 1 | Z = 1 , S = s , W = w )= P ( β + α S + α S + J (cid:88) j =1 β j f j ( W ) + (cid:15) Y ≥ | S = s , W = w )= P β + α s + α g ( W ) + J (cid:88) j =1 β j f j ( W ) + (cid:15) Y + α (cid:15) S ≥ | g ( W ) + (cid:15) S = s , W = w = P (cid:15) Y + α (cid:15) S ≥ − β + α s + α g ( W ) + J (cid:88) j =1 β j f j ( W ) | (cid:15) S = s − g ( W ) , W = w . Ding (2016) implies( (cid:15) Y , (cid:15) S ) | (cid:15) S = x ∼ t (cid:18) µ , ν + d ν + 1 Σ , ν + 1 (cid:19) , where µ = σ Y S x/σ S σ S S x/σ S , Σ = σ Y − σ Y S /σ S − σ Y S σ S S /σ S − σ Y S σ S S /σ S σ S − σ S S /σ S , d = x /σ S . (cid:15) Y + α (cid:15) S | (cid:15) S = x ∼ t (cid:32) α σ S S σ S + σ Y S σ S , ν + d ν + 1 r ( α ) , ν + 1 (cid:33) , where r ( α ) = (1 , α ) Σ (1 , α ) (cid:62) . Therefore, P ( Y = 1 | Z = 1 , S = s , W = w )= T ν +1 β + α s + α g ( w ) − (cid:18) α σ S S σ S + σ Y S σ S (cid:19) { s − g ( w ) } + (cid:80) Jj =1 β j f j ( W ) (cid:112) t ( s , w ) r ( α ) , where t ( s , w ) = ν + { s − g ( w ) } /σ S ν + 1 . Because { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, we can identify ( β , . . . , β J , α , α )from the identifiability of the Robit model. Similarly, we can identify ( β (cid:48) , . . . , β (cid:48) J , α (cid:48) , α (cid:48) ).Therefore, we can identify the PCEs. (cid:3) Appendix C: More details for the application
Appendix C.1: Technical issues with a discrete intermediate variables
Corollary 2 requires W to be continuous but W is categorical in the application. We givethe following proposition to formally justify the identifiability of the PCEs in our dataanalysis. Proposition S8.
Under Assumptions 1 and 2, assume that P ( S , S | W = w ) is identifi-able, and Y and Y follow linear models E ( Y z | S , S , W ) = β z + β z S + β z S , ( z = 0 , . If there exist s and s such that E ( S | S = s , W = w ) and E ( S | S = s , W = w ) arenot constant in w , then the PCEs are identifiable. Proof of Proposition S8.
From the linear models for Y , E ( Y = y | Z = 0 , S = s , W = w ) = (cid:90) E ( Y = y | S = s, S = s , W = w ) P ( S = s | S = s , W = w )d s = (cid:90) ( β + β S + β S ) P ( S = s | S = s , W = w )d s = β + β E ( S | S = s , W = w ) + β s . E ( S | S = s , W = w ) is not constant in w , β , β and β are identifiable.Similarly, β , β and β are identifiable and hence the PCEs are identifiable. (cid:3) In Proposition S8, the condition that E ( S | S = s , W = w ) and E ( S | S = s , W = w ) are constant in w is testable because P ( S , S | W = w ) is identifiable. In ourapplication, the conditions in Proposition S8 hold and thus the PCEs are identifiable evenif the intermediate variable W is categorical. Appendix C.2: Bayesian analysis with different priors r = r = r = r = r = r = (a) Results with prior A b -b b -b r = r = r = r = r = r = (b) Results with prior B b -b b -b Figure S1: Histograms of the posterior distributions of β − β and β − β . The solidlines are the medians and the dashed lines are the posterior 2.5% and 97.5% quantiles. PriorA uses Ω z = 10 diag(1 , ,
1) and Ω = 10 diag(1 , Ω z = diag(1 , ,
1) and Ω = diag(1 , β = ( β , β , β ) and β = ( β , β , β ). Wechoose multivariate Normal priors for β z and µ w : β z ∼ N ( , Ω z ), µ w ∼ N ( , Ω ) for w = 1 , . . . ,
7. Prior A uses Ω z = 10 diag(1 , ,
1) and Ω = 10 diag(1 ,
1) for z = 0 ,
1; PriorB uses Ω z = diag(1 , ,
1) and Ω = diag(1 ,
1) for z = 0 ,
1. We choose the following non-47nformative parameters for other parameters: f ( σ zw ) ∝ /σ zw , f ( σ Y z ) ∝ /σ Y z , { P ( W =1) , . . . , P ( W = 7) } ∼ Dirichlet(1 , . . . ,
1) and P ( Z = 1 | W = w ) ∼ Beta(1 , z = 0 , w = 1 , . . . ,
7. Figure S1 presents the results which are not sensitive to different priors.
Appendix C.3: More sensitivity analysis for the application
Section 6 conducts the analysis without covariates. This section includes covariates in thedata analysis which can make Assumption 3 more plausible. Similar to the main text, wewill also assess the sensitivity of the results to the correlation coefficient between S and S given W .Let X be the covariates. We consider the following model for ( S , S , W ):( S , S ) | W = w, X = x ∼ N µ ( w, x ) µ ( w, x ) , Σ ( w ) = σ ( w ) ρ ( w ) σ ( w ) σ ( w ) ρ ( w ) σ ( w ) σ ( w ) σ ( w ) , where ρ ( w ) = ρ , µ ( w, x ) = α w + γ (cid:62) w x and µ ( w, x ) = α w + γ (cid:62) w x . Therefore, we allowthe mean of S and S to depend on the covariates in the sensitivity analysis. We thenconsider the following model for the potential outcomes: Y z = β z S + β z S + β z, X X + (cid:15) Y z . For the simplicity of sensitivity analysis, we use an alternative estimation strategy basedon the method of moments, which does not require to specify the distributions of (cid:15) Y and (cid:15) Y :1. Obtain the estimates of µ ( w, x ), µ ( w, x ) and Σ ( w ) for all w from the observeddistribution of ( S , S , W, X ).2. For units with Z i = 1, impute S i with (cid:98) S i = (cid:98) µ ( w, x )+ ρ (cid:98) σ ( w ) / (cid:98) σ ( w ) { S i − µ ( w, x ) } ;for For units with Z i = 0, impute S i with (cid:98) S i = (cid:98) µ ( w, x ) + ρ (cid:98) σ ( w ) / (cid:98) σ ( w ) { S i − µ ( w, x ) } .3. Obtain the estimates of ( β , β , β , X ) by regressing Y i on S i , (cid:98) S i and X i for unitswith Z i = 1; obtain the estimates of ( β , β , β , X ) by regressing Y i on (cid:98) S i , S i and X i for units with Z i = 0.We use the bootstrap to get the 95% confidence intervals of the PCEs. We first conductthe sensitivity analysis without covariates. The upper panel of Table S1 shows the point48stimates and 95% credible intervals of the five PCEs from the methods of moments. Theresults are similar to those from the Bayesian approach presented in the main text.We then consider the sensitivity analysis with covariates. Due to the limited samplesize, we only include the demographic variables, gender, age and race. The lower panel ofTable S1 shows the point estimates and 95% credible intervals of the five PCEs from themethods of moments. Although the estimates change in the sensitivity analysis, we canstill conclude that for people who can gain more for the job-search self-efficacy from thetreatment, the treatment can lower the risk of depression to a larger extent.Table S1: Point and interval estimates of representative PCEs using the methods of mo-ments. The intervals excluding zero are highlighted in bold. Without covariates ( S , S ) ρ =0 ρ =0.2 ρ =0.4 ρ = 0.6 ρ =0.8(1 . , .
00) 1.210 1.362 1.440 1.406 1.366( − . , . − . , . − . , . − . , . − . , . . , .
50) 0 .
243 0 .
277 0 .
398 0 .
293 0 . − . , . − . , . − . , . − . , . − . , . . , . − . − . − . − . − . − . , . − . , . − . , . − . , . − . , . . , . − . − . − . − . − . ( − . , − . ( − . , . − . , . − . , . − . , . . , . − . − . − . − . − . ( − . , − . − . , − . − . , − . ( − . , . − . , . With covariates ( S , S ) ρ =0 ρ =0.2 ρ =0.4 ρ = 0.6 ρ =0.8(1 . , .
00) 1.581 1.704 1.732 1.466 1.562 (0 . , . . , . ( − . , . − . , . − . , . . , .
50) 0 .
273 0 .
312 0 .
334 0 .
333 0 . (0 . , . . , . ( − . , . − . , . − . , . . , . − . − . − . − . − . − . , . − . , . − . , . − . , . − . , . . , . − . − . − . − . − . ( − . , − . ( − . , − . ( − . , − . − . , − . − . , − . (5 . , . − . − . − . − . − . ( − . , − . − . , − . − . , − . − . , − . − . , − .089)