[PDF] Identification of Causal Effects Within Principal Strata Using Auxiliary Variables

Abstract

In causal inference, principal stratification is a framework for dealing with a posttreatment intermediate variable between a treatment and an outcome, in which the principal strata are defined by the joint potential values of the intermediate variable. Because the principal strata are not fully observable, the causal effects within them, also known as the principal causal effects, are not identifiable without additional assumptions. Several previous empirical studies leveraged auxiliary variables to improve the inference of principal causal effects. We establish a general theory for identification and estimation of the principal causal effects with auxiliary variables, which provides a solid foundation for statistical inference and more insights for model building in empirical research. In particular, we consider two commonly-used strategies for principal stratification problems: principal ignorability, and the conditional independence between the auxiliary variable and the outcome given principal strata and covariates. For these two strategies, we give non-parametric and semi-parametric identification results without modeling assumptions on the outcome. When the assumptions for neither strategies are plausible, we propose a large class of flexible parametric and semi-parametric models for identifying principal causal effects. Our theory not only establishes formal identification results of several models that have been used in previous empirical studies but also generalizes them to allow for different types of outcomes and intermediate variables.

Full PDF

IIdentiﬁcation of Causal Eﬀects Within PrincipalStrata Using Auxiliary Variables

Zhichao Jiang and Peng Ding ∗ Abstract

In causal inference, principal stratiﬁcation is a framework for dealing with a post-treatment intermediate variable between a treatment and an outcome, in which theprincipal strata are deﬁned by the joint potential values of the intermediate variable.Because the principal strata are not fully observable, the causal eﬀects within them, alsoknown as the principal causal eﬀects, are not identiﬁable without additional assump-tions. Several previous empirical studies leveraged auxiliary variables to improve theinference of principal causal eﬀects. We establish a general theory for identiﬁcation andestimation of the principal causal eﬀects with auxiliary variables, which provides a solidfoundation for statistical inference and more insights for model building in empiricalresearch. In particular, we consider two commonly-used strategies for principal stratiﬁ-cation problems: principal ignorability, and the conditional independence between theauxiliary variable and the outcome given principal strata and covariates. For these twostrategies, we give non-parametric and semi-parametric identiﬁcation results withoutmodeling assumptions on the outcome. When the assumptions for neither strategies areplausible, we propose a large class of ﬂexible parametric and semi-parametric modelsfor identifying principal causal eﬀects. Our theory not only ensures formal identiﬁca-tion results of several models that have been used in previous empirical studies but alsogeneralizes them to allow for diﬀerent types of outcomes and intermediate variables.

Keywords:

Augmented design; Auxiliary independent; Identiﬁcation; Principal ignorability;Principal stratiﬁcation ∗ Zhichao Jiang (Email: [email protected]) is Assistant Professor, Department of Biostatisticsand Epidemiology, University of Massachusetts, Amherst. Peng Ding (Email: [email protected])is Assistant Professor, Department of Statistics, University of California, Berkeley. a r X i v : . [ s t a t . M E ] A ug Introduction

Complications arise in causal inference with an intermediate variable between the treat-ment and the outcome. Cochran (1957), Rosenbaum (1984) and Frangakis and Rubin(2002) pointed out that naively conditioning on the observed intermediate variable doesnot yield valid causal interpretations in general. Frangakis and Rubin (2002) proposed touse principal stratiﬁcation, the joint potential values of the intermediate variable underboth the treatment and control, to deﬁne subgroup causal eﬀects, because it acts as apretreatment covariate vector unaﬀected by the treatment. Principal stratiﬁcation has awide range of applications with meanings varying in diﬀerent scientiﬁc contexts. In non-compliance problems where the treatment received can diﬀer from the treatment assigned,principal stratiﬁcation represents individual potential compliance behavior (Angrist et al.,1996). In truncation-by-death problems where some units die before the measurement timepoint of their outcomes, principal stratiﬁcation represents individual potential survival sta-tus (Rubin, 2006). In surrogate evaluation problems, principal stratiﬁcation helps to clarifycriteria for good surrogate endpoints (Frangakis and Rubin, 2002; Gilbert and Hudgens,2008). In mediation analysis, Rubin (2004), Gallop et al. (2009), Elliott et al. (2010) andMattei and Mealli (2011) deﬁned direct eﬀects as treatment eﬀects on the outcome withinprincipal strata of identical potential intermediate variables under the treatment and con-trol. VanderWeele (2008) and Forastiere et al. (2018) linked the principal stratiﬁcationapproach with the direct and indirect eﬀect approach, and Jo (2008) linked the princi-pal stratiﬁcation approach with structural equation model for mediation analysis. Theseproblems with intermediate variables concern average causal eﬀects within principal strata,which are also known as the principal causal eﬀects (PCEs).Because we cannot observe the joint potential values of the intermediate variable, wedo not know the principal stratum of every individual and thus cannot identify the PCEswithout additional assumptions. For a binary intermediate variable, Zhang and Rubin(2003), Cheng and Small (2006) and Imai (2008) derived large sample bounds, which canbe too wide to be informative. Angrist et al. (1996), Little and Yau (1998), Zhang et al.(2009) and Frumento et al. (2012) imposed additional structural or model assumptions toachieve identiﬁcation. When the intermediate variable is continuous, identiﬁcation becomesmore diﬃcult because of the inﬁnitely many principal strata. To estimate the PCEs, Gilbertand Hudgens (2008) assumed parametric models and used a likelihood approach. Jin and2ubin (2008), Schwartz et al. (2011), and Zigler and Belin (2012) proposed diﬀerent formsof parametric and semi-parametric Bayesian approaches. However, the identiﬁability oftheir models is not formally established. Without identiﬁability, the likelihood functionmay be ﬂat over a region of some parameters, and Bayesian inference can be sensitive toprior speciﬁcations. See Gustafson (2009) and Ding and Li (2018) and for more discussionon identiﬁability.Identiﬁcation is sometimes achievable with a pretreatment auxiliary variable satisfyingsome conditional independence assumptions. We focus on two categories. The ﬁrst cate-gory assumes that the outcome is independent of the principal strata given the auxiliaryvariable. This assumption is known as principal ignorability (Jo et al., 2011; Ding andLu, 2017). Under principal ignorability, Jo and Stuart (2009) and Stuart and Jo (2015)used principal scores to analyze data with one-sided noncompliance, and Joﬀe et al. (2007)suggested using principal scores to estimate general causal eﬀects within principal strata.Ding and Lu (2017) established formal identiﬁcation results for PCEs with a binary inter-mediate variable in randomized experiments. The other category assumes the conditionalindependence between the outcome and the auxiliary variable within principal strata. Wewill refer to this conditional independence as auxiliary independence . This assumption mo-tivates several identiﬁcation and estimation strategies in diﬀerent contexts. For a binaryintermediate variable, Ding et al. (2011) used the baseline quality of life as an auxiliaryvariable for identiﬁcation when evaluating the eﬀect on the quality of life with outcomestruncated by death. Under monotonicity, Mealli and Pacini (2013) relaxed Ding et al.(2011)’s assumptions and discussed bounds and identiﬁcation of the PCEs with a binarysecondary variable. Wang et al. (2017) extended the strategy to observational studies andrelaxed monotonicity in a sensitivity analysis. In a study with multiple independent trials,Jiang et al. (2016) used the trial number as an auxiliary variable and proposed strategies toidentify the PCEs. Yuan et al. (2019) weakened the identiﬁcation assumptions and appliedthe methodology to a multi-site trial in education. Similar ideas have also been used to dealwith continuous intermediate variables. In assessing the eﬀect of an HIV vaccine on infec-tion rate through immune response, Follmann (2006) used the baseline immune responseto the rabies vaccine as an auxiliary variable. Qin et al. (2008) extended this idea to dealwith time-to-event endpoints under a case-cohort sampling. Gilbert and Hudgens (2008)and Huang and Gilbert (2011) proposed approaches to evaluating biomarkers based on3rincipal stratiﬁcation by incorporating baseline covariates as auxiliary variables to predictthe biomarkers. These strategies also provided insights for better experimental designs. Inparticular, Gabriel and Follmann (2016) proposed the augmented treatment run-in designand used a baseline measure as a predictor of the potential values of the intermediate vari-able. However, under auxiliary independence, formal identiﬁcation results are establishedonly for binary intermediate variables (Ding et al., 2011; Mealli and Pacini, 2013; Jianget al., 2016).This paper discusses the identiﬁcation of PCEs deﬁned by a general intermediate vari-able with auxiliary variables. We ﬁrst generalize the identiﬁcation results under principalignorability in Ding and Lu (2017) to general intermediate variables in both randomizedexperiments and observational studies. We then study the identiﬁcation under auxiliary in-dependence in various scenarios. With auxiliary independence, we establish non-parametricidentiﬁcation results for discrete intermediate variables and semi-parametric identiﬁcationresults for continuous intermediate variables. These results do not require modeling the out-come. Without principal ignorability or auxiliary independence, we propose a large classof parametric models to identify the PCEs, which has not been formally established be-fore. Compared with models used in previous empirical studies, our models require weakerassumptions and can deal with diﬀerent types of data.Identiﬁability is a cornerstone for both frequentists’ (Bickel and Doksum, 2015) andBayesian (Gustafson, 2015) inferences. Our results provide theoretical bases to check theidentiﬁability of PCEs before analyzing data. Practitioners can use our results to guidemodel building for principal stratiﬁcation problems. Our results imply that some existingmodels are indeed identiﬁable but some are not (e.g. Follmann, 2006; Gilbert and Hud-gens, 2008; Zigler and Belin, 2012). Moreover, our results reveal that some existing modelsinvoked unnecessary assumptions for identiﬁcation, for example, restricting the parame-ter space or imposing informative priors, although these assumptions make ﬁnite-sampleinference more convenient.The paper uses the following notation. Let i.i.d. denote “independently and identicallydistributed,”

A B | C denote the conditional independence of A and B given C , and A d = B denote that A has the same distribution as B . Let P ( · ) be the probability mass ordensity function, and Φ( · ) be the cumulative distribution function of the standard Normaldistribution. We say that functions { f ( x ) , . . . , f J ( x ) } are linearly independent if c +4 f ( x ) + · · · + c J f J ( x ) = 0 for all x implies c = c = · · · = c J = 0. We say that a family Q of probability distributions is complete if (cid:82) f ( v ) Q (d v ) = 0 for all Q ∈ Q implies f ( v ) = 0 , a.s. (Lehmann and Romano, 2006). Let Z be a binary treatment indicator with Z = 1 for the treatment and 0 for the control, Y be an outcome of interest, and S be an intermediate variable between the treatmentand outcome. Let S iz and Y iz be the potential values of the intermediate variable andthe outcome if unit i were to receive treatment z ( z = 0 , S i = Z i S i + (1 − Z i ) S i and Y i = Z i Y i + (1 − Z i ) Y i . Assume that { Z i , S i , S i , Y i , Y i : i = 1 , . . . , n } are i.i.d. samples drawn from aninﬁnite superpopulation, and thus the observed { Z i , S i , Y i : i = 1 , . . . , n } are also i.i.d. Asa result, we can drop the subscript i .Frangakis and Rubin (2002) deﬁned principal stratiﬁcation as U i = ( S i , S i ), the jointpotential values of the intermediate variable, and the PCEs as τ s s = E { Y − Y | U = ( s , s ) } for all s , s . The PCEs are not identiﬁable because U is latent in general. It is commonto exploit a pretreatment auxiliary variable for identifying the PCEs. Let W i denote thisvariable with meanings varying in diﬀerent settings. We start with the following basicassumption. Assumption 1. Z ( Y , Y , S , S ) | W .Assumption 1 is often guaranteed by design. In completely randomized experiments,Assumption 1 holds because Z ( Y , Y , S , S , W ). In a multi-center experiment with W being the center number, Assumption 1 holds because Z is randomized in each center.We consider two diﬀerent assumptions for identiﬁcation. The ﬁrst assumption is theconditional independence between the potential outcome Y z and the principal stratum U given the auxiliary variable W . Assumption 2 (principal ignorability) . Y z U | W for z = 0 , W , the principal stratiﬁcation variableis randomly assigned with respect to the potential outcomes. Assumption 2 requires that5onditioning on the auxiliary variable there is no diﬀerence between the distributions ofthe potential outcomes across principal strata. Many applied researchers have invoked itto estimate the PCEs (Follmann, 2000; Jo and Stuart, 2009; Jo et al., 2011; Stuart andJo, 2015). To make Assumptions 2 more plausible, researchers often need to include allpretreatment covariates in W . We provide two examples below. Example 1.

Follmann (2000) studied the eﬀect of a multi-factor intervention on mortal-ity due to coronary heart disease, where Z is the indicator of the intervention and Y isthe survival time of the patients. One-sided noncompliance occurred in the experiment,where patients assigned to the treatment group might not actually take the treatment.Let S denote the actual treatment, which can be diﬀerent from Z . The principal strat-iﬁcation variable characterizes the compliance behavior of the patients. Follmann (2000)argued that the patients with diﬀerent compliance behavior would be similar conditioningon pretreatment covariates W , and estimate the PCEs under principal ignorability. Example 2.

Ding and Lu (2017) gave an example of a randomized experiment withtruncation-by-death, where Z is the treatment indicator, S is the binary survival status,and Y is the health-related quality of life. Because the outcome is only well deﬁned for thesurvived patients, the parameter of interest is the PCE within the principal stratum of thepatients who would survive regardless of the treatment. They used all the covariates asthe auxiliary variables and invoked the principal ignorability in their analysis, which meansthat the health-related quality of life for always survived patients would be identical to thatfor other patients given the covariates.The second identiﬁcation assumption is the conditional independence between the po-tential outcome Y z and the auxiliary variable W given the principal stratum U . Assumption 3 (auxiliary independence) . Y z W | U for z = 0 , Y W | ( Z, U ), i.e., the auxiliaryvariable is independent of the outcome conditional on the treatment and principal strata.Including additional pretreatment covariates can make this assumption more plausible.However, for notational simplicity, we condition on X implicitly and omit X below. Weprovide two examples below, in which Assumption 3 are justiﬁable by design.6 xample 3. Follmann (2006) introduced an augmented design to assess immune responsein vaccine trials, where Z is the indicator of an HIV vaccine injection, S is the immuneresponse to this vaccine, and Y is the infection indicator. Before the randomization of Z ,all patients receive the rabies vaccine. Let W denote the immune response to the rabiesvaccine, which is correlated with S . Because the rabies vaccine is irrelevant to the HIVinfection, the potential HIV infection status should depend only on the immune responseto the HIV vaccine but not the rabies vaccine. This justiﬁes auxiliary independence, basedon which Follmann (2006) estimated the PCEs. Example 4.

Jiang et al. (2016) proposed approaches to identifying the PCEs by multipleindependent trials, where Z is the treatment indicator, S is the indicator of three-yearcancer reoccurrence, and Y is the ﬁve-year survival status. The data are from multipletrials with the trial number denoted by W . Jiang et al. (2016) argued that the principalstratiﬁcation variable is a measure of physical status, and assumed that the survival statusdoes not depend on the trial number W given the patient’s physical status. As a result,they identiﬁed the PCEs under auxiliary independence.When S is binary as in Example 4, Jiang et al. (2016) showed the identiﬁability ofPCEs. With a general S as in Example 3, formal identiﬁcation results have not beenestablished although several parametric or semi-parametric models have been proposed indata analysis.In the following two sections, we will give a uniﬁed theory for the identiﬁcation of thePCEs with an auxiliary variable under various scenarios. The theoretical results dependon three factors: (a) whether or not the potential intermediate variable under control S isconstant, (b) whether or not the intermediate variable S is discrete or continuous, and (c)whether or not Assumption 2 or 3 holds. Table 1 presents the roadmap for our paper. Weencourage the readers to revisit it after reading Sections 3 and 4. We ﬁrst consider the cases with a constant intermediate variable under control.

Assumption 4. S i = c for all i , where c is a constant.In some vaccine trials (e.g., Follmann, 2006; Hudgens and Gilbert, 2009), Assumption 4is plausible because vaccine antigens must be present to induce a speciﬁc immune response,7able 1: Roadmap of the suﬃcient conditions for identifying the PCEs. Note that theresults with a non-constant S require the identiﬁcation of P ( S , S | W ).Assumptions Type of S Requirement for W Outcome model

Constant S Section 3.1 1, 2 and 4 General No NoSection 3.2 1, 3 and 4 Discrete More categories than S NoSection 3.3 1, 3 and 4 General Completeness NoSection 3.4 1 and 4 General Depends on the model of S Yes

Non-constant S Section 4.2 1 and 2 General No NoSection 4.3 1 and 3 Discrete More categories than S NoSection 4.4 1 and 3 General Completeness NoSection 4.5 1 General Depends on the model of S Yeswhich is absent in the control group. For a binary S , Assumption 4 with c = 0 is called strong monotonicity , which holds in the one-sided noncompliance setting because individu-als assigned to the control group do not have access to the treatment (Sommer and Zeger,1991; Imbens and Rubin, 2015). Under Assumption 4, S is constant, and therefore it isnot necessary to include it in U , simplifying the PCEs to τ s = E ( Y − Y | S = s ) = E ( Y | S = s ) − E ( Y | S = s ) . Because S is observed in the treatment group, we can identify E ( Y | S = s ) = E ( Y | Z = 1 , S = s ) under Assumption 1. Then we need only to identify E ( Y | S = s ).Because S is missing in the control group, the PCEs are not identiﬁable without additionalassumptions. Below we will discuss the identiﬁcation of PCEs under Assumption 2 or 3. We extend the deﬁnition of principal score to general S , which is the probability of theprincipal strata conditional on the auxiliary variable: π s ,s ( W ) = P ( S = s , S = s | W ) . π s ( W ) = P ( S = s | W ), which isidentiﬁed by π s ( W ) = P ( S = s | Z = 1 , W ) under Assumption 1. The proportions ofstrata are then identiﬁed by π s = P ( S = s ) = E { π s ( W ) } . With principal ignorability,we can identify the PCEs using the principal scores and show the results in the followingtheorem. Theorem 1.

Under Assumptions 1, 2 and 4, the PCEs are identiﬁed by τ s = E ( Y | Z = 1 , S = s ) − E (cid:26) π s ( W ) π s (1 − Z ) Y − e ( W ) (cid:27) , (1)where e ( W ) = P ( Z = 1 | W ) is the propensity score.Theorem 1 shows that E ( Y | S = s ) can be identiﬁed by the average of the outcomesin a weighted sample, with the weight depending on both the principal score and thepropensity scores. The principal score accounts for the relationship between the principalstratum membership and the covariates, whereas the propensity score accounts for therelationship between the treatment and the covariates. Ding and Lu (2017)’s result holdsonly in randomized experiments with a binary S , and Theorem 1 generalizes it to allow fordiﬀerent types of S in observational studies.Based on Theorem 1, we can estimate the PCEs using a two-step procedure. We ﬁrstestimate the principal score and the propensity score. We then plug in the estimatedprincipal score and propensity score into (1) and replace the expectation with the empiricalaverage to obtain the estimates of PCEs. Theorem 2.

Suppose S ∈ { s , . . . , s K } and W ∈ { w , . . . , w L } . Let M denote the K × L matrix with the ( k, l )-th element P ( S = s k | Z = 1 , W = w l ). Under Assumptions 1, 3 and 4,if rank( M (cid:62) M ) = K , then the PCEs are identiﬁable.From Theorem 2, a necessary condition for identiﬁcation is L ≥ K , i.e., W must havemore categories than S . Because M depends only on the distribution of the observeddata, the condition rank( M (cid:62) M ) = K is testable. The following example illustrates theidentiﬁability for the case with binary intermediate and auxiliary variables. Example 5.

Consider binary S and W. First, from the observed distribution and Assump-tion 1, we can identify θ sw = P ( S = s | W = w ) = P ( S = s | Z = 1 , W = w ) and9 w = E ( Y | W = w ) = E ( Y | Z = 0 , W = w ) for s, w = 0 ,

1. Second, from Assumption 3,we have δ = E ( Y | S = 1) θ + E ( Y | S = 0) θ ,δ = E ( Y | S = 1) θ + E ( Y | S = 0) θ , which are two linear equations of E ( Y | S = 1) and E ( Y | S = 0). If the conditionrank( M (cid:62) M ) = 2 holds, or, equivalently, S / W | Z = 1, we can uniquely solve the twolinear equations and obtain E ( Y | S = 1) = δ θ − δ θ θ θ − θ θ , E ( Y | S = 0) = δ θ − δ θ θ θ − θ θ . Therefore, the PCEs are identiﬁable.

Identiﬁcation is more diﬃcult with a continuous intermediate variable, which generatesinﬁnitely many principal strata. Let W be the support of W , and P W = { P ( S | W = w ) : w ∈ W} be the family of probability distributions indexed by w . Based on the deﬁnitionof completeness, we give a suﬃcient condition for identiﬁcation. Theorem 3.

Under Assumptions 1, 3 and 4, if P W is complete, then the PCEs are iden-tiﬁable.As discussed before, the key to identify the PCEs is to identify E ( Y | S ). FromAssumptions 1 and 3, we have E ( Y | Z = 0 , W = w ) = E ( Y | W = w )= E { E ( Y | S ) | W = w } = (cid:90) E ( Y | S = s ) Q (d s ) (2)for any probability measure Q ( s ) = P ( S ≤ s | W = w ) in P W . The left-hand side of (2)is directly estimable from the observed data, and the distributions in P W are identiﬁed by P ( S | W ) = P ( S | Z = 1 , W ). Therefore, (2) is an integral equation for E ( Y | S = s ).As a result, E ( Y | S = s ) is identiﬁable if it can be uniquely determined by (2), whichis guaranteed by the completeness of P W . When S is discrete, the integral in (2) becomessummation, and the completeness is the same as the rank condition in Theorem 2.10heorem 3 is general but abstract. From the well-known completeness property of anexponential family (Lehmann and Romano, 2006), we have a more interpretable suﬃcientcondition for identifying PCEs. Theorem 4.

Under Assumptions 1, 3 and 4, we further assume P ( S = s | W = w ) = h ( s ) g ( w ) exp { η (cid:62) ( w ) t ( s ) } , where s → t ( s ) is a one-to-one mapping and { η ( w ) : w ∈ W} contains an open set in R d where d is the length of the vector function η ( w ). The PCEs are identiﬁable.Theorem 4 requires that the distribution of S conditional on W belongs to the expo-nential family, but it does not require any models for the potential outcome Y z . Therefore,Theorem 4 guarantees semi-parametric identiﬁability and allows for diﬀerent types of out-comes. Below we give an example for Normal distributions. Corollary 1.

Under Assumptions 1, 3 and 4, if ( S , W ) follows a bivariate Normal distri-bution, then the PCEs are identiﬁable. Remark 1.

For a binary outcome, Follmann (2006) assumes that the outcome follows aProbit model and ( S , W ) follows a bivariate Normal distribution, which is a special case ofCorollary 1. Thus, Follmann (2006)’s model is semi-parametrically identiﬁed even withoutthe outcome model, and his parametric outcome model is invoked only for convenience inthe ﬁnite-sample inference.To further improve the applicability of Theorem 3, we review the following lemma (Huand Shiu, 2017, Lemma 4) on the completeness of a class of location-scale distributionfamilies, which works for non-exponential distributions. Lemma 1.

Suppose the support of W has an interior point, and S = h ( W ) + σ ( W ) (cid:15) withcontinuously diﬀerentiable h ( w ) and σ ( w ) and (cid:15) W . P W is complete if the characteristicfunction and density function of (cid:15) , φ ( t ) and f ( (cid:15) ), satisfy the following conditions:(a) 0 < | φ ( t ) | < C exp( − δ | t | ) for all t ∈ R and some constants C, δ > f ( (cid:15) ) is continuously diﬀerentiable, (cid:82) + ∞−∞ | xf (cid:48) ( x ) | d x < + ∞ , and (cid:82) + ∞−∞ f ( x )d x < + ∞ ;(c) for any positive integer J , the following functions are linearly independent (cid:26) f (cid:18) x − h σ (cid:19) , . . . , f (cid:18) x − h J σ J (cid:19)(cid:27) , where the ( h j , σ j )’s are distinct. 11he existence of the inﬁnite sequence required by Lemma 1 holds automatically forcontinuous W but fails for discrete W . Conditions (a) and (b) in Lemma 1 are techni-cal requirements on the distribution of the error term (cid:15) . Condition (c) means that theﬁnite location-scale mixture of the distribution of (cid:15) is identiﬁable, which holds for manydistributions (Everitt and Hand, 1981). For example, Appendix B.1 shows that Conditions(a)–(c) hold when (cid:15) follows a Normal, t or Logistic distribution. Combining Theorem 3with Lemma 1, we obtain the following theorem for the location-scale distribution families. Theorem 5.

Suppose that W is continuous, Assumptions 1, 3 and 4 hold, S = h ( W ) + σ ( W ) (cid:15) with continuously diﬀerentiable h ( w ) and σ ( w ), and (cid:15) W . If (cid:15) satisﬁes Conditions(a)–(c) in Lemma 1, then the PCEs are identiﬁable.Theorem 5 guarantees the identiﬁability of PCEs in many models involving distribu-tions that do not belong to an exponential family. It allows for heteroscedastic errors andenables ﬂexible model choices. For example, if we replace the bivariate Normal distributionassumption of ( S , W ) with S | W = w ∼ N ( µ ( w ) , σ ( w )), Theorem 4 and Corollary 1cannot be applied because { η ( w ) = (1 /σ ( w ) , µ ( w ) /σ ( w )) : w ∈ W} is a line in R .However, Theorem 5 ensures that the PCEs are still identiﬁable. The conditional independence in Assumption 2 or 3 may be violated. In Example 2, co-variates may not be suﬃcient to account for the diﬀerence in the health-related quality oflife across principal strata, which makes Assumption 2 implausible; in Example 4, diﬀerentcenters may have diﬀerent qualities of services, which makes Assumption 3 implausible.Without conditional independence, W does not help to achieve non-parametric or semi-parametric identiﬁcation. One solution is to conduct sensitivity analysis, which, however,requires to use sensitivity parameters to characterize the violation of the assumptions andfurther requires to specify their ranges. Sensitivity analysis gives a range of estimates ratherthan a point estimate, and it often depends on additional model assumptions. We will notpursue this direction in this paper. Instead, in this subsection, we seek an alternative routeto propose some parametric models for identifying the PCEs, in which the auxiliary variable W satisﬁes certain modeling assumptions. We can also include other covariates X in ourmodels, but do not require any modeling assumptions for X . So, again, we condition on X Proposition 1.

Under Assumptions 1 and 4, assume that ( S , Y ) follow additive models: S = g ( W ) + σ ( W ) (cid:15) S , (3) Y = β + αS + J (cid:88) j =1 β j f j ( W ) + σ ( W ) (cid:15) Y , (4)where E ( (cid:15) S | W ) = E ( (cid:15) Y | W ) = 0, and g ( w ) and σ ( w ) can be unknown functions. If { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, then the PCEs are identiﬁable.We do not need to specify g ( w ) and σ ( w ) because we can identify them from theobserved distribution P ( S, W | Z = 1) under Assumption 1. In contrast, we need to specifythe f j ( w )’s in the model of Y .Intuitively, in Proposition 1, replacing S in (4) by (3), we obtain an additive model of Y on W , and the linear independence condition allows us to disentangle the coeﬃcients ofdiﬀerent terms involving W . For example, if g ( W ) is quadratic in W in (3) and { J = 1, f ( W ) = W } in (4), then the linear independence assumption holds in Proposition 1.However, if g ( w ) is linear in w , then the linear independence assumption fails.If f j ( w ) = 0 for all j , then Proposition 1 becomes a special case of Theorem 5. Propo-sition 1 guarantees the identiﬁability of PCEs in additive models without specifying thedistributions of the error terms.In the model for Y , we require S to have a linear form. Identiﬁcation may also bepossible for other forms of S , but will require the knowledge of the distributions of theerror terms.For binary outcomes, we show an identiﬁcation result below for the Probit model. Proposition 2.

Under Assumptions 1 and 4, assume that S follows an additive modelwith Normal error and Y follows a Probit model: S = g ( W ) + (cid:15) S , (cid:15) S W, (cid:15) S ∼ N (0 , σ ) , P ( Y = 1 | S = s, W = w ) = Φ  β + αs + J (cid:88) j =1 β j f j ( w )  , where g ( w ) can be unknown. If { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, thenthe PCEs are identiﬁable. 13he model of S in Proposition 2 requires the variance of the error term (cid:15) S not dependon W , which is diﬀerent from Proposition 1. Identiﬁcation may also be possible with thevariance depending on W , but will rely on the functional form of var( S | W ). Remark 2.

Our result is not in contrary to Follmann (2006). Without Assumption 3,Follmann (2006) assumed a bivariate Normal distribution for ( S , W ) and used the followingProbit model for Y : P ( Y = 1 | Z, S , W ) = Φ( β + β Z + β S + β W + β ZS ) . (5)Under Assumption 1, (5) is equivalent to P ( Y = 1 | S , W ) = Φ { β + β + ( β + β ) S + β W } , P ( Y = 1 | S , W ) = Φ( β + β S + β W ) . From Proposition 2, without the model of Y , the PCEs are not identiﬁable because thelinear independence condition is violated. The identiﬁability comes from the parallel modelassumption that restricts the coeﬃcients of W be the same in the models of Y and Y . Remark 3.

Without the linear independence condition, researchers often used additionalinformation on the parameters to improve identiﬁcation. Using a Bayesian approach, Ziglerand Belin (2012) imposed informative priors on α . In a similar setting with a time-to-eventoutcome, Qin et al. (2008) imposed the principal ignorability Y S | W , or, equivalently, α = 0. Assumption 4 does not hold in many applications. Without it, we can never simultaneouslyobserve S and S , and therefore it is challenging to identify the joint distribution of ( S , S )in the ﬁrst place, let alone the PCEs. Below we ﬁrst use a copula model for the jointdistribution of ( S , S ), and then discuss identiﬁcation of the PCEs. P ( S , S | W ) Under Assumption 1, P ( S z | W ) = P ( S | Z = z, W ), and thus we can identify the marginaldistributions of S z given W from the observed data. To recover the joint distribution of( S , S ) given W from the marginal distributions, we need some prior knowledge about14he association between S and S conditional on W . For a binary S , a commonly-usedassumption to recover the joint distribution of ( S , S ) is the monotonicity assumption that S ≥ S . Under this assumption, we can identify P ( S = 1 , S = 1 | W ) = P ( S = 1 | Z = 0 , W ), P ( S = 0 , S = 0 | W ) = P ( S = 0 | Z = 1 , W ) and P ( S = 1 , S = 0 | W ) = P ( S = 1 | Z = 1 , W ) − P ( S = 1 | Z = 0 , W ). For a continuous S , Efron and Feldman(1991) and Jin and Rubin (2008) discussed the equipercentile equating assumption, i.e., F ( S | W ) = F ( S | W ), where F z ( · | W ) is the cumulative distribution function of S z given W for z = 0 ,

1. Under this assumption, S z determines S − z based on F ( · | W ) and F ( · | W ) for z = 0 , w , weassume P ( S , S | W = w ) = C ρ { P ( S | W = w ) , P ( S | W = w ) } , (6)where C ρ ( · , · ) is a copula and ρ is a measure of the association between S and S . If weknow ρ , then we can identify P ( S , S | W = w ) from the marginal distributions by (6).Otherwise, we can view ρ as a sensitivity parameter and conduct sensitivity analysis byvarying ρ . Assume that the principal score π s ,s ( W ) = P ( S , S | W ) is identiﬁable. So the density ofthe principal strata equals π s ,s = E { π s ,s ( W ) } . Similar to Section 3.1, we establish theidentiﬁcation of PCEs using the principal scores. Theorem 6.

Under Assumptions 1 and 2, if π s ,s ( W ) is identiﬁable, then the PCEs areidentiﬁed by τ s s = E (cid:26) π s ,s ( W ) π s ,s · ZYe ( W ) (cid:27) − E (cid:26) π s ,s ( W ) π s ,s · (1 − Z ) Y − e ( W ) (cid:27) . Theorem 6 generalizes Theorem 1 to the case with non-constant intermediate variables.It shows that E ( Y z | S = s , S = s ) can be identiﬁed by the average of the outcomes in a15eighted sample, with the weight depending on both the principal score and the propensityscore. We give the identiﬁcation results for discrete intermediate variables.

Theorem 7.

Suppose S ∈ { s , . . . , s K } , W ∈ { w , . . . , w L } , Assumptions 1 and 3 hold,and P ( S , S | W ) is identiﬁable.(a) Let M s denote the K × L matrix with ( k, l )-th element P ( S = s k | S = s , W = w l ).For a ﬁxed s , if rank( M (cid:62) s M s ) = K , then P ( Y | S , S = s ) is identiﬁable.(b) Let M s denote the K × L matrix with ( k, l )-th element P ( S = s k | S = s , W = w l ).For a ﬁxed s , if rank( M (cid:62) s M s ) = K , then P ( Y | S = s , S ) is identiﬁable.(c) If (a) and (b) above hold for all s and s , then the PCEs are identiﬁable.Theorem 7 extends Theorem 2. As special cases of Theorem 7, for a binary intermediatevariable under monotonicity, Ding et al. (2011) and Jiang et al. (2016) gave the identiﬁ-cation results, and the rank conditions in Theorem 7 simplify to conditional independencerelationships S / W | S and S / W | S . Recalling that W is the support of W . For ﬁxed s and s , let P W ,s = { P ( S | S = s , W = w ) : w ∈ W} and P W ,s = { P ( S | S = s , W = w ) : w ∈ W} be the families ofthe distributions indexed by w given s and s , respectively.Similar to Section 3.2, the identiﬁability of PCEs reduces to the completeness of prob-ability distributions of P W ,s and P W ,s . Theorem 8.

Suppose that Assumptions 1 and 3 hold, and P ( S , S | W ) is identiﬁable.(a) If P W ,s is complete for all s , then P ( Y, S , S , W | Z = 0) is identiﬁable.(b) If P W ,s is complete for all s , then P ( Y, S , S , W | Z = 1) is identiﬁable.(c) If (a) and (b) above hold, then the PCEs are identiﬁable.16imilar to Theorem 3, Theorem 8 does not require any models for the distribution of Y z ( z = 0 , Corollary 2.

For a continuous W , suppose that Assumptions 1 and 3 hold. If( S , S ) | W = w ∼ N  µ ( w ) µ ( w )  ,  σ ( w ) ρ ( w ) σ ( w ) σ ( w ) ρ ( w ) σ ( w ) σ ( w ) σ ( w )  , (7)with a known ρ ( w ), then the PCEs are identiﬁable.Corollary 2 does not need any models for the outcome, but it requires the auxiliaryvariable to be continuous. In Corollary 2, with a known ρ ( w ), we can identify the jointdistribution of ( S , S ) given W from the marginal distributions of S z given W . Therefore,the PCEs are identiﬁable from Theorem 8. To apply Corollary 2, we need to pre-specifythe correlation coeﬃcient ρ ( w ), which is a sensitive parameter in practice. Similar to the case with a constant control intermediate variable, we propose some usefulparametric models for identifying the PCEs using the auxiliary variable W when Assump-tions 2 or 3 fails. Proposition 3.

For a binary S with monotonicity S ≥ S , suppose that Assumption 1holds, and Y and Y follow linear models E ( Y z | S , S , W ) = β z + β z S + β z S + β z W, ( z = 0 , . (8)If both P ( S = 1 | Z = 1 , W = w ) P ( S = 1 | Z = 0 , W = w ) and P ( S = 0 | Z = 1 , W = w ) P ( S = 0 | Z = 0 , W = w ) (9)are not constant in w , then the PCEs are identiﬁable.We can use observed data to check whether the two terms in (9) are constant in w .For a binary W , the only restriction of (8) is no interaction term among ( S , S , W ) in themodel of Y , which is similar to some existing no-interaction or homogeneity assumption(Ding et al., 2011; Wang et al., 2017).For a continuous intermediate variable, we give the following proposition.17 roposition 4. Suppose that Assumption 1 holds, and ( S , S ) given W follows (7) witha known ρ ( w ). Suppose Y and Y follow additive models: Y = β + α S + α S + J (cid:88) j =1 β j f j ( W ) + σ ( W ) (cid:15) Y ,Y = β (cid:48) + α (cid:48) S + α (cid:48) S + J (cid:88) j =1 β (cid:48) j h j ( W ) + σ ( W ) (cid:15) Y , ( (cid:15) Y , (cid:15) Y ) ( S , S , W ) . The PCEs are identiﬁable if the following two conditions hold:(a) { , s , E ( S | S = s , W = w ) , f ( w ) , . . . , f J ( w ) } are linearly independent as functionsof ( s , w );(b) { , s , E ( S | S = s , W = w ) , h ( w ) , . . . , h J ( w ) } are linearly independent as functionsof ( s , w ).Proposition 4, as an extension of Proposition 1, is mostly useful for continuous outcomes.The Normality in (7) implies a linear relation of S on S given W , i.e., S = a ( W ) S + b ( W ) (cid:15) S with a ( w ) and b ( w ) determined by the distribution of ( S , S ) given W . Then,in Proposition 4, we can obtain an additive model of Y on S and W by replacing S in themodel of Y . The linear independence condition (a) allows us to disentangle the coeﬃcientsof diﬀerent terms involving S and W . Similar discussion applies to condition (b).The Normality in (7) is also helpful for binary outcomes. The following propositiongives the identiﬁcation result under Probit model for Y z . Proposition 5.

Suppose that Assumption 1 holds, and ( S , S ) given W follows (7) witha known ρ ( w ). Suppose Y and Y follow Probit models: P ( Y = 1 | S = s , S = s , W = w ) = Φ  β + α s + α s + J (cid:88) j =1 β j f j ( w )  , (10) P ( Y = 1 | S = s , S = s , W = w ) = Φ  β (cid:48) + α (cid:48) s + α (cid:48) s + k (cid:88) j =1 β (cid:48) j h j ( w )  . (11)If Conditions (a) and (b) in Proposition 4 hold, then the PCEs are identiﬁable. Remark 4.

Using a Bayesian approach, Zigler and Belin (2012) assumed a trivariate Nor-mal distribution for ( S , S , W ) with a sensitivity parameter to characterize the correlation18etween S and S , and Probit models for Y z with f j ( w ) and h j ( w ) linear in w . Undertheir model, the conditional expectation E ( S | S = s , W = w ) is linear in both s and w , and E ( S | S = s , W = w ) is linear in both s and w . Thus, the linear independencecondition is violated, and the parameters are not identiﬁable. To mitigate the inferentialdiﬃculties, Zigler and Belin (2012) imposed informative priors on α − α (cid:48) and α − α (cid:48) . In the frequentists’ inference, non-identiﬁability renders the likelihood function ﬂat overa region for some parameters, and the classical repeated sampling theory of the maxi-mum likelihood estimates do not apply (Bickel and Doksum, 2015). Computationally, theBayesian machinery is still applicable as long as the priors are proper. The simulation be-low, however, highlights the importance of identiﬁability in the Bayesian inference. In bothcases with a constant and non-constant control intermediate variable, we use two models toestimate the PCEs under several data generating processes (DGPs). The two models seemsimilar in form but have diﬀerent identiﬁability. We use the Gibbs Sampler to simulate theposterior distributions of the PCEs with 20000 iterations and the ﬁrst 4000 iterations asthe burn-in period. The Markov chains mix very well with the Gelman–Rubin diagnosticstatistics close to one based on multiple chains.

We generate data from DGP 1: Z ∼ Bernoulli(0 . , W ∼ N (0 , , Z W, S | W ∼ N ( γ + γ W, σ ) , P ( Y z = 1 | S , W ) = Φ( β z + β z S + β z W ) , with parameters ( β , β , β ) = (1 , − . , . β , β , β ) = (0 . , , .

5) and ( γ , γ , σ ) =(1 , . , Z , W and Y z are the same as DGP 1, but S | W ∼ N ( γ + γ W + γ W , σ ),where ( γ , γ , γ , σ ) = (1 , . , , , γ + γ W + γ W , W ) are linearly independent, the PCEs are identiﬁablebased on Proposition 2. 19or both DGPs 1 and 2, we use the true models to analyze the generated data withsample size 1000. We choose the following two sets of priors to assess the sensitivity of theinference based on posteriors:(A) ( β z , β z , β z ) ∼ N ( , diag(1 , , / − ) for z = 0 , p ( σ ) ∝ /σ , and ( γ , γ ) ∼ N ( , diag(1 , / − ) for model 1 (correspondingly, ( γ , γ , γ ) ∼ N ( , diag(1 , , / − )for model 2).(B) ( β z , β z , β z ) ∼ N ( , diag(1 , , z = 0 , p ( σ ) ∝ /σ , and ( γ , γ ) ∼ N ( , diag(1 , / − ) for model 1 (correspondingly, ( γ , γ , γ ) ∼ N ( , diag(1 , , / − )for model 2).The prior for ( β z , β z , β z ) is much less diﬀused in prior (B) than in prior (A).Figure 1 shows the posterior distributions of ( β , β , β , β ). For model 2, the pos-terior 95% credible intervals cover the true parameters under both priors. For model 1, theposterior distributions of β and β diﬀer greatly under the two priors. Their posteriordistributions are far away from the true values under prior (A), which shows strong evidenceof non-identiﬁability or weakly identiﬁability of model 1. Similar to Section 5.1, we describe two DGPs with diﬀerent identiﬁability and evaluate theﬁnite sample performance of Bayesian inference under each DGP. We choose two modelscorresponding to two nested DGPs so that we can go beyond Section 5.1 to assess theperformance of Bayesian inference with a mis-speciﬁed model.We ﬁrst specify the two DGPs. For DGP 3, W ∼ Bernoulli(0 .

5) and Z | W = w ∼ Bernoulli( α w ), where ( α , α ) = (0 . , . U = ( S , S ) from categoricaldistributions conditional on W , and Y from Bernoulli distributions conditional on Z and U with true values of the parameters in Table 2(a). We name the model correspondingto DGP 3 as model 3. For model 3, Assumptions 1 and 3 hold. Because the stratum( S , S ) = (0 ,

1) does not exist, monotonicity holds and thus the distribution of ( S , S )given W is identiﬁable. From Theorem 7, the PCEs are identiﬁable.For DGP 4, we generate W and Z in the same way as DGP 4. We then generate U =( S , S ) from categorical distributions conditional on W , and Y from Bernoulli distributionsconditional on Z and U with true values of the parameters in Table 2(b). We name the20 b b b b b b b M ode l M ode l Figure 1: Posterior distributions of the parameters in Section 5.1. The grey histograms arethe results with prior (A), the white histograms are the results with prior (B). The verticaldashed lines are the true values of the parameters.model corresponding to DGP 4 as model 4. For model 4, stratum ( S , S ) = (0 ,

1) exists,and monotonicity does not hold. Without monotonicity, the distribution of ( S , S ) | W isnot identiﬁable, and thus the PCEs are not identiﬁable. Use models 3 and 4 to analyze the data simulated from DGP 3.

Because model4 is a generalization of model 3, they are both correctly speciﬁed under DGP 3. However,the true value of τ in model 4 is not well-deﬁned.We choose two sample sizes 1000 and 50000. For model 3, we choose the followingpriors: P ( W = 1) ∼ Beta(1 , α w ∼ Beta(1 , π ,w , π ,w , π ,w ) ∼ Dirichlet(1 , , w = 1 ,

2. We choose two diﬀerent priors for the parameters δ u,s s . One is the uniformprior Beta(1 ,

1) and the other is Beta(0 . , . π ,w , π ,w , π ,w , π ,w ) is Dirichlet(1 , , , τ m11 , τ and τ , where τ m11 is the PCEwithin the stratum ( S , S ) = (1 ,

1) under model 3, and τ and τ are the PCEs withinthe strata ( S , S ) = (1 ,

1) and (0 ,

1) under model 4, respectively. Comparing the two rows21able 2: True values of the parameters under DGP 3 and DGP 4. (a) DGP 3. The true PCEs are: τ = 0 . τ = 0 . τ =0 . P ( U = u | W = w ) u = (1 , u = (1 , u = (0 , w = 1 0.5 0.3 0.2 w = 2 0.2 0.3 0.5 P ( Y = 1 | Z = z, U = u ) u = (1 , u = (1 , u = (0 , z = 1 0.8 0.7 0.6 z = 0 0.5 0.3 0.1 (b) DGP 4. The true PCEs are: τ = 0 . τ = 0 . τ = 0 . τ = − . P ( U = u | W = w ) u = (1 , u = (1 , u = (0 , u = (0 , w = 1 0.5 0.3 0.1 0.1 w = 2 0.1 0.3 0.5 0.1 P ( Y = 1 | Z = z, U = u ) u = (1 , u = (1 , u = (0 , u = (0 , z = 1 0.8 0.7 0.6 0.2 z = 0 0.5 0.3 0.1 0.5 of plots in Figure 2(a), we can see that as the sample size increases, the posterior 95%credible intervals of τ m11 becomes narrower and always cover the true value, regardless ofthe priors. For model 4, the posterior distributions of the PCEs change greatly and theposterior 95% credible intervals do not shrink as those under model 3. When the samplesize is 50000, the posterior distribution of τ is far away from the true value with the ﬂatprior Beta(1 ,

1) and is not unimodal with the prior Beta(0 . , . ,

1) and Beta(0 . , .

5) priors resultin small discrepancies. The drastic diﬀerences with diﬀerent sample sizes and priors showstrong evidence of the non-identiﬁability or weakly identiﬁability of model 4, which canyield misleading estimates and inferences.

Use models 3 and 4 to analyze data simulated from DGP 4.

The true model 4is not identiﬁable, and model 3 is mis-speciﬁed. Figure 2(b) shows the results for τ m11 , τ and τ . Although model 3 is not the true model, the result under this model is very stableunder diﬀerent priors. The 95% credible intervals of τ m11 cover the true value. This maybe due to our choice of small values of π , and π , , which makes model 3 only slightlydeviates from the true model. In contrast, the result of model 4 changes greatly under22 t t t t t (a) The data are simulated from DGP 3 and analyzed by models 3 and 4 S a m p l e s i z e S a m p l e s i z e t t t t t t (b) The data are simulated from DGP 4 and analyzed by models 3 and 4 S a m p l e s i z e S a m p l e s i z e Figure 2: Posterior distributions of the PCEs in Section 5.2. τ m11 is the PCE within thestratum ( S , S ) = (1 ,

1) under model 3. τ and τ are the PCEs within the strata( S , S ) = (1 ,

1) and (0 ,

1) under model 4. The grey histograms are the results with priorBeta(1 ,

1) for δ u,s s , the white histograms are the results with prior Beta(0 . , .

5) for δ u,s s .The vertical dashed lines are the true values of the parameters.23iﬀerent priors even when the sample size is large. The posterior distributions of τ aremultimodal even with a very large sample size. Therefore, using an unidentiﬁable modelmay lead to an undesirable result even if it is a true model.Our simulation demonstrates that identiﬁcation is important in Bayesian inference.Otherwise, the results are extremely sensitive to the priors. More importantly, the simula-tion suggests that when the proposed model is not identiﬁable, using an identiﬁable model“close” to it may be a compromising solution. The Job Search Intervention Study was a randomized ﬁeld experiment investigating the ef-ﬁcacy of a job training intervention on unemployed workers (Vinokur et al., 1995; Vinokurand Schul, 1997; Tingley et al., 2014). The program was designed not only to increasereemployment among the unemployed but also to enhance the mental health of the jobseekers. In the study, 600 unemployed workers are randomly assigned to the treatmentgroup ( Z = 1) and 299 are assigned to the control group ( Z = 0). Those in the treatmentgroup participated in workshops that covered skills for job search and coping with stress.Those in the control group received a booklet describing job-search tips. The intermedi-ate variable S is a measure of job-search self-eﬃcacy ranged from 1 to 5. It measures theparticipants’ conﬁdence in being able to successfully perform six essential job-search activ-ities such as completing a job application or resume, using their social network to discoverpromising job openings, and getting their point across in a job interview. The outcome Y isa measure of depressive symptoms based on the Hopkins Symptom Checklist. It measureshow much they had been bothered or distressed in the last two weeks by various depressionsymptoms such as feeling blue, having thoughts of ending one’s life, and crying easily. Let W be the previous occupation, which is a nominal variable with seven categories.We assume that ( S , S ) given W follows (7), where ρ ( w ) is the correlation coeﬃcientof S and S given W = w . We further assume linear models for Y and Y : Y z = β z + β z S + β z S + (cid:15) Y z , where (cid:15) Y ∼ N (0 , σ Y ), (cid:15) Y ∼ N (0 , σ Y ), and ( (cid:15) Y , (cid:15) Y ) ( S , S , W ). We choose the linearmodel because of its simplicity for illustration, and acknowledge its limitation and leavethe task of building more ﬂexible models for Y and Y to future work. Under this model,24 S P C E −1.5−1.0−0.50.00.51.0 r =0 Figure 3: Posterior medians of the PCEs with ρ = 0.the PCEs equals τ s s = β − β + ( β − β ) s + ( β − β ) s . We assume ρ ( w ) = ρ and treat ρ as the sensitivity parameter within { , . , . , . , . } .From Proposition 2, the PCEs are identiﬁable. We use a Bayesian approach and simulatethe posterior distributions of the PCEs. To assess the sensitivity of our results to diﬀerentpriors, we choose two diﬀerent priors. Denote β = ( β , β , β ) and β = ( β , β , β ).For the ﬁrst prior, we choose multivariate Normal priors for β z and µ w : β z ∼ N ( , Ω z ), µ w ∼ N ( , Ω ), with Ω z = 10 diag(1 , ,

1) and Ω = 10 diag(1 ,

1) for z = 0 ,

1, and w =1 , . . . ,

7. We choose the following non-informative parameters for the other parameter: f ( σ zw ) ∝ /σ zw , f ( σ Y z ) ∝ /σ Y z , { P ( W = 1) , . . . , P ( W = 7) } ∼ Dirichlet(1 , . . . ,

1) and P ( Z = 1 | W = w ) ∼ Beta(1 , z = 0 , w = 1 , . . . ,

7. For the second prior, wechoose Ω z = diag(1 , ,

1) and Ω = diag(1 ,

1) and keep other prior distributions unchanged.We will present the results for the ﬁrst prior in the main text and show the sensitivity checkof the results to diﬀerent priors in Appendix C.2.Figure S1 shows the posterior medians of τ s s for all ( s , s ) under ρ = 0. The surfaceof these posterior medians rises from its lowest point at principal stratum (5 ,

1) to its25able 3: Point and interval estimates of some PCEs using the Bayesian approach. Theintervals excluding zero are highlighted in bold. ( S , S ) ρ =0 ρ =0.2 ρ =0.4 ρ = 0.6 ρ =0.8(1 . , .

00) 1.363 1.901 1.676 1.790 1.530( − . , . − . , . − . , . − . , . − . , . . , .

50) 0 .

288 0 .

392 0 .

366 0 .

389 0 . − . , . − . , . − . , . − . , . − . , . . , . − . − . − . − . − . − . , . − . , . − . , . − . , . − . , . . , . − . − . − . − . − . ( − . , − . − . , − . ( − . , . − . , . − . , . . , . − . − . − . − . − . ( − . , − . − . , − . − . , − . ( − . , . − . , . highest point at principal stratum (1 , S and S decreases. That is, for people who can gain more for thejob-search self-eﬃcacy from the treatment, the treatment can lower the risk of depressionto a larger extent. Imai et al. (2010) analyzed this data using a mediation analysis andfound that the indirect eﬀect of the treatment through job-search self-eﬃcacy is negative.This implies the program participation decreases depressive symptoms by increasing thelevel of job search self-eﬃcacy. Jo et al. (2011) used the principal stratiﬁcation approach bydichotomizing the job-search self-eﬃcacy, and found that the treatment has a negative eﬀecton the depression for people whose job-search self-eﬃcacy is improved by the treatment.Our conclusion corroborates with their ﬁndings.In our analysis, we assume Assumption 3 holds conditional on the previous occupation.It is plausible because conditional on the potential values of the job-search self-eﬃcacy, theprevious occupation may not aﬀect the depressive symptoms. In Appendix C.3, we alsoconduct an analysis of Assumption 3 by including more covariates.To assess the sensitivity of the PCEs to ρ , we choose ﬁve principal strata, consisting ofthe maximum, minimum, 25%, 50%, and 75% quantiles of S and S . Table 3 shows theirposterior medians and 95% credible intervals. The point estimates are not sensitive to thevalues of ρ , and the interval estimates are not sensitive to small values of ρ . But as ρ growslarger, the intervals tend to become wider which makes the results not signiﬁcant.26wo technical issues arise in our data analysis. First, Proposition 2 requires W to becontinuous but W is categorical in our application. In Appendix C.1, we give a formaljustiﬁcation of the identiﬁability of the PCEs in our model with a discrete W . Second,the Normality assumptions on the outcomes are invoked for convenience in the Bayesiancomputation. In fact, without Normality, we can use the method of moments to estimatethe PCEs. The results from the method of moments are similar to those from the Bayesianinference; see Appendix C.2. Identiﬁcation of the PCEs is an important but challenging problem. Although several em-pirical studies have leveraged auxiliary variables to improve inference for the PCEs, formalidentiﬁcation results have not been established especially for non-binary intermediate vari-ables. Our results supplement previous empirical studies with theoretical justiﬁcations foridentiﬁcation. We give identiﬁcation results for several models based on Normal distri-butions, which can be generalized to other commonly-used distributions. Appendix B.4gives identiﬁcation results for models based on t distributions, which are useful for robustanalysis of data with heavy tails.Researchers have conducted sensitivity analyses for the principal ignorability and theauxiliary independence using diﬀerent kinds of models under various settings. For example,Ding and Lu (2017) proposed the sensitivity analysis for principal ignorability with a binaryintermediate variable, and Jiang et al. (2016) proposed the sensitivity analysis for auxiliaryindependence using a random-eﬀects model. However, there is no general setup for thesensitivity analysis of these assumptions, which may depend on the speciﬁcation of themodel and types of the outcomes and the intermediate variables. We believe that sensitivityanalysis should be routinely conducted in problems with principal stratiﬁcation, but leavethe development and the technical details to future research. Auxiliary variables play diﬀerent roles in identifying the PCEs, depending on the underly-ing assumptions. Under principal ignorability, auxiliary variables can be viewed as “con-27ounders” between the principal stratiﬁcation variable and the outcome. In contrast, underauxiliary independence, auxiliary variables can be treated as an “instrumental variables”for the relationship between the principal stratiﬁcation and the outcome. Therefore, thecomparison between the principal ignorability and auxiliary ignobility for identifying thePCEs resembles the comparison between the ignorability assumption (Rosenbaum and Ru-bin, 1983) and the instrumental variable method (Angrist et al., 1996) for identifying theaverage causal eﬀect. The methods based on principal ignorability are easy to employ be-cause the assumption generally conditions on all baseline variables. However, they bearsimilar disadvantages as the methods based on ignorability for estimating average causaleﬀect — we do not know whether we have conditioned on suﬃcient variables (Pearl, 2000,2009). In contrast, the methods based on auxiliary independence may be burdening toanalysts and content experts because one needs to carve out a speciﬁc baseline variableas a designated auxiliary variable. However, the advantage is that we can intentionallytarget the variable based on science and experts’ knowledge or by design. For example,this assumption can possibly be used in a multi-center trial as in Example 4, and based onExample 3, Follmann (2006) proposed the augmented design which is useful in assessingthe eﬀect of vaccination.Although we restrict the auxiliary variable W to be pretreatment in the paper, theauxiliary independence assumption allows for it to be aﬀected by the treatment. It onlyrequires the auxiliary variable to be independent of the outcome conditional on the treat-ment and principal strata, which can hold even if the auxiliary variable is posttreatment.For example, for a binary S , Mealli and Pacini (2013) identify the PCEs in completelyrandomized experiments using a secondary outcome as the auxiliary variable. In contrast,the principal ignorability assumption is unlikely to hold with a posttreatment auxiliaryvariable. The required independence would fail due to the bias induced by conditioning ona posttreatment variable. Alternative identiﬁcation strategies also exist without requiring an auxiliary variable. Fora binary intermediate variable, without monotonicity or exclusion restriction, Hirano et al.(2000) suggested using parallel outcome models to improve identiﬁability where the regres-sion coeﬃcients of the covariates are the same for all types of non-compliers. Mealli et al.282016) used the concentration graph theory to study the identiﬁcation of the PCEs. It isof interest to combine these strategies in theory and practice.The identiﬁcation issue of PCEs is closely related to the ﬁnite mixture models. Forexample, with a binary intermediate variable, the observed data with ( Z = 1 , S = 1) is amixture of principal strata ( S = 1 , S = 1) and ( S = 1 , S = 0), and the observed datawith ( Z = 1 , S = 0) is a mixture of principal strata ( S = 0 , S = 0) and ( S = 0 , S = 1).From this perspective, principal ignorability and auxiliary independence help to separatethe components in the ﬁnite mixture model. Researchers sometimes use parametric ﬁnitemixture models for principal stratiﬁcation problems (Zhang et al., 2009; Frumento et al.,2012). However, even if those models are parametrically identiﬁable, the estimators oftenhave poor ﬁnite-sample properties (Frumento et al., 2016; Feller et al., 2019). These ﬁnd-ings echo the caveat from Cox and Donnelly (2011, page 96): “If an issue can be addressednonparametrically then it will often be better to tackle it parametrically; however, if itcannot be resolved nonparametrically then it is usually dangerous to resolve it parametri-cally.” This is an important motivation for us to seek for nonparametric and semiparametricidentiﬁability as presented in this paper. References

Abramowitz, M., I. A. Stegun, et al. (1966). Handbook of mathematical functions.

AppliedMathematics Series 55 , 62.Angrist, J. D., G. W. Imbens, and D. B. Rubin (1996). Identiﬁcation of causal eﬀects usinginstrumental variables (with discussion).

Journal of the American Statistical Associa-tion 91 , 444–455.Bartolucci, F. and L. Grilli (2011). Modeling partial compliance through copulas in aprincipal stratiﬁcation framework.

Journal of the American Statistical Association 106 ,469–479.Bickel, P. J. and K. A. Doksum (2015).

Mathematical Statistics: Basic Ideas and SelectedTopics, Volume I . Boca Raton, FL: CRC Press.Cheng, J. and D. S. Small (2006). Bounds on causal eﬀects in three-arm trials with non-29ompliance.

Journal of the Royal Statistical Society: Series B (Statistical Methodol-ogy) 68 , 815–836.Cochran, W. G. (1957). Analysis of covariance: its nature and uses.

Biometrics 13 , 261–281.Conlon, A., J. Taylor, and M. Elliott (2017). Surrogacy assessment using principal stratiﬁ-cation and a Gaussian copula model.

Statistical Methods in Medical Research 26 , 88–107.Cox, D. R. and C. A. Donnelly (2011).

Principles of applied statistics . Cambridge: Cam-bridge University Press.Daniels, M. J., J. A. Roy, C. Kim, J. W. Hogan, and M. G. Perri (2012). Bayesian inferencefor the causal eﬀect of mediation.

Biometrics 68 , 1028–1036.Ding, P. (2016). On the conditional distribution of the multivariate t distribution. TheAmerican Statistician 70 , 293–295.Ding, P., Z. Geng, W. Yan, and X. H. Zhou (2011). Identiﬁability and estimation ofcausal eﬀects by principal stratiﬁcation with outcomes truncated by death.

Journal ofthe American Statistical Association 106 , 1578–1591.Ding, P. and F. Li (2018). Causal inference: A missing data perspective.

Statistical Sci-ence 33 , 214–237.Ding, P. and J. Lu (2017). Principal stratiﬁcation analysis using principal scores.

Journalof the Royal Statistical Society: Series B (Statistical Methodology) 79 , 757–777.Efron, B. and D. Feldman (1991). Compliance as an explanatory variable in clinical trials(with discussion).

Journal of the American Statistical Association 86 , 9–17.Elliott, M. R., T. E. Raghunathan, and Y. Li (2010). Bayesian inference for causal me-diation eﬀects using principal stratiﬁcation with dichotomous mediators and outcomes.

Biostatistics 11 , 353–372.Everitt, B. S. and D. J. Hand (1981).

Finite Mixture Distributions . New York: Chapmanand Hall.Feller, A., E. Greif, N. Ho, L. Miratrix, and N. Pillai (2019). Weak separation in mixturemodels and implications for principal stratiﬁcation. arXiv preprint arXiv:1602.06595 .30ollmann, D. (2006). Augmented designs to assess immune response in vaccine trials.

Biometrics 62 , 1161–1169.Follmann, D. A. (2000). On the eﬀect of treatment among would-be treatment compli-ers: An analysis of the multiple risk factor intervention trial.

Journal of the AmericanStatistical Association 95 , 1101–1109.Forastiere, L., A. Mattei, and P. Ding (2018). Principal ignorability in mediation analysis:through and beyond sequential ignorability.

Biometrika 105 , 979–986.Frangakis, C. E. and D. B. Rubin (2002). Principal stratiﬁcation in causal inference.

Biometrics 58 , 21–29.Frumento, P., F. Mealli, B. Pacini, and D. B. Rubin (2012). Evaluating the eﬀect of trainingon wages in the presence of noncompliance, nonemployment, and missing outcome data.

Journal of the American Statistical Association 107 , 450–466.Frumento, P., F. Mealli, B. Pacini, and D. B. Rubin (2016). The fragility of standardinferential approaches in principal stratiﬁcation models relative to direct likelihood ap-proaches.

Statistical Analysis and Data Mining 9 , 58–70.Gabriel, E. E. and D. Follmann (2016). Augmented trial designs for evaluation of principalsurrogates.

Biostatistics 17 , 453–467.Gallop, R., D. S. Small, J. Y. Lin, M. R. Elliott, M. Joﬀe, and T. R. Ten Have (2009).Mediation analysis with principal stratiﬁcation.

Statistics in Medicine 28 , 1108–1130.Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin (2014).

Bayesian Data Analysis (3rd ed.) . London: Chapman and Hall/CRC.Gilbert, P. B. and M. G. Hudgens (2008). Evaluating candidate principal surrogate end-points.

Biometrics 64 , 1146–1154.Gustafson, P. (2009). What are the limits of posterior distributions arising from noniden-tiﬁed models, and why should we care?

Journal of the American Statistical Associa-tion 104 , 1682–1695.Gustafson, P. (2015).

Bayesian Inference for Partially Identiﬁed Models: Exploring theLimits of Limited Data . Chapman and Hall/CRC.31irano, K., G. W. Imbens, D. B. Rubin, and X.-H. Zhou (2000). Assessing the eﬀect of aninﬂuenza vaccine in an encouragement design.

Biostatistics 1 , 69–88.Hu, Y. and J.-L. Shiu (2017). Nonparametric identiﬁcation using instrumental variables:suﬃcient conditions for completeness.

Econometric Theory 34 , 1–35.Huang, Y. and P. B. Gilbert (2011). Comparing biomarkers as principal surrogate end-points.

Biometrics 67 , 1442–1451.Hudgens, M. G. and P. B. Gilbert (2009). Assessing vaccine eﬀects in repeated low-dosechallenge experiments.

Biometrics 65 , 1223–1232.Imai, K. (2008). Sharp bounds on the causal eﬀects in randomized experiments with“truncation-by-death”.

Statistics and Probability Letters 78 , 144–149.Imai, K., L. Keele, and D. Tingley (2010). A general approach to causal mediation analysis.

Psychological Methods 15 , 309–334.Imbens, G. W. and D. B. Rubin (2015).

Causal Inference in Statistics, Social, and Biomed-ical Sciences . Cambridge: Cambridge University Press.Jiang, Z., P. Ding, and Z. Geng (2016). Principal causal eﬀect identiﬁcation and surrogateend point evaluation by multiple trials.

Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) 78 , 829–848.Jin, H. and D. B. Rubin (2008). Principal stratiﬁcation for causal inference with extendedpartial compliance.

Journal of the American Statistical Association 103 , 101–111.Jo, B. (2008). Causal inference in randomized experiments with mediational processes.

Psychological Methods 13 , 314–336.Jo, B. and E. A. Stuart (2009). On the use of propensity scores in principal causal eﬀectestimation.

Statistics in Medicine 28 , 2857–2875.Jo, B., E. A. Stuart, D. P. MacKinnon, and A. D. Vinokur (2011). The use of propensityscores in mediation analysis.

Multivariate Behavioral Research 46 , 425–452.Joﬀe, M. M., D. Small, C.-Y. Hsu, et al. (2007). Deﬁning and estimating intervention eﬀectsfor groups that will develop an auxiliary outcome.

Statistical Science 22 , 74–97.32im, C., L. R. Henneman, C. Choirat, and C. M. Zigler (2020). Health eﬀects of powerplant emissions through ambient air quality.

Journal of the Royal Statistical Society:Series A (Statistics in Society) .Lange, K. L., R. J. Little, and J. M. Taylor (1989). Robust statistical modeling using the t distribution. Journal of the American Statistical Association 84 , 881–896.Lehmann, E. L. and J. P. Romano (2006).

Testing Statistical Hypotheses (3rd ed.). NewYork: Springer.Little, R. J. and L. H. Yau (1998). Statistical techniques for analyzing data from preventiontrials: Treatment of no-shows using Rubin’s causal model.

Psychological Methods 3 , 147.Liu, C. (2004).

Robit regression: a simple robust alternative to logistic and probit regression ,pp. 227–238. In

Applied Bayesian Modeling and Causal Inference From Incomplete-DataPerspectives (A. Gelman and X. L. Meng, eds.), New York: Wiley.Mattei, A. and F. Mealli (2011). Augmented designs to assess principal strata direct eﬀects.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 , 729–752.Mealli, F. and B. Pacini (2013). Using secondary outcomes to sharpen inference in ran-domized experiments with noncompliance.

Journal of the American Statistical Associa-tion 108 , 1120–1131.Mealli, F., B. Pacini, and E. Stanghellini (2016). Identiﬁcation of principal causal eﬀects us-ing additional outcomes in concentration graphs.

Journal of Educational and BehavioralStatistics 41 , 463–480.Nelsen, R. B. (2007).

An Introduction to Copulas (2nd ed.). New York: Springer.Pearl, J. (2000).

Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge:Cambridge University Press.Pearl, J. (2009). Letter to the editor: Remarks on the method of propensity score.

Statisticsin Medicine 28 , 1415–1416.Qin, L., P. B. Gilbert, D. Follmann, and D. Li (2008). Assessing surrogate endpointsin vaccine trials with case-cohort sampling and the cox model.

The Annals of AppliedStatistics 2 , 386–407. 33osenbaum, P. R. (1984). The consequences of adjustment for a concomitant variable thathas been aﬀected by the treatment.

Journal of the Royal Statistical Society. Series A 147 ,656–666.Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score inobservational studies for causal eﬀects.

Biometrika 70 , 41–55.Roy, J., J. W. Hogan, and B. H. Marcus (2008). Principal stratiﬁcation with predictors ofcompliance for randomized trials with 2 active treatments.

Biostatistics 9 , 277–289.Rubin, D. B. (2004). Direct and indirect causal eﬀects via potential outcomes (with dis-cussion and reply).

Scandinavian Journal of Statistics 31 , 161–170.Rubin, D. B. (2006). Causal inference through potential outcomes and principal stratiﬁca-tion: application to studies with “censoring” due to death (with discussion).

StatisticalScience 21 , 299–309.Schwartz, S. L., F. Li, and F. Mealli (2011). A bayesian semiparametric approach to inter-mediate variables in causal inference.

Journal of the American Statistical Association 106 ,1331–1344.Sommer, A. and S. L. Zeger (1991). On estimating eﬃcacy from clinical trials.

Statisticsin Medicine 10 , 45–52.Stuart, E. A. and B. Jo (2015). Assessing the sensitivity of methods for estimating principalcausal eﬀects.

Statistical Methods in Medical Research 24 , 657–674.Tingley, D., T. Yamamoto, K. Hirose, L. Keele, and K. Imai (2014). Mediation: R packagefor causal mediation analysis.

Journal of Statistical Software 59 , 1–38.VanderWeele, T. J. (2008). Simple relations between principal stratiﬁcation and direct andindirect eﬀects.

Statistics and Probability Letters 78 , 2957–2962.Vinokur, A. D., R. H. Price, and Y. Schul (1995). Impact of the JOBS intervention onunemployed workers varying in risk for depression.

American Journal of CommunityPsychology 23 , 39–74. 34inokur, A. D. and Y. Schul (1997). Mastery and inoculation against setbacks as active in-gredients in the jobs intervention for the unemployed.

Journal of Consulting and ClinicalPsychology 65 , 867.Wang, L., X.-H. Zhou, and T. S. Richardson (2017). Identiﬁcation and estimation of causaleﬀects with outcomes truncated by death.

Biometrika 104 , 597–612.Yang, F. and P. Ding (2018). Using survival information in truncation by death problemswithout the monotonicity assumption.

Biometrics 74 , 1232–1239.Yuan, L.-H., A. Feller, L. W. Miratrix, et al. (2019). Identifying and estimating principalcausal eﬀects in a multi-site trial of early college high schools.

The Annals of AppliedStatistics 13 , 1348–1369.Zellner, A. (1976). Bayesian and non-Bayesian analysis of the regression model with mul-tivariate student- t error terms. Journal of the American Statistical Association 71 , 400–405.Zhang, J. L. and D. B. Rubin (2003). Estimation of causal eﬀects via principal stratiﬁcationwhen some outcomes are truncated by “death”.

Journal of Educational and BehavioralStatistics 28 , 353–368.Zhang, J. L., D. B. Rubin, and F. Mealli (2009). Likelihood-based analysis of causal eﬀectsof job-training programs using principal stratiﬁcation.

Journal of the American StatisticalAssociation 104 , 166–176.Zigler, C. M. and T. R. Belin (2012). A Bayesian approach to improved estimation of causaleﬀect predictiveness for a principal surrogate endpoint.

Biometrics 68 , 922–932.35 upplementary Materials

Appendix A provides the proofs of the theorems.Appendix B provides the proofs of the corollaries and propositions, and presents addi-tional results for models related to t distributions. Let t p ( µ , Σ , ν ) denote the p -dimensional t distribution with median µ , scale matrix Σ , and degrees of freedom ν , and let T ν ( · ) denotethe cumulative distribution function of the standard t distribution with degrees of freedom ν . Appendix C provides more details for the data analysis. Appendix A: Proofs of the theorems

To prove the theorems, we need the following lemma from importance sampling.

Lemma 2.

Let f X ( x ) and f Y ( y ) be the density functions of X and Y. For any function g ( · ), E { g ( X ) } = E (cid:26) f X ( Y ) f Y ( Y ) g ( Y ) (cid:27) , provided the existence of the moments. Proof of Theorem 1.

From the law of total expectation, E (cid:26) π s ( W ) π s (1 − Z ) Y − e ( W ) (cid:27) = E (cid:34) E (cid:40) π s ( W ) π s (1 − Z ) Y − e ( W ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W (cid:41)(cid:35) = E (cid:20) π s ( W ) π s { − e ( W ) } E { (1 − Z ) Y | W } (cid:21) = E (cid:20) π s ( W ) π s { − e ( W ) } E { (1 − Z ) | W } E ( Y | W ) (cid:21) (Assumption 1)= E (cid:26) π s ( W ) π s E ( Y | W ) (cid:27) . (12)On the other hand, from the law of total expectation, E ( Y | S = s ) = E { E ( Y | W, S = s ) | S = s } = E { E ( Y | W ) | S = s } , (13)where the last equality follows from Assumption 2. The expectation in (12) is with respectto W and the expectation in (13) is with respect to W | S = s . Therefore, from Lemma 2, E { E ( Y | W ) | S = s } = E (cid:26) P ( W | S = s ) P ( W ) E ( Y | W ) (cid:27) = E (cid:26) π s ( W ) π s E ( Y | W ) (cid:27) .

36s a result, E (cid:26) π s ( W ) π s (1 − Z ) Y − e ( W ) (cid:27) = E ( Y | S = s ) . (cid:3) Proof of Theorem 2.

Because rank( M (cid:62) M ) = K , we can ﬁnd an invertible K × K sub-matrix of M , denoted by H . Without loss of generality, we use w , . . . , w K to denote thecorresponding values of W in H . First, for any s , P ( Y = y | S = s ) = P ( Y = y | Z =1 , S = s ), which can be identiﬁed from the observed data. Assumption 3 implies P ( Y = y | Z = 0 , W = w k ) = (cid:88) s P ( Y = y | Z = 0 , S = s , W = w j ) P ( S = s | Z = 0 , W = w k )= (cid:88) s P ( Y = y | S = s ) P ( S = s | Z = 0 , W = w k ) , (14)for k = 1 , . . . , K , where P ( Y = y | Z = 0 , W = w k ) and P ( S = s | Z = 0 , W = w k ) canbe identiﬁed from the observed data. Because H is invertible, we can solve (14) to obtain P ( Y = y | S = s ) for all s . Therefore, the PCEs are identiﬁable. (cid:3) Proof of Theorem 3.

For a ﬁxed y , if there exist two functions of S , f ( Y = y | S ) and f ( Y = y | S ), satisfying E { f ( Y = y | S ) | W } = E { f ( Y = y | S ) | W } , then E { g ( S ) | W } = 0, where g ( S ) = f ( Y = y | S ) − f ( Y = y | S ). From thedeﬁnition of completeness, g ( S ) = 0 and hence the distribution of P ( Y = y | S ) isidentiﬁable. Therefore, the PCEs are identiﬁable. (cid:3) Proof of Theorem 4.

From the completeness of exponential family (shown in Result S1 inthe next section), P W is complete. Then from Theorem 3, the PCEs are identiﬁable. (cid:3) Proof of Theorem 5.

From Lemma 1, P W is complete. Then from Theorem 3, the PCEsare identiﬁable. (cid:3) Proof of Theorem 6.

Similar to the proof of Theorem 1, we can show that E (cid:26) π s ,s ( W ) π s ,s · ZYe ( W ) (cid:27) = E ( Y | S = s , S = s ) , E (cid:26) π s ,s ( W ) π s ,s · (1 − Z ) Y − e ( W ) (cid:27) = E ( Y | S = s , S = s ) , (cid:3) Proof of Theorem 7.

Because rank( M (cid:62) s M s ) = K , we can ﬁnd an invertible K × K sub-matrix of M s , denoted by H s , with the corresponding values of W denoted by w , . . . , w K .For any ﬁxed s , Assumption 3 implies P ( Y = y | Z = 0 , S = s , W = w j )= (cid:88) s P ( Y = y | Z = 0 , S = s , S = s , W = w j ) P ( S = s | S = s , W = w j )= (cid:88) s P ( Y = y | S = s , S = s ) P ( S = s | S = s , W = w j ) , (15)for j = 1 , . . . , K , where P ( Y = y | Z = 0 , S = s , W = w j ) and P ( S = s | S = s , W = w j ) can be identiﬁed from the observed data. Because H s is invertible, we cansolve (15) to obtain P ( Y = y | S = s , S = s ) for all s . Similarly, we can identify P ( Y = y | S = s , S = s ) for all s under rank( M (cid:62) s M s ) = K . Therefore, the PCEs areidentiﬁable under conditions (a) and (b). (cid:3) Proof of Theorem 8.

For any ﬁxed y and s , if there exist two functions of S , f ( Y = y | S , S = s ) and f ( Y = y | S , S = s ), satisfying E { f ( Y = y | S , S = s ) | W } = E { f ( Y = y | S , S = s ) | W } , then E { g ( S ) | W } = 0, where g ( S ) = f ( Y = y | S , S = s ) − f ( Y = y | S , S = s ).Because P W ,s is complete, g ( S ) = 0. Therefore, the distribution of P ( Y = y | S , S = s ) is identiﬁable. Similarly, the distribution of P ( Y = y | S = s , S ) is identiﬁablefor any ﬁxed y and s if P W ,s is complete. Therefore, the PCEs are identiﬁable underconditions (a) and (b). (cid:3) Appendix B: Proof of the corollaries and propositions

Appendix B.1: Proofs of the completeness of some distribution families

In this section, we show the completeness of exponential family and three distributionfamilies based on Normal, t and Logistic distribution, respectively.38 esult S1. (Lehmann and Romano, 2006, Theorem 4.3.1) Let X be a random vector withprobability distributiond P θ ( x ) = C ( θ ) exp  J (cid:88) j =1 θ j t j ( x )  d µ ( x ) , and let P ω be the family of distributions of t = ( t ( X ) , . . . , t J ( X )) as θ = ( θ , . . . , θ J ) variesover the set ω . Then P ω is complete provided ω contains a J -dimensional rectangle in R J . Result S2.

Suppose S = h ( W ) + σ ( W ) (cid:15) with (cid:15) W , and h ( w ) and σ ( w ) continuouslydiﬀerentiable. If (cid:15) ∼ N (0 , P W is complete. Proof of Result S2.

We verify the conditions in Lemma 1. Because φ ( t ) = exp( − t / P W is complete. (cid:3) Result S3.

Suppose S = h ( W ) + σ ( W ) (cid:15) with (cid:15) W , and h ( w ) and σ ( w ) continuouslydiﬀerentiable. If (cid:15) ∼ t (0 , , ν ), then P W is complete. Proof of Result S3.

We verify the conditions in Lemma 1. The characteristic function of (cid:15) is φ ( t ) = K ν/ ( √ ν | t | ) · ( √ ν | t | ) ν/ Γ( ν/ ν/ − , where Γ( · ) is the Gamma function and K ν/ ( · ) is the modiﬁed Bessel function of the secondkind. Abramowitz et al. (1966) ensures thatlim t →∞ K ν/ ( √ ν | t | )exp( −√ ν | t | ) / (cid:112) √ ν | t | = 1 . Therefore, the dominating term in φ ( t ) is exp( −√ ν | t | ), and hence condition (a) holds. Itis easy to verify condition (b). Due to the identiﬁability of t mixture model (Everitt andHand, 1981), condition (c) holds. Thus, from Lemma 1, P W is complete. (cid:3) Result S4.

Suppose S = h ( W ) + σ ( W ) (cid:15) with (cid:15) W , and h ( w ) and σ ( w ) continuously dif-ferentiable. If (cid:15) follows a standard Logistic distribution with density exp( − (cid:15) ) / { − (cid:15) ) } , then P W is complete. 39 roof of Result S4. We verify the conditions in Lemma 1. The characteristic function of (cid:15) is φ ( t ) = πt/ sinh( πt ) . Because the dominating term in φ ( t ) is exp( − π | t | ), condition (a)holds. It is also easy to show that condition (b) holds. From the identiﬁability of Logisticmixture models (Everitt and Hand, 1981), condition (c) holds. Thus, from Lemma 1, P W is complete. (cid:3) Appendix B.2: Proof of the corollaries

Proof of Corollary 1.

Suppose that the distribution of ( S , W ) is bivariate Normal withmeans ( µ S , µ W ), variances ( σ S , σ W ) and correlation coeﬃcient ρ . From P ( S = s , W = w ) = P ( S = s | W = w, Z = 1) P ( W = w ) , we can identify the joint distribution of ( S , W ): P ( S = s | W = w ) = 1 (cid:112) πσ ( w ) exp (cid:20) − { s − µ ( w ) } σ ( w ) (cid:21) = 1 (cid:112) πσ ( w ) exp (cid:26) − µ ( w )2 σ ( w ) (cid:27) exp (cid:26) − s σ ( w ) (cid:27) exp (cid:26) s µ ( w ) σ ( w ) (cid:27) , where µ ( w ) = µ S + ρσ S /σ W ( w − µ W ) and σ ( w ) = σ S (1 − ρ ). Thus, we have T ( x ) = x and η ( w ) = µ ( w ) /σ ( w ). From Theorem 4, the PCEs are identiﬁable. (cid:3) Proof of Corollary 2.

First, we can identify { µ ( w ) , µ ( w ) , σ ( w ) , σ ( w ) } from the observeddata. From the bivariate Normal assumption, the conditional distribution of S | ( S = s , W = w ) is a Normal distribution with mean µ ( s , w ) = µ ( w ) + ρ ( w ) σ ( w ) /σ ( w ) { s − µ ( w ) } and variance σ ( w ) = { − ρ ( w ) } σ ( w ), i.e., S = µ ( s , W ) + σ ( W ) (cid:15), where (cid:15) W and (cid:15) ∼ N (0 , P W ,s is complete. Similarly, P W ,s is alsocomplete. Therefore, from Theorem 8, the PCEs are identiﬁable. (cid:3) Appendix B.3: Proof of the propositions

Proof of Proposition 1.

First, because we can observed S when Z = 1, we can identify P ( Y | S ) and g ( W ) = E ( S | W ). Second, the linear models of S and Y imply Y = β + αg ( W ) + J (cid:88) j =1 β j f j ( W ) + σ ( W ) (cid:15) + σ ( W ) (cid:15) . (cid:15) , (cid:15) ) W and { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, the param-eters are identiﬁable from the classic result of linear models. Therefore, we can identify P ( Y | S ) and hence the PCEs. (cid:3) Proof of Proposition 2.

First, because we can observed S when Z = 1, we can identify P ( Y | S ). Second, the distribution of ( S , W ) is identiﬁable, so are g ( w ) and σ . Third, P ( Y = 1 | Z = 0 , W = w ) = (cid:90) P ( Y = 1 | Z = 0 , S = s, W = w ) P ( S = s | W = w ) ds = (cid:90) P ( Y = 1 | S = s, W = w ) P ( S = s | W = w ) ds = (cid:90) Φ  αs + J (cid:88) j =1 β j f j ( W )  P ( S = s | W = w ) ds = Φ (cid:40) β + αg ( w ) + (cid:80) Jj =1 β j f j ( w ) √ α σ (cid:41) . Because { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, from the identiﬁability of theProbit model, we can identify α √ α σ , β √ α σ , . . . , β J √ α σ . Because σ is identiﬁable, we can identify { α, β , . . . , β J } and hence the PCEs. (cid:3) Proof of Proposition 3.

First, we can identify the joint distribution of ( S , S ) given W : P ( S = 1 , S = 1 | W = w ) = P ( S = 1 , S = 1 | Z = 0 , W = w ) = P ( S = 1 | Z = 0 , W = w ) , P ( S = 0 , S = 0 | W = w ) = P ( S = 0 , S = 0 | Z = 1 , W = w ) = P ( S = 0 | Z = 1 , W = w ) , P ( S = 1 , S = 0 | W = w ) = 1 − P ( S = 1 , S = 1 | W = w ) − P ( S = 0 , S = 0 | W = w ) . Then P ( S = 1 | S = 1 , W = w ) = P ( S = 1 | Z = 0 , W = w ) P ( S = 1 | Z = 1 , W = w ) , P ( S = 1 | S = 0 , W = w ) = 0 . Because the subpopulation with ( Z = 1 , S = 1) is a mixture of the subpopulations with( S = 1 , S = 1) and ( S = 1 , S = 0), we have E ( Y | Z = 1 , S = 1 , W = w ) = E ( Y | S = 1 , S = 1 , W = w ) P ( S = 1 | S = 1 , W = w )+ E ( Y | S = 1 , S = 0 , W = w ) P ( S = 0 | S = 1 , W = w )= β + β + β P ( S = 1 | Z = 0 , W = w ) P ( S = 1 | Z = 1 , W = w ) + β w. (16)41he subpopulation with ( Z = 1 , S = 0) is the same as the subpopulation with ( S = 0 , S =0), and then E ( Y | S = 0 , Z = 1 , W = w ) = E ( Y | S = 0 , S = 0 , W = w ) β + β w. (17)Because P ( S = 1 | Z = 0 , W = w ) / P ( S = 1 | Z = 1 , W = w ) is not constant in w , we canidentify ( β , β , β , β ) from (16) and (17). Similarly, we can identify ( β , β , β , β )and hence the PCEs. (cid:3) Proof of Proposition 4.

From the linear model for Y , E ( Y | Z = 1 , S = s , W = w ) = (cid:90) E ( Y | S = s , S = s, W = w ) P ( S = s | S = s , W = w ) ds = (cid:90) { β + α s + α s + J (cid:88) j =1 β j f j ( W ) } P ( S = s | S = s , W = w ) ds = β + α s + α E ( S | S = s , W = w ) + J (cid:88) j =1 β j f j ( w ) . Because { , s , E ( S | S = s , W = w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, we canidentify ( β , . . . , β J , α , α ). Similarly, we can identify ( β (cid:48) , . . . , β (cid:48) J , α (cid:48) , α (cid:48) ). Therefore, wecan identify the PCEs. (cid:3) Proof of Proposition 5.

We can ﬁrst identify { µ ( w ) , µ ( w ) , σ ( w ) , σ ( w ) } . From the bi-variate Normality, the conditional distribution S | ( S = s , W = w ) is Normal with mean µ ( s , w ) = µ ( w ) + ρ ( w ) σ ( w ) /σ ( w ) { s − µ ( w ) } and variance σ ( w ) = { − ρ ( w ) } σ ( w ).Then P ( Y = 1 | Z = 1 , S = s , W = w )= (cid:90) P ( Y = 1 | S = s , S = s , W = w ) P ( S = s | S = s , W = w ) ds = (cid:90) Φ  β + α s + α s + J (cid:88) j =1 β j f j ( W )  P ( S = s | S = s , W = w ) ds = Φ (cid:40) β + α s + α µ ( s , w ) + (cid:80) Jj =1 β j f j ( w ) (cid:112) α σ ( w ) (cid:41) . Because { , s , E ( S | S = s , W = w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, wecan use Probit regression to ﬁrst identify α and α by ﬁxing w and then identify β j for42 = 1 , . . . , K by ﬁxing s . Similarly, we can identify ( β (cid:48) , . . . , β (cid:48) k , α (cid:48) , α (cid:48) ). Therefore, we canidentify the PCEs. (cid:3) Appendix B.4: Models based on t distributions We give identiﬁcation results for models based on t distributions. These models are fre-quently applied for robust analysis when the data have heavier tails than the standardNormal distribution (Zellner, 1976; Lange et al., 1989; Liu, 2004; Gelman et al., 2014).We ﬁrst give the results under Assumption 3. The following result is an application ofTheorem 5 to models based on t distributions with constant control intermediate variable. Corollary S1.

Suppose that W is continuous and Assumptions 1, 3 and 4 hold, and S | W = w ∼ t ( h ( w ) , σ ( w ) , ν ) with unknown ν , where h ( w ) and σ ( w ) are continuouslydiﬀerentiable functions that can be unknown. The PCEs are identiﬁable. Proof of Corollary S1.

From Result S3, P W is complete. Thus, from Theorem 5, the PCEsare identiﬁable. (cid:3) Similarly, the following result is an application of Theorem 8 to models based on t distributions with non-constant control intermediate variable. Corollary S2.

For a continuous W , suppose that Assumptions 1 and 3 hold. If( S , S ) | W = w ∼ t  µ ( w ) µ ( w )  ,  σ ( w ) ρ ( w ) σ ( w ) σ ( w ) ρ ( w ) σ ( w ) σ ( w ) σ ( w )  , ν  , with known values of ρ ( w ), then the PCEs are identiﬁable. Proof of Corollary S2.

First, we can identify { µ ( w ) , µ ( w ) , σ ( w ) , σ ( w ) } . For any ﬁxed s , Ding (2016) implies that S = h s ( W ) + σ s ( W ) (cid:15) , where h s ( W ) = µ ( W ) + σ ( W ) σ ( W ) { s − µ ( W ) } ,σ s ( W ) = (cid:115) ν + { s − µ ( W ) } /σ ( W ) ν + 1 · (cid:115) σ ( W ) − σ ( W ) σ ( W ) ,(cid:15) ∼ t , , ν + 1 , (cid:15) W. From Result S3, P W ,s is complete. Similarly, P W ,s is also complete. Therefore, fromTheorem 8, the PCEs are identiﬁable. (cid:3)

43e then give the results when neither Assumptions 2 or 3 holds. The following proposi-tion extends the Probit model in Proposition 2 to the Robit model (Liu, 2004) with constantcontrol intermediate variable.

Proposition 6.

Under Assumptions 1 and 4, assume S = g ( W ) + (cid:15) S ,Y = I  β + αS + J (cid:88) j =1 β j f j ( W ) + (cid:15) Y >  , ( (cid:15) S , (cid:15) Y ) ∼ t ( µ , Σ , ν ) , where ( (cid:15) S , (cid:15) Y ) W , µ = (0 , (cid:62) , Σ = diag( σ , g ( w ) can be unknown. If { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, then the PCEs are identiﬁable.Proposition 2 does not assume a joint model for ( S , Y ). In contrast, Proposition 6assumes a joint model for these two variables by restricting the error terms to follow abivariate t distribution. This is because two independent Normal random variables make abivariate Normal random variable, but two independent t random variables do not make abivariate t random variable even with the same degrees of freedom. Proof of Proposition 6.

First, because we can observed S when Z = 1, we can identify P ( Y | S ). Second, because the distribution of ( S , W ) is identiﬁable, we can identify g ( w )and σ . Then P ( Y = 1 | Z = 0 , W = w ) = (cid:90) P ( Y = 1 | Z = 0 , S = s, W = w ) P ( S = s | Z = 0)d s = (cid:90) P ( Y = 1 | S = s, W = w ) P ( S = s | Z = 0)d s = T ν (cid:40) β + αg ( W ) + (cid:80) Jj =1 β j f j ( W ) √ α σ (cid:41) . Because { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, from the identiﬁability of theRobit model, we can identify α √ α σ , β √ α σ , . . . , β J √ α σ . Because σ is identiﬁable, we can identify { α, β , . . . , β J } and hence the PCEs. (cid:3) Similarly, for the case with non-constant control intermediate variable, we give an iden-tiﬁcation result for Robit outcome models as an extension of the Probit outcome models.44 roposition 7.

Under Assumption 1, assume that S = g ( W ) + (cid:15) S , S = g ( W ) + (cid:15) S ,Y (1) = I  β + α S + α S + J (cid:88) j =1 β j f j ( W ) + (cid:15) Y >  ,Y (0) = I (cid:40) β (cid:48) + α (cid:48) S + α (cid:48) S + k (cid:88) i =1 β (cid:48) i h i ( W ) + (cid:15) Y > (cid:41) , where ( (cid:15) S , (cid:15) S , (cid:15) Y , (cid:15) Y ) ∼ t ( µ , Σ , ν ), ( (cid:15) S , (cid:15) S , (cid:15) Y , (cid:15) Y ) W , and µ =   , Σ =  σ S σ S S σ S Y σ S S σ S σ S Y σ Y S σ Y σ Y Y σ Y S σ Y Y σ Y  , with σ S S known. The PCEs are identiﬁable, if { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearlyindependent, and { , g ( w ), h ( w ) , . . . , h J ( w ) } are linearly independent. Proof of Proposition 7.

We can ﬁrst identify g ( · ), g ( · ) and all the terms in Σ except σ Y Y .Then P ( Y = 1 | Z = 1 , S = s , W = w )= P ( β + α S + α S + J (cid:88) j =1 β j f j ( W ) + (cid:15) Y ≥ | S = s , W = w )= P  β + α s + α g ( W ) + J (cid:88) j =1 β j f j ( W ) + (cid:15) Y + α (cid:15) S ≥ | g ( W ) + (cid:15) S = s , W = w  = P  (cid:15) Y + α (cid:15) S ≥ −  β + α s + α g ( W ) + J (cid:88) j =1 β j f j ( W )  | (cid:15) S = s − g ( W ) , W = w  . Ding (2016) implies( (cid:15) Y , (cid:15) S ) | (cid:15) S = x ∼ t (cid:18) µ , ν + d ν + 1 Σ , ν + 1 (cid:19) , where µ =  σ Y S x/σ S σ S S x/σ S  , Σ =  σ Y − σ Y S /σ S − σ Y S σ S S /σ S − σ Y S σ S S /σ S σ S − σ S S /σ S  , d = x /σ S . (cid:15) Y + α (cid:15) S | (cid:15) S = x ∼ t (cid:32) α σ S S σ S + σ Y S σ S , ν + d ν + 1 r ( α ) , ν + 1 (cid:33) , where r ( α ) = (1 , α ) Σ (1 , α ) (cid:62) . Therefore, P ( Y = 1 | Z = 1 , S = s , W = w )= T ν +1  β + α s + α g ( w ) − (cid:18) α σ S S σ S + σ Y S σ S (cid:19) { s − g ( w ) } + (cid:80) Jj =1 β j f j ( W ) (cid:112) t ( s , w ) r ( α )  , where t ( s , w ) = ν + { s − g ( w ) } /σ S ν + 1 . Because { , g ( w ) , f ( w ) , . . . , f J ( w ) } are linearly independent, we can identify ( β , . . . , β J , α , α )from the identiﬁability of the Robit model. Similarly, we can identify ( β (cid:48) , . . . , β (cid:48) J , α (cid:48) , α (cid:48) ).Therefore, we can identify the PCEs. (cid:3) Appendix C: More details for the application

Appendix C.1: Technical issues with a discrete intermediate variables

Corollary 2 requires W to be continuous but W is categorical in the application. We givethe following proposition to formally justify the identiﬁability of the PCEs in our dataanalysis. Proposition S8.

Under Assumptions 1 and 2, assume that P ( S , S | W = w ) is identiﬁ-able, and Y and Y follow linear models E ( Y z | S , S , W ) = β z + β z S + β z S , ( z = 0 , . If there exist s and s such that E ( S | S = s , W = w ) and E ( S | S = s , W = w ) arenot constant in w , then the PCEs are identiﬁable. Proof of Proposition S8.

From the linear models for Y , E ( Y = y | Z = 0 , S = s , W = w ) = (cid:90) E ( Y = y | S = s, S = s , W = w ) P ( S = s | S = s , W = w )d s = (cid:90) ( β + β S + β S ) P ( S = s | S = s , W = w )d s = β + β E ( S | S = s , W = w ) + β s . E ( S | S = s , W = w ) is not constant in w , β , β and β are identiﬁable.Similarly, β , β and β are identiﬁable and hence the PCEs are identiﬁable. (cid:3) In Proposition S8, the condition that E ( S | S = s , W = w ) and E ( S | S = s , W = w ) are constant in w is testable because P ( S , S | W = w ) is identiﬁable. In ourapplication, the conditions in Proposition S8 hold and thus the PCEs are identiﬁable evenif the intermediate variable W is categorical. Appendix C.2: Bayesian analysis with diﬀerent priors r = r = r = r = r = r = (a) Results with prior A b -b b -b r = r = r = r = r = r = (b) Results with prior B b -b b -b Figure S1: Histograms of the posterior distributions of β − β and β − β . The solidlines are the medians and the dashed lines are the posterior 2.5% and 97.5% quantiles. PriorA uses Ω z = 10 diag(1 , ,

1) and Ω = 10 diag(1 , Ω z = diag(1 , ,

1) and Ω = diag(1 , β = ( β , β , β ) and β = ( β , β , β ). Wechoose multivariate Normal priors for β z and µ w : β z ∼ N ( , Ω z ), µ w ∼ N ( , Ω ) for w = 1 , . . . ,

7. Prior A uses Ω z = 10 diag(1 , ,

1) and Ω = 10 diag(1 ,

1) for z = 0 ,

1; PriorB uses Ω z = diag(1 , ,

1) and Ω = diag(1 ,

1) for z = 0 ,

1. We choose the following non-47nformative parameters for other parameters: f ( σ zw ) ∝ /σ zw , f ( σ Y z ) ∝ /σ Y z , { P ( W =1) , . . . , P ( W = 7) } ∼ Dirichlet(1 , . . . ,

1) and P ( Z = 1 | W = w ) ∼ Beta(1 , z = 0 , w = 1 , . . . ,

7. Figure S1 presents the results which are not sensitive to diﬀerent priors.

Appendix C.3: More sensitivity analysis for the application

Section 6 conducts the analysis without covariates. This section includes covariates in thedata analysis which can make Assumption 3 more plausible. Similar to the main text, wewill also assess the sensitivity of the results to the correlation coeﬃcient between S and S given W .Let X be the covariates. We consider the following model for ( S , S , W ):( S , S ) | W = w, X = x ∼ N  µ ( w, x ) µ ( w, x )  , Σ ( w ) =  σ ( w ) ρ ( w ) σ ( w ) σ ( w ) ρ ( w ) σ ( w ) σ ( w ) σ ( w )  , where ρ ( w ) = ρ , µ ( w, x ) = α w + γ (cid:62) w x and µ ( w, x ) = α w + γ (cid:62) w x . Therefore, we allowthe mean of S and S to depend on the covariates in the sensitivity analysis. We thenconsider the following model for the potential outcomes: Y z = β z S + β z S + β z, X X + (cid:15) Y z . For the simplicity of sensitivity analysis, we use an alternative estimation strategy basedon the method of moments, which does not require to specify the distributions of (cid:15) Y and (cid:15) Y :1. Obtain the estimates of µ ( w, x ), µ ( w, x ) and Σ ( w ) for all w from the observeddistribution of ( S , S , W, X ).2. For units with Z i = 1, impute S i with (cid:98) S i = (cid:98) µ ( w, x )+ ρ (cid:98) σ ( w ) / (cid:98) σ ( w ) { S i − µ ( w, x ) } ;for For units with Z i = 0, impute S i with (cid:98) S i = (cid:98) µ ( w, x ) + ρ (cid:98) σ ( w ) / (cid:98) σ ( w ) { S i − µ ( w, x ) } .3. Obtain the estimates of ( β , β , β , X ) by regressing Y i on S i , (cid:98) S i and X i for unitswith Z i = 1; obtain the estimates of ( β , β , β , X ) by regressing Y i on (cid:98) S i , S i and X i for units with Z i = 0.We use the bootstrap to get the 95% conﬁdence intervals of the PCEs. We ﬁrst conductthe sensitivity analysis without covariates. The upper panel of Table S1 shows the point48stimates and 95% credible intervals of the ﬁve PCEs from the methods of moments. Theresults are similar to those from the Bayesian approach presented in the main text.We then consider the sensitivity analysis with covariates. Due to the limited samplesize, we only include the demographic variables, gender, age and race. The lower panel ofTable S1 shows the point estimates and 95% credible intervals of the ﬁve PCEs from themethods of moments. Although the estimates change in the sensitivity analysis, we canstill conclude that for people who can gain more for the job-search self-eﬃcacy from thetreatment, the treatment can lower the risk of depression to a larger extent.Table S1: Point and interval estimates of representative PCEs using the methods of mo-ments. The intervals excluding zero are highlighted in bold. Without covariates ( S , S ) ρ =0 ρ =0.2 ρ =0.4 ρ = 0.6 ρ =0.8(1 . , .

00) 1.210 1.362 1.440 1.406 1.366( − . , . − . , . − . , . − . , . − . , . . , .

50) 0 .

243 0 .

277 0 .

398 0 .

293 0 . − . , . − . , . − . , . − . , . − . , . . , . − . − . − . − . − . − . , . − . , . − . , . − . , . − . , . . , . − . − . − . − . − . ( − . , − . ( − . , . − . , . − . , . − . , . . , . − . − . − . − . − . ( − . , − . − . , − . − . , − . ( − . , . − . , . With covariates ( S , S ) ρ =0 ρ =0.2 ρ =0.4 ρ = 0.6 ρ =0.8(1 . , .

00) 1.581 1.704 1.732 1.466 1.562 (0 . , . . , . ( − . , . − . , . − . , . . , .

50) 0 .

273 0 .

312 0 .

334 0 .

333 0 . (0 . , . . , . ( − . , . − . , . − . , . . , . − . − . − . − . − . − . , . − . , . − . , . − . , . − . , . . , . − . − . − . − . − . ( − . , − . ( − . , − . ( − . , − . − . , − . − . , − . (5 . , . − . − . − . − . − . ( − . , − . − . , − . − . , − . − . , − . − . , − .089)