[PDF] Doubly weighted M-estimation for nonrandom assignment and missing outcomes

Abstract

This paper proposes a new class of M-estimators that double weight for the twin problems of nonrandom treatment assignment and missing outcomes, both of which are common issues in the treatment effects literature. The proposed class is characterized by a `robustness' property, which makes it resilient to parametric misspecification in either a conditional model of interest (for example, mean or quantile function) or the two weighting functions. As leading applications, the paper discusses estimation of two specific causal parameters; average and quantile treatment effects (ATE, QTEs), which can be expressed as functions of the doubly weighted estimator, under misspecification of the framework's parametric components. With respect to the ATE, this paper shows that the proposed estimator is doubly robust even in the presence of missing outcomes. Finally, to demonstrate the estimator's viability in empirical settings, it is applied to Calonico and Smith (2017)'s reconstructed sample from the National Supported Work training program.

Full PDF

DDoubly weighted M-estimation for nonrandomassignment and missing outcomes

Akanksha Negi † November 23, 2020

Abstract

This paper proposes a new class of M-estimators that double weight for the twin problemsof nonrandom treatment assignment and missing outcomes, both of which are common issuesin the treatment effects literature. The proposed class is characterized by a ‘ robustness ’ prop-erty, which makes it resilient to parametric misspeciﬁcation in either a conditional model ofinterest (for example, mean or quantile function) or the two weighting functions. As leadingapplications, the paper discusses estimation of two speciﬁc causal parameters; average andquantile treatment effects (ATE, QTEs), which can be expressed as functions of the doublyweighted estimator, under misspeciﬁcation of the framework’s parametric components. Withrespect to the ATE, this paper shows that the proposed estimator is doubly robust even in thepresence of missing outcomes. Finally, to demonstrate the estimator’s viability in empiricalsettings, it is applied to Cal´onico and Smith (2017)’s reconstructed sample from the NationalSupported Work training program.

Keywords:

Unconfoundedness, Missing at random, Double weighting, M-estimation, Treatment ef-fects

JEL Classiﬁcation:

C13, C18, C31 ∗ I am grateful to Jeffrey M. Wooldridge, Steven Haider, Ben Zou, and Kenneth Frank. Special thanks to TimVogelsang, Wendun Wang, Alyssa Carlson, Christian Cox, Tymon Słoczy´nski, and seminar & conference participants,for insightful comments and suggestions on earlier drafts of this paper. † a r X i v : . [ ec on . E M ] N ov Introduction

When interest lies in causal inference, the prevalence of missing data poses a major identiﬁcationchallenge. A common issue is that the outcome of interest is missing for some proportion of thesample. In this case, the complete data method that drops observations with missing outcomesis widely used. While dropping is practically convenient, it not only leads to substantial loss ofinformation but more importantly creates a nonrandom sample for estimation. In turn, droppingcan generally lead to inconsistent treatment effect estimates. This paper proposes an estimator thatdouble weights for the twin problems of nonrandom treatment assignment and missing outcomesby using information on covariates.Weighting has been used extensively in both the missing data [Horvitz and Thompson (1952),Robins et al. (1994), Robins and Rotnitzky (1995), Wooldridge (2007)] and treatment effect [Rosen-baum and Rubin (1983), Hahn (1998), Hirano and Imbens (2001), Firpo (2007), Słoczy´nski andWooldridge (2018)] literatures. However, a weighting approach that corrects for general missing-ness in the outcome to estimate treatment effects using observational data is yet to be proposed.Previous studies have considered weighting to deal with speciﬁc missing data issues such as at-trition and non-response in the presence of endogenous treatment selection [Fr¨olich and Huber(2014), Huber (2014), Fricke et al. (2020)]. Typically, the identiﬁcation argument in these pa-pers is based on one or more instruments with discussion centered around estimation of averagetreatment effects.This paper introduces inverse probability weighting alongside propensity score (PS) weightingin a general M-estimation framework to address two prevalent problems in the causal inferenceliterature. Moreover, the objective function being solved is permitted to be non-smooth in theunderlying parameters thereby covering both average and quantile treatment effects. A key featureof the proposed estimator is its robustness to parametric misspeciﬁcation in either a conditionalmodel of interest (such as mean or quantile) or the two weighting functions. In addition, theATE estimator which uses the proposed strategy is shown to be ‘ doubly robust ’ [Słoczy´nski andWooldridge (2018)] even in the presence of missing outcomes.The key identifying assumptions for consistency of the doubly weighted estimator of a pop-ulation level parameter are unconfoundedness and missing at random. Put differently, the tworestrictions imply that the treatment assignment and missing outcomes mechanisms are as good asrandomly assigned after conditioning on covariates. With respect to missingness, the mechanismalso allows sample observability to depend on the treatment status. As such it allows for differ-ential non-response, attrition, and even non-compliance to the extent that conditioning variablespredict it.For many observational studies, unconfoundedness may be a reasonable assumption. Previousliterature has found several situations where such an assumption is tenable, especially when pre-treatment values of the outcome variable are available. For example, LaLonde (1986) and Hotzet al. (2006) have shown that controlling for pre-training earnings alone reduces signiﬁcant biasbetween non-experimental and experimental estimates. The literature assessing teacher impact onstudent achievement has reported similar ﬁndings with pre-test scores [Chetty et al. (2014), Kaneand Staiger (2008), and Shadish et al. (2008)], indicating the plausibility of unconfoundedness in This is a widely used assumption in the treatment effects literature and is known by a variety of names such asexogeneity, ignorability, selection on observables, and conditional independence assumption (CIA). and second step plugs in the estimatedprobabilities as weights to solve a general objective function. Given the parametric nature of theﬁrst and second steps, this paper highlights a robustness property which allows the estimator toremain consistent for a parameter of interest under misspeciﬁcation of either a conditional modelor the two probability weights. Consequently, the asymptotic theory in this paper distinguishesbetween these two halves. The ﬁrst half focuses on misspeciﬁcation of either a conditional expec-tation function (CEF) or a conditional quantile function (CQF), whereas the second half considersmisspeciﬁcation in the weighting functions.As illustrative examples, the paper discusses robust estimation of two speciﬁc causal parame-ters, namely, the ATE and QTEs, expressed as functions of the doubly weighted estimator. Con-sistent estimation of the ATE is achievable under both misspeciﬁcation scenarios. Of particularinterest is the case when the conditional mean function is misspeciﬁed. For estimation of quantiletreatment effects, the paper considers three different parameters, namely, conditional quantile treat-ment effect (CQTE), a linear approximation to CQTE, and unconditional quantile treatment effect(UQTE), each of which may be of interest to the researcher depending on whether features of theconditional or unconditional outcomes distribution are of interest. Simulations show that the dou-bly weighted ATE and QTE estimates have the lowest ﬁnite sample bias compared to alternativeswhich ignore one or both problems. Finally, the proposed method is applied to estimate average and distributional impacts of theNational Supported Work (NSW) training program on earnings for the Aid to Families with De-pendent Children (AFDC) target group. The sample is obtained from Cal´onico and Smith (2017)who recreate Lalonde’s within-study analysis for the AFDC women. The idea behind choosing thisempirical application is to utilize the presence of experimental and non-experimental comparisongroups for evaluating whether the strategy of double weighting brings us close the experimentalbenchmark relative to other alternatives. The paper ﬁnds that the empirical bias for the doublyweighted estimate is much smaller than that for the unweighted estimate.The rest of this paper is structured as follows. Section 2 describes the basic potential outcomesframework and provides a short description of the population models with an introduction to thenaive unweighted estimator. Section 3 discusses the treatment assignment and missing outcomemechanisms which leads us directly to the identiﬁcation lemma. Section 4 develops the ﬁrst halfof the asymptotic theory for the doubly weighted estimator with a focus on misspeciﬁcation of aconditional feature of interest. This half also requires the weights to be correct for delivering pa-rameter identiﬁcation. In contrast, section 5 considers the other half where a conditional model ofinterest is correctly speciﬁed but the weights may be misspeciﬁed. Identiﬁcation here relies on theparameter solving a conditional problem. Section 6 studies the speciﬁcs of robustness for estimat-ing ATE and QTEs in rigorous detail. Section 7 provides supporting Monte Carlo evidence underthree interesting cases of misspeciﬁcation; correct conditional model with misspeciﬁed weights,misspeciﬁed conditional model with correct weights, and misspeciﬁed model and weights. Section8 applies the proposed method to job training data from Cal´onico and Smith (2017) and section 9 As a practical matter, researchers typically follow the convention of estimating these probabilities as ﬂexible logitfunctions. Such as the unweighted estimator which drops missing outcomes and does not weight or the ps-weighted estima-tor which drops the missing data and weights by the propensity score to correct for nonrandom assignment.

Consider the standard Neyman-Rubin causal model. Let Y (1) and Y (0) denote potential out-comes corresponding to the treatment and control states and let W be an indicator for whether anindividual received the treatment. Then observed outcome is Y = Y (0) · (1 − W ) + Y (1) · W (1)Also, let X be a vector of pre-treatment characteristics which includes an intercept. Some featureof the distribution of ( Y ( g ) , X ) ⊂ R M is assumed to depend on a ﬁnite P g × vector θ g , containedin a parameter space Θ g ⊂ R P g . Let q ( Y ( g ) , X , θ g ) be an objective function that depends onoutcomes, covariates, and the parameter vector, θ g . Then, the parameter of interest is deﬁned to bea solution to the following M-estimation problem. Assumption 1. (Identiﬁcation of θ g ) The parameter vector θ g ∈ Θ g is a unique solution to thepopulation minimization problem min θ g ∈ Θ g E (cid:2) q ( Y ( g ) , X , θ g ) (cid:3) (2) for each g = 0 , . Examples include the smooth ordinary least squares (OLS) function, q ( Y ( g ) , X , θ g ) = ( Y ( g ) − X θ g ) or the non-smooth conditional quantile regression (CQR) of Koenker and Bassett (1978), q ( Y ( g ) , X , θ g ) = c τ ( Y ( g ) − X θ g ) . Other examples of q ( · ) can be log-likelihood and quasi-log-likelihood (QLL) functions.An implicit point in assumption 1 is that θ g is not assumed to be correctly speciﬁed for a condi-tional feature like a conditional mean, variance, or even the full conditional distribution. It simplyrequires θ g to uniquely minimize the population problem in (2). If θ g is correctly speciﬁed for anyof the above mentioned quantities, then the parameter is of direct interest to researchers. However,if θ g is misspeciﬁed for any of these distributional features, assumption 1 guarantees a uniquepseudo true solution, θ ∗ g [White (1982)]. In the case of misspeciﬁcation, determining whether θ ∗ g is meaningful will depend on the conditional feature being studied and the estimation method used.For example, in the case of OLS, θ g will index a linear projection if one is agnostic about linear-ity of the CEF. Angrist et al. (2006) establish analogous approximation properties for quantiles,where a misspeciﬁed CQF can still provide the best weighted mean square approximation to thetrue CQF. As mentioned in Negi and Wooldridge (2020), X may include functions of covariates such as levels, squares,and interactions which will be chosen by the researcher. The dimension of the covariate vector is assumed ﬁxed anddoes not grow with the sample size. For generality, the dimension of θ g is allowed to be different for the treatment and control group problems andis also different than the dimension of X , where X ∈ X ⊂ R dim ( X ) For a random variable u , c τ ( u ) = ( τ − { u < } ) u is the asymmetric loss function for estimating quantiles and {·} is an indicator function. ‘ S ’ be a binary indicator such that S = 1 if the outcome is observed and S = 0 otherwise.The objective of this paper is to consistently estimate θ g . In the presence of missing outcomes, acommon empirical strategy is to solve the following M-estimation problems for the treatment andcontrol groups, respectively. min θ ∈ Θ N (cid:88) i =1 S i · W i · q ( Y i (1) , X i , θ ) min θ ∈ Θ N (cid:88) i =1 S i · (1 − W i ) · q ( Y i (0) , X i , θ ) (3)Let us refer to the estimator that solves (3) as the unweighted M-estimator and denote it as ˆ θ ug .This estimator uses the available sample after dropping the missing data to estimate θ g . Using thereverse analogy principle, ˆ θ ug will be consistent for θ g if it solves the population analogue of (3),which may not be true. As an example, consider Y ( g ) = X θ g + U ( g ) , g = 0 , E [ X (cid:48) U ( g )] = In this case, even if the treatment is randomly assigned, missingness may still be correlated with thetreatment, observable factors, or both. Hence, the population ﬁrst order condition for the selectedsample, E [ S · W · X (cid:48) U ( g )] , is not zero even though E [ X (cid:48) U ( g )] = . So identiﬁcation of θ g isnow confounded on two grounds; nonrandom assignment which renders the treatment and controlgroups incomparable and missing outcomes which violates the ‘random sampling’ assumption.The next section discusses the identiﬁcation approach taken in this paper. Without imposing any structure on the assignment and missingness mechanisms in the population,estimating θ g remains difﬁcult. To proceed with identiﬁcation, I assume that the treatment isunconfounded on covariates. Formally,

Assumption 2. (Strong ignorability) Assume, { Y (0) , Y (1) ⊥⊥ W }| X (4) i) The vector of pre-treatment covariates, X , is always observed for the entire sample.ii) For all x ∈ X ⊂ R dim( X ) , deﬁne p ( x ) = P ( W = 1 | X = x ) such that p ( x ) > κ for aconstant κ > . Equation (4) indicates that conditioning on covariates is enough to parse out any systematicdifferences that may exist between the treatment and control groups. One advantage of uncon-foundedness is that, intuitively, it has a better chance of holding once we control for a rich set of Like most other assumptions, unconfoundedness is non-refutable. For methods that indirectly test for its validity,see Huber and Melly (2015), de Luna and Johansson (2014), and Heckman and Hotz (1989). X . Note that unconfoundedness not only includes cases where the treatment is a de-terministic function of the covariates, for example stratiﬁed (or block) experiments, but also caseswhere the treatment is a stochastic function of covariates. Part i) requires that we observe thesecovariates for all individuals. Part ii) is an overlap condition which ensures that for all values of x in X , we observe units in both the treatment and control groups. With respect to the missing outcomes mechanism, I assume selection on observables

Assumption 3. (Missing at Random (MAR)) Assume, { Y (0) , Y (1) ⊥⊥ S }| X , W (5) i) In addition to X , W is always observed for the entire sample.ii) For each ( x , w ) ∈ ( X , W ) ⊂ R dim( X )+1 , deﬁne, r ( x , w ) ≡ P ( S = 1 | X = x , W = w ) suchthat η < r ( x , w ) < for a constant η > and w = 0 , . Equation (5) states that conditional on covariates and the treatment status, the individuals whoseoutcomes are missing do not differ systematically from those who are observed. This impliesthat adjusting for X and W renders the outcomes as good as randomly missing. In the statisticsliterature, this assumption is known as MAR and represents a mechanism wherein missingnessonly depends on observables and not on the missing values of the variable itself [Little and Rubin(2019)]. Special cases covered under this mechanism are patterns such as missing completelyat random (MCAR) and exogenous missingness considered in Wooldridge (2007). Allowing themissingness probability to be a function of the treatment indicator is particularly useful in cases ofdifferential nonresponse. For instance, in NSW, people assigned to the treatment group were lesslikely to drop out of the program compared to the control group. In such cases, covariates alonemay not be sufﬁcient for predicting missingness. To the extent that being observed in the sample ispredicted by X and W , assumption 3 can accommodate non-observability due to sampling design,item non-response, and attrition in a two period panel. Part i) of the above assumption ensures that X and W are fully observed and part ii) againimposes an overlap condition. It states that there is a positive probability of observing people inthe sample for a given X and W .Then solving the doubly weighted population problem given below is the same as solving theoriginal M-estimation problem in (2). The following lemma establishes this equality Lemma 1. (Identiﬁcation) Given assumptions 1, 2, 3, assume i) q ( Y ( g ) , X , θ g ) is a real valuedfunction for all ( Y ( g ) , X ) ⊂ R M ii) E (cid:2) | q ( Y ( g ) , X , θ g ) | (cid:3) < ∞ for all θ g ∈ Θ g , g = 0 , , then E (cid:2) ω g · q (cid:0) Y ( g ) , X , θ g (cid:1) (cid:3) = E (cid:2) q (cid:0) Y ( g ) , X , θ g (cid:1) (cid:3) (6) where ω = S · Wr ( X , W ) · p ( X ) , ω = S · (1 − W ) r ( X , W ) · (1 − p ( X )) . For example, Hirano and Imbens (2001) control for a rich set of prognostic factors to justify unconfoundednesswhile estimating the effects of right heart catheterization (RHC) on survival rates of patients. Methods for checking overlap involve calculating normalized sample average differences for each covariate andchecking the empirical distribution of propensity scores. For the case of attrition, one must assume that second period missingness is ignorable conditional on initial periodcovariates and the treatment status. Lemma 1 is important for us as it helps to illustrate the role of double weighting in dealing with thetwo issues at hand. However, to operationalize this argument, we ﬁrst need to estimate r ( X , W ) and p ( X ) before introducing the estimator and studying its asymptotic properties.The following assumptions posit that we have a correctly speciﬁed model for the two proba-bilities and that we estimate them using binary response maximum likelihood. Since both W and S are binary responses, estimation of γ and δ using MLE will be asymptotically efﬁcient un-der correct speciﬁcation of these functions. Consistency and asymptotic normality for γ and δ follow from theorems 2.5 and 3.3 of Newey and McFadden (1994). Assumption 4. (Correct parametric speciﬁcation of propensity score) Assume that i) There existsa known parametric function G ( X , γ ) for p ( X ) where γ ∈ Γ ⊂ R I and < G ( X , γ ) < forall X ∈ X , γ ∈ Γ ; ii) There exists γ ∈ Γ s.t. p ( X ) = G ( X , γ ) ; iii) ˆ γ is the binary responsemaximum likelihood estimator that solves max γ ∈ Γ N (cid:88) i =1 { W i log G ( X i , γ ) + (1 − W i ) log (1 − G ( X i , γ )) } (7) Assumption 5. (Correct parametric speciﬁcation of missing outcomes probability) Assume thati) There exists a known parametric function R ( X , W, δ ) for r ( X , W ) where δ ∈ ∆ ⊂ R K and R ( X , W, δ ) > for all X ∈ X , δ ∈ ∆ ; ii) There exists δ ∈ ∆ s.t. r ( X , W ) = R ( X , W, δ ) ; iii) ˆ δ is the binary response maximum likelihood estimator that solves max δ ∈ ∆ N (cid:88) i =1 { S i log R ( X i , W i , δ ) + (1 − S i ) log (1 − R ( X i , W i , δ )) } (8)The inﬂuence function representations for ˆ γ and ˆ δ can then be written as √ N ( ˆ γ − γ ) = E (cid:0) d i d (cid:48) i (cid:1) − N − / N (cid:88) i =1 d i + o p (1) √ N (cid:16) ˆ δ − δ (cid:17) = E (cid:0) b i b (cid:48) i (cid:1) − N − / N (cid:88) i =1 b i + o p (1) (9) Deﬁne q ( Y, X , θ ) = q ( Y (1) , X , θ ) for W = 1 and q ( Y (0) , X , θ ) for W = 0 , then E [ ω g · q ( Y, X , θ )] ≡ E [ ω g · q (cid:0) Y ( g ) , X , θ g (cid:1) ] which makes it function of the observed random vector { ( Y i , X i , W i , S i ) : i = 1 , , . . . , N } . d i and b i are scores of the binary response log-likelihood problems in (7) and (8) evaluatedat the probability limits γ and δ , respectively. The doubly weighted estimator is then deﬁned as: ˆ θ g = argmin θ g ∈ Θ g N (cid:88) i =1 (cid:98) ω ig · q ( Y i ( g ) , X i , θ g ) (10)where (cid:98) ω i = S i · W i R ( X i ,W i , ˆ δ ) · G ( X i , ˆ γ ) and (cid:98) ω i = S i · (1 − W i ) R ( X i ,W i , ˆ δ ) · (1 − G ( X i , ˆ γ )) are the estimated weights for solvingthe treatment and control group problems, respectively. Given the two-step nature of the estimation problem; ﬁrst step uses binary response MLE forestimating the probability weights and second step solves an objective function using the ﬁrst-stepweights, the asymptotic theory utilizes results for two-step estimators with a non-smooth objectivefunction to establish the large sample properties of ˆ θ g . The following theorem ﬁlls in the primitiveregularity conditions for applying the uniform law of large numbers. Theorem 1. (Consistency) Suppose assumption 1 holds and that i) { ( Y i , X i , W i , S i ); i = 1 , , . . . , N } are i.i.d draws satisfying assumptions 2 and 3; ii) Θ g is compact for g = 0 , ; iii) G ( X , γ ) sat-isﬁes assumption 4 and is continuous for each γ on the support of X . Similarly, R ( X , W, δ ) satisﬁes assumption 5 and is continuous for each δ on the support of ( X , W ) ; iv) q ( Y ( g ) , X , θ g ) is continuous at each θ g ∈ Θ g with probability one; v) E (cid:20) sup θ g ∈ Θ g | q ( Y ( g ) , X , θ g ) | (cid:21) < ∞ . Then, ˆ θ g p → θ g . The proof follows from verifying the conditions in Lemma 2.4 of Newey and McFadden (1994).Under the dominance condition given in v), uniform convergence of sample averages holds quitegenerally.For establishing asymptotic normality, I provide primitive conditions for the general case ofnon-smooth objective functions. Let the score of q ( Y ( g ) , X , θ g ) at the true parameter, θ g , bedenoted as h ( Y ( g ) , X , θ g ) ≡ h g and suppose it exists with probability one. Let the populationproblem be denoted as Q ( θ g ) ≡ E (cid:2) ω g · q ( Y ( g ) , X , θ g ) (cid:3) and the sample analogue be given as Q N ( θ g ) ≡ N ˆ ρ g N (cid:88) i =1 (cid:98) ω ig · q ( Y i ( g ) , X i , θ g ) where ˆ ρ g = N g /N and N ˆ ρ g → ∞ as ˆ ρ g → ρ g . For the sake of asymptotics, we may ignorethe division by ˆ ρ g . The main condition needed for establishing asymptotic normality is stochasticequicontinuity of the empirical process v N ( θ g ) ≡ √ N N (cid:88) i =1 (cid:26)(cid:98) ω ig h ig ( θ g ) − E (cid:2)(cid:98) ω ig h ig ( θ g ) (cid:3)(cid:27) (11) When necessary, the estimated weights will also be denoted as ω g ( ˆ δ , ˆ γ ) ≡ (cid:98) ω g . The sampling fractions N = (cid:80) Ni =1 S i · W i and N = (cid:80) Ni =1 S i · (1 − W i ) are random which implies that N = N + N is also random as opposed to being ﬁxed ahead of time. Theorem 2. (Asymptotic Normality) In addition to the conditions mentioned in Theorem 1, as-sume i) θ g ∈ int ( Θ g ) ; ii) q ( Y ( g ) , X , θ g ) is continuously differentiable on int ( Θ g ) with proba-bility one; iii) N (cid:80) Ni =1 (cid:98) ω ig · h ( Y i ( g ) , X i , ˆ θ g ) = o p ( N − / ) ; iv) E (cid:20) sup θ g ∈ Θ g (cid:107) h ( Y ( g ) , X , θ g ) (cid:107) (cid:21) < ∞ ;v) G ( · , γ ) and R ( · , δ ) are both twice continuously differentiable on int (Γ) and int (∆) , respec-tively; vi) E (cid:20) sup δ ∈ ∆ (cid:107) b ( X , W, S, δ ) (cid:107) (cid:21) < ∞ , E (cid:20) sup γ ∈ Γ (cid:107) d ( X , W, γ ) (cid:107) (cid:21) < ∞ ; vii) E (cid:2) ω g · h ( Y ( g ) , X , θ g ) (cid:3) is continuously differentiable on int ( Θ g ) ; viii) H g ≡ ∇ θ g E (cid:104) ω g · h ( Y ( g ) , X , θ g ) (cid:105) is nonsingular;ix) { v N ( θ g ) : N ≥ } is stochastically equicontinuous. Then, √ N ( ˆ θ g − θ g ) d → N (cid:16) , H − g Ω g H − g (cid:17) where Ω g = E (cid:0) l ig l (cid:48) ig (cid:1) − E (cid:0) l ig b (cid:48) i (cid:1) E ( b i b (cid:48) i ) − E (cid:0) b i l (cid:48) ig (cid:1) − E (cid:0) l ig d (cid:48) i (cid:1) E ( d i d (cid:48) i ) − E (cid:0) d i l (cid:48) ig (cid:1) for each g = 0 , and l ig ≡ ω ig h ig is score of the weighted objective function evaluated at θ g . Sufﬁcient primitive conditions for stochastic equicontinuity may be found in Andrews (1994).The asymptotic variance expression derived above offers some interesting insights. First, the mid-dle term, Ω g , represents the variance of the residual from the population regression of the weightedscore, l ig , on the two binary response scores, b i and d i . Note that even though Ω g would involvecovariance between the two MLE scores, that term is zero on account of the two scores beingconditionally independent.Second, the expression for Ω g has an efﬁciency implication for the second step estimate, ˆ θ g .When a researcher is only willing to assume identiﬁcation of θ g in the unconditional sense, it ispotentially more efﬁcient to estimate the two weights even when they are known. To show thisformally, let us assume that p ( X ) and r ( X , W ) are known and ˜ θ g is the doubly weighted estimatorthat uses known weights, ω g . Then, Corollary 1. (Efﬁciency gain with estimated weights) Under the assumptions of theorem 2,

Avar (cid:2) √ N (cid:0) ˜ θ g − θ g (cid:1)(cid:3) − Avar (cid:2) √ N (cid:0) ˆ θ g − θ g (cid:1)(cid:3) = H − g Σ g H − g − H − g Ω g H − g = H − g (cid:0) Σ g − Ω g (cid:1) H − g is positive semi-deﬁnite and where Σ g = E ( l ig l (cid:48) ig ) . In other words, we do no worse, asymptotically, by estimating the weights even when we actu-ally know them. This result can be seen an extension of Wooldridge (2007) to the case when onehas two sets of probability weights being estimated in the ﬁrst stage. In the missing data literature, this result has also been called the “efﬁciency puzzle”. Prokhorov and Schmidt(2009) study this puzzle in a GMM framework using an augmented set of moment conditions, where the ﬁrst setof moments correspond to the weighted objective function and the second set belongs to the missing outcomes (orselection in their case) problem. A conditional feature of interest is correctly speciﬁed

The asymptotic results in the previous section were derived under the assumption that some featureof the conditional distribution of outcomes may be misspeciﬁed. This was implicit in deﬁning θ g asa solution to the unconditional M-estimation problem. Examples include estimating a misspeciﬁedlinear conditional mean or quantile function. In contrast, this section highlights the other half ofthe asymptotic theory which is formalized using a strong version of the identiﬁcation assumptionand allowing the weights to be misspeciﬁed. Assumption 6. (Strong identiﬁcation of θ g ) The parameter vector θ g ∈ Θ g is the unique solutionto the population minimization problem min θ g ∈ Θ g E (cid:2) q ( Y ( g ) , X , θ g ) | X (cid:3) ; g = 0 , (12) under unconfoundedness (deﬁned in 2) and MAR (deﬁned in 3) for each X ∈ X ⊂ R dim ( X ) . The above can be seen as a strengthening of the identiﬁcation assumption in section 4 sinceLIE implies that θ g is also a solution to the unconditional M-estimation problem. By requiring θ g to solve (12), assumption 5 is intended for situations where a conditional feature of interest iscorrectly speciﬁed. An implication of this strengthened identiﬁcation is that θ g now solves theconditional score of the objective function i.e. E (cid:2) h ( Y ( g ) , X , θ g ) | X (cid:3) = .For instance, the conditional score will be zero in the case of estimating a correctly speciﬁedCEF with either OLS or quasi maximum likelihood estimation (QMLE) in the linear exponentialfamily (LEF). This would also hold for a correctly speciﬁed CQF estimated either using quantileregression or QMLE in the tick exponential family [Komunjer (2005)].Delineating these two identiﬁcation scenarios is important for determining which causal pa-rameter can be estimated consistently under each setting. As we will see in the next section, it ispossible to estimate the ATE under both cases of misspeciﬁcation. However the same cannot besaid for QTE parameters. In addition to assumption 6, the asymptotic results in this half do not relyon correct speciﬁcation of weights. In other words, assuming R ( · , · , δ ) and G ( · , γ ) to be correctlyspeciﬁed is rather restrictive and not required for the doubly weighted estimator to be consistentfor θ g . Assumption 7. (Parametric speciﬁcation of propensity score) Assume that conditions i) and iii) ofassumption 4 hold where condition ii) is deﬁned for some γ ∗ ∈ Γ such that plim ( ˆ γ ) = γ ∗ . Assumption 8. (Parametric speciﬁcation of missingness probability) Assume that conditions i)and iii) of assumption 5 hold where condition ii) is deﬁned for some δ ∗ ∈ ∆ such that plim ( ˆ δ ) = δ ∗ . Note that assumptions 7 and 8 do not require the parametric models for the two probabilitiesto be correctly speciﬁed. Nevertheless, we continue to assume that ˆ γ and ˆ δ solve the same binaryresponse problem as in Assumptions 4 and 5 with probability limits given by pseudo true values γ ∗ and δ ∗ , respectively [White (1982)]. To show that θ g is still a solution to the doubly weightedpopulation problem with misspeciﬁed weights, a sketch of the argument is given below. Consider, E (cid:104) ω ∗ g · q ( Y ( g ) , X , θ g ) (cid:105) (13)10here ω ∗ g are asymptotic weights which use G ( X , γ ∗ ) and R ( X , W, δ ∗ ) . Using LIE along withunconfoundedness and MAR, I can rewrite the above expectation as E (cid:2) ξ g ( X ) · E { q ( Y ( g ) , X , θ g ) | X } (cid:3) where ξ g ( X ) is a function of weights for g = 0 , . The strong identiﬁcation assumption implies E (cid:2) q ( Y ( g ) , X , θ g ) | X (cid:3) ≤ E (cid:2) q ( Y ( g ) , X , θ g ) | X (cid:3) , ∀ θ g ∈ Θ g Further, since ξ g ( X ) > , E (cid:104) ω ∗ g · q ( Y ( g ) , X , θ g ) (cid:105) ≤ E (cid:104) ω ∗ g · q ( Y ( g ) , X , θ g ) (cid:105) , θ g ∈ Θ g where the inequality is strict when θ g (cid:54) = θ g . Therefore, solving the doubly weighted problemidentiﬁes the parameter even if the weights are wrong. In general, the parameter that solves (13)will be different from the one that solves the same problem with correct weights. But as long as θ g is a unique solution, solving (13) will identify it.The following two theorems establish consistency and asymptotic normality of the doublyweighted estimator. Theorem 3. (Consistency under strong identiﬁcation) Under assumptions 2, 3, 6, 7, and 8 withregularity conditions (1), (2) and (3) of Theorem 1, ˆ θ g p → θ g as N → ∞ where ˆ θ g is the doubly-weighted estimator that solves (13). Theorem 4. (Asymptotic Normality under strong identiﬁcation) Under the assumptions of theorem3 and the regularity conditions of theorem 2 where MLE estimators ˆ γ and ˆ δ have probability limitsgiven by γ ∗ and δ ∗ , then √ N ( ˆ θ g − θ g ) d → N (cid:0) , H − g Ω g H − g (cid:1) where Ω g = E (cid:0) l ig l (cid:48) ig (cid:1) with H g and l ig deﬁned in Theorem 2 except with asymptotic weights given by ω ∗ ig . Substantively, there is no real difference in the proof of the above theorem compared to thosederived in section 4 except that now ˆ γ and ˆ δ are converging to probability limits that could bepotentially different from those indexing the true treatment and missing outcome probabilities. Aconsequence of the objective function solving the conditional problem is reﬂected in the asymptoticvariance expression. Compared to the previous section, Ω g now is simply the variance of weightedscore of the objective function without ﬁrst stage adjustment of the estimated probabilities. This isbecause under assumption 6, E (cid:0) l ig b (cid:48) i (cid:1) = E (cid:0) l ig d (cid:48) i (cid:1) = . A sketch of the proof for E (cid:0) l ig b (cid:48) i (cid:1) = isprovided below. The argument for E (cid:0) l ig d (cid:48) i (cid:1) follows analogously. E (cid:0) l ig b (cid:48) i (cid:1) ≡ E (cid:0) ω ∗ ig h ig b (cid:48) i (cid:1) = E (cid:2) ζ g ( X i ) · E (cid:0) h ( Y i ( g ) , X i , θ g ) | X i (cid:1)(cid:3) = where ζ g ( X ) is a function of weights. The ﬁrst equality uses the deﬁnition of l ig with misspeciﬁedweights and second equality applies LIEs with unconfoundedness and MAR. In other words, thereason for obtaining a simpler expression for Ω g is because the correlation between weighted scoreof the objective function and the two binary response scores is zero when θ g is correctly speciﬁedfor a conditional feature of interest and we use an appropriate method to estimate it. When R ( X , W, δ ∗ ) = r ( X , W ) and G ( X , γ ∗ ) = p ( X ) , then solving (13) will be the same as solving theproblems in section 3.

11 simpler expression for Ω g also means that we can no longer exploit these correlations be-tween scores to obtain asymptotic efﬁciency for estimating θ g . Again, let ˜ θ g be the doublyweighted estimator that uses true weights, ω g , then Corollary 2. (No gain with estimated weights under strong identiﬁcation) Under the assumptionsof theorem 4

Avar (cid:2) √ N (cid:0) ˜ θ g − θ g (cid:1)(cid:3) = Avar (cid:2) √ N (cid:0) ˆ θ g − θ g (cid:1)(cid:3) = H − g Ω g H − g Hence knowledge of the weights does little when for instance we have a correctly speciﬁedCEF or CQF and we use either OLS or QR to estimate the parameters indexing these conditionalmodels of interest.A special case of weights misspeciﬁcation is when ω ∗ g is a constant. This is plausible since R ( X , W, δ ∗ ) and G ( X , γ ∗ ) are allowed to be any bounded positive functions of X and W . Inother words, the unweighted estimator, ˆ θ ug , which does not weight to correct for either problemis also consistent for θ g under the results of theorem 3. In fact, assumptions 7 and 8 suggest thatany weighted estimator will sufﬁce for estimating θ g . In this case, one may turn to asymptoticefﬁciency to guide our choice between weighting or not weighting at all. The following resultsays that if the objective function satisﬁes the generalized conditional information matrix equal-ity (GCIME), the unweighted estimator is asymptotically more efﬁcient than any of its weightedcounterpart (correctly speciﬁed weights or not). Corollary 3. (Efﬁciency gain with unweighted estimator under GCIME) Under assumptions oftheorem 4 if we additionally suppose that the objective function satisﬁes GCIME in the populationwhich is deﬁned as: E (cid:2) h ( Y ( g ) , X , θ g ) h ( Y ( g ) , X , θ g ) (cid:48) | X (cid:3) = σ g · ∇ θ g E (cid:2) h ( Y ( g ) , X , θ g ) | X (cid:3) = σ g · A ( X , θ g ) (14) Then,

Avar (cid:2) √ N (cid:0) ˆ θ g − θ g (cid:1)(cid:3) = H − g Ω g H − g and Avar (cid:2) √ N (cid:0) ˆ θ ug − θ g (cid:1)(cid:3) = ( H u g ) − Ω u g ( H u g ) − and, Avar (cid:2) √ N (cid:0) ˆ θ g − θ g (cid:1)(cid:3) − Avar (cid:2) √ N (cid:0) ˆ θ ug − θ g (cid:1)(cid:3) is positive semi-deﬁnite. The proof of this theorem follows from noting that we can express the difference in the twoasymptotic variances as the expected outer product of population residuals from the regression of B i on D i , which are weighted versions of square root of matrix A i (See appendix F for details).Hence the difference is positive semi-deﬁnite.We know GCIME is known in a variety of estimation contexts. In the case of full maximumlikelihood, GCIME holds for q ( Y ( g ) , X , θ g ) = − ln f g ( Y | X , θ g ) where f g ( ·|· ) is the true con-ditional density with σ g = 1 . For estimating conditional mean parameters using QMLE in thelinear exponential family (LEF), GCIME holds if Var( Y ( g ) | X ) = σ g · v [ m ( X , θ g )] . In otherwords, GCIME will be satisﬁed if Var( Y ( g ) | X ) satisﬁes the generalized linear model assump-tion, irrespective of whether the higher order moments of the conditional distribution correspondto the chosen QLL or not. For estimation using nonlinear least squares, GCIME will hold for q ( Y ( g ) , X , θ g ) = [ Y ( g ) − m ( X , θ g )] with the homoskedasticity assumption. Hence in all these12ases the unweighted estimator will be more efﬁcient than its weighted counterpart. But whenGCIME is not satisﬁed, the two may not be easy to rank. The asymptotic theory can now be used to discuss estimation of speciﬁc causal estimands like ATEand QTEs which can be expressed as functions of the doubly weighted estimator, θ g . As discussed in Słoczy´nski and Wooldridge (2018), DR estimators remain consistent for the popu-lation ATE despite misspeciﬁcation in either the conditional mean function or the propensity score,but not both. The current doubly weighted framework along with results developed in sections 4and 5 allow us to extend this result to the case with missing outcomes.Let m ( X , θ g ) be a parametric model for the conditional mean which is said to be correctlyspeciﬁed for the CEF if for some θ g ∈ Θ g E [ Y ( g ) | X ] = m ( X , θ g ) or equivalently, Y ( g ) = m ( X , θ g ) + U ( g ) such that E [ U ( g ) | X ] = 0 . Then, let us consider thefollowing two scenarios in turn. First half: Correct conditional mean

When the conditional mean function is correct, there ismore than one estimation method that can be used to consistently estimate θ g , namely, nonlinearleast squares (NLS) and QMLE with LEF. For both these examples, results from section 5 dictatethat weighting is not needed for consistency. The fact that one could weight by the misspeciﬁedweights and still consistently estimate θ g is what forms the ‘ ﬁrst part ’ of the DR result with doubleweighting.Once θ g has been estimated by solving the sample version of the NLS or QMLE problem, ATEcan be estimated as follows, ˆ∆ ate = 1 N N (cid:88) i =1 m ( X i , ˆ θ ) − N N (cid:88) i =1 m ( X i , ˆ θ ) If in addition to having a correct conditional mean, I also assume the error variance of the outcomesto be homoskedastic (cid:0) E [ U ( g ) | X ] = Var [ U ( g ) | X ] = σ g (cid:1) , then the NLS estimator that does notweight at all is the preferred alternative from an efﬁciency perspective. This is due to GCIMEbeing satisﬁed with NLS under homoskedasticity. Second half: Correct weights

If one acknowledges misspeciﬁcation in the conditional meanmodel, there is no general way of consistently estimating the ATE. However, a useful mean ﬁttingproperty of QMLEs in LEF along with double weighting can be used here to obtain consistent13stimates of the unconditional means, E [ Y ( g )] , despite misspeciﬁcation in the conditional means, E [ Y ( g ) | X ] . In the generalized linear model (GLM) literature, the link function, h − ( · ) , relates the mean ofthe distribution to a linear index as follows h − ( E [ Y ( g ) | X ]) = X θ g (15)The estimation strategy then is to choose m ( X , θ g ) to be the function, h ( · ) , with the QLL corre-sponding to a choice of LEF density. Then the population ﬁrst order conditions from solving thisQMLE problem give us E (cid:34) ∇ θ g h ( X θ ∗ g ) (cid:48) · ( Y ( g ) − h ( X θ ∗ g )) v [ h ( X θ ∗ g )] (cid:35) = (16)where v [ h ( · )] is variance of the mean function and θ ∗ g denotes the pseudo true parameter indexingthe misspeciﬁed conditional mean model [White (1982)]. In particular, by choosing h − ( · ) to bethe canonical link for the QLL associated with the density, the gradient in numerator of (16) cancelswith the variance term in the denominator. Note that this occurs only when one uses the canonicallink function.Such cancellation of terms ensures that if one includes an intercept in X , the misspeciﬁed meanmodel ﬁts the overall mean of the distribution (see Wooldridge (2010) chapter 13 for more detail)so that, E [ Y ( g )] = E [ h ( X θ ∗ g )] With nonrandom assignment and missing outcomes, solving the sample GLM FOC in (16) wouldstill not be sufﬁcient for consistently estimating θ ∗ g . Therefore, one would instead solve the doublyweighted FOC given below. N (cid:88) i =1 (cid:98) ω i · X (cid:48) i · (cid:2) Y i − h ( X i ˆ θ ) (cid:3) = N (cid:88) i =1 (cid:98) ω i · X (cid:48) i · (cid:2) Y i − h ( X i ˆ θ ) (cid:3) = (17)The role played by weighting is crucial here for ˆ θ g to be consistent for the pseudo true parameter θ ∗ g . This forms the ‘ second half ’ of the DR result with double weighting. If h ( · ) is the identity function, the ﬁrst order conditions above can be recognized as thosebelonging to OLS with the line of best ﬁt passing through the mean of Y . This is because OLS is aQMLE with normal QLL and identity link function, typically used for outcomes with unrestrictedsupport. Other combinations of QLLs and canonical link functions can be found in Table 2 of Negi The property of QMLEs that we are most familiar with is the one where parameters in a correctly speciﬁedconditional mean can be consistently estimated if we choose m ( X , θ g ) so that it’s range corresponds to the chosenLEF density (or QLL function), irrespective of the range and nature of the outcomes. This property is used in the ﬁrsthalf of DR. Section F in the online appendix provides a detailed proof of how population GLM FOCs identify the uncondi-tional means (and hence the ATE). Y . Summary.

DR estimation of ATE with double weighting

Case 1: Correct mean, misspeciﬁed weights

1. Consistent estimates for the conditional mean parameters, θ g , can be obtained by eitherusing NLS or QMLE in LEF.2. A consistent estimator of ATE is obtained as ˆ∆ ate = 1 N N (cid:88) i =1 m ( X i , ˆ θ ) − N N (cid:88) i =1 m ( X i , ˆ θ ) Case 2: Misspeciﬁed mean, correct weights

1. Depending upon the range and nature of the outcome, Y , choose an appropriate QLL associ-ated with an LEF density. Choose the mean function, m ( X , θ g ) = h ( X θ g ) , where h ( · ) is theinverse canonical link function associated with the chosen density. Using this combinationof mean function and QLL, use the moment conditions in (17) to obtain consistent estimates, ˆ θ g .2. Consistent estimates of ATE can then be obtained as follows ˆ∆ ate = 1 N N (cid:88) i =1 h ( X i ˆ θ ) − N N (cid:88) i =1 h ( X i ˆ θ ) where X includes an intercept and ˆ θ g solves the GLM ﬁrst order conditions. Unlike the case of ATE, it is generally not possible to obtain UQTE by averaging CQTE over thedistribution of X . In this section, I use double weighting to illustrate estimation of three differentquantile estimands, namely, UQTE, CQTE, and a weighted linear approximation (LP) to the trueCQTE, each of which may be of interest to the researcher depending on whether features of theconditional or unconditional outcomes distribution are of interest. Whether θ g indexes the trueCQF or an approximation depends on what is being assumed about the conditional quantile modeland the estimation method used.Let’s assume that the two potential outcomes are continuous in R . It is typical to deﬁne the τ th quantile of Y ( g ) as Q τ,g = inf { y : F g ( y ) ≥ τ } , < τ < Then the UQTE for the τ th quantile is deﬁned as the difference in the marginal quantiles of theoutcomes distributions, UQTE τ = Q τ, − Q τ, τ th conditional quantile of Y ( g ) for X = x as, Q τ,g ( x ) = inf { y : F g ( y | x ) ≥ τ } , < τ < where F g ( ·| x ) denotes the conditional distribution function of Y ( g ) given X = x . Then, CQTEfor the τ th quantile for some subgroup deﬁned by X isCQTE τ ( X ) = Q τ, ( X ) − Q τ, ( X ) Let q τ ( X , θ g ( τ )) be a parametric model for the τ th conditional quantile of Y ( g ) which is said tobe correctly speciﬁed if for some θ g ( τ ) ∈ Θ g Q τ,g ( X ) = q τ ( X , θ g ( τ )) (18) Estimation of CQTE τ : Incidentally, much like conditional mean, if CQF τ is correctly speciﬁed,there are two methods that will ensure consistent estimation of the CQF parameters, θ g ( τ ) . Theﬁrst is CQR of Koenker and Bassett (1978) and the second is a class of QML estimators that usea special ‘ tick-exponential ’ family of distributions to suggest consistent estimators of conditionalquantile parameters. This QMLE class has been proposed by Komunjer (2005). The method isanalogous to estimating a correctly speciﬁed conditional mean function using QMLE in the linearexponential family.For estimation that uses CQR, θ g ( τ ) will actually solve the stronger conditional problem, θ g ( τ ) = argmin θ g ∈ Θ g E (cid:2) c τ ( Y ( g ) − q τ ( X , θ g ( τ ))) | X (cid:3) (19)For estimation via QMLE, as long as the CQF is correct and we choose an appropriate QLL then, θ g ( τ ) = argmin θ g ∈ Θ g E (cid:2) − ln { φ τ ( Y ( g ) , q τ ( X , θ g ( τ ))) }| X (cid:3) (20)where φ τ ( · , · ) is the density that belongs to the tick-exponential family. As dictated by results insection 5, weighting the QR or QML objective functions, irrespective of whether the weights arecorrectly speciﬁed or not will also deliver a consistent estimator of θ g ( τ ) .Once we have obtained ˆ θ g either by solving the QR or QML problem, the τ th conditionalquantile treatment effect for subgroup X can be estimated as (cid:92) CQTE τ ( X ) = q τ ( X , ˆ θ ( τ )) − q τ ( X , ˆ θ ( τ )) . Estimation of LP to CQTE τ : The traditional literature on conditional quantile estimation hasfocused on correct speciﬁcation. However, Angrist et al. (2006) establish an approximation prop-erty of CQR that is analogous to the approximation property of linear regression. The main im-plication of such a result is that solving CQR with q τ ( X , θ g ( τ )) = X θ g ( τ ) would still identify a φ τ ( y, η ) = φ τ ( y, η ) = exp (cid:2) − (1 − τ )[ a ( η ) − b ( y )] { y ≤ η } + τ [ a ( η ) − c ( y )] { y > η } (cid:3) is a probability den-sity and η is the τ -quantile of φ τ such that (cid:82) η −∞ φ τ ( y, η ) dy = τ . Komunjer (2005) shows that CQR of Koenker andBassett (1978) is a special case of this QMLE class. τ . Therefore, the difference in LPs of τ -quantile CQFs isinterpretable as identifying an LP to the CQTE τ .As before, weighting becomes crucial in the presence of nonrandom assignment and missingoutcomes for identifying the LP parameters. ˆ θ g ( τ ) = argmin θ g ∈ Θ g N (cid:88) i =1 (cid:98) ω ig · c τ ( Y i − X i θ g ( τ )) (21)In other words, one would need to weight the CQR problem with correct weighting functions for ˆ θ g ( τ ) p → θ ∗ g ( τ ) , which indexes the true LP to CQF τ for group g . Then, (cid:99) LP [ CQTE τ ( X )] = X [ ˆ θ ( τ ) − ˆ θ ( τ )] (22) Direct estimation of UQTE τ : As mentioned in the beginning of this section, estimating UQTE τ from CQTE τ is generally not possible even if we assume a correct model for the conditional quan-tiles of Y ( g ) . In other words, one cannot obtain unconditional quantiles from averaging conditionalquantiles over the distribution of X . In this case, we can directly estimate Q τ,g by running a quan-tile regression of the outcome on an intercept (similar to Firpo (2007)). In the present case, thesolution to the doubly weighted objective function gives us, ˆ θ g ( τ ) = argmin θ g ∈ Θ g N (cid:88) i =1 (cid:98) ω ig · c τ ( Y i − θ g ( τ )) such that ˆ θ g ( τ ) p → Q τ,g . Weighting by G ( · ) and R ( · ) is crucial here since these functions serve toremove biases arising due to nonrandom assignment and missing outcomes. One can then obtainthe unconditional quantile treatment effect as, (cid:92) UQTE τ = ˆ θ ( τ ) − ˆ θ ( τ ) An alternative method of estimating UQTE τ is to use recentered inﬂuence functions suggested byFirpo et al. (2009) (see appendix B).The next section discusses results from a Monte Carlo study which evaluates the ﬁnite sam-ple behavior of doubly weighted ATE and QTE estimators under three different misspeciﬁcationscenarios. This section compares the empirical distributions of ATE and QTEs using unweighted, ps-weighted,and d-weighted estimators. The discussion is centered around three common misspeciﬁcationscenarios that are interesting from an empirical standpoint. These cases are enumerated in tables Firpo (2007) uses propensity score weighting to directly estimate unconditional quantiles in the presence ofnonrandom assignment. Details of the simulation design are given in section A of the online appendix.

Case (1) in Table A.1 considers a misspeciﬁed mean function but correct probability weights.This is the principal case covered in section 4 wherein weighting is crucial. As one can see, theempirical distribution of the doubly weighted estimator is centered on the true ATE whereas thatfor the unweighted estimator is shifted to the right (see ﬁgure A.1, Case 1).Case (2) looks at what happens when everything, conditional mean and the two weights, ismisspeciﬁed. The theory in this paper does not address this particular case. However, this charac-terizes an interesting possibility given that misspeciﬁcation of all components is a valid concern.The simulation results do offer some insight here. The doubly weighted estimator seems to be theonly choice that delivers the true ATE on average whereas the others distributions are shifted awayfrom the truth (see ﬁgure A.1, Case 2).Finally, case (3) depicts the possibility of a correctly speciﬁed conditional mean function butmisspeciﬁed weights. Here weighting does not have any bite in resolving the identiﬁcation issue,beyond what is already achieved from having a correct mean function. In ﬁgure A.1, case 3, theempirical distributions of the estimated ATE for the unweighted, ps-weighted, and d-weightedestimators all coincide and are centered on the true ATE.[

Figure A.1 here ] As discussed earlier, there are really three parameters worth discussing when one talks about QTEs;CQTE, LP to CQTE, and UQTE. Misspeciﬁcation in the CQF shifts attention to consistently es-timating a linear projection to the true CQTE. First case in Table A.2 considers exactly such ascenario. Using the results in Angrist et al. (2006), I interpret the solution to the doubly weightedproblem given in (21) as providing a consistent weighted linear projection to the true CQF whichis then used to estimate an LP to the true CQTE. Case 1 of Figure A.2 plots the bias in estimatedLP relative to the true LP as a function of X for the three estimators. Note that weighting hereis crucial for consistently estimating the LP. The relative bias of the doubly weighted estimator isthe lowest amongst all and coincides with the line of no bias. Case 2 considers the situation whenalong with a misspeciﬁed CQF, the weights are also wrong. We still ﬁnd the proposed estimatorperforming the best in terms of bias.Finally ﬁgure A.3 considers a correctly speciﬁed CQF in which case we can estimate theCQTE. One can observe in the ﬁgure that the estimated function using double weighting co-incides with the true CQTE irrespective of how we weight. All three estimators; unweighted,ps-weighted, and doubly weighted will be consistent for the true CQTE. Misspeciﬁcation in theweights will not affect this result. See section A of the online appendix for details regrading plotting the CQTE curve.

18 also consider direct estimation of UQTE which does not require parametric speciﬁcation ofthe CQF since it is simply a difference of the marginal quantiles. So the two weights are theonly relevant components of the framework which will affect consistency of UQTE. In ﬁgure A.4,case 1, when both weights are correct, not weighting and double weighting both bring us close tothe true parameter. For the second case where both probability models are misspeciﬁed, doubleweighting does a little worse than not weighting at all. However, the results at other quantilesreﬂect more favorably upon double weighting (see section H of the online appendix for results at50th and 75th quantiles). Propensity score weighting performs the worst in both cases suggestinginstances where weighting for nonrandom assignment after dropping data that is missing may notbe the preferred alternative.[

Figure A.2 here ] [

Figure A.3 here ] [

Figure A.4 here ] In this section, I apply the proposed estimator to the Aid to Families with Dependent Children(AFDC) sample of women from the National Supported Work program compiled by Cal ´onico andSmith (2017) (CS, thereafter). NSW was a transitional and subsidized work experience programwhich was implemented as a randomized experiment in the United States between 1975-1979. CSreplicate LaLonde (1986)’s within-study analysis for the AFDC women in the program, wherethe purpose of such an analysis is to evaluate how training estimates obtained from using non-experimental identiﬁcation strategies (for example, CIA) compare to experimental estimates. Tocompute the non-experimental estimates, CS combine the NSW experimental sample with twonon-experimental comparison groups drawn from PSID, called PSID-1 and PSID-2. In this pa-per, I utilize the within-study feature of this empirical application to estimate how close the doublyweighted estimates get to the experimental estimate compared with ps-weighting and unweightedestimates.To construct these empirical bias measures, I ﬁrst augment the CS sample to allow for womenwho had missing earnings information in 1979. This renders 26% of the experimental and 11% ofthe PSID samples missing. I then combine the experimental treatment group of NSW with threedistinct comparison groups present in the CS dataset, namely, the experimental control group,and the two PSID samples, to compute the unweighted, ps-weighted, and d-weighted trainingestimates. The difference in the non-experimental estimate, obtained from using the doublyweighted estimator, and the experimental estimate provides the ﬁrst measure of estimated biasassociated with the proposed strategy. Combining the experimental control group with the non-experimental comparison group gives a second measure of estimated bias [Heckman et al. (1998)].Much like CS, I report both these estimates across a range of regression speciﬁcations for theaverage returns to training estimates.Given the growing importance of estimating distributional impacts of job training programs, Ialso estimate returns to training at every 10th quantile of the 1979 earnings distribution. The roleof double weighting is highlighted for the case of estimating marginal quantiles since covariates, The PSID-1 sample constructed by CS involves keeping all female household heads continuously from 1975-1979 who were between 20 and 55 years of age in 1975 and were not retired in 1975. The sample labeled PSID-2further restricts PSID-1 to include only those women who received AFDC welfare in 1975. For details regarding sample construction and estimation of weights, see section E of the online appendix.

First, to evaluate whether women with missing earnings in 1979 were signiﬁcantly different thanthose who were observed, Table A.2 reports the mean and standard deviation of the woman’s age,years of schooling, pre-training earnings and other characteristics across the observed and missingsamples. In terms of age, the women who were observed in the experimentally treated group ofNSW and the PSID-1 sample were, on average, older than those who were missing. The observedwomen in PSID-1 were also more likely to be married. For the PSID-2 sample, women who wereobserved had, on average, more kids with higher pre-training earnings. Apart from these minordifferences, the observed women did not appear to be systematically different that those who weremissing, as measured through observable characteristics.The presence of non-experimental control groups implies that assignment was nonrandom andtherefore an issue in the sample. This is because the comparison groups were drawn from PSIDafter imposing only a partial version of the full NSW eligibility criteria. Table A.1 provides de-scriptive statistics for the covariates by the treatment status. As can be expected, the treatment andcontrol groups of NSW are not observably different, indicating the strong role that randomizationplays in producing comparable groups. In contrast, the women in PSID-1 and PSID-2 groups arestatistically different than the treatment group members implying substantial scope for nonrandomassignment.Table A.3 reports the d-weighted, ps-weighted and unweighted average returns to training es-timates which using three different comparison groups; NSW control, PSID-1 and PSID-2. Theunweighted (unadjusted and adjusted) experimental estimates given in row 1, are same as the es-timates reported by CS in Table 3 of their paper. Overall, one can see that the doubly weightedexperimental estimates are more stable than the single weighted or unweighted estimates acrossthe different regression speciﬁcations, with a range between $824-$828.For computing the ps-weighted and d-weighted non-experimental estimates, I ﬁrst trim thesample to ensure common support between the treatment and comparison groups. This reducesthe sample size from 1,248 to 1,016 observations for the PSID-1 estimates and from 782 to 720observations for the PSID-2 estimates. A pattern that is consistent across the two sets of non-experimental estimates is that weighting gets us much closer to the benchmark relative to notweighting at all. For instance, the unweighted simple difference in means estimate of training,which uses the PSID-1 comparison group, is -$799 whereas the weighted estimates are $827 and$803. For the PSID-2 comparison group, the unweighted estimate which controls for all covariatesis $335 whereas the weighted estimates are $905 and $904.The second panel of Table A.3 reports the bias in training estimates from combining the ex-perimental control group with the PSID comparison groups. A similar pattern is seen here withweighted bias estimates being much closer to zero than the unweighted estimates. For instance, thedoubly weighted estimate that adjusts for all covariates using the PSID-1 comparison group is -$21whereas the unweighted estimates is -$568. These results suggest that the argument for weight-ing is strong when using a non-experimental comparison group where nonrandom assignment and Appendix E describes estimation of the two probability weights along with the sample trimming criteria. Figure A.5 plots the relative bias in UQTE estimates at every 10th quantile of the 1979 earningsdistribution. Much like the average training estimates, we see that the weighted estimates consis-tently lie below the unweighted estimates for most quantiles, irrespective of whether we use thePSID-1 or PSID-2 non-experimental group. Note that I do not plot UQTE estimates for quantilesless than 0.46, since these are all zero. This empirical application illustrates the role of proposed estimator in both experimental andobservational data contexts. The comparison involving the treatment and control group of NSWdemonstrates its use in an experiment with missing outcomes, whereas the non-experimental sam-ple demonstrates its use in the more realistic observational data setting.[

Table A.1 here ] [

Table A.2 here ] [

Table A.3 here ] [

Figure A.5 here ] In empirical research, the problems of nonrandom assignment and missing outcomes threaten iden-tiﬁcation of causal parameters. This paper proposes a new class of consistent and asymptotically-normal M-estimators that address these two issues using a double weighting procedure. Themethod combines propensity score weighting with weighting for missing outcomes in a generalM-estimation framework, which can be applied to a range of estimation methods, such as ordi-nary least squares, quasi maximum likelihood, and quantile regression. In addition, the proposedclass has a robustness property which allows us to estimate meaningful causal quantities of interestdespite misspeciﬁcation in either a conditional model of interest or the two weighting functions.As leading applications, the paper discusses estimation of ATE and QTEs. A Monte Carlostudy indicates that the doubly weighted estimates of average and quantile treatment effects havethe lowest bias compared to naive alternatives (unweighted or propensity score weighted estima-tors) under three realistic cases of misspeciﬁcation. Finally, the estimator is applied to the dataon AFDC women from the NSW program compiled by Cal ´onico and Smith (2017). The pres-ence of experimental and non-experimental comparison groups in this application help to quantifythe estimated bias in the doubly weighted returns to training estimates as well as the other twoestimators.Since the severity and magnitude of bias introduced from ignoring either problem cannot beassessed ex-ante, a safe bet from the practitioner’s perspective is to report both doubly weightedand unweighted causal effect estimates. Practically, the doubly weighted estimator for the ATEis easy to implement. Appendix D provides an example code that uses Stata’s gmm command forimplementing it. Computation of analytically correct standard errors, however, requires additionalcoding and is still a work in progress. Alternatively, one can use bootstrapped standard errorswhich will provide asymptotically correct inference.Even though missing outcomes are a common concern in empirical analysis, it is equally com-mon to encounter missing data on the covariates. A particularly important future extension can beto allow for missing data on both. In this case, using a generalized method of moments frameworkwhich incorporates information on complete and incomplete cases could provide efﬁciency gains Note that the large standard errors for the non-experimental estimates can be attributed to the small sample sizesand to the large residual variance of earnings in the PSID-1 and PSID-2 populations. There are a lot of women in the experimental and PSID samples with zero real earnings in 1979.

References A NDREWS , D. W. (1994): “Empirical process methods in econometrics,”

Handbook of economet-rics , 4, 2247–2294.A

NGRIST , J., V. C

HERNOZHUKOV , AND

I. F

ERN ´ ANDEZ -V AL (2006): “Quantile Regression un-der Misspeciﬁcation, with an Application to the U.S. Wage Structure,” Econometrica , 74, 539–563.C AL ´ ONICO , S.

AND

J. S

MITH (2017): “The women of the national supported work demonstra-tion,”

Journal of Labor Economics , 35, S65–S97.C

HETTY , R., J. N. F

RIEDMAN , AND

J. E. R

OCKOFF (2014): “Measuring the impacts of teachersI: Evaluating bias in teacher value-added estimates,”

American Economic Review , 104, 2593–2632. DE L UNA , X.

AND

P. J

OHANSSON (2014): “Testing for the unconfoundedness assumption usingan instrumental assumption,”

Journal of Causal Inference , 2, 187–199.F

IRPO , S. (2007): “Efﬁcient semiparametric estimation of quantile treatment effects,”

Economet-rica , 75, 259–276.F

IRPO , S., N. M. F

ORTIN , AND

T. L

EMIEUX (2009): “Unconditional quantile regressions,”

Econometrica , 77, 953–973.F

IRPO , S.

AND

C. P

INTO (2016): “Identiﬁcation and estimation of distributional impacts of inter-ventions using changes in inequality measures,”

Journal of Applied Econometrics , 31, 457–486.F

RICKE , H., M. F R ¨ OLICH , M. H

UBER , AND

M. L

ECHNER (2020): “Endogeneity and non-response bias in treatment evaluation–nonparametric identiﬁcation of causal effects by instru-ments,”

Journal of Applied Econometrics , 35, 481–504.F R ¨ OLICH , M.

AND

M. H

UBER (2014): “Treatment evaluation with multiple outcome periodsunder endogeneity and attrition,”

Journal of the American Statistical Association , 109, 1697–1711.H

AHN , J. (1998): “On the role of the propensity score in efﬁcient semiparametric estimation ofaverage treatment effects,”

Econometrica , 315–331.H

ECKMAN , J., H. I

CHIMURA , J. S

MITH , AND

P. T

ODD (1998): “Characterizing Selection BiasUsing Experimental Data,”

Econometrica , 66, 1017–1098.H

ECKMAN , J. J.

AND

V. J. H

OTZ (1989): “Choosing among alternative nonexperimental methodsfor estimating the impact of social programs: The case of manpower training,”

Journal of theAmerican statistical Association , 84, 862–874.22

IRANO , K.

AND

G. W. I

MBENS (2001): “Estimation of causal effects using propensity scoreweighting: An application to data on right heart catheterization,”

Health Services and Outcomesresearch methodology , 2, 259–278.H

ORVITZ , D. G.

AND

D. J. T

HOMPSON (1952): “A generalization of sampling without replace-ment from a ﬁnite universe,”

Journal of the American statistical Association , 47, 663–685.H

OTZ , V. J., G. W. I

MBENS , AND

J. A. K

LERMAN (2006): “Evaluating the differential effectsof alternative welfare-to-work training components: A reanalysis of the California GAIN pro-gram,”

Journal of Labor Economics , 24, 521–566.H

UBER , M. (2014): “Treatment evaluation in the presence of sample selection,”

EconometricReviews , 33, 869–905.H

UBER , M.

AND

B. M

ELLY (2015): “A test of the conditional independence assumption in sampleselection models,”

Journal of Applied Econometrics , 30, 1144–1168.K

ANE , T. J.

AND

D. O. S

TAIGER (2008): “Estimating teacher impacts on student achievement:An experimental evaluation,” Tech. rep., National Bureau of Economic Research.K

OENKER , R.

AND

G. B

ASSETT (1978): “Regression Quantiles,”

Econometrica , 46, 33–50.K

OMUNJER , I. (2005): “Quasi-maximum likelihood estimation for conditional quantiles,”

Journalof Econometrics , 128, 137 – 164.L A L ONDE , R. J. (1986): “Evaluating the econometric evaluations of training programs with ex-perimental data,”

The American economic review , 604–620.L

ITTLE , R. J.

AND

D. B. R

UBIN (2019):

Statistical analysis with missing data , vol. 793, JohnWiley & Sons.N

EGI , A.

AND

J. M. W

OOLDRIDGE (2020): “Revisiting regression adjustment in experimentswith heterogeneous treatment effects,”

Econometric Reviews , 0, 1–31.N

EWEY , W. K.

AND

D. M C F ADDEN (1994): “Large sample estimation and hypothesis testing,”

Handbook of econometrics , 4, 2111–2245.P

ROKHOROV , A.

AND

P. S

CHMIDT (2009): “GMM redundancy results for general missing dataproblems,”

Journal of Econometrics , 151, 47–55.R

OBINS , J. M.

AND

A. R

OTNITZKY (1995): “Semiparametric efﬁciency in multivariate regressionmodels with missing data,”

Journal of the American Statistical Association , 90, 122–129.R

OBINS , J. M., A. R

OTNITZKY , AND

L. P. Z

HAO (1994): “Estimation of regression coefﬁcientswhen some regressors are not always observed,”

Journal of the American statistical Association ,89, 846–866.R

OSENBAUM , P. R.

AND

D. B. R

UBIN (1983): “The central role of the propensity score in obser-vational studies for causal effects,”

Biometrika , 70, 41–55.23

HADISH , W. R., M. H. C

LARK , AND

P. M. S

TEINER (2008): “Can nonrandomized experi-ments yield accurate answers? A randomized experiment comparing random and nonrandomassignments,”

Journal of the American statistical association , 103, 1334–1344.S

ŁOCZY ´ NSKI , T.

AND

J. M. W

OOLDRIDGE (2018): “A general double robustness result for esti-mating average treatment effects,”

Econometric Theory , 34, 112–133.W

HITE , H. (1982): “Maximum likelihood estimation of misspeciﬁed models,”

Econometrica:Journal of the Econometric Society , 1–25.W

OOLDRIDGE , J. M. (2007): “Inverse probability weighted estimation for general missing dataproblems,”

Journal of Econometrics , 141, 1281–1301.——— (2010):

Econometric analysis of cross section and panel data , MIT press.24

Tables and ﬁgures for main text

Figure A.1: Empirical distribution of estimated ATE for N=5,000Case 1: Misspeciﬁed CEF, correct weights Case 2: Misspeciﬁed CEF, misspeciﬁed weights

Notes:

This ﬁgure plots the empirical distributions of the unweighted, ps-weighted, and d-weighted ATE estimatesusing 1,000 Monte Carlo simulation draws of sample size 5,000. The average treated sample size is N = 5 , × . × .

38 = 779 and average control sample size is N = 5 , × (1 − . × .

38 = 1 , . The true ATE= 0.096 and the population is generated using a million observations. The unweighted estimator does not weightthe observed data. The ps-weighted estimator weights to correct only for nonrandom assignment and the d-weightedestimator weights by both the treatment and missing outcomes probabilities. Case 3: Correct CEF, misspeciﬁed weights

Notes:

38 = 779 and average control sample size is N = 5 , × (1 − . × .

Notes:

This ﬁgure plots the bias in the unweighted, ps-weighted, and d-weighted LPs to CQTE relative to the truepopulation LP for N = 5 , . The average treated sample size is N = 5 , × . × .

38 = 779 and averagecontrol sample size is N = 5 , × (1 − . × .

Notes:

This ﬁgure plots the average d-weighted CQTE function with the true CQTE along X for 1,000 Monte Carlosimulation draws of sample size N = 5 , . Along with these two graphs, the ﬁgure also plots the individual functionacross the 1,000 simulation draws. The average treated sample is N = 5 , × . × .

38 = 779 and average controlsample is N = 5 , × (1 − . × .

38 = 1 , . τ = 0 . Case1: Correct weights Case 2: Misspeciﬁed weights

Notes:

This ﬁgure plots the empirical distributions of the unweighted, ps-weighted, and d-weighted UQTE estimatesusing 1,000 Monte Carlo simulation draws of sample size 5,000. The average treated sample is N = 5 , × . × .

38 = 779 and average control sample is N = 5 , × (1 − . × .

This graph plots the bias in the unweighted, ps-weighted and d-weighted UQTE estimates relative to the trueexperimental estimates across different quantiles of the 1979 earnings distribution. Panel (a) plots the relative biasestimates using the PSID-1 comparison group and Panel (b) plots the same using the PSID-2 comparison group.The treatment and missing outcome propensity score models have been estimated as ﬂexible logits and the samplesused for constructing these estimates have been trimmed to ensure common support across the two groups. Thetreatment propensity score has been estimated using the full experimental sample along with either PSID-1 or PSID-2comparison group. The UQTE estimates for τ < . are omitted from the graph since these are zero. Table A.1: Covariate means and p-values from the test of equality of two means, by treatmentstatus

Covariates Treatment Control P (cid:0) | T | > | t | (cid:1) PSID-1 P (cid:0) | T | > | t | (cid:1) PSID-2 P (cid:0) | T | > | t | (cid:1) Age in years 33.37 33.64 0.46 36.73 0.00 34.41 0.11(7.42) (7.19) (10.60) (9.48)Years of education 10.30 10.27 0.72 11.32 0.00 10.55 0.07(1.92) (2.00) (2.71) (2.09)Proportion of high school dropouts 0.70 0.69 0.73 0.45 0.00 0.59 0.00(0.46) (0.46) (0.50) (0.49)Proportion Married 0.02 0.04 0.03 0.02 0.05 0.01 0.08(0.15) (0.20) (0.13) (0.10)Proportion Black 0.84 0.82 0.29 0.66 0.00 0.87 0.13(0.37) (0.39) (0.47) (0.34)Proportion Hispanic 0.12 0.13 0.59 0.02 0.00 0.02 0.00(0.32) (0.33) (0.12) (0.16)Number of children in 1975 2.17 2.26 0.21 1.70 0.00 2.91 0.00(1.30) (1.32) (1.75) (1.73)Real earnings in 1975 799.88 811.19 0.91 7446.15 0.00 2069.65 0.00(1931.92) (2041.32) (7515.59) (3474.10)Observations 796 795 729 204

Notes:

Along with the covariate means and standard deviation (in parentheses), the table also reports p-values from the test of equality for two means. Column 4tests for differences between the NSW treatment and control groups, column 6 and 8 report the same using PSID-1 and PSID-2 comparison groups respectively.Real earnings in 1975 are expressed in terms of 1982 dollars.

Control Treatment PSID-1 PSID-2Covariates

Missing Observed P (cid:0) | T | > | t | (cid:1) Missing Observed P (cid:0) | T | > | t | (cid:1) Missing Observed P (cid:0) | T | > | t | (cid:1) Missing Observed P (cid:0) | T | > | t | (cid:1) Age 33.36 33.74 0.51 32.15 33.77 0.01 34.00 37.07 0.01 33.32 34.54 0.62(7.30) (7.15) (7.39) (7.40) (10.50) (10.57) (10.81) (9.34)Years of education 10.29 10.26 0.85 10.29 10.31 0.89 11.44 11.30 0.60 11.05 10.49 0.18(1.93) (2.03) (2.05) (1.88) (2.17) (2.77) (1.73) (2.13)Proportion of high school dropouts 0.70 0.68 0.57 0.69 0.70 0.77 0.43 0.45 0.73 0.55 0.59 0.68(0.46) (0.47) (0.46) (0.46) (0.50) (0.50) (0.51) (0.49)Proportion married 0.05 0.04 0.61 0.03 0.02 0.75 0.00 0.02 0.00 0.00 0.01 0.16(0.21) (0.19) (0.16) (0.15) (0.00) (0.14) (0.00) (0.10)Proportion black 0.81 0.82 0.81 0.83 0.84 0.87 0.74 0.65 0.10 0.91 0.86 0.50(0.39) (0.39) (0.38) (0.37) (0.44) (0.48) (0.29) (0.35)Proportion hispanic 0.12 0.13 0.87 0.13 0.12 0.64 0.01 0.02 0.82 0.05 0.02 0.62(0.33) (0.33) (0.33) (0.32) (0.11) (0.12) (0.21) (0.15)Number of children in 1975 2.33 2.23 0.34 2.14 2.19 0.69 1.54 1.71 0.33 2.41 2.97 0.05(1.29) (1.34) (1.32) (1.29) (1.45) (1.78) (1.14) (1.79)Real earnings in 1975 621.54 879.28 0.12 610.77 861.65 0.11 6927.95 7510.92 0.50 896.56 2211.45 0.02(1,523.00) (2,194.93) (1,677.36) (2,005.53) (7,330.74) (7,541.41) (2,315.12) (3,567.50)Observations 795 795 796 796 729 729 204 204

Notes:

Along with the covariate means and standard deviation (in parentheses), the table also reports p-values from the test of equality for two means between the observed and missing samples. Real earnings in 1975 are expressed interms of 1982 dollars. able A.3: Unweighted and weighted earnings comparisons and estimated training effects using NSW and PSID comparison groups Comparison group Post-training earnings estimatesUnadjusted Adjusted Adjusted

Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted

NSW

821 848 824 845 852 828 864 850 826N=1,185 (307.22) (304.04) (304.61) (303.60) (302.94) (303.53) (303.47) (302.96) (303.58)

PSID-1 -799 827 803 298 909 907 335 905 904N=1,016 (444.84) (503.00) (503.26) (428.60) (497.76) (501.54) (440.18) (518.54) (522.97)

PSID-2 -31 569 566 492 1,040 996 698 1,082 1,049N=720 (713.88) (1041.81) (1027.12) (664.46) (961.74) (953.80) (784.28) (1264.18) (1217.46)

Bias estimates using NSW controlPSID-1 -1,620 169 156 -493 -40 -21 -568 -38 -21N=1,001 (431.75) (561.74) (553.07) (427.93) (499.91) (501.44) (434.59) (504.19) (507.02)

PSID-2 -853 -228 -212 -109 207 200 -378 -17 -24N=705 (707.87) (1041.44) (1025.87) (663.80) (962.85) (954.61) (759.75) (1195.47) (1156.39)

Adjusted covariates

Pre-training earnings (1975) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Age (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Age (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Education (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

High school droput (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Black (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Hispanic (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Marital status (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Number of Children (1975) (cid:88) (cid:88) (cid:88)

Notes:

This table reports unadjusted and adjusted post-training earnings differences between the NSW treatment and three different comparison groups, namely, NSW control, PSID-1 and PSID-2.The ﬁrst row reports experimental training estimates which combines the NSW treatment and control group whereas the second and third rows report non-experimental estimates computed fromusing the PSID-1 and PSID-2 groups respectively. Each of the non-experimental estimates should be compared to the experimental benchmark. The second panel of the table reports bias estimatescomputed from combining the NSW control with PSID-1 and PSID-2 comparison groups respectively. These represent a second measure of bias which should be compared to zero. Bootstrappedstandard errors are given in parentheses and have been constructed using 10,000 replications. All values are in 1982 dollars. The samples used for estimating the training and bias estimates havebeen trimmed to ensure common support in the distribution of weights for the treatment and comparison groups. For more detail, see appendix E. nline Appendix Akanksha Negi † November 23, 2020

Abstract

In this online appendix, section A provides details of the simulation study. Section B dis-cusses an extension of the doubly weighted framework to the case of estimating unconditionalquantile treatment effects using recentered inﬂuence functions. Section C provides a simpleextension to the case when treatment assumes multiple values. Section D provides the asymp-totic variance expressions for the average treatment effect under the ﬁrst and second half ofasymptotic theory. Section E provides some background information on the National sup-ported work demonstration along with augmenting Cal´onico and Smith (2017)’s sample formissing information and trimming rules for the probability weights. Section F contains proofsfor results in the main text. Finally sections G and H provide supplementary tables and ﬁgures,respectively.

A Simulation details

This section outlines details of the simulation study for evaluating the ﬁnite sample behavior ofunweighted, ps-weighted, and d-weighted (doubly weighted) estimators of ATE and QTE param-eters. For each data generating process, the population is generated using a million observations.The empirical distributions of ATE and QTE estimands are simulated from drawing random vec-tors { ( Y i , X i , W i , S i ); i = 1 , , . . . , N } of size N a thousand times without replacement from thepopulation. This is done to mimic the setting of ”random sampling” from an inﬁnite population. A.1 Average treatment effect

To allow for possible misspeciﬁcation of the regression functions E [ Y ( g ) | X ] , I simulate two binarypotential outcomes generated using a probit as follows Y ( g ) = (cid:40) , Y ∗ ( g ) > , Y ∗ ( g ) ≤ Y ∗ ( g ) = X θ g + U ( g ) X here includes an intercept. The linear index, X θ g , is parameterized to have co-variates be only mildly predictive of the potential outcomes with R = 0 . and R = 0 . in thepopulation. The two covariates and the two latent errors are drawn from two independent bivariatenormal distributions as follows, (cid:18) X X (cid:19) ∼ N (cid:32)(cid:18) (cid:19) , (cid:18) . . (cid:19)(cid:33) and (cid:18) U (0) U (1) (cid:19) ∼ N (cid:32)(cid:18) (cid:19) , (cid:18) . . (cid:19)(cid:33) (A.1) The assignment and missing outcome mechanisms have been simulated to ensure that uncon-foundedness and MAR are satisﬁed W = (cid:40) , W ∗ > , W ∗ ≤ and S = (cid:40) , S ∗ > , S ∗ ≤ (A.2)where W ∗ = X γ + ν S ∗ = Z δ + υ with the errors ν and υ drawn from two independent standard logistic distributions. Misspeciﬁcation in the true assignment and missing outcome distributions is allowed in boththe functional form and linear index dimension where for the misspeciﬁed cases, I estimate a probitwith X omitted from the linear index. For scenarios where the conditional mean is misspeciﬁed, Iestimate a linear model with a correct index. The parameters, γ and δ , indexing the assignmentand missingness mechanisms have been chosen to ensure average propensity of assignment to be41% and average propensity of being observed to be 38%. The missing data have been simulatedto imitate empirical settings where a signiﬁcant portion of the outcomes are missing. The followingtable gives an estimation summary for the eight different cases of misspeciﬁcation,

Table A.1: Estimation summary for different cases of misspeciﬁcation

Scenario CEF G ( · ) R ( · ) Model Estimation Model Estimation Model Estimation X θ g C Λ( X γ ) C Λ( Z γ ) X θ g M Φ( X (1) γ (1) ) M Φ( Z (1) γ (1) ) Φ( X θ g ) M Φ( X (1) γ (1) ) M Φ( Z (1) γ (1) ) Notes : C and M correspond to whether the estimated model is correctly speciﬁed or misspeciﬁed. X and Z both include an intercept. X (1) and Z (1) are the subsets of X and Z left after omitting X . G ( · ) refers to the propensity score model and R ( · ) refers to the missing outcomes probability model. Here θ = (0 , , (cid:48) and θ = ( − , , (cid:48) . With cross sectional data, covariates are typically seen to be mildlypredictive of the outcome. For example, in the National Supported Work dataset from Cal´onico and Smith (2017),baseline factors explain about 26-50 percent of the variation in the non-experimental sample and about .04-2 percentin the experimental sample depending upon the included subset of covariates. This implies that P ( W = 1 | X ) ≡ p ( X ) = Λ( X γ ) and P ( S = 1 | W, X ) ≡ r ( X , W ) = Λ( Z δ ) where Λ( · ) isthe standard logistic CDF. Here γ = (0 . , - . , - . (cid:48) , δ = (0 . , . , . , - . (cid:48) and Z = (1 , W, X , X ) .2 Quantile treatment effects To ensure that the marginal quantiles of the potential outcome distributions are unique with no ﬂatspots, I simulate two continuous non-negative outcomes as follows, Y ( g ) = exp [ X θ g + U ( g )] , for g = 0 , where θ = (0 . , − . , − . (cid:48) and θ = (0 . , . , − . (cid:48) are parameterized to ensure R = 0 . and R = 0 . in the population. The two covariates and the two latent errors are drawn fromtwo independent normal distributions following (A.1). The missing outcomes and the treatmentassignment mechanisms are also generated according to eq (A.2). Since exp ( · ) is an increasingcontinuous function, the equivariance property of quantiles implies that Q τ [ Y ( g ) | X ] = Q τ (cid:104) exp ( X θ g + U ( g )) | X (cid:105) = exp (cid:104) Q τ ( X θ g + U ( g ) | X ) (cid:105) = exp (cid:104) X θ g + Q τ ( U ( g ) | X ) (cid:105) = exp (cid:104) X θ g + Φ − ( τ ) (cid:105) where Φ − ( τ ) is the inverse standard normal CDF evaluated at τ . This equivariance property helpsto characterize and estimate CQTE for cases when the CQF is correct. The three different cases ofmisspeciﬁcation are enumerated in Table A.2 below. Case 1 corresponds to the situation for whichresults are derived in section 4, Case 2 allows for misspeciﬁcation in both conditional quantilefunction and the probability weights. Even though the theory in this paper does not address thatspeciﬁc case, the simulation results show that the proposed estimator has the lowest bias amongall three alternatives. Finally, case 3 relates to situations considered in section 5; correct CQF butmisspeciﬁed weights. Table A.2: Estimation summary for quantile effects under different cases of misspec-iﬁcation

Scenario CQF G ( · ) R ( · ) Model Estimation Model Estimation Model Estimation X θ g ( τ ) C Λ( X γ ) C Λ( Z γ ) X θ g ( τ ) M Φ( X (1) γ (1) ) M Φ( X (1) γ (1) ) ( X θ g ( τ )) M Φ( X (1) γ (1) ) M Φ( X (1) γ (1) ) Notes : C and M denote whether the estimated model is correctly speciﬁed or misspeciﬁed. X and Z both include an intercept. X (1) and Z (1) are the subsets of X and Z left after omitting X . Therefore,the probability models have been misspeciﬁed in both the functional form and the linear index dimen-sion. G ( · ) refers to the propensity score model and R ( · ) refers to the missing outcomes probabilitymodel. For plotting the estimated and true CQTE functions, I ﬁrst collect the estimates that solve theunweighted, ps-weighted, and doubly weighted CQR problem (deﬁned in (19)) corresponding toa particular quantile level, τ = 0 . , . , . across 1,000 Monte Carlo simulation draws. I then34raw a linearly spaced vector of values for X and simulate the CQTE using the 1,000 estimatedconditional quantile coefﬁcients. Averaging these 1,000 functions at each point on the X vectorgives me the estimated average CQTE function. I plot this along with the 1,000 individual functionsand the true CQTE, which is calculated using the population conditional quantile parameters, θ g . B Unconditional quantile treatment effect using recentered in-ﬂuence functions

This section discusses an alternative method of estimating UQTE using Firpo et al. (2009)’s (FFL,thereafter) recentered inﬂuence function (RIF) methodology.Following FFL, let v ( F ) be a real valued functional such that v : F → R whose domain F is a class of distribution functions such that F ∈ F if (cid:12)(cid:12) v ( F ) (cid:12)(cid:12) < + ∞ . One may deﬁne v ( · ) tobe any distributional statistic of interest like mean, variance, quantiles, inequality indices etc. Wecan deﬁne various treatment effects as the difference in the functionals of the marginal outcomedistributions ∆ v = v − v (B.1)where v g ≡ v ( F g ) is the functional of the distribution function for Y ( g ) . As deﬁned in FFL, theRIF is nothing but the inﬂuence function which has been centered at the statistic v g . Formally,RIF ( Y ( g ); v, F g ) = v ( F g ) + IF ( Y ( g ); v, F g ) (B.2)where IF ( Y ( g ); v, F g ) captures the change in v g as a result of an inﬁnitesimal change in the dis-tribution of X . FFL introduce the idea of running a standard regression of RIF on X with theobjective of estimating the function E (cid:2) RIF ( Y ( g ); v, F g ) | X (cid:3) = X θ g One can then use the law of iterated expectations to express v g in terms of the regression functionas follows, E [ E ( RIF ( Y ( g ); v, F g ) | X )] = v g (B.3)For v g = Q τ,g , equation B.2 deﬁnes the UQTE for the τ th quantile. We know that the RIF for Q τ,g is given as: RIF ( Y ( g ); Q τ , F g ) = Q τ,g + τ − { Y ( g ) ≤ Q τ,g } f g ( Q τ,g ) (B.4)where f g ( · ) is the density of Y ( g ) . Then estimation of doubly weighted UQTE using RIFs in-volves the following steps: Note that Firpo and Pinto (2016) use the above formulation to consider inequality treatment effects by exclusivelyconsidering v to be different inequality measures. Note that Firpo et al. (2009) express the conditional RIF expectation as E (cid:2) RIF ( Y ( g ); Q τ , F g ) | X (cid:3) = c ,τ,g · P [ Y ( g ) > Q τ,g | X ] + c ,τ,g where c ,τ,g = 1 /f g ( Q τ,g ) and c ,τ,g = Q τ,g − c ,τ,g · (1 − τ ) for the τ th quantile of Y ( g ) . ˆ θ g =  N N (cid:88) i =1 (cid:98) ω ig X (cid:48) i X i  −  N N (cid:88) i =1 (cid:98) ω ig X (cid:48) i · (cid:100) RIF ( Y i ; (cid:98) Q τ , (cid:98) F g )  b. (cid:100) RIF ( Y ( g ); (cid:98) Q τ , (cid:98) F g ) = (cid:98) Q τ,g + τ − { Y ( g ) ≤ (cid:98) Q τ,g } (cid:98) f g ( (cid:98) Q τ,g ) where (cid:98) f g ( y ) is the non-parametric kerneldensity estimator with bandwidth h g .c. (cid:98) f g ( (cid:98) Q τ,g ) = 1 N N (cid:88) i =1 (cid:98) ω ig · h g · K g (cid:32) (cid:98) Q τ,g h g (cid:33) d. (cid:98) Q τ,g = argmin Q g (cid:80) Ni =1 (cid:98) ω ig · c τ ( Y i − Q g ) e. (cid:98) ω i = S i · W i R ( X i , W i , ˆ δ ) · G ( X i , ˆ γ ) and (cid:98) ω = S i · (1 − W i ) R ( X i , W i , ˆ δ ) · (1 − G ( X i , ˆ γ )) where double weighting has to be performed at each stage that uses the observed sample. Thisimplies that for ensuring consistency of UQTE, the weights would necessarily have to be correctlyspeciﬁed. One may estimate the weights nonparametrically using sieves to sidestep this issue ofmisspeciﬁcation. Estimating UQTE in this manner also has the advantage initially put forth in FFLwhich is that one can directly estimate the effect of covariates on UQTE. C Multivalued Treatments

One can easily extend the binary treatment case considered here to the case when there are mul-tiple treatment values. Let Y ( g ) denote the potential outcome for treatment level g where g =0 , , . . . , T and W g be a binary indicator for receiving treatment level g such that W + W + . . . + W T = 1 P ( W g = 1) ≡ ρ g > Also, let W = (cid:16) W , W , . . . , W T (cid:17) . Then the observed outcome is Y = W · Y (0) + W · Y (1) + . . . + W T · Y ( T ) Let ρ g ( x ) ≡ P ( W g = 1 | X = x ) be the propensity score and r ( x , w ) ≡ P ( S = 1 | X = x , W g = w ) be the missing outcomes probability for treatment level g . One may then consider solving the samepopulation problem, Q ( θ ) but with true weights given as ω g = S · W g r ( X , W g ) · ρ g ( X ) To construct the doubly weighted estimator, we would assume unconfoundedness and MARalong with assuming parametric models for the two probability weights; R ( X , W g , δ ) and G ( X , γ g ) .36 Asymptotic variance for ATE

Given √ N consistent and asymptotically normal estimators, ˆ θ and ˆ θ , the estimated averagetreatment effect ˆ∆ ate = 1 N N (cid:88) i =1 m ( X i , ˆ θ ) − N N (cid:88) i =1 m ( X i , ˆ θ ) is easily shown to also be √ N -consistent and asymptotically normal [Wooldridge (2010) chapter21]. Regularity conditions for such an asymptotic result would require that the parametric model, m ( X , θ g ) , is continuously differentiable on the parameter space Θ g ⊂ R P g and θ g is in the interiorof Θ g . Then, by the continuous mapping theorem and slutsky’s theorem, √ N (cid:16) ˆ∆ ate − ∆ ate (cid:17) d → N (0 , V ) where V = E (cid:2) ψ ( X i ) ψ ( X i ) (cid:48) (cid:3) . Let’s denote E (cid:104) ∇ θ g m ( X i , θ g ) (cid:105) ≡ J g , then ψ ( X i ) = (cid:8) m ( X i , θ ) − m ( X i , θ ) − ∆ ate (cid:9) − J · H − u i + J · H − u i where H g is the Hessian for the treatment group g , and u ig is the residual from the regression ofthe weighted score on the scores of two probability models. For the case when the conditionalmean model is correctly speciﬁed, the variance expression simpliﬁes toV = E (cid:2) ( m ( X i , θ ) − m ( X i , θ )) − ∆ ate (cid:3) + J · V · J (cid:48) + J · V · J (cid:48) (D.1)Here V and V are the asymptotic variances of the doubly weighted estimator that solve thetreatment and control group problems, respectively. The above formula makes it clear that it betterto use more efﬁcient estimators of ˆ θ g . But we know from the results in section 5 that when theconditional mean model is correctly speciﬁed, using estimated weights is as efﬁcient as usingknown weights. Another alternative in this case is to use unweighted estimators of θ g since underGCIME, unweighted estimators is more efﬁcient than the doubly weighted estimators of θ g .For the case when the mean model is misspeciﬁed, the asymptotic variance of the ATE is givenas follows V = E (cid:2) ( m ( X i , θ ) − m ( X i , θ )) − ∆ ate (cid:3) + J · V · J (cid:48) + G · V · J (cid:48) − E (cid:2) { m ( X i , θ ) − m ( X i , θ ) − ∆ ate } u (cid:48) i (cid:3) H − J (cid:48) + 2 E (cid:2) { m ( X i , θ ) − m ( X i , θ ) − ∆ ate } u (cid:48) i (cid:3) H − J (cid:48) (D.2)In this case, the variance expression is a bit more complicated than the previous case. Even thoughit is better to have more efﬁcient estimators of θ g in this case as well, it is not obvious whether thatwould help obtain a smaller variance for the ATE since we now have cross correlation terms in thevariance expression. 37 .1 Proofs Asymptotic variance expression for ATE: Correctly speciﬁed mean model . Assuming contin-uous differentiability of m ( X i , θ g ) on Θ g , mean value expansion around θ g gives N N (cid:88) i =1 m ( X i , ˆ θ g ) ≈ N N (cid:88) i =1 m ( X i , θ g ) + 1 N N (cid:88) i =1 ∇ θ g m ( X i , ˜ θ g ) · ( ˆ θ g − θ g ) where ˜ θ g lies between ˆ θ g and θ g . Since ˆ θ g p → θ g , so does ˜ θ g . Hence, using the weak law of largenumbers, we obtain √ N N (cid:88) i =1 m ( X i , ˆ θ g ) = 1 √ N N (cid:88) i =1 m ( X i , θ g ) + J g · √ N ( ˆ θ g − θ g ) + o p (1) Adding and subtracting √ N · E [ m ( X i , θ g )] on both sides gives us √ N N (cid:88) i =1 (cid:110) m ( X i , ˆ θ g ) − E [ m ( X i , θ g )] (cid:111) = 1 √ N N (cid:88) i =1 (cid:110) m ( X i , θ g ) − E [ m ( X i , θ g )] (cid:111) + J g · √ N ( ˆ θ g − θ g ) + o p (1) Then, using the asymptotic results from section 5, where we posit that the conditional feature ofinterest is correctly speciﬁed, we have √ N (cid:16) ˆ θ − θ (cid:17) = − H −  √ N N (cid:88) i =1 l i  + o p (1) √ N (cid:16) ˆ θ − θ (cid:17) = − H −  √ N N (cid:88) i =1 l i  + o p (1) Therefore, √ N (cid:16) ˆ∆ ate − ∆ ate (cid:17) = 1 √ N N (cid:88) i =1 (cid:32) { m ( X i , θ ) − m ( X i , θ ) − ∆ ate } − J · H − l i + J · H − l i (cid:33) + o p (1) We may rewrite the above using the inﬂuence function representation as √ N (cid:16) ˆ∆ ate − ∆ ate (cid:17) = 1 √ N N (cid:88) i =1 ψ ( X i ) + o p (1) where E (cid:2) ψ ( X i ) (cid:3) = 0 Then, provided that E (cid:2) ψ ( X i ) ψ ( X i ) (cid:48) (cid:3) exists, Avar (cid:104) √ N (cid:0) ˆ∆ ate − ∆ ate (cid:1)(cid:105) = E (cid:20)(cid:16) m ( X i , θ ) − m ( X i , θ ) (cid:17) − ∆ ate (cid:21) + J · V · J (cid:48) + J · V · J (cid:48) Note that the covariance term involving l ig is zero since they denote scores for the treatment andcontrol group problems. The covariance terms involving (cid:8) m ( X i , θ ) − m ( X i , θ ) − ∆ ate (cid:9) and l ig θ g solves the conditional problem. However, using that fact that E [ h ( Y i ( g ) , X i , θ g ) | X i ] = along with LIE, those covariance terms can be shown to be zero. Misspeciﬁed mean model

In the case of a misspeciﬁed mean model, we still have √ N N (cid:88) i =1 (cid:26) m ( X i , ˆ θ g ) − E (cid:16) m ( X i , θ g ) (cid:17)(cid:27) = 1 √ N N (cid:88) i =1 (cid:110) m ( X i , θ g ) − E [ m ( X i , θ g )] (cid:111) + J ·√ N ( ˆ θ g − θ g ) + o p (1) Now using results from section 4 √ N (cid:16) ˆ θ − θ (cid:17) = − H − √ N N (cid:88) i =1 (cid:110) l i − E (cid:0) l i b (cid:48) i (cid:1) E (cid:0) b i b (cid:48) i (cid:1) − b i − E ( l i d (cid:48) i ) E ( d i d (cid:48) i ) − d i (cid:111) + o p (1)= − H − √ N N (cid:88) i =1 u i + o p (1) √ N (cid:16) ˆ θ − θ (cid:17) = − H − √ N N (cid:88) i =1 (cid:110) l i − E (cid:0) l i b (cid:48) i (cid:1) E (cid:0) b i b (cid:48) i (cid:1) − b i − E ( l i d (cid:48) i ) E ( d i d (cid:48) i ) − d i (cid:111) + o p (1)= − H − √ N N (cid:88) i =1 u i + o p (1) Then, √ N (cid:16) ˆ∆ ate − ∆ ate (cid:17) = 1 √ N N (cid:88) i =1 (cid:32) (cid:110) m ( X i , θ ) − m ( X i , θ ) − ∆ ate (cid:111) − J · H − u i + J · H − u i (cid:33) + o p (1)= 1 √ N N (cid:88) i =1 ψ ( X i ) + o p (1) Then,

Avar (cid:20) √ N (cid:16) ˆ∆ ate − ∆ ate (cid:17)(cid:21) = E (cid:20)(cid:16) m ( X i , θ ) − m ( X i , θ ) (cid:17) − ∆ ate (cid:21) + J · V · J (cid:48) + J · V · J (cid:48) − E (cid:20)(cid:110) m ( X i , θ ) − m ( X i , θ ) − ∆ ate (cid:111) u (cid:48) i (cid:21) H − J (cid:48) + 2 E (cid:20)(cid:110) m ( X i , θ ) − m ( X i , θ ) − ∆ ate (cid:111) u (cid:48) i (cid:21) H − J (cid:48) D.2 Practical advice for obtaining doubly weighted ATE estimates

An easy way to obtain the doubly weighted estimates, ˆ θ g , for estimating ATE, is to combine thetreatment and control group problems into a one-step GMM procedure. Essentially, this meansthat one would stack the moment conditions from the ﬁrst and second steps, which can then be39olved jointly via GMM. Since there are no over-identifying restrictions in the doubly weightedframework, one-step estimation of θ g is equivalent to two-step estimation. Then, suppressingexplicit dependence on data, ¯ m ( θ , θ , γ, δ ) = 1 N N (cid:88) i =1 m i ( θ , θ , γ, δ ) = N −  NN · (cid:80) Ni =1 m i ( θ , γ, δ ) NN · (cid:80) Ni =1 m i ( θ , γ, δ ) (cid:80) Ni =1 m i ( γ ) (cid:80) Ni =1 m i ( δ )  where, m i ( θ , γ, δ ) = S i · (1 − W i ) R ( X i , W i , ˆ δ ) · (1 − G ( X i , ˆ γ )) · ∇ θ q ( Y i (0) , X i , θ ) (cid:48) m i ( θ , γ, δ ) = S i · W i R ( X i , W i , ˆ δ ) · G ( X i , ˆ γ ) · ∇ θ q ( Y i (1) , X i , θ ) (cid:48) m i ( γ ) = ∇ γ G ( X i , γ ) (cid:48) · W i − G ( X i , γ ) G ( X i , γ ) · (1 − G ( X i , γ )) m i ( δ ) = ∇ δ R ( X i , W i , δ ) (cid:48) · S i − R ( X i , W i , δ ) R ( X i , W i , δ ) · (1 − R ( X i , W i , δ )) The example code below uses STATA’s gmm command to estimate the doubly weighted ATEestimate

Example code using STATA’s gmm local Rhat="exp(b31+b32*w+b33*x1+b34*x2)/(1+exp(b31+b32*w+b33*x1+b34*x2))"local Ghat="exp(b21+b22*x1+b23*x2)/(1+exp(b21+b22*x1+b23*x2))"gmm ((-2*s*(1-w)/(‘Rhat’*(1-‘Ghat’)))*(y-b00-b01*x1-b02*x2)*(n/nc)) ///((-2*s*w/(‘Rhat’*‘Ghat’))*(y-b10-b11*x1-b12*x2)*(n/nt)) ///(w-exp(b21+b22*x1+b23*x2)/(1+exp(b21+b22*x1+b23*x2))) ///(s-exp(b31+b32*w+b33*x1+b34*x2)/(1+exp(b31+b32*w+b33*x1+b34*x2))), ///instruments(1 2 3: x1 x2) instruments(4: w x1 x2) winitial(identity) ///nocommonesample onestep from(b00 0.1 b01 0.1 b02 0.1 b10 0.1 b11 0.1 b12 ///0.1 b21 0.1 b22 0.1 b23 0.1 b31 0.1 b32 0.1 b33 0.1 b34 0.1)

Then using the GMM estimates, one can estimate the average treatment effect as gen y0hat = b[b00: cons]+ b[b01: cons]*x1+ b[b02: cons]*x2gen y1hat = b[b10: cons]+ b[b11: cons]*x1+ b[b12: cons]*x2egen ate = mean(y1hat-y0hat)

Since I am estimating the two probability models as logits, the last two moments simplify to40 i ( γ ) = X (cid:48) i · ( W i − Λ( X i γ )) m i ( δ ) = Z (cid:48) i · ( S i − Λ( Z i δ )) where Z i ≡ ( X i , W i ) . Even though this one-step estimation allows us to obtain variance estimates (cid:98) V and (cid:98) V for ˆ θ and ˆ θ respectively, obtaining analytically correct standard errors for estimatedATE requires additional work. A command that implements the correct standard errors is still inthe works. Meanwhile, one can use bootstrapped standard errors, which provide asymptoticallycorrect inference. E Appendix to CS (2017) Application

E.1 Description of National Supported Work Program

The NSW was a transitional and subsidized work experience program that was mainly intended totarget four sub-populations; ex-offenders, former drug addicts, women on AFDC welfare and highschool dropouts. The program became operational in 1975 and continued until 1979 at ﬁfteenlocations in the United States. In ten of these sites, the program operated as a randomized exper-iment where individuals who qualiﬁed for the training program were randomly assigned to eitherthe treatment or control group. At the time of enrollment in April 1975, individuals were givena retrospective baseline survey which was then followed by four follow-up interviews conductedat nine month intervals each. The survey data was collected using these baseline and follow-upinterviews over a period of four years. The data included measurement on baseline covariates likeage, years of education, number of children in 1975, high school dropout status, marital status, tworace indicators for black and Hispanic sub-populations and other demographic and socio-economicinformation. The main outcome of interest was real earnings for the post-training year of 1979.

E.2 Augmenting the CS sample to account for missing earnings in 1979

I obtain the data from CS’s supplementary data ﬁles in the Journal of Labor Economics wherethe authors recreate the experimental sample on AFDC women using the raw public use dataﬁles maintained by the Inter-University Consortium for Political and Social Research (ICPSR).Then, I use the PSIDcross ﬁle provided by CS along with other supplementary data ﬁles to addback the individuals whom CS originally dropped from the analysis for not having valid earningsinformation between 1975-1979. For this, I apply the same ﬁlters applied by CS who use themto match their PSID samples to the ones used by LaLonde (1986). These ﬁlters involve keepingall female household heads continuously from 1975-1979 who were between 20 and 55 years of The AFDC program is administered and funded by the federal and state governments and is meant to provideﬁnancial assistance to needy families.

Source : US Census Bureau. Beyond the main eligibility criteria that was appliedto all four target populations, the AFDC group was subjected to two additional criteria which were, a) no child below6 years of age and b) on AFDC welfare for at least 30 of the last 36 months. Out of the 10 sites, 7 served AFDC women with random assignment at one or more of these sites in operationfrom Feb 1976-Aug 1977 (CS (2017)). This constitutes the ﬁrst non-experimental sample thatCS use in their analysis, which they call the PSID-1 sample. The second PSID sample, whichthey label PSID-2 further restricts the PSID-1 sample to include only those women who receivedAFDC welfare in 1975. In order to compare my sample with the original sample used by CS, Iﬁrst apply all the above mentioned ﬁlters and create a dummy variable which I call “cs”. Next,I remove the ﬁlter which requires the women to be continuous household heads and instead onlyimpose that ﬁlter for 1975 and 1976. The reason this ﬁlter is imposed for both years 1975 and 1976but not for any other years is because in the PSID datasets, the income information in a particularyear corresponds to the previous calendar year. This ensures that merging the cross-ﬁle with theseparate single-year ﬁles for 1975 and 1976 guarantee that only those women are included whodo not have any missing earnings information for the pre-training year of 1974 and 1975. This isimportant since pre-training earnings are treated as any other baseline covariate in this paper, onwhich I do not allow any missing information.After merging cross year individual ﬁle with the single year family ﬁles, I then merge this PSIDdataset with the NSW dataset using CS’s .do ﬁles and generate the various sample dummies essen-tially in the same manner as they do. After this, I further restrict the sample to include only thosewomen who have valid earnings information in 1975, which is the pre-training year for AFDCwomen. I also drop the cases where the measured age or education is less than zero. In order tomake sure that any observations not used by CS only correspond to the ones that have missingpost-program earnings, I also drop observations that do not satisfy the CS criteria but have ob-served earnings in 1979.

E.3 Treatment and missing outcome probability speciﬁcations and sampletrimming

In this application, I estimate three sets of treatment assignment and missing outcomes probabilitymodels depending upon which comparison group is used for obtaining the estimates. For the exper-imental estimates, I use the experimental treatment and control groups to estimate the propensityscore model. For the PSID-1 estimates, I consider the NSW experimental observations to be thetreatment group and use PSID-1 as the control group. For estimating the PSID-2 propensity scoremodel, I switch to PSID-2 as being the comparison control group. For estimating the missing out-come probability models, I include the treatment indicator depending upon the comparison groupas mentioned above. The probability models are estimated as logits and include the following co-variates in their speciﬁcation. For the treatment probability, I include the real earnings in 1974 and1975 along with an indicator variable for whether the individual had any zero earnings in 1974 and1975. Beyond these, I also include Age, Age-squared, Education, High school dropout status, therace indicators of black and Hispanic along as well as the number of children in 1975. CS alsoadd some interaction terms in their propensity score speciﬁcation which I do not. I noticed that al-lowing for those terms in my speciﬁcations drove the ﬁnal weights for many women in the sample For the additional ﬁlters that CS impose, see their supplementary material provided in JLE. Even though the two PSID comparison groups are not perfectly representative of women who would have proveneligible for NSW, there is no clear alternative since the PSID data lacks detailed covariate information that would beneeded to impose the full eligibility criteria on the PSID sample. weight = (w/Ghat+(1-w)/(1-Ghat))*(s/Rhat)

The trimming threshold for ps-weighted estimates is kept the same as for computing the doublyweighted estimates since the overlap problem was relatively more severe when using the compositeweights than when using propensity scores only. The graphs below plot the kernel density for theprobabilities

Rhat*Ghat for the treatment group and

Rhat*(1-Ghat) for the control group.The common support problem due to which the samples were appropriately trimmed can be seenin ﬁgure E.1.Additionally, ﬁgures E.2 and E.3 plot the estimated distributions for the propensity score andmissing outcomes probability, where panel (a)-(c) display these for the three treatment and com-parison group combinations. A couple of points emerge from the estimated graphs. For ﬁgureE.2, panel (a), we see that the treatment and control distributions appear very similar, conﬁrmingthe strong role of randomization in producing groups that are balanced in terms of covariates. Forpanel (b), we see that the experimental observations have a relatively high probability of beingtreated whereas the control group have low probabilities. Note, however, that the common supportcondition holds quite strongly for the PSID-1 group. In panel (c), while the estimated distributionfor the treated units still has a higher mean, the PSID-2 comparison group distribution is relativelysimilar than PSID-1 in panel (b). These ﬁndings suggest that nonrandom assignment is predictedwell by the covariates in the propensity score distributions. The same cannot be said for the es-timated missing outcomes probabilities where panel (b) and (c) reveal a strong overlap problem.Moreover, we see that the treated units are less likely to be missing outcomes compared to thecomparison groups. 43igure E.1: Kernel density plots for the composite probability a) Experimental treatment and control groups b) Experimental treatment and PSID-1 groupc) Experimental treatment and PSID-2 group

Notes:

The weights here correspond to the product of the estimated assignment and missing outcomes probabilities.Following CS (2017), I exploit the efﬁciency gain from combining the experimental treatment and control groups forestimating the treatment and missing outcome probability models. For the PSID-1 group, this means using the fullexperimental group to be the treatment group and the PSID-1 as the control group. Similarly, to construct weights forthe PSID-2 group, this means using the full experimental group along with the PSID-2 as the control group. a) Experimental treatment and control groups b) Experimental treatment and PSID-1 groupc) Experimental treatment and PSID-2 group

Notes:

Following CS (2017), I exploit the efﬁciency gains from combining the experimental treatment and controlgroups for estimating the propensity scores. For the PSID-1 group, this means using the full experimental group to bethe treatment group and the PSID-1 as the control group. Similarly, to construct weights for the PSID-2 group, thismeans using the full experimental group along with the PSID-2 as the control group. a) Experimental treatment and control groups b) Experimental treatment and PSID-1 groupc) Experimental treatment and PSID-2 group

Notes:

Following CS (2017), I exploit the efﬁciency gains from combining the experimental treatment and controlgroups for estimating the missing outcome probability. For the PSID-1 group, this means using the full experimentalgroup to be the treatment group and the PSID-1 as the control group. Similarly, to construct weights for the PSID-2group, this means using the full experimental group along with the PSID-2 as the control group. Proofs

Proof of Lemma 1 . Let us ﬁrst consider the argument for θ . By LIE and using the fact that q ( Y, X , θ ) = W · q ( Y (1) , X , θ ) + (1 − W ) · q ( Y (0) , X , θ ) we can write, E (cid:2) ω · q ( Y, X , θ ) (cid:3) = E (cid:34) E (cid:18) Sr ( X , W ) · Wp ( X ) · q (cid:0) Y (1) , X , θ (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) Y (1) , X , W (cid:19)(cid:35) = E (cid:20) Wr ( X , W ) · p ( X ) · q (cid:0) Y (1) , X , θ (cid:1) · P (cid:0) S = 1 | Y (1) , X , W (cid:1)(cid:21) = E (cid:20) Wr ( X , W ) · p ( X ) · q (cid:0) Y (1) , X , θ (cid:1) · P (cid:0) S = 1 | X , W (cid:1)(cid:21) = E (cid:20) Wp ( X ) · q (cid:0) Y (1) , X , θ (cid:1)(cid:21) Using another application of LIE along with unconfoundedness, we obtain E (cid:20) Wp ( X ) · q (cid:0) Y (1) , X , θ (cid:1)(cid:21) = E (cid:2) q ( Y (1) , X , θ ) (cid:3) where the third equality follows from MAR and fourth follows from part ii) of Assumption 3. The proof for θ follows analogously. Proof of Theorem 1 . It has already been established that E (cid:2) ω g · q ( Y, X , θ ) (cid:3) ≡ E (cid:2) ω g · q ( Y ( g ) , X , θ g ) (cid:3) = E (cid:2) q ( Y ( g ) , X , θ g ) (cid:3) for both g = 0 , . By iii) ω g ( γ , δ ) is continuous in γ and δ and is bounded in absolute value byAssumptions 4 and 5. Moreover, ω g ( · , γ , δ ) q ( · , θ ) is continuous with probability one. Then, alongwith v), DCT, and boundedness of ω g ( · , · ) we obtain, sup( θ g , γ , δ ) ∈ (cid:16) Θ g , ˜Γ , ˜∆ (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 ω ig ( γ , δ ) · q ( Y i ( g ) , X i , θ g ) − E (cid:2) ω g ( γ , δ ) · q ( Y ( g ) , X , θ g ) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p → (F.1) by Lemma 2.4 in Newey and McFadden (1994). Then, by triangle inequality, sup θ g ∈ Θ g (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 (cid:98) ω ig · q ( Y i ( g ) , X i , θ g ) − E (cid:2) ω g · q ( Y ( g ) , X , θ g ) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup θ g ∈ Θ g (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N g (cid:88) i =1 (cid:98) ω ig · q ( Y i ( g ) , X i , θ g ) − E (cid:2)(cid:98) ω g · q ( Y ( g ) , X , θ g ) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (F.2) + sup θ g ∈ Θ g (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2)(cid:98) ω g · q ( Y ( g ) , X , θ g ) (cid:3) − E (cid:2) ω g · q ( Y ( g ) , X , θ g ) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (F.3) (A.2) is o p (1) because of (A.1). (A.3) is o p (1) due to ˆ γ p → γ , ˆ δ p → δ and uniform continuity of E (cid:2) ω g · q ( Y ( g ) , X , δ g ) (cid:3) on Θ g × ˜Γ × ˜∆ . Then consistency of ˆ θ g for θ g follows from Theorem 2.1 ˜Γ and ˜∆ are compact neighborhoods around γ and δ .

47f Newey and McFadden (1994).

Proof of Theorem 2 . Explicit dependence on data is suppressed for notational simplicity. Thenexpanding (cid:98) ω ig around ω ig , (cid:98) ω ig ≈ ω ig − (cid:101) ω ig b (cid:48) i ( ˜ δ ) · ( ˆ δ − δ ) − (cid:101) ω ig d (cid:48) i ( ˜ γ ) · ( ˆ γ − γ ) where ˜ δ lies between ˆ δ and δ and ˜ γ lies between ˆ γ and γ . Then, consider N − / N (cid:88) i =1 (cid:98) ω ig · h ig = N − / N (cid:88) i =1 (cid:110) ω ig h ig − (cid:101) ω ig h ig · b (cid:48) i ( ˜ δ ) · ( ˆ δ − δ ) − (cid:101) ω ig h ig · d (cid:48) i ( ˜ γ ) · ( ˆ γ − γ ) (cid:111) = N − / N (cid:88) i =1 ω ig h ig − N − N (cid:88) i =1 (cid:101) ω ig h ig b (cid:48) i ( ˜ δ ) · √ N ( ˆ δ − δ ) − N − N (cid:88) i =1 (cid:101) ω ig h ig d (cid:48) i ( ˜ γ ) · √ N ( ˆ γ − γ ) Now let, ( θ ∗ g , δ ∗ ) = arg sup θ g ∈ Θ g , δ ∈ ∆ (cid:107) h ( θ g ) · b (cid:48) ( δ ) (cid:107) . Then, ( E (cid:2) (cid:107) h ( θ ∗ g ) b (cid:48) ( δ ∗ ) (cid:107) (cid:3) ) ≤ E (cid:2) (cid:107) h ( θ ∗ g ) (cid:107) (cid:3) E (cid:2) (cid:107) b (cid:48) ( δ ∗ ) (cid:107) (cid:3) ≤ E (cid:20) sup θ g ∈ Θ g (cid:107) h ( θ g ) (cid:107) (cid:21) E (cid:20) sup θ g ∈ Θ g (cid:107) b (cid:48) ( δ ) (cid:107) (cid:21) < ∞ (F.4) where ﬁrst inequality holds by cauchy-schwartz, second holds due to the deﬁnition of supremums,and third by conditions iv) and vi). Then, E (cid:20) sup θ g ∈ Θ g , δ ∈ ∆ (cid:107) h ( θ g ) b (cid:48) ( δ ) (cid:107) (cid:21) ≤ (cid:32) E (cid:20) sup θ g ∈ Θ g , δ ∈ ∆ (cid:107) h ( θ g ) b (cid:48) ( δ ) (cid:107) (cid:21)(cid:33) < ∞ where the ﬁrst inequality holds trivially and second inequality holds because of (F.4). An analogousargument may be made for showing E (cid:20) sup θ g ∈ Θ g , γ ∈ Γ (cid:107) h ( θ g ) d (cid:48) ( γ ) (cid:107) (cid:21) < ∞ . Using the fact that ω g ( γ , δ ) is continuous and bounded along with continuity of l ( θ g ) (condition ii)), b ( δ ) , d ( γ ) (condition iii)of theorem 1), we obtain N N (cid:88) i =1 (cid:101) ω ig h ig b (cid:48) i ( ˜ δ ) = E (cid:2) ω ig h ig b (cid:48) i (cid:3) + o p (1)1 N N (cid:88) i =1 (cid:101) ω ig h ig d (cid:48) i ( ˜ γ ) = E (cid:2) ω ig h ig d (cid:48) i (cid:3) + o p (1) (F.5)using Lemma 4.3 in Newey and McFadden (1994) as ˜ γ → p γ and ˜ δ → p δ . Rewriting (7) using48nﬂuence function representations for ˆ γ and ˆ δ along with (F.5) N − / N (cid:88) i =1 (cid:98) ω ig h ig = N − / N (cid:88) i =1 (cid:110) l ig − E (cid:2) l ig b (cid:48) i (cid:3) · E (cid:2) b i b (cid:48) i (cid:3) − b i − E (cid:2) l ig d (cid:48) i (cid:3) · E (cid:2) d i d (cid:48) i (cid:3) − d i (cid:111) + o p (1) ≡ N − / N (cid:88) i =1 u ig + o p (1) d → N ( , Ω g ) (F.6)where u ig ≡ l ig − E (cid:2) l ig b (cid:48) i (cid:3) · E [ b i b (cid:48) i ] − b i − E (cid:2) l ig d (cid:48) i (cid:3) · E [ d i d (cid:48) i ] − d i . Since E ( u ig ) = , Ω g = E (cid:16) l ig l (cid:48) ig (cid:17) − E (cid:0) l ig b (cid:48) i (cid:1) E (cid:0) b i b (cid:48) i (cid:1) − E (cid:16) b i l (cid:48) ig (cid:17) − E (cid:0) l ig d (cid:48) i (cid:1) E (cid:0) d i d (cid:48) i (cid:1) − E (cid:16) d i l (cid:48) ig (cid:17) Next part of the proof uses the theory of empirical processes for obtaining asymptotic normality ofthe doubly weighted estimator. Using the deﬁnition in (11) along with the fact that E [ (cid:98) ω ig h i ( θ g )] p → E [ ω ig h i ( θ g )] (by continuity of ω ( γ , δ ) h ( θ g ) , condition iv) and DCT as ( ˆ γ , ˆ δ ) p → ( γ , δ ) ), rewrite v N ( θ g ) = v ∗ N ( θ g ) + o p (1) (F.7)where v ∗ N ( θ g ) ≡ N (cid:80) Ni =1 (cid:110)(cid:98) ω ig h i ( θ g ) − E (cid:2) ω ig h i ( θ g ) (cid:3)(cid:111) . Let ¯ m N ( θ g ) = 1 N N (cid:88) i =1 (cid:98) ω ig h i ( θ g ) m ∗ N ( θ g ) = E (cid:2) ω ig h i ( θ g ) (cid:3) Then performing element by element mean value expansions of m ∗ N ( ˆ θ g ) around θ g , we obtain = √ N m ∗ N ( θ g ) = √ N m ∗ N ( ˆ θ g ) − ∇ θ g m ∗ N ( ˜ θ g ) (cid:48) · √ N ( ˆ θ g − θ g ) where ˜ θ g lies between ˆ θ g and θ g . Since the population ﬁrst order condition is zero at the truth = ∇ θ g E (cid:104) ω g · q ( Y ( g ) , X , θ g ) (cid:105) = E (cid:104) ω g · h ( Y ( g ) , X , θ g ) (cid:105) ≡ m ∗ N ( θ g ) The second equality follows from dominance condition iv) and application of Lemma 3.6 in Neweyand McFadden (1994). Then, by the continuity of ∇ θ g E (cid:2) ω ig h i ( θ g ) (cid:3) (condition vi)) ∇ θ g m ∗ N ( ˜ θ g ) p → H g By continuous mapping theorem and condition viii), √ N ( ˆ θ g − θ g ) = ( H − g + o p (1)) · √ N m ∗ N ( ˆ θ g ) (F.8)49onsider, −√ N m ∗ N ( ˆ θ g ) = v ∗ N ( ˆ θ g ) − √ N ¯ m N ( ˆ θ g )= v ∗ N ( ˆ θ g ) − v ∗ N ( θ g ) + v ∗ N ( θ g ) − √ N ¯ m N ( ˆ θ g )= v ∗ N ( θ g ) + o p (1) since v ∗ N ( ˆ θ g ) − v ∗ N ( θ g ) = o p (1) by asymptotic equivalence in (F.7) and stochastic equicontinuityby condition ix). Moreover, √ N ¯ m N ( ˆ θ g ) = o p (1) by condition iii). Therefore, v ∗ N ( θ g ) = 1 N N (cid:88) i =1 (cid:98) ω ig h ig d → N ( , Ω g ) by (F.6). Then using (F.8) along with slutsky’s theorem, √ N ( ˆ θ g − θ g ) d → N (cid:16) , H − g Ω g H − g (cid:17) . Proof of Corollary 1 . Consider, Σ g − Ω g = E (cid:0) l ig l (cid:48) ig (cid:1) − { E (cid:0) l ig l (cid:48) ig (cid:1) − E (cid:0) l ig b (cid:48) i (cid:1) E (cid:0) b i b (cid:48) i (cid:1) − E ( b i l (cid:48) ig ) − E (cid:0) l ig d (cid:48) i (cid:1) E (cid:0) d i d (cid:48) i (cid:1) − E ( d i l (cid:48) ig ) } = E (cid:0) l ig b (cid:48) i (cid:1) E (cid:0) b i b (cid:48) i (cid:1) − E ( b i l (cid:48) ig ) + E (cid:0) l ig d (cid:48) i (cid:1) E (cid:0) d i d (cid:48) i (cid:1) − E ( d i l (cid:48) ig ) since each component matrix in the above expression is positive semi-deﬁnite, therefore the sumof the two matrices is also positive semi-deﬁnite. Proof of Theorem 3 . It has already been established that θ g solves E (cid:104) ω ∗ g · q ( Y ( g ) , X , θ g ) (cid:105) The proof of uniform convergence follows similar to the proof of theorem 1 where we replace ω g by ω ∗ g . Then, consistency of ˆ θ g for θ g follows from Theorem 2.1 in Newey and McFadden(1994). Proof of Theorem 4 . The proof follows in the manner of Theorem 2 where we replace ω g by ω ∗ g .Also, Ω g now denotes the variance of the score of the objective function, l ig , without the ﬁrststage adjustment for the estimated weights. This is because, E ( l ig b (cid:48) i ) = E ( l ig d (cid:48) i ) = because theconditional score of l ig , E [ h ( Y ( g ) , X , θ g ) | X ] = due to strong identiﬁcation of θ g . Proof of corollary 2 . This proof follows from the proof of theorem 4, and the asymptotic varianceof the estimator that uses known weights which is

Avar (cid:104) √ N (cid:0) ˜ θ g − θ g (cid:1)(cid:105) = H − g Ω g H − g where Ω g = E (cid:16) l ig l (cid:48) ig (cid:17) . The result follows immediately.50 roof of Corollary 3 (Efﬁciency gain with unweighted estimator under GCIME) . Using two ap-plications of LIE and invoking MAR and unconfoundedness, I can rewrite E (cid:20) S i · W i R ( X i , W i , δ ∗ ) · G ( X i , γ ∗ ) · q ( Y i (1) , X i , θ ) (cid:21) = E (cid:20) r ( X i , R ( X i , , δ ∗ ) · p ( X i ) G ( X i , γ ∗ ) · q ( Y i (1) , X i , θ ) (cid:21) Using another application of LIE, I can rewrite the above as = E (cid:20) r ( X i , R ( X i , , δ ∗ ) · p ( X i ) G ( X i , γ ∗ ) · E (cid:8) q ( Y i (1) , X i , θ ) | X i (cid:9)(cid:21) Then, H = E (cid:20) r ( X i , R ( X i , , δ ∗ ) · p ( X i ) G ( X i , γ ∗ ) · ∇ θ E (cid:8) h ( Y i (1) , X i , θ ) | X i (cid:9)(cid:21) = E (cid:20) r ( X i , R ( X i , , δ ∗ ) · p ( X i ) G ( X i , γ ∗ ) · A ( X i , θ ) (cid:21) Similarly, I use LIE to express Ω as Ω = E (cid:20) r ( X i , R ( X i , , δ ∗ ) · p ( X i ) G ( X i , γ ∗ ) · E (cid:8) h ( Y i (1) , X i , θ ) h ( Y i (1) , X i , θ ) (cid:48) (cid:12)(cid:12) X i (cid:9)(cid:21) = σ · E (cid:20) r ( X i , R ( X i , , δ ∗ ) · p ( X i ) G ( X i , γ ∗ ) · A ( X i , θ ) (cid:21) For the unweighted estimator, the variance simpliﬁes, and this happens precisely due to theGCIME. To see this, consider H u1 . Then using LIE, I can rewrite H u1 = E (cid:104) r ( X i , · p ( X i ) · ∇ θ E (cid:8) h ( Y i (1) , X i , θ ) | X i (cid:9)(cid:105) = E (cid:2) r ( X i , · p ( X i ) · A ( X i , θ ) (cid:3) and similarly we can rewrite Ω u1 using LIE as Ω u1 = E (cid:104) r ( X i , · p ( X i ) · E (cid:8) h ( Y i (1) , X i , θ ) h ( Y i (1) , X i , θ ) (cid:48) | X i (cid:9)(cid:105) = σ · E (cid:2) r ( X i , · p ( X i ) · A ( X i , θ ) (cid:3) Therefore, the asymptotic variance simpliﬁes to simply

Avar (cid:20) √ N (cid:16) ˆ θ u − θ (cid:17)(cid:21) = σ · (cid:16) E (cid:2) r ( X i , · p ( X i ) · A ( X i , θ ) (cid:3)(cid:17) − (cid:34) Avar (cid:26) √ N (cid:16) ˆ θ u − θ (cid:17)(cid:27)(cid:35) − − (cid:34) Avar (cid:26) √ N (cid:16) ˆ θ − θ (cid:17)(cid:27)(cid:35) − = 1 σ · (cid:40) E ( r i · p i · A i ) − E (cid:18) r i · p i R i · G i · A i (cid:19) · E (cid:18) r i · p i R i · G i · A i (cid:19) − · E (cid:18) r i · p i R i · G i · A i (cid:19)(cid:41) Let B i = r / i · p / i · A / i and D i = (cid:16) r / i /R i (cid:17) · (cid:16) p / i /G i (cid:17) · A / i = 1 σ (cid:110) E (cid:0) B (cid:48) i B i (cid:1) − E (cid:0) B (cid:48) i D i (cid:1) · E (cid:0) D (cid:48) i D i (cid:1) − · E (cid:0) D (cid:48) i B i (cid:1)(cid:111) where the quantity inside the brackets is nothing but the variance of the residuals from the popula-tion regression of B i on D i . Hence, the difference is positive semi-deﬁnite. The results for g = 0 can be proven analogously. F.1 Identiﬁcation of ATE using pooled and separate slopes mean functionsunder second half of DR

Pooled slopes . Let us assume that m ( X , θ g ) = h ( X θ + ηW ) is the chosen mean function for E [ Y ( g ) | X ] . Then, in the presence of nonrandom sampling, we have the following ﬁrst orderconditions N (cid:88) i =1 S i · (cid:32) W i ˆ R i · ˆ G i + (1 − W i )ˆ R i · (1 − ˆ G i ) (cid:33) · (cid:104) Y i − h ( X i ˆ θ + ˆ ηW i ) (cid:105) = 0 N (cid:88) i =1 S i · W i ˆ R i · ˆ G i · (cid:104) Y i − h ( X i ˆ θ + ˆ ηW i ) (cid:105) = 0 N (cid:88) i =1 S i · (cid:32) W i ˆ R i · ˆ G i + (1 − W i )ˆ R i · (1 − ˆ G i ) (cid:33) · X (cid:48) i (cid:104) Y i − h ( X i ˆ θ + ˆ ηW i ) (cid:105) = 0 where ˆ R = R ( X , W, ˆ δ ) and ˆ G = G ( X , ˆ γ ) . Ignoring the last set of moment conditions, thepopulation counterpart to the FOCs above are: E (cid:34) S · (cid:18) WR · G + (1 − W ) R · (1 − G ) (cid:19) · (cid:2) Y − h ( X θ ∗ + η ∗ W ) (cid:3)(cid:35) = 0 (F.9) E (cid:20) S · WR · G · (cid:2) Y − h ( X θ ∗ + η ∗ W ) (cid:3)(cid:21) = 0 (F.10)52here θ ∗ and η ∗ are the probability limits of QMLE estimators ˆ θ and ˆ η . Rearranging (F.9) and(F.10) gives us E (cid:34) SR · (cid:18) WG + (1 − W )(1 − G ) (cid:19) · Y (cid:35) = E (cid:34) SR · (cid:18) WG + (1 − W )(1 − G ) (cid:19) · h ( X θ ∗ + η ∗ W ) (cid:35) (F.11) E (cid:20) S · WR · G · Y (cid:21) = E (cid:20) S · WR · G · h ( X θ ∗ + η ∗ W ) (cid:21) (F.12)Now, Y = Y (1) · W + Y (0) · (1 − W ) which implies that we can replace Y in the above twoequations to obtain the LHS of (F.11) equal to E (cid:34) SR · (cid:26) WG · Y (1) + (1 − W )(1 − G ) · Y (0) (cid:27)(cid:35) By using iterated expectations we can rewrite the above equation as E (cid:20) WG · R · E ( S · Y (1) | X , W ) + (1 − W )(1 − G ) · R · E ( S · Y (0) | X , W ) (cid:21) Due to MAR, we can split the conditional expectation into parts. E (cid:20) WG · R · E ( S | X , W ) · E ( Y (1) | X , W ) + (1 − W )(1 − G ) · R · E ( S | X , W ) · E ( Y (0) | X , W ) (cid:21) Note that, W · E ( S | X , W ) = W · R . similarly, (1 − W ) · E ( S | X , W ) = (1 − W ) · R and due tounconfoundedness we have, E (cid:2) Y (1) | X , W (cid:3) = E (cid:2) Y (1) | X (cid:3) and E (cid:2) Y (0) | X , W (cid:3) = E (cid:2) Y (0) | X (cid:3) .Therefore, we can simplify the above expression into E (cid:20) W · RG · R · E ( Y (1) | X ) + (1 − W ) · R (1 − G ) · R · E ( Y (0) | X ) (cid:21) Another application of iterated expectation gives us E (cid:20) E ( Y (1) | X ) G · E [ W | X ] + E ( Y (0) | X )(1 − G ) · E [(1 − W ) | X ] (cid:21) = E (cid:2) E ( Y (1) | X ) + E ( Y (0) | X ) (cid:3) = E [ Y (1)] + E [ Y (0)] E (cid:34) h ( X θ ∗ + η ∗ W ) · (cid:26) W G · R · E ( S | X , W ) + (1 − W )(1 − G ) · R · E ( S | X , W ) (cid:27)(cid:35) = E (cid:34) h ( X θ ∗ + η ∗ W ) · (cid:26) WG + (1 − W )(1 − G ) (cid:27)(cid:35) = E (cid:20) h ( X θ ∗ + η ∗ W ) · WG (cid:21) + E (cid:20) h ( X θ ∗ + η ∗ W ) · (1 − W )(1 − G ) (cid:21) Therefore, combining the LHS and RHS give the result E [ Y (1)] + E [ Y (0)] = E (cid:20) h ( X θ ∗ + η ∗ W ) · WG (cid:21) + E (cid:20) h ( X θ ∗ + η ∗ W ) · (1 − W )(1 − G ) (cid:21) (F.13)Now, consider the LHS of F.12. E (cid:20) S · WR · G · Y (cid:21) = E (cid:20) S · WR · G · Y (1) (cid:21) = E [ Y (1)] (by LIE)Similarly using LIE, the RHS of F.12 can be re-written as E (cid:20) S · WR · G · h ( X θ ∗ + η ∗ W ) (cid:21) = E (cid:20) h ( X θ ∗ + η ∗ W ) · WG · R · E ( S | X , W ) (cid:21) = E (cid:20) h ( X θ ∗ + η ∗ W ) · WG (cid:21) Therefore combining the LHS and RHS give us E [ Y (1)] = E (cid:20) h ( X θ ∗ + η ∗ W ) · WG (cid:21) (F.14)Then using F.14 along with F.13 implies that E [ Y (0)] = E (cid:20) h ( X θ ∗ + η ∗ W ) · (1 − W )(1 − G ) (cid:21) (F.15)Consider E (cid:2) h ( X θ ∗ + η ∗ W ) · W | X (cid:3) = E (cid:2) h ( X θ ∗ + η ∗ ) (cid:3) · P ( W = 1 | X ) E (cid:20) h ( X θ ∗ + η ∗ W ) · WG (cid:21) = E (cid:2) h ( X θ ∗ + η ∗ ) (cid:3) . Similarly, we can also show that E (cid:20) h ( X θ ∗ + η ∗ W ) · (1 − W )(1 − G ) (cid:21) = E (cid:2) h ( X θ ∗ ) (cid:3) Hence, the pooled regression adjustment estimator can be written as ∆ Pate = E (cid:2) h ( X θ ∗ + η ∗ ) (cid:3) − E (cid:2) h ( X θ ∗ ) (cid:3) so a consistent estimator of the QMLE pooled regression adjustment estimator can be obtained byreplacing the population expectation by the sample average in the above expression and weightingby the appropriate probabilities to recover the balance of the random sample which gives us ˆ∆ Pate = 1 N N (cid:88) i =1 h ( X i ˆ θ + ˆ η ) − N N (cid:88) i =1 h ( X i ˆ θ ) Separate slopes . Let us assume that m ( X , θ g ) = h ( X θ g ) is the chosen mean function for E (cid:2) Y ( g ) | X (cid:3) .Then the population FOCs are E (cid:20) S · WR · G · (cid:2) Y − h ( X θ ∗ ) (cid:3)(cid:21) = 0 (F.16) E (cid:20) S · (1 − W ) R · (1 − G ) · (cid:2) Y − h ( X θ ∗ ) (cid:3)(cid:21) = 0 (F.17)where θ ∗ g are the probability limits of QMLE estimators ˆ θ g . Rearranging F.16 and F.17 just like inthe pooled case gives us the following equalities. E (cid:20) S · WR · G · Y (cid:21) = E (cid:20) S · WR · G · h ( X θ ∗ ) (cid:21) E (cid:20) S · (1 − W ) R · (1 − G ) · Y (cid:21) = E (cid:20) S · (1 − W ) R · (1 − G ) · h ( X θ ∗ ) (cid:21) Proceeding with the above two equations in the same way as in the pooled case gives us the results E [ Y (1)] = E (cid:2) h ( X θ ∗ ) (cid:3) E [ Y (0)] = E (cid:2) h ( X θ ∗ ) (cid:3) Therefore, ∆ Fate = E (cid:2) h ( X θ ∗ ) (cid:3) − E (cid:2) h ( X θ ∗ ) (cid:3) and a consistent estimator of the QMLE separateregression adjustment estimator can be obtained as ˆ∆ Fate = 1 N N (cid:88) i =1 h ( X i ˆ θ ) − N N (cid:88) i =1 h ( X i ˆ θ ) Supplementary Tables

Table G.1: Proportion of missing earnings in the experimental sample

Earnings in 1979 Treated Control Total

Missing 196 210 406Observed 600 585 1185

Total

796 795 1591Table G.2: Proportion of missing data in the PSID samples

Earnings in 1979 PSID-1 PSID-2

Missing 81 22Observed 648 182

Total

729 20456able G.3: Unweighted and weighted earnings comparisons and estimated training effects using NSW and PSID comparison groups

Comparison group Pre-training estimatesUnadjusted Adjusted

Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted

NSW -18 -9 1 -22 -10 -1N=1,185 (123.45) (51.07) (48.76) (124.70) (51.34) (48.97)

PSID-1 -2,534 -222 -255 -2,804 -199 -222N=1,016 (283.95) (213.57) (205.59) (281.49) (212.55) (205.45)

PSID-2 -2,080 -1,371 -1,357 -2,181 -1,505 -1,467N=720 (411.23) (331.41) (317.41) (427.24) (359.98) (342.16)

Bias using NSW controlPSID-1 -2,517 289 236 -2,760 334 287N=1,001 (279.38) (256.93) (247.18) (283.09) (257.50) (248.20)

PSID-2 -2,063 -1,249 -1,255 -2,144 -1,306 -1,297N=705 (416.53) (323.36) (310.59) (435.74) (354.12) (337.68)

Adjusted covariates

Pre-training earnings (1975) (cid:88) (cid:88) (cid:88)

Age (cid:88) (cid:88) (cid:88)

Age2 (cid:88) (cid:88) (cid:88)

Education (cid:88) (cid:88) (cid:88)

High school droput (cid:88) (cid:88) (cid:88)

Black (cid:88) (cid:88) (cid:88)

Hispanic (cid:88) (cid:88) (cid:88)

Marital status (cid:88) (cid:88) (cid:88)

Number of Children (1975)

Notes:

This table reports unadjusted and adjusted pre-training earnings differences where the ﬁrst row reports the experimental estimates whichcombines the NSW treatment and control groups. The second and third row reports non-experimental earnings estimates computed from usingthe PSID-1 and PSID-2 comparison groups respectively. The second panel of the table reports bias estimates computed from combining the NSWcontrol and PSID-1 and PSID-2 comparison groups respectively. Both the pre-training estimates and the bias estimates should be compared tozero. Bootstrapped standard errors are given in parentheses and have been constructed from using 10,000 replications. All values are in 1982dollars. The samples used for estimating the training and bias estimates using PSID-1 and PSID-2 comparison groups have been trimmed toensure common support in the distribution of weights for the NSW-treatment and comparison groups. For more detail, see appendix E. able G.4: Unconditional quantile treatment effect (UQTE) using PSID-1 comparison groupQuantile Experimental Unweighted PS-weighted D-weighted0.1 0 0 0 0(0) (0) (0) (0)0.2 0 0 0 0(0) (0) (0) (0)0.3 0 0 0 0(0) (12.91) (0) (0)0.4 0 -1124.61 0 0(11.17) (552.97) (207.14) (174.89)0.5 993.52 -2227.26 2076.58 1847.04(695.93) (983.43) (851.09) (829.42)0.6 2004.40 -860.55 3602.76 3535.85(1112.82) (964.97) (1299.08) (1284.64)0.7 2129.93 428.01 3415.47 3340.84(716.04) (728.22) (988.24) (992.95)0.8 1753.27 -190.60 2019.44 2019.44(372.37) (519.63) (984.59) (999.47)0.9 1134.21 -1563.27 -385.45 -385.45(449.86) (952.85) (1059.43) (1056.09) Notes:

This table reports unweighted, ps-weighted and d-weighted UQTE estimatesfor three different comparison groups, namely, NSW control, PSID-1 and PSID-2. Theestimates are reported at every 10th quantile of the 1979 earnings distribution. The ex-perimental and PSID-1 estimates have been constructed using N=1,185 and N=1,016observations respectively. Bootstrapped standard errors are given in parentheses andhave been constructed using 1,000 replications. All values are in 1982 dollars. Thesamples used for constructing these estimates have been trimmed to ensure commonsupport across the treatment and comparison groups.

Notes:

This table reports unweighted, ps-weighted and d-weighted UQTE estimatesfor three different comparison groups, namely, NSW control, PSID-1 and PSID-2. Theestimates are reported at every 10th quantile of the 1979 earnings distribution. Theexperimental and PSID-2 estimates have been computed using N=1,185 and N=720observations respectively. Bootstrapped standard errors are given in parentheses andhave been constructed using 1,000 replications. All values are in 1982 dollars. Thesamples used for constructing these estimates have been trimmed to ensure commonsupport across the treatment and comparison groups. Supplementary Figures

Figure H.1: Estimated CQTE with true CQTE as a function of X for N=5,000Case 3: Correct CQF, misspeciﬁed weights b) τ = 0 . c) τ = 0 . Notes:

38 = 779 and average controlsample is N = 5 , × (1 − . × .

38 = 1 , . X forN=5,000 Case 1: Misspeciﬁed CQF, correct weights b) τ = 0 . c) τ = 0 . Notes:

This ﬁgure plots the bias in the unweighted, ps-weighted, and d-weighted LP of the true CQTE relative to thetrue population LP of CQTE. The average treated sample is N = 5 , × . × .

38 = 779 and average controlsample is N = 5 , × (1 − . × .

This ﬁgure plots the bias in the unweighted, ps-weighted, and d-weighted LP of the true CQTE relative to thetrue population LP of CQTE. The average treated sample is N = 5 , × . × .

38 = 779 and average controlsample is N = 5 , × (1 − . × .

38 = 779 and average control sample is N = 5 , × (1 − . × .

38 = 1 , . The unweighted estimatordoes not weight the observed data. The ps-weighted estimator weights to correct only for nonrandom assignment andthe d-weighted estimator weights by both the treatment and missing outcomes propensity score models to deal withnonrandom assignment and missing outcome problems.. The unweighted estimatordoes not weight the observed data. The ps-weighted estimator weights to correct only for nonrandom assignment andthe d-weighted estimator weights by both the treatment and missing outcomes propensity score models to deal withnonrandom assignment and missing outcome problems.

38 = 779 and average control sample is N = 5 , × (1 − . × .