[PDF] Double-Robust Identification for Causal Panel Data Models

Abstract

We study identification and estimation of causal effects in settings with panel data. Traditionally researchers follow model-based identification strategies relying on assumptions governing the relation between the potential outcomes and the unobserved confounders. We focus on a novel, complementary, approach to identification where assumptions are made about the relation between the treatment assignment and the unobserved confounders. We introduce different sets of assumptions that follow the two paths to identification, and develop a double robust approach. We propose estimation methods that build on these identification strategies.

Full PDF

DDouble-Robust Identiﬁcation for Causal Panel

Data Models ∗ Dmitry Arkhangelsky † Guido W. Imbens ‡ September 2019

Abstract

We study identiﬁcation and estimation of causal eﬀects of a binary treatment in settingswith panel data. We highlight that there are two paths to identiﬁcation in the presenceof unobserved confounders. First, the conventional path based on making assumptions onthe relation between the potential outcomes and the unobserved confounders. Second, adesign-based path where assumptions are made about the relation between the treatmentassignment and the confounders. We introduce diﬀerent sets of assumptions that followthe two paths, and develop double robust approaches to identiﬁcation where we exploitboth approaches, similar in spirit to the double robust approaches to estimation in theprogram evaluation literature.

Keywords : ﬁxed eﬀects, cross-section data, clustering, causal eﬀects, treatment eﬀects, uncon-foundedness. ∗ This paper beneﬁted greatly from our discussions with Manuel Arellano and David Hirshberg. † Assistant Professor, CEMFI, darkhangel@cemﬁ.es. ‡ Professor of Economics, Graduate School of Business and Department of Economics, StanfordUniversity, SIEPR, and NBER, [email protected]. a r X i v : . [ ec on . E M ] S e p Introduction

Panel data are widely used to assess causal eﬀects of policy interventions on economic out-comes. These data are particularly useful in settings where there is substantial heterogeneityboth between units at the same point in time, as well as heterogeneity over time within units.Fundamentally the presence of panel data allows for two conceptually diﬀerent comparisons toestimate causal eﬀects. First, we can compare treated and control outcomes for the same unitat diﬀerent points in time, that is, make across-time within-unit comparisons. Such compar-isons are not possible in cross-section settings. Second, following approaches in cross-sectionalsettings, we can compare treated and control outcomes at the same point in time for diﬀerentunits, i.e., within-period across-unit comparisons. In that case, we use the panel data simplyto allow for a richer set of controls than we would use in a cross-section setting. Diﬀerent setsof assumptions justify the two approaches. In practice, researchers often make assumptionsthat simultaneously justify both types of comparisons. For example, many empirical papersuse a linear two-way ﬁxed eﬀect speciﬁcation that implicitly justiﬁes both the within-unit andwithin-period comparisons: Y it = α i + λ t + τ W it + β (cid:62) X it + ε it . (1.1)Here W it is an indicator for the treatment, with τ the causal eﬀect of interest, and X it are thetime-unit speciﬁc control variables. In this speciﬁcation, the α i capture the permanent unit-speciﬁc eﬀects, and the λ t capture the common time eﬀects. After removing the unit and timeﬁxed eﬀects, we can compare outcomes for treated units both to outcomes for the same unit intime periods where the unit was not treated, or to control units in the same time period.In this paper, we take a diﬀerent perspective, building on the program evaluation or causalinference literature. We start with the assumption that conditional on an unobserved unit-speciﬁc variable U i (possibly vector-valued), the T -component vector of treatment assignmentsover time for unit i , W i , with t -th element equal to W it , is independent of the vector of potentialoutcomes Y i ( w ): W i ⊥⊥ (cid:110) Y i ( w ) (cid:111) w (cid:12)(cid:12)(cid:12) U i . (1.2)1his assumption has no immediate content because we can make it hold by construction bysetting U i equal to the vector of assignments W i . Nevertheless, it clariﬁes what the issue is andwhy cross-section data alone are not suﬃcient: there is an unobserved variable U i that invali-dates comparisons of observed outcomes by treatment status because this unobserved variableis correlated both with the potential outcomes and with the treatment assignment. Althoughit is not always articulated in this form implicitly this conditional independence assumption ismade in many of the approaches to identiﬁcation in panel data settings used in the empiricalliterature.For the case where (in contrast to the case we consider in the current paper) (1.2) holds with U i observed, the program evaluation literature has developed a number of eﬀective methodsfor estimating the average causal eﬀect of W it on Y it (see Imbens [2004], Abadie and Cattaneo[2018] for reviews). One approach is to remove the association between U i and the treatment W it by using the propensity score either through weighting or through conditioning. Second,one can transform the outcome by removing the association between the outcome and U i . Thisis typically done by subtracting from the outcome the conditional mean of the outcome Y it given U i . Third, and most eﬀectively, one can use double robust methods and combine thepropensity score adjustment and the outcome modeling/transformation. These methods inspirethe proposals developed in the current paper for the case where U i is not observed.In the case where U i is not observed one has to make additional assumptions to ensurepoint-identiﬁcation. For the most part, applied researchers have been focusing on making as-sumptions regarding the relationship between the outcome and the unobserved characteristic.This approach is natural, often follows directly from an economic model, and is supported bythe econometric theory (see, e.g., the surveys: Chamberlain [1984], Arellano and Honor´e [2001],Arellano [2003], Arellano and Bonhomme [2011]). At the same time, such restrictions are verydiﬀerent from (1.2) because they are not motivated by a model of W i (model of assignment).The point that we are making in this paper is that a model for W i provides an alternative path toidentiﬁcation argument, and, moreover, it can be considered separately from the model for theoutcome. We show that with panel data, one can base the identiﬁcation argument on eitherthe outcome model or the assignment model being correct. This is where our approach diﬀersconceptually from the double robust estimation literature. Here both the design assumptionsand the outcome modeling approaches are used in the identiﬁcation stage.First, analogous to the outcome modeling, we can use models and assumptions to motivate a2ransformation of the potential outcomes such that the unobserved component is independent ofthe transformed potential outcomes, and the transformed outcomes themselves are informativeabout the causal eﬀect of interest. Formally, U i ⊥⊥ g (cid:18)(cid:110) Y i ( w ) (cid:111) w (cid:19) , (1.3)for some function of the potential outcomes g ( · ), possibly after some conditioning. Many meth-ods used in the empirical literature, including the two-way ﬁxed eﬀect estimator, can be thoughtof as ﬁtting in this approach. For example, consider a two-period setting. The two-way ﬁxedeﬀect estimator transforms the outcomes by taking diﬀerences, e.g. , in the two period case∆ i = g ( Y i ( w )) = Y i ( w ) − Y i ( w ), so that ∆ i is free of dependence on the unobserved compo-nent U i .The second approach is design-based, where the goal is to ﬁnd a set of conditioning vari-ables S i that removes the association between the treatment assignment and the unobservedcomponent analogous to the propensity score approach. U i ⊥⊥ W i (cid:12)(cid:12)(cid:12) S i . (1.4)A version of this assumption has been used in the panel literature before (e.g., the exchangeabil-ity assumption in Altonji and Matzkin [2005] or the exponential family assumption in Arkhangel-sky and Imbens [2018]). In this paper, we argue that it holds for a variety of models that havebeen commonly used for binary data (e.g., Honor´e and Kyriazidou [2000], Chamberlain [2010],Aguirregabiria et al. [2018]). In principle, the two-way ﬁxed eﬀect estimator can also be thoughtof as following this approach by comparing treated and control units at the same time within theset of units with the same fraction of treated periods, that is, conditioning on S i = (cid:80) Tt =1 W it .However, as a general approach to identifying treatment eﬀects in a panel data setting, thisdesign-based approach that is common in the treatment eﬀect literature has not been explored,and we do so in the current paper.Third, we explore robust versions where we combine outcome modeling and assumptionson the assignment mechanism. Essentially there we develop models that justify (1.3) for sometransformation, and models that justify (1.4) for some conditioning variables S i , and then con-sider strategies that only require that the independence in (1.3) holds within subpopulations3eﬁned by S i : U i ⊥⊥ g (cid:18)(cid:110) Y i ( w ) (cid:111) w (cid:19) (cid:12)(cid:12)(cid:12) S i . (1.5)The paper ﬁts in with the recent literature on causal inference in panel data settings,including the closely related synthetic control literature (Abadie et al. [2010], Arkhangelskyet al. [2019], Xu [2017], Ben-Michael et al. [2018]) diﬀerence in diﬀerences methods (de Chaise-martin and D’Haultfœuille [2018], Goodman-Bacon [2017], Athey and Imbens [2018], Atheyet al. [2017]), and ﬁxed eﬀect methods (Imai and Kim [2019], Arkhangelsky and Imbens [2018]). For p ∈ [1 , ∞ ] we use L p ( P ) to denote the space of all random variables X that satisfy E [ (cid:107) X (cid:107) p ] p < ∞ . For any two random variables X , X ∈ L p ( P ) we use (cid:107) X − X (cid:107) p to denotethe L p ( P ) distance. For a random sample { X i } ni =1 and any real-valued functions f , f : X → R we deﬁne: P n f ( X i ) := 1 N N (cid:88) i =1 f ( X i ) (cid:107) f − f (cid:107) n,p = ( P n ( f ( X i ) − f ( X i )) p ) p (1.6)For a matrix A we use σ min ( A ) to denote its smallest singular value. We observe N units over T periods ( i and t being a generic unit and period, respectively).We focus on settings with large N and ﬁxed T . We are interested in the eﬀect of a binarypolicy variable w on some economic outcome Y it . To formalize this we consider a potentialoutcome framework (Imbens and Rubin [2015]). The policy can change over time, and so isindexed by unit i and time t , W it ∈ { , } . Let w t ≡ ( w , w , . . . , w t ) denote the sequence oftreatment exposures up to time t , with w as shorthand for the full vector of exposures w T .Deﬁne W i ≡ ( W i , . . . , W iT ) to be the full assignment vector for unit i . For the ﬁrst part of thepaper we assume that researchers do not observe additional unit-level covariates and explicitly4ntroduce them in Section 4. In general, one can view all our identiﬁcation results as conditionalon covariates.Let Y it ( w t ) denote the potential outcome for unit i at time t , given treatment history up totime t w t : Y it ( w t ) ≡ Y it ( w , w , . . . , w t ) . (2.1)In this paper we consider a static version of this general model. Assumption 2.1. (No Dynamics)

For arbitrary w t (1) and w t (2) such that w t = w t we havethe following: Y it ( w t (1) ) = Y it ( w t (2) ) (2.2)This restriction implies that past treatment exposures do not aﬀect contemporaneous out-comes. This assumption does not restrict time-series correlation in the realized outcomes andso on its own does not have any testable implications. However, given a particular assignmentprocess, Assumption 2.1 can be tested. Since a substantial part of the empirical literature fo-cuses on contemporaneous eﬀects and assumes away dynamic eﬀects, we view this as a naturalstarting point. The issues we raise are relevant for the dynamic treatment eﬀect case as wellbut are discussed most easily in the static case.Given the no-dynamics assumption we can index the potential outcomes by a single binaryargument w , so we write Y it ( w ), for w ∈ { , } . In this setup we can be interested in varioustreatment eﬀects. Deﬁne individual and time-speciﬁc treatment eﬀects: τ it ≡ Y it (1) − Y it (0) (2.3)We focus primarily on average treatment eﬀects, typically a convex combination of individualeﬀects τ it . Deﬁne also Y i ( w ) ≡ ( Y i ( w ) , . . . , Y it ( w T )) to be the vector of potential outcomes.We make two additional assumptions. First, we restrict out attention to settings with strictlyexogenous covariates (e.g., Arellano [2003]) and make the following assumption: Assumption 2.2. (Latent Unconfoundedness)

There exist a random element U i ∈ U such hat the following conditional independence holds: W i ⊥⊥ (cid:110) Y i ( w ) (cid:111) w (cid:12)(cid:12)(cid:12) U i (2.4)This assumption eﬀectively says that once we control for U i , then all the diﬀerences inthe treatment paths W i across units are unrelated to the potential outcomes. This type ofassignment should be contrasted with the sequential assignment where W it can depend on pastoutcomes and latent characteristics. See Arellano [2003] for a discussion in the linear case.On its own Assumption 2.2 is not restrictive because we allow U i to be unobserved: we canmechanically choose U i = W i so that this assumption is satisﬁed by construction. There aremultiple papers that essentially follow this road, going back at least to Chamberlain [1992] (alsosee Chernozhukov et al. [2013] for a very general version of this approach).We view U i as a unit characteristic that we need to control for if we wish to compareoutcomes across units. We formalize this by making the following assumption on the (infeasible)generalized propensity score (Imbens [2000]) that ensures that in principle such comparisons arepossible. Assumption 2.3. (Latent Overlap)

Deﬁne the infeasible generalized propensity score: r inf ( w, u ) ≡ pr( W i = w | U i = u ) . (2.5) For any u ∈ U : max w { r inf ( w, u ) } < U i but diﬀerent values of W i . This type of assumption is common in the (cross-section) programevaluation literature: without such an overlap assumption even if we observed U i we would notbe able to identify the average causal eﬀect of the treatment without functional form restrictions.However, this latent overlap assumption is not always maintained in the panel literature. Forexample, if only time-series variation is used to make causal statements, then one does notneed to make Assumption 2.3. Of course, this comes at a cost – one has to restrict the waypotential outcomes can change over time. At the same time, if one also wants to exploit the6ross-sectional variation, then some version of Assumption 2.3 appears to be unavoidable, butthe outcome model can be more ﬂexible compared to the approaches that rely on over-timecomparisons. Before we consider identiﬁcation in various models we need to deﬁne additional objects. Let W be the support of the vector of assignments W i ; we can think of W as a matrix with at most 2 T rows and T columns, where each row is an element of the support of W i . Let W k be a k row of thematrix W – a T -dimensional vector of zeros and ones. Let π k ≡ pr( W i = W k ) = E (cid:2) W i = W k (cid:3) .All π k are positive, otherwise the corresponding row of W can be dropped. Let K be the numberof rows in W .For example, if T = 3 then W can have the following form: W =   (3.1)Each row of this matrix represents a possible assignment, and in this particular case only 4 outof the 2 = 8 possible combinations have positive probability. For a particular unit i , let k ( i )be the row W k of W such that W k = W i . For the identiﬁcation argument we assume we know W and the probabilities π k and consider estimation in Section 4.We are interested in estimating weighted averages of the treatment eﬀects τ it . Our estimatorswill be linear in Y , with weights that depend on W i :ˆ τ = 1 N T N (cid:88) i =1 T (cid:88) t =1 ω it Y it . Choosing an estimator therefore corresponds to choosing a set of weights ω it . We maintainthroughout this section the no-dynamics assumption (Assumption 2.1), latent unconfoundedness7ssumption (Assumption 2.2), and latent overlap (Assumption 2.3). As discussed brieﬂy in the introduction, the latent unconfoundedness assumptiwon can be ex-ploited in two directions. To build intuition, it is useful to brieﬂy make an analogy to theconventional unconfoundedness case with observed confounders, in a cross-section setting.Suppose we have unconfoundedness (Rosenbaum and Rubin [1983]) with an observed con-founder X i . Here we use its weak form (Imbens [2000]): W i = w ⊥⊥ Y i ( w ) (cid:12)(cid:12)(cid:12) X i , ∀ w. (3.2)In that case researchers have followed two approaches. One is to exploit the propensity scoreresult that (irrespective of whether (3.2) holds), W i = w ⊥⊥ X i (cid:12)(cid:12)(cid:12) pr( W i = w | X i ) , (3.3)where pr( W i = w | X i ) is the generalized propensity score. (3.2) and (3.3) combined imply thatconditional on the generalized propensity score we have W i = w ⊥⊥ Y i ( w ) (cid:12)(cid:12)(cid:12) pr( W i = w | X i ) . (3.4)Thus, we can conditioning on a variable, here pr( W i = w | X i ) such that the association of thetreatment indicator, here W i = w and the variable we originally need to condition on, here X i ,vanishes.A second approach is to transform the potential outcomes. Deﬁne the conditional ex-pectations µ ( w, x ) ≡ E [ Y i ( w ) | X i = x ) and e ( X i ) ≡ pr( W i = w | X i ). We do not actuallyneed the full independence assumption in (3.2), only the mean-independence since it implies, E [ Y i ( w ) | W i = w , X i ] = E [ Y i ( w ) | X i ] . Now deﬁne˜ Y i ( w ) ≡ g ( Y i ( w )) ≡ Y i ( w ) − µ ( w, x ) − E [ e ( X i )] W i (1 − E [ e ( X i )]) − W i e ( X i ) W i (1 − e ( X i )) − W i (cid:16) µ (1 , X i ) − µ (0 , X i ) (cid:17) . This transformation of the potential outcomes does not change mean-indepence of ˜ Y i ( w ) and8 W i = w conditional on X i , and we have E [ ˜ Y i ( w )) | W i , X i ] = E [ ˜ Y i ( w ) | X i ] . Hwever, for this transformed outcome we have something much stronger. Here we do not needthe conditioning on X i to have the result that the expected value is free of dependence on W i ,and mean-independence holds without conditioning on X i : E [ ˜ Y i ( w )) | W i = 1] = E [ ˜ Y i ( w ) | W i = 0] = E [ ˜ Y i ( w )] = E [ Y i (1) − Y i (0)] . We can combine these two approaches and estimate E [ ˜ Y i (1) | W i = 1 , e ( X i )] , and E [ ˜ Y i (0) | W i = 0 , e ( X i )] , and average the diﬀerence over the marginal distribution of e ( X i ). This will have double ro-bustness properties.The ﬁrst insight that we take to the panel data case is that we can either use the conditionaldistribution of the assignment given the confounder to remove biases associated with a directcomparison of treated and control units, or we can remove the dependence of the outcomes onthe confounder. This general strategy works whether the confounder is observed or not, butimplementing the two approaches is a bigger challenge if the confounder is not observed, and weneed to make additional assumptions in order to do so. The second insight is that combiningthese two approaches may lead to more robust estimates of the treatment eﬀects. In this section we consider a simple example that illustrates the main message of the paper. Forsimplicity we start assuming that τ it = τ – constant treatment eﬀects – and no covariates X i .At the end of the section we discuss heterogenity in treatment eﬀects. We introduce covariatesin Section 4.Consider the case with three periods and suppose that the distribution of W i is given byTable 1. A researher wants to use a standard ﬁxed eﬀects model and runs the following regression9 able 1: Assignment process and weights P ( W i ) W W W ω ( fe )1 ( W i ) ω ( fe )2 ( W i ) ω ( fe )3 ( W i )0.09 0 0 0 0.46 -0.64 0.180.04 1 0 0 5.70 -3.26 -2.440.11 0 1 0 -2.16 4.60 -2.440.14 1 1 0 3.08 1.98 -5.070.07 0 0 1 -2.16 -3.26 5.420.08 1 0 1 3.08 -5.88 2.800.15 0 1 1 -4.78 1.98 2.800.32 1 1 1 0.46 -0.64 0.18(in population): Y it = α i + λ t + τ fe W it + ε it E [ ε it | W i , α i ] = 0 (3.5)Usual OLS logic implies that τ fe has the following representation: τ fe = E [ Y it ω ( fe ) t ( W i )] (3.6)where ω ( fe ) t ( W i ) are ﬁxed eﬀects weights that depend only on the distribution of W i . For thedistribution given above the weights are presented in Table 1. By construction these weights sumup to 0 for every row and every column (once reweighted by the probabilities). If the two-waymodel is correctly speciﬁed than the estimator based on a sample analog of these weights hasexcellent statistical properties (see e.g., Donoho et al. [1994], Armstrong and Koles´ar [2018b], andreferences therein). At the same time, such estimator is not entirely satisfactory. In particular,assume that the assignment is random conditional on W i ≡ T (cid:80) Tt =1 W it : W i ⊥⊥ (cid:110) Y i ( w ) (cid:111) w (cid:12)(cid:12)(cid:12) W i (3.7)In this case, the relevant outcome model has the following structure: Y it = h t ( W i ) + τ W it + ξ it E [ ξ it | W i ] = 0 (3.8)10he estimator based on the ﬁxed eﬀect weights is consistent if the following condition is satisﬁedfor every t and W i : E [ ω ( fe ) t ( W i ) | W i ] = 0 (3.9)Table 2 shows that this is not true for the given distribution of W i . As a result, if the outcome Table 2:

Aggregated weights W i E [ ω ( fe )1 ( W i ) | W i ] E [ ω ( fe )2 ( W i ) | W i ] E [ ω ( fe )3 ( W i ) | W i ]0 0.46 -0.64 0.181 -0.73 0.60 0.132 -0.08 0.36 -0.283 0.46 -0.64 0.18model is given by (3.8) then the ﬁxed eﬀect weights will give us an inconsistent estimator. Thisis not surprising because ω ( fe ) t ( W i ) are not constructed to deal with such outcome models.At this point, it is natural to ask whether we can achieve both goals simultaneously, i.e.,can we ﬁnd the weights that “work” if either the ﬁxed eﬀect model (3.5) or the design process(3.7) is correctly speciﬁed? The answer is positive and the weights that satisfy this restrictionare given in Table 3. It is evident that the weights some up to zero for each row and simple Table 3:

Doubly robust weights ω ( dr )1 ( W i ) ω ( dr )2 ( W i ) ω ( dr )3 ( W i )0.00 0.00 0.006.59 -3.95 -2.64-1.46 4.10 -2.643.24 1.66 -4.90-1.46 -3.95 5.423.24 -6.39 3.15-4.81 1.66 3.150.00 0.00 0.00calculation shows that E [ ω ( dr ) t ( W i ) | W i ] = 0 for every t and W i . As a result, there is no trade-oﬀin terms of identiﬁcation and we can construct the estimator that works for both models.So far we have assumed that the treatment eﬀects are constant. This assumption is verystrong and it is well documented that two-way estimators have problems in case with heteroge-nous treatment eﬀects (e.g., see de Chaisemartin and D’Haultfœuille [2018]). This is evident11fter looking at Table 1: in the last row we assign negative weight to treated units in the secondperiod. In contrast to this, all treated units receive non-negative weight when we use doublyrobust weights from Table 3. This is not a coincidence and below we discuss a procedure thatguarantees that this property is satisﬁed. First we consider outcome models. Recall that by the no-dynamics assumption the potentialoutcomes Y it ( w ) are indexed by a binary treatment w . A common outcome model that goesback at least to Chamberlain [1992] is the following one: Assumption 3.1.

The potential outcomes satisfy: E [ Y it ( w ) | U i ] = α ( U i ) + λ t + τ ( U i ) w. (3.10)Given Assumption 2.2 the content of this model is that it restricts the time-dependency ofthe conditional mean of the control outcome and the treatment eﬀect. Rewriting the model wecan see that more directly. The conditional control mean is E [ Y it (0) | U i ] = α ( U i ) + λ t , which is restricted to be additively separable in time, and the conditional treatment eﬀect is E [ τ it | U i ] = τ ( U i ) , which is restricted to be time-invariant.We are interested in identifying a convex combination of the heterogenous treatment eﬀects τ ( U i ) (which itself is a convex combination of τ it ) in this model. We do this by using the weights ω kt that satisfy the following restrictions:1 T K (cid:88) k =1 T (cid:88) t =1 π k ω kt W kt = 1 , ∀ k, T (cid:88) t ω kt W kt ≥ ∀ k, T T (cid:88) t =1 ω kt = 0 , ∀ t, K (cid:88) k =1 π k ω kt = 0 (3.11)12et W outc be the set of weights { ω tk } t,k that satisfy these restrictions. We can evaluate theserestrictions and thus we can construct this set. For any generic element ω ∈ W outc deﬁne therandom variables ω k ( i ) t : ω k ( i ) t ≡ K (cid:88) k =1 ω kt { W i = W k } (3.12)Using these stochastic weights we can compute the following expectation: τ ( ω ) = E (cid:34) T T (cid:88) t =1 Y it ω k ( i ) t (cid:35) (3.13) Proposition 1.

Suppose Assumptions 2.1, 2.2, and 3.1 hold, and that ω ∈ W outc . Then τ ( ω ) is a convex combination of τ ( U i ) . As a result, a certain convex combination of τ ( U i ) can be identiﬁed whenever W outc is non-empty. A natural question when this is the case. The answer is quite simple: the matrix W should contain at least one of the following three submatrices (up to permutations): W =   , W =   , W =   . (3.14)Consider each of these three cases separately. In the ﬁrst case there are adoptors of the treatmentwith ( W it = 0 , W it +1 = 1) and in the same periods t and t + 1 non-adoptors with ( W it =0 , W it +1 = 0). In the second case there are adoptors of the treatment with ( W it = 0 , W it +1 = 1)and in the same periods t and t + 1 units who have already adopted and keep the treatment,with ( W it = 1 , W it +1 = 1). In the last case there are adopters with ( W it = 0 , W it +1 = 1) andunits who switch out with ( W it = 1 , W it +1 = 0). To put this discussion in perspective, it is notsuﬃcient to have assignment matrices of the type W =   , W =   , where with the ﬁrst design some units are always in the control group and all others are always13n the treatment group, and where with the second design all units adopt the treatment atexactly the same time. In this section we consider assignment processes that satisfy a certain suﬃciency property. Westate it as a high-level assumption and then show examples of economic models that satisfy thisassumption:

Assumption 3.2. (Sufficiency)

There exist a known W i -measurable suﬃcient statistic S i ∈ S and a subset A ⊂ S such that: ( i ) W i ⊥⊥ U i (cid:12)(cid:12)(cid:12) S i , (3.15) and ( ii ) , for all s ∈ A : max w { r ( w, s ) } < . (3.16) where r ( w, s ) is the feasible generalized propensity score: r ( w, s ) ≡ pr( W i = w | S i = s ) . (3.17)This assumption might look restrictive, but an S i such that conditional on S i the treatment W i and the unboserved variable U i are independent always exists, namely S gen i ≡ f U | W ( ·| W i ),where f U | W ( x | y ) is the conditional distribution of U i given W i . In general, S gen i is an inﬁnite-dimensional object (a function) and is unknown, because f U | W ( x | y ) is unknown. As a result,the ﬁrst restriction that we make in Assumption 3.2 is that S i is known. Part ( ii ) does notallow for S i = S gen i because we require W i to have a non-generate distribution given S i . Belowwe consider various assignment models that are common in the empirical panel data literatureand demonstrate that in all of them there exist S i that one can easily compute.The main implication of the Assumption 3.2 coupled with Assumption 2.2 is summarized inthe following proposition: 14 roposition 2. Suppose Assumptions 2.1, 2.2, and 3.2 hold. Then for any w : W i = w ⊥⊥ Y i ( w ) (cid:12)(cid:12)(cid:12) S i . (3.18)This proposition demonstrates that unconfoundedness conditional on U i can be transformedinto undonfoundedness conditional on S i under the additional assumption that restricts theassignment process.The assignment models that we consider in this section are restrictive, in a sense that theymust satisfy Assumption 3.2. At the same time, most of the models for the binary time-series process W it that are used in the applied and theoretical literature actually satisfy theserestrictions (see, e.g., Honor´e and Kyriazidou [2000], Chamberlain [2010], Aguirregabiria et al.[2018]). In fact, in certain cases existence of a suﬃcient statistic is a necessary requirement forestimation of common parameters (e.g., Magnac [2004]). This is especially relevant, becausemany of such models have an underlying economic intuition and can be interpreted as modelsof optimal choice.We are not interested in estimating common parameters of the model for W i , which is thestandard object in non-linear panel analysis. Instead, we only require that the conditionaldistribution of W i admits a certain representation. Parameters of this representation are notidentiﬁed with ﬁxed T , but they do not play any role in Proposition 2, which is the only resultthat we need. Static model.

As a ﬁrst example that we consider a static logit model with heterogeneityover time. Formally, we consider the following model: E [ W it | U i ] = exp( α T ( U i ) ψ ( t ) + λ t )1 + exp( α T ( U i ) ψ ( t ) + λ t ) W it ⊥⊥ { W il } l (cid:54) = t (cid:12)(cid:12)(cid:12) U i (3.19)where ψ ( t ) is a known function of t . It is easy to demonstrate that in this model S i = T (cid:88) t =1 ψ ( t ) W it /T. (cid:3) Dynamic model.

Next we consider a time homogenous Markov model: E [ W it | U i , W t − i ] = exp( α ( U i ) + γ ( U i ) W it − )1 + exp( α ( U i ) + γ ( U i ) W it − ) W it ⊥⊥ { W il } l>t (cid:12)(cid:12)(cid:12) U i , W t − i (3.20)In this model S i = (cid:32) T − (cid:88) t =2 W it , T (cid:88) t =2 W it W it − , W i , W iT (cid:33) . (cid:3) General case

For suﬃciency we need the following representation for the conditional distri-bution of W i :log ( P ( W i | U i )) = S ( W i ) (cid:62) α ( U i ) + β ( U i ) + γ ( W i ) (3.21)where S ( · ) is a known function of W i . All previous examples have this representation. Moregenerally, Aguirregabiria et al. [2018] show that this structure arises in ﬂexible models of dynamicchoice. (cid:3) Let S i be a potential suﬃcient statistics. Let W s be a matrix representation of the supportof W i conditional on S i = s and W sk be a generic row (element of the support). For example, if S i = (cid:80) t W it and W is given by (3.1) then S i takes 3 possible values and we have the following:16 = (cid:16) (cid:17) W =   W = (cid:16) (cid:17) (3.22)When considering identiﬁcation strategy based on design assumptions we do not restrictpotential outcomes, but instead require that assumptions behind Proposition 2 are satisﬁed. Inthis case, one can identify a convex combination of individual treatment eﬀects using the weightsthat satisfy the following restrictions (for all k, s and t ):1 T (cid:88) tk π k ω kt W kt = 1 (cid:88) k : W k ∈ W s π k ω kt W kt ≥ (cid:88) k : W k ∈ W s π k ω kt = 0 (3.23)Let W design be the set of weights { ω tk } t,k that satisfy these restrictions. It is easy to see that W design is nonempty whenever there exists at least one s such that W s contains at least tworows. This is guaranteed by the second part of Assumption 3.2. For any ω ∈ W design deﬁne therandom variables ω k ( i ) t in the same way as before and consider the following expectation: τ ( ω ) = E (cid:34) T T (cid:88) t =1 Y it ω k ( i ) t (cid:35) (3.24) Proposition 3.

Suppose Assumptions 2.1, 2.2, 2.3, and 3.2 hold, and that ω ∈ W design . Then τ ( ω ) is a convex combination of treatment eﬀects. The sets of W outc and W design are motivated by diﬀerent models and in general do not need tobe similar. In some sense, one can say that the weights in W outc target within-unit comparisons,17hile those in W target within-period comparisons. This interpretation is convenient, but isnot entirely correct because in general W outc ∩ W design is not empty. Consequently, one does notneed to take a stand on what comparisons to use: those based on looking at the same unitsacross time or at diﬀerent units for a ﬁxed time period. As a result, we suggest using the weightsin W outc ∩ W design . In fact, we restrict this set even further and deﬁne the following one: W dr ≡ { ω } subject to: 1 T | W | (cid:88) tk π k ω kt W kt = 1 , T T (cid:88) t =1 ω kt = 0 (cid:88) k : W k ∈ W s π k ω kt = 0 , ω kt W kt ≥ W dr , and note that W dr ⊂ ( W outc ∩ W design ). The diﬀerence between W outc ∩ W design and W dr is quite small – we simply impose the additional restriction that every treatedunit receives a non-negative weight. Note that neither weights in W outc nor in W design in generalsatisfy this restriction. This is important in practice, because we want to be robust to arbitraryheterogeneity in treatment eﬀects.When is the set W dr non-empty? Combining earlier discussion of W outc and W design it is easyto see that a necessary and suﬃcient condition for W dr to be non-empty is that there exists an s such that the corresponding W s contains at least one of the following two sub-matrices (upto permutations): W =   W =   (3.26)In particular, note that the matrix W from (3.14) is not suﬃcient. The reason for this is thatwe require the weights for treated units to be non-negative and sum up to zero for each row.This implies that the ﬁrst row should receive a zero weight and thus we cannot make cross-sectional comparisons. The requirement for W s to contain these sub-matrices is in generalmore demanding than the second part of Assumption 3.2. At the same time, if S i includes W i then for any s , W s can contain W only if it contains W and this is equivalent to the overlapcondition. 18inally we can state the main identiﬁcation result. The following theorem is a direct conse-quence of Propositions 1 and 3: Theorem 1.

Suppose Assumptions 2.1, 2.2, and 2.3 hold, and either 3.1, or Assumption 3.2,or both hold. Then for any ω ∈ W dr , the estimand τ ( ω ) is a convex combination of treatmenteﬀects. We assume that we observe a random sample { Y i , W i , X i } Ni =1 from some distribution P with T (number of periods) being ﬁxed. We assume that a researcher has constructed suﬃcient statistics S i ≡ S ( W i , X i ) based on a design model. We maintain Assumption 2.1 and additionally restrictthe outcome model: Assumption 4.1.

For each t one of the following outcome models is correct. Either there exista suﬃcient statistic S i such that the following is true: Y it (0) = β t + ψ ( X i , t ) (cid:62) δ + ψ ( X i , S i , t ) (cid:62) γ + ξ it E [ ξ it | X i , S i ] = 0( ξ i , . . . ξ iT ) ⊥⊥ W i | X i , S i (4.1) or U i = ( W i , X i ) and we have the following: Y it (0) = α ( W i , X i ) + β t + ψ ( X i , t ) (cid:62) δ + ε it E [ ε it | X i , W i ] = 0 (4.2) where ψ ( X i , t ) and ψ ( X i , S i , t ) are known p -dimensional functions. This assumption allows for our design model to be correct, so that we only need to controlfor ( S i , X i ), or the more traditional ﬁxed eﬀects model to be correct. We do not impose anyrestrictions on Y it (1) and thus on heterogeneity in treatment eﬀects. For simplicity we assumethat in both cases the conditional expectations are linear in parameters with respect to a known19nite-dimensional dictionary. Since all our identiﬁcation results hold conditional on X i thisassumption is not necessary and the estimation procedure below can be adopted to allow forunknown ψ and ψ . At the same time, we believe that our estimator is a natural alternative forthe current status quo which is a two-way ﬁxed eﬀect model estimated by OLS which is basedon (4.2). We leave further nonparametric generalizations to future work. Our estimator is deﬁned in the following way:ˆ τ := 1 N T (cid:88) it ˆ ω it Y it (4.3)where the weights { ˆ ω it } it solve the optimization problem: { ˆ ω it } it = arg min { ω it } it N T ) (cid:88) it ω it subject to: 1 nT (cid:88) it ω it W it ≥ T (cid:88) i ω it = 01 N (cid:88) t ω it = 01 N T (cid:88) it ω it ψ ( X i , S i , t ) = 0 ω it W it ≥ ψ ( X i , S i , t ) := ( ψ ( X i , t ) , ψ ( X i , S i , t )). At the optimum the ﬁrst inequalityis binding and we write it down in this form to simplify the dual representation below. Theweights ˆ ω it are related to standard OLS ﬁxed eﬀects weights, but here we are explicitly lookingfor weights that balance out functions of S i , not only ﬁxed attributes X i , and satisfy certaininequality constraints. The last restriction is crucial, because it is well documented that thestandard OLS estimators with ﬁxed eﬀects in general do not correspond to reasonable estimandsif the eﬀects are heterogeneous (see e.g., de Chaisemartin and D’Haultfœuille [2018]).It is natural to ask if the weights that solve the problem above exist. In Lemma A.1 we show20hat a necessary and suﬃcient condition for the existence is that the control and treated unitssatisfy a certain overlap condition. In particular, there is no { λ i , µ t , γ } i,t such that the followingis true: λ i + µ t + ψ (cid:62) it γ ≥ W it = { λ i + µ t + ψ (cid:62) it γ > } (4.5)This is a very mild overlap condition that is likely to be satisﬁed for any reasonable assignmentprocess.Our estimator ﬁts naturally into recent theoretical literature on balancing weights (e.g.,Imai and Ratkovic [2014], Zubizarreta [2015], Athey et al. [2016], Hirshberg and Wager [2017],Chernozhukov et al. [2018a,b], Armstrong and Koles´ar [2018a]). The main technical diﬀerencebetween our approach and the ones proposed in the literature is that we need to balance unit-speciﬁc functions and explicitly impose non-negativity constraints. At the same time, we onlybalance a small parametric class of functions of ( X i , S i ), while others consider much more generalfunctional classes. We leave this generalization to future research. The Lagrangian saddle-point problem for the program (4.4) has the following form:inf ω it sup λ ( t ) ,λ ( i ) ,γ,µ it ≥ ,π ≥ N T ) (cid:88) it ω it + 1 N (cid:88) i λ ( i ) (cid:32) T (cid:88) i ω it (cid:33) +1 T (cid:88) t λ ( t ) (cid:32) N (cid:88) t ω it (cid:33) + π (cid:32) − N T (cid:88) it ω it W it (cid:33) − γ (cid:62) (cid:32) N T (cid:88) it ω it ψ it (cid:33) − N T (cid:88) it µ it ω it W it (4.6)21here we use ψ it as a shorthand for ψ ( X i , S i , t ). In Lemma A.1 we show that strong dualityholds and we can rearrange the minimization and maximization:sup λ ( t ) ,λ ( i ) ,γ,µ it ≥ ,π ≥ inf ω it N T ) (cid:88) it ω it + 1 N (cid:88) i λ ( i ) (cid:32) T (cid:88) i ω it (cid:33) +1 T (cid:88) t λ ( t ) (cid:32) N (cid:88) t ω it (cid:33) − π (cid:32) N T (cid:88) it ω it W it − (cid:33) − γ (cid:62) (cid:32) N T (cid:88) it ω it ψ it (cid:33) − N T (cid:88) it ( µ it ω it W it ) (4.7)Solving this in terms of ω it (an unconstrained quadratic problem) we get the following repre-sentation: inf λ ( t ) ,λ ( i ) ,γ,µ it ≥ ,π ≥ P n (cid:34) T T (cid:88) t =1 (cid:0) πW it − λ ( t ) − λ ( i ) − γ (cid:62) ψ it − µ it W it (cid:1) (cid:35) − πN (4.8)We can further simplify this expression by concentrating out µ it and π . To this end, deﬁne thefollowing loss function: ρ z ( x ) := x (1 − z ) + x z (4.9)After some algebra we get the following:inf λ ( t ) ,λ ( i ) ,γ P n (cid:32) T T (cid:88) t =1 ρ W it (cid:0) W it − λ ( t ) − λ ( i ) − γ (cid:62) ψ it (cid:1)(cid:33) (4.10)Let { ˆ λ ( t ) , ˆ λ ( i ) , ˆ γ } i,t be the solutions to this problem. The optimal unnormalized weights are equalto the following:ˆ ω ( un ) it = (cid:16) W it − ˆ λ ( t ) − ˆ λ ( i ) − ˆ γ (cid:62) ψ it (cid:17) (1 − W it ) + (cid:16) W it − ˆ λ ( t ) − ˆ λ ( i ) − ˆ γ (cid:62) ψ it (cid:17) + W it (4.11)22nd the optimal weights are given by the normalization:ˆ ω it := ˆ ω ( un ) it NT (cid:80) it ˆ ω ( un ) it W it (4.12)By construction the weights are non-negative for the treated units and sum up to one oncemultiplied by W it . The denominator is strictly positive under the conditions of Lemma A.1. In order to state the inference results we need to make several statistical assumptions:

Assumption 4.2. (a) P -a.s. ( X i , S i ) ∈ Ω – compact subset of some metric space; (b) ψ ( X i , S i , t ) is a continuous function of its arguments (on Ω ); errors u it satisfy the following moment condi-tions: E [ u it | W i , X i ] ≤ σ u < ∞ E [ u it ] < ∞ (4.13)Part of the assumption about u it is standard in the literature on projection estimators. Weassume compactness to streamline the proofs and we think that it covers most problems thatresearchers face in applications. There is no doubt that it can be considerably relaxed. Assumption 4.3. (a) S i includes W i ; (b) for all t and η > we have E [ W it | S i , X i ] ≤ − η ;(c) the following holds: Γ it := (1 − W it ) ψ it − (cid:80) Tl =1 (1 − W il ) ψ il (cid:80) Tl =1 (1 − W il ) σ min (cid:32) T (cid:88) t =1 E (cid:2) Γ it Γ (cid:62) it (cid:3)(cid:33) ≥ κ > τ and ˆ ω it : Theorem 2.

Suppose Assumptions 4.1, 4.2, 4.3 are satisﬁed. Then there exist a collection of andom variables { ω (cid:63) ( X i , W i , t ) } Tt =1 such that the following holds: T T (cid:88) t =1 (cid:107) ˆ ω t − ω (cid:63)t (cid:107) = o p (1) (4.15) Deﬁne the following conditional estimand: τ emp = 1 N T (cid:88) it ˆ ω it W it E [ τ it | W i , X i ] (4.16) the scaled diﬀerence between the estimator and τ emp converges in distribution to a normal randomvariable: √ n (ˆ τ − τ emp ) → N (0 , σ ) (4.17) where the variance has the following form: σ := E (cid:32) T T (cid:88) t =1 ω (cid:63)it (( u it + W it ( τ it − E [ τ it | W i , X i ])) (cid:33)  (4.18) where ω (cid:63)it := ω (cid:63) ( X i , W i , t ) , and u it is equal to either ξ it or ε it . This theorem describes the performance of our estimator in larger samples. The populationweights ω (cid:63) depend on ( X i , W i ), not only on S i which is an implication of the fact that we needto deal with individual ﬁxed eﬀects.Our next result shows that standard nonparametrtic bootstrap provides a conservative esti-mator for σ . Theorem 3.

Let { ˆ τ ( b ) } Bb =1 be a set of non-parametric (unit-level) bootstrap analogs of ˆ τ . Deﬁne: ˆ σ := NB B (cid:88) b =1 (cid:0) ˆ τ ( b ) − ˆ τ (cid:1) (4.19) and suppose that assumption of Theorem 2 hold. Then if E [ τ it | W i , X i ] = τ ˆ σ is consistent for σ ; otherwise ˆ σ is conservative. Conclusion

In this paper, we propose a novel identiﬁcation argument that can be used to evaluate a causaleﬀect using panel data. We show that one can naturally combine familiar restrictions on therelationship between the outcome and the unobserved unit-level characteristics with reasonableeconomic models of the assignment. Our approach allows us to construct a doubly robustidentiﬁcation argument: out estimand has causal interpretation if either the outcome modelis correct, or the assignment model is correct (or both). Using these results, we construct anatural generalization of the standard two-way ﬁxed eﬀects estimator that is robust to arbitraryheterogeneity in treatment eﬀects and show that it has reasonable theoretical properties.25 eferences

A Abadie and MD Cattaneo. Econometric methods for program evaluation. Annual Review ofEconomics, 18, 2018.Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Synthetic control methods for com-parative case studies: Estimating the eﬀect of California’s tobacco control program. Journalof the American Statistical Association, 105(490):493–505, 2010.Victor Aguirregabiria, Jiaying Gu, and Yao Luo. Suﬃcient statistics for unobserved heterogene-ity in structural dynamic logit models. arXiv preprint arXiv:1805.04048, 2018.Joseph G Altonji and Rosa L Matzkin. Cross section and panel data estimators for nonseparablemodels with endogenous regressors. Econometrica, 73(4):1053–1102, 2005.Manuel Arellano. Panel data econometrics. Oxford university press, 2003.Manuel Arellano and St´ephane Bonhomme. Nonlinear panel data analysis. 2011.Manuel Arellano and Bo Honor´e. Panel data models: some recent developments. Handbook ofeconometrics, 5:3229–3296, 2001.Dmitry Arkhangelsky and Guido Imbens. The role of the propensity score in ﬁxed eﬀect models.Technical report, National Bureau of Economic Research, 2018.Dmitry Arkhangelsky, Susan Athey, David A Hirshberg, Guido W Imbens, and Stefan Wager.Synthetic diﬀerence in diﬀerences. Technical report, National Bureau of Economic Research,2019.Timothy Armstrong and Michal Koles´ar. Finite-sample optimal estimation and inference onaverage treatment eﬀects under unconfoundedness. 2018a.Timothy B Armstrong and Michal Koles´ar. Optimal inference in a class of regression models.Econometrica, 86(2):655–683, 2018b.Susan Athey and Guido Imbens. Design-based analysis in diﬀerence-in-diﬀerences settings withstaggered adoption. 2018. 26usan Athey, Guido Imbens, and Stefan Wager. Eﬃcient inference of average treatment eﬀects inhigh dimensions via approximate residual balancing. arXiv preprint arXiv:1604.07125, 2016.Susan Athey, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi.Matrix completion methods for causal panel data models. arXiv preprint arXiv:1710.10251,2017.Eli Ben-Michael, Avi Feller, and Jesse Rothstein. The augmented synthetic control method.arXiv preprint arXiv:1811.04170, 2018.Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,2004.Gary Chamberlain. Panel data. Handbook of econometrics, 2:1247–1318, 1984.Gary Chamberlain. Eﬃciency bounds for semiparametric regression. Econometrica: Journal ofthe Econometric Society, pages 567–596, 1992.Gary Chamberlain. Binary response models for panel data: Identiﬁcation and information.Econometrica, 78(1):159–168, 2010.Victor Chernozhukov, Iv´an Fern´andez-Val, Jinyong Hahn, and Whitney Newey. Average andquantile eﬀects in nonseparable panel models. Econometrica, 81(2):535–580, 2013.Victor Chernozhukov, Whitney Newey, and James Robins. Double/de-biased machine learningusing regularized riesz representers. arXiv preprint arXiv:1802.08667, 2018a.Victor Chernozhukov, Whitney K Newey, and Rahul Singh. Learning l2 continuous regressionfunctionals via regularized riesz representers. arXiv preprint arXiv:1809.05224, 2018b.Cl´ement de Chaisemartin and Xavier D’Haultfœuille. Two-way ﬁxed eﬀects estimators withheterogeneous treatment eﬀects. 2018.David L Donoho et al. Statistical estimation and optimal recovery. The Annals of Statistics, 22(1):238–270, 1994.Andrew Goodman-Bacon. Diﬀerence-in-diﬀerences with variation in treatment timing. Technicalreport, Working Paper, 2017. 27avid A Hirshberg and Stefan Wager. Augmented minimax linear estimation. arXiv preprintarXiv:1712.00038, 2017.Bo E Honor´e and Ekaterini Kyriazidou. Panel data discrete choice models with lagged dependentvariables. Econometrica, 68(4):839–874, 2000.Kosuke Imai and In Song Kim. When should we use unit ﬁxed eﬀects regression models for causalinference with longitudinal data? American Journal of Political Science, 63(2):467–490, 2019.Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.Guido Imbens. The role of the propensity score in estimating dose–response functions.Biometrika, 87(0):706–710, 2000.Guido Imbens. Nonparametric estimation of average treatment eﬀects under exogeneity: Areview. Review of Economics and Statistics, pages 1–29, 2004.Guido W Imbens and Donald B Rubin. Causal Inference in Statistics, Social, and BiomedicalSciences. Cambridge University Press, 2015.Thierry Magnac. Panel binary variables and suﬃciency: generalizing conditional logit.Econometrica, 72(6):1859–1876, 2004.Shahar Mendelson. Learning without concentration. In Conference on Learning Theory, pages25–39, 2014.Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observationalstudies for causal eﬀects. Biometrika, 70(1):41–55, 1983.Yiqing Xu. Generalized synthetic control method: Causal inference with interactive ﬁxed eﬀectsmodels. Political Analysis, 25(1):57–76, 2017.Jos´e R Zubizarreta. Stable weights that balance covariates for estimation with incompleteoutcome data. Journal of the American Statistical Association, 110(511):910–922, 2015.28

Appendix

Proof of Proposition 1 : For any ω ∈ W outc we deﬁned the random variables ω k ( i ) t ≡ K (cid:88) k =1 ω kt { W i = W k } (A.1)and considered the following estimator: τ ( ω ) = E (cid:34) T T (cid:88) t =1 Y it ω k ( i ) t (cid:35) (A.2)By assumption we have the representation: E (cid:34) T T (cid:88) t =1 Y it ω k ( i ) t (cid:35) = E (cid:34) T T (cid:88) t =1 ( α ( U i ) + λ t + τ ( U i ) W it + ε it ) ω k ( i ) t (cid:35) = E (cid:34) T T (cid:88) t =1 ( α ( U i ) + λ t + τ ( U i ) W it + ε it ) K (cid:88) k =1 ω kt { W i = W k } (cid:35) = E (cid:34) T T (cid:88) t =1 ( α ( U i ) ω kt { W i = W k } ) (cid:35) +1 T T (cid:88) t =1 λ t K (cid:88) k =1 E [ ω kt { W i = W k } ] + E (cid:34) τ ( U i ) { W i = W k } T K (cid:88) k =1 T (cid:88) t =1 W kt ω kt (cid:35) =1 T T (cid:88) t =1 λ t K (cid:88) k =1 π k ω kt + E [ τ ( U i ) ξ ( W i )] = E [ τ ( U i ) ξ ( W i )] (A.3)where ξ ( W i ) := { W i = W k } T (cid:80) Kk =1 (cid:80) Tt =1 W kt ω kt ≥ . The ﬁrst equality follows from the restrictionson the outcome model, the second – by deﬁnition of the weights, the third – because E [ ε i | U i ] = 0and strict exogeneity assumption; ﬁnally the last two equalities follow by construction of weights. Byconstruction we also have that ξ ( W i ) ≥ E [ ξ ( W i )] = 1. This proves the claim. Proof of Proposition 3 : The proof is very similar to the one above and is omitted.

Proof of Proposition 2 : We need to prove the following for arbitrary w and measurable A , A : E [ { W i = w }{ Y i (0) ∈ A , Y i (1) ∈ A }| S i ] = E { W i = w }| S i ] E [ { Y i (0) ∈ A , Y i (1) ∈ A }| S i ] (A.4) e have the following chain of equalities that proves the claim. E [ { W i = w }{ Y i (0) ∈ A , Y i (1) ∈ A }| S i ] = E [ { W i = w } E [ { Y i (0) ∈ A , Y i (1) ∈ A }| S i , U i , W i ] | S i ] = E [ { W i = w } E [ { Y i (0) ∈ A , Y i (1) ∈ A }| U i , S i ] | S i ] = EE [ { W i = w }| S i , U i ] E [ { Y i (0) ∈ A , Y i (1) ∈ A }| U i , S i ] | S i ] = E [ E [ { W i = w }| S i ] E [ { Y i (0) ∈ A , Y i (1) ∈ A }| U i , S i ] | S i ] = E { W i = w }| S i ] E [ { Y i (0) ∈ A , Y i (1) ∈ A }| S i ] (A.5)where the second inequality follows by strict exogeneity, the fourth one – by suﬃciency. Lemma A.1.

Suppose that { W it } i,t are such that there is no { α i , β t , γ } i,t such that the following istrue: α i + β t + ψ (cid:62) it γ ≥ W it = { α i + β i + ψ (cid:62) it γ > } (A.6) Then (a) the primal problem always has a unique solution and (b) the strong duality holds, i.e., for afunction h ( λ, µ, π, γ, ω ) := 1( nT ) (cid:88) it ω it + 1 n (cid:88) i λ ( i ) (cid:32) T (cid:88) i ω it (cid:33) +1 T (cid:88) t λ ( t ) (cid:32) n (cid:88) t ω it (cid:33) + π (cid:32) − nT (cid:88) it ω it W it (cid:33) − γ (cid:62) (cid:32) nT (cid:88) it ω it ψ it (cid:33) − nT (cid:88) it µ it ω it W it (A.7) we have inf ω it sup λ ( t ) ,λ ( i ) ,γ,µ it ≥ ,π ≥ h ( λ, µ, π, γ, ω ) = sup λ ( t ) ,λ ( i ) ,γ,µ it ≥ ,π ≥ inf ω it h ( λ, µ, π, γ, ω ) (A.8) roof. Direct application of Generalized Farkas’ lemma implies that the constraint set is empty iﬀthere exist ( α (cid:63)i , β (cid:63)t , γ (cid:63) ) such that the following is true: α (cid:63)i + β (cid:63)t + ψ (cid:62) it γ (cid:63) ≥ W it = { α (cid:63)i + β (cid:63)t + ψ (cid:62) it γ (cid:63) > } (A.9)By assumption such ( α (cid:63)i , β (cid:63)t , γ (cid:63) ) does not exist and thus the constraint set is not empty and convex.Since the objective function is strictly convex we have that the primal problem has the unique solution.Since all the inequality constrains are aﬃne strong duality holds (see 5.2.3 in Boyd and Vandenberghe[2004]) and we have the result. Lemma A.2.

For arbitrary γ deﬁne g ( X, W , γ ) in the following way: g ( X, W , γ ) ∈ arg min α (cid:40) T T (cid:88) t =1 ρ W t ( W t − α − ψ (cid:62) t γ ) (cid:41) (A.10) Then for any W such that W < this function is uniquely deﬁned. Also if (cid:107) ψ t (cid:107) ∞ < K then g ( X, W , γ ) is P a.s. uniformly (in ( X, W ) ) Lipschitz in γ .Proof. If W < h t := W t − ψ (cid:62) t γ ; and let ˜ h (1) , . . . , ˜ h ( (cid:80) Tt =1 W t ) be the decreasing ordering of h t for units with W t = 1; let˜ h (0) = 0. For k = 0 , . . . , (cid:80) Tt =1 W t deﬁne the following functions: g k ( X, W , γ ) := (cid:80) Tt =1 (1 − W it ) h t + (cid:80) kl =0 ˜ h ( l ) (cid:80) Tt =1 (1 − W it ) + k (A.11)It is easy to see that we have the following: g ( X, W , γ ) = g ( X, W , γ ) + k (cid:88) l =1 { ˜ h ( l ) ≥ g ( l − } ( g l ( X, W , γ ) − ( g l − ( X, W , γ )) (A.12)From this representation if follows that g ( X, W , γ ) is diﬀerentiable and P -a.s. uniformly (in ( X, W ))Lipschitz in γ . Lemma A.3.

Let { W i , X i } be distributed according to P ; assume that S i includes W i and E [ W it | S i , X i ] < − η P a.s. for η > . Then there exist a σ ( W i , X i ) -measurable random variable α (cid:63)i and a vector γ (cid:63) uch that the following conditions are satisﬁed: ξ it := W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) E (cid:34) T (cid:88) t =1 ξ it ψ it (1 − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } ) (cid:35) = 0 T (cid:88) t =1 ξ it (1 − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } ) = 0 (A.13) Proof.

Deﬁne F := { f ∈ L ( P ) T : f t = g ( W i , X i ) + h t ( S i , X i ) , g, h t ∈ L ∞ ( P ) } , similarly deﬁne G := { g = ( g , . . . g T ) : g t = f + ψ (cid:62) t γ, f ∈ L ( P ) , γ ∈ R p } .Consider the following optimization program:inf g ∈G E (cid:34) T T (cid:88) t =1 ρ W it ( W it − g it ) (cid:35) (A.14)and let r (cid:63) be the value of inﬁmum. We prove that there exists a function g (cid:63) ∈ G that solves thisproblem. This is not entirely trivial because G is not compact and the loss function is not quadraticso we cannot directly use neither Weierstrass nor the standard projection theorem.Consider the set F ( r (cid:63) ) := { f ∈ F : E (cid:104) T (cid:80) Tt =1 ρ W it ( W it − f it ) (cid:105) ≤ r (cid:63) } . It is straightforward to see thatthis set is convex and because R ( f ) is continuous on L T ( P ) it follows that f ∈ F ( r (cid:63) ) ⇒ R ( f ) ≤ r (cid:63) .The set F ( r (cid:63) ) is closed and convex. Now assume that g (cid:63) does not exist and thus F ( r (cid:63) ) ∩ G = ∅ . Byconstruction G is closed (in L ( P )) and convex; as a result we have two closed convex sets with emptyintersection.Assume that F ( r (cid:63) ) is weakly compact then by strict separating hyperplane theorem it follows thatthere exist h (cid:63) ∈ L T ( P ) and a ∈ R such that sup f ∈F ( r (cid:63) ) ( f, h (cid:63) ) < a < a < inf g ∈G ( g, h (cid:63) ). Assume thatthere exist a function f (cid:63) ∈ F ( r (cid:63) ) ∪ G such that R ( f (cid:63) ) ≤ R ( f ) for any function f ∈ F ( r (cid:63) ) ∪ G . Fix an (cid:15) > g ε ∈ G such that R ( g ε ) < r (cid:63) + ε . Using this function construct g ε ∈ G such that R ( g ε ) < r (cid:63) + ε . For t ∈ [0 ,

1] consider a function r ( t ) = R ( f (cid:63) + t ( f (cid:63) − g ε )). By convexity of t it follows that r ( t ) is convex and by deﬁnition of f (cid:63) it follows that r ( t ) has a minimum at zero.For t ∈ [0 ,

1] consider a function:( h (cid:63) , f (cid:63) + t ( g ε − f (cid:63) )) =: a + bt (A.15) nd deﬁne t := a − ab and t := a − ab . It follows that t − t t = a − a a − a > g ε . By construction it follows that r ( t ) ≥ r (cid:63) and r ( t ) < r (cid:63) + ε and by convexity we have r ( t ) ≥ r ( t ) + r ( t )) − r (0) t × ( t − t ) ≥ r (cid:63) + r (cid:63) − R ( f (cid:63) ) t × ( t − t ). The RHS of this inequality does not dependon ε which leads to contradiction.To ﬁnish the proof we need to show that (a) f (cid:63) exists and is unique and (b) that F ( r (cid:63) ) is weaklycompact. The latter statement will follow if we prove that F ( r (cid:63) ) is bounded in L ( P ). This followsbecause R ( f ) is convex and has a unique minimum at f (cid:63) in F ( r (cid:63) ).Finally we prove the R ( f ) has a unique minimum at f (cid:63) . Consider f (cid:63) such that f (cid:63)t := E [ W it | S i , X i ].Because S i includes W i it follows that T (cid:80) Tt =1 f (cid:63)t = W i . Take any function f ∈ F and consider a convexcombination f ( λ ) := f (cid:63) + λ ( f − f (cid:63) ). Because f ∈ L ∞ ( P ) and f (cid:63)t ≤ − η it follows that for all λ < λ we have f t ( λ ) < λ < λ we have that R ( f ( λ )) = E (cid:104) T (cid:80) Tt =1 ( W t − f (cid:63)t ) (cid:105) + E (cid:104)(cid:80) Tt =1 ( f (cid:63)t − f t ( λ )) (cid:105) > R ( f (cid:63) ). By convexity of R ( f ) it follows that R ( f ) > R ( f (cid:63) ) which proves that g (cid:63) exists. The ﬁnal result follows because R ( f ) is Gato-diﬀerentiable on F and the results follows bytaking ﬁrst order conditions. .3 Theorems Proof of Theorem 2 : We split the proof into two parts. First, we assume that (cid:107) ( ω (cid:63) ) un − ˆ ω un (cid:107) = o p (1), ( ω (cid:63)it ) un is uniformly bounded, and E (cid:104) T (cid:80) Tt =1 ( ω (cid:63)it ) un W it (cid:105) >

0, and prove the normality result.Then we prove the ﬁrst statement.

Part 1:

Assume that (cid:107) ( ω (cid:63) ) un − ˆ ω un (cid:107) = o p (1).For the estimator ˆ τ we have the following:ˆ τ = 1 nT (cid:88) it ˆ ω it Y it = 1 nT (cid:88) it ˆ ω it τ it W it + 1 nT (cid:88) it ˆ ω it u it = τ emp + 1 nT (cid:88) it ˆ ω it u it = τ emp + 1 P n T (cid:80) Tt =1 ˆ ω unit W it (cid:32) nT (cid:88) it ( ω (cid:63)it ) un u it + 1 nT (cid:88) it (ˆ ω unit − ( ω (cid:63)it ) un ) u it (cid:33) (A.16)By construction and assumption we have the following: E [(ˆ ω unit − ( ω (cid:63)it ) un ) u it |{ W j , X j } nj =1 ] = (ˆ ω unit − ( ω (cid:63)it ) un ) E [ u it |{ W j , X j } nj =1 ] =(ˆ ω unit − ( ω (cid:63)it ) un ) E [ u it | W i , X i ] = 0 (A.17)This implies that by conditional Chebyshev inequality we have the following: ζ n ( (cid:15) ) := E (cid:34)(cid:40) √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P n T T (cid:88) t =1 (ˆ ω unit − ( ω (cid:63)it ) un ) u it (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:41) |{ W j , X j } nj =1 (cid:35) ≤ P n E (cid:20)(cid:16)(cid:80) Tt =1 (ˆ ω unit − ( ω (cid:63)it ) un (cid:17) |{ W j , X j } nj =1 (cid:21) T (cid:15) ≤ σ u T (cid:15) (cid:107) ( ω (cid:63) ) un − ˆ ω un (cid:107) = o p (1) (A.18)Since indicator is a bounded function it follows that for any (cid:15) > o E [ ζ n ( (cid:15) )] = o (1) (A.19)and thus we have nT (cid:80) it (cid:107) ( ω (cid:63) ) un − ˆ ω un (cid:107) u it = o p (cid:16) √ n (cid:17) . Finally we need to check that CLT applies to nT (cid:80) it ( ω (cid:63)it ) un u it . The mean of each summand is zero and the variance is bounded: E (cid:32) T T (cid:88) t =1 ( ω (cid:63)it ) un u it (cid:33)  ≤ T T (cid:88) t =1 E (cid:104) (( ω (cid:63)it ) un u it ) (cid:105) ≤ T (cid:88) t =1 (cid:113) E [ u it ] E [(( ω (cid:63)it ) un ) ] < ∞ (A.20) inally, deﬁne: ω (cid:63)it := ( ω (cid:63)it ) un E (cid:104) T (cid:80) Tt =1 ( ω (cid:63)it ) un W it (cid:105) (A.21)It is easy to see that we have: P n T T (cid:88) t =1 ˆ ω unit W it = E (cid:34) T T (cid:88) t =1 ( ω (cid:63)it ) un W it (cid:35) + o p (1) (A.22)and thus we have the following: (cid:107) ω (cid:63) − ˆ ω (cid:107) = o p (1) √ n (ˆ τ − τ emp ) → N (0 , σ τ ) (A.23)which concludes the ﬁrst part. Part 2:

In this part we prove that (cid:107) ( ω (cid:63) ) un − ˆ ω un (cid:107) = o p (1), ( ω (cid:63)it ) un is uniformly bounded, and E (cid:104) T (cid:80) Tt =1 ( ω (cid:63)it ) un W it (cid:105) >

0. We use the dual representation derived in Section 4.3 and show that thesolution converges to a population one.The proof below shows that empirical weights converge to oracle weights that solve a certain problemin population. We use a natural adaptation of the “small-ball” argument from Mendelson [2014]. Thisis not necessary and most likely one can construct a simpler proof using classical results for GMMestimators. We present a diﬀerent argument because it can be naturally generalized to handle moresophisticated estimation procedures – something that we want to address in future work.We start by deﬁning relevant oracle weights. Consider ( { α (cid:63)i } ni =1 , γ (cid:63) ) that satisfy the following restric-tions: ξ it := W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) E (cid:34) T (cid:88) t =1 ξ it ψ it (1 − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } ) (cid:35) = 0 T (cid:88) t =1 ξ it (1 − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } ) = 0 (A.24)Where we include time ﬁxed eﬀects λ t into the deﬁnition of ψ it , since T is ﬁxed this does not create ny problems. We prove that oracle weights that satisfy these restrictions exists in Lemma A.3. Usingthese parameters we consider a lower bound on individual components of the loss function: ρ W it ( W it − α i − ψ (cid:62) it γ ) = ( W it − α i − ψ (cid:62) it γ ) (cid:16) − W it { W it − α i − ψ (cid:62) it γ ≤ } (cid:17) =( W it − α i − ψ (cid:62) it γ ) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17) +( W it − α i − ψ (cid:62) it γ ) W it (cid:16) { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } − { W it − α i − ψ (cid:62) it γ ≤ } (cid:17) ≥ ( W it − α i − ψ (cid:62) it γ ) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17) − ( W it − α i − ψ (cid:62) it γ ) W it { α (cid:63)i + ψ (cid:62) it γ (cid:63) < ≤ α i + ψ (cid:62) it γ } (A.25)Using this and the properties of the oracle weights we get the following inequality for the excess lossfor unit i : T (cid:88) t =1 (cid:16) ρ W it ( W it − α i − ψ (cid:62) it γ ) − ρ W it ( W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ) (cid:17) ≥ T (cid:88) t =1 (cid:16) ( α (cid:63)i − α i ) + ψ (cid:62) it ( γ (cid:63) − γ )) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17)(cid:17) + T (cid:88) t =1 (cid:16) ξ it ( α (cid:63)i − α (cid:63)i ) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17)(cid:17) + T (cid:88) t =1 (cid:16) ξ it ψ (cid:62) it ( γ (cid:63) − γ ) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17)(cid:17) − T (cid:88) t =1 (cid:16) ( W it − α i − ψ (cid:62) it γ ) W it { α (cid:63)i + X (cid:62) i γ (cid:63) < ≤ α i + ψ (cid:62) it γ } (cid:17) = T (cid:88) t =1 (cid:16) ( α (cid:63)i − α i ) + ψ (cid:62) it ( γ (cid:63) − γ )) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17)(cid:17) + T (cid:88) t =1 (cid:16) ξ it ψ (cid:62) it ( γ (cid:63) − γ ) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17)(cid:17) − T (cid:88) t =1 (cid:16) ( W it − α i − ψ (cid:62) it γ ) W it { α (cid:63)i + ψ (cid:62) it γ (cid:63) < ≤ α i + ψ (cid:62) it γ } (cid:17) (A.26)Note that the last equality follows by deﬁnition of ξ it and ( { α (cid:63)i } ni =1 , γ (cid:63) ). n Lemma A.2 we show that α (cid:63)i is a function of γ (cid:63) and data for unit i : α (cid:63)i = g ( X i , W i , γ (cid:63) ) (A.27)and prove that g is uniformly Lipschitz. By construction for every γ we only need to consider α i thatsatisﬁes the following equality: α i = g ( X i , W i , γ ) (A.28)Deﬁne: f it = α i + ψ (cid:62) it γf (cid:63)it = α (cid:63)i + ψ (cid:62) it γ (cid:63) (A.29)and observe that we have the following: P n T (cid:88) t =1 (1 − W it { W it < f (cid:63)it } )( f it − f (cid:63)it ) ≥ P n T (cid:88) t =1 (1 − W it )( f it − f (cid:63)it ) ≥ ( γ − γ (cid:63) ) (cid:62) (cid:32) T (cid:88) t =1 P n Γ it Γ (cid:62) it (cid:33) ( γ − γ (cid:63) ) = κ (cid:107) γ − γ (cid:63) (cid:107) + o p ( (cid:107) γ − γ (cid:63) (cid:107) ) (A.30)where Γ it := (1 − W it ) ψ it − (cid:80) Tl =1 (1 − W il ) ψ il (cid:80) Tl =1 (1 − W il ) (A.31)Assume that (cid:107) γ − γ (cid:63) (cid:107) = r , which implies that | α i − α (cid:63)i | ≤ C r . Assumptions guarantee that ψ it isbounded and thus (cid:80) Tt =1 (cid:107) f t − f (cid:63)t (cid:107) ∞ ≤ C r . Using CS we get the following inequality: P n ξ it ψ (cid:62) it ( γ (cid:63) − γ ) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17) ≤(cid:107) γ (cid:63) − γ (cid:107) × (cid:13)(cid:13)(cid:13) P n ξ it ψ it (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17)(cid:13)(cid:13)(cid:13) (A.32) e also have the following inequality: P n (cid:34) T T (cid:88) t =1 ( W it − α i − ψ (cid:62) it γ ) W it { α (cid:63)i + ψ (cid:62) it γ (cid:63) < ≤ α i + ψ (cid:62) it γ } (cid:35) ≤ P n (cid:34) T T (cid:88) t =1 ( f (cid:63)it − f it ) { f (cid:63)it < ≤ f it } (cid:35) ≤ (cid:107) f (cid:63) − f (cid:107) ∞ × P n (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f it } (cid:35) (A.33)where the ﬁrst implication follows because of the indicator, and the the second one follows by Holderinequality. Since (cid:107) f (cid:63) − f (cid:107) ∞ ≤ C r we have the following: P n (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f it } (cid:35) ≤ P n (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) (A.34)DKW inequality implies that we have the following with high probability: P n (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) ≤ E (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) + C √ n (A.35)It is now easy to see that if r is greater than O (cid:16) √ n (cid:17) then the excess loss is positive with high probability.Since the loss function is convex this implies that optimum should belong to a ball of radius √ n around( { α (cid:63)i } ni =1 , γ (cid:63) ) with high probability which proves that for all t (cid:107) ˆ ω ( un ) t − ( ω (cid:63)t ) un (cid:107) = o p (1). roof of Theorem 3 : Part 1

For each observation i deﬁne M i – the number of times this observation is sampled in abootstrap sample. Using this notation we can deﬁne bootstap analogs of α i and γ from the proof ofTheorem 2: { α ( b ) i , γ ( b ) } ni =1 = arg min P n M i T T (cid:88) t =1 ρ W it ( W it − α i − ψ Tit γ ) (A.36)in case if M i = 0 we deﬁne α ( b ) i using the function g ( X i , W i , γ (cid:63) ) from 2. It is straightforward to extendthe proof of Theorem 2 and show that bootstrap weights converge to population ones. Most part followbecause of two key properties of { M i } ni =1 : P n M i X i = E [ X i ] + o p (1) P n M i ε i = O p (cid:18) √ n (cid:19) (A.37)for any square integrable X i and any square integrable mean-zero ε i (all independent of M i ). Thesecond inequality follows by applying Chebyshev inequality, the ﬁrst one follows from the second one.The only additional result that we need is the following one: P n M i (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) = P n ( M i − (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) + P n (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } − E (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35)(cid:35) + E (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) = E (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) + O p (cid:18) √ n (cid:19) (A.38)where the last line follows by DKW inequality, the fact that the set of intervals is Donsker, and themultiplier process converges to same limit process as the standard empirical one. It follows that wehave convergence results: (cid:107) ω ( b ) − ω (cid:63) (cid:107) ∞ = o p (1) (cid:107) ω ( b ) − ω (cid:63) (cid:107) = O p (cid:18) √ n (cid:19) (A.39) art 2: By construction of bootstrap estimator we have the following representation:ˆ τ ( b ) − ˆ τ = P n M i T T (cid:88) t =1 ω ( b ) it τ it W it − P n T T (cid:88) t =1 ˆ ω it τ it W it + P n M i T T (cid:88) t =1 ω ( b ) it u it − P n T T (cid:88) t =1 ˆ ω it u it = P n M i T T (cid:88) t =1 ω ( b ) it ( τ it − E [ τ it ]) W it − P n T T (cid:88) t =1 ˆ ω it ( τ it − E [ τ it ]) W it + P n ( M i −

1) 1 T T (cid:88) t =1 ω (cid:63)it u it + o p (cid:18) √ n (cid:19) (A.40)From this representation it follows that if τ it = const then the bootstrap estimator is consistent forthe asymptotic variance of ˆ τ . In case if τ it is heterogenous we further expand the ﬁrst term. Deﬁne τ t ( W i , X i ) := E [ τ it | W i , X i ] and η it := τ it − τ t ( W i , X i ). We have the following: P n M i T T (cid:88) t =1 ω ( b ) it τ it W it − P n T T (cid:88) t =1 ˆ ω it τ it W it = P n M i T T (cid:88) t =1 ω ( b ) it τ t ( W i , X i ) W it − P n T T (cid:88) t =1 ˆ ω it τ t ( W i , X i ) W it + P n M i T T (cid:88) t =1 ω ( b ) it η it W it − P n T T (cid:88) t =1 ˆ ω it η it W it = P n T T (cid:88) t =1 ( M i ω ( b ) it − ˆ ω it ) τ t ( W i , X i ) W it + P n ( M i −

1) 1 T T (cid:88) t =1 ω (cid:63)it η it W it + o p (cid:18) √ n (cid:19) (A.41)It follows that we have the following:ˆ τ ( b ) − ˆ τ = P n ( M i −

1) 1 T T (cid:88) t =1 ω (cid:63)it ( η it W it + u it )+ P n T T (cid:88) t =1 ( M i ω ( b ) it − ˆ ω it ) τ t ( W i , X i ) W it + small order terms (A.42)Since the second summand is uncorrelated with the ﬁrst one we have that the bootstrap variance is aconservative estimator of the correct variance.+ small order terms (A.42)Since the second summand is uncorrelated with the ﬁrst one we have that the bootstrap variance is aconservative estimator of the correct variance.