Double-Robust Identification for Causal Panel Data Models
DDouble-Robust Identification for Causal Panel
Data Models ∗ Dmitry Arkhangelsky † Guido W. Imbens ‡ September 2019
Abstract
We study identification and estimation of causal effects of a binary treatment in settingswith panel data. We highlight that there are two paths to identification in the presenceof unobserved confounders. First, the conventional path based on making assumptions onthe relation between the potential outcomes and the unobserved confounders. Second, adesign-based path where assumptions are made about the relation between the treatmentassignment and the confounders. We introduce different sets of assumptions that followthe two paths, and develop double robust approaches to identification where we exploitboth approaches, similar in spirit to the double robust approaches to estimation in theprogram evaluation literature.
Keywords : fixed effects, cross-section data, clustering, causal effects, treatment effects, uncon-foundedness. ∗ This paper benefited greatly from our discussions with Manuel Arellano and David Hirshberg. † Assistant Professor, CEMFI, darkhangel@cemfi.es. ‡ Professor of Economics, Graduate School of Business and Department of Economics, StanfordUniversity, SIEPR, and NBER, [email protected]. a r X i v : . [ ec on . E M ] S e p Introduction
Panel data are widely used to assess causal effects of policy interventions on economic out-comes. These data are particularly useful in settings where there is substantial heterogeneityboth between units at the same point in time, as well as heterogeneity over time within units.Fundamentally the presence of panel data allows for two conceptually different comparisons toestimate causal effects. First, we can compare treated and control outcomes for the same unitat different points in time, that is, make across-time within-unit comparisons. Such compar-isons are not possible in cross-section settings. Second, following approaches in cross-sectionalsettings, we can compare treated and control outcomes at the same point in time for differentunits, i.e., within-period across-unit comparisons. In that case, we use the panel data simplyto allow for a richer set of controls than we would use in a cross-section setting. Different setsof assumptions justify the two approaches. In practice, researchers often make assumptionsthat simultaneously justify both types of comparisons. For example, many empirical papersuse a linear two-way fixed effect specification that implicitly justifies both the within-unit andwithin-period comparisons: Y it = α i + λ t + τ W it + β (cid:62) X it + ε it . (1.1)Here W it is an indicator for the treatment, with τ the causal effect of interest, and X it are thetime-unit specific control variables. In this specification, the α i capture the permanent unit-specific effects, and the λ t capture the common time effects. After removing the unit and timefixed effects, we can compare outcomes for treated units both to outcomes for the same unit intime periods where the unit was not treated, or to control units in the same time period.In this paper, we take a different perspective, building on the program evaluation or causalinference literature. We start with the assumption that conditional on an unobserved unit-specific variable U i (possibly vector-valued), the T -component vector of treatment assignmentsover time for unit i , W i , with t -th element equal to W it , is independent of the vector of potentialoutcomes Y i ( w ): W i ⊥⊥ (cid:110) Y i ( w ) (cid:111) w (cid:12)(cid:12)(cid:12) U i . (1.2)1his assumption has no immediate content because we can make it hold by construction bysetting U i equal to the vector of assignments W i . Nevertheless, it clarifies what the issue is andwhy cross-section data alone are not sufficient: there is an unobserved variable U i that invali-dates comparisons of observed outcomes by treatment status because this unobserved variableis correlated both with the potential outcomes and with the treatment assignment. Althoughit is not always articulated in this form implicitly this conditional independence assumption ismade in many of the approaches to identification in panel data settings used in the empiricalliterature.For the case where (in contrast to the case we consider in the current paper) (1.2) holds with U i observed, the program evaluation literature has developed a number of effective methodsfor estimating the average causal effect of W it on Y it (see Imbens [2004], Abadie and Cattaneo[2018] for reviews). One approach is to remove the association between U i and the treatment W it by using the propensity score either through weighting or through conditioning. Second,one can transform the outcome by removing the association between the outcome and U i . Thisis typically done by subtracting from the outcome the conditional mean of the outcome Y it given U i . Third, and most effectively, one can use double robust methods and combine thepropensity score adjustment and the outcome modeling/transformation. These methods inspirethe proposals developed in the current paper for the case where U i is not observed.In the case where U i is not observed one has to make additional assumptions to ensurepoint-identification. For the most part, applied researchers have been focusing on making as-sumptions regarding the relationship between the outcome and the unobserved characteristic.This approach is natural, often follows directly from an economic model, and is supported bythe econometric theory (see, e.g., the surveys: Chamberlain [1984], Arellano and Honor´e [2001],Arellano [2003], Arellano and Bonhomme [2011]). At the same time, such restrictions are verydifferent from (1.2) because they are not motivated by a model of W i (model of assignment).The point that we are making in this paper is that a model for W i provides an alternative path toidentification argument, and, moreover, it can be considered separately from the model for theoutcome. We show that with panel data, one can base the identification argument on eitherthe outcome model or the assignment model being correct. This is where our approach differsconceptually from the double robust estimation literature. Here both the design assumptionsand the outcome modeling approaches are used in the identification stage.First, analogous to the outcome modeling, we can use models and assumptions to motivate a2ransformation of the potential outcomes such that the unobserved component is independent ofthe transformed potential outcomes, and the transformed outcomes themselves are informativeabout the causal effect of interest. Formally, U i ⊥⊥ g (cid:18)(cid:110) Y i ( w ) (cid:111) w (cid:19) , (1.3)for some function of the potential outcomes g ( · ), possibly after some conditioning. Many meth-ods used in the empirical literature, including the two-way fixed effect estimator, can be thoughtof as fitting in this approach. For example, consider a two-period setting. The two-way fixedeffect estimator transforms the outcomes by taking differences, e.g. , in the two period case∆ i = g ( Y i ( w )) = Y i ( w ) − Y i ( w ), so that ∆ i is free of dependence on the unobserved compo-nent U i .The second approach is design-based, where the goal is to find a set of conditioning vari-ables S i that removes the association between the treatment assignment and the unobservedcomponent analogous to the propensity score approach. U i ⊥⊥ W i (cid:12)(cid:12)(cid:12) S i . (1.4)A version of this assumption has been used in the panel literature before (e.g., the exchangeabil-ity assumption in Altonji and Matzkin [2005] or the exponential family assumption in Arkhangel-sky and Imbens [2018]). In this paper, we argue that it holds for a variety of models that havebeen commonly used for binary data (e.g., Honor´e and Kyriazidou [2000], Chamberlain [2010],Aguirregabiria et al. [2018]). In principle, the two-way fixed effect estimator can also be thoughtof as following this approach by comparing treated and control units at the same time within theset of units with the same fraction of treated periods, that is, conditioning on S i = (cid:80) Tt =1 W it .However, as a general approach to identifying treatment effects in a panel data setting, thisdesign-based approach that is common in the treatment effect literature has not been explored,and we do so in the current paper.Third, we explore robust versions where we combine outcome modeling and assumptionson the assignment mechanism. Essentially there we develop models that justify (1.3) for sometransformation, and models that justify (1.4) for some conditioning variables S i , and then con-sider strategies that only require that the independence in (1.3) holds within subpopulations3efined by S i : U i ⊥⊥ g (cid:18)(cid:110) Y i ( w ) (cid:111) w (cid:19) (cid:12)(cid:12)(cid:12) S i . (1.5)The paper fits in with the recent literature on causal inference in panel data settings,including the closely related synthetic control literature (Abadie et al. [2010], Arkhangelskyet al. [2019], Xu [2017], Ben-Michael et al. [2018]) difference in differences methods (de Chaise-martin and D’Haultfœuille [2018], Goodman-Bacon [2017], Athey and Imbens [2018], Atheyet al. [2017]), and fixed effect methods (Imai and Kim [2019], Arkhangelsky and Imbens [2018]). For p ∈ [1 , ∞ ] we use L p ( P ) to denote the space of all random variables X that satisfy E [ (cid:107) X (cid:107) p ] p < ∞ . For any two random variables X , X ∈ L p ( P ) we use (cid:107) X − X (cid:107) p to denotethe L p ( P ) distance. For a random sample { X i } ni =1 and any real-valued functions f , f : X → R we define: P n f ( X i ) := 1 N N (cid:88) i =1 f ( X i ) (cid:107) f − f (cid:107) n,p = ( P n ( f ( X i ) − f ( X i )) p ) p (1.6)For a matrix A we use σ min ( A ) to denote its smallest singular value. We observe N units over T periods ( i and t being a generic unit and period, respectively).We focus on settings with large N and fixed T . We are interested in the effect of a binarypolicy variable w on some economic outcome Y it . To formalize this we consider a potentialoutcome framework (Imbens and Rubin [2015]). The policy can change over time, and so isindexed by unit i and time t , W it ∈ { , } . Let w t ≡ ( w , w , . . . , w t ) denote the sequence oftreatment exposures up to time t , with w as shorthand for the full vector of exposures w T .Define W i ≡ ( W i , . . . , W iT ) to be the full assignment vector for unit i . For the first part of thepaper we assume that researchers do not observe additional unit-level covariates and explicitly4ntroduce them in Section 4. In general, one can view all our identification results as conditionalon covariates.Let Y it ( w t ) denote the potential outcome for unit i at time t , given treatment history up totime t w t : Y it ( w t ) ≡ Y it ( w , w , . . . , w t ) . (2.1)In this paper we consider a static version of this general model. Assumption 2.1. (No Dynamics)
For arbitrary w t (1) and w t (2) such that w t = w t we havethe following: Y it ( w t (1) ) = Y it ( w t (2) ) (2.2)This restriction implies that past treatment exposures do not affect contemporaneous out-comes. This assumption does not restrict time-series correlation in the realized outcomes andso on its own does not have any testable implications. However, given a particular assignmentprocess, Assumption 2.1 can be tested. Since a substantial part of the empirical literature fo-cuses on contemporaneous effects and assumes away dynamic effects, we view this as a naturalstarting point. The issues we raise are relevant for the dynamic treatment effect case as wellbut are discussed most easily in the static case.Given the no-dynamics assumption we can index the potential outcomes by a single binaryargument w , so we write Y it ( w ), for w ∈ { , } . In this setup we can be interested in varioustreatment effects. Define individual and time-specific treatment effects: τ it ≡ Y it (1) − Y it (0) (2.3)We focus primarily on average treatment effects, typically a convex combination of individualeffects τ it . Define also Y i ( w ) ≡ ( Y i ( w ) , . . . , Y it ( w T )) to be the vector of potential outcomes.We make two additional assumptions. First, we restrict out attention to settings with strictlyexogenous covariates (e.g., Arellano [2003]) and make the following assumption: Assumption 2.2. (Latent Unconfoundedness)
There exist a random element U i ∈ U such hat the following conditional independence holds: W i ⊥⊥ (cid:110) Y i ( w ) (cid:111) w (cid:12)(cid:12)(cid:12) U i (2.4)This assumption effectively says that once we control for U i , then all the differences inthe treatment paths W i across units are unrelated to the potential outcomes. This type ofassignment should be contrasted with the sequential assignment where W it can depend on pastoutcomes and latent characteristics. See Arellano [2003] for a discussion in the linear case.On its own Assumption 2.2 is not restrictive because we allow U i to be unobserved: we canmechanically choose U i = W i so that this assumption is satisfied by construction. There aremultiple papers that essentially follow this road, going back at least to Chamberlain [1992] (alsosee Chernozhukov et al. [2013] for a very general version of this approach).We view U i as a unit characteristic that we need to control for if we wish to compareoutcomes across units. We formalize this by making the following assumption on the (infeasible)generalized propensity score (Imbens [2000]) that ensures that in principle such comparisons arepossible. Assumption 2.3. (Latent Overlap)
Define the infeasible generalized propensity score: r inf ( w, u ) ≡ pr( W i = w | U i = u ) . (2.5) For any u ∈ U : max w { r inf ( w, u ) } < U i but different values of W i . This type of assumption is common in the (cross-section) programevaluation literature: without such an overlap assumption even if we observed U i we would notbe able to identify the average causal effect of the treatment without functional form restrictions.However, this latent overlap assumption is not always maintained in the panel literature. Forexample, if only time-series variation is used to make causal statements, then one does notneed to make Assumption 2.3. Of course, this comes at a cost – one has to restrict the waypotential outcomes can change over time. At the same time, if one also wants to exploit the6ross-sectional variation, then some version of Assumption 2.3 appears to be unavoidable, butthe outcome model can be more flexible compared to the approaches that rely on over-timecomparisons. Before we consider identification in various models we need to define additional objects. Let W be the support of the vector of assignments W i ; we can think of W as a matrix with at most 2 T rows and T columns, where each row is an element of the support of W i . Let W k be a k row of thematrix W – a T -dimensional vector of zeros and ones. Let π k ≡ pr( W i = W k ) = E (cid:2) W i = W k (cid:3) .All π k are positive, otherwise the corresponding row of W can be dropped. Let K be the numberof rows in W .For example, if T = 3 then W can have the following form: W = (3.1)Each row of this matrix represents a possible assignment, and in this particular case only 4 outof the 2 = 8 possible combinations have positive probability. For a particular unit i , let k ( i )be the row W k of W such that W k = W i . For the identification argument we assume we know W and the probabilities π k and consider estimation in Section 4.We are interested in estimating weighted averages of the treatment effects τ it . Our estimatorswill be linear in Y , with weights that depend on W i :ˆ τ = 1 N T N (cid:88) i =1 T (cid:88) t =1 ω it Y it . Choosing an estimator therefore corresponds to choosing a set of weights ω it . We maintainthroughout this section the no-dynamics assumption (Assumption 2.1), latent unconfoundedness7ssumption (Assumption 2.2), and latent overlap (Assumption 2.3). As discussed briefly in the introduction, the latent unconfoundedness assumptiwon can be ex-ploited in two directions. To build intuition, it is useful to briefly make an analogy to theconventional unconfoundedness case with observed confounders, in a cross-section setting.Suppose we have unconfoundedness (Rosenbaum and Rubin [1983]) with an observed con-founder X i . Here we use its weak form (Imbens [2000]): W i = w ⊥⊥ Y i ( w ) (cid:12)(cid:12)(cid:12) X i , ∀ w. (3.2)In that case researchers have followed two approaches. One is to exploit the propensity scoreresult that (irrespective of whether (3.2) holds), W i = w ⊥⊥ X i (cid:12)(cid:12)(cid:12) pr( W i = w | X i ) , (3.3)where pr( W i = w | X i ) is the generalized propensity score. (3.2) and (3.3) combined imply thatconditional on the generalized propensity score we have W i = w ⊥⊥ Y i ( w ) (cid:12)(cid:12)(cid:12) pr( W i = w | X i ) . (3.4)Thus, we can conditioning on a variable, here pr( W i = w | X i ) such that the association of thetreatment indicator, here W i = w and the variable we originally need to condition on, here X i ,vanishes.A second approach is to transform the potential outcomes. Define the conditional ex-pectations µ ( w, x ) ≡ E [ Y i ( w ) | X i = x ) and e ( X i ) ≡ pr( W i = w | X i ). We do not actuallyneed the full independence assumption in (3.2), only the mean-independence since it implies, E [ Y i ( w ) | W i = w , X i ] = E [ Y i ( w ) | X i ] . Now define˜ Y i ( w ) ≡ g ( Y i ( w )) ≡ Y i ( w ) − µ ( w, x ) − E [ e ( X i )] W i (1 − E [ e ( X i )]) − W i e ( X i ) W i (1 − e ( X i )) − W i (cid:16) µ (1 , X i ) − µ (0 , X i ) (cid:17) . This transformation of the potential outcomes does not change mean-indepence of ˜ Y i ( w ) and8 W i = w conditional on X i , and we have E [ ˜ Y i ( w )) | W i , X i ] = E [ ˜ Y i ( w ) | X i ] . Hwever, for this transformed outcome we have something much stronger. Here we do not needthe conditioning on X i to have the result that the expected value is free of dependence on W i ,and mean-independence holds without conditioning on X i : E [ ˜ Y i ( w )) | W i = 1] = E [ ˜ Y i ( w ) | W i = 0] = E [ ˜ Y i ( w )] = E [ Y i (1) − Y i (0)] . We can combine these two approaches and estimate E [ ˜ Y i (1) | W i = 1 , e ( X i )] , and E [ ˜ Y i (0) | W i = 0 , e ( X i )] , and average the difference over the marginal distribution of e ( X i ). This will have double ro-bustness properties.The first insight that we take to the panel data case is that we can either use the conditionaldistribution of the assignment given the confounder to remove biases associated with a directcomparison of treated and control units, or we can remove the dependence of the outcomes onthe confounder. This general strategy works whether the confounder is observed or not, butimplementing the two approaches is a bigger challenge if the confounder is not observed, and weneed to make additional assumptions in order to do so. The second insight is that combiningthese two approaches may lead to more robust estimates of the treatment effects. In this section we consider a simple example that illustrates the main message of the paper. Forsimplicity we start assuming that τ it = τ – constant treatment effects – and no covariates X i .At the end of the section we discuss heterogenity in treatment effects. We introduce covariatesin Section 4.Consider the case with three periods and suppose that the distribution of W i is given byTable 1. A researher wants to use a standard fixed effects model and runs the following regression9 able 1: Assignment process and weights P ( W i ) W W W ω ( fe )1 ( W i ) ω ( fe )2 ( W i ) ω ( fe )3 ( W i )0.09 0 0 0 0.46 -0.64 0.180.04 1 0 0 5.70 -3.26 -2.440.11 0 1 0 -2.16 4.60 -2.440.14 1 1 0 3.08 1.98 -5.070.07 0 0 1 -2.16 -3.26 5.420.08 1 0 1 3.08 -5.88 2.800.15 0 1 1 -4.78 1.98 2.800.32 1 1 1 0.46 -0.64 0.18(in population): Y it = α i + λ t + τ fe W it + ε it E [ ε it | W i , α i ] = 0 (3.5)Usual OLS logic implies that τ fe has the following representation: τ fe = E [ Y it ω ( fe ) t ( W i )] (3.6)where ω ( fe ) t ( W i ) are fixed effects weights that depend only on the distribution of W i . For thedistribution given above the weights are presented in Table 1. By construction these weights sumup to 0 for every row and every column (once reweighted by the probabilities). If the two-waymodel is correctly specified than the estimator based on a sample analog of these weights hasexcellent statistical properties (see e.g., Donoho et al. [1994], Armstrong and Koles´ar [2018b], andreferences therein). At the same time, such estimator is not entirely satisfactory. In particular,assume that the assignment is random conditional on W i ≡ T (cid:80) Tt =1 W it : W i ⊥⊥ (cid:110) Y i ( w ) (cid:111) w (cid:12)(cid:12)(cid:12) W i (3.7)In this case, the relevant outcome model has the following structure: Y it = h t ( W i ) + τ W it + ξ it E [ ξ it | W i ] = 0 (3.8)10he estimator based on the fixed effect weights is consistent if the following condition is satisfiedfor every t and W i : E [ ω ( fe ) t ( W i ) | W i ] = 0 (3.9)Table 2 shows that this is not true for the given distribution of W i . As a result, if the outcome Table 2:
Aggregated weights W i E [ ω ( fe )1 ( W i ) | W i ] E [ ω ( fe )2 ( W i ) | W i ] E [ ω ( fe )3 ( W i ) | W i ]0 0.46 -0.64 0.181 -0.73 0.60 0.132 -0.08 0.36 -0.283 0.46 -0.64 0.18model is given by (3.8) then the fixed effect weights will give us an inconsistent estimator. Thisis not surprising because ω ( fe ) t ( W i ) are not constructed to deal with such outcome models.At this point, it is natural to ask whether we can achieve both goals simultaneously, i.e.,can we find the weights that “work” if either the fixed effect model (3.5) or the design process(3.7) is correctly specified? The answer is positive and the weights that satisfy this restrictionare given in Table 3. It is evident that the weights some up to zero for each row and simple Table 3:
Doubly robust weights ω ( dr )1 ( W i ) ω ( dr )2 ( W i ) ω ( dr )3 ( W i )0.00 0.00 0.006.59 -3.95 -2.64-1.46 4.10 -2.643.24 1.66 -4.90-1.46 -3.95 5.423.24 -6.39 3.15-4.81 1.66 3.150.00 0.00 0.00calculation shows that E [ ω ( dr ) t ( W i ) | W i ] = 0 for every t and W i . As a result, there is no trade-offin terms of identification and we can construct the estimator that works for both models.So far we have assumed that the treatment effects are constant. This assumption is verystrong and it is well documented that two-way estimators have problems in case with heteroge-nous treatment effects (e.g., see de Chaisemartin and D’Haultfœuille [2018]). This is evident11fter looking at Table 1: in the last row we assign negative weight to treated units in the secondperiod. In contrast to this, all treated units receive non-negative weight when we use doublyrobust weights from Table 3. This is not a coincidence and below we discuss a procedure thatguarantees that this property is satisfied. First we consider outcome models. Recall that by the no-dynamics assumption the potentialoutcomes Y it ( w ) are indexed by a binary treatment w . A common outcome model that goesback at least to Chamberlain [1992] is the following one: Assumption 3.1.
The potential outcomes satisfy: E [ Y it ( w ) | U i ] = α ( U i ) + λ t + τ ( U i ) w. (3.10)Given Assumption 2.2 the content of this model is that it restricts the time-dependency ofthe conditional mean of the control outcome and the treatment effect. Rewriting the model wecan see that more directly. The conditional control mean is E [ Y it (0) | U i ] = α ( U i ) + λ t , which is restricted to be additively separable in time, and the conditional treatment effect is E [ τ it | U i ] = τ ( U i ) , which is restricted to be time-invariant.We are interested in identifying a convex combination of the heterogenous treatment effects τ ( U i ) (which itself is a convex combination of τ it ) in this model. We do this by using the weights ω kt that satisfy the following restrictions:1 T K (cid:88) k =1 T (cid:88) t =1 π k ω kt W kt = 1 , ∀ k, T (cid:88) t ω kt W kt ≥ ∀ k, T T (cid:88) t =1 ω kt = 0 , ∀ t, K (cid:88) k =1 π k ω kt = 0 (3.11)12et W outc be the set of weights { ω tk } t,k that satisfy these restrictions. We can evaluate theserestrictions and thus we can construct this set. For any generic element ω ∈ W outc define therandom variables ω k ( i ) t : ω k ( i ) t ≡ K (cid:88) k =1 ω kt { W i = W k } (3.12)Using these stochastic weights we can compute the following expectation: τ ( ω ) = E (cid:34) T T (cid:88) t =1 Y it ω k ( i ) t (cid:35) (3.13) Proposition 1.
Suppose Assumptions 2.1, 2.2, and 3.1 hold, and that ω ∈ W outc . Then τ ( ω ) is a convex combination of τ ( U i ) . As a result, a certain convex combination of τ ( U i ) can be identified whenever W outc is non-empty. A natural question when this is the case. The answer is quite simple: the matrix W should contain at least one of the following three submatrices (up to permutations): W = , W = , W = . (3.14)Consider each of these three cases separately. In the first case there are adoptors of the treatmentwith ( W it = 0 , W it +1 = 1) and in the same periods t and t + 1 non-adoptors with ( W it =0 , W it +1 = 0). In the second case there are adoptors of the treatment with ( W it = 0 , W it +1 = 1)and in the same periods t and t + 1 units who have already adopted and keep the treatment,with ( W it = 1 , W it +1 = 1). In the last case there are adopters with ( W it = 0 , W it +1 = 1) andunits who switch out with ( W it = 1 , W it +1 = 0). To put this discussion in perspective, it is notsufficient to have assignment matrices of the type W = , W = , where with the first design some units are always in the control group and all others are always13n the treatment group, and where with the second design all units adopt the treatment atexactly the same time. In this section we consider assignment processes that satisfy a certain sufficiency property. Westate it as a high-level assumption and then show examples of economic models that satisfy thisassumption:
Assumption 3.2. (Sufficiency)
There exist a known W i -measurable sufficient statistic S i ∈ S and a subset A ⊂ S such that: ( i ) W i ⊥⊥ U i (cid:12)(cid:12)(cid:12) S i , (3.15) and ( ii ) , for all s ∈ A : max w { r ( w, s ) } < . (3.16) where r ( w, s ) is the feasible generalized propensity score: r ( w, s ) ≡ pr( W i = w | S i = s ) . (3.17)This assumption might look restrictive, but an S i such that conditional on S i the treatment W i and the unboserved variable U i are independent always exists, namely S gen i ≡ f U | W ( ·| W i ),where f U | W ( x | y ) is the conditional distribution of U i given W i . In general, S gen i is an infinite-dimensional object (a function) and is unknown, because f U | W ( x | y ) is unknown. As a result,the first restriction that we make in Assumption 3.2 is that S i is known. Part ( ii ) does notallow for S i = S gen i because we require W i to have a non-generate distribution given S i . Belowwe consider various assignment models that are common in the empirical panel data literatureand demonstrate that in all of them there exist S i that one can easily compute.The main implication of the Assumption 3.2 coupled with Assumption 2.2 is summarized inthe following proposition: 14 roposition 2. Suppose Assumptions 2.1, 2.2, and 3.2 hold. Then for any w : W i = w ⊥⊥ Y i ( w ) (cid:12)(cid:12)(cid:12) S i . (3.18)This proposition demonstrates that unconfoundedness conditional on U i can be transformedinto undonfoundedness conditional on S i under the additional assumption that restricts theassignment process.The assignment models that we consider in this section are restrictive, in a sense that theymust satisfy Assumption 3.2. At the same time, most of the models for the binary time-series process W it that are used in the applied and theoretical literature actually satisfy theserestrictions (see, e.g., Honor´e and Kyriazidou [2000], Chamberlain [2010], Aguirregabiria et al.[2018]). In fact, in certain cases existence of a sufficient statistic is a necessary requirement forestimation of common parameters (e.g., Magnac [2004]). This is especially relevant, becausemany of such models have an underlying economic intuition and can be interpreted as modelsof optimal choice.We are not interested in estimating common parameters of the model for W i , which is thestandard object in non-linear panel analysis. Instead, we only require that the conditionaldistribution of W i admits a certain representation. Parameters of this representation are notidentified with fixed T , but they do not play any role in Proposition 2, which is the only resultthat we need. Static model.
As a first example that we consider a static logit model with heterogeneityover time. Formally, we consider the following model: E [ W it | U i ] = exp( α T ( U i ) ψ ( t ) + λ t )1 + exp( α T ( U i ) ψ ( t ) + λ t ) W it ⊥⊥ { W il } l (cid:54) = t (cid:12)(cid:12)(cid:12) U i (3.19)where ψ ( t ) is a known function of t . It is easy to demonstrate that in this model S i = T (cid:88) t =1 ψ ( t ) W it /T. (cid:3) Dynamic model.
Next we consider a time homogenous Markov model: E [ W it | U i , W t − i ] = exp( α ( U i ) + γ ( U i ) W it − )1 + exp( α ( U i ) + γ ( U i ) W it − ) W it ⊥⊥ { W il } l>t (cid:12)(cid:12)(cid:12) U i , W t − i (3.20)In this model S i = (cid:32) T − (cid:88) t =2 W it , T (cid:88) t =2 W it W it − , W i , W iT (cid:33) . (cid:3) General case
For sufficiency we need the following representation for the conditional distri-bution of W i :log ( P ( W i | U i )) = S ( W i ) (cid:62) α ( U i ) + β ( U i ) + γ ( W i ) (3.21)where S ( · ) is a known function of W i . All previous examples have this representation. Moregenerally, Aguirregabiria et al. [2018] show that this structure arises in flexible models of dynamicchoice. (cid:3) Let S i be a potential sufficient statistics. Let W s be a matrix representation of the supportof W i conditional on S i = s and W sk be a generic row (element of the support). For example, if S i = (cid:80) t W it and W is given by (3.1) then S i takes 3 possible values and we have the following:16 = (cid:16) (cid:17) W = W = (cid:16) (cid:17) (3.22)When considering identification strategy based on design assumptions we do not restrictpotential outcomes, but instead require that assumptions behind Proposition 2 are satisfied. Inthis case, one can identify a convex combination of individual treatment effects using the weightsthat satisfy the following restrictions (for all k, s and t ):1 T (cid:88) tk π k ω kt W kt = 1 (cid:88) k : W k ∈ W s π k ω kt W kt ≥ (cid:88) k : W k ∈ W s π k ω kt = 0 (3.23)Let W design be the set of weights { ω tk } t,k that satisfy these restrictions. It is easy to see that W design is nonempty whenever there exists at least one s such that W s contains at least tworows. This is guaranteed by the second part of Assumption 3.2. For any ω ∈ W design define therandom variables ω k ( i ) t in the same way as before and consider the following expectation: τ ( ω ) = E (cid:34) T T (cid:88) t =1 Y it ω k ( i ) t (cid:35) (3.24) Proposition 3.
Suppose Assumptions 2.1, 2.2, 2.3, and 3.2 hold, and that ω ∈ W design . Then τ ( ω ) is a convex combination of treatment effects. The sets of W outc and W design are motivated by different models and in general do not need tobe similar. In some sense, one can say that the weights in W outc target within-unit comparisons,17hile those in W target within-period comparisons. This interpretation is convenient, but isnot entirely correct because in general W outc ∩ W design is not empty. Consequently, one does notneed to take a stand on what comparisons to use: those based on looking at the same unitsacross time or at different units for a fixed time period. As a result, we suggest using the weightsin W outc ∩ W design . In fact, we restrict this set even further and define the following one: W dr ≡ { ω } subject to: 1 T | W | (cid:88) tk π k ω kt W kt = 1 , T T (cid:88) t =1 ω kt = 0 (cid:88) k : W k ∈ W s π k ω kt = 0 , ω kt W kt ≥ W dr , and note that W dr ⊂ ( W outc ∩ W design ). The difference between W outc ∩ W design and W dr is quite small – we simply impose the additional restriction that every treatedunit receives a non-negative weight. Note that neither weights in W outc nor in W design in generalsatisfy this restriction. This is important in practice, because we want to be robust to arbitraryheterogeneity in treatment effects.When is the set W dr non-empty? Combining earlier discussion of W outc and W design it is easyto see that a necessary and sufficient condition for W dr to be non-empty is that there exists an s such that the corresponding W s contains at least one of the following two sub-matrices (upto permutations): W = W = (3.26)In particular, note that the matrix W from (3.14) is not sufficient. The reason for this is thatwe require the weights for treated units to be non-negative and sum up to zero for each row.This implies that the first row should receive a zero weight and thus we cannot make cross-sectional comparisons. The requirement for W s to contain these sub-matrices is in generalmore demanding than the second part of Assumption 3.2. At the same time, if S i includes W i then for any s , W s can contain W only if it contains W and this is equivalent to the overlapcondition. 18inally we can state the main identification result. The following theorem is a direct conse-quence of Propositions 1 and 3: Theorem 1.
Suppose Assumptions 2.1, 2.2, and 2.3 hold, and either 3.1, or Assumption 3.2,or both hold. Then for any ω ∈ W dr , the estimand τ ( ω ) is a convex combination of treatmenteffects. We assume that we observe a random sample { Y i , W i , X i } Ni =1 from some distribution P with T (number of periods) being fixed. We assume that a researcher has constructed sufficient statistics S i ≡ S ( W i , X i ) based on a design model. We maintain Assumption 2.1 and additionally restrictthe outcome model: Assumption 4.1.
For each t one of the following outcome models is correct. Either there exista sufficient statistic S i such that the following is true: Y it (0) = β t + ψ ( X i , t ) (cid:62) δ + ψ ( X i , S i , t ) (cid:62) γ + ξ it E [ ξ it | X i , S i ] = 0( ξ i , . . . ξ iT ) ⊥⊥ W i | X i , S i (4.1) or U i = ( W i , X i ) and we have the following: Y it (0) = α ( W i , X i ) + β t + ψ ( X i , t ) (cid:62) δ + ε it E [ ε it | X i , W i ] = 0 (4.2) where ψ ( X i , t ) and ψ ( X i , S i , t ) are known p -dimensional functions. This assumption allows for our design model to be correct, so that we only need to controlfor ( S i , X i ), or the more traditional fixed effects model to be correct. We do not impose anyrestrictions on Y it (1) and thus on heterogeneity in treatment effects. For simplicity we assumethat in both cases the conditional expectations are linear in parameters with respect to a known19nite-dimensional dictionary. Since all our identification results hold conditional on X i thisassumption is not necessary and the estimation procedure below can be adopted to allow forunknown ψ and ψ . At the same time, we believe that our estimator is a natural alternative forthe current status quo which is a two-way fixed effect model estimated by OLS which is basedon (4.2). We leave further nonparametric generalizations to future work. Our estimator is defined in the following way:ˆ τ := 1 N T (cid:88) it ˆ ω it Y it (4.3)where the weights { ˆ ω it } it solve the optimization problem: { ˆ ω it } it = arg min { ω it } it N T ) (cid:88) it ω it subject to: 1 nT (cid:88) it ω it W it ≥ T (cid:88) i ω it = 01 N (cid:88) t ω it = 01 N T (cid:88) it ω it ψ ( X i , S i , t ) = 0 ω it W it ≥ ψ ( X i , S i , t ) := ( ψ ( X i , t ) , ψ ( X i , S i , t )). At the optimum the first inequalityis binding and we write it down in this form to simplify the dual representation below. Theweights ˆ ω it are related to standard OLS fixed effects weights, but here we are explicitly lookingfor weights that balance out functions of S i , not only fixed attributes X i , and satisfy certaininequality constraints. The last restriction is crucial, because it is well documented that thestandard OLS estimators with fixed effects in general do not correspond to reasonable estimandsif the effects are heterogeneous (see e.g., de Chaisemartin and D’Haultfœuille [2018]).It is natural to ask if the weights that solve the problem above exist. In Lemma A.1 we show20hat a necessary and sufficient condition for the existence is that the control and treated unitssatisfy a certain overlap condition. In particular, there is no { λ i , µ t , γ } i,t such that the followingis true: λ i + µ t + ψ (cid:62) it γ ≥ W it = { λ i + µ t + ψ (cid:62) it γ > } (4.5)This is a very mild overlap condition that is likely to be satisfied for any reasonable assignmentprocess.Our estimator fits naturally into recent theoretical literature on balancing weights (e.g.,Imai and Ratkovic [2014], Zubizarreta [2015], Athey et al. [2016], Hirshberg and Wager [2017],Chernozhukov et al. [2018a,b], Armstrong and Koles´ar [2018a]). The main technical differencebetween our approach and the ones proposed in the literature is that we need to balance unit-specific functions and explicitly impose non-negativity constraints. At the same time, we onlybalance a small parametric class of functions of ( X i , S i ), while others consider much more generalfunctional classes. We leave this generalization to future research. The Lagrangian saddle-point problem for the program (4.4) has the following form:inf ω it sup λ ( t ) ,λ ( i ) ,γ,µ it ≥ ,π ≥ N T ) (cid:88) it ω it + 1 N (cid:88) i λ ( i ) (cid:32) T (cid:88) i ω it (cid:33) +1 T (cid:88) t λ ( t ) (cid:32) N (cid:88) t ω it (cid:33) + π (cid:32) − N T (cid:88) it ω it W it (cid:33) − γ (cid:62) (cid:32) N T (cid:88) it ω it ψ it (cid:33) − N T (cid:88) it µ it ω it W it (4.6)21here we use ψ it as a shorthand for ψ ( X i , S i , t ). In Lemma A.1 we show that strong dualityholds and we can rearrange the minimization and maximization:sup λ ( t ) ,λ ( i ) ,γ,µ it ≥ ,π ≥ inf ω it N T ) (cid:88) it ω it + 1 N (cid:88) i λ ( i ) (cid:32) T (cid:88) i ω it (cid:33) +1 T (cid:88) t λ ( t ) (cid:32) N (cid:88) t ω it (cid:33) − π (cid:32) N T (cid:88) it ω it W it − (cid:33) − γ (cid:62) (cid:32) N T (cid:88) it ω it ψ it (cid:33) − N T (cid:88) it ( µ it ω it W it ) (4.7)Solving this in terms of ω it (an unconstrained quadratic problem) we get the following repre-sentation: inf λ ( t ) ,λ ( i ) ,γ,µ it ≥ ,π ≥ P n (cid:34) T T (cid:88) t =1 (cid:0) πW it − λ ( t ) − λ ( i ) − γ (cid:62) ψ it − µ it W it (cid:1) (cid:35) − πN (4.8)We can further simplify this expression by concentrating out µ it and π . To this end, define thefollowing loss function: ρ z ( x ) := x (1 − z ) + x z (4.9)After some algebra we get the following:inf λ ( t ) ,λ ( i ) ,γ P n (cid:32) T T (cid:88) t =1 ρ W it (cid:0) W it − λ ( t ) − λ ( i ) − γ (cid:62) ψ it (cid:1)(cid:33) (4.10)Let { ˆ λ ( t ) , ˆ λ ( i ) , ˆ γ } i,t be the solutions to this problem. The optimal unnormalized weights are equalto the following:ˆ ω ( un ) it = (cid:16) W it − ˆ λ ( t ) − ˆ λ ( i ) − ˆ γ (cid:62) ψ it (cid:17) (1 − W it ) + (cid:16) W it − ˆ λ ( t ) − ˆ λ ( i ) − ˆ γ (cid:62) ψ it (cid:17) + W it (4.11)22nd the optimal weights are given by the normalization:ˆ ω it := ˆ ω ( un ) it NT (cid:80) it ˆ ω ( un ) it W it (4.12)By construction the weights are non-negative for the treated units and sum up to one oncemultiplied by W it . The denominator is strictly positive under the conditions of Lemma A.1. In order to state the inference results we need to make several statistical assumptions:
Assumption 4.2. (a) P -a.s. ( X i , S i ) ∈ Ω – compact subset of some metric space; (b) ψ ( X i , S i , t ) is a continuous function of its arguments (on Ω ); errors u it satisfy the following moment condi-tions: E [ u it | W i , X i ] ≤ σ u < ∞ E [ u it ] < ∞ (4.13)Part of the assumption about u it is standard in the literature on projection estimators. Weassume compactness to streamline the proofs and we think that it covers most problems thatresearchers face in applications. There is no doubt that it can be considerably relaxed. Assumption 4.3. (a) S i includes W i ; (b) for all t and η > we have E [ W it | S i , X i ] ≤ − η ;(c) the following holds: Γ it := (1 − W it ) ψ it − (cid:80) Tl =1 (1 − W il ) ψ il (cid:80) Tl =1 (1 − W il ) σ min (cid:32) T (cid:88) t =1 E (cid:2) Γ it Γ (cid:62) it (cid:3)(cid:33) ≥ κ > τ and ˆ ω it : Theorem 2.
Suppose Assumptions 4.1, 4.2, 4.3 are satisfied. Then there exist a collection of andom variables { ω (cid:63) ( X i , W i , t ) } Tt =1 such that the following holds: T T (cid:88) t =1 (cid:107) ˆ ω t − ω (cid:63)t (cid:107) = o p (1) (4.15) Define the following conditional estimand: τ emp = 1 N T (cid:88) it ˆ ω it W it E [ τ it | W i , X i ] (4.16) the scaled difference between the estimator and τ emp converges in distribution to a normal randomvariable: √ n (ˆ τ − τ emp ) → N (0 , σ ) (4.17) where the variance has the following form: σ := E (cid:32) T T (cid:88) t =1 ω (cid:63)it (( u it + W it ( τ it − E [ τ it | W i , X i ])) (cid:33) (4.18) where ω (cid:63)it := ω (cid:63) ( X i , W i , t ) , and u it is equal to either ξ it or ε it . This theorem describes the performance of our estimator in larger samples. The populationweights ω (cid:63) depend on ( X i , W i ), not only on S i which is an implication of the fact that we needto deal with individual fixed effects.Our next result shows that standard nonparametrtic bootstrap provides a conservative esti-mator for σ . Theorem 3.
Let { ˆ τ ( b ) } Bb =1 be a set of non-parametric (unit-level) bootstrap analogs of ˆ τ . Define: ˆ σ := NB B (cid:88) b =1 (cid:0) ˆ τ ( b ) − ˆ τ (cid:1) (4.19) and suppose that assumption of Theorem 2 hold. Then if E [ τ it | W i , X i ] = τ ˆ σ is consistent for σ ; otherwise ˆ σ is conservative. Conclusion
In this paper, we propose a novel identification argument that can be used to evaluate a causaleffect using panel data. We show that one can naturally combine familiar restrictions on therelationship between the outcome and the unobserved unit-level characteristics with reasonableeconomic models of the assignment. Our approach allows us to construct a doubly robustidentification argument: out estimand has causal interpretation if either the outcome modelis correct, or the assignment model is correct (or both). Using these results, we construct anatural generalization of the standard two-way fixed effects estimator that is robust to arbitraryheterogeneity in treatment effects and show that it has reasonable theoretical properties.25 eferences
A Abadie and MD Cattaneo. Econometric methods for program evaluation. Annual Review ofEconomics, 18, 2018.Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Synthetic control methods for com-parative case studies: Estimating the effect of California’s tobacco control program. Journalof the American Statistical Association, 105(490):493–505, 2010.Victor Aguirregabiria, Jiaying Gu, and Yao Luo. Sufficient statistics for unobserved heterogene-ity in structural dynamic logit models. arXiv preprint arXiv:1805.04048, 2018.Joseph G Altonji and Rosa L Matzkin. Cross section and panel data estimators for nonseparablemodels with endogenous regressors. Econometrica, 73(4):1053–1102, 2005.Manuel Arellano. Panel data econometrics. Oxford university press, 2003.Manuel Arellano and St´ephane Bonhomme. Nonlinear panel data analysis. 2011.Manuel Arellano and Bo Honor´e. Panel data models: some recent developments. Handbook ofeconometrics, 5:3229–3296, 2001.Dmitry Arkhangelsky and Guido Imbens. The role of the propensity score in fixed effect models.Technical report, National Bureau of Economic Research, 2018.Dmitry Arkhangelsky, Susan Athey, David A Hirshberg, Guido W Imbens, and Stefan Wager.Synthetic difference in differences. Technical report, National Bureau of Economic Research,2019.Timothy Armstrong and Michal Koles´ar. Finite-sample optimal estimation and inference onaverage treatment effects under unconfoundedness. 2018a.Timothy B Armstrong and Michal Koles´ar. Optimal inference in a class of regression models.Econometrica, 86(2):655–683, 2018b.Susan Athey and Guido Imbens. Design-based analysis in difference-in-differences settings withstaggered adoption. 2018. 26usan Athey, Guido Imbens, and Stefan Wager. Efficient inference of average treatment effects inhigh dimensions via approximate residual balancing. arXiv preprint arXiv:1604.07125, 2016.Susan Athey, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi.Matrix completion methods for causal panel data models. arXiv preprint arXiv:1710.10251,2017.Eli Ben-Michael, Avi Feller, and Jesse Rothstein. The augmented synthetic control method.arXiv preprint arXiv:1811.04170, 2018.Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,2004.Gary Chamberlain. Panel data. Handbook of econometrics, 2:1247–1318, 1984.Gary Chamberlain. Efficiency bounds for semiparametric regression. Econometrica: Journal ofthe Econometric Society, pages 567–596, 1992.Gary Chamberlain. Binary response models for panel data: Identification and information.Econometrica, 78(1):159–168, 2010.Victor Chernozhukov, Iv´an Fern´andez-Val, Jinyong Hahn, and Whitney Newey. Average andquantile effects in nonseparable panel models. Econometrica, 81(2):535–580, 2013.Victor Chernozhukov, Whitney Newey, and James Robins. Double/de-biased machine learningusing regularized riesz representers. arXiv preprint arXiv:1802.08667, 2018a.Victor Chernozhukov, Whitney K Newey, and Rahul Singh. Learning l2 continuous regressionfunctionals via regularized riesz representers. arXiv preprint arXiv:1809.05224, 2018b.Cl´ement de Chaisemartin and Xavier D’Haultfœuille. Two-way fixed effects estimators withheterogeneous treatment effects. 2018.David L Donoho et al. Statistical estimation and optimal recovery. The Annals of Statistics, 22(1):238–270, 1994.Andrew Goodman-Bacon. Difference-in-differences with variation in treatment timing. Technicalreport, Working Paper, 2017. 27avid A Hirshberg and Stefan Wager. Augmented minimax linear estimation. arXiv preprintarXiv:1712.00038, 2017.Bo E Honor´e and Ekaterini Kyriazidou. Panel data discrete choice models with lagged dependentvariables. Econometrica, 68(4):839–874, 2000.Kosuke Imai and In Song Kim. When should we use unit fixed effects regression models for causalinference with longitudinal data? American Journal of Political Science, 63(2):467–490, 2019.Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.Guido Imbens. The role of the propensity score in estimating dose–response functions.Biometrika, 87(0):706–710, 2000.Guido Imbens. Nonparametric estimation of average treatment effects under exogeneity: Areview. Review of Economics and Statistics, pages 1–29, 2004.Guido W Imbens and Donald B Rubin. Causal Inference in Statistics, Social, and BiomedicalSciences. Cambridge University Press, 2015.Thierry Magnac. Panel binary variables and sufficiency: generalizing conditional logit.Econometrica, 72(6):1859–1876, 2004.Shahar Mendelson. Learning without concentration. In Conference on Learning Theory, pages25–39, 2014.Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observationalstudies for causal effects. Biometrika, 70(1):41–55, 1983.Yiqing Xu. Generalized synthetic control method: Causal inference with interactive fixed effectsmodels. Political Analysis, 25(1):57–76, 2017.Jos´e R Zubizarreta. Stable weights that balance covariates for estimation with incompleteoutcome data. Journal of the American Statistical Association, 110(511):910–922, 2015.28
Appendix
Proof of Proposition 1 : For any ω ∈ W outc we defined the random variables ω k ( i ) t ≡ K (cid:88) k =1 ω kt { W i = W k } (A.1)and considered the following estimator: τ ( ω ) = E (cid:34) T T (cid:88) t =1 Y it ω k ( i ) t (cid:35) (A.2)By assumption we have the representation: E (cid:34) T T (cid:88) t =1 Y it ω k ( i ) t (cid:35) = E (cid:34) T T (cid:88) t =1 ( α ( U i ) + λ t + τ ( U i ) W it + ε it ) ω k ( i ) t (cid:35) = E (cid:34) T T (cid:88) t =1 ( α ( U i ) + λ t + τ ( U i ) W it + ε it ) K (cid:88) k =1 ω kt { W i = W k } (cid:35) = E (cid:34) T T (cid:88) t =1 ( α ( U i ) ω kt { W i = W k } ) (cid:35) +1 T T (cid:88) t =1 λ t K (cid:88) k =1 E [ ω kt { W i = W k } ] + E (cid:34) τ ( U i ) { W i = W k } T K (cid:88) k =1 T (cid:88) t =1 W kt ω kt (cid:35) =1 T T (cid:88) t =1 λ t K (cid:88) k =1 π k ω kt + E [ τ ( U i ) ξ ( W i )] = E [ τ ( U i ) ξ ( W i )] (A.3)where ξ ( W i ) := { W i = W k } T (cid:80) Kk =1 (cid:80) Tt =1 W kt ω kt ≥ . The first equality follows from the restrictionson the outcome model, the second – by definition of the weights, the third – because E [ ε i | U i ] = 0and strict exogeneity assumption; finally the last two equalities follow by construction of weights. Byconstruction we also have that ξ ( W i ) ≥ E [ ξ ( W i )] = 1. This proves the claim. Proof of Proposition 3 : The proof is very similar to the one above and is omitted.
Proof of Proposition 2 : We need to prove the following for arbitrary w and measurable A , A : E [ { W i = w }{ Y i (0) ∈ A , Y i (1) ∈ A }| S i ] = E { W i = w }| S i ] E [ { Y i (0) ∈ A , Y i (1) ∈ A }| S i ] (A.4) e have the following chain of equalities that proves the claim. E [ { W i = w }{ Y i (0) ∈ A , Y i (1) ∈ A }| S i ] = E [ { W i = w } E [ { Y i (0) ∈ A , Y i (1) ∈ A }| S i , U i , W i ] | S i ] = E [ { W i = w } E [ { Y i (0) ∈ A , Y i (1) ∈ A }| U i , S i ] | S i ] = EE [ { W i = w }| S i , U i ] E [ { Y i (0) ∈ A , Y i (1) ∈ A }| U i , S i ] | S i ] = E [ E [ { W i = w }| S i ] E [ { Y i (0) ∈ A , Y i (1) ∈ A }| U i , S i ] | S i ] = E { W i = w }| S i ] E [ { Y i (0) ∈ A , Y i (1) ∈ A }| S i ] (A.5)where the second inequality follows by strict exogeneity, the fourth one – by sufficiency. Lemma A.1.
Suppose that { W it } i,t are such that there is no { α i , β t , γ } i,t such that the following istrue: α i + β t + ψ (cid:62) it γ ≥ W it = { α i + β i + ψ (cid:62) it γ > } (A.6) Then (a) the primal problem always has a unique solution and (b) the strong duality holds, i.e., for afunction h ( λ, µ, π, γ, ω ) := 1( nT ) (cid:88) it ω it + 1 n (cid:88) i λ ( i ) (cid:32) T (cid:88) i ω it (cid:33) +1 T (cid:88) t λ ( t ) (cid:32) n (cid:88) t ω it (cid:33) + π (cid:32) − nT (cid:88) it ω it W it (cid:33) − γ (cid:62) (cid:32) nT (cid:88) it ω it ψ it (cid:33) − nT (cid:88) it µ it ω it W it (A.7) we have inf ω it sup λ ( t ) ,λ ( i ) ,γ,µ it ≥ ,π ≥ h ( λ, µ, π, γ, ω ) = sup λ ( t ) ,λ ( i ) ,γ,µ it ≥ ,π ≥ inf ω it h ( λ, µ, π, γ, ω ) (A.8) roof. Direct application of Generalized Farkas’ lemma implies that the constraint set is empty iffthere exist ( α (cid:63)i , β (cid:63)t , γ (cid:63) ) such that the following is true: α (cid:63)i + β (cid:63)t + ψ (cid:62) it γ (cid:63) ≥ W it = { α (cid:63)i + β (cid:63)t + ψ (cid:62) it γ (cid:63) > } (A.9)By assumption such ( α (cid:63)i , β (cid:63)t , γ (cid:63) ) does not exist and thus the constraint set is not empty and convex.Since the objective function is strictly convex we have that the primal problem has the unique solution.Since all the inequality constrains are affine strong duality holds (see 5.2.3 in Boyd and Vandenberghe[2004]) and we have the result. Lemma A.2.
For arbitrary γ define g ( X, W , γ ) in the following way: g ( X, W , γ ) ∈ arg min α (cid:40) T T (cid:88) t =1 ρ W t ( W t − α − ψ (cid:62) t γ ) (cid:41) (A.10) Then for any W such that W < this function is uniquely defined. Also if (cid:107) ψ t (cid:107) ∞ < K then g ( X, W , γ ) is P a.s. uniformly (in ( X, W ) ) Lipschitz in γ .Proof. If W < h t := W t − ψ (cid:62) t γ ; and let ˜ h (1) , . . . , ˜ h ( (cid:80) Tt =1 W t ) be the decreasing ordering of h t for units with W t = 1; let˜ h (0) = 0. For k = 0 , . . . , (cid:80) Tt =1 W t define the following functions: g k ( X, W , γ ) := (cid:80) Tt =1 (1 − W it ) h t + (cid:80) kl =0 ˜ h ( l ) (cid:80) Tt =1 (1 − W it ) + k (A.11)It is easy to see that we have the following: g ( X, W , γ ) = g ( X, W , γ ) + k (cid:88) l =1 { ˜ h ( l ) ≥ g ( l − } ( g l ( X, W , γ ) − ( g l − ( X, W , γ )) (A.12)From this representation if follows that g ( X, W , γ ) is differentiable and P -a.s. uniformly (in ( X, W ))Lipschitz in γ . Lemma A.3.
Let { W i , X i } be distributed according to P ; assume that S i includes W i and E [ W it | S i , X i ] < − η P a.s. for η > . Then there exist a σ ( W i , X i ) -measurable random variable α (cid:63)i and a vector γ (cid:63) uch that the following conditions are satisfied: ξ it := W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) E (cid:34) T (cid:88) t =1 ξ it ψ it (1 − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } ) (cid:35) = 0 T (cid:88) t =1 ξ it (1 − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } ) = 0 (A.13) Proof.
Define F := { f ∈ L ( P ) T : f t = g ( W i , X i ) + h t ( S i , X i ) , g, h t ∈ L ∞ ( P ) } , similarly define G := { g = ( g , . . . g T ) : g t = f + ψ (cid:62) t γ, f ∈ L ( P ) , γ ∈ R p } .Consider the following optimization program:inf g ∈G E (cid:34) T T (cid:88) t =1 ρ W it ( W it − g it ) (cid:35) (A.14)and let r (cid:63) be the value of infimum. We prove that there exists a function g (cid:63) ∈ G that solves thisproblem. This is not entirely trivial because G is not compact and the loss function is not quadraticso we cannot directly use neither Weierstrass nor the standard projection theorem.Consider the set F ( r (cid:63) ) := { f ∈ F : E (cid:104) T (cid:80) Tt =1 ρ W it ( W it − f it ) (cid:105) ≤ r (cid:63) } . It is straightforward to see thatthis set is convex and because R ( f ) is continuous on L T ( P ) it follows that f ∈ F ( r (cid:63) ) ⇒ R ( f ) ≤ r (cid:63) .The set F ( r (cid:63) ) is closed and convex. Now assume that g (cid:63) does not exist and thus F ( r (cid:63) ) ∩ G = ∅ . Byconstruction G is closed (in L ( P )) and convex; as a result we have two closed convex sets with emptyintersection.Assume that F ( r (cid:63) ) is weakly compact then by strict separating hyperplane theorem it follows thatthere exist h (cid:63) ∈ L T ( P ) and a ∈ R such that sup f ∈F ( r (cid:63) ) ( f, h (cid:63) ) < a < a < inf g ∈G ( g, h (cid:63) ). Assume thatthere exist a function f (cid:63) ∈ F ( r (cid:63) ) ∪ G such that R ( f (cid:63) ) ≤ R ( f ) for any function f ∈ F ( r (cid:63) ) ∪ G . Fix an (cid:15) > g ε ∈ G such that R ( g ε ) < r (cid:63) + ε . Using this function construct g ε ∈ G such that R ( g ε ) < r (cid:63) + ε . For t ∈ [0 ,
1] consider a function r ( t ) = R ( f (cid:63) + t ( f (cid:63) − g ε )). By convexity of t it follows that r ( t ) is convex and by definition of f (cid:63) it follows that r ( t ) has a minimum at zero.For t ∈ [0 ,
1] consider a function:( h (cid:63) , f (cid:63) + t ( g ε − f (cid:63) )) =: a + bt (A.15) nd define t := a − ab and t := a − ab . It follows that t − t t = a − a a − a > g ε . By construction it follows that r ( t ) ≥ r (cid:63) and r ( t ) < r (cid:63) + ε and by convexity we have r ( t ) ≥ r ( t ) + r ( t )) − r (0) t × ( t − t ) ≥ r (cid:63) + r (cid:63) − R ( f (cid:63) ) t × ( t − t ). The RHS of this inequality does not dependon ε which leads to contradiction.To finish the proof we need to show that (a) f (cid:63) exists and is unique and (b) that F ( r (cid:63) ) is weaklycompact. The latter statement will follow if we prove that F ( r (cid:63) ) is bounded in L ( P ). This followsbecause R ( f ) is convex and has a unique minimum at f (cid:63) in F ( r (cid:63) ).Finally we prove the R ( f ) has a unique minimum at f (cid:63) . Consider f (cid:63) such that f (cid:63)t := E [ W it | S i , X i ].Because S i includes W i it follows that T (cid:80) Tt =1 f (cid:63)t = W i . Take any function f ∈ F and consider a convexcombination f ( λ ) := f (cid:63) + λ ( f − f (cid:63) ). Because f ∈ L ∞ ( P ) and f (cid:63)t ≤ − η it follows that for all λ < λ we have f t ( λ ) < λ < λ we have that R ( f ( λ )) = E (cid:104) T (cid:80) Tt =1 ( W t − f (cid:63)t ) (cid:105) + E (cid:104)(cid:80) Tt =1 ( f (cid:63)t − f t ( λ )) (cid:105) > R ( f (cid:63) ). By convexity of R ( f ) it follows that R ( f ) > R ( f (cid:63) ) which proves that g (cid:63) exists. The final result follows because R ( f ) is Gato-differentiable on F and the results follows bytaking first order conditions. .3 Theorems Proof of Theorem 2 : We split the proof into two parts. First, we assume that (cid:107) ( ω (cid:63) ) un − ˆ ω un (cid:107) = o p (1), ( ω (cid:63)it ) un is uniformly bounded, and E (cid:104) T (cid:80) Tt =1 ( ω (cid:63)it ) un W it (cid:105) >
0, and prove the normality result.Then we prove the first statement.
Part 1:
Assume that (cid:107) ( ω (cid:63) ) un − ˆ ω un (cid:107) = o p (1).For the estimator ˆ τ we have the following:ˆ τ = 1 nT (cid:88) it ˆ ω it Y it = 1 nT (cid:88) it ˆ ω it τ it W it + 1 nT (cid:88) it ˆ ω it u it = τ emp + 1 nT (cid:88) it ˆ ω it u it = τ emp + 1 P n T (cid:80) Tt =1 ˆ ω unit W it (cid:32) nT (cid:88) it ( ω (cid:63)it ) un u it + 1 nT (cid:88) it (ˆ ω unit − ( ω (cid:63)it ) un ) u it (cid:33) (A.16)By construction and assumption we have the following: E [(ˆ ω unit − ( ω (cid:63)it ) un ) u it |{ W j , X j } nj =1 ] = (ˆ ω unit − ( ω (cid:63)it ) un ) E [ u it |{ W j , X j } nj =1 ] =(ˆ ω unit − ( ω (cid:63)it ) un ) E [ u it | W i , X i ] = 0 (A.17)This implies that by conditional Chebyshev inequality we have the following: ζ n ( (cid:15) ) := E (cid:34)(cid:40) √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P n T T (cid:88) t =1 (ˆ ω unit − ( ω (cid:63)it ) un ) u it (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:41) |{ W j , X j } nj =1 (cid:35) ≤ P n E (cid:20)(cid:16)(cid:80) Tt =1 (ˆ ω unit − ( ω (cid:63)it ) un (cid:17) |{ W j , X j } nj =1 (cid:21) T (cid:15) ≤ σ u T (cid:15) (cid:107) ( ω (cid:63) ) un − ˆ ω un (cid:107) = o p (1) (A.18)Since indicator is a bounded function it follows that for any (cid:15) > o E [ ζ n ( (cid:15) )] = o (1) (A.19)and thus we have nT (cid:80) it (cid:107) ( ω (cid:63) ) un − ˆ ω un (cid:107) u it = o p (cid:16) √ n (cid:17) . Finally we need to check that CLT applies to nT (cid:80) it ( ω (cid:63)it ) un u it . The mean of each summand is zero and the variance is bounded: E (cid:32) T T (cid:88) t =1 ( ω (cid:63)it ) un u it (cid:33) ≤ T T (cid:88) t =1 E (cid:104) (( ω (cid:63)it ) un u it ) (cid:105) ≤ T (cid:88) t =1 (cid:113) E [ u it ] E [(( ω (cid:63)it ) un ) ] < ∞ (A.20) inally, define: ω (cid:63)it := ( ω (cid:63)it ) un E (cid:104) T (cid:80) Tt =1 ( ω (cid:63)it ) un W it (cid:105) (A.21)It is easy to see that we have: P n T T (cid:88) t =1 ˆ ω unit W it = E (cid:34) T T (cid:88) t =1 ( ω (cid:63)it ) un W it (cid:35) + o p (1) (A.22)and thus we have the following: (cid:107) ω (cid:63) − ˆ ω (cid:107) = o p (1) √ n (ˆ τ − τ emp ) → N (0 , σ τ ) (A.23)which concludes the first part. Part 2:
In this part we prove that (cid:107) ( ω (cid:63) ) un − ˆ ω un (cid:107) = o p (1), ( ω (cid:63)it ) un is uniformly bounded, and E (cid:104) T (cid:80) Tt =1 ( ω (cid:63)it ) un W it (cid:105) >
0. We use the dual representation derived in Section 4.3 and show that thesolution converges to a population one.The proof below shows that empirical weights converge to oracle weights that solve a certain problemin population. We use a natural adaptation of the “small-ball” argument from Mendelson [2014]. Thisis not necessary and most likely one can construct a simpler proof using classical results for GMMestimators. We present a different argument because it can be naturally generalized to handle moresophisticated estimation procedures – something that we want to address in future work.We start by defining relevant oracle weights. Consider ( { α (cid:63)i } ni =1 , γ (cid:63) ) that satisfy the following restric-tions: ξ it := W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) E (cid:34) T (cid:88) t =1 ξ it ψ it (1 − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } ) (cid:35) = 0 T (cid:88) t =1 ξ it (1 − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } ) = 0 (A.24)Where we include time fixed effects λ t into the definition of ψ it , since T is fixed this does not create ny problems. We prove that oracle weights that satisfy these restrictions exists in Lemma A.3. Usingthese parameters we consider a lower bound on individual components of the loss function: ρ W it ( W it − α i − ψ (cid:62) it γ ) = ( W it − α i − ψ (cid:62) it γ ) (cid:16) − W it { W it − α i − ψ (cid:62) it γ ≤ } (cid:17) =( W it − α i − ψ (cid:62) it γ ) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17) +( W it − α i − ψ (cid:62) it γ ) W it (cid:16) { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } − { W it − α i − ψ (cid:62) it γ ≤ } (cid:17) ≥ ( W it − α i − ψ (cid:62) it γ ) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17) − ( W it − α i − ψ (cid:62) it γ ) W it { α (cid:63)i + ψ (cid:62) it γ (cid:63) < ≤ α i + ψ (cid:62) it γ } (A.25)Using this and the properties of the oracle weights we get the following inequality for the excess lossfor unit i : T (cid:88) t =1 (cid:16) ρ W it ( W it − α i − ψ (cid:62) it γ ) − ρ W it ( W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ) (cid:17) ≥ T (cid:88) t =1 (cid:16) ( α (cid:63)i − α i ) + ψ (cid:62) it ( γ (cid:63) − γ )) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17)(cid:17) + T (cid:88) t =1 (cid:16) ξ it ( α (cid:63)i − α (cid:63)i ) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17)(cid:17) + T (cid:88) t =1 (cid:16) ξ it ψ (cid:62) it ( γ (cid:63) − γ ) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17)(cid:17) − T (cid:88) t =1 (cid:16) ( W it − α i − ψ (cid:62) it γ ) W it { α (cid:63)i + X (cid:62) i γ (cid:63) < ≤ α i + ψ (cid:62) it γ } (cid:17) = T (cid:88) t =1 (cid:16) ( α (cid:63)i − α i ) + ψ (cid:62) it ( γ (cid:63) − γ )) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17)(cid:17) + T (cid:88) t =1 (cid:16) ξ it ψ (cid:62) it ( γ (cid:63) − γ ) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17)(cid:17) − T (cid:88) t =1 (cid:16) ( W it − α i − ψ (cid:62) it γ ) W it { α (cid:63)i + ψ (cid:62) it γ (cid:63) < ≤ α i + ψ (cid:62) it γ } (cid:17) (A.26)Note that the last equality follows by definition of ξ it and ( { α (cid:63)i } ni =1 , γ (cid:63) ). n Lemma A.2 we show that α (cid:63)i is a function of γ (cid:63) and data for unit i : α (cid:63)i = g ( X i , W i , γ (cid:63) ) (A.27)and prove that g is uniformly Lipschitz. By construction for every γ we only need to consider α i thatsatisfies the following equality: α i = g ( X i , W i , γ ) (A.28)Define: f it = α i + ψ (cid:62) it γf (cid:63)it = α (cid:63)i + ψ (cid:62) it γ (cid:63) (A.29)and observe that we have the following: P n T (cid:88) t =1 (1 − W it { W it < f (cid:63)it } )( f it − f (cid:63)it ) ≥ P n T (cid:88) t =1 (1 − W it )( f it − f (cid:63)it ) ≥ ( γ − γ (cid:63) ) (cid:62) (cid:32) T (cid:88) t =1 P n Γ it Γ (cid:62) it (cid:33) ( γ − γ (cid:63) ) = κ (cid:107) γ − γ (cid:63) (cid:107) + o p ( (cid:107) γ − γ (cid:63) (cid:107) ) (A.30)where Γ it := (1 − W it ) ψ it − (cid:80) Tl =1 (1 − W il ) ψ il (cid:80) Tl =1 (1 − W il ) (A.31)Assume that (cid:107) γ − γ (cid:63) (cid:107) = r , which implies that | α i − α (cid:63)i | ≤ C r . Assumptions guarantee that ψ it isbounded and thus (cid:80) Tt =1 (cid:107) f t − f (cid:63)t (cid:107) ∞ ≤ C r . Using CS we get the following inequality: P n ξ it ψ (cid:62) it ( γ (cid:63) − γ ) (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17) ≤(cid:107) γ (cid:63) − γ (cid:107) × (cid:13)(cid:13)(cid:13) P n ξ it ψ it (cid:16) − W it { W it − α (cid:63)i − ψ (cid:62) it γ (cid:63) ≤ } (cid:17)(cid:13)(cid:13)(cid:13) (A.32) e also have the following inequality: P n (cid:34) T T (cid:88) t =1 ( W it − α i − ψ (cid:62) it γ ) W it { α (cid:63)i + ψ (cid:62) it γ (cid:63) < ≤ α i + ψ (cid:62) it γ } (cid:35) ≤ P n (cid:34) T T (cid:88) t =1 ( f (cid:63)it − f it ) { f (cid:63)it < ≤ f it } (cid:35) ≤ (cid:107) f (cid:63) − f (cid:107) ∞ × P n (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f it } (cid:35) (A.33)where the first implication follows because of the indicator, and the the second one follows by Holderinequality. Since (cid:107) f (cid:63) − f (cid:107) ∞ ≤ C r we have the following: P n (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f it } (cid:35) ≤ P n (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) (A.34)DKW inequality implies that we have the following with high probability: P n (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) ≤ E (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) + C √ n (A.35)It is now easy to see that if r is greater than O (cid:16) √ n (cid:17) then the excess loss is positive with high probability.Since the loss function is convex this implies that optimum should belong to a ball of radius √ n around( { α (cid:63)i } ni =1 , γ (cid:63) ) with high probability which proves that for all t (cid:107) ˆ ω ( un ) t − ( ω (cid:63)t ) un (cid:107) = o p (1). roof of Theorem 3 : Part 1
For each observation i define M i – the number of times this observation is sampled in abootstrap sample. Using this notation we can define bootstap analogs of α i and γ from the proof ofTheorem 2: { α ( b ) i , γ ( b ) } ni =1 = arg min P n M i T T (cid:88) t =1 ρ W it ( W it − α i − ψ Tit γ ) (A.36)in case if M i = 0 we define α ( b ) i using the function g ( X i , W i , γ (cid:63) ) from 2. It is straightforward to extendthe proof of Theorem 2 and show that bootstrap weights converge to population ones. Most part followbecause of two key properties of { M i } ni =1 : P n M i X i = E [ X i ] + o p (1) P n M i ε i = O p (cid:18) √ n (cid:19) (A.37)for any square integrable X i and any square integrable mean-zero ε i (all independent of M i ). Thesecond inequality follows by applying Chebyshev inequality, the first one follows from the second one.The only additional result that we need is the following one: P n M i (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) = P n ( M i − (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) + P n (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } − E (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35)(cid:35) + E (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) = E (cid:34) T T (cid:88) t =1 { f (cid:63)it < ≤ f (cid:63)it + C r } (cid:35) + O p (cid:18) √ n (cid:19) (A.38)where the last line follows by DKW inequality, the fact that the set of intervals is Donsker, and themultiplier process converges to same limit process as the standard empirical one. It follows that wehave convergence results: (cid:107) ω ( b ) − ω (cid:63) (cid:107) ∞ = o p (1) (cid:107) ω ( b ) − ω (cid:63) (cid:107) = O p (cid:18) √ n (cid:19) (A.39) art 2: By construction of bootstrap estimator we have the following representation:ˆ τ ( b ) − ˆ τ = P n M i T T (cid:88) t =1 ω ( b ) it τ it W it − P n T T (cid:88) t =1 ˆ ω it τ it W it + P n M i T T (cid:88) t =1 ω ( b ) it u it − P n T T (cid:88) t =1 ˆ ω it u it = P n M i T T (cid:88) t =1 ω ( b ) it ( τ it − E [ τ it ]) W it − P n T T (cid:88) t =1 ˆ ω it ( τ it − E [ τ it ]) W it + P n ( M i −
1) 1 T T (cid:88) t =1 ω (cid:63)it u it + o p (cid:18) √ n (cid:19) (A.40)From this representation it follows that if τ it = const then the bootstrap estimator is consistent forthe asymptotic variance of ˆ τ . In case if τ it is heterogenous we further expand the first term. Define τ t ( W i , X i ) := E [ τ it | W i , X i ] and η it := τ it − τ t ( W i , X i ). We have the following: P n M i T T (cid:88) t =1 ω ( b ) it τ it W it − P n T T (cid:88) t =1 ˆ ω it τ it W it = P n M i T T (cid:88) t =1 ω ( b ) it τ t ( W i , X i ) W it − P n T T (cid:88) t =1 ˆ ω it τ t ( W i , X i ) W it + P n M i T T (cid:88) t =1 ω ( b ) it η it W it − P n T T (cid:88) t =1 ˆ ω it η it W it = P n T T (cid:88) t =1 ( M i ω ( b ) it − ˆ ω it ) τ t ( W i , X i ) W it + P n ( M i −
1) 1 T T (cid:88) t =1 ω (cid:63)it η it W it + o p (cid:18) √ n (cid:19) (A.41)It follows that we have the following:ˆ τ ( b ) − ˆ τ = P n ( M i −
1) 1 T T (cid:88) t =1 ω (cid:63)it ( η it W it + u it )+ P n T T (cid:88) t =1 ( M i ω ( b ) it − ˆ ω it ) τ t ( W i , X i ) W it + small order terms (A.42)Since the second summand is uncorrelated with the first one we have that the bootstrap variance is aconservative estimator of the correct variance.+ small order terms (A.42)Since the second summand is uncorrelated with the first one we have that the bootstrap variance is aconservative estimator of the correct variance.