Low-Rank Approximations of Nonseparable Panel Models
LLow-Rank Approximations of Nonseparable
Panel Models ∗ Iv´an Fern´andez-Val † Hugo Freeman ‡ Martin Weidner ‡ October 2020
Abstract
We provide estimation methods for panel nonseparable models based on low-rankfactor structure approximations. The factor structures are estimated by matrix-completion methods to deal with the computational challenges of principal compo-nent analysis in the presence of missing data. We show that the resulting estimatorsare consistent in large panels, but suffer from approximation and shrinkage biases.We correct these biases using matching and difference-in-difference approaches. Nu-merical examples and an empirical application to the effect of election day regis-tration on voter turnout in the U.S. illustrate the properties and usefulness of ourmethods.
Nonseparable models are useful to capture multidimensional unobserved heterogeneity,which is an important feature of economic data. The presence of this heterogeneity ∗ This paper was prepared for the Econometrics Journal Special Session on “Econometrics of PanelData” at the Royal Economic Society 2019 Annual Conference in Warwick University. We thank ShuowenChen, and the participants of this conference and the 25 th International Panel Data Conference forcomments. This research was supported by the Economic and Social Research Council through theESRC Centre for Microdata Methods and Practice grant RES-589-28-0001, and by the European ResearchCouncil grants ERC-2014-CoG-646917-ROMIA and ERC-2018-CoG-819086-PANEDA. † Department of Economics, Boston University, 270 Bay State Road, Boston, MA 02215-1403, USA.Email: [email protected] ‡ Department of Economics, University College London, Gower Street, London WC1E 6BT, UK, andand CeMMAP. Email: [email protected] , [email protected] a r X i v : . [ ec on . E M ] O c t akes the effect of covariates on the outcome of interest different for each unit due tofactors that are unobservable or unavailable to the researcher. In the absence of furtherrestrictions, a different data generating process essentially operates for each unit, whichcreates identification and estimation challenges. One way to deal with these challengesis the use of panel data, where each unit is observed on multiple occasions. In thispaper, we develop an approach to estimate nonseparable models from panel data basedon homogeneity restrictions and low-rank factor approximations. Whilst homogeneityrestrictions have been used previously in this context, the application of low-rank factorapproximations is more novel.The nonseparable model that we consider includes observed discrete covariates ortreatments, multidimensional unobserved individual and time effects, and idiosyncraticerrors. We construct the effects of interest as averages or quantiles of potential outcomesconstructed from the model by exogenously manipulating the value of the treatments.These effects are generally not identified from the observed data because the treatmentassignment is usually determined by the unobserved individual and time effects. Fol-lowing the previous panel literature, we impose cross-section and time-series homogene-ity restrictions to identify the effects of interest, see, e.g. Chamberlain (1982); Manski(1987); Honor´e (1992); Evdokimov (2010); Graham and Powell (2012); Hoderlein andWhite (2012); Chernozhukov, Fern´andez-Val, Hahn and Newey (2013).The estimation of the nonseparable model is challenging due to the presence of themultidimensional unobserved individual and time effects. We cannot just exclude theseeffects because they are endogenous, i.e., related to the treatments. We deal with thisproblem by approximating their effect with a low-rank factor structure. This approach canbe interpreted as a series or sieve approximation on the unobservables. We characterizethe error of this approximation in terms of the functional singular value decomposition ofthe expectation of the outcome conditional on the treatment and unobserved effects. Forsmooth conditional expectation functions, the mean square error of the approximationerror vanishes with the rank of the factor structure at a polynomial rate.We develop an estimator of the low-rank factor approximation in the case where thecovariate of interest is binary. This is an empirically relevant case as it covers the treat-ment effect model for panel data. We also show how to extend the model to includeadditive controls and fixed effects. Here, we rely on the analogy between the estimationof treatment effects and the matrix completion problem previously noted by Athey, Bay-ati, Doudchenko, Imbens and Khosravi (2017) and Amjad, Shah and Shen (2018). Thus,2iven that the principal components program is combinatorially hard in the presence ofmissing data, we consider the convex relaxation of this program that replaces a constraintin the rank of a matrix by a constraint in its nuclear norm, following Srebro and Jaakkola(2003) and Fazel (2003). The resulting estimator is the matrix-completion estimator.The main theoretical result of the paper is to show that the matrix-completion estima-tor is consistent under asymptotic sequences where the two dimensions of the panel growto infinity at the same rate. This result does not follow from the existing matrix comple-tion literature that assumes that the matrix to complete has low-rank. In our case, theunderlying matrix of interest can have full rank, but we impose appropriate smoothnessassumptions on the data generating process that guarantee that the singular values of thematrix form a rapidly decreasing sequence. This allows a low-rank approximation, and italso implies a bound on the nuclear norm of the matrix. Our consistency proof for thematrix completion estimator therefore crucially relies on the bound of the nuclear norm,but does not impose any low-rank conditions. Our proof strategy also avoids the high-level restricted strong convexity assumption (see e.g. Negahban and Wainwright 2012).We instead provide interpretable conditions on the underlying process of the observableand unobservable variables directly.The matrix-completion estimator is consistent, but can be biased in small samples.This bias comes from two different sources: approximation bias due to the low-rank factorstructure approximation and shrinkage bias due to the nuclear norm regularization of theprincipal component analysis program (Cai, Cand`es and Shen, 2010; Ma, Goldfarb andChen, 2011; Bai and Ng, 2019b). We propose matching approaches to debias the estimator.For each treatment level, the simplest approach consists of finding the observation in theother treatment level that is the closest in terms of the estimated factor structure. Wealso propose a two-way matching procedure that combines matching with a differences-in-differences approach. The two-way procedure is related to several recent proposals suchas the matching approach of Imai and Kim (2019) to estimate causal effects from paneldata and the blind regression of Li, Shah, Song and Yu (2017) for matrix completion.The difference with these proposals is in the information used to match the observations.Imai and Kim (2019) use the treatment variable and Li, Shah, Song and Yu (2017) theoutcome, whereas we use the estimated factor structure. In this sense, the estimation ofthe factor structure can be seen as a preliminary de-noising step of the data (Chatterjee,2015). Amjad, Shah and Shen (2018) proposed a similar debiasing procedure based onthe estimated factor structure, but they rely on synthetic control methods instead of3atching.We illustrate our methods with an empirical application to the effect of election dayregistration (EDR) on voter turnout and numerical simulations. We estimate averageand quantile effects using a state-level panel dataset on the 24 U.S. presidential electionsbetween 1920 and 2012 collected by Xu (2017). We find that, after controlling for possiblenon-random adoption, EDR has a positive effect, especially at the bottom of the voterturnout distribution. Our methods uncover stronger effects than standard difference-in-difference methods that rely on restrictive parallel trend assumptions. The simulationresults show that our theoretical results provide a good representation of the behavior ofthe estimators in small samples.The rest of the paper is organized as follows. Section 2 describes the model andeffects of interest. Section 3 introduces the low-rank factor approximation and derivesthe properties of its matrix-completion estimator. The matching methods to debias thematrix-completion estimator are discussed in Section 4. Section 5 reports the resultsof the numerical examples. All the proofs of the theoretical results are gathered in theAppendix. Throughout this paper we consider the following nonseparable and nonparametric paneldata model:
Assumption 1 (Model) . Y it = g ( X it , A i , B t , U it ) , i ∈ N = { , . . . , N } , t ∈ T = { , . . . , T } , (1) where i and t index individual units and time periods, respectively; Y it is an observedoutcome or response variable with support Y ⊆ R ; g is an unknown function; X it is avector of observed covariates or treatments with support X ⊆ R d x ; A i and B t are vectors ofindividual and time unobserved effects, possibly correlated with X it , with supports A ⊆ R d a and B ⊆ R d b , respectively; and U it is a vector of unobserved error terms of unspecifieddimension, for which we assume that U it d = U js | X NT , A N , B T , for all i, j ∈ N , t, s ∈ T , (2) and U it ⊥⊥ X js | A N , B T , for all i, j ∈ N , t, s ∈ T , (3)4 here X NT = { X it : i ∈ N , t ∈ T } , A N = { A i : i ∈ N } , B T = { B t : t ∈ T } , and ⊥⊥ denotes stochastic independence. This model can be motivated from a purely statistical perspective as a latent variablemodel using the Aldous-Hoover representation for exchangeable random matrices, e.g.Xu, Massouli and Lelarge (2014), Chatterjee (2015), Orbanz and Roy (2015), and Li andBell (2017). We motivate it instead as a structural model where the unobserved effects A i and B t are associated with individual heterogeneity and aggregate shocks, respectively.Additional exogenous covariates can be incorporated in the usual way by carrying out theanalysis conditional on them.The main restriction imposed by Assumption 1 is the unit and time homogeneity in(2). A sufficient condition for unit homogeneity is that the observations are identicallydistributed across i , which is a common sampling assumption for panel data. Timehomogeneity has also been commonly used in panel data models (Chamberlain, 1982;Manski, 1987; Honor´e, 1992; Evdokimov, 2010; Graham and Powell, 2012; Hoderlein andWhite, 2012; Chernozhukov, Fern´andez-Val, Hahn and Newey, 2013). It implies that timeis randomly assigned, conditional on covariates and unobserved effects. The additionalrestriction in (3) is an exogeneity condition of X js with respect to U it . Given (2), it isa mild condition as time homogeneity already imposes that any relationship between U it and X js can only be unit and time-invariant. Taken together, (2) and (3) impose that U it d = U js | A N , B T , for all i, j ∈ N , t, s ∈ T . The model considered is similar to the static model in Chernozhukov, Fern´andez-Val,Hahn and Newey (2013), but there are three important differences. First, the structuralfunction g has time effects as arguments and therefore allows the relationship between Y it and X it to vary over time in an unrestricted fashion even under (2). For example, itcan include location and scale time effects. Second, Chernozhukov, Fern´andez-Val, Hahnand Newey (2013) impose that Y it and X it are identically distributed across i , whichis stronger than the unit homogeneity in (2). Thus, unit homogeneity is conditionalon the treatments and unobserved effects and therefore does not restrict the treatmentassignment process. Third, they analyze short panels, whereas we rely on large T foridentification. Our model also encompasses the nonseparable model with time effects in In the Aldous-Hoover representation, A i , B t and U it are independent uniform random variables. Y it = g t ( X it , A T i B t + U it ). We provide moreexamples of models covered by Assumption 1 below.The structural function g is generally not identified, but can be used to constructinteresting effects. Let Y it ( x ) := g ( x , A i , B t , U it ( x )) be the potential outcome for indi-vidual i at time t , obtained by setting exogenously X it = x ∈ X and drawing U it ( x ) d = U it | A N , B T , where we impose rank similarity on U it ( x ) across the values of x ∈ X . Themain effects of interest are the average structural functions (ASFs) µ t ( x ) := 1 N N (cid:88) i =1 E (cid:2) Y it ( x ) | A N , B T (cid:3) , µ ( x ) := 1 T T (cid:88) t =1 µ t ( x ) , (4)and the conditional average structural functions (CASFs) µ t ( x | X ) := 1 N t ( X ) N (cid:88) i =1 { X it ∈ X } E (cid:2) Y it ( x ) | A N , B T (cid:3) , N t ( X ) = N (cid:88) i =1 { X it ∈ X } ,µ ( x | X ) := 1 n ( X ) T (cid:88) t =1 N t ( X ) µ t ( x | X ) , n ( X ) = T (cid:88) t =1 N t ( X ) , (5)where X ⊆ X . The ASFs and CASFs correspond to averages of the potential outcome Y it ( x ) at a given time period or aggregated over the observed time periods. In both casesthe average is over the cross sectional units in the observed sample or finite population.Infinite-population versions of the effects can be obtained by taking probability limits as N → ∞ . If X it includes only a binary treatment, the ASFs and CASFs can be used toform treatment effects. For example, µ (1) − µ (0) is the time-aggregated average treatmenteffect and µ t (1 | − µ t (0 |
1) is the average treatment effect on the treated at time t .Distribution structural functions (DSFs) can be constructed analogously replacing Y it ( x )by { Y it ( x ) ≤ y } in (4) and (5) for y ∈ Y . Quantile effects can then be formed by takingleft-inverses of the DSFs and taking differences. For example, the τ -quantile treatmenteffect at time t is q t,τ (1) − q t,τ (0), where q t,τ ( x ) = inf (cid:40) y ∈ Y : 1 N N (cid:88) i =1 E (cid:2) { Y it ( x ) ≤ y } | A N , B T (cid:3) ≥ τ (cid:41) . We provide some examples of data generating processes that satisfy Assumption 1.The purpose is to show that Assumption 1 covers a great variety of models commonlyused in empirical analysis. Our estimation methods are generic in that we do not need Note that our model allows for g to depend on t because the dimension of B t is unspecified.
6o specify the data generating process, besides of satisfying Assumption 1. Of course,using more information about the data generating process would lead to more efficientestimators, but at the cost of robustness to model misspecification.
Example 1 (Linear factor model) . Consider the linear panel model with factor structurein the error terms: Y it ( x ) = x T β + λ T i f t + σ i ( x ) σ t ( x ) U it ( x ) , U it ( x ) | X NT , A N , B T ∼ i.i.d. F U , where U it ( x ) is a zero mean random variable with marginal distribution F U , which doesnot depend on x . This is special case of Assumption 1 with Y it = Y it ( X it ) , A i =( λ i , { σ i ( x ) : x ∈ X } ) , B t = ( f t , { σ t ( x ) : x ∈ X } ) , and U it = U it ( X it ) . The average effectof changing the covariate from x to x at t is µ t ( x ) − µ t ( x ) = µ t ( x | x ) − µ t ( x | x ) = ( x − x ) T β . A version of this model was considered by Kim and Oka (2014) to analyze the effect ofunilateral divorce laws on divorce rates in the U.S. This model encompasses the standarddifference-in-difference model, Y it ( x ) = x T β + λ i + f t + σ i ( x ) σ t ( x ) U it ( x ) , by setting λ i =( λ i , T and f t = (1 , f t ) T . Example 2 (Binary response model) . Assume that the potential outcome Y it ( x ) is binaryand generated by Y it ( x ) = { m ( x , A i , B t ) ≥ U it ( x ) } , U it ( x ) | X NT , A N , B T ∼ i.i.d. U (0 , , for some unknown function m . Here, assuming that U it ( x ) is uniform is a normalization,since m can be arbitrary. This nonparametric single index model with unobserved effectsis a special case of Assumption 1 with Y it = Y it ( X it ) and U it = U it ( X it ) . The ASFs at x and t is µ t ( x ) = 1 N N (cid:88) i =1 m ( x , A i , B t ) . Similar single index models for count or censored responses are also covered by Assumption1.
Example 3 (Treatment effect factor model) . Assume that X it contains only a binarytreatment indicator, i.e., X = { , } . The potential outcomes are generated by the linearfactor model Y it ( x ) = λ i ( x ) T f t ( x ) + σ i ( x ) σ t ( x ) U it ( x ) , U it ( x ) | X NT , A N , B T ∼ i.i.d. F U , x ∈ X , here U it ( x ) is a zero mean random variable with marginal distribution F U , which doesnot depend on x . This is special case of Assumption 1 with Y it = Y it ( X it ) , A i =( { λ i ( x ) , σ i ( x ) : x ∈ X } ) , B t = ( { f t ( x ) , σ t ( x ) : x ∈ X } ) , and U it = U it ( X it ) . The av-erage treatment effect at t is µ t (1) − µ t (0) = 1 N N (cid:88) i =1 [ λ i (1) T f t (1) − λ i (0) T f t (0)] , and the average effect on the treated at t is µ t (1 | − µ t (0 |
1) = 1 N t (1) N (cid:88) i =1 { X it = 1 } [ λ i (1) T f t (1) − λ i (0) T f t (0)] , provided that N t (1) = (cid:80) Ni =1 { X it = 1 } > . Versions of this model have been consideredby Hsiao, Steve Ching and Ki Wan (2012), Gobillon and Magnac (2016), Athey, Bayati,Doudchenko, Imbens and Khosravi (2017), Li and Bell (2017), Xu (2017), Li (2018), Baiand Ng (2019a), Xiong and Pelger (2019), and Chan and Kwok (2020). Example 1 is aspecial case with λ i ( x ) T f t ( x ) = x T β + λ T i f t . Throughout this paper we use standard panel data notation, with the two panel dimen-sions being denoted by units i and time t . However, one could also consider pseudo-panelor network applications of our results, where the two panel dimensions are denoted by i and j , and Y ij could, for example, be wage of worker i in firm j , consumption of member i in household j , a friendship indicator between individuals i and j , or the volume oftrade from country i to country j . The existing literature on two-way heterogeneity innetwork models usually either makes stronger parametric assumptions than we imposehere (e.g. Graham 2017, Dzemski 2019, Chen, Fern´andez-Val and Weidner 2020, Zeleneev2020) or uses stochastic blockmodels or graphon models, which typically ignore the effectof covariates (e.g. Holland, Laskey and Leinhardt (1983), Wolfe and Olhede 2013, Gao,Lu, Zhou et al. 2015, Auerbach 2019). Our methods of estimating non-parametric modelswith two-way heterogeneity may therefore also be of interest in a network context. A natural starting point to estimate the effects in (4) and (5) is to use empirical analogs.This amounts to replace E (cid:2) Y it ( x ) | A N , B T (cid:3) by an estimator. There are two complica-tions with this approach. First, the potential outcome Y it ( x ) is not observable. We deal8ith this complication by noting that under Assumption 1,E (cid:2) Y it ( x ) | A N , B T (cid:3) = E [ Y it | X it = x , A i , B t ] , so that we can write the expectation of the potential outcome as an expectation of theobserved outcome. The second complication is that A i and B t are not observable, so thatwe cannot directly estimate E [ Y it | X it = x , A i , B t ]. To deal with this complication, westart by noticing thatE [ Y it | X it = x , A i = a , B t = b ] =: m ( x , a , b ) , (6)where the function m does not vary with i and t , by Assumption 1. We show next howthis function can be approximated and estimated using a low-rank factor structure. For ease of exposition we assume that the regressor domain X is finite in the rest of thepaper. Accordingly, we denote the corresponding discrete covariate and its values by X it and x instead of X it and x . For most of the analysis, we will focus on the binarytreatment case where X = { , } .The approximation that we propose is based on the singular value decomposition ofthe function ( a , b ) (cid:55)→ m ( x, a , b ) for each x ∈ X . We make two assumptions on thisdecomposition. The first assumption is a sampling condition on the unobserved effectsthat will be useful to define a norm for the eigenfunctions. Assumption 2 (Sampling of A i and B t ) . (i) A i is independent and identically dis-tributed across i ∈ N , (ii) B t is independent and identically distributed over t ∈ T , and(iii) A i and B t are independent for all i, t . For simplicity we consider the case where both A i and B t are independently dis-tributed across i and over t , but since we consider asymptotic sequences where both N and T become large one could also allow for appropriate weak dependence across both i and t . Formalizing this weak dependence would complicate both the assumption and theproof of the following results, which is why we decided to stick to independence in ourpresentation here.The next assumption imposes smoothness on ( a , b ) (cid:55)→ m ( x, a , b ).9 ssumption 3 (Smoothness of ( a , b ) (cid:55)→ m ( x, a , b )) . Let m ( x, a , b ) = ∞ (cid:88) j =1 s j ( x ) u j ( x, a ) v j ( x, b ) (7) be the functional singular value decomposition of m ( x, a , b ) . We assume that ∞ (cid:88) j =1 s j ( x ) < ∞ . Assumption 3 is a high-level condition on the singular values of the function m ( x, a , b ),which are defined by the decomposition (7). In this functional singular value decomposi-tion the eigenfunctions u j ( x, a ) ∈ R and v j ( x, b ) ∈ R are normalized as E u j ( x, A i ) = 1and E v j ( x, B t ) = 1, and they also satisfy the orthogonality conditions E u j ( x, A i ) u k ( x, A i ) =0 and E v j ( x, B t ) v k ( x, B t ) = 0, for j (cid:54) = k . The singular values are sorted such as s ( x ) ≥ s ( x ) ≥ s ( x ) ≥ . . . ≥ s j ( x ) (cid:46) j − α , where thedecay coefficient α depends on the dimensions of the arguments a , b , and on the smooth-ness of ( a , b ) (cid:55)→ m ( x, a , b ). For sufficiently smooth functions, α > (cid:80) ∞ j =1 s j ( x ) < ∞ . For example, if ( a , b ) (cid:55)→ m ( x, a , b ) is continuously differentiable up toorder s and A and B are compact, then s j ( x ) (cid:46) j − sda ∧ db , by Theorem 3.3 of Griebel and Harbrecht (2013), where d a ∧ d b is the minimum of d a and d b . This implies that (cid:80) ∞ j =1 s j ( x ) < ∞ if s > d a ∧ d b . Assumption 3 is therefore ahigh-level smoothness assumption on ( a , b ) (cid:55)→ m ( x, a , b ).The formulation of this smoothness assumption is convenient for our purposes, becauseit immediately leads to a low-rank approximation of m ( x, a , b ). The low-rank approxi-mation truncates the singular value decomposition to the first R elements, m ( x, a , b ) = ∞ (cid:88) j =1 s j ( x ) / u j ( x, a ) (cid:124) (cid:123)(cid:122) (cid:125) =: φ j ( x, a ) s j ( x ) / v j ( x, b ) (cid:124) (cid:123)(cid:122) (cid:125) =: ψ j ( x, b ) = R (cid:88) j =1 φ j ( x, a ) ψ j ( x, b ) + ζ R ( x, a , b ) . (8)The first term is the approximation and the second term is the approximation error.Under Assumption 3, ζ R ( x, a , b ) → R → ∞ .
10n other words, the approximation error can be made negligible by increasing the trunca-tion point R . For example, if s j ( x ) (cid:46) j − α with α >
1, thenE (cid:2) ζ R ( X it , A i , B t ) (cid:3) = E (cid:34) ∞ (cid:88) j = R +1 s j ( X it ) u j ( X it , A i ) v j ( X it , B t ) (cid:35) (cid:46) R − α . Hence, ζ R ( X it , A i , B t ) converges in mean square to zero.Combining (6) and (8), we obtain the approximate factor model Y it = λ i ( X it ) T f t ( X it ) + ζ R ( X it , A i , B t ) + E it , E it := Y it − E [ Y it | X it , A i , B t ] , (9)where λ i ( x ) = [ φ ( x, A i ) , . . . , φ R ( x, A i )] T , f t ( x ) = [ ψ ( x, B t ) , . . . , ψ R ( x, B t )] T , and thecomposite error ν it := ζ R ( X it , A i , B t )+ E it contains the approximation error, ζ R ( X it , A i , B t ),and the conditional expectation error, E it . The factor structure λ i ( X it ) T f t ( X it ) can beseen as a series approximation on unobserved individual and time effects to the function m ( X it , A i , B t ) if we let R = R N,T to grow with N and T such that ζ R ( X it , A i , B t ) van-ishes as N, T → ∞ . The factor structure approximation is exact in some cases for fixed R . For instance, in Example 3 m ( X it , A i , B t ) = λ i ( X it ) T f t ( X it ) , so that ζ R ( X it , A i , B t ) = 0 a.s. if R is greater or equal to the number of factors.In the model (9) the factor structure changes with the treatment level. In other words,we have a different pure factor model for each x ∈ X , that is Y it = λ i ( x ) T f t ( x ) + ν it if X it = x. This observation leads to our first estimation strategy where the data is partitioned bythe treatment level and separate factors and factor loadings are estimated in each elementof the partition by solving the least squares programmin { λ i } Ni =1 , { f t } Tt =1 N (cid:88) i =1 T (cid:88) t =1 D it ( x ) ( Y it − λ T i f t ) , (10)where D it ( x ) := { X it = x } . Unfortunately, we cannot solve this problem using standardprincipal component analysis due to the presence of missing data, that is, each observa-tional unit ( i, t ) is not available at all treatment levels. In the next section, we applymatrix completion methods to deal with this problem.11 .2 Estimation by matrix completion methods We start by expressing the program (10) in matrix form. Let Γ R ( x ) = λ N ( x ) f T ( x ) T ,where λ N ( x ) = [ λ ( x ) , . . . , λ N ( x )] T , a N × R matrix of factor loadings, and f T ( x ) =[ f ( x ) , . . . , f T ( x )] T , a T × R matrix of factors. The least squares estimator of Γ R ( x ) isthe N × T matrix Γ with typical element Γ it that solvesmin { Γ ∈ R N × T :rank( Γ ) ≤ R } N (cid:88) i =1 T (cid:88) t =1 D it ( x ) ( Y it − Γ it ) . Let Y ( x ) be a N × T matrix whose ( i, t ) element is Y it if X it = x and is missing otherwise.The previous program is closely related to the problem of completing the missing entriesof Y ( x ) using a low rank approximation matrix Γ R ( x ) (Rennie and Srebro, 2005; Cand`esand Recht, 2009; Candes and Tao, 2010). This connection was previously noticed byAthey, Bayati, Doudchenko, Imbens and Khosravi (2017) and Amjad, Shah and Shen(2018) in the context of treatment effects models. The solution is the N × T matrix ofrank R whose entries are the closest in the mean squared error sense to the correspondingentries of Y ( x ).The previous program is combinatorially hard because of the constraint in the rank ofthe matrix (Srebro and Jaakkola, 2003). Following Fazel (2003) we consider the convexrelaxation of the program min { Γ ∈ R N × T : (cid:107) Γ (cid:107) ≤ R } N (cid:88) i =1 T (cid:88) t =1 D it ( x ) ( Y it − Γ it ) , where (cid:107) Γ (cid:107) is the nuclear norm of Γ and R is a positive constant such that R = f ( R ),where f is an increasing function. Hence, ζ R ( x, A i , B t ) vanishes as R → ∞ . We replacethe rank constraint, rank( Γ ) ≤ R , by a constraint on the nuclear norm of the matrix, (cid:107) Γ (cid:107) ≤ R , i.e. we replace a constraint in the number of nonzero singular values bya constraint in the sum of singular values. This program is convex in Γ and can bereformulated in Lagrange form asmin { Γ ∈ R N × T } N (cid:88) i =1 T (cid:88) t =1 D it ( x ) ( Y it − Γ it ) + ρ ( R ) (cid:107) Γ (cid:107) , (11)where ρ ( R ) ≥ R . There exist efficient algorithms to solve this program (Mazumder, Hastie andTibshirani, 2010). 12et (cid:98) Γ ( x ) be a solution to (11) with typical element (cid:98) Γ it ( x ). Then, we can form estima-tors of the ASF and CASF as (cid:98) µ t ( x ) = 1 N N (cid:88) i =1 (cid:104) D it ( x ) Y it + { − D it ( x ) } (cid:98) Γ it ( x ) (cid:105) , and (cid:98) µ t ( x | x ) = (cid:80) Ni =1 D it ( x ) (cid:104) D it ( x ) Y it + { − D it ( x ) } (cid:98) Γ it ( x ) (cid:105)(cid:80) Ni =1 D it ( x ) . In the next section, we provide conditions under which these estimators are consistentusing asymptotic sequences where
N, T → ∞ . These estimators, however, might displayshrinkage biases in finite samples due to the nuclear norm regularization (Cai, Cand`esand Shen, 2010; Ma, Goldfarb and Chen, 2011; Bai and Ng, 2019b). We propose twomatching procedures to debias the estimator in Section 4.
Let Γ ∞ ( x ) be the N × T matrix with typical element Γ ∞ it ( x ) = m ( x, A i , B t ) and E ( x ) bethe N × T matrix with typical element E it ( x ) := (cid:40) E it = Y it − Γ ∞ it ( x ) if X it = x ,0 otherwise. (12)Note that Γ ∞ ( x ) = lim R →∞ Γ R ( x ) a.s. Furthermore, we introduce the notation D ( x ) = { ( i, t ) ∈ N × T : X it = x } , and n ( x ) = | D ( x ) | for the number of observations with X it = x .We assume x ∈ X throughout.Recall that (cid:98) Γ ( x ) ∈ argmin Γ ∈ R N × T Q NT ( Γ , ρ, x ) , Q NT ( Γ , ρ, x ) = 12 (cid:88) ( i,t ) ∈ D ( x ) ( Y it − Γ it ) + ρ (cid:107) Γ (cid:107) , (13)where ρ := ρ ( R ). Here, if the argmin over Γ ∈ R N × T is not unique, then we can choose (cid:98) Γ ( x ) arbitrarily from the set of minimizers — our results are not affected by that, we onlyrequire that Q NT ( (cid:98) Γ ( x ) , ρ, x ) ≤ Q NT ( Γ , ρ, x ), for all Γ ∈ R N × T . We want to show that (cid:98) Γ ( x ) converges to Γ ∞ ( x ) as N, T → ∞ in some sense such that (cid:98) µ ( x ) − µ ( x ) = o P (1). Forthat we require additional assumptions. 13 ssumption 4 (Error Moments) . Conditional on X NT , A N and B T , E it ( x ) is inde-pendent across ( i, t ) ∈ D ( x ) , and there exists a constant b < ∞ that does not depend on i , t , N , T , such that E (cid:2) E it ( x ) | A N , B T , X NT (cid:3) ≤ b. Furthermore, we assume that n ( x ) − (cid:80) ( i,t ) ∈ D ( x ) Γ ∞ it ( x ) = O P (1) . Assumption 4 could equivalently be replaced by the two high-level conditions2 n ( x ) (cid:88) ( i,t ) ∈ D ( x ) Γ ∞ it ( x ) E it = o P (1) , (cid:107) E ( x ) (cid:107) ∞ = O P (cid:16) √ N + T (cid:17) . The first of those conditions is implied by Assumption 4 through application of the weaklaw of large numbers, while the second follows, for example, by the spectral norm inequal-ity in Lata(cid:32)la (2005). In principle, we could still derive those high-level conditions if weallowed for appropriate weak dependence of E it ( x ) across i and over t , but we again focuson the independent case for simplicity of presentation.We first provide a consistency result for the entries of (cid:98) Γ ( x ) that correspond to theobserved values of Y ( x ). Lemma 1.
Let the Assumptions 2, 3 and 4 hold, and assume that ρ = ρ NT is chosensuch that ρ NT / √ N + T → ∞ and ρ NT √ N T /n ( x ) → as N, T → ∞ . Then, n ( x ) (cid:88) ( i,t ) ∈ D ( x ) (cid:104)(cid:98) Γ it ( x ) − Γ ∞ it ( x ) (cid:105) = o P (1) . A necessary condition for the existence of the sequence ρ = ρ NT in Lemma 1 is n ( x ) / (cid:112) ( N + T ) N T → ∞ , that is, the fraction n ( x ) / ( N T ) of observations with X it = x can converge to zero, but not too fast. Apart from that, Lemma 1 does not restrict theassignment process that determines X NT . Notice also that Lemma 1 does not requireAssumption 1 because Γ ∞ ( x ) is a reduced-form parameter.Lemma 1 is not directly useful to show the consistency of the estimators of the ASF,because it only guarantees (cid:96) -consistency of (cid:98) Γ ( x ) over the set of entries ( i, t ) for which X it = x . Those are exactly the observations for which an unbiased estimator of Γ ∞ it ( x ) = m ( x, A i , B t ) is already available, namely Y it . The consistency result we would like toobtain is 1 N T N (cid:88) i =1 T (cid:88) t =1 (cid:104)(cid:98) Γ it ( x ) − Γ ∞ it ( x ) (cid:105) = o P (1) , (14)14ut such a result will certainly require stronger assumptions on X NT than we have im-posed so far. The existing literature on matrix completion relies on the concept of re-stricted strong convexity to derive (14). This approach shows that under certain conditionson a matrix M with entries M it , and on X NT (which determines the set D ( x )), thereexists a constant c > N T N (cid:88) i =1 T (cid:88) t =1 M it ≤ cn ( x ) (cid:88) ( i,t ) ∈ D ( x ) M it . See Theorem 1 in Negahban and Wainwright (2012), Lemma 12 in Klopp et al. (2014),and Lemma 3 in Athey, Bayati, Doudchenko, Imbens and Khosravi (2017). Thus, if M it = (cid:98) Γ it ( x ) − Γ ∞ it ( x ) and X NT satisfy restricted strong convexity, then (14) would followfrom Lemma 1.We pursue a different strategy than the existing matrix completion literature to showthat (cid:98) µ ( x ) := 1 T T (cid:88) t =1 (cid:98) µ t ( x ) = 1 N T N (cid:88) i =1 T (cid:88) t =1 D it ( x ) Y it + 1 N T N (cid:88) i =1 T (cid:88) t =1 [1 − D it ( x )] (cid:98) Γ it ( x )is a consistent estimator of ( N T ) − (cid:80) Ni =1 (cid:80) Tt =1 Γ ∞ it , which under Assumption 1 is equalto µ ( x ) defined in (4). We believe that our approach is simpler in the setting of thispaper where Γ ∞ it ( x ) is not necessarily of low-rank. In particular, we do not aim to show(14), but instead we derive consistency of (cid:98) µ ( x ) directly. However, the following theoremstill requires additional assumptions on the assignment process that determines X NT , inthe same way that additional conditions on X NT are required to verify restricted strongconvexity. For simplicity, we focus on consistency of (cid:98) µ ( x ) in the main text, but resultsfor more general weighted averages of the form ( N T ) − (cid:80) Ni =1 (cid:80) Tt =1 W it ( x ) Γ ∞ it ( x ), withknown weights W it ( x ) ∈ R , are presented in the appendix. For example, in the case of thetreatment effects on the treated that we consider in the empirical application of Section5.1, W it ( x ) = n (1) − X it . Theorem 1.
Let the Assumptions 1, 2, 3 and 4 hold. Consider
N, T → ∞ at thesame rate, and let ρ = ρ NT be chosen such that ρ NT / √ N + T → ∞ and ρ NT / √ N T → . Define P it ( x ) := Pr (cid:0) X it = x | A N , B T (cid:1) , and assume that min i,t P it ( x ) > and that ( N T ) − (cid:80) Ni =1 (cid:80) Tt =1 P − it ( x ) = O P (1) . Let G ( x ) be the N × T matrix with entries G it ( x ) =15 − it ( x )( D it ( x ) − P it ( x )) , and assume that (cid:107) G ( x ) (cid:107) ∞ = O P ( √ N + T ) , and N T N (cid:88) i =1 T (cid:88) t =1 P − it ( x ) G it ( x ) = o P (1) , N T N (cid:88) i =1 T (cid:88) t =1 Γ ∞ it ( x ) G it ( x ) = o P (1) . (15) Then, (cid:98) µ ( x ) = µ ( x ) + o P (1) . To interpret the conditions in Theorem 1, notice that due to the definitions D it ( x ) = { X it = x } and P it ( x ) = Pr (cid:0) X it = x | A N , B T (cid:1) , E (cid:2) G it ( x ) | A N , B T (cid:3) = 0 by construc-tion, and G it ( x ) therefore plays a role very similar to the error term E it ( x ). In particular,the conditions in (15) can be verified by a weak law of large numbers, as long as P − it ( x )is not too large, and G it ( x ) is not too strongly correlated across i and over t . Regardingthe condition on the spectral norm (cid:107) G ( x ) (cid:107) ∞ = O P ( √ N + T ), there are many results inthe random-matrix theory literature that show this rate for mean-zero random matrices G ( x ), see, for example, Geman (1980), Silverstein (1989), Bai, Silverstein and Yin (1988),Yin, Bai and Krishnaiah (1988). In particular, if G it ( x ) is independent across both i and t , then this rate result follows from the very elegant spectral norm inequality in Lata(cid:32)la(2005), see the proof of Lemma 1 in the appendix, where apply that inequality to E it ( x ).However, that simple argument would require X it to be independently distributed across i and t , conditional on A N , B T . More generally, we expect (cid:107) G ( x ) (cid:107) ∞ = O P ( √ N + T ) tohold whenever the matrix entries G it ( x ) have zero mean, sufficiently bounded moments,and weak correlation across both i and t , see Section S.2 of the supplementary materialof Moon and Weidner (2017) for details.An important restriction on the treatment design that is imposed by Theorem 1 is thatPr (cid:0) X it = x | A N , B T (cid:1) > i and t . However, the key technical step in our proofof the theorem is Proposition 1 in the appendix, which does not necessarily require thatstrong condition. We will not explore deviations from that assumption here, because wethink that that P it ( x ) > X it = { t ≥ τ i } , where τ i is the date of the lawchange in state i . In that case, if we consider τ i to be a random variable with sufficientlylarge support conditional on the unobserved effects, then the condition P it ( x ) > This is because P it ( x ) need not be chosen equal to Pr (cid:0) X it = x | A N , B T (cid:1) in that proposition, butverifying the conditions of the proposition is harder if P it ( x ) is chosen differently.
16e have thus shown that consistent estimates for ASFs can be obtained via the matrixcompletion estimator even if the estimand Γ ∞ it ( x ) = m ( x, A i , B t ) itself is not of low rank.This is the main technical result of this paper. However, inference on µ ( x ) based on (cid:98) µ ( x )can be problematic, because (cid:98) µ ( x ) is subject to both low-rank approximation and shrinkagebiases. The low-rank approximation bias is due to the approximation error ζ R ( x, a , b ) inthe decomposition of m ( x, a , b ) in equation (8). The shrinkage bias comes from bias in (cid:98) Γ ( x ) due to the presence of the nuclear norm penalization in the objective function of(13). To isolate this bias, consider a simple case where Y it ( x ) follows a deterministic purefactor model Y it ( x ) = Γ it ( x ) = R (cid:88) j =1 s j ( x ) u j ( x, A i ) v j ( x, B i ) . Then, the matrix completion estimator of Γ it ( x ) in (13) yields (cid:98) Γ it ( x ) = R (cid:88) j =1 [ s j ( x ) − ρ ] + u j ( x, A i ) v j ( x, B i )where [ z ] + = max( z, Γ ( x ), (cid:98) Γ ( x ) has the same eigenvectors but thesingular values are shrunk toward zero. This argument carries over to the case where Y it ( x )follows an approximate factor structure (Cai, Cand`es and Shen, 2010; Ma, Goldfarb andChen, 2011; Bai and Ng, 2019b). Because of these biases, we explore alternative estimatesfor µ ( x ) in Section 4. As we mentioned in Section 2, exogenous covariates can be incorporated by conditioningon their values. This method can produce very noisy estimators in small samples unlessthe covariates take only on few values. Here we consider a semiparametric version of themodel that imposes additivity in the effect of the exogenous covariates. It also allowsfor additive unobserved individual and time effects that might vary across the covariatelevel x . These effects can be subsumed in the factor structure, but are usually consideredseparately in empirical analysis as the estimators perform better without regularizingthem (Athey, Bayati, Doudchenko, Imbens and Khosravi, 2017).Let C it be a d c -vector of covariates, α ( x ) = ( α ( x ) , . . . , α N ( x )) be a N -vector ofindividual effects and δ ( x ) = ( δ ( x ) , . . . , δ T ( x )) be a T -vector of time effects. Then, we17an replace the program (11) bymin { β ∈ R dc , α ∈ R N , δ ∈ R T , Γ ∈ R N × T } N (cid:88) i =1 T (cid:88) t =1 { X it = x } ( Y it − C T it β − α i − δ t − Γ it ) + ρ ( R ) (cid:107) Γ (cid:107) , Chernozhukov, Hansen, Liao and Zhu (2018), Moon and Weidner (2018) and Beyhum andGautier (2019) provide algorithms to solve this program. Let (cid:98) β ( x ), (cid:98) α ( x ) = ( (cid:98) α ( x ) , . . . , (cid:98) α N ( x )), (cid:98) δ ( x ) = ( (cid:98) δ ( x ) , . . . , (cid:98) δ T ( x )), and (cid:98) Γ ( x ) be the solution of the previous program. We can formestimators of the ASF and CASF as (cid:98) µ t ( x ) = 1 N N (cid:88) i =1 (cid:104) { X it = x } Y it + { X it (cid:54) = x } (cid:110) C T it (cid:98) β ( x ) + (cid:98) α i ( x ) + (cid:98) δ t ( x ) + (cid:98) Γ it ( x ) (cid:111)(cid:105) , and (cid:98) µ t ( x | x ) = (cid:80) Ni =1 (cid:104) { X it = x = x } Y it + { X it = x (cid:54) = x } (cid:110) C T it (cid:98) β ( x ) + (cid:98) α i ( x ) + (cid:98) δ t ( x ) + (cid:98) Γ it ( x ) (cid:111)(cid:105)(cid:80) Ni =1 { X it = x } . The matrix completion estimator of the ASF is generally biased. As we explained inSection 3.3, the bias comes from two sources: low-rank approximation bias and shrinkagebias. One could attempt to correct the shrinkage bias by shifting the singular values of (cid:98) Γ ( x ) upwards. However, inference results on the ASFs based on matrix completion aregenerally very difficult to obtain even if Γ ∞ ( x ) is truly low rank. In our setting, thepresence of the additional low-rank approximation bias makes this even more challenging.We instead discuss alternative estimators and show that they have significantly lowerbiases than the matrix completion estimators in the numerical simulations of Section 5.2.To construct the estimators of Γ ∞ ( x ), we start by extracting the factor structure of (cid:98) Γ ( x ) in (13). Let (cid:98) λ i ( x ) and (cid:98) f t ( x ) be the R × (cid:98) Γ it ( x ) = (cid:98) λ i ( x ) T (cid:98) f t ( x ) , subject to the usual normalizations that T − (cid:80) Tt =1 (cid:98) f t ( x ) (cid:98) f t ( x ) T is the identity matrix ofsize R and N − (cid:80) Ni =1 (cid:98) λ i ( x ) (cid:98) λ i ( x ) T is a diagonal matrix. Next, we apply a matching pro-cedure to this factor structure. In its simplest version, we estimate each entry Γ ∞ it ( x )18uch that X it (cid:54) = x , by matching with the observation with X js = x that is the nearestneighbor in terms of the vectors (cid:98) λ i ( x ) and (cid:98) f t ( x ). In particular, ˘Γ it ( x ) = Y i ∗∗ ( i,t,x ) ,t ∗∗ ( i,t,x ) where i ∗∗ ( i, t, x ) ∈ N and t ∗∗ ( i, t, x ) ∈ T are a solution to the programmin j ∈ N ,s ∈ T (cid:13)(cid:13)(cid:13)(cid:98) λ i ( x ) − (cid:98) λ j ( x ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:98) f t ( x ) − (cid:98) f s ( x ) (cid:13)(cid:13)(cid:13) s.t. X js = x. We also consider a two-way matching procedure that combines matching with adifference-in-difference approach. It consists of two steps:(i) For all x ∈ X and ( i, t ) ∈ N × T such that X it (cid:54) = x , find the matches i ∗ ( i, t, x ) ∈ N and t ∗ ( i, t, x ) ∈ T that solve the programmin j ∈ N ,s ∈ T (cid:13)(cid:13)(cid:13)(cid:98) λ i ( x ) − (cid:98) λ j ( x ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:98) f t ( x ) − (cid:98) f s ( x ) (cid:13)(cid:13)(cid:13) s.t. X is = X jt = X js = x. (ii) Estimate Γ it ( x ) by (cid:101) Γ it ( x ) = Y i,t ∗ ( i,t,x ) + Y i ∗ ( i,t,x ) ,t − Y i ∗ ( i,t,x ) ,t ∗ ( i,t,x ) . In other words, we find the match ( j, s ) with X js = x that not only is the closest to ( i, t )in terms of the estimated factor structure, but also corresponds to a unit j with X jt = x and a time period s with X is = x . Then, we estimate the counterfactual Γ it ( x ) as a linearcombination of Y jt , Y is and Y js .The additional difference-in-difference step in the two-way procedure is useful to reducebias. To see this, we can compare (cid:101) Γ it ( x ) with the simple matching estimator ˘Γ it ( x ). Thus,abstracting from the estimation error in the factors and loadings,E[˘Γ it ( x ) − Γ it ( x ) | A N , B T , X NT ] = m ( x, A i ∗∗ ( i,t,x ) , B t ∗∗ ( i,t,x ) ) − m ( x, A i , B t )= O P ( (cid:107) A i ∗∗ ( i,t,x ) − A i (cid:107) + (cid:107) B t ∗∗ ( i,t,x ) − B t (cid:107) ) , by a first-order Taylor expansion of ( a i , b t ) (cid:55)→ m ( x, a i , b t ) around ( A i , B t ); whereasE[ (cid:101) Γ it ( x ) − Γ it ( x ) | A N , B T , X NT ] = m ( x, A i ∗ ( i,t,x ) , B t ∗ ( i,t,x ) ) − m ( x, A i , B t )= O P ( (cid:107) A i ∗ ( i,t,x ) − A i (cid:107) + (cid:107) B t ∗ ( i,t,x ) − B t (cid:107) ) , by a second-order Taylor expansion of ( a i , b t ) (cid:55)→ m ( x, a i , b t ) around ( A i , B t ). The two-way matching removes the leading term of the Taylor expansion, reducing the bias of the19atching by one order of magnitude because i ∗∗ ( i, t, x ) (cid:54) = i or t ∗∗ ( i, t, x ) (cid:54) = t . On theother hand, (cid:107) A i ∗ ( i,t,x ) − A i (cid:107) ≥ (cid:107) A i ∗∗ ( i,t,x ) − A i (cid:107) and (cid:107) B t ∗ ( i,t,x ) − B t (cid:107) ≥ (cid:107) B t ∗∗ ( i,t,x ) − B t (cid:107) a.s. because the two-way procedure imposes the additional restrictions X is = X jt = x .Whether the first or second order bias dominates would generally be determined by theproportion of observations with X js = x and the distributions of A i and B t . We providea numerical comparison of the biases of the matching estimators in Section 5.2.We develop the theory for a debised estimator that allows for multiple matches andestimated factors and loadings. Multiple matches are expected reduce dispersion at thecost of increasing bias. Let λ i = λ ( x, A i ) and f t = f ( x, B t ) be the transformations of A i and B t that are consistently estimated by (cid:98) λ i and (cid:98) f t . We define N i = (cid:110) j ∈ N \ { i } : (cid:13)(cid:13)(cid:13)(cid:98) λ i − (cid:98) λ j (cid:13)(cid:13)(cid:13) ≤ τ NT (cid:111) , T t = (cid:110) s ∈ T \ { t } : (cid:13)(cid:13)(cid:13) (cid:98) f t − (cid:98) f s (cid:13)(cid:13)(cid:13) ≤ υ NT (cid:111) , for some bandwidth parameters τ NT > υ NT >
0. The debiased estimator of µ ( x ) isthen given by (cid:101) µ ( x ) = 1 N T N (cid:88) i =1 T (cid:88) t =1 (cid:101) Y it ( x ) , with (cid:101) Y it ( x ) = Y it if X it = x ,1 n it (cid:88) j ∈ N i (cid:88) s ∈ T t { X is = X jt = X js = x } ( Y is + Y jt − Y js ) if X it (cid:54) = x and n it > n ( x ) (cid:80) ( j,s ) ∈ D ( x ) Y js if n it = 0, (16)where n it := (cid:80) j ∈ N i (cid:80) s ∈ T t { X is = X jt = X js = x } . Here, for X it (cid:54) = x , we construct thecounterfactual (cid:101) Y it ( x ) by averaging over all units ( j, s ) ∈ N i × T t that satisfy the constraint X is = X jt = X js = x . Notice that if X it (cid:54) = x and n it = 0, then we cannot construct asuitable counterfactual by that method. In that case we assign (cid:101) Y it ( x ) the average ofthe observations with X js = x to make sure that (cid:101) µ ( x ) is always well-defined, but ourassumption below guarantees that this rarely happens.This estimator has similar debiasing properties to the nearest neighbor describedabove, but it is more tractable theoretically because it varies more smoothly with re- The matching method discussed here is also applicable to settings where the matching is based onvariables other than the estimated factor structure. These include for example cross section and timeseries averages of the observable variables. See the appendix for a more general treatment. (cid:101) µ ( x ) can be written as (cid:101) µ ( x ) = 1 N T N (cid:88) i =1 T (cid:88) t =1 ω it Y it , where the weights ω it are functions of (cid:98) λ j and (cid:98) f s for all j ∈ N and s ∈ T . To show that (cid:101) µ ( x ) is a consistent estimator of µ ( x ), we use the following assumption: Assumption 5 (Two-way Matching Estimator) . There exists a sequence ξ NT > suchthat ξ NT → as N, T → ∞ , and(i) NT (cid:80) Ni =1 (cid:80) Tt =1 { X it (cid:54) = x & n it = 0 } = O P ( ξ NT ) .(ii) Y it is uniformly bounded over i, t, N, T .(iii) Y it is independent across both i and t , conditional on X NT , A N , B T .(iv) The function ( a , b ) (cid:55)→ m ( x, a , b ) is at least twice continuously differentiable withuniformly bounded second derivatives.(v) There exists c > such that (cid:107) a − a (cid:107) ≤ c (cid:107) λ ( a ) − λ ( a ) (cid:107) for all a , a ∈ A , and (cid:107) b − b (cid:107) ≤ c (cid:107) f ( b ) − f ( b ) (cid:107) for all b , b ∈ B .(vi) N (cid:80) Ni =1 (cid:18)(cid:13)(cid:13)(cid:13)(cid:98) λ i − λ i (cid:13)(cid:13)(cid:13) + max j ∈ N i (cid:13)(cid:13)(cid:13)(cid:98) λ j − λ j (cid:13)(cid:13)(cid:13) (cid:19) = O P ( ξ NT ) . T (cid:80) Tt =1 (cid:18)(cid:13)(cid:13)(cid:13) (cid:98) f t − f t (cid:13)(cid:13)(cid:13) + max s ∈ T t (cid:13)(cid:13)(cid:13) (cid:98) f s − f s (cid:13)(cid:13)(cid:13) (cid:19) = O P ( ξ NT ) .(vii) τ NT = O P ( ξ NT ) and υ NT = O P ( ξ NT ) .(viii) NT (cid:80) Ni =1 (cid:80) Tt =1 E (cid:2) ω it (cid:12)(cid:12) X NT , A N , B T (cid:3) = O P ( N T ξ NT ) . (ix) Let Y NT − ( i,t ) , − ( j,s ) be the outcome matrix Y NT , but with Y it and Y js replace by zero (orsome other non-random number), and all other outcomes unchanged. We assume N T ) N (cid:88) i,j =1 T (cid:88) t,s =1 { ( i, t ) (cid:54) = ( j, s ) } E (cid:20)(cid:12)(cid:12)(cid:12) ω it (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) ω js (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) − ω it ( Y NT ) ω js ( Y NT ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) X NT , A N , B T (cid:21) = O P (cid:0) ξ NT (cid:1) . emark 1 (Assumption 5) . Part (i) guarantees that X it (cid:54) = x and n it = 0 only hap-pens for a small fraction of observations ( i, t ) . We are therefore able to construct propercounterfactuals (cid:101) Y it ( x ) for most observations. Part (ii) is a boundedness condition thatis standard in the matrix completion literature. Part (iii) is an independence conditionthat is convenient to simplify the derivations but can be generalized to weak correlationacross both i and t . We use part (iv) to bound the error terms of the Taylor expansionsfor the bias. Part (v) imposes an injectivity condition. The functions a (cid:55)→ λ ( a ) and b (cid:55)→ f ( b ) need to be such that A i and B t can be uniquely recovered from λ i = λ ( A i ) and f t = f ( B t ) . A necessary condition is that the dimensions of λ i and f t are greaterthan or equal to the dimensions of A i and B t , respectively. This holds in our factorstructure approximation when let R grow with the sample size, provided that the dimen-sions of A i and B t are fixed. Part (vi) holds if (cid:98) λ i − λ i and (cid:98) f t − f t are of order N − / and T − / . We expect this assumption to be satisfied for rates ξ NT (cid:29) max( N − , T − ) .The bandwidth parameters τ NT and υ NT should not be chosen too large according to part(vii). For example, if we want to achieve a rate ξ NT (cid:28) max( N − / , T − / ) , then we need τ NT (cid:28) max( N − / , T − / ) and υ NT (cid:28) max( N − / , T − / ) . Part (viii) requires that anygiven outcome Y it is not chosen too often with too high weight in the construction of thecounterfactuals (cid:101) Y js ( x ) . Finally, part (ix) is a high-level assumption that could be justifiedby appropriate distributional assumptions on X it , A i , B t , and on the estimators (cid:98) λ i and (cid:98) f t . We prefer to present it as a high-level assumption, because formally working out thedistributional assumptions is quite cumbersome. Intuitively, if n it is sufficiently large,then changing Y NT to Y NT − ( i,t ) , − ( j,s ) should not change the constructions of the counterfac-tual (cid:98) Y it ( x ) very much. If that is true for all ( i, t ) , then the weights ω it ( Y NT ) should bevery close to the weights ω it (cid:16) Y NT − ( i,t ) , − ( j,s ) (cid:17) and the assumption is satisfied. Theorem 2.
Under Assumptions 1 and 5, (cid:101) µ ( x ) − µ ( x ) = O P ( ξ NT ) . As discussed in the above remark, one can achieve rates ξ NT (cid:28) max( N − / , T − / ) forsufficiently regular data generating processes, and if the bandwidth parameters τ NT and υ NT are chosen sufficiently small. By contrast, the low-rank approximation bias in (cid:98) µ ( x )will usually prevent us from achieving such a convergence rate for (cid:98) µ ( x ). This finding isconsistent with our Monte Carlo results in Section 5.2, where (cid:101) µ ( x ) is found to typicallyhave much smaller bias than (cid:98) µ ( x ). 22 Numerical Examples
We illustrate the methods of the paper with an empirical application to the effect ofallowing voter registration during the election day on voter turnout in the U.S. (Xu,2017). Voting in the U.S. used to require registration prior to the election day in moststates. Registration increased the cost of voting and was considered as one possible reasonfor low turnout rates. In response, some states implemented Election Day Registration(EDR) laws that allowed eligible voters to register on election day when they arrive atthe polling stations. These laws were not passed by all the states, and there was variationin the time of adoption across states. Thus, they were enacted by Maine, Minnesota andWisconsin in 1976; Wyoming, Indiana and New Hampshire in 1994, and Connecticut in2012.We use a dataset on the 24 presidential elections for 47 states between 1920 and2012 collected by Xu (2017). It includes state-level information about the turnout rate, Y it , measured as the total ballots counted divided by voting-age population in state i atelection t , and a treatment indicator for EDR, X it , that equals one if the state i has anEDR law enacted at election t . Following Xu (2017), we exclude North Dakota whereregistration was never needed, and Alaska and Hawaii that were not states until 1959.Since there are only 9 states that are ever treated and the treatment started in the 1976election, we focus on effects on the treated at the elections between 1976 and 2012. Weestimate average treatment effects and quantile treatment effects at multiple quantileindices.Figure 1 compares the average turnout of states that are ever treated with statesthat are never treated in elections prior to the first implementation of the EDR laws in1976. It shows that ever treated states have higher turnout rates on average than nevertreated states without the EDR treatment. We consider several methods to deal withthis likely nonrandom assignment of EDR to estimate the ATTs at each election after1976. First, we do a naive comparison of means between treated and nontreated statesin each election (Dmeans). Second, we consider a difference-in-difference method thatuses the nontreated states as controls at each election (DiD). In particular, we estimatethe effects from a linear regression with state effects and election effects interacted witha EDR indicator. This method yields the ATT for each election under a parallel trend23
920 1930 1940 1950 1960 1970 year t u r nou t ( % ) l l l l l l l l l l l l l ll Ever TreatedNever Treated
Figure 1: Pretrends in turnout rate before EDR by future treatment statusassumption between treated and nontreated states. Third, we compute our estimatorbased on matrix completion methods without debiasing (MC) with additive state andelection effects and the parameter ρ such that the number of factors is R = 6. Fourth, wedebias the MC estimates using the two-way matching method with 10 matches (TWM-10). Fifth, we consider the simple matching method with 5 matches (SM-5). We choosethe number of matches roughly based on the numerical simulations of Section 5.2.Figure 2 reports the estimates of the ATT of EDR at each election. The methodsthat account for possible nonrandom assignment of the EDR produce lower estimates ofthe effect than the naive comparison of means between treated and nontreated states. The DiD model is a special case our model with additive effects. In this case, it imposes that thereare only additive state and election effects that are the same for both treatment levels. − year t u r nou t ( % ) l l l l l l l l l ll l l l l l l l l lll DmeansDiDTWM−10SM−5MC
Figure 2: Average treatment effect on the treated of EDR on turnout rate at each electionFigure 3 plots the estimates of the election-aggregated quantile treatment effect onthe treated (QTT) of EDR as a function of the quantile index. We report estimates fromfour methods: a naive comparison of quantiles between treated and non-treated states(Dquantiles), our estimator based on matrix completion methods without debiasing (MC)25ith additive state and election effects and the parameter ρ such that the number of factorsis R = 3, two-way matching with 10 matches (TWM-10), and simple matching with 5matches (SM-5). The QTT is the difference of the quantiles between the observed turnoutfor the treated observations and the corresponding potential turnout have they not beentreated. The quantiles of the observed turnout are estimated using sample quantiles.The estimates of the quantiles of the potential outcomes are obtained by inverting thecorresponding estimates of the distribution, which are obtained by our methods replacing Y it by the indicator ( Y it ≤ y ) and repeating the procedure over a grid of values of y thatincludes the sample quantiles of observed turnout with indices { . , . , . . . , . } . Here,we find that the effect of EDR is decreasing across the distribution of turnout and rangesbetween 10 and 0% according to TWM-10. EDR is therefore more effective for stateswith low voter turnouts. Comparing with the Dquantiles estimates, we find that the signof the selection bias switches from positive to negative around the middle of the turnoutdistribution.
To evaluate the performance of our methods in a controlled synthetic environment, wegenerate potential outcomes from an additive linear model where Y it ( x ) = x + g ( A i , B t ) + U it ( x ) , x ∈ { , } , i ∈ { , . . . , } , t ∈ { , . . . , } ,U it ( x ) ∼ N (0 , /
4) independently over i , t and x , A i ∼ U (0 ,
1) independently over i , B t ∼ U (0 ,
1) independently over t , U it ( x ), A j and B s are independent for all i , t , j and s ,and g is the Gaussian kernel, i.e., g ( a, b ) = 1 √ πθ exp (cid:18) − ( a − b ) θ (cid:19) . (17)This design is similar to that used in Bordenave, Coste and Nadakuditi (2020), with kernelfunction specification from the numerical simulations in Griebel and Harbrecht (2010). The parameter θ controls the decay of the singular values of g and can be calibrated tomake sure the singular values decay slowly. Smaller values for θ lead to greater dispersion We rearrange the estimates of the distribution to guarantee that they are increasing with respect to y (Chernozhukov, Fern´andez-Val and Galichon, 2010). We find similar results in a multiplicative model where Y it ( x ) = (1 + x ) g ( A i , B t ) + U it ( x ). We omitthese results for the sake of brevity. .2 0.4 0.6 0.8 quantile index t u r nou t ( % ) DquantilesTWM−10SM−5MC
Figure 3: Time-averaged QTT of EDR on turnout ratein the kernel function ( a, b ) (cid:55)→ g ( a, b ) and a slower singular value decay, hence can beinterpreted as a measure of smoothness. The assignment of X it that determines whatpotential outcomes are observed is similar to the election application. In particular, onlyobservations for the first half of the units, i ∈ { , . . . , } , and the second half of thepanel, t ∈ { , . . . , } , may be treated. For these observations, X it is related to theunobserved effects ( A i , B t ) via X it = { g ( A i , B t ) ≥ c } , where c is a constant calibratedto Pr( g ( A i , B t ) ≥ c ) = . Smoothness here is specifically related to numerical smoothness, i.e. variability in the function withinclose neighbourhoods of its arguments. µ (0 | Notes: based on 1 ,
000 simulations
We apply similar methods to Section 5.1 to estimate the CASFs µ t (0 | , t ∈{ , . . . , } , and µ (0 |
1) using the observed variables X it and Y it = Y it ( X it ). Thus,we consider Dmeans, DiD, MC without additive effects and with the parameter ρ suchthat R = 5, and multiple versions of TWM and SM with the number of matches equalto 1, 5, 10, and 30. For each method, we compute the bias, standard deviation and rmsefrom 1 ,
000 simulations. Across the simulations, we redraw the values of U it ( x ) and hold A i , B t and X it fixed. Table 1 reports the results for the time-aggregated CASF, µ (0 | µ t (0 | t . The results showthat Dmeans, DiD and MC are severely biased relative to their standard deviations. Allthe matching estimators reduce bias and rmse, despite of increasing dispersion. As onewould expect, increasing the number of matches reduces the variability of the matchingestimators but increases their biases. The number of matches that minimizes the rmse islarger for the TWM than for the SM. Overall, these small-sample findings agree with theasymptotic results of Sections 3.3 and 4. 28igure 4: Results for t (cid:55)→ µ t (0 | References
Amjad, M., D. Shah, and D. Shen (2018). Robust synthetic control.
Journal of MachineLearning Research 19 (22), 1–51.Athey, S., M. Bayati, N. Doudchenko, G. Imbens, and K. Khosravi (2017). Matrix com-pletion methods for causal panel data models.Auerbach, E. (2019). Identification and estimation of a partially linear regression modelusing network data. arXiv preprint arXiv:1903.09679 .Bai, J. and S. Ng (2019a). Matrix completion, counterfactuals, and factor analysis ofmissing data.Bai, J. and S. Ng (2019b). Rank regularized estimation of approximate factor models.
Journal of Econometrics 212 (1), 78–96.Bai, Z. D., J. W. Silverstein, and Y. Q. Yin (1988). A note on the largest eigenvalue of alarge dimensional sample covariance matrix.
J. Multivar. Anal. 26 (2), 166–168.Beyhum, J. and E. Gautier (2019). Square-root nuclear norm penalized estimator forpanel data models with approximately low-rank unobserved heterogeneity.Bordenave, C., S. Coste, and R. R. Nadakuditi (2020). Detection thresholds in very sparsematrix completion. arXiv preprint arXiv:2005.06062 .29ai, J.-F., E. J. Cand`es, and Z. Shen (2010). A singular value thresholding algorithm formatrix completion.
SIAM Journal on optimization 20 (4), 1956–1982.Cand`es, E. J. and B. Recht (2009). Exact matrix completion via convex optimization.
Foundations of Computational mathematics 9 (6), 717.Candes, E. J. and T. Tao (2010). The power of convex relaxation: Near-optimal matrixcompletion.
IEEE Transactions on Information Theory 56 (5), 2053–2080.Chamberlain, G. (1982). Multivariate regression models for panel data.
Journal of Econo-metrics 18 (1), 5–46.Chan, M. K. and S. Kwok (2020, March). The PCDID Approach: Difference-in-Differenceswhen Trends are Potentially Unparallel and Stochastic. Working Papers 2020-03, Uni-versity of Sydney, School of Economics.Chatterjee, S. (2015). Matrix estimation by universal singular value thresholding.
Ann.Statist. 43 (1), 177–214.Chen, M., I. Fern´andez-Val, and M. Weidner (2020). Nonlinear factor models for networkand panel data.
Journal of Econometrics .Chernozhukov, V., I. Fern´andez-Val, and A. Galichon (2010). Quantile and probabilitycurves without crossing.
Econometrica 78 (3), 1093–1125.Chernozhukov, V., I. Fern´andez-Val, J. Hahn, and W. Newey (2013). Average and quantileeffects in nonseparable panel models.
Econometrica 81 (2), 535–580.Chernozhukov, V., C. Hansen, Y. Liao, and Y. Zhu (2018). Inference for heterogeneouseffects using low-rank estimation of factor slopes.Dzemski, A. (2019). An empirical model of dyadic link formation in a network withunobserved heterogeneity.
Review of Economics and Statistics 101 (5), 763–776.Evdokimov, K. (2010). Identification and estimation of a nonparametric panel data modelwith unobserved heterogeneity.
Department of Economics, Princeton University .Fazel, S. M. (2003). Matrix rank minimization with applications.30reyberger, J. (2017, 09). Non-parametric Panel Data Models with Interactive FixedEffects.
The Review of Economic Studies 85 (3), 1824–1851.Gao, C., Y. Lu, H. H. Zhou, et al. (2015). Rate-optimal graphon estimation.
The Annalsof Statistics 43 (6), 2624–2652.Geman, S. (1980, April). A limit theorem for the norm of random matrices.
Annals ofProbability 8 (2), 252–261.Gobillon, L. and T. Magnac (2016). Regional policy evaluation: Interactive fixed effectsand synthetic controls.
The Review of Economics and Statistics 98 (3), 535–551.Graham, B. S. (2017). An econometric model of network formation with degree hetero-geneity.
Econometrica 85 (4), 1033–1063.Graham, B. S. and J. L. Powell (2012). Identification and estimation of average par-tial effects in ˜nirregularˆı correlated random coefficient panel data models.
Economet-rica 80 (5), 2105–2152.Griebel, M. and H. Harbrecht (2010).
Approximation of two-variate functions: Singularvalue decomposition versus regular sparse grids . SFB 611.Griebel, M. and H. Harbrecht (2013, 05). Approximation of bi-variate functions: singularvalue decomposition versus sparse grids.
IMA Journal of Numerical Analysis 34 (1),28–54.Hoderlein, S. and H. White (2012). Nonparametric identification in nonseparable paneldata models with generalized fixed effects.
Journal of Econometrics 168 (2), 300–314.Holland, P. W., K. B. Laskey, and S. Leinhardt (1983). Stochastic blockmodels: Firststeps.
Social networks 5 (2), 109–137.Honor´e, B. (1992). Trimmed LAD and least squares estimation of truncated and cen-sored regression models with fixed effects.
Econometrica: Journal of the EconometricSociety 60 (3), 533–565.Hsiao, C., H. Steve Ching, and S. Ki Wan (2012). A panel data approach for programevaluation: Measuring the benefits of political and economic integration of hong kongwith mainland china.
Journal of Applied Econometrics 27 (5), 705–740.31mai, K. and I. S. Kim (2019). On the use of two-way fixed effects regression models forcausal inference with panel data.
Unpublished paper: Harvard University .Kim, D. and T. Oka (2014). Divorce law reforms and divorce rates in the usa: Aninteractive fixed-effects approach.
Journal of Applied Econometrics .Klopp, O. et al. (2014). Noisy low-rank matrix completion with general sampling distri-bution.
Bernoulli 20 (1), 282–303.Lata(cid:32)la, R. (2005). Some estimates of norms of random matrices.
Proceedings of theAmerican Mathematical Society 133 (5), 1273–1282.Li, K. (2018). Inference for factor model based average treatment effects.
Available atSSRN 3112775 .Li, K. T. and D. R. Bell (2017). Estimation of average treatment effects with panel data:Asymptotic theory and implementation.
Journal of Econometrics 197 (1), 65 – 75.Li, Y., D. Shah, D. Song, and C. L. Yu (2017). Nearest neighbors for matrix estimationinterpreted as blind regression for latent variable model.Ma, S., D. Goldfarb, and L. Chen (2011). Fixed point and bregman iterative methods formatrix rank minimization.
Mathematical Programming 128 (1-2), 321–353.Manski, C. (1987). Semiparametric analysis of random effects linear models from binarypanel data.
Econometrica: Journal of the Econometric Society 55 (2), 357–362.Mazumder, R., T. Hastie, and R. Tibshirani (2010). Spectral regularization algorithmsfor learning large incomplete matrices.
Journal of Machine Learning Research 11 (80),2287–2322.Moon, H. R. and M. Weidner (2017). Dynamic linear panel regression models with inter-active fixed effects.
Econometric Theory 33 (1), 158–195.Moon, H. R. and M. Weidner (2018). Nuclear norm regularized estimation of panelregression models.Negahban, S. and M. J. Wainwright (2012). Restricted strong convexity and weightedmatrix completion: Optimal bounds with noise.
The Journal of Machine LearningResearch 13 (1), 1665–1697. 32rbanz, P. and D. M. Roy (2015). Bayesian models of graphs, arrays and other ex-changeable random structures.
IEEE Transactions on Pattern Analysis and MachineIntelligence 37 (2), 437–461.Rennie, J. D. M. and N. Srebro (2005). Fast maximum margin matrix factorizationfor collaborative prediction. In
Proceedings of the 22nd International Conference onMachine Learning , ICML ˜O05, New York, NY, USA, pp. 713–719. Association forComputing Machinery.Silverstein, J. W. (1989). On the eigenvectors of large dimensional sample covariancematrices.
J. Multivar. Anal. 30 (1), 1–16.Srebro, N. and T. Jaakkola (2003). Weighted low-rank approximations. In
Proceedings ofthe 20th International Conference on Machine Learning (ICML-03) , pp. 720–727.Wolfe, P. J. and S. C. Olhede (2013). Nonparametric graphon estimation. arXiv preprintarXiv:1309.5936 .Xiong, R. and M. Pelger (2019). Large dimensional latent factor modeling with missingobservations and applications to causal inference.Xu, J., L. Massouli, and M. Lelarge (2014). Edge label inference in generalized stochasticblock models: from spectral theory to impossibility results.Xu, Y. (2017). Generalized synthetic control method: Causal inference with interactivefixed effects models.
Political Analysis 25 (1), 57–76.Yin, Y. Q., Z. D. Bai, and P. Krishnaiah (1988). On the limit of the largest eigenvalue ofthe large-dimensional sample covariance matrix.
Probability Theory Related Fields 78 ,509–521.Zeleneev, A. (2020). Identification and estimation of network models with nonparametricunobserved heterogeneity. 33
Proofs
A.1 Proof of Lemma 1
We start with a preliminary result that relates the nuclear norm of Γ ∞ ( x ) with the sum ofthe singular values of the function ( a , b ) (cid:55)→ m ( x, a , b ). This link will be useful to boundthe approximation error of (cid:98) Γ ( x ). We define (cid:107) m ( x, · , · ) (cid:107) ∗ := ∞ (cid:88) j =1 s j ( x ) . Lemma 2.
Let Assumptions 2 and 3 hold. Then, as
N, T → ∞ , (cid:107) Γ ∞ ( x ) (cid:107) ≤ √ N T (cid:107) m ( x, · , · ) (cid:107) ∗ + o P ( √ N T ) = O P ( √ N T ) . Lemma 2 implies that (cid:107) Γ ∞ ( x ) (cid:107) grows with N and T at the same rate as any low-rankmatrix M with elements that are of order one with bounded second moments such that (cid:107) M (cid:107) ≤ (cid:112) rank( M ) (cid:107) M (cid:107) = (cid:113) rank( M ) (cid:80) Ni =1 (cid:80) Tt =1 M it = O P ( √ N T ). This result willbe useful for the proofs of Lemma 1 and of Theorem 1. The proof of Lemma 2 is providedin Appendix A.4.The following technical lemma provides the key step in the proof of Lemma 1 in themain text.
Lemma 3.
Under Assumptions 2 and 3, n ( x ) (cid:88) ( i,t ) ∈ D ( x ) (cid:16)(cid:98) Γ it ( x ) − Γ ∞ it ( x ) (cid:17) ≤ ρ (cid:107) Γ ∞ ( x ) (cid:107) n ( x ) − n ( x ) (cid:88) ( i,t ) ∈ D ( x ) Γ ∞ it ( x ) E it , for all ρ ≥ (cid:107) E ( x ) (cid:107) ∞ . Notice that Lemma 3 is a non-stochastic finite sample result, which only requires that E it ( x ) and (cid:98) Γ ( x ) are as defined in (12) and (13). The proof of Lemma 3 is provided inAppendix A.4.We are now ready to provide the proof of the lemma in the main text. Proof of Lemma 1.
The definition of E it ( x ) in (12) guarantees that E (cid:2) E it ( x ) | A N , B T , X NT (cid:3) =0, and Assumption 4 furthermore guarantees that E it ( x ) is independent across i and34 and has a finite fourth moment, conditional on X NT , A N and B T . Furthermore,Γ ∞ it ( x ) = m ( x, A i , B t ) only depends on A N and B T . We therefore findE n ( x ) (cid:88) ( i,t ) ∈ D ( x ) Γ ∞ it ( x ) E it (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A N , B T , X NT = 1 n ( x ) (cid:88) ( i,t ) ∈ D ( x ) [Γ ∞ it ( x )] E (cid:2) E it (cid:12)(cid:12) A N , B T , X NT (cid:3) ≤ b / n ( x ) (cid:88) ( i,t ) ∈ D ( x ) [Γ ∞ it ( x )] = O P (1 /n ( x )) , where b is the constant from Assumption 4. From this we conclude that1 n ( x ) (cid:88) ( i,t ) ∈ D ( x ) Γ ∞ it ( x ) E it = O P (cid:18) n / ( x ) (cid:19) = o P (1) . (18)Next, applying Assumption 4 and Theorem 2 in Lata(cid:32)la (2005) we findE (cid:2) (cid:107) E ( x ) (cid:107) ∞ | A N , B T , X NT (cid:3) ≤ C (cid:40) max t (cid:115)(cid:88) i E [ E it ( x ) | A N , B T , X NT ]+ max i (cid:115)(cid:88) t E [ E it ( x ) | A N , B T , X NT ]+ (cid:32)(cid:88) i,t E (cid:2) E it ( x ) | A N , B T , X NT (cid:3)(cid:33) / (cid:41) ≤ C b / (cid:110) √ N + √ T + n ( x ) / (cid:111) = O P (cid:16) √ N + T (cid:17) , where C is a universal constant. We therefore have (cid:107) E ( x ) (cid:107) ∞ = O P (cid:0) √ N + T (cid:1) , and sincewe assume that ρ = ρ NT satisfies ρ NT / √ N + T → ∞ we conclude that ρ NT ≥ (cid:107) E ( x ) (cid:107) ∞ with probability approaching one. We can therefore apply Lemma 3 to find that, withprobability approaching one, we have1 n ( x ) (cid:88) ( i,t ) ∈ D ( x ) (cid:16)(cid:98) Γ it ( x ) − Γ ∞ it ( x ) (cid:17) ≤ ρ NT (cid:107) Γ ∞ ( x ) (cid:107) n ( x ) − n ( x ) (cid:88) ( i,t ) ∈ D ( x ) Γ ∞ it ( x ) E it = 2 ρ NT O P ( √ N T ) n ( x ) + o P (1)= o P (1) , where we applied (18) and Lemma 2, as well as the condition ρ NT √ N T /n ( x ) → (cid:4) .2 Proof of Theorem 1 In the following consider a generic reduced form parameter ν ( x ) = 1 N T N (cid:88) i =1 T (cid:88) t =1 W it ( x ) Γ ∞ it ( x ) , (19)with corresponding estimator (cid:98) ν ( x ) = 1 N T N (cid:88) i =1 T (cid:88) t =1 W it ( x ) (cid:98) Γ it ( x ) , (20)where W it ( x ) are given weights. The following proposition provides a finite-sample non-stochastic bound for the error of this reduced form estimator. Proposition 1.
Let the Assumptions 2, 3 and 4 hold. Let P it ( x ) be non-zero real numbersfor all ( i, t ) ∈ N × T . Define V it ( x ) := W it ( x ) P − it ( x )( D it ( x ) − P it ( x )) NT (cid:80) Ni =1 (cid:80) Tt =1 W it ( x ) P − it ( x ) ,c := 1 − NT (cid:80) Ni =1 (cid:80) Tt =1 W it ( x ) P − it ( x ) V it ( x ) NT (cid:80) Ni =1 (cid:80) Tt =1 W it ( x ) P − it ( x ) ,c := 1 N T N (cid:88) i =1 T (cid:88) t =1 V it ( x ) Γ ∞ it ( x ) ,c := 2 ρc N T (cid:107) Γ ∞ ( x ) (cid:107) − c N T (cid:88) ( i,t ) ∈ D ( x ) E it ( x ) Γ ∞ it ( x ) + (cid:18) c c (cid:19) ,c := √ c + | c | c , and let V ( x ) be the N × T matrix with elements V it ( x ) . If c > and ρ > (cid:107) E ( x ) (cid:107) ∞ + c (cid:107) V ( x ) (cid:107) ∞ , then | (cid:98) ν ( x ) − ν ( x ) | ≤ c . The proof of Proposition 1 is provided in Appendix A.4. Proposition 1 is the key steprequired for the proof of Theorem 1. However, before proving this main text result wewant to provide an informal remark on the usefulness of Proposition 1 more generally.
Remark 2 (Consistency of (cid:98) ν ( x )) . Proposition 1 holds for all P it ( x ) ∈ R \ { } , but forthe proposition to be useful in showing consistency of (cid:98) ν ( x ) we need to choose P it ( x ) such hat c and (cid:107) V ( x ) (cid:107) ∞ are not too large. The easiest way to guarantee this is to consider X it to be random and weakly correlated across both i and t , and to define P it ( x ) as thepropensity score, that is P it ( x ) = Pr (cid:0) X it = x | A N , B T (cid:1) , which is assumed to be positive and not too small — e.g. we need that q := (cid:34) N T N (cid:88) i =1 T (cid:88) t =1 W it ( x ) P − it ( x ) (cid:35) − converges to some positive constant. Then V it ( x ) has mean zero, analogous to E it ( x ) , and c = q + O P (1 / √ N T ) ,c = O P (1 / √ N T ) c = 2 ρq N T (cid:107) Γ ∞ ( x ) (cid:107) + O P (1 / √ N T ) ,c = (cid:114) ρq N T (cid:107) Γ ∞ ( x ) (cid:107) + smaller order terms . Thus, if, like in Lemma 1, ρ = ρ NT such that ρ NT / √ N + T → ∞ and ρ NT / √ N T → as N, T → ∞ , then (cid:98) ν ( x ) = ν ( x ) + o P (1) . The following proof formalizes this heuristic argument for the case that W it ( x ) = 1 . Proof of Theorem 1.
Let W it ( x ) = 1, and let ν ( x ) and (cid:98) ν ( x ) be as defined in (19) and(20) above. We then have µ ( x ) = ν ( x ) , (cid:98) µ ( x ) = (cid:98) ν ( x ) + 1 N T (cid:88) ( i,t ) ∈ D ( x ) E it ( x ) − N T (cid:88) ( i,t ) ∈ D ( x ) (cid:104)(cid:98) Γ it ( x ) − Γ ∞ it ( x ) (cid:105) . (21)We drop all the arguments x in the rest of this proof. We want to apply Proposition 1 with P it = Pr (cid:0) X it = x | A N , B T (cid:1) >
0. Let G it = P − it ( D it − P it ) be as defined in Theorem 1,and also define q := (cid:104) NT (cid:80) Ni =1 (cid:80) Tt =1 P − it (cid:105) − . Since P it ∈ [0 ,
1] we also have q ∈ [0 , q − = O P (1). Using Lemma 2 we know that (cid:107) Γ ∞ (cid:107) =37 P ( √ N T ), and we have already found that (cid:80) ( i,t ) ∈ D Γ ∞ it E it = O P (cid:0) n / (cid:1) in (18) above.Using this together the other assumptions in the theorem we find that V it = q G it c = q (cid:32) − qN T N (cid:88) i =1 T (cid:88) t =1 P − it G it (cid:33) = q [1 − o P (1)] ,c = qN T N (cid:88) i =1 T (cid:88) t =1 G it Γ ∞ it = o P (1) ,c = 2 ρ O P ( √ N T ) c N T − O P (cid:0) n / (cid:1) c N T + (cid:18) c c (cid:19) = o P (1) ,c = √ c + | c | c = o P (1) . We furthermore have (cid:107) V (cid:107) ∞ = q (cid:107) G (cid:107) ∞ = O P (1) O P ( √ N + T ) = O P ( √ N + T ) . In the proof of Lemma 1 we already argued that (cid:107) E (cid:107) ∞ = O P (cid:0) √ N + T (cid:1) . Since we assumethat ρ = ρ NT satisfies ρ NT / √ N + T → ∞ we conclude that ρ > (cid:107) E (cid:107) ∞ + c (cid:107) V (cid:107) ∞ with probability approach one. We can therefore apply Proposition 1 to find that withprobability approach one we have | (cid:98) ν − ν | ≤ c = o P (1) . We have thus shown that (cid:98) ν = ν + o P (1).Furthermore, analogous to the result in (18) we can show that (cid:80) ( i,t ) ∈ D E it = O P (cid:0) n / (cid:1) ,and we therefore have NT (cid:80) ( i,t ) ∈ D E it = o P (1). Finally, applying Lemma 1 we have Next,from we know that n (cid:88) ( i,t ) ∈ D (cid:16)(cid:98) Γ it − Γ ∞ it (cid:17) ≤ n (cid:88) ( i,t ) ∈ D (cid:16)(cid:98) Γ it − Γ ∞ it (cid:17) = o P (1) , and therefore NT (cid:80) ( i,t ) ∈ D ( x ) (cid:104)(cid:98) Γ it ( x ) − Γ ∞ it ( x ) (cid:105) = o P (1). Plugging those result into (21) wefind (cid:98) µ ( x ) = µ ( x ) + o P (1). (cid:4) .3 Proof of Theorem 2 In this section we present and prove a more general version of Theorem 2. Let φ i = φ ( x, A i ) and ψ t = ψ ( x, B t ) be transformations of A i and B t . Let (cid:98) φ i and (cid:98) ψ t be cor-responding estimators. In the main text we presented the special case where (cid:98) φ i and (cid:98) ψ t were equal to the factor loadings and factors obtained from (cid:98) Γ ( x ), but many other choicesof (cid:98) φ i and (cid:98) ψ t are conceivable. We again define N i = (cid:110) j ∈ N \ { i } : (cid:13)(cid:13)(cid:13) (cid:98) φ i − (cid:98) φ j (cid:13)(cid:13)(cid:13) ≤ τ NT (cid:111) , T t = (cid:110) s ∈ T \ { t } : (cid:13)(cid:13)(cid:13) (cid:98) ψ t − (cid:98) ψ s (cid:13)(cid:13)(cid:13) ≤ υ NT (cid:111) , for some bandwidth parameters τ NT > υ NT >
0. A debiased estimator of thereduced form parameter in (19) is given by (cid:101) ν ( x ) = 1 N T N (cid:88) i =1 T (cid:88) t =1 W it ( x ) (cid:101) Y it ( x ) , where (cid:101) Y it ( x ) is defined as in (16). In the main text we only discussed the special case W it ( x ) = 1. We can write (cid:101) ν ( x ) as (cid:101) ν ( x ) = 1 N T N (cid:88) i =1 T (cid:88) t =1 ω it Y it , where the weights ω it are functions of (cid:98) φ j and (cid:98) ψ s for all j ∈ N and s ∈ T . Assumption 5in the main text is generalized as follows. Assumption 6.
There exists a sequence ξ NT > such that ξ NT → as N, T → ∞ , and(i) NT (cid:80) Ni =1 (cid:80) Tt =1 W it ( x ) { X it (cid:54) = x & n it = 0 } = O P ( ξ NT ) .(ii) Y it and W it ( x ) are uniformly bounded over i, t, N, T .(iii) Y it is independent across both i and t , conditional on X NT , A N , B T .(iv) The function ( a , b ) (cid:55)→ m ( x, a , b ) is twice continuously differentiable with uniformlybounded second derivatives.(v) There exists c > such that (cid:107) a − a (cid:107) ≤ c (cid:107) φ ( a ) − φ ( a ) (cid:107) for all a , a ∈ A , and (cid:107) b − b (cid:107) ≤ c (cid:107) ψ ( b ) − ψ ( b ) (cid:107) for all b , b ∈ B .(vi) N (cid:80) Ni =1 (cid:18)(cid:13)(cid:13)(cid:13) (cid:98) φ i − φ i (cid:13)(cid:13)(cid:13) + max j ∈ N i (cid:13)(cid:13)(cid:13) (cid:98) φ j − φ j (cid:13)(cid:13)(cid:13) (cid:19) = O P ( ξ NT ) . T (cid:80) Tt =1 (cid:18)(cid:13)(cid:13)(cid:13) (cid:98) ψ t − ψ t (cid:13)(cid:13)(cid:13) + max s ∈ T t (cid:13)(cid:13)(cid:13) (cid:98) ψ s − ψ s (cid:13)(cid:13)(cid:13) (cid:19) = O P ( ξ NT ) . vii) τ NT = O P ( ξ NT ) and υ NT = O P ( ξ NT ) .(viii) NT (cid:80) Ni =1 (cid:80) Tt =1 E (cid:2) ω it (cid:12)(cid:12) X NT , A N , B T (cid:3) = O P ( N T ξ NT ) . (ix) Let Y NT − ( i,t ) , − ( j,s ) be the outcome matrix Y NT , but with Y it and Y js replace by zero (orsome other non-random number), and all other outcomes unchanged. We assume N T ) N (cid:88) i,j =1 T (cid:88) t,s =1 { ( i, t ) (cid:54) = ( j, s ) } E (cid:20)(cid:12)(cid:12)(cid:12) ω it (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) ω js (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) − ω it ( Y NT ) ω js ( Y NT ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) X NT , A N , B T (cid:21) = O P (cid:0) ξ NT (cid:1) . The generalized version of Theorem 2 is given in the following.
Theorem 3.
Under Assumptions 1 and 6, (cid:101) ν ( x ) − ν ( x ) = O P ( ξ NT ) . Proof of Theorem 3 (containing Theorem 2 as a special case).
Define m it ( x ) := m ( x, A i , B t ). We decompose (cid:101) ν ( x ) − ν ( x ) = e ( x ) + e ( x ) + e ( x ) , (22)where e ( x ) = 1 N T N (cid:88) i =1 T (cid:88) t =1 W it ( x ) { X it (cid:54) = x & n it = 0 } [ m it ( X it ) − m it ( x )] , and e ( x ) := 1 N T N (cid:88) i =1 T (cid:88) t =1 { X it (cid:54) = x & n it > } W it ( x ) e ,it ( x ) ,e ,it ( x ) := (cid:80) j ∈ N i (cid:80) s ∈ T t { X is = X jt = X js = x } [ m is ( x ) + m jt ( x ) − m js ( x ) − m it ( x )] (cid:80) j ∈ N i (cid:80) s ∈ T t { X is = X jt = X js = x } , and e ( x ) := 1 N T N (cid:88) i =1 T (cid:88) t =1 ω it E it , In the following we consider e ( x ), e ( x ), e ( x ) separately.40 Bound on e ( x ): Assumption 6(i) and (ii) guarantee that | e ( x ) | ≤ (cid:16) max it | m it ( X it ) − m it ( x ) | (cid:17) N T N (cid:88) i =1 T (cid:88) t =1 W it ( x ) { X it (cid:54) = x & n it = 0 } = O P ( ξ NT ) . (23) e ( x ): Assumption 6(iv) guarantees that there exists a constant b > (cid:12)(cid:12)(cid:12)(cid:12) m ( x, a , b ) − m ( x, A i , B t ) − ( a − A i ) (cid:48) ∂m ( x, A i , B t ) ∂ A i − ( b − B t ) (cid:48) ∂m ( x, A i , B t ) ∂ B t (cid:12)(cid:12)(cid:12)(cid:12) ≤ b (cid:0) (cid:107) a − A i (cid:107) + (cid:107) b − B t (cid:107) (cid:1) . Using this we find that m is ( x ) + m jt ( x ) − m js ( x ) − m it ( x ) ≤ b (cid:0) (cid:107) A i − A j (cid:107) + (cid:107) B t − B s (cid:107) (cid:1) , and therefore | e ,it ( x ) | ≤ b (cid:80) j ∈ N i (cid:80) s ∈ T t { X is = X jt = X js = x } (cid:0) (cid:107) A i − A j (cid:107) + (cid:107) B t − B s (cid:107) (cid:1)(cid:80) j ∈ N i (cid:80) s ∈ T t { X is = X jt = X js = x }≤ b (cid:18) max j ∈ N i (cid:107) A i − A j (cid:107) + max s ∈ T t (cid:107) B t − B s (cid:107) (cid:19) . We thus find | e ( x ) | ≤ b (cid:18) max ij | W it ( x ) | (cid:19) (cid:32) N N (cid:88) i =1 max j ∈ N i (cid:107) A i − A j (cid:107) + 1 T T (cid:88) t =1 max s ∈ T t (cid:107) B t − B s (cid:107) (cid:33) ≤ b c (cid:18) max ij | W it ( x ) | (cid:19) (cid:32) N N (cid:88) i =1 max j ∈ N i (cid:107) φ ( A i ) − φ ( A j ) (cid:107) + 1 T T (cid:88) t =1 max s ∈ T t (cid:107) ψ ( B t ) − ψ ( B s ) (cid:107) (cid:33) = 2 b c (cid:18) max ij | W it ( x ) | (cid:19) (cid:32) N N (cid:88) i =1 max j ∈ N i (cid:107) φ i − φ j (cid:107) + 1 T T (cid:88) t =1 max s ∈ T t (cid:107) ψ t − ψ s (cid:107) (cid:33) . Using the triangle inequality, the definition of N i , and the general inequality ( x + x + x ) ≤ x + x + x ), for x , x , x ∈ R , we havemax j ∈ N i (cid:107) φ i − φ j (cid:107) ≤ max j ∈ N i (cid:16)(cid:13)(cid:13)(cid:13) (cid:98) φ i − (cid:98) φ j (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:98) φ i − φ i (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:98) φ j − φ j (cid:13)(cid:13)(cid:13)(cid:17) ≤ max j ∈ N i (cid:16) τ NT + (cid:13)(cid:13)(cid:13) (cid:98) φ i − φ i (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:98) φ j − φ j (cid:13)(cid:13)(cid:13)(cid:17) ≤ τ NT + 3 (cid:13)(cid:13)(cid:13) (cid:98) φ i − φ i (cid:13)(cid:13)(cid:13) + 3 max j ∈ N i (cid:13)(cid:13)(cid:13) (cid:98) φ j − φ j (cid:13)(cid:13)(cid:13) . s ∈ T t (cid:107) ψ t − ψ s (cid:107) ≤ υ NT + 3 (cid:13)(cid:13)(cid:13) (cid:98) ψ t − ψ t (cid:13)(cid:13)(cid:13) + 3 max s ∈ T t (cid:13)(cid:13)(cid:13) (cid:98) ψ s − ψ s (cid:13)(cid:13)(cid:13) . We thus obtain | e ( x ) | ≤ b c (cid:18) max ij | W it ( x ) | (cid:19) (cid:40) τ NT + υ NT N N (cid:88) i =1 (cid:18)(cid:13)(cid:13)(cid:13) (cid:98) φ i − φ i (cid:13)(cid:13)(cid:13) + max j ∈ N i (cid:13)(cid:13)(cid:13) (cid:98) φ j − φ j (cid:13)(cid:13)(cid:13) (cid:19) T T (cid:88) t =1 (cid:18)(cid:13)(cid:13)(cid:13) (cid:98) ψ t − ψ t (cid:13)(cid:13)(cid:13) + max s ∈ T t (cid:13)(cid:13)(cid:13) (cid:98) ψ s − ψ s (cid:13)(cid:13)(cid:13) (cid:19) (cid:41) = O P ( ξ NT ) . (24) e ( x ): We have[ e ( x )] = 1( N T ) N (cid:88) i,j =1 T (cid:88) t,s =1 ω it ( Y NT ) ω js ( Y NT ) E it E js = T + T + T , where T := 1 N T N (cid:88) i =1 T (cid:88) t =1 ω it ( Y NT ) E it ,T := 1( N T ) N (cid:88) i,j =1 T (cid:88) t,s =1 { ( i, t ) (cid:54) = ( j, s ) }× (cid:2) ω it ( Y NT ) ω js ( Y NT ) − ω it (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) ω js (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1)(cid:3) E it E js ,T := 1( N T ) N (cid:88) i,j =1 T (cid:88) t,s =1 { ( i, t ) (cid:54) = ( j, s ) } ω it (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) ω js (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) E it E js . We haveE (cid:104) T (cid:12)(cid:12)(cid:12) X NT , A N , B T (cid:105) ≤ (cid:18) max i,t | E it | (cid:19) N T ) N (cid:88) i =1 T (cid:88) t =1 E (cid:104) ω it (cid:0) Y NT (cid:1) (cid:12)(cid:12)(cid:12) X NT , A N , B T (cid:105) = O P ( ξ NT ) , (cid:12)(cid:12)(cid:12) E (cid:104) T (cid:12)(cid:12)(cid:12) X NT , A N , B T (cid:105)(cid:12)(cid:12)(cid:12) ≤ (cid:18) max i,t | E it | (cid:19) N T ) N (cid:88) i,j =1 T (cid:88) t,s =1 { ( i, t ) (cid:54) = ( j, s ) }× E (cid:104)(cid:12)(cid:12) ω it ( Y NT ) ω js ( Y NT ) − ω it (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) ω js (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) X NT , A N , B T (cid:105) = O P ( ξ NT ) . where we used that Y it (and thus E it ) is uniformly bounded, together with Assump-tion 6(viii) and (ix). Next, for ( i, t ) (cid:54) = ( j, s ) weE (cid:104) ω it (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) ω js (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) E it E js (cid:12)(cid:12)(cid:12) Y NT − ( i,t ) , − ( j,s ) , X NT , A N , B T (cid:105) = ω it (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) ω js (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) E (cid:104) E it E js (cid:12)(cid:12)(cid:12) Y NT − ( i,t ) , − ( j,s ) , X NT , A N , B T (cid:105) = ω it (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) ω js (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) E (cid:104) E it (cid:12)(cid:12)(cid:12) Y NT − ( i,t ) , − ( j,s ) , X NT , A N , B T (cid:105) E (cid:104) E js (cid:12)(cid:12)(cid:12) Y NT − ( i,t ) , − ( j,s ) , X NT , A N , B T (cid:105) = 0 , where we used E (cid:2) E it | X NT , A N , B T (cid:3) = 0 together with the assumption that Y it (andthus E it ) is independent across both i and t , conditional on X NT , A N , B T . By the lawof iterated expectations the last display result also implies that for ( i, t ) (cid:54) = ( j, s ) we haveE (cid:104) ω it (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) ω js (cid:0) Y NT − ( i,t ) , − ( j,s ) (cid:1) E it E js (cid:12)(cid:12)(cid:12) X NT , A N , B T (cid:105) = 0 . Using this we obtain that E (cid:104) T (cid:12)(cid:12)(cid:12) X NT , A N , B T (cid:105) = 0 . Combining those results on T , T , T we obtainE (cid:110) [ e ( x )] (cid:12)(cid:12)(cid:12) X NT , A N , B T (cid:111) = O P ( ξ NT ) , which implies e = O P ( ξ NT ). Together with (22), (23), and (24) this gives the statementof the theorem. (cid:4) A.4 Proof of Intermediate Results
Proof of Lemma 2.
Let u j ( x ) be the N -vector with elements u j ( x, A i ), and let v j ( x )be the T -vector with elements v j ( x, B t ). Then we have Γ ∞ ( x ) = (cid:80) ∞ j =1 s j ( x ) u j ( x ) v T j ( x ),43nd therefore (cid:107) Γ ∞ ( x ) (cid:107) ≤ ∞ (cid:88) j =1 s j ( x ) (cid:107) u j ( x ) (cid:107) (cid:107) v j ( x ) (cid:107) = √ N T ∞ (cid:88) j =1 s j ( x ) (cid:118)(cid:117)(cid:117)(cid:116) N N (cid:88) i =1 [ u j ( x, A i )] (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 [ v j ( x, B t )] ≤ √ N T ∞ (cid:88) j =1 s j ( x ) (cid:32) N (cid:80) Ni =1 [ u j ( x, A i )] − (cid:33) (cid:32) T (cid:80) Tt =1 [ v j ( x, B t )] − (cid:33) = √ N T ∞ (cid:88) j =1 s j ( x ) + √ N T R NT = √ N T (cid:107) m ( x, · , · ) (cid:107) ∗ + √ N T R NT , where for the second inequality we used that √ z ≤ z − , for all z ≥
0, and we defined R NT = NT (cid:80) Ni =1 (cid:80) Tt =1 r it , with r it = ∞ (cid:88) j =1 s j ( x ) (cid:26) [ u j ( x, A i )] + [ v j ( x, B t )] u j ( x, A i )] [ v j ( x, B t )] − (cid:27) . Assumption 3 guarantees that [ u j ( x, A i )] and [ v j ( x, B t )] have mean equal to one, whichimplies that r it has mean zero. Assumption 2 and the WLLN therefore guarantees that R NT = o P (1). We have thus shown that (cid:107) Γ ∞ ( x ) (cid:107) ≤ √ N T (cid:107) m ( x, · , · ) (cid:107) ∗ + o P ( √ N T ),and since (cid:107) m ( x, · , · ) (cid:107) ∗ is finite and non-random we also have (cid:107) Γ ∞ ( x ) (cid:107) = O P ( √ N T ). (cid:4) Proof of Lemma 3.
The nuclear norm (or trace norm) can be defined by (cid:107) Γ (cid:107) = max { M ∈ R N × T : (cid:107) M (cid:107) ∞ ≤ } Tr ( M (cid:48) Γ ) (cid:124) (cid:123)(cid:122) (cid:125) = N (cid:88) i =1 T (cid:88) t =1 M it Γ it . (25)Our assumption ρ ≥ (cid:107) E ( x ) (cid:107) ∞ guarantees that a possible choice in this maximization is M = ρ − E ( x ), and we therefore have ρ (cid:107) Γ (cid:107) ≥ N (cid:88) i =1 T (cid:88) t =1 D it ( x ) E it ( x ) Γ it . Y it = Γ ∞ it ( x ) + E it ( x ) we find that Q NT ( Γ , ρ, x )= 12 N (cid:88) i =1 T (cid:88) t =1 D it ( x ) ( Y it − Γ it ) + ρ (cid:107) Γ (cid:107) ≥ N (cid:88) i =1 T (cid:88) t =1 D it ( x ) (Γ ∞ it ( x ) + E it ( x ) − Γ it ) + N (cid:88) i =1 T (cid:88) t =1 D it ( x ) E it ( x ) Γ it = 12 N (cid:88) i =1 T (cid:88) t =1 D it ( x ) (Γ ∞ it ( x ) − Γ it ) + N (cid:88) i =1 T (cid:88) t =1 D it ( x ) Γ ∞ it ( x ) E it ( x ) + 12 N (cid:88) i =1 T (cid:88) t =1 D it ( x ) E it ( x ) . By definition we have Q NT ( (cid:98) Γ ( x ) , ρ, x ) ≤ Q NT ( Γ ∞ ( x ) , ρ, x ) = 12 N (cid:88) i =1 T (cid:88) t =1 D it ( x ) E it ( x ) + ρ (cid:107) Γ ∞ ( x ) (cid:107) Combining the results in the last two displays gives the statement of the lemma. (cid:4)
Proof of Proposition 1.
In this proof we drop the argument x everywhere, and wedefine θ = N T ν and θ = N T ν . Define the
N T -vectors γ = vec( Γ ), Γ ∞ = vec( Γ ∞ ), w = vec( W it : i ∈ N , t ∈ T ), d = vec( D it : i ∈ N , t ∈ T ), and p = vec( P it : i ∈ N , t ∈ T ).Then, diag( p ) is an N T × N T diagonal matrix. For ρ > θ ∈ R we define L NT ( θ, ρ ) = min { Γ ∈ R N × T : θ = w (cid:48) γ } Q NT ( Γ , ρ ) , which is the profile objective function that minimizes Q NT ( Γ , ρ ) over almost all parameters Γ , only keeping our parameter of interest fixed at θ = w (cid:48) γ = (cid:80) Ni =1 (cid:80) Tt =1 W it Γ it . Our goalis to show that the minimizing value (cid:98) θ := argmin θ ∈ R L NT ( θ, ρ ) = N (cid:88) i =1 T (cid:88) t =1 W it (cid:98) Γ it is close to θ := w (cid:48) Γ ∞ = (cid:80) Ni =1 (cid:80) Tt =1 W it Γ ∞ it . Using the definition of Q NT ( Γ , ρ ) and Y it =Γ ∞ it + E it we find that L NT ( θ, ρ ) ≤ Q NT ( Γ ∞ , ρ ) = 12 N (cid:88) i =1 T (cid:88) t =1 D it E it + ρ (cid:107) Γ ∞ (cid:107) . (26)If for a given value of θ = w (cid:48) γ we have that the matrix M ( θ ) with elements M it ( θ ) := D it E it − w (cid:48) ( γ − Γ ∞ ) w (cid:48) diag( p ) − w ( D it − P it ) W it P it satisfies (cid:107) M ( θ ) (cid:107) ∞ ≤ ρ , then by the definition of (cid:107) · (cid:107) in4525) we have ρ (cid:107) Γ (cid:107) ≤ Tr( Γ (cid:48) M ( θ )) = (cid:80) Ni =1 (cid:80) Tt =1 M it ( θ )Γ it . Using this and Y it = Γ ∞ it + E it we find that Q NT ( Γ , ρ ) = 12 N (cid:88) i =1 T (cid:88) t =1 D it ( Y it − Γ it ) + ρ (cid:107) Γ (cid:107) ≥ N (cid:88) i =1 T (cid:88) t =1 D it [(Γ ∞ it − Γ it ) + E it ] + N (cid:88) i =1 T (cid:88) t =1 (cid:26) D it E it − [( γ − Γ ∞ ) (cid:48) w ] w (cid:48) diag( p ) − w ( D it − P it ) W it P it (cid:27) Γ it = 12 N (cid:88) i =1 T (cid:88) t =1 D it (Γ it − Γ ∞ it ) − [( γ − Γ ∞ ) (cid:48) w ] w (cid:48) diag( p ) − w N (cid:88) i =1 T (cid:88) t =1 ( D it − P it ) W it P it (Γ it − Γ ∞ it ) (cid:124) (cid:123)(cid:122) (cid:125) =: Q (low , NT ( Γ ) + N (cid:88) i =1 T (cid:88) t =1 M it ( θ ) Γ ∞ it + 12 N (cid:88) i =1 T (cid:88) t =1 D it E it (cid:124) (cid:123)(cid:122) (cid:125) =: Q (low , NT , where in the last step we added and subtracted (cid:80) Ni =1 (cid:80) Tt =1 M it ( θ ) Γ ∞ it , and we multipliedout [(Γ ∞ it − Γ it ) + E it ] , which leads to some simplifications. Notice that D it E it = E it byconstruction of E it , so that some occurrences of D it above could be dropped, but we findit clearer to keep track of D it explicitly here.Next, we define the N T × N T idempotent matrices P = diag( p ) − w w (cid:48) w (cid:48) diag( p ) − w and R = I NT − P .We then have Q (low , NT ( Γ )= 12 ( γ − Γ ∞ ) (cid:48) diag( d ) ( γ − Γ ∞ ) − [( γ − Γ ∞ ) (cid:48) w ] w (cid:48) diag( p ) − w (cid:2) w (cid:48) diag( p ) − diag( d − p ) ( γ − Γ ∞ ) (cid:3) = 12 ( γ − Γ ∞ ) (cid:48) (cid:0) P (cid:48) + R (cid:48) (cid:1) diag( d ) ( P + R ) ( γ − Γ ∞ ) − ( γ − Γ ∞ ) (cid:48) P (cid:48) diag( d − p ) ( P + R ) ( γ − Γ ∞ )= 12 ( γ − Γ ∞ ) (cid:48) R (cid:48) diag( d ) R ( γ − Γ ∞ ) + 12 ( γ − Γ ∞ ) (cid:48) P (cid:48) diag (2 p − d ) P ( γ − Γ ∞ ) , = 12 ( γ − Γ ∞ ) (cid:48) R (cid:48) diag( d ) R ( γ − Γ ∞ ) + 12 ( γ − Γ ∞ ) (cid:48) P (cid:48) diag ( p − d ) P ( γ − Γ ∞ ) + 12 [( γ − Γ ∞ ) (cid:48) w ] w (cid:48) diag( p ) − w where all the “mixed terms” (that involve both P and R ) cancel because we have P (cid:48) diag( p ) R = 0, and in the last step we used that P (cid:48) diag ( p ) P = w w (cid:48) w (cid:48) diag( p ) − w . Wehave min { Γ ∈ R N × T : θ = w (cid:48) γ } ( γ − Γ ∞ ) (cid:48) R (cid:48) diag( d ) R ( γ − Γ ∞ ) = 0 , because γ ∗ = RΓ ∞ + θ diag( p ) − ww (cid:48) diag( p ) − w is a possible choice in the minimization problem, which46atisfies w (cid:48) γ ∗ = θ and R ( γ ∗ − Γ ∞ ) = 0. We therefore havemin { Γ ∈ R N × T : θ = w (cid:48) γ } Q (low , NT ( Γ )= 12 ( θ − θ ) (cid:18) w (cid:48) diag( p ) − w + w (cid:48) diag( p ) − diag ( p − d ) diag( p ) − w ( w (cid:48) diag( p ) − w ) (cid:19) = 12 ( θ − θ ) (cid:32) (cid:80) Ni =1 (cid:80) Tt =1 W it P − it + (cid:80) Ni =1 (cid:80) Tt =1 W it P − it ( P it − D it )( (cid:80) Ni =1 (cid:80) Tt =1 W it P − it ) (cid:33) = N T c ( ν − ν ) , with c as defined in the statement of the proposition, and ν − ν = ( N T ) − ( θ − θ ).Thus, if M it ( θ ) = D it E it − ( ν − ν ) V it satisfies (cid:107) M ( θ ) (cid:107) ∞ ≤ ρ , then we have L NT ( θ, ρ ) ≥ min { Γ ∈ R N × T : θ = w (cid:48) γ } Q (low , NT ( Γ ) + Q (low , NT = N T c ( ν − ν ) + N (cid:88) i =1 T (cid:88) t =1 M it ( θ ) Γ ∞ it + 12 N (cid:88) i =1 T (cid:88) t =1 D it E it , and combing this with (26) gives L NT ( θ, ρ ) − L NT ( θ , ρ ) N T ≥ c ν − ν ) + 1 N T N (cid:88) i =1 T (cid:88) t =1 M it ( θ ) Γ ∞ it − ρN T (cid:107) Γ ∞ (cid:107) = c ν − ν ) + 1 N T N (cid:88) i =1 T (cid:88) t =1 D it E it Γ ∞ it − ( ν − ν ) 1 N T N (cid:88) i =1 T (cid:88) t =1 V it Γ ∞ it − ρN T (cid:107) Γ ∞ (cid:107) . Using the assumption c > c and c in the proposition this inequalitycan equivalently be written as2 [ L NT ( N T ν, ρ ) − L NT ( N T ν , ρ )] c N T ≥ ( ν − ν ) − c c ( ν − ν ) + (cid:18) c c (cid:19) − c = (cid:18) ν − ν − c c (cid:19) − c . (27)Notice that c > (cid:107) E (cid:107) ∞ < ρ and therefore ρ (cid:107) Γ ∞ (cid:107) ≥ (cid:80) Ni =1 (cid:80) Tt =1 E it Γ ∞ it , according to (25).The inequality in (27) was derived under the assumption that (cid:107) M ( N T ν ) (cid:107) ∞ ≤ ρ .Define ν ∗ + ( ε ) ∈ R and ν ∗− ( ε ) ∈ R by ν ∗± ( ε ) := ν ± ( c + ε ) , for 0 < ε ≤ ρ − (cid:107) E (cid:107) ∞ − c (cid:107) V (cid:107) ∞ (cid:107) V (cid:107) ∞ . (cid:107) E (cid:107) ∞ + c (cid:107) V (cid:107) ∞ < ρ guarantees that such an ε > (cid:107) M ( N T ν ∗± ( ε )) (cid:107) ∞ = (cid:107) E − ( ν ∗± ( ε ) − ν ) V (cid:107) ∞ ≤ (cid:107) E (cid:107) ∞ + | ν ∗± ( ε ) − ν |(cid:107) V (cid:107) ∞ ≤ ρ, where the final inequality follows from the definition of ν ∗± ( ε ). The conditions for (27) istherefore satisfies by ν = ν ∗± ( ε ), that is, we have2 (cid:2) L NT ( N T ν ∗± ( ε ) , ρ ) − L NT ( N T ν , ρ ) (cid:3) c N T ≥ (cid:18) ν ∗± ( ε ) − ν − c c (cid:19) − c = (cid:18) c + ε ∓ c c (cid:19) − c = (cid:18) √ c + ε + | c | ∓ c c (cid:19) − c ≥ ( √ c + ε ) − c > . where we used the definition c = √ c + | c | c . L NT ( N T ν, ρ ) is a convex function of ν = θ/N T , because it was obtained via profilingof the convex function Q NT ( Γ , ρ ). The value ν lies in the interval [ ν ∗ + ( ε ) , ν ∗− ( ε )], and wehave shown that L NT ( N T ν , ρ ) < L NT ( N T ν ∗± ( ε ) , ρ ). It must therefore be the case thatthe optimal (cid:98) ν = N T (cid:98) θ that minimizes L NT ( N T ν, ρ ) also lies in the interval [ ν ∗ + ( ε ) , ν ∗− ( ε )]— otherwise we obtain a contradiction to the convexity of L NT ( N T ν, ρ ). Thus, we haveshown that | (cid:98) ν − ν | ≤ c + ε, and because we can choose ε > | (cid:98) ν − ν | ≤ c , which is what we wanted to show. (cid:4)(cid:4)