[PDF] Forward-Selected Panel Data Approach for Program Evaluation

Abstract

Policy evaluation is central to economic data analysis, but economists mostly work with observational data in view of limited opportunities to carry out controlled experiments. In the potential outcome framework, the panel data approach (Hsiao, Ching and Wan, 2012) constructs the counterfactual by exploiting the correlation between cross-sectional units in panel data. The choice of cross-sectional control units, a key step in its implementation, is nevertheless unresolved in data-rich environment when many possible controls are at the researcher's disposal. We propose the forward selection method to choose control units, and establish validity of the post-selection inference. Our asymptotic framework allows the number of possible controls to grow much faster than the time dimension. The easy-to-implement algorithms and their theoretical guarantee extend the panel data approach to big data settings.

Full PDF

FForward-Selected Panel Data Approachfor Program Evaluation

Zhentao Shi and Jingyi HuangDepartment of Economics, the Chinese University of Hong Kong

Abstract

Policy evaluation is central to economic data analysis, but economists mostly work withobservational data in view of limited opportunities to carry out controlled experiments. Inthe potential outcome framework, the panel data approach (Hsiao, Ching and Wan, 2012)constructs the counterfactual by exploiting the correlation between cross-sectional units inpanel data. The choice of cross-sectional control units, a key step in their implementation,is nevertheless unresolved in data-rich environment when many possible controls are at theresearcher’s disposal. We propose the forward selection method to choose control units,and establish validity of post-selection inference. Our asymptotic framework allows thenumber of possible controls to grow much faster than the time dimension. The easy-to-implement algorithms and their theoretical guarantee extend the panel data approach tobig data settings. Monte Carlo simulations are conducted to demonstrate the ﬁnite sampleperformance of the proposed method. Two empirical examples illustrate the usefulness ofour procedure when many controls are available in real-world applications.

Key words: aggressive algorithm, average treatment eﬀect, counterfactual analysis, post-selection inferenceJEL code: C13, C21, C23, C38, D73

Zhentao Shi (corresponding author): [email protected] , Department of Economics, 912 Es-ther Lee Building, the Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR,China. Tel: (852) 3943-1432. Fax (852) 2603-5805. Shi acknowledges the ﬁnancial support from theHong Kong Research Grants Council Early Career Scheme No.24614817. We thank Cheng Hsiao, PeterPhillips and Yinchu Zhu for helpful comments. All remaining errors are ours. a r X i v : . [ ec on . E M ] A ug Introduction

A controlled experiment compares outcomes of a treatment group with those from a controlgroup. It is the golden standard for scientiﬁc research. While the randomized controlled trialsare useful in understanding economic mechanisms (Duﬂo, Glennerster, and Kremer, 2007;Banerjee and Duﬂo, 2009), for large-scale questions economists mostly have access only toobservational datasets. For example, we rarely enjoy the luxury to implement a controlledexperiment in economic research at a national level—such an exercise can be prohibitivelyexpensive or ethically unacceptable. Instead, a counterfactual, the potential outcome thatnever happens in the real world, is constructed from observational data for policy evaluation.In view of the lack of genuine control groups in many important economic empirical ques-tions, Hsiao, Ching, and Wan (2012) (HCW, henceforth) propose the panel data approach(PDA) to exploit the correlation between cross-sectional units in estimating the counterfac-tual. PDA is simply a linear regression on the cross-sectional units in the pre-event data, andthen these estimated coeﬃcients are used to extrapolate the counterfactual of no policy inter-vention in the post-event period. Its convenience attracts many applications and extensions,for example Bai, Li, and Ouyang (2014); Ouyang and Peng (2015); Ke, Chen, Hong, and Hsiao(2017), to name a few. Compared with the popular diﬀerence-in-diﬀerence, the combinationof control units allows time-varying treatment eﬀect. Alternatively, Abadie and Gardeazabal(2003) and Abadie, Diamond, and Hainmueller (2010) advocate the synthetic control method (SCM). Hsiao and Zhou (2019) and Gardeazabal and Vega-Bayo (2017) compare PDA andSCM in simulations and empirical applications.Choice of the control units directly aﬀects PDA’s estimation and inference results, andthus a systematic variable selection scheme is of vital importance. HCW experiment with theAkaike information criterion (AIC) and the corrected AIC (AICC), and Du and Zhang (2015)recommend the latter for consistent variable selection. These conventional variable selectionmethods compute an information criterion for each candidate model, and pick the “best sub-set”. However, in PDA the total number of candidate models is N , where N is the numberof available potential control units. In spite of the state-of-the-art computing technology, ex-haustive search quickly becomes prohibitive for a moderate N . The exhaustive enumerationis inapplicable in the era of big data when the rich-data environment oﬀers information at anunprecedented scale. Furthermore, besides the computational diﬃculty, a large cross-sectionaldimension also challenges PDA’s theoretical justiﬁcation. As PDA is often applied to aggre-gate data with low-frequency temporal observations at the time dimension, HCW’s “ﬁxed N ,large T ” asymptotic framework is unlikely to deliver satisfactory approximation in empiricalstudies where N is comparable to T , or even exceeds N . To overcome the high dimensionalityin practice, Li and Bell (2017) suggest using Lasso but provide no theoretical foundation, andCarvalho, Masini, and Medeiros (2018) develop the Lasso theory under the general frameworkof Artiﬁcial Counterfactual (ArCo).Within the PDA framework, this paper studies the estimation and inference of the averagetreatment eﬀect (ATE) when a large number of candidate cross-sectional control units arepresent. We contribute to PDA by formally tackling the control unit selection problem inthe N > T context, which is often encountered in real-world applications. In particular, wepropose the forward selection method to pick the control units one by one until the iterationis stopped by an information criterion. Forward selection is computationally much moreeﬃcient than exhaustive search. For hypothesis testing decisions about the ATE, we suggest2alculating the conventional t -statistic conditioning on the selected units and then comparingit with the critical value based on the standard normal distribution. This algorithm is veryeasy to implement and accessible to applied researchers.Most statistical research on high-dimensional problems is under the environment of in-dependently and identically distributed (i.i.d.) data, which is too restrictive for economicinvestigation involving temporal observations. Accommodating heterogeneous weakly depen-dent time series, we establish the statistical theory in the asymptotic framework allowing N/T → ∞ as N, T → ∞ . Forward selection achieves dimension reduction by singling out R control units out of the total N candidates, providing R → ∞ and R/T → . We show thatforward selection is able to attain “nearly-optimal” model ﬁtting relative to the best subset.For the testing of ATE, our theory validates the seemingly naive practice of standard normalinference as if the randomness in the selection step can be ignored. To assess the accuracy ofthe asymptotic approximation, extensive Monte Carlo simulations are conducted to check theﬁnite sample behavior of the ATE estimator and the t -statistic.Forward selection has been studied in ultrahigh-dimensional regressions by Wang (2009)and Zhong, Duan, and Zhu (2017) as a device for model determination. Kozbur (2017), Kozbur(2018) and Hansen, Kozbur, and Misra (2018) investigate the test-based stopping criterionand post-selection inference. Our paper diﬀers from these studies in that we assume neitherthe “ β -min condition” nor sparsity for the underlying true coeﬃcients, because our focuslies in the properties of the post-selection ATE, which is an easier statistical object than slopecoeﬃcients estimation from training data. The greedy nature of the algorithm is closely relatedto the component-wise boosting (Bühlmann, 2006; Luo and Spindler, 2016a) that is familiarto econometricians (Bai and Ng, 2009; Shi, 2016; Luo and Spindler, 2016b; Fonseca, Medeiros,Vasconcelos, and Veiga, 2018). Alternatively, Carvalho, Masini, and Medeiros (2018)’s ArCoimposes the restricted eigenvalue condition (their Assumption 2), which is crucial for theasymptotic validity of Lasso-type methods in sparse models.PDA is motivated from a factor model, as to be discussed in Section 2.1. In general, thelinear regression induced by the factor model is dense in regression coeﬃcients. A noticeablediﬀerence of this paper from the statistical literature of sparse estimation is that we do notimpose sparsity in the regression coeﬃcients in the data generating process (DGP). As aconsequence, we carry out variable selection in search for a sparse model to approximate thepossibly dense model, and the criterion for model evaluation is not about recovery of the trueactive variables but the goodness of ﬁt.This theoretical extension makes it possible to apply PDA to investigate the impact of theChina’s anti-corruption campaign on the luxury watch import. Anecdotal evidence indicatesthat luxury watches were popular in China either for bribery or conspicuous consumption. Theraw data witness a slump of luxury watch importation since China’s sweeping anti-corruptioncampaign launched in the end of 2012. Using the comprehensive United Nations datasetwith 88 categories of imported commodities, we assess the eﬀectiveness of the anti-corruptioncampaign to the watch import. Plan.

The rest of the paper is organized as follows. Section 2 introduces PDA, describesour new algorithm for variable selection and ATE inference, and presents asymptotic analysis ofthis procedure. Section 3 reports the simulation results, and Section 4 carries out two real dataempirical applications for comparison and demonstration. All proofs and extra simulationsare relegated to the appendix. 3 otation.

We use standard econometric notations. For a real number, (cid:100)·(cid:101) is the ceil-ing function and (cid:98)·(cid:99) is the ﬂoor function. For a square matrix, ( · ) − is the Moore-Penrosegeneralized inverse, and φ min ( · ) and φ max ( · ) are the minimal eigenvalue and maximum eigen-value, respectively. For a discrete set U , we denote | U | as its cardinality. For a panel dataof N + 1 cross sectional units, their index set is denoted as N := { , , . . . , N } , in which j = 0 indexes the sole treated unit whereas N := { , . . . , N } is the index set of the N controlunits. In the potential outcome framework, let y jt and y jt be the outcome of the unit j attime t with and without a policy intervention, respectively. We cannot witness y jt and y jt simultaneously; instead we observe y jt = y jt (1 − d jt ) + y jt d jt , where d jt is a dummy vari-able equal to 1 if the j -th unit is under intervention at time t and 0 otherwise. The timedimension t ∈ T = { , . . . , T } consists of a pre-treatment period T = { , . . . , T } and apost-treatment period T = { T + 1 , . . . , T } , with length T = |T | and T = |T | . As wework with heterogeneous time series, we deﬁne ¯ E [ x t ] = T (cid:80) T t =1 E [ x t ] as the average of thepopulation means in the pre-treatment period, in which E [ x t ] may vary across t . Similarly,deﬁne ¯ E (2) [ · ] = T (cid:80) t ∈T E [ · ] as the average of population means of the post-treatment perioddata. For simplicity of presentation, we assume ¯ E (cid:104) y jt (cid:105) = 0 for all j ∈ N and the linear re-gressions in Section 2 does not include an intercept. While incorporating the intercept incursextra notation, this single additional constant regressor does not aﬀect the asymptotic theory(Bühlmann and van de Geer, 2011, p.104). In real applications we can always accommodate ¯ E (cid:104) y jt (cid:105) (cid:54) = 0 by adding the intercept in the regressions in Algorithm 1 below, and this is whatwe do in Section 4. In asymptotic theory, a universal constant is a strictly positive real numberindependent of sample sizes. PDA is motivated from a factor model. For the completeness of the paper, we brieﬂy sum-marize HCW’s proposal. Consider a standard pure factor model in which all cross-sectionalunits share at most K common factors: y jt = λ (cid:48) j f t + η jt , where f t is a mean zero K × vector of latent factors, λ j is a K × factor loading, and η jt is a mean zero idiosyncratic component. Stacking y t ( N +1) × = ( y t , ..., y Nt ) (cid:48) , we write the ( N + 1) -equation system as y t = Λ f t + η t , (1)where Λ ( N +1) × K = ( λ , λ , ..., λ N ) (cid:48) is the factor loading matrix and η t ( N +1) × = ( η t , . . . , η Nt ) (cid:48) isthe collection of zero mean idiosyncratic errors.HCW assume only one unit is exposed to a policy intervention, so the intervention doesnot aﬀect the outcomes of all other units j ∈ N . The treatment eﬀect is ∆ t = y t − y t , t ∈ T (2)4s we only observe y t after the intervention, to evaluate the treatment eﬀect we have toestimate the counterfactual y t for t ∈ T from the observed data. Li and Bell (2017) showthat, based on the factor model, there exists an N × vector β such that we can rewrite (1)as y t = Y (cid:48)N t β + ε t , for t ∈ T , (3)where Y N tN × = ( y t , ..., y Nt ) (cid:48) .The linear factor model (1) generates the regression equation (3), which is PDA’s workhorsefor estimation and inference. In this sense, the factor model in HCW merely serves as a moti-vation but is irrelevant for PDA’s implementation. In view of (2) and (3), it is straightforwardto construct the counterfactual. With the pre-treatment sub-sample T , HCW estimate (cid:98) β byOLS or GLS. Then they predict the counterfactual as (cid:98) y t = Y (cid:48)(cid:108)N t (cid:98) β for t ∈ T and thereby thetreatment eﬀect (cid:98) ∆ t = y t − (cid:98) y t .To conduct statistical inference, HCW are interested in the null hypothesis H : ¯ E (2) [∆ t ] = 0 , that is, the ATE is zero. If we reject the test at a certain signiﬁcance level, the data providesupportive evidence that the intervention, on average, shifts the mean of the treated unit. The estimate of PDA depends on the choice of the control units. When the number of potentialcontrols is large, the information criterion approach encounters computational diﬃculty inexhaustive search. To solve this problem, we propose an iterative selection method. Let y jT × = ( y j , . . . , y jT ) (cid:48) be the j -th time series, and let Y UT ×| U | = ( y j ) j ∈ U stack the T temporalobservations of a | U | -dimensional multivariate random vector, where U is a generic subsetof N . In the ﬁrst iteration, we regress y on each y j , j ∈ N , and choose the one thatminimizes the sum of squared residuals. We denote the index of the minimizer as (cid:98) j andlet (cid:98) U = { (cid:98) j } be a single-element set. In the r -th iteration, where r = 2 , .., R , we run theleast square regression of y against Y (cid:98) U r − together with one more y j , j ∈ N \ (cid:98) U r − , choosethe one—denoted as (cid:98) j —that minimizes the sum of squared residuals, and incorporate it intothe selection set (cid:98) U r = (cid:98) U r − ∪ { (cid:98) j } . The total number of iterations, R , is a tuning parameterspeciﬁed by the user.The algorithm is described formally as follows. Let P U = Y U ( Y (cid:48) U Y U ) − Y (cid:48) U be the projectionmatrix for the linear space spanned by Y U , and P ⊥ U = I T − P U . Algorithm 1 (Forward selection) .Step1

Set the initial iteration index as r = 0 and the selection set as U = ∅ . Step2:

Update the iteration index r ← r + 1 . Get (cid:98) j r where (cid:98) j r = arg min j ∈N \ (cid:98) U r − y (cid:48) P ⊥ (cid:98) U r − ∪{ j } y . Update the selected set (cid:98) U r = (cid:98) U r − ∪ { (cid:98) j r } . tep3 Repeat Step 2.1-2.3 until r > R .Remark . This is a greedy algorithm that takes the most aggressive direction in each stepto reduce the sum of squared residuals conditional on the variables that are already included.Moreover, once a variable is selected, there is no mechanism to drop it. Greedy algorithms arecommon in modern machine learning. For example, Breiman (2001) grows regression trees bysplitting a single variable each time at the deepest descent, and Bühlmann (2006)’s compo-nentwise boosting also seeks the most greedy variable without adjusting other coeﬃcients.After selecting (cid:98) U R , we run OLS of y on Y (cid:98) U R to obtain the coeﬃcient (cid:98) β (cid:98) U R and the prediction (cid:98) y t, (cid:98) U R = Y (cid:48) (cid:98) U R ,t (cid:98) β (cid:98) U R for t ∈ T . The treatment eﬀect is estimated as (cid:98) ∆ t, (cid:98) U R = y t − (cid:98) y t, (cid:98) U R , t ∈ T . Let (cid:98) ρ τ,U = 1 T (cid:88) t,s ∈T (cid:98) (cid:15) tU (cid:98) (cid:15) sU · {| t − s | ≤ τ } as an estimate of the long-run variance, where the tuning parameters τ is the number of lagsincluded in the estimation, and (cid:98) ε tU = y t − Y tU (cid:98) β U , t ∈ T , is the least-squares regressionresidual of y on Y U . We use the t -statistic Z T , (cid:98) U R = (cid:98) ρ − τ, (cid:98) U R · √ T (cid:88) t ∈T (cid:98) ∆ t, (cid:98) U R . (4)We will show that under mild assumptions Z T , (cid:98) U R converges, under the null hypothesis H ,in distribution to the standard normal. Therefore, we would reject the null at size (1 − α ) if | Z T | > Φ − (cid:0) − α (cid:1) , where Φ( · ) is the cumulative distribution function of the standard normaldistribution.There are two tuning parameters in the procedure, R for the total number of variables and τ for long-run variance estimation. We suggest using Wang, Li, and Leng (2009)’s modiﬁedBIC criterion to choose R , while the choice of τ has been well studied in the econometricsliterature (Newey and West, 1987; Andrews, 1991).Before we conclude this section, we emphasize that we do not attempt to directly estimatethe factor model due to the following reasons. (i) In the PDA framework the factor model is anabstraction independent of the algorithm based on linear regression, and this is also followedby Li and Bell (2017) and Carvalho, Masini, and Medeiros (2018). (ii) To conduct inferencein the factor model, we will need to estimate the ( N + 1) × ( N + 1) covariance matrix, whichinvolves ( N + 2) ( N + 1) / entries so other sparse matrix estimation techniques have to beimplemented for dimension reduction. In this section, we analyze the asymptotic guarantee of the proposed algorithms in Section2.2. In the pre-treatment period, we take T → ∞ , and the cross-sectional dimension N isunderstood as a deterministic function of T with N → ∞ , lim sup T →∞ N/T ∈ [0 , ∞ ] but lim sup T →∞ (log N ) /T = 0 . In other words, asymptotically N is allowed to grow at a faster6peed than T to accommodate high-dimensional settings, but log N must be dominated by T . Next, we impose two high-level assumptions. The ﬁrst one regularizes the eigenvalues ofthe Gram matrix. Let η r = min | U |≤ r φ min (cid:0) ¯ E [ Y Ut Y (cid:48) Ut ] (cid:1) for some r ∈ N . Assumption 1.

For any small universal constant δ > , there exists a sequence ( R = R T ) such that R + R ( T / log N ) / → , and lim inf T →∞ η (1+ δ ) R ≥ c for some universal constant c > . In the literature of large-dimensional factor models, it is common to assume η ( N +1) boundedaway from 0, for example Bai (2003, p.141). Such a minimal eigenvalue condition on the ( N + 1) × ( N + 1) population Gram matrix is relaxed here to any u × u sub-Gram-matrixwith u = | U | ≤ (1 + δ ) R . It echoes the restricted eigenvalue condition or the compatibilitycondition that are routinely imposed in most of the high-dimensional regression papers (Bickel,Ritov, and Tsybakov, 2009; Bühlmann and van de Geer, 2011, Section 6.13). More precisely,our version is the sparse Riesz condition as in Zhang and Huang (2008) and Chen and Chen(2008); while these papers set δ = 1 , we relax it to any ﬁxed δ > . As R diverges toinﬁnity at a rate slower than ( T / log N ) / , the sample version of the u × u Gram matrix T (cid:80) t ∈T Y (cid:48) U Y U involving the cross product of the T × u matrix Y U is likely to be of fullrank when u (cid:28) T , with the help of the second assumption below about the populationsecond-moment as well as their sample counterpart. Assumption 2.

For the pre-treatment period t ∈ T ,(a) max i,j ∈N (cid:12)(cid:12)(cid:12) T (cid:80) T t =1 y it y jt − ¯ E [ y it y jt ] (cid:12)(cid:12)(cid:12) = O p (cid:16)(cid:113) log NT (cid:17) . (b) max j ∈N ¯ E (cid:104) y jt (cid:105) ≤ C for a universal constant C . Assumption 2(a) postulates a uniform convergence rate of the second moments, and (b)is a common assumption of ﬁnite population second moments. With independent observa-tions, Belloni, Chen, Chernozhukov, and Hansen (2012) use the self-normalized Cramér-typemoderate-deviation theory (Jing, Shao, and Wang, 2003) to establish the uniform probabilitybound. In time series context, similar conditions are used in Medeiros and Mendes (2016),Kock and Callot (2015), and Koo, Anderson, Seo, and Yao (2019) under various assumptionsof the tail bounds and the serial dependence.Given the above assumptions, we state the ﬁrst theoretical result about the uniform estima-tion error of the variance. Let σ U be the variance of the projection residual in the populationmodel using Y Ut as regressors, and (cid:98) σ U be the sample variance of ( (cid:98) ε Ut ) t ∈T . Lemma 1.

Under Assumptions 1 and 2, we have sup | U |≤ (1+ δ ) R (cid:12)(cid:12)(cid:98) σ U − σ U (cid:12)(cid:12) = O p (cid:18)(cid:113) T − R log N (cid:19) . In the population model, let P U y be the projection of y onto the closed linear span of Y U , and denote ε U = P ⊥ U y = y − P U y as the projection residual. Then σ U = ¯ E (cid:2) ε Ut (cid:3) is the projection residual’s populationaverage second moment. emark . Lemma 1 indicates that uniformly on any set U with fewer or equal to (1 + δ ) R elements, if R diverges slowly such that R + R ( T / log N ) / → , then the diﬀerence between thesample variance of residuals (cid:98) σ U and its population counterpart (cid:98) σ U is negligible in probability.Now we deﬁne our objective for variable selection. Let U ∗ = arg min | U |≤ u σ U be the bestsubset of u elements, and let σ ∗ u = σ U ∗ be the corresponding noise level under this bestsubset. If U ∗ is not unique, we simply refer to any of them as the best subset and our analysisis not aﬀected no matter U ∗ is unique or not. It is computationally expensive to locate the bestsubset U ∗ . Were the population quantity σ U known for each U , we would have to exhaustivelycompare the noise level for (cid:0) Nu (cid:1) models, which is of exponential order of N .Instead of searching for U ∗ , we seek to identify a subset (cid:98) U R on which σ (cid:98) U R approximatesthe optimal variance σ ∗ u . Theorem 1 below states that the greedy Algorithm 1 selects a set (cid:98) U R with a regression variance asymptotically as small as the desired u -element set if R dominates u as T →∞ . The greedy algorithm only searches among (cid:80) Rr =1 ( N − r + 1) = N R − R ( R − models, which is of linear order of N . The latter is often computationally much more eﬃcientthan the exhaustive search. Theorem 1.

Suppose Assumptions 1 and 2 hold. For any sequence u such that u/R → , wehave Pr (cid:16)(cid:98) σ (cid:98) U R ≤ σ ∗ u + δ (cid:17) → for any small universal constant δ > . Theorem 1 is a nearly optimal result. It implies with high probability that the computa-tionally feasible sample variance (cid:98) σ (cid:98) U R is asymptotically no worse, up to an arbitrarily smalltolerance δ , than the computationally heavy but theoretically optimal σ ∗ u , the lower boundof the variance associated with the best subset. Such approximation can be achieved by incor-porating R units. Though R is of bigger order than u in the asymptotic sense, if we specify R = (cid:100) u log log N (cid:101) , then obviously the number of OLS regressions according to our Algorithm1 is fewer than N u log log N , and N u log log N (cid:28) (cid:0) Nu (cid:1) for a non-trivial u and large N . Remark . If the best subset U ∗ is sparse, for example in a sparse linear regression with onlya few non-zero coeﬃcients satisfying the β -min condition, Theorem 1 may not be surprisingas these non-zero coeﬃcients will all be selected easily. The novelty of this result lies in thatit imposes no sparsity assumption on the DGP model’s regression coeﬃcients. Example 1.

Consider a regression equation y t = (cid:80) Nj =1 β j y jt + (cid:15) t where the regressor y jt ∼ iid N (0 , , the coeﬃcient β j = c j / √ n for some non-zero ﬁnite constant c j ∈ (0 , ∞ ) , and (cid:15) t isindependent of the regressors. Since β j (cid:16) n − / (cid:54) = 0 for all j here, this is an extremely denseregression. When N/T → ∞ , it is impossible to accurately estimate all the coeﬃcients. Nev-ertheless, in this setting Assumptions 1 and 2 are satisﬁed if (log N ) /T → . Thus accordingto Theorem 1, our Algorithm 1 picks an R -regressor model that dominates the optimal set U ∗ in terms of the associated population variance as long as R/u → ∞ , R/ ( T / log) / → evenif u → . Remark . The key technical innovation is Lemma A.1 in the Appendix, an inequality con-cerning the increment of the greedy algorithm. The result relies on Assumption 1, which is anatural implication of standard factor models in a high-dimensional setting (Bai, 2003).8fter variable selection via forward selection, we use Y (cid:98) U R t to predict the counterfactual y t after the policy intervention. We obtain the time-varying treatment eﬀect (cid:98) ∆ t, (cid:98) U R Let F t ,t N be the smallest σ -ﬁeld generated by the Borel sets of the collection { ( f (cid:48) t , η (cid:48) t ) (cid:48) K + N +1 : t ≤ t ≤ t } ,and let we deﬁne an α -coeﬃcient α ( m ) = sup T ,T,N ∈ N , T +1 ≤ t ≤ T − m (cid:110) | P ( AB ) − P ( A ) P ( B ) | : A ∈ F T +1 ,t N , B ∈ F t + m,t N (cid:111) . (5)The following are additional assumptions for valid post-selection inference. Assumption 3. (a) There exist two universal constants a and a such that α ( m ) ≤ a exp ( − a m ) for all m .(b) max j ∈N (cid:12)(cid:12)(cid:12) T (cid:80) t ∈T y it (cid:12)(cid:12)(cid:12) = O p (cid:16)(cid:113) log NT (cid:17) .(c) max i,j ∈N (cid:12)(cid:12)(cid:12) T (cid:80) t ∈T y it y jt − ¯ E (2) (cid:104) y it y jt (cid:105)(cid:12)(cid:12)(cid:12) = O p (cid:16)(cid:113) log NT (cid:17) . (d) max t ∈T ,i ∈N E (cid:20)(cid:16) y jt (cid:17) (cid:21) ≤ C < ∞ .(e) lim inf R →∞ min U : | u |≤ R (cid:80) ∞ τ = −∞ ¯ E (2) (cid:2) (cid:15) Ut (cid:15) U ( t + τ ) (cid:3) ≥ c .(f ) lim sup R →∞ max U : | u |≤ R (cid:80) ∞ τ = −∞ (cid:12)(cid:12) ¯ E (2) (cid:2) (cid:15) Ut (cid:15) U ( t + τ ) (cid:3)(cid:12)(cid:12) ≤ C. Assumption 3(a) restricts the dependence of the heterogeneous time series, similar to Car-valho, Masini, and Medeiros (2018)’s Assumption 3. In particular, the α -coeﬃcient in (5) isthe upper bound over all index T , T and N , so the time series is geometrically strong mixingfor all sample sizes. We use it for an extra technical purpose: It allows us to invoke the Berry-Essen bound for heterogeneous time series (Bentkus, Götze, and Tikhomoirov, 1997). Underthe null hypothesis, (b) is about the convergence rate of the sample mean to the populationmean 0, although y t is unobservable. In the post-treatment period, Assumption 3 (c) is anal-ogous to Assumption 2(b) in the pre-treatment period, and (d) is commonly imposed in highdimensional factor models (Bai, 2003). The last two items in Assumption 3 are concerningthe long-run variance, where (e) bounds the long-run variance from degeneracy and (f) guar-antees the absolute summability of the autocorrelations. (d), (e) and (f) make sure that theself-normalized test statistic behaves well, so that the Berry-Essen bound can be applied toestablish the asymptotic normality of the test statistic.Similar to N , again we view T as a deterministic non-decreasing function of T . In thestatement of the following Theorem 2 we only explicitly send T → ∞ , while ( N, T , R ) areunderstood to diverge to inﬁnity as well. The relative rates of the sample sizes ( T , T , N ) and the tuning parameter R in Theorem 2 is more restrictive than that in Theorem 1. This isbecause the post-selection inference has to tolerate the estimation error from the pre-treatmentperiod as well as to regularize the asymptotic distribution uniformly for the selected set (cid:98) U R . Theorem 2.

Suppose Assumptions 1, 2, and 3 hold. If T − R log N log T +(log N ) /T → as T → ∞ , then under the null hypothesis H the t -statistic Z T, (cid:98) U R d → N (0 , f we choose τ → ∞ and τ = o ( T ) .Remark . If a single dataset is used for model selection and parameter estimation, post-selection inference on the model coeﬃcients is in general a very diﬃcult statistical problemthat often leads to non-standard asymptotic distribution (Leeb and Pötscher, 2005, 2006), andthis is the direction of intensive recent research (Berk, Brown, Buja, Zhang, and Zhao, 2013;Belloni, Chernozhukov, and Kato, 2014; Belloni, Chernozhukov, Fernández-Val, and Hansen,2017; Hansen, Kozbur, and Misra, 2018). However, in conditional (on the selected modelfrom a training sample) predictive inference, post-selection asymptotic normality is achievable(Leeb, 2009) and the inference can be carried out following standard asymptotically normalprocedure. In our context, the estimated ATE is the average of the predicted outcomes overthe post-event period T . The pre-treatment sample, in which the model is selected, and thepost-event sample, in which the counterfactual is predicted, are asymptotic independent underthe α -mixing condition Assumption 3(a).Theorem 2 is a uniform result over the selected set (cid:98) U R . In other words, the asymptotic nor-mality holds for any (cid:98) U R , which is a random set determined by the pre-treatment data. Consideran alternative non-random way of choosing a sequence of sets. Given an arbitrary ordering ofthe control units, we may naively choose the ﬁrst R terms U naive R = { , . . . , R } for R satisfyingthe order in Theorem 2. By the Berry-Esseen bound for strong mixing time series (Bentkus,Götze, and Tikhomoirov, 1997; Sunklodas, 2000), we would also have Z T,U naive R d → N (0 , .Nevertheless, Z T, (cid:98) U R is more powerful than the naive Z T,U naive R because (cid:98) U R is aggressively cho-sen to reduce the variance remaining in the regression error.The asymptotically normality in Theorem 2 holds regardless of the algorithm that selectsa subset of no more than R elements. It is also applicable to the t -statistic based on HCW’sbest subset method via AIC or AICC. When they developed the asymptotic inference, HCWheuristically took the selected variables, which we denote here as (cid:98) U AICC R , as if they were ﬁxed.Our result implies Z T, (cid:98) U AICC R d → N (0 , , which helps justify HCW’s practice. Instead of (cid:98) U AICC R ,we nevertheless advocate the forward selection algorithm for (cid:98) U R in view of its eﬀectiveness incomputation in high dimensional settings. In this section, we evaluate the ﬁnite-sample performance of our proposed algorithm by MonteCarlo simulations. We conduct extensive experiments with sparse and non-sparse coeﬃcients,and with various degrees of cross-sectional correlation and time dependence. For comparison,we also estimate the model using Lasso (Tibshirani, 1996). For each DGP, we generate onetreated unit j = 0 along with 100 control units j = 1 , . . . , . We run replicationsand check the out-of-sample root mean predicted squared error (RMPSE) as well as the testsize or power for the ATE. For simplicity, we set equal the length of the pre-treatment andpost-treatment time series, with T = T = 40 , , and 200. Both forward selection andLasso need turning parameters: the stopping time R in forward selection and the penalty level Due to the limitation of space, in the main text here we present results for a non-sparse underlyinglinear regression model. In Section B in the Appendix, we further show the performance of variable selection,parameter estimation and prediction accuracy in a design of a sparse linear model. in Lasso. We adopt the modiﬁed BIC (Wang, Li, and Leng, 2009) in choosing the tuningparameters. For forward selection, the stopping time R is determined by (cid:98) R = arg min r ∈ N (cid:26) log (cid:0)(cid:98) σ r (cid:1) + log log ( N ) · log( T ) T · r (cid:27) where (cid:98) σ r is the mean squared residual of selected model in the r -th step. For the Lassoestimator, (cid:98) β λ = arg min β ∈ R N T (cid:88) t ∈T (cid:0) y t − Y (cid:48)N t β (cid:1) + λ N (cid:88) j =1 | β | where λ is the penalty level, and in ﬁnite sample it is determined by (cid:98) λ = arg min λ  T (cid:88) t ∈T (cid:16) y t − Y (cid:48)N t (cid:98) β λ (cid:17) + 2 log log ( N ) · log( T ) T · (cid:12)(cid:12)(cid:12) (cid:98) β λ (cid:12)(cid:12)(cid:12)  . In the second term of the modiﬁed BIC, we have the admittedly ad hoc constant 1 for forwardselection and 2 for the Lasso, respectively. The diﬀerence arises because in our simulationsLasso would select many more variables than forward selection were the same constant sharedin the two estimation methods, resulting in even less satisfactory performance.

We ﬁrst generate the data via a factor model with four common factors. • (iid factor) All factors f kt ∼ i . i . d . N (cid:0) , k (cid:1) across t = 1 , . . . , T and k = 1 , . . . , . ThisDGP serves as a benchmark. • (time-dependent factor) The dynamic factors are iid : f t = u t AR(1) : f t = 0 . f ,t − + u t MA(2) : f t = u t + 0 . u t − + 0 . u t − ARMA(1 ,

1) : f t = 0 . f ,t − + u t + 0 . u t − for t = 1 , . . . , T, where u kt ∼ N (0 , independently across t and k .The factor loading λ jk , k = 1 , . . . , , is independently drawn from Uniform (1 , if j = 0 , . . . , ,whereas λ jk ∼ Uniform ( − . , . if j = 5 , . . . , . The idiosyncratic shocks η jt ∼ N (cid:0) , . (cid:1) in the factor model (1) is independent across j and t .For t ∈ T , the treated unit y t is subject to an exogenous shock ∆ t . We generate ∆ t byseven DGPs, denoted by D to D : D ∆ t = 0; D ∆ t ∼ N (0 , D ∆ t = 0 . ∆ t − + w t , w t ∼ N (0 , D ∆ t ∼ N (0 . , D ∆ t ∼ N (1 , D ∆ t = 0 .

25 + 0 . ∆ t − + w t , w t ∼ N (0 , D ∆ t = 0 . . ∆ t − + w t , w t ∼ N (0 , . ii d f a c t . T = T = 40 T = T = 80 T = T = 100 T = T = 200 d y n . f a c t . T = T = 40 T = T = 80 T = T = 100 T = T = 200 Notes: The upper panel is for iid factors and the lower panel for dynamic factors. FS is short for forwardselection. The simulation is repeated for 1000 times. The ﬁrst two columns are the medians of replications forthe number of selected variables by forward selection and Lasso, respectively. The following columns are themeans of the bias and RMSPE over the replications.

The null hypothesis is true under D – D , and false under D – D . The treatment is time-invariant under D , time-varying under D , and serially correlated under D . D and D introduce time-invariant shifts to post-treatment outcomes, whereas D and D add time-varying treatment eﬀects of nonzero means. We use the pre-treatment data to estimate the regression coeﬃcients, and then use the post-treatment data to evaluate the out-of-sample performance. Table 1 gives the number of non-zero coeﬃcients, the empirical bias and RMSPE, deﬁned as bias = 1 T (cid:88) t ∈T (cid:0)(cid:98) y t − y t (cid:1) and RMSPE = (cid:115) T (cid:88) t ∈T (cid:0)(cid:98) y t − y t (cid:1) , (6)where (cid:98) y t is the predicted value for y t : forward selection gives (cid:98) y t = Y (cid:48) (cid:98) U (cid:98) R ,t (cid:98) β (cid:98) U (cid:98) R and Lasso gives (cid:98) y t = Y (cid:48)N t (cid:98) β (cid:98) λ . In the simulation we observe that the number of selected variables of Lasso ismore sensitive to sample size ( T ) than forward selection. In both factor structures the biasand RMSPE of Lasso are larger than those of forward selection in all cases, and Lasso choosesmore variables than forward selection except the case of T = T = 40 under the dynamicfactors.In the post-treatment period t ∈ T , the realized value of the treated unit is y t = y t + ∆ t for various designs of ∆ t . The estimated treatment eﬀect is (cid:98) ∆ t = y t − (cid:98) y t , t ∈ T . We then estimate the long-run variance of (cid:98) ∆ t (Newey and West, 1987), and construct thetest statistic as in (4). The rejection probability—the proportion of instances when the teststatistic’s absolute value exceeds the critical value—is displayed in Table 2. The nominal testsize is 5%. As the null hypothesis is true in D – D , the rejection probability is associated12able 2: Test Size and Power Under D – D Forward Selection Lasso T = T =

40 80 100 200 40 80 100 200 ii d f a c t o r s D D D D D D D d y n a m i c f a c t o r s D D D D D D D Notes: The entries for D - D display the test size and those for D - D show the power. The rejectionprobability is computed over 1000 replications. with test size; the closer it approaches to , the better is the performance. For D – D , onthe contrary, the larger is the rejection probability, the more powerful is the test.We observe in Table 2 that as the sample size increases, the test size based on forwardselection falls down toward under both the static and dynamic factor structures, though itis less accurate in D when dynamics is present in the factors. This is caused by the relativelyimprecise long-run variance estimation. The test is powerful in general under D -- D whenthe null is violated. In contrast, the test size of the model selected by Lasso is subject tomore severe size distortion when the latent factors have dynamics, and is less powerful. Theunsatisfactory performance of the Lasso-based inference is largely due to the estimation biasintrinsic to shrinkage methods. For example, under D and T = T = N = 100 the test sizesfor forward selection and Lasso are . and . for iid factors, and they hike to . and when the latent factors involve dynamics. Even with this size inﬂation, under all D – D the test power based on Lasso are smaller than those of forward selection.We plot in Figure 1 the estimated ATE to facilitate visualization under various DGPs,sample sizes, and latent factors structures. In each panel, the null hypothesis is true for theﬁrst column of subgraphs, whereas the null is violated with E [ ∆ t ] = 0 . for all t ∈ T in thesecond column and E [ ∆ t ] = 1 in the last column. We witness in both factor structures thatforward selection estimates the counterfactual with little bias and the variance is reduced asthe sample size grows. Finally, the kernel density of test statistic in (4) is shown in Figure 2.The test statistic is robust under the latent factor structures. Normality is approximated verywell in D and D , though slightly heavier tails are observed in D . Overall, the t -statisticgraph is supportive for the theoretical result of asymptotic normality.13igure 1: Density of the Estimated Average Treatment Eﬀect: iid Factors (a) iid factors(b) dynamic factors Notes: The blue bell-shape curve is the density of the standard normal distribution N (0 , , which is thelimiting distribution of the t -statistic. We apply our algorithm to two real data applications in this section. We ﬁrst revisit HCW’sempirical example, which is well documented and amenable for replication and comparison.Next, we investigate a high-dimensional problem where the number of potential control unitsoverpasses the sample size. Such situation is often encountered in practice.

The original application of PDA in HCW assesses the eﬀect of Closer Economic Partnership Ar-rangement (CEPA) on Hong Kong’s GDP growth rate. The dataset contains 44 pre-treatmentperiods and 17 post-treatment periods. Hong Kong’s GDP growth rate is the dependent vari-able, and those of 24 other countries are control units. As N = 24 is of similar magnitudeto T = 44 , variable selection is relevant despite N < T . We compare the R-squared of themodels picked by forward selection and exhaustive search for R . For a given R , exhaustivesearch compares (cid:0) NR (cid:1) models and select the one with the largest R-square, namely the (insample) best subset. In the original paper, the criteria AIC ( R ) = T ln (cid:0)(cid:98) σ R (cid:1) + 2 ( R + 2) and AICC ( R ) = AIC ( R ) + R +2)( R +3) T − ( R +1) − choose (cid:98) R = 6 and (cid:98) R = 9 , respectively. The includedcountries at each step are listed in Table 3. The turnover is high over R = R . The modelsselected by forward selection track the best in-sample subset closely. Notice that the exhaustivesearch runs OLS more than 1.3 million times to pin down the 9 variables, whereas forwardselection performs merely 180 OLS regressions for R = 9 . Forward selection is computationallymuch more eﬃcient.We further add Lasso for comparison. When Lasso’s tuning parameter λ is selected bythe modiﬁed BIC, it yields a model with 9 non-zero coeﬃcients corresponding to Finland,Korea, Mexico, Norway, Singapore, Philippines, Indonesia, Malaysia and Thailand; there are5 overlapped members amongst those by AICC. In Figure 3, the Lasso’s R-squared is muchweaker, due to the shrinkage bias when the coeﬃcients are pushed toward zero. To improveLasso, we try the post-Lasso estimation, a simple OLS on the variables with the aforementionedLasso-selected variables, to reduce the shrinkage bias. Post Lasso enhances the R-squared, butthere remains a non-trivial gap relative to that of forward selection. China launched an anti-corruption campaign of unprecedented scale in November 2012 shortlyafter Xi Jinping took oﬃce. The campaign aimed at cracking down graft and power abusein all party apparatus, government bureaucracies and military departments. The inﬂuence ofthe anti-corruption campaign motivates academic research assessing its impact from variousperspectives, for example, stock return (Lin, Morck, Yeung, and Zhao, 2016; Ding, Fang, Lin,and Shi, 2017) and corporate behavior (Xu and Yano, 2016; Pan and Tian, 2017). In thispaper, we investigate luxury goods importation.We use the import data from UN Comtrade Database. The UN Comtrade Databaseprovides detailed statistics for international commodity trade, and the monthly data for Chinaare available since 2010. We focus on the category named “watches with case of, or clad with, DESA/UNSD, United Nations Compared database. http://comtrade.un.org/.

Note: the star “*” in each curve is the stopping point determined by the modiﬁed BIC. precious metal”, following Lan and Li (2018) who ﬁnd that Chinese luxury watches import co-moves with leadership transitions and government turnover. To ensure that the control unitsare insusceptible to the anti-corruption policy, 7 categories commonly consumed as bribe goodsor conspicuous consumption are excluded. As a result, N = 88 out of the total 95 categoriesare left to serve as control units.The raw time series of Chinese luxury watch import, plotted as the red curve in the lowersubgraph in Figure 4, dropped sharply around the start of the anti-corruption campaign.However, a seemingly structural break can be the upshot of many factors that inﬂuenced themacroeconomic environment, for example, terms of international trade, exchange rate volatil-ity, domestic political attitude. During the period from 2013 to 2015, Chinese economy sloweddown and it stirred a turmoil over the global commodity market. Besides the watches, othercommodity importation shrank as well. While the ﬂagging economy would have weakened theimports of a myriad of commodities, we employ PDA to control such overall eﬀect in the hopeto better isolate the impact of the anti-corruption campaign. These 7 categories are (with the UN Comtrade Database code in the parenthesis): Beverages, spirits andvinegar (22), Tobacco and manufactured tobacco substitutes (24), Essential oils, perfumes, cosmetics, toiletries(33), Articles of leather, animal gut, harness, travel goods (42), Fur-skins and artiﬁcial fur, manufactures thereof(43), Pearls, precious stones, metals, coins, etc (71), Clocks and watches and parts thereof (91) and Works ofart, collectors pieces and antiques (97). .2.2 Results We apply the PDA to construct counterfactuals. The dependent variable is set as the monthlygrowth rate of luxury watch import in US dollars, and the independent variables are chosen bythe greedy algorithm out of the import growth rates of the 88 commodities. We use the growthrate instead of the level data to avoid time series non-stationarity. January, 2013 is regardedas the time of the treatment, which is the month after the

Eight-Point Policy announcement.There are 35 pre-treatment observations ranging from February 2010 to December 2012, and 36post-treatment observations spanning from January 2013 to December 2015. The algorithmselects 3 control units . With the estimated model, we predict the counterfactuals (cid:98) y t andestimate treatment eﬀect for t = 36 , · · · , .Figure 4 displays the actual luxury watches import growth (solid line) and its estimatedcounterparts without anti-corruption campaign (dashed line). January 2013, the time of thetreatment, is highlighted by the vertical line in the middle. The upper subgraph shows thegrowth rate; the lower one shows the value in US dollars, where the counterfactual in monetaryvalue is constructed according to the predicted growth rate. Before the intervention, themodel ﬁts the real data quite well and the R-squared of the selected model is 77.85%. AfterJanuary 2013, were the anti-corruption policy not implemented, the import growth rate wouldfollow the track indicated by the dashed line, which is visibly higher than the realizations. Inparticular, in January 2013, the import value slumped by 42%. In contrast, our counterfactualprediction suggests it would have increased by 1.7%. The average treatment eﬀect over thepost-treatment period is (cid:88) t =37 (cid:98) ∆ t, (cid:98) U (cid:98) R = − . , which means that on average the anti-corruption campaign slowed down the luxury watchimport by 3.09% per month. The t -statistic is − . , with p -value . . It rejects the nullhypothesis of zero average treatment eﬀect at 5% size. Accumulating such a monthly ATE over36 months leads to roughly two thirds of reduction in importation ( (1 − . = 0 . ),which is manifested in the lower subgraph. In December 2015, while the realized import was29.35 million US dollars, the counterfactual predicts 89.27 million had China not waged thecampaign. Our empirical evidence suggests that China’s anti-corruption has been eﬀective inslashing the luxury watch import. In this paper, we propose an algorithm to select the control units in PDA. We show that theforward selection method is computationally much more eﬃcient than the exhaustive searchfor the best subset. We establish asymptotic theory for the nearly optimality of forwardselection, and show validity of conducting post-selection inference for the ATE by the t -statisticconditional on the selected set. These extensions widen the applicability of PDA to real worldhigh dimensional-problems in big data. We demonstrate the usefulness of our methodology insimulations and real data examples. The selected categories are: “knitted or crocheted fabric”, “cork and articles of cork”, and “salt, sulphur,earth, stone, plaster, lime and cement”. eferences

Abadie, A., A. Diamond, and

J. Hainmueller (2010): “Synthetic control methods forcomparative case studies: Estimating the eﬀect of California?s tobacco control program,”

Journal of the American Statistical Association , 105(490).

Abadie, A., and

J. Gardeazabal (2003): “The economic costs of conﬂict: A case study ofthe Basque Country,”

American Economic Review , pp. 113–132.

Andrews, D. (1991): “Heteroskedasticity and autocorrelation consistent covariant matrixestimation,”

Econometrica , 59(3), 817–858.

Bai, C., Q. Li, and

M. Ouyang (2014): “Property taxes and home prices: A tale of twocities,”

Journal of Econometrics , 180(1), 1–15.

Bai, J. (2003): “Inferential theory for factor models of large dimensions,”

Econometrica , 71(1),135–171.

Bai, J., and

S. Ng (2009): “Boosting diﬀusion indices,”

Journal of Applied Econometrics ,24(4), 607–629.

Banerjee, A. V., and

E. Duflo (2009): “The experimental approach to developmenteconomics,”

Annual Review of Economics , 1(1), 151–178.

Belloni, A., D. Chen, V. Chernozhukov, and

C. Hansen (2012): “Sparse models andmethods for optimal instruments with an application to eminent domain,”

Econometrica ,80(6), 2369–2429.

Belloni, A., V. Chernozhukov, I. Fernández-Val, and

C. Hansen (2017): “Programevaluation and causal inference with high-dimensional data,”

Econometrica , 85(1), 233–298.

Belloni, A., V. Chernozhukov, and

K. Kato (2014): “Uniform post-selection inferencefor least absolute deviation regression and other Z-estimation problems,”

Biometrika , 102(1),77–94.

Bentkus, V., F. Götze, and

A. Tikhomoirov (1997): “Berry-Esseen bounds for statisticsof weakly dependent samples,”

Bernoulli , 3(3), 329–349.

Berk, R., L. Brown, A. Buja, K. Zhang, and

L. Zhao (2013): “Valid post-selectioninference,”

The Annals of Statistics , 41(2), 802–837.

Bickel, P., Y. Ritov, and

A. Tsybakov (2009): “Simultaneous analysis of Lasso andDantzig selector,”

Annals of statistics , 37(4), 1705–1732.

Breiman, L. (2001): “Random forests,”

Machine Learning , 45(1), 5–32.

Bühlmann, P. (2006): “Boosting for high-dimensional linear models,”

The Annals of Statis-tics , 34(2), 559–583.

Bühlmann, P., and

S. van de Geer (2011):

Statistics for High-Dimensional Data: Meth-ods, Theory and Applications . Springer. 20 arvalho, C., R. Masini, and

M. C. Medeiros (2018): “ArCo: An Artiﬁcial Counterfac-tual Approach for High-dimensional Panel Time-Series Data,”

Journal of Econometrics , inpress.

Chen, J., and

Z. Chen (2008): “Extended Bayesian information criteria for model selectionwith large model spaces,”

Biometrika , 95(3), 759–771.

Ding, H., H. Fang, S. Lin, and

K. Shi (2017): “Equilibrium Consequences of Corruptionon Firms: Evidence from China’s Anti-Corruption Campaign,” Discussion paper, Universityof Pennsylvania, working Paper.

Du, Z., and

L. Zhang (2015): “Home-purchase restriction, property tax and housing pricein China: A counterfactual analysis,”

Journal of Econometrics , 188(2), 558–568.

Duflo, E., R. Glennerster, and

M. Kremer (2007): “Using randomization in develop-ment economics research: A toolkit,”

Handbook of Development Economics , 4, 3895–3962.

Fonseca, Y., M. Medeiros, G. Vasconcelos, and

A. Veiga (2018): “BooST: Boost-ing Smooth Trees for Partial Eﬀect Estimation in Nonlinear Regressions,” arXiv preprintarXiv:1808.03698 . Gardeazabal, J., and

A. Vega-Bayo (2017): “An Empirical Comparison Between theSynthetic Control Method and HSIAO et al.’s Panel Data Approach to Program Evaluation,”

Journal of Applied Econometrics , 32(5), 983–1002.

Hansen, C., D. Kozbur, and

S. Misra (2018): “Targeted undersmoothing,” Discussionpaper, working paper.

Hörmann, S. (2009): “Berry-Esseen bounds for econometric time series,”

Latin AmericanJournal of Probability and Mathematical Statistics , 6, 377–397.

Hsiao, C., S. H. Ching, and

S. K. Wan (2012): “A panel data approach for programevaluation: measuring the beneﬁts of political and economic integration of Hong Kong withmainland China,”

Journal of Applied Econometrics , 27(5), 705–740.

Hsiao, C., and

Q. Zhou (2019): “Panel parametric, semiparametric, and nonparametricconstruction of counterfactuals,”

Journal of Applied Econometrics , 34(4), 463–481.

Jing, B.-Y., Q.-M. Shao, and

Q. Wang (2003): “Self-normalized Cramér-type large devi-ations for independent random variables,”

The Annals of Probability , 31(4), 2167–2215.

Jirak, M. (2016): “Berry-Esseen theorems under weak dependence,”

The Annals of Probabil-ity , 44(3), 2024–2063.

Ke, X., H. Chen, Y. Hong, and

C. Hsiao (2017): “Do China’s high-speed-rail projectspromote local economy?,”

China Economic Review , 44, 203–226.

Kock, A. B., and

L. Callot (2015): “Oracle inequalities for high dimensional vector au-toregressions,”

Journal of Econometrics , 186(2), 325–344.

Koo, B., H. M. Anderson, M. H. Seo, and

W. Yao (2019): “High-dimensional predictiveregression in the presence of cointegration,”

Journal of Econometrics , forthcoming.21 ozbur, D. (2017): “Testing-based forward model selection,”

American Economic Review ,107(5), 266–69.(2018): “Sharp convergence rates for forward regression in high-dimensional sparselinear models,” Discussion paper.

Lan, X., and

W. Li (2018): “Swiss watch cycles: Evidence of corruption during leadershiptransition in China,”

Journal of Comparative Economics , 46(4), 1234–1252.

Leeb, H. (2009): “Conditional predictive inference post model selection,”

The Annals ofStatistics , 37(5B), 2838–2876.

Leeb, H., and

B. M. Pötscher (2005): “Model selection and inference: Facts and ﬁction,”

Econometric Theory , 21(1), 21–59.(2006): “Can one estimate the conditional distribution of post-model-selection esti-mators?,”

The Annals of Statistics , 34(5), 2554–2591.

Li, K. T., and

D. R. Bell (2017): “Estimation of average treatment eﬀects with panel data:Asymptotic theory and implementation,”

Journal of Econometrics , 197(1), 65–75.

Lin, C., R. Morck, B. Yeung, and

X. Zhao (2016): “Anti-corruption reforms and share-holder valuations: Event study evidence from China,” Discussion paper, National Bureauof Economic Research.

Luo, Y., and

M. Spindler (2016a): “High-Dimensional L Boosting: Rate of Convergence,” arXiv preprint arXiv:1602.08927 .(2016b): “ L Boosting for Economic Applications,” arXiv preprint arXiv:1702.03244 . Medeiros, M. C., and

E. F. Mendes (2016): “l1-regularization of high-dimensional time-series models with non-Gaussian and heteroskedastic errors,”

Journal of Econometrics ,191(1), 255–271.

Newey, W. K., and

K. D. West (1987): “A Simple, Positive Semi-Deﬁnite, Heteroskedas-ticity and Autocorrelation Consistent Covariance Matrix,”

Econometrica , pp. 703–708.

Ouyang, M., and

Y. Peng (2015): “The treatment-eﬀect estimation: A case study of the2008 economic stimulus package of China,”

Journal of Econometrics , 188(2), 545–557.

Pan, X., and

G. G. Tian (2017): “Political connections and corporate investments: Evidencefrom the recent anti-corruption campaign in China,”

Journal of Banking & Finance , In press.

Shi, Z. (2016): “Econometric Estimation in High-Dimensional Moment Equalities,”

Journalof Econometrics , 195, 104–119.

Sunklodas, J. (1984): “On the rate of convergence in the central limit theorem for stronglymixing random variables,”

Lithuanian Mathematical Journal , 24, 182–190.(2000): “Approximation of distributions of sums of weakly dependent random vari-ables by the normal distribution,” in

Limit Theorems of Probability Theory , pp. 113–165.Springer. 22 ibshirani, R. (1996): “Regression shrinkage and selection via the lasso,”

Journal of theRoyal Statistical Society. Series B (Methodological) , 58, 267–288.

Wang, H. (2009): “Forward regression for ultra-high dimensional variable screening,”

Journalof the American Statistical Association , 104(488), 1512–1524.

Wang, H., B. Li, and

C. Leng (2009): “Shrinkage tuning parameter selection with a diverg-ing number of parameters,”

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 71(3), 671–683.

Xu, G., and

G. Yano (2016): “How does anti-corruption aﬀect corporate innovation? Ev-idence from recent anti-corruption eﬀorts in China,”

Journal of Comparative Economics ,45(3), 498–519.

Zhang, C.-H., and

J. Huang (2008): “The sparsity and bias of the lasso selection in high-dimensional linear regression,”

The Annals of Statistics , 36(4), 1567–1594.

Zhong, W., S. Duan, and

L. Zhu (2017): “Forward additive regression for ultrahigh di-mensional nonparametric additive models,”

Statistica Sinica .23

Proofs

A.1 Proof of Lemma 1

For any U ⊂ N , whose cardinality is u = | U | , deﬁne (cid:98) L U := 1 T y (cid:48) P U y = y (cid:48) Y U T (cid:18) Y (cid:48) U Y U T (cid:19) − Y (cid:48) U y T = (cid:0) ¯ E [ y Ut y t ] + ζ U (cid:1) (cid:48) (Σ U + V U ) − (cid:0) ¯ E [ y Ut y t ] + ζ U (cid:1) (A1)where ζ U = Y (cid:48) U y T − ¯ E [ y Ut y t ] , Σ U = ¯ E [ y Ut y (cid:48) Ut ] , and V U = ( v ij ) i,j ∈ U = Y (cid:48) U Y U T − Σ U . UnderAssumption 2(a), we have (cid:107) ζ U (cid:107) ∞ = O p (cid:16)(cid:112) (log N ) /T (cid:17) . The maximal eigenvalue of V U isbounded by φ max ( V U ) ≤ u max i,j ∈ U ( v ij ) = O p (cid:16) u (cid:112) (log N ) /T (cid:17) , (A2)where the stochastic order again follows by Assumption 2(a). Furthermore, (A2) implies (Σ U + V U ) − = Σ − / U (cid:16) I + Σ − / U V U Σ − / U (cid:17) − Σ − / U = Σ − / U (cid:32) I − ∞ (cid:88) l =1 (cid:16) − Σ − / U V U Σ − / U (cid:17) l (cid:33) Σ − / U = Σ − / U ( I + Ξ ) Σ − / U , (A3)where Ξ = ( − l +1 (cid:80) ∞ l =1 (cid:16) Σ − / U V U Σ − / U (cid:17) l . As φ max (cid:16) Σ − / U V U Σ − / U (cid:17) ≤ φ max ( V U ) φ max (cid:0) Σ − U (cid:1) = φ max ( V U ) φ − (Σ U )= O p (cid:16) u (cid:112) (log N ) /T (cid:17) c − = O p (cid:16) u (cid:112) (log N ) /T (cid:17) by Assumption 1, when T is suﬃciently large we have φ max ( Ξ ) ≤ φ max (cid:16) Σ − / U V U Σ − / U (cid:17) − φ max (cid:16) Σ − / U V U Σ − / U (cid:17) = O p (cid:32) u (cid:114) log NT (cid:33) . (A4)Substitute (A3) into (A1), (cid:98) L U = (cid:0) ¯ E [ y Ut y t ] + ζ U (cid:1) (cid:48) Σ − / U ( I + Ξ ) Σ − / U (cid:0) ¯ E [ y Ut y t ] + ζ U (cid:1) = (cid:0) ¯ E [ y Ut y t ] + ζ U (cid:1) (cid:48) Σ − U (cid:0) ¯ E [ y Ut y t ] + ζ U (cid:1) · (cid:16) O p (cid:16) u (cid:112) (log N ) /T (cid:17)(cid:17) (A5)24here the last line follows by (A4). Notice that (cid:0) ¯ E [ y Ut y t ] + ζ U (cid:1) (cid:48) Σ − U (cid:0) ¯ E [ y Ut y t ] + ζ U (cid:1) = ¯ E [ y Ut y t ] (cid:48) Σ − U ¯ E [ y Ut y t ] + 2 ζ (cid:48) U Σ − U ¯ E [ y Ut y t ] + ζ (cid:48) U Σ − U ζ U = L U + 2 ζ (cid:48) U Σ − U ¯ E [ y Ut y t ] + ζ (cid:48) U Σ − U ζ U , (A6)where L U := ¯ E [ y Ut y t ] (cid:48) Σ − U ¯ E [ y Ut y t ] . The third term on the right-hand side of the aboveequation is bounded by ζ (cid:48) U Σ − U ζ U ≤ φ − (Σ U ) (cid:107) ζ U (cid:107) ≤ c − u (cid:107) ζ U (cid:107) ∞ = O p (cid:18) u log NT (cid:19) , (A7)and the second term is bounded by ζ (cid:48) U Σ − U E [ y Ut y t ] = 2 (cid:16) Σ − / U ζ U (cid:17) (cid:48) (cid:16) Σ − / U E [ y Ut y t ] (cid:17) ≤ (cid:0) ζ (cid:48) U Σ − U ζ U (cid:1) / (cid:112) L U ≤ φ − / (Σ U ) · (cid:107) ζ U (cid:107) · (cid:112) L U ≤ c − / · √ u (cid:107) ζ U (cid:107) ∞ · (cid:112) L U = O p (cid:16)(cid:112) u (log N ) /T (cid:17) , (A8)where the ﬁrst inequality follows from the Cauchy-Schwarz inequality, and (cid:107)·(cid:107) and (cid:107)·(cid:107) ∞ arethe usual L -norm and the sup-norm of a vector, respectively. Substitute (A6), (A7) and (A8)into (A5), (cid:98) L U = (cid:32) L U + O p (cid:32)(cid:114) u log NT (cid:33)(cid:33) (cid:32) O p (cid:32)(cid:114) u log NT (cid:33)(cid:33) = L U + O p (cid:32)(cid:114) u log NT (cid:33) . Since the above equality holds uniformly for all U and Assumption 1 is stated for R , we have sup | U |≤ (1+ δ ) R (cid:12)(cid:12)(cid:12)(cid:98) L U − L U (cid:12)(cid:12)(cid:12) = O p (cid:32)(cid:114) (1 + δ ) R log NT (cid:33) = O p (cid:32)(cid:114) R log NT (cid:33) . Finally, when | U | = 0 let (cid:98) σ y be the sample variance of { y t } t ∈T and σ y = ¯ E (cid:2) y t (cid:3) . Obvi-ously, (cid:98) σ y − σ y = O p (cid:0) T − (cid:1) . By deﬁnition, (cid:98) L U = (cid:98) σ y − (cid:98) σ U and L U = σ y − σ U . The claim in thestatement follows. A.2 Proof of Theorem 1

Under Assumption 2, for any set

U, V ⊂ N such that U ⊃ V and u > v , we ave the inequality max j ∈N σ { V,j }| V ≥ η u C ( u − v ) σ U | V , (A9) Remark . The left-hand side is the magnitude of the descent of the greedy algorithm. Theright-hand side is the proportion of the total gap L V and L U . It means that each greedypursuit can close the gap σ U | V by a nontrivial proportion. Proof of Lemma A.1.

We ﬁrst prove the case when V = ∅ . Deﬁne β jU := (cid:0) ¯ E [ y Ut y (cid:48) Ut ] (cid:1) − ¯ E [ y Ut y jt ] .We can write σ U |∅ = σ y − (cid:16) σ y − ¯ E (cid:104)(cid:0) y (cid:48) Ut β U (cid:1) (cid:105)(cid:17) = ¯ E (cid:104)(cid:0) y (cid:48) Ut β U (cid:1) (cid:105) = β U Σ U β U = ¯ E [ y Ut y t ] (cid:48) Σ − U ¯ E [ y Ut y t ] and similarly σ { j }|∅ = ¯ E (cid:20)(cid:16) y jt β { j } (cid:17) (cid:21) = (cid:16) ¯ E (cid:104) y jt (cid:105)(cid:17) − (cid:0) ¯ E [ y jt y t ] (cid:1) . By Assumption 2(b) ¯ E (cid:104) y jt (cid:105) ≤ C , we have σ { j }|∅ ≥ C − (cid:0) ¯ E [ y jt y t ] (cid:1) and it immediately implies max j ∈N (cid:0) ¯ E [ y jt y t ] (cid:1) ≤ C · max j ∈N σ { j }|∅ . (A10)On the other hand σ U |∅ ≤ η − u (cid:13)(cid:13) ¯ E [ y Ut y t ] (cid:13)(cid:13) ≤ uη u (cid:13)(cid:13) ¯ E [ y Ut y t ] (cid:13)(cid:13) ∞ ≤ uη u max j ∈N (cid:0) ¯ E [ y jt y t ] (cid:1) ≤ Cuη u max j ∈N σ { j }|∅ , where the last inequality follows by (A10). The above inequality is the special case of (A9)when V = ∅ and v = 0 .Parallel argument applies when V (cid:54) = ∅ and u > v . Let the scalar random variable ε jV t := y jt − y (cid:48) V t β jV any j ∈ N \ V , and the random vector ε UV t := (cid:16) ε jV t (cid:17) j ∈ U \ V ; they are the projectionresiduals of y jt , for j ∈ ( U \ V ) ∪ { } , after the eﬀect of ( y jt ) t ∈ V being partialled out. The gap σ U | V can be bounded by σ U | V = σ V − σ U = L U − L V = ¯ E (cid:2) ε UV t ε V t (cid:3) (cid:48) (cid:0) ¯ E (cid:2) ε UV t ε U (cid:48) V t (cid:3)(cid:1) − ¯ E (cid:2) ε UV t ε V t (cid:3) ≤ φ − (cid:0) ¯ E (cid:2) ε UV t ε U (cid:48) V t (cid:3)(cid:1) (cid:88) j ∈ U \ V (cid:16) ¯ E (cid:104) ε jV t ε V t (cid:105)(cid:17) ≤ φ − (cid:0) ¯ E (cid:2) ε UV t ε U (cid:48) V t (cid:3)(cid:1) ( u − v ) max j ∈ U \ V (cid:16) ¯ E (cid:104) ε jV t ε V t (cid:105)(cid:17) ≤ ( u − v ) · φ − (cid:0) ¯ E (cid:2) ε UV t ε U (cid:48) V t (cid:3)(cid:1) max j ∈N (cid:16) ¯ E (cid:104) ε jV t ε V t (cid:105)(cid:17) . (A11)Since (cid:0) ¯ E (cid:2) ε UV t ε U (cid:48) V t (cid:3)(cid:1) − is a submatrix of Σ − U , we have φ − (cid:0) ¯ E (cid:2) ε UV t ε U (cid:48) V t (cid:3)(cid:1) = φ max (cid:16)(cid:0) ¯ E (cid:2) ε UV t ε U (cid:48) V t (cid:3)(cid:1) − (cid:17) ≤ φ max (cid:0) Σ − U (cid:1) = φ − (Σ U ) ≤ η − u . (A12)26imilarly, for any j ∈ N \ V it implies σ { V,j }| V = (cid:16) var (cid:104) ε jV t (cid:105)(cid:17) − (cid:16) ¯ E (cid:104) ε jV t ε V t (cid:105)(cid:17) ≥ (cid:0) ¯ E (cid:2) y jt (cid:3)(cid:1) − (cid:16) ¯ E (cid:104) ε jV t ε V t (cid:105)(cid:17) ≥ C − (cid:16) ¯ E (cid:104) ε jV t ε V t (cid:105)(cid:17) . (A13)Combining (A11), (A12) and (A13): σ U | V ≤ u − vη u max j ∈N (cid:16) ¯ E (cid:104) ε jV t ε V t (cid:105)(cid:17) ≤ C ( u − v ) η u max j ∈N σ { V,j }| V . The statement in the lemma follows when we rearrange the above inequality.With the key inequality (A9), we proceed our analysis of the greedy iteration. Deﬁne acollection of sequences of index sets U R ( α ) = (cid:40) ( U , . . . , U R ) ∈ N R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) U r − ⊂ U r , | U r \ U r − | = 1 , and σ U r | U r − ≥ (1 − α ) max j ∈N σ { U r − ,j }| U r − (cid:41) for some ﬁxed α ∈ (0 , . Any increasing sequence in U R ( α ) satisﬁes the inequality σ U r | U r − ≥ (1 − α ) max j ∈N σ { U r − ,j }| U r − . The constant α can be viewed as a tolerance. We do not haveto be utterly greedy in the sense of griping the best choice given U r − . As long as we makeprogress in each iteration by reducing the gap to at least a constant proportion of what themost greedy choice can achieve, we can still approach, or even surpass, our target. This is themessage of the following lemma. Lemma A.2.

For any sequence ( U , . . . , U R ) ∈ U R ( α ) and any W ⊂ N , we have σ U R − σ W ≤ σ y (cid:16) − (1 − α ) η w + R Cw (cid:17) R where w = | W | . Remark . Lemma A.2 states what happens when the forward selection algorithm being ap-plied to the population model. In each iteration, the index set includes one more element;however the variance updates less greedily. Even with this less greedy algorithm, given theoptimal set W = U ∗ , after R -times iteration as with R → ∞ the diﬀerence between σ U R andthe optimal σ U ∗ will converge in probability to zero. The tolerance will be needed when webring the population greedy algorithm to the data. Proof of Lemma A.2.

We ﬁrst derive an inequality for generic sets

W, V ⊂ N and W (cid:54) = V .Deﬁne U = W ∪ V so that U ⊃ V and u − v ≥ , Lemma A.1 gives max j ∈N σ { V,j }| V ≥ η u C ( u − v ) σ U | V . η u = η | W ∪ V | ≥ η w + v and u − v = | W ∪ V | − v ≤ w , we continue the above inequality η u C ( u − v ) σ U | V ≥ η w + v Cw σ U | V = η w + v Cw (cid:0) σ V − σ U (cid:1) ≥ η w + v Cw (cid:0) σ V − σ W (cid:1) , where the last inequality follows as σ U ≤ σ W . Multiply − (1 − α ) and add (cid:0) σ V − σ W (cid:1) on bothsides, (cid:0) σ V − σ W (cid:1) − (1 − α ) max j ∈N σ { V,j }| V ≤ (cid:16) − (1 − α ) η w + v Cw (cid:17) (cid:0) σ V − σ W (cid:1) . (A14)Now consider σ U R − σ W = (cid:16) σ U R − − σ W (cid:17) − σ U R | U R − ≤ (cid:16) σ U R − − σ W (cid:17) − (1 − α ) max j ∈N σ { U R − ,j }| U R − ≤ (cid:16) − (1 − α ) η w + R Cw (cid:17) (cid:16) σ U R − − σ W (cid:17) . where the ﬁrst inequality holds by the deﬁnition of U R ( α ) , and the second inequality followsby (A14). We iterate the above inequality to obtain σ U R − σ W ≤ (cid:16) − (1 − α ) η w + R Cw (cid:17) (cid:16) − (1 − α ) η w + R − Cw (cid:17) (cid:16) σ U R − − σ W (cid:17) ≤ · · · ≤ (cid:0) σ ∅ − σ W (cid:1) R (cid:89) r =0 (cid:16) − (1 − α ) η w + r Cw (cid:17) ≤ σ y (cid:16) − (1 − α ) η w + R Cw (cid:17) R . The calculation in Lemma A.1 and A.2 is carried out in the population. Now we link thepopulation with the sample to prove Theorem 1.

Proof of Theorem 1.

By adding and subtracting, (cid:98) σ (cid:98) U R − σ U ∗ = (cid:16)(cid:98) σ (cid:98) U R − σ (cid:98) U R (cid:17) + (cid:16) σ (cid:98) U R − σ U ∗ (cid:17) . (A15)Since | (cid:98) U R | = R , we invoke Lemma 1 so that (cid:98) σ (cid:98) U R − σ (cid:98) U R = O p (cid:18)(cid:113) T − R log N (cid:19) = o p (1) uniformly for all (cid:98) U R if T − R log N → .We focus on σ (cid:98) U R − σ U ∗ . Let ζ r = max | V |≤ r (cid:12)(cid:12)(cid:98) σ V − σ V (cid:12)(cid:12) . Deﬁne a collection of sets A r ( α ) = (cid:26) V (cid:12)(cid:12)(cid:12)(cid:12) V ⊂ N , | V | = r, max j ∈N σ { j,V }| V > ζ r α (cid:27) . (A16)Let (cid:98) j = arg max j ∈N (cid:98) σ { j,V }| V , which is the index selected by the greedy algorithm from thesample given the set V . Denote (cid:16) (cid:98) U , . . . , (cid:98) U R (cid:17) as the selected sequence by the greedy algorithm.We discuss two cases. 28i) If (cid:98) U r ∈ A r ( α ) for all ≤ r ≤ R , then σ { (cid:98) j, (cid:98) U r − } | (cid:98) U r − ≥ (cid:98) σ { (cid:98) j, (cid:98) U r − } | (cid:98) U r − − (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) σ { (cid:98) j, (cid:98) U r − } | (cid:98) U r − − σ { (cid:98) j, (cid:98) U r − } | (cid:98) U r − (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:98) σ { (cid:98) j, (cid:98) U r − } | (cid:98) U r − − | U r |≤ r (cid:12)(cid:12)(cid:98) σ U r − σ U r (cid:12)(cid:12) = (cid:98) σ { (cid:98) j, (cid:98) U r − } | (cid:98) U r − − ζ r − = max j ∈N (cid:98) σ { j, (cid:98) U r − } | (cid:98) U r − − ζ r − ≥ max j ∈N (cid:18) σ { j, (cid:98) U r − } | (cid:98) U r − − (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) σ { j, (cid:98) U r − } | (cid:98) U r − − σ { j, (cid:98) U r − } | (cid:98) U r − (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) − ζ r − ≥ max j ∈N σ { j, (cid:98) U r − } | (cid:98) U r − − ζ r − > (1 − α ) max j ∈N σ { j, (cid:98) U r − } | (cid:98) U r − . Thus we have (cid:16) (cid:98) U , . . . , (cid:98) U R (cid:17) ∈ U R ( α ) . By Lemma A.2 setting W = U ∗ , σ (cid:98) U R − σ ∗ w ≤ σ y (cid:16) − (1 − α ) η w + R Cu (cid:17) R ≤ σ y (cid:18) − (1 − α ) cC × w (cid:19) R → (A17)when the event (cid:16) (cid:98) U , . . . , (cid:98) U R (cid:17) ∈ U R ( α ) occurs.(ii) Suppose the selected sequence (cid:16) (cid:98) U , . . . , (cid:98) U R (cid:17) has some elements (cid:98) U r not satisfying A r ( α ) . Let (cid:98) r = min (cid:110) r ∈ { , . . . , R } | (cid:98) U r / ∈ A r ( α ) (cid:111) be the ﬁrst occurrence of violation whensequence progresses, and we have max j ∈N σ { j, (cid:98) U (cid:98) r } | (cid:98) U (cid:98) r ≤ ζ r α . (A18)If U ∗ ⊂ (cid:98) U (cid:98) r , which is the ideal case when the selected includes the population optimal set, then σ (cid:98) U R ≤ σ (cid:98) U (cid:98) r ≤ σ U ∗ . On the other hand, even if U ∗ is not a subset of (cid:98) U (cid:98) r , we have σ (cid:98) U R − σ ∗ w ≤ σ (cid:98) U (cid:98) r − σ ∗ w ≤ σ (cid:98) U (cid:98) r − σ U ∗ ∪ (cid:98) U (cid:98) r = σ ( U ∗ ∪ (cid:98) U (cid:98) r ) | (cid:98) U (cid:98) r ≤ Cwη w + (cid:98) r · max j ∈N σ { j, (cid:98) U (cid:98) r } | (cid:98) U (cid:98) r ≤ Cwη w + R · max j ∈N σ { j, (cid:98) U (cid:98) r } | (cid:98) U (cid:98) r ≤ Cwc · ζ (cid:98) r α = o p (cid:115) R log NT  , (A19)where the third inequality follows by (A9), the ﬁfth inequality by the condition (A18) andAssumption 1 since w + R ≤ (1 + δ ) R holds asymptotically for any δ > as w/R → , andthe stochastic order by Lemma 1.Collecting (A17) and (A19) and in view of (cid:12)(cid:12)(cid:12)(cid:98) σ (cid:98) U R − σ (cid:98) U R (cid:12)(cid:12)(cid:12) = o p (1) , we have the statement ofthe theorem. 29 .3 Proof of Theorem 2 We ﬁrst establish the ﬁrst-stage estimation error bound of the OLS coeﬃcients.

Lemma A.3.

Under Assumptions 1 and 2, the estimation of the coeﬃcient max | U |≤ R (cid:13)(cid:13)(cid:13) (cid:98) β U − β U (cid:13)(cid:13)(cid:13) = O p (cid:115) R log NT  Proof of Lemma A.3.

For any U such that | U | = u ≤ R , we have ¯ E (cid:2) y t (cid:3) = ¯ E (cid:104)(cid:0) y Ut β U + ε Ut (cid:1) (cid:105) = β (cid:48) U ¯ E (cid:2) y Ut y (cid:48) Ut (cid:3) β U + ¯ E (cid:2) ε Ut (cid:3) ≥ β (cid:48) U ¯ E (cid:2) y Ut y (cid:48) Ut (cid:3) β U ≥ (cid:13)(cid:13) β U (cid:13)(cid:13) η u . Therefore under Assumptions 1 and 2(b) when T is suﬃciently large, max | U |≤ R (cid:13)(cid:13) β U (cid:13)(cid:13) ≤ (cid:113) ¯ E (cid:2) y t (cid:3) /η R ≤ (cid:112) C/c < ∞ . (A20)Using the notations deﬁned in the proof of Lemma 1, the OLS estimator (cid:98) β U = (cid:18) Y (cid:48) U Y U T (cid:19) − Y (cid:48) U y T = (Σ U + V U ) − (cid:0) ¯ E [ y Ut y t ] + ζ U (cid:1) = Σ − / U ( I + Ξ ) Σ − / U (cid:0) ¯ E [ y Ut y t ] + ζ U (cid:1) = Σ − U ¯ E [ y Ut y t ] + Σ − / U Ξ Σ − / U (cid:0) ¯ E [ y Ut y t ] + ζ U (cid:1) + Σ − U ζ U = β U + Σ − / U Ξ Σ / U β U + Σ − / U Ξ Σ / U Σ − U ζ U + Σ − U ζ U as β U = Σ − U ¯ E [ y Ut y t ] . Subtract β U on both sides and take (cid:107)·(cid:107) , (cid:13)(cid:13)(cid:13) (cid:98) β U − β U (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) Σ − / U Ξ Σ / U β U (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) Σ − / U Ξ Σ / U Σ − U ζ U (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13) Σ − U ζ U (cid:13)(cid:13) ≤ φ max (cid:16) Σ − / U Ξ Σ / U (cid:17) (cid:13)(cid:13) β U (cid:13)(cid:13) + (cid:16) φ max (cid:16) Σ − / U Ξ Σ / U (cid:17) + 1 (cid:17) (cid:13)(cid:13) Σ − U ζ U (cid:13)(cid:13) ≤ φ max ( Ξ ) φ max (cid:0) Σ − U (cid:1) (cid:13)(cid:13) β U (cid:13)(cid:13) + (cid:0) φ max ( Ξ ) φ max (cid:0) Σ − U (cid:1) + 1 (cid:1) φ max (cid:0) Σ − U (cid:1) (cid:107) ζ U (cid:107) ≤ O p (cid:32) u (cid:114) log NT (cid:33) η − u (cid:13)(cid:13) β U (cid:13)(cid:13) + (cid:32) O p (cid:32) u (cid:114) log NT (cid:33) η − u + 1 (cid:33) η − u · √ u (cid:107) ζ U (cid:107) ∞ ≤ O p (cid:32) u (cid:114) log NT (cid:33) + O p (cid:32) u (cid:114) log NT (cid:33) = O p (cid:32) R (cid:114) log NT (cid:33) . where the fourth inequality follows by (A4), and the last inequality by (A20), Assumptions 1and 2(a) as u ≤ R . 30eﬁne Z T ,U = (cid:98) ρ − τ,U · √ T (cid:88) t ∈T (cid:98) ∆ t,U = (cid:98) ρ − τ,U · √ T (cid:88) t ∈T (cid:16) y t − y Ut (cid:98) β U (cid:17) Z ∗ T ,U = (cid:98) ρ ∗− τ,U √ T (cid:88) t ∈T (cid:0) y t − y Ut β U (cid:1) , where Z ∗ T ,U is an infeasible version of Z T ,U as if the true coeﬃcient β U is known, and similarly (cid:98) ρ ∗ τ,U = 1 T (cid:88) t,s ∈T (cid:15) tU (cid:15) sU · {| t − s | ≤ τ } is the (infeasible) counterpart of (cid:98) ρ τ,U with known β U .Under the null hypothesis H , the self-normalized statistics Z T ,U = (cid:98) ρ − τ,U · √ T (cid:80) t ∈T (cid:98) (cid:15) Ut and Z ∗ T ,U = (cid:98) ρ ∗− τ,U √ T (cid:80) t ∈T (cid:15) Ut . The next result shows that the feasible Z T ,U converges inprobability to Z ∗ T ,U uniformly for all U such that | U | ≤ R . Lemma A.4.

Under the Assumptions 1, 2, and 3 hold and the null hypothesis H , if T − R u log N log T + T − log N → , then max | u |≤ R (cid:12)(cid:12) Z T ,U − Z ∗ T ,U (cid:12)(cid:12) p → . Remark . Lemma A.4 is the asymptotic equivalence of Z T ,U and Z ∗ T ,U , which means thatthe former will have the same asymptotic distribution as the latter. As the latter is a statisticinvolving no estimated parameters, it is much easier to pin down its asymptotic distributionby borrowing convergence in distribution results from probability theory literature. Proof of Lemma A.4.

For any | U | = u ≤ R , the diﬀerence between the nominators in Z ∗ T ,U and Z T ,U is bounded by (cid:12)(cid:12)(cid:12)(cid:12) √ T (cid:88) t ∈T ( (cid:15) Ut − (cid:98) (cid:15) Ut ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:16) (cid:98) β U − β U (cid:17) (cid:48) √ T (cid:88) t ∈T y Ut (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13) (cid:98) β U − β U (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13) √ T (cid:88) t ∈T y Ut (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) (cid:98) β U − β U (cid:13)(cid:13)(cid:13) √ u · max j ∈ U (cid:12)(cid:12)(cid:12)(cid:12) √ T (cid:88) t ∈T y it (cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32)(cid:114) R log NT (cid:33) √ uO p (cid:16)(cid:112) log N (cid:17) = O p (cid:18)(cid:113) T − R u log N (cid:19) , (A21)where the ﬁrst inequality follows by the Cauchy-Schwarz inequality, and the stochastic orderby Assumption 3(a). This bound holds uniformly of all U such that | U | ≤ R .31ext, we deal with the long-run variance. Denote ρ ∗ τ,U = (cid:16)(cid:80) | s |≤ τ ¯ E (2) (cid:2) (cid:15) Ut (cid:15) U ( t + s ) (cid:3)(cid:17) / bethe τ -term truncated estimator of the long-run variance. Let T ,s := { T + 1 , . . . , T − s } . Thediﬀerence between the denominators—the long-run variances in Z ∗ T ,U and Z T ,U —is boundedby (cid:12)(cid:12)(cid:98) ρ ∗ τ,U − (cid:98) ρ τ,U (cid:12)(cid:12) ≤ τ max ≤ s ≤ τ (cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈T ,s ( (cid:98) (cid:15) Ut (cid:98) (cid:15) U,t + s − (cid:15) Ut (cid:15) U,t + s ) (cid:12)(cid:12)(cid:12)(cid:12) = 2 τ max ≤ s ≤ τ (cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈T ,s (cid:18)(cid:18) (cid:15) U,t + (cid:16) (cid:98) β U − β U (cid:17) (cid:48) y U,t (cid:19) (cid:18) (cid:15)

U,t + s + (cid:16) (cid:98) β U − β U (cid:17) (cid:48) y U,t + s (cid:19) − (cid:15) Ut (cid:15) U,t + s (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) ≤ τ max ≤ s ≤ τ (cid:16) (cid:98) β U − β U (cid:17) (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈T ,s y Ut y (cid:48) U,t + s (cid:16) (cid:98) β U − β U (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) + 4 τ max ≤ s ≤ τ (cid:12)(cid:12)(cid:12)(cid:12) (cid:16) (cid:98) β U − β U (cid:17) (cid:48) T (cid:88) t ∈T ,s y Ut (cid:15) U,t + s (cid:12)(cid:12)(cid:12)(cid:12) ≤ τ (cid:13)(cid:13)(cid:13) (cid:98) β U − β U (cid:13)(cid:13)(cid:13) max ≤ s ≤ τ φ max (cid:18) T (cid:88) t ∈T ,s y Ut y (cid:48) U,t + s (cid:19) +4 τ (cid:13)(cid:13)(cid:13) (cid:98) β U − β U (cid:13)(cid:13)(cid:13) max ≤ s ≤ τ T (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) t ∈T ,s y Ut (cid:15) U,t + s (cid:13)(cid:13)(cid:13)(cid:13) . (A22)For any s ∈ { , . . . , τ } and U , in the above inequality (A22) φ max  T (cid:88) t ∈T ,s y Ut y (cid:48) U,t + s  ≤ φ max  T T (cid:88) t = T +1 y Ut y (cid:48) Ut  ≤ u  max j ∈N T T (cid:88) t = T +1 y jt  ≤ u (cid:32) max j ∈N ¯ E (2) (cid:2) y jt (cid:3) + O p (cid:32)(cid:114) log NT (cid:33)(cid:33) = O p ( u ) (A23)where the ﬁrst inequality holds by the Cauchy-Schwarz inequality, and the stochastic order byAssumption 3(b). This bound also holds uniformly for all U such that | U | ≤ R . Similarly, theother term in the right-hand side of (A22) is bounded by (cid:13)(cid:13)(cid:13)(cid:13) T (cid:88) t ∈T ,s y Ut (cid:15) U,t + s (cid:13)(cid:13)(cid:13)(cid:13) ≤ √ u max j ∈ U (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈T ,s y jt (cid:15) U,t + s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ u ·  max U T (cid:88) t ∈T (cid:15) Ut  /  max j ∈ U T (cid:88) t ∈T y jt  / ≤ √ u  T (cid:88) t ∈T y t  /  max j ∈ U T (cid:88) t ∈T y jt  / = √ uC (cid:32) C + O p (cid:32)(cid:114) log NT (cid:33)(cid:33) = O p (cid:0) √ u (cid:1) (A24)where the second and the fourth inequalities follow by the Cauchy-Schwarz inequality, and32he stochastic rate by Assumption 3(b). This bound also holds uniformly for all U such that | U | ≤ R .Substitute the bounds in (A23), (A24) and Lemma A.3 into (A22), and notice τ is chosenas o ( T ) , we have max U : | U |≤ R (cid:12)(cid:12)(cid:98) ρ ∗ τ,U − (cid:98) ρ τ,U (cid:12)(cid:12) ≤ O p (cid:18) R log NT (cid:19) O p ( u ) + τ O p (cid:32)(cid:114) R log NT (cid:33) O p (cid:0) √ u (cid:1) = O (log T ) O p (cid:18)(cid:113) T − R u log N log T (cid:19) (1 + o p (1))= O p (cid:18)(cid:113) T − R u log N log T (cid:19) = o p (1) . The above inequality, along with the boundedness of the population long-run variance as inAssumption 3 (e) and (f), ensures the estimation error in the denominator is asymptoticallynegligible under the rate condition the rate T − R u log N log T → . In other words, theorder of the diﬀerence between Z ∗ T ,U and Z T ,U is governed by the numerator as in (A21): max U : | u |≤ R (cid:12)(cid:12) Z ∗ T ,U − Z T ,U (cid:12)(cid:12) = O p (cid:18)(cid:113) T − R u log N (cid:19) = o p (1) . In view of Lemma A.3 and Lemma A.4, the proof of Theorem 2 is an application of aBerry-Esseen bound for time series. Many results in the probability theory literature areabout strictly stationary time series (Bentkus, Götze, and Tikhomoirov, 1997; Jirak, 2016),but much fewer for heterogeneous time series. We use the result by Sunklodas (1984), whichwas originally in Russian and later was re-interpreted in English in Sunklodas (2000, p.133,Theorem 10) and Hörmann (2009, p.380). Let S n = (cid:80) nt =1 x t for some generic zero-meantime series ( x t ) nt =1 , and let B n = E (cid:104) ( (cid:80) nt =1 x t ) (cid:105) . If ( x t ) nt =1 is α -mixing with geometric rate, max t ≤ n | x k | ≤ C < ∞ and B n ≥ nc for some c > for all n suﬃciently large, then sup a ∈ R | P ( S n /B n ≤ a ) − Φ ( a ) | ≤ C BE log B n B n max ≤ t ≤ n E (cid:104) | x k | (cid:105) , (A25)where C BE is a constant only depends on the geometric rate of strong mixing, max t ≤ n | x k | and B n /n ; C BE is independent of the sample size. Proof of Theorem 2.

The nominator of the t -statistic Z ∗ U,τ is √ T (cid:80) t ∈T (cid:0) y t − y Ut β U (cid:1) , so that max t ∈T E (cid:104)(cid:12)(cid:12) y t − y Ut β U (cid:12)(cid:12) (cid:105) ≤ max t ∈T E (cid:104) | y t | (cid:105) + max t ∈T E (cid:104)(cid:12)(cid:12) y Ut β U (cid:12)(cid:12) (cid:105) . Uniformly for any U , by the Cauchy-Schwarz inequality and Lemma A.3, E (cid:104)(cid:12)(cid:12) y Ut β U (cid:12)(cid:12) (cid:105) ≤ (cid:13)(cid:13) β U (cid:13)(cid:13) E (cid:104) (cid:107) y Ut (cid:107) (cid:105) ≤ ( C/c ) / E (cid:104) (cid:107) y Ut (cid:107) (cid:105) E (cid:104) (cid:107) y Ut (cid:107) (cid:105) = E (cid:2) ( (cid:88) j ∈ U y jt ) / (cid:3) ≤ u / max j ∈ U E (cid:104) | y jt | (cid:105) ≤ Cu / . We thus have max | U |≤ R max t ∈T E (cid:104)(cid:12)(cid:12) y t − y Ut β U (cid:12)(cid:12) (cid:105) = O (cid:16) R / (cid:17) . Let Z ∗∗ T ,U = ρ ∗− U √ T (cid:88) t ∈T (cid:15) Ut , where ρ ∗ U = T − E (cid:104)(cid:0)(cid:80) t ∈T (cid:15) Ut (cid:1) (cid:105) . Thus by Assumption 3(f), the Berry-Essen bound (A25)indicates that there exists a constant C BE2 such that sup a ∈ R (cid:12)(cid:12) P (cid:0) Z ∗∗ T ,U ≤ a (cid:1) − Φ ( a ) (cid:12)(cid:12) ≤ C BE log (cid:16)(cid:113) T ρ ∗ U (cid:17)(cid:113) T ρ ∗ U O (cid:16) R / (cid:17) ≤ C BE2 (cid:113) T − R log T for suﬃciently large T , and the last inequality follows by Assumption 3(e). The constant C BE2 is independent of U . It implies that sup a ∈ R sup | U |≤ R (cid:12)(cid:12) P (cid:0) Z ∗∗ T ,U ≤ a (cid:1) − Φ ( a ) (cid:12)(cid:12) ≤ C BE2 (cid:113) T − R log T . (A26)Since ρ ∗ U is bounded above for all U such that | U | ≤ R by Assumption 3(e), we have | ρ ∗ U − ρ ∗ τ,U | = o (1) . We thus have max U : | U |≤ R (cid:12)(cid:12)(cid:12) Z ∗ T ,U − Z ∗∗ T ,U (cid:12)(cid:12)(cid:12) = o p (1) and furthermore byLemma A.4 max U : | U |≤ R (cid:12)(cid:12)(cid:12) Z T ,U − Z ∗∗ T ,U (cid:12)(cid:12)(cid:12) = o p (1) . Replace Z ∗∗ T ,U by Z T ,U , sup a ∈ R | P ( Z T ,U ≤ a ) − Φ ( a ) | = O (cid:18)(cid:113) T − R u log T (cid:19) . Lastly, under the strong α -mixing condition the selected set from the pre-treatment period (cid:98) U R is asymptotically independence of the statistic Z T ,U from the post-event period. In otherwords, the selected variables are asymptotic independent of the prediction. We conclude that sup a ∈ R (cid:12)(cid:12)(cid:12) P (cid:16) Z T , (cid:98) U R ≤ a (cid:17) − Φ ( a ) (cid:12)(cid:12)(cid:12) = O (cid:18)(cid:113) T − R log T (cid:19) → . B Additional Simulations

We run additional simulations to check the variable selection performance when the regressionparameter is sparse. Such sparsity can also be generated from the factor model when manycontrol units are linked by some factors that do not aﬀect the treated unit. To construct asparse linear model, we consider three data generation processes for Y N t :(a) (independent) Each y jt ∼ iid N (0 , . 34b) (time series) Four categories of dependence structures: for t = 1 , . . . , Ty jt = 0 . y j,t − + u jt , j ∈ (cid:26) , . . . , N (cid:27) y jt = 0 . y j,t − + u jt , j ∈ (cid:26) N + 1 , . . . , N (cid:27) y jt = u jt + 0 . u j,t − + 0 . u j,t − , j ∈ (cid:26) N , N (cid:27) y jt = 0 . y j,t − + u jt + 0 . u j,t − , j ∈ (cid:26) N + 1 , . . . , N (cid:27) . The number of variables is equal in each category.(c) (cross-sectional correlated time series) All covariates in Y N t are generated by the samefour dynamic factors as in Section 3, with factor loading λ jk independently generated by N (1 , for all j and k . In the factor model the error η jt is also independently distributedas N (0 , . ) over j and t .Once the regressors are simulated, they are used to further generate the dependent variable.The potential outcome with no treatment is y t = Y (cid:48)N t β + ε t , where the true parameter β = (1 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) s , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) N − s ) with s being the number of active variables, and ε t is independentover t with distribution N (0 , . ) . Since ( N − s ) true coeﬃcients are all zeros, this is a sparselinear model.The stopping criterion for forward selection and the tuning parameter for Lasso are exactlythe same as used in Section 3. We carry out 1000 replications. To evaluate the performance invariable selection, we compute (i) the empirical probability of all the relevant variables beingselected P a = (cid:80) i =1 (cid:110) (cid:98) β j (cid:54) = 0 , ∀ j ∈ { , . . . , s } (cid:111) , (ii) the empirical probability of averageproportion of individual relevant variables being selected P b = s (cid:80) i =1 (cid:80) sj =1 (cid:110) (cid:98) β j (cid:54) = 0 (cid:111) ,and (iii) the empirical probability of average proportion of individual irrelevant variables beingexcluded P c = N − s ) (cid:80) i =1 (cid:80) Nj = s +1 (cid:110) (cid:98) β j = 0 (cid:111) . As shown in Table B1--B3, the probabilityof including all the s variables approaches 100% in all three DGPs. The probability of excludingthe irrelevant variables also converges to 100%, which holds true for all factor structures. Forexample, in Table B1 in the case s = 8 and N = 100 where the numbers in the tablesare highlighted with bold font, when T increases from to , P a hikes from . and , P b changes from . to , and P c also reaches . when the sample sizeincreases to although, as expected, the exclusion of irrelevant variables is imperfect, sinceforward selection has no mechanism to drop variables. It is worth mentioning that, when timedependency and cross-sectional correlation are both present in Table B3, forward selectionis able to select all the relevant variables with probability rising from . to . when T increases from 40 to 200, while P b reaches . and P c reaches . For comparison,Lasso’s P a , P b and P c are lower, at . , . and . .Besides variable selection, we compute the average bias and RMSE of the estimated pa-rameters by forward selection and Lasso in Table B4–B6, in comparison with the the oracle35LS estimator. The average bias and RMSE are deﬁned as Bias β = 1 N (cid:88) j ∈N (cid:16) (cid:98) β j − β j (cid:17) and RM SE β = (cid:115) N (cid:88) j ∈N (cid:16) (cid:98) β j − β j (cid:17) . Thanks to consistent model selection, the bias and variance of forward selection parameterestimation can be controlled at a low level. The larger the sample size is, the more preciseparameter estimation is. The estimation error of forward selection estimator approaches tothe oracle one, and is in general smaller than Lasso.In terms of prediction (Table B7-B9), we compare the oracle, forward selection and Lasso’sout-of-sample bias and root mean prediction squared error (RMPSE) deﬁned in (6). It is wellknown that overﬁtting undermines out-of-sample prediction, which is also shown here. Thebias of the prediction of both forward selection and Lasso is not distant from the oracle one.RMPSE decreases as sample size is large and the ratio of s/N is small. Forward selectiondelivers less noisy prediction for all cases, and is closer to the oracle.Lastly, we show the result of ATE test in the sparse case (see Table B10), where the sameprocedure is carried out as before. To save space, only the case of N = 100 and s = 8 istabulated. The test performance is inﬂuenced by relationships across cross-sectional units andtime, even for the oracle model. Forward selection can achieve nearly oracle performance whenthe sample size is large, while Lasso is less satisfactory due to more severe size distortion andpower insuﬃciency. 36able B1: Performance in Variable Selection (IID) T = 40 T = 80 T = 100 T = 200

40 100 200 40 100 200 40 100 200 40 100 200 F o r w a r dS e l ec t i o n P a

100 99.5 98.6 100 100 100 100 100 100 100 100 10096.6

100 100

100 100 P b

100 99.6 98.8 100 100 100 100 100 100 100 100 10097.8

100 100

100 100 P c L a ss o P a P b P c Table B2: Performance in Variable Selection (Time Dependence) T = 40 T = 80 T = 100 T = 200

40 100 200 40 100 200 40 100 200 40 100 200 F o r w a r dS e l ec t i o n P a P b

99 96.9 89.6 100 100 99.9 100 99.9 100 100 100 10086 61.5 41 99.1 98.6 95.8 99.6 98.9 99.1 100 100 10046.9 26.5 17.7 93.2 72.3 46.8 96.5 90.6 75.8 99.6 99 99.6 P c L a ss o P a P b P c T = 40 T = 80 T = 100 T = 200

40 100 200 40 100 200 40 100 200 40 100 200 F o r w a r dS e l ec t i o n P a P b P c L a ss o P a P b P c Table B4: Parameter Estimation: Bias and RMSE (cid:0) both × − (cid:1) (IID) T = 40 T = 80 T = 100 T = 200 N

40 100 200 40 100 200 40 100 200 40 100 200(a) Bias - Forward Selection s = 4 s = 8 -5.29 -12.17 -13.35 -0.28 0.06 0.06 -0.08 0.00 0.02 0.07 0.01 0.01 s = 16 -147.66 -94.05 -54.36 -1.76 -6.59 -10.76 -0.05 -0.72 -1.64 0.14 0.01 0.00(b) Bias - Lasso s = 4 -20.59 -11.8 -7.33 -13.71 -6.84 -3.92 -11.88 -5.86 -3.38 -8.39 -4 -2.22 s = 8 -42.27 -34.99 -22.86 -25.48 -14.07 -8.92 -22.28 -12.19 -7.38 -15.08 -7.82 -4.55 s = 16 -90.43 -98.07 -59.09 -38.83 -32.95 -31.89 -35.47 -24.33 -18.74 -24.91 -14.95 -9.29(c) Bias - Oracle s = 4 s = 8 -0.310 -0.052 0.026 -0.229 0.009 0.033 -0.084 -0.044 0.014 0.086 -0.014 -0.013 s = 16 s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 (cid:0) both × − (cid:1) (Time Dependence) T = 40 T = 80 T = 100 T = 200 N

40 100 200 40 100 200 40 100 200 40 100 200(a) Bias - Forward Selection s = 4 -0.82 -1.49 -2.24 -0.03 0.09 -0.09 0.15 0.01 0.01 -0.06 0 -0.01 s = 8 -27.36 -30.46 -23.4 -1.84 -1.23 -1.77 -0.76 -0.84 -0.36 0 0.03 0 s = 16 -186.41 -107.14 -62.09 -24.63 -40.96 -40.06 -12.96 -14.06 -18.62 -1.47 -1.4 -0.3(b) Bias - Lasso s = 4 -15.78 -9.92 -7.09 -9.92 -5.32 -3.34 -8.91 -4.58 -2.68 -5.89 -2.87 -1.66 s = 8 -36.21 -35.74 -26.09 -19.09 -11.73 -8.06 -16.43 -9.51 -6.21 -10.92 -5.85 -3.5 s = 16 -106.57 -111.29 -63.86 -30.44 -31.93 -40.8 -25.71 -19.75 -21.29 -17.71 -11.28 -7.48(c) Bias - Oracle s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 Table B6: Parameter Estimates: Bias and RMSE (cid:0) both × − (cid:1) (Time Dependence andCross-Sectional Correlation) T = 40 T = 80 T = 100 T = 200 N

40 100 200 40 100 200 40 100 200 40 100 200(a) Bias - Forward Selection s = 4 -14.13 -10.99 -7.74 -3.98 -5.83 -5.97 -2.37 -4.16 -4.71 -0.31 -0.48 -1.49 s = 8 -36.19 -21.12 -11.77 -9.33 -9.76 -6.38 -5.06 -6.66 -4.23 -0.12 -0.78 -0.41 s = 16 -59.48 -40.23 -27.99 -4.12 -13.68 -18.94 -0.71 -5.99 -10.51 -0.14 0 -0.12(b) Bias - Lasso s = 4 -7.75 -9.09 -7.06 -4.06 -4.21 -4.67 -3.88 -3.36 -3.48 -3.01 -2.51 -1.44 s = 8 -50.64 -25.81 -12.79 -22.27 -25.36 -14.02 -15.63 -23.58 -13.97 -11.14 -9.45 -9.72 s = 16 -114.96 -47.68 -23.38 -63.63 -50.24 -25.39 -46.07 -49.74 -25.96 -33.31 -32.19 -26.57(c) Bias - Oracle s = 4 -0.042 0.027 -0.029 0.019 -0.016 0.018 0.020 0.008 0.008 -0.024 0.003 0.001 s = 8 s = 16 s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 (cid:0) × − (cid:1) and RMPSE (IID) T = T = 40 T = T = 80 T = T = 100 T = T = 200 N

40 100 200 40 100 200 40 100 200 40 100 200 (a) Bias - Forward selection s = 4 -1.12 5.20 -3.08 -0.56 -1.65 -1.45 -0.72 -3.00 0.68 0.11 -0.95 -1.32 s = 8 -2.89 -0.67 2.54 -0.51 -2.58 0.09 3.37 1.44 0.44 0.24 -3.76 1.33 s = 16 s = 4 -2.84 -0.71 -7.82 -0.72 -3.84 -2.16 -1.32 -1.82 1.01 0.31 -0.75 -1.72 s = 8 s = 16 -1.97 -5.30 -9.64 0.73 -6.45 -11.09 -2.03 -1.41 2.36 -0.43 -0.08 -2.35(c) Bias - Oracle s = 4 -2.47 4.00 -6.18 -0.49 -2.11 -1.43 -0.88 -2.96 -0.67 0.02 -0.83 -1.11 s = 8 -2.46 -0.01 2.76 -0.46 -1.73 0.40 3.05 1.51 -0.79 0.32 -3.39 1.14 s = 16 s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 Table B8: Out-of-sample Prediction: Bias (cid:0) × − (cid:1) and RMPSE (Time Dependence) T = T = 40 T = T = 80 T = T = 100 T = T = 200 N

40 100 200 40 100 200 40 100 200 40 100 200 (a) Bias - Forward selection s = 4 -9.83 -11.48 -5.83 -2.32 -3.37 -0.35 3.32 0.30 0.37 1.53 -0.10 1.67 s = 8 -76.74 -30.71 85.84 -5.40 -9.44 -2.41 0.49 -10.51 7.90 1.84 0.10 0.17 s = 16 -223.91 2.74 -11.67 21.89 10.30 -110.20 -13.06 -12.05 -0.58 -7.45 -14.67 -0.21(b) Bias - Lasso s = 4 -5.43 -16.80 12.42 -6.00 -3.06 -2.06 -0.59 -4.02 -2.07 -0.27 -0.73 1.75 s = 8 -12.26 -43.05 -15.22 0.39 -7.49 21.20 3.73 1.30 3.88 0.28 1.52 4.32 s = 16 -88.93 2.35 20.80 6.54 6.33 -15.92 6.60 19.26 20.07 0.79 -6.13 2.43(c) Bias - Oracle s = 4 s = 8 -9.11 -2.58 -2.80 -1.10 -0.37 -2.49 -0.49 -0.61 -0.39 1.28 0.22 0.81 s = 16 -14.43 -20.50 -16.84 -1.67 -3.29 0.72 0.41 -0.03 0.03 -1.34 -2.17 -1.25(d) RMPSE - Forward selection s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 (cid:0) × − (cid:1) and RMPSE (Time Dependence andCross-Sectional Correlation) T = T = 40 T = T = 80 T = T = 100 T = T = 200 N

40 100 200 40 100 200 40 100 200 40 100 200 (a) Bias - Forward selection s = 4 s = 8 -3.32 0.20 0.23 1.61 4.16 6.95 -1.12 -1.05 -4.16 1.44 2.45 -0.98 s = 16 -0.62 4.58 -22.24 -0.49 -4.09 -3.94 0.54 -1.00 -6.31 -0.94 -0.35 -0.25(b) Bias - Lasso s = 4 s = 8 -14.94 1.55 -3.28 1.50 3.17 11.69 1.14 -3.41 -12.06 1.61 3.25 2.56 s = 16 s = 4 s = 8 -2.12 -0.62 -1.37 -1.23 1.11 -2.02 1.53 -0.52 -1.63 1.06 2.45 -0.67 s = 16 -5.52 -2.77 6.09 -2.67 -3.57 -0.72 0.22 0.35 -0.93 -0.99 -0.28 -0.05(d) RMPSE - Forward selection s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 s = 4 s = 8 s = 16 N = 100 and s = 8 Size Power T , T = D D D D D D D II D FS 0.074 0.085 0.119 0.542 0.889 0.395 0.802 Lasso 0.080 0.080 0.127 0.495 0.891 0.371 0.803 Oracle 0.072 0.079 0.141 0.783 1 0.502 0.935 T i m e D e p e nd e n ce FS 0.317 0.292 0.296 0.411 0.566 0.368 0.532 Lasso 0.314 0.291 0.291 0.404 0.573 0.358 0.544 Oracle 0.189 0.120 0.167 0.734 0.997 0.490 0.913 T i m e D e p . a nd C r o ss - S ec . C o rr . FS 0.150 0.136 0.167 0.446 0.875 0.350 0.760 Lasso 0.186 0.150 0.175 0.460 0.861 0.374 0.753 Oracle 0.116 0.090 0.145 0.787 0.999 0.509 0.910200