Optimal Experimental Design for Staggered Rollouts
OOptimal Experimental Design for Staggered Rollouts
Ruoxuan Xiong*
Department of Management Science & Engineering, Stanford University, [email protected]
Susan Athey
Graduate School of Business, Stanford University, [email protected]
Mohsen Bayati
Graduate School of Business, Stanford University, [email protected]
Guido Imbens
Graduate School of Business, Stanford University, [email protected]
Experimentation has become an increasingly prevalent tool for guiding decision-making and policy choices.A common hurdle in designing experiments is the lack of statistical power. In this paper, we study theoptimal multi-period experimental design under the constraint that the treatment cannot be easily removedonce implemented; for example, a government might implement a public health intervention in differentgeographies at different times, where the treatment cannot be easily removed due to practical constraints.The treatment design problem is to select which geographies (referred by units) to treat at which time,intending to test hypotheses about the effect of the treatment. When the potential outcome is a linearfunction of unit and time effects, and discrete observed/latent covariates, we provide an analytically feasiblesolution to the optimal treatment design problem where the variance of the treatment effect estimator is atmost 1 + O (cid:0) N (cid:1) times the variance using the optimal treatment design, where N is the number of units.This solution assigns units in a staggered treatment adoption pattern – if the treatment only affects oneperiod, the optimal fraction of treated units in each period increases linearly in time; if the treatment affectsmultiple periods, the optimal fraction increases non-linearly in time, smaller at the beginning and larger atthe end. In the general setting where outcomes depend on latent covariates, we show that historical datacan be utilized in designing experiments. We propose a data-driven local search algorithm to assign unitsto treatment times. We demonstrate that our approach improves upon benchmark experimental designsvia synthetic interventions on the influenza occurrence rate and synthetic experiments on interventions forin-home medical services and grocery expenditure. Key words : Experimental Design, Contamination and Spillover Effects, Staggered Rollouts, TreatmentEffect Estimation, Stratification
1. Introduction
Experimentation is a cornerstone of decision-making. Large technology and retail companies runthousands or even tens of thousands of experiments per year to evaluate the impact of variousdecisions, design better products, and understand mechanisms (Muchnik et al. 2013, Aral and * Alphabetical order other than the first author. a r X i v : . [ ec on . E M ] A ug Walker 2014, Panniello et al. 2014, Cheung et al. 2017, Brynjolfsson et al. 2018, Cohen et al. 2018,Brynjolfsson et al. 2019, Salganik 2019, Gordon et al. 2019, Joo et al. 2019, Zhang et al. 2019a,b,¨Ulk¨u et al. 2019, Lee and Hosanagar 2019, Cui et al. 2020, Sun et al. 2020a,b). In some cases,a variety of firm constraints lead firms (units) to introduce new products or features at differenttimes. In some cases, a variety of practical constraints lead clinical researchers and policymakersto introduce the treatment or intervention at different times. After such a rollout has occurred,analysts treat this type of “staggered adoption” of the new feature as a “natural experiment”and attempt to estimate causal effects (Athey and Stern 2002, Athey and Imbens 2018, Atheyet al. 2018). However, less research has been done to consider how to design such experiments inanticipation of precisely analyzing treatment effects.In this paper, we study how to optimally design experiments in settings where the experimentaldesigner can choose not just which units to treat but also when to treat them.
Our goal is tominimize the variance of the estimate for the treatment effect by choosing the optimal multi-periodtreatment design.
We maintain the assumption that during the course of the experiment, oncetreated, a unit continues being treated until the end of the experiment. This assumption is moti-vated by practical considerations such as difficulties communicating with experimental subjects orfrictions in implementing changes; we also extend our results to the scenario where the treatmentcan be removed.Our main motivation to study this question is the fundamental challenge in experimental design: statistical power (Pocock and Simon 1975, Cohen 1992, Kuhfeld 2005, Lawrie et al. 2015, Hemminget al. 2015, Li et al. 2018b, Bhat et al. 2019), which is the likelihood to detect the effect of a newdecision and is also a concern in observational studies (Fogarty and Small 2016, Heng et al. 2019).Statistical power can be low due to various reasons, but the primary reason is a small sample size.For an essential class of questions where the treatment effect has spillover effects, the relevant unitof analysis is an aggregated set of samples, such as a metropolitan area for testing public healthinterventions to reduce the occurrence of an infectious disease such as flu. Another example israndomization at a clinic instead of a patient-level for a clinical trial. It compares different weight-loss programs where patients may meet each other in the waiting room and communicate abouttheir own strategies (Leahey et al. 2014). While randomizing at a higher level of aggregation caneffectively avoid contamination and spillover effects, but it substantially reduces the sample sizeand statistical power.Measuring outcomes for units over time allows the experimental designer to improve the statis-tical power, as repeated observations for the same units allow the analyst to learn key parametersof the data generating process, and thus reduce the error in estimating outcomes for units undercounterfactual treatment status (Crowder and Hand 1990, Baltagi 2008, Hsiao 2014). However, determining the optimal timing of treatment assignments is challenging for two reasons. First, itis difficult to find an explicit functional form for how the treatment design affects the variance ofthe treatment effect estimator (Hussey and Hughes 2007, Hemming et al. 2015, Li et al. 2018b).Second, assigning units to either the treated or control group at each point in time is an integerprogramming problem, which is, in general, computationally expensive to solve (Bertsimas et al.2015, Kallus 2018, Bhat et al. 2019). For the first challenge, our analysis is based on the best lin-ear unbiased estimator, which provides the explicit functional form and has the smallest varianceamong all linear unbiased estimators.
The main contribution of this paper is to provide closed-form solutions for the second challenge.
We provide theoretical guarantees for the optimality ofour proposed multi-period experiment designs and support their practical relevance via extensivesimulations on real data sets.This paper’s objective is to optimally allocate subjects from the control groups to the treatedgroups over multiple periods so that the variance of our treatment effect estimator is minimized.We restrict treated units to switch back to the control. We assume potential outcomes followa two-way fixed-effect linear model, and possibly with additional observed or latent covariates.Furthermore, we assume the treatment effect is constant, and we want to estimate this objectas precise as possible. This assumption is less restrictive than it looks like. When the treatmenteffect is heterogeneous and time-varying, an object of interest is the average treatment effect(Rosenbaum and Rubin 1983, Abadie et al. 2010). We can estimate the average effect throughthe constant treatment effect model by decomposing the treatment effect into two terms: 1. theaverage effect, which is constant; 2. the heterogeneous and time-varying part with zero mean,which can be incorporated into the error term. We analyze the optimal experimental design tominimize the variance of the best linear unbiased estimator (BLUE) for the treatment effect. Wefind a treatment design with a constant growth rate of the treated percentage is near-optimal. Wepropose an algorithm based on simulated annealing and the minimax decision criterion to activelyimprove this near-optimal solution using historical control data.Specifically, our first main contribution is to provide simple sufficient conditions for the opti-mal treatment design when the potential outcome can have observed and latent covariates. Inparticular, we allow latent covariates to address the concern of model misspecification that wemay not fully observe all the covariates that affect the potential outcome. Furthermore, if we havemany observed covariates, and we do not know which matter, we could instead use a data-drivenapproach to estimate the latent covariates that can well summarize the information in the data.When covariates take finitely many values, in an optimal treatment design, the treated proportion (fraction of treated units in each period) should be equal for each cluster of units with equal covari-ate values (stratum). We can interpret this treatment design as stratified randomization in both
Model Two-way fixed effects Two-way fixed effects Two-way fixed effects with carryoverwith discrete covariates treatment effects and discrete covariatesOptimal t − T fraction treated for t − T fraction treated for Nonlinear fraction treated in t solution all t in a T -period design each stratum and all t with five stages for each stratum Table 1
This table summarizes the optimal treatment design in various potential outcome models. The base casehas two-way (unit and time) fixed effects (Section 3.1). A more general case has two-way fixed effects with additionaldiscrete observed and/or latent covariates (Section 3.2). The most general case allows the effect of a treatment carriesover to multiple periods (Section 3.3). unit and time dimensions, as a contrast to the conventional A/B testing that only randomizes in theunit dimension. If the treatment only affects one period, the optimal treated proportion increaseslinearly in time. However, if the effect of treatment carries over to future periods (Zhang et al.2019a), the optimal treated proportion increases non-linearly in time, is convex at the beginning,and concave towards the end. When there is no feasible solution to allow the exact optimal treatedproportion for each stratum, we provide a rounding scheme that yields a feasible solution whoseestimation variance is within 1 + O (cid:0) N (cid:1) of the estimation variance of the optimal solution, where N is the number of units. Our main results are summarized in Table 1.Our second main contribution is to propose a data-driven algorithm to efficiently find a bettertreatment design when covariates are latent or take infinitely many values. Our algorithm exploitscovariate information in the historical control data and searches for a better treatment design basedon simulated annealing and minimax decision rule, which is motivated by our theoretical analysis.We illustrate our treatment design and algorithm’s performance through synthetic experimentson three real data sets on studying interventions for reducing flu occurrence rate, interventionsfor reducing the utilization of home-visit medical services, and marketing strategies for promotinggrocery store sales. The first example is motivated by several features of the critical problem ofrigorously studying the effect of various interventions during a pandemic such as COVID-19. First,in such experiments, statistical power is low not only because we need to randomize at largegeographical areas due to the contagious nature of the disease, but also because the treatmenteffect may be smaller than the (seasonal) fluctuations of the disease. Second, our proposed solutionsin this paper allow experiments to start at the beginning of the peak season, continue during thepeak season, and terminate at the end of the peak season. Third, the interventions usually havea carryover effect on the disease occurrence rate in multiple future periods. Overall, based on ourtheoretical results and empirical analysis of real data, we find that our proposed treatment designssubstantially outperform benchmark designs.The rest of the paper is organized as follows. Section 2 describes the model and our objective.The main results for the analysis of treatment designs are presented in Section 3, and Section 4 provides our algorithms for finding optimal treatment designs. Finally, in Section 5, we demon-strate the efficiency gain from our results and algorithms by analyzing synthetic experiments oninterventions for reducing flu occurrence rate, using a large multi-year real data set. We defersynthetic experiments on a secondary data set for reducing utilization of home-visit services and atertiary data set for promoting grocery store sales to Appendix F. We mainly focus on the problem of multi-period experimental design when treated units cannotswitch back to the control, which is closely related to the stepped wedge design in clinical trials(Brown and Lilford 2006). In the stepped wedge design, an intervention is sequentially rolled out toparticipants (more often to clusters than to individuals) over several periods. Hemming et al. (2015)point out that the stepped wedge clustered studies tend to have better statistical performancethan parallel clustered studies when the intra-cluster correlations are larger. Optimal allocationof clusters in the stepped wedge design has been studied under the linear model with time fixedeffects, random cluster effects (Hussey and Hughes 2007, Hemming et al. 2015), and additionalrandom interaction effects (Li et al. 2018b). In comparison, our setting and results are more generalin the following three facets: 1. We allow each unit to have its own (observed or latent) covariates;2. We allow the treatment to have carryover effects; 3. We show how historical control data canbe used to improve treatment design. The first generalization is universal in practice; indeed, theassumption in prior literature on the linearity of outcomes in time effects and cluster effects isrestrictive. The second generalization is relevant in many applications, such as in clinical trials(Grizzle 1965, Wallenstein and Fisher 1977, Willan and Pater 1986). For the third generalization,historical control data is commonly available in clinical trials and public health, so it is natural toleverage that data to enhance experiment design.Our problem also shares similarities with the literature on covariate balancing and matching inexperimental design (Pocock and Simon 1975, Cook et al. 2002, Hu et al. 2012, Bertsimas et al.2015, 2019, Kallus 2018, Bhat et al. 2019, Krieger et al. 2019) and in causal inference (Nikolaevet al. 2013, Imai and Ratkovic 2014, Sauppe et al. 2014, Sun and Nikolaev 2016, Fan et al. 2016,Sauppe and Jacobson 2017, Li et al. 2018a, Kallus et al. 2018, Zhao 2019, Kallus 2020). Ourexperimental design setup is different in that we allocate treatment for multiple units over multipleperiods and the treatment cannot be removed once allocated. Compared with the literature ondesigning covariate balanced experiments (Bertsimas et al. 2015, 2019, Kallus 2018, Bhat et al.2019), we allow covariates to be latent, and we propose to search for a better treatment design fromhistorical control data that contains information about the latent covariates. Moreover, instead ofdirectly solving the integer programming problem , we propose to start from the analytical solution This is, in general, computationally expensive. This can be infeasible when covariates are latent. that is optimal in the model without covariates, and actively improve this solution via historicaldata in an efficient way.Another relevant literature to our findings is prior work on stratification in randomized experi-ments, such as in clinical trials (Simon 1979, Polit and Beck 2008, Athey and Imbens 2017), andin stratified sampling (Mulvey 1983, Cheng and Davenport 1989, Fox 2000). They suggest thestatistical power of the treatment effect estimator can be increased by partitioning the populationinto small strata, randomizing within each stratum, and adjusting the standard errors. Our resultsconvey a similar insight; we prove that stratification increases power in multi-period experimentaldesign with discrete covariates. Furthermore, for more general covariates, we propose a data-drivenstratification heuristic based on historical data.The rich literature in online learning and the multi-armed bandit (e.g., Bubeck and Cesa-Bianchi(2012), Lattimore and Szepesv´ari (2018)) can also be viewed as a form of multi-period experimentaldesign. However, in that setting, the decisions (or treatment designs) are adaptively updated, giventhe experiment’s partially observed results. In contrast, we consider settings where the experimentdesigner, should commit to a procedure for the future rounds, in advance.
2. Model and Problem Statement
In this section, we will introduce the potential outcome model under the control and treatment inSection 2.1. Next, in Section 2.2, we describe the main treatment allocation problem that needs tobe solved to maximize the statistical power of the experiment.
Before we massively scale up implementing a new policy, we want to evaluate its treatment effect.Assume we have N units to apply this new policy to, and we can observe their outcomes for T periods. We have the following assumption for the potential outcome. Assumption 1 (Data Model Assumption) . Assume the potential outcome for these N unitsover the T time periods can be modeled as Y it ( z it ) = α i + β t + L it + τ z it + ε it (1) and in matrix notation Y ( Z ) = α(cid:126) (cid:62) + (cid:126) β (cid:62) + L + τ Z + ε , (2) where τ is the treatment effect, z it ∈ {− , } indicates whether unit i at time t is treated ( z it = 1 ) ornot ( z it = − ), α i and β t are unit and time fixed effects. Furthermore, we allow matrix L = [ L it ] ∈ R N × T to be in one of the following two cases: L = 0 ; L = Xθ (cid:62) + U V (cid:62) , where X ∈ R N × r and U = (cid:2) u (cid:62) · · · u (cid:62) N (cid:3) (cid:62) ∈ R N × k are observed and latentcovariates, θ ∈ R T × r and V = (cid:2) v (cid:62) · · · v (cid:62) T (cid:3) (cid:62) ∈ R T × k are both unknown coefficients. We also assumethat U is deterministic, and V is random and independent of the treatment with E [ v t | α, β, X ] = 0 and Cov[ v t | α, β, X ] = Σ V for all t .We assume L is low-rank, that is r, k (cid:28) min( N, T ) . In the second case, r and k can be 0.Furthermore, we assume ε it is the i.i.d. observed noise and independent of the treatment with E [ ε it | α, β, X ] = 0 and Cov[ ε it | α, β, X ] = σ for all i and t . Remark 1.
The term
U V (cid:62) can be thought of as the structured component of the noise thatmodels correlation in the noise for different units. Since in this structure the randomness is in V , the properties for U and V do not appear as symmetric variants of each other. Alternatively,we can assume V is deterministic, and U is random, which would model the case where noisehas time-series correlations. We only focus on the former case since the latter would be similar.Another straightforward extension of our model that we do not cover in the paper is when thereare two independent sources of structured noise (one with across unit correlations and one withacross time correlations). Furthermore, the assumption that V has mean zero holds without loss ofgenerality because otherwise, one can update the fixed effects α and β as well as U and V , so thatthis assumption holds. The assumption on the randomness of this model, V and ε , is standard andallows regression estimators to be unbiased that will be discussed in detail in Section 2.2.In this paper, our primary focus is the second case L = Xθ (cid:62) + U V (cid:62) where the potential outcomehas covariates that can be either observed, latent, or both. However, we will discuss the simplecase of L = 0 only to build intuition for the general case.We assume the treatment effect τ is constant but Remark 2 below discusses how more generaltreatment effects can be also captured by this setup and Remark 3 below discusses the extensionof model (1) to carryover treatment effects.We explicitly specify fixed affects α(cid:126) (cid:62) and (cid:126) β (cid:62) in model (2) because it is common that thepotential outcomes across units and across time periods have different means and it can potentiallyreduce the estimation bias of parameters in model (2).In this paper, we focus on the irreversible treatment adoption pattern, which is stated in thefollowing assumption. However, for completeness, we also solve the case when this assumption fails(the treatment can be removed) in Appendix A.1. Assumption 2 (Irreversible Treatment Adoption Pattern) . We assume the treatmentadoption pattern is such that once a unit adopts treatment, it stays treated afterwards, that is, z it ≤ z i,t +1 , for all i and ≤ t ≤ T − First, we allow units to progressively adopt treatment. Our setting is different from paired andstratified experiments, where the treatment status stays the same throughout the experiment. Oursetting is similar to the crossover trial in a longitudinal study. However, we prohibit treated unitsfrom switching back to the control, which is closely related to the stepped wedge treatment designin randomized controlled trials (Brown and Lilford 2006, Hussey and Hughes 2007, Woertmanet al. 2013, Hemming et al. 2015). This restriction is closely related to the staggered treatmentadoption pattern in observational studies (Athey and Stern 1998, Athey et al. 2018, Athey andImbens 2018).We study the irreversible scenario mainly for two reasons: 1. There are some practical constraintsrestricting units from frequently switching between control and treatment. For example, policiesat the cluster level, such as programs to improve case management skills of healthcare staff at thehospital level, can hardly be suspended or rolled back. 2. When the treated units switch back tocontrol, they may not return to the original control status. For example, the improvement of casemanagement skills of healthcare staff can be persistent even if the programs are suspended.
Remark 2.
The assumption that τ is constant can be relaxed as follows. If the treatment effectis heterogeneous and time-varying, denoted by τ it , we can still estimate the average treatmenteffect τ (which is typically the main goal in observational studies) by writing τ it = τ + δ it where δ it has mean 0, and incorporate δ it into ε it in model (2). This is because each entry of Z is ± ε it ) would be ε it ± δ it . This would still be an iid noise model if δ it has a symmetric distribution around the origin. Remark 3.
In some cases, the treatment can affect multiple periods and we are interested inestimating carryover treatment effects. We can extend model (1) to Y it ( z it ) = α i + β t + L it + τ z it + τ z i,t − + · · · + τ (cid:96) +1 z i,t − (cid:96) + ε it , where (cid:96) is the number of lagged periods to which the effect of thetreatment can carry over. We also study the optimal experimental design for this carryover modelin Section 3.3. Our objective is to estimate the treatment effect τ as precisely as possible. We need to considertwo aspects to achieve our objective1. How should we estimate τ ?2. How can we find the optimal design matrix Z ∈ R N × T that minimizes the variance of thetreatment effect estimator ˆ τ ?In the second question, the variance comes from the random observational noise ε and the random V in model (2).As a preparation to discuss the first question, we state the well-known Gauss-Markov Theorem. Lemma 1 (Gauss-Markov Theorem) . Consider a linear model (cid:126)y = X β + noise with E [ noise | X ] = 0 and Cov[ noise | X ] = Ω . If Ω = σ I , then the ordinary least squares (OLS) estimator ˆ β = ( X (cid:62) X ) − X (cid:62) (cid:126)y is the best linear unbiased estimator (BLUE). Otherwise, when Ω is not a multi-ple of the identity matrix, the generalized least squares (GLS) estimator ˆ β = ( X (cid:62) Ω − X ) − X (cid:62) Ω − (cid:126)y is BLUE. Here, “best” means the estimator has the lowest variance among all unbiased linearestimators. For the first question, in model (2), if either L = 0 or L = Xθ (cid:62) ( L only has observed covariates),then ˆ τ estimated from OLS is BLUE. If L = Xθ (cid:62) + U V (cid:62) , we can use GLS to estimate τ from Y ( Z ) = α(cid:126) (cid:62) + (cid:126) β (cid:62) + Xθ (cid:62) + τ Z + e, (3)where we incorporate U V (cid:62) into the “error” term e , i.e., e = U V (cid:62) + ε . Lemma 1 together withAssumption 1 implies that ˆ τ from GLS is BLUE. Hence, our theoretical analysis is based on OLSand GLS because they are BLUE.In practice, (cid:126)e ’s covariance matrix Ω is usually unknown, where (cid:126)e is the vectorization of matrix e in Eq. (3). We can use the feasible generalized least squares (feasible GLS), where we get anestimate of Ω using the residuals from OLS. When errors’ covariance matrix can be consistentlyestimated, feasible GLS is asymptotically more efficient than OLS. However, the covariance matrixhas the dimension N T × N T that can be very large, and feasible GLS can be less efficient infinite samples. To address the efficiency concern, we can impose some structure on Ω, such as thediagonal plus low-rank structure. Alternatively, we propose an estimator (motivated by low-rankmatrix estimation literature in machine learning) that can be more efficient but may be biased.The details are presented in Sections 4 and 5.For the second question, we can find the optimal treatment design by solving the following integerprogramming problem, min Z Var(ˆ τ ) (4)s . t . z it ≤ z i,t +1 , for all i and 1 ≤ t ≤ T − z it ∈ {− , } , for all i and t . Solving the integer program (4) is challenging for two reasons. First, we need to understand howVar(ˆ τ ) changes with z it . Second, the decision variable z it is an integer and solving the integerprogram (4) can be computationally expensive. Specifically, even in the special case where T = 1and ˆ τ is obtained via OLS, we will see later in the integer program (8) that integer program (4)reduces to the number partitioning problem (Hayes 2002, Mertens 2006) which is NP-hard (thenumbers corresponding to Γ that can take infinitely many values). We address the first challenge by analyzing the variance of ˆ τ estimated from OLS or GLS.Specifically, this variance is a quadratic function of z it . For the second challenge, we show in Section3 that when ˆ τ is estimated by OLS or GLS depending on the structure of L , we can provide a near-optimal analytical solution, which has a surprisingly simple form and significantly outperformsbenchmark treatment designs in empirical applications as shown in Section 5. Remark 4.
In the remaining, for notation simplicity, “ z it ≤ z i,t +1 ” stands for “ z it ≤ z i,t +1 , for all i and 1 ≤ t ≤ T − z it ∈ {− , } ” stands for “ z it ∈ {− , } , for all i and t ”.
3. Results
We provide sufficient conditions for the optimal solution of the integer program (4) under mildassumptions, when we use BLUE (either OLS or GLS) to estimate ˆ τ , in Section 3.1 and 3.2. Thesufficient conditions require the treated proportion to grow linearly in time. Based on the sufficientconditions, we demonstrate how to find a feasible solution with objective value (variance) to be atmost 1 + O (cid:0) N (cid:1) times the objective value of the global optimum.Furthermore, we extend our analysis to the scenario where the treatment has carryover effects,and our objective is to minimize the variance of estimated direct and lagged treatment effects. Wefind that the optimal proportion is non-linear in time. The details are presented in Section 3.3. In this section, we study the first case, i.e., L = 0 (no observed or latent covariates). Model (1) canbe rewritten as Y it ( z it ) = α i + β t + τ z it + ε it , (5)= (cid:2) z it Γ (cid:62) it (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) × ( N + T ) (cid:20) τη (cid:21)(cid:124)(cid:123)(cid:122)(cid:125) ( N + T ) × + ε it In matrix notation, we have (cid:126)y = (cid:2) (cid:126)z Γ (cid:3) (cid:20) τη (cid:21) + (cid:126)ε, (6)where (cid:126)y = (cid:2) (cid:126)y (cid:62) (cid:126)y (cid:62) · · · (cid:126)y (cid:62) T (cid:3) (cid:62) ∈ R NT × , (cid:126)y t = (cid:2) Y t · · · Y Nt (cid:3) (cid:62) ∈ R N , ˜ I = (cid:2) I N − (cid:126) (cid:3) (cid:62) ∈ R N × ( N − , (cid:126) { } N − , (cid:126) { } N , η = (cid:20) αβ (cid:21) , α = α ... α N − , β = β ... β T , and Γ = Γ :1 Γ :2 ...Γ : T = ˜ I (cid:126) I (cid:126)
I (cid:126) . Note that we restrict α N = 0 such that all other α i and β t can be uniquely identified. We use OLSto estimate τ and η . With some algebra, we have the following expression for the variance of ˆ τ ,Var(ˆ τ ) = σ (cid:126)z (cid:62) ( I NT − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) (cid:126)z . (7)Details to show Eq. (7) are presented in the Appendix C. With Eq. (7), the integer program (4) issimplified to min (cid:126)z =[ z it ] (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z (8)s . t . z it ≤ z i,t +1 z it ∈ {− , } Now we formally define our convex relaxation of the integer program (8). Let ω t = N (cid:80) Ni =1 z it forall t . Therefore, it is easy to see that ω t takes value from {− , − N , · · · , − N , } , given theconstraints of the integer program (8). We then relax the range of ω t to − ≤ ω t ≤ (cid:126)ω ∈ R T (cid:126)ω (cid:62) P (cid:126) (cid:126)ω + 2 (cid:126)d (cid:62) (cid:126)ω (9)s . t . − ≤ ω t ≤ ω t ≤ ω t +1 , with P (cid:126) = I − T (cid:126) (cid:126) (cid:62) ∈ R T × T , (cid:126)d = [ d t ] ∈ R T , and d t = T +1 − tT for all t . The following theorem providesthe optimal treatment design for the quadratic program (9). Theorem 1 (Fixed Effects Only) . Under Assumptions 1-2, and when L = 0 , then an optimalsolution for the quadratic program (9) is (cid:126)ω ∗ ∈ R T that is defined by ω ∗ t = 2 t − − TT , for all t . (10)Theorem 1 states that the analytical optimal solution ω ∗ t of the quadratic program (9) is linear in t . Given ω ∗ t , the optimal treated proportion is ω ∗ t . Figure 1 visualizes the optimal treatedproportion at every time period for a T -period treatment design problem, where T ranges from 2to 20. The treated proportion increases at the constant rate T . Intuition.
In order to build intuition for Theorem 1, the following two lemmas provide theoptimal solution for two special cases of the problem.
Lemma 2 (Time Fixed Effects Only) . Consider the same assumptions as in Theorem 1, andin addition, assume that Y it ( z it ) = β t + τ z it + ε it . Then, any treatment design is optimal if it satisfies ω t = 0 for all t and z it ≤ z i,t +1 for all i and t . Lemma 2 states that if there are only time fixed effects in the potential outcome, the optimaltreatment design is the parallel treatment design that half of the units are assigned treatment whilethe others are functioning as the control for all time periods. The intuition for Lemma 2 is thatgiven the time effect β t , observations within in a period serve as their own control to estimate τ .Formally, the variance only depends on the cross-sectional average of z it . The variance is minimizedwhen there are 50% treated and 50% control at every time period. Rows of Z are exchangeable,so only the treated proportion rather than whom to treat matters. Table 2a provides an exampleof the optimal treatment design. Lemma 2 implies that it is optimal to randomly and evenly splitunits into the treated and control groups from the beginning as in the conventional A/B testingand placebo tests. Lemma 3 (Unit Fixed Effects Only) . Consider the same assumptions as in Theorem 1, andin addition, assume that Y it ( z it ) = α i + τ z it + ε it . Then, any treatment design is optimal if it satisfies ω t = − for all t < T +12 and ω t = 1 for all t > T +12 . Lemma 3 states that if there are only unit fixed effects, the optimal treatment design is toallocate treatment to all units at halftime. The intuition for Lemma 3 is symmetric to that forLemma 2: given the unit effect α i , every unit’s observations serve as their own control to estimate τ . Formally, the variance only depends on the time-series average of z it . The variance is minimizedwhen there are 50% treated and 50% control for each unit. Table 2b provides an example of theoptimal treatment design.Intuitively, when we have both unit and time fixed effects, the optimal treatment design is inbetween the optimal treatment designs in Lemmas 2 and 3. Specifically, the variance should dependon the average of z it over time and the average of z it over units. Under the irreversible treatmentadoption pattern, the average of z it over time can be identified, given the average of z it over units.Therefore we can cancel out the average of z it over time. The Hessian of the objective functionin the quadratic program (9) is the positive semidefinite matrix P (cid:126) , so the objective function isconvex.The analytical optimal solution ω ∗ t is linear in t , and balances the learning of both unit andtime fixed effects. If we aggregate observations for all time periods, there are 50% control and 50%treated. Table 2c provides an example of the optimal treatment design.The optimal solution in Theorem 1, ω ∗ t = t − − TT , is equivalent to N (2 t − T treated units at timeperiod t . However, N (2 t − T may not be an integer. In order to get a feasible solution of the integerprogram (4), we use the following nearest integer rounding rule. If T is odd, units can be either treated or untreated at time period t = T +12 . -1 -1-1 -1-1 -1-1 -11 11 11 11 1 (a) Time fixed effects -1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1 (b) Unit fixed effects -1 -1-1 -1-1 1-1 1-1 1-1 11 11 1 (c) Two-way fixed effects Table 2
An optimal treatment design with N = 8 and T = 2 in the models with time fixed effects only, unit fixedeffects only, and two-way fixed effects. Each row denotes a unit and each column denotes a time period. Nearest integer rounding rule:
We obtain a feasible solution Z rnd by rounding N (2 t − T tothe nearest integer. Specifically, Z rnd = [ z rnd it ] is a treatment design satisfying the constraints in theinteger program (4) and1 N N (cid:88) i =1 z rnd it = (cid:40) (cid:98) N (2 t − T (cid:99) if frac (cid:0) N (2 t − T (cid:1) < , or frac (cid:0) N (2 t − T (cid:1) = with t − T < (cid:100) N (2 t − T (cid:101) if frac (cid:0) N (2 t − T (cid:1) > , or frac (cid:0) N (2 t − T (cid:1) = with t − T ≥ , where frac (cid:0) N (2 t − T (cid:1) = N (2 t − T − (cid:98) N (2 t − T (cid:99) . The following theorem provides an upper bound forVar(ˆ τ ), when treatment design Z rnd is used. Theorem 2.
Under Assumption 1, let Z ∗ be the optimal solution of the integer program (4) and Z rnd be the feasible solution of the integer program (4) using the nearest integer rounding rule.Denote the variance of ˆ τ using the treatment design Z as Var Z (ˆ τ ) . We have Var Z rnd (ˆ τ ) ≤ − /N Var Z ∗ (ˆ τ ) . Theorem 2 states that the nearest integer rounding rule outputs a solution that approximates theglobal optimum within a factor of 1 + O (cid:0) N (cid:1) . In this section, we study the second case, i.e., L = Xθ (cid:62) + U V (cid:62) . Model (1) can be rewritten as Y it ( z it ) = α i + β t + X i θ t + τ z it + u (cid:62) i v t + ε it (cid:124) (cid:123)(cid:122) (cid:125) e it (11)Since e it has a factor structure, e it are correlated in the unit dimension. We use GLS to estimate τ , because GLS is BLUE when errors are correlated. Then we have the following theorem thatprovides the sufficient conditions and optimal treatment design for model (11). Theorem 3 (Observed and Latent Covariates) . Suppose Assumptions 1-2 hold, rows in X have (cid:80) Ni =1 X i = (cid:126) , rows in U have (cid:80) Ni =1 u i = (cid:126) and (cid:80) Ni =1 X i u (cid:62) i = r,k , v t ’s covariance Σ V = I k , t . . . . . . % tr e a t e d T = 2 T = 3 T = 5 T = 8 T = 10 T = 12 T = 15 T = 20 Figure 1
Optimal treated proportion at every time period for a T -period treatment design problem. The legendswith different colors denote the number of periods T . For example, the leftmost blue line represents the optimaltreated proportion at time periods 1 and 2 for a two-period treatment design problem, while the rightmost grey linerepresents the optimal treated proportion at every time period for a 20-period treatment design problem. Cov[ v t , v t − q | α, β, X ] = 0 for any q , Cov[ v t , ε is | α, β, X ] = 0 for any s, t and i , and ε it ’s variance σ = 1 . Then, Z is an optimal treatment design if it satisfies N N (cid:88) i =1 z it = 2 t − − TT , N N (cid:88) i =1 X i z it = µ X , N N (cid:88) i =1 u i z it = µ U , for all t (12) for some µ X ∈ R r and µ U ∈ R k , and if it satisfies the constraints in the integer program (4) .Moreover, if ( X i , u i ) can only take G values, denoted as ( x , u , ) , · · · , ( x G , u ,G ) , each with positiveprobability, we define ω g,t = |O g | (cid:80) i ∈O g z it , where O g = { i : ( X i , u i ) = ( x g , u ,g ) } . Then any treatmentdesign Z is optimal if it satisfies the constraints in the integer program (4) and ω g,t = 2 t − − TT , for all t and g. (13) Remark 5.
The dimensions of X i and u i (i.e., r and k ) can be zero. They correspond to thesetting that the model only has latent covariates ( r = 0) and only has observed covarates ( k = 0). r,k is a zero matrix with dimension r × k . Remark 6.
When there are latent covariates u i in the model, we need to know their value tostratify. If we have no information about u i , we could only stratify over X i and each stratumhas treated proportion t − T (in this context, units in a stratum have the same X i , but may havedifferent u i ). In many applications, we have historical control data, which contains informationabout u i . We propose an algorithm in Section 4 to mimic stratification using historical data.The assumptions in Theorem 3 can hold without loss of generality. First, for the assumption (cid:80) Ni =1 X i = (cid:126) X is orthogonal to (cid:126)
1) to hold, we can always use the Gram-Schmidt procedure to Equivalent to QR decomposition decompose (cid:2) (cid:126) X (cid:3) = QR ∈ R N × ( r +1) . Then the first column in Q is (cid:126) Q are orthogonal to (cid:126)
1, which can be used as the observed covariates in model (11). We canuse a similar procedure to project out the cross-section mean of u i and X i u (cid:62) i for the assumptions (cid:80) Ni =1 u i = (cid:126) (cid:80) Ni =1 X i u (cid:62) i = r,k to hold. Second, if v t has mean 0 and variance Σ V for any positivedefinite matrix Σ V , we can right multiply u (cid:62) i by Σ − / V and left multiply v t by Σ − / V so that v t has variance I k (conditions (cid:80) Ni =1 u i = (cid:126) (cid:80) Ni =1 X i u (cid:62) i = r,k stay valid after this manipulation).Note that the assumptions on U and V are asymmetric and we provide a discussion in Remark 1.Moreover, we can multiply both the LHS and RHS of model (11) by 1 /σ so that the variance of ε it is 1 and the optimal treatment design does not change by this scaling. Moreover, we assume u i is deterministic and v t is random, following the literature on large dimensional factor modeling(Bai and Ng 2002, Bai 2003). We provide additional discussion of the model and assumptions inTheorem 3 in Appendix A.2.Theorem 3 states that when covariates X i and u i only take finitely many values, we provide an optimal treatment design. This treatment design has linear staggered rollouts with stratification .In other words, for each stratum, the group of units with the same covariate value, we randomizethe treatment allocation over time such that the treated proportion satisfies t − T at time t . If weaggregate all strata, the overall treated proportion satisfies t − T at time t , which is the same asTheorem 1. Table 3 provides an example of the optimal treatment design. The complete proof is deferred to Appendix D and here we provide a high-leveloverview of the main steps. First, via algebraic maninpulations, we show that Var(ˆ τ ) = (cid:126)z (cid:62) (Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e ) (cid:126)z where Γ is the more general (with covariates) version of Γfrom Section 3.1 and Σ e is the covariance matrix of the vectorization of e . Second, we showmin Var(ˆ τ ) can be simplified to minimize( (cid:126)ω (1) ) (cid:62) P (cid:126) (cid:126)ω (1) + 2 (cid:126)d (cid:62) (cid:126)ω (1) (cid:124) (cid:123)(cid:122) (cid:125) ( a ) + (cid:80) r +1 j =2 ( (cid:126)ω ( j ) ) (cid:62) P (cid:126) (cid:126)ω ( j ) (cid:124) (cid:123)(cid:122) (cid:125) ( b ) + (cid:126)z (cid:62) M U (cid:126)z (cid:124) (cid:123)(cid:122) (cid:125) ( c ) where (cid:126)ω (1) t = N (cid:80) Ni =1 z it , (cid:126)ω ( j +1) t = N (cid:80) Ni =1 X ij z it , P (cid:126) = I T − T (cid:126) (cid:126) (cid:62) , d t = T +1 − tT and M U = P (cid:126) ⊗ U ( I k + U (cid:62) U ) − U (cid:62) . If we can find a solution that separately minimizes (a), (b) and (c), then this solutionminimizes their sum. Third, (a) is minimized when Z satisfies N (cid:80) Ni =1 z it = t − − TT following The-orem 1. (b) depends on the cross-sectional average of z it weighted by X i and is minimized when Z satisfies N (cid:80) Ni =1 X i z it = µ X , equivalently (cid:126)ω ( j +1) = µ X,j · (cid:126)
1, where µ X,j is the j -th entry in µ X . (c) is In practice, if we do not know Σ e , we can first run OLS and then use the residual to estimate Σ e , which is aconsistent estimator for Σ e . -1 -1-1 1-1 11 1-1 -1-1 1-1 11 1 X i = 1 X i = − (a) Two-way fixed effects with observed covariates X i -1 -1-1 1-1 11 1-1 -1-1 1-1 11 1 u i = 1 u i = − (b) Two-way fixed effects with latent covariates u i Table 3
An optimal treatment design with N = 8 and T = 2 under the model with either observed or latentcovariates. In this toy example, there are two stratum, X i (or u i ) = 1 and X i (or u i ) = − minimized when Z satisfies N (cid:80) Ni =1 u i z it = µ U . Fourth, if ( X i , u i ) can only take G different valuesand there is a Z that satisfies ω g,t = t − − TT for all g and t , we plug this Z into the LHS of thesufficient conditions (12) and verify these conditions are satisfied. In practice, we may not be able to find a feasible Z that satisfies Eq. (13) because |O g | (2 t − T maynot be an integer for each strata. and therefore there does not exist a Z that satisfies Eq. (13). Inthe case where we only have X i (no u i ), we use the same nearest integer rounding rule as the onein Section 3.1 to get a feasible treatment design Z rnd of the integer program (4). Then we have thefollowing theorem to provide a guarantee for Z rnd . Theorem 4.
Under the assumptions in Theorem 3 and L = Xθ (cid:62) , let Z ∗ be the optimal solutionof the integer program (4) and Z rnd be a feasible solution of the integer program (4) using thenearest integer rounding rule. Denote the variance of ˆ τ using the treatment design Z as Var Z (ˆ τ ) , x j, max = max g | x gj | and N min = min g |O g | . Assume (1 + (cid:80) rj =1 x j, max ) /N < , we have Var Z rnd (ˆ τ ) ≤ − (1 + (cid:80) rj =1 x j, max ) /N Var Z ∗ (ˆ τ ) . Theorem 4 states the nearest integer rounding rule outputs a feasible solution whose objectivevalue is at most 1 + O (cid:16) N (cid:17) times the global optimum when r and x j, max are bounded. Since eachrealization of X i takes positive probability bounded away from 0, we have 1+ O (cid:16) N (cid:17) = 1+ O (cid:0) N (cid:1) when N → ∞ . However, in finite samples, if G is large compared with N , the feasible solutionmay not be close to the global optimum. Then we could use partition methods, such as K-means,to partition units into G (cid:48) groups based on their covariates’ values. Within each group, we canrandomize the treatment allocation such that Eq. (13) is satisfied. On the other hand, if X i takesinfinitely many values, we can also use this method. In some cases, the effect of a treatment carries over to multiple periods, for example, in medicaland clinical research. Researchers often need to consider the existence of carryover effects whendesigning cross-over trials that are widely used in medical and clinical research (Grizzle 1965,Wallenstein and Fisher 1977, Hills and Armitage 1979, Willan and Pater 1986, Freeman 1989, Senn2002, Jones and Kenward 2014). The advertising effect on sales usually has a duration (Assmuset al. 1984), typically between six and nine months as shown by Leone (1995). In this section, wegeneralize model (1) and allow the treatment to have carryover effects, Y it ( z it ) = α i + β t + X i θ t + τ z it + τ z i,t − + · · · + τ (cid:96) +1 z i,t − (cid:96) + u (cid:62) i v t + ε it , (14)where ε it i.i.d. ∼ (0 , σ ). In this model, we use GLS to estimate τ , · · · , τ (cid:96) +1 , and our objective is tominimize the estimation variance of τ , · · · , τ (cid:96) +1 by finding the optimal treatment design Z . Westack the observations Y it from time (cid:96) + 1 to T together and in matrix notation (cid:126)y (cid:96) +1 (cid:126)y (cid:96) +2 ... (cid:126)y T (cid:124) (cid:123)(cid:122) (cid:125) (cid:126)y = (cid:126)z (cid:126)z · · · (cid:126)z (cid:96) +1 (cid:126)z (cid:126)z · · · (cid:126)z (cid:96) +2 ... ... . . . ... (cid:126)z T − (cid:96) (cid:126)z T − (cid:96) +1 · · · (cid:126)z T (cid:124) (cid:123)(cid:122) (cid:125) Z τ (cid:96) +1 ... τ (cid:124) (cid:123)(cid:122) (cid:125) (cid:126)τ +Γ (cid:20) α ˜ β (cid:21) + (cid:126)e (cid:96) +1 (cid:126)e (cid:96) +2 ... (cid:126)e T (cid:124) (cid:123)(cid:122) (cid:125) (cid:126)e , where (cid:126)z = z t z t ... z Nt , ˜ β = ˜ β (cid:96) +1 ...˜ β T , (cid:126)e t = e t e t ... e Nt , ˜ β t = (cid:2) β t θ (cid:62) t (cid:3) (cid:62) , (cid:126)y t is the same as the (cid:126)y t in Eq. (6), and Γ ∈ R ( N ( T − (cid:96) )) × ( N +( T − (cid:96) − p ) is the more general(with covariates) version of the Γ in Eq. (6).Our objective is to minimize Var(ˆ (cid:126)τ ), whereVar(ˆ (cid:126)τ ) = ( Z (cid:62) (cid:0) Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:1) Z ) − , Σ e is the covariance matrix of the vectorization of e = [ e it ] and e it = u (cid:62) i v t + ε it . Note that Var(ˆ (cid:126)τ ) isa matrix and minimizing a matrix is not a well-defined problem. Instead, we consider minimizingtr(Var(ˆ (cid:126)τ )) because the objects of interest are Var(ˆ τ l ) (the diagonal entries in Var(ˆ (cid:126)τ )) and mini-mizing tr(Var(ˆ (cid:126)τ )) is a well-defined problem. A surrogate for minimizing tr(Var(ˆ (cid:126)τ )) is maximizingtr(Var(ˆ (cid:126)τ ) − ), that is, max Z tr( Z (cid:62) (cid:0) Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:1) Z ) (15)s . t . z it ≤ z i,t +1 z it ∈ {− , } The optimal solution Z of the integer program (24) is called the T (trace)-optimal treatmentdesign. We also consider another objective, minimizing det(Var(ˆ (cid:126)τ )), in Appendix A.3. The followingtheorem provides sufficient conditions for the optimal solution of the integer program (24) as wellas the analytical optimal solution when covariates can only take finitely many values. Theorem 5 (Carryover Effects with Observed and Latent Covariates) . Suppose thepotential outcome follows model (14) , assumptions in Theorems 3 hold and
T > (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) . Z isan optimal treatment design of the integer program (15) if it satisfies N N (cid:88) i =1 z it = ω ∗ t , N N (cid:88) i =1 X i z it = µ X , N N (cid:88) i =1 u i z it = µ U , for all t (16) for some µ X ∈ R r and µ U ∈ R k and the constraints in the integer program (4) , where (cid:126)ω ∗ ∈ R T isdefined by ω ∗ t = − t ≤ (cid:98) (cid:96)/ (cid:99) (( A ( (cid:96) ) ) − b ( (cid:96) ) ) t −(cid:98) (cid:96)/ (cid:99) (cid:98) (cid:96)/ (cid:99) < t ≤ (cid:96) − t − ( (cid:96) +1) T − (cid:96) (cid:96) < t ≤ T − (cid:96) − (( A ( (cid:96) ) ) − b ( (cid:96) ) ) T +1 −(cid:98) (cid:96)/ (cid:99)− t T − (cid:96) < t ≤ T − (cid:98) (cid:96)/ (cid:99) T − (cid:98) (cid:96)/ (cid:99) < t (17) The definition of A ( (cid:96) ) ∈ R ( (cid:96) −(cid:98) (cid:96)/ (cid:99) ) × ( (cid:96) −(cid:98) (cid:96)/ (cid:99) ) and b ( (cid:96) ) ∈ R (cid:96) −(cid:98) (cid:96)/ (cid:99) is provided in Eq. (22) and Eq. (23) in Appendix A.3.Moreover, if ( X i , u i ) can only take G values, each with positive probability, ω g,t = |O g | (cid:80) i ∈O g z it is defined similarly as the ω g,t in Theorem 3. Then any treatment design Z is optimal if it satisfiesthe constraints in the integer program (4) and ω g,t = ω ∗ t , for all t and g. (18) Remark 7.
Similar as Theorem 3, the dimensions of X i and u i (i.e., r and k ) can be zero. Remark 8.
To provide a few concrete examples for ω ∗ t , if (cid:96) = 1, we have ω ∗ t = − t − T − t. If (cid:96) = 2, we have ω ∗ = − , ω ∗ = −
1+ 22 T − , ω ∗ t = −
1+ 2 t − T − t = 3 , · · · , T − , ω ∗ T − = 1 − T − , ω ∗ T = 1 . We provide the expression of ω ∗ t for (cid:96) = 3 in Remark 16 in Appendix A.3. Figure 2 shows theoptimal treated proportion for other (cid:96) in a T = 10 period problem. Remark 9.
The assumption
T > (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) can be potentially relaxed. We verified using opti-mization software that Eq. (17) is optimal even this assumption is not satisfied.The treated proportion is convex at the beginning and concave towards the end. If the treatmenthas carryover effects for longer periods (i.e., larger (cid:96) ), the optimal treated proportion is smallerat the beginning and is larger towards the end as shown in Figure 2. There are five differentstages in the optimal solution from Theorem 5: in the first stage, all units are control units; in t . . . . . . % tr e a t e d ‘ = 0 ‘ = 1 ‘ = 2 ‘ = 3 ‘ = 4 ‘ = 5 Figure 2 T -optimal design: Optimal treated percentage at each period for a T -period treatment design in thepresence of carryover treatment effects, where T = 10. Different colors represent the number of periods (cid:96) to which thetreatment can carry over. the second stage, ω ∗ t grows non-linearly in time and is less than − (cid:96) +1 T − (cid:96) t −(cid:98) (cid:96)/ (cid:99)(cid:98) (cid:96)/ (cid:99) +1 (implied by theassumption T > (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) ); in the third stage, ω ∗ t grows linearly in time; in the fourth stage, ω ∗ t grows non-linearly in time again and is larger than 1 − (cid:96) +1 T − (cid:96) T + (cid:98) (cid:96)/ (cid:99)− t (cid:98) (cid:96)/ (cid:99) +1 (implied by the assumption T > (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) ); in the fifth stage, all units are treated units. The optimal solution is symmetricwith respect to the origin, because f ( ω , · · · , ω T ) = f ( − ω T , · · · , − ω ), where f ( ω , · · · , ω T ) is theobjective function in the integer program (15).Intuitively, treated proportions are larger at the end compared with the direct effect only casein Theorem 3 because we want to have sufficiently many observations to precisely estimate thelagged treatment effects. Treated proportions are smaller at the beginning because we only use theobserved outcomes from (cid:96) + 1 to T to estimate (cid:126)τ , α , and ˜ β , so increasing the treated proportionat the beginning can only marginally affect the efficiency. The complete proof is deferred to Appendix E and here we provide an overview of the main steps.First, we start with the case where the potential outcome does not have covariates (i.e., r = k = 0and Σ e = I N ( T − (cid:96) ) ) and then maximizing the objective function in the integer program (15) isequivalent to minimizing tr( Z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) Z ). The corresponding convex relaxation is a sum of (cid:96) + 1 quadratic functions (cid:80) (cid:96) +1 j =1 (cid:104) (cid:126)ω (cid:62) j :( T − (cid:96) − j ) P (cid:126) (cid:126)ω j :( T − (cid:96) − j ) + 2( (cid:126)d ( j ) ) (cid:62) (cid:126)ω j :( T − (cid:96) − j ) (cid:105) , (19)where (cid:126)ω j :( T − (cid:96) − j ) = (cid:2) ω j · · · ω T − (cid:96) − j (cid:3) (cid:62) , P (cid:126) = I − T − (cid:96) (cid:126) (cid:126) (cid:62) ∈ R ( T − (cid:96) ) × ( T − (cid:96) ) , (cid:126)d ( j ) = [ d ( j ) t ] ∈ R T − (cid:96) , d ( j ) t = T − (cid:96) − j − tT − (cid:96) for all j and t , and the definition and constraints of ω t are the same as those of ω t inthe quadratic program (9). The details are presented Lemma 4 in Appendix A.3. Second, we verify that ω ∗ in Eq. (17) satisfies the KKT conditions of the quadratic program with objective function(19) and ω ∗ is therefore an optimal solution for this quadratic program. The details are presentedin Theorem 5 in Appendix E. Third, we follow the same steps as Theorem 3 to show additionalcovariate conditions are needed when the potential outcome has covariates.
4. Algorithms
In this section, we focus on the scenario where the potential outcome has latent covariates, whichis widely used in practice, Y ( Z ) = α(cid:126) (cid:62) + (cid:126) β (cid:62) + U V (cid:62) + τ Z + ε. (20)For notation simplicity, we use L to denote U V (cid:62) in this section.We propose algorithms to tackle two challenges:1. How can we find the optimal treatment design in practice when covariates are unobserved?2. Is there an alternative (potentially non-linear) estimation approach for τ that may have asmaller variance?For the first challenge, in Section 4.1, we introduce a local search heuristic (motivated by thesufficient conditions in Theorem 3) to search for a better treatment design by leveraging histori-cal control data that contains latent covariates’ information, given that historical control data iscommonly available in practice.For the second challenge, in Section 4.2, we leverage advances on low-rank matrix estimation inthe machine learning literature to introduce a non-linear estimator for τ that may be biased butcan have a smaller variance. We refer to this estimator as the low-rank matrix estimation withfixed effects (LRME). Since the optimal treatment design depends on the values of latent covariates from Theorem 3, thehistorical control data with covariates’ information could help us find a better treatment design.In order to find a better treatment design, we propose to (a) approximate Var(ˆ τ ) using a robustsurrogate when our estimation approach does not a closed-form expression for Var(ˆ τ ), and (b) findan approximate solution to the computationally challenging integer program of finding optimal Z that minimizes Var(ˆ τ ). We propose Algorithm 1 that addresses (a) and (b) as follows: • For (a), we adopts a robust measure to approximate Var(ˆ τ ) that consists of two phases(i) Split the historical control data into sub-matrices: we use a moving window approachto split the historical control data Y hist , All into m sub-matrices Y hist , (1) , · · · , Y hist , ( m ) , each withdimension N × T . (ii) We take a synthetic treatment effect τ (can be 0) and apply synthetic experiment to Y hist , ( j ) using treatment design Z . In more detail, we assume Y hist , ( j ) does not contain the interventionwe study. We define the control outcome as Y ( j ) it ( −
1) = Y hist , ( j ) it , the treated outcome as Y ( j ) it (1) = Y ( j ) it ( −
1) + τ , and the “observed” outcome as Y ( j ) it = Z it · Y ( j ) it (1) + − Z it · Y ( j ) it ( −
1) for every i , t and j . We use Y ( j ) and Z as the input to estimate τ and the estimator is denoted as ˆ τ ( j ) . Wecalculate the maximum estimation error across all sub-matrices, max j | ˆ τ ( j ) − τ | , and use it as arobust surrogate for Var(ˆ τ ). • For (b), we start from an arbitrary treatment design Z that satisfies N (cid:80) Ni =1 z it = t − − TT andthe constraints in the integer program (4). Then we adopt the following local search heuristicto improve Z (motivated by simulated annealing ) until Z converges or the algorithm runs for amaximum number of iterations:(i) Swap two random rows in Z at every iteration and denote the new treatment design as Z new (note that the cross-sectional average remains unchanged at t − − TT at time period t ).(ii) Calculate the maximum estimation error max j | ˆ τ ( j ) − τ | when treatment design Z new isapplied to Y (1) , · · · , Y ( m ) . If max j | ˆ τ ( j ) − τ | is reduced, replace Z by Z new . Otherwise, keep Z unchanged with probability 1 − p escape but still replace Z by Z new with probability p escape . Here, p escape is a small positive number.(iii) At the end of each iteration, define Z opt to be the matrix Z that has the smallestmax j | ˆ τ ( j ) − τ | . Once the algorithm stops (maximum number of iterations is reached or Z is con-verged), return Z opt . Remark 10.
We used the maximum estimation error instead of the mean-squared error becauseit is more robust to outliers.
Remark 11.
In the local search heuristic, when we swap the rows in Z , the cross-sectional aver-age does not change. This guarantees that one of the sufficient conditions for the optimal solutionin Theorem 3 (i.e., N (cid:80) Ni =1 z it = t − − TT ) is always satisfied. Therefore, the goal of the algorithm isto find a treatment design Z that satisfies the other sufficient condition (i.e., N (cid:80) Ni =1 u i z it = µ ). Remark 12.
In the local search heuristic, we replace Z by Z new with some small probability p escape following the idea of simulated annealing, to escape potential local minima that traps Z . τ Instead of incorporating L into the error term, we could estimate L jointly with τ using the followingobjective function,ˆ τ , ˆ α, ˆ β, ˆ L = arg min τ,α,β,L N T (cid:13)(cid:13)(cid:13) Y − α(cid:126) (cid:62) − (cid:126) β (cid:62) − L − τ Z (cid:13)(cid:13)(cid:13) F + µ (cid:107) L (cid:107) ∗ , (21)denoted as low-rank matrix estimation with fixed effects (LRME). Here, (cid:107) L (cid:107) ∗ is the nuclear norm(or trace norm) of matrix L , which is equal to the sum of its singular values. Also, (cid:107)·(cid:107) F refers to Algorithm 1:
A data-driven local search heuristic for Z Inputs : Y (1) it ( − , · · · , Y ( m ) it ( − , Z, τ, t init , t min , υ, steps max , r , ∆ τ , t max Z current ← Z ; Y ( j ) it (1) ← Y ( j ) it ( −
1) + τ and Y ( j ) it ← Z current it · Y ( j ) it (1) + − Z current it · Y ( j ) it ( −
1) for all i , t and j ; for j = 1 , · · · , m do ˆ τ ( j ) , ˆ α ( j ) , ˆ β ( j ) estimated from Y ( j ) by OLS/GLS/LRME ; end E current ← max j | ˆ τ ( j ) − τ | , E opt ← E current , Z opt ← Z current , t ← t init , i step ← while t > t min and i step < steps max do Randomly select distinct i , i ∈ [ N ] ; Z new ← Z current , Z new i , : ← Z current i , : , Z new i , : ← Z current i , : , i step ← i step + 1 ; Y ( j ) it ← Z new it · Y ( j ) it (1) + − Z new it · Y ( j ) it ( −
1) for all i , t and j ; for j = 1 , · · · , m do ˆ τ ( j ) , ˆ α ( j ) , ˆ β ( j ) estimated from Y ( j ) by OLS/GLS/LRME ; end E new ← max j | ˆ τ ( j ) − τ | , dE = E new − E current ; if E new < E opt then E current ← E new , Z current ← Z new , Z opt ← Z new ; elseif exp(dE /t ) > random(0 , then E current ← E new , Z current ← Z new ; t ← t × υ ; endendOutputs: Z opt the Frobenius norm of a matrix. If the regularization parameter µ is larger, the rank of ˆ L tends tobe smaller. It is known in the matrix estimation literature that such a bias in estimating L can leadto a lower variance (Candes and Recht 2009, Candes and Plan 2010). If Var( ˆ L ) is lower, Var(ˆ τ ) islikely to be smaller. We compare LRME with the BLUE estimator through synthetic experimentsin Section 5.The objective function (21) is convex in τ, α, β and L , which has N T + N + T + 1 variables intotal. Finding the global optimal solution of convex program (21) can be slow with off-the-shelfsoftware for convex optimization problems such as cvxpy . Alternatively, we propose to use the iterative singular value thresholding and ordinary least squares (iterative SVT and OLS) algorithmto efficiently solve convex program (21). The details of this algorithm are described in Algorithm
2. We can justify using SVT Theorem 1 in Hastie et al. (2015) that shows the optimal solution ofˆ L = arg min rank ( (cid:96) ) ≤ k (cid:107) Y − L (cid:107) F + µ (cid:107) L (cid:107) ∗ , is ˆ L = U k S µ ( D k ) V (cid:62) k , where the rank- k SVD of Y is U k D k V (cid:62) k and S µ ( D k ) is a diagonal k by k matrix with its diagonal entries to be ( σ − µ ) + , · · · , ( σ k − µ ) + . When we have historical controldata, we can use cross-validation to find the optimal µ by the grid search algorithm. The detailsare provided in Appendix B.1. Algorithm 2:
Iterative SVT and OLS
Inputs :
Y, Z, k , µ NT , ∆ τ , and t max ˆ τ ( − ← t = 0, ˆ τ (0) , ˆ α (0) , ˆ β (0) ← arg min τ,α,β (cid:13)(cid:13)(cid:13) Y − α(cid:126) (cid:62) − (cid:126) β (cid:62) − τ Z (cid:13)(cid:13)(cid:13) F ;ˆ Y (0) e ← Y − ˆ α (0) (cid:126) (cid:62) − (cid:126)
1( ˆ β (0) ) (cid:62) − ˆ τ (0) Z ; while | ˆ τ ( t ) − ˆ τ ( t − | > ∆ τ and t < t max do The rank- k SVD of ˆ Y ( t ) e is U ( t ) k D ( t ) k ( V ( t ) k ) (cid:62) , where D ( t ) k = diag( d ( t )1 , · · · , d ( t ) k ) ; S µ NT ( D ( t ) k ) ← diag(( d ( t )1 − µ NT ) + , · · · , ( d ( t ) k − µ NT ) + ) ;ˆ L ( t +1) = U ( t ) k S µ NT ( D ( t ) k )( V ( t ) k ) (cid:62) ;ˆ τ ( t +1) , ˆ α ( t +1) , ˆ β ( t +1) = arg min τ,α,β (cid:13)(cid:13)(cid:13) Y − α(cid:126) (cid:62) − (cid:126) β (cid:62) − τ Z − ˆ L ( t +1) (cid:13)(cid:13)(cid:13) F ;ˆ Y ( t +1) e = Y − ˆ α ( t +1) (cid:126) (cid:62) − (cid:126)
1( ˆ β ( t +1) ) (cid:62) − ˆ τ ( t +1) Z ; t ← t + 1 ; endOutputs: ˆ τ ( t − , ˆ α ( t − , ˆ β ( t − , ˆ L ( t −
5. Empirical Study: Reducing Influenza Occurrence Rate
Influenza is primarily a community-based infection that is transmitted in households and com-munity settings. CDC estimates that influenza has resulted in 9 million to 45 million illnesses,140,000 to 810,000 hospitalizations, and 12,000 to 61,000 deaths annually since 2010. The bestway to prevent infectious diseases such as seasonal flu is vaccination. However, in the absence of avaccine, like in the COVID-19 pandemic, other interventions such as social distancing or wearing aface cover are recommended. An important and challenging question is: what is the efficacy of suchpreventive measures or other interventions in different environments such as schools, workplaces,and businesses in slowing down the spread of the disease and reducing the substantial burden of thedisease on society?
The best way to answer this question is to run pilot programs or experiments. Since flu iscontagious, we must randomize the experiments at an aggregate geographical level, such as the Metropolitan Statistical Area (MSA), to minimize the spillover effects. The tradeoff is, we wouldhave a smaller number of units and therefore lower the statistical power to detect the treatmenteffect. Moreover, the treatment effect is, in general, small compared with the natural fluctuationsof flu occurrence rate, so the statistical power is critical to detect the effect of new interventions.Hence, our solution in Section 3 and our algorithm in Section 4 are especially relevant in thissetting to evaluate the effect of new interventions.
In this study, we focus on the MarketScan Research Databases (formerly called Truven) that containlongitudinal, patient-level, medical claim data for healthcare research. The databases contain dataon continuous periods of enrollment per individual, inpatient, and outpatient claim records from thebeginning of 2007 to mid-2017. We focus on the inpatient admission records and outpatient servicerecords where the primary diagnosis (DX1) is influenza according to ICD-9-CM diagnosis codes (thedatabases only have claim records with ICD-9 diagnosis codes). Overall, the data contains 21,277inpatient admissions and 9,678,572 outpatient records with primary diagnosis influenza, indicatingpatients are not usually admitted to hospitals for influenza. We consider this combined data forthe analysis. We aggregate the influenza records by MSA and month. Furthermore, we aggregateindividual enrollment data by MSA and month to get the total number of enrolled patients, byMSA and month. We divide the number of influenza visits by the number of enrolled patients toobtain the flu visit occurrence rate by MSA and month.Due to the seasonal nature of the flu, we focus on the months from October to April, where fluis active, and there is a relatively high flu visit occurrence rate. We further focus on the periodbetween October 2007 to April 2015 because outside this period, we have only a few observations,and the flu occurrence rate is 0 for most MSAs. We get a panel of 185 MSAs over 56 months, whereall MSAs have a positive occurrence rate every month.
Remark 13.
In addition to the flu occurrence problem, we also apply our treatment designalgorithm to two other data sets (home medical visits and purchases from a large grocery store).The secondary and tertiary analysis are presented in Appendix F.
In this section, we present the models and estimation methods, treatment designs, how to runsynthetic experiments, the metrics to evaluate treatment designs, and the specification used in thesynthetic experiments. In ICD-9-CM, the following diagnosis codes indicate influenza: 488, 487.0, 487.1, 487.8, 488.0, 488.1, 488.01, 488.02,488.09, 488.11, 488.12, 488.19, 488.81, 488.82, and 488.89. Model Assumptions and Estimation Methods.
Since we do not know the data generatingprocess of the flu occurrence data, we consider two models when the treatment only has directeffect: one is the two-way fixed effect model with direct effect only Y it ( z it ) = α i + β t + τ z it + ε it , andthe other one is the latent factor model with direct effect only Y it ( z it ) = α i + β t + u (cid:62) i v t + τ z it + ε it .Moreover, we consider two models with carryover treatment effects: two-way fixed effect modelwith carryover effects Y it ( z it ) = α i + β t + τ z it + · · · + τ (cid:96) +1 z i,t − (cid:96) + ε it and latent factor model withcarryover effects Y it ( z it ) = α i + β t + u (cid:62) i v t + τ z it + · · · + τ (cid:96) +1 z i,t − (cid:96) + ε it . For the two-way fixed effectmodel, we use OLS to estimate τ (or τ , · · · , τ (cid:96) +1 for the model with carryover effects). For thelatent factor model, we use two methods to estimate τ (or τ , · · · , τ (cid:96) +1 for the model with carryovereffects): one is feasible GLS , and the other one is LRME presented in Section 4.2.
Treatment Designs.
We compare the different treatment designs via synthetic experiments.Denote ω t as the cross-sectional average of the treatment design Z at time t and note that, ω t = − Z FF (fifty-fifty): Z FF satisfies ω t = 0 for all t . Z FF is optimal when the potential outcomeonly has time fixed effects, as shown in Lemma 2. (b) Z BA (before-after): Z BA satisfies ω t = − t < T +12 and ω t = 1 for t > T +12 . Z BA is optimalwhen the potential outcome only has unit fixed effects, as shown in Lemma 3. (c) Z FF+BA (fifty-fifty with before-after): Z FF+BA satisfies ω t = − t < T +12 and ω t = 0for t ≥ T +12 , implying half of the units switch from control to treated at midway through theexperiment.2. Z OPT : Z OPT satisfies ω t = t − − TT for all t . Z OPT is optimal when the potential outcome hasboth unit and time fixed effects and there are no covariates, as shown in Theorem 3.3. Z K =2 (or Z K =3 , partition and stratification): the treatment design that satisfies ω t = t − − TT for each group, where units are partitioned into groups based on their value in the largest singularvectors estimated from the historical data. GLS is the BLUE estimator, but is infeasible. We provide the details to implement feasible GLS in Appendix B.2. When Z FF satisfies ω t = 0 for all t , (cid:126)z will be in the span of Γ in model (5) so τ can not be uniquely identified.Hence, we randomly select one more unit as control in the first time period and one more unit as treated in the lasttime period to uniquely identify τ . When Z BA satisfies ω t = − t < T +12 and ω t = 1 for t > T +12 , (cid:126)z will be in the span of Γ defined in model (5) so τ can not be uniquely identified. Hence, we randomly select one more unit as treated in time period T and one moreunit as control in time period T + 1 to uniquely identify τ . Z OPT+ : the treatment design returned from Algorithm 1, where the input of Algorithm 1 is Z OPT .5. Z OPT-CO : Z OPT-CO satisfies ω t = ω ∗ t for all t , where ω ∗ t is defined in Eq. (17) and is nonlinearin t . Z OPT-CO is optimal when the treatment has carryover effects over (cid:96) lagged periods and thereare no covariates, as shown in Theorem 5.
Synthetic Treatment Effect.
The original data does not contain any specific treatment thatwe can study. We first consider the model with only direct treatment effect. Therefore, entry ( i, t )of the original data is Y it ( − τ , and if entry ( i, t ) is hypothetically assigned totreatment group, i.e., z it = 1, we define Y it (1) = Y it ( −
1) + τ . Therefore, the final input to differentestimation methods is going to be the values of matrix Y ( Z ) as well as Z itself. The value of τ and counterfactual values of each Y ( Z ) are hidden from the estimators. The true value of τ will later be used to assess performance (RMSE) of the various treatment designs and estimationmethods. Moreover, we consider a hypothetical intervention with synthetic carryover treatmenteffects τ , τ , · · · , τ (cid:96) +1 that are defined in model (14). We use Y it (( z i,t − (cid:96) , · · · , z i,t )) to denote thepotential outcome when the past (cid:96) + 1 periods’ treatment status is z i,t − (cid:96) , · · · , z i,t . Entry ( i, t ) ofthe original data is Y it (( − , · · · , − , − i switches from the control to the treated groupat period t , i.e., z i,t − = − , z it = 1, we define Y it (( − , · · · , − , Y it (( − , · · · , − , − τ ;If unit i switches from the control to the treated group at period t − s (1 ≤ s < (cid:96) ), we define Y it (( − , · · · , , · · · , Y it (( − , · · · , − , · · · , − τ + τ + · · · + τ s +1 ; If unit i switches from thecontrol to the treated group at period t − s ( s ≥ (cid:96) ), we define Y it ((1 , · · · , Y it (( − , · · · , − τ + τ + · · · + τ (cid:96) +1 . Evaluation Metrics.
We randomly select m blocks each with dimension N × ˜ T . For each block,the first ˜ T − T time periods are the historical control data and we apply synthetic experimentsto the last N × T time periods. We split the historical control data into k matrices, where twoadjacent matrices could have overlapping time periods. We use these k matrices as the input forAlgorithm 1. In the experiment with direct treatment effect only, for each treatment design Z andeach of the m blocks, the estimated treatment effect is denoted as ˆ τ ( j ) Z , where j = 1 , · · · , m . Wereport the following metrics for each treatment design Z :RMSE Z = (cid:32) m m (cid:88) j =1 (cid:16) ˆ τ ( j ) Z − τ (cid:17) (cid:33) / , ¯ τ Z = 1 m m (cid:88) j =1 ˆ τ ( j ) Z , Var(ˆ τ Z ) = 1 m m (cid:88) j =1 (cid:16) ˆ τ ( j ) Z − ¯ τ Z (cid:17) . In the experiment with carryover effects, the estimated direct and lagged i -period treatment effectsare denoted as ˆ τ ( j ) Z, and ˆ τ ( j ) Z,i for i = 1 , · · · , (cid:96) + 1 and j = 1 , · · · , m . We report the RMSE across alltreatment effect parametersRMSE Z, Carryover = (cid:32) m ( (cid:96) + 1) m (cid:88) j =1 (cid:96) +1 (cid:88) l =1 (cid:16) ˆ τ ( j ) Z,l − τ l (cid:17) (cid:33) / . Specification.
We choose the number of random blocks at m = 500 and the treatment effect at τ = − .
01% in the experiment with direct treatment effect only. Note that the median occurrencerate is 0.0714%, so the treatment effect is about 14% of the median occurrence rate. There are61 MSAs with at least 0.01%. occurrence rate in every month. Our synthetic experiments areimplemented on these 61 MSAs. Moreover, in the experiment with carryover effects, we considerthe prevention policy can affect the outcome for the current period and two periods in the future.We choose the treatment effects at [ τ , τ , τ ] = [ − . , − . , − . T = 7,start the experiment in October and end in April. We use N = 25 and N = 50. In this section, we demonstrate three main findings from the synthetic experiments on the flu dataas shown in Tables 4-5 and Figure 3 and on the home medical visit data and grocery data as shownin Tables 6-9 and Figures 5-6 in Appendix F:1. The linear staggered treatment design Z OPT can significantly outperform all benchmark treat-ment designs.2. We can use historical data to search for a better treatment design compared with Z OPT , forexample, using our data-driven local search heuristic.3. When the treatment has carryover effects, the nonlinear staggered treatment design Z OPT-CO outperforms the linear treatment design Z OPT . Linear Staggered Treatment Design Outperforms Benchmark Treatment Designs.
We first compare the RMSE of the optimal treatment design matrix Z OPT with the benchmarktreatment designs Z FF , Z BA , and Z FF+BA for all estimation methods in Table 4. Figure 3 showsthe bias and variance of benchmark treatment designs and Z OPT on the flu data. Tables 6 and 8and Figures 5 and 6 in Appendix F show the corresponding results for the home medical visit dataand grocery data. We have the following findings:1. Z OPT can consistently outperform benchmark treatment designs for all estimation methods. ( N, T ) (25,7) (50,7)OLS Z BA Z FF Z FF+BA Z OPT Z BA Z FF Z FF+BA Z OPT
LRME Z BA Z FF Z FF+BA Z OPT Z OPT
N, T ) (25,7) (50,7)OLS Z OPT Z K =2 Z K =3 Z OPT+ Z OPT Z K =2 Z K =3 Z OPT+ Z OPT Z K =2 Z K =3 Z OPT+ Z OPT+
Table 4
Inpatient and outpatient flu visit rates: This table compares the RMSE based on m = 500 randomlysampled blocks for benchmark treatment designs, Z OPT , Z K =2 , Z K =3 , Z OPT+ and “Hist Winner” (the estimationmethod with minimax estimation error on historical matrices). In the synthetic experiments, the intervention only has direct treatment effect and we choose the treatment effect at τ = − . Z OPT outperforms benchmark treatment designs. The right table shows we can find a better treatment design Z OPT+ via historical data and our data-driven local search algorithm. The optimal estimation method found onhistorical data has similar performance as GLS and LRME. ( N, T ) (25,7) (50,7)OLS Z OPT Z OPT-CO Z OPT Z OPT-CO
LRME Z OPT Z OPT-CO
Table 5
Inpatient and outpatient flu visit rates: This table compares the RMSE based on m = 500 randomlysampled blocks for Z OPT and Z OPT-CO . In the synthetic experiments, the intervention has direct and carryover treatment effects and we choose the treatment effects at [ τ , τ , τ ] = [ − . , − . , − . nonlinear staggered design Z OPT-CO outperforms the linear staggered design Z OPT .
2. The latent factor model estimated from either GLS or LRME outperforms the two-way fixedeffect model estimated from OLS on all data sets. In other words, GLS and LRME are betterestimation methods compared with OLS.
Remark 14.
In practice, we can use cross-validation to select the best estimation method usinghistorical control data. Specifically, we use the leave one (historical matrix) out cross-validation:First, apply synthetic treatment to every historical matrix and estimate τ by OLS, GLS, andLRME; Second, calculate the maximum estimation error of τ across all historical matrices; Third,find the estimation method with the smallest maximum estimation error. We use this method to estimate the treatment effect on experimental data. We find that the RMSE from this approach isvery close to the optimal RMSE, as shown in Table 4. An Even Better Treatment Design Found via Historical Data.
Suppose the potentialoutcome has either observed or latent covariates. In that case, Theorem 3 shows that the treatedproportion of Z OPT satisfies one of the two sufficient conditions for the optimal treatment design.We can apply Algorithm 1 or K-means to the historical data that contains information aboutcovariates to search for a better treatment design that mimics stratification and satisfies the othersufficient condition.Table 4 presents the RMSE of Z OPT , Z K =2 and Z K =3 returned from K-means, and Z OPT+ returned from Algorithm 1 for all estimation methods. Figure 3 shows the bias and variance ofthese four treatment designs on the flu data. The corresponding results for the other two data setsare presented in Tables 6 and 8 and Figures 5 and 6 in Appendix F. For all estimation methodsand all data sets, Z OPT+ , Z K =2 , and Z K =3 have a smaller RMSE compared with Z OPT . Moreover,in general, Z OPT+ has a smaller RMSE copmared with Z K =2 and Z K =3 , especially on the medicalhome visit and grocery data. In empirical applications, we may not have the information abouthow many groups to partition units into, i.e., K = 2 ,
3, or some larger number. Thus, we suggestusing Z OPT+ . Nonlinear Staggered Treatment Design Outperforms When the Treatment Has Car-ryover Effects.
When the treatment can affect multiple periods, Theorem 5 shows that theoptimal treated proportion is nonlinear in time and equals the treated proportion of Z OPT-CO . Z OPT-CO can consistently outperform Z OPT for all estimation methods, as shown in Table 5. More-over, GLS is a better estimation method compared with OLS and LRME. The findings are robustto the other two data sets, as shown in Tables 7 and 9.
6. Concluding Remarks
In this paper, we study the optimal multi-period design of experiments, where our goal is tomaximize the statistical power of our estimate for the treatment effect. Formally, we minimizethe variance of the estimated treatment effect. We assume that the treated units cannot switchback to control during the experiment. In the simplest case, we assume the potential outcomefollows a two-way fixed-effect linear model; In extensions, we consider additional observed or latentcovariates and the treatment with carryover effects. We find a set of sufficient conditions for theoptimal treatment design. When the potential outcome does not have covariates or has covariatesthat take discrete values, we can provide a feasible analytical solution that approximates the globaloptimum within a factor of 1 + O (cid:0) N (cid:1) based on the sufficient conditions. Figure 3
Inpatient and outpatient flu visit rates: This figure compares the average estimated treatment effectˆ τ Z and its standard deviation (cid:112) Var(ˆ τ Z ) based on m = 500 randomly sampled blocks for benchmark designs, Z OPT , Z K =2 , Z K =3 and Z OPT+ . In the synthetic experiments, N = 50, T = 7, the intervention only has direct treat-ment effect and we choose the treatment effect at τ = − . τ Z , while the error barindicates the standard deviation (cid:112) Var(ˆ τ Z ). The red dash line indicates the true value of τ = − . Z OPT , Z K =2 , Z K =3 and Z OPT+ over benchmark treatment designs. The bias of various treatment designs is similar while Z OPT has much smallervariance compared with benchmark designs and the treatment designs found using historical data, such as Z OPT+ from Algorithm 1, has smaller variance compared with Z OPT . If we have historical control data, we propose a local search algorithm based on the minimaxdecision rule to find a better treatment design based on the sufficient conditions when the poten-tial outcome has latent covariates or covariates can take infinitely many values. In the end, wedemonstrate the practical relevance of our results through synthetic interventions on the influenzaoccurrence rate and synthetic experiments on in-home medical service data and grocery data.Our solution can significantly outperform benchmark treatment designs independent of the modelassumption and treatment effect estimation approach.
Managerial Implications.
Our results have several implications on guiding companies andpolicymakers to conduct experiments for evaluating interventions. First, when the sample size issmall, by leveraging the time dimension and tracking the same sample across multiple time periods,one can increase the number of observations and use our optimal multi-period experiment designto maximize the benefit of the experiment. Second, structure of the multi-period experiment designis substantially impacted when the effect of interventions last for multiple time periods. Third, in presence of historical (pre-experimentation) data, we can further optimize our treatment designsand hence reduce the number of required samples which means lowering the experiment cost. Future directions.
One interesting direction is how we can dynamically adjust the treatmentdesign to improve the efficiency using both experimental and historical control data, under theconstraint that the treatment cannot be removed once implemented, or it is costly to switch fromtreated to control. This direction is related to online learning or multi-arm/contextual banditproblems, but the objective is different, and there is a switching constraint from treated to control.
References
Abadie, Alberto, Alexis Diamond, Jens Hainmueller. 2010. Synthetic control methods for comparative casestudies: Estimating the effect of californias tobacco control program.
Journal of the American statisticalAssociation (490) 493–505.Aral, Sinan, Dylan Walker. 2014. Tie strength, embeddedness, and social influence: A large-scale networkedexperiment.
Management Science (6) 1352–1370.Assmus, Gert, John U Farley, Donald R Lehmann. 1984. How advertising affects sales: Meta-analysis ofeconometric results. Journal of Marketing Research (1) 65–74.Athey, S, S Stern. 2002. The impact of information technology on emergency health care outcomes. TheRand Journal of Economics (3) 399.Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, Khashayar Khosravi. 2018. Matrixcompletion methods for causal panel data models. Tech. rep., National Bureau of Economic Research.Athey, Susan, Guido W Imbens. 2017. The econometrics of randomized experiments. Handbook of EconomicField Experiments , vol. 1. Elsevier, 73–140.Athey, Susan, Guido W Imbens. 2018. Design-based analysis in difference-in-differences settings with stag-gered adoption. Tech. rep., National Bureau of Economic Research.Athey, Susan, Scott Stern. 1998. An empirical framework for testing theories about complimentarity inorganizational design. Tech. rep., National Bureau of Economic Research.Bai, Jushan. 2003. Inferential theory for factor models of large dimensions.
Econometrica (1) 135–171.Bai, Jushan, Serena Ng. 2002. Determining the number of factors in approximate factor models. Econometrica (1) 191–221.Baltagi, Badi. 2008. Econometric analysis of panel data . John Wiley & Sons.Bertsimas, Dimitris, Mac Johnson, Nathan Kallus. 2015. The power of optimization over randomization indesigning experiments involving small samples.
Operations Research (4) 868–876.Bertsimas, Dimitris, Nikita Korolko, Alexander M Weinstein. 2019. Covariate-adaptive optimization in onlineclinical trials. Operations Research .2Bhat, Nikhil, Vivek F Farias, Ciamac C Moallemi, Deeksha Sinha. 2019. Near optimal ab testing.
Manage-ment Science .Brown, Celia A, Richard J Lilford. 2006. The stepped wedge trial design: a systematic review.
BMC medicalresearch methodology (1) 54.Brynjolfsson, Erik, Avinash Collis, Felix Eggers. 2019. Using massive online choice experiments to measurechanges in well-being. Proceedings of the National Academy of Sciences (15) 7250–7255.Brynjolfsson, Erik, Felix Eggers, Avinash Gannamaneni. 2018. Measuring welfare with massive online choiceexperiments: A brief introduction.
AEA Papers and Proceedings , vol. 108. 473–76.Bubeck, S´ebastien, Nicolo Cesa-Bianchi. 2012. Regret analysis of stochastic and nonstochastic multi-armedbandit problems. arXiv preprint arXiv:1204.5721 .Candes, Emmanuel, Benjamin Recht. 2009. Exact matrix completion via convex optimization.
Communi-cations of the ACM (6) 111–119.Candes, Emmanuel J, Yaniv Plan. 2010. Matrix completion with noise. Proceedings of the IEEE (6)925–936.Cheng, Russell CH, Teresa Davenport. 1989. The problem of dimensionality in stratified sampling. Manage-ment Science (11) 1278–1296.Cheung, Wang Chi, David Simchi-Levi, He Wang. 2017. Dynamic pricing and demand learning with limitedprice experimentation. Operations Research (6) 1722–1731.Cohen, Jacob. 1992. Statistical power analysis. Current directions in psychological science (3) 98–101.Cohen, Maxime, Michael-David Fiszer, Baek Jung Kim. 2018. Frustration-based promotions: Field experi-ments in ride-sharing. Available at SSRN 3129717 .Cook, Thomas D, Donald Thomas Campbell, William Shadish. 2002.
Experimental and quasi-experimentaldesigns for generalized causal inference . Houghton Mifflin Boston, MA.Crowder, Martin J, David J Hand. 1990.
Analysis of repeated measures , vol. 41. CRC Press.Cui, Ruomeng, Jun Li, Dennis J Zhang. 2020. Reducing discrimination with reviews in the sharing economy:Evidence from field experiments on airbnb.
Management Science (3) 1071–1094.Fan, Jianqing, Kosuke Imai, Han Liu, Yang Ning, Xiaolin Yang. 2016. Improving covariate balancing propen-sity score: A doubly robust and efficient approach. Tech. rep., Technical report, Princeton Univ.Fogarty, Colin B, Dylan S Small. 2016. Sensitivity analysis for multiple comparisons in matched observationalstudies through quadratically constrained linear programming. Journal of the American StatisticalAssociation (516) 1820–1830.Fox, Bennet L. 2000. Separability in optimal allocation.
Operations Research (1) 173–176.Freeman, PR. 1989. The performance of the two-stage analysis of two-treatment, two-period crossover trials. Statistics in medicine (12) 1421–1432.3Gordon, Brett R, Florian Zettelmeyer, Neha Bhargava, Dan Chapsky. 2019. A comparison of approaches toadvertising measurement: Evidence from big field experiments at facebook. Marketing Science (2)193–225.Grizzle, James E. 1965. The two-period change-over design and its use in clinical trials. Biometrics
The Journal of Machine Learning Research (1) 3367–3402.Hayes, Brian. 2002. Computing science: The easiest hard problem. American Scientist (2) 113–117.Hemming, Karla, Terry P Haines, Peter J Chilton, Alan J Girling, Richard J Lilford. 2015. The steppedwedge cluster randomised trial: rationale, design, analysis, and reporting. Bmj h391.Heng, Siyu, Hyunseung Kang, Dylan S Small, Colin B Fogarty. 2019. Increasing power for observationalstudies of aberrant response: An adaptive approach. arXiv preprint arXiv:1907.06770 .Hills, M, P Armitage. 1979. The two-period cross-over clinical trial.
British journal of clinical pharmacology (1) 7.Hsiao, Cheng. 2014. Analysis of panel data . 54, Cambridge university press.Hu, Yanqing, Feifang Hu, et al. 2012. Asymptotic properties of covariate-adaptive randomization.
TheAnnals of Statistics (3) 1794–1815.Hussey, Michael A, James P Hughes. 2007. Design and analysis of stepped wedge cluster randomized trials. Contemporary Clinical Trials (2) 182–191.Imai, Kosuke, Marc Ratkovic. 2014. Covariate balancing propensity score. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) (1) 243–263.Jones, Byron, Michael G Kenward. 2014. Design and analysis of cross-over trials . CRC press.Joo, Mingyu, Michael L Thompson, Greg M Allenby. 2019. Optimal product design by sequential experimentsin high dimensions.
Management Science (7) 3235–3254.Kallus, Nathan. 2018. Optimal a priori balance in the design of controlled experiments. Journal of the RoyalStatistical Society: Series B (Statistical Methodology) (1) 85–112.Kallus, Nathan. 2020. Generalized optimal matching methods for causal inference. Journal of MachineLearning Research (62) 1–54.Kallus, Nathan, Xiaojie Mao, Madeleine Udell. 2018. Causal inference with noisy and missing covariates viamatrix factorization. Advances in neural information processing systems . 6921–6932.Krieger, Abba M, David Azriel, Adam Kapelner. 2019. Nearly random designs with greatly improved balance.
Biometrika (3) 695–701.Kuhfeld, Warren F. 2005. Experimental design, efficiency, coding, and choice designs.
Marketing ResearchMethods in SAS: Experimental Design, Choice, Conjoint, and Graphical Techniques preprint
Statistics & ProbabilityLetters American Journal of Public Health (7)1300–1306.Lee, Dokyun, Kartik Hosanagar. 2019. How do recommender systems affect sales diversity? a cross-categoryinvestigation via randomized field experiment.
Information Systems Research (1) 239–259.Leone, Robert P. 1995. Generalizing what is known about temporal aggregation and advertising carryover. Marketing Science (3 supplement) G141–G150.Li, Fan, Kari Lock Morgan, Alan M Zaslavsky. 2018a. Balancing covariates via propensity score weighting. Journal of the American Statistical Association (521) 390–400.Li, Fan, Elizabeth L Turner, John S Preisser. 2018b. Optimal allocation of clusters in cohort stepped wedgedesigns.
Statistics & Probability Letters
Management Science (8) 1743–1763.lEcuyer, Pierre, Patrick Maill´e, Nicol´as E Stier-Moses, Bruno Tuffin. 2017. Revenue-maximizing rankingsfor online platforms with quality-sensitive consumers. Operations Research (2) 408–423.Mertens, Stephan. 2006. The easiest hard problem: Number partitioning. Computational Complexity andStatistical Physics (2) 125–139.Muchnik, Lev, Sinan Aral, Sean J Taylor. 2013. Social influence bias: A randomized experiment.
Science (6146) 647–651.Mulvey, John M. 1983. Multivariate stratified sampling by optimization.
Management Science (6) 715–724.Nikolaev, Alexander G, Sheldon H Jacobson, Wendy K Tam Cho, Jason J Sauppe, Edward C Sewell. 2013.Balance optimization subset selection (boss): An alternative approach for causal inference with obser-vational data. Operations Research (2) 398–412.Panniello, Umberto, Michele Gorgoglione, Shawndra Hill, Kartik Hosanagar. 2014. Incorporating profit mar-gins into recommender systems: A randomized field experiment of purchasing behavior and consumertrust .Pocock, Stuart J, Richard Simon. 1975. Sequential treatment assignment with balancing for prognosticfactors in the controlled clinical trial. Biometrics
Nursing research: Generating and assessing evidence for nursingpractice . Lippincott Williams & Wilkins.5Rosenbaum, Paul R, Donald B Rubin. 1983. The central role of the propensity score in observational studiesfor causal effects.
Biometrika (1) 41–55.Salganik, Matthew. 2019. Bit by bit: Social research in the digital age . Princeton University Press.Sauppe, Jason J, Sheldon H Jacobson. 2017. The role of covariate balance in observational studies.
NavalResearch Logistics (NRL) (4) 323–344.Sauppe, Jason J, Sheldon H Jacobson, Edward C Sewell. 2014. Complexity and approximation results for thebalance optimization subset selection model for causal inference in observational studies. INFORMSJournal on Computing (3) 547–566.Senn, Stephen S. 2002. Cross-over trials in clinical research , vol. 5. John Wiley & Sons.Simon, Richard. 1979. Restricted randomization designs in clinical trials.
Biometrics
The Journal of Machine Learning Research (1) 6990–7020.Sun, Tianshu, Siva Viswanathan, Ni Huang, Elena Zheleva. 2020a. Designing promotional incentive toembrace social sharing: Evidence from field and lab experiments. MIS Quarterly .Sun, Tianshu, Siva Viswanathan, Elena Zheleva. 2020b. Creating social contagion through firm-mediatedmessage design: Evidence from a randomized field experiment.
Management Science .¨Ulk¨u, Sezer, Chris Hydock, Shiliang Cui. 2019. Making the wait worthwhile: Experiments on the effect ofqueueing on consumption.
Management Science .Wallenstein, Sylvan, Alan C Fisher. 1977. The analysis of the two-period repeated measurements crossoverdesign with application to clinical trials.
Biometrics
Biometrics
Journal of Clinical Epidemiology (7) 752–758.Zhang, Dennis J, Hengchen Dai, Lingxiu Dong, Fangfang Qi, Nannan Zhang, Xiaofei Liu, Zhongyi Liu, JiangYang. 2019a. The long-term and spillover effects of price promotions on retailing platforms: Evidencefrom a large randomized experiment on alibaba. Management Science .Zhang, Dennis J, Hengchen Dai, Lingxiu Dong, Qian Wu, Lifan Guo, Xiaofei Liu. 2019b. The value of pop-up stores on retailing platforms: Evidence from a field experiment with alibaba.
Management Science (11) 5142–5151.Zhao, Qingyuan. 2019. Covariate balancing propensity score by tailored loss functions. The Annals ofStatistics (2) 965–993.6 Study Approval and Acknowledgement : This paper used the MarketScan Research Databases.Data access was provided by the Stanford Center for Population Health Sciences Data Core. ThePHS Data Core is supported by a National Institutes of Health National Center for AdvancingTranslational Science Clinical and Translational Science Award (UL1 TR001085) and internalStanford funding. The content is solely the responsibility of the authors and does not necessarilyrepresent the official views of the NIH.
Appendix A: Additional Theoretical Results
A.1. Treatment is Reversible
Proposition 1.
Suppose Assumption 1 holds, and the treatment is reversible, let ω t = N (cid:80) Ni =1 z it and ζ i = T (cid:80) Ti =1 z it . Suppose the potential outcome can be modeled as Y it ( z it ) = β t + τ z it + ε it . Any treatment design isoptimal if it satisfies ω t = 0 . Suppose the potential outcome can be modeled as Y it ( z it ) = α i + τ z it + ε it . Any treatment design isoptimal if it satisfies ζ i = 0 . Suppose the potential outcome can be modeled as Y it ( z it ) = α i + β t + τ z it + ε it . Any treatment design isoptimal if it satisfies ω t = 0 , ζ i = 0 . Suppose the potential outcome can be modeled as Y it ( z it ) = α i + β t + X (cid:62) i θ t + u (cid:62) i v t + τ z it + ε it (or Y it ( z it ) = α i + β t + X i θ t + τ z it + τ z i,t − + · · · + τ (cid:96) +1 z i,t − (cid:96) + u (cid:62) i v t + ε it ). Suppose the assumptions in Theorem3 (or Theorem 5) hold and let ω g,t = |O g | (cid:80) i ∈O g z it , where O g = { i : X i = x g , u i = u ,g } . Any treatment designis optimal if it satisfies ω g,t = 0 , for all t and g, ζ i = 0 . Proof of Proposition 1
The proof is a direct extension of the proof of Lemmas 2 and 3, Theorems 1, 3 and5 in Appendix C. The major difference is when the treatment can be reversed, the value of (cid:80) Ni =1 ζ i is notunique given ω t . Then we can separately optimize the part with ζ i and ω t . Note that (cid:80) Ni =1 ζ i is minimized at ζ i = 0. Since the Hessian of the part with ω t ( ω g,t ) is positive semi-definite, this part is minimized at ω t = 0(and ω g,t = 0). (cid:3) A.2. Additional Discussion of Theorem 3
Remark 15.
Consider a symmetric case for Theorem 3 that rows in V are deterministic and have (cid:80) Ti =1 v t = (cid:126) u i is random with mean 0 and variance I k , Cov( u i , u j ) = 0 for any i (cid:54) = j . In this case, errors7are cross-sectional independent but time-series correlated, and units are exchangeable. By symmetricity, wehave the sufficient conditions for the optimal solution in this case (similar to Theorem 3),1 N N (cid:88) i =1 z it = 2 t − − TT , N N (cid:88) i =1 X i z it = µ X , T T (cid:88) t =1 v t z it = µ V , for all i for some µ V ∈ R k . When z it is the same for all i (before-after design Z BA defined in Section 5.2), the time-series average condition T (cid:80) Tt =1 v t z it = µ V for all i is satisfied. However, this treatment design violates thefirst condition on the cross-sectional average, N (cid:80) Ni =1 z it = t − − TT . When v t is not stationary and has a strongtime trend, such as GDP and stock indices over time, there may not exist a Z to satisfy all the sufficientconditions. Intuitively, the optimal Z balances the sufficient conditions on the time-series and cross-sectionalaverage, which is closer to Z BA rather than Z FF . A.3. More Results for Carryover Treatment Effects
In Theorem 5, A ( (cid:96) ) and b ( (cid:96) ) in Eq. (17) are defined as follows A ( (cid:96) ) = (cid:98) (cid:96)/ (cid:99) + 1 (cid:98) (cid:96)/ (cid:99) + 2 . . . (cid:96) − T − (cid:96) (cid:96) − (cid:98) (cid:96)/ (cid:99) (cid:96) − − (cid:98) (cid:96)/ (cid:99) (cid:96) − − (cid:98) (cid:96)/ (cid:99) · · · (cid:96) − − (cid:98) (cid:96)/ (cid:99) (cid:96) − − (cid:98) (cid:96)/ (cid:99) (cid:96) − − (cid:98) (cid:96)/ (cid:99) · · · · · · (22) b ( (cid:96) ) = − (cid:98) (cid:96)/ (cid:99) + 1... (cid:96) − (cid:96) + 1 T − (cid:96) ( (cid:98) (cid:96)/ (cid:99) + 1) ...( (cid:96) − (cid:96) − T − (cid:96) (cid:80) (cid:96) −(cid:98) (cid:96)/ (cid:99) l =1 ( (cid:98) (cid:96)/ (cid:99) + 1 − l )...2 (cid:98) (cid:96)/ (cid:99) − (cid:98) (cid:96)/ (cid:99) (23)The first step to show Theorem 5 is to provide the convex relaxation of the integer program (15) whenthe potential outcome does not have observed or latent covariates. When the model does not have latentcovariates, Σ e = σ I N ( T − (cid:96) ) and Var(ˆ (cid:126)τ ) equalsVar(ˆ (cid:126)τ ) = ( Z (cid:62) ( I N ( T − (cid:96) ) − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) Z ) − . Then the integer program (15) is equivalent to the following integer programmin Z tr( Z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) Z ) (24)s . t . z it ≤ z i,t +1 z it ∈ {− , } The covex relaxation of the integer program (24) is provided in the following lemma.
Lemma 4.
Suppose Assumptions 1-2 hold, the potential outcome follows model (14) and does not haveobserved or latent covariates, i.e., Y it ( z it ) = α i + β t + τ z it + τ z i,t − + · · · + τ (cid:96) +1 z i,t − (cid:96) + ε it , then the convexrelaxation of the integer program (24) is min (cid:126)ω (cid:96) +1 (cid:88) j =1 T − (cid:96) − j (cid:88) t = j ω t − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) + T − (cid:96) − j (cid:88) t =1 T − (cid:96) − j − t ) T − (cid:96) ω t (25) s.t. − ≤ ω t ≤ ω t ≤ ω t +1 (cid:96) + 1 quadratic problems and it is equalto the objective function (19) in the overview of the proof of Theorem 5. If we minimize each one of these (cid:96) + 1 problems, the optimal solution of the sub-problem is the same as that in Theorem 1 with the numberof time periods to be T − (cid:96) . The proof of Lemma 4 and the remaining steps to prove Theorem 5 are providedin Appendix E. Remark 16.
Beyond the examples for ω ∗ when (cid:96) = 1 and (cid:96) = 2 in Remark 8, we also provide the expressionfor ω ∗ when (cid:96) = 3, ω ∗ = − , ω ∗ = − T − T + 79 , ω ∗ = − T − T − T + 79 ,ω ∗ t = − t − T − t = 4 , · · · , T − ,ω ∗ T − = 1 − T − T − T + 79 , ω ∗ T − = 1 − T − T + 79 , ω ∗ T = 1If we not only care about the variance of estimated direct and lagged effects, but also the covariance betweenestimated direct and lagged effects, i.e., Cov(ˆ τ j , ˆ τ j (cid:48) ) for j (cid:54) = j (cid:48) , then it is natural to consider minimizingdet(Var(ˆ (cid:126)τ )). Note that when we do not have latent covariates, Var(ˆ (cid:126)τ ) = 1 / det( Z (cid:62) ( I N ( T − (cid:96) ) − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) Z ),minimizing Var(ˆ (cid:126)τ ) is equivalent to minimizing − det( Z (cid:62) ( I N ( T − (cid:96) ) − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) Z ) and the correspondingoptimization problem is min (cid:126)z =[ z it ] − det( Z (cid:62) ( I N ( T − (cid:96) ) − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) Z ) (26) s.t. z it ≤ z i,t +1 z it ∈ {− , } The optimal solution Z for the integer program (26) is called the D (determinant)-optimal design. We providethe relaxation for the integer program (26) in Lemma 5. Lemma 5.
Suppose Assumptions 1-2 hold, the potential outcome follows model (14) and does not haveobserved or latent covariates, i.e., Y it ( z it ) = α i + β t + τ z it + τ z i,t − + · · · + τ (cid:96) +1 z i,t − (cid:96) + ε it , then the relaxationfor the integer program (26) is min (cid:126)ω − det (Θ) (27) s.t. − ≤ ω t ≤ ω t ≤ ω t +1 , where Θ is defined as Θ jm = − N (cid:20)(cid:80) T − (cid:96) − jt = j ω t − T − (cid:96) (cid:16)(cid:80) T − (cid:96) − jt = j ω t (cid:17) + T − (cid:96) (cid:80) Tt =1 ( υ ( j,j ) T +1 − t − υ ( j,j ) T − t ) ω t (cid:21) j = m − N (cid:104)(cid:80) T − (cid:96) − jt = j ω t ω t + m − j − T − (cid:96) (cid:16)(cid:80) T − (cid:96) − jt = j ω t (cid:17) (cid:16)(cid:80) T − (cid:96) − mt = m ω t (cid:17) + T − (cid:96) (cid:80) Tt =1 ( υ ( j,m ) T +1 − t − υ ( j,m ) T − t ) ω t − (cid:80) m − t = j ( ω t − ω T − (cid:96) + t ) (cid:105) j < m Θ mj j > m . t . . . . . . % tr e a t e d ‘ = 0 ‘ = 1 ‘ = 2 ‘ = 3 ‘ = 4 ‘ = 5 Figure 4 D -optimal treatment design: Optimal treated proportion at each period for a T -period treatment designin the presence of carryover treatment effects, where T = 10. Different colors represent the number of lags (cid:96) in thecarryover effects. and υ ( j,m ) t is defined as υ ( j,m ) t = t ≤ (cid:96) + 1 − m − (cid:16) − t − − (cid:96) + m ) T − (cid:96) (cid:17) (cid:96) + 1 − m < t ≤ (cid:96) + 1 − j (cid:16) − t − − (cid:96) + m ) T − (cid:96) (cid:17) (cid:16) − t − − (cid:96) + j ) T − (cid:96) (cid:17) (cid:96) + 1 − j < t ≤ T + 1 − m (cid:16) − t − − (cid:96) + j ) T − (cid:96) (cid:17) T + 1 − m < t ≤ T + 1 − j T + 1 − j < t Note that each entry in Θ is a quadratic function of ω t , the cross-sectional average of the treatmentvariables. However, the objective function in the optimization problem (27) is not a convex function of ω t . Inpractice, we can use the off-the-shelf software to find the optimal ω t . We provide the optimal solution for theoptimization problem (27) when T = 10 in Figure 4. Similar as all the previous results, the treated proportionis symmetric with respect to the coordinate ( T +12 , . T -optimal design in Theorem 5, ifthe treatment affects more periods ( (cid:96) is larger), the optimal treated proportion is in general smaller at thebeginning, increases at a faster rate in the middle, and is larger in the end. The proof of Lemma 5 is providedin Appendix E. Appendix B: Additional Algorithm
B.1. Grid Search the Tuning Parameter in Low Rank Matrix Estimation with Fixed Effects
When we use LRME to estimate the treatment effect τ given a treatment design Z , we need to specifythe regularization parameter µ . If we have the historical control data, we could use grid search to find theoptimal µ . The details are presented in Algorithm 3. Similar to Section 4.1, we split the historical datainto sub-matrices. Given a treatment design matrix Z , we apply the synthetic experiment to each of thesesub-matrices and use LRME to estimate τ using a particular µ . The optimal µ minimizes max j | ˆ τ ( j ) − τ | where ˆ τ ( j ) is estimated from the j -th sub-matrix. The next step is to use this treatment design matrix Z togenerate synthetic experimental data and estimate τ by LRME using the optimal µ .0 Algorithm 3:
Grid search optimal µ Inputs : Y hist , (1) , · · · , Y hist , ( m ) , Z, τ, µ max , µ min , N grids , k , ∆ τ , and t max (cid:126)µ ← logspace( µ min , µ max , N grids ); for j = 1 , · · · , m do ˆ τ ( j ) , ˆ α ( j ) , ˆ β ( j ) , ˆ L ( j ) ← Algorithm 2( Y hist , ( j ) , Z, k , µ max , ∆ τ , and t max ); end E opt ← max j | ˆ τ ( j ) − τ | ; µ opt = µ max ; for µ ∈ (cid:126)µ dofor j = 1 , · · · , m do ˆ τ ( j ) , ˆ α ( j ) , ˆ β ( j ) , ˆ L ( j ) ← Algorithm 2( Y hist , ( j ) , Z, k , µ , ∆ τ , and t max ); end E current ← max j | ˆ τ ( j ) − τ | ; if E current < E opt then E opt ← E current , µ opt ← µ ; endendOutputs: µ opt B.2. Feasible GLS
In feasible GLS, we first estimate τ and residuals from OLS, then estimate errors’ covariance matrix fromthe residuals, and in the end, estimate τ from regression weighted by errors’ inverse covariance matrix. Weestimate errors’ covariance matrix following three assumptions: 1. errors are time-series uncorrelated andidentically distributed over time; 2. errors’ covariance matrix is the sum of a low-rank matrix and a diagonalmatrix (the same assumptions as Theorem 3). We estimate errors’ covariance matrix Ω at any time by U, S, U (cid:62) = SVD (cid:18) T ˆ e ˆ e (cid:62) (cid:19) , ˆΩ = U : , k S k , k U (cid:62) : , k + diag (cid:18) T ˆ e ˆ e (cid:62) − U : , k S k , k U (cid:62) : , k (cid:19) , where S is a diagonal matrix with diagonal entries to be the singular values of T ˆ e ˆ e (cid:62) in descending order,diag (cid:0) T ˆ e ˆ e (cid:62) − U : , k S k , k U (cid:62) : , k (cid:1) is a diagonal matrix with diagonal entries to be the diagonal entries of T ˆ e ˆ e T − U : , k S k , k U (cid:62) : , k and k < T . The full estimated errors’ covariance matrix (
N T × N T ) is a blockdiagonal matrix with each block to be ˆΩ. By assuming errors’ are time-series uncorrelated, we estimate muchfewer parameters and each parameter is estimated from T observations. Appendix C: No Covariates
Proof of Equation (7) . We use linear regression to estimate τ and η , so (cid:20) ˆ τ ˆ η (cid:21) = (cid:18)(cid:20) (cid:126)z (cid:62) Γ (cid:62) (cid:21) (cid:2) (cid:126)z Γ (cid:3)(cid:19) − (cid:20) (cid:126)z (cid:62) Γ (cid:62) (cid:21) (cid:126)y k can be a preset value or can be chosen in a data-driven manner. (cid:20) τη (cid:21) + (cid:18)(cid:20) (cid:126)z (cid:62) (cid:126)z (cid:126)z (cid:62) ΓΓ (cid:62) (cid:126)z Γ (cid:62) Γ (cid:21)(cid:19) − (cid:20) (cid:126)z (cid:62) Γ (cid:62) (cid:21) (cid:126)ε Since Var( (cid:126)ε ) = σ I , Var (cid:18)(cid:20) ˆ τ ˆ η (cid:21)(cid:19) = σ (cid:18)(cid:20) (cid:126)z (cid:62) (cid:126)z (cid:126)z (cid:62) ΓΓ (cid:62) (cid:126)z Γ (cid:62) Γ (cid:21)(cid:19) − From block matrix inversion, we haveVar(ˆ τ ) = σ (cid:0) (cid:126)z (cid:62) (cid:126)z − (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z (cid:1) − , which is equivalent to Eq. (7). (cid:3) Proof of Lemma 2.
When there are only time fixed effects, Γ can be simplified toΓ = I T ⊗ (cid:126) (cid:126) (cid:126) (cid:126) ∈ R NT × T , where (cid:126) ∈ R N Then we have Γ (cid:62)
Γ = N · I T . N (cid:126)z (cid:62) Γ = (cid:2) ω · · · ω T (cid:3) , where ω t = N (cid:80) Ni =1 z it . The objective function in theinteger program (8) has (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z = N T (cid:88) t =1 ω t It is minimized at ω t = 0. Thus, any treatment design is optimal if it satisfies ω t = 0 for all t and z it = z i,t +1 for all i and t .In other words, if N is even, Var(ˆ τ ) is minimized at ω t = 0, implying 50% of z it to be 1 and the other50% of z it to be −
1. If N is odd, Var(ˆ τ ) is minimized at | ω t | = N , implying there are N − units with z it = 1for all t , there are N − units with z it = − t , and there is one unit with z it to be flexible as long as z it ≤ z i,t +1 is satisfied. (cid:3) Proof of Lemma 3.
When there are only unit fixed effects, Γ can be simplified toΓ = (cid:126) ⊗ I N = I N I N ... I N ∈ R NT × N , where (cid:126) ∈ R T . Then we have Γ (cid:62)
Γ = T · I N . T (cid:126)z (cid:62) Γ = (cid:2) ζ · · · ζ N (cid:3) , where ζ i = T (cid:80) Tt =1 z it . The objective function in theinteger program (8) has (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z = T N (cid:88) i =1 ζ i . It is minimized at ω t = 0. Thus, any treatment design is optimal if it satisfies ζ i = 0 for all i . Given Assumption2, the equivalent statement is that any treatment design is optimal if it satisfies ω t = − t < T +12 and ω t = 1 for all t > T +12 .In other words, if T is even, Var(ˆ τ ) is minimized at ζ i = 0 for all i , implying all units are untreated in thefirst half time periods and they all switch to the treated group in the second half time periods. If T is odd,Var(ˆ τ ) is minimized at | ζ i | = T , implying all units are in the control group for all t ≤ T − , they can be eithertreated or untreated at time period t = T +12 , and they are in the treated group for all t > T +12 . (cid:3) Since we not have time fixed effects, we do not have the identification problem and we do not need to assume α N = 0 as a contrast to Theorem 1. Proof of Theorem 1.
Denote ζ i = T (cid:80) Tt =1 z it for all i and ω t = N (cid:80) Ni =1 z it for all t .There are two steps to show Theorem 1:1. The objective function in the integer program (8), (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z , can be simplified to (cid:126)ω (cid:62) P (cid:126) (cid:126)ω + 2 (cid:126)d (cid:62) (cid:126)ω with P (cid:126) = I T − T (cid:126) (cid:126) (cid:62) ∈ R T × T , (cid:126)d = [ d t ] ∈ R T , and d t = T +1 − tT for all t .2. ω ∗ t = t − − TT is the optimal solution that satisfies the first order condition (FOC) of (cid:126)ω (cid:62) P (cid:126) (cid:126)ω + 2 (cid:126)d (cid:62) (cid:126)ω . Now we start with the first step.
There are two main intermediate steps.1.
Express (cid:126)z (cid:62)
Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z in terms of ω t and ζ i . When potential outcomes follow Y it ( z it ) = α i + β t + τ z it + ε it , we have Γ = Γ :1 Γ :2 ...Γ : T = ˜ I (cid:126) I (cid:126)
I (cid:126) ∈ R NT × ( N + T − . where ˜ I = (cid:2) I N − (cid:126) (cid:3) (cid:62) ∈ R N × ( N − , (cid:126) ∈ { } N − and (cid:126) ∈ { } N . In z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) z , we have(Γ (cid:62) Γ) − = (cid:20) Ξ Ξ Ξ Ξ (cid:21) = (cid:20) M − N M M (cid:126) − N M (cid:62) (cid:126) M N I + N M (cid:62) (cid:126) M M (cid:126) (cid:21) ∈ R ( N + T − × ( N + T − , where Ξ = M , Ξ = − N M M (cid:126) , Ξ = − N M (cid:62) (cid:126) M , Ξ = N I + N M (cid:62) (cid:126) M M (cid:126) and M = 1 T (cid:18) I N − − N (cid:126) (cid:126) (cid:62) (cid:19) − = 1 T ( I + (cid:126) (cid:126) (cid:62) ) ∈ R ( N − × ( N − , M (cid:126) = (cid:2) (cid:126) · · · (cid:126) (cid:3) ∈ R ( N − × T . where (cid:126) { } N − in M and M (cid:126) . Furthermore, (cid:126)z (cid:62) Γ = (cid:2)(cid:80) Tt =1 z t · · · (cid:80) Tt =1 z N − ,t (cid:80) Ni =1 z i · · · (cid:80) Ni =1 z iT (cid:3) = (cid:2) T ζ · · · T ζ N − N ω · · · N ω T (cid:3) , Let φ := (cid:2) T ζ · · · T ζ N − (cid:3) (cid:62) and ι := (cid:2) N ω · · · N ω T (cid:3) (cid:62) . Then we have (cid:126)z (cid:62) Γ = (cid:2) φ (cid:62) ι (cid:62) (cid:3) and (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z = φ (cid:62) Ξ φ + 2 φ (cid:62) Ξ ι + ι (cid:62) Ξ ι. We simplify each term in (cid:126)z (cid:62)
Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z , φ (cid:62) Ξ φ = 1 T φ (cid:62) ( I N − + (cid:126) (cid:126) (cid:62) ) φ = T N − (cid:88) i =1 ζ i + T (cid:32) N − (cid:88) i =1 ζ i (cid:33) φ (cid:62) Ξ ι = − T φ (cid:62) ( I N − + (cid:126) (cid:126) (cid:62) ) (cid:126) (cid:32) T (cid:88) t =1 ω t (cid:33) = − NT φ (cid:62) (cid:126) (cid:32) T (cid:88) t =1 ω t (cid:33) = − N (cid:32) N − (cid:88) i =1 ζ i (cid:33) (cid:32) T (cid:88) t =1 ω t (cid:33) ι (cid:62) Ξ ι = N T (cid:88) t =1 ω t + N ( N − T (cid:32) T (cid:88) t =1 ω t (cid:33) From the definitions of ζ i and ω t , T (cid:80) Ni =1 ζ i = N (cid:80) Tt =1 ω t . Then (cid:80) N − i =1 ζ i = NT (cid:80) Tt =1 ω t − ζ N . Using thisproperty, we have (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z = T N (cid:88) i =1 ζ i − NT (cid:32) T (cid:88) t =1 ω t (cid:33) + N T (cid:88) t =1 ω t . (28)32. Express (cid:126)z (cid:62)
Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z in terms of ω t . Following Assumption 2, (cid:80) Ni =1 ζ i is fixed given ω t . Given ω t , there are N (1+ ω )2 , N (1+ ω )2 , · · · N (1+ ω T )2 treated units in time period 1 , , · · · , T . It is equivalent to hav-ing N (1+ ω )2 , N ( ω − ω )2 , · · · , N ( ω T − ω T − )2 untreated units to start the treatment at time period 1 , , · · · , T andleaving N (1 − ω T )2 units in the control group in the end. Then we have N (cid:88) i =1 ζ i = N (cid:34) ω · ω − ω (cid:18) T − T (cid:19) + ω − ω (cid:18) T − T (cid:19) + · · · + ω T − ω T − (cid:18) − T (cid:19) + 1 − ω T · (cid:35) = N (cid:20) (cid:18) − T (cid:19) ω T + (cid:18) − T (cid:19) ω T + · · · + (cid:18) − T − T (cid:19) ω T T (cid:21) = 2 NT T (cid:88) t =1 T + 1 − tT · ω t Therefore, we can write (cid:126)z (cid:62)
Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z as a function of ω t and denote (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z as f ( (cid:126)ω ) f ( (cid:126)ω ) := (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z = N T (cid:88) t =1 ω t − NT (cid:32) T (cid:88) t =1 ω t (cid:33) + 2 NT T (cid:88) t =1 ( T + 1 − t ) · ω t = N (cid:16) (cid:126)ω (cid:62) P (cid:126) (cid:126)ω + 2 (cid:126)d (cid:62) (cid:126)ω (cid:17) . Next we show the second step.
Let f ( (cid:126)ω ) := (cid:80) Tt =1 ω t , f ( (cid:126)ω ) := − T (cid:0)(cid:80) Tt =1 ω t (cid:1) and f ( (cid:126)ω ) :=2 (cid:80) Tt =1 T +1 − tT ω t . Then we have f ( (cid:126)ω ) = f ( (cid:126)ω ) + f ( (cid:126)ω ) + f ( (cid:126)ω ). We first take the second order derivative ∇ f ( (cid:126)ω ) = 2 I T ∇ f ( (cid:126)ω ) = − T (cid:126) (cid:126) (cid:62) ∇ f ( (cid:126)ω ) = 0Thus, ∇ f ( (cid:126)ω ) = ∇ f ( (cid:126)ω ) + ∇ f ( (cid:126)ω ) + ∇ f ( (cid:126)ω ) ≥ f ( (cid:126)ω ) is convex. Next we take the first order derivative ∇ f ( (cid:126)ω ) = (cid:2) ω + T − T − T (cid:80) Tt =1 ω t ω + T − T − T (cid:80) Tt =1 ω t · · · ω T − T − T − T (cid:80) Tt =1 ω t (cid:3) When ω ∗ t = t − − TT , ∇ f ( (cid:126)ω ∗ ) = 0. Thus, (cid:126)ω ∗ is a local optimal solution. Since f ( (cid:126)ω ) is convex, the localminimum is also the global minimum. Therefore, (cid:126)ω ∗ is the global optimal solution. (cid:3) Proof of Theorem 2
Denote the objective function of the quadratic program (9) as f ( (cid:126)ω ) (similar as theproof of Theorem 1). The variance of τ (denoted as g ( (cid:126)ω )) is g ( (cid:126)ω ) := Var(ˆ τ ) = σ (cid:126)z (cid:62) ( I NT − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) (cid:126)z = σ N ( T − f ( (cid:126)ω ))The optimal solution in Theorem 1 is ω ∗ t = t − − TT . We plug (cid:126)ω ∗ = [ ω ∗ t ] into f ( (cid:126)ω ), f ( (cid:126)ω ∗ ) = − T (cid:88) t =1 (cid:18) T + 1 − tT (cid:19) = − ( T + 1)( T − T . (cid:126)ω ∗ is denoted as (cid:126)ω rnd = [ ω rnd t ] (from the nearest integer rounding rule). Wehave | ω rnd t − ω ∗ t | ≤ N and | (cid:80) Tt =1 ω rnd t | ≤ N . Therefore, f ( (cid:126)ω rnd ) = T (cid:88) t =1 (cid:18) ω rnd t + T + 1 − tT (cid:19) − T (cid:32) T (cid:88) t =1 ω rnd t (cid:33) − T (cid:88) t =1 (cid:18) T + 1 − tT (cid:19) ≤ TN − ( T + 1)( T − T We have g ( (cid:126)ω rnd ) g ( (cid:126)ω ∗ ) = T − f ( (cid:126)ω ∗ ) T − f ( (cid:126)ω rnd ) ≤ (4 T − / (3 T )(4 T − / (3 T ) − T /N = 4 T − T − − T /N ≤ − /N following 3 T ≤ T − T . Note that f ( (cid:126)ω ∗ ) reaches the minimum objective function value in thequadratic program (9), so f ( (cid:126)ω ∗ ) is the lower bound for the minimum objective function value in the originalinteger programming problem. Thus, g ( (cid:126)ω ∗ ) ≤ Var Z ∗ (ˆ τ ) and we haveVar Z rnd (ˆ τ ) ≤ − /N Var Z ∗ (ˆ τ ) . (cid:3) Appendix D: With Covariates
Proof of Theorem 3.
We can combine β t with X (cid:62) i θ t in the potential outcome model (11), that is, Y it ( z it ) = α i + (cid:2) X (cid:62) i (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) ˜ X (cid:62) i (cid:20) β t θ t (cid:21) + τ z it + ε it , and denote p := r + 1 so ˜ X i ∈ R p . Denote ζ i = T (cid:80) Tt =1 z it for all i and ˜ ω t = N (cid:80) Ni =1 ˜ X i z it ∈ R p for all t . First,when we use W = Σ − e in GLS, with some algebra, we have the following expression for the variance of ˆ τ ,Var(ˆ τ ) = (cid:126)z (cid:62) (cid:0) Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:1) (cid:126)z, (29)where ˜ I = (cid:2) I N − p N − p,p (cid:3) (cid:62) ∈ R N × ( N − p ) , N − p,p is a matrix of 0, (cid:126) ∈ { } N , andΓ = Γ :1 Γ :2 ...Γ : T = ˜ I (cid:126) X ˜ I (cid:126) X ... . . .˜ I (cid:126) X = ˜ I ˜ X ˜ I ˜ X ... . . .˜ I ˜ X ∈ R NT × ( N − p + Tp ) . (30)We restrict α i = 0 for i = N − p, · · · , N such that all other α i and β t can be uniquely identified. Details toshow Eq. (29) are the same as those to show Eq. (7).There are three steps to show Theorem 3:1. Minimizing Var(ˆ τ ) can be simplified to minimize( (cid:126)ω (1) ) (cid:62) P (cid:126) (cid:126)ω (1) + 2 (cid:126)d (cid:62) (cid:126)ω (1) + p (cid:88) j =2 ( (cid:126)ω ( j ) ) (cid:62) P (cid:126) (cid:126)ω ( j ) + N (cid:80) Tt =1 (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t − NT (cid:0)(cid:80) Tt =1 (cid:126)z t (cid:1) U ( I k + U (cid:62) U ) − U (cid:62) (cid:0)(cid:80) Tt =1 (cid:126)z t (cid:1) , where (cid:126)ω ( j ) = (cid:2) ˜ ω ,j · · · ˜ ω T,j (cid:3) (cid:62) and ˜ ω t,j = N (cid:80) Ni =1 ˜ X i,j z it for j = 1 , · · · , p and t = 1 , · · · , T .2. Take the first-order condition and show sufficient conditions for optimality.53. Based on the sufficient conditions and finite u i , show ω g,t = t − − TT is optimal. We start with the first step.
The steps to simplify (cid:126)z (cid:62) (Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e ) (cid:126)z are as follows.1. Calculate Σ − e . Under the assumptions that v t has mean 0 and variance I k , Cov( v t , v t − q ) = 0 for any q , Cov( v t , ε is ) = 0 for any s, t and i , and σ = 1, we have Σ − e = diag(Ψ , Ψ , · · · , Ψ), whereΨ = Σ − e t = ( I N + U U (cid:62) ) − = I N − U ( I k + U (cid:62) U ) − U (cid:62) Calculate (cid:126)z (cid:62) Σ − e (cid:126)z . We have z (cid:62) Σ − e z = (cid:80) Tt =1 z (cid:62) t Ψ z t .3. Calculate (cid:126)z (cid:62) Σ − e Γ . We have (cid:126)z (cid:62) Σ − e Γ = (cid:126)z (cid:62)
Ψ ˜ I Ψ ˜ X Ψ ˜ I Ψ ˜ X ... . . .Ψ ˜ I Ψ ˜ X = (cid:2)(cid:80) Tt =1 (cid:126)z (cid:62) t Ψ ˜
I (cid:126)z (cid:62) Ψ ˜ X · · · (cid:126)z (cid:62) T Ψ ˜ X (cid:3) = (cid:2) φ (cid:62) ι (cid:62) (cid:3) , where φ (cid:62) = (cid:80) Tt =1 (cid:126)z (cid:62) t Ψ ˜ I = ( (cid:80) Tt =1 (cid:126)z t ) (cid:62) Ψ ˜ I and ι (cid:62) = (cid:2) (cid:126)z (cid:62) Ψ ˜ X · · · (cid:126)z (cid:62) T Ψ ˜ X (cid:3) . Let ζ := T (cid:80) Tt =1 (cid:126)z t ∈ R N . Then φ = T ζ (cid:62)
Ψ ˜ I . Note that U and ˜ X are orthogonal, we have Ψ ˜ X = ˜ X and ˜ X (cid:62) Ψ ˜ X = N · I p . Then ι (cid:62) = (cid:2) (cid:126)z (cid:62) ˜ X · · · (cid:126)z (cid:62) T ˜ X (cid:3) = (cid:2) N ˜ ω (cid:62) · · · N ˜ ω (cid:62) T (cid:3) = N ˜ (cid:126)ω (cid:62) ∈ R Tp , where ˜ ω t = N (cid:80) Ni =1 ˜ X i z it ∈ R p .4. Calculate (Γ (cid:62) Σ e Γ) − . Note that(Γ (cid:62) Σ e Γ) − = (cid:20) Ξ Ξ Ξ Ξ (cid:21) = (cid:20) M − M ˜ M − ˜ M (cid:62) M ¯ M + ˜ M (cid:62) M ˜ M (cid:21) ∈ R ( N +( T − p )) × ( N +( T − p ) , where Ξ = M , Ξ = − M ˜ M , Ξ = − ˜ M (cid:62) M , Ξ = ¯ M + ˜ M (cid:62) M ˜ M with M = 1 T (cid:16) ˜ I (cid:62) Ψ ˜ I − ˜ I (cid:62) Ψ ˜ X ( ˜ X (cid:62) Ψ ˜ X ) − ˜ X (cid:62) Ψ ˜ I (cid:17) − = 1 T (cid:18) ˜ I (cid:62) Ψ ˜ I − N ˜ X ˜ X (cid:62) (cid:19) − ∈ R ( N − p ) × ( N − p ) ˜ M = (cid:2) ˜ I (cid:62) Ψ ˜ X ( ˜ X (cid:62) Ψ ˜ X ) − · · · ˜ I (cid:62) Ψ ˜ X ( ˜ X (cid:62) Ψ ˜ X ) − (cid:3) = (cid:2) N ˜ X · · · N ˜ X (cid:3) ∈ R ( N − p ) × Tp ¯ M = diag(( ˜ X (cid:62) Ψ ˜ X ) − , ( ˜ X (cid:62) Ψ ˜ X ) − , · · · , ( ˜ X (cid:62) Ψ ˜ X ) − ) = 1 N I Tp ∈ R Tp × Tp We can simplify M = T (cid:16) ˜ I (cid:62) Ψ ˜ I − N ˜ X ˜ X (cid:62) (cid:17) − by M = 1 T (cid:34)(cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − + (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (cid:18) N − ˜ X (cid:62) (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (cid:19) − ˜ X (cid:62) (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − (cid:35)(cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − = (cid:0) I N − p − U (1) ( I k + U (cid:62) U ) − U (cid:62) (1) (cid:1) − = I N − p + U (1) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (1) where U (1) = (cid:2) u u · · · u N − p (cid:3) (cid:62) ∈ R ( N − p ) × k and U (2) = (cid:2) u N − p +1 · · · u N (cid:3) (cid:62) ∈ R p × k (note U = (cid:2) U (cid:62) (1) U (cid:62) (2) (cid:3) (cid:62) ).5. Calculate (cid:126)z (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:126)z . It is equivalent to calculating (cid:2) φ (cid:62) ι (cid:62) (cid:3) (cid:20) Ξ Ξ Ξ Ξ (cid:21) (cid:20) φι (cid:21) = φ (cid:62) Ξ φ + φ (cid:62) Ξ ι + ι (cid:62) Ξ φ + ι (cid:62) Ξ ι , where (recall ζ = T (cid:80) Tt =1 (cid:126)z t and ˜ (cid:126)ω = (cid:2) ˜ ω (cid:62) · · · ˜ ω (cid:62) T (cid:3) (cid:62) ) φ (cid:62) Ξ φ = T ζ (cid:62) Ψ ˜ IM ˜ I (cid:62) Ψ ζφ (cid:62) Ξ ι = − N T ζ (cid:62)
Ψ ˜ IM ˜ M ˜ (cid:126)ω = − T ζ (cid:62)
Ψ ˜ IM ˜ X (cid:32) T (cid:88) t =1 ˜ ω t (cid:33) ι (cid:62) Ξ φ = − N T ˜ (cid:126)ω (cid:62) ˜ M (cid:62) M Ψ ˜ Iζ = − T (cid:32) T (cid:88) t =1 ˜ ω (cid:62) t (cid:33) ˜ X (cid:62) M Ψ ˜ I (cid:62) ζι (cid:62) Ξ ι = N T (cid:88) t =1 ˜ ω (cid:62) t ˜ ω t + (cid:32) T (cid:88) t =1 ˜ ω (cid:62) t (cid:33) ˜ X (cid:62) M ˜ X (cid:32) T (cid:88) t =1 ˜ ω t (cid:33) X (1) := (cid:2) ˜ X ˜ X · · · ˜ X N − p (cid:3) (cid:62) ∈ R ( N − p ) × p and ˜ X (2) := (cid:2) ˜ X N − p +1 · · · ˜ X N (cid:3) (cid:62) ∈ R p × p (note ˜ X = (cid:104) ˜ X (cid:62) (1) ˜ X (cid:62) (2) (cid:105) (cid:62) ).We can simplify φ (cid:62) Ξ φ , φ (cid:62) Ξ ι , ι (cid:62) Ξ φ and ι (cid:62) Ξ ι by calculating the following terms G = ˜ I (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ I (cid:62) = (cid:20) I N − p + U (1) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (1) (cid:126) (cid:126) (cid:62) (cid:21) ∈ R N × N Ω = Ψ G Ψ = (cid:20) I N − p (cid:126) − U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (1) (cid:21) Ψ= (cid:20) I N − p − U (1) ( I k + U (cid:62) U ) − U (cid:62) (1) − U (1) ( I k + U (cid:62) U ) − U (cid:62) (2) − U (2) ( I k + U (cid:62) U ) − U (cid:62) (1) U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (1) U (1) ( I k + U (cid:62) U ) − U (cid:62) (2) (cid:21) ∈ R N × N (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (1) = ˜ X (1) − U (1) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (2) ˜ X (2) ∈ R ( N − p ) × p ˜ I (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (1) = (cid:20) ˜ X (1) − U (1) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (2) ˜ X (2) (cid:21) ∈ R N × p δ = ˜ X (cid:62) (1) (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (1) = N I p − ( ˜ X (cid:62) (2) ˜ X (2) − ˜ X (cid:62) (2) U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (2) ˜ X (2) ) ∈ R p × p γ = Ψ ˜ I (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (1) = (cid:20) ˜ X (1) U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (2) ˜ X (2) (cid:21) ∈ R N × p Together with N (cid:80) Tt =1 ˜ ω t = T (cid:80) Ni =1 ˜ X i ζ i and N(cid:126) (cid:62) ω = T ˜ X (cid:62) ζ , we have φ (cid:62) Ξ φ = T ζ (cid:62) (cid:0)
Ω + γ ( N I p − δ ) − γ (cid:62) (cid:1) ζφ (cid:62) Ξ ι = − T ζ (cid:62) (cid:16) γ ( N I p − δ ) − ˜ X (cid:62) (cid:17) ζι (cid:62) Ξ φ = − T ζ (cid:62) (cid:16) ˜ X ( N I p − δ ) − γ (cid:62) (cid:17) ζι (cid:62) Ξ ι = N T (cid:88) t =1 ˜ ω (cid:62) t ˜ ω t + T ζ (cid:62) (cid:16) ˜ X ( N I p − δ ) − ˜ X (cid:62) (cid:17) ζ − NT (cid:32) T (cid:88) t =1 ˜ ω (cid:62) t (cid:33) (cid:32) T (cid:88) t =1 ˜ ω t (cid:33) and (cid:2) φ (cid:62) ι (cid:62) (cid:3) (cid:20) Ξ Ξ Ξ Ξ (cid:21) (cid:20) φι (cid:21) = N T (cid:88) t =1 ˜ ω (cid:62) t ˜ ω t − NT (cid:32) T (cid:88) t =1 ˜ ω (cid:62) t (cid:33) (cid:32) T (cid:88) t =1 ˜ ω t (cid:33) + T ζ (cid:62) (cid:18)
Ω + (cid:16) γ − ˜ X (cid:17) ( N I p − δ ) − (cid:16) γ − ˜ X (cid:17) (cid:62) (cid:19) ζ. with Ω + (cid:16) γ − ˜ X (cid:17) ( N I p − δ ) − (cid:16) γ − ˜ X (cid:17) (cid:62) = Ω + (cid:20) (cid:126) (cid:126) (cid:62) I p − U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (2) (cid:21) = I N − U ( I k + U (cid:62) U ) − U (cid:62) , following U (cid:62) U = U (cid:62) (1) U (1) + U (cid:62) (2) U (2) and U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (1) U (1) ( I k + U (cid:62) U ) − U (cid:62) (2) − U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (2) = U (2) ( I k + U (cid:62) (2) U (2) ) − ( U (cid:62) (1) U (1) − I k − U (cid:62) U )( I k + U (cid:62) U ) − U (cid:62) (2) = − U (2) ( I k + U (cid:62) U ) − U (cid:62) (2) . Thus, (cid:2) φ (cid:62) ι (cid:62) (cid:3) (cid:20) Ξ Ξ Ξ Ξ (cid:21) (cid:20) φι (cid:21) = N T (cid:88) t =1 ˜ ω (cid:62) t ˜ ω t − NT (cid:32) T (cid:88) t =1 ˜ ω (cid:62) t (cid:33) (cid:32) T (cid:88) t =1 ˜ ω t (cid:33) + T ζ (cid:62) (cid:0) I N − U ( I k + U (cid:62) U ) − U (cid:62) (cid:1) ζ. Calculate (cid:126)z (cid:62) (Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e ) (cid:126)z . (cid:126)z (cid:62) Σ − e (cid:126)z − (cid:126)z (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:126)z = N T − T (cid:88) t =1 (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t − N T (cid:88) t =1 ˜ ω (cid:62) t ˜ ω t + NT (cid:32) T (cid:88) t =1 ˜ ω (cid:62) t (cid:33) (cid:32) T (cid:88) t =1 ˜ ω t (cid:33) − T ζ (cid:62) (cid:0) I N − U ( I k + U (cid:62) U ) − U (cid:62) (cid:1) ζ Then maximizing (cid:126)z (cid:62) Σ − e (cid:126)z − (cid:126)z (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:126)z is equivalent to minimizing( (cid:126)ω (1) ) (cid:62) P (cid:126) (cid:126)ω (1) + 2 (cid:126)d (cid:62) (cid:126)ω (1) + r +1 (cid:88) j =2 ( (cid:126)ω ( j ) ) (cid:62) P (cid:126) (cid:126)ω ( j ) + N (cid:80) Tt =1 (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t − NT (cid:0)(cid:80) Tt =1 (cid:126)z t (cid:1) U ( I k + U (cid:62) U ) − U (cid:62) (cid:0)(cid:80) Tt =1 (cid:126)z t (cid:1) (31)following T (cid:80) Ni =1 ζ i = NT (cid:80) Tt =1 ( T + 1 − t ) ˜ ω t, from the proof of Theorem 1. Next we show the second step.
It is the same as the Sketch of Proof. We separate minimize (a), (b) and(c) and get the sufficient conditions. For (c), note that T (cid:88) t =1 z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) z t − T (cid:32) T (cid:88) t =1 z (cid:62) t (cid:33) U ( I k + U (cid:62) U ) − U (cid:62) (cid:32) T (cid:88) t =1 z t (cid:33) = (cid:126)z (cid:62) (cid:18)(cid:18) I T − T (cid:126) (cid:126) (cid:62) (cid:19) ⊗ U ( I k + U (cid:62) U ) − U (cid:62) (cid:19) (cid:126)z := (cid:126)z (cid:62) M U (cid:126)zI T − T (cid:126) (cid:126) (cid:62) is a positive semi-definite matrix with one eigenvalue to be 0 and the corresponding eigenvectorto be (cid:126)
1. Thus, (cid:126)z (cid:62) M U (cid:126)z ≥ (cid:126)z and the minimum value is attained when U (cid:62) z t is the same for all t , which is equivalent to Z (cid:62) U = (cid:126) µ (cid:62) U ∈ R T × k for some µ U ∈ R k . For (a), from Theorem 1, the optimalsolution of T (cid:80) Ni =1 ζ i + N (cid:80) Tt =1 ω t − NT (cid:0)(cid:80) Tt =1 ω t (cid:1) (equals ( (cid:126)ω (1) ) (cid:62) P (cid:126) (cid:126)ω (1) + 2 (cid:126)d (cid:62) (cid:126)ω (1) ) is N (cid:80) Ni =1 z it = t − − TT .For (b), it depends on the cross-sectional average of z it weighted by X i and is minimized when Z satisfies N (cid:80) Ni =1 X i z it = µ X , equivalently (cid:126)ω ( j +1) = µ X,j · (cid:126)
1, where µ X,j is the j -th entry in µ X . Hence, we get thesufficient conditions for the optimal solution1 N N (cid:88) i =1 z it = 2 t − − TT , N N (cid:88) i =1 X i z it = µ X , N N (cid:88) i =1 u i z it = µ U , for all t for some µ X ∈ R p and µ U ∈ R k . Last is to show the third step.
Since there are G strata for ( X i , u i ), that is, G different values for ( X i , u i ).Then ω t = (cid:80) Gg =1 p g ω g,t and ω t = (cid:80) Gg =1 p g x gj ω g,t for j = 1 , · · · , r . We have(b) = N T (cid:88) t =1 r (cid:88) j =1 (cid:32) G (cid:88) g =1 p g x gj ω g,t (cid:33) − NT r (cid:88) j =1 (cid:32) T (cid:88) t =1 G (cid:88) g =1 p g x gj ω g,t (cid:33) When ω g,t = t − − TT for all t and g , given X is orthogonal to (cid:126)
1, we have (cid:80) Gg =1 p g x gj ω g,t = 0 and then(b) is minimized with value zero . Similarly, we can show that when ω g,t = t − − TT for all t and g , given U is orthogonal to (cid:126)
1, (c) is minimized with value zero . Moreover, ω t = (cid:80) Gg =1 p g ω g,t = t − − TT so (a) = 0 isminimized as well. Hence, the sum of (a), (b) and (c) is minimized. (cid:3) Proof of Theorem 4
Denote f ( ω t , ω g,t ) = T (cid:88) t =1 ω t − T (cid:32) T (cid:88) t =1 ω t (cid:33) + 2 T (cid:88) t =1 T + 1 − tT ω t + T (cid:88) t =1 r (cid:88) j =1 (cid:32) G (cid:88) g =1 p g x gj ω g,t (cid:33) − T r (cid:88) j =1 (cid:32) T (cid:88) t =1 G (cid:88) g =1 p g x gj ω g,t (cid:33) τ is denoted as g ( ω t , ω g,t ) and g ( ω t , ω g,t ) := Var(ˆ τ ) = σ (cid:126)z (cid:62) ( I − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) (cid:126)z = σ N ( T − f ( ω t , ω g,t ))We evaluate f ( ω t , ω g,t ) at the optimal solution in Theorem 3 and has f ( ω ∗ t , ω ∗ g,t ) = − T (cid:88) t =1 (cid:18) T + 1 − tT (cid:19) = − ( T + 1)( T − T , where ω ∗ t = ω ∗ g,t = t − − TT . When we use the nearest rounding rule to get a feasible Z rnd , the corresponding ω t , ω g,t are denoted as ω rnd t , ω rnd g,t . We have | ω rnd g,t − ω ∗ g,t | ≤ |O g | , | (cid:80) Tt =1 ω rnd g,t | ≤ |O g | , | ˜ w t − w ∗ t | = | G (cid:88) g =1 p g ( ω rnd g,t − ω ∗ g,t ) | ≤ G (cid:88) g =1 p g | ω rnd g,t − ω ∗ g,t | ≤ G (cid:88) g =1 p g N min = 1 N min and | (cid:80) Tt =1 ω rnd t | ≤ N min . We have T (cid:88) t =1 (cid:18) ω rnd t + T + 1 − tT (cid:19) − T (cid:32) T (cid:88) t =1 ω rnd t (cid:33) − T (cid:88) t =1 (cid:18) T + 1 − tT (cid:19) ≤ TN − ( T + 1)( T − T from Theorem 2, | G (cid:88) g =1 p g x g,j w g,t | = (cid:12)(cid:12)(cid:12) G (cid:88) g =1 p g x gj (cid:18) w g,t + T + 1 − tT (cid:19) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) (cid:88) g : x gj > p g x gj (cid:18) w g,t + T + 1 − tT (cid:19) + (cid:88) g : x gj < p g x gj (cid:18) w g,t + T + 1 − tT (cid:19) (cid:12)(cid:12)(cid:12) ≤ max (cid:12)(cid:12)(cid:12) (cid:88) g : x gj > p g x gj (cid:18) w g,t + T + 1 − tT (cid:19) (cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12) (cid:88) g : x gj < p g x gj (cid:18) w g,t + T + 1 − tT (cid:19) (cid:12)(cid:12)(cid:12) ≤ x j, max N min , and T (cid:88) t =1 r (cid:88) j =1 (cid:32) G (cid:88) g =1 p g x gj ω g,t (cid:33) − T r (cid:88) j =1 (cid:32) T (cid:88) t =1 G (cid:88) g =1 p g x gj ω g,t (cid:33) ≤ T (cid:80) rj =1 x j, max N . We have g ( ω rnd t , ω rnd g,t ) g ( ω ∗ t , ω ∗ g,t ) = T − f ( ω ∗ t , ω ∗ g,t ) T − f ( ω rnd t , ω rnd g,t ) ≤ (4 T − / (3 T )(4 T − / (3 T ) − T (1 + (cid:80) rj =1 x j, max ) /N ≤ − (1 + (cid:80) rj =1 x j, max ) /N following 3 T ≤ T − T . Note that f ( ω ∗ t , ω ∗ g,t ) is the optimal solution of the convex relaxationproblem so f ( ω ∗ t , ω ∗ g,t ) is the lower bound for the minimum objective function value of the original integerprogramming problem. Thus, g ( ω ∗ t , ω ∗ g,t ) ≤ Var Z ∗ (ˆ τ ) and we haveVar Z rnd (ˆ τ ) ≤ − (1 + (cid:80) rj =1 x j, max ) /N Var Z ∗ (ˆ τ ) (cid:3) Appendix E: Extension to Carryover Effects
Proof of Lemmas 4 and 5
Let (cid:126)z ( j ) := (cid:2) (cid:126)z (cid:62) j (cid:126)z (cid:62) j +1 · · · (cid:126)z (cid:62) T − (cid:96) − j (cid:3) (cid:62) for j = 1 , · · · , (cid:96) + 1. Then Z = (cid:2) (cid:126)z (1) (cid:126)z (2) · · · (cid:126)z ( (cid:96) +1 (cid:3) . Furthermore, we decompose Var(ˆ (cid:126)τ )Θ = Z (cid:62) ( I N ( T − (cid:96) ) − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) Z = Z (cid:62) Z (cid:124) (cid:123)(cid:122) (cid:125) Θ − Z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) Z (cid:124) (cid:123)(cid:122) (cid:125) Θ and simplify Θ and Θ separately. Let us first analyze Θ = Z (cid:62) Z . For the ( j, j )-th entry in Θ , we haveΘ ,jj = N ( T − (cid:96) ). For the ( j, m )-th entry in Z (cid:62) Z ( j < m ), we haveΘ ,jm = N (cid:34) ( T − (cid:96) ) + m − (cid:88) t = j ( ω t − ω T − (cid:96) + t ) (cid:35) . Next, let use consider Θ = Z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) Z . We have(Γ (cid:62) Γ) − = (cid:20) Ξ Ξ Ξ Ξ (cid:21) = (cid:20) M − N M M (cid:126) − N M (cid:62) (cid:126) M N I T − (cid:96) + N M (cid:62) (cid:126) M M (cid:126) (cid:21) ∈ R ( N + T − (cid:96) − × ( N + T − (cid:96) − , where Ξ = M , Ξ = − N M M (cid:126) , Ξ = − N M (cid:62) (cid:126) M , Ξ = N I T − (cid:96) + N M (cid:62) (cid:126) M M (cid:126) and M = 1 T − (cid:96) (cid:18) I N − − N (cid:126) (cid:126) (cid:62) (cid:19) − = 1 T − (cid:96) ( I N − + (cid:126) (cid:126) (cid:62) ) ∈ R ( N − × ( N − , M (cid:126) = (cid:2) (cid:126) · · · (cid:126) (cid:3) ∈ R ( N − × ( T − (cid:96) ) . Furthermore, ( (cid:126)z ( j ) ) (cid:62) Γ = (cid:104)(cid:80) T − (cid:96) − jt = j z t · · · (cid:80) T − (cid:96) − jt = j z N − ,t (cid:80) Ni =1 z ij · · · (cid:80) Ni =1 z i,T − (cid:96) − j (cid:105) = (cid:104) ( T − (cid:96) ) ζ ( j )1 · · · ( T − (cid:96) ) ζ ( j ) N − N ω j · · · N ω T − (cid:96) − j (cid:105) , where ζ ( j ) i = T − (cid:96) (cid:80) T − (cid:96) − jt = j z t for i = 1 , , · · · , N and j = 1 , , · · · , (cid:96) +1, and ω t = N (cid:80) Ti =1 z it for t = 1 , , · · · , T .We let α ( j ) = (cid:104) ( T − (cid:96) ) ζ ( j )1 · · · ( T − (cid:96) ) ζ ( j ) N − (cid:105) (cid:62) and β ( j ) = (cid:2) N y j · · · N y T − (cid:96) − j (cid:3) (cid:62) . Then Z (cid:62) Γ = ( α (1) ) (cid:62) ( β (1) ) (cid:62) ... ...( α ( (cid:96) +1) ) (cid:62) ( β ( (cid:96) +1) ) (cid:62) and the ( j, m )-th term of Θ = Z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) Z hasΘ ,jm = ( α ( j ) ) (cid:62) Ξ α ( m ) + ( α ( j ) ) (cid:62) Ξ β ( m ) + ( β ( j ) ) (cid:62) Ξ α ( m ) + ( β ( j ) ) (cid:62) Ξ β ( m ) ∈ R ( (cid:96) +1) × ( N + T − (cid:96) − . For each term in Θ ,jm ,( α ( j ) ) (cid:62) Ξ α ( m ) = 1 T − (cid:96) ( α ( j ) ) (cid:62) ( I N − + (cid:126) (cid:126) (cid:62) ) α ( m ) = ( T − (cid:96) ) N − (cid:88) i =1 ζ ( j ) i ζ ( m ) i + ( T − (cid:96) ) (cid:32) N − (cid:88) i =1 ζ ( j ) i (cid:33) (cid:32) N − (cid:88) i =1 ζ ( m ) i (cid:33) ( α ( j ) ) (cid:62) Ξ β ( m ) = − T − (cid:96) ( α ( j ) ) (cid:62) ( I N − + (cid:126) (cid:126) (cid:62) ) (cid:126) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) = − NT − (cid:96) ( α ( j ) ) (cid:62) (cid:126) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) = − N (cid:32) N − (cid:88) i =1 ζ ( j ) i (cid:33) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) ( β ( j ) ) (cid:62) Ξ α ( m ) = − T − (cid:96) ( α ( m ) ) (cid:62) ( I N − + (cid:126) (cid:126) (cid:62) ) (cid:126) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) = − NT − (cid:96) ( α ( m ) ) (cid:62) (cid:126) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) = − N (cid:32) N − (cid:88) i =1 ζ ( m ) i (cid:33) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) ( β ( j ) ) (cid:62) Ξ β ( m ) = N T − (cid:96) − j (cid:88) t = j ω t ω t + m − j + N ( N − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) ζ ( j ) i and ω t , we have ( T − (cid:96) ) (cid:80) Ni =1 ζ ( j ) i = N (cid:80) T − (cid:96) − jt = j ω t for j = 1 , , · · · , (cid:96) + 1 and (cid:80) N − i =1 ζ ( j ) i = NT − (cid:96) (cid:80) T − (cid:96) − jt = j ω t − ζ ( j ) N . Using this property, we haveΘ ,jm = ( T − (cid:96) ) N (cid:88) i =1 ζ ( j ) i ζ ( m ) i − NT − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) + N T − (cid:96) − j (cid:88) t = j ω t ω t + m − j . (32)Recall the definition ζ ( j ) i = T − (cid:96) (cid:80) T − (cid:96) − jt = j z t , there are T + 1 different values that ζ ( j ) i ζ ( m ) i can take, denotedas υ ( j,m )0 , υ ( j,m )1 , · · · , υ ( j,m ) T , where υ ( j,m ) t denotes the value of ζ ( j ) i ζ ( m ) i when unit i starts to get the treatmentat time period T + 1 − t (and υ ( j,m )0 represents the value of ζ ( j ) i ζ ( m ) i when unit i stays in the control groupfor all time periods). Without loss of generality, we assume j ≤ m and have υ ( j,m ) t = t ≤ (cid:96) + 1 − m − (cid:16) − t − − (cid:96) + m ) T − (cid:96) (cid:17) (cid:96) + 1 − m < t ≤ (cid:96) + 1 − j (cid:16) − t − − (cid:96) + m ) T − (cid:96) (cid:17) (cid:16) − t − − (cid:96) + j ) T − (cid:96) (cid:17) (cid:96) + 1 − j < t ≤ T + 1 − k (cid:16) − t − − (cid:96) + j ) T − (cid:96) (cid:17) T + 1 − m < t ≤ T + 1 − j T + 1 − j < t Given ω t , there are N (1+ ω )2 , N (1+ ω )2 , · · · N (1+ ω T )2 treated units in time period 1 , , · · · , T . It is equivalent tohaving N (1+ ω )2 , N ( ω − ω )2 , · · · , N ( ω T − ω T − )2 untreated units to start the treatment in time period 1 , , · · · , T and leaving N (1 − ω T )2 units in the control group in the end. N (cid:88) i =1 ζ ( j ) i ζ ( m ) i = N (cid:20) ω · υ ( j,m ) T + ω − ω υ ( j,m ) T − + · · · + 1 − ω T · υ ( j,m )0 (cid:21) = N (cid:34) υ ( j,m ) T − υ ( j,m ) T − ω + υ ( j,m ) T − − υ ( j,m ) T − ω + · · · + υ ( j,m )1 − υ ( j,m )0 ω T (cid:35) , following υ ( j,m )0 = υ ( j,m ) T = 1. Then we can rewrite Θ ,jm in terms of ω t and υ ( j,m ) t Θ ,jm = N ( T − (cid:96) )+ N ( T − (cid:96) )2 T (cid:88) t =1 ( υ ( j,m ) T +1 − t − υ ( j,m ) T − t ) ω t − NT − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) + N T − (cid:96) − j (cid:88) t = j ω t ω t + m − j . Recall Θ = Θ − Θ , for the diagonal entries in Θ,Θ jj = − N T − (cid:96) − j (cid:88) t = j ω t − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) + T − (cid:96) T (cid:88) t =1 ( υ ( j,j ) T +1 − t − υ ( j,j ) T − t ) ω t and for the off-diagonal entries in Θ and for j < m ,Θ jm = − N (cid:34) T − (cid:96) − j (cid:88) t = j ω t ω t + m − j − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) + T − (cid:96) T (cid:88) t =1 ( υ ( j,m ) T +1 − t − υ ( j,m ) T − t ) ω t − m − (cid:88) t = j ( ω t − ω T − (cid:96) + t ) (cid:35) and for j > m , Θ jm = Θ mj . Thus, for the T -optimal design, the optimization problem ismin ω t (cid:96) +1 (cid:88) j =1 T − (cid:96) − j (cid:88) t = j ω t − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) + T − (cid:96) T (cid:88) t =1 ( υ ( j,j ) T +1 − t − υ ( j,j ) T − t ) ω t (33)s . t . − ≤ ω t ≤ , t = 1 , , · · · , Tω t ≤ ω t +1 t = 1 , , · · · , T − T − (cid:96) = T − (cid:96) − j − t ) T − (cid:96) .For the D -optimal treatment design, the optimization problem ismin ω t − det (Θ) (34)s . t . − ≤ ω t ≤ , t = 1 , , · · · , Tω t ≤ ω t +1 t = 1 , , · · · , T − (cid:3) The following lemma provides an optimal solution for the quadratic program (25) in Lemma 4. This lemmaprovides an important intermediate step to show Theorem 5.
Lemma 6 (Carryover Effects without Covariates) . Suppose Assumptions 1-2 hold, the potential out-come follows model (14) and does not have observed or latent covariates, i.e., Y it ( z it ) = α i + β t + τ z it + · · · + τ (cid:96) +1 z i,t − (cid:96) + ε it , and T > (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) , then an optimal solution for the quadratic program (25) is (cid:126)ω ∗ ∈ R T that is defined in Eq. (17) .Proof of Lemma 6 Denote the objective in the quadratic program (25) in Lemma 4 as f ( (cid:126)ω ) := (cid:80) (cid:96) +1 j =1 (cid:20)(cid:80) T − (cid:96) − jt = j ω t − T − (cid:96) (cid:16)(cid:80) T − (cid:96) − jt = j ω t (cid:17) + (cid:80) T − (cid:96) − jt =1 2( T − (cid:96) − j − t ) T − (cid:96) ω t (cid:21) .The Lagrangian of f ( (cid:126)ω ) is L ( (cid:126)ω, (cid:126)λ,(cid:126)κ,(cid:126)ι ) = (cid:96) +1 (cid:88) j =1 T − (cid:96) − j (cid:88) t = j ω t − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) + T (cid:88) t =1 T − (cid:96) − j − t ) T − (cid:96) ω t + T (cid:88) t =1 λ t ( − − ω t ) + T (cid:88) t =1 κ t ( ω t −
1) + T − (cid:88) t =1 ι t ( ω t − ω t +1 )The KKT conditions of L ( (cid:126)ω, (cid:126)λ,(cid:126)κ,(cid:126)ι ) are ∂ L ∂ω t = tω t − (cid:80) tj =1 s j T − (cid:96) + ( T − (cid:96) − t ) tT − (cid:96) − λ t + κ t + ι t − ι t − = 0 , t ≤ (cid:96) (35) ∂ L ∂ω t = ( (cid:96) + 1) ω t − (cid:80) (cid:96) +1 j =1 s j T − (cid:96) + ( (cid:96) + 1)( T + 1 − t ) T − (cid:96) − λ t + κ t + ι t − ι t − = 0 , (cid:96) < t ≤ T − (cid:96) (36) ∂ L ∂ω t = ( T + 1 − t ) ω t − (cid:80) T +1 − tj =1 s j T − (cid:96) + ( T − (cid:96) − T + 1 − t ) T − (cid:96) − λ t + κ t + ι t − ι t − = 0 , t > T − (cid:96) (37) λ t ( − − ω t ) = 0 , κ t ( ω t −
1) = 0 , ι t ( ω t − ω t +1 ) = 0 − ≤ ω t ≤ , ω t ≤ ω t +1 , λ t ≥ , κ t ≥ , ι t ≥ s j = (cid:80) T − (cid:96) − jt = j ω t for j = 1 , · · · , (cid:96) + 1 and ι = 0.The Hessian of f ( (cid:126)ω ) is positive semi-definite. Any solution that satisfies the KKT conditions is optimal.First we can show the optimal solution is symmetric with respect to the origin. The proof is as follows. If (cid:126)ω ‡ is the optimal solution of the quadratic program (25), then we can show (cid:126)ω † = (cid:2) − ω ‡ T − ω ‡ T − · · · − ω ‡ (cid:3) hasthe same value in the objective function as (cid:126)ω ‡ because (cid:80) (cid:96) +1 j =1 (cid:80) T − (cid:96) − jt =1 2( T − (cid:96) − j − t ) T − (cid:96) ω t in f ( (cid:126)ω ) is symmetricwith respect to the origin and similarly for the other two terms in f ( (cid:126)ω ). Since quadratic program (25) isconvex, f (cid:16) (cid:126)ω ‡ + (cid:126)ω † (cid:17) ≤ (cid:2) f ( (cid:126)ω ‡ ) + f ( (cid:126)ω † ) (cid:3) = f ( (cid:126)ω ‡ ) . Then if (cid:126)ω ‡ is optimal, (cid:126)ω † = (cid:126)ω ‡ .2Now we can focus on the ω that satisfies (cid:2) ω ω · · · ω T (cid:3) = (cid:2) − ω T − ω T − · · · − ω (cid:3) . From the definition of s j = (cid:80) T − (cid:96) − jt = j ω t , we have s j = − s (cid:96) +1 − j . If (cid:96) is even, s (cid:96)/ = 0.Now we are going to verify (cid:126)ω ∗ = (cid:2) ω ∗ ω ∗ · · · ω ∗ T (cid:3) defined in Eq. (17) satisfies the KKT conditions withfeasible (cid:126)λ,(cid:126)κ,(cid:126)ι .1. ω ∗ t for (cid:96) < t ≤ T − (cid:96) : ω ∗ t = − t − ( (cid:96) +1) T − (cid:96) satisfies Eq. (36) with λ t = κ t = ι t = 0 and ι (cid:96) = 0.2. ω ∗ t for t ≤ (cid:96) : Given ω t = − ω T +1 − t . We can simplify s j to s j = (cid:40)(cid:80) (cid:96) +1 − jt = j ω t for j = 1 , · · · , (cid:98) ( (cid:96) + 1) / (cid:99) (cid:80) T − (cid:96) + jt = T − j ω t for j = (cid:98) ( (cid:96) + 1) / (cid:99) + 1 , · · · , (cid:96) + 1 . As an example, when (cid:96) = 2, we have s = ω + ω , s = 0 and s = ω T − + ω T ; when (cid:96) = 3, we have s = ω + ω + ω , s = ω , s = ω T − and s = ω T − + ω T − + ω T . Furthermore, s j + s (cid:96) +2 − j = 0 for 1 ≤ j ≤ (cid:96) + 1.Using this property, for (cid:98) (cid:96)/ (cid:99) < t ≤ (cid:96) , we have (cid:80) tj =1 s j = (cid:80) (cid:96) +1 − tj =1 s j .Next we show when ω t = − t ≤ (cid:98) (cid:96)/ (cid:99) , there exist some ω t for (cid:98) (cid:96)/ (cid:99) < t ≤ (cid:96) and some feasible λ t , κ t , ι t that satisfy Eq. (35).When ω t = − t ≤ (cid:98) (cid:96)/ (cid:99) , then for (cid:98) (cid:96)/ (cid:99) < t ≤ (cid:96) , (cid:80) tj =1 s j = (cid:80) (cid:96) +1 − tj =1 s j = (cid:2) (cid:80) (cid:96) +1 − tj =1 ( (cid:98) (cid:96)/ (cid:99) + 1 − j ) (cid:3) +min( (cid:96) + 1 − t, (cid:96) − (cid:98) (cid:96)/ (cid:99) ) ω (cid:98) (cid:96)/ (cid:99) +1 + · · · + min( (cid:96) + 1 − t, ω (cid:96) − + min( (cid:96) + 1 − t, ω (cid:96) . As an example, when (cid:96) = 2, s = − ω ; when (cid:96) = 3, s + s = − ω + ω , s + s + s = − ω + ω . We can rewrite Eq. (35) for (cid:98) (cid:96)/ (cid:99) < t ≤ (cid:96) in a vectorized form as the following (we will consider Eq. (35) for t ≤ (cid:98) (cid:96)/ (cid:99) in the later partof this proof) ∂ L ∂ω (cid:98) (cid:96)/ (cid:99) +1 ... ∂ L ∂ω (cid:96) = A ( (cid:96) ) ω (cid:98) (cid:96)/ (cid:99) +1 ... ω (cid:96) − b ( (cid:96) ) − λ (cid:98) (cid:96)/ (cid:99) +1 ... λ (cid:96) + κ (cid:98) (cid:96)/ (cid:99) +1 ... κ (cid:96) + ι (cid:98) (cid:96)/ (cid:99) +1 ... ι (cid:96) − ι (cid:98) (cid:96)/ (cid:99) ... ι (cid:96) − = 0 , (38)where A ( (cid:96) ) and b ( (cid:96) ) are defined in Eq. (22) and Eq. (23). When (cid:2) ω (cid:98) (cid:96)/ (cid:99) +1 · · · ω (cid:96) (cid:3) (cid:62) = ( A ( (cid:96) ) ) − b ( (cid:96) ) , Eq. (38)holds with λ t = κ t = ι t = 0 for t = (cid:98) (cid:96)/ (cid:99) +1 , · · · , (cid:96) and ι (cid:98) (cid:96)/ (cid:99) = 0. The remaining step is to verify the constraints − ≤ ω t ≤ − (cid:96) +1 T − (cid:96) and ω t ≤ ω t +1 hold if (cid:2) ω (cid:98) (cid:96)/ (cid:99) +1 · · · ω (cid:96) (cid:3) (cid:62) = ( A ( (cid:96) ) ) − b ( (cid:96) ) .(a) The first step is to show − A ( (cid:96) ) (cid:126) ≤ b ( (cid:96) ) ≤ ( − (cid:96) +1 T − (cid:96) ) A ( (cid:96) ) (cid:126)
1. Note that the diagonal entries in A ( (cid:96) ) are positive while the off-diagonal entries in A ( (cid:96) ) are negative, then A ( (cid:96) ) t (cid:48) , : (cid:126)ω ( (cid:98) (cid:96)/ (cid:99) +1): L is increasing in ω t anddecreasing in ω s for t (cid:48) = 1 , · · · , (cid:96) − (cid:98) (cid:96)/ (cid:99) , t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) and s (cid:54) = t (cid:48) + (cid:98) (cid:96)/ (cid:99) . If − A ( (cid:96) ) (cid:126) ≤ b ( (cid:96) ) ≤ ( − (cid:96) +1 T − (cid:96) ) A ( (cid:96) ) (cid:126) ω t defined in (cid:2) ω (cid:98) (cid:96)/ (cid:99) +1 · · · ω (cid:96) (cid:3) (cid:62) = ( A ( (cid:96) ) ) − b ( (cid:96) ) is between − − (cid:96) +1 T − (cid:96) for t = (cid:98) (cid:96)/ (cid:99) + 1 , · · · , (cid:96) .First, let us show − A ( (cid:96) ) (cid:126) ≤ b ( (cid:96) ) , which is equivalent to showing every entry in A ( (cid:96) ) (cid:126) b ( (cid:96) ) is non-negative,that is, for t (cid:48) = 1 , · · · , (cid:96) − (cid:98) (cid:96)/ (cid:99) , ( A ( (cid:96) ) (cid:126) t (cid:48) + b ( (cid:96) ) t (cid:48) ≥
0. If (cid:96) is even, (cid:80) (cid:96) − tl =1 ( (cid:96) − (cid:98) (cid:96)/ (cid:99) + 1 − l ) = t ( (cid:96) +1 − t )2 and (cid:80) (cid:96) − tl =1 ( (cid:98) (cid:96)/ (cid:99) + 1 − l ) = t ( (cid:96) +1 − t )2 . Let t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) . We have( A ( (cid:96) ) (cid:126) t (cid:48) + b ( (cid:96) ) t (cid:48) = t − T − (cid:96) t ( (cid:96) + 1 − t )2 − t + t T − (cid:96) − T − (cid:96) t ( (cid:96) + 1 − t )2 = t (2 T − (cid:96) − T − (cid:96) ≥ . If (cid:96) is odd, (cid:80) (cid:96) − tj =1 ( (cid:96) − (cid:98) (cid:96)/ (cid:99) + 1 − j ) = ( t +1)( (cid:96) +1 − t )2 and (cid:80) (cid:96) − tj =1 ( (cid:98) (cid:96)/ (cid:99) + 1 − j ) = ( t − (cid:96) +1 − t )2 . We have( A ( (cid:96) ) (cid:126) t (cid:48) + b ( (cid:96) ) t (cid:48) = t − T − (cid:96) ( t + 1)( (cid:96) + 1 − t )2 − t + t T − (cid:96) − T − (cid:96) ( t − (cid:96) + 1 − t )2 = t (2 T − (cid:96) − T − (cid:96) ≥ . Second, let us show b ( (cid:96) ) ≤ ( − (cid:96) +1 T − (cid:96) ) A ( (cid:96) ) (cid:126)
1, which is equivalent to showing every entry in (1 − (cid:96) +1 T − (cid:96) ) A ( (cid:96) ) (cid:126) b ( (cid:96) ) is non-positive, that is, for t (cid:48) = 1 , · · · , (cid:96) − (cid:98) (cid:96)/ (cid:99) , (cid:0) A ( (cid:96) ) (1 − (cid:96) +1 T − (cid:96) ) (cid:126) (cid:1) t (cid:48) + b ( (cid:96) ) t (cid:48) ≤
0. If (cid:96) is even (cid:16) A ( (cid:96) ) (1 − (cid:96) + 1 T − (cid:96) ) (cid:126) (cid:17) t (cid:48) + b ( (cid:96) ) t (cid:48) = t (2 T − (cid:96) − T − (cid:96) − (cid:96) + 1 T − (cid:96) (cid:16) t − t ( (cid:96) + 1 − t )2( T − (cid:96) ) (cid:17) = t ( T − (cid:96) − T − (cid:96) (cid:16) − (cid:96) + 1 T − (cid:96) (cid:17) < t ( T − (cid:96) − < − (cid:96) +1 T − (cid:96) >
0. If (cid:96) is odd, (cid:16) A ( (cid:96) ) (1 − (cid:96) + 1 T − (cid:96) ) (cid:126) (cid:17) t (cid:48) + b ( (cid:96) ) t (cid:48) = t (2 T − (cid:96) − T − (cid:96) − (cid:96) + 1 T − (cid:96) (cid:16) t − ( t + 1)( (cid:96) + 1 − t )2( T − (cid:96) ) (cid:17) = t ( T − (cid:96) − T − (cid:96) (cid:16) − t + 12 t (cid:96) + 1 T − (cid:96) (cid:17) < t ( T − (cid:96) − < − t +12 t (cid:96) +1 T − (cid:96) > − b ( (cid:96) ) t (cid:48) / ( A ( (cid:96) ) (cid:126) t (cid:48) is non-decreasing in t (cid:48) for t (cid:48) = 1 , · · · , (cid:96) − (cid:98) (cid:96)/ (cid:99) . Note that the diagonalentries in A ( (cid:96) ) are positive while the off-diagonal entries in A ( (cid:96) ) are negative, then A ( (cid:96) ) t (cid:48) , : (cid:126)ω ( (cid:98) (cid:96)/ (cid:99) +1): L is increasingin ω t and decreasing in ω s for t (cid:48) = 1 , · · · , (cid:96) − (cid:98) (cid:96)/ (cid:99) , t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) and s (cid:54) = t (cid:48) + (cid:98) (cid:96)/ (cid:99) . If − b ( (cid:96) ) t (cid:48) / ( A ( (cid:96) ) (cid:126) t (cid:48) isnon-decreasing in t (cid:48) , then ω t is non-decreasing in t , where t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) .Let c t (cid:48) be the c t (cid:48) that satisfies ( A ( (cid:96) ) ( − c t (cid:48) T − (cid:96) ) (cid:126) t (cid:48) = b ( (cid:96) ) t (cid:48) and let t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) . If (cid:96) is even, we have t (2 T − (cid:96) − T − (cid:96) = c t (cid:48) T − (cid:96) (cid:18) t − T − (cid:96) t ( (cid:96) + 1 − t )2 (cid:19) ⇔ T − (cid:96) − c t (cid:48) t + 2 T − (cid:96) − T − (cid:96) ) . Since ∂ (2 T − (cid:96) − ∂t = 2, ∂ t +2 T − (cid:96) − T − (cid:96) ) ∂t = T − (cid:96) ) , and 2 > T − (cid:96) ) , we have c t (cid:48) increases in t and t (cid:48) . This implies − b ( (cid:96) ) t (cid:48) / ( A ( (cid:96) ) (cid:126) t (cid:48) is non-decreasing in t (cid:48) for even (cid:96) .If (cid:96) is odd, we have t (2 T − (cid:96) − T − (cid:96) = c t (cid:48) T − (cid:96) (cid:18) t − T − (cid:96) ( t + 1)( (cid:96) + 1 − t )2 (cid:19) ⇔ T − (cid:96) − c t (cid:48) (cid:18) t + 1)( T − (cid:96) − t ( T − (cid:96) ) (cid:19) . Since ∂ (2 T − (cid:96) − ∂t = 2, ∂ t +2 T − (cid:96) − T − (cid:96) ) ∂t ≤ (cid:96) +3 (cid:96) +1 1 T − (cid:96) , and 2 > (cid:96) +3( T − (cid:96) )( (cid:96) +1) , we have c t (cid:48) increases in t and t (cid:48) . This againimplies − b ( (cid:96) ) t (cid:48) / ( A ( (cid:96) ) (cid:126) t (cid:48) is non-decreasing in t (cid:48) for odd (cid:96) .We have verified that for (cid:98) (cid:96)/ (cid:99) < t ≤ (cid:96) , ω t defined in (cid:2) ω (cid:98) (cid:96)/ (cid:99) +1 · · · ω (cid:96) (cid:3) (cid:62) = ( A ( (cid:96) ) ) − b ( (cid:96) ) satisifies the KKTconditions. The remaining step is to verify for t ≤ (cid:98) (cid:96)/ (cid:99) , ω t defined as ω t = − ω t = −
1, constraints − ≤ ω t ≤ ω t ≤ ω t +1 for t ≤ (cid:98) (cid:96)/ (cid:99) and ω (cid:98) (cid:96)/ (cid:99) ≤ ω + (cid:98) (cid:96)/ (cid:99) +1 are satisfied. Weonly need to verify that we can find feasible λ t , κ t , ι t to satisfy Eq. (35). Since ω t = −
1, from complementaryslackness, κ t = 0. Plug ω t = − λ − ι = − s T − (cid:96)λ t − ι t + ι t − = − t + (cid:80) tj =1 s j T − (cid:96) for t = 2 , · · · (cid:98) (cid:96)/ (cid:99) − λ t + ι t − = − t + (cid:80) tj =1 s j T − (cid:96) for t = (cid:98) (cid:96)/ (cid:99) We only need to verify − t + (cid:80) tj =1 s j T − (cid:96) ≥ t = (cid:98) (cid:96)/ (cid:99) as for the other conditions, λ − ι and λ t − ι t + ι t − cantake any value by properly choosing λ t and ι .Note that T − (cid:96) (cid:80) (cid:98) (cid:96)/ (cid:99) j =1 s j = T − (cid:96) (cid:80) L +1 −(cid:98) (cid:96)/ (cid:99) j =1 s j = ( (cid:96) + 1 − (cid:98) (cid:96)/ (cid:99) )( ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1) − ( (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) ) T − (cid:96) . Furthermore, ifwe can show ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1 ≤ (cid:96) +1 T − (cid:96) (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) for even (cid:96) and ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1 ≤ (cid:96) +1 T − (cid:96) (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) for odd (cid:96) , then wehave − (cid:98) (cid:96)/ (cid:99) T − (cid:96) − ( (cid:96) + 1 − (cid:98) (cid:96)/ (cid:99) )( ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1) + ( (cid:96) + 1 − (cid:98) (cid:96)/ (cid:99) ) T − (cid:96) = ( (cid:96) + 1 − (cid:98) (cid:96)/ (cid:99) )( (cid:96) + 1) T − (cid:96) − ( (cid:96) + 1 − (cid:98) (cid:96)/ (cid:99) )( ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1) ≥ − t + (cid:80) tj =1 s j T − (cid:96) ≥ ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1 ≤ (cid:96) +1 T − (cid:96) (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) for even (cid:96) and ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1 ≤ (cid:96) +1 T − (cid:96) (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) for odd (cid:96) .” Denote c ut := − (cid:96) +1 T − (cid:96) t (cid:48) (cid:98) ( (cid:96) +1) / (cid:99) +1 . If we can show ω t ≤ − (cid:96) +1 T − (cid:96) t (cid:48) (cid:98) ( (cid:96) +1) / (cid:99) +1 := c ut for t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) , then it implies“ ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1 ≤ (cid:96) +1 T − (cid:96) (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) for even (cid:96) and ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1 ≤ (cid:96) +1 T − (cid:96) (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) for odd (cid:96) .” Note that the diagonalentries in A ( (cid:96) ) are positive while the off-diagonal entries in A ( (cid:96) ) are negative, then A ( (cid:96) ) t (cid:48) , : (cid:126)ω ( (cid:98) (cid:96)/ (cid:99) +1): L is increasingin ω t and decreasing in ω s for t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) and s (cid:54) = t (cid:48) + (cid:98) (cid:96)/ (cid:99) . We only need to show ( A ( (cid:96) ) c u ) (cid:96) −(cid:98) (cid:96)/ (cid:99) ≥ b ( (cid:96) ) (cid:96) −(cid:98) (cid:96)/ (cid:99) ,where c u = (cid:2) c u (cid:98) (cid:96)/ (cid:99) +1 · · · c u(cid:96) −(cid:98) (cid:96)/ (cid:99) (cid:3) (cid:62) . If (cid:96) is even, and when T > (cid:96) +11 (cid:96) +28 , − ( A ( (cid:96) ) c u ) (cid:96) −(cid:98) (cid:96)/ (cid:99) + b ( (cid:96) ) (cid:96) −(cid:98) (cid:96)/ (cid:99) = (cid:96) ( (cid:96) − T − (cid:96) − (cid:96) ( (cid:96) + 1) T − (cid:96) (cid:18) (cid:96)(cid:96) + 2 − T − (cid:96) (cid:19) = − (cid:96)T − (cid:96) (cid:18) (cid:96) + 2 − (cid:96) + 14( T − (cid:96) ) (cid:19) < . If (cid:96) is odd, and when T > (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) (note that (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) > (cid:96) +11 (cid:96) +28 ), − ( A ( (cid:96) ) c u ) (cid:96) −(cid:98) (cid:96)/ (cid:99) + b ( (cid:96) ) (cid:96) −(cid:98) (cid:96)/ (cid:99) = (cid:96) ( (cid:96) − T − (cid:96) − (cid:96) ( (cid:96) + 1) T − (cid:96) (cid:96) + 1 (cid:96) + 3 + ( (cid:96) + 1) T − (cid:96) ) = − T − (cid:96) (cid:18) (cid:96)(cid:96) + 3 − ( L + 2) T − (cid:96) ) (cid:19) < . ω ∗ t for t > T − (cid:96) : this is a symmetric case of ω ∗ t for t < (cid:96) . The proof of ω ∗ t for t > T − (cid:96) carries over tothis case.We have verified that the (cid:126)ω ∗ defined in Eq. (17) satisfies the KKT conditions and the Hessian of the objectivefunction in the quadratic program (25) is positive semi-definite, then (cid:126)ω ∗ is an optimal solution for thequadratic program (25). (cid:3) Proof of Theorem 5
In this proof, we first provide an equivalent and simplified form of the objectivefunction tr( Z (cid:62) Σ − e (Σ e − Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) )Σ − e Z ) in the integer program (15). We will show that the objectivefunction is a sum of (cid:96) + 1 convex optimization problems, with each one to be similar as objective function(31) in the proof of Theorem 3. Then we show the optimal solution by using the results in Theorem 3 andLemma 6.Conceptually, we show the equivalent form of the objective function in the integer program (15) using theresults in Theorem 3 and Lemma 4. We decompose this task into a few steps:1. Calculate Θ = Z (cid:62) Σ − e Z ∈ R ( (cid:96) +1) × ( (cid:96) +1)
2. Calculate Θ = Z (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e Z ∈ R ( (cid:96) +1) × ( (cid:96) +1) (a) Calculate Z (cid:62) Σ − e Γ ∈ R ( (cid:96) +1) × ( N +( T − (cid:96) − p ) (b) Calculate (Γ (cid:62) Σ − e Γ) − ∈ R ( N +( t − − (cid:96) ) p ) × ( N +( t − − (cid:96) ) p ) (c) Calculate ( (cid:126)z ( j ) ) (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:126)z ( m ) ∈ R (d) Calculate ( (cid:126)z ( j ) ) (cid:62) (Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e ) (cid:126)z ( m ) ∈ R where (cid:126)z ( j ) = (cid:2) (cid:126)z (cid:62) j (cid:126)z (cid:62) j +1 · · · (cid:126)z (cid:62) T − (cid:96) − j (cid:3) (cid:62) for j = 1 , · · · , (cid:96) + 1 and Γ is equal toΓ = Γ :( (cid:96) +1) Γ :( (cid:96) +2) ...Γ : T = ˜ I (cid:126) X ˜ I (cid:126) X ... . . .˜ I (cid:126) X = ˜ I ˜ X ˜ I ˜ X ... . . .˜ I ˜ X ∈ R ( N ( T − (cid:96) )) × ( N +( T − (cid:96) − p ) , (39)where ˜ I = (cid:2) I N − p N − p,p (cid:3) (cid:62) ∈ R N × ( N − p ) , N − p,p is a matrix of 0, (cid:126) ∈ { } N . Similarly as Theorem 3, Σ − e =diag(Ψ , Ψ , · · · , Ψ) ∈ R ( N ( T − (cid:96) )) × ( N ( T − (cid:96) )) , whereΨ = Σ − e t = ( I N + U U (cid:62) ) − = I N − U ( I k + U (cid:62) U ) − U (cid:62) ∈ R N × N . = Z (cid:62) Σ − e Z hasΘ ,jm = ( (cid:126)z ( j ) ) (cid:62) Σ − e (cid:126)z ( m ) = T − (cid:96) (cid:88) t =1 (cid:126)z (cid:62) j − t Ψ (cid:126)z m − t . For the second step, Z (cid:62) Σ − e Γ has( (cid:126)z ( j ) ) (cid:62) Σ − e Γ = ( (cid:126)z ( j ) ) (cid:62) Ψ ˜ I Ψ ˜ X Ψ ˜ I Ψ ˜ X ... . . .Ψ ˜ I Ψ ˜ X = (cid:2)(cid:80) T − (cid:96)t =1 (cid:126)z (cid:62) j − t Ψ ˜
I (cid:126)z (cid:62) j Ψ ˜ X · · · (cid:126)z (cid:62) T − (cid:96) + j − Ψ ˜ X (cid:3) = (cid:2) ( φ ( j ) ) (cid:62) ( ι ( j ) ) (cid:62) (cid:3) , where ( φ ( j ) ) (cid:62) = (cid:80) T − (cid:96)t =1 (cid:126)z (cid:62) j − t Ψ ˜ I = ( (cid:80) T − (cid:96)t =1 (cid:126)z j − t ) (cid:62) Ψ ˜ I and ( ι ( j ) ) (cid:62) = (cid:2) (cid:126)z (cid:62) j Ψ ˜ X · · · (cid:126)z (cid:62) T − (cid:96) + j − Ψ ˜ X (cid:3) . Let ζ ( j ) = T − (cid:96) (cid:80) T − (cid:96) + j − t = j (cid:126)z t ∈ R N , then φ = T ( ζ ( j ) ) (cid:62) Ψ ˜ I . Note that U and ˜ X are orthogonal, we have Ψ ˜ X = ˜ X and˜ X (cid:62) Ψ ˜ X = N · I p . Then ( ι ( j ) ) (cid:62) = (cid:2) (cid:126)z (cid:62) j ˜ X · · · (cid:126)z (cid:62) T − (cid:96) + j − ˜ X (cid:3) = (cid:2) N ˜ ω (cid:62) j · · · N ˜ ω (cid:62) T − (cid:96) + j − (cid:3) = N (˜ (cid:126)ω ( j ) ) (cid:62) ∈ R ( T − (cid:96) ) p , where˜ ω t = N (cid:80) Ni =1 ˜ X i z it ∈ R p . Next is to calculate (Γ (cid:62) Σ − e Γ) − . We have similar decomposition as Theorem 3,(Γ (cid:62) Σ e Γ) − = (cid:20) Ξ Ξ Ξ Ξ (cid:21) = (cid:20) M − M ˜ M − ˜ M (cid:62) M ¯ M + ˜ M (cid:62) M ˜ M (cid:21) ∈ R ( N +( T − (cid:96) − p )) × ( N +( T − (cid:96) − p ) , where Ξ = M , Ξ = − M ˜ M , Ξ = − ˜ M (cid:62) M , Ξ = ¯ M + ˜ M (cid:62) M ˜ M with M = 1 T − (cid:96) (cid:16) ˜ I (cid:62) Ψ ˜ I − ˜ I (cid:62) Ψ ˜ X ( ˜ X (cid:62) Ψ ˜ X ) − ˜ X (cid:62) Ψ ˜ I (cid:17) − = 1 T − (cid:96) (cid:18) ˜ I (cid:62) Ψ ˜ I − N ˜ X ˜ X (cid:62) (cid:19) − ∈ R ( N − p ) × ( N − p ) ˜ M = (cid:2) ˜ I (cid:62) Ψ ˜ X ( ˜ X (cid:62) Ψ ˜ X ) − · · · ˜ I (cid:62) Ψ ˜ X ( ˜ X (cid:62) Ψ ˜ X ) − (cid:3) = (cid:2) N ˜ X · · · N ˜ X (cid:3) ∈ R ( N − p ) × (( T − (cid:96) ) p ) ¯ M = diag(( ˜ X (cid:62) Ψ ˜ X ) − , ( ˜ X (cid:62) Ψ ˜ X ) − , · · · , ( ˜ X (cid:62) Ψ ˜ X ) − ) = 1 N I ∈ R (( T − (cid:96) ) p ) × (( T − (cid:96) ) p ) Similar as Theorem 3, we can simplify M . M = 1 T − (cid:96) (cid:34)(cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − + (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (cid:18) N − ˜ X (cid:62) (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (cid:19) − ˜ X (cid:62) (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − (cid:35)(cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − = (cid:0) I N − p − U (1) ( I k + U (cid:62) U ) − U (cid:62) (1) (cid:1) − = I N − p + U (1) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (1) , where U (1) = (cid:2) u u · · · u N − p (cid:3) (cid:62) ∈ R ( N − p ) × k and U (2) = (cid:2) u N − p +1 · · · u N (cid:3) (cid:62) ∈ R p × k ( U = (cid:2) U (cid:62) (1) U (cid:62) (2) (cid:3) (cid:62) ). Next is to calculate ( (cid:126)z ( j ) ) (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:126)z ( m ) for ≤ j, m ≤ (cid:96) + 1. It is equivalent to calcu-lating (cid:2) ( φ ( j ) ) (cid:62) ( ι ( j ) ) (cid:62) (cid:3) (cid:20) Ξ Ξ Ξ Ξ (cid:21) (cid:20) φ ( m ) ι ( m ) (cid:21) = ( φ ( j ) ) (cid:62) Ξ φ ( m ) + ( φ ( j ) ) (cid:62) Ξ ι ( m ) + ( ι ( j ) ) (cid:62) Ξ φ ( m ) + ( ι ( j ) ) (cid:62) Ξ ι ( m ) ,where (recall ζ ( j ) = T − (cid:96) (cid:80) T − (cid:96) + j − t = j (cid:126)z t and ˜ (cid:126)ω ( j ) = (cid:2) ˜ ω (cid:62) j · · · ˜ ω (cid:62) T − (cid:96) + j − (cid:3) )( φ ( j ) ) (cid:62) Ξ φ ( m ) = ( T − (cid:96) ) ( ζ ( j ) ) (cid:62) Ψ ˜ IM ˜ I (cid:62) Ψ ζ ( m ) ( φ ( j ) ) (cid:62) Ξ ι ( m ) = − N ( T − (cid:96) )( ζ ( j ) ) (cid:62) Ψ ˜ IM ˜ M ˜ (cid:126)ω ( m ) = − ( T − (cid:96) )( ζ ( j ) )Ψ ˜ IM ˜ X (cid:32) T − (cid:96) + m − (cid:88) t = m ˜ ω m − t (cid:33) ( ι ( j ) ) (cid:62) Ξ φ ( m ) = − N ( T − (cid:96) )(˜ (cid:126)ω ( j ) ) (cid:62) ˜ M (cid:62) M Ψ ˜ Iζ ( m ) = − ( T − (cid:96) ) (cid:32) T − (cid:96) + j − (cid:88) t = j ˜ ω t (cid:33) ˜ X (cid:62) M Ψ ˜ I (cid:62) ζ ( m ) ( ι ( j ) ) (cid:62) Ξ ι ( m ) = N T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t ˜ ω t + m − j + (cid:32) T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t (cid:33) ˜ X (cid:62) M ˜ X (cid:32) T − (cid:96) + m − (cid:88) t = m ˜ ω t (cid:33) φ ( j ) ) (cid:62) Ξ φ ( m ) , ( φ ( j ) ) (cid:62) Ξ ι ( m ) , ( ι ( j ) ) (cid:62) Ξ φ ( m ) and ( ι ( j ) ) (cid:62) Ξ ι ( m ) can be simplified similar as the proof ofTheorem 3 (the definition of notations can be found in the proof of Theorem 3 as well).( φ ( j ) ) (cid:62) Ξ φ ( m ) = ( T − (cid:96) )( ζ ( j ) ) (cid:62) (cid:0) Ω + γ ( N I p − δ ) − γ (cid:62) (cid:1) ζ ( m ) ( φ ( j ) ) (cid:62) Ξ ι ( m ) = − ( T − (cid:96) )( ζ ( j ) ) (cid:62) (cid:16) γ ( N I p − δ ) − ˜ X (cid:62) (cid:17) ζ ( m ) ( ι ( j ) ) (cid:62) Ξ φ ( m ) = − ( T − (cid:96) )( ζ ( j ) ) (cid:62) (cid:16) ˜ X ( N I p − δ ) − γ (cid:62) (cid:17) ζ ( m ) ( ι ( j ) ) (cid:62) Ξ ι ( m ) = N T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t ˜ ω t + m − j + ( T − (cid:96) )( ζ ( j ) ) (cid:62) (cid:16) ˜ X ( N I p − δ ) − ˜ X (cid:62) (cid:17) ζ ( m ) − NT − (cid:96) (cid:32) T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t (cid:33) (cid:32) T − (cid:96) + m − (cid:88) t = m ˜ ω t (cid:33) Similar as the proof of Theorem 3, we can show (cid:2) ( φ ( j ) ) (cid:62) ( ι ( j ) ) (cid:62) (cid:3) (cid:20) Ξ Ξ Ξ Ξ (cid:21) (cid:20) φ ( m ) ι ( m ) (cid:21) = N T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t ˜ ω t + m − j − NT − (cid:96) (cid:32) T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t (cid:33) (cid:32) T − (cid:96) + m − (cid:88) t = m ˜ ω t (cid:33) +( T − (cid:96) )( ζ ( j ) ) (cid:62) (cid:0) I N − U ( I k + U (cid:62) U ) − U (cid:62) (cid:1) ζ ( m ) . Next is to calculate ( (cid:126)z ( j ) ) (cid:62) (Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e ) (cid:126)z ( m ) for ≤ j, m ≤ (cid:96) + 1 . ( (cid:126)z ( j ) ) (cid:62) Σ − e (cid:126)z ( m ) − ( (cid:126)z ( j ) ) (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:126)z ( m ) = Θ ,jm − Θ ,jm = N ( T − (cid:96) ) − T − (cid:96) + j − (cid:88) t = j (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t + m − j − N T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t ˜ ω t + m − j + NT − (cid:96) (cid:32) T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t (cid:33) (cid:32) T − (cid:96) + m − (cid:88) t = m ˜ ω t (cid:33) − ( T − (cid:96) )( ζ ( j ) ) (cid:62) (cid:0) I N − U ( I k + U (cid:62) U ) − U (cid:62) (cid:1) ζ ( m ) = N ( T − (cid:96) ) − (cid:34) ( T − (cid:96) )( ζ ( j ) ) (cid:62) ζ ( m ) − NT − (cid:96) (cid:32) T − (cid:96) + j − (cid:88) t = j ω (1) t (cid:33) (cid:32) T − (cid:96) + m − (cid:88) t = m ω (1) t (cid:33) + N T − (cid:96) + j − (cid:88) t = j ω (1) t ω (1) t + m − j (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) := − a ( j,m ) + N · p (cid:88) q =2 (cid:34) − T − (cid:96) + j − (cid:88) t = j ω ( q ) t ω ( q ) t + m − j + 1 T − (cid:96) (cid:32) T − (cid:96) + j − (cid:88) t = j ω ( q ) t (cid:33) (cid:32) T − (cid:96) + m − (cid:88) t = m ω ( q ) t (cid:33) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) := − b ( j,m ) + ( T − (cid:96) )( ζ ( j ) ) (cid:62) U ( I k + U (cid:62) U ) − U (cid:62) ζ ( m ) − T − (cid:96) + j − (cid:88) t = j (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t + m − j (cid:124) (cid:123)(cid:122) (cid:125) := − c ( j,m ) , where ω (1) t = N (cid:80) Ni =1 z it and ω ( q ) t = N (cid:80) Ni =1 X i,q − z it for q = 2 , · · · , p .Note that maximizing tr( Z (cid:62) Σ − e (Σ e − Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) )Σ − e Z ) is equivalent to minimiz-ing − tr( Z (cid:62) Σ − e (Σ e − Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) )Σ − e Z ) and − tr( Z (cid:62) Σ − e (Σ e − Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) )Σ − e Z ) = − (cid:80) (cid:96) +1 j =1 ( (cid:126)z ( j ) ) (cid:62) (Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e ) (cid:126)z ( j ) = (cid:80) (cid:96) +1 j =1 (cid:0) a ( j,j ) + b ( j,j ) + c ( j,j ) (cid:1) . From Lemma 4, we know (cid:96) +1 (cid:88) j =1 a ( j,j ) = N · (cid:96) +1 (cid:88) j =1 T − (cid:96) − j (cid:88) t = j ( ω (1) t ) − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω (1) t (cid:33) + T − (cid:96) − j (cid:88) t =1 T − (cid:96) − j − t ) T − (cid:96) ω (1) t . From Lemma 6, (cid:126)ω ∗ defined in Eq. (17) minimizes (cid:80) (cid:96) +1 j =1 a ( j,j ) . (cid:96) +1 (cid:88) j =1 b ( j,j ) = N · (cid:96) +1 (cid:88) j =1 p (cid:88) q =2 (cid:34) T − (cid:96) + j − (cid:88) t = j ( ω ( q ) t ) − T − (cid:96) (cid:32) T − (cid:96) + j − (cid:88) t = j ω ( q ) t (cid:33) (cid:35) . N (cid:80) Ni =1 X i z it = µ X for some µ X ∈ R r and for all t , then (cid:80) (cid:96) +1 j =1 b ( j,j ) is minimized. (cid:96) +1 (cid:88) j =1 c ( j,j ) = (cid:96) +1 (cid:88) j =1 (cid:34) T − (cid:96) + j − (cid:88) t = j (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t − ( T − (cid:96) )( ζ ( j ) ) (cid:62) U ( I k + U (cid:62) U ) − U (cid:62) ζ ( j ) (cid:35) = (cid:96) +1 (cid:88) j =1 (cid:34) T − (cid:96) + j − (cid:88) t = j (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t − T − (cid:96) (cid:18) T − (cid:96) + j − (cid:88) t = j (cid:126)z t (cid:19) (cid:62) U ( I k + U (cid:62) U ) − U (cid:62) (cid:18) T − (cid:96) + j − (cid:88) t = j (cid:126)z t (cid:19)(cid:35) From Theorem 3, if N (cid:80) Ni =1 u i z it = µ U for some µ U ∈ R k and for all t , then (cid:80) (cid:96) +1 j =1 c ( j,j ) is minimized. Hence,if the optimal design satisfies1 N N (cid:88) i =1 z it = ω ∗ t , N N (cid:88) i =1 X i z it = µ X , N N (cid:88) i =1 u i z it = µ U , for all t (40)for some µ X ∈ R r and µ U ∈ R k . Then (cid:80) (cid:96) +1 j =1 a ( j,j ) , (cid:80) (cid:96) +1 j =1 b ( j,j ) and (cid:80) (cid:96) +1 j =1 c ( j,j ) are all minimized and therefore (cid:80) (cid:96) +1 j =1 ( a ( j,j ) + b ( j,j ) + c ( j,j ) ) is minimized, which is equivalent to tr( Z (cid:62) Σ − e (Σ e − Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) )Σ − e Z ) beingmaximized. (cid:3) Appendix F: Additional Empirical Results
F.1. Additional Empirical Datasets
Offering extra reward points or discounts (if they spend a certain amount of money on a shopping trip).Increasing the marginal redemption rates if customers redeem more reward points. Offering personalizedpromotions on the categories that customers shop less frequently.Our analysis focuses on frequent shoppers because they are more familiar and tend to pay more attentionto the reward or loyalty program’s changes. We define frequent shoppers as the households with expenditurein at least half of the weeks in the data set, that is, more than 48 of the 97 total weeks. There are 7,130frequent households. We want to study if a new approach can increase frequent households expenditure.
Home Medical Visits.
This data set has 40,079 records of home medical visits from Jan 2016 to Dec2018 in Barcelona Spain, which is publicly available on Kaggle. We aggregate records by city and week,calculate visit rates by taking the moving average of numbers of visits, and get a panel of 102 cities over 144weeks. We consider home medical visit rates instead of the raw number of visits for two reasons:1. Home visits can be naturally modeled as a Poisson distribution. We can calculate the visit rate byaveraging the number of home visits in the past m weeks, which is an unbiased and consistent estimator forthe rate (we use m = 16 here). It is natural and reasonable to measure the impact of environmental policiesby the impact on the visit rate.2. We are interested in the environmental policies that can improve public health, i.e., reduce the numberof home visits. If an entry has value zero, the value could not be further reduced. However, if we study visitrates, most entries are positive, which can potentially be reduced. Transactions from a Large Grocery Store.
This data set contains 17,880,248 transactions between May2005 and May 2007. We aggregate transactions by household and week and get a panel of 7,130 frequenthouseholds over 97 weeks, where frequent households had expenditure in more than 50% weeks. This data set can be downloaded at
F.2. Additional Experiment SetupsSpecification.
In this section, the results are calculated based on 100 randomly selected blocks dividedinto the historical control and synthetic experimental data. For the home medical visit data, we choose thetreatment effect at τ = − .
05 in the experiment with direct effect only. Here the median value is 1 . . τ wechoose allows the treated outcome to stay positive. For the grocery data, we choose the treatment effectat τ = 10 in the experiment with direct effect only. Note that the median expenditure is 43.54 so the scaleof the treatment effect is about 23% of the median expenditure. In the experiment with carryover effects,we consider the treatment can affect the current period and four periods in the future (equivalent to (cid:96) = 4)and choose the treatment effects at [ τ , τ , τ , τ , τ ] = [ − . , − . , − . , − . , − . τ , τ , τ , τ , τ ] = [6 , , , , .
5] for the grocerydata.
Choice of Estimation Method.
In practice, we need to select an estimation method to estimate thetreatment effect on the experimental data. We can leverage historical control data and use the procedure inRemark 14 to find the best estimation method. Then we use this method to estimate the treatment effecton the synthetic experimental data. The results are presented in the row “Hist Winner” in Tables 6 and 8.
Remark 17.
One possible reason for the BLUE estimator to perform worse than LRME on the home visitdata is that the visit rates are strongly time-series correlated by construction. In GLS, we calculate errors’covariance matrix based on the assumption that errors are time-series uncorrelated and use the inverse ofthis covariance matrix as the weighting matrix. This assumption allows us to estimate fewer parameters( N × N rather than N T × N T ) in the covariance matrix. Hence feasible GLS is less efficient when theassumption on time-series uncorrelation is violated. We also estimate the full errors’ covariance matrix with NT × NT parameters that do not rely on the time-seriesuncorrelated assumption. However, the corresponding RMSE is much larger than the RMSE presented in Table 6. ( N, T ) (25,10) (25,20) (50,10) (50,20)OLS Z BA Z FF Z FF+BA Z OPT Z BA Z FF Z FF+BA Z OPT Z BA Z FF Z FF+BA Z OPT
HistWinner Z OPT
N, T ) (25,10) (25,20) (50,10) (50,20)OLS Z OPT Z K =2 Z K =3 Z OPT+ Z OPT Z K =2 Z K =3 Z OPT+ Z OPT Z K =2 Z K =3 Z OPT+
HistWinner Z OPT+
Table 6
Home medical visit data: This table compares the RMSE based on m = 100 randomly sampled blocks forbenchmark treatment designs, Z OPT , Z K =2 , Z K =3 , Z OPT+ and “Hist Winner” (the estimation method with minimaxestimation error on historical matrices). In the synthetic experiments, the intervention only has direct treatment effectand we choose the treatment effect at τ = − .
05. The left table shows the linear staggered design Z OPT outperformsbenchmark treatment designs. The right table shows we can find a better treatment design Z OPT+ via historical dataand our data-driven local search algorithm. Both tables show LRME is the best estimation method on the homemedical visit data and the optimal estimation method found on historical data has similar performance as LRME. ( N, T ) (25,10) (25,20) (50,10) (50,20)OLS Z OPT Z OPT-CO Z OPT Z OPT-CO Z OPT Z OPT-CO
Table 7
Home medical visit data: This table compares the RMSE based on m = 100 randomly sampled blocksfor Z OPT and Z OPT-CO . In the synthetic experiments, the intervention has direct and carryover treatment effectsand we choose the treatment effects at [ τ , τ , τ , τ , τ ] = [ − . , − . , − . , − . , − . nonlinear staggered design Z OPT-CO outperforms the linear staggered design Z OPT . Figure 5
Home medical visit data: This figure compares the average estimated treatment effect ˆ τ Z and its standarddeviation (cid:112) Var(ˆ τ Z ) based on m = 100 randomly sampled blocks for benchmark designs, Z OPT , Z K =2 , Z K =3 and Z OPT+ . In the synthetic experiments, N = 50, T = 20, the intervention only has direct treatment effect and wechoose the treatment effect at τ = − .
05. The height of the bar shows ˆ τ Z , while the error bar indicates the standarddeviation (cid:112) Var(ˆ τ Z ). The red dash line indicates the true value of τ = − .
05. Note that figures in the second rowhave a different y-axis scale due to superior performance of Z OPT , Z K =2 , Z K =3 and Z OPT+ over benchmark treatmentdesigns. The bias of various treatment designs is similar while Z OPT has much smaller variance compared withbenchmark designs and the treatment designs found using historical data, such as Z OPT+ from Algorithm 1, hassmaller variance compared with Z OPT . ( N, T ) (50,10) (50,20) (100,10) (100,20)OLS Z BA Z FF Z FF+BA Z OPT Z BA Z FF Z FF+BA Z OPT
LRME Z BA Z FF Z FF+BA Z OPT Z OPT
N, T ) (50,10) (50,20) (100,10) (100,20)OLS Z OPT Z K =2 Z K =3 Z OPT+ Z OPT Z K =2 Z K =3 Z OPT+
LRME Z OPT Z K =2 Z K =3 Z OPT+ Z OPT+
Table 8
Grocery data: This table compares the RMSE based on m = 100 randomly sampled blocks for benchmarktreatment designs, Z OPT , Z K =2 , Z K =3 , Z OPT+ and “Hist Winner” (the estimation method with minimax estimationerror on historical matrices). In the synthetic experiments, the intervention only has direct treatment effect and wechoose the treatment effect at τ = 10. The left table shows the linear staggered design Z OPT outperforms benchmarktreatment designs. The right table shows we can find a better treatment design Z OPT+ via historical data and ourdata-driven local search algorithm. Both tables show GLS is the best estimation method and the optimal estimationmethod found on historical data has similar performance as GLS. ( N, T ) (50,10) (50,20) (100,10) (100,20)OLS Z OPT Z OPT-CO Z OPT Z OPT-CO
LRME Z OPT Z OPT-CO
Table 9
Grocery data: This table compares the RMSE based on m = 100 randomly sampled blocks for Z OPT and Z OPT-CO . In the synthetic experiments, the intervention has direct and carryover treatment effects and we choosethe treatment effects at [ τ , τ , τ , τ , τ ] = [6 , , , , . nonlinear staggered design Z OPT-CO outperforms the linear staggered design Z OPT . Figure 6