[PDF] Optimal Experimental Design for Staggered Rollouts

Abstract

Experimentation has become an increasingly prevalent tool for guiding decision-making and policy choices. A common hurdle in designing experiments is the lack of statistical power. In this paper, we study the optimal multi-period experimental design under the constraint that the treatment cannot be easily removed once implemented; for example, a government might implement a public health intervention in different geographies at different times, where the treatment cannot be easily removed due to practical constraints. The treatment design problem is to select which geographies (referred by units) to treat at which time, intending to test hypotheses about the effect of the treatment. When the potential outcome is a linear function of unit and time effects, and discrete observed/latent covariates, we provide an analytically feasible solution to the optimal treatment design problem where the variance of the treatment effect estimator is at most 1+O(1/N^2) times the variance using the optimal treatment design, where N is the number of units. This solution assigns units in a staggered treatment adoption pattern - if the treatment only affects one period, the optimal fraction of treated units in each period increases linearly in time; if the treatment affects multiple periods, the optimal fraction increases non-linearly in time, smaller at the beginning and larger at the end. In the general setting where outcomes depend on latent covariates, we show that historical data can be utilized in designing experiments. We propose a data-driven local search algorithm to assign units to treatment times. We demonstrate that our approach improves upon benchmark experimental designs via synthetic interventions on the influenza occurrence rate and synthetic experiments on interventions for in-home medical services and grocery expenditure.

Full PDF

OOptimal Experimental Design for Staggered Rollouts

Ruoxuan Xiong*

Department of Management Science & Engineering, Stanford University, [email protected]

Susan Athey

Graduate School of Business, Stanford University, [email protected]

Mohsen Bayati

Graduate School of Business, Stanford University, [email protected]

Guido Imbens

Graduate School of Business, Stanford University, [email protected]

Experimentation has become an increasingly prevalent tool for guiding decision-making and policy choices.A common hurdle in designing experiments is the lack of statistical power. In this paper, we study theoptimal multi-period experimental design under the constraint that the treatment cannot be easily removedonce implemented; for example, a government might implement a public health intervention in diﬀerentgeographies at diﬀerent times, where the treatment cannot be easily removed due to practical constraints.The treatment design problem is to select which geographies (referred by units) to treat at which time,intending to test hypotheses about the eﬀect of the treatment. When the potential outcome is a linearfunction of unit and time eﬀects, and discrete observed/latent covariates, we provide an analytically feasiblesolution to the optimal treatment design problem where the variance of the treatment eﬀect estimator is atmost 1 + O (cid:0) N (cid:1) times the variance using the optimal treatment design, where N is the number of units.This solution assigns units in a staggered treatment adoption pattern – if the treatment only aﬀects oneperiod, the optimal fraction of treated units in each period increases linearly in time; if the treatment aﬀectsmultiple periods, the optimal fraction increases non-linearly in time, smaller at the beginning and larger atthe end. In the general setting where outcomes depend on latent covariates, we show that historical datacan be utilized in designing experiments. We propose a data-driven local search algorithm to assign unitsto treatment times. We demonstrate that our approach improves upon benchmark experimental designsvia synthetic interventions on the inﬂuenza occurrence rate and synthetic experiments on interventions forin-home medical services and grocery expenditure. Key words : Experimental Design, Contamination and Spillover Eﬀects, Staggered Rollouts, TreatmentEﬀect Estimation, Stratiﬁcation

1. Introduction

Experimentation is a cornerstone of decision-making. Large technology and retail companies runthousands or even tens of thousands of experiments per year to evaluate the impact of variousdecisions, design better products, and understand mechanisms (Muchnik et al. 2013, Aral and * Alphabetical order other than the ﬁrst author. a r X i v : . [ ec on . E M ] A ug Walker 2014, Panniello et al. 2014, Cheung et al. 2017, Brynjolfsson et al. 2018, Cohen et al. 2018,Brynjolfsson et al. 2019, Salganik 2019, Gordon et al. 2019, Joo et al. 2019, Zhang et al. 2019a,b,¨Ulk¨u et al. 2019, Lee and Hosanagar 2019, Cui et al. 2020, Sun et al. 2020a,b). In some cases,a variety of ﬁrm constraints lead ﬁrms (units) to introduce new products or features at diﬀerenttimes. In some cases, a variety of practical constraints lead clinical researchers and policymakersto introduce the treatment or intervention at diﬀerent times. After such a rollout has occurred,analysts treat this type of “staggered adoption” of the new feature as a “natural experiment”and attempt to estimate causal eﬀects (Athey and Stern 2002, Athey and Imbens 2018, Atheyet al. 2018). However, less research has been done to consider how to design such experiments inanticipation of precisely analyzing treatment eﬀects.In this paper, we study how to optimally design experiments in settings where the experimentaldesigner can choose not just which units to treat but also when to treat them.

Our goal is tominimize the variance of the estimate for the treatment eﬀect by choosing the optimal multi-periodtreatment design.

We maintain the assumption that during the course of the experiment, oncetreated, a unit continues being treated until the end of the experiment. This assumption is moti-vated by practical considerations such as diﬃculties communicating with experimental subjects orfrictions in implementing changes; we also extend our results to the scenario where the treatmentcan be removed.Our main motivation to study this question is the fundamental challenge in experimental design: statistical power (Pocock and Simon 1975, Cohen 1992, Kuhfeld 2005, Lawrie et al. 2015, Hemminget al. 2015, Li et al. 2018b, Bhat et al. 2019), which is the likelihood to detect the eﬀect of a newdecision and is also a concern in observational studies (Fogarty and Small 2016, Heng et al. 2019).Statistical power can be low due to various reasons, but the primary reason is a small sample size.For an essential class of questions where the treatment eﬀect has spillover eﬀects, the relevant unitof analysis is an aggregated set of samples, such as a metropolitan area for testing public healthinterventions to reduce the occurrence of an infectious disease such as ﬂu. Another example israndomization at a clinic instead of a patient-level for a clinical trial. It compares diﬀerent weight-loss programs where patients may meet each other in the waiting room and communicate abouttheir own strategies (Leahey et al. 2014). While randomizing at a higher level of aggregation caneﬀectively avoid contamination and spillover eﬀects, but it substantially reduces the sample sizeand statistical power.Measuring outcomes for units over time allows the experimental designer to improve the statis-tical power, as repeated observations for the same units allow the analyst to learn key parametersof the data generating process, and thus reduce the error in estimating outcomes for units undercounterfactual treatment status (Crowder and Hand 1990, Baltagi 2008, Hsiao 2014). However, determining the optimal timing of treatment assignments is challenging for two reasons. First, itis diﬃcult to ﬁnd an explicit functional form for how the treatment design aﬀects the variance ofthe treatment eﬀect estimator (Hussey and Hughes 2007, Hemming et al. 2015, Li et al. 2018b).Second, assigning units to either the treated or control group at each point in time is an integerprogramming problem, which is, in general, computationally expensive to solve (Bertsimas et al.2015, Kallus 2018, Bhat et al. 2019). For the ﬁrst challenge, our analysis is based on the best lin-ear unbiased estimator, which provides the explicit functional form and has the smallest varianceamong all linear unbiased estimators.

The main contribution of this paper is to provide closed-form solutions for the second challenge.

We provide theoretical guarantees for the optimality ofour proposed multi-period experiment designs and support their practical relevance via extensivesimulations on real data sets.This paper’s objective is to optimally allocate subjects from the control groups to the treatedgroups over multiple periods so that the variance of our treatment eﬀect estimator is minimized.We restrict treated units to switch back to the control. We assume potential outcomes followa two-way ﬁxed-eﬀect linear model, and possibly with additional observed or latent covariates.Furthermore, we assume the treatment eﬀect is constant, and we want to estimate this objectas precise as possible. This assumption is less restrictive than it looks like. When the treatmenteﬀect is heterogeneous and time-varying, an object of interest is the average treatment eﬀect(Rosenbaum and Rubin 1983, Abadie et al. 2010). We can estimate the average eﬀect throughthe constant treatment eﬀect model by decomposing the treatment eﬀect into two terms: 1. theaverage eﬀect, which is constant; 2. the heterogeneous and time-varying part with zero mean,which can be incorporated into the error term. We analyze the optimal experimental design tominimize the variance of the best linear unbiased estimator (BLUE) for the treatment eﬀect. Weﬁnd a treatment design with a constant growth rate of the treated percentage is near-optimal. Wepropose an algorithm based on simulated annealing and the minimax decision criterion to activelyimprove this near-optimal solution using historical control data.Speciﬁcally, our ﬁrst main contribution is to provide simple suﬃcient conditions for the opti-mal treatment design when the potential outcome can have observed and latent covariates. Inparticular, we allow latent covariates to address the concern of model misspeciﬁcation that wemay not fully observe all the covariates that aﬀect the potential outcome. Furthermore, if we havemany observed covariates, and we do not know which matter, we could instead use a data-drivenapproach to estimate the latent covariates that can well summarize the information in the data.When covariates take ﬁnitely many values, in an optimal treatment design, the treated proportion (fraction of treated units in each period) should be equal for each cluster of units with equal covari-ate values (stratum). We can interpret this treatment design as stratiﬁed randomization in both

Model Two-way ﬁxed eﬀects Two-way ﬁxed eﬀects Two-way ﬁxed eﬀects with carryoverwith discrete covariates treatment eﬀects and discrete covariatesOptimal t − T fraction treated for t − T fraction treated for Nonlinear fraction treated in t solution all t in a T -period design each stratum and all t with ﬁve stages for each stratum Table 1

This table summarizes the optimal treatment design in various potential outcome models. The base casehas two-way (unit and time) ﬁxed eﬀects (Section 3.1). A more general case has two-way ﬁxed eﬀects with additionaldiscrete observed and/or latent covariates (Section 3.2). The most general case allows the eﬀect of a treatment carriesover to multiple periods (Section 3.3). unit and time dimensions, as a contrast to the conventional A/B testing that only randomizes in theunit dimension. If the treatment only aﬀects one period, the optimal treated proportion increaseslinearly in time. However, if the eﬀect of treatment carries over to future periods (Zhang et al.2019a), the optimal treated proportion increases non-linearly in time, is convex at the beginning,and concave towards the end. When there is no feasible solution to allow the exact optimal treatedproportion for each stratum, we provide a rounding scheme that yields a feasible solution whoseestimation variance is within 1 + O (cid:0) N (cid:1) of the estimation variance of the optimal solution, where N is the number of units. Our main results are summarized in Table 1.Our second main contribution is to propose a data-driven algorithm to eﬃciently ﬁnd a bettertreatment design when covariates are latent or take inﬁnitely many values. Our algorithm exploitscovariate information in the historical control data and searches for a better treatment design basedon simulated annealing and minimax decision rule, which is motivated by our theoretical analysis.We illustrate our treatment design and algorithm’s performance through synthetic experimentson three real data sets on studying interventions for reducing ﬂu occurrence rate, interventionsfor reducing the utilization of home-visit medical services, and marketing strategies for promotinggrocery store sales. The ﬁrst example is motivated by several features of the critical problem ofrigorously studying the eﬀect of various interventions during a pandemic such as COVID-19. First,in such experiments, statistical power is low not only because we need to randomize at largegeographical areas due to the contagious nature of the disease, but also because the treatmenteﬀect may be smaller than the (seasonal) ﬂuctuations of the disease. Second, our proposed solutionsin this paper allow experiments to start at the beginning of the peak season, continue during thepeak season, and terminate at the end of the peak season. Third, the interventions usually havea carryover eﬀect on the disease occurrence rate in multiple future periods. Overall, based on ourtheoretical results and empirical analysis of real data, we ﬁnd that our proposed treatment designssubstantially outperform benchmark designs.The rest of the paper is organized as follows. Section 2 describes the model and our objective.The main results for the analysis of treatment designs are presented in Section 3, and Section 4 provides our algorithms for ﬁnding optimal treatment designs. Finally, in Section 5, we demon-strate the eﬃciency gain from our results and algorithms by analyzing synthetic experiments oninterventions for reducing ﬂu occurrence rate, using a large multi-year real data set. We defersynthetic experiments on a secondary data set for reducing utilization of home-visit services and atertiary data set for promoting grocery store sales to Appendix F. We mainly focus on the problem of multi-period experimental design when treated units cannotswitch back to the control, which is closely related to the stepped wedge design in clinical trials(Brown and Lilford 2006). In the stepped wedge design, an intervention is sequentially rolled out toparticipants (more often to clusters than to individuals) over several periods. Hemming et al. (2015)point out that the stepped wedge clustered studies tend to have better statistical performancethan parallel clustered studies when the intra-cluster correlations are larger. Optimal allocationof clusters in the stepped wedge design has been studied under the linear model with time ﬁxedeﬀects, random cluster eﬀects (Hussey and Hughes 2007, Hemming et al. 2015), and additionalrandom interaction eﬀects (Li et al. 2018b). In comparison, our setting and results are more generalin the following three facets: 1. We allow each unit to have its own (observed or latent) covariates;2. We allow the treatment to have carryover eﬀects; 3. We show how historical control data canbe used to improve treatment design. The ﬁrst generalization is universal in practice; indeed, theassumption in prior literature on the linearity of outcomes in time eﬀects and cluster eﬀects isrestrictive. The second generalization is relevant in many applications, such as in clinical trials(Grizzle 1965, Wallenstein and Fisher 1977, Willan and Pater 1986). For the third generalization,historical control data is commonly available in clinical trials and public health, so it is natural toleverage that data to enhance experiment design.Our problem also shares similarities with the literature on covariate balancing and matching inexperimental design (Pocock and Simon 1975, Cook et al. 2002, Hu et al. 2012, Bertsimas et al.2015, 2019, Kallus 2018, Bhat et al. 2019, Krieger et al. 2019) and in causal inference (Nikolaevet al. 2013, Imai and Ratkovic 2014, Sauppe et al. 2014, Sun and Nikolaev 2016, Fan et al. 2016,Sauppe and Jacobson 2017, Li et al. 2018a, Kallus et al. 2018, Zhao 2019, Kallus 2020). Ourexperimental design setup is diﬀerent in that we allocate treatment for multiple units over multipleperiods and the treatment cannot be removed once allocated. Compared with the literature ondesigning covariate balanced experiments (Bertsimas et al. 2015, 2019, Kallus 2018, Bhat et al.2019), we allow covariates to be latent, and we propose to search for a better treatment design fromhistorical control data that contains information about the latent covariates. Moreover, instead ofdirectly solving the integer programming problem , we propose to start from the analytical solution This is, in general, computationally expensive. This can be infeasible when covariates are latent. that is optimal in the model without covariates, and actively improve this solution via historicaldata in an eﬃcient way.Another relevant literature to our ﬁndings is prior work on stratiﬁcation in randomized experi-ments, such as in clinical trials (Simon 1979, Polit and Beck 2008, Athey and Imbens 2017), andin stratiﬁed sampling (Mulvey 1983, Cheng and Davenport 1989, Fox 2000). They suggest thestatistical power of the treatment eﬀect estimator can be increased by partitioning the populationinto small strata, randomizing within each stratum, and adjusting the standard errors. Our resultsconvey a similar insight; we prove that stratiﬁcation increases power in multi-period experimentaldesign with discrete covariates. Furthermore, for more general covariates, we propose a data-drivenstratiﬁcation heuristic based on historical data.The rich literature in online learning and the multi-armed bandit (e.g., Bubeck and Cesa-Bianchi(2012), Lattimore and Szepesv´ari (2018)) can also be viewed as a form of multi-period experimentaldesign. However, in that setting, the decisions (or treatment designs) are adaptively updated, giventhe experiment’s partially observed results. In contrast, we consider settings where the experimentdesigner, should commit to a procedure for the future rounds, in advance.

2. Model and Problem Statement

In this section, we will introduce the potential outcome model under the control and treatment inSection 2.1. Next, in Section 2.2, we describe the main treatment allocation problem that needs tobe solved to maximize the statistical power of the experiment.

Before we massively scale up implementing a new policy, we want to evaluate its treatment eﬀect.Assume we have N units to apply this new policy to, and we can observe their outcomes for T periods. We have the following assumption for the potential outcome. Assumption 1 (Data Model Assumption) . Assume the potential outcome for these N unitsover the T time periods can be modeled as Y it ( z it ) = α i + β t + L it + τ z it + ε it (1) and in matrix notation Y ( Z ) = α(cid:126) (cid:62) + (cid:126) β (cid:62) + L + τ Z + ε , (2) where τ is the treatment eﬀect, z it ∈ {− , } indicates whether unit i at time t is treated ( z it = 1 ) ornot ( z it = − ), α i and β t are unit and time ﬁxed eﬀects. Furthermore, we allow matrix L = [ L it ] ∈ R N × T to be in one of the following two cases: L = 0 ; L = Xθ (cid:62) + U V (cid:62) , where X ∈ R N × r and U = (cid:2) u (cid:62) · · · u (cid:62) N (cid:3) (cid:62) ∈ R N × k are observed and latentcovariates, θ ∈ R T × r and V = (cid:2) v (cid:62) · · · v (cid:62) T (cid:3) (cid:62) ∈ R T × k are both unknown coeﬃcients. We also assumethat U is deterministic, and V is random and independent of the treatment with E [ v t | α, β, X ] = 0 and Cov[ v t | α, β, X ] = Σ V for all t .We assume L is low-rank, that is r, k (cid:28) min( N, T ) . In the second case, r and k can be 0.Furthermore, we assume ε it is the i.i.d. observed noise and independent of the treatment with E [ ε it | α, β, X ] = 0 and Cov[ ε it | α, β, X ] = σ for all i and t . Remark 1.

The term

U V (cid:62) can be thought of as the structured component of the noise thatmodels correlation in the noise for diﬀerent units. Since in this structure the randomness is in V , the properties for U and V do not appear as symmetric variants of each other. Alternatively,we can assume V is deterministic, and U is random, which would model the case where noisehas time-series correlations. We only focus on the former case since the latter would be similar.Another straightforward extension of our model that we do not cover in the paper is when thereare two independent sources of structured noise (one with across unit correlations and one withacross time correlations). Furthermore, the assumption that V has mean zero holds without loss ofgenerality because otherwise, one can update the ﬁxed eﬀects α and β as well as U and V , so thatthis assumption holds. The assumption on the randomness of this model, V and ε , is standard andallows regression estimators to be unbiased that will be discussed in detail in Section 2.2.In this paper, our primary focus is the second case L = Xθ (cid:62) + U V (cid:62) where the potential outcomehas covariates that can be either observed, latent, or both. However, we will discuss the simplecase of L = 0 only to build intuition for the general case.We assume the treatment eﬀect τ is constant but Remark 2 below discusses how more generaltreatment eﬀects can be also captured by this setup and Remark 3 below discusses the extensionof model (1) to carryover treatment eﬀects.We explicitly specify ﬁxed aﬀects α(cid:126) (cid:62) and (cid:126) β (cid:62) in model (2) because it is common that thepotential outcomes across units and across time periods have diﬀerent means and it can potentiallyreduce the estimation bias of parameters in model (2).In this paper, we focus on the irreversible treatment adoption pattern, which is stated in thefollowing assumption. However, for completeness, we also solve the case when this assumption fails(the treatment can be removed) in Appendix A.1. Assumption 2 (Irreversible Treatment Adoption Pattern) . We assume the treatmentadoption pattern is such that once a unit adopts treatment, it stays treated afterwards, that is, z it ≤ z i,t +1 , for all i and ≤ t ≤ T − First, we allow units to progressively adopt treatment. Our setting is diﬀerent from paired andstratiﬁed experiments, where the treatment status stays the same throughout the experiment. Oursetting is similar to the crossover trial in a longitudinal study. However, we prohibit treated unitsfrom switching back to the control, which is closely related to the stepped wedge treatment designin randomized controlled trials (Brown and Lilford 2006, Hussey and Hughes 2007, Woertmanet al. 2013, Hemming et al. 2015). This restriction is closely related to the staggered treatmentadoption pattern in observational studies (Athey and Stern 1998, Athey et al. 2018, Athey andImbens 2018).We study the irreversible scenario mainly for two reasons: 1. There are some practical constraintsrestricting units from frequently switching between control and treatment. For example, policiesat the cluster level, such as programs to improve case management skills of healthcare staﬀ at thehospital level, can hardly be suspended or rolled back. 2. When the treated units switch back tocontrol, they may not return to the original control status. For example, the improvement of casemanagement skills of healthcare staﬀ can be persistent even if the programs are suspended.

Remark 2.

The assumption that τ is constant can be relaxed as follows. If the treatment eﬀectis heterogeneous and time-varying, denoted by τ it , we can still estimate the average treatmenteﬀect τ (which is typically the main goal in observational studies) by writing τ it = τ + δ it where δ it has mean 0, and incorporate δ it into ε it in model (2). This is because each entry of Z is ± ε it ) would be ε it ± δ it . This would still be an iid noise model if δ it has a symmetric distribution around the origin. Remark 3.

In some cases, the treatment can aﬀect multiple periods and we are interested inestimating carryover treatment eﬀects. We can extend model (1) to Y it ( z it ) = α i + β t + L it + τ z it + τ z i,t − + · · · + τ (cid:96) +1 z i,t − (cid:96) + ε it , where (cid:96) is the number of lagged periods to which the eﬀect of thetreatment can carry over. We also study the optimal experimental design for this carryover modelin Section 3.3. Our objective is to estimate the treatment eﬀect τ as precisely as possible. We need to considertwo aspects to achieve our objective1. How should we estimate τ ?2. How can we ﬁnd the optimal design matrix Z ∈ R N × T that minimizes the variance of thetreatment eﬀect estimator ˆ τ ?In the second question, the variance comes from the random observational noise ε and the random V in model (2).As a preparation to discuss the ﬁrst question, we state the well-known Gauss-Markov Theorem. Lemma 1 (Gauss-Markov Theorem) . Consider a linear model (cid:126)y = X β + noise with E [ noise | X ] = 0 and Cov[ noise | X ] = Ω . If Ω = σ I , then the ordinary least squares (OLS) estimator ˆ β = ( X (cid:62) X ) − X (cid:62) (cid:126)y is the best linear unbiased estimator (BLUE). Otherwise, when Ω is not a multi-ple of the identity matrix, the generalized least squares (GLS) estimator ˆ β = ( X (cid:62) Ω − X ) − X (cid:62) Ω − (cid:126)y is BLUE. Here, “best” means the estimator has the lowest variance among all unbiased linearestimators. For the ﬁrst question, in model (2), if either L = 0 or L = Xθ (cid:62) ( L only has observed covariates),then ˆ τ estimated from OLS is BLUE. If L = Xθ (cid:62) + U V (cid:62) , we can use GLS to estimate τ from Y ( Z ) = α(cid:126) (cid:62) + (cid:126) β (cid:62) + Xθ (cid:62) + τ Z + e, (3)where we incorporate U V (cid:62) into the “error” term e , i.e., e = U V (cid:62) + ε . Lemma 1 together withAssumption 1 implies that ˆ τ from GLS is BLUE. Hence, our theoretical analysis is based on OLSand GLS because they are BLUE.In practice, (cid:126)e ’s covariance matrix Ω is usually unknown, where (cid:126)e is the vectorization of matrix e in Eq. (3). We can use the feasible generalized least squares (feasible GLS), where we get anestimate of Ω using the residuals from OLS. When errors’ covariance matrix can be consistentlyestimated, feasible GLS is asymptotically more eﬃcient than OLS. However, the covariance matrixhas the dimension N T × N T that can be very large, and feasible GLS can be less eﬃcient inﬁnite samples. To address the eﬃciency concern, we can impose some structure on Ω, such as thediagonal plus low-rank structure. Alternatively, we propose an estimator (motivated by low-rankmatrix estimation literature in machine learning) that can be more eﬃcient but may be biased.The details are presented in Sections 4 and 5.For the second question, we can ﬁnd the optimal treatment design by solving the following integerprogramming problem, min Z Var(ˆ τ ) (4)s . t . z it ≤ z i,t +1 , for all i and 1 ≤ t ≤ T − z it ∈ {− , } , for all i and t . Solving the integer program (4) is challenging for two reasons. First, we need to understand howVar(ˆ τ ) changes with z it . Second, the decision variable z it is an integer and solving the integerprogram (4) can be computationally expensive. Speciﬁcally, even in the special case where T = 1and ˆ τ is obtained via OLS, we will see later in the integer program (8) that integer program (4)reduces to the number partitioning problem (Hayes 2002, Mertens 2006) which is NP-hard (thenumbers corresponding to Γ that can take inﬁnitely many values). We address the ﬁrst challenge by analyzing the variance of ˆ τ estimated from OLS or GLS.Speciﬁcally, this variance is a quadratic function of z it . For the second challenge, we show in Section3 that when ˆ τ is estimated by OLS or GLS depending on the structure of L , we can provide a near-optimal analytical solution, which has a surprisingly simple form and signiﬁcantly outperformsbenchmark treatment designs in empirical applications as shown in Section 5. Remark 4.

In the remaining, for notation simplicity, “ z it ≤ z i,t +1 ” stands for “ z it ≤ z i,t +1 , for all i and 1 ≤ t ≤ T − z it ∈ {− , } ” stands for “ z it ∈ {− , } , for all i and t ”.

3. Results

We provide suﬃcient conditions for the optimal solution of the integer program (4) under mildassumptions, when we use BLUE (either OLS or GLS) to estimate ˆ τ , in Section 3.1 and 3.2. Thesuﬃcient conditions require the treated proportion to grow linearly in time. Based on the suﬃcientconditions, we demonstrate how to ﬁnd a feasible solution with objective value (variance) to be atmost 1 + O (cid:0) N (cid:1) times the objective value of the global optimum.Furthermore, we extend our analysis to the scenario where the treatment has carryover eﬀects,and our objective is to minimize the variance of estimated direct and lagged treatment eﬀects. Weﬁnd that the optimal proportion is non-linear in time. The details are presented in Section 3.3. In this section, we study the ﬁrst case, i.e., L = 0 (no observed or latent covariates). Model (1) canbe rewritten as Y it ( z it ) = α i + β t + τ z it + ε it , (5)= (cid:2) z it Γ (cid:62) it (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) × ( N + T ) (cid:20) τη (cid:21)(cid:124)(cid:123)(cid:122)(cid:125) ( N + T ) × + ε it In matrix notation, we have (cid:126)y = (cid:2) (cid:126)z Γ (cid:3) (cid:20) τη (cid:21) + (cid:126)ε, (6)where (cid:126)y = (cid:2) (cid:126)y (cid:62) (cid:126)y (cid:62) · · · (cid:126)y (cid:62) T (cid:3) (cid:62) ∈ R NT × , (cid:126)y t = (cid:2) Y t · · · Y Nt (cid:3) (cid:62) ∈ R N , ˜ I = (cid:2) I N − (cid:126) (cid:3) (cid:62) ∈ R N × ( N − , (cid:126) { } N − , (cid:126) { } N , η = (cid:20) αβ (cid:21) , α =  α ... α N −  , β =  β ... β T  , and Γ =  Γ :1 Γ :2 ...Γ : T  =  ˜ I (cid:126) I (cid:126)

I (cid:126)  . Note that we restrict α N = 0 such that all other α i and β t can be uniquely identiﬁed. We use OLSto estimate τ and η . With some algebra, we have the following expression for the variance of ˆ τ ,Var(ˆ τ ) = σ (cid:126)z (cid:62) ( I NT − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) (cid:126)z . (7)Details to show Eq. (7) are presented in the Appendix C. With Eq. (7), the integer program (4) issimpliﬁed to min (cid:126)z =[ z it ] (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z (8)s . t . z it ≤ z i,t +1 z it ∈ {− , } Now we formally deﬁne our convex relaxation of the integer program (8). Let ω t = N (cid:80) Ni =1 z it forall t . Therefore, it is easy to see that ω t takes value from {− , − N , · · · , − N , } , given theconstraints of the integer program (8). We then relax the range of ω t to − ≤ ω t ≤ (cid:126)ω ∈ R T (cid:126)ω (cid:62) P (cid:126) (cid:126)ω + 2 (cid:126)d (cid:62) (cid:126)ω (9)s . t . − ≤ ω t ≤ ω t ≤ ω t +1 , with P (cid:126) = I − T (cid:126) (cid:126) (cid:62) ∈ R T × T , (cid:126)d = [ d t ] ∈ R T , and d t = T +1 − tT for all t . The following theorem providesthe optimal treatment design for the quadratic program (9). Theorem 1 (Fixed Eﬀects Only) . Under Assumptions 1-2, and when L = 0 , then an optimalsolution for the quadratic program (9) is (cid:126)ω ∗ ∈ R T that is deﬁned by ω ∗ t = 2 t − − TT , for all t . (10)Theorem 1 states that the analytical optimal solution ω ∗ t of the quadratic program (9) is linear in t . Given ω ∗ t , the optimal treated proportion is ω ∗ t . Figure 1 visualizes the optimal treatedproportion at every time period for a T -period treatment design problem, where T ranges from 2to 20. The treated proportion increases at the constant rate T . Intuition.

In order to build intuition for Theorem 1, the following two lemmas provide theoptimal solution for two special cases of the problem.

Lemma 2 (Time Fixed Eﬀects Only) . Consider the same assumptions as in Theorem 1, andin addition, assume that Y it ( z it ) = β t + τ z it + ε it . Then, any treatment design is optimal if it satisﬁes ω t = 0 for all t and z it ≤ z i,t +1 for all i and t . Lemma 2 states that if there are only time ﬁxed eﬀects in the potential outcome, the optimaltreatment design is the parallel treatment design that half of the units are assigned treatment whilethe others are functioning as the control for all time periods. The intuition for Lemma 2 is thatgiven the time eﬀect β t , observations within in a period serve as their own control to estimate τ .Formally, the variance only depends on the cross-sectional average of z it . The variance is minimizedwhen there are 50% treated and 50% control at every time period. Rows of Z are exchangeable,so only the treated proportion rather than whom to treat matters. Table 2a provides an exampleof the optimal treatment design. Lemma 2 implies that it is optimal to randomly and evenly splitunits into the treated and control groups from the beginning as in the conventional A/B testingand placebo tests. Lemma 3 (Unit Fixed Eﬀects Only) . Consider the same assumptions as in Theorem 1, andin addition, assume that Y it ( z it ) = α i + τ z it + ε it . Then, any treatment design is optimal if it satisﬁes ω t = − for all t < T +12 and ω t = 1 for all t > T +12 . Lemma 3 states that if there are only unit ﬁxed eﬀects, the optimal treatment design is toallocate treatment to all units at halftime. The intuition for Lemma 3 is symmetric to that forLemma 2: given the unit eﬀect α i , every unit’s observations serve as their own control to estimate τ . Formally, the variance only depends on the time-series average of z it . The variance is minimizedwhen there are 50% treated and 50% control for each unit. Table 2b provides an example of theoptimal treatment design.Intuitively, when we have both unit and time ﬁxed eﬀects, the optimal treatment design is inbetween the optimal treatment designs in Lemmas 2 and 3. Speciﬁcally, the variance should dependon the average of z it over time and the average of z it over units. Under the irreversible treatmentadoption pattern, the average of z it over time can be identiﬁed, given the average of z it over units.Therefore we can cancel out the average of z it over time. The Hessian of the objective functionin the quadratic program (9) is the positive semideﬁnite matrix P (cid:126) , so the objective function isconvex.The analytical optimal solution ω ∗ t is linear in t , and balances the learning of both unit andtime ﬁxed eﬀects. If we aggregate observations for all time periods, there are 50% control and 50%treated. Table 2c provides an example of the optimal treatment design.The optimal solution in Theorem 1, ω ∗ t = t − − TT , is equivalent to N (2 t − T treated units at timeperiod t . However, N (2 t − T may not be an integer. In order to get a feasible solution of the integerprogram (4), we use the following nearest integer rounding rule. If T is odd, units can be either treated or untreated at time period t = T +12 . -1 -1-1 -1-1 -1-1 -11 11 11 11 1 (a) Time ﬁxed eﬀects -1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1 (b) Unit ﬁxed eﬀects -1 -1-1 -1-1 1-1 1-1 1-1 11 11 1 (c) Two-way ﬁxed eﬀects Table 2

An optimal treatment design with N = 8 and T = 2 in the models with time ﬁxed eﬀects only, unit ﬁxedeﬀects only, and two-way ﬁxed eﬀects. Each row denotes a unit and each column denotes a time period. Nearest integer rounding rule:

We obtain a feasible solution Z rnd by rounding N (2 t − T tothe nearest integer. Speciﬁcally, Z rnd = [ z rnd it ] is a treatment design satisfying the constraints in theinteger program (4) and1 N N (cid:88) i =1 z rnd it = (cid:40) (cid:98) N (2 t − T (cid:99) if frac (cid:0) N (2 t − T (cid:1) < , or frac (cid:0) N (2 t − T (cid:1) = with t − T < (cid:100) N (2 t − T (cid:101) if frac (cid:0) N (2 t − T (cid:1) > , or frac (cid:0) N (2 t − T (cid:1) = with t − T ≥ , where frac (cid:0) N (2 t − T (cid:1) = N (2 t − T − (cid:98) N (2 t − T (cid:99) . The following theorem provides an upper bound forVar(ˆ τ ), when treatment design Z rnd is used. Theorem 2.

Under Assumption 1, let Z ∗ be the optimal solution of the integer program (4) and Z rnd be the feasible solution of the integer program (4) using the nearest integer rounding rule.Denote the variance of ˆ τ using the treatment design Z as Var Z (ˆ τ ) . We have Var Z rnd (ˆ τ ) ≤ − /N Var Z ∗ (ˆ τ ) . Theorem 2 states that the nearest integer rounding rule outputs a solution that approximates theglobal optimum within a factor of 1 + O (cid:0) N (cid:1) . In this section, we study the second case, i.e., L = Xθ (cid:62) + U V (cid:62) . Model (1) can be rewritten as Y it ( z it ) = α i + β t + X i θ t + τ z it + u (cid:62) i v t + ε it (cid:124) (cid:123)(cid:122) (cid:125) e it (11)Since e it has a factor structure, e it are correlated in the unit dimension. We use GLS to estimate τ , because GLS is BLUE when errors are correlated. Then we have the following theorem thatprovides the suﬃcient conditions and optimal treatment design for model (11). Theorem 3 (Observed and Latent Covariates) . Suppose Assumptions 1-2 hold, rows in X have (cid:80) Ni =1 X i = (cid:126) , rows in U have (cid:80) Ni =1 u i = (cid:126) and (cid:80) Ni =1 X i u (cid:62) i = r,k , v t ’s covariance Σ V = I k , t . . . . . . % tr e a t e d T = 2 T = 3 T = 5 T = 8 T = 10 T = 12 T = 15 T = 20 Figure 1

Optimal treated proportion at every time period for a T -period treatment design problem. The legendswith diﬀerent colors denote the number of periods T . For example, the leftmost blue line represents the optimaltreated proportion at time periods 1 and 2 for a two-period treatment design problem, while the rightmost grey linerepresents the optimal treated proportion at every time period for a 20-period treatment design problem. Cov[ v t , v t − q | α, β, X ] = 0 for any q , Cov[ v t , ε is | α, β, X ] = 0 for any s, t and i , and ε it ’s variance σ = 1 . Then, Z is an optimal treatment design if it satisﬁes N N (cid:88) i =1 z it = 2 t − − TT , N N (cid:88) i =1 X i z it = µ X , N N (cid:88) i =1 u i z it = µ U , for all t (12) for some µ X ∈ R r and µ U ∈ R k , and if it satisﬁes the constraints in the integer program (4) .Moreover, if ( X i , u i ) can only take G values, denoted as ( x , u , ) , · · · , ( x G , u ,G ) , each with positiveprobability, we deﬁne ω g,t = |O g | (cid:80) i ∈O g z it , where O g = { i : ( X i , u i ) = ( x g , u ,g ) } . Then any treatmentdesign Z is optimal if it satisﬁes the constraints in the integer program (4) and ω g,t = 2 t − − TT , for all t and g. (13) Remark 5.

The dimensions of X i and u i (i.e., r and k ) can be zero. They correspond to thesetting that the model only has latent covariates ( r = 0) and only has observed covarates ( k = 0). r,k is a zero matrix with dimension r × k . Remark 6.

When there are latent covariates u i in the model, we need to know their value tostratify. If we have no information about u i , we could only stratify over X i and each stratumhas treated proportion t − T (in this context, units in a stratum have the same X i , but may havediﬀerent u i ). In many applications, we have historical control data, which contains informationabout u i . We propose an algorithm in Section 4 to mimic stratiﬁcation using historical data.The assumptions in Theorem 3 can hold without loss of generality. First, for the assumption (cid:80) Ni =1 X i = (cid:126) X is orthogonal to (cid:126)

1) to hold, we can always use the Gram-Schmidt procedure to Equivalent to QR decomposition decompose (cid:2) (cid:126) X (cid:3) = QR ∈ R N × ( r +1) . Then the ﬁrst column in Q is (cid:126) Q are orthogonal to (cid:126)

1, which can be used as the observed covariates in model (11). We canuse a similar procedure to project out the cross-section mean of u i and X i u (cid:62) i for the assumptions (cid:80) Ni =1 u i = (cid:126) (cid:80) Ni =1 X i u (cid:62) i = r,k to hold. Second, if v t has mean 0 and variance Σ V for any positivedeﬁnite matrix Σ V , we can right multiply u (cid:62) i by Σ − / V and left multiply v t by Σ − / V so that v t has variance I k (conditions (cid:80) Ni =1 u i = (cid:126) (cid:80) Ni =1 X i u (cid:62) i = r,k stay valid after this manipulation).Note that the assumptions on U and V are asymmetric and we provide a discussion in Remark 1.Moreover, we can multiply both the LHS and RHS of model (11) by 1 /σ so that the variance of ε it is 1 and the optimal treatment design does not change by this scaling. Moreover, we assume u i is deterministic and v t is random, following the literature on large dimensional factor modeling(Bai and Ng 2002, Bai 2003). We provide additional discussion of the model and assumptions inTheorem 3 in Appendix A.2.Theorem 3 states that when covariates X i and u i only take ﬁnitely many values, we provide an optimal treatment design. This treatment design has linear staggered rollouts with stratiﬁcation .In other words, for each stratum, the group of units with the same covariate value, we randomizethe treatment allocation over time such that the treated proportion satisﬁes t − T at time t . If weaggregate all strata, the overall treated proportion satisﬁes t − T at time t , which is the same asTheorem 1. Table 3 provides an example of the optimal treatment design. The complete proof is deferred to Appendix D and here we provide a high-leveloverview of the main steps. First, via algebraic maninpulations, we show that Var(ˆ τ ) = (cid:126)z (cid:62) (Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e ) (cid:126)z where Γ is the more general (with covariates) version of Γfrom Section 3.1 and Σ e is the covariance matrix of the vectorization of e . Second, we showmin Var(ˆ τ ) can be simpliﬁed to minimize( (cid:126)ω (1) ) (cid:62) P (cid:126) (cid:126)ω (1) + 2 (cid:126)d (cid:62) (cid:126)ω (1) (cid:124) (cid:123)(cid:122) (cid:125) ( a ) + (cid:80) r +1 j =2 ( (cid:126)ω ( j ) ) (cid:62) P (cid:126) (cid:126)ω ( j ) (cid:124) (cid:123)(cid:122) (cid:125) ( b ) + (cid:126)z (cid:62) M U (cid:126)z (cid:124) (cid:123)(cid:122) (cid:125) ( c ) where (cid:126)ω (1) t = N (cid:80) Ni =1 z it , (cid:126)ω ( j +1) t = N (cid:80) Ni =1 X ij z it , P (cid:126) = I T − T (cid:126) (cid:126) (cid:62) , d t = T +1 − tT and M U = P (cid:126) ⊗ U ( I k + U (cid:62) U ) − U (cid:62) . If we can ﬁnd a solution that separately minimizes (a), (b) and (c), then this solutionminimizes their sum. Third, (a) is minimized when Z satisﬁes N (cid:80) Ni =1 z it = t − − TT following The-orem 1. (b) depends on the cross-sectional average of z it weighted by X i and is minimized when Z satisﬁes N (cid:80) Ni =1 X i z it = µ X , equivalently (cid:126)ω ( j +1) = µ X,j · (cid:126)

1, where µ X,j is the j -th entry in µ X . (c) is In practice, if we do not know Σ e , we can ﬁrst run OLS and then use the residual to estimate Σ e , which is aconsistent estimator for Σ e . -1 -1-1 1-1 11 1-1 -1-1 1-1 11 1 X i = 1 X i = − (a) Two-way ﬁxed eﬀects with observed covariates X i -1 -1-1 1-1 11 1-1 -1-1 1-1 11 1 u i = 1 u i = − (b) Two-way ﬁxed eﬀects with latent covariates u i Table 3

An optimal treatment design with N = 8 and T = 2 under the model with either observed or latentcovariates. In this toy example, there are two stratum, X i (or u i ) = 1 and X i (or u i ) = − minimized when Z satisﬁes N (cid:80) Ni =1 u i z it = µ U . Fourth, if ( X i , u i ) can only take G diﬀerent valuesand there is a Z that satisﬁes ω g,t = t − − TT for all g and t , we plug this Z into the LHS of thesuﬃcient conditions (12) and verify these conditions are satisﬁed. In practice, we may not be able to ﬁnd a feasible Z that satisﬁes Eq. (13) because |O g | (2 t − T maynot be an integer for each strata. and therefore there does not exist a Z that satisﬁes Eq. (13). Inthe case where we only have X i (no u i ), we use the same nearest integer rounding rule as the onein Section 3.1 to get a feasible treatment design Z rnd of the integer program (4). Then we have thefollowing theorem to provide a guarantee for Z rnd . Theorem 4.

Under the assumptions in Theorem 3 and L = Xθ (cid:62) , let Z ∗ be the optimal solutionof the integer program (4) and Z rnd be a feasible solution of the integer program (4) using thenearest integer rounding rule. Denote the variance of ˆ τ using the treatment design Z as Var Z (ˆ τ ) , x j, max = max g | x gj | and N min = min g |O g | . Assume (1 + (cid:80) rj =1 x j, max ) /N < , we have Var Z rnd (ˆ τ ) ≤ − (1 + (cid:80) rj =1 x j, max ) /N Var Z ∗ (ˆ τ ) . Theorem 4 states the nearest integer rounding rule outputs a feasible solution whose objectivevalue is at most 1 + O (cid:16) N (cid:17) times the global optimum when r and x j, max are bounded. Since eachrealization of X i takes positive probability bounded away from 0, we have 1+ O (cid:16) N (cid:17) = 1+ O (cid:0) N (cid:1) when N → ∞ . However, in ﬁnite samples, if G is large compared with N , the feasible solutionmay not be close to the global optimum. Then we could use partition methods, such as K-means,to partition units into G (cid:48) groups based on their covariates’ values. Within each group, we canrandomize the treatment allocation such that Eq. (13) is satisﬁed. On the other hand, if X i takesinﬁnitely many values, we can also use this method. In some cases, the eﬀect of a treatment carries over to multiple periods, for example, in medicaland clinical research. Researchers often need to consider the existence of carryover eﬀects whendesigning cross-over trials that are widely used in medical and clinical research (Grizzle 1965,Wallenstein and Fisher 1977, Hills and Armitage 1979, Willan and Pater 1986, Freeman 1989, Senn2002, Jones and Kenward 2014). The advertising eﬀect on sales usually has a duration (Assmuset al. 1984), typically between six and nine months as shown by Leone (1995). In this section, wegeneralize model (1) and allow the treatment to have carryover eﬀects, Y it ( z it ) = α i + β t + X i θ t + τ z it + τ z i,t − + · · · + τ (cid:96) +1 z i,t − (cid:96) + u (cid:62) i v t + ε it , (14)where ε it i.i.d. ∼ (0 , σ ). In this model, we use GLS to estimate τ , · · · , τ (cid:96) +1 , and our objective is tominimize the estimation variance of τ , · · · , τ (cid:96) +1 by ﬁnding the optimal treatment design Z . Westack the observations Y it from time (cid:96) + 1 to T together and in matrix notation  (cid:126)y (cid:96) +1 (cid:126)y (cid:96) +2 ... (cid:126)y T (cid:124) (cid:123)(cid:122) (cid:125) (cid:126)y =  (cid:126)z (cid:126)z · · · (cid:126)z (cid:96) +1 (cid:126)z (cid:126)z · · · (cid:126)z (cid:96) +2 ... ... . . . ... (cid:126)z T − (cid:96) (cid:126)z T − (cid:96) +1 · · · (cid:126)z T (cid:124) (cid:123)(cid:122) (cid:125) Z  τ (cid:96) +1 ... τ (cid:124) (cid:123)(cid:122) (cid:125) (cid:126)τ +Γ (cid:20) α ˜ β (cid:21) +  (cid:126)e (cid:96) +1 (cid:126)e (cid:96) +2 ... (cid:126)e T (cid:124) (cid:123)(cid:122) (cid:125) (cid:126)e , where (cid:126)z =  z t z t ... z Nt  , ˜ β =  ˜ β (cid:96) +1 ...˜ β T  , (cid:126)e t =  e t e t ... e Nt  , ˜ β t = (cid:2) β t θ (cid:62) t (cid:3) (cid:62) , (cid:126)y t is the same as the (cid:126)y t in Eq. (6), and Γ ∈ R ( N ( T − (cid:96) )) × ( N +( T − (cid:96) − p ) is the more general(with covariates) version of the Γ in Eq. (6).Our objective is to minimize Var(ˆ (cid:126)τ ), whereVar(ˆ (cid:126)τ ) = ( Z (cid:62) (cid:0) Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:1) Z ) − , Σ e is the covariance matrix of the vectorization of e = [ e it ] and e it = u (cid:62) i v t + ε it . Note that Var(ˆ (cid:126)τ ) isa matrix and minimizing a matrix is not a well-deﬁned problem. Instead, we consider minimizingtr(Var(ˆ (cid:126)τ )) because the objects of interest are Var(ˆ τ l ) (the diagonal entries in Var(ˆ (cid:126)τ )) and mini-mizing tr(Var(ˆ (cid:126)τ )) is a well-deﬁned problem. A surrogate for minimizing tr(Var(ˆ (cid:126)τ )) is maximizingtr(Var(ˆ (cid:126)τ ) − ), that is, max Z tr( Z (cid:62) (cid:0) Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:1) Z ) (15)s . t . z it ≤ z i,t +1 z it ∈ {− , } The optimal solution Z of the integer program (24) is called the T (trace)-optimal treatmentdesign. We also consider another objective, minimizing det(Var(ˆ (cid:126)τ )), in Appendix A.3. The followingtheorem provides suﬃcient conditions for the optimal solution of the integer program (24) as wellas the analytical optimal solution when covariates can only take ﬁnitely many values. Theorem 5 (Carryover Eﬀects with Observed and Latent Covariates) . Suppose thepotential outcome follows model (14) , assumptions in Theorems 3 hold and

T > (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) . Z isan optimal treatment design of the integer program (15) if it satisﬁes N N (cid:88) i =1 z it = ω ∗ t , N N (cid:88) i =1 X i z it = µ X , N N (cid:88) i =1 u i z it = µ U , for all t (16) for some µ X ∈ R r and µ U ∈ R k and the constraints in the integer program (4) , where (cid:126)ω ∗ ∈ R T isdeﬁned by ω ∗ t =  − t ≤ (cid:98) (cid:96)/ (cid:99) (( A ( (cid:96) ) ) − b ( (cid:96) ) ) t −(cid:98) (cid:96)/ (cid:99) (cid:98) (cid:96)/ (cid:99) < t ≤ (cid:96) − t − ( (cid:96) +1) T − (cid:96) (cid:96) < t ≤ T − (cid:96) − (( A ( (cid:96) ) ) − b ( (cid:96) ) ) T +1 −(cid:98) (cid:96)/ (cid:99)− t T − (cid:96) < t ≤ T − (cid:98) (cid:96)/ (cid:99) T − (cid:98) (cid:96)/ (cid:99) < t (17) The deﬁnition of A ( (cid:96) ) ∈ R ( (cid:96) −(cid:98) (cid:96)/ (cid:99) ) × ( (cid:96) −(cid:98) (cid:96)/ (cid:99) ) and b ( (cid:96) ) ∈ R (cid:96) −(cid:98) (cid:96)/ (cid:99) is provided in Eq. (22) and Eq. (23) in Appendix A.3.Moreover, if ( X i , u i ) can only take G values, each with positive probability, ω g,t = |O g | (cid:80) i ∈O g z it is deﬁned similarly as the ω g,t in Theorem 3. Then any treatment design Z is optimal if it satisﬁesthe constraints in the integer program (4) and ω g,t = ω ∗ t , for all t and g. (18) Remark 7.

Similar as Theorem 3, the dimensions of X i and u i (i.e., r and k ) can be zero. Remark 8.

To provide a few concrete examples for ω ∗ t , if (cid:96) = 1, we have ω ∗ t = − t − T − t. If (cid:96) = 2, we have ω ∗ = − , ω ∗ = −

1+ 22 T − , ω ∗ t = −

1+ 2 t − T − t = 3 , · · · , T − , ω ∗ T − = 1 − T − , ω ∗ T = 1 . We provide the expression of ω ∗ t for (cid:96) = 3 in Remark 16 in Appendix A.3. Figure 2 shows theoptimal treated proportion for other (cid:96) in a T = 10 period problem. Remark 9.

The assumption

T > (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) can be potentially relaxed. We veriﬁed using opti-mization software that Eq. (17) is optimal even this assumption is not satisﬁed.The treated proportion is convex at the beginning and concave towards the end. If the treatmenthas carryover eﬀects for longer periods (i.e., larger (cid:96) ), the optimal treated proportion is smallerat the beginning and is larger towards the end as shown in Figure 2. There are ﬁve diﬀerentstages in the optimal solution from Theorem 5: in the ﬁrst stage, all units are control units; in t . . . . . . % tr e a t e d ‘ = 0 ‘ = 1 ‘ = 2 ‘ = 3 ‘ = 4 ‘ = 5 Figure 2 T -optimal design: Optimal treated percentage at each period for a T -period treatment design in thepresence of carryover treatment eﬀects, where T = 10. Diﬀerent colors represent the number of periods (cid:96) to which thetreatment can carry over. the second stage, ω ∗ t grows non-linearly in time and is less than − (cid:96) +1 T − (cid:96) t −(cid:98) (cid:96)/ (cid:99)(cid:98) (cid:96)/ (cid:99) +1 (implied by theassumption T > (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) ); in the third stage, ω ∗ t grows linearly in time; in the fourth stage, ω ∗ t grows non-linearly in time again and is larger than 1 − (cid:96) +1 T − (cid:96) T + (cid:98) (cid:96)/ (cid:99)− t (cid:98) (cid:96)/ (cid:99) +1 (implied by the assumption T > (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) ); in the ﬁfth stage, all units are treated units. The optimal solution is symmetricwith respect to the origin, because f ( ω , · · · , ω T ) = f ( − ω T , · · · , − ω ), where f ( ω , · · · , ω T ) is theobjective function in the integer program (15).Intuitively, treated proportions are larger at the end compared with the direct eﬀect only casein Theorem 3 because we want to have suﬃciently many observations to precisely estimate thelagged treatment eﬀects. Treated proportions are smaller at the beginning because we only use theobserved outcomes from (cid:96) + 1 to T to estimate (cid:126)τ , α , and ˜ β , so increasing the treated proportionat the beginning can only marginally aﬀect the eﬃciency. The complete proof is deferred to Appendix E and here we provide an overview of the main steps.First, we start with the case where the potential outcome does not have covariates (i.e., r = k = 0and Σ e = I N ( T − (cid:96) ) ) and then maximizing the objective function in the integer program (15) isequivalent to minimizing tr( Z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) Z ). The corresponding convex relaxation is a sum of (cid:96) + 1 quadratic functions (cid:80) (cid:96) +1 j =1 (cid:104) (cid:126)ω (cid:62) j :( T − (cid:96) − j ) P (cid:126) (cid:126)ω j :( T − (cid:96) − j ) + 2( (cid:126)d ( j ) ) (cid:62) (cid:126)ω j :( T − (cid:96) − j ) (cid:105) , (19)where (cid:126)ω j :( T − (cid:96) − j ) = (cid:2) ω j · · · ω T − (cid:96) − j (cid:3) (cid:62) , P (cid:126) = I − T − (cid:96) (cid:126) (cid:126) (cid:62) ∈ R ( T − (cid:96) ) × ( T − (cid:96) ) , (cid:126)d ( j ) = [ d ( j ) t ] ∈ R T − (cid:96) , d ( j ) t = T − (cid:96) − j − tT − (cid:96) for all j and t , and the deﬁnition and constraints of ω t are the same as those of ω t inthe quadratic program (9). The details are presented Lemma 4 in Appendix A.3. Second, we verify that ω ∗ in Eq. (17) satisﬁes the KKT conditions of the quadratic program with objective function(19) and ω ∗ is therefore an optimal solution for this quadratic program. The details are presentedin Theorem 5 in Appendix E. Third, we follow the same steps as Theorem 3 to show additionalcovariate conditions are needed when the potential outcome has covariates.

4. Algorithms

In this section, we focus on the scenario where the potential outcome has latent covariates, whichis widely used in practice, Y ( Z ) = α(cid:126) (cid:62) + (cid:126) β (cid:62) + U V (cid:62) + τ Z + ε. (20)For notation simplicity, we use L to denote U V (cid:62) in this section.We propose algorithms to tackle two challenges:1. How can we ﬁnd the optimal treatment design in practice when covariates are unobserved?2. Is there an alternative (potentially non-linear) estimation approach for τ that may have asmaller variance?For the ﬁrst challenge, in Section 4.1, we introduce a local search heuristic (motivated by thesuﬃcient conditions in Theorem 3) to search for a better treatment design by leveraging histori-cal control data that contains latent covariates’ information, given that historical control data iscommonly available in practice.For the second challenge, in Section 4.2, we leverage advances on low-rank matrix estimation inthe machine learning literature to introduce a non-linear estimator for τ that may be biased butcan have a smaller variance. We refer to this estimator as the low-rank matrix estimation withﬁxed eﬀects (LRME). Since the optimal treatment design depends on the values of latent covariates from Theorem 3, thehistorical control data with covariates’ information could help us ﬁnd a better treatment design.In order to ﬁnd a better treatment design, we propose to (a) approximate Var(ˆ τ ) using a robustsurrogate when our estimation approach does not a closed-form expression for Var(ˆ τ ), and (b) ﬁndan approximate solution to the computationally challenging integer program of ﬁnding optimal Z that minimizes Var(ˆ τ ). We propose Algorithm 1 that addresses (a) and (b) as follows: • For (a), we adopts a robust measure to approximate Var(ˆ τ ) that consists of two phases(i) Split the historical control data into sub-matrices: we use a moving window approachto split the historical control data Y hist , All into m sub-matrices Y hist , (1) , · · · , Y hist , ( m ) , each withdimension N × T . (ii) We take a synthetic treatment eﬀect τ (can be 0) and apply synthetic experiment to Y hist , ( j ) using treatment design Z . In more detail, we assume Y hist , ( j ) does not contain the interventionwe study. We deﬁne the control outcome as Y ( j ) it ( −

1) = Y hist , ( j ) it , the treated outcome as Y ( j ) it (1) = Y ( j ) it ( −

1) + τ , and the “observed” outcome as Y ( j ) it = Z it · Y ( j ) it (1) + − Z it · Y ( j ) it ( −

1) for every i , t and j . We use Y ( j ) and Z as the input to estimate τ and the estimator is denoted as ˆ τ ( j ) . Wecalculate the maximum estimation error across all sub-matrices, max j | ˆ τ ( j ) − τ | , and use it as arobust surrogate for Var(ˆ τ ). • For (b), we start from an arbitrary treatment design Z that satisﬁes N (cid:80) Ni =1 z it = t − − TT andthe constraints in the integer program (4). Then we adopt the following local search heuristicto improve Z (motivated by simulated annealing ) until Z converges or the algorithm runs for amaximum number of iterations:(i) Swap two random rows in Z at every iteration and denote the new treatment design as Z new (note that the cross-sectional average remains unchanged at t − − TT at time period t ).(ii) Calculate the maximum estimation error max j | ˆ τ ( j ) − τ | when treatment design Z new isapplied to Y (1) , · · · , Y ( m ) . If max j | ˆ τ ( j ) − τ | is reduced, replace Z by Z new . Otherwise, keep Z unchanged with probability 1 − p escape but still replace Z by Z new with probability p escape . Here, p escape is a small positive number.(iii) At the end of each iteration, deﬁne Z opt to be the matrix Z that has the smallestmax j | ˆ τ ( j ) − τ | . Once the algorithm stops (maximum number of iterations is reached or Z is con-verged), return Z opt . Remark 10.

We used the maximum estimation error instead of the mean-squared error becauseit is more robust to outliers.

Remark 11.

In the local search heuristic, when we swap the rows in Z , the cross-sectional aver-age does not change. This guarantees that one of the suﬃcient conditions for the optimal solutionin Theorem 3 (i.e., N (cid:80) Ni =1 z it = t − − TT ) is always satisﬁed. Therefore, the goal of the algorithm isto ﬁnd a treatment design Z that satisﬁes the other suﬃcient condition (i.e., N (cid:80) Ni =1 u i z it = µ ). Remark 12.

In the local search heuristic, we replace Z by Z new with some small probability p escape following the idea of simulated annealing, to escape potential local minima that traps Z . τ Instead of incorporating L into the error term, we could estimate L jointly with τ using the followingobjective function,ˆ τ , ˆ α, ˆ β, ˆ L = arg min τ,α,β,L N T (cid:13)(cid:13)(cid:13) Y − α(cid:126) (cid:62) − (cid:126) β (cid:62) − L − τ Z (cid:13)(cid:13)(cid:13) F + µ (cid:107) L (cid:107) ∗ , (21)denoted as low-rank matrix estimation with ﬁxed eﬀects (LRME). Here, (cid:107) L (cid:107) ∗ is the nuclear norm(or trace norm) of matrix L , which is equal to the sum of its singular values. Also, (cid:107)·(cid:107) F refers to Algorithm 1:

A data-driven local search heuristic for Z Inputs : Y (1) it ( − , · · · , Y ( m ) it ( − , Z, τ, t init , t min , υ, steps max , r , ∆ τ , t max Z current ← Z ; Y ( j ) it (1) ← Y ( j ) it ( −

1) + τ and Y ( j ) it ← Z current it · Y ( j ) it (1) + − Z current it · Y ( j ) it ( −

1) for all i , t and j ; for j = 1 , · · · , m do ˆ τ ( j ) , ˆ α ( j ) , ˆ β ( j ) estimated from Y ( j ) by OLS/GLS/LRME ; end E current ← max j | ˆ τ ( j ) − τ | , E opt ← E current , Z opt ← Z current , t ← t init , i step ← while t > t min and i step < steps max do Randomly select distinct i , i ∈ [ N ] ; Z new ← Z current , Z new i , : ← Z current i , : , Z new i , : ← Z current i , : , i step ← i step + 1 ; Y ( j ) it ← Z new it · Y ( j ) it (1) + − Z new it · Y ( j ) it ( −

1) for all i , t and j ; for j = 1 , · · · , m do ˆ τ ( j ) , ˆ α ( j ) , ˆ β ( j ) estimated from Y ( j ) by OLS/GLS/LRME ; end E new ← max j | ˆ τ ( j ) − τ | , dE = E new − E current ; if E new < E opt then E current ← E new , Z current ← Z new , Z opt ← Z new ; elseif exp(dE /t ) > random(0 , then E current ← E new , Z current ← Z new ; t ← t × υ ; endendOutputs: Z opt the Frobenius norm of a matrix. If the regularization parameter µ is larger, the rank of ˆ L tends tobe smaller. It is known in the matrix estimation literature that such a bias in estimating L can leadto a lower variance (Candes and Recht 2009, Candes and Plan 2010). If Var( ˆ L ) is lower, Var(ˆ τ ) islikely to be smaller. We compare LRME with the BLUE estimator through synthetic experimentsin Section 5.The objective function (21) is convex in τ, α, β and L , which has N T + N + T + 1 variables intotal. Finding the global optimal solution of convex program (21) can be slow with oﬀ-the-shelfsoftware for convex optimization problems such as cvxpy . Alternatively, we propose to use the iterative singular value thresholding and ordinary least squares (iterative SVT and OLS) algorithmto eﬃciently solve convex program (21). The details of this algorithm are described in Algorithm

2. We can justify using SVT Theorem 1 in Hastie et al. (2015) that shows the optimal solution ofˆ L = arg min rank ( (cid:96) ) ≤ k (cid:107) Y − L (cid:107) F + µ (cid:107) L (cid:107) ∗ , is ˆ L = U k S µ ( D k ) V (cid:62) k , where the rank- k SVD of Y is U k D k V (cid:62) k and S µ ( D k ) is a diagonal k by k matrix with its diagonal entries to be ( σ − µ ) + , · · · , ( σ k − µ ) + . When we have historical controldata, we can use cross-validation to ﬁnd the optimal µ by the grid search algorithm. The detailsare provided in Appendix B.1. Algorithm 2:

Iterative SVT and OLS

Inputs :

Y, Z, k , µ NT , ∆ τ , and t max ˆ τ ( − ← t = 0, ˆ τ (0) , ˆ α (0) , ˆ β (0) ← arg min τ,α,β (cid:13)(cid:13)(cid:13) Y − α(cid:126) (cid:62) − (cid:126) β (cid:62) − τ Z (cid:13)(cid:13)(cid:13) F ;ˆ Y (0) e ← Y − ˆ α (0) (cid:126) (cid:62) − (cid:126)

1( ˆ β (0) ) (cid:62) − ˆ τ (0) Z ; while | ˆ τ ( t ) − ˆ τ ( t − | > ∆ τ and t < t max do The rank- k SVD of ˆ Y ( t ) e is U ( t ) k D ( t ) k ( V ( t ) k ) (cid:62) , where D ( t ) k = diag( d ( t )1 , · · · , d ( t ) k ) ; S µ NT ( D ( t ) k ) ← diag(( d ( t )1 − µ NT ) + , · · · , ( d ( t ) k − µ NT ) + ) ;ˆ L ( t +1) = U ( t ) k S µ NT ( D ( t ) k )( V ( t ) k ) (cid:62) ;ˆ τ ( t +1) , ˆ α ( t +1) , ˆ β ( t +1) = arg min τ,α,β (cid:13)(cid:13)(cid:13) Y − α(cid:126) (cid:62) − (cid:126) β (cid:62) − τ Z − ˆ L ( t +1) (cid:13)(cid:13)(cid:13) F ;ˆ Y ( t +1) e = Y − ˆ α ( t +1) (cid:126) (cid:62) − (cid:126)

1( ˆ β ( t +1) ) (cid:62) − ˆ τ ( t +1) Z ; t ← t + 1 ; endOutputs: ˆ τ ( t − , ˆ α ( t − , ˆ β ( t − , ˆ L ( t −

5. Empirical Study: Reducing Inﬂuenza Occurrence Rate

Inﬂuenza is primarily a community-based infection that is transmitted in households and com-munity settings. CDC estimates that inﬂuenza has resulted in 9 million to 45 million illnesses,140,000 to 810,000 hospitalizations, and 12,000 to 61,000 deaths annually since 2010. The bestway to prevent infectious diseases such as seasonal ﬂu is vaccination. However, in the absence of avaccine, like in the COVID-19 pandemic, other interventions such as social distancing or wearing aface cover are recommended. An important and challenging question is: what is the eﬃcacy of suchpreventive measures or other interventions in diﬀerent environments such as schools, workplaces,and businesses in slowing down the spread of the disease and reducing the substantial burden of thedisease on society?

The best way to answer this question is to run pilot programs or experiments. Since ﬂu iscontagious, we must randomize the experiments at an aggregate geographical level, such as the Metropolitan Statistical Area (MSA), to minimize the spillover eﬀects. The tradeoﬀ is, we wouldhave a smaller number of units and therefore lower the statistical power to detect the treatmenteﬀect. Moreover, the treatment eﬀect is, in general, small compared with the natural ﬂuctuationsof ﬂu occurrence rate, so the statistical power is critical to detect the eﬀect of new interventions.Hence, our solution in Section 3 and our algorithm in Section 4 are especially relevant in thissetting to evaluate the eﬀect of new interventions.

In this study, we focus on the MarketScan Research Databases (formerly called Truven) that containlongitudinal, patient-level, medical claim data for healthcare research. The databases contain dataon continuous periods of enrollment per individual, inpatient, and outpatient claim records from thebeginning of 2007 to mid-2017. We focus on the inpatient admission records and outpatient servicerecords where the primary diagnosis (DX1) is inﬂuenza according to ICD-9-CM diagnosis codes (thedatabases only have claim records with ICD-9 diagnosis codes). Overall, the data contains 21,277inpatient admissions and 9,678,572 outpatient records with primary diagnosis inﬂuenza, indicatingpatients are not usually admitted to hospitals for inﬂuenza. We consider this combined data forthe analysis. We aggregate the inﬂuenza records by MSA and month. Furthermore, we aggregateindividual enrollment data by MSA and month to get the total number of enrolled patients, byMSA and month. We divide the number of inﬂuenza visits by the number of enrolled patients toobtain the ﬂu visit occurrence rate by MSA and month.Due to the seasonal nature of the ﬂu, we focus on the months from October to April, where ﬂuis active, and there is a relatively high ﬂu visit occurrence rate. We further focus on the periodbetween October 2007 to April 2015 because outside this period, we have only a few observations,and the ﬂu occurrence rate is 0 for most MSAs. We get a panel of 185 MSAs over 56 months, whereall MSAs have a positive occurrence rate every month.

Remark 13.

In addition to the ﬂu occurrence problem, we also apply our treatment designalgorithm to two other data sets (home medical visits and purchases from a large grocery store).The secondary and tertiary analysis are presented in Appendix F.

In this section, we present the models and estimation methods, treatment designs, how to runsynthetic experiments, the metrics to evaluate treatment designs, and the speciﬁcation used in thesynthetic experiments. In ICD-9-CM, the following diagnosis codes indicate inﬂuenza: 488, 487.0, 487.1, 487.8, 488.0, 488.1, 488.01, 488.02,488.09, 488.11, 488.12, 488.19, 488.81, 488.82, and 488.89. Model Assumptions and Estimation Methods.

Since we do not know the data generatingprocess of the ﬂu occurrence data, we consider two models when the treatment only has directeﬀect: one is the two-way ﬁxed eﬀect model with direct eﬀect only Y it ( z it ) = α i + β t + τ z it + ε it , andthe other one is the latent factor model with direct eﬀect only Y it ( z it ) = α i + β t + u (cid:62) i v t + τ z it + ε it .Moreover, we consider two models with carryover treatment eﬀects: two-way ﬁxed eﬀect modelwith carryover eﬀects Y it ( z it ) = α i + β t + τ z it + · · · + τ (cid:96) +1 z i,t − (cid:96) + ε it and latent factor model withcarryover eﬀects Y it ( z it ) = α i + β t + u (cid:62) i v t + τ z it + · · · + τ (cid:96) +1 z i,t − (cid:96) + ε it . For the two-way ﬁxed eﬀectmodel, we use OLS to estimate τ (or τ , · · · , τ (cid:96) +1 for the model with carryover eﬀects). For thelatent factor model, we use two methods to estimate τ (or τ , · · · , τ (cid:96) +1 for the model with carryovereﬀects): one is feasible GLS , and the other one is LRME presented in Section 4.2.

Treatment Designs.

We compare the diﬀerent treatment designs via synthetic experiments.Denote ω t as the cross-sectional average of the treatment design Z at time t and note that, ω t =  − Z FF (ﬁfty-ﬁfty): Z FF satisﬁes ω t = 0 for all t . Z FF is optimal when the potential outcomeonly has time ﬁxed eﬀects, as shown in Lemma 2. (b) Z BA (before-after): Z BA satisﬁes ω t = − t < T +12 and ω t = 1 for t > T +12 . Z BA is optimalwhen the potential outcome only has unit ﬁxed eﬀects, as shown in Lemma 3. (c) Z FF+BA (ﬁfty-ﬁfty with before-after): Z FF+BA satisﬁes ω t = − t < T +12 and ω t = 0for t ≥ T +12 , implying half of the units switch from control to treated at midway through theexperiment.2. Z OPT : Z OPT satisﬁes ω t = t − − TT for all t . Z OPT is optimal when the potential outcome hasboth unit and time ﬁxed eﬀects and there are no covariates, as shown in Theorem 3.3. Z K =2 (or Z K =3 , partition and stratiﬁcation): the treatment design that satisﬁes ω t = t − − TT for each group, where units are partitioned into groups based on their value in the largest singularvectors estimated from the historical data. GLS is the BLUE estimator, but is infeasible. We provide the details to implement feasible GLS in Appendix B.2. When Z FF satisﬁes ω t = 0 for all t , (cid:126)z will be in the span of Γ in model (5) so τ can not be uniquely identiﬁed.Hence, we randomly select one more unit as control in the ﬁrst time period and one more unit as treated in the lasttime period to uniquely identify τ . When Z BA satisﬁes ω t = − t < T +12 and ω t = 1 for t > T +12 , (cid:126)z will be in the span of Γ deﬁned in model (5) so τ can not be uniquely identiﬁed. Hence, we randomly select one more unit as treated in time period T and one moreunit as control in time period T + 1 to uniquely identify τ . Z OPT+ : the treatment design returned from Algorithm 1, where the input of Algorithm 1 is Z OPT .5. Z OPT-CO : Z OPT-CO satisﬁes ω t = ω ∗ t for all t , where ω ∗ t is deﬁned in Eq. (17) and is nonlinearin t . Z OPT-CO is optimal when the treatment has carryover eﬀects over (cid:96) lagged periods and thereare no covariates, as shown in Theorem 5.

Synthetic Treatment Eﬀect.

The original data does not contain any speciﬁc treatment thatwe can study. We ﬁrst consider the model with only direct treatment eﬀect. Therefore, entry ( i, t )of the original data is Y it ( − τ , and if entry ( i, t ) is hypothetically assigned totreatment group, i.e., z it = 1, we deﬁne Y it (1) = Y it ( −

1) + τ . Therefore, the ﬁnal input to diﬀerentestimation methods is going to be the values of matrix Y ( Z ) as well as Z itself. The value of τ and counterfactual values of each Y ( Z ) are hidden from the estimators. The true value of τ will later be used to assess performance (RMSE) of the various treatment designs and estimationmethods. Moreover, we consider a hypothetical intervention with synthetic carryover treatmenteﬀects τ , τ , · · · , τ (cid:96) +1 that are deﬁned in model (14). We use Y it (( z i,t − (cid:96) , · · · , z i,t )) to denote thepotential outcome when the past (cid:96) + 1 periods’ treatment status is z i,t − (cid:96) , · · · , z i,t . Entry ( i, t ) ofthe original data is Y it (( − , · · · , − , − i switches from the control to the treated groupat period t , i.e., z i,t − = − , z it = 1, we deﬁne Y it (( − , · · · , − , Y it (( − , · · · , − , − τ ;If unit i switches from the control to the treated group at period t − s (1 ≤ s < (cid:96) ), we deﬁne Y it (( − , · · · , , · · · , Y it (( − , · · · , − , · · · , − τ + τ + · · · + τ s +1 ; If unit i switches from thecontrol to the treated group at period t − s ( s ≥ (cid:96) ), we deﬁne Y it ((1 , · · · , Y it (( − , · · · , − τ + τ + · · · + τ (cid:96) +1 . Evaluation Metrics.

We randomly select m blocks each with dimension N × ˜ T . For each block,the ﬁrst ˜ T − T time periods are the historical control data and we apply synthetic experimentsto the last N × T time periods. We split the historical control data into k matrices, where twoadjacent matrices could have overlapping time periods. We use these k matrices as the input forAlgorithm 1. In the experiment with direct treatment eﬀect only, for each treatment design Z andeach of the m blocks, the estimated treatment eﬀect is denoted as ˆ τ ( j ) Z , where j = 1 , · · · , m . Wereport the following metrics for each treatment design Z :RMSE Z = (cid:32) m m (cid:88) j =1 (cid:16) ˆ τ ( j ) Z − τ (cid:17) (cid:33) / , ¯ τ Z = 1 m m (cid:88) j =1 ˆ τ ( j ) Z , Var(ˆ τ Z ) = 1 m m (cid:88) j =1 (cid:16) ˆ τ ( j ) Z − ¯ τ Z (cid:17) . In the experiment with carryover eﬀects, the estimated direct and lagged i -period treatment eﬀectsare denoted as ˆ τ ( j ) Z, and ˆ τ ( j ) Z,i for i = 1 , · · · , (cid:96) + 1 and j = 1 , · · · , m . We report the RMSE across alltreatment eﬀect parametersRMSE Z, Carryover = (cid:32) m ( (cid:96) + 1) m (cid:88) j =1 (cid:96) +1 (cid:88) l =1 (cid:16) ˆ τ ( j ) Z,l − τ l (cid:17) (cid:33) / . Speciﬁcation.

We choose the number of random blocks at m = 500 and the treatment eﬀect at τ = − .

01% in the experiment with direct treatment eﬀect only. Note that the median occurrencerate is 0.0714%, so the treatment eﬀect is about 14% of the median occurrence rate. There are61 MSAs with at least 0.01%. occurrence rate in every month. Our synthetic experiments areimplemented on these 61 MSAs. Moreover, in the experiment with carryover eﬀects, we considerthe prevention policy can aﬀect the outcome for the current period and two periods in the future.We choose the treatment eﬀects at [ τ , τ , τ ] = [ − . , − . , − . T = 7,start the experiment in October and end in April. We use N = 25 and N = 50. In this section, we demonstrate three main ﬁndings from the synthetic experiments on the ﬂu dataas shown in Tables 4-5 and Figure 3 and on the home medical visit data and grocery data as shownin Tables 6-9 and Figures 5-6 in Appendix F:1. The linear staggered treatment design Z OPT can signiﬁcantly outperform all benchmark treat-ment designs.2. We can use historical data to search for a better treatment design compared with Z OPT , forexample, using our data-driven local search heuristic.3. When the treatment has carryover eﬀects, the nonlinear staggered treatment design Z OPT-CO outperforms the linear treatment design Z OPT . Linear Staggered Treatment Design Outperforms Benchmark Treatment Designs.

We ﬁrst compare the RMSE of the optimal treatment design matrix Z OPT with the benchmarktreatment designs Z FF , Z BA , and Z FF+BA for all estimation methods in Table 4. Figure 3 showsthe bias and variance of benchmark treatment designs and Z OPT on the ﬂu data. Tables 6 and 8and Figures 5 and 6 in Appendix F show the corresponding results for the home medical visit dataand grocery data. We have the following ﬁndings:1. Z OPT can consistently outperform benchmark treatment designs for all estimation methods. ( N, T ) (25,7) (50,7)OLS Z BA Z FF Z FF+BA Z OPT Z BA Z FF Z FF+BA Z OPT

LRME Z BA Z FF Z FF+BA Z OPT Z OPT

N, T ) (25,7) (50,7)OLS Z OPT Z K =2 Z K =3 Z OPT+ Z OPT Z K =2 Z K =3 Z OPT+ Z OPT Z K =2 Z K =3 Z OPT+ Z OPT+

Table 4

Inpatient and outpatient ﬂu visit rates: This table compares the RMSE based on m = 500 randomlysampled blocks for benchmark treatment designs, Z OPT , Z K =2 , Z K =3 , Z OPT+ and “Hist Winner” (the estimationmethod with minimax estimation error on historical matrices). In the synthetic experiments, the intervention only has direct treatment eﬀect and we choose the treatment eﬀect at τ = − . Z OPT outperforms benchmark treatment designs. The right table shows we can ﬁnd a better treatment design Z OPT+ via historical data and our data-driven local search algorithm. The optimal estimation method found onhistorical data has similar performance as GLS and LRME. ( N, T ) (25,7) (50,7)OLS Z OPT Z OPT-CO Z OPT Z OPT-CO

LRME Z OPT Z OPT-CO

Table 5

Inpatient and outpatient ﬂu visit rates: This table compares the RMSE based on m = 500 randomlysampled blocks for Z OPT and Z OPT-CO . In the synthetic experiments, the intervention has direct and carryover treatment eﬀects and we choose the treatment eﬀects at [ τ , τ , τ ] = [ − . , − . , − . nonlinear staggered design Z OPT-CO outperforms the linear staggered design Z OPT .

2. The latent factor model estimated from either GLS or LRME outperforms the two-way ﬁxedeﬀect model estimated from OLS on all data sets. In other words, GLS and LRME are betterestimation methods compared with OLS.

Remark 14.

In practice, we can use cross-validation to select the best estimation method usinghistorical control data. Speciﬁcally, we use the leave one (historical matrix) out cross-validation:First, apply synthetic treatment to every historical matrix and estimate τ by OLS, GLS, andLRME; Second, calculate the maximum estimation error of τ across all historical matrices; Third,ﬁnd the estimation method with the smallest maximum estimation error. We use this method to estimate the treatment eﬀect on experimental data. We ﬁnd that the RMSE from this approach isvery close to the optimal RMSE, as shown in Table 4. An Even Better Treatment Design Found via Historical Data.

Suppose the potentialoutcome has either observed or latent covariates. In that case, Theorem 3 shows that the treatedproportion of Z OPT satisﬁes one of the two suﬃcient conditions for the optimal treatment design.We can apply Algorithm 1 or K-means to the historical data that contains information aboutcovariates to search for a better treatment design that mimics stratiﬁcation and satisﬁes the othersuﬃcient condition.Table 4 presents the RMSE of Z OPT , Z K =2 and Z K =3 returned from K-means, and Z OPT+ returned from Algorithm 1 for all estimation methods. Figure 3 shows the bias and variance ofthese four treatment designs on the ﬂu data. The corresponding results for the other two data setsare presented in Tables 6 and 8 and Figures 5 and 6 in Appendix F. For all estimation methodsand all data sets, Z OPT+ , Z K =2 , and Z K =3 have a smaller RMSE compared with Z OPT . Moreover,in general, Z OPT+ has a smaller RMSE copmared with Z K =2 and Z K =3 , especially on the medicalhome visit and grocery data. In empirical applications, we may not have the information abouthow many groups to partition units into, i.e., K = 2 ,

3, or some larger number. Thus, we suggestusing Z OPT+ . Nonlinear Staggered Treatment Design Outperforms When the Treatment Has Car-ryover Eﬀects.

When the treatment can aﬀect multiple periods, Theorem 5 shows that theoptimal treated proportion is nonlinear in time and equals the treated proportion of Z OPT-CO . Z OPT-CO can consistently outperform Z OPT for all estimation methods, as shown in Table 5. More-over, GLS is a better estimation method compared with OLS and LRME. The ﬁndings are robustto the other two data sets, as shown in Tables 7 and 9.

6. Concluding Remarks

In this paper, we study the optimal multi-period design of experiments, where our goal is tomaximize the statistical power of our estimate for the treatment eﬀect. Formally, we minimizethe variance of the estimated treatment eﬀect. We assume that the treated units cannot switchback to control during the experiment. In the simplest case, we assume the potential outcomefollows a two-way ﬁxed-eﬀect linear model; In extensions, we consider additional observed or latentcovariates and the treatment with carryover eﬀects. We ﬁnd a set of suﬃcient conditions for theoptimal treatment design. When the potential outcome does not have covariates or has covariatesthat take discrete values, we can provide a feasible analytical solution that approximates the globaloptimum within a factor of 1 + O (cid:0) N (cid:1) based on the suﬃcient conditions. Figure 3

Inpatient and outpatient ﬂu visit rates: This ﬁgure compares the average estimated treatment eﬀectˆ τ Z and its standard deviation (cid:112) Var(ˆ τ Z ) based on m = 500 randomly sampled blocks for benchmark designs, Z OPT , Z K =2 , Z K =3 and Z OPT+ . In the synthetic experiments, N = 50, T = 7, the intervention only has direct treat-ment eﬀect and we choose the treatment eﬀect at τ = − . τ Z , while the error barindicates the standard deviation (cid:112) Var(ˆ τ Z ). The red dash line indicates the true value of τ = − . Z OPT , Z K =2 , Z K =3 and Z OPT+ over benchmark treatment designs. The bias of various treatment designs is similar while Z OPT has much smallervariance compared with benchmark designs and the treatment designs found using historical data, such as Z OPT+ from Algorithm 1, has smaller variance compared with Z OPT . If we have historical control data, we propose a local search algorithm based on the minimaxdecision rule to ﬁnd a better treatment design based on the suﬃcient conditions when the poten-tial outcome has latent covariates or covariates can take inﬁnitely many values. In the end, wedemonstrate the practical relevance of our results through synthetic interventions on the inﬂuenzaoccurrence rate and synthetic experiments on in-home medical service data and grocery data.Our solution can signiﬁcantly outperform benchmark treatment designs independent of the modelassumption and treatment eﬀect estimation approach.

Managerial Implications.

Our results have several implications on guiding companies andpolicymakers to conduct experiments for evaluating interventions. First, when the sample size issmall, by leveraging the time dimension and tracking the same sample across multiple time periods,one can increase the number of observations and use our optimal multi-period experiment designto maximize the beneﬁt of the experiment. Second, structure of the multi-period experiment designis substantially impacted when the eﬀect of interventions last for multiple time periods. Third, in presence of historical (pre-experimentation) data, we can further optimize our treatment designsand hence reduce the number of required samples which means lowering the experiment cost. Future directions.

One interesting direction is how we can dynamically adjust the treatmentdesign to improve the eﬃciency using both experimental and historical control data, under theconstraint that the treatment cannot be removed once implemented, or it is costly to switch fromtreated to control. This direction is related to online learning or multi-arm/contextual banditproblems, but the objective is diﬀerent, and there is a switching constraint from treated to control.

References

Abadie, Alberto, Alexis Diamond, Jens Hainmueller. 2010. Synthetic control methods for comparative casestudies: Estimating the eﬀect of californias tobacco control program.

Journal of the American statisticalAssociation (490) 493–505.Aral, Sinan, Dylan Walker. 2014. Tie strength, embeddedness, and social inﬂuence: A large-scale networkedexperiment.

Management Science (6) 1352–1370.Assmus, Gert, John U Farley, Donald R Lehmann. 1984. How advertising aﬀects sales: Meta-analysis ofeconometric results. Journal of Marketing Research (1) 65–74.Athey, S, S Stern. 2002. The impact of information technology on emergency health care outcomes. TheRand Journal of Economics (3) 399.Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, Khashayar Khosravi. 2018. Matrixcompletion methods for causal panel data models. Tech. rep., National Bureau of Economic Research.Athey, Susan, Guido W Imbens. 2017. The econometrics of randomized experiments. Handbook of EconomicField Experiments , vol. 1. Elsevier, 73–140.Athey, Susan, Guido W Imbens. 2018. Design-based analysis in diﬀerence-in-diﬀerences settings with stag-gered adoption. Tech. rep., National Bureau of Economic Research.Athey, Susan, Scott Stern. 1998. An empirical framework for testing theories about complimentarity inorganizational design. Tech. rep., National Bureau of Economic Research.Bai, Jushan. 2003. Inferential theory for factor models of large dimensions.

Econometrica (1) 135–171.Bai, Jushan, Serena Ng. 2002. Determining the number of factors in approximate factor models. Econometrica (1) 191–221.Baltagi, Badi. 2008. Econometric analysis of panel data . John Wiley & Sons.Bertsimas, Dimitris, Mac Johnson, Nathan Kallus. 2015. The power of optimization over randomization indesigning experiments involving small samples.

Operations Research (4) 868–876.Bertsimas, Dimitris, Nikita Korolko, Alexander M Weinstein. 2019. Covariate-adaptive optimization in onlineclinical trials. Operations Research .2Bhat, Nikhil, Vivek F Farias, Ciamac C Moallemi, Deeksha Sinha. 2019. Near optimal ab testing.

Manage-ment Science .Brown, Celia A, Richard J Lilford. 2006. The stepped wedge trial design: a systematic review.

BMC medicalresearch methodology (1) 54.Brynjolfsson, Erik, Avinash Collis, Felix Eggers. 2019. Using massive online choice experiments to measurechanges in well-being. Proceedings of the National Academy of Sciences (15) 7250–7255.Brynjolfsson, Erik, Felix Eggers, Avinash Gannamaneni. 2018. Measuring welfare with massive online choiceexperiments: A brief introduction.

AEA Papers and Proceedings , vol. 108. 473–76.Bubeck, S´ebastien, Nicolo Cesa-Bianchi. 2012. Regret analysis of stochastic and nonstochastic multi-armedbandit problems. arXiv preprint arXiv:1204.5721 .Candes, Emmanuel, Benjamin Recht. 2009. Exact matrix completion via convex optimization.

Communi-cations of the ACM (6) 111–119.Candes, Emmanuel J, Yaniv Plan. 2010. Matrix completion with noise. Proceedings of the IEEE (6)925–936.Cheng, Russell CH, Teresa Davenport. 1989. The problem of dimensionality in stratiﬁed sampling. Manage-ment Science (11) 1278–1296.Cheung, Wang Chi, David Simchi-Levi, He Wang. 2017. Dynamic pricing and demand learning with limitedprice experimentation. Operations Research (6) 1722–1731.Cohen, Jacob. 1992. Statistical power analysis. Current directions in psychological science (3) 98–101.Cohen, Maxime, Michael-David Fiszer, Baek Jung Kim. 2018. Frustration-based promotions: Field experi-ments in ride-sharing. Available at SSRN 3129717 .Cook, Thomas D, Donald Thomas Campbell, William Shadish. 2002.

Experimental and quasi-experimentaldesigns for generalized causal inference . Houghton Miﬄin Boston, MA.Crowder, Martin J, David J Hand. 1990.

Analysis of repeated measures , vol. 41. CRC Press.Cui, Ruomeng, Jun Li, Dennis J Zhang. 2020. Reducing discrimination with reviews in the sharing economy:Evidence from ﬁeld experiments on airbnb.

Management Science (3) 1071–1094.Fan, Jianqing, Kosuke Imai, Han Liu, Yang Ning, Xiaolin Yang. 2016. Improving covariate balancing propen-sity score: A doubly robust and eﬃcient approach. Tech. rep., Technical report, Princeton Univ.Fogarty, Colin B, Dylan S Small. 2016. Sensitivity analysis for multiple comparisons in matched observationalstudies through quadratically constrained linear programming. Journal of the American StatisticalAssociation (516) 1820–1830.Fox, Bennet L. 2000. Separability in optimal allocation.

Operations Research (1) 173–176.Freeman, PR. 1989. The performance of the two-stage analysis of two-treatment, two-period crossover trials. Statistics in medicine (12) 1421–1432.3Gordon, Brett R, Florian Zettelmeyer, Neha Bhargava, Dan Chapsky. 2019. A comparison of approaches toadvertising measurement: Evidence from big ﬁeld experiments at facebook. Marketing Science (2)193–225.Grizzle, James E. 1965. The two-period change-over design and its use in clinical trials. Biometrics

The Journal of Machine Learning Research (1) 3367–3402.Hayes, Brian. 2002. Computing science: The easiest hard problem. American Scientist (2) 113–117.Hemming, Karla, Terry P Haines, Peter J Chilton, Alan J Girling, Richard J Lilford. 2015. The steppedwedge cluster randomised trial: rationale, design, analysis, and reporting. Bmj h391.Heng, Siyu, Hyunseung Kang, Dylan S Small, Colin B Fogarty. 2019. Increasing power for observationalstudies of aberrant response: An adaptive approach. arXiv preprint arXiv:1907.06770 .Hills, M, P Armitage. 1979. The two-period cross-over clinical trial.

British journal of clinical pharmacology (1) 7.Hsiao, Cheng. 2014. Analysis of panel data . 54, Cambridge university press.Hu, Yanqing, Feifang Hu, et al. 2012. Asymptotic properties of covariate-adaptive randomization.

TheAnnals of Statistics (3) 1794–1815.Hussey, Michael A, James P Hughes. 2007. Design and analysis of stepped wedge cluster randomized trials. Contemporary Clinical Trials (2) 182–191.Imai, Kosuke, Marc Ratkovic. 2014. Covariate balancing propensity score. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) (1) 243–263.Jones, Byron, Michael G Kenward. 2014. Design and analysis of cross-over trials . CRC press.Joo, Mingyu, Michael L Thompson, Greg M Allenby. 2019. Optimal product design by sequential experimentsin high dimensions.

Management Science (7) 3235–3254.Kallus, Nathan. 2018. Optimal a priori balance in the design of controlled experiments. Journal of the RoyalStatistical Society: Series B (Statistical Methodology) (1) 85–112.Kallus, Nathan. 2020. Generalized optimal matching methods for causal inference. Journal of MachineLearning Research (62) 1–54.Kallus, Nathan, Xiaojie Mao, Madeleine Udell. 2018. Causal inference with noisy and missing covariates viamatrix factorization. Advances in neural information processing systems . 6921–6932.Krieger, Abba M, David Azriel, Adam Kapelner. 2019. Nearly random designs with greatly improved balance.

Biometrika (3) 695–701.Kuhfeld, Warren F. 2005. Experimental design, eﬃciency, coding, and choice designs.

Marketing ResearchMethods in SAS: Experimental Design, Choice, Conjoint, and Graphical Techniques preprint

Statistics & ProbabilityLetters American Journal of Public Health (7)1300–1306.Lee, Dokyun, Kartik Hosanagar. 2019. How do recommender systems aﬀect sales diversity? a cross-categoryinvestigation via randomized ﬁeld experiment.

Information Systems Research (1) 239–259.Leone, Robert P. 1995. Generalizing what is known about temporal aggregation and advertising carryover. Marketing Science (3 supplement) G141–G150.Li, Fan, Kari Lock Morgan, Alan M Zaslavsky. 2018a. Balancing covariates via propensity score weighting. Journal of the American Statistical Association (521) 390–400.Li, Fan, Elizabeth L Turner, John S Preisser. 2018b. Optimal allocation of clusters in cohort stepped wedgedesigns.

Statistics & Probability Letters

Management Science (8) 1743–1763.lEcuyer, Pierre, Patrick Maill´e, Nicol´as E Stier-Moses, Bruno Tuﬃn. 2017. Revenue-maximizing rankingsfor online platforms with quality-sensitive consumers. Operations Research (2) 408–423.Mertens, Stephan. 2006. The easiest hard problem: Number partitioning. Computational Complexity andStatistical Physics (2) 125–139.Muchnik, Lev, Sinan Aral, Sean J Taylor. 2013. Social inﬂuence bias: A randomized experiment.

Science (6146) 647–651.Mulvey, John M. 1983. Multivariate stratiﬁed sampling by optimization.

Management Science (6) 715–724.Nikolaev, Alexander G, Sheldon H Jacobson, Wendy K Tam Cho, Jason J Sauppe, Edward C Sewell. 2013.Balance optimization subset selection (boss): An alternative approach for causal inference with obser-vational data. Operations Research (2) 398–412.Panniello, Umberto, Michele Gorgoglione, Shawndra Hill, Kartik Hosanagar. 2014. Incorporating proﬁt mar-gins into recommender systems: A randomized ﬁeld experiment of purchasing behavior and consumertrust .Pocock, Stuart J, Richard Simon. 1975. Sequential treatment assignment with balancing for prognosticfactors in the controlled clinical trial. Biometrics

Nursing research: Generating and assessing evidence for nursingpractice . Lippincott Williams & Wilkins.5Rosenbaum, Paul R, Donald B Rubin. 1983. The central role of the propensity score in observational studiesfor causal eﬀects.

Biometrika (1) 41–55.Salganik, Matthew. 2019. Bit by bit: Social research in the digital age . Princeton University Press.Sauppe, Jason J, Sheldon H Jacobson. 2017. The role of covariate balance in observational studies.

NavalResearch Logistics (NRL) (4) 323–344.Sauppe, Jason J, Sheldon H Jacobson, Edward C Sewell. 2014. Complexity and approximation results for thebalance optimization subset selection model for causal inference in observational studies. INFORMSJournal on Computing (3) 547–566.Senn, Stephen S. 2002. Cross-over trials in clinical research , vol. 5. John Wiley & Sons.Simon, Richard. 1979. Restricted randomization designs in clinical trials.

Biometrics

The Journal of Machine Learning Research (1) 6990–7020.Sun, Tianshu, Siva Viswanathan, Ni Huang, Elena Zheleva. 2020a. Designing promotional incentive toembrace social sharing: Evidence from ﬁeld and lab experiments. MIS Quarterly .Sun, Tianshu, Siva Viswanathan, Elena Zheleva. 2020b. Creating social contagion through ﬁrm-mediatedmessage design: Evidence from a randomized ﬁeld experiment.

Management Science .¨Ulk¨u, Sezer, Chris Hydock, Shiliang Cui. 2019. Making the wait worthwhile: Experiments on the eﬀect ofqueueing on consumption.

Management Science .Wallenstein, Sylvan, Alan C Fisher. 1977. The analysis of the two-period repeated measurements crossoverdesign with application to clinical trials.

Biometrics

Journal of Clinical Epidemiology (7) 752–758.Zhang, Dennis J, Hengchen Dai, Lingxiu Dong, Fangfang Qi, Nannan Zhang, Xiaofei Liu, Zhongyi Liu, JiangYang. 2019a. The long-term and spillover eﬀects of price promotions on retailing platforms: Evidencefrom a large randomized experiment on alibaba. Management Science .Zhang, Dennis J, Hengchen Dai, Lingxiu Dong, Qian Wu, Lifan Guo, Xiaofei Liu. 2019b. The value of pop-up stores on retailing platforms: Evidence from a ﬁeld experiment with alibaba.

Management Science (11) 5142–5151.Zhao, Qingyuan. 2019. Covariate balancing propensity score by tailored loss functions. The Annals ofStatistics (2) 965–993.6 Study Approval and Acknowledgement : This paper used the MarketScan Research Databases.Data access was provided by the Stanford Center for Population Health Sciences Data Core. ThePHS Data Core is supported by a National Institutes of Health National Center for AdvancingTranslational Science Clinical and Translational Science Award (UL1 TR001085) and internalStanford funding. The content is solely the responsibility of the authors and does not necessarilyrepresent the oﬃcial views of the NIH.

Appendix A: Additional Theoretical Results

A.1. Treatment is Reversible

Proposition 1.

Suppose Assumption 1 holds, and the treatment is reversible, let ω t = N (cid:80) Ni =1 z it and ζ i = T (cid:80) Ti =1 z it . Suppose the potential outcome can be modeled as Y it ( z it ) = β t + τ z it + ε it . Any treatment design isoptimal if it satisﬁes ω t = 0 . Suppose the potential outcome can be modeled as Y it ( z it ) = α i + τ z it + ε it . Any treatment design isoptimal if it satisﬁes ζ i = 0 . Suppose the potential outcome can be modeled as Y it ( z it ) = α i + β t + τ z it + ε it . Any treatment design isoptimal if it satisﬁes ω t = 0 , ζ i = 0 . Suppose the potential outcome can be modeled as Y it ( z it ) = α i + β t + X (cid:62) i θ t + u (cid:62) i v t + τ z it + ε it (or Y it ( z it ) = α i + β t + X i θ t + τ z it + τ z i,t − + · · · + τ (cid:96) +1 z i,t − (cid:96) + u (cid:62) i v t + ε it ). Suppose the assumptions in Theorem3 (or Theorem 5) hold and let ω g,t = |O g | (cid:80) i ∈O g z it , where O g = { i : X i = x g , u i = u ,g } . Any treatment designis optimal if it satisﬁes ω g,t = 0 , for all t and g, ζ i = 0 . Proof of Proposition 1

The proof is a direct extension of the proof of Lemmas 2 and 3, Theorems 1, 3 and5 in Appendix C. The major diﬀerence is when the treatment can be reversed, the value of (cid:80) Ni =1 ζ i is notunique given ω t . Then we can separately optimize the part with ζ i and ω t . Note that (cid:80) Ni =1 ζ i is minimized at ζ i = 0. Since the Hessian of the part with ω t ( ω g,t ) is positive semi-deﬁnite, this part is minimized at ω t = 0(and ω g,t = 0). (cid:3) A.2. Additional Discussion of Theorem 3

Remark 15.

Consider a symmetric case for Theorem 3 that rows in V are deterministic and have (cid:80) Ti =1 v t = (cid:126) u i is random with mean 0 and variance I k , Cov( u i , u j ) = 0 for any i (cid:54) = j . In this case, errors7are cross-sectional independent but time-series correlated, and units are exchangeable. By symmetricity, wehave the suﬃcient conditions for the optimal solution in this case (similar to Theorem 3),1 N N (cid:88) i =1 z it = 2 t − − TT , N N (cid:88) i =1 X i z it = µ X , T T (cid:88) t =1 v t z it = µ V , for all i for some µ V ∈ R k . When z it is the same for all i (before-after design Z BA deﬁned in Section 5.2), the time-series average condition T (cid:80) Tt =1 v t z it = µ V for all i is satisﬁed. However, this treatment design violates theﬁrst condition on the cross-sectional average, N (cid:80) Ni =1 z it = t − − TT . When v t is not stationary and has a strongtime trend, such as GDP and stock indices over time, there may not exist a Z to satisfy all the suﬃcientconditions. Intuitively, the optimal Z balances the suﬃcient conditions on the time-series and cross-sectionalaverage, which is closer to Z BA rather than Z FF . A.3. More Results for Carryover Treatment Eﬀects

In Theorem 5, A ( (cid:96) ) and b ( (cid:96) ) in Eq. (17) are deﬁned as follows A ( (cid:96) ) =  (cid:98) (cid:96)/ (cid:99) + 1 (cid:98) (cid:96)/ (cid:99) + 2 . . . (cid:96)  − T − (cid:96)  (cid:96) − (cid:98) (cid:96)/ (cid:99) (cid:96) − − (cid:98) (cid:96)/ (cid:99) (cid:96) − − (cid:98) (cid:96)/ (cid:99) · · · (cid:96) − − (cid:98) (cid:96)/ (cid:99) (cid:96) − − (cid:98) (cid:96)/ (cid:99) (cid:96) − − (cid:98) (cid:96)/ (cid:99) · · · · · ·  (22) b ( (cid:96) ) = −  (cid:98) (cid:96)/ (cid:99) + 1... (cid:96) − (cid:96)  + 1 T − (cid:96)  ( (cid:98) (cid:96)/ (cid:99) + 1) ...( (cid:96) − (cid:96)  − T − (cid:96) (cid:80) (cid:96) −(cid:98) (cid:96)/ (cid:99) l =1 ( (cid:98) (cid:96)/ (cid:99) + 1 − l )...2 (cid:98) (cid:96)/ (cid:99) − (cid:98) (cid:96)/ (cid:99)  (23)The ﬁrst step to show Theorem 5 is to provide the convex relaxation of the integer program (15) whenthe potential outcome does not have observed or latent covariates. When the model does not have latentcovariates, Σ e = σ I N ( T − (cid:96) ) and Var(ˆ (cid:126)τ ) equalsVar(ˆ (cid:126)τ ) = ( Z (cid:62) ( I N ( T − (cid:96) ) − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) Z ) − . Then the integer program (15) is equivalent to the following integer programmin Z tr( Z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) Z ) (24)s . t . z it ≤ z i,t +1 z it ∈ {− , } The covex relaxation of the integer program (24) is provided in the following lemma.

Lemma 4.

Suppose Assumptions 1-2 hold, the potential outcome follows model (14) and does not haveobserved or latent covariates, i.e., Y it ( z it ) = α i + β t + τ z it + τ z i,t − + · · · + τ (cid:96) +1 z i,t − (cid:96) + ε it , then the convexrelaxation of the integer program (24) is min (cid:126)ω (cid:96) +1 (cid:88) j =1  T − (cid:96) − j (cid:88) t = j ω t − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) + T − (cid:96) − j (cid:88) t =1 T − (cid:96) − j − t ) T − (cid:96) ω t  (25) s.t. − ≤ ω t ≤ ω t ≤ ω t +1 (cid:96) + 1 quadratic problems and it is equalto the objective function (19) in the overview of the proof of Theorem 5. If we minimize each one of these (cid:96) + 1 problems, the optimal solution of the sub-problem is the same as that in Theorem 1 with the numberof time periods to be T − (cid:96) . The proof of Lemma 4 and the remaining steps to prove Theorem 5 are providedin Appendix E. Remark 16.

Beyond the examples for ω ∗ when (cid:96) = 1 and (cid:96) = 2 in Remark 8, we also provide the expressionfor ω ∗ when (cid:96) = 3, ω ∗ = − , ω ∗ = − T − T + 79 , ω ∗ = − T − T − T + 79 ,ω ∗ t = − t − T − t = 4 , · · · , T − ,ω ∗ T − = 1 − T − T − T + 79 , ω ∗ T − = 1 − T − T + 79 , ω ∗ T = 1If we not only care about the variance of estimated direct and lagged eﬀects, but also the covariance betweenestimated direct and lagged eﬀects, i.e., Cov(ˆ τ j , ˆ τ j (cid:48) ) for j (cid:54) = j (cid:48) , then it is natural to consider minimizingdet(Var(ˆ (cid:126)τ )). Note that when we do not have latent covariates, Var(ˆ (cid:126)τ ) = 1 / det( Z (cid:62) ( I N ( T − (cid:96) ) − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) Z ),minimizing Var(ˆ (cid:126)τ ) is equivalent to minimizing − det( Z (cid:62) ( I N ( T − (cid:96) ) − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) Z ) and the correspondingoptimization problem is min (cid:126)z =[ z it ] − det( Z (cid:62) ( I N ( T − (cid:96) ) − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) Z ) (26) s.t. z it ≤ z i,t +1 z it ∈ {− , } The optimal solution Z for the integer program (26) is called the D (determinant)-optimal design. We providethe relaxation for the integer program (26) in Lemma 5. Lemma 5.

Suppose Assumptions 1-2 hold, the potential outcome follows model (14) and does not haveobserved or latent covariates, i.e., Y it ( z it ) = α i + β t + τ z it + τ z i,t − + · · · + τ (cid:96) +1 z i,t − (cid:96) + ε it , then the relaxationfor the integer program (26) is min (cid:126)ω − det (Θ) (27) s.t. − ≤ ω t ≤ ω t ≤ ω t +1 , where Θ is deﬁned as Θ jm =  − N (cid:20)(cid:80) T − (cid:96) − jt = j ω t − T − (cid:96) (cid:16)(cid:80) T − (cid:96) − jt = j ω t (cid:17) + T − (cid:96) (cid:80) Tt =1 ( υ ( j,j ) T +1 − t − υ ( j,j ) T − t ) ω t (cid:21) j = m − N (cid:104)(cid:80) T − (cid:96) − jt = j ω t ω t + m − j − T − (cid:96) (cid:16)(cid:80) T − (cid:96) − jt = j ω t (cid:17) (cid:16)(cid:80) T − (cid:96) − mt = m ω t (cid:17) + T − (cid:96) (cid:80) Tt =1 ( υ ( j,m ) T +1 − t − υ ( j,m ) T − t ) ω t − (cid:80) m − t = j ( ω t − ω T − (cid:96) + t ) (cid:105) j < m Θ mj j > m . t . . . . . . % tr e a t e d ‘ = 0 ‘ = 1 ‘ = 2 ‘ = 3 ‘ = 4 ‘ = 5 Figure 4 D -optimal treatment design: Optimal treated proportion at each period for a T -period treatment designin the presence of carryover treatment eﬀects, where T = 10. Diﬀerent colors represent the number of lags (cid:96) in thecarryover eﬀects. and υ ( j,m ) t is deﬁned as υ ( j,m ) t =  t ≤ (cid:96) + 1 − m − (cid:16) − t − − (cid:96) + m ) T − (cid:96) (cid:17) (cid:96) + 1 − m < t ≤ (cid:96) + 1 − j (cid:16) − t − − (cid:96) + m ) T − (cid:96) (cid:17) (cid:16) − t − − (cid:96) + j ) T − (cid:96) (cid:17) (cid:96) + 1 − j < t ≤ T + 1 − m (cid:16) − t − − (cid:96) + j ) T − (cid:96) (cid:17) T + 1 − m < t ≤ T + 1 − j T + 1 − j < t Note that each entry in Θ is a quadratic function of ω t , the cross-sectional average of the treatmentvariables. However, the objective function in the optimization problem (27) is not a convex function of ω t . Inpractice, we can use the oﬀ-the-shelf software to ﬁnd the optimal ω t . We provide the optimal solution for theoptimization problem (27) when T = 10 in Figure 4. Similar as all the previous results, the treated proportionis symmetric with respect to the coordinate ( T +12 , . T -optimal design in Theorem 5, ifthe treatment aﬀects more periods ( (cid:96) is larger), the optimal treated proportion is in general smaller at thebeginning, increases at a faster rate in the middle, and is larger in the end. The proof of Lemma 5 is providedin Appendix E. Appendix B: Additional Algorithm

B.1. Grid Search the Tuning Parameter in Low Rank Matrix Estimation with Fixed Eﬀects

When we use LRME to estimate the treatment eﬀect τ given a treatment design Z , we need to specifythe regularization parameter µ . If we have the historical control data, we could use grid search to ﬁnd theoptimal µ . The details are presented in Algorithm 3. Similar to Section 4.1, we split the historical datainto sub-matrices. Given a treatment design matrix Z , we apply the synthetic experiment to each of thesesub-matrices and use LRME to estimate τ using a particular µ . The optimal µ minimizes max j | ˆ τ ( j ) − τ | where ˆ τ ( j ) is estimated from the j -th sub-matrix. The next step is to use this treatment design matrix Z togenerate synthetic experimental data and estimate τ by LRME using the optimal µ .0 Algorithm 3:

Grid search optimal µ Inputs : Y hist , (1) , · · · , Y hist , ( m ) , Z, τ, µ max , µ min , N grids , k , ∆ τ , and t max (cid:126)µ ← logspace( µ min , µ max , N grids ); for j = 1 , · · · , m do ˆ τ ( j ) , ˆ α ( j ) , ˆ β ( j ) , ˆ L ( j ) ← Algorithm 2( Y hist , ( j ) , Z, k , µ max , ∆ τ , and t max ); end E opt ← max j | ˆ τ ( j ) − τ | ; µ opt = µ max ; for µ ∈ (cid:126)µ dofor j = 1 , · · · , m do ˆ τ ( j ) , ˆ α ( j ) , ˆ β ( j ) , ˆ L ( j ) ← Algorithm 2( Y hist , ( j ) , Z, k , µ , ∆ τ , and t max ); end E current ← max j | ˆ τ ( j ) − τ | ; if E current < E opt then E opt ← E current , µ opt ← µ ; endendOutputs: µ opt B.2. Feasible GLS

In feasible GLS, we ﬁrst estimate τ and residuals from OLS, then estimate errors’ covariance matrix fromthe residuals, and in the end, estimate τ from regression weighted by errors’ inverse covariance matrix. Weestimate errors’ covariance matrix following three assumptions: 1. errors are time-series uncorrelated andidentically distributed over time; 2. errors’ covariance matrix is the sum of a low-rank matrix and a diagonalmatrix (the same assumptions as Theorem 3). We estimate errors’ covariance matrix Ω at any time by U, S, U (cid:62) = SVD (cid:18) T ˆ e ˆ e (cid:62) (cid:19) , ˆΩ = U : , k S k , k U (cid:62) : , k + diag (cid:18) T ˆ e ˆ e (cid:62) − U : , k S k , k U (cid:62) : , k (cid:19) , where S is a diagonal matrix with diagonal entries to be the singular values of T ˆ e ˆ e (cid:62) in descending order,diag (cid:0) T ˆ e ˆ e (cid:62) − U : , k S k , k U (cid:62) : , k (cid:1) is a diagonal matrix with diagonal entries to be the diagonal entries of T ˆ e ˆ e T − U : , k S k , k U (cid:62) : , k and k < T . The full estimated errors’ covariance matrix (

N T × N T ) is a blockdiagonal matrix with each block to be ˆΩ. By assuming errors’ are time-series uncorrelated, we estimate muchfewer parameters and each parameter is estimated from T observations. Appendix C: No Covariates

Proof of Equation (7) . We use linear regression to estimate τ and η , so (cid:20) ˆ τ ˆ η (cid:21) = (cid:18)(cid:20) (cid:126)z (cid:62) Γ (cid:62) (cid:21) (cid:2) (cid:126)z Γ (cid:3)(cid:19) − (cid:20) (cid:126)z (cid:62) Γ (cid:62) (cid:21) (cid:126)y k can be a preset value or can be chosen in a data-driven manner. (cid:20) τη (cid:21) + (cid:18)(cid:20) (cid:126)z (cid:62) (cid:126)z (cid:126)z (cid:62) ΓΓ (cid:62) (cid:126)z Γ (cid:62) Γ (cid:21)(cid:19) − (cid:20) (cid:126)z (cid:62) Γ (cid:62) (cid:21) (cid:126)ε Since Var( (cid:126)ε ) = σ I , Var (cid:18)(cid:20) ˆ τ ˆ η (cid:21)(cid:19) = σ (cid:18)(cid:20) (cid:126)z (cid:62) (cid:126)z (cid:126)z (cid:62) ΓΓ (cid:62) (cid:126)z Γ (cid:62) Γ (cid:21)(cid:19) − From block matrix inversion, we haveVar(ˆ τ ) = σ (cid:0) (cid:126)z (cid:62) (cid:126)z − (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z (cid:1) − , which is equivalent to Eq. (7). (cid:3) Proof of Lemma 2.

When there are only time ﬁxed eﬀects, Γ can be simpliﬁed toΓ = I T ⊗ (cid:126)  (cid:126) (cid:126) (cid:126)  ∈ R NT × T , where (cid:126)   ∈ R N Then we have Γ (cid:62)

Γ = N · I T . N (cid:126)z (cid:62) Γ = (cid:2) ω · · · ω T (cid:3) , where ω t = N (cid:80) Ni =1 z it . The objective function in theinteger program (8) has (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z = N T (cid:88) t =1 ω t It is minimized at ω t = 0. Thus, any treatment design is optimal if it satisﬁes ω t = 0 for all t and z it = z i,t +1 for all i and t .In other words, if N is even, Var(ˆ τ ) is minimized at ω t = 0, implying 50% of z it to be 1 and the other50% of z it to be −

1. If N is odd, Var(ˆ τ ) is minimized at | ω t | = N , implying there are N − units with z it = 1for all t , there are N − units with z it = − t , and there is one unit with z it to be ﬂexible as long as z it ≤ z i,t +1 is satisﬁed. (cid:3) Proof of Lemma 3.

When there are only unit ﬁxed eﬀects, Γ can be simpliﬁed toΓ = (cid:126) ⊗ I N =  I N I N ... I N  ∈ R NT × N , where (cid:126)   ∈ R T . Then we have Γ (cid:62)

Γ = T · I N . T (cid:126)z (cid:62) Γ = (cid:2) ζ · · · ζ N (cid:3) , where ζ i = T (cid:80) Tt =1 z it . The objective function in theinteger program (8) has (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z = T N (cid:88) i =1 ζ i . It is minimized at ω t = 0. Thus, any treatment design is optimal if it satisﬁes ζ i = 0 for all i . Given Assumption2, the equivalent statement is that any treatment design is optimal if it satisﬁes ω t = − t < T +12 and ω t = 1 for all t > T +12 .In other words, if T is even, Var(ˆ τ ) is minimized at ζ i = 0 for all i , implying all units are untreated in theﬁrst half time periods and they all switch to the treated group in the second half time periods. If T is odd,Var(ˆ τ ) is minimized at | ζ i | = T , implying all units are in the control group for all t ≤ T − , they can be eithertreated or untreated at time period t = T +12 , and they are in the treated group for all t > T +12 . (cid:3) Since we not have time ﬁxed eﬀects, we do not have the identiﬁcation problem and we do not need to assume α N = 0 as a contrast to Theorem 1. Proof of Theorem 1.

Denote ζ i = T (cid:80) Tt =1 z it for all i and ω t = N (cid:80) Ni =1 z it for all t .There are two steps to show Theorem 1:1. The objective function in the integer program (8), (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z , can be simpliﬁed to (cid:126)ω (cid:62) P (cid:126) (cid:126)ω + 2 (cid:126)d (cid:62) (cid:126)ω with P (cid:126) = I T − T (cid:126) (cid:126) (cid:62) ∈ R T × T , (cid:126)d = [ d t ] ∈ R T , and d t = T +1 − tT for all t .2. ω ∗ t = t − − TT is the optimal solution that satisﬁes the ﬁrst order condition (FOC) of (cid:126)ω (cid:62) P (cid:126) (cid:126)ω + 2 (cid:126)d (cid:62) (cid:126)ω . Now we start with the ﬁrst step.

There are two main intermediate steps.1.

Express (cid:126)z (cid:62)

Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z in terms of ω t and ζ i . When potential outcomes follow Y it ( z it ) = α i + β t + τ z it + ε it , we have Γ =  Γ :1 Γ :2 ...Γ : T  =  ˜ I (cid:126) I (cid:126)

I (cid:126)  ∈ R NT × ( N + T − . where ˜ I = (cid:2) I N − (cid:126) (cid:3) (cid:62) ∈ R N × ( N − , (cid:126) ∈ { } N − and (cid:126) ∈ { } N . In z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) z , we have(Γ (cid:62) Γ) − = (cid:20) Ξ Ξ Ξ Ξ (cid:21) = (cid:20) M − N M M (cid:126) − N M (cid:62) (cid:126) M N I + N M (cid:62) (cid:126) M M (cid:126) (cid:21) ∈ R ( N + T − × ( N + T − , where Ξ = M , Ξ = − N M M (cid:126) , Ξ = − N M (cid:62) (cid:126) M , Ξ = N I + N M (cid:62) (cid:126) M M (cid:126) and M = 1 T (cid:18) I N − − N (cid:126) (cid:126) (cid:62) (cid:19) − = 1 T ( I + (cid:126) (cid:126) (cid:62) ) ∈ R ( N − × ( N − , M (cid:126) = (cid:2) (cid:126) · · · (cid:126) (cid:3) ∈ R ( N − × T . where (cid:126) { } N − in M and M (cid:126) . Furthermore, (cid:126)z (cid:62) Γ = (cid:2)(cid:80) Tt =1 z t · · · (cid:80) Tt =1 z N − ,t (cid:80) Ni =1 z i · · · (cid:80) Ni =1 z iT (cid:3) = (cid:2) T ζ · · · T ζ N − N ω · · · N ω T (cid:3) , Let φ := (cid:2) T ζ · · · T ζ N − (cid:3) (cid:62) and ι := (cid:2) N ω · · · N ω T (cid:3) (cid:62) . Then we have (cid:126)z (cid:62) Γ = (cid:2) φ (cid:62) ι (cid:62) (cid:3) and (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z = φ (cid:62) Ξ φ + 2 φ (cid:62) Ξ ι + ι (cid:62) Ξ ι. We simplify each term in (cid:126)z (cid:62)

Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z , φ (cid:62) Ξ φ = 1 T φ (cid:62) ( I N − + (cid:126) (cid:126) (cid:62) ) φ = T N − (cid:88) i =1 ζ i + T (cid:32) N − (cid:88) i =1 ζ i (cid:33) φ (cid:62) Ξ ι = − T φ (cid:62) ( I N − + (cid:126) (cid:126) (cid:62) ) (cid:126) (cid:32) T (cid:88) t =1 ω t (cid:33) = − NT φ (cid:62) (cid:126) (cid:32) T (cid:88) t =1 ω t (cid:33) = − N (cid:32) N − (cid:88) i =1 ζ i (cid:33) (cid:32) T (cid:88) t =1 ω t (cid:33) ι (cid:62) Ξ ι = N T (cid:88) t =1 ω t + N ( N − T (cid:32) T (cid:88) t =1 ω t (cid:33) From the deﬁnitions of ζ i and ω t , T (cid:80) Ni =1 ζ i = N (cid:80) Tt =1 ω t . Then (cid:80) N − i =1 ζ i = NT (cid:80) Tt =1 ω t − ζ N . Using thisproperty, we have (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z = T N (cid:88) i =1 ζ i − NT (cid:32) T (cid:88) t =1 ω t (cid:33) + N T (cid:88) t =1 ω t . (28)32. Express (cid:126)z (cid:62)

Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z in terms of ω t . Following Assumption 2, (cid:80) Ni =1 ζ i is ﬁxed given ω t . Given ω t , there are N (1+ ω )2 , N (1+ ω )2 , · · · N (1+ ω T )2 treated units in time period 1 , , · · · , T . It is equivalent to hav-ing N (1+ ω )2 , N ( ω − ω )2 , · · · , N ( ω T − ω T − )2 untreated units to start the treatment at time period 1 , , · · · , T andleaving N (1 − ω T )2 units in the control group in the end. Then we have N (cid:88) i =1 ζ i = N (cid:34) ω · ω − ω (cid:18) T − T (cid:19) + ω − ω (cid:18) T − T (cid:19) + · · · + ω T − ω T − (cid:18) − T (cid:19) + 1 − ω T · (cid:35) = N (cid:20) (cid:18) − T (cid:19) ω T + (cid:18) − T (cid:19) ω T + · · · + (cid:18) − T − T (cid:19) ω T T (cid:21) = 2 NT T (cid:88) t =1 T + 1 − tT · ω t Therefore, we can write (cid:126)z (cid:62)

Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z as a function of ω t and denote (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z as f ( (cid:126)ω ) f ( (cid:126)ω ) := (cid:126)z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) (cid:126)z = N T (cid:88) t =1 ω t − NT (cid:32) T (cid:88) t =1 ω t (cid:33) + 2 NT T (cid:88) t =1 ( T + 1 − t ) · ω t = N (cid:16) (cid:126)ω (cid:62) P (cid:126) (cid:126)ω + 2 (cid:126)d (cid:62) (cid:126)ω (cid:17) . Next we show the second step.

Let f ( (cid:126)ω ) := (cid:80) Tt =1 ω t , f ( (cid:126)ω ) := − T (cid:0)(cid:80) Tt =1 ω t (cid:1) and f ( (cid:126)ω ) :=2 (cid:80) Tt =1 T +1 − tT ω t . Then we have f ( (cid:126)ω ) = f ( (cid:126)ω ) + f ( (cid:126)ω ) + f ( (cid:126)ω ). We ﬁrst take the second order derivative ∇ f ( (cid:126)ω ) = 2 I T ∇ f ( (cid:126)ω ) = − T (cid:126) (cid:126) (cid:62) ∇ f ( (cid:126)ω ) = 0Thus, ∇ f ( (cid:126)ω ) = ∇ f ( (cid:126)ω ) + ∇ f ( (cid:126)ω ) + ∇ f ( (cid:126)ω ) ≥ f ( (cid:126)ω ) is convex. Next we take the ﬁrst order derivative ∇ f ( (cid:126)ω ) = (cid:2) ω + T − T − T (cid:80) Tt =1 ω t ω + T − T − T (cid:80) Tt =1 ω t · · · ω T − T − T − T (cid:80) Tt =1 ω t (cid:3) When ω ∗ t = t − − TT , ∇ f ( (cid:126)ω ∗ ) = 0. Thus, (cid:126)ω ∗ is a local optimal solution. Since f ( (cid:126)ω ) is convex, the localminimum is also the global minimum. Therefore, (cid:126)ω ∗ is the global optimal solution. (cid:3) Proof of Theorem 2

Denote the objective function of the quadratic program (9) as f ( (cid:126)ω ) (similar as theproof of Theorem 1). The variance of τ (denoted as g ( (cid:126)ω )) is g ( (cid:126)ω ) := Var(ˆ τ ) = σ (cid:126)z (cid:62) ( I NT − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) (cid:126)z = σ N ( T − f ( (cid:126)ω ))The optimal solution in Theorem 1 is ω ∗ t = t − − TT . We plug (cid:126)ω ∗ = [ ω ∗ t ] into f ( (cid:126)ω ), f ( (cid:126)ω ∗ ) = − T (cid:88) t =1 (cid:18) T + 1 − tT (cid:19) = − ( T + 1)( T − T . (cid:126)ω ∗ is denoted as (cid:126)ω rnd = [ ω rnd t ] (from the nearest integer rounding rule). Wehave | ω rnd t − ω ∗ t | ≤ N and | (cid:80) Tt =1 ω rnd t | ≤ N . Therefore, f ( (cid:126)ω rnd ) = T (cid:88) t =1 (cid:18) ω rnd t + T + 1 − tT (cid:19) − T (cid:32) T (cid:88) t =1 ω rnd t (cid:33) − T (cid:88) t =1 (cid:18) T + 1 − tT (cid:19) ≤ TN − ( T + 1)( T − T We have g ( (cid:126)ω rnd ) g ( (cid:126)ω ∗ ) = T − f ( (cid:126)ω ∗ ) T − f ( (cid:126)ω rnd ) ≤ (4 T − / (3 T )(4 T − / (3 T ) − T /N = 4 T − T − − T /N ≤ − /N following 3 T ≤ T − T . Note that f ( (cid:126)ω ∗ ) reaches the minimum objective function value in thequadratic program (9), so f ( (cid:126)ω ∗ ) is the lower bound for the minimum objective function value in the originalinteger programming problem. Thus, g ( (cid:126)ω ∗ ) ≤ Var Z ∗ (ˆ τ ) and we haveVar Z rnd (ˆ τ ) ≤ − /N Var Z ∗ (ˆ τ ) . (cid:3) Appendix D: With Covariates

Proof of Theorem 3.

We can combine β t with X (cid:62) i θ t in the potential outcome model (11), that is, Y it ( z it ) = α i + (cid:2) X (cid:62) i (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) ˜ X (cid:62) i (cid:20) β t θ t (cid:21) + τ z it + ε it , and denote p := r + 1 so ˜ X i ∈ R p . Denote ζ i = T (cid:80) Tt =1 z it for all i and ˜ ω t = N (cid:80) Ni =1 ˜ X i z it ∈ R p for all t . First,when we use W = Σ − e in GLS, with some algebra, we have the following expression for the variance of ˆ τ ,Var(ˆ τ ) = (cid:126)z (cid:62) (cid:0) Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:1) (cid:126)z, (29)where ˜ I = (cid:2) I N − p N − p,p (cid:3) (cid:62) ∈ R N × ( N − p ) , N − p,p is a matrix of 0, (cid:126) ∈ { } N , andΓ =  Γ :1 Γ :2 ...Γ : T  =  ˜ I (cid:126) X ˜ I (cid:126) X ... . . .˜ I (cid:126) X  =  ˜ I ˜ X ˜ I ˜ X ... . . .˜ I ˜ X  ∈ R NT × ( N − p + Tp ) . (30)We restrict α i = 0 for i = N − p, · · · , N such that all other α i and β t can be uniquely identiﬁed. Details toshow Eq. (29) are the same as those to show Eq. (7).There are three steps to show Theorem 3:1. Minimizing Var(ˆ τ ) can be simpliﬁed to minimize( (cid:126)ω (1) ) (cid:62) P (cid:126) (cid:126)ω (1) + 2 (cid:126)d (cid:62) (cid:126)ω (1) + p (cid:88) j =2 ( (cid:126)ω ( j ) ) (cid:62) P (cid:126) (cid:126)ω ( j ) + N (cid:80) Tt =1 (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t − NT (cid:0)(cid:80) Tt =1 (cid:126)z t (cid:1) U ( I k + U (cid:62) U ) − U (cid:62) (cid:0)(cid:80) Tt =1 (cid:126)z t (cid:1) , where (cid:126)ω ( j ) = (cid:2) ˜ ω ,j · · · ˜ ω T,j (cid:3) (cid:62) and ˜ ω t,j = N (cid:80) Ni =1 ˜ X i,j z it for j = 1 , · · · , p and t = 1 , · · · , T .2. Take the ﬁrst-order condition and show suﬃcient conditions for optimality.53. Based on the suﬃcient conditions and ﬁnite u i , show ω g,t = t − − TT is optimal. We start with the ﬁrst step.

The steps to simplify (cid:126)z (cid:62) (Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e ) (cid:126)z are as follows.1. Calculate Σ − e . Under the assumptions that v t has mean 0 and variance I k , Cov( v t , v t − q ) = 0 for any q , Cov( v t , ε is ) = 0 for any s, t and i , and σ = 1, we have Σ − e = diag(Ψ , Ψ , · · · , Ψ), whereΨ = Σ − e t = ( I N + U U (cid:62) ) − = I N − U ( I k + U (cid:62) U ) − U (cid:62) Calculate (cid:126)z (cid:62) Σ − e (cid:126)z . We have z (cid:62) Σ − e z = (cid:80) Tt =1 z (cid:62) t Ψ z t .3. Calculate (cid:126)z (cid:62) Σ − e Γ . We have (cid:126)z (cid:62) Σ − e Γ = (cid:126)z (cid:62) 

Ψ ˜ I Ψ ˜ X Ψ ˜ I Ψ ˜ X ... . . .Ψ ˜ I Ψ ˜ X  = (cid:2)(cid:80) Tt =1 (cid:126)z (cid:62) t Ψ ˜

I (cid:126)z (cid:62) Ψ ˜ X · · · (cid:126)z (cid:62) T Ψ ˜ X (cid:3) = (cid:2) φ (cid:62) ι (cid:62) (cid:3) , where φ (cid:62) = (cid:80) Tt =1 (cid:126)z (cid:62) t Ψ ˜ I = ( (cid:80) Tt =1 (cid:126)z t ) (cid:62) Ψ ˜ I and ι (cid:62) = (cid:2) (cid:126)z (cid:62) Ψ ˜ X · · · (cid:126)z (cid:62) T Ψ ˜ X (cid:3) . Let ζ := T (cid:80) Tt =1 (cid:126)z t ∈ R N . Then φ = T ζ (cid:62)

Ψ ˜ I . Note that U and ˜ X are orthogonal, we have Ψ ˜ X = ˜ X and ˜ X (cid:62) Ψ ˜ X = N · I p . Then ι (cid:62) = (cid:2) (cid:126)z (cid:62) ˜ X · · · (cid:126)z (cid:62) T ˜ X (cid:3) = (cid:2) N ˜ ω (cid:62) · · · N ˜ ω (cid:62) T (cid:3) = N ˜ (cid:126)ω (cid:62) ∈ R Tp , where ˜ ω t = N (cid:80) Ni =1 ˜ X i z it ∈ R p .4. Calculate (Γ (cid:62) Σ e Γ) − . Note that(Γ (cid:62) Σ e Γ) − = (cid:20) Ξ Ξ Ξ Ξ (cid:21) = (cid:20) M − M ˜ M − ˜ M (cid:62) M ¯ M + ˜ M (cid:62) M ˜ M (cid:21) ∈ R ( N +( T − p )) × ( N +( T − p ) , where Ξ = M , Ξ = − M ˜ M , Ξ = − ˜ M (cid:62) M , Ξ = ¯ M + ˜ M (cid:62) M ˜ M with M = 1 T (cid:16) ˜ I (cid:62) Ψ ˜ I − ˜ I (cid:62) Ψ ˜ X ( ˜ X (cid:62) Ψ ˜ X ) − ˜ X (cid:62) Ψ ˜ I (cid:17) − = 1 T (cid:18) ˜ I (cid:62) Ψ ˜ I − N ˜ X ˜ X (cid:62) (cid:19) − ∈ R ( N − p ) × ( N − p ) ˜ M = (cid:2) ˜ I (cid:62) Ψ ˜ X ( ˜ X (cid:62) Ψ ˜ X ) − · · · ˜ I (cid:62) Ψ ˜ X ( ˜ X (cid:62) Ψ ˜ X ) − (cid:3) = (cid:2) N ˜ X · · · N ˜ X (cid:3) ∈ R ( N − p ) × Tp ¯ M = diag(( ˜ X (cid:62) Ψ ˜ X ) − , ( ˜ X (cid:62) Ψ ˜ X ) − , · · · , ( ˜ X (cid:62) Ψ ˜ X ) − ) = 1 N I Tp ∈ R Tp × Tp We can simplify M = T (cid:16) ˜ I (cid:62) Ψ ˜ I − N ˜ X ˜ X (cid:62) (cid:17) − by M = 1 T (cid:34)(cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − + (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (cid:18) N − ˜ X (cid:62) (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (cid:19) − ˜ X (cid:62) (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − (cid:35)(cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − = (cid:0) I N − p − U (1) ( I k + U (cid:62) U ) − U (cid:62) (1) (cid:1) − = I N − p + U (1) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (1) where U (1) = (cid:2) u u · · · u N − p (cid:3) (cid:62) ∈ R ( N − p ) × k and U (2) = (cid:2) u N − p +1 · · · u N (cid:3) (cid:62) ∈ R p × k (note U = (cid:2) U (cid:62) (1) U (cid:62) (2) (cid:3) (cid:62) ).5. Calculate (cid:126)z (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:126)z . It is equivalent to calculating (cid:2) φ (cid:62) ι (cid:62) (cid:3) (cid:20) Ξ Ξ Ξ Ξ (cid:21) (cid:20) φι (cid:21) = φ (cid:62) Ξ φ + φ (cid:62) Ξ ι + ι (cid:62) Ξ φ + ι (cid:62) Ξ ι , where (recall ζ = T (cid:80) Tt =1 (cid:126)z t and ˜ (cid:126)ω = (cid:2) ˜ ω (cid:62) · · · ˜ ω (cid:62) T (cid:3) (cid:62) ) φ (cid:62) Ξ φ = T ζ (cid:62) Ψ ˜ IM ˜ I (cid:62) Ψ ζφ (cid:62) Ξ ι = − N T ζ (cid:62)

Ψ ˜ IM ˜ M ˜ (cid:126)ω = − T ζ (cid:62)

Ψ ˜ IM ˜ X (cid:32) T (cid:88) t =1 ˜ ω t (cid:33) ι (cid:62) Ξ φ = − N T ˜ (cid:126)ω (cid:62) ˜ M (cid:62) M Ψ ˜ Iζ = − T (cid:32) T (cid:88) t =1 ˜ ω (cid:62) t (cid:33) ˜ X (cid:62) M Ψ ˜ I (cid:62) ζι (cid:62) Ξ ι = N T (cid:88) t =1 ˜ ω (cid:62) t ˜ ω t + (cid:32) T (cid:88) t =1 ˜ ω (cid:62) t (cid:33) ˜ X (cid:62) M ˜ X (cid:32) T (cid:88) t =1 ˜ ω t (cid:33) X (1) := (cid:2) ˜ X ˜ X · · · ˜ X N − p (cid:3) (cid:62) ∈ R ( N − p ) × p and ˜ X (2) := (cid:2) ˜ X N − p +1 · · · ˜ X N (cid:3) (cid:62) ∈ R p × p (note ˜ X = (cid:104) ˜ X (cid:62) (1) ˜ X (cid:62) (2) (cid:105) (cid:62) ).We can simplify φ (cid:62) Ξ φ , φ (cid:62) Ξ ι , ι (cid:62) Ξ φ and ι (cid:62) Ξ ι by calculating the following terms G = ˜ I (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ I (cid:62) = (cid:20) I N − p + U (1) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (1) (cid:126) (cid:126) (cid:62) (cid:21) ∈ R N × N Ω = Ψ G Ψ = (cid:20) I N − p (cid:126) − U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (1) (cid:21) Ψ= (cid:20) I N − p − U (1) ( I k + U (cid:62) U ) − U (cid:62) (1) − U (1) ( I k + U (cid:62) U ) − U (cid:62) (2) − U (2) ( I k + U (cid:62) U ) − U (cid:62) (1) U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (1) U (1) ( I k + U (cid:62) U ) − U (cid:62) (2) (cid:21) ∈ R N × N (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (1) = ˜ X (1) − U (1) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (2) ˜ X (2) ∈ R ( N − p ) × p ˜ I (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (1) = (cid:20) ˜ X (1) − U (1) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (2) ˜ X (2) (cid:21) ∈ R N × p δ = ˜ X (cid:62) (1) (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (1) = N I p − ( ˜ X (cid:62) (2) ˜ X (2) − ˜ X (cid:62) (2) U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (2) ˜ X (2) ) ∈ R p × p γ = Ψ ˜ I (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (1) = (cid:20) ˜ X (1) U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (2) ˜ X (2) (cid:21) ∈ R N × p Together with N (cid:80) Tt =1 ˜ ω t = T (cid:80) Ni =1 ˜ X i ζ i and N(cid:126) (cid:62) ω = T ˜ X (cid:62) ζ , we have φ (cid:62) Ξ φ = T ζ (cid:62) (cid:0)

Ω + γ ( N I p − δ ) − γ (cid:62) (cid:1) ζφ (cid:62) Ξ ι = − T ζ (cid:62) (cid:16) γ ( N I p − δ ) − ˜ X (cid:62) (cid:17) ζι (cid:62) Ξ φ = − T ζ (cid:62) (cid:16) ˜ X ( N I p − δ ) − γ (cid:62) (cid:17) ζι (cid:62) Ξ ι = N T (cid:88) t =1 ˜ ω (cid:62) t ˜ ω t + T ζ (cid:62) (cid:16) ˜ X ( N I p − δ ) − ˜ X (cid:62) (cid:17) ζ − NT (cid:32) T (cid:88) t =1 ˜ ω (cid:62) t (cid:33) (cid:32) T (cid:88) t =1 ˜ ω t (cid:33) and (cid:2) φ (cid:62) ι (cid:62) (cid:3) (cid:20) Ξ Ξ Ξ Ξ (cid:21) (cid:20) φι (cid:21) = N T (cid:88) t =1 ˜ ω (cid:62) t ˜ ω t − NT (cid:32) T (cid:88) t =1 ˜ ω (cid:62) t (cid:33) (cid:32) T (cid:88) t =1 ˜ ω t (cid:33) + T ζ (cid:62) (cid:18)

Ω + (cid:16) γ − ˜ X (cid:17) ( N I p − δ ) − (cid:16) γ − ˜ X (cid:17) (cid:62) (cid:19) ζ. with Ω + (cid:16) γ − ˜ X (cid:17) ( N I p − δ ) − (cid:16) γ − ˜ X (cid:17) (cid:62) = Ω + (cid:20) (cid:126) (cid:126) (cid:62) I p − U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (2) (cid:21) = I N − U ( I k + U (cid:62) U ) − U (cid:62) , following U (cid:62) U = U (cid:62) (1) U (1) + U (cid:62) (2) U (2) and U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (1) U (1) ( I k + U (cid:62) U ) − U (cid:62) (2) − U (2) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (2) = U (2) ( I k + U (cid:62) (2) U (2) ) − ( U (cid:62) (1) U (1) − I k − U (cid:62) U )( I k + U (cid:62) U ) − U (cid:62) (2) = − U (2) ( I k + U (cid:62) U ) − U (cid:62) (2) . Thus, (cid:2) φ (cid:62) ι (cid:62) (cid:3) (cid:20) Ξ Ξ Ξ Ξ (cid:21) (cid:20) φι (cid:21) = N T (cid:88) t =1 ˜ ω (cid:62) t ˜ ω t − NT (cid:32) T (cid:88) t =1 ˜ ω (cid:62) t (cid:33) (cid:32) T (cid:88) t =1 ˜ ω t (cid:33) + T ζ (cid:62) (cid:0) I N − U ( I k + U (cid:62) U ) − U (cid:62) (cid:1) ζ. Calculate (cid:126)z (cid:62) (Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e ) (cid:126)z . (cid:126)z (cid:62) Σ − e (cid:126)z − (cid:126)z (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:126)z = N T − T (cid:88) t =1 (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t − N T (cid:88) t =1 ˜ ω (cid:62) t ˜ ω t + NT (cid:32) T (cid:88) t =1 ˜ ω (cid:62) t (cid:33) (cid:32) T (cid:88) t =1 ˜ ω t (cid:33) − T ζ (cid:62) (cid:0) I N − U ( I k + U (cid:62) U ) − U (cid:62) (cid:1) ζ Then maximizing (cid:126)z (cid:62) Σ − e (cid:126)z − (cid:126)z (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:126)z is equivalent to minimizing( (cid:126)ω (1) ) (cid:62) P (cid:126) (cid:126)ω (1) + 2 (cid:126)d (cid:62) (cid:126)ω (1) + r +1 (cid:88) j =2 ( (cid:126)ω ( j ) ) (cid:62) P (cid:126) (cid:126)ω ( j ) + N (cid:80) Tt =1 (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t − NT (cid:0)(cid:80) Tt =1 (cid:126)z t (cid:1) U ( I k + U (cid:62) U ) − U (cid:62) (cid:0)(cid:80) Tt =1 (cid:126)z t (cid:1) (31)following T (cid:80) Ni =1 ζ i = NT (cid:80) Tt =1 ( T + 1 − t ) ˜ ω t, from the proof of Theorem 1. Next we show the second step.

It is the same as the Sketch of Proof. We separate minimize (a), (b) and(c) and get the suﬃcient conditions. For (c), note that T (cid:88) t =1 z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) z t − T (cid:32) T (cid:88) t =1 z (cid:62) t (cid:33) U ( I k + U (cid:62) U ) − U (cid:62) (cid:32) T (cid:88) t =1 z t (cid:33) = (cid:126)z (cid:62) (cid:18)(cid:18) I T − T (cid:126) (cid:126) (cid:62) (cid:19) ⊗ U ( I k + U (cid:62) U ) − U (cid:62) (cid:19) (cid:126)z := (cid:126)z (cid:62) M U (cid:126)zI T − T (cid:126) (cid:126) (cid:62) is a positive semi-deﬁnite matrix with one eigenvalue to be 0 and the corresponding eigenvectorto be (cid:126)

1. Thus, (cid:126)z (cid:62) M U (cid:126)z ≥ (cid:126)z and the minimum value is attained when U (cid:62) z t is the same for all t , which is equivalent to Z (cid:62) U = (cid:126) µ (cid:62) U ∈ R T × k for some µ U ∈ R k . For (a), from Theorem 1, the optimalsolution of T (cid:80) Ni =1 ζ i + N (cid:80) Tt =1 ω t − NT (cid:0)(cid:80) Tt =1 ω t (cid:1) (equals ( (cid:126)ω (1) ) (cid:62) P (cid:126) (cid:126)ω (1) + 2 (cid:126)d (cid:62) (cid:126)ω (1) ) is N (cid:80) Ni =1 z it = t − − TT .For (b), it depends on the cross-sectional average of z it weighted by X i and is minimized when Z satisﬁes N (cid:80) Ni =1 X i z it = µ X , equivalently (cid:126)ω ( j +1) = µ X,j · (cid:126)

1, where µ X,j is the j -th entry in µ X . Hence, we get thesuﬃcient conditions for the optimal solution1 N N (cid:88) i =1 z it = 2 t − − TT , N N (cid:88) i =1 X i z it = µ X , N N (cid:88) i =1 u i z it = µ U , for all t for some µ X ∈ R p and µ U ∈ R k . Last is to show the third step.

Since there are G strata for ( X i , u i ), that is, G diﬀerent values for ( X i , u i ).Then ω t = (cid:80) Gg =1 p g ω g,t and ω t = (cid:80) Gg =1 p g x gj ω g,t for j = 1 , · · · , r . We have(b) = N T (cid:88) t =1 r (cid:88) j =1 (cid:32) G (cid:88) g =1 p g x gj ω g,t (cid:33) − NT r (cid:88) j =1 (cid:32) T (cid:88) t =1 G (cid:88) g =1 p g x gj ω g,t (cid:33) When ω g,t = t − − TT for all t and g , given X is orthogonal to (cid:126)

1, we have (cid:80) Gg =1 p g x gj ω g,t = 0 and then(b) is minimized with value zero . Similarly, we can show that when ω g,t = t − − TT for all t and g , given U is orthogonal to (cid:126)

1, (c) is minimized with value zero . Moreover, ω t = (cid:80) Gg =1 p g ω g,t = t − − TT so (a) = 0 isminimized as well. Hence, the sum of (a), (b) and (c) is minimized. (cid:3) Proof of Theorem 4

Denote f ( ω t , ω g,t ) = T (cid:88) t =1 ω t − T (cid:32) T (cid:88) t =1 ω t (cid:33) + 2 T (cid:88) t =1 T + 1 − tT ω t + T (cid:88) t =1 r (cid:88) j =1 (cid:32) G (cid:88) g =1 p g x gj ω g,t (cid:33) − T r (cid:88) j =1 (cid:32) T (cid:88) t =1 G (cid:88) g =1 p g x gj ω g,t (cid:33) τ is denoted as g ( ω t , ω g,t ) and g ( ω t , ω g,t ) := Var(ˆ τ ) = σ (cid:126)z (cid:62) ( I − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) (cid:126)z = σ N ( T − f ( ω t , ω g,t ))We evaluate f ( ω t , ω g,t ) at the optimal solution in Theorem 3 and has f ( ω ∗ t , ω ∗ g,t ) = − T (cid:88) t =1 (cid:18) T + 1 − tT (cid:19) = − ( T + 1)( T − T , where ω ∗ t = ω ∗ g,t = t − − TT . When we use the nearest rounding rule to get a feasible Z rnd , the corresponding ω t , ω g,t are denoted as ω rnd t , ω rnd g,t . We have | ω rnd g,t − ω ∗ g,t | ≤ |O g | , | (cid:80) Tt =1 ω rnd g,t | ≤ |O g | , | ˜ w t − w ∗ t | = | G (cid:88) g =1 p g ( ω rnd g,t − ω ∗ g,t ) | ≤ G (cid:88) g =1 p g | ω rnd g,t − ω ∗ g,t | ≤ G (cid:88) g =1 p g N min = 1 N min and | (cid:80) Tt =1 ω rnd t | ≤ N min . We have T (cid:88) t =1 (cid:18) ω rnd t + T + 1 − tT (cid:19) − T (cid:32) T (cid:88) t =1 ω rnd t (cid:33) − T (cid:88) t =1 (cid:18) T + 1 − tT (cid:19) ≤ TN − ( T + 1)( T − T from Theorem 2, | G (cid:88) g =1 p g x g,j w g,t | = (cid:12)(cid:12)(cid:12) G (cid:88) g =1 p g x gj (cid:18) w g,t + T + 1 − tT (cid:19) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) (cid:88) g : x gj > p g x gj (cid:18) w g,t + T + 1 − tT (cid:19) + (cid:88) g : x gj < p g x gj (cid:18) w g,t + T + 1 − tT (cid:19) (cid:12)(cid:12)(cid:12) ≤ max (cid:12)(cid:12)(cid:12) (cid:88) g : x gj > p g x gj (cid:18) w g,t + T + 1 − tT (cid:19) (cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12) (cid:88) g : x gj < p g x gj (cid:18) w g,t + T + 1 − tT (cid:19) (cid:12)(cid:12)(cid:12) ≤ x j, max N min , and T (cid:88) t =1 r (cid:88) j =1 (cid:32) G (cid:88) g =1 p g x gj ω g,t (cid:33) − T r (cid:88) j =1 (cid:32) T (cid:88) t =1 G (cid:88) g =1 p g x gj ω g,t (cid:33) ≤ T (cid:80) rj =1 x j, max N . We have g ( ω rnd t , ω rnd g,t ) g ( ω ∗ t , ω ∗ g,t ) = T − f ( ω ∗ t , ω ∗ g,t ) T − f ( ω rnd t , ω rnd g,t ) ≤ (4 T − / (3 T )(4 T − / (3 T ) − T (1 + (cid:80) rj =1 x j, max ) /N ≤ − (1 + (cid:80) rj =1 x j, max ) /N following 3 T ≤ T − T . Note that f ( ω ∗ t , ω ∗ g,t ) is the optimal solution of the convex relaxationproblem so f ( ω ∗ t , ω ∗ g,t ) is the lower bound for the minimum objective function value of the original integerprogramming problem. Thus, g ( ω ∗ t , ω ∗ g,t ) ≤ Var Z ∗ (ˆ τ ) and we haveVar Z rnd (ˆ τ ) ≤ − (1 + (cid:80) rj =1 x j, max ) /N Var Z ∗ (ˆ τ ) (cid:3) Appendix E: Extension to Carryover Eﬀects

Proof of Lemmas 4 and 5

Let (cid:126)z ( j ) := (cid:2) (cid:126)z (cid:62) j (cid:126)z (cid:62) j +1 · · · (cid:126)z (cid:62) T − (cid:96) − j (cid:3) (cid:62) for j = 1 , · · · , (cid:96) + 1. Then Z = (cid:2) (cid:126)z (1) (cid:126)z (2) · · · (cid:126)z ( (cid:96) +1 (cid:3) . Furthermore, we decompose Var(ˆ (cid:126)τ )Θ = Z (cid:62) ( I N ( T − (cid:96) ) − Γ(Γ (cid:62) Γ) − Γ (cid:62) ) Z = Z (cid:62) Z (cid:124) (cid:123)(cid:122) (cid:125) Θ − Z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) Z (cid:124) (cid:123)(cid:122) (cid:125) Θ and simplify Θ and Θ separately. Let us ﬁrst analyze Θ = Z (cid:62) Z . For the ( j, j )-th entry in Θ , we haveΘ ,jj = N ( T − (cid:96) ). For the ( j, m )-th entry in Z (cid:62) Z ( j < m ), we haveΘ ,jm = N (cid:34) ( T − (cid:96) ) + m − (cid:88) t = j ( ω t − ω T − (cid:96) + t ) (cid:35) . Next, let use consider Θ = Z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) Z . We have(Γ (cid:62) Γ) − = (cid:20) Ξ Ξ Ξ Ξ (cid:21) = (cid:20) M − N M M (cid:126) − N M (cid:62) (cid:126) M N I T − (cid:96) + N M (cid:62) (cid:126) M M (cid:126) (cid:21) ∈ R ( N + T − (cid:96) − × ( N + T − (cid:96) − , where Ξ = M , Ξ = − N M M (cid:126) , Ξ = − N M (cid:62) (cid:126) M , Ξ = N I T − (cid:96) + N M (cid:62) (cid:126) M M (cid:126) and M = 1 T − (cid:96) (cid:18) I N − − N (cid:126) (cid:126) (cid:62) (cid:19) − = 1 T − (cid:96) ( I N − + (cid:126) (cid:126) (cid:62) ) ∈ R ( N − × ( N − , M (cid:126) = (cid:2) (cid:126) · · · (cid:126) (cid:3) ∈ R ( N − × ( T − (cid:96) ) . Furthermore, ( (cid:126)z ( j ) ) (cid:62) Γ = (cid:104)(cid:80) T − (cid:96) − jt = j z t · · · (cid:80) T − (cid:96) − jt = j z N − ,t (cid:80) Ni =1 z ij · · · (cid:80) Ni =1 z i,T − (cid:96) − j (cid:105) = (cid:104) ( T − (cid:96) ) ζ ( j )1 · · · ( T − (cid:96) ) ζ ( j ) N − N ω j · · · N ω T − (cid:96) − j (cid:105) , where ζ ( j ) i = T − (cid:96) (cid:80) T − (cid:96) − jt = j z t for i = 1 , , · · · , N and j = 1 , , · · · , (cid:96) +1, and ω t = N (cid:80) Ti =1 z it for t = 1 , , · · · , T .We let α ( j ) = (cid:104) ( T − (cid:96) ) ζ ( j )1 · · · ( T − (cid:96) ) ζ ( j ) N − (cid:105) (cid:62) and β ( j ) = (cid:2) N y j · · · N y T − (cid:96) − j (cid:3) (cid:62) . Then Z (cid:62) Γ =  ( α (1) ) (cid:62) ( β (1) ) (cid:62) ... ...( α ( (cid:96) +1) ) (cid:62) ( β ( (cid:96) +1) ) (cid:62)  and the ( j, m )-th term of Θ = Z (cid:62) Γ(Γ (cid:62) Γ) − Γ (cid:62) Z hasΘ ,jm = ( α ( j ) ) (cid:62) Ξ α ( m ) + ( α ( j ) ) (cid:62) Ξ β ( m ) + ( β ( j ) ) (cid:62) Ξ α ( m ) + ( β ( j ) ) (cid:62) Ξ β ( m ) ∈ R ( (cid:96) +1) × ( N + T − (cid:96) − . For each term in Θ ,jm ,( α ( j ) ) (cid:62) Ξ α ( m ) = 1 T − (cid:96) ( α ( j ) ) (cid:62) ( I N − + (cid:126) (cid:126) (cid:62) ) α ( m ) = ( T − (cid:96) ) N − (cid:88) i =1 ζ ( j ) i ζ ( m ) i + ( T − (cid:96) ) (cid:32) N − (cid:88) i =1 ζ ( j ) i (cid:33) (cid:32) N − (cid:88) i =1 ζ ( m ) i (cid:33) ( α ( j ) ) (cid:62) Ξ β ( m ) = − T − (cid:96) ( α ( j ) ) (cid:62) ( I N − + (cid:126) (cid:126) (cid:62) ) (cid:126) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) = − NT − (cid:96) ( α ( j ) ) (cid:62) (cid:126) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) = − N (cid:32) N − (cid:88) i =1 ζ ( j ) i (cid:33) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) ( β ( j ) ) (cid:62) Ξ α ( m ) = − T − (cid:96) ( α ( m ) ) (cid:62) ( I N − + (cid:126) (cid:126) (cid:62) ) (cid:126) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) = − NT − (cid:96) ( α ( m ) ) (cid:62) (cid:126) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) = − N (cid:32) N − (cid:88) i =1 ζ ( m ) i (cid:33) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) ( β ( j ) ) (cid:62) Ξ β ( m ) = N T − (cid:96) − j (cid:88) t = j ω t ω t + m − j + N ( N − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) ζ ( j ) i and ω t , we have ( T − (cid:96) ) (cid:80) Ni =1 ζ ( j ) i = N (cid:80) T − (cid:96) − jt = j ω t for j = 1 , , · · · , (cid:96) + 1 and (cid:80) N − i =1 ζ ( j ) i = NT − (cid:96) (cid:80) T − (cid:96) − jt = j ω t − ζ ( j ) N . Using this property, we haveΘ ,jm = ( T − (cid:96) ) N (cid:88) i =1 ζ ( j ) i ζ ( m ) i − NT − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) + N T − (cid:96) − j (cid:88) t = j ω t ω t + m − j . (32)Recall the deﬁnition ζ ( j ) i = T − (cid:96) (cid:80) T − (cid:96) − jt = j z t , there are T + 1 diﬀerent values that ζ ( j ) i ζ ( m ) i can take, denotedas υ ( j,m )0 , υ ( j,m )1 , · · · , υ ( j,m ) T , where υ ( j,m ) t denotes the value of ζ ( j ) i ζ ( m ) i when unit i starts to get the treatmentat time period T + 1 − t (and υ ( j,m )0 represents the value of ζ ( j ) i ζ ( m ) i when unit i stays in the control groupfor all time periods). Without loss of generality, we assume j ≤ m and have υ ( j,m ) t =  t ≤ (cid:96) + 1 − m − (cid:16) − t − − (cid:96) + m ) T − (cid:96) (cid:17) (cid:96) + 1 − m < t ≤ (cid:96) + 1 − j (cid:16) − t − − (cid:96) + m ) T − (cid:96) (cid:17) (cid:16) − t − − (cid:96) + j ) T − (cid:96) (cid:17) (cid:96) + 1 − j < t ≤ T + 1 − k (cid:16) − t − − (cid:96) + j ) T − (cid:96) (cid:17) T + 1 − m < t ≤ T + 1 − j T + 1 − j < t Given ω t , there are N (1+ ω )2 , N (1+ ω )2 , · · · N (1+ ω T )2 treated units in time period 1 , , · · · , T . It is equivalent tohaving N (1+ ω )2 , N ( ω − ω )2 , · · · , N ( ω T − ω T − )2 untreated units to start the treatment in time period 1 , , · · · , T and leaving N (1 − ω T )2 units in the control group in the end. N (cid:88) i =1 ζ ( j ) i ζ ( m ) i = N (cid:20) ω · υ ( j,m ) T + ω − ω υ ( j,m ) T − + · · · + 1 − ω T · υ ( j,m )0 (cid:21) = N (cid:34) υ ( j,m ) T − υ ( j,m ) T − ω + υ ( j,m ) T − − υ ( j,m ) T − ω + · · · + υ ( j,m )1 − υ ( j,m )0 ω T (cid:35) , following υ ( j,m )0 = υ ( j,m ) T = 1. Then we can rewrite Θ ,jm in terms of ω t and υ ( j,m ) t Θ ,jm = N ( T − (cid:96) )+ N ( T − (cid:96) )2 T (cid:88) t =1 ( υ ( j,m ) T +1 − t − υ ( j,m ) T − t ) ω t − NT − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) + N T − (cid:96) − j (cid:88) t = j ω t ω t + m − j . Recall Θ = Θ − Θ , for the diagonal entries in Θ,Θ jj = − N  T − (cid:96) − j (cid:88) t = j ω t − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) + T − (cid:96) T (cid:88) t =1 ( υ ( j,j ) T +1 − t − υ ( j,j ) T − t ) ω t  and for the oﬀ-diagonal entries in Θ and for j < m ,Θ jm = − N (cid:34) T − (cid:96) − j (cid:88) t = j ω t ω t + m − j − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) (cid:32) T − (cid:96) − m (cid:88) t = m ω t (cid:33) + T − (cid:96) T (cid:88) t =1 ( υ ( j,m ) T +1 − t − υ ( j,m ) T − t ) ω t − m − (cid:88) t = j ( ω t − ω T − (cid:96) + t ) (cid:35) and for j > m , Θ jm = Θ mj . Thus, for the T -optimal design, the optimization problem ismin ω t (cid:96) +1 (cid:88) j =1  T − (cid:96) − j (cid:88) t = j ω t − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) + T − (cid:96) T (cid:88) t =1 ( υ ( j,j ) T +1 − t − υ ( j,j ) T − t ) ω t  (33)s . t . − ≤ ω t ≤ , t = 1 , , · · · , Tω t ≤ ω t +1 t = 1 , , · · · , T − T − (cid:96) = T − (cid:96) − j − t ) T − (cid:96) .For the D -optimal treatment design, the optimization problem ismin ω t − det (Θ) (34)s . t . − ≤ ω t ≤ , t = 1 , , · · · , Tω t ≤ ω t +1 t = 1 , , · · · , T − (cid:3) The following lemma provides an optimal solution for the quadratic program (25) in Lemma 4. This lemmaprovides an important intermediate step to show Theorem 5.

Lemma 6 (Carryover Eﬀects without Covariates) . Suppose Assumptions 1-2 hold, the potential out-come follows model (14) and does not have observed or latent covariates, i.e., Y it ( z it ) = α i + β t + τ z it + · · · + τ (cid:96) +1 z i,t − (cid:96) + ε it , and T > (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) , then an optimal solution for the quadratic program (25) is (cid:126)ω ∗ ∈ R T that is deﬁned in Eq. (17) .Proof of Lemma 6 Denote the objective in the quadratic program (25) in Lemma 4 as f ( (cid:126)ω ) := (cid:80) (cid:96) +1 j =1 (cid:20)(cid:80) T − (cid:96) − jt = j ω t − T − (cid:96) (cid:16)(cid:80) T − (cid:96) − jt = j ω t (cid:17) + (cid:80) T − (cid:96) − jt =1 2( T − (cid:96) − j − t ) T − (cid:96) ω t (cid:21) .The Lagrangian of f ( (cid:126)ω ) is L ( (cid:126)ω, (cid:126)λ,(cid:126)κ,(cid:126)ι ) = (cid:96) +1 (cid:88) j =1  T − (cid:96) − j (cid:88) t = j ω t − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω t (cid:33) + T (cid:88) t =1 T − (cid:96) − j − t ) T − (cid:96) ω t  + T (cid:88) t =1 λ t ( − − ω t ) + T (cid:88) t =1 κ t ( ω t −

1) + T − (cid:88) t =1 ι t ( ω t − ω t +1 )The KKT conditions of L ( (cid:126)ω, (cid:126)λ,(cid:126)κ,(cid:126)ι ) are ∂ L ∂ω t = tω t − (cid:80) tj =1 s j T − (cid:96) + ( T − (cid:96) − t ) tT − (cid:96) − λ t + κ t + ι t − ι t − = 0 , t ≤ (cid:96) (35) ∂ L ∂ω t = ( (cid:96) + 1) ω t − (cid:80) (cid:96) +1 j =1 s j T − (cid:96) + ( (cid:96) + 1)( T + 1 − t ) T − (cid:96) − λ t + κ t + ι t − ι t − = 0 , (cid:96) < t ≤ T − (cid:96) (36) ∂ L ∂ω t = ( T + 1 − t ) ω t − (cid:80) T +1 − tj =1 s j T − (cid:96) + ( T − (cid:96) − T + 1 − t ) T − (cid:96) − λ t + κ t + ι t − ι t − = 0 , t > T − (cid:96) (37) λ t ( − − ω t ) = 0 , κ t ( ω t −

1) = 0 , ι t ( ω t − ω t +1 ) = 0 − ≤ ω t ≤ , ω t ≤ ω t +1 , λ t ≥ , κ t ≥ , ι t ≥ s j = (cid:80) T − (cid:96) − jt = j ω t for j = 1 , · · · , (cid:96) + 1 and ι = 0.The Hessian of f ( (cid:126)ω ) is positive semi-deﬁnite. Any solution that satisﬁes the KKT conditions is optimal.First we can show the optimal solution is symmetric with respect to the origin. The proof is as follows. If (cid:126)ω ‡ is the optimal solution of the quadratic program (25), then we can show (cid:126)ω † = (cid:2) − ω ‡ T − ω ‡ T − · · · − ω ‡ (cid:3) hasthe same value in the objective function as (cid:126)ω ‡ because (cid:80) (cid:96) +1 j =1 (cid:80) T − (cid:96) − jt =1 2( T − (cid:96) − j − t ) T − (cid:96) ω t in f ( (cid:126)ω ) is symmetricwith respect to the origin and similarly for the other two terms in f ( (cid:126)ω ). Since quadratic program (25) isconvex, f (cid:16) (cid:126)ω ‡ + (cid:126)ω † (cid:17) ≤ (cid:2) f ( (cid:126)ω ‡ ) + f ( (cid:126)ω † ) (cid:3) = f ( (cid:126)ω ‡ ) . Then if (cid:126)ω ‡ is optimal, (cid:126)ω † = (cid:126)ω ‡ .2Now we can focus on the ω that satisﬁes (cid:2) ω ω · · · ω T (cid:3) = (cid:2) − ω T − ω T − · · · − ω (cid:3) . From the deﬁnition of s j = (cid:80) T − (cid:96) − jt = j ω t , we have s j = − s (cid:96) +1 − j . If (cid:96) is even, s (cid:96)/ = 0.Now we are going to verify (cid:126)ω ∗ = (cid:2) ω ∗ ω ∗ · · · ω ∗ T (cid:3) deﬁned in Eq. (17) satisﬁes the KKT conditions withfeasible (cid:126)λ,(cid:126)κ,(cid:126)ι .1. ω ∗ t for (cid:96) < t ≤ T − (cid:96) : ω ∗ t = − t − ( (cid:96) +1) T − (cid:96) satisﬁes Eq. (36) with λ t = κ t = ι t = 0 and ι (cid:96) = 0.2. ω ∗ t for t ≤ (cid:96) : Given ω t = − ω T +1 − t . We can simplify s j to s j = (cid:40)(cid:80) (cid:96) +1 − jt = j ω t for j = 1 , · · · , (cid:98) ( (cid:96) + 1) / (cid:99) (cid:80) T − (cid:96) + jt = T − j ω t for j = (cid:98) ( (cid:96) + 1) / (cid:99) + 1 , · · · , (cid:96) + 1 . As an example, when (cid:96) = 2, we have s = ω + ω , s = 0 and s = ω T − + ω T ; when (cid:96) = 3, we have s = ω + ω + ω , s = ω , s = ω T − and s = ω T − + ω T − + ω T . Furthermore, s j + s (cid:96) +2 − j = 0 for 1 ≤ j ≤ (cid:96) + 1.Using this property, for (cid:98) (cid:96)/ (cid:99) < t ≤ (cid:96) , we have (cid:80) tj =1 s j = (cid:80) (cid:96) +1 − tj =1 s j .Next we show when ω t = − t ≤ (cid:98) (cid:96)/ (cid:99) , there exist some ω t for (cid:98) (cid:96)/ (cid:99) < t ≤ (cid:96) and some feasible λ t , κ t , ι t that satisfy Eq. (35).When ω t = − t ≤ (cid:98) (cid:96)/ (cid:99) , then for (cid:98) (cid:96)/ (cid:99) < t ≤ (cid:96) , (cid:80) tj =1 s j = (cid:80) (cid:96) +1 − tj =1 s j = (cid:2) (cid:80) (cid:96) +1 − tj =1 ( (cid:98) (cid:96)/ (cid:99) + 1 − j ) (cid:3) +min( (cid:96) + 1 − t, (cid:96) − (cid:98) (cid:96)/ (cid:99) ) ω (cid:98) (cid:96)/ (cid:99) +1 + · · · + min( (cid:96) + 1 − t, ω (cid:96) − + min( (cid:96) + 1 − t, ω (cid:96) . As an example, when (cid:96) = 2, s = − ω ; when (cid:96) = 3, s + s = − ω + ω , s + s + s = − ω + ω . We can rewrite Eq. (35) for (cid:98) (cid:96)/ (cid:99) < t ≤ (cid:96) in a vectorized form as the following (we will consider Eq. (35) for t ≤ (cid:98) (cid:96)/ (cid:99) in the later partof this proof)  ∂ L ∂ω (cid:98) (cid:96)/ (cid:99) +1 ... ∂ L ∂ω (cid:96)  = A ( (cid:96) )  ω (cid:98) (cid:96)/ (cid:99) +1 ... ω (cid:96)  − b ( (cid:96) ) −  λ (cid:98) (cid:96)/ (cid:99) +1 ... λ (cid:96)  +  κ (cid:98) (cid:96)/ (cid:99) +1 ... κ (cid:96)  +  ι (cid:98) (cid:96)/ (cid:99) +1 ... ι (cid:96)  −  ι (cid:98) (cid:96)/ (cid:99) ... ι (cid:96) −  = 0 , (38)where A ( (cid:96) ) and b ( (cid:96) ) are deﬁned in Eq. (22) and Eq. (23). When (cid:2) ω (cid:98) (cid:96)/ (cid:99) +1 · · · ω (cid:96) (cid:3) (cid:62) = ( A ( (cid:96) ) ) − b ( (cid:96) ) , Eq. (38)holds with λ t = κ t = ι t = 0 for t = (cid:98) (cid:96)/ (cid:99) +1 , · · · , (cid:96) and ι (cid:98) (cid:96)/ (cid:99) = 0. The remaining step is to verify the constraints − ≤ ω t ≤ − (cid:96) +1 T − (cid:96) and ω t ≤ ω t +1 hold if (cid:2) ω (cid:98) (cid:96)/ (cid:99) +1 · · · ω (cid:96) (cid:3) (cid:62) = ( A ( (cid:96) ) ) − b ( (cid:96) ) .(a) The ﬁrst step is to show − A ( (cid:96) ) (cid:126) ≤ b ( (cid:96) ) ≤ ( − (cid:96) +1 T − (cid:96) ) A ( (cid:96) ) (cid:126)

1. Note that the diagonal entries in A ( (cid:96) ) are positive while the oﬀ-diagonal entries in A ( (cid:96) ) are negative, then A ( (cid:96) ) t (cid:48) , : (cid:126)ω ( (cid:98) (cid:96)/ (cid:99) +1): L is increasing in ω t anddecreasing in ω s for t (cid:48) = 1 , · · · , (cid:96) − (cid:98) (cid:96)/ (cid:99) , t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) and s (cid:54) = t (cid:48) + (cid:98) (cid:96)/ (cid:99) . If − A ( (cid:96) ) (cid:126) ≤ b ( (cid:96) ) ≤ ( − (cid:96) +1 T − (cid:96) ) A ( (cid:96) ) (cid:126) ω t deﬁned in (cid:2) ω (cid:98) (cid:96)/ (cid:99) +1 · · · ω (cid:96) (cid:3) (cid:62) = ( A ( (cid:96) ) ) − b ( (cid:96) ) is between − − (cid:96) +1 T − (cid:96) for t = (cid:98) (cid:96)/ (cid:99) + 1 , · · · , (cid:96) .First, let us show − A ( (cid:96) ) (cid:126) ≤ b ( (cid:96) ) , which is equivalent to showing every entry in A ( (cid:96) ) (cid:126) b ( (cid:96) ) is non-negative,that is, for t (cid:48) = 1 , · · · , (cid:96) − (cid:98) (cid:96)/ (cid:99) , ( A ( (cid:96) ) (cid:126) t (cid:48) + b ( (cid:96) ) t (cid:48) ≥

0. If (cid:96) is even, (cid:80) (cid:96) − tl =1 ( (cid:96) − (cid:98) (cid:96)/ (cid:99) + 1 − l ) = t ( (cid:96) +1 − t )2 and (cid:80) (cid:96) − tl =1 ( (cid:98) (cid:96)/ (cid:99) + 1 − l ) = t ( (cid:96) +1 − t )2 . Let t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) . We have( A ( (cid:96) ) (cid:126) t (cid:48) + b ( (cid:96) ) t (cid:48) = t − T − (cid:96) t ( (cid:96) + 1 − t )2 − t + t T − (cid:96) − T − (cid:96) t ( (cid:96) + 1 − t )2 = t (2 T − (cid:96) − T − (cid:96) ≥ . If (cid:96) is odd, (cid:80) (cid:96) − tj =1 ( (cid:96) − (cid:98) (cid:96)/ (cid:99) + 1 − j ) = ( t +1)( (cid:96) +1 − t )2 and (cid:80) (cid:96) − tj =1 ( (cid:98) (cid:96)/ (cid:99) + 1 − j ) = ( t − (cid:96) +1 − t )2 . We have( A ( (cid:96) ) (cid:126) t (cid:48) + b ( (cid:96) ) t (cid:48) = t − T − (cid:96) ( t + 1)( (cid:96) + 1 − t )2 − t + t T − (cid:96) − T − (cid:96) ( t − (cid:96) + 1 − t )2 = t (2 T − (cid:96) − T − (cid:96) ≥ . Second, let us show b ( (cid:96) ) ≤ ( − (cid:96) +1 T − (cid:96) ) A ( (cid:96) ) (cid:126)

1, which is equivalent to showing every entry in (1 − (cid:96) +1 T − (cid:96) ) A ( (cid:96) ) (cid:126) b ( (cid:96) ) is non-positive, that is, for t (cid:48) = 1 , · · · , (cid:96) − (cid:98) (cid:96)/ (cid:99) , (cid:0) A ( (cid:96) ) (1 − (cid:96) +1 T − (cid:96) ) (cid:126) (cid:1) t (cid:48) + b ( (cid:96) ) t (cid:48) ≤

0. If (cid:96) is even (cid:16) A ( (cid:96) ) (1 − (cid:96) + 1 T − (cid:96) ) (cid:126) (cid:17) t (cid:48) + b ( (cid:96) ) t (cid:48) = t (2 T − (cid:96) − T − (cid:96) − (cid:96) + 1 T − (cid:96) (cid:16) t − t ( (cid:96) + 1 − t )2( T − (cid:96) ) (cid:17) = t ( T − (cid:96) − T − (cid:96) (cid:16) − (cid:96) + 1 T − (cid:96) (cid:17) < t ( T − (cid:96) − < − (cid:96) +1 T − (cid:96) >

0. If (cid:96) is odd, (cid:16) A ( (cid:96) ) (1 − (cid:96) + 1 T − (cid:96) ) (cid:126) (cid:17) t (cid:48) + b ( (cid:96) ) t (cid:48) = t (2 T − (cid:96) − T − (cid:96) − (cid:96) + 1 T − (cid:96) (cid:16) t − ( t + 1)( (cid:96) + 1 − t )2( T − (cid:96) ) (cid:17) = t ( T − (cid:96) − T − (cid:96) (cid:16) − t + 12 t (cid:96) + 1 T − (cid:96) (cid:17) < t ( T − (cid:96) − < − t +12 t (cid:96) +1 T − (cid:96) > − b ( (cid:96) ) t (cid:48) / ( A ( (cid:96) ) (cid:126) t (cid:48) is non-decreasing in t (cid:48) for t (cid:48) = 1 , · · · , (cid:96) − (cid:98) (cid:96)/ (cid:99) . Note that the diagonalentries in A ( (cid:96) ) are positive while the oﬀ-diagonal entries in A ( (cid:96) ) are negative, then A ( (cid:96) ) t (cid:48) , : (cid:126)ω ( (cid:98) (cid:96)/ (cid:99) +1): L is increasingin ω t and decreasing in ω s for t (cid:48) = 1 , · · · , (cid:96) − (cid:98) (cid:96)/ (cid:99) , t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) and s (cid:54) = t (cid:48) + (cid:98) (cid:96)/ (cid:99) . If − b ( (cid:96) ) t (cid:48) / ( A ( (cid:96) ) (cid:126) t (cid:48) isnon-decreasing in t (cid:48) , then ω t is non-decreasing in t , where t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) .Let c t (cid:48) be the c t (cid:48) that satisﬁes ( A ( (cid:96) ) ( − c t (cid:48) T − (cid:96) ) (cid:126) t (cid:48) = b ( (cid:96) ) t (cid:48) and let t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) . If (cid:96) is even, we have t (2 T − (cid:96) − T − (cid:96) = c t (cid:48) T − (cid:96) (cid:18) t − T − (cid:96) t ( (cid:96) + 1 − t )2 (cid:19) ⇔ T − (cid:96) − c t (cid:48) t + 2 T − (cid:96) − T − (cid:96) ) . Since ∂ (2 T − (cid:96) − ∂t = 2, ∂ t +2 T − (cid:96) − T − (cid:96) ) ∂t = T − (cid:96) ) , and 2 > T − (cid:96) ) , we have c t (cid:48) increases in t and t (cid:48) . This implies − b ( (cid:96) ) t (cid:48) / ( A ( (cid:96) ) (cid:126) t (cid:48) is non-decreasing in t (cid:48) for even (cid:96) .If (cid:96) is odd, we have t (2 T − (cid:96) − T − (cid:96) = c t (cid:48) T − (cid:96) (cid:18) t − T − (cid:96) ( t + 1)( (cid:96) + 1 − t )2 (cid:19) ⇔ T − (cid:96) − c t (cid:48) (cid:18) t + 1)( T − (cid:96) − t ( T − (cid:96) ) (cid:19) . Since ∂ (2 T − (cid:96) − ∂t = 2, ∂ t +2 T − (cid:96) − T − (cid:96) ) ∂t ≤ (cid:96) +3 (cid:96) +1 1 T − (cid:96) , and 2 > (cid:96) +3( T − (cid:96) )( (cid:96) +1) , we have c t (cid:48) increases in t and t (cid:48) . This againimplies − b ( (cid:96) ) t (cid:48) / ( A ( (cid:96) ) (cid:126) t (cid:48) is non-decreasing in t (cid:48) for odd (cid:96) .We have veriﬁed that for (cid:98) (cid:96)/ (cid:99) < t ≤ (cid:96) , ω t deﬁned in (cid:2) ω (cid:98) (cid:96)/ (cid:99) +1 · · · ω (cid:96) (cid:3) (cid:62) = ( A ( (cid:96) ) ) − b ( (cid:96) ) satisiﬁes the KKTconditions. The remaining step is to verify for t ≤ (cid:98) (cid:96)/ (cid:99) , ω t deﬁned as ω t = − ω t = −

1, constraints − ≤ ω t ≤ ω t ≤ ω t +1 for t ≤ (cid:98) (cid:96)/ (cid:99) and ω (cid:98) (cid:96)/ (cid:99) ≤ ω + (cid:98) (cid:96)/ (cid:99) +1 are satisﬁed. Weonly need to verify that we can ﬁnd feasible λ t , κ t , ι t to satisfy Eq. (35). Since ω t = −

1, from complementaryslackness, κ t = 0. Plug ω t = − λ − ι = − s T − (cid:96)λ t − ι t + ι t − = − t + (cid:80) tj =1 s j T − (cid:96) for t = 2 , · · · (cid:98) (cid:96)/ (cid:99) − λ t + ι t − = − t + (cid:80) tj =1 s j T − (cid:96) for t = (cid:98) (cid:96)/ (cid:99) We only need to verify − t + (cid:80) tj =1 s j T − (cid:96) ≥ t = (cid:98) (cid:96)/ (cid:99) as for the other conditions, λ − ι and λ t − ι t + ι t − cantake any value by properly choosing λ t and ι .Note that T − (cid:96) (cid:80) (cid:98) (cid:96)/ (cid:99) j =1 s j = T − (cid:96) (cid:80) L +1 −(cid:98) (cid:96)/ (cid:99) j =1 s j = ( (cid:96) + 1 − (cid:98) (cid:96)/ (cid:99) )( ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1) − ( (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) ) T − (cid:96) . Furthermore, ifwe can show ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1 ≤ (cid:96) +1 T − (cid:96) (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) for even (cid:96) and ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1 ≤ (cid:96) +1 T − (cid:96) (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) for odd (cid:96) , then wehave − (cid:98) (cid:96)/ (cid:99) T − (cid:96) − ( (cid:96) + 1 − (cid:98) (cid:96)/ (cid:99) )( ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1) + ( (cid:96) + 1 − (cid:98) (cid:96)/ (cid:99) ) T − (cid:96) = ( (cid:96) + 1 − (cid:98) (cid:96)/ (cid:99) )( (cid:96) + 1) T − (cid:96) − ( (cid:96) + 1 − (cid:98) (cid:96)/ (cid:99) )( ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1) ≥ − t + (cid:80) tj =1 s j T − (cid:96) ≥ ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1 ≤ (cid:96) +1 T − (cid:96) (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) for even (cid:96) and ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1 ≤ (cid:96) +1 T − (cid:96) (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) for odd (cid:96) .” Denote c ut := − (cid:96) +1 T − (cid:96) t (cid:48) (cid:98) ( (cid:96) +1) / (cid:99) +1 . If we can show ω t ≤ − (cid:96) +1 T − (cid:96) t (cid:48) (cid:98) ( (cid:96) +1) / (cid:99) +1 := c ut for t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) , then it implies“ ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1 ≤ (cid:96) +1 T − (cid:96) (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) for even (cid:96) and ω (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) + 1 ≤ (cid:96) +1 T − (cid:96) (cid:96) +1 −(cid:98) (cid:96)/ (cid:99) for odd (cid:96) .” Note that the diagonalentries in A ( (cid:96) ) are positive while the oﬀ-diagonal entries in A ( (cid:96) ) are negative, then A ( (cid:96) ) t (cid:48) , : (cid:126)ω ( (cid:98) (cid:96)/ (cid:99) +1): L is increasingin ω t and decreasing in ω s for t = t (cid:48) + (cid:98) (cid:96)/ (cid:99) and s (cid:54) = t (cid:48) + (cid:98) (cid:96)/ (cid:99) . We only need to show ( A ( (cid:96) ) c u ) (cid:96) −(cid:98) (cid:96)/ (cid:99) ≥ b ( (cid:96) ) (cid:96) −(cid:98) (cid:96)/ (cid:99) ,where c u = (cid:2) c u (cid:98) (cid:96)/ (cid:99) +1 · · · c u(cid:96) −(cid:98) (cid:96)/ (cid:99) (cid:3) (cid:62) . If (cid:96) is even, and when T > (cid:96) +11 (cid:96) +28 , − ( A ( (cid:96) ) c u ) (cid:96) −(cid:98) (cid:96)/ (cid:99) + b ( (cid:96) ) (cid:96) −(cid:98) (cid:96)/ (cid:99) = (cid:96) ( (cid:96) − T − (cid:96) − (cid:96) ( (cid:96) + 1) T − (cid:96) (cid:18) (cid:96)(cid:96) + 2 − T − (cid:96) (cid:19) = − (cid:96)T − (cid:96) (cid:18) (cid:96) + 2 − (cid:96) + 14( T − (cid:96) ) (cid:19) < . If (cid:96) is odd, and when T > (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) (note that (cid:96) +13 (cid:96) +7 (cid:96) +38 (cid:96) > (cid:96) +11 (cid:96) +28 ), − ( A ( (cid:96) ) c u ) (cid:96) −(cid:98) (cid:96)/ (cid:99) + b ( (cid:96) ) (cid:96) −(cid:98) (cid:96)/ (cid:99) = (cid:96) ( (cid:96) − T − (cid:96) − (cid:96) ( (cid:96) + 1) T − (cid:96) (cid:96) + 1 (cid:96) + 3 + ( (cid:96) + 1) T − (cid:96) ) = − T − (cid:96) (cid:18) (cid:96)(cid:96) + 3 − ( L + 2) T − (cid:96) ) (cid:19) < . ω ∗ t for t > T − (cid:96) : this is a symmetric case of ω ∗ t for t < (cid:96) . The proof of ω ∗ t for t > T − (cid:96) carries over tothis case.We have veriﬁed that the (cid:126)ω ∗ deﬁned in Eq. (17) satisﬁes the KKT conditions and the Hessian of the objectivefunction in the quadratic program (25) is positive semi-deﬁnite, then (cid:126)ω ∗ is an optimal solution for thequadratic program (25). (cid:3) Proof of Theorem 5

In this proof, we ﬁrst provide an equivalent and simpliﬁed form of the objectivefunction tr( Z (cid:62) Σ − e (Σ e − Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) )Σ − e Z ) in the integer program (15). We will show that the objectivefunction is a sum of (cid:96) + 1 convex optimization problems, with each one to be similar as objective function(31) in the proof of Theorem 3. Then we show the optimal solution by using the results in Theorem 3 andLemma 6.Conceptually, we show the equivalent form of the objective function in the integer program (15) using theresults in Theorem 3 and Lemma 4. We decompose this task into a few steps:1. Calculate Θ = Z (cid:62) Σ − e Z ∈ R ( (cid:96) +1) × ( (cid:96) +1)

2. Calculate Θ = Z (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e Z ∈ R ( (cid:96) +1) × ( (cid:96) +1) (a) Calculate Z (cid:62) Σ − e Γ ∈ R ( (cid:96) +1) × ( N +( T − (cid:96) − p ) (b) Calculate (Γ (cid:62) Σ − e Γ) − ∈ R ( N +( t − − (cid:96) ) p ) × ( N +( t − − (cid:96) ) p ) (c) Calculate ( (cid:126)z ( j ) ) (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:126)z ( m ) ∈ R (d) Calculate ( (cid:126)z ( j ) ) (cid:62) (Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e ) (cid:126)z ( m ) ∈ R where (cid:126)z ( j ) = (cid:2) (cid:126)z (cid:62) j (cid:126)z (cid:62) j +1 · · · (cid:126)z (cid:62) T − (cid:96) − j (cid:3) (cid:62) for j = 1 , · · · , (cid:96) + 1 and Γ is equal toΓ =  Γ :( (cid:96) +1) Γ :( (cid:96) +2) ...Γ : T  =  ˜ I (cid:126) X ˜ I (cid:126) X ... . . .˜ I (cid:126) X  =  ˜ I ˜ X ˜ I ˜ X ... . . .˜ I ˜ X  ∈ R ( N ( T − (cid:96) )) × ( N +( T − (cid:96) − p ) , (39)where ˜ I = (cid:2) I N − p N − p,p (cid:3) (cid:62) ∈ R N × ( N − p ) , N − p,p is a matrix of 0, (cid:126) ∈ { } N . Similarly as Theorem 3, Σ − e =diag(Ψ , Ψ , · · · , Ψ) ∈ R ( N ( T − (cid:96) )) × ( N ( T − (cid:96) )) , whereΨ = Σ − e t = ( I N + U U (cid:62) ) − = I N − U ( I k + U (cid:62) U ) − U (cid:62) ∈ R N × N . = Z (cid:62) Σ − e Z hasΘ ,jm = ( (cid:126)z ( j ) ) (cid:62) Σ − e (cid:126)z ( m ) = T − (cid:96) (cid:88) t =1 (cid:126)z (cid:62) j − t Ψ (cid:126)z m − t . For the second step, Z (cid:62) Σ − e Γ has( (cid:126)z ( j ) ) (cid:62) Σ − e Γ = ( (cid:126)z ( j ) ) (cid:62)  Ψ ˜ I Ψ ˜ X Ψ ˜ I Ψ ˜ X ... . . .Ψ ˜ I Ψ ˜ X  = (cid:2)(cid:80) T − (cid:96)t =1 (cid:126)z (cid:62) j − t Ψ ˜

I (cid:126)z (cid:62) j Ψ ˜ X · · · (cid:126)z (cid:62) T − (cid:96) + j − Ψ ˜ X (cid:3) = (cid:2) ( φ ( j ) ) (cid:62) ( ι ( j ) ) (cid:62) (cid:3) , where ( φ ( j ) ) (cid:62) = (cid:80) T − (cid:96)t =1 (cid:126)z (cid:62) j − t Ψ ˜ I = ( (cid:80) T − (cid:96)t =1 (cid:126)z j − t ) (cid:62) Ψ ˜ I and ( ι ( j ) ) (cid:62) = (cid:2) (cid:126)z (cid:62) j Ψ ˜ X · · · (cid:126)z (cid:62) T − (cid:96) + j − Ψ ˜ X (cid:3) . Let ζ ( j ) = T − (cid:96) (cid:80) T − (cid:96) + j − t = j (cid:126)z t ∈ R N , then φ = T ( ζ ( j ) ) (cid:62) Ψ ˜ I . Note that U and ˜ X are orthogonal, we have Ψ ˜ X = ˜ X and˜ X (cid:62) Ψ ˜ X = N · I p . Then ( ι ( j ) ) (cid:62) = (cid:2) (cid:126)z (cid:62) j ˜ X · · · (cid:126)z (cid:62) T − (cid:96) + j − ˜ X (cid:3) = (cid:2) N ˜ ω (cid:62) j · · · N ˜ ω (cid:62) T − (cid:96) + j − (cid:3) = N (˜ (cid:126)ω ( j ) ) (cid:62) ∈ R ( T − (cid:96) ) p , where˜ ω t = N (cid:80) Ni =1 ˜ X i z it ∈ R p . Next is to calculate (Γ (cid:62) Σ − e Γ) − . We have similar decomposition as Theorem 3,(Γ (cid:62) Σ e Γ) − = (cid:20) Ξ Ξ Ξ Ξ (cid:21) = (cid:20) M − M ˜ M − ˜ M (cid:62) M ¯ M + ˜ M (cid:62) M ˜ M (cid:21) ∈ R ( N +( T − (cid:96) − p )) × ( N +( T − (cid:96) − p ) , where Ξ = M , Ξ = − M ˜ M , Ξ = − ˜ M (cid:62) M , Ξ = ¯ M + ˜ M (cid:62) M ˜ M with M = 1 T − (cid:96) (cid:16) ˜ I (cid:62) Ψ ˜ I − ˜ I (cid:62) Ψ ˜ X ( ˜ X (cid:62) Ψ ˜ X ) − ˜ X (cid:62) Ψ ˜ I (cid:17) − = 1 T − (cid:96) (cid:18) ˜ I (cid:62) Ψ ˜ I − N ˜ X ˜ X (cid:62) (cid:19) − ∈ R ( N − p ) × ( N − p ) ˜ M = (cid:2) ˜ I (cid:62) Ψ ˜ X ( ˜ X (cid:62) Ψ ˜ X ) − · · · ˜ I (cid:62) Ψ ˜ X ( ˜ X (cid:62) Ψ ˜ X ) − (cid:3) = (cid:2) N ˜ X · · · N ˜ X (cid:3) ∈ R ( N − p ) × (( T − (cid:96) ) p ) ¯ M = diag(( ˜ X (cid:62) Ψ ˜ X ) − , ( ˜ X (cid:62) Ψ ˜ X ) − , · · · , ( ˜ X (cid:62) Ψ ˜ X ) − ) = 1 N I ∈ R (( T − (cid:96) ) p ) × (( T − (cid:96) ) p ) Similar as Theorem 3, we can simplify M . M = 1 T − (cid:96) (cid:34)(cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − + (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (cid:18) N − ˜ X (cid:62) (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − ˜ X (cid:19) − ˜ X (cid:62) (cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − (cid:35)(cid:16) ˜ I (cid:62) Ψ ˜ I (cid:17) − = (cid:0) I N − p − U (1) ( I k + U (cid:62) U ) − U (cid:62) (1) (cid:1) − = I N − p + U (1) ( I k + U (cid:62) (2) U (2) ) − U (cid:62) (1) , where U (1) = (cid:2) u u · · · u N − p (cid:3) (cid:62) ∈ R ( N − p ) × k and U (2) = (cid:2) u N − p +1 · · · u N (cid:3) (cid:62) ∈ R p × k ( U = (cid:2) U (cid:62) (1) U (cid:62) (2) (cid:3) (cid:62) ). Next is to calculate ( (cid:126)z ( j ) ) (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:126)z ( m ) for ≤ j, m ≤ (cid:96) + 1. It is equivalent to calcu-lating (cid:2) ( φ ( j ) ) (cid:62) ( ι ( j ) ) (cid:62) (cid:3) (cid:20) Ξ Ξ Ξ Ξ (cid:21) (cid:20) φ ( m ) ι ( m ) (cid:21) = ( φ ( j ) ) (cid:62) Ξ φ ( m ) + ( φ ( j ) ) (cid:62) Ξ ι ( m ) + ( ι ( j ) ) (cid:62) Ξ φ ( m ) + ( ι ( j ) ) (cid:62) Ξ ι ( m ) ,where (recall ζ ( j ) = T − (cid:96) (cid:80) T − (cid:96) + j − t = j (cid:126)z t and ˜ (cid:126)ω ( j ) = (cid:2) ˜ ω (cid:62) j · · · ˜ ω (cid:62) T − (cid:96) + j − (cid:3) )( φ ( j ) ) (cid:62) Ξ φ ( m ) = ( T − (cid:96) ) ( ζ ( j ) ) (cid:62) Ψ ˜ IM ˜ I (cid:62) Ψ ζ ( m ) ( φ ( j ) ) (cid:62) Ξ ι ( m ) = − N ( T − (cid:96) )( ζ ( j ) ) (cid:62) Ψ ˜ IM ˜ M ˜ (cid:126)ω ( m ) = − ( T − (cid:96) )( ζ ( j ) )Ψ ˜ IM ˜ X (cid:32) T − (cid:96) + m − (cid:88) t = m ˜ ω m − t (cid:33) ( ι ( j ) ) (cid:62) Ξ φ ( m ) = − N ( T − (cid:96) )(˜ (cid:126)ω ( j ) ) (cid:62) ˜ M (cid:62) M Ψ ˜ Iζ ( m ) = − ( T − (cid:96) ) (cid:32) T − (cid:96) + j − (cid:88) t = j ˜ ω t (cid:33) ˜ X (cid:62) M Ψ ˜ I (cid:62) ζ ( m ) ( ι ( j ) ) (cid:62) Ξ ι ( m ) = N T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t ˜ ω t + m − j + (cid:32) T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t (cid:33) ˜ X (cid:62) M ˜ X (cid:32) T − (cid:96) + m − (cid:88) t = m ˜ ω t (cid:33) φ ( j ) ) (cid:62) Ξ φ ( m ) , ( φ ( j ) ) (cid:62) Ξ ι ( m ) , ( ι ( j ) ) (cid:62) Ξ φ ( m ) and ( ι ( j ) ) (cid:62) Ξ ι ( m ) can be simpliﬁed similar as the proof ofTheorem 3 (the deﬁnition of notations can be found in the proof of Theorem 3 as well).( φ ( j ) ) (cid:62) Ξ φ ( m ) = ( T − (cid:96) )( ζ ( j ) ) (cid:62) (cid:0) Ω + γ ( N I p − δ ) − γ (cid:62) (cid:1) ζ ( m ) ( φ ( j ) ) (cid:62) Ξ ι ( m ) = − ( T − (cid:96) )( ζ ( j ) ) (cid:62) (cid:16) γ ( N I p − δ ) − ˜ X (cid:62) (cid:17) ζ ( m ) ( ι ( j ) ) (cid:62) Ξ φ ( m ) = − ( T − (cid:96) )( ζ ( j ) ) (cid:62) (cid:16) ˜ X ( N I p − δ ) − γ (cid:62) (cid:17) ζ ( m ) ( ι ( j ) ) (cid:62) Ξ ι ( m ) = N T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t ˜ ω t + m − j + ( T − (cid:96) )( ζ ( j ) ) (cid:62) (cid:16) ˜ X ( N I p − δ ) − ˜ X (cid:62) (cid:17) ζ ( m ) − NT − (cid:96) (cid:32) T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t (cid:33) (cid:32) T − (cid:96) + m − (cid:88) t = m ˜ ω t (cid:33) Similar as the proof of Theorem 3, we can show (cid:2) ( φ ( j ) ) (cid:62) ( ι ( j ) ) (cid:62) (cid:3) (cid:20) Ξ Ξ Ξ Ξ (cid:21) (cid:20) φ ( m ) ι ( m ) (cid:21) = N T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t ˜ ω t + m − j − NT − (cid:96) (cid:32) T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t (cid:33) (cid:32) T − (cid:96) + m − (cid:88) t = m ˜ ω t (cid:33) +( T − (cid:96) )( ζ ( j ) ) (cid:62) (cid:0) I N − U ( I k + U (cid:62) U ) − U (cid:62) (cid:1) ζ ( m ) . Next is to calculate ( (cid:126)z ( j ) ) (cid:62) (Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e ) (cid:126)z ( m ) for ≤ j, m ≤ (cid:96) + 1 . ( (cid:126)z ( j ) ) (cid:62) Σ − e (cid:126)z ( m ) − ( (cid:126)z ( j ) ) (cid:62) Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e (cid:126)z ( m ) = Θ ,jm − Θ ,jm = N ( T − (cid:96) ) − T − (cid:96) + j − (cid:88) t = j (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t + m − j − N T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t ˜ ω t + m − j + NT − (cid:96) (cid:32) T − (cid:96) + j − (cid:88) t = j ˜ ω (cid:62) t (cid:33) (cid:32) T − (cid:96) + m − (cid:88) t = m ˜ ω t (cid:33) − ( T − (cid:96) )( ζ ( j ) ) (cid:62) (cid:0) I N − U ( I k + U (cid:62) U ) − U (cid:62) (cid:1) ζ ( m ) = N ( T − (cid:96) ) − (cid:34) ( T − (cid:96) )( ζ ( j ) ) (cid:62) ζ ( m ) − NT − (cid:96) (cid:32) T − (cid:96) + j − (cid:88) t = j ω (1) t (cid:33) (cid:32) T − (cid:96) + m − (cid:88) t = m ω (1) t (cid:33) + N T − (cid:96) + j − (cid:88) t = j ω (1) t ω (1) t + m − j (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) := − a ( j,m ) + N · p (cid:88) q =2 (cid:34) − T − (cid:96) + j − (cid:88) t = j ω ( q ) t ω ( q ) t + m − j + 1 T − (cid:96) (cid:32) T − (cid:96) + j − (cid:88) t = j ω ( q ) t (cid:33) (cid:32) T − (cid:96) + m − (cid:88) t = m ω ( q ) t (cid:33) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) := − b ( j,m ) + ( T − (cid:96) )( ζ ( j ) ) (cid:62) U ( I k + U (cid:62) U ) − U (cid:62) ζ ( m ) − T − (cid:96) + j − (cid:88) t = j (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t + m − j (cid:124) (cid:123)(cid:122) (cid:125) := − c ( j,m ) , where ω (1) t = N (cid:80) Ni =1 z it and ω ( q ) t = N (cid:80) Ni =1 X i,q − z it for q = 2 , · · · , p .Note that maximizing tr( Z (cid:62) Σ − e (Σ e − Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) )Σ − e Z ) is equivalent to minimiz-ing − tr( Z (cid:62) Σ − e (Σ e − Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) )Σ − e Z ) and − tr( Z (cid:62) Σ − e (Σ e − Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) )Σ − e Z ) = − (cid:80) (cid:96) +1 j =1 ( (cid:126)z ( j ) ) (cid:62) (Σ − e − Σ − e Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) Σ − e ) (cid:126)z ( j ) = (cid:80) (cid:96) +1 j =1 (cid:0) a ( j,j ) + b ( j,j ) + c ( j,j ) (cid:1) . From Lemma 4, we know (cid:96) +1 (cid:88) j =1 a ( j,j ) = N · (cid:96) +1 (cid:88) j =1  T − (cid:96) − j (cid:88) t = j ( ω (1) t ) − T − (cid:96) (cid:32) T − (cid:96) − j (cid:88) t = j ω (1) t (cid:33) + T − (cid:96) − j (cid:88) t =1 T − (cid:96) − j − t ) T − (cid:96) ω (1) t  . From Lemma 6, (cid:126)ω ∗ deﬁned in Eq. (17) minimizes (cid:80) (cid:96) +1 j =1 a ( j,j ) . (cid:96) +1 (cid:88) j =1 b ( j,j ) = N · (cid:96) +1 (cid:88) j =1 p (cid:88) q =2 (cid:34) T − (cid:96) + j − (cid:88) t = j ( ω ( q ) t ) − T − (cid:96) (cid:32) T − (cid:96) + j − (cid:88) t = j ω ( q ) t (cid:33) (cid:35) . N (cid:80) Ni =1 X i z it = µ X for some µ X ∈ R r and for all t , then (cid:80) (cid:96) +1 j =1 b ( j,j ) is minimized. (cid:96) +1 (cid:88) j =1 c ( j,j ) = (cid:96) +1 (cid:88) j =1 (cid:34) T − (cid:96) + j − (cid:88) t = j (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t − ( T − (cid:96) )( ζ ( j ) ) (cid:62) U ( I k + U (cid:62) U ) − U (cid:62) ζ ( j ) (cid:35) = (cid:96) +1 (cid:88) j =1 (cid:34) T − (cid:96) + j − (cid:88) t = j (cid:126)z (cid:62) t U ( I k + U (cid:62) U ) − U (cid:62) (cid:126)z t − T − (cid:96) (cid:18) T − (cid:96) + j − (cid:88) t = j (cid:126)z t (cid:19) (cid:62) U ( I k + U (cid:62) U ) − U (cid:62) (cid:18) T − (cid:96) + j − (cid:88) t = j (cid:126)z t (cid:19)(cid:35) From Theorem 3, if N (cid:80) Ni =1 u i z it = µ U for some µ U ∈ R k and for all t , then (cid:80) (cid:96) +1 j =1 c ( j,j ) is minimized. Hence,if the optimal design satisﬁes1 N N (cid:88) i =1 z it = ω ∗ t , N N (cid:88) i =1 X i z it = µ X , N N (cid:88) i =1 u i z it = µ U , for all t (40)for some µ X ∈ R r and µ U ∈ R k . Then (cid:80) (cid:96) +1 j =1 a ( j,j ) , (cid:80) (cid:96) +1 j =1 b ( j,j ) and (cid:80) (cid:96) +1 j =1 c ( j,j ) are all minimized and therefore (cid:80) (cid:96) +1 j =1 ( a ( j,j ) + b ( j,j ) + c ( j,j ) ) is minimized, which is equivalent to tr( Z (cid:62) Σ − e (Σ e − Γ(Γ (cid:62) Σ − e Γ) − Γ (cid:62) )Σ − e Z ) beingmaximized. (cid:3) Appendix F: Additional Empirical Results

F.1. Additional Empirical Datasets

Oﬀering extra reward points or discounts (if they spend a certain amount of money on a shopping trip).Increasing the marginal redemption rates if customers redeem more reward points. Oﬀering personalizedpromotions on the categories that customers shop less frequently.Our analysis focuses on frequent shoppers because they are more familiar and tend to pay more attentionto the reward or loyalty program’s changes. We deﬁne frequent shoppers as the households with expenditurein at least half of the weeks in the data set, that is, more than 48 of the 97 total weeks. There are 7,130frequent households. We want to study if a new approach can increase frequent households expenditure.

Home Medical Visits.

This data set has 40,079 records of home medical visits from Jan 2016 to Dec2018 in Barcelona Spain, which is publicly available on Kaggle. We aggregate records by city and week,calculate visit rates by taking the moving average of numbers of visits, and get a panel of 102 cities over 144weeks. We consider home medical visit rates instead of the raw number of visits for two reasons:1. Home visits can be naturally modeled as a Poisson distribution. We can calculate the visit rate byaveraging the number of home visits in the past m weeks, which is an unbiased and consistent estimator forthe rate (we use m = 16 here). It is natural and reasonable to measure the impact of environmental policiesby the impact on the visit rate.2. We are interested in the environmental policies that can improve public health, i.e., reduce the numberof home visits. If an entry has value zero, the value could not be further reduced. However, if we study visitrates, most entries are positive, which can potentially be reduced. Transactions from a Large Grocery Store.

This data set contains 17,880,248 transactions between May2005 and May 2007. We aggregate transactions by household and week and get a panel of 7,130 frequenthouseholds over 97 weeks, where frequent households had expenditure in more than 50% weeks. This data set can be downloaded at

F.2. Additional Experiment SetupsSpeciﬁcation.

In this section, the results are calculated based on 100 randomly selected blocks dividedinto the historical control and synthetic experimental data. For the home medical visit data, we choose thetreatment eﬀect at τ = − .

05 in the experiment with direct eﬀect only. Here the median value is 1 . . τ wechoose allows the treated outcome to stay positive. For the grocery data, we choose the treatment eﬀectat τ = 10 in the experiment with direct eﬀect only. Note that the median expenditure is 43.54 so the scaleof the treatment eﬀect is about 23% of the median expenditure. In the experiment with carryover eﬀects,we consider the treatment can aﬀect the current period and four periods in the future (equivalent to (cid:96) = 4)and choose the treatment eﬀects at [ τ , τ , τ , τ , τ ] = [ − . , − . , − . , − . , − . τ , τ , τ , τ , τ ] = [6 , , , , .

5] for the grocerydata.

Choice of Estimation Method.

In practice, we need to select an estimation method to estimate thetreatment eﬀect on the experimental data. We can leverage historical control data and use the procedure inRemark 14 to ﬁnd the best estimation method. Then we use this method to estimate the treatment eﬀecton the synthetic experimental data. The results are presented in the row “Hist Winner” in Tables 6 and 8.

Remark 17.

One possible reason for the BLUE estimator to perform worse than LRME on the home visitdata is that the visit rates are strongly time-series correlated by construction. In GLS, we calculate errors’covariance matrix based on the assumption that errors are time-series uncorrelated and use the inverse ofthis covariance matrix as the weighting matrix. This assumption allows us to estimate fewer parameters( N × N rather than N T × N T ) in the covariance matrix. Hence feasible GLS is less eﬃcient when theassumption on time-series uncorrelation is violated. We also estimate the full errors’ covariance matrix with NT × NT parameters that do not rely on the time-seriesuncorrelated assumption. However, the corresponding RMSE is much larger than the RMSE presented in Table 6. ( N, T ) (25,10) (25,20) (50,10) (50,20)OLS Z BA Z FF Z FF+BA Z OPT Z BA Z FF Z FF+BA Z OPT Z BA Z FF Z FF+BA Z OPT

HistWinner Z OPT

N, T ) (25,10) (25,20) (50,10) (50,20)OLS Z OPT Z K =2 Z K =3 Z OPT+ Z OPT Z K =2 Z K =3 Z OPT+ Z OPT Z K =2 Z K =3 Z OPT+

HistWinner Z OPT+

Table 6

Home medical visit data: This table compares the RMSE based on m = 100 randomly sampled blocks forbenchmark treatment designs, Z OPT , Z K =2 , Z K =3 , Z OPT+ and “Hist Winner” (the estimation method with minimaxestimation error on historical matrices). In the synthetic experiments, the intervention only has direct treatment eﬀectand we choose the treatment eﬀect at τ = − .

05. The left table shows the linear staggered design Z OPT outperformsbenchmark treatment designs. The right table shows we can ﬁnd a better treatment design Z OPT+ via historical dataand our data-driven local search algorithm. Both tables show LRME is the best estimation method on the homemedical visit data and the optimal estimation method found on historical data has similar performance as LRME. ( N, T ) (25,10) (25,20) (50,10) (50,20)OLS Z OPT Z OPT-CO Z OPT Z OPT-CO Z OPT Z OPT-CO

Table 7

Home medical visit data: This table compares the RMSE based on m = 100 randomly sampled blocksfor Z OPT and Z OPT-CO . In the synthetic experiments, the intervention has direct and carryover treatment eﬀectsand we choose the treatment eﬀects at [ τ , τ , τ , τ , τ ] = [ − . , − . , − . , − . , − . nonlinear staggered design Z OPT-CO outperforms the linear staggered design Z OPT . Figure 5

Home medical visit data: This ﬁgure compares the average estimated treatment eﬀect ˆ τ Z and its standarddeviation (cid:112) Var(ˆ τ Z ) based on m = 100 randomly sampled blocks for benchmark designs, Z OPT , Z K =2 , Z K =3 and Z OPT+ . In the synthetic experiments, N = 50, T = 20, the intervention only has direct treatment eﬀect and wechoose the treatment eﬀect at τ = − .

05. The height of the bar shows ˆ τ Z , while the error bar indicates the standarddeviation (cid:112) Var(ˆ τ Z ). The red dash line indicates the true value of τ = − .

05. Note that ﬁgures in the second rowhave a diﬀerent y-axis scale due to superior performance of Z OPT , Z K =2 , Z K =3 and Z OPT+ over benchmark treatmentdesigns. The bias of various treatment designs is similar while Z OPT has much smaller variance compared withbenchmark designs and the treatment designs found using historical data, such as Z OPT+ from Algorithm 1, hassmaller variance compared with Z OPT . ( N, T ) (50,10) (50,20) (100,10) (100,20)OLS Z BA Z FF Z FF+BA Z OPT Z BA Z FF Z FF+BA Z OPT

LRME Z BA Z FF Z FF+BA Z OPT Z OPT

N, T ) (50,10) (50,20) (100,10) (100,20)OLS Z OPT Z K =2 Z K =3 Z OPT+ Z OPT Z K =2 Z K =3 Z OPT+

LRME Z OPT Z K =2 Z K =3 Z OPT+ Z OPT+

Table 8

Grocery data: This table compares the RMSE based on m = 100 randomly sampled blocks for benchmarktreatment designs, Z OPT , Z K =2 , Z K =3 , Z OPT+ and “Hist Winner” (the estimation method with minimax estimationerror on historical matrices). In the synthetic experiments, the intervention only has direct treatment eﬀect and wechoose the treatment eﬀect at τ = 10. The left table shows the linear staggered design Z OPT outperforms benchmarktreatment designs. The right table shows we can ﬁnd a better treatment design Z OPT+ via historical data and ourdata-driven local search algorithm. Both tables show GLS is the best estimation method and the optimal estimationmethod found on historical data has similar performance as GLS. ( N, T ) (50,10) (50,20) (100,10) (100,20)OLS Z OPT Z OPT-CO Z OPT Z OPT-CO

LRME Z OPT Z OPT-CO

Table 9

Grocery data: This table compares the RMSE based on m = 100 randomly sampled blocks for Z OPT and Z OPT-CO . In the synthetic experiments, the intervention has direct and carryover treatment eﬀects and we choosethe treatment eﬀects at [ τ , τ , τ , τ , τ ] = [6 , , , , . nonlinear staggered design Z OPT-CO outperforms the linear staggered design Z OPT . Figure 6