[PDF] Design and Analysis of Switchback Experiments

Abstract

In switchback experiments, a firm sequentially exposes an experimental unit to a random treatment, measures its response, and repeats the procedure for several periods to determine which treatment leads to the best outcome. Although practitioners have widely adopted this experimental design technique, the development of its theoretical properties and the derivation of optimal design procedures have been, to the best of our knowledge, elusive. In this paper, we address these limitations by establishing the necessary results to ensure that practitioners can apply this powerful class of experiments with minimal assumptions. Our main result is the derivation of the optimal design of switchback experiments under a range of different assumptions on the order of carryover effect - that is, the length of time a treatment persists in impacting the outcome. We cast the experimental design problem as a minimax discrete robust optimization problem, identify the worst-case adversarial strategy, establish structural results for the optimal design, and finally solve the problem via a continuous relaxation. For the optimal design, we derive two approaches for performing inference after running the experiment. The first provides exact randomization based p -values and the second uses a finite population central limit theorem to conduct conservative hypothesis tests and build confidence intervals. We further provide theoretical results for our inferential procedures when the order of the carryover effect is misspecified. For firms that possess the capability to run multiple switchback experiments, we also provide a data-driven strategy to identify the likely order of carryover effect. To study the empirical properties of our results, we conduct extensive simulations. We conclude the paper by providing some practical suggestions.

Full PDF

DDesign and Analysis of Switchback Experiments

Iavor Bojinov

Technology and Operations Management Unit, Harvard Business School, Boston, MA 02163 [email protected]

David Simchi-Levi

Institute for Data, Systems, and Society, Department of Civil and Environmental Engineering, and Operations ResearchCenter, Massachusetts Institute of Technology, Cambridge, MA 02139, [email protected]

Jinglong Zhao

Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139, [email protected]

In switchback experiments, a ﬁrm sequentially exposes an experimental unit to a random treatment, mea-sures its response, and repeats the procedure for several periods to determine which treatment leads to thebest outcome. Although practitioners have widely adopted this experimental design technique, the develop-ment of its theoretical properties and the derivation of optimal design procedures have been, to the bestof our knowledge, elusive. In this paper, we address these limitations by establishing the necessary resultsto ensure that practitioners can apply this powerful class of experiments with minimal assumptions. Ourmain result is the derivation of the optimal design of switchback experiments under a range of diﬀerentassumptions on the order of carryover eﬀect that is, the length of time a treatment persists in impactingthe outcome. We cast the experimental design problem as a minimax discrete robust optimization problem,identify the worst-case adversarial strategy, establish structural results for the optimal design, and ﬁnallysolve the problem via a continuous relaxation. For the optimal design, we derive two approaches for per-forming inference after running the experiment. The ﬁrst provides exact randomization based p -values andthe second uses a ﬁnite population central limit theorem to conduct conservative hypothesis tests and buildconﬁdence intervals. We further provide theoretical results for our inferential procedures when the order ofcarryover eﬀect is misspeciﬁed. For ﬁrms that possess the capability to run multiple switchback experiments,we also provide a data-driven strategy to identify the likely order of carryover eﬀect. To study the empir-ical properties of our results, we conduct extensive simulations. We conclude the paper by providing somepractical suggestions.

1. Introduction

Academic scholars have appreciated the beneﬁts that experimentation brings to ﬁrms for manydecades (March 1991, Sitkin 1992, Sarasvathy 2001, Thomke 2001, Kohavi and Thomke 2017).However, widespread adoption of the practice has only taken oﬀ in the last decade, partly fueledby the rapid cost reductions achieved by ﬁrms in the technology sector (Kohavi et al. 2007, 2009,Azevedo et al. 2019, Kohavi et al. 2020). Most large ﬁrms now possess internal tools for experi- a r X i v : . [ s t a t . M E ] A ug mentation, and a growing number of smaller and more conventional companies are purchasing thecapabilities from third-party sellers that oﬀer full-stack integration (Thomke 2020). These toolstypically allow simple “A/B” tests that compare the standard oﬀering “A” to a new or improvedversion “B”. The comparisons are made across a range of diﬀerent business outcomes, and thetests are usually conducted for at least a week. This simple practice has provided tremendous valueto ﬁrms (Koning et al. 2019). Some ﬁrms and authors, however, have recognized the limitationsof these simple A/B tests (Bojinov et al. 2020b). Principle amongst these is adequately handlinginterference (the scenario where the assignment of one subject impacts another) or estimatingheterogeneous (or personalized) eﬀects .In this paper, we simultaneously tackle both of these challenges by developing a theoreticalframework for the optimal design and analysis of switchback (or time series) experiments. Inswitchback experiments, we sequentially expose a unit to a random treatment, measure its response,and repeat the procedure for a ﬁxed period of time (Robins 1986, Bojinov and Shephard 2019). Byadministering alternate treatments to the same unit, we can directly estimate an individual levelcausal eﬀect and alleviate the challenges posed by interference.There are two classes of applications were switchback experiments are widely used in practice.The ﬁrst arises when units interfere with each other either through a network or some morecomplicated unknown structure. For example, consider a ride-hailing platform that wants to testa new fare pricing algorithm’s eﬀectiveness in a large city (Farronato et al. 2018). Administeringthe test version to a subset of drivers can impact their behavior, which, in turn, could change thebehavior of drivers that are receiving the old version. Directly comparing the revenue generatedby the drivers across the two groups will likely provide a biased estimate of what would happenif everyone were assigned to the new version compared to the old. Instead, practitioners considerthe city a single aggregated unit and use a switchback experiment to estimate the intervention’seﬀectiveness, thereby alleviating the problem caused by interference. A similar issue often arises inmarketing when, for example, a retailer wants to test the eﬀectiveness of a new promotion planningalgorithm (Ferreira et al. 2016). Administering the new version to a subset of stock keeping units(SKU’s) cannibalizes the sales from the other SKU’s. Again comparing the generated revenueacross the two groups is unlikely to provide an accurate measure of the promotion’s eﬀectiveness.Instead, practitioners can treat all the SKU’s as a single aggregated unit and use a switchbackexperiment to obtain accurate estimates of the promotion’s eﬀectiveness. The second application Many online platforms and retail marketplaces have observed diﬀerent levels of interference when the assignmentof one subject impacts another. See Kastelman and Ramesh (2018), Farronato et al. (2018), Glynn et al. (2020) foronline platforms ( e.g. , DoorDash, Lyft, Uber), and Caro and Gallien (2012), Ferreira et al. (2016), Ma et al. (2020)for retail markets ( e.g.,

AB InBev, Rue la la, Zara). See Nie et al. (2018), Deshpande et al. (2018), Hadad et al. (2019) for estimating heterogeneous eﬀects. arises when we have a limited number of experimental units, and we believe the eﬀects are likelyto be heterogeneous. For example, Bojinov and Shephard (2019) used switchback experimentsto make causal claims about the relative eﬀectiveness of algorithms compared with humans atexecuting large ﬁnancial trades across a range of ﬁnancial markets. More generally, psychologistsand biostatisticians regularly use switchback experiments whenever studying the eﬀectiveness ofan intervention on a single unit ( e.g. , Lillie et al. (2011) and Boruvka et al. (2018)).There are three signiﬁcant challenges to using switchback experiments. The ﬁrst is that causalestimators from switchback experiments have large variances as the precision is a function of thetotal number of assignments. The second is that past interventions are likely to impact futureoutcomes; this is often referred to as a carryover eﬀect. Typically, many authors assume that thereare no carryover eﬀects, e.g. , Chamberlain (1982), Athey and Imbens (2018), and Imai and Kim(2019), although some recent work has relaxed this assumption (Sobel 2012, Bojinov et al. 2020a).The third is that standard super population inference—where we either assume a model for theoutcome . or that the units are sampled from an inﬁnitely large population—requires unrealisticassumptions that fail to capture the problem’s personalized nature (Bojinov and Shephard 2019).This paper’s main contributions are to address these three challenges and present a frameworkthat allows ﬁrms and researchers to run reliable switchback experiments. First, we derive optimaldesigns for switchback experiments, ensuring that we can select a design that leads to the lowestvariance among the most popular class assignment mechanisms. Second, we assume the presenceof a carryover eﬀect and show that our estimation and inference are valid both when the orderof carryover eﬀect is correctly speciﬁed and misspeciﬁed, the later leading to a minor increasein the variance. For practitioners and managers, we also propose a method to identify the orderof carryover eﬀect by running a series of carefully designed switchback experiments. Finally, wetake a purely design-based perspective on uncertainty; that is, we treat the outcomes as unknownbut ﬁxed (or equivalently, we condition on the set of potential outcomes) and assume that theassignment mechanism is the only source of randomness (Abadie et al. 2020). The main beneﬁt ofa design-based perspective is that the inference, and in turn the causal conclusions, do not dependon our ability to correctly specify a model describing the phenomena we are studying, ensuringthat our ﬁndings are wholly non-parametric and robust to model misspeciﬁcation (Imbens andRubin 2015, Chapter 5).The paper is structured as follows. In Section 2 we deﬁne the notations, the assumptions, and theassignment mechanism that we focus on, which is referred to as the regular switchback experiments . For example, Wager and Xu (2019), Johari et al. (2020) assume that the market they are experimenting on is in anequilibrium state, Glynn et al. (2020) assumes an underlying Markovian model for the outcomes, and Athey et al.(2018), Eckles et al. (2016), Sussman and Airoldi (2017), Puelz et al. (2019) make assumptions on the structure ofthe interference.

In Section 3, we discuss how to design an eﬀective regular switchback experiment under the minimaxrule. We cast the design problem as a minimax robust optimization problem. We identify theworst-case adversarial strategy, establish structural results, and then explicitly ﬁnd the optimaldesign. In Section 4, we discuss how to perform inference and conduct statistical testing basedon the results obtained from an optimally designed switchback experiments. We propose an exacttest for sharp null hypotheses, and an asymptotic test for testing the average treatment eﬀect.We provide p -values and construct conﬁdence intervals based on such two hypotheses tests. InSection 5, we discuss cases when carryover eﬀects are misspeciﬁed. We show that our estimationand inference still remain valid, with only a little more variance. In Section 6, we run simulations totest the correctness and eﬀectiveness of our proposed experiments under various simulation setups.In Section 7, we discuss how to conduct hypothesis testing to identify the true order of carryovereﬀects. We give empirical illustrations on how to conduct a switchback experiment in practice. Allthe proofs can be found in the Appendix.

2. Notations, Assumptions, and Regular Switchback Experiments

We focus our discussion on a single experimental unit. For example, this unit could be a ride-hailingplatform testing the eﬀectiveness of a new fare pricing algorithm in a large city, or a retailer testingthe eﬀectiveness of a new promotion planning algorithm over all its SKU’s. At each time point t ∈ [ T ] = { , , ..., T } , we assign the unit to receive an intervention W t ∈ { , } . For example, oneexperimental period could be several minutes to one hour for a ride-hailing platform or one to twodays for a retailer; the intervention could be a new pricing or promotion planning algorithm. Insome applications, external factors determine the time horizon, T ; however, when T is not ﬁxed,Section 7.2, provides details for how the manager can determine an appropriate T .Following convention, we say that the unit is assigned to treatment if W t = 1 and control when W t = 0; in A/B testing terminology, “A” is control and “B” is treatment. The assignment path isthen the collection of assignments and is denoted using a vector notation whose dimensions arespeciﬁed in the subscript, W T = ( W , W , ..., W T ) ∈ { , } T . We adopt the convention that W t stands for a random assignment path, while w t stands for one realization of the random assignmentpath. Though we focus on binary assignments, our results easily extend to more complex settings.After administering the assigned intervention, we observed a corresponding outcome. For exam-ple, this could be total traﬃc or total revenue generated during each experimental period. Let { t : t (cid:48) } = { t, t + 1 , ..., t (cid:48) } . Following the extended potential outcomes framework, at time t ∈ [ T ], we positthat for each possible assignment path w T there exists a corresponding potential outcome denotedby Y t ( w T ). The set of all potential outcomes will then be written as Y = { Y i ( w T ) } t ∈ [ T ] , w T ∈{ , } T . Figure 1 Illustrator of assignment paths and potential outcomes when T = 4 . The green path stands for oneassignment path w = (1 , , , . The two red dots stand for two potential outcomes that are equal in Example 3. Example 1.

When T = 4, there are 16 assignment paths as shown in Figure 1. Associated witheach assignment path w are potential outcomes Y ( w ) , Y ( w ) , Y ( w ) , Y ( w ). (cid:3) Throughout this paper, we do not directly model the potential outcomes or impose a parametricrelationship with the assignment path; instead, we treat them as unknown but ﬁxed quantities,or, equivalently, we will implicitly condition on Y (Imbens and Rubin 2015, Chapter 5). Thebeneﬁt of this approach is that we can be completely agnostic to the outcome process, allowingus to make nonparametric causal claims. To make inference possible, we rely on the variationintroduced by the random assignment path; this is commonly referred to as ﬁnite-sample or design-based perspective. Unlike traditional sampling-based inference, our approach does not require ahypothetical population from which we sampled our experimental units (Abadie et al. 2020).Since the potential outcomes are ﬁxed but unknown, we can assume that their absolute values arebounded from above (Robins et al. 1999, Bai 2019, Li et al. 2020). Assumption 1 is almost alwayssatisﬁed, since it only assumes that the potential outcomes are bounded by the same constant B , e.g. , the total traﬃc or revenue generated from each experimental period is some ﬁnite amount. Assumption 1 (Bounded Potential Outcomes) . Assume that the potential outcomes arebounded by some constant, i.e., ∃ B > , s.t. ∀ t ∈ [ T ] , ∀ w ∈ { , } T , | Y t ( w ) | ≤ B and denote Y ∈ [ − B, B ] T := Y . In particular, knowledge about the magnitude of B is not required. We further make the following two assumptions that limit the dependence of the potentialoutcomes on assignment paths.

Assumption 2 (Non-anticipating Potential Outcomes) . Assume for any t ∈ [ T ] , w t ∈{ , } T , and for any w (cid:48) t +1: T , w (cid:48)(cid:48) t +1: T ∈ { , } T − t , Y t ( w t , w (cid:48) t +1: T ) = Y t ( w t , w (cid:48)(cid:48) t +1: T ) . Assumption 2 states that the potential outcomes at time t do not depend on future treatments(Bojinov and Shephard 2019, Basse et al. 2019, Rambachan and Shephard 2019). Since we controlthe assignment mechanism, the design ensures that this assumption is satisﬁed. Example 2 (Example 1 Continued).

Under Assumption 2, Y (1 , , ,

1) = Y (1 , , , Y (1 , ,

1) to stand for both Y (1 , , ,

1) and Y (1 , , , (cid:3) Assumption 3 (No m -Carryover Eﬀects) . Assume there exists a ﬁxed and given m , suchthat for any t ∈ { m + 1 , m + 2 , ..., T } , w t − m : T ∈ { , } T − t + m +1 , and for any w (cid:48) t − m − , w (cid:48)(cid:48) t − m − ∈{ , } t − m − , Y t ( w (cid:48) t − m − , w t − m : T ) = Y t ( w (cid:48)(cid:48) t − m − , w t − m : T ) . Assumption 3 restricts the order of carryover eﬀect (Laird et al. 1992, Senn and Lambrou 1998,Bojinov and Shephard 2019, Basse et al. 2019). In many applications, Assumption 3 is satisﬁed;however, practitioners must rely on their domain knowledge to choose an appropriate m . Forexample, surge pricing on a ride-hailing platform is typically known not to carry over for morethan 1 ∼ m will not invalidatethe subsequent inference, but will lead to a little increase in variance. See Section 5 and 6.5 fordiscussions. Moreover, we can always correctly identify m with a little more experimental budget.See Section 7.1 for discussions.Under both Assumptions 2 and 3, we simplify notations as follows. For any t ∈ { m + 1 , ..., T } , w t − m : t ∈ { , } m +1 , and for any w (cid:48) t − m − , w (cid:48)(cid:48) t − m − ∈ { , } t − m − , any w (cid:48) t +1: T , w (cid:48)(cid:48) t +1: T ∈ { , } T − t , Y t ( w (cid:48) t − m − , w t − m : t , w (cid:48) t +1: T ) = Y t ( w (cid:48)(cid:48) t − m − , w t − m : t , w (cid:48)(cid:48) t +1: T ) . In the remainder of this paper, we will write Y t ( w t − m − , w t − m : t , w t +1: T ) = Y t ( w t − m : t ). Example 3 (Example 2 Continued).

Suppose m = 1. Under Assumptions 2 and 3, thepotential outcomes at the 2 red dots in Figure 1 are equal, i.e., Y (1 , ,

1) = Y (0 , , (cid:3) In the potential outcomes approach to causal inference, any comparison of potential outcomeshas a causal interpretation. For any p ∈ N , let p +1 = (1 , , ...,

1) be a vector of p + 1 ones; let p +1 = (0 , , ...,

0) be a vector of p + 1 zeros. In this paper, we focus on the average lag- p causaleﬀect of consecutive treatments on the outcome, deﬁned for any p ∈ [ T − τ p ( Y ) = 1 T − p T (cid:88) t = p +1 [ Y t ( p +1 ) − Y t ( p +1 )] . (1)This estimand captures the eﬀects of permanently deploying a new policy , and has been widelystudied in the longitudinal experiments since the early work of Robins (1986).Although we focus on an average causal eﬀect, all of our results and analysis trivially extend tothe total causal eﬀect, which does not divide the sum by the number of estimands, i.e., ( T − p ) τ p ( Y ).The optimal design as we will show in Section 3 will not be changed. We mainly focus on the p = m case in this section. We refer to Section 5 when m is misspeciﬁed, and Section 7.1 to identify m .The challenge of causal inference on switchback experiments is that we only observe one assign-ment path. So in each period t , we observe at most either Y t ( p +1 ) or Y t ( p +1 ) (and sometimesneither). To link the observed and potential outcomes, we assume there is only one version ofthe treatment, and there is no non-compliance. Let w obs T be the realized assignment path. Let Y obs t = Y t ( w obs T ) be the observed outcome at time t , under the realized assignment path w obs T . It is the manager’s decision to decide the design of switchback experiments. In this paper, wenarrow our scope to the family of regular switchback experiments. This family of experiments areparameterized by T = { t = 1 < t < t < ... < t K } ⊆ [ T ] , where K ∈ N belongs to the set of all positive integers, and T contains a total of K + 1 integers.For the ease of notations also let t K +1 = T + 1. Definition 1 (Regular Switchback Experiments).

For any K ∈ N and any T = { t = 1

Figure 2 Two designs. The blue lines stand for the possible treatment assignments that a design couldadminister. Left: regular switchback experiment (Example 4); Right: irregular switchback experiment (Example 5).

In words, the manager decides on a collection of randomization points, which consists of ﬂippinga fair coin at each period t ∈ { t , ..., t K } . If the resulting ﬂip at period t k is heads , then the managerassigns the unit to treatment during periods ( t k , t k + 1 , ..., t k +1 − tails , then themanager assign the unit to control during periods ( t k , t k + 1 , ..., t k +1 − t ∈ T , the treatmentprobability that leads to the smallest variance is 1 / Example 4.

When T = 4, T = { t = 1 , t = 3 } corresponds to the following design: with proba-bility 1 / W = (1 , , , / W = (1 , , , / W =(0 , , , / W = (0 , , , (cid:3) Example 5.

Not all switchback experiments are regular. For example, when T = 4: with prob-ability 1 / W = (1 , , , / W = (1 , , , / W =(0 , , , / W = (0 , , , (cid:3) Any design of switchback experiment induces a probabilistic distribution over assignment paths w T ∈ { , } T . Deﬁne a design of switchback experiment to be any η : { , } N → [0 ,

1] such that (cid:88) w T ∈{ , } T η ( w T ) = 1 , η ( w T ) ≥ , ∀ w T ∈ { , } T . Explicitly, η ( · ) is the underlying discrete distribution of the random assignment path W T . Forany regular switchback experiment T , we refer to the probability distribution using η T ( · ).For any T , there are a total of 2 K +1 many assignment paths, which are uniquely determinedby the values of W t , W t , ..., W t K . The assignment path is random, and follows the probabilitydistribution η T ( · ): η T ( w T ) = (cid:40) / K +1 , if ∀ k ∈ { , , ..., K } , w t k = w t k +1 = ... = w t k +1 − , , otherwise. (3)In the remainder of this paper, unless explicitly noted, all probabilities and expectations are takenwith respect to this probability distribution η T ( · ). Now that η T ( · ) is determined, following any realization of the assignment path W T = w T , weuse the Horvitz-Thompson estimator to estimate the lag- p eﬀect:ˆ τ p ( η T , w T , Y ) = 1 T − p T (cid:88) t = p +1 (cid:26) Y obs t { w t − p : t = p +1 } Pr( W t − p : t = p +1 ) − Y obs t { w t − p : t = p +1 } Pr( W t − p : t = p +1 ) (cid:27) (4)Since the assignment path W T is random, this Horvitz-Thompson estimator is random, as well. Weemphasize that estimator depends on (i) the probability distribution behind the random assignmentpath, (ii) the realization of the assignment path, and (iii) the potential outcomes. Example 6.

Suppose T = 4 , p = m = 1. See Figure 1. Suppose the assignments are probabilisticand Pr( W t = 1) = Pr( W t = 0) = 1 / , ∀ t ∈ [4] . With probability 1 /

16 the green assignment path isadministered, W = (1 , , , τ = { Y (1 ,

1) + 0 − Y (0 , } . (cid:3) It is well-known that the Horvitz-Thompson estimator is unbiased under the probabilistic treat-ment assignment assumption, which is satisﬁed by regular switchback experiments.

Assumption 4 (Probabilistic Treatment Assignment) . Assume for any t ∈ { p + 1 : T } ,both < Pr( W t − p : t = p +1 ) , Pr( W t − p : t = p +1 ) < . It is easy to verify that regular switchback experiments satisfy Assumption 4, since the assign-ment on the ﬁrs period is random, i.e. Pr( W = 1) = 1 / Theorem 1 (Unbiasedness of the Horvitz-Thompson Estimator) . In a regular switch-back experiment, under Assumptions 2 and 3, the Horvitz-Thompson estimator is unbiased for theaverage lag- p causal eﬀect of consecutive treatments on outcome, i.e., E W T ∼ η T [ˆ τ p ( η T , W T , Y )] = τ p ( Y ) . The proof to Theorem 1 is standard, by checking the expectations. We defer its proof to Sec-tion EC.3 in the Appendix. When m is misspeciﬁed, the above estimator is still meaningful withcausal interpretations. See Section 5 for a discussion. To evaluate the quality of a design of experiment, we adopt the decision-theoretic framework(Berger 2013, Bickel and Doksum 2015). When the random design is η T ( · ), the assignment path W T is random. For any realization of the assignment path w T and any set of potential outcomes Y , we deﬁne the loss function L ( η T , w T , Y ) = (ˆ τ p ( η T , w T , Y ) − τ p ( Y )) and the risk function r ( η T , Y ) = E W T ∼ η T ( · ) [ L ( η T , W T , Y )]= (cid:88) w T ∈{ , } T η T ( w T ) · (ˆ τ p ( η T , w T , Y ) − τ p ( Y )) (5)Such a risk function quantiﬁes the expected loss incurred by one design of experiment. Sincethe estimator is unbiased, the risk function also has a second interpretation: the variance of theestimator. A design with lower risk is also a design whose estimator has a lower variance. Example 7 (Examples 4 and 6 Revisited).

Suppose T = 4 and p = m = 1. As in Example 4,the experiment is T = { , } . With probability 1 / W = (1 , , , τ ( T ) = { Y (1 , − Y (0 , } . So L ( η ˜ T , w T , Y ) = ( Y (1 ,

1) + Y (0 , − Y (1 ,

1) + Y (0 , − Y (1 , − Y (0 , . As inExample 6, ˜ T = { , , , } . With probability 1 / W = (1 , , , τ (˜ T ) = { Y (1 , − Y (0 , } . So L ( η T , w T , Y ) = (3 Y (1 ,

1) + Y (0 , − Y (1 ,

1) + Y (0 , − Y (1 , − Y (0 , . (cid:3) Example 7 suggests that, even if the two realizations of the assignment path are the same andthe potential outcomes are the same, since the probability distributions T , ˜ T are diﬀerent, thecorresponding loss functions could be diﬀerent. This motivates us to ﬁnd a probability distributionwith a smaller expected loss against some Y , which we will detail in Section 3.

3. Design of Regular Switchback Experiments under Minimax Rule

The minimax decision rule (Berger 2013, Wu 1981) ﬁnds an optimal design of experiment, suchthat the worst-case risk against an adversarial selection of potential outcomes is minimized,min T ∈ [ T ] max Y ∈Y r ( η T , Y ) = min T ∈ [ T ] max Y ∈Y (cid:88) w T ∈{ , } T η T ( w T ) · (ˆ τ p ( w T , Y ) − τ p ( Y )) . (6)The goal of this section is to ﬁnd the optimal T ∗ ⊆ [ T ]. Throughout this section we assumeperfect knowledge of m and assume p = m .In practice, there is a trade-oﬀ between having too few randomization points ( corresponding tosmall K ) and too many (corresponding to large K ). Intuitively, too many decreases the probabilityof observing an assignment path m +1 or m +1 , which, in turn, decreases the amount of useful data.On the other hand, too few decreases the number of independent observations and reduces ourability to produce reliable results. Both of these scenarios reduce our ability to draw valid causalclaims. To make switchback experiments useful in practice, we need to ﬁnd the optimal numberof randomization points that allows us to draw valid inference while minimizing the variance. Weformalize this goal through the minimax framework, where we try to derive the best possible designfor the worse possible set of potential outcomes. To solve the minimax problem, we start by focusing on the inner maximization part of (6). Wecharacterize the worst-case potential outcomes by identifying two dominating strategies for theadversarial selection of potential outcomes. Denote Y + = { Y t ( m +1 ) = Y t ( m +1 ) = B } t ∈{ m +1: T } and Y − = { Y t ( m +1 ) = Y t ( m +1 ) = − B } t ∈{ m +1: T } Lemma 1.

Under Assumptions 1–3, Y + and Y − are the only two dominating strategies for theadversarial selection of potential outcomes. That is, for any T ⊆ [ T ] and for any Y ∈ Y , r ( η T , Y + ) ≥ r ( η T , Y ); r ( η T , Y − ) ≥ r ( η T , Y ) . Moreover, for any Y ∈ Y such that Y (cid:54) = Y + or Y − , the above two inequalities are strict. The proof of Lemma 1 and an implication of Lemma 1 can be found in Sections EC.4.2.2and EC.4.2.3, respectively.

Example 8 (Example 4 Continued).

Suppose T = 4, p = m = 1, and T = { , } . The riskfunction can be calculated by r ( η T , Y ) = (cid:80) t =2 [( Y t (1 ,

1) + Y t (0 , ] + 2 Y (1 , + 2 Y (0 , +2 (cid:80) t =2 [( Y t (1 ,

1) + Y t (0 , Y t +1 (1 ,

1) + Y t +1 (0 , Y (only) at Y t (1 ,

1) = Y t (0 ,

0) = ± B, ∀ t ∈ { } . This is what Lemma 1 says. (cid:3)

Lemma 1 simpliﬁes the minimax problem in (6). Instead of directly solving the minimax problem,we can now replace Y by either Y + or Y − , and solve only a minimization problem.Using Lemma 1, we now establish two structural results that limit the class of optimal designsof regular switchback experiments. Lemma 2 states the optimal starting and ending structure;Lemma 3 states the optimal middle-case structure. The proofs to Lemma 2 and Lemma 3 aredeferred to Sections EC.4.3.1 and EC.4.3.2, respectively. Lemma 2.

When Y = Y + or Y = Y − , under Assumptions 1–3, any optimal design of experiment T must satisfy t ≥ m + 2 , and t K ≤ T − m. Lemma 2 suggests that the ﬁrst coin ﬂip on period 1 should be followed by at least m periodsthat do not ﬂip a coin, and that the last coin ﬂip should be followed by at least m periods that donot ﬂip a coin. It guarantees that the assignment path during { m + 1 } and during { T − m : T } are both useful, i.e., m +1 or m +1 . Lemma 3.

When Y = Y + or Y = Y − , under Assumptions 1–3, any optimal design of experiment T must satisfy t k +1 − t k − ≥ m, ∀ k ∈ [ K ] . Lemma 3 suggests that in every consecutive m + 1 periods, there could be at most 3 random-ization points. This is because too many randomization points in every consecutive m + 1 periodsdecreases the chance of observing an useful assignment path of m +1 or m +1 . Lemma 3 formalizessuch intuition, and suggests that when m grows large, the optimal design randomizes less often.Lemmas 2 and 3 restrict the space of possible optimal regular switchback experiment to a smallerclass of switchback experiments, which we deﬁne below. Definition 2 (Persistent Switchback Experiments).

We say a regular switchback exper-iment T is persistent, if it satisﬁes the following three conditions, t ≥ m + 2; t K ≤ T − m ; t k +1 − t k − ≥ m, ∀ k ∈ [ K ] . For persistent switchback experiments, we can explicitly calculate the risk function r ( η T , Y ). Theorem 2 (Risk Function) . When Y = Y + or Y = Y − , under Assumptions 1–3, the riskfunction for any persistent switchback experiment is given by r ( η T , Y ) = 1( T − m ) (cid:40) K +1 (cid:88) k =1 ( t k − t k − ) + 8 m ( t K − t ) + 4 m K − m + 4 K (cid:88) k =2 [( m − t k + t k − ) + ] (cid:41) B (7)Theorem 2 explicitly describes the risk function of any optimal design of regular switchbackexperiments, which lies in the class of persistent switchback experiments. The proof of Theorem 2is deferred to Section EC.4.4 in the appendix.To understand the risk function in (7), note that the ﬁrst summation of the squares (cid:80) K +1 k =1 ( t k − t k − ) suggests that the gap between two consecutive randomization points should not be too big;while the last summation of the squares (cid:80) Kk =2 [( m − t k + t k − ) + ] suggests that the gap should notbe too small. Such a contrast formalized the trade-oﬀ that we have described earlier in this section.Based on the risk function in (7), we are able to describe the optimal design, as we state in thenext Theorem. Theorem 3 (Optimal Design) . Under Assumptions 1–3, the optimal solution to the designof regular switchback experiment as we have introduced in (6) is equivalent to the optimal solutionto the following subset selection problem. min T ⊂ [ T ] (cid:40) K (cid:88) k =0 ( t k +1 − t k ) + 8 m ( t K − t ) + 4 m K − m + 4 K − (cid:88) k =1 [( m − t k +1 + t k ) + ] (cid:41) (8) In particular, when m = 0 then T ∗ = { , , , ..., T } ; when m > , and if there exists n ≥ ∈ N , s.t. T = nm , then T ∗ = { , m + 1 , m + 1 , ..., ( n − m + 1 } . The optimal design under two remarkable special cases are, when m = 0, T ∗ = { , , , ..., T } ;and when m = 1, T ∗ = { , , , ..., T − } . When managers believe there to be very little carryovereﬀect, the optimal designs are almost the same. Moreover, Theorem 3 presents the optimal designin a class of perfect cases when the time horizon split into several epochs. In practice, selecting T is part of the design of the experiment; our recommendation is to pick a T that satisﬁes theconditions in Theorem 3. See Section 7.2.We can also ﬁnd the optimal design for other imperfect cases by solving (8); however, sincethere are integrality issues in the subset selection problem, the discussion of optimal design in suchimperfect cases are rather technical. We defer to Section EC.4.5 in the appendix to discuss suchdetails. The proof of Theorem 3 is deferred to Section EC.4.5.1 in the appendix. Example 9 (An Optimal Design).

When T = 12, p = m = 2, the optimal design of regularswitchback experiment is T ∗ = { , , , } . See Table 1. (cid:3) Table 1 An example of the optimal design T ∗ versus an arbitrary design ˜ T when T = 12 and p = m = 2 . T ∗ (cid:88) − − − (cid:88) − (cid:88) − (cid:88) − − − ˜ T (cid:88) − − (cid:88) − − (cid:88) − − (cid:88) − − Each checkmark beneath a time period t indicates that t is a randomization point. It is worth noting that both the causal estimand and the Horvitz-Thompson estimator involveconsecutive treatments or controls for m + 1 periods. By contrast, Theorem 3 suggests that theoptimal design have epochs of equal length m (ignoring the ﬁrst and last epoch).At a ﬁrst sight this is counter-intuitive. Intuitively, each epoch should contain at least m + 1periods so there exist periods that always have consecutive treatments m +1 or m +1 and alwaysgenerate useful data; e.g. , periods ˜ t = 4 , ,

10 in the third row of Table 1. However, even if eachepoch had m + 1 periods, there are still many periods that do not always generate useful data ( e.g. ,periods ˜ t = 5 , , ,

4. Inference and Statistical Testing

After designing and running the experiment, we obtain two time series. The ﬁrst is the observedassignment path w obs T , and the second is the corresponding observed outcomes Y obs p +1: T . See Figure 3.To draw inference from this data we propose two methods, the exact inference and the asymptoticinference, as we detail below. Figure 3 Illustrator of the observed assignment path w obs T (blue and red dots) and the observed outcomes Y obs p +1: T (black curve). The dashed lines are the potential outcomes under consecutive treatments / controls. Throughout this section we assume perfect knowledge of m , i.e., p = m , and we will write τ m and ˆ τ m to stand for τ p and ˆ τ p , respectively. When m (cid:54) = p , our inference methods are still valid. SeeSection 5 for a discussion, and see Section 6.5 for a numerical example. We propose an exact non-parametric test for the sharp null of no causal eﬀect at every time point H : Y t ( m +1 ) − Y t ( m +1 ) = 0 for all t = 1 , , ..., T. (9)This will be tested against a portmanteau alternative. The sharp null hypothesis implies that Y t ( w obs t − m : t ) = Y t ( w (cid:48) t − m : t ) for all w (cid:48) t − m : t ∈ { , } t ; that is, regardless of the assignment path w (cid:48) t − m : t we would have observed the same outcomes.We can conduct exact tests by using the known assignment mechanism to simulate new assign-ment paths. Algorithm 1 provides the details of how to implement it. In particular, under the sharpnull hypothesis of no treatment eﬀect (9), any assignment path w [ i ]1: T leads to the same observed out-comes. So in Step 3, we always plug in the same observed outcomes Y obs p +1: T . To obtain a conﬁdenceinterval, we propose inverting a sequence of exact hypothesis tests to identify the region outside ofwhich (9) is violated at the prespeciﬁed nominal level (Imbens and Rubin 2015, Chapter 5). Algorithm 1

Algorithm for performing a sharp-null hypothesis test

Require:

Fix I , total number of samples drawn. for i in 1 : I do Sample a new assignment path w [ i ]1: T according to the assignment mechanism. Hold Y obs p +1: T unchanged. Compute ˆ τ [ i ] according to (4),ˆ τ [ i ] = 1 T − m T (cid:88) t = m +1 (cid:40) Y obs t { w [ i ] t − m : t = m +1 } Pr( W t − m : t = m +1 ) − Y obs t { w [ i ] t − m : t = m +1 } Pr( W t − m : t = m +1 ) (cid:41) . end for Compute ˆ p F = I − (cid:80) Ii =1 (cid:8)(cid:12)(cid:12) ˆ τ [ i ] (cid:12)(cid:12) > | ˆ τ | (cid:9) return ˆ p F , the estimated p -value. For large I , this is exact. Now we introduce the asymptotic inference method, which tests the following null hypothesis H : τ m = 1 T − m T (cid:88) t = m +1 [ Y t ( m +1 ) − Y t ( m +1 )] = 0 . (10)The testing of such null hypothesis is based on the Horvitz-Thompson estimator in (4) beingasymptotically Gaussian. Below we introduce how to conduct asymptotic inference.Assume n = T /, ≥ W , W m +1 , ..., W ( n − m +1 . To makethe dependence on randomization clear, denote the following notations. Let N be the set ofall non-negative integers. For any k ∈ N , let ¯ Y k ( m +1 ) = (cid:80) ( k +2) mt =( k +1) m +1 Y t ( m +1 ) and ¯ Y k ( m +1 ) = (cid:80) ( k +2) mt =( k +1) m +1 Y t ( m +1 ). Moreover, let ¯ Y obs k = (cid:80) ( k +2) mt =( k +1) m +1 Y obs t be the summation of observed out-comes. Theorem 4 (Variance of the Horvitz-Thompson Estimator) . Under Assumptions 1–3,if n = T /m ≥ is an integer, then under the optimal design as shown in Theorem 3, the varianceof the Horvitz-Thompson estimator, Var (ˆ τ m ) , is Var (ˆ τ m ) = 1( T − m ) (cid:40) ¯ Y ( m +1 ) + ¯ Y ( m +1 ) + 2 ¯ Y ( m +1 ) ¯ Y ( m +1 )+ n − (cid:88) k =1 (cid:2) Y k ( m +1 ) + 3 ¯ Y k ( m +1 ) + 2 ¯ Y k ( m +1 ) ¯ Y k ( m +1 ) (cid:3) + ¯ Y n − ( m +1 ) + ¯ Y n − ( m +1 ) + 2 ¯ Y n − ( m +1 ) ¯ Y n − ( m +1 )+ n − (cid:88) k =0 (cid:2) ¯ Y k ( m +1 ) + ¯ Y k ( m +1 ) (cid:3) · (cid:2) ¯ Y k +1 ( m +1 ) + ¯ Y k +1 ( m +1 ) (cid:3)(cid:41) (11) Theorem 4 provides the variance of the estimator. Since we never observe all the potentialoutcomes, many of the cross-product terms from the variance can never be estimated. As analternative, we provide the following two upper bounds, and detail two unbiased estimators of suchtwo upper bounds, respectively.

Corollary 1.

Under the conditions in Theorem 4, there exist two upper bounds for the varianceof the Horvitz-Thompson estimator,

Var (ˆ τ m ) ≤ Var U1 (ˆ τ m ) ≤ Var U2 (ˆ τ m ) . These two upper bounds Var U1 (ˆ τ m ) and Var U2 (ˆ τ m ) can be estimated by ˆ σ U1 and ˆ σ U2 , respectively, where ˆ σ U1 = 1( T − m ) (cid:40)

6( ¯ Y obs ) + n − (cid:88) k =1

24( ¯ Y obs k ) { W km +1 = W ( k +1) m +1 } + 6( ¯ Y obs n − ) + n − (cid:88) k =0

16 ¯ Y obs k ¯ Y obs k +1 { W km +1 = W ( k +1) m +1 = W ( k +2) m +1 } (cid:41) , and ˆ σ U2 = 1( T − m ) (cid:40)

8( ¯ Y obs ) + n − (cid:88) k =1

32( ¯ Y obs k ) { W km +1 = W ( k +1) m +1 } + 8( ¯ Y obs n − ) (cid:41) . Moreover, ˆ σ U1 and ˆ σ U2 are unbiased, i.e., E [ˆ σ U1 ] = Var U1 (ˆ τ m ) , and E [ˆ σ U2 ] = Var U2 (ˆ τ m ) . Corollary 1 provides the foundation to make conservative inference. We make the followingtechnical assumption for the asymptotic normal distribution to hold.

Assumption 5 (Non-negligible Variance) . Assume that the randomization distribution hasa non-negligible variance, i.e.,

Var (ˆ τ m ) ≥ Ω( n − ) (12) In particular, one suﬃcient condition for (12) is to assume that all the potential outcomes arepositive, i.e., there exists some constant b > , such that ∀ t ∈ [ T ] , ∀ w T ∈ { , } T , Y t ( w T ) ≥ b . Intuitively, the key to Central Limit Theorem is that all the variables roughly have the sameorder of variance, i.e., there cannot be a small number of variables such that their variances makethe majority of the sum. Assumption 5 suggests that the variance is large enough, such that itcannot come from only a few of the time periods.

Theorem 5 (Asymptotic Normality) . Let m be ﬁxed. For any n ≥ ∈ N , deﬁne an n -replicaexperiment such that there are T = nm time periods. We take the optimal design as in Theorem 3whose randomization points are at T ∗ = { , m + 1 , m + 1 , ..., ( n − m + 1 } . Under Assumptions 2–3, and under Assumption 5, the limiting distribution of the Horvitz-Thompson estimator in the n -replica experiment has an asymptotic normal distribution. That is, let Var (ˆ τ m ) be deﬁned inTheorem 4. As n → + ∞ , ˆ τ m − τ m (cid:112) Var (ˆ τ m ) D −→ N (0 , . In particular, Theorem 5 does not require

Var (ˆ τ m ) to converge as n → + ∞ .To conduct inference, we replace Var (ˆ τ m ) by ˆ σ , one of the two bounds provided in Corollary 1.Deﬁne the test statistic to be z = | ˆ τ m | / √ ˆ σ . When the alternative hypothesis is two-sided, theestimated p -value is given by ˆ p N = 2 − z ), where Φ is the CDF of a standard normal distribution.The proofs of Theorem 4, Corollary 1, and Theorem 5 are deferred to Sections EC.5.2, EC.5.3,and EC.5.4 in the Appendix, respectively.

5. A Discussion about Misspeciﬁed m So far we have discussed cases when m is correctly speciﬁed. When m is misspeciﬁed, the esti-mation and inference are still valid and meaningful. We will detail below cases when m is eitheroverestimated m p . When m is overestimated, m p . The lag- p eﬀect in (1) is not well deﬁned. Instead, we deﬁnethe m -misspeciﬁed lag- p causal eﬀect that pads the p + 1 assignments with the earlier observedtreatments. τ ( m ) p ( Y ) = 1 T − p (cid:40) m (cid:88) t = p +1 (cid:2) Y t ( w obs t − p − , p +1 ) − Y t ( w obs t − p − , p +1 ) (cid:3) + T (cid:88) t = m +1 (cid:2) Y t ( w obs t − m : t − p − , p +1 ) − Y t ( w obs t − m : t − p − , p +1 ) (cid:3)(cid:41) . (13)This is a special case of the weighted lag- p causal eﬀect introduced in Bojinov and Shephard (2019).Similarly to the average lag- p causal eﬀect, τ ( m ) p ( Y ) captures how administering p + 1 consecutivetreatments as opposed to p + 1 consecutive controls impact the outcomes at time t , conditional onthe observed assignment path up to time t − p −

1. See Section 6.5 for numerical results.

When m is overestimated ( m p ), sometimes we have to slightly augment the results and study theconditional expectation. Deﬁne f T : [ T ] → T to be the “determining randomization point of period t ,” f T ( t ) = max { j | j ∈ T , j ≤ t } such that, it is the realization at time f T ( t ) that uniquely determines the assignment at time t , i.e. W t = W f T ( t ) , ∀ t ∈ [ T ]. See Example 10 for an illustration of f T ( · ). When T is clear from thecontext we drop the subscript and use f ( · ) = f T ( · ). Depending on if f ( t − p ) ≤ t − m , we establishan analogy of Theorem 1 for the m -underestimated case. Theorem 6 (Unbiasedness of the Estimator when m is Misspeciﬁed) . Under Assump-tions 2 and 3, for m > p , at each time t ≥ m + 1 , the Horvitz-Thompson estimator is either unbiasedfor the lag- m causal eﬀect when f ( t − p ) ≤ t − m , or conditionally unbiased for the m -misspeciﬁedlag- p causal eﬀect when f ( t − p ) > t − m . When p + 1 ≤ t ≤ m , the Horvitz-Thompson estimatoris either unbiased for the lag- t causal eﬀect when f ( t − p ) = 1 , or conditionally unbiased for the m -misspeciﬁed lag- t causal eﬀect when f ( t − p ) > . To remove the conditional expectation, we can further take an outer loop of expectation averagedover the past assignment paths. Although this is somewhat diﬀerent from the average lag- p eﬀectintroduced earlier in (1), it does capture the impact of a sequence of treatment relative to a sequenceof controls.All the mathematical expressions of Theorem 6, as well its proof, are all stated in Section EC.6.1in the Appendix. See Example 10 below for a speciﬁc illustration of Theorem 6. For a numericalillustration of the estimand and estimator in more general setups, see Section 6.5. Example 10 (Misspecified m ). Suppose T = 4 , m = 2 , p = 1 , T = { , } . Then the determiningrandomization points are f T (1) = 1 , f T (2) = 1 , f T (3) = 3 , f T (4) = 3, and E (cid:20) Y obs { W = (1 , } Pr( W = (1 , − Y obs { W = (0 , } Pr( W = (0 , (cid:21) = Y (1 , − Y (0 , E (cid:20) Y obs { W = (1 , } Pr( W = (1 , − Y obs { W = (0 , } Pr( W = (0 , (cid:21) = Y (1 , , − Y (0 , , E (cid:20) Y obs { W = (1 , } Pr( W = (1 , − Y obs { W = (0 , } Pr( W = (0 , (cid:21) = 12 [ Y (1 , ,

1) + Y (0 , , − Y (0 , , − Y (1 , , (cid:3) The exact inference procedure as in Section 4.1 remains valid when m is misspeciﬁed. For theasymptotic inference procedure as in Section 4.2, Theorem 5 still holds when m is misspeciﬁed, aswe state in Corollary 2. The proof is deferred to Section EC.6.2 in the Appendix. Corollary 2 (Asymptotic Normality when m is Misspeciﬁed) . For any n ≥ ∈ N ,deﬁne an n -replica experiment such that there are T = np time periods. Take the optimal design asin Theorem 3 whose randomization points are at T ∗ = { , p + 1 , p + 1 , ..., ( n − p + 1 } . We havethe following two observations.i When m p , under Assumptions 1–3 and assume Var (ˆ τ p ) ≥ Ω( n − ) , the limiting distribution of the Horvitz-Thompson estimator in the n -replica experi-ment has an asymptotic normal distribution. That is, as n → + ∞ , ˆ τ p − τ p (cid:112) Var (ˆ τ p ) D −→ N (0 , . Corollary 2, together with Theorem 5, is the key to identify m , the order of the carryover eﬀect.In Section 7.1, we will provide methods to identify m . So far we have discussed estimation and inference when the order of carryover eﬀects are misspec-iﬁed. We conclude with a short discussion on the robustness of our method.First, the optimal design between m = 0 and m = 1 are almost the same. This suggests that whenthere is very little carryover eﬀect, our proposed optimal design is robust. Second, as we will see inSection 6, when the order of carryover eﬀect is slightly overestimated the variance is only a littlelarger. This adds an extra layer of robustness that a slight misspeciﬁcation is often acceptable.

6. Simulation Study and Empirical Illustration

There are 5 goals for this simulation study. First, to illustrate how to conduct a switchback exper-iment for various outcome models. Second, to show that our proposed optimal design has thesmallest risk, compared with two benchmarks. There are two dimensions for our comparison: theworst-case risk and the risk under a speciﬁc outcome model. Third, to verify the asymptotic normaldistribution under a non-asymptotic setup, and to study the quality of the upper bound proposedin Corollary 1. Fourth, to understand the rejection rate and its dependence on the length of timehorizon. Fifth, under randomly generated cases, to study the performance of the optimal designunder a misspeciﬁed m , and to compare the diﬀerence of the two inference methods proposed inSection 4. The potential outcome framework is ﬂexible. As we will see below, it is easy to use the potentialoutcome framework to describe many complex relationships between assignments and outcomes. We start with a simple model which originates from Oman and Seiden (1988): Y t ( w t ) = µ + α t + δw t + γw t − + (cid:15) t (14)where µ is a ﬁxed eﬀect; α t is a ﬁxed eﬀect associated to period t ; δw t is the contemporaneouseﬀect, and γw t − is the carryover eﬀect from period t − (cid:15) t is the random noise in period t . Sucha model as well as a few very similar ones are widely used in the literature (Hedayat et al. 1978,Jones and Kenward 2014).A more general variant from the above model is to consider carryover eﬀects of any arbitraryorder: Y t ( w t ) = µ + α t + δ (1) w t + δ (2) w t − + ... + δ ( t ) w + (cid:15) t (15)where δ (1) , δ (2) , ..., δ ( t ) are non-stochastic coeﬃcients. The dotted terms are carryover eﬀects ofhigher orders. And all the other parameters are as deﬁned in (14). We will run simulations basedon this more general model, which enables us to test the performance of our proposed optimaldesign under a misspeciﬁed m .The autoregressive model (Arellano 2003) is even more general: Y ( w ) = δ , w + (cid:15) and ∀ t > Y t ( w t ) = φ t,t − Y t − ( w t − ) + φ t,t − Y t − ( w t − ) + ... + φ t, Y ( w )+ δ t,t w t + δ t,t − w t − + ... + δ t, w + (cid:15) t (16)where φ t, ˜ t and δ t, ˜ t are non-stochastic coeﬃcients; the dotted terms are carryover eﬀects of higherorders; (cid:15) t is the random noise in period t . We can iteratively replace Y t ( w t ) using a linear combi-nation of w t , w t − , ..., w . So the autoregressive model in (16) can be written in a similar form of(15). The only diﬀerence is that the coeﬃcients are diﬀerent and dependent on t .For all these models, we ﬁrst decide what is the order of carryover eﬀects, namely m . Then weuse Theorem 3 to ﬁnd the optimal design of experiment. Finally, we use the exact randomizationtest in Section 4 to conduct hypothesis test. We consider two setups. The ﬁrst setup is for the worst-case risk.We consider T = 120 , p = m = 2 where m is correctly identiﬁed, and Y t ( ) = Y t ( ) = 10. Wecompare three diﬀerent designs of switchback experiments. The ﬁrst one is our proposed optimaldesign as in Theorem 3, such that T ∗ = { , , , ..., } . The second one is the most common andnaive switchback experiment, which independently assign treatment/control in every period withhalf-half probability. It is parameterized by T H1 = { , , , ..., } . The third one is the “intuitive” experiment discussed in Example 9, which divides the time horizon into several epochs each withlength m + 1 = 3. It is parameterized by T H2 = { , , , ..., } .Second, we run simulations based on the outcome model as in (15). We consider T = 120 , p = m = 2 where m is correctly identiﬁed. For the outcome model, we consider µ = 0, α t = log ( t ), and (cid:15) t ∼ N (0 ,

1) are i.i.d. standard normal distributions. For any t >

3, let δ ( t ) = 0. We will vary thevalues of δ (1) , δ (2) , δ (3) ∈ { , } and conduct experiments under 2 = 8 diﬀerent scenarios. Again wecompare the same three diﬀerent designs of switchback experiments. T ∗ = { , , , ..., } , T H1 = { , , , ..., } , T H2 = { , , , ..., } .We simulate one assignment path at a time, and conduct experiment following this assignmentpath. Since the outcome model is prescribed, we can calculate both the causal estimand and andthe observed outcomes (along the simulated assignment path). Then we calculate the Horvitz-Thompson estimator based on the simulated assignment path and the simulated observed outcomes.With both the estimand and estimator, we can calculate the loss function. By repeating the aboveprocedure enough (in this simulation, 100000) times we approximately have the risk function. We calculate the worst-case risk functions via simulation. Noticethat even though we could calculate the worst-case risk function explicitly via Theorem 2, we stillrun the simulation to conﬁrm this result. See Table 2 for results.The causal eﬀect is τ = 0 because Y t ( ) = Y t ( ) = 10. The simulated estimator is E [ˆ τ ∗ ] = − . E [ˆ τ H1 ] = 0 . E [ˆ τ H2 ] = − . T − p then the errors are very clode to zero. The risk function is r ( η T ∗ ) = 26 . r ( η T H1 ) = 33 .

67 and r ( η T H1 ) = 27 .

85 for the two benchmarks,respectively. Such simulation results suggest that our proposed optimal design have the smallestrisk, under the worst case outcome model.

Table 2 Simulation results for the worst-case risk function. τ E [ˆ τ ∗ ] E [ˆ τ H1 ] E [ˆ τ H2 ] r ( η T ∗ ) r ( η T H1 ) r ( η T H2 )0 0 . . . .

78 33 .

67 27 . The optimal design T ∗ as suggested in Theorem 3 yields the smallest risk. We also calculate the risk functions based on the outcome model in (15). See Table 3. As we varythe values of δ (1) , δ (2) and δ (3) , the total lag-2 causal eﬀect is being changed. All three estimatorsare able to reﬂect the change as the estimand changes. The risk function can be simulated and wesee that the risk function associated with the ﬁrst benchmark T H1 is 28% ∼

32% larger than theoptimal design; and the second benchmark T H2 is 1% ∼

2% larger. Such simulation results suggest Table 3 Simulation results for the risk function based on the outcome model in (15) . δ (1) δ (2) δ (3) τ E [ˆ τ ∗ ] E [ˆ τ H1 ] E [ˆ τ H2 ] r ( η T ∗ ) r ( η T H1 ) r ( η T H2 )1 1 1 3 3.016 3.012 3.002 7.96 10.22 8.111 1 2 4 4.018 4.013 4.002 9.57 12.39 9.741 2 1 4 4.018 4.013 4.002 9.57 12.39 9.742 1 1 4 4.018 4.013 4.002 9.57 12.39 9.741 2 2 5 5.020 5.015 5.003 11.34 14.81 11.522 1 2 5 5.020 5.015 5.003 11.34 14.81 11.522 2 1 5 5.020 5.015 5.003 11.34 14.81 11.522 2 2 6 6.022 6.016 6.003 13.28 17.48 13.47 For each row, the random seed that generates the simulation setup is ﬁxed. The optimal design T ∗ as suggested in Theorem 3,though solved from a minimax program, still yields the smallest risk for the outcome model in (15). again that our proposed optimal design have the smallest risk. Moreover, based on the fact that r ( η T H2 ) is rather close to r ( η T ∗ ) and much smaller than r ( η T H1 ), we suggest that a slight overestimateof m is more desirable than an underestimate.As the magnitude of treatment eﬀects increase, the associated risk functions also increase. Therelative diﬀerence between risk functions of r ( η T H1 ) and r ( η T ∗ ) increases, while the relative diﬀerencebetween r ( η T H1 ) and r ( η T ∗ ) decreases. This coincides with the intuitions discussed in Section 3. We run simulations based on the outcome model as in (15). Weconsider T = 120 , m = 2. We will consider three cases: (i) m is correctly speciﬁed so p = 2; (ii) m is overestimated to be 3 so p = 3, and we estimate lag-3 causal estimand as in (1); (iii) m isunderestimated to be 1 so p = 1, and we pretend as if we estimated the lag-1 causal estimand.However, the lag-1 causal estimand is not well deﬁned – and we instead estimate the 2-misspeciﬁedlag-1 causal estimand as in (13).For the outcome model, we consider µ = 0, α t = log ( t ), and (cid:15) t ∼ N (0 ,

1) are i.i.d. standard normaldistributions. For any t >

3, let δ ( t ) = 0. For simplicity, let δ (1) = δ (2) = δ (3) = δ . We vary δ ∈ { , , } and conduct experiments under 3 diﬀerent scenarios.We simulate one assignment path at a time, and conduct experiment following this assignmentpath. Since the outcome model is prescribed, we calculate the observed outcomes based on the sim-ulated assignment path. Then we calculate the Horvitz-Thompson estimator, and two conservativeestimators of the randomization variance (Corollary 1), both based on the simulated assignmentpath and the simulated observed outcomes. On the other hand, the lag- p causal estimand is easyto calculate once the outcome model is prescribed. Yet the m -misspeciﬁed lag- p causal estimandhas to be calculated in conjunction with the simulated assignment path. By repeating the aboveprocedure enough (in this simulation, 100000) times we obtain a distribution of the estimator, andwe calculate the mean value of the estimator (and the m -misspeciﬁed lag- p causal estimand). Figure 4 shows approximate normality of the randomization dis-tribution, under all 9 cases. There are three speciﬁcations of m : correctly speciﬁed when p = 2;overestimated when p = 3; underestimated when p = 1. There are three speciﬁcations of δ = 1 , , m thatleads to three diﬀerent designs. Table 4 Simulation results for the randomization distribution. τ p τ [ m ] p E [ˆ τ p ] Var (ˆ τ p ) E [ˆ σ U1 ] E [ˆ σ U2 ]correct m δ = 1 3 − δ = 2 6 − δ = 3 9 − m δ = 1 3 − δ = 2 6 − δ = 3 9 − m δ = 1 − δ = 2 − δ = 3 − The randomization distributions in all 9 cases are unbiased. The conservative estimation of the variance upper boundsfrom Corollary 1 are close to the true variance.

From Table 4, we make the following two observations. (i) Unbiasedness of the Horvitz-Thompson estimator . When m is correctly speciﬁed, R [ˆ τ p ] is very close to τ p , verifying theunbiasedness of the estimator. When m is overestimated, the estimand remains unchanged, and theestimator remains unbiased. But the variance of the estimator is larger. When m is underestimated,the estimand is the m -misspeciﬁed estimand, and the estimator is unbiased for this m -misspeciﬁedestimand. (ii) Quality of Corollary 1 . As we increase δ , the variance of the randomization distributionalso increases. The two conservative estimators of the randomization variance are very close tothe true variance, which suggests that Corollary 1 approximates the true variance quite well. Eventhough the second upper bound Var U2 (ˆ τ p ) is larger than the ﬁrst one Var U1 (ˆ τ p ), its estimator ˆ σ U2 turns out to be smaller than ˆ σ U1 in most cases. We run simulations based on the outcome model as in (15). Wevery T ∈ { , , , ..., } . We consider p = m = 2 where m is correctly speciﬁed. Similar to δ = 1. (b) m correctly speciﬁed, δ = 2. (c) m correctly speciﬁed, δ = 3.(d) m overestimated, δ = 1. (e) m overestimated, δ = 2. (f) m overestimated, δ = 3.(g) m underestimated, δ = 1. (h) m underestimated, δ = 2. (i) m underestimated, δ = 3. Figure 4 Approximate normality of the randomization distributions in all 9 cases. The red vertical lines are theexpected values of the randomization distributions.

Section 6.3, we consider the same parameterization and conduct experiments under 3 diﬀerentscenarios δ ∈ { , , } .We simulate one assignment path at a time, and conduct experiment following this assignmentpath. We ﬁrst calculate the observed outcomes and the Horvitz-Thompson estimator. Then weconduct the two inference methods as proposed in Section 4 (for the asymptotic inference method, we plug in the second upper bound ˆ σ U2 ) and obtain two estimated p -values. We reject the corre-sponding null hypothesis when the p -value is smaller than 0 .

1. By repeating the above procedureenough (in this simulation, 1000) times we obtain the frequency of a null hypothesis being rejected,which we refer to as the rejection rate.

We calculate the rejection rates via simulation, and plot Figure 5.In all the simulations, δ (cid:54) = 0 , τ p (cid:54) = 0, so ideally we would wish to reject the null hypothesis (whetherif it is (9) or (10)). Figure 5 Rejection rates and their dependence on

T /m . The blue dots are rejection rates under exactinference; the red dots are under asymptotic inference. Left: δ = 1 ; Middle: δ = 2 ; Right: δ = 3 From Figure 5 we make the following three observations. (i) Dependence on

T /m . The rejec-tion rates increase as the length of the horizon increases – more speciﬁcally, as

T /m the totalnumber of epochs increases. In practice, when ﬁrms have choose the length of T and decide howmuch experimental budgets to allocate, they can refer to Figure 5 to choose T properly. Also seediscussion in Section 7.2. (ii) Between two inference methods . In all three cases, the rejection rate from testing asharp null hypothesis (9) is slightly higher than that from testing the Neyman’s null (10). Thiscoincides with our intuition that a sharp null is more likely to be rejected. We discuss this inSection 6.5.2 together with the associated p -values. (iii) Dependence on the signal-to-noise ratio . The rejection rates all increase as δ increasesfrom 1 to 3 (while holding the noise from the model ﬁxed). This suggests that when the treatmenteﬀect is relatively larger, we do not require a long experimental horizon to achieve a desired rejectionrate. m We run simulations based on the outcome model as in (15). Weconsider T = 120 , m = 2. We consider three cases: (i) m correctly speciﬁed p = 2; (ii) m is overesti-mated p = 3, and we estimate the lag-3 causal estimand as in (1); (iii) m is underestimated p = 1, and we pretend as if we estimated the lag-1 causal estimand. However, the lag-1 causal estimandis not well deﬁned. Instead, we estimate the 2-misspeciﬁed lag-1 causal estimand as in (13).For the outcome model, we consider the same parameterization as in Section 6.3, and conductexperiments under 3 diﬀerent scenarios δ ∈ { , , } .We only simulate one assignment path. Since the outcome model is prescribed, we calculate theobserved outcomes. There is only one time series of such observed outcomes. We calculate theHorvitz-Thompson estimator based on the simulated assignment path and the simulated observedoutcomes. We calculate the lag- p causal estimand directly, and also the m -misspeciﬁed lag- p causalestimand in conjunction with the simulated assignment path. Finally, we perform the two inferencemethods from Section 4, and report their associated estimated p -values. The conservative samplingvariance we take is ˆ σ U2 . We choose I = 100000 to be the number of samples drawn in the exactinference method as shown in Algorithm 1. Notice this is only one experiment under one simulated experi-mental setup from one simulated assignment path. So the estimators ˆ τ p we derive are diﬀerent from τ p . But they are still roughly following the true causal eﬀects which they estimate. See Table 5. Table 5 Simulation results for correctly speciﬁed m , overestimated m , and underestimated m . τ p τ [ m ] p ˆ τ p ˆ σ U2 ˆ p F ˆ p N correct m δ = 1 3 − δ = 2 6 − δ = 3 9 − m δ = 1 3 − δ = 2 6 − δ = 3 9 − m δ = 1 − δ = 2 − δ = 3 − The simulation setup for the three δ = 1 cases is the same; so are the δ = 2 cases and δ = 3 cases. The estimated p -valuesderived from the exact inference is slightly smaller than the p -values derived from the asymptotic inference. From Table 5 we see that both our estimator and the estimated variance are well deﬁned in allthe cases when m is correctly speciﬁed, is overestimated, and is underestimated. In each case, as delta increases from 1 to 3, the associated p -values exhibit decreasing trends, suggesting a strongerrejection rate against the null hypothesis. Moreover, the p -values suggested by the exact inferenceis always slightly smaller than the p -values suggested by the asymptotic inference. This coincideswith our intuition that: (i) the exact inference method possesses a stronger null hypothesis (9)which implies the null hypothesis of (10); (ii) in the asymptotic inference we replaced the truerandomization variance by its conservative upper bound, which further leads to a larger p -value.

7. Practical Implications

We recap what would a manager do to practically run a switchback experiment. First, the gran-ularity and the length of horizon are sometimes given to a manager; when the manager has moreﬂexibility to choose the granularity and the length of horizon, see discussions in Section 7.2. Sec-ond, the manager either consults domain knowledge to decide what is the order of carryover eﬀect,or when such knowledge is not perfect, run a ﬁrst phase experiment to identify such an order.See discussions in Section 7.1. Third, the manager decides on a collection of randomization points,and draws one sample from the randomization distribution to be the assignment path. We recom-mend the optimal design as we discussed in Section 3. Finally, the manager collects data from theexperiment, and draws causal conclusions using the methods in Section 4.

We borrow Theorem 5 and Corollary 2 to deﬁne a sub-routine, which, combined with searchingmethod, identiﬁes the order of carryover eﬀect.Suppose we have access to two i.i.d. experimental units. Such two experimental units could betwo identical units. They can also be two time spans on one single experimental unit, such thatthe two spans are well separated and the carryover eﬀect from one does not aﬀect the outcomes ofthe other.On the ﬁrst experimental unit, we design an optimal experiment under p = p ; while on thesecond unit, p = p . Without loss of generality let p < p . We observe the outcomes, collect thedata, and ﬁnd the following statistics from the two experiments. In the ﬁrst one, calculate ˆ τ p , thesampling average, and ˆ σ p , the conservative sampling variance as suggested by Corollary 1. In thesecond one, calculate ˆ τ p and ˆ σ p .Deﬁne a sub-routine that tests the following null hypothesis: H : m ≤ p (17)Under the null hypothesis (17), τ p = τ p . Furthermore, given that the two experimental unitsare independent, the diﬀerence between the two sample means should be a normal distributioncentered around zero, i.e., (ˆ τ p − ˆ τ p ) / (cid:112) Var ( τ p ) + Var ( τ p ) D −→ N (0 , z = | ˆ τ p − ˆ τ p | / (cid:112) ˆ σ p + ˆ σ p . The estimated p -value is given byˆ p = 2 − z ), where Φ is the CDF of a standard normal distribution.Such a sub-routine enables us to test the null hypothesis (17). We can combine such a sub-routinewith any searching method to identify m .To take an example, suppose we are running an experiment whose setup is in Section 6.5, andthat δ = 3. Suppose we have narrowed down the range of the order of carryover eﬀect to be m ≤ In the ﬁrst round, we consult the sub-routine to test a null hypothesis m ≤

2. Then we wouldobserve row 3 and 6 from Table 5, with ˆ τ = 7 . , ˆ σ = 23 .

88; ˆ τ = 8 . , ˆ σ = 39 .

00. So the estimated p -value for the null hypothesis m ≤ p = 0 . m ≤ τ = 1 . , ˆ σ = 9 .

47; ˆ τ = 7 . , ˆ σ = 23 . p -value for the null hypothesis m ≤ p = 0 . . T /p ≈ The selection of T depends on the following two components: ﬁrst, the granularity of one singletime period; and second, the number of total epochs T /p included in the time horizon.The granularity of each time period refers to how long in the physical world a single time periodcorresponds to. As long as each time period is granular enough such that its length is smaller thanthe length of the carryover eﬀect, and that the length of the carryover eﬀect is some multiple timesof the length of one time unit, then the selection of granularity makes no diﬀerence to the optimaldesign and analysis of switchback experiments. See Example 11.

Example 11 (Two Granularity Levels).

In the ride-sharing application, suppose theﬁrms have two options to either treat one single time period as 15 minutes, or as 30 minutes. SeeFigure 6. Each smallest time period stands for 15 minutes, and the carryover eﬀect lasts for 1 hour.In the time granularity as shown in blue, each time period lasts for 15 minutes, and the carryovereﬀect lasts for m = 4 time periods. In the time granularity as shown in red, each time period lastsfor 30 minutes, and the carryover eﬀect lasts for m = 2 time periods. Figure 6 Illustrator of two granularity levels. Blue: each period is 15 minutes; Red: each period is 30 minutes.

From Theorem 3, the optimal design exhibits an optimal structure that randomizes once every m time periods (except for the ﬁrst and last epoch, which lasts for 2 m time periods each). In bothcases, the optimal design would randomize once every 1 hour. Furthermore, from Theorem 1 weknow that under both cases the mean value of the Horvitz-Thompson estimator remains unchanged.From Theorem 4, the variance consists of terms like ¯ Y k ( m +1 ) = (cid:80) ( k +2) mt =( k +1) m +1 Y t ( m +1 ), which arethe sum across all the outcomes within 1 hour. So under both cases the variance of the estimatorremains unchanged. (cid:3) On the other hand, when the length of each time period is longer than the length of the carryovereﬀect, this will cause unnecessary overestimation of the length of the carryover eﬀect. In an extremecase, the length of carryover eﬀect is 1 minute while each period is selected to be 1 hour. Thenthe minimum order of carryover eﬀect would be 1, which overestimates the length of the carryovereﬀect to be 1 hour. So we suggest that the length of each time period should be smaller than thelength of the carryover eﬀect.Second, after we have decided about the granularity of each time period, we can consult theprocedures from Section 7.1 to identify m . The next to choose is how long the experiment shouldlast, n = T /m . We choose n by referring to the rejection rate curve as shown in Section 6.4. Weﬁrst prescribe which inference method to use (exact inference or asymptotic inference). Then weborrow our domain knowledge to get a sense of the signal-to-noise ratio. Finally, we choose anydesired rejection rate and ﬁnd out the length of horizon required.

8. Concluding Remarks, Limitations and Future Work

We studied the design and analysis of switchback experiments. We formulated and solved a min-imax problem for the design of optimal switchback experiments. We then analyzed our proposedoptimal design and proposed two inferential methods. In particular, we showed asymptotic nor-mality of the randomization distribution. We discussed cases when the order of carryover eﬀect, m ,is misspeciﬁed, and detailed a method to identify the order of carryover eﬀect. We gave empiricalsuggestions that a slight overestimate of m is acceptable and better than an underestimate.We point out two limitations of our paper. First, when m the order of carryover eﬀect is as largeas comparable to T the length of horizon, our method, though still unbiased in theory, incurs ahuge variance that typically prohibits the experimental designer to make any inference. This is dueto the fact that our method is general and requires the minimum amount of modeling assumptions.When the outcome model has some structures, for example the equilibrium eﬀects as in Wager andXu (2019), utilizing such structures will lead to diﬀerence designs. Second, our method does nottake advantage of covariate information to further reduce the randomization variance. This couldlead to future work.Finally, we do encourage empirical researchers who apply our method to use domain knowledgeto narrow down m ﬁrst, before using the procedure in Section 7.1 to identify m . This is because,empirically, each sub-routine to identify (17) needs to consume experimental resources at the scaleof T /p ≈

100 to well distinguish two candidate values, which could be luxurious when the resourceis scarce. References

Abadie A, Athey S, Imbens GW, Wooldridge JM (2020) Sampling-based versus design-based uncertainty inregression analysis.

Econometrica

Journal of Marketing Research

Panel data econometrics (Oxford university press).Athey S, Eckles D, Imbens GW (2018) Exact p-values for network interference.

Journal of the AmericanStatistical Association

Available atSSRN 3171224 .Bai Y (2019) Optimality of matched-pair designs in randomized control trials.

Available at SSRN 3483834 .Basse G, Ding Y, Toulis P (2019) Minimax crossover designs. arXiv preprint arXiv:1908.03531 .Bell DR, Chiang J, Padmanabhan V (1999) The decomposition of promotional response: An empiricalgeneralization.

Marketing Science

Statistical decision theory and Bayesian analysis (Springer Science & Business Media).Bickel PJ, Doksum KA (2015)

Mathematical statistics: basic ideas and selected topics, volume I , volume 117(CRC Press).Bojinov I, Rambachan A, Shephard N (2020a) Panel experiments and dynamic causal eﬀects: A ﬁnite pop-ulation perspective. arXiv preprint arXiv:2003.09915 .Bojinov I, Saint-Jacques G, Tingley M (2020b) Avoid the pitfalls of a/b testing make sure your experimentsrecognize customers’ varying needs.

Harvard Business Review

Journal of the American Statistical Association

Operations Research

Journal of econometrics

International Conference on Machine Learning , 1194–1203 (PMLR).Eckles D, Karrer B, Ugander J (2016) Design and analysis of experiments in networks: Reducing bias frominterference.

Journal of Causal Inference

HarvardBusiness School Case

Manufacturing & Service Operations Management arXiv preprint arXiv:1905.07544 .Glynn P, Johari R, Rasouli M (2020) Adaptive experimental design with temporal interference: A maximumlikelihood approach. arXiv preprint arXiv:2006.05591 .Hadad V, Hirshberg DA, Zhan R, Wager S, Athey S (2019) Conﬁdence intervals for policy evaluation inadaptive experiments. arXiv preprint arXiv:1911.02768 .Hedayat A, Afsarinejad K, et al. (1978) Repeated measurements designs, ii.

The Annals of Statistics

Duke Mathe-matical Journal

American Journal of Political Science http://dx.doi.org/10.1111/ajps.12417 .Imbens GW, Rubin DB (2015)

Causal Inference for Statistics, Social, and Biomedical Sciences: An Intro-duction (Cambridge University Press), URL http://dx.doi.org/10.1017/CBO9781139025751 .Johari R, Li H, Weintraub G (2020) Experimental design in two-sided platforms: An analysis of bias. arXivpreprint arXiv:2002.05670 .Jones B, Kenward MG (2014)

Design and analysis of cross-over trials (CRC press).Kastelman D, Ramesh R (2018) Switchback tests and randomized experimentation under network eﬀects atdoordash.

URL: https://medium.com/@DoorDash/switchback-tests-and-randomized-experimentation-under-network-eﬀects-at-doordash-f1d938ab7c2a .Kohavi R, Crook T, Longbotham R, Frasca B, Henne R, Ferres JL, Melamed T (2009) Online experimentationat microsoft.

Data Mining Case Studies

Proceedings of the 13th ACM SIGKDD international conference onKnowledge discovery and data mining , 959–967.Kohavi R, Tang D, Xu Y (2020)

Trustworthy Online Controlled Experiments: A Practical Guide to A/BTesting (Cambridge University Press), URL http://dx.doi.org/10.1017/9781108653985 .Kohavi R, Thomke S (2017) The surprising power of online experiments.

Harvard Business Review

Statistics in Medicine k factorial experiments. The Annals of Statistics

Personalized medicine

Forthcoming at ManagementScience .March JG (1991) Exploration and exploitation in organizational learning.

Organization science

International Conference on Artiﬁcial Intelligence and Statistics , 1261–1269.Oman SD, Seiden E (1988) Switch-back designs.

Biometrika arXiv preprint arXiv:1910.10862 .Rambachan A, Shephard N (2019) Econometric analysis of potential outcomes time series: instruments,shocks, linearity and the causal response function. arXiv preprint arXiv:1903.01637 .Robins J (1986) A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor eﬀect.

Mathematical Modelling

Journal of the American Statistical Association

Statistics & probability letters

Academy of management Review

Statistics in Medicine

Research in organizational behavior

Journal of the American Statistical Association http://dx.doi.org/10.1080/01621459.2011.646917 .Sussman DL, Airoldi EM (2017) Elements of estimation theory for causal eﬀects in the presence of networkinterference. arXiv preprint arXiv:1702.03578 .Thomke S (2001) Enlightened experimentation. the new imperative for innovation.

Harvard Business Review

Experimentation Works: The Surprising Power of Business Experiments (Harvard Busi-ness Press).Wager S, Xu K (2019) Experimenting in equilibrium. arXiv preprint arXiv:1903.02124 .Wu CF (1981) On the robustness and eﬃciency of some randomized designs.

The Annals of Statistics -companion to

Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments ec1

Online Appendix

EC.1. Recap of Notations

Within this paper, let N , N be the set of positive integers and non-negative integers, respectively.For any T ∈ N , let [ T ] = { , ..., T } be the set of positive integers no larger than T . For any t < t (cid:48) ∈ N ,let { t : t (cid:48) } = { t, t + 1 , ..., t (cid:48) } be the set of integers between (including) t and t (cid:48) . For any m ∈ N ,let m = (1 , , ...,

1) be a vector of m ones; let m = (0 , , ...,

0) be a vector of m zeros. We useparentheses for probabilities, i.e., Pr( · ); brackets for expectations, i.e., E [ · ]; and curly brackets forindicators, i.e., {·} . For any a ∈ R , let ( a ) + = max { a, } . EC.2. Theorems Used

We summarize here the preliminaries that we have directly used in our proof.

Definition EC.1 ( φ -Dependent Random Variables, Hoeffding and Robbins (1948)). For any sequence { X , X , ... } , if there exists φ such that for any s − r > φ , the two sets( X , X , ..., X r ) , ( X s , X s +1 , ..., X n )are independent, then the sequence is said to be φ -dependent. Lemma EC.1 (Romano and Wolf (2000), Theorem 2.1) . Let { X n,i } be a triangular arrayof zero-mean random variables. Let φ ∈ N be a ﬁxed constant. For each n = 1 , , ... , let d = d n , andsuppose that X n, , X n, , ..., X n,d is an φ -dependent sequence of random variables. Deﬁne B n,k,a = Var (cid:32) a + k − (cid:88) i = a X n,i (cid:33) , B n = B n,d, = Var (cid:32) d (cid:88) i =1 X n,i (cid:33) For some δ > and − ≤ γ ≤ , if the following conditions hold: E | X n,i | δ ≤ ∆ n , for all i ; B n,k,a /k γ ≤ K n , for all a and k ≥ φ ; B n / ( dφ γ ) ≥ L n ; K n /L n = O (1) ;

5. ∆ /L (2+ δ ) / n = O (1) ,then (cid:80) di =1 X n,i B n D −→ N (0 , . We explain Lemma EC.1. The D −→ notation stands for convergence in distribution. The deﬁnitionof a sequence of φ -dependent random variables is given in Deﬁnition EC.1. To check if the conditionsin Lemma EC.1 hold, we will ﬁrst calculate B n,k,a for any k and a , and then construct some proper∆ n , K n , and L n . c2 e-companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments

EC.3. Proof of Unbiasedness of the Horvitz-Thompson Estimator

Proof of Theorem 1.

First observe that for regular switchback experiments, both 0 < Pr( W t − p : t = p +1 ) , Pr( W t − p : t = p +1 ) <

1. So Assumption 4 is naturally satisﬁed.So for any t ∈ { p + 1 : T } , with probability Pr( W t − p : t = p +1 ) (cid:54) = 0, { W t − p : t = p +1 } = 1, and Y obs t = Y t ( p +1 ). So E (cid:104) Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) (cid:105) = Y t ( m +1 ). Similarly E (cid:104) Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) (cid:105) = Y t ( p +1 ).Sum them up for any t ∈ { p + 1 : T } we ﬁnish the proof. (cid:3) EC.4. Proofs and Discussions from Section 3

In Section 3 we focus on the case when p = m . Throughout this section in the appendix, we useonly m instead of p . EC.4.1. Extra Notations Used in the Proofs from Section 3

For any t ∈ { m + 1 : T } , denote t ( T , Y ) = Y t ( m +1 ) (cid:34) { W t − m : t = ( m +1 ) } · · m (cid:89) l =1 { t − l +1 ∈ T } − (cid:35) − Y t ( m +1 ) (cid:34) { W t − m : t = ( m +1 ) } · · m (cid:89) l =1 { t − l +1 ∈ T } − (cid:35) (EC.1)where we use 2 (cid:81) ml =1 { t − l +1 ∈ T } to calculate the inverse propensity score. When T and Y are clearfrom the context we omit them and use t for t ( T , Y ).Using the above notation, we could re-writeˆ τ m − τ m = 1 T − m T (cid:88) t = m +1 t ( T , Y )Note that ∀ t ∈ { m + 1 , m + 2 , ..., T } , E [ t ( T , Y )] = 0 . (EC.2)Recall that any regular switchback experiment can be represented by T = { t , t , ..., t K } ⊆ [ T ].Deﬁne f T : [ T ] → T to be the “determining randomization point of period t ”, i.e. f T ( t ) = max { j | j ∈ T , j ≤ t } such that W f T ( t ) uniquely determines the distribution of W t , i.e. W t = W f T ( t ) . When T is clear fromthe context we also omit the subscript and use f ( t ) for f T ( t ).Similarly, we deﬁne f m T ( t ) : [ T ] → { , } T , which maps a time period to a subset of the set T , tobe the “determining randomization points of periods { t − m, t − m + 1 , ..., t } ”, i.e. f m T ( t ) = { j |∃ i ∈ { t − m, ..., t } , s.t.j = f T ( i ) } -companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments ec3 such that f m T ( t ) ⊆ T ⊆ [ T ]. And f m T ( t ) contains all the time periods whose coin ﬂips determine thedistributions of W t − m , W t − m +1 , ..., W t . Denote | f m T ( t ) | = J . We keep in mind that J depends on m, t and T , yet they are all omitted for brevity. Since W t − m , W t − m +1 , ..., W t are determined by at leastone randomization point f ( t − m ), we know that f m T ( t ) (cid:54) = ∅ is non-empty, i.e., | f m T ( t ) | = J ≥ . (EC.3)Finally, deﬁne “overlapping randomization points of periods { t − m, t − m + 1 , ..., t } and { t (cid:48) − m, t (cid:48) − m + 1 , ..., t (cid:48) } ” to be O T ( t, t (cid:48) ) = f m T ( t ) ∩ f m T ( t (cid:48) )Denote | O T ( t, t (cid:48) ) | = J o . We keep in mind that J o depends on m, t, t (cid:48) and T , yet they are all omittedfor brevity. EC.4.2. Lemma 1: Adversarial Selection of Potential OutcomesEC.4.2.1. Preliminaries.

We ﬁrst introduce two Lemmas for the proof of Lemma 1.

Lemma EC.2.

Under Assumptions 2–3, for any t ∈ [ T ] , let | f m T ( t ) | = J . E [ t ] =(2 J − Y t ( m +1 ) + 2 Y t ( m +1 ) Y t ( m +1 ) + (2 J − Y t ( m +1 ) . (EC.4) Proof of Lemma EC.2.

Denote | f m T ( t ) | = J . Let the elements be f m T ( t ) = { u , u , ..., u J } . Let u < u < ... < u J .Using the notations deﬁned earlier in Section EC.4, E [ t ] = Pr ( W t − m : t = m +1 ) · (cid:8) Y t ( m +1 )(2 J · − − Y t ( m +1 )(2 J · − (cid:9) + Pr ( W t − m : t = m +1 ) · (cid:8) Y t ( m +1 )(2 J · − − Y t ( m +1 )(2 J · − (cid:9) + Pr ( W t − m : t (cid:54) = J or J ) · (cid:8) Y t ( m +1 )(2 J · − − Y t ( m +1 )(2 J · − (cid:9) = Pr (( W u , ..., W u J ) = J ) · (cid:8) (2 J − Y t ( m +1 ) + Y t ( m +1 ) (cid:9) + Pr (( W u , ..., W u J ) = J ) · (cid:8) − Y t ( m +1 ) − (2 J − Y t ( m +1 ) (cid:9) + Pr (( W u , ..., W u J ) (cid:54) = J or J ) · {− Y t ( m +1 ) + Y t ( m +1 ) } =(2 J − Y t ( m +1 ) + 2 Y t ( m +1 ) Y t ( m +1 ) + (2 J − Y t ( m +1 ) which ﬁnishes the proof. (cid:3) Lemma EC.3.

Under Assumptions 2–3, for any t < t (cid:48) ∈ [ T ] , when | O T ( t, t (cid:48) ) | = J o = 0 , E [ t t (cid:48) ] =0 . (EC.5) When | O T ( t, t (cid:48) ) | = J o ≥ , E [ t t (cid:48) ] =(2 J o − Y t ( m +1 ) Y t (cid:48) ( m +1 ) + Y t ( m +1 ) Y t (cid:48) ( m +1 )+ Y t ( m +1 ) Y t (cid:48) ( m +1 ) + (2 J o − Y t ( m +1 ) Y t (cid:48) ( m +1 ) . (EC.6) c4 e-companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments

Proof of Lemma EC.3.

Denote | f m T ( t ) | = J , | f m T ( t (cid:48) ) | = J (cid:48) , and | O T ( t, t (cid:48) ) | = J o . Let the elementsbe f m T ( t ) = { u , u , ..., u J } , f m T ( t (cid:48) ) = { u (cid:48) , u (cid:48) , ..., u (cid:48) J (cid:48) } , and O T ( t, t (cid:48) ) = { u o , u o , ..., u o J o } . Let u < u <... < u J , u (cid:48) < u (cid:48) < ... < u (cid:48) J (cid:48) , and u o < u o < ... < u o J o .One time period could have diﬀerent numberings in f m T ( t ), f m T ( t (cid:48) ), and O T ( t, t (cid:48) ). For example, u J − J o +1 = u (cid:48) = u o , and u J = u (cid:48) J o = u o J o . See Table EC.1 for an illustrator of the determining ran-domization points and the overlapping randomization points. Table EC.1 Illustrator of the determining randomization points and the overlapping randomization points. u u ... u J − J o +1 ... u J u o ... u o J o u (cid:48) ... u (cid:48) J o u (cid:48) J o +1 ... u (cid:48) J (cid:48) Each columns stands for one time period. The ﬁrst row stands for the determining randomization points of f m T ( t ); the second row for the overlapping randomization points of O T ( t, t (cid:48) ); and the third row for the determiningrandomization points of f m T ( t (cid:48) ). First, when J o = 0, this implies that t and t (cid:48) are independent. Then E [ t t (cid:48) ] = E [ t ] E [ t (cid:48) ] = 0,where the second equality is due to (EC.2).When J o ≥

1, this implies that t and t (cid:48) are correlated. Using the notations deﬁned above, E [ t t (cid:48) ] = E W u o ,...,W u o J o (cid:104) E (cid:104) t t (cid:48) (cid:12)(cid:12)(cid:12) W u o , ..., W u o J o (cid:105)(cid:105) (EC.7)= Pr (cid:16) ( W u o , ..., W u o J o ) = J o (cid:17) E (cid:104) t t (cid:48) (cid:12)(cid:12)(cid:12) ( W u o , ..., W u o J o ) = J o (cid:105) + Pr (cid:16) ( W u o , ..., W u o J o ) = J o (cid:17) E (cid:104) t t (cid:48) (cid:12)(cid:12)(cid:12) ( W u o , ..., W u o J o ) = J o (cid:105) + Pr (cid:16) ( W u o , ..., W u o J o ) (cid:54) = J o or J o (cid:17) E (cid:104) t t (cid:48) (cid:12)(cid:12)(cid:12) ( W u o , ..., W u o J o ) (cid:54) = J o or J o (cid:105) Next we go over the three cases of ( W u o , ..., W u o J o ) as decomposed above. Note that conditionalon ( W u o , ..., W u o J o ), t and t (cid:48) are independent, i.e., E (cid:104) t t (cid:48) (cid:12)(cid:12)(cid:12) W u o , ..., W u o J o (cid:105) = E (cid:104) t (cid:12)(cid:12)(cid:12) W u o , ..., W u o J o (cid:105) E (cid:104) t (cid:48) (cid:12)(cid:12)(cid:12) W u o , ..., W u o J o (cid:105) (1) With probability 1 / J o , ( W u o , ..., W u o J o ) = J o , in which case E (cid:104) t (cid:12)(cid:12)(cid:12) W u o , ..., W u o J o (cid:105) = Pr ( W t − m : t = m +1 ) · (cid:8) Y t ( m +1 )(2 J · −

1) + Y t ( m +1 ) (cid:9) + Pr ( W t − m : t (cid:54) = m +1 ) · (cid:8) Y t ( m +1 )(2 J · −

1) + Y t ( m +1 ) (cid:9) = Pr (cid:0) ( W u , W u , ..., W u J − J o ) = J − J o (cid:1) · (cid:8) Y t ( m +1 )(2 J −

1) + Y t ( m +1 ) (cid:9) + Pr (cid:0) ( W u , W u , ..., W u J − J o ) (cid:54) = J − J o (cid:1) · {− Y t ( m +1 ) + Y t ( m +1 ) } = 12 J − J o · (cid:8) Y t ( m +1 )(2 J −

1) + Y t ( m +1 ) (cid:9) + (cid:18) − J − J o (cid:19) · {− Y t ( m +1 ) + Y t ( m +1 ) } =(2 J o − Y t ( m +1 ) + Y t ( m +1 ) -companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments ec5 where the third equality is due to (2). Similarly, E (cid:104) t (cid:48) (cid:12)(cid:12)(cid:12) W u o , ..., W u o J o (cid:105) = Pr ( W t (cid:48) − m : t (cid:48) = m +1 ) · (cid:110) Y t (cid:48) ( m +1 )(2 J (cid:48) · −

1) + Y t (cid:48) ( m +1 ) (cid:111) + Pr ( W t (cid:48) − m : t (cid:48) (cid:54) = m +1 ) · (cid:110) Y t (cid:48) ( m +1 )(2 J (cid:48) · −

1) + Y t (cid:48) ( m +1 ) (cid:111) = Pr (cid:16) ( W u (cid:48) J o +1 , W u (cid:48) J o +2 , ..., W u (cid:48) J (cid:48) ) = J (cid:48) − J o (cid:17) · (cid:110) Y t (cid:48) ( m +1 )(2 J (cid:48) −

1) + Y t (cid:48) ( m +1 ) (cid:111) + Pr (cid:16) ( W u (cid:48) J o +1 , W u (cid:48) J o +2 , ..., W u (cid:48) J (cid:48) ) (cid:54) = J (cid:48) − J o (cid:17) · {− Y t (cid:48) ( m +1 ) + Y t (cid:48) ( m +1 ) } = 12 J (cid:48) − J o · (cid:110) Y t (cid:48) ( m +1 )(2 J (cid:48) −

1) + Y t (cid:48) ( m +1 ) (cid:111) + (cid:18) − J (cid:48) − J o (cid:19) · {− Y t (cid:48) ( m +1 ) + Y t (cid:48) ( m +1 ) } =(2 J o − Y t (cid:48) ( m +1 ) + Y t (cid:48) ( m +1 ) (2) With probability 1 / J o , ( W u o , ..., W u o J o ) = J o , in which case E (cid:104) t (cid:12)(cid:12)(cid:12) W u o , ..., W u o J o (cid:105) = Pr ( W t − m : t = m +1 ) · (cid:8) − Y t ( m +1 ) − Y t ( m +1 )(2 J · − (cid:9) + Pr ( W t − m : t (cid:54) = m +1 ) · (cid:8) − Y t ( m +1 ) − Y t ( m +1 )(2 J · − (cid:9) = 12 J − J o · (cid:8) − Y t ( m +1 ) − Y t ( m +1 )(2 J − (cid:9) + (cid:18) − J − J o (cid:19) · {− Y t ( m +1 ) + Y t ( m +1 ) } = − Y t ( m +1 ) − (2 J o − Y t ( m +1 ) E (cid:104) t (cid:48) (cid:12)(cid:12)(cid:12) W u o , ..., W u o J o (cid:105) = Pr ( W t (cid:48) − m : t (cid:48) = m +1 ) · (cid:110) − Y t (cid:48) ( m +1 ) − Y t (cid:48) ( m +1 )(2 J (cid:48) · − (cid:111) + Pr ( W t (cid:48) − m : t (cid:48) (cid:54) = m +1 ) · (cid:110) − Y t (cid:48) ( m +1 ) − Y t (cid:48) ( m +1 )(2 J (cid:48) · − (cid:111) = 12 J (cid:48) − J o · (cid:110) − Y t (cid:48) ( m +1 ) − Y t (cid:48) ( m +1 )(2 J (cid:48) − (cid:111) + (cid:18) − J (cid:48) − J o (cid:19) · {− Y t (cid:48) ( m +1 ) + Y t (cid:48) ( m +1 ) } = − Y t (cid:48) ( m +1 ) − (2 J o − Y t (cid:48) ( m +1 ) (3) With probability 1 − · (1 / J o ), ( W u o , ..., W u o J o ) (cid:54) = J o or J o , in which case E (cid:104) t (cid:12)(cid:12)(cid:12) W u o , ..., W u o J o (cid:105) = − Y t ( m +1 ) + Y t ( m +1 ) E (cid:104) t (cid:48) (cid:12)(cid:12)(cid:12) W u o , ..., W u o J o (cid:105) = − Y t (cid:48) ( m +1 ) + Y t (cid:48) ( m +1 )Finally, putting all above together into (EC.7), we have E [ t t (cid:48) ] = 12 J o · (cid:110) (2 J o − Y t ( m +1 ) + Y t ( m +1 ) (cid:111) · (cid:110) (2 J o − Y t (cid:48) ( m +1 ) + Y t (cid:48) ( m +1 ) (cid:111) + 12 J o · (cid:110) − Y t ( m +1 ) − (2 J o − Y t ( m +1 ) (cid:111) · (cid:110) − Y t (cid:48) ( m +1 ) − (2 J o − Y t (cid:48) ( m +1 ) (cid:111) + (cid:26) − J o (cid:27) · {− Y t ( m +1 ) + Y t ( m +1 ) } · {− Y t (cid:48) ( m +1 ) + Y t (cid:48) ( m +1 ) } =(2 J o − Y t ( m +1 ) Y t (cid:48) ( m +1 ) + Y t ( m +1 ) Y t (cid:48) ( m +1 ) + Y t ( m +1 ) Y t (cid:48) ( m +1 ) + (2 J o − Y t ( m +1 ) Y t (cid:48) ( m +1 )which ﬁnishes the proof. (cid:3) c6 e-companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments

EC.4.2.2. Proof of Lemma 1.

The proof of Lemma 1 is through careful expansion of therisk function, the expected square loss.

Proof of Lemma 1.

From Lemma EC.2 and Lemma EC.3, all the terms are quadratic, andall the coeﬃcients are non-negative. That is, after multiplying the constant ( T − m ) , for any T ∈ [ T ] , Y ∈ Y we can express in canonical form the following:( T − m ) · E W T ∼ η ( · ) (cid:104) (ˆ τ m ( W T , Y ) − τ m ( Y )) (cid:105) = T (cid:88) t = m +1 (cid:8) a t (11) Y t ( m +1 ) + a t (10) Y t ( m +1 ) Y t ( m +1 ) + a t (00) Y t ( m +1 ) (cid:9) + (cid:88) m +1 ≤ t

0, so the inequality is strict. Similarly, if ∃ t ∈ { m +1 , ..., T } such that − B < Y t ( m +1 )

0, so the inequality is strict. (cid:3)

EC.4.2.3. Implications of Lemma 1.

Lemma 1 simpliﬁes the minimax problem in (6).Instead of thinking it as a minimax problem, we can now replace Y by either Y + or Y − , and solveonly a minimization problem.Here we state Lemma EC.4 that is a direct implication of Lemma 1. It will be frequently usedlater on. Lemma EC.4.

When Y = Y + or Y = Y − , under Assumptions 1–3, for any t ∈ [ T ] , E [ t ] =2 J +1 B . For any t < t (cid:48) ∈ [ T ] , when | O T ( t, t (cid:48) ) | = J o = 0 , E [ t t (cid:48) ] =0 When | O T ( t, t (cid:48) ) | = J o ≥ , E [ t t (cid:48) ] =2 J o +1 B Proof of Lemma EC.4.

Replace Y t ( m +1 ) = Y t ( m +1 ) by B or − B , into the expressions in Lem-mas EC.2 and EC.3. (cid:3) -companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments ec7

EC.4.3. Structural Results of the Optimal Design

First note that when we focus on the optimal design, we treat T and m both as constants. So theconstant of 1 / ( T − m ) in the expression of the risk function does not aﬀect the optimal design. EC.4.3.1. Proof of Lemma 2.

Proof of Lemma 2.

We prove the two parts separately, both by contradiction. (1)

Suppose there exists an optimal design T = { t = 1 , t , t , ..., t K } such that t ≤ m + 1. Thenwe try to construct another design ˜ T , such that (cid:12)(cid:12)(cid:12) ˜ T (cid:12)(cid:12)(cid:12) = K = | T | −

1. And the K elements are ˜ T = { ˜ t = 1 , ˜ t = t , ˜ t = t , ..., ˜ t K − = t K } . Table EC.2 An example of two regular switchback experiments T and ˜ T when m = 4 and t = 3 . T (cid:88) − (cid:88) − − (cid:88) ...˜ T (cid:88) − − − − (cid:88) ... Each checkmark beneath a number indicates that this number is within that set; and each dashbeneath a number indicates that this number is not within that set. For example, the checkmark (cid:88) beneath number 3 indicates that 3 ∈ T ; and the dash − beneath number 3 indicates that 3 (cid:54) = ˜ T . Next we argue that when Y = Y + or Y = Y − , r ( T , Y ) > r (˜ T , Y ) , which suggests that T is not the optimal design.First, focus on the squared terms. For any m + 1 ≤ t ≤ t + m − t ∈ f m T ( t ) , t (cid:54) = f m ˜ T ( t ). Moreover, t − m ≤ t −

1, so that t ∈ f m ˜ T ( t ). So f m T ( t ) − { t } = f m ˜ T ( t ), and (cid:12)(cid:12) f m ˜ T ( t ) (cid:12)(cid:12) ≥

1. As a result, E [ t ( T ) ] − E [ t (˜ T ) ] ≥ (2 − ) B = 4 B . For any t ≥ t + m , either (i) f T ( t − m ) = t , in which case f ˜ T ( t − m ) = t . This is the only diﬀerencebetween f m T ( t ) and f m ˜ T ( t ), i.e., f m T ( t ) − { t } = f m ˜ T ( t ) − { t } . So | f m T ( t ) | = (cid:12)(cid:12) f m ˜ T ( t ) (cid:12)(cid:12) . The second caseis (ii) f T ( t − m ) ≥ t , in which case f m T ( t ) = f m ˜ T ( t ). Both cases suggest that E [ t ( T ) ] − E [ t (˜ T ) ] = 0 . So we have T (cid:88) t = m +1 E (cid:2) t ( T ) (cid:3) − T (cid:88) t = m +1 E (cid:104) t (˜ T ) (cid:105) = t + m − (cid:88) t = m +1 (cid:16) E (cid:2) t ( T ) (cid:3) − E (cid:104) t (˜ T ) (cid:105)(cid:17) + T (cid:88) t = t + m (cid:16) E (cid:2) t ( T ) (cid:3) − E (cid:104) t (˜ T ) (cid:105)(cid:17) ≥ t + m − (cid:88) t = m +1 (4 B ) + 0=4( t − B > c8 e-companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments

Second, focus on the cross product terms. For any t and t (cid:48) such that m + 1 ≤ t < t (cid:48) ≤ t + m − t ∈ O T ( t, t (cid:48) ) , t (cid:54) = O ˜ T ( t, t (cid:48) ). Moreover, t − m ≤ t −

1, so that t ∈ O T ( t, t (cid:48) ). So O T ( t, t (cid:48) ) − { t } = O ˜ T ( t, t (cid:48) ), and | O ˜ T ( t, t (cid:48) ) | ≥

1. As a result, E [ t ( T ) t (cid:48) ( T )] − E [ t (˜ T ) t (cid:48) (˜ T )] ≥ (2 − ) B = 4 B > . For any m + 1 ≤ t < t (cid:48) ≤ T such that t (cid:48) ≥ t + m , either (i) f T ( t (cid:48) − m ) = t , in which case f ˜ T ( t (cid:48) − m ) = t . So O T ( t, t (cid:48) ) − { t } = O ˜ T ( t, t (cid:48) ) − { t } . So | O T ( t, t (cid:48) ) | = | O ˜ T ( t, t (cid:48) ) | . The second case is (ii) f T ( t (cid:48) − m ) ≥ t , in which case O T ( t, t (cid:48) ) = O ˜ T ( t, t (cid:48) ). Both cases suggest that E [ t ( T ) t (cid:48) ( T )] − E [ t (˜ T ) t (cid:48) (˜ T )] = 0 . So we have (cid:88) m +1 ≤ t r (˜ T , Y ) . (2) Suppose there exists an optimal design T = { t = 1 , t , t , ..., t K } such that t K ≥ T − m + 1.Then we try to construct another design ˜ T , such that (cid:12)(cid:12)(cid:12) ˜ T (cid:12)(cid:12)(cid:12) = K = | T | −

1. And the K elements are˜ T = { ˜ t = 1 , ˜ t = t , ˜ t = t , ..., ˜ t K − = t K − } . Table EC.3 An example of two regular switchback experiments T and ˜ T when m = 4 and t K = T − . ... T − T − T − T − T − T T ... (cid:88) − (cid:88) (cid:88) − − ˜ T ... (cid:88) − (cid:88) − − − Each checkmark beneath a number indicates that this number is within that set; and each dashbeneath a number indicates that this number is not within that set. For example, the checkmark (cid:88) beneath number T − T − ∈ T ; and the dash − beneath number T − T − (cid:54) = ˜ T . Next we argue that when Y = Y + or Y = Y − , r ( T , Y ) > r (˜ T , Y ) , which suggests that T is not the optimal design. -companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments ec9

First focus on the squared terms. For any m + 1 ≤ t ≤ t K − f m T ( t ) = f m ˜ T ( t ) is totally unchanged. E [ t ( T ) ] − E [ t (˜ T ) ] = 0 . For any t K ≤ t ≤ T , t K / ∈ f m ˜ T ( t ) , t K ∈ f m T ( t ). And all the other determining randomization pointsare unchanged. So f m ˜ T ( t ) ⊂ f m T ( t ) and f m T ( t ) − { t K } = f m ˜ T ( t ) and (cid:12)(cid:12) f m ˜ T ( t ) (cid:12)(cid:12) ≥ E [ t ( T ) ] − E [ t (˜ T ) ] ≥ (2 − ) B = 4 B . So we have T (cid:88) t = m +1 E (cid:2) t ( T ) (cid:3) − T (cid:88) t = m +1 E (cid:104) t (˜ T ) (cid:105) = t K − (cid:88) t = m +1 (cid:16) E (cid:2) t ( T ) (cid:3) − E (cid:104) t (˜ T ) (cid:105)(cid:17) + T (cid:88) t = t K (cid:16) E (cid:2) t ( T ) (cid:3) − E (cid:104) t (˜ T ) (cid:105)(cid:17) ≥ T (cid:88) t = t K (4 B ) + 0=4( T − t K + 1) B > m + 1 ≤ t < t (cid:48) ≤ T such that t ≤ t K − O T ( t, t (cid:48) ) = O ˜ T ( t, t (cid:48) ) is totally unchanged. E [ t ( T ) t (cid:48) ( T )] − E [ t (˜ T ) t (cid:48) (˜ T )] = 0 . For any t K ≤ t < t (cid:48) ≤ T , since t (cid:48) − m ≤ T − m ≤ t K −

1, so f ˜ T ( t (cid:48) − m ) < t K and | O ˜ T ( t, t (cid:48) ) | ≥ O ˜ T ( t, t (cid:48) ) ⊂ O T ( t, t (cid:48) ). So E [ t ( T ) t (cid:48) ( T )] − E [ t (˜ T ) t (cid:48) (˜ T )] ≥ (2 − ) B ≥ B > . So we have (cid:88) m +1 ≤ t r (˜ T , Y ) . (cid:3) c10 e-companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments

EC.4.3.2. Proof of Lemma 3.

Proof of Lemma 3.

Recall that we denote t = 1 and t K +1 = T + 1. First, from Lemma 2, t ≥ m + 2 , t K ≤ T − m . So k = 1 and k = K cases both hold. Next, when 2 ≤ k ≤ K −

1, we prove bycontradiction.Suppose there exists some optimal design T , such that ∃ ≤ k ≤ K − , s.t. t k +1 − t k − ≤ m − . Denote K = { k ∈ { K − } | t k +1 − t k − ≤ m − } . Since K (cid:54) = ∅ , pick j = max K to be the largest element in K . Apparently j ≤ K − j ∈ { K − } . We also know that t j +2 ≥ t j + m, because otherwise j + 1 ∈ K , which contradicts themaximality of j .We now construct another design ˜ T such that (cid:12)(cid:12)(cid:12) ˜ T (cid:12)(cid:12)(cid:12) = K = | T | −

1, and the K elements are ˜ T = { ˜ t = 1 , ˜ t = t , ..., ˜ t j − = t j − , ˜ t j = t j +1 , ..., ˜ t K − = t K } . Table EC.4 An example of two regular switchback experiments T and ˜ T when m = 4 and t j = t j +1 − t j − + 2 . ... t j − t j − + 1 t j t j +1 t j +1 + 1 t j +1 + 2 t j +2 ... T ... (cid:88) − (cid:88) (cid:88) − − (cid:88) ...˜ T ... (cid:88) − − (cid:88) − − (cid:88) ... Each checkmark beneath a number indicates that this number is within that set; and each dash beneath a number indicatesthat this number is not within that set. For example, the checkmark (cid:88) beneath number t j indicates that t j ∈ T ; and thedash − beneath number t j indicates that t j (cid:54) = ˜ T . Next we argue that when Y = Y + or Y = Y − , r ( T , Y ) > r (˜ T , Y ) , which suggests that T is not the optimal design.First focus on the squared terms. When t ≤ t j − f m T ( t ) = f m ˜ T ( t ) is totally unchanged. E [ t ( T ) ] − E [ t (˜ T ) ] = 0 . When t j ≤ t ≤ t j + m −

1, this suggests that t − m ≤ t J − f ˜ T ≤ t j −

1. So t j / ∈ f m ˜ T ( t ) , t j ∈ f m T ( t ). And all the other determining randomization points are unchanged. So f m ˜ T ( t ) ⊂ f m T ( t ) and f m T ( t ) − { t j } = f m ˜ T ( t ) and (cid:12)(cid:12) f m ˜ T ( t ) (cid:12)(cid:12) ≥ E [ t ( T ) ] − E [ t (˜ T ) ] ≥ (2 − ) B = 4 B . When t j + m ≤ t ≤ T , either (i) f T ( t − m ) = t j , in which case f ˜ T ( t − m ) = t j − . This is the onlydiﬀerence between f m T ( t ) and f m ˜ T ( t ), i.e., f m T ( t ) − { t j } = f m ˜ T ( t ) − { t j − } . So | f m T ( t ) | = (cid:12)(cid:12) f m ˜ T ( t ) (cid:12)(cid:12) . Thesecond case is (ii) f T ( t − m ) ≥ t j +1 , in which case f m T ( t ) = f m ˜ T ( t ). Both cases suggest that E [ t ( T ) ] − E [ t (˜ T ) ] = 0 . -companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments ec11

So we have T (cid:88) t = m +1 E (cid:2) t ( T ) (cid:3) − T (cid:88) t = m +1 E (cid:104) t (˜ T ) (cid:105) = t j − (cid:88) t = m +1 (cid:16) E (cid:2) t ( T ) (cid:3) − E (cid:104) t (˜ T ) (cid:105)(cid:17) + t j + m − (cid:88) t = t j (cid:16) E (cid:2) t ( T ) (cid:3) − E (cid:104) t (˜ T ) (cid:105)(cid:17) + T (cid:88) t = t j + m (cid:16) E (cid:2) t ( T ) (cid:3) − E (cid:104) t (˜ T ) (cid:105)(cid:17) ≥ t j + m − (cid:88) t = t j (4 B ) + 0=4( m − B > m + 1 ≤ t < t (cid:48) ≤ T . There are many cases whichwe summarize in Table EC.5 Table EC.5 Summary of the diﬀerences between cross-product terms under two regularswitchback experiments T and ˜ T . T ˜ T m + 1 ≤ t ≤ t j − , t < t (cid:48) ≤ T unchanged t j − ≤ t ≤ t j − , t < t (cid:48) ≤ t j + m − t j − ≤ t ≤ t j − , t j + m ≤ t (cid:48) ≤ t j +1 + m − B t j − ≤ t ≤ t j − , t j +1 + m ≤ t (cid:48) ≤ T unchanged t j ≤ t < t (cid:48) ≤ t j + m − | O T ( t,t (cid:48) ) | +1 B | O ˜ T ( t,t (cid:48) ) | +1 B t j ≤ t ≤ t j + m − , t j + m ≤ t (cid:48) ≤ T unchanged t j + m ≤ t < t (cid:48) ≤ T unchanged We explain Table EC.5.When m + 1 ≤ t ≤ t j − , t < t (cid:48) ≤ T , all the overlapping randomization points are earlier than t j − −

1, i.e., ∀ a ∈ O T ( t, t (cid:48) ) , a ≤ t j − − ∀ a ∈ O ˜ T ( t, t (cid:48) ) , a ≤ t j − −

1. So t j / ∈ O T ( t, t (cid:48) ), and the overlappingrandomization points are unchanged, i.e., O T ( t, t (cid:48) ) = O ˜ T ( t, t (cid:48) ).When t j − ≤ t ≤ t j − , t < t (cid:48) ≤ t j + m −

1, all the overlapping randomization points are earlierthan t j − , i.e., ∀ a ∈ O T ( t, t (cid:48) ) , a ≤ t j − ; ∀ a ∈ O ˜ T ( t, t (cid:48) ) , a ≤ t j − . So t j / ∈ O T ( t, t (cid:48) ), and the overlappingrandomization points are unchanged, i.e., O T ( t, t (cid:48) ) = O ˜ T ( t, t (cid:48) ).When t j − ≤ t ≤ t j − , t j + m ≤ t (cid:48) ≤ t j +1 + m −

1, changing from T to ˜ T increases the expectedvalues. This is because t (cid:48) − m ≥ t j > t . So ﬁrst, O T ( t, t (cid:48) ) = ∅ . But f ˜ T ( t (cid:48) − m ) = t j − and t j − ∈ f m ˜ T ( t ),which suggests that t j − ∈ O ˜ T ( t, t (cid:48) ) . Also, ∀ a ∈ f m ˜ T ( t (cid:48) ) , a ≥ t j − ; ∀ a ∈ f m T ( t ) , a ≤ t j − , which suggeststhat t j − is the only overlapping element. So, O ˜ T ( t, t (cid:48) ) = { t j − } . In this case, E [ t ( T ) t (cid:48) ( T )] − E [ t (˜ T ) t (cid:48) (˜ T )] = (0 − ) B = − B . When t j − ≤ t ≤ t j − , t j +1 + m ≤ t (cid:48) ≤ T , since t (cid:48) − m ≥ t j +1 > t j > t , O T ( t, t (cid:48) ) = O ˜ T ( t, t (cid:48) ) = ∅ . c12 e-companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments

When t j ≤ t < t (cid:48) ≤ t j + m − t j ∈ O T ( t, t (cid:48) ) and t j / ∈ O ˜ T ( t, t (cid:48) ). And all the other overlappingrandomization points are unchanged, so O T ( t, t (cid:48) ) − { t j } = O ˜ T ( t, t (cid:48) ) and | O ˜ T ( t, t (cid:48) ) | ≥ . In this case, E [ t ( T ) t (cid:48) ( T )] − E [ t (˜ T ) t (cid:48) (˜ T )] ≥ (2 − ) B = 4 B . When t j ≤ t ≤ t j + m − , t j + m ≤ t (cid:48) ≤ T , either (i) f m T ( t (cid:48) − m ) = t j , in which case f ˜ T ( t (cid:48) − m ) = t j − .This is the only diﬀerence between O T ( t, t (cid:48) ) and O ˜ T ( t, t (cid:48) ), i.e., O T ( t, t (cid:48) ) − { t j } = O ˜ T ( t, t (cid:48) ) − { t j − } . | O T ( t, t (cid:48) ) | = | O ˜ T ( t, t (cid:48) ) | . The second case is (ii) f T ( t (cid:48) − m ) ≥ t j +1 , in which case O T ( t, t (cid:48) ) = O ˜ T ( t, t (cid:48) ) isunchanged. Both cases suggest that E [ t ( T ) t (cid:48) ( T )] − E [ t (˜ T ) t (cid:48) (˜ T )] = 0.When t j + m ≤ t < t (cid:48) ≤ T , either (i) f m T ( t (cid:48) − m ) = t j , in which case f ˜ T ( t (cid:48) − m ) = t j − . This is theonly diﬀerence between O T ( t, t (cid:48) ) and O ˜ T ( t, t (cid:48) ), i.e., O T ( t, t (cid:48) ) − { t j } = O ˜ T ( t, t (cid:48) ) − { t j − } . | O T ( t, t (cid:48) ) | = | O ˜ T ( t, t (cid:48) ) | . The second case is (ii) f T ( t (cid:48) − m ) ≥ t j +1 , in which case O T ( t, t (cid:48) ) = O ˜ T ( t, t (cid:48) ) is unchanged.Both cases suggest that E [ t ( T ) t (cid:48) ( T )] − E [ t (˜ T ) t (cid:48) (˜ T )] = 0.So we have (cid:88) m +1 ≤ t

1, so ( t j − t j − )( t j +1 − t j ) ≤ ( m − ≤ m ( m − .Combine both square terms and cross-product terms we know that r ( T , Y ) > r (˜ T , Y ) . (cid:3) EC.4.4. Proof of Theorem 2

Proof of Theorem 2.

Think of E [ t ] as E [ t t ], so that r ( η T , Y ) = (cid:80) Tt = m +1 (cid:80) Tt (cid:48) = m +1 E [ t t (cid:48) ].Then we can decompose the risk function to be( T − m ) · r ( η T , Y ) = (cid:88) m +1 ≤ t,t (cid:48) ≤ T min { t,t (cid:48) }≤ t − E [ t t (cid:48) ] + K − (cid:88) k =1  (cid:88) t k ≤ t,t (cid:48) ≤ T min { t,t (cid:48) }≤ t k +1 − E [ t t (cid:48) ]  + (cid:88) t K ≤ t,t (cid:48) ≤ T E [ t t (cid:48) ](EC.8) -companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments ec13

The core of this proof is to carefully count how many values can each E [ t t (cid:48) ] , ∀ t, t (cid:48) ∈ { m + 1 : T } take. See Table EC.6 for an illustration. Table EC.6 Illustrator of the diﬀerent values of E [ t t ] , when T = 17 , m = 4 , T = { , , , } . (1 2 3 4) 5 6 7 8 9 10 11 12 13 14 15 16 17( (cid:88) − − − ) − (cid:88) − (cid:88) − − − − (cid:88) − − − −− (cid:88) − (cid:88) − − − − (cid:88) − − − − In the second line, each checkmark beneath number t indicates that period t ∈ T , i.e. there is a random-ization point at period t . This table illustrates diﬀerent values of E [ t t (cid:48) ] when t, t (cid:48) ∈ { m + 1 , T } , wherethe zero values are omitted. The B magnitudes are also omitted. First we calculate the ﬁrst block from equation (EC.8). Because t ≥ m + 2, for any t, t (cid:48) suchthat m + 1 ≤ min { t, t (cid:48) } ≤ t − m + 1 ≤ max { t, t (cid:48) } ≤ t + m −

1, we know that the only overlappingrandomization point is t . So E [ t t (cid:48) ] = 4 B . For any t, t (cid:48) such that m + 1 ≤ min { t, t (cid:48) } ≤ t − t + m ≤ max { t, t (cid:48) } ≤ T , there is no overlapping randomization point so E [ t t (cid:48) ] = 0 . (cid:88) m +1 ≤ t,t (cid:48) ≤ T min { t,t (cid:48) }≤ t − E [ t t (cid:48) ] = B (cid:0) · (( t − − m ) (cid:1) Then we calculate the second block from equation (EC.8). For any k ∈ [ K − t k − t k − and t k +1 − t k , which jointly determine the values of E [ t t (cid:48) ] for any t, t (cid:48) , such that t k ≤ min { t, t (cid:48) } ≤ t k +1 − t k ≤ max { t, t (cid:48) } ≤ T . We will go over each of the four cases below. (1) When t k − t k − ≥ m, t k +1 − t k ≥ m . Due to Lemma EC.4, for all t, t (cid:48) ∈ { t k : t k + m − } , E [ t t (cid:48) ] = 8 B , because both t k − ≤ t − m ≤ t k − t k − ≤ t (cid:48) − m ≤ t k −

Switchback Experiments

In this case, (cid:88) t k ≤ t,t (cid:48) ≤ T min { t,t (cid:48) }≤ t k +1 − E [ t t (cid:48) ] = B (cid:0) · m + 4 · (( m + t k +1 − t k ) − m ) (cid:1) (2) When t k − t k − ≥ m, t k +1 − t k < m . Due to Lemma EC.4, for all t, t’ such that t k ≤ min { t, t (cid:48) } ≤ t k +1 − t k ≤ max { t, t (cid:48) } ≤ t k + m − E [ t t (cid:48) ] = 8 B , because both t, t (cid:48) ≤ t k + m −

1, so t k − ≤ t − m ≤ t k − t k − ≤ t (cid:48) − m ≤ t k −

1, and both t k − and t k are overlapping randomization points. Forall t, t (cid:48) such that t k ≤ min { t, t (cid:48) } ≤ t k +1 − t k + m ≤ max { t, t (cid:48) } ≤ t k +1 + m − E [ t t (cid:48) ] = 4 B ,because t k ≤ min { t, t (cid:48) } ≤ t k +1 − t k ≤ max { t, t (cid:48) } − m ≤ t k +1 − t k is the overlappingrandomization point. For all t, t (cid:48) such that t k ≤ min { t, t (cid:48) } ≤ t k +1 − t k +1 + m ≤ max { t, t (cid:48) } ≤ T , E [ t t (cid:48) ] = 0.In this case, (cid:88) t k ≤ t,t (cid:48) ≤ T min { t,t (cid:48) }≤ t k +1 − E [ t t (cid:48) ] = B (cid:0) · ( m − ( m − t k +1 + t k ) ) + 4 · (( m + t k +1 − t k ) − m + ( m − t k +1 − t k ) ) (cid:1) (3) When t k − t k − < m, t k +1 − t k ≥ m . Due to Lemma EC.4, for all t, t (cid:48) ∈ { t k : t k − + m − } , E [ t t (cid:48) ] = 16 B , because t − m ≤ t k − − ≤ t k ≤ t and t (cid:48) − m ≤ t k − − ≤ t k ≤ t (cid:48) so t k − , t k − , t k arethree determining randomization points. Also t k − t k − ≥ m so t k − ≤ min { t, t (cid:48) } − m and t k − is nota determining randomization point. For all t, t (cid:48) such that t k ≤ min { t, t (cid:48) } ≤ t k + m − , t k − + m ≤ max { t, t (cid:48) } ≤ t k + m − E [ t t (cid:48) ] = 8 B , because min { t, t (cid:48) } − m ≤ t k − t k − ≤ max { t, t (cid:48) } − m ≤ t k − t k − and t k are two determining randomization point. For all t, t (cid:48) such that t k ≤ min { t, t (cid:48) } ≤ t k +1 − , t k + m ≤ max { t, t (cid:48) } ≤ t k +1 + m − E [ t t (cid:48) ] = 4 B , because t k ≤ max { t, t (cid:48) } − m so t k is theonly determining randomization point. For all t, t (cid:48) such that t k ≤ min { t, t (cid:48) } ≤ t k +1 − , t k +1 + m ≤ max { t, t (cid:48) } ≤ T , E [ t t (cid:48) ] = 0 . In this case, (cid:88) t k ≤ t,t (cid:48) ≤ T min { t,t (cid:48) }≤ t k +1 − E [ t t (cid:48) ] = B (cid:0) · ( m − t k + t k − ) + 8 · ( m − ( m − t k + t k − ) )+4 · (( m + t k +1 − t k ) − m ) (cid:1) (4) When t k − t k − < m, t k +1 − t k < m . Due to Lemma EC.4, for all t, t (cid:48) ∈ { t k : t k − + m − } , E [ t t (cid:48) ] = 16 B , because t − m ≤ t k − − ≤ t k ≤ t and t (cid:48) − m ≤ t k − − ≤ t k ≤ t (cid:48) so t k − , t k − , t k are three determining randomization points. Also t k − t k − ≥ m so t k − ≤ min { t, t (cid:48) } − m and t k − isnot a determining randomization point. For all t, t (cid:48) such that t k ≤ min { t, t (cid:48) } ≤ t k +1 − , t k − + m ≤ max { t, t (cid:48) } ≤ t k + m − E [ t t (cid:48) ] = 8 B , because min { t, t (cid:48) } − m < t k − t k − ≤ max { t, t (cid:48) } − m ≤ -companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments ec15 t k − t k − and t k are two determining randomization points. For all t, t (cid:48) such that t k ≤ min { t, t (cid:48) } ≤ t k +1 − , t k + m ≤ max { t, t (cid:48) } ≤ t k +1 + m − E [ t t (cid:48) ] = 4 B , because t k ≤ max { t, t (cid:48) } − m so t k is theonly determining randomization point. For all t, t (cid:48) such that t k ≤ min { t, t (cid:48) } ≤ t k +1 − , t k +1 + m ≤ max { t, t (cid:48) } ≤ T , E [ t t (cid:48) ] = 0 . In this case, (cid:88) t k ≤ t,t (cid:48) ≤ T min { t,t (cid:48) }≤ t k +1 − E [ t t (cid:48) ] = B (cid:0) · ( m − t k + t k − ) + 8 · ( m − ( m − t k + t k − ) − ( m − t k +1 + t k ) )+4 · (( m + t k +1 − t k ) − m + ( m − t k +1 + t k ) ) (cid:1) Finally we calculate the third block from equation (EC.8). Observe that T − t K ≥ m . (1) When t K − t K − ≥ m . Due to Lemma EC.4, for all t, t (cid:48) ∈ { t K : t K + m − } , E [ t t (cid:48) ] = 8 B , becauseboth t K − ≤ t − m ≤ t K − t K − ≤ t (cid:48) − m ≤ t K −

1, and both t K − and t K are overlappingrandomization points. For all t, t (cid:48) such that t K ≤ min { t, t (cid:48) } ≤ T, t K + m ≤ max { t, t (cid:48) } ≤ T , E [ t t (cid:48) ] =4 B , because t K ≤ max { t, t (cid:48) } − m so t K is the only determining randomization point.In this case, (cid:88) t K ≤ t,t (cid:48) ≤ T E [ t t (cid:48) ] = B (cid:0) · m + 4 · (( T + 1 − t K ) − m ) (cid:1) (2) When t K − t K − < m . Due to Lemma EC.4, for all t, t (cid:48) ∈ { t K : t K − + m − } , E [ t t (cid:48) ] = 16 B ,because t − m ≤ t K − − ≤ t K ≤ t and t (cid:48) − m ≤ t K − − ≤ t K ≤ t (cid:48) so t K − , t K − , t K are threedetermining randomization points. Also t K − t K − ≥ m so t K − ≤ min { t, t (cid:48) } − m and t K − is not adetermining randomization point. For all t, t (cid:48) such that t K ≤ min { t, t (cid:48) } ≤ t K + m − , t K − + m ≤ max { t, t (cid:48) } ≤ t K + m − E [ t t (cid:48) ] = 8 B , because min { t, t (cid:48) } − m ≤ t K − t K − ≤ max { t, t (cid:48) } − m ≤ t K − t K − and t K are two determining randomization points. For all t, t (cid:48) such that t K ≤ min { t, t (cid:48) } ≤ T, t K + m ≤ max { t, t (cid:48) } ≤ T , E [ t t (cid:48) ] = 4 B , because t K ≤ max { t, t (cid:48) } − m so t K is theonly determining randomization point.In this case, (cid:88) t K ≤ t,t (cid:48) ≤ T E [ t t (cid:48) ] = B (cid:0) · ( m − t K + t K − ) + 8 · ( m − ( m − t K + t K − ) ) + 4 · (( T + 1 − t K ) − m ) (cid:1) Now we combine all above together.Note that whenever there exists k ∈ { K } such that ( t k − t k − ) < m , this suggests that in (cid:80) t k ≤ t,t (cid:48) ≤ T min { t,t (cid:48) }≤ t k +1 − E [ t t (cid:48) ] there is a 16( m − t k + t k − ) ; but in (cid:80) t k − ≤ t,t (cid:48) ≤ T min { t,t (cid:48) }≤ t k − E [ t t (cid:48) ] there is a c16 e-companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments − ( m − t k + t k − ) ). So when we sum them up, we break 16( m − t k + t k − ) into two 8( m − t k + t k − ) , which cancels in two sumations. By telescoping,( T − m ) · r ( η T , Y ) = (cid:88) m +1 ≤ t,t (cid:48) ≤ T min { t,t (cid:48) }≤ t − E [ t t (cid:48) ] + K − (cid:88) k =1  (cid:88) t k ≤ t,t (cid:48) ≤ T min { t,t (cid:48) }≤ t k +1 − E [ t t (cid:48) ]  + (cid:88) t K ≤ t,t (cid:48) ≤ T E [ t t (cid:48) ]= 4 B · (cid:0) ( t − − m (cid:1) + K − (cid:88) k =1 B · (cid:16) m + 4 (cid:16) ( m + t k +1 − t k ) − m + (cid:0) ( m − t k +1 + t k ) + (cid:1) (cid:17)(cid:17) + B · (cid:0) m + 4 (cid:0) ( T + 1 − t K ) − m (cid:1)(cid:1) = B · (cid:40) K (cid:88) k =0 ( t k +1 − t k ) + 8 m ( t K − t ) + 4 m K − m + 4 K − (cid:88) k =1 [( m − t k +1 + t k ) + ] (cid:41) which ﬁnishes the proof. (cid:3) EC.4.5. Optimal Solutions to the Subset Selection Problem in Theorem 3EC.4.5.1. Proof of Theorem 3.

Proof of Theorem 3.

Consider the problem as we have introduced in (6). Due to Lemma 1, Y + = { Y t ( m +1 ) = Y t ( m +1 ) = B } t ∈{ m +1: T } and Y − = { Y t ( m +1 ) = Y t ( m +1 ) = − B } t ∈{ m +1: T } are theonly two dominating strategies for the adversarial selection of potential outcomes.Then due to Lemma 2 and Lemma 3, the optimal design of switchback experiment must satisfythe following three conditions. t ≥ m + 2 , t K ≤ T − m t k +1 − t k − ≥ m, ∀ k ∈ [ K ] . Due to Theorem 2, the risk function of the optimal design of experiment is given by r ( η T , Y ) = 1( T − m ) (cid:40) K +1 (cid:88) k =1 ( t k − t k − ) + 8 m ( t K − t ) + 4 m K − m + 4 K (cid:88) k =2 [( m − t k + t k − ) + ] (cid:41) B So if we further take minimum over T ⊂ [ T ] in the above risk function, we ﬁnd the optimalsolution to the original problem introduced in (6). Note that B is a constant and irrelevant to ourdecisions, and that T and m are inputs. So we solvemin T ⊂ [ T ] (cid:40) K (cid:88) k =0 ( t k +1 − t k ) + 8 m ( t K − t ) + 4 m K − m + 4 K − (cid:88) k =1 [( m − t k +1 + t k ) + ] (cid:41) as stated in (8).In particular, if there exists some constant n ∈ N , n ≥

4, such that T = nm , we can explicitly ﬁndthe optimal design of experiment. Take the continuous relaxation of this problem, such that forany K , { < t < t < ... < t K < T + 1 } ∈ [1 , T + 1] K .min K ∈ N , {

Switchback Experiments ec17

The relaxed problem provides a lower bound to the original subset selection problem as stated in(8). We will argue later that it is a lucky coincidence that the optimal solution to this relaxedproblem is also an integer solution.First we argue that t − t = t K +1 − t K . This is because otherwise if t − t (cid:54) = t K +1 − t K thendenote a = t − t + t K +1 − t K . We could always pick for any k ∈ { K } , ˜ t k = t k + a − t + 1, such that t k +1 − t k is unchanged for any k ∈ { K − } . The only change in the objective value comes from (cid:0) a (cid:1) − (cid:0) ( t − t ) + ( t K +1 − t K ) (cid:1) < , which suggests that t − t (cid:54) = t K +1 − t K is not optimal.Second, similarly, we argue that for any k (cid:48) < k (cid:48)(cid:48) ∈ [ K − t k (cid:48) +1 − t k (cid:48) = t k (cid:48)(cid:48) +1 − t k (cid:48)(cid:48) This is becauseotherwise if t k (cid:48) +1 − t k (cid:48) (cid:54) = t k (cid:48)(cid:48) +1 − t k (cid:48)(cid:48) then denote b = t k (cid:48) +1 − t k (cid:48) + t k (cid:48)(cid:48) +1 − t k (cid:48)(cid:48) . We could always pick for any k ∈ { k (cid:48) + 1 : k (cid:48)(cid:48) } , ˜ t k = t k + b − ( t k (cid:48) +1 − t k (cid:48) ), such that t k +1 − t k is unchanged for any k ∈ { k (cid:48) + 1 : k (cid:48)(cid:48) − } .The only change in the objective value comes from (cid:0) b + 2(( m − b ) + ) (cid:1) − (cid:0) ( t k (cid:48) +1 − t k (cid:48) ) + ( t k (cid:48)(cid:48) +1 − t k (cid:48)(cid:48) ) + (( m − t k (cid:48) +1 + t k (cid:48) ) + ) + (( m − t k (cid:48)(cid:48) +1 + t k (cid:48)(cid:48) ) + ) (cid:1) < , where x + (( m − x ) + ) is convex and the inequality holds due to Jensen’s Inequality. This inequalitysuggests that t k (cid:48) +1 − t k (cid:48) (cid:54) = t k (cid:48)(cid:48) +1 − t k (cid:48)(cid:48) is not optimal.With the above two structural results, we can assume that there exists a, b >

0, such that t − t = t K +1 − t K = a , and t k +1 − t k = b, ∀ k ∈ [ K −

1] Also, it must be satisﬁed that 2 a + ( K − b = T . Nextwe replace K − T − ab into the relaxed problem, to havemin a,b> (cid:8) a + ( K − b ) + 8 m ( K − b + 4 m ( K −

1) + 4( K − m − b ) + ) (cid:9) = min a,b> (cid:26) a + 4( T − a ) b + 8 m ( T − a ) + 4 m T − ab + 4 T − ab (( m − b ) + ) (cid:27) Either when b ≥ m , the above is to minimizemin a,b> (cid:26) a + 4( T − a ) b + 8 m ( T − a ) + 4 m T − ab (cid:27) Note that8 a + 4( T − a ) b + 8 m ( T − a ) + 4 m T − ab =8 a + 8 m ( T − a ) + 4( T − a ) (cid:18) b + m b (cid:19) ≥ a + 16 m ( T − a )=8( a − m ) + 16 mT − m ≥ mT − m where the ﬁrst inequality takes equality if and only if b = m b , which suggests b = m ; the secondinequality takes equality if and only if a = 2 m . c18 e-companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments

Or when b ≤ m , the above is to minimizemin a,b> (cid:26) a + 4( T − a ) b + 8 m ( T − a ) + 4 m T − ab + 4 T − ab ( m − b ) (cid:27) Note that8 a + 4( T − a ) b + 8 m ( T − a ) + 4 m T − ab + 4 T − ab ( m − b ) =8 a + 8( T − a ) (cid:18) b + m b (cid:19) ≥ a + 16 m ( T − a )=8( a − m ) + 16 mT − m ≥ mT − m where the ﬁrst inequality takes equality if and only if b = m b , which suggests b = m ; the secondinequality takes equality if and only if a = 2 m .Combining both cases, the optimal solution is when a = 2 m and b = m , which happens to bean integer solution, thus optimal for the subset selection problem. Translating into t , ..., t K thissuggests that t = 2 m + 1 , t = 3 m + 1 , ..., t K = ( n − m + 1. (cid:3) EC.4.5.2. Solutions in the Imperfect Cases.

It is always worth noting that we are takinga design of experiments perspective. So when practically we have control of T , we can pick T tobe some multiples of m , which ﬁts our Theorem 3 perfectly. If we do not have control of T , we canalways pick a smaller T (cid:48) such that T (cid:48) = (cid:98) T /m (cid:99) · m is some multiples of m .Nonetheless, from an optimization perspective, we establish the following optimal structures forthe subset selection problem as in (8). Recall that t K +1 = T + 1. Lemma EC.5.

Under Assumptions 1–3, the optimal design of regular switchback experimentmust satisfy the following two conditions, | ( t − t ) − ( t K +1 − t K ) | ≤ , | ( t j +1 − t j ) − ( t j (cid:48) +1 − t j (cid:48) ) | ≤ , ∀ ≤ j, j (cid:48) ≤ K − . Proof of Lemma EC.5.

Prove by contradiction.

Case 1.

Suppose there exists some optimal design T , such that ( t − t ) − ( t K +1 − t K ) ≥

2. Wenow construct another design ˜ T , such that (cid:12)(cid:12)(cid:12) ˜ T (cid:12)(cid:12)(cid:12) = K = | T | , and the K elements are ˜ T = { ˜ t = 1 , ˜ t = t − , ˜ t = t − , ..., ˜ t K = t K − } . Now check the expression as in (8). Note that ˜ t k +1 − ˜ t k = t k +1 − t k is unchanged for any k ∈ [ K − t K − ˜ t = t K − t is unchanged; and m − ˜ t k +1 − ˜ t k = m − t k +1 − t k in unchanged for any k ∈ [ K − t − ˜ t ) + (˜ t K +1 − ˜ t K ) = ( t − t − + ( t K +1 − t K + 1) ≤ ( t − t ) + ( t K +1 − t K ) , because ( t − t ) − ( t K +1 − t K ) ≥ T , such that ( t K +1 − t K ) − ( t − t ) ≥

2, thenconstruct another design ˜ T = { ˜ t = 1 , ˜ t = t + 1 , ˜ t = t + 1 , ..., ˜ t K = t K + 1 } . -companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments ec19

Case 2.

Suppose there exists some optimal design T , and there exists 1 ≤ j < j (cid:48) ≤ K − t j +1 − t j ) − ( t j (cid:48) +1 − t j (cid:48) ) ≥

2. We now construct another design ˜ T , such that (cid:12)(cid:12)(cid:12) ˜ T (cid:12)(cid:12)(cid:12) = K = | T | , and the K elements are ˜ T = { ˜ t = 1 , ˜ t = t , ..., ˜ t j = t j , ˜ t j +1 = t j +1 − , ..., ˜ t j (cid:48) = t j (cid:48) − , ˜ t j (cid:48) +1 = t j (cid:48) +1 , ..., ˜ t K = t K } .Now check the expression as in (8). Note that ˜ t k +1 − ˜ t k = t k +1 − t k is unchanged for any k ∈ { K } except j and j (cid:48) ; ˜ t K − ˜ t = t K − t is unchanged; and m − ˜ t k +1 − ˜ t k = m − t k +1 − t k in unchanged forany k ∈ [ K −

1] except j and j (cid:48) . Now focus on j and j (cid:48) .(˜ t j +1 − ˜ t j ) + (˜ t j (cid:48) +1 − ˜ t j (cid:48) ) + [( m − ˜ t j +1 + ˜ t j ) + ] + [( m − ˜ t j (cid:48) +1 + ˜ t j (cid:48) ) + ] =( t j +1 − t j − + ( t j (cid:48) +1 − t j (cid:48) + 1) + [( m − t j +1 + t j + 1) + ] + [( m − t j (cid:48) +1 + t j (cid:48) − + ] ≤ ( t j +1 − t j ) + ( t j (cid:48) +1 − t j (cid:48) ) + [( m − t j +1 + t j ) + ] + [( m − t j (cid:48) +1 + t j (cid:48) ) + ] To see why this inequality holds, deﬁne g ( x ) = x + [( m − x ) + ] and note that g ( x ) is a univariateconvex function. The inequality holds due to ( t j +1 − t j ) − ( t j (cid:48) +1 − t j (cid:48) ) ≥ T , and there exists 1 ≤ j < j (cid:48) ≤ K − t j (cid:48) +1 − t j (cid:48) ) − ( t j +1 − t j ) ≥

2. Then construct another design ˜ T = { ˜ t = 1 , ˜ t = t , ..., ˜ t j = t j , ˜ t j +1 = t j +1 + 1 , ..., ˜ t j (cid:48) = t j (cid:48) + 1 , ˜ t j (cid:48) +1 = t j (cid:48) +1 , ..., ˜ t K = t K } .Combine both cases we ﬁnish the proof. (cid:3) EC.5. Proofs and Discussions from Section 4

In Section 4 we focus on the case when p = m . Throughout this section in the appendix, we useonly m instead of p . EC.5.1. Extra Notations Used in the Proofs from Section 4

For any t ∈ { m + 1 : T } , we use the notations of t as deﬁned in (EC.1). Denote¯ = m (cid:88) t = m +1 t ¯ k = ( k +2) m (cid:88) t =( k +1) m +1 t , ∀ k ∈ [ K ]¯ K +1 = ( K +3) m (cid:88) t =( K +2) m +1 t It is worth noting that under the optimal design as suggested by Theorem 3, when

T /m = n ∈ N is an integer, we have K = n −

3. So ( K + 3) m = T . See Example EC.1 below. Example EC.1 (An Optimal Design and Its ¯ k Notations).

When T = 12, p = m = 2,the optimal design of regular switchback experiment is T ∗ = { , , , } , and K = 3. The ¯ k notationsare deﬁned below. Each ¯ k spans m = 2 periods. See Table EC.7. (cid:3) c20 e-companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments

Table EC.7 An example of the optimal design T ∗ and its ¯ k notations when T = 12 and p = m = 2 . T ∗ (cid:88) − − − (cid:88) − (cid:88) − (cid:88) − − −−{ ¯ k } K +1 k =0 ¯ ¯ ¯ ¯ ¯ Using the above notation, we could writeˆ τ m − τ m = 1 T − m K +1 (cid:88) k =0 ¯ k , and so Var (ˆ τ m ) = 1( T − m ) Var (cid:32) K +1 (cid:88) k =0 ¯ k (cid:33) . EC.5.2. Proof of Theorem 4

The proof of Theorem 4 resembles the proof of Lemmas EC.2 and EC.3. The trick here is to observethat for any k ∈ [ K ], the values of all the variables t , where ( k + 1) m + 1 ≤ t ≤ ( k + 2) m , are alldetermined by the randomization at time km + 1 and ( k + 1) m + 1. Since they are all correlated,we can use ¯ k to stand for (cid:80) ( k +2) mt =( k +1) m +1 t for short. Proof of Theorem 4.

First observe that ¯ k has zero mean for each k ∈ { K + 1 } . So we candecompose the variance into squared terms and cross-product terms,( T − m ) Var (ˆ τ m ) = Var (cid:32) K +1 (cid:88) k =0 ¯ k (cid:33) = K +1 (cid:88) k =0 E (cid:2) ¯ k (cid:3) + (cid:88) ≤ k

2, ¯ k = ¯ Y ( m +1 ) + ¯ Y ( m +1 );with probability 1 /

2, ¯ k = − ¯ Y ( m +1 ) − ¯ Y ( m +1 ). When k ∈ [ K ], with probability 1 /

4, ¯ k =3 ¯ Y ( m +1 ) + ¯ Y ( m +1 ); with probability 1 /

2, ¯ k = − ¯ Y ( m +1 ) + ¯ Y ( m +1 ); with probability 1 / k = − ¯ Y ( m +1 ) − Y ( m +1 ).Then for the cross-product terms, if k (cid:48) − k ≥

2, then ¯ k and ¯ k (cid:48) are independent, i.e., E [¯ k ¯ k (cid:48) ] = 0.If k (cid:48) − k = 1, then E [¯ k ¯ k +1 ] = ( ¯ Y k ( m +1 ) + ¯ Y k ( m +1 )) · ( ¯ Y k +1 ( m +1 ) + ¯ Y k +1 ( m +1 ))This is because the values of ¯ k and ¯ k +1 are determined by the realization at 3 ran-domization points, W km +1 , W ( k +1) m +1 , W ( k +2) m +1 . With probability 1 /

8, ¯ k ¯ k +1 = (3 ¯ Y k ( m +1 ) + -companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments ec21 ¯ Y k ( m +1 )) · (3 ¯ Y k +1 ( m +1 ) + ¯ Y k +1 ( m +1 )); with probability 1 /

8, ¯ k ¯ k +1 = (3 ¯ Y k ( m +1 ) + ¯ Y k ( m +1 )) · ( − ¯ Y k +1 ( m +1 ) + ¯ Y k +1 ( m +1 )); with probability 1 /

8, ¯ k ¯ k +1 = ( − ¯ Y k ( m +1 ) + ¯ Y k ( m +1 )) · (3 ¯ Y k +1 ( m +1 ) + ¯ Y k +1 ( m +1 )); with probability 1 /

8, ¯ k ¯ k +1 = ( − ¯ Y k ( m +1 ) + ¯ Y k ( m +1 )) · ( − ¯ Y k +1 ( m +1 ) + ¯ Y k +1 ( m +1 )); with probability 1 /

8, ¯ k ¯ k +1 = ( − ¯ Y k ( m +1 ) + ¯ Y k ( m +1 )) · ( − ¯ Y k +1 ( m +1 ) − Y k +1 ( m +1 )); with probability 1 /

8, ¯ k ¯ k +1 = ( − ¯ Y k ( m +1 ) − Y k ( m +1 )) · ( − ¯ Y k +1 ( m +1 ) + ¯ Y k +1 ( m +1 )); with probability 1 /

8, ¯ k ¯ k +1 = ( − ¯ Y k ( m +1 ) − Y k ( m +1 )) · ( − ¯ Y k +1 ( m +1 ) − Y k +1 ( m +1 )).Combining the squared terms and the cross-product terms we ﬁnish the proof. (cid:3) EC.5.3. Discssions and proof of Corollary 1

We ﬁrst provide the details of the two variance upper bounds here.

Var U1 (ˆ τ m ) = 1( T − m ) (cid:40) (cid:2) ¯ Y ( m +1 ) + ¯ Y ( m +1 ) (cid:3) + n − (cid:88) k =1 (cid:2) ¯ Y k ( m +1 ) + ¯ Y k ( m +1 ) (cid:3) + 4 (cid:2) ¯ Y n − ( m +1 ) + ¯ Y n − ( m +1 ) (cid:3) + n − (cid:88) k =0 (cid:2) ¯ Y k ( m +1 ) · ¯ Y k +1 ( m +1 ) + ¯ Y k ( m +1 ) · ¯ Y k +1 ( m +1 ) (cid:3)(cid:41) , and Var U2 (ˆ τ m ) = 1( T − m ) (cid:40) (cid:2) ¯ Y ( m +1 ) + ¯ Y ( m +1 ) (cid:3) + n − (cid:88) k =1 (cid:2) ¯ Y k ( m +1 ) + ¯ Y k ( m +1 ) (cid:3) + 4 (cid:2) ¯ Y n − ( m +1 ) + ¯ Y n − ( m +1 ) (cid:3) (cid:41) . We prove Corollary 1 using the basic inequality that 2 xy ≤ x + y . Such an inequality is com-monly used to ﬁnd a conservative upper bound of the variance. Proof of Corollary 1.

From Theorem 4, the variance of the estimator is given by( T − m ) Var (ˆ τ m ) ≤ (cid:8) ¯ Y ( m +1 ) + ¯ Y ( m +1 ) (cid:9) + n − (cid:88) k =1 (cid:8) ¯ Y k ( m +1 ) + ¯ Y k ( m +1 ) (cid:9) + 2 (cid:8) ¯ Y n − ( m +1 ) + ¯ Y n − ( m +1 ) (cid:9) + n − (cid:88) k =0 (cid:2) ¯ Y k ( m +1 ) + ¯ Y k ( m +1 ) (cid:3) · (cid:2) ¯ Y k +1 ( m +1 ) + ¯ Y k +1 ( m +1 ) (cid:3) ≤ (cid:8) ¯ Y ( m +1 ) + ¯ Y ( m +1 ) (cid:9) + n − (cid:88) k =1 (cid:8) ¯ Y k ( m +1 ) + ¯ Y k ( m +1 ) (cid:9) + 2 (cid:8) ¯ Y n − ( m +1 ) + ¯ Y n − ( m +1 ) (cid:9) + n − (cid:88) k =0 (cid:8) Y k ( m +1 ) ¯ Y k +1 ( m +1 ) + 2 ¯ Y k ( m +1 ) ¯ Y k +1 ( m +1 ) + ¯ Y k ( m +1 ) + ¯ Y k ( m +1 ) + ¯ Y k +1 ( m +1 ) + ¯ Y k +1 ( m +1 ) (cid:9) ≤ (cid:8) ¯ Y ( m +1 ) + ¯ Y ( m +1 ) (cid:9) + n − (cid:88) k =1 (cid:8) ¯ Y k ( m +1 ) + ¯ Y k ( m +1 ) (cid:9) + 3 (cid:8) ¯ Y n − ( m +1 ) + ¯ Y n − ( m +1 ) (cid:9) c22 e-companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments + n − (cid:88) k =0 (cid:8) ¯ Y k ( m +1 ) + ¯ Y k ( m +1 ) + ¯ Y k +1 ( m +1 ) + ¯ Y k +1 ( m +1 ) (cid:9) =4 (cid:8) ¯ Y ( m +1 ) + ¯ Y ( m +1 ) (cid:9) + n − (cid:88) k =1 (cid:8) ¯ Y k ( m +1 ) + ¯ Y k ( m +1 ) (cid:9) + 4 (cid:8) ¯ Y n − ( m +1 ) + ¯ Y n − ( m +1 ) (cid:9) where the ﬁrst inequality suggests Var (ˆ τ m ) ≤ Var U1 (ˆ τ m ), and the last inequality suggests Var U1 (ˆ τ m ) ≤ Var U2 (ˆ τ m ).The unbiasedness part is due to the estimator of the variances being Horvitz-Thompson typeestimators, and that regular switchback experiments naturally satisfy Assumption 4. (cid:3) EC.5.4. Proof of Theorem 5

We prove Theorem 5 by using Lemma EC.1. In particular, we derive B n,k,a , and then constructsome proper ∆ n , K n , and L n . Proof of Theorem 5.

In the n -replica experiment, ˆ τ m − τ m = n − m (cid:80) n − k =0 ¯ k , and Var (ˆ τ m ) = n − m Var (cid:16)(cid:80) n − k =0 ¯ k (cid:17) . To use the language from Lemma EC.1, denote d = n −

1. Denote forany i ∈ [ n − X n,i = n − m ¯ i − so we know that φ = 1, i.e., { X n, , X n, , ... } is a sequence of1-dependent random variables.First note that B n = Var (ˆ τ m ), and we calculate B n,k,a as follows. B n,k,a = 1( n − m Var (cid:32) a + k − (cid:88) i = a ¯ i − (cid:33) ≤ n − m (cid:40) a + k − (cid:88) i = a (cid:2) Y i − ( m +1 ) + 3 ¯ Y i − ( m +1 ) + 2 ¯ Y i − ( m +1 ) ¯ Y i − ( m +1 ) (cid:3) + a + k − (cid:88) i = a

2[ ¯ Y i − ( m +1 ) + ¯ Y i − ( m +1 )] · [ ¯ Y i ( m +1 ) + ¯ Y i ( m +1 )] (cid:41) ≤ km B + 8( k − m B ( n − m ≤ kB ( n − Pick γ = 0 , δ = 1, then ∆ n = B / ( n − , K n = 16 B / ( n − , and L n = Var (ˆ τ m ) / ( n − E | X n,i | ≤ ∆ n = B / ( n − , because all the potential outcomes are bounded by B , so that X n,i ≤ B/ ( n − B n,k,a /k ≤ K n = 16 B / ( n − .3. B n / ( n − ≥ L n = Var (ˆ τ m ) / ( n − K n /L n = 16 B / ( n − Var (ˆ τ m ) = O (1), where the last equality is due to Assumption 5.5. ∆ n /L / n = B / ( n − / Var (ˆ τ m ) / = O (1), where the last equality is due to Assumption 5. -companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments ec23

Due to Lemma EC.1, ˆ τ m − τ m (cid:112) Var (ˆ τ m ) D −→ N (0 , . (cid:3) EC.6. Proofs and Discussions from Section 5

In Section 5 we discuss the cases when m is misspeciﬁed. Throughout this section in the appendix,we use both p and m . Recall that m is the order of carryover eﬀect, and p is the experimenter’sknowledge of m . EC.6.1. Unbiasedness of the Horvitz-Thompson Estimator when m is Misspeciﬁed We state here the omitted mathematics in Theorem 6.Under Assumptions 2 and 3, for m > p , at each time t ≥ m + 1, the Horvitz-Thompson estimatoris either unbiased for the lag- m causal eﬀect when f ( t − p ) ≤ t − m , i.e., E W T ∼ η T (cid:20) Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) − Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) (cid:21) = Y t ( m +1 ) − Y t ( m +1 ) , or conditionally unbiased for the m -misspeciﬁed lag- p causal eﬀect when f ( t − p ) > t − m , i.e., E W T ∼ η T (cid:20)(cid:26) Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) − Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) (cid:27) − (cid:8) Y t ( w obs t − m : f ( t − p ) − , t − f ( t − p )+1 ) − Y t ( w obs t − m : f ( t − p ) − , t − f ( t − p )+1 ) (cid:9) (cid:12)(cid:12)(cid:12)(cid:12) W t − m : f ( t − p ) − = w obs t − m : f ( t − p ) − (cid:21) = 0 . When p + 1 ≤ t ≤ m , the Horvitz-Thompson estimator is either unbiased for the lag- t causal eﬀectwhen f ( t − p ) = 1, i.e., E W T ∼ η T (cid:20) Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) − Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) (cid:21) = Y t ( t ) − Y t ( t ) , or conditionally unbiased for the m -misspeciﬁed lag- t causal eﬀect when f ( t − p ) >

1, i.e., E W T ∼ η T (cid:20)(cid:26) Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) − Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) (cid:27) − (cid:8) Y t ( w obs f ( t − p ) − , t − f ( t − p )+1 ) − Y t ( w obs f ( t − p ) − , t − f ( t − p )+1 ) (cid:9) (cid:12)(cid:12)(cid:12)(cid:12) W f ( t − p ) − = w obs f ( t − p ) − (cid:21) = 0 . To remove the conditional expectation, we can further take an outer loop of expectation averagedover the past assignment paths. So the estimator is estimating a weighted average of lag- p eﬀects.When t ≥ m + 1, (cid:88) w t − m : f ( t − p ) − Pr( W t − m : f ( t − p ) − = w t − m : f ( t − p ) − )( Y t ( w t − m : f ( t − p ) − , t − f ( t − p )+1 ) − Y t ( w t − m : f ( t − p ) − , t − f ( t − p )+1 )) , c24 e-companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments and when p + 1 ≤ t ≤ m , (cid:88) w f ( t − p ) − Pr( W f ( t − p ) − = w f ( t − p ) − )( Y t ( w f ( t − p ) − , t − f ( t − p )+1 ) − Y t ( w f ( t − p ) − , t − f ( t − p )+1 )) . We prove Theorem 6 as follows.

Proof of Theorem 6.

Focus on any speciﬁc t ∈ { m + 1 : T } .When f ( t − p ) ≤ t − m , both 0 < Pr( W t − p : t = p +1 ) , Pr( W t − p : t = p +1 ) <

1. With probabilityPr( W t − p : t = p +1 ) (cid:54) = 0, { W t − p : t = p +1 } = 1, and Y obs t = Y t ( m +1 ). So E (cid:104) Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) (cid:105) = Y t ( m +1 ). Similarly E (cid:104) Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) (cid:105) = Y t ( m +1 ). So E W T ∼ η T (cid:20)(cid:26) Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) − Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) (cid:27)(cid:21) = Y t ( m +1 ) − Y t ( m +1 ) . When f ( t − p ) > t − m , both 0 < Pr (cid:16) W t − p : t = p +1 (cid:12)(cid:12)(cid:12) W t − m : f ( t − p ) − = w obs t − m : f ( t − p ) − (cid:17) < < Pr (cid:16) W t − p : t = p +1 (cid:12)(cid:12)(cid:12) W t − m : f ( t − p ) − = w obs t − m : f ( t − p ) − (cid:17) <

1. Conditional on W t − m : f ( t − p ) − = w obs t − m : f ( t − p ) − , we know that with probability Pr (cid:16) W t − p : t = p +1 (cid:12)(cid:12)(cid:12) W t − m : f ( t − p ) − = w obs t − m : f ( t − p ) − (cid:17) (cid:54) =0, { W t − p : t = p +1 } = 1, and Y obs t = Y t ( w obs t − m : f ( t − p ) − , t − f ( t − p )+1 ). So E W T ∼ η T (cid:20) Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) − Y t ( w obs t − m : f ( t − p ) − , t − f ( t − p )+1 ) (cid:12)(cid:12) W t − m : f ( t − p ) − = w obs t − m : f ( t − p ) − (cid:21) = 0 . Similarly, we have E W T ∼ η T (cid:20) Y obs t { W t − p : t = p +1 } Pr( W t − p : t = p +1 ) − Y t ( w obs t − m : f ( t − p ) − , t − f ( t − p )+1 ) (cid:12)(cid:12) W t − m : f ( t − p ) − = w obs t − m : f ( t − p ) − (cid:21) = 0 , which ﬁnishes the proof. (cid:3) EC.6.2. Asymptotic Normality when m is Misspeciﬁed The proof of Corollary 2 consists of two parts: m p . When m p we prove Corollary 2 by using Lemma EC.1. In particular, we derive B n,k,a ,and then construct some proper ∆ n , K n , and L n . Proof of Corollary 2.

The proof consists of two parts: m p . First, when m < p , weknow that ˆ τ p = ˆ τ m , τ p = τ m , Var (ˆ τ p ) = Var (ˆ τ m ). Due to Theorems 4 we prove part (i) the expressionin (11). Due to Theorem 5 we know thatˆ τ p − τ p (cid:112) Var (ˆ τ p ) = ˆ τ m − τ m (cid:112) Var (ˆ τ m ) D −→ N (0 , . Second, when m > p , then we follow the same trick as in Theorem 5. In the n -replica exper-iment, ˆ τ p − E [ τ [ m ] p ] = n − p (cid:80) n − k =0 ¯ k , and Var (ˆ τ p ) = n − p Var (cid:16)(cid:80) n − k =0 ¯ k (cid:17) . To use the languagefrom Lemma EC.1, denote d = n −

1. Denote for any i ∈ [ n − X n,i = n − p ¯ i − . We know that -companion to Bojinov, Simchi-Levi, and Zhao:

Switchback Experiments ec25

Table EC.8 An illustration of φ when m = 5 , p = 3 . −−−−−−−−−−−−−−−−−−−−−−−−−−−→ carryover eﬀect . . .

13 14 15 16 17 18 19 20 21 22 23 24 . . . T ∗ (cid:88) − − (cid:88) − − (cid:88) − − (cid:88) − −{ ¯ k } K +1 k =0 ¯ ¯ ¯ ¯ In this example φ = (cid:100) mp (cid:101) = 2. The arrow above numbers 17 through 22 means that the assignment on period 17 aﬀects the outcomeon period 22. So that ¯ and ¯ are correlated, but ¯ and ¯ are independent. φ = (cid:100) mp (cid:101) , so that { X n, , X n, , ... } is a sequence of φ -dependent random variables. See Table EC.8for an illustration of φ .First note that B n = Var (ˆ τ p ), and we calculate B n,k,a as follows. Note that k ≥ φ + 1. B n,k,a = 1( n − p Var (cid:32) a + k − (cid:88) i = a ¯ i − (cid:33) ≤ n − p (cid:32) a + k − (cid:88) i = a E [¯ i − ] + a + k − (cid:88) i = a E [¯ i − ¯ i ] + ... + a + k − φ (cid:88) i = a E [¯ i − ¯ i − φ ] (cid:33) ≤ Cp B ( n − p · ( k + ( k −

1) + ... + ( k − φ )) ≤ ( φ + 1) CkB ( n − where C is some constant bounding the number of terms in each cross-product expectation2 E [¯ i − ¯ i ] , ..., E [¯ i − ¯ i − φ ]; and φ + 1 is a constant as well.Pick γ = 0 , δ = 1, then ∆ n = B / ( n − , K n = ( φ + 1) CB / ( n − , and L n = Var (ˆ τ m ) / ( n − E | X n,i | ≤ ∆ n = B / ( n − , because all the potential outcomes are bounded by B , so that X n,i ≤ B/ ( n − B n,k,a /k ≤ K n = ( φ + 1) CB / ( n − .3. B n / ( n − ≥ L n = Var (ˆ τ m ) / ( n − K n /L n = ( φ + 1) CB / ( n − Var (ˆ τ m ) = O (1), where the last equality is due to Assumption 5.5. ∆ n /L / n = B / ( n − / Var (ˆ τ m ) / = O (1), where the last equality is due to Assumption 5.Due to Lemma EC.1, ˆ τ p − τ p (cid:112) Var (ˆ τ p ) D −→ N (0 , ..