[PDF] Policy choice in experiments with unknown interference

Abstract

This paper discusses experimental design to estimate welfare-maximizing policies. We consider a setting where units are organized into large, finitely many independent clusters and interact over unobserved dimensions within each cluster. The contribution of this paper is two-fold. First, we construct a test for whether a welfare-improving treatment configuration exists and hence worth learning by conducting a larger scale experiment. Second, we introduce an adaptive randomization procedure to estimate welfare-maximizing individual treatment allocation rules valid under unobserved interference. We derive asymptotic properties of the marginal effects estimators and finite-sample regret guarantees of the policy. Finally, we illustrate the method's advantage in simulations calibrated to an existing experiment on information diffusion.

Full PDF

PPolicy choice in experiments with unknowninterference ∗ Davide Viviano † First version: November 16, 2020This version: January 22, 2021

Abstract

This paper discusses experimental design for inference and estimation of individualizedtreatment allocation rules in the presence of unknown interference. We consider asetting where units are organized into large, ﬁnitely many independent clusters andinteract over unobserved dimensions within each cluster. The contribution of thispaper is two-fold. First, we design a short pilot study with few clusters to test whetherthere exists a welfare-improving treatment conﬁguration and hence worth learning byconducting a larger scale experiment. We propose a practical test that uses informationon the marginal eﬀect of the policy on welfare to compare the base-line interventionagainst any possible alternative. Second, we introduce a sequential randomizationprocedure to estimate welfare-maximizing individual treatment allocation rules validunder unobserved (and partial) interference. We propose nonparametric estimatorsof direct treatments and marginal spillover eﬀects, which serve for hypothesis testingand policy-design. We derive the estimators’ asymptotic properties, and small sampleregret guarantees of the policy estimated through the sequential experiment. Finally,we illustrate the method’s advantage in simulations calibrated to an existing experimenton information diﬀusion.

Keywords:

Policy Targeting, Causal Inference, Experimental Design, Welfare Maximiza-tion, Spillovers, Individualized Treatments.

JEL Codes:

C10, C14, C31, C54. ∗ I thank Graham Elliott, James Fowler, Paul Niehaus, Yixiao Sun and Kaspar W¨uthrich for helpfulcomments and discussion. All mistakes are my own. † Department of Economics, UC San Diego. Correspondence: [email protected]. a r X i v : . [ ec on . E M ] J a n Introduction

One of the main objectives of experiments is to identify the most eﬀective policy. This paperaddresses two questions that the decision-maker faces in practice: (i) “Does the baselinepolicy lead to the largest welfare compared to any possible alternative, and, hence, isa possibly large-scale experiment necessary for improving current decisions?” (ii) “Howshould the experiment be designed for estimating welfare-maximizing treatment allocationrules?”. The presence of interference challenges these questions: treatment eﬀects mayspillover across individuals as a result of unobserved interactions.Network eﬀects often play a crucial role in the design of policies. However, a majorchallenge for the policymaker is the cost associated with observing and collecting networkinformation (Breza et al., 2020). In a development study, for example, collecting networkinformation in each village (Cai et al., 2015; Banerjee et al., 2013) often requires enumer-ating each individual in the population and collecting information on her friends. Thedesign of experiments for estimating welfare-maximizing treatment allocation rules underunknown interference (as opposed to estimating treatment eﬀects) has been unexplored bypast literature.This paper answers the two questions above in a setting where units are organizedinto large, ﬁnitely many independent clusters, such as cities, schools, villages, or districts.Within each cluster, interference occurs locally through an unobserved network. Re-searchers have access to an adaptive experiment over ﬁnitely many T periods, i.e., theycan sequentially assign treatments and observe outcomes over each iteration. We use onlytwo periods of experimentation and two or more clusters (i.e., pilot study) to ascertainwhether some treatment conﬁguration will be welfare improving and hence worth learn-ing by conducting the rest of the experiment. We then discuss a sequential procedure toestimate welfare-maximizing policies, which requires 2( T + 1) (ﬁnitely) many clusters.The contribution of this paper is two-fold. (a) First, it introduces, to the best of our Interference naturally occurs in several economic applications: information campaigns (Banerjee et al.,2013; Jones et al., 2017), health-programs (Kim et al., 2015), development and public policy programs(Baird et al., 2018; Muralidharan et al., 2017; Muralidharan and Niehaus, 2017), and marketing campaigns(Zubcsek and Sarvary, 2011) among others. In many economic applications, individuals are often organized into (large) independent clusters, whileinterference within each cluster being unobserved by the researcher. For example, when studying theprogram satisfaction a cash transfer program (Alatas et al., 2012), individuals are organized into villagesand connected within a village by parental or friendship ties (Alatas et al., 2016). In social marketingapplications, individuals are often organized in cities or states, and they interact within each geographicalregion (Varian, 2016). any possible alternative) under networkinterference. The test uses information on marginal eﬀects of the policy function. (b)Second, it discusses an experimental design for estimation of welfare-maximizing treatmentallocation rules (instead of average treatment eﬀects) that allows for unobserved (andpartial) interference. We introduce the ﬁrst adaptive experiment under partial interference,consisting of a matched-pair two-stage adaptive design. We now discuss our contributionin detail.There has been recent work on how to construct welfare-optimal allocation rules in theabsence of interference, i.e., where one person’s outcome is independent of other people’streatment status. In this setting, information from the conditional average treatment eﬀectcan be used to design welfare-maximizing allocations (Kitagawa and Tetenov, 2018; Atheyand Wager, 2020). Suppose we instead knew the structure of this dependence. In thatcase, we can either (a) use neighbors’ exposures observed from either a pilot study orconstructed based on a particular network model to construct the welfare (Viviano, 2019;Kitagawa and Wang, 2020), or (b) explicitly model global eﬀects of the interactions on thesystem to guide the design of the experiment, as discussed in the context of online pricingexperiments in Wager and Xu (2019). A more challenging problem is when either theglobal or local interference mechanism is unknown in the experiment . The ﬁrst challengewe address is identifying and estimating the treatment’s overall eﬀect under interferencewhen the network is unobserved. We show that if individuals are organized into groupsbetween which there are no spillovers, and interference is local within each group, we canestimate the overall marginal eﬀect of the treatment. The marginal eﬀect (ME) deﬁnesthe change in welfare from an inﬁnitesimal change in the policy function. The ME isestimated by inducing small deviations to baseline interventions within ﬁnitely many pairs of such groups, without necessitating information on within-clusters interactions. We usethe information of the marginal eﬀects of the policy to evaluate and then estimate policiessequentially.We consider policies consisting of individualized probabilistic treatment allocation rules.Individualized allocations imply that treatments are assigned independently based onindividual-speciﬁc baseline covariates. Examples of policies include sending informationto an individual (Bond et al., 2012), targeting cash-transfers (Egger et al., 2019) or sub-sidies (Dupas, 2014), with the probability of treatment diﬀering based, for instance, onthe age or education of each individual. The class of individualized assignment rules en-3ompasses homogenous assignments in two-stage randomized experiments as a special case(Baird et al., 2018) and it can be implemented (a) without requiring knowledge of thepopulation network and (b) in an online fashion.Identiﬁcation relies on decomposing the potential outcomes as the sum of a conditionalmean function and unobservable characteristics. The conditional mean function dependson the individual treatment assignment, individual baseline covariates, and the parameter β indexing the assignment mechanism (e.g., the probability of treatment for diﬀerent indi-vidual types). The dependence with the parameter β captures the average spillover eﬀectgenerated by the neighbors’ treatments, which is averaged over the distribution of treatmentassignments . Unobservables instead depend on neighbors’ assignments, and, as a result ofthe assumption that eﬀects spillover locally within the network, locally dependent. Weconstruct the marginal eﬀect as the sum of the direct eﬀect of each individual’s treatment,weighted by the marginal propensity to be treated, plus the marginal spillover eﬀect. Themarginal spillover eﬀect deﬁnes the derivative of within-cluster average potential outcomesas functions of probabilities of treatments , also averaged over neighbors’ assignments andcovariates. Diﬀerently from the literature on causal inference under local interference (e.g.,Li and Wager (2020); Leung (2020)), identiﬁcation allows for neighbors’ exposures to beunobserved to the researcher.Estimation of marginal eﬀects in the pilot study works as follows: we ﬁrst pair clustersand, for each pair, in the ﬁrst period of experimentation, we assign treatments indepen-dently based on the same target parameter across the two clusters. In the second period,we assign to each cluster locally perturbated probabilities of treatments, with perturba-tions in each pair having opposite signs. We construct direct eﬀects using the informationwithin each cluster and the marginal spillover eﬀects by comparing the average outcomeson the treated and controls between the two clusters diﬀerentiated over the two periods andappropriately reweighted by treatments’ probabilities. By taking a diﬀerence-in-diﬀerenceof the outcomes between two clusters in a pair over two periods, we allow for clusterand time-speciﬁc separable ﬁxed eﬀects. The design permits us to consistently estimatethe marginal eﬀects without necessitating inﬁnitely many clusters: it guarantees that wealways compare two clusters with opposite perturbations to the target policy, instead oftaking an average across all clusters whose concentration rate would depend on the numberof clusters. See Hudgens and Halloran (2008) for a deﬁnition of potential outcomes under partial interference. testing marginal eﬀects to motivatea sequential design represents a contribution of independent interest of this paper to theliterature on experimental design.The second question we answer is how we can estimate welfare-maximizing allocationrules under unknown interference. We discuss the design of the sequential experiment toestimate welfare-maximizing individualized treatments under unknown interference. Theexperiment consists of sequential updates of each pair’s policy based on the previous ran-domization period. Within each randomization period, we perform p iterations, with p indicating the number of parameters to be estimated, and we estimate marginal eﬀectsover one direction at a time. We construct marginal eﬀects estimators by contrasting(weighted) averages of outcomes between two clusters in a pair, diﬀerentiated by the ob-served outcome in the experiment’s ﬁrst iteration.The sequential experiment for policy-design presents one major challenge: the estimatedtreatment assignment rule over each iteration is data-dependent, and the time-dependenceof unobservables may lead to a confounded experiment. This problem is generally notincurred in adaptive experiments, where units are assumed to be drawn without replace-5ents (Kasy and Sautmann, 2019; Wager and Xu, 2019). We break dependence using anovel cross-ﬁtting algorithm (Chernozhukov et al., 2018), where, in our case, the algorithmconsists of “circular” updates of the policies using information from subsequent clusters.The circular approach’s fundamental idea is that treatments in each pair depend on theoutcomes and assignments in the next pair, in the previous period. As a result, as long asthe number of pairs of clusters exceeds the number of iterations, the experiment is neverconfounded.We use a gradient descent method for policy updates (Bottou et al., 2018). An im-portant assumption is that the local optimization procedures achieve the global optimum.This condition is satisﬁed under decreasing marginal eﬀects of the probability of treat-ments. The learning rate choice allows for strict quasi-concavity through the gradient’snorm rescaling (Hazan et al., 2015). We discuss small sample guarantees of the proposeddesign. We show that under local strong-concavity of the welfare criterion and global strictquasi-concavity, the worst-case in-sample regret across all clusters converges to zero at ratelog( T ) /T , where T denotes the number of iterations. We also show that the out-of-sample regret, i.e., the regret incurred after deploying the estimated policy on a new sample, scalesto zero at a rate 1 /T .The proposed sequential experiment substantially diﬀers from what an intuitive exten-sion of a two-stage randomized experiment for estimating welfare-optimal treatment rulesmay be: ﬁrst randomizing probabilities of treatment between clusters (Baird et al., 2018)(e.g., uniformly), randomizing treatments within each cluster independently, and ﬁnallyextrapolating the welfare function over the parameter space. We do not consider thisalternative approach for two main reasons: (i) treatments are individualized and heteroge-neously assigned, and the estimation error for learning allocation rules through grid-searchnaturally incurs a curse of dimensionality; (ii) it does not necessarily control the in-sampleregret, i.e., it requires substantial exploration to be able to extrapolate the entire responsefunction, at the expense of the welfare on in-sample participants. As opposed, our pro-cedure minimizes in-sample exploration controlling the in-sample regret, and its in andout-of-sample regret only scales quadratically with the dimension of the policy function.Our results rely on two conditions: (i) while observations may exhibit time-dependence,treatments do not carry-over in time; (ii) the ﬁrst moments for same treatment exposuresacross diﬀerent clusters averages converge to the same estimand across diﬀerent clusters, See also Garber (2019) for examples of possibility results of logarithmic regret rate under lack of (global)strong concavity.

6p to separable cluster-speciﬁc ﬁxed eﬀects. Condition (i) is often explicitly or implicitlyimposed in the study of adaptive experiments (see, e.g., Kasy and Sautmann (2019)),and (ii) representativeness of the clusters is often necessary for valid inference in two-stagerandomized experiments (Baird et al., 2018). We include extensions that relax (i) to limitedcarry-over and extensions that allow for non-separable time and cluster-speciﬁc ﬁxed eﬀects((ii)), under lack of spillovers on the treated units (but not the control units) in Section 5.We conclude our discussion with a calibrated experiment. Data from Cai et al. (2015)show that (i) marginal eﬀects of the treatment exhibit decreasing marginal returns inapplications, and (ii) the method presents substantial advantages relative to existing ex-perimental designs.The rest of the paper is organized as follows. We discuss the set-up and the deﬁnition ofwelfare in Section 2. We discuss hypothesis testing in Section 3. The adaptive experimentfor policy-design is introduced in Section 4. Section 5 presents extensions in the presenceof dynamic eﬀects and non. Section 6 collects the numerical experiments and Section 7concludes.

This paper relates to three main strands of literature: (i) experimental design; (ii) causalinference under network interference; (iii) empirical welfare maximization and statisticaltreatment choice. We review the main references in the following lines.In the context of experimental design under network interference, common designsinclude clustered experiments (Eckles et al., 2017; Taylor and Eckles, 2018; Ugander et al.,2013) and saturation design experiments (Baird et al., 2018; Basse and Feller, 2018; Pouget-Abadie, 2018). However, our analysis focuses on detecting welfare-maximizing policiesinstead of inference on treatment and spillover eﬀects diﬀerently from those designs. Thediﬀerent target estimand motivates the sequential procedure of our experiment. Recentliterature discusses alternative design mechanisms for inference on treatment eﬀects only,often assuming knowledge of the underlying network structure. Examples include Basseand Airoldi (2018b), which only allows for dependence but not interference, Jagadeesanet al. (2020) who discuss the design of experiments for estimating direct treatment eﬀectsonly in the presence of observed networks, Breza et al. (2020) which discuss inference ontreatment eﬀects with aggregated relational data, and Viviano (2020) who discusses thedesign of two-wave experiments under an observed network, focusing on variance reduction7f treatment eﬀect estimators. Additional references include Basse and Airoldi (2018a) thatdiscuss limitations of design-based causal inference under interference, Kang and Imbens(2016), which discuss encouragement designs instead in the presence of interference. Noneof the above references neither address the problem of policy-design nor discuss inferenceon welfare-maximizing policies.Local experimentation for experimental design relates to Wager and Xu (2019) whodiscuss local experimentation in the diﬀerent context of estimation of prices in a singletwo-sided market with asymptotically independent agents, through randomization of pricesto individuals. However, as noted by the authors, the assumptions imposed in the abovereference do not allow for unknown interference. These diﬀerences motivate our identiﬁca-tion strategy and algorithmic procedures, which exploits two-level local randomization atthe cluster and individual level instead of individual-based randomization, as well as ourproposed non-parametric estimator of marginal eﬀects based on the clustering.Our paper also relates more broadly to the literature on adaptive experimentationthrough ﬁrst-order approximation methods (Bubeck et al., 2017; Flaxman et al., 2004;Kleinberg, 2005), and experimental design with strategic agents recently discussed in Munro(2020). However, these references do not allow for network interference. They focus onindividual-level randomization procedure, as opposed to the cluster-based and individual-based sequential procedure proposed in the current paper. Under unknown interference,we show that it is necessary for consistent estimation of marginal eﬀects with ﬁnitely manyclusters to deterministically assign treatments based on small deviation of the policies be-tween pairs of clusters. The two-stage matched pair cluster design represents a furtherdiﬀerence from both designs based on individual randomizations and from saturation ex-periments where probabilities of treatments are randomized between clusters.Additional references include bandit algorithms, Thompson sampling (Cesa-Bianchiand Lugosi, 2006; Bubeck et al., 2012; Russo et al., 2017), and the recent econometricliterature on adaptive and two-stage experiments (Kasy and Sautmann, 2019; Bai, 2019;Tabord-Meehan, 2018) which, however, does not allow for network interference.We build a connection to the literature on inference under interference. Most of theliterature often assume an observed network structure (Aronow et al., 2017; Manski, 2013;Leung, 2020; Ogburn et al., 2017; Li and Wager, 2020; Goldsmith-Pinkham and Imbens,2013; Athey et al., 2018; Choi, 2017; Forastiere et al., 2020), diﬀerently from the currentpaper. References which discuss inference under partial interference include Hudgens andHalloran (2008), Vazquez-Bare (2017) among others. Unlike the current paper, the above8eferences focus on inference on treatment eﬀects instead of inference on welfare-maximizingpolicies. Finally, S¨avje et al. (2020) discuss conditions for valid inference of the direct eﬀectof treatment only, under unknown interference. In contrast, estimating optimal policiesrequires estimating the marginal eﬀects of the treatments.In the context of policy-design, Viviano (2019) discusses instead targeting on networksin an oﬀ-line scenario, where data are observed from an existing experiment or quasi-experiment, without therefore discussing the problem of experimental design. Kitagawaand Wang (2020) discusses allocation rules on a SIR network, in the absence of an experi-ment, assuming a fully observable network structure and using a model-based method. Liet al. (2019), Graham et al. (2010), Bhattacharya (2009) consider the problem of optimal allocation of individuals across small groups such as room’s dormitories, using data froma single wave experiment. However, the above procedures neither allow for the design ofindividualized treatment allocation rules nor sequential experimentation.This paper also contributes to the growing literature on statistical treatment rulesby proposing a design mechanism to test and estimate treatment allocation rules. Ref-erences on policy estimation include Manski (2004), Athey and Wager (2020) Kitagawaand Tetenov (2018) Kitagawa and Tetenov (2019), Elliott and Lieli (2013), Mbakop andTabord-Meehan (2016), Bhattacharya and Dupas (2012), Dehejia (2005), Stoye (2009),Stoye (2012),Tetenov (2012), Murphy (2003), Nie et al. (2020), Kallus (2017), Lu et al.(2018), Sasaki and Ura (2020) among others. However, none of the above references neitherdiscuss testing for policy optimality, nor the problem for experimental design, nor discussesthe case of interference. Finally, the literature on inference on welfare-maximizing decisions has mostly focusedon constructing conﬁdence intervals around welfare estimators, which, however, do notpermit to compare a target policy against any possible alternative (Kato and Kaneko,2020; Zhang et al., 2020; Hadad et al., 2019; Andrews et al., 2019; Imai and Li, 2019;Bhattacharya et al., 2013; Luedtke and Van Der Laan, 2016). In the context of independentobservations, exceptions are Armstrong and Shen (2015); Rai (2018); Kasy (2016), whichpropose procedures for constructing sets of welfare-maximizing policies (or rank of policies),whose validity, however, does not allow for dependence and interference, and which often Our test is based on the marginal eﬀects of the policy. Observe that here we deﬁne the ME as thederivative of welfare with respect to the parameters of the policy function which should not be confusedwith the deﬁnition of MTE commonly adopted in the causal inference literature denoting the derivativerelative to the (endogenous) selection mechanism (e.g., see recent work of Sasaki and Ura (2020) of oﬀ-lineempirical welfare maximization using the MTE).

This section discusses the model, the deﬁnition of welfare, and the estimand of interest.

Preliminaries and notation

We start by introducing necessary notation. We deﬁne Y i,t ∈ Y the outcome of interest of unit i at time t , D i,t ∈ { , } the treatment assignmentof unit i at time t . We denote X i ∈ X individual speciﬁc base-line covariates. We let X i ∼ F X i , with f X i denoting the Radon-Nykodim derivative of F X i . Units are assumed tobe organized into K independent large clusters, and observed over T periods. We denote k ( i ) ∈ { , · · · , K } the cluster of unit i , N k the number of units in cluster k , N = (cid:80) Kk =1 N k .For notational convenience only, we assume equally sized clusters with N k = N/K = ˇ N .From each cluster we sample at random each period covariates and outcomes of n < ˇ N individuals. We denote S k,t the set of indexes of units sampled from cluster k at time t (these may or may not be the same indexes every period). Motivated by developmentstudies (Cai et al., 2015; Banerjee et al., 2013), we assume that units are connected withineach cluster k according to a ﬁxed adjacency matrix A k ∈ R ˇ N × ˇ N , unobserved to theresearcher. All our conditions must be interpreted conditional on the adjacency matrices( A , · · · , A K ). Interference within each cluster occurs in unknown dimensions. However, nointerference between clusters is allowed. Therefore, throughout the rest of our discussion,we will implicitly assume that SUTVA (Rubin, 1990) holds at the cluster level only. Assignment mechanism

Let e ( · ; β ) : X (cid:55)→ E ⊂ (0 , , β ∈ B , (1)denote a class of individual treatment assignments, where β denotes a vector of parameters,and e ( x ; β ) is a twice continuously diﬀerentiable function. We denote dim( β ) = p . Wedeﬁne a (conditional) Bernoulli allocation rule as follows.10 eﬁnition 2.1 (Conditional Bernoulli Allocation Rule (CBAR)) . A Bernoulli allocationrule with parameters β t = { ( β k, · · · , β k,t ) } k ∈{ , ··· ,K } , assigns treatments to all units i ∈{ , · · · , N } as follows D i,t | X i = x, β t ∼ Bern (cid:16) e ( x ; β k ( i ) ,t ) (cid:17) , independently across units and time.Deﬁnition 2.1 deﬁnes an allocation where treatments are assigned independently ineach cluster, with cluster-speciﬁc and time speciﬁc conditional assignments e ( X i ; β k ( i ) ,t ),parametrized by the vector of parameters β t . Importantly, the above deﬁnition assumesthat treatment assignments in cluster k are conditional independent on β k (cid:48) (cid:54) = k, T given β k, T . In addition, treatment assignments are drawn for all units in a cluster (regardlessof whether their post-treatment outcome is observed or not). Remark 1 (Why a CBAR?) . The cluster-speciﬁc Bernoulli allocation is commonly usedin two-stage randomized network experiments for inference on treatment eﬀects in thepresence of a single experimentation period ( t = 1), and homogenous treatment assignments(i.e., e ( x ; β ) = β ) (Baird et al., 2018). This paper considers heterogenous assignments andmultiple experimentation periods, and the diﬀerent goal of welfare-maximization guides thechoice of the parameter β . We consider a CBAR since it is simple and easy to implementin practice, and it can be implemented in an on-line fashion. A CBAR induces a local-dependence structure which, we show, permits estimation of welfare-maximizing policiesand asymptotic inference on the optimality of base-line interventions. Example 2.1 (Targeting information) . Consider the problem of targeting information toindividuals (Cai et al., 2015). Here, D i,t denotes whether information is sent to individual i at time t , while Y i,t equals the outcome of interest of unit i at time t (e.g., insuranceadoption at period t ). Units are organized in villages k ∈ { , · · · , K } . Suppose thatinsurance adoption of individual i at time t depends on individual’s i treatment assignment D i,t , and individual i (cid:48) s friends’ and friends’ of friends treatment assignments in village k ( i ).We say that two individuals are connected either because they are direct friends (and sothe assignment of i directly impacts the decision of j ) or because they share a commonfriend. A simple deﬁnition of the adjacency matrix takes the following form: A ki,j = 1 (cid:110) i is friend of j (cid:111) + α × (cid:110) ( i, j ) have a common friend (cid:111) , α ∈ [0 , . See Figure 1 for an illustrative example. The matrix of friendships, the parameter α , and11 → Figure 1: Example of network interactions. The ﬁgure on the left draw individuals con-nected to friends. Figure on the right draws an adjacency matrix obtained after connectingindividuals sharing a common friend (colored in green).so also the matrix A k are unobserved to the researcher and policy-maker. Researchers aimto study how many individuals should be treated to maximize the welfare generated byinsurance adoption, net of costs of treatments. Namely, they consider an allocation rule ofthe form e ( x ; β ) = β, β ∈ ( δ, − δ ) ⊂ (0 , , where β denotes the probability of treatment, and X i = 1. Example 2.2 (Cash transfer program) . Consider the problem of targeting cash-transfersto individuals, to maximize satisfaction with the program (Alatas et al., 2012). Unitsare organized in independent villages and connected within each village based on parentalties and friendships. Targeting cash-transfers generates spillovers along these dimensions.The policy-maker observes for each individual the quality of the roof (binary), of the ﬂoor(binary), and whether the individual attended secondary school (binary). The policy-makerconstructs a linear probability decision rule with cutoﬀs at ( δ, − δ ) ⊂ (0 , e (cid:16) ﬂoor , roof , educ; β (cid:17) = β + β ﬂoor + β roof + β educ , β ∈ B , δ ≤ (cid:88) j =0 β j ≤ − δ. (2)The set B encodes capacity constraints , as well as ethical and legal constraints on theparameter space. See Figure 2 for a graphical illustration with β = 0. Program eﬀectiveness can be measured using measures of program satisfaction. Satisfaction with theprogram is shown to increase compliance of villages with the program and relates to unobserved measuresof poverty (Alatas et al., 2012). Capacity constraints can be imposed whenever the distribution of X i is known to the policy maker,and these can be directly incorporated on the conditions on the parameter space. β + β β β + β + β β + β Figure 2: Example of probabilistic treatment assignment rule for a cash transfer program.Individuals are assigned diﬀerent probabilities based on the quality of their ﬂoor and roof.The ﬁnal goal is to optimal estimate those probabilities.

Throughout our discussion we assume partial interference: potential outcomes are inde-pendent between clusters (Baird et al., 2018). We formalize the between-cluster SUTVAby deﬁning Y i,t (cid:16) d k ( i )1 , · · · , d k ( i ) t (cid:17) , d ks ∈ { , } ˇ N , s ∈ { , · · · , t } , the potential outcome of unit i at time t is a function of the treatment assigned to units inthe same cluster only over each period s ≤ t . We implicitly assume that potential outcomesare consistent (Imbens and Rubin, 2015) and potential outcomes and base-line covariatesare jointly mutually independent between clusters.Three restrictions on within cluster dependences are imposed. Assumption 1 (No carry-over and local interference) . Assume that for any d , · · · , d t , t ≥

1, the following conditions hold.(A) Y i,t (cid:16) d k ( i )1 , · · · , d k ( i ) t (cid:17) is a constant function in ( d k ( i )1 , · · · , d k ( i ) t − );(B) Y i,t (cid:16) · · · , d k ( i ) t (cid:17) is constant in each entry d k ( i ) t,j , with j : A k ( i ) i,j = 0. In addition, (cid:80) j { A k ( i ) i,j > } ≤ √ γ n , with γ n ≤ n / ;(C) (cid:110) X i , Y i,s ≤ T ( · · · , d ) , , d ∈ { , } ˇ N (cid:111) ⊥ (cid:110) X j , Y j,t ( · · · , d (cid:48) ) , d (cid:48) ∈ { , } ˇ N (cid:111) j (cid:54)∈{ v : A k ( i ) i,v > } ,t ≤ T ,denoting the set of units not being connected to individual i .13ondition (A) assumes that eﬀects do not propagate in time. This condition is knownas no-carry over and often implicitly imposed in studies on experimental design (Kasy andSautmann, 2019). For a discussion on the no-carry-over assumption, the reader may referto Athey and Imbens (2018). However, potential outcomes may exhibit time dependence(e.g., due to unobserved time-varying factors). In Section 5.3 we extend our framework tolimited carry-over eﬀects. Condition (B) imposes local interference: spillovers propagatewithin (unknown) neighborhoods. The size of a neighborhood is assumed to grow at aslower rate than the sample size. The assumption of local interference is often imposed forvalid causal inference in the presence of observed network structures, see, e.g., Leung (2020),Jagadeesan et al. (2020). Condition (C) instead imposes local dependence among outcomesand covariates. Similarly to (B), researchers do not know the dependence structure withineach cluster. For simplicity, throughout our discussion, we refer to potential outcomes onlyas functions of all other units’ current treatment status in the same cluster.Under Assumption 1, we can show that outcomes are only locally dependent, withtheir expectation depending on the parameter β indexing the assignment mechanism. Thisdecomposition permits to (i) identify the welfare function and (ii) derive asymptotic resultsby exploiting the local dependence structure. We formalize this idea in the following lemma. Lemma 2.1.

Let Assumption 1 hold. Then for a CBAR with β k, T ⊥ { X i , Y i, t ( d ) , d ∈{ , } ˇ N } i : k ( i )= k , for all iY i,t = m i,t ( D i,t , X i , β k ( i ) ,t ) + ε i,t , E (cid:104) ε i,t | D i,t , X i , β k ( i ) ,t (cid:105) = 0 , for some individual speciﬁc function m i,t ( · ) and unobservables ε i,t . In addition, ( ε i,t , X i ) ⊥ { ( ε j,t ≤ T , X j ) } j ∈J ( i ) | β k ( i ) ,t , where J ( i ) ⊂ { v : k ( v ) = k ( i ) } , and |J ( i ) | ≥ ˇ N − γ n . Lemma 2.1 states that observed outcomes under a Bernoulli assignments are the sumof two components: a conditional expectation function m i,t ( · ), which depends on the indi-vidual assignment, base-line covariates and parameter β k ( i ) ,t , and of unobservables ε i,t thatalso depend on neighbors’ covariates and treatment assignments. Unobservables dependon at most 2 γ n many other unobservables in the same cluster. Observe that ε i,t are not identically distributed since they also depend on treatment assignments of the neighborsof individual i . We illustrate Lemma 2.1 in the following examples.14 xample 2.1 Cont’d Assume that each individual has at least one connection. Let Y i,t = α t + D i,t φ + (cid:80) j (cid:54) = i A k ( i ) i,j D j,t φ (cid:80) j (cid:54) = i A k ( i ) i,j − (cid:16) (cid:80) j (cid:54) = i A k ( i ) i,j D j,t (cid:80) j (cid:54) = i A k ( i ) i,j (cid:17) φ + η i,t , (3)with η i,t being cross-sectionally independent unobservables. Namely, outcomes depend ontheir own treatment and the percentage of treated units connected to i . Equation (3) alsostates that spillovers have decreasing marginal eﬀects. Taking expectations over neighbors’assignments, under a CBAR, we can write Y i = α t + D i,t φ + β k ( i ) ,t φ − β k ( i ) ,t φ × (cid:80) j (cid:54) = i A k ( i ) , i,j ( (cid:80) j (cid:54) = i A k ( i ) i,j ) − β k ( i ) ,t φ × (cid:80) j (cid:54) = i,j (cid:48) (cid:54) = i,j (cid:48) (cid:54) = j A k ( i ) i,j A k ( i ) i,j (cid:48) ( (cid:80) j (cid:54) = i A k ( i ) i,j ) (cid:124) (cid:123)(cid:122) (cid:125) = m i,t ( D i,t , ,β k ( i ) ,t ) + ε i,t , where ε i,t is a function of (cid:80) j (cid:54) = i A k ( i ) i,j D j,t . The case with covariates follows similarly, wherethe expectation is also taken with respect to X j (cid:54) = i (see the following example). Example 2.2 Cont’d

Assume that each individual has at least one connection. Let Y i,t = α k ( i ) + D i,t φ + (1 − D i,t ) (cid:80) j (cid:54) = i A k ( i ) i,j D j,t φ (cid:80) j (cid:54) = i A k ( i ) i,j + η i,t , (4)with η i,t being cross-sectionally independent unobservables, with A i,j = A j,i ∈ { , } . Thatis, spillovers only occur on those individuals that are not treated. The model is equivalentto Y i,t = α k ( i ) + D i,t φ + β φ + (1 − D i,t ) (cid:80) j (cid:54) = i A k ( i ) i,j (cid:104) E [ﬂoor j ] β + E [roof j ] β + E [educ j ] β (cid:105) φ (cid:80) j (cid:54) = i A k ( i ) i,j + ε i,t , where ε i,t = (1 − D i,t ) (cid:104) (cid:80) j (cid:54) = i A k ( i ) i,j D j,t φ (cid:80) j (cid:54) = i A k ( i ) i,j − (cid:80) j (cid:54) = i A k ( i ) i,j (cid:104) E [ﬂoor j ] β + E [roof j ] β + E [educ j ] β (cid:105) φ (cid:80) j (cid:54) = i A k ( i ) i,j (cid:105) + η i,t . Example 2.3 (Educational program) . Consider the problem of design educational pro-grams in schools (Opper, 2016) for test-scores improvements. Students are clustered in The expression below uses independence of the treatment assignments under a CBAR. ∈ { , · · · , K } schools. Each student is assigned to equally-sized classes c ( i ) of ﬁxed size C , unobserved to the researcher. Test-scores depend on assignments as follows Y i,t = P (cid:16) D i,t , (cid:88) j (cid:54) = i : c ( j )= c ( i ) D j,t , X i , η i,t (cid:17) , ( η i,t , X i ) ∼ i.i.d. F η,X for some arbitrary polynomial function P ( · ), and independent stationary unobservables η i,t . Then under a CBAR m i,t ( d, x, β ) = E β (cid:104) P (cid:16) d, (cid:88) j (cid:54) = i : c ( j )= c ( i ) D j,t , X i , η i,t (cid:17)(cid:12)(cid:12)(cid:12) X i = x (cid:105) , (5)where E β denotes the expectation over neighbors’ assignment under a CBAR with clusterlevel parameter β . For each individual ε i,t depends on at most observables and unobserv-ables of C other units.The second condition that we impose is on the clusters being representative of theunderlying population. Assumption 2 (Representative clusters and ﬁxed eﬀects) . For any d ∈ { , } , β ∈ B , x ∈X , any random sample S k,t , of size n from cluster k is such that1 n (cid:88) i ∈S k,t m i,t (cid:16) d, x, β (cid:17) f X i ( x ) = α t ( x )+ τ k ( x )+ m ( d, x, β ) f X ( x )+ J n , n (cid:88) i ∈S k,t f X i ( x ) = ˇ f X ( x )+ J n for some possibly unknown and uniformly bounded functions α t ( · ) , τ k ( x ) , m ( · ) , f X ( · ) , ˇ f X ( · )and J n ∈ [ − b n , b n ], for some positive b n → n → ∞ .The functions α t ( x ) , τ k ( x ) capture the time-speciﬁc and cluster-speciﬁc ﬁxed eﬀects forthe sub-population in with covariate { X i = x } at time t , multiplied by the average densitywithin each cluster. In Example 2.1 α t ( x ) = α t , while in Example 2.3 α t ( x ) = 0 due to thestationarity assumption. In the presence of identically distributed covariates, the function m ( · ) deﬁnes the within-cluster expectation, conditional on X i = x , net of ﬁxed eﬀects.Whenever covariates are not identically distributed, m ( d, x, β ) f X ( x ) deﬁnes the limitingaverage of the product between the conditional mean function and the individual-speciﬁcdensity f X i , evaluated at x . The function ˇ f X ( x ) denotes the within cluster average densityfunction of the covariates. The component J n captures imbalance across clusters. InExample 2.1, Assumption 2 holds if the average inverse degree is asymptotically the same16cross diﬀerent clusters, while it fails otherwise. In Example 2.3 instead the assumptionalways holds with J n = 0.Assuming the representativeness of clusters is a common assumption for causal infer-ence. For instance, Baird et al. (2018) assumes that cluster-level expectations are notcluster-speciﬁc, and Vazquez-Bare (2017) assumes that the joint distribution of outcomesfrom each cluster is the same across diﬀerent clusters. Here we allow for separable cluster-speciﬁc ﬁxed eﬀects. The assumption implicitly imposes that ﬁxed eﬀects are additive andseparable, and expectations within each cluster concentrate around the same target esti-mand (after subtracting the ﬁxed eﬀects). In the following remark, we discuss a relaxationof Assumption 2 to allow for non-separable time and cluster-speciﬁc ﬁxed eﬀects. Remark 2 (Non-separable time/cluster ﬁxed eﬀects) . In Section 5 we discuss the diﬀerentscenario where 1 n (cid:88) i ∈S k,t m i,t (cid:16) d, x, β (cid:17) f X i ( x ) = α t,k ( x ) + m ( d, x, β ) f X ( x ) + J n , (6)i.e., the ﬁxed-eﬀects are not separable in time and cluster identity. We discuss this scenarioand provide a set of results under the alternative condition that ∂m (1 ,x,β ) ∂β = 0, i.e., thatspillovers do not occur on treated units, but only occur on controls. For example, in thepresence of an information campaign, we may assume that spillovers do not occur on thoseindividuals who had already received information but only on those who were not exposedto information.A further relaxation of Assumption 2 may consist of also indexing the function m ( · )with the cluster-type, as in Park and Kang (2020), and conducting separate analysis withindiﬀerent cluster. This is omitted for the sake of brevity. The scope of this paper is to estimate the conditional Bernoulli assignment that maximizessocial welfare. We introduce the notion of (utilitarian) welfare (Manski, 2004).

Deﬁnition 2.2 (Welfare) . For a given conditional Bernoulli assignment with parameters17 k,t = β , deﬁne the (utilitarian) welfare as follows: W ( β ) = (cid:90) (cid:104) e ( x ; β ) (cid:16) m (1 , x, β ) − m (0 , x, β ) (cid:17) + m (0 , x, β ) (cid:105) f X ( x ) dx − (cid:90) c ( x ) e ( x ; β ) ˇ f X ( x ) dx, (7)where c ( x ) < ∞ denotes the cost of treatment for units with X i = x .Welfare is deﬁned as the average eﬀect under the treatment assignment e ( · ; β ), net ofits implementation cost c ( x ), assumed to be known to the policy-maker. Observe thatwelfare does not depend on ﬁxed eﬀects since those do not depend on the policy β .We can now introduce our main estimand. Deﬁnition 2.3 (Estimand) . Deﬁne the welfare-maximizing policy as β ∗ ∈ arg sup β ∈B W ( β ) , (8)where B = [ B , B ] p denotes a pre-speciﬁed compact set.Equation (8) deﬁnes the vector of parameters that maximizes social welfare. In oursetting, policy-makers choose β ∗ based on an experiment conducted over a pre-speciﬁedtime-window. Once the experiment is terminated, the policy cannot be updated, and noadditional information is collected. Remark 3 (Carry-over eﬀects) . In Section 5.3, we consider the extension where Y i,t ( · · · , d t − , d t )also depends on the past treatment assignments d t − , allowing for carry-over eﬀects in time.We consider both stationary decisions and time-variant decisions, and discuss estimationin these scenarios, at the expense of a more data-intense experimentation for detectingwelfare-maximizing policies. Estimation and inference on welfare-maximizing decisions rely on identifying and esti-mating the marginal eﬀects of the treatment. For expositional convenience, we implicitlyassume diﬀerentiability and defer to Section 3 formal assumptions. We discuss deﬁnitionsof marginal eﬀects in the following lines.

Deﬁnition 2.4 (Marginal eﬀects) . The marginal eﬀect of treatment is deﬁned as follows: V ( β ) = ∂W ( β ) ∂β . β .Under the above regularity condition, the marginal eﬀect takes an intuitive form. Deﬁne∆( x, β ) = m (1 , x, β ) − m (0 , x, β )the average direct eﬀect, averaged over the spillovers, for a given level of covariate x . Thenmarginal eﬀects are deﬁned as (cid:90) (cid:104) e ( x ; β ) ∂m (1 , x, β ) ∂β + (1 − e ( x ; β )) ∂m (0 , x, β ) ∂β (cid:124) (cid:123)(cid:122) (cid:125) ( S ) + ∂e ( x ; β ) ∂β ∆( x, β ) (cid:124) (cid:123)(cid:122) (cid:125) ( D ) (cid:105) f X ( x ) dx − c ( x ) ∂e ( x, β ) ∂β ˇ f X ( x ) dx. (9) The above expression shows that the eﬀect depends on (a) the direct eﬀect of changing β , captured by the component (D); (b) the indirect eﬀect of changing β due to marginalspillover eﬀects, captured by the component (S). Example 2.1 Cont’d

Consider the model in Equation (3), and an adjacency matrixwith α = 0 for simplicity (i.e., spillovers only occur within ﬁrst degree neighbors). Thedirect eﬀect of the treatment denotes the eﬀect of informing individual i on her insurancetake-up. The eﬀect equals φ . The marginal spillover eﬀect denotes the eﬀect of making asmall change on the probability that other individuals (so included i ’s friends) are invitedto the information session. In our example, the marginal spillover eﬀect equals ∂m ( d, , β ) ∂β = φ − φ κ − βφ (1 − κ ) , where κ = lim ˇ N →∞ N (cid:80) ˇ Ni =1 1 | N i | , denotes the asymptotic limit of the average inverse degree.The optimal policy sets the marginal eﬀect equal to zero. As a result, we obtain ∂W ( β ) ∂β = φ + φ − φ κ − βφ (1 − κ ) ⇒ β ∗ = φ + φ − φ κφ (1 − κ ) . Intuitively, more individuals should be treated if either (i) the direct eﬀect is larger ( φ ↑ ),or (ii) the spillover eﬀect is larger ( φ ↑ ). Observe that the marginal eﬀect can be usedfor (a) testing whether baseline interventions are optimal, testing whether marginal eﬀects The identity below follows from the dominated convergence theorem under Assumption 3. See Section3 for details.

Example 2.2 Cont’d

Consider the model in Equation (4). Then the objective functionreads as follows W ( β ) = ˇ κ (cid:62) β − φ β (cid:62) ˇ M β for a vector ˇ κ and a matrix ˇ M which depends on the asymptotic limit of the (weighted)within clusters expectations, assumed to converge to the same limits across diﬀerent clus-ters. The function has decreasing marginal eﬀects whenever spillovers have positive eﬀect( φ > Before discussing the sequential experiment for estimating β ∗ , we ask whether the base-linepolicy is welfare-maximizing. Namely, this section answers to the following question:“given a base-line policy e ( · ; ι ) , ι ∈ B , is ι = β ∗ , i.e., does it maximize welfare?” . (11)The question is equivalent to test the hypothesis W ( ι ) ≥ W ( β ) , for all β ∈ B . (12) To see why the claim holds observe that the objective function reads as follows W ( β ) = β φ + lim ˇ N →∞ N ˇ N (cid:88) i =1 (cid:110)(cid:104) E [ﬂoor i ] β + E [roof i ] β + E [educ i ] β (cid:105) φ + (cid:80) j (cid:54) = i A k ( i ) i,j (cid:104) E [ﬂoor j ] β + E [roof j ] β + E [educ j ] β (cid:105) φ (cid:80) j (cid:54) = i A k ( i ) i,j (cid:111) − lim ˇ N →∞ N ˇ N (cid:88) i =1 (cid:110) (cid:80) j (cid:54) = i A k ( i ) i,j (cid:104) E [ﬂoor j ] β + E [roof j ] β + E [educ j ] β (cid:105) φ (cid:80) j (cid:54) = i A k ( i ) i,j × (cid:104) E [ﬂoor i ] β + E [roof i ] β + E [educ i ] β (cid:105)(cid:111) . (10) Assuming that the weighted within cluster expectation converge to the same limit across diﬀerent clustersleads to the above expression for W ( β ). Since A i,j ∈ { , } , and the covariates are either zero or one, themarginal eﬀect have negative derivative whenever φ > ι to a speciﬁc alternative, but instead, weask whether ι outperforms all other policies. The above equation represents a naturalnull hypothesis whenever its rejection motivates possibly expensive (because of either itsaccounting or opportunity cost) larger-scale experimentation. The following testable im-plication is considered. Testable implication

Let ι be an interior point of B , and W ( β ) be continuously diﬀer-entiable. Then V ( j ) ( ι ) = 0 ∀ j ∈ { , · · · , p } if W ( ι ) ≥ W ( β ) , for all β ∈ B . The above implication follows by standard properties of continuously diﬀerentiable func-tions, and it allows us to perform the test without comparing ι to any possible alternatives.Instead, we can test the following hypothesis H : V ( ι ) = 0 , j ∈ { , · · · , ˜ p } (13)where we test 1 ≤ ˜ p ≤ p arbitrary many coordinates of the vector V ( β ). Observe thatthe implication does not require concavity, and it solely relies on diﬀerentiability of theobjective function and on ι being an interior point.We formalize our intuition in the following lines, where we discuss estimation andinference on marginal eﬀects. Testing marginal eﬀects in the context of experimentaldesign has not been discussed in previous literature. We assume possibly ﬁnitely manyclusters K ≥ p , and two experimentation periods only. Organization

We organize this section as follows: we start by introducing local two-stageexperimentation; we then introduce the estimators constructed based on the randomizationprocedure; we discuss the full algorithm for the design of the pilot study; ﬁnally, we discussinference on marginal eﬀects using the observations from the pilot study.

In this section, we discuss the intuition and motivation behind our procedure for testing for policy optimality at the parameter value ι . We start from some preliminary notation.21 reliminaries Consider two clusters, indexed by { k, k + 1 } , and two periods { , t } , with k being an odd number (e.g., k = 1). The key idea for non-parametrically estimatingmarginal eﬀects consists of inducing local deviations in the parameter at the cluster level and alternating deviations over pairs of clusters observed over two consecutive periods. Forexpositional convenience, here we discuss the problem of estimating one single entry V ( j ) ( ι ) , for a given parameter ι . In this section, the parameter ι is assumed to be exogenous. Wedeﬁne the vector e j = (cid:104) , · · · , , , , · · · , (cid:105) , where e j ∈ { , } p , and e ( j ) j = 1 , with e j = 0 for all entries except entry j . Deﬁne ( − j ) all the indexes of a vector exceptindex ( j ). Local experimentation

For a given set of parameters β t , the key idea for estimatingmarginal eﬀects consists of assigning treatments independently across units as follows: D i, | X i = x, β k ( i ) , ∼ Bern (cid:16) e ( x ; β k ( i ) , ) (cid:17) ,D i,t | X i = x, β k ( i ) ,t ∼  Bern (cid:16) e ( x ; β k ( i ) ,t + η n e j ) (cid:17) if k ( i ) = k Bern (cid:16) e ( x ; β k ( i ) ,t − η n e j ) (cid:17) if k ( i ) = k + 1 , n − / < η n < n − / , t > . (14)The parameter η n captures small deviations from the target parameter. Intuitively, in theﬁrst period, each cluster’s treatment assignment depends on the parameter β k ( i ) , . In thesecond period, instead, we induce a small deviation over the parameter β k ( i ) ,t with oppositesign within the pair. The two periods randomization aims to control for cluster-speciﬁc ﬁxed eﬀects. The between-cluster randomization instead aims to control for time-speciﬁc ﬁxed eﬀects. A crucial aspect for identiﬁcation is that deviations η n are deterministicallyassigned with the opposite sign within each pair. Finally, recall that treatments are assignedto all individuals in the population. Example 2.1 Cont’d

Let e ( x ; ι ) = ι ∈ (0 , , with ι = 40%. At time t = 0, researchersinvite to information sessions each individual in village k = 1 and village k = 2 with equalprobability 40%. At time t = 1, researchers treat with lower probability individuals invillage k = 1 with a probability equal 40% − η n and with higher probability individuals in22 = 0 t = 1 k = 1 k = 2 −→ Figure 3: Example of two-stage local randomization with time and cluster-speciﬁc ﬁxedeﬀects. Each node is a diﬀerent individual, and squares denote clusters. Units are connectedwithin each cluster. Pink nodes denote control units, and gray nodes denote treated units.In the ﬁrst period, units are assigned to treatment, with the same probability in clusters k ∈ { , } . In the second period, the probability of being treated is slightly larger in cluster k = 1 and smaller in cluster k = 2.village k = 2 with a probability 40% + η n . This randomization is illustrated in Figure 3,and Figure 4. Next, we discuss the estimators of interest. We estimate separately the direct eﬀect ofthe treatment and the marginal spillover eﬀect of the treatment. Separate estimation ofthese two eﬀects has two motivations: (i) it exploits knowledge of the propensity scorefunction e ( x ; β ); (ii) it permits identiﬁcation of marginal eﬀects also when ﬁxed eﬀectsare not separable in time and cluster identity, but the spillover eﬀects on the treated arezero as discussed in Section 5. The proposed estimator can be interpreted as a diﬀerence-in-diﬀerence estimator, where we take diﬀerences between outcomes once reweighted bythe marginal probability of treatments and the inverse probability of treatment. Figure5 provides the basic intuition behind the proposed estimator in the presence of time andcluster-speciﬁc ﬁxed eﬀects. 23igure 4: Example of two-stage local randomization with time and cluster-speciﬁc ﬁxedeﬀects. In the ﬁrst period, a draw from the blue dot for each cluster is performed. Inthe second period in the ﬁrst cluster, we assign the policy colored in green and the secondcluster the one colored in brown. We test for policy optimality by testing whether theestimated derivative equals zero. The sequential randomization procedure repeats the pro-cess sequentially, using a circular estimation procedure for the marginal eﬀect to guaranteeunconfounded experimentation (see Section 4.1).It will be convenient to deﬁne v h =  − h is odd;1 otherwise , e i,j,t ( β ) = e (cid:16) X i , β + η n × v k ( i ) e j { t > } (cid:17) , (15)respectively the indicator corresponding to the cluster identity v h and the assigned propen-sity score to individual i at time t for a given target parameter β .We can now discuss estimation of the marginal eﬀects. Estimation of direct eﬀects

We estimate the direct eﬀects using an Horowitz-Thompsonestimator (Horvitz and Thompson, 1952), reweighted by the marginal eﬀect on the propen-sity score. Namely, we deﬁne (cid:98) ∆ ( j ) k,t ( β ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t ∂e ( X i ; β ) ∂β ( j ) (cid:104) Y i,t D i,t e i,j,t ( β ) − Y i,t (1 − D i,t )1 − e i,j,t ( β ) (cid:105) . (16)24 E [ Y ] t = 0 t = 1 τ τ ×× W ( β ) + α + τ τ − τ W ( β ) + α + τ × W ( β + η n ) + α + τ W ( β − η n ) + α + τ × ∂W ( β ) ∂β (cid:12)(cid:12)(cid:12) β = β η n + τ − τ Figure 5: The intuition behind the estimator with additive and separable time and cluster-speciﬁc ﬁxed eﬀects. We consider two clusters k ∈ { , } , with ( τ , τ ) denoting respectivelythe cluster-speciﬁc ﬁxed eﬀect for the ﬁrst and second cluster. The brown cross correspondsto the second cluster’s welfare value and the green cross to the one in the ﬁrst cluster. Inthe ﬁrst period, a one-period experiment in each cluster is performed. The correspondingwelfare depends on the cluster-speciﬁc ﬁxed eﬀect ( τ k ) and the time-speciﬁc ﬁxed eﬀect( α t ). In the second period, a positive (negative) small deviation is applied to the policy β in the ﬁrst (second) cluster. For the ﬁrst (second) cluster, the diﬀerence within the sameperiod over the two clusters equals the derivative ( − ) ∂W ( β ) ∂β multiplied by the deviationparameter η n plus the diﬀerence of the cluster-speciﬁc ﬁxed eﬀects τ − τ . The diﬀerenceof the diﬀerence between the two clusters equals approximately two times the derivative ∂W ( β ) ∂β times the deviation parameter η n . 25he above expression estimates the average eﬀect of treating an individual sampled fromcluster k and cluster k + 1, once we reweight the expression by the marginal eﬀect on thetreatment assignment. Observe that each individual’s outcome is weighted by the inverseprobability of assigned treatment. However, the derivative ∂e ( X i ; β ) ∂β ( j ) is evaluated at sometarget parameter β before introducing a perturbation. Estimation of marginal spillover eﬀects

Next, we discuss estimation of marginalspillover eﬀects, which is what deﬁned as (S) in Equation (9), averaged over the distribu-tion of covariates. The estimators respectively on the treated and control units take thefollowing form: ˆ S ( j ) k,t (1 , β ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:104) v k ( i ) e ( X i ; β ) η n × Y i,t D i,t e i,j,t ( β ) (cid:105) − n (cid:88) i ∈S k, ∪S k +1 , (cid:104) v k ( i ) η n × Y i, D i, (cid:105) , ˆ S ( j ) k,t (0 , β ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:104) v k ( i ) (1 − e ( X i ; β )) η n × Y i,t (1 − D i,t )1 − e i,j,t ( β ) (cid:105) − n (cid:88) i ∈S k, ∪S k, (cid:104) v k ( i ) η n × Y i, (1 − D i, ) (cid:105) . ˆ S ( j ) k,t (1 , β ) (and similarly ˆ S ( j ) k,t (0 , β )) depends on several components. First, (i) it dependson the weighted outcome of treated individuals of each cluster in the pair. Second, (ii) itreweights observations by the propensity score evaluated at the coeﬃcient β . Finally, (iii)it takes the diﬀerence of the diﬀerence (i.e., it weights observations by v k ) of the weightedoutcomes between the two clusters between the two periods. The overall expression is thendivided by the deviation parameter η n . Marginal eﬀect estimator

The ﬁnal estimator of the marginal eﬀect deﬁned in Equa-tion (9) is the sum of the direct and marginal spillover eﬀect, taking the following form: (cid:98) Z ( j ) k,t ( β ) = ˆ S ( j ) k,t (1 , β ) + ˆ S ( j ) k,t (0 , β ) + ˆ∆ ( j ) k,t ( β ) − n (cid:88) i ∈S k,t ∪S k +1 ,t c ( X i ) ∂e ( X i , β ) ∂β ( j ) , (17)where the last component captures the average marginal cost.We now discuss theoretical properties of the estimator. Assumption 3 (Regularity 1) . Let the following conditions hold.(A) Let || m ( · ) || ∞ < ∞ , and twice continously diﬀerentiable in β , with uniformly boundedderivatives; 26B) ε i,t is a sub-gaussian random variable with parameter σ < ∞ , and m i,t ( · ) is uniformlybounded for all ( i, t );(C) β (cid:55)→ e ( X ; β ) is twice continuously diﬀerentiable in β with uniformly bounded ﬁrstand second order derivative almost surely;(D) X is a compact space.Assumption 3 (A) is a regularity assumption, which imposes bounded conditional meanwith bounded derivative. (B) holds whenever, for instance, ε i,t is uniformly bounded; (C)assumes bounded derivative of the propensity score, which holds for general functions suchas logistic or probit assignments, whenever covariates have compact support. We nowintroduce the ﬁrst theorem. Theorem 3.1.

Let Assumptions 1, 2, 3 hold, and consider a randomization as in Equation (23) with an exogenous parameter ι . Then (cid:12)(cid:12)(cid:12) E (cid:104) (cid:98) Z ( j ) k, ( ι ) (cid:105) − V ( j ) ( ι ) (cid:12)(cid:12)(cid:12) = O ( J n /η n + η n ) . The proof is contained in the Appendix. The above theorem showcases the estimator’sexpectation converges to the target estimand for a ﬁxed, exogenous coeﬃcient.

Remark 4 (Non-separable time and cluster-speciﬁc ﬁxed eﬀects) . Consider the model inEquation (6) with non-separable time and cluster-speciﬁc ﬁxed eﬀects. Suppose, however,that spillovers only occur on control units but not on the treated. Then identiﬁcation ofmarginal eﬀect is performed using a single experimentation period and two clusters. Eachcluster is exposed to deviations with opposite signs. Identiﬁcation is described in Figure6. See Section 5 for a formal discussion.

Remark 5 (Pairing clusters) . In the presence of more than two clusters, we estimatemarginal eﬀects by ﬁrst pairing clusters and then estimate the eﬀects in each pair. In theabsence of pairing, the ﬁrst-order bias resulting would not be equal to zero. Instead, itwould only be of the undesirable order 1 / (cid:112) Kη n , after averaging across all clusters. Thisfollows from the fact that in the absence of pairing, v k would be a Rademacher randomvariable whose average concentrates around zero only at a rate 1 / √ K . The ﬁrst-order biasoccurring from estimation within each cluster would cancel out only as v k converge to zero,which occurs at a rate 1 / √ K rescaled by the denominator η n . This is an additional and27 E [ Y ] D = 0 D = 1 k = 1 k = 2 ×× ∂m (0 ,β ) ∂β η n + α , − α , ×× α , − α , Figure 6: Illustration of the intuition behind the identiﬁcation of marginal spillover eﬀectswith non-separable time and cluster-speciﬁc ﬁxed eﬀects (Equation 6), and spillovers onlyon the control units. The x-axis corresponds to treated and control units. Two clusters k ∈ { , } are considered. The diﬀerence over control units between the two clusters (theline between the brown and green cross) corresponds to the marginal spillover eﬀect on thecontrol times the deviation η n , plus the between-cluster diﬀerence of the time and cluster-speciﬁc ﬁxed eﬀects α k,t . The diﬀerence between the treated units instead corresponds tothe diﬀerence between the non-separable ﬁxed eﬀects, assuming that spillovers only occuron the control units (and not on the treated). The diﬀerence of the diﬀerence equals themarginal spillover eﬀect on the control, times η n .important diﬀerence from saturation experiments, where probabilities of treatments arerandomly allocated across clusters. Remark 6 (Sequential randomization) . In the presence of sequential randomization with t >

1, the choice of the parameter may depend on past information. For this case, theexogeneity condition of the parameter does not necessarily hold. We propose estimatorsthat address this issue in Section 4.Throughout the rest of our discussion, it will be convenient to refer to (cid:98) Z as an averageof random variables. It can be easily shown that the estimator in Equation (26) reads as (cid:98) Z ( j ) k,t ( β ) = 1 n (cid:88) i ∈S k,t ∪S k +1 ,t W ( j ) i,t ( β ) − n (cid:88) i ∈S k, ∪S k +1 , v k ( i )2 η n Y i, , (18)where 28 ( j ) i,t ( β ) = 12 (cid:104) Y i,t D i,t e i,j,t ( β ) − Y i,t (1 − D i,t )1 − e i,j,t ( β ) − c ( X i ) (cid:105) ∂e ( X i ; β ) ∂β ( j ) + v k ( i ) η n (cid:104) Y i,t D i,t e i,j,t ( β ) e ( X i ; β ) + Y i,t (1 − D i,t )1 − e i,j,t ( β ) (1 − e ( X i ; β ) (cid:105) . (19) Remark 7 (Randomization in the absence of time-speciﬁc ﬁxed eﬀects) . Suppose that α t = 0 (e.g., sequential randomizations occurs over a short-time period). Then we constructthe estimator of the marginal eﬀect by pairing each cluster with itself over two consecutiveperiods { t − , t } . This approach requires a half number of clusters for its implementation,at the expense of increasing the overall number of randomization periods. We now discuss the pilot study consisting of two periods of experimentation t ∈ { , } forinference on marginal eﬀects. Pairing clusters

First, we pair clusters. Without loss of generality, we assume that pairsconsist of two consecutive clusters k , k + 1 for each odd k . We assign v k as in Equation(15). Assigning coordinates to diﬀerent pairs

We assign any element in the set of oddcluster’s indexes { , , · · · , K − } to a set K j ⊆ { , , · · · , K − } , for each coordinate j ∈ { , · · · , ˜ p } , with the set |K j | = ˜ K ≥

2. The set K j denotes the set of clusters used totest coordinate j of the marginal eﬀect. Small deviations

The experimenter assigns treatments according to the allocation rulein Deﬁnition 2.1. Each pair estimates a single coordinate ( j ). We set for all k ∈ (cid:110) K j ∪ { h + 1 , h ∈ K j } (cid:111) , i.e., for all clusters assigned to test coordinate j , we randomizetreatments as in Equation (23) with β k, = β k, = ι for all k . Estimation of marginal eﬀects

We estimate marginal eﬀects similarly to what dis-cussed in Equation (26). For any pair of clusters ( k, k + 1), k ∈ K j , the estimator of themarginal eﬀects at ι reads as (cid:98) Z ( j ) k, ( ι ). Deﬁne for each pairs of clusters ( k, k + 1), k ∈ K j , (cid:98) Z k = (cid:98) Z ( j ) k, ( ι ) . We observe that in many circumstances, we may be interested in testing a speciﬁc coordinate of thevector, in which case K j = { , , · · · , K − } . xample 2.1 Cont’d Consider conducting a pilot study to test whether treating indi-viduals with probability ι = 40% is welfare-optimal. Researchers run the experiment onat least two clusters (e.g., six clusters are considered). In the ﬁrst period t = 1 in all sixclusters, researchers select individuals for treatments with probability 40%. In the secondperiod, researchers ﬁrst pair clusters. In each pair, they assign treatments in the ﬁrst clus-ter with 39% probability and 41% probability in the second cluster in the pair. They thenestimate the marginal eﬀect within each pair. Example 2.2 Cont’d

Consider testing the ﬁrst coordinate ( β ) and the second coor-dinate ( β ) of the policy function in Equation (2), using eight clusters. The baselineparameter-values are (0 . , . , , . , . , , . , . , , . , . , , . , . , , . , . , , β is estimated using the ﬁrst and third pair ofclusters, and the marginal eﬀect on the second coordinate β is estimated using the secondand fourth pair of clusters. See Figure 7 for a graphical illustration. In the following lines, we discuss the proposed estimator’s asymptotic properties that allowus to test Equation (13). Before discussing the next theorem, we introduce regularityconditions. Observe ﬁrst that under Assumption 3, W ( j ) i,t ( β ) , U ( j ) i,t ( β ) in Equation (19) is oforder 1 /η n . In the following assumption, we impose that the within-cluster variance isbounded away from zero after appropriately rescaling.

Assumption 4 (Regularity 2) . Assume that for any exogenous vector β ∈ B , under a See Lemma B.2. = 0(0 . , . , , . , . , , . , . , , . , . , ,

0) (0 . , . , , . , . , , . , . , , . , . , , (cid:98) Z (1) ( · ) (cid:98) Z (2) ( · ) t = 1 Estimator β =2 β =3 β =4 β =Figure 7: Example 2.2 continued. The parameters tested have value (0 . , . , , k ∈ { , · · · , K } , for t = 1,Var (cid:16) n (cid:88) i ∈S k,t ∪S k +1 ,t W ( j ) i,t ( β ) − (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n Y i, (cid:17) = ¯ C k ρ n , where ρ n ≥ nη n , for a constant ¯ C k > / √ n ,after appropriately rescaling by η n . Under standard moment assumptions, observe thatAssumption 4 is satisﬁed as long asVar (cid:16) n (cid:88) i ∈S k,t ∪S k +1 ,t W ( j ) i,t ( β ) − (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n Y i, (cid:17) ≥ n (cid:88) i ∈S k,t ∪S k +1 ,t Var (cid:16) W ( j ) i,t ( β ) (cid:17) + 1 n (cid:88) i ∈S k, ∪S k +1 , Var (cid:16) v k ( i ) η n Y i, (cid:17) . (cid:101) Z n = (cid:104) (cid:98) Z , (cid:98) Z , · · · , (cid:98) Z K − (cid:105) , the vector of estimators of the marginal eﬀect for each pair of clusters. Theorem 3.2.

Let Assumption 1, 2, 3, 4, hold. Then Σ − / n ( (cid:101) Z n − µ ) + B n → d N (0 , , where B n = O (cid:16) η n × √ n + J n × (cid:112) / ( η n ρ n ) (cid:17) , Σ n =  Var( (cid:98) Z )Var( (cid:98) Z ) . . . Var( (cid:98) Z K/ )  (cid:62) I K/ , (20) and for k ∈ K j , µ ( k ) = V ( j ) ( ι ) . Theorem 3.2 showcases that the estimated gradient converges in distribution to a Gaus-sian distribution after appropriately rescaling by its variance. The asymptotic distributionis centered around the true marginal eﬀect and a bias component B n , which captures thediscrepancy between the expectation across diﬀerent clusters (i.e., clusters being drawnfrom diﬀerent distributions). The theorem allows for J n = O (1 / √ n ) whenever η n nρ n = o (1),i.e., whenever the variance of the estimator is of any order larger than n after appropriaterescaling by η n . This occurs in the presence of positive dependence, with an average de-gree growing with the sample size. In the presence of independent observations, the biasis vanishing if J n = o (1 / √ n ). Finally, the expression of the bias also shows that η n shouldbe selected such that η n = o ( n − / ).Given Theorem 3.2, we construct a scale invariant test statistics without necessitatingestimation of the (unknown) variance (Ibragimov and M¨uller, 2010). Deﬁne P ( j ) n = 1˜ K (cid:88) k ∈K j (cid:98) Z k , the average marginal eﬀect for coordinate j estimated from those clusters used to estimate32he eﬀect of the jth coordinate. We construct Q j,n = (cid:112) ˜ KP ( j ) n (cid:113) ( ˜ K − − (cid:80) k ∈K j ( (cid:98) Z ( j ) k − P ( j ) n ) , T n = max j ∈{ , ··· , ˜ p } | Q j,n | , (21)where T n denotes the test statistics employed to test the null-hypothesis in Equation (13).The choice of the l -inﬁnity norm as above is often employed in statistics for testing globalnull hypotheses (Chernozhukov et al., 2014). In our application it is motivated by itstheoretical properties: the statistics Q j,n follows an unknown distribution as a result ofpossibly heteroskedastic variances of (cid:98) Z k across diﬀerent clusters. However, the upper-bound on the critical quantiles of the proposed test-statistic for unknown variance attainsa simple expression under the proposed test-statistics. From a conceptual stand-point,the proposed test-statistic is particularly suited when a large deviation occurs over onedimension of the vector. Theorem 3.3 (Nominal coverage) . Let Assumption 1, 2, 3, 4, hold. Let ˜ K ≥ , H be asdeﬁned in Equation (13) , and B n = o (1) . For any α ≤ . , lim n →∞ P (cid:16) T n ≤ q α (cid:12)(cid:12)(cid:12) H (cid:17) ≥ − α, where q α = cv ˜ K − (cid:16) − (1 − α ) / ˜ p (cid:17) , (22) with cv ˜ K − ( h ) denotes the critical value of a t-test with level h with test-statistic having ˜ K − degrees of freedom. Theorem 3.3 allows for inference on marginal eﬀects, and ultimately for testing policyoptimality, using few clusters and two consecutive experimentation periods. The derivationexploits properties of the t-statistics discussed in Ibragimov and M¨uller (2010, 2016) ,combined with Theorem 3.2 and properties of the proposed test statistic T n used to testthe global null hypothesis H . To our knowledge, Theorem 3.3 is the ﬁrst that allows fortesting for optimality of treatment allocation rules under network interference. In this section, we discuss the experimental design to estimate β ∗ as deﬁned in Equation(8) through sequential randomization. See also Chernozhukov et al. (2018) for a discussion on pivotal inference in the diﬀerent context ofsynthetic controls. j

21 3 411 1 122 2 2Figure 8: Illustration of the dynamic experiment with p = 2. The experiment has T periodsin total ( t ∈ { , · · · , } ), ˇ T many waves ( w ∈ { , , , } ), p many iterations ( j ∈ { , } ).In total the experiment has at least K ≥

2( ˇ T + 1) many clusters (i.e., K ≥ Preliminaries and time structure

First, researchers pair clusters as discussed in Sec-tion 3. Each pair consists of consecutive clusters { k, k + 1 } with k being odd. Over eachperiod t and cluster k , they draw at random n units from each cluster, whose covariatesand post-treatment outcomes are observed. The indexes of these units are collected in theset S k,t . We consider ˇ T experimentation waves and K ≥

2( ˇ T + 1) clusters paired into K/ w ∈ { , · · · , ˇ T } has j ∈ { , · · · , p } iterations overwhich a gradient descent algorithm is implemented, with in total T = ˇ T × p + 1 periodsof randomization, where at period t = 0 treatments are randomized based on the baselinepolicy ι . Each iteration j is used to estimate a diﬀerent coordinate of the vector of marginaleﬀects. In Example 2.1 p = 1, and therefore ˇ T = T + 1, while in Example 2.2 p = 4, andtherefore ˇ T = T / p = 2. Initialization

Each experimentation wave corresponds to a vector ˇ β w ∈ R K , with ˇ β wk corresponding to the target parameter for cluster k . The set of parameters is estimatedover the previous iteration (i.e., parameters are data-dependent), with initializationˇ β = ( ι, · · · , ι ) , with ι chosen exogenously . In the ﬁrst period t = 0 (before any experimentation wave w isperformed), treatments are randomized independently as D i, | X i = x ∼ Bern( e ( x ; ι )) , ∀ i. Remark 8 (Experiment, conditional on rejection) . Whenever the larger-scale experimentis conducted conditional on the rejection of the null hypothesis in Equation (12), the larger-scale experiment must be performed on a set of clusters diﬀerent from the ones used totest the above null hypothesis. 34 .1 Description of one experimentation wave

The algorithm consists of ˇ T experimentation waves. We introduce ﬁrst the procedure witha single experimentation wave w . An experimentation wave consists of p iterations, eachconsisting of two experimentation period, with in total p periods of experimentation. Anexperimentation wave w starts at time t = w × p + 1. Randomization

The randomization procedure follows similarly to Equation (23), iterat-ing over each dimension j ∈ { , · · · , p } of the vector of parameters. Formally, the followingloop is considered. For each j ∈ { , · · · , p } , D i,t | X i = x, ˇ β wk ( i ) ∼  Bern (cid:16) e ( x ; ˇ β wk ( i ) + η n e j ) (cid:17) if k ( i ) is oddBern (cid:16) e ( x ; ˇ β wk ( i ) − η n e j ) (cid:17) if k ( i ) is evenˇ Z ( j ) k,w =  n (cid:80) i ∈S k,t ∪S k +1 ,t W ( j ) i,t ( ˇ β wk ) − n (cid:80) i ∈S k, ∪S k +1 , v k ( i )2 η n Y i, if k is odd;ˇ Z ( j ) k − ,w otherwise .t ← t + 1 , (23)where W i,t is deﬁned in Equation (19). The loop works as follows: for each coordinate j werandomize treatments with parameters having a positive (negative) deviation in the ﬁrst(second) cluster in each pair. We then compute the jth coordinate of the experimentation-wave speciﬁc marginal eﬀect ˇ Z ( j ) k,w in cluster k corresponding to the target parameter ˇ β wk .We subtract from the estimator the diﬀerence of the cluster-speciﬁc ﬁxed eﬀects. Circular cross-ﬁtting

We are left to discuss the choice ˇ β w +1 . To do so, we use a circularcross-ﬁtting procedure which estimates the gradient using the marginal eﬀect obtained inthe subsequent pair: ˇ V ( j ) k,w =  ˇ Z ( j ) k +2 ,w if k ≤ K − Z ( j )1 ,w otherwise . if k is odd . V ( j ) k,w = ˇ V ( j ) k − ,w if k is even. We update each policy using a gradient descent as followsˇ β w +1 k = Π B , B − η n (cid:104) ˇ β wk + α k,w ˇ V k,w (cid:105) . Here, Π B , B − η n denotes the projection operator onto the set [ B , B − η n ] p . Intuitively, foreach policy we perform a gradient update, using the gradient estimated on the subsequentpair of policies.

The complete algorithm performs ˇ T experimentation waves as described in the previoussub-section in a sequential manner. The algorithm returns the average coeﬃcients in eachpair ˆ β ∗ = 1 K K (cid:88) k =1 ˇ β ˇ T +1 k . Dependence plays an important role in our setting, where some of all the units ina cluster may participate in the experiment in several periods. We break dependenceusing a novel cross-ﬁtting algorithm, consisting of “circular” updates of the policies usinginformation from subsequent clusters, as shown in Figure 10.We use a local optimization procedure for policy updates, with the gradient beingestimated non-parametrically. We devise an adaptive gradient descent algorithm to trade-oﬀ the error of the method and the estimation error of the gradient.

Remark 9 (Learning rate, quasi-concavity and local strong concavity) . We choose a learn-ing rate to accommodate strictly quasi-concave functions, taking α k,w =  γ √ w || ˇ V k,w || if || ˇ V k,w || > v √ ˇ T − (cid:15) n , , for a positive (cid:15) n , (cid:15) n →

0, and small constant 1 ≥ v >

0. The reader may refer to Lemma For example, in one dimensional setting, we have Π a,b ( c ) = c , if c ∈ [ a, b ] and Π a,b ( c ) = a if c ≤ a , andΠ a,b ( c ) = b if c ≥ b . The algorithm performs full gradient updates instead of coordinate-wise gradient updates due to thedependence structure, since otherwise, for large p , the circular cross-ﬁtting may not guarantee unconfounde-ness. k odd k even ˇ β = ιw = 1 ˇ β = ι + α , ˇ Z , ˇ β = ι + α , ˇ Z , ˇ β = ι + α , ˇ Z , w = 2 ˇ β = ι ˇ β = ι β k,t = ˇ β + η n β k,t = ˇ β − η n Figure 9: Example of sequential experiment with two waves w ∈ { , } . Each squarecorresponds to a given cluster. Clusters are ﬁrst paired. Over each wave, a local variationat the cluster level is induced (positive on those clusters with odd k and negative otherwise).The gradient is estimated within each pair. The policy in the next wave w = 2 is updatedbased on the gradient of the next pair.B.8 in the Appendix for further details. The choice of the learning rate allows for strictquasi-concavity through the gradient’s norm rescaling (Hazan et al., 2015), while it controlsthe estimation error after rescaling by 1 / √ w . Example 2.1 Cont’d

In this example j = 1, since only the probability of treatmentis parameter of interest with learning rate inducing a gradient norm rescaling rate α k,t = . || ˇ V k,w ||√ w (see Remark 9). Let K = 6 , ˇ T = 2. Each wave consists of one period.1. Initialization : ˇ β k = (cid:104) , · · · , (cid:105) . Formally, we let (cid:15) n ∝ (cid:113) γ n η n n + J n /η n + η n . First wave w = 1 : Individuals in the ﬁrst, third, and ﬁfth clusters are assigned totreatments with probability 41% and those in the remaining clusters with probability39%.

Estimates: ( ˇ Z , , ˇ Z , , ˇ Z , ) = (0 . , . , . Update:

Set ˇ β = (cid:104) , · · · , (cid:105) .3. Second wave w = 2: ﬁrst, third and ﬁfth cluster assign probability (51% , · · · , , · · · , Estimates: ( ˇ Z , , ˇ Z , , ˇ Z , ) = ( − . , . , . Update:

Set ˇ β = (cid:104) . √ , . √ , − . √ , − . √ , . √ , . √ (cid:105) .A graphical illustration is depicted in Figure 9. Next, we discuss the theoretical properties of the algorithm. The following assumption isimposed on the number of clusters.

Assumption 5 (Number of clusters) . Suppose that K ≥

2( ˇ T + 1).Assumption 5 imposes that the number of clusters exceeds the number of periods ofexperimentation. 38 emma 4.1 (Unconfoundeness) . Let Assumption 1, 5 hold. Consider ˇ β wk estimated throughthe circular cross-ﬁtting. Then for any k ∈ { , · · · , K } , t ∈ { , · · · , T } , (cid:16) ˇ β k , · · · , ˇ β ˇ Tk (cid:17) ⊥ (cid:110) Y i,t ( d ) , X i , d ∈ { , } ˇ N (cid:111) i : k ( i ) ∈{ k,k +1 } ,t ≤ T . The proof is contained in the Appendix. The proof is a consequence of the fact thatthe coeﬃcients are estimated using information from all clusters except clusters { k, k + 1 } .Lemma 4.1 guarantees that the experimentation is not confounded due to time dependencebetween unobservables. In the following lines, we motivate the gradient descent methodas a valid optimization procedure also under lack of concavity, only imposing that thefunction is quasi-concave. Assumption 6 (Strict quasi-concavity and local strong concavity) . Assume that the fol-lowing conditions hold.(A) For every β, β (cid:48) ∈ B , such that W ( β (cid:48) ) − W ( β ) ≥ V ( β ) (cid:62) ( β (cid:48) − β ) ≥ . (B) For every β ∈ B , || V ( β ) || ≥ µ || β − β ∗ || , for a positive constant µ > ∂ W ( β ) ∂β (cid:12)(cid:12)(cid:12) β = β ∗ has negative eigenvalues bounded away from zero at β ∗ .Condition (A) imposes a quasi-concavity of the objective function. The condition is equivalent to assuming that any α -sub level set of − W ( β ) is convex, being equivalent tocommon deﬁnitions of quasi concavity (Boyd et al., 2004). Condition (B) assumes that thegradient only vanishes at the optimum, allowing for saddle points, but ruling out regionsover which marginal eﬀects remain constant at zero. A simple suﬃcient condition such that(B) holds is under decreasing marginal eﬀects (see the next example). A similar notion ofstrict quasi-concavity can be found in Hazan et al. (2015). Condition (C) imposes that thefunction has negative deﬁnite Hessian at β ∗ only but not necessarily globally. Intuitively(C) imposes strong concavity only at the optimum. Example 2.1 Cont’d

Let Equation (3) hold and suppose that φ >

0, i.e., marginaleﬀects of treating one additional neighbor are decreasing. Then the function is stronglyconcave in β .The above example discusses the case in the absence of covariates. The reader mayrefer to Equation (4) for an example in the presence of covariates. We can now state thefollowing theorem. 39 heorem 4.2 (Guarantees under quasi-concavity) . Let Assumptions 1, 2, 3, 5, 6 hold.Take a small ξ > , and let n / − ξ ≥ ¯ C (cid:113) log( n ) pγ n T e B √ pT log( KT ) , J n ≤ / √ n , η n =1 /n / ξ , for ﬁnite constants ∞ > B, ¯ C > . Let T ≥ ζ , for a ﬁnite constant ζ < ∞ .Then with probability at least − /n , || β ∗ − ˆ β ∗ || ≤ p ¯ C ˇ T .

The proof is in the Appendix. Theorem 4.2 provides a small sample upper bound onthe out-of-sample regret of the algorithm. The upper bound only depends on T (and not n ) since n is assumed to be suﬃciently larger than T . The following corollary holds. Corollary.

Let the conditions in Theorem 4.2 hold. Then with probability at least − /nτ ( β ∗ ) − τ ( ˆ β ∗ ) ≤ p ¯ C (cid:48) ˇ T for a ﬁnite constant ¯ C (cid:48) < ∞ . The above corollary formalizes the “out-of-sample” regret bound scaling linearly withthe number of periods. Theorem 4.2 provides guarantees on the estimated policy andresulting welfare.The above theorem guarantees that the estimated policy, once implemented in futureperiods, leads to the largest welfare up to an error factor scaling linearly with the numberof periods and the dimension of the parameter space. However, researchers may wonderwhether the procedure is “harmless” also on the in-sample units, i.e., whether the procedurehas guarantees on the in-sample regret (Bubeck et al., 2012). We provide guarantees inthe following theorem.

Theorem 4.3 (In-sample regret) . Let the conditions in Theorem 4.2 hold. Then withprobability at least − /n , max k ∈{ , ··· ,K } T ˇ T (cid:88) w =1 (cid:104) τ ( β ∗ ) − τ ( ˇ β wk ) (cid:105) ≤ ¯ C p log( ˇ T )ˇ T for a ﬁnite constant ¯ C < ∞ . The proof is contained in the Appendix. Theorem 4.3 guarantees that the cumulativewelfare in each cluster k , incurred by deploying the current policy ˇ β wk at wave w , converges40o the largest achievable welfare at a rate log( T ) /T , also for those units participating in theexperiment. Observe that by a ﬁrst-order Taylor expansion under Assumption 3, a directconclusion is that the bound also holds for policies ˇ β wk ± η n up to an additional factor whichscales to zero at rate η n (and therefore negligible under the conditions imposed on n ). Thisresult guarantees that the proposed design is not harmful to experimental participants ineach cluster.In the following theorem, we discuss similar guarantees, imposing weaker conditionson the sample size, at the expense of assuming global strong-concavity of the objectivefunction (Boyd et al., 2004). In this case, the learning rate is chosen as α w = γ/w , withoutnecessitating rescaling by the size of the gradient. We formalize our result in the followingtheorem. Theorem 4.4 (Guarantees under strong concavity) . Let Assumptions 1, 2, 3, 5 hold. Let α k,w = γ/w for a small γ > . Take a small ξ > . Let n / − ξ ≥ ¯ C (cid:112) p log( n ) γ n T B log( KT ) , J n ≤ / √ n , η n = 1 /n / ξ , for ﬁnite constants B, ¯ C > . Assume that W ( β ) is stronglyconcave in β . Then with probability at least − /n , || β ∗ − ˆ β ∗ || ≤ p ¯ CT for a ﬁnite constant ¯ C < ∞ . We now contrast the result with past literature. Regret guarantees are often the ob-ject of interest in analyzing policy assignments (Kitagawa and Tetenov, 2018; Mbakopand Tabord-Meehan, 2018; Athey and Wager, 2020; Kasy and Sautmann, 2019; Bubecket al., 2012; Viviano, 2019). However, the above references either assume a lack of interfer-ence or consider partially observable network structures. In online optimization, the rate1 /T is common for stochastic gradient descent methods under concavity (Bottou et al.,2018). In particular, using a local-optimization method Wager and Xu (2019) derive regretguarantees of the same order in the diﬀerent setting of market pricing, under mean-ﬁeldasymptotics (i.e., n → ∞ ), with units and samples over each wave being independent.Diﬀerently, our results provide small sample guarantees, without imposing independenceor modeling assumptions, other than partial interference. This requires a diﬀerent prooftechnique. The proof of the theorem (i) uses concentration arguments for locally depen-dent graphs (Janson, 2004), to derive an exponential rate of convergence, adjusted bythe dependence component γ n ; (ii) it uses the within-cluster and between-cluster varia-41ion for consistent estimation of the marginal eﬀect, together with the matching design toguarantee that there is non-vanishing bias when estimating marginal spillover eﬀects; (iii)it exploits in-sample regret bounds for the adaptive gradient descent method with normrescaling; (iv) it uses a recursive argument to bound the cumulative error obtained throughthe estimation and circular cross-ﬁtting, where the cumulative error depends on the samplesize and the number of iterations. Finally, we observe that our results only require localstrong concavity , as opposed to global strong concavity. This result is possible by ﬁrstshowing that the estimator lies within a ball close to the optimum as T exceeds a certainﬁnite threshold which depends on the eigenvalues of the Hessian at β ∗ , and then discussconvergence within a local neighborhood from β ∗ . In this section we discuss three extensions.

The ﬁrst extension discusses nonseparable time and cluster-speciﬁc ﬁxed eﬀects as in Equa-tion (6). For the sake of brevity, we only discuss the estimator of marginal eﬀect and itsconsistency in this section. Asymptotic inference and regret bounds follow similarly as inprevious sections. The key assumption for estimating marginal eﬀects under nonseparableﬁxed eﬀects is that spillovers only occur on the control units but not on the treated. Weformalize the assumption in the following lines.

Assumption 7 (Fixed eﬀects and constant spillovers on the treated) . For any d ∈ { , } , β ∈B , x ∈ X , any random sample S k,t , of size n from cluster k is such that1 n (cid:88) i ∈S k,t m i,t (cid:16) d, x, β (cid:17) f X i ( x ) = α k,t ( x ) + m ( d, x, β ) f X ( x ) + J n , n (cid:88) i ∈S k,t f X i ( x ) = ˇ f X ( x ) + J n for some possibly unknown functions α k,t ( · ) , m ( · ) , f X ( · ) , ˇ f X ( · ) and J n ∈ [ − b n , b n ], for somepositive b n → n → ∞ . Assume in addition that m (1 , x, β ) is constant in β .Assumption 9 states that time and cluster-speciﬁc ﬁxed eﬀects are not separable, andspillovers only occur on control units. Under Assumption 9 local experimentation only42ccurs over a single period (instead of two periods). Namely, for each pair of clusters, overeach period t , we randomize treatments as follows: D i,t | X i = x, β k ( i ) ,t ∼  Bern (cid:16) e ( x ; β k ( i ) ,t + η n e j ) (cid:17) if k ( i ) is oddBern (cid:16) e ( x ; β k ( i ) ,t − η n e j ) (cid:17) if k ( i ) is even , n − / < η n < n − / . (24)The parameter η n captures small deviations from the target parameter. Observe thatdiﬀerently from Equation (23), under Assumption 9 we do not necessitate two consecutiverandomizations for estimation of the marginal eﬀects. We now discuss the estimation ofthe marginal eﬀects. Estimation of direct eﬀects

Similarly to Equation (16) we estimate the direct eﬀectof the treatments taking a weighted diﬀerence between the control and treated units of thefollowing form: (cid:98) ∆ ( j ) k,t ( β ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t ∂e ( X i ; β ) ∂β ( j ) (cid:104) Y i,t D i,t e ( X i ; β + v k ( i ) η n e j ) − Y i,s (1 − D i,t )1 − e ( X i ; β + v k ( i ) η n e j ) (cid:105) . (25)Since randomizations are implemented only over a single period, the expression sums eﬀectson the treated and control (reweighted by the assigned probability of exposure) only attime t . Estimation of marginal spillover eﬀects

By assumption the marginal spillover eﬀecton the treated is zer. Therefore, we only need to estimate the marginal spillover eﬀect onthe control. The estimator takes the following form: ˆ S ( j ) k,t (0 , β ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:104) v k ( i ) (1 − e ( X i ; β )) η n × Y i,t (1 − D i,t )1 − e ( X i ; β + v k ( i ) η n e j ) (cid:105) . Similarly, as before, the estimator takes the diﬀerence in the outcomes on the controlbetween the two clusters, and it rescales it by the factor η n . Bias estimation

Finally, observe that due to non-separable eﬀects, the estimator of themarginal spillover eﬀect presents a bias, of the form α k,t − α k +1 ,t η n . The bias is estimated bydiﬀerentiating the outcomes on the control units between the two clusters. Namely, the43stimated bias is obtained as follows ˆ B ( j ) k,t ( β ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:104) v k ( i ) (1 − e ( X i ; β )) η n × Y i,t D i,t e ( X i ; β + v k ( i ) η n e j ) (cid:105) . Marginal eﬀect estimator

Let Assumptions 1, 3, 9 hold, and consider a randomization as in Equation (24) with an exogenous parameter ι . Then (cid:12)(cid:12)(cid:12) E [ (cid:98) Z ( j ) k, ( ι )] − V ( j ) ( ι ) (cid:12)(cid:12)(cid:12) = O (cid:16) J n /η n + η n (cid:17) . Theorem 5.1 guarantees consistency of the estimator for a suitable choice of η n . All theremaining results directly extend also to this setting. In this section we discuss inference on the estimated parameter ˇ β wk ex-post the sequentialrandomization procedure. To conduct inference, we let K, ˇ T be ﬁnite while imposing thestronger assumption that K ≥ T , i.e., a twice as large pool of clusters compared toAssumption 5 is available. In this scenario, for each group of pairs { k, k + 1 , k + 2 , k + 3 } ,we run the same algorithm as in Section 4.1, with a small modiﬁcation: we group clustersinto groups of four clusters, over each wave, we letˇ β k,w = ˇ β wk +1 = ˇ β wk +2 = ˇ β wk +3 (27)The deﬁnition of marginal eﬀects ˇ Z k,w remains the same as in Equation (23).Given the policy ˇ β ˇ T +1 k , we test the hypothesis H post,k : V ( ˇ β ˇ T +1 k ) | ˇ β ˇ T +1 k = 0 , k ∈ { , , , · · · } . We can then construct the following test statistic to test H post,k as follows. For k ∈ { , , , · · · } , we deﬁne (recall that ˇ Z k,w contains informationfrom cluster k and k + 1), Q postk,j = √

2( ˇ Z ( j ) k, ˇ T + (cid:98) Z ( j ) k +2 , ˇ T ) (cid:113) ( (cid:98) Z ( j ) k, ˇ T − (cid:98) Z ( j ) k +2 ,, ˇ T ) , T post,kn = max j | Q postk,j | , with T post,kn denoting the test statistic for the kth hypothesis. We now introduce thefollowing theorem. Theorem 5.2.

Let Assumption 1, 2, 3, 5 hold, and Assumption 4 hold for t = T . Let K ≥ T , and consider a design mechanism as Section 4.1 with policies as in Equation (27) . Let η n = n − / − ξ , for a small ξ > , and J n = 0 . Let α/p ≤ . . Then lim n →∞ P (cid:16) T post,kn ≤ cv( α/p ) (cid:12)(cid:12)(cid:12) ˇ β k,t , H post,k (cid:17) ≥ − α, where cv( h ) denotes the (1 − h ) − th quantile of a standard Cauchy random variable. The above theorem allows for separate testing. In the presence of multiple testing, sizeadjustments to control the compound error rate should be considered. Observe that wemay also increase the size of groups of clusters (e.g., K ≥

8( ˇ T + 1)) to obtain the sameexpression as above for the test statistics, averaged over more cluster, so increasing power. In this section we discuss extensions of the model to allow for carry-over eﬀects. Forexpositional convenience, we allow carry-over only through two consecutive periods.

We start our discussion by introducing the dynamic model.

Assumption 8 (Dynamic model) . For a conditional Bernoulli allocation with exogenousparameters as in Deﬁnition 2.1, let the followig hold Y i,t = m i (cid:16) D i,t , X i , β k ( i ) ,t , β k ( i ) ,t − (cid:17) + ε i,t , E β k ( i ) , t [ ε i,t | D i,t , X i ] = 0 , for some unknown m i ( · ). 45ssumption 8 deﬁnes the outcomes as functions of their present treatment assignment,covariates, and the policy-decision β implemented in the current and past period. Thecomponent β k,t − captures carry-over eﬀects that result from neighbors’ treatments in thepast.Similarly to Assumption 2, we assume that clusters are representative of the underlyingpopulation of interest. For simplicity we omit time and cluster speciﬁc ﬁxed eﬀects, andwe also assume that covariates are identically distributed. Assumption 9 (Representative clusters) . Let the following hold: for any random sample S k,t from cluster k , with size |S k,t | = n , with1 n (cid:88) i ∈S k,t m i ( d, x, β t , β t − ) = m ( d, x, β t , β t − ) + O ( J n ) , J n → . Assume in addition that X i ∼ F X for all i .Discussion on the above condition can be found in Section 2. Given the above deﬁni-tions, we can introduce the notion of welfare. Deﬁnition 5.1 (Instantaneous welfare) . DeﬁneΓ( β, φ ) = (cid:90) (cid:110) e ( x ; β ) (cid:104) m (1 , x, β, φ ) − m (0 , x, β, φ ) (cid:105) + m (0 , x, β, φ ) − c ( x ) e ( x ; β ) (cid:111) dF X ( x )the instantaneous welfare.Deﬁnition 5.1 deﬁnes welfare as a function of parameters β , the current policy, and pastpolicy φ . It captures the notion of welfare at a given point in time. We now introduce ourestimand of interest. Deﬁnition 5.2 (Estimand) . Deﬁne the estimand as follows β ∗∗ ∈ arg sup β ∈B Γ( β, β ) . Deﬁnition 5.2 deﬁnes the estimand of interest, which is deﬁned as the vector of pa-rameters that maximizes welfare, under the constraint that the decision remains invariant The case with covariates not being identically distributed follows similarly to what discussed in thecurrent section.

Carry-over eﬀects introduce challenges for optimization due to dynamics. A simple gra-dient descent may not convergence, since every next iteration, the function Γ( β t , β t − )also depend on past decisions. Motivated by this observation, we propose patient gradientdescent updates. “Patient” gradient descent First, we introduce the optimization algorithm in fullgenerality. We begin our iteration from the starting value ι , we evaluate Γ( ι, ι ), andcompute its total derivative ∇ ( ι ). We then update the current policy choice in the directionof the total derivative and wait for one more iteration before making the next update.Formally, the ﬁrst three iteration consists of the following updates:Γ( ι, ι ) ⇒ Γ( ι + ∇ ( ι ) , ι ) ⇒ Γ( ι + ∇ ( ι ) , ι + ∇ ( ι )) . We name the iterations “patient” since, in the third step, the algorithm makes a policychoice ι + ∇ ( ι ), even if this choice may decrease utility in the third iteration, comparedto the utility in the previous step. However, the overall utility from the ﬁrst to the thirditeration is increasing. Estimation and updates

The estimation procedure follows similarly to Section 4.1,with a small modiﬁcations: for every period t the policy stays enforced for one more period t + 1, without necessitating that data are collected over the period t + 1. This modiﬁcationis a direct extension of the gradient descent that allows for dynamics, at the expense of re-quiring a longer overall experimentation period. That is, for example in Equation (23) D i,t is randomized as a bernoulli with parameter ˇ β wk ( i ) over two consecutive periods and overthe following two periods it is randomized with a local deviation η n . Let the estimated Observe that although Assumption 9 assumes that there are no cluster and time speciﬁc ﬁxed eﬀect,the randomization and estimation procedure directly allows for those similarly to what discussed in Section6. β ∗∗ .Next, we discuss the theoretical guarantees of the proposed algorithm. The proof is in-cluded in the Appendix. Theorem 5.3.

Let Assumptions 3, 5, 8, 9 hold. Take a small ξ > . Let n / − ξ ≥ ¯ C (cid:113) log( n ) γ n T e B √ pT log( KT ) , J n ≤ / √ n , η n = 1 /n / ξ , for ﬁnite constants ∞ >B, ¯ C > . Let T ≥ ζ , for a ﬁnite constant ζ < ∞ . Let β (cid:55)→ Γ( β, β ) satisfying strictquasi-concavity and local strong concavity in Assumption 6. Then with probability at least − /n , || β ∗∗ − ˆ β ∗∗ T || ≤ p ¯ CT for a ﬁnite constant ¯ C < ∞ . In this section, we discuss the case where the policy can be updated over each iteration.The objective of the policymaker is to estimate the optimal path of policies. An essen-tial condition in this section is that there are no cluster-speciﬁc ﬁxed eﬀects discussed inAssumption 9.

First order conditions

A natural question is whether β ∗∗ maximizes the long-run wel-fare deﬁned as follows T ∗ (cid:88) t =1 q t Γ( β t , β t − )where q ∈ (0 ,

1) denotes a discounting factors. In the presence of concave Γ( · ), linear-ity in carry-over eﬀects, and lack of interactions of carry-overs with present assignments,the welfare-maximizing policy is stationary. To observe why, observe that the ﬁrst orderconditions read as follows: ∂ Γ( β t , β t − ) ∂β t (cid:124) (cid:123)(cid:122) (cid:125) ( A ) + q ∂ Γ( β t +1 , β t ) ∂β t (cid:124) (cid:123)(cid:122) (cid:125) ( B ) = 0 , ∀ t. (28)Assuming that ( B ) is a constant and ( A ) does not depend on β t − , the solution to all theabove equation is the same β t in each equation. Whenever these conditions are not met,48 ∗∗ ﬁnds a practical motivation instead: once the study is concluded, the policymaker mayprefer to adopt a single policy decision instead of a sequence of non-stationary decisions.However, in the following lines, we also discuss non-stationary decisions, whenever thoseare of interest to the policymaker. Policy parametrization

The design of non-stationary decisions requires instead a moredata-intense scenario. We sketch the main ideas in the following lines. From Equation(28), we observe that the welfare-maximizing β t +1 only depends on ( β t , β t − ). Using ideasfrom reinforcement learning and welfare-maximization (Sutton and Barto, 2018; Adusumilliet al., 2019) we parametrize the policy function, by parameters θ ∈ Θ, with π θ : B × B (cid:55)→ B . For any two past decisions, π θ ( β t , β t − ) prescribes the welfare maximizing policy β t +1 inthe subsequent iteration. The objective function takes the following form (cid:102) W ( θ ) = T ∗ (cid:88) t =1 q t Γ (cid:16) π θ ( β t − , β t − ) , π θ ( β t − , β t − ) (cid:17) , such that β t = π θ ( β t − , β t − ) ∀ t ≥ , β = β − = ι. (29)Here (cid:102) W ( θ ) denotes the long-run welfare indexed by a given policy’s parameter θ . By takingﬁrst-order conditions, we have ∂ (cid:102) W ( θ ) ∂θ = T ∗ (cid:88) t =1 q t (cid:104) ∂ Γ (cid:16) π θ ( β t − , β t − ) , π θ ( β t − , β t − ) (cid:17) ∂π θ ( β t − , β t − ) (cid:124) (cid:123)(cid:122) (cid:125) ( i ) × f θ,t ( ι )+ ∂ Γ (cid:16) π θ ( β t − , β t − ) , π θ ( β t − , β t − ) (cid:17) ∂π θ ( β t − , β t − ) (cid:124) (cid:123)(cid:122) (cid:125) ( ii ) × f θ,t − ( ι ) (cid:105) , (30)where f θ,t ( ι ) = ∂π θ ( β t , β t − ) ∂θ , such that the constraint in Eq. (29) holds.Observe that the function f θ,t ( ι ) is known to the experimenter that can be obtained throughthe chain rule. However, ( i ) and ( ii ) are unknown and must be estimated. The key ideaconsists of constructing triads of clusters and alternating perturbation over sub-sequent Observe that diﬀerently from Adusumilli et al. (2019) here we allow also the outcome to dependdynamically on treatment assignments.

Grouping clusters

Create groups of three clusters { k, k + 1 , k + 2 } ; Iterations

The experiment consists of ˇ T waves. Diﬀerently from Section 4.1, each waveconsists of j ∈ { , · · · , dim( θ ) } iterations and s ∈ { , · · · , T ∗ } sub-iterations. Over eachwave w a policy’s parameter ˇ θ wk is chosen for each triad of clusters { k, k + 1 , k + 2 } . Eachwave corresponds to a path of policies. That is, for wave w , cluster k , a starting value ι ,the path of policies is (cid:104) π ˇ θ wk ( ι, ι ) , π ˇ θ wk (cid:16) π ˇ θ wk ( ι, ι ) , ι (cid:17) , π ˇ θ wk (cid:16) π ˇ θ wk ( π ˇ θ wk ( ι, ι ) , ι ) , π ˇ θ wk ( ι, ι ) (cid:17) , · · · (cid:105) . That is, given the parameter value, each policy is chosen based on the policy-choice in theprevious periods, with in total T ∗ many periods. We denote ˇ θ wk ( s ) the policy recommen-dation on the path under parameter ˇ θ wk after s sub-iterations. Policy randomization

Over each wave w , iteration j and sub-iteration s , ( w, j, s ), andgroup of clusters { k, k + 1 , k + 2 } , we randomize treatments as follows: D i,wjs | X i , ˇ θ wk ∼  Bern (cid:16) e ( X i ; ˇ θ wk ( s )) (cid:17) , if k ( i ) = k Bern (cid:16) e ( X i ; ˇ θ wk ( s ) + η n e j ) (cid:17) , if k ( i ) = k + 1 and s is odd;Bern (cid:16) e ( X i ; ˇ θ wk ( s )) (cid:17) , if k ( i ) = k + 1 and s is evenBern (cid:16) e ( X i ; ˇ θ wk ( s ) + η n e j ) (cid:17) , if k ( i ) = k + 2 and s is even;Bern (cid:16) e ( X i ; ˇ θ wk ( s )) (cid:17) , if k ( i ) = k + 2 and s is odd . (31)Intuitively, one of the three clusters is assigned the same policy ˇ θ wk . The remaining twoclusters alternate over each sub-iteration s ∈ { , · · · , T ∗ } on whether a small deviation isapplied or not to the policy. Marginal eﬀect estimator

The estimator consists in taking the diﬀerence of the weightedoutcomes between cluster k and cluster k + 1 over odd iterations for estimating ( i ) and be-tween k and k + 2 over odd iterations for estimating ( ii ) and viceversa over even iterations. Formally, ˇ θ wk ( s ) = π ˇ θ wk ( β s , β s − ), β s = π ˇ θ wk ( β s − , β s − ) if s ≥ ι otherwise. jth entry of the gradient is computed at the end of T ∗ iteration deﬁned as ˇ F ( j ) k,w . Aformal discussion is included in Appendix E. Gradient update

Similarly to Section 4.1, over each wave w we perform gradient up-dates where the policy for the triad { k, k + 1 , k + 2 } is updated using the gradient ˇ F k +3 ,w is the subsequent triad.The above procedure estimates the policy π θ for out-of-sample implementation viagradient descent method, requiring, however, a large number of iterations on the in-sampleunits. The estimated policy is then deployed on the target population, having a muchlarger size than the in-sample population. We defer to Appendix E a formal discussion onthe method.We conclude this section with an example.Table 1: One wave w with three clusters and T ∗ = 4. Over each period ˇ θ ( s ) denotes thepolicy assignment along the path corresponding to policy π ˇ θ at time s . Here, Γ denotesthe policy’s instantaneous eﬀect, as a function of the present and past assignment rule.By diﬀerentiating the eﬀect between the ﬁrst cluster and the second cluster at time t = 2,we estimate the partial derivative of the eﬀect of the policy assigned at time t = 2 onthe welfare at time t = 2. By comparing the instantaneous welfare on the ﬁrst and thirdclusters, we estimate the policy’s partial eﬀect at time t = 2 in the current period. Thereverse reasoning applies at time t = 3. k = 1 k = 2 k = 3ˇ θ Γ ˇ θ Γ ˇ θ Γ t = 1 ˇ θ (1) Γ(ˇ θ (1) , ˇ θ (1)) ˇ θ (1) + η n Γ(ˇ θ (1) + η n , ˇ θ (1)) ˇ θ (1) Γ(ˇ θ (1) , ˇ θ (1)) t = 2 ˇ θ (2) Γ(ˇ θ (2) , ˇ θ (1)) ˇ θ (2) Γ(ˇ θ (2) , ˇ θ (1) + η n ) ˇ θ (2) + η n Γ(ˇ θ (2) + η n , ˇ θ (1)) t = 3 ˇ θ (3) Γ(ˇ θ (3) , ˇ θ (2)) ˇ θ (3) + η n Γ(ˇ θ (3) + η n , ˇ θ (2)) ˇ θ (3) Γ(ˇ θ (3) , ˇ θ (2) + η n ) t = 4 ˇ θ (4) Γ(ˇ θ (4) , ˇ θ (3)) ˇ θ (4) Γ(ˇ θ (4) , ˇ θ (3) + η n ) ˇ θ (4) + η n Γ(ˇ θ (4) + η n , ˇ θ (3)) Example 2.1 Cont’d

Let ˇ T = 10 and T ∗ = 4. Then experimentation is conducted over40 iterations. Clusters are ﬁrst grouped into triads. Consider the triad { k, k + 1 , k + 2 } .Consider a policy π θ ( β t − , β t − ) = θ + β t − θ + β t − θ . t depends linearly on the proba-bility of treatment in the previous two periods. The policymaker’s objective is to ﬁnd theoptimal path of probabilities, which corresponds to estimate the parameters ( θ , θ , θ ) thatmaximizes the long-run welfare. The local optimization procedure starts from a startingvalue ι = 40%, and an initialization value for the parametersˇ θ = (0 , , , i.e., the probability of treatment today equals the one from yesterday. Then the ﬁrst waveof experimentation w = 1 aims to study the long run marginal eﬀect at (0 , , sequence of policies over each s ∈ { , · · · , T ∗ } is (cid:16) ˇ θ (1) , · · · , ˇ θ (4) (cid:17) = (cid:104) ˇ θ + ι ˇ θ + ι ˇ θ (cid:124) (cid:123)(cid:122) (cid:125) ( A ) , ˇ θ + ˇ θ × ( A ) + ˇ θ ι (cid:124) (cid:123)(cid:122) (cid:125) ( B ) , ˇ θ + ˇ θ × ( B ) + ˇ θ ( A ) (cid:124) (cid:123)(cid:122) (cid:125) ( C ) , ˇ θ + ˇ θ × ( C ) + ˇ θ × ( B ) (cid:105) = (40% , , , where the last equality follows from the choice of our starting point ((0 , , w = 1). In the ﬁrst period t = 1, treatments are randomizedas follows: individuals are treated in cluster k , and k + 2 with probability 40%, whiletreatments in the second cluster are assigned with probability 41%. In the second period t = 2, treatments are assigned with probability 40% in the ﬁrst and second cluster andwith probability 41% in the third cluster. The sequence repeats once more before the ﬁrstwave ends. Table 1 reports the instantaneous welfare over each period t over a sequencewithin a wave w . Observe that by alternative the perturbation over the second and theﬁrst cluster, over each period, we can identify and estimate the marginal eﬀect of the policyin the current period and the marginal eﬀect of the policy in the previous period. Oncethe ﬁrst iteration is concluded, we estimate the gradient using Equation (30), we choosethe new value of ˇ θ based on the gradient update, and then we keep iterating. In this section, we study the numerical properties of the proposed estimator. We calibrateour experiments to data from Cai et al. (2015), and we consider as target estimand thepercentage of individuals to be treated within each cluster.52 .1 Set up

The data contains network information of each individual over 47 villages in China andadditional individual-speciﬁc characteristics. The outcome of interest is binary, and itconsists of insurance adoption. Let A k denote the adjacency matrix in cluster k observedfrom sampled data. We calibrate our simulations to the estimated linear-probability model Y i,t = φ + φ X i + φ D i,t + φ X i × D i,t + S i φ + S i × D i φ + S i φ + η i,t , where S i = (cid:80) j (cid:54) = i A k ( i ) i,j D i,t (cid:80) j (cid:54) = i A k ( i ) i,j denotes the percentage of treated friends. The above equation captures direct eﬀectsthrough the coeﬃcient φ and φ , where the latter also captures heterogeneity in eﬀects; itcaptures spillover eﬀects through the coeﬃcient φ and φ , as well as interactions betweenspillover and direct eﬀects through the coeﬃcient φ . We estimate those coeﬃcients us-ing a linear regressor with a small penalization ( e − ) to improve stability. The covariatematrix contains available individuals’ information such as gender, age, rice-area, literacy,risk-aversion, the probability of disaster in a given region, and the number of friends. Wesimulate η i,t | η i,t − ∼ N ( ρη i,t − , σ ) , X i ∼ i.i.d. F X,n with ρ = 0 .

8, and F Xn denoting the empirical distribution of observations’ covariatesobserved in the data. We calibrate the variance to be the estimated residuals variance,approximately equal to σ = 0 . To study the performance under diﬀerent level of den-sity of the network, we consider two alternative graphs: (i) two individuals are connected ifthey had reciprocally indicated the other as a connection (sparser network); (ii) two indi-viduals are connected if either had indicated the other as a connection (denser network). In Data is accessible at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CXDJM0&widget=dataverse@harvard. The adjacency matrix is constructed by considering each starting node of the edge list as a separateobservation, oversampling individuals with a larger degree and under-sample individuals with a smallerdegree.

We consider the problem of maximizing the probability of treatment assignments, with E = (0 . , . T ∈ { , } iterations, sampling from the ﬁrst K = T clusters. We omit time-speciﬁc ﬁxed eﬀects for simplicity, and, following Remark7 over each iteration, we randomize treatments twice in the same cluster, with the secondrandomization inducing a small perturbation. We consider two scenarios corresponding totwo diﬀerent within-cluster sample sizes:(A) Researchers sample once over each experimental wave from each cluster (i.e., ¯ n ≈ n denotes the median sample size);(B) Researchers sample ﬁve times the same participants from each cluster over eachexperimental wave (i.e., ¯ n ≈ strong dependence due to the persistency of the idiosyncratic errors andthe fact that individuals observed over multiple periods have the same covariates. Asa result, (B) reduces the variability occurring from the treatment assignments, but notfrom covariates. We choose η n = ¯ n − / , with η n = 0 .

05 for Scenario (A) and 0 .

022 forScenario (B). Given the heterogeneity in the sample size ¯ n − / does not aﬀect consistencyfor the larger clusters, while controlling the bias across all clusters. We consider theadaptive learning rate of the gradient descent with γ = 0 .

1, and random initializationsdrawn uniformly between (0 . , . . , .

9) and it assigns treatment satu-rations to clusters deterministically ; (ii) the second randomizes probabilities of treatmentsacross clusters uniformly between (0 . , . × T consecutive periods in Scenario (A) and over 10 × T consecu-tive periods in Scenario (B). The competitors estimate the welfare-maximizing probabilityusing a correctly speciﬁed quadratic regression of the average outcomes onto the saturationprobabilities, which is expected to have a small out-of-sample regret.A comprehensive set of results is in Figure 11, where each column in the panel reportsthe average in-sample regret, the out-of-sample regret, and the worst-case regret across allclusters. The top panels show that under a denser network structure, the proposed design’sin-sample regret is signiﬁcantly smaller across all T under consideration. The out-of-sample regret of the proposed estimator is comparable to the regret of the saturation experimentfor ¯ n = 1200 and slightly larger for ¯ n = 400. We observe similar behavior in the bottompanel, where the proposed method achieves a signiﬁcantly smaller in-sample regret. As ¯ n increases, the in and out-of-sample regret of the algorithm decreases. The out-of-sampleregret of the competitors also decreases, while the in-sample regret increases by design.We also observe that as T increases, the error of the sequential experiment may eitherincrease or decrease. These mixed results document the trade-oﬀ between the numberof iterations and the small sample size. Whenever the estimation error dominates thegradient descent’s optimization error, the number of waves increases the estimation errorfaster than the linear rate. As a result, longer experiments requires much larger samples forbetter accuracy. In practice, we recommend that practitioners carefully select the numberof iterations by considering the overall sample size. Our results show that a small numberof waves suﬃces to achieve the global optimum while controlling the in-sample regret. In this section, we discuss hypothesis testing. Similarly as before, we let η n = ¯ n − / and we consider scenarios with varying ¯ n and number of clusters. Namely, we consider“iteration = 1” (¯ n = 400), “iteration = 3” (¯ n = 1200) , “iteration = 5” (¯ n = 2000),respectively corresponding to inference after one, three and ﬁve consecutive samplings55igure 11: Results from adaptive experiment for T ∈ { , } with 200 replications. Thetop (weak ties) panels correspond to the denser network and at the bottom to the sparsernetwork (strong ties). Saturation 1 corresponds to a saturation experiment with equallyspaced saturation probabilities, Saturation 2, a design with saturation probabilities drawnrandomly from a uniform distribution, and Saturation 3 as Saturation 2, but with halfof the clusters, excluding those with less than four hundred observations. Matching isperformed with the same cluster over two consecutive periods.56able 3: Coverage Probability of testing the null hypothesis of optimality over 500 repli-cations with test with size 5%. Here K denotes the number of clusters, with the ﬁrst two,four, etc., clusters being considered. Median cluster’s size across all clusters is ¯ n ≈ K = 2 4 6 10 20 40 2 4 6 10 20 40iter = 1 0.95 0.95 0.96 0.96 0.96 0.94 0.95 0.96 0.95 0.95 0.94 0.91iter = 3 0.96 0.95 0.95 0.93 0.96 0.96 0.96 0.95 0.94 0.93 0.96 0.95iter = 5 0.95 0.94 0.95 0.95 0.94 0.96 0.94 0.94 0.94 0.94 0.91 0.95from the participants in the cluster. We consider K ∈ { , , , , , } clusters. Wematch clusters with themselves over two consecutive iterations (see Remark 7). In Table 3,we report the coverage probability under the null hypothesis of welfare-optimality for a testwith size 5%. The result shows that the coverage probability is approximately 95% acrossall designs. In Figure 12, we plot the power, i.e., the probability of rejection, whenever thecoeﬃcient moves away from the welfare-maximizing policy within a range from zero to 0 . Conclusion

This paper has introduced a novel method for experimental design under unobserved inter-ference to test and estimate welfare-maximizing policies. The proposed methodology ex-ploits between and within-cluster local variation to estimate non-parametrically marginalspillover and direct eﬀects. It uses the marginal eﬀects of the treatment for hypothesistesting and policy-design. We discuss the method’s theoretical properties, showcase validcoverage in the presence of ﬁnitely many clusters for the hypothesis testing procedure, andguarantees on the in and out-of-sample regret of the design.We outlined the importance of allowing for general unknown interactions without im-posing a particular exposure-mapping. We make two assumptions: within-clusters interac-tions are local, and clusters are representative of the underlying population. We leave forfuture research addressing experimental design in the presence of heterogeneous clustersand global interaction mechanisms.The hypothesis testing mechanism allows us to test for policy-optimality. Future ex-tensions may be considered: (i) low-cost experimentation may prefer null hypotheses of no-policy optimality; (ii) the testing may be used for continuous treatments or observationalstudies. Finally, we introduced experimental designs for non-stationary policy-decision,discussing marginal eﬀects under limited carry-overs. The design under inﬁnitely longcarry-over eﬀects remains an open research question.

References

Adusumilli, K., F. Geiecke, and C. Schilter (2019). Dynamically optimal treatment alloca-tion using reinforcement learning. arXiv preprint arXiv:1904.01047 .Alatas, V., A. Banerjee, A. G. Chandrasekhar, R. Hanna, and B. A. Olken (2016). Networkstructure and the aggregation of information: Theory and evidence from indonesia.

American Economic Review 106 (7), 1663–1704.Alatas, V., A. Banerjee, R. Hanna, B. A. Olken, and J. Tobias (2012). Targeting the poor:evidence from a ﬁeld experiment in indonesia.

American Economic Review 102 (4),1206–40.Andrews, I., T. Kitagawa, and A. McCloskey (2019). Inference on winners. Technicalreport, National Bureau of Economic Research.Armstrong, T. and S. Shen (2015). Inference on optimal treatment assignments.59ronow, P. M., C. Samii, et al. (2017). Estimating average causal eﬀects under generalinterference, with application to a social network experiment.

The Annals of AppliedStatistics 11 (4), 1912–1947.Athey, S., D. Eckles, and G. W. Imbens (2018). Exact p-values for network interference.

Journal of the American Statistical Association 113 (521), 230–240.Athey, S. and G. W. Imbens (2018). Design-based analysis in diﬀerence-in-diﬀerences set-tings with staggered adoption. Technical report, National Bureau of Economic Research.Athey, S. and S. Wager (2020). Policy learning with observational data.

Econometrica,Forthcoming .Bai, Y. (2019). Optimality of matched-pair designs in randomized controlled trials.

Avail-able at SSRN 3483834 .Baird, S., J. A. Bohren, C. McIntosh, and B. ¨Ozler (2018). Optimal design of experimentsin the presence of interference.

Review of Economics and Statistics 100 (5), 844–860.Bakirov, N. K. and G. J. Szekely (2006). Student’s t-test for gaussian scale mixtures.

Journal of Mathematical Sciences 139 (3), 6497–6505.Banerjee, A., A. G. Chandrasekhar, E. Duﬂo, and M. O. Jackson (2013). The diﬀusion ofmicroﬁnance.

Science 341 (6144), 1236498.Basse, G. and A. Feller (2018). Analyzing two-stage experiments in the presence of inter-ference.

Journal of the American Statistical Association 113 (521), 41–55.Basse, G. W. and E. M. Airoldi (2018a). Limitations of design-based causal inference anda/b testing under arbitrary and network interference.

Sociological Methodology 48 (1),136–151.Basse, G. W. and E. M. Airoldi (2018b). Model-assisted design of experiments in thepresence of network-correlated outcomes.

Biometrika 105 (4), 849–858.Bhattacharya, D. (2009). Inferring optimal peer assignment from experimental data.

Jour-nal of the American Statistical Association 104 (486), 486–500.Bhattacharya, D. and P. Dupas (2012). Inferring welfare maximizing treatment assignmentunder budget constraints.

Journal of Econometrics 167 (1), 168–196.Bhattacharya, D., P. Dupas, and S. Kanaya (2013). Estimating the impact of means-tested subsidies under treatment externalities with application to anti-malarial bednets.Technical report, National Bureau of Economic Research.Bond, R. M., C. J. Fariss, J. J. Jones, A. D. Kramer, C. Marlow, J. E. Settle, and J. H.Fowler (2012). A 61-million-person experiment in social inﬂuence and political mobiliza-tion.

Nature 489 (7415), 295. 60ottou, L., F. E. Curtis, and J. Nocedal (2018). Optimization methods for large-scalemachine learning.

Siam Review 60 (2), 223–311.Boyd, S., S. P. Boyd, and L. Vandenberghe (2004).

Convex optimization . Cambridgeuniversity press.Breza, E., A. G. Chandrasekhar, T. H. McCormick, and M. Pan (2020). Using aggregatedrelational data to feasibly identify network structure without network data.

AmericanEconomic Review 110 (8), 2454–84.Brooks, R. L. (1941). On colouring the nodes of a network. In

Mathematical Proceedingsof the Cambridge Philosophical Society , Volume 37, pp. 194–197. Cambridge UniversityPress.Bubeck, S., N. Cesa-Bianchi, et al. (2012). Regret analysis of stochastic and nonstochasticmulti-armed bandit problems.

Foundations and Trends ® in Machine Learning 5 (1),1–122.Bubeck, S., Y. T. Lee, and R. Eldan (2017). Kernel-based methods for bandit convexoptimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theoryof Computing , pp. 72–85.Cai, J., A. De Janvry, and E. Sadoulet (2015). Social networks and the decision to insure.

American Economic Journal: Applied Economics 7 (2), 81–108.Cesa-Bianchi, N. and G. Lugosi (2006).

Prediction, learning, and games . Cambridgeuniversity press.Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duﬂo, C. Hansen, W. Newey, andJ. Robins (2018). Double/debiased machine learning for treatment and structural pa-rameters.Chernozhukov, V., D. Chetverikov, K. Kato, et al. (2014). Gaussian approximation ofsuprema of empirical processes.

The Annals of Statistics 42 (4), 1564–1597.Chernozhukov, V., K. Wuthrich, and Y. Zhu (2018). Practical and robust t -test basedinference for synthetic control and related methods. arXiv preprint arXiv:1812.10820 .Chin, A., D. Eckles, and J. Ugander (2018). Evaluating stochastic seeding strategies innetworks. arXiv preprint arXiv:1809.09561 .Choi, D. (2017). Estimation of monotone treatment eﬀects in network experiments. Journalof the American Statistical Association 112 (519), 1147–1155.Dehejia, R. H. (2005). Program evaluation as a decision problem.

Journal of Economet-rics 125 (1-2), 141–173. 61upas, P. (2014). Short-run subsidies and long-run adoption of new health products:Evidence from a ﬁeld experiment.

Econometrica 82 (1), 197–228.Eckles, D., B. Karrer, and J. Ugander (2017). Design and analysis of experiments innetworks: Reducing bias from interference.

Journal of Causal Inference 5 (1).Egger, D., J. Haushofer, E. Miguel, P. Niehaus, and M. W. Walker (2019). General equi-librium eﬀects of cash transfers: experimental evidence from kenya. Technical report,National Bureau of Economic Research.Elliott, G. and R. P. Lieli (2013). Predicting binary outcomes.

Journal of Economet-rics 174 (1), 15–26.Flaxman, A. D., A. T. Kalai, and H. B. McMahan (2004). Online convex optimization inthe bandit setting: gradient descent without a gradient. arXiv preprint cs/0408007 .Forastiere, L., E. M. Airoldi, and F. Mealli (2020). Identiﬁcation and estimation of treat-ment and interference eﬀects in observational studies on networks.

Journal of the Amer-ican Statistical Association , 1–18.Garber, D. (2019). Logarithmic regret for online gradient descent beyond strong convexity.In

The 22nd International Conference on Artiﬁcial Intelligence and Statistics , pp. 295–303. PMLR.Goldsmith-Pinkham, P. and G. W. Imbens (2013). Social networks and the identiﬁcationof peer eﬀects.

Journal of Business & Economic Statistics 31 (3), 253–264.Graham, B. S., G. W. Imbens, and G. Ridder (2010). Measuring the eﬀects of segregation inthe presence of social spillovers: A nonparametric approach. Technical report, NationalBureau of Economic Research.Guo, R., J. Li, and H. Liu (2020). Counterfactual evaluation of treatment assignment func-tions with networked observational data. In

Proceedings of the 2020 SIAM InternationalConference on Data Mining , pp. 271–279. SIAM.Hadad, V., D. A. Hirshberg, R. Zhan, S. Wager, and S. Athey (2019). Conﬁdence intervalsfor policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768 .Hazan, E., K. Levy, and S. Shalev-Shwartz (2015). Beyond convexity: Stochastic quasi-convex optimization. In

Advances in Neural Information Processing Systems , pp. 1594–1602.Hirano, K. and J. R. Porter (2020). Asymptotic analysis of statistical decision rules ineconometrics.

Handbook of Econometrics .62orvitz, D. G. and D. J. Thompson (1952). A generalization of sampling without replace-ment from a ﬁnite universe.

Journal of the American statistical Association 47 (260),663–685.Hudgens, M. G. and M. E. Halloran (2008). Toward causal inference with interference.

Journal of the American Statistical Association 103 (482), 832–842.Ibragimov, R. and U. K. M¨uller (2010). t-statistic based correlation and heterogeneityrobust inference.

Journal of Business & Economic Statistics 28 (4), 453–468.Ibragimov, R. and U. K. M¨uller (2016). Inference with few heterogeneous clusters.

Reviewof Economics and Statistics 98 (1), 83–96.Imai, K. and M. L. Li (2019). Experimental evaluation of individualized treatment rules. arXiv preprint arXiv:1905.05389 .Imbens, G. W. and D. B. Rubin (2015).

Causal inference in statistics, social, and biomedicalsciences . Cambridge University Press.Jagadeesan, R., N. S. Pillai, A. Volfovsky, et al. (2020). Designs for estimating the treat-ment eﬀect in networks with interference.

Annals of Statistics 48 (2), 679–712.Janson, S. (2004). Large deviations for sums of partly dependent random variables.

RandomStructures & Algorithms 24 (3), 234–248.Jones, J. J., R. M. Bond, E. Bakshy, D. Eckles, and J. H. Fowler (2017). Social inﬂuenceand political mobilization: Further evidence from a randomized experiment in the 2012us presidential election.

PloS one 12 (4), e0173851.Kallus, N. (2017). Recursive partitioning for personalization using observational data. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pp.1789–1798. JMLR. org.Kang, H. and G. Imbens (2016). Peer encouragement designs in causal inference withpartial interference and identiﬁcation of local average network eﬀects. arXiv preprintarXiv:1609.04464 .Kasy, M. (2016). Partial identiﬁcation, distributional preferences, and the welfare rankingof policies.

Review of Economics and Statistics 98 (1), 111–131.Kasy, M. (2017). Who wins, who loses? identiﬁcation of the welfare impact of changingwages.Kasy, M. (2018). Optimal taxation and insurance using machine learning—suﬃcient statis-tics and beyond.

Journal of Public Economics 167 , 205–219.Kasy, M. and A. Sautmann (2019). Adaptive treatment assignment in experiments forpolicy choice.

Econometrica . 63ato, M. and Y. Kaneko (2020). Oﬀ-policy evaluation of bandit algorithm from dependentsamples under batch update policy. arXiv preprint arXiv:2010.13554 .Kim, D. A., A. R. Hwong, D. Staﬀord, D. A. Hughes, A. J. O’Malley, J. H. Fowler, andN. A. Christakis (2015). Social network targeting to maximise population behaviourchange: a cluster randomised controlled trial.

The Lancet 386 (9989), 145–153.Kitagawa, T. and A. Tetenov (2018). Who should be treated? Empirical welfare maxi-mization methods for treatment choice.

Econometrica 86 (2), 591–616.Kitagawa, T. and A. Tetenov (2019). Equality-minded treatment choice.

Journal of Busi-ness & Economic Statistics , 1–14.Kitagawa, T. and G. Wang (2020). Who should get vaccinated? individualized allocationof vaccines over sir network. arXiv preprint arXiv:2012.04055 .Kleinberg, R. D. (2005). Nearly tight bounds for the continuum-armed bandit problem. In

Advances in Neural Information Processing Systems , pp. 697–704.Leung, M. P. (2020). Treatment and spillover eﬀects under network interference.

Reviewof Economics and Statistics 102 (2), 368–380.Li, S. and S. Wager (2020). Random graph asymptotics for treatment eﬀect estimationunder network interference. arXiv preprint arXiv:2007.13302 .Li, X., P. Ding, Q. Lin, D. Yang, and J. S. Liu (2019). Randomization inference for peereﬀects.

Journal of the American Statistical Association , 1–31.Lu, C., B. Sch¨olkopf, and J. M. Hern´andez-Lobato (2018). Deconfounding reinforcementlearning in observational settings. arXiv preprint arXiv:1812.10576 .Luedtke, A. R. and M. J. Van Der Laan (2016). Statistical inference for the mean outcomeunder a possibly non-unique optimal treatment strategy.

Annals of statistics 44 (2), 713.Manski (2004). Statistical treatment rules for heterogeneous populations.

Economet-rica 72 (4), 1221–1246.Manski, C. F. (2013). Identiﬁcation of treatment response with social interactions.

TheEconometrics Journal 16 (1), S1–S23.Mbakop, E. and M. Tabord-Meehan (2016). Model selection for treatment choice: Penalizedwelfare maximization. arXiv preprint arXiv:1609.03167 .Mbakop, E. and M. Tabord-Meehan (2018). Model selection for treatment choice: Penal-ized welfare maximization. arXiv.org 1609.03167.Munro, E. (2020). Learning to personalize treatments when agents are strategic. arXivpreprint arXiv:2011.06528 . 64uralidharan, K. and P. Niehaus (2017). Experimentation at scale.

Journal of EconomicPerspectives 31 (4), 103–24.Muralidharan, K., P. Niehaus, and S. Sukhtankar (2017). General equilibrium eﬀects of(improving) public employment programs: Experimental evidence from india. Technicalreport, National Bureau of Economic Research.Murphy, S. A. (2003). Optimal dynamic treatment regimes.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) 65 (2), 331–355.Nie, X., E. Brunskill, and S. Wager (2020). Learning when-to-treat policies.

Journal of theAmerican Statistical Association (just-accepted), 1–58.Ogburn, E. L., O. Sofrygin, I. Diaz, and M. J. van der Laan (2017). Causal inference forsocial network data. arXiv preprint arXiv:1705.08527 .Opper, I. M. (2016). Does helping john help sue? evidence of spillovers in education.

American Economic Review 109 (3), 1080–1115.Park, C. and H. Kang (2020). Eﬃcient semiparametric estimation of network treatmenteﬀects under partial interference. arXiv preprint arXiv:2004.08950 .Pouget-Abadie, J. (2018).

Dealing with Interference on Experimentation Platforms . Ph.D. thesis.Rai, Y. (2018). Statistical inference for treatment assignment policies.

UnpublishedManuscript .Ross, N. et al. (2011). Fundamentals of stein’s method.

Probability Surveys 8 , 210–293.Rubin, D. B. (1990). Formal mode of statistical inference for causal eﬀects.

Journal ofstatistical planning and inference 25 (3), 279–292.Russo, D., B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen (2017). A tutorial onthompson sampling. arXiv preprint arXiv:1707.02038 .Saez, E. (2001). Using elasticities to derive optimal income tax rates.

The review ofeconomic studies 68 (1), 205–229.Sasaki, Y. and T. Ura (2020). Welfare analysis via marginal treatment eﬀects. arXivpreprint arXiv:2012.07624 .S¨avje, F., P. M. Aronow, and M. G. Hudgens (2020). Average treatment eﬀects in thepresence of unknown interference.

The Annals of Statistics, Forthcoming .Stoye, J. (2009). Minimax regret treatment choice with ﬁnite samples.

Journal of Econo-metrics 151 (1), 70–81.Stoye, J. (2012). Minimax regret treatment choice with covariates or with limited validityof experiments.

Journal of Econometrics 166 (1), 138–156.65utton, R. S. and A. G. Barto (2018).

Reinforcement learning: An introduction . MITpress.Tabord-Meehan, M. (2018). Stratiﬁcation trees for adaptive randomization in randomizedcontrolled trials. arXiv preprint arXiv:1806.05127 .Taylor, S. J. and D. Eckles (2018). Randomized experiments to detect and estimate socialinﬂuence in networks. In

Complex Spreading Phenomena in Social Systems , pp. 289–322.Springer.Tetenov, A. (2012). Statistical treatment choice based on asymmetric minimax regretcriteria.

Journal of Econometrics 166 (1), 157–165.Ugander, J., B. Karrer, L. Backstrom, and J. Kleinberg (2013). Graph cluster randomiza-tion: Network exposure to multiple universes. In

Proceedings of the 19th ACM SIGKDDinternational conference on Knowledge discovery and data mining , pp. 329–337. ACM.Varian, H. R. (2016). Causal inference in economics and marketing.

Proceedings of theNational Academy of Sciences 113 (27), 7310–7315.Vazquez-Bare, G. (2017). Identiﬁcation and estimation of spillover eﬀects in randomizedexperiments. arXiv preprint arXiv:1711.02745 .Viviano, D. (2019). Policy targeting under network interference. arXiv preprintarXiv:1906.10258 .Viviano, D. (2020). Experimental design under network interference. arXiv preprintarXiv:2003.08421 .Wager, S. and K. Xu (2019). Experimenting in equilibrium. arXiv preprintarXiv:1903.02124 .Wainwright, M. J. (2019).

High-dimensional statistics: A non-asymptotic viewpoint , Vol-ume 48. Cambridge University Press.Zhang, K. W., L. Janson, and S. A. Murphy (2020). Inference for batched bandits. arXivpreprint arXiv:2002.03217 .Zubcsek, P. P. and M. Sarvary (2011). Advertising to a social network.

QuantitativeMarketing and Economics 9 (1), 71–107. 66

Preliminaries and notation

First, we introduce conventions and notation. Whenever we take summation, we sumover experimental participants unless otherwise speciﬁed. We deﬁne x (cid:46) y if x is lessor equal than y times a universal constant. We refer to the number of clusters as k ∈{ , · · · , K, · · · } with the cluster index k = K + 1 = 1. Deﬁne t ( j, w ) the time t corre-sponding to wave w and iteration j as discussed in Section 4.1. We deﬁne β k,j,w = b jk ( ˇ β wk ) , b jk ( β ) =  β + η n e j if k is odd; β − η n e j otherwise . Throughout our proofs, we will implicitely condition on v , · · · , v K . Finally, observethat β k,j,w is a measurable function of ˇ β wk , and therefore conditioning on ˇ β wk will implicitelyresult into conditioning also on β k,j,w . Oracle gradient descent

We deﬁne β ∗ w = Π B , B (cid:104) β ∗ w − + α w − V ( β ∗ w − ) (cid:105) , β ∗ = ι, (32)the oracle solution of the local optimization procedure, for known welfare function. α w = γ √ w || V ( β ∗ w − ) || unless otherwise speciﬁed. Take ˇ T >

0. The algorithm terminates if || V ( β ∗ w ) || ≤ µ √ ˇ T .We now discuss deﬁnitions of dependency graphs. Deﬁnition A.1 (Adjacency matrix and dependency graph) . Given n random variables R i ,we denote A n an adjacency matrix with A ( i,j ) n = 1 if and only if R i and R j are dependent.The variables connected under A n forms a dependency graph (Janson, 2004), i.e., unitsthat are not connected are mutually independent. Lemma A.1. (Ross et al., 2011) Let X , ..., X n be random variables such that E [ X i ] < ∞ , E [ X i ] = 0 , σ = Var( (cid:80) ni =1 X i ) and deﬁne W = (cid:80) ni =1 X i /σ . Let the collection ( X , ..., X n ) have dependency neighborhoods N i , i = 1 , ..., n and also deﬁne D = max ≤ i ≤ n | N i | . Thenfor Z a standard normal random variable, we obtain d W ( W, Z ) ≤ D σ n (cid:88) i =1 E | X i | + √ D / √ πσ (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 E [ X i ] , (33)67 here d W denotes the Wasserstein metric. Deﬁnition A.2. (Proper Cover)

Given an adjacency matrix A n , with n rows and columns,a family C n = {C n ( j ) } of disjoint subsets of [ n ] is a proper cover of A n if ∪C n ( j ) = [ n ] and C n ( j ) contains units such that for any pair of elements { ( i, k ) ∈ C n ( j ) , k (cid:54) = i } , A ( i,k ) n = 0.The size of the smallest proper cover is the chromatic number, deﬁned as χ ( A n ). Deﬁnition A.3. (Chromatic Number)

The chromatic number χ ( A n ) , denotes the size ofthe smallest proper cover of A n . Lemma A.2. (Brook’s Theorem,Brooks (1941)) For any connected undirected graph G with maximum degree ∆ , the chromatic number of G is at most ∆ unless G is a completegraph or an odd cycle, in which case the chromatic number is ∆ + 1 . B Lemmas

Proof of Lemma 2.1.

Under Assumption 1 (A), we can write the potential outcome onlyas a function of the current treatment assignment in the same cluster, namely we write Y i,t ( d k ( i ) t ). Deﬁne D kt the vector of treatment assignments in cluster k at time t . Underconsistency of potential outcomes Y i,t ( D k ( i ) t ) = Y i,t = E (cid:104) Y i,t ( D k ( i ) t ) | D i,t , X i , β k ( i ) ,t (cid:105) + ε i,t , E (cid:104) ε i,t | D i,t , X i , β k ( i ) ,t (cid:105) = 0 , where the above equation follows from the fact that the distribution of D k ( i ) t is fully char-acterized by the (exogenous) parameter β k ( i ) ,t . By deﬁnition ε i,t = Y i,t ( D k ( i ) t ) − m i,t ( D i,t , X i , β k ( i ) ,t ) = Y i,t ( g ( β k ( i ) ,t , X ) , · · · , g ˇ N ( β k ( i ) ,t , X ˇ N )) − m i,t ( D i,t , X i , β k ( i ) ,t ) , for some random functions g i ( · ). By deﬁnition of a CBAR, these functions are independentbetween individuals (i.e., treatment assignments are conditional independent given covari-ates). Observe that under Assumption 1 (B) Y i,t ( D k ( i ) t ) depends on at most √ γ n manyentries. As a result, Y i,t ( · ) depends on at least the same entry with γ n many other potentialoutcomes, which deﬁnes those units sharing at least one common neighbor. Observe nowthat g i ( β k ( i ) ,t , X i ) , Y i,t ( · ) are locally dependent under Assumption 1 (C) with those sameunits being neighbors of individual i or neighbors of the neighbors. As a result ε i,t , ε j,t ≤ T i, j ) are neighbors or they share a common neighbor.Therefore ε i,t depends on at most √ γ n + γ n many other ε j,t ≤ T completing the proof.In the following Lemma, we extend results from Janson (2004) for the concentrationof unbounded sub-gaussian random variables. We state the lemma for general randomvariables R i forming a dependency graph with adjacency matrix A n . Lemma B.1.

Deﬁne { R i } ni =1 sub-gaussian random variables, forming a dependency graphwith adjacency matrix A n with maximum degree bounded by γ n . Then with probability atleast − δ , (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( R i − E [ R i ]) (cid:12)(cid:12)(cid:12) ≤ ¯ C (cid:114) γ n log( γ n /δ ) n . for a ﬁnite constant ¯ C < ∞ .Proof. First, we construct a proper cover C n as in Deﬁnition A.2, with minimal chromaticnumber χ ( A n ). We can write (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( R i − E [ R i ]) (cid:12)(cid:12)(cid:12) ≤ (cid:88) C n ( j ) ∈C n (cid:12)(cid:12)(cid:12) n (cid:88) i ∈C n ( j ) ( R i − E [ R i ]) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( A ) . Observe now that by deﬁnition of the dependency graph, components in ( A ) are mutuallyindependent. Using the Chernoﬀ’s bound (Wainwright, 2019), we have that with proba-bility at least 1 − δ , (cid:12)(cid:12)(cid:12) (cid:88) i ∈C n ( j ) ( R i − E [ R i ]) (cid:12)(cid:12)(cid:12) ≤ ¯ C (cid:112) |C n ( j ) | log(1 /δ ) , for a ﬁnite constant ¯ C < ∞ , where |C n ( j ) | denotes the number of elements in C n ( j ). As aresult, using the union bound, we obtain that with probability at least 1 − δ , (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( R i − E [ R i ]) (cid:12)(cid:12)(cid:12) ≤ ¯ Cn (cid:88) C n ( j ) ∈C n (cid:112) |C n ( j ) | log( χ ( A n ) /δ ) (cid:124) (cid:123)(cid:122) (cid:125) ( B ) . Using concavity of the square-root function, after multiplying and dividing (B) by χ ( A n ),69e have ( B ) ≤ ¯ Cn χ ( A n ) (cid:118)(cid:117)(cid:117)(cid:116) χ ( A n ) (cid:88) C n ( j ) ∈C n |C n ( j ) | log( χ ( A n ) /δ )= ¯ Cn (cid:112) χ ( A n ) n log( χ ( A n ) /δ ) . The last equality follows by the deﬁnition of proper cover. The ﬁnal result follows byLemma A.2.

Lemma B.2.

Under Assumption 3, η n W ( j ) i,t ( β ) , is sub-gaussian for some parameter ˜ σ < ∞ , for any β ∈ B .Proof. Observe that we can write η n W ( j ) i,t ( β ) = Y i,t η n ∂e ( X i ; β ) ∂β × (cid:104) D i,t e i,j,t ( β ) − (1 − D i,t )1 − e i,j,t ( β ) (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ( A ) + Y i,t × v k ( i ) (cid:104) e ( X i ; β ) D i,t e i,j,t ( β ) − (1 − e ( X i ; β ))(1 − D i,t )1 − e i,j,t ( β ) (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ( B ) − η n c ( X i ) ∂e ( X i ; β ) ∂β (cid:124) (cid:123)(cid:122) (cid:125) ( C ) . By deﬁnition of E and Assumption 3, (A) in the expression is bounded by ¯ Cη n for aﬁnite constant ¯ C . Similarly, (B) is bounded by a ﬁnite constant ¯ C , while (C) is uniformlybounded by Assumption 3. Since Y i,t is sub-gaussian by Assumption 3 (bounded m i,t andsub-gaussian ε i,t , and η n ≤ Lemma B.3.

Let Assumption 1, 5 hold. Consider the experimental design in Equation (23) with ˇ β wk estimated as in Section 4.1. Then for any pair of clusters { k, k + 1 } , with k being odd (cid:16) ˇ β k , · · · , ˇ β ˇ Tk (cid:17) ⊥ (cid:110) Y i,t ( d ) , X i , d ∈ { , } ˇ N (cid:111) i : k ( i ) ∈{ k,k +1 } ,t ≤ T . Proof of Lemma B.3.

To show that the claim holds it suﬃces to show that ˇ β wk is a functionof observables and unobservables only of those units in clusters k (cid:48) (cid:54)∈ { k, k + 1 } . We startfrom studying ˇ β ˇ Tk . Observe that ˇ β ˇ Tk is chosen based on the gradient ˇ Z k, ˇ T − estimatedin the previous period in clusters { k + 2 , k + 3 } . The gradient estimated ˇ Z k +2 , ˇ T − is afunction of the unobservables and observables at any time t ≤ T in clusters { k + 2 , k + 3 } and the policy ˇ β ˇ T − k +2 . The policy ˇ β ˇ T − k +2 is a function of the gradient ˇ Z k +4 , ˇ T − estimated in70he subsequent two clusters { k + 4 , k + 5 } over the previous wave of experimentation ˇ T − { k, k + 1 } since K ≥

2( ˇ T + 1). The same reasoning applies to the remainingcoeﬃcients. Lemma B.4.

Let Assumption 1, 5 hold. Consider the experimental design in Section 4.1.Then, for t ≥ , the following holds: E (cid:104) Y i,t ( j,w ) D i,t ( j,w ) e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk ( i ) , X i (cid:105) = m i,t ( j,w ) (1 , X i , β k ( i ) ,j,w ) , E (cid:104) Y i,t ( j,w ) (1 − D i,t ( j,w ) )1 − e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) β wk ( i ) , X i (cid:105) = m i,t ( j,w ) (0 , X i , β k ( i ) ,j,w ) . Proof of Lemma B.4.

We prove the ﬁrst statement, while the second statement followssimilarly. Under Assumption 1 E (cid:104) Y i,t ( j,w ) D i,t ( j,w ) e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk ( i ) , X i (cid:105) = E (cid:104) m i,t ( j,w ) (1 , X i , β k ( i ) ,j,w ) D i,t ( j,w ) e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk ( i ) , X i (cid:105) + E (cid:104) ε i,t ( j,w ) D i,t ( j,w ) e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk ( i ) , X i (cid:105) . Observe that by design E (cid:104) m i,t ( j,w ) (1 , X i , β k ( i ) ,j,w ) D i,t ( j,w ) e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk ( i ) , X i (cid:105) = m i,t ( j,w ) (1 , X i , β k ( i ) ,j,w ) . In addition, by Lemma B.3 and Lemma B.4, E (cid:104) ε i,t ( j,w ) D i,t ( j,w ) e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk ( i ) , X i (cid:105) = 0completing the proof. Lemma B.5.

Let Assumption 1, 2, 3, 5, hold. Let W ( j ) i,t be deﬁned as in Equation (19) .Then for any odd k , n (cid:88) i ∈S k,t ( j,w ) ∪S k +1 ,t ( j,w ) E (cid:104) W ( j ) i,t ( j,w ) (cid:16) ˇ β wk ( i ) (cid:17) | ˇ β wk ( i ) (cid:105) − n (cid:88) i ∈S k, ∪S k +1 , E (cid:104) v k ( i ) η n Y i, | ˇ β wk ( i ) (cid:105) = V ( j ) ( ˇ β wk ) + O ( η n ) + O ( J n × η n ) . roof of Lemma B.5. Recall the deﬁnition of V ( β ) in Deﬁnition 2.4. In addition, recallthat for k being odd ˇ β wk = ˇ β wk +1 . For short of notation, we deﬁne t = t ( j, w ). Observe thatby Lemma B.4, since v k is deterministic, we can write1 n (cid:88) i ∈S k,t ∪S k +1 ,t E (cid:104) W ( j ) i,t (cid:16) ˇ β wk ( i ) (cid:17) | ˇ β wk ( i ) (cid:105) =12 n (cid:88) i ∈S k,t ∪S k +1 ,t E (cid:104) ( m i,t (1 , X i , β k ( i ) ,j,w ) − m i,t (0 , X i , β k ( i ) ,j,w ) − c ( X i )) ∂e ( X i ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk (cid:12)(cid:12)(cid:12) ˇ β wk (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ( A ) − n (cid:88) i ∈S k,t ∪S k +1 ,t v k ( i ) η n E (cid:104) m i,t (1 , X i , β k ( i ) ,j,w ) e ( X i ; ˇ β wk ( i ) ) + (1 − e ( X i ; ˇ β wk ( i ) )) m i,t (0 , X i , β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ( B ) . (34)Observe that in the above expression follows since β k ( i ) ,j,w is a deterministic function ofˇ β wk ( i ) . We study ( A ) and ( B ) separately. We start from ( A ). We decompose (A) in thefollowing components. ( A ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:90) (cid:104) m i,t (1 , x, β k ( i ) ,j,w ) − m i,t (0 , x, β k ( i ) ,j,w ) (cid:105) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk f X i ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) ( I ) − n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:90) c ( x ) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk f X i ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) ( II ) . (35) First observe that we can write( II ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:90) c ( x ) ∂e ( x ; β ) ∂β ( j ) f X i ( x ) dx = 12 (cid:90) c ( x ) ∂e ( x ; β ) ∂β ( j ) n (cid:88) i ∈S k,t ∪S k +1 ,t f X i ( x ) dx = (cid:90) c ( x ) ∂e ( x ; β ) ∂β ( j ) ˇ f X ( x ) dx + O ( J n ) , where the last equality follows from the dominated convergence theorem, and the fact that |S k,t ∪ S k +1 ,t | = 2 n . Using the dominated convergence theorem combined with assumption3, we have (cid:90) c ( x ) ∂e ( x ; β ) ∂β ( j ) ˇ f X ( x ) dx = ∂ (cid:82) c ( x ) e ( x ; β ) ˇ f X ( x ) dx∂β ( j ) . I ). We can write( I ) = 12 n (cid:90) (cid:88) i ∈S k,t ∪S k +1 ,t (cid:104) ( m i,t (1 , x, β k ( i ) ,j,w ) − m i,t (0 , x, β k ( i ) ,j,w )) f X i ( x ) (cid:105) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk dx = 12 (cid:90) (cid:104) (cid:88) b ∈{ k,k +1 } ( m (1 , x, β b,j,w ) − m (0 , x, β b,j,w )) f X ( x ) (cid:105) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk dx + O ( J n ) , where the second equality follows from Assumption 2. We now use a ﬁrst order Taylorexpansion to m (1 , x, β b,j,w ) , m (0 , x, β b,j,w ) around ˇ β wk . Observe that since β b,j,w deviatesfrom ˇ β wk by at most η n over one coordinate and zero over the remaining coordinates, underAssumption 3, we can write12 (cid:90) (cid:104) (cid:88) b ∈{ k,k +1 } ( m (1 , x, β b,j,w ) − m (0 , x, β b,j,w )) f X ( x ) (cid:105) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk dx = (cid:90) (cid:104) ( m (1 , x, ˇ β wk ) − m (0 , x, ˇ β wk )) f X ( x ) (cid:105) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk dx + O ( η n ) . We now study ( B ). Observe that diﬀerently from ( A ) for ( B ) we also need to account fortime and cluster speciﬁc ﬁxed eﬀects. We start studying the component( B ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t v k ( i ) η n (cid:90) m i,t (1 , x, β k ( i ) ,j,w ) e ( x ; ˇ β wk ( i ) ) f X i ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) ( a ) + 12 n (cid:88) i ∈S k,t ∪S k +1 ,t v k ( i ) η n (cid:90) m i,t (0 , x, β k ( i ) ,j,w )(1 − e ( x ; ˇ β wk ( i ) )) f X i ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) ( b ) . We study ( a ), while ( b ) follows similarly. Under Assumption 2, since ˇ β wk = ˇ β wk +1 for k beingodd, we write( a ) = 12 n v k ( i ) η n (cid:90) (cid:88) i ∈S k,t ∪S k +1 ,t m i,t (1 , x, β k ( i ) ,j,w ) e ( x ; ˇ β wk ) f X i ( x ) dx = 12 (cid:88) b ∈{ k,k +1 } v b η n (cid:90) m (1 , x, β b,j,w ) e ( x ; ˇ β wk ) f X ( x ) + e ( x ; ˇ β wk ) α t ( x ) dx + e ( x ; ˇ β wk ) τ b ( x ) dx + O ( J n /η n )= 12 1 η n (cid:90) ( m (1 , x, β k,j,w ) − m (1 , x, β k +1 ,j,w )) e ( x ; ˇ β wk ) f X ( x ) + e ( x ; ˇ β wk )( τ k ( x ) − τ k +1 ( x )) dx + O ( J n /η n ) . b ) as follows( b ) = 12 1 η n (cid:90) ( m (0 , x, β k,j,w ) − m (0 , x, β k +1 ,j,w ))(1 − e ( x ; ˇ β wk )) f X ( x )+(1 − e ( x ; ˇ β wk ))( τ k ( x ) − τ k +1 ( x )) dx + O ( J n η n ) . Combining the expressions, we write1 n (cid:88) i ∈S k,t ∪S k +1 ,t E (cid:104) W ( j ) i,t (cid:16) ˇ β wk ( i ) (cid:17) | ˇ β wk ( i ) (cid:105) = 12 η n (cid:90) ( τ k ( x ) − τ k +1 ( x )) dx ++ 12 1 η n (cid:90) ( m (1 , x, β k,j,w ) − m (1 , x, β k +1 ,j,w )) e ( x ; ˇ β wk ) f X ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) ( c ) + 12 1 η n (cid:90) ( m (0 , x, β k,j,w ) − m (0 , x, β k +1 ,j,w ))(1 − e ( x ; ˇ β wk )) f X ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) ( d ) + 12 (cid:90) (cid:104) ( m (1 , x, ˇ β wk ) − m (0 , x, ˇ β wk )) f X ( x ) (cid:105) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk dx + O ( η n + J n η n ) . We now study ( c ) while ( d ) follows similarly. We do a second order Taylor expansionof m ( d, x, · ) at ˇ β wk . Using the randomization scheme in Equation (23) we obtain underAssumption 3 ( c ) = 12 1 η n (cid:90) X ∂m (1 , x, ˇ β wk ) ∂β η n e ( x ; ˇ β wk ) f X ( x ) dx + O ( η n ) , where the component O ( η n ) is bounded by the compact support assumption on X , thefact that || f X || ∞ < ∞ , and the boundeness assumption on the second order derivative.Similarly, we write( d ) = 12 1 η n (cid:90) X ∂m (0 , x, ˇ β wk ) ∂β η n (1 − e ( x ; ˇ β wk )) f X ( x ) dx + O ( η n ) . Finally, by the circular cross ﬁtting algorithm and Assumption 5 ˇ β wk is independent onobservables and unobservables in cluster { k, k + 1 } at time s = 0, since ˇ β wk is a measurablefunction of the gradient estimated in all clusters but cluster { k, k + 1 } . As a result, we cantake (recall that every expression is conditional on the initialization value ι assumed to be74xogenous)1 n (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n E (cid:104) Y i, | ˇ β wk ( i ) (cid:105) = 1 n (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n (cid:90) m i,t (1 , x, ι ) e ( x ; ι ) f X i ( x ) dx + 1 n (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n (cid:90) m i,t (0 , x, ι )(1 − e ( x ; ι )) f X i ( x ) dx. Under Assumption 2, we have1 n (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n (cid:90) m i,t (1 , x, ι ) e ( x ; ι ) f X i ( x ) dx + 1 n (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n (cid:90) m i,t (0 , x, ι )(1 − e ( x ; ι )) f X i ( x ) dx = 12 η n (cid:90) ( τ k ( x ) − τ k +1 ( x )) dx + O ( J n /η n ) . Combining the equations the proof completes.

Lemma B.6.

Consider the experimental design in Section 4.1. Let Assumption 1, 2, 3, 5,hold. Then with probability at least − δ , for every k being odd, and every w ∈ { , · · · , ˇ T } ˇ Z ( j ) k,w = V ( j ) ( ˇ β wk ) + O (cid:16)(cid:115) γ n log( γ n K ˇ T /δ ) η n n + η n + J n /η n (cid:17) Proof.

Observe that by Lemma B.2, η n W ( j ) i,t ( β ) is sub-gaussian with parameter ˜ σ . Sim-ilarly, under Assumption 3 (B), η n Y i, is sub-gaussian. In addition, under Assumption 1,by Lemma B.4, and Lemma 2.1, since the assignment mechanism is a measurable functionof ˇ β wk in cluster k, k + 1 (cid:110) ( W ( j ) i,t ( ˇ β wk ) , Y i, ) (cid:111) i ∈S k,t ∪S k +1 ,t ∪S k, ∪S k +1 , (cid:12)(cid:12)(cid:12) ˇ β wk form a dependency graph (e.g., see Ross et al. (2011)), with maximimum degree boundedby O ( γ n ) since each observation ( W ( j ) i,t ( ˇ β wk ) , Y i, ) depends on at most γ n units in the set { W ( j ) j,t , Y j, } j (cid:54) = i,j : k ( j )= k ( i ) . This follows from Assumption 1 (B), (C), and the fact that ˇ β wk is estimated using information from all clusters except { k, k + 1 } under the circular cross75tting and Assumption 5. By Lemma B.1 with probability at least 1 − δ , (cid:12)(cid:12)(cid:12) ˇ Z ( j ) k,w − E [ ˇ Z ( j ) k,w | ˇ β wk ] (cid:12)(cid:12)(cid:12) ≤ ¯ C (cid:48) (cid:115) γ n log( γ n /δ ) η n n , (36)for a universal ﬁnite constant ¯ C (cid:48) < ∞ . Using the triangular inequality we obtain (cid:12)(cid:12)(cid:12) ˇ Z ( j ) k,w − V ( ˇ β wk ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˇ Z ( j ) k,w − E [ ˇ Z ( j ) k,w | ˇ β wk ] (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) V ( ˇ β wk ) − E [ ˇ Z ( j ) k,w | ˇ β wk ] (cid:12)(cid:12)(cid:12) , The ﬁrst term is bounded as in Equation (36) and the second term by Lemma B.5. Theﬁnal result follows by the union bound over K, ˇ T . Lemma B.7 (Adaptive gradient descent for quasi-concave functions and locally strongconcave) . Let B be compact. Deﬁne G max { sup β ∈B || β || , } . Let Assumption 3, 6 hold.Let κ be a positive ﬁnite constant, deﬁned as in Equation (37) . Then for any w ≤ ˇ T , w ≥ γ ( κ + 2) e ( G +1) /γ , the following holds: || β ∗ w − β ∗ || ≤ κw − . Proof.

To prove the statement, we use properties of gradient descent methods (Hazan et al.,2015) with key diﬀerences from the previous reference. Instead of ﬁxing the estimation errorover all iterations, we let the estimation error decrease with w . Preliminaries

Clearly, if the algorithm terminates at t , under Assumption 6 (B), thisimplies that || β w − β ∗ || ≤ T , proving the claim. Therefore, assume that the algorithm did not terminate at time w .Deﬁne (cid:15) w = 1 / ( w −

1) and let ∇ w to be the gradient evaluated at β ∗ w − . For every β ∈ B ,deﬁne H ( β ) (cid:12)(cid:12)(cid:12) [ β ∗ ,β ] the Hessian evaluated at some point β ∈ [ β ∗ , β ], such that W ( β ) = W ( β ∗ ) + 12 ( β − β ∗ ) (cid:62) H ( β ) (cid:12)(cid:12)(cid:12) [ β ∗ ,β ] ( β − β ∗ ) , β − β ∗ ) (cid:62) H ( β ) (cid:12)(cid:12)(cid:12) [ β ∗ ,β ] ( β − β ∗ ) = f ( β ) ≤ , where the inequality follows by deﬁnition of β ∗ . Claim

We claim that −| λ max ||| β − β ∗ || ≤ f ( β ) ≤ −| λ min ||| β − β ∗ || for constants λ max , λ min >

0. The lower bound follows directly by Assumption 3, while theupper bound follows from Assumption 6 (iii) and compactness of B . We provide details forthe upper bound in the following paragraph. Proof of the claim on the upper bound

We now use a contradiction argument.Suppose that the upper bound does not hold. Then there must exist a sequence β s ∈ B such that f ( β s ) ≥ o ( || β s − β ∗ || ). Observe ﬁrst of all that since the parameter space B is compact, any sequence such that β s → β (cid:54) = β ∗ would contradict the statement due toglobal optimality of β ∗ , and the fact that || β − β ∗ || < ∞ . As a result, we only have todiscuss sequences β s → β ∗ . Recall that twice continuously diﬀerentiability of W ( β ), wehave that H ( β s ) → H ( β ∗ ). As a result, we can ﬁnd, for s ≥ S , for S large enough, a pointin the sequence such that (since p is ﬁnite)2 f ( β s ) ≤ ( β s − β ∗ ) (cid:62) H ( β ∗ ) + δ ( s ) || β s − β ∗ || , for δ ( s ) = p || H ( β s ) − H ( β ∗ ) || ∞ . Since H ( β ∗ ) is negative deﬁnite, the above expression isbounded as follows 2 f ( β s ) ≤ − ( | ˜ λ min | − δ ( s )) || β s − β ∗ || , where | ˜ λ min | > H ( β ∗ ) (in absolute value) bounded awayfrom zero by Assumption 6 (iii). Since δ ( s ) →

0, we reach a contradiction.

Cases

Deﬁne κ = | λ max || λ min | ≥ . (37)77bserve now that if || β ∗ w − β ∗ || ≤ (cid:15) w κ , the claim trivially holds. Therefore, consider thecase where || β ∗ w − β ∗ || > (cid:15) w κ. Comparisons within the neighborhood

Take ˜ β = β ∗ − √ (cid:15) w ∇ w ||∇ w || . Observe that W ( ˜ β ) − W ( β ∗ w ) = 12 ( ˜ β − β ∗ ) (cid:62) H ( ˜ β ) (cid:12)(cid:12)(cid:12) [ β ∗ , ˜ β ] ( ˜ β − β ∗ ) −

12 ( β ∗ w − β ∗ ) (cid:62) H ( β ∗ w ) (cid:12)(cid:12)(cid:12) [ β ∗ ,β ∗ w ] ( β ∗ w − β ∗ ) ≥ −| λ max | (cid:15) w + | λ min | (cid:15) w κ = 0 . As a result, for all β ∗ w : || β ∗ w − β ∗ || > (cid:15) w κ , using quasi-concavity ∇ (cid:62) w ( ˜ β − β ∗ w ) ≥ ⇒ ∇ (cid:62) w ( β ∗ − β ∗ w ) ≥ √ (cid:15) w ||∇ w || (38) Plugging in the above expression in the deﬁnition of β ∗ w By construction of thealgorithm, we write || β ∗ − β ∗ w || ≤ || β ∗ − β ∗ w − || − α w − γ ∇ (cid:62) w ( β ∗ − β ∗ w ) + γ α w − ||∇ w || . By Equation (38), we can write || β ∗ − β ∗ w || ≤ || β ∗ − β ∗ w − || − γα w − √ (cid:15) w ||∇ w || + γ α w − ||∇ w || . Plugging in the expression for α w , and using the fact that γ ≤

1, we have (cid:15) w κ ≤ || β ∗ − β ∗ w || ≤ || β ∗ − β ∗ w − || − γ(cid:15) w . Recursive argument

Observe that if || β ∗ − β ∗ w − || ≤ (cid:15) w − κ , then we have (cid:15) w κ ≤ || β ∗ − β ∗ w || ≤ κ ( w − − γw − ⇒ κ + γw − ≤ κw − ⇒ ( w − κ + γ ) ≤ ( w − κ ⇒ w ≤ κ + 2 γγ , which leads to a contradiction. As a result, we can assume that || β ∗ − β ∗ w − || > κ(cid:15) w − .Observe that now β ∗ w − satisﬁes the same conditions discussed above. Using the recursion78or all s ≥ κ +2 γ , we have || β ∗ − β ∗ w || ≤ || β ∗ − β ∗ ( κ +2) /γ || − γ w (cid:88) s =( κ +2) /γ (cid:15) w ≤ G + 1 − γ log( w ) + γ log( κ/γ + 2 /γ ) . Whenever w > γ ( κ + 2) e G/γ +1 /γ , we have a contradiction. The proof completes. Lemma B.8.

Let Assumptions 1, 2, 3, 5, 6 hold. Assume that (cid:15) n ≥ √ p (cid:104) ¯ C (cid:115) γ n log( γ n ˇ T K/δ ) η n n + η n + J n /η n (cid:105) , µ √ ˇ T − (cid:15) n ≥ for a universal constant ¯ C < .Then with probability at least − δ , for any w ≤ ˇ T ,either ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ β wk − β ∗ w (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = O ( P w ( δ ) + pη n ) , or ( ii ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ β wk − β ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ p ˇ T where P ( δ ) = err( δ ) and P w ( δ ) = √ pν n B √ w P w − ( δ ) + P w − ( δ ) + √ pν n √ w err( δ ) , for a ﬁniteconstant B < ∞ , and err( δ ) = O (cid:16)(cid:113) γ n log( γ n p ˇ T K/δ ) η n n + pη n + J n /η n (cid:17) , with ν n = µ √ ˇ T − (cid:15) n .Proof. First, recall that by Lemma B.6 we can write for every k and t ,ˇ V ( j ) k,w = V ( j ) ( ˇ β wk +2 ) + O (cid:16)(cid:115) γ n log( γ n K ˇ T /δ ) η n n + η n + J n /η n (cid:17) . We now proceed by induction. We ﬁrst prove the statement, assuming that the constraintis never attained. We then discuss the case of the constrained solution. Deﬁne B = p sup β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ W ( β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ . Unconstrained case

Consider w = 1. Then since all clusters start from the samestarting point ι , we can write with probability 1 − δ , by the union bound and Lemma B.6 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k, − V ( β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ err( δ ) . (39)79onsider now the case where the algorithm stops, i.e., || ˇ V k, || ≤ µ √ ˇ T − (cid:15) n . By Lemma B.6 || V ( β ∗ ) || ≤ || ˇ V k, || + √ p err( δ ) ≤ µ √ ˇ T − (cid:15) n + √ p err( δ ) ≤ µ √ ˇ T . (40)since (cid:15) n ≥ √ p err( δ ). As a result, also the oracle algorithm stops at β ∗ by construction of (cid:15) n . Suppose the algorithm does not stop. Then it must be that || ˇ V k, || ≥ µ √ ˇ T − (cid:15) n and || V ( β ∗ ) || ≥ µ √ ˇ T − (cid:15) n − √ p err ≥ µ √ ˇ T − (cid:15) n := ν n > . Observe now that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k, || ˇ V k, || − V ( β ∗ ) || V ( β ∗ ) || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k, − V ( β ∗ ) || V ( β ∗ ) || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k, ( || ˇ V k, || − || V ( β ∗ ) || ) || V ( β ∗ ) || || ˇ V k, || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k, − V ( β ∗ ) || V ( β ∗ ) || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + √ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k, − V ( β ∗ ) || V ( β ∗ ) || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ . (41)Then with probability at least 1 − δ ,(41) ≤ ν n × √ p err( δ ) . completing the claim for w = 1. Consider now a general w . Deﬁne the error until time w − P w − .Then for every j ∈ { , · · · , p } , by Assumption 3, we have with probabilityat least 1 − wδ (using the union bound),ˇ V ( j ) k,w = V ( j ) ( ˇ β wk +2 ) + err( δ ) = V ( j ) ( β ∗ w + P w ( δ )) + err( δ ) ⇒ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k,w − V ( β ∗ w ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ BP w ( δ ) + err( δ ) , where the above inequality follows by the mean-value theorem and Assumption 3. Supposenow that || ˇ V k,w || ≤ µ √ ˇ T − (cid:15) n . Then for the same argument as in Equation (40), we have || V ( ˇ β wk ) || ≤ µ √ ˇ T . || ˇ β wk − β ∗ || ≤ T , which proves the statement. Suppose instead that the algorithm does not stop. The wecan write by the induction argument (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ β wk + 1 √ w ˇ V k,w || ˇ V k,w || − β ∗ w − √ w V ( β ∗ w ) || V ( β ∗ w ) || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ P w ( δ ) + 1 √ w (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k,w || ˇ V k,w || − V ( β ∗ w ) || V ( β ∗ w ) || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:124) (cid:123)(cid:122) (cid:125) ( B ) . (42)Using the same argument in Equation (41), we have with probability at least 1 − δ ,( B ) ≤ √ pν n (cid:104) err( δ ) + BP w ( δ ) (cid:105) , which completes the proof for the unconstrained case. The ˇ T component in the errorexpression follows from the union bound across all ˇ T events. Constrained case

Since the statement is true for w = 1, we can assume that it is truefor all s ≤ w − B is a compact space, wecan write (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π B , B − η n (cid:104) w (cid:88) s =1 α k,s ˇ V k,s (cid:105) − Π B , B (cid:104) w (cid:88) s =1 α s V ( β ∗ s ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π B , B − η n (cid:104) w (cid:88) s =1 α k,s ˇ V k,s (cid:105) − Π B , B − η n (cid:104) w (cid:88) s =1 α s V ( β ∗ s ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + p O ( η n ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w (cid:88) s =1 α k,s ˇ V k,s − w (cid:88) s =1 α s V ( β ∗ s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + p O ( η n ) . For the ﬁrst component in the last inequality, we follow the same argument as above.

C Theorems

Proof of Theorem 3.1.

The proof follows directly from Lemma B.5, where β replaces ˇ β wk since exogenous. 81 heorem C.1. Let the conditions in Lemma B.8 hold. Then with probability at least − δ ,for any k ∈ { , · · · , K } , for any ˇ T ≥ w ≥ ζ , for ζ < ∞ being a universal constant || β ∗ − ˇ β wk || ≤ κw − ν n pe B √ p ˇ T w × O (cid:16) γ n log( pγ n ˇ T K/δ ) η n n + p η n + J n /η n (cid:17) , with ν n = µ √ T − (cid:15) n , κ, B < ∞ being constants independent on ( p, n, ˇ T ) and (cid:15) n as deﬁnedin Lemma B.8.Proof. We invoke Lemma B.8. Observe that we only have to assume that (i) holds sincefor (ii) the claim trivially holds. Using the triangular inequality, we can write || β ∗ − ˇ β wk || ≤ || β ∗ − β ∗ w || + || ˇ β wk − β ∗ w || . The ﬁrst component on the right-hand side is bounded by Lemma B.7 with ζ deﬁned as inthe lemma. Using Lemma B.8, we bound the second component as follows || ˇ β wk − β ∗ w || ≤ p || ˇ β wk − β ∗ w || ∞ = p × O ( P w ( δ )) . We conclude the proof by explicitely deﬁning P w = (1 + 2 B √ pν n √ w ) P w − + 1 √ w err n ( δ ) . where err n ( δ ) = √ pν n O ( (cid:113) γ n log( pT K/δ ) η n n + pη n + J n /η n ), and B < ∞ denotes a ﬁnite constant.Using a recursive argument, we obtain P w = err n ( δ ) w (cid:88) s =1 α s w (cid:89) j = s ( 2 B √ pν n √ j + 1) . Recall now that ν n ≥ µ √ ˇ T as in Lemma B.8. As a result we can bound the aboveexpression aserr n ( δ ) w (cid:88) s =1 α s w (cid:89) j = s ( 2 B √ pν n √ j +1) ≤ w (cid:88) s =1 α s w (cid:89) j = s ( 8 µ √ ˇ T B √ p √ j +1) ≤ w (cid:88) s =1 α s exp (cid:16) w (cid:88) j = s µ √ ˇ T B √ p √ j (cid:17) . (cid:16) w (cid:88) j = s µ √ ˇ T B √ p √ j (cid:17) ≤ exp (cid:16) µ (cid:112) ˇ T B √ p ( w / − s / + 1) (cid:17) (cid:46) exp (cid:16) √ p (cid:112) ˇ T √ w (cid:17) e − s / . We now write w (cid:88) s =1 α s w (cid:89) j = s ( 2 B √ pν n √ j + 1) (cid:46) w (cid:88) s =1 √ s e − s / e B √ p √ ˇ T √ w (cid:46) e B √ p √ ˇ T √ w , completing the proof. Corollary.

Theorem 4.2 holds.Proof.

Consider Lemma B.8 where we choose δ = 1 /n . Observe that we choose (cid:15) n ≤ µ √ ˇ T ,which is attained by the conditions in Lemma B.8 as long as n is small enough such that √ p (cid:104) ¯ C (cid:115) log( n ) γ n log( pγ n ˇ T K ) η n n + η n + J n /η n (cid:105) ≤ µ √ ˇ T attained under the assumptions stated. As a result, we have ν n = µ √ ˇ T . The claim directlyfollows from Theorem C.1. Corollary.

Let the conditions in Theorem C.1 hold. Then with probability at least − δ ,for a ﬁnite constant B < ∞ , τ ( β ∗ ) − τ ( ˆ β ∗ ) (cid:46) p ˇ T + 1 ν n pe B √ p ˇ T × O (cid:16) γ n log( pγ n ˇ T K/δ ) η n n + p η n + J n /η n (cid:17) . Proof.

We have || β ∗ − K (cid:88) k ˇ β ˇ T +1 k || ≤ K (cid:88) k || ˇ β ˇ T +1 k − β ∗ || . The proof concludes by Theorem C.1 and Assumption 3, after observing that τ ( β ∗ ) − τ ( ˆ β ∗ ) (cid:46) p || β ∗ − ˆ β ∗ || . . 83 orollary. Theorem 4.3 holds.Proof.

By the mean value theorem and Assumption 3, we have Under Assumption 3, wehave ˇ T (cid:88) w =1 τ ( β ∗ ) − τ ( ˇ β wk ) ≤ ¯ Cp ˇ T (cid:88) w =1 || β ∗ − ˇ β wk || , for a universal constant ¯ C < ∞ . We now take w ≥ ζ , for ζ < ∞ such that Lemma B.7holds. By Theorem C.1, for n satisfying the conditions in Theorem 4.2, with δ = 1 /n , withprobability at least 1 − /n , using a second order Taylor expansion and using the boundedcondition on the Hessian in Assumption 3, we have ˇ T (cid:88) w>ζ τ ( β ∗ ) − τ ( ˇ β wk ) ≤ ˇ T (cid:88) w>ζ pκ (cid:48) w (cid:46) p log( ˇ T )for κ (cid:48) < ∞ being a ﬁnite constant. Finally, using the fact that B is a compact space, wewrite (cid:88) w ≤ ζ || β ∗ − ˇ β wk || ≤ ζB < ∞ for a universal constant B , completing the proof. Corollary.

Theorem 5.3 holds.Proof.

The proof follows directly from Theorem C.1, after noticing that every two periods,the function is evaluated at the same vector of parameter Γ( ˇ β w , ˇ β w ). Therefore, we canapply all our results to the function β (cid:55)→ Γ( β, β ) which satisﬁes the same conditions as W ( β ). Theorem C.2.

Let Assumption 1, 2, 3, 4, 5. Then ˇ Z ( j ) k,w − V ( j ) ( ˇ β k,t ) (cid:113) Var( ˇ Z k,w | ˇ β wk ) + B n → d N (0 , , where B n = O ( η n × √ n + J n / (cid:112) η n ρ n ) Proof.

By Lemma B.5, we have E [ ˇ Z ( j ) k,w | ˇ β wk ] = V ( j ) ( ˇ β wk ) + O ( η n + J n /η n ) .

84e have ˇ Z ( j ) k,w − E [ ˇ Z ( j ) k,w | ˇ β wk ] (cid:113) Var( ˇ Z ( j ) k,w | ˇ β wk ) = ˇ Z ( j ) k,w − V ( j ) ( ˇ β k,w ) (cid:113) Var( ˇ Z ( j ) k,w | ˇ β k,w ) + O (cid:16) η n + J n /η n (cid:113) Var( ˇ Z ( j ) k,w | ˇ β wk ) (cid:17) . Observe that under Assumption 4, O (cid:16) η n + J n /η n (cid:113) Var( ˇ Z ( j ) k,w | ˇ β wk ) (cid:17) ≤ O ( η n × √ n + J n / (cid:112) η n ρ n ) . We now invoke Lemma A.1. Deﬁne t = t ( j, w ). First, deﬁne H i,t = 1 n W i,t ( ˇ β wk ) , H i, = 2 v k ( i ) η n n Y i, . Following the same reasoning as in Lemma B.6, we observe that( H i,t , H i, ) i ∈S k,t ∪S k +1 ,t ∪S k, ∪S k +1 , (cid:12)(cid:12)(cid:12) ˇ β wk form a dependency graph with maximum degree of order O ( γ n ). To observe why, noticethat H i,t depends on at most γ n + 1 many elements ( H j,t , H j, ) and similarly H i, , condi-tional on ˇ β wk , since, under the cross ﬁtting algorithm, ˇ β wk is estimated not using informationfrom clusters { k, k + 1 } .In addition, under Assumption 3 and Lemma B.2 E [ H i,t | ˇ β wk ] , E [ H i, | ˇ β wk ] ≤ c (cid:48) n η n < ∞ , E [ H i,t | ˇ β wk ] , E [ H i, | ˇ β wk ] ≤ c (cid:48) n η n < ∞ , since 1 /η n ≤ n , for a constant c (cid:48) < ∞ . Deﬁne σ = Var( (cid:80) i ∈S k,t ∪S k +1 ,t H i,t − (cid:80) i ∈S k, ∪S k +1 , H i, ).Using Lemma A.1 and the triangular inequality, we write d W ( (cid:88) i ∈S k,t ∪S k +1 ,t H i,t + (cid:88) i ∈S k, ∪S k +1 , H i, , G ) ≤ γ n σ (cid:88) i ∈S k,t ∪S k,t +1 ∪S k, ∪S k +1 , (cid:104) E | H i,t | + E | H i, | (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ( A ) + √ γ / n √ πσ (cid:115) (cid:88) i ∈S k,t ∪S k +1 ,t ∪S k, ∪S k +1 , (cid:104) E [ H i,t ] + E [ H i, ] (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ( B ) , G ∼ N (0 , d W denotes the Wasserstein metric. We now inspect each argument on the right handside. Under Assumption 4, we have( A ) ≤ C (cid:48) γ n n η n × n / η n = γ n n / → . Similarly, for (B), we have ( B ) ≤ c (cid:48) γ / n nη n η n n = γ / n η n n → . The proof completes.

Corollary.

Theorem 3.2 holds.Proof.

First observe that since ˇ T = 1 and K ≥

2, Assumption 5 is satisﬁed. Therefore,the result follows by Theorem C.2, and between cluster independence over the ﬁrst period t = 1 (Assumption 1). Proof of Theorem 3.3.

Take t jz = √ z (cid:80) zi =1 X ji (cid:113) ( z − − (cid:80) zi =1 ( X ji − ¯ X j ) , X ji ∼ N (0 , σ ji ) . Recall that by Theorem 1 in Ibragimov and M¨uller (2010) and Bakirov and Szekely(2006), we have that for α ≤ . σ , ··· ,σ q P ( | t z | ≥ cv α ) = P ( | T q − | ≥ cv α ) , where cv α is the critical value of a t-test with level α , and T z − is a t-student randomvariable with z − P (cid:16) T n ≥ q | H (cid:17) = P (cid:16) max j ∈{ , ··· , ˜ p } | Q j,n | ≥ q | H (cid:17) = 1 − P (cid:16) | Q j,n | ≤ q ∀ j | H (cid:17) = 1 − ˜ p (cid:89) j =1 P (cid:16) | Q j,n | ≤ q | H (cid:17) , where the last equality follows by between cluster independence (Assumption 1). Observe86ow that by Theorem 3.2 and the fact that the rate of convergence is the same for allclusters (Assumption 4) , for all j , for some ( σ , · · · , σ z ), z = ˜ K ,sup q (cid:12)(cid:12)(cid:12) P (cid:16) | Q j,n | ≤ q | H (cid:17) − P (cid:16) | t j ˜ K | ≤ q (cid:17)(cid:12)(cid:12)(cid:12) = o (1) . As a result, we can writesup σ , ··· ,σ K lim n →∞ − ˜ p (cid:89) j =1 P (cid:16) | Q j,n | ≤ q | H (cid:17) = 1 − ˜ p (cid:89) j =1 inf σ j , ··· ,σ j ˜ K P (cid:16) | t j ˜ K | ≤ q (cid:17) . Using the result in Bakirov and Szekely (2006), we haveinf σ j , ··· ,σ j ˜ K P (cid:16) | t j ˜ K | ≤ q (cid:17) = P (cid:16) | T ˜ K − | ≤ q | H (cid:17) . Therefore, 1 − ˜ p (cid:89) j =1 inf σ j , ··· ,σ j ˜ K P (cid:16) | t j ˜ K | ≤ q | H (cid:17) = 1 − P ˜ p (cid:16) | T ˜ K − | ≤ q (cid:17) . Setting the expression equal to α , we obtain1 − P ˜ p (cid:16) | T ˜ K − | ≤ q (cid:17) = α ⇒ P (cid:16) | T ˜ K − | ≥ q (cid:17) = 1 − (1 − α ) / ˜ p . The proof completes after solving for q . Corollary.

Theorem 5.2 holds.Proof.

The proof follows directly as a corollary of Theorem 3.2 and results on t-statisticsin Ibragimov and M¨uller (2010) .

Proof of Theorem 5.1.

We follow the same proof as Lemma B.5. Recall the expression ofthe estimator in Equation (26). The estimator depends on three component (cid:98) ∆ ( j ) k, ( β ) , ˆ S ( j ) k, (0 , β ) , ˆ B ( j ) k, . (43)The expectation of the ﬁrst component follows similarly as to what discussed in the proofof Lemma B.5, component ( A ) in Equation (35), since ﬁxed eﬀects α k,t ( x ) cancel out once Here we use continuity of the Gaussian distribution, and the fact that ˜ p is ﬁnite.

87e diﬀerentiate the treated and the control units. As a result, it suﬃces to study thecomponent ˆ S ( j ) k,t (0 , β ) and the bias component ˆ B ( j ) k,t . We start from ˆ S ( j ) k, (0 , β ). Using thesame argument as in Lemma B.5, by Assumption 1, and Lemma B.4 we have E [ ˆ S ( j ) k, (0 , β )] = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t E (cid:104) v k ( i ) (1 − e ( X i ; β )) η n × Y i,t (1 − D i,t )1 − e ( X i ; β + v k ( i ) η n e j ) (cid:105) = 12 η n (cid:90) α t,k ( x ) dx + 12 η n (cid:90) (cid:88) b ∈{ k,k +1 } m (0 , x, β + v b η n e j )(1 − e ( x ; β )) f X ( x ) dx + O ( J n /η n ) (cid:124) (cid:123)(cid:122) (cid:125) ( I ) , where O ( J n ) follows from Assumption 3. Using Assumption 3 and doing a second orderTaylor expansion of m (0 , x, · ) around β , we have( I ) = 12 η n (cid:90) ( α ,k ( x ) − α ,k +1 ( x )) dx + (cid:90) ∂m (0 , x, β ) ∂β (1 − e ( x ; β )) f X ( x ) dx + O ( J n /η n + η n ) . Consider now the bias component. Using Assumption 9 and the fact that spillovers do notoccur on the treated, we have E [ ˆ B ( j ) k,t ( β )] = 12 η n (cid:90) ( α ,k ( x ) − α ,k +1 ( x )) dx, completing the proof. D Regret guarantees under global strong concavity

In this section, we discuss theoretical guarantees of the algorithm, assuming the globalstrong concavity of the objective function W ( β ). Oracle gradient descent under concavity

We deﬁne β ∗ w = Π B , B (cid:104) β ∗ w − + α w − V ( β ∗ w − ) (cid:105) , β ∗ = ι, (44)with α w = ηw + 1 , equal for all clusters. 88n the following lemmas and theorem, we consider the concave version of the gradientdescent.The following lemma follows by standard properties of the gradient descent algorithm(Bottou et al., 2018). Lemma D.1.

For the learning rate as α w = η/ ( w + 1) , and β ∗ w as deﬁned in Equation (44) , under Assumption 3, for η ≤ /l and let L = max { p ( B − B ) , G /η } , with G being the upper bound on the gradient and l > a positive upper bound on the negative ofthe Hessian of W ( β ) . Let W ( β ) be strongly concave. Then the following holds: || β ∗ w − β ∗ || ≤ Lw for a constant L < ∞ . The proof is contained in Appendix E, and it follows standard arguments.

Lemma D.2.

Let Assumption 1, 2, 3, 5. Then with probability at least − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π B , B − η n (cid:104) ˇ T (cid:88) w =1 α w ˇ V k,w (cid:105) − Π B , B (cid:104) ˇ T (cid:88) s =1 α w V ( β ∗ w ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = O ( P ˇ T ( δ )) where P ( δ ) = α × err( δ ) and P w ( δ ) = Bα t P w − ( δ ) + P w − ( δ ) + α w err w ( δ ) , for a ﬁniteconstant B < ∞ , and err w ( δ ) = O (cid:16)(cid:113) γ n log( p ˇ T K/δ ) η n n + pη n + J n /η n (cid:17) .Proof. Recall that by Lemma B.6 we can write for every k and t ,ˇ V ( j ) k,w = V ( j ) ( ˇ β wk +2 ) + O (cid:16)(cid:115) γ n log( K ˇ T /δ ) η n n + η n + J n /η n (cid:17) . We now proceed by induction. We ﬁrst prove the statement, assuming that the constraintis never attained. We then discuss the case of the constrained solution. Deﬁne B = p sup β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ W ( β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ . Unconstrained case

Consider w = 1. Then since all clusters start from the samestarting point ι , we can write with probability 1 − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) α ˇ V k, − α V ( β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = α err( δ ) . t = 2, then we obtain for every j ∈ { , · · · , p } , α ˇ V ( j ) k, = α V ( j ) ( ˇ β k +2 ) + α err( δ ) = α V ( j ) ( β ∗ + α V ( β ∗ ) + α err( δ )) + α err( δ ) . Using the mean value theorem and Assumption 3, for a ﬁnite universal constant

B < ∞ ,we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) α ˇ V ( j ) k, − α V ( j ) ( β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ α err( δ ) + Bα α err( δ ) ⇒ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) w =1 α w ˇ V ( j ) k,w − (cid:88) w =1 α w V ( j ) ( β ∗ w ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ α err( δ ) + Bα α err( δ ) + α err( δ ) . Consider now a general w . Then we can write with probability 1 − δ , α w ˇ V k,w = α w V ( ˇ β w − ) + α w err( δ ) . Let P w = α w P w − + P w − + α w err( δ ), with P = α err( δ ). Using the induction argument,we write α w ˇ V k,w ≤ α w V ( β ∗ w − + P w − ) + α w err( δ ) . Using the mean value theorem and Assumption 3, we obtain α w ˇ V k,w ≤ α w V ( β ∗ w − ) + α w BP w − + α w err( δ ) . Taking the sum, we obtain with probability 1 − wδ (notice that each of these events holdjointly by the union bound) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w (cid:88) s =1 α s ˇ V k,s − t (cid:88) s =1 α s V ( β ∗ s − ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ α w BP w − + P w − + α t err( δ ) . Constrained case

Since the statement is true for w = 1, we can assume that it is truefor all s ≤ w − B is a compact space, we90an write (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π B , B − η n (cid:104) w (cid:88) s =1 α s ˆ V k,s (cid:105) − Π B , B (cid:104) w (cid:88) s =1 α s V ( β ∗ s − ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π B , B − η n (cid:104) w (cid:88) s =1 α s ˆ V k,s (cid:105) − Π B , B − η n (cid:104) w (cid:88) s =1 α s V ( β ∗ s − ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + pη n ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w (cid:88) s =1 α s ˆ V k,s − t (cid:88) s =1 α s V ( β ∗ s − ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + pη n completing the proof. Theorem D.3.

Let the conditions in Theorem C.1 and Lemma D.1 hold. Choose α w = η/w . Then with probability at least − δ , || β ∗ − ˇ β ˇ T +1 k || ≤ T + p ˇ T B × O (cid:16) γ n log( ˇ T K/δ ) η n n + p η n + J n /η n (cid:17) , for a ﬁnte constant B < ∞ .Proof. Using the triangular inequality, we can write || β ∗ − ˇ β ˇ T +1 k || ≤ || β ∗ − β ∗ ˇ T +1 || + || ˇ β ˇ T +1 k − β ∗ ˇ T +1 || . The ﬁrst component on the right-hand side is bounded by Lemma D.1. Using Lemma D.2,we bound the second component as follows || ˇ β ˇ Tk − β ∗ T || ≤ p || ˇ β ˇ Tk − β ∗ T || ∞ = p × O ( P T ( δ )) . We conclude the proof by explicitely deﬁning the rate of P ˇ T ( δ ). We can simplify P w to theexpression P w = (1 + Bw ) P k,w − + 1 w err n . where err n = O ( (cid:113) γ n log( p ˇ T K/δ ) η n n + pη n + J n /η n ). Using a recursive argument, we obtain P w = err n w (cid:88) s =1 α s w (cid:89) j = s ( Bw + 1) .

91e now write w (cid:88) s =1 α s w (cid:89) j = s ( Bw + 1) (cid:46) w (cid:88) s =1 s e B log( w ) (cid:46) w B , completing the proof. E Further mathematical details

E.1 Gradient estimator for non-stationary policies

For expositional simplicity we only consider a triad { k, k + 1 , k + 2 } . For notational conve-nience, we deﬁne β k,t the policy assigned to cluster k at time t according to the random-ization in Equation (31). Deﬁne∆( x, β, φ ) = m (1 , x, β, φ ) − m (0 , x, β, φ ) . Observe that we can write ∂ Γ( β, φ ) ∂β = (cid:90) (cid:110) ∂e ( x ; β ) ∂β ∆( x, β, φ ) + ∂m (0 , x, β, φ ) ∂β + e ( x ; β ) ∂ ∆( x, β, φ ) ∂β + c ( x ) ∂e ( x ; β ) ∂β (cid:111) dF X ( x ) ∂ Γ( β, φ ) ∂φ = (cid:90) (cid:110) e ( x ; β ) ∂ ∆( x, β, φ ) ∂φ + ∂m (0 , x, β, φ ) ∂φ (cid:111) dF X ( x ) . We now discuss the estimation of each component. We take (cid:98) ∆ k,t ( β ) = 13 n (cid:88) i ∈S k,t ∪S k +1 ,t ∪S k +2 ,t ∂e ( X i ; β ) ∂β (cid:104) Y i,t D i,t e ( X i ; β k ( i ) ,t ) − Y i,t (1 − D i,t )1 − e ( X i ; β k ( i ) ,t ) (cid:105) . The above estimator is centered around the target estimand up to a factor of order O ( η n + J n ) as discussed in Section 3. We now discuss the estimation of the marginal eﬀects. Deﬁne u h,t =  − h = k h = k + 1 , t is odd or h = k + 2 and t is even;0 otherwise . Intuitively, the above indicator equals minus one whenever the cluster is the cluster in thetriad that is assigned a perturbation in the current period. The estimator of the marginal92pillover eﬀect in the current period is constructed by taking (cid:98) S k,t ( β ) = 1 n (cid:88) i ∈S k,t ∪S k +1 ,t ∪S k +2 ,t u k ( i ) ,t η n e ( X i ; β ) (cid:104) Y i,t D i,t e ( X i ; β k ( i ) ,t ) − Y i,t (1 − D i,t )1 − e ( X i ; β k ( i ) ,t ) (cid:105) + u k ( i ) ,t η n Y i,t (1 − D i,t )1 − e ( X i ; β k ( i ) ,t ) , where β k,t denote the (perturbed) parameter assigned to cluster k at time t . Its justiﬁcationfollows similarly to what is discussed in Section 3, with the diﬀerence here that the clusterunder perturbation is one of the three clusters, which alternate every other period t . Wecan estimate the marginal eﬀect of coordinate ( j ) in the current period by taking (cid:98) ∆ k,sjt (ˇ θ wk ) + (cid:98) S k,sjw (ˇ θ wk ) . We now discuss estimating the marginal eﬀect in the previous period. Deﬁne p h,t =  − h = k h = k + 1 , t is even or h = k + 2 and t is odd;0 otherwise . The above indicator equals one for the cluster that in the previous period was subject toperturbation. We can now use the same rationale as before and estimate the eﬀect in theprevious period as (cid:98) U k,t ( β ) = 1 n (cid:88) i ∈S k,t ∪S k +1 ,t ∪S k +2 ,t p k ( i ) ,t η n e ( X i ; β ) (cid:104) Y i,t D i,t e ( X i ; β k ( i ) ,t ) − Y i,t (1 − D i,t )1 − e ( X i ; β k ( i ) ,t ) (cid:105) + p k ( i ) ,t η n Y i,t (1 − D i,t )1 − e ( X i ; β k ( i ) ,t ) . The ﬁnal estimator of the marginal eﬀect ˇ F ( j ) k,w weight each component (marginal eﬀectsfrom previous and current periods) over periods s ∈ { , · · · , T ∗ } by the functions f θ,t ( ι )over the path reads as followsˇ F ( j ) k,w = T ∗ (cid:88) s =1 f ˇ θ wk ,s ( ι ) (cid:104) (cid:98) ∆ k,sjw (ˇ θ wk ) + (cid:98) S k,sjw (ˇ θ wk ) (cid:105) + f ˇ θ wk ,s − ( ι ) (cid:98) U k,sjw (ˇ θ wk ) , with f · ( · ) as deﬁned in Equation (30). 93 .2 Proof of Lemma D.1 Proof.

We follow a standard argument for the gradient descent. Denote β ∗ the estimandof interest and recall the deﬁnition of β ∗ t in Equation (44). We deﬁne ∇ t the gradientevaluated at β ∗ t − . From strong concavity, we can write τ ( β ∗ ) − τ ( β ∗ t ) ≤ ∂τ ( β ∗ t ) ∂β ( β ∗ − β ∗ t ) − l || β ∗ − β ∗ t || τ ( β ∗ t ) − τ ( β ∗ ) ≤ ∂τ ( β ∗ ) ∂β ( β ∗ t − β ∗ ) − l || β ∗ − β ∗ t || . As a result, since ∂τ ( β ∗ ) ∂β = 0, we have (cid:16) ∂τ ( β ∗ ) ∂β − ∂τ ( β ∗ t ) ∂β (cid:17) ( β ∗ − β ∗ t ) = ∂τ ( β ∗ t ) ∂β ( β ∗ − β ∗ t ) ≥ l || β ∗ t − β ∗ || . (45)In addition, we can write: || β ∗ t − β ∗ || = || β ∗ − Π B , B ( β ∗ t + α t ∇ t ) || ≤ || β ∗ − β ∗ t − α t ∇ t || where the last inequality follows from the Pythagorean theorem. Observe that we have || β ∗ − β ∗ t || ≤ || β ∗ − β ∗ t − || − α t ∇ t ( β ∗ − β ∗ t − ) + α t ||∇ t || . Using Equation (45), we can write || β ∗ t +1 − β ∗ || ≤ (1 − lα t ) || β ∗ t − β ∗ || + α t G . We now prove the statement by induction. Clearly at time t = 0, the statement triviallyholds. Consider a general time t . Then using the induction argument, we write || β ∗ t +1 − β ∗ || ≤ (1 − t + 1 ) Lt + L ( t + 1) ≤ (1 − t + 1 ) Lt + Lt ( t + 1)= (1 − t + 1 ) Lt = Lt + 1 ..