PPolicy choice in experiments with unknowninterference ∗ Davide Viviano † First version: November 16, 2020This version: January 22, 2021
Abstract
This paper discusses experimental design for inference and estimation of individualizedtreatment allocation rules in the presence of unknown interference. We consider asetting where units are organized into large, finitely many independent clusters andinteract over unobserved dimensions within each cluster. The contribution of thispaper is two-fold. First, we design a short pilot study with few clusters to test whetherthere exists a welfare-improving treatment configuration and hence worth learning byconducting a larger scale experiment. We propose a practical test that uses informationon the marginal effect of the policy on welfare to compare the base-line interventionagainst any possible alternative. Second, we introduce a sequential randomizationprocedure to estimate welfare-maximizing individual treatment allocation rules validunder unobserved (and partial) interference. We propose nonparametric estimatorsof direct treatments and marginal spillover effects, which serve for hypothesis testingand policy-design. We derive the estimators’ asymptotic properties, and small sampleregret guarantees of the policy estimated through the sequential experiment. Finally,we illustrate the method’s advantage in simulations calibrated to an existing experimenton information diffusion.
Keywords:
Policy Targeting, Causal Inference, Experimental Design, Welfare Maximiza-tion, Spillovers, Individualized Treatments.
JEL Codes:
C10, C14, C31, C54. ∗ I thank Graham Elliott, James Fowler, Paul Niehaus, Yixiao Sun and Kaspar W¨uthrich for helpfulcomments and discussion. All mistakes are my own. † Department of Economics, UC San Diego. Correspondence: [email protected]. a r X i v : . [ ec on . E M ] J a n Introduction
One of the main objectives of experiments is to identify the most effective policy. This paperaddresses two questions that the decision-maker faces in practice: (i) “Does the baselinepolicy lead to the largest welfare compared to any possible alternative, and, hence, isa possibly large-scale experiment necessary for improving current decisions?” (ii) “Howshould the experiment be designed for estimating welfare-maximizing treatment allocationrules?”. The presence of interference challenges these questions: treatment effects mayspillover across individuals as a result of unobserved interactions.Network effects often play a crucial role in the design of policies. However, a majorchallenge for the policymaker is the cost associated with observing and collecting networkinformation (Breza et al., 2020). In a development study, for example, collecting networkinformation in each village (Cai et al., 2015; Banerjee et al., 2013) often requires enumer-ating each individual in the population and collecting information on her friends. Thedesign of experiments for estimating welfare-maximizing treatment allocation rules underunknown interference (as opposed to estimating treatment effects) has been unexplored bypast literature.This paper answers the two questions above in a setting where units are organizedinto large, finitely many independent clusters, such as cities, schools, villages, or districts.Within each cluster, interference occurs locally through an unobserved network. Re-searchers have access to an adaptive experiment over finitely many T periods, i.e., theycan sequentially assign treatments and observe outcomes over each iteration. We use onlytwo periods of experimentation and two or more clusters (i.e., pilot study) to ascertainwhether some treatment configuration will be welfare improving and hence worth learn-ing by conducting the rest of the experiment. We then discuss a sequential procedure toestimate welfare-maximizing policies, which requires 2( T + 1) (finitely) many clusters.The contribution of this paper is two-fold. (a) First, it introduces, to the best of our Interference naturally occurs in several economic applications: information campaigns (Banerjee et al.,2013; Jones et al., 2017), health-programs (Kim et al., 2015), development and public policy programs(Baird et al., 2018; Muralidharan et al., 2017; Muralidharan and Niehaus, 2017), and marketing campaigns(Zubcsek and Sarvary, 2011) among others. In many economic applications, individuals are often organized into (large) independent clusters, whileinterference within each cluster being unobserved by the researcher. For example, when studying theprogram satisfaction a cash transfer program (Alatas et al., 2012), individuals are organized into villagesand connected within a village by parental or friendship ties (Alatas et al., 2016). In social marketingapplications, individuals are often organized in cities or states, and they interact within each geographicalregion (Varian, 2016). any possible alternative) under networkinterference. The test uses information on marginal effects of the policy function. (b)Second, it discusses an experimental design for estimation of welfare-maximizing treatmentallocation rules (instead of average treatment effects) that allows for unobserved (andpartial) interference. We introduce the first adaptive experiment under partial interference,consisting of a matched-pair two-stage adaptive design. We now discuss our contributionin detail.There has been recent work on how to construct welfare-optimal allocation rules in theabsence of interference, i.e., where one person’s outcome is independent of other people’streatment status. In this setting, information from the conditional average treatment effectcan be used to design welfare-maximizing allocations (Kitagawa and Tetenov, 2018; Atheyand Wager, 2020). Suppose we instead knew the structure of this dependence. In thatcase, we can either (a) use neighbors’ exposures observed from either a pilot study orconstructed based on a particular network model to construct the welfare (Viviano, 2019;Kitagawa and Wang, 2020), or (b) explicitly model global effects of the interactions on thesystem to guide the design of the experiment, as discussed in the context of online pricingexperiments in Wager and Xu (2019). A more challenging problem is when either theglobal or local interference mechanism is unknown in the experiment . The first challengewe address is identifying and estimating the treatment’s overall effect under interferencewhen the network is unobserved. We show that if individuals are organized into groupsbetween which there are no spillovers, and interference is local within each group, we canestimate the overall marginal effect of the treatment. The marginal effect (ME) definesthe change in welfare from an infinitesimal change in the policy function. The ME isestimated by inducing small deviations to baseline interventions within finitely many pairs of such groups, without necessitating information on within-clusters interactions. We usethe information of the marginal effects of the policy to evaluate and then estimate policiessequentially.We consider policies consisting of individualized probabilistic treatment allocation rules.Individualized allocations imply that treatments are assigned independently based onindividual-specific baseline covariates. Examples of policies include sending informationto an individual (Bond et al., 2012), targeting cash-transfers (Egger et al., 2019) or sub-sidies (Dupas, 2014), with the probability of treatment differing based, for instance, onthe age or education of each individual. The class of individualized assignment rules en-3ompasses homogenous assignments in two-stage randomized experiments as a special case(Baird et al., 2018) and it can be implemented (a) without requiring knowledge of thepopulation network and (b) in an online fashion.Identification relies on decomposing the potential outcomes as the sum of a conditionalmean function and unobservable characteristics. The conditional mean function dependson the individual treatment assignment, individual baseline covariates, and the parameter β indexing the assignment mechanism (e.g., the probability of treatment for different indi-vidual types). The dependence with the parameter β captures the average spillover effectgenerated by the neighbors’ treatments, which is averaged over the distribution of treatmentassignments . Unobservables instead depend on neighbors’ assignments, and, as a result ofthe assumption that effects spillover locally within the network, locally dependent. Weconstruct the marginal effect as the sum of the direct effect of each individual’s treatment,weighted by the marginal propensity to be treated, plus the marginal spillover effect. Themarginal spillover effect defines the derivative of within-cluster average potential outcomesas functions of probabilities of treatments , also averaged over neighbors’ assignments andcovariates. Differently from the literature on causal inference under local interference (e.g.,Li and Wager (2020); Leung (2020)), identification allows for neighbors’ exposures to beunobserved to the researcher.Estimation of marginal effects in the pilot study works as follows: we first pair clustersand, for each pair, in the first period of experimentation, we assign treatments indepen-dently based on the same target parameter across the two clusters. In the second period,we assign to each cluster locally perturbated probabilities of treatments, with perturba-tions in each pair having opposite signs. We construct direct effects using the informationwithin each cluster and the marginal spillover effects by comparing the average outcomeson the treated and controls between the two clusters differentiated over the two periods andappropriately reweighted by treatments’ probabilities. By taking a difference-in-differenceof the outcomes between two clusters in a pair over two periods, we allow for clusterand time-specific separable fixed effects. The design permits us to consistently estimatethe marginal effects without necessitating infinitely many clusters: it guarantees that wealways compare two clusters with opposite perturbations to the target policy, instead oftaking an average across all clusters whose concentration rate would depend on the numberof clusters. See Hudgens and Halloran (2008) for a definition of potential outcomes under partial interference. testing marginal effects to motivatea sequential design represents a contribution of independent interest of this paper to theliterature on experimental design.The second question we answer is how we can estimate welfare-maximizing allocationrules under unknown interference. We discuss the design of the sequential experiment toestimate welfare-maximizing individualized treatments under unknown interference. Theexperiment consists of sequential updates of each pair’s policy based on the previous ran-domization period. Within each randomization period, we perform p iterations, with p indicating the number of parameters to be estimated, and we estimate marginal effectsover one direction at a time. We construct marginal effects estimators by contrasting(weighted) averages of outcomes between two clusters in a pair, differentiated by the ob-served outcome in the experiment’s first iteration.The sequential experiment for policy-design presents one major challenge: the estimatedtreatment assignment rule over each iteration is data-dependent, and the time-dependenceof unobservables may lead to a confounded experiment. This problem is generally notincurred in adaptive experiments, where units are assumed to be drawn without replace-5ents (Kasy and Sautmann, 2019; Wager and Xu, 2019). We break dependence using anovel cross-fitting algorithm (Chernozhukov et al., 2018), where, in our case, the algorithmconsists of “circular” updates of the policies using information from subsequent clusters.The circular approach’s fundamental idea is that treatments in each pair depend on theoutcomes and assignments in the next pair, in the previous period. As a result, as long asthe number of pairs of clusters exceeds the number of iterations, the experiment is neverconfounded.We use a gradient descent method for policy updates (Bottou et al., 2018). An im-portant assumption is that the local optimization procedures achieve the global optimum.This condition is satisfied under decreasing marginal effects of the probability of treat-ments. The learning rate choice allows for strict quasi-concavity through the gradient’snorm rescaling (Hazan et al., 2015). We discuss small sample guarantees of the proposeddesign. We show that under local strong-concavity of the welfare criterion and global strictquasi-concavity, the worst-case in-sample regret across all clusters converges to zero at ratelog( T ) /T , where T denotes the number of iterations. We also show that the out-of-sample regret, i.e., the regret incurred after deploying the estimated policy on a new sample, scalesto zero at a rate 1 /T .The proposed sequential experiment substantially differs from what an intuitive exten-sion of a two-stage randomized experiment for estimating welfare-optimal treatment rulesmay be: first randomizing probabilities of treatment between clusters (Baird et al., 2018)(e.g., uniformly), randomizing treatments within each cluster independently, and finallyextrapolating the welfare function over the parameter space. We do not consider thisalternative approach for two main reasons: (i) treatments are individualized and heteroge-neously assigned, and the estimation error for learning allocation rules through grid-searchnaturally incurs a curse of dimensionality; (ii) it does not necessarily control the in-sampleregret, i.e., it requires substantial exploration to be able to extrapolate the entire responsefunction, at the expense of the welfare on in-sample participants. As opposed, our pro-cedure minimizes in-sample exploration controlling the in-sample regret, and its in andout-of-sample regret only scales quadratically with the dimension of the policy function.Our results rely on two conditions: (i) while observations may exhibit time-dependence,treatments do not carry-over in time; (ii) the first moments for same treatment exposuresacross different clusters averages converge to the same estimand across different clusters, See also Garber (2019) for examples of possibility results of logarithmic regret rate under lack of (global)strong concavity.
6p to separable cluster-specific fixed effects. Condition (i) is often explicitly or implicitlyimposed in the study of adaptive experiments (see, e.g., Kasy and Sautmann (2019)),and (ii) representativeness of the clusters is often necessary for valid inference in two-stagerandomized experiments (Baird et al., 2018). We include extensions that relax (i) to limitedcarry-over and extensions that allow for non-separable time and cluster-specific fixed effects((ii)), under lack of spillovers on the treated units (but not the control units) in Section 5.We conclude our discussion with a calibrated experiment. Data from Cai et al. (2015)show that (i) marginal effects of the treatment exhibit decreasing marginal returns inapplications, and (ii) the method presents substantial advantages relative to existing ex-perimental designs.The rest of the paper is organized as follows. We discuss the set-up and the definition ofwelfare in Section 2. We discuss hypothesis testing in Section 3. The adaptive experimentfor policy-design is introduced in Section 4. Section 5 presents extensions in the presenceof dynamic effects and non. Section 6 collects the numerical experiments and Section 7concludes.
This paper relates to three main strands of literature: (i) experimental design; (ii) causalinference under network interference; (iii) empirical welfare maximization and statisticaltreatment choice. We review the main references in the following lines.In the context of experimental design under network interference, common designsinclude clustered experiments (Eckles et al., 2017; Taylor and Eckles, 2018; Ugander et al.,2013) and saturation design experiments (Baird et al., 2018; Basse and Feller, 2018; Pouget-Abadie, 2018). However, our analysis focuses on detecting welfare-maximizing policiesinstead of inference on treatment and spillover effects differently from those designs. Thedifferent target estimand motivates the sequential procedure of our experiment. Recentliterature discusses alternative design mechanisms for inference on treatment effects only,often assuming knowledge of the underlying network structure. Examples include Basseand Airoldi (2018b), which only allows for dependence but not interference, Jagadeesanet al. (2020) who discuss the design of experiments for estimating direct treatment effectsonly in the presence of observed networks, Breza et al. (2020) which discuss inference ontreatment effects with aggregated relational data, and Viviano (2020) who discusses thedesign of two-wave experiments under an observed network, focusing on variance reduction7f treatment effect estimators. Additional references include Basse and Airoldi (2018a) thatdiscuss limitations of design-based causal inference under interference, Kang and Imbens(2016), which discuss encouragement designs instead in the presence of interference. Noneof the above references neither address the problem of policy-design nor discuss inferenceon welfare-maximizing policies.Local experimentation for experimental design relates to Wager and Xu (2019) whodiscuss local experimentation in the different context of estimation of prices in a singletwo-sided market with asymptotically independent agents, through randomization of pricesto individuals. However, as noted by the authors, the assumptions imposed in the abovereference do not allow for unknown interference. These differences motivate our identifica-tion strategy and algorithmic procedures, which exploits two-level local randomization atthe cluster and individual level instead of individual-based randomization, as well as ourproposed non-parametric estimator of marginal effects based on the clustering.Our paper also relates more broadly to the literature on adaptive experimentationthrough first-order approximation methods (Bubeck et al., 2017; Flaxman et al., 2004;Kleinberg, 2005), and experimental design with strategic agents recently discussed in Munro(2020). However, these references do not allow for network interference. They focus onindividual-level randomization procedure, as opposed to the cluster-based and individual-based sequential procedure proposed in the current paper. Under unknown interference,we show that it is necessary for consistent estimation of marginal effects with finitely manyclusters to deterministically assign treatments based on small deviation of the policies be-tween pairs of clusters. The two-stage matched pair cluster design represents a furtherdifference from both designs based on individual randomizations and from saturation ex-periments where probabilities of treatments are randomized between clusters.Additional references include bandit algorithms, Thompson sampling (Cesa-Bianchiand Lugosi, 2006; Bubeck et al., 2012; Russo et al., 2017), and the recent econometricliterature on adaptive and two-stage experiments (Kasy and Sautmann, 2019; Bai, 2019;Tabord-Meehan, 2018) which, however, does not allow for network interference.We build a connection to the literature on inference under interference. Most of theliterature often assume an observed network structure (Aronow et al., 2017; Manski, 2013;Leung, 2020; Ogburn et al., 2017; Li and Wager, 2020; Goldsmith-Pinkham and Imbens,2013; Athey et al., 2018; Choi, 2017; Forastiere et al., 2020), differently from the currentpaper. References which discuss inference under partial interference include Hudgens andHalloran (2008), Vazquez-Bare (2017) among others. Unlike the current paper, the above8eferences focus on inference on treatment effects instead of inference on welfare-maximizingpolicies. Finally, S¨avje et al. (2020) discuss conditions for valid inference of the direct effectof treatment only, under unknown interference. In contrast, estimating optimal policiesrequires estimating the marginal effects of the treatments.In the context of policy-design, Viviano (2019) discusses instead targeting on networksin an off-line scenario, where data are observed from an existing experiment or quasi-experiment, without therefore discussing the problem of experimental design. Kitagawaand Wang (2020) discusses allocation rules on a SIR network, in the absence of an experi-ment, assuming a fully observable network structure and using a model-based method. Liet al. (2019), Graham et al. (2010), Bhattacharya (2009) consider the problem of optimal allocation of individuals across small groups such as room’s dormitories, using data froma single wave experiment. However, the above procedures neither allow for the design ofindividualized treatment allocation rules nor sequential experimentation.This paper also contributes to the growing literature on statistical treatment rulesby proposing a design mechanism to test and estimate treatment allocation rules. Ref-erences on policy estimation include Manski (2004), Athey and Wager (2020) Kitagawaand Tetenov (2018) Kitagawa and Tetenov (2019), Elliott and Lieli (2013), Mbakop andTabord-Meehan (2016), Bhattacharya and Dupas (2012), Dehejia (2005), Stoye (2009),Stoye (2012),Tetenov (2012), Murphy (2003), Nie et al. (2020), Kallus (2017), Lu et al.(2018), Sasaki and Ura (2020) among others. However, none of the above references neitherdiscuss testing for policy optimality, nor the problem for experimental design, nor discussesthe case of interference. Finally, the literature on inference on welfare-maximizing decisions has mostly focusedon constructing confidence intervals around welfare estimators, which, however, do notpermit to compare a target policy against any possible alternative (Kato and Kaneko,2020; Zhang et al., 2020; Hadad et al., 2019; Andrews et al., 2019; Imai and Li, 2019;Bhattacharya et al., 2013; Luedtke and Van Der Laan, 2016). In the context of independentobservations, exceptions are Armstrong and Shen (2015); Rai (2018); Kasy (2016), whichpropose procedures for constructing sets of welfare-maximizing policies (or rank of policies),whose validity, however, does not allow for dependence and interference, and which often Our test is based on the marginal effects of the policy. Observe that here we define the ME as thederivative of welfare with respect to the parameters of the policy function which should not be confusedwith the definition of MTE commonly adopted in the causal inference literature denoting the derivativerelative to the (endogenous) selection mechanism (e.g., see recent work of Sasaki and Ura (2020) of off-lineempirical welfare maximization using the MTE).
This section discusses the model, the definition of welfare, and the estimand of interest.
Preliminaries and notation
We start by introducing necessary notation. We define Y i,t ∈ Y the outcome of interest of unit i at time t , D i,t ∈ { , } the treatment assignmentof unit i at time t . We denote X i ∈ X individual specific base-line covariates. We let X i ∼ F X i , with f X i denoting the Radon-Nykodim derivative of F X i . Units are assumed tobe organized into K independent large clusters, and observed over T periods. We denote k ( i ) ∈ { , · · · , K } the cluster of unit i , N k the number of units in cluster k , N = (cid:80) Kk =1 N k .For notational convenience only, we assume equally sized clusters with N k = N/K = ˇ N .From each cluster we sample at random each period covariates and outcomes of n < ˇ N individuals. We denote S k,t the set of indexes of units sampled from cluster k at time t (these may or may not be the same indexes every period). Motivated by developmentstudies (Cai et al., 2015; Banerjee et al., 2013), we assume that units are connected withineach cluster k according to a fixed adjacency matrix A k ∈ R ˇ N × ˇ N , unobserved to theresearcher. All our conditions must be interpreted conditional on the adjacency matrices( A , · · · , A K ). Interference within each cluster occurs in unknown dimensions. However, nointerference between clusters is allowed. Therefore, throughout the rest of our discussion,we will implicitly assume that SUTVA (Rubin, 1990) holds at the cluster level only. Assignment mechanism
Let e ( · ; β ) : X (cid:55)→ E ⊂ (0 , , β ∈ B , (1)denote a class of individual treatment assignments, where β denotes a vector of parameters,and e ( x ; β ) is a twice continuously differentiable function. We denote dim( β ) = p . Wedefine a (conditional) Bernoulli allocation rule as follows.10 efinition 2.1 (Conditional Bernoulli Allocation Rule (CBAR)) . A Bernoulli allocationrule with parameters β t = { ( β k, · · · , β k,t ) } k ∈{ , ··· ,K } , assigns treatments to all units i ∈{ , · · · , N } as follows D i,t | X i = x, β t ∼ Bern (cid:16) e ( x ; β k ( i ) ,t ) (cid:17) , independently across units and time.Definition 2.1 defines an allocation where treatments are assigned independently ineach cluster, with cluster-specific and time specific conditional assignments e ( X i ; β k ( i ) ,t ),parametrized by the vector of parameters β t . Importantly, the above definition assumesthat treatment assignments in cluster k are conditional independent on β k (cid:48) (cid:54) = k, T given β k, T . In addition, treatment assignments are drawn for all units in a cluster (regardlessof whether their post-treatment outcome is observed or not). Remark 1 (Why a CBAR?) . The cluster-specific Bernoulli allocation is commonly usedin two-stage randomized network experiments for inference on treatment effects in thepresence of a single experimentation period ( t = 1), and homogenous treatment assignments(i.e., e ( x ; β ) = β ) (Baird et al., 2018). This paper considers heterogenous assignments andmultiple experimentation periods, and the different goal of welfare-maximization guides thechoice of the parameter β . We consider a CBAR since it is simple and easy to implementin practice, and it can be implemented in an on-line fashion. A CBAR induces a local-dependence structure which, we show, permits estimation of welfare-maximizing policiesand asymptotic inference on the optimality of base-line interventions. Example 2.1 (Targeting information) . Consider the problem of targeting information toindividuals (Cai et al., 2015). Here, D i,t denotes whether information is sent to individual i at time t , while Y i,t equals the outcome of interest of unit i at time t (e.g., insuranceadoption at period t ). Units are organized in villages k ∈ { , · · · , K } . Suppose thatinsurance adoption of individual i at time t depends on individual’s i treatment assignment D i,t , and individual i (cid:48) s friends’ and friends’ of friends treatment assignments in village k ( i ).We say that two individuals are connected either because they are direct friends (and sothe assignment of i directly impacts the decision of j ) or because they share a commonfriend. A simple definition of the adjacency matrix takes the following form: A ki,j = 1 (cid:110) i is friend of j (cid:111) + α × (cid:110) ( i, j ) have a common friend (cid:111) , α ∈ [0 , . See Figure 1 for an illustrative example. The matrix of friendships, the parameter α , and11 → Figure 1: Example of network interactions. The figure on the left draw individuals con-nected to friends. Figure on the right draws an adjacency matrix obtained after connectingindividuals sharing a common friend (colored in green).so also the matrix A k are unobserved to the researcher and policy-maker. Researchers aimto study how many individuals should be treated to maximize the welfare generated byinsurance adoption, net of costs of treatments. Namely, they consider an allocation rule ofthe form e ( x ; β ) = β, β ∈ ( δ, − δ ) ⊂ (0 , , where β denotes the probability of treatment, and X i = 1. Example 2.2 (Cash transfer program) . Consider the problem of targeting cash-transfersto individuals, to maximize satisfaction with the program (Alatas et al., 2012). Unitsare organized in independent villages and connected within each village based on parentalties and friendships. Targeting cash-transfers generates spillovers along these dimensions.The policy-maker observes for each individual the quality of the roof (binary), of the floor(binary), and whether the individual attended secondary school (binary). The policy-makerconstructs a linear probability decision rule with cutoffs at ( δ, − δ ) ⊂ (0 , e (cid:16) floor , roof , educ; β (cid:17) = β + β floor + β roof + β educ , β ∈ B , δ ≤ (cid:88) j =0 β j ≤ − δ. (2)The set B encodes capacity constraints , as well as ethical and legal constraints on theparameter space. See Figure 2 for a graphical illustration with β = 0. Program effectiveness can be measured using measures of program satisfaction. Satisfaction with theprogram is shown to increase compliance of villages with the program and relates to unobserved measuresof poverty (Alatas et al., 2012). Capacity constraints can be imposed whenever the distribution of X i is known to the policy maker,and these can be directly incorporated on the conditions on the parameter space. β + β β β + β + β β + β Figure 2: Example of probabilistic treatment assignment rule for a cash transfer program.Individuals are assigned different probabilities based on the quality of their floor and roof.The final goal is to optimal estimate those probabilities.
Throughout our discussion we assume partial interference: potential outcomes are inde-pendent between clusters (Baird et al., 2018). We formalize the between-cluster SUTVAby defining Y i,t (cid:16) d k ( i )1 , · · · , d k ( i ) t (cid:17) , d ks ∈ { , } ˇ N , s ∈ { , · · · , t } , the potential outcome of unit i at time t is a function of the treatment assigned to units inthe same cluster only over each period s ≤ t . We implicitly assume that potential outcomesare consistent (Imbens and Rubin, 2015) and potential outcomes and base-line covariatesare jointly mutually independent between clusters.Three restrictions on within cluster dependences are imposed. Assumption 1 (No carry-over and local interference) . Assume that for any d , · · · , d t , t ≥
1, the following conditions hold.(A) Y i,t (cid:16) d k ( i )1 , · · · , d k ( i ) t (cid:17) is a constant function in ( d k ( i )1 , · · · , d k ( i ) t − );(B) Y i,t (cid:16) · · · , d k ( i ) t (cid:17) is constant in each entry d k ( i ) t,j , with j : A k ( i ) i,j = 0. In addition, (cid:80) j { A k ( i ) i,j > } ≤ √ γ n , with γ n ≤ n / ;(C) (cid:110) X i , Y i,s ≤ T ( · · · , d ) , , d ∈ { , } ˇ N (cid:111) ⊥ (cid:110) X j , Y j,t ( · · · , d (cid:48) ) , d (cid:48) ∈ { , } ˇ N (cid:111) j (cid:54)∈{ v : A k ( i ) i,v > } ,t ≤ T ,denoting the set of units not being connected to individual i .13ondition (A) assumes that effects do not propagate in time. This condition is knownas no-carry over and often implicitly imposed in studies on experimental design (Kasy andSautmann, 2019). For a discussion on the no-carry-over assumption, the reader may referto Athey and Imbens (2018). However, potential outcomes may exhibit time dependence(e.g., due to unobserved time-varying factors). In Section 5.3 we extend our framework tolimited carry-over effects. Condition (B) imposes local interference: spillovers propagatewithin (unknown) neighborhoods. The size of a neighborhood is assumed to grow at aslower rate than the sample size. The assumption of local interference is often imposed forvalid causal inference in the presence of observed network structures, see, e.g., Leung (2020),Jagadeesan et al. (2020). Condition (C) instead imposes local dependence among outcomesand covariates. Similarly to (B), researchers do not know the dependence structure withineach cluster. For simplicity, throughout our discussion, we refer to potential outcomes onlyas functions of all other units’ current treatment status in the same cluster.Under Assumption 1, we can show that outcomes are only locally dependent, withtheir expectation depending on the parameter β indexing the assignment mechanism. Thisdecomposition permits to (i) identify the welfare function and (ii) derive asymptotic resultsby exploiting the local dependence structure. We formalize this idea in the following lemma. Lemma 2.1.
Let Assumption 1 hold. Then for a CBAR with β k, T ⊥ { X i , Y i, t ( d ) , d ∈{ , } ˇ N } i : k ( i )= k , for all iY i,t = m i,t ( D i,t , X i , β k ( i ) ,t ) + ε i,t , E (cid:104) ε i,t | D i,t , X i , β k ( i ) ,t (cid:105) = 0 , for some individual specific function m i,t ( · ) and unobservables ε i,t . In addition, ( ε i,t , X i ) ⊥ { ( ε j,t ≤ T , X j ) } j ∈J ( i ) | β k ( i ) ,t , where J ( i ) ⊂ { v : k ( v ) = k ( i ) } , and |J ( i ) | ≥ ˇ N − γ n . Lemma 2.1 states that observed outcomes under a Bernoulli assignments are the sumof two components: a conditional expectation function m i,t ( · ), which depends on the indi-vidual assignment, base-line covariates and parameter β k ( i ) ,t , and of unobservables ε i,t thatalso depend on neighbors’ covariates and treatment assignments. Unobservables dependon at most 2 γ n many other unobservables in the same cluster. Observe that ε i,t are not identically distributed since they also depend on treatment assignments of the neighborsof individual i . We illustrate Lemma 2.1 in the following examples.14 xample 2.1 Cont’d Assume that each individual has at least one connection. Let Y i,t = α t + D i,t φ + (cid:80) j (cid:54) = i A k ( i ) i,j D j,t φ (cid:80) j (cid:54) = i A k ( i ) i,j − (cid:16) (cid:80) j (cid:54) = i A k ( i ) i,j D j,t (cid:80) j (cid:54) = i A k ( i ) i,j (cid:17) φ + η i,t , (3)with η i,t being cross-sectionally independent unobservables. Namely, outcomes depend ontheir own treatment and the percentage of treated units connected to i . Equation (3) alsostates that spillovers have decreasing marginal effects. Taking expectations over neighbors’assignments, under a CBAR, we can write Y i = α t + D i,t φ + β k ( i ) ,t φ − β k ( i ) ,t φ × (cid:80) j (cid:54) = i A k ( i ) , i,j ( (cid:80) j (cid:54) = i A k ( i ) i,j ) − β k ( i ) ,t φ × (cid:80) j (cid:54) = i,j (cid:48) (cid:54) = i,j (cid:48) (cid:54) = j A k ( i ) i,j A k ( i ) i,j (cid:48) ( (cid:80) j (cid:54) = i A k ( i ) i,j ) (cid:124) (cid:123)(cid:122) (cid:125) = m i,t ( D i,t , ,β k ( i ) ,t ) + ε i,t , where ε i,t is a function of (cid:80) j (cid:54) = i A k ( i ) i,j D j,t . The case with covariates follows similarly, wherethe expectation is also taken with respect to X j (cid:54) = i (see the following example). Example 2.2 Cont’d
Assume that each individual has at least one connection. Let Y i,t = α k ( i ) + D i,t φ + (1 − D i,t ) (cid:80) j (cid:54) = i A k ( i ) i,j D j,t φ (cid:80) j (cid:54) = i A k ( i ) i,j + η i,t , (4)with η i,t being cross-sectionally independent unobservables, with A i,j = A j,i ∈ { , } . Thatis, spillovers only occur on those individuals that are not treated. The model is equivalentto Y i,t = α k ( i ) + D i,t φ + β φ + (1 − D i,t ) (cid:80) j (cid:54) = i A k ( i ) i,j (cid:104) E [floor j ] β + E [roof j ] β + E [educ j ] β (cid:105) φ (cid:80) j (cid:54) = i A k ( i ) i,j + ε i,t , where ε i,t = (1 − D i,t ) (cid:104) (cid:80) j (cid:54) = i A k ( i ) i,j D j,t φ (cid:80) j (cid:54) = i A k ( i ) i,j − (cid:80) j (cid:54) = i A k ( i ) i,j (cid:104) E [floor j ] β + E [roof j ] β + E [educ j ] β (cid:105) φ (cid:80) j (cid:54) = i A k ( i ) i,j (cid:105) + η i,t . Example 2.3 (Educational program) . Consider the problem of design educational pro-grams in schools (Opper, 2016) for test-scores improvements. Students are clustered in The expression below uses independence of the treatment assignments under a CBAR. ∈ { , · · · , K } schools. Each student is assigned to equally-sized classes c ( i ) of fixed size C , unobserved to the researcher. Test-scores depend on assignments as follows Y i,t = P (cid:16) D i,t , (cid:88) j (cid:54) = i : c ( j )= c ( i ) D j,t , X i , η i,t (cid:17) , ( η i,t , X i ) ∼ i.i.d. F η,X for some arbitrary polynomial function P ( · ), and independent stationary unobservables η i,t . Then under a CBAR m i,t ( d, x, β ) = E β (cid:104) P (cid:16) d, (cid:88) j (cid:54) = i : c ( j )= c ( i ) D j,t , X i , η i,t (cid:17)(cid:12)(cid:12)(cid:12) X i = x (cid:105) , (5)where E β denotes the expectation over neighbors’ assignment under a CBAR with clusterlevel parameter β . For each individual ε i,t depends on at most observables and unobserv-ables of C other units.The second condition that we impose is on the clusters being representative of theunderlying population. Assumption 2 (Representative clusters and fixed effects) . For any d ∈ { , } , β ∈ B , x ∈X , any random sample S k,t , of size n from cluster k is such that1 n (cid:88) i ∈S k,t m i,t (cid:16) d, x, β (cid:17) f X i ( x ) = α t ( x )+ τ k ( x )+ m ( d, x, β ) f X ( x )+ J n , n (cid:88) i ∈S k,t f X i ( x ) = ˇ f X ( x )+ J n for some possibly unknown and uniformly bounded functions α t ( · ) , τ k ( x ) , m ( · ) , f X ( · ) , ˇ f X ( · )and J n ∈ [ − b n , b n ], for some positive b n → n → ∞ .The functions α t ( x ) , τ k ( x ) capture the time-specific and cluster-specific fixed effects forthe sub-population in with covariate { X i = x } at time t , multiplied by the average densitywithin each cluster. In Example 2.1 α t ( x ) = α t , while in Example 2.3 α t ( x ) = 0 due to thestationarity assumption. In the presence of identically distributed covariates, the function m ( · ) defines the within-cluster expectation, conditional on X i = x , net of fixed effects.Whenever covariates are not identically distributed, m ( d, x, β ) f X ( x ) defines the limitingaverage of the product between the conditional mean function and the individual-specificdensity f X i , evaluated at x . The function ˇ f X ( x ) denotes the within cluster average densityfunction of the covariates. The component J n captures imbalance across clusters. InExample 2.1, Assumption 2 holds if the average inverse degree is asymptotically the same16cross different clusters, while it fails otherwise. In Example 2.3 instead the assumptionalways holds with J n = 0.Assuming the representativeness of clusters is a common assumption for causal infer-ence. For instance, Baird et al. (2018) assumes that cluster-level expectations are notcluster-specific, and Vazquez-Bare (2017) assumes that the joint distribution of outcomesfrom each cluster is the same across different clusters. Here we allow for separable cluster-specific fixed effects. The assumption implicitly imposes that fixed effects are additive andseparable, and expectations within each cluster concentrate around the same target esti-mand (after subtracting the fixed effects). In the following remark, we discuss a relaxationof Assumption 2 to allow for non-separable time and cluster-specific fixed effects. Remark 2 (Non-separable time/cluster fixed effects) . In Section 5 we discuss the differentscenario where 1 n (cid:88) i ∈S k,t m i,t (cid:16) d, x, β (cid:17) f X i ( x ) = α t,k ( x ) + m ( d, x, β ) f X ( x ) + J n , (6)i.e., the fixed-effects are not separable in time and cluster identity. We discuss this scenarioand provide a set of results under the alternative condition that ∂m (1 ,x,β ) ∂β = 0, i.e., thatspillovers do not occur on treated units, but only occur on controls. For example, in thepresence of an information campaign, we may assume that spillovers do not occur on thoseindividuals who had already received information but only on those who were not exposedto information.A further relaxation of Assumption 2 may consist of also indexing the function m ( · )with the cluster-type, as in Park and Kang (2020), and conducting separate analysis withindifferent cluster. This is omitted for the sake of brevity. The scope of this paper is to estimate the conditional Bernoulli assignment that maximizessocial welfare. We introduce the notion of (utilitarian) welfare (Manski, 2004).
Definition 2.2 (Welfare) . For a given conditional Bernoulli assignment with parameters17 k,t = β , define the (utilitarian) welfare as follows: W ( β ) = (cid:90) (cid:104) e ( x ; β ) (cid:16) m (1 , x, β ) − m (0 , x, β ) (cid:17) + m (0 , x, β ) (cid:105) f X ( x ) dx − (cid:90) c ( x ) e ( x ; β ) ˇ f X ( x ) dx, (7)where c ( x ) < ∞ denotes the cost of treatment for units with X i = x .Welfare is defined as the average effect under the treatment assignment e ( · ; β ), net ofits implementation cost c ( x ), assumed to be known to the policy-maker. Observe thatwelfare does not depend on fixed effects since those do not depend on the policy β .We can now introduce our main estimand. Definition 2.3 (Estimand) . Define the welfare-maximizing policy as β ∗ ∈ arg sup β ∈B W ( β ) , (8)where B = [ B , B ] p denotes a pre-specified compact set.Equation (8) defines the vector of parameters that maximizes social welfare. In oursetting, policy-makers choose β ∗ based on an experiment conducted over a pre-specifiedtime-window. Once the experiment is terminated, the policy cannot be updated, and noadditional information is collected. Remark 3 (Carry-over effects) . In Section 5.3, we consider the extension where Y i,t ( · · · , d t − , d t )also depends on the past treatment assignments d t − , allowing for carry-over effects in time.We consider both stationary decisions and time-variant decisions, and discuss estimationin these scenarios, at the expense of a more data-intense experimentation for detectingwelfare-maximizing policies. Estimation and inference on welfare-maximizing decisions rely on identifying and esti-mating the marginal effects of the treatment. For expositional convenience, we implicitlyassume differentiability and defer to Section 3 formal assumptions. We discuss definitionsof marginal effects in the following lines.
Definition 2.4 (Marginal effects) . The marginal effect of treatment is defined as follows: V ( β ) = ∂W ( β ) ∂β . β .Under the above regularity condition, the marginal effect takes an intuitive form. Define∆( x, β ) = m (1 , x, β ) − m (0 , x, β )the average direct effect, averaged over the spillovers, for a given level of covariate x . Thenmarginal effects are defined as (cid:90) (cid:104) e ( x ; β ) ∂m (1 , x, β ) ∂β + (1 − e ( x ; β )) ∂m (0 , x, β ) ∂β (cid:124) (cid:123)(cid:122) (cid:125) ( S ) + ∂e ( x ; β ) ∂β ∆( x, β ) (cid:124) (cid:123)(cid:122) (cid:125) ( D ) (cid:105) f X ( x ) dx − c ( x ) ∂e ( x, β ) ∂β ˇ f X ( x ) dx. (9) The above expression shows that the effect depends on (a) the direct effect of changing β , captured by the component (D); (b) the indirect effect of changing β due to marginalspillover effects, captured by the component (S). Example 2.1 Cont’d
Consider the model in Equation (3), and an adjacency matrixwith α = 0 for simplicity (i.e., spillovers only occur within first degree neighbors). Thedirect effect of the treatment denotes the effect of informing individual i on her insurancetake-up. The effect equals φ . The marginal spillover effect denotes the effect of making asmall change on the probability that other individuals (so included i ’s friends) are invitedto the information session. In our example, the marginal spillover effect equals ∂m ( d, , β ) ∂β = φ − φ κ − βφ (1 − κ ) , where κ = lim ˇ N →∞ N (cid:80) ˇ Ni =1 1 | N i | , denotes the asymptotic limit of the average inverse degree.The optimal policy sets the marginal effect equal to zero. As a result, we obtain ∂W ( β ) ∂β = φ + φ − φ κ − βφ (1 − κ ) ⇒ β ∗ = φ + φ − φ κφ (1 − κ ) . Intuitively, more individuals should be treated if either (i) the direct effect is larger ( φ ↑ ),or (ii) the spillover effect is larger ( φ ↑ ). Observe that the marginal effect can be usedfor (a) testing whether baseline interventions are optimal, testing whether marginal effects The identity below follows from the dominated convergence theorem under Assumption 3. See Section3 for details.
Example 2.2 Cont’d
Consider the model in Equation (4). Then the objective functionreads as follows W ( β ) = ˇ κ (cid:62) β − φ β (cid:62) ˇ M β for a vector ˇ κ and a matrix ˇ M which depends on the asymptotic limit of the (weighted)within clusters expectations, assumed to converge to the same limits across different clus-ters. The function has decreasing marginal effects whenever spillovers have positive effect( φ > Before discussing the sequential experiment for estimating β ∗ , we ask whether the base-linepolicy is welfare-maximizing. Namely, this section answers to the following question:“given a base-line policy e ( · ; ι ) , ι ∈ B , is ι = β ∗ , i.e., does it maximize welfare?” . (11)The question is equivalent to test the hypothesis W ( ι ) ≥ W ( β ) , for all β ∈ B . (12) To see why the claim holds observe that the objective function reads as follows W ( β ) = β φ + lim ˇ N →∞ N ˇ N (cid:88) i =1 (cid:110)(cid:104) E [floor i ] β + E [roof i ] β + E [educ i ] β (cid:105) φ + (cid:80) j (cid:54) = i A k ( i ) i,j (cid:104) E [floor j ] β + E [roof j ] β + E [educ j ] β (cid:105) φ (cid:80) j (cid:54) = i A k ( i ) i,j (cid:111) − lim ˇ N →∞ N ˇ N (cid:88) i =1 (cid:110) (cid:80) j (cid:54) = i A k ( i ) i,j (cid:104) E [floor j ] β + E [roof j ] β + E [educ j ] β (cid:105) φ (cid:80) j (cid:54) = i A k ( i ) i,j × (cid:104) E [floor i ] β + E [roof i ] β + E [educ i ] β (cid:105)(cid:111) . (10) Assuming that the weighted within cluster expectation converge to the same limit across different clustersleads to the above expression for W ( β ). Since A i,j ∈ { , } , and the covariates are either zero or one, themarginal effect have negative derivative whenever φ > ι to a specific alternative, but instead, weask whether ι outperforms all other policies. The above equation represents a naturalnull hypothesis whenever its rejection motivates possibly expensive (because of either itsaccounting or opportunity cost) larger-scale experimentation. The following testable im-plication is considered. Testable implication
Let ι be an interior point of B , and W ( β ) be continuously differ-entiable. Then V ( j ) ( ι ) = 0 ∀ j ∈ { , · · · , p } if W ( ι ) ≥ W ( β ) , for all β ∈ B . The above implication follows by standard properties of continuously differentiable func-tions, and it allows us to perform the test without comparing ι to any possible alternatives.Instead, we can test the following hypothesis H : V ( ι ) = 0 , j ∈ { , · · · , ˜ p } (13)where we test 1 ≤ ˜ p ≤ p arbitrary many coordinates of the vector V ( β ). Observe thatthe implication does not require concavity, and it solely relies on differentiability of theobjective function and on ι being an interior point.We formalize our intuition in the following lines, where we discuss estimation andinference on marginal effects. Testing marginal effects in the context of experimentaldesign has not been discussed in previous literature. We assume possibly finitely manyclusters K ≥ p , and two experimentation periods only. Organization
We organize this section as follows: we start by introducing local two-stageexperimentation; we then introduce the estimators constructed based on the randomizationprocedure; we discuss the full algorithm for the design of the pilot study; finally, we discussinference on marginal effects using the observations from the pilot study.
In this section, we discuss the intuition and motivation behind our procedure for testing for policy optimality at the parameter value ι . We start from some preliminary notation.21 reliminaries Consider two clusters, indexed by { k, k + 1 } , and two periods { , t } , with k being an odd number (e.g., k = 1). The key idea for non-parametrically estimatingmarginal effects consists of inducing local deviations in the parameter at the cluster level and alternating deviations over pairs of clusters observed over two consecutive periods. Forexpositional convenience, here we discuss the problem of estimating one single entry V ( j ) ( ι ) , for a given parameter ι . In this section, the parameter ι is assumed to be exogenous. Wedefine the vector e j = (cid:104) , · · · , , , , · · · , (cid:105) , where e j ∈ { , } p , and e ( j ) j = 1 , with e j = 0 for all entries except entry j . Define ( − j ) all the indexes of a vector exceptindex ( j ). Local experimentation
For a given set of parameters β t , the key idea for estimatingmarginal effects consists of assigning treatments independently across units as follows: D i, | X i = x, β k ( i ) , ∼ Bern (cid:16) e ( x ; β k ( i ) , ) (cid:17) ,D i,t | X i = x, β k ( i ) ,t ∼ Bern (cid:16) e ( x ; β k ( i ) ,t + η n e j ) (cid:17) if k ( i ) = k Bern (cid:16) e ( x ; β k ( i ) ,t − η n e j ) (cid:17) if k ( i ) = k + 1 , n − / < η n < n − / , t > . (14)The parameter η n captures small deviations from the target parameter. Intuitively, in thefirst period, each cluster’s treatment assignment depends on the parameter β k ( i ) , . In thesecond period, instead, we induce a small deviation over the parameter β k ( i ) ,t with oppositesign within the pair. The two periods randomization aims to control for cluster-specific fixed effects. The between-cluster randomization instead aims to control for time-specific fixed effects. A crucial aspect for identification is that deviations η n are deterministicallyassigned with the opposite sign within each pair. Finally, recall that treatments are assignedto all individuals in the population. Example 2.1 Cont’d
Let e ( x ; ι ) = ι ∈ (0 , , with ι = 40%. At time t = 0, researchersinvite to information sessions each individual in village k = 1 and village k = 2 with equalprobability 40%. At time t = 1, researchers treat with lower probability individuals invillage k = 1 with a probability equal 40% − η n and with higher probability individuals in22 = 0 t = 1 k = 1 k = 2 −→ Figure 3: Example of two-stage local randomization with time and cluster-specific fixedeffects. Each node is a different individual, and squares denote clusters. Units are connectedwithin each cluster. Pink nodes denote control units, and gray nodes denote treated units.In the first period, units are assigned to treatment, with the same probability in clusters k ∈ { , } . In the second period, the probability of being treated is slightly larger in cluster k = 1 and smaller in cluster k = 2.village k = 2 with a probability 40% + η n . This randomization is illustrated in Figure 3,and Figure 4. Next, we discuss the estimators of interest. We estimate separately the direct effect ofthe treatment and the marginal spillover effect of the treatment. Separate estimation ofthese two effects has two motivations: (i) it exploits knowledge of the propensity scorefunction e ( x ; β ); (ii) it permits identification of marginal effects also when fixed effectsare not separable in time and cluster identity, but the spillover effects on the treated arezero as discussed in Section 5. The proposed estimator can be interpreted as a difference-in-difference estimator, where we take differences between outcomes once reweighted bythe marginal probability of treatments and the inverse probability of treatment. Figure5 provides the basic intuition behind the proposed estimator in the presence of time andcluster-specific fixed effects. 23igure 4: Example of two-stage local randomization with time and cluster-specific fixedeffects. In the first period, a draw from the blue dot for each cluster is performed. Inthe second period in the first cluster, we assign the policy colored in green and the secondcluster the one colored in brown. We test for policy optimality by testing whether theestimated derivative equals zero. The sequential randomization procedure repeats the pro-cess sequentially, using a circular estimation procedure for the marginal effect to guaranteeunconfounded experimentation (see Section 4.1).It will be convenient to define v h = − h is odd;1 otherwise , e i,j,t ( β ) = e (cid:16) X i , β + η n × v k ( i ) e j { t > } (cid:17) , (15)respectively the indicator corresponding to the cluster identity v h and the assigned propen-sity score to individual i at time t for a given target parameter β .We can now discuss estimation of the marginal effects. Estimation of direct effects
We estimate the direct effects using an Horowitz-Thompsonestimator (Horvitz and Thompson, 1952), reweighted by the marginal effect on the propen-sity score. Namely, we define (cid:98) ∆ ( j ) k,t ( β ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t ∂e ( X i ; β ) ∂β ( j ) (cid:104) Y i,t D i,t e i,j,t ( β ) − Y i,t (1 − D i,t )1 − e i,j,t ( β ) (cid:105) . (16)24 E [ Y ] t = 0 t = 1 τ τ ×× W ( β ) + α + τ τ − τ W ( β ) + α + τ × W ( β + η n ) + α + τ W ( β − η n ) + α + τ × ∂W ( β ) ∂β (cid:12)(cid:12)(cid:12) β = β η n + τ − τ Figure 5: The intuition behind the estimator with additive and separable time and cluster-specific fixed effects. We consider two clusters k ∈ { , } , with ( τ , τ ) denoting respectivelythe cluster-specific fixed effect for the first and second cluster. The brown cross correspondsto the second cluster’s welfare value and the green cross to the one in the first cluster. Inthe first period, a one-period experiment in each cluster is performed. The correspondingwelfare depends on the cluster-specific fixed effect ( τ k ) and the time-specific fixed effect( α t ). In the second period, a positive (negative) small deviation is applied to the policy β in the first (second) cluster. For the first (second) cluster, the difference within the sameperiod over the two clusters equals the derivative ( − ) ∂W ( β ) ∂β multiplied by the deviationparameter η n plus the difference of the cluster-specific fixed effects τ − τ . The differenceof the difference between the two clusters equals approximately two times the derivative ∂W ( β ) ∂β times the deviation parameter η n . 25he above expression estimates the average effect of treating an individual sampled fromcluster k and cluster k + 1, once we reweight the expression by the marginal effect on thetreatment assignment. Observe that each individual’s outcome is weighted by the inverseprobability of assigned treatment. However, the derivative ∂e ( X i ; β ) ∂β ( j ) is evaluated at sometarget parameter β before introducing a perturbation. Estimation of marginal spillover effects
Next, we discuss estimation of marginalspillover effects, which is what defined as (S) in Equation (9), averaged over the distribu-tion of covariates. The estimators respectively on the treated and control units take thefollowing form: ˆ S ( j ) k,t (1 , β ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:104) v k ( i ) e ( X i ; β ) η n × Y i,t D i,t e i,j,t ( β ) (cid:105) − n (cid:88) i ∈S k, ∪S k +1 , (cid:104) v k ( i ) η n × Y i, D i, (cid:105) , ˆ S ( j ) k,t (0 , β ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:104) v k ( i ) (1 − e ( X i ; β )) η n × Y i,t (1 − D i,t )1 − e i,j,t ( β ) (cid:105) − n (cid:88) i ∈S k, ∪S k, (cid:104) v k ( i ) η n × Y i, (1 − D i, ) (cid:105) . ˆ S ( j ) k,t (1 , β ) (and similarly ˆ S ( j ) k,t (0 , β )) depends on several components. First, (i) it dependson the weighted outcome of treated individuals of each cluster in the pair. Second, (ii) itreweights observations by the propensity score evaluated at the coefficient β . Finally, (iii)it takes the difference of the difference (i.e., it weights observations by v k ) of the weightedoutcomes between the two clusters between the two periods. The overall expression is thendivided by the deviation parameter η n . Marginal effect estimator
The final estimator of the marginal effect defined in Equa-tion (9) is the sum of the direct and marginal spillover effect, taking the following form: (cid:98) Z ( j ) k,t ( β ) = ˆ S ( j ) k,t (1 , β ) + ˆ S ( j ) k,t (0 , β ) + ˆ∆ ( j ) k,t ( β ) − n (cid:88) i ∈S k,t ∪S k +1 ,t c ( X i ) ∂e ( X i , β ) ∂β ( j ) , (17)where the last component captures the average marginal cost.We now discuss theoretical properties of the estimator. Assumption 3 (Regularity 1) . Let the following conditions hold.(A) Let || m ( · ) || ∞ < ∞ , and twice continously differentiable in β , with uniformly boundedderivatives; 26B) ε i,t is a sub-gaussian random variable with parameter σ < ∞ , and m i,t ( · ) is uniformlybounded for all ( i, t );(C) β (cid:55)→ e ( X ; β ) is twice continuously differentiable in β with uniformly bounded firstand second order derivative almost surely;(D) X is a compact space.Assumption 3 (A) is a regularity assumption, which imposes bounded conditional meanwith bounded derivative. (B) holds whenever, for instance, ε i,t is uniformly bounded; (C)assumes bounded derivative of the propensity score, which holds for general functions suchas logistic or probit assignments, whenever covariates have compact support. We nowintroduce the first theorem. Theorem 3.1.
Let Assumptions 1, 2, 3 hold, and consider a randomization as in Equation (23) with an exogenous parameter ι . Then (cid:12)(cid:12)(cid:12) E (cid:104) (cid:98) Z ( j ) k, ( ι ) (cid:105) − V ( j ) ( ι ) (cid:12)(cid:12)(cid:12) = O ( J n /η n + η n ) . The proof is contained in the Appendix. The above theorem showcases the estimator’sexpectation converges to the target estimand for a fixed, exogenous coefficient.
Remark 4 (Non-separable time and cluster-specific fixed effects) . Consider the model inEquation (6) with non-separable time and cluster-specific fixed effects. Suppose, however,that spillovers only occur on control units but not on the treated. Then identification ofmarginal effect is performed using a single experimentation period and two clusters. Eachcluster is exposed to deviations with opposite signs. Identification is described in Figure6. See Section 5 for a formal discussion.
Remark 5 (Pairing clusters) . In the presence of more than two clusters, we estimatemarginal effects by first pairing clusters and then estimate the effects in each pair. In theabsence of pairing, the first-order bias resulting would not be equal to zero. Instead, itwould only be of the undesirable order 1 / (cid:112) Kη n , after averaging across all clusters. Thisfollows from the fact that in the absence of pairing, v k would be a Rademacher randomvariable whose average concentrates around zero only at a rate 1 / √ K . The first-order biasoccurring from estimation within each cluster would cancel out only as v k converge to zero,which occurs at a rate 1 / √ K rescaled by the denominator η n . This is an additional and27 E [ Y ] D = 0 D = 1 k = 1 k = 2 ×× ∂m (0 ,β ) ∂β η n + α , − α , ×× α , − α , Figure 6: Illustration of the intuition behind the identification of marginal spillover effectswith non-separable time and cluster-specific fixed effects (Equation 6), and spillovers onlyon the control units. The x-axis corresponds to treated and control units. Two clusters k ∈ { , } are considered. The difference over control units between the two clusters (theline between the brown and green cross) corresponds to the marginal spillover effect on thecontrol times the deviation η n , plus the between-cluster difference of the time and cluster-specific fixed effects α k,t . The difference between the treated units instead corresponds tothe difference between the non-separable fixed effects, assuming that spillovers only occuron the control units (and not on the treated). The difference of the difference equals themarginal spillover effect on the control, times η n .important difference from saturation experiments, where probabilities of treatments arerandomly allocated across clusters. Remark 6 (Sequential randomization) . In the presence of sequential randomization with t >
1, the choice of the parameter may depend on past information. For this case, theexogeneity condition of the parameter does not necessarily hold. We propose estimatorsthat address this issue in Section 4.Throughout the rest of our discussion, it will be convenient to refer to (cid:98) Z as an averageof random variables. It can be easily shown that the estimator in Equation (26) reads as (cid:98) Z ( j ) k,t ( β ) = 1 n (cid:88) i ∈S k,t ∪S k +1 ,t W ( j ) i,t ( β ) − n (cid:88) i ∈S k, ∪S k +1 , v k ( i )2 η n Y i, , (18)where 28 ( j ) i,t ( β ) = 12 (cid:104) Y i,t D i,t e i,j,t ( β ) − Y i,t (1 − D i,t )1 − e i,j,t ( β ) − c ( X i ) (cid:105) ∂e ( X i ; β ) ∂β ( j ) + v k ( i ) η n (cid:104) Y i,t D i,t e i,j,t ( β ) e ( X i ; β ) + Y i,t (1 − D i,t )1 − e i,j,t ( β ) (1 − e ( X i ; β ) (cid:105) . (19) Remark 7 (Randomization in the absence of time-specific fixed effects) . Suppose that α t = 0 (e.g., sequential randomizations occurs over a short-time period). Then we constructthe estimator of the marginal effect by pairing each cluster with itself over two consecutiveperiods { t − , t } . This approach requires a half number of clusters for its implementation,at the expense of increasing the overall number of randomization periods. We now discuss the pilot study consisting of two periods of experimentation t ∈ { , } forinference on marginal effects. Pairing clusters
First, we pair clusters. Without loss of generality, we assume that pairsconsist of two consecutive clusters k , k + 1 for each odd k . We assign v k as in Equation(15). Assigning coordinates to different pairs
We assign any element in the set of oddcluster’s indexes { , , · · · , K − } to a set K j ⊆ { , , · · · , K − } , for each coordinate j ∈ { , · · · , ˜ p } , with the set |K j | = ˜ K ≥
2. The set K j denotes the set of clusters used totest coordinate j of the marginal effect. Small deviations
The experimenter assigns treatments according to the allocation rulein Definition 2.1. Each pair estimates a single coordinate ( j ). We set for all k ∈ (cid:110) K j ∪ { h + 1 , h ∈ K j } (cid:111) , i.e., for all clusters assigned to test coordinate j , we randomizetreatments as in Equation (23) with β k, = β k, = ι for all k . Estimation of marginal effects
We estimate marginal effects similarly to what dis-cussed in Equation (26). For any pair of clusters ( k, k + 1), k ∈ K j , the estimator of themarginal effects at ι reads as (cid:98) Z ( j ) k, ( ι ). Define for each pairs of clusters ( k, k + 1), k ∈ K j , (cid:98) Z k = (cid:98) Z ( j ) k, ( ι ) . We observe that in many circumstances, we may be interested in testing a specific coordinate of thevector, in which case K j = { , , · · · , K − } . xample 2.1 Cont’d Consider conducting a pilot study to test whether treating indi-viduals with probability ι = 40% is welfare-optimal. Researchers run the experiment onat least two clusters (e.g., six clusters are considered). In the first period t = 1 in all sixclusters, researchers select individuals for treatments with probability 40%. In the secondperiod, researchers first pair clusters. In each pair, they assign treatments in the first clus-ter with 39% probability and 41% probability in the second cluster in the pair. They thenestimate the marginal effect within each pair. Example 2.2 Cont’d
Consider testing the first coordinate ( β ) and the second coor-dinate ( β ) of the policy function in Equation (2), using eight clusters. The baselineparameter-values are (0 . , . , , . , . , , . , . , , . , . , , . , . , , . , . , , β is estimated using the first and third pair ofclusters, and the marginal effect on the second coordinate β is estimated using the secondand fourth pair of clusters. See Figure 7 for a graphical illustration. In the following lines, we discuss the proposed estimator’s asymptotic properties that allowus to test Equation (13). Before discussing the next theorem, we introduce regularityconditions. Observe first that under Assumption 3, W ( j ) i,t ( β ) , U ( j ) i,t ( β ) in Equation (19) is oforder 1 /η n . In the following assumption, we impose that the within-cluster variance isbounded away from zero after appropriately rescaling.
Assumption 4 (Regularity 2) . Assume that for any exogenous vector β ∈ B , under a See Lemma B.2. = 0(0 . , . , , . , . , , . , . , , . , . , ,
0) (0 . , . , , . , . , , . , . , , . , . , , (cid:98) Z (1) ( · ) (cid:98) Z (2) ( · ) t = 1 Estimator β =2 β =3 β =4 β =Figure 7: Example 2.2 continued. The parameters tested have value (0 . , . , , k ∈ { , · · · , K } , for t = 1,Var (cid:16) n (cid:88) i ∈S k,t ∪S k +1 ,t W ( j ) i,t ( β ) − (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n Y i, (cid:17) = ¯ C k ρ n , where ρ n ≥ nη n , for a constant ¯ C k > / √ n ,after appropriately rescaling by η n . Under standard moment assumptions, observe thatAssumption 4 is satisfied as long asVar (cid:16) n (cid:88) i ∈S k,t ∪S k +1 ,t W ( j ) i,t ( β ) − (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n Y i, (cid:17) ≥ n (cid:88) i ∈S k,t ∪S k +1 ,t Var (cid:16) W ( j ) i,t ( β ) (cid:17) + 1 n (cid:88) i ∈S k, ∪S k +1 , Var (cid:16) v k ( i ) η n Y i, (cid:17) . (cid:101) Z n = (cid:104) (cid:98) Z , (cid:98) Z , · · · , (cid:98) Z K − (cid:105) , the vector of estimators of the marginal effect for each pair of clusters. Theorem 3.2.
Let Assumption 1, 2, 3, 4, hold. Then Σ − / n ( (cid:101) Z n − µ ) + B n → d N (0 , , where B n = O (cid:16) η n × √ n + J n × (cid:112) / ( η n ρ n ) (cid:17) , Σ n = Var( (cid:98) Z )Var( (cid:98) Z ) . . . Var( (cid:98) Z K/ ) (cid:62) I K/ , (20) and for k ∈ K j , µ ( k ) = V ( j ) ( ι ) . Theorem 3.2 showcases that the estimated gradient converges in distribution to a Gaus-sian distribution after appropriately rescaling by its variance. The asymptotic distributionis centered around the true marginal effect and a bias component B n , which captures thediscrepancy between the expectation across different clusters (i.e., clusters being drawnfrom different distributions). The theorem allows for J n = O (1 / √ n ) whenever η n nρ n = o (1),i.e., whenever the variance of the estimator is of any order larger than n after appropriaterescaling by η n . This occurs in the presence of positive dependence, with an average de-gree growing with the sample size. In the presence of independent observations, the biasis vanishing if J n = o (1 / √ n ). Finally, the expression of the bias also shows that η n shouldbe selected such that η n = o ( n − / ).Given Theorem 3.2, we construct a scale invariant test statistics without necessitatingestimation of the (unknown) variance (Ibragimov and M¨uller, 2010). Define P ( j ) n = 1˜ K (cid:88) k ∈K j (cid:98) Z k , the average marginal effect for coordinate j estimated from those clusters used to estimate32he effect of the jth coordinate. We construct Q j,n = (cid:112) ˜ KP ( j ) n (cid:113) ( ˜ K − − (cid:80) k ∈K j ( (cid:98) Z ( j ) k − P ( j ) n ) , T n = max j ∈{ , ··· , ˜ p } | Q j,n | , (21)where T n denotes the test statistics employed to test the null-hypothesis in Equation (13).The choice of the l -infinity norm as above is often employed in statistics for testing globalnull hypotheses (Chernozhukov et al., 2014). In our application it is motivated by itstheoretical properties: the statistics Q j,n follows an unknown distribution as a result ofpossibly heteroskedastic variances of (cid:98) Z k across different clusters. However, the upper-bound on the critical quantiles of the proposed test-statistic for unknown variance attainsa simple expression under the proposed test-statistics. From a conceptual stand-point,the proposed test-statistic is particularly suited when a large deviation occurs over onedimension of the vector. Theorem 3.3 (Nominal coverage) . Let Assumption 1, 2, 3, 4, hold. Let ˜ K ≥ , H be asdefined in Equation (13) , and B n = o (1) . For any α ≤ . , lim n →∞ P (cid:16) T n ≤ q α (cid:12)(cid:12)(cid:12) H (cid:17) ≥ − α, where q α = cv ˜ K − (cid:16) − (1 − α ) / ˜ p (cid:17) , (22) with cv ˜ K − ( h ) denotes the critical value of a t-test with level h with test-statistic having ˜ K − degrees of freedom. Theorem 3.3 allows for inference on marginal effects, and ultimately for testing policyoptimality, using few clusters and two consecutive experimentation periods. The derivationexploits properties of the t-statistics discussed in Ibragimov and M¨uller (2010, 2016) ,combined with Theorem 3.2 and properties of the proposed test statistic T n used to testthe global null hypothesis H . To our knowledge, Theorem 3.3 is the first that allows fortesting for optimality of treatment allocation rules under network interference. In this section, we discuss the experimental design to estimate β ∗ as defined in Equation(8) through sequential randomization. See also Chernozhukov et al. (2018) for a discussion on pivotal inference in the different context ofsynthetic controls. j
21 3 411 1 122 2 2Figure 8: Illustration of the dynamic experiment with p = 2. The experiment has T periodsin total ( t ∈ { , · · · , } ), ˇ T many waves ( w ∈ { , , , } ), p many iterations ( j ∈ { , } ).In total the experiment has at least K ≥
2( ˇ T + 1) many clusters (i.e., K ≥ Preliminaries and time structure
First, researchers pair clusters as discussed in Sec-tion 3. Each pair consists of consecutive clusters { k, k + 1 } with k being odd. Over eachperiod t and cluster k , they draw at random n units from each cluster, whose covariatesand post-treatment outcomes are observed. The indexes of these units are collected in theset S k,t . We consider ˇ T experimentation waves and K ≥
2( ˇ T + 1) clusters paired into K/ w ∈ { , · · · , ˇ T } has j ∈ { , · · · , p } iterations overwhich a gradient descent algorithm is implemented, with in total T = ˇ T × p + 1 periodsof randomization, where at period t = 0 treatments are randomized based on the baselinepolicy ι . Each iteration j is used to estimate a different coordinate of the vector of marginaleffects. In Example 2.1 p = 1, and therefore ˇ T = T + 1, while in Example 2.2 p = 4, andtherefore ˇ T = T / p = 2. Initialization
Each experimentation wave corresponds to a vector ˇ β w ∈ R K , with ˇ β wk corresponding to the target parameter for cluster k . The set of parameters is estimatedover the previous iteration (i.e., parameters are data-dependent), with initializationˇ β = ( ι, · · · , ι ) , with ι chosen exogenously . In the first period t = 0 (before any experimentation wave w isperformed), treatments are randomized independently as D i, | X i = x ∼ Bern( e ( x ; ι )) , ∀ i. Remark 8 (Experiment, conditional on rejection) . Whenever the larger-scale experimentis conducted conditional on the rejection of the null hypothesis in Equation (12), the larger-scale experiment must be performed on a set of clusters different from the ones used totest the above null hypothesis. 34 .1 Description of one experimentation wave
The algorithm consists of ˇ T experimentation waves. We introduce first the procedure witha single experimentation wave w . An experimentation wave consists of p iterations, eachconsisting of two experimentation period, with in total p periods of experimentation. Anexperimentation wave w starts at time t = w × p + 1. Randomization
The randomization procedure follows similarly to Equation (23), iterat-ing over each dimension j ∈ { , · · · , p } of the vector of parameters. Formally, the followingloop is considered. For each j ∈ { , · · · , p } , D i,t | X i = x, ˇ β wk ( i ) ∼ Bern (cid:16) e ( x ; ˇ β wk ( i ) + η n e j ) (cid:17) if k ( i ) is oddBern (cid:16) e ( x ; ˇ β wk ( i ) − η n e j ) (cid:17) if k ( i ) is evenˇ Z ( j ) k,w = n (cid:80) i ∈S k,t ∪S k +1 ,t W ( j ) i,t ( ˇ β wk ) − n (cid:80) i ∈S k, ∪S k +1 , v k ( i )2 η n Y i, if k is odd;ˇ Z ( j ) k − ,w otherwise .t ← t + 1 , (23)where W i,t is defined in Equation (19). The loop works as follows: for each coordinate j werandomize treatments with parameters having a positive (negative) deviation in the first(second) cluster in each pair. We then compute the jth coordinate of the experimentation-wave specific marginal effect ˇ Z ( j ) k,w in cluster k corresponding to the target parameter ˇ β wk .We subtract from the estimator the difference of the cluster-specific fixed effects. Circular cross-fitting
We are left to discuss the choice ˇ β w +1 . To do so, we use a circularcross-fitting procedure which estimates the gradient using the marginal effect obtained inthe subsequent pair: ˇ V ( j ) k,w = ˇ Z ( j ) k +2 ,w if k ≤ K − Z ( j )1 ,w otherwise . if k is odd . V ( j ) k,w = ˇ V ( j ) k − ,w if k is even. We update each policy using a gradient descent as followsˇ β w +1 k = Π B , B − η n (cid:104) ˇ β wk + α k,w ˇ V k,w (cid:105) . Here, Π B , B − η n denotes the projection operator onto the set [ B , B − η n ] p . Intuitively, foreach policy we perform a gradient update, using the gradient estimated on the subsequentpair of policies.
The complete algorithm performs ˇ T experimentation waves as described in the previoussub-section in a sequential manner. The algorithm returns the average coefficients in eachpair ˆ β ∗ = 1 K K (cid:88) k =1 ˇ β ˇ T +1 k . Dependence plays an important role in our setting, where some of all the units ina cluster may participate in the experiment in several periods. We break dependenceusing a novel cross-fitting algorithm, consisting of “circular” updates of the policies usinginformation from subsequent clusters, as shown in Figure 10.We use a local optimization procedure for policy updates, with the gradient beingestimated non-parametrically. We devise an adaptive gradient descent algorithm to trade-off the error of the method and the estimation error of the gradient.
Remark 9 (Learning rate, quasi-concavity and local strong concavity) . We choose a learn-ing rate to accommodate strictly quasi-concave functions, taking α k,w = γ √ w || ˇ V k,w || if || ˇ V k,w || > v √ ˇ T − (cid:15) n , , for a positive (cid:15) n , (cid:15) n →
0, and small constant 1 ≥ v >
0. The reader may refer to Lemma For example, in one dimensional setting, we have Π a,b ( c ) = c , if c ∈ [ a, b ] and Π a,b ( c ) = a if c ≤ a , andΠ a,b ( c ) = b if c ≥ b . The algorithm performs full gradient updates instead of coordinate-wise gradient updates due to thedependence structure, since otherwise, for large p , the circular cross-fitting may not guarantee unconfounde-ness. k odd k even ˇ β = ιw = 1 ˇ β = ι + α , ˇ Z , ˇ β = ι + α , ˇ Z , ˇ β = ι + α , ˇ Z , w = 2 ˇ β = ι ˇ β = ι β k,t = ˇ β + η n β k,t = ˇ β − η n Figure 9: Example of sequential experiment with two waves w ∈ { , } . Each squarecorresponds to a given cluster. Clusters are first paired. Over each wave, a local variationat the cluster level is induced (positive on those clusters with odd k and negative otherwise).The gradient is estimated within each pair. The policy in the next wave w = 2 is updatedbased on the gradient of the next pair.B.8 in the Appendix for further details. The choice of the learning rate allows for strictquasi-concavity through the gradient’s norm rescaling (Hazan et al., 2015), while it controlsthe estimation error after rescaling by 1 / √ w . Example 2.1 Cont’d
In this example j = 1, since only the probability of treatmentis parameter of interest with learning rate inducing a gradient norm rescaling rate α k,t = . || ˇ V k,w ||√ w (see Remark 9). Let K = 6 , ˇ T = 2. Each wave consists of one period.1. Initialization : ˇ β k = (cid:104) , · · · , (cid:105) . Formally, we let (cid:15) n ∝ (cid:113) γ n η n n + J n /η n + η n . First wave w = 1 : Individuals in the first, third, and fifth clusters are assigned totreatments with probability 41% and those in the remaining clusters with probability39%.
Estimates: ( ˇ Z , , ˇ Z , , ˇ Z , ) = (0 . , . , . Update:
Set ˇ β = (cid:104) , · · · , (cid:105) .3. Second wave w = 2: first, third and fifth cluster assign probability (51% , · · · , , · · · , Estimates: ( ˇ Z , , ˇ Z , , ˇ Z , ) = ( − . , . , . Update:
Set ˇ β = (cid:104) . √ , . √ , − . √ , − . √ , . √ , . √ (cid:105) .A graphical illustration is depicted in Figure 9. Next, we discuss the theoretical properties of the algorithm. The following assumption isimposed on the number of clusters.
Assumption 5 (Number of clusters) . Suppose that K ≥
2( ˇ T + 1).Assumption 5 imposes that the number of clusters exceeds the number of periods ofexperimentation. 38 emma 4.1 (Unconfoundeness) . Let Assumption 1, 5 hold. Consider ˇ β wk estimated throughthe circular cross-fitting. Then for any k ∈ { , · · · , K } , t ∈ { , · · · , T } , (cid:16) ˇ β k , · · · , ˇ β ˇ Tk (cid:17) ⊥ (cid:110) Y i,t ( d ) , X i , d ∈ { , } ˇ N (cid:111) i : k ( i ) ∈{ k,k +1 } ,t ≤ T . The proof is contained in the Appendix. The proof is a consequence of the fact thatthe coefficients are estimated using information from all clusters except clusters { k, k + 1 } .Lemma 4.1 guarantees that the experimentation is not confounded due to time dependencebetween unobservables. In the following lines, we motivate the gradient descent methodas a valid optimization procedure also under lack of concavity, only imposing that thefunction is quasi-concave. Assumption 6 (Strict quasi-concavity and local strong concavity) . Assume that the fol-lowing conditions hold.(A) For every β, β (cid:48) ∈ B , such that W ( β (cid:48) ) − W ( β ) ≥ V ( β ) (cid:62) ( β (cid:48) − β ) ≥ . (B) For every β ∈ B , || V ( β ) || ≥ µ || β − β ∗ || , for a positive constant µ > ∂ W ( β ) ∂β (cid:12)(cid:12)(cid:12) β = β ∗ has negative eigenvalues bounded away from zero at β ∗ .Condition (A) imposes a quasi-concavity of the objective function. The condition is equivalent to assuming that any α -sub level set of − W ( β ) is convex, being equivalent tocommon definitions of quasi concavity (Boyd et al., 2004). Condition (B) assumes that thegradient only vanishes at the optimum, allowing for saddle points, but ruling out regionsover which marginal effects remain constant at zero. A simple sufficient condition such that(B) holds is under decreasing marginal effects (see the next example). A similar notion ofstrict quasi-concavity can be found in Hazan et al. (2015). Condition (C) imposes that thefunction has negative definite Hessian at β ∗ only but not necessarily globally. Intuitively(C) imposes strong concavity only at the optimum. Example 2.1 Cont’d
Let Equation (3) hold and suppose that φ >
0, i.e., marginaleffects of treating one additional neighbor are decreasing. Then the function is stronglyconcave in β .The above example discusses the case in the absence of covariates. The reader mayrefer to Equation (4) for an example in the presence of covariates. We can now state thefollowing theorem. 39 heorem 4.2 (Guarantees under quasi-concavity) . Let Assumptions 1, 2, 3, 5, 6 hold.Take a small ξ > , and let n / − ξ ≥ ¯ C (cid:113) log( n ) pγ n T e B √ pT log( KT ) , J n ≤ / √ n , η n =1 /n / ξ , for finite constants ∞ > B, ¯ C > . Let T ≥ ζ , for a finite constant ζ < ∞ .Then with probability at least − /n , || β ∗ − ˆ β ∗ || ≤ p ¯ C ˇ T .
The proof is in the Appendix. Theorem 4.2 provides a small sample upper bound onthe out-of-sample regret of the algorithm. The upper bound only depends on T (and not n ) since n is assumed to be sufficiently larger than T . The following corollary holds. Corollary.
Let the conditions in Theorem 4.2 hold. Then with probability at least − /nτ ( β ∗ ) − τ ( ˆ β ∗ ) ≤ p ¯ C (cid:48) ˇ T for a finite constant ¯ C (cid:48) < ∞ . The above corollary formalizes the “out-of-sample” regret bound scaling linearly withthe number of periods. Theorem 4.2 provides guarantees on the estimated policy andresulting welfare.The above theorem guarantees that the estimated policy, once implemented in futureperiods, leads to the largest welfare up to an error factor scaling linearly with the numberof periods and the dimension of the parameter space. However, researchers may wonderwhether the procedure is “harmless” also on the in-sample units, i.e., whether the procedurehas guarantees on the in-sample regret (Bubeck et al., 2012). We provide guarantees inthe following theorem.
Theorem 4.3 (In-sample regret) . Let the conditions in Theorem 4.2 hold. Then withprobability at least − /n , max k ∈{ , ··· ,K } T ˇ T (cid:88) w =1 (cid:104) τ ( β ∗ ) − τ ( ˇ β wk ) (cid:105) ≤ ¯ C p log( ˇ T )ˇ T for a finite constant ¯ C < ∞ . The proof is contained in the Appendix. Theorem 4.3 guarantees that the cumulativewelfare in each cluster k , incurred by deploying the current policy ˇ β wk at wave w , converges40o the largest achievable welfare at a rate log( T ) /T , also for those units participating in theexperiment. Observe that by a first-order Taylor expansion under Assumption 3, a directconclusion is that the bound also holds for policies ˇ β wk ± η n up to an additional factor whichscales to zero at rate η n (and therefore negligible under the conditions imposed on n ). Thisresult guarantees that the proposed design is not harmful to experimental participants ineach cluster.In the following theorem, we discuss similar guarantees, imposing weaker conditionson the sample size, at the expense of assuming global strong-concavity of the objectivefunction (Boyd et al., 2004). In this case, the learning rate is chosen as α w = γ/w , withoutnecessitating rescaling by the size of the gradient. We formalize our result in the followingtheorem. Theorem 4.4 (Guarantees under strong concavity) . Let Assumptions 1, 2, 3, 5 hold. Let α k,w = γ/w for a small γ > . Take a small ξ > . Let n / − ξ ≥ ¯ C (cid:112) p log( n ) γ n T B log( KT ) , J n ≤ / √ n , η n = 1 /n / ξ , for finite constants B, ¯ C > . Assume that W ( β ) is stronglyconcave in β . Then with probability at least − /n , || β ∗ − ˆ β ∗ || ≤ p ¯ CT for a finite constant ¯ C < ∞ . We now contrast the result with past literature. Regret guarantees are often the ob-ject of interest in analyzing policy assignments (Kitagawa and Tetenov, 2018; Mbakopand Tabord-Meehan, 2018; Athey and Wager, 2020; Kasy and Sautmann, 2019; Bubecket al., 2012; Viviano, 2019). However, the above references either assume a lack of interfer-ence or consider partially observable network structures. In online optimization, the rate1 /T is common for stochastic gradient descent methods under concavity (Bottou et al.,2018). In particular, using a local-optimization method Wager and Xu (2019) derive regretguarantees of the same order in the different setting of market pricing, under mean-fieldasymptotics (i.e., n → ∞ ), with units and samples over each wave being independent.Differently, our results provide small sample guarantees, without imposing independenceor modeling assumptions, other than partial interference. This requires a different prooftechnique. The proof of the theorem (i) uses concentration arguments for locally depen-dent graphs (Janson, 2004), to derive an exponential rate of convergence, adjusted bythe dependence component γ n ; (ii) it uses the within-cluster and between-cluster varia-41ion for consistent estimation of the marginal effect, together with the matching design toguarantee that there is non-vanishing bias when estimating marginal spillover effects; (iii)it exploits in-sample regret bounds for the adaptive gradient descent method with normrescaling; (iv) it uses a recursive argument to bound the cumulative error obtained throughthe estimation and circular cross-fitting, where the cumulative error depends on the samplesize and the number of iterations. Finally, we observe that our results only require localstrong concavity , as opposed to global strong concavity. This result is possible by firstshowing that the estimator lies within a ball close to the optimum as T exceeds a certainfinite threshold which depends on the eigenvalues of the Hessian at β ∗ , and then discussconvergence within a local neighborhood from β ∗ . In this section we discuss three extensions.
The first extension discusses nonseparable time and cluster-specific fixed effects as in Equa-tion (6). For the sake of brevity, we only discuss the estimator of marginal effect and itsconsistency in this section. Asymptotic inference and regret bounds follow similarly as inprevious sections. The key assumption for estimating marginal effects under nonseparablefixed effects is that spillovers only occur on the control units but not on the treated. Weformalize the assumption in the following lines.
Assumption 7 (Fixed effects and constant spillovers on the treated) . For any d ∈ { , } , β ∈B , x ∈ X , any random sample S k,t , of size n from cluster k is such that1 n (cid:88) i ∈S k,t m i,t (cid:16) d, x, β (cid:17) f X i ( x ) = α k,t ( x ) + m ( d, x, β ) f X ( x ) + J n , n (cid:88) i ∈S k,t f X i ( x ) = ˇ f X ( x ) + J n for some possibly unknown functions α k,t ( · ) , m ( · ) , f X ( · ) , ˇ f X ( · ) and J n ∈ [ − b n , b n ], for somepositive b n → n → ∞ . Assume in addition that m (1 , x, β ) is constant in β .Assumption 9 states that time and cluster-specific fixed effects are not separable, andspillovers only occur on control units. Under Assumption 9 local experimentation only42ccurs over a single period (instead of two periods). Namely, for each pair of clusters, overeach period t , we randomize treatments as follows: D i,t | X i = x, β k ( i ) ,t ∼ Bern (cid:16) e ( x ; β k ( i ) ,t + η n e j ) (cid:17) if k ( i ) is oddBern (cid:16) e ( x ; β k ( i ) ,t − η n e j ) (cid:17) if k ( i ) is even , n − / < η n < n − / . (24)The parameter η n captures small deviations from the target parameter. Observe thatdifferently from Equation (23), under Assumption 9 we do not necessitate two consecutiverandomizations for estimation of the marginal effects. We now discuss the estimation ofthe marginal effects. Estimation of direct effects
Similarly to Equation (16) we estimate the direct effectof the treatments taking a weighted difference between the control and treated units of thefollowing form: (cid:98) ∆ ( j ) k,t ( β ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t ∂e ( X i ; β ) ∂β ( j ) (cid:104) Y i,t D i,t e ( X i ; β + v k ( i ) η n e j ) − Y i,s (1 − D i,t )1 − e ( X i ; β + v k ( i ) η n e j ) (cid:105) . (25)Since randomizations are implemented only over a single period, the expression sums effectson the treated and control (reweighted by the assigned probability of exposure) only attime t . Estimation of marginal spillover effects
By assumption the marginal spillover effecton the treated is zer. Therefore, we only need to estimate the marginal spillover effect onthe control. The estimator takes the following form: ˆ S ( j ) k,t (0 , β ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:104) v k ( i ) (1 − e ( X i ; β )) η n × Y i,t (1 − D i,t )1 − e ( X i ; β + v k ( i ) η n e j ) (cid:105) . Similarly, as before, the estimator takes the difference in the outcomes on the controlbetween the two clusters, and it rescales it by the factor η n . Bias estimation
Finally, observe that due to non-separable effects, the estimator of themarginal spillover effect presents a bias, of the form α k,t − α k +1 ,t η n . The bias is estimated bydifferentiating the outcomes on the control units between the two clusters. Namely, the43stimated bias is obtained as follows ˆ B ( j ) k,t ( β ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:104) v k ( i ) (1 − e ( X i ; β )) η n × Y i,t D i,t e ( X i ; β + v k ( i ) η n e j ) (cid:105) . Marginal effect estimator
The final estimator of the marginal effect defined in Equa-tion (9) is the sum of the direct and marginal spillover effect, taking the following form: (cid:98) Z ( j ) k,t ( β ) = ˆ S ( j ) k,t (0 , β ) − ˆ B ( j ) k,t + ˆ∆ ( j ) k,t ( β ) − n (cid:88) i ∈S k,t ∪S k +1 ,t c ( X i ) ∂e ( X i , β ) ∂β ( j ) , (26)where the last component captures the average marginal cost. We now discuss theoreticalproperties of the estimator. A graphical illustration is provided in Figure 6. We can nowstate the following theorem. Theorem 5.1.
Let Assumptions 1, 3, 9 hold, and consider a randomization as in Equation (24) with an exogenous parameter ι . Then (cid:12)(cid:12)(cid:12) E [ (cid:98) Z ( j ) k, ( ι )] − V ( j ) ( ι ) (cid:12)(cid:12)(cid:12) = O (cid:16) J n /η n + η n (cid:17) . Theorem 5.1 guarantees consistency of the estimator for a suitable choice of η n . All theremaining results directly extend also to this setting. In this section we discuss inference on the estimated parameter ˇ β wk ex-post the sequentialrandomization procedure. To conduct inference, we let K, ˇ T be finite while imposing thestronger assumption that K ≥ T , i.e., a twice as large pool of clusters compared toAssumption 5 is available. In this scenario, for each group of pairs { k, k + 1 , k + 2 , k + 3 } ,we run the same algorithm as in Section 4.1, with a small modification: we group clustersinto groups of four clusters, over each wave, we letˇ β k,w = ˇ β wk +1 = ˇ β wk +2 = ˇ β wk +3 (27)The definition of marginal effects ˇ Z k,w remains the same as in Equation (23).Given the policy ˇ β ˇ T +1 k , we test the hypothesis H post,k : V ( ˇ β ˇ T +1 k ) | ˇ β ˇ T +1 k = 0 , k ∈ { , , , · · · } . We can then construct the following test statistic to test H post,k as follows. For k ∈ { , , , · · · } , we define (recall that ˇ Z k,w contains informationfrom cluster k and k + 1), Q postk,j = √
2( ˇ Z ( j ) k, ˇ T + (cid:98) Z ( j ) k +2 , ˇ T ) (cid:113) ( (cid:98) Z ( j ) k, ˇ T − (cid:98) Z ( j ) k +2 ,, ˇ T ) , T post,kn = max j | Q postk,j | , with T post,kn denoting the test statistic for the kth hypothesis. We now introduce thefollowing theorem. Theorem 5.2.
Let Assumption 1, 2, 3, 5 hold, and Assumption 4 hold for t = T . Let K ≥ T , and consider a design mechanism as Section 4.1 with policies as in Equation (27) . Let η n = n − / − ξ , for a small ξ > , and J n = 0 . Let α/p ≤ . . Then lim n →∞ P (cid:16) T post,kn ≤ cv( α/p ) (cid:12)(cid:12)(cid:12) ˇ β k,t , H post,k (cid:17) ≥ − α, where cv( h ) denotes the (1 − h ) − th quantile of a standard Cauchy random variable. The above theorem allows for separate testing. In the presence of multiple testing, sizeadjustments to control the compound error rate should be considered. Observe that wemay also increase the size of groups of clusters (e.g., K ≥
8( ˇ T + 1)) to obtain the sameexpression as above for the test statistics, averaged over more cluster, so increasing power. In this section we discuss extensions of the model to allow for carry-over effects. Forexpositional convenience, we allow carry-over only through two consecutive periods.
We start our discussion by introducing the dynamic model.
Assumption 8 (Dynamic model) . For a conditional Bernoulli allocation with exogenousparameters as in Definition 2.1, let the followig hold Y i,t = m i (cid:16) D i,t , X i , β k ( i ) ,t , β k ( i ) ,t − (cid:17) + ε i,t , E β k ( i ) , t [ ε i,t | D i,t , X i ] = 0 , for some unknown m i ( · ). 45ssumption 8 defines the outcomes as functions of their present treatment assignment,covariates, and the policy-decision β implemented in the current and past period. Thecomponent β k,t − captures carry-over effects that result from neighbors’ treatments in thepast.Similarly to Assumption 2, we assume that clusters are representative of the underlyingpopulation of interest. For simplicity we omit time and cluster specific fixed effects, andwe also assume that covariates are identically distributed. Assumption 9 (Representative clusters) . Let the following hold: for any random sample S k,t from cluster k , with size |S k,t | = n , with1 n (cid:88) i ∈S k,t m i ( d, x, β t , β t − ) = m ( d, x, β t , β t − ) + O ( J n ) , J n → . Assume in addition that X i ∼ F X for all i .Discussion on the above condition can be found in Section 2. Given the above defini-tions, we can introduce the notion of welfare. Definition 5.1 (Instantaneous welfare) . DefineΓ( β, φ ) = (cid:90) (cid:110) e ( x ; β ) (cid:104) m (1 , x, β, φ ) − m (0 , x, β, φ ) (cid:105) + m (0 , x, β, φ ) − c ( x ) e ( x ; β ) (cid:111) dF X ( x )the instantaneous welfare.Definition 5.1 defines welfare as a function of parameters β , the current policy, and pastpolicy φ . It captures the notion of welfare at a given point in time. We now introduce ourestimand of interest. Definition 5.2 (Estimand) . Define the estimand as follows β ∗∗ ∈ arg sup β ∈B Γ( β, β ) . Definition 5.2 defines the estimand of interest, which is defined as the vector of pa-rameters that maximizes welfare, under the constraint that the decision remains invariant The case with covariates not being identically distributed follows similarly to what discussed in thecurrent section.
Carry-over effects introduce challenges for optimization due to dynamics. A simple gra-dient descent may not convergence, since every next iteration, the function Γ( β t , β t − )also depend on past decisions. Motivated by this observation, we propose patient gradientdescent updates. “Patient” gradient descent First, we introduce the optimization algorithm in fullgenerality. We begin our iteration from the starting value ι , we evaluate Γ( ι, ι ), andcompute its total derivative ∇ ( ι ). We then update the current policy choice in the directionof the total derivative and wait for one more iteration before making the next update.Formally, the first three iteration consists of the following updates:Γ( ι, ι ) ⇒ Γ( ι + ∇ ( ι ) , ι ) ⇒ Γ( ι + ∇ ( ι ) , ι + ∇ ( ι )) . We name the iterations “patient” since, in the third step, the algorithm makes a policychoice ι + ∇ ( ι ), even if this choice may decrease utility in the third iteration, comparedto the utility in the previous step. However, the overall utility from the first to the thirditeration is increasing. Estimation and updates
The estimation procedure follows similarly to Section 4.1,with a small modifications: for every period t the policy stays enforced for one more period t + 1, without necessitating that data are collected over the period t + 1. This modificationis a direct extension of the gradient descent that allows for dynamics, at the expense of re-quiring a longer overall experimentation period. That is, for example in Equation (23) D i,t is randomized as a bernoulli with parameter ˇ β wk ( i ) over two consecutive periods and overthe following two periods it is randomized with a local deviation η n . Let the estimated Observe that although Assumption 9 assumes that there are no cluster and time specific fixed effect,the randomization and estimation procedure directly allows for those similarly to what discussed in Section6. β ∗∗ .Next, we discuss the theoretical guarantees of the proposed algorithm. The proof is in-cluded in the Appendix. Theorem 5.3.
Let Assumptions 3, 5, 8, 9 hold. Take a small ξ > . Let n / − ξ ≥ ¯ C (cid:113) log( n ) γ n T e B √ pT log( KT ) , J n ≤ / √ n , η n = 1 /n / ξ , for finite constants ∞ >B, ¯ C > . Let T ≥ ζ , for a finite constant ζ < ∞ . Let β (cid:55)→ Γ( β, β ) satisfying strictquasi-concavity and local strong concavity in Assumption 6. Then with probability at least − /n , || β ∗∗ − ˆ β ∗∗ T || ≤ p ¯ CT for a finite constant ¯ C < ∞ . In this section, we discuss the case where the policy can be updated over each iteration.The objective of the policymaker is to estimate the optimal path of policies. An essen-tial condition in this section is that there are no cluster-specific fixed effects discussed inAssumption 9.
First order conditions
A natural question is whether β ∗∗ maximizes the long-run wel-fare defined as follows T ∗ (cid:88) t =1 q t Γ( β t , β t − )where q ∈ (0 ,
1) denotes a discounting factors. In the presence of concave Γ( · ), linear-ity in carry-over effects, and lack of interactions of carry-overs with present assignments,the welfare-maximizing policy is stationary. To observe why, observe that the first orderconditions read as follows: ∂ Γ( β t , β t − ) ∂β t (cid:124) (cid:123)(cid:122) (cid:125) ( A ) + q ∂ Γ( β t +1 , β t ) ∂β t (cid:124) (cid:123)(cid:122) (cid:125) ( B ) = 0 , ∀ t. (28)Assuming that ( B ) is a constant and ( A ) does not depend on β t − , the solution to all theabove equation is the same β t in each equation. Whenever these conditions are not met,48 ∗∗ finds a practical motivation instead: once the study is concluded, the policymaker mayprefer to adopt a single policy decision instead of a sequence of non-stationary decisions.However, in the following lines, we also discuss non-stationary decisions, whenever thoseare of interest to the policymaker. Policy parametrization
The design of non-stationary decisions requires instead a moredata-intense scenario. We sketch the main ideas in the following lines. From Equation(28), we observe that the welfare-maximizing β t +1 only depends on ( β t , β t − ). Using ideasfrom reinforcement learning and welfare-maximization (Sutton and Barto, 2018; Adusumilliet al., 2019) we parametrize the policy function, by parameters θ ∈ Θ, with π θ : B × B (cid:55)→ B . For any two past decisions, π θ ( β t , β t − ) prescribes the welfare maximizing policy β t +1 inthe subsequent iteration. The objective function takes the following form (cid:102) W ( θ ) = T ∗ (cid:88) t =1 q t Γ (cid:16) π θ ( β t − , β t − ) , π θ ( β t − , β t − ) (cid:17) , such that β t = π θ ( β t − , β t − ) ∀ t ≥ , β = β − = ι. (29)Here (cid:102) W ( θ ) denotes the long-run welfare indexed by a given policy’s parameter θ . By takingfirst-order conditions, we have ∂ (cid:102) W ( θ ) ∂θ = T ∗ (cid:88) t =1 q t (cid:104) ∂ Γ (cid:16) π θ ( β t − , β t − ) , π θ ( β t − , β t − ) (cid:17) ∂π θ ( β t − , β t − ) (cid:124) (cid:123)(cid:122) (cid:125) ( i ) × f θ,t ( ι )+ ∂ Γ (cid:16) π θ ( β t − , β t − ) , π θ ( β t − , β t − ) (cid:17) ∂π θ ( β t − , β t − ) (cid:124) (cid:123)(cid:122) (cid:125) ( ii ) × f θ,t − ( ι ) (cid:105) , (30)where f θ,t ( ι ) = ∂π θ ( β t , β t − ) ∂θ , such that the constraint in Eq. (29) holds.Observe that the function f θ,t ( ι ) is known to the experimenter that can be obtained throughthe chain rule. However, ( i ) and ( ii ) are unknown and must be estimated. The key ideaconsists of constructing triads of clusters and alternating perturbation over sub-sequent Observe that differently from Adusumilli et al. (2019) here we allow also the outcome to dependdynamically on treatment assignments.
Grouping clusters
Create groups of three clusters { k, k + 1 , k + 2 } ; Iterations
The experiment consists of ˇ T waves. Differently from Section 4.1, each waveconsists of j ∈ { , · · · , dim( θ ) } iterations and s ∈ { , · · · , T ∗ } sub-iterations. Over eachwave w a policy’s parameter ˇ θ wk is chosen for each triad of clusters { k, k + 1 , k + 2 } . Eachwave corresponds to a path of policies. That is, for wave w , cluster k , a starting value ι ,the path of policies is (cid:104) π ˇ θ wk ( ι, ι ) , π ˇ θ wk (cid:16) π ˇ θ wk ( ι, ι ) , ι (cid:17) , π ˇ θ wk (cid:16) π ˇ θ wk ( π ˇ θ wk ( ι, ι ) , ι ) , π ˇ θ wk ( ι, ι ) (cid:17) , · · · (cid:105) . That is, given the parameter value, each policy is chosen based on the policy-choice in theprevious periods, with in total T ∗ many periods. We denote ˇ θ wk ( s ) the policy recommen-dation on the path under parameter ˇ θ wk after s sub-iterations. Policy randomization
Over each wave w , iteration j and sub-iteration s , ( w, j, s ), andgroup of clusters { k, k + 1 , k + 2 } , we randomize treatments as follows: D i,wjs | X i , ˇ θ wk ∼ Bern (cid:16) e ( X i ; ˇ θ wk ( s )) (cid:17) , if k ( i ) = k Bern (cid:16) e ( X i ; ˇ θ wk ( s ) + η n e j ) (cid:17) , if k ( i ) = k + 1 and s is odd;Bern (cid:16) e ( X i ; ˇ θ wk ( s )) (cid:17) , if k ( i ) = k + 1 and s is evenBern (cid:16) e ( X i ; ˇ θ wk ( s ) + η n e j ) (cid:17) , if k ( i ) = k + 2 and s is even;Bern (cid:16) e ( X i ; ˇ θ wk ( s )) (cid:17) , if k ( i ) = k + 2 and s is odd . (31)Intuitively, one of the three clusters is assigned the same policy ˇ θ wk . The remaining twoclusters alternate over each sub-iteration s ∈ { , · · · , T ∗ } on whether a small deviation isapplied or not to the policy. Marginal effect estimator
The estimator consists in taking the difference of the weightedoutcomes between cluster k and cluster k + 1 over odd iterations for estimating ( i ) and be-tween k and k + 2 over odd iterations for estimating ( ii ) and viceversa over even iterations. Formally, ˇ θ wk ( s ) = π ˇ θ wk ( β s , β s − ), β s = π ˇ θ wk ( β s − , β s − ) if s ≥ ι otherwise. jth entry of the gradient is computed at the end of T ∗ iteration defined as ˇ F ( j ) k,w . Aformal discussion is included in Appendix E. Gradient update
Similarly to Section 4.1, over each wave w we perform gradient up-dates where the policy for the triad { k, k + 1 , k + 2 } is updated using the gradient ˇ F k +3 ,w is the subsequent triad.The above procedure estimates the policy π θ for out-of-sample implementation viagradient descent method, requiring, however, a large number of iterations on the in-sampleunits. The estimated policy is then deployed on the target population, having a muchlarger size than the in-sample population. We defer to Appendix E a formal discussion onthe method.We conclude this section with an example.Table 1: One wave w with three clusters and T ∗ = 4. Over each period ˇ θ ( s ) denotes thepolicy assignment along the path corresponding to policy π ˇ θ at time s . Here, Γ denotesthe policy’s instantaneous effect, as a function of the present and past assignment rule.By differentiating the effect between the first cluster and the second cluster at time t = 2,we estimate the partial derivative of the effect of the policy assigned at time t = 2 onthe welfare at time t = 2. By comparing the instantaneous welfare on the first and thirdclusters, we estimate the policy’s partial effect at time t = 2 in the current period. Thereverse reasoning applies at time t = 3. k = 1 k = 2 k = 3ˇ θ Γ ˇ θ Γ ˇ θ Γ t = 1 ˇ θ (1) Γ(ˇ θ (1) , ˇ θ (1)) ˇ θ (1) + η n Γ(ˇ θ (1) + η n , ˇ θ (1)) ˇ θ (1) Γ(ˇ θ (1) , ˇ θ (1)) t = 2 ˇ θ (2) Γ(ˇ θ (2) , ˇ θ (1)) ˇ θ (2) Γ(ˇ θ (2) , ˇ θ (1) + η n ) ˇ θ (2) + η n Γ(ˇ θ (2) + η n , ˇ θ (1)) t = 3 ˇ θ (3) Γ(ˇ θ (3) , ˇ θ (2)) ˇ θ (3) + η n Γ(ˇ θ (3) + η n , ˇ θ (2)) ˇ θ (3) Γ(ˇ θ (3) , ˇ θ (2) + η n ) t = 4 ˇ θ (4) Γ(ˇ θ (4) , ˇ θ (3)) ˇ θ (4) Γ(ˇ θ (4) , ˇ θ (3) + η n ) ˇ θ (4) + η n Γ(ˇ θ (4) + η n , ˇ θ (3)) Example 2.1 Cont’d
Let ˇ T = 10 and T ∗ = 4. Then experimentation is conducted over40 iterations. Clusters are first grouped into triads. Consider the triad { k, k + 1 , k + 2 } .Consider a policy π θ ( β t − , β t − ) = θ + β t − θ + β t − θ . t depends linearly on the proba-bility of treatment in the previous two periods. The policymaker’s objective is to find theoptimal path of probabilities, which corresponds to estimate the parameters ( θ , θ , θ ) thatmaximizes the long-run welfare. The local optimization procedure starts from a startingvalue ι = 40%, and an initialization value for the parametersˇ θ = (0 , , , i.e., the probability of treatment today equals the one from yesterday. Then the first waveof experimentation w = 1 aims to study the long run marginal effect at (0 , , sequence of policies over each s ∈ { , · · · , T ∗ } is (cid:16) ˇ θ (1) , · · · , ˇ θ (4) (cid:17) = (cid:104) ˇ θ + ι ˇ θ + ι ˇ θ (cid:124) (cid:123)(cid:122) (cid:125) ( A ) , ˇ θ + ˇ θ × ( A ) + ˇ θ ι (cid:124) (cid:123)(cid:122) (cid:125) ( B ) , ˇ θ + ˇ θ × ( B ) + ˇ θ ( A ) (cid:124) (cid:123)(cid:122) (cid:125) ( C ) , ˇ θ + ˇ θ × ( C ) + ˇ θ × ( B ) (cid:105) = (40% , , , where the last equality follows from the choice of our starting point ((0 , , w = 1). In the first period t = 1, treatments are randomizedas follows: individuals are treated in cluster k , and k + 2 with probability 40%, whiletreatments in the second cluster are assigned with probability 41%. In the second period t = 2, treatments are assigned with probability 40% in the first and second cluster andwith probability 41% in the third cluster. The sequence repeats once more before the firstwave ends. Table 1 reports the instantaneous welfare over each period t over a sequencewithin a wave w . Observe that by alternative the perturbation over the second and thefirst cluster, over each period, we can identify and estimate the marginal effect of the policyin the current period and the marginal effect of the policy in the previous period. Oncethe first iteration is concluded, we estimate the gradient using Equation (30), we choosethe new value of ˇ θ based on the gradient update, and then we keep iterating. In this section, we study the numerical properties of the proposed estimator. We calibrateour experiments to data from Cai et al. (2015), and we consider as target estimand thepercentage of individuals to be treated within each cluster.52 .1 Set up
The data contains network information of each individual over 47 villages in China andadditional individual-specific characteristics. The outcome of interest is binary, and itconsists of insurance adoption. Let A k denote the adjacency matrix in cluster k observedfrom sampled data. We calibrate our simulations to the estimated linear-probability model Y i,t = φ + φ X i + φ D i,t + φ X i × D i,t + S i φ + S i × D i φ + S i φ + η i,t , where S i = (cid:80) j (cid:54) = i A k ( i ) i,j D i,t (cid:80) j (cid:54) = i A k ( i ) i,j denotes the percentage of treated friends. The above equation captures direct effectsthrough the coefficient φ and φ , where the latter also captures heterogeneity in effects; itcaptures spillover effects through the coefficient φ and φ , as well as interactions betweenspillover and direct effects through the coefficient φ . We estimate those coefficients us-ing a linear regressor with a small penalization ( e − ) to improve stability. The covariatematrix contains available individuals’ information such as gender, age, rice-area, literacy,risk-aversion, the probability of disaster in a given region, and the number of friends. Wesimulate η i,t | η i,t − ∼ N ( ρη i,t − , σ ) , X i ∼ i.i.d. F X,n with ρ = 0 .
8, and F Xn denoting the empirical distribution of observations’ covariatesobserved in the data. We calibrate the variance to be the estimated residuals variance,approximately equal to σ = 0 . To study the performance under different level of den-sity of the network, we consider two alternative graphs: (i) two individuals are connected ifthey had reciprocally indicated the other as a connection (sparser network); (ii) two indi-viduals are connected if either had indicated the other as a connection (denser network). In Data is accessible at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CXDJM0&widget=dataverse@harvard. The adjacency matrix is constructed by considering each starting node of the edge list as a separateobservation, oversampling individuals with a larger degree and under-sample individuals with a smallerdegree.
We consider the problem of maximizing the probability of treatment assignments, with E = (0 . , . T ∈ { , } iterations, sampling from the first K = T clusters. We omit time-specific fixed effects for simplicity, and, following Remark7 over each iteration, we randomize treatments twice in the same cluster, with the secondrandomization inducing a small perturbation. We consider two scenarios corresponding totwo different within-cluster sample sizes:(A) Researchers sample once over each experimental wave from each cluster (i.e., ¯ n ≈ n denotes the median sample size);(B) Researchers sample five times the same participants from each cluster over eachexperimental wave (i.e., ¯ n ≈ strong dependence due to the persistency of the idiosyncratic errors andthe fact that individuals observed over multiple periods have the same covariates. Asa result, (B) reduces the variability occurring from the treatment assignments, but notfrom covariates. We choose η n = ¯ n − / , with η n = 0 .
05 for Scenario (A) and 0 .
022 forScenario (B). Given the heterogeneity in the sample size ¯ n − / does not affect consistencyfor the larger clusters, while controlling the bias across all clusters. We consider theadaptive learning rate of the gradient descent with γ = 0 .
1, and random initializationsdrawn uniformly between (0 . , . . , .
9) and it assigns treatment satu-rations to clusters deterministically ; (ii) the second randomizes probabilities of treatmentsacross clusters uniformly between (0 . , . × T consecutive periods in Scenario (A) and over 10 × T consecu-tive periods in Scenario (B). The competitors estimate the welfare-maximizing probabilityusing a correctly specified quadratic regression of the average outcomes onto the saturationprobabilities, which is expected to have a small out-of-sample regret.A comprehensive set of results is in Figure 11, where each column in the panel reportsthe average in-sample regret, the out-of-sample regret, and the worst-case regret across allclusters. The top panels show that under a denser network structure, the proposed design’sin-sample regret is significantly smaller across all T under consideration. The out-of-sample regret of the proposed estimator is comparable to the regret of the saturation experimentfor ¯ n = 1200 and slightly larger for ¯ n = 400. We observe similar behavior in the bottompanel, where the proposed method achieves a significantly smaller in-sample regret. As ¯ n increases, the in and out-of-sample regret of the algorithm decreases. The out-of-sampleregret of the competitors also decreases, while the in-sample regret increases by design.We also observe that as T increases, the error of the sequential experiment may eitherincrease or decrease. These mixed results document the trade-off between the numberof iterations and the small sample size. Whenever the estimation error dominates thegradient descent’s optimization error, the number of waves increases the estimation errorfaster than the linear rate. As a result, longer experiments requires much larger samples forbetter accuracy. In practice, we recommend that practitioners carefully select the numberof iterations by considering the overall sample size. Our results show that a small numberof waves suffices to achieve the global optimum while controlling the in-sample regret. In this section, we discuss hypothesis testing. Similarly as before, we let η n = ¯ n − / and we consider scenarios with varying ¯ n and number of clusters. Namely, we consider“iteration = 1” (¯ n = 400), “iteration = 3” (¯ n = 1200) , “iteration = 5” (¯ n = 2000),respectively corresponding to inference after one, three and five consecutive samplings55igure 11: Results from adaptive experiment for T ∈ { , } with 200 replications. Thetop (weak ties) panels correspond to the denser network and at the bottom to the sparsernetwork (strong ties). Saturation 1 corresponds to a saturation experiment with equallyspaced saturation probabilities, Saturation 2, a design with saturation probabilities drawnrandomly from a uniform distribution, and Saturation 3 as Saturation 2, but with halfof the clusters, excluding those with less than four hundred observations. Matching isperformed with the same cluster over two consecutive periods.56able 3: Coverage Probability of testing the null hypothesis of optimality over 500 repli-cations with test with size 5%. Here K denotes the number of clusters, with the first two,four, etc., clusters being considered. Median cluster’s size across all clusters is ¯ n ≈ K = 2 4 6 10 20 40 2 4 6 10 20 40iter = 1 0.95 0.95 0.96 0.96 0.96 0.94 0.95 0.96 0.95 0.95 0.94 0.91iter = 3 0.96 0.95 0.95 0.93 0.96 0.96 0.96 0.95 0.94 0.93 0.96 0.95iter = 5 0.95 0.94 0.95 0.95 0.94 0.96 0.94 0.94 0.94 0.94 0.91 0.95from the participants in the cluster. We consider K ∈ { , , , , , } clusters. Wematch clusters with themselves over two consecutive iterations (see Remark 7). In Table 3,we report the coverage probability under the null hypothesis of welfare-optimality for a testwith size 5%. The result shows that the coverage probability is approximately 95% acrossall designs. In Figure 12, we plot the power, i.e., the probability of rejection, whenever thecoefficient moves away from the welfare-maximizing policy within a range from zero to 0 . Conclusion
This paper has introduced a novel method for experimental design under unobserved inter-ference to test and estimate welfare-maximizing policies. The proposed methodology ex-ploits between and within-cluster local variation to estimate non-parametrically marginalspillover and direct effects. It uses the marginal effects of the treatment for hypothesistesting and policy-design. We discuss the method’s theoretical properties, showcase validcoverage in the presence of finitely many clusters for the hypothesis testing procedure, andguarantees on the in and out-of-sample regret of the design.We outlined the importance of allowing for general unknown interactions without im-posing a particular exposure-mapping. We make two assumptions: within-clusters interac-tions are local, and clusters are representative of the underlying population. We leave forfuture research addressing experimental design in the presence of heterogeneous clustersand global interaction mechanisms.The hypothesis testing mechanism allows us to test for policy-optimality. Future ex-tensions may be considered: (i) low-cost experimentation may prefer null hypotheses of no-policy optimality; (ii) the testing may be used for continuous treatments or observationalstudies. Finally, we introduced experimental designs for non-stationary policy-decision,discussing marginal effects under limited carry-overs. The design under infinitely longcarry-over effects remains an open research question.
References
Adusumilli, K., F. Geiecke, and C. Schilter (2019). Dynamically optimal treatment alloca-tion using reinforcement learning. arXiv preprint arXiv:1904.01047 .Alatas, V., A. Banerjee, A. G. Chandrasekhar, R. Hanna, and B. A. Olken (2016). Networkstructure and the aggregation of information: Theory and evidence from indonesia.
American Economic Review 106 (7), 1663–1704.Alatas, V., A. Banerjee, R. Hanna, B. A. Olken, and J. Tobias (2012). Targeting the poor:evidence from a field experiment in indonesia.
American Economic Review 102 (4),1206–40.Andrews, I., T. Kitagawa, and A. McCloskey (2019). Inference on winners. Technicalreport, National Bureau of Economic Research.Armstrong, T. and S. Shen (2015). Inference on optimal treatment assignments.59ronow, P. M., C. Samii, et al. (2017). Estimating average causal effects under generalinterference, with application to a social network experiment.
The Annals of AppliedStatistics 11 (4), 1912–1947.Athey, S., D. Eckles, and G. W. Imbens (2018). Exact p-values for network interference.
Journal of the American Statistical Association 113 (521), 230–240.Athey, S. and G. W. Imbens (2018). Design-based analysis in difference-in-differences set-tings with staggered adoption. Technical report, National Bureau of Economic Research.Athey, S. and S. Wager (2020). Policy learning with observational data.
Econometrica,Forthcoming .Bai, Y. (2019). Optimality of matched-pair designs in randomized controlled trials.
Avail-able at SSRN 3483834 .Baird, S., J. A. Bohren, C. McIntosh, and B. ¨Ozler (2018). Optimal design of experimentsin the presence of interference.
Review of Economics and Statistics 100 (5), 844–860.Bakirov, N. K. and G. J. Szekely (2006). Student’s t-test for gaussian scale mixtures.
Journal of Mathematical Sciences 139 (3), 6497–6505.Banerjee, A., A. G. Chandrasekhar, E. Duflo, and M. O. Jackson (2013). The diffusion ofmicrofinance.
Science 341 (6144), 1236498.Basse, G. and A. Feller (2018). Analyzing two-stage experiments in the presence of inter-ference.
Journal of the American Statistical Association 113 (521), 41–55.Basse, G. W. and E. M. Airoldi (2018a). Limitations of design-based causal inference anda/b testing under arbitrary and network interference.
Sociological Methodology 48 (1),136–151.Basse, G. W. and E. M. Airoldi (2018b). Model-assisted design of experiments in thepresence of network-correlated outcomes.
Biometrika 105 (4), 849–858.Bhattacharya, D. (2009). Inferring optimal peer assignment from experimental data.
Jour-nal of the American Statistical Association 104 (486), 486–500.Bhattacharya, D. and P. Dupas (2012). Inferring welfare maximizing treatment assignmentunder budget constraints.
Journal of Econometrics 167 (1), 168–196.Bhattacharya, D., P. Dupas, and S. Kanaya (2013). Estimating the impact of means-tested subsidies under treatment externalities with application to anti-malarial bednets.Technical report, National Bureau of Economic Research.Bond, R. M., C. J. Fariss, J. J. Jones, A. D. Kramer, C. Marlow, J. E. Settle, and J. H.Fowler (2012). A 61-million-person experiment in social influence and political mobiliza-tion.
Nature 489 (7415), 295. 60ottou, L., F. E. Curtis, and J. Nocedal (2018). Optimization methods for large-scalemachine learning.
Siam Review 60 (2), 223–311.Boyd, S., S. P. Boyd, and L. Vandenberghe (2004).
Convex optimization . Cambridgeuniversity press.Breza, E., A. G. Chandrasekhar, T. H. McCormick, and M. Pan (2020). Using aggregatedrelational data to feasibly identify network structure without network data.
AmericanEconomic Review 110 (8), 2454–84.Brooks, R. L. (1941). On colouring the nodes of a network. In
Mathematical Proceedingsof the Cambridge Philosophical Society , Volume 37, pp. 194–197. Cambridge UniversityPress.Bubeck, S., N. Cesa-Bianchi, et al. (2012). Regret analysis of stochastic and nonstochasticmulti-armed bandit problems.
Foundations and Trends ® in Machine Learning 5 (1),1–122.Bubeck, S., Y. T. Lee, and R. Eldan (2017). Kernel-based methods for bandit convexoptimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theoryof Computing , pp. 72–85.Cai, J., A. De Janvry, and E. Sadoulet (2015). Social networks and the decision to insure.
American Economic Journal: Applied Economics 7 (2), 81–108.Cesa-Bianchi, N. and G. Lugosi (2006).
Prediction, learning, and games . Cambridgeuniversity press.Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, andJ. Robins (2018). Double/debiased machine learning for treatment and structural pa-rameters.Chernozhukov, V., D. Chetverikov, K. Kato, et al. (2014). Gaussian approximation ofsuprema of empirical processes.
The Annals of Statistics 42 (4), 1564–1597.Chernozhukov, V., K. Wuthrich, and Y. Zhu (2018). Practical and robust t -test basedinference for synthetic control and related methods. arXiv preprint arXiv:1812.10820 .Chin, A., D. Eckles, and J. Ugander (2018). Evaluating stochastic seeding strategies innetworks. arXiv preprint arXiv:1809.09561 .Choi, D. (2017). Estimation of monotone treatment effects in network experiments. Journalof the American Statistical Association 112 (519), 1147–1155.Dehejia, R. H. (2005). Program evaluation as a decision problem.
Journal of Economet-rics 125 (1-2), 141–173. 61upas, P. (2014). Short-run subsidies and long-run adoption of new health products:Evidence from a field experiment.
Econometrica 82 (1), 197–228.Eckles, D., B. Karrer, and J. Ugander (2017). Design and analysis of experiments innetworks: Reducing bias from interference.
Journal of Causal Inference 5 (1).Egger, D., J. Haushofer, E. Miguel, P. Niehaus, and M. W. Walker (2019). General equi-librium effects of cash transfers: experimental evidence from kenya. Technical report,National Bureau of Economic Research.Elliott, G. and R. P. Lieli (2013). Predicting binary outcomes.
Journal of Economet-rics 174 (1), 15–26.Flaxman, A. D., A. T. Kalai, and H. B. McMahan (2004). Online convex optimization inthe bandit setting: gradient descent without a gradient. arXiv preprint cs/0408007 .Forastiere, L., E. M. Airoldi, and F. Mealli (2020). Identification and estimation of treat-ment and interference effects in observational studies on networks.
Journal of the Amer-ican Statistical Association , 1–18.Garber, D. (2019). Logarithmic regret for online gradient descent beyond strong convexity.In
The 22nd International Conference on Artificial Intelligence and Statistics , pp. 295–303. PMLR.Goldsmith-Pinkham, P. and G. W. Imbens (2013). Social networks and the identificationof peer effects.
Journal of Business & Economic Statistics 31 (3), 253–264.Graham, B. S., G. W. Imbens, and G. Ridder (2010). Measuring the effects of segregation inthe presence of social spillovers: A nonparametric approach. Technical report, NationalBureau of Economic Research.Guo, R., J. Li, and H. Liu (2020). Counterfactual evaluation of treatment assignment func-tions with networked observational data. In
Proceedings of the 2020 SIAM InternationalConference on Data Mining , pp. 271–279. SIAM.Hadad, V., D. A. Hirshberg, R. Zhan, S. Wager, and S. Athey (2019). Confidence intervalsfor policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768 .Hazan, E., K. Levy, and S. Shalev-Shwartz (2015). Beyond convexity: Stochastic quasi-convex optimization. In
Advances in Neural Information Processing Systems , pp. 1594–1602.Hirano, K. and J. R. Porter (2020). Asymptotic analysis of statistical decision rules ineconometrics.
Handbook of Econometrics .62orvitz, D. G. and D. J. Thompson (1952). A generalization of sampling without replace-ment from a finite universe.
Journal of the American statistical Association 47 (260),663–685.Hudgens, M. G. and M. E. Halloran (2008). Toward causal inference with interference.
Journal of the American Statistical Association 103 (482), 832–842.Ibragimov, R. and U. K. M¨uller (2010). t-statistic based correlation and heterogeneityrobust inference.
Journal of Business & Economic Statistics 28 (4), 453–468.Ibragimov, R. and U. K. M¨uller (2016). Inference with few heterogeneous clusters.
Reviewof Economics and Statistics 98 (1), 83–96.Imai, K. and M. L. Li (2019). Experimental evaluation of individualized treatment rules. arXiv preprint arXiv:1905.05389 .Imbens, G. W. and D. B. Rubin (2015).
Causal inference in statistics, social, and biomedicalsciences . Cambridge University Press.Jagadeesan, R., N. S. Pillai, A. Volfovsky, et al. (2020). Designs for estimating the treat-ment effect in networks with interference.
Annals of Statistics 48 (2), 679–712.Janson, S. (2004). Large deviations for sums of partly dependent random variables.
RandomStructures & Algorithms 24 (3), 234–248.Jones, J. J., R. M. Bond, E. Bakshy, D. Eckles, and J. H. Fowler (2017). Social influenceand political mobilization: Further evidence from a randomized experiment in the 2012us presidential election.
PloS one 12 (4), e0173851.Kallus, N. (2017). Recursive partitioning for personalization using observational data. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pp.1789–1798. JMLR. org.Kang, H. and G. Imbens (2016). Peer encouragement designs in causal inference withpartial interference and identification of local average network effects. arXiv preprintarXiv:1609.04464 .Kasy, M. (2016). Partial identification, distributional preferences, and the welfare rankingof policies.
Review of Economics and Statistics 98 (1), 111–131.Kasy, M. (2017). Who wins, who loses? identification of the welfare impact of changingwages.Kasy, M. (2018). Optimal taxation and insurance using machine learning—sufficient statis-tics and beyond.
Journal of Public Economics 167 , 205–219.Kasy, M. and A. Sautmann (2019). Adaptive treatment assignment in experiments forpolicy choice.
Econometrica . 63ato, M. and Y. Kaneko (2020). Off-policy evaluation of bandit algorithm from dependentsamples under batch update policy. arXiv preprint arXiv:2010.13554 .Kim, D. A., A. R. Hwong, D. Stafford, D. A. Hughes, A. J. O’Malley, J. H. Fowler, andN. A. Christakis (2015). Social network targeting to maximise population behaviourchange: a cluster randomised controlled trial.
The Lancet 386 (9989), 145–153.Kitagawa, T. and A. Tetenov (2018). Who should be treated? Empirical welfare maxi-mization methods for treatment choice.
Econometrica 86 (2), 591–616.Kitagawa, T. and A. Tetenov (2019). Equality-minded treatment choice.
Journal of Busi-ness & Economic Statistics , 1–14.Kitagawa, T. and G. Wang (2020). Who should get vaccinated? individualized allocationof vaccines over sir network. arXiv preprint arXiv:2012.04055 .Kleinberg, R. D. (2005). Nearly tight bounds for the continuum-armed bandit problem. In
Advances in Neural Information Processing Systems , pp. 697–704.Leung, M. P. (2020). Treatment and spillover effects under network interference.
Reviewof Economics and Statistics 102 (2), 368–380.Li, S. and S. Wager (2020). Random graph asymptotics for treatment effect estimationunder network interference. arXiv preprint arXiv:2007.13302 .Li, X., P. Ding, Q. Lin, D. Yang, and J. S. Liu (2019). Randomization inference for peereffects.
Journal of the American Statistical Association , 1–31.Lu, C., B. Sch¨olkopf, and J. M. Hern´andez-Lobato (2018). Deconfounding reinforcementlearning in observational settings. arXiv preprint arXiv:1812.10576 .Luedtke, A. R. and M. J. Van Der Laan (2016). Statistical inference for the mean outcomeunder a possibly non-unique optimal treatment strategy.
Annals of statistics 44 (2), 713.Manski (2004). Statistical treatment rules for heterogeneous populations.
Economet-rica 72 (4), 1221–1246.Manski, C. F. (2013). Identification of treatment response with social interactions.
TheEconometrics Journal 16 (1), S1–S23.Mbakop, E. and M. Tabord-Meehan (2016). Model selection for treatment choice: Penalizedwelfare maximization. arXiv preprint arXiv:1609.03167 .Mbakop, E. and M. Tabord-Meehan (2018). Model selection for treatment choice: Penal-ized welfare maximization. arXiv.org 1609.03167.Munro, E. (2020). Learning to personalize treatments when agents are strategic. arXivpreprint arXiv:2011.06528 . 64uralidharan, K. and P. Niehaus (2017). Experimentation at scale.
Journal of EconomicPerspectives 31 (4), 103–24.Muralidharan, K., P. Niehaus, and S. Sukhtankar (2017). General equilibrium effects of(improving) public employment programs: Experimental evidence from india. Technicalreport, National Bureau of Economic Research.Murphy, S. A. (2003). Optimal dynamic treatment regimes.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) 65 (2), 331–355.Nie, X., E. Brunskill, and S. Wager (2020). Learning when-to-treat policies.
Journal of theAmerican Statistical Association (just-accepted), 1–58.Ogburn, E. L., O. Sofrygin, I. Diaz, and M. J. van der Laan (2017). Causal inference forsocial network data. arXiv preprint arXiv:1705.08527 .Opper, I. M. (2016). Does helping john help sue? evidence of spillovers in education.
American Economic Review 109 (3), 1080–1115.Park, C. and H. Kang (2020). Efficient semiparametric estimation of network treatmenteffects under partial interference. arXiv preprint arXiv:2004.08950 .Pouget-Abadie, J. (2018).
Dealing with Interference on Experimentation Platforms . Ph.D. thesis.Rai, Y. (2018). Statistical inference for treatment assignment policies.
UnpublishedManuscript .Ross, N. et al. (2011). Fundamentals of stein’s method.
Probability Surveys 8 , 210–293.Rubin, D. B. (1990). Formal mode of statistical inference for causal effects.
Journal ofstatistical planning and inference 25 (3), 279–292.Russo, D., B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen (2017). A tutorial onthompson sampling. arXiv preprint arXiv:1707.02038 .Saez, E. (2001). Using elasticities to derive optimal income tax rates.
The review ofeconomic studies 68 (1), 205–229.Sasaki, Y. and T. Ura (2020). Welfare analysis via marginal treatment effects. arXivpreprint arXiv:2012.07624 .S¨avje, F., P. M. Aronow, and M. G. Hudgens (2020). Average treatment effects in thepresence of unknown interference.
The Annals of Statistics, Forthcoming .Stoye, J. (2009). Minimax regret treatment choice with finite samples.
Journal of Econo-metrics 151 (1), 70–81.Stoye, J. (2012). Minimax regret treatment choice with covariates or with limited validityof experiments.
Journal of Econometrics 166 (1), 138–156.65utton, R. S. and A. G. Barto (2018).
Reinforcement learning: An introduction . MITpress.Tabord-Meehan, M. (2018). Stratification trees for adaptive randomization in randomizedcontrolled trials. arXiv preprint arXiv:1806.05127 .Taylor, S. J. and D. Eckles (2018). Randomized experiments to detect and estimate socialinfluence in networks. In
Complex Spreading Phenomena in Social Systems , pp. 289–322.Springer.Tetenov, A. (2012). Statistical treatment choice based on asymmetric minimax regretcriteria.
Journal of Econometrics 166 (1), 157–165.Ugander, J., B. Karrer, L. Backstrom, and J. Kleinberg (2013). Graph cluster randomiza-tion: Network exposure to multiple universes. In
Proceedings of the 19th ACM SIGKDDinternational conference on Knowledge discovery and data mining , pp. 329–337. ACM.Varian, H. R. (2016). Causal inference in economics and marketing.
Proceedings of theNational Academy of Sciences 113 (27), 7310–7315.Vazquez-Bare, G. (2017). Identification and estimation of spillover effects in randomizedexperiments. arXiv preprint arXiv:1711.02745 .Viviano, D. (2019). Policy targeting under network interference. arXiv preprintarXiv:1906.10258 .Viviano, D. (2020). Experimental design under network interference. arXiv preprintarXiv:2003.08421 .Wager, S. and K. Xu (2019). Experimenting in equilibrium. arXiv preprintarXiv:1903.02124 .Wainwright, M. J. (2019).
High-dimensional statistics: A non-asymptotic viewpoint , Vol-ume 48. Cambridge University Press.Zhang, K. W., L. Janson, and S. A. Murphy (2020). Inference for batched bandits. arXivpreprint arXiv:2002.03217 .Zubcsek, P. P. and M. Sarvary (2011). Advertising to a social network.
QuantitativeMarketing and Economics 9 (1), 71–107. 66
Preliminaries and notation
First, we introduce conventions and notation. Whenever we take summation, we sumover experimental participants unless otherwise specified. We define x (cid:46) y if x is lessor equal than y times a universal constant. We refer to the number of clusters as k ∈{ , · · · , K, · · · } with the cluster index k = K + 1 = 1. Define t ( j, w ) the time t corre-sponding to wave w and iteration j as discussed in Section 4.1. We define β k,j,w = b jk ( ˇ β wk ) , b jk ( β ) = β + η n e j if k is odd; β − η n e j otherwise . Throughout our proofs, we will implicitely condition on v , · · · , v K . Finally, observethat β k,j,w is a measurable function of ˇ β wk , and therefore conditioning on ˇ β wk will implicitelyresult into conditioning also on β k,j,w . Oracle gradient descent
We define β ∗ w = Π B , B (cid:104) β ∗ w − + α w − V ( β ∗ w − ) (cid:105) , β ∗ = ι, (32)the oracle solution of the local optimization procedure, for known welfare function. α w = γ √ w || V ( β ∗ w − ) || unless otherwise specified. Take ˇ T >
0. The algorithm terminates if || V ( β ∗ w ) || ≤ µ √ ˇ T .We now discuss definitions of dependency graphs. Definition A.1 (Adjacency matrix and dependency graph) . Given n random variables R i ,we denote A n an adjacency matrix with A ( i,j ) n = 1 if and only if R i and R j are dependent.The variables connected under A n forms a dependency graph (Janson, 2004), i.e., unitsthat are not connected are mutually independent. Lemma A.1. (Ross et al., 2011) Let X , ..., X n be random variables such that E [ X i ] < ∞ , E [ X i ] = 0 , σ = Var( (cid:80) ni =1 X i ) and define W = (cid:80) ni =1 X i /σ . Let the collection ( X , ..., X n ) have dependency neighborhoods N i , i = 1 , ..., n and also define D = max ≤ i ≤ n | N i | . Thenfor Z a standard normal random variable, we obtain d W ( W, Z ) ≤ D σ n (cid:88) i =1 E | X i | + √ D / √ πσ (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 E [ X i ] , (33)67 here d W denotes the Wasserstein metric. Definition A.2. (Proper Cover)
Given an adjacency matrix A n , with n rows and columns,a family C n = {C n ( j ) } of disjoint subsets of [ n ] is a proper cover of A n if ∪C n ( j ) = [ n ] and C n ( j ) contains units such that for any pair of elements { ( i, k ) ∈ C n ( j ) , k (cid:54) = i } , A ( i,k ) n = 0.The size of the smallest proper cover is the chromatic number, defined as χ ( A n ). Definition A.3. (Chromatic Number)
The chromatic number χ ( A n ) , denotes the size ofthe smallest proper cover of A n . Lemma A.2. (Brook’s Theorem,Brooks (1941)) For any connected undirected graph G with maximum degree ∆ , the chromatic number of G is at most ∆ unless G is a completegraph or an odd cycle, in which case the chromatic number is ∆ + 1 . B Lemmas
Proof of Lemma 2.1.
Under Assumption 1 (A), we can write the potential outcome onlyas a function of the current treatment assignment in the same cluster, namely we write Y i,t ( d k ( i ) t ). Define D kt the vector of treatment assignments in cluster k at time t . Underconsistency of potential outcomes Y i,t ( D k ( i ) t ) = Y i,t = E (cid:104) Y i,t ( D k ( i ) t ) | D i,t , X i , β k ( i ) ,t (cid:105) + ε i,t , E (cid:104) ε i,t | D i,t , X i , β k ( i ) ,t (cid:105) = 0 , where the above equation follows from the fact that the distribution of D k ( i ) t is fully char-acterized by the (exogenous) parameter β k ( i ) ,t . By definition ε i,t = Y i,t ( D k ( i ) t ) − m i,t ( D i,t , X i , β k ( i ) ,t ) = Y i,t ( g ( β k ( i ) ,t , X ) , · · · , g ˇ N ( β k ( i ) ,t , X ˇ N )) − m i,t ( D i,t , X i , β k ( i ) ,t ) , for some random functions g i ( · ). By definition of a CBAR, these functions are independentbetween individuals (i.e., treatment assignments are conditional independent given covari-ates). Observe that under Assumption 1 (B) Y i,t ( D k ( i ) t ) depends on at most √ γ n manyentries. As a result, Y i,t ( · ) depends on at least the same entry with γ n many other potentialoutcomes, which defines those units sharing at least one common neighbor. Observe nowthat g i ( β k ( i ) ,t , X i ) , Y i,t ( · ) are locally dependent under Assumption 1 (C) with those sameunits being neighbors of individual i or neighbors of the neighbors. As a result ε i,t , ε j,t ≤ T i, j ) are neighbors or they share a common neighbor.Therefore ε i,t depends on at most √ γ n + γ n many other ε j,t ≤ T completing the proof.In the following Lemma, we extend results from Janson (2004) for the concentrationof unbounded sub-gaussian random variables. We state the lemma for general randomvariables R i forming a dependency graph with adjacency matrix A n . Lemma B.1.
Define { R i } ni =1 sub-gaussian random variables, forming a dependency graphwith adjacency matrix A n with maximum degree bounded by γ n . Then with probability atleast − δ , (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( R i − E [ R i ]) (cid:12)(cid:12)(cid:12) ≤ ¯ C (cid:114) γ n log( γ n /δ ) n . for a finite constant ¯ C < ∞ .Proof. First, we construct a proper cover C n as in Definition A.2, with minimal chromaticnumber χ ( A n ). We can write (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( R i − E [ R i ]) (cid:12)(cid:12)(cid:12) ≤ (cid:88) C n ( j ) ∈C n (cid:12)(cid:12)(cid:12) n (cid:88) i ∈C n ( j ) ( R i − E [ R i ]) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( A ) . Observe now that by definition of the dependency graph, components in ( A ) are mutuallyindependent. Using the Chernoff’s bound (Wainwright, 2019), we have that with proba-bility at least 1 − δ , (cid:12)(cid:12)(cid:12) (cid:88) i ∈C n ( j ) ( R i − E [ R i ]) (cid:12)(cid:12)(cid:12) ≤ ¯ C (cid:112) |C n ( j ) | log(1 /δ ) , for a finite constant ¯ C < ∞ , where |C n ( j ) | denotes the number of elements in C n ( j ). As aresult, using the union bound, we obtain that with probability at least 1 − δ , (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( R i − E [ R i ]) (cid:12)(cid:12)(cid:12) ≤ ¯ Cn (cid:88) C n ( j ) ∈C n (cid:112) |C n ( j ) | log( χ ( A n ) /δ ) (cid:124) (cid:123)(cid:122) (cid:125) ( B ) . Using concavity of the square-root function, after multiplying and dividing (B) by χ ( A n ),69e have ( B ) ≤ ¯ Cn χ ( A n ) (cid:118)(cid:117)(cid:117)(cid:116) χ ( A n ) (cid:88) C n ( j ) ∈C n |C n ( j ) | log( χ ( A n ) /δ )= ¯ Cn (cid:112) χ ( A n ) n log( χ ( A n ) /δ ) . The last equality follows by the definition of proper cover. The final result follows byLemma A.2.
Lemma B.2.
Under Assumption 3, η n W ( j ) i,t ( β ) , is sub-gaussian for some parameter ˜ σ < ∞ , for any β ∈ B .Proof. Observe that we can write η n W ( j ) i,t ( β ) = Y i,t η n ∂e ( X i ; β ) ∂β × (cid:104) D i,t e i,j,t ( β ) − (1 − D i,t )1 − e i,j,t ( β ) (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ( A ) + Y i,t × v k ( i ) (cid:104) e ( X i ; β ) D i,t e i,j,t ( β ) − (1 − e ( X i ; β ))(1 − D i,t )1 − e i,j,t ( β ) (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ( B ) − η n c ( X i ) ∂e ( X i ; β ) ∂β (cid:124) (cid:123)(cid:122) (cid:125) ( C ) . By definition of E and Assumption 3, (A) in the expression is bounded by ¯ Cη n for afinite constant ¯ C . Similarly, (B) is bounded by a finite constant ¯ C , while (C) is uniformlybounded by Assumption 3. Since Y i,t is sub-gaussian by Assumption 3 (bounded m i,t andsub-gaussian ε i,t , and η n ≤ Lemma B.3.
Let Assumption 1, 5 hold. Consider the experimental design in Equation (23) with ˇ β wk estimated as in Section 4.1. Then for any pair of clusters { k, k + 1 } , with k being odd (cid:16) ˇ β k , · · · , ˇ β ˇ Tk (cid:17) ⊥ (cid:110) Y i,t ( d ) , X i , d ∈ { , } ˇ N (cid:111) i : k ( i ) ∈{ k,k +1 } ,t ≤ T . Proof of Lemma B.3.
To show that the claim holds it suffices to show that ˇ β wk is a functionof observables and unobservables only of those units in clusters k (cid:48) (cid:54)∈ { k, k + 1 } . We startfrom studying ˇ β ˇ Tk . Observe that ˇ β ˇ Tk is chosen based on the gradient ˇ Z k, ˇ T − estimatedin the previous period in clusters { k + 2 , k + 3 } . The gradient estimated ˇ Z k +2 , ˇ T − is afunction of the unobservables and observables at any time t ≤ T in clusters { k + 2 , k + 3 } and the policy ˇ β ˇ T − k +2 . The policy ˇ β ˇ T − k +2 is a function of the gradient ˇ Z k +4 , ˇ T − estimated in70he subsequent two clusters { k + 4 , k + 5 } over the previous wave of experimentation ˇ T − { k, k + 1 } since K ≥
2( ˇ T + 1). The same reasoning applies to the remainingcoefficients. Lemma B.4.
Let Assumption 1, 5 hold. Consider the experimental design in Section 4.1.Then, for t ≥ , the following holds: E (cid:104) Y i,t ( j,w ) D i,t ( j,w ) e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk ( i ) , X i (cid:105) = m i,t ( j,w ) (1 , X i , β k ( i ) ,j,w ) , E (cid:104) Y i,t ( j,w ) (1 − D i,t ( j,w ) )1 − e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) β wk ( i ) , X i (cid:105) = m i,t ( j,w ) (0 , X i , β k ( i ) ,j,w ) . Proof of Lemma B.4.
We prove the first statement, while the second statement followssimilarly. Under Assumption 1 E (cid:104) Y i,t ( j,w ) D i,t ( j,w ) e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk ( i ) , X i (cid:105) = E (cid:104) m i,t ( j,w ) (1 , X i , β k ( i ) ,j,w ) D i,t ( j,w ) e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk ( i ) , X i (cid:105) + E (cid:104) ε i,t ( j,w ) D i,t ( j,w ) e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk ( i ) , X i (cid:105) . Observe that by design E (cid:104) m i,t ( j,w ) (1 , X i , β k ( i ) ,j,w ) D i,t ( j,w ) e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk ( i ) , X i (cid:105) = m i,t ( j,w ) (1 , X i , β k ( i ) ,j,w ) . In addition, by Lemma B.3 and Lemma B.4, E (cid:104) ε i,t ( j,w ) D i,t ( j,w ) e ( X i ; β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk ( i ) , X i (cid:105) = 0completing the proof. Lemma B.5.
Let Assumption 1, 2, 3, 5, hold. Let W ( j ) i,t be defined as in Equation (19) .Then for any odd k , n (cid:88) i ∈S k,t ( j,w ) ∪S k +1 ,t ( j,w ) E (cid:104) W ( j ) i,t ( j,w ) (cid:16) ˇ β wk ( i ) (cid:17) | ˇ β wk ( i ) (cid:105) − n (cid:88) i ∈S k, ∪S k +1 , E (cid:104) v k ( i ) η n Y i, | ˇ β wk ( i ) (cid:105) = V ( j ) ( ˇ β wk ) + O ( η n ) + O ( J n × η n ) . roof of Lemma B.5. Recall the definition of V ( β ) in Definition 2.4. In addition, recallthat for k being odd ˇ β wk = ˇ β wk +1 . For short of notation, we define t = t ( j, w ). Observe thatby Lemma B.4, since v k is deterministic, we can write1 n (cid:88) i ∈S k,t ∪S k +1 ,t E (cid:104) W ( j ) i,t (cid:16) ˇ β wk ( i ) (cid:17) | ˇ β wk ( i ) (cid:105) =12 n (cid:88) i ∈S k,t ∪S k +1 ,t E (cid:104) ( m i,t (1 , X i , β k ( i ) ,j,w ) − m i,t (0 , X i , β k ( i ) ,j,w ) − c ( X i )) ∂e ( X i ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk (cid:12)(cid:12)(cid:12) ˇ β wk (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ( A ) − n (cid:88) i ∈S k,t ∪S k +1 ,t v k ( i ) η n E (cid:104) m i,t (1 , X i , β k ( i ) ,j,w ) e ( X i ; ˇ β wk ( i ) ) + (1 − e ( X i ; ˇ β wk ( i ) )) m i,t (0 , X i , β k ( i ) ,j,w ) (cid:12)(cid:12)(cid:12) ˇ β wk (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ( B ) . (34)Observe that in the above expression follows since β k ( i ) ,j,w is a deterministic function ofˇ β wk ( i ) . We study ( A ) and ( B ) separately. We start from ( A ). We decompose (A) in thefollowing components. ( A ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:90) (cid:104) m i,t (1 , x, β k ( i ) ,j,w ) − m i,t (0 , x, β k ( i ) ,j,w ) (cid:105) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk f X i ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) ( I ) − n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:90) c ( x ) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk f X i ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) ( II ) . (35) First observe that we can write( II ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t (cid:90) c ( x ) ∂e ( x ; β ) ∂β ( j ) f X i ( x ) dx = 12 (cid:90) c ( x ) ∂e ( x ; β ) ∂β ( j ) n (cid:88) i ∈S k,t ∪S k +1 ,t f X i ( x ) dx = (cid:90) c ( x ) ∂e ( x ; β ) ∂β ( j ) ˇ f X ( x ) dx + O ( J n ) , where the last equality follows from the dominated convergence theorem, and the fact that |S k,t ∪ S k +1 ,t | = 2 n . Using the dominated convergence theorem combined with assumption3, we have (cid:90) c ( x ) ∂e ( x ; β ) ∂β ( j ) ˇ f X ( x ) dx = ∂ (cid:82) c ( x ) e ( x ; β ) ˇ f X ( x ) dx∂β ( j ) . I ). We can write( I ) = 12 n (cid:90) (cid:88) i ∈S k,t ∪S k +1 ,t (cid:104) ( m i,t (1 , x, β k ( i ) ,j,w ) − m i,t (0 , x, β k ( i ) ,j,w )) f X i ( x ) (cid:105) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk dx = 12 (cid:90) (cid:104) (cid:88) b ∈{ k,k +1 } ( m (1 , x, β b,j,w ) − m (0 , x, β b,j,w )) f X ( x ) (cid:105) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk dx + O ( J n ) , where the second equality follows from Assumption 2. We now use a first order Taylorexpansion to m (1 , x, β b,j,w ) , m (0 , x, β b,j,w ) around ˇ β wk . Observe that since β b,j,w deviatesfrom ˇ β wk by at most η n over one coordinate and zero over the remaining coordinates, underAssumption 3, we can write12 (cid:90) (cid:104) (cid:88) b ∈{ k,k +1 } ( m (1 , x, β b,j,w ) − m (0 , x, β b,j,w )) f X ( x ) (cid:105) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk dx = (cid:90) (cid:104) ( m (1 , x, ˇ β wk ) − m (0 , x, ˇ β wk )) f X ( x ) (cid:105) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk dx + O ( η n ) . We now study ( B ). Observe that differently from ( A ) for ( B ) we also need to account fortime and cluster specific fixed effects. We start studying the component( B ) = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t v k ( i ) η n (cid:90) m i,t (1 , x, β k ( i ) ,j,w ) e ( x ; ˇ β wk ( i ) ) f X i ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) ( a ) + 12 n (cid:88) i ∈S k,t ∪S k +1 ,t v k ( i ) η n (cid:90) m i,t (0 , x, β k ( i ) ,j,w )(1 − e ( x ; ˇ β wk ( i ) )) f X i ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) ( b ) . We study ( a ), while ( b ) follows similarly. Under Assumption 2, since ˇ β wk = ˇ β wk +1 for k beingodd, we write( a ) = 12 n v k ( i ) η n (cid:90) (cid:88) i ∈S k,t ∪S k +1 ,t m i,t (1 , x, β k ( i ) ,j,w ) e ( x ; ˇ β wk ) f X i ( x ) dx = 12 (cid:88) b ∈{ k,k +1 } v b η n (cid:90) m (1 , x, β b,j,w ) e ( x ; ˇ β wk ) f X ( x ) + e ( x ; ˇ β wk ) α t ( x ) dx + e ( x ; ˇ β wk ) τ b ( x ) dx + O ( J n /η n )= 12 1 η n (cid:90) ( m (1 , x, β k,j,w ) − m (1 , x, β k +1 ,j,w )) e ( x ; ˇ β wk ) f X ( x ) + e ( x ; ˇ β wk )( τ k ( x ) − τ k +1 ( x )) dx + O ( J n /η n ) . b ) as follows( b ) = 12 1 η n (cid:90) ( m (0 , x, β k,j,w ) − m (0 , x, β k +1 ,j,w ))(1 − e ( x ; ˇ β wk )) f X ( x )+(1 − e ( x ; ˇ β wk ))( τ k ( x ) − τ k +1 ( x )) dx + O ( J n η n ) . Combining the expressions, we write1 n (cid:88) i ∈S k,t ∪S k +1 ,t E (cid:104) W ( j ) i,t (cid:16) ˇ β wk ( i ) (cid:17) | ˇ β wk ( i ) (cid:105) = 12 η n (cid:90) ( τ k ( x ) − τ k +1 ( x )) dx ++ 12 1 η n (cid:90) ( m (1 , x, β k,j,w ) − m (1 , x, β k +1 ,j,w )) e ( x ; ˇ β wk ) f X ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) ( c ) + 12 1 η n (cid:90) ( m (0 , x, β k,j,w ) − m (0 , x, β k +1 ,j,w ))(1 − e ( x ; ˇ β wk )) f X ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) ( d ) + 12 (cid:90) (cid:104) ( m (1 , x, ˇ β wk ) − m (0 , x, ˇ β wk )) f X ( x ) (cid:105) ∂e ( x ; β ) ∂β ( j ) (cid:12)(cid:12)(cid:12) β = ˇ β wk dx + O ( η n + J n η n ) . We now study ( c ) while ( d ) follows similarly. We do a second order Taylor expansionof m ( d, x, · ) at ˇ β wk . Using the randomization scheme in Equation (23) we obtain underAssumption 3 ( c ) = 12 1 η n (cid:90) X ∂m (1 , x, ˇ β wk ) ∂β η n e ( x ; ˇ β wk ) f X ( x ) dx + O ( η n ) , where the component O ( η n ) is bounded by the compact support assumption on X , thefact that || f X || ∞ < ∞ , and the boundeness assumption on the second order derivative.Similarly, we write( d ) = 12 1 η n (cid:90) X ∂m (0 , x, ˇ β wk ) ∂β η n (1 − e ( x ; ˇ β wk )) f X ( x ) dx + O ( η n ) . Finally, by the circular cross fitting algorithm and Assumption 5 ˇ β wk is independent onobservables and unobservables in cluster { k, k + 1 } at time s = 0, since ˇ β wk is a measurablefunction of the gradient estimated in all clusters but cluster { k, k + 1 } . As a result, we cantake (recall that every expression is conditional on the initialization value ι assumed to be74xogenous)1 n (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n E (cid:104) Y i, | ˇ β wk ( i ) (cid:105) = 1 n (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n (cid:90) m i,t (1 , x, ι ) e ( x ; ι ) f X i ( x ) dx + 1 n (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n (cid:90) m i,t (0 , x, ι )(1 − e ( x ; ι )) f X i ( x ) dx. Under Assumption 2, we have1 n (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n (cid:90) m i,t (1 , x, ι ) e ( x ; ι ) f X i ( x ) dx + 1 n (cid:88) i ∈S k, ∪S k +1 , v k ( i ) η n (cid:90) m i,t (0 , x, ι )(1 − e ( x ; ι )) f X i ( x ) dx = 12 η n (cid:90) ( τ k ( x ) − τ k +1 ( x )) dx + O ( J n /η n ) . Combining the equations the proof completes.
Lemma B.6.
Consider the experimental design in Section 4.1. Let Assumption 1, 2, 3, 5,hold. Then with probability at least − δ , for every k being odd, and every w ∈ { , · · · , ˇ T } ˇ Z ( j ) k,w = V ( j ) ( ˇ β wk ) + O (cid:16)(cid:115) γ n log( γ n K ˇ T /δ ) η n n + η n + J n /η n (cid:17) Proof.
Observe that by Lemma B.2, η n W ( j ) i,t ( β ) is sub-gaussian with parameter ˜ σ . Sim-ilarly, under Assumption 3 (B), η n Y i, is sub-gaussian. In addition, under Assumption 1,by Lemma B.4, and Lemma 2.1, since the assignment mechanism is a measurable functionof ˇ β wk in cluster k, k + 1 (cid:110) ( W ( j ) i,t ( ˇ β wk ) , Y i, ) (cid:111) i ∈S k,t ∪S k +1 ,t ∪S k, ∪S k +1 , (cid:12)(cid:12)(cid:12) ˇ β wk form a dependency graph (e.g., see Ross et al. (2011)), with maximimum degree boundedby O ( γ n ) since each observation ( W ( j ) i,t ( ˇ β wk ) , Y i, ) depends on at most γ n units in the set { W ( j ) j,t , Y j, } j (cid:54) = i,j : k ( j )= k ( i ) . This follows from Assumption 1 (B), (C), and the fact that ˇ β wk is estimated using information from all clusters except { k, k + 1 } under the circular cross75tting and Assumption 5. By Lemma B.1 with probability at least 1 − δ , (cid:12)(cid:12)(cid:12) ˇ Z ( j ) k,w − E [ ˇ Z ( j ) k,w | ˇ β wk ] (cid:12)(cid:12)(cid:12) ≤ ¯ C (cid:48) (cid:115) γ n log( γ n /δ ) η n n , (36)for a universal finite constant ¯ C (cid:48) < ∞ . Using the triangular inequality we obtain (cid:12)(cid:12)(cid:12) ˇ Z ( j ) k,w − V ( ˇ β wk ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˇ Z ( j ) k,w − E [ ˇ Z ( j ) k,w | ˇ β wk ] (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) V ( ˇ β wk ) − E [ ˇ Z ( j ) k,w | ˇ β wk ] (cid:12)(cid:12)(cid:12) , The first term is bounded as in Equation (36) and the second term by Lemma B.5. Thefinal result follows by the union bound over K, ˇ T . Lemma B.7 (Adaptive gradient descent for quasi-concave functions and locally strongconcave) . Let B be compact. Define G max { sup β ∈B || β || , } . Let Assumption 3, 6 hold.Let κ be a positive finite constant, defined as in Equation (37) . Then for any w ≤ ˇ T , w ≥ γ ( κ + 2) e ( G +1) /γ , the following holds: || β ∗ w − β ∗ || ≤ κw − . Proof.
To prove the statement, we use properties of gradient descent methods (Hazan et al.,2015) with key differences from the previous reference. Instead of fixing the estimation errorover all iterations, we let the estimation error decrease with w . Preliminaries
Clearly, if the algorithm terminates at t , under Assumption 6 (B), thisimplies that || β w − β ∗ || ≤ T , proving the claim. Therefore, assume that the algorithm did not terminate at time w .Define (cid:15) w = 1 / ( w −
1) and let ∇ w to be the gradient evaluated at β ∗ w − . For every β ∈ B ,define H ( β ) (cid:12)(cid:12)(cid:12) [ β ∗ ,β ] the Hessian evaluated at some point β ∈ [ β ∗ , β ], such that W ( β ) = W ( β ∗ ) + 12 ( β − β ∗ ) (cid:62) H ( β ) (cid:12)(cid:12)(cid:12) [ β ∗ ,β ] ( β − β ∗ ) , β − β ∗ ) (cid:62) H ( β ) (cid:12)(cid:12)(cid:12) [ β ∗ ,β ] ( β − β ∗ ) = f ( β ) ≤ , where the inequality follows by definition of β ∗ . Claim
We claim that −| λ max ||| β − β ∗ || ≤ f ( β ) ≤ −| λ min ||| β − β ∗ || for constants λ max , λ min >
0. The lower bound follows directly by Assumption 3, while theupper bound follows from Assumption 6 (iii) and compactness of B . We provide details forthe upper bound in the following paragraph. Proof of the claim on the upper bound
We now use a contradiction argument.Suppose that the upper bound does not hold. Then there must exist a sequence β s ∈ B such that f ( β s ) ≥ o ( || β s − β ∗ || ). Observe first of all that since the parameter space B is compact, any sequence such that β s → β (cid:54) = β ∗ would contradict the statement due toglobal optimality of β ∗ , and the fact that || β − β ∗ || < ∞ . As a result, we only have todiscuss sequences β s → β ∗ . Recall that twice continuously differentiability of W ( β ), wehave that H ( β s ) → H ( β ∗ ). As a result, we can find, for s ≥ S , for S large enough, a pointin the sequence such that (since p is finite)2 f ( β s ) ≤ ( β s − β ∗ ) (cid:62) H ( β ∗ ) + δ ( s ) || β s − β ∗ || , for δ ( s ) = p || H ( β s ) − H ( β ∗ ) || ∞ . Since H ( β ∗ ) is negative definite, the above expression isbounded as follows 2 f ( β s ) ≤ − ( | ˜ λ min | − δ ( s )) || β s − β ∗ || , where | ˜ λ min | > H ( β ∗ ) (in absolute value) bounded awayfrom zero by Assumption 6 (iii). Since δ ( s ) →
0, we reach a contradiction.
Cases
Define κ = | λ max || λ min | ≥ . (37)77bserve now that if || β ∗ w − β ∗ || ≤ (cid:15) w κ , the claim trivially holds. Therefore, consider thecase where || β ∗ w − β ∗ || > (cid:15) w κ. Comparisons within the neighborhood
Take ˜ β = β ∗ − √ (cid:15) w ∇ w ||∇ w || . Observe that W ( ˜ β ) − W ( β ∗ w ) = 12 ( ˜ β − β ∗ ) (cid:62) H ( ˜ β ) (cid:12)(cid:12)(cid:12) [ β ∗ , ˜ β ] ( ˜ β − β ∗ ) −
12 ( β ∗ w − β ∗ ) (cid:62) H ( β ∗ w ) (cid:12)(cid:12)(cid:12) [ β ∗ ,β ∗ w ] ( β ∗ w − β ∗ ) ≥ −| λ max | (cid:15) w + | λ min | (cid:15) w κ = 0 . As a result, for all β ∗ w : || β ∗ w − β ∗ || > (cid:15) w κ , using quasi-concavity ∇ (cid:62) w ( ˜ β − β ∗ w ) ≥ ⇒ ∇ (cid:62) w ( β ∗ − β ∗ w ) ≥ √ (cid:15) w ||∇ w || (38) Plugging in the above expression in the definition of β ∗ w By construction of thealgorithm, we write || β ∗ − β ∗ w || ≤ || β ∗ − β ∗ w − || − α w − γ ∇ (cid:62) w ( β ∗ − β ∗ w ) + γ α w − ||∇ w || . By Equation (38), we can write || β ∗ − β ∗ w || ≤ || β ∗ − β ∗ w − || − γα w − √ (cid:15) w ||∇ w || + γ α w − ||∇ w || . Plugging in the expression for α w , and using the fact that γ ≤
1, we have (cid:15) w κ ≤ || β ∗ − β ∗ w || ≤ || β ∗ − β ∗ w − || − γ(cid:15) w . Recursive argument
Observe that if || β ∗ − β ∗ w − || ≤ (cid:15) w − κ , then we have (cid:15) w κ ≤ || β ∗ − β ∗ w || ≤ κ ( w − − γw − ⇒ κ + γw − ≤ κw − ⇒ ( w − κ + γ ) ≤ ( w − κ ⇒ w ≤ κ + 2 γγ , which leads to a contradiction. As a result, we can assume that || β ∗ − β ∗ w − || > κ(cid:15) w − .Observe that now β ∗ w − satisfies the same conditions discussed above. Using the recursion78or all s ≥ κ +2 γ , we have || β ∗ − β ∗ w || ≤ || β ∗ − β ∗ ( κ +2) /γ || − γ w (cid:88) s =( κ +2) /γ (cid:15) w ≤ G + 1 − γ log( w ) + γ log( κ/γ + 2 /γ ) . Whenever w > γ ( κ + 2) e G/γ +1 /γ , we have a contradiction. The proof completes. Lemma B.8.
Let Assumptions 1, 2, 3, 5, 6 hold. Assume that (cid:15) n ≥ √ p (cid:104) ¯ C (cid:115) γ n log( γ n ˇ T K/δ ) η n n + η n + J n /η n (cid:105) , µ √ ˇ T − (cid:15) n ≥ for a universal constant ¯ C < .Then with probability at least − δ , for any w ≤ ˇ T ,either ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ β wk − β ∗ w (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = O ( P w ( δ ) + pη n ) , or ( ii ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ β wk − β ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ p ˇ T where P ( δ ) = err( δ ) and P w ( δ ) = √ pν n B √ w P w − ( δ ) + P w − ( δ ) + √ pν n √ w err( δ ) , for a finiteconstant B < ∞ , and err( δ ) = O (cid:16)(cid:113) γ n log( γ n p ˇ T K/δ ) η n n + pη n + J n /η n (cid:17) , with ν n = µ √ ˇ T − (cid:15) n .Proof. First, recall that by Lemma B.6 we can write for every k and t ,ˇ V ( j ) k,w = V ( j ) ( ˇ β wk +2 ) + O (cid:16)(cid:115) γ n log( γ n K ˇ T /δ ) η n n + η n + J n /η n (cid:17) . We now proceed by induction. We first prove the statement, assuming that the constraintis never attained. We then discuss the case of the constrained solution. Define B = p sup β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ W ( β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ . Unconstrained case
Consider w = 1. Then since all clusters start from the samestarting point ι , we can write with probability 1 − δ , by the union bound and Lemma B.6 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k, − V ( β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ err( δ ) . (39)79onsider now the case where the algorithm stops, i.e., || ˇ V k, || ≤ µ √ ˇ T − (cid:15) n . By Lemma B.6 || V ( β ∗ ) || ≤ || ˇ V k, || + √ p err( δ ) ≤ µ √ ˇ T − (cid:15) n + √ p err( δ ) ≤ µ √ ˇ T . (40)since (cid:15) n ≥ √ p err( δ ). As a result, also the oracle algorithm stops at β ∗ by construction of (cid:15) n . Suppose the algorithm does not stop. Then it must be that || ˇ V k, || ≥ µ √ ˇ T − (cid:15) n and || V ( β ∗ ) || ≥ µ √ ˇ T − (cid:15) n − √ p err ≥ µ √ ˇ T − (cid:15) n := ν n > . Observe now that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k, || ˇ V k, || − V ( β ∗ ) || V ( β ∗ ) || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k, − V ( β ∗ ) || V ( β ∗ ) || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k, ( || ˇ V k, || − || V ( β ∗ ) || ) || V ( β ∗ ) || || ˇ V k, || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k, − V ( β ∗ ) || V ( β ∗ ) || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + √ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k, − V ( β ∗ ) || V ( β ∗ ) || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ . (41)Then with probability at least 1 − δ ,(41) ≤ ν n × √ p err( δ ) . completing the claim for w = 1. Consider now a general w . Define the error until time w − P w − .Then for every j ∈ { , · · · , p } , by Assumption 3, we have with probabilityat least 1 − wδ (using the union bound),ˇ V ( j ) k,w = V ( j ) ( ˇ β wk +2 ) + err( δ ) = V ( j ) ( β ∗ w + P w ( δ )) + err( δ ) ⇒ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k,w − V ( β ∗ w ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ BP w ( δ ) + err( δ ) , where the above inequality follows by the mean-value theorem and Assumption 3. Supposenow that || ˇ V k,w || ≤ µ √ ˇ T − (cid:15) n . Then for the same argument as in Equation (40), we have || V ( ˇ β wk ) || ≤ µ √ ˇ T . || ˇ β wk − β ∗ || ≤ T , which proves the statement. Suppose instead that the algorithm does not stop. The wecan write by the induction argument (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ β wk + 1 √ w ˇ V k,w || ˇ V k,w || − β ∗ w − √ w V ( β ∗ w ) || V ( β ∗ w ) || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ P w ( δ ) + 1 √ w (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˇ V k,w || ˇ V k,w || − V ( β ∗ w ) || V ( β ∗ w ) || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:124) (cid:123)(cid:122) (cid:125) ( B ) . (42)Using the same argument in Equation (41), we have with probability at least 1 − δ ,( B ) ≤ √ pν n (cid:104) err( δ ) + BP w ( δ ) (cid:105) , which completes the proof for the unconstrained case. The ˇ T component in the errorexpression follows from the union bound across all ˇ T events. Constrained case
Since the statement is true for w = 1, we can assume that it is truefor all s ≤ w − B is a compact space, wecan write (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π B , B − η n (cid:104) w (cid:88) s =1 α k,s ˇ V k,s (cid:105) − Π B , B (cid:104) w (cid:88) s =1 α s V ( β ∗ s ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π B , B − η n (cid:104) w (cid:88) s =1 α k,s ˇ V k,s (cid:105) − Π B , B − η n (cid:104) w (cid:88) s =1 α s V ( β ∗ s ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + p O ( η n ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w (cid:88) s =1 α k,s ˇ V k,s − w (cid:88) s =1 α s V ( β ∗ s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + p O ( η n ) . For the first component in the last inequality, we follow the same argument as above.
C Theorems
Proof of Theorem 3.1.
The proof follows directly from Lemma B.5, where β replaces ˇ β wk since exogenous. 81 heorem C.1. Let the conditions in Lemma B.8 hold. Then with probability at least − δ ,for any k ∈ { , · · · , K } , for any ˇ T ≥ w ≥ ζ , for ζ < ∞ being a universal constant || β ∗ − ˇ β wk || ≤ κw − ν n pe B √ p ˇ T w × O (cid:16) γ n log( pγ n ˇ T K/δ ) η n n + p η n + J n /η n (cid:17) , with ν n = µ √ T − (cid:15) n , κ, B < ∞ being constants independent on ( p, n, ˇ T ) and (cid:15) n as definedin Lemma B.8.Proof. We invoke Lemma B.8. Observe that we only have to assume that (i) holds sincefor (ii) the claim trivially holds. Using the triangular inequality, we can write || β ∗ − ˇ β wk || ≤ || β ∗ − β ∗ w || + || ˇ β wk − β ∗ w || . The first component on the right-hand side is bounded by Lemma B.7 with ζ defined as inthe lemma. Using Lemma B.8, we bound the second component as follows || ˇ β wk − β ∗ w || ≤ p || ˇ β wk − β ∗ w || ∞ = p × O ( P w ( δ )) . We conclude the proof by explicitely defining P w = (1 + 2 B √ pν n √ w ) P w − + 1 √ w err n ( δ ) . where err n ( δ ) = √ pν n O ( (cid:113) γ n log( pT K/δ ) η n n + pη n + J n /η n ), and B < ∞ denotes a finite constant.Using a recursive argument, we obtain P w = err n ( δ ) w (cid:88) s =1 α s w (cid:89) j = s ( 2 B √ pν n √ j + 1) . Recall now that ν n ≥ µ √ ˇ T as in Lemma B.8. As a result we can bound the aboveexpression aserr n ( δ ) w (cid:88) s =1 α s w (cid:89) j = s ( 2 B √ pν n √ j +1) ≤ w (cid:88) s =1 α s w (cid:89) j = s ( 8 µ √ ˇ T B √ p √ j +1) ≤ w (cid:88) s =1 α s exp (cid:16) w (cid:88) j = s µ √ ˇ T B √ p √ j (cid:17) . (cid:16) w (cid:88) j = s µ √ ˇ T B √ p √ j (cid:17) ≤ exp (cid:16) µ (cid:112) ˇ T B √ p ( w / − s / + 1) (cid:17) (cid:46) exp (cid:16) √ p (cid:112) ˇ T √ w (cid:17) e − s / . We now write w (cid:88) s =1 α s w (cid:89) j = s ( 2 B √ pν n √ j + 1) (cid:46) w (cid:88) s =1 √ s e − s / e B √ p √ ˇ T √ w (cid:46) e B √ p √ ˇ T √ w , completing the proof. Corollary.
Theorem 4.2 holds.Proof.
Consider Lemma B.8 where we choose δ = 1 /n . Observe that we choose (cid:15) n ≤ µ √ ˇ T ,which is attained by the conditions in Lemma B.8 as long as n is small enough such that √ p (cid:104) ¯ C (cid:115) log( n ) γ n log( pγ n ˇ T K ) η n n + η n + J n /η n (cid:105) ≤ µ √ ˇ T attained under the assumptions stated. As a result, we have ν n = µ √ ˇ T . The claim directlyfollows from Theorem C.1. Corollary.
Let the conditions in Theorem C.1 hold. Then with probability at least − δ ,for a finite constant B < ∞ , τ ( β ∗ ) − τ ( ˆ β ∗ ) (cid:46) p ˇ T + 1 ν n pe B √ p ˇ T × O (cid:16) γ n log( pγ n ˇ T K/δ ) η n n + p η n + J n /η n (cid:17) . Proof.
We have || β ∗ − K (cid:88) k ˇ β ˇ T +1 k || ≤ K (cid:88) k || ˇ β ˇ T +1 k − β ∗ || . The proof concludes by Theorem C.1 and Assumption 3, after observing that τ ( β ∗ ) − τ ( ˆ β ∗ ) (cid:46) p || β ∗ − ˆ β ∗ || . . 83 orollary. Theorem 4.3 holds.Proof.
By the mean value theorem and Assumption 3, we have Under Assumption 3, wehave ˇ T (cid:88) w =1 τ ( β ∗ ) − τ ( ˇ β wk ) ≤ ¯ Cp ˇ T (cid:88) w =1 || β ∗ − ˇ β wk || , for a universal constant ¯ C < ∞ . We now take w ≥ ζ , for ζ < ∞ such that Lemma B.7holds. By Theorem C.1, for n satisfying the conditions in Theorem 4.2, with δ = 1 /n , withprobability at least 1 − /n , using a second order Taylor expansion and using the boundedcondition on the Hessian in Assumption 3, we have ˇ T (cid:88) w>ζ τ ( β ∗ ) − τ ( ˇ β wk ) ≤ ˇ T (cid:88) w>ζ pκ (cid:48) w (cid:46) p log( ˇ T )for κ (cid:48) < ∞ being a finite constant. Finally, using the fact that B is a compact space, wewrite (cid:88) w ≤ ζ || β ∗ − ˇ β wk || ≤ ζB < ∞ for a universal constant B , completing the proof. Corollary.
Theorem 5.3 holds.Proof.
The proof follows directly from Theorem C.1, after noticing that every two periods,the function is evaluated at the same vector of parameter Γ( ˇ β w , ˇ β w ). Therefore, we canapply all our results to the function β (cid:55)→ Γ( β, β ) which satisfies the same conditions as W ( β ). Theorem C.2.
Let Assumption 1, 2, 3, 4, 5. Then ˇ Z ( j ) k,w − V ( j ) ( ˇ β k,t ) (cid:113) Var( ˇ Z k,w | ˇ β wk ) + B n → d N (0 , , where B n = O ( η n × √ n + J n / (cid:112) η n ρ n ) Proof.
By Lemma B.5, we have E [ ˇ Z ( j ) k,w | ˇ β wk ] = V ( j ) ( ˇ β wk ) + O ( η n + J n /η n ) .
84e have ˇ Z ( j ) k,w − E [ ˇ Z ( j ) k,w | ˇ β wk ] (cid:113) Var( ˇ Z ( j ) k,w | ˇ β wk ) = ˇ Z ( j ) k,w − V ( j ) ( ˇ β k,w ) (cid:113) Var( ˇ Z ( j ) k,w | ˇ β k,w ) + O (cid:16) η n + J n /η n (cid:113) Var( ˇ Z ( j ) k,w | ˇ β wk ) (cid:17) . Observe that under Assumption 4, O (cid:16) η n + J n /η n (cid:113) Var( ˇ Z ( j ) k,w | ˇ β wk ) (cid:17) ≤ O ( η n × √ n + J n / (cid:112) η n ρ n ) . We now invoke Lemma A.1. Define t = t ( j, w ). First, define H i,t = 1 n W i,t ( ˇ β wk ) , H i, = 2 v k ( i ) η n n Y i, . Following the same reasoning as in Lemma B.6, we observe that( H i,t , H i, ) i ∈S k,t ∪S k +1 ,t ∪S k, ∪S k +1 , (cid:12)(cid:12)(cid:12) ˇ β wk form a dependency graph with maximum degree of order O ( γ n ). To observe why, noticethat H i,t depends on at most γ n + 1 many elements ( H j,t , H j, ) and similarly H i, , condi-tional on ˇ β wk , since, under the cross fitting algorithm, ˇ β wk is estimated not using informationfrom clusters { k, k + 1 } .In addition, under Assumption 3 and Lemma B.2 E [ H i,t | ˇ β wk ] , E [ H i, | ˇ β wk ] ≤ c (cid:48) n η n < ∞ , E [ H i,t | ˇ β wk ] , E [ H i, | ˇ β wk ] ≤ c (cid:48) n η n < ∞ , since 1 /η n ≤ n , for a constant c (cid:48) < ∞ . Define σ = Var( (cid:80) i ∈S k,t ∪S k +1 ,t H i,t − (cid:80) i ∈S k, ∪S k +1 , H i, ).Using Lemma A.1 and the triangular inequality, we write d W ( (cid:88) i ∈S k,t ∪S k +1 ,t H i,t + (cid:88) i ∈S k, ∪S k +1 , H i, , G ) ≤ γ n σ (cid:88) i ∈S k,t ∪S k,t +1 ∪S k, ∪S k +1 , (cid:104) E | H i,t | + E | H i, | (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ( A ) + √ γ / n √ πσ (cid:115) (cid:88) i ∈S k,t ∪S k +1 ,t ∪S k, ∪S k +1 , (cid:104) E [ H i,t ] + E [ H i, ] (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ( B ) , G ∼ N (0 , d W denotes the Wasserstein metric. We now inspect each argument on the right handside. Under Assumption 4, we have( A ) ≤ C (cid:48) γ n n η n × n / η n = γ n n / → . Similarly, for (B), we have ( B ) ≤ c (cid:48) γ / n nη n η n n = γ / n η n n → . The proof completes.
Corollary.
Theorem 3.2 holds.Proof.
First observe that since ˇ T = 1 and K ≥
2, Assumption 5 is satisfied. Therefore,the result follows by Theorem C.2, and between cluster independence over the first period t = 1 (Assumption 1). Proof of Theorem 3.3.
Take t jz = √ z (cid:80) zi =1 X ji (cid:113) ( z − − (cid:80) zi =1 ( X ji − ¯ X j ) , X ji ∼ N (0 , σ ji ) . Recall that by Theorem 1 in Ibragimov and M¨uller (2010) and Bakirov and Szekely(2006), we have that for α ≤ . σ , ··· ,σ q P ( | t z | ≥ cv α ) = P ( | T q − | ≥ cv α ) , where cv α is the critical value of a t-test with level α , and T z − is a t-student randomvariable with z − P (cid:16) T n ≥ q | H (cid:17) = P (cid:16) max j ∈{ , ··· , ˜ p } | Q j,n | ≥ q | H (cid:17) = 1 − P (cid:16) | Q j,n | ≤ q ∀ j | H (cid:17) = 1 − ˜ p (cid:89) j =1 P (cid:16) | Q j,n | ≤ q | H (cid:17) , where the last equality follows by between cluster independence (Assumption 1). Observe86ow that by Theorem 3.2 and the fact that the rate of convergence is the same for allclusters (Assumption 4) , for all j , for some ( σ , · · · , σ z ), z = ˜ K ,sup q (cid:12)(cid:12)(cid:12) P (cid:16) | Q j,n | ≤ q | H (cid:17) − P (cid:16) | t j ˜ K | ≤ q (cid:17)(cid:12)(cid:12)(cid:12) = o (1) . As a result, we can writesup σ , ··· ,σ K lim n →∞ − ˜ p (cid:89) j =1 P (cid:16) | Q j,n | ≤ q | H (cid:17) = 1 − ˜ p (cid:89) j =1 inf σ j , ··· ,σ j ˜ K P (cid:16) | t j ˜ K | ≤ q (cid:17) . Using the result in Bakirov and Szekely (2006), we haveinf σ j , ··· ,σ j ˜ K P (cid:16) | t j ˜ K | ≤ q (cid:17) = P (cid:16) | T ˜ K − | ≤ q | H (cid:17) . Therefore, 1 − ˜ p (cid:89) j =1 inf σ j , ··· ,σ j ˜ K P (cid:16) | t j ˜ K | ≤ q | H (cid:17) = 1 − P ˜ p (cid:16) | T ˜ K − | ≤ q (cid:17) . Setting the expression equal to α , we obtain1 − P ˜ p (cid:16) | T ˜ K − | ≤ q (cid:17) = α ⇒ P (cid:16) | T ˜ K − | ≥ q (cid:17) = 1 − (1 − α ) / ˜ p . The proof completes after solving for q . Corollary.
Theorem 5.2 holds.Proof.
The proof follows directly as a corollary of Theorem 3.2 and results on t-statisticsin Ibragimov and M¨uller (2010) .
Proof of Theorem 5.1.
We follow the same proof as Lemma B.5. Recall the expression ofthe estimator in Equation (26). The estimator depends on three component (cid:98) ∆ ( j ) k, ( β ) , ˆ S ( j ) k, (0 , β ) , ˆ B ( j ) k, . (43)The expectation of the first component follows similarly as to what discussed in the proofof Lemma B.5, component ( A ) in Equation (35), since fixed effects α k,t ( x ) cancel out once Here we use continuity of the Gaussian distribution, and the fact that ˜ p is finite.
87e differentiate the treated and the control units. As a result, it suffices to study thecomponent ˆ S ( j ) k,t (0 , β ) and the bias component ˆ B ( j ) k,t . We start from ˆ S ( j ) k, (0 , β ). Using thesame argument as in Lemma B.5, by Assumption 1, and Lemma B.4 we have E [ ˆ S ( j ) k, (0 , β )] = 12 n (cid:88) i ∈S k,t ∪S k +1 ,t E (cid:104) v k ( i ) (1 − e ( X i ; β )) η n × Y i,t (1 − D i,t )1 − e ( X i ; β + v k ( i ) η n e j ) (cid:105) = 12 η n (cid:90) α t,k ( x ) dx + 12 η n (cid:90) (cid:88) b ∈{ k,k +1 } m (0 , x, β + v b η n e j )(1 − e ( x ; β )) f X ( x ) dx + O ( J n /η n ) (cid:124) (cid:123)(cid:122) (cid:125) ( I ) , where O ( J n ) follows from Assumption 3. Using Assumption 3 and doing a second orderTaylor expansion of m (0 , x, · ) around β , we have( I ) = 12 η n (cid:90) ( α ,k ( x ) − α ,k +1 ( x )) dx + (cid:90) ∂m (0 , x, β ) ∂β (1 − e ( x ; β )) f X ( x ) dx + O ( J n /η n + η n ) . Consider now the bias component. Using Assumption 9 and the fact that spillovers do notoccur on the treated, we have E [ ˆ B ( j ) k,t ( β )] = 12 η n (cid:90) ( α ,k ( x ) − α ,k +1 ( x )) dx, completing the proof. D Regret guarantees under global strong concavity
In this section, we discuss theoretical guarantees of the algorithm, assuming the globalstrong concavity of the objective function W ( β ). Oracle gradient descent under concavity
We define β ∗ w = Π B , B (cid:104) β ∗ w − + α w − V ( β ∗ w − ) (cid:105) , β ∗ = ι, (44)with α w = ηw + 1 , equal for all clusters. 88n the following lemmas and theorem, we consider the concave version of the gradientdescent.The following lemma follows by standard properties of the gradient descent algorithm(Bottou et al., 2018). Lemma D.1.
For the learning rate as α w = η/ ( w + 1) , and β ∗ w as defined in Equation (44) , under Assumption 3, for η ≤ /l and let L = max { p ( B − B ) , G /η } , with G being the upper bound on the gradient and l > a positive upper bound on the negative ofthe Hessian of W ( β ) . Let W ( β ) be strongly concave. Then the following holds: || β ∗ w − β ∗ || ≤ Lw for a constant L < ∞ . The proof is contained in Appendix E, and it follows standard arguments.
Lemma D.2.
Let Assumption 1, 2, 3, 5. Then with probability at least − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π B , B − η n (cid:104) ˇ T (cid:88) w =1 α w ˇ V k,w (cid:105) − Π B , B (cid:104) ˇ T (cid:88) s =1 α w V ( β ∗ w ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = O ( P ˇ T ( δ )) where P ( δ ) = α × err( δ ) and P w ( δ ) = Bα t P w − ( δ ) + P w − ( δ ) + α w err w ( δ ) , for a finiteconstant B < ∞ , and err w ( δ ) = O (cid:16)(cid:113) γ n log( p ˇ T K/δ ) η n n + pη n + J n /η n (cid:17) .Proof. Recall that by Lemma B.6 we can write for every k and t ,ˇ V ( j ) k,w = V ( j ) ( ˇ β wk +2 ) + O (cid:16)(cid:115) γ n log( K ˇ T /δ ) η n n + η n + J n /η n (cid:17) . We now proceed by induction. We first prove the statement, assuming that the constraintis never attained. We then discuss the case of the constrained solution. Define B = p sup β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ W ( β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ . Unconstrained case
Consider w = 1. Then since all clusters start from the samestarting point ι , we can write with probability 1 − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) α ˇ V k, − α V ( β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = α err( δ ) . t = 2, then we obtain for every j ∈ { , · · · , p } , α ˇ V ( j ) k, = α V ( j ) ( ˇ β k +2 ) + α err( δ ) = α V ( j ) ( β ∗ + α V ( β ∗ ) + α err( δ )) + α err( δ ) . Using the mean value theorem and Assumption 3, for a finite universal constant
B < ∞ ,we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) α ˇ V ( j ) k, − α V ( j ) ( β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ α err( δ ) + Bα α err( δ ) ⇒ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) w =1 α w ˇ V ( j ) k,w − (cid:88) w =1 α w V ( j ) ( β ∗ w ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ α err( δ ) + Bα α err( δ ) + α err( δ ) . Consider now a general w . Then we can write with probability 1 − δ , α w ˇ V k,w = α w V ( ˇ β w − ) + α w err( δ ) . Let P w = α w P w − + P w − + α w err( δ ), with P = α err( δ ). Using the induction argument,we write α w ˇ V k,w ≤ α w V ( β ∗ w − + P w − ) + α w err( δ ) . Using the mean value theorem and Assumption 3, we obtain α w ˇ V k,w ≤ α w V ( β ∗ w − ) + α w BP w − + α w err( δ ) . Taking the sum, we obtain with probability 1 − wδ (notice that each of these events holdjointly by the union bound) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w (cid:88) s =1 α s ˇ V k,s − t (cid:88) s =1 α s V ( β ∗ s − ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ α w BP w − + P w − + α t err( δ ) . Constrained case
Since the statement is true for w = 1, we can assume that it is truefor all s ≤ w − B is a compact space, we90an write (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π B , B − η n (cid:104) w (cid:88) s =1 α s ˆ V k,s (cid:105) − Π B , B (cid:104) w (cid:88) s =1 α s V ( β ∗ s − ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π B , B − η n (cid:104) w (cid:88) s =1 α s ˆ V k,s (cid:105) − Π B , B − η n (cid:104) w (cid:88) s =1 α s V ( β ∗ s − ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + pη n ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w (cid:88) s =1 α s ˆ V k,s − t (cid:88) s =1 α s V ( β ∗ s − ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + pη n completing the proof. Theorem D.3.
Let the conditions in Theorem C.1 and Lemma D.1 hold. Choose α w = η/w . Then with probability at least − δ , || β ∗ − ˇ β ˇ T +1 k || ≤ T + p ˇ T B × O (cid:16) γ n log( ˇ T K/δ ) η n n + p η n + J n /η n (cid:17) , for a finte constant B < ∞ .Proof. Using the triangular inequality, we can write || β ∗ − ˇ β ˇ T +1 k || ≤ || β ∗ − β ∗ ˇ T +1 || + || ˇ β ˇ T +1 k − β ∗ ˇ T +1 || . The first component on the right-hand side is bounded by Lemma D.1. Using Lemma D.2,we bound the second component as follows || ˇ β ˇ Tk − β ∗ T || ≤ p || ˇ β ˇ Tk − β ∗ T || ∞ = p × O ( P T ( δ )) . We conclude the proof by explicitely defining the rate of P ˇ T ( δ ). We can simplify P w to theexpression P w = (1 + Bw ) P k,w − + 1 w err n . where err n = O ( (cid:113) γ n log( p ˇ T K/δ ) η n n + pη n + J n /η n ). Using a recursive argument, we obtain P w = err n w (cid:88) s =1 α s w (cid:89) j = s ( Bw + 1) .
91e now write w (cid:88) s =1 α s w (cid:89) j = s ( Bw + 1) (cid:46) w (cid:88) s =1 s e B log( w ) (cid:46) w B , completing the proof. E Further mathematical details
E.1 Gradient estimator for non-stationary policies
For expositional simplicity we only consider a triad { k, k + 1 , k + 2 } . For notational conve-nience, we define β k,t the policy assigned to cluster k at time t according to the random-ization in Equation (31). Define∆( x, β, φ ) = m (1 , x, β, φ ) − m (0 , x, β, φ ) . Observe that we can write ∂ Γ( β, φ ) ∂β = (cid:90) (cid:110) ∂e ( x ; β ) ∂β ∆( x, β, φ ) + ∂m (0 , x, β, φ ) ∂β + e ( x ; β ) ∂ ∆( x, β, φ ) ∂β + c ( x ) ∂e ( x ; β ) ∂β (cid:111) dF X ( x ) ∂ Γ( β, φ ) ∂φ = (cid:90) (cid:110) e ( x ; β ) ∂ ∆( x, β, φ ) ∂φ + ∂m (0 , x, β, φ ) ∂φ (cid:111) dF X ( x ) . We now discuss the estimation of each component. We take (cid:98) ∆ k,t ( β ) = 13 n (cid:88) i ∈S k,t ∪S k +1 ,t ∪S k +2 ,t ∂e ( X i ; β ) ∂β (cid:104) Y i,t D i,t e ( X i ; β k ( i ) ,t ) − Y i,t (1 − D i,t )1 − e ( X i ; β k ( i ) ,t ) (cid:105) . The above estimator is centered around the target estimand up to a factor of order O ( η n + J n ) as discussed in Section 3. We now discuss the estimation of the marginal effects. Define u h,t = − h = k h = k + 1 , t is odd or h = k + 2 and t is even;0 otherwise . Intuitively, the above indicator equals minus one whenever the cluster is the cluster in thetriad that is assigned a perturbation in the current period. The estimator of the marginal92pillover effect in the current period is constructed by taking (cid:98) S k,t ( β ) = 1 n (cid:88) i ∈S k,t ∪S k +1 ,t ∪S k +2 ,t u k ( i ) ,t η n e ( X i ; β ) (cid:104) Y i,t D i,t e ( X i ; β k ( i ) ,t ) − Y i,t (1 − D i,t )1 − e ( X i ; β k ( i ) ,t ) (cid:105) + u k ( i ) ,t η n Y i,t (1 − D i,t )1 − e ( X i ; β k ( i ) ,t ) , where β k,t denote the (perturbed) parameter assigned to cluster k at time t . Its justificationfollows similarly to what is discussed in Section 3, with the difference here that the clusterunder perturbation is one of the three clusters, which alternate every other period t . Wecan estimate the marginal effect of coordinate ( j ) in the current period by taking (cid:98) ∆ k,sjt (ˇ θ wk ) + (cid:98) S k,sjw (ˇ θ wk ) . We now discuss estimating the marginal effect in the previous period. Define p h,t = − h = k h = k + 1 , t is even or h = k + 2 and t is odd;0 otherwise . The above indicator equals one for the cluster that in the previous period was subject toperturbation. We can now use the same rationale as before and estimate the effect in theprevious period as (cid:98) U k,t ( β ) = 1 n (cid:88) i ∈S k,t ∪S k +1 ,t ∪S k +2 ,t p k ( i ) ,t η n e ( X i ; β ) (cid:104) Y i,t D i,t e ( X i ; β k ( i ) ,t ) − Y i,t (1 − D i,t )1 − e ( X i ; β k ( i ) ,t ) (cid:105) + p k ( i ) ,t η n Y i,t (1 − D i,t )1 − e ( X i ; β k ( i ) ,t ) . The final estimator of the marginal effect ˇ F ( j ) k,w weight each component (marginal effectsfrom previous and current periods) over periods s ∈ { , · · · , T ∗ } by the functions f θ,t ( ι )over the path reads as followsˇ F ( j ) k,w = T ∗ (cid:88) s =1 f ˇ θ wk ,s ( ι ) (cid:104) (cid:98) ∆ k,sjw (ˇ θ wk ) + (cid:98) S k,sjw (ˇ θ wk ) (cid:105) + f ˇ θ wk ,s − ( ι ) (cid:98) U k,sjw (ˇ θ wk ) , with f · ( · ) as defined in Equation (30). 93 .2 Proof of Lemma D.1 Proof.
We follow a standard argument for the gradient descent. Denote β ∗ the estimandof interest and recall the definition of β ∗ t in Equation (44). We define ∇ t the gradientevaluated at β ∗ t − . From strong concavity, we can write τ ( β ∗ ) − τ ( β ∗ t ) ≤ ∂τ ( β ∗ t ) ∂β ( β ∗ − β ∗ t ) − l || β ∗ − β ∗ t || τ ( β ∗ t ) − τ ( β ∗ ) ≤ ∂τ ( β ∗ ) ∂β ( β ∗ t − β ∗ ) − l || β ∗ − β ∗ t || . As a result, since ∂τ ( β ∗ ) ∂β = 0, we have (cid:16) ∂τ ( β ∗ ) ∂β − ∂τ ( β ∗ t ) ∂β (cid:17) ( β ∗ − β ∗ t ) = ∂τ ( β ∗ t ) ∂β ( β ∗ − β ∗ t ) ≥ l || β ∗ t − β ∗ || . (45)In addition, we can write: || β ∗ t − β ∗ || = || β ∗ − Π B , B ( β ∗ t + α t ∇ t ) || ≤ || β ∗ − β ∗ t − α t ∇ t || where the last inequality follows from the Pythagorean theorem. Observe that we have || β ∗ − β ∗ t || ≤ || β ∗ − β ∗ t − || − α t ∇ t ( β ∗ − β ∗ t − ) + α t ||∇ t || . Using Equation (45), we can write || β ∗ t +1 − β ∗ || ≤ (1 − lα t ) || β ∗ t − β ∗ || + α t G . We now prove the statement by induction. Clearly at time t = 0, the statement triviallyholds. Consider a general time t . Then using the induction argument, we write || β ∗ t +1 − β ∗ || ≤ (1 − t + 1 ) Lt + L ( t + 1) ≤ (1 − t + 1 ) Lt + Lt ( t + 1)= (1 − t + 1 ) Lt = Lt + 1 ..