[PDF] Assessing Time-Varying Causal Effect Moderation in the Presence of Cluster-Level Treatment Effect Heterogeneity

Abstract

Micro-randomized trial (MRT) is a sequential randomized experimental design to empirically evaluate the effectiveness of mobile health (mHealth) intervention components that may be delivered at hundreds or thousands of decision points. The MRT context has motivated a new class of causal estimands, termed "causal excursion effects", for which inference can be made by a weighted, centered least squares approach (Boruvka et al., 2017). Existing methods assume between-subject independence and non-interference. Deviations from these assumptions often occur which, if unaccounted for, may result in bias and overconfident variance estimates. In this paper, causal excursion effects are considered under potential cluster-level correlation and interference and when the treatment effect of interest depends on cluster-level moderators. The utility of our proposed methods is shown by analyzing data from a multi-institution cohort of first year medical residents in the United States. The approach paves the way for construction of mHealth interventions that account for observed social network information.

Full PDF

AAssessing Time-Varying Causal Eﬀect Moderation inthe Presence of Cluster-Level Treatment EﬀectHeterogeneity

Jieru Shi, Zhenke Wu, Walter DempseyFebruary 3, 2021

Abstract

Micro-randomized trial (MRT) is a sequential randomized experimental design toempirically evaluate the eﬀectiveness of mobile health (mHealth) intervention compo-nents that may be delivered at hundreds or thousands of decision points. The MRTcontext has motivated a new class of causal estimands, termed “causal excursion ef-fects”, for which inference can be made by a weighted, centered least squares approach(Boruvka et al., 2017). Existing methods assume between-subject independence andnon-interference. Deviations from these assumptions often occur which, if unaccountedfor, may result in bias and overconﬁdent variance estimates. In this paper, causal ex-cursion eﬀects are considered under potential cluster-level correlation and interferenceand when the treatment eﬀect of interest depends on cluster-level moderators. Theutility of our proposed methods is shown by analyzing data from a multi-institutioncohort of ﬁrst year medical residents in the United States. The approach paves theway for construction of mHealth interventions that account for observed social networkinformation. a r X i v : . [ s t a t . M E ] F e b eywords: Causal Inference; Clustered Data; Just-In-Time Adaptive Interventions; Micro-randomized Trials; Mobile Health; Moderation Eﬀect

In many areas of behavioral science, push notiﬁcations in mobile apps that are adapted tocontinuously collected information on an individual’s current context have received a con-siderable amount of attention. These time-varying adaptive interventions are hypothesizedto lead to meaningful short- and long-term behavior change. The assessment of the time-varying eﬀect of such push notiﬁcations motivated sequential randomized designs such asthe micro-randomized trial (MRT) [Nahum-Shani et al., 2017, Klasnja et al., 2015]. In anMRT, interventions are randomized to be potentially delivered at a over a period of time,at hundreds or thousands of decision points. The MRT design enables the estimation of thetime-delayed marginal treatment eﬀects of a push notiﬁcation upon the outcome of interestat multiple time-lags, referred to as “causal excursion eﬀects” [Boruvka et al., 2017, Qianet al., 2020]. A weighted, centered least squares (WCLS) approach has been developed toconsistently estimate causal excursion eﬀects with robust methods of uncertainty assessment.Extensions to balance the number of randomizations by observed or derived individual-levelcontextual information have also been developed, e.g., a stratiﬁed MRT [Dempsey et al.,2020].Existing inferential methods for data collected in an MRT rely on two key assumptions.First, an intervention delivered to an individual is assumed to only impact the outcome ofthat individual, i.e., non-interference. Second, the method assumes no stochastic dependenceamong outcomes of diﬀerent subjects. Deviations from these assumptions, however, mayoccur when individuals naturally form clusters. Consider, for example, an MRT in whichmedical interns are recruited from several hospitals and every day potentially receive pushnotiﬁcations designed to improve physical activity, mood, or sleep. First, interference may2ccur when an intern receiving a push notiﬁcation to increase their steps may then encourageanother intern in the same specialty and institution to do the same hence altering the others’outcomes. Second, even if there is no direct interference, interns in the same specialty andinstitution are likely to have correlated outcomes, induced by variation in unobserved factorsshared within each cluster. For example, interns within a specialty and institution may sharesimilar experiences during their residency which will lead to a higher correlation in theirweekly mood scores than interns in diﬀerent specializations or institutions.The primary contribution of this paper is the signiﬁcant extension of data analytic meth-ods for MRTs to account for potential cluster-level correlation and the extension of thecausal excursion eﬀect to account for potential interference, i.e., we deﬁne both a direct andindirect causal excursion eﬀect. In Section 2, we review the MRT design and the existingmethod’s implicit assumptions. In Section 3, we introduce the moderated causal excursioneﬀect using a group-based conceptualization of potential outcomes [Hong and Raudenbush,2006, Vanderweele et al., 2013]. We then generalize WCLS criterion to a cluster-based WCLS(C-WCLS) criterion to estimate both direct and indirect causal excursion eﬀects in Section 4.Finally, we demonstrate that when proximal outcomes exhibit within-cluster correlation butthere is an absence of between-cluster treatment eﬀect heterogeneity, our proposal agreeswith the traditional approach both in terms of the point estimate and the asymptotic vari-ance. In Section 6, we illustrate the proposed method by estimating the causal excursioneﬀects of push notiﬁcations on mood, sleep, and step count for a cohort of ﬁrst year medicalresidents in the United States. The paper concludes with a brief discussion.

We ﬁrst review the design of MRTs and the existing inferential methods based on MRTdata. We focus on the assumptions made in the inferential methods to set the stage for ourproposed extension. 3 .1 Micro-Randomized Trials (MRT)

An MRT consists of a sequence of within-subject decision times t = 1 , . . . , T at whichtreatment options may be randomly assigned. The resulting longitudinal data for any subjectcan be summarized as { O , O , A , O , A , . . . , O T , A T , O T +1 } where t indexes a sequence of decision points, O is the baseline information, O t is the in-formation collected between time t − t which may include mobile sensor and wearabledata, and A t is the treatment option provided at time t . In an MRT, A t results from ran-domization at each decision point; here, we assume this randomization probability dependson the complete observed history H t := { O , O , A , . . . , A t − , O t } and is denoted p t ( A t | H t ).Treatment options are designed to impact a proximal response Y t +1 which is a known func-tion of { O t , A t , O t +1 } . In this section, we will use HeartSteps [Klasnja et al., 2018], an MRTdesigned to increase physical activity among sedentary adults, as our illustrative example.In HeartSteps [Klasnja et al., 2018], A t = 1 denotes sending a tailored activity messagedesigned to increase proximal physical activity and therefore Y t +1 is chosen to be total stepcount in the 30-minutes after the decision time. Remark 2.1.

At some decision points, it is inappropriate for scientiﬁc, ethical, or burdenreasons to provide treatment. To address this, the observation vector O t also contains anindicator of availability: I t = 1 if available for treatment and I t = 0 otherwise. Availabilityat time t is determined before treatment randomization. When an individual is unavailablefor treatment ( I t = 0), then no treatment is provided ( A t = 0). In this paper, for simplicity,individuals are assumed always available for treatment. All methods can be readily extendedto account for availability as in Boruvka et al. [2017].4 .2 Estimand and Inferential Method Following Boruvka et al. [2017], Qian et al. [2020], and Dempsey et al. [2020], we focus onthe class of estimands that is referred to as “causal excursion eﬀects” deﬁned by β ( t ; s ) = E [ E [ Y t +1 | H t , A t = 1] − E [ Y t +1 | H t , A t = 0] | S t = s ]where S t is a time-varying potential eﬀect moderator (or “state”) and a deterministic functionof the observed history H t . For example, in the HeartSteps study, S t may represent subject-speciﬁc contextual information such as step count in the previous 30 minutes. See Section3.2 for how this estimand is derived using potential outcomes. Under the assumption that β ( t ; s ) = f t ( s ) (cid:62) β (cid:63) where f t ( s ) ∈ R q is a feature vector depending only on state s and decisionpoint t , a consistent estimator ˆ β can be obtained by minimizing the following weighted andcentered least squares (WCLS) criterion:arg max α,β P n (cid:34) T (cid:88) t =1 W t (cid:0) Y t +1 − g t ( H t ) (cid:48) α − ( A t − ˜ p t (1 | S t )) f t ( S t ) (cid:62) β (cid:1)(cid:35) (1)where P n is deﬁned as the average of a function over the sample, W t = ˜ p t ( A t | S t ) /p t ( A t | H t )is a weight where the numerator is an arbitrary function with range (0 ,

1) that only dependson S t and the denominator is the MRT speciﬁed randomization probability, and g t ( H t ) ∈ R p are a set of control variables chosen to help reduce the variance of the estimators and toconstruct more powerful test statistics. In HeartSteps, for example, a natural control variableis the number of steps in the prior 30 minutes as this variable is likely highly correlated withthe proximal response and thus can be used to reduce variance in the estimation of β .See Boruvka et al. [2017] for additional details on how to deﬁne and estimate moderatedtime-varying treatment eﬀects using data from an MRT.5 Cluster-Level Proximal Treatment Eﬀects

MRTs are implicitly deﬁned with the individual as the unit of analysis. The associated an-alytics rely on statistical independence among individuals in deriving asymptotic normality;moreover, the treatment eﬀects assume that treatments provided to one individual do notimpact another individual. In Section 5, we show that the nominal coverage probability ofthe 95% conﬁdence interval obtained from these methods decays quickly when the MRT isconducted among individuals with cluster-level treatment eﬀect heterogeneity. Moreover,in such a setting, an interesting scientiﬁc question concerns moderation analysis where themoderator is deﬁned at the cluster-level, i.e., average values of a moderator such as mood orstep-count across a cluster. Here we consider a cluster-based deﬁnition of causal excursioneﬀect and corresponding analyses to address these issues. We start with our motivatingexample that demonstrates the necessity for considering clustering in the context of MRTs.

Intern Health Study (IHS) is a 6-month micro-randomized trial on 1,565 medical interns [NeCampet al., 2020]. Due to high depression rates and levels of stress during the ﬁrst year of physicianresidency training, a critical question is whether targeted notiﬁcations can be used at theright time to improve mood, increase sleep time, and/or increase physical activity. Enrolledmedical interns were randomized weekly to receive either mood, activity, or sleep notiﬁca-tions or receive no notiﬁcations for that week (probability 1/4 each). On a week in whichnotiﬁcations are provided, interns were randomized daily to receive a notiﬁcation with prob-ability 50%. Hence for a notiﬁcation week, the user received, on average, 3.5 notiﬁcations.The exploratory and MRT analyses conducted in this paper focus on weekly randomization;See NeCamp et al. [2020] for further study details.Medical internships occur in a speciﬁc location and within a speciﬁc specialty. The 1,565interns in the MRT represented 321 diﬀerent residency locations and 42 specialties. Thus,6nterns are naturally clustered. Table 3.1 presents a standard equally-weighted analysis ofvariance (ANOVA) applied to the average weekly mood scores with three factors: week-in-study, residency and specialty both interacted with weekly treatment (Txt). Compared withthe residual mean square of 1.27, there is considerable excess variation associated with week-in-study (28.44), residency by treatment (5.89) and specialty by treatment (8.41). In otherwords, the ANOVA suggests proximal responses exhibit potential within-cluster correlationwithin residency locations and specialties that varies with treatment. In this paper, we focuson deﬁning clusters using specialty and institution information and do not discuss multi-levelclusters. Source d.f. Sum Sq. Mean Sq.week-in-study 25 711 28.44Residency:Txt 1091 6424 5.89Specialty:Txt 48 404 8.41Residuals 38946 49479 1.27Table 1: ANOVA decomposition for average weekly mood scores in the IHS.Three immediate questions may arise:(1) Does the standard MRT analysis still have good statistical properties, e.g., consistencyand valid nominal coverage probability of 95% conﬁdence intervals when participantsexhibit within-cluster correlation?(2) Can one assess potential cluster-level eﬀect moderators using the standard MRT anal-ysis? and,(3) Can one extend the MRT analysis to assess the eﬀect of treatment provided to oneindividual on the proximal response of another individual in the same cluster?For (1), through a simulation study in Section 5, we demonstrate potential deﬁciency of thestandard WCLS method that results in severe under-coverage of nominal 95% conﬁdenceintervals. In particular, we use simulation settings that mimic realistic cluster-level treat-ment heterogeneity. For (2), in IHS, for example, a scientiﬁc question may be how group7verage mood impacts the proximal eﬀect of notiﬁcations on average daily mood. Sinceeach individual’s mood may be impacted by their own treatments, if the proximal eﬀect ismoderated by cluster-level summary variables then this implies the proximal response is afunction of past treatment assignment to all individuals in the group. Finally, for (3), inSection 3.2 we extend the causal excursion eﬀect to distinguish between direct and indirectcausal excursion eﬀects.In this paper, we extend WCLS method [Boruvka et al., 2017] to account for clusteredcontinuous outcomes and potential within-cluster interference of other subjects’ treatmentsupon one subject’s outcome, oﬀering a general framework for answering the three questionsabove. Here we study micro-randomized trials in which there is a pre-deﬁned cluster-levelstructure. The cluster structure may result in (A) within-cluster correlation among proximalresponses (question 1), (B) treatment impacting individual-level moderators which thenimpact proximal responses of other individuals (question 2), or (C) treatment interference,i.e., treatment provided to one individual may impact the proximal response of anotherindividual in the same cluster (question 3). We will demonstrate how our proposed methodaddresses these three issues.

The primary question of interest is whether the treatment has a proximal eﬀect. Here,potential outcomes are used to deﬁne the proximal eﬀects. Due to potential treatmentinterference in the cluster MRT setting, we consider two types of moderated eﬀects: a directand an indirect eﬀect, both on the linear scale.Consider a cluster of size G . The overbar is used to denote a sequence through a speci-ﬁed treatment occasion; ¯ a t,j = ( a ,j , . . . , a t,j ), for instance, denotes the sequence of realizedtreatments up to and including decision time t for individual j ∈ [ G ] := { , . . . , G } in thecluster. Let ¯ a t = (¯ a t, , . . . , ¯ a t,G ) denote the set of realized treatments up to and includingdecision time t for all individuals in the cluster and ¯ a t, − j = ¯ a t \ ¯ a t,j to denote this set with the8 th individual’s realized treatments removed. Let Y t +1 ,j (¯ a t ) denote the potential outcomefor individual j ∈ [ G ] in the cluster. Note that the potential outcome for individual j maydepend on the realized treatments for all members in the cluster. Direct causal eﬀects . In standard MRTs, the individual is the unit of interest. In ourcurrent setting, the cluster is the unit of interest. At the cluster level, our interest lies inthe eﬀect of providing treatment versus not providing treatment at time t on a randomindividual in the group. This can be expressed as a diﬀerence in potential outcomes for theproximal response and is given by1 G G (cid:88) j =1 [ Y t +1 ,j (¯ a t, − j , (¯ a t − ,j , a tj = 1)) − Y t +1 ,j (¯ a t, − j , (¯ a t − ,j , a tj = 0))] . (2)Equation (2) is a direct causal eﬀect of treatment compared to no treatment on the proximalresponse for the same individual, while all previous treatments on the individual and alltreatments on every other member of the cluster are ﬁxed to speciﬁc values.The “fundamental problem of causal inference” [Rubin, 1978, Pearl, 2009] is that theseindividual diﬀerences cannot be observed. Thus, similar to prior work [Dempsey et al.,2020, Boruvka et al., 2017], in this paper averages of potential outcomes are considered indeﬁning treatment eﬀects. Let S t (¯ a t − ) denote a vector of potential moderator variableschosen from H t (¯ a t − ), the cluster-level history up to decision point t . We write S t (¯ a t − ) =( S t,j (¯ a t − ) , S t, − j (¯ a t − )) to clarify that the potential moderator variables can contain bothinformation on the selected individual as well as other individuals in the cluster. Thenmoderated direct treatment eﬀect, denoted β ( t ; s ), can be deﬁned as E (cid:20) Y t +1 ,J ( ¯ A t, − J , ( ¯ A t − ,J , − Y t +1 ,J ( ¯ A t, − J , ( ¯ A t − ,J , | (cid:0) S t,J ( ¯ A t − ) , S t, − J ( ¯ A t − ) (cid:1) = ( s J , s − J ) (cid:21) . (3)where J is a uniformly distributed random index deﬁned on [ G ]. The expectation is overboth the potential outcomes ( Y t +1 ,J ( · )), the set of randomized treatments ( ¯ A t, − J and ¯ A t − ,J ),9nd the random index ( J ). Diﬀerent choices of variables in S t ( ¯ A t − ) address a variety ofscientiﬁc questions. A primary analysis, for instance, may focus on the marginal proximaleﬀects and therefore set S t ( ¯ A t − ) = ∅ . A second analysis may focus on assessing the eﬀectconditional on variables only related to the individual indexed by J and set S t ( ¯ A t − ) = X t,J ( ¯ A t − ,J ), where X t,J ( ¯ A t − ,J ) ⊂ O t,H ( ¯ A t − ,J ) is a potential moderator of interest. A thirdanalysis may consider moderation by the average value across all individuals S t ( ¯ A t − ) = G − (cid:80) j X t,j ( ¯ A t − ,j ) or the joint moderation of the individual’s mood and the average of theother individuals’ moods S t ( ¯ A t − ) = (cid:16) X t,j ( ¯ A t − ,j ) , G − (cid:80) j (cid:48) (cid:54) = j X t,j (cid:48) ( ¯ A t − ,j (cid:48) ) (cid:17) . Indirect causal eﬀects . Of secondary interest is the eﬀect of providing treatment versusnot providing treatment to the j th individual at time t on a diﬀerent individual’s proximalresponse when they do not receive treatment, i.e., within-cluster treatment interference. This indirect causal eﬀect is given by1 G · ( G − (cid:88) j (cid:48) (cid:54) = j (cid:2) Y t +1 ,j (¯ a t, −{ j,j (cid:48) } , (¯ a t − ,j , a t,j = 0) , (¯ a t − ,j (cid:48) , a t,j (cid:48) = 1)) − Y t +1 ,j (¯ a t, −{ j,j (cid:48) } , (¯ a t − ,j , a t,j = 0) , (¯ a t − ,j (cid:48) , a t,j (cid:48) = 0)) (cid:3) . Again, since the individual diﬀerences cannot be observed, we consider averages of potentialoutcomes in deﬁning indirect treatment eﬀects. The moderated indirect treatment eﬀect,denoted β ( IE ) ( t ; s ), can be deﬁned by E (cid:20) Y t +1 ,J ( ¯ A t, −{ J,J (cid:48) } , ( ¯ A t − ,J , , ( ¯ A t − ,J (cid:48) , − Y t +1 ,J ( ¯ A t, −{ J,J (cid:48) } , ( ¯ A t − ,J , , ( ¯ A t − ,J (cid:48) , | S t ( ¯ A t − ) (cid:21) . (4)where J (cid:48) is uniformly distributed random index on the set [ G ] \{ J } . The expectation is againover both the potential outcomes ( Y t +1 ,J ( · )), the set of randomized treatments ( ¯ A t, −{ J,J (cid:48) } ,¯ A t − ,J , and ¯ A t − ,J (cid:48) ), and the random indices ( J and J (cid:48) ). Here, the potential moderator canbe written as S t ( ¯ A t − ) = (cid:0) S t,J ( ¯ A t − ) , S t,J (cid:48) ( ¯ A t − ) , S t, −{ J,J (cid:48) } ( ¯ A t − ) (cid:1) to clarify that the variables10an contain both information on the two selected individuals as well as others in the cluster.Note that a similar eﬀect can be deﬁned when the individual does receive treatment. Fornow, we focus on the eﬀect deﬁned by (4). It is useful to distinguish the above estimands, the traditional MRT estimand [Boruvka et al.,2017], and estimands commonly studies in longitudinal treatment eﬀect estimation [Robins,1986]. In the causal inference literature, the typical estimand is an expected outcome for aparticular sequence of treatments,i.e., E [ Y ( a , . . . , a T )]. Such estimands do not depend onthe treatment distribution, but are often not of primary interest in the current MRT settingsince many sequences of treatment may never be observed in ﬁnite samples given the largenumber of decision points (often hundreds or thousands).The estimands considered in this paper are most similar to average outcomes under aparticular dynamic treatment regime E π [ Y ( A , . . . , A T )] where π denotes the dynamic treat-ment regime from which the treatments are drawn [Murphy et al., 2001]. Indeed, for any A u,j not contained in S t ( ¯ A t − ), the direct and indirect eﬀects depend on the distribution of { A u,j } u

Assumption 3.1.

We assume consistency, positivity, and sequential ignorability: • Consistency: For each t ≤ T and j ∈ [ G ], { Y t +1 ,j ( ¯ A t ) , O t,j ( ¯ A t − ) , A t,j ( ¯ A t − ) } = { Y t +1 ,j , O t,j , A t,j } . That is, the observed values equal the corresponding potential out-comes; • Positivity: if the joint density { H t = h, A t = a } is greater than zero, then P ( A t = a t | H t = h t ) > • Sequential ignorability: for each t ≤ T , the potential outcomes, { Y ,j ( a ) , O ,j ( a ) , A ,j ( a ) , . . . , Y T,j (¯ a T − ) } j ∈ [ G ] , ¯ a T − ∈{ , } ( T − × G , are independent of A t,j conditional on the observed history H t .Sequential ignorability and, assuming all of the randomization probabilities are boundedaway from 0 and 1, positivity, are guaranteed for a cluster MRT by design. In a standardMRT, the randomization probabilities may depend on the individual’s observed history soPr( A t = a t | H t = h t ) = (cid:81) Gj =1 Pr( A t,j = a t,j | H t,j = h t,j ) and the positivity constraint can beplaced on the individual-level randomization probabilities. Here, we allow for the possibilitythat the randomization probabilities depend on the joint cluster-level history. Consistency isa necessary assumption for linking the potential outcomes as deﬁned here to the data. Sincean individual’s outcomes may be inﬂuenced by the treatments provided to other individualsin the same cluster, consistency holds due to our use of a group-based conceptualization ofpotential outcomes as seen in Hong and Raudenbush [2006] and Vanderweele et al. [2013].12 emma 3.2. Under assumption 3.1, the moderated direct treatment eﬀect β ( t ; s ) is equal to E (cid:34) G G (cid:88) j =1 E [ Y t +1 ,j | H t , A t,j = 1] − E [ Y t +1 ,j | H t , A t,j = 0] | S t (cid:35) where each expectation is with respect to the distribution of the data collected using thetreatment assignment probabilities. Under assumption 3.1, the moderated indirect treatmenteﬀect β ( t ; s ) ( IE ) is equal to E (cid:34) G ( G − (cid:88) j (cid:54) = j (cid:48) E [ Y t +1 ,j | H t , A t,j = 0 , A t,j (cid:48) = 1] − E [ Y t +1 ,j | H t , A t,j = 0 , A t,j (cid:48) = 0] | S t (cid:35) . Proof of Lemma 3.2 can be found in the Appendix A.

In this section, we consider estimation of the two estimands deﬁned in Lemma 3.2 usingcluster MRT data.

We make the following assumptions regarding the direct treatment eﬀect speciﬁcation:

Assumption 4.1.

Assume the treatment eﬀect, β ( t ; S t ), of interest satisﬁes E [ E [ Y t +1 ,J | H t , A t,J = 1] − E [ Y t +1 ,J | H t , A t,J = 1] | S t ] = f t ( S t ) (cid:62) β (cid:63) where f t ( s ) ∈ R p is a p -dimensional vector function of s and time t .We now consider inference on the unknown p -dimensional parameter β (cid:63) . To do so, wedeﬁne the weight W t,j at decision time t for the j th individual as equal to ˜ p ( A t,j | S t ) p t ( A t,j | H t ) where˜ p t ( a | S t ) ∈ (0 ,

1) is arbitrary as long as it does not depend on terms in H t other than S t ,13nd p ( A t,j | H t ) is the marginal probability that individual j receives treatment A t,j given H t . For now, we consider pre-speciﬁed ˜ p t ( a | S t ) (i.e., does not depend on the observedMRT data). Here we consider an estimator ˆ β of β (cid:63) which is the minimizer of the followingcluster-based, weighted-centered least-squares (C-WCLS) criterion, minimized over ( α, β ): P M (cid:34) G m G m (cid:88) j =1 T (cid:88) t =1 W t,j (cid:0) Y t +1 ,j − g t ( H t ) (cid:48) α − ( A t,j − ˜ p t ( A t,j | S t )) f t ( S t ) (cid:62) β (cid:1) (cid:35) (5)where P M is deﬁned as the average of a function over the sample, which in this context isthe sample of clusters rather than the sample of individuals as in traditional MRT settings.In Appendix A, we prove the following result. Lemma 4.2.

Under assumption 4.1, then, given invertibility and moment conditions, theestimator ˆ β that minimizes (5) satisﬁes √ M (cid:16) ˆ β − β (cid:63) (cid:17) → N (0 , Q − W Q − ) where Q = E (cid:34) T (cid:88) t =1 ˜ p t (1 | S t )(1 − ˜ p t (1 | S t )) f t ( S t ) f t ( S t ) (cid:62) (cid:35) and W = E (cid:34) T (cid:88) t =1 W J,t (cid:15)

J,t ( A J,t − ˜ p t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W ˜ J,t (cid:15) ˜ J,t ( A ˜ J,t − ˜ p t (1 | S t )) f t ( S t ) (cid:62) (cid:35) where (cid:15) t,j = Y t +1 − g t ( H t ) (cid:48) α (cid:63) − ( A t − ˜ p t (1 | S t )) f t ( S t ) (cid:62) β (cid:63) , α (cid:63) minimizes the least-squarescriterion E (cid:104) G − (cid:80) Gj =1 (cid:80) Tt =1 W t,j ( Y t +1 ,j − g t ( H t ) (cid:48) α ) (cid:105) , and both J and ˜ J are independentrandomly sampled indices from the cluster. In practice, plug-in estimates ˆ Q and ˆ W are used to estimate the covariance structure. Ap-pendix C presents small sample size adjustments that we use in simulation and analysis inthis paper. It is clear that from Lemma 4.2 that use of standard MRT data analytic methods,i.e., equation (1), will produce biased estimates for the marginal eﬀect of interest.14 .2 Indirect Eﬀect Estimation We make the following assumptions regarding the indirect treatment eﬀect speciﬁcation:

Assumption 4.3.

We assume the treatment eﬀect of interest satisﬁes E [ E [ Y t +1 ,J | H t , A t,J = 0 , A t,J (cid:48) = 1] − E [ Y t +1 ,J | H t , A t,J = 0 , A t,J (cid:48) = 0] | S t ] = f t ( S t ) (cid:62) β (cid:63)(cid:63) where f t ( s ) ∈ R p is a p -dimensional vector function of s and time t .We now consider inference on the unknown p -dimensional parameter β (cid:63)(cid:63) . To do so, wedeﬁne a new weight W t,j,j (cid:48) at decision time t for the j th individual as equal to ˜ p ( A t,j ,A t,j (cid:48) | S t ) p t ( A t,j ,A t,j (cid:48) | H t ) where ˜ p t ( a, a (cid:48) | S t ) ∈ (0 ,

1) is arbitrary as long as it does not depend on terms in H t otherthan S t , and p ( A t,j , A t,j (cid:48) | H t ) is the marginal probability that individuals j and j (cid:48) receivetreatments A t,j and A t,j (cid:48) respectively given H t . Here we consider an estimator ˆ β ( IE ) of β (cid:63)(cid:63) which is the minimizer of the following cluster-based weighted-centered least-squares(C-WCLS) criterion, minimized over ( α, β ): P M (cid:34) G m ( G m − (cid:88) j (cid:54) = j (cid:48) T (cid:88) t =1 W t,j,j (cid:48) (cid:0) Y t +1 ,j − g t ( H t ) (cid:48) α − (1 − A t,j )( A t,j (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:62) β (cid:1) (cid:35) (6)where ˜ p (cid:63)t (1 | S t ) = ˜ p t (0 , | S t )˜ p t (0 , | S t )+˜ p t (0 , | S t ) . If an individual’s randomization probabilities onlydepends on their own observed history then ˜ p (cid:63)t (1 | S t,j (cid:48) ) = ˜ p t (1 | S t,j (cid:48) ). In Appendix A, weprove the following result. Lemma 4.4.

Under assumption 4.3, then, under invertibility and moment conditions, theestimator ˆ β ( IE ) that minimizes (6) satisﬁes √ M (cid:16) ˆ β ( IE ) − β (cid:63)(cid:63) (cid:17) → N (0 , Q − W Q − ) where Q = E (cid:34) T (cid:88) t =1 (˜ p t (0 , | S t ) + ˜ p t (0 , | S t ))˜ p (cid:63)t (1 | S t )(1 − ˜ p (cid:63)t (1 | S t )) f t ( S t ) f t ( S t ) (cid:62) (cid:35) nd W = E (cid:20) T (cid:88) t =1 W t,J,J (cid:48) (cid:15) t,J,J (cid:48) (1 − A J,t )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W t,J,J (cid:48) (cid:15) t, ˜ J, ˜ J (cid:48) (1 − A ˜ J,t )( A t, ˜ J (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:62) (cid:21) where (cid:15) t,j,j (cid:48) = Y t +1 ,j − g t ( H t ) (cid:48) α (cid:63)(cid:63) − (1 − A t,j )( A t,j (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:62) β (cid:63)(cid:63) , α (cid:63)(cid:63) minimizes theleast-squares criterion E (cid:104) G m ( G m − (cid:80) j (cid:54) = j (cid:48) (cid:80) Tt =1 W t,j,j (cid:48) ( Y t +1 ,j − g t ( H t ) (cid:48) α ) (cid:105) , and both ( J, J (cid:48) ) and ( ˜ J , ˜ J (cid:48) ) are independently, randomly sampled pairs from the cluster (without replacement). In Appendix E, the above variances are re-derived in the general setting where the nu-merators ˜ p t ( a | S t ) and ˜ p t ( a, a (cid:48) | S t ) are estimated using the observed MRT data. A natural question is whether there are any conditions under which the standard MRTanalysis presented in Section 2.2 is equivalent to the proposed cluster MRT analysis for thedirect eﬀect presented in Section 4. It is clear from Proposition 4.2 that if the cluster-sizeis one then we recover the estimators and asymptotic theory for standard MRT analyses.Lemma 4.5 below proves that, under certain conditions, even when cluster sizes are greaterthan one, an equivalence of point estimates and asymptotic variances is guaranteed.

Lemma 4.5.

Consider the direct eﬀect when the moderator is deﬁned on the individual (i.e., S j,t ), and the randomization probabilities only depend on the individual’s observed history,i.e., p ( A j,t | H t ) = p ( A j,t | H j,t ) . If cluster size is constant across clusters (i.e., G m ≡ G ),then the point estimates from (1) and (3) are equal for any sample size. Moreover, if E [ E [ (cid:15) j,t (cid:15) j (cid:48) ,t (cid:48) | H j,t , A j,t = a, H j (cid:48) ,t (cid:48) , A j (cid:48) ,t (cid:48) = a (cid:48) ] | S t,j , S t,j (cid:48) ] = ψ ( S t,j , S t,j (cid:48) ) , (7) for some function ψ , i.e., the cross-term is constant in a and a (cid:48) , where (cid:15) j,t is the error deﬁned n Lemma 4.2 then the estimators share the same asymptotic variance. Proof of Lemma 4.5 can be found in Appendix B. Here, we consider the following class ofrandom eﬀect models to help with interpretation of the suﬃcient condition (7). Speciﬁcally,for participant j at decision time t suppose the generative model for the proximal outcomeis Y t +1 ,j = g t ( H t,j ) (cid:48) α + Z (cid:48) t,j b g (cid:124) (cid:123)(cid:122) (cid:125) ( I ) +( A t,j − p t (1 | H t,j ))( f t ( H t,j ) (cid:48) β + Z (cid:48) t,j ˜ b g (cid:124) (cid:123)(cid:122) (cid:125) ( II ) ) + e t,j where ( I ) and ( II ) are random eﬀects with design matrix Z t,j , E [ f t ( H t,j ) (cid:48) β | S t,j ] = f t ( S t,j ) (cid:48) β ,and e t,j is a participant-speciﬁc error term. The treatment eﬀect conditional on the completeobserved history and the random eﬀects is f t ( H t,j ) (cid:48) β + Z (cid:48) t,j ˜ b g , which implies the marginalcausal eﬀect is f t ( S t,j ) (cid:48) β so Assumption 4.1 holds. Random eﬀects in ( I ) allow for cluster-level variation in baseline values of the proximal outcome, while random eﬀects in ( II )allow for cluster-level variation in the fully-conditional treatment eﬀect. Given the abovegenerative model, suﬃcient condition (7) holds if ˜ b g ≡

0, i.e., when the treatment eﬀect doesnot exhibit cluster-level variation. To see this, note that the errors (cid:15) t,j and (cid:15) t,j (cid:48) as deﬁned inLemma 4.2 do not depend on treatment if ˜ b = 0. The cross-term is non-zero ( ψ (cid:54)≡

0) becausethere is within-cluster variation due to random eﬀects in ( I ). For this reason, we refer to(7) as a treatment-eﬀect heterogeneity condition, i.e., when clusters exhibit treatment-eﬀectheterogeneity this induces marginal residual correlation to depend on a and a (cid:48) which meansthe proposed approach is necessary for assessing direct eﬀects rather than standard MRTanalyses. In Section 5, we conduct simulations to support this calculation. To evaluate the performance of the proposed cluster-level proximal treatment eﬀect esti-mator, we extend the simulation setup in Boruvka et al. [2017]. Consider a standard MRTwhere the randomization probability p t (1 | H t ) is known. The observation vector O t is a single17tate variable S t ∈ {− , } at each decision time t . Assume the response, Y t +1 , follows alinear model given by: Y t +1 = θ { S t − E [ S t | A t − , H t − ] } + θ { A t − − p t − (1 | H t − ) } + { A t − p t (1 | H t ) } ( β + β S t ) + e t +1 (8)The randomization probability is given by p t (1 | H t ) = expit( η A t − + η S t ) where expit( x ) =(1 + exp( − x )) − ; the state dynamics are given by P ( S t = 1 | A t − , H t − ) = expit( ξA t − ) (note A = 0), and the independent error term satisﬁes e t ∼ N(0 ,

1) with Corr( e u , e t ) = 0 . | u − t | / .To keep similar to Boruvka et al. [2017], in scenarios below, we set θ = 0 . , θ = 0 , ξ =0 , η = − . , η = 0 . , β = − .

2, and β = 0 .

2. It is readily seen that because ξ = 0, E [ S t ] = 0 . × ( −

1) + 0 . × E [ E [ Y t +1 | A t = 1 , H t ] − E [ Y t +1 | A t = 0 , H t ]] = β + β E [ S t ] = β . The marginal treatment eﬀect is thus constant in time and is given by β (cid:63) = β = − . β (cid:63) . In each case, when the weighted and centered method is used, we set f t ( S t ) = 1 in (8) (i.e., S t = ∅ ). We report average ˆ β point estimates, standard deviation(SE) and root mean squared error (RMSE) of ˆ β , and 95% conﬁdence interval coverageprobabilities (CP) across 1000 replicates. We vary the number of clusters and cluster size.Conﬁdence intervals are constructed based on standard errors that are corrected for theestimation of weights and/or small samples. We compare the proposed cluster-based method(C-WCLS) to the approach by Boruvka et al. [2017] (WCLS). Simulation Scenario I . The ﬁrst scenario concerns the estimation of β ∗ when an impor-tant individual-level moderator exists and proximal outcomes share a random cluster-levelintercept term that does not interact with treatment. In the data generative model, we18ncorporate a cluster-level random-intercept e g ∼ N (0 , . Y t +1 ,j = ( − . . · S t,j ) × ( A t,j − p t (1 | H t,j )) + 0 . S t,j + e g + e t +1 ,j . Table 2 presents the results, which shows both WCLS and the proposed C-WCLS approachachieve unbiasedness and proper coverage. This is in line with our theoretical results in Sec-tion 4.3 stating asymptotic equivalence of the two procedures under no cluster-level treatmentheterogeneity.

Simulation Scenario II . In the second scenario, we extend the above generative model toinclude a random cluster-level intercept term that interacts with treatment by consideringthe linear model Y t +1 ,j = ( − . b g + 0 . · S t,j ) × ( A t,j − p t (1 | H t,j )) + 0 . · S t,j + e g + e t +1 ,j (9)where b g ∼ N (0 , .

1) is a random-intercept term within the treatment eﬀect per cluster.Table 2 presents the results. This demonstrates that if cluster-level random eﬀects interactwith treatment (i.e. b g (cid:54)≡ β but onlythe proposed method achieves the nominal 95% coverage probability. To further demonstratethis, Figure 1 presents nominal coverage as a function of the ratio of the variance of therandom eﬀect b g over the variance of the random eﬀect e g and group size respectively. Thisshows that the coverage probability of the WCLS method decays rapidly while the proposedmethod achieves the nominal 95% coverage probability for all choices of the variance of b g and when the group size increases. Note that even when group size is 5 (i.e., small groups),the nominal coverage drops to 80%. Simulation Scenario III . In the third scenario, we assume the treatment eﬀect for anindividual depends on the average state of all individuals in the cluster. Therefore, we deﬁne19igure 1: C-WCLS oﬀers more valid 95% conﬁdence intervals than WCLS in Scenario II.Observed coverage probability varies by group size ( G ) (total sample size ﬁxed) ( left ), andthe variance of random intercept b g as a fraction of the variance of the random error term e g ( right). the cluster-level moderator ¯ S t,g = G g (cid:80) G g j =1 S t,j and consider the generative model: Y t +1 ,j = ( − . b g + 0 . · ¯ S t,g ) × ( A t,j − p t (1 | H t,j )) + 0 . S t,j + e g + e t +1 ,j The proposed estimator again achieves the nominal 95% coverage probability while theWCLS method does not (see Scenario III, Table 2).

Simulation Scenario IV . The fourth scenario considers the indirect eﬀect. For individual j at decision point t , we construct the total eﬀect as T E j,t = (cid:80) j (cid:48) (cid:54) = j { A t,j (cid:48) − ˜ p t,j (cid:48) (1 | H t ) } ( β + β S t,j (cid:48) ), where β = − . β = 0 .

2. The generative model is given by: Y t +1 ,j = ( − . b g + 0 . · ¯ S t,g ) × { A t,j − p t (1 | H t,j ) } + 0 . S t,j + T E j,t + e g + e t +1 ,j This model implies a marginal indirect eﬀect equal to β ( IE )1 = β = − .

1. Table 3 presentsthe results. We see that the proposed indirect estimator exhibited nearly no bias and achievedthe nominal coverage probability. 20cenario Estimator

The Intern Health Study (IHS) was a 6-month MRT on 1,565 medical interns where four typesof weekly notiﬁcation - mood, activity, sleep, or none – were randomly assigned with equalprobability to each subject NeCamp et al. [2020] (see Section 3.1). In IHS, 273 institutionsand 14 specialties were observed. Here, we assess the eﬀect of the three types of notiﬁcations(mood, activity, and sleep) compared to no notiﬁcations on the weekly average of self-reported mood scores and log step-count for the population of interns. See Appendix D foran additional analysis of log sleep minutes. Due to high levels of missing data, weekly averagemood scores and were computed from multiply imputed daily self-reported scores. Similarimputation was performed for daily log step-counts and daily log sleep minutes. See NeCampet al. [2020] for further details.Let t = 1 , . . . , T denote the weekly decision points at which the individual is randomizedto the various types of notiﬁcations. Because of the form of the intervention, all participantswere available for this intervention throughout the study; i.e., I t ≡

1. The proximal out-come Y j,t +1 is weekly mood score, which is reported on a Likert scale taking values from 1to 10 (higher scores mean better mood). We collapse notiﬁcations to a binary variable, i.e., A t,j = 1 if the individual was assigned to receive either mood, activity, or sleep notiﬁcationson week t ; otherwise, A t,j = 0 and no notiﬁcations were sent that week. At any occasion t , an individual’s notiﬁcation randomization probabilities were only dependent of their pastobserved history H t,j . For simplicity, we start with clusters beingare constructed based onmedical specialty. The average cluster size was 120; the ﬁrst and third quartile were 56 and2240 respectively, with maximum and minimum sizes of 353 and 24. For every individual ineach cluster at each decision point we compute the average prior weekly mood score for allothers in the cluster, denoted ¯ Y − j,t for the j th individual in the cluster. We consider twomoderation analyses that can both be expressed as β ( t ; S t ) = β + β · Y j,t − + β ¯ Y − j,t − . The ﬁrst set of moderation analyses considers the standard moderation analysis whereonly individual-level moderators are included (i.e., β = 0), with or without accounting forcluster-level moderation eﬀect heterogeneity. Table 4 presents the results and compares ourproposed approach against the WCLS approach from Boruvka et al. [2017]. In this case,the eﬀects do not change too much for the average weekly mood analysis; however, thesigniﬁcant eﬀect of messages on weekly log step count under the traditional MRT analysisbecomes insigniﬁcant when accounting for cluster eﬀects.The second moderation analysis lets β be a free parameter, enabling novel moderationanalyses that accounts for the average weekly previous mood score of other individuals.Table 4 presents the results. Here, we see that the constant term β becomes negative butinsigniﬁcant while the new term β is positive and signiﬁcant. The results suggest the averagemood of others in the cluster may moderate the eﬀect of notiﬁcations. Therefore, the impactof a notiﬁcation is larger when the average mood score of others in the specialty is largewhile the individual’s score is low. Similar results hold for the log step-count analysis.Finally, we consider indirect moderation eﬀect analyses. In this analysis, clusters areconstructed based on medical specialty and institution. This was done as interference wasonly likely when interns are in close geographic proximity. Here, we consider the marginalindirect eﬀect (e.g., no moderators) both when the individual did not receive the interventionand when the individual did receive an intervention at decision time t . Table 5 presents theresults. In this case, the estimated indirect eﬀects are weaker than the direct eﬀects by a23utcome Setting Variables Estimate Std. Error t-value p-valueMood WCLS β β -0.055 0.011 -4.822 0.000C-WCLS β β -0.053 0.014 -3.868 0.000C-WCLS β -0.238 0.282 -0.842 0.401 β -0.054 0.014 -3.973 0.000 β β β -0.037 0.015 -2.484 0.015C-WCLS β β -0.031 0.019 -1.580 0.117C-WCLS β -2.095 1.248 -1.678 0.094 β -0.034 0.019 -1.745 0.083 β Here we consider the causal excursion eﬀect in the presence of a priori known clusters. Wehave extended the causal excursion eﬀect to naturally account for cluster information. Inparticular, both direct and indirect excursion eﬀects have been formalized in the contextof MRT to account for potential interference. The eﬀects described in this paper are most24utcome Variables Estimate Std. Error t-value p-valueMood ˜ β β β β β β β representsthe indirect eﬀect under A j,t = 0, while the coeﬃcient ˜ β represents the indirect eﬀect under A j,t = 1.important when using MRT data to build optimized JITAIs for deployment in an mHealthpackage. Speciﬁcally, the estimation procedure for the direct excursion eﬀect accounts forwithin-cluster correlation in the proximal outcomes which helps the scientiﬁc team avoidmaking erroneous conclusions about intervention eﬀectiveness using standard MRT meth-ods. Moreover, estimation of indirect eﬀects allows the scientiﬁc team to answer questionsabout impact of interventions on other members of the same cluster. Use of these methodsprovides empirical evidence for the scientiﬁc team to include or exclude intervention com-ponents that may have had unanticipated second order eﬀects, or potentially lead to novelways to improve the intervention component by revising the intervention to more explicitlyaccount for cluster-level interference. While this work represents a major step forward in theanalysis of micro-randomized trial data, further work is required. Speciﬁcally, extensions tobe considered future work include accounting for overlapping communities and/or network(rather than cluster-only) structure [Ogburn and VanderWeele, 2014, Papadogeorgou et al.,2019], accounting for general non-continuous proximal outcomes such as binary or count out-comes [Qian et al., 2020], penalization of the working model to allow for high-dimensionalmoderators, and a method to use the proposed approach to form warm-start policies at theindividual level while accounting for group level information [Luckett et al., 2020].25 eferences A. Boruvka, D. Almirall, K. Witkiewitz, and S.A. Murphy. Assessing time-varying causaleﬀect moderation in mobile health. To appear in the

Journal of the American StatisticalAssociation , 2017.L.M. Collins.

Optimization of Behavioral, Biobehavioral, and Biomedical Interventions .Springer International Publishing, 2018.W. Dempsey, P. Liao, and S.A. Murphy. The stratiﬁed micro-randomized trial design: Sam-ple size considerations for testing nested causal eﬀects of time-varying treatments.

Annalsof Applied Statistics , 14(2):661–684, 2020.G. Hong and S. W. Raudenbush. Evaluating kindergarten retention policy.

Journal of theAmerican Statistical Association , 101(475):901–910, 2006.P. Klasnja, E.B. Hekler, S. Shiﬀman, A. Boruvka, D. Almirall, A. Tewari, and S.A. Mur-phy. Microrandomized trials: An experimental design for developing just-in-time adaptiveinterventions.

Health Psychol , 34:1220–1228, 2015.Predrag Klasnja, Shawna Smith, Nicholas J Seewald, Andy Lee, Kelly Hall, Brook Luers,Eric B Hekler, and Susan A Murphy. Eﬃcacy of contextually tailored suggestions forphysical activity: a micro-randomized optimization trial of HeartSteps.

Annals of Behav-ioral Medicine , 53(6):573–582, 09 2018. ISSN 0883-6612. doi: 10.1093/abm/kay067. URL https://doi.org/10.1093/abm/kay067 .Daniel J. Luckett, Eric B. Laber, Anna R. Kahkoska, David M. Maahs, Elizabeth Mayer-Davis, and Michael R. Kosorok. Estimating dynamic treatment regimes in mobile healthusing v-learning.

Journal of the American Statistical Association , 115(530):692–706, 2020.doi: 10.1080/01621459.2018.1537919. URL https://doi.org/10.1080/01621459.2018.1537919 . PMID: 32952236. 26A. Mancl and T.A. DeRouen. A covariance estimator for GEE with improved small-sampleproperties.

Biometrics , 57(1):126–134, 2001.S A Murphy, M J van der Laan, J M Robins, and Conduct Problems Prevention ResearchGroup. Marginal mean models for dynamic regimes.

Journal of the American Statis-tical Association , 96(456):1410–1423, 2001. doi: 10.1198/016214501753382327. PMID:20019887.I. Nahum-Shani, S.N. Smith, B.J. Spring, L.M. Collins, K. Witkiewitz, A. Tewari, and S.A.Murphy. Just-in-time adaptive interventions (JITAIs) in mobile health: Key componentsand design principles for ongoing health behavior support.

Annals of Behavioral Medicine ,2017.Timothy NeCamp, Srijan Sen, Elena Frank, Maureen A Walton, Edward L Ionides, Yu Fang,Ambuj Tewari, and Zhenke Wu. Assessing real-time moderation for developing adaptivemobile health interventions for medical interns: Micro-randomized trial.

Journal of medicalInternet research , 22(3):e15033, March 2020. ISSN 1439-4456. doi: 10.2196/15033. URL https://europepmc.org/articles/PMC7157494 .E. Ogburn and T. VanderWeele. Causal diagrams for interference.

Statistical Science , 29(4):559–578, 2014.Georgia Papadogeorgou, Fabrizia Mealli, and Corwin M. Zigler. Causal inference withinterfering units for cluster and population level treatment allocation programs.

Bio-metrics , 75(3):778–787, 2019. doi: https://doi.org/10.1111/biom.13049. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13049 .J. Pearl. Causal inference in statistics: An overview.

Statistics Surveys , 3:96–146, 2009.Tianchen Qian, Hyesun Yoo, Predrag Klasnja, Daniel Almirall, and Susan Murphy. Es-timating time-varying causal excursion eﬀect in mobile health with binary outcomes.27 iometrika , 09 2020. ISSN 0006-3444. doi: 10.1093/biomet/asaa070. URL https://doi.org/10.1093/biomet/asaa070 . asaa070.J. Robins. A new approach to causal inference in mortality studies with a sustained exposureperiod-application to control of the healthy worker survivor eﬀect.

Mathematical Modelling ,7(9):1393–1512, 1986.DB. Rubin. Bayesian inference for causal eﬀects: The role of randomization.

The Annals ofStatistics , 6(1):34–58, 1978.T. J. Vanderweele, G. Hong, S.M. Jones, and J.L. Brown. Mediation and spillover eﬀects ingroup-randomized trials: A case study of the 4Rs educational intervention.

Journal of theAmerican Statistical Association , 108(502):469–482, 2013.28

Technical Details

Proof.

Proof of Lemma 3.2 We establish Lemma 3.2 for the direct eﬀect (3). For a s ∈ { , } G ,we consider E (cid:34)(cid:32) t − (cid:89) s =1 p s ( a s | H s (¯ a s − )) (cid:33) (cid:32)(cid:89) j (cid:48) (cid:54) = j p t ( a t,j (cid:48) | H t (¯ a t − ) (cid:33) Y t +1 ,j (¯ a t, − j , (¯ a t − ,J , a )) S t (¯ a t )= s (cid:35) E (cid:34)(cid:32) t − (cid:89) s =1 p s ( a s | H s (¯ a s − )) (cid:33) (cid:32)(cid:89) j (cid:48) (cid:54) = j p t ( a t,j (cid:48) | H t (¯ a t − ) (cid:33) S t (¯ a t )= s E [ Y t +1 ,j (¯ a t, − j , (¯ a t − ,j , a )) | H t (¯ a t − )] (cid:35) since the history H t includes the moderator variable S t at time t . By consistency, H t ( ¯ A t − ) = H t so the above is equal to E (cid:34)(cid:32) t − (cid:89) s =1 p s ( a s | H s ) (cid:33) (cid:32) (cid:89) j (cid:48) (cid:54) = J p t ( a t,j (cid:48) | H t (cid:33) S t (¯ a t )= s E [ Y t +1 ,j (¯ a t, − j , (¯ a t − ,j , a )) | H t ] (cid:35) (10)Sequential ignorability implies that E [ Y t +1 ,j (¯ a t, − j , (¯ a t − ,j , a )) | H t ]= E [ Y t +1 ,j (¯ a t, − j , (¯ a t − ,j , a )) | H t , A t,j = a ]Summing over all potential outcomes and normalizing yields E (cid:88) ¯ a t − (cid:0)(cid:81) t − s =1 p s ( a s | H s ) (cid:1) (cid:16)(cid:81) j (cid:48) (cid:54) = j p t ( a t,j (cid:48) | H t (cid:17) S t (¯ a t )= s E [ (cid:0)(cid:81) t − s =1 p s ( a s | H s ) (cid:1) (cid:16)(cid:81) j (cid:48) (cid:54) = j p t ( a t,j (cid:48) | H t (cid:17) S t (¯ a t )= s ] E [ Y t +1 ,j (¯ a t, − j , (¯ a t − , a )) | H t , A t,j = a ]  = E (cid:88) ¯ a t − (cid:0)(cid:81) t − s =1 p s ( a s | H s ) (cid:1) (cid:16)(cid:81) j (cid:48) (cid:54) = j p t ( a t,j (cid:48) | H t (cid:17) S t (¯ a t )= s E [ (cid:0)(cid:81) t − s =1 p s ( a s | H s ) (cid:1) (cid:16)(cid:81) j (cid:48) (cid:54) = j p t ( a t,j (cid:48) | H t (cid:17) S t (¯ a t )= s ] E [ Y t +1 ,j (¯ a t, − j , (¯ a t − , a )) | H t , A t = a ] | S t = s  = E [ E [ Y t +1 ,j | H t , A t = a ] | S t = s ] . In the ﬁnal equation, the outer expectation is with respect to the history H t conditionalon S t = s . That is, over both past treatments A s and past observations O s for s < t as well29s over current treatments for A t,j (cid:48) for j (cid:48) (cid:54) = j . The above shows E (cid:2) Y t +1 ( ¯ A t, − j , ( ¯ A t − ,j , a )) | S t (¯ a t ) = s (cid:3) = E [ E [ Y t +1 ,j | H t , A t,j = a ] | S t = s ] . Averaging over individuals in the group j ∈ [ G ] group size completes the proof. The prooffor the indirect eﬀect follows the exact same structure. A.1 Lemma 4.2

We next provide a detailed proof of asymptotic normality and consistency for the weighted-centered least squares estimator.

Proof of consistency for direct and indirect eﬀects.

The solutions ( ˆ α, ˆ β ) that minimize equa-tion (5) are consistent estimators for the solutions that minimize the following E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 W t,j ( Y t +1 ,j − g t ( H t ) (cid:48) α − ( A t,j − ˜ p t (1 | S t )) f t ( S t ) (cid:48) β ) (cid:35) Diﬀerentiating the above equation with respect to α yields a set of p estimating equations.0 q (cid:48) = E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 W t,j ( Y t +1 ,j − g t ( H t ) (cid:48) α − ( A t,j − ˜ p t (1 | S t )) f t ( S t ) (cid:48) β ) g t ( H t ) (cid:35) We note that E [ W t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) (cid:48) β | H t ] = 0 . Therefore, we have,0 p = E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 ( g t ( H t ) E [ W t,j Y t +1 ,j ] − g t ( H t ) g t ( H t ) (cid:48) α ) (cid:35) ⇒ α = E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 g t ( H t ) g t ( H t ) (cid:48) (cid:35) − E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 g t ( H t ) E [ W t,j Y t +1 ,j ] (cid:35)

30e note that E [ W t,J ( A t,J − ˜ p t (1 | S t )) g t ( H t ) (cid:48) α ] = 0 , and E [ W t,J ( A t,J − ˜ p t (1 | S t )) Y t +1 ,J ] = ˜ p t (1 | S t )(1 − ˜ p t (1 | S t )) β ( t ; S t )Now diﬀerentiating with respect to β yields0 q = E (cid:34) T (cid:88) t =1 W t,J ( Y t +1 ,J − g t ( H t ) (cid:48) α − ( A t,J − ˜ p t (1 | S t )) f t ( S t ) (cid:48) β ) ( A t,J − ˜ p t (1 | S t )) f t ( S t ) (cid:35) q = E (cid:34) T (cid:88) t =1 ˜ p t (1 | S t )(1 − ˜ p t (1 | S t )) ( β ( t ; S t ) − f t ( S t ) (cid:48) β (cid:63) ) f t ( S t ) (cid:35) Then we have β (cid:63) = E (cid:34) T (cid:88) t =1 ˜ p t (1 | S t )(1 − ˜ p t (1 | S t )) f t ( S t ) f t ( S t ) (cid:48) (cid:35) − E (cid:34) T (cid:88) t =1 ˜ p t (1 | S t )(1 − ˜ p t (1 | S t )) f t ( S t ) β ( t ; S t ) (cid:35) Under assumption 4.1, we have that β = β (cid:63) which guarantees consistency.We next consider the indirect eﬀect estimator. Recall that˜ p (cid:63)t (1 | S t ) = ˜ p t (0 , | S t )˜ p t (0 , | S t ) + ˜ p t (0 , | S t )is the replacement for ˜ p (1 | S t ) in the direct eﬀect for centering. If we make the assumptionthat ˜ p t (0 , | S t ) = ˜ p t (0 | S t )˜ p t (1 | S t ) then ˜ p (cid:63)t (1 | S t ) = ˜ p t (1 | S t ); however, we providethe proof in complete generality. The estimates that minimize equation (6) are consistentestimators for the solutions that minimize the following E (cid:34) T (cid:88) t =1 W t,J,J (cid:48) ( Y t +1 ,J − g t ( H t ) (cid:48) α − (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:48) β ) (cid:35) α yields a set of p estimating equations.0 p = E (cid:34) T (cid:88) t =1 W t,J,J (cid:48) ( Y t +1 ,J − g t ( H t ) (cid:48) α − (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:48) β ) g t ( H t ) (cid:35) We note that E [ W t,J,J (cid:48) (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:48) β | H t ] = 0 . Therefore, we have,0 p = E (cid:34) T (cid:88) t =1 ( g t ( H t ) E [ W t,J,J (cid:48) Y t +1 ,J ] − g t ( H t ) g t ( H t ) (cid:48) α ) (cid:35) ⇒ α = E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 g t ( H t ) g t ( H t ) (cid:48) (cid:35) − E (cid:34) T (cid:88) t =1 g t ( H t ) E [ W t,J,J (cid:48) Y t +1 ,J ] (cid:35) First, we show that E [ W t,J,J (cid:48) (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) | H t ]= (cid:88) a (cid:48) ∈{ , } E [˜ p t (0 , a (cid:48) | S t )( a (cid:48) − ˜ p (cid:63)t (1 | S t )) | H t , A t = 0 , A t,J (cid:48) = a (cid:48) ]=˜ p t (0 , | S t )(1 − ˜ p t (0 , | S t )˜ p t (0 , | S t ) + ˜ p t (0 , | S t ) ) − ˜ p t (0 , | S t ) ˜ p t (0 , | S t )˜ p t (0 , | S t ) + ˜ p t (0 , | S t ) = 0and E (cid:2) W t,J,J (cid:48) (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) | H t (cid:3) =˜ p t (0 , | S t ) (cid:18) ˜ p t (0 , | S t )˜ p t (0 , | S t ) + ˜ p t (0 , | S t ) (cid:19) + ˜ p t (0 , | S t ) (cid:18) ˜ p t (0 , | S t )˜ p t (0 , | S t ) + ˜ p t (0 , | S t ) (cid:19) =(˜ p t (0 , | S t ) + ˜ p t (0 , | S t ))˜ p (cid:63)t (1 | S t )(1 − ˜ p (cid:63)t (1 | S t )) . E [ W t,J,J (cid:48) (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) g t ( H t ) (cid:48) α ] = 0 , and E [ W t,J,J (cid:48) (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) Y t +1 ,J ]˜ p t (0 , | S t ) + ˜ p t (0 , | S t ) = ˜ p (cid:63)t (1 | S t )(1 − ˜ p (cid:63)t (1 | S t )) β ( IE ) ( t ; S t ) . Now diﬀerentiating with respect to β yields0 q = E (cid:34) T (cid:88) t =1 (˜ p t (0 , | S t ) + ˜ p t (0 , | S t ))˜ p (cid:63)t (1 | S t )(1 − ˜ p (cid:63)t (1 | S t )) (cid:0) β ( IE ) ( t ; S t ) − f t ( S t ) (cid:48) β (cid:63)(cid:63) (cid:1) f t ( S t ) (cid:35) Under assumption 4.3, we have that β = β (cid:63)(cid:63) which guarantees consistency. Proof of Asymptotic Normality.

We now consider the issue of asymptotic normality. First,let (cid:15) t,j = Y t +1 ,j − g t ( H t ) (cid:48) α (cid:63) − ( A t,j − ˜ p t (1 | S t )) f t ( S t ) (cid:48) β (cid:63) , ˆ θ = ( ˆ α, ˆ β ), and θ (cid:63) = ( α (cid:63) , β (cid:63) ). Since S t ⊂ H t deﬁne h t,j ( H t ) (cid:48) = ( g t ( H t ) (cid:48) , ( A t,j − ˜ p t (1 | S t )) f t ( S t ) (cid:48) ).Then √ M (ˆ θ − θ (cid:63) ) = √ M (cid:26) P M (cid:18) G m G m (cid:88) j =1 T (cid:88) t =1 W t,j h t,j ( H t ) h t,j ( H t ) (cid:48) (cid:19) − (cid:20) P M (cid:18) G m G m (cid:88) j =1 T (cid:88) t =1 W t,j Y t +1 ,j h t,j ( H t ) (cid:19) − P M (cid:18) G m G m (cid:88) j =1 T (cid:88) t =1 W t,j h t,j ( H t ) h t,j ( H t ) (cid:48) (cid:19) θ (cid:63) (cid:21)(cid:27) = √ M (cid:26) E (cid:20) G G (cid:88) j =1 T (cid:88) t =1 W t,j h t,j ( H t ) h t,j ( H t ) (cid:48) (cid:21) − (cid:20) P M (cid:18) G m G m (cid:88) j =1 T (cid:88) t =1 W t,j (cid:15) t,j h t,j ( H t ) (cid:19)(cid:21)(cid:27) + o p ( )By deﬁnitions of α (cid:63) and β (cid:63) and the previous consistency argument E (cid:20) G G (cid:88) j =1 T (cid:88) t =1 W t,j h t,j ( H t ) h t,j ( H t ) (cid:48) (cid:21) = 033hen under moments conditions, we have asymptotic normality with variance Σ θ given byΣ θ = E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 W t,j h t,j ( H t ) h t,j ( H t ) (cid:48) (cid:35) − E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 W t,j (cid:15) t,j h t,j ( H t ) × G G (cid:88) j =1 T (cid:88) t =1 W t,j (cid:15) t,j h t,j ( H t ) (cid:48) (cid:35) E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 W t,j h t,j ( H t ) h t,j ( H t ) (cid:48) (cid:35) − Due to centering, the expectation of the matrix W t,J h t,J ( H t ) h t,J ( H t ) (cid:48) is block diagonal andthe sub-covariance matrix Σ β can be extracted and is equal toΣ β = (cid:34) T (cid:88) t =1 E [( A t,J − ˜ p t (1 | S t )) W t,J f t ( S t ) f t ( S t ) (cid:48) ] (cid:35) − · E (cid:20) T (cid:88) t =1 W t,J (cid:15) t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W t,J (cid:15) t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) (cid:48) (cid:21) · (cid:34) T (cid:88) t =1 E [( A t,J − ˜ p t (1 | S t )) W t,J f t ( S t ) f t ( S t ) (cid:48) ] (cid:35) − as desired.We next consider asymptotic normality in the indirect setting. First, let (cid:15) t,j,j (cid:48) = Y t +1 ,j − g t ( H t ) (cid:48) α (cid:63)(cid:63) − (1 − A t,j )( A t,j (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:48) β (cid:63)(cid:63) , ˆ θ = ( ˆ α, ˆ β ), and θ (cid:63) = ( α (cid:63)(cid:63) , β (cid:63)(cid:63) ). Since S t ⊂ H t deﬁne h t,j,j (cid:48) ( H t ) (cid:48) = ( g t ( H t ) (cid:48) , (1 − A t,j )( A t,j (cid:48) − p (cid:63)t (1 | S t )) f t ( S t ) (cid:48) ). Then √ M (ˆ θ − θ (cid:63)(cid:63) ) equals √ M (cid:26) P M (cid:18) G m · ( G m − G m (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) h t,j,j (cid:48) ( H t ) h t,j,j (cid:48) ( H t ) (cid:48) (cid:19) − (cid:20) P M (cid:18) G m · ( G m − G m (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j Y t +1 ,j h t,j,j (cid:48) ( H t ) (cid:19) − P M (cid:18) G m · ( G m − G m (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) h t,j,j (cid:48) ( H t ) h t,j,j (cid:48) ( H t ) (cid:48) (cid:19) θ (cid:63) (cid:21)(cid:27) = √ M (cid:26) E (cid:20) G ( G − G (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) h t,j,j (cid:48) ( H t ) h t,j,j (cid:48) ( H t ) (cid:48) (cid:21) − (cid:20) P M (cid:18) G m · ( G m − G m (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) (cid:15) t,j,j (cid:48) h t,j,j (cid:48) ( H t ) (cid:19)(cid:21)(cid:27) + o p ( )By deﬁnitions of α (cid:63)(cid:63) and β (cid:63)(cid:63) and the previous consistency argument E (cid:20) G · ( G − G (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) h t,j,j (cid:48) ( H t ) h t,j,j (cid:48) ( H t ) (cid:48) (cid:21) = 0Then under moments conditions, we have asymptotic normality with variance Σ θ given byΣ θ = E (cid:34) G · ( G − G (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) h t,j,j (cid:48) ( H t ) h t,j,j (cid:48) ( H t ) (cid:48) (cid:35) − E (cid:34) G ( G − G (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) (cid:15) t,j,j (cid:48) h t,j,j (cid:48) ( H t ) × G ( G − G (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j (cid:15) t,j,j (cid:48) h t,j,j (cid:48) ( H t ) (cid:48) (cid:35) E (cid:34) G · ( G − G (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) h t,j,j (cid:48) ( H t ) h t,j,j (cid:48) ( H t ) (cid:48) (cid:35) − Due to centering, the expectation of the matrix W t,J,J (cid:48) h t,J,J (cid:48) ( H t ) h t,J,J (cid:48) ( H t ) (cid:48) is block diagonal35nd the sub-covariance matrix Σ β can be extracted and is equal toΣ β = (cid:34) T (cid:88) t =1 E [(1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) W t,J,J (cid:48) f t ( S t ) f t ( S t ) (cid:48) ] (cid:35) − · E (cid:20) T (cid:88) t =1 W t,J,J (cid:48) (cid:15) t,J,J (cid:48) (1 − A t,J )( A t,J (cid:48) − ˜ p t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W t, ˜ J, ˜ J (cid:48) (cid:15) t, ˜ J, ˜ J (cid:48) (1 − A t, ˜ J )( A t, ˜ J (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:48) (cid:21) · (cid:34) T (cid:88) t =1 E [(1 − A t,J )( A t,J (cid:48) − ˜ p t (1 | S t )) W t,J,J (cid:48) f t ( S t ) f t ( S t ) (cid:48) ] (cid:35) − as desired. B Proof of Lemma 4.5

Proof.

Consider the W -matrix for the direct eﬀect asymptotic variance,1 G (cid:88) t,t (cid:48) (cid:88) j,j (cid:48) E (cid:20) W t,j (cid:15) t,j ( A t,j − ˜ p t (1 | S t )) W t (cid:48) ,j (cid:48) (cid:15) t (cid:48) ,j (cid:48) ( A t (cid:48) ,j (cid:48) − ˜ p t (1 | S t (cid:48) )) f t ( S t ) f t (cid:48) ( S t (cid:48) ) (cid:48) (cid:21) = 1 G (cid:88) t,t (cid:48) (cid:88) j,j (cid:48) E (cid:20) W t,j (cid:15) t,j ( A t,J − ˜ p t (1 | S t )) W t (cid:48) ,j (cid:48) (cid:15) t (cid:48) ,j (cid:48) ( A t (cid:48) ,j (cid:48) − ˜ p t (1 | S t (cid:48) )) f t ( S t ) f t (cid:48) ( S t (cid:48) ) (cid:48) (cid:21) Consider the cross-terms with j (cid:54) = j (cid:48) and without loss of generality assume t ≥ t (cid:48) , then E (cid:20) (cid:88) a,a (cid:48) ˜ p t ( a | S t )( a − ˜ p t (1 | S t ))˜ p t (cid:48) ( a (cid:48) | S t (cid:48) )( a (cid:48) − ˜ p t (cid:48) (1 | S t (cid:48) )) E (cid:20) E (cid:20) (cid:15) t,j (cid:15) t (cid:48) ,j (cid:48) | H t,j , A t,j = a, H t (cid:48) ,j (cid:48) , A t (cid:48) ,j (cid:48) = a (cid:48) (cid:21) | S t , S t (cid:48) (cid:21) f t ( S t ) f t (cid:48) ( S t (cid:48) ) (cid:48) (cid:21) . a and a (cid:48) we can re-write theabove as:= E (cid:34)(cid:88) a,a (cid:48) ˜ p t ( a | S t )( a − ˜ p t (1 | S t ))˜ p t (cid:48) ( a (cid:48) | S t (cid:48) )( a (cid:48) − ˜ p t (cid:48) (1 | S t (cid:48) )) ψ ( S t , S t (cid:48) ) f t ( S t ) f t (cid:48) ( S t (cid:48) ) (cid:48) (cid:35) = E (cid:20) ψ ( S t , S t (cid:48) ) f t ( S t ) f t (cid:48) ( S t (cid:48) ) (cid:48) (cid:32)(cid:88) a,a (cid:48) ˜ p t ( a | S t )( a − ˜ p t (1 | S t ))˜ p t (cid:48) ( a (cid:48) | S t (cid:48) )( a (cid:48) − ˜ p t (cid:48) (1 | S t (cid:48) )) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) =0 (cid:21) = E [ ψ ( S t , S t (cid:48) ) f t ( S t ) f t (cid:48) ( S t (cid:48) ) (cid:48) ·

0] = 0 . Therefore, we have that the W -matrix simpliﬁes to E (cid:20) T (cid:88) t =1 W t,J (cid:15) t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W t,J (cid:15) t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) (cid:48) (cid:21) = E (cid:20) G G (cid:88) j =1 (cid:20) T (cid:88) t =1 W t,J (cid:15) t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W t,J (cid:15) t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) (cid:48) (cid:21)(cid:21) = E (cid:20) T (cid:88) t =1 W t (cid:15) t ( A t − ˜ p t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W t (cid:15) t ( A t − ˜ p t (1 | S t )) f t ( S t ) (cid:48) (cid:21) which is the W matrix as in the standard MRT analysis. C Small sample size adjustment for covariance estima-tion

The robust sandwich covariance estimator Mancl and DeRouen [2001] for the entire variancematrix is given by Q − Λ Q − . The ﬁrst term, Q , is given by (cid:32) M (cid:88) m =1 G m (cid:88) j =1 D Tj,m W j,m D j,m (cid:33) D j,m is the model matrix for individual j in group g associated with equation (5), and W j,m is a diagonal matrix of individual weights. The middle term Λ is given by M (cid:88) m =1 G m (cid:88) i,j =1 D (cid:48) i,m W i,m ( I i,m − H i,m ) − e i,m e (cid:48) j,m ( I j,m − H j,m ) − W j,m D j,m where I i is an identity matrix of correct dimension, e i is the individual-speciﬁc residual vectorand H j,m = D j,m (cid:32) M (cid:88) m =1 G m (cid:88) j =1 D (cid:48) j,m W j,m D j,m (cid:33) − D (cid:48) j,m W j,m From Q − Λ Q − we extract ˆΣ β . D Additional analysis of IHS

Setting Variables Estimate Std. Error t-value p-valueWCLS β β -0.068 0.017 -3.916 0.000C-WCLS β β -0.061 0.023 -2.696 0.009C-WCLS β -1.912 1.379 -1.386 0.171 β -0.067 0.023 -2.948 0.004 β E Additional on the indirect eﬀect

Weights used in the estimation of the indirect eﬀect is a natural extension of Boruvka et al.[2017]. As in Section 4.2, the weight W t,j,j (cid:48) at decision time t for the j th individual is equalto ˜ p ( A t,j ,A t,j (cid:48) | S t ) p t ( A t,j ,A t,j (cid:48) | H t ) where ˜ p t ( a, a (cid:48) | S t ) ∈ (0 ,

1) is arbitrary as long as it does not depend onterms in H t other than S t , and p ( A t,j , A t,j (cid:48) | H t ) is the marginal probability that individuals38 and j (cid:48) receive treatments A t,j and A t,j (cid:48) respectively given H t .In the simulation, the treatment individuals j and j (cid:48) receive A t,j and A t,j (cid:48) are mutuallyindependent conditioning on the previous history. thus, the denominator of W t,j,j (cid:48) can befactorized into: p ( A t,j , A t,j (cid:48) | H t ) = p ( A t,j | H t ) p ( A t,j (cid:48) | H t )Besides, the numerator of W t,j,j (cid:48) is deﬁned as the empirical frequency of the treatment pair( a, a (cid:48) ), which takes the value from { (0 , , (0 , , (1 , , (1 , } . Here we denote it as˜ p t ( A t,j , A t,j (cid:48) | S t ) = ˆ p t ( A t,j , A t,j (cid:48) | S t )Therefore, the weight we used in the simulation is constructed as: W t,j,j (cid:48) = ˆ p t ( A t,j , A t,j (cid:48) | S t ) p ( A t,j | H t ) p ( A t,j (cid:48) | H t )When the numerators are estimated using the observed data, the variance-covariancemust account for this. Throughout we allow for the setting in which individuals are notalways available. For completeness we provide results for a more general estimating functionwhich can be used with observational (non-randomized A t ) treatments, under the assumptionof sequential ignorability and assuming the data analyst is able to correctly model and esti-mate the treatment probability p ( A t,j , A t,j (cid:48) | H t ). We indicate how the results are simpliﬁedby use of data from an MRT.Denote the parameterized treatment probability by p t ( a, a (cid:48) | H t ; η ) (with parameter η );note η is known in an MRT. Denote the parameterized numerator of the weights by ˜ p t ( a, a (cid:48) | S t ; ρ )(with parameter ρ ). The proof below allows the data analyst to use a ˜ p t with an estimatedparameter ˜ ρ or to pre-specify ρ as desired. We use a superscript of (cid:63) to denote limiting val-ues of estimated parameters (e.g. η (cid:63) , ρ (cid:63) ). Then the more general version of the estimating39quation U W ( α, β ; ˆ η, ˆ ρ ) is: T (cid:88) t =1 (cid:0) Y t +1 ,J − g t ( H t ) (cid:62) α − (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t ; ˆ ρ )) f t ( S t ) (cid:48) β (cid:1) I t,J I t,J (cid:48) W t,J,J (cid:48) ( A t,J , A t,J (cid:48) , H t ; ˆ η, ˆ ρ ) ×  g t ( H t )(1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t ; ˆ ρ )) f t ( S t )  Note W t,J,J (cid:48) in the body of the paper is replaced here by W t,J,J (cid:48) ( A t , H t ; ˆ η, ˆ ρ ), and ˆ η , ˆ ρ areestimators. Treatment Probability Model: