Assessing Time-Varying Causal Effect Moderation in the Presence of Cluster-Level Treatment Effect Heterogeneity
AAssessing Time-Varying Causal Effect Moderation inthe Presence of Cluster-Level Treatment EffectHeterogeneity
Jieru Shi, Zhenke Wu, Walter DempseyFebruary 3, 2021
Abstract
Micro-randomized trial (MRT) is a sequential randomized experimental design toempirically evaluate the effectiveness of mobile health (mHealth) intervention compo-nents that may be delivered at hundreds or thousands of decision points. The MRTcontext has motivated a new class of causal estimands, termed “causal excursion ef-fects”, for which inference can be made by a weighted, centered least squares approach(Boruvka et al., 2017). Existing methods assume between-subject independence andnon-interference. Deviations from these assumptions often occur which, if unaccountedfor, may result in bias and overconfident variance estimates. In this paper, causal ex-cursion effects are considered under potential cluster-level correlation and interferenceand when the treatment effect of interest depends on cluster-level moderators. Theutility of our proposed methods is shown by analyzing data from a multi-institutioncohort of first year medical residents in the United States. The approach paves theway for construction of mHealth interventions that account for observed social networkinformation. a r X i v : . [ s t a t . M E ] F e b eywords: Causal Inference; Clustered Data; Just-In-Time Adaptive Interventions; Micro-randomized Trials; Mobile Health; Moderation Effect
In many areas of behavioral science, push notifications in mobile apps that are adapted tocontinuously collected information on an individual’s current context have received a con-siderable amount of attention. These time-varying adaptive interventions are hypothesizedto lead to meaningful short- and long-term behavior change. The assessment of the time-varying effect of such push notifications motivated sequential randomized designs such asthe micro-randomized trial (MRT) [Nahum-Shani et al., 2017, Klasnja et al., 2015]. In anMRT, interventions are randomized to be potentially delivered at a over a period of time,at hundreds or thousands of decision points. The MRT design enables the estimation of thetime-delayed marginal treatment effects of a push notification upon the outcome of interestat multiple time-lags, referred to as “causal excursion effects” [Boruvka et al., 2017, Qianet al., 2020]. A weighted, centered least squares (WCLS) approach has been developed toconsistently estimate causal excursion effects with robust methods of uncertainty assessment.Extensions to balance the number of randomizations by observed or derived individual-levelcontextual information have also been developed, e.g., a stratified MRT [Dempsey et al.,2020].Existing inferential methods for data collected in an MRT rely on two key assumptions.First, an intervention delivered to an individual is assumed to only impact the outcome ofthat individual, i.e., non-interference. Second, the method assumes no stochastic dependenceamong outcomes of different subjects. Deviations from these assumptions, however, mayoccur when individuals naturally form clusters. Consider, for example, an MRT in whichmedical interns are recruited from several hospitals and every day potentially receive pushnotifications designed to improve physical activity, mood, or sleep. First, interference may2ccur when an intern receiving a push notification to increase their steps may then encourageanother intern in the same specialty and institution to do the same hence altering the others’outcomes. Second, even if there is no direct interference, interns in the same specialty andinstitution are likely to have correlated outcomes, induced by variation in unobserved factorsshared within each cluster. For example, interns within a specialty and institution may sharesimilar experiences during their residency which will lead to a higher correlation in theirweekly mood scores than interns in different specializations or institutions.The primary contribution of this paper is the significant extension of data analytic meth-ods for MRTs to account for potential cluster-level correlation and the extension of thecausal excursion effect to account for potential interference, i.e., we define both a direct andindirect causal excursion effect. In Section 2, we review the MRT design and the existingmethod’s implicit assumptions. In Section 3, we introduce the moderated causal excursioneffect using a group-based conceptualization of potential outcomes [Hong and Raudenbush,2006, Vanderweele et al., 2013]. We then generalize WCLS criterion to a cluster-based WCLS(C-WCLS) criterion to estimate both direct and indirect causal excursion effects in Section 4.Finally, we demonstrate that when proximal outcomes exhibit within-cluster correlation butthere is an absence of between-cluster treatment effect heterogeneity, our proposal agreeswith the traditional approach both in terms of the point estimate and the asymptotic vari-ance. In Section 6, we illustrate the proposed method by estimating the causal excursioneffects of push notifications on mood, sleep, and step count for a cohort of first year medicalresidents in the United States. The paper concludes with a brief discussion.
We first review the design of MRTs and the existing inferential methods based on MRTdata. We focus on the assumptions made in the inferential methods to set the stage for ourproposed extension. 3 .1 Micro-Randomized Trials (MRT)
An MRT consists of a sequence of within-subject decision times t = 1 , . . . , T at whichtreatment options may be randomly assigned. The resulting longitudinal data for any subjectcan be summarized as { O , O , A , O , A , . . . , O T , A T , O T +1 } where t indexes a sequence of decision points, O is the baseline information, O t is the in-formation collected between time t − t which may include mobile sensor and wearabledata, and A t is the treatment option provided at time t . In an MRT, A t results from ran-domization at each decision point; here, we assume this randomization probability dependson the complete observed history H t := { O , O , A , . . . , A t − , O t } and is denoted p t ( A t | H t ).Treatment options are designed to impact a proximal response Y t +1 which is a known func-tion of { O t , A t , O t +1 } . In this section, we will use HeartSteps [Klasnja et al., 2018], an MRTdesigned to increase physical activity among sedentary adults, as our illustrative example.In HeartSteps [Klasnja et al., 2018], A t = 1 denotes sending a tailored activity messagedesigned to increase proximal physical activity and therefore Y t +1 is chosen to be total stepcount in the 30-minutes after the decision time. Remark 2.1.
At some decision points, it is inappropriate for scientific, ethical, or burdenreasons to provide treatment. To address this, the observation vector O t also contains anindicator of availability: I t = 1 if available for treatment and I t = 0 otherwise. Availabilityat time t is determined before treatment randomization. When an individual is unavailablefor treatment ( I t = 0), then no treatment is provided ( A t = 0). In this paper, for simplicity,individuals are assumed always available for treatment. All methods can be readily extendedto account for availability as in Boruvka et al. [2017].4 .2 Estimand and Inferential Method Following Boruvka et al. [2017], Qian et al. [2020], and Dempsey et al. [2020], we focus onthe class of estimands that is referred to as “causal excursion effects” defined by β ( t ; s ) = E [ E [ Y t +1 | H t , A t = 1] − E [ Y t +1 | H t , A t = 0] | S t = s ]where S t is a time-varying potential effect moderator (or “state”) and a deterministic functionof the observed history H t . For example, in the HeartSteps study, S t may represent subject-specific contextual information such as step count in the previous 30 minutes. See Section3.2 for how this estimand is derived using potential outcomes. Under the assumption that β ( t ; s ) = f t ( s ) (cid:62) β (cid:63) where f t ( s ) ∈ R q is a feature vector depending only on state s and decisionpoint t , a consistent estimator ˆ β can be obtained by minimizing the following weighted andcentered least squares (WCLS) criterion:arg max α,β P n (cid:34) T (cid:88) t =1 W t (cid:0) Y t +1 − g t ( H t ) (cid:48) α − ( A t − ˜ p t (1 | S t )) f t ( S t ) (cid:62) β (cid:1)(cid:35) (1)where P n is defined as the average of a function over the sample, W t = ˜ p t ( A t | S t ) /p t ( A t | H t )is a weight where the numerator is an arbitrary function with range (0 ,
1) that only dependson S t and the denominator is the MRT specified randomization probability, and g t ( H t ) ∈ R p are a set of control variables chosen to help reduce the variance of the estimators and toconstruct more powerful test statistics. In HeartSteps, for example, a natural control variableis the number of steps in the prior 30 minutes as this variable is likely highly correlated withthe proximal response and thus can be used to reduce variance in the estimation of β .See Boruvka et al. [2017] for additional details on how to define and estimate moderatedtime-varying treatment effects using data from an MRT.5 Cluster-Level Proximal Treatment Effects
MRTs are implicitly defined with the individual as the unit of analysis. The associated an-alytics rely on statistical independence among individuals in deriving asymptotic normality;moreover, the treatment effects assume that treatments provided to one individual do notimpact another individual. In Section 5, we show that the nominal coverage probability ofthe 95% confidence interval obtained from these methods decays quickly when the MRT isconducted among individuals with cluster-level treatment effect heterogeneity. Moreover,in such a setting, an interesting scientific question concerns moderation analysis where themoderator is defined at the cluster-level, i.e., average values of a moderator such as mood orstep-count across a cluster. Here we consider a cluster-based definition of causal excursioneffect and corresponding analyses to address these issues. We start with our motivatingexample that demonstrates the necessity for considering clustering in the context of MRTs.
Intern Health Study (IHS) is a 6-month micro-randomized trial on 1,565 medical interns [NeCampet al., 2020]. Due to high depression rates and levels of stress during the first year of physicianresidency training, a critical question is whether targeted notifications can be used at theright time to improve mood, increase sleep time, and/or increase physical activity. Enrolledmedical interns were randomized weekly to receive either mood, activity, or sleep notifica-tions or receive no notifications for that week (probability 1/4 each). On a week in whichnotifications are provided, interns were randomized daily to receive a notification with prob-ability 50%. Hence for a notification week, the user received, on average, 3.5 notifications.The exploratory and MRT analyses conducted in this paper focus on weekly randomization;See NeCamp et al. [2020] for further study details.Medical internships occur in a specific location and within a specific specialty. The 1,565interns in the MRT represented 321 different residency locations and 42 specialties. Thus,6nterns are naturally clustered. Table 3.1 presents a standard equally-weighted analysis ofvariance (ANOVA) applied to the average weekly mood scores with three factors: week-in-study, residency and specialty both interacted with weekly treatment (Txt). Compared withthe residual mean square of 1.27, there is considerable excess variation associated with week-in-study (28.44), residency by treatment (5.89) and specialty by treatment (8.41). In otherwords, the ANOVA suggests proximal responses exhibit potential within-cluster correlationwithin residency locations and specialties that varies with treatment. In this paper, we focuson defining clusters using specialty and institution information and do not discuss multi-levelclusters. Source d.f. Sum Sq. Mean Sq.week-in-study 25 711 28.44Residency:Txt 1091 6424 5.89Specialty:Txt 48 404 8.41Residuals 38946 49479 1.27Table 1: ANOVA decomposition for average weekly mood scores in the IHS.Three immediate questions may arise:(1) Does the standard MRT analysis still have good statistical properties, e.g., consistencyand valid nominal coverage probability of 95% confidence intervals when participantsexhibit within-cluster correlation?(2) Can one assess potential cluster-level effect moderators using the standard MRT anal-ysis? and,(3) Can one extend the MRT analysis to assess the effect of treatment provided to oneindividual on the proximal response of another individual in the same cluster?For (1), through a simulation study in Section 5, we demonstrate potential deficiency of thestandard WCLS method that results in severe under-coverage of nominal 95% confidenceintervals. In particular, we use simulation settings that mimic realistic cluster-level treat-ment heterogeneity. For (2), in IHS, for example, a scientific question may be how group7verage mood impacts the proximal effect of notifications on average daily mood. Sinceeach individual’s mood may be impacted by their own treatments, if the proximal effect ismoderated by cluster-level summary variables then this implies the proximal response is afunction of past treatment assignment to all individuals in the group. Finally, for (3), inSection 3.2 we extend the causal excursion effect to distinguish between direct and indirectcausal excursion effects.In this paper, we extend WCLS method [Boruvka et al., 2017] to account for clusteredcontinuous outcomes and potential within-cluster interference of other subjects’ treatmentsupon one subject’s outcome, offering a general framework for answering the three questionsabove. Here we study micro-randomized trials in which there is a pre-defined cluster-levelstructure. The cluster structure may result in (A) within-cluster correlation among proximalresponses (question 1), (B) treatment impacting individual-level moderators which thenimpact proximal responses of other individuals (question 2), or (C) treatment interference,i.e., treatment provided to one individual may impact the proximal response of anotherindividual in the same cluster (question 3). We will demonstrate how our proposed methodaddresses these three issues.
The primary question of interest is whether the treatment has a proximal effect. Here,potential outcomes are used to define the proximal effects. Due to potential treatmentinterference in the cluster MRT setting, we consider two types of moderated effects: a directand an indirect effect, both on the linear scale.Consider a cluster of size G . The overbar is used to denote a sequence through a speci-fied treatment occasion; ¯ a t,j = ( a ,j , . . . , a t,j ), for instance, denotes the sequence of realizedtreatments up to and including decision time t for individual j ∈ [ G ] := { , . . . , G } in thecluster. Let ¯ a t = (¯ a t, , . . . , ¯ a t,G ) denote the set of realized treatments up to and includingdecision time t for all individuals in the cluster and ¯ a t, − j = ¯ a t \ ¯ a t,j to denote this set with the8 th individual’s realized treatments removed. Let Y t +1 ,j (¯ a t ) denote the potential outcomefor individual j ∈ [ G ] in the cluster. Note that the potential outcome for individual j maydepend on the realized treatments for all members in the cluster. Direct causal effects . In standard MRTs, the individual is the unit of interest. In ourcurrent setting, the cluster is the unit of interest. At the cluster level, our interest lies inthe effect of providing treatment versus not providing treatment at time t on a randomindividual in the group. This can be expressed as a difference in potential outcomes for theproximal response and is given by1 G G (cid:88) j =1 [ Y t +1 ,j (¯ a t, − j , (¯ a t − ,j , a tj = 1)) − Y t +1 ,j (¯ a t, − j , (¯ a t − ,j , a tj = 0))] . (2)Equation (2) is a direct causal effect of treatment compared to no treatment on the proximalresponse for the same individual, while all previous treatments on the individual and alltreatments on every other member of the cluster are fixed to specific values.The “fundamental problem of causal inference” [Rubin, 1978, Pearl, 2009] is that theseindividual differences cannot be observed. Thus, similar to prior work [Dempsey et al.,2020, Boruvka et al., 2017], in this paper averages of potential outcomes are considered indefining treatment effects. Let S t (¯ a t − ) denote a vector of potential moderator variableschosen from H t (¯ a t − ), the cluster-level history up to decision point t . We write S t (¯ a t − ) =( S t,j (¯ a t − ) , S t, − j (¯ a t − )) to clarify that the potential moderator variables can contain bothinformation on the selected individual as well as other individuals in the cluster. Thenmoderated direct treatment effect, denoted β ( t ; s ), can be defined as E (cid:20) Y t +1 ,J ( ¯ A t, − J , ( ¯ A t − ,J , − Y t +1 ,J ( ¯ A t, − J , ( ¯ A t − ,J , | (cid:0) S t,J ( ¯ A t − ) , S t, − J ( ¯ A t − ) (cid:1) = ( s J , s − J ) (cid:21) . (3)where J is a uniformly distributed random index defined on [ G ]. The expectation is overboth the potential outcomes ( Y t +1 ,J ( · )), the set of randomized treatments ( ¯ A t, − J and ¯ A t − ,J ),9nd the random index ( J ). Different choices of variables in S t ( ¯ A t − ) address a variety ofscientific questions. A primary analysis, for instance, may focus on the marginal proximaleffects and therefore set S t ( ¯ A t − ) = ∅ . A second analysis may focus on assessing the effectconditional on variables only related to the individual indexed by J and set S t ( ¯ A t − ) = X t,J ( ¯ A t − ,J ), where X t,J ( ¯ A t − ,J ) ⊂ O t,H ( ¯ A t − ,J ) is a potential moderator of interest. A thirdanalysis may consider moderation by the average value across all individuals S t ( ¯ A t − ) = G − (cid:80) j X t,j ( ¯ A t − ,j ) or the joint moderation of the individual’s mood and the average of theother individuals’ moods S t ( ¯ A t − ) = (cid:16) X t,j ( ¯ A t − ,j ) , G − (cid:80) j (cid:48) (cid:54) = j X t,j (cid:48) ( ¯ A t − ,j (cid:48) ) (cid:17) . Indirect causal effects . Of secondary interest is the effect of providing treatment versusnot providing treatment to the j th individual at time t on a different individual’s proximalresponse when they do not receive treatment, i.e., within-cluster treatment interference. This indirect causal effect is given by1 G · ( G − (cid:88) j (cid:48) (cid:54) = j (cid:2) Y t +1 ,j (¯ a t, −{ j,j (cid:48) } , (¯ a t − ,j , a t,j = 0) , (¯ a t − ,j (cid:48) , a t,j (cid:48) = 1)) − Y t +1 ,j (¯ a t, −{ j,j (cid:48) } , (¯ a t − ,j , a t,j = 0) , (¯ a t − ,j (cid:48) , a t,j (cid:48) = 0)) (cid:3) . Again, since the individual differences cannot be observed, we consider averages of potentialoutcomes in defining indirect treatment effects. The moderated indirect treatment effect,denoted β ( IE ) ( t ; s ), can be defined by E (cid:20) Y t +1 ,J ( ¯ A t, −{ J,J (cid:48) } , ( ¯ A t − ,J , , ( ¯ A t − ,J (cid:48) , − Y t +1 ,J ( ¯ A t, −{ J,J (cid:48) } , ( ¯ A t − ,J , , ( ¯ A t − ,J (cid:48) , | S t ( ¯ A t − ) (cid:21) . (4)where J (cid:48) is uniformly distributed random index on the set [ G ] \{ J } . The expectation is againover both the potential outcomes ( Y t +1 ,J ( · )), the set of randomized treatments ( ¯ A t, −{ J,J (cid:48) } ,¯ A t − ,J , and ¯ A t − ,J (cid:48) ), and the random indices ( J and J (cid:48) ). Here, the potential moderator canbe written as S t ( ¯ A t − ) = (cid:0) S t,J ( ¯ A t − ) , S t,J (cid:48) ( ¯ A t − ) , S t, −{ J,J (cid:48) } ( ¯ A t − ) (cid:1) to clarify that the variables10an contain both information on the two selected individuals as well as others in the cluster.Note that a similar effect can be defined when the individual does receive treatment. Fornow, we focus on the effect defined by (4). It is useful to distinguish the above estimands, the traditional MRT estimand [Boruvka et al.,2017], and estimands commonly studies in longitudinal treatment effect estimation [Robins,1986]. In the causal inference literature, the typical estimand is an expected outcome for aparticular sequence of treatments,i.e., E [ Y ( a , . . . , a T )]. Such estimands do not depend onthe treatment distribution, but are often not of primary interest in the current MRT settingsince many sequences of treatment may never be observed in finite samples given the largenumber of decision points (often hundreds or thousands).The estimands considered in this paper are most similar to average outcomes under aparticular dynamic treatment regime E π [ Y ( A , . . . , A T )] where π denotes the dynamic treat-ment regime from which the treatments are drawn [Murphy et al., 2001]. Indeed, for any A u,j not contained in S t ( ¯ A t − ), the direct and indirect effects depend on the distribution of { A u,j } u Assumption 3.1. We assume consistency, positivity, and sequential ignorability: • Consistency: For each t ≤ T and j ∈ [ G ], { Y t +1 ,j ( ¯ A t ) , O t,j ( ¯ A t − ) , A t,j ( ¯ A t − ) } = { Y t +1 ,j , O t,j , A t,j } . That is, the observed values equal the corresponding potential out-comes; • Positivity: if the joint density { H t = h, A t = a } is greater than zero, then P ( A t = a t | H t = h t ) > • Sequential ignorability: for each t ≤ T , the potential outcomes, { Y ,j ( a ) , O ,j ( a ) , A ,j ( a ) , . . . , Y T,j (¯ a T − ) } j ∈ [ G ] , ¯ a T − ∈{ , } ( T − × G , are independent of A t,j conditional on the observed history H t .Sequential ignorability and, assuming all of the randomization probabilities are boundedaway from 0 and 1, positivity, are guaranteed for a cluster MRT by design. In a standardMRT, the randomization probabilities may depend on the individual’s observed history soPr( A t = a t | H t = h t ) = (cid:81) Gj =1 Pr( A t,j = a t,j | H t,j = h t,j ) and the positivity constraint can beplaced on the individual-level randomization probabilities. Here, we allow for the possibilitythat the randomization probabilities depend on the joint cluster-level history. Consistency isa necessary assumption for linking the potential outcomes as defined here to the data. Sincean individual’s outcomes may be influenced by the treatments provided to other individualsin the same cluster, consistency holds due to our use of a group-based conceptualization ofpotential outcomes as seen in Hong and Raudenbush [2006] and Vanderweele et al. [2013].12 emma 3.2. Under assumption 3.1, the moderated direct treatment effect β ( t ; s ) is equal to E (cid:34) G G (cid:88) j =1 E [ Y t +1 ,j | H t , A t,j = 1] − E [ Y t +1 ,j | H t , A t,j = 0] | S t (cid:35) where each expectation is with respect to the distribution of the data collected using thetreatment assignment probabilities. Under assumption 3.1, the moderated indirect treatmenteffect β ( t ; s ) ( IE ) is equal to E (cid:34) G ( G − (cid:88) j (cid:54) = j (cid:48) E [ Y t +1 ,j | H t , A t,j = 0 , A t,j (cid:48) = 1] − E [ Y t +1 ,j | H t , A t,j = 0 , A t,j (cid:48) = 0] | S t (cid:35) . Proof of Lemma 3.2 can be found in the Appendix A. In this section, we consider estimation of the two estimands defined in Lemma 3.2 usingcluster MRT data. We make the following assumptions regarding the direct treatment effect specification: Assumption 4.1. Assume the treatment effect, β ( t ; S t ), of interest satisfies E [ E [ Y t +1 ,J | H t , A t,J = 1] − E [ Y t +1 ,J | H t , A t,J = 1] | S t ] = f t ( S t ) (cid:62) β (cid:63) where f t ( s ) ∈ R p is a p -dimensional vector function of s and time t .We now consider inference on the unknown p -dimensional parameter β (cid:63) . To do so, wedefine the weight W t,j at decision time t for the j th individual as equal to ˜ p ( A t,j | S t ) p t ( A t,j | H t ) where˜ p t ( a | S t ) ∈ (0 , 1) is arbitrary as long as it does not depend on terms in H t other than S t ,13nd p ( A t,j | H t ) is the marginal probability that individual j receives treatment A t,j given H t . For now, we consider pre-specified ˜ p t ( a | S t ) (i.e., does not depend on the observedMRT data). Here we consider an estimator ˆ β of β (cid:63) which is the minimizer of the followingcluster-based, weighted-centered least-squares (C-WCLS) criterion, minimized over ( α, β ): P M (cid:34) G m G m (cid:88) j =1 T (cid:88) t =1 W t,j (cid:0) Y t +1 ,j − g t ( H t ) (cid:48) α − ( A t,j − ˜ p t ( A t,j | S t )) f t ( S t ) (cid:62) β (cid:1) (cid:35) (5)where P M is defined as the average of a function over the sample, which in this context isthe sample of clusters rather than the sample of individuals as in traditional MRT settings.In Appendix A, we prove the following result. Lemma 4.2. Under assumption 4.1, then, given invertibility and moment conditions, theestimator ˆ β that minimizes (5) satisfies √ M (cid:16) ˆ β − β (cid:63) (cid:17) → N (0 , Q − W Q − ) where Q = E (cid:34) T (cid:88) t =1 ˜ p t (1 | S t )(1 − ˜ p t (1 | S t )) f t ( S t ) f t ( S t ) (cid:62) (cid:35) and W = E (cid:34) T (cid:88) t =1 W J,t (cid:15) J,t ( A J,t − ˜ p t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W ˜ J,t (cid:15) ˜ J,t ( A ˜ J,t − ˜ p t (1 | S t )) f t ( S t ) (cid:62) (cid:35) where (cid:15) t,j = Y t +1 − g t ( H t ) (cid:48) α (cid:63) − ( A t − ˜ p t (1 | S t )) f t ( S t ) (cid:62) β (cid:63) , α (cid:63) minimizes the least-squarescriterion E (cid:104) G − (cid:80) Gj =1 (cid:80) Tt =1 W t,j ( Y t +1 ,j − g t ( H t ) (cid:48) α ) (cid:105) , and both J and ˜ J are independentrandomly sampled indices from the cluster. In practice, plug-in estimates ˆ Q and ˆ W are used to estimate the covariance structure. Ap-pendix C presents small sample size adjustments that we use in simulation and analysis inthis paper. It is clear that from Lemma 4.2 that use of standard MRT data analytic methods,i.e., equation (1), will produce biased estimates for the marginal effect of interest.14 .2 Indirect Effect Estimation We make the following assumptions regarding the indirect treatment effect specification: Assumption 4.3. We assume the treatment effect of interest satisfies E [ E [ Y t +1 ,J | H t , A t,J = 0 , A t,J (cid:48) = 1] − E [ Y t +1 ,J | H t , A t,J = 0 , A t,J (cid:48) = 0] | S t ] = f t ( S t ) (cid:62) β (cid:63)(cid:63) where f t ( s ) ∈ R p is a p -dimensional vector function of s and time t .We now consider inference on the unknown p -dimensional parameter β (cid:63)(cid:63) . To do so, wedefine a new weight W t,j,j (cid:48) at decision time t for the j th individual as equal to ˜ p ( A t,j ,A t,j (cid:48) | S t ) p t ( A t,j ,A t,j (cid:48) | H t ) where ˜ p t ( a, a (cid:48) | S t ) ∈ (0 , 1) is arbitrary as long as it does not depend on terms in H t otherthan S t , and p ( A t,j , A t,j (cid:48) | H t ) is the marginal probability that individuals j and j (cid:48) receivetreatments A t,j and A t,j (cid:48) respectively given H t . Here we consider an estimator ˆ β ( IE ) of β (cid:63)(cid:63) which is the minimizer of the following cluster-based weighted-centered least-squares(C-WCLS) criterion, minimized over ( α, β ): P M (cid:34) G m ( G m − (cid:88) j (cid:54) = j (cid:48) T (cid:88) t =1 W t,j,j (cid:48) (cid:0) Y t +1 ,j − g t ( H t ) (cid:48) α − (1 − A t,j )( A t,j (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:62) β (cid:1) (cid:35) (6)where ˜ p (cid:63)t (1 | S t ) = ˜ p t (0 , | S t )˜ p t (0 , | S t )+˜ p t (0 , | S t ) . If an individual’s randomization probabilities onlydepends on their own observed history then ˜ p (cid:63)t (1 | S t,j (cid:48) ) = ˜ p t (1 | S t,j (cid:48) ). In Appendix A, weprove the following result. Lemma 4.4. Under assumption 4.3, then, under invertibility and moment conditions, theestimator ˆ β ( IE ) that minimizes (6) satisfies √ M (cid:16) ˆ β ( IE ) − β (cid:63)(cid:63) (cid:17) → N (0 , Q − W Q − ) where Q = E (cid:34) T (cid:88) t =1 (˜ p t (0 , | S t ) + ˜ p t (0 , | S t ))˜ p (cid:63)t (1 | S t )(1 − ˜ p (cid:63)t (1 | S t )) f t ( S t ) f t ( S t ) (cid:62) (cid:35) nd W = E (cid:20) T (cid:88) t =1 W t,J,J (cid:48) (cid:15) t,J,J (cid:48) (1 − A J,t )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W t,J,J (cid:48) (cid:15) t, ˜ J, ˜ J (cid:48) (1 − A ˜ J,t )( A t, ˜ J (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:62) (cid:21) where (cid:15) t,j,j (cid:48) = Y t +1 ,j − g t ( H t ) (cid:48) α (cid:63)(cid:63) − (1 − A t,j )( A t,j (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:62) β (cid:63)(cid:63) , α (cid:63)(cid:63) minimizes theleast-squares criterion E (cid:104) G m ( G m − (cid:80) j (cid:54) = j (cid:48) (cid:80) Tt =1 W t,j,j (cid:48) ( Y t +1 ,j − g t ( H t ) (cid:48) α ) (cid:105) , and both ( J, J (cid:48) ) and ( ˜ J , ˜ J (cid:48) ) are independently, randomly sampled pairs from the cluster (without replacement). In Appendix E, the above variances are re-derived in the general setting where the nu-merators ˜ p t ( a | S t ) and ˜ p t ( a, a (cid:48) | S t ) are estimated using the observed MRT data. A natural question is whether there are any conditions under which the standard MRTanalysis presented in Section 2.2 is equivalent to the proposed cluster MRT analysis for thedirect effect presented in Section 4. It is clear from Proposition 4.2 that if the cluster-sizeis one then we recover the estimators and asymptotic theory for standard MRT analyses.Lemma 4.5 below proves that, under certain conditions, even when cluster sizes are greaterthan one, an equivalence of point estimates and asymptotic variances is guaranteed. Lemma 4.5. Consider the direct effect when the moderator is defined on the individual (i.e., S j,t ), and the randomization probabilities only depend on the individual’s observed history,i.e., p ( A j,t | H t ) = p ( A j,t | H j,t ) . If cluster size is constant across clusters (i.e., G m ≡ G ),then the point estimates from (1) and (3) are equal for any sample size. Moreover, if E [ E [ (cid:15) j,t (cid:15) j (cid:48) ,t (cid:48) | H j,t , A j,t = a, H j (cid:48) ,t (cid:48) , A j (cid:48) ,t (cid:48) = a (cid:48) ] | S t,j , S t,j (cid:48) ] = ψ ( S t,j , S t,j (cid:48) ) , (7) for some function ψ , i.e., the cross-term is constant in a and a (cid:48) , where (cid:15) j,t is the error defined n Lemma 4.2 then the estimators share the same asymptotic variance. Proof of Lemma 4.5 can be found in Appendix B. Here, we consider the following class ofrandom effect models to help with interpretation of the sufficient condition (7). Specifically,for participant j at decision time t suppose the generative model for the proximal outcomeis Y t +1 ,j = g t ( H t,j ) (cid:48) α + Z (cid:48) t,j b g (cid:124) (cid:123)(cid:122) (cid:125) ( I ) +( A t,j − p t (1 | H t,j ))( f t ( H t,j ) (cid:48) β + Z (cid:48) t,j ˜ b g (cid:124) (cid:123)(cid:122) (cid:125) ( II ) ) + e t,j where ( I ) and ( II ) are random effects with design matrix Z t,j , E [ f t ( H t,j ) (cid:48) β | S t,j ] = f t ( S t,j ) (cid:48) β ,and e t,j is a participant-specific error term. The treatment effect conditional on the completeobserved history and the random effects is f t ( H t,j ) (cid:48) β + Z (cid:48) t,j ˜ b g , which implies the marginalcausal effect is f t ( S t,j ) (cid:48) β so Assumption 4.1 holds. Random effects in ( I ) allow for cluster-level variation in baseline values of the proximal outcome, while random effects in ( II )allow for cluster-level variation in the fully-conditional treatment effect. Given the abovegenerative model, sufficient condition (7) holds if ˜ b g ≡ 0, i.e., when the treatment effect doesnot exhibit cluster-level variation. To see this, note that the errors (cid:15) t,j and (cid:15) t,j (cid:48) as defined inLemma 4.2 do not depend on treatment if ˜ b = 0. The cross-term is non-zero ( ψ (cid:54)≡ 0) becausethere is within-cluster variation due to random effects in ( I ). For this reason, we refer to(7) as a treatment-effect heterogeneity condition, i.e., when clusters exhibit treatment-effectheterogeneity this induces marginal residual correlation to depend on a and a (cid:48) which meansthe proposed approach is necessary for assessing direct effects rather than standard MRTanalyses. In Section 5, we conduct simulations to support this calculation. To evaluate the performance of the proposed cluster-level proximal treatment effect esti-mator, we extend the simulation setup in Boruvka et al. [2017]. Consider a standard MRTwhere the randomization probability p t (1 | H t ) is known. The observation vector O t is a single17tate variable S t ∈ {− , } at each decision time t . Assume the response, Y t +1 , follows alinear model given by: Y t +1 = θ { S t − E [ S t | A t − , H t − ] } + θ { A t − − p t − (1 | H t − ) } + { A t − p t (1 | H t ) } ( β + β S t ) + e t +1 (8)The randomization probability is given by p t (1 | H t ) = expit( η A t − + η S t ) where expit( x ) =(1 + exp( − x )) − ; the state dynamics are given by P ( S t = 1 | A t − , H t − ) = expit( ξA t − ) (note A = 0), and the independent error term satisfies e t ∼ N(0 , 1) with Corr( e u , e t ) = 0 . | u − t | / .To keep similar to Boruvka et al. [2017], in scenarios below, we set θ = 0 . , θ = 0 , ξ =0 , η = − . , η = 0 . , β = − . 2, and β = 0 . 2. It is readily seen that because ξ = 0, E [ S t ] = 0 . × ( − 1) + 0 . × E [ E [ Y t +1 | A t = 1 , H t ] − E [ Y t +1 | A t = 0 , H t ]] = β + β E [ S t ] = β . The marginal treatment effect is thus constant in time and is given by β (cid:63) = β = − . β (cid:63) . In each case, when the weighted and centered method is used, we set f t ( S t ) = 1 in (8) (i.e., S t = ∅ ). We report average ˆ β point estimates, standard deviation(SE) and root mean squared error (RMSE) of ˆ β , and 95% confidence interval coverageprobabilities (CP) across 1000 replicates. We vary the number of clusters and cluster size.Confidence intervals are constructed based on standard errors that are corrected for theestimation of weights and/or small samples. We compare the proposed cluster-based method(C-WCLS) to the approach by Boruvka et al. [2017] (WCLS). Simulation Scenario I . The first scenario concerns the estimation of β ∗ when an impor-tant individual-level moderator exists and proximal outcomes share a random cluster-levelintercept term that does not interact with treatment. In the data generative model, we18ncorporate a cluster-level random-intercept e g ∼ N (0 , . Y t +1 ,j = ( − . . · S t,j ) × ( A t,j − p t (1 | H t,j )) + 0 . S t,j + e g + e t +1 ,j . Table 2 presents the results, which shows both WCLS and the proposed C-WCLS approachachieve unbiasedness and proper coverage. This is in line with our theoretical results in Sec-tion 4.3 stating asymptotic equivalence of the two procedures under no cluster-level treatmentheterogeneity. Simulation Scenario II . In the second scenario, we extend the above generative model toinclude a random cluster-level intercept term that interacts with treatment by consideringthe linear model Y t +1 ,j = ( − . b g + 0 . · S t,j ) × ( A t,j − p t (1 | H t,j )) + 0 . · S t,j + e g + e t +1 ,j (9)where b g ∼ N (0 , . 1) is a random-intercept term within the treatment effect per cluster.Table 2 presents the results. This demonstrates that if cluster-level random effects interactwith treatment (i.e. b g (cid:54)≡ β but onlythe proposed method achieves the nominal 95% coverage probability. To further demonstratethis, Figure 1 presents nominal coverage as a function of the ratio of the variance of therandom effect b g over the variance of the random effect e g and group size respectively. Thisshows that the coverage probability of the WCLS method decays rapidly while the proposedmethod achieves the nominal 95% coverage probability for all choices of the variance of b g and when the group size increases. Note that even when group size is 5 (i.e., small groups),the nominal coverage drops to 80%. Simulation Scenario III . In the third scenario, we assume the treatment effect for anindividual depends on the average state of all individuals in the cluster. Therefore, we define19igure 1: C-WCLS offers more valid 95% confidence intervals than WCLS in Scenario II.Observed coverage probability varies by group size ( G ) (total sample size fixed) ( left ), andthe variance of random intercept b g as a fraction of the variance of the random error term e g ( right). the cluster-level moderator ¯ S t,g = G g (cid:80) G g j =1 S t,j and consider the generative model: Y t +1 ,j = ( − . b g + 0 . · ¯ S t,g ) × ( A t,j − p t (1 | H t,j )) + 0 . S t,j + e g + e t +1 ,j The proposed estimator again achieves the nominal 95% coverage probability while theWCLS method does not (see Scenario III, Table 2). Simulation Scenario IV . The fourth scenario considers the indirect effect. For individual j at decision point t , we construct the total effect as T E j,t = (cid:80) j (cid:48) (cid:54) = j { A t,j (cid:48) − ˜ p t,j (cid:48) (1 | H t ) } ( β + β S t,j (cid:48) ), where β = − . β = 0 . 2. The generative model is given by: Y t +1 ,j = ( − . b g + 0 . · ¯ S t,g ) × { A t,j − p t (1 | H t,j ) } + 0 . S t,j + T E j,t + e g + e t +1 ,j This model implies a marginal indirect effect equal to β ( IE )1 = β = − . 1. Table 3 presentsthe results. We see that the proposed indirect estimator exhibited nearly no bias and achievedthe nominal coverage probability. 20cenario Estimator The Intern Health Study (IHS) was a 6-month MRT on 1,565 medical interns where four typesof weekly notification - mood, activity, sleep, or none – were randomly assigned with equalprobability to each subject NeCamp et al. [2020] (see Section 3.1). In IHS, 273 institutionsand 14 specialties were observed. Here, we assess the effect of the three types of notifications(mood, activity, and sleep) compared to no notifications on the weekly average of self-reported mood scores and log step-count for the population of interns. See Appendix D foran additional analysis of log sleep minutes. Due to high levels of missing data, weekly averagemood scores and were computed from multiply imputed daily self-reported scores. Similarimputation was performed for daily log step-counts and daily log sleep minutes. See NeCampet al. [2020] for further details.Let t = 1 , . . . , T denote the weekly decision points at which the individual is randomizedto the various types of notifications. Because of the form of the intervention, all participantswere available for this intervention throughout the study; i.e., I t ≡ 1. The proximal out-come Y j,t +1 is weekly mood score, which is reported on a Likert scale taking values from 1to 10 (higher scores mean better mood). We collapse notifications to a binary variable, i.e., A t,j = 1 if the individual was assigned to receive either mood, activity, or sleep notificationson week t ; otherwise, A t,j = 0 and no notifications were sent that week. At any occasion t , an individual’s notification randomization probabilities were only dependent of their pastobserved history H t,j . For simplicity, we start with clusters beingare constructed based onmedical specialty. The average cluster size was 120; the first and third quartile were 56 and2240 respectively, with maximum and minimum sizes of 353 and 24. For every individual ineach cluster at each decision point we compute the average prior weekly mood score for allothers in the cluster, denoted ¯ Y − j,t for the j th individual in the cluster. We consider twomoderation analyses that can both be expressed as β ( t ; S t ) = β + β · Y j,t − + β ¯ Y − j,t − . The first set of moderation analyses considers the standard moderation analysis whereonly individual-level moderators are included (i.e., β = 0), with or without accounting forcluster-level moderation effect heterogeneity. Table 4 presents the results and compares ourproposed approach against the WCLS approach from Boruvka et al. [2017]. In this case,the effects do not change too much for the average weekly mood analysis; however, thesignificant effect of messages on weekly log step count under the traditional MRT analysisbecomes insignificant when accounting for cluster effects.The second moderation analysis lets β be a free parameter, enabling novel moderationanalyses that accounts for the average weekly previous mood score of other individuals.Table 4 presents the results. Here, we see that the constant term β becomes negative butinsignificant while the new term β is positive and significant. The results suggest the averagemood of others in the cluster may moderate the effect of notifications. Therefore, the impactof a notification is larger when the average mood score of others in the specialty is largewhile the individual’s score is low. Similar results hold for the log step-count analysis.Finally, we consider indirect moderation effect analyses. In this analysis, clusters areconstructed based on medical specialty and institution. This was done as interference wasonly likely when interns are in close geographic proximity. Here, we consider the marginalindirect effect (e.g., no moderators) both when the individual did not receive the interventionand when the individual did receive an intervention at decision time t . Table 5 presents theresults. In this case, the estimated indirect effects are weaker than the direct effects by a23utcome Setting Variables Estimate Std. Error t-value p-valueMood WCLS β β -0.055 0.011 -4.822 0.000C-WCLS β β -0.053 0.014 -3.868 0.000C-WCLS β -0.238 0.282 -0.842 0.401 β -0.054 0.014 -3.973 0.000 β β β -0.037 0.015 -2.484 0.015C-WCLS β β -0.031 0.019 -1.580 0.117C-WCLS β -2.095 1.248 -1.678 0.094 β -0.034 0.019 -1.745 0.083 β Here we consider the causal excursion effect in the presence of a priori known clusters. Wehave extended the causal excursion effect to naturally account for cluster information. Inparticular, both direct and indirect excursion effects have been formalized in the contextof MRT to account for potential interference. The effects described in this paper are most24utcome Variables Estimate Std. Error t-value p-valueMood ˜ β β β β β β β representsthe indirect effect under A j,t = 0, while the coefficient ˜ β represents the indirect effect under A j,t = 1.important when using MRT data to build optimized JITAIs for deployment in an mHealthpackage. Specifically, the estimation procedure for the direct excursion effect accounts forwithin-cluster correlation in the proximal outcomes which helps the scientific team avoidmaking erroneous conclusions about intervention effectiveness using standard MRT meth-ods. Moreover, estimation of indirect effects allows the scientific team to answer questionsabout impact of interventions on other members of the same cluster. Use of these methodsprovides empirical evidence for the scientific team to include or exclude intervention com-ponents that may have had unanticipated second order effects, or potentially lead to novelways to improve the intervention component by revising the intervention to more explicitlyaccount for cluster-level interference. While this work represents a major step forward in theanalysis of micro-randomized trial data, further work is required. Specifically, extensions tobe considered future work include accounting for overlapping communities and/or network(rather than cluster-only) structure [Ogburn and VanderWeele, 2014, Papadogeorgou et al.,2019], accounting for general non-continuous proximal outcomes such as binary or count out-comes [Qian et al., 2020], penalization of the working model to allow for high-dimensionalmoderators, and a method to use the proposed approach to form warm-start policies at theindividual level while accounting for group level information [Luckett et al., 2020].25 eferences A. Boruvka, D. Almirall, K. Witkiewitz, and S.A. Murphy. Assessing time-varying causaleffect moderation in mobile health. To appear in the Journal of the American StatisticalAssociation , 2017.L.M. Collins. Optimization of Behavioral, Biobehavioral, and Biomedical Interventions .Springer International Publishing, 2018.W. Dempsey, P. Liao, and S.A. Murphy. The stratified micro-randomized trial design: Sam-ple size considerations for testing nested causal effects of time-varying treatments. Annalsof Applied Statistics , 14(2):661–684, 2020.G. Hong and S. W. Raudenbush. Evaluating kindergarten retention policy. Journal of theAmerican Statistical Association , 101(475):901–910, 2006.P. Klasnja, E.B. Hekler, S. Shiffman, A. Boruvka, D. Almirall, A. Tewari, and S.A. Mur-phy. Microrandomized trials: An experimental design for developing just-in-time adaptiveinterventions. Health Psychol , 34:1220–1228, 2015.Predrag Klasnja, Shawna Smith, Nicholas J Seewald, Andy Lee, Kelly Hall, Brook Luers,Eric B Hekler, and Susan A Murphy. Efficacy of contextually tailored suggestions forphysical activity: a micro-randomized optimization trial of HeartSteps. Annals of Behav-ioral Medicine , 53(6):573–582, 09 2018. ISSN 0883-6612. doi: 10.1093/abm/kay067. URL https://doi.org/10.1093/abm/kay067 .Daniel J. Luckett, Eric B. Laber, Anna R. Kahkoska, David M. Maahs, Elizabeth Mayer-Davis, and Michael R. Kosorok. Estimating dynamic treatment regimes in mobile healthusing v-learning. Journal of the American Statistical Association , 115(530):692–706, 2020.doi: 10.1080/01621459.2018.1537919. URL https://doi.org/10.1080/01621459.2018.1537919 . PMID: 32952236. 26A. Mancl and T.A. DeRouen. A covariance estimator for GEE with improved small-sampleproperties. Biometrics , 57(1):126–134, 2001.S A Murphy, M J van der Laan, J M Robins, and Conduct Problems Prevention ResearchGroup. Marginal mean models for dynamic regimes. Journal of the American Statis-tical Association , 96(456):1410–1423, 2001. doi: 10.1198/016214501753382327. PMID:20019887.I. Nahum-Shani, S.N. Smith, B.J. Spring, L.M. Collins, K. Witkiewitz, A. Tewari, and S.A.Murphy. Just-in-time adaptive interventions (JITAIs) in mobile health: Key componentsand design principles for ongoing health behavior support. Annals of Behavioral Medicine ,2017.Timothy NeCamp, Srijan Sen, Elena Frank, Maureen A Walton, Edward L Ionides, Yu Fang,Ambuj Tewari, and Zhenke Wu. Assessing real-time moderation for developing adaptivemobile health interventions for medical interns: Micro-randomized trial. Journal of medicalInternet research , 22(3):e15033, March 2020. ISSN 1439-4456. doi: 10.2196/15033. URL https://europepmc.org/articles/PMC7157494 .E. Ogburn and T. VanderWeele. Causal diagrams for interference. Statistical Science , 29(4):559–578, 2014.Georgia Papadogeorgou, Fabrizia Mealli, and Corwin M. Zigler. Causal inference withinterfering units for cluster and population level treatment allocation programs. Bio-metrics , 75(3):778–787, 2019. doi: https://doi.org/10.1111/biom.13049. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13049 .J. Pearl. Causal inference in statistics: An overview. Statistics Surveys , 3:96–146, 2009.Tianchen Qian, Hyesun Yoo, Predrag Klasnja, Daniel Almirall, and Susan Murphy. Es-timating time-varying causal excursion effect in mobile health with binary outcomes.27 iometrika , 09 2020. ISSN 0006-3444. doi: 10.1093/biomet/asaa070. URL https://doi.org/10.1093/biomet/asaa070 . asaa070.J. Robins. A new approach to causal inference in mortality studies with a sustained exposureperiod-application to control of the healthy worker survivor effect. Mathematical Modelling ,7(9):1393–1512, 1986.DB. Rubin. Bayesian inference for causal effects: The role of randomization. The Annals ofStatistics , 6(1):34–58, 1978.T. J. Vanderweele, G. Hong, S.M. Jones, and J.L. Brown. Mediation and spillover effects ingroup-randomized trials: A case study of the 4Rs educational intervention. Journal of theAmerican Statistical Association , 108(502):469–482, 2013.28 Technical Details Proof. Proof of Lemma 3.2 We establish Lemma 3.2 for the direct effect (3). For a s ∈ { , } G ,we consider E (cid:34)(cid:32) t − (cid:89) s =1 p s ( a s | H s (¯ a s − )) (cid:33) (cid:32)(cid:89) j (cid:48) (cid:54) = j p t ( a t,j (cid:48) | H t (¯ a t − ) (cid:33) Y t +1 ,j (¯ a t, − j , (¯ a t − ,J , a )) S t (¯ a t )= s (cid:35) E (cid:34)(cid:32) t − (cid:89) s =1 p s ( a s | H s (¯ a s − )) (cid:33) (cid:32)(cid:89) j (cid:48) (cid:54) = j p t ( a t,j (cid:48) | H t (¯ a t − ) (cid:33) S t (¯ a t )= s E [ Y t +1 ,j (¯ a t, − j , (¯ a t − ,j , a )) | H t (¯ a t − )] (cid:35) since the history H t includes the moderator variable S t at time t . By consistency, H t ( ¯ A t − ) = H t so the above is equal to E (cid:34)(cid:32) t − (cid:89) s =1 p s ( a s | H s ) (cid:33) (cid:32) (cid:89) j (cid:48) (cid:54) = J p t ( a t,j (cid:48) | H t (cid:33) S t (¯ a t )= s E [ Y t +1 ,j (¯ a t, − j , (¯ a t − ,j , a )) | H t ] (cid:35) (10)Sequential ignorability implies that E [ Y t +1 ,j (¯ a t, − j , (¯ a t − ,j , a )) | H t ]= E [ Y t +1 ,j (¯ a t, − j , (¯ a t − ,j , a )) | H t , A t,j = a ]Summing over all potential outcomes and normalizing yields E (cid:88) ¯ a t − (cid:0)(cid:81) t − s =1 p s ( a s | H s ) (cid:1) (cid:16)(cid:81) j (cid:48) (cid:54) = j p t ( a t,j (cid:48) | H t (cid:17) S t (¯ a t )= s E [ (cid:0)(cid:81) t − s =1 p s ( a s | H s ) (cid:1) (cid:16)(cid:81) j (cid:48) (cid:54) = j p t ( a t,j (cid:48) | H t (cid:17) S t (¯ a t )= s ] E [ Y t +1 ,j (¯ a t, − j , (¯ a t − , a )) | H t , A t,j = a ] = E (cid:88) ¯ a t − (cid:0)(cid:81) t − s =1 p s ( a s | H s ) (cid:1) (cid:16)(cid:81) j (cid:48) (cid:54) = j p t ( a t,j (cid:48) | H t (cid:17) S t (¯ a t )= s E [ (cid:0)(cid:81) t − s =1 p s ( a s | H s ) (cid:1) (cid:16)(cid:81) j (cid:48) (cid:54) = j p t ( a t,j (cid:48) | H t (cid:17) S t (¯ a t )= s ] E [ Y t +1 ,j (¯ a t, − j , (¯ a t − , a )) | H t , A t = a ] | S t = s = E [ E [ Y t +1 ,j | H t , A t = a ] | S t = s ] . In the final equation, the outer expectation is with respect to the history H t conditionalon S t = s . That is, over both past treatments A s and past observations O s for s < t as well29s over current treatments for A t,j (cid:48) for j (cid:48) (cid:54) = j . The above shows E (cid:2) Y t +1 ( ¯ A t, − j , ( ¯ A t − ,j , a )) | S t (¯ a t ) = s (cid:3) = E [ E [ Y t +1 ,j | H t , A t,j = a ] | S t = s ] . Averaging over individuals in the group j ∈ [ G ] group size completes the proof. The prooffor the indirect effect follows the exact same structure. A.1 Lemma 4.2 We next provide a detailed proof of asymptotic normality and consistency for the weighted-centered least squares estimator. Proof of consistency for direct and indirect effects. The solutions ( ˆ α, ˆ β ) that minimize equa-tion (5) are consistent estimators for the solutions that minimize the following E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 W t,j ( Y t +1 ,j − g t ( H t ) (cid:48) α − ( A t,j − ˜ p t (1 | S t )) f t ( S t ) (cid:48) β ) (cid:35) Differentiating the above equation with respect to α yields a set of p estimating equations.0 q (cid:48) = E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 W t,j ( Y t +1 ,j − g t ( H t ) (cid:48) α − ( A t,j − ˜ p t (1 | S t )) f t ( S t ) (cid:48) β ) g t ( H t ) (cid:35) We note that E [ W t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) (cid:48) β | H t ] = 0 . Therefore, we have,0 p = E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 ( g t ( H t ) E [ W t,j Y t +1 ,j ] − g t ( H t ) g t ( H t ) (cid:48) α ) (cid:35) ⇒ α = E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 g t ( H t ) g t ( H t ) (cid:48) (cid:35) − E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 g t ( H t ) E [ W t,j Y t +1 ,j ] (cid:35) 30e note that E [ W t,J ( A t,J − ˜ p t (1 | S t )) g t ( H t ) (cid:48) α ] = 0 , and E [ W t,J ( A t,J − ˜ p t (1 | S t )) Y t +1 ,J ] = ˜ p t (1 | S t )(1 − ˜ p t (1 | S t )) β ( t ; S t )Now differentiating with respect to β yields0 q = E (cid:34) T (cid:88) t =1 W t,J ( Y t +1 ,J − g t ( H t ) (cid:48) α − ( A t,J − ˜ p t (1 | S t )) f t ( S t ) (cid:48) β ) ( A t,J − ˜ p t (1 | S t )) f t ( S t ) (cid:35) q = E (cid:34) T (cid:88) t =1 ˜ p t (1 | S t )(1 − ˜ p t (1 | S t )) ( β ( t ; S t ) − f t ( S t ) (cid:48) β (cid:63) ) f t ( S t ) (cid:35) Then we have β (cid:63) = E (cid:34) T (cid:88) t =1 ˜ p t (1 | S t )(1 − ˜ p t (1 | S t )) f t ( S t ) f t ( S t ) (cid:48) (cid:35) − E (cid:34) T (cid:88) t =1 ˜ p t (1 | S t )(1 − ˜ p t (1 | S t )) f t ( S t ) β ( t ; S t ) (cid:35) Under assumption 4.1, we have that β = β (cid:63) which guarantees consistency.We next consider the indirect effect estimator. Recall that˜ p (cid:63)t (1 | S t ) = ˜ p t (0 , | S t )˜ p t (0 , | S t ) + ˜ p t (0 , | S t )is the replacement for ˜ p (1 | S t ) in the direct effect for centering. If we make the assumptionthat ˜ p t (0 , | S t ) = ˜ p t (0 | S t )˜ p t (1 | S t ) then ˜ p (cid:63)t (1 | S t ) = ˜ p t (1 | S t ); however, we providethe proof in complete generality. The estimates that minimize equation (6) are consistentestimators for the solutions that minimize the following E (cid:34) T (cid:88) t =1 W t,J,J (cid:48) ( Y t +1 ,J − g t ( H t ) (cid:48) α − (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:48) β ) (cid:35) α yields a set of p estimating equations.0 p = E (cid:34) T (cid:88) t =1 W t,J,J (cid:48) ( Y t +1 ,J − g t ( H t ) (cid:48) α − (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:48) β ) g t ( H t ) (cid:35) We note that E [ W t,J,J (cid:48) (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:48) β | H t ] = 0 . Therefore, we have,0 p = E (cid:34) T (cid:88) t =1 ( g t ( H t ) E [ W t,J,J (cid:48) Y t +1 ,J ] − g t ( H t ) g t ( H t ) (cid:48) α ) (cid:35) ⇒ α = E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 g t ( H t ) g t ( H t ) (cid:48) (cid:35) − E (cid:34) T (cid:88) t =1 g t ( H t ) E [ W t,J,J (cid:48) Y t +1 ,J ] (cid:35) First, we show that E [ W t,J,J (cid:48) (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) | H t ]= (cid:88) a (cid:48) ∈{ , } E [˜ p t (0 , a (cid:48) | S t )( a (cid:48) − ˜ p (cid:63)t (1 | S t )) | H t , A t = 0 , A t,J (cid:48) = a (cid:48) ]=˜ p t (0 , | S t )(1 − ˜ p t (0 , | S t )˜ p t (0 , | S t ) + ˜ p t (0 , | S t ) ) − ˜ p t (0 , | S t ) ˜ p t (0 , | S t )˜ p t (0 , | S t ) + ˜ p t (0 , | S t ) = 0and E (cid:2) W t,J,J (cid:48) (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) | H t (cid:3) =˜ p t (0 , | S t ) (cid:18) ˜ p t (0 , | S t )˜ p t (0 , | S t ) + ˜ p t (0 , | S t ) (cid:19) + ˜ p t (0 , | S t ) (cid:18) ˜ p t (0 , | S t )˜ p t (0 , | S t ) + ˜ p t (0 , | S t ) (cid:19) =(˜ p t (0 , | S t ) + ˜ p t (0 , | S t ))˜ p (cid:63)t (1 | S t )(1 − ˜ p (cid:63)t (1 | S t )) . E [ W t,J,J (cid:48) (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) g t ( H t ) (cid:48) α ] = 0 , and E [ W t,J,J (cid:48) (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) Y t +1 ,J ]˜ p t (0 , | S t ) + ˜ p t (0 , | S t ) = ˜ p (cid:63)t (1 | S t )(1 − ˜ p (cid:63)t (1 | S t )) β ( IE ) ( t ; S t ) . Now differentiating with respect to β yields0 q = E (cid:34) T (cid:88) t =1 (˜ p t (0 , | S t ) + ˜ p t (0 , | S t ))˜ p (cid:63)t (1 | S t )(1 − ˜ p (cid:63)t (1 | S t )) (cid:0) β ( IE ) ( t ; S t ) − f t ( S t ) (cid:48) β (cid:63)(cid:63) (cid:1) f t ( S t ) (cid:35) Under assumption 4.3, we have that β = β (cid:63)(cid:63) which guarantees consistency. Proof of Asymptotic Normality. We now consider the issue of asymptotic normality. First,let (cid:15) t,j = Y t +1 ,j − g t ( H t ) (cid:48) α (cid:63) − ( A t,j − ˜ p t (1 | S t )) f t ( S t ) (cid:48) β (cid:63) , ˆ θ = ( ˆ α, ˆ β ), and θ (cid:63) = ( α (cid:63) , β (cid:63) ). Since S t ⊂ H t define h t,j ( H t ) (cid:48) = ( g t ( H t ) (cid:48) , ( A t,j − ˜ p t (1 | S t )) f t ( S t ) (cid:48) ).Then √ M (ˆ θ − θ (cid:63) ) = √ M (cid:26) P M (cid:18) G m G m (cid:88) j =1 T (cid:88) t =1 W t,j h t,j ( H t ) h t,j ( H t ) (cid:48) (cid:19) − (cid:20) P M (cid:18) G m G m (cid:88) j =1 T (cid:88) t =1 W t,j Y t +1 ,j h t,j ( H t ) (cid:19) − P M (cid:18) G m G m (cid:88) j =1 T (cid:88) t =1 W t,j h t,j ( H t ) h t,j ( H t ) (cid:48) (cid:19) θ (cid:63) (cid:21)(cid:27) = √ M (cid:26) E (cid:20) G G (cid:88) j =1 T (cid:88) t =1 W t,j h t,j ( H t ) h t,j ( H t ) (cid:48) (cid:21) − (cid:20) P M (cid:18) G m G m (cid:88) j =1 T (cid:88) t =1 W t,j (cid:15) t,j h t,j ( H t ) (cid:19)(cid:21)(cid:27) + o p ( )By definitions of α (cid:63) and β (cid:63) and the previous consistency argument E (cid:20) G G (cid:88) j =1 T (cid:88) t =1 W t,j h t,j ( H t ) h t,j ( H t ) (cid:48) (cid:21) = 033hen under moments conditions, we have asymptotic normality with variance Σ θ given byΣ θ = E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 W t,j h t,j ( H t ) h t,j ( H t ) (cid:48) (cid:35) − E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 W t,j (cid:15) t,j h t,j ( H t ) × G G (cid:88) j =1 T (cid:88) t =1 W t,j (cid:15) t,j h t,j ( H t ) (cid:48) (cid:35) E (cid:34) G G (cid:88) j =1 T (cid:88) t =1 W t,j h t,j ( H t ) h t,j ( H t ) (cid:48) (cid:35) − Due to centering, the expectation of the matrix W t,J h t,J ( H t ) h t,J ( H t ) (cid:48) is block diagonal andthe sub-covariance matrix Σ β can be extracted and is equal toΣ β = (cid:34) T (cid:88) t =1 E [( A t,J − ˜ p t (1 | S t )) W t,J f t ( S t ) f t ( S t ) (cid:48) ] (cid:35) − · E (cid:20) T (cid:88) t =1 W t,J (cid:15) t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W t,J (cid:15) t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) (cid:48) (cid:21) · (cid:34) T (cid:88) t =1 E [( A t,J − ˜ p t (1 | S t )) W t,J f t ( S t ) f t ( S t ) (cid:48) ] (cid:35) − as desired.We next consider asymptotic normality in the indirect setting. First, let (cid:15) t,j,j (cid:48) = Y t +1 ,j − g t ( H t ) (cid:48) α (cid:63)(cid:63) − (1 − A t,j )( A t,j (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:48) β (cid:63)(cid:63) , ˆ θ = ( ˆ α, ˆ β ), and θ (cid:63) = ( α (cid:63)(cid:63) , β (cid:63)(cid:63) ). Since S t ⊂ H t define h t,j,j (cid:48) ( H t ) (cid:48) = ( g t ( H t ) (cid:48) , (1 − A t,j )( A t,j (cid:48) − p (cid:63)t (1 | S t )) f t ( S t ) (cid:48) ). Then √ M (ˆ θ − θ (cid:63)(cid:63) ) equals √ M (cid:26) P M (cid:18) G m · ( G m − G m (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) h t,j,j (cid:48) ( H t ) h t,j,j (cid:48) ( H t ) (cid:48) (cid:19) − (cid:20) P M (cid:18) G m · ( G m − G m (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j Y t +1 ,j h t,j,j (cid:48) ( H t ) (cid:19) − P M (cid:18) G m · ( G m − G m (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) h t,j,j (cid:48) ( H t ) h t,j,j (cid:48) ( H t ) (cid:48) (cid:19) θ (cid:63) (cid:21)(cid:27) = √ M (cid:26) E (cid:20) G ( G − G (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) h t,j,j (cid:48) ( H t ) h t,j,j (cid:48) ( H t ) (cid:48) (cid:21) − (cid:20) P M (cid:18) G m · ( G m − G m (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) (cid:15) t,j,j (cid:48) h t,j,j (cid:48) ( H t ) (cid:19)(cid:21)(cid:27) + o p ( )By definitions of α (cid:63)(cid:63) and β (cid:63)(cid:63) and the previous consistency argument E (cid:20) G · ( G − G (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) h t,j,j (cid:48) ( H t ) h t,j,j (cid:48) ( H t ) (cid:48) (cid:21) = 0Then under moments conditions, we have asymptotic normality with variance Σ θ given byΣ θ = E (cid:34) G · ( G − G (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) h t,j,j (cid:48) ( H t ) h t,j,j (cid:48) ( H t ) (cid:48) (cid:35) − E (cid:34) G ( G − G (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) (cid:15) t,j,j (cid:48) h t,j,j (cid:48) ( H t ) × G ( G − G (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j (cid:15) t,j,j (cid:48) h t,j,j (cid:48) ( H t ) (cid:48) (cid:35) E (cid:34) G · ( G − G (cid:88) j =1 (cid:88) j (cid:48) (cid:54) = j T (cid:88) t =1 W t,j,j (cid:48) h t,j,j (cid:48) ( H t ) h t,j,j (cid:48) ( H t ) (cid:48) (cid:35) − Due to centering, the expectation of the matrix W t,J,J (cid:48) h t,J,J (cid:48) ( H t ) h t,J,J (cid:48) ( H t ) (cid:48) is block diagonal35nd the sub-covariance matrix Σ β can be extracted and is equal toΣ β = (cid:34) T (cid:88) t =1 E [(1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t )) W t,J,J (cid:48) f t ( S t ) f t ( S t ) (cid:48) ] (cid:35) − · E (cid:20) T (cid:88) t =1 W t,J,J (cid:48) (cid:15) t,J,J (cid:48) (1 − A t,J )( A t,J (cid:48) − ˜ p t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W t, ˜ J, ˜ J (cid:48) (cid:15) t, ˜ J, ˜ J (cid:48) (1 − A t, ˜ J )( A t, ˜ J (cid:48) − ˜ p (cid:63)t (1 | S t )) f t ( S t ) (cid:48) (cid:21) · (cid:34) T (cid:88) t =1 E [(1 − A t,J )( A t,J (cid:48) − ˜ p t (1 | S t )) W t,J,J (cid:48) f t ( S t ) f t ( S t ) (cid:48) ] (cid:35) − as desired. B Proof of Lemma 4.5 Proof. Consider the W -matrix for the direct effect asymptotic variance,1 G (cid:88) t,t (cid:48) (cid:88) j,j (cid:48) E (cid:20) W t,j (cid:15) t,j ( A t,j − ˜ p t (1 | S t )) W t (cid:48) ,j (cid:48) (cid:15) t (cid:48) ,j (cid:48) ( A t (cid:48) ,j (cid:48) − ˜ p t (1 | S t (cid:48) )) f t ( S t ) f t (cid:48) ( S t (cid:48) ) (cid:48) (cid:21) = 1 G (cid:88) t,t (cid:48) (cid:88) j,j (cid:48) E (cid:20) W t,j (cid:15) t,j ( A t,J − ˜ p t (1 | S t )) W t (cid:48) ,j (cid:48) (cid:15) t (cid:48) ,j (cid:48) ( A t (cid:48) ,j (cid:48) − ˜ p t (1 | S t (cid:48) )) f t ( S t ) f t (cid:48) ( S t (cid:48) ) (cid:48) (cid:21) Consider the cross-terms with j (cid:54) = j (cid:48) and without loss of generality assume t ≥ t (cid:48) , then E (cid:20) (cid:88) a,a (cid:48) ˜ p t ( a | S t )( a − ˜ p t (1 | S t ))˜ p t (cid:48) ( a (cid:48) | S t (cid:48) )( a (cid:48) − ˜ p t (cid:48) (1 | S t (cid:48) )) E (cid:20) E (cid:20) (cid:15) t,j (cid:15) t (cid:48) ,j (cid:48) | H t,j , A t,j = a, H t (cid:48) ,j (cid:48) , A t (cid:48) ,j (cid:48) = a (cid:48) (cid:21) | S t , S t (cid:48) (cid:21) f t ( S t ) f t (cid:48) ( S t (cid:48) ) (cid:48) (cid:21) . a and a (cid:48) we can re-write theabove as:= E (cid:34)(cid:88) a,a (cid:48) ˜ p t ( a | S t )( a − ˜ p t (1 | S t ))˜ p t (cid:48) ( a (cid:48) | S t (cid:48) )( a (cid:48) − ˜ p t (cid:48) (1 | S t (cid:48) )) ψ ( S t , S t (cid:48) ) f t ( S t ) f t (cid:48) ( S t (cid:48) ) (cid:48) (cid:35) = E (cid:20) ψ ( S t , S t (cid:48) ) f t ( S t ) f t (cid:48) ( S t (cid:48) ) (cid:48) (cid:32)(cid:88) a,a (cid:48) ˜ p t ( a | S t )( a − ˜ p t (1 | S t ))˜ p t (cid:48) ( a (cid:48) | S t (cid:48) )( a (cid:48) − ˜ p t (cid:48) (1 | S t (cid:48) )) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) =0 (cid:21) = E [ ψ ( S t , S t (cid:48) ) f t ( S t ) f t (cid:48) ( S t (cid:48) ) (cid:48) · 0] = 0 . Therefore, we have that the W -matrix simplifies to E (cid:20) T (cid:88) t =1 W t,J (cid:15) t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W t,J (cid:15) t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) (cid:48) (cid:21) = E (cid:20) G G (cid:88) j =1 (cid:20) T (cid:88) t =1 W t,J (cid:15) t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W t,J (cid:15) t,J ( A t,J − ˜ p t (1 | S t )) f t ( S t ) (cid:48) (cid:21)(cid:21) = E (cid:20) T (cid:88) t =1 W t (cid:15) t ( A t − ˜ p t (1 | S t )) f t ( S t ) × T (cid:88) t =1 W t (cid:15) t ( A t − ˜ p t (1 | S t )) f t ( S t ) (cid:48) (cid:21) which is the W matrix as in the standard MRT analysis. C Small sample size adjustment for covariance estima-tion The robust sandwich covariance estimator Mancl and DeRouen [2001] for the entire variancematrix is given by Q − Λ Q − . The first term, Q , is given by (cid:32) M (cid:88) m =1 G m (cid:88) j =1 D Tj,m W j,m D j,m (cid:33) D j,m is the model matrix for individual j in group g associated with equation (5), and W j,m is a diagonal matrix of individual weights. The middle term Λ is given by M (cid:88) m =1 G m (cid:88) i,j =1 D (cid:48) i,m W i,m ( I i,m − H i,m ) − e i,m e (cid:48) j,m ( I j,m − H j,m ) − W j,m D j,m where I i is an identity matrix of correct dimension, e i is the individual-specific residual vectorand H j,m = D j,m (cid:32) M (cid:88) m =1 G m (cid:88) j =1 D (cid:48) j,m W j,m D j,m (cid:33) − D (cid:48) j,m W j,m From Q − Λ Q − we extract ˆΣ β . D Additional analysis of IHS Setting Variables Estimate Std. Error t-value p-valueWCLS β β -0.068 0.017 -3.916 0.000C-WCLS β β -0.061 0.023 -2.696 0.009C-WCLS β -1.912 1.379 -1.386 0.171 β -0.067 0.023 -2.948 0.004 β E Additional on the indirect effect Weights used in the estimation of the indirect effect is a natural extension of Boruvka et al.[2017]. As in Section 4.2, the weight W t,j,j (cid:48) at decision time t for the j th individual is equalto ˜ p ( A t,j ,A t,j (cid:48) | S t ) p t ( A t,j ,A t,j (cid:48) | H t ) where ˜ p t ( a, a (cid:48) | S t ) ∈ (0 , 1) is arbitrary as long as it does not depend onterms in H t other than S t , and p ( A t,j , A t,j (cid:48) | H t ) is the marginal probability that individuals38 and j (cid:48) receive treatments A t,j and A t,j (cid:48) respectively given H t .In the simulation, the treatment individuals j and j (cid:48) receive A t,j and A t,j (cid:48) are mutuallyindependent conditioning on the previous history. thus, the denominator of W t,j,j (cid:48) can befactorized into: p ( A t,j , A t,j (cid:48) | H t ) = p ( A t,j | H t ) p ( A t,j (cid:48) | H t )Besides, the numerator of W t,j,j (cid:48) is defined as the empirical frequency of the treatment pair( a, a (cid:48) ), which takes the value from { (0 , , (0 , , (1 , , (1 , } . Here we denote it as˜ p t ( A t,j , A t,j (cid:48) | S t ) = ˆ p t ( A t,j , A t,j (cid:48) | S t )Therefore, the weight we used in the simulation is constructed as: W t,j,j (cid:48) = ˆ p t ( A t,j , A t,j (cid:48) | S t ) p ( A t,j | H t ) p ( A t,j (cid:48) | H t )When the numerators are estimated using the observed data, the variance-covariancemust account for this. Throughout we allow for the setting in which individuals are notalways available. For completeness we provide results for a more general estimating functionwhich can be used with observational (non-randomized A t ) treatments, under the assumptionof sequential ignorability and assuming the data analyst is able to correctly model and esti-mate the treatment probability p ( A t,j , A t,j (cid:48) | H t ). We indicate how the results are simplifiedby use of data from an MRT.Denote the parameterized treatment probability by p t ( a, a (cid:48) | H t ; η ) (with parameter η );note η is known in an MRT. Denote the parameterized numerator of the weights by ˜ p t ( a, a (cid:48) | S t ; ρ )(with parameter ρ ). The proof below allows the data analyst to use a ˜ p t with an estimatedparameter ˜ ρ or to pre-specify ρ as desired. We use a superscript of (cid:63) to denote limiting val-ues of estimated parameters (e.g. η (cid:63) , ρ (cid:63) ). Then the more general version of the estimating39quation U W ( α, β ; ˆ η, ˆ ρ ) is: T (cid:88) t =1 (cid:0) Y t +1 ,J − g t ( H t ) (cid:62) α − (1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t ; ˆ ρ )) f t ( S t ) (cid:48) β (cid:1) I t,J I t,J (cid:48) W t,J,J (cid:48) ( A t,J , A t,J (cid:48) , H t ; ˆ η, ˆ ρ ) × g t ( H t )(1 − A t,J )( A t,J (cid:48) − ˜ p (cid:63)t (1 | S t ; ˆ ρ )) f t ( S t ) Note W t,J,J (cid:48) in the body of the paper is replaced here by W t,J,J (cid:48) ( A t , H t ; ˆ η, ˆ ρ ), and ˆ η , ˆ ρ areestimators. Treatment Probability Model: