[PDF] Adaptive Sequential Design for a Single Time-Series

Abstract

The current work is motivated by the need for robust statistical methods for precision medicine; as such, we address the need for statistical methods that provide actionable inference for a single unit at any point in time. We aim to learn an optimal, unknown choice of the controlled components of the design in order to optimize the expected outcome; with that, we adapt the randomization mechanism for future time-point experiments based on the data collected on the individual over time. Our results demonstrate that one can learn the optimal rule based on a single sample, and thereby adjust the design at any point t with valid inference for the mean target parameter. This work provides several contributions to the field of statistical precision medicine. First, we define a general class of averages of conditional causal parameters defined by the current context for the single unit time-series data. We define a nonparametric model for the probability distribution of the time-series under few assumptions, and aim to fully utilize the sequential randomization in the estimation procedure via the double robust structure of the efficient influence curve of the proposed target parameter. We present multiple exploration-exploitation strategies for assigning treatment, and methods for estimating the optimal rule. Lastly, we present the study of the data-adaptive inference on the mean under the optimal treatment rule, where the target parameter adapts over time in response to the observed context of the individual. Our target parameter is pathwise differentiable with an efficient influence function that is doubly robust - which makes it easier to estimate than previously proposed variations. We characterize the limit distribution of our estimator under a Donsker condition expressed in terms of a notion of bracketing entropy adapted to martingale settings.

Full PDF

AAdaptive Sequential Design for a Single

Time-Series

Ivana MalenicaDivision of Biostatistics, University of California, BerkeleyAurelien BibautDivision of Biostatistics, University of California, BerkeleyMark J. van der LaanDivision of Biostatistics, University of California, Berkeley

Abstract

The current work is motivated by the need for robust statistical methods for pre-cision medicine; we pioneer the concept of a sequential, adaptive design for a singleindividual. As such, we address the need for statistical methods that provide ac-tionable inference for a single unit at any point in time. Consider the case that oneobserves a single time-series, where at each time t , one observes a data record O ( t )involving treatment nodes A ( t ), an outcome node Y ( t ), and time-varying covariates W ( t ). We aim to learn an optimal, unknown choice of the controlled components ofthe design in order to optimize the expected outcome; with that, we adapt the ran-domization mechanism for future time-point experiments based on the data collectedon the individual over time. Our results demonstrate that one can learn the optimalrule based on a single sample, and thereby adjust the design at any point t with validinference for the mean target parameter. This work provides several contributions tothe ﬁeld of statistical precision medicine. First, we deﬁne a general class of averages ofconditional causal parameters deﬁned by the current context (“context-speciﬁc”) forthe single unit time-series data. We deﬁne a nonparametric model for the probabilitydistribution of the time-series under few assumptions, and aim to fully utilize thesequential randomization in the estimation procedure via the double robust structureof the eﬃcient inﬂuence curve of the proposed target parameter. We present multipleexploration-exploitation strategies for assigning treatment, and methods for estimat-ing the optimal rule. Lastly, we present the study of the data-adaptive inference on a r X i v : . [ m a t h . S T ] J a n he mean under the optimal treatment rule, where the target parameter adapts overtime in response to the observed context of the individual. Our target parameteris pathwise diﬀerentiable with an eﬃcient inﬂuence function that is doubly robust -which makes it easier to estimate than previously proposed variations. We charac-terize the limit distribution of our estimator under a Donsker condition expressed interms of a notion of bracketing entropy adapted to martingale settings. Keywords:

Sequential decision making, Time-Series, Optimal Individualized Treatment,Targeted Maximum Likelihood Estimation (TMLE), Causal Inference.2

Introduction

There is growing scientiﬁc enthusiasm for the use, and development, of mobile health de-signs (mHealth) - broadly referring to the practice of health care mediated via mobileand wearable technologies (Steinhubl et al. 2013, Malvey & Slovensky 2014, Istepanian &Woodward 2017, Istepanian & Al-Anzi 2018). Numerous smartphones and Internet coupleddevices, connected to a plethora of mobile health applications, support continuous assemblyof data-driven healthcare intervention and insight opportunities. Interest in mobile inter-ventions spans myriad of applications, including behavioral maintenance or change (Freeet al. 2013, Muhammad et al. 2017), disease management (Heron & Smyth 2010, Ertinet al. 2011, Muessig et al. 2013, Steinhubl et al. 2013), teaching and social support (Kumaret al. 2013) and addiction management (Dulin et al. 2014, Zhang et al. 2016). In particular,Istepanian & Al-Anzi (2018) refer to mHealth as one of the most transformative drivers forhealthcare delivery in modern times.Recently, a new type of an experimental design termed micro-randomized trial (MRT)was developed in order to support just-in-time adaptive exposures - with an aim to deliverthe intervention at the optimal time and location (Dempsey et al. 2015, Klasnja et al.2015). To this date, multiple trials have been completed using MRT design, includingencouraging regular physical activity (Klasnja et al. 2019) and engaging participation insubstance use data gathering process in high-risk populations (Rabbi et al. 2018). For bothobservational mHealth and MRT, the time-series nature of the collected data provides anunique opportunity to collect individual characteristics and context of each subject, whilestudying the eﬀect of treatment on the outcome at speciﬁed future time-point.The generalized estimating equation (GEE) and random eﬀects models are the mostcommonly employed approaches for the analysis of mobile health data (Walls & Schafer2006, Stone et al. 2007, Bolger & Laurenceau 2013, Hamaker et al. 2018). As pointed3ut in Boruvka et al. (2018), these methods often do not yield consistent estimates of thecausal eﬀect of interest if time-varying treatment is present. As an alternative, Boruvkaet al. (2018) propose a centered and weighted least square estimation method for GEEthat provides unbiased estimation, assuming linear model for the treatment eﬀects. Theytackle proximal and distal eﬀects, with a focus on continuous outcome. On the other hand,Luckett et al. (2019) propose a new reinforcement learning method applicable to perennial,frequently collected longitudinal data. While the literature on dynamic treatment regimesis vast and well-studied (Murphy 2003, Robins 2004, Chakraborty & Moodie 2013, Luedtke& van der Laan 2016 c , a , b ), the unique challenges posed by mHealth obstruct their directemployment; for instance, mHealth objective typically has an inﬁnite horizon. Luckett et al.(2019) model the data-generating distribution as a Markov decision process, and estimatethe optimal policy among a class of pre-speciﬁed policies in both oﬄine and online setting.While mHealth, MRT designs and the corresponding methods for their analysis aim todeliver treatment tailored to each patient, they are still not optimized with complete “N-of-1” applications in mind. The usual population based target estimands fail to ensnare thefull, personalized nature of the time-series trajectory, often imposing strong assumptionson the dynamics model for the estimation purposes. To the best of our knowledge, Robinset al. (1999) provide the ﬁrst step towards describing a causal framework for a single subjectwith time-varying exposure and binary outcome in a time-series setting. Focusing on fullpotential paths, Bojinov & Shephard (2019) provide a causal framework for time-seriesexperiments with randomization-based inference. Other methodologies focused on singleunit applications rely on strong modeling assumptions, primarily linear predictive modelsand stationarity; see Bojinov & Shephard (2019) for an excellent review of the few workson the topic. Alternatively, van der Laan et al. (2018) propose causal eﬀects deﬁned asmarginal distributions of the outcome at a particular time point under a certain interventionon one or more of the treatment nodes. The eﬃcient inﬂuence function of these estimators,4owever, relies on the whole mechanism in a non-double robust manner. Therefore, evenwhen the assignment function is known, the inference still relies on consistent (at rate)estimation of the conditional distributions of the covariate and outcome nodes.The current work presented is motivated by the need for robust statistical methodsfor precision medicine, pioneering the concept of a sequential, adaptive design for a singleindividual. To the best of our knowledge, this is the ﬁrst work on learning the optimalindividualized treatment rule in response to the current context for a single subject. Atreatment rule for a patient is an individualized treatment strategy based on the historyaccrued, and context learned, up to the most current time point. A reward is measuredon the patient at repetitive units, and optimality is meant in terms of optimization of themean reward at a particular time t . We aim to learn an optimal, unknown choice of thecontrolled components of the design based on the data collected on the individual over time;with that, we adapt the randomization mechanism for future time-point experiments. Ourresults demonstrate that one can learn the optimal, context deﬁned rule based on a singlesample, and thereby adjust the design at any point t with valid inference for the meantarget parameter.This article provides two main contributions to the ﬁeld of statistical precision medicine.First, we deﬁne a general class of averages of conditional (context-speciﬁc) causal parame-ters for the single unit time-series data. We deﬁne models for the probability distributionof the time-series that refrains from making unrealistic parametric assumptions, and aimsto fully utilize the sequential randomization in the estimation procedure. Secondly, wepresent the study of the data-adaptive inference on the mean under the optimal treatmentrule, where the target parameter adapts over time in response to the observed context ofthe individual. Our estimators are double robust and easier to estimate eﬃciently thanpreviously proposed variations (van der Laan et al. 2018). Finally, for inference, we relyon martingale Central Limit Theorem under a conditional variance stabilization condition5nd a maximal inequality for martingales with respect to an extension of the notion ofbracketing entropy for martingale settings, initially proposed by van de Geer (2000), whichwe refer to as sequential bracketing entropy .This article structure is as follows. In Section 2 we formally present the general formu-lation of the statistical estimation problem, consisting of specifying the statistical modeland notation, the target parameter deﬁned as the average of context-speciﬁc target parame-ters, causal assumptions and identiﬁcation results, and the corresponding eﬃcient inﬂuencecurve for the target parameter. In Section 3 we discuss diﬀerent strategies for estimatingthe optimal treatment rule and sampling strategies for assigning treatment at each timepoint. The following section, Section 4, introduces the Targeted Maximum Likelihood Es-timator (TMLE), with Section 5 covering the theory behind the proposed estimator. InSection 6 we present simulation results for diﬀerent dependence settings. We conclude witha short discussion in Section 7. Let O ( t ) be the observed data at time t , where we assume to follow a patient along timesteps t = 1 , . . . , N such that O N ≡ ( O (0) , O (1) , . . . O ( N )) = ( O ( t ) : t = 0 , . . . , N ). At eachtime step t , the experimenter assigns to the patient a binary treatment A ( t ) ∈ A := { , } .We then observe, in this order, a post-treatment health outcome Y ( t ) ∈ Y ⊂ R , and thena post-outcome vector of time-varying covariates W ( t ) lying in an Euclidean set W . Wesuppose that larger values of Y ( t ) reﬂect a better health outcome; without loss of generality,we also assume that Y ( t ) ≡ (0 , W ( t ) is an important part of post-exposure history tobe considered for the next record, O ( t + 1). Finally, we note that O (0) = ( W (0)), where6 ( −

1) = A (0) = Y (0) = ∅ ; as such, O (0) plays the role of baseline covariates for thecollected time-series, based on which exposure A (1) might be allocated.We denote O ( t ) := ( A ( t ) , Y ( t ) , W ( t )) the observed data collected on the patient at timestep t , with O := A × Y × W as the domain of the observation O ( t ). We note that O ( t ) hasa ﬁxed dimension in time t , and is an element of an Euclidean set O . Our data set is thetime-indexed sequence O N ∈ O N , or time-series , of the successive observations collected ona single patient. For any t , we let ¯ O ( t ) := ( O (1) , . . . , O ( t )) denote the observed history ofthe patient up until time t . Unlike in more traditional statistical settings, the data points O (1) , . . . , O ( N ) are not independent draws from the same law: here they form a dependentsequence, which is a single draw of a distribution over O N . In that sense, our data reducesto a single sample.We let O N ∼ P N , where P N denotes the true probability distribution of O N . Thesubscript “0” stands for the “truth” throughout the rest of the article, denoting the true,unknown features of the distribution of the data. Realizations of a random variable O N are denoted with lower case letters, o N . We suppose that P N admits a density p N w.r.t. adominating measure µ over O N that can be written as the product measure µ = × Nt =1 ( µ A × µ Y × µ W ), with µ A , µ Y , and µ W measures over A , Y , and W . From the chain rule, thelikelihood under the true data distribution P N of a realization ¯ o N of ¯ O N can be factorizedaccording to the time ordering of observation nodes as: p N ( o N ) = N (cid:89) t =1 p ,a ( t ) ( a ( t ) | o ( t − × N (cid:89) t =1 p ,y ( t ) ( y ( t ) | o ( t − , a ( t )) × N (cid:89) t =0 p ,w ( t ) ( w ( t ) | o ( t − , a ( t ) , y ( t )) , where a ( t ) (cid:55)→ p ,a ( t ) ( a ( t ) | ¯ o ( t − y ( t ) (cid:55)→ p ,y ( t ) ( y ( t ) | ¯ o ( t − , a ( t )), and w ( t ) (cid:55)→ p ,w ( t ) ( w ( t ) | ¯ o ( t − , a ( t ) , y ( t )) are conditional densities w.r.t. the dominating measures µ A , µ Y , and µ W . 7 .2 Statistical Model Since O N represents a single time-series, a dependent process, we observe only a singledraw from P N . As a result, we are unable to estimate any part of P N without additionalassumptions. In particular, we assume that the conditional distribution of O ( t ) given ¯ O ( t − P O ( t ) | ¯ O ( t − , depends on ¯ O ( t −

1) through a summary measure C o ( t ) = C o ( ¯ O ( t − ∈ C of ﬁxed dimension; each C o ( t ) might contain t -speciﬁc summary of previous measurementsof context, or is of a particular Markov order. For later notational convenience, we denotethis conditional distribution P O ( t ) | ¯ O ( t − with P C o ( t ) . Then, the density p C o ( t ) of P C o ( t ) withrespect to a dominating measure µ C o ( t ) is a conditional density ( o, C o ) → p C o ( t ) ( o | C o ) sothat for each value of C o ( t ), (cid:82) p C o ( t ) ( o | C o ( t )) dµ C o ( t ) ( o ) = 1. We extend this notion to allparts of the likelihood as described in subsection (2.1), deﬁning q y ( t ) as the density for node Y ( t ) conditional on a ﬁxed dimensional summary C y ( t ), with C w ( t ) and C a ( t ) correspondingto ﬁxed dimensional summaries for q w ( t ) = p ,w ( t ) ( w ( t ) | C w ( t )) and g t = p ,a ( t ) ( a ( t ) | C a ( t )),respectively.Additionally, we assume that p C o ( t ) is parameterized by a common (in time t ) func-tion θ ∈ Θ, with inputs ( c, o ) → θ ( c, o ). The conditional distribution p C o ( t ) dependson θ only through θ ( C o ( t ) , · ) We write p C o ( t ) = p θ,C o ( t ) interchangeably. Let q y be thecommon conditional density of Y ( t ), given ( A ( t ) , C o ( t )); we make no such assumption on q w ( t ) . Additionally, we make no conditional stationarity assumptions on g t if randomiza-tion probabilities are known, as is the case for an adaptive sequential trial. We deﬁne¯ Q ( C o ( t ) , A ( t )) = E P Co ( t ) ( Y ( t ) | C o ( t ) , A ( t )) to be the conditional mean of Y ( t ) given C o ( t )and A ( t ). As such, we have that ¯ Q ( C y ( t )) = ¯ Q ( C o ( t ) , A ( t )) = (cid:82) yq y ( y | C o ( t ) , A ( t )) dµ y ( o ),and ¯ Q is a common function across time t ; we put no restrictions on ¯ Q . We suppressdependence of the conditional density q w ( t ) in future reference, as this factor plays no rolein estimation. In particular, q w ( t ) does not aﬀect the eﬃcient inﬂuence curve of the targetparameter, allowing us to act as if q w ( t ) is known. Finally, we deﬁne θ = ( g, ¯ Q ) and let8 = G × ¯ Q be the cartesian product of the two nonparametric parameter spaces for g and¯ Q . Let p θ,C o ( t ) and p Nθ be the density for O ( t ) given C o ( t ) and O N , implied by θ . This deﬁnesa statistical model M N = { P Nθ : θ } where P Nθ is the probability measure for the time-seriesimplied by p θ,C o ( t ) . Additionally, we deﬁne a statistical model of distributions of O ( t ) attime t , conditional on realized summary C o ( t ). In particular, let M ( C o ( t )) = { P θ,C o ( t ) : θ } be the model for P C o ( t ) for a given C o ( t ) implied by M N . Note that, by setup, both M N and M ( C o ( t )) contain their truth P and P O ( t ) | C o ( t ) , respectively. Similarly to thelikelihood expression in sub section (2.1), we can factorize the likelihood under the abovedeﬁned statistical model according to time ordering as: p θ ( o N ) = N (cid:89) t =1 g t ( a ( t ) | C a ( t )) N (cid:89) t =1 q y ( y ( t ) | C y ( t )) N (cid:89) t =0 q w ( t ) ( w ( t ) | C w ( t )) . By specifying a non-parametric structural equations model (NPSEM; equivalently, struc-tural causal model), we assume that each component of the observed time-speciﬁc datastructure is a function of an observed, ﬁxed-dimensional history and an unmeasured ex-ogenous error term (Pearl (2009)). We encode the time-ordering of the variables using thefollowing NPSEM: W (0) = f W (0) ( U W (0)) , (1) A ( t ) = f A ( t ) ( C A ( t ) , U A ( t )) , t = 1 , . . . , N,Y ( t ) = f Y ( t ) ( C Y ( t ) , U Y ( t )) , t = 1 , . . . , N,W ( t ) = f W ( t ) ( C W ( t ) , U W ( t )) , t = 1 , . . . , N, f A ( t ) : t = 1 , . . . , N ), ( f Y ( t ) : t = 1 , . . . , N ) and ( f W ( t ) : t = 0 , . . . , N ) are unspeci-ﬁed, deterministic functions and U = ( U W (0) , . . . , U W ( N ) , U A (1) , . . . , U A ( N ) , U Y (1) , . . . , U Y ( N ))is a vector of exogenous errors.We denote M F the set of all probability distributions P F over the domain of ( O, U ) thatare compatible with the NPSEM deﬁned above. Let P F be the true probability distributionof ( O, U ), which we assume to belong to M F ; we denote M F as the causal model . Thecausal model M F encodes all the knowledge about the data-generating process, and impliesa model for the distribution of the counterfactual random variables; as such, causal eﬀectsare deﬁned in terms of hypothetical interventions on the NPSEM.Consider a treatment rule C o ( t ) → d ( C o ( t )) ∈ { , } , that maps the observed, ﬁxeddimensional history C o ( t ) into a treatment decision for A ( t ). We introduce a counterfactualrandom variable O N,d , deﬁned by substituting the equation for node A at time t in theNPSEM with the intervention d : W d (0) = f W (0) ( U W (0)) A d ( t ) = d ( C A ( t )) , t = 1 , . . . , NY d ( t ) = f Y ( t ) ( C Y ( t ) , U Y ( t )) , t = 1 , . . . , NW d ( t ) = f W ( t ) ( C W ( t ) , U W ( t )) , t = 1 , . . . , N, We gather all of the nodes of the above modiﬁed NPSEM in the random vector O N,d :=( O d ( t ) : t = 1 , . . . , N ), where O d ( t ) := ( A d ( t ) , Y d ( t ) , W d ( t )). The random vector O N,d represents the counterfactual time-series, or counterfactual trajectory the subject of interestwould have had, had each treatment assignment A ( t ), for t = 1 , . . . , N , had been carriedout following the treatment rule d .We now formally deﬁne time-series causal parameters. First, we introduce a time- andcontext-speciﬁc causal model. Let M Ft ( c ) be the set of conditional probability distributions P Fc over the domain of ( O ( t ) , U A ( t ) , U Y ( t ) , U W ( t )) compatible with the non-parametric10tructural equation model (1) imposing that C A ( t ) = C o ( t ) = c : A c ( t ) = f A ( t ) ( c o , U A ( t )) Y c ( t ) = f Y ( t ) ( c A ( c o , A ( t )) , U Y ( t )) W c ( t ) = f W ( t ) ( c W ( c o , A ( t ) , Y ( t )) , U W ( t )) . Let O dc ( t ) be the counterfactual observation at time t , obtained by substituting the A ( t )equation in the above set of equations with the deterministic intervention d : A dc ( t ) = d ( c ) Y dc ( t ) = f Y ( t ) ( c A ( c o , A ( t )) , U Y ( t )) W dc ( t ) = f W ( t ) ( c W ( c o , A ( t ) , Y ( t )) , U W ( t )) . We deﬁne our causal parameter of interest asΨ

F,dC o ( t ) ( P FC o ( t ) ) := E [ Y dC o ( t ) ] , which is the expectation of the counterfactual random variable Y d , generated by the abovemodiﬁed NPSEM. It corresponds to starting at c = C o ( t ), the current context, and assigningtreatment following d . Our causal target parameter is the mean outcome we would haveobtained after one time-step, if, starting at time t from the observed context C o ( t ), we hadcarried out intervention d . Once we have deﬁned our causal target parameter, the natural question that arises is howto identify it from the observed data distribution. We can identify the distribution of the d -speciﬁc time series O N,d , and also of the ( d, C o ( t ))-speciﬁc observation O dC o ( t ) , from theobserved data via the G-computation formula - under the sequential randomization andpositivity assumptions, which we state below.11 ssumption 1 (Sequential randomization) . For every t , Y d ( t ) ⊥⊥ A ( t ) | C o ( t ) (and Y dC o ( t ) ( t ) ⊥⊥ A ( t ) | C o ( t ) ). Assumption 2 (Positivity) . It holds that under the treatment mechanism g ,t , each treat-ment value a ∈ { , } has a positive probability of being assigned, under every possibletreatment history: g ,t ( a | c ) > , for every t ≥ , a ∈ { , } and every c ∈ C such that P [ C o ( t ) = c ] > . Note that under the setting of the present article, as we suppose that A ( t ) is assigned atrandom conditional on C o ( t ) by the experimenter, assumption 1 concerning the sequentialrandomization automatically holds. Under identiﬁcation assumptions 1 and 2, we can writeour causal parameter Ψ F,dC o ( t ) ( P F ) as a feature of the data-generating distribution:Ψ F,dC o ( t ) ( P F ) = Ψ dC o ( t ) ( P ) := E P [ Y ( t ) | A ( t ) = d ( C o ( t )) , C o ( t )] . Observe that in the above deﬁnition Ψ dC ( t ) ( P ), shortly deﬁned as Ψ C ( t ) ( P ), depends on P only through the true conditional distribution of O ( t ) given C o ( t ). For every P , we remindthat P C o ( t ) denotes the distribution of O ( t ) given C o ( t ), and let M ( C o ( t )) be the set ofsuch distributions corresponding to P ∈ M . At each time-point t , given a C o ( t ), we deﬁnea target parameter Ψ C o ( t ) : M ( C o ( t )) → R that is pathwise diﬀerentiable with canonicalgradient D ∗ C o ( t ) ( P C o ( t ) )( o ) at P C o ( t ) in M ( C o ( t )). As described in Section 2.2, we have thatΨ C o ( t ) ( P C o ( t ) ) = Ψ C o ( t ) ( θ ), where Ψ C o ( t ) ( θ ) depends on θ only though its section θ ( C o ( t ) , · ).We denote the collection of C o ( t )-speciﬁc canonical gradients as ( c, o ) → D ∗ ( P C o ( t ) )( c, o ),so that we can write them uniformly as a function of the observed components; with that,we have that D ∗ C o ( t ) ( P C o ( t ) )( o ) = D ∗ C o ( t ) ( θ )( o ) = D ∗ ( θ )( c o ( t ) , o ). As is custom for canonicalgradients, for a given C o ( t ), D ∗ ( θ ) is a function of the observed data with conditional meanzero with respect to P C o ( t ) .Finally, we propose a class of statistical target parameters ¯Ψ( θ ) deﬁned as the averageover time of C o ( t )-speciﬁc counterfactual means under the treatment rule. In particular,12he target parameter on M N , Ψ N : M N → R of the data distribution P N ∈ M N is deﬁnedas: ¯Ψ( θ ) = 1 N N (cid:88) t =1 Ψ C o ( t ) ( θ ) . The statistical target parameter ¯Ψ( θ ) is data-dependent, as it is deﬁned as an average overtime of parameters of the conditional distribution of O ( t ) given the observed realizationof C o ( t ); as such, it depends on ( C o (1) , . . . , C o ( N )). In practice, ¯Ψ( θ ) is an average ofthe means under optimal treatment decisions over all observed contexts over time. Asan average of C o ( t )-speciﬁc causal eﬀects with a double robust eﬃcient inﬂuence curve D ∗ C o ( t ) ( θ )( o ), it follows we can estimate ¯Ψ( θ ) in a double robust manner as well, as wefurther emphasize in the following section. In the following theorem we provide the canonical gradient of our target parameter thatadmits a ﬁrst order expansion with a double-robust second order term.

Theorem 1 (Canonical gradient and ﬁrst order expansion) . Under the strong positiv-ity assumption (assumption 3 in subsection 5.2), the target parameter mapping Ψ C ( t ) : M ( C o ( t )) → R is pathwise diﬀerentiable w.r.t. M ( C o ( t )) , with a canonical gradient w.r.t. M ( C o ( t )) given by D ∗ C o ( t ) ( ¯ Q, g )( o ) = g ∗ ( a | C o ( t )) g ( a | C o ( t )) (cid:0) y − ¯ Q ( a, C o ( t )) (cid:1) . Furthermore Ψ C o ( t ) ( ¯ Q ) admits the following ﬁrst order expansion: Ψ C o ( t ) ( ¯ Q ) − Ψ C o ( t ) ( ¯ Q ) = − P ,C o ( t ) D ∗ C o ( t ) ( ¯ Q, g ) + R ( ¯ Q, ¯ Q , g, g ,t ) , where R is a second order remainder that is doubly-robust, with R ( ¯ Q, ¯ Q , g, g ,t ) = 0 ifeither ¯ Q = ¯ Q or g = g ,t . marginal param-eters. Unlike the conditional parameters we consider here, the eﬃcient inﬂuence functionof marginal parameters is not double-robust in the usual sense; that is, robust w.r.t. a pairof variation independent nuisance parameters. More importantly, knowing or consistentlyestimating the treatment mechanism does not guarantee consistency of the causal eﬀect forparameters described by van der Laan et al. (2018) and Kallus & Uehara (2019). We pursuethe discussion on marginal parameters in more detail in subsection 8.1 in the appendix. Now that we have identiﬁed the context-speciﬁc counterfactual outcome under d as a pa-rameter of the observed data distribution P N , we can identify the optimal treatment rule.The optimal treatment rule is a priori a causal object deﬁned as a function of P F , and aparameter of the observed data generating distribution P N . Under the identiﬁcation as-sumptions, we can identify the optimal rule from the observed data distribution as follows.Fix arbitrarily ¯ Q ∈ ¯ Q . To alleviate notation, we further introduce the blip function deﬁnedas: B ( C o ( t )) ≡ ¯ Q ( C o ( t ) , A ( t ) = 1) − ¯ Q ( C o ( t ) , A ( t ) = 0) . Intuitively, if B ( C o ( t )) >

0, assigning treatment A ( t ) = 1 is more beneﬁcial (in termsof optimizing Y ( t )) than A ( t ) = 0 for time point t under the current context C o ( t ). If B ( C o ( t )) <

0, we can optimize the t -speciﬁc outcome by assigning the subject treatment A ( t ) = 0 instead. The true optimal rule for the purpose of optimizing the mean of the next(short-term) outcome Y ( t ), for binary treatment, is then given by: d ( C o ( t )) ≡ I ( B ( C o ( t )) > . (2)14s deﬁned in Equation (2), d ( C o ( t )) is a typical treatment rule that maps observed ﬁxeddimensional summary deterministically into one treatment; a stochastic treatment ruledoes so randomly (Luedtke & van der Laan 2016 c , a , Chambaz et al. 2017). In an adaptive sequential trial, the process of generating A ( t ) is controlled by the ex-perimenter. As such, one can simultaneously learn and start assigning treatment accord-ing to the best current estimate of the optimal treatment rule, with varying exploration-exploitation objectives. In this section we describe diﬀerent strategies for estimating theoptimal treatment rule, as well as propose diﬀerent sampling schemes for assigning treat-ment. First, we consider estimating the optimal treatment rule based on a parametric workingmodel. As described previously, consider a treatment rule C o ( t ) → d ( C o ( t )) ∈ { , } thatmaps the history C o ( t ) into a treatment decision for A ( t ). We deﬁne a parametric workingmodel for q y indexed by parameter θ such that { q y,θ : θ } . Notice that under the speciﬁedworking model, we have that:¯ Q θ ( C o ( t ) , a ) = E ( Y ( t ) | C o ( t ) , A ( t ) = a ) = (cid:90) yq y,θ ( y | C o ( t ) , a ) dµ y ( y ) . The true conditional treatment eﬀect, B ( C o ( t )), can then be expressed as B θ ( C o ( t )) = ¯ Q θ ( C o ( t ) , − ¯ Q θ ( C o ( t ) , A ( t ) forthe purpose of maximizing Y ( t ) is given by: d ( C o ( t )) = I ( B ( C o ( t )) > . Under the parametric working model, we note that the optimal treatment rule can berepresented as: d θ ( C o ( t )) = I ( B θ ( C o ( t )) > . Let θ t − to be the maximum likelihood estimate of the true θ based on the mostcurrent history, ¯ O ( t − q y,θ . We could deﬁne theﬁxed dimensional history C o ( t ) such that for each time point t , θ t − is included in therelevant history C o ( t ) for O ( t ). The current estimate of the rule is then deﬁned as: d θ t − ( C o ( t )) = I ( B θ t − ( C o ( t )) > . If the parametric model is very ﬂexible, B θ t − might be a good approximation of the trueconditional treatment eﬀect B ( C o ( t )). In that case, d θ t − ( C o ( t )) is a good approximation ofthe optimal rule d ( C o ( t )). Nevertheless, we argue that θ t − will converge to θ deﬁned by aKullback-Leibler projection of the true q y, onto the working model { q y,θ : θ } . Consequently,the rule d θ t − ( C o ( t )) will converge to a ﬁxed I ( B ( C o ( t )) >

0) as t converges to inﬁnity. Instead of considering a parametric working model, we explore estimation of the optimaltreatment rule based on more ﬂexible, possibly nonparametric approaches drawn from themachine learning literature. As in the previous subsection, we deﬁne B ¯ Q t − ( C o ( t )) to bean estimator of the true blip function, B ( C o ( t )), based on the most recent observationsup to time t , ¯ O ( t − Q which provides convenient computational and sta-tistical properties for dense time-series data described elsewhere (van der Laan & Lendle2014, Benkeser et al. 2018). Additionally, we might consider ensemble machine learningmethods that target B directly (Luedtke & van der Laan 2016 c ). As mentioned in theprevious section, we can view B ¯ Q t − ( C o ( t )) as just another univariate covariate extractedfrom the past, and include it in our deﬁnition of C o ( t ). If B ¯ Q t − is consistent for B , thenthe rule d ¯ Q t − ( C o ( t )) based on B ¯ Q t − will converge to the optimal rule I ( B ( C o ( t )) > c , Chambaz et al. 2017). In the following section, we describe two sampling schemes that deﬁne g N = { g t : t =1 , · · · , N } precisely. Both rely on estimating parts of the likelihood based on the time-pointscollected so far for the single subject studied. The t -dependent current estimate of ¯ Q and B are then further utilized to assign the next treatment, collect the next correspondingblock of data, and estimate the target parameter of interest. Following the empirical processliterature, we deﬁne P N f to be the empirical average of function f , and P f = E P f ( O ). In the following, we follow closely the argument given in Chambaz et al. (2017) in orderto deﬁne g N . Let ¯ Q t − denote the time t estimate of ¯ Q based on the time-series pointscollected so far, ¯ O ( t − d ¯ Q t − ( C o ( t )) might not be a goodestimate of d ( C o ( t )). As such, assigning the current conditional probability of treatmentdeterministically based on the estimated rule could be ill-advised. In addition, withoutexploration (enforced via a deterministic rule), we cannot guarantee consistency of theoptimal rule estimator.In light of that, we deﬁne { c t } t ≥ and { e t } t ≥ as user-deﬁned, non-increasing sequences17uch that c ≤ , lim t c t ≡ c ∞ > t e t ≡ e ∞ >

0. More speciﬁcally, we let { e t } t ≥ deﬁne the level of random perturbation around the current estimate d ¯ Q t − ( C o ( t ))of the optimal rule. We deﬁne { c t } t ≥ as the probability of failure, so choosing c = · · · = c t = 0 . e t , we pick thetreatment uniformly at random. This positive probability e t is what is often referred to asthe exploration rate in the bandit and reinforcement learning literature (Sutton & Barto1998). For every t ≥

1, we could have the following function G t over [ − ,

1] as deﬁned inChambaz et al. (2017): G t ( x ) = c t I [ x ≤ − e t ] + (1 − c t ) I [ x ≥ e t ] + (cid:18) − / − c t e t x + 1 / − c t e t / x + 12 (cid:19) I [ − e t ≤ x ≤ e t ] , where G t ( x ) is used to derive a stochastic treatment rule from an estimated blip function,such that g t (1 | C o ( t )) = G t ( B ¯ Q t − ( C o ( t ))) . Note that G t is a smooth approximation to x → I [ x ≥

0] bounded away from 0 and1, mimicking the optimal treatment rule as an indicator of the true blip function. Withthat in mind, any other non-decreasing k n -Lipschitz function with F t ( x ) = c t for x ≤ − e t and F t ( x ) = 1 − c t for x ≥ e t would approximate the optimal treatment rule as well. Thedeﬁnitions of G t and g t prompt the following lemma, which illustrates the ability of thesampling scheme to learn form the collected data, while still exploring: Lemma 1.

Let t ≥

1. Then we have that: inf c o ( t ) g t ( d ( c o ( t )) | c o ( t )) ≥ c o ( t ) g t (1 − d ( c o ( t )) | c o ( t )) ≥ c t . g t (1 | C o ( t )) approximates d ( C o ( t )) in the following sense: | g t (1 | C o ( t )) − d ( C o ( t )) | ≤ c ∞ I [ | B ( C o ( t )) ≥ e ∞ | ] + 12 I [ | B ( C o ( t )) < e ∞ | ] . If c ∞ and e ∞ are small and | B ( C o ( t )) ≥ e ∞ | , then drawing treatment assignment from asmooth approximation of d ( C o ( t )) is not much diﬀerent than d ( C o ( t )), with little impacton the mean value of the reward. Alternatively, one could allocate randomization probabilities based on the tails of an es-timate of the blip function, B ( C o ( t )). In particular, we present a sampling scheme thatutilizes the Highly Adaptive Lasso (HAL) estimator for obtaining the bounds around theestimate of the true blip function. The Highly Adaptove Lasso is a nonparametric regres-sion estimator that does not rely on local smoothness assumptions (Benkeser & van derLaan 2016, van der Laan 2017). Brieﬂy, for the class of functions that are right-hand con-tinuous with left-hand limits and a ﬁnite variation norm, HAL is an MLE which can becomputed based on L -penalized regression. As such, it is similar to standard lasso regres-sion function in its implementation, except that the relationship between the predictors andthe outcome is described by data-dependent basis functions instead of a parametric model.For a thorough description of the Highly Adaptive Lasso estimator, we refer the reader toBenkeser & van der Laan (2016) and van der Laan (2017); we provide more details on theHighly Adaptive Lasso in the appendix subsection 8.4.We propose to use HAL to estimate B ( C o ), which implies an estimator for the optimalrule d ( C o ) = I ( B ( C o ) > L B ( θ )( o, C o ) = ( D ( θ )( O ) − B ( C o )) , θ = ( g, ¯ Q ) required to evaluate D ( θ )( O ). This inﬂuence function hasthe property that E ( D ( θ ) | C o ) = B ( C o ) if either ¯ Q = ¯ Q or g = g , under positivity.As such, L B ( θ ) is a double robust and eﬃcient loss function for the true risk in thesense that P n L B ( θ ) is a double robust locally eﬃcient estimator of the true risk underregularity conditions. As a double robust and eﬃcient loss, the true risk of the loss function L B ( θ )( o, C o ) equals P ( B − B ) ( C o ) up until a constant if either D ( θ ) = D ( ¯ Q , g ) or D ( θ ) = D ( ¯ Q, g ).Let E ( D ( θ ) | C o ) = ψ blip , with ψ blip ∈ D [0 , τ ], the Banach space of d -variate cadlagfunctions. Deﬁne C o,s = { C o,j : j ∈ s } for a given subset s ⊂ { , ..., d } . For ψ blip ∈ D [0 , τ ],we deﬁne the s th section of ψ blip as ψ blip s ( c o ) = ψ blip ( c o, I (1 ∈ s ) , . . . , c o,d I ( d ∈ s )), where c o denotes all possibilities of C o . We assume the variation norm of ψ blip is ﬁnite: (cid:107) ψ blip (cid:107) v = ψ blip (0) + (cid:88) s ⊂{ ,...,d } (cid:90) τ s s | ψ blip s ( du ) | < M. The HAL estimator represents ψ blip as ψ blip ( c o ) = ψ blip (0) + (cid:88) s ⊂{ ,...,d } (cid:90) τ s s ψ blip s ( du )= ψ blip (0) + (cid:88) s ⊂{ ,...,d } (cid:90) τ s s I ( u ≤ c o,s ) ψ blip s ( du ) , which uses a discrete measure ψ blip m with m support points to approximate this representa-tion. For each subset s , at time t = N , we select as support points the N observed values˜ c o,s ( t ), t = 1 , . . . , N , of the context C o,s ( t ). Then, for each subset s , we have a discreteapproximation of ψ blip s with support deﬁned by the actual N observations and point-masses d ψ blip m,s,t , the pointmass assigned by ψ blip m to point ˜ c o,s ( t ), t = 1 , . . . , N . This approximationconsists of a linear combination of basis functions c o → φ s,t ( c o ) = I ( c o,s ≥ ˜ c o,s ( t )) withcorresponding coeﬃcients d ψ blip m,s,t summed over s and t = 1 , . . . , N .20he minimization of the empirical risk P n L B ( θ )( o, C o ) of this estimator, ψ blip n , corre-sponds to lasso regression with predictors φ s,t across all subsets s ⊂ { , . . . , d } and for t = 1 , . . . , N . That is, for ψ blip β = β + (cid:88) s ⊂{ ,...,d } N (cid:88) t =1 β s,t φ s,t and corresponding subspace Ψ n,M = { ψ β : β, β + (cid:80) s ⊂{ ,...,d } (cid:80) Nt =1 | β s,t | < M } , β n = argmin β,β + (cid:80) s ⊂{ ,...,d } (cid:80) Nt =1 | β s,t |

Theorem 2.

For any ¯ Q ∈ ¯ Q , the diﬀerence between the TMLE and its target decomposesas ¯Ψ( ¯ Q ∗ N ) − ¯Ψ( ¯ Q ) = M ,N ( ¯ Q ) + M ,N ( ¯ Q ∗ N , ¯ Q ) , ith M ,N ( ¯ Q ) = 1 N N (cid:88) t =1 D ∗ ( ¯ Q )( C o ( t ) , O ( t )) − P ,C o ( t ) D ∗ ( ¯ Q ) ,M ,N ( ¯ Q ∗ N , ¯ Q ) = 1 N N (cid:88) t =1 ( δ C o ( t ) ,O ( t ) − P ,C o ( t ) )( D ∗ ( ¯ Q ∗ N ) − D ∗ ( ¯ Q )) . The ﬁrst term, M ,N ( ¯ Q ), is the average of a martingale diﬀerence sequence, and we willanalyze it with a classical martingale central limit theorem. The second term is a martingaleprocess indexed by ¯ Q ∈ ¯ Q , evaluated at ¯ Q = ¯ Q ∗ N . We will prove an equicontinuity resultunder a complexity condition for a process derived from the function class { D ∗ ( ¯ Q ) : ¯ Q ∈ ¯ Q} , which will imply that if ¯ Q ∗ N P −→ ¯ Q ∈ ¯ Q then M ,N ( ¯ Q ∗ N , ¯ Q ) = o P ( N − / ). A set of suﬃcient conditions for the asymptotic normality of the term M ,N ( ¯ Q ) is that (a)the terms D ∗ ( ¯ Q )( C o ( t ) , O ( t )) remain bounded, and (b) that the average of the conditionalvariances of D ∗ ( ¯ Q )( C o ( t ) , O ( t )) stabilize. A suﬃcient condition for condition (a) to holdis the following strong version of the positivity assumption. Assumption 3 (Strong positivity) . There exists δ > such that, for every t ≥ , g ,t ( A ( t ) | C a ( t )) ≥ δ, P -a.s. Assumption 4 (Stabilization of the mean of conditional variances) . There exists σ ( ¯ Q ) ∈ (0 , ∞ ) such that N N (cid:88) t =1 Var Q (cid:0) D ∗ ( ¯ Q )( C o ( t ) , O ( t )) | C o ( t ) (cid:1) d −→ σ ( ¯ Q ) . We formally state below our asymptotic normality result for M ,N ( ¯ Q ).24 heorem 3. Suppose that assumption 3 and assumption 4 hold. Then √ N M ,N ( ¯ Q ) d −→ N (0 , σ ( ¯ Q )) . Proof.

The result follows directly from various versions of martingale central limit theorems(e.g. theorem 2 in Brown (1971)).We show in section 8.2 in the appendix that the conditional variances stabilize under(1) mixing and ergodicity conditions for the sequence ( C o ( t )) of contexts, and if (2) thedesign g ,t stabilizes asymptotically. We discuss special cases in which these mixing andergodicity conditions can be checked explicitly in appendix section 8.2. We rely on theempirical variance estimator, (cid:98) σ N := 1 N N (cid:88) t =1 D ∗ ( ¯ Q ∗ N , g ,t ) ( C o ( t ) , O ( t )) , which converges to the asymptotic variance σ ( ¯ Q ) of M ,N ( ¯ Q ) . In this susbsection, we give a brief overview of the analysis of the term M ,N ( ¯ Q ∗ N , ¯ Q ), whichwe carry out in detail in appendix section 8.3. We show that M ,N ( ¯ Q ∗ N , ¯ Q ) = o P ( N − / ) byproving an equicontinuity result for the process { M ,N ( ¯ Q, ¯ Q ) : ¯ Q ∈ ¯ Q} . Our equicontinuityresult relies on a measure of complexity for the processΞ N := (cid:110)(cid:0) D ∗ ( ¯ Q, g ,t )( C o ( t ) , O ( t )) − D ∗ ( ¯ Q , g ,t )( C o ( t ) , O ( t )) (cid:1) Nt =1 : ¯ Q ∈ ¯ Q (cid:111) , (3)which we refer to as sequential bracketing entropy , introduced by van de Geer (2000) forthe analysis of martingale processes. We relegate the formal deﬁnition of the sequentialbracketing entropy to the appendix section 8.3. In particular, we denote N [ ] ( (cid:15), b, Ξ N , ¯ O ( N ))25s the sequential bracketing number of Ξ N corresponding to brackets of size (cid:15) . Our equicon-tinuity result is a sequential equivalent of similar results for i.i.d. settings (e.g. van derVaart & Wellner (2013)) and similarly relies on a Donsker-like condition. Assumption 5 (Sequential Donsker condition) . Deﬁne the sequential bracketing entropyintegral as J [ ] ( (cid:15), b, Ξ N , ¯ O ( N )) := (cid:82) (cid:15) (cid:113) log(1 + N [ ] ( u, b, Ξ N , ¯ O ( N )) du. Suppose that thereexists a function a : R + → R + that converges to as δ → , such that J [ ] ( (cid:15), b, Ξ N , ¯ O ( N )) ≤ a ( δ ) . Note that a suﬃcient condition for assumption 5 to hold is that log(1+ N [ ] ( u, b, Ξ N , ¯ O ( N )) ≤ C(cid:15) − p , with p ∈ (0 ,

2) and

C > N . Assumption 6 ( L convergence of the outcome model) . It holds that (cid:107) ¯ Q ∗ N − ¯ Q (cid:107) ,g ∗ ,h N = o P (1) , where h N is the empirical measure h N := N − (cid:80) Nt =1 δ C o ( t ) . Theorem 4 (Equicontinuity of the martingale process term) . Consider the process Ξ N deﬁned in equation (3) . Suppose that assumptions 3, 5 and 6 hold. Then M ,N ( ¯ Q ∗ N , ¯ Q ) = o P ( N − / ) . As an immediate corollary of theorems 3 and 4, we have the following asymptotic normalityresult for our TML estimator.

Theorem 5 (Asymptotic normality of the TMLE) . Suppose that assumptions 3, 4, 5 and6 hold. Then √ N (cid:0) ¯Ψ( ¯ Q ∗ N ) − ¯Ψ( ¯ Q ) (cid:1) d −→ N (0 , σ ( ¯ Q )) . (cid:98) σ N converges in probability to σ ( ¯ Q ), which implies that (cid:98) σ − N √ N (cid:0) ¯Ψ( ¯ Q ∗ N ) − ¯Ψ( ¯ Q ) (cid:1) d −→ N (0 , . Therefore, denoting q − α/ the 1 − α/ (cid:20) ¯Ψ( ¯ Q ∗ N ) − q − α/ (cid:98) σ N √ N , ¯Ψ( ¯ Q ∗ N ) + q − α/ (cid:98) σ N √ N (cid:21) has asymptotic coverage 1 − α for the target ¯Ψ( ¯ Q ). In this section we present simulation results concerning the adaptive learning of the optimalindividualized treatment rule estimated using machine learning methods for a single time-series. We focus on the stochastic sampling scheme described in subsection 3.2.1, andexplore performance of our estimator with diﬀerent initial sample sizes and consequentsequential updates. We consider binary outcome and treatment, but note that the resultswill be comparable for continuous bounded outcome. Finally, unless speciﬁed otherwise,we present coverage of the mean under the current estimate of the optimal individualizedtreatment rule at each update based on 500 Monte Carlo draws. We set the referencetreatment mechanism to a balanced design, assigning treatment with probability 0 . We explore a simple dependence setting ﬁrst, emphasising the connection with i.i.d se-quential settings. We data consists of a binary treatment ( A ( t ) ∈ { , } ) and outcome27 Y ( t ) ∈ { , } ). The time-varying covariate W ( t ) decomposes as W ( t ) ≡ ( W ( t ) , W ( t ))with binary W and continuous W . The outcome Y at time t is conditionally drawngiven { A ( t ) , Y ( t − , W ( t − } from a Bernoulli distribution, with success probabil-ity deﬁned as 1 . ∗ A ( t ) + 0 . ∗ Y ( i − − . ∗ W ( i − t = 1000 and t = 500 by ﬁrst drawing a set of four O ( t ) samples ran-domly from binomial and normal distributions in order to have a starting point to initiatetime dependence. After the ﬁrst 4 draws, we draw A ( t ) from a binomial distributionwith success probability 0.5, Y ( t ) from a Bernoulli distribution with success probabil-ity dependent on { A ( t ) , A ( t − , Y ( t − , W ( t − } , followed by W ( t ) conditional on { Y ( t − , W ( t − , W ( t − } and W ( t ) conditional on { A ( t − , Y ( t − , W ( t − } .After t = 1000 or t = 500, we continue to draw O ( t ) as above, but with A ( t ) drawn from astochastic intervention approximating the current estimate d ¯ Q t − of the optimal rule d ¯ Q .This procedure is repeated until reaching a speciﬁed ﬁnal time point indicating the endof a trial. Our estimator of ¯ Q , and thereby the optimal rule d , is based on an onlinesuper-learner with an ensemble consisting of multiple algorithms, including simple general-ized linear models, penalized regressions, HAL and extreme gradient boosting (Coyle et al.2018). For cross-validation, we relied on the online cross-validation scheme, also known asthe recursive scheme in the time-series literature. The sequences { c t } t ≥ and { e t } t ≥ arechosen constant, with c ∞ = 10% and e ∞ = 5%. The TMLEs are computed at sample sizesa multiple of 200, and no more than 1800 (for initial t = 1000) or 1300 (for initial t = 500),at which point sampling is stopped. We use the coverage of asymptotic 95% conﬁdenceintervals to evaluate the performance of the TMLE in estimating the average across time t of the d ¯ Q t − -speciﬁc mean outcome. The exact data-generating distribution used is as28ollows: A (0 : 4) ∼ Bern(0 . Y (0 : 4) ∼ Bern(0 . W (0 : 4) ∼ Bern(0 . W (0 : 4) ∼ Normal(0 , A (4 : t ) ∼ Bern(0 . Y (4 : t ) ∼ Bern( expit (1 . ∗ A ( i ) + 0 . ∗ Y ( i − − . ∗ W ( i − W (4 : t ) ∼ Bern( expit (0 . ∗ W ( i − − . ∗ Y ( i −

1) + 0 . ∗ W ( i − W (4 : t ) ∼ Normal(0 . ∗ A ( i −

1) + Y ( i − − W ( i − , sd = 1) A ( t : 1800) ∼ d ¯ Q t − Y ( t : 1800) ∼ Bern( expit (1 . ∗ A ( i ) + 0 . ∗ Y ( i − − . ∗ W ( i − W ( t : 1800) ∼ Bern( expit (0 . ∗ W ( i − − . ∗ Y ( i −

1) + 0 . ∗ W ( i − W ( t : 1800) ∼ Normal(0 . ∗ A ( i −

1) + Y ( i − − W ( i − , sd = 1) . From Table 1, we can see that the 95% coverage for the average across time of the coun-terfactual mean outcome under the current estimate of the optimal dynamic treatmentapproaches nominal coverage with increasing time-steps, for both t = 500 and t = 1000length of the initial time-series. The mean conditional variance stabilizes with increasingtime-steps, as illustrated in Table 2 and Figure 1A, thus satisﬂying assumption 4 necessaryfor showing asymptotic normality of the TML estimator. In Simulation 1b, we explore the behavior of our estimator in case of more elaboratedependence. As in Simulation 1a, we only consider binary treatment ( A ( t ) ∈ { , } ) andoutcome ( Y ( t ) ∈ { , } ), with binary and continuous time-varying covariates. We set the29eference treatment mechanism to a balanced treatment mechanism assigning treatmentwith probability P ( A ( t ) = 1) = 0 .

5, and generate the initial sample of size t = (1000 , W ( t ) , W ( t ) , A ( t ) , Y ( t ). As before, upon the ﬁrst t = 1000 or t =500 time-points, we continue to draw O ( t ) with A ( t ) sampled from a stochastic interventionapproximating the current estimate d ¯ Q t − of the optimal rule d ¯ Q . The estimator of theoptimal rule d ¯ Q was based on an ensemble of machine learning algorithms and regression-based algorithms, with honest risk estimate achieved by utilizing online cross-validationscheme with validation set size of 30. The sequences { c t } t ≥ and { e t } t ≥ were set to 10%and 5%, respectively. The TMLEs are computed at initial t = 1000 or t = 500, andconsequently at sample sizes being a multiple of 200, and no more than 1800 (or 1300), atwhich point sampling is stopped. The exact data-generating distribution used is as follows: A (0 : 4) , Y (0 : 4) , W (0 : 4) ∼ Bern(0 . W (0 : 4) ∼ Normal(0 , A (4 : t ) ∼ Bern(0 . Y (4 : t ) ∼ Bern( expit (1 . ∗ A ( i ) + 0 . ∗ Y ( i − − . ∗ W ( i − W (4 : t ) ∼ Bern( expit (0 . ∗ W ( i − − . ∗ Y ( i −

1) + 0 . ∗ W ( i − W (4 : t ) ∼ Normal(0 . ∗ A ( i −

1) + 0 . ∗ W ( i − W ( t : 1800) ∼ Normal(0 . ∗ A ( i −

1) + Y ( i − − W ( i − , sd = 1) . As demonstrated in Table 1, the TML estimator approaches 95% coverage with increasingnumber of time points with more elaborate dependence structure as well. The assumptionof stabilization of the mean of conditional variances is shown to be valid in Table 2 andFigure 1B, allowing for the asymptotic coverage 1 − α for the target ¯Ψ( ¯ Q ).30 Cov t Cov t Cov t Cov t Cov t Simulation 1a

Simulation 1a

500 90.00 93.20 93.80 94.80 94.60

Simulation 1b

500 89.60 90.20 89.90 90.80 91.40Table 1: The 95% coverage for the average across time of the counterfactual mean outcomeunder the current estimate of the optimal dynamic treatment at time points t , t = t + 200, t = t + 400, t = t + 600 and t = t + 800. The ﬁrst t time points sample treatmentwith probability 0.5. The sequences { c n } t ≥ and { e n } t ≥ are chosen constant, with c ∞ =10% and e ∞ = 5%. TMLEs are computed at t = { , } , t , t , t and t , withsequential updates being of size 200. The results are reported over 500 Monte-Carlo drawsfor Simulations 1a and 1b with initial sample sizes 1000 and 500. t Var t Var t Var t Var t Var t Simulation 1a

Simulation 1a

500 0.0011 0.0024 0.0035 0.0014 0.0011

Simulation 1b

500 0.0199 0.0171 0.0187 0.0152 0.0087Table 2: Variance for the average across time of the counterfactual mean outcome underthe current estimate of the optimal dynamic treatment at time points t , t = t + 200, t = t + 400, t = t + 600 and t = t + 800, over 500 Monte-Carlo draws for Simulations 1aand 1b with initial sample sizes 1000 and 500.31igure 1: Illustration of the data-adaptive inference of the mean reward under the optimaltreatment rule with initial sample size n = 1000 and n = 500 for Simulation 1a and 1b.The red crosses reﬂect successive values of the data-adaptive true parameter, with starsrepresenting the estimated parameter with the corresponding 95% conﬁdence interval forthe data-adaptive parameter. 32 Conclusions

In this manuscript, we consider causal parameters based on observing a single time serieswith asymptotic results derived over time t . The data setup constitutes a typical longitu-dinal data structure, where within each t -speciﬁc time-block one observes treatment andoutcome nodes, and possibly time-dependent covariates in-between treatment nodes. Each t -speciﬁc data record O ( t ) is viewed as its own experiment in the context of the observedhistory C o ( t ), carrying information about a causal eﬀect of the treatment nodes on thenext outcome node. While in this work we concentrate on single time point interventions,we emphasize that our setup can be easily generalized to context speciﬁc causal eﬀects ofmultiple time point interventions, therefore estimating the causal eﬀect of A ( t : t + k ) onfuture Y ( t + k ).A key assumption necessary in order to obtain the presented results is that the relevanthistory for generating O ( t ), given the past ¯ O ( t − C o ( t ). We note that our conditions allow for C o ( t ) to be a function of thewhole observed past, allowing us to avoid Markov-order type assumptions that limit depen-dence on recent, or speciﬁcally predeﬁned past. Components of C o ( t ) that depend on thewhole past, such as an estimate of the optimal treatment rule based on ( O (1) , . . . , O ( t − t -speciﬁc experiment corresponds todrawing from a conditional distribution of O ( t ) given C o ( t ). We assume that this condi-tional distribution is either constant in time or is parametrized by a constant function. Assuch, we can learn the true mechanism that generates the time-series, even when the modelfor the mechanism is nonparametric. With the exception of parametric models allowingfor maximum likelihood estimation, we emphasize that statistical inference for proposed33arget parameters of the time-series data generating mechanism is a challenging problemwhich requires targeted machine learning.The work of van der Laan et al. (2018) and Kallus & Uehara (2019) studies marginalcausal parameters, marginalizing over the distribution of C o ( t ), deﬁned on the same statis-tical model as the parameter we consider in this article. In particular, van der Laan et al.(2018) deﬁne target parameters and estimation of the counterfactual mean of a future (e.g.,long term) outcome under a stochastic intervention on a subset of the treatment nodes,allowing for extensions to single unit causal eﬀects. As such, the target parameter proposedby van der Laan et al. (2018) addresses the important question regarding the distributionof the outcome at time t , had we intervened on some of the past treatment nodes in a(possibly single) time-series. While important, the TMLE of such target parameters arechallenging to implement due to their reliance on the density estimation of the marginaldensity of C o ( t ) averaged across time t . Additionally, we remark that such marginal causalparameters cannot be robustly estimated if treatment is sequentially randomized, due tothe lack of double robustness of the second order remainder.In this work, we focus on a context-speciﬁc target parameter is order to explore robuststatistical inference for causal questions based on observing a single time series of a par-ticular unit. We note that for each given C o ( t ), any intervention-speciﬁc mean outcome EY g ∗ ( t ) with g ∗ being a stochastic intervention w.r.t. the conditional distribution of P C o ( t ) (with deterministic rule being a special case), represents a well studied statistical estima-tion problem based on observing many i.i.d. copies. Even though we do not have repeatedobservations from the C o ( t )-speciﬁc distribution at time t , the collection ( C o ( t ) , O ( t )) acrossall time points represent the analogue of an i.i.d. data set ( C o ( t ) , O ( t )) ∼ iid P , where C o ( t )can be viewed as a baseline covariate for the longitudinal causal inference data structure;we make the connection with the i.i.d. sequential design in one of our simulations. Theinitial estimation step of the TMLE should still respect the known dependence in construc-34ion of the initial estimator, by relying on appropriate estimation techniques developed fordependent data. Similarly, variance estimation can proceed as in the i.i.d case using therelevant i.i.d. eﬃcient inﬂuence curve. This insight relies on the fact that the TMLE inthis case allows for the same linear approximation as the TMLE for i.i.d. data, with themartingale central limit theorem applied to the linear approximation instead. Since thelinear expansion of the time-series TMLE for context-speciﬁc parameter is an element ofthe tangent space of the statistical model, our derived TMLE is asymptotically eﬃcient.Our motivation for studying the proposed context-speciﬁc parameter strives from itsimportant role in precision medicine, in which one wants to tailor the treatment rule to theindividual observed over time. In particular, we derive a TMLE which uses only the pastdata ¯ O ( t −

1) of a single unit in order to learn the optimal treatment rule for assigning A ( t ) to maximize the mean outcome Y ( t ). Here, we assign the treatment at the next timepoint t + 1 according to the current estimate of the optimal rule, allowing for the time-series to learn and apply the optimal treatment rule at the same time. The time-seriesgenerated by the described adaptive design within a single unit can be used to estimate,and most importantly provide inference, for the average across all time-points t of thecounterfactual mean outcome of Y ( t ) under the estimate d ( C o ( t )) of the optimal rule ata relevant time point t . Assuming that the estimate of the optimal rule is consistent, asthe number of time-points increases, our target parameter converges to the mean outcomeone would have obtained had they carried out the optimal rule from the start. As such,we can eﬀectively learn the optimal rule and simultaneously obtain valid inference for itsperformance. Interestingly, this does not provide inference relative to, for example, thecontrol that always assigns A ( t ) = 0. This is due to the fact that by assigning treatment A ( t ) according to a rule, the positivity assumption needed to learn N (cid:80) t E ( Y A ( t )=0 ( t ) | C o ( t )) is violated. However, we note that one can safely conclude that one will not beworse than this control rule, even when the control rule is equal to the optimal rule. If35ne is interested in inference for a contrast based on a single time-series, then we advocatefor random assignment between the control and estimate of optimal rule. As such, ourproposed methodology still allows to learn the desired contrast.Finally, we note that while the context-speciﬁc parameter enjoys many important sta-tistical and computational advantages as opposed to the marginal target parameter basedon a single time-series, the formulation employed in this article is only sensible if one isinterested in the causal eﬀect of treatment on a short-term outcome. In particular, if theamount of time necessary to collect outcome Y ( t ) in O ( t ) is long, then generating a longtime series would take too much time to be practically useful. If one is interested in causaleﬀects on a long term outcome and is willing to forgo utilizing known randomization prob-abilities for treatment, we advocate for the marginal target parameters as described inprevious work by van der Laan et al. (2018) or Kallus & Uehara (2019).36 Appendix

We present below two alternative statistical parameters deﬁned on the same statisticalmodel as the parameter we consider in this article, and which were considered in previousworks (van der Laan et al. 2018, Kallus & Uehara 2019). The parameters are marginal , asopposed to context-speciﬁc parameters we consider in the present article. The deﬁnitionof the marginal parameters entails integrating against certain marginal distributions ofcontexts, as we make explicit below.Consider the distribution P Q,g ∗ over inﬁnite sequences taking values in the inﬁnitecartesian product space × ∞ t =1 O , deﬁned from the factors of P ∈ M by the following G-computation formula: P Q,g ∗ (( o ( t )) ∞ t =1 ) := P C o (1) ( c o (1)) ∞ (cid:89) t =1 g ∗ ( a ( t ) | c o ( t )) Q ( y ( t ) | c o ( t )) Q w ( w ( t ) | c o ( t )) . Let ( O ∗ ( t )) ∞ t =1 ∼ P Q,g ∗ , with O ∗ ( t ) = ( A ∗ ( t ) , Y ∗ ( t ) , W ∗ ( t )). As a ﬁrst example of a marginal parameter, van der Laan et al. (2018) consider a class ofparameters which includes Ψ ,τ ( P ) := E Q,g ∗ [ Y ∗ ( τ )] , for τ ≥

1. Under the causal identiﬁability assumptions 1 and 2, Ψ ,τ ( P ) equals the meanoutcome we would obtain at time τ , under a counterfactual time series with initial contextdistribution P ,C o (1) and intervention g ∗ (instead of the observed intervention g ) at every37ime point. We note that P ,C o (1) is the initial, observed data-generating distribution. Thecanonical gradient of Ψ ,τ w.r.t. our model M (where M assumes P C o (1) known ) is D ∗ ( P )( o N ) := 1 N N (cid:88) t =1 ¯ D ( Q, ω, g )( c o ( t ) , o ( t ))with ¯ D ( Q, ω, g )( c o , o ) := τ (cid:88) s =1 ω s ( c ) g ∗ ( a | c o ) g ( a | c o ) { E Q,g ∗ [ Y ∗ ( τ ) | O ∗ ( s ) = o, C ∗ o ( s ) = c o ] − E Q,g ∗ [ Y ∗ ( τ ) | A ∗ ( s ) = a, C ∗ o ( s ) = c o ] } , with ω s ( c o ) = h C ∗ o ( s ) ( c o ) / ¯ h N ( c o ), where h C o ( s ) ( c o ) = P Q,g [ C o ( s ) = c o ] , ¯ h N ( c o ) = 1 N N (cid:88) t =1 h C o ( t ) ( c o ) , and h C ∗ o ( s ) ( c o ) = P Q,g ∗ [ C ∗ o ( s ) = c o ]are the marginal density of context C o ( s ) under P , the average thereof over observedtime points t = 1 , . . . , N , and the marginal density of context C ∗ o ( s ) under P Q,g ∗ . Wenote that Ψ , is the marginal equivalent of our parameter Ψ C o (1) . Speciﬁcally, Ψ , ( P ) = (cid:82) dP C o (1) ( c o (1))Ψ c o (1) ( P ). If we instead supposed that P C o (1) is unknown and lies in a certain model M P Co (1) , the canonicalgradient would have one additional component, which would be lying in the tangent space of M P Co (1) . Asfar as the conditional parameter of the main text are concerned, this distinction has no eﬀect, as these donot depend on the marginal distribution of contexts and therefore its canonical gradient has no componentsin the tangent spaces corresponding to the context distributions. .1.2 Marginal parameter by Kallus and Uehara (2019) Let γ ∈ (0 , ( P ) := E Q,g ∗ (cid:34) ∞ (cid:88) τ =1 γ τ Y ∗ ( τ ) (cid:35) = (cid:88) τ ≥ γ τ Ψ ,τ ( P ) . Under the causal identiﬁability assumptions 1 and 2, Ψ ( P ) is the expected total dis-counted outcome from time point 1 until ∞ that we would get if we carried out intervention g ∗ forever - starting from initial context distribution P ,C o (1) as in the observed data gener-ating distribution. The canonical gradient Ψ w.r.t. M (again, supposing that M considers P ,C o (1) known) is D ∗ ( P )( o N ) := 1 N N (cid:88) t =1 ¯ D ( Q, ω, g )( c o ( t ) , o ( t )) , with ¯ D ( Q, ω, g )( c o , o ) := ∞ (cid:88) s =1 ω s ( c o ) g ∗ ( a | c o ) g ( a | c o ) { y + γV ,Q,g ∗ ( c o , o ) − V ,Q,g ∗ ( c o , a ) } , where ω s is deﬁned as in the previous example, and V ,Q,g ∗ ( c o , o ) := E Q,g ∗ (cid:34)(cid:88) τ ≥ γ τ Y ∗ ( τ ) | C ∗ o (1) = c o , O ∗ (1) = o (cid:35) and V ,Q,g ∗ ( c o , o ) := E Q,g ∗ (cid:34)(cid:88) τ ≥ γ τ Y ∗ ( τ ) | A ∗ (1) = a, O ∗ (1) = o (cid:35) . In this article we are concerned with adaptive trials where the intervention is controlled bythe experimenter, hence g is known; we therefore only consider the case g = g . Under39 = g , both parameters Ψ (cid:48) ∈ { Ψ ,τ , Ψ } deﬁned above admit a ﬁrst order expansion of theform Ψ (cid:48) ( P ) − Ψ (cid:48) ( P ) = − P D ∗ ( P ) + R (cid:48) ( Q, Q , ω, ω ) , where R (cid:48) is a second-order remainder term such that R ( Q, Q , ω, ω ) = 0 if either Q = Q and ω = ω . While this resembles a traditional double-robustness property, as that whichholds in the i.i.d. setting for the ATE or in the time series setting for our conditionalparameter (as opposed to arbitrary time-series dependence or Markov decision process) itis important to note the following:1. For Ψ (cid:48) ∈ { Ψ ,τ , Ψ } , knowledge of the treatment mechanism is not suﬃcient to guar-antee that the remainder term is zero; we direct the interested reader to van der Laanet al. (2018) for the exact form of R (cid:48) .2. The parameters ω and Q are not variation independent, as appears explicitly fromthe deﬁnition of ω s . In fact, when estimating ω s from a single time series, one musta priori rely on an estimator of Q to obtain estimates of ω s (see van der Laan et al.(2018)). Therefore, if the estimator of Q is inconsistent, the corresponding estimatorof ω s will be inconsistent as well. Assumption 4 on the stabilization of the conditional variance of the canonical gradientcan be checked under mixing conditions on the sequence of context ( C o ( t )), and under thecondition that the design g ,t converges to a ﬁxed design. We state formally below such aset of conditions. 40 ssumption 7 (Convergence of the marginal law of contexts) . Suppose that the marginallaw of contexts converges to a limit law, that is C o ( t ) d −→ C ∞ , for some random variable C ∞ . The next assumption is a mixing condition in terms of ρ -mixing. We ﬁrst recall thenotion of ρ -mixing. Deﬁnition 1 ( ρ -mixing) . Consider a couple of random variables ( Z , Z ) ∼ P . The ρ -mixing coeﬃcient, or maximum correlation coeﬃcient of Z and Z is deﬁned as ρ P ( Z , Z ) := sup { Corr( f ( Z ) , f ( Z )) : f ∈ L ( P Z ) , f ∈ L ( P Z ) } . Assumption 8 ( ρ -mixing condition) . Suppose that sup t ≥ N (cid:88) s =1 ρ ( C o ( t ) , C o ( t + s )) = o ( N )Observe that if g is common across time points, the process ( C o ( t )) is an homogeneousMarkov chain. Conditions under which homogeneous Markov chains have marginal lawconverging to a ﬁxed law and are mixing have been extensively studied. A textbook exam-ple, albeit perhaps a bit too contrived for many speciﬁcations of the setting of our currentarticle, is when the Markov chain has ﬁnite state space and the probability of transitioningbetween any two states from one time point to the next is non-zero. In this case, ergodictheory shows that the transition kernel of the Markov chain admits a so-called invariantlaw - the marginal laws converge exponentially fast (in total variation distance) to theinvariant law, and the mixing coeﬃcients have ﬁnite sum. We refer the interested readerto the survey paper by Bradley (2005) for more general conditions under which Markovchains have convergent marginal laws and are strongly mixing (for various types of mixingcoeﬃcients, one of them being ρ -mixing) Assumption 9 (Design stabilization) . There is a design g ∞ such that (cid:107) g ,t − g ∞ (cid:107) ,P g ∗ ,h ,t = o (1) , and g ∞ ≥ δ , for some δ > .

41e note that, as we will always use assumption 9 along with assumption 3, we will supposethat the constant δ in the statement of both assumptions is the same. Lemma 2 (Conditional variance stabilization under mixing) . Suppose that assumptions 3,7 and 8 hold. Then assumption 4 holds.

We dedicate the appendix subsection 8.5 to the proof of lemma 2.

We analyze the martingale process { M ,N ( ¯ Q, ¯ Q ) : ¯ Q ∈ ¯ Q} under a measure of complexityintroduced by van de Geer (2000), which we will refer to in the present work as sequen-tial bracketing entropy . We state below the deﬁnition of sequential bracketing entropyparticularized to our setting. Deﬁnition 2 (Sequential bracketing entropy) . Consider a stochastic process of the form Ξ N := { ( ξ t ( f )) Nt =1 : f ∈ F } where F is an index set such that, for every f ∈ F , t ∈ [ N ] , ξ t ( f ) is an ¯ O ( t ) -measurable real valued random variable. We say that a collection of randomvariables of the form B := { (Λ jt , Υ jt ) Nt =1 : j ∈ [ J ] } is an ( (cid:15), b, ¯ O ( N )) bracketing of Ξ N if1. for every t ∈ [ N ] , and j ∈ [ J ] , (Λ jt , Υ jt ) is ¯ O ( t ) -measurable,2. for every f ∈ F , there exists j ∈ [ J ] , such that, for every t ∈ [ J ] , Λ jt ≤ ξ t ( f ) ≤ Υ jt ,3. for every t ∈ [ N ] , j ∈ [ J ] , | Λ jt − Υ jt | ≤ b a.s.,4. for every j ∈ [ J ] , N N (cid:88) t =1 E (cid:2) (Υ jt − Λ jt ) | ¯ O ( t − (cid:3) ≤ (cid:15) . We denote N [ ] ( (cid:15), b, Ξ N , ¯ O ( N )) the minimal cardinality of an ( (cid:15), b, Ξ N , ¯ O ( N )) -bracketing. { M ,N ( ¯ Q, ¯ Q ) : ¯ Q ∈ Q} is derivedfrom the process Ξ N := (cid:110)(cid:0) ( D ∗ ( ¯ Q ) − D ∗ ( ¯ Q ))( C o ( t ) , O ( t )) (cid:1) Nt =1 : ¯ Q ∈ ¯ Q (cid:111) . Natural questions that arise are (1) how to connect the sequential bracketing entropy ofthe process Ξ N to a traditional bracketing entropy measure for the outcome model ¯ Q , and(2) how to obtain consistency of an estimator ¯ Q ∗ N ﬁtted from sequentially collected data.Answers to both of these questions entail bracketing entropy preservation results that wepresent in the upcoming subsection, 8.3.1.We emphasize that the notion of sequential covering numbers , and the corresponding sequential covering entropy introduced by Rakhlin et al. (2014), represent a measure ofcomplexity under which one can control martingale processes and obtain equicontinuityresults. One motivation for the development of the notion of sequential covering numbersis that results that hold for i.i.d. empirical processes under traditional covering entropyconditions do not hold for martingale processes. Interestingly, while classical coveringnumber conditions cannot be used to control martingale processes, classical bracketingnumber bounds can usually be turned into sequential bracketing number bounds. Ourchoice to state results in terms of one measure of sequential complexity rather than theother (or both) is motivated by concision purposes, and also by the fact that we know howto bound bracketing entropy of a certain class of statistical models we ﬁnd realistic in manyapplications, as we describe in later subsections. We formalize the connection between the sequential bracketing entropy of the process Ξ N to a traditional bracketing entropy measure for the outcome model ¯ Q in lemma 3 below.In particular, lemma 3 bounds the sequential bracketing entropy of the canonical gradient43rocess Ξ N in terms of the bracketing entropy of the outcome model ¯ Q w.r.t. a normdeﬁned below. Lemma 3.

Suppose that assumption 3 holds. Then N [ ] ( (cid:15), Ξ N , ¯ O ( N )) (cid:46) N [ ] ( (cid:15), ¯ Q , L ( P g ∗ ,h N )) , where P g ∗ ,h N ( a, c ) = g ∗ ( a | c ) h N ( c ) , with h N being the empirical measure h N := N − (cid:80) Nt =1 δ C o ( t ) .Proof. Suppose B = { ( λ j , υ j ) : j ∈ [ J ] } is an (cid:15) -bracketing in L ( P g ∗ ,h N ) norm of ¯ Q . Let¯ Q ∈ Q . There exists j ∈ [ J ] such that λ j ≤ ¯ Q ≤ υ j . Without loss of generality, we cansuppose that 0 ≤ λ j ≤ υ j ≤

1, since the bracket ( λ j ∨ , υ j ∧

1) brackets the same functionsof ¯ Q as ( λ j , υ j ), as every element of ¯ Q has range in [0 , D ∗ ( ¯ Q ) − D ∗ ( ¯ Q ) = g ∗ g ,t ( ¯ Q − ¯ Q ) + (cid:88) a =1 g ∗ ( a | · )( ¯ Q − ¯ Q ))( a, · ) . Denoting Λ jt := g ∗ g ,t ( ¯ Q − υ j ) + (cid:88) a =1 g ∗ ( a | · )( λ j − ¯ Q ))( a, · ) , and Υ jt := g ∗ g ,t ( ¯ Q − λ j ) + (cid:88) a =1 g ∗ ( a | · )( υ j − ¯ Q ))( a, · ) , we have that Λ jt ≤ ( D ∗ ( ¯ Q, g ,t ) − D ∗ ( ¯ Q, g ,t )( C o ( t ) , O ( t )) ≤ Υ jt .

44e now check the size of the sequential bracket (Λ jt , Υ jt ) Nt =1 . We have that1 N N (cid:88) t =1 E Q ,g (cid:2) (Υ jt − Λ jt ) | ¯ O ( t − (cid:3) = 1 N N (cid:88) t =1 E Q ,g (cid:40) g ∗ g ,t ( υ j − λ j )( A ( t ) , C o ( t )) + (cid:88) a =1 ( g ∗ ( υ j − λ j ))( a, C o ( t )) (cid:41) | C o ( t )  ≤ N N (cid:88) t =1 E Q ,g (cid:34)(cid:18) g ∗ g ,t (cid:19) ( υ j − λ j ) ( A ( t ) , C o ( t )) | C o ( t ) (cid:35) + E Q ,g ∗ [( υ j − λ j )( A ( t ) , C o ( t ))] ≤ δ − N N (cid:88) t =1 E Q ,g ∗ (cid:2) ( υ j − λ j ) ( A ( t ) , C o ( t )) | C o ( t ) (cid:3) = 4 δ − (cid:107) υ j − λ j (cid:107) ,P g ∗ ,hN ≤ δ − (cid:15) , where we have used assumption 3 and Jensen’s inequality in the fourth line above. Fromassumption 3, it is also immediate to check that | Υ jt − Λ jt | ≤ δ − .So far, we have proven that one can construct a (2 δ − / (cid:15), δ − , ¯ O ( N )) bracketing of Ξ N from an (cid:15) -bracketing in L ( P g ∗ ,h N ) norm of ¯ Q . Treating δ as a constant, this implies thatlog N [ ] ( (cid:15), δ − , Ξ N , ¯ O ( N )) (cid:46) log N [ ] ( (cid:15), ¯ Q , L ( P g ∗ ,h N )).When proving consistency and convergence rate results for the outcome model estimator¯ Q ∗ N , we need bounds on the sequential bracketing entropy of the following martingaleprocess: L N := (cid:110)(cid:0) (cid:96) t ( ¯ Q )( C o ( t ) , O ( t )) (cid:1) Nt =1 : ¯ Q ∈ ¯ Q (cid:111) , where (cid:96) t ( ¯ Q )( c, o ) := ( g ∗ ( a | c ) /g ,t ( a | c ))( (cid:96) ( ¯ Q )( o ) − (cid:96) ( ¯ Q )( o )), with (cid:96) ( f ) denoting a lossfunction. We refer to L N as an inverse propensity weighted loss process . Lemma 4 in Bibaut& van der Laan (2019) provides conditions, which hold for most common loss functions,45nder which the bracketing entropy of the loss class { (cid:96) ( f )( ¯ Q ) : ¯ Q ∈ ¯ Q} is dominated up toa constant by the bracketing entropy of ¯ Q . As a direct corollary of this lemma, we state thefollowing result on the sequential bracketing entropy of the process L N ; we refer to Bibaut& van der Laan (2019) for examples of common settings where assumption 10 is satisﬁed. Assumption 10.

The loss function can be written as (cid:96) ( ¯ Q )( c, a, y ) = (cid:101) (cid:96) ( ¯ Q ( c, a ) , y ) , where (cid:101) (cid:96) satisﬁes the following conditions: • for all f , c , a , y (cid:55)→ (cid:101) (cid:96) ( ¯ Q ( c, a ) , y ) is unimodal, • for any y , u (cid:55)→ (cid:101) (cid:96) ( u, y ) is L -Lispchitz, for L = O (1) . Lemma 4 (Sequential bracketing entropy of loss process) . Suppose that assumptions 10and 3 hold. Then N [ ] ( (cid:15), L N , ¯ O ( N )) (cid:46) N [ ] ( (cid:15), ¯ Q , L ( P g ∗ ,h N )) . In this subsection, we give convergence guarantees for outcome model estimators ¯ Q N , andtheir targeted counterpart ¯ Q ∗ N , ﬁtted on sequentially collected data. We ﬁrst give con-vergence rate guarantees for empirical risk minizers ¯ Q N over a class ¯ Q , in terms of thebracketing entropy in L ( P g ∗ ,h N )-norm of ¯ Q .As brieﬂy deﬁned in section 4, let (cid:96) = L be a loss function for the outcome regressionsuch that, for every ¯ Q : C × A → [0 , Q ∈ arg min ¯ Q -measurable P Q ,g ∗ ,h N (cid:96) ( ¯ Q ) . We denote R ,N ( ¯ Q ) := P Q ,g ∗ ,h N (cid:96) ( ¯ Q ) as the population risk; we note that this populationrisk is equal to the average across t of the conditional risks P Q g ∗ ,h N (cid:96) ( ¯ Q ) given C o ( t ). Let46 Q ∗ be a minimizer of R ,N ( ¯ Q ) over ¯ Q . We further deﬁne the empirical risk as (cid:98) R N ( ¯ Q ) := 1 N N (cid:88) t =1 g ∗ g ,t ( A ( t ) | C o ( t )) (cid:96) ( ¯ Q )( C o ( t ) , O ( t )) . Note that the empirical risk minimizer over ¯ Q is any minimizer over ¯ Q of (cid:98) R N ( ¯ Q ); as such,we use importance sampling weighting factor g ∗ /g ,t in front of each term (cid:96) ( ¯ Q )( C o ( t ) , O ( t )).This choice is motivated by the fact that we want convergence rates guarantees for ¯ Q N in L ( P g ∗ ,h N ), as is natural to control the size of the sequential brackets of the canonicalgradient process Ξ N in terms of the size of brackets of ¯ Q in L ( P g ∗ ,h N ) norm (see lemma3). In the following, we state the entropy condition and additional assumptions on the lossfunction. Assumption 11 (Entropy of the outcome model) . Suppose that there exists p > suchthat log(1 + N [ ] ( (cid:15), ¯ Q , L ( P g ∗ ,h N ))) ≤ (cid:15) − p . Assumption 12 (Variance bound for the loss) . Suppose that (cid:107) (cid:96) ( ¯ Q ) − (cid:96) ( ¯ Q ∗ ) (cid:107) , ¯ Q ,g ∗ ,h N (cid:46) R ,N ( ¯ Q ) − R ,N ( ¯ Q ∗ ) for all ¯ Q ∈ ¯ Q . Assumption 13 (Excess risk dominates L norm) . Suppose that (cid:107) ¯ Q − ¯ Q ∗ (cid:107) ,g ∗ ,h N (cid:46) R ,N ( ¯ Q ) − R ,N ( ¯ Q ∗ ) . Theorem 6.

Consider an empirical risk minimizer ¯ Q N over ¯ Q , and a population minimizer ¯ Q ∗ , as deﬁned above. Suppose that assumptions 11, 12, 13, and assumption 10 hold. Then, (cid:107) ¯ Q N − ¯ Q ∗ (cid:107) ,g ∗ ,h N =  O P ( N − p/ ) if p < ,O P ( N − p ) if p > . roof. Consider the process L N deﬁned in subsection 8.3.1. We deﬁne M ,N ( ¯ Q, ¯ Q ∗ ) and (cid:99) M N ( ¯ Q, ¯ Q ∗ ) as population and empirical risk diﬀerences M ,N ( ¯ Q, ¯ Q ∗ ) := R ,N ( ¯ Q ) − R ,N ( ¯ Q ∗ ) and (cid:99) M N ( ¯ Q, ¯ Q ∗ ) := (cid:98) R N ( ¯ Q ) − (cid:98) R N ( ¯ Q ∗ ) . Let σ N ( ¯ Q, ¯ Q ∗ ) := 1 N N (cid:88) t =1 E (cid:34)(cid:18) g ∗ g ,t ( A ( t ) | C o ( t ))( (cid:96) ( ¯ Q ) − (cid:96) ( ¯ Q ∗ ))( C o ( t ) , O ( t )) (cid:19) | C o ( t ) (cid:35) . The quantity σ N ( ¯ Q, ¯ Q ∗ ) can be seen as a sequential equivalent of an L norm for the process { ( g ∗ /g ,t )( A ( t ) | C o ( t ))( (cid:96) ( ¯ Q ) − (cid:96) ( ¯ Q ∗ ))( C o ( t ) , O ( t )) } Nt =1 . From assumption 3, we have thatsup t ≥ (cid:107) ( g ∗ /g ,t )( (cid:96) ( ¯ Q ) − (cid:96) ( ¯ Q ∗ )) (cid:107) ∞ = O (1). From theorem A.4 in van Handel (2010), withprobability at least 1 − e − x , we have thatsup (cid:110) M ,N ( ¯ Q, ¯ Q ∗ ) − (cid:99) M N ( ¯ Q, ¯ Q ∗ ) : ¯ Q ∈ ¯ Q , σ N ( ¯ Q ) ≤ r (cid:111) (cid:46) r − + 1 √ N (cid:90) rr − (cid:113) log(1 + N [ ] ( (cid:15), , L N , ¯ O ( N ))) d(cid:15) + 1 N log(1 + N [ ] ( r, , L N , ¯ O ( N ))) + r (cid:114) xN + xN . From assumption 3, we have that σ N ( ¯ Q ) (cid:46) (cid:107) (cid:96) ( ¯ Q ) − (cid:96) ( ¯ Q ∗ ) (cid:107) ,g ∗ ,h N (cid:46) M ,N ( ¯ Q, ¯ Q ∗ ) . Combined with lemma 4, we have thatsup (cid:110) M ,N ( ¯ Q, ¯ Q ∗ ) − (cid:99) M N ( ¯ Q, ¯ Q ∗ ) : ¯ Q ∈ ¯ Q , M ,N ( ¯ Q, ¯ Q ∗ ) ≤ r (cid:111) (cid:46) r − + 1 √ N (cid:90) rr − (cid:113) log(1 + N [ ] ( (cid:15), ¯ Q , L ( P g ∗ ,h N )) d(cid:15) + 1 N log(1 + N [ ] ( r, ¯ Q , L ( P g ∗ ,h N )) + r (cid:114) xN + xN (4)with probability at least 1 − e − x . In the following, we treat the cases p < p > ase p > Observe that (cid:107) ¯ Q N − ¯ Q ∗ (cid:107) ,g ∗ ,h N (cid:46) M ,N ( ¯ Q N , ¯ Q ∗ )= M ,N ( ¯ Q N , ¯ Q ∗ ) − (cid:99) M N ( ¯ Q N , ¯ Q ∗ ) + (cid:99) M N ( ¯ Q N , ¯ Q ∗ ) ≤ M ,N ( ¯ Q N , ¯ Q ∗ ) − (cid:99) M N ( ¯ Q N , ¯ Q ∗ ) ≤ sup (cid:110) M ,N ( ¯ Q, ¯ Q ∗ ) − (cid:99) M N ( ¯ Q, ¯ Q ∗ ) : ¯ Q ∈ ¯ Q , M ,N ( ¯ Q, ¯ Q ∗ ) ≤ r (cid:111) where r := sup ¯ Q ∈ ¯ Q M ,N ( ¯ Q, ¯ Q ∗ ). The third line follows from the fact that Q N minimizes (cid:98) R N ( ¯ Q ) over ¯ Q , wich implies that (cid:99) M N ( ¯ Q N , ¯ Q ∗ ) ≤

0. We now use equation (4) to bound thelast line of the inequality. From assumption 3, we know that r = O (1). Using the entropybound from assumption 11 and minimizing the right hand side of (4) w.r.t. r − , we obtainthat, with probability at least 1 − e − x , (cid:107) ¯ Q N − ¯ Q ∗ (cid:107) ,g ∗ ,h N (cid:46) N − /p + x √ N + xN , which, by picking x appropriately, then implies that (cid:107) ¯ Q N − ¯ Q ∗ (cid:107) ,g ∗ ,h N = O P ( N − /p ). Case p < Starting from the bound (4), via some algebra and by taking an integral,we obtain E P (cid:104) sup (cid:110) M ,N ( ¯ Q, ¯ Q ∗ ) − (cid:99) M N ( ¯ Q, ¯ Q ∗ ) : ¯ Q ∈ ¯ Q , M ,N ( ¯ Q, ¯ Q ∗ ) ≤ r (cid:111)(cid:105) (cid:46) r − + 1 √ N (cid:18) r + (cid:90) rr − (cid:113) log(1 + N [ ] ( (cid:15), ¯ Q , L ( P g ∗ ,h N )) d(cid:15) (cid:19) + 1 N (cid:0) r + log(1 + N [ ] ( r, ¯ Q , L ( P g ∗ ,h N )) (cid:1) . Let r − = 0. By using the entropy bound from assumption 11, we obtain that E P (cid:104) sup (cid:110) M ,N ( ¯ Q, ¯ Q ∗ ) − (cid:99) M N ( ¯ Q, ¯ Q ∗ ) : ¯ Q ∈ ¯ Q , M ,N ( ¯ Q, ¯ Q ∗ ) ≤ r (cid:111)(cid:105) (cid:46) √ N r − p/ (cid:18) r − p/ r √ N (cid:19) . M ,N ( ¯ Q N , ¯ Q ∗ ) = O P ( N − p/ ) , and therefore (cid:107) ¯ Q N − ¯ Q ∗ (cid:107) ,g ∗ ,h N = O P ( N − p/ ) . Now that we know how to characterize the sequential bracketing entropy of Ξ N and L N interms of the bracketing entropy w.r.t. the norm L ( P Q ,h C,N ) of the outcome model Q , welook at speciﬁc function classes Q for which we know how to bound the latter. Consider functions over a certain domain X ; in our setting we note that X = C × O .Suppose that dim( X ) = d . We denote H ( β, M ) the class of functions over a certaindomain X , such that, for any x, y ∈ X , and any non-negative integers β , . . . , β d such that β + . . . + β d = (cid:98) β (cid:99) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ (cid:98) β (cid:99) f∂x β . . . ∂x β d d ( x ) − ∂ (cid:98) β (cid:99) f∂x β . . . ∂x β d d ( y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ M (cid:107) x − y (cid:107) . The bracketing entropy w.r.t. the uniform norm (cid:107) · (cid:107) of such a class satisﬁeslog N [ ] ( (cid:15), H ( β, M ) , (cid:107) · (cid:107) ) (cid:46) (cid:15) − d/β . For more detail, we refer the interested reader to, for example, chapter 2.7 in van der Vaart& Wellner (2013). As such, our Donsker condition 5 is satisﬁed for β > d/

2. Nevertheless,we caution that assuming that the outcome model lies in a Holder class of diﬀerentiabilityorder β > d/ .4.2 HAL class

A class of functions that is much richer that the previous Holder classes is the class ofcadlag functions with bounded sectional variation norm - also referred to as Hardy-Krausevariation. We refer to this class as the Highly Adaptive Lasso class (HAL class), as it isthe class in which the estimator, introduced in van der Laan (2017), takes values. TheHighly Adaptive Lasso class is particularly attractive in i.i.d. settings for various reasons,which we enumerate next. (1) Unlike Holder classes, it doesn’t make local smoothnessassumptions. Rather it only restricts a global measure of irregularity, the sectional variationnorm, thereby allowing for functions to be diﬀerentially smooth/variable depending onthe area of the domain. (2) Emprical risk minimzers over the HAL class were shown tobe competitive with the best supervised machine learning algorithms, including GradientBoosting Machines and Random Forests. (3) We know how to bound both the uniformmetric entropy and the bracketing entropy of these classes of functions. These boundsshow that the corresponding entropy integrals are bounded, which imply that the HALclass is Donsker. In particular, Bibaut & van der Laan (2019) provide a bound on thebracketing entropy w.r.t. L r ( P ), for r ∈ [1 , ∞ ), for probability distribution that havebounded Radon-Nikodym derivative w.r.t. the Lebesgue measure, that is dP/dµ ≤ C .Bibaut & van der Laan (2019) use this bracketing entropy bound to prove the rate ofconvergence O ( N − / (log N ) d − ).Unfortunately, to bound the sequential bracketing entropies of Ξ N and of L N we wouldneed a bracketing entropy bound w.r.t. L ( P Q ,h C,N ), which, owing to the fact that h C,N isa discrete measure, does not have bounded Radon-Nikodym derivative w.r.t. the Lebesguemeasure over

C × O . Under the assumption 7 on the convergence of the marginals of ( C o ( t ))to a limit law (we shall denote it h ∞ ), we have that h C,N d −→ h ∞ , which can reasonably bea continuous measure dominated by the Lebesgue measure. By convergence in distributionof h C,N t to h ∞ , we have at least that the size of brackets w.r.t. h C,N converges to the size51f brackets under h ∞ . If this convergence were uniform over bracketings of ¯ Q , and that dh ∞ /dµ ≤ C , then we would have that N [ ] ( (cid:15), ¯ Q , L ( P Q ,h C,N )) (cid:46) N [ ] ( (cid:15), ¯ Q , L ( µ ). Provingthe uniformity over bracket seems to be a relatively tough theoretical endeavor, and weleave it to future research. Given the diﬃculty in bounding N [ ] ( (cid:15), ¯ Q , L ( P Q ,g ∗ ,h N )) for the HAL, class, we consider amodiﬁed HAL class in the case where C is discrete, that is C = { c , . . . , c J } . We deﬁne themodiﬁed class as the set of functions f : C × O → R such that, for every c ∈ C , o (cid:55)→ f ( c, o )is cadlag with sectional variation norm smaller than M . It is straightforward to show thatthe bracketing entropy of such a class F is bounded as follows:log N [ ] ( (cid:15), F , L ( P Q ,g ∗ ,h N )) (cid:46) |C| (cid:15) − (log(1 /(cid:15) )) O ) − . The proof of lemma 2 relies on lemma 5, which we present and prove in the following.

Lemma 5.

Denote, for any ﬁxed g , f ( g )( c ) := Var Q ,g ( D ∗ ( Q , g )( C o ( t ) , O ( t )) | C o ( t ) = c ) . Suppose that assumption 3 holds, and let g be ﬁxed given C o ( t ) . Suppose that the strongpositivity assumption holds for g , that is g ≥ δ , for the same δ an in assumption 3. Then, (cid:12)(cid:12) E h ,t [ f ( g )( C o ( t ))] − E h ,t [ f ( g ,t )( C o ( t ))] (cid:12)(cid:12) ≤ δ − (cid:107) g − g ,t (cid:107) ,P g ∗ ,h ,t . Proof.

Observe that D ∗ ( Q , g )( c, o ) can be decomposed as D ∗ ( Q , g )( c, o ) = D ∗ ( ¯ Q , g ) + D ∗ ( Q , c ) , D ∗ ( ¯ Q , g )( c, o ) := g ∗ ( a | c ) g ( a | c ) (cid:0) y − ¯ Q ( a, c ) (cid:1) , and D ∗ ( Q )( c ) := (cid:88) a =1 g ∗ ( a (cid:48) | c ) ¯ Q ( a (cid:48) , c ) − Ψ( Q ) . As D ∗ ( Q )( C o ( t )) is constant given C o ( t ), we have that f ( g )( c ) = Var Q ,g (cid:0) D ∗ ( ¯ Q , g )( C o ( t ) , O ( t )) | C o ( t ) (cid:1) . In the following, the dependence of the canonical gradient on ( C o ( t ) , O ( t )) is implied, butsuppressed in the notation. For any g , (cid:12)(cid:12) E h ,t [ f ( g )( C o ( t ))] − E h ,t [ f ( g ,t )( C o ( t ))] (cid:12)(cid:12) = (cid:12)(cid:12) E h ,t (cid:2) E Q ,g (cid:2) ( D ∗ ( ¯ Q , g )) | C o ( t )) (cid:3)(cid:3) − (cid:2) E Q ,g (cid:2) ( D ∗ ( ¯ Q , g ,t )) | C o ( t )) (cid:3)(cid:3) − E h ,t (cid:104) E Q ,g (cid:2) D ∗ ( ¯ Q , g ) | C o ( t ) (cid:3) − E Q ,g (cid:2) D ∗ ( ¯ Q , g ,t ) | C o ( t ) (cid:3) (cid:105)(cid:12)(cid:12)(cid:12) ≤ E Q ,g (cid:2)(cid:12)(cid:12) ( D ∗ ( ¯ Q , g )) − ( D ∗ ( ¯ Q , g ,t )) (cid:12)(cid:12)(cid:3) + (cid:0) (cid:107) D ∗ ( ¯ Q , g ) (cid:107) ∞ + (cid:107) D ∗ ( ¯ Q , g ,t ) (cid:107) ∞ (cid:1) × E Q ,g (cid:2)(cid:12)(cid:12) D ∗ ( ¯ Q , g ) − D ∗ ( ¯ Q , g ,t ) (cid:12)(cid:12) ( C o ( t ) , O ( t ))) (cid:3) .

53e start with the analysis of the ﬁrst term. In particular, we have that E Q ,g (cid:2)(cid:12)(cid:12) ( D ∗ ( ¯ Q , g )) − ( D ∗ ( ¯ Q , g ,t )) ( C o ( t ) , O ( t )) (cid:12)(cid:12)(cid:3) = E Q ,g (cid:34)(cid:0) Y ( t ) − ¯ Q ( A ( t ) , C o ( t )) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) g ∗ g (cid:19) ( A ( t ) | C o ( t )) − (cid:18) g ∗ g ,t (cid:19) ( A ( t ) | C o ( t )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) = E Q ,g (cid:34)(cid:0) Y ( t ) − ¯ Q ( A ( t ) | C o ( t )) (cid:1) (cid:18) g ∗ g ,t (cid:19) ( A ( t ) | C o ( t )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:18) gg ,t (cid:19) ( A ( t ) | C o ( t )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ δ − E Q ,g ∗ (cid:20)(cid:0) Y ( t ) − ¯ Q ( A ( t ) , C o ( t )) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) g ,t − g g (cid:19) ( A ( t ) , C o ( t )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ δ − E Q ,g ∗ (cid:2)(cid:12)(cid:12) ( g ,t − g )( A ( t ) | C o ( t )) (cid:12)(cid:12)(cid:3) ≤ δ − E Q ,g ∗ [ | ( g ,t − g )( A ( t ) | C o ( t )) | ]=2 δ − (cid:107) g ,t − g (cid:107) ,P g ∗ ,h ,t . We now turn to the second term. It follows that E Q ,g (cid:2)(cid:12)(cid:12) D ∗ ( ¯ Q, g )( C o ( t ) , O ( t )) − D ∗ ( C o ( t ) , O ( t )) (cid:12)(cid:12)(cid:3) = E Q ,g (cid:20)(cid:18) ¯ Q (cid:12)(cid:12)(cid:12)(cid:12) g ∗ g − g ∗ g ,t (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) ( C o ( t ) , O ( t )) (cid:21) ≤ δ − (cid:107) g ,t − g (cid:107) ,P g ∗ ,h ,t ≤ δ − E Q ,g ∗ [ | ( g ,t − g )( A ( t ) | C o ( t )) | ]= δ − (cid:107) g ,t − g (cid:107) ,P g ∗ ,h ,t . As (cid:107) D ∗ ( ¯ Q , g ) (cid:107) ∞ ≤ δ − and (cid:107) D ∗ ( ¯ Q , g ,t (cid:107) ∞ δ − , we therefore have that E Q ,g [ | f ( g )( C o ( t )) − f ( g ,t )( C o ( t )) | ] ≤ δ − (cid:107) g − g ,t (cid:107) ,P g ∗ ,h ,t . Lemma 2.

Suppose that assumptions 3, 7 and 8 hold. Then assumption 4 holds. roof. We use the notation of lemma 5 in this proof. We have that1 N N (cid:88) t =1 f ( g ,t )( C o ( t )) − E [ f ( g ∞ )( C ∞ )]= 1 N N (cid:88) t =1 f ( g ,t )( C o ( t )) − E [ f ( g ,t ( C o ( t ))] (5)+ 1 N N (cid:88) t =1 E [ f ( g ,t )( C o ( t ))] − E [ f ( g ∞ )( C o ( t ))] (6)+ 1 N N (cid:88) t =1 E [ f ( g ∞ )( C o ( t ))] − E [ f ( g ∞ )( C ∞ )] . (7)Denote A N , B N and C N the quantities in lines (5), (6) and (7) above; we start withbounding A N . In particular, we have thatVar ( A N ) = 1 N N (cid:88) t =1 Var( f ( g ,t )( C o ( t )))+ 2 N N (cid:88) t =1 N − t (cid:88) s =1 Cov( f ( g ,t )( C o ( t )) , f ( g ,t + s )( C o ( t + s )))= 1 N N (cid:88) t =1 Var( f ( g ,t )( C o ( t )))+ 2 N N (cid:88) t =1 N − t (cid:88) s =1 (cid:113) Var( f ( g ,t )( C o ( t )))Var( f ( g ,t + s )( C o ( t + s ))) ρ ( C o ( t ) , C o ( t + s )) . From assumption 3, (cid:107) f ( g ,t ) (cid:107) ∞ ≤ δ − for every t . Therefore,Var (cid:32) N N (cid:88) t =1 f ( g ,t )( C o ( t )) (cid:33) ≤ δ − N (cid:32) N + 2 N sup t ≥ N (cid:88) s =1 ρ ( C o ( t ) , C o ( t + s )) (cid:33) ≤ δ − N (cid:0) N + o ( N )) (cid:1) = o (1) , N N (cid:88) t =1 f ( g ,t )( C o ( t )) − N N (cid:88) t =1 E [ f ( g ,t )( C o ( t ))] = o P (1) . (8)We now turn to B N . From lemma 5, we have that B N ≤ δ − N N (cid:88) t =1 (cid:107) g ,t − g ∞ (cid:107) ,P g ∗ ,h ,t . From assumption 9 and Cesaro’s lemma for deterministic sequences of real numbers, B N = o (1). Finally, from assumption 7 and Cesaro’s lemma, C N = o (1). Denoting σ := E [ f ( C ∞ )], the above inequality and (8) yield the wished claim. Theorem 2.

For any ¯ Q ∈ ¯ Q , the diﬀerence between the TMLE and its target decomposesas ¯Ψ( ¯ Q ∗ N ) − ¯Ψ( ¯ Q ) = M ,N ( ¯ Q ) + M ,N ( ¯ Q ∗ N , ¯ Q ) , with M ,N ( ¯ Q ) = 1 N N (cid:88) t =1 D ∗ ( ¯ Q )( C o ( t ) , O ( t )) − P ,C o ( t ) D ∗ ( ¯ Q ) ,M ,N ( ¯ Q ∗ N , ¯ Q ) = 1 N N (cid:88) t =1 ( δ C o ( t ) ,O ( t ) − P ,C o ( t ) )( D ∗ ( ¯ Q ∗ N ) − D ∗ ( ¯ Q )) . Proof.

We recall the ﬁrst order expansion of Ψ C o ( t ) given by Theorem 1, and deﬁned asΨ C o ( t ) ( ¯ Q ) − Ψ C o ( t ) ( ¯ Q ) = − P ,C o ( t ) D ∗ ( ¯ Q, g )( C o ( t ) , O ( t )) + R ( ¯ Q, ¯ Q , g, g ,t ) .

56e note that, since in an adaptive trial the treatment mechanism is controlled, we havethat g = g . Additionally, by deﬁnition, TMLE procedure yields ¯ Q ∗ N such that1 N N (cid:88) t =1 D ∗ ( ¯ Q ∗ N )( C o ( t ) , O ( t )) = 0 . Combined, it follows that¯Ψ( ¯ Q ∗ N ) − ¯Ψ( ¯ Q ) = 1 N N (cid:88) t =1 D ∗ ( ¯ Q ∗ N )( C o ( t ) , O ( t )) − P ,C o ( t ) D ∗ ( ¯ Q ∗ N )( C o ( t ) , O ( t ))which, by adding and subtracting N − (cid:80) Nt =1 ( δ C o ( t ) ,O ( t ) − P ,C o ( t ) ) D ∗ ( ¯ Q ), implies the wisheddecomposition. Theorem 4.

Consider the process Ξ N deﬁned in equation (3) . Suppose that assumptions3, 5 and 6 hold. Then M ,N ( ¯ Q ∗ N , ¯ Q ) = o P ( N − / ) .Proof. We want to show that for any (cid:15), δ >

0, there exist N such that for any N ≥ N P (cid:104) √ N M ,N ( ¯ Q ∗ N , ¯ Q ) ≥ (cid:15) (cid:105) ≤ δ. Let (cid:15) > δ >

0. Deﬁne, for any ¯ Q , σ N ( ¯ Q, ¯ Q ) := 1 N N (cid:88) t =1 E P (cid:2)(cid:0) D ∗ ( ¯ Q, g ,t ) − D ∗ ( ¯ Q , g ,t ) (cid:1) ( C o ( t ) , O ( t )) | C o ( t ) (cid:3) . Under assumption 3, sup t ≥ sup ¯ Q ∈ ¯ Q (cid:107) D ∗ ( ¯ Q, g ,t ) − D ∗ ( ¯ Q , g ,t ) (cid:107) ∞ = O (1). Theorem A.4 invan Handel (2010) yields that, with probability at least 1 − δ/ √ N sup (cid:8) M ,N ( ¯ Q, ¯ Q ) : ¯ Q ∈ ¯ Q , σ N ( ¯ Q, ¯ Q ) ≤ r (cid:9) (cid:46) J [ ] ( r, , Ξ N , ¯ O ( N )) + 1 √ N log(1 + N [ ] ( r, , Ξ N , ¯ O ( N ))) + r (cid:112) log(2 /δ ) + log(2 /δ ) √ N . (9)57s (cid:15) (cid:55)→ (cid:113) log(1 + N [ ] ( (cid:15), , Ξ N , ¯ O ( N ))) is non-increasing, we have thatlog(1 + N [ ] ( r, , Ξ N , ¯ O ( N ))) ≤ (cid:0) J [ ] ( r, , Ξ N , ¯ O ( N )) (cid:1) r , and therefore we can bound the right-hand side in (9) with J [ ] ( r, , Ξ N , ¯ O ( N )) (cid:18) J [ ] ( r, , Ξ N , ¯ O ( N )) √ N r (cid:19) + r (cid:112) log(2 /δ ) + log(2 /δ ) √ N .

From assumption 5, there exists r > r (cid:112) log(2 /δ ≤ (cid:15)/ J [ ] ( r , , Ξ N , ¯ O ( N )) ≤ (cid:15)/ . We choose N such that, for every N ≥ N , J [ ] ( r , , Ξ N , ¯ O ( N )) √ N r ≤ /δ ) √ N ≤ (cid:15)/ . We then have that, for any N ≥ N , with probability at least 1 − δ/ √ N sup (cid:8) M ,N ( ¯ Q, ¯ Q ) : ¯ Q ∈ ¯ Q , σ N ( ¯ Q, ¯ Q ) ≤ r (cid:9) ≤ (cid:15)/ . (10)Denote E ( N, r ) the event under which (10) holds. We further ntroduce the event E ( N, r ) := (cid:8) σ N ( ¯ Q ∗ N , ¯ Q ) ≤ r (cid:9) . Under assumption 3, σ N ( ¯ Q ∗ N , ¯ Q ) (cid:46) (cid:107) ¯ Q ∗ N − ¯ Q (cid:107) ,g ∗ ,h N , and from assumption 6, we havethat (cid:107) ¯ Q ∗ N − ¯ Q (cid:107) ,g ∗ ,h N = o P (1) . Therefore, there exists N such that for every N ≥ N , E ( N, r ) holds with probability atleast 1 − δ/

2. We further conclude that for any N ≥ N := max( N , N ), P [ E ( N, r ) ∩E ( N, r )] ≥ − δ , and under E ( N, r ) ∩ E ( N, r ), it holds that √ N M ,N ( ¯ Q ∗ N , ¯ Q ) ≤ (cid:15), which is the wished claim. 58 eferences Benkeser, D., Ju, C., Lendle, S. & van der Laan, M. (2018), ‘Online cross-validation-basedensemble learning’,

Statistics in Medicine (2), 249–260.Benkeser, D. & van der Laan, M. (2016), ‘The Highly Adaptive Lasso Estimator’, Proc IntConf Data Sci Adv Anal , 689–696.Bergmeir, C. & Ben´ıtez, J. M. (2012), ‘On the use of cross-validation for time series pre-dictor evaluation’,

Information Sciences , 192 – 213. Data Mining for SoftwareTrustworthiness.Bibaut, A. F. & van der Laan, M. J. (2019), ‘Fast rates for empirical risk minimizationover c`adl`ag functions with bounded sectional variation norm’.Bojinov, I. & Shephard, N. (2019), ‘Time series experiments and causal estimands: Exactrandomization tests and trading’,

Journal of the American Statistical Association (0), 1–36.Bolger, N. & Laurenceau, J.-P. (2013), Intensive Longitudinal Methods: An Introduction toDiary and Experience Sampling Research , Methodology in the social sciences, GuilfordPublications.Boruvka, A., Almirall, D., Witkiewitz, K. & Murphy, S. A. (2018), ‘Assessing time-varyingcausal eﬀect moderation in mobile health’,

Journal of the American Statistical Associa-tion (523), 1112–1121.Bradley, R. C. (2005), ‘Basic properties of strong mixing conditions. a survey and someopen questions’,

Probab. Surveys , 107–144.Brown, B. M. (1971), ‘Martingale central limit theorems’, Ann. Math. Statist. (1), 59–66.59hakraborty, B. & Moodie, E. (2013), Statistical Methods for Dynamic Treatment Regimes ,Springer Publishing Company, Incorporated.Chambaz, A., Zheng, W. & van der Laan, M. (2017), ‘Targeted sequential design fortargeted learning inference of the optimal treatment rule and its mean reward’,

Ann Stat (6), 2537–2564.Coyle, J., Hejazi, N., Malenica, I. & Sofrygin, O. (2018), ‘sl3: modern super learning withpipelines’. R package version 0.1.0. URL: https://github.com/tlverse/sl3

Dempsey, W., Liao, P., Klasnja, P., Nahum-Shani, I. & Murphy, S. (2015), ‘Randomisedtrials for the Fitbit generation’,

Signif (Oxf ) (6), 20–23.Dulin, P., Gonzalez, V. & Campbell, K. (2014), ‘Results of a pilot test of a self-administeredsmartphone-based treatment system for alcohol use disorders: usability and early out-comes’, Subst Abus (2), 168–175.Ertin, E., Stohs, N., Kumar, S., Raij, A., al’Absi, M. & Shah, S. (2011), Autosense:unobtrusively wearable sensor suite for inferring the onset, causality, and consequencesof stress in the ﬁeld, in ‘SenSys’.Free, C., Phillips, G., Galli, L., Watson, L., Felix, L., Edwards, P., Patel, V. & Haines, A.(2013), ‘The eﬀectiveness of mobile-health technology-based health behaviour change ordisease management interventions for health care consumers: a systematic review’, PLoSMed. (1), e1001362.Hamaker, E., Asparouhov, T., Brose, A., Schmiedek, F. & Muth´en, B. (2018), ‘At thefrontiers of modeling intensive longitudinal data: Dynamic structural equation models60or the aﬀective measurements from the cogito study’, Multivariate Behavioral Research (6), 820–841. PMID: 29624092.Heron, K. & Smyth, J. (2010), ‘Ecological momentary interventions: incorporating mobiletechnology into psychosocial and health behaviour treatments’, Br J Health Psychol (Pt 1), 1–39.Istepanian, R. & Al-Anzi, T. (2018), ‘m-Health 2.0: New perspectives on mobile health,machine learning and big data analytics’, Methods , 34–40.Istepanian, R. & Woodward, B. (2017),

M-Health: Fundamentals and Applications: Fun-damentals and Applications , John Wiley-IEEE.Kallus, N. & Uehara, M. (2019), ‘Eﬃciently breaking the curse of horizon in oﬀ-policyevaluation with double reinforcement learning’.Klasnja, P., Hekler, E., Shiﬀman, S., Boruvka, A., Almirall, D., Tewari, A. & Murphy,S. (2015), ‘Microrandomized trials: An experimental design for developing just-in-timeadaptive interventions’,

Health Psychol , 1220–1228.Klasnja, P., Smith, S., Seewald, N., Lee, A., Hall, K., Luers, B., Hekler, E. & Murphy, S.(2019), ‘Eﬃcacy of Contextually Tailored Suggestions for Physical Activity: A Micro-randomized Optimization Trial of HeartSteps’,

Ann Behav Med (6), 573–582.Kumar, S., Nilsen, W., Abernethy, A., Atienza, A., Patrick, K., Pavel, M., Riley, W.,Shar, A., Spring, B., Spruijt-Metz, D., Hedeker, D., Honavar, V., Kravitz, R., Lefeb-vre, R., Mohr, D., Murphy, S., Quinn, C., Shusterman, V. & Swendeman, D. (2013),‘Mobile health technology evaluation: the mHealth evidence workshop’, Am J Prev Med (2), 228–236. 61uckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E. & Kosorok,M. R. (2019), ‘Estimating dynamic treatment regimes in mobile health using v-learning’, Journal of the American Statistical Association (0), 1–34.Luedtke, A. & van der Laan, M. (2016 a ), ‘Optimal individualized treatments in resource-limited settings’, The International Journal of Biostatistics (1), 283–303.Luedtke, A. & van der Laan, M. (2016 b ), ‘Statistical inference for the mean outcome undera possibly non-unique optimal treatment strategy’, Ann Stat (2), 713–742.Luedtke, A. & van der Laan, M. (2016 c ), ‘Super-learning of an optimal dynamic treatmentrule’, The International Journal of Biostatistics (1), 305–332.Malvey, D. & Slovensky, D. J. (2014), mHealth: Transforming Healthcare , Springer Pub-lishing Company, Incorporated.Muessig, K., Pike, E., Legrand, S. & Hightow-Weidman, L. (2013), ‘Mobile phone appli-cations for the care and prevention of HIV and other sexually transmitted diseases: areview’, J. Med. Internet Res. (1), e1.Muhammad, G., Alsulaiman, M., Amin, S., Ghoneim, A. & Alhamid, M. (2017), ‘A facial-expression monitoring system for improved healthcare in smart cities’, IEEE Access , 10871–10881.Murphy, S. A. (2003), ‘Optimal dynamic treatment regimes’, Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) (2), 331–355.Pearl, J. (2009), Causality: Models, Reasoning and Inference , 2nd edn, Cambridge Univer-sity Press, New York, NY, USA. 62abbi, M., Philyaw Kotov, M., Cunningham, R., Bonar, E., Nahum-Shani, I., Klasnja, P.,Walton, M. & Murphy, S. (2018), ‘Toward Increasing Engagement in Substance Use DataCollection: Development of the Substance Abuse Research Assistant App and Protocolfor a Microrandomized Trial Using Adolescents and Emerging Adults’,

JMIR Res Protoc (7), e166.Rakhlin, A., Sridharan, K. & Tewari, A. (2014), ‘Sequential complexities and uniformmartingale laws of large numbers’, Probability Theory and Related Fields , 111–153.Robins, J. M. (2004),

Optimal Structural Nested Models for Optimal Sequential Decisions ,Springer New York, New York, NY, pp. 189–326.Robins, J. M., Greenland, S. & Hu, F.-C. (1999), ‘Estimation of the causal eﬀect of atime-varying exposure on the marginal mean of a repeated binary outcome’,

Journal ofthe American Statistical Association (447), 687–700.Steinhubl, S. R., Muse, E. D. & Topol, E. J. (2013), ‘Can Mobile Health TechnologiesTransform Health Care?’, JAMA (22), 2395–2396.Stone, A., Shiﬀman, S., Atienza, A. & Nebeling, L. (2007),

The Science of Real-Time DataCapture: Self-Reports in Health Research , Oxford University Press.Sutton, R. S. & Barto, A. G. (1998),

Introduction to Reinforcement Learning , 1st edn, MITPress, Cambridge, MA, USA.van de Geer, S. (2000),

Empirical Processes in M-Estimation , Cambridge Series in Statis-tical and Probabilistic Mathematics, Cambridge University Press.van der Laan, M. (2017), ‘A Generally Eﬃcient Targeted Minimum Loss Based Estimatorbased on the Highly Adaptive Lasso’,

Int J Biostat (2).63an der Laan, M., A., C. & S., L. (2018), Online Targeted Learning for Time Series ,Springer International Publishing, Cham, pp. 317–346.van der Laan, M. & Gruber, S. (2016), One-step targeted minimum loss-based estimationbased on universal least favorable one-dimensional submodels., Technical Report WorkingPaper 347., U.C. Berkeley Division of Biostatistics Working Paper Series.van der Laan, M. & Lendle, S. (2014), Online Targeted Learning, Technical Report WorkingPaper 330, U.C. Berkeley Division of Biostatistics Working Paper Series.van der Laan, M., Polley, E. & Hubbard, A. (2007), Super learner, Technical ReportWorking Paper 222., U.C. Berkeley Division of Biostatistics Working Paper Series.van der Laan, M. & Rose, S. (2011),

Targeted Learning: Causal Inference for Observationaland Experimental Data (Springer Series in Statistics) , Springer.van der Laan, M. & Rose, S. (2018),

Targeted Learning in Data Science: Causal Inferencefor Complex Longitudinal Studies , Springer Science and Business Media.van der Laan, M. & Rubin, D. (2006), Targeted maximum likelihood learning, TechnicalReport Working Paper 213, U.C. Berkeley Division of Biostatistics Working Paper Series.van der Vaart, A. & Wellner, J. (2013),

Weak Convergence and Empirical Processes ,Springer-Verlag New York.van Handel, R. (2010), ‘On the minimal penalty for markov order estimation’,

ProbabilityTheory and Related Fields (3-4), 709–738.Walls, T. & Schafer, J. (2006),

Models for intensive longitudinal data , Methodology in thesocial sciences, Oxford University Press. 64hang, M., Ward, J., Ying, J., Pan, F. & Ho, R. (2016), ‘The alcohol tracker application:an initial evaluation of user preferences’,

BMJ Innov2