Double/Debiased Machine Learning for Dynamic Treatment Effects via g-Estimation
DD OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS
Double/Debiased Machine Learning for Dynamic Treatment Effects
Greg Lewis
GLEWIS @ MICROSOFT . COM
Microsoft Research
Vasilis Syrgkanis
VASY @ MICROSOFT . COM
Microsoft Research
Abstract
We consider the estimation of treatment effects in settings when multiple treatments are assignedover time and treatments can have a causal effect on future outcomes. We formulate the problemas a linear state space Markov process with a high dimensional state and propose an extension ofthe double/debiased machine learning framework to estimate the dynamic effects of treatments.Our method allows the use of arbitrary machine learning methods to control for the high dimen-sional state, subject to a mean square error guarantee, while still allowing parametric estimation andconstruction of confidence intervals for the dynamic treatment effect parameters of interest. Ourmethod is based on a sequential regression peeling process, which we show can be equivalentlyinterpreted as a Neyman orthogonal moment estimator. This allows us to show root-n asymptoticnormality of the estimated causal effects.
Keywords: dynamic treatment regime, high-dimensional, treatment effects, double machine learn-ing
1. Introduction
Improving outcomes often requires multiple treatments: patients may need a course of drugs tomanage or cure a disease, soil may need multiple additives to improve fertility, companies may needmultiple marketing efforts to close the sale. To make data-driven decisions, policy-makers needprecise estimates of what will happen when a new policy is pursued. Because of its importance,this topic has been studied by many communities and under multiple regimes and formulations;examples include the field of reinforcement learning Sutton and Barto (1998), longitudinal analysisin biostatistics Bang and Robins (2005), the dynamic treatment regime in causal inference andadaptive clinical trials Lei et al. (2012).This paper offers a new method for estimating and making inferences about the counterfactual effectof a new treatment policy. Our method is designed to work with observational data, in an environ-ment with multiple treatments (either discrete or continuous), and a high-dimensional state. Validcausal inference is necessary to correctly attribute changes in outcomes to the different treatmentsapplied. But it is more challenging than in a static context, since there are multiple causal pathwaysfrom treatments to subsequent outcomes (e.g. directly, or by changing future states, or by affectingintermediate outcomes, or by influencing future treatments).Our contribution is to extend existing results in the field of semi-parametric inference Neyman(1979); Robinson (1988); Ai and Chen (2003); Chernozhukov et al. (2016); Chernozhukov et al. a r X i v : . [ ec on . E M ] F e b EWIS AND S YRGKANIS (2018) to a semi-parametric Markovian model with flexible high-dimensional state. We proposean estimation algorithm that estimates the dynamic effects of interest, from observational data, atparametric root- n rates. We prove asymptotic normality of the estimates, even when the state ishigh-dimensional and machine learning algorithms are used to control for indirect effects of thestate, as opposed to effects stemming from the treatments. Our formulation can be viewed as adynamic version of Robinson’s classic partial linear model in semi-parametric inference Robinson(1988), where the controls evolve over time and are affected by prior treatments. Our estimationalgorithm builds upon and generalizes the recent work of Chernozhukov et al. (2018, 2017) toestimate not only contemporaneous effects, but also dynamic effects over time. In particular, wepropose a sequential residualization approach, where the effects at every period are estimated in aNeyman orthogonal manner and then peeled-off from the outcome, so as to define a new “calibrated”outcome, which will be used to estimate the effect of the treatment in the previous period.Our work is also closely related to the work on doubly robust estimation in longitudinal data analysisand in offline policy evaluation in reinforcement learning Nie et al. (2019); Thomas and Brunskill(2016); Petersen et al. (2014); Kallus and Uehara (2019b,a). However, one crucial point of departurefrom all of these works is that we formulate minimal parametric assumptions that allow us to avoidnon-parametric rates or high-variance estimation. Typical fully non-parametric approaches in off-policy evaluation in dynamic settings, requires the repeated estimation of inverse propensity weightsat each step, which leads to a dependence on quantities that can be very ill-posed in practice, suchas the product of the ratios of states under the observational and the target policy. Moreover, theseapproaches are typically restricted to discrete states and actions. Our goal is to capture settingswhere the treatment can also be continuous (investment level, price discount level, drug dosage), andthe state contains many continuous variables and is potentially high dimensional. Inverse propensityweighting approaches, though more robust in terms of the assumptions that they make on the model,can be quite prohibitive in these settings even in moderately large data sets.Our work is very closely related to the bio-statistics literature on estimating causal effects in the dy-namic treatment regime or in structural nested models (see Vansteelandt and Sjolander (2016) for arecent overview). In particular, our identification strategy for dynamic treatment effects is a variantof the well-studied g -formula and g -estimation for structural nested models (SNMs) Robins (1986);Robins et al. (1992); Robins (1994); Robins and Ritov (1997); Robins et al. (2000); Lok and DeGrut-tola (2012); Vansteelandt et al. (2014). Our model can be best viewed as a variant of linear structuralnested mean models Robins (1994), that allows for both continuous and discrete treatments appliedat each time period. Our identification strategy is analogous to the identification strategy proposedto such models (see e.g. Section 3.2 of Vansteelandt and Sjolander (2016)). The aforementionedworks have provided doubly robust estimators for this setting. Our work proposes an orthogonalmoment estimator for this setup that reduces to simple regression and classification oracles that canbe solved with arbitrary machine learning methods. Thus enabling the use of machine learning forestimation of these models in high-dimensional state space/control variable setting. Moreover, un-der an appropriate sample-splitting scheme, we provide finite sample mean-squared error rates andprove root- n -asymptotic normality under assumptions only on the mean-squared-error achieved bythese machine learning oracles. Further we provide extensions to the case of a single time-seriesdataset, as opposed to cross-sectional time-series, as well as extensions to linear structural nestedmodels with heterogeneous coefficients. OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS
2. Model and Preliminaries
We consider a partially linear state space Markov decision process { X t , T t , Y t } mt =1 , where X t ∈ R p is the state at time t , T t ∈ R d is the action or treatment at time t and Y t ∈ R is an observed outcomeof interest at time t . We assume that these variables are related via a linear Markovian process: ∀ t ∈ { , . . . , m } : X t = A · T t − + B · X t − + η t T t = p ( T t − , X t ) + ζ t (1) Y t = θ (cid:48) T t + µ (cid:48) X t + (cid:15) t where η t , ζ t and (cid:15) t are exogenous mean-zero random shocks, independent of all contemporaneousand lagged treatments and states, that for simplicity we assume are each drawn i.i.d. across timewith T = X = 0 . In Section 7 we show that our results generalize to more complex Markovianmodels, but for simplicity of exposition we focus on this simplified version, as it captures all thecomplexities needed to highlight our main contributions.Our goal is to estimate the effect of a change in the treatment policy on the final outcome Y m . Moreconcretely, suppose that were to make an intervention and set each of the treatments to some se-quence of values: { τ , . . . , τ m } , then what would be the expected difference in the final outcome Y m as compared to some baseline policy? For simplicity and without loss of generality, we willconsider the baseline policy to be setting all treatments to zero. We will denote this expected differ-ence as: V ( τ , . . . , τ m ) . Equivalently, we can express the quantity we are interested in do-calculus:if we denote with R ( τ ) = E [ Y m | do ( T = τ , . . . , T m = τ m )] , then V ( τ ) = R ( τ ) − R (0) . InAppendix A, we will also analyze the estimation of the effect of adaptive counterfactual treatmentpolicies, where the treatment at each step can be a function of the state. Our first observation is that we can decompose the quantity V ( τ ) into the estimation of the dynamictreatment effects : if we were to make an intervention and increase the treatment at period m − κ by unit, then what is the change θ κ in the outcome Y m , for κ ∈ { , . . . , m } ; assuming that we set allsubsequent treatments to zero (or equivalently to some constant value). This quantity is the effectin the final outcome, that does not go through the changes in the subsequent treatments, due to ourobservational Markovian treatment policy, but only the part of the effect that goes through changesin the state space X t , that is not part of our decision process. This effect can also be expressed interms of the constants in our Markov process as: ∀ κ ∈ { , . . . , m − } : θ κ = µ (cid:48) B κ − A (2)Then we have the following lemma: Lemma 1
The counterfactual value function V : R d · m → R , can be expressed in terms of thedynamic treatment effects as: V ( τ , . . . , τ m ) = m − (cid:88) κ =0 θ (cid:48) κ τ m − κ (3) EWIS AND S YRGKANIS
Proof
Observe that by repeatedly expanding the state space in terms of the previous state andaction pairs we can write that under any non-state dependent sequence of interventions τ , . . . , τ m ,the counterfactual value of the state at time m , X ( τ ) m can be written as: X ( τ ) m = m − (cid:88) κ =1 B κ − A · τ m − κ + δ m where δ t is a linear function of the mean-zero exogenous random shocks at all periods. Thus underintervention τ , the final outcome is distributed according to the random variable: Y ( τ ) m = θ (cid:48) τ m + m − (cid:88) κ =1 θ (cid:48) κ τ m − κ + µ (cid:48) δ m + (cid:15) t Since V ( τ ) is the expected difference of this counterfactual final outcome and the counterfactualfinal outcome when τ = 0 , and since δ t is mean-zero conditional on any exogenous sequence ofinterventions that do not depend on the state or prior actions, we get the desired result.Thus to estimate the function V , it suffices to estimate the dynamic treatment effects: θ , . . . , θ m .
3. Identification of Dynamic Effects
We first start by showing that the parameters θ , . . . , θ m are identifiable from the observationaldata. Identification is not immediately obvious. Write the final outcome as a linear function of allthe treatments and the initial state: Y m = m − (cid:88) κ =0 θ (cid:48) κ T m − κ + µ (cid:48) δ m + (cid:15) m (4)The shock δ m contains components that are heavily correlated with the treatments, since the shocksat period t affect the state at period t + 1 , which in turn affect the observed treatment at period t + 1 . In other words, if we view the problem as a simultaneous treatment problem, where the finaloutcome is the outcome, then we essentially have a problem of unmeasured confounding (implicitlybecause we ignored the intermediate states).However, we show that these dynamic effects can be identified from the observational data via asequential peeling process, which as we show in the next section, leads to an estimation strategythat achieves parametric rates. This is one of the main contributions of the paper. Theorem 2
The dynamic treatment effects are identified from the observational data via the follow-ing set of conditional moment restrictions: ∀ q ∈ { , . . . , m } E [ ¯ Y m,m − q − θ q T m − q − µ (cid:48) B q X m − q | T m − q , X m − q ] = 0 where: ¯ Y m,m − q = Y m − (cid:80) q − κ =0 θ κ T m − κ . OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS
Proof
By repeatedly unrolling the state X m , q − times we have that: Y m = θ T m + q (cid:88) κ =0 µ (cid:48) B κ − AT m − κ + µ (cid:48) B q X m − q + q − (cid:88) κ =0 µ (cid:48) B κ − η m − κ + (cid:15) m = θ T m + q (cid:88) κ =0 θ κ T m − κ + µ (cid:48) B q X m − q + q (cid:88) κ =0 µ (cid:48) B κ − η m − κ + (cid:15) m Thus by re-arranging the equation, we have: ¯ Y m,m − q = θ q T m − q + µ (cid:48) B q X m − q + q (cid:88) κ =0 µ (cid:48) B κ − η m − κ + (cid:15) m Since for all t ≥ m − q , η t , (cid:15) t are subsequent random shocks to the variables T m − q , X m − q , we havethat they are mean-zero conditional on T m − q , X m − q . Thus: E [ ¯ Y m,m − q − θ q T m − q − µ (cid:48) B q X m − q | T m − q , X m − q ] = 0 which concludes the proof of the theorem.
4. Sequential DML Estimation
We now address the estimation problem. We assume that we are given access to n i.i.d. samplesfrom the Markovian process, i.e. we are given n independent time-series, and we denote sample i ,with { X it , T it , Y it } .Our goal is to develop an estimator of the function V or equivalently of the parameter vector θ =( θ , . . . , θ m ) . We will consider the case of a high-dimensional state space, i.e. p (cid:29) n , but lowdimensional treatment space and a low dimensional number of periods m , i.e. d, m (cid:28) n is aconstant independent of n .Our goal is to estimate the parameters θ at √ n -rates and in a way that our estimator is asymptoti-cally normal, so that we can construct asymptotically valid confidence intervals around our dynamictreatment effects and our estimate of the function V . The latter is a non-trivial task due to the high-dimensionality of the state space. For instance, the latter would be statistically impossible if we wereto take the direct route of estimating the whole Markov process (i.e. the high-dimensional quantities A, B, µ ): if these quantities have a number of non-zero coefficients that grows with n at any poly-nomial rate, then known results on sparse linear regression, preclude their estimation at root-n rates(see e.g. Wainwright (2015)). However, we are not really interested in these low-level parametersof the dynamic process, but solely on the low dimensional parameter vector θ . We will treat thisproblem as a semi-parametric inference problem and develop a Neyman orthogonal estimator forthe parameter vector Neyman (1979); Robinson (1988); Ai and Chen (2003); Chernozhukov et al.(2016); Chernozhukov et al. (2018).In particular, we consider a sequential version of the double machine learning algorithm proposedin Chernozhukov et al. (2018). In the case of a single time-period, i.e. m = 0 , then Chernozhukov EWIS AND S YRGKANIS et al. (2018), recommends the following estimator for θ : using half of your data, fit a model ˆ q ( X ) of E [ Y | X ] , i.e. that predicts the outcome Y from the controls X and a model ˆ p ( X ) for E [ T | X ] . Then estimate θ on the other half of the data, based on the estimating equation: m ( θ ; ˆ p , ˆ q ) = E (cid:104) ( ˜ Y − θ ˜ T ) ˜ T (cid:105) = 0 (5)where ˜ Y = Y − ˆ q ( X ) and ˜ T = T − ˆ p ( X ) are the residual outcome and treatment.We propose a sequential version of this process that we call sequential DML for sequential dou-ble/debiased machine learning . Intuitively our algorithm proceeds as follows:1. We can construct an estimate ˆ θ of θ in a robust (Neyman orthogonal) manner, by applyingthe approach of Chernozhukov et al. (2018) on the final step of the process, i.e. on time step T m ; this will estimate all the contemporaneous effects of the treatments,2. Subsequently we can remove the effect of the observed final step treatment from the observedfinal step outcome, i.e. by re-defining the random variable Y im,m − = Y im − ˆ θ T im ; doing thiswe have removed any effects on Y im , caused by the final treatment T im .3. We can then estimate the one-step dynamic effect θ , by performing the residual-on-residualestimation approach with target outcome the “calibrated” outcome Y im,m − , treatment T im − and controls X im − . Theorem 2 tells us that the required conditional exogeneity momentrequired to apply the residualization process is valid for these random variables. We cancontinue in a similar manner, by removing the estimated effect of T m − from Y im,m − andrepeating the above process.We provide a formal statement of the sequential DML process in Algorithm 1, which also describesmore formally the sample splitting and cross-fitting approach that we follow in order to estimate thenuisance models p and q required for calculating the estimated residuals.
5. Estimation Rates and Normality
Our main theorems are to show that subject to the first stage models of the conditional expectationsachieving a small (but relatively slow) estimation error, then the recovered parameters are root-n-consistent and asymptotically normal. Our asymptotic normality proof relies on showing that onecan re-interpret our
Sequential DML estimator as a Z -estimator based on a set of moments thatsatisfy the property of Neyman orthogonality.Let p, q denote the vector of all nuisance functions and p ∗ , q ∗ their true values. Moreover, let θ denote the vector of dynamic effect parameters, with θ its true value. We provide both finite samplerates mean squared error rates and asymptotic normality of our estimates (proofs in Appendix C andD). Theorem 3 (Finite Sample)
Suppose that E [ ζ t ζ (cid:48) t ] (cid:23) λI and that each coordinate ˆ h of each nui-sance model { q κ , p τ,κ } τ,κ satisfy that w.p. − δ : (cid:107) ˆ h − h ∗ (cid:107) = (cid:113) E X [(ˆ h ( X ) − h ( X )) ] ≤ (cid:15) n,δ OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS
Algorithm 1
Sequential DMLRandomly split the n samples in S, S (cid:48) for each κ ∈ { , . . . , m } do Regress Y m on X m − κ using S to learn estimate ˆ q κ of model q κ ( x ) = E [ Y m | X m − κ = x ] andcalculate residuals ˜ Y im,m − κ = Y im − ˆ q κ ( X im − κ ) on other half; vice versa use S (cid:48) to learn modeland evaluate on S . for each τ ∈ { , . . . , κ } do Regress T m − τ on X m − κ using S to learn estimate ˆ p τ,κ of model p τ,κ ( x ) = E [ T m − τ | X m − κ = x ] on the first half and calculate residuals ˜ T im − τ,m − κ = T im − τ − p τ,κ ( X im − κ ) onother half, and vice versa. end forend for Using all the data S ∪ S (cid:48) for κ = 0 to m do Regress ¯ Y m,κ = ˜ Y m,m − κ − (cid:80) τ<κ ˆ θ (cid:48) τ ˜ T m − τ,m − κ on ˜ T m − κ,m − κ , i.e. find a solution ˆ θ κ to: n n (cid:88) i =1 (cid:16) ¯ Y im,κ − θ κ ˜ T im − κ,m − κ (cid:17) ˜ T im − κ,m − κ = 0 (6) end for where expectation is with respect to the corresponding input of each model. Moreover, supposethat: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) τ =0 | E (cid:104) ˜ T ∗ m − τ,m − κ ζ (cid:48) t (cid:105) | (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ c m Then w.p. − δ : (cid:107) ˆ θ − θ (cid:107) ∞ ≤ O (cid:32) d C m (cid:32)(cid:114) log( d m/δ ) n + (cid:15) n,δ (cid:33)(cid:33) where C m := max (cid:26) , (cid:16) c m λ − O ( m d (cid:15) n,δ ) (cid:17) m +1 (cid:27) . If each coordinate ˆ h of each nuisance model satis-fies: E ˆ h [ (cid:107) ˆ h − h ∗ (cid:107) ] (1 / ≤ (cid:15) n , then: (cid:113) E [ (cid:107) ˆ θ − θ (cid:107) ∞ ] ≤ O (cid:32) d C m (cid:32)(cid:114) log( d m ) n + (cid:15) n (cid:33)(cid:33) . Theorem 4 (Asymptotic Normality)
Suppose that the nuisance models { q κ , p τ,κ } τ,κ satisfy that: ∀ κ, τ ≤ κ : (cid:107) q κ − q ∗ κ (cid:107) , (cid:107) p τ,κ − p τ,κ (cid:107) ≤ o ( n − / ) (7) Moreover, suppose that all random variables are bounded and that E [ ζ t ζ (cid:48) t ] (cid:23) λI . Then: √ n (ˆ θ − θ ) → N (0 , V ) (8)
1. Where by absolute value we denote coordinate-wise EWIS AND S YRGKANIS where V = M − Σ( M − ) (cid:48) and M is a m × m block lower triangular matrix consisting of blocksof size d × d , whose block entries are of the form: ∀ τ ≤ κ : M κ,τ = E [( T m − τ − E [ T m − τ | X m − κ ]) ζ (cid:48) m − κ ] and Σ is a m × m block diagonal matrix whose diagonal block entries take the form: Σ tt = E [ (cid:15) m − t ζ m − t ζ (cid:48) m − t ] . Concrete Rates for Lasso Nuisance Estimates.
Suppose that the observational policy p is alsolinear, i.e. p ( X ) = AX . Then all the models q κ and p τ,κ are high-dimensional linear functions oftheir input arguments, i.e. q κ ( x ) = φ (cid:48) x and p τ,κ,j ( x ) = π (cid:48) τ,κ,j x . If these linear functions satisfy asparsity constraint then under standard regularity assumptions we can guarantee if we use the Lassoregression to estimate each of these functions that w.p. − δ , the estimation error of all nuisancemodels is O (cid:18) s (cid:113) log( p/δ ) n (cid:19) , where s is an upper bound on the number of non-zero coefficients. Onesufficient regularity condition is that the expected co-variance matrix of every period’s state has fullrank, i.e. E [ X t X (cid:48) t ] (cid:23) λI . Thus the requirements of the our main theorems of this section would besatisfied as long as the sparsity grows as s = o ( n / ) . These sparsity conditions are for instancesatisfied if for instance, only s coordinates of the high-dimensional state have any effect on the finaloutcome (i.e. are outcome-relevant) and if only s coordinates of the high-dimensional state enterthe observational policy.
6. Dependent Single Time-Series Samples
Thus far we have assumed that we are working with n independent time series, each of duration m .Though this is applicable to many settings where we have panel data with many units over time,in some other settings it is unreasonable to assume that we have many units over time, but ratherthat we have the same unit over a long period. In this case, we would want to do asymptotics asthe number of periods grows. Our goal is still to estimate the dynamic treatment effects (i.e. theeffect of a treatment at period t on an outcome in period t + κ , for κ ∈ { , . . . , m } ) for some fixedlook-ahead horizon m .These quantities can allow us to evaluate the effect of counterfactual treatment policies on the dis-counted sum of the outcomes, i.e. (cid:80) ∞ t =0 γ t Y t for γ < . We can write the counterfactual valuefunction for any non-adaptive policy as: V ( τ ) = ∞ (cid:88) t =0 γ t (cid:88) q ≤ t θ t − q τ q (9)Assuming outcomes are bounded, the effect (cid:80) q ≤ t θ t − q τ q on any period t can be at most someconstant. Thus taking m to be roughly log γ ( n ) , suffices to achieve a good approximation of theeffect function V ( π ) , since the rewards vanish after that many periods, i.e. if we let: V m ( τ ) = m (cid:88) t =0 γ t (cid:88) q ≤ t θ t − q τ q OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS then observe that: (cid:107) V m ( τ ) − V ( τ ) (cid:107) ≤ O ( γ m ) . Thus after m = log /γ ( n ) , we have that theapproximation error is smaller than / √ n . Thus it suffices to learn the dynamic treatment effectparameters for a small number of steps. To account for this logarithmic growth, we will make thedependence on m explicit in our theorems below.For any m , we will estimate these parameters by splitting the time-series into sequential B = n/m blocks of size m . Then we will treat each of these blocks roughly as independent observations. Themain challenge in our proofs is dealing with the fact that these blocks are not independent but seriallycorrelated. However, we can still apply techniques, such as martingale Bernstein concentrationinequalities and martingale Central Limit Theorems to achieve the desired estimation rates.The other important change that we need to make is in the way that we fit our nuisance estimates. Toavoid using future samples to train models that will be used in prior samples (which would ruin themartingale structure), we instead propose a progressive nuisance estimation fitting approach, whereat every period, all prior blocks are used to train the nuisance models and then they are evaluated onthe next block. Algorithm 2
Block Sequential DML with Progressive Nuisance EstimationPartition the data into B = n/m blocks of m periodsDenote with X bt , T bt , Y bt the state, action, outcome pairs in the t -th period of block b . for each block b ∈ { B/ , . . . , B } dofor each κ ∈ { , . . . , m } do Regress Y b (cid:48) m on X b (cid:48) m − κ using all blocks b (cid:48) < b to learn model q κ ( x ) = E [ Y m | X m − κ = x ] and and calculate the residual ˜ Y bm,m − κ = Y bm − q κ ( X bm − κ ) . for each τ ∈ { , . . . , κ } do Regress each T b (cid:48) m − τ on X b (cid:48) m − κ using blocks b (cid:48) < b to learn model p τ,κ and calculateresidual ˜ T bm − τ,m − κ = T bm − τ − p τ,κ ( X bm − κ ) on the other half, and vice versa. end forend forend for Using all the blocks b ∈ { B/ , . . . , B } for κ = 0 to m do Regress ¯ Y m,κ = ˜ Y m,m − κ − (cid:80) τ<κ ˆ θ (cid:48) τ ˜ T m − τ,m − κ on ˜ T m − κ,m − κ , i.e. find a solution ˆ θ κ to theestimating equation: B B/ (cid:88) b =1 (cid:16) ¯ Y bm,κ − θ κ ˜ T bm − κ,m − κ (cid:17) ˜ T bm − κ,m − κ = 0 (10) end forTheorem 5 Let F b denote the filtration up until block b . Suppose that E [ ζ t ζ (cid:48) t | F b ] (cid:23) λI and thateach coordinate ˆ h of each nuisance model { q κ , p τ,κ } τ,κ satisfy that w.p. − δ, ∀ b ≥ B/ : (cid:107) ˆ h − h ∗ (cid:107) b, = (cid:113) E X [ˆ h ( X b ) − h ( X b ) | F b ] ≤ (cid:15) B,δ EWIS AND S YRGKANIS where expectation is with respect to the corresponding input of each model from block b . Moreover,suppose that: (cid:13)(cid:13)(cid:13)(cid:80) mτ =0 | E (cid:104) ˜ T ∗ m − τ,m − κ ζ (cid:48) t | F b (cid:105) | (cid:13)(cid:13)(cid:13) ∞ ≤ c m . Then w.p. − δ : (cid:107) ˆ θ − θ (cid:107) ∞ ≤ O (cid:32) d C m (cid:32)(cid:114) log( d m B/δ ) B + (cid:15) B,δ (cid:33)(cid:33) where C m := max (cid:26) , (cid:16) c m λ − O ( m d (cid:15) n,δ ) (cid:17) m +1 (cid:27) m . If each coordinate ˆ h of each nuisance modelsatisfies: ∀ b ≥ B/ E ˆ h [ (cid:107) ˆ h − h ∗ (cid:107) b, ] (1 / ≤ (cid:15) B (11) Then: (cid:113) E [ (cid:107) ˆ θ − θ (cid:107) ∞ ] ≤ O (cid:32) d C m (cid:32)(cid:114) log( d m B ) B + (cid:15) B (cid:33)(cid:33) . Theorem 6
Suppose that all random variables are bounded and that E [ ζ t ζ (cid:48) t | F b ] (cid:23) λI . More-over, suppose that each coordinate ˆ h of each nuisance model satisfies: ∀ b ≥ B/ E ˆ h [ (cid:107) ˆ h − h ∗ (cid:107) b, ] (1 / ≤ o ( B − / ) (12) and suppose that m satisfies that mC m = o ( √ B ) , where C m as defined in Theorem 5. Then: √ B Σ − / M (ˆ θ − θ ) → N (0 , I ) (13) where M, Σ are the same as in Theorem 4. Concrete rates under sparsity.
Again we note that the requirements on the nuisance estimationmodels are satisfied under sparsity constraints and if the Lasso is used for their estimation. Onecomplication is that in this setup the data used to fit these models are auto-regressive, since thetreatment has an effect on subsequent states. However, recent work shows that the same type oferror rates as in the independent sample case can be achieved by the lasso for such auto-regressiveproblems (see Melnyk and Banerjee (2016)). Moreover, for more general statistical learning al-gorithms, one can use the results in Agarwal and Duchi (2013), to get generalization bounds andhence also MSE bounds for the types of martingale nuisance mean squared error quantities that weare interested in.
7. Generalizations and Extensions
Due to space constraints we defer several non-trivial extensions of our main theory to the appendix.In Appendix A we offer a generalization where the coefficients θ can be heterogeneous and this het-erogeneity can depend on both endogenous and exogenous parts of the state and even past actions.Moreover, we characterize the counterfactual value of adaptive treatment policies. Combining thesetwo developments we offer an estimation method for evaluating adaptive counterfactual policies. OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS
8. Experimental Results
We consider data drawn from the DGP presented in Equation (1), with a linear observational policy: p ( T t − , X t ) = C T t − + D · X t with X , T = 0 and (cid:15) t , ζ t , η t standard normal r.v.’s. (recall that d is the number of treatments and p the number of state variables). We consider the instance where: A ij = . , for all i ∈ [ p ] , j ∈ [ d ] , B = . I p , C = . I d , D [: , . , D [: , p ] = 0 , µ [1 : 2] = . . We consider settings wherethe effect is constant, i.e. θ ∈ R d or heterogeneous, where θ ( X ) = θ + (cid:104) β , X [ S ] (cid:105) for someknown low dimensional subset S of the states.We compare the dynamic DML to several benchmarks. The results are presented in Figures 1, com-paring the estimates of the dynamic DML algorithm to a number of other alternatives on a singleinstance. They fall into two categories. In the “static” set of approaches, each of the contempora-neous and lag effects is estimated one at a time, either by direct regression or (static) DML. So forexample, to estimate the one period lag effect θ , we would regress Y t on T t − , with controls X .We consider direct regression with no controls (“no-ctrls”), direct and DML with controls from theinital period (i.e X t − , “init-ctrls” and “init-ctrls-dml”) and direct and DML with controls from thesame period as the outcome (i.e X t , “fin-ctrls” and “fin-ctrls-dml”). As an alternative to all of these,we try a “direct” dynamic approach, where initially we estimate θ using a lasso regression of Y t with all the controls and past treatments, and return the coefficient on T t , and then “peel” off theestimated effect as in the main text before running another Lasso regression of Y t − θ T t on T t − to get the first lag effect etc. So this approach incorporates the peeling effect but doesn’t do anyorthogonalization.The point estimate for θ , θ and θ are depicted in the three panels of Figure 1 and the error barscorrespond to the constructed confidence interval. For all three, the dynamic DML is relativelyclose to the truth and the confidence interval contains the truth. The remaining approaches arenot, although for the contemporaneous effect the approaches with final period controls have similarperformance - it is really in the lagged effects that the differences become most apparent.Subsequently we run multiple experiments to evaluate the performance of DynamicDML. In eachsetting, we run Monte Carlo experiments, where each experiment draws N = 500 samplesfrom the above DGP and then estimated the effects and lag effects based on our Sequential DML al-gorithm. Figure 2 considers the case of two treatments, and shows that the algorithm performs wellin terms of giving reasonable coverage guarantees - for a nominal 95% coverage, actual coveragevaries from 91% to 94.5%. The right panel shows that the average estimates are close to the truth.In Figure 6 in Appendix H we repeat the experiments with N = 2000 , and actual coverage is nowtightly in the range 94% to 95%, and the average estimates remain relatively unbiased. Finally, Fig-ure 3 explores a scenario with heterogeneous effects that depend on the value of a discrete control,allowing for five different treatment effects at different lags. Performance further improves when N = 2000 - we obtain precise estimates and coverage (see Figure 8 in Appendix H, where we alsopresent further experimental results). EWIS AND S YRGKANIS
Figure 1: Comparison of DynamicDML (with confidence intervals) with benchmarks on a singleinstance. n = 400 , n t = 2 , n x = 100 , s = 10 , σ ( (cid:15) t ) = . , σ ( ζ t ) = . , σ ( η t ) = . , C = 0 (a) Coverage (b) Point estimates Figure 2: n = 500 , n t = 2 , n x = 450 , s = 2 , σ ( (cid:15) t ) = 1 , σ ( ζ t ) = . , σ ( η t ) = 1 (a) Coverage (b) Point estimates Figure 3: Heterogeneous effects: n = 500 , n t = 2 , n x = 450 , s = 2 , σ ( (cid:15) t ) = 1 , σ ( ζ t ) = . , σ ( η t ) = 1 OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS
References
A. Agarwal and J. C. Duchi. The generalization ability of online algorithms for dependent data.
IEEE Transactions on Information Theory , 59(1):573–587, Jan 2013. ISSN 1557-9654. doi:10.1109/TIT.2012.2212414.C. Ai and X. Chen. Efficient estimation of models with conditional moment restrictions containingunknown functions.
Econometrica , 71(6):1795–1843, 2003.H. Bang and J. M. Robins. Doubly robust estimation in missing data and causal infer-ence models.
Biometrics , 61(4):962–973, 2005. doi: 10.1111/j.1541-0420.2005.00377.x.URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1541-0420.2005.00377.x .V. Chernozhukov, J. C. Escanciano, H. Ichimura, W. K. Newey, and J. M. Robins. Locally RobustSemiparametric Estimation. arXiv e-prints , art. arXiv:1608.00033, July 2016.V. Chernozhukov, M. Goldman, V. Semenova, and M. Taddy. Orthogonal machine learning fordemand estimation: High dimensional causal inference in dynamic panels. arXiv preprintarXiv:1712.09988 , 2017.V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins.Double/debiased machine learning for treatment and structural parameters.
The EconometricsJournal , 21(1):C1–C68, 2018.D. J. Foster and V. Syrgkanis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036 ,2019.D. A. Freedman. On tail probabilities for martingales.
Ann. Probab. , 3(1):100–118, 02 1975. doi:10.1214/aop/1176996452. URL https://doi.org/10.1214/aop/1176996452 .P. Hall and C. Heyde. 3 - the central limit theorem. In P. Hall and C. Heyde, editors,
Mar-tingale Limit Theory and its Application , Probability and Mathematical Statistics: A Seriesof Monographs and Textbooks, pages 51 – 96. Academic Press, 1980. doi: https://doi.org/10.1016/B978-0-12-319350-6.50009-8. URL .N. Kallus and M. Uehara. Double reinforcement learning for efficient off-policy evaluation inmarkov decision processes, 2019a.N. Kallus and M. Uehara. Efficiently breaking the curse of horizon in off-policy evaluation withdouble reinforcement learning, 2019b.H. Lei, I. Nahum-Shani, K. Lynch, D. Oslin, and S. A. Murphy. A" smart" design for buildingindividualized treatment sequences.
Annual review of clinical psychology , 8:21–48, 2012.J. J. Lok and V. DeGruttola. Impact of time to start treatment following infection with applicationto initiating haart in hiv-positive patients.
Biometrics , 68(3):745–754, 2012. EWIS AND S YRGKANIS
I. Melnyk and A. Banerjee. Estimating structured vector autoregressive models. In
Proceedings ofthe 33rd International Conference on International Conference on Machine Learning - Volume48 , ICML’16, pages 830–839. JMLR.org, 2016.J. Neyman. c ( α ) tests and their use. Sankhya , pages 1–21, 1979.X. Nie, E. Brunskill, and S. Wager. Learning when-to-treat policies, 2019.T. Peel, S. Anthoine, and L. Ralaivola. Empirical bernstein inequality for martingales: Applicationto online learning. 2013.M. Petersen, J. Schwab, S. Gruber, N. Blaser, M. Schomaker, and M. van der Laan. Targetedmaximum likelihood estimation for dynamic and static longitudinal marginal structural workingmodels.
Journal of causal inference , 2(2):147–185, 2014.J. Robins. A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect.
Mathematical modelling , 7(9-12):1393–1512, 1986.J. M. Robins. Correcting for non-compliance in randomized trials using structural nested meanmodels.
Communications in Statistics-Theory and methods , 23(8):2379–2412, 1994.J. M. Robins and Y. Ritov. Toward a curse of dimensionality appropriate (coda) asymptotic theoryfor semi-parametric models.
Statistics in medicine , 16(3):285–319, 1997.J. M. Robins, D. Blevins, G. Ritter, and M. Wulfsohn. G-estimation of the effect of prophylaxistherapy for pneumocystis carinii pneumonia on the survival of aids patients.
Epidemiology , pages319–336, 1992.J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal inference inepidemiology, 2000.P. M. Robinson. Root-n-consistent semiparametric regression.
Econometrica: Journal of the Econo-metric Society , pages 931–954, 1988.R. S. Sutton and A. G. Barto.
Introduction to Reinforcement Learning . MIT Press, Cambridge, MA,USA, 1st edition, 1998. ISBN 0262193981.P. S. Thomas and E. Brunskill. Data-efficient off-policy policy evaluation for reinforcement learn-ing. In
Proceedings of the 33rd International Conference on International Conference on Ma-chine Learning - Volume 48 , ICML’16, pages 2139–2148. JMLR.org, 2016.S. Vansteelandt and A. Sjolander. Revisiting g-estimation of the effect of a time-varying ex-posure subject to time-varying confounding.
Epidemiologic Methods , 5(1):37 – 56, 2016.doi: https://doi.org/10.1515/em-2015-0005. URL .S. Vansteelandt, M. Joffe, et al. Structural nested models and g-estimation: the partially realizedpromise.
Statistical Science , 29(4):707–731, 2014.J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. preparation. Universityof California, Berkeley , 2015. OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS
Contents
A Heterogeneous Effects and Adaptive Policies 15
A.1 Estimation of Heterogeneous Dynamic Effects . . . . . . . . . . . . . . . . . . . . 16A.2 Policy Evaluation with Heterogeneity in Exogenous States and Past Actions . . . . 17
B More Complex Non-Linear Markov Processes 18C Proof of Finite Sample Rates 19D Proof of Asymptotic Normality 21E Proof of Lemma 7 22F Proof of Theorem 5 23G Proof of Theorem 6 24H Further Experimental Results 26
H.1 Constant Treatment Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26H.2 Heterogeneous Treatment Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 27H.3 Counterfactual Policy Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Appendix A. Heterogeneous Effects and Adaptive Policies
We can also account for heterogeneity of the effects in observable characteristics Z t . For instance,suppose that we modified the structural equations as: X t = A ( Z t − ) · T t − + B · X t − + η t Y t = θ ( Z t ) (cid:48) T t + µ (cid:48) X t + (cid:15) t In this case, we can also easily adapt our algorithm to estimate heterogeneous models over particularhypothesis spaces by replacing the final stage estimation with a repeated square loss minimization.Observe that if treatments are discrete and finite, and state variables take finitely many values, thenthe latter functional form is without loss of generality, since we can consider the high-dimensionalstate X t to be a one-hot encoding of the finite number of state values (i.e. p -dimensional if the statetakes p distinct values) and the treatment vector representing the one-hot encoding of the finitelymany treatment values. Then the state evolution equation describes a generic Markov-chain overthe finitely many state values, whose transition probabilities can be arbitrarily affected by the valueof the treatment. However, this representation does not allow us to impose any natural sparsity orregularity assumptions. EWIS AND S YRGKANIS
Another interpretation, that achieves both desirable properties is the following: suppose that theunderlying base state S t lies in { , } r , i.e. it corresponds to r binary features. Then we can consider X t to contain all possible product features of the s binary base states. Then the structural equationsare again without loss of generality. However, we have now encoded the p = 2 r values of the state ina hierarchical encoding, that could very well satisfy the sparsity constraints, i.e. that a small numberof products of base state features enter the outcome equation and a small number of products of basestate features enter the observational policy. In such settings the nuisance functions are estimableat rates of order (cid:112) s log( p ) /n = (cid:112) s r/n , i.e. linear in the number of underlying binary states.Similar reasoning can be extending to continuous features, by considering sieve expansions of thenon-linear dynamics.We first start by a generic characterization of the counterfactual value function for any adaptivepolicy. This generalizes our Theorem 1 to account for the fact that the target policy depends on paststates and hence is correlated with past random shocks. Lemma 7
Let Π be a space of adaptive policies: i.e. a vector of mappings π t : R p → R d that mapsthe state at action t to an action at time t . Let V : Π → R be the counterfactual value function thatfor each policy π calculates the difference between the counterfactual final outcome under policy Π as compared to an always zero treatment policy. Moreover, let: ∀ κ ∈ { , . . . , m − } : θ κ ( Z ) = µ (cid:48) B κ − A ( Z ) (14) Then we can write: V ( π ) = m − (cid:88) κ =0 E (cid:20) θ κ (cid:16) Z ( π ) m − κ (cid:17) (cid:48) T ( π ) m − κ (cid:21) (15) where Z ( π ) t , T πt denotes the random variable of the component of the state vector and the action attime t , when policy π is deployed. In Section A.1, we show that we can estimate the functions θ t from the observational data using aheterogeneous version of the Sequential DML algorithm (with a potential change in the nuisancecalculation to incorporate preliminary estimates of the effects θ κ ).However, even if we have θ κ , it is still not clear what the counterfactual policy value will be, becausethis counterfactual policy value also requires knowing how the distribution of states changes whendeploying π and not the observational policy π ∗ . If these distributional shifts can be estimatedaccurately then the counterfactual policy can be estimated in a plugin manner. We identify twosettings where this is the case: i) Z t are exogenous parts of the state that can affect the treatmentand outcome, but are not affected by the treatment. Then Z ( π ) t = Z ( π ∗ ) t and we can replace theexpectations with empirical expectations over the observational data, ii) the variable Z t containsexogenous states and past treatments, since in the new policy past treatments are known. A.1 Estimation of Heterogeneous Dynamic Effects
When heterogeneity is solely based on exogenous states, then a simple adaptation of our SequentialDML algorithm accomplishes the task. OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS
Algorithm 3
Sequential DML with Heterogeneity in Exogenous States
Input: data x i , size m Randomly split the samples i in S, S (cid:48)
Calculate nuisance models with cross-fitting in exactly the same manner as in Sequential DML(see Algorithm 1)Using all the data S ∪ S (cid:48) for κ = 0 to m do Let ¯ Y m,κ = ˜ Y m,m − κ − (cid:80) τ<κ ˆ θ τ ( Z m − τ ) (cid:48) ˜ T m − τ,m − κ Regress ¯ Y m,κ on ˜ T m − κ,m − κ , with a heterogeneous coefficient in Z m − κ , i.e. minimize, oversome hypothesis space Θ κ , the square loss: min θ κ ( · ) ∈ Θ κ n n (cid:88) i =1 (cid:16) ¯ Y iκ − θ κ ( Z im − κ ) ˜ T im − κ,m − κ (cid:17) end for The proof that this estimate converges at a fast MSE rate follows along similar lines as the proof ofTheorem 3 and invoking results of Foster and Syrgkanis (2019) on orthogonal statistical learning.Moreover, when effect heterogeneity is linear, i.e. θ κ ( Z m − κ ) = (cid:104) α, Z m − κ (cid:105) , then observe that theabove estimator can also be viewed as an orthogonal moment estimator, following an identical setof steps as in the proof of Theorem 4. Thus the estimated coefficients α are also asymptoticallynormal and we can use normal based confidence intervals with an estimate of the variance.When heterogeneity is with respect to endogenous states (e.g. past actions), then we need to alterslightly the nuisance estimation component of the sequential DML to also construct preliminaryestimates of the dynamic effects to be used to estimate the calibrated target outcomes. The reasonis that now, we cannot write for τ ≤ κ : E [ θ τ ( Z m − τ ) T m − τ | T m − κ ] = θ τ ( Z m − τ ) E [ T m − τ | T m − κ ] (16)since Z m − τ is correlated with T m − κ . Thus in order to estimate ¯ Y m,κ we need to first estimate thetarget models θ τ for τ < κ in order to calculate the residual outcome Y m − (cid:80) τ<κ ˆ θ τ ( Z m − τ ) (cid:48) T m − τ and then regress this residual outcome on T m − κ . Subsequently setting ¯ Y m,κ as the residual inthis final regression. This requires interchanging the estimation of nuisance and target models. Topreserve statistical validity one would need to estimate all these regressions on half of the data. Thusa conservative approach would be to estimate preliminary versions of ¯ θ τ , without orthogonality onhalf of the data, so as to create targets ¯ Y and residual treatments ˜ T . Then running the final regressionto estimate "de-biased" versions of these preliminary estimates, using all of the data in a cross-fittingmanner. With this conservative method, the analysis of asymptotic normality (in the case of linearmodels of heterogeneity) and the fast MSE rates via the orthogonal statistical learning frameworkcan be argued in an identical manner as in the case of heterogeneity with respect to exogenous states. A.2 Policy Evaluation with Heterogeneity in Exogenous States and Past Actions
When heterogeneity is with respect to exogenous states and the target policy is also heterogeneousonly with respect to exogenous states, then observe that the distribution of Z ’s does not change EWIS AND S YRGKANIS between the observational and target policy. Thus we can estimate a policy value using the empiricaldistribution of exogenous states Z , i.e.: ˆ V ( π ) = 1 n (cid:88) i m − (cid:88) κ =1 θ κ ( Z im − κ ) π m − κ ( Z im − κ ) (17)where π t ( Z t ) is the treatment that the target policy π recommends at time step t , when the exogenousstates are Z t . When the heterogeneous model θ t ( Z t ) is linear in Z t , i.e. θ t ( Z t ) = (cid:104) α t , Z t (cid:105) , thenobserve that the latter estimate is a linear function of the estimates of α . Thus the policy estimatewill also be asymptotically normal and the confidence interval can be easily constructed via thedelta method.Similarly, if the heterogeneous model and the target policy π depends on exogenous states andpast treatments, then we can roll out the policy on the exogenous states of every sample and getthe treatment path T π,it that the target policy would follow on sample i , based on the observedexogenous states Z m . Thus we can estimate the value of the policy again via an empirical average: ˆ V ( π ) = 1 n (cid:88) i m − (cid:88) κ =1 θ κ ( Z im − κ , T π,i Our approach easily extends to allow for non-linear components in the Markov process. The solerequirement that we need for our method to work is that the dynamic effect of a treatment on anysubsequent outcome be constant (not heterogeneous in the observable variables) and linear. This forinstance would be the case for the following more complex Markovian process: W t = h ( W t − ) + ξ t X t = A · T t − + B · X t − + D Y t − + g ( W t ) + η t T t = p ( T t − , Y t − , X t , W t ) + ζ t Y t = λ (cid:48) T t + µ (cid:48) X t + f ( W t ) + (cid:15) t In other words, we can have an exogenous part state W t that is not affected by the treatments butwhich goes into the treatment decision. This exogenous component could affect the endogenousstate X t , the treatment decision T t and the outcome Y t in a non-linear manner. Furthermore, theoutcome Y t − could also affect the next period state in a linear manner. Moreover, the treatmentpolicy in the observational data could be arbitrarily non-linear. The only real requirement is thatthe endogenous part of the state X t that is affected by the treatments be evolving in a linear mannerand having a linear effect on the outcomes. Assuming that the nuisance models in our algorithm areestimable at a n − / rate, then our two main theorems continue to hold. OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS Appendix C. Proof of Finite Sample Rates Proof Let Z denote the random variables associated with a single sample series. Observe that if welet: (cid:96) κ ( Z ; θ, p, q ) = (cid:16) ¯ Y im,κ − θ κ ˜ T im − κ,m − κ (cid:17) Then we have that: ∇ θ κ (cid:96) κ ( Z ; θ, p, q ) = ψ κ ( Z ; θ, p, q ) (where ψ, m are as defined in the proof ofTheorem 4).Moreover, observe that by consistency of the nuisance functions we have that for any ν = θ κ − θ (cid:48) κ : ν (cid:48) ∇ θ κ θ κ E [ (cid:96) κ ( Z ; θ, ˆ p, ˆ q )] ν (cid:107) ν (cid:107) ≥ ν (cid:48) ∇ θ κ θ κ E [ (cid:96) κ ( Z ; θ, p ∗ , q ∗ )] νν − O (cid:18) (cid:15) n,δ (cid:107) ν (cid:107) (cid:107) ν (cid:107) (cid:19) = ν (cid:48) E [ ζ m − κ ζ (cid:48) m − κ ] νν + O (cid:18) (cid:15) n,δ (cid:107) ν (cid:107) (cid:107) ν (cid:107) (cid:19) ≥ λ − O ( m d (cid:15) n,δ ) where the first inequality follows from smoothness of the second derivatives of (cid:96) κ with respect to thenuisances and the mean-squared-error consistency of the nuisance functions. Thus for some λ n ≥ λ − O ( m d (cid:15) n,δ ) , the function E [ (cid:96) κ ( Z ; θ, ˆ p, ˆ q )] is λ n -strongly convex in θ κ . By strong convexity wehave: E [ (cid:96) κ ( Z ; θ, ˆ p, ˆ q ) − (cid:96) κ ( Z ; θ − k , θ ∗ k , ˆ p, ˆ q )] ≥ m κ ( θ − κ , θ ∗ κ ; p, q ) (cid:48) ( θ κ − θ ∗ κ ) + λ n (cid:107) θ κ − θ ∗ κ (cid:107) Moreover, by convexity we have: E [ (cid:96) κ ( Z ; θ − k , θ ∗ k , p, q ) − (cid:96) κ ( Z ; θ, p, q )] ≥ m κ ( θ ; p, q ) (cid:48) ( θ ∗ κ − θ κ ) Combining the two yields: λ n (cid:107) θ κ − θ ∗ κ (cid:107) ≤ ( m κ ( θ ; p, q ) + m κ ( θ − κ , θ ∗ κ ; p, q )) (cid:48) ( θ ∗ κ − θ κ ) ≤ (cid:107) m κ ( θ ; p, q ) + m κ ( θ − κ , θ ∗ κ ; p, q ) (cid:107) (cid:107) θ ∗ κ − θ κ (cid:107) Thus we have: (cid:107) θ κ − θ ∗ κ (cid:107) ≤ λ n (cid:107) m κ ( θ ; p, q ) + m κ ( θ − κ , θ ∗ κ ; p, q ) (cid:107) (19)Observe that by the definition of our estimate ˆ θ , we have that: E n [ ψ κ ( Z ; ˆ θ, ˆ p, ˆ q )] = 0 Moreover, on each side of the fold we have that for a fixed θ : E [ ψ κ ( Z ; θ, ˆ p, ˆ q )] = m ( θ ; ˆ p, ˆ q ) Thus by the lipschitzness of the moment ψ with respect to θ and standard empirical process theory,w.p. − δ : sup θ ∈ [ − H,H ] d (cid:107) E n [ ψ κ ( Z ; θ, ˆ p, ˆ q )] − m κ ( θ ; ˆ p, ˆ q ) (cid:107) ≤ O (cid:32) H d (cid:114) log( d/δ ) n (cid:33) EWIS AND S YRGKANIS Thus we have that w.p. − δ : (cid:107) m κ ( θ ; ˆ p, ˆ q ) (cid:107) ≤ O (cid:32) d (cid:114) log( d/δ ) n (cid:33) Moreover, by the identifying equation assumption we also have that: m κ ( θ ; p ∗ , q ∗ ) = 0 Thus it suffices to understand the deviation of m κ ( θ − κ , θ ∗ κ ; ˆ p, ˆ q ) from the above. For this we willperform a second order Taylor expansion around all arguments, θ, p, q . We observe that m κ isindependent of θ t for t ≥ κ and also by Neyman orthogonality, the first order term with respect tothe nuisance models will be zero. Thus we have: m κ ( θ − κ , θ ∗ κ ; ˆ p, ˆ q ) − m κ ( θ ; p ∗ , q ∗ ) ≤ ∇ θ <κ m κ ( θ ; p ∗ , q ∗ ) (cid:48) ( θ <κ − θ ∗ <κ ) + O (cid:0) (cid:107) ˆ p − p ∗ (cid:107) + (cid:107) ˆ q − q ∗ (cid:107) (cid:1) ≤ c m sup t<κ (cid:107) θ t − θ ∗ t (cid:107) + (cid:15) n,δ where we used the fact that: ∇ θ τ m κ ( θ ; p ∗ , q ∗ ) = E (cid:104) ˜ T ∗ m − τ,m − κ ζ (cid:48) t (cid:105) (20)and by our assumption (cid:13)(cid:13)(cid:13)(cid:80) mτ =0 | E (cid:104) ˜ T ∗ m − τ,m − κ ζ (cid:48) t (cid:105) | (cid:13)(cid:13)(cid:13) ∞ ≤ c m , and subsequently we applied an (cid:96) − (cid:96) ∞ Cauchy-Schwarz inequality coordinate-wise for each coordinate of the treatment. Thus overallwe have, for some absolute constant C : (cid:107) θ κ − θ ∗ κ (cid:107) ≤ λ n (cid:32) C (cid:32) H d (cid:114) log( d/δ ) n + (cid:15) n,δ (cid:33) + c m sup t<κ (cid:107) θ t − θ ∗ t (cid:107) (cid:33) (21)Let µ n,δ = Cλ n (cid:18) H d (cid:113) log( d/δ ) n + (cid:15) n,δ (cid:19) . Then observe that: (cid:107) θ κ − θ ∗ κ (cid:107) ≤ µ n,δ + ( c m /λ n ) sup t<κ (cid:107) θ t − θ ∗ t (cid:107) Then we have that: (cid:107) θ − θ ∗ (cid:107) ≤ µ n,δ and by induction, if we assume that (cid:107) θ t − θ ∗ t (cid:107) ≤ µ n,δ (cid:80) tτ =0 ( c m /λ n ) τ for all t < κ , then: (cid:107) θ κ − θ ∗ κ (cid:107) ≤ µ n,δ + ( c m /λ n ) µ n,δ κ − (cid:88) τ =0 ( c m /λ n ) τ = µ n,δ κ (cid:88) t =0 ( c m /λ n ) τ Thus overall we have: ∀ κ ∈ { , . . . , m } : (cid:107) θ κ − θ ∗ κ (cid:107) ≤ µ n,δ m (cid:88) t =0 ( c m /λ n ) τ ≤ µ n,δ ( c m /λ n ) m +1 − c m /λ n − O (cid:0) µ n,δ max { , ( c m /λ n ) m +1 } (cid:1) = O (cid:32) µ n,δ max (cid:40) , (cid:18) c m λ − O ( m d (cid:15) n,δ ) (cid:19) m +1 (cid:41)(cid:33) Which concludes the proof of the theorem. OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS Appendix D. Proof of Asymptotic Normality Proof Observe that our estimator can be equivalently viewed as the solution to an empirical versionof the following vector of moment conditions: ∀ κ ∈ { , . . . , m } : m κ ( θ ; p ∗ , q ∗ ) = E [ ψ k ( Z ; θ, p ∗ , q ∗ )] = 0 where Z is a random vector denoting the variables of a single sample series and: ψ κ ( Z ; θ, p ∗ , q ∗ ) = (cid:32) ¯ Y ∗ m,κ − κ (cid:88) τ =0 θ τ ˜ T ∗ m − τ,m − κ (cid:33) ˜ T ∗ m − κ,m − κ and ˜ Y ∗ m,κ := Y m − q ∗ κ ( X m − κ ) ˜ T m − τ,m − κ := T m − τ − p ∗ τ,κ ( X m − κ ) If we denote with ˆ p and ˆ q the estimates of the nuisance models, then our estimate ˆ θ is equivalent tothe solution of the system of equations: E n (cid:104) ψ ( Z ; ˆ θ, ˆ p, ˆ q ) (cid:105) = 1 n n (cid:88) i =1 ψ ( Z i ; ˆ θ, ˆ p, ˆ q ) = 0 We now show that the moment vector ψ satisfies the property of Neyman orthogonality with respectto all the nuisance models p, q . For this it suffices to show that for all κ ∈ { , . . . , m } : E [ ∇ π τ,κ ,γ κ ψ κ ( Z ; θ , p ∗ , q ∗ ) | X m − κ ] = 0 (22)where ∇ π τ,κ ,γ κ is the derivative of ψ with respect to the output of the nuisance models p ∗ τ,κ , q ∗ κ evaluated at X m − κ .For any κ , the derivative with respect to π τ,κ for τ < κ is equal to: E [ ˜ T ∗ m − κ,m − κ | X m − κ ] = 0 (23)while the derivative with respect to π κ,κ is: − E [ ¯ Y ∗ m,κ − κ (cid:88) τ =0 θ τ ˜ T ∗ m − τ,m − κ | X m − κ ] + E [ T ∗ m − κ,m − κ | X m − κ ] = 0 Similarly, the derivative with respect to γ κ is: − E [ ˜ T ∗ m − κ,m − κ | X m − κ ] = 0 (24)Moreover, the Jacobian M of the moment vector m at the true values θ , p ∗ , q ∗ is a block lowertriangular matrix whose block values are of the form: ∀ ≤ τ ≤ κ ≤ m : M τ,κ = E [ ˜ T ∗ m − τ,m − κ ζ (cid:48) m − κ ] EWIS AND S YRGKANIS Thus its diagonal block values take the form M κ,κ = E [ ζ m − κ ζ (cid:48) m − κ ] (cid:23) λI . Hence, the minimumeigenvalue of M is at least λ .Thus our setting and our estimator falls exactly into the framework of orthogonal moment esti-mation of Chernozhukov et al. (2018) and we can directly apply Theorem 3.1 of Chernozhukovet al. (2018) to get the result (the exact form of the matrix Σ follows from the observation that ψ κ ( Z ; θ , p ∗ , q ∗ ) = (cid:15) m − κ ζ m − κ and the fact that Σ = E [ ψ ( Z ; θ , p ∗ , q ∗ ) ψ ( Z ; θ , p ∗ , q ∗ ) (cid:48) ] ). Appendix E. Proof of Lemma 7 Let X πt denote the random variable of the state at time t under policy π . Let π ≤ t denote the policythat follows policy π up until time t and then offering a treatment of zero in all subsequent steps.Moreover, let T ( π ≤ t ) t denote the random vector of treatments under policy π ≤ t . Let Y ( π ) m denote thecounterfactual final outcome under the adaptive policy π .Observe that we can write: Y ( π ) m − Y (0) m = m (cid:88) t =0 Y ( π ≤ m − t ) m − Y ( π ≤ m − t − ) m where Y (0) m is the counterfactual final outcome under zero treatment at each period. Observe that byrepeatedly expanding the linear system on all periods where the treatment is zero, we can write: Y ( π ≤ m − t ) m = µ (cid:48) B t − AT ( π ≤ m − t ) m − t + µ (cid:48) B t X ( π ≤ m − t ) m − t + (cid:88) t>m − q µ (cid:48) B t − η t + (cid:15) t Since η t and (cid:15) t are independent and mean-zero conditional on T ( π ≤ m − t ) m − t , X ( π ≤ m − t ) m − t , we have: E [ Y ( π ≤ m − t ) m ] = E [ µ (cid:48) B t − AT ( π ≤ m − t ) m − t + µ (cid:48) B t X ( π ≤ m − t ) m − t ]= E [ θ t T ( π ≤ m − t ) m − t + µ (cid:48) B t X ( π ≤ m − t ) m − t ] Since π ≤ m − t and π ≤ m − t − use the same exact treatment policy up until time m − t − , we havethat X ( π ≤ m − t ) m − t and X ( π ≤ m − t − ) m − t are identically distributed. Thus: E [ Y ( π ≤ m − t ) m − Y ( π ≤ m − t − ) m ] = θ (cid:48) t E (cid:104) T ( π ≤ m − t ) m − t − T ( π ≤ m − t − ) m − t (cid:105) Finally, observe that T ( π ≤ m − t ) m − is identically distributed as T ( π ) m − (since both policies are identicalup until and including time m − t ). Moreover, T ( π ≤ m − t − ) m − t = 0 . Thus: E [ Y ( π ≤ m − t ) m − Y ( π ≤ m − t − ) m ] = θ (cid:48) t E (cid:104) T ( π ) m − t (cid:105) OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS Appendix F. Proof of Theorem 5 Consider blocks of size m and let Z b denote the set of all random variables in block b , starting fromthe first state X b, of that block and ending in the final outcome Y b,m . Moreover, let F b denote thefiltration before each block b (i.e. all past variables before b ) and let: m b,κ ( θ ; p, q ) = E [ ψ ( Z b ; θ, p, q ) | F b ] Since by assumption E [ ζ t ζ (cid:48) t | F b ] (cid:23) λI . Then: E [ (cid:96) κ ( Z b ; θ, p, q ) − (cid:96) κ ( Z b ; θ − k , θ ∗ k , p, q ) | F b ] ≥ m b,κ ( θ − κ , θ ∗ κ ; p, q ) (cid:48) ( θ κ − θ ∗ κ ) + λ (cid:107) θ κ − θ ∗ κ (cid:107) E [ (cid:96) κ ( Z ; θ − k , θ ∗ k , p, q ) − (cid:96) κ ( Z ; θ, p, q ) | F b ] ≥ m b,κ ( θ ; p, q ) (cid:48) ( θ ∗ κ − θ κ ) Combining the two yields: λ (cid:107) θ κ − θ ∗ κ (cid:107) ≤ ( m b,κ ( θ ; p, q ) + m b,κ ( θ − κ , θ ∗ κ ; p, q )) (cid:48) ( θ ∗ κ − θ κ ) Averaging the latter inequality over the B blocks, applying Cauchy-Swartz inequality and dividingover by (cid:107) θ κ − θ ∗ κ (cid:107) : (cid:107) θ κ − θ ∗ κ (cid:107) ≤ λ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B (cid:88) b ( m b,κ ( θ ; p, q ) + m b,κ ( θ − κ , θ ∗ κ ; p, q )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Observe that by the definition of our estimate ˆ θ , we have that: B (cid:88) b ψ κ ( Z b ; ˆ θ, ˆ p, ˆ q ) = 0 Moreover, conditioning on the first fold, we have that for any fixed θ : E [ ψ κ ( Z b ; θ, ˆ p, ˆ q ) | F b ] = m b,κ ( θ ; ˆ p, ˆ q ) Thus by the lipschitzness of the moment ψ with respect to θ and martingale Azuma inequality andstandard covering arguments, we have: sup θ ∈ [ − H,H ] d (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B (cid:88) b ( ψ κ ( Z b ; θ, ˆ p, ˆ q )] − m b,κ ( θ ; ˆ p, ˆ q )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ O (cid:32)(cid:114) d log( d B/δ ) B (cid:33) Thus we have that w.p. − δ : (cid:107) m b,κ (ˆ θ ; ˆ p, ˆ q ) (cid:107) ≤ O (cid:32)(cid:114) d log( d B/δ ) B (cid:33) Moreover, by the identifying equation assumption we also have that: ∀ b : m b,κ ( θ ; p ∗ , q ∗ ) = 0 Thus it suffices to understand the deviation of m κ ( θ − κ , θ ∗ κ ; ˆ p, ˆ q ) from the above. For this we willperform a second order Taylor expansion around all arguments, θ, p, q . We observe that m κ isindependent of θ t for t ≥ κ and also by Neyman orthogonality, the first order term with respect tothe nuisance models will be zero. The remainder of the proof is identical to the final inductive partof the proof of Theorem 3. EWIS AND S YRGKANIS Appendix G. Proof of Theorem 6 Consider blocks of size m and let Z b denote the set of all random variables in block b , starting fromthe first state X b, of that block and ending in the final outcome Y b,m . Moreover, let F b denote thefiltration before each block b (i.e. all past variables before b ) and let: m b,κ ( θ ; p, q ) = E [ ψ ( Z b ; θ, p, q ) | F b ] Let m κ ( θ ; p, q ) = 1 B (cid:88) b m b,κ ( θ ; p, q )Ψ B,κ ( θ ; p, q ) = 1 B (cid:88) b ψ κ ( Z b ; θ, p, q ) By a second order Taylor expansion of m ( θ ; p, q ) around θ , p , q , and due to orthogonality ofthe moment with respect to p, q , the linearity of the moment with respect to θ , and the fact that m ( θ ; p , q ) = 0 we have: m ( θ ; p, q ) = ∇ θ m ( θ ; p , q ) ( θ − θ ) + ξ where ξ , satisfies that: (cid:107) ξ (cid:107) = m d O (cid:0) (cid:107) p − p (cid:107) ∞ , (cid:107) q − q (cid:107) ∞ (cid:1) If we let: M = ∇ θ m ( θ ; p , q ) , we have that: M ( θ − θ ) = m ( θ ; p, q ) + ρ where ρ also satisfies: (cid:107) ρ (cid:107) = m d O (cid:0) (cid:107) p − p (cid:107) ∞ , (cid:107) q − q (cid:107) ∞ (cid:1) Moreover, observe that by the definition of our estimator, we have: Ψ B (ˆ θ ; ˆ p, ˆ q ) = 0 Thus we conclude that: M (ˆ θ − θ ) = (cid:16) m (ˆ θ ; ˆ p, ˆ q ) − Ψ B (ˆ θ ; ˆ p, ˆ q ) (cid:17) + ρ Moreover, observe that m ( θ ; p ∗ , q ∗ ) = 0 . Thus we can further expand the right hand side as: m (ˆ θ ; ˆ p, ˆ q ) − Ψ B (ˆ θ ; ˆ p, ˆ q ) = − Ψ B ( θ ; p ∗ , q ∗ ) + m (ˆ θ ; ˆ p, ˆ q ) − Ψ B (ˆ θ ; ˆ p, ˆ q ) − ( m ( θ ; p ∗ , q ∗ ) − Ψ B ( θ ; p ∗ , q ∗ ))= − Ψ B ( θ ; p ∗ , q ∗ ) (cid:124) (cid:123)(cid:122) (cid:125) Asymptotic normal term + m (ˆ θ ; ˆ p, ˆ q ) − m ( θ ; p ∗ , q ∗ ) − (Ψ B (ˆ θ ; ˆ p, ˆ q ) − Ψ B ( θ ; p ∗ , q ∗ )) (cid:124) (cid:123)(cid:122) (cid:125) A = Stochastic equicontinuous term OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS First we note that the term A decays to zero faster than a root- B rate. By a martingale Bernsteininequality (see e.g. Freedman (1975) or Peel et al. (2013)), we have that for any θ : (cid:107) A (cid:107) ∞ ≤ O (cid:32) sup κ ∈{ ,...,m − } ,i ∈{ ,...,d } (cid:114) sup b E [( ψ κ,i ( Z b ; θ, ˆ p, ˆ q ) − ψ κ,i ( Z b ; θ , ˆ p, ˆ q )) | F b ] log( m d/δ ) B + log( m d/δ ) B (cid:33) Moreover, by the Lipschitzness of ψ , we have that: (cid:113) E [( ψ κ,i ( Z b ; θ, ˆ p, ˆ q ) − ψ κ ( Z b ; θ , ˆ p, ˆ q )) | F b ] ≤ m d O (cid:18)(cid:113) E [ (cid:107) ˆ θ − θ (cid:107) ∞ ] , (cid:113) E [ (cid:107) ˆ p − p ∗ (cid:107) b, ∞ ] , (cid:113) (cid:107) ˆ q − q ∗ (cid:107) b, ∞ ] (cid:19) By our estimation rate theorem for ˆ θ and our assumption on the estimation rate of ˆ p and ˆ q we havethat: ∀ κ ∈ { , . . . , m − } : (cid:113) E [ (cid:107) ˆ θ − θ (cid:107) ∞ ] ≤ O (cid:32) d C m (cid:32)(cid:114) log( d B ) B + (cid:15) B (cid:33)(cid:33) and (cid:113) E [ (cid:107) ˆ p − p ∗ (cid:107) b, ∞ ] , (cid:113) E [ (cid:107) ˆ q − q ∗ (cid:107) b, ∞ ] ≤ (cid:15) B = o ( B − / ) . Since by assumption m C m = o ( √ B ) , we have: √ B m d O (cid:18)(cid:113) E [ (cid:107) ˆ θ − θ (cid:107) ∞ ] , (cid:113) E [ (cid:107) ˆ p − p ∗ (cid:107) b, ∞ ] , (cid:113) (cid:107) ˆ q − q ∗ (cid:107) b, ∞ ] (cid:19) = o p (1) Thus: √ B (cid:107) A (cid:107) = o p (1) (25)Moreover, for the exact same reason: √ B (cid:107) ρ (cid:107) = o p (1) (26)Thus it suffices to show that: −√ B Ψ B ( θ ; p ∗ , q ∗ ) → d N (0 , Σ) (27)Observe that under the true values of θ , p ∗ , q ∗ , the moment ψ b,κ ( Z b ; θ , p ∗ , q ∗ ) = (cid:15) b,κ ζ b,κ is onlya function of the residual random noises, ζ b,κ , (cid:15) b,κ . Since these variables are independent and iden-tically distributed across all blocks , we can conclude that the conditional variance:Var ( ψ κ ( Z b ; θ , p ∗ , q ∗ ) | F b ) = E [ (cid:15) b,κ ζ b,κ ζ (cid:48) b,κ | F b ] = Σ κκ (cid:31) σ λI (28)where σ is the variance of the shocks (cid:15) b,κ and λ is the minimum eigenvalue of the co-varianceof ζ b,κ . Thus Σ is a positive definite symmetric matrix with minimum eigenvalue independent of m . Thus i is a constant positive definite matrix, independent of b . We can thus apply a MartingaleCentral Limit Theorem (see e.g. Hall and Heyde (1980)) to get the desired statement: √ B Σ − / M (ˆ θ − θ ) → d N (0 , I ) (29) 2. Our theorem would easily extend if these variables form a martingale with a non-zero variance at each step EWIS AND S YRGKANIS Appendix H. Further Experimental Results H.1 Constant Treatment Effects Figure 4: Comparison of DynamicDML (with confidence intervals) with benchmarks on a singleinstance. n = 400 , n t = 2 , n x = 100 , s = 10 , σ ( (cid:15) t ) = . , σ ( ζ t ) = . , σ ( η t ) = . , C = 0 (a) Coverage (b) Point estimates Figure 5: n = 500 , n t = 2 , n x = 450 , s = 2 , σ ( (cid:15) t ) = 1 , σ ( ζ t ) = . , σ ( η t ) = 1 (a) Coverage (b) Point estimates Figure 6: n = 2000 , n t = 2 , n x = 450 , s = 2 , σ ( (cid:15) t ) = 1 , σ ( ζ t ) = . , σ ( η t ) = 1 OUBLE /D EBIASED M ACHINE L EARNING FOR D YNAMIC T REATMENT E FFECTS H.2 Heterogeneous Treatment Effects (a) Coverage (b) Point estimates Figure 7: Heterogeneous effects: n = 500 , n t = 2 , n x = 450 , s = 2 , σ ( (cid:15) t ) = 1 , σ ( ζ t ) = . , σ ( η t ) = 1 (a) Coverage (b) Point estimates Figure 8: Heterogeneous effects: n = 2000 , n t = 2 , n x = 450 , s = 2 , σ ( (cid:15) t ) = 1 , σ ( ζ t ) = . , σ ( η t ) = 1 H.3 Counterfactual Policy Values (a) Coverage (b) Point estimates Figure 9: Counterfactual Policy Values for randomly chosen binary policies based on exogenousfeatures: n = 500 , n t = 2 , n x = 450 , s = 2 , σ ( (cid:15) t ) = 1 , σ ( ζ t ) = . , σ ( η t ) = 1 EWIS AND S YRGKANIS (a) Coverage (b) Point estimates Figure 10: Counterfactual Policy Values for randomly chosen binary policies based on exoge-nous features: n = 2000 , n t = 2 , n x = 450 , s = 2 , σ ( (cid:15) t ) = 1 , σ ( ζ t ) = . , σ ( η t ) = 1) = 1