Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings
SStatistical Inference of the Value Function for ReinforcementLearning in Infinite Horizon Settings
Chengchun Shi
London School of Economics and Political Science, London, U.K.
Sheng Zhang, Wenbin Lu and Rui Song
North Carolina State University, Raleigh, USA.
Summary . Reinforcement learning is a general technique that allows an agent to learn anoptimal policy and interact with an environment in sequential decision making problems. Thegoodness of a policy is measured by its value function starting from some initial state. Thefocus of this paper is to construct confidence intervals (CIs) for a policy’s value in infinitehorizon settings where the number of decision points diverges to infinity. We propose to modelthe action-value state function (Q-function) associated with a policy based on series/sievemethod to derive its confidence interval. When the target policy depends on the observeddata as well, we propose a S equenti A l V alue E valuation (SAVE) method to recursively updatethe estimated policy and its value estimator. As long as either the number of trajectories orthe number of decision points diverges to infinity, we show that the proposed CI achievesnominal coverage even in cases where the optimal policy is not unique. Simulation studiesare conducted to back up our theoretical findings. We apply the proposed method to a datasetfrom mobile health studies and find that reinforcement learning algorithms could help improvepatient’s health status. Keywords : Confidence interval; Value function; Reinforcement learning; Infinite hori-zons
1. Introduction
Reinforcement learning (RL) is a general technique that allows an agent to learn and interactwith an environment. A policy defines the agent’s way of behaving. It maps the states ofenvironments to a set of actions to be chosen from. RL algorithms have made tremendous a r X i v : . [ s t a t . M L ] J a n C. Shi, S. Zhang, W. Lu and R. Song achievements and found extensive applications in video games (Silver et al., 2016), robotics(Kormushev et al., 2013), bidding (Jin et al., 2018), ridesharing (Xu et al., 2018), etc. Inparticular, a number of RL methods have been proposed in precision medicine, to derive anoptimal policy as a set of sequential treatment decision rules that optimize patients’ clinicaloutcomes over a fixed period of time (finite horizon). References include Murphy (2003);Zhang et al. (2013); Zhao et al. (2015); Shi et al. (2018); Zhang et al. (2018), to name afew.Mobile health (or mHealth) technology has recently emerged due to the use of mobiledevices such as mobile phones, tablet computers or wearable devices in health care. Itallows health-care providers to communicate with patients and manage their illness in realtime. It also collects rich longitudinal data (e.g., through mobile health apps) that can beused to estimate the optimal policy. Data from mHealth applications differ from those infinite horizon settings in that the number of treatment decision points for each patient isnot necessarily fixed (infinite horizon) while the total number of patients could be limited.Take the OhioT1DM dataset (Marling and Bunescu, 2018) as an example. It contains datafor six patients with type 1 diabetes. For each of the patient, their continuous glucosemonitoring (CGM) blood glucose levels, insulin doses including bolus and basal rates, self-reported times of meals and exercises are continually measured and recorded for eight weeks.Developing an optimal policy as functions of these time-varying covariates could potentiallyassist these patients in improving their health status.In this paper, we focus on the infinite horizon setting where the data generating processis modeled by a Markov decision process (Puterman, 1994). This includes applications frommHealth, games, robotics, ridesharing, etc. After a policy is being proposed, it is importantto examine its benefit prior to recommending it for practical use. The goodness of a policyis quantified by its value function, corresponding to the discounted cumulative reward thatthe agent receives on average, starting from some initial state. The inference of the valuefunction helps a decision maker to evaluate the impact of implementing a policy when theenvironment is in a certain state. In some applications, it is also important to evaluate theintegrated value of a policy aggregated over different initial states. For example, in medicalstudies, one might wish to know the mean outcome of patients in the population. Theintegrated value could thus be used as a criterion for comparing different policies. nference of the Value for Reinforcement Learning In statistics literature, a few methods have been proposed to estimate the optimal policyin infinite horizons. Ertefaie and Strawderman (2018) proposed a variant of gradient Q-learning method to develop an optimal policy by estimating the optimal Q-function. Luckettet al. (2019) proposed a V-learning to directly search the optimal policy among a restrictedclass of policies. Inference of the value function under a generic (data-dependent) policyhas not been studied in these papers. In the computer science literature, Thomas et al.(2015) and Jiang and Li (2016) proposed (augmented) inverse propensity-score weighted((A)IPW) estimators for the the value function in infinite horizons and derived their as-sociated CIs. However, these methods are not suitable for settings where only a limitednumber of trajectories (e.g., plays of a game or patients in medical studies) are available,since (A)IPW estimators become increasingly unstable as the number of decision pointsdiverges to infinity. Moreover, they cannot be used to derive CIs for value function at agiven initial state.The focus of this paper is to construct confidence intervals (CIs) for a (possibly data-dependent) policy’s value function at a given state as well as its integrated value withrespect to a given reference distribution. Our proposed CI is derived by estimating the state-action value function (Q-function) under the target policy. Specifically, we use series/sievemethod to approximate the Q-function based on L basis functions, where L grows with thetotal number of observations. The advances of our proposed method are summarized asfollows. First, the proposed inference method is generally applicable. Specifically, it canbe applied to any fixed policy (either deterministic or random) and any data-dependentpolicy whose value converges at a certain rate. The latter includes policies estimated bygradient Q-learning (Maei et al., 2010; Ertefaie and Strawderman, 2018), fitted Q-iteration(see for example, Ernst et al., 2005; Riedmiller, 2005), etc. See Section 3.2.4 for detailedillustrations.Second, when applied to data-dependent policies, our method is valid in nonregular caseswhere the optimal policy is not uniquely defined. Inference without requiring the uniquenessof the optimal policy is extremely challenging even in the simpler finite-horizon settings (seethe related discussions in Luedtke and van der Laan, 2016). The major challenge lies inthat the estimated policy may not stabilize as sample size grows, making the variance ofthe value estimator difficult to estimate (see Section 3.2.1 for details). We achieve valid C. Shi, S. Zhang, W. Lu and R. Song inference by proposing a S equenti A l V alue E valuation (SAVE) method that splits the datainto several blocks and recursively update the estimated policy and its value estimator. Itis worth mentioning that the data-splitting rule cannot be arbitrarily determined since theobservations are time dependent in infinite horizon settings (see Section 3.2.2 for details).Third, our CI is valid as long as either the number of trajectories n in the data, or thenumber of decision points T in each trajectory diverges to infinity. It can thus be appliedto a wide variety of real applications in infinite horizons ranging from the Framinghamheart study (Tsao and Vasan, 2015) with over two thousand patients to the OhioT1DMdataset that contains eight weeks’ worth of data for six people. We also allow both n and T to approach infinity, which is the case in applications from video games. In contrast, CIsproposed by Thomas et al. (2015) and Jiang and Li (2016) require n to grow to infinity toachieve nominal coverage.Lastly, we consider both off-policy and on-policy learning methods. In off-policy settings,CIs are derived based on historical data collected by a different behavior policy. Off-policyevaluation is critical in situations where running the target policy could be expensive, riskyor unethical. In on-policy settings, the estimated policy is recursively updated as batchesof new observations arrive. To the best of our knowledge, this is the first work on statisticalinference of a data-dependent policy in on-policy settings.To study the asymptotic properties of our proposed CI, we focus on tensor-product splineand wavelet series estimators. Our technical contributions are described as follows. First,we introduce a bidirectional-asymptotic framework that allows either n or T to approachinfinity. Our major technical contribution is to derive a nonasymptotic error bound for thespectral norm of the random matrix (cid:98) Σ π − E (cid:98) Σ π (see the explicit form of (cid:98) Σ π in Section 3.1)as a function of n , T and L (see Lemma 3). This result is important in studying the limitingdistribution of series estimators under such a theoretical framework.Second, for policies that are estimated by Q-leanring type algorithms such as the greedygradient Q-leaning, fitted Q-iteration and deep Q-network (Mnih et al., 2015), we relatethe convergence rate of their values to the prediction error of the corresponding estimatedQ-functions. We show in Theorems 3 and 4 that the values can converge at faster rates thanthe estimated Q-functions under certain margin type conditions on the optimal Q-function.To the best of our knowledge, these findings have not been discovered in the reinforcement- nference of the Value for Reinforcement Learning learning literature. Our theorems form a basis for researchers to study the value propertiesof Q-learning type algorithms. Moreover, our theoretical results are consistent with findingsin point treatment studies where there is only one single decision point (see e.g., Qian andMurphy, 2011; Luedtke and van der Laan, 2016). However, the derivation of Theorems 3and 4 is more involved since the value function in our settings is an infinite series involvingboth immediate and future rewards.Third, when these basis functions are used, we mathematically characterize the ap-proximation error of the Q-function as a function of L , the covariates dimension and thesmoothness of the Markov transition function and the conditional mean of the immediatereward as a function of the state-action pair. This offers some guidance to practitioners onthe choice of the number of basis functions L , when some prior knowledge on the degree ofsmoothness of the aforementioned functions are available.The rest of the paper is organized as follows. We introduce the model setup in Section 2.In Sections 3 and 4, we present the proposed off-policy and on-policy evaluation methods,respectively. Simulation studies are conducted to evaluate the empirical performance of theproposed inference methods in Section 5. We apply the proposed inference method to theOhioT1DM dataset in Section 6. All proofs are given in the supplementary article.
2. Optimal policy in infinite-horizon settings
We begin by introducing the notion of optimal policy, Q-function and the value functionin infinite-horizon settings. Let X ,t ∈ X be the time-varying covariates collected at timepoint t , A ,t ∈ A denote the action taken at time t , and Y ,t stand for the immediate rewardobserved. Suppose the system satisfies the following Markov assumption (MA),Pr( X ,t +1 ∈ B| X ,t = x, A ,t = a, { Y ,j } ≤ j 3. Off-policy evaluation Let n denote the number of trajectories in the dataset. For the i -th trajectory, let { A i,t } t ≥ , { X i,t } t ≥ and { Y i,t } t ≥ denote the sequence of actions, states and rewards, respectively. Itis worth mentioning that the time points are not necessarily homogeneous across differenttrajectories. Suppose the data are generated according to a fixed policy b ( ·|· ), better knownas the behavior policy such that { ( X ,t , A ,t , Y ,t ) } t ≥ , { ( X ,t , A ,t , Y ,t ) } t ≥ , · · · , { ( X n,t , A n,t , Y n,t ) } t ≥ , are i.i.d copies of { ( X ,t , A ,t , Y ,t ) } t ≥ . The observed data can thus be summarized as { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ≤ t Following their procedure, for a fixed policy π , one might estimate V ( π ; · ) nonparametri-cally and construct the CI using the resulting estimates. However, such an approach mightnot be appropriate for polices that are discontinuous functions of the covariates. To betterillustrate this, notice that V ( π, · ) satisfies the following Bellman equation V ( π ; x ) = (cid:88) a ∈A π ( a | x ) (cid:26) r ( x, a ) + γ (cid:90) x (cid:48) V ( π ; x (cid:48) ) P ( dx (cid:48) | x, a ) (cid:27)(cid:124) (cid:123)(cid:122) (cid:125) C ( π ; x,a ) . (3.4)When P satisfies certain smoothness conditions (see Condition A1 below), we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) x (cid:48) V ( π ; x (cid:48) ) P ( dx (cid:48) | x , a ) − (cid:90) x (cid:48) V ( π ; x (cid:48) ) P ( dx (cid:48) | x , a ) (cid:12)(cid:12)(cid:12)(cid:12) (3.5) ≤ (cid:90) x (cid:48) | V ( π ; x (cid:48) ) ||P ( dx (cid:48) | x , a ) − P ( dx (cid:48) | x , a ) | → , as (cid:107) x − x (cid:107) → , for any π . Suppose r ( · , a ) is continuous for any a ∈ A . Then C ( π ; x, a ) is continuous in x for any π and a . When π is a non-continuous function of x , it follows from (3.4) that V ( π ; · ) is not continuous as well. However, many nonparametric methods, such as kernelsmoothers, series estimation and neural networks, require the underlying function to possesscertain degree of smoothness in order to achieve estimation consistency. Notice that anynon-constant deterministic policy has jumps and is not continuous at certain points (suchas the optimal policy π given in (2.2)). This poses significant challenges in performinginference to these policies.To allow valid inference for both deterministic and random policies, we consider mod-elling the Q-function. Under CMIA, we have Q ( π ; x, a ) = (cid:88) t ≥ γ t E π { r ( X ,t , A ,t ) | X , = x, A , = a } . This together with MA yields Q ( π ; x, a ) = r ( x, a ) + γ E π (cid:88) t ≥ γ t { E π r ( X ,t , A ,t ) | X , , A , } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X , = x, A , = a = r ( x, a ) + γ E π { Q ( X , , A , ) | X , = x, A , = a } . As a result, the Q-function satisfies the following Bellman equation Q ( π ; x, a ) = r ( x, a ) + γ (cid:88) a (cid:48) ∈A (cid:90) x (cid:48) Q ( π ; x (cid:48) , a (cid:48) ) π ( a (cid:48) | x (cid:48) ) P ( dx (cid:48) | x, a ) . (3.6) nference of the Value for Reinforcement Learning Similar to (3.5), we can show the second term on the right-hand-side (RHS) of (3.5) is asmooth function x for any π and a . When r ( · , a ) is smooth, so is Q ( π, · , a ). To formallyestablish these results, we introduce the notion of p -smoothness below.Let h ( · ) be an arbitrary function on X . For a d -tuple α = ( α , . . . , α d ) T of nonnegativeintegers, let D α denote the differential operator: D α h ( x ) = ∂ (cid:107) α (cid:107) h ( x ) ∂x α · · · ∂x α d d . Here, x j denotes the j -th element of x . For any p > 0, let (cid:98) p (cid:99) denote the largest integerthat is smaller than p . Define the class of p -smooth functions as follows:Λ( p, c ) = h : sup (cid:107) α (cid:107) ≤(cid:98) p (cid:99) sup x ∈ X | D α h ( x ) | ≤ c, sup (cid:107) α (cid:107) = (cid:98) p (cid:99) sup x,y ∈ X x (cid:54) = y | D α h ( x ) − D α h ( y ) |(cid:107) x − y (cid:107) p −(cid:98) p (cid:99) ≤ c . For any x ∈ X , a ∈ A , suppose the transition kernel P ( ·| x, a ) is absolutely continuous withrespect to the Lebesgue measure. Then there exists some transition density function q suchthat P ( dx (cid:48) | x, a ) = q ( x (cid:48) | x, a ) dx (cid:48) . We impose the following condition.(A1.) There exist some p, c > r ( · , a ) , q ( x (cid:48) |· , a ) ∈ Λ( p, c ) for any a ∈ A , x (cid:48) ∈ X . Lemma 1. Under A1, there exists some constant c (cid:48) > such that Q ( π ; · , a ) ∈ Λ( p, c (cid:48) ) for any policy π and a ∈ A . Lemma 1 implies the Q-function has bounded derivatives up to order (cid:98) p (cid:99) . This motivatesus to first estimate the Q-function and then derive the corresponding value estimators basedon the relation V ( π ; x ) = (cid:80) a ∈A π ( a | x ) Q ( π ; x, a ). By the Bellman equation (3.6), we canshow the Q-function satisfies E (cid:34) (cid:40) Y i,t + γ (cid:88) a ∈A Q ( π ; X i,t +1 , a ) π ( a | X i,t +1 ) − Q ( π ; X i,t , A i,t ) (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i,t , A i,t (cid:35) = 0 . (3.7)The above equation forms a basis of our methods to learn Q ( π ; · , · ) (see details in thenext section). In contrast to Equation (3.3), the sampling ratio π ( a | x ) /b ( a | x ) does notappear in (3.7). This is because A i,t is the only sampling action and no further actionsare involved in (3.7). As a result, our method does not require correct specification of thebehavior policy. Nor do we need to estimate it from the observed dataset. This is anotheradvantage of modelling the Q-function over the value. C. Shi, S. Zhang, W. Lu and R. Song We describe our procedure in this section. We propose to approximate Q ( π ; · , · ) based onlinear sieves, which takes the form Q ( π ; x, a ) ≈ Φ TL ( x ) β π,a , ∀ x ∈ X , a ∈ A , where Φ L ( · ) = { φ L, ( · ) , · · · , φ L,L ( · ) } T is a vector consisting of L sieve basis functions, suchas splines or wavelet bases (see for example, Huang, 1998, for choices of basis functions). Weallow L to grow with the sample size to reduce the bias of the resulting estimates. Undercertain mild conditions, there exist some { β ∗ π,a } a ∈A that satisfy E (cid:40) Y i,t + γ (cid:88) a ∈A Φ TL ( X i,t +1 ) β ∗ π,a π ( a | X i,t +1 ) − Φ TL ( X i,t ) β ∗ π,a (cid:48) (cid:41) Φ L ( X i,t ) I ( A i,t = a (cid:48) ) = 0 , for any a (cid:48) ∈ A . Suppose A = { , , . . . , m } . Define β ∗ π = ( β ∗ Tπ, , · · · , β ∗ Tπ,m ) T , ξ ( x, a ) = { Φ TL ( x ) I ( a = 1) , Φ TL ( x ) I ( a = 2) , · · · , Φ TL ( x ) I ( a = m ) } T , U π ( x ) = { Φ TL ( x ) π (1 | x ) , Φ TL ( x ) π (2 | x ) , · · · , Φ TL ( x ) π ( m | x ) } T , ξ i,t = ξ ( X i,t , A i,t ), U π,i,t = U π ( X i,t ). The above equation can be rewritten as E ξ i,t ( ξ i,t − γ U π,i,t +1 ) T β ∗ π = E ξ i,t Y i,t . Based on the observed data, we propose to estimate β ∗ π by solving (cid:98) β π = (cid:26) (cid:80) i T i n (cid:88) i =1 T i − (cid:88) t =0 ξ i,t ( ξ i,t − γ U π,i,t +1 ) T (cid:124) (cid:123)(cid:122) (cid:125) (cid:98) Σ π (cid:27) − (cid:32) (cid:80) i T i n (cid:88) i =1 T i − (cid:88) t =0 ξ i,t Y i,t (cid:33) . Let (cid:98) β π = ( (cid:98) β Tπ, , · · · , (cid:98) β Tπ,m ) T , we propose to estimate V ( π ; x ) by (cid:98) V ( π ; x ) = (cid:88) a ∈A Φ TL ( x ) (cid:98) β π,a π ( a | x ) = U Tπ ( x ) (cid:98) β π . A two-side CI is given by[ (cid:98) V ( π ; x ) − z α/ ( (cid:88) i T i ) − / (cid:98) σ ( π ; x ) , (cid:98) V ( π ; x ) + z α/ ( (cid:88) i T i ) − / (cid:98) σ ( π ; x )] , (3.8)where z α denotes the upper α -th quantile of a standard normal distribution, and (cid:98) σ ( π ; x ) = U Tπ ( x ) (cid:98) Σ − π (cid:98) Ω π ( (cid:98) Σ Tπ ) − U π ( x ) , where (cid:98) Ω π = 1 (cid:80) i T i n (cid:88) i =1 T i − (cid:88) t =0 ξ i,t ξ Ti,t { Y i,t + γ (cid:88) a ∈A Φ TL ( X i,t +1 ) (cid:98) β π,a π ( a | X i,t +1 ) − Φ TL ( X i,t ) (cid:98) β π,A i,t } . nference of the Value for Reinforcement Learning Let G be a reference distribution on the covariate space X . Define the following integratedvalue function V ( π ; G ) = (cid:90) x ∈ X V ( π ; x ) G ( dx ) . By setting G ( · ) to be a Dirac measure δ x ( · ), i.e, G ( X ) = I ( x ∈ X ) , ∀X ⊆ X , V ( π ; G )is reduced to V ( π ; x ). Let ν ( · ) be the probability density function of X , . By setting G ( dx ) = ν ( x ) dx , we obtain V ( π ; G ) = (cid:90) x ∈ X V ( π ; x ) ν ( x ) dx. Based on (cid:98) β π , a two-side CI for V ( π ; G ) is given by[ (cid:98) V ( π ; G ) − z α/ ( (cid:88) i T i ) − / (cid:98) σ ( π ; G ) , (cid:98) V ( π ; G ) + z α/ ( (cid:88) i T i ) − / (cid:98) σ ( π ; G )] , (3.9)where (cid:98) V ( π ; G ) = (cid:90) x ∈ X (cid:98) V ( π ; x ) G ( dx ) , (3.10) (cid:98) σ ( π ; G ) = (cid:26)(cid:90) x ∈ X U π ( x ) G ( dx ) (cid:27) T (cid:98) Σ − π (cid:98) Ω π ( (cid:98) Σ Tπ ) − (cid:26)(cid:90) x ∈ X U π ( x ) G ( dx ) (cid:27) . (3.11) In this section, we focus on proving the validity of the proposed CIs in (3.9). By setting G ( · ) = δ x ( · ), it implies that the CI in (3.8) achieves nominal coverage as well. To simplifythe presentation, we assume T = T = · · · = T n = T , all the covariates are continuousand X = [0 , d . Following the behavior policy b ( ·|· ), the set of variables { X ,t } t ≥ forms atime-homogeneous Markov chain. Its transition kernel P X is given by P X ( B| x ) = Pr( X , ∈ B| X , = x ) = (cid:88) a ∈A P ( B| x, a ) b ( a | x ) , ∀B ∈ X . We assume such a Markov chain has an unique invariant distribution with some densityfunction µ ( · ) on X . Notice that we do not require ν = µ . To study asymptotics in casewhere T → ∞ , we assume the chain { X ,t } t ≥ is geometrically ergodic (see Condition A4in Appendix A). This condition is not required when T is bounded. It is worth mentioningthe boundedness of T does not mean we work on a finite-horizon setting, since T is thetermination time of the study, not the final time step of each trajectory. C. Shi, S. Zhang, W. Lu and R. Song To establish the limiting distribution of the estimated value function, we restrict ourattentions to two particular types of sieve basis functions, corresponding tensor product of B-spline or Wavelets (see Condition A2 in Appendix A). See Section 6 of Chen and Christensen(2015) for a brief review of these sieve bases. This together with A1 implies that there existsa set of vectors { β ∗ π,a } a ∈A that satisfy sup x ∈ X ,a ∈A | Q ( π ; x, a ) − Φ TL ( x ) β ∗ π,a | = O ( L − p/d ). SeeSection 2.2 of Huang (1998) for detailed discussions on the approximation power of thesesieve bases. To guarantee the bias of our value estimator is asymptotically negligible relativeto its variance, we require sup x ∈ X ,a ∈A | Q ( π ; x, a ) − Φ TL ( x ) β ∗ π,a | = O { ( nT ) − / } . Thus, L shallbe chosen to satisfy L (cid:29) ( nT ) d/p .For any x ∈ X , a ∈ A , define ω π ( x, a ) = E (cid:40) Y , + γ (cid:88) a ∈A π ( a | X , ) Q ( π ; X , , a ) − Q ( π ; X , , A , ) (cid:41) | X , = x, A , = a . Theorem 1 (bidirectional asymptotics). Assume A1-A4 hold. Suppose L satis-fies L = o {√ nT / log( nT ) } , L p/d (cid:29) nT { (cid:107) (cid:82) x Φ L ( x ) G ( dx ) (cid:107) − } , and there exists someconstant c ≥ such that ω π ( x, a ) ≥ c − for any x ∈ X , a ∈ A and Pr (max ≤ t ≤ T − | Y ,t | ≤ c ) = 1 . Then as either n → ∞ or T → ∞ , we have √ nT (cid:98) σ − ( π ; G ) { (cid:98) V ( π ; G ) − V ( π ; G ) } d → N (0 , . To save space, we put conditions A2-A4 and their discussions in Appendix A. A sketchfor the proof is given in Appendix B.1. Under the conditions in Theorem 1, we can showthat (cid:98) σ ( π ; G ) converges almost surely to some σ ( π ; G ). The form of σ ( π ; G ) is given inSection E.2. In addition, we have (cid:98) V ( π ; G ) − V ( π ; G )( nT ) − / (cid:98) σ ( π ; G ) = ( nT ) − / σ ( π ; G ) n (cid:88) i =1 T (cid:88) t =1 (cid:26)(cid:90) x ∈ X U π ( x ) G ( dx ) (cid:27) T Σ − π ξ i,t ε i,t + o p (1) , (3.12)where Σ π = E (cid:98) Σ π and ε i,t = Y i,t + γ (cid:88) a ∈A Q ( π ; X i,t +1 , a ) π ( a | X i,t +1 ) − Q ( π ; X i,t , A i,t ) . By MA, CMIA and (3.7), the leading term on the RHS of (3.12) forms a mean-zero martin-gale (details can be found in Section E.2). As either n or T grows to infinity, the asymptoticnormality follows from the martingale central limit theorem. nference of the Value for Reinforcement Learning When σ ( π ; G ) is bounded away from zero, it can be seen from (3.12) that (cid:98) V ( π ; G ) − V ( π ; G ) = O p ( n − / T − / ). That is, the proposed value estimator converges at a rate of( nT ) − / . In contrast, AIPW-type estimators typically converge at a rate of n − / and arethus not suitable for settings with only a few trajectories. For simplicity, we assume T = T = · · · = T n = T throughout this section. Consider anestimated policy (cid:98) π , computed based on the data { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ≤ t 2. We present these two conditions and theirdiscussions in Appendix A. Theorem 2 (bidirectional asymptotics). Assume A1-A3, A4* and A5 hold. Sup-pose K = O (1) and L satisfies L = o {√ nT / log( nT ) } , L p/d (cid:29) nT { (cid:107) (cid:82) x Φ L ( x ) G ( dx ) (cid:107) − } .Suppose T min = T if T is bounded. Assume there exists some constant c ≥ such that ω π ( x, a ) ≥ c − for any x, a, π and Pr (max ≤ t ≤ T − | Y ,t | ≤ c ) = 1 . Then as either n → ∞ or T → ∞ , we have (cid:112) nT ( K − /K (cid:101) σ − ( G ) { (cid:101) V ( G ) − V ( (cid:98) π ; G ) } d → N (0 , , (cid:112) nT ( K − /K (cid:101) σ − ( G ) { (cid:101) V ( G ) − V ( π ∗ ; G ) } d → N (0 , . We provide a sketch for the proof of Theorem 2 in Appendix B.2. For any I ⊆ I , we use (cid:98) π I to denote an estimated optimal policy based on observationsin I . Let (cid:98) Q I ( · , · ) denote some consistent estimator for Q opt ( · , · ) and (cid:98) π I denote the greedypolicy with respect to (cid:98) Q I ( · , · ) (see Equation (3.14)).In the following, we focus on relating | V ( π opt ; G ) − V ( (cid:98) π I ; G ) | to the prediction loss (cid:98) Q I − Q opt . By definition, V ( π opt ; x ) ≥ V ( (cid:98) π I ; x ) , ∀ x ∈ X . Hence, V ( π opt ; G ) ≥ V ( (cid:98) π I ; G ). Itsuffices to provide an upper bound for V ( π opt ; G ) − V ( (cid:98) π I ; G ). We introduce a margin-typecondition A6 in Appendix A. The following theorems summarize our results. Theorem 3. Assume A1 and A6 hold. Suppose the following event occurs with proba-bility at least − O ( |I| − κ ) for any finite κ > , sup x ∈ X ,a ∈A | (cid:98) Q I ( x, a ) − Q opt ( x, a ) | = O ( |I| − b ∗ ) , for some b ∗ > . Then E | V ( π opt ; G ) − V ( (cid:98) π I ; G ) | = O ( |I| − b ∗ (1+ α ) ) . In Theorem 3, we require the estimated Q-function to satisfy certain uniform convergencerate. In Theorem 4 below, we relax this condition by assuming that the integrated lossconverges to zero at certain rate. nference of the Value for Reinforcement Learning Theorem 4. Assume A1 and A6 hold. Suppose (cid:32) E (cid:90) x ∈ X (cid:88) a ∈A | (cid:98) Q I ( x, a ) − Q opt ( x, a ) | dx (cid:33) / = O ( |I| − b ∗ ) , for some b ∗ > . Then E | V ( π opt ; G ) − V ( (cid:98) π I ; G ) | = O ( |I| − b ∗ (2+2 α ) / (2+ α ) ) . It can be seen from Theorems 3 and 4 that the integrated value converges faster comparedto the Q-function. We provide a sketch for the proofs of both theorems in Appendix B.3. In this section, we provide several examples to illustrate the convergence rate of (cid:98) Q I . Theproposed methods can be applied to evaluating the values under these estimated policies. Example 1 (Greedy gradient Q-learning). The optimal Q-function satisfies Q opt ( x, a ) = r ( x, a ) + γ (cid:90) x (cid:48) max a (cid:48) ∈A Q opt ( x (cid:48) , a (cid:48) ) P ( dx (cid:48) | x, a ) , for any a and x , and hence E (cid:20) (cid:26) Y i,t + γ max a (cid:48) ∈A Q opt ( X i,t +1 , a (cid:48) ) − Q opt ( X i,t , A i,t ) (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) X i,t , A i,t (cid:21) = 0 . Suppose we model Q opt ( x, a ) by linear sieves Φ TL ( x ) θ a . Then we can compute { (cid:98) θ a, I } a ∈A byminimizing the following projected Bellman error: arg min { θ a } a ∈A (cid:88) ( i,t ) ∈I δ i,t ( { θ a } a ∈A ) ξ i,t T (cid:88) ( i,t ) ∈I ξ i,t ξ Ti,t − (cid:88) ( i,t ) ∈I δ i,t ( { θ a } a ∈A ) ξ i,t , where δ i,t ( { θ a } a ∈A ) = Y i,t + γ max a (cid:48) ∈A Φ TL ( X i,t +1 ) θ a (cid:48) − Φ TL ( X i,t ) θ A i,t . The above loss is non-smooth and non-convex as a function of { θ a } a ∈A . The estimator { (cid:98) θ a, I } a ∈A can be computedbased on the greedy gradient Q-learning algorithm.Assuming the optimal Q-function is correctly specified, Ertefaie and Strawderman (2018)established the consistency and asymptotic normality of the parameter estimates under thescenario where both L and T are fixed. Set (cid:98) Q I ( x, a ) = Φ T ( x ) (cid:98) θ a, I . Using similar argumentsin proving Theorem 1, we can show that with proper choice of L , sup x ∈ X ,a ∈A | (cid:98) Q I ( x, a ) − Q opt ( x, a ) | coverages at a rate of O ( |I| − p/ (2 p + d ) ) up to some logarithmic factors, with proba-bility at least − O ( n − T − ) . The condition in Theorem 3 thus holds for any b ∗ < p/ (2 p + d ) . C. Shi, S. Zhang, W. Lu and R. Song Example 2 (Fitted Q -iteration). In fitted Q -iteration (FQI), the optimal Q-functionis approximated by some nonparametric models Q ( · , · , θ ) indexed by θ . The parameter θ isiteratively updated by (cid:98) θ k +1 = arg min θ (cid:88) ( i,t ) ∈I k { Y i,t + γ max a Q ( X i,t +1 , a ; (cid:98) θ k ) − Q ( X i,t , A i,t ; θ ) } , for k = 0 , , , . . . , K − , where I k ’s are some subsets of I . When I = · · · = I K = I and Q ( · , · , θ ) is the family of neural networks, this algorithm is the neural FQI proposed byRiedmiller (2005). Yang et al. (2019) studied a variant of neural FQI by assuming I k ’sare disjoint and the training samples in ∪ Kk =1 I k are independent. Using similar argumentsin the proof of Theorem 4.4 in Yang et al. (2019), we can show E (cid:82) x ∈ X (cid:80) a ∈A | (cid:98) Q I ( x, a ) − Q opt ( x, a ) | dx coverages at a rate of O ( |I| − (2 p ) / (2 p + d ) ) up to some logarithmic factors. Theconditions in Theorem 4 thus hold for any b ∗ < p/ (2 p + d ) . 4. Extensions to on-policy evaluation We now extend our methodology in Section 3 to on-policy settings. The proposed CI issimilar to that presented in Section 3.2.2 and applies to any reinforcement learning algo-rithms that iteratively update the estimated policy based on batches of observations. Let { T ( k ) } k ≥ be a monotonically increasing sequence that diverges to infinity. At the k -thiteration, define ¯ I k = { ( i, t ) : 1 ≤ i ≤ n, ≤ t < (cid:80) kj =1 T ( j ) } . The data observed so far canbe summarized as { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ≤ i ≤ n, ≤ t< (cid:80) kj =1 T ( j ) . We compute the estimatedpolicy (cid:98) π ¯ I k based on these data. Then we determine the behavior policy (cid:98) b ¯ I k as a functionof (cid:98) π ¯ I k and generate new observations { ( A i,t , Y i,t , X i,t +1 ) } ≤ i ≤ n, (cid:80) kj =1 T ( j ) ≤ t< (cid:80) k +1 j =1 T ( j ) , (4.18)according to (cid:98) b ¯ I k . To balance the exploration-exploitation trade-off, a common choice of (cid:98) b ¯ I k is the (cid:15) -greedy policy with respect to (cid:98) π ¯ I k .Let I k +1 = { ≤ i ≤ n, (cid:80) kj =1 T ( j ) ≤ t < (cid:80) k +1 j =1 T ( j ) } . The new observations in (4.18)are conditionally independent of (cid:98) π ¯ I k given those in ¯ I k . So the Bellman equation in (3.7)is valid with π = (cid:98) π ¯ I k for any ( i, t ) ∈ ¯ I k +1 . We compute (cid:98) V I k +1 ( (cid:98) π ¯ I k ; G ) and (cid:98) σ I k +1 ( (cid:98) π ¯ I k ; G ) asin Section C of the supplementary article, where the number of basis L ( k + 1) depends onboth n and T ( k + 1). We iterate this procedure for k = 1 , , . . . , K − 1. The estimated value nference of the Value for Reinforcement Learning and CI for V ( (cid:98) π ¯ I k ; G ) are given by (cid:101) V ( G ) = (cid:40) K − (cid:88) k =1 (cid:112) T ( k ) (cid:98) σ I k +1 ( (cid:98) π ¯ I k ; G ) (cid:41) − (cid:40) K − (cid:88) k =1 (cid:112) T ( k ) (cid:98) V I k +1 ( (cid:98) π ¯ I k ; G ) (cid:98) σ I k +1 ( (cid:98) π ¯ I k ; G ) (cid:41) , and (cid:101) V ( G ) − z α/ (cid:40) K (cid:88) k =2 (cid:114) nT ( k ) K − (cid:41) − / (cid:101) σ ( G ) , (cid:101) V ( G ) + z α/ (cid:40) K (cid:88) k =2 (cid:114) nT ( k ) K − (cid:41) − / (cid:101) σ ( G ) , where (cid:101) σ ( G ) = { (cid:80) Kk =2 (cid:112) T ( k ) }{ (cid:80) Kk =2 (cid:112) T ( k ) (cid:98) σ − k ( (cid:98) π ¯ I k − ; G ) } − . Similar to Theorem 2, wecan show such a CI achieves nominal coverage under certain conditions. To save space, weprovide our technical results in Section D of the supplementary article. 5. Simulations In this section, we conduct Monte Carlo simulations to examine the finite sample perfor-mance of the proposed CI. We consider off-policy settings in Sections 5.1 and 5.2, whereCIs for values under both fixed and optimal policies are reported. In Section 5.3, we reportCIs computed in on-policy settings. The state vector X ,t in our settings might not havebounded supports. For j = 1 , . . . , d , we define X ( j ) ∗ ,t = Φ( X ( j )0 ,t ) where X ( j )0 ,t stands for the j -th element of X ,t and Φ( · ) is the cumulative distribution function of a standard normalrandom variable. This gives us a transformed state vector with bounded support. Thebasis functions are constructed from the tensor product of d one-dimensional cubic B-splinesets where knots are placed at equally spaced sample quantiles of the transformed statevariables. We set the discount factor γ = 0 . L = (cid:98) ( nT ) η (cid:99) with η = 3 / 7. Here, for any z ∈ R , (cid:98) z (cid:99) denotes the largest integer that is smaller or equal to z .We tried several other values of the parameter η , and the resulting CIs are very similar andnot sensitive to the choice of η . Details can be found in Section G of the supplementaryarticle. We generate the initial state variable X , according to N (0 , I ), a bivariate normal distri-bution with zero mean and identity covariance matrix. The transition dynamics are given C. Shi, S. Zhang, W. Lu and R. Song by X ,t +1 = (2 A ,t − 1) 00 (1 − A ,t ) X ,t + z t , (5.19)for t ≥ 0, where { z t } t ≥ iid ∼ N (0 , I / 4) and { A ,t } t ≥ are i.i.d Bernoulli random variableswith expectation 0 . 5. Moreover, A ,t is independent of X ,t for any t ≥ 0. The immediateresponse Y ,t is defined by Y ,t = X T ,t +1 − 14 (2 A ,t − . (5.20)The target policy we consider is designed as follows, π ( a | x ) = , x > x > , otherhwise , where x i denotes the i th element of x .The true value function V ( π, G ) is computed by Monte Carlo approximations. Specifi-cally, we simulate N = 10 independent trajectories with initial state variable distributedaccording to G . For each trajectory, we simulate T = 500 timestamps to obtain { Y j,t : 1 ≤ j ≤ N, ≤ t ≤ T } . The action at each decision point is chosen according to π . Then weapproximate V ( π, G ) by (cid:80) Nj =1 (cid:80) T − t =0 γ t Y j,t /N . We consider three reference functions for G ,corresponding to (A): G = δ (0 . , . ; (B): G = δ ( − . , − . and (C): G = N (0 , I ). Thus, inthe first two scenarios, G is a Dirac measure. In the last scenario, G has a continuous densityfunction and we use Monte Carlo methods to compute the integrals in (3.10) and (3.11).For each scenario, we further consider 9 cases by setting n = 25 , , 100 and T = 30 , , V ( π ; G ) with G equal to the initial distribution of X , . So we compare it withour method in Scenario (C) only. In addition, DR requires the calculation of the Q-function Q ( π ; · , · ) and the behavior policy b ( ·|· ). Here, we treat b ( ·|· ) as known and estimate Q ( π ; · , · )based on tensor product B-spline basis functions. To implement DR, we randomly split alltrajectories into two halves of equal sizes, use the first half to estimate Q ( π ; · , · ) and use nference of the Value for Reinforcement Learning Fig. 1. Empirical coverage probabilities and average lengths of CIs constructed by the proposedmethod under Scenarios (A) and (B), with different choices of n and T . . . . (A) n*T E C P n=25n=50n=100 . . . . (B) n*T E C P n=25n=50n=100 . . . . . . . (A) n*T A L n=25n=50n=100 . . . . . . . (B) n*T A L n=25n=50n=100 Fig. 2. Empirical coverage probabilities and average lengths of CIs constructed by the proposedmethod (denoted by the square symbol) and the DR method (denoted by the snow symbol) underScenario (C), with different choices of n and T . . . . (C) n*T E C P n=25n=50n=100 n=25n=50n=100 . . . . (C) n*T A L n=25n=50n=100 n=25n=50n=100 C. Shi, S. Zhang, W. Lu and R. Song the remaining second half to evaluate V ( π ; G ) based on the estimated Q-function. Then weswap the two sub-datasets that are split apart to compute another estimator for V ( π ; G ).The final estimator is defined as an average of the two value estimators. Its variance isestimated based on the sampling variance estimator.In Figure 1, we plot the empirical coverage probabilities (ECPs) and average lengths(ALs) of CIs constructed by the proposed method, with different choices of n and T . It canbe seen that our CI achieves nominal coverage in all cases. In addition, the length of our CIdecreases as nT increases. This is consistent with our theoretical findings where we showthe proposed value estimator converges at a rate of n − / T − / under certain conditions(see the discussions below Theorem 1).In Figure 2, we plot ECPs and ALs of CIs constructed by the proposed method andthe DR method. DR performs poorly in this setting. On one hand, ECPs of DR arebelow 85% in all cases. On the other hand, our CIs are much narrower compared to thoseconstructed by DR. As commented in the introduction, this is because DR requires a largenumber of trajectories in order to achieve nominal coverage. Because they are constructedby AIPW-type estimators, their ALs decay at a rate of n − / instead of n − / T − / . In this section, we focus on constructing CIs for values under an optimal policy. Specifically,we use a version of fitted Q -iteration (double FQI) to compute the estimated optimal policy.Detailed algorithm can be found in Section H of the supplementary article. To implementthe proposed CI in Section 3.2, we set K n = 2 and K T = 2. To evaluate our CI, we generatea very large sample to compute an estimated optimal policy (cid:98) π ∗ based on double FQI and usethe Monte Carlo methods described in Section 5.1 to evaluate its value V ( (cid:98) π ∗ ; G ). Then wetreat V ( (cid:98) π ∗ ; G ) as the true optimal value V ( π ∗ ; G ). We consider the same three choices of thereference function G as in Section 5.1. We further consider 6 cases by setting n = 100 , T = 60 , , n or T increases.In addition, we design a non-regular setting where the actions do not have effects on the nference of the Value for Reinforcement Learning Fig. 3. Empirical coverage probabilities and average lengths of CIs constructed by the proposedmethod under Scenarios (A), (B) and (C), with different choices of n and T . . . . (A) n*T E C P n=100n=200 10000 15000 20000 25000 . . . . (B) n*T E C P n=100n=200 10000 15000 20000 25000 . . . . (C) n*T E C P n=100n=20010000 15000 20000 25000 . . . . (A) n*T A L n=100n=200 10000 15000 20000 25000 . . . . (B) n*T A L n=100n=200 10000 15000 20000 25000 . . . . (C) n*T A L n=100n=200 C. Shi, S. Zhang, W. Lu and R. Song Fig. 4. Empirical coverage probabilities and average lengths of CIs constructed by the proposedmethod in the nonregular setting under Scenarios (A), (B), (C), with different choices of n and T . . . . (A) n*T E C P n=100n=200 10000 15000 20000 25000 . . . . (B) n*T E C P n=100n=200 10000 15000 20000 25000 . . . . (C) n*T E C P n=100n=20010000 15000 20000 25000 . . . . (A) n*T A L n=100n=200 10000 15000 20000 25000 . . . . (B) n*T A L n=100n=200 10000 15000 20000 25000 . . . . (C) n*T A L n=100n=200 transition dynamics or the immediate rewards. Specifically, for any t ≥ 0, we set X ,t +1 = − X ,t + z t , and Y ,t = X T ,t +1 , where { z t } t ≥ iid ∼ N (0 , I / 4) and { A ,t } t ≥ are i.i.d Bernoulli random variables with expec-tation 0 . 5. Under this setup, any policy will achieve the same value function. As a result,the optimal policy is not unique. We consider the same combinations of G , n and T as inthe regular setting. ECPs and ALs of the proposed CIs are plotted in Figure 4. It can beseen that our CIs achieve nominal coverage in the non-regular setting as well. nference of the Value for Reinforcement Learning Table 1. Empirical coverage probabilities and average lengths of CIsconstructed in on-policy settings with different choices of T and G ECPs ALsT = 120 200 280 T = 120 200 280 G = δ (0 . , . G = δ − (0 . , . G = N (0 , I ) 0.914 0.926 0.948 0.29 0.21 0.17 We consider a setting where the transition dynamics and immediate rewards are defined by(5.19) and (5.20), respectively. In the first block of data { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ≤ t 6. Application to the OhioT1DM dataset As commented in the introduction, this dataset contains eight weeks’ records of CGMblood glucose levels, insulin doses and self-reported life-event data for each of six patientswith type 1 diabetes. To analyze this data, we divide these eight weeks into three hourintervals. The state variable X i,t is set to be a three-dimensional vector. Specifically, itsfirst element X (1) i,t is the average CGM blood glucose levels during the three hour interval[ t − , t ). The second covariate X (2) i,t is constructed based on the i -patient’s self-reportedtime and the carbohydrate estimate for the meal. Suppose the patient has meals at time C. Shi, S. Zhang, W. Lu and R. Song Fig. 5. Value estimates under an estimated optimal policy (denoted by the circle symbol) as wellas the associated CI and the observed discounted cumulative reward for each of the six patients(denoted by the snow symbol) t , t , . . . , t N ∈ [ t − , t ) with the carbohydrate estimates CE , CE , . . . , CE N . Define X (2) i,t = N (cid:88) j =1 CE j γ t j − t +1) c , where γ c corresponds to the decay rate every five minutes. Here, we set γ c = 0 . 5. The thirdcovariate X (3) i,t is defined as an average of the basal rate during the three hour interval.We discretize the action according to the amount of insulin injected in the three hourinterval. Specifically, A i,t = 1 when the total amount of insulin delivered to the i -th patientis greater than one unit. Otherwise, we set A i,t = 0. The immediate reward Y i,t is definedaccording to the Index of Glycemic Control (IGC, Rodbard, 2009), which is a non-linearfunction of the blood glucose levels. Specifically, we set Y i,t = − (80 − X (1) i,t +1 ) , X (1) i,t +1 < , ≤ X (1) i,t +1 < − ( X (1) i,t +1 − . , ≤ X (1) i,t +1 . A large IGC indicates the patient is in good health status. We set the discount factor γ = 0 . 5, as in simulations. nference of the Value for Reinforcement Learning For the i -th patient, we apply the double FQI algorithm to the data { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ≤ t 6, when the initial starting time is either 8:00 am or 2:00 pm in Day 1. It can beseen that the observed discounted cumulative reward is smaller than the upper bound ofour CI in all cases. In some cases, it is well below the lower bound of our CI. This suggestsapplying reinforcement learning algorithms could potentially improve some patients’ healthstatus. References Audibert, J.-Y. and Tsybakov, A. B. (2007) Fast learning rates for plug-in classifiers. Ann.Statist. , , 608–633.Bradley, R. C. (2005) Basic properties of strong mixing conditions. A survey and some openquestions. Probab. Surv. , , 107–144. Update of, and a supplement to, the 1986 original.Burman, P. and Chen, K.-W. (1989) Nonparametric estimation of a regression function. Ann. Statist. , , 1567–1596.Chen, X. and Christensen, T. M. (2015) Optimal uniform convergence rates and asymp-totic normality for series estimators under weak dependence and weak conditions. J.Econometrics , , 447–465.Davydov, Y. A. (1973) Mixing conditions for markov chains. Teoriya Veroyatnostei i eePrimeneniya , , 321–338.Ernst, D., Geurts, P. and Wehenkel, L. (2005) Tree-based batch mode reinforcement learn-ing. J. Mach. Learn. Res. , , 503–556. C. Shi, S. Zhang, W. Lu and R. Song Ertefaie, A. and Strawderman, R. L. (2018) Constructing dynamic treatment regimes overindefinite time horizons. Biometrika , , 963–977.Hasselt, H. V. (2010) Double q-learning. In Advances in Neural Information ProcessingSystems , 2613–2621.Huang, J. Z. (1998) Projection estimation in multiple regression with application to func-tional ANOVA models. Ann. Statist. , , 242–272.Jiang, N. and Li, L. (2016) Doubly robust off-policy value evaluation for reinforcementlearning. In International Conference on Machine Learning , 652–661.Jin, J., Song, C., Li, H., Gai, K., Wang, J. and Zhang, W. (2018) Real-time bidding withmulti-agent reinforcement learning in display advertising. In Proceedings of the 27th ACMInternational Conference on Information and Knowledge Management , 2193–2201. ACM.Kormushev, P., Calinon, S. and Caldwell, D. (2013) Reinforcement learning in robotics:Applications and real-world challenges. Robotics , , 122–148.Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E. and Kosorok,M. R. (2019) Estimating dynamic treatment regimes in mobile health using v-learning. J. Amer. Statist. Assoc. Luedtke, A. R. and van der Laan, M. J. (2016) Statistical inference for the mean outcomeunder a possibly non-unique optimal treatment strategy. Ann. Statist. , , 713–742.Maei, H. R., Szepesv´ari, C., Bhatnagar, S. and Sutton, R. S. (2010) Toward off-policylearning control with function approximation. In ICML , 719–726.Marling, C. and Bunescu, R. C. (2018) The ohiot1dm dataset for blood glucose level pre-diction. In KHD@ IJCAI , 60–63.McLeish, D. L. (1974) Dependent central limit theorems and invariance principles. Ann.Probability , , 620–628.Meyer, Y. (1992) Wavelets and operators , vol. 37 of Cambridge Studies in Advanced Mathe-matics . Cambridge University Press, Cambridge. Translated from the 1990 French originalby D. H. Salinger. nference of the Value for Reinforcement Learning Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves,A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G. et al. (2015) Human-level controlthrough deep reinforcement learning. Nature , , 529.Murphy, S. A. (2003) Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat.Methodol. , , 331–366.Puterman, M. L. (1994) Markov decision processes: discrete stochastic dynamic program-ming . Wiley Series in Probability and Mathematical Statistics: Applied Probability andStatistics. John Wiley & Sons, Inc., New York. A Wiley-Interscience Publication.Qian, M. and Murphy, S. A. (2011) Performance guarantees for individualized treatmentrules. Ann. Statist. , , 1180–1210.Riedmiller, M. (2005) Neural fitted q iteration–first experiences with a data efficient neuralreinforcement learning method. In European Conference on Machine Learning , 317–328.Springer.Rodbard, D. (2009) Interpretation of continuous glucose monitoring data: glycemic vari-ability and quality of glycemic control. Diabetes technology & therapeutics , , S–55.Saikkonen, P. (2001) Stability results for nonlinear vector autoregressions with an applica-tion to a nonlinear error correction model. Tech. rep. , Discussion Papers, InterdisciplinaryResearch Project 373: Quantification and Simulation of Economic Processes.Schumaker, L. L. (1981) Spline functions: basic theory . John Wiley & Sons, Inc., New York.Pure and Applied Mathematics, A Wiley-Interscience Publication.Shi, C., Fan, A., Song, R. and Lu, W. (2018) High-dimensional A -learning for optimaldynamic treatment regimes. Ann. Statist. , , 925–957.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrit-twieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016) Mastering thegame of go with deep neural networks and tree search. nature , , 484.Sutton, R. S. and Barto, A. G. (2018) Reinforcement learning: an introduction . AdaptiveComputation and Machine Learning. MIT Press, Cambridge, MA, second edn. C. Shi, S. Zhang, W. Lu and R. Song Thomas, P. S., Theocharous, G. and Ghavamzadeh, M. (2015) High-confidence off-policyevaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence .Tropp, J. A. (2011) Freedman’s inequality for matrix martingales. Electron. Commun.Probab. , , 262–270.— (2012) User-friendly tail bounds for sums of random matrices. Found. Comput. Math. , , 389–434.Tsao, C. W. and Vasan, R. S. (2015) Cohort profile: The framingham heart study (fhs):overview of milestones in cardiovascular epidemiology. International journal of epidemi-ology , , 1800–1813.Tsybakov, A. B. (2004) Optimal aggregation of classifiers in statistical learning. Ann.Statist. , , 135–166.Xu, Z., Li, Z., Guan, Q., Zhang, D., Li, Q., Nan, J., Liu, C., Bian, W. and Ye, J. (2018)Large-scale order dispatch in on-demand ride-hailing platforms: A learning and plan-ning approach. In Proceedings of the 24th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , 905–913. ACM.Yang, Z., Xie, Y. and Wang, Z. (2019) A theoretical analysis of deep q-learning. arXivpreprint arXiv:1901.00137 .Zhang, B., Tsiatis, A. A., Laber, E. B. and Davidian, M. (2013) Robust estimation ofoptimal dynamic treatment regimes for sequential treatment decisions. Biometrika , ,681–694.Zhang, Y., Laber, E. B., Davidian, M. and Tsiatis, A. A. (2018) Estimation of optimaltreatment regimes using lists. J. Amer. Statist. Assoc. , , 1541–1549.Zhao, Y.-Q., Zeng, D., Laber, E. B. and Kosorok, M. R. (2015) New statistical learningmethods for estimating optimal dynamic treatment regimes. J. Amer. Statist. Assoc. , , 583–598. nference of the Value for Reinforcement Learning A. Some technical conditions A.1. Conditions A2-A4 Define P tX as the t -step transition kernel, i.e, P tX ( B| x ) = Pr( X ,t ∈ B| X , = x ). LetBSpl( L, r ) denote a tensor-product B-spline basis of degree r and dimension L on [0 , d ,and let Wav( L, r ) denote a tensor-product Wavelet basis of regularity r and dimension L on [0 , d .(A2.) The sieve Φ L is either BSpl( L, r ) or Wav( L, r ) with r > max( p, µ and ν are uniformly bounded away from 0 and ∞ on X .(A4.) Suppose (i) and (ii) hold when T → ∞ and (iii) holds when T is bounded.(i) λ min [ (cid:82) x ∈ X (cid:80) a ∈A { ξ ( x, a ) ξ T ( x, a ) − γ u π ( x, a ) u Tπ ( x, a ) } b ( a | x ) µ ( x ) dx ] ≥ ¯ c for some con-stant ¯ c > 0, where u π ( x, a ) = E { U π ( X , ) | X , = x, A , = a } and λ min ( K ) denotes theminimum eigenvalue of a matrix K .(ii) The Markov chain { X ,t } t ≥ is geometrically ergodic, i.e, there exists some function M ( · ) on X and some constant ρ < (cid:82) x ∈ X M ( x ) µ ( x ) dx < + ∞ and (cid:107)P tX ( ·| x ) − µ ( · ) (cid:107) T V ≤ M ( x ) ρ t , ∀ t ≥ , where (cid:107) · (cid:107) T V denotes the total variation norm.(iii) λ min [ (cid:80) T − t =0 E { ξ ,t ξ T ,t − γ u π ( X ,t , A ,t ) u Tπ ( X ,t , A ,t ) } ] ≥ T ¯ c for some constant ¯ c > E (cid:98) Σ π is invertible when T diverges to infinity. InSection F of the supplementary article, we show A4(i) is automatically satisfied when thetarget policy π is deterministic and b is the (cid:15) -greedy policy with respect to π that satisfies (cid:15) ≤ − γ .Suppose the Markov chain { X ,t } t ≥ has a finite state space and let P X denote its tran-sition matrix. Assume P X is diagonalizable. Then A4(ii) holds when the second largesteigenvalue of P X is strictly smaller than 1. When X ,t ’s are generated by the vector autore-gressive process E { X ,t | X ,t − } = f ( X ,t − ) for some function f , Saikkonen (2001) providedsufficient conditions that ensure the geometric ergodicity of the Markov chain.When ν = µ , { X ,t } t ≥ is stationary. Under Condition A4(ii), it follows from Theorem3.7 of Bradley (2005) that { X ,t } t ≥ is exponentially β -mixing (see the proof of Lemma 3for details). When T → ∞ , A4(ii) enables us to derive matrix concentration inequalities for (cid:98) Σ π . This together with A4(i) implies that (cid:98) Σ π is invertible, with probability approaching 1 C. Shi, S. Zhang, W. Lu and R. Song (wpa1).When ν = µ , A4(iii) is reduced to A4(i). This condition guarantees (cid:98) Σ π is invertiblewpa1 under the scenario where n diverges to infinity while T remains bounded. A.2. Conditions A4* and A5 We present the technical condition (A4*) and (A5) below. We assume the estimated policysatisfies (cid:98) π I ∈ Π with probability 1, for any I . For example, if Q-leanring type algorithmsare used and we approximate the optimal Q-function based on a linear model Φ T ( x, a ) β with some basis function Φ( x, a ) ∈ R M . Then for any β , we can define a policy π ( β ) asfollows: π β ( a | x ) = , if a = sarg max a (cid:48) ∈A Φ T ( x, a (cid:48) ) β, , otherwise . Then we have Π = { π β : β ∈ R M } .(A4*.) Assume (i) and (ii) hold if T → ∞ and (iii) holds if T is bounded.(i) inf π ∈ Π λ min [ (cid:82) x ∈ X (cid:80) a ∈A { ξ ( x, a ) ξ T ( x, a ) − γ u π ( x, a ) u Tπ ( x, a ) } b ( a | x ) µ ( x ) dx ] ≥ ¯ c for someconstant ¯ c > { X ,t } t ≥ is geometrically ergodic, i.e, there exists some function M ( · ) on X and some constant ρ < (cid:82) x ∈ X M ( x ) µ ( x ) dx < + ∞ and (cid:107)P tX ( ·| x ) − µ ( · ) (cid:107) T V ≤ M ( x ) ρ t , ∀ t ≥ . (iii) inf π ∈ Π λ min [ (cid:80) T − t =0 E { ξ ,t ξ T ,t − γ u π ( X ,t , A ,t ) u Tπ ( X ,t , A ,t ) } ] ≥ ¯ cT for some constant¯ c > ν = µ .(A5) For any I that takes the form of I = { ( i, t ) : 0 ≤ t < t i , ≤ i ≤ n } for some0 ≤ t , t , . . . , t n ≤ T , we have E | V ( (cid:98) π I ; G ) − V ( π ∗ ; G ) | = O ( |I| − b ), for some b > / nT ) b − / (cid:29) (cid:107) (cid:82) x Φ L ( x ) G ( dx ) (cid:107) − , where the big- O term is uniform in the trainingsamples I .Set I = I . By Markov’s inequality, it is immediate to see A5 implies that Condition(3.13) holds. When the tensor-product B-splines are used, we have lim inf L (cid:107) (cid:82) x Φ L ( x ) G ( dx ) (cid:107) > 0. Thus, it is equivalent to require E | V ( (cid:98) π I ; G ) − V ( π ∗ ; G ) | = O ( |I| − b ) for some b > / nference of the Value for Reinforcement Learning A.3. Condition A6 (A6.) Assume there exist some constants α, δ > λ (cid:26) x ∈ X : max a Q opt ( x, a ) − max a (cid:48) ∈A− arg max a Q opt ( x,a ) Q opt ( x, a (cid:48) ) ≤ ε (cid:27) = O ( ε α ) , (A.21) G (cid:26) x ∈ X : max a Q opt ( x, a ) − max a (cid:48) ∈A− arg max a Q opt ( x,a ) Q opt ( x, a (cid:48) ) ≤ ε (cid:27) = O ( ε α ) , (A.22)where λ denotes the Lebesgue measure, the big- O terms are uniform in 0 < ε ≤ δ , andmax a (cid:48) ∈A− arg max a Q opt ( x,a ) Q opt ( x, a (cid:48) ) = −∞ if the set A − arg max a Q opt ( x, a ) = ∅ .For each x , the quantity max a Q opt ( x, a ) − max a (cid:48) ∈A− arg max a Q opt ( x,a ) Q opt ( x, a (cid:48) ) measuresthe difference in value between π opt and the policy that assigns the best suboptimal treat-ment(s) at the first decision point and follows π opt subsequently. In point treatment studies,Qian and Murphy (2011) imposed a similar condition (see Equation (3.3), Qian and Murphy,2011) to derive sharp convergence rate for the value under an estimated optimal individ-ualized treatment regime. Here, we generalize their condition in infinite-horizon settings.A6 is also closely related to the margin condition commonly used to bound the excessmisclassification error (Tsybakov, 2004; Audibert and Tsybakov, 2007).To better understand Condition A6, we consider a simple scenario where A = { , } .Define τ ( x ) = Q opt ( x, − Q opt ( x, a Q opt ( x, a ) − max a (cid:48) ∈A− arg max a Q opt ( x,a ) Q opt ( x, a (cid:48) ) = | τ ( x ) | , if τ ( x ) (cid:54) = 0 , + ∞ , otherwise . As a result, (A.21) and (A.22) are equivalent to the followings: λ { x ∈ X : 0 < | τ ( x ) | ≤ ε } = O ( ε α ) , (A.23) G { x ∈ X : 0 < | τ ( x ) | ≤ ε } = O ( ε α ) . (A.24)Apparently, these two conditions hold when inf x ∈ X | τ ( x ) | > 0. They are satisfied in manyother cases. For example, let d = 1. Consider τ ( x ) = x /α , if x > , , otherwise , for some α > 0. Then, with some calculations, we can show λ { x ∈ X : 0 < | τ ( x ) | ≤ ε } ≤ λ { x : 0 < x < ε α } = ε α . C. Shi, S. Zhang, W. Lu and R. Song This verifies (A.23). When G has a bounded density function on X , (A.24) is reduced to(A.23). If G ( · ) equals the Dirac measure δ x ( · ), then (A.24) automatically holds for any α > B. Proof sketches B.1. A sketch for the proof of Theorem 1 We provide an outline for the proof in this section. The detailed proof can be found inSection E.2 of the supplementary article. We break the proof into three steps. In the firststep, we show the estimator (cid:98) β π satisfies (cid:98) β − β ∗ = Σ − π (cid:32) nT n (cid:88) i =1 T − (cid:88) t =0 ξ i,t ε i,t (cid:33) + O p ( L − p/d ) + O p { L ( nT ) − log( nT ) } , (B.25)where Σ π = E (cid:98) Σ π . The proof of (B.25) relies on some random matrix inequalities establishedin Lemma 3 of the supplementary article.In the second step, we show the linear representation in (3.12) holds. The proof of (3.12)relies on the convergence rate of (cid:98) β established in the first step and some additional randommatrix nequalities in Lemma 4.In the last step, we show the leading term on the RHS of (3.12) is asymptotically normal,based on the martingale central limit theorem. The completes the proof of Theorem 1. B.2. A sketch for the proof of Theorem 2 Similar to (3.15), for each 1 ≤ k ≤ K , we have (cid:98) V I k +1 ( (cid:98) π ¯ I k ; G ) − V ( (cid:98) π ¯ I k ; G )( nT /K ) − / (cid:98) σ I k +1 ( (cid:98) π ¯ I k ; G ) (B.26)= ( nT /K ) − σ ( (cid:98) π ¯ I k ; G ) (cid:88) ( i,t ) ∈I k +1 (cid:26)(cid:90) x ∈ X U (cid:98) π ¯ I k ( x ) G ( dx ) (cid:27) T Σ − (cid:98) π ¯ I k ξ i,t ε i,t ( (cid:98) π ¯ I k ) + R (1) k , where R (1) k denotes the remainder term and ε i,t ( (cid:98) π ¯ I k ) = Y i,t + γ (cid:88) a ∈A Q ( (cid:98) π ¯ I k ; X i,t +1 , a ) (cid:98) π ¯ I k ( a | X i,t +1 ) − Q ( (cid:98) π ¯ I k ; X i,t , A i,t ) . Since (3.16) is satisfied, we have E { ε i,t ( (cid:98) π ¯ I k ) | O k } = 0. Conditional on the data in O k , (cid:98) π ¯ I k isa deterministic rule. The RHS of (B.26) is thus equivalent to a mean-zero martingale. nference of the Value for Reinforcement Learning When (A5) is satisfied, we have (cid:98) V I k +1 ( (cid:98) π ¯ I k ; G ) − V ( (cid:98) π ¯ I k ; G )( nT /K ) − / (cid:98) σ I k +1 ( (cid:98) π ¯ I k ; G ) = (cid:98) V I k +1 ( (cid:98) π ¯ I k ; G ) − V ( π ∗ ; G )( nT /K ) − / (cid:98) σ I k +1 ( (cid:98) π ¯ I k ; G ) + R (2) k , (B.27)for some remainder term R (2) k . Suppose R (1) k and R (2) k satisfy certain convergence rates.Combining (B.27) together with (B.26) yields, (cid:101) V ( G ) − V ( π ∗ ; G ) { nT ( K − /K } − / (cid:101) σ ( G ) (B.28)= (cid:114) nT ( K − K K − (cid:88) k =1 (cid:88) ( i,t ) ∈I k +1 σ ( (cid:98) π ¯ I k ; G ) (cid:26)(cid:90) x ∈ X U (cid:98) π ¯ I k ( x ) G ( dx ) (cid:27) T Σ − (cid:98) π ¯ I k ξ i,t ε i,t ( (cid:98) π ¯ I k ) + o p (1) , due to our use of the inverse weighting trick. Theorem 2 thus follows by the martingalecentral limit theorem. B.3. A sketch for the proofs of Theorems 3 and 4 Proofs of Theorems 3 and 4 are divided into two steps. In the first step, we decompose thevalue difference V ( π opt ; G ) − V ( (cid:98) π I ; G ) into the sum of an infinite series and provide upperbounds for all the terms in the series. In the second step, we use the margin-type conditionA6 to further characterize these upper bounds. To save space, we only present the first stepin this section.For j = 0 , , , . . . , define a time-dependent policy (cid:98) π ( j ) I that executes (cid:98) π I at the first j time points and then follows π opt . By definition, we have π opt = (cid:98) π (0) I and (cid:98) π I = (cid:98) π ( ∞ ) I . Noticethat V ( π opt ; G ) − V ( (cid:98) π (1) I ; G ) = (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } G ( dx ) . (B.29)Moreover, for any j ≥ V ( (cid:98) π ( j ) I ; G ) − V ( (cid:98) π ( j +1) I ; G ) = (cid:90) x (cid:88) t ≥ γ t { E (cid:98) π ( j ) I ( Y ,t | X , = x ) − E (cid:98) π ( j +1) I ( Y ,t | X , = x ) } G ( dx )= (cid:90) x (cid:88) t ≥ j γ t { E (cid:98) π ( j ) I ( Y ,t | X , = x ) − E (cid:98) π ( j +1) I ( Y ,t | X , = x ) } G ( dx ) . Let q ( j ) X ( ·| x ) be the density function of X ,j conditional on X , = x , following the estimated C. Shi, S. Zhang, W. Lu and R. Song policy (cid:98) π I at the first j time points, we have (cid:88) t ≥ j γ t E (cid:98) π ( j ) I ( Y ,t | X ,j ) = γ j (cid:88) a ∈A Q opt ( X ,j , a ) π opt ( a | X ,j ) , (cid:88) t ≥ j γ t E (cid:98) π ( j +1) I ( Y ,t | X ,j ) = γ j (cid:88) a ∈A Q opt ( X ,j , a ) (cid:98) π I ( a | X ,j ) . It follows that V ( (cid:98) π ( j ) I ; G ) − V ( (cid:98) π ( j +1) I ; G ) = γ j (cid:90) x,x (cid:48) ∈ X (cid:88) a ∈A Q opt ( x (cid:48) , a ) { π opt ( a | x (cid:48) ) − (cid:98) π I ( a | x (cid:48) ) } q ( j ) X ( x (cid:48) | x ) dx (cid:48) G ( dx ) . By A1, we have sup x,x (cid:48) ,a q ( x (cid:48) | x, a ) ≤ c . Under the Markovian assumption, q ( j ) X ( x (cid:48) | x ) = (cid:90) y (cid:88) a ∈A q ( x (cid:48) | y, a ) (cid:98) π I ( a | y ) q ( j − X ( y | x ) dy ≤ c (cid:90) y q ( j − X ( y | x ) dy = c. In addition, (cid:80) a ∈A Q opt ( x (cid:48) , a ) { π opt ( a | x (cid:48) ) − (cid:98) π I ( a | x (cid:48) ) } ≥ x (cid:48) , by the definition of π opt .Therefore, we obtain V ( (cid:98) π ( j ) I ; G ) − V ( (cid:98) π ( j +1) I ; G ) ≤ cγ j (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } dx. (B.30)By A1, the reward function r ( · , · ) is uniformly bounded. This further implies that Q opt ( · , · )is uniformly bounded. Therefore, (cid:80) j ≥ t V ( (cid:98) π ( j ) I ; G ) − V ( (cid:98) π ( j +1) I ; G ) → t → ∞ . It followsfrom (B.29) and (B.30) that V ( π opt ; G ) − V ( (cid:98) π I ; G ) = + ∞ (cid:88) j =0 { V ( (cid:98) π ( j ) I ; G ) − V ( (cid:98) π ( j +1) I ; G ) }≤ (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } G ( dx ) (B.31)+ cγ − γ (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } dx. (B.32)Let λ denote the Lebesgue measure on R d . In Sections E.8 and E.9, we use A6 to furtherbound (B.31) and (B.32). C. More on the CI in (3.17)We begin by providing more details on the estimators (cid:98) V I k +1 ( (cid:98) π ¯ I k ; G ) and its standard error (cid:98) σ I k +1 ( (cid:98) π ¯ I k ; G ). In general, for a given I ⊆ I and any policy π , we define (cid:98) V I ( π ; G ) and nference of the Value for Reinforcement Learning (cid:98) σ I ( π ; G ) as (cid:98) V I ( π ; G ) = (cid:26)(cid:90) x ∈ X U π ( x ) G ( dx ) (cid:27) T (cid:98) β I ,π , (cid:98) σ I ( π ; G ) = (cid:26)(cid:90) x ∈ X U π ( x ) G ( dx ) (cid:27) T (cid:98) Σ − I ,π (cid:98) Ω I ,π ( (cid:98) Σ T I ,π ) − (cid:26)(cid:90) x ∈ X U π ( x ) G ( dx ) (cid:27) , where (cid:98) β I ,π = ( (cid:98) β T I ,π, , · · · , (cid:98) β T I ,π,m ) T = 1 |I| (cid:88) ( i,t ) ∈I (cid:98) Σ − I ,π ξ i,t Y i,t , (cid:98) Σ I ,π = 1 |I| (cid:88) ( i,t ) ∈I ξ i,t ( ξ i,t − γ U π,i,t +1 ) T , (cid:98) Ω I ,π = 1 |I| (cid:88) ( i,t ) ∈I ξ i,t ξ Ti,t { Y i,t + γ (cid:88) a ∈A Φ TL ( X i,t +1 ) (cid:98) β I ,π,a π ( a | X i,t +1 ) − Φ TL ( X i,t ) (cid:98) β π,A i,t } , and |I| stands for the number of elements in I . D. More on on-policy evaluation In this section, we show our proposed CI in Section 4 achieves nominal converge. Tosimplify the analysis, we focus on the setting where K is finite, T (1) = · · · = T ( K ) = T and L (1) = · · · = L ( K ) = L . When K diverges, the sequences { T ( k ) } k ≥ and { L ( k ) } k ≥ shall be properly chosen to reduce the bias of the value estimates. We leave this for futureresearch.Similar to Appendix A.2, we assume the estimated policy (cid:98) π I ∈ Π with probability 1,for any I . In on-policy settings, the behavior policy b π is a function of the estimatedpolicy π ∈ Π. For instance, when an (cid:15) -greedy policy is used to determine the behaviorpolicy, then we have b π = (1 − (cid:15) ) π + (cid:15)π ∗ where π ∗ denotes a uniform random policy. Let B = { b π : π ∈ Π } .For any behavior policy b ∈ B , consider a Markov chain { ( X ,t,b , A ,t,b ) } t ≥ generated bythis behavior policy. Let Y ,t,b be the realization of the immediate reward at time t . Let µ b denote the limiting distribution of the Markov chain { X ,t,b } t ≥ , and P t,bX ( ·| x ) be its t -steptransition kernel. For any x ∈ X , a ∈ A , define ω π,b ( x, a ) as E (cid:40) Y , ,b + γ (cid:88) a ∈A π ( a | X , ,b ) Q ( π ; X , ,b , a ) − Q ( π ; X , ,b , A , ,b ) (cid:41) | X , ,b = x, A , ,b = a . We introduce the following conditions.(A3’.) Assume ν and q are uniformly bounded away from 0 and ∞ on their supports. C. Shi, S. Zhang, W. Lu and R. Song (A4’.) Assume (i) and (ii) hold if T → ∞ and (iii) holds if T is bounded.(i) inf π ∈ Π ,b ∈B λ min [ (cid:82) x ∈ X (cid:80) a ∈A { ξ ( x, a ) ξ T ( x, a ) − γ u π ( x, a ) u Tπ ( x, a ) } b ( a | x ) µ b ( x ) dx ] ≥ ¯ c forsome constant ¯ c > M ( · ) on X and some constant ρ < π ∈ Π (cid:90) x ∈ X M ( x ) µ b π ( x ) dx < + ∞ , and sup π ∈ Π (cid:107)P t,b π X ( ·| x ) − µ b π ( · ) (cid:107) T V ≤ M ( x ) ρ t , ∀ t ≥ . (iii) There exists some constant ¯ c > π ∈ Π ,b ∈B λ min [ T − (cid:88) t =0 E { ξ ( X ,t,b , A ,t,b ) ξ ( X ,t,b , A ,t,b ) T − γ u π ( X ,t,b , A ,t,b ) u Tπ ( X ,t,b , A ,t,b ) } ] ≥ ¯ cT. (A5’) For any I that takes the form of I = { ( i, kT ) : 0 ≤ t < t , ≤ i ≤ n } for some1 ≤ k ≤ K , we have E | V ( (cid:98) π I ; G ) − V ( π ∗ ; G ) | = O ( |I| − b ), for some b > / nT ) b − / (cid:29) (cid:107) (cid:82) x Φ L ( x ) G ( dx ) (cid:107) − , where the big- O term is uniform in I . Theorem 5. Assume A1, A2, A3’-A5’ hold. Suppose L = o {√ nT / log( nT ) } and L p/d (cid:29) nT { (cid:107) (cid:82) x Φ L ( x ) G ( dx ) (cid:107) − } . Assume there exists some constant c ≥ such that ω π,b ( x, a ) ≥ c − for any x, a, π, b and Pr (max ≤ t ≤ T − | Y ,t | ≤ c ) = 1 . Then as either n → ∞ or T → ∞ , (cid:112) nT ( K − (cid:101) σ − ( G ) { (cid:101) V ( G ) − V ( (cid:98) π ; G ) } d → N (0 , , (cid:112) nT ( K − (cid:101) σ − ( G ) { (cid:101) V ( G ) − V ( π ∗ ; G ) } d → N (0 , . Proof of Theorem 5 is omitted for brevity. E. Technical proofs For any two positive sequences { a t } t ≥ and { b t } t ≥ , we write a t (cid:22) b t if there exists someconstant C > a t ≤ Cb t for any t . The notation a t (cid:22) a t = O (1). Wewill use C, ¯ C > q X ( ·| x ) denote the density function of P X ( ·| x ). Define S mL − asthe unit sphere { v ∈ R mL : (cid:107) v (cid:107) = 1 } . When splines are used to estimate the Q-function,we assume the internal knots are equally spaced. nference of the Value for Reinforcement Learning E.1. Proof of Lemma 1 Since X is compact, Condition (A1) implies that sup x ∈ X ,a ∈A | r ( x, a ) | ≤ R for some 0 < R < + ∞ . Under CMIA, we have Q ( π ; x, a ) = (cid:88) t ≥ γ t E π ( Y ,t , X , = x, A , = a ) = (cid:88) t ≥ γ t E π { r ( X ,t , A ,t ) | X , = x, A , = a } . As a result, we obtain sup π,x,a | Q ( π ; x, a ) | ≤ R − γ . (E.33)By the Bellman equation, we obtain Q ( π ; x, a ) = r ( x, a ) + γ (cid:90) x (cid:48) (cid:88) a (cid:48) ∈A Q ( π ; x (cid:48) , a (cid:48) ) π ( a (cid:48) | x (cid:48) ) q ( x (cid:48) | x, a ) dx (cid:48) . Since r ( · , a ) is p -smooth for any a ∈ A , it suffices to show T ( π ; x, a ) = (cid:90) x (cid:48) (cid:88) a (cid:48) ∈A Q ( π ; x (cid:48) , a (cid:48) ) π ( a (cid:48) | x (cid:48) ) q ( x (cid:48) | x, a ) dx (cid:48) , is p -smooth for any a ∈ A and any policy π .For any function h ( · ) defined on X , let ∂ j h ( x ) denote the partial derivative ∂h ( x ) /∂x j .Without loss of generality, suppose p > ∂ j p ( x (cid:48) | x, a ) exists for any j . In thefollowing, we show ∂ j T ( π ; x, a ) exists for any j . Let e j = (0 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) j − , , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) d − j ) T , ∀ j = 1 , . . . , d. For any δ ∈ R , consider the limitRe( j, δ ) = T ( π ; x + e j δ, a ) − T ( π ; x, a ) δ − (cid:90) x (cid:48) (cid:88) a (cid:48) ∈A Q ( π ; x (cid:48) , a (cid:48) ) π ( a (cid:48) | x (cid:48) ) ∂ j q ( x (cid:48) | x, a ) dx (cid:48) , as j → ∞ . By the mean value theorem, we have | Re( j, δ ) | ≤ (cid:90) x (cid:48) (cid:88) a (cid:48) ∈A | Q ( π ; x (cid:48) , a (cid:48) ) | π ( a (cid:48) | x (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12) q ( x (cid:48) | x + e j δ, a ) − q ( x (cid:48) | x, a ) δ − ∂ j q ( x (cid:48) | x, a ) (cid:12)(cid:12)(cid:12)(cid:12) dx (cid:48) ≤ (cid:90) x (cid:48) (cid:88) a (cid:48) ∈A | Q ( π ; x (cid:48) , a (cid:48) ) | π ( a (cid:48) | x (cid:48) ) | ∂ j q ( x (cid:48) | x + e j θ x δ, a ) − ∂ j q ( x (cid:48) | x, a ) | dx (cid:48) , where 0 ≤ θ x ≤ x . When 1 < p ≤ 2, we have (cid:98) p (cid:99) = 1. It follows from ConditionA1 that | ∂ j q ( x (cid:48) | x + e j θ x δ, a ) − ∂ j q ( x (cid:48) | x, a ) | ≤ cδ p − . C. Shi, S. Zhang, W. Lu and R. Song When p > | ∂ j ∂ j q ( x (cid:48) | x, a ) | exists and is bounded by c for any x (cid:48) , x and a . It follows fromthe mean value theorem that | ∂ j q ( x (cid:48) | x + e j θ x δ, a ) − ∂ j q ( x (cid:48) | x, a ) | ≤ cδ. In either case, we have that | Re( j, δ ) | ≤ c max( δ, δ p − ) (cid:90) x (cid:48) (cid:88) a (cid:48) ∈A | Q ( π ; x (cid:48) , a (cid:48) ) | π ( a (cid:48) | x (cid:48) ) dx (cid:48) . By (E.33) and that X is compact, we obtain Re( j, δ ) → δ → 0. This implies that ∂ j T ( π ; x, a ) exists for any x ∈ X , a ∈ A and equals (cid:90) x (cid:48) (cid:88) a (cid:48) ∈A Q ( π ; x (cid:48) , a (cid:48) ) π ( a (cid:48) | x (cid:48) ) ∂ j q ( x (cid:48) | x, a ) dx (cid:48) . In addition, it follows from A1 and (E.33) that | ∂ j T ( π ; x, a ) | ≤ Rc − γ λ ( X ) , ∀ j, x, a, π, where λ ( · ) denotes the Lebesgue measure. Using the same arguments, we can show for any d -tuple α = ( α , . . . , α d ) T of nonnegative integers that satisfies (cid:107) α (cid:107) ≤ (cid:98) p (cid:99) , D α T ( π ; · , a ) = (cid:90) x (cid:48) (cid:88) a (cid:48) ∈A Q ( π ; x (cid:48) , a (cid:48) ) π ( a (cid:48) | x (cid:48) ) D α q ( x (cid:48) |· , a ) dx (cid:48) , (E.34)and sup π sup x ∈ X ,a ∈A | D α T ( π ; x, a ) | ≤ Rc − γ λ ( X ) . (E.35)Moreover, by A1, (E.33) and (E.34), we have for any d -tuple α with (cid:107) α (cid:107) = (cid:98) p (cid:99) that (cid:107) D α T ( π ; x, a ) − D α T ( π ; y, a ) (cid:107) ≤ (cid:90) x (cid:48) (cid:88) a (cid:48) ∈A | Q ( π ; x (cid:48) , a (cid:48) ) | π ( a (cid:48) | x (cid:48) ) | D α q ( x (cid:48) | x, a ) − D α q ( x (cid:48) | y, a ) | dx (cid:48) ≤ Rc − γ λ ( X ) (cid:107) x − y (cid:107) p −(cid:98) p (cid:99) . This together with (E.34) implies that T ( π ; · , a ) ∈ Λ( p, Rcλ ( X )(1 − γ ) − ) for any π and a .The proof is thus completed. E.2. Proof of Theorem 1 We introduce the following lemmas before proving Theorem 1. In the proof of Theorem 1and Lemma 2-5, we will omit the subscript π in U π ( · ), u π , Σ π , (cid:98) Σ π , (cid:98) β π , β ∗ π , ω π , etc, forbrevity. nference of the Value for Reinforcement Learning Lemma 2. Under A2, there exists some constant c ∗ ≥ such that ( c ∗ ) − ≤ λ min (cid:26)(cid:90) x ∈ X Φ L ( x )Φ TL ( x ) dx (cid:27) ≤ λ max (cid:26)(cid:90) x ∈ X Φ L ( x )Φ TL ( x ) dx (cid:27) ≤ c ∗ , (E.36) and sup x ∈ X (cid:107) Φ L ( x ) (cid:107) ≤ c ∗ √ L . Lemma 3. Suppose the conditions in Theorem 1 hold. We have as either n → ∞ or T → ∞ that (cid:107) Σ − (cid:107) ≤ c − , (cid:107) Σ (cid:107) = O (1) , (cid:107) (cid:98) Σ − Σ (cid:107) = O p { L / ( nT ) − / log( nT ) } , (cid:107) (cid:98) Σ − − Σ − (cid:107) = O p { L / ( nT ) − / log( nT ) } and (cid:107) (cid:98) Σ − (cid:107) ≤ c − wpa1. Lemma 4. Suppose the conditions in Theorem 1 hold. We have as either n → ∞ or T → ∞ that λ max ( T − (cid:80) T − t =0 E ξ ,t ξ T ,t ) = O (1) , λ max { ( nT ) − (cid:80) ni =1 (cid:80) T − t =0 ξ i,t ξ Ti,t } = O p (1) , λ min ( T − (cid:80) T − t =0 E ξ ,t ξ T ,t ) ≥ ¯ c/ and λ min { ( nT ) − (cid:80) ni =1 (cid:80) T − t =0 ξ i,t ξ Ti,t } ≥ ¯ c/ wpa1. Lemma 5. (cid:107) (cid:82) x U ( x ) G ( dx ) (cid:107) ≥ m − / (cid:107) (cid:82) x Φ L ( x ) G ( dx ) (cid:107) .Step 1. Since L p/d (cid:29) nT { (cid:107) (cid:82) x Φ L ( x ) G ( dx ) (cid:107) − } , it follows from Lemma 5 that L p/d (cid:29) nT { (cid:107) (cid:82) x U ( x ) G ( dx ) (cid:107) − } . By Lemma 1 and Condition A2, there exist a set ofvector { β ∗ a } that satisfy (see Section 2.2 of Huang, 1998, for details)sup x ∈ X ,a ∈A | Q ( π ; x, a ) − Φ TL ( x ) β ∗ a | ≤ CL − p/d , (E.37)for some constant C > 0. Let β ∗ = ( β ∗ T , . . . , β ∗ Tm ) T , and r i,t = γ (cid:88) a ∈A { Φ TL ( X i,t +1 ) β ∗ a − Q ( π ; X i,t +1 , a ) } π ( a | X i,t +1 ) − { Φ TL ( X i,t ) β ∗ A i,t − Q ( π ; X i,t , A i,t ) } . The condition Pr(max ≤ t ≤ T − | Y i,t | ≤ c ) = 1 implies that | Y i,t | ≤ c , ∀ i, t , almost surely.By Lemma 1 and the definition of p -smooth functions, we obtain that | Q ( π ; x, a ) | ≤ c (cid:48) forany π, x, a . It follows thatmax ≤ t ≤ T − , ≤ i ≤ n | ε i,t | ≤ c + ( γ + 1) c (cid:48) ≤ c + 2 c (cid:48) , (E.38)almost surely. In addition, it follows from (E.37) thatmax ≤ t ≤ T − , ≤ i ≤ n | r i,t | ≤ x ∈ X ,a ∈A | Q ( π ; x, a ) − Φ TL ( x ) β ∗ a | ≤ CL − p/d . (E.39) C. Shi, S. Zhang, W. Lu and R. Song By definition, we have (cid:98) β − β ∗ = (cid:98) Σ − (cid:34) nT n (cid:88) i =1 T − (cid:88) t =0 ξ i,t { Y i,t − ( ξ i,t − γ U i,t +1 ) T β ∗ } (cid:35) = (cid:98) Σ − (cid:34) nT n (cid:88) i =1 T − (cid:88) t =0 ξ i,t (cid:40) Y i,t − Φ TL ( X i,t ) β ∗ A i,t + γ (cid:88) a ∈A Φ TL ( X i,t +1 ) β ∗ a π ( a | X i,t +1 ) (cid:41)(cid:35) = (cid:98) Σ − (cid:40) nT n (cid:88) i =1 T − (cid:88) t =0 ξ i,t ( ε i,t − r i,t ) (cid:41) = Σ − (cid:32) nT n (cid:88) i =1 T − (cid:88) t =0 ξ i,t ε i,t (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) ζ + ( (cid:98) Σ − − Σ − ) (cid:32) nT n (cid:88) i =1 T − (cid:88) t =0 ξ i,t ε i,t (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) ζ − (cid:98) Σ − (cid:32) nT n (cid:88) i =1 T − (cid:88) t =0 ξ i,t r i,t (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) ζ . In the following, we show ζ = O p { L ( nT ) − log( nT ) } and ζ = O p ( L − p/d ) as either n → ∞ ,or T → ∞ . Error bound for (cid:107) ζ (cid:107) : Let F i,t denote the sub-dataset { X i,t , A i,t } ∪ { ( Y i,j , A i,j , X i,j ) } ≤ j Using similar arguments in bounding (cid:107) ζ (cid:107) in Step 1, we can show that (cid:107) ζ (cid:107) = O p { L / ( nT ) − / } and thus (cid:107) (cid:98) β − β ∗ (cid:107) = O p ( L − p/d ) + O p { L / ( nT ) − / } , (E.42)under the condition that L (cid:28) √ nT / log( nT ). Notice that (cid:98) V ( π ; G ) can be presented as { (cid:82) x U ( x ) G ( dx ) } T (cid:98) β . As a result, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) V ( π ; G ) − (cid:26)(cid:90) x U ( x ) G ( dx ) (cid:27) T ( β ∗ + ζ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x U ( x ) G ( dx ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) (cid:98) β − β ∗ − ζ (cid:107) . (E.43)By (E.37), we have | U T ( x ) β ∗ − V ( π ; x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) U T ( x ) β ∗ − (cid:88) a ∈A Q ( π ; x, a ) π ( a | x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) a ∈A | Q ( π ; x, a ) − Φ TL ( x ) β ∗ a | π ( a | x ) ≤ CL − p/d , C. Shi, S. Zhang, W. Lu and R. Song and hence (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:26)(cid:90) x U ( x ) G ( dx ) (cid:27) T β ∗ − (cid:90) x V ( π ; x ) G ( dx ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ CL − p/d . This together with (E.43) yields that | (cid:98) V ( π ; G ) − V ( π ; G ) − U T ( x ) ζ | ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x U ( x ) G ( dx ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) (cid:98) β − β ∗ − ζ (cid:107) + CL − p/d . (E.44)Let Ω = T − (cid:80) T − t =0 E ω ( X ,t , A ,t ) ξ ,t ξ T ,t and σ ( π ; G ) = (cid:26)(cid:90) x U ( x ) G ( dx ) (cid:27) T Σ − Ω ( Σ T ) − (cid:26)(cid:90) x U ( x ) G ( dx ) (cid:27) . Since inf x,a ω ( x, a ) ≥ c − , it follows from Lemma 4 that λ min ( Ω ) ≥ c − λ min (cid:32) T T − (cid:88) t =0 E ξ ,t ξ T ,t (cid:33) ≥ − c − ¯ c. Hence, we obtain σ ( π ; G ) ≥ ¯ c c (cid:26)(cid:90) x U ( x ) G ( dx ) (cid:27) T Σ − ( Σ T ) − (cid:26)(cid:90) x U ( x ) G ( dx ) (cid:27) . (E.45)By Lemma 3, we have (cid:107) Σ (cid:107) = O (1), or equivalently, λ max ( Σ T Σ ) = O (1). This implies that λ min { Σ − ( Σ T ) − } ≥ ¯ C for some constant ¯ C > σ ( π ; G ) ≥ (3 c ) − ¯ c ¯ C (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x U ( x ) G ( dx ) (cid:13)(cid:13)(cid:13)(cid:13) , (E.46)by (E.45). Combining (E.46) together with (E.44) yields that1 σ ( π ; G ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) V ( π ; G ) − V ( π ; G ) − (cid:26)(cid:90) x U ( x ) G ( dx ) (cid:27) T ζ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ c √ ¯ c ¯ C (cid:107) (cid:98) β − β ∗ − ζ (cid:107) + √ c CL − p/d √ ¯ c ¯ C (cid:107) (cid:82) x U ( x ) G ( dx ) (cid:107) . By (E.41) and that L (cid:28) √ nT / log( nT ), L p/d (cid:29) nT { (cid:107) (cid:82) x U ( x ) G ( dx ) (cid:107) − } , we obtain √ nT { (cid:98) V ( π ; G ) − V ( π ; G ) } σ ( π ; G ) = √ nT { (cid:82) x U ( x ) G ( dx ) } T ζ σ ( π ; x ) + o p (1) . (E.47)This completes the second step of the proof. Step 3: In the following, we show √ nT σ − ( π ; G ) { (cid:82) x U ( x ) G ( dx ) } T ζ d → N (0 , ≤ g ≤ nT , let i ( g ) and t ( g ) be the quotient and the remainder of g + T − T that satisfy g = { i ( g ) − } T + t ( g ) + 1 and 0 ≤ t ( g ) < T. nference of the Value for Reinforcement Learning Let F (0) = { X , , A , } . Then we iteratively define {F ( g ) } ≤ g ≤ nT as follows: F ( g ) = F ( g − ∪ { Y i ( g ) ,t ( g ) , X i ( g ) ,t ( g )+1 , A i ( g ) ,t ( g )+1 } , if t ( g ) < T − , F ( g ) = F ( g − ∪ { Y i ( g ) ,T − , X i ( g ) ,T , X i ( g )+1 , , A i ( g )+1 , } , otherwise . Let ξ ( g ) = ξ i ( g ) ,t ( g ) and ε ( g ) = ε i ( g ) ,t ( g ) . It follows that √ nT { (cid:82) x U ( x ) G ( dx ) } T ζ σ ( π ; x ) = nT (cid:88) g =1 { (cid:82) x U ( x ) G ( dx ) } T Σ − ξ ( g ) ε ( g ) √ nT σ ( π ; x ) . (E.48)By MA, CMIA and the Bellman equation in (3.7), we obtain that E { ε ( g ) |F ( g − } = E { ε ( g ) | X i ( g ) ,t ( g ) , A i ( g ) ,t ( g ) } = 0 . Hence, the RHS of (E.48) forms a martingale with respect to the filtration { σ ( F ( g ) ) } g ≥ ,where σ ( F ( g ) ) stands for the σ -algebra generated by F ( g ) . To show the asymptotic nor-mality, we use a martingale central limit theorem for triangular arrays (Corollary 2.8 ofMcLeish, 1974). This requires to verify the following two conditions:(a) max ≤ g ≤ nT |{ (cid:82) x U ( x ) G ( dx ) } T Σ − ξ ( g ) ε ( g ) | / {√ nT σ ( π ; x ) } p → nT ) − (cid:80) nTg =1 |{ (cid:82) x U ( x ) G ( dx ) } T Σ ξ ( g ) ε ( g ) | / { σ ( π ; x ) } p → (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { (cid:82) x U ( x ) G ( dx ) } T Σ − ξ ( g ) ε ( g ) √ nT σ ( π ; x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107){ (cid:82) x U ( x ) G ( dx ) } T Σ − (cid:107) (cid:107) ξ ( g ) (cid:107) | ε ( g ) |√ nT σ ( π ; x ) ≤ ( c + 2 c (cid:48) ) (cid:107){ (cid:82) x U ( x ) G ( dx ) } T Σ − (cid:107) (cid:107) ξ ( g ) (cid:107) √ nT σ ( π ; x ) ≤ ( c + 2 c (cid:48) ) c ∗ √ L (cid:107){ (cid:82) x U ( x ) G ( dx ) } T Σ − (cid:107) √ nT σ ( π ; x ) ≤ √ c ( c + 2 c (cid:48) ) c ∗ √ ¯ c √ L √ nT , where the first inequality follows from Cauchy-Schwarz inequality, the second inequalityis due to (E.38), the third inequality is due to Lemma 2 and the fact that (cid:107) ξ ( g ) (cid:107) ≤ sup x (cid:107) Φ L ( x ) (cid:107) , and the last inequality follows from (E.45). Since L (cid:28) √ nT / log( nT ), (a) C. Shi, S. Zhang, W. Lu and R. Song is proven. To verify (b), notice that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) nT nT (cid:88) g =1 |{ (cid:82) x U ( x ) G ( dx ) } T Σ − ξ ( g ) ε ( g ) | σ ( π ; x ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 σ ( π ; x ) × (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:26)(cid:90) x U ( x ) G ( dx ) (cid:27) T Σ − nT nT (cid:88) g =1 ( ε ( g ) ) ξ ( g ) ( ξ ( g ) ) T − Ω ( Σ T ) − (cid:26)(cid:90) x U ( x ) G ( dx ) (cid:27)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107){ (cid:82) x U ( x ) G ( dx ) } T Σ − (cid:107) σ ( π ; x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nT nT (cid:88) g =1 ( ε ( g ) ) ξ ( g ) ( ξ ( g ) ) T − Ω (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . In view of (E.45), it suffices to show (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nT nT (cid:88) g =1 ( ε ( g ) ) ξ ( g ) ( ξ ( g ) ) T − Ω (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1) . (E.49)This can be proven using similar arguments in bounding (cid:107) (cid:98) Σ − Σ (cid:107) in the proof of Lemma3. In view of (E.47) and (E.48), we have by Slutsky’s theorem that √ nT { (cid:98) V ( π ; G ) − V ( π ; G ) } σ ( π ; G ) d → N (0 , . To complete the proof, it remains to show (cid:98) σ ( π ; G ) /σ ( π ; G ) p → 1. Using similar argumentsin verifying (b), it suffices to show (cid:107) (cid:98) Σ − (cid:98) Ω ( (cid:98) Σ T ) − − Σ − Ω ( Σ T ) − (cid:107) = o p (1). By (E.38)and Lemma 4, we have λ max ( Ω ) ≤ ( c + 2 c (cid:48) ) λ max (cid:32) T T − (cid:88) t =0 E ξ ,t ξ T ,t (cid:33) = O (1) , and hence (cid:107) Ω (cid:107) = O (1). This together with Lemma 3 and the condition L (cid:28) √ nT / log( nT )yields that (cid:107) (cid:98) Σ − Ω ( (cid:98) Σ T ) − − Σ − Ω ( Σ T ) − (cid:107) ≤ (cid:107) (cid:98) Σ − − Σ (cid:107) (cid:107) Ω (cid:107) (cid:107) ( (cid:98) Σ T ) − (cid:107) + (cid:107) Σ − (cid:107) (cid:107) Ω (cid:107) (cid:107) (cid:98) Σ − − Σ (cid:107) = O p { L / ( nT ) − / log( nT ) } = o p (1) . Thus, it remains to show (cid:107) (cid:98) Σ − (cid:98) Ω ( (cid:98) Σ T ) − − (cid:98) Σ − Ω ( (cid:98) Σ T ) − (cid:107) = o p (1), or (cid:107) (cid:98) Ω − Ω (cid:107) = o p (1) , by Lemma 3. In view of (E.49), it suffices to show (cid:107) ( nT ) − (cid:80) nTg =1 ( ε ( g ) ) ξ ( g ) ( ξ ( g ) ) T − (cid:98) Ω (cid:107) = o p (1), or equivalently,sup a ∈ S mL − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) nT nT (cid:88) g =1 a T ξ ( g ) ( ξ ( g ) ) T a { ( ε ( g ) ) − ( (cid:98) ε ( g ) ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) , nference of the Value for Reinforcement Learning where (cid:98) ε ( g ) = Y i ( g ) ,t ( g ) + γ (cid:88) a ∈A Φ TL ( X i ( g ) ,t ( g )+1 ) (cid:98) β a π ( a | X i ( g ) ,t ( g )+1 ) − Φ TL ( X i ( g ) ,t ( g ) ) (cid:98) β A i ( g ) ,t ( g ) . By Lemma 4, we have sup a ∈ S mL − ( nT ) − (cid:80) nTg =1 a T ξ ( g ) ( ξ ( g ) ) T a = O p (1). Hence, it suffices toshow max ≤ g ≤ nT | ( ε ( g ) ) − ( (cid:98) ε ( g ) ) | = o p (1). Suppose we have shown that max ≤ g ≤ nT | ε ( g ) − (cid:98) ε ( g ) | = o p (1). By (E.38), ε ( g ) s are uniformly bounded with probability 1 and thus we havemax ≤ g ≤ nT | ε ( g ) + (cid:98) ε ( g ) | = O p (1). It follows thatmax ≤ g ≤ nT | ( ε ( g ) ) − ( (cid:98) ε ( g ) ) | ≤ max ≤ g ≤ nT | ε ( g ) − (cid:98) ε ( g ) | max ≤ g ≤ nT | ε ( g ) + (cid:98) ε ( g ) | = o p (1) . Therefore, it remains to show max ≤ g ≤ nT | ε ( g ) − (cid:98) ε ( g ) | = o p (1), or equivalently,max ≤ i ≤ n, ≤ t ≤ T − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) γ (cid:88) a ∈A Q ( π ; X i,t +1 , a ) π ( a | X i,t +1 ) − Q ( π ; X i,t , A i,t ) (E.50) − γ (cid:88) a ∈A Φ TL ( X i,t +1 ) (cid:98) β a π ( a | X i,t +1 ) − Φ TL ( X i,t ) (cid:98) β A i,t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . The LHS of (E.50) is upper bound by(1 + γ ) sup x ∈ X ,a ∈A | Q ( π ; x, a ) − Φ TL ( x ) (cid:98) β a | . By (E.37), (E.42) and Lemma 2, we havesup x ∈ X ,a ∈A | Q ( π ; x, a ) − Φ TL ( x ) (cid:98) β a | ≤ sup x ∈ X ,a ∈A | Q ( π ; x, a ) − Φ TL ( x ) β ∗ a | + sup x ∈ X | Φ TL ( x ) (cid:98) β a − Φ TL ( x ) β ∗ a |≤ CL − p/d + sup x ∈ X (cid:107) Φ L ( x ) (cid:107) sup a ∈A (cid:107) (cid:98) β a − β ∗ a (cid:107) = O ( L − p/d ) + O ( L / ) sup a ∈A (cid:107) (cid:98) β a − β ∗ a (cid:107) = O p ( L / − p/d ) + O p ( Ln − / T − / ) . Under the given conditions, we have L p/d (cid:29) √ nT and L (cid:28) √ nT / log( nT ). This implies O p ( L / − p/d ) = o p (1), and O p ( Ln − / T − / ) = o p (1). Therefore, we havesup x ∈ X ,a ∈A | Q ( π ; x, a ) − Φ TL ( x ) (cid:98) β a | = o p (1) . The proof is hence completed. E.3. Proof of Lemma 2 For B-spline basis, the assertion in (E.36) follows from the arguments used in the proof ofTheorem 3.3 of Burman and Chen (1989). For wavelet basis, the assertion in (E.36) followsfrom the arguments used in the proof of Theorem 5.1 of Chen and Christensen (2015). C. Shi, S. Zhang, W. Lu and R. Song For either B-spline or wavelet sieve and any L ≥ x ∈ X , the number of nonzeroelements in the vector Φ L ( x ) is bounded by some constant. Moreover, each of the basisfunction is uniformly bounded by O ( √ L ). This proves that the second assertion. E.4. Proof of Lemma 3 Consider the following two scenarios: (i) T grows to infinity; (ii) T is bounded. In Scenario(i), define Σ ∗ = (cid:90) x ∈ X (cid:88) a ∈A ξ ( x, a ) { ξ ( x, a ) − γ u ( x, a ) } T b ( a | x ) µ ( x ) dx, where u ( x, a ) = E { U ( X , ) | X , = x, A , = a } . We introduce the following lemma beforeproving Lemma 3. Lemma 6. Suppose T → ∞ . Under the given conditions in Lemma 3, we have (cid:107) Σ − Σ ∗ (cid:107) (cid:22) T − / . The proof is divided into four parts. In the first part, we show a T Σ a ≥ ¯ c (cid:107) a (cid:107) / a ∈ R mL and (cid:107) Σ − (cid:107) ≤ / ¯ c , as either n → ∞ , or T → ∞ . In the second part, we bound (cid:107) (cid:98) Σ − Σ (cid:107) . In the third part, we bound (cid:107) (cid:98) Σ − − Σ − (cid:107) . Finally, we show (cid:107) Σ (cid:107) = O (1). Part 1: Consider Scenario (i) first. We first show a T Σ ∗ a ≥ ¯ c (cid:107) a (cid:107) / a ∈ R mL . Itfollows from Cauchy-Schwarz inequality that (cid:90) x ∈ X (cid:88) a ∈A { a T ξ ( x, a ) } [ E { U ( X , ) | X , = x, A , = a } T a ] b ( a | x ) µ ( x ) dx ≤ η / ( a ) η / ( a ) , where η ( a ) = (cid:90) x ∈ X (cid:88) a ∈A { a T u ( x, a ) } b ( a | x ) µ ( x ) dx, η ( a ) = (cid:90) x ∈ X (cid:88) a ∈A { a T ξ ( x, a ) } b ( a | x ) µ ( x ) dx. Therefore, a T Σ ∗ a ≥ η ( a ) − (cid:112) η ( a ) (cid:112) γ η ( a ) = (cid:112) η ( a ) η ( a ) − γ η ( a ) (cid:112) η ( a ) + γ (cid:112) η ( a ) . Under A4(i), we obtain η ( a ) − γ η ( a ) ≥ ¯ c (cid:107) a (cid:107) , ∀ a ∈ R mL , and hence a T Σ ∗ a ≥ { η ( a ) − γ η ( a ) } ≥ ¯ c (cid:107) a (cid:107) , ∀ a ∈ R mL . (E.51) nference of the Value for Reinforcement Learning We now show (cid:107) Σ ∗ a (cid:107) ≥ ¯ c (cid:107) a (cid:107) , ∀ a ∈ R mL . (E.52)Otherwise, there exists some a ∈ R mL such that (cid:107) Σ ∗ a (cid:107) < − ¯ c (cid:107) a (cid:107) . By Cauchy-Schwarz inequality, we obtain a T Σ ∗ a ≤ (cid:107) a (cid:107) (cid:107) Σ ∗ a (cid:107) < − ¯ c (cid:107) a (cid:107) . However, this vi-olates the assertion in (E.51). (E.52) is thus proven. By Lemma 6 and Cauchy-Schwarzinequality, we obtain (cid:107) Σ a (cid:107) ≥ (cid:107) Σ ∗ a (cid:107) − (cid:107) Σ − Σ ∗ (cid:107) (cid:107) a (cid:107) ≥ { − ¯ c − O ( T − / ) }(cid:107) a (cid:107) ≥ − ¯ c (cid:107) a (cid:107) , (E.53)as T → ∞ . According to the singular value decomposition, we have Σ = V T Λ V forsome orthogonal matrices V , V and some diagonal matrix Λ . By orthogonality, weobtain (cid:107) Σ a (cid:107) = (cid:107) Λ V a (cid:107) and (cid:107) a (cid:107) = (cid:107) V a (cid:107) . In view of (E.53), we have (cid:107) Λ a (cid:107) ≥ − ¯ c (cid:107) a (cid:107) , ∀ a ∈ R mL . This implies that the absolute value of each diagonal element in Λ isat least ¯ c . Thus, we obtain (cid:107) Λ − (cid:107) ≤ c − and hence (cid:107) Σ − (cid:107) ≤ c − .Under Scenario (ii), using similar arguments, we can show a T Σ a ≥ − ¯ c (cid:107) a (cid:107) , for any a ∈ R mL and (cid:107) Σ − (cid:107) ≤ c − , when A4(iii) holds. We omit the proof for brevity. Part 2: We first consider Scenario (ii). Define the random matrix R i = 1 T T − (cid:88) t =0 ξ ( X i,t , A i,t ) { ξ ( X i,t , A i,t ) − γ U ( X i,t +1 ) } T . By Lemma 2, we have max ≤ i ≤ n, ≤ t ≤ T − (cid:107) ξ ( X i,t , A i,t ) (cid:107) ≤ sup x (cid:107) Φ L ( x ) (cid:107) ≤ c ∗ √ L andmax ≤ i ≤ n, ≤ t ≤ T (cid:107) U ( X i,t +1 ) (cid:107) ≤ sup x (cid:107) Φ L ( x ) (cid:107) ≤ c ∗ √ L . It follows thatmax ≤ i ≤ n (cid:107) R i − E R i (cid:107) ≤ T T − (cid:88) t =0 c ∗ √ L ( c ∗ √ L + γc ∗ √ L ) ≤ L ( c ∗ ) . (E.54)Let σ n = max (cid:40)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 E ( R i − E R i )( R i − E R i ) T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 E ( R i − E R i ) T ( R i − E R i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:41) = n max (cid:8)(cid:13)(cid:13) E ( R − E R )( R − E R ) T (cid:13)(cid:13) , (cid:13)(cid:13) E ( R − E R ) T ( R − E R ) (cid:13)(cid:13) (cid:9) . For any v ∈ R mL , we have v T E ( R − E R )( R − E R ) T v = v T E R R T v − v T ( E R )( E R ) T v ≤ v T E R R T v . C. Shi, S. Zhang, W. Lu and R. Song Moreover, using similar arguments in proving (E.54), we can show v T E R R T v ≤ E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T − (cid:88) t =0 v T ξ ( X ,t , A ,t ) { ξ ( X ,t , A ,t ) − γ U ( X ,t +1 ) } T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ L ( c ∗ ) E (cid:40) T T − (cid:88) t =0 v T ξ ( X ,t , A ,t ) (cid:41) . By Cauchy-Schwarz inequality, we obtain v T E R R T v ≤ L ( c ∗ ) T T − (cid:88) t =0 E { v T ξ ( X ,t , A ,t ) } ≤ L ( c ∗ ) (cid:107) v (cid:107) T T − (cid:88) t =0 λ max (cid:8) E ξ ( X ,t , A ,t ) ξ ( X ,t , A ,t ) T (cid:9) . Similarly, we can show v T E R T R v ≤ L ( c ∗ ) (cid:107) v (cid:107) T T − (cid:88) t =0 λ max { E ξ ( X ,t , A ,t ) ξ ( X ,t , A ,t ) T } , and hence σ n ≤ Ln ( c ∗ ) T T − (cid:88) t =0 λ max { E ξ ( X ,t , A ,t ) ξ ( X ,t , A ,t ) T } . (E.55)Consider λ max { E ξ ( X , , A , ) ξ ( X , , A , ) T } first. Notice that E ξ ( X , , A , ) ξ ( X , , A , ) T is a block diagonal matrix. For any v ∈ R mL , let v = ( a T , . . . , a Tm ) T where all the sub-vectors a j s have the same length. With some calculations, we have v T E ξ ( X , , A , ) ξ ( X , , A , ) T v = m (cid:88) j =1 E { a Tj Φ L ( X , ) } b ( j | X , ) ≤ λ max { E Φ L ( X , )Φ TL ( X , ) }(cid:107) v (cid:107) ≤ λ max (cid:26)(cid:90) x ∈ X Φ L ( x )Φ TL ( x ) ν ( x ) dx (cid:27) (cid:107) v (cid:107) . By Condition A3 and Lemma 2, we obtain λ max (cid:26)(cid:90) x ∈ X Φ L ( x )Φ TL ( x ) ν ( x ) dx (cid:27) ≤ sup x ∈ X ν ( x ) λ max (cid:26)(cid:90) x ∈ X Φ L ( x )Φ TL ( x ) dx (cid:27) (cid:22) . This yields λ max { E ξ ( X , , A , ) ξ ( X , , A , ) T } (cid:22) . (E.56)For any t > 0, the marginal density function of X ,t is given by µ t ( x ) = (cid:90) x ,...,x t − ∈ X ν ( x ) q X ( x | x ) · · · q X ( x t − | x t − ) q X ( x | x t − ) dx dx . . . d x t − . (E.57) nference of the Value for Reinforcement Learning Thus, we have µ t ( x ) ≤ sup x (cid:48) ,x (cid:48)(cid:48) q X ( x (cid:48) | x (cid:48)(cid:48) ) for any t ≥ x ∈ X . Under Condition A1, wecan show the density function q X ( x | x (cid:48) ) is uniformly bounded for any x and x (cid:48) . It followsthat µ t ( x ) is uniformly bounded for any t ≥ x . Using similar arguments in proving(E.56), we can show λ max { E ξ ( X ,t , A ,t ) ξ ( X ,t , A ,t ) T } (cid:22) , ∀ ≤ t ≤ T. (E.58)This together with (E.55) and (E.56) yields σ n ≤ CnL, (E.59)for some constant C > 0. Combining this together with (E.54), an application of the matrixconcentration inequality (see Theorem 1.6 in Tropp, 2012) yields thatPr (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 R i − n E R (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ τ (cid:33) ≤ mL exp (cid:18) − τ CnL + 8 L ( c ∗ ) τ / (cid:19) , ∀ τ > . Set τ = 3 √ CnL log n . Since T is bounded, under the given conditions, n will grow toinfinity. For sufficiently large n , we have 8 L ( c ∗ ) τ / (cid:28) τ and hencePr (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 R i − n E R (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ (cid:112) CnL log n (cid:33) ≤ mLn . Since L (cid:28) n and T is bounded, we obtain 2 mL/n (cid:28) / ( n T ). Thus, we can show thatthe following event occurs with probability at least 1 − O ( n − T − ), (cid:13)(cid:13)(cid:13) (cid:98) Σ − Σ (cid:13)(cid:13)(cid:13) = 1 n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 R i − n E R (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:22) (cid:112) ( nT ) − L log( nT ) , (E.60)since T is bounded.Now let’s consider Scenario (i). We define a new Markov chain { ( X ∗ ,t , A ∗ ,t ) } t ≥ thathas the same probability transition function as { ( X ,t , A ,t ) } t ≥ . The density function ofthe initial variable X ∗ , is set to be the invariant distribution µ ( · ). The chain { X ∗ ,t } t ≥ isstationary. Let R i,t = ξ ( X i,t , A i,t ) { ξ ( X i,t , A i,t ) − γ U ( X i,t +1 ) } T , ∀ i, t, R ∗ ,t = ξ ( X ∗ ,t , A ∗ ,t ) { ξ ( X ∗ ,t , A ∗ ,t ) − γ U ( X ∗ ,t +1 ) } T , ∀ t. We aim to apply the matrix concentration inequality to the sum of independent randommatrix (regardless of whether n is bounded or not),1 n n (cid:88) i =1 (cid:40) T T − (cid:88) t =0 ( R i,t − Σ ∗ ) (cid:41) . C. Shi, S. Zhang, W. Lu and R. Song We begin by providing an upper error bound for max ≤ i ≤ n (cid:107) T − (cid:80) T − t =0 ( R i,t − Σ ∗ ) (cid:107) . Forany τ > 0, notice thatPr (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T − (cid:88) t =0 ( R i,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > τ (cid:33) = (cid:90) x ∈ X Pr (cid:32) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T − (cid:88) t =0 ( R i,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > τ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X , = x (cid:33) ν ( x ) dx = (cid:90) x ∈ X Pr (cid:32) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T − (cid:88) t =0 ( R i,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > τ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X , = x (cid:33) ν ( x ) µ ( x ) µ ( x ) dx. By A3, the ratio µ − ( x ) ν ( x ) is uniformly bounded for any x . It follows thatPr (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T − (cid:88) t =0 ( R i,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > τ (cid:33) (cid:22) (cid:90) x ∈ X Pr (cid:32) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T − (cid:88) t =0 ( R i,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > τ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X , = x (cid:33) µ ( x ) dx = Pr (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T − (cid:88) t =0 ( R ∗ ,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > τ (cid:33) . (E.61)Let F t − = { ( X ∗ ,j , A ∗ ,j ) } ≤ j ≤ t , for all t ≥ 0, and σ ( F t ) be the σ -algebra generated by F t .Define R ∗∗ ,t = ξ ( X ∗ ,t , A ∗ ,t ) { ξ ( X ∗ ,t , A ∗ ,t ) − γ u ( X ∗ ,t , A ∗ ,t ) } T . The sum (cid:80) T − t =0 ( R ∗ ,t − R ∗∗ ,t ) forms a mean zero matrix martingale with respect to thefiltration { σ ( F t ) : t ≥ − } . Similar to (E.54) and (E.59), we can showmax ≤ t ≤ T − (cid:107) R ∗ ,t − R ∗∗ ,t (cid:107) ≤ L ( c ∗ ) , max (cid:40)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − (cid:88) t =0 E { ( R ∗ ,t − R ∗∗ ,t ) T ( R ∗ ,t − R ∗∗ ,t ) |F t − } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − (cid:88) t =0 E { ( R ∗ ,t − R ∗∗ ,t )( R ∗ ,t − R ∗∗ ,t ) T |F t − } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:41) (cid:22) LT. By the matrix martingale concentration inequality (Corollary 1.3, Tropp, 2011), we obtainthe following occurs with probability at least 1 − O ( n − T − ), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − (cid:88) t =0 ( R ∗ ,t − R ∗∗ ,t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:22) (cid:112) LT log( nT ) . (E.62)Define R ∗∗∗ ,t = (cid:88) a ∈A ξ ( X ∗ ,t , a ) { ξ ( X ∗ ,t , a ) − γ u ( X ∗ ,t , a ) } T b ( a | X ∗ ,t ) . (E.63) nference of the Value for Reinforcement Learning Conditional on { X ∗ ,t } t ≥ , { R ∗∗ ,t − R ∗∗∗ ,t } t ≥ are independent mean zero random variables.Using similar arguments in proving (E.60), we can show thatPr (cid:32) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − (cid:88) t =0 ( R ∗∗ ,t − R ∗∗∗ ,t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ C (cid:112) LT log( nT ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { X ∗ ,t } t ≥ (cid:33) = O ( n − T − ) , for some constant C > 0, where the big- O term is independent of { X ∗ ,t } t ≥ . Thus, weobtain Pr (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − (cid:88) t =0 ( R ∗∗ ,t − R ∗∗∗ ,t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ C (cid:112) LT log( nT ) (cid:33) = O ( n − T − ) , This together with (E.62) implies that the following event occurs with probability at least1 − O ( n − T − ), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − (cid:88) t =0 ( R ∗ ,t − R ∗∗∗ ,t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:22) (cid:112) LT log( nT ) . (E.64)Notice that each R ∗∗∗ ,t is a function of X ∗ ,t only, with mean Σ . The Markov chain { X ∗ ,t } t ≥ is stationary. Following Davydov (1973), define the β -mixing coefficient of the stationaryMarkov chain { X ∗ ,t } t ≥ as β ( q ) = (cid:90) x ∈ X sup ≤ ϕ ≤ (cid:12)(cid:12) E { ϕ ( X ∗ ,q ) | X ∗ , = x } − E ϕ ( X ∗ , ) (cid:12)(cid:12) µ ( x ) dx. Under the geometric ergodicity assumption in A4(ii), it follows from Theorem 3.7 of Bradley(2005) that { X ∗ ,t } t ≥ is exponentially β -mixing. That is, β ( t ) = O ( ρ t ) for some ρ < t ≥ 0. Using similar arguments in proving (E.54), we can showmax ≤ t ≤ T − (cid:107) R ∗∗∗ ,t − Σ ∗ (cid:107) ≤ L ( c ∗ ) . (E.65)Moreover, for any 0 ≤ t ≤ t ≤ T − v , v ∈ R mL , we have by Cauchy-Schwarzinequality that | v T E ( R ∗∗∗ ,t − Σ ∗ )( R ∗∗∗ ,t − Σ ∗ ) T v | ≤ (cid:113) E (cid:107) v T ( R ∗∗∗ ,t − Σ ∗ ) (cid:107) (cid:113) E (cid:107) v T ( R ∗∗∗ ,t − Σ ∗ ) (cid:107) ≤ (cid:113) λ max { E ( R ∗∗∗ ,t − Σ ∗ )( R ∗∗∗ ,t − Σ ∗ ) T } (cid:113) λ max { E ( R ∗∗∗ ,t − Σ ∗ )( R ∗∗∗ ,t − Σ ∗ ) T }(cid:107) v (cid:107) (cid:107) v (cid:107) . Using similar arguments in proving (E.59), we can showmax ≤ t ≤ T − λ max { E ( R ∗∗∗ ,t − Σ ∗ )( R ∗∗∗ ,t − Σ ∗ ) T } (cid:22) L. C. Shi, S. Zhang, W. Lu and R. Song This implies max ≤ t ,t ≤ T − sup v (cid:54) =0 , v (cid:54) =0 | v T E ( R ∗∗∗ ,t − Σ ∗ )( R ∗∗∗ ,t − Σ ∗ ) T v |(cid:107) v (cid:107) (cid:107) v (cid:107) (cid:22) L, and hence, max ≤ t ,t ≤ T − sup v (cid:54) =0 (cid:107) E ( R ∗∗∗ ,t − Σ ∗ )( R ∗∗∗ ,t − Σ ∗ ) T v (cid:107) (cid:107) v (cid:107) (cid:22) L, or equivalently, max ≤ t ,t ≤ T − (cid:107) E ( R ∗∗∗ ,t − Σ ∗ )( R ∗∗∗ ,t − Σ ∗ ) T (cid:107) (cid:22) L. Similarly, we can showmax ≤ t ,t ≤ T − (cid:107) E ( R ∗∗∗ ,t − Σ ∗ ) T ( R ∗∗∗ ,t − Σ ∗ ) (cid:107) (cid:22) L. (E.66)By Theorem 4.2 of Chen and Christensen (2015), there exists some constant C > τ ≥ < q < T ,Pr (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − (cid:88) t =0 ( R ∗∗∗ ,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ τ (cid:33) ≤ Tq β ( q ) + Pr (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) t ∈I r ( R ∗∗∗ ,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ τ +4 mL exp (cid:18) − τ / CT qL + 4 qLτ ( c ∗ ) / (cid:19) , (E.67)where I r = { q (cid:98) ( T + 1) /q (cid:99) , q (cid:98) ( T + 1) /q (cid:99) + 1 , · · · , T − } . Suppose τ ≥ qL ( c ∗ ) . Notice that |I r | ≤ q . It follows from (E.65) thatPr (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) t ∈I r ( R ∗∗∗ ,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ τ = 0 . (E.68)Since β ( q ) = O ( ρ q ), set q = − nT ) / log ρ , we obtain T β ( q ) /q = O ( n − T − ). Set τ = max { (cid:112) CT qL log( T n ) , qL ( c ∗ ) log( nT ) } , we obtain that τ ≥ CT qL log( T n ) and τ ≥ qLτ ( c ∗ ) / T n ) and τ ≥ qL ( c ∗ ) , as either n → ∞ or T → ∞ . It follows from (E.67), (E.68) and the condition L (cid:28) nT thatthe following event occurs with probability at least 1 − O ( n − T − ), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − (cid:88) t =0 ( R ∗∗∗ ,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:22) max {√ T L log( T n ) , L log ( T n ) } . nference of the Value for Reinforcement Learning Combining this together with (E.61) and (E.64) yields that the following event occurs withprobability at least 1 − O ( n − T − ), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:22) max {√ T L log( T n ) , L log ( T n ) } . (E.69)By Bonferroni’s inequality, we obtain with probability at least 1 − O ( n − T − ) thatmax i ∈{ ,...,n } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − (cid:88) t =0 ( R i,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ¯ C max {√ T L log( T n ) , L log ( T n ) } , (E.70)for some constant ¯ C > 0. For i = 0 , , . . . , n , let A i denote the event A i = (cid:40)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − (cid:88) t =0 ( R i,t − Σ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ¯ C max {√ T L log( T n ) , L log ( T n ) } (cid:41) . It follows from (E.70) that the following event occurs with probability at least 1 − O ( n − T − ), n (cid:88) i =1 T − (cid:88) t =0 ( R i,t − Σ ∗ ) = n (cid:88) i =1 (cid:40) T − (cid:88) t =0 ( R i,t − Σ ∗ ) (cid:41) I ( A i ) . (E.71)Now we provide an upper error bound for (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) T (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = sup v ∈ S mL − E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Using similar arguments in proving (E.61), we can show thatsup v ∈ S mL − E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = sup v ∈ S mL − (cid:90) x E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X , = x ν ( x ) dx (cid:22) sup v ∈ S mL − (cid:90) x E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X , = x µ ( x ) dx = sup v ∈ S mL − E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:40) T − (cid:88) t =0 ( R ∗ ,t − Σ ∗ ) (cid:41) v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . As a result, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) T (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:22) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:40) T − (cid:88) t =0 ( R ∗ ,t − Σ ∗ ) (cid:41) T (cid:40) T − (cid:88) t =0 ( R ∗ ,t − Σ ∗ ) (cid:41)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (E.72) C. Shi, S. Zhang, W. Lu and R. Song Notice that sup v ∈ S mL − E v T (cid:40) T − (cid:88) t =0 ( R ∗ ,t − Σ ∗ ) (cid:41) T (cid:40) T − (cid:88) t =0 ( R ∗ ,t − Σ ∗ ) (cid:41) v ≤ T − (cid:88) t =0 sup v ∈ S mL − E v T ( R ∗ ,t − Σ ∗ ) T ( R ∗ ,t − Σ ∗ ) v (cid:124) (cid:123)(cid:122) (cid:125) η (E.73)+ (cid:88) ≤ t ,t ≤ T − ≤| t − t |≤ sup v ∈ S mL − E v T ( R ∗ ,t − Σ ∗ ) T ( R ∗ ,t − Σ ∗ ) v (cid:124) (cid:123)(cid:122) (cid:125) η + (cid:88) ≤ t ,t ≤ T − | t − t |≥ sup v ∈ S mL − E v T ( R ∗ ,t − Σ ∗ ) T ( R ∗ ,t − Σ ∗ ) v (cid:124) (cid:123)(cid:122) (cid:125) η . By (E.66), we obtain that η (cid:22) LT and η (cid:22) LT. (E.74)For any 0 ≤ t < t ≤ T − t − t ≥ 4, it follows from MA that ( X ∗ ,t , A ∗ ,t , X ∗ ,t +1 )is independent of ( X ∗ ,t , A ∗ ,t , X ∗ ,t +1 ) given X ∗ ,t − . Thus, we have E ( R ∗ ,t − Σ ∗ ) T ( R ∗ ,t − Σ ∗ ) = E ( R ∗ ,t − Σ ∗ ) T E { ( R ∗ ,t − Σ ∗ ) | X ∗ ,t − } . Similarly, conditional on X ∗ ,t +2 , ( X ∗ ,t , A ∗ ,t , X ∗ ,t +1 ) and X ∗ ,t − are independent. Itfollows that E ( R ∗ ,t − Σ ∗ ) T E { ( R ∗ ,t − Σ ∗ ) | X ∗ ,t − } = E [ E { ( R ∗ ,t − Σ ∗ ) | X ∗ ,t +2 } ] T [ E { ( R ∗ ,t − Σ ∗ ) | X ∗ ,t − } ] , and hence E ( R ∗ ,t − Σ ∗ ) T ( R ∗ ,t − Σ ∗ ) = E [ E { ( R ∗ ,t − Σ ∗ ) | X ∗ ,t +2 } ] T [ E { ( R ∗ ,t − Σ ∗ ) | X ∗ ,t − } ] . (E.75)Define Θ ( x ) = E ( R ∗ ,t | X ,t +2 = x ) and Θ ( x ) = E ( R ∗ ,t | X ,t − = x ). Let E X ∗ ,t denotethe conditional expectation given X ∗ ,t With some calculations, we can show Θ ( x ) = E (cid:104) E X ∗ ,t ξ ( X ∗ ,t , A ∗ ,t ) { ξ ( X ∗ ,t , A ∗ ,t ) − γ U ( X ∗ ,t +1 ) } T (cid:12)(cid:12)(cid:12) X ∗ ,t +2 = x (cid:105) = E (cid:34) (cid:90) y ∈ X (cid:88) a ∈A ξ ( y, a ) { ξ ( y, a ) − γ U ( X ∗ ,t +1 ) } T b ( a | y ) q ( X ∗ ,t +1 | y, a ) µ ( y ) µ ( X ∗ ,t +1 ) dy (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ∗ ,t +2 = x (cid:35) = (cid:90) y ,y ∈ X (cid:88) a ∈A ξ ( y , a ) { ξ ( y , a ) − γ U ( y ) } T b ( a | y ) q ( y | y , a ) µ ( y ) q X ( x | y ) µ ( x ) dy dy , nference of the Value for Reinforcement Learning and Θ ( x ) = E (cid:34) (cid:88) a ∈A ξ ( X ∗ ,t , a ) { ξ ( X ∗ ,t , a ) − γ u ( X ∗ ,t , a ) } T b ( a | X ∗ ,t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ∗ ,t − = x (cid:35) = (cid:90) y ∈ X (cid:88) a ∈A ξ ( y, a ) { ξ ( y, a ) − γ u ( y, a ) } T b ( a | y ) q X ( y | x ) dy. It follows from (E.75) that v T E ( R ∗ ,t − Σ ∗ ) T ( R ∗ ,t − Σ ∗ ) v = v T E Θ T ( X ∗ ,t +2 ) Θ ( X ∗ ,t − ) v − v T ( Σ ∗ ) T Σ ∗ v = v T E Θ T ( X ∗ ,t +2 ) Θ ( X ∗ ,t − ) v − v T { E Θ T ( X ∗ ,t +2 ) } T E Θ ( X ∗ ,t − ) v = mL (cid:88) j =1 [ E { v T Θ , · ,j ( X ∗ ,t +2 ) }{ Θ ,j, · ( X ∗ ,t − ) v } − { E v T Θ , · ,j ( X ∗ ,t +2 ) }{ E Θ ,j, · ( X ∗ ,t − ) v } ] , where Θ l,j, · ( x ) and Θ l, · ,j ( x ) denote the j -th row and j -column of Θ l ( x ), respectively. Let ξ j ( · , · ) be the j -th element of ξ ( · , · ). By Lemma 2 and the definitions of ξ and U , we havesup v ∈ S mL − ,x ∈ X a ∈A | ξ T ( x, a ) v | (cid:22) L / and sup v ∈ S mL − ,x ∈ X | U T ( x ) v | (cid:22) L / . It follows from and A1 and A3 thatmax j ∈{ ,...,mL } sup v ∈ S mL − ,x ∈ X | v T Θ , · ,j ( x ) |(cid:22) (cid:90) y ,y ∈ X (cid:88) a ∈A | ξ j ( y , a ) || v T { ξ ( y , a ) − γ U ( y ) }| b ( a | y ) dy dy (cid:22) L / (cid:90) y ∈ X (cid:88) a ∈A | ξ j ( y , a ) | b ( a | y ) dy ≤ L / max j ∈{ ,...,L } (cid:90) y ∈ X | φ L,j ( y ) | dy. Similarly to Lemma 2, we can show max j ∈{ ,...,L } (cid:82) y ∈ X | φ L,j ( y ) | dy (cid:22) L − / , and hencemax j ∈{ ,...,mL } sup v ∈ S mL − ,x ∈ X | v T Θ , · ,j ( x ) | (cid:22) . (E.76)It follows thatsup v ∈ S mL − | v T E ( R ∗ ,t − Σ ∗ ) T ( R ∗ ,t − Σ ∗ ) v |≤ sup v ∈ S mL − mL (cid:88) j =1 (cid:90) x ∈ X | E { Θ ,j, · ( X ∗ ,t − ) v | X ∗ ,t +2 } − E { Θ ,j, · ( X ∗ ,t − ) v || v T Θ , · ,j ( x ) | µ ( x ) dx (cid:22) sup v ∈ S mL − mL (cid:88) j =1 (cid:90) x ∈ X | E { Θ ,j, · ( X ∗ ,t − ) v | X ∗ ,t +2 } − E { Θ ,j, · ( X ∗ ,t − ) v | µ ( x ) dx. C. Shi, S. Zhang, W. Lu and R. Song Similar to (E.76), we can showmax j ∈{ ,...,mL } sup v ∈ S mL − ,x ∈ X | Θ ,j, · ( x ) v | (cid:22) . According to the definition of β -mixing coefficients and the geometric ergodicity assumption,we havesup v ∈ S mL − | v T E ( R ∗ ,t − Σ ∗ ) T ( R ∗ ,t − Σ ∗ ) v | (cid:22) mL (cid:88) j =1 β ( t − t − (cid:22) Lρ t − t − , where the above bound is uniform for any pair ( t , t ) that satisfies 0 ≤ t ≤ t − (cid:88) ≤ t Similarly, we can show (cid:88) ≤ t This together with (E.72) yields that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) T (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:22) LT, or sup v ∈ S mL − E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:22) LT. (E.77)By Cauchy-Schwarz inequality, we obtainsup v ∈ S mL − E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) I ( A ) − E (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) I ( A ) (cid:41) v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:22) LT, nference of the Value for Reinforcement Learning and hence (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:34)(cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) I ( A ) − E (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) I ( A ) (cid:41)(cid:35) T (E.78) × (cid:34)(cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) I ( A ) − E (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) I ( A ) (cid:41)(cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:22) LT. Similarly, we can show (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:34)(cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) I ( A ) − E (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) I ( A ) (cid:41)(cid:35) × (cid:34)(cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) I ( A ) − E (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) I ( A ) (cid:41)(cid:35) T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:22) LT. Combining this with (E.70) and (E.78), an application of the matrix Bernstein inequality(Theorem 1.6 in Tropp, 2012) yields that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:40) T − (cid:88) t =0 ( R i,t − Σ ∗ ) (cid:41) I ( A i ) − n E (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) I ( A ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O p {√ nT L log( nT ) } , under the assumption that L = o { nT / log ( nT ) } . This together with (E.70) yields that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:40) T − (cid:88) t =0 ( R i,t − Σ ∗ ) (cid:41) − n E (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) I ( A ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O p {√ nT L log( nT ) } . (E.79)By Cauchy-Schwarz inequality, we have for any v , v ∈ R mL with (cid:107) v (cid:107) = (cid:107) v (cid:107) = 1 that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) v T E (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) I ( A c ) v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:118)(cid:117)(cid:117)(cid:116) E (cid:34) v T (cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) v (cid:35) (cid:113) Pr( A c ) ≤ (cid:118)(cid:117)(cid:117)(cid:116) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:40) T − (cid:88) t =0 ( R ,t − Σ ∗ ) (cid:41) v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:113) Pr( A c ) (cid:22) √ LT n − / T − = O ( n − ) , by (E.69), (E.77) and the condition that L (cid:28) T n/ log ( T n ). This together with (E.79)yields that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:40) T − (cid:88) t =0 ( R i,t − E R ,t ) (cid:41)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O p {√ nT L log( nT ) } , and hence (cid:107) (cid:98) Σ − Σ (cid:107) = O p { L / ( nT ) − / log( nT ) } . Part 3 : In either scenario, we have shown (cid:107) (cid:98) Σ − Σ (cid:107) = O p { L / ( nT ) − / log( nT ) } . Underthe condition that L = o { nT / log( nT ) } , we have (cid:107) (cid:98) Σ − Σ (cid:107) = o p (1). By definition, this C. Shi, S. Zhang, W. Lu and R. Song implies that (cid:107) (cid:98) Σ − Σ (cid:107) ≤ ¯ c/ 6, with probability approaching 1. In Part 1 , we have shownthat v T Σ v ≥ ¯ c (cid:107) v (cid:107) / 3, for any v ∈ R mL . By Cauchy-Schwarz inequality, the followingevent occurs with probability approaching 1, v T (cid:98) Σ v ≥ ¯ c (cid:107) v (cid:107) / , ∀ v ∈ R mL . Using similar arguments in Part 1 , this implies (cid:98) Σ is invertible and satisfies (cid:107) (cid:98) Σ − (cid:107) ≤ c − ,with probability tending to 1. Therefore (cid:107) (cid:98) Σ − − Σ − (cid:107) = (cid:107) (cid:98) Σ − ( (cid:98) Σ − Σ ) Σ − (cid:107) ≤ (cid:107) (cid:98) Σ − (cid:107) (cid:107) (cid:98) Σ − Σ (cid:107) (cid:107) Σ − (cid:107) ≤ c − (cid:107) (cid:98) Σ − Σ (cid:107) , with probability tending to 1. Since (cid:107) (cid:98) Σ − Σ (cid:107) = O p { L / ( nT ) − / log( nT ) } , we obtain (cid:107) (cid:98) Σ − − Σ − (cid:107) = O p { L / ( nT ) − / log( nT ) } . The proof is hence completed. Part 4: Consider Scenario (i) first. By Lemma 6, it suffices to show (cid:107) Σ ∗ (cid:107) = O (1). This isequivalent to show η ≡ sup v , v ∈ S mL − (cid:12)(cid:12) v T ( E R ∗∗∗ , ) v (cid:12)(cid:12) (cid:22) . With some calculations, we have η ≤ sup v , v ∈ S mL − (cid:88) a ∈A E | v T ξ ( X ∗ , , a ) || v T ξ ( X ∗ , , a ) | b ( a | X ∗ , )+ sup v , v ∈ S mL − (cid:88) a ∈A E | v T ξ ( X ∗ , , a ) || v T u ( X ∗ , , a ) | b ( a | X ∗ , ) ≤ η (1)4 + 12 η (2)4 , by Cauchy-Schwarz inequality, where η (1)4 = sup v ∈ S mL − (cid:88) a ∈A E | v T ξ ( X ∗ , , a ) | b ( a | X ∗ ,t ) ,η (2)4 = sup v ∈ S mL − (cid:88) a ∈A E | v T u ( X ∗ , , a ) | b ( a | X ∗ ,t ) . By definition, we have η (1)4 = λ max (cid:40)(cid:88) a ∈A E ξ ( X ∗ , , a ) ξ T ( X ∗ , , a ) b ( a | X ∗ , ) (cid:41) . Since the matrix (cid:80) a ∈A ξ ( X ∗ , , a ) ξ T ( X ∗ , , a ) b ( a | X ∗ , ) is block diagonal with the main-diagonalblocks { Φ L ( X ∗ , )Φ L ( X ∗ , ) T b ( j | X ∗ , ) } j =1 ,...,m . By Lemma 2 and Condition A3, we can show nference of the Value for Reinforcement Learning η (1)4 (cid:22) 1. As for η (2)4 , we have η (2)4 ≤ sup v ∈ S mL − E | v T U ( X ∗ , ) | = λ max (cid:8) E U ( X ∗ , ) U T ( X ∗ , ) (cid:9) ≤ sup v ,...,v m ∈ R L (cid:107) v (cid:107) + ··· + (cid:107) v m (cid:107) =1 (cid:88) ≤ j ,j ≤ m a Tj E Φ L ( X ∗ , )Φ L ( X ∗ , ) T a j π ( j | X ∗ , ) π ( j | X ∗ , ) ≤ (cid:88) ≤ j ,j ≤ m λ max (cid:8) E Φ L ( X ∗ , )Φ L ( X ∗ , ) T (cid:9) π ( j | X ∗ , ) π ( j | X ∗ , ) = λ max (cid:8) E Φ L ( X ∗ , )Φ L ( X ∗ , ) T (cid:9) , where the first inequality follows from Jensen’s inequality. By Lemma 2 and Condition A3,we can similarly show that η (2)4 (cid:22) 1. Thus, we obtain (cid:107) Σ ∗ (cid:107) = η (cid:22) X ,t isuniformly bounded. Using similar arguments in proving η (cid:22) 1, we can show (cid:13)(cid:13) E ξ ( X ,t , A ,t ) { ξ ( X ,t , A ,t ) − U ( X ,t +1 ) } T (cid:13)(cid:13) = O (1) , where the big- O term is uniform for any 0 ≤ t ≤ T − 1. Since T is bounded, we obtain that Σ ≤ T − (cid:88) t =0 (cid:13)(cid:13) E ξ ( X ,t , A ,t ) { ξ ( X ,t , A ,t ) − U ( X ,t +1 ) } T (cid:13)(cid:13) (cid:22) . The proof is hence completed. E.4.1. Proof of Lemma 6 By definition, we have Σ ∗ = E R ∗∗∗ , = T − (cid:80) T − t =0 E R ∗∗∗ ,t and Σ = T − (cid:80) T − t =0 E R (1)0 ,t where R ∗∗∗ ,t is defined in (E.63) and R (1)0 ,t = (cid:88) a ∈A ξ ( X ,t , a ) { ξ ( X ,t , a ) − γ u ( X ,t , a ) } T b ( a | X ,t ) . It suffices to show (cid:107) (cid:80) T − t =0 E ( R ∗∗∗ ,t − R (1)0 ,t ) (cid:107) (cid:22) √ T , or equivalently,sup v , v ∈ S mL − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) v T (cid:40) T − (cid:88) t =0 E ( R ∗∗∗ ,t − R (1)0 ,t ) (cid:41) v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:22) √ T . Notice that sup v , v ∈ S mL − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) v T (cid:40) T − (cid:88) t =0 E ( R ∗∗∗ ,t − R (1)0 ,t ) (cid:41) v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ η + sup v , v ∈ S mL − (cid:12)(cid:12)(cid:12) v T ( E R (1)0 , ) v (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) η + sup v , v ∈ S mL − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) v T (cid:40) T − (cid:88) t =1 E ( R ∗∗∗ ,t − R (1)0 ,t ) (cid:41) v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) η . C. Shi, S. Zhang, W. Lu and R. Song In Part 4, we have shown that η (cid:22) 1. Using similar arguments, we can show η (cid:22) η (cid:22) √ T .Define g( v , v , y ) = (cid:80) a ∈A (cid:82) x ∈ X v T ξ ( x, a ) { ξ ( x, a ) − γ u ( x, a ) } T v b ( a | x ) q X ( x | y ) dx . Usingsimilar arguments in proving η (1)4 , η (2)4 (cid:22) 1, we can show there exists some constant N > | g( v , v , y ) | ≤ N for any y ∈ X and any v , v ∈ S mL − . For any variable Z thatsatisfies Pr( | Z | ≤ N ) = 1 and any integer J , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) EZ − J (cid:88) j = − J N jJ Pr (cid:18) N jJ ≤ Z < N ( j + 1) J (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (E.80) ≤ J (cid:88) j = − J E (cid:12)(cid:12)(cid:12)(cid:12) Z − N jJ (cid:12)(cid:12)(cid:12)(cid:12) I (cid:18) N jJ ≤ Z < N ( j + 1) J (cid:19) ≤ N J . For any t ≥ 1, notice that E v T R ∗∗∗ ,t v = E g ( v , v , X ∗ , ) and E v T R (3)0 ,t v = E g ( v , v , X ,t − ).It follows from (E.80) that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E v T R ∗∗∗ ,t v − J (cid:88) j = − J N jJ Pr (cid:18) N jJ ≤ g ( v , v , X ∗ , ) < N ( j + 1) J (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ N J , (E.81) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E v T R (1)0 ,t v − J (cid:88) j = − J N jJ Pr (cid:18) N jJ ≤ g ( v , v , X ,t − ) < N ( j + 1) J (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ N J , (E.82)for any v , v with (cid:107) v (cid:107) = (cid:107) v (cid:107) = 1. Under the assumption in A4(ii), we have for any − J ≤ j ≤ J , t ≥ v , v with (cid:107) v (cid:107) = (cid:107) v (cid:107) = 1 that (cid:12)(cid:12)(cid:12)(cid:12) Pr (cid:18) N jJ ≤ g ( v , v , X ,t − ) < N ( j + 1) J (cid:12)(cid:12)(cid:12)(cid:12) X , = x (cid:19) − Pr (cid:18) N jJ ≤ g ( v , v , X ∗ , ) < N ( j + 1) J (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ M ( x ) ρ t − . Since (cid:82) x M ( x ) µ ( x ) dx < + ∞ , under A3, we have (cid:82) x M ( x ) ν ( x ) dx < + ∞ . It follows fromJensen’s inequality that (cid:12)(cid:12)(cid:12)(cid:12) Pr (cid:18) N jJ ≤ g ( v , v , X ,t − ) < N ( j + 1) J (cid:19) − Pr (cid:18) N jJ ≤ g ( v , v , X ∗ , ) < N ( j + 1) J (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) x ∈ X M ( x ) ν ( x ) dxρ t − = ¯ Cρ t − , for some constant ¯ C > 0, and hence (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J (cid:88) j = − J (cid:26) Pr (cid:18) N jJ ≤ g ( v , v , X ,t − ) < N ( j + 1) J (cid:19) − Pr (cid:18) N jJ ≤ g ( v , v , X ∗ , ) < N ( j + 1) J (cid:19)(cid:27)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ¯ C (2 J − ρ t − . nference of the Value for Reinforcement Learning This together with (E.81) and (E.82) yields that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − (cid:88) t =1 E ( v T R ∗∗∗ ,t v − v T R (1)0 ,t v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ T N J + ¯ C (2 J − T − (cid:88) t =1 ρ t − ≤ T N J + 2 ¯ CJ − ρ . Set J = (cid:112) T N (1 − ρ ) / ¯ C , we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − (cid:88) t =1 E ( v T R ∗∗∗ ,t v − v T R (1)0 ,t v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ ¯ CT N √ − ρ . Notice that the above bound is independent of v and v . Thus, we have obtain η (cid:22) √ T .The proof is hence completed. E.5. Proof of Lemma 4 The proof of Lemma 4 is very similar to that of Lemma 3. By Condition A4(i), we obtain λ min ( T − (cid:80) T − t =0 E ξ ,t ξ T ,t ) ≥ ¯ c , when T is bounded. A4(iii) further implies that λ min (cid:34)(cid:90) x ∈ X (cid:40)(cid:88) a ∈A ξ ( x, a ) ξ T ( x, a ) b ( a | x ) (cid:41)(cid:35) ≥ ¯ c, as T → ∞ . Using similar arguments as in the first part of the proof of Lemma 3, we canshow that λ min (cid:32) T T − (cid:88) t =0 E ξ ,t ξ T ,t (cid:33) ≥ ¯ c , (E.83)as either n → ∞ , or T → ∞ . Using similar arguments in the third part of the proof ofLemma 3, we can show that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nT n (cid:88) i =1 T − (cid:88) t =0 ξ i,t ξ Ti,t − T T − (cid:88) t =0 E ξ ,t ξ T ,t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O p { L / ( nT ) − / log( nT ) } . (E.84)Since L = o {√ nT / log( nT ) } , it follows from (E.83) and (E.84) that the following eventoccurs with probability tending to 1, λ min (cid:32) nT n (cid:88) i =1 T − (cid:88) t =0 ξ i,t ξ Ti,t (cid:33) ≥ λ min (cid:32) T T − (cid:88) t =0 E ξ ,t ξ T ,t (cid:33) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nT n (cid:88) i =1 T − (cid:88) t =0 ξ i,t ξ Ti,t − T T − (cid:88) t =0 E ξ ,t ξ T ,t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ ¯ c − ¯ c c . C. Shi, S. Zhang, W. Lu and R. Song It remains to show λ max { ( nT ) − (cid:80) ni =1 (cid:80) T − t =0 ξ i,t ξ Ti,t } = O p (1) and λ max (cid:32) T T − (cid:88) t =0 E ξ ,t ξ T ,t (cid:33) = O (1) , (E.85)Suppose (E.85) holds. By (E.84) and the condition that L = o {√ nT / log( nT ) } , we have λ max { ( nT ) − (cid:80) ni =1 (cid:80) T − t =0 ξ i,t ξ Ti,t } = O p (1). Thus, it suffices to show (E.85). When T isbounded, (E.85) can be proven using similar arguments in (E.56) and (E.58). When T diverges, using similar arguments in the proof of Lemma 6, we can show (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T − (cid:88) t =0 E { ξ ,t ξ T ,t − ξ ( X ∗ , , A ∗ , ) ξ T ( X ∗ , , A ∗ , ) } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:22) T − / . (E.86)Similar to (E.56), we can show λ max { T − E ξ ( X ∗ , , A ∗ , ) ξ T ( X ∗ , , A ∗ , ) } = O (1). This to-gether with (E.86) yields (E.85), in the scenario where T → ∞ . The proof is hence com-pleted. E.6. Proof of Lemma 5 It follows from Cauchy-Schwarz inequality and triangle inequality that (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x ∈ X U ( x ) G ( dx ) (cid:13)(cid:13)(cid:13)(cid:13) = m (cid:88) a =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x ∈ X Φ L ( x ) π ( a | x ) G ( dx ) (cid:13)(cid:13)(cid:13)(cid:13) ≥ m − (cid:40) m (cid:88) a =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x ∈ X Φ L ( x ) π ( a | x ) G ( dx ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:41) ≥ m − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x ∈ X m (cid:88) a =1 Φ L ( x ) π ( a | x ) G ( dx ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = m − (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x ∈ X Φ L ( x ) G ( dx ) (cid:13)(cid:13)(cid:13)(cid:13) . The proof is hence completed. E.7. Proof of Theorem 2 Recall that n = K n n min , T = K T T min and |I k | = n min T min for any k . Consider the scenariowhere T is bounded first. Since T min = T , the data are divided according to the trajectoriesthey belong to. Thus, for any k = 2 , . . . , K , variables { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ( i,t ) ∈I k areindependent of { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ( i,t ) ∈ ¯ I k − . Let Σ (cid:98) π ¯ I k − = E [ (cid:98) Σ I k , (cid:98) π ¯ I k − |{ ( X i,t , A i,t , Y i,t , X i,t +1 ) } ( i,t ) ∈ ¯ I k − ] , nference of the Value for Reinforcement Learning for k = 2 , . . . , K . Under the given conditions, K is bounded. Similar to Lemma 3, we canshow max ≤ k ≤ K (cid:107) Σ − (cid:98) π ¯ I k − (cid:107) ≤ c − , (E.87)under A5*, and max ≤ k ≤ K (cid:107) Σ (cid:98) π ¯ I k − − (cid:98) Σ I k , (cid:98) π ¯ I k − (cid:107) (cid:22) (cid:112) ( nT ) − L log( nT ) , (E.88)with probability at least 1 − O ( n − T − ) = 1 − o (1).Now let us consider the scenario where T → ∞ . For any policy π , define Σ ∗ π = (cid:90) x ∈ X (cid:88) a ∈A ξ ( x, a ) { ξ ( x, a ) − γ u π ( x, a ) } T b ( a | x ) µ ( x ) dx. Using similar arguments in the proof of Lemma 3, we can show that (cid:107) Σ ∗ (cid:98) π ¯ I ( k ) a (cid:107) ≥ ¯ c (cid:107) a (cid:107) , ∀ a ∈ R mL , (E.89)under A4*(i).For any ( i, t ) ∈ I k , X i,t +1 is independent of { ( X j,t , A j,t , Y j,t , X j,t +1 ) } ( j,t ) ∈ ¯ I k − given X i,t and A i,t . It follows that E [ U (cid:98) π ¯ I k − ( X i,t +1 ) | X i,t , A i,t , { ( X j,t , A j,t , Y j,t , X j,t +1 ) } ( j,t ) ∈ ¯ I k − ] = u (cid:98) π ¯ I k − ( X i,t , A i,t )and hence n min T min Σ (cid:98) π ¯ I k − equals (cid:88) ( i,t ) ∈I k E [ ξ i,t { ξ i,t − γ u (cid:98) π ¯ I k − ( X i,t , A i,t ) } T |{ ( X j,t , A j,t , Y j,t , X j,t +1 ) } ( j,t ) ∈ ¯ I k − ] . For k = 1 , . . . , K , define ( i ( k ) , t ( k )) to be the tuple in I k such that i ≥ i ( k ) , t ≥ t ( k ) forany ( i, t ) ∈ I k . Then, we have I k = { ( i, t ) : i ( k ) ≤ i < i ( k ) + n min , t ( k ) ≤ t < t ( k ) + T min } . For any k ∈ { , . . . , K } with t ( k ) = 0, { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ( i,t ) ∈I k are independent of { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ( i,t ) ∈ ¯ I k − . Similar to Lemma 6, we can show that for any such k ,there exists some constant c > (cid:107) Σ (cid:98) π ¯ I k − − Σ ∗ (cid:98) π ¯ I k − (cid:107) ≤ c T − / . C. Shi, S. Zhang, W. Lu and R. Song This together with (E.89), implies that (cid:107) Σ (cid:98) π ¯ I k − a (cid:107) ≥ ¯ c (cid:107) a (cid:107) , ∀ a ∈ R mL , (E.90)for any k such that t ( k ) = 0.For any k ∈ { , . . . , K } with t ( k ) > 0, we have X i,t ( k ) ∈ { X j,t , A j,t , Y j,t , X j,t +1 } ( j,t ) ∈ ¯ I k for all i ( k ) ≤ i < i ( k ) + n min . Given X i,t ( k ) , A i,t ( k ) is conditionally independent of { X j,t , A j,t , Y j,t , X j,t +1 } ( j,t ) ∈ ¯ I k − . Thus, we decompose Σ (cid:98) π ¯ I k − as1 n min T min i ( k )+ n min − (cid:88) i = i ( k ) (cid:88) a ∈A ξ ( X i,t ( k ) , a ) { ξ ( X i,t ( k ) , a ) − γ u (cid:98) π ¯ I k − ( X i,t ( k ) , a ) } T b ( a | X i,t ( k ) ) (cid:124) (cid:123)(cid:122) (cid:125) Σ (1) (cid:98) π ¯ I k − + 1 n min T min (cid:88) ( i,t ) ∈I k t>t ( k ) E [ ξ i,t { ξ i,t − γ u (cid:98) π ¯ I k − ( X i,t , A i,t ) } T |{ ( X j,t , A j,t , Y j,t , X j,t +1 ) } ( j,t ) ∈ ¯ I k − ] (cid:124) (cid:123)(cid:122) (cid:125) Σ (2) (cid:98) π ¯ I k − . Error bound for max k (cid:107) Σ (1) (cid:98) π ¯ I k − (cid:107) : By Cauchy-Schwarz inequality, we have (cid:107) Σ (1) (cid:98) π ¯ I k − (cid:107) ≤ n min T min sup v , v ∈ S mL − (cid:88) a ∈A i ( k )+ n min − (cid:88) i = i ( k ) { v T ξ ( X i,t ( k ) , a ) } b ( a | X i,t ( k ) )+ 1 n min T min sup v , v ∈ S mL − (cid:88) a ∈A i ( k )+ n min − (cid:88) i = i ( k ) | v T ξ ( X i,t ( k ) , a ) || v T u (cid:98) π ¯ I k − ( X i,t ( k ) , a ) | b ( a | X i,t ( k ) ) ≤ n min T min sup v ∈ S mL − (cid:88) a ∈A i ( k )+ n min − (cid:88) i = i ( k ) { v T ξ ( X i,t ( k ) , a ) } b ( a | X i,t ( k ) ) (cid:124) (cid:123)(cid:122) (cid:125) η ,k + 12 n min T min sup v ∈ S mL − (cid:88) a ∈A i ( k )+ n min − (cid:88) i = i ( k ) { v T u (cid:98) π ¯ I k − ( X i,t ( k ) , a ) } b ( a | X i,t ( k ) ) (cid:124) (cid:123)(cid:122) (cid:125) η ,k . Notice that η ,k ≤ λ max (cid:88) a ∈A i ( k )+ n min − (cid:88) i = i ( k ) ξ ( X i,t ( k ) , a ) ξ T ( X i,t ( k ) , a ) b ( a | X i,t ( k ) ) . (E.91) nference of the Value for Reinforcement Learning Similar to (E.60), we can show that λ max (cid:88) a ∈A i ( k )+ n min − (cid:88) i = i ( k ) ξ ( X i,t ( k ) , a ) ξ T ( X i,t ( k ) , a ) b ( a | X i,t ( k ) ) ≤ c n min λ max (cid:40)(cid:88) a ∈A E ξ ( X ,t ( k ) , a ) ξ T ( X ,t ( k ) , a ) b ( a | X i,t ( k ) ) (cid:41) + c (cid:112) n min log( nT ) , for some constant c > − O ( n − T − ), where the big- O term isuniform in k . In the proof of Lemma 3, we have shown that the marginal density of X ,t isuniformly bounded. Similar to (E.58), we can show λ max (cid:40)(cid:88) a ∈A E ξ ( X ,t ( k ) , a ) ξ T ( X ,t ( k ) , a ) b ( a | X i,t ( k ) ) (cid:41) (cid:22) , (E.92)for any k . In view of (E.91), we have shown that η ,k ≤ c n min + c (cid:112) n min log( nT ) , for some constant c > 0, with probability at least 1 − O ( n − T − ). By Bonferroni’s in-equality, we obtain that max k ∈{ ,...,K } ,t ( k ) > η ,k ≤ c n min + c (cid:112) n min log( nT ) , (E.93)with probability at least 1 − O ( n − T − ) = 1 − o (1).By Jensen’s inequality, we have η ,k ≤ sup v ∈ S mL − i ( k )+ n min − (cid:88) i = i ( k ) E [ { v T U (cid:98) π ¯ I k − ( X i,t ( k )+1 ) } | X i,t ( k ) , (cid:98) π ¯ I k − ] . Under A3, the density of X i,t ( k )+1 given X i,t ( k ) is uniformly bounded. Using similararguments in bounding η (2)4 in the proof of Lemma 3, we can show there exists some constant c > v ∈ S mL − E [ { v T U (cid:98) π ¯ I k − ( X i,t ( k )+1 ) } | X i,t ( k ) , (cid:98) π ¯ I k − ] ≤ c . Thus, we obtain max k ∈{ ,...,K } ,t ( k ) > η ,k ≤ c n min . Combining this together with (E.93), we have wpa1 thatmax k ∈{ ,...,K } ,t ( k ) > (cid:107) Σ (1) (cid:98) π ¯ I k − (cid:107) ≤ c + c T min + 3 c (cid:112) log( nT )2 √ n min T min . (E.94) C. Shi, S. Zhang, W. Lu and R. Song Error bound for max k (cid:107) Σ (2) (cid:98) π ¯ I k − − T − ( T min − Σ ∗ (cid:98) π ¯ I k − (cid:107) : Conditional on X i,t ( k ) , X i,t ( k )+1 is independent of (cid:98) π ¯ I k − , for any i such that ( i, t ( k )) ∈ I k . Moreover, its conditional densityfunction given X i,t ( k ) and (cid:98) π ¯ I k − is uniformly bounded, by A3. Similar to Lemma 6, we canshow for any i that satisfies ( i, t ( k )) ∈ I k , (cid:13)(cid:13)(cid:13) E [ ξ i,t { ξ i,t − γ u (cid:98) π ¯ I k − ( X i,t , A i,t ) } T |{ ( X j,t , A j,t , Y j,t , X j,t +1 ) } ( j,t ) ∈ ¯ I k − ] − Σ ∗ (cid:98) π ¯ I k − (cid:13)(cid:13)(cid:13) ≤ c T − / , for some constant c > 0. This in turn yields thatmax k ∈{ ,...,K } ,t ( k ) > (cid:13)(cid:13)(cid:13)(cid:13) Σ (2) (cid:98) π ¯ I k − − T min − T min Σ ∗ (cid:98) π ¯ I k − (cid:13)(cid:13)(cid:13)(cid:13) ≤ c T − / . (E.95)Combining (E.94) with (E.95), we have wpa1 thatmax k ∈{ ,...,K } ,t ( k ) > (cid:13)(cid:13)(cid:13)(cid:13) Σ (cid:98) π ¯ I k − − T min − T min Σ ∗ (cid:98) π ¯ I k − (cid:13)(cid:13)(cid:13)(cid:13) ≤ max k ∈{ ,...,K } ,t ( k ) > (cid:13)(cid:13)(cid:13)(cid:13) Σ (2) (cid:98) π ¯ I k − − T min − T min Σ ∗ (cid:98) π ¯ I k − (cid:13)(cid:13)(cid:13)(cid:13) + max k ∈{ ,...,K } ,t ( k ) > (cid:107) Σ (1) (cid:98) π ¯ I k − (cid:107) ≤ c + c T min + 3 c (cid:112) log( nT )2 √ n min T min + c T − / . Since K = O (1), we have n min T min (cid:29) log( nT ). Thus, we have wpa1 thatmax k ∈{ ,...,K } ,t ( k ) > (cid:13)(cid:13)(cid:13)(cid:13) Σ (cid:98) π ¯ I k − − T min − T min Σ ∗ (cid:98) π ¯ I k − (cid:13)(cid:13)(cid:13)(cid:13) = o (1) , as T min → ∞ . By (E.89), we can show wpa1 thatmin k ∈{ ,...,K } ,t ( k ) > (cid:107) Σ (cid:98) π ¯ I k − a (cid:107) ≥ ¯ c (cid:107) a (cid:107) , ∀ a ∈ R mL . This together with (E.90) yields thatmin k ∈{ ,...,K } (cid:107) Σ (cid:98) π ¯ I k − a (cid:107) ≥ ¯ c (cid:107) a (cid:107) , ∀ a ∈ R mL , wpa1. Using similar arguments in the proof of Lemma 3, we obtain wpa1 thatmax k ∈{ ,...,K } (cid:107) Σ − (cid:98) π ¯ I k − (cid:107) ≤ c − . (E.96)Now we provide an error bound for max k ∈{ ,...,K } (cid:107) (cid:98) Σ I k , (cid:98) π ¯ I k − − Σ I k , (cid:98) π ¯ I k − (cid:107) . Considerany k ∈ { , . . . , K } with t ( k ) = 0, { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ( i,t ) ∈I k are independent of { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ( i,t ) ∈ ¯ I k − . Using similar arguments in Part 2 of the proof of Lemma3, we can show wpa1 that,max k ∈{ ,...,K } t ( k )=0 (cid:107) (cid:98) Σ I k , (cid:98) π ¯ I k − − Σ (cid:98) π ¯ I k − (cid:107) (cid:22) (cid:114) LnT log( nT ) . (E.97) nference of the Value for Reinforcement Learning Consider k ∈ { , . . . , K } with t ( k ) > 0. We decompose (cid:98) Σ I k , (cid:98) π ¯ I k − as1 n min T min i ( k )+ n min − (cid:88) i = i ( k ) ξ i,t ( k ) { ξ i,t ( k ) − γ U (cid:98) π ¯ I k − ,i,t k )+1 } T (cid:124) (cid:123)(cid:122) (cid:125) (cid:98) Σ (1) I k, (cid:98) π ¯ I k − + 1 n min T min (cid:88) ( i,t ) ∈I k t>t ( k ) [ ξ i,t { ξ i,t − γ U (cid:98) π ¯ I k − ,i,t } T (cid:124) (cid:123)(cid:122) (cid:125) (cid:98) Σ (2) I k, (cid:98) π ¯ I k − . It follows thatmax k ∈{ ,...,K } t ( k ) > (cid:107) (cid:98) Σ I k , (cid:98) π ¯ I k − − Σ (cid:98) π ¯ I k − (cid:107) ≤ max k ∈{ ,...,K } t ( k ) > (cid:107) (cid:98) Σ (1) I k , (cid:98) π ¯ I k − − Σ (1) (cid:98) π ¯ I k − (cid:107) + max k ∈{ ,...,K } t ( k ) > (cid:107) (cid:98) Σ (2) I k , (cid:98) π ¯ I k − − Σ (2) (cid:98) π ¯ I k − (cid:107) . Error bound for max k (cid:107) (cid:98) Σ (1) I k , (cid:98) π ¯ I k − − Σ (1) (cid:98) π ¯ I k − (cid:107) : Given { ( X j,t , A j,t , Y j,t , X j,t +1 ) } ( j,t ) ∈ ¯ I k − ,( A i ( k ) ,t ( k ) , X i ( k ) ,t ( k )+1 ) , · · · , ( A i ( k )+ n min − ,t ( k ) , X i ( k )+ n min − ,t ( k )+1 ) conditionally indepen-dent. Using the matrix concentration inequality, we can show wpa1 thatmax k ∈{ ,...,K } t ( k ) > (cid:107) (cid:98) Σ (1) I k , (cid:98) π ¯ I k − − Σ (1) (cid:98) π ¯ I k − (cid:107) (cid:22) (cid:114) LnT log( nT ) . (E.98) Error bound for max k (cid:107) (cid:98) Σ (2) I k , (cid:98) π ¯ I k − − Σ (2) (cid:98) π ¯ I k − (cid:107) : Given { ( X j,t , A j,t , Y j,t , X j,t +1 ) } ( j,t ) ∈ ¯ I k − , { ( X i ( k ) ,t , A i ( k ) ,t , Y i ( k ) ,t , X i ( k ) ,t +1 ) : t ( k ) + 1 ≤ t < t ( k ) + T min } , · · · , { ( X i ( k )+ n min − ,t , A i ( k )+ n min − ,t , Y i ( k )+ n min − ,t , X i ( k )+ n min − ,t +1 ) : t ( k ) + 1 ≤ t < t ( k ) + T min } are conditionally independent. Moreover, for any i such that ( i, t ( k )) ∈ I k , thedensity function of X i,t ( k )+1 conditional on { ( X j,t , A j,t , Y j,t , X j,t +1 ) } ( j,t ) ∈ ¯ I k − is uniformlybounded under A3. Using similar arguments in bounding (cid:107) (cid:98) Σ − Σ (cid:107) in Part 2 of the proofof Lemma 3, we can show wpa1 thatmax k ∈{ ,...,K } t ( k ) > (cid:107) (cid:98) Σ (2) I k , (cid:98) π ¯ I k − − Σ (2) (cid:98) π ¯ I k − (cid:107) (cid:22) (cid:114) LnT log( nT ) . (E.99)Combining (E.98) with (E.99) and (E.97), we obtain wpa1 thatmax k ∈{ ,...,K } (cid:107) (cid:98) Σ I k , (cid:98) π ¯ I k − − Σ (cid:98) π ¯ I k − (cid:107) (cid:22) (cid:114) LnT log( nT ) . (E.100) C. Shi, S. Zhang, W. Lu and R. Song In view of (E.87), (E.88), (E.96), we have shown thatmax k ∈{ ,...,K } (cid:107) Σ − (cid:98) π ¯ I k − (cid:107) ≤ c − and max k ∈{ ,...,K } (cid:107) (cid:98) Σ I k , (cid:98) π ¯ I k − − Σ (cid:98) π ¯ I k − (cid:107) (cid:22) (cid:114) LnT log( nT ) , wpa1, regardless of whether T is bounded or not. Under the given conditions, we have (cid:112) L/ ( nT ) log( nT ) = o (1). Using similar arguments in the proof of Lemma 3, we can showwpa1 thatmax k ∈{ ,...,K } (cid:107) (cid:98) Σ − I k , (cid:98) π ¯ I k − (cid:107) ≤ c , max k ∈{ ,...,K } (cid:107) (cid:98) Σ − I k , (cid:98) π ¯ I k − − Σ − (cid:98) π ¯ I k − (cid:107) (cid:22) (cid:114) LnT log( nT ) . (E.101)Notice that (cid:98) π = (cid:98) π ¯ I ( K ) . By Lemma 1, we have Q ( (cid:98) π ¯ I ( k ) ; · , a ) ∈ Λ( p, c (cid:48) ) for any k ∈{ , . . . , K } . Using similar arguments in the proof of Theorem 12.8 of Schumaker (1981) andproof of Proposition 5 of Meyer (1992), there exist some vectors { β ∗ (cid:98) π ¯ I ( k ) ,a } a ∈A , ≤ k ≤ K thatsatisfy max ≤ k ≤ K sup x ∈ X ,a ∈A | Q ( (cid:98) π ¯ I ( k ) ; x, a ) − Φ TL ( x ) β ∗ (cid:98) π ¯ I ( k ) ,a | ≤ CL − p/d , (E.102)for some constant C > 0. Similar to (E.39), we have by (E.102) thatmax ( i,t ) ∈I k | r i,t | ≤ CL − p/d , ∀ k = 2 , . . . , K, (E.103)where r i,t = γ (cid:88) a ∈A { Φ TL ( X i,t +1 ) β ∗ (cid:98) π ¯ I k − ,a − Q ( (cid:98) π ¯ I k − ; X i,t +1 , a ) } (cid:98) π ¯ I k − ( a | X i,t +1 ) − { Φ TL ( X i,t ) β ∗ (cid:98) π ¯ I k − ,A i,t − Q ( (cid:98) π ¯ I k − ; X i,t , A i,t ) } . Similar to the proof of Theorem 1, we have by (E.101) and (E.103) that (cid:98) β I k , (cid:98) π ¯ I k − − β ∗ (cid:98) π ¯ I k − = Σ − (cid:98) π ¯ I k − KnT (cid:88) ( i,t ) ∈I k ξ i,t ε i,t (cid:124) (cid:123)(cid:122) (cid:125) ζ ,k + O ( L − p/d ) + O p { L ( nT ) − log( nT ) } , where ε i,t = Y i,t + γ (cid:88) a ∈A Q ( (cid:98) π ¯ I k − ; X i,t +1 , a ) (cid:98) π ¯ I k − ( a | X i,t +1 ) − Q ( (cid:98) π ¯ I k − ; X i,t , A i,t ) , for any ( i, t ) ∈ I k . nference of the Value for Reinforcement Learning To prove the asymptotic normality of (cid:112) nT ( K − /K (cid:101) σ − ( G ) { (cid:101) V ( G ) − V ( (cid:98) π ; G ) } , it suf-fices to show (cid:114) nT KK − K (cid:88) k =2 (cid:98) V I k ( (cid:98) π ¯ I k − ; G ) − V ( (cid:98) π ; G ) (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) d → N (0 , . (E.104)Using similar arguments in the proof of Theorem 1, we can show (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:114) nT KK − K (cid:88) k =2 (cid:98) V I k ( (cid:98) π ¯ I k − ; G ) − V ( (cid:98) π ; G ) (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) (E.105) − (cid:114) nT KK − K (cid:88) k =2 V ( (cid:98) π ¯ I k − ; G ) + { (cid:82) x U (cid:98) π ¯ I k − ( x ) G ( dx ) } T ζ ,k − V ( (cid:98) π ; G ) (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . Notice that (cid:98) π = (cid:98) π ¯ I K . Under A5, we have1 K − K (cid:88) k =2 E | V ( (cid:98) π ¯ I k − ; G ) − V ( (cid:98) π ; G ) |≤ K − K (cid:88) k =2 E | V ( (cid:98) π ¯ I k − ; G ) − V ( π ∗ ; G ) | + E | V ( (cid:98) π ; G ) − V ( π ∗ ; G ) |≤ O (1) (cid:40) K − K − (cid:88) k =1 ( nT ) − b k − b K b + ( nT ) − b (cid:41) , where O (1) denotes some positive constant. Since (cid:80) K − k =1 k − b ≤ (cid:82) K x − b dx (cid:22) K − b ,we obtain that 1 K − K (cid:88) k =2 E | V ( (cid:98) π ¯ I k − ; G ) − V ( (cid:98) π ; G ) | = O { ( nT ) − b } , and hence (cid:114) nT KK − K (cid:88) k =2 E | V ( (cid:98) π ¯ I k − ; G ) − V ( (cid:98) π ; G ) | = O { ( nT ) / − b } = o (cid:26)(cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x Φ L ( x ) dx (cid:13)(cid:13)(cid:13)(cid:13) (cid:27) , By A6. By Markov’s inequality, we obtain that (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x Φ L ( x ) dx (cid:13)(cid:13)(cid:13)(cid:13) − √ nT KK − K (cid:88) k =2 | V ( (cid:98) π ¯ I k − ; G ) − V ( (cid:98) π ; G ) | = o p (1) . Similar to Lemma 5, we can show that for any k ∈ { , . . . , K } , (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x Φ L ( x ) dx (cid:13)(cid:13)(cid:13)(cid:13) − ≥ m − / (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x U (cid:98) π ¯ I k − ( x ) dx (cid:13)(cid:13)(cid:13)(cid:13) − . C. Shi, S. Zhang, W. Lu and R. Song It follows that (cid:114) nT KK − K (cid:88) k =2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x U (cid:98) π ¯ I k − ( x ) dx (cid:13)(cid:13)(cid:13)(cid:13) − | V ( (cid:98) π ¯ I k − ; G ) − V ( (cid:98) π ; G ) | = o p (1) . (E.106)Using similar arguments in the proof of Theorem 1, we can showmax k ∈{ ,...,K } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) σ ( (cid:98) π ¯ I k − ; G ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . Since (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) σ ( (cid:98) π ¯ I k − ; G ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) σ ( (cid:98) π ¯ I k − ; G ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) σ ( (cid:98) π ¯ I k − ; G ) + 1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − , we obtain max k ∈{ ,...,K } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) σ ( (cid:98) π ¯ I k − ; G ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . (E.107)Similar to (E.46), we can show there exists some constant c > 0, such that the followingoccurs wpa1, σ ( (cid:98) π ¯ I k − ; G ) ≥ c (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x U (cid:98) π ¯ I k − ( x ) G ( dx ) (cid:13)(cid:13)(cid:13)(cid:13) , ∀ k ∈ { , . . . , K } . This together with (E.107) yields that (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) ≥ c (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) x U (cid:98) π ¯ I k − ( x ) G ( dx ) (cid:13)(cid:13)(cid:13)(cid:13) , ∀ k ∈ { , . . . , K } , wpa1. In view of (E.106), we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:114) nT KK − K (cid:88) k =2 V ( (cid:98) π ¯ I k − ; G ) − V ( (cid:98) π ; G ) (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . Combining this together with (E.105), we obtain that (cid:114) nT KK − K (cid:88) k =2 (cid:98) V I k ( (cid:98) π ¯ I k − ; G ) − V ( (cid:98) π ; G ) (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) − (cid:114) nT KK − K (cid:88) k =2 { (cid:82) x U (cid:98) π ¯ I k − ( x ) G ( dx ) } T ζ ,k (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) = o p (1) . To prove (E.104), it suffices to show (cid:114) nT KK − K (cid:88) k =2 { (cid:82) x U (cid:98) π ¯ I k − ( x ) G ( dx ) } T ζ ,k (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) d → N (0 , . (E.108) nference of the Value for Reinforcement Learning The LHS of (E.108) can be further decomposed as (cid:114) nT KK − K (cid:88) k =2 { (cid:82) x U (cid:98) π ¯ I k − ( x ) G ( dx ) } T ζ ,k σ ( (cid:98) π ¯ I k − ; G ) (cid:124) (cid:123)(cid:122) (cid:125) η + (cid:114) nT KK − K (cid:88) k =2 { (cid:82) x U (cid:98) π ¯ I k − ( x ) G ( dx ) } T ζ ,k σ ( (cid:98) π ¯ I k − ; G ) (cid:40) σ ( (cid:98) π ¯ I k − ; G ) (cid:98) σ I k ( (cid:98) π ¯ I k − ; G ) − (cid:41)(cid:124) (cid:123)(cid:122) (cid:125) η . In the following, we show η d → N (0 , η P → 0. Assertion(E.108) thus follows from Slutsky’s theorem.Notice that η equals η = (cid:115) KnT ( K − K (cid:88) k =2 (cid:88) ( i,t ) ∈I k { (cid:82) x U (cid:98) π ¯ I k − ( x ) G ( dx ) } T Σ − (cid:98) π ¯ I k − ξ i,t ε i,t σ ( (cid:98) π ¯ I k − ; G ) . For any 1 ≤ g ≤ nT , there exists some integer k ( g ) that satisfies { k ( g ) − } n min T min + 1 ≤ g ≤ k ( g ) n min T min . Let t ( g ) and i ( g ) be the integers that satisfy g − { k ( g ) − } n min T min = { i ( g ) − i ( k ( g )) } T min + t ( g ) − t ( k ( g )) + 1 . Let F (0) = { X , , A , } . Then we iteratively define {F ( g ) } ≤ g ≤ nT as follows: F ( g ) = F ( g − ∪ { Y i ( g ) ,T − , X i ( g ) ,T } , if t ( g + 1) > t ( g ) = T − , F ( g ) = F ( g − ∪ { Y i ( g ) ,t ( g ) , X i ( g ) ,t ( g )+1 , A i ( g ) ,t ( g )+1 } , if t ( g + 1) > t ( g ) < T − , F ( g ) = F ( g − ∪ { Y i ( g ) ,T − , X i ( g ) ,T , X i ( g +1) , , A i ( g +1) , } , if t ( g + 1) = 0 & t ( g ) = T − , F ( g ) = F ( g − ∪ { Y i ( g ) ,t ( g ) , X i ( g ) ,t ( g )+1 , A i ( g ) ,t ( g )+1 , X i ( g +1) , , A i ( g +1) , } , otherwise . Let ξ ( g ) = ξ i ( g ) ,t ( g ) and ε ( g ) = ε i ( g ) ,t ( g ) . We rewrite η as η = (cid:115) KnT ( K − nT (cid:88) g = nT/K +1 { (cid:82) x U (cid:98) π ¯ I ( k ( g ) − ( x ) G ( dx ) } T Σ − (cid:98) π ¯ I ( k ( g ) − ξ ( g ) ε ( g ) σ ( (cid:98) π ¯ I ( k ( g ) − ; G ) . One can show that η forms a mean-zero martingale with respect to the filtration { σ ( F ( g ) ) } g ≥ nT/K .Using similar arguments in the proof of Theorem 1, we can show thatmax g ∈{ nT/K +1 ,...,nT } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { (cid:82) x U (cid:98) π ¯ I ( k ( g ) − ( x ) G ( dx ) } T Σ − (cid:98) π ¯ I ( k ( g ) − ξ ( g ) ε ( g ) σ ( (cid:98) π ¯ I ( k ( g ) − ; G ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P → C. Shi, S. Zhang, W. Lu and R. Song Moreover, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) KnT ( K − nT (cid:88) g = nT/K +1 [ { (cid:82) x U (cid:98) π ¯ I ( k ( g ) − ( x ) G ( dx ) } T Σ − (cid:98) π ¯ I ( k ( g ) − ξ ( g ) ε ( g ) ] σ ( (cid:98) π ¯ I ( k ( g ) − ; G ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max k ∈{ ,...,K } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) KnT (cid:88) ( i,t ) ∈I k [ { (cid:82) x U (cid:98) π ¯ I k − ( x ) G ( dx ) } T Σ − (cid:98) π ¯ I k − ξ i,t ε i,t ] σ ( (cid:98) π ¯ I k − ; G ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max k ∈{ ,...,K } (cid:107){ (cid:82) x U (cid:98) π ¯ I k − ( x ) G ( dx ) } T Σ − (cid:98) π ¯ I k − (cid:107) (cid:107) (cid:98) Ω ∗I k , (cid:98) π ¯ I k − − Ω (cid:98) π ¯ I k − (cid:107) σ ( (cid:98) π ¯ I k − ; G ) , where (cid:98) Ω ∗I k , (cid:98) π ¯ I k − = |I k | − (cid:80) ( i,t ) ∈I k ξ i,t ξ (cid:62) i,t ε i,t .Similar to the proof of Theorem 1, we can show max k ∈{ ,...,K } (cid:107) (cid:98) Ω ∗I k , (cid:98) π ¯ I k − − Ω (cid:98) π ¯ I k − (cid:107) = o p (1). Similar to (E.45), we can show there exists some constant c > k ∈{ ,...,K } (cid:107){ (cid:82) x U (cid:98) π ¯ I k − ( x ) G ( dx ) } T Σ − (cid:98) π ¯ I k − (cid:107) σ ( (cid:98) π ¯ I k − ; G ) ≤ c . (E.110)It follows that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) KnT ( K − nT (cid:88) g = nT/K +1 [ { (cid:82) x U (cid:98) π ¯ I ( k ( g ) − ( x ) G ( dx ) } T Σ − (cid:98) π ¯ I ( k ( g ) − ξ ( g ) ε ( g ) ] σ ( (cid:98) π ¯ I ( k ( g ) − ; G ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P → . (E.111)Using a martingale central limit theorem for triangular arrays (Corollary 2.8 of McLeish,1974), we have by (E.109) and (E.111) that η d → N (0 , E.8. Proof of Theorem 3 Based on the discussions in Section 3.2.3, it suffices to show E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } G ( dx ) = O ( |I| − b ∗ (1+ α ) ) , and E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } dx = O ( |I| − b ∗ (1+ α ) ) . (E.112)We only prove (E.112) for brevity. Under the given conditions, we have Pr( A ) ≥ − O ( |I| − κ ), where A = (cid:40) sup x ∈ X ,a ∈A | (cid:98) Q I ( x, a ) − Q opt ( x, a ) | ≤ C |I| − b ∗ (cid:41) . nference of the Value for Reinforcement Learning Notice that E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } dx = E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } dx I ( A c ) (E.113)+ E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } dx I ( A ) . Under A1, Q opt is uniformly bounded. Therefore, the first term on the RHS of (E.113)is upper bounded by O { Pr( A c ) } = O ( |I| − κ ). Since κ can be chosen arbitrarily large, itsuffices to show E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } dx I ( A ) = O {|I| − b ∗ (1+ α ) } . (E.114)For any x ∈ X , supposemax a Q opt ( x, a ) − max a ∈A− arg max a (cid:48) Q opt ( x,a (cid:48) ) Q opt ( x, a ) > C |I| − b ∗ . (E.115)Under the event defined in A , we havemax a (cid:98) Q I ( x, a ) − max a ∈A− arg max a (cid:48) Q opt ( x,a (cid:48) ) (cid:98) Q I ( x, a ) > , and hence { a ∈ A : (cid:98) π ( a | x ) = 1 } ⊆ arg max a ∈A Q opt ( x, a ) . Thus, we have (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } = 0 , when (E.115) holds. Let X ∗ denote the set of x that satisfies (E.115). It follows that E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } dx I ( A ) (E.116)= E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } I ( x ∈ X c ∗ ) dx I ( A ) . Let (cid:98) a I ( x ) = sarg max a (cid:98) Q I ( a, x ). Similarly, we can show the event (cid:98) a I ( x ) / ∈ arg max a ∈A Q opt ( x, a )occurs only when max a Q opt ( x, a ) − Q opt ( x, (cid:98) a I ( x )) ≤ C |I| − b ∗ . (E.117) C. Shi, S. Zhang, W. Lu and R. Song Since max a Q opt ( x, a ) − Q opt ( x, (cid:98) a I ( x )) = (cid:80) a Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } , we obtain E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } I ( x ∈ X c ∗ ) dx I ( A ) ≤ C |I| − b ∗ (cid:90) x I ( x ∈ X c ∗ ) dx = 2 C |I| − b ∗ O ( |I| − b ∗ α ) = O ( |I| − b ∗ (1+ α ) ) , where the first equality follows from A6. Combining this together with (E.116) yields(E.114). The proof is hence completed. E.9. Proof of Theorem 4 For a given ε > 0, let A ∗ = { max a Q opt ( x, a ) − Q opt ( x, (cid:98) a I ( x )) ≤ ε } . Notice that E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } dx (E.118)= E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } I ( x ∈ A ∗ ) dx + E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } I ( x ∈ A c ∗ ) dx. Using similar arguments in the proof of Theorem 3, we can show E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } I ( x ∈ A ∗ ) dx ≤ ε E (cid:90) x I ( x ∈ A ∗ ) dx ≤ ελ (cid:26) x : max a Q opt ( x, a ) − max a ∈A− arg max a (cid:48) Q opt ( x,a (cid:48) ) Q opt ( x, a ) ≤ ε (cid:27) = O ( ε α ) . (E.119)Moreover, similar to (E.117), we can show the event (cid:98) a I ( x ) / ∈ arg max a ∈A Q opt ( x, a )occurs only whenmax a Q opt ( x, a ) − Q opt ( x, (cid:98) a I ( x )) ≤ a ∈A | (cid:98) Q I ( x, a ) − Q opt ( x, a ) | . nference of the Value for Reinforcement Learning It follows that E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } I ( x ∈ A c ∗ ) dx ≤ E (cid:90) x a ∈A | (cid:98) Q I ( x, a ) − Q opt ( x, a ) | (cid:80) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } I ( x ∈ A c ∗ ) dx ≤ ε E (cid:90) x max a ∈A | (cid:98) Q I ( x, a ) − Q opt ( x, a ) | dx ≤ ε E (cid:90) x (cid:88) a ∈A | (cid:98) Q I ( x, a ) − Q opt ( x, a ) | dx = O ( ε − |I| − b ∗ ) . Combining this together with (E.118) and (E.119) yields that E (cid:90) x (cid:88) a ∈A Q opt ( x, a ) { π opt ( a | x ) − (cid:98) π I ( a | x ) } dx = O ( ε α ) + O ( ε − |I| − b ∗ ) . The proof is hence completed by setting ε = |I| − b ∗ / (2+ α ) . F. Additional details regarding Condition A4 When ν = µ , the density function of X , equals µ as well. By Jensen’s inequality, we havefor any v ∈ R mL that { v T u π ( x, a ) } ≤ E [ { v T U π ( X , ) } | X , = x, A , = a ] , and hence (cid:90) x ∈ X { v T u π ( x, a ) } b ( a | x ) µ ( x ) dx ≤ (cid:90) x ∈ X { v T U π ( x ) } µ ( x ) dx. The matrix (cid:90) x ∈ X (cid:88) a ∈A { U π ( x ) U Tπ ( x ) − u π ( x, a ) u Tπ ( x, a ) } b ( a | x ) µ ( x ) dx is positive semidefinite. It follows that λ min (cid:34)(cid:90) x ∈ X (cid:88) a ∈A { ξ ( x, a ) ξ T ( x, a ) − γ u π ( x, a ) u Tπ ( x, a ) } b ( a | x ) µ ( x ) dx (cid:35) ≥ λ min (cid:34)(cid:90) x ∈ X (cid:88) a ∈A { ξ ( x, a ) ξ T ( x, a ) − γ U π ( x ) U Tπ ( x ) } b ( a | x ) µ ( x ) dx (cid:35) . When π is a deterministic policy, (cid:80) a ∈A ξ ( x, a ) ξ T ( x, a ) b ( a | x ) − γ U π ( x ) U Tπ ( x ) is a blockdiagonal matrix. To show A4(i) holds, it suffices to show λ min [Φ L ( x )Φ TL ( x ) { b ( a | x ) − γ π ( a | x ) } µ ( x ) dx ] > , ∀ a ∈ A . C. Shi, S. Zhang, W. Lu and R. Song Fig. 6. Suppose b is the (cid:15) -greedy policy with respect to π , i.e, b ( a | x ) = (cid:15)m − + (1 − (cid:15) ) π ( a | x ), forany a ∈ { , . . . , m } and (cid:15) satisfies (cid:15) ≤ − γ , we have λ min (cid:20)(cid:90) x ∈ X Φ L ( x )Φ TL ( x ) { b ( a | x ) − γ π ( a | x ) } µ ( x ) dx (cid:21) ≥ m λ min (cid:26)(cid:90) x ∈ X Φ L ( x )Φ TL ( x ) µ ( x ) dx (cid:27) . Suppose A3 holds. It suffices to require λ min (cid:26)(cid:90) x ∈ X Φ L ( x )Φ TL ( x ) dx (cid:27) > . (F.120)The condition in (F.120) is automatically satisfied when A2 holds (see, e.g., Burman andChen, 1989; Chen and Christensen, 2015). G. Sensitivity test for the parameter η In this section, we conduct the sensitivity test for the parameter η in the number of basis L = (cid:98) ( nT ) η (cid:99) . We consider the simulation of the off-policy evaluation with a fixed targetpolicy in Section 5.1. For scenario (A), (B) and (C), we set n = 100, T = 100 and thedifferent η ’s are chosen from (0 . , . , . , . , / , . η . nference of the Value for Reinforcement Learning H. Double fitted Q -iteration In this section, we introduce our algorithm for computing the estimated optimal policy inour numerical studies. The proposed algorithm is based on FQI that recursively updates theestimated optimal Q-function by some supervised learning method (see Example 2 in Section3.2.3). In FQI, at each iteration, a maximization over estimated Q-function is used as anestimate of the maximum of the true Q-function. This can lead to a significant positive bias(Sutton and Barto, 2018). Hasselt (2010) proposed a double Q-learning method to reducethe maximization bias. Here, we apply similar ideas to FQI to compute the estimatedoptimal policy. We use a pseudocode to summarize our algorithm below. Algorithm 1: Double Fitted Q-iteration Algorithm Input: { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ≤ t Input: { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ≤ t Input: { ( X i,t , A i,t , Y i,t , X i,t +1 ) } ≤ t