[PDF] Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

Abstract

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have orders of magnitude lower mean squared error than existing methods---it makes more efficient use of the available data. Our new estimator is based on two advances: an extension of the doubly robust estimator (Jiang and Li, 2015), and a new way to mix between model based estimates and importance sampling based estimates.

Full PDF

DData-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning

Philip S. Thomas

PHILIPT @ CS . CMU . EDU

Carnegie Mellon University

Emma Brunskill

EBRUN @ CS . CMU . EDU

Carnegie Mellon University

Abstract

In this paper we present a new way of predictingthe performance of a reinforcement learning pol-icy given historical data that may have been gen-erated by a different policy. The ability to evalu-ate a policy from historical data is important forapplications where the deployment of a bad pol-icy can be dangerous or costly. We show em-pirically that our algorithm produces estimatesthat often have orders of magnitude lower meansquared error than existing methods—it makesmore efﬁcient use of the available data. Our newestimator is based on two advances: an exten-sion of the doubly robust estimator (Jiang & Li,2015), and a new way to mix between modelbased estimates and importance sampling basedestimates.

1. Introduction

The ability to predict the performance of a policy with-out actually having to use it is crucial to the responsibleuse of reinforcement learning algorithms. Consider thesetting where the user of a reinforcement learning algo-rithm has already deployed some policy, e.g., for determin-ing which advertisement to show a user visiting a website(Theocharous et al., 2015), for determining which medicaltreatment to suggest for a patient (Thapa et al., 2005), or forsuggesting a personalized curriculum for a student (Mandelet al., 2014). In these examples, using a bad policy can becostly or dangerous, so it is important that the user of areinforcement learning algorithm be able to accurately pre-dict how well a new policy will perform without having todeploy it.In this paper we propose a new algorithm for tackling thisperformance prediction problem, which is called the off-policy policy evaluation (OPE) problem. The primary ob-jective in OPE problems is to produce estimates that mini-mize some notion of error. We select mean squared error, a popular notion of error for estimators, as our loss function.This is in line with previous works that all use (root) meansquared error when empirically validating their methods(Precup et al., 2000; Dud´ık et al., 2011; Mahmood et al.,2014; Thomas, 2015b; Jiang & Li, 2015).Given this goal, an estimator should be stronglyconsistent—its mean squared error should converge almostsurely to zero as the amount of available data increases. Inthis paper we introduce a new strongly consistent estima-tor, MAGIC, that directly optimizes mean squared error.Our empirical results show that MAGIC can produce esti-mates with orders of magnitude lower mean squared errorthan the estimates produced by existing algorithms.Our new algorithm comes from the synthesis of two newcontributions. The ﬁrst contribution is an extension ofthe recently proposed doubly robust (DR) OPE algorithm(Jiang & Li, 2015). We present a novel derivation of theiralgorithm that removes the assumption that the horizon isﬁnite and known. We also give conditions under which theDR estimator is strongly consistent. We then show how wecan signiﬁcantly reduce the variance of the DR estimatorby introducing a small amount of bias—an effective trade-off when attempting to minimize the mean squared error ofthe estimates. We call our extension of the DR estimatorthe weighted doubly robust (WDR) estimator.Our second major contribution is a new estimator, whichwe call the blending IS and model (BIM) estimator, thatcombines two different OPE estimators not just by select-ing between them, but by blending them together in a waythat minimizes the mean squared error. The combinationof these two contributions results in a particularly powerfulnew OPE algorithm that we call the model and guided im-portance sampling combining (MAGIC) estimator, whichuses BIM to combine a purely model-based estimator withWDR. In our simulations, MAGIC has the best general per- A formal deﬁnition of what it means for an estimator tobe strongly consistent is provided in Appendix A, where, inLemma , we also elucidate the relationship between strong con-sistency and mean squared error. a r X i v : . [ c s . L G ] A p r ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning formance, often exhibiting orders of magnitude lower meansquared error than prior state-of-the-art estimators.

2. Notation

We assume that the reader is familiar with reinforce-ment learning (Sutton & Barto, 1998) and adopt nota-tional standard MDPNv1 for

Markov decision processes (Thomas, 2015a, MDPs). For simplicity, our notationassumes that the state, action, and reward sets are ﬁ-nite, although our results carry over to more general set-tings. Let H := ( S , A , R , S , . . . ) be a trajectory , and g ( H ) := P ∞ t =0 γ t R t denote the return of a trajec-tory. We assume that the (possibly unknown) minimumand maximum rewards, r min and r max , are ﬁnite and that γ ∈ [0 , for the ﬁnite-horizon setting and γ ∈ [0 , for the indeﬁnite and inﬁnite horizon settings so that g ( H ) is bounded. We use the discounted objective function, v ( π ) := E [ g ( H ) | H ∼ π ] , where H ∼ π denotes that H was generated using the policy π . When dealing withmultiple trajectories, we use superscripts to denote whichtrajectory a term comes from. For example, we write S Ht to denote the state at time t during trajectory H . Let v π and q π be the state value function and state-action valuefunction for policy π —for all ( π, s, a ) ∈ Π × S × A ,let v π ( s ) := E [ P ∞ t =0 γ t R t | S = s, π ] and q π ( s, a ) := E [ P ∞ t =0 γ t R t | S = s, A = a, π ] . Notice that v withouta superscript denotes the objective function, while v π de-notes a value function.We will assume that historical data , D , is provided. For-mally this historical data is a set of n ∈ N > tra-jectories and the known policies, called behavior poli-cies , that were used to generate them. That is, D := { ( H i , π i ) } ni =1 , where H i ∼ π i . Importantly, whenwe write H i , we always mean that H i ∼ π i . Let ρ t ( H, π e , π b ) := Q ti =0 π e (cid:0) A Hi (cid:12)(cid:12) S Hi (cid:1) /π b (cid:0) A Hi (cid:12)(cid:12) S Hi (cid:1) , be an importance weight , which is the probability of the ﬁrst t steps of H under the evaluation policy , π e , divided byits probability under the behavior policy π b (Precup et al.,2000, Section 2). For brevity, we write ρ it as shorthand for ρ t ( H i , π e , π i ) and ρ t as shorthand for ρ t ( H, π e , π b ) . To We have veriﬁed that our results carry over to the set-ting where the states, actions, and rewards are continuous ran-dom variables with density functions. This result is relativelystraightforward—summations are replaced with integrals, proba-bility mass functions with probability density functions, and prob-abilities with probability densities, where appropriate. We haveincluded Assumption , which is implied when the setting is re-stricted to ﬁnite state, action, and reward sets, since it is necessaryfor the continuous setting. We have not veriﬁed our results for thesetting where the states, actions, and rewards come from distribu-tions that do not have probability mass or density functions, e.g.,if the rewards come from the Cantor distribution. For alliteration, one might think of H as denoting the historyof an episode . simplify later expressions, let ρ i − := 1 for all i . One of theprimary challenges will be to combat the high variance andlarge range of the importance weights, ρ t .Some of the methods that we describe use an approximatemodel of an MDP. Let ˆ r π ( s, a, t ) denote the model’s pre-diction of the reward t steps later, R t , if S = s , A = a ,and the policy π is used to generate the subsequent actions, A , A , . . . . For example, ˆ r π ( s, a, is a prediction of theimmediate reward after taking action a in state s and is thusthe same for all policies, π . We assume that these predic-tions are bounded by ﬁnite (possibly unknown) constants r modelmin and r modelmax , i.e., ˆ r π ( s, a, t ) ∈ [ r modelmin , r modelmax ] . Let ˆ r π ( s, t ) := X a ∈A π ( a | s )ˆ r π ( s, a, t ) , (1)be a prediction of R t if S = s and the policy π is usedto generate actions A , A , . . . , for all ( s, t, π ) ∈ S × N ≥ × Π . Let ˆ v π ( s ) := P ∞ t =0 γ t ˆ r π ( s, t ) and ˆ q π ( s, a ) := P ∞ t =0 γ t ˆ r π ( s, a, t ) be the model’s estimates of v π ( s ) and q π ( s, a ) . We assume that if a terminal absorbing state , ∞ s ,is reached, the model’s predictions of rewards that occurthereafter are always zero: ˆ r π ( ∞ s , a, t ) = 0 for all ( π, a, t ) ∈ Π × A × N ≥ . Although better models will tend to im-prove our estimates, we make no assumptions about the ve-racity of the approximate model’s predictions, ˆ r π ( s, a, t ) .

3. Off-Policy Policy Evaluation (OPE)

The problem of off-policy policy evaluation (OPE) is de-ﬁned as follows. We are given an evaluation policy, π e , historical data , D , and an approximate model. Ourgoal is to produce an estimator, ˆ v ( D ) , of v ( π e ) that haslow mean squared error (MSE): MSE(ˆ v ( D ) , v ( π e )) := E h (ˆ v ( D ) − v ( π e )) i . We use capital letters to denote ran-dom variables, and so the random terms in expected valuesare always the capitalized letters (e.g. D is a random vari-able). We assume that the process producing states, ac-tions, and rewards is an MDP with an unknown initial statedistribution, transition function, and reward function. Weassume that the evaluation policy, π e , the behavior poli-cies, π i , i ∈ { , . . . , n } , and the discount parameter, γ , areknown. For a review of OPE methods, see the works ofPrecup et al. (2000) or Thomas (2015b, Chapter 3). Morerecent methods can be found in the works of Jiang & Li(2015) and Mandel et al. (2016).

4. Doubly Robust (DR) Estimator

The doubly robust (DR) estimator (Jiang & Li, 2015) is anew unbiased estimator of v ( π e ) that achieves promisingempirical and theoretical results by leveraging an approx-imate model of an MDP to decrease the variance of theunbiased estimates produced by ordinary importance sam- ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning pling (Precup et al., 2000). It is doubly robust in that itwill provide “good” estimates if either the model is ac-curate or the behavior policies are known. By “good”it is meant that if the former does not hold then the esti-mator will remain unbiased (although it might have highvariance and thus high mean squared error), and if the lat-ter does not hold then if the model has low error the doublyrobust estimator will also tend to have low error. Doublyrobust estimators were introduced and remain popular inthe statistics community (Rotnitzky & Robins, 1995; Hee-jung & Robins, 2005).The work that introduced the DR estimator for MDPs(Jiang & Li, 2015) derived it as a generalization of a dou-bly robust estimator for bandits (Dud´ık et al., 2011). Thismay be why the DR estimator was derived only for the ﬁ-nite horizon setting where the horizon is known (every tra-jectory must terminate within L < ∞ time steps, and L must be known). It also resulted in a recursive deﬁnition ofthe DR estimator that can be difﬁcult to interpret. In Ap-pendix B we instead derive the DR estimator for MDPs asan application of control variates. Our new derivation holdswithout assumptions on the horizon and gives the intuitivenon-recursive deﬁnition, where w it = ρ it /n : DR( D ) := n X i =1 ∞ X t =0 γ t w it R H i t (2) − n X i =1 ∞ X t =0 γ t (cid:16) w it ˆ q π e (cid:16) S H i t , A H i t (cid:17) − w it − ˆ v π e (cid:16) S H i t (cid:17)(cid:17) . In Appendix B we show that this deﬁnition is equivalentto that of Jiang & Li (2015) when the horizon is ﬁnite andknown, and we provide several new theoretical results per-taining to the DR estimator. Speciﬁcally, we give condi-tions for DR to be an unbiased estimator without assump-tions on the horizon, and we give the ﬁrst proofs that it is astrongly consistent estimator. Although these are importantproperties to establish, we relegate them to an appendix dueto space limitations.The non-recursive deﬁnition of the DR estimator presentedin (2) also reveals the close relationship of the DR esti-mator to advantage sum estimators. Advantage sum esti-mators were introduced as a way to lower the variance ofon-policy Monte Carlo performance estimates for a settingthat is a generalization of the (partially observable) MDPsetting (Zinkevich et al., 2006; White & Bowling, 2009).The DR estimator for the on-policy setting can be foundin the work of Zinkevich et al. (2006, Equation 8). Onemay therefore view the DR estimator (Jiang & Li, 2015)as the extension of the advantage sum estimator (Zinkevichet al., 2006) to the off-policy setting or as the extension ofthe doubly robust estimator for bandits (Dud´ık et al., 2011)to the sequential setting. We are therefore not the ﬁrst to show that the DR estimator can be viewed as an applica-tion of control variates, since Veness et al. (2011, Section3.1) point out that the advantage sum estimator is an appli-cation of control variates. Still, our derivation in AppendixB of the DR estimator is novel.The DR estimator is not purely model based, since it usesimportance weights. However, it is also not a model-freeimportance sampling method, since it uses an approximatemodel to decrease the variance of its estimates. We there-fore refer to it as a guided importance sampling method,since the approximate model is used to guide, but not com-pletely replace, the importance sampling estimates.

5. Weighted Doubly Robust (WDR) Estimator

Empirical and theoretical results show that the DR estima-tor developed, analyzed, and tested by Jiang & Li (2015)can signiﬁcantly reduce the variance of ordinary impor-tance sampling without introducing bias. The fact thatit does not introduce bias can be particularly importantwhen the estimator is used to produce conﬁdence boundson v ( π e ) (Thomas, 2015b). However, in practice theseconﬁdence bounds often require an impractical amountof data before they are tight enough to be useful, andso approximate conﬁdence bounds (e.g., bootstrap conﬁ-dence bounds) are used instead (Theocharous et al., 2015;Thomas, 2015b). When using these approximate conﬁ-dence bounds, the strict requirement that an OPE estimatorbe an unbiased estimator of v ( π e ) is not necessary. Further-more, often the goal of OPE is not to produce conﬁdencebounds, but to produce the best estimate of v ( π e ) possible,in order to determine whether π e should be used instead ofthe current behavior policy or as an internal mechanism in apolicy search algorithm (Levine & Koltun, 2013). In thesecases, the “best” estimator is typically deﬁned as the onethat has the lowest mean squared error (MSE). For exam-ple, in their experiments, Precup et al. (2000), Dud´ık et al.(2011), Mahmood et al. (2014), Thomas (2015b), and Jiang& Li (2015) all use the (root) MSE when evaluating OPEmethods.Although unbiasedness might seem like a desirable prop-erty of an estimator, when the goal is to minimize MSE,it often is not. In general, the MSE of an estimator, ˆ θ ,of a statistic, θ , can be decomposed into its variance andits squared bias: MSE(ˆ θ, θ ) = E [( θ − ˆ θ ) ] = Var(ˆ θ ) +Bias(ˆ θ ) , where Bias(ˆ θ ) := E [ˆ θ ] − θ . The optimal esti-mator in terms of MSE is typically one that balances thisbias-variance trade-off, not one with zero bias. Therefore,in the context of minimizing MSE, strong asymptotic con-sistency, which requires the MSE of an estimator to almostsurely converge to zero as the amount of available data in-creases, is a more desirable property than unbiasedness. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning In this section we propose a new OPE estimator that we callthe weighted doubly robust (WDR) estimator. The WDRestimator comes from applying a simple well-known exten-sion to importance sampling estimators to the DR estima-tor to produce a new guided importance sampling method.This extension does not directly optimize the bias-variancetrade-off, but it does tend to signiﬁcantly better balance itwhile maintaining asymptotic consistency. More specif-ically, WDR is based on weighted importance sampling (Powell & Swann, 1966) as opposed to ordinary impor-tance sampling (Hammersley & Handscomb, 1964). Forfurther discussion of the beneﬁts of weighted importancesampling over ordinary importance sampling, see the workof Thomas (2015b, Section 3.8). Weighted importancesampling has been used before for OPE (Precup et al.,2000), but not in conjunction with the DR estimator.Our WDR estimator is deﬁned as the DR estimator in(2), except where w it := ρ it / P nj =1 ρ jt . Intuitively it isclear that this estimator is asymptotically correct because E [ ρ jt ] = 1 , and so by the law of large numbers the denom-inator of w it will converge to n . Although WDR is not anunbiased estimator of v ( π e ) , its bias follows a pattern thatis both predictable and also sometimes desirable. Whenthere is only a single trajectory, i.e., n = 1 , WDR( D ) isan unbiased estimator of the performance of the behaviorpolicy, since w t = 1 for all t . If there is a single behaviorpolicy, π b , as the number of trajectories increases, the ex-pected value of WDR( D ) shifts away from v ( π b ) towards v ( π e ) .Before presenting theoretical results about the WDR es-timator, we introduce assumptions that they will require.These assumptions are only included if they are explicitlymentioned in a theorem—most theorems only rely on fewof these assumptions. Even when these assumptions arenot satisﬁed, it does not mean that the result does not holdor that the WDR estimator will perform poorly—it merelymeans that the theoretical results that we provide are notguaranteed by our proofs.Assumption ensures that all trajectories of interest whenevaluating π e will be produced by all of the behavior poli-cies. This is a standard assumption in OPE and typicallyprecludes the use of deterministic behavior policies. Assumption 1 (Absolute continuity) . For all ( s, a, i ) ∈ Just as DR-v2 extends the DR estimator (Jiang & Li, 2015,Section 4.4), one can create the WDR-v2 estimator by replacing ˆ q π e ( S t , A t ) with ˆ r π e ( S t , A t ,

0) + γ ˆ v π e ( S t +1 ) in (2). Empiricalresults for DR-v2 and WDR-v2 are included in the spreadsheetin the supplemental documents. For the domains presented here,these variants did not outperform the original DR and WDR esti-mators. Assumption could be replaced with a less-restrictive as-sumption like that used by Thomas (2015b, Section 3.5). We useAssumption because it allows for simpliﬁed proofs. S × A × { , . . . , n } , if π i ( a | s ) = 0 then π e ( a | s ) = 0 . Assumption requires all of the behavior policies to beidentical. This assumption is trivially satisﬁed if data iscollected from one behavior policy. Also, this assumptionis often satisﬁed for applications where data is abundant, sothat evaluation can be performed using only the data fromthe most recent behavior policy. Also, Assumption re-quires the horizon, L , to be ﬁnite. Assumption 2 (Single behavior policy) . For all ( i, j ) ∈{ , . . . , n } , π i = π j . Assumption 3. L is ﬁnite. Assumption requires the importance weights, ρ it , to bebounded above by a ﬁnite constant, β ∈ R (they are alwaysbounded below by zero). It is trivially satisﬁed in the com-mon setting where the horizon is ﬁnite and the state andaction sets are ﬁnite. Although Assumption requires β to exist, none of our results depend on how large β is. So,in the non-ﬁnite state, action, and horizon settings one mayensure that evaluation policies are only considered if theysatisfy Assumption for some arbitrarily large β . Assumption 4 (Bounded importance weight) . There ex-ists a constant β < ∞ such that for all ( t, i ) ∈ N ≥ ×{ , . . . , n } , ρ it ≤ β surely. In the following theorems, Theorems and , we give twodifferent sets of assumptions that are sufﬁcient to show thatWDR is a strongly consistent estimator of v ( π e ) . Noticethat if the sets of states and actions are ﬁnite and the hori-zon is ﬁnite, then Assumption holds, and so Theorem means that WDR will be strongly consistent given only As-sumption . Theorem 1 (WDR – strongly consistent estimator for onebehavior policy, ﬁnite horizon) . If Assumptions , , and hold then WDR( D ) a.s. −→ v ( π e ) . Proof

See Appendix C.1.

Theorem 2 (WDR – strongly consistent estimator formany behavior policies) . If Assumptions and hold then WDR( D ) a.s. −→ v ( π e ) . Proof

See Appendix C.2.

6. Empirical Studies (WDR)

In order to both show the empirical beneﬁts of WDR overexisting importance sampling estimators and better mo-tivate our second major contribution, in this section wepresent an empirical comparison of different OPE meth-ods. We compare to a broad sampling of model-free im-portance sampling estimators, deﬁnitions of which can befound in the work of Thomas (2015b, Chapter 3): im-portance sampling (IS), per-decision importance sampling The raw data for all of the plots provided in this paper (aswell as additional plots) are available in the spreadsheet includedin the supplemental material. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning M e a n S qu a r e d E rr o r Number of Episodes, n (a) Gridworld M e a n S qu a r e d E rr o r Number of Episodes, n (b) ModelFail

550 2 20 200 2,000 20,000 M e a n S qu a r e d E rr o r Number of Episodes, n (c) ModelWin Figure 1: Empirical results for three different experimental setups. All plots in this paper have the same format: theyshow the mean squared error of different estimators as n , the number of episodes in D , increases. Both axes always use alogarithmic scale and standard error bars are included from trials. All plots use the following legend: IS PDIS WIS CWPDIS DR AM WDR (PDIS), weighted importance sampling (WIS), and con-sistent weighted per-decision importance sampling (CW-PDIS). We also compare to the guided importance sam-pling doubly robust (DR) estimator (Jiang & Li, 2015).Lastly, we compare to the approximate model (AM) es-timator, which uses all of the available data to constructan approximate model of the MDP. The performance ofthe evaluation policy on the approximate model is typicallyeasy to compute and can be used as an estimate of v ( π e ) .For example, in our experiments the approximate modelmaintains an estimate, b d , of the initial state distribution,and so we deﬁne AM := P s ∈S b d ( s )ˆ v π e ( s ) . Notice thatunlike the importance sampling based methods, AM doesnot include any importance weights ( ρ t terms).In Appendix D we provide detailed descriptions of the ex-perimental setup and results. Here we provide only anoverview. We used three domains: a × gridworld pre-viously constructed speciﬁcally for evaluating OPE meth-ods (Thomas, 2015b, Section 2.5), as well as two sim-ple domains that we developed to exemplify the settingswhere different methods excel and fail. In our simula-tions, WDR dominated the other importance sampling andguided importance sampling estimators (but not AM). Notonly did WDR always achieve the lowest mean squared er-ror of these estimators, but no other single (guided) im-portance sampling estimator was able to always achievemean squared errors within an order of WDR’s. Figure 1ais an example using the gridworld where no other methodachieved mean squared errors within an order of WDR’s.The second notable trend is that WDR often signiﬁcantlyoutperformed AM. We constructed a simple MDP and ex-perimental setup that we call ModelFail , which exempliﬁesthis. In the ModelFail experiments, the approximate model This model-based estimator has been called the direct method in previous work (Dud´ık et al., 2011), however, in other previouswork direct methods are model-free while indirect methods aremodel-based (Sutton & Barto, 1998, Section 9.2). uses function approximation, which causes it to convergeto the wrong MDP. This results in AM’s mean squared er-ror plateauing at a non-zero value, as shown in Figure 1b.Since WDR remains strongly consistent in this setting, itsMSE converges almost surely to zero.The third notable trend is that when an accurate approxi-mate model is available, WDR does not always outperformAM. We constructed a simple MDP that we call

ModelWin ,which exempliﬁes this. In the ModelWin MDP, the approx-imate model quickly converges to the correct MDP, and soAM outperforms WDR. This is depicted in Figure 1c.One might wonder why DR and WDR can do worse thanAM even though they incorporate the approximate model.Notice that we can write the DR and WDR estimators as:

WDR( D ) := 1 n n X i =1 ˆ v π e ( S H i ) | {z } (a) (3) + n X i =1 ∞ X t =0 γ t w it " R H i t − ˆ q π e (cid:16) S H i t , A H i t (cid:17) + γ ˆ v π e (cid:16) S H i t +1 (cid:17)| {z } (b) . If the approximate model is perfect, then (a) is both a lowvariance and unbiased estimator of v ( π e ) . If the approx-imate model is perfect and R t and S t +1 are deterministicfunctions of S t and A t , then (b) is zero, and so the sec-ond term is always zero and WDR is an excellent estimator.However, if R t or S t +1 is not a deterministic function of S t and A t —if the state transitions or rewards are stochastic—then (b) is not necessarily zero. If the importance weights, w it , have high variance, then even slightly non-zero valuesof (b) can cause DR and WDR to have high variance.In summary, in our experiments, WDR dominated theother (guided) importance sampling estimators, sometimesachieving orders of magnitude lower MSE. However, theexperiments also show that WDR is not always the best ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning estimator—sometimes AM can produce estimates with anorder lower MSE. This trend is also visible in the results ofJiang & Li (2015), where AM performs better than DR (al-though they did not compare to WDR, since it had not yetbeen introduced). Ideally we would like an estimator thatcombines WDR and AM or switches between them auto-matically, to always achieve the performance of the betterestimator. In the following sections we show how this canbe done.

7. Blending IS and Model (BIM) Estimator

In this section we show how two OPE estimators can bemerged into a single estimator that exhibits the desirableproperties of both. Before doing so, we establish some ter-minology. We divide OPE estimators into three classes.The ﬁrst class we call importance sampling estimators .We deﬁne this class to include all estimators that, when L is ﬁnite, are deﬁned using all of the importance weights ρ , ρ , . . . , ρ L − . Notice that this includes IS, PDIS, WIS,and CWPDIS, as well as the guided importance samplingmethods, DR and WDR.The second class we call purely model-based estimators .We deﬁne this class to include all estimators that do notcontain any ρ t terms for t ≥ . The only purely model-based estimator in this paper is AM. Finally, we call thethird class partial importance sampling estimators . Theseestimators are those that do not fall into either of the othertwo classes—estimators that use importance weights, ρ t ,but only for t < L − . We will introduce one such estima-tor later in this section.We contend that importance sampling estimators andpurely model-based estimators are two extremes on a spec-trum of estimators. Importance sampling estimators tendto be strongly consistent. That is, as more historical databecomes available, their estimates become increasingly ac-curate. However, their use of importance weights meansthat they all (including DR and WDR) also can have highvariance relative to purely model-based estimators. This isevident in the results on the ModelWin domain.On the other end of the spectrum, purely model-based es-timators like AM are often not strongly consistent. If theapproximate model uses function approximation or if thereis some partial observability, then the approximate modelmay not converge to the true MDP. So, as more historicaldata becomes available, the estimates of AM may convergeto a value other than v ( π e ) . Thus, purely model-based es-timators tend to have high bias, even asymptotically, as ev-idenced by the AM curve in Figure 1b. However, purelymodel-based methods also tend to have low variance be-cause they do not contain any ρ t terms.Between these two extremes lies a range of partial im- portance sampling estimators. Estimators that are closeto the purely model-based estimators use ρ t terms onlyfor small t , while estimators that are close to importancesampling estimators use ρ t terms with large t approach-ing L − . Before formally deﬁning one such partialimportance sampling estimator, we present a few addi-tional deﬁnitions. First, let IS ( j ) ( D ) denote an estimate of E [ P jt =0 γ t R t | H ∼ π e ] , constructed from D using an im-portance sampling method like PDIS or WDR, which usesimportance weights up to and including ρ t . Similarly, let AM ( j ) ( D ) denote a primarily model-based prediction from D of E [ P ∞ t = j γ t R t | H ∼ π e ] that may not use ρ t termswith t ≥ j .We can now deﬁne a partial importance sampling estima-tor that we call the off-policy j -step return , g ( j ) ( D ) , whichuses an importance sampling based method to predict theoutcome of using π e up until R j is generated, and the ap-proximate model estimator to predict the outcomes there-after. That is, for all j ∈ N ≥− , let g ( j ) ( D ) := IS ( j ) ( D ) + AM ( j +1) ( D ) g ( ∞ ) ( D ) := lim j →∞ g ( j ) ( D ) . (4)We refer to j as the length of the j -step return.Notice that g ( − ( D ) is a purely model-based estimator, g ( ∞ ) ( D ) is an importance sampling estimator, and theother off-policy j -step returns are partial importance sam-pling estimators that blend between these two extremes.When j is small, the off-policy j -step return is similar toAM, using importance sampling to predict only a few earlyrewards. When j is large, it uses importance sampling topredict most of the rewards and the model only for a fewrewards at the end of a trajectory. So, as j increases, weexpect the variance of the return to increase, but the bias todecrease.We propose a new estimator, which we call the blendingIS and model (BIM) estimator, that leverages this spec-trum of estimators to blend together the IS and AM esti-mators in a way that minimizes MSE. It does this by com-puting a weighted average of the different length returns: BIM( D ) := x ⊺ g ( D ) , where x := ( x − , x , x , . . . ) ⊺ is aninﬁnite-dimensional weight vector and g ( D ) is an inﬁnite-dimensional vector of different length returns, g ( D ) :=( g ( − ( D ) , g (0) ( D ) , . . . , ) ⊺ . Although in theory x can beinﬁnite, in practice there is always ﬁnite data, so x is ﬁnite.The remaining question is then: how should we select theweights, x ?A similar question has been studied before in reinforce- If prior knowledge about d is available, then one might con-sider adding g ( − ( D ) to denote the model’s prediction of v ( π e ) ,which might differ from g ( − ( D ) . ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning ment learning research when deciding how to weight j -step returns (not off-policy), as reviewed by Sutton & Barto(1998, Section 7.2). The most common solution, a com-plex return called the λ -return , uses x − = 0 and x j =(1 − λ ) λ j for all other j . The λ -return is the foundation ofthe entire TD ( λ ) family of algorithms, which includes theoriginal linear-time algorithm (Sutton, 1988), least-squaresformulations (Bradtke & Barto, 1996; Mahmood et al.,2014), methods for adapting λ (Downey & Sanner, 2010),true-online methods (van Hasselt et al., 2014), and the re-cent emphatic methods (Mahmood et al., 2015).Recent work has suggested that the λ -return could bereplaced by more statistically principled complex re-turns like the γ -return (Konidaris et al., 2011) or Ω -return (Thomas et al., 2015). For the ﬁnite-horizon set-ting and for j ∈ { , . . . , L − } the γ -return uses x j := ( P ji =0 γ i ) ) − / P L − j =0 ( P ˆ ji =0 γ i ) − , and the Ω -return uses x j = P L − i =0 Ω − n ( j, i ) / P L − j,i =0 Ω − n (ˆ j, i ) , where Ω n is the L × L covariance matrix where Ω n ( i, j ) =Cov( g ( i ) ( D ) , g ( j ) ( D )) , and where both the γ and Ω -returns use x j = 0 for j

6∈ { , . . . , L − } .The advantage of the γ -return over the λ -return is that ituses a more accurate model of how variance increases withthe length of a return, which also eliminates the λ hyper-parameter used by the λ -return. The advantages of the Ω -return over the γ -return are that it both uses a yet more-accurate estimate of how variance grows with the lengthof the return, which is computed from historical data, andthat it better accounts for the fact that different length re-turns are not independent, i.e., g ( i ) ( D ) and g ( j ) ( D ) are notindependent even if i = j .However, none of these weighting schemes are sufﬁcientfor our needs because they do not cause BIM to necessarilybe a strongly consistent estimator. This is likely becausethey were all designed for the setting where only one tra-jectory is available, i.e., n = 1 , while strong consistency isa property that deals with performance as n → ∞ . Further-more, they were designed for on-policy policy evaluation.We therefore propose a new weighting scheme (anew complex return for multiple trajectories) that di-rectly optimizes our primary objective: the meansquared error. This new weighting scheme is x ⋆ :=arg min x ∈ R ∞ MSE( x ⊺ g ( D ) , v ( π e )) . Unfortunately, wetypically cannot compute x ⋆ , because we do not know MSE( x ⊺ g ( D ) , v ( π e )) for any x . Instead, we pro-pose estimating x ⋆ by minimizing an approximation of MSE( x ⊺ g ( D ) , v ( π e )) . First, dealing with an inﬁnite num-ber of different return lengths is challenging. To avoid this,we propose only using a subset of the returns, { g ( j ) ( D ) } , The λ -return with λ = 1 is deﬁned to be g ( ∞ ) ( D ) and isconsistent, but it does not mix the two OPE methods at all. for j ∈ J , where |J | < ∞ . For all j

6∈ J , we assign x j = 0 . We suggest including − and ∞ in J .To simplify later notation, let g J ( D ) ∈ R |J | be the ele-ments of g ( D ) whose indexes are in J —the returns thatwill not necessarily be given weights of zero. Also let J j denote the j th element in J . We can then estimate x ⋆ by: b x ⋆ ∈ arg min x ∈ R |J| MSE( x ⊺ g J ( D ) , v ( π e )) , where our estimate of x ⋆j is zero if j

6∈ J and our estimateof x ⋆ J j is b x ⋆j for j ∈ { , . . . , |J |} .Next, to avoid searching all of R |J | and also to serve as aform of regularization on b x ⋆ , we limit the set of x that weconsider to the |J | -simplex, i.e., we require x j ≥ for all j ∈ { , . . . , |J |} and P |J | j =1 x j = 1 . We write ∆ |J | todenote this set of weight vectors—the |J | -simplex.Using the bias-variance decomposition of MSE, we there-fore have that b x ⋆ ∈ arg min x ∈ ∆ |J| Bias( x ⊺ g J ( D )) + Var( x ⊺ g J ( D ))= arg min x ∈ ∆ |J| x ⊺ [Ω n + b n b ⊺ n ] x , where n remains the number of trajectories in D , Ω n is the |J | × |J | covariance matrix where Ω n ( i, j ) =Cov( g ( J i ) ( D ) , g ( J j ) ( D )) and b n is the |J | -dimensionalvector with b n ( j ) = E [ g ( J j ) ( D )] − v ( π e ) for all j ∈{ , . . . , |J |} . This simpliﬁes the problem of estimatingthe MSE for all possible x into estimating two terms: thebias vector, b n , and the covariance matrix, Ω n .Let b b n and b Ω n be the estimates of b n and Ω n when thereare n trajectories in D . The exact scheme used to esti-mate b n and Ω n depends on the deﬁnitions of IS ( j ) ( D ) and AM ( j ) ( D ) . In general, both terms are easier to esti-mate for unweighted importance sampling estimators likePDIS and DR than for weighted estimators like CWPDISor WDR.To make the dependence of BIM on the estimates of Ω n and b n explicit, and to summarize the approximations wehave made, we redeﬁne the BIM estimator as: BIM( D, b Ω n , b b n ) := ( b x ⋆ ) ⊺ g J ( D ) , where b x ⋆ ∈ arg min x ∈ ∆ |J| x ⊺ [ b Ω n + b b n b b ⊺ n ] x . In the next section we propose using WDR as the impor-tance sampling method, IS, and show how b n and Ω n can Since b n (similarly, Ω n ) already has a subscript, we write b n ( j ) to denote the j th element of b n . ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning be approximated in this setting. First we show that if atleast one of the returns included in J is a strongly consis-tent estimator of v ( π e ) , and if the estimates of b n and Ω n are themselves strongly consistent, then BIM is a stronglyconsistent estimator of v ( π e ) : Theorem 3.

If Assumption holds, there exists at leastone j ∈ J such that g ( j ) ( D ) is a strongly consistent esti-mator of v ( π e ) , b b n − b n a.s. −→ , and b Ω n − Ω n a.s. −→ , then BIM( D, b Ω n , b b n ) a.s. −→ v ( π e ) . Proof

See Appendix E.

8. Model and Guided Importance SamplingCombining (MAGIC) Estimator

In this section we propose using the BIM estimator withWDR as the importance sampling estimator. The result-ing estimator combines purely model based estimates withthe estimates of the guided importance sampling algorithmWDR, and so we call it the m odel a nd g uided i mportancesampling c ombining (MAGIC) estimator.Although the derivation of how to properly deﬁne IS ( j ) ( D ) and AM ( j ) ( D ) in order to blend WDR with the approx-imate model is less obvious than one might expect andtherefore an important technical detail, we relegate it to Ap-pendix F due to space restrictions. The resulting deﬁnitionof an off-policy j -step return is, for all j ∈ N ≥− : g ( j ) ( D ) := n X i =1 j X t =0 γ t w it R H i t | {z } (a) + n X i =1 γ j +1 w ij ˆ v π e ( S H i j +1 ) | {z } (b) − n X i =1 j X t =0 γ t (cid:16) w it ˆ q π e (cid:16) S H i t , A H i t (cid:17) − w it − ˆ v π e (cid:16) S H i t (cid:17)(cid:17)| {z } (c) . where (c) is the combined control variate for both the im-portance sampling based term, (a) , and the model-basedterm, (b) , and where we use WDR’s deﬁnition of w it .We estimate Ω n from the n trajectories in D using a samplecovariance matrix, b Ω n . See Appendix G for details andpseudocode for MAGIC.Estimating the bias vector, b n , is challenging because it hasa strong dependence on the value that we wish we knew, v ( π e ) . We cannot use AM’s estimate as a stand-in for v ( π e ) because it would cause us to assume that AM’s great-est weakness—its high bias—is negligible. We cannot useWDR’s estimate (or any other importance sampling estima-tor’s estimate) because our estimate of b n would then con-ﬂate the high variance of importance sampling estimateswith the bias that we wish to estimate.When n , the number of trajectories in D , is small, variancetends to be the root cause of high MSE. We therefore pro-pose using an estimate of b n that is initially conservative—initially it underestimates the bias—but which becomes correct as n increases. Let CI( g ( ∞ ) ( D ) , δ ) be a − δ conﬁ-dence interval on the expected value of the random variable g ( ∞ ) ( D ) = WDR( D ) . Intuitively, as n increases we ex-pect that this conﬁdence interval will converge to g ( ∞ ) ( D ) ,which in turn converges to v ( π e ) . So, we estimate b n ( j ) ,the bias of the off-policy j -step return, by its distance fromthe conﬁdence interval. That is, we estimate b n ( j ) as b b n ( j ) := dist (cid:16) g ( J j ) ( D ) , CI( g ( ∞ ) ( D ) , . (cid:17) , where dist( y, Z ) is the distance between y ∈ R and the set Z ⊆ R : dist( y, Z ) = min z ∈Z | y − z | . We use both the per-centile bootstrap conﬁdence interval (Efron & Tibshirani,1993) and Chernoff-Hoeffding’s inequality—whichever istighter—for CI in our experiments.In Theorem we show that the MAGIC estimator is astrongly consistent estimator of v ( π e ) given one set of as-sumptions that we used to show that WDR is strongly con-sistent and that WDR is included as one of the off-policy j -step returns. Theorem 4 (MAGIC - strongly consistent) . If Assumptions and hold and ∞ ∈ J then MAGIC( D ) a.s. −→ v ( π e ) . Proof

See Appendix H.

9. Empirical Studies (MAGIC)

Appendix I provides detailed experiments using MAGIC.In this section we provide an overview of these results.The ﬁrst three plots in Figure 2 correspond to those in Fig-ure 1, but include MAGIC. In general MAGIC does verywell, tracking or exceeding the best performance of WDRand AM. However, in Figure 2c MAGIC does not perfectlytrack AM. The scale is logarithmic, so the difference be-tween MAGIC and AM is small in comparison to the ben-eﬁt of MAGIC over WDR. We hypothesize that the reasonMAGIC does not match AM may be due to error in ourestimates of Ω n and b n .Figure 2d is for an experimental setup that we call Hy-brid , where early in trajectories there is partial observabil-ity (e.g., initial uncertainty about a student’s knowledgein an intelligent tutoring system, or uncertainty about thestate of the world in a robotic application). In these set-tings MAGIC outperforms all other estimators, even AMand WDR, by automatically leveraging WDR for the partsof trajectories where partial observability causes the modelto be inaccurate, and AM for parts of the trajectories wherethe model is accurate. To emphasize this, we includeMAGIC-B (B for binary ) where J = {− , ∞} , so thatBIM can only blend AM and WDR by placing weights onthem. The relatively poor performance of MAGIC-B sup-ports our use of off-policy j -step returns. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning M e a n S qu a r e d E rr o r Number of Episodes, n (a) Gridworld M e a n S qu a r e d E rr o r Number of Episodes, n (b) ModelFail M e a n S qu a r e d E rr o r Number of Episodes, n (c) ModelWin M e a n S qu a r e d E rr o r Number of Episodes, n (d) Hybrid Figure 2: Empirical comparison of MAGIC to other estimators using the legend from Figure 1. All plots use the followinglegend (although only Figure 2d includes MAGIC-B):

DR AM WDR MAGIC MAGIC-B

10. Conclusion

We have proposed several new OPE estimators and showedempirically that they outperform existing estimators. Whileprevious OPE estimators that use importance sampling of-ten failed to outperform the approximate model estimator(which does not use importance sampling), our new estima-tors often do, frequently by orders of magnitude. In caseswhere the approximate model estimator remains the bestestimator, one of our new estimators, MAGIC, performssimilarly. In other cases, MAGIC meets or exceeds the per-formance of state-of-the-art prior estimators.

References

Bartle, Robert G.

The elements of integration and Lebesguemeasure . John Wiley & Sons, 2014.Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee,M. Natural actor-critic algorithms.

Automatica , 45(11):2471–2482, 2009.Bradtke, S.J. and Barto, A.G. Linear least-squares algo-rithms for temporal difference learning.

Machine Learn-ing , 22(1–3):33–57, March 1996.Davison, A. C. and Hinkley, D. V.

Bootstrap Methods andtheir Application . Cambridge University Press, Cam-bridge, 1997.Downey, C. and Sanner, S. Temporal difference Bayesianmodel averaging: A Bayesian perspective on adaptinglambda. In

Proceedings of the 27th International Con-ference on Machine Learning , pp. 311–318, 2010.Dud´ık, M., Langford, J., and Li, L. Doubly robust policyevaluation and learning. In

Proceedings of the Twenty-Eighth International Conference on Machine Learning ,pp. 1097–1104, 2011.Efron, B. and Tibshirani, R. J.

An Introduction to the Boot-strap . Chapman and Hall, London, 1993.Hammersley, J. M. and Handscomb, D. C. Monte carlomethods, methuen & co.

Ltd., London , pp. 40, 1964. Heejung, H. and Robins, J. M. Doubly robust estimation inmissing data and causal inference models.

Biometrics ,61(4):962–973, 2005.Jiang, N. and Li, L. Doubly robust off-policy evaluationfor reinforcement learning.

ArXiv , arXiv:1511.03722v1,2015.Konidaris, G. D., Niekum, S., and Thomas, P. S. TD γ :Re-evaluating complex backups in temporal differencelearning. In Advances in Neural Information ProcessingSystems 24 , pp. 2402–2410. 2011.Levine, S. and Koltun, V. Guided policy search. In

Pro-ceedings of The 30th International Conference on Ma-chine Learning , pp. 1–9, 2013.Mahmood, A. R., Hasselt, H., and Sutton, R. S. Weightedimportance sampling for off-policy learning with linearfunction approximation. In

Advances in Neural Informa-tion Processing Systems 27 , 2014.Mahmood, A. R., Yu, H., White, M., and Sutton,R. S. Emphatic temporal-difference learning.

ArXiv ,arXiv:1507.01569, 2015.Mandel, T., Liu, Y., Levine, S., Brunskill, E., and Popovi´c,Z. Ofﬂine policy evaluation across representations withapplications to educational games. In

Proceedings of the13th International Conference on Autonomous Agentsand Multiagent Systems , 2014.Mandel, T., Liu, Y., Brunskill, E., and Popovi´c, Z. Ofﬂineevaluation of online reinforcement learning algorithms.In

Proceedings of the Thirtieth Conference on ArtiﬁcialIntelligence , 2016.Mittelhammer, R. C.

Mathematical statistics for economicsand business , volume 78. Springer, 1996.Powell, M. J. D. and Swann, J. Weighted uniform sam-pling: a Monte Carlo technique for reducing variance.

Journal of the Institute of Mathematics and its Applica-tions , 2(3):228–236, 1966. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning

Precup, D., Sutton, R. S., and Singh, S. Eligibility tracesfor off-policy policy evaluation. In

Proceedings of the17th International Conference on Machine Learning , pp.759–766, 2000.Rotnitzky, A. and Robins, J. M. Semiparametric regres-sion estimation in the presence of dependent censoring.

Biometrika , 82(4):805–820, 1995.Sen, P. K. and Singer, J. M.

Large Sample Methods inStatistics An Introduction With Applications . Chapman& Hall, 1993.Sutton, R. S. and Barto, A. G.

Reinforcement Learning: AnIntroduction . MIT Press, Cambridge, MA, 1998.Sutton, R.S. Learning to predict by the methods of tempo-ral differences.

Machine Learning , 3(1):9–44, 1988.Thapa, D., Jung, I., and Wang, G. Agent based deci-sion support system using reinforcement learning underemergency circumstances.

Advances in Natural Compu-tation , 3610:888–892, 2005.Theocharous, G., Thomas, P. S., and Ghavamzadeh, M.Personalized ad recommendation systems for life-timevalue optimization with guarantees. In

Proceedings ofthe International Joint Conference on Artiﬁcial Intelli-gence , 2015.Thomas, P. S. A notation for Markov decision processes.

ArXiv , arXiv:1512.09075v1, 2015a.Thomas, P. S.

Safe Reinforcement Learning . PhD thesis,University of Massachusetts Amherst, 2015b.Thomas, P. S., Niekum, S., Theocharous, G., andKonidaris, G. D. Policy evaluation using the Ω -return.In Advances in Neural Information Processing Systems ,2015.van Hasselt, H., Mahmood, A. R., and Sutton, R. S. Off-policy TD ( λ ) with true online equivalence. In Proceed-ings of the 30th Conference on Uncertainty in ArtiﬁcialIntelligence , 2014.Veness, J., Lanctot, M., and Bowling, M. Variance reduc-tion in monte-carlo tree search. In

Advances in NeuralInformation Processing Systems , pp. 1836–1844, 2011.White, M. and Bowling, M. Learning a value analysistool for agent evaluation. In

Proceedings of the Inter-national Joint Conference on Artiﬁcial Intelligence , pp.1976–1981, 2009.Zinkevich, M., Bowling, M., Bard, N., Kan, M., andBillings, D. Optimal unbiased estimators for evaluatingagent performance. In

Proceedings of the Twenty-FirstNational Conference on Artiﬁcial Intelligence (AAAI) ,pp. 573–578, 2006. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning

A. Preliminaries

In this section we present additional notation, deﬁnitions,properties, (known) theorems, corollaries, and lemmas thatare useful when we prove theorems later.Let H t := ( S , A , R , S , . . . , S t − , A t − , R t − , S t ) bethe ﬁrst t transitions in the episode H . We call H t a par-tial trajectory of length t . Notice that we use subscriptson trajectories to denote the trajectory’s index in D and su-perscripts to denote partial trajectories— H ti is the ﬁrst t transitions of the i th trajectory in D . Let H t be the set ofall possible partial trajectories of length t .For all ( π, s ) ∈ Π × S , let supp s ( π ) be the set of actionsthat have non-zero probability when the policy π is usedto select an action in state s , i.e., supp s ( π ) := { a ∈ A : π ( a | s ) = 0 } . Similarly, let supp( π, t ) := { h t ∈ H t :Pr( H t = h t | π ) = 0 } .Later we will need to bound terms like ρ it R it for some t and i . Notice that even if ρ it < β , it is possible for ρ it R it > βr max if r max is negative, since ρ it could be zero.Additionally, sometimes we may deal with r max terms andother times r modelmax . To avoid explicitly handling these cases,we will bound terms using loose bounds that depend on anew term: r ⋆ max := max {| r min | , | r max | , | r modelmin | , | r modelmax |} . Deﬁnition 1 (Almost Sure Convergence) . A sequence ofrandom variables, ( X n ) ∞ n =1 , converges almost surely to therandom variable X if Pr (cid:16) lim n →∞ X n = X (cid:17) = 1 . We write X n a.s. −→ X to denote that the sequence ( X n ) ∞ n =1 convergences almost surely to X . Deﬁnition 2.

Let θ be a real number and (ˆ θ n ) ∞ n =1 bean inﬁnite sequence of random variables. We call ˆ θ n , a(strongly) consistent estimator of θ if and only if ˆ θ n a.s. −→ θ . Notice that an estimator being unbiased does not mean thatit is also strongly consistent—estimators can be any combi-nation of biased/unbiased and consistent/inconsistent. Nextwe present several known properties of almost sure conver-gence (Mittelhammer, 1996, Section 5.5).

Property 1. [Continuous mapping theorem] X n a.s. −→ X implies that f ( X n ) a.s. −→ f ( X ) for every continuous func-tion f . Property 2.

Let X n and Y n be sequences of random vari-ables and X and Y be random variables. If X n a.s. −→ X , Y n a.s. −→ Y , and if Pr( Y = 0) = 0 , then X n Y n a.s. −→ XY . Property 3. If { X in } mi =1 are m < ∞ sequences of randomvariables such that X in a.s. −→ X i for all i ∈ { , . . . , m } ,then P mi =1 X in a.s. −→ P mi =1 X i . We will require an additional property of almost sure con-vergence that is similar to Property , but which allows forthe sum over a countably inﬁnite number of sequences ofrandom variables, i.e., m = ∞ . In order to establish thisproperty we begin with Lebesgue’s dominated convergencetheorem: Theorem 5 (Lebesgue’s Dominated Convergence Theo-rem) . Let ( f n ) ∞ n =1 be a sequence of integrable functionsthat converges almost everywhere to a real-valued measur-able function f . If there exists an integrable function g such that | f n | ≤ g for all n , then lim n →∞ Z f n dµ = Z f dµ. Proof.

See the work of (Bartle, 2014, Theorem 5.6).Next we use Lebesgue’s dominated convergence theoremto show conditions under which we can reverse the orderof a limit and an inﬁnite summation:

Lemma 1.

Let { x in } ∞ i =0 be a countably inﬁnite num-ber of real-valued sequences indexed by i , such that lim n →∞ x in = x i for all i ∈ N ≥ . If there exists a function g : N ≥ → R such that | x in | ≤ g ( i ) for all n ∈ N > and i ∈ N ≥ , and P ∞ i =0 g ( i ) < ∞ , then lim n →∞ ∞ X i =0 x in = ∞ X i =0 lim n →∞ x in . Proof.

We apply Lebesgue’s dominated convergence the-orem (Theorem ), where for all ( n, i ) ∈ N > × N ≥ , f n ( i ) = X in , f ( i ) = x i , and µ is the counting measureon the measure space ( N ≥ , P ( N ≥ )) , where P ( N ≥ ) isthe power set of N ≥ .We can now establish our desired property about almostsure convergence: Property 4.

Let { X in } ∞ i =0 be a countably inﬁnite numberof sequences of random variables such that X in a.s. −→ X i forall i ∈ N ≥ . If there exists a function g : N ≥ → R suchthat | X in | ≤ g ( i ) surely for all ( n, i ) ∈ N > × N ≥ , and P ∞ i =0 g ( i ) < ∞ , then P ∞ i =0 X in a.s. −→ P ∞ i =0 X i . To conform to standard notations elsewhere, here we reusethe symbol g , which was previously used to denote the return ofa trajectory, g ( H ) . The two uses of g are sufﬁciently dissimilarthat this reuse should not cause confusion. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning Proof. Pr lim n →∞ ∞ X i =0 X in = ∞ X i =0 X i ! (a) ≥ Pr  ∞ \ i =0 (cid:16) lim n →∞ X in = X i (cid:17) \ ∞ X i =0 lim n →∞ X in = ∞ X i =0 X i !| {z } (b)  (c) ≥ Pr  ∞ \ i =0 (cid:16) lim n →∞ X in = X i (cid:17) \ ∞ \ i =0 (cid:16) lim n →∞ X in = X i (cid:17)!| {z } (d)  = Pr ∞ \ i =0 (cid:16) lim n →∞ X in = X i (cid:17)! =1 − Pr  ∞ [ i =0 (cid:16) lim n →∞ X in = X i (cid:17)| {z } (e)  =1 , where (a) comes from Lemma which ensures that ∞ \ i =0 (cid:16) lim n →∞ X in = X i (cid:17) = ⇒ lim n →∞ ∞ X i =0 X in = ∞ X i =0 lim n →∞ X in ! , (c) holds because (d) = ⇒ (b) , and (e) has zero measurebecause it is the countable union of zero measure sets bythe assumption that X in a.s. −→ X i for all i ∈ N ≥ .Next we show that if a sequence of random variables, X n ,converges almost surely to a random variable, X , then theexpected value of X n converges to the expected value of X . Lemma 2. If ( X i ) ∞ i =1 is a sequence of uniformly boundedreal-valued random variables and if X n a.s. −→ X , then lim n →∞ E [ X n ] = E [ X ] . Proof.

Let X n (for all n ) and X be random variables onthe probability space (Ω , Σ , P ) and let A = { ω ∈ Ω :lim n →∞ X n = X } . Then: lim n →∞ E [ X n ] = lim n →∞ Z Ω X n d P (a) = Z Ω lim n →∞ X n d P = Z A lim n →∞ X n d P | {z } (b) + Z Ω \A lim n →∞ X n d P | {z } (c) , where (a) comes from the bounded convergence theorem.For term (b) , notice that for all ω ∈ A , lim n →∞ X n = X .For term (c) , notice that by the assumption that X n a.s. −→ X ,we have that Ω \ A has measure zero. So: lim n →∞ E [ X n ] = Z A X d P = Z A X d P + Z Ω \A X d P = E [ X ] . Next we present a lemma that relates almost sure conver-gence of estimators to mean squared error. Let ˆ θ be anestimator of θ . Recall that: MSE(ˆ θ, θ ) := E h (ˆ θ − θ ) i . We show that a sequence, ( X n ) ∞ n =1 converges almostsurely to X if and only if lim n →∞ MSE( X n , X ) = 0 . Lemma 3. If ( X i ) ∞ i =1 is a sequence of uniformly boundedreal-valued random variables, then X n a.s. −→ X if and onlyif lim n →∞ MSE( X n , X ) = 0 .Proof. We show each direction separately. First we showthat X n a.s. −→ X implies lim n →∞ MSE( X n , X ) = 0 . MSE( X n , X ) = E [( X n − X ) ]= E [ Y n ] , where Y n := ( X n − X ) . By the continuous mappingtheorem we have that Y n a.s. −→ ( X − X ) = 0 . So, byLemma (applied to E [ Y n ] ) we have that lim n →∞ MSE( X n , X ) = E [0]= 0 . Next we show the other direction: that lim n →∞ MSE( X n , X ) = 0 implies X n a.s. −→ X . Let X and all X n be random variables on the probability space (Ω , Σ , P ) , A = { ω ∈ Ω : lim n →∞ MSE( X n , X ) = 0 } ,and B = { ω ∈ A : lim n →∞ X n = X } . If lim n →∞ MSE( X n , X ) = 0 , then by the deﬁnition of ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning MSE we have that: n →∞ Z Ω ( X n − X ) d P (a) = Z Ω (cid:16) lim n →∞ X n − X (cid:17) d P = Z B (cid:16) lim n →∞ X n − X (cid:17) d P | {z } (b) + Z A\B (cid:16) lim n →∞ X n − X (cid:17) d P | {z } (c) + Z Ω \A (cid:16) lim n →∞ X n − X (cid:17) d P | {z } (d) , where we get (a) by using the bounded convergence the-orem to pass the limit inside the integral and the fact that ( X n − X ) is a continuous function of X n to then movethe limit to the X n term. Notice that (b) , (c) , and (d) areall positive, and so they must all be zero for the equal-ity with zero to hold. We have that (d) is necessarilyzero due to the deﬁnition of A and our assumption that lim n →∞ MSE( X n , X ) = 0 . Similarly, (c) is zero because,from the deﬁnition of B , A \ B causes lim n →∞ X n = X .However, in (b) , by the deﬁnition of B , lim n →∞ X n − X isnon-zero, and so for the equality with zero to hold, B musthave measure zero. That is, Pr(lim n →∞ X n = X ) = 0 ,and thus Pr(lim n →∞ X n = X ) = 1 .Next we show that if two sequences of random variablesconverge to the same random variable, then any sequenceof random variables bounded between the two sequencesmust also converge to the same random variable. Lemma 4. If X n a.s. −→ X , Z n a.s. −→ X , and for all n , X n ≤ Y n ≤ Z n , then Y n a.s. −→ X .Proof. Pr (cid:16) lim n →∞ Y n = X (cid:17) = Pr (cid:16) (cid:16) lim n →∞ Y n ≤ X (cid:17) (5) \ (cid:16) lim n →∞ Y n ≥ X (cid:17) (cid:17) Since Pr (cid:16) lim n →∞ Y n ≥ X (cid:17) ≥ Pr (cid:16) lim n →∞ X n ≥ X (cid:17) ≥ Pr (cid:16) lim n →∞ X n = X (cid:17) =1 , and Pr (cid:16) lim n →∞ Y n ≤ X (cid:17) ≥ Pr (cid:16) lim n →∞ Z n ≤ X (cid:17) ≥ Pr (cid:16) lim n →∞ Z n = X (cid:17) =1 , we have that (5) is the probability of the joint occurance oftwo probability one events, and so Pr (cid:16) lim n →∞ Y n = X (cid:17) = 1 . Next we show that if the difference between two sequencesconverges almost surely to zero, then we can substitute onesequence for the other as an input to a continuous functionwithout changing the almost sure convergence propertiesof the function:

Lemma 5. If f is a continuous function, f ( X n ) a.s. −→ X ,and Y n − X n a.s. −→ , then f ( Y n ) a.s. −→ X .Proof. Pr (cid:16) lim n →∞ f ( Y n ) = X (cid:17) = Pr (cid:16) lim n →∞ f ( Y n − X n + X n ) = X (cid:17) (a) = Pr (cid:16) f (cid:16) lim n →∞ Y n − X n + X n (cid:17) = X (cid:17) (b) ≥ Pr (cid:16) lim n →∞ Y n − X n = 0 (cid:17)\ (cid:16) f (cid:16) lim n →∞ X n (cid:17) = X (cid:17) ! = Pr (cid:16) lim n →∞ Y n − X n = 0 (cid:17)\ (cid:16) lim n →∞ f ( X n ) = X (cid:17) ! (c) =1 , where (a) holds because f is a continuous function, andwhere (b) holds because it gives sufﬁcient conditions forthe event in the line above to hold, and (c) holds becauseunder our assumptions the two events both occur with prob-ability one. So we can conclude that f ( Y n ) a.s. −→ X .Next we review two standard forms of the strong law oflarge numbers. Theorem 6 (Khintchine Strong Law of Large Numbers) . Let { X i } ∞ i =1 be independent and identically distributedrandom variables. Then ( n P ni =1 X i ) ∞ n =1 is a sequence ofrandom variables that converges almost surely to E [ X ] . ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning Proof.

See the work of Sen & Singer (1993, Theorem2.3.13).

Theorem 7 (Kolmogorov Strong Law of Large Numbers) . Let { X i } ∞ i =1 be independent (not necessarily identicallydistributed) random variables. If all X i have the samemean and bounded variance (i.e., there is a ﬁnite con-stant b such that for all i ≥ , Var( X i ) ≤ b ), then ( n P ni =1 X i ) ∞ n =1 is a sequence of random variables thatconverges almost surely to E [ X ] .Proof. See the work of Sen & Singer (1993, Theorem2.3.10 with Proposition 2.3.10).In Corollary we present a simple extension of Kol-mogorov’s strong law of large numbers that we often stillrefer to as Kolmogorov’s strong law of large numbers: Corollary 1.

Let { X i } ∞ i =1 be independent (not necessarilyidentically distributed) random variables. If all X i havethe same mean and are uniformly bounded by a ﬁnite con-stant b , then ( n P ni =1 X i ) ∞ n =1 is a sequence of randomvariables that converges almost surely to E [ X ] .Proof. For all i ∈ N > we have that | X i | ≤ b surely, sofrom Popoviciu’s inequality, Var( X i ) ≤ b , and so we canapply Theorem .We now turn to results that are more speciﬁc to reinforce-ment learning and off-policy policy evaluation. Lemma establishes a relationship between the expected values of ˆ r π e ( s, i ) and ˆ r π e ( s, A, i ) for all i if A is generated by somepolicy π . Lemma 6.

Let ( π e , π ) ∈ Π , where ( π ( a | s ) = 0) = ⇒ ( π e ( a | s ) = 0) for all ( a, s ) ∈ A × S . Then for all ( s, i ) ∈S × N ≥ , ˆ r π e ( s, i ) = E (cid:20) π e ( A | s ) π ( A | s ) ˆ r π e ( s, A, i ) (cid:12)(cid:12)(cid:12)(cid:12) A ∼ π (cid:21) . Proof.

First, recall from (1) that for all ( s, i ) ∈ S × { , . . . , n } : ˆ r π e ( s, i ) := X a ∈A π e ( a | s )ˆ r π e ( s, a, i )= X a ∈ supp s ( π e ) π e ( a | s )ˆ r π e ( s, a, i ) (a) = X a ∈ supp s ( π ) π e ( a | s )ˆ r π e ( s, a, i )= X a ∈ supp s ( π ) π ( a | s ) π ( a | s ) π e ( a | s )ˆ r π e ( s, a, i )= X a ∈ supp s ( π ) π ( a | s ) π e ( a | s ) π ( a | s ) ˆ r π e ( s, a, i )= E (cid:20) π e ( A | s ) π ( A | s ) ˆ r π e ( s, A, i ) (cid:12)(cid:12)(cid:12)(cid:12) A ∼ π (cid:21) . where (a) holds by the assumption that ( π ( a | s ) = 0) = ⇒ ( π e ( a | s ) = 0) for all ( a, s ) ∈ A × S .Corollary extends Lemma to show a relationship be-tween ˆ v π e ( s ) and the expected value of ˆ q π e ( s, A, i ) if A isgenerated by some policy π : Corollary 2.

Let ( π e , π ) ∈ Π , where ( π ( a | s ) = 0) = ⇒ ( π e ( a | s ) = 0) for all ( a, s ) ∈ A × S . Then for all s ∈ S , ˆ v π e ( s ) = E (cid:20) π e ( A | s ) π ( A | s ) ˆ q π e ( s, A ) (cid:12)(cid:12)(cid:12)(cid:12) A ∼ π (cid:21) . Proof.

We have from Lemma that for all i ∈ N ≥ , ˆ r π e ( s, i ) = E (cid:20) π e ( A | s ) π ( A | s ) ˆ r π e ( s, A, i ) (cid:12)(cid:12)(cid:12)(cid:12) A ∼ π (cid:21) . Summing both sides over t and multiplying by γ t we havethat: ∞ X t =0 γ t ˆ r π e ( s, t ) | {z } =ˆ v πe ( s ) = ∞ X t =0 γ t E (cid:20) π e ( A | s ) π ( A | s ) ˆ r π e ( s, A, t ) (cid:12)(cid:12)(cid:12)(cid:12) A ∼ π (cid:21) ˆ v π e ( s ) = E " π e ( A | s ) π ( A | s ) ∞ X t =0 γ t ˆ r π e ( s, A, t ) | {z } =ˆ q πe ( s,A ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A ∼ π = E (cid:20) π e ( A | s ) π ( A | s ) ˆ q π e ( s, A ) (cid:12)(cid:12)(cid:12)(cid:12) A ∼ π (cid:21) . Before presenting the next theorem, notice that we can ex-press the DR estimator, (2), as

DR( D ) = n P ni =1 DR i ( D ) ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning if DR i ( D ) := ∞ X t =0 γ t ρ it R H i t − ∞ X t =0 γ t (cid:16) ρ it ˆ q π e (cid:16) S H i t , A H i t (cid:17) − ρ it − ˆ v π e (cid:16) S H i t (cid:17)(cid:17) . Lemma gives conditions under which the DR estimatoris an unbiased estimator of v ( π e ) when using only one tra-jectory. This lemma is the bulk of the proof that the fullDR estimator is unbiased—we have placed it in a separatelemma because it is also a useful result when showing thatthe DR estimator is strongly consistent. Lemma 7.

If Assumption holds then E [DR i ( D )] = v ( π e ) for all i ∈ { , . . . , n } .Proof. Recall that DR i ( D ) := ∞ X t =0 γ t ρ it R H i t − ∞ X t =0 γ t (cid:16) ρ it ˆ q π e (cid:16) S H i t , A H i t (cid:17) − ρ it − ˆ v π e (cid:16) S H i t (cid:17)(cid:17) . First, notice that P ∞ t =0 γ t ρ H i t R H i t is the per-decision im-portance sampling (PDIS) estimator, which is known tobe an unbiased estimator of v ( π e ) (Precup et al., 2000;Thomas, 2015b). So, we need only show that the remain-ing terms in the deﬁnition of DR i ( D ) have expected valuezero, i.e., that E " ∞ X t =0 γ t ρ it ˆ q π e (cid:16) S H i t , A H i t (cid:17) = E " ∞ X t =0 γ t ρ it − ˆ v π e (cid:16) S H i t (cid:17) . By Corollary (which requires Assumption ) we have that E " ∞ X t =0 γ t ρ it − ˆ v π e (cid:16) S H i t (cid:17) = E " ∞ X t =0 γ t ρ it − π e (cid:16) A H i t | S H i t (cid:17) π i (cid:16) A H i t | S H i t (cid:17) ˆ q π e (cid:16) S H i t , A H i t (cid:17) = E " ∞ X t =0 γ t ρ it ˆ q π e (cid:16) S H i t , A H i t (cid:17) . For completeness, next we show formally the obvious re-sult that Assumption implies that partial trajectories thatoccur under the evaluation policy must occur under the be-havior policy. Lemma 8.

Assumption implies that if Pr( H t = h t | π i ) =0 , then Pr( H t = h t | π e ) = 0 for all i ∈ { , . . . , n } , h t := ( s , a , r , s , . . . , s t − , a t − , r t − , s t ) ∈ H t , and ≤ t < ∞ .Proof. If t = 0 then h t = ( s ) , which does not dependon the policy, so clearly if Pr( H = h | π i ) = 0 then Pr( H = h | π e ) = 0 . Hereafter we assume ≤ t < ∞ .Notice that for any π ∈ Π , Pr( H t = h t | π ) (a) = Pr( S = s ) Pr( A = a | S = s , π ) × (cid:16) t − Y i =1 Pr( S i = s i | S i − = s i − , A i − = a i − ) × Pr( R i − = r i − | S i − = s i − , A i − = a i − , S i = s i ) × Pr( A i = a i | S i = s i , π ) (cid:17) × Pr( S t = s t | S t − = s t − , A t − = a t − ) × Pr( R t − = r t − | S t − = s t − , A t − = a t − , S t = s t ) (b) = d ( s ) π ( a | s ) P ( s t | s t − , a t − ) R ( r t − | s t − , a t − , s t ) × t − Y i =1 P ( s i | s i − a i − ) R ( r i − | s i − , a i − , s i ) π ( a i | s i ) . where (a) comes from repeated application of the rule that,for any random variables X and Y , Pr( X = x, Y = y ) =Pr( X = x ) Pr( Y = y | X = x ) and the Markov property forstate transitions, actions, and rewards, and (b) comes fromthe deﬁnitions of d , π, R and P in MDPNv1.So, if Pr( H t = h t | π i ) = 0 , then one of the terms in theproduct above (using π i for π ) must be zero. If that termis not a π i term, then it also shows up in Pr( H t = h t | π e ) ,and so Pr( H t = h t | π e ) = 0 . If the term is a π i term, thenby Assumption , the corresponding π e term must also bezero, and so Pr( H t = h t | π i ) = 0 .Next, recall the known result that the ratio of partial trajec-tory probabilities under two different policies can be writ-ten in terms of the two policies: Lemma 9.

Let π e and π b be any two policies and t ∈ N > .Let h t be any history of length t that has non-zero proba-bility under π b , i.e., Pr( H t = h t | π b ) = 0 . Then Pr( H t = h t | π e )Pr( H t = h t | π b ) = t − Y i =0 π e ( a i | s i ) π b ( a i | s i ) . Proof.

See the works of (Precup et al., 2000) or (Thomas,2015b, Lemma 1).Next we establish Lemma , which states that we can useimportance sampling to generate unbiased estimates of anyfunction of partial trajectories in D . Recall that whenever ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning we write H i (or H ti ) we always mean a trajectory generatedby π i , so H i ∼ π i . Lemma 10.

If Assumption holds, then for all ( t, i ) ∈ N ≥− × { , . . . , n } : E [ ρ it f ( H t +1 i )] = E [ f ( H t +1 ) | H t +1 ∼ π e ] , for any real-valued function f .Proof. If t = − then H t − = ( S ) , which does not de-pend on the policy, so the result is immediate. If t ≥ : E [ ρ it f ( H t +1 i )] = E  t Y j =0 π e (cid:16) A H i j | S H i j (cid:17) π i (cid:16) A H i j | S H i j (cid:17) f ( H t +1 i )  (a) = E " Pr (cid:0) H t +1 i = h t +1 i (cid:12)(cid:12) π e (cid:1) Pr (cid:0) H t +1 i = h t +1 i (cid:12)(cid:12) π i (cid:1) f ( H t +1 i ) = X supp( π i ,t +1) Pr (cid:0) H t +1 = h t +1 (cid:12)(cid:12) π i (cid:1) × Pr (cid:0) H t +1 = h t +1 (cid:12)(cid:12) π e (cid:1) Pr ( H t +1 = h t +1 | π i ) f ( H t +1 )= X supp( π i ,t +1) Pr (cid:0) H t +1 = h t +1 (cid:12)(cid:12) π e (cid:1) f ( H t +1 ) (b) = X supp( π e ,t +1) Pr (cid:0) H t +1 = h t +1 (cid:12)(cid:12) π e (cid:1) f ( H t +1 )= E [ f ( H t +1 ) | H t +1 ∼ π e ] , where (a) comes from Lemma and (b) comes fromLemma , which requires Assumption .We can use Lemma to show the well-known result thatthe expected value of an importance weight is one: Lemma 11.

For all π i and t ∈ N ≥− , if Assumption holds, then E [ ρ it ] = 1 .Proof. This follows from Lemma with f ( H t +1 ) := 1 .Next we establish a lemma that will be crucial to showingthat the WDR estimator is strongly consistent: Lemma 12.

For all t ∈ N ≥ , let f t : H t +1 → R . IfAssumption holds, f t = 0 for all t ∈ N ≥ L , and either: • Case 1:

Assumptions and hold.or • Case 2:

Assumption holds and there is a ﬁnite f max such that for all t ∈ N ≥ and h t +1 ∈ H t +1 , | f t ( h t +1 ) | < f max . then ∞ X t =0 γ t n X i =1 ρ it P nj =1 ρ jt f t ( H t +1 i ) (6) a.s. −→ E " ∞ X t =0 γ t f t ( H t +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H ∼ π e . Proof.

Let X tn := n X i =1 ρ it P nj =1 ρ jt γ t f t ( H t +1 i ) , so that the left side of (6) can be written as P ∞ t =0 X tn . Firstwe multiply the numerator and denominator of X tn by n toget: X tn = n P ni =1 γ t ρ it f t ( H t +1 i ) n P ni =1 ρ it . (7)We will show that the numerator of (7) converges almostsurely to the desired value: n n X i =1 ρ it γ t f t ( H t +1 i ) a.s. −→ E [ γ t f t ( H t +1 ) | H t +1 ∼ π e ] . (8)By Lemma , which relies on Assumption , we have that E [ ρ it γ t f t ( H t +1 i )] = E [ γ t f t ( H t +1 ) | H t +1 ∼ π e ] . Con-sider the two cases from the statement of the lemma:1. Case 1: H t +1 i is independent and identically dis-tributed for all i , so ρ it γ t f t ( H t +1 i ) is also indepen-dent and identically distributed for all i . Therefore byKhintchine’s strong law of large numbers, Theorem ,we have (8).2. Case 2: H t +1 i are not necessarily identically dis-tributed since there may be multiple behavior policies,so we cannot directly apply Khintchine’s strong lawof large numbers. Instead notice that ρ it is boundedby β due to Assumption , and so | ρ it γ t f t ( H t +1 i ) | ≤ βγ t f max . So, we can apply Kolmogorov’s strong lawof large numbers, Corollary , to get (8).Next we show that the denominator of (7) converges almostsurely to one: n n X i =1 ρ it a.s. −→ . (9)By Lemma , which relies on Assumption , we have that E [ ρ it ] = 1 . Again consider the two possible settings:1. Case 1: H t +1 i is independent and identically dis-tributed for all i , so ρ it is also independent and iden-tically distributed for all i . Therefore by Khintchine’sstrong law of large numbers we have (9).2. Case 2:

Since ρ it ≤ β , we can apply Kolmogorov’sstrong law of large numbers to get (9). ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning By applying Property to (8) and (9) we have that for all t , X tn a.s. −→ E (cid:2) γ t f t ( H t +1 ) (cid:12)(cid:12) H t +1 ∼ π e (cid:3) . So,1. Case 1:

Since X tn = 0 for t ≥ L and by Property , ∞ X t =0 X tn = L − X t =0 X tn a.s. −→ E " L − X t =0 γ t f t ( H t +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H t +1 ∼ π e = E " ∞ X t =0 γ t f t ( H t +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H ∼ π e . Case 2:

In order to apply Property we must showthat there exists a function g : N ≥ → R such that P ∞ t =0 g ( t ) < ∞ and for all n ∈ N > and t ∈ N ≥ , | X tn | ≤ g ( t ) . The following deﬁnition of g satisﬁesthese requirements: g ( t ) := ( γ t f max if t < L, otherwise.That is, ∞ X t =0 g ( t ) ≤ ( f max − γ if γ < ,Lf max otherwise, < ∞ , since we have assumed that γ can only be in theﬁnite-horizon setting, where L = ∞ . Also, | X tn | =0 = g ( t ) by deﬁnition if t ≥ L and if t < L then: | X tn | := (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ρ it P nj =1 ρ jt γ t f t ( H t +1 i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ γ t f max n X i =1 ρ it P nj =1 ρ jt = γ t f max = g ( t ) . So, by Property , we have (6).Finally, we establish an extension of Lemma that willfacilitate its use with sequences that are not quite in theform that it is deﬁned for: Lemma 13.

For all t ∈ N ≥ , let f t : H t → R . If Assump-tion holds, f t = 0 for all t ∈ N ≥ L , and either: • Case 1:

Assumptions and hold. or • Case 2:

Assumption holds and there is a ﬁnite f max such that for all t ∈ N ≥ and h t ∈ H t , | f t ( h t ) |

By removing the ﬁrst term of the sum and shiftingthe variable that the sum uses by one, we can rewrite theleft side of (10) as n n X i =1 f ( H i ) + ∞ X t =0 γ t n X i =1 ρ it P nj =1 ρ jt γf t +1 ( H t +1 i ) . We have that n n X i =1 f ( H i ) a.s. −→ E [ f ( H )] , (11)by Khintchine’s strong law of large numbers in Case 1, andKolmogorov’s strong law of large numbers in Case 2 (since f is bounded). Also, by Lemma (where the deﬁnitionof f t +1 in this lemma is used for f t in our application ofLemma ) we have that ∞ X t =0 γ t n X i =1 ρ it P nj =1 ρ jt γf t +1 ( H t +1 i ) a.s. −→ E " ∞ X t =0 γ t +1 f t +1 ( H t +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H ∼ π e . (12)So by applying Property to (11) and (12) we have: n n X i =1 f ( H i ) + ∞ X t =0 γ t n X i =1 ρ it P nj =1 ρ jt γf t +1 ( H t +1 i ) a.s. −→ E [ f ( H )] + E " ∞ X t =0 γ t +1 f t +1 ( H t +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H ∼ π e = X t =0 E [ γ t f t ( H t )] + E " ∞ X t =1 γ t f t ( H t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H ∼ π e = E " ∞ X t =0 γ t f t ( H t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H ∼ π e . B. Doubly Robust Derivation and Proofs

In this appendix we provide an alternate derivation of theDR estimator using control variates. The idea behind con-trol variates is as follows. Suppose that we would like toestimate θ := E [ X ] given a sample of X . The obvious es-timator would be ˆ θ := X . However, if we have a sampleof another random variable, Y , with known expected value, E [ Y ] , then the estimator ˆ θ := X − Y + E [ Y ] may have ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning lower variance. Speciﬁcally, while Var(ˆ θ ) = Var( X ) , wehave that Var(ˆ θ ) = Var( X )+Var( Y ) − X, Y ) . So, ˆ θ has lower variance than ˆ θ if X, Y ) > Var( Y ) .Often Y is referred to as the control variate . Notice that theoptimal control variate is Y := X , since then Var(ˆ θ ) = 0 .Furthermore, notice that ˆ θ remains an unbiased estimatorof θ as long as the expected value of Y exists— E [ˆ θ ] = E [ X − Y + E [ Y ]] = E [ X ] − E [ Y ] + E [ Y ] = E [ X ] = θ .Control variates have been used before in reinforcementlearning to reduce the variance of policy gradient estimates(Bhatnagar et al., 2009), where the control variate was re-ferred to as a baseline .Recall that we have deﬁned the DR estimator in (2) as DR( D ) := n X i =1 ∞ X t =0 γ t w it R H i t | {z } X − n X i =1 ∞ X t =0 γ t (cid:16) w it ˆ q π e (cid:16) S H i t , A H i t (cid:17) − w it − ˆ v π e (cid:16) S H i t (cid:17)(cid:17)| {z } Y . In this deﬁnition the X term is the per-decision importancesampling (PDIS) estimator, which is known to be an un-biased and strongly consistent estimator of v ( π e ) (Precupet al., 2000; Thomas, 2015b). Also, the control variate, Y , is mean zero, i.e., E [ Y ] = 0 . To see why this controlvariate is reasonable, notice that all of the terms that aremultiplied by γ t w it approximately cancel: ˆ q π e (cid:16) S H i t , A H i t (cid:17) ≈ R H i t + γ ˆ v π e (cid:16) S H i t +1 (cid:17) . So, Y is a decent approximation of X , and therefore DR( D ) will have low variance.Our derivation of the control variate used by the DR es-timator is based on an alternate view of control variates.If we do not know the expected value of the control vari-ate, Y , but we have another random variable, Z , such that E [ Z ] = E [ Y ] , then we can use the unbiased estimator ˆ θ = X − Y + Z . The variance of this estimator is givenby Var(ˆ θ ) = Var( X ) + Var( Y − Z ) − X, Y − Z ) .So, if Y ≈ X and Z has low variance, then this estimatormay have lower variance than θ . Technically, this is anordinary application of control variates using Y − Z as themean-zero control variate. We derive DR using this alter-nate view.We begin with the per-decision importance sampling (PDIS) estimator, which is known to be an unbiased andstrongly consistent estimator of v ( π e ) (Precup et al., 2000; Thomas, 2015b). The PDIS estimator is given by: PDIS( D ) := 1 n n X i =1 ∞ X t =0 ρ it γ t R H i t . In order to reduce the variance of this estimator we willsubtract a control variate that we expect to be highly cor-related with the PDIS estimator, and then add back in theexpected value of the control variate: n n X i =1 ∞ X t =0 ρ it γ t R H i t | {z } PDIS estimator, X − n n X i =1 ∞ X t =0 ρ it γ t ˆ r π e ( S H i t , A H i t , | {z } control variate, Y + E " n n X i =1 ∞ X t =0 ρ it γ t ˆ r π e ( S H i t , A H i t , E [ control variate ]= E [ Y ] . (13)Here we expect the control variate to be similar to the PDISestimator if the model’s reward predictions are accurate,i.e., if R H i t ≈ ˆ r π e ( S H i t , A H i t , .If it could be used, (13) would be an extremely low-variance estimator of v ( π e ) since X − Y would usually benear-zero and E [ Y ] is a constant that is near v ( π e ) . How-ever, E [ control variate ] is not known, and so we cannot use(13) directly. Although estimating E [ Y ] is nearly as hardas estimating v ( π e ) , it is marginally easier. It is easier be-cause v ( π e ) uses the unknown transition and reward func-tions of the MDP to produce the distribution of rewards ateach time step, while E [ Y ] uses the known approximatemodel’s transition and reward function for the last transi-tion before each reward occurs. We can therefore estimate E [ Y ] using an unbiased estimator that typically has lowervariance than the control variate. In the alternate view ofcontrol variates this new term will be Z : n n X i =1 ∞ X t =0 ρ it γ t R H i t | {z } PDIS estimator, X − n n X i =1 ∞ X t =0 ρ it γ t ˆ r π e (cid:16) S H i t , A H i t , (cid:17)| {z } control variate, Y + 1 n n X i =1 ∞ X t =0 ρ it − γ t ˆ r π e (cid:16) S H i t , (cid:17)| {z } Z . (14)Here we expect the Z term to have lower variance than the Y term because for each i and t it only depends on actions A H i , . . . , A H i t − and not A H i t . This is reﬂected in its use of ρ it − rather than ρ it . Before continuing our derivation we ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning verify that E [ Y ] = E [ Z ] if Assumption holds: E [ Z ] = E " n n X i =1 ∞ X t =0 ρ it − γ t ˆ r π e (cid:16) S H i t , (cid:17) (a) = E  n n X i =1 ∞ X t =0 ρ it − γ t π e (cid:16) A H i t | S H i t (cid:17) π i (cid:16) A H i t | S H i t (cid:17) ˆ r π e (cid:16) S H i t , A H i t , (cid:17) = E " n n X i =1 ∞ X t =0 ρ it γ t ˆ r π e (cid:16) S H i t , A H i t , (cid:17) = E [ Y ] , where (a) comes from Lemma .So far, in (14), we have introduced a control variate intoPDIS that we expect might reduce the variance of the es-timator a little without introducing bias. However, it willstill have high variance because Z is a high-variance esti-mator of E [ Y ] . To overcome this, we can introduce anothercontrol variate into Z to make it a lower-variance estimatorof E [ Y ] . So, we introduce another control variate: n n X i =1 ∞ X t =0 ρ it γ t R H i t | {z } X − n n X i =1 ∞ X t =0 ρ it γ t ˆ r π e (cid:16) S H i t , A H i t , (cid:17)| {z } Y + 1 n n X i =1 ∞ X t =0 ρ it − γ t ˆ r π e (cid:16) S H i t , (cid:17)| {z } Z − n n X i =1 ∞ X t =0 ρ it − γ t ˆ r π e (cid:16) S H i t − , A H i t − , (cid:17)| {z } new control variate ,Y ′ + 1 n n X i =1 ∞ X t =0 ρ it − γ t ˆ r π e (cid:16) S H i t − , (cid:17)| {z } Z ′ . Here E [ Z ′ ] = E [ Y ′ ] (although we omit to proof of thisclaim), Y ′ is similar to Z and so it serves as a good controlvariate therefor, and Z ′ will usually have lower variancethan Y ′ because it uses ρ it − rather than ρ it − . However,now Z ′ is a high-variance estimator of E [ Y ′ ] . We thereforeintroduce a control variate for Z ′ , and this process repeats.This process of introducing control variates eventually ter-minates when the new control variate is not random. Theresulting estimator is (we call this estimator DR( D ) be- cause we will show that it is equivalent to (2)): DR( D ) = 1 n n X i =1 ∞ X t =0 ρ it γ t R H i t (15) − n n X i =1 ∞ X t =0 γ t t X τ =0 ρ iτ ˆ r π e (cid:0) S H i τ , A H i τ , t − τ (cid:1) + 1 n n X i =1 ∞ X t =0 γ t t X τ =0 ρ iτ − ˆ r π e (cid:0) S H i τ , t − τ (cid:1) . Next we will combine the ˆ r terms into ˆ v and ˆ q terms to geta more succinct expression. To this end, we will use theproperty that P ∞ i =0 P ij =0 f ( i, j ) = P ∞ j =0 P ∞ i = j f ( i, j ) tochange the order of the sums over t and τ . We also split γ t into γ τ γ t − τ : DR( D ) = 1 n n X i =1 ∞ X t =0 ρ it γ t R H i t − n n X i =1 ∞ X τ =0 ρ iτ γ τ ∞ X t = τ γ t − τ ˆ r π e (cid:0) S H i τ , A H i τ , t − τ (cid:1) + 1 n n X i =1 ∞ X τ =0 ρ iτ − γ τ ∞ X t = τ γ t − τ ˆ r π e (cid:0) S H i τ , t − τ (cid:1) . Next we perform a change of variable using j = t − τ toreplace t : DR( D ) = 1 n n X i =1 ∞ X t =0 ρ it γ t R H i t − n n X i =1 ∞ X τ =0 ρ iτ γ τ ∞ X j =0 γ j ˆ r π e (cid:0) S H i τ , A H i τ , j (cid:1) + 1 n n X i =1 ∞ X τ =0 ρ iτ − γ τ ∞ X j =0 γ j ˆ r π e (cid:0) S H i τ , j (cid:1) = 1 n n X i =1 ∞ X t =0 ρ it γ t R H i t − n n X i =1 ∞ X τ =0 ρ iτ γ τ ˆ q π e (cid:0) S H i τ , A H i τ (cid:1) + 1 n n X i =1 ∞ X τ =0 ρ iτ − γ τ ˆ v π e (cid:0) S H i τ (cid:1) . Replacing the variable τ with t and using w it = ρ it n we get ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning that: DR( D ) = n X i =1 ∞ X t =0 γ t w it R H i t − n X i =1 ∞ X t =0 γ t (cid:16) w it ˆ q π e (cid:16) S H i t , A H i t (cid:17) − w it − ˆ v π e (cid:16) S H i t (cid:17)(cid:17) , which is (2).The original derivation of the DR estimator (Jiang & Li,2015) required the horizon to be ﬁnite and known. Ourderivation makes neither of these assumptions. That is, itallows for inﬁnite or indeﬁnite horizons and for ﬁnite hori-zons where the horizon is not known. If the horizon, L ,is ﬁnite and known, then one should ensure that the modeluses all of the available information, including the knownhorizon and time step. In the next section we show that if L is ﬁnite and known, then our non-recursive deﬁnition of theDR estimator is equivalent to the recursive form of (Jiang& Li, 2015). B.1. Equivalence of DR Deﬁnitions

In this section we show that our non-recursive deﬁnition ofthe DR estimator is equivalent to the recursive deﬁnitionprovided by Jiang & Li (2015) when the horizon is ﬁniteand known.

Theorem 8. (2) is equivalent to the DR estimator pre-sented by Jiang & Li (2015) if the ﬁnite horizon, L , of theMDP is known.Proof. Jiang & Li (2015) deﬁne the DR estimator for a sin-gle trajectory (i.e., n = 1 ) as the last element, X L , of a se-quence, ( X i ) Li =0 . This sequence is deﬁned by the followingrecurrence relation. Let X := 0 and for all k ∈ { , . . . , L } let X k :=ˆ v π e ( S L − k ) + π e ( A L − k | S L − k ) π ( A L − k | S L − k ) R L − k + γX k − − ˆ q π e ( S L − k , A L − k ) ! . As in the deﬁnition of

DR( D ) in (2), Jiang & Li (2015)deﬁne the DR estimator for multiple trajectories to be theaverage of the estimator for each trajectory individually.So, to show that their recursive deﬁnition and our deﬁnitionare equivalent, we need only show that they are equivalentwhen there is a single trajectory.Since hereafter in this proof we deal with only a single tra-jectory, we drop the superscripts that we use to specify thetrajectory, i.e., we write ρ t rather than ρ t . Also let π b := π denote the single behavior policy. For further brevity, let π eb ( t ) := π e ( A t | S t ) π b ( A t | S t ) . First, notice that we can rewrite (2) for the single-trajectoryﬁnite-horizon setting as:

DR( D ) = L − X t =0 γ t ρ t R t − L − X t =0 γ t ρ t ˆ q π e ( S t , A t )+ L − X t =0 γ t ρ t − ˆ v π e ( S t )) , (16)since S L is surely the absorbing state and so R t , ˆ q π e ( S t , A t ) , and ˆ v π e ( S t ) are all zero for t ≥ L . Toverify that this deﬁnition is equivalent to X L , we will de-ﬁne another sequence, ( Y i ) Li =1 , such that X i = Y i for all i ∈ { , . . . , L } and such that Y L = DR( D ) trivially.Let Y k := P L − t = L − k γ t " ρ t ( R t − ˆ q π e ( S t , A t )) + ρ t − ˆ v π e ( S t ) γ L − k ρ L − k − . Notice that Y L is identical to (16) since γ L − L ρ L − L − = 1 .So, all that remains is to show that Y k = X k for all k ∈{ , . . . , L } . We will show this using a proof by induction.For the base case, k = 1 , it is straightforward to verify that X = Y . For the inductive step we assume the inductivehypothesis that X k − = Y k − and show that then X k = Y k : X k :=ˆ v π e ( S L − k ) + π eb ( L − k ) R L − k + γX k − − ˆ q π e ( S L − k , A L − k ) ! =ˆ v π e ( S L − k ) + π eb ( L − k ) R L − k + γY k − − ˆ q π e ( S L − k , A L − k ) ! . Substituting in the deﬁnition of Y k − and performing alge-braic manipulations we have that: X k =ˆ v π e ( S L − k ) + π eb ( L − k ) R L − k + π eb ( L − k ) γ L − k ρ L − k × L − X t = L − k +1 γ t " ρ t ( R t − ˆ q π e ( S t , A t )) + ρ t − ˆ v π e ( S t ) − π eb ( L − k )ˆ q π e ( S L − k , A L − k ) , where × denotes that a line was split into multiple lines (wedo not use cross-products anywhere in this paper). Since π eb ( L − k ) ρ L − k = 1 ρ L − k − , ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning and by reordering terms, we have that X k = π eb ( L − k )( R L − k − ˆ q π e ( S L − k , A L − k )) + ˆ v π e ( S L − k )+ P L − t = L − k +1 γ t " ρ t ( R t − ˆ q π e ( S t , A t )) + ρ t − ˆ v π e ( S t ) γ L − k ρ L − k − . Adding one more element to the summation so that it startsat t = L − k , and then explicitly subtracting off this addi-tional term we have that: X k = π eb ( L − k )( R L − k − ˆ q π e ( S L − k , A L − k )) + ˆ v π e ( S L − k )+ P L − t = L − k γ t " ρ t ( R t − ˆ q π e ( S t , A t )) + ρ t − ˆ v π e ( S t ) γ L − k ρ L − k − − γ L − k γ L − k ρ L − k − " ρ L − k ( R L − k − ˆ q π e ( S L − k , A L − k ))+ ρ L − k − ˆ v π e ( S L − k ) . Canceling several γ and ρ terms, we have that: X k = P L − t = L − k γ t " ρ t ( R t − ˆ q π e ( S t , A t )) + ρ t − ˆ v π e ( S t ) γ L − k ρ L − k − = Y k . B.2. DR is Unbiased

While Jiang & Li (2015) showed that the DR estimator(with ﬁnite horizon) is an unbiased estimator of v ( π e ) , inthis section we show that the DR estimator (without as-sumptions about the horizon) is an unbiased estimator of v ( π e ) . Theorem 9 (DR – unbiased estimator) . If Assumption holds, then E [DR( D )] = v ( π e ) .Proof. This result was shown previously for the known ﬁ-nite horizon setting (Jiang & Li, 2015), but has not beenshown before for the other settings. Because we will usesome steps of this proof in later proofs, the majority of thisproof is relegated to a lemma. E [DR( D )] = E " n n X i =1 DR i ( D ) (a) = 1 n n X i =1 v ( π e )= v ( π e ) , where (a) comes from Lemma . B.3. Conditions for Consistency of DR

In this section we show that the DR estimator is a stronglyconsistent estimator of v ( π e ) given mild technical assump-tions and that there is only one behavior policy (Theo-rem ) or that the importance weights are bounded (The-orem ). Theorem 10 (DR – strongly consistent estimator forone behavior policy) . If Assumptions and hold then DR( D ) a.s. −→ v ( π e ) . Proof.

This proof is a relatively straightforward applica-tion of the law of large numbers.We have from Lemma that E [DR i ( D )] = v ( π e ) for all i ∈ { , . . . , n } . By Assumption , { DR i ( D ) } ni =1 is a set of n independent and identically distributed random variables(since H i ∼ π for all i , and DR i ( D ) only depends on H i ).We can therefore conclude by Khintchine’s strong law oflarge numbers, Theorem , that DR( D ) a.s. −→ v ( π e ) . Theorem 11 (DR – strongly consistent estimator for manybehavior policies) . If Assumptions and hold then DR( D ) a.s. −→ v ( π e ) . Proof.

We have from Lemma that E [DR i ( D )] = v ( π e ) for all i ∈ { , . . . , n } . However, { DR i ( D ) } ni =1 is a setof n independent but not necessarily identically distributedrandom variables, so we cannot apply Khintchine’s stronglaw of large numbers. Instead, we will apply Kolmogorov’sstrong law of large numbers, which requires each randomvariable, DR i ( D ) , to be bounded.We have that: DR i ( D ) = ∞ X t =0 γ t ρ it R H i t − ∞ X t =0 γ t ρ it ˆ q π e (cid:16) S H i t , A H i t (cid:17) + ∞ X t =0 γ t ρ it − ˆ v π e (cid:16) S H i t (cid:17) = ∞ X t =0 γ t ρ it R H i t − ∞ X t =0 γ t ρ it ∞ X τ =0 γ τ ˆ r π e (cid:16) S H i t , A H i t , τ (cid:17)| {z } =:ˆ q πe (cid:16) S Hit ,A Hit (cid:17) + ∞ X t =0 γ t ρ it − ∞ X τ =0 γ τ ˆ r π e (cid:16) S H i t , τ (cid:17)| {z } =:ˆ v πe (cid:16) S Hit (cid:17) . ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning So, | DR i ( D ) | ≤ βr ⋆ max L X t =0 γ t L X τ =0 γ τ < ∞ , since either L < ∞ or γ ∈ [0 , . So, DR i ( D ) is boundedabove and below and thus we can apply Kolmogorov’sstrong law of large numbers (Corollary ) to conclude that DR( D ) a.s. −→ v ( π e ) . C. Weighted Doubly Robust Proofs

C.1. Proof of Theorem In this section we prove Theorem , which states that WDR( D ) is a strongly consistent estimator of v ( π e ) if As-sumptions , , and hold.First, notice that we can rewrite the WDR estimator as: WDR( D ) := ∞ X t =0 γ t n X i =1 ρ it P nj =1 ρ jt R H i t | {z } =:CWPDIS( D ) (17) − ∞ X t =0 γ t n X i =1 ρ it P nj =1 ρ jt ˆ q π e (cid:16) S H i t , A H i t (cid:17)| {z } =: X n + ∞ X t =0 γ t n X i =1 ρ it − P nj =1 ρ jt − ˆ v π e (cid:16) S H i t (cid:17)| {z } =: Y n . We have from Lemma that CWPDIS( D ) a.s. −→ E " ∞ X t =0 γ t R Ht | H ∼ π e = v ( π e ) , (18)which has been shown before (Thomas, 2015b, Theorem13). Also by Lemma we have that X n a.s. −→ E h ∞ X t =0 γ t ˆ q π e (cid:0) S Ht , A Ht (cid:1) (cid:12)(cid:12)(cid:12) H ∼ π e i , (19) and by Lemma we have that Y n a.s. −→ E h ∞ X t =0 γ t ˆ v π e (cid:16) S Ht (cid:17) (cid:12)(cid:12)(cid:12) H ∼ π e i = E h ∞ X t =0 γ t ∞ X j =0 γ j ˆ r π e (cid:16) S Ht , j (cid:17)| {z } =ˆ v πe ( S Ht ) (cid:12)(cid:12)(cid:12) H ∼ π e i = E h ∞ X t =0 γ t ∞ X j =0 γ j X a ∈A π e (cid:16) a | S Ht (cid:17) ˆ r π e (cid:16) S Ht , a, j (cid:17)| {z } =ˆ r πe ( S Ht ,j ) (cid:12)(cid:12)(cid:12) H ∼ π e i = E h ∞ X t =0 γ t ∞ X j =0 γ j ˆ r π e (cid:16) S Ht , A Ht , j (cid:17)| {z } =ˆ q πe ( S Ht ,A Ht ) (cid:12)(cid:12)(cid:12) H ∼ π e i = E h ∞ X t =0 γ t ˆ q π e (cid:16) S Ht , A Ht (cid:17) (cid:12)(cid:12)(cid:12) H ∼ π e i . (20) So, by applying Property to (18), (19), and (20) we havethat WDR( D ) a.s. −→ v ( π e ) . C.2. Proof of Theorem In this section we prove Theorem , which states that ifAssumptions and hold then WDR( D ) a.s. −→ v ( π e ) . Recall that WDR can be deﬁned as in (17). First weapply Lemma to the CWPDIS( D ) term, which uses f t ( H t +1 i ) = R H i t , which is bounded since | R H i t | ≤ r ⋆ max .The result of Lemma is that CWPDIS( D ) a.s. −→ E " ∞ X t =0 γ t R Ht | H ∼ π e = v ( π e ) . (21)Next we apply Lemma to the X n term, which uses f t ( H t +1 i ) = ˆ q π e (cid:16) S H i t , A H i t (cid:17) , which is bounded since (cid:12)(cid:12)(cid:12) ˆ q π e (cid:16) S H i t , A H i t (cid:17)(cid:12)(cid:12)(cid:12) ≤ ( r ⋆ max − γ if L = ∞ Lr max otherwise.The result of applying Lemma to X n is that X n a.s. −→ E h ∞ X t =0 γ t ˆ q π e (cid:0) S Ht , A Ht (cid:1) (cid:12)(cid:12)(cid:12) H ∼ π e i . (22)Lastly, we apply Lemma to the Y n term, which uses f t ( H ti ) = ˆ v π e (cid:16) S H i t (cid:17) , which is bounded since (cid:12)(cid:12)(cid:12) ˆ v π e (cid:16) S H i t (cid:17)(cid:12)(cid:12)(cid:12) ≤ ( r ⋆ max (1 − γ ) if L = ∞ Lr ⋆ max otherwise. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning The result of applying Lemma to Y n is that Y n a.s. −→ E h ∞ X t =0 γ t ˆ v π e (cid:16) S Ht (cid:17) (cid:12)(cid:12)(cid:12) H ∼ π e i (a) = E h ∞ X t =0 γ t ˆ q π e (cid:16) S Ht , A Ht (cid:17) (cid:12)(cid:12)(cid:12) H ∼ π e i , (23) where (a) comes from the same derivation that was used in(20). So, by applying Property to (21), (22), and (23) wehave that WDR( D ) a.s. −→ v ( π e ) . D. Extended Empirical Studies (WDR)

In this section we provide a detailed description of our ex-periments comparing the WDR estimator to various im-portance sampling estimators (IS, PDIS, WIS, CWPDIS),as well as DR and AM. We performed experiments usingthree domains: ModelFail, ModelWin, and a gridworld.We will describe each domain, then describe the experi-mental setup, and then present empirical results. All threedomains have a ﬁnite horizon and use γ = 1 . . D.1. The ModelFail Domain

The ModelFail domain was constructed so that the modelwould fail to converge to the true MDP. One way that thiscan happen is if the model uses function approximation, sothat it cannot represent the true MDP. Another way that thiscan happen is if there is some partial observability, which iscommon in real applications. We therefore construct a do-main where the true underlying MDP has three states (plusthe terminal absorbing state), but where the agent cannottell the difference between any of the states.The MDP used by ModelFail is depicted in Figure 3. Al-though the MDP has three states (denoted by circles) plusthe terminal absorbing state (denoted by the double-circle),the agent does not observe which state it is in—it only seesa single state. The agent begins in the left-most state, whereit has two actions available. The ﬁrst action always takes itto the upper state, while the second always takes in to thelower state. In both cases, the agent receives no reward.At time t = 1 , the agent is always in the upper or lowerstate (although it cannot tell the difference between themand the initial state), and it must select between two possi-ble actions. Both actions always have the same effect—theagent transitions to the terminal absorbing state. However,if the agent was in the upper state, R = 1 , while R = − if the agent was in the lower state. The horizon is L = 2 since S = ∞ s always.The behavior policy selects a with probability ap-proximately . and a with probability approximately . (these probabilities were chosen arbitrarily by us-ing weights of and − with softmax action selection, ✞☛✞(cid:0) ✁ ✂ ✌✄✁ ✂ ✄☎✆✝✟✆ Figure 3: ModelFail MDP.and were not optimized). The evaluation policy does theopposite—it selects a with probability approximately . and a with probability approximately . .Consider what happens when we try to model this MDPbased on the observations produced by running the be-havior policy to produce an inﬁnite number of trajectories(without trying to infer anything about the true underlyingstructure of the MDP). Recall that we observe only a sin-gle state. First consider the transition dynamics: half ofthe time either action causes a transition back to the singlestate, while half of the time the agent transitions to the ab-sorbing state. Next consider the rewards: half of the timethe agent receives no reward, with probability . / it re-ceives a reward of , and with probability . / it receivesa reward of − , and these rewards appear completely un-correlated with the action that was selected (since non-zerorewards occur at time t = 1 and A has no bearing on re-wards or state transitions). So, from the model’s point ofview, the actions have no impact on state transitions or re-wards, and so every policy is equally good and will producean expected return of . , while in reality an optimal pol-icy will produce an expected return of . and a pessimalpolicy will produce an expected return of − . .We provided the model with the true horizon, L = 2 , sothat its predictions of R t are zero for t ≥ . D.2. The ModelWin Domain

This domain was constructed so that the approximatemodel of the MDP would quickly converge to the trueMDP, while importance sampling based approaches likeDR and WDR would continue to have high variance. Re-call from our discussion in Section 6 that DR and WDRwill be equal to a simple model-based approach if the ap-proximate MDP is perfect and state transition and rewardsare deterministic. To avoid this, the ModelWin domain hasstochastic state transitions that cause the (b) term in (3) tonot necessarily be zero.The ModelWin MDP is depicted in Figure 4. Unlike theModelFail domain, the agent observes the true underlyingstates of the ModelWin MDP, of which there are three, plusa terminal absorbing state (not pictured). The agent alwaysbegins in s , where it must select between two actions. The ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning ﬁrst action, a , causes the agent to transition to s withprobability . and s with probability . . The second ac-tion, a , does the opposite: the agent transitions to s withprobability . and s with probability . . If the agenttransitions to s , then it receives a reward of , and if ittransitions to s it receives a reward of − . In states s and s , the agent has two possible actions, but both always pro-duce a reward of zero and a deterministic transition back to s . The horizon is set to L = 20 , so, S = ∞ s always. ✞☛ ✞(cid:0)✞✁ ✂✁✄ ☎✆✝✂☛✄ ☎✆✟✠ ✡ ✌☞✂✁✄ ☎✆✟✂☛✄ ☎✆✝✠ ✡ ☞ ✍✎✏✑✎ Figure 4: ModelWin MDP.To see why DR and WDR struggle on this domain, con-sider what happens if the approximate model is perfect andthe agent takes action a in state s . In our discussion of(3) we concluded that DR and WDR will perform well if R = q π e ( s , a ) − γ ˆ v π e ( S ′ ) , where S ′ is the state thatthe agent transitions to after taking action a in state s ,which is a random variable. Consider the two values thatthe right side can take, depending on whether S ′ = s or S ′ = s . It can be either ˆ q π e ( s , a ) − γ ˆ v π e ( s ) or ˆ q π e ( s , a ) − γ ˆ v π e ( s ) . Since ˆ v π e ( s ) = ˆ v π e ( s ) , thesetwo statements are equal—the prediction of R will be thesame regardless of whether the agent transitions to s or s , and so its prediction must sometimes be wrong (sincethe rewards differ depending on whether the agent transi-tions to s or s ). So, term (b) in (3) will not be zero—thecontrol variate used by DR and WDR does not perfectlycancel with the PDIS (or CWPDIS) term. If w it is large,then this will produce high variance. In order to make w it large, we need only make the horizon long and the behaviorand evaluation policies dissimilar.The behavior and evaluation policies both select actionsuniformly randomly in states s and s . However, in s thebehavior policy takes action a with probability approxi-mately . and action a with probability approximately . , while the evaluation policy does the opposite—ittakes action a with probability approximately . andaction a with probability approximately . (these prob-abilities come from using softmax action selection withweights of and ). Technically, implementing the horizon of L = 20 requiresthe states to be augmented to include the current time step so thatstate transitions are Markovian. The approximate model is pro-vided with the time step and the horizon. As in the ModelFail domain, for the ModelWin domain weprovided the approximate model with the true horizon ofthe MDP, L = 20 , so that its predictions of R t were zerofor t ≥ . D.3. The Gridworld Domain

The third domain that we used was the gridworld domaindeveloped by Thomas (2015b, Section 2.5) for evaluatingOPE algorithms. It is a × gridworld with four actions, L = 100 , and deterministic transition and reward func-tions. This domain was developed speciﬁcally for evalu-ating different OPE methods. Thomas (2015b) proposedﬁve policies, π , . . . , π , that can serve as the behavior andevaluation policies.Although this setup was developed for evaluating OPEmethods, it was not developed with DR and WDR in mind(since they were introduced later). Speciﬁcally, its use ofdeterministic state-transition and reward functions meansthat when the model is accurate, AM, DR, and WDR willall perform similarly (due the the (b) term in (3) being near-zero).We therefore performed experiments with two variants ofthis gridworld. In the ﬁrst variant the approximate modelwas provided with the horizon, L = 100 . However, in thesecond variant we introduced some partial observability byproviding the model with the incorrect horizon: L = 101 .This has a signiﬁcant impact for value predictions close tothe end of a trajectory because the model incorrectly pre-dicts when the rewards will necessarily be zero. We write Gridworld-TH and

Gridworld-FH to denote the gridworldwhere the agent is provided with the true horizon and falsehorizon, respectively.

D.4. Experimental Setup

For each domain we generated n trajectories (for various n ) and computed the sample mean squared error betweenthe predictions of the various OPE methods and the trueperformance of the evaluation policy (estimated using alarge number of on-policy Monte-Carlo rollouts). For eachvalue of n and each OPE algorithm, we performed this ex-periment times and report the average sample meansquared error over these trials. All plots include stan-dard error bars and use logarithmic scales for both the hor-izontal and vertical axes.Perhaps surprisingly, it is not obvious how to fairly com-pare the different OPE algorithms. Clearly IS, PDIS, WIS,and CWPDIS should use all of the trajectories in D , sincethey do not require an approximate model. Similarly, AMshould use all of the data to construct an approximatemodel. However, how should the available data be split forDR, WDR, and the MAGIC estimators? We believe that ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning there are at least three reasonable answers:1. DR, WDR, and MAGIC should be provided with ad-ditional trajectories not available to IS, PDIS, WIS,and CWPDIS, and these trajectories should be used toconstruct an approximate model. This setup wouldemulate the setting where prior domain knowledge(not necessarily trajectories) can be used to constructan approximate model, which IS, PDIS, WIS, andCWPDIS ignore.2. DR, WDR, and MAGIC should use all of the availabledata, D , to construct an approximate model. Theyshould then reuse this same data to compute their es-timates. This approach is reasonable, but the reuse ofdata invalidates our theoretical guarantees. Still, em-pirically we ﬁnd that this approach causes DR, WDR,and MAGIC to perform at their best.3. DR, WDR, and MAGIC should partition D into twosets. The ﬁrst set should be used to construct the ap-proximate model, and the second set should be used tocompute the DR, WDR, and MAGIC estimates usingthe approximate model.Since there is not necessarily a “correct” answer to whichway of performing experiments is best, we show our resultsusing both the second and third approach. For each domain,the “full-data” variant uses the second approach while the“half-data” variant uses the third approach, where D is par-titioned into two sets of equal size.Since all of the domains that we use have ﬁnite state andaction sets, we use a simple maximum-likelihood approx-imate model. That is, we predict that the probability oftransitioning from s to s ′ given action a is the number oftimes this transition was observed divided by the numberof times action a was taken in state s . If D contains no ex-amples of action a being taken in state s , then we assumethat taking action a in state s always causes a transition tothe terminal absorbing state.In this appendix, we present empirical results from fourprevious importance sampling methods, deﬁnitions ofwhich can be found in the work of Thomas (2015b, Chap-ter 3): importance sampling (IS), per-decision importancesampling (PDIS), weighted importance sampling (WIS),and consistent weighted per-decision importance sampling (CWPDIS). We also show results for the guided importancesampling methods DR and WDR and the purely model-based method, AM. The legend used by all of the plots inthis appendix is provided in Figure 5.Figure 5: The legend used by all plots in Appendix D. D.5. ModelFail Results

Figure 1b in Section 6 depicts the result on the ModelFaildomain in the full-data setting. We reproduce this plot inFigure 6. Here the weighted importance sampling methods,WIS and CWPDIS, are obscured by the curve for WDR,while the unweighted importance sampling methods, ISand PDIS, are obscured by the curve for DR. Notice thatWDR outperforms AM by orders of magnitude and DRby approximately an order of magnitude. Also notice thateven though the approximate model is not accurate, whichmeans that the control variates used by DR and WDR maybe poor, the DR and WDR estimators do not perform worsethan PDIS and CWPDIS, respectively.In Figure 7 we reproduce this experiment in the half-datasetting. Since AM does not use any data for importancesampling, in both settings (half-data and full-data) it isidentical. Similarly, IS, PDIS, WIS, and CWPDIS do notuse an approximate model, so they always use all of thedata and are therefore also identical in both settings. How-ever, DR and WDR are not the same—they use half of thedata to construct the approximate model and the other halfto compute their estimates. This means that, for DR andWDR, the approximate model tends to be worse, and theimportance sampling estimate also tends to be worse. Asa result, the DR and WDR curves are shifted up slightly.Still, the same general trends are evident—WDR outper-forms AM by orders and DR by an order.

D.6. ModelWin Results

Figure 1c in Section 6 depicts the result of running impor-tance sampling and guided importance sampling methodsas well as the approximate model estimator on the Model-Win experimental setup in the full-data setting. We repro-duce this plot in Figure 8. Here AM has approximately anorder of magnitude lower MSE than all of the other meth-ods, including WDR, and was our motivation for combin-ing AM and WDR using BIM.In Figure 9 we reproduce this experiment in the half-datasetting. As with the ModelWin setup, this only hurts DRand WDR. When there are few trajectories, it appears toimpact DR more than WDR, although this may be due tonoise (notice the large standard error bars on the DR curvewhen n is small. D.7. Gridworld Results

Figure 1a in Section 6 depicts the results of using the fourthgridworld policy, π , as the behavior policy and the ﬁfth, π , as the evaluation policy for the Gridworld-FH domainin the full-data setting. We reproduce it in Figure 10. No-tice that WDR outperforms all other methods by at least anorder of magnitude. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning M e a n S qu a r e d E rr o r Number of Episodes, n Figure 6: ModelFail, full-data. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 7: ModelFail, half-data.

550 2 20 200 2,000 20,000 M e a n S qu a r e d E rr o r Number of Episodes, n Figure 8: ModelWin, full-data. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 9: ModelWin, half-data. In Figure 11 we reproduce this experiment in the half-datasetting. As before there is little change, except that theDR and WDR curves shift up. WDR remains the best-performing estimator, by approximately an order of mag-nitude.Next we reproduced Figures 10 and 11 for Gridworld-THas opposed to Gridworld-FH. The results are in Figures 12and 13 respectively. Notice that, when given the true hori-zon, AM excels. In the full-data setting DR and WDR bothlie directly on top of the curve for AM. This makes sensebecause the transition function and reward function are de-terministic, and so, given the way that we constructed ourapproximate model, both methods degenerate to exactlyAM. In the half-data setting DR and WDR lag slightly be-hind the curve for AM since they can only use half as muchdata.Next we reproduced these four ﬁgures using the ﬁrst grid-world policy, π , as the behavior policy and the second, π ,as the evaluation policy. Whereas π and π are nearly de-terministic and produce long trajectories, π and π are farfrom deterministic and tend to produce shorter trajectories.Notably, the behavior policy, π , selects actions uniformlyrandomly, and so this presents a very different setting forOPE. The results are provided in Figures 14–17. In thisexample, DR and WDR perform similarly—signiﬁcantlybetter than the importance sampling algorithms IS, PDIS,WIS, and CWPDIS, and marginally better than AM givenenough data. Also, when the true horizon is provided to themodel, DR and WDR again degenerate to AM. D.8. Summary

The key takeaways from these experiments are that WDRtends to outperform the other importance sampling estima-tors, IS, PDIS, WIS, and CWPDIS, as well as the guidedimportance sampling method, DR. None of these methodsachieved mean squared errors within an order of magnitudeof WDR’s across all of our experiments. This shows thepower of WDR as a guided importance sampling method.However, WDR did not always win—in the ModelFail set-ting, AM outperformed WDR by an order of magnitude.Similar results have been observed by others. For example,in the experiments of Jiang & Li (2015), AM tended to out-perform DR (although they did not compare to WDR, sinceit had not yet been introduced). This motivated our intro-duction of the BIM estimator as a way to blend togetherWDR and AM.Notice that, if the transition function and reward functionare deterministic and there is no partial observability (asin the gridworld experiments using the true horizon), then,given the way that we constructed our approximate model,DR and WDR degenerate to AM. This degeneration (which ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning M e a n S qu a r e d E rr o r Number of Episodes, n Figure 10: Gridworld-FH, full-data, π behavior policy, π evaluation policy. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 11: Gridworld-FH, half-data, π behavior policy, π evaluation policy. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 12: Gridworld-TH, full-data, π behavior policy, π evaluation policy. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 13: Gridworld-TH, half-data, π behavior policy, π evaluation policy. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 14: Gridworld-FH, full-data, π behavior policy, π evaluation policy. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 15: Gridworld-FH, half-data, π behavior policy, π evaluation policy. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning M e a n S qu a r e d E rr o r Number of Episodes, n Figure 16: Gridworld-TH, full-data, π behavior policy, π evaluation policy. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 17: Gridworld-TH, half-data, π behavior policy, π evaluation policy. is not bad, but suggests that importance sampling methodsare not necessary) would also not occur if the approximatemodel used function approximation.Lastly, notice that DR and WDR performed better in thefull-data setting than in the half-data setting. This suggeststhat, in practice, one should use all of the available databoth to produce an approximate model and to compute theDR and WDR estimates. Even though this violates the as-sumptions used by our theoretical guarantees, this does notmean, for example, that MAGIC will not still be a stronglyconsistent estimator for the application at hand. E. Consistency of BIM

In this appendix we prove Theorem , which states thatif Assumption holds, there exists at least one j ∈ J such that g ( j ) ( D ) is a strongly consistent estimator of v ( π e ) , and b b n − b n a.s. −→ , and b Ω n − Ω n a.s. −→ , then BIM( D, b Ω n , b b n ) a.s. −→ v ( π e ) . We begin by showing that BIM converges almost surelyto v ( π e ) if it were to use the true Ω n and b n , rather thanestimates thereof. Let j ⋆ ∈ J be an index such that g ( j ⋆ ) ( D ) a.s. −→ v ( π e ) , which exists by assumption. Let y ∈ ∆ |J | be the weight vector that places a weight of oneon g ( j ⋆ ) ( D ) and a weight of zero on the other returns, suchthat y ⊺ g J ( D ) = g ( j ⋆ ) ( D ) a.s. −→ v ( π e ) . So, by Lemma (which requires that g ( j ) ( D ) is uniformly bounded for all j ∈ J , which holds by Assumption and the fact that re-wards and reward predictions are bounded), we have that lim n →∞ MSE( y ⊺ g ( D ) , v ( π e )) = 0 .Recall that BIM( D, Ω n , b n ) uses the weight vector, x ⋆ thatminimizes the MSE: x ⋆ ∈ arg min x ∈ ∆ |J| MSE( x ⊺ g J ( D ) , Ω n , b n ) . Since y ∈ ∆ |J | , we have that for all n MSE(( x ⋆ ) ⊺ g J ( D ) , v ( π e )) ≤ MSE( y ⊺ g J ( D ) , v ( π e )) . Since lim n →∞ MSE( y ⊺ g J ( D ) , v ( π e )) = 0 we have that lim n →∞ MSE(( x ⋆ ) ⊺ g J ( D ) , v ( π e )) ≤ , and since MSE is always greater than or equal to zero,we can replace the ≤ above with an equality. Since ( x ⋆ ) ⊺ g J ( D ) = BIM( D, Ω n , b n ) this can be rewritten as lim n →∞ MSE(BIM( D, Ω n , b n ) , v ( π e )) = 0 . By Lemma we have that this implies that BIM( D, Ω n b n ) a.s. −→ v ( π e ) . So far we have shown that BIM, when using the true co-variance matrix and bias vector, converges almost surely to ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning v ( π e ) . By Lemma we can therefore conclude that if b b n − b n a.s. −→ and b Ω n − Ω n a.s. −→ , then BIM( D, b Ω n b b n ) a.s. −→ v ( π e ) . F. Derivation of g ( j ) ( D ) using WDR In this appendix we derive a reasonable deﬁnition for g ( j ) ( D ) , the off-policy j -step return, when using WDR forthe importance sampling estimator. We assume that thereader is familiar with our use of control variates in Ap-pendix B. First, consider what control variate should beadded to the j -step PDIS or CWPDIS estimator: n X i =1 j X t =0 γ t w it R H i t , where the deﬁnition of w it determines whether this is PDISor CWPDIS. Reproducing our arguments from AppendixB, we ﬁnd that a reasonable deﬁnition for IS ( j ) ( D ) is sim-ilar to (15), but with the time index, t , summing only to t = j and using w it terms rather than ρ it terms for general-ity: IS ( j ) ( D ) := n X i =1 j X t =0 w it γ t R H i t − n X i =1 j X t =0 γ t t X τ =0 w iτ ˆ r π e (cid:0) S H i τ , A H i τ , t − τ (cid:1) + n X i =1 j X t =0 γ t t X τ =0 w iτ − ˆ r π e (cid:0) S H i τ , t − τ (cid:1) . Notice that this deﬁnition is not equivalent to what onewould get if (2) were modiﬁed only so that the sum goesfrom time t = 0 to t = j , since that deﬁnition would in-clude reward predictions beyond R j in ˆ v and ˆ q terms. In-stead, this deﬁnition is equivalent to the deﬁnition of (2)if it were applied to a modiﬁed MDP where every episodeterminates after R j is produced.Next, consider the deﬁnition of AM ( j ) ( D ) . We might useimportance sampling to correct for the distribution of S j ,and the model to predict the remaining rewards: AM ( j ) ( D ) = γ j n X i =1 w ij − ˆ v π e ( S H i j )= γ j n X i =1 w ij − ∞ X τ =0 γ τ ˆ r π e ( S H i j , τ ) . This is just one possible deﬁnition of AM ( j ) . We alsoexperimented with a deﬁnition that is purely model based: AM ( j ) ( D ) := P s ∈S b d ( s ) P ∞ t = j γ t ˆ r π e ( s, t ) . Since this deﬁni-tion does not include any importance weights, it does not requirean additional control variate. We found that this variant performedsimilarly to the deﬁnition that we present. Notice that AM ( j ) is not a purely model-based estimator if j ≥ since it uses importance weights. Furthermore, thisuse of importance sampling can result in high variance. Topartially mitigate this variance, we can introduce a controlvariate to get a new deﬁnition: AM ( j ) ( D ) = γ j n X i =1 w ij − ∞ X τ =0 γ τ ˆ r π e ( S H i j , τ ) − γ j n X i =1 w ij − ∞ X τ =0 γ τ ˆ r π e ( S H i j − , A H i j − , τ + 1)+ γ j n X i =1 w ij − ∞ X τ =0 γ τ ˆ r π e ( S H i j − , τ + 1) . As in our derivation of the DR estimator in Appendix B, wecan repeat this process by continuing to add control variatesuntil the control variate is not random to get our ﬁnal deﬁ-nition of AM ( j ) ( D ) : AM ( j ) ( D ) := γ j n X i =1 w ij − ∞ X τ =0 γ τ ˆ r π e ( S H i j , τ ) − γ j j X k =1 n X i =1 w ij − k ∞ X τ =0 γ τ ˆ r π e ( S H i j − k , A H i j − k , τ + k )+ γ j j X k =1 n X i =1 w ij − k − ∞ X τ =0 γ τ ˆ r π e ( S H i j − k , τ + k ) . Combining the IS and AM deﬁnitions to produce a off-policy j -step return as deﬁned in (4) we have: g ( j ) ( D ) := IS ( j ) ( D ) + AM ( j +1) ( D )= n X i =1 j X t =0 w it γ t R H i t + γ j +1 n X i =1 w ij ∞ X τ =0 ˆ r π e ( S H i j +1 , τ ) − n X i =1 j X t =0 γ t t X τ =0 w iτ ˆ r π e (cid:16) S H i τ , A H i τ , t − τ (cid:17)| {z } (a) + n X i =1 j X t =0 γ t t X τ =0 w iτ − ˆ r π e (cid:16) S H i τ , t − τ (cid:17)| {z } (b) − γ j +1 j +1 X k =1 n X i =1 w ij +1 − k ∞ X τ =0 γ τ ˆ r π e ( S H i j +1 − k , A H i j +1 − k , τ + k ) | {z } (c) + γ j +1 j +1 X k =1 n X i =1 w ij − k ∞ X τ =0 γ τ ˆ r π e ( S H i j +1 − k , τ + k ) | {z } (d) . ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning Notice that the terms (a) and (b) use predictions of rewardsup until and including R j , while the terms (c) and (d) usepredictions of rewards beginning with R j +1 and going toinﬁnity. So, with algebraic manipulations we can combine (a) and (c) to get n X i =1 j X t =0 γ t w it ˆ q π e (cid:16) S H i t , A H i t (cid:17) and we can combine (b) and (d) to get: n X i =1 j X t =0 γ t w it − ˆ v π e (cid:16) S H i t (cid:17) . So, we have that g ( j ) ( D ) := n X i =1 j X t =0 γ t w it R H i t + n X i =1 γ j +1 w ij ˆ v π e ( S H i j +1 ) − n X i =1 j X t =0 γ t (cid:16) w it ˆ q π e (cid:16) S H i t , A H i t (cid:17) − w it − ˆ v π e (cid:16) S H i t (cid:17)(cid:17) . G. MAGIC Details

In this section we provide additional details about theMAGIC algorithm. Speciﬁcally, we describe exactly howwe estimate Ω n and b n before presenting pseudocode forMAGIC. G.1. Estimating Ω n We can write g ( j ) ( D ) as the sum of n terms: g ( j ) ( D ) = n X i =1 g ( j ) i ( D ) , (24)where g ( j ) i ( D ) := j X t =0 γ t w it R H i t ! + γ j +1 w ij ˆ v π e ( S H i j +1 ) − j X t =0 γ t (cid:16) w it ˆ q π e (cid:16) S H i t , A H i t (cid:17) − w it − ˆ v π e (cid:16) S H i t (cid:17)(cid:17) . So,

Cov( g ( i ) ( D ) , g ( j ) ( D )) = Cov n X k =1 g ( i ) k ( D ) , n X k =1 g ( j ) k ( D ) ! . Notice that g ( j ) i ( D ) really is a function of all of D , notjust H i , since w it = ρ it / P nj =1 ρ jt . This means that, al-though the terms in the sum, P nk =1 g ( i ) k ( D ) , are identicallydistributed, they are not independent, due to their shared reliance on D . However, notice that the g ( i ) k ( D ) terms,for various k , become less dependent as n → ∞ becausethe only dependence of g ( i ) k ( D ) on trajectories other than H k comes from the denominator of w it , which convergesalmost surely to n (we established this in our proofs thatWDR is strongly consistent).We therefore propose an approximation of Ω n that comesfrom the assumption that the g ( i ) k ( D ) terms, for various k ,are independent: Cov( g ( i ) ( D ) , g ( j ) ( D ))= X k ∈{ ,...,n } X l ∈{ ,...,n } Cov (cid:16) g ( i ) k ( D ) , g ( j ) l ( D ) (cid:17) (a) ≈ X k ∈{ ,...,n } Cov (cid:16) g ( i ) k ( D ) , g ( j ) k ( D ) (cid:17) (b) = n Cov (cid:16) g ( i )( · ) ( D ) , g ( j )( · ) ( D ) (cid:17) , where (a) comes from the assumption that g ( i ) k ( D ) and g ( j ) l ( D ) are independent for all i, j, k, and l where k = l , (b) comes from the assumption that they are identically dis-tributed, and where g ( i )( · ) ( D ) uses ( · ) to denote that any sub-script in { , . . . , n } could be used since the random vari-ables are independent and identically distributed.We therefore approximate Ω n using the sample covariance: b Ω n ( i, j ) := nn − n X k =1 (cid:16) g ( J i ) k ( D ) − ¯ g ( J i ) k ( D ) (cid:17) (25) × (cid:16) g ( J j ) k ( D ) − ¯ g ( J j ) k ( D ) (cid:17) , where ¯ g ( J i ) k ( D ) := 1 n n X k =1 g ( J i ) k ( D ) . The above scheme for estimating Ω n is the one that weuse in our pseudocode and experiments. However, we alsoexperimented with bootstrap estimates of Ω n . They yieldedsimilar performance at signiﬁcantly higher computationalcost. G.2. Estimating b n As described previously, we use a conﬁdence interval,

CI( g ( ∞ ) ( D ) , δ ) , when computing b n . We stated that theconﬁdence interval that we use is a combination of thepercentile bootstrap and the Chernoff-Hoeffding inequal-ity. Speciﬁcally, we compute the conﬁdence interval pro-duced by both methods, and return the tighter of the two. Inpractice, this is nearly always the conﬁdence interval pro-duced by the percentile bootstrap, and so practical imple-mentations of MAGIC may just use the percentile boot-strap. We include the loose Chernoff-Hoeffding bound be- ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning cause it allows for easier theoretical analysis of the MAGICalgorithm. G.3. Pseudocode

Pseudocode for the MAGIC algorithm is provided in Algo-rithm 1. It takes as input D , π e , and an approximate model,all of which are deﬁned in Section 2. It also takes as input J , which is deﬁned in Section 7, and a positive integer κ ,that we have not deﬁned previously. We use κ to denotethe number of times the bootstrap algorithm should resam-ple the trajectories. In our experiments we used κ = 200 .In general, it should be made as large as possible given anyruntime constraints. Other literature has suggested that itshould be chosen to be approximately κ = 2000 (Efron &Tibshirani, 1993; Davison & Hinkley, 1997).Line 2 calls for the |J | × |J | matrix, b Ω n , to be computedaccording to (25).Line 3 speciﬁes that a structure, D , should be created. Thisstructure will be used to store the bootstrap resamplings,such that D i is the i th resampling of D . That is, D i is a setof n trajectories and the behavior policies that generatedthem, sampled with replacement from D (this resamplingis done on lines 4–6).Line 7 calls for the creation of a vector, v , to store the off-policy j -step return for j = ∞ (recall that this is just theWDR estimator) for each bootstrap sample, sorted into as-cending order. Lines 8 and 9 then compute the percentilebootstrap conﬁdence interval, [ l, u ] , for the mean of g ( ∞ ) ( D ) , which we ensure includes WDR( D ) . For ourtheoretical analysis, we add a line after this that sets l ← max ( l, WDR( D ) − ξ r ln(2 /δ )2 n ) (26)and u ← min ( l, WDR( D ) + ξ r ln(2 /δ )2 n ) , (27)where ξ is a bound on the range of g ( i ) ( D ) . In practice,these lines almost never change the values of l and u andcan be ignored.Lines 10–12 then show how the bias vector can be com-puted from the already deﬁned terms. Notice that the orderof g ( J j ) ( D ) and l or u does not matter since the bias termin the decomposition of mean squared error is squared. Theorder that we use facilitates a simple consistency proof forMAGIC. Given that the covariance matrix and bias vec-tor have been approximated, Line 13 sets x to be the so-lution of a constrained quadratic program (in our experi-ments we solved this quadratic program using the Gurobilibrary). Finally, line 14 returns the weighted combination of the different off-policy j -step returns (recall that g J ( D ) is deﬁned in Section 7). Algorithm 1

MAGIC( D ) Input: • D : Historical data. • π e : Evaluation policy. • Approximate model that allows for computationof ˆ r π e ( s, a, t ) . • J : The set of return lengths to consider. Theﬁrst element, J , should be − and the last, J |J | ,should be ∞ . • κ : The number of bootstrap resamplings. Compute b Ω n according to (25). Allocate D ( · ) so that for all i ∈ { , . . . , κ } , D i canhold n trajectories. for i = 1 to κ do Load D i with n uniform random samples drawnfrom D with replacement. end for v = sort (cid:0) g ( ∞ ) ( D ( · ) ) (cid:1) l ← min { WDR( D ) , v ( ⌊ . n ⌋ ) } u ← max { WDR( D ) , v ( ⌈ . n ⌉ ) } for j = 1 to |J | do b b n ( j ) ←  g ( J j ) ( D ) − u if g ( J j ) ( D ) > ug ( J j ) ( D ) − l if g ( J j ) ( D ) < l otherwise.12: end for x ← arg min x ∈ ∆ |J| x ⊺ [ b Ω n + b b n b b ⊺ n ] x return x ⊺ g J ( D ) H. Consistency of MAGIC

In this section we prove Theorem , which states that if As-sumptions and hold and ∞ ∈ J , then MAGIC( D ) a.s. −→ v ( π e ) . This result follows immediately from Theorem if b Ω n − Ω n a.s. −→ and b b n − b n a.s. −→ , since Assump-tions and are sufﬁcient to ensure that g ( ∞ ) ( D ) =WDR( D ) a.s. −→ v ( π e ) . In Appendix H.3 we show that b Ω n − Ω n a.s. −→ , and then in Appendix H.4 we show that b b n − b n a.s. −→ . However, ﬁrst we establish two usefulproperties of the off-policy j -step returns. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning H.1. Convergence of Off-Policy j -Step Return Recall that the off-policy j -step return used by MAGIC isgiven by: g ( j ) ( D ) := n X i =1 j X t =0 γ t w it R H i t + n X i =1 γ j +1 w ij ˆ v π e ( S H i j +1 ) − n X i =1 j X t =0 γ t (cid:16) w it ˆ q π e (cid:16) S H i t , A H i t (cid:17) − w it − ˆ v π e (cid:16) S H i t (cid:17)(cid:17) , which can be written as: g ( j ) ( D ) = n X i =1 j X t =0 γ t w it X it + 1 n n X i =1 ˆ v π e ( S H i ) , where X it = R H i t − ˆ q π e (cid:16) S H i t , A H i t (cid:17) + γ ˆ v π e (cid:16) S H i t +1 (cid:17) . (28)Notice that X it is a bounded random variable since rewardsand reward predictions are bounded. So, by Lemma wehave that n X i =1 j X t =0 γ t w it X it a.s. −→ E " j X t =0 γ t X t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H ∼ π e . (29)Also, since ˆ v π e ( S H i ) is bounded, we have from the Kol-mogorov strong law of large numbers that n n X i =1 ˆ v π e ( S H i ) a.s. −→ E [ˆ v π e ( S )] . (30)So, combining (29) and (30) we have from Property that g ( j ) ( D ) a.s. −→ E " ˆ v π e ( S H ) + j X t =0 γ t X t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H ∼ π e . Let c j := E h ˆ v π e ( S H ) + P jt =0 γ t X t (cid:12)(cid:12)(cid:12) H ∼ π e i denote thisconstant value that g ( j ) ( D ) converges to. H.2. Convergence of Component of Off-Policy j -StepReturn Recall from (24) that the off-policy j -step return can bewritten as: g ( j ) ( D ) = n X i =1 g ( j ) i ( D ) , where g ( j ) i ( D ) := j X t =0 γ t w it R H i t ! + γ j +1 w ij ˆ v π e ( S H i j +1 ) − j X t =0 γ t (cid:16) w it ˆ q π e (cid:16) S H i t , A H i t (cid:17) − w it − ˆ v π e (cid:16) S H i t (cid:17)(cid:17) . here we will show that for any i and j , g ( j ) i ( D ) a.s. −→ .Notice that g ( j ) i ( D ) can be written as: g ( j ) i ( D ) = j X t =0 γ t ρ it X it P nk =1 ρ kt = j X t =0 γ t Y it , where X it is as deﬁned in (28), and Y it := ρ it X it P nk =1 ρ kt = n ρ it X it n P nk =1 ρ kt Since X it and ρ it are bounded, we have that lim n →∞ n ρ it X it = 0 . Also, by Lemma and Kol-mogorov’s strong law of large numbers, we have that n P nk =1 ρ kt a.s −→ . So, Y it a.s. −→ for all t and i . Further-more, Y it is bounded since ≤ ρ it P nk =1 ρ kt ≤ and X it isbounded. So, by Property , we have that g ( j ) i ( D ) a.s. −→ . H.3. Consistency of b Ω n Here we establish that b Ω n − Ω n a.s. −→ . There are twosteps to this result. First we will show that lim n →∞ Ω n =0 —the true covariance matrix converges to the zero matrix.We then show that b Ω n a.s. −→ as well, which means that b Ω n − Ω n a.s. −→ .Recall from Appendix H.1 that g ( j ) ( D ) a.s. −→ c j . We canwrite Ω n ( i, j ) = E h ( g ( i ) ( D ) − E [ g ( i ) ( D )])( g ( j ) ( D ) − E [ g ( j ) ( D )]) i = E [ Y n ] , (31) where Y n := (cid:16) g ( i ) ( D ) − E [ g ( i ) ( D )] (cid:17)(cid:16) g ( j ) ( D ) − E [ g ( j ) ( D )] (cid:17) . Recall that g ( j ) ( D ) a.s. −→ c j . By Lemma we thereforehave that for all j , lim n →∞ E [ g ( j ) ( D )] = c j . So, by thecontinuous mapping theorem, Y n a.s. −→ ( c i − c i )( c j − c j )=0 . So, by applying Lemma to (31) we have that lim n →∞ Ω n ( i, j ) = lim n →∞ E [ Y n ] = 0 .Next we show that b Ω n a.s. −→ . First, recall from AppendixH.2 that for all j ∈ J and k ∈ { , . . . , n } , g ( j ) k ( D ) a.s. −→ . ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning So, by Property we have that ¯ g ( j ) k ( D ) a.s. −→ as well. So, g ( j ) k ( D ) − ¯ g ( j ) k ( D ) a.s. −→ , and so by Property and thedeﬁnition of b Ω n , we have that b Ω n ( i, j ) a.s. −→ for all ( i, j ) ∈ J . H.4. Consistency of b b n Here we show that b b n − b n a.s. −→ . We have from thedeﬁnitions of b b n , l , and u that: b b n ( j ) − b n ( j ) ≤ g ( J j ) ( D ) − l − E [ g ( J j ) ( D )] + v ( π e ) (32)and b b n ( j ) − b n ( j ) ≥ g ( J j ) ( D ) − u − E [ g ( J j ) ( D )] + v ( π e ) . (33)We will show that both of the right hand sides above con-verge almost surely to zero, which, by Lemma , impliesthat b b n ( j ) − b n ( j ) converges almost surely to zero as well.First consider (32). We have from Appendix H.1 that g ( J j ) ( D ) a.s. −→ c J j . So, by Lemma we have that lim n →∞ E [ g ( J j ) ( D )] = E [ c J j ] = c J j . We also havethat u − l ≤ √ n p ξ ln(2 /δ ) , by (26) and (27). Since WDR( D ) ∈ [ l, u ] , we have that | WDR( D ) − l | ≤ √ n p ξ ln(2 /δ ) . Since ξ is a constant, the right side is a sequence of con-stants (not random variables) that converges to zero. Theleft side is positive and less than the right, and so it toomust converge (surely, not just almost surely) to zero: lim n →∞ | WDR( D ) − l | = 0 . So, Pr( lim n →∞ l = v ( π e )) = Pr( lim n →∞ l + WDR( D ) − l = v ( π e ))= Pr( lim n →∞ WDR( D ) = v ( π e ))=1 , where the last step comes from Theorem . This means that l a.s. −→ v ( π e ) .Combining , , and , we have that the right side of(32) converges almost surely to zero. This same argument,using the upper bound, u , rather than the lower bound, l ,shows that the right side of (33) converges almost surely tozero as well, and so we can conclude. I. Extended Empirical Studies (MAGIC)

Here we present detailed results concerning the MAGIC es-timator. These results will use the same three domains andtwo experimental setups (full-data and half-data) that wereintroduced in Appendix D, as well as one additional do-main, which we call the

Hybrid domain. We begin by intro-ducing the Hybrid domain, we then discuss minor changesto the experimental setup and then present results.

I.1. The Hybrid Domain

The purpose of this domain is to showcase a common prob-lem type: domains where early in a trajectory there is par-tial observability, but as time passes within each trajectory,the partial observability decays. This happens, for exam-ple, in robotics applications where there may be some un-certainty about the position or pose of a robot. However,as the trajectory progresses the robot may be able to betterlocalize itself, removing or diminishing the uncertainty.We emulate this setting by concatenating the ModelFailand ModelWin domains. That is, the agent begins in theModelFail domain. Whenever it would transition to the ab-sorbing state, it instead transitions to the initial state of theModelWin domain.

I.2. Experimental Setup

We performed these experiments in the same way as thosein Appendix D, except that we compared different estima-tors. Speciﬁcally, we introduce curves for the MAGIC es-timator, but remove the curves for the poorly-performingimportance sampling estimators, IS, PDIS, WIS, and CW-PDIS. So, the plots contain curves for DR, WDR, AM,and MAGIC. The legend used by all of the plots in thisappendix is provided in Figure 18.

DR AM WDR MAGIC MAGIC-B

Figure 18: The legend used by all plots in Appendix I.Also, for the hybrid domain we included a curve for binaryMAGIC (MAGIC-B), which uses J = {− , ∞} . WhereasMAGIC blends between AM and WDR using off-policy j -step returns of various lengths, binary MAGIC only placesweights on AM and WDR. Our comparison to MAGIC-B shows the importance of including the off-policy j -stepreturns rather than merely trying to switch between, or di-rectly weight, AM and WDR.Lastly, since all of the domains have ﬁnite horizons, weused J = {− , . . . , L } for MAGIC. This means that ituses all of the possible off-policy j -step returns. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning I.3. ModelFail Results

Figure 2b in Section 9 depicts the results for the ModelFaildomain in the full-data setting. We reproduce this plot inFigure 19. In Figure 20 we show the results for ModelFailin the half-data setting. There is little difference betweenthe plots—in both cases MAGIC properly tracks WDR, sothat both WDR and MAGIC outperform AM an DR by atleast an order of magnitude for most n . I.4. ModelWin Results

Figure 2c in Section 9 depicts the results for the ModelWindomain in the full-data setting. We reproduce this plot inFigure 21. In Figure 22 we show the results for ModelFailin the half-data setting. In both cases MAGIC tracks AM,although it drifts away a little as n increases. This suggeststhat there may be room for improvement in our estimatesof Ω n and b n . However, also notice that due to the loga-rithmic scale, the difference between MAGIC and AM issmall in comparison to the distance between MAGIC andDR. I.5. Gridworld Results

Figures 23 through 30 depict the results for the Gridworld-FH and Gridworld-TH domains in both the full and half-data settings. The same general trends are visible. First,WDR tends to outperform DR, sometimes by an order ofmagnitude. Also, MAGIC tends to track WDR, since inthese experiments it is usually the best-performing algo-rithm. Lastly, for the Gridworld-TH, full-data setting, DR,WDR, and MAGIC all degenerate to AM, while in theGridworld-TH, half-data setting they degenerate to approx-imately AM using half as much data.

I.6. Hybrid Results

Last, but not least, Figures 31 and 32 show the results onthe Hybrid domain in the full-data and half-data settings,respectively. Notice that in MAGIC signiﬁcantly outper-forms all other methods, including WDR and AM. MAGICalso outperforms MAGIC-B, which shows the importanceof using off-policy j -step returns for various values of j . I.7. Summary

Overall, MAGIC acts as desired—it tracks WDR or AM,whichever is better for the application at hand. However,notice that it does not do this perfectly, particularly whenthere is little data available. This is likely because whenthere is little data it is difﬁcult to estimate Ω n , and the con-ﬁdence interval used when estimating b n will be loose. Insome cases, even when there is a large amount of data,MAGIC struggles to properly track AM. However, thistends to be when both methods perform well, and may be M e a n S qu a r e d E rr o r Number of Episodes, n Figure 19: ModelFail, full-data. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 20: ModelFail, half-data. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 21: ModelWin, full-data. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning

50 2 20 200 2,000 20,000 M e a n S qu a r e d E rr o r Number of Episodes, n Figure 22: ModelWin, half-data. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 23: Gridworld-FH, full-data. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 24: Gridworld-FH, half-data. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 25: Gridworld-TH, full-data. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 26: Gridworld-TH, half-data. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 27: Gridworld-FH, full-data. p1p2 ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning M e a n S qu a r e d E rr o r Number of Episodes, n Figure 28: Gridworld-FH, half-data. p1p2 M e a n S qu a r e d E rr o r Number of Episodes, n Figure 29: Gridworld-TH, full-data. p1p2 M e a n S qu a r e d E rr o r Number of Episodes, n Figure 30: Gridworld-TH, half-data. p1p2 M e a n S qu a r e d E rr o r Number of Episodes, n Figure 31: Hybrid, full-data. M e a n S qu a r e d E rr o r Number of Episodes, n Figure 32: Hybrid, half-data. ata-Efﬁcient Off-Policy Policy Evaluation for Reinforcement Learning due to an increased difﬁculty of determining which methodto favor when they both are improving rapidly with n .We also showed in Figures 31 and 32 an example whereMAGIC outperformed MAGIC-B by an order of magni-tude, and all previous methods (including DR) by 2–3 or-ders of magnitude. This exempliﬁes the importanceof blending between importance sampling methods andpurely model-based estimators using off-policy j -step re-turns, as opposed to selecting between or directly weight-ing WDR and AM and the power of MAGIC relative toexisting estimators. J. Future Work

Several avenues of future work remain. Good performanceof MAGIC is contingent on our ability to efﬁciently es-timate Ω n and b nn