CoinDICE: Off-Policy Confidence Interval Estimation
Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvári, Dale Schuurmans
CCoinDICE: Off-Policy Confidence Interval Estimation ∗ Bo Dai , ∗ Ofir Nachum , Yinlam Chow Lihong Li , Csaba Szepesvári , , Dale Schuurmans , Google Research University of Alberta DeepMind
Abstract
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, wherethe goal is to estimate a confidence interval on a target policy’s value, given only access to a staticexperience dataset collected by unknown behavior policies. Starting from a function space embedding ofthe linear program formulation of the Q -function, we obtain an optimization problem with generalizedestimating equation constraints. By applying the generalized empirical likelihood method to the resultingLagrangian, we propose CoinDICE , a novel and efficient algorithm for computing confidence intervals.Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sampleregimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighterand more accurate than existing methods. One of the major barriers that hinders the application of reinforcement learning (RL) is the ability to evaluatenew policies reliably before deployment, a problem generally known as off-policy evaluation (OPE). In manyreal-world domains, e.g. , healthcare (Murphy et al., 2001; Gottesman et al., 2018), recommendation (Li et al.,2011; Chen et al., 2019), and education (Mandel et al., 2014), deploying a new policy can be expensive, riskyor unsafe. Accordingly, OPE has seen a recent resurgence of research interest, with many methods proposedto estimate the value of a policy (Precup et al., 2000; Dudík et al., 2011; Bottou et al., 2013; Jiang and Li,2016; Thomas and Brunskill, 2016; Liu et al., 2018; Nachum et al., 2019a; Kallus and Uehara, 2019a,b; Zhanget al., 2020b).However, the very settings where OPE is necessary usually entail limited data access. In these cases,obtaining knowledge of the uncertainty of the estimate is as important as having a consistent estimator. Thatis, rather than a point estimate , many applications would benefit significantly from having confidence intervals on the value of a policy. The problem of estimating these confidence intervals, known as high-confidenceoff-policy evaluation (HCOPE) (Thomas et al., 2015b), is imperative in real-world decision making, wheredeploying a policy without high-probability safety guarantees can have catastrophic consequences (Thomas,2015). Most existing high-confidence off-policy evaluation algorithms in RL (Bottou et al., 2013; Thomaset al., 2015a,b; Hanna et al., 2017) construct such intervals using statistical techniques such as concentrationinequalities and the bootstrap applied to importance corrected estimates of policy value. The primarychallenge with these correction-based approaches is the high variance resulting from multiplying per-stepimportance ratios in long-horizon problems. Moreover, they typically require full knowledge (or a goodestimate) of the behavior policy, which is not easily available in behavior-agnostic OPE settings (Nachumet al., 2019a).In this work, we propose an algorithm for behavior-agnostic HCOPE. We start from a linear programmingformulation of the state-action value function. We show that the value of the policy may be obtained from ∗ Equal contribution. Email: {bodai, ofirnachum}@google.com . Open-source code for CoinDICE is available at https://github.com/google-research/dice_rl. a r X i v : . [ c s . L G ] O c t Lagrangian optimization problem for generalized estimating equations over data sampled from off-policydistributions. This observation inspires a generalized empirical likelihood approach (Owen, 2001; Broniatowskiand Keziou, 2012; Duchi et al., 2016) to confidence interval estimation. These derivations enable us to expresshigh-confidence lower and upper bounds for the policy value as results of minimax optimizations over anarbitrary offline dataset, with the appropriate distribution corrections being implicitly estimated during theoptimization. åWe translate this understanding into a practical estimator,
Confidence Interval DIstributionCorrection Estimation (CoinDICE), and design an efficient algorithm for implementing it. We then justifythe asymptotic coverage of these bounds and present non-asymptotic guarantees to characterize finite-sampleeffects. Notably, CoinDICE is behavior-agnostic and its objective function does not involve any per-stepimportance ratios, and so the estimator is less susceptible to high-variance gradient updates. We evaluateCoinDICE in a number of settings and show that it provides both tighter confidence interval estimates andmore correctly matches the desired statistical coverage compared to existing methods.
For a set W , the set of probability measures over W is denoted by P ( W ) . We consider a Markov DecisionProcess (MDP) (Puterman, 2014), M = ( S, A, T, R, γ, µ ) , where S denotes the state space, A denotes theaction space, T : S × A → P ( S ) is the transition probability kernel, R : S × A → P ([0 , R max ]) is a boundedreward kernel, γ ∈ (0 , is the discount factor, and µ is the initial state distribution.A policy, π : S → P ( A ) , can be used to generate a random trajectory by starting from s ∼ µ ( s ) ,then following a t ∼ π ( s t ) , r t ∼ R ( s t , a t ) and s t +1 ∼ T ( s t , a t ) for t (cid:62) . The state- and action-valuefunctions of π are denoted V π and Q π , respectively. The policy also induces an occupancy measure, d π ( s, a ) := (1 − γ ) E π (cid:104)(cid:80) t (cid:62) γ t { s t = s, a t = a } (cid:105) , the normalized discounted probability of visiting ( s, a ) ina trajectory generated by π , where {·} is the indicator function. Finally, the policy value is defined as the normalized expected reward accumulated along a trajectory: ρ π := (1 − γ ) E (cid:34) ∞ (cid:88) t =0 γ t r t | s ∼ µ , a t ∼ π ( s t ) , r t ∼ R ( s t , a t ) , s t +1 ∼ T ( s t , a t ) (cid:35) . (1)We are interested in estimating the policy value and its confidence interval (CI) in the behavior agnosticoff-policy setting (Nachum et al., 2019a; Zhang et al., 2020a), where interaction with the environment islimited to a static dataset of experience D := { ( s, a, s (cid:48) , r ) i } ni =1 . Each tuple in D is generated according to ( s, a ) ∼ d D , r ∼ R ( s, a ) and s (cid:48) ∼ T ( s, a ) , where d D is an unknown distribution over S × A , perhaps inducedby one or more unknown behavior policies. The initial distribution µ ( s ) is assumed to be easy to samplefrom, as is typical in practice. Abusing notation, we denote by d D both the distribution over ( s, a, s (cid:48) , r ) andits marginal on ( s, a ) . We use E d [ · ] for the expectation over a given distribution d , and E D [ · ] for its empiricalapproximation using D .Following previous work (Sutton et al., 2012; Uehara et al., 2019; Zhang et al., 2020a), for ease of expositionwe assume the transitions in D are i.i.d. . However, our results may be extended to fast-mixing, ergodicMDPs, where the the empirical distribution of states along a long trajectory is close to being i.i.d. (Antoset al., 2008; Lazaric et al., 2012; Dai et al., 2017; Duchi et al., 2016).Under mild regularity assumptions, the OPE problem may be formulated as a linear program – referredto as the Q -LP (Nachum et al., 2019b; Nachum and Dai, 2020) – with the following primal and dual forms: min Q : S × A → R (1 − γ ) E µ π [ Q ( s , a )] (2) s . t . Q ( s, a ) (cid:62) R ( s, a ) + γ · P π Q ( s, a ) , ∀ ( s, a ) ∈ S × A, and max d : S × A → R + E d [ r ( s, a )] (3) s . t . d ( s, a ) = (1 − γ ) µ π ( s, a ) + γ · P π ∗ d ( s, a ) , ∀ ( s, a ) ∈ S × A, All sets and maps are assumed to satisfy appropriate measurability conditions; which we will omit from below for the sakeof reducing clutter. P π and its adjoint, P π ∗ , are defined as P π Q ( s, a ) := E s (cid:48) ∼ T ( ·| s,a ) ,a (cid:48) ∼ π ( ·| s (cid:48) ) [ Q ( s (cid:48) , a (cid:48) )] , P π ∗ d ( s, a ) := π ( a | s ) (cid:88) ˜ s, ˜ a T ( s | ˜ s, ˜ a ) d (˜ s, ˜ a ) . The optimal solutions of (2) and (3) are the Q -function, Q π , and stationary state-action occupancy, d π ,respectively, for policy π ; see Nachum et al. (2019b, Theorems 3 & 5) for details as well as extensions to theundiscounted case.Using the Lagrangian of (2) or (3), we have ρ π = min Q max τ (cid:62) (1 − γ ) E µ π [ Q ( s , a )] + E d D [ τ ( s, a ) ( R ( s, a ) + γQ ( s (cid:48) , a (cid:48) ) − Q ( s, a ))] , (4)where τ ( s, a ) := d ( s,a ) d D ( s,a ) is the stationary distribution corrector . One of the key benefits of the minimaxoptimization (4) is that both expectations can be immediately approximated by sample averages. Infact, this formulation allows the derivation of several recent behavior-agnostic OPE estimators in a unifiedmanner (Nachum et al., 2019a; Uehara et al., 2019; Zhang et al., 2020a; Nachum and Dai, 2020).
We now develop a new approach to obtaining confidence intervals for OPE. The algorithm,
COnfidenceINterval stationary DIstribution Correction Estimation (CoinDICE) , is derived by combining function spaceembedding and the previously described Q -LP. Both the primal and dual forms of the Q -LP contain | S | | A | constraints that involve expectations overstate transition probabilities. Working directly with these constraints quickly becomes computationally andstatistically prohibitive when | S | | A | is large or infinite, as with standard LP approaches (De Farias andVan Roy, 2003). Instead, we consider a relaxation that embeds the constraints in a function space: ˜ ρ π := max d : S × A → R + E d [ r ( s, a )] s . t . (cid:104) φ, d (cid:105) = (cid:104) φ, (1 − γ ) µ π + γ · P π ∗ d (cid:105) , (5)where φ : S × A → Ω p ⊂ R p is a feature map, and (cid:104) φ, d (cid:105) := (cid:82) φ ( s, a ) d ( s, a ) dsda . By projecting theconstraints onto a function space with feature mapping φ , we can reduce the number of constraints from | S | | A | to p . Note that p may still be infinite. The constraint in (5) can be written as generalized estimatingequations (Qin and Lawless, 1994; Lam and Zhou, 2017) for the correction ratio τ ( s, a ) over augmentedsamples x := ( s , a , s, a, r, s (cid:48) , a (cid:48) ) with ( s , a ) ∼ µ π , ( s, a, r, s (cid:48) ) ∼ d D , and a (cid:48) ∼ π ( ·| s (cid:48) ) , (cid:104) φ, d (cid:105) = (cid:104) φ, (1 − γ ) µ π + γ · P π ∗ d (cid:105) ⇔ E x [∆ ( x ; τ, φ )] = 0 , (6)where ∆ ( x ; τ, φ ) := (1 − γ ) φ ( s , a ) + τ ( s, a ) ( γφ ( s (cid:48) , a (cid:48) ) − φ ( s, a )) . The corresponding Lagrangian is ˜ ρ π = max τ : S × A → R + min β ∈ R p E d D [ τ · r ( s, a )] + (cid:104) β, E d D [∆ ( x ; τ, φ )] (cid:105) . (7)This embedding approach for the dual Q -LP is closely related to approximation methods for the standardstate-value LP (De Farias and Van Roy, 2003; Pazis and Parr, 2011; Lakshminarayanan et al., 2017). Thegap between the solutions to (5) and the original dual LP (3) depends on the expressiveness of the featuremapping φ . Before stating a theorem that quantifies the error, we first offer a few examples to provideintuition for the role played by φ . We assume one can sample initial states from µ , an assumption that often holds in practice. Then, the data in D can betreated as being augmented as ( s , a , s, a, r, s (cid:48) , a (cid:48) ) with a ∼ π ( a | s ) , a (cid:48) ∼ π ( a | s (cid:48) ) . xample (Indicator functions): Suppose p = | S | | A | is finite and φ = [ δ s,a ] ( s,a ) ∈ S × A , where δ s,a ∈ { , } p with δ s,a = 1 at position ( s, a ) and otherwise. Plugging this feature mapping into (5), we recover theoriginal dual Q -LP (3). Example (Full-rank basis):
Suppose Φ ∈ R p × p is a full-rank matrix with p = | S | | A | ; furthermore, φ ( s, a ) = Φ(( s, a ) , · ) (cid:62) . Although the constraints in (5) and (3) are different, their solutions are identical. Thiscan be verified by the Lagrangian in Appendix A. Example (RKHS function mappings):
Suppose φ ( s, a ) := k (( s, a ) , · ) ∈ R p with p = ∞ , which formsa reproducing kernel Hilbert space (RKHS) H k . The LHS and RHS in the constraint of (5) are the kernelembeddings of d ( s, a ) and (1 − γ ) µ π ( s, a ) + γ · P π ∗ d ( s, a ) respectively. The constraint in (5) can then beunderstood as as a form of distribution matching by comparing kernel embeddings, rather than element-wisematching as in (3). If the kernel function k ( · , · ) is characteristic, the embeddings of two distributions willmatch if and only if the distributions are identical almost surely (Sriperumbudur et al., 2011). Theorem 1 (Approximation error)
Suppose the constant function ∈ F φ := span { φ } . Then, (cid:54) ˜ ρ π − ρ π (cid:54) β (cid:107) Q π − (cid:104) β, φ (cid:105)(cid:107) ∞ , where Q π is the fixed-point solution to the Bellman equation Q ( s, a ) = R ( s, a ) + γ P π Q ( s, a ) . Please refer to Appendix A for the proof. The condition ∈ F φ is standard and is trivial to satisfy.Although the approximation error relies on (cid:107)·(cid:107) ∞ , a sharper bound that relies on a norm taking the state-actiondistribution into account can also be obtained (De Farias and Van Roy, 2003). We focus on characterizingthe uncertainty due to sampling in this paper, so for ease of exposition we will consider a setting where φ issufficiently expressive to make the approximation error zero. If desired, the approximation error in Theorem 1can be included in the analysis.Note that, compared to using a characteristic kernel to ensure injectivity for the RKHS embeddings overall distributions (and thus guaranteeing arbitrarily small approximation error), Theorem 1 only requires that Q π be represented in F φ , which is a much weaker condition. In practice, one may also learn the featuremapping φ for the projection jointly. By introducing the function space embedding of the constraints in (5), we have transformed the originalpoint-wise constraints in the Q -LP to generalized estimating equations. This paves the way to applying thegeneralized empirical likelihood (EL) (Owen, 2001; Broniatowski and Keziou, 2012; Bertail et al., 2014; Duchiet al., 2016) method to estimate a confidence interval on policy value.Recall that, given a convex, lower-semicontinuous function f : R + → R satisfying f (1) = 0 , the f -divergence between densities p and q on R is defined as D f ( P || Q ) := (cid:82) Q ( dx ) f (cid:16) dP ( x ) dQ ( x ) (cid:17) dx .Given an f -divergence, we propose our main confidence interval estimate based on the following confidenceset C fn,ξ ⊂ R : C fn,ξ := (cid:26) ˜ ρ π ( w ) = max τ (cid:62) E w [ τ · r ] (cid:12)(cid:12)(cid:12)(cid:12) w ∈ K f , E w [∆ ( x ; τ, φ )] = 0 (cid:27) , with K f := (cid:26) w ∈ P n − ( (cid:98) p n ) ,D f ( w || (cid:98) p n ) (cid:54) ξn (cid:27) , (8)where P n − ( (cid:98) p n ) denotes the n -simplex on the support of (cid:98) p n , the empirical distribution over D . It is easy toverify that this set C fn,ξ ⊂ R is convex, since ˜ ρ π ( w ) is a convex function over a convex feasible set. Thus, C fn,ξ is an interval. In fact, C fn,ξ is the image of the policy value ˜ ρ π on a bounded (in f -divergence) perturbationto w in the neighborhood of the empirical distribution (cid:98) p n .Intuitively, the confidence interval C fn,ξ possesses a close relationship to bootstrap estimators. In vanillabootstrap, one constructs a set of empirical distributions (cid:8) w i (cid:9) mi =1 by resampling from the dataset D . Such4ubsamples are used to form the empirical distribution on (cid:8) ˜ ρ (cid:0) w i (cid:1)(cid:9) mi =1 , which provides population statisticsfor confidence interval estimation. However, this procedure is computationally very expensive, involving m separate optimizations. By contrast, our proposed estimator C fn,ξ exploits the asymptotic properties of thestatistic ˜ ρ π ( w ) to derive a target confidence interval by solving only two optimization problems (Section 3.3),a dramatic savings in computational cost.Before introducing the algorithm for computing C fn,ξ , we establish the first key result that, by choosing ξ = χ , − α (1) , C fn,ξ is asymptotically a (1 − α ) -confidence interval on the policy value, where χ , − α (1) is the (1 − α ) -quantile of the χ -distribution with degree of freedom. Theorem 2 (Informal asymptotic coverage)
Under some mild conditions, if D contains i.i.d. samplesand the optimal solution to the Lagrangian of (5) is unique, we have lim n →∞ P (cid:16) ρ π ∈ C fn,ξ (cid:17) = P (cid:16) χ (cid:54) ξ (cid:17) . (9) Thus, C fn,χ , − α (1) is an asymptotic (1 − α ) -confidence interval of the value of the policy π . Please refer to Appendix E.1 for the precise statement and proof of Theorem 2.Theorem 2 generalizes the result in Duchi et al. (2016) to statistics with generalized estimating equations,maintaining the degree of freedom in the asymptotic χ -distribution. One may also apply existing resultsfor EL with generalized estimating equations (e.g., Lam and Zhou, 2017), but these would lead to a limitingdistribution of χ m ) with m (cid:29) degrees of freedom, resulting in a much looser confidence interval estimatethan Theorem 2.Note that Theorem 2 can also be specialized to multi-armed contextual bandits to achieve a tighterconfidence interval estimate in this special case. In particular, for contextual bandits, the stationarydistribution constraint in (5), E w [∆ ( x ; τ, φ )] = 0 , is no longer needed, and can be replaced by E w [ τ −
1] = 0 .Then by the same technique used for MDPs, we can obtain a confidence interval estimate for offline contextualbandits; see details in Appendix C. Interestingly, the resulting confidence interval estimate not only hasthe same asymptotic coverage as previous work (Karampatziakis et al., 2019), but is also simpler andcomputationally more efficient.
Now we provide a distributional robust optimization view of the upper and lower bounds of C fn,ξ . Theorem 3 (Upper and lower confidence bounds)
Denote the upper and lower confidence bounds of C fn,ξ by u n and l n , respectively: [ l n , u n ] = (cid:20) min w ∈K f min β ∈ R p max τ (cid:62) E w [ (cid:96) ( x ; τ, β )] , max w ∈K f max τ (cid:62) min β ∈ R p E w [ (cid:96) ( x ; τ, β )] (cid:21) , (10) = (cid:20) min β ∈ R p max τ (cid:62) min w ∈K f E w [ (cid:96) ( x ; τ, β )] , max τ (cid:62) min β ∈ R p max w ∈K f E w [ (cid:96) ( x ; τ, β )] (cid:21) , (11) where (cid:96) ( x ; τ, β ) := τ · r + β (cid:62) ∆ ( x ; τ, φ ) . For any ( τ, β, λ, η ) that satisfies the constraints in (11) , the optimalweights for the upper and lower confidence bounds are w l = f (cid:48)∗ (cid:18) η − (cid:96) ( x ; τ, β ) λ (cid:19) and w u = f (cid:48)∗ (cid:18) (cid:96) ( x ; τ, β ) − ηλ (cid:19) . (12) respectively. Therefore, the confidence bounds can be simplified as: (cid:20) l n u n (cid:21) = min β max τ (cid:62) ,λ (cid:62) ,η E D (cid:104) − λf ∗ (cid:16) η − (cid:96) ( x ; τ,β ) λ (cid:17) + η − λ ξn (cid:105) max τ (cid:62) min β,λ (cid:62) ,η E D (cid:104) λf ∗ (cid:16) (cid:96) ( x ; τ,β ) − ηλ (cid:17) + η + λ ξn (cid:105) . (13)5he proof of this result relies on Lagrangian duality and the convexity and concavity of the optimization; itmay be found in full detail in Appendix D.1.As we can see in Theorem 3, by exploiting strong duality properties to move w into the inner mostoptimizations in (11), the obtained optimization (11) is the distributional robust optimization extenionof the saddle-point problem. The closed-form reweighting scheme is demonstrated in (12). For particular f -divergences, such as the KL - and -power divergences, for a fixed ( β, τ ) , the optimal η can be easilycomputed and the weights w recovered in closed-form. For example, by using KL ( w || (cid:98) p n ) , (12) can be usedto obtain the updates w l ( x ) = exp (cid:18) η l − (cid:96) ( x ; τ, β ) λ (cid:19) , w u ( x ) = exp (cid:18) (cid:96) ( x ; τ, β ) − η u λ (cid:19) , (14)where η l and η u provide the normalizing constants. (For closed-form updates of w w.r.t. other f -divergences,please refer to Appendix D.2.) Plug the closed-form of optimal weights into (11), this greatly simplifies theoptimization over the data perturbations yielding (13), and estabilishes the connection to the prioritizedexperiences replay (Schaul et al., 2016), where both reweight the experience data according to their loss, butwith different reweighting schemes.Note that it is straightforward to check that the estimator for u n in (13) is nonconvex-concave and theestimator for l n in (13) is nonconcave-convex. Therefore, one could alternatively apply stochastic gradientdescent-ascent (SGDA) for to solve (13) and benefit from attractive finite-step convergence guarantees (Linet al., 2019). Remark (Practical considerations):
As also observed in Namkoong and Duchi (2016), SGDA for (13)could potentially suffer from high variance in both the objective and gradients when λ approaches .In Appendix D.3, we exploit several properties of (11), which leads to a computational efficient algorithm, toovercome the numerical issue. Please refer to Appendix D.3 for the details of Algorithm 1 and the practicalconsiderations. Remark (Joint learning for feature embeddings):
The proposed framework also allows for thepossibility to learn the features for constraint projection. In particular, consider ζ ( · , · ) := β (cid:62) φ ( · , · ) : S × A → R .Note that we could treat the combination β (cid:62) φ ( s, a ) together as the Lagrange multiplier function for theoriginal Q -LP with infinitely many constraints, hence both β and φ ( · , · ) could be updated jointly. Althoughthe conditions for asymptotic coverage no longer hold, the finite-sample correction results of the next sectionare still applicable. This might offer an interesting way to reduce the approximation error introduced byinappropriate feature embeddings of the constraints, while still maintaining calibrated confidence intervals. Theorem 2 establishes the asymptotic (1 − α ) -coverage of the confidence interval estimates producedby CoinDICE, ignoring higher-order error terms that vanish as sample size n → ∞ . In practice, how-ever, n is always finite, so it is important to quantify these higher-order terms. This section addresses thisproblem, and presents a finite-sample bound for the estimate of CoinDICE. In the following, we let F τ and F β be the function classes of τ and β used by CoinDICE. Theorem 4 (Informal finite-sample correction)
Denote by d F τ and d F β the finite VC -dimension of F τ and F β , respectively. Under some mild conditions, when D f is χ -divergence, we have P ( ρ π ∈ [ l n − κ n , u n + κ n ]) (cid:62) −
12 exp (cid:18) c + 2 (cid:0) d F τ + d F β − (cid:1) log n − ξ (cid:19) , where c = 2 c + log d F τ + log d F β + (cid:0) d F τ + d F β − (cid:1) , κ n = Mξ n + 2 C (cid:96) Mn (cid:18) (cid:113) ξ n (cid:19) , and ( c, M, C (cid:96) ) areuniveral constants. O (cid:16) √ n (cid:17) , we achieve a faster rate of O (cid:0) n (cid:1) without any additional assumptions on the noise orcurvature conditions. The tight sample complexity in Theorem 4 implies that one can construct the (1 − α ) -finite sample confidence interval by optimizing (11) with ξ = 18 (cid:0) log α − c − (cid:0) d F τ + d F β − (cid:1) log n (cid:1) , andcomposing with κ n . However, we observe that this bound can be conservative compared to the asymptoticconfidence interval in Theorem 2. Therefore, we will evaluate the asymptotic version of CoinDICE based onTheorem 2 in the experiment.The conservativeness arises from the use of a union bound. However, we conjecture that the rate isoptimal up to a constant. We exploit the VC dimension due to its generality. In fact, the bound can beimproved by considering a data-dependent measure, e.g. , Rademacher complexity, or by some function classdependent measure, e.g. , function norm in RKHS, for specific function approximators. CoinDICE provide both upper and lower bounds of the target policy’s estimated value, which paves the pathfor applying the principle of optimism (Lattimore and Szepesvári, 2020) or pessimism (Swaminathan andJoachims, 2015) in the face of uncertainty for policy optimization in different learning settings.
Optimism in the face of uncertainty.
Optimism in the face of uncertainty leads to risk-seeking algo-rithms, which can be used to balance the exploration/exploitation trade-off. Conceptually, they always treatthe environment as the best plausibly possible. This principle has been successfully applied to stochasticbandit problems, leading to many instantiations of UCB algorithms (Lattimore and Szepesvári, 2020). In eachround, an action is selected according to the upper confidence bound, and the obtained reward will be usedto refine the confidence bound iteratively. When applied to MDPs, this principle inspires many optimisticmodel-based (Bartlett and Mendelson, 2002; Auer et al., 2009; Strehl et al., 2009; Szita and Szepesvari, 2010;Dann et al., 2017), value-based (Jin et al., 2018), and policy-based algorithms (Cai et al., 2019). Most ofthese algorithms are not compatible with function approximators.We can also implement the optimism principle by optimizing the upper bound in CoinDICE iteratively, i.e. , max π u D ( π ) . In t -th iteration, we calculate the gradient of u D ( π t ) , i.e. , ∇ π u D ( π t ) , based on the existingdataset D t , then, the policy π t will be updated by (natural) policy gradient and samples will be collectedthrough the updated policy π t +1 . Please refer to Appendix F for the gradient computation and algorithmdetails. Pessimism in the face of uncertainty.
In offline reinforcement learning (Lange et al., 2012; Fujimotoet al., 2019; Wu et al., 2019; Nachum et al., 2019b), only a fixed set of data from behavior policies is given, asafe optimization criterion is to maximize the worst-case performance among a set of statistically plausiblemodels (Laroche et al., 2019; Kumar et al., 2019; Yu et al., 2020). In contrast to the previous case of onlineexploration, this is a pessimism principle (Cohen and Hutter, 2020; Buckman et al., 2020) or counterfactualrisk minimization (Swaminathan and Joachims, 2015), and highly related to robust MDP (Iyengar, 2005;Nilim and El Ghaoui, 2005; Tamar et al., 2013; Chow et al., 2015).Different from most of the existing methods where the worst-case performance is characterized by model-based perturbation or ensemble, the proposed CoinDICE provides a lower bound to implement the pessimismprinciple, i.e. , max π l D ( π ) . Conceptually, we apply the (natural) policy gradient w.r.t. l D ( π t ) to update thepolicy iteratively. Since we are dealing with policy optimization in the offline setting, the dataset D keepsunchanged. Please refer to Appendix F for the algorithm details. Off-policy estimation has been extensively studied in the literature, given its practical importance. Mostexisting methods are based on the core idea of mportance reweighting to correct for distribution mismatches7etween the target policy and the off-policy data (Precup et al., 2000; Bottou et al., 2013; Li et al., 2015;Xie et al., 2019). Unfortunately, when applied naively, importance reweighting can result in an excessivelyhigh variance, which is known as the “curse of horizon” (Liu et al., 2018). To avoid this drawback, there hasbeen rapidly growing interest in estimating the correction ratio of the stationary distribution (e.g., Liu et al.,2018; Nachum et al., 2019a; Uehara et al., 2019; Liu et al., 2019; Zhang et al., 2020a,b). This work is alongthe same line and thus applicable in long-horizon problems. Other off-policy approaches are also possible,notably model-based (e.g., Fonteneau et al., 2013) and doubly robust methods (Jiang and Li, 2016; Thomasand Brunskill, 2016; Tang et al., 2020; Uehara et al., 2019). These techniques can potentially be combinedwith our algorithm, which we leave for future investigation.While most OPE works focus on obtaining accurate point estimates, several authors provide ways toquantify the amount of uncertainty in the OPE estimates. In particular, confidence bounds have beendeveloped using the central limit theorem (Bottou et al., 2013), concentration inequalities (Thomas et al.,2015b; Kuzborskij et al., 2020), and nonparametric methods such as the bootstrap (Thomas et al., 2015a;Hanna et al., 2017). In contrast to these works, the CoinDICE is asymptotically pivotal, meaning that thereare no hidden quantities we need to estimate, which is based on correcting for the stationary distributionin the behavior-agnostic setting, thus avoiding the curse of horizon and broadening the application of theuncertainty estimator. Recently, Jiang and Huang (2020) provide confidence intervals for OPE, but focus onthe intervals determined by the approximation error induced by a function approximator, while our confidenceintervals quantify statistical error .Empirical likelihood (Owen, 2001) is a powerful tool with many applications in statistical inference likeeconometrics (Chen et al., 2018), and more recently in distributionally robust optimization (Duchi et al.,2016; Lam and Zhou, 2017). EL-based confidence intervals can be used to guide exploration in multiarmedbandits (Honda and Takemura, 2010; Cappé et al., 2013), and for OPE (Karampatziakis et al., 2019; Kallusand Uehara, 2019b). While the work of Kallus and Uehara (2019b) is also based on EL, it differs from thepresent work in two important ways. First, their focus is on developing an asymptotically efficient OPE point estimate, not confidence intervals. Second, they solve for timestep-dependent weights, whereas we onlyneed to solve for timestep- independent weights from a system of moment matching equations induced by anunderlying ergodic Markov chain.
We now evaluate the empirical performance of CoinDICE, comparing it to a number of existing confidenceinterval estimators for OPE based on concentration inequalities. Specifically, given a dataset of loggedtrajectories, we first use weighted step-wise importance sampling (Precup et al., 2000) to calculate a separateestimate of the target policy value for each trajectory. Then given such a finite sample of estimates, we thenuse the empirical
Bernstein inequality (Thomas et al., 2015b) to derive high-confidence lower and upperbounds for the true value. Alternatively, one may also use
Student’s t -test or Efron’s bias corrected andaccelerated bootstrap (Thomas et al., 2015a).We begin with a simple bandit setting, devising a two-armed bandit problem with stochastic payoffs.We define the target policy as a near-optimal policy, which chooses the optimal arm with probability . .We collect off-policy data using a behavior policy which chooses the optimal arm with probability of only . . Our results are presented in Figure 2. We plot the empirical coverage and width of the estimatedintervals across different confidence levels. More specifically, each data point in Figure 2 is the result of experiments. In each experiment, we randomly sample a dataset and then compute a confidence interval. The interval coverage is then computed as the proportion of intervals out of that contain the true value of thetarget policy. The interval log -width is the median of the log of the width of the computed intervals.Figure 2 shows that the intervals produced by CoinDICE achieve an empirical coverage close to the intendedcoverage. In this simple bandit setting, the coverages of Student’s t and bootstrapping are also close tocorrect, although they suffer more in the low-data regime. Notably, the width of the intervals produced byCoinDICE are especially narrow while maintaining accurate coverage.8 rozenLake Taxi = 50 = 100 = 20 = 50 i n t e r v a l c o v e r ag e . . . . . . . . . . . . . . . . . . i n t e r v a ll og - w i d t h − . − . − . − . − . − . − . − . − . −
202 0.6 0.7 0.8 0.9 0.95 − Confidence level ( − α ) CoinDICE (ours)Bootstrapping BernsteinStudent t Expected coverage
Figure 1: Results of CoinDICE and baseline methods on an infinite-horizon version of FrozenLake and Taxi.In FrozenLake, each dataset consists of trajectories of length ; in Taxi, each dataset consists of trajectoriesof length . = 50 = 100 = 200 i n t e r v a l c o v e r ag e . . . . . . . . . . . i n t e r v a ll og - w i d t h − − − − − Confidence level ( − α ) CoinDICE (ours)Bootstrapping BernsteinStudent t Expected coverage
Figure 2: Results of CoinDICE and baseline methods on a sim-ple two-armed bandit. We plot empirical coverage and median log -width ( y -axes) of intervals evaluated at a number of desiredconfidence levels ( x -axis), as measured over 200 random trials. Wefind that CoinDICE achieves more accurate coverage and narrowerintervals compared to the baseline confidence interval estimationmethods.We now turn to more complicatedMDP environments. We use Frozen-Lake (Brockman et al., 2016), a highlystochastic gridworld environment, andTaxi (Dietterich, 1998), an environmentwith a moderate state space of ele-ments. As in (Liu et al., 2018), we modifythese environments to be infinite horizonby randomly resetting the state upon ter-mination. The discount factor is γ = 0 . .The target policy is taken to be a near-optimal one, while the behavior policy ishighly suboptimal. The behavior policyin FrozenLake is the optimal policy with0.2 white noise, which reduces the policyvalue dramatically, from 0.74 to 0.24. Forthe behavior policies in Taxi and Reacher,we follow the same experiment setting forconstructing the behavior policies to col-lect data as in (Nachum et al., 2019a; Liuet al., 2018).We follow the same evaluation proto-col as in the bandit setting, measuringempirical interval coverage and log -widthover experimental trials for variousdataset sizes and confidence levels. Results are shown in Figure 1. We find a similar conclusion that CoinDICEconsistently achieves more accurate coverage and smaller widths than baselines. Notably, the baseline methods’accuracy suffers more significantly compared to the simpler bandit setting described earlier.Lastly, we evaluate CoinDICE on Reacher (Brockman et al., 2016; Todorov et al., 2012), a continuouscontrol environment. In this setting, we use a one-hidden-layer neural network with ReLU activations. Results9 n t e r v a l c o v e r ag e . . . i n t e r v a ll og - w i d t h − − − − Confidence level ( − α ) Figure 3: Results of CoinDICE and baseline methods on Reacher (Brockman et al., 2016; Todorov et al.,2012), using trajectories of length . Colors and markers are as defined in the legends of previous figures.are shown in Figure 3. To account for the approximation error of the used neural network, we measure thecoverage of CoinDICE with respect to a true value computed as the median of a large ensemble of neuralnetworks trained on the off-policy data. To keep the comparison fair, we measure the coverage of the IS-basedbaselines with respect to a true value computed as the median of a large number of IS-based point estimates.The results show similar conclusions as before: CoinDICE achieves more accurate coverage than the IS-basedmethods. Still, we see that CoinDICE coverage suffers in this regime, likely due to optimization difficulties. Ifthe optimum of the Lagrangian is only approximately found, the empirical coverage will inevitably be inexact. In this paper, we have developed CoinDICE, a novel and efficient confidence interval estimator applicable tothe behavior-agnostic offline setting. The algorithm builds on a few technical components, including a newfeature embedded Q -LP, and a generalized empirical likelihood approach to confidence interval estimation.We analyzed the asymptotic coverage of CoinDICE’s estimate, and provided an inite-sample bound. On avariety of off-policy benchmarks we empirically compared the new algorithm with several strong baselinesand found it to be superior to them. Acknowledgements
We thank Hanjun Dai, Mengjiao Yang and other members of the Google Brain team for helpful discussions.Csaba Szepesvári gratefully acknowledges funding from the Canada CIFAR AI Chairs Program, Amii andNSERC.
References
András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellman-residualminimization based fitted policy iteration and a single sample path.
Machine Learning , 71(1):89–129, 2008.Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exactcomputation with an infinitely wide neural net. In
Advances in Neural Information Processing Systems ,pages 8139–8148, 2019.P. Auer. Using upper confidence bounds for online learning. In
Proc. 41st Annual Symposium on Foundationsof Computer Science , pages 270–279. IEEE Computer Society Press, Los Alamitos, CA, 2000.Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In
Advances in neural information processing systems , pages 89–96, 2009.P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results.
Journal of Machine Learning Research , 3:463–482, 2002.10atrice Bertail, Emmanuelle Gautherat, and Hugo Harari-Kermadec. Empirical ϕ ∗ -divergence minimizers forHadamard differentiable functionals. In Topics in Nonparametric Statistics , pages 21–32. Springer, 2014.Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly,Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: Theexample of computational advertising.
The Journal of Machine Learning Research , 14(1):3207–3260, 2013.Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and WojciechZaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540 , 2016.Michel Broniatowski and Amor Keziou. Divergences and duality for estimation and test under momentcondition models.
Journal of Statistical Planning and Inference , 142(9):2554–2573, 2012.Jacob Buckman, Carles Gelada, and Marc G Bellemare. The importance of pessimism in fixed-dataset policyoptimization. arXiv preprint arXiv:2009.06799 , 2020.Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830 , 2019.Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz, et al. Kullback-Leibler upper confidence bounds for optimal sequential allocation.
The Annals of Statistics , 41(3):1516–1541,2013.Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. Top- k off-policycorrection for a REINFORCE recommender system. In Proceedings of the Twelfth ACM InternationalConference on Web Search and Data Mining , pages 456–464, 2019.X. Chen, T. M. Christensen, and E. Tamer. Monte Carlo confidence sets for identified sets.
Econometrica , 86(6):1965–2018, 2018.Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making:a cvar optimization approach. In
Advances in Neural Information Processing Systems , pages 1522–1530,2015.Michael K Cohen and Marcus Hutter. Pessimism about unknown unknowns inspires conservatism. In
Conference on Learning Theory , pages 1344–1373. PMLR, 2020.Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergentreinforcement learning with nonlinear function approximation.
CoRR , abs/1712.10285, 2017.Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds forepisodic reinforcement learning. In
Advances in Neural Information Processing Systems , pages 5713–5723,2017.Daniela Pucci De Farias and Benjamin Van Roy. The linear programming approach to approximate dynamicprogramming.
Operations research , 51(6):850–865, 2003.Thomas G. Dietterich. The MAXQ method for hierarchical reinforcement learning. In
Proc. Intl. Conf.Machine Learning , pages 118–126. Morgan Kaufmann, San Francisco, CA, 1998.John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empiricallikelihood approach. arXiv preprint arXiv:1610.03425 , 2016.Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In
Proceedingsof the 28th International Conference on Machine Learning , pages 1097–1104, 2011. CoRR abs/1103.4601.Ivar Ekeland and Roger Temam.
Convex analysis and variational problems , volume 28. SIAM, 1999.11aphael Fonteneau, Susan A. Murphy, Louis Wehenkel, and Damien Ernst. Batch mode reinforcementlearning based on the synthesis of artificial trajectories.
Annals of Operations Research , 208(1):383–416,2013.Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration.In
International Conference on Machine Learning , pages 2052–2062, 2019.Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, LinyingZhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li wei H. Lehman,Matthieu Komorowski, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Fi-nale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings, 2018.arXiv:1805.12298.Josiah P. Hanna, Peter Stone, and Scott Niekum. Bootstrapping with models: Confidence intervals foroff-policy evaluation. In
Proceedings of the 31st AAAI Conference on Artificial Intelligence , pages 4933–4934,2017.Junya Honda and Akimichi Takemura. An asymptotically optimal bandit algorithm for bounded supportmodels. In
COLT , pages 67–79, 2010.Garud N Iyengar. Robust dynamic programming.
Mathematics of Operations Research , 30(2):257–280, 2005.Nan Jiang and Jiawei Huang. Minimax confidence interval for off-policy evaluation and policy optimization,2020. arXiv:2002.02081.Nan Jiang and Lihong Li. Doubly robust off-policy evaluation for reinforcement learning. In
Proceedings ofthe 33rd International Conference on Machine Learning , pages 652–661, 2016.Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In
Advances in Neural Information Processing Systems , pages 4863–4873, 2018.Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient off-policy evaluation inMarkov decision processes. arXiv preprint arXiv:1908.08526 , 2019a.Nathan Kallus and Masatoshi Uehara. Intrinsically efficient, stable, and bounded off-policy evaluation forreinforcement learning. In
Advances in Neural Information Processing Systems 32 , pages 3320–3329, 2019b.Nikos Karampatziakis, John Langford, and Paul Mineiro. Empirical likelihood for contextual bandits. arXivpreprint arXiv:1906.03323 , 2019.Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning viabootstrapping error reduction. In
Advances in Neural Information Processing Systems , pages 11784–11794,2019.Ilja Kuzborskij, Claire Vernade, András György, Csaba Szepesvári Confident Off-Policy Evaluation andSelection through Self-Normalized Importance Weighting. arXiv preprint arXiv:2006.10460 , 2020.Chandrashekar Lakshminarayanan, Shalabh Bhatnagar, and Csaba Szepesvari. A linearly relaxed approximatelinear program for Markov decision processes. arXiv preprint arXiv:1704.02544 , 2017.Henry Lam and Enlu Zhou. The empirical likelihood approach to quantifying uncertainty in sample averageapproximation.
Operations Research Letters , 45(4):301–307, 2017.Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In
Reinforcementlearning , pages 45–73. Springer, 2012.Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with baselinebootstrapping. In
International Conference on Machine Learning , pages 3652–3661. PMLR, 2019.12or Lattimore and Csaba Szepesvári.
Bandit algorithms . Cambridge University Press, 2020.Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Finite-sample analysis of least-squarespolicy iteration.
Journal of Machine Learning Research , 13(Oct):3041–3074, 2012.Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In
Proceedings of the fourth ACM international conferenceon Web search and data mining , pages 297–306. ACM, 2011.Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. pages 608–616,2015.Tianyi Lin, Chi Jin, and Michael I. Jordan. On gradient descent ascent for nonconvex-concave minimaxproblems.
CoRR , abs/1906.00331, 2019.Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett,editors,
Advances in Neural Information Processing Systems 31 , pages 5356–5366. Curran Associates, Inc.,2018.Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. Understanding the curse of horizon in off-policy evaluationvia conditional importance sampling, 2019. arXiv:1910.06508.Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluationacross representations with applications to educational games. 2014.Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740 , 2009.Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group.Marginal mean models for dynamic regimes.
Journal of the American Statistical Association , 96(456):1410–1423, 2001.Ofir Nachum and Bo Dai. Reinforcement learning via Fenchel-Rockafellar duality. arXiv preprintarXiv:2001.01866 , 2020.Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discountedstationary distribution corrections. pages 2315–2325, 2019a.Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. AlgaeDICE: Policygradient from arbitrary experience. arXiv preprint arXiv:1912.02074 , 2019b.Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimizationwith f-divergences. In
Advances in neural information processing systems , pages 2208–2216, 2016.Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In
Advancesin neural information processing systems , pages 2971–2980, 2017.Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transitionmatrices.
Operations Research , 53(5):780–798, 2005.Art B Owen.
Empirical likelihood . Chapman and Hall/CRC, 2001.Jason Pazis and Ronald Parr. Non-parametric approximate linear programming for MDPs. In
AAAI , 2011.Doina Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In
Proc. Intl.Conf. Machine Learning , pages 759–766. Morgan Kaufmann, San Francisco, CA, 2000.13artin L Puterman.
Markov decision processes: discrete stochastic dynamic programming . John Wiley &Sons, 2014.Jin Qin and Jerry Lawless. Empirical likelihood and general estimating equations. the Annals of Statistics ,pages 300–325, 1994.R Tyrrell Rockafellar. Augmented lagrange multiplier functions and duality in nonconvex programming.
SIAM Journal on Control , 12(2):268–285, 1974.Werner Römisch. Delta method, infinite dimensional.
Wiley StatsRef: Statistics Reference Online , 2014.Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver Prioritized experience replay. In
Proceedingsof the 4th International Conference on Learning Representations , 2016.B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Universality, characteristic kernels and RKHS embeddingof measures.
Journal of Machine Learning Research , 12:2389–2410, 2011.Alexander L. Strehl, Lihong Li, and Michael L. Littman. Reinforcement learning in finite MDPs: PACanalysis. In
Journal of Machine Learning Research , 10:2413–2444, 2009.Richard S Sutton, Csaba Szepesvári, Alborz Geramifard, and Michael P Bowling. Dyna-style planning withlinear function approximation and prioritized sweeping. arXiv preprint arXiv:1206.3285 , 2012.Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged banditfeedback. In
International Conference on Machine Learning , pages 814–823, 2015.Istvan Szita and Csaba Szepesvari. Model-based reinforcement learning with nearly tight explorationcomplexity bounds. In
Proceedings of the 27th International Conference on International Conference onMachine Learning , page 1031–1038. Omnipress, 2010.Aviv Tamar, Huan Xu, and Shie Mannor. Scaling up robust mdps by reinforcement learning. arXiv preprintarXiv:1306.6189 , 2013.Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, and Qiang Liu. Doubly robust bias reduction ininfinite horizon off-policy estimation. In
Proceedings of the 8th International Conference on LearningRepresentations , 2020.Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High confidence policy improvement.In
International Conference on Machine Learning , pages 2380–2388, 2015a.Philip S Thomas.
Safe reinforcement learning . PhD thesis, University of Massachusetts Libraries, 2015.Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning.In
Proceedings of the 33rd International Conference on Machine Learning , pages 2139–2148, 2016.Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation.In
Twenty-Ninth AAAI Conference on Artificial Intelligence , 2015b.Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In
Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on , pages 5026–5033.IEEE, 2012.Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and Q-function learning for off-policyevaluation. arXiv preprint arXiv:1910.12809 , 2019.A. W. van der Vaart and J. A. Wellner.
Weak Convergence and Empirical Processes . Springer, 1996.Weiran Wang and Miguel A Carreira-Perpinán. Projection onto the probability simplex: An efficient algorithmwith a simple proof, and an application. arXiv preprint arXiv:1309.1541 , 2013.14ifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXivpreprint arXiv:1911.11361 , 2019.Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Optimal off-policy evaluation for reinforcement learningwith marginalized importance sampling. In
Advances in Neural Information Processing Systems 32 , pages9665–9675, 2019.Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, andTengyu Ma. Mopo: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239 , 2020.Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. GenDICE: Generalized offline estimation of stationaryvalues. In
International Conference on Learning Representations , 2020a.Shangtong Zhang, Bo Liu, and Shimon Whiteson. GradientDICE: Rethinking generalized offline estimationof stationary values, 2020b. arXiv:2001.11113. 15 ppendix
A Approximation Error Analysis
In this section, we provide a complete proof of Theorem 1, quantifying the effect of function embedding ofconstraints in dual Q -LP. The proof is an adaptation from the standard LP for state-value functions to thecase of Q -LP (De Farias and Van Roy, 2003).We first provide an equivalent reformulation of the primal of the feature embedded LP, Lemma 5
The solution defined by β ∗ = argmin β ∈ R p (cid:8) (1 − γ ) E µ π (cid:2) β (cid:62) φ ( s , a ) (cid:3) | β (cid:62) φ ( s, a ) (cid:62) B π (cid:0) β (cid:62) φ (cid:1) ( s, a ) , ∀ ( s, a ) ∈ S × A (cid:9) , with ( B π Q ) ( s, a ) := R ( s, a ) + γ · P π Q ( s, a ) is also the solution to min β ∈ R p (cid:13)(cid:13) Q π − β (cid:62) φ (cid:13)(cid:13) ,µ π (15) s . t . β (cid:62) φ ( s, a ) (cid:62) B π (cid:0) β (cid:62) φ (cid:1) ( s, a ) , ∀ ( s, a ) ∈ S × A, where (cid:107) f (cid:107) ,µ π := (cid:82) | f ( s, a ) | µ ( s ) π ( a | s ) dsda . Proof
Recall the fact that B π is monotonic: given two bounded functions, ν (cid:62) ν implies B π ν (cid:62) B π ν .Therefore, for any feasible ν , we have ν (cid:62) B π ν (cid:62) B π ν (cid:62) . . . (cid:62) B ∞ π ν = Q π , where the convergence to Q π isdue to the contraction property of B π .Consider a feasible β , we have (cid:13)(cid:13) Q π − β (cid:62) φ (cid:13)(cid:13) ,µ π = (cid:90) (cid:0) β (cid:62) φ ( s, a ) − Q π ( s, a ) (cid:1) µ ( s ) π ( a | s ) dsda, (16)which implies minimizing E µ π (cid:2) β (cid:62) φ (cid:3) is equivalent to minimizing (cid:13)(cid:13) Q π − β (cid:62) φ (cid:13)(cid:13) ,µ π . Theorem 1
Suppose the constant function ∈ F φ := span { φ } . Then, (cid:54) ˜ ρ π − ρ π (cid:54) β (cid:107) Q π − (cid:104) β, φ (cid:105)(cid:107) ∞ , where Q π is the fixed-point solution to the Bellman equation Q ( s, a ) = R ( s, a ) + γ P π Q ( s, a ) . Proof
We first show the equivalence between function space embedding of dual Q -LP and the linearapproximation of primal Q -LP, which can be easily derived by checking their Lagrangians. Denote l ( d, β ) := E d [ r ( s, a )] + β (cid:62) (cid:104) φ, (1 − γ ) µ π + γ · P π ∗ d − d (cid:105) (17) = (1 − γ ) E µ π (cid:2) β (cid:62) φ ( s, a ) (cid:3) + E d (cid:2) r ( s, a ) + γ · P π β (cid:62) φ ( s, a ) − β (cid:62) φ ( s, a ) (cid:3) = (1 − γ ) E µ π [ Q β ( s, a )] + E d [ r ( s, a ) + γ · P π Q β ( s, a ) − Q β ( s, a )] , where β ∈ R p and Q β ( s, a ) := β (cid:62) φ ( s, a ) . Since the l ( d, β ) is convex-concave w.r.t. ( β, d ) , it is also theLagrangian of primal Q -LP with linear parametrization, i.e. , min β ∈ R p (1 − γ ) E µ π (cid:2) β (cid:62) φ ( s , a ) (cid:3) (18) s . t . β (cid:62) φ ( s, a ) (cid:62) R ( s, a ) + γ · P π β (cid:62) φ ( s, a ) , ∀ ( s, a ) ∈ S × A. By Lemma 5, it is equivalent to solving min β ∈ R p (cid:13)(cid:13) Q π − β (cid:62) φ (cid:13)(cid:13) ,µ π (19) s . t . β (cid:62) φ ( s, a ) (cid:62) B π (cid:0) β (cid:62) φ (cid:1) ( s, a ) , ∀ ( s, a ) ∈ S × A.
16e now define ( d ∗ , β ∗ ) := argmax d (cid:62) argmin β l ( d, β ) , ˜ β := argmin β (cid:13)(cid:13) Q π − β (cid:62) φ (cid:13)(cid:13) ∞ ,(cid:15) := (cid:13)(cid:13)(cid:13) Q π − ˜ β (cid:62) φ (cid:13)(cid:13)(cid:13) ∞ , and obtain from strong duality that E d ∗ [ r ( s, a )] = (1 − γ ) E µ π (cid:104) ( β ∗ ) (cid:62) φ (cid:105) . Recall the fact B π is a γ -contraction operator with the norm (cid:107)·(cid:107) ∞ , and we have (cid:13)(cid:13)(cid:13) B π (cid:16) ˜ β (cid:62) φ (cid:17) − Q π (cid:13)(cid:13)(cid:13) ∞ (cid:54) γ (cid:13)(cid:13)(cid:13) ˜ β (cid:62) φ − Q π (cid:13)(cid:13)(cid:13) ∞ , which implies B π (cid:16) ˜ β (cid:62) φ (cid:17) (cid:54) Q π + γ(cid:15) . Now consider a new solution (cid:16) ˜ β (cid:62) φ − c (cid:17) , which must be in span { φ } as ∈ span { φ } . Then, B π (cid:16) ˜ β (cid:62) φ − c (cid:17) = B π (cid:16) ˜ β (cid:62) φ (cid:17) − γc (cid:54) Q π + γ(cid:15) − γc (cid:54) ˜ β (cid:62) φ + (1 + γ ) (cid:15) − γc = ˜ β (cid:62) φ − c + ((1 − γ ) c + (1 + γ ) (cid:15) ) . Choose c = − (1 + γ ) (cid:15)/ (1 − γ ) , and the above implies B π (cid:16) ˜ β (cid:62) φ − c (cid:17) (cid:54) ˜ β (cid:62) φ − c . Therefore, there existssome ¯ β such that ¯ β (cid:62) φ = ˜ β (cid:62) φ + 1 + γ − γ (cid:15) . Then, we can bound the approximation error E d ∗ [ r ( s, a )] − ρ π = E d ∗ [ r ( s, a )] − (1 − γ ) E µ π [ Q π ]= (1 − γ ) E µ π (cid:104) ( β ∗ ) (cid:62) φ (cid:105) − (1 − γ ) E µ π [ Q π ] (cid:62) , where the last inequality comes from the fact (1 − γ ) E µ π (cid:104) ( β ∗ ) (cid:62) φ (cid:105) is the optimal value of a restrictedfeasible set within linearly representable Q β .On the other hand, we bound (1 − γ ) E µ π (cid:104) ( β ∗ ) (cid:62) φ (cid:105) − (1 − γ ) E µ π [ Q π ] = (1 − γ ) (cid:13)(cid:13)(cid:13) ( β ∗ ) (cid:62) φ − Q β (cid:13)(cid:13)(cid:13) ,µ π (cid:54) (1 − γ ) (cid:13)(cid:13) ¯ β (cid:62) φ − Q β (cid:13)(cid:13) ,µ π (cid:54) (1 − γ ) (cid:13)(cid:13) ¯ β (cid:62) φ − Q β (cid:13)(cid:13) ∞ (cid:54) (1 − γ ) (cid:16)(cid:13)(cid:13)(cid:13) ¯ β (cid:62) φ − ˜ β (cid:62) φ (cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13) Q π − ˜ β (cid:62) φ (cid:13)(cid:13)(cid:13) ∞ (cid:17) (cid:54) (1 − γ ) (cid:18) γ − γ (cid:19) (cid:15) = 2 (cid:15). where the third inequality comes from the optimality of (19).17 ustification of full-rank basis embedding. The effect of full-rank basis embedding in the examplein Section 3.1 can be justified straightforwardly. We consider the Lagrangian (17). If the φ ∈ R | S || A |×| S || A | is full-rank, φ − exists. For arbitrary Q ∈ R | S || A |× , there exists β = (cid:0) Qφ − (cid:1) (cid:62) , which means there is anone-to-one correspondence between Q and β in Lagrangian. Therefore, in finite state and action MDP,the Lagrangian is not affected by full-rank basis embedding, and therefore, the solution of full-rank basisembedding will be the same as the original LP. B CoinDICE for Undiscounted and finite-horizon MDPs
In the main text, we consider the CoinDICE for infinite-horizon MDP with discounted factor γ ∈ (0 , . Theproposed CoinDICE can be easily generalized for undiscounted MDPs with γ = 1 and finite-horizon MDPs. Undiscounted MDP.
Particularly, we have the dual form of the Q -LP as ˜ ρ π := (cid:26) max d : S × A → R + E d [ r ( s, a )] (cid:12)(cid:12)(cid:12)(cid:12) (cid:82) d ( s, a ) dsda = 1 d ( s, a ) = P π ∗ d ( s, a ) , ∀ ( s, a ) ∈ S × A (cid:27) . (20)Comparing with the (3), we have an extra normalization constraint to avoid the scaling issues. Specifically, if d ( s, a ) is feasible, without the normalization constraint, c · d ( s, a ) will also be feasible for ∀ c > . Therefore,the optimization could be unbounded.By change-of-variable τ ( s, a ) = d π ( s,a ) d D ( s,a ) and feature embeddings of the stationary constraint in (20), weobtain ˜ ρ π := (cid:26) max τ : S × A → R + E d D [ τ · r ( s, a )] (cid:12)(cid:12)(cid:12)(cid:12) E d D [ τ ( s, a )] = 1 E d D [ φ ( s (cid:48) , a (cid:48) ) ( τ ( s (cid:48) , a (cid:48) ) − τ ( s, a ))] = 0 (cid:27) . (21)Then, the CoinDICE confidence interval is achieved by applying the generalized empirical likelihood to (21), i.e. , C fn,ξ := (cid:26) ˜ ρ π ( w ) = max τ (cid:62) E w [ τ · r ] (cid:12)(cid:12)(cid:12)(cid:12) w ∈ K f , E w [ τ −
1] = 0 E w (cid:2) ¯∆ ( x ; τ, φ ) (cid:3) = 0 (cid:27) , with K f := (cid:26) w ∈ P n − ( (cid:98) p n ) ,D f ( w || (cid:98) p n ) (cid:54) ξn (cid:27) , (22)where ¯∆ ( x ; τ, φ ) := φ ( s (cid:48) , a (cid:48) ) ( τ ( s (cid:48) , a (cid:48) ) − τ ( s, a )) .As we will discussed in Section 3.3 for discounted MDPs, similar argument can be applied for (22), andthus, one can have the confidence interval in undiscouted MDPs as C fn,ξ = [ l n , u n ] with [ l n , u n ] = (cid:20) min β ∈ R p ,ν max τ (cid:62) min w ∈K f E w [ (cid:96) ( x ; τ, β, ν )] , max τ (cid:62) min β ∈ R p ,ν max w ∈K f E w [ (cid:96) ( x ; τ, β, ν )] (cid:21) , (23)where (cid:96) ( x ; τ, β, ν ) := τ · r + β (cid:62) ∆ ( x ; τ, φ ) + ν − ν · τ . Remark (Normalization constraint):
Although in the discounted MDPs, there is no scaling issue, andthus the normalizaiton constraint is redudant, we still prefer to add the constraint in practice. It does notonly bring the benefits in optimization, but also enforce the normalization explicitly and reduce the feasibleset, leading to better statistical property. 18 inite-horizon MDP.
While we mainly focus on infinite-horizon MDPs with a discounted factor, the dualmethod can be adapted to finite-horizon settings straightforwardly. For example, we have the finite-horizon d -LP as max d h ( s,a ): S × A → R + H (cid:88) h =1 E d h [ r h ( s, a )] (24) s . t . d ( s, a ) = µ ( s ) π ( a | s ) , (25) d h +1 ( s, a ) = P π ∗ d h ( s, a ) , ∀ h ∈ { , . . . , H } . (26)Upon this finite-horizon formulation, we can derive the finite-step CoinDICE following the same technique, i.e. . [ l n , u n ] = (cid:34) min w ∈K f min β Hh =1 ∈ R p max τ Hh =1 (cid:62) E w (cid:2) (cid:96) H (cid:0) x ; τ Hh =1 , β Hh =1 (cid:1)(cid:3) , max w ∈K f max τ Hh =1 (cid:62) min β Hh =1 ∈ R p E w (cid:2) (cid:96) H (cid:0) x ; τ Hh =1 , β Hh =1 (cid:1)(cid:3)(cid:35) , where x := (cid:110) ( s, a, r, s (cid:48) , a (cid:48) , h ) Hh =1 (cid:111) , (cid:96) H (cid:0) x ; τ Hh =1 , β Hh =1 (cid:1) := (cid:80) Hh =1 τ h r h + (cid:80) Hh =1 β (cid:62) h ∆ h ( x ; τ h , φ ) , and ∆ h ( x ; τ h , φ ) := τ h ( s, a ) φ ( s (cid:48) , a (cid:48) ) − τ h +1 ( s (cid:48) , a (cid:48) ) φ ( s (cid:48) , a (cid:48) ) . C CoinBandit
MDPs are strictly more general than multi-armed and contextual bandits. Therefore, our estimator can alsobe specialized accordingly for confidence interval estimation in bandit problems with slight modifications.Without loss of generality, we consider the contextual bandit setting, while the multi-armed bandits can befurther reduced from contextual bandit.Specifically, in the behavior-agnostic contextual bandit setting, the stationary distribution constraintin (5) is no long applicable in bandit setting. We rewrite the policy value as ˜ ρ π := E s ∼ µ D ,a ∼ π ( a | s ) [ r ( s, a )]= (cid:26) max τ : S × A → R + E d D [ τ · r ( s, a )] (cid:12)(cid:12)(cid:12) d D · τ = µ D π, E d D [ τ ] = 1 (cid:27) , (27)where we reload the µ D as the contextual distribution, which is unchanged for all policies, d D ( s, a ) = µ D ( s ) π b ( a | s ) , τ ( s, a ) := µ D ( s ) π ( a | s ) µ D ( s ) π b ( a | s ) , and φ ( s, a ) denotes the feature mappings. We keep the normalizationconstraint to ensure the validation of density ratio empirically.We apply the same technique to (27), leading to the CoinBandit confidence interval estimator C fn,ξ := (cid:26) ˜ ρ π ( w ) = max τ (cid:62) E w [ τ · r ] (cid:12)(cid:12)(cid:12)(cid:12) w ∈ K f , E w [ τ −
1] = 0 E w [ (cid:78) ( x ; τ, φ )] = 0 (cid:27) , with K f := (cid:26) w ∈ P n − ( (cid:98) p n ) ,D f ( w || (cid:98) p n ) (cid:54) ξn (cid:27) , (28)where the x := ( s, a, s (cid:48) , a (cid:48) ) is constructed by s ∼ µ D ( s ) , a ∼ π ( a | s ) and ( s (cid:48) , a (cid:48) ) ∼ d D , and (cid:78) ( x ; τ, φ ) := φ ( s, a ) − φ ( s (cid:48) , a (cid:48) ) · τ ( s (cid:48) , a (cid:48) ) .Similarly, the interval estimator in CoinBandit (28) can be calculated by solving a minimax optimization. Remark (Behavior-known contextual bandit):
When the behavior policy π b ( a | s ) is known, thesolution to (27) can be computed in closed-form as τ ( s, a ) = π ( a | s ) π b ( a | s ) . Then, the CoinBandit reduces to C fn,ξ := (cid:26) ˜ ρ π ( w ) = E w [ τ · r ] (cid:12)(cid:12)(cid:12)(cid:12) w ∈ K f , E w [ τ −
1] = 0 (cid:27) , with K f := (cid:26) w ∈ P n − ( (cid:98) p n ) ,D f ( w || (cid:98) p n ) (cid:54) ξn (cid:27) . (29)19 emark (Multi-armed bandit): Furthermore, these estimators (28) and (29) can be further reducedfor multi-armed bandit. Specifically, we set all s equivalent, then, the s becomes the dummy variable. TheCoinBandit estimators (28) and (29) reduces for the off-policy evaluation in multi-armed bandit. If the actionnumber is finite, we can use tabular representation for τ ( a ) , eliminating the approximation error. Remark (Comparison to Karampatziakis et al. (2019)):
Karampatziakis et al. (2019) considers theoff-policy contextual bandit confidence interval estimation. Although both CoinBandit and the estimator inKarampatziakis et al. (2019) share the same asymptotic coverage, there are significant differences:• The estimator in Karampatziakis et al. (2019) is derived from empirical likelihood with reverse KL -divergence, while our CoinBandit is based on generalized empirical likelihood with arbitrary f -divergence.• More importantly, compared to our CoinBandit, which is applicable for both behavior-agnostic and behavior-known off-policy setting, the estimator in Karampatziakis et al. (2019) is only valid for behavior-knownsetting.• Computationally, the estimator in Karampatziakis et al. (2019) requires an extra statistics, i.e. , (cid:40) max w n (cid:88) i =1 log ( nw i ) (cid:12)(cid:12) E w [ τ −
1] = 0 , w ∈ K − · ) (cid:41) , while such quantity is not required in CoinBandit, and thus saving the computational cost.• Statistically, we provide finite sample complexity for CoinBandit in Theorem 4, while such sample complexityis not clear for Karampatziakis et al. (2019). D Stochastic Confidence Interval Estimation
We analyze the properties of the optimization for the upper and lower bounds and derive the practicalalgorithm in this section.
D.1 Upper and Lower Confidence Bounds
We first establish the distribution robust optimization representation of the confidence region:
Lemma 6
Let ˆ ρ π ( w ) = max τ (cid:62) min β ∈ R p E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) . The confidence region C fn,ξ can berepresented equivalently as C fn,ξ = (cid:8) ˆ ρ π ( w ) (cid:12)(cid:12) w ∈ K f (cid:9) . (30) Proof
For any w ∈ K f , we rewrite the optimization (8) by its Lagrangian, which will be an estimate of thepolicy value, ˆ ρ π ( w ) = max τ (cid:62) min β ∈ R p E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) . (31)Based on Lemma 6, we can formulate the upper and lower bounds: Theorem 3
Denote the upper and lower confidence bounds of C fn,ξ by u n and l n , respectively: [ l n , u n ] = (cid:20) min w ∈K f min β ∈ R p max τ (cid:62) E w [ (cid:96) ( x ; τ, β )] , max w ∈K f max τ (cid:62) min β ∈ R p E w [ (cid:96) ( x ; τ, β )] (cid:21) , = (cid:20) min β ∈ R p max τ (cid:62) min w ∈K f E w [ (cid:96) ( x ; τ, β )] , max τ (cid:62) min β ∈ R p max w ∈K f E w [ (cid:96) ( x ; τ, β )] (cid:21) , here (cid:96) ( x ; τ, β ) := τ · r + β (cid:62) ∆ ( x ; τ, φ ) . For any ( τ, β, λ, η ) that satisfies the constraints in (11) , the optimalweights for upper and lower confidence bounds are w l = f (cid:48)∗ (cid:18) η − (cid:96) ( x ; τ, β ) λ (cid:19) and w u = f (cid:48)∗ (cid:18) (cid:96) ( x ; τ, β ) − ηλ (cid:19) , respectively. Therefore, the confidence bounds can be simplified as: (cid:20) l n u n (cid:21) = min β max τ (cid:62) ,λ (cid:62) ,η E D (cid:104) − λf ∗ (cid:16) η − (cid:96) ( x ; τ,β ) λ (cid:17) + η − λ ξn (cid:105) max τ (cid:62) min β,λ (cid:62) ,η E D (cid:104) λf ∗ (cid:16) (cid:96) ( x ; τ,β ) − ηλ (cid:17) + η + λ ξn (cid:105) . Proof
We first calculate the upper bound u n using Lemma 6: u n = max w ∈K f ρ π ( w ) = max w ∈K f max τ (cid:62) min β ∈ R p E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) = max τ (cid:62) max w ∈K f min β ∈ R p E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) (32) = max τ (cid:62) min β ∈ R p max w ∈K f E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) , (33)where the switch between max w ∈K f and max τ (cid:62) in (32) is immediate, (33) is due to the fact that the objectiveis concave w.r.t. β and convex w.r.t. w and τ , separately.We apply Lagrangian to the inner constrained optimization over w , leading to u n = max τ min β,λ (cid:62) ,η max w (cid:62) E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) − λ (cid:18) D f ( w || (cid:98) p n ) − ξn (cid:19) + η (cid:0) − w (cid:62) (cid:1) = max τ (cid:62) min β,λ (cid:62) ,η E D (cid:20) λf ∗ (cid:18) τ · r + β (cid:62) ∆ ( x ; τ, φ ) − ηλ (cid:19) + η + λξn (cid:21) , (34)where the last equation comes from the conjugate of f , and for any given ( τ, β, λ, η ) , the optimal w ∗ will be w ∗ u = f (cid:48)∗ (cid:18) τ · r + β (cid:62) ∆ ( x ; τ, φ ) − ηλ (cid:19) . The lower bound l n may be obtained in a similar fashion: l n = min w ∈K f ρ ( w ; π ) = min w ∈K f max τ (cid:62) min β ∈ R p E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) = min w ∈K f min β ∈ R p max τ (cid:62) E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) = min β ∈ R p min w ∈K f max τ (cid:62) E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) = min β ∈ R p max τ (cid:62) min w ∈K f E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) . Again, we consider the Lagrangian l n = min β ∈ R p max τ (cid:62) ,λ (cid:62) ,η min w (cid:62) E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) + λ (cid:18) D f ( w || (cid:98) p n ) − ξn (cid:19) + η (cid:0) − w (cid:62) (cid:1) = min β max τ (cid:62) ,λ (cid:62) ,η E D (cid:34) − λf ∗ (cid:32) η − (cid:0) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:1) λ (cid:33) + η − λξn (cid:35) , and the optimal weight is w ∗ l = f (cid:48)∗ (cid:32) η − (cid:0) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:1) λ (cid:33) . .2 Closed-form Solution for Reweighting We consider a few examples of f -divergences in Theorem 3, and show how the weights can be efficientlycomputed, for a given τ and β .• KL -divergence. To satisfy the conditions in Assumption 1, we select f ( x ) = 2 x log x . Recall theproperty that for any convex function f and any α > , the conjugate function of g ( x ) = αf ( x ) is equalto g ∗ ( y ) = αf ∗ ( y/α ) . Let f be the standard f -divergence function of KL -divergence KL ( w || (cid:98) p n ) , i.e., f ( x ) = 2 x log x . With g (cid:48)∗ ( y ) = f (cid:48)∗ ( y/α ) , equation (12) implies that the following upper and lower bounds: w l ( x ) = exp (cid:18) η l − (cid:96) ( x ; τ, β )2 λ (cid:19) , η l = − log n (cid:88) i =1 exp (cid:18) − (cid:96) ( x ; τ, β )2 λ (cid:19) w u ( x ) = exp (cid:18) (cid:96) ( x ; τ, β ) − η u λ (cid:19) , η u = log n (cid:88) i =1 exp (cid:18) (cid:96) ( x ; τ, β )2 λ (cid:19) . This can also be verified by plugging the f ( x ) = 2 x log x into (12) and considering w (cid:62) = 1 .• Reverse KL-divergence.
With the f-divergence function f ( x ) = − log x for the reverse-KL divergence,one has the following upper and lower bounds: w l ( x ) = λδ ( (cid:96) ( x ; τ, β ) > η l ) ( (cid:96) ( x ; τ, β ) − η l ) − , n (cid:88) i =1 δ ( (cid:96) ( x ; τ, β ) > η l ) ( (cid:96) ( x ; τ, β ) − η l ) − = 1 λ ,w u ( x ) = λδ ( η u > (cid:96) ( x ; τ, β )) ( η u − (cid:96) ( x ; τ, β )) − , n (cid:88) i =1 δ ( η u > (cid:96) ( x ; τ, β )) ( η u − (cid:96) ( x ; τ, β )) − = 1 λ , where δ ( a > b ) = (cid:40) if a > b otherwise . This is obtained by plugging the f ( x ) = − log x into (12) andconsidering w (cid:62) = 1 , w (cid:62) and KKT conditions on the dual variables for w (cid:62) . Unfortunately thereverse KL-divergence does not satisfy the conditions in Assumption 1. Note that this is the standardf-divergence function for empirical likelihood maximization problem, we therefore also include it here forthe sake of completeness.• χ -divergence. Notice that the standard f-divergence function, i.e., f ( x ) = ( x − , of χ -divergence χ ( w || (cid:98) p n ) := E (cid:98) p n (cid:20)(cid:16) w (cid:98) p n − (cid:17) (cid:21) satisfies the conditions in Assumption 1. Consider the lower boundcalculation. Leveraging the closed-form solution of the following (cid:96) projection problem onto the simplexspace w (cid:62) = 1 and w (cid:62) (Wang and Carreira-Perpinán, 2013): arg min w : w (cid:62) =1 ,w (cid:62) n (cid:88) i =1 w i (cid:96) ( x i ; τ, β ) λ + n (cid:88) i =1 (cid:98) p n,i ( w i − (cid:98) p n,i ) = (cid:112)(cid:98) p n,i · arg min v : v (cid:62) √ (cid:98) p n =1 ,v (cid:62) n (cid:88) i =1 (cid:18) v i − (1 − (cid:96) ( x i ; τ, β )2 λ ) · (cid:112)(cid:98) p n,i (cid:19) , ( here we let v i = w i (cid:112)(cid:98) p n,i ) the lower bound w (cid:96) ( x ) is given by (for any i ∈ { , , . . . , n } ) w (cid:96) ( x i ) = (cid:112)(cid:98) p n,i · w ∗ ( x i )= (cid:112)(cid:98) p n,i · (cid:18) (1 − (cid:96) ( x i ; τ, β )2 λ ) · (cid:112)(cid:98) p n,i + G (cid:98) p n (cid:18) (1 − (cid:96) ( x ; τ, β )2 λ ) · (cid:112)(cid:98) p n,i (cid:19)(cid:19) + , G (cid:98) p n ( y ) = − (cid:80) |S (cid:98) pn | i =1 y i · √ (cid:98) p n,i (cid:80) |S (cid:98) pn | i =1 (cid:98) p n,i , S (cid:98) p n is the set of indices in { , . . . , n } in which any element j satisfies y ( j ) + (cid:80) ji =1 (cid:98) p n,i (1 − (cid:80) ji =1 y ( i ) · (cid:112)(cid:98) p n,i ) > . Here y ( i ) indicates the samples with the i -th largest element of y . Using analogous arguments, by replacing (cid:96) with − (cid:96) one can also define a similar solution for the upperbound w u ( x ) . Now suppose (cid:98) p n,i = n , ∀ i . Then, we have w l ( x i ) = (cid:114) n · (cid:32) (1 − (cid:96) ( x i ; τ, β )2 λ ) · (cid:114) n + G n (cid:32) (1 − (cid:96) ( x ; τ, β )2 λ ) · (cid:114) n (cid:33)(cid:33) + ,w u ( x i ) = (cid:114) n · (cid:32) (1 + (cid:96) ( x i ; τ, β )2 λ ) · (cid:114) n + G n (cid:32) (1 + (cid:96) ( x ; τ, β )2 λ ) · (cid:114) n (cid:33)(cid:33) + , where G n ( y ) = n − (cid:80) |S /n | i =1 y i ·√ n |S /n | , S n is the set of indices in { , . . . , n } in which any element j satisfies y ( j ) + j ( n − √ n (cid:80) ji =1 y ( i ) ) > . Here y ( i ) indicates the samples with the i -th largest element of y . Thiscan also be verified by plugging the f ( x ) = ( x − into (12) and considering w (cid:62) = 1 and w (cid:62) . In fact,the above can be generalized to the Cressie-Read family with f ( x ) = ( x − k − k ( x − k − k ( k − .• Reverse KL -divergence. With the f -divergence function f ( x ) = − log x for the reverse KL -divergence,one has the following upper and lower bounds: w l ( x ) = λδ ( (cid:96) ( x ; τ, β ) > η l ) ( (cid:96) ( x ; τ, β ) − η l ) − , n (cid:88) i =1 δ ( (cid:96) ( x ; τ, β ) > η l ) ( (cid:96) ( x ; τ, β ) − η l ) − = 1 λ ,w u ( x ) = λδ ( η u > (cid:96) ( x ; τ, β )) ( η u − (cid:96) ( x ; τ, β )) − , n (cid:88) i =1 δ ( η u > (cid:96) ( x ; τ, β )) ( η u − (cid:96) ( x ; τ, β )) − = 1 λ , where δ ( a > b ) = (cid:40) if a > b otherwise . This is obtained by plugging the f ( x ) = − log x into (12) andconsidering w (cid:62) = 1 , w (cid:62) and KKT conditions on the dual variables for w (cid:62) . Unfortunately the reverse KL -divergence does not satisfy the conditions in Assumption 1. Note that this is the standard f -divergenceused in the vanilla empirical likelihood, we therefore also include it here for the sake of completeness. D.3 Practical Algorithm
In (13), we eliminates one level optimization, and thus reduces the computational difficulty. Meanwhile, theSGDA for (13) could benefit from the attractive finite-step convergence. However, as observed in Namkoongand Duchi (2016), when λ approaches , the SGDA for (13) may suffer from high variance. In this section,we consider two optional strategies to bypass such difficulty. We take the upper bound as an example, whichcan be applied for lower bound similarly:• Instead of using the optimal weights (12), Namkoong and Duchi (2016) suggests to keep the ( w, λ ) inoptimization to be updated simultaneously via gradients, i.e. , targeting on solving the Lagrangian (33)with SGDA directly. For example, with KL -divergence, this leads to the update of w u in t -th iterationas ˜ w ( j ) = exp (cid:16) η t (cid:96) ( j ) (cid:17) (cid:16) w ( j ) (cid:17) − η t λ (cid:18) n (cid:19) η t λ and w u = ˜ w ( j ) (cid:80) j ˜ w ( j ) , (35)with η t as the stepsize. 23 lgorithm 1 CoinDICE: estimating upper confidence bound using KL-divergence and function approxima-tion.
Inputs : A target policy π , a desired confidence − α , a finite sample dataset D := { ( s ( j )0 , a ( j )0 , s ( j ) , a ( j ) , r ( j ) , s (cid:48) ( j ) ) } nj =1 , optimizers OPT θ , number of iterations K, T .Set divergence limit ξ := χ , − α .Initialize λ ∈ R , Q θ : S × A → R , ζ θ : S × A → R . for k = 1 , . . . , K dofor t = 1 , . . . , T do Sample from target policy a ( j )0 ∼ π ( s ( j )0 ) , a ( j ) (cid:48) ∼ π ( s ( j ) (cid:48) ) for j = 1 , . . . , n .Compute loss terms: (cid:96) ( j ) := (1 − γ ) Q θ ( s ( j )0 , a ( j )0 ) + ζ θ ( s ( j ) , a ( j ) ) · ( − Q θ ( s ( j ) , a ( j ) ) + r ( j ) + γQ θ ( s ( j ) (cid:48) , a ( j ) (cid:48) )) Update ( θ , θ ) ← OPT θ ( L , θ , θ ) . end for Update ( w, λ ) by (35) or (37)Compute loss L := (cid:80) nj =1 w ( j ) · (cid:96) ( j ) . end forReturn L .• The instability and high variance of solving (13) comes from unboundness of w induced by arbitarry λ during the optimization procedure. In other words, given a fixed ( τ, β ) , if we can keep w ∈ K f satisfied, i.e. , w u = argmax KL ( w || (cid:98) p n ) (cid:54) ξn (cid:104) w, (cid:96) (cid:105) ⇒ ( w u , λ ∗ ) = argmax w (cid:62) =1 ,w (cid:62) argmin λ (cid:62) (cid:104) w, (cid:96) (cid:105) − λ (cid:18) KL ( w || (cid:98) p n ) − ξn (cid:19) (36) ⇒ ( w u , λ ∗ ) = ˜ w ( j ) λ ∗ := exp { (cid:96) ( j ) /λ ∗ } ; w ( j ) λ ∗ := ˜ w ( j ) λ ∗ / (cid:88) ˜ w ( j ) λ ∗ with n (cid:88) j =1 w ( j ) λ ∗ log w ( j ) λ ∗ = ξ/n , (37)the optimization will be stable.Moreover, for the computational cost consideration, the major bottleneck in the optimization is updating the w , which is an O ( n ) operation. Therefore, we leave the update of w less frequently, which is corresponding tooptimizing the equivalent form (10). Combined with these techniques into SGDA, we illustrate the algorithmin Algorithm 1. Remark (More regularization for stability):
Directly solving a Lagrangian for LP may induce some in-stability, due to lack of curvature. To overcome such difficulty, the augmented Lagrangian method (ALM) (Rock-afellar, 1974) is the natural choice. Directly applying the ALM will introduce the regularization h ( E (cid:98) p n [∆ ( x ; τ, φ )]) where h denotes some convex function with minimum at zero. Such regularization will not change the optimalsolution ( τ, β ) in (11) and the value [ l n , u n ] .The ALM introduces extra computational cost in optimization since the regularization involves empiricalexpectations inside a nonlinear function. We exploit alternative regularizations following the spirit of ALM,while circumventing the computational difficulty. Recall the fact that the regularization on dual variable doesnot change the optimal solution (Nachum et al., 2019b, Theorem 4), i.e. τ ∗ ( s, a ) = (cid:26) argmax τ (cid:62) E d D [ τ · r ( s, a )] (cid:12)(cid:12) E d D [∆ ( x ; τ, φ )] = 0 (cid:27) (38) = (cid:26) argmax τ (cid:62) E d D [ τ · r ( s, a )] − α E p [ h ( τ )] (cid:12)(cid:12) E d D [∆ ( x ; τ, φ )] = 0 (cid:27) , (39)24here p is some distribution over S × A .We show the upper bound as an example, and the lower bound can be treated similarly. We have ( w u , τ ∗ ) = argmax w ∈K f (cid:26) argmax τ (cid:62) E w [ τ · r ( s, a )] (cid:12)(cid:12) E w [∆ ( x ; τ, φ )] = 0 (cid:27) = argmax w ∈K f (cid:26) argmax τ (cid:62) E w [ τ · r ( s, a )] − α E p [ h ( τ )] (cid:12)(cid:12) E w [∆ ( x ; τ, φ )] = 0 (cid:27) , (40)where the equality comes from Nachum et al. (2019b, Theorem 4) and the fact the regularization E p [ h ( τ )] does not depend on w . Then, we can solve (40) alternatively for ( w u , τ ∗ ) by Lagrangian, max τ (cid:62) min β max w ∈K f E w (cid:2) τ · r ( s, a ) + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) − α E p [ h ( τ )] . (41)Although the optimal ˜ β ∗ to (41) differs from β ∗ , ( w u , τ ∗ ) are the same. Once we have the ( w u , τ ∗ ) , we canrecover the original Lagrangian ˜ ρ π ( w u ) = E w u [ τ · r ( s, a )] , since E w u (cid:2) β ∗(cid:62) ∆ ( x ; τ ∗ , φ ) (cid:3) = 0 in the originalLagrangian E w ( (cid:96) ( x ; τ ∗ , β ∗ )) in (11) due to the KKT condition.Comparing to the original ALM, the new regularization takes the advantage of ALM while keeps theoriginal computational efficiency. E Proofs for Statistical Properties
In this section, we provide the detailed proofs for the asymptotic coverage Theorem 2 and the finite-samplecorrection Theorem 4. For notation simplicity, we use sup , max and inf , min interchangeably. With a littleabuse of notation, we use (cid:82) as (cid:80) on discrete domain. E.1 Asymptotic Coverage
Theorem 2 follows from a result in Duchi et al. (2016). The following notation will be needed:• (cid:96) ( x ; τ, β ) = (1 − γ ) β (cid:62) φ ( s , a ) + τ ( s, a ) (cid:0) r ( s, a ) + γβ (cid:62) φ ( s (cid:48) , a (cid:48) ) − β (cid:62) φ ( s, a ) (cid:1) ;• (cid:107) f (cid:107) := (cid:82) | f ( s, a ) | d D ( s, a ) dsda , and (cid:107) φ ( s, a ) (cid:107) := (cid:112) (cid:104) φ, φ (cid:105) ;• (cid:107) f ( s, a ) (cid:107) L ( d D ) := E d D (cid:2) f ( s, a ) (cid:3) , H ⊂ L (cid:0) d D (cid:1) , we define L ∞ ( H ) be the space of bounded linearfunctionals on H with (cid:107) L − L (cid:107) H := sup h ∈H | L h − L h | for L , L ∈ L ∞ ( H ) ;• p = dPdµ , with a Lebesgue measure µ , is the Radon-Nikodym derivative. Abusing notation a bit, we use ( D f ( P || Q ) , D ( p || q )) , and ( E P [ · ] , E p ( · )) interchangeably. Definition 7 (Duchi et al., 2016,
Hadamard directionally differentiability)
Let Q be the space ofsigned measures bounded with norm (cid:107)·(cid:107) H . The functional T : P → R is Hadamard directionally differentiable at P ∈ P tangentially to B ⊂ Q if for all H ∈ B , there exists dT p ( H ) ∈ R such that for all convergentsequences t n → and (cid:107) H n − H (cid:107) H → that satisfies P + t n H n ∈ P , the following holds T ( P + t n H n ) − T ( P ) t n → dT P ( H ) , as n → ∞ . We say T : P → R has an influence function T ( x ; P ) ∈ R if dT P ( Q − P ) := (cid:90) T ( x ; P ) d ( Q − P ) ( x ) , and E P (cid:2) T ( x ; P ) (cid:3) = 0 .
25e consider f in D f satisfying the following assumption (Duchi et al., 2016), Assumption 1 (Smoothness of f -divergence) The function f : R + → R is convex, three times differen-tiable in a neighborhood of , f (1) = f (cid:48) (1) = 0 and f (cid:48)(cid:48) (1) = 2 . Then, the following theorem, which slightly simplifies Duchi et al. (2016, Theorem 10), characterizes theasymptotic coverage of the general uncertainty estimation,
Theorem 8 (General asymptotic coverage)
Let Assumption 1 hold and H = { h ( x ; τ, β ) } , where h ( x ; τ, β ) is Lipschitz and the space of ( τ, β ) is compact. Denote B ⊂ Q be such that (cid:13)(cid:13)(cid:13) √ n (cid:16) (cid:98) P n − P (cid:17) − G (cid:13)(cid:13)(cid:13) H → with G ∈ B . Assume T : P → R is Hadamard differentiable at P tangentially to B with influence function T ( · ; P ) and dT P is defined and continuous on the whole Q , then, lim n →∞ P (cid:18) T ( P ) ∈ (cid:26) T ( P ) : D f ( P || P n ) (cid:54) ξn (cid:27)(cid:19) = P (cid:0) χ (cid:54) ξ (cid:1) . Denote the T ( P ) = max τ (cid:62) min β ∈ R p E P [ (cid:96) ( x ; τ, β )] by convexity-concavity, our proof for Theorem 2 willbe mainly checking the conditions required by Theorem 8: i) , Lipschitz continuity of functions in H , and ii) Hadamard differentiability of T ( P ) .We first specify the regularity assumption for stationary distribution ratio: Assumption 2 (Stationary ratio regularity)
The target stationary state-action correction rato is bounded: (cid:107) τ ∗ (cid:107) ∞ (cid:54) C τ < ∞ , and τ ∗ ∈ F τ where F τ is a convex, compact and bounded RKHS space with bounded kernelfunction (cid:107) k (( · , · ) , ( s, a )) (cid:107) F τ (cid:54) K . The bounded ratio component of Assumption 2 is a standard assumption used in Nachum et al. (2019a);Zhang et al. (2020a); Uehara et al. (2019). The latter part regarding F τ is required for the existence ofsolutions. In fact, the RKHS assumption F τ is already quite flexible, and it includes deep neural networks byadopting the neural tangent kernels (Arora et al., 2019).With Assumption 2, we can immediately obtain T ( P ) = max τ ∈F τ min β ∈ R p E P [ (cid:96) ( x ; τ, β )] = min β ∈ R p max τ ∈F τ E P [ (cid:96) ( x ; τ, β )] by the minimax theorem (Ekeland and Temam, 1999, Proposition 2.1). By this equivalence, we will focus onthe min - max form.Since r ∈ [0 , R max ] , one has for every π that Q π (cid:54) R max / (1 − γ ) . Therefore, it is reasonable to assumethe following regularity conditions for φ : Assumption 3 (Embedding feature regularity)
There exist some finite constants C β and C φ , such that (cid:107) β (cid:107) (cid:54) C β , (cid:107) φ (cid:107) (cid:54) C φ . Moreover, φ ( s, a ) is L φ -Lipschitz continuous. This assumption implies (cid:13)(cid:13) β (cid:62) φ (cid:13)(cid:13) ∞ (cid:54) (cid:107) β (cid:107) (cid:107) φ (cid:107) (cid:54) C β C φ and Lipschitz continuity of β (cid:62) φ ( s, a ) . We define F β := { β | (cid:107) β (cid:107) (cid:54) C β } . Lemma 9 (Lipschitz continuity)
Under Assumptions 2 and 3, function (cid:96) satisfies (cid:107) (cid:96) ( x ; τ, β ) (cid:107) ∞ (cid:54) M and is C (cid:96) -Lipschitz in ( τ, β ) , for some proper finite constants M and C (cid:96) . Proof
We first show the boundedness claim. By the definition of (cid:96) ( x ; τ, β ) , one has (cid:107) (cid:96) ( x ; τ, β ) (cid:107) ∞ (cid:54) (1 − γ ) (cid:13)(cid:13) β (cid:62) φ (cid:13)(cid:13) ∞ + (cid:13)(cid:13) τ ( s, a ) (cid:0) r ( s, a ) + γβ (cid:62) φ ( s (cid:48) , a (cid:48) ) − β (cid:62) φ ( s, a ) (cid:1)(cid:13)(cid:13) ∞ (cid:54) (1 − γ ) (cid:13)(cid:13) β (cid:62) φ (cid:13)(cid:13) ∞ + (cid:107) τ ( s, a ) (cid:107) ∞ (cid:0) r ( s, a ) + γβ (cid:62) φ ( s (cid:48) , a (cid:48) ) − β (cid:62) φ ( s, a ) (cid:1) (cid:54) (1 − γ ) C β C φ + C τ ( R max + (1 + γ ) C β C φ )= ( C τ + 1) (1 − γ ) C β C φ + C τ R max := M. That f (1) = 0 is required in the definition of f-divergence. If f (cid:48) (1) (cid:54) = 0 , one can “lift” it by ¯ f ( t ) = f ( t ) − f (cid:48) (1)( t − sothat the new function satisfies ¯ f (cid:48) (1) = 0 . f (cid:48)(cid:48) (1) = 2 is assumed for easier calculation without loss of generality, as discussedin Duchi et al. (2016). For example, one can use f ( t ) = 2 x log x − x − for modified KL-divergence, f ( t ) = ( x − for χ -divergence, and f ( t ) = − log x + ( x − − ( x − for reverse KL-divergence.
26e equip F τ × F β with the norm (cid:107) ( τ, β ) (cid:107) := (cid:107) τ (cid:107) F τ + (cid:107) β (cid:107) , (42)Then, we show the Lipschitz continuity of (cid:96) ( x ; τ, β ) in ( τ, β ) , | (cid:96) ( x ; τ , β ) − (cid:96) ( x ; τ , β ) | (cid:54) (1 − γ ) (cid:12)(cid:12)(cid:12) φ ( s , a ) (cid:62) ( β − β ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) τ ( s, a ) ( β − β ) (cid:62) ( γφ ( s (cid:48) , a (cid:48) ) + φ ( s, a )) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12) ( τ ( s, a ) − τ ( s, a )) (cid:0) r ( s, a ) + γβ (cid:62) φ ( s (cid:48) , a (cid:48) ) − β (cid:62) φ ( s, a ) (cid:1)(cid:12)(cid:12) (cid:54) (1 − γ ) ((2 + γ ) C φ + C τ ) (cid:107) β − β (cid:107) + ( R max + (1 + γ ) C φ C β ) | τ ( s, a ) − τ ( s, a ) | , (cid:54) (1 − γ ) ((2 + γ ) C φ + C τ ) (cid:107) β − β (cid:107) + ( R max + (1 + γ ) C φ C β ) K (cid:107) τ − τ (cid:107) F τ , (cid:54) C (cid:96) (cid:0) (cid:107) β − β (cid:107) + (cid:107) τ − τ (cid:107) F τ (cid:1) , which implies the (cid:96) ( x ; τ, β ) is C (cid:96) -Lipschitz continuous with C (cid:96) := max { (1 − γ ) ((2 + γ ) C φ + C τ , (1 + γ ) C φ C β ) K } . We now check the Hadamard directional differentiability of T ( P ) . The following proof largely follows Duchiet al. (2016); Römisch (2014). Lemma 10 (Hadamard Differentiability)
Under Assumptions 2 and 3, the functional T ( P ) = min β ∈F β max τ ∈F τ E P [ (cid:96) ( x ; τ, β )] is Hadamard directionally differentiable on P tangentially to B ( H , P ) ⊂ L ∞ ( H ) with derivative dT P ( H ) := (cid:90) (cid:96) ( x ; τ ∗ , β ∗ ) dH ( x ) , where ( β ∗ , τ ∗ ) = argmin β ∈F β argmax τ ∈F τ E P [ (cid:96) ( x ; τ, β )] . Proof
For convenience, we define ˜ H ( τ, β ) := (cid:90) (cid:96) ( x ; τ, β ) dH ( x ) , where H is associated with a measure in Q .We first show the upper bound convergence. For H n ∈ B ( H , P ) with (cid:107) H n − H (cid:107) H → , for any sequence t n → , we have T ( P + t n H n ) − T ( P )= min β ∈F β max τ ∈F τ (cid:16) E P [ (cid:96) ( x ; τ, β )] + t n ˜ H n ( τ, β ) (cid:17) − min β ∈F β max τ ∈F τ E P [ (cid:96) ( x ; τ, β )] (cid:54) max τ ∈F τ (cid:16) E P [ (cid:96) ( x ; τ, β ∗ )] + t n ˜ H n ( τ, β ∗ ) (cid:17) − E P [ (cid:96) ( x ; τ, β ∗ )] (cid:54) max τ ∈F τ t n ˜ H n ( τ, β ∗ ) . Denote τ ∗ n = argmax τ ∈F τ ˜ H n ( τ, β ∗ ) , by definition, we have max τ ∈F τ ˜ H n ( τ, β ∗ ) − max τ ∈F τ ˜ H ( τ, β ∗ ) (cid:54) ˜ H n ( τ ∗ n , β ∗ ) − ˜ H ( τ ∗ n , β ∗ ) (cid:54) (cid:13)(cid:13)(cid:13) ˜ H n − ˜ H (cid:13)(cid:13)(cid:13) H → . Therefore, we obtain lim sup n t n ( T ( P + t n H n ) − T ( P )) (cid:54) ˜ H ( τ ∗ , β ∗ ) . T ( P + t n H n )= min β ∈F β (cid:26) max τ ∈F τ (cid:16) E P [ (cid:96) ( x ; τ, β )] + t n ˜ H n ( τ, β ) (cid:17)(cid:27) = min β ∈F β (cid:110) E P [ (cid:96) ( x ; τ n ( β ) , β )] + t n (cid:16) ˜ H n ( τ n ( β ) , β ) − ˜ H ( τ n ( β ) , β ) (cid:17) + t n ˜ H ( τ n ( β ) , β ) (cid:111) (cid:54) min β ∈F β (cid:110) E P [ (cid:96) ( x ; τ n ( β ) , β )] + t n (cid:13)(cid:13)(cid:13) ˜ H n − ˜ H (cid:13)(cid:13)(cid:13) H + t n (cid:13)(cid:13)(cid:13) ˜ H (cid:13)(cid:13)(cid:13) H (cid:111) (cid:54) min β ∈F β E P [ (cid:96) ( x ; τ n ( β ) , β )] + O (1) · t n , where τ n ( β ) = argmax τ ∈F τ (cid:16) E P [ (cid:96) ( x ; τ, β )] + t n ˜ H n ( τ, β ) (cid:17) .Denote the set of (cid:15) -ball of solutions w.r.t. P as S P ( (cid:15) ) := (cid:26) β ∈ F β : max τ ∈F τ E P [ (cid:96) ( x ; τ, β )] (cid:54) min β ∈F β max τ ∈F τ E P [ (cid:96) ( x ; τ, β )] + (cid:15) (cid:27) . Then, β ∗ n ∈ S P + t n H n (0) implies β ∗ n ∈ S P ( ct n ) , which in turn implies the sequence of β ∗ n has a subsequence ˜ β ∗ m that converges to β ∗ ∈ S P (0) .It is straightforward to check the Lipschitz continuity of ¯ (cid:96) ( β ) := max τ E [ (cid:96) ( x ; τ, β )] as (cid:12)(cid:12) ¯ (cid:96) ( β ) − ¯ (cid:96) ( β ) (cid:12)(cid:12) (cid:54) (1 − γ ) (cid:107) β − β (cid:107) E µ π [ (cid:107) φ s ,a (cid:107) ] + (cid:12)(cid:12)(cid:12)(cid:12) max τ ∈F τ E (cid:2) τ · r + β (cid:62) ∆ (cid:3) − max τ ∈F τ E (cid:2) τ · r + β (cid:62) ∆ (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) (1 − γ ) (cid:107) β − β (cid:107) E µ π [ (cid:107) φ s ,a (cid:107) ] + max τ ∈F (cid:12)(cid:12) E (cid:2) τ · r + β (cid:62) ∆ (cid:3) − E (cid:2) τ · r + β (cid:62) ∆ (cid:3)(cid:12)(cid:12) (cid:54) (1 − γ ) (cid:107) β − β (cid:107) E µ π [ (cid:107) φ s ,a (cid:107) ] + max τ ∈F (cid:12)(cid:12)(cid:12) E (cid:104) ( β − β ) (cid:62) ∆ (cid:105)(cid:12)(cid:12)(cid:12) (cid:54) ((1 − γ ) C φ + C τ (1 + γ ) C φ ) (cid:107) β − β (cid:107) . Therefore, with ˜ β ∗ n → β ∗ , we have lim m ˜ (cid:96) (cid:16) ˜ β ∗ m (cid:17) = min β ˜ (cid:96) ( β ) = T ( P ) , and due to the optimality, for any m , ˜ (cid:96) (cid:16) ˜ β ∗ m (cid:17) (cid:62) min β ˜ (cid:96) ( β ) .T ( P + t m H m ) − T ( P ) (cid:62) max τ ∈F τ (cid:110) E P (cid:104) (cid:96) (cid:16) x ; τ, ˜ β ∗ m (cid:17)(cid:105) + t n ˜ H n (cid:16) τ, ˜ β ∗ m (cid:17)(cid:111) − max τ ∈F τ E P (cid:104) (cid:96) (cid:16) x ; τ, ˜ β ∗ m (cid:17)(cid:105) (cid:62) E P (cid:104) (cid:96) (cid:16) x ; τ m (cid:16) ˜ β ∗ m (cid:17) , ˜ β ∗ m (cid:17)(cid:105) + t n ˜ H n (cid:16) τ m (cid:16) ˜ β ∗ m (cid:17) , ˜ β ∗ m (cid:17) − E P (cid:104) (cid:96) (cid:16) x ; τ m (cid:16) ˜ β ∗ m (cid:17) , ˜ β ∗ m (cid:17)(cid:105) = t n ˜ H n (cid:16) τ m (cid:16) ˜ β ∗ m (cid:17) , ˜ β ∗ m (cid:17) , where τ m (cid:16) ˜ β ∗ m (cid:17) = argmax τ ∈F τ E P (cid:104) (cid:96) (cid:16) x ; τ, ˜ β ∗ m (cid:17)(cid:105) .Since ˜ β ∗ m → β ∗ , we have τ m (cid:16) ˜ β ∗ m (cid:17) → τ ∗ , and thus, (cid:12)(cid:12)(cid:12) ˜ H n (cid:16) τ m (cid:16) ˜ β ∗ m (cid:17) , ˜ β ∗ m (cid:17) − ˜ H ( τ ∗ , β ∗ ) (cid:12)(cid:12)(cid:12) (cid:54) (cid:12)(cid:12)(cid:12) ˜ H n (cid:16) τ m (cid:16) ˜ β ∗ m (cid:17) , ˜ β ∗ m (cid:17) − ˜ H (cid:16) τ m (cid:16) ˜ β ∗ m (cid:17) , ˜ β ∗ m (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) ˜ H (cid:16) τ m (cid:16) ˜ β ∗ m (cid:17) , ˜ β ∗ m (cid:17) − ˜ H ( τ ∗ , β ∗ ) (cid:12)(cid:12)(cid:12) (cid:54) (cid:13)(cid:13)(cid:13) ˜ H n − ˜ H (cid:13)(cid:13)(cid:13) H + (cid:12)(cid:12)(cid:12) ˜ H (cid:16) τ m (cid:16) ˜ β ∗ m (cid:17) , ˜ β ∗ m (cid:17) − ˜ H ( τ ∗ , β ∗ ) (cid:12)(cid:12)(cid:12) → , (cid:96) ( τ, β ; x ) is Lipschitz continuous. Therefore, we obtain lim inf n t n ( T ( P + t n H n ) − T ( P )) (cid:62) ˜ H ( τ ∗ , β ∗ ) . Theorem 2 (Asymptotic coverage)
Under Assumptions 1, 2, and 3, if D contains i.i.d. samples andthe optimal solution to the Lagrangian of (5) is unique, we have lim n →∞ P (cid:16) ρ π ∈ C fn,ξ (cid:17) = P (cid:16) χ (cid:54) ξ (cid:17) . (43)Therefore, C fn,χ , − α (1) is an asymptotic (1 − α ) -confidence interval of the value of the policy π . Proof
The proof is to verify the conditions in Theorem 8 hold. By Lemma 6, we can rewrite P (cid:16) ρ π ∈ C fn,ξ (cid:17) = P (cid:0) ρ π ∈ (cid:8) ˆ ρ π ( w ) (cid:12)(cid:12) w ∈ K f (cid:9)(cid:1) , where, according to the boundedness assumption on β in Assumption 3, ˆ ρ π ( w ) = max τ (cid:62) min β ∈F β E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) = min β ∈F β max τ (cid:62) E w (cid:2) τ · r + β (cid:62) ∆ ( x ; τ, φ ) (cid:3) . With Lemma 9 and Lemma 10, the conditions in Theorem 8 are satisfied. We apply Theorem 8 on the uniqueoptimal solution ( τ ∗ , β ∗ ) = argmin β ∈F β argmax τ (cid:62) E P [ (cid:96) ( x ; τ, β )] . We have dT P is a linear functional on thespace of bounded measures and dT P ( H ) = (cid:90) (cid:96) ( x ; τ ∗ , β ∗ ) dH ( x ) , with the canonical gradient given by T ( · ; P ) = (cid:96) ( x ; τ ∗ , β ∗ ) − E P [ (cid:96) ( x ; τ ∗ , β ∗ )] . E.2 Finite-Sample Correction
The previous section considers the asymptotic coverage of CoinDICE. We now analyze the finite-sample effectfor the estimator, for the special case f ( x ) = ( x − . Thus, D f is the χ -divergence.Consider the optimization problem, max w ∈ R n w (cid:62) z, s . t . D f ( w || (cid:98) p n ) (cid:54) ξn , w ∈ P n − ( (cid:98) p n ) . (44)The following result will be needed. Lemma 11 (Namkoong and Duchi, 2017,
Theorem 1)
Let Z ∈ [0 , M ] be a random variable, σ = V ar ( Z ) and s n = E (cid:98) p n (cid:2) Z (cid:3) − E (cid:98) p n [ Z ] as the population and sample variance of Z , respectively. For ξ (cid:62) , we have (cid:34)(cid:114) ξn s n − M ξn (cid:35) + (cid:54) max w (cid:26) E w [ Z ] | D f ( w || (cid:98) p n ) (cid:54) ξn , w ∈ P n − ( (cid:98) p n ) (cid:27) − E (cid:98) p n [ Z ] (cid:54) (cid:114) ξn s n . Moreover, for n (cid:62) max (cid:110) , M σ max { σ, } (cid:111) , with probability at least − exp (cid:16) − nσ M (cid:17) , max w (cid:26) E w [ Z ] | D f ( w || (cid:98) p n ) (cid:54) ξn , w ∈ P n − ( (cid:98) p n ) (cid:27) = E (cid:98) p n [ Z ] + (cid:114) ξn s n . Z . For completeness, we give the proof below, which is adapted from Namkoong and Duchi(2017). Recall that the lower bound is obtained by solving the following: min w ∈ R n w (cid:62) z, s . t . D f ( w || (cid:98) p n ) (cid:54) ξn , w ∈ P n − ( (cid:98) p n ) . (45) Lemma 12 (Lower bound variance representation)
Under the same conditions in Lemma 11, for ξ (cid:62) , we have (cid:34)(cid:114) ξn s n − M ξn (cid:35) + (cid:54) E (cid:98) p n [ Z ] − min w (cid:26) E w [ Z ] | D f ( w || (cid:98) p n ) (cid:54) ξn , w ∈ P n − ( (cid:98) p n ) (cid:27) (cid:54) (cid:114) ξn s n . Moreover, for n (cid:62) max (cid:110) , M σ max { σ, } (cid:111) , with probability at least − exp (cid:16) − nσ M (cid:17) , min w (cid:26) E w [ Z ] | D f ( w || (cid:98) p n ) (cid:54) ξn , w ∈ P n − ( (cid:98) p n ) (cid:27) = E (cid:98) p n [ Z ] − (cid:114) ξn s n . Proof
Denote u = n − w , we have u (cid:62) = 0 , and the optimization (45) can be written as ¯ z − max u u (cid:62) ( z − ¯ z ) , s . t . (cid:107) u (cid:107) (cid:54) ξn , u (cid:62) = 0 , u (cid:54) n , (46)with ¯ z = n (cid:80) ni =1 z i . Obviously, by the Cauchy-Schwartz inequality, u (cid:62) ( z − ¯ z ) (cid:54) (cid:114) ξn (cid:107) z − ¯ z (cid:107) , and the equality holds if and only if u i = √ ξ ( z − ¯ z ) n (cid:107) z − ¯ z (cid:107) = √ ξ ( z − ¯ z ) n (cid:112) ns n . Given the constraint u (cid:54) n , to achieve the maximum, one needs to ensure max i √ ξ ( z − ¯ z ) n (cid:112) ns n (cid:54) . If this condition is satisfied, we have E (cid:98) p n [ Z ] − min w (cid:26) E w ( Z ) | D f ( w || (cid:98) p n ) (cid:54) ξn (cid:27) (cid:54) (cid:114) ξn s n . Since z ∈ [0 , M ] , we have | z i − ¯ z | (cid:54) M , to ensure the condition, we need ξM ns n (cid:54) ⇔ s n (cid:62) ξM n . Otherwise,suppose s n < ξM n , or equivalently ξs n n < ξ M n , then, min w w (cid:62) z (cid:54) E (cid:98) p n [ z ] − (cid:34)(cid:114) ξn s n − M ξn (cid:35) + . For the high-probability statement, when n (cid:62) max (cid:110) , M σ max { σ, } (cid:111) , and the event s n (cid:62) σ holds, s n (cid:62) ξM n . Following Maurer and Pontil (2009, Theorem 10), one can bound that P ( | s n − σ | (cid:54) a ) (cid:54) exp (cid:18) − na M (cid:19) . a = (cid:16) − √ (cid:17) σ completes the proof.With Lemma 11 and Lemma 12, we represent the confidence bounds with variance. We resort to anempirical Bernstein bound applied to the function space F with bounded function h : X → [0 , M ] , usingempirical (cid:96) ∞ -covering numbers, N ∞ ( F , (cid:15), n ) , Lemma 13 (Maurer and Pontil, 2009,
Theorem 6)
Let n (cid:62) M t and t (cid:62) log 12 . Then, with probability atleast − N ∞ ( F , (cid:15), n ) e − t , for any h ∈ F , E [ h ] − E (cid:98) p n [ h ] (cid:54) (cid:114) V ar (cid:98) p n ( h ) tn + 15 M tn + 2 (cid:32) (cid:114) tn (cid:33) (cid:15). Theorem 4 (Finite-sample correction)
Denote by N ∞ ( F τ , (cid:15), n ) and N ∞ ( F β , (cid:15), n ) the (cid:96) ∞ -coveringnumbers of F τ and F β with (cid:15) -ball on n empirical samples, respectively. Let D f be χ -divergence. Under As-sumptions 2 and 3, let M := ( C τ + 1) (1 − γ ) C β C φ + C τ R max and C (cid:96) := max { (1 − γ ) ((2 + γ ) C φ + C τ , (1 + γ ) C φ C β ) K } ,then, we have P ( ρ π ∈ [ l u − ζ n , u n + ζ n ]) (cid:62) − N ∞ ( F τ , (cid:15), n ) N ∞ ( F β , (cid:15), n ) e − ξ , where ( l n , u n ) are the solutions to (11) , ζ n = Mξ n + 2 (cid:18) (cid:113) ξ n (cid:19) C (cid:96) (cid:15) and ξ = χ , − α (1) .When the VC -dimensions of F τ and F β (denoted by d F φ and d F β , respectively) are finite, we have P ( ρ π ∈ [ l n − κ n , u n + κ n ]) (cid:62) −
12 exp (cid:18) c + 2 (cid:0) d F τ + d F β − (cid:1) log n − ξ (cid:19) , where c = 2 c + log d F τ + log d F β + (cid:0) d F τ + d F β − (cid:1) , and κ n = Mξ n + 2 C (cid:96) Mn (cid:18) (cid:113) ξ n (cid:19) . Proof
We focus on the upper bound, and the lower bound can be bounded in a similar way. Define ( τ ∗ , β ∗ ) := argmax τ ∈F τ argmin β E d D [ (cid:96) ( x ; τ, β )] (cid:16) (cid:98) w ∗ , ˆ τ ∗ , ˆ β ∗ (cid:17) := argmax w argmax τ ∈F τ argmin β E w [ (cid:96) ( x ; τ, β )] . By definition and the optimality of β ∗ , we have ρ π = E d D [ (cid:96) ( x ; τ ∗ , β ∗ )] (cid:54) E d D (cid:104) (cid:96) (cid:16) x ; τ ∗ , ˆ β ∗ (cid:17)(cid:105) . (47)Applying Lemma 13 and the Lipschitz-continuity of (cid:96) ( x ; τ, β ) on F τ × F β , with probability at least31 − N ∞ ( F τ , (cid:15), n ) N ∞ ( F β , (cid:15), n ) e − t , we have E d D (cid:104) (cid:96) (cid:16) x ; τ ∗ , ˆ β ∗ (cid:17)(cid:105) (cid:54) E (cid:98) p n (cid:104) (cid:96) (cid:16) x ; τ ∗ , ˆ β ∗ (cid:17)(cid:105) + 3 (cid:118)(cid:117)(cid:117)(cid:116) V ar (cid:98) p n (cid:16) (cid:96) (cid:16) x ; τ ∗ , ˆ β ∗ (cid:17)(cid:17) tn + 15 M tn + 2 (cid:32) (cid:114) tn (cid:33) C (cid:96) (cid:15) (cid:54) max D f ( w || (cid:98) p n ) (cid:54) ξn E w (cid:104) (cid:96) (cid:16) x ; τ ∗ , ˆ β ∗ (cid:17)(cid:105) − (cid:118)(cid:117)(cid:117)(cid:116) ξV ar (cid:98) p n (cid:16) (cid:96) (cid:16) x ; τ ∗ , ˆ β ∗ (cid:17)(cid:17) n − M ξn + +3 (cid:118)(cid:117)(cid:117)(cid:116) V ar (cid:98) p n (cid:16) (cid:96) (cid:16) x ; τ ∗ , ˆ β ∗ (cid:17)(cid:17) tn + 15 M tn + 2 (cid:32) (cid:114) tn (cid:33) C (cid:96) (cid:15) (cid:54) max D f ( w || (cid:98) p n ) (cid:54) ξn max τ ∈F τ min β ∈F β E w [ (cid:96) ( x ; τ, β )] + 116 n M ξ + 2 (cid:32) (cid:114) tn (cid:33) C (cid:96) (cid:15) where the second equation comes from Lemma 11 and the third line comes from setting t (cid:54) ξ andthe definition of ˆ β ∗ . Combining this with (47), we may conclude that with probability at least − N ∞ ( F τ , (cid:15), n ) N ∞ ( F β , (cid:15), n ) e − ξ , ρ π (cid:54) max D f ( w || (cid:98) p n ) (cid:54) ξn max τ ∈F τ min β ∈F β E w [ (cid:96) ( x ; τ, β )] + 11 M ξ n + 2 (cid:32) (cid:114) ξ n (cid:33) C (cid:96) (cid:15). With the same strategy based on Lemma 12 and Lemma 13, we can also bound the finite-sample lowerbound correction that with probability at least − N ∞ ( F τ , (cid:15), n ) N ∞ ( F β , (cid:15), n ) e − ξ , ρ π (cid:62) max D f ( w || (cid:98) p n ) (cid:54) ξn max τ ∈F τ min β ∈F β E w [ (cid:96) ( x ; τ, β )] − M ξ n − (cid:32) (cid:114) ξ n (cid:33) C (cid:96) (cid:15). The first part of the theorem is then proved.For the second part, by van der Vaart and Wellner (1996, Theorem 2.6.7), one can bound N ( F , (cid:15), n ) (cid:54) c VC ( F ) (cid:0) Mne(cid:15) (cid:1) VC ( F ) − for some constant c . We set (cid:15) = Mn and denote d F = VC ( F ) . Plugging this into thebound, we obtain P ( ρ π ∈ [ l n − κ, u n + κ ]) (cid:62) −
12 exp (cid:18) c + 2 (cid:0) d F τ + d F β − (cid:1) log n − ξ (cid:19) , where c and κ are as given in the theorem statement. F Implementing Principles of Optimism and Pessimism
Based on the discussion in Section 5, the optimism and pessimism principles can be implemented by maximizing u D ( π ) and l D ( π ) , respectively. In this section, we first calculate the gradient ∇ π u D ( π ) and ∇ π l D ( π ) , andelaborate on the algorithm details.Since we will optimize the policy π , we modify the confidence interval estimator in CoinDICE slightly, sothat π is an explicitly parameterized distribution. Concretely, we consider the samples x := ( s , s, a, r ) with s ∼ µ ( s ) , ( s, a, r, s (cid:48) ) ∼ d D , which leads to the corresponding upper and lower bounds with ˜ (cid:96) ( x ; τ, β, π ) := τ · r + β (cid:62) ˜∆ ( x ; τ, φ, π ) , ˜∆ ( x ; τ, φ, π ) = (1 − γ ) E π ( a | s ) [ φ ( s , a )] + γ E π ( a (cid:48) | s (cid:48) ) [ φ ( s (cid:48) , a (cid:48) ) τ ( s, a )] − φ ( s, a ) τ ( s, a ) . Theorem 14
Given optimal ( β ∗ l , τ ∗ l , w ∗ l ) and ( β ∗ u , τ ∗ u , w ∗ u ) for lower and upper bounds, respectively, thegradients of l D ( π ) and u D ( π ) can be computed as ∇ π l D ( π ) ∇ π u D ( π ) = E w ∗ l (cid:104) (1 − γ ) E a ∼ π (cid:104) ∇ π log π ( a | s ) β ∗ l (cid:62) φ ( s , a ) (cid:105) + γ E a (cid:48) ∼ π ( a (cid:48) | s (cid:48) ) (cid:104) τ ∗ l ( s, a ) ∇ π log π ( a (cid:48) | s (cid:48) ) β ∗ l (cid:62) φ ( s (cid:48) , a (cid:48) ) (cid:105)(cid:105) E w ∗ u (cid:104) (1 − γ ) E a ∼ π (cid:104) ∇ π log π ( a | s ) β ∗ u (cid:62) φ ( s , a ) (cid:105) + γ E a (cid:48) ∼ π ( a (cid:48) | s (cid:48) ) (cid:104) τ ∗ u ( s, a ) ∇ π log π ( a (cid:48) | s (cid:48) ) β ∗ u (cid:62) φ ( s (cid:48) , a (cid:48) ) (cid:105)(cid:105) (48) Proof
We focus on the computation of ∇ π u D ( π ) with the optimal ( β ∗ u , τ ∗ u , w ∗ u ) : ∇ π u D ( π ) = E w ∗ u (cid:104) ∇ π ˜ (cid:96) ( x ; τ, β ) (cid:105) = (1 − γ ) E w ∗ u ∇ π E a ∼ π (cid:104) β ∗ u (cid:62) φ ( s , a ) (cid:105) + γ E w ∗ u (cid:104) τ ∗ u ( s, a ) ∇ π E a (cid:48) ∼ π ( a (cid:48) | s (cid:48) ) (cid:104) β ∗ u (cid:62) φ ( s (cid:48) , a (cid:48) ) (cid:105)(cid:105) = (1 − γ ) E w ∗ u E a ∼ π (cid:104) ∇ π log π ( a | s ) β ∗ u (cid:62) φ ( s , a ) (cid:105) (49) + γ E w ∗ u E a (cid:48) ∼ π ( a (cid:48) | s (cid:48) ) (cid:104) τ ∗ u ( s, a ) ∇ π log π ( a (cid:48) | s (cid:48) ) β ∗ u (cid:62) φ ( s (cid:48) , a (cid:48) ) (cid:105) . (50)The case for the lower bound can be obtained similarly: ∇ π l D ( π ) = (1 − γ ) E w ∗ l E a ∼ π (cid:104) ∇ π log π ( a | s ) β ∗ l (cid:62) φ ( s , a ) (cid:105) (51) + γ E w ∗ l E a (cid:48) ∼ π ( a (cid:48) | s (cid:48) ) (cid:104) τ ∗ l ( s, a ) ∇ π log π ( a (cid:48) | s (cid:48) ) β ∗ l (cid:62) φ ( s (cid:48) , a (cid:48) ) (cid:105) . Now, we are ready to apply the policy gradient upon u D ( π ) or l D ( π ) to implement the optimism forexploration or pessimism for safe policy improvement, respectively. We illustrate the details in Algorithm 2. Algorithm 2
CoinDICE-OPT: implementation of optimism/pessimism principle