[PDF] Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature

Abstract

This paper studies model-based bandit and reinforcement learning (RL) with nonlinear function approximations. We propose to study convergence to approximate local maxima because we show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward. For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOlin), which provably converges to a local maximum with sample complexity that only depends on the sequential Rademacher complexity of the model class. Our results imply novel global or local regret bounds on several concrete settings such as linear bandit with finite or sparse model class, and two-layer neural net bandit. A key algorithmic insight is that optimism may lead to over-exploration even for two-layer neural net model class. On the other hand, for convergence to local maxima, it suffices to maximize the virtual return if the model can also reasonably predict the size of the gradient and Hessian of the real return.

Full PDF

aa r X i v : . [ c s . L G ] F e b Provable Model-based Nonlinear Bandit and ReinforcementLearning: Shelve Optimism, Embrace Virtual Curvature

Kefan DongStanford University [email protected]

Jiaqi YangTsinghua University [email protected]

Tengyu MaStanford University [email protected]

February 9, 2021

Abstract

This paper studies model-based bandit and reinforcement learning (RL) with nonlinear function ap-proximations. We propose to study convergence to approximate local maxima because we show thatglobal convergence is statistically intractable even for one-layer neural net bandit with a deterministicreward. For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascentwith Online Model Learner (ViOL), which provably converges to a local maximum with sample com-plexity that only depends on the sequential Rademacher complexity of the model class. Our results implynovel global or local regret bounds on several concrete settings such as linear bandit with ﬁnite or sparsemodel class, and two-layer neural net bandit. A key algorithmic insight is that optimism may lead toover-exploration even for two-layer neural net model class. On the other hand, for convergence to localmaxima, it sufﬁces to maximize the virtual return if the model can also reasonably predict the size of thegradient and Hessian of the real return.

Recent progresses demonstrate many successful applications of deep reinforcement learning (RL) in robotics[Levine et al., 2016], games [Berner et al., 2019, Silver et al., 2017], computational biology [Mahmud et al.,2018], etc. However, theoretical understanding of deep RL algorithms is limited. Last few years witnesseda plethora of results on linear function approximations in RL [Zanette et al., 2020, Shariff and Szepesvári,2020, Jin et al., 2020, Wang et al., 2019, 2020a, Du et al., 2019, Agarwal et al., 2020a], but the analysistechniques appear to strongly rely on (approximate) linearity and hard to generalize to neural networks. The goal of this paper is to theoretically analyze model-based nonlinear bandit and RL with neuralnet approximation, which achieves amazing sample-efﬁciency in practice (see e.g., [Janner et al., 2019,Clavera et al., 2019, Hafner et al., 2019a,b, Dong et al., 2020]). We focus on the realistic setting where thestate and action spaces are continuous.Past theoretical work on model-based RL studies families of dynamics with restricted complexity mea-sures such as Eluder dimension [Osband and Roy, 2014], witness rank [Sun et al., 2019] and the lineardimensionality [Yang and Wang, 2020], and others [Modi et al., 2020, Kakade et al., 2020]. Implicationsof these complexity measures have been studied, e.g., ﬁnite mixture of dynamics [Ayoub et al., 2020] and Speciﬁcally, Zanette et al. [2020], Shariff and Szepesvári [2020] rely on closure of the value function class under bootstrapping,and Jin et al. [2020], Wang et al. [2019], Du et al. [2019] use uncertainty quantiﬁcation for linear regression. known neural net parameters, ﬁnding the best parameterized policy still involves optimizinga complex non-concave function, which is in general computationally intractable. More fundamentally, weﬁnd that it is also statistically intractable for solving the one-hidden-layer neural net bandit problem (whichis a strict sub-case of deep RL). In other words, it requires exponential (in the input dimension) samples toﬁnd the global maximum (see Theorem 5.1). This also shows that conditions in past work that guaranteeglobal convergence cannot apply to neural nets.Given these strong impossibility results, we propose to reformulate the problem to ﬁnding an approx-imate local maximum policy with guarantees. This is in the same vein as the recent fruitful paradigmin non-convex optimization where researchers disentangle the problem into showing that all local minimaare good and fast convergence to local minima (e.g., see [Ge et al., 2016, 2015, 2017, Ge and Ma, 2020,Lee et al., 2016]). In RL, local maxima can often be global as well for many cases [Agarwal et al., 2020b]. This paper focuses on sample-efﬁcient convergence to an approximate local maximum. We consider thenotion of local regret, which is measured against the worst ǫ -approximate local maximum of the rewardfunction (see Eq. (1)).Zero-order optimization or policy gradient algorithms can converge to local maxima and become naturalpotential competitors. They are widely believed to be less sample-efﬁcient than the model-based approachbecause the latter can leverage the extrapolation power of the parameterized models. Theoretically, ourformulation aims to characterize this phenomenon with results showing that the model-based approach’ssample complexity mostly depends (linearly) on the complexity of the model class, whereas policy gradientalgorithms’ sample complexity polynomially depend on the dimensionality of policy parameters (in RL) oractions (in bandit). Our technical goal is to answer the following question:Can we design algorithms that converge to approximate local maxima with sample complexities thatdepend only and polynomially on the complexity measure of the dynamics/reward class?We note that this question is open even if the dynamics hypothesis class is ﬁnite, and the complexity measureis the logarithm of its size. The question is also open even for nonlinear bandit problems (where dynamicsclass is replaced by reward function class), with which we start our research. We consider ﬁrst nonlinearbandit with deterministic reward where the reward function is given by η ( θ, a ) for action a ∈ A under in-stance θ ∈ Θ . We use sequential Rademacher complexity [Rakhlin et al., 2015a,b] to capture the complexityof the reward function η . Theorem 1.1 (Informal version of Theorem 3.1) . There exists a model-based algorithm (ViOL, Alg. 1)whose local regret, compared to Ω( ǫ ) -approximate local maxima, is bounded by O ( √ T R T /ǫ ) , where R T is the sequential Rademacher complexity of a bounded loss function induced by the reward function class { η ( θ, · ) : θ ∈ Θ } . The sequential Rademacher complexity R T is often bounded by the form e O ( R √ T ) for some parameter R that measures the complexity of the hypothesis. When this happens, we have O ( √ T R T ) = e O ( T / ) = o ( T ) local regret. The all-local-maxima-are-global condition only needs to hold to the ground-truth total expected reward function. This poten-tially can allow disentangled assumptions on the ground-truth instance and the hypothesis class.

2n contrast to zero-order optimization, which does not use the parameterization of η and has regretbounds depending on the action dimension, our regret only depends on the complexity of the reward functionclass. This suggests that our algorithm exploits the extrapolation power of the reward function class. To thebest of our knowledge, this is the ﬁrst action-dimension-free result for both linear and nonlinear banditproblems. More concretely, we instantiate our theorem to the following settings and get new results thatleverage the model complexity (more in Section 3.1).1. Linear bandit with ﬁnite parameter space Θ . Because η is concave in action a , our result leads toa standard regret bound O (cid:16) T / (log | Θ | ) / (cid:17) . In this case both zero-order optimization and theSquareCB algorithm in Foster and Rakhlin [2020] have regrets that depend on the dimension of actionspace d A .2. Linear bandit with s -sparse instance parameters. Our algorithm achieves an e O (cid:0) T / s / (cid:1) regretbound. The regret bound of zero-order optimization depends polynomially on d A , so do Eluder dimen-sion based bounds because the Eluder dimension for this class is Ω( d A ) . We give the ﬁrst algorithmfor sparse linear bandits whose regret does not depend on the dimension, but only the sparsity, withonly two mild assumptions, namely boundedness and continuity of the action set. In contrast, previousresults either leverage the rather strong anti-concentration assumption on the action set [Wang et al.,2020b], or have implicit dimension dependency [Hao et al., 2020, Remark 4.3].3. Two-layer neural nets bandit. The local regret of our algorithm is bounded by e O (cid:0) ǫ − T / (cid:1) . Zero-order optimization can also ﬁnd a local maximum but with Ω( d A ) samples. Optimistic algorithms inthis case have an exponential sample complexity (see Theorem 5.2).The results for bandit can be extended to model-based RL with deterministic nonlinear dynamics anddeterministic reward. Our algorithm can ﬁnd an approximate locally maximal stochastic policy (underadditional Lipschitz assumptions): Theorem 1.2 (Informal version of Theorem 4.3) . For RL problems with deterministic dynamics class andstochastic policy class, the local regret of Alg. 2, compared to Ω( ǫ ) -approximate local maxima, is boundedby O ( √ T R T /ǫ ) , where R T is the sequential Rademacher complexity of ℓ losses of the dynamics class. To the best of our knowledge, this is the ﬁrst model-based RL algorithms with provable ﬁnite sam-ple complexity guarantees (for local convergence) for general nonlinear dynamics. The work of Luo et al.[2019] is the closest prior work which also shows local convergence, but its conditions likely cannot besatisﬁed by any parameterized models (including linear models). As discussed, other prior works on model-based RL do not apply to one-hidden-layer neural nets because they conclude global convergence which isnot possible for one-hidden-layer neural nets in the worst case.

Optimism vs. Exploring by Model-based Curvature Estimate.

The key algorithmic idea is to avoid ex-ploration using the optimism-in-face-of-uncertainty principle because we show that optimism over a barelynonlinear model class is already statistically too aggressive, even if the ground-truth model is linear (seeTheorem 5.2 in Section 5). Indeed, empirical model-based deep RL research has also not found optimismto be useful, partly because with neural nets dynamics, optimism will lead to huge virtual returns on theoptimistic dynamics [Luo et al., 2019]. The work of Foster and Rakhlin [2020] also proposes algorithmsthat do not rely on UCB—their exploration strategy either relies on the discrete action space, or leveragesthe linear structure in the action space and has action-dimension dependency. In contrast, our algorithms’3xploration relies more on the learning of the model (or the model’s capability of predicting the curvatureof the reward, as discussed more below). Consequently, our regret bounds can be action-dimension-free.Our algorithm is conceptually very simple—it alternates between maximizing virtual return (over actionor policy) and learning the model parameters by an online learner. The key insight is that, in order toensure sufﬁcient exploration for converging to local maxima, it sufﬁces for the model to predict the gradientand Hessian of the return reasonably accurately, and then follow the virtual return. We achieve reasonablecurvature prediction by modifying the loss function of the online learner. Because we leverage modelextrapolation, the sample complexity of model-based curvature prediction depends on the model complexity,instead of action dimension in the zero-optimization approach for bandit.

In this section, we ﬁrst introduce our problem setup for nonlinear bandit and reinforcement learning, andthen the preliminary for online learning and sequential Rademacher complexity.

We consider deterministic nonlinear bandit problem with continuous actions. Let θ ∈ Θ be the parameterthat speciﬁes the bandit instance, a ∈ R d A be the action, and η ( θ, a ) ∈ [0 , be the reward function. Let θ ⋆ denote the unknown ground-truth parameter, and throughout the paper, we work under the realizabilityassumption that θ ⋆ ∈ Θ . A bandit algorithm aims to maximize the reward under θ ⋆ , that is, η ( θ ⋆ , a ) . Let a ⋆ = argmax a η ( θ ⋆ , a ) denote the optimal action (breaking tie arbitrarily). Let k H k sp be the spectralnorm of a symmetric matrix H . We also assume that the reward function, its gradient and Hessian matrixare Lipschitz, which are somewhat standard assumptions in the optimization literature (e.g., the work ofJohnson and Zhang [2013], Ge et al. [2015]). Assumption 2.1.

We assume that for all θ ∈ Θ , sup a k∇ a η ( θ, a ) k ≤ ζ g and sup a (cid:13)(cid:13) ∇ a η ( θ, a ) (cid:13)(cid:13) sp ≤ ζ h . And for every θ ∈ Θ and a , a ∈ R d A , (cid:13)(cid:13) ∇ a η ( θ, a ) − ∇ a η ( θ, a ) (cid:13)(cid:13) sp ≤ ζ k a − a k . As a motivation to consider deterministic rewards, we prove in Theorem 5.2 for a special case that noalgorithm can ﬁnd a local maximum in less than √ d A steps. The result implies that an action-dimension-freeregret bound is impossible under reasonably stochastic environments. Approximate Local Maximum and Local Regret.

In this paper, we aim to ﬁnd a local maximum of thereal reward function η ( θ ⋆ , · ) . A point x is an ( ǫ g , ǫ h ) -approximate local maximum of a twice-differentiablefunction f ( x ) if k∇ f ( x ) k ≤ ǫ g , and λ max ( ∇ f ( x )) ≤ ǫ h . As argued in Sec. 1 and proved in Sec. 5,because reaching a global maximum is computational and statistically intractable for nonlinear problemswe only aim to reach a local maximum. We deﬁne the “local regret” by comparing with an approximatelocal maximum. Formally speaking, let A ǫ g ,ǫ h be the set of all ( ǫ g , ǫ h ) -approximate local maximum of η ( θ ⋆ , · ) . The ( ǫ g , ǫ h ) -local regret of a sequence of actions a , . . . , a T is deﬁned as REG ǫ g ,ǫ h ( T ) = T X t =1 (cid:18) inf a ∈ A ǫg,ǫh η ( θ ⋆ , a ) − η ( θ ⋆ , a t ) (cid:19) . (1)Our goal is to achieve a ( ǫ g , ǫ h ) -local regret that is sublinear in T and inverse polynomial in ǫ g and ǫ h . Witha sublinear regret (i.e., REG ǫ g ,ǫ h ( T ) = o ( T ) ), the average performance, T P Tt =1 η ( θ ⋆ , a t ) , converges to thatof an approximate local maximum of η ( θ ⋆ , · ) . 4 .2 Reinforcement Learning We consider ﬁnite horizon Markov decision process (MDP) with deterministic dynamics, deﬁned by a tuple h T, r, H, µ i , where the dynamics T maps from a state action pair ( s, a ) to next state s ′ , r : S × A → [0 , is the reward function, and H and µ denote the horizon and distribution of initial state respectively. Let S and A be the state and action spaces. Without loss of generality, we make the standard assumption thatthe state space is disjoint for different time steps. That is, there exists disjoint sets S , · · · , S H such that S = ∪ Hh =1 S h , and for any s h ∈ S , a h ∈ a , T ( s h , a h ) ∈ S h +1 .In this paper consider parameterized policy and dynamics. Formally speaking, the policy class is givenby Π = { π ψ : ψ ∈ Ψ } , and the dynamics class is given by { T θ : θ ∈ Θ } . The value function is deﬁnedas V πT ( s h ) , E [ P Hh ′ = h r ( s h ′ , a h ′ )] , where a h ∼ π ( · | s h ) , s h +1 = T ( s h , a h ) . Sharing the notation withthe bandit setting, let η ( θ, ψ ) = E s ∼ µ V π ψ T θ ( s ) be the expected return of policy π ψ under dynamics T θ . Also, we use ρ πT to denote the distribution of state action pairs when running policy π in dynamics T .For simplicity, we do not distinguish ψ, θ from π ψ , T θ when the context is clear. For example, we write V ψθ = V π ψ T θ . The approximate local regret is deﬁned in the same as in the bandit setting, except that the gradient andHessian matrix are taken w.r.t to the policy parameter space ψ . We also assume realizability ( θ ⋆ ∈ Θ ) andthe regularity assumptions as in Assumption 2.1 (with action a replaced by policy parameter ψ ). Consider a prediction problem where we aim to learn a function that maps from X to Y parameterized byparameters in Θ . Let ℓ (( x, y ); θ ) be a loss function that maps ( X × Y ) × Θ → R + . An online learner R aims to solve the prediction tasks under the presence of an adversarial nature iteratively. At time step t , thefollowing happens.1. The learner computes a distribution p t = R ( { ( x i , y i ) } t − i =1 ) over the parameter space Θ .2. The adversary selects a point ¯ x t ∈ ¯ X (which may depend on p t ) and generates a sample ξ t from someﬁxed distribution q . Let x t , (¯ x t , ξ t ) , and the adversary picks a label y t ∈ Y .3. The data point ( x t , y t ) is revealed to the online learner.The online learner aims to minimize the expected regret in T rounds of interactions, deﬁned as REGOL T , E ξ t ∼ q,θ t ∼ p t ∀ ≤ t ≤ T " T X t =1 ℓ (( x t , y t ); θ t ) − inf θ ∈ Θ T X t =1 ℓ (( x t , y t ); θ ) . (2)The difference of the formulation from the most standard online learning setup is that the ξ t part of the inputis randomized instead of adversarially chosen (and the learner knows the distribution of ξ t before makingthe prediction p t ). It was introduced by Rakhlin et al. [2011], who considered a more generalized settingwhere the distribution q in round t can depend on { x , · · · , x t − } .We adopt the notation from Rakhlin et al. [2011, 2015a] to deﬁne the (distribution-dependent) sequentialRademacher complexity of the loss function class L = { ( x, y ) ℓ (( x, y ); θ ) : θ ∈ Θ } . For any set Z , a Z -valued tree with length T is a set of functions { z i : {± } i − → Z} Ti =1 . For a sequence of Rademacherrandom variables ǫ = ( ǫ , · · · , ǫ T ) and for every ≤ t ≤ T, we denote z t ( ǫ ) , z t ( ǫ , · · · , ǫ t − ) . For any ¯ X -valued tree x and any Y -valued tree y , we deﬁne the sequential Rademacher complexity as5 T ( L ; x , y ) , E ξ , ··· ,ξ t E ǫ " sup ℓ ∈L T X t =1 ǫ t ℓ (( x ( ǫ ) , ξ t ) , y ( ǫ )) . (3)We also deﬁne R T ( L ) = sup x , y R T ( L ; x , y ) , where the supremum is taken over all ¯ X -valued and Y -valued trees. Rakhlin et al. [2011] proved the existence of an algorithm whose online learning regret satisﬁes REGOL T ≤ R T ( L ) . We ﬁrst study model-based algorithms for nonlinear continuous bandits problem, which is a simpliﬁcationof model-based reinforcement learning. We use the notations and setup in Section 2.1.

Abstraction of analysis for model-based algorithms.

Typically, a model-based algorithm explicitlymaintains an estimated model ˆ θ t , and sometimes maintains a distribution, posterior, or conﬁdence regionof ˆ θ t . We will call η ( θ ⋆ , a ) the real reward of action a , and η (ˆ θ t , a ) the virtual reward . Most analysisfor model-based algorithms (including UCB and ours) can be abstracted as showing the following twoproperties:(i) the virtual reward η (ˆ θ t , a t ) is sufﬁciently high.(ii) the virtual reward η (ˆ θ t , a t ) is close to the real reward η ( θ ⋆ , a t ) in the long run.One can expect that a proper combination of property (i) and (ii) leads to showing the real reward η ( θ ⋆ , a t ) is high in the long run. Before describing our algorithms, we start by inspecting and summarizingthe pros and cons of UCB from this viewpoint. Pros and cons of UCB.

The UCB algorithm chooses an action a t and an estimated model ˆ θ t that max-imize the virtual reward η (ˆ θ t , a t ) among those models agreeing with the observed data. The pro is that itsatisﬁes property (i) by deﬁnition— η (ˆ θ t , a t ) is higher than the optimal real reward η ( θ ⋆ , a ⋆ ) . The down-side is that ensuring (ii) is challenging and often requires strong complexity measure bound such as Eluderdimension (which is not polynomial for even barely nonlinear models, as shown in Theorem 5.1). The dif-ﬁculty largely stems from our very limited control of ˆ θ t except its consistency with the observed data. Inorder to bound the difference between the real and virtual rewards, we essentially require that any model thatagrees with the past history should extrapolate to any future data accurately (as quantitatively formulated inEluder dimension). Moreover, the difﬁculty of satisfying property (ii) is fundamentally caused by the over-exploration of UCB—As shown in the Theorem 5.2, UCB suffers from bad regrets with barely nonlinearfamily of models. Our key idea: natural exploration via model-based curvature estimate.

We deviate from UCB byreadjusting the priority of the two desiderata. First, we focus more on ensuring property (ii) on large modelclass by leveraging strong online learners. We use an online learning algorithm to predict ˆ θ t with the ob-jective that η (ˆ θ t , a t ) matches η ( θ ⋆ , a t ) . As a result, the difference between the virtual and real rewarddepends on the online learnability or the sequential Rademacher complexity of the model class. Sequen-tial Rademacher complexity turns out to be a fundamentally more relaxed complexity measure than Eluderdimension—e.g., two-layer neural networks’ sequential Rademacher complexity is polynomial in param-eter norm and dimension, but their Eluder dimension is at least exponential in dimension (even with a6onstant parameter norm). However, an immediate consequence of using online-learned ˆ θ t is that we loseoptimism/exploration that ensured property (i). Algorithm 1

ViOL: Vi rtual Ascent with O nline Model L earner (for Bandit) Set parameter κ = 2 ζ g and κ = 640 √ ζ h . Let H = ∅ ; choose a ∈ A arbitrarily. for t = 1 , , · · · do Run R on H t − with loss function ℓ (deﬁned in equation (6)) and obtain p t = R ( H t − ) . Let a t ← argmax a E θ t ∼ p t [ η ( θ t , a )] . Sample u t , v t ∼ N (0 , I d A × d A ) independently. Let ξ t = ( u t , v t ) , ¯ x t = ( a t , a t − ) , and x t = (¯ x t , ξ t ) Compute y t = [ η ( θ ⋆ , a t ) , η ( θ ⋆ , a t − ) , h∇ a η ( θ ⋆ , a t − ) , u t i , h∇ a η ( θ ⋆ , a t − ) u t , v t i ] ∈ R by apply-ing a ﬁnite number of actions in the real environments using equation (4) and (5) with inﬁnitesimal α and α . Update H t = H t − ∪ { ( x t , y t ) } Our approach realizes property (i) in a sense that the virtual reward will improve iteratively if the realreward is not yet near a local maximum. This is much weaker than what UCB offers (i.e., that the virtualreward is higher than the optimal real reward), but sufﬁces to show the convergence to a local maximumof the real reward function. We achieve this by demanding the estimated model ˆ θ t not only to predictthe real reward accurately, but also to predict the gradient ∇ a η ( θ ⋆ , a ) and Hessian ∇ a η ( θ ⋆ , a ) accurately.In other words, we augment the loss function for the online learner so that the estimated model satisﬁes η (ˆ θ t , a t ) ≈ η ( θ ⋆ , a t ) , ∇ a η (ˆ θ t , a t ) ≈ ∇ a η ( θ ⋆ , a t ) , and ∇ a η (ˆ θ t , a t ) ≈ ∇ a η ( θ ⋆ , a t ) in the long run. Thisimplies that when a t is not at a local maximum of the real reward function η ( θ ⋆ , · ) , then it’s not at a maximumof the virtual reward η (ˆ θ t , · ) , and hence the virtual reward will improve in the next round if we take thegreedy action that maximizes it. Estimating projections of gradients and Hessians.

To guide the online learner to predict ∇ a η ( θ ⋆ , a t ) correctly, we need a supervision for it. However, we only observe the reward η ( θ ⋆ , a t ) . Leveraging thedeterministic reward property, we use rewards at a and a + α u to estimate the projection of the gradient ata random direction u : h∇ a η ( θ ⋆ , a ) , u i = lim α → ( η ( θ ⋆ , a + α u ) − η ( θ ⋆ , a )) /α (4)It turns out that the number of random projections h∇ a η ( θ ⋆ , a ) , u i needed for ensuring a large virtual gra-dient does not depend on the dimension, because we only use these projections to estimate the norm of thegradient but not necessarily the exact direction of the gradient (which may require d samples.) Similarly, wecan also estimate the projection of Hessian to two random directions u, v ∈ d A by: (cid:10) ∇ a η ( θ ⋆ , a ) v, u (cid:11) = lim α → ( h∇ a η ( θ ⋆ , a + α v ) , u i − h∇ a η ( θ ⋆ , a ) , u i ) /α (5) = lim α → lim α → (( η ( θ ⋆ , a + α u + α v ) − η ( θ ⋆ , a + α v )) − ( η ( θ ⋆ , a + α u ) − η ( θ ⋆ , a ))) / ( α α ) More concretely, the algorithm can get stuck when (1) a t is optimal for ˆ θ t , (2) ˆ θ t ﬁts actions a t (and history) accurately, but(3) ˆ θ t does not ﬁt a ⋆ (because online learner never sees a ⋆ ). The passivity of online learning formulation causes this issue—theonline learner is only required to predict well for the point that it saw and will see, but not for those points that it never observes.This limitation, on the other hand, allows more relaxed complexity measure of the model class (that is, sequential Rademachercomplexity instead of Eluder dimension). α and α . Note that α should be at least an order smallerthan α because the limitations are taken sequentially.We create the following prediction task for an online learner: let θ be the parameter, x = ( a, a ′ , u, v ) bethe input, ˆ y = [ η ( θ, a ) , η ( θ, a ′ ) , h∇ a η ( θ, a ′ ) , u i , h∇ a η ( θ, a ′ ) u, v i ] ∈ R be the output, and y = [ η ( θ ⋆ , a ) , η ( θ ⋆ , a ′ ) , h∇ a η ( θ ⋆ , a ′ ) , u i , h∇ a η ( θ ⋆ , a ′ ) u, v i ] ∈ R be the supervision, and the loss function be ℓ ((( a, a ′ , u, v ) , y ); θ ) , ([ˆ y ] − [ y ] ) + ([ˆ y ] − [ y ] ) + min (cid:16) κ , ([ˆ y ] − [ y ] ) (cid:17) + min (cid:16) κ , ([ˆ y ] − [ y ] ) (cid:17) (6)Here we used [ y ] i to denote the i -th coordinate of y ∈ R to avoid confusing with y t (the supervision at time t .) Our model-based bandit algorithm is formally stated in Alg. 1 with its regret bound below. Theorem 3.1.

Let R T be the sequential Rademacher complexity of the family of the losses deﬁned in Eq. (6) .Let C = 2 + ζ g /ζ h . Under Assumption 2.1, for any ǫ ≤ min (1 , ζ / , we can bound the ( ǫ, √ ζ ǫ ) -local regret of Alg. 1 from above by E h REG ǫ, √ ζ ǫ ( T ) i ≤ (cid:16) C p T R T (cid:17) max (cid:16) ζ h ǫ − , p ζ ǫ − / (cid:17) . (7)Note that when the sequential Rademacher complexity R T is bounded by e O ( R √ T ) (which is typical),we have O ( √ T R T ) = e O ( T / ) = o ( T ) regret. As a result, Alg. 1 achieves a O (poly(1 /ǫ )) samplecomplexity by the sample complexity-regret reduction [Jin et al., 2018, Section 3.1]. We sketch some instantiations of our main theorem, whose proofs are deferred to Appendix B.4.

Linear bandit with ﬁnite model class.

Consider the problem with action set A = { a ∈ R d : k a k ≤ } and ﬁnite model class Θ ⊂ { θ ∈ R d : k θ k = 1 } . Suppose the reward is linear, that is, η ( θ, a ) = h θ, a i .We deal with the constrained action set by using a surrogate loss ˜ η ( θ, a ) , h θ, a i − k a k and applyTheorem 3 with reward ˜ η . We claim that the (global) regret is bounded by O (cid:16) T / (log | Θ | ) / (cid:17) . Notethat here the regret bound is independent of the dimension d , whereas, by contrast, the SquareCB algorithmin Foster and Rakhlin [2020] depends polynomially on d (see Theorem 7 of Foster and Rakhlin [2020]).Zero-order optimization approach [Duchi et al., 2015] in this case also gives a poly( d ) regret bound. Thisand examples below demonstrate that our results fully leverage the low-complexity model class to eliminatethe dependency on the action dimension.A full proof of this claim needs a few steps: (i) realizing that η ( θ ⋆ , a ) is concave in a with no badlocal maxima, and therefore our local regret and the standard regret coincide (up to some conversion of theerrors); (ii) invoking Rakhlin et al. [2015b, Lemma 3] to show that the sequential Rademacher complexity R T is bounded by O (cid:16)p (2 log | Θ | ) /T (cid:17) , and (iii) verifying ˜ η satisﬁes the conditions (Assumption 2.1) onthe actions that the algorithm will visit. 8 inear bandit with sparse or structured model vectors. We consider the deterministic linear ban-dit setting where the model class

Θ = { θ ∈ R d : k θ k ≤ s, k θ k = 1 } consists of all s -sparse vec-tors on the unit sphere. Similarly to ﬁnite hypothesis case, we claim that the global regret of Alg. 1 is REG ( T ) = e O (cid:0) T / s / (cid:1) . The regret of our algorithm only depends on the sparsity level s (up to log-arithmic factors), whereas the Eluder dimension of sparse linear hypothesis is still Ω( d ) , and the regretin Lattimore and Szepesvári [2020] also depends on d . The proof follows from discretizing the space Θ intoroughly d O ( s ) points and applying the ﬁnite model class result above.Moreover, we can further extend the result to other linear bandit settings where θ has an additionalstructure. Suppose Θ = { θ = φ ( z ) : z ∈ R s } for some Lipschitz function φ . Then, a simlar approach givesregret bound that only depends on s but not d (up to logarithmic factors). Two-layer neural nets.

Here we consider the reward function given by two-layer neural networks withwidth m . For matrices W ∈ R m × d and W ∈ R × m , let η (( W , W ) , a ) = W σ ( W a ) − k a k for somenonlinear link function σ : R → [0 , with bounded derivatives up to the third order. Recall that the (1 , ∞ ) -norm of W is deﬁned by max i ∈ [ m ] P dj =1 | [ W ] i,j | . Let the model hypothesis space be

Θ = { ( W , W ) : k W k , ∞ ≤ , k W k ≤ } and θ , ( W , W ) . We claim that the local regret of Alg .1 is bounded by e O (cid:0) ǫ − T / (cid:1) . To the best of our knowledge, this is the ﬁrst result analyzing nonlinear bandit with neuralnetwork parameterization. The result follows from analyzing the sequential Rademacher complexity for η , h∇ a η, u i , and h u, ∇ a η · v i , and ﬁnally the resulting loss function ℓ . See Theorem B.2 in Section B.4 fordetails. We remark here that zero-order optimization in this case gives a poly( d ) local regret bound.We note that if the second layer of the neural network W contains all negative entries, and the activationfunction σ is monotone, then η (( W , W ) , a ) is concave in the action. (This is a special case of input convexneural networks [Amos et al., 2017].) Therefore, in this case, the local regret is the same as the globalregret, and we can obtain global regret guarantee (see Theorem B.2.) We note that loss function for learninginput convex neural networks is still nonconvex, but the statistical global regret result does not rely on theconvexity of the loss for learning. Proof of Theorem 3.1 consists of the following parts:(i) Because of the design of the loss function (Eq. 6), the online learner guarantees that θ t can estimate thereward, its gradient and hessian accurately, that is, for θ t ∼ p t , η ( θ ⋆ , a t ) ≈ η ( θ t , a t ) , ∇ a η ( θ ⋆ , a t − ) ≈∇ a η ( θ t , a t − ) , and ∇ a η ( θ ⋆ , a t − ) ≈ ∇ a η ( θ t , a t − ) .(ii) Because of (i), maximizing the virtual reward E θ t η ( θ t , a ) w.r.t a leads to improving the real rewardfunction η ( θ ⋆ , a ) iteratively (in terms of ﬁnding second-order local improvement direction.)Concretely, deﬁne the errors in rewards and its derivatives: ∆ t, = | η ( θ t , a t ) − η ( θ ⋆ , a t ) | , ∆ t, = | η ( θ t , a t − ) − η ( θ ⋆ , a t − ) | , ∆ t, = k∇ a η ( θ t , a t − ) − ∇ a η ( θ ⋆ , a t − ) k , and ∆ t, = (cid:13)(cid:13) ∇ a η ( θ t , a t − ) − ∇ a η ( θ ⋆ , a t − ) (cid:13)(cid:13) sp . Let ∆ t = P i =1 ∆ t,i be the total error which measures how close-ness between θ t and θ ⋆ .Assuming that ∆ t,j ’s are small, to show (ii), we essentially view a t = argmax a ∈A E θ t η ( θ t , a ) as anapproximate update on the real reward η ( θ ⋆ , · ) and show it has local improvements if a t − is not a criticalpoint of the real reward: η ( θ ⋆ , a t ) ' ∆ t E θ t [ η ( θ t , a t )] (8) ≥ sup a E θ t (cid:20) η ( θ t , a t − ) + h a − a t − , ∇ a η ( θ t , a t − ) i − ζ g k a − a t − k (cid:21) (9)9 ∆ t sup a E θ t (cid:20) η ( θ ⋆ , a t − ) + h a − a t − , ∇ a η ( θ ⋆ , a t − ) i − ζ g k a − a t − k (cid:21) (10) ≥ η ( θ ⋆ , a t − ) + 12 ζ g k∇ a η ( θ ⋆ , a t − ) k . (11)Here in equations (8) and (10), we use the symbol ' ∆ t to present informal inequalities that are true up tosome additive errors that depend on ∆ t . This is because equation (8) holds up to errors related to ∆ t, = | η ( θ t , a t ) − η ( θ ⋆ , a t ) | , and equation (10) holds up to errors related to ∆ t, = | η ( θ t , a t − ) − η ( θ ⋆ , a t − ) | and ∆ t, = k∇ a η ( θ t , a t − ) − ∇ a η ( θ ⋆ , a t − ) k . Eq. (9) is a second-order Taylor expansion around the previousiteration a t − and utilizes the deﬁnition a t = argmax a ∈A E θ t η ( θ t , a ) . Eq. (11) is a standard step to show theﬁrst-order improvement of gradient descent (the so-called “descent lemma”). We also remark that a t is themaximizer of the expected reward E θ t η ( θ t , a ) instead of η ( θ t , a ) because the adversary in online learningcannot see θ t when choosing adversarial point a t .The following lemma formalizes the proof sketch above, and also extends it to considering second-orderimprovement. The proof can be found in Appendix B.1. Lemma 3.2.

In the setting of Theorem 3.1, when a t − is not an ( ǫ, √ ǫζ ) -approximate second orderstationary point, we have η ( θ ⋆ , a t ) ≥ η ( θ ⋆ , a t − ) + min (cid:16) ζ − h ǫ / , ζ − / ǫ / (cid:17) − C E θ t ∼ p t [∆ t ] . (12)Next, we show part (i) by linking the error ∆ t to the loss function ℓ (Eq. (6)) used by the online learner.The errors ∆ t, , ∆ t, are already part of the loss function. Let ˜∆ t, = h∇ a η ( θ t , a t − ) − ∇ a η ( θ ⋆ , a t − ) , u t i and ˜∆ t, = (cid:10) ∇ a η ( θ t , a t − ) − ∇ a η ( θ ⋆ , a t − ) u t , v t (cid:11) be the remaining two terms (without the clipping) in theloss (Eq. (6)). Note that ˜∆ t, is supposed to bound ∆ t, because E u t [ ˜∆ t, ] = ∆ t, . Similarly, E u t ,v t [ ˜∆ t, ] = k∇ a η ( θ t , a t − ) − ∇ a η ( θ ⋆ , a t − ) k ≥ ∆ t, . We clip ˜∆ t, and ˜∆ t, to make them uniformly bounded andimprove the concentration with respect to the randomness of u and v (the clipping is conservative and isoften not active). Let ˜∆ t = ∆ t, + ∆ t, + min (cid:16) κ , ˜∆ t, (cid:17) + min (cid:16) κ , ˜∆ t, (cid:17) be the error received by theonline learner at time t . The argument above can be rigorously formalized into a lemma that upper bound ∆ t by ˜∆ t , which will be bounded by the sequential Rademacher complexity. Lemma 3.3.

By choosing κ = 2 ζ g and κ = 640 √ ζ h , we have E u T ,v T ,θ T " T X t =1 ˜∆ t ≥ E θ T " T X t =1 ∆ t . (13)We defer the proof to Appendix B.2. With Lemma 3.2 and Lemma 3.3, we can prove Theorem 3.1 bykeeping track of the performance η ( θ ⋆ , a t ) . The full proof can be found in Appendix B.3. In this section, we extend the results in Section 3 to model-based reinforcement learning with deterministicdynamics and reward function.We can always view a model-based reinforcement learning problem with parameterized dynamics andpolicy as a nonlinear bandit problem in the following way. The policy parameter ψ corresponds to the action a in bandit, and the dynamics parameter θ corresponds to the model parameter θ in bandit. The expected10otal return η ( θ, ψ ) = E s ∼ µ V π ψ T θ ( s ) is the analogue of reward function in bandit. We intend to make thesame regularity assumptions on η as in the bandit case (that is, Assumption 2.1) with a being replaced by ψ . However, when the policy is deterministic, the reward function η has Lipschitzness constant with respectto ψ that is exponential in H (even if dynamics and policy are both deterministic with good Lipschitzness).This prohibits efﬁcient optimization over policy parameters. Therefore we focus on stochastic policies inthis section, for which we expect η and its derivatives to be Lipschitz with respect to ψ .Blindly treating RL as a bandit only utilizes the reward but not the state observations. In fact, one majorreason why model-based methods are more sample efﬁcient is that it supervises the learning of dynamics bystate observations. To reason about the learning about local steps and the dynamics, we make the followingadditional Lipschitzness assumptions on the policies and value functions beyond those for the total reward η ( θ, ψ ) . (This is because the Lipschitzness of η ( θ, ψ ) hardly implies Lipschitzness on the dynamics and thepolicies.) Assumption 4.1.

We assume the following (analogous to Assumption 2.1) on the value function: ∀ ψ ∈ Ψ , θ ∈ Θ , s, s ′ ∈ S we have • | V ψθ ( s ) − V ψθ ( s ′ ) | ≤ L k s − s ′ k ; • k∇ ψ V ψθ ( s ) − ∇ ψ V ψθ ( s ′ ) k ≤ L k s − s ′ k ; • k∇ ψ V ψθ ( s ) − ∇ ψ V ψθ ( s ′ ) k sp ≤ L k s − s ′ k . Assumption 4.2.

We assume the following Lipschitzness assumptions on the stochastic policies parameter-ization π ψ . • k E a ∼ π ψ ( ·| s ) [( ∇ ψ log π ψ ( a | s ))( ∇ ψ log π ψ ( a | s )) ⊤ ] k sp ≤ χ g ; • k E a ∼ π ψ ( ·| s ) [( ∇ ψ log π ψ ( a | s )) ⊗ ] k sp ≤ χ f ; • k E a ∼ π ψ ( ·| s ) [( ∇ ψ log π ψ ( a | s ))( ∇ ψ log π ψ ( a | s )) ⊤ ] k sp ≤ χ h . We will show that the difference of gradient and Hessian of the total reward can be upper-bounded bythe difference of dynamics. Let τ t = ( s , a , · · · , s H , a H ) be a trajectory sampled from policy π ψ t underthe ground-truth dynamics T θ ⋆ . Similarly to Yu et al. [2020], using the simulation lemma and Lipschitznessof the value function, we can easily upper bound ∆ t, = | η ( θ t , ψ t ) − η ( θ ⋆ , ψ t ) | by the one-step modelprediction errors. Thanks to policies’ stochasticity, using the REINFORCE formula, we can also bound thegradient errors by the model errors: ∆ t, = k∇ ψ η ( θ t , ψ t − ) − ∇ ψ η ( θ ⋆ , ψ t − ) k . E τ ∼ ρ ψt − θ⋆ " H X h =1 k T θ t ( s h , a h ) − T θ ⋆ ( s h , a h ) k . Similarly, we can upper bound the Hessian errors by the errors of dynamics. As a result, the loss functionsimply can be set to ℓ (( τ t , τ ′ t ); θ ) = X ( s h ,a h ) ∈ τ t k T θ ( s h , a h ) − T θ ⋆ ( s h , a h ) k + X ( s ′ h ,a ′ h ) ∈ τ ′ t (cid:13)(cid:13) T θ ( s ′ h , a ′ h ) − T θ ⋆ ( s ′ h , a ′ h ) (cid:13)(cid:13) (14) Recall that the injective norm of a k -th order tensor A ∈ R d ⊗ k is deﬁned as (cid:13)(cid:13) A ⊗ k (cid:13)(cid:13) sp = sup u ∈ S d − (cid:10) A, u ⊗ k (cid:11) . τ, τ ′ sampled from policy π ψ t and π ψ t − respectively. Compared to Alg. 1, the lossfunction is here simpler without relying on ﬁnite difference techniques to query gradients projections. Ouralgorithm for RL is analogous to Alg. 1 by using the loss function in Eq. (14). Our algorithm is presentedin Alg. 2 in Appendix C. Main theorem for Alg. 2 is shown below. Theorem 4.3.

Let c = HL (4 H χ h + 4 H χ f + 2 H χ g + 1) + HL (8 H χ g + 2) + 4 HL and C =2 + ζ g ζ h . Let R dynT be the sequential Rademacher complexity for the loss function deﬁned in Eq. (14) . UnderAssumption 2.1-4.2, for any ǫ ≤ min (cid:16) , ζ (cid:17) , we can bound the ( ǫ, √ ζ ǫ ) -regret of Alg. 2 by E h REG ǫ, √ ζ ǫ ( T ) i ≤ (cid:18) C q c T R dynT (cid:19) max (cid:16) ζ h ǫ − , p ζ ǫ − / (cid:17) . (15) Comparison with policy gradient.

Policy gradient [Williams, 1992] can also have a sample com-plexity analysis by controlling its variance. Let g ( s, a ) = ∇ ψ log π ψ ( a | s ) . The variance of REIN-FORCE estimator is given by E [ k g ( s, a ) k ] . Our bound can be dimension-free if (cid:13)(cid:13) E [ g ( s, a ) g ( s, a ) ⊤ ] (cid:13)(cid:13) op is a constant. The difference between E [ k g ( s, a ) k ] and (cid:13)(cid:13) E [ g ( s, a ) g ( s, a ) ⊤ ] (cid:13)(cid:13) op can be as large as a fac-tor of d A when g ( s, a ) is isotropic. More concretely, it’s possible that our bound is dimension-free andthe bound for policy gradient is not. Consider a typical stochastic policy in deep RL [Schulman et al.,2017, 2015]: π ψ ( s ) ∼ µ ψ ( s ) + N (0 , σ I ) , where µ ψ is a neural network and σ is a constant . We have g ( s, a ) = ∂µ ψ ( s ) ∂ψ σ ( µ ψ ( s ) − a ) . It follows that if (cid:13)(cid:13)(cid:13) ∂µ ψ ( s ) ∂ψ (cid:13)(cid:13)(cid:13) sp ≈ , then E [ k g ( s, a ) k ] ≈ d A . On the otherhand, (cid:13)(cid:13) E a [ g ( s, a ) g ( s, a ) ⊤ ] (cid:13)(cid:13) op can be bounded by O (1) if g ( s, a ) is isotropic. We prove several lower bounds to show (a) the hardness of ﬁnding global maxima, (b) the inefﬁciency ofusing optimism in nonlinear bandit, and (c) the hardness of stochastic environments.

Hardness of Global Optimality.

In the following theorem, we show it statistically intractable to ﬁnd theglobal optimal policy when the function class is chosen to be the neural networks with ReLU activation.That is, the reward function can be written in the form of η (( w, b ) , a ) = ReLU( h w, a i + b ) . Theorem 5.1.

When the function class is chosen to be one-layer neural networks with ReLU activation, theminimax sample complexity is Ω( ε − ( d − ) . Besides, the ε -eluder dimension of one-layer neural networks isat least Ω( ε ( d − ) . The proof is deferred to Appendix A.1. We also note that the theorem above does require ReLU activa-tion, because if the ReLU function is replaced by a strictly monotone link function with bounded derivatives(up to third order), then this is the setting of deterministic generalized linear bandit problem, which does al-low a global regret that depends polynomially on dimension [Filippi et al., 2010, Dong et al., 2019, Li et al.,2017]. In this case, our Theorem 3.1 can also give polynomial global regret result: because all local maximaof the reward function is global maximum [Hazan et al., 2015, Kakade et al., 2011] and it also satisﬁes thestrict-saddle property [Ge et al., 2015], the local regret result translates to a global regret result. This showsthat our framework does separate the intractable cases from the tractable by the notion of local and globalregrets. 12ith two-layer neural networks, we can relax the use of ReLU activation—Theorem 5.1 holds withtwo-layer neural networks and leaky-ReLU activations [Xu et al., 2015] because O (1) leaky-ReLU canimplement a ReLU activation. We conjecture that with more layers, the impossibility result also holds for abroader sets of activations. Inefﬁciency caused by optimism in nonlinear models.

In the following we revisit the optimism-in-face-of-uncertainty principle. First we recall the UCB algorithm in deterministic environments.We formalize UCB algorithm under deterministic environments as follows. At every time step t , thealgorithm maintains a upper conﬁdence bound C t : A → R . The function C t satisﬁes η ( θ ⋆ , a ) ≤ C t ( a ) . And then the action for time step t is a t ← arg max C t ( a ) . Let Θ t be the set of parameters that is consistentwith η ( θ ⋆ , a ) , · · · , η ( θ ⋆ , a t − ) . That is, Θ t = { θ ∈ Θ : η ( θ, a τ ) = η ( θ ⋆ , a τ ) , ∀ τ < t } . In a deterministicenvironment, the tightest upper conﬁdence bound is C ( a ) = sup θ ∈ Θ t η ( θ, a ) . The next theorem states that the UCB algorithm that uses optimism-in-face-of-uncertainty principle canoverly explore in the action space, even if the ground-truth is simple.

Theorem 5.2.

Consider the case where the ground-truth reward function is linear: h θ ⋆ , a i and the actionset is a ∈ S d − . If the hypothesis is chosen to be two-layer neural network with width d , UCB algorithmwith tightest upper conﬁdence bound suffers exponential sample complexity . Proof of the theorem is deferred to Appendix A.3.

Hardness of stochastic environments.

As a motivation to consider deterministic rewards, the next theo-rem proves that a poly(log | Θ | ) sample complexity is impossible for ﬁnding local optimal action even undermild stochastic environment. Theorem 5.3.

There exists an bandit problem with stochastic reward and hypothesis class with size log | Θ | = e O (1) , such that any algorithm requires Ω( d ) sample to ﬁnd a (0 . , -approximate second order stationarypoint with probability . . The hard instance is linear bandit with hypothesis

Θ = { e , · · · , e d } , action space A = S d − andi.i.d. standard Gaussian noise. Intuitively, the hardness comes from low signal-to-noise ratio because min i |h a, e i i| ≤ / √ d for any a ∈ A . In other words, in the worse case the signal-to-noise ration is O (cid:16) / √ d (cid:17) , which leads to a sample complexity that depends on d . We defer the formal proof to Ap-pendix A.2. There are several provable efﬁcient algorithms without optimism for contextual bandit. Foster and Rakhlin[2020] and Simchi-Levi and Xu [2020] exploit a particular exploration probability that is approximatelythe inverse of empirical gap. The SquareCB algorithm [Foster and Rakhlin, 2020] also extends to inﬁniteaction, but with a polynomial dependence on the action dimension in regret bound. The subtlety of explo-ration probability in Foster and Rakhlin [2020] and Simchi-Levi and Xu [2020] makes it hard to extend toreinforcement learning. Recently, Foster et al. [2020] prove an instance-dependent regret bound for contex-tual bandit.The deterministic nonlinear bandit problem can also be formulated as zero-order optimization withoutnoise (see Duchi et al. [2015], Liu et al. [2020] and references therein), where the reward class is assumed13o be all 1-Lipschitz functions. In contrast, our algorithm exploits the knowledge of the reward functionparametrization and achieves an action-dimension-free regret. In the setting of stochastic nonlinear bandit,Filippi et al. [2010] consider generalized linear model. Valko et al. [2013], Zhou et al. [2020] focus on re-wards in a Reproducing Kernel Hilbert Space (RKHS) and neural network (in the Neural Tangent Kernelregime) respectively, and provide algorithms with sublinear regret.Another line of research focuses on solving reinforcement learning by running optimization algorithmson the policy space. Agarwal et al. [2020b] prove that natural policy gradient can solve tabular MDPsefﬁciently. Cai et al. [2020] incorporate exploration bonus in proximal policy optimization algorithm andachieves polynomial regret in linear MDP setting.

In this paper, we design new algorithms whose local regrets are bounded by the sequential Rademachercomplexity of particular loss functions. By rearranging the priorities of exploration versus exploitation, ouralgorithms avoid over-aggressive explorations caused by the optimism in the face of uncertainty principle,and hence apply to nonlinear models and dynamics. We raise the following questions as future works:1. Since we mainly focus on proving a regret bound that depends only on the complexity of dynam-ics/reward class, our convergence rate in T is likely not minimax optimal. Can our algorithms (oranalysis) be modiﬁed to achieve minimax optimal regret for some of the instantiations such as sparselinear bandit and linear bandit with ﬁnite model class?2. In the bandit setting, we focus on deterministic reward because our ViOL algorithm relies on ﬁnitedifference to estimate the gradient and Hessian of reward function. In fact, Theorem 5.3 shows thataction-dimension-free regret bound for linear models is impossible under standard Gaussian noise.Can we extend our algorithm to stochastic environments with additional assumptions on noises?3. In the reinforcement learning setting, we use policy gradient lemma to upper bound the gradi-ent/Hessian loss by the dynamics loss, which inevitable require the policies being stochastic. De-spite the success of stochastic policies in deep reinforcement learning, the optimal policy may not bestochastic. Can we extend the ViOL algorithm to reinforcement learning problems with deterministicpolicy hypothesis? Acknowledgment

The authors would like to thank Yuanhao Wang, Daogao Liu, Zhizhou Ren, Jason D. Lee and Colin Weifor helpful discussion. TM is also partially supported by the Google Faculty Award, Lam Research, andJD.com.

References

Alekh Agarwal, Nan Jiang, and Sham M Kakade. Reinforcement learning: Theory and algorithms.

CSDept., UW Seattle, Seattle, WA, USA, Tech. Rep , 2019.Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity andrepresentation learning of low rank mdps. arXiv preprint arXiv:2006.10814 , 2020a.14lekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation withpolicy gradient methods in markov decision processes. In

Proceedings of Thirty Third Conference onLearning Theory , volume 125 of

Proceedings of Machine Learning Research , pages 64–66. PMLR, 09–12 Jul 2020b.Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In

International Conference onMachine Learning , pages 146–155. PMLR, 2017.Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin F Yang. Model-based reinforcementlearning with value-targeted regression. In

Proceedings of the 37th International Conference on MachineLearning , 2020.Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy Dennison,David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforce-ment learning. arXiv preprint arXiv:1912.06680 , 2019.Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efﬁcient exploration in policy optimization.In

International Conference on Machine Learning , pages 1283–1294. PMLR, 2020.Ignasi Clavera, Yao Fu, and Pieter Abbeel. Model-augmented actor-critic: Backpropagating through paths.In

International Conference on Learning Representations , 2019.Kefan Dong, Yuping Luo, Tianhe Yu, Chelsea Finn, and Tengyu Ma. On the expressivity of neural networksfor deep reinforcement learning. In

International Conference on Machine Learning , pages 2627–2637.PMLR, 2020.Shi Dong, Tengyu Ma, and Benjamin Van Roy. On the performance of thompson sampling on logisticbandits. In

Conference on Learning Theory , pages 1158–1160, 2019.Simon S Du, Yuping Luo, Ruosong Wang, and Hanrui Zhang. Provably efﬁcient Q-learning with functionapproximation via distribution shift error checking oracle. In

Advances in Neural Information ProcessingSystems , pages 8058–8068, 2019.John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-orderconvex optimization: The power of two function evaluations.

IEEE Transactions on Information Theory ,61(5):2788–2806, 2015.Sarah Filippi, Olivier Cappé, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: the generalizedlinear case. In

Proceedings of the 23rd International Conference on Neural Information ProcessingSystems-Volume 1 , pages 586–594, 2010.Dylan Foster and Alexander Rakhlin. Beyond UCB: Optimal and efﬁcient contextual bandits with regres-sion oracles. In

Proceedings of the 37th International Conference on Machine Learning , volume 119 of

Proceedings of Machine Learning Research , pages 3199–3210. PMLR, 13–18 Jul 2020.Dylan J Foster, Alexander Rakhlin, David Simchi-Levi, and Yunzong Xu. Instance-dependent complexityof contextual bandits and reinforcement learning: A disagreement-based perspective. arXiv preprintarXiv:2010.03104 , 2020.Rong Ge and Tengyu Ma. On the optimization landscape of tensor decompositions.

Mathematical Pro-gramming , pages 1–47, 2020. 15ong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradientfor tensor decomposition. In

Conference on Learning Theory , pages 797–842, 2015.Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In

Advancesin Neural Information Processing Systems , pages 2973–2981, 2016.Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A uniﬁedgeometric analysis. arXiv preprint arXiv:1704.00708 , 2017.Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning be-haviors by latent imagination. In

International Conference on Learning Representations , 2019a.Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James David-son. Learning latent dynamics for planning from pixels. In

International Conference on Machine Learn-ing , pages 2555–2565. PMLR, 2019b.Botao Hao, Tor Lattimore, and Mengdi Wang. High-dimensional sparse linear bandits. arXiv preprintarXiv:2011.04020 , 2020.Elad Hazan, Kﬁr Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex optimization.In

Advances in Neural Information Processing Systems , pages 1594–1602, 2015.Daniel Hsu, Sham Kakade, Tong Zhang, et al. A tail inequality for quadratic forms of subgaussian randomvectors.

Electronic Communications in Probability , 17, 2012.Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-basedpolicy optimization. arXiv preprint arXiv:1906.08253 , 2019.Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efﬁcient?In

Proceedings of the 32nd International Conference on Neural Information Processing Systems , pages4868–4878, 2018.Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efﬁcient reinforcement learningwith linear function approximation. In

Conference on Learning Theory , pages 2137–2143, 2020.Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.In

Advances in Neural Information Processing Systems , pages 315–323, 2013.Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Information theo-retic regret bounds for online nonlinear control. arXiv preprint arXiv:2006.12466 , 2020.Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efﬁcient learning of generalized linearand single index models with isotonic regression. In

Advances in Neural Information Processing Systems ,pages 927–935, 2011.Tor Lattimore and Csaba Szepesvári.

Bandit algorithms . Cambridge University Press, 2020.Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection.

Annals of Statistics , pages 1302–1338, 2000.Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges tominimizers.

University of California, Berkeley , 1050:16, 2016.16ergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotorpolicies.

The Journal of Machine Learning Research , 17(1):1334–1373, 2016.Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextualbandits. In

International Conference on Machine Learning , pages 2071–2080. PMLR, 2017.Sijia Liu, Pin-Yu Chen, Bhavya Kailkhura, Gaoyuan Zhang, Alfred O Hero III, and Pramod K Varshney.A primer on zeroth-order optimization in signal processing and machine learning: Principals, recentadvances, and applications.

IEEE Signal Processing Magazine , 37(5):43–54, 2020.Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic frame-work for model-based deep reinforcement learning with theoretical guarantees. In

International Confer-ence on Learning Representations , 2019. URL https://openreview.net/forum?id=BJe1E2R5KX .Mufti Mahmud, Mohammed Shamim Kaiser, Amir Hussain, and Stefano Vassanelli. Applications of deeplearning and reinforcement learning to biological data.

IEEE transactions on neural networks and learn-ing systems , 29(6):2063–2079, 2018.Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of reinforcement learningusing linearly combined model ensembles. In

International Conference on Artiﬁcial Intelligence andStatistics , pages 2010–2020. PMLR, 2020.Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. In

Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1 ,pages 1466–1474, 2014.Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: stochastic, constrained, andsmoothed adversaries. In

Proceedings of the 24th International Conference on Neural Information Pro-cessing Systems , pages 1764–1772, 2011.Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexities.

Journal of Machine Learning Research , 16(6):155–186, 2015a.Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform martingalelaws of large numbers.

Probability Theory and Related Fields , 161(1-2):111–153, 2015b.Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic explo-ration. In

Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2 , pages 2256–2264, 2013.John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policyoptimization. In

International conference on machine learning , pages 1889–1897, 2015.John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti-mization algorithms. arXiv preprint arXiv:1707.06347 , 2017.Roshan Shariff and Csaba Szepesvári. Efﬁcient planning in large mdps with weak linear function approxi-mation. arXiv preprint arXiv:2007.06184 , 2020. 17avid Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, ThomasHubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without humanknowledge.

Nature , 550(7676):354, 2017.David Simchi-Levi and Yunzong Xu. Bypassing the monster: A faster and simpler optimal algorithm forcontextual bandits under realizability. arXiv preprint arXiv:2003.12699 , 2020.Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in con-textual decision processes: Pac bounds and exponential improvements over model-free approaches. In

Conference on Learning Theory , pages 2898–2933. PMLR, 2019.Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 2011.Michal Valko, Nathan Korda, Rémi Munos, Ilias Flaounas, and Nello Cristianini. Finite-time analysis ofkernelised contextual bandits. In

Proceedings of the Twenty-Ninth Conference on Uncertainty in ArtiﬁcialIntelligence , pages 654–663, 2013.Ruosong Wang, Ruslan Salakhutdinov, and Lin F Yang. Provably efﬁcient reinforcement learning withgeneral value function approximation.

Advances in Neural Information Processing Systems , 2020a.Yining Wang, Ruosong Wang, Simon S Du, and Akshay Krishnamurthy. Optimism in reinforcement learn-ing with generalized linear function approximation. arXiv preprint arXiv:1912.04136 , 2019.Yining Wang, Yi Chen, Ethan X Fang, Zhaoran Wang, and Runze Li. Nearly dimension-independent sparselinear bandit over small action spaces via best subset selection. arXiv preprint arXiv:2009.02003 , 2020b.Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learn-ing.

Machine learning , 8(3-4):229–256, 1992.Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectiﬁed activations in convolu-tional network, 2015.Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regretbound. In

International Conference on Machine Learning , pages 10746–10756. PMLR, 2020.Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, andTengyu Ma. Mopo: Model-based ofﬂine policy optimization. arXiv preprint arXiv:2005.13239 , 2020.Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimalpolicies with low inherent Bellman error. In

International Conference on Machine Learning , 2020.Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In

International Conference on Machine Learning , pages 11492–11502. PMLR, 2020.18

Additional Results in Section 5

In this section, we prove several negative results.

A.1 Proof of Theorem 5.1

Proof.

We consider the class I = { I θ,ε : k θ k ≤ , ε > } of inﬁnite-armed bandit instances, where in theinstance I θ,ε , the reward of pulling action x ∈ B d (1) is deterministic and is equal to η ( I θ,ε , x ) = Ax {h x, θ i − ε, } . (16)We prove the ﬁrst statement in the theorem by proving the minimax regret. The sample complexity thenfollows from the canonical sample complexity-regret reduction [Jin et al., 2018, Section 3.1]. Let A denoteany algorithm. Let R T A ,I be the T -step regret of algorithm A under instance I . Then we have inf A sup I ∈I E [ R T A ,I ] ≥ Ω( T d − d − ) . Fix ε = c · T − / ( d − . Let Θ be an ε -packing of the sphere { x ∈ R d : k x k = 1 } . Then we have | Θ | ≥ Ω( ε − ( d − ) . So we choose c > to be a numeric constant such that T ≤ | Θ | / . Let µ be thedistribution over Θ such that µ ( θ ) = Pr[ ∃ t ≤ T s.t. η ( I θ,ε , a t ) = 0 when r τ ≡ for τ = 1 , . . . , T ] . Notethat for any action a t ∈ B d (1) , there is at most one θ ∈ Θ such that η ( I θ,ε , a t ) = 0 , because Θ is a packing.Since T ≤ | Θ | / , there exists θ ∗ ∈ Θ such that µ ( θ ∗ ) ≤ / . Therefore, with probability / , the algorithm A would obtain reward r t = η ( I θ ∗ ,ε , a t ) = 0 for every time step t = 1 , . . . , T . Note that under instance I θ ∗ ,ε , the optimal action is to choose a t ≡ θ ∗ , which would give reward r ∗ t ≡ ε . Therefore, with probability / , we have E [ R T A ,I θ,ε ] ≥ εT / ≥ Ω( T d − d − ) .Next we prove the second statement. We use dim E ( F , ε ) to denote the ε -eluder dimension of thefunction class F . Let Θ be an ε -packing of the sphere { x ∈ R d : k x k = 1 } . We write Θ = { θ , . . . , θ n } . Then we have n ≥ Ω( ε − ( d − ) . Next we establish that dim E ( F , ε ) ≥ Ω( ε − ( d − ) . For each i ∈ [ n ] , wedeﬁne the function f i ( a ) = η ( I θ i ,ε , a ) ∈ F . Then for i ≤ n − , we have f i ( θ j ) = f i +1 ( θ j ) for j ≤ i − , while ε = f i ( θ i ) = f i +1 ( θ i ) = 0 . Therefore, θ i is ε -independent of its predecessors. As a result, we have dim E ( F , ε ) ≥ n − . A.2 Proof of Theorem 5.3

Proof.

We consider a linear bandit problem with hypothesis class

Θ = { e , · · · , e d } . The action space is S d − . The stochastic reward function is given by η ( θ, a ) = h θ, a i + ξ where ξ = N (0 , is the noise.Deﬁne the set A i = { a ∈ S d − : |h a, e i i| ≥ . } . By basic algebra we get, A i ∩ A j = ∅ for all i = j. The manifold gradient of η ( θ, · ) on S d − is grad η ( θ, a ) = (cid:16) I − aa ⊤ (cid:17) θ. By triangular inequality we get k grad η ( θ, a ) k ≥ k θ k − |h a, θ i| . Consequently, k grad η ( θ i , a ) k ≥ . for a A i . In other words, (cid:0) S d − \ A i (cid:1) does not contain any (0 . , -approximate second order stationarypoint for η ( θ i , · ) . For a ﬁxed algorithm, let a , · · · , a T be the sequence of actions chosen by the algorithm, and x t = h θ ⋆ , a t i + ξ t . Next we prove that with T . d steps, there exists i ∈ [ d ] such that Pr i [ ∃ t ∈ [ T ] : a t ∈ A i ] ≤ / , where Pr i denotes the probability space generated by θ ⋆ = θ i . Let Pr be the probability space gen-erated by θ ⋆ = 0 . Let E i,T be the event that the algorithm outputs an action a ∈ A i in T steps. By Pinskerinequality we get, E i [ E i,T ] ≤ E [ E i,T ] + r D KL (Pr i , Pr ) . (17)Using the chain rule of KL-divergence and the fact that D KL ( N (0 , , N ( a, a , we get E i [ E i,T ] ≤ E [ E i,T ] + vuut E T X t =1 h a t , θ i i . (18)Consequently, d X i =1 E i [ E i,T ] ≤ d X i =1 E [ E i,T ] + d X i =1 vuut E " T X t =1 h a t , θ i i (19) ≤ vuut d E " d X i =1 T X t =1 h a t , θ i i ≤ r dT , (20)which means that min i ∈ [ d ] E i [ E i,T ] ≤ d + r T d . (21)Therefore when T ≤ d , there exists i ∈ [ d ] such that E i [ E i,T ] ≤ . A.3 Proof of Theorem 5.2

We ﬁrst provide a proof sketch to the theorem. We consider the following reward function. η (( θ ⋆ , θ ⋆ , α ) , a ) = 164 h a, θ ⋆ i + α max (cid:18) h θ ⋆ , a i − , (cid:19) . Note that the reward function η can be clearly realized by a two-layer neural network with width d .When α = 0 we have η (( θ ⋆ , θ ⋆ , α ) , a ) = h θ ⋆ , a i , which represents a linear reward. Informally, optimismbased algorithm will try to make the second term large (because optimistically the algorithm hopes α = 1 ),which leads to an action a t that is suboptimal for ground-truth reward (in which case α = 0 ). In round t ,the optimism algorithm observes h θ ⋆ , a t i = 0 , and can only eliminate an exponentially small fraction of θ ⋆ from the hypothesis. Therefore the optimism algorithm needs exponential number of steps to determine α = 0 and stops exploration. Formally, the prove is given below. Proof.

Consider a bandit problem where A = S d − and η (( θ ⋆ , θ ⋆ , α ) , a ) = 164 h a, θ ⋆ i + α max (cid:18) h θ ⋆ , a i − , (cid:19) . Θ = { θ , θ , α : k θ k ≤ , k θ k ≤ , α ∈ [0 , } . Then the reward function η can be clearly realized by a two-layer neural network with width d . Note that when α = 0 we have η (( θ ⋆ , θ ⋆ , α ) , a ) = h θ ⋆ , a i , which represents a linear reward. In the following we use θ ⋆ = ( θ ⋆ , θ ⋆ , asa shorthand.The UCB algorithm is described as follows. At every time step t , the algorithm maintains a upperconﬁdence bound C t : A → R . The function C t satisﬁes η ( θ ⋆ , a ) ≤ C t ( a ) . And then the action for timestep t is a t ← argmax C t ( a ) .Let P = { p , p , · · · , p n } be an -packing of the sphere S d − , where n = Ω(2 d ) . Let B ( p i , ) be theball with radius / centered at p i , and B i = B ( p i , ) ∪ S d − . We prove the theorem by showing that theUCB algorithm will explore every packing in P . That is, for any i ∈ [ n ] , there exists t such that a t ∈ B i .Since we have sup a j ∈ B j h p i , a j i ≤ / for all j = i, this over-exploration strategy leads to a samplecomplexity (for ﬁnding a (31 / -suboptimal action) at least Ω(2 d ) when θ ⋆ = ( p i , p i , . Let Θ t be the set of parameters that is consistent with η ( θ ⋆ , a ) , · · · , η ( θ ⋆ , a t − ) . That is, Θ t = { θ ∈ Θ : η ( θ , a τ ) = η ( θ ⋆ , a τ ) , ∀ τ < t } . Since our environment is deterministic, a tightest upper conﬁdencebound is C ( a ) = sup θ ∈ Θ t η ( θ , a ) . Let A t = { a , · · · , a t } . It can be veriﬁed that for any θ ∈ S d − ,η (( θ ⋆ , θ , , · ) is consistent with η ( θ ⋆ , · ) on A t − if B ( θ , ) ∪ A t − = ∅ . As a result, for any θ such that B ( θ , ) ∪ A t − = ∅ we have C ( θ ) ≥ > a η ( θ ⋆ , a ) . (22)Next we prove that for any i ∈ [ n ] , there exists t such that a t ∈ B ( p i , ) . Note that η ( θ , · ) is Lipschitzfor every θ ∈ Θ . Therefore we have C t ( a τ + ξ ) ≤ + sup a η ( θ ⋆ , a ) = for any τ < t and ξ such that k ξ k ≤ . Therefore, after t = (cid:0) (cid:1) d time steps we get sup a C t ( a ) ≤ . Combining with Eq. (22), forany θ there exists t ≤ (cid:0) (cid:1) d such that a t ∈ B ( θ , ) . B Missing Proofs in Section 3

In this section, we show missing proofs in Section 3.

B.1 Proof of Lemma 3.2

Proof.

We prove the lemma by showing that algorithm 1 improves reward η ( θ ⋆ , a t ) in the following twocases:1. k∇ a η ( θ ⋆ , a t − ) k ≥ ǫ , or2. k∇ a η ( θ ⋆ , a t − ) k ≤ ǫ and λ max (cid:0) ∇ a η ( θ ⋆ , a t − ) (cid:1) ≥ √ ζ ǫ . Case 1:

For simplicity, let g t = ∇ a η ( θ ⋆ , a t − ) . In this case we assume k g t k ≥ ǫ. Deﬁne function ¯ η t ( θ, a ) = η ( θ, a t − ) + h a − a t − , ∇ a η ( θ, a t − ) i − ζ h k a − a t − k (23)to be the local ﬁrst order approximation of function η ( θ, a ) . By the Lipschitz assumption (namely, As-sumption 2.1), we have η ( θ, a ) ≥ ¯ η t ( θ, a ) for all θ ∈ Θ , a ∈ A . By the deﬁnition of ∆ t, and ∆ t, , weget ¯ η t ( θ t , a ) ≥ ¯ η t ( θ ⋆ , a ) − ∆ t, − k a − a t − k ∆ t, . (24)21n this case we have η ( θ ⋆ , a t ) ≥ E θ t ∼ p t [ η ( θ t , a t ) − ∆ t, ] ≥ sup a E θ t ∼ p t [ η ( θ t , a ) − ∆ t, ] (By the optimality of a t ) ≥ sup a E θ t ∼ p t [¯ η t ( θ t , a ) − ∆ t, ] ≥ sup a E θ t ∼ p t [¯ η t ( θ ⋆ , a ) − ∆ t, − ∆ t, − k a − a t − k ∆ t, ] (By Eq. (24)) ≥ E θ t ∼ p t (cid:20) η ( θ ⋆ , a t − ) + 14 ζ h k g t k − ∆ t, − ∆ t, − k g t k ζ h ∆ t, (cid:21) (Take a = a t − + g t ζ h ) ≥ η ( θ ⋆ , a t − ) + ǫ ζ h − E θ t ∼ p t (cid:20)(cid:18) ζ g ζ h (cid:19) ∆ t (cid:21) (By Cauchy-Schwarz) Case 2:

Let H t − = ∇ a η ( θ ⋆ , a t − ) . Deﬁne v t − ∈ argmax v : k v k =1 v ⊤ H t − v . In this case we have k g t k ≤ ǫ and v ⊤ t − H t − v t − ≥ p ζ ǫ k v t − k . (25)Deﬁne function ˆ η t ( θ, a ) = η ( θ, a t − ) + h a − a t − , ∇ a η ( θ, a t − ) i + 12 (cid:10) ∇ a η ( θ t , a t − )( a − a t − ) , a − a t − (cid:11) − ζ k a − a t − k (26)to be the local second order approximation of function η ( θ, a ) . By the Lipschitz assumption (namely,Assumption 2.1), we have η ( θ, a ) ≥ ˆ η t ( θ, a ) for all θ ∈ Θ , a ∈ A .By Eq. (25), we can exploit the positive curvature by taking a ′ = a t − + 4 q ǫζ v t . Concretely, by basicalgebra we get: ˆ η t ( θ ⋆ , a ′ ) ≥ η ( θ ⋆ , a t − ) − ǫ (cid:13)(cid:13) a ′ − a t − (cid:13)(cid:13) + 3 p ζ ǫ (cid:13)(cid:13) a ′ − a t − (cid:13)(cid:13) − ζ (cid:13)(cid:13) a ′ − a t − (cid:13)(cid:13) ≥ η ( θ ⋆ , a t − ) + 12 s ǫ ζ . (27)Combining with the deﬁnition of ∆ t, , ∆ t, and ∆ t, , for any a ∈ A we get ˆ η t ( θ t , a ) ≥ ˆ η t ( θ ⋆ , a ) − ∆ t, − k a − a t − k ∆ t, − k a − a t − k ∆ t, . (28)As a result, we have η ( θ ⋆ , a t ) ≥ E θ t ∼ p t [ η ( θ t , a t ) − ∆ t, ] ≥ E θ t ∼ p t (cid:2) η ( θ t , a ′ ) − ∆ t, (cid:3) (By the optimality of a t ) ≥ E θ t ∼ p t (cid:20) ˆ η t ( θ ⋆ , a ′ ) − ∆ t, − ∆ t, − k a − a t − k ∆ t, − k a − a t − k ∆ t, (cid:21) (By Eq. (28)) ≥ η ( θ ⋆ , a t − ) + 12 s ǫ ζ − E θ t ∼ p t (cid:20) ∆ t, + ∆ t, + 4 r ǫζ ∆ t, + 8 ǫζ ∆ t, (cid:21) (By Eq. (27))22 η ( θ ⋆ , a t − ) + 12 s ǫ ζ − E θ t ∼ p t [2∆ t ] . (When ǫ ≤ ζ )Combining the two cases together, we get the desired result. B.2 Proof of Lemma 3.3

Proof.

Deﬁne F t to be the σ -ﬁeld generated by random variable u t , v t , θ t . In the following, we use E t [ · ] as a shorthand for E [ · | F t ] . Let g t = ∇ a η ( θ t , a t − ) − ∇ a η ( θ ⋆ , a t − ) . Note that condition on θ t , h g t , u t i follows the distribution N (0 , k g t k ) . By Assumption 2.1, k g t k ≤ ζ g = κ . As a result, E t − h min (cid:16) κ , h g t , u t i (cid:17) | θ t i ≥ E t − h h g t , u t i | θ t i = 12 k g t k . (29)By the tower property of expectation we get E t − h min (cid:16) κ , ˜∆ t, (cid:17)i ≥ E t − (cid:2) ∆ t, (cid:3) . (30)Now we turn to the term ˜∆ t, . Let H t = ∇ a η ( θ t , a t − ) − ∇ a η ( θ ⋆ , a t − ) . Deﬁne a random variable x = (cid:0) u ⊤ t H t v t (cid:1) . Note that u t , v t are independent, we have E t − [ x | θ t ] = E t − h k H t v t k | θ t i = k H t k ≥ k H t k . (31)Since u t , v t are two Gaussian vectors, random variable x has nice concentratebility properties. Therefore wecan prove that the min operator in the deﬁnition of ˜∆ t does not change the expectation too much. Formallyspeaking, by Lemma D.6, condition on F t − and θ t , we have E (cid:2) min (cid:0) κ , x (cid:1)(cid:3) ≥ min (cid:0) ζ h , E [ x ] (cid:1) , whichleads to E t − h min( κ , ˜∆ t, ) i ≥ E t − h min( ζ h , k H t k ) i = 12 E t − h k H t k i . (32)Combining Eq. (30) and Eq. (32), we get the desired inequality. B.3 Proof of Theorem 3.1

Proof.

Let δ t = inf a ∈ A ( ǫ ) , √ ζ ǫ η ( θ ⋆ , a ) − η ( θ ⋆ , a t ) . By the deﬁnition of regret we have

REG ǫ, √ ζ ǫ ( T ) = P Tt =1 δ t . Deﬁne υ = min (cid:18) ζ h ǫ , ζ / ǫ / (cid:19) for simplicity. Recall that C = 2 + ζ g ζ h . In the following weprove by induction that for any t , E t − " T X t = t δ t ≤ E t − " υ δ t + C T X t = t +1 ∆ t ! . (33)For the base case where t = T Eq. (33) trivially holds because υ ≤ . Now suppose Eq. (33) holds for any t > t and consider time step t . When a t A ( ǫ ) , √ ζ ǫ , applyingLemma 3.2 we get η ( θ ⋆ , a t +1 ) ≥ η ( θ ⋆ , a t ) + υ − C E t [∆ t +1 ] . By basic algebra we get, δ t +1 ≤ δ t − υ + C E t [∆ t +1 ] . (34)23s a result, E t − " T X t = t δ t = E t − " δ t + T X t = t +1 δ t ≤ E t − " δ t + 1 υ δ t +1 + C T X t = t +2 ∆ t ! (By induction hypothesis) ≤ E t − " δ t − υ δ t + C ∆ t +1 + C T X t = t +2 ∆ t ! (By Eq. (34)) ≤ E t − " υ δ t + C T X t = t +1 ∆ t ! . On the other hand, when a t ∈ A ( ǫ ) , √ ζ ǫ we have η ( θ ⋆ , a t +1 ) ≥ E θ t [ η ( θ t +1 , a t +1 ) − ∆ t +1 , ] ≥ E θ t [ η ( θ t +1 , a t ) − ∆ t +1 , ] (By the optimality of a t +1 ) ≥ E θ t [ η ( θ ⋆ , a t ) − ∆ t +1 , − ∆ t +1 , ] ≥ η ( θ ⋆ , a t ) − C E θ t [∆ t +1 ] . Note that since a t ∈ A ( ǫ ) , √ ζ ǫ , we have δ t ≤ . As a result, E t − " T X t =0 δ t ≤ E t − " δ t + 1 υ δ t +1 + C T X t = t +2 ∆ t ! (By induction hypothesis) ≤ E t − " υ δ t + C ∆ t +1 + C T X t = t +2 ∆ t ! ≤ E t − " υ δ t + C T X t = t +1 ∆ t ! . Combining the two cases together we prove Eq. (33). It follows that E h REG ǫ, √ ζ ǫ ( T ) i = E " T X t =1 δ t ≤ E " υ δ + C T X t =1 ∆ t ! (35) ≤ υ  C E vuut T T X t =1 ∆ t  ≤ υ  C vuut T E " T X t =1 ∆ t . (36)Note that when realizability holds, we have inf θ P Tt =1 ℓ (( x t , y t ); θ ) = 0 . Therefore, by Lemma 3.3 and thedeﬁnition of online learning regret (see Eq. (2)) we have E h REG ǫ, √ ζ ǫ ( T ) i ≤ υ  C vuut T E " T X t =1 ˜∆ t ≤ υ (cid:16) C p T R T (cid:17) . (37)24 .4 Instantiations of Theorem 3.1 In this section we rigorously prove the instantiations discussed in Section 3.

Linear bandit with ﬁnite model class.

Recall that the linear bandit reward is given by η ( θ, a ) = h θ, a i ,and the constrained reward is ˜ η ( θ, a ) = η ( θ, a ) − k a k . In order to deal with ℓ regularization which violates Assumption 2.1, we bound the set of actionsAlg. 1 takes. Consider the regularized reward ˜ η ( θ, a ) . When k a k > we have ˜ η ( θ, a ) < . Thereforethe set of actions taken by Alg. 1 satisﬁes k a t k ≤ for all t . Because we only apply Lemma 3.2 andLemma 3.3 to actions that is taken by the algorithm, Theorem 3.1 holds even if Assumption 2.1 is satisﬁedlocally for k a k . . Since the gradient and Hessian of regularization term is a and I d respectively, wehave k∇ a ˜ η ( θ, a ) k . k∇ a η ( θ, a ) k + 1 and (cid:13)(cid:13) ∇ a ˜ η ( θ, a ) (cid:13)(cid:13) sp . (cid:13)(cid:13) ∇ a η ( θ, a ) (cid:13)(cid:13) sp + 1 when k a k . , whichveriﬁes Assumption 2.1.Note that ∇ a ˜ η ( θ, a ) = θ − a. As a result, for any a ∈ A ( ǫ, we have k θ ⋆ − a k ≤ ǫ, which means that η ( θ ⋆ , a ) ≥ η ( θ ⋆ , a ⋆ ) − ǫ. It follows directly that the standard regret

REG ( T ) for linear bandit is boundedby the local regret REG ǫ, √ ζ ǫ plus an extra ǫT error term. Recall that Theorem 3.1 states REG ǫ, √ ζ ǫ ( T ) . ǫ − p T R T . (38)As a immediate corollary, we can bound the linear bandit regret by REG ( T ) . ǫ − p T R T + 2 ǫT. (39)Since the loss function ℓ is uniformly bounded by v = κ + κ + 4 . By Rakhlin et al. [2011], for ﬁnitehypothesis we have R T ≤ v p T log | Θ | . By choosing ǫ = T − / we get REG ( T ) . T / (log | Θ | ) / ,which proves our claim. Linear bandit with sparse or structured model vectors.

In this case, the reduction is exactly the sameas that in linear bandit. In the following we prove that the sparse linear hypothesis has a small coveringnumber. Note that the log | Θ | regret bound ﬁts perfectly with the covering number technique. That is, we candiscretize the hypothesis Θ by ﬁnding a / poly( dT ) -covering of the loss function L = { ℓ ( · , θ ) : θ ∈ Θ } .And then the regret of our algorithm depends polynomially on the log-covering number. Since the log-covering number of the set of s -sparse vectors is bounded by O ( s log( dT )) , we get the desired result. Two-layer neural network.

Recall that a two-layer neural network is deﬁned by η (( W , W ) , a ) = W σ ( W a ) ,where σ is the activation function. For a matrix W ∈ R m × d , the (1 , ∞ ) -norm is deﬁnedby Ax i ∈ [ m ] P dj =1 | [ W ] i,j | . We make the following assumptions regarding the activation function.

Assumption B.1.

For any x, y ∈ R , the activation function σ ( · ) satisﬁes sup x | σ ( x ) | ≤ , sup x (cid:12)(cid:12) σ ′ ( x ) (cid:12)(cid:12) ≤ , sup x (cid:12)(cid:12) σ ′′ ( x ) (cid:12)(cid:12) ≤ , (40) (cid:12)(cid:12) σ ′′ ( x ) − σ ′′ ( y ) (cid:12)(cid:12) ≤ | x − y | . (41)The following theorem summarized our result in this setting.25 heorem B.2. Let

Θ = { ( W , W ) : k W k ≤ , k W k , ∞ ≤ } be the parameter hypothesis. Under thesetting of Theorem 3.1 with Assumption B.1, the local regret of Alg. 1 running on two-layer neural networkscan be bounded by e O (cid:0) ǫ − T / (cid:1) . In addition, if the neural network is input concave, then the global regretof Alg. 1 is bounded by e O (cid:0) T / (cid:1) . Proof.

We prove the theorem by ﬁrst bounding the sequential Rademacher complexity of the loss function,and then applying Theorem 3.1. Let θ = ( W , W ) . Recall that u ⊙ v denotes the element-wise product. Bybasic algebra we get, h∇ a η ( θ, a ) , u i = W (cid:0) σ ′ ( W a ) ⊙ W u (cid:1) , (42) u ⊤ ∇ a η ( θ, a ) v = W (cid:0) σ ′′ ( W a ) ⊙ W u ⊙ W v (cid:1) . (43)First of all, we verify that the regularized reward ˜ η ( θ, a ) , η ( θ, a ) − k a k satisﬁes Assumption 2.1. Indeedwe have k∇ a η ( θ, a ) k = sup u ∈ S d − h∇ a η ( θ, a ) , u i ≤ , (44) (cid:13)(cid:13) ∇ a η ( θ, a ) (cid:13)(cid:13) sp = sup u,v ∈ S d − u ⊤ ∇ a η ( θ, a ) v ≤ , (45) (cid:13)(cid:13) ∇ a η ( θ, a ) − ∇ a η ( θ, a ) (cid:13)(cid:13) sp = sup u,v ∈ S d − W (cid:0)(cid:0) σ ′′ ( W a ) − σ ′′ ( W a ) (cid:1) ⊙ W u ⊙ W v (cid:1) ≤ . (46)Observe that | η ( θ, a ) | ≤ k a k ∞ , we have ˜ η ( θ, a ) < when k a k > . As a result, action a t taken by Alg. 1satisﬁes k a t k ≤ for all t . Since the gradient and Hessian of regularization term is a and I d respectively,we have k∇ a ˜ η ( θ, a ) k . k∇ a η ( θ, a ) k + 1 and (cid:13)(cid:13) ∇ a ˜ η ( θ, a ) (cid:13)(cid:13) sp . (cid:13)(cid:13) ∇ a η ( θ, a ) (cid:13)(cid:13) sp + 1 . It follows thatAssumption 2.1 holds with constant Lipschitzness for actions a such that k a k . .In the following we bound the sequential Rademacher complexity of the loss function. By Rakhlin et al.[2015a, Proposition 15], we can bound the sequential Rademacher complexity of ∆ t, and ∆ t, by e O (cid:0) √ T log d (cid:1) . Next we turn to higher order terms.First of all, because the (1 , ∞ ) norm of W is bounded, we have k W u k ∞ ≤ k u k ∞ . It follows from theboundness of σ ′ ( x ) that k σ ′ ( W a ) ⊙ W u k ∞ ≤ k u k ∞ . Therefore we get h∇ a η ( θ, a ) , u i ≤ k W k (cid:13)(cid:13) σ ′ ( W a ) ⊙ W u (cid:13)(cid:13) ∞ ≤ k u k ∞ . (47)Similarly, we get u ⊤ ∇ a η ( θ, a ) v ≤ k u k ∞ k v k ∞ . (48)Let B = k u k ∞ (1 + k v k ∞ ) for shorthand. We consider the error term ∆ t, = ( h∇ a η ( θ, a ) , u i − [ y t ] ) . Let G be the function class { ( h∇ a η ( θ, a ) , u i − [ y t ] ) : θ ∈ Θ } , and G = {h∇ a η ( θ, a ) , u i : θ ∈ Θ } . Applying Rakhlin et al. [2015a, Lemma 4] we get R T ( G ) . B log / ( T ) R T ( G ) . Deﬁne G = { σ ′ ( w ⊤ a ) · w ⊤ u : w ∈ R d , k w k ≤ } . In the following we show that R T ( G ) . R T ( G ) . For any sequence u , · · · , u T and A -valued tree a , we have R T ( G ) = E ǫ  sup W : k W k ≤ g , ··· ,g w ∈G T X t =1 ǫ t  w X j =1 [ W ] j g j ( a ( ǫ ))  (49)26 E ǫ  sup W : k W k ≤ g , ··· ,g w ∈G k W k Ax j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T X t =1 ǫ t ( g j ( a ( ǫ ))) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (50) ≤ E ǫ " sup g ∈G (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T X t =1 ǫ t ( g j ( a ( ǫ ))) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (51)Since we have ∈ G by taking w = 0 , by symmetricity we have E ǫ " sup g ∈G (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T X t =1 ǫ t ( g j ( a ( ǫ ))) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E ǫ " sup g ∈G T X t =1 ǫ t ( g j ( a ( ǫ ))) = 2 R T ( G ) . (52)Now we bound R T ( G ) by applying the composition lemma of sequential Rademacher complexity(namely Rakhlin et al. [2015a, Lemma 4]). First of all we deﬁne a relaxed function hypothesis G = { σ ′ (( w ′ ) ⊤ a ) · w ⊤ u : w , w ′ ∈ R d , k w k ≤ , k w ′ k ≤ } . Since G ⊂ G we have R T ( G ) ≤ R T ( G ) . Note that we have (cid:12)(cid:12) σ ′ ( w ⊤ a ) (cid:12)(cid:12) ≤ and w ⊤ u ≤ k u k ∞ . Let φ ( x, y ) = xy , which is (3 c ) -Lipschitz for | x | , | y | ≤ c. Deﬁne G = { σ ′ ( w ⊤ a ) : w ∈ R d , k w k ≤ } and G = { w ⊤ u : w ∈ R d , k w k ≤ } .Rakhlin et al. [2015a, Lemma 4] gives R T ( G ) . B log / ( T )( R T ( G ) + R T ( G )) . Note that G is a gen-eralized linear hypothesis and G is linear, we have R T ( G ) . B log / ( T ) p T log( d ) and R T ( G ) . B p T log( d ) .In summary, we get R T ( G ) = O (cid:16) poly( B )polylog( d, T ) √ T (cid:17) . Since the input u t ∼ N (0 , I d × d ) , wehave B . log( dT ) with probability /T. As a result, the distribution dependent Rademacher complexity of ˜∆ t, in this case is bounded by O (cid:16) poly( B )polylog( d, T ) √ T (cid:17) .Similarly, we can bound the sequential Rademacher complexity of the Hessian term ˜∆ t, by O (cid:16) poly( B )polylog( d, T ) √ T (cid:17) by applying composition lemma with Lipschitz function φ ( x, y, z ) = xyz with bounded | x | , | y | , | z | . As a result, the sequential Rademacher complexity of the loss function can bebounded by R T = O (cid:16) poly( B )polylog( d, T ) √ T (cid:17) . Applying Theorem 3.1, the local regret of Alg. 1 is bounded by e O (cid:0) ǫ − T / (cid:1) . When the neural network is input concave (see Amos et al. [2017]), the regularized reward ˆ η ( θ, a ) is Ω(1) -strongly concave. As a result, for any a ∈ A ǫ, we have η ( θ ⋆ , a ) ≥ η ( θ ⋆ , a ⋆ ) − O ( ǫ ) . It follows that,

REG ( T ) = e O (cid:16) ǫ − T / + ǫ T (cid:17) . (53)By letting ǫ = T − / we get REG ( T ) = e O (cid:0) T / (cid:1) . C Missing Proofs in Section 4

First of all, we present our algorithm in Alg. 2.In the following we present the proof sketch for Theorem 4.3. Compare to the bandit case, we onlyneed to prove an analog of Lemma 3.3, which means that we need to upper-bound the error term ∆ t by the27 lgorithm 2 Vi rtual Ascent with O nline Model L earner (ViOL) (for RL) Let H = ∅ ; choose a ∈ A arbitrarily. for t = 1 , , · · · do Run R on H t − with loss function ℓ (deﬁned in Eq. (14)) and obtain p t = A ( H t − ) . ψ t ← argmax ψ E θ t ∼ p t [ η ( θ t , ψ )] ; Sample one trajectory τ t from policy π ψ t , and one trajectory τ ′ t from policy π ψ t − . Update H t ← H t − ∪ { ( τ, τ ′ ) } difference of dynamics, as discussed before. Formally speaking, let τ t = ( s , a , · · · , s H , a H ) be a trajectorysampled from policy π ψ t under the ground-truth dynamics T θ ⋆ . By telescope lemma (Lemma D.10) we get V ψθ ( s ) − V ψθ ⋆ ( s ) = E τ ∼ ρ ψθ⋆ " H X h =1 (cid:16) V ψθ ( T θ ( s h , a h )) − V ψθ ( T θ ⋆ ( s h , a h )) (cid:17) . (54)Lipschitz assumption (Assumption 4.1) yields, (cid:12)(cid:12)(cid:12) V ψθ ( T θ ( s h , a h )) − V ψθ ( T θ ⋆ ( s h , a h )) (cid:12)(cid:12)(cid:12) ≤ L k T θ ( s h , a h ) − T θ ⋆ ( s h , a h )) k . (55)Combining Eq. (54) and Eq. (55) and apply Cauchy-Schwartz inequality gives an upper bound for [∆ t ] and [∆ t ] . As for the gradient term, we will take gradient w.r.t. ψ to both sides of Eq. (54). The gradient insideexpectation can be dealt with easily. And the gradient w.r.t. the distribution ρ ψθ ⋆ can be computed by policygradient lemma (Lemma D.11). As a result we get ∇ ψ V ψθ ( s ) − ∇ ψ V ψθ ⋆ ( s )= E τ ∼ ρ ψθ⋆ " H X h =1 ∇ ψ log π ψ ( a h | s h ) ! H X h =1 (cid:16) V ψθ ( T θ ( s h , a h )) − V ψθ ( T θ ⋆ ( s h , a h )) (cid:17)! + E τ ∼ ρ ψθ⋆ " H X h =1 (cid:16) ∇ ψ V ψθ ( T θ ( s h , a h )) − ∇ ψ V ψθ ( T θ ⋆ ( s h , a h )) (cid:17) . (56)The ﬁrst term can be bounded by vector-form Cauchy-Schwartz and Assumption 4.2, and the second termis bounded by Assumption 4.1. Similarly, this approach can be extended to second order term. As a result,we have the following lemma. Lemma C.1.

Under the setting of Theorem 4.3, we have c E τ t ,τ ′ t ,θ t (cid:2) ¯∆ t (cid:3) ≥ E θ t (cid:2) ∆ t (cid:3) . (57)Proof of Lemma C.1 is shown in Appendix C.1. Proof of Theorem 4.3 is exactly the same as that ofTheorem 3.1 except for replacing Lemma 3.3 with Lemma C.1. C.1 Proof of Lemma C.1

Proof.

The lemma is proven by combining standard telescoping lemma and policy gradient lemma. Speciﬁ-cally, let ρ πT be the distribution of trajectories generated by policy π and dynamics T . By telescoping lemma(Lemma D.10) we have, V ψ t θ t ( s ) − V ψ t θ ⋆ ( s ) = E τ ∼ ρ ψtθ⋆ " H X h =1 (cid:16) V ψ t θ t ( T θ t ( s h , a h )) − V ψ t θ t ( T θ ⋆ ( s h , a h )) (cid:17) . (58)28y the Lipschitz assumption (Assumption 4.1), (cid:12)(cid:12)(cid:12) V ψ t θ t ( T θ t ( s h , a h )) − V ψ t θ t ( T θ ⋆ ( s h , a h )) (cid:12)(cid:12)(cid:12) ≤ L k T θ t ( s h , a h ) − T θ ⋆ ( s h , a h )) k . (59)Consequently ∆ t, = (cid:16) V ψ t θ t ( s ) − V ψ t θ ⋆ ( s ) (cid:17) ≤ HL E τ ∼ ρ ψtθ⋆ " H X h =1 k T θ t ( s h , a h ) − T θ ⋆ ( s h , a h )) k . (60)Similarly we get, ∆ t, = (cid:16) V ψ t − θ t ( s ) − V ψ t − θ ⋆ ( s ) (cid:17) ≤ HL E τ ∼ ρ ψt − θ⋆ " H X h =1 k T θ t ( s h , a h ) − T θ ⋆ ( s h , a h )) k . (61)Now we turn to higher order terms. First of all, by Hölder inequality and Assumption 4.2, we can provethe following:• (cid:13)(cid:13)(cid:13)(cid:13) E τ ∼ ρ ψθ⋆ (cid:20)(cid:16)P Hh =1 ∇ ψ log π ψ ( a h | s h ) (cid:17)(cid:16)P Hh =1 ∇ ψ log π ψ ( a h | s h ) (cid:17) ⊤ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) sp ≤ H χ g , ∀ ψ ∈ Ψ; • (cid:13)(cid:13)(cid:13)(cid:13) E τ ∼ ρ ψθ⋆ (cid:20)(cid:16)P Hh =1 ∇ ψ log π ψ ( a h | s h ) (cid:17) ⊗ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) sp ≤ H χ f , ∀ ψ ∈ Ψ; • (cid:13)(cid:13)(cid:13)(cid:13) E τ ∼ ρ ψθ⋆ (cid:20)(cid:16)P Hh =1 ∇ ψ log π ψ ( a h | s h ) (cid:17)(cid:16)P Hh =1 ∇ ψ log π ψ ( a | s ) (cid:17) ⊤ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) sp ≤ H χ h , ∀ ψ ∈ Ψ . Indeed, consider the ﬁrst statement. Deﬁne g h = ∇ ψ log π ψ ( a h | s h ) for shorthand. Then we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E τ ∼ ρ ψθ⋆  H X h =1 g h ! H X h =1 g h ! ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) sp = sup u ∈ S d − u ⊤ E τ ∼ ρ ψθ⋆  H X h =1 g h ! H X h =1 g h ! ⊤  u (62) = sup u ∈ S d − E τ ∼ ρ ψθ⋆ * u, H X h =1 g h !+  ≤ sup u ∈ S d − E τ ∼ ρ ψθ⋆ " H H X h =1 h u, g h i (63) ≤ E τ ∼ ρ ψθ⋆ " H H X h =1 sup u ∈ S d − h u, g h i = E τ ∼ ρ ψθ⋆ " H H X h =1 (cid:13)(cid:13)(cid:13) gg ⊤ (cid:13)(cid:13)(cid:13) sp ≤ H χ g . (64)Similarly we can get the second and third statement.For any ﬁxed ψ and θ we have V ψθ ( s ) − V ψθ ⋆ ( s ) = E τ ∼ ρ ψθ⋆ " H X h =1 (cid:16) V ψθ ( T θ ( s h , a h )) − V ψθ ( T θ ⋆ ( s h , a h )) (cid:17) . (65)Applying policy gradient lemma (namely, Lemma D.11) to RHS of Eq. (65) we get, ∇ ψ V ψθ ( s ) − ∇ ψ V ψθ ⋆ ( s ) E τ ∼ ρ ψθ⋆ " H X h =1 ∇ ψ log π ψ ( a h | s h ) ! H X h =1 (cid:16) V ψθ ( T θ ( s h , a h )) − V ψθ ( T θ ⋆ ( s h , a h )) (cid:17)! + E τ ∼ ρ ψθ⋆ " H X h =1 (cid:16) ∇ ψ V ψθ ( T θ ( s h , a h )) − ∇ ψ V ψθ ( T θ ⋆ ( s h , a h )) (cid:17) . (66)Deﬁne the following shorthand: G ψθ ( s, a ) = V ψθ ( T θ ( s, a )) − V ψθ ( T θ ⋆ ( s, a )) , (67) f = H X h =1 ∇ ψ log π ψ ( a h | s h ) . (68)In the following we also omit the subscription in E τ ∼ ρ ψθ⋆ when the context is clear. It followed by Eq. (66)that (cid:13)(cid:13)(cid:13) ∇ ψ V ψθ ( s ) − ∇ ψ V ψθ ⋆ ( s ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E " f H X h =1 G ψθ ( s h , a h ) ! + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E " H X h =1 ∇ ψ G ψθ ( s h , a h ) ≤ (cid:13)(cid:13)(cid:13) E h f f ⊤ i(cid:13)(cid:13)(cid:13) sp E  H X h =1 G ψθ ( s h , a h ) !  + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E " H X h =1 ∇ ψ G ψθ ( s h , a h ) (By Lemma D.7) ≤ H (cid:13)(cid:13)(cid:13) E h f f ⊤ i(cid:13)(cid:13)(cid:13) sp E " H X h =1 G ψθ ( s h , a h ) + 2 H E " H X h =1 (cid:13)(cid:13)(cid:13) ∇ ψ G ψθ ( s h , a h ) (cid:13)(cid:13)(cid:13) . Now, plugin ψ = ψ t − , θ = θ t and apply Assumption 4.1 we get ∆ t, = (cid:13)(cid:13)(cid:13) ∇ ψ V ψ t − θ t ( s ) − ∇ ψ V ψ t − θ ⋆ ( s ) (cid:13)(cid:13)(cid:13) ≤ (2 HL + 2 H χ g L ) E τ ∼ ρ ψt − θ⋆ " H X h =1 k T θ t ( s h , a h ) − T θ ⋆ ( s h , a h ) k . For any ﬁxed ψ, θ , deﬁne the following shorthand: g = H X h =1 (cid:16) V ψθ ( T θ ( s h , a h )) − V ψθ ( T θ ⋆ ( s h , a h )) (cid:17) . (69)Apply policy gradient lemma again to RHS of Eq. (66) we get ∇ ψ V ψθ ( s ) − ∇ ψ V ψθ ⋆ ( s )= E h ( ∇ ψ g ) f ⊤ i + E h f ( ∇ ψ g ) ⊤ i + E (cid:2) ∇ ψ g (cid:3) + E " g H X h =1 ∇ ψ log π ψ ( a h | s h ) ! + E h g (cid:16) f f ⊤ (cid:17)i . As a result of Lemma D.8 and Lemma D.9 that, (cid:13)(cid:13)(cid:13) ∇ ψ V ψθ ( s ) − ∇ ψ V ψθ ⋆ ( s ) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) E h ( ∇ ψ g ) f ⊤ i + E h f ( ∇ ψ g ) ⊤ i(cid:13)(cid:13)(cid:13) + 4 (cid:13)(cid:13) E (cid:2) ∇ ψ g (cid:3)(cid:13)(cid:13) + 4 (cid:13)(cid:13)(cid:13) E h g (cid:16) f f ⊤ (cid:17)i(cid:13)(cid:13)(cid:13) + 4 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E " g H X h =1 ∇ ψ log π ψ ( a h | s h ) ! ≤ u,v ∈ S d − E [ h∇ ψ g, u i h f, v i ] + 4 E h(cid:13)(cid:13) ∇ ψ g (cid:13)(cid:13) i + 4 E (cid:2) g (cid:3) (cid:13)(cid:13) E (cid:2) f ⊗ (cid:3)(cid:13)(cid:13) sp + 4 E (cid:2) g (cid:3) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E  H X h =1 ∇ ψ log π ψ ( a h | s h ) ! H X h =1 ∇ ψ log π ψ ( a h | s h ) ! ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) sp . (70)Note that by Hölder’s inequality, sup u,v ∈ S d − E [ h∇ ψ g, u i h f, v i ] ≤ sup u,v ∈ S d − E h h∇ ψ g, u i i E h h f, v i i ≤ E h k∇ ψ g k i (cid:13)(cid:13)(cid:13) E h f f ⊤ i(cid:13)(cid:13)(cid:13) sp . By Assumption 4.1 we get, E [ g ] = E  H X h =1 (cid:16) V ψθ ( T θ ( s h , a h )) − V ψθ ( T θ ⋆ ( s h , a h )) (cid:17)!  (71) ≤ H E " H X h =1 (cid:16) V ψθ ( T θ ( s h , a h )) − V ψθ ( T θ ⋆ ( s h , a h )) (cid:17) (72) ≤ HL E " H X h =1 k T θ ( s h , a h ) − T θ ⋆ ( s h , a h ) k . (73)Similarly, we have E [ k∇ ψ g k ] ≤ HL E " H X h =1 k T θ ( s h , a h ) − T θ ⋆ ( s h , a h ) k , (74) E [ (cid:13)(cid:13) ∇ ψ g (cid:13)(cid:13) ] ≤ HL E " H X h =1 k T θ ( s h , a h ) − T θ ⋆ ( s h , a h ) k . (75)Combining with Eq. (70) we get, ∆ t, = (cid:13)(cid:13)(cid:13) ∇ ψ V ψ t θ t ( s ) − ∇ ψ V ψ t θ ⋆ ( s ) (cid:13)(cid:13)(cid:13) ≤ (cid:0) H L χ g + 4 HL + 4 L ( H χ h + H χ f ) (cid:1) E τ ∼ ρ ψt − θ⋆ " H X h =1 k T θ t ( s h , a h ) − T θ ⋆ ( s h , a h ) k . By noting that ∆ t = P i =1 ∆ i,t , we get the desired upper bound. D Helper Lemmas

In this section, we list helper lemmas that are used in previous sections.31 .1 Helper Lemmas on Probability Analysis

The following lemma provides a concentration inequality on the norm of linear transformation of a Gaussianvector, which is used to prove Lemma D.3.

Lemma D.1 (Theorem 1 of Hsu et al. [2012]) . For v ∼ N (0 , I ) be a n dimensional Gaussian vector, and A ∈ R n × n . Let Σ = A ⊤ A , then ∀ t > , Pr h k Av k ≥ Tr(Σ) + 2 p Tr(Σ ) t + 2 k Σ k op t i ≤ exp( − t ) . (76) Corollary D.2.

Under the same settings of Lemma D.1, ∀ t > , Pr h k Av k ≥ k A k + 4 k A k t i ≤ exp( − t ) . (77) Proof.

Let λ i be the i -th eigenvalue of Σ . By the deﬁnition of Σ we have λ i ≥ . Then we have Tr(Σ) = n X i =1 λ i = k A k , Tr(Σ ) = n X i =1 λ i ≤ n X i =1 λ i ! = k A k , k Σ k op = Ax i ∈ [ n ] λ i ≤ n X i =1 λ i = k A k . Plug in Eq. (76), we get the desired equation.Next lemma proves a concentration inequality on which Lemma 3.3 relies.

Lemma D.3.

Given a symmetric matrix H , let u, v ∼ N (0 , I ) be two independent random vectors, we have ∀ t ≥ , Pr h ( u ⊤ Hv ) ≥ t k H k i ≤ −√ t/ . (78) Proof.

Condition on v , u ⊤ Hv is a Gaussian random variable with mean zero and variance k Hv k . There-fore we have, ∀ v, Pr (cid:20)(cid:16) u ⊤ Hv (cid:17) ≥ √ t k Hv k (cid:21) ≤ exp( −√ t/ . (79)By Corollary D.2 and basic algebra we get, Pr h k Hv k ≥ √ t k H k i ≤ −√ t/ . (80)Consequently, E h I h ( u ⊤ Hv ) ≥ t k H k ii ≤ E h I h ( u ⊤ Hv ) ≥ √ t k Hv k or k Hv k ≥ √ t k H k ii ≤ E h I h ( u ⊤ Hv ) ≥ √ t k Hv k i | v i + E h I h k Hv k ≥ √ t k H k ii ≤ −√ t/ . (Combining Eq. (79) and Eq. (80))32he next two lemmas are dedicated to prove anti-concentration inequalities that is used in Lemma 3.3. Lemma D.4 (Lemma 1 of Laurent and Massart [2000]) . Let ( y , · · · , y n ) be i.i.d. N (0 , Gaussian vari-ables. Let a = ( a , · · · , a n ) be non-negative coefﬁcient. Let k a k = n X i =1 a i . Then for any positive t , Pr n X i =1 a i y i ≤ n X i =1 a i − k a k √ t ! ≤ exp( − t ) . (81) Lemma D.5.

Given a symmetric matrix H ∈ R n × n , let u, v ∼ N (0 , I ) be two independent random vectors.Then Pr (cid:20) ( u ⊤ Hv ) ≥ k H k (cid:21) ≥ . (82) Proof.

Since u, v are independent, by the isotropy of Guassian vectors we can assume that H =diag( λ , · · · , λ n ) . Note that condition on v , u ⊤ Hv is a Gaussian random variable with mean zero andvariance k Hv k . As a result, ∀ v, Pr (cid:20)(cid:16) u ⊤ Hv (cid:17) ≥ k Hv k | v (cid:21) ≥ . (83)On the other hand, k Hv k = P ni =1 λ i v i . Invoking Lemma D.4 we have Pr (cid:20) k Hv k ≥ k H k (cid:21) ≥ Pr  k Hv k ≥ k H k − vuut n X i =1 λ i  = Pr  n X i =1 λ i v i ≥ n X i =1 λ i − vuut n X i =1 λ i  (By deﬁnition) ≥ − exp( − / ≥ . (84)Combining Eq. (83) and Eq. (84) we get, Pr (cid:20) ( u ⊤ Hv ) ≥ k H k (cid:21) ≥ Pr (cid:20) ( u ⊤ Hv ) ≥ k Hv k , k Hv k ≥ k H k (cid:21) ≥ . The following lemma justiﬁes the cap in the loss function.

Lemma D.6.

Given a symmetric matrix H , let u, v ∼ N (0 , I ) be two independent random vectors. Let κ , c ∈ R + be two numbers satisfying κ ≥ √ c , then min (cid:16) c , k H k (cid:17) ≤ E (cid:20) min (cid:18) κ , (cid:16) u ⊤ Hv (cid:17) (cid:19)(cid:21) . (85) Proof.

Let x = (cid:0) u ⊤ Hv (cid:1) for simplicity. Consider the following two cases:33 ase 1: k H k F ≤ κ / . In this case we exploit the tail bound of random variable x . Speciﬁcally, E (cid:20)(cid:16) u ⊤ Hv (cid:17) (cid:21) − E (cid:20) min (cid:18) κ , (cid:16) u ⊤ Hv (cid:17) (cid:19)(cid:21) = Z ∞ κ Pr [ x ≥ t ] dt ≤ Z ∞ κ exp − s t k H k ! dt (By Lemma D.3) = 24 exp (cid:18) − κ k H k F (cid:19) k H k F ( κ + 4 k H k F ) ≤

48 exp (cid:18) − κ k H k F (cid:19) k H k F κ ( k H k F ≤ κ in this case) ≤ · k H k F κ k H k F κ ( exp( − x ) ≤ x when x ≥ ) ≤ k H k . As a result, E (cid:20) min (cid:18) κ , (cid:16) u ⊤ Hv (cid:17) (cid:19)(cid:21) ≥ E (cid:20)(cid:16) u ⊤ Hv (cid:17) (cid:21) − k H k k H k . (86) Case 2: k H k F > κ / . In this case, we exploit the anti-concentration result of random variable x . Notethat by the choice of κ , we have k H k F > κ /

40 = ⇒ k H k ≥ c . As a result, E (cid:20) min (cid:18) κ , (cid:16) u ⊤ Hv (cid:17) (cid:19)(cid:21) ≥ c Pr (cid:20) min (cid:18) κ , (cid:16) u ⊤ Hv (cid:17) (cid:19) ≥ c (cid:21) ≥ c Pr (cid:20)(cid:16) u ⊤ Hv (cid:17) ≥ c (cid:21) (By deﬁnition of κ ) ≥ c Pr (cid:20)(cid:16) u ⊤ Hv (cid:17) ≥ k H k (cid:21) ≥ c . (By Lemma D.5)Therefore, in both cases we get E (cid:20) min (cid:18) κ , (cid:16) u ⊤ Hv (cid:17) (cid:19)(cid:21) ≥

12 min (cid:16) c , k H k (cid:17) , (87)which proofs Eq. (85). 34ollowing lemmas are analogs to Cauchy-Schwartz inequality (in vector/matrix forms), which are usedto prove Lemma C.1 for reinforcement learning case. Lemma D.7.

For a random vector x ∈ R d and random variable r , we have k E [ rx ] k ≤ (cid:13)(cid:13)(cid:13) E h xx ⊤ i(cid:13)(cid:13)(cid:13) op E (cid:2) r (cid:3) . (88) Proof.

Note that for any vector g ∈ R d , k g k = sup u ∈ S d − h u, g i . As a result, k E [ rx ] k = sup u ∈ S d − h u, E [ rx ] i = sup u ∈ S d − E [ r h u, x i ] ≤ sup u ∈ S d − E h h u, x i i E (cid:2) r (cid:3) (Hölder Ineqaulity) = (cid:13)(cid:13)(cid:13) E h xx ⊤ i(cid:13)(cid:13)(cid:13) op E (cid:2) r (cid:3) . Lemma D.8.

For a symmetric random matrix H ∈ R d × d and random variable r , we have k E [ rH ] k ≤ (cid:13)(cid:13)(cid:13) E h HH ⊤ i(cid:13)(cid:13)(cid:13) sp E (cid:2) r (cid:3) . (89) Proof.

Note that for any matrix G ∈ R d , k H k = sup u,v ∈ S d − (cid:0) u ⊤ G v (cid:1) . As a result, k E [ rH ] k = sup u,v ∈ S d − (cid:16) u ⊤ E [ rH ] v (cid:17) = sup u,v ∈ S d − E h r (cid:16) u ⊤ Hv (cid:17)i ≤ sup u,v ∈ S d − E (cid:20)(cid:16) u ⊤ Hv (cid:17) (cid:21) E (cid:2) r (cid:3) (Hölder Ineqaulity) = sup u,v ∈ S d − E h u ⊤ Hvv ⊤ H ⊤ u i E (cid:2) r (cid:3) ≤ sup u ∈ S d − E h u ⊤ HH ⊤ u i E (cid:2) r (cid:3) = (cid:13)(cid:13)(cid:13) E h HH ⊤ i(cid:13)(cid:13)(cid:13) sp E (cid:2) r (cid:3) . Lemma D.9.

For a random matrix x ∈ R d and a positive random variable r , we have (cid:13)(cid:13)(cid:13) E h rxx ⊤ i(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) E [ x ⊗ ] (cid:13)(cid:13) sp E (cid:2) r (cid:3) . (90) Proof.

Since r is non-negative, we have E (cid:2) rxx ⊤ (cid:3) (cid:23) . As a result, (cid:13)(cid:13)(cid:13) E h rxx ⊤ i(cid:13)(cid:13)(cid:13) sp = sup u ∈ S d − u ⊤ E h rxx ⊤ i u. It follows that (cid:13)(cid:13)(cid:13) E h rxx ⊤ i(cid:13)(cid:13)(cid:13) = sup u ∈ S d − (cid:16) u ⊤ E h rxx ⊤ i u (cid:17) = sup u ∈ S d − E h r h u, x i i sup u ∈ S d − E h h u, x i i E (cid:2) r (cid:3) (Hölder Ineqaulity) = sup u ∈ S d − (cid:10) u ⊗ , E [ x ⊗ ] (cid:11) E (cid:2) r (cid:3) = (cid:13)(cid:13) E (cid:2) x ⊗ (cid:3)(cid:13)(cid:13) sp E (cid:2) r (cid:3) . D.2 Helper Lemmas on Reinforcement Learning

Lemma D.10 (Telescoping or Simulation Lemma, see Luo et al. [2019], Agarwal et al. [2019]) . For anypolicy π and deterministic dynamical model T, ˆ T , we have V π ˆ T ( s ) − V πT ( s ) = E τ ∼ ρ πT " H X h =1 (cid:16) V π ˆ T ( ˆ T ( s h , a h )) − V π ˆ T ( T ( s h , a h )) (cid:17) . (91) Lemma D.11 (Policy Gradient Lemma, see Sutton and Barto [2011]) . For any policy π ψ , deterministicdynamical model T and reward function r ( s h , a h ) , we have ∇ ψ V π ψ T = E τ ∼ ρ πψT " H X h =1 ∇ ψ log π ψ ( a h | s h ) ! H X h =1 r ( s h , a h ) ! (92) Proof.