Representation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
Bogdan Mazoure, Thang Doan, Tianyu Li, Vladimir Makarenkov, Joelle Pineau, Doina Precup, Guillaume Rabusseau
RRepresentation of Reinforcement Learning Policies in ReproducingKernel Hilbert Spaces
Bogdan Mazoure ∗ Thang Doan ∗ Tianyu Li Vladimir Makarenkov Joelle Pineau
Doina Precup
Guillaume Rabusseau
Abstract
We propose a general framework for pol-icy representation for reinforcement learningtasks. This framework involves finding a low-dimensional embedding of the policy on a re-producing kernel Hilbert space (RKHS). Theusage of RKHS based methods allows us toderive strong theoretical guarantees on theexpected return of the reconstructed policy.Such guarantees are typically lacking in black-box models, but are very desirable in tasksrequiring stability. We conduct several exper-iments on classic RL domains. The resultsconfirm that the policies can be robustly em-bedded in a low-dimensional space while theembedded policy incurs almost no decrease inreturn.
In the reinforcement learning (RL) framework, the goalof a rational agent consists in maximizing the expectedrewards in a dynamical system by finding a suitableconditional distribution known as policy . This condi-tional distribution can be found using policy iterationand any suitable function approximator, ranging fromlinear models to neural networks [Sutton and Barto,2018, Lillicrap et al., 2015]. Albeit neural networks McGill University Mila - Quebec Artificial IntelligenceInstitute Université du Québec à Montréal FacebookAI Research CIFAR AI chair DeepMind Universitéde Montréal. ∗ Equal contribution. Correspondence to:Bogdan Mazoure
In the reinforcement learning (RL) framework, the goalof a rational agent consists in maximizing the expectedrewards in a dynamical system by finding a suitableconditional distribution known as policy . This condi-tional distribution can be found using policy iterationand any suitable function approximator, ranging fromlinear models to neural networks [Sutton and Barto,2018, Lillicrap et al., 2015]. Albeit neural networks McGill University Mila - Quebec Artificial IntelligenceInstitute Université du Québec à Montréal FacebookAI Research CIFAR AI chair DeepMind Universitéde Montréal. ∗ Equal contribution. Correspondence to:Bogdan Mazoure
We consider a problem modeled by a discounted MarkovDecision Process (MDP) M := (cid:104)S , A , r, P, S , γ (cid:105) ,where S is the state space; A is the action space; givenstates s , s (cid:48) ∈ S , action a ∈ A , P ( s (cid:48) | s, a ) is the transi-tion probability of transferring s to s (cid:48) under the action a and r ( s, a ) is the reward collected at state s afterexecuting the action a ; S ⊆ S is the set of initial statesand γ is the discount factor. Through the paper, weassume that r is a bounded function. The agent in RL environment often execute actions ac-cording to some policies. We define a stochastic policy π : S × A → [0 , such that π ( a | s ) is the conditionalprobability that the agent takes action a ∈ A afterobserving the state s ∈ S . The objective is to finda policy π which maximizes the expected discountedreward: η ( π ) = E π (cid:34) ∞ (cid:88) t =0 γ t r ( s t , a t ) (cid:35) . (1)The state value function V π and the state-action valuefunction Q π are defined as: V π ( s ) = E π (cid:34) ∞ (cid:88) k = t γ k − t r ( s k , a k ) (cid:12)(cid:12)(cid:12)(cid:12) s t = s (cid:35) ,Q π ( s, a ) = E π (cid:34) ∞ (cid:88) k = t γ k − t r ( s k , a k ) (cid:12)(cid:12)(cid:12)(cid:12) s t = s, a t = a (cid:35) The gap between Q π and V π is known as the advantagefunction: A π ( s, a ) = Q π ( s, a ) − V π ( s ) , where a ∼ π ( ·| s ) . The coverage of policy π is definedas the stationary state visitation probability ρ π ( s ) = P ( s t = s ) under π . The theory of inner product spaces has been widely usedto characterize approximate learning of Markov decisionprocesses [Puterman, 2014, Tsitsiklis and Van Roy,1999]. In this subsection, we provide a short overviewof inner product spaces of square-integrable functionson a closed interval I denoted L ( I ) .The L ( I ) space is known to be separable for I =[0 , , i.e., it admits a countable orthonormal basis offunctions { ω k } ∞ k =1 , such that (cid:104) ω i , ω j (cid:105) = δ ij for all i, j and δ being Kronecker’s delta. This property makespossible computations with respect to the inner product (cid:104) f, g (cid:105) = (cid:90) t ∈ I f ( t )¯ g ( t ) dt, (2)and corresponding norm || f || = (cid:112) (cid:104) f, f (cid:105) , for f, g ∈ L ( I ) .That is, for any f ∈ L ( I ) , there exist scalars { ξ fk } ∞ k =1 such that its representation in the RKHS ˆ f ( t ) = ∞ (cid:88) k =1 ξ fk ω k ( t ) , ξ fk = (cid:104) f, ω k (cid:105) . (3)If ˆ f K ( t ) = (cid:80) Kk =1 ξ fk ω k ( t ) , then adding ξ K +1 ω K +1 to ˆ f K can be thought of as adding a new orthogonal basis.q. equation 3 suggests a simple embedding rule, whereone would project a density function onto a fixed basisand store only the first { ξ fk } Kk =1 coefficients. Whichcomponents to pick will depend on the nature of thebasis: harmonic and wavelet bases are ranked accord-ing to amplitude, while singular vectors are sorted bycorresponding singular values.The choice of the basis { ω k } ∞ k =1 plays a crucial role inthe quality of approximation. The properties of thefunction f (e.g. periodicity, continuity) should alsobe taken taken into account when picking ω . Someexamples of well-known orthonormal bases of L ( I ) are Fourier ω k ( t ) = e πikt [Stein and Shakarchi, 2003] and Haar ω k,k (cid:48) ( t ) = 2 k/ ω (2 k t − k (cid:48) ) [Haar, 1909]. Otherexamples include but are not limited to Hermite andDaubechies bases [Daubechies, 1988]. Moreover, itis known that a mixture model with countable num-ber of components is a universal density approximator[McLachlan and Peel, 2004]; in such case, the basisconsists of Gaussian density functions and is not neces-sarily orthonormal.Moreover, matrix decomposition algorithms such asthe truncated singular value decomposition (SVD) arepopular projection methods due to their robustnessand extension to finite-rank bounded operators [Zhu,2007], since they learn both the basis vectors and thecorresponding linear weights. Although SVD has pre-viously been used in embedding schemes [Lu et al.,2017, Goetschalckx et al., 2018], unlike us, most worksapply it on the weights of the function rather than theoutputs. Similarly to the density approximators above,a key appeal of SVD is that the error rate of truncationcan be controlled efficiently through the magnitude ofthe minimum truncated singular values .While convergence guarantees are mostly known for theclosed interval I = [0 , , multiple works have studiedproperties of the previously discussed basis functions onthe whole real line and in multiple dimensions [Egozcueet al., 2006]. Weaker convergence results in Hilbertspaces can be stated with respect to the space’s in-ner product. If, for every g ∈ L ( I ) and sequence offunctions f , .., f n ∈ L ( I ) , lim n →∞ (cid:104) f n , g (cid:105) → (cid:104) f, g (cid:105) , (4)then the sequence f n is said to converge weakly to f as n → ∞ . Stronger theorems are known for specificbases but require stronger assumptions on f .In order to allow our learning framework to have strongconvergence results in the inner product sense as well asa countable set of basis functions, we restrict ourselvesto separable Hilbert spaces. see Eckart-Young-Mirsky theorem In this section, we first introduce a framework to repre-sent policies in an RKHS with desirable error guaran-tees. The size of this embedding can be varied throughthe coefficient truncation step. We show that this trun-cation has the property to reduce the return variancewithout drastically affecting the expected return. Next,we propose a discretization and pruning steps as atractable alternative to working directly in the contin-uous space. Finally, we derive a practical algorithmwhich leverages all aforementioned steps.
By Mercer’s theorem [Mercer, 1909], any symmetricpositive-definite kernel function κ with associated inte-gral operator T κ can be decomposed as: κ ( x, y ) = ∞ (cid:88) k =1 ( (cid:112) λ k e k ( x ))( (cid:112) λ k e k ( y ))= ∞ (cid:88) k =1 ω k ( x ) ω k ( y ) , (5)where λ k and e k are the eigenvalues and eigenfunctionsof T κ , respectively.It follows by Moore–Aronszajn theorem [Aronszajn,1950] that there exists a unique Hilbert space H offunctions for which κ is the kernel.Moreover, the inner product associated with H is: (cid:104) f, g (cid:105) H := ∞ (cid:88) k =1 (cid:104) f, e k (cid:105) L (cid:104) g, e k (cid:105) L λ k = ∞ (cid:88) k =1 ξ fk ξ gk λ k . (6)This allows us to state performance bounds betweenpolicies embedded in an RKHS in terms of their pro-jection weights, as shown in the next section.Consider the conditional distribution π : S × A → R .For a fixed s , the function π ( ·| s ) belongs to a Hilbertspace H s of functions f : A → R . Lemma 3.1.
Let s be a state in S . Let H s be anRKHS, the integral operator of which has eigenval-ues { λ k } ∞ k =1 and eigenfunctions { e k } ∞ k =1 . Let ξ π i k = (cid:104) π i , e k (cid:105) and π , π ∈ H s such that π = π ( ·| s ) , π = π ( ·| s ) . Then, the following holds || π − π || H s = ∞ (cid:88) k =1 ( ξ π k − ξ π k ) λ k . (7) Lemma 3.2.
Let ( ξ π , ξ π ) , .., ( ξ π K , ξ π K ) be the projec-tion weights of π i onto H s such that, for p, q ∈ N + , if p > q then ξ π i p < ξ π i q . If there exists a K ∈ N + such epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
Let ( ξ π , ξ π ) , .., ( ξ π K , ξ π K ) be the projec-tion weights of π i onto H s such that, for p, q ∈ N + , if p > q then ξ π i p < ξ π i q . If there exists a K ∈ N + such epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces that for all k > K , | ξ π k − ξ π k | < ε k √ λ k for some real ε > , then the following holds ∞ (cid:88) k = K +1 ( ξ π k − ξ π k ) λ k ≤ ε K +1) − ε . (8)The truncated embedding of size K can be formed bysetting the sequence of coefficients { ξ πk } ∞ k = K to zero,obtaining ˆ π K ( t ) = K (cid:88) k =1 ξ k ω k ( t ) . (9)As shown in Experiments 7.5, truncating the embeddingat a given rank can have desirable properties on thereturns and on their variance. In this subsection, we show under which conditions onecan expect a reduction in variance of returns of theRKHS policy, thus helping in sensitive tasks such asmedical interventions or autonomous vehicles [Garcıaand Fernández, 2015].Any policy π in the RKHS can be decomposed as afinite sum: π ( a | s ) = K (cid:88) k =1 ξ k ω k ( s, a ) (cid:124) (cid:123)(cid:122) (cid:125) ˆ π K + ∞ (cid:88) k = K ξ k ω k ( s, a ) (cid:124) (cid:123)(cid:122) (cid:125) ε K , ∀ s, a ∈ S×A . (10)Our framework allows to derive variance guarantees byfirst considering the random return variable, Z of thepolicy π given some state and action: Z ( π | s , a ) = r ( S , A ) + ∞ (cid:88) t =1 γ t r ( S t , A t ) , (11)with S t ∼ P ( ·| s t − , a t − ) , A t ∼ π ( ·| s t ) and S , A ∼ β for some initial distribution β .Suppose we have access to some entry-wise normaliza-tion function σ : R → [0 , . We are not concerned withthe exact form of σ , but, as an example, we use thesoftmax function σ ( x i ) = e x i (cid:80) dj =1 e x j .The variance of Z ( π ) over the entire trajectory is hardto compute [Tamar et al., 2016] since it requires to solvea Bellman equation for E [ Z ( π )] . Instead, we look atthe variance over initial state-action pairs sampled from β . Since we do not know the distribution of π ( a | s ) butonly that of ( a, s ) under β , we compute all expectationsusing the Law of the Unconscious Statistician [Ringnér,2009, LOTUS]. Lemma 3.3.
Let V β [ X ] = E β [ X ] − ( E β [ X ]) be thevariance operator with respect to β , and let σ be apositive-value, differentiable function. If the followingconditions are satisfied1. σ (cid:48) ( η (ˆ π K )) ≤ σ (cid:48) ( η ( π )) (monotonically decreasing σ (cid:48) on (0 , ∞ ) );2. (cid:113) E β [ ε K ]3 ≥ E β [ ε K ] (second moment condition);3. V β [ π ] ≈ ( σ (cid:48) ( η ( π ))) V β [ Q π ] (Second order Taylorresidual is small),then the following holds: V β [ Q π ] ≥ V β [ Q ˆ π K ] (12)Lemma 3.3 tells us the relation between the varianceof returns collected by either the true or truncatedpolicies (detailed analysis in Appendix).In practice, Condition 1 is satisfied when σ (cid:48) is de-creasing on (0 , ∞ ) , rewards are strictly positive and η (ˆ π K ) ≤ η ( π ) . Condition 2 is not restrictive, and is, forexample, satisfied by the Student’s t distribution with ν > degrees of freedom. Condition 3 is a classicalassumption in variance analysis [Benaroya et al., 2005]and holds if the expansion is done near the expectationof the Q function. As we show in the next section, evenafter the truncation step, our framework still providesperformance important guarantees. Storing an approximation of the ground truth policyimplies a trade-off between performance and numberof basis functions. Our framework allows to controlthe difference in collected reward using the followingtheorem of projecting π onto the first K basis functions. Theorem 3.4.
Let s ∈ S and H s be the associatedRKHS. Let π , π ∈ H s be represented by { ξ π k } ∞ k =1 and { ξ π k } ∞ k =1 and let ε such that Lemma 3.2 holds. Let M s > be such that | π ( a | s ) − π ( a | s ) | ≤ M s || π − π || H for all a ∈ A . Let ∆ K H s = (cid:80) Kk =1 ( ξ π k − ξ π k ) λ k and (cid:15) i =max s,a | A π i ( s, a ) | . Then | η ( π ) − η ( π ) | ≤ |A| (cid:15)γ (1 − γ ) ß max s ∈S ( M s ∆ K H s ) + O ( ε K max s ∈S M s ∆ K H s ) ™ + max( (cid:15) , (cid:15) ) . his theorem implies that representing any policywithin the top K bases of the Hilbert space yields anapproximation error polynomial in ε due to how the lin-ear coefficients were picked. Since M s = || κ ( · , s ) || H ≤ sup s ∈S || κ ( · , s ) || H < ∞ when all function in H arebounded (see Sun and Wu [2009]), the right hand sideof Theorem 3.4 is finite.While Theorem 3.4 holds in the continuous case, mostprojection algorithms such as Fast Fourier Transform,wavelet transform and SVD operate on a discretizedversion of the function. The next section describes theerror made during this discretization step. So far, the theoretical formulation of our algorithm wasin the continuous space of states and actions. However,most efficient algorithms to project a function onto anorthonormal basis operate in the discrete space [Steinand Shakarchi, 2003, Daubechies, 1988]. For this rea-son, we introduce the discretization step, which allowsto leverage these highly optimized methods withoutsacrificing the approximation guarantees from the con-tinuous domain.An important step in our algorithm is the component-wise grouping of similar states. This step is crucial,since it allows us to greatly simplify the calculations ofthe error bounds and re-use existing discrete projectionalgorithms such as Fast Fourier Transform. However,when the discretization is done naively, it can resultin slower computation times due to a blowup in state-action space dimensions (to mitigate this, we proposethe pruning step in next section). We use an approachknown as quantile bining [Naeini et al., 2015]. Weassume that state components assigned to the same binhave a similar behaviour, a somewhat limiting conditionwhich is good enough for simple environments andgreatly simplifies our proposed algorithm.Recall that the cumulative distribution function Π X ( x ) = P [ X ≤ x ] and the corresponding quantilefunction Q X ( p ) = inf { x ∈ R : p ≤ Π X ( x ) } . In prac-tice, we approximate Q , Π with the empirical quantileand density functions ˆ Q , ˆΠ , respectively. Theorem 3.5.
Let Π S , Π A be cumulative policy func-tions over S , A and let (cid:100) Π (cid:101) S , (cid:100) Π (cid:101) A be their empiricalestimates discretized using quantile bining over statesand actions into b S and b A clusters, respectively. Let Q S , Q A be the corresponding empirical quantile func-tions. Then, the volume of the discretization error can be written as δ Π , (cid:100) Π (cid:101) = b S (cid:88) i =1 b A (cid:88) j =1 ß (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) q Si q Si − (cid:100) ˆΠ (cid:101) S ( s ) ds − ib S ( q Si − q Si − ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) q Ai,j q Ai,j − (cid:100) ˆΠ (cid:101) A ( a ) da − jb A ( q Ai,j − q Ai,j − ) (cid:12)(cid:12)(cid:12)(cid:12) ™ , (13) where q Si = (cid:100) ˆ Q S (cid:101) (cid:0) ib S (cid:1) and q Ai,j = (cid:100) ˆ Q A | i (cid:101) (cid:0) jb A (cid:1) . Note that δ Π , (cid:100) Π (cid:101) is similar to || Π , (cid:100) Π (cid:101)|| , with the onlydifference that the volume of each error rectangle isconsidered. Therefore, an argument identical to Theo-rem 3.4 can be used to map Theorem 3.5 into the errorspace of returns.Figure 1: Schematic view of the discretization errorTheorem 3.5 and Figure 1 show how the discretizationerror induced by Algorithm 1 can be computed acrossall states and actions in the training set. Note thatsamples used to compute both distribution functionsare i.i.d. inter-, but not intra-trajectories. This meansthat the classical result from Dvoretzky et al. [1956]might not hold.The discretization step enables us to use powerful dis-crete projection algorithms, but as a side effect candrastically expand the state space. However, we ob-served empirically that most policies tend to cover asmall subset of the state space. Using this information,we introduce a state-space pruning method based onthe steady state distribution, which is addressed in thefollowed section. The discretization step from the previous section greatlysimplifies the calculations of the projection weights.However, it can potentially make the new state-actionspace quite large. In this section, we introduce the epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
Let Π S , Π A be cumulative policy func-tions over S , A and let (cid:100) Π (cid:101) S , (cid:100) Π (cid:101) A be their empiricalestimates discretized using quantile bining over statesand actions into b S and b A clusters, respectively. Let Q S , Q A be the corresponding empirical quantile func-tions. Then, the volume of the discretization error can be written as δ Π , (cid:100) Π (cid:101) = b S (cid:88) i =1 b A (cid:88) j =1 ß (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) q Si q Si − (cid:100) ˆΠ (cid:101) S ( s ) ds − ib S ( q Si − q Si − ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) q Ai,j q Ai,j − (cid:100) ˆΠ (cid:101) A ( a ) da − jb A ( q Ai,j − q Ai,j − ) (cid:12)(cid:12)(cid:12)(cid:12) ™ , (13) where q Si = (cid:100) ˆ Q S (cid:101) (cid:0) ib S (cid:1) and q Ai,j = (cid:100) ˆ Q A | i (cid:101) (cid:0) jb A (cid:1) . Note that δ Π , (cid:100) Π (cid:101) is similar to || Π , (cid:100) Π (cid:101)|| , with the onlydifference that the volume of each error rectangle isconsidered. Therefore, an argument identical to Theo-rem 3.4 can be used to map Theorem 3.5 into the errorspace of returns.Figure 1: Schematic view of the discretization errorTheorem 3.5 and Figure 1 show how the discretizationerror induced by Algorithm 1 can be computed acrossall states and actions in the training set. Note thatsamples used to compute both distribution functionsare i.i.d. inter-, but not intra-trajectories. This meansthat the classical result from Dvoretzky et al. [1956]might not hold.The discretization step enables us to use powerful dis-crete projection algorithms, but as a side effect candrastically expand the state space. However, we ob-served empirically that most policies tend to cover asmall subset of the state space. Using this information,we introduce a state-space pruning method based onthe steady state distribution, which is addressed in thefollowed section. The discretization step from the previous section greatlysimplifies the calculations of the projection weights.However, it can potentially make the new state-actionspace quite large. In this section, we introduce the epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces pruning step based on the long-run distribution ofthe policy, and then quantify the impact of removingunvisited states both on the performance and coverageof the policy.
Since policies tend to concentrate around the optimumas training progresses [Ahmed et al., 2018], pruningthose states would not significantly hinder the overallperformance of the embedded policy. Let ρ π ( s ) denotethe number of times π visits state s in the long runand N ( s, a ) the number of times action a is taken from s over N trajectories.This leads us to define π pruned as a policy which actsrandomly under some state visitation threshold. Pre-cisely, if S = { s : ρ π ( s ) = 0 } and ( · ) is the indicatorfunction, then π pruned ( a | s ) = { − S ( s ) } N ( s, a ) ρ π ( s ) + S ( s ) 1 b A . (14)Pruning rarely visited states can drastically reduce thenumber of parameters, while maintaining high probabil-ity performance guarantees for π pruned . The followingtheorem quantifies the effect of pruning states on theperformances of the policy. Theorem 3.6 (Policy pruning, Simão et al. [2019]) . Let N be the number of rollout trajectories of the policy π , r M be the largest reward. With probability − δ ,the following holds: | η ( π ) − η ( π pruned ) | ≤ r M − γ |S||A| + 4 log δ N . (15)This theorem allows one to discretize only the subsetof frequently visited states while still ensuring a strongperformance guarantee with high probability.Instead of pruning based on samples from ρ (cid:100) π (cid:101) , whichinduces a computational overhead since we need toperform an additional set of rollouts, we instead prunestates based on ρ π . The approximation error inducedby this switch can be understood through the scope ofThm. 3.7. Any changes done to the ground truth policy such asembedding it into an RKHS will affect its stationarydistribution and hence the long-run coverage of thepolicy. In this subsection, we provide a result on theerror made on the stationary distributions as a functionof error made in the original policies. Under a fixed policy π , an MDP induces a Markov chaindefined by the expected transition model ( P π ) ss (cid:48) = (cid:80) a ∈A π ( a | s ) P ( s (cid:48) | s, a ) . In tensor form, it can be repre-sented as A ( T (2) Π ) , where Π as = π ( a | s ) is the policymatrix, A = I (cid:12) I is the Khatri-Rao product and T (2) is the second matricization of the transition tensor T sas (cid:48) = P ( s (cid:48) | = s, a ) .If the Markov chain defined above is irreducible andhomogeneous, then its stationary distribution corre-sponds to the state occupancy probability given bythe principal left eigenvector of P π .The following theorem bridges the error made on thereconstruction of long-run distribution as a function ofthe system’s transition dynamics and distance betweenpolicies. Theorem 3.7 (Approximate policy coverage) . Let || ·|| S p be the Schatten p -norm and || · || p be the vector p -norm. If ( P π , |S| − |S| ) and ( P π pruned , |S| − |S| )are irreducible, homogeneous Markov chains, then thefollowing holds: || ρ (cid:62) π − ρ (cid:62) π pruned || S ≤ || Z || S ∞ || Π − Π pruned || S || T (3) || S , (16) where Z = ( I − P π + |S| ρ (cid:62) π ) − and |S| is a vector ofall ones. A detailed proof can be found in the Appendix. Notethat the upper bound depends on both the environ-ment’s structure as well as on the policy reconstructionquality. It is thus expected that, for MDPs with partic-ularly large singular values of T (2) , the bound convergesless quickly than for those with smaller singular val-ues. See Appendix for visualizations of the bound onempirical domains.Equiped with these tools, we are ready to state thegeneral policy embedding bound in the RKHS setting. We are finally ready to use the previous performancebounds to state the general performance result. Thedifference in performance of ground-truth policy andtruncated embedded policy can be decomposed as adiscretization error. a pruning error and a projectionerror:
Corollary 3.7.1.
Given ground truth policy π , dis-cretized policy (cid:100) π (cid:101) , pruned policy π pruned and embedded The existence and uniqueness of ρ follow from thePerron-Frobenius theorem. olicy ˆ π , the total policy embedding bound is (cid:12)(cid:12)(cid:12)(cid:12) η ( π ) − η (ˆ π ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) η ( π ) − η ( (cid:100) π (cid:101) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) Theorem 3.5 + (cid:12)(cid:12)(cid:12)(cid:12) η ( (cid:100) π (cid:101) ) − η ( π pruned ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) Theorem 3.6 + (cid:12)(cid:12)(cid:12)(cid:12) η ( π pruned ) − η (ˆ π ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) Theorem 3.4 . Tighter bounds of Thm. 3.4 can be found in the litera-ture Giardina and Chirlian [1972], Rakhlin et al. [2005]for specific basis functions.In the next section, we propose a practical algorithmwhich integrates all three steps.
We first discuss the construction of (cid:100) π (cid:101) via quantile dis-cretization. The idea consists in grouping together sim-ilar state-action pairs (in term of visitation frequency).To this end, we use the quantile state visitation func-tion Q S and the state visitation distribution function Π S , as well as their empirical counterparts, ˆ Q S and ˆΠ S . Quantile and distribution functions for the actionspace are defined analogously. Algorithm 1:
Quantile discretization
Input:
Policy π , number of state bins b S , numberof action bins b A Result:
Discrete policy (cid:100) π (cid:101) Collect a set of states S ⊆ S and set of actions A ⊆ A via rollouts of π ;Build the empirical c.d.f. ˆΠ s from set S ;Build the empirical c.d.f. ˆΠ a from set A ;Find numbers i , .., i b S s.t. ˆΠ s ( i l ) − ˆΠ s ( i l − ) = b S using ˆ Q s for all s ∈ S, l = 1 , .., b S ;Find numbers j , .., j b A s.t. ˆΠ a ( j l ) − ˆΠ a ( j l − ) = b A using ˆ Q a for all a ∈ S, l = 1 , .., b A ; if i l − ≤ s ≤ i l and j l (cid:48) − ≤ a ≤ j l (cid:48) then Assign ( s, a ) to ( l, l (cid:48) ) ; end Set (cid:100) π (cid:101) ll (cid:48) to π ( i l (cid:48)− + i l (cid:48) , i l − + i l ) Algorithm 1 outlines the proposed discretization pro-cess allowing one to approximate any continuous pol-icy (e.g., computed by a neural network) by a 2-dimensional table indexed by discrete states and actions.We use quantile discretization in order to have maximalresolution in frequently visited areas. It also allowsfor slightly faster sampling during rollouts, since theprobability of falling in each bin is uniform.
Algorithm 2:
RKHS policy embedding
Input:
Policy π , number of components K , basis ω = { ω k } ∞ k =1 Result:
A set of coefficients ξ , ..., ξ K
1. Rollout π in the environment to estimate ˆ Q s , ˆ Q a (empirical quantile functions);2. Project π onto lattice to obtain (cid:100) π (cid:101) using Alg. 1;3. Prune (cid:100) π (cid:101) with ˆ Q s , ˆ Q a using Eq. 14 ;4. Project (cid:100) π (cid:101) onto first K elements of ω using oneof the projection algorithms described in Section2.2;Since in practice working with continuous RKHS pro-jections is cumbersome, we use Algorithm 1 in order toproject the continuous policy π onto a discrete lattice,which is then projected onto the Hilbert space . Anatural consequence of working with embedded policiesmeans that "taking actions according to ˆ π K " reducesto importance sampling of actions conditional on statebins (discussed in Appendix 7.4). In this section, we consider a range of control tasks:bandit turntable, Pendulum and Continuous Moun-tain Car (CMC) from OpenAI Gym [Brockman et al.,2016]. All experimental details can be found in Ap-pendix 7.9.3.In Pendulum and CMC, we omit GMM and fixed basisGMM methods, since the expectation-maximization al-gorithm runs into stability problems when dealing witha high number of components. Fixed-basis GMM servesas a baseline to benchmark optimization issues in theExpectation-Maximization (EM) algorithm for GMMs.Moreover, we use SVD for low-rank matrix factorizationand the fourth order Daubechies wavelet [Daubechies,1988] (DB4) as an orthonormal basis alternative toFourier (DFT).
We evaluate our framework in the multi-armed ban-dit [Slivkins, 2019, MAB] inspired from Fellows et al.[2018] (see Figure 2a). Results are reported in Fig-ure 2b via the average Wasserstein-1 metric, definedas E s [ W ( π ( ·| s ) , ˆ π ( ·| s ))] . From the figure, we can ob-serve that GMM has more trouble converging whenwe keep more than 60 components. This is due to theEM’s stability issue when dealing with large numberof components. While fixed basis GMM struggles toaccurately approximate the policy, DFT shows a stable Python pseudocode can be found in Appendix 7.9.1 epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
We evaluate our framework in the multi-armed ban-dit [Slivkins, 2019, MAB] inspired from Fellows et al.[2018] (see Figure 2a). Results are reported in Fig-ure 2b via the average Wasserstein-1 metric, definedas E s [ W ( π ( ·| s ) , ˆ π ( ·| s ))] . From the figure, we can ob-serve that GMM has more trouble converging whenwe keep more than 60 components. This is due to theEM’s stability issue when dealing with large numberof components. While fixed basis GMM struggles toaccurately approximate the policy, DFT shows a stable Python pseudocode can be found in Appendix 7.9.1 epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
Figure 2: Bandit turntable environment in (a) polarcoordinates. Magnitude of modes is proportional to thereward (lighter is higher). Actions are absolute anglesin interval [ − π, π ] . (b) shows the W distance betweenground truth and order K approximations. Dottedline indicates the number of modes in the ground truthdensity.performance after a threshold of 10 components. Theappendix contains additional results in the MAB set-ting motivated by recent advances [Dudik et al., 2011,Foster et al., 2018], and a simplified truncation lemmafor average rewards. We compare all embedding methods with a maximumlikelihood estimate (MLE) of π , which consists of ap-proximating E [ π K ( ·| s )] ≈ K (cid:80) Kk =1 A k , A k ∼ π ( ·| s ) .The MLE rate of convergence to π is O ( √ K ) , whichgrounds the convergence rate of other methods. Asshown in Figure 3 over 10 runs, DFT, SVD and DB4reach same returns as the MLE method. Note thatDB4 shows slightly better performance than DFT.Figure 3: Difference in returns between embedded andground truth policies for the Mountain Car task withrespect of the number of parameters used.Due to space constraints, Appendix Figure 15 containsinsights on the pruning and discretization errors. The same experimental setup as for Mountain Carapplies to this environment. As shown in Figure 4over 10 runs, when the true policy has converged toa degenerate (optimal) distribution ( , steps), allmethods show comparable performance (in terms ofconvergence). DFT shows better convergence at earlystages of training ( , steps), that is when the truepolicy has a large variance. Refer to Appendix Figure 9-11 for visualizations of the true policies.Figure 4: Absolute difference in returns collected bydiscretized and reconstructed Gaussian policies for (a) (b) (cid:112) V [ η ( π )] (cid:112) V [ η (ˆ π K )] Pendulum (2k) 114.51 84.75Pendulum (3k) 333.06 191.83Pendulum (4k) 363.6 273.1Pendulum (5k) 182.2 163.3CMC 28.1 11.0Table 1: Variance of the collected returns in Pendulum-v0 under various policies over 100 trials, with K set tohalf of its maximum value (i.e. K = b S b A / ).Table 1 shows the standard deviation in returns col-lected by true and truncated policies over 100 trials.The truncated policies systematically has standard de-viation at least as large as the one of the true policy. In this work, we introduced a general framework forrepresenting policies for reinforcement learning tasks.It allows to represent any continuous policy densityby a projection onto the K first basis functions of areproducing kernel Hilbert space. We showed theoret-ically that the performance of this embedded policyepends on the discretization, pruning and projectionalgorithms, and we provide upper bounds for theseerrors. In addition, by projecting to a lower dimen-sional space, our framework provides a more stable androbust reconstructed policy. Finally, we conducted ex-periments to demonstrate the behaviour of SVD, DFT,DB4 and GMM basis functions on a set of classicalcontrol tasks, supporting our performance guarantees.Moreover, our experiments show a reduction in thevariance of the returns of the reconstructed policies,resulting a more robust policy. A natural extensionto our framework would be to directly operate in thecontinuous space, avoiding the discretization step pro-cedure. Acknowledgements
This research is supported by the Canadian Institutefor Advanced Research (CIFAR AI chair program).
References
Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi,and Dale Schuurmans. Understanding the impactof entropy on policy optimization. arXiv preprintarXiv:1811.11214 , 2018.Anayo K Akametalu, Jaime F Fisac, Jeremy H Gillula,Shahab Kaynama, Melanie N Zeilinger, and Claire JTomlin. Reachability-based safe learning with gaus-sian processes. In , pages 1424–1431. IEEE, 2014.Nachman Aronszajn. Theory of reproducing kernels.
Transactions of the American mathematical society ,68(3):337–404, 1950.Anil Aswani, Humberto Gonzalez, S Shankar Sastry,and Claire Tomlin. Provably safe and robust learning-based model predictive control.
Automatica , 49(5):1216–1226, 2013.André MS Barreto, Doina Precup, and Joelle Pineau.Practical kernel-based reinforcement learning.
TheJournal of Machine Learning Research , 17(1):2372–2441, 2016.Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. Verifiable reinforcement learning via policyextraction. In
Advances in neural information pro-cessing systems , pages 2494–2504, 2018.Bernhard Baumgartner. An inequality for the traceof matrix products, using absolute values. arXivpreprint arXiv:1106.6189 , 2011.Marc G Bellemare, Will Dabney, and Rémi Munos. Adistributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887 , 2017. Haym Benaroya, Seon Mi Han, and Mark Nagurka.
Probability models in engineering and science , volume193. CRC press, 2005.Felix Berkenkamp, Matteo Turchetta, Angela Schoellig,and Andreas Krause. Safe model-based reinforcementlearning with stability guarantees. In
Advances inneural information processing systems , pages 908–918, 2017.Byron Boots, Geoffrey Gordon, and Arthur Gretton.Hilbert space embeddings of predictive state repre-sentations. arXiv preprint arXiv:1309.6819 , 2013.Greg Brockman, Vicki Cheung, Ludwig Pettersson,Jonas Schneider, John Schulman, Jie Tang, and Woj-ciech Zaremba. Openai gym.
CoRR , abs/1606.01540,2016. URL http://arxiv.org/abs/1606.01540 .Grace E Cho and Carl D Meyer. Comparison of per-turbation bounds for the stationary distribution of amarkov chain.
Linear Algebra and its Applications ,335(1-3):137–150, 2001.Ingrid Daubechies. Orthonormal bases of compactlysupported wavelets.
Communications on pure andapplied mathematics , 41(7):909–996, 1988.Luc Devroye. An automatic method for generatingrandom variates with a given characteristic function.In
SIAM journal on applied mathematics , 1986.Simon S Du, Yining Wang, and Aarti Singh. On thepower of truncated svd for general high-rank matrixestimation problems. In
Advances in neural infor-mation processing systems , pages 445–455, 2017.Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karam-patziakis, John Langford, Lev Reyzin, and TongZhang. Efficient optimal learning for contextual ban-dits. arXiv preprint arXiv:1106.2369 , 2011.Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz.Asymptotic minimax character of the sample dis-tribution function and of the classical multinomialestimator.
The Annals of Mathematical Statistics ,pages 642–669, 1956.Juan José Egozcue, José Luis Díaz-Barrero, and VeraPawlowsky-Glahn. Hilbert space of probability den-sity functions based on aitchison geometry.
ActaMathematica Sinica , 22(4):1175–1182, 2006.Matthew Fellows, Kamil Ciosek, and Shimon Whiteson.Fourier Policy Gradients. In
ICML , 2018.Dylan J Foster, Alekh Agarwal, Miroslav Dudík,Haipeng Luo, and Robert E Schapire. Practicalcontextual bandits with regression oracles. arXivpreprint arXiv:1803.01088 , 2018.Javier Garcıa and Fernando Fernández. A comprehen-sive survey on safe reinforcement learning.
Journal ofMachine Learning Research , 16(1):1437–1480, 2015. epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
Journal ofMachine Learning Research , 16(1):1437–1480, 2015. epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
C Giardina and P Chirlian. Bounds on the truncationerror of periodic signals.
IEEE Transactions onCircuit Theory , 19(2):206–207, 1972.Koen Goetschalckx, Bert Moons, Patrick Wambacq,and Marian Verhelst. Efficiently combining svd, prun-ing, clustering and retraining for enhanced neuralnetwork compression. In
Proceedings of the 2nd In-ternational Workshop on Embedded and Mobile DeepLearning , pages 1–6, 2018.Steffen Grunewalder, Guy Lever, Luca Baldassarre,Massi Pontil, and Arthur Gretton. Modelling transi-tion dynamics in mdps with rkhs embeddings. arXivpreprint arXiv:1206.4655 , 2012.Alfred Haar.
Zur theorie der orthogonalen funktio-nensysteme . Georg-August-Universitat, Gottingen.,1909.Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, andSergey Levine. Soft actor-critic: Off-policy maximumentropy deep reinforcement learning with a stochasticactor.
CoRR , abs/1801.01290, 2018. URL http://arxiv.org/abs/1801.01290 .Dunham Jackson.
The theory of approximation . TheAmerican mathematical society, 1930.Sham Kakade and John Langford. Approximately op-timal approximate reinforcement learning.Motonobu Kanagawa and Kenji Fukumizu. Recoveringdistributions from gaussian rkhs embeddings. In
Artificial Intelligence and Statistics , pages 457–465,2014.Guy Katz, Clark Barrett, David L Dill, Kyle Julian,and Mykel J Kochenderfer. Reluplex: An efficientsmt solver for verifying deep neural networks. In
International Conference on Computer Aided Verifi-cation , pages 97–117. Springer, 2017.Guy Lever, John Shawe-Taylor, Ronnie Stafford, andCsaba Szepesvári. Compressed conditional mean em-beddings for model-based reinforcement learning. In
Thirtieth AAAI Conference on Artificial Intelligence ,2016.Tianyu Li, Bogdan Mazoure, Doina Precup, and Guil-laume Rabusseau. Efficient planning under partialobservability with unnormalized q functions and spec-tral learning. In
International Conference on Arti-ficial Intelligence and Statistics , pages 2852–2862.PMLR, 2020.Timothy P Lillicrap, Jonathan J Hunt, AlexanderPritzel, Nicolas Heess, Tom Erez, Yuval Tassa, DavidSilver, and Daan Wierstra. Continuous controlwith deep reinforcement learning. arXiv preprintarXiv:1509.02971 , 2015. Michael L Littman and Richard S Sutton. Predictiverepresentations of state. In
Advances in neural infor-mation processing systems , pages 1555–1561, 2002.Mathias Lohne. The computational complexity of thefast fourier transform. Technical report, 2017.Yongxi Lu, Abhishek Kumar, Shuangfei Zhai,Yu Cheng, Tara Javidi, and Rogerio Feris. Fully-adaptive feature sharing in multi-task networks withapplications in person attribute classification. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 5334–5343,2017.Geoffrey McLachlan and David Peel.
Finite mixturemodels . John Wiley & Sons, 2004.J Mercer. Functions of positive and negative type, andtheir connection with the theory of integral equations.
Philosophical Transactions of the Royal Society ofLondon. Series A, Containing Papers of a Mathe-matical or Physical Character , 209:415–446, 1909.Mahdi Pakdaman Naeini, Gregory F Cooper, and MilosHauskrecht. Obtaining well calibrated probabilitiesusing bayesian binning. In
Proceedings of the... AAAIConference on Artificial Intelligence. AAAI Confer-ence on Artificial Intelligence , volume 2015, page2901. NIH Public Access, 2015.Yu Nishiyama, Abdeslam Boularias, Arthur Gretton,and Kenji Fukumizu. Hilbert space embeddings ofpomdps. arXiv preprint arXiv:1210.4887 , 2012.Lerrel Pinto, James Davidson, Rahul Sukthankar, andAbhinav Gupta. Robust adversarial reinforcementlearning. arXiv preprint arXiv:1703.02702 , 2017.Martin L Puterman.
Markov decision processes: dis-crete stochastic dynamic programming . John Wiley& Sons, 2014.Alexander Rakhlin, Dmitry Panchenko, and SayanMukherjee. Risk bounds for mixture density estima-tion.
ESAIM: Probability and Statistics , 9:220–229,2005.Howard L Resnikoff, O Raymond Jr, et al.
Wavelet anal-ysis: the scalable structure of information . SpringerScience & Business Media, 2012.Bengt Ringnér. Law of the unconscious statistician.
Unpublished Note , 2009.Dorsa Sadigh, S Shankar Sastry, Sanjit A Seshia, andAnca Dragan. Information gathering actions overhuman internal state. In , pages 66–73. IEEE, 2016.John Schulman, Sergey Levine, Pieter Abbeel, MichaelJordan, and Philipp Moritz. Trust region policyptimization. In Francis Bach and David Blei, edi-tors,
Proceedings of the 32nd International Confer-ence on Machine Learning , volume 37 of
Proceed-ings of Machine Learning Research , pages 1889–1897,Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/schulman15.html .Paul J Schweitzer. Perturbation theory and finitemarkov chains.
Journal of Applied Probability , 5(2):401–413, 1968.Shayak Sen, Saikat Guha, Anupam Datta, Sriram KRajamani, Janice Tsai, and Jeannette M Wing. Boot-strapping privacy compliance in big data systems.In ,pages 327–342. IEEE, 2014.Thiago D. Simão, Romain Laroche, and Rémi Tachetdes Combes. Safe policy improvement with an esti-mated baseline policy, 2019.Aleksandrs Slivkins. Introduction to multi-armed ban-dits. arXiv preprint arXiv:1904.07272 , 2019.Le Song, Xinhua Zhang, Alex Smola, Arthur Gretton,and Bernhard Schölkopf. Tailoring density estima-tion via reproducing kernel moment matching. In
Proceedings of the 25th international conference onMachine learning , pages 992–999, 2008.E.M. Stein and R. Shakarchi.
Fourier Analysis: AnIntroduction . Princeton University Press, 2003. ISBN9780691113845. URL https://books.google.ca/books?id=I6CJngEACAAJ .Hongwei Sun and Qiang Wu. Application of integral op-erator for regularized least-square regression.
Math-ematical and Computer Modelling , 49(1-2):276–285,2009.Richard S Sutton and Andrew G Barto.
Reinforcementlearning: An introduction . 2018.Aviv Tamar, Dotan Di Castro, and Shie Mannor. Learn-ing the variance of the reward-to-go.
The Journal ofMachine Learning Research , 17(1):361–396, 2016.John N Tsitsiklis and Benjamin Van Roy. Optimalstopping of markov processes: Hilbert space theory,approximation algorithms, and an application topricing high-dimensional financial derivatives.
IEEETransactions on Automatic Control , 44(10):1840–1851, 1999.Kirk Wolter.
Introduction to variance estimation .Springer Science & Business Media, 2007.Kehe Zhu.
Operator theory in function spaces . Number138. American Mathematical Soc., 2007. epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
Operator theory in function spaces . Number138. American Mathematical Soc., 2007. epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
Reproducibility Checklist
We follow the reproducibility checklist "The machine learning reproducibility checklist" and point to relevantsections explaining them here.For all algorithms presented, check if you include:•
A clear description of the algorithm, see main paper and included codebase.
The proposedapproach is completely described by Alg. 2.•
An analysis of the complexity (time, space, sample size) of the algorithm.
The space complexityof our algorithm depends on the number of desired basis. It is K for a 1-dimension K − component GMMwith diagonal covariance, mK + K + nK for a K − truncated SVD of an m × n matrix, and only K for thewavelet and Fourier bases. Note that a simple implementation of SVD requires to find the three matrices U , D , V (cid:62) before truncation.The time complexity depends on the reconstruction, pruning and discretization algorithms, which involvesconducting rollouts w.r.t. policy π , training a quantile encoder on the state space trajectories, and finallyembedding the the pruned matrix using the desired algorithm. We found that the Gaussian mixture is theslowest to train, due to shortcomings of the expectation-maximization algorithm.• A link to a downloadable source code, including all dependencies.
The code is included withSupplemental Files as a zip file; all dependencies can be installed using Python’s package manager. Uponpublication, the code would be available on Github. Additionally, we include the model’s weights as well asthe discretized policy for
Pendulum-v0 environment.For all figures and tables that present empirical results, check if you include:•
A complete description of the data collection process, including sample size.
We use standardbenchmarks provided in OpenAI Gym (Brockman et al., 2016).•
A link to downloadable version of the dataset or simulation environment.
Not applicable.•
An explanation of how samples were allocated for training / validation / testing.
We do not usea training-validation-test split, but instead report the mean performance (and one standard deviation) of thepolicy at evaluation time across 10 trials.•
An explanation of any data that were excluded.
The only data exclusion was done during policypruning, as outlined in the main paper.•
The exact number of evaluation runs.
10 trials to obtain all figures, and 200 rollouts to determine ρ π .• A description of how experiments were run.
See Section Experimental Results in the main paper anddidactic example details in Appendix.•
A clear definition of the specific measure or statistics used to report results.
Undiscountedreturns across the whole episode are reported, and in turn averaged across 10 seeds. Confidence intervalsshown in Fig. 3 were obtained using the pooled variance formula from a difference of means t − test.• Clearly defined error bars.
Confidence intervals are always mean ± standard deviation over 10 trials.• A description of results with central tendency (e.g. mean) and variation (e.g. stddev) . Allresults use the mean and standard deviation.•
A description of the computing infrastructure used.
All runs used 1 CPU for all experiments with Gb of memory. .1 Bandit example An N -armed bandit has a reward distribution such that r ( i ) = √ π exp( − ( i − N/ /
2) + U i for i = 1 , .., N , andstationary i.i.d. measurement noise U , with the additional restriction that V ( U ) = λ , i.e. the signal-to-noiseratio (SNR) simplifies to SNR = N λ .Running any suitable policy optimization method Dudik et al. [2011], Foster et al. [2018] then produces a policy π which regresses on both the signal and the noise (blue curve in Figure 7). However, if the class of regressorfunctions is overparameterized, then the policy will also capture the noise, which in turn will lead to suboptimalcollected rewards. If regularization is not an option (e.g. the training data was deleted due to privacy requirementsas discussed in Sen et al. [2014]), then separating signal from noise becomes more challenging.Our approach involves projecting the policy into a Hilbert space in which noise is captured via fine-grained basisfunctions and therefore can be eliminated via truncation of the basis. Figure 7 illustrates the hypothetical banditscenario, and the behavior of our approach. Executing the policy with 2 components will yield a larger rewardsignal than executing the original policy, because the measurement noise is removed via truncation. Let π Fourier K denote the density approximated by the first K th Fourier partial sums Stein and Shakarchi [2003], then a result from Jackson [1930] shows that || π − π Fourier K || = O ( K − ) . (17)More recent results Giardina and Chirlian [1972] provide even tighter bounds for continuous and periodic functionswith m − continuous derivatives and bounded m th derivative. In such case, || π − π Fourier K || ∞ = O ( K − (2 m +1) ) . (18) Rate of convergence of mixture approximation
Let π MM K denote a (finite) K − component mixture model.A result from Rakhlin et al. [2005] shows the following result for π MM K in the class of mixtures with K marginally-independent scaled density functions and π in the class of lower-bounded probability density functions: KL ( π || π MM K ) = O ( K − + n − ) , (19)where n is the number of i.i.d. samples used in the learning of π MM K .The constants hidden inside the big-O notation depend on the nature of the function and can become quite large,which can explain differences in empirical evaluation. Consider the case when f is a continuous, positive and decreasing function with a discrete sequence a n = f ( n ) .For example, the reconstruction error W (Π , ˆΠ) as well as difference in returns | η ( π ) − η (ˆ π ) | fall under this family.Then, (cid:82) ∞ f ( t ) dt < ∞ implies that (cid:80) ∞ i =0 a n < ∞ . Under these assumptions, convergence guarantees (e.g. onmonotonically decreasing reconstruction error) in continuous space imply convergence in the discrete (empirical)setting. Hence, we operate on discrete spaces rather than continuous ones.All further computations of distance between two discretized policies will have to be computed over the corre-sponding discrete grid ( i, j ) , i = 1 , .., b S , j = 1 , .., b A . One can look at the average action taken by two MDPpolicies at state s : (cid:12)(cid:12)(cid:12)(cid:12) E a ∼ π [ a | s ] − E a ∼ ˜ π [ a | s ] (cid:12)(cid:12)(cid:12)(cid:12) ≤ W (Π s , ˜Π s ) (20)This relationship is one of the motivations behind using W to assess goodness-of-fit in the experiments section.Naively discretizing some function f over a fixed interval [ a, b ] can be done by n equally spaced bins. To pass froma continuous value f ( x ) to discrete, one would find i ∈ , .., n such that b − an i ≤ x ≤ b − an ( i + 1) . However, using auniform grid for a non-uniform f wastes representation power. To re-allocate bins in low density areas, we use the epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
Let π MM K denote a (finite) K − component mixture model.A result from Rakhlin et al. [2005] shows the following result for π MM K in the class of mixtures with K marginally-independent scaled density functions and π in the class of lower-bounded probability density functions: KL ( π || π MM K ) = O ( K − + n − ) , (19)where n is the number of i.i.d. samples used in the learning of π MM K .The constants hidden inside the big-O notation depend on the nature of the function and can become quite large,which can explain differences in empirical evaluation. Consider the case when f is a continuous, positive and decreasing function with a discrete sequence a n = f ( n ) .For example, the reconstruction error W (Π , ˆΠ) as well as difference in returns | η ( π ) − η (ˆ π ) | fall under this family.Then, (cid:82) ∞ f ( t ) dt < ∞ implies that (cid:80) ∞ i =0 a n < ∞ . Under these assumptions, convergence guarantees (e.g. onmonotonically decreasing reconstruction error) in continuous space imply convergence in the discrete (empirical)setting. Hence, we operate on discrete spaces rather than continuous ones.All further computations of distance between two discretized policies will have to be computed over the corre-sponding discrete grid ( i, j ) , i = 1 , .., b S , j = 1 , .., b A . One can look at the average action taken by two MDPpolicies at state s : (cid:12)(cid:12)(cid:12)(cid:12) E a ∼ π [ a | s ] − E a ∼ ˜ π [ a | s ] (cid:12)(cid:12)(cid:12)(cid:12) ≤ W (Π s , ˜Π s ) (20)This relationship is one of the motivations behind using W to assess goodness-of-fit in the experiments section.Naively discretizing some function f over a fixed interval [ a, b ] can be done by n equally spaced bins. To pass froma continuous value f ( x ) to discrete, one would find i ∈ , .., n such that b − an i ≤ x ≤ b − an ( i + 1) . However, using auniform grid for a non-uniform f wastes representation power. To re-allocate bins in low density areas, we use the epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces quantile binning method, which first computes the cumulative distribution function of f , called F . Then, it finds n points such that the probability of falling in each bin is equal. Quantile binning can be see as uniform binningin probability space, and exactly corresponds to uniform discretization if f is constant (uniform distribution).Below, we present an example of discretizing 4,000 sample points taken from four N ( µ i , . distributions, usingthe quantile binning.Figure 5: Quantile binning for the four Gaussians problem, using 10 bins per dimension.In Figure 5, we are only allowed to allocate 10 bins per dimension. Note how the grid is denser around high-probability regions, while the furthermost bins are the widest. This uneven discretization, if applied to the policydensity function, allows the agent to have higher detail around high-probability regions.In Python, the function sklearn.preprocessing.KBinsDiscretizer can be used to easily discretize samplesfrom the target density.Figure 6: Quantile binning of unnormalized state visitation counts of DQN policy in the discrete Mountain Cartask. Bins closer to the starting states are visited more often than those at the end of the trajectory. Blue dotsrepresents the bins limits.Since policy densities are expected to be more complex than the previous example, we analyze quantile binningin the discrete Mountain Car environment. To do so, we train a DQN policy until convergence (i.e. collectedrewards at least − ), and perform 500 rollouts using the greedy policy. Trajectories are used to construct a2-dimensional state visitation histogram. At the same time, all visited states are given to the quantile binningalgorithm, which is used to assign observations to bins.Figure 6 presents the state visitation histogram with n S = 30 bins, where color represents the state visitationcount. Bins are shown by dotted lines. .4 Acting with tabular policies In order to execute the learned policy, one has to be able to (1) generate samples conditioned on a given stateand (2) evaluate the log-probability of state-action pairs. Acting with a tabular density can be done via varioussampling techniques. For example, importance sampling of the atoms corresponding to each bin is possible when b A is small, while the inverse c.d.f. method is expected to be faster in larger dimensions. The process can befurther optimized on a case-per-case basis for each method. For example, sampling directly from the real part ofa characteristic function can be done with the algorithm defined in Devroye [1986]Furthermore, it is possible to use any of the above algorithms to jointly sample ( s, a ) pairs under the assumptionthat states were discretized by quantiles. First, uniformly sample a state bin and then use any of the conditionalsampling algorithms to sample action a . Optionally, one can add uniform noise (clipped to the range of the bin)to the sampled action. This naive trick would transform discrete actions into continuous approximations of thepolicy network. We observe that truncating the embedded policy can have a denoising effect, i.e. boosting the signal-to-noiseratio (SNR). As a motivation, consider a multi-armed bandit setting [Slivkins, 2019] with reward distribution r ( i ) = √ π exp( − ( i − N/ /
2) + U i for i = 1 , .., N , and stationary measurement noise U i . By running anysuitable policy optimization method [Dudik et al., 2011, Foster et al., 2018], we obtain a policy π which regresseson both the signal and the noise. Figure 7 illustrates the hypothetical bandit scenario. Applying DFT followed bytruncation produces π K , some of which are shown in Figure 7a. Sampling actions from π K yields a lower averageregret up to K=10, because the measurement noise is truncated and the correct signal mode is recovered.Figure 7: (a) Motivating bandit example with ground truth signal, noisy signal and truncated policy, and (b) rewards collected by greedy (i.e. the mode) truncated and true policiesThe bandit performance bound can be stated as follows:
Lemma 7.1.
Let H be an RKHS, π , π ∈ H be represented by { ξ π k } ∞ k =1 and { ξ π k } ∞ k =1 and r : A → R be thebandit’s reward function. Let M > be such that | π ( a ) − π ( a ) | ≤ M || π − π || H for all a ∈ A and let p, q > s.t. p + q = 1 . Then | η ( π ) − η ( π ) | ≤ |A| /q M || r || p ∞ (cid:88) k =1 ( ξ π k − ξ π k ) λ k Lemma 7.1 guarantees that rewards collected by π will depend on the distance between π and π . The proofcan be found in the next section. epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
Let H be an RKHS, π , π ∈ H be represented by { ξ π k } ∞ k =1 and { ξ π k } ∞ k =1 and r : A → R be thebandit’s reward function. Let M > be such that | π ( a ) − π ( a ) | ≤ M || π − π || H for all a ∈ A and let p, q > s.t. p + q = 1 . Then | η ( π ) − η ( π ) | ≤ |A| /q M || r || p ∞ (cid:88) k =1 ( ξ π k − ξ π k ) λ k Lemma 7.1 guarantees that rewards collected by π will depend on the distance between π and π . The proofcan be found in the next section. epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces Proof. (Lemma 3.3) Throughout the proof, we will be using the expectation E β which can be seen as the classicalinner product weighted by the initial state-action distribution (cid:104) f, g (cid:105) β = (cid:90) s,a f ( s, a ) g ( s, a ) β ( s, a ) d ( sa ) over the domain of s, a .If we take a uniform initialization density, i.e. β ( s, a ) ∝ ( |S||A| ) − , then the inner product simplifies to (cid:104) f, g (cid:105) β = ( |S||A| ) − (cid:104) f, g (cid:105) , which will be used further on to simplify the covariance term between orthonormal basis functions.After a first-order Talor expansion of variances around η ( π ) and η (ˆ π K ) described in Benaroya et al. [2005], weobtain V β [ π ( a | s )] = { σ (cid:48) ( η ( π )) } V β [ Q π ( s, a )] + E π V β [ˆ π K ( a | s )] = { σ (cid:48) ( η (ˆ π K )) } V β [ Q ˆ π K ( s, a )] + E ˆ π K , (21)where E π is the Taylor residual of order 2 for policy π . This identity is very widely used in statistics, especiallyfor approximating the variance of functions of statistics through the delta method [Wolter, 2007].Expanding the variance of sum of correlated random variables, we obtain V β [ π ( a | s )] = V β [ˆ π K ( a | s )] + V β [ ε K ( a | s )] + 2 V β [ˆ π K ( a | s ) , ε K ( a | s )] (22)The covariance term V β [ˆ π K ( a | s ) , ε K ( a | s )] can be decomposed as E β [ˆ π K ( a | s ) ε K ( a | s )] − E β [ˆ π K ( a | s )] E β [ ε K ( a | s )] .Then, the second moment expression of the covariance becomes zero with respect to this inner product, dueto the use of orthonormal basis functions. We use the dominated convergence theorem which lets the order ofsummation be changed for (cid:80) ∞ j = K ξ j ω j ( s, a ) , which greatly simplifies the calculations. E β [ˆ π K ( a | s ) ε K ( a | s )] = E β [ K (cid:88) k =1 ξ k ω k ( s, a ) ∞ (cid:88) j = K ξ j ω j ( s, a )]= K (cid:88) k =1 ∞ (cid:88) j = K ξ k ξ j E β [ ω k ( s, a ) ω j ( s, a )]= K (cid:88) k =1 ∞ (cid:88) j = K ξ k ξ j (cid:104) ω k , ω j (cid:105) β = 0 ( ω k ( s, a ) and ω j ( s, a ) are orthogonal ) (23)Combining both expressions yields the relationship between the variance of the truncated policy and that of theexpected returns. V β [ Q π ] = ï { σ (cid:48) ( η (ˆ π K )) } V β [ Q ˆ π K ] − E β [ˆ π K ] E β [ ε K ] + V β [ ε K ] + E ˆ π K { σ (cid:48) ( η ( π )) } ò + E π = ï { σ (cid:48) ( η (ˆ π K )) } V β [ Q ˆ π K ] − E β [ˆ π K ] E β [ ε K ] + V β [ ε K ] { σ (cid:48) ( η ( π )) } ò + E π + E ˆ π K { σ (cid:48) ( η ( π )) } (24)If the Taylor expansion is performed in a neighborhood of η ( π ) and η (ˆ π K ) , we can then consider the error termsas sufficiently small and neglect them, as is often the case in classical statistics [Benaroya et al., 2005]. β [ Q π ] ≈ Å σ (cid:48) ( η (ˆ π K )) σ (cid:48) ( η ( π )) ã V β [ Q ˆ π K ] + V β [ ε K ] − E β [ˆ π K ] E β [ ε K ] { σ (cid:48) ( η ( π )) } (25)For the variance of the returns of the true policy to be not lower than the variance of returns of the truncatedpolicy, the coefficient of the first term on the right hand side must be at most 1, and the second term must bepositive. Note that these are sufficient but not necessary conditions.Precisely, the following conditions must be satisfied:1. σ (cid:48) ( η (ˆ π K )) ≤ σ (cid:48) ( η ( π )) ;2. (cid:113) E β [ ε K ]3 ≥ E β [ ε K ] . This is an assumption on the second moment behavior of the policy and is satisfied by,for example, the Student’s t distribution with degrees of freedom ν > .3. The Taylor error terms E π and E ˆ π K are small since the expansion is performed around η ( π ) and η (ˆ π K ) ,respectively.That is, under various initializations, truncation of policy coefficients can be beneficial by reducing the variance. Proof. (Theorem 3.5) We want to quantify the difference between the continuous cdf Π and the step-functiondefined by the quantile binning strategy, (cid:98) Π (cid:99) , with b bins. Notice that the error between Π and (cid:98) Π (cid:99) in bin i canbe written as δ i = (cid:90) q i q i − Π( x ) dx − q i − q i − b , (26)where q i = ˆ Q (cid:0) ib (cid:1) .The total error across all bins is therefore δ = b (cid:88) i =1 ï (cid:90) q i q i − Π( x ) dx − q i − q i − b ò , (27)The error δ is computed across both states and actions, meaning that the total volume of the error between thepolicies can be thought of as error in A made for a single state group, which leads to the state approximationerror being considered b S times, akin to conditional probabilities.The state-action total error can be computed as follows δ a,s = b S (cid:88) i =1 ï (cid:90) q Si q Si − Π S ( s ) ds − i ( q Si − q Si − ) b S ò b A (cid:88) j =1 ï (cid:90) q Ai,j q Ai,j − Π A ( a ) da − j ( q Ai,j − q Ai,j − ) b A ò , (28)where superscripts and subscripts A and S denotes dependence of the quantity on either action or state, respectively.The notation q i,j refers to the quantile ˆ Q ( ib ) computed conditional on the state falling into bin i . epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
Let H be an RKHS, π , π ∈ H be represented by { ξ π k } ∞ k =1 and { ξ π k } ∞ k =1 and r : A → R be thebandit’s reward function. Let M > be such that | π ( a ) − π ( a ) | ≤ M || π − π || H for all a ∈ A and let p, q > s.t. p + q = 1 . Then | η ( π ) − η ( π ) | ≤ |A| /q M || r || p ∞ (cid:88) k =1 ( ξ π k − ξ π k ) λ k Lemma 7.1 guarantees that rewards collected by π will depend on the distance between π and π . The proofcan be found in the next section. epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces Proof. (Lemma 3.3) Throughout the proof, we will be using the expectation E β which can be seen as the classicalinner product weighted by the initial state-action distribution (cid:104) f, g (cid:105) β = (cid:90) s,a f ( s, a ) g ( s, a ) β ( s, a ) d ( sa ) over the domain of s, a .If we take a uniform initialization density, i.e. β ( s, a ) ∝ ( |S||A| ) − , then the inner product simplifies to (cid:104) f, g (cid:105) β = ( |S||A| ) − (cid:104) f, g (cid:105) , which will be used further on to simplify the covariance term between orthonormal basis functions.After a first-order Talor expansion of variances around η ( π ) and η (ˆ π K ) described in Benaroya et al. [2005], weobtain V β [ π ( a | s )] = { σ (cid:48) ( η ( π )) } V β [ Q π ( s, a )] + E π V β [ˆ π K ( a | s )] = { σ (cid:48) ( η (ˆ π K )) } V β [ Q ˆ π K ( s, a )] + E ˆ π K , (21)where E π is the Taylor residual of order 2 for policy π . This identity is very widely used in statistics, especiallyfor approximating the variance of functions of statistics through the delta method [Wolter, 2007].Expanding the variance of sum of correlated random variables, we obtain V β [ π ( a | s )] = V β [ˆ π K ( a | s )] + V β [ ε K ( a | s )] + 2 V β [ˆ π K ( a | s ) , ε K ( a | s )] (22)The covariance term V β [ˆ π K ( a | s ) , ε K ( a | s )] can be decomposed as E β [ˆ π K ( a | s ) ε K ( a | s )] − E β [ˆ π K ( a | s )] E β [ ε K ( a | s )] .Then, the second moment expression of the covariance becomes zero with respect to this inner product, dueto the use of orthonormal basis functions. We use the dominated convergence theorem which lets the order ofsummation be changed for (cid:80) ∞ j = K ξ j ω j ( s, a ) , which greatly simplifies the calculations. E β [ˆ π K ( a | s ) ε K ( a | s )] = E β [ K (cid:88) k =1 ξ k ω k ( s, a ) ∞ (cid:88) j = K ξ j ω j ( s, a )]= K (cid:88) k =1 ∞ (cid:88) j = K ξ k ξ j E β [ ω k ( s, a ) ω j ( s, a )]= K (cid:88) k =1 ∞ (cid:88) j = K ξ k ξ j (cid:104) ω k , ω j (cid:105) β = 0 ( ω k ( s, a ) and ω j ( s, a ) are orthogonal ) (23)Combining both expressions yields the relationship between the variance of the truncated policy and that of theexpected returns. V β [ Q π ] = ï { σ (cid:48) ( η (ˆ π K )) } V β [ Q ˆ π K ] − E β [ˆ π K ] E β [ ε K ] + V β [ ε K ] + E ˆ π K { σ (cid:48) ( η ( π )) } ò + E π = ï { σ (cid:48) ( η (ˆ π K )) } V β [ Q ˆ π K ] − E β [ˆ π K ] E β [ ε K ] + V β [ ε K ] { σ (cid:48) ( η ( π )) } ò + E π + E ˆ π K { σ (cid:48) ( η ( π )) } (24)If the Taylor expansion is performed in a neighborhood of η ( π ) and η (ˆ π K ) , we can then consider the error termsas sufficiently small and neglect them, as is often the case in classical statistics [Benaroya et al., 2005]. β [ Q π ] ≈ Å σ (cid:48) ( η (ˆ π K )) σ (cid:48) ( η ( π )) ã V β [ Q ˆ π K ] + V β [ ε K ] − E β [ˆ π K ] E β [ ε K ] { σ (cid:48) ( η ( π )) } (25)For the variance of the returns of the true policy to be not lower than the variance of returns of the truncatedpolicy, the coefficient of the first term on the right hand side must be at most 1, and the second term must bepositive. Note that these are sufficient but not necessary conditions.Precisely, the following conditions must be satisfied:1. σ (cid:48) ( η (ˆ π K )) ≤ σ (cid:48) ( η ( π )) ;2. (cid:113) E β [ ε K ]3 ≥ E β [ ε K ] . This is an assumption on the second moment behavior of the policy and is satisfied by,for example, the Student’s t distribution with degrees of freedom ν > .3. The Taylor error terms E π and E ˆ π K are small since the expansion is performed around η ( π ) and η (ˆ π K ) ,respectively.That is, under various initializations, truncation of policy coefficients can be beneficial by reducing the variance. Proof. (Theorem 3.5) We want to quantify the difference between the continuous cdf Π and the step-functiondefined by the quantile binning strategy, (cid:98) Π (cid:99) , with b bins. Notice that the error between Π and (cid:98) Π (cid:99) in bin i canbe written as δ i = (cid:90) q i q i − Π( x ) dx − q i − q i − b , (26)where q i = ˆ Q (cid:0) ib (cid:1) .The total error across all bins is therefore δ = b (cid:88) i =1 ï (cid:90) q i q i − Π( x ) dx − q i − q i − b ò , (27)The error δ is computed across both states and actions, meaning that the total volume of the error between thepolicies can be thought of as error in A made for a single state group, which leads to the state approximationerror being considered b S times, akin to conditional probabilities.The state-action total error can be computed as follows δ a,s = b S (cid:88) i =1 ï (cid:90) q Si q Si − Π S ( s ) ds − i ( q Si − q Si − ) b S ò b A (cid:88) j =1 ï (cid:90) q Ai,j q Ai,j − Π A ( a ) da − j ( q Ai,j − q Ai,j − ) b A ò , (28)where superscripts and subscripts A and S denotes dependence of the quantity on either action or state, respectively.The notation q i,j refers to the quantile ˆ Q ( ib ) computed conditional on the state falling into bin i . epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces Proof. (Lemma 7.1) (cid:12)(cid:12)(cid:12)(cid:12) η ( π ) − η ( π ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a ∈A r ( a ) ï π ( a ) − π ( a ) ò (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a ∈A r ( a )∆( a ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) a ∈A (cid:12)(cid:12)(cid:12)(cid:12) r ( a )∆( a ) (cid:12)(cid:12)(cid:12)(cid:12) = || r ∆ || ≤ || r || p || ∆ || q ≤ |A| /q M || r || p ∞ (cid:88) k =1 ( ξ π k − ξ π k ) λ k where the before last line is a direct application of Holder inequality and the last line happens because evaluationis a bounded linear functional such that | f ( x ) | ≤ M || f || H for all f ∈ H . Proof. (Theorem 3.4) Recall the following result from Kakade and Langford, Schulman et al. [2015]: | η ( π ) − L π ( π ) | ≤ α γε (1 − γ ) , (29)where (cid:15) = max s,a | Q π ( s, a ) − V π ( s ) | and α = max s T V ( π ( ·| s ) , π ( ·| s )) .Instead, suppose that for every fixed state s ∈ S , there exists an RKHS of all square-integrable conditionaldensities of the form π ( ·| s ) . We examine the difference term between policies in this H s . Since π , π ∈ H s , δ , = π − π ∈ H s and hence there exist M s > such that | δ , ( a ) | ≤ M s || δ , || H s for all a ∈ A . Then, (cid:88) a ∈A | π ( a ) − π ( a ) | ≤ (cid:88) a ∈A | δ , ( a ) |≤ M s (cid:88) a ∈A || δ , || H s ≤ |A| M s ∞ (cid:88) k =1 ( ξ π k − ξ π k ) λ k = |A| M s ∆ ∞H s , (30)where we let ∆ ∞H s = (cid:80) ∞ k =1 ( ξ π k − ξ π k ) λ k for shorter notation.Then, | η ( π ) − L π ( π ) | ≤ |A| γε (1 − γ ) (max s ∈S M s ∆ ∞H s ) , (31)We can obtain a performance bound by first expanding the left hand side L π ( π ) = η ( π ) + (cid:88) s ρ π ( s ) (cid:88) a π ( a | s ) A π ( s, a ) Using Eq. (29), and using reverse triangle inequality, we have (cid:12)(cid:12)(cid:12)(cid:12) η ( π ) − η ( π ) (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s ρ π ( s ) (cid:88) a π ( a | s ) A π ( s, a ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) η ( π ) − (cid:8) η ( π ) + (cid:88) s ρ π ( s ) (cid:88) a π ( a | s ) A π ( s, a ) (cid:9)(cid:12)(cid:12)(cid:12)(cid:12) ≤ α γε (1 − γ ) herefore, (cid:12)(cid:12)(cid:12)(cid:12) η ( π ) − η ( π ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ α γε (1 − γ ) + (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s ρ π ( s ) (cid:88) a π ( a | s ) A π ( s, a ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ α γε (1 − γ ) + (cid:15) (32)We combine Eq. (30) with Eq. (32) to obtain the following RKHS form on the improvement lower bound. | η ( π ) − η ( π ) | ≤ |A| (cid:15)γ (1 − γ ) (max s ∈S M s ∆ ∞H s ) + (cid:15) (33)If we suppose that weights after K are small enough, we can split the approximation error using Lemma 3.2 for E K = ε K +1) − ε : M s ∆ ∞H s ≤ M s Å ∆ K H s + E K ã , . (34)Using this in Eq. (33) yields the final result: | η ( π ) − η ( π ) | ≤ |A| (cid:15)γ (1 − γ ) Å max s ∈S ( M s ∆ K H s ) + 2 E K max s ∈S M s ∆ K H s + E K max s ∈S M s ã + (cid:15) (35) | η ( π ) − η ( π ) | ≤ |A| (cid:15)γ (1 − γ ) ß max s ∈S ( M s ∆ K H s ) + O ( ε K max s ∈S M s ∆ K H s ) ™ + (cid:15) . (36).Now, if ε is close to zero (i.e. π and π have very similar ξ K , ξ K +1 .. ), then the expression simplifies to | η ( π ) − η ( π ) | ≤ |A| (cid:15)γ (1 − γ ) max s ∈S ( M s ∆ K H s ) + (cid:15) . (37) Proof. (Theorem 3.7) We first represent the approximate transition matrix P ˜ π induced by the policy ˜ π as aperturbation of the true transition: P ˜ π = P π + ( P ˜ π − P π )= P π + E (38)Then, the difference between stationary distributions ρ ˜ π and ρ π is equal to Schweitzer [1968], Cho and Meyer[2001]: ρ (cid:62) π − ρ (cid:62) ˜ π = ρ (cid:62) ˜ π EZ , (39)where Z = ( I − P π + |S| ρ (cid:62) π ) − is the fundamental matrix of the Markov chain induced by P π and |S| is a vectorof ones.In particular, the above result holds for Schatten norms Baumgartner [2011]: || ρ π − ρ ˜ ρ || ≤ || ρ ˜ π || || Z || S ∞ || E || S ∞ ≤ || Z || S ∞ || E || S (40)So far, this result is known for irreducible, homogeneous Markov chains and has no decision component.Consider the matrix E , which is the difference between expected transition models of true and approximatepolicies. It can be expanded into products of matricized tensors: E = P ˜ π − P π = A ( ˜ Π ⊗ I ) T (3) − A ( Π ⊗ I ) T (cid:62) (3) = A (( ˜ Π − Π ) ⊗ I ) T (cid:62) (3) (41) epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
Let H be an RKHS, π , π ∈ H be represented by { ξ π k } ∞ k =1 and { ξ π k } ∞ k =1 and r : A → R be thebandit’s reward function. Let M > be such that | π ( a ) − π ( a ) | ≤ M || π − π || H for all a ∈ A and let p, q > s.t. p + q = 1 . Then | η ( π ) − η ( π ) | ≤ |A| /q M || r || p ∞ (cid:88) k =1 ( ξ π k − ξ π k ) λ k Lemma 7.1 guarantees that rewards collected by π will depend on the distance between π and π . The proofcan be found in the next section. epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces Proof. (Lemma 3.3) Throughout the proof, we will be using the expectation E β which can be seen as the classicalinner product weighted by the initial state-action distribution (cid:104) f, g (cid:105) β = (cid:90) s,a f ( s, a ) g ( s, a ) β ( s, a ) d ( sa ) over the domain of s, a .If we take a uniform initialization density, i.e. β ( s, a ) ∝ ( |S||A| ) − , then the inner product simplifies to (cid:104) f, g (cid:105) β = ( |S||A| ) − (cid:104) f, g (cid:105) , which will be used further on to simplify the covariance term between orthonormal basis functions.After a first-order Talor expansion of variances around η ( π ) and η (ˆ π K ) described in Benaroya et al. [2005], weobtain V β [ π ( a | s )] = { σ (cid:48) ( η ( π )) } V β [ Q π ( s, a )] + E π V β [ˆ π K ( a | s )] = { σ (cid:48) ( η (ˆ π K )) } V β [ Q ˆ π K ( s, a )] + E ˆ π K , (21)where E π is the Taylor residual of order 2 for policy π . This identity is very widely used in statistics, especiallyfor approximating the variance of functions of statistics through the delta method [Wolter, 2007].Expanding the variance of sum of correlated random variables, we obtain V β [ π ( a | s )] = V β [ˆ π K ( a | s )] + V β [ ε K ( a | s )] + 2 V β [ˆ π K ( a | s ) , ε K ( a | s )] (22)The covariance term V β [ˆ π K ( a | s ) , ε K ( a | s )] can be decomposed as E β [ˆ π K ( a | s ) ε K ( a | s )] − E β [ˆ π K ( a | s )] E β [ ε K ( a | s )] .Then, the second moment expression of the covariance becomes zero with respect to this inner product, dueto the use of orthonormal basis functions. We use the dominated convergence theorem which lets the order ofsummation be changed for (cid:80) ∞ j = K ξ j ω j ( s, a ) , which greatly simplifies the calculations. E β [ˆ π K ( a | s ) ε K ( a | s )] = E β [ K (cid:88) k =1 ξ k ω k ( s, a ) ∞ (cid:88) j = K ξ j ω j ( s, a )]= K (cid:88) k =1 ∞ (cid:88) j = K ξ k ξ j E β [ ω k ( s, a ) ω j ( s, a )]= K (cid:88) k =1 ∞ (cid:88) j = K ξ k ξ j (cid:104) ω k , ω j (cid:105) β = 0 ( ω k ( s, a ) and ω j ( s, a ) are orthogonal ) (23)Combining both expressions yields the relationship between the variance of the truncated policy and that of theexpected returns. V β [ Q π ] = ï { σ (cid:48) ( η (ˆ π K )) } V β [ Q ˆ π K ] − E β [ˆ π K ] E β [ ε K ] + V β [ ε K ] + E ˆ π K { σ (cid:48) ( η ( π )) } ò + E π = ï { σ (cid:48) ( η (ˆ π K )) } V β [ Q ˆ π K ] − E β [ˆ π K ] E β [ ε K ] + V β [ ε K ] { σ (cid:48) ( η ( π )) } ò + E π + E ˆ π K { σ (cid:48) ( η ( π )) } (24)If the Taylor expansion is performed in a neighborhood of η ( π ) and η (ˆ π K ) , we can then consider the error termsas sufficiently small and neglect them, as is often the case in classical statistics [Benaroya et al., 2005]. β [ Q π ] ≈ Å σ (cid:48) ( η (ˆ π K )) σ (cid:48) ( η ( π )) ã V β [ Q ˆ π K ] + V β [ ε K ] − E β [ˆ π K ] E β [ ε K ] { σ (cid:48) ( η ( π )) } (25)For the variance of the returns of the true policy to be not lower than the variance of returns of the truncatedpolicy, the coefficient of the first term on the right hand side must be at most 1, and the second term must bepositive. Note that these are sufficient but not necessary conditions.Precisely, the following conditions must be satisfied:1. σ (cid:48) ( η (ˆ π K )) ≤ σ (cid:48) ( η ( π )) ;2. (cid:113) E β [ ε K ]3 ≥ E β [ ε K ] . This is an assumption on the second moment behavior of the policy and is satisfied by,for example, the Student’s t distribution with degrees of freedom ν > .3. The Taylor error terms E π and E ˆ π K are small since the expansion is performed around η ( π ) and η (ˆ π K ) ,respectively.That is, under various initializations, truncation of policy coefficients can be beneficial by reducing the variance. Proof. (Theorem 3.5) We want to quantify the difference between the continuous cdf Π and the step-functiondefined by the quantile binning strategy, (cid:98) Π (cid:99) , with b bins. Notice that the error between Π and (cid:98) Π (cid:99) in bin i canbe written as δ i = (cid:90) q i q i − Π( x ) dx − q i − q i − b , (26)where q i = ˆ Q (cid:0) ib (cid:1) .The total error across all bins is therefore δ = b (cid:88) i =1 ï (cid:90) q i q i − Π( x ) dx − q i − q i − b ò , (27)The error δ is computed across both states and actions, meaning that the total volume of the error between thepolicies can be thought of as error in A made for a single state group, which leads to the state approximationerror being considered b S times, akin to conditional probabilities.The state-action total error can be computed as follows δ a,s = b S (cid:88) i =1 ï (cid:90) q Si q Si − Π S ( s ) ds − i ( q Si − q Si − ) b S ò b A (cid:88) j =1 ï (cid:90) q Ai,j q Ai,j − Π A ( a ) da − j ( q Ai,j − q Ai,j − ) b A ò , (28)where superscripts and subscripts A and S denotes dependence of the quantity on either action or state, respectively.The notation q i,j refers to the quantile ˆ Q ( ib ) computed conditional on the state falling into bin i . epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces Proof. (Lemma 7.1) (cid:12)(cid:12)(cid:12)(cid:12) η ( π ) − η ( π ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a ∈A r ( a ) ï π ( a ) − π ( a ) ò (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a ∈A r ( a )∆( a ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) a ∈A (cid:12)(cid:12)(cid:12)(cid:12) r ( a )∆( a ) (cid:12)(cid:12)(cid:12)(cid:12) = || r ∆ || ≤ || r || p || ∆ || q ≤ |A| /q M || r || p ∞ (cid:88) k =1 ( ξ π k − ξ π k ) λ k where the before last line is a direct application of Holder inequality and the last line happens because evaluationis a bounded linear functional such that | f ( x ) | ≤ M || f || H for all f ∈ H . Proof. (Theorem 3.4) Recall the following result from Kakade and Langford, Schulman et al. [2015]: | η ( π ) − L π ( π ) | ≤ α γε (1 − γ ) , (29)where (cid:15) = max s,a | Q π ( s, a ) − V π ( s ) | and α = max s T V ( π ( ·| s ) , π ( ·| s )) .Instead, suppose that for every fixed state s ∈ S , there exists an RKHS of all square-integrable conditionaldensities of the form π ( ·| s ) . We examine the difference term between policies in this H s . Since π , π ∈ H s , δ , = π − π ∈ H s and hence there exist M s > such that | δ , ( a ) | ≤ M s || δ , || H s for all a ∈ A . Then, (cid:88) a ∈A | π ( a ) − π ( a ) | ≤ (cid:88) a ∈A | δ , ( a ) |≤ M s (cid:88) a ∈A || δ , || H s ≤ |A| M s ∞ (cid:88) k =1 ( ξ π k − ξ π k ) λ k = |A| M s ∆ ∞H s , (30)where we let ∆ ∞H s = (cid:80) ∞ k =1 ( ξ π k − ξ π k ) λ k for shorter notation.Then, | η ( π ) − L π ( π ) | ≤ |A| γε (1 − γ ) (max s ∈S M s ∆ ∞H s ) , (31)We can obtain a performance bound by first expanding the left hand side L π ( π ) = η ( π ) + (cid:88) s ρ π ( s ) (cid:88) a π ( a | s ) A π ( s, a ) Using Eq. (29), and using reverse triangle inequality, we have (cid:12)(cid:12)(cid:12)(cid:12) η ( π ) − η ( π ) (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s ρ π ( s ) (cid:88) a π ( a | s ) A π ( s, a ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) η ( π ) − (cid:8) η ( π ) + (cid:88) s ρ π ( s ) (cid:88) a π ( a | s ) A π ( s, a ) (cid:9)(cid:12)(cid:12)(cid:12)(cid:12) ≤ α γε (1 − γ ) herefore, (cid:12)(cid:12)(cid:12)(cid:12) η ( π ) − η ( π ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ α γε (1 − γ ) + (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s ρ π ( s ) (cid:88) a π ( a | s ) A π ( s, a ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ α γε (1 − γ ) + (cid:15) (32)We combine Eq. (30) with Eq. (32) to obtain the following RKHS form on the improvement lower bound. | η ( π ) − η ( π ) | ≤ |A| (cid:15)γ (1 − γ ) (max s ∈S M s ∆ ∞H s ) + (cid:15) (33)If we suppose that weights after K are small enough, we can split the approximation error using Lemma 3.2 for E K = ε K +1) − ε : M s ∆ ∞H s ≤ M s Å ∆ K H s + E K ã , . (34)Using this in Eq. (33) yields the final result: | η ( π ) − η ( π ) | ≤ |A| (cid:15)γ (1 − γ ) Å max s ∈S ( M s ∆ K H s ) + 2 E K max s ∈S M s ∆ K H s + E K max s ∈S M s ã + (cid:15) (35) | η ( π ) − η ( π ) | ≤ |A| (cid:15)γ (1 − γ ) ß max s ∈S ( M s ∆ K H s ) + O ( ε K max s ∈S M s ∆ K H s ) ™ + (cid:15) . (36).Now, if ε is close to zero (i.e. π and π have very similar ξ K , ξ K +1 .. ), then the expression simplifies to | η ( π ) − η ( π ) | ≤ |A| (cid:15)γ (1 − γ ) max s ∈S ( M s ∆ K H s ) + (cid:15) . (37) Proof. (Theorem 3.7) We first represent the approximate transition matrix P ˜ π induced by the policy ˜ π as aperturbation of the true transition: P ˜ π = P π + ( P ˜ π − P π )= P π + E (38)Then, the difference between stationary distributions ρ ˜ π and ρ π is equal to Schweitzer [1968], Cho and Meyer[2001]: ρ (cid:62) π − ρ (cid:62) ˜ π = ρ (cid:62) ˜ π EZ , (39)where Z = ( I − P π + |S| ρ (cid:62) π ) − is the fundamental matrix of the Markov chain induced by P π and |S| is a vectorof ones.In particular, the above result holds for Schatten norms Baumgartner [2011]: || ρ π − ρ ˜ ρ || ≤ || ρ ˜ π || || Z || S ∞ || E || S ∞ ≤ || Z || S ∞ || E || S (40)So far, this result is known for irreducible, homogeneous Markov chains and has no decision component.Consider the matrix E , which is the difference between expected transition models of true and approximatepolicies. It can be expanded into products of matricized tensors: E = P ˜ π − P π = A ( ˜ Π ⊗ I ) T (3) − A ( Π ⊗ I ) T (cid:62) (3) = A (( ˜ Π − Π ) ⊗ I ) T (cid:62) (3) (41) epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces The norm of E can also be upper bounded as follows: || E || S ≤ || A || S || ( ˜ Π − Π ) ⊗ IT (cid:62) (3) || S ∞ ≤ || A || S || ( ˜ Π − Π ) ⊗ IT (cid:62) (3) || S ≤ || A || S || ( ˜ Π − Π ) ⊗ I || S || T (cid:62) (3) || S ≤ || A || S || ( ˜ Π − Π ) ⊗ I || S || T (cid:62) (3) || S ≤ || A || S || ˜ Π − Π || S || I || S ∞ || T (cid:62) (3) || S = || ˜ Π − Π || S || T (3) || S (42)Combining this result from that of Schweitzer [1968] yields || ρ π − ρ ˜ ρ || S ≤ || Z || S ∞ (cid:124) (cid:123)(cid:122) (cid:125) Depends onMDP + π || ˜ Π − Π || S (cid:124) (cid:123)(cid:122) (cid:125) Depends on π || T (3) || S (cid:124) (cid:123)(cid:122) (cid:125) Depends on MDP (43)
The time complexity for a k -truncated SVD decomposition of a matrix A ∈ R n × m is O ( mnk ) as discussed in Duet al. [2017]. Fast Fourier transform can be done in O ( mn log( mn )) for the 2-dimensional case Lohne [2017].Hence, SVD is expected to be faster whenever k < log( nm ) . The discrete wavelet transform’s (DWT) timecomplexity depends on the choice of mother wavelets. When using Mallat’s algorithm, the 2-dimensional DWTis known to have complexities ranging from O ( n log n ) to as low as O ( n ) , for a square matrix of size n × n and depending on the choice of mother wavelet Resnikoff et al. [2012]. Finally, we expect FFT and Daubechieswavelets to have less parameters than SVD, since the former use pre-defined orthonormal bases, while the latermust store the left and right singular vectors. We validate the rate of convergence of Thm. 3.7 in the toy experiment described below. Consider a deterministicchain MDP with N states. A fixed policy in state i transitions to i + 1 with probability α and to i − withprobability − α . If in states or N , the agent remains there with probability − α and α , respectively. Weconsider the expected transition model P π obtained using the policy reconstructed through discrete Fouriertransform. A visualization of Theorem 3.7 is shown in Fig. 8.Figure 8: Upper bound on || ρ π − ρ ˆ π || in the chain MDP task as a function of number of Fourier components andof the number of states. .9 Experiment details This subsection covers all aspects of the conducted experiments, including pseudocode, additional results anddetails about the setup.
Below is the Python-like pseudocode for the case of Fourier bases (which can be easily adapted to all basesexamined in this paper). This pseudocode is heavily simplified Python, and skips some bookkeeping steps such asedge cases, but convey the overall flow of the algorithm. def RKHS_policy_step_1_and_2(cts_pi,env,n_trajectories,n_episode_steps,b_S,b_A): """ cts_pi: Object with 2 methods: - sample(state): samples an action from pi(.|state); - log_prob(state,action): returns pi(a|s); env: OpenAI Gym environment; n_trajectories: Number of trajectories in rollouts (positive integer); n_episode_steps: Number of maximum steps in each episode (positive integer); b_S: number of state bins per dimension (positive integer); b_A: number of action bins per dimension (positive integer). """ from sklearn.preprocessing import KBinsDiscretier states,actions,rewards = rollout(cts_pi,env.n_trajectories,n_episode_steps) enc_A = KBinsDiscretizer(n_bins=b_A,encode='ordinal',strategy='quantile') enc_S = KBinsDiscretizer(n_bins=b_S,encode='ordinal',strategy='quantile') bins_A = enc_A.fit_transform(actions) bins_S = enc_S.fit_transform(states) dsc_pi = np.zeros(shape=(b_S,b_A)) for i in range(b_S): for j in range(b_A): s,a = enc_S.inverse_transform([i]), enc_A.inverse_transform([j]) dsc_pi[i,j] = np.exp(cts_pi.log_prob(s,a)) dsc_pi[i,:] /= np.sum(dsc_pi[i,:) rho_pi = np.zeros(shape=(b_S,)) for bin in bins_S: rho_pi[bin] += 1 rho_pi /= np.sum(rho_pi) return dsc_pi, rho_pi def RKHS_policy_step_3(dsc_pi,rho_pi): """ dsc_pi: np.array of size b_S x b_A; rho_pi: np.array of length b_S. """ idx_to_keep = np.where(rho_pi>0) pruned_dsc_pi = dsc_pi[idx_to_keep] epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
Below is the Python-like pseudocode for the case of Fourier bases (which can be easily adapted to all basesexamined in this paper). This pseudocode is heavily simplified Python, and skips some bookkeeping steps such asedge cases, but convey the overall flow of the algorithm. def RKHS_policy_step_1_and_2(cts_pi,env,n_trajectories,n_episode_steps,b_S,b_A): """ cts_pi: Object with 2 methods: - sample(state): samples an action from pi(.|state); - log_prob(state,action): returns pi(a|s); env: OpenAI Gym environment; n_trajectories: Number of trajectories in rollouts (positive integer); n_episode_steps: Number of maximum steps in each episode (positive integer); b_S: number of state bins per dimension (positive integer); b_A: number of action bins per dimension (positive integer). """ from sklearn.preprocessing import KBinsDiscretier states,actions,rewards = rollout(cts_pi,env.n_trajectories,n_episode_steps) enc_A = KBinsDiscretizer(n_bins=b_A,encode='ordinal',strategy='quantile') enc_S = KBinsDiscretizer(n_bins=b_S,encode='ordinal',strategy='quantile') bins_A = enc_A.fit_transform(actions) bins_S = enc_S.fit_transform(states) dsc_pi = np.zeros(shape=(b_S,b_A)) for i in range(b_S): for j in range(b_A): s,a = enc_S.inverse_transform([i]), enc_A.inverse_transform([j]) dsc_pi[i,j] = np.exp(cts_pi.log_prob(s,a)) dsc_pi[i,:] /= np.sum(dsc_pi[i,:) rho_pi = np.zeros(shape=(b_S,)) for bin in bins_S: rho_pi[bin] += 1 rho_pi /= np.sum(rho_pi) return dsc_pi, rho_pi def RKHS_policy_step_3(dsc_pi,rho_pi): """ dsc_pi: np.array of size b_S x b_A; rho_pi: np.array of length b_S. """ idx_to_keep = np.where(rho_pi>0) pruned_dsc_pi = dsc_pi[idx_to_keep] epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces def pruned_agent(state_bin): actions = np.arange(0,b_A) if state_bin in idx_to_keep: probs = pruned_dsc_pi[idx_to_keep==state_bin] else: probs = actions*0+1/actions.size return np.random.choice(actions,1,p=probs) return pruned_agent def RKHS_policy_step_4(dsc_pi,K): """ dsc_pi: np.array of size b_S x b_A; K: number of components to keep (positive integer). """ FS = np.fft.fftn(dsc_pi) fshift = np.real(np.fft.fftshift(FS)).flatten() threshold = sorted(flat_fshift,reverse=True)[K] mask = np.real(fshift >= th).astype(int) fshift *= mask f_ishift = np.fft.ifftshift(fshift) f_dft = np.real(np.fft.ifftn(f_ishift)) We consider a bandit problem in which rewards are spread in a circle as shown in Figure 2 a). Actions consist ofangles in the interval [ − π, π ] . Given the corresponding Boltzmann policy, we compare the reconstruction qualityof the discrete Fourier transform and GMM. In addition to Fourier and GMM, we also compare with fixed basisGMM, where the means are sampled uniformly over [ − π, π ] and variance is drawn uniformly from { . , . , , } ;only mixing coefficients are learned. This provides insight into whether the reconstruction algorithms are learningthe policy, or if π is so simple that a random set of Gaussian can already approximate it well. All parameters were chosen using a memory-performance trade-off heuristic. For example, for ContinuousMountain Car, we based ourselves off Figure 15 to select b S = 35 .Hyperparameters Turntable Pendulum Continuous Mountain Car b A
100 15 10 b S N/A 35 35Optimizer AdamArchitecture 256Learning rate 1e-03Hidden dimension 256 rollouts 100Torch seeds 0 to 9Table 2: Hyperparameters used across experiments, N/A: not applicable .9.4 Pendulum
This classical mechanic task consists of a pendulum which needs to swing up in order to stay upright. The actionsrepresent the joint effort (torque) between − and units. We train SAC until convergence and save snapshots ofthe actor and critic after k and k steps. The reconstruction task is to recover both the Gaussian policy (actor)and the Boltzmann-Q policy (critic, temperature set to 1).Figure 9: Plots of Gaussian Π after k training steps in polar coordinates. Each circle of different radius representsa states in the Pendulum environment as shown by the snapshots. Higher intensity colors (red) represent higherdensity mass on the given angular action.Figure 10: Plots of discretized and pruned (a) Gaussian policy and (b) Boltzmann-Q policy in polar coordinatesfor thousands training steps of soft actor-critic. Rewards collected by their respective continuous networks areindicated in parentheses. Each circle of radius s corresponds to π ( ·| s ) for a discrete s = 1 , , .., . All densitiesare on the same scale. epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
This classical mechanic task consists of a pendulum which needs to swing up in order to stay upright. The actionsrepresent the joint effort (torque) between − and units. We train SAC until convergence and save snapshots ofthe actor and critic after k and k steps. The reconstruction task is to recover both the Gaussian policy (actor)and the Boltzmann-Q policy (critic, temperature set to 1).Figure 9: Plots of Gaussian Π after k training steps in polar coordinates. Each circle of different radius representsa states in the Pendulum environment as shown by the snapshots. Higher intensity colors (red) represent higherdensity mass on the given angular action.Figure 10: Plots of discretized and pruned (a) Gaussian policy and (b) Boltzmann-Q policy in polar coordinatesfor thousands training steps of soft actor-critic. Rewards collected by their respective continuous networks areindicated in parentheses. Each circle of radius s corresponds to π ( ·| s ) for a discrete s = 1 , , .., . All densitiesare on the same scale. epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces Figure 11: Plots of discretized and pruned Boltzmann policy in polar coordinates for and thousands trainingsteps of soft actor-critic. Each circle of radius s corresponds to π ( ·| s ) for a discrete s = 1 , , .., .Figure 12: Absolute difference in returns collected by discretized and reconstructed Boltzmann-Q policies for Pendulum-v0 , averaged over 10 trials. Blue dots represent the number of parameters of the neural network policy.SVD, DFT and DB4 projections need an order of magnitude less in term of parameters to reconstruct the originalpolicy.Figure 13: W distance between the true and reconstructed stationary Boltzmann-Q distributions, averaged over10 trials for Pendulum-v0 .igure 14: W distance between the true and reconstructed stationary distributions, averaged over 10 trials. SVD,DFT and DB4 methods show a fast convergence to the oracle’s stationary distribution.Figure 15: (a) Difference in returns between embedded and ground truth policies for the Mountain Car task withrespect of the number of parameters used and the discretization and pruning errors for (b) the Gaussian policyof
ContinuousMountainCar-v0 and (c) the Boltzmann-Q policy of
Pendulum-v0
In this environment, the agent (a car) must reach the flag located at the top of a hill. It needs to go back and forthin order to generate sufficient momentum to reach the top. The agent is allowed to apply a speed motion in theinterval [ − , . We use Soft Actor-Critic (SAC, Haarnoja et al. [2018]) to train the agent until convergence ( k steps).Figure 15a shows that the error between the true and truncated policies steadily decreases as a function ofprojection weights kept. Note that SVD’s parameters are computed as total entrie in the matrices U, D, V , andhence the smallest possible rank (rank 1) of D dictates the minimal number of parameters. Figure 15b-c showhow the discretization and prunning performances behave as a function of trajectories (for pruning, since it relieson trajectories to estimate ρ π ), or bins per state dimension (for discretization). In an environment where the state space is very large, or when policies are degenerate along a narrow path in thestate space, the pruning step can turn out to be very useful.Table 3 shows the proportion of parameters from the discretized policy that was pruned. Pruned states are oneswhich are visited with probability at most ε ; in all our experiments, ε was set to 0, so no visited state was everpruned. epresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spacesepresentation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces