Convergence of Q-value in case of Gaussian rewards
aa r X i v : . [ m a t h . O C ] M a r Convergence of Q-value in case of Gaussian rewards
Konatsu Miyamoto , Masaya Suzuki , Yuma Kigami , and Kodai Satake Dynamic Pricing technology MatrixFlow Osaka univercity Kindai univercityMarch 10, 2020
Abstract
In this paper, as a study of reinforcement learning, we converge the Q function to unbounded rewardssuch as Gaussian distribution. From the central limit theorem, in some real-world applications it is natural toassume that rewards follow a Gaussian distribution , but existing proofs cannot guarantee convergence of theQ-function. Furthermore, in the distribution-type reinforcement learning and Bayesian reinforcement learningthat have become popular in recent years, it is better to allow the reward to have a Gaussian distribution.Therefore, in this paper, we prove the convergence of the Q-function under the condition of E [ r ( s, a ) ] < ∞ ,which is much more relaxed than the existing research. Finally, as a bonus, a proof of the policy gradient theoremfor distributed reinforcement learning is also posted. In recent years, Reinforcement Learning(RL) has come into fasion. General method in ordinary ReinforcementLearning using Markov decision processes use a state action value functions[1]. Agents created by these algorithmstake strategies to maximize the expected value of the cumulative reward. However, in practical use , there aremany situations where it is necessary to consider not only expected values but also risks. Therefore, DistributionalReinforcement Learning(DRL) that considers the distribution of cumulative rewards has also been studied. DRLresearch presents a particle method of risk responsive algorithm[2]. As for similar research, there are[3][4],whichis equivalent to [2] mathematically,but used the different algorithm and parametric methods[5]. [4] discusses theconvergence of measures in discrete steps. Another way to practice DRL is using the Bayesian approach. In[22],it is regarded as an estimation of the uncertainty of the expected value.But in fact, the Bayesian inferece canapproximate the distribution of uncertain objecsion. It can perform distributed reinforcement learning. There areother existing papers on Bayesian reinforcement learning. We want to take [6][7] up this time.It is a method usingGaussian processes, and it can be said that the reward follows Gaussian distributions. [5] also supports unboundedrewards like Gaussian distributions. We want to show that the approximation of the cumulative reward distributionconverges even in unbounded rewards. In this paper, we prove the convergence of the normal state action valuefunction as a preliminary step. In addition, we perform the convergence proof for Q functions with continuousconcentration domain,taking Deep Q-learning(DQN) into consideration.
The proof history of Q-function convergence is long. For example, there are papers such as [8], [9], [10], and [11]using [10]. A paper on an unusual proof method is [12] using ordinary differential equations. For DQN, there isa study [13] summarizing the approximation error. The approximation error due to the neural network is verifiedthere. Other research results include [14][15][16][17][18]. All of these studies assume that rewards are bounded.That is, there is a certain constant R max < ∞ and | r ( s, a ) | ≤ R max a.e. (1.1)1olds. Therefore, Gaussian distributions cannot be assumed. In this paper, we prove the convergence of the Qfunction under condition , ∀ s, a ∈ S × A, E [ r ( s, a ) ] < ∞ (1.2)which is more relaxed than (1.1), with normal distribution in mind. Finally, we prove the convergence of the Q func-tion in the domain of continuous concentration under ideal conditions. This is a frequent concept in reinforcementlearning. Let two tuples ( S, S ) , ( T, T ) be both measurable spaces.Transition kernel k : S × T → R + is defined to satisfy thefollowing two conditions. ·∀ B ∈ T , k ( · , B ) on S is measurable (2.1) ·∀ s ∈ S, k ( s, · ) is measure on T (2.2)This is used in situations where s is fixed and the distribution on T is fixed. Assume that both the set of states S and the set of actions A are finite sets. A transition kernel p is defined on( S × A, S × A ) , ( S × R , S ⊗ B ( R )). That is, p ( r, s | s, a ) is a probability measure that governs the distribution of thenext state s ∈ S and immediate reward r ∈ R when an action a ∈ A is taken in state s ∈ S is there. The strategy π : S → P ( A ) is the action probability determined from the current situation, as can be seen from the definition.The deterministic approach is that for any s , there is a a and π ( a | s ) = 1. A set of random variables s t , a t , r t takingvalues in S, A, R is written as ( s t , a t , r t ) ∞ t =0 . This stochastic process is called Markov decision process(MDP). Put the whole set of policies as Π. The state action value function Q π : S × A → R for the policy π is defined asfollows. Q π ( s, a ) := E [ ∞ X t =0 γ t R t | s = s, a t = a ( r t , s t +1 ) , p ( r t , s t +1 | s t , a t ) , π ( a t | s t )] (2.3)Furthermore, the state value function V π ( s ) is defined as follows. V π ( s ) := X a ∈ A π ( a | s ) Q π ( s, a ) (2.4)Define the optimal strategy π ∗ as π ∗ := argmax π ∈ Π V π ( s ) (2.5)In addition, the state action value function Q π ∗ for the optimal policy is called the optimum state action valuefunction, and simply expressed as Q ∗ . The action that takes the maximum value for the optimal state actionfunction is the optimal policy. π ∗ ( a | s ) = ( argmax a ∈ A Q ( s, a )0 else (2.6)holds for any s, a . 2 Update of state action value function and Robbins Monro condition
Update the Q-unction as follows Q n +1 ( s, a ) = (1 − α ( s, a, s t , a t , t )) Q t ( s, a ) + α ( s, a, s t , a t , t )[ r t ( s t , a t ) + max b ∈ A Q t ( s t +1 , b )] (3.1)The following sequence { c t } ∞ t =0 satisfies the Robbins-Monro condition. ∀ t, c t ∈ [0 ,
1] (3.2) ∞ X t =0 c t = ∞ (3.3) ∞ X t =0 c t < ∞ (3.4)Using this, the mapping α : S × A × S × A × N → (0 ,
1) is defined as follows. α ( s, a, s t , a t , t ) = ( c t s t = s, a t = a else (3.5)In addition,it is assumed that this also satisfies the Robbins Monroe condition stochastically uniformly for arbitrary s, a . ∞ X t =0 α ( s, a, s t , a t , t ) = ∞ a.e. (3.6) ∞ X t =0 α ( s, a, s t , a t , t ) < ∞ a.e. (3.7) Consider a real-valued function w t ( x ) on a finite set X . Theorem 1
Convergence of Q-value in case of Gaussian rewards X is finite set. Let ramdom value r t ( x ) , X := S × A . Let W be the set of functions f : X → R and || f || W isdefined as || f || W := max x ∈X f ( x ) . For any s, a , let E [ r ( s, a )] < ∞ . || Q t − Q ∗ || W → proof.In line with the proof of [9]. The F condition is relaxed and the statement is stronger, so it needs to bedone more precisely. Consider a stochastic process of ∆ t ( x ) := Q t ( x ) − Q ∗ ( x ) . Since Q ∗ ( x ) is a constant, V (∆ t ( x )) = V ( Q t ( x )) . Putting F t ( x ) := r t ( x )+ γ sup b Q t ( X ( s, a ) , b ) − Q ∗ ( x ) , this is F t +1 measurable stochastic pro-cess. Furthermore, if we put G t ( x ) := r t ( x )+ γ sup b Q t ( X ( s, a ) , b ) , by definition G t − E [ G t ( x ) |F t ] = F t − E [ F t ( x ) |F t ] .The two stochastic processes δ t , w t ∈ W are taken so that ∆ ( x ) = δ ( x ) + w ( x ) . Define time evolution as δ t +1 ( x ) = (1 − a t ( x )) δ t ( x ) + a t ( x ) E [ F t ( x ) |F t ] (4.2) w t +1 ( x ) = (1 − a t ( x )) w t ( x ) + a t ( x ) p t ( x ) (4.3) However, p t ( x ) := F t ( x ) − E [ F t ( x ) |F t ] . At this time, ∆ t ( x ) = w t ( x ) + δ t ( x ) . First, we show that w t convergesto 0 for X with probability 1 by using Lemma 2. By definition, E [ p t |F t ] = 0 , so P t E | [ p t |F t ] | = 0 holds. FromLemma 1 and the definition of p t , G t , E [ p t ] ≤ E [ G t ] holds. Putting L t ( ω ) := sup x | Q t ( x ) | , this random variable is F t -measurable and takes a finite value with probability . Since L is a finite value, a certain constant K can betaken so that E [ L ] ≤ K C R holds. And the following holds with probability 1. t +1 ≤ max( L t , (1 − b t ) L t + b t (sup x | r t ( x ) | + γL t )) (4.4) Using the above formula, the following holds E [ L t +1 ] ≤ max( E [ L t ] , E [((1 − b t ) L t + b t (sup x | r t ( x ) | + γL t )) ]) (4.5) Suppose there is K t ∈ R that is E [ L t ] ≤ K t C R . At this time, put H t := sup x | r t ( x ) | + γL t E [ H t ] = E [sup x | r t ( x ) | ] + 2 E [sup x | r t ( x ) | γL t ] + γ E [ L t ] (4.6) ≤ C R + 2 γ q C R K t C R + K t C R (4.7)= (1 + γK t ) C R (4.8) Then, E [((1 − b t ) L t + b t (sup x | r t ( x ) | + γL t )) ] ≤ (1 − b t ) E [ L t ] + 2(1 − b t ) b t q E [ L t ] E [ H t ] + b t γ E [ H t ] (4.9) ≤ (1 − b t ) K t C R + 2(1 − b t ) b t K t (1 + γK t ) C R + (1 + γK t ) C R (4.10)= ((1 − b t ) K t + b t (1 + γK t )) C R (4.11)= ( K t + b t (1 − (1 − γ ) K t )) C R (4.12) Putting K t +1 = max ( K t , K t + b t (1 − (1 − γ ) K t )) , EL t +1 ≤ K t +1 C R can be said. Since K ∈ R exists, K t ∈ R exists for any t , and E [ L t ] ≤ K t C R can be said. It is clear from the equation that K t +1 = K t when K t > − γ ,and K t ≤ − γ Then, K t +1 ≤ − γ + 1 holds. Therefore, it was shown earlier that K t exists for any t , in addition K t ≤ K ∗ := max( K , − γ + 1) can be also said. | G t ( x ) | ≤ | r t ( x ) | + γL t holds, so the following equation hold.for all x E [ p t ( x )] ≤ E [ G t ( x )] (4.13)= E [ r t ( x ) ] + 2 γ q E [ r t ( x ) ] E [ L t ] + E [ L t ] (4.14) ≤ (1 + γK ∗ ) C R (4.15) Then, X t E [ a t p t ] ≤ X t b t (1 + γK ∗ ) C R (4.16) ≤ M (1 + γK ∗ ) C R < ∞ (4.17) holds for all x . When we use Lemma2,putting U t := a t ( x ) p t ( x ) (4.18) T ( w t , ω ) := (1 − a t ( x )) w n (4.19) P t E [ U t ] < ∞ can be said. Since E [ U t |F n ] = 0 , P t | E [ U t |F n ] | = 0 holds. Then, for any ǫ > , set α = ǫ, β t ( x ) = b t ( x ) and γ t ( x ) = ǫ (2 a t ( x ) − a t ( x )) , then T ( w t , ω ) ≤ max( α, (1 + β t ) w t − γ t ) (4.20) X t γ t = ∞ a.e (4.21) holds. The latter is based on Robbins Monro conditions. Therefore, w t ( x ) → holds for any x . Define the linearoperator T : W → W as follows: for q ∈ W T q ( s, a ) = Z R X s ′ [ r ( s, a ) + γ sup b q ( s ′ , b )] p ( dr, s ′ | s, a ) (4.22)= E [ r ( s, a ) + sup b q ( X ( s, a ) , b )] (4.23)4 ∗ is a fixed point for this operator. For any q , q ∈ W ||T q − T q || W = sup s,a [ | Z R X s ′ [ r ( s, a ) + γ sup b q ( s ′ , b )] p ( dr, s ′ | s, a ) − Z R X s ′ [ r ( s, a ) + γ sup b q ( s ‘ , b )] p ( dr, s ′ | s, a ) | ](4.24) ≤ Z R X s ′ [ γ | sup b q ( s ′ , b ) − sup b q ( s ′ , b ) | ] p ( dr, s ′ | s, a ) (4.25) ≤ Z R X s ′ [ γsup b | q ( s ′ , b ) − q ( s, b ) | ] p ( dr, s ′ | s, a ) (4.26)= γ || q − q || W (4.27) Thus T is a reduction operator. | E [ F t ( x, a ) |F t ] | ≤ Z R X s ′ | r ( s, a ) + γ sup b Q t ( s ′ , b ) − Q ∗ ( s, a ) | p ( dr, s ′ | s, a ) (4.28)= |T Q t ( x, a ) − Q ∗ ( s, a ) | (4.29)= |T Q t ( x, a ) − T Q ∗ ( s, a ) | (4.30) ≤ γ || ∆ t || W (4.31) Then, || δ t +1 || ≤ (1 − a t ( x )) || δ t || + a t ( x ) || δ t + w t || (4.32) ≤ (1 − a t ( x )) || δ t || + a t ( x )( || δ t || + || w t || ) (4.33) || w t ( x ) || converges uniformly to 0 with a probability of 1 for any x as described above. Therefore, from Lemma 3, || δ t +1 ( x ) || → for any x . That is, for any x , || ∆ t ( x ) || W → , which holds the main theorem assertion. The method in Chapter 3 is called Q-learning, and the value is updated before performing the next action. On theother hand, SARASA updates the value after performing the following actions. Q t +1 ( s, a ) = (1 − α ( s, a, s t , a t , t )) Q t ( s, a ) + α ( s, a, s t , a t , t )( r t ( s, a ) + Q t ( s t +1 , a t +1 )) (5.1) a t +1 is often stochastically determined by softmax function or the like. Theorem 2
Suppose that the Q function is updated by the above SARASA method. At this time, || Q t − Q ∗ || W → in t → ∞ (5.2) proof.Put L ′ t := max x,y in mathcalX | Q t ( x ) − Q t ( y ) | It is clear from the definition that L ′ t ≤ L t . Later along thisfollows the proof of Theorem 1. For example, in a situation such as DQN, an update for one s, a has an effect on other state actions. As a simplemodel to take such situations into account, we put the ripple function f ( x , x ) defined on the compact set X . Thissatisfies the next conditions. f ( x, x ) = 1 (6.1) f ( x, y ) is continue. (6.2)5f Q ∗ is a continuous function, it can be used to depart from any continuous function and have the same convergenceon the compact set. Let X ⊂ R d be a simple connected compact set. Let Q ∗ , Q be a continuous function on X .Let W be a continuous function on X . || f || W := max x ∈X f ( x ) Q t +1 ( s, a ) = (1 − f ( s, a, s t , a t ) α ( s, a, s t , a t , t )) Q t ( s, a ) + f ( s, a, s t , a t ) α ( s, a, s t , a t , t )( r t ( s, a ) + max b ∈ A Q t ( s t +1 , b ))(6.3)At this time, || Q t − Q ∗ || W → K N := { x , x , x , ......, x N } on X . Limiting Q to K converges to a correct functionuniformly over K from Theorem 1.For any ǫ Since Q ∗ is a continuous function, the function whose value is definedon a dense set is uniquely determined. Convergence can be said. As we mentioned earlier,we want to prove the convergence of the distribution. An order evaluation of the expectedvalue should be also performed. We also want to estimate the convergence order for a specific neural network suchas [13]. According to [13], as with Theorem 3, in the domain of continuous concentration, as R max := sup r ( ω, s, a ),using constants C , C , ξ, α || Q ∗ − Q n || W ≤ C · (log n ) ξ n − α + C R max (7.1)is established. However, when r follows a normal distribution, R max = ∞ , so the upper limit of the error is infinite,and this unexpected expression has no meaning. In case of using unbounded rewards, stronger inequality proofsare needed. References [1] Watldns, C.J.C.H. . Learning from delayed rewards. PhD Thesis, University of Cambridge, England. 1989[2] T.Morimura,M Sugiyama,H Kashima,H Hachiya,and T.Tanaka.Nonparametric return distribution approxima-tion for reinforcement learning.In International Conference on Machine Learning,2010[3] Marc G Bellemare, Will Dabney, and R´emi Munos. A distributional perspective on reinforcement learning. InInternational Conference on Machine Learning, pp. 449458, 2017.[4] Rowland, M., Bellemare, M. G., Dabney, W., Munos, R., and Teh, Y. W. . An analysis of categorical distribu-tional reinforcement learning. In Artificial Intelligence and Statistics (AISTATS).2018[5] T.Morimura,M.Sugiyama,H.Kashima,H.Hachiya,and T.Tanaka.Parametric return density estimation for rein-forcement learning. In Conference on Uncertainty in Artificial Intterigence,2010[6] Azizzadenesheli, K., Brunskill, E., Anandkumar, A., 2018. Efficient exploration through bayesian deep q-networks.arXiv preprint arXiv:1802.04412.2018[7] Rasmussen, C. E. and Kuss, M.. Gaussian processes in reinforcement learning. In Thrun, S., Saul, L. K., andSch¨olkopf, B., editors, Advances in Neural Information Processing Systems 16, pages 751759, Cambridge, MA,USA. MIT Press.2004[8] CHRISTOPHER J.C.H. WATKINS ,PETER DAYAN,Q-learning,Machine Learning, 8, 279-292 Kluwer Aca-demic Publishers, Boston. Manufactured in The Netherlands.1992[9] J. N. Tsitsiklis, Asynchronous Stochastic Approximation and Q-learning, Machine Learning, 16, 1994[10] Tommi Jakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative dynamicprogramming algorithms. Neural Computation, 6(6):11851201, 1994.[11] F. S. Melo, Convergence of Q-learning: A simple proof, Institute Of Systems and Robotics, Tech. Rep.2001[12] V. S. Borkar and S. P. Meyn. The O.D.E. method for convergence of stochastic approximation.and reinforcementlearning. SIAM Journal on Control and Optimization, 38(2):447469, 2000.613] Zhuoran Yang,Yuchen Xie,Zhaoran Wang,A Theoretical Analysis of Deep Q-Learning,arXiv preprintarXiv:1901.00137v2.2019[14] Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. and Geist, M. . Approximate modified policy iterationand its application to the game of tetris. Journal of Machine Learning Research, 16 16291676.2015[15] Farahmand, A.-m., Szepesv´ari, C. and Munos, R. . Error propagation for approximate policy and value itera-tion. In Advances in Neural Information Processing Systems.2010[16] Andras Antos, Csaba Szepesv´ari, and R´emi Munos. Learning near-optimal policies with Bellman-residualminimization based fitted policy iteration and a single sample path. Machine Learning, 71:89129, 2008.[17] Remi Munos. Performance bounds in ´lp norm for approximate value iteration. SIAM Journalon Control andOptimization, 2007.[18] Remi Munos. Error bounds for approximate policy iteration. In ´ICML 2003: Proceedings of the 20th AnnualInternational Conference on Machine Learning, 2003[19] Venter J, On Dvoretzky stochastic approximation theorems, Annals of Mathematical Statistics 37,15341544,1966[20] Berti, P., Crimaldi, I., Pratelli, L. and Rigo, P. Rate of convergence of predictive distributions for dependentdata. Bernoulli 15, 13511367.2009.[21] BERTI, P., PRATELLI, L. and RIGO, P. . Limit theorems for predictive sequences of random variables.Technical Report 146, Dip. EPMQ, Univ. Pavia. Available at economia.2002[22] Ghavamzadeh, M., Mannor, S., Pineau, J., and Tamar, A. . Bayesian reinforcement learning:a survey. Foun-dations and Trends in Machine Learning, 8(5-6):359483.2015[23] Chen Tessler, Guy Tennenholtz, and Shie Mannor. Distributional policy optimization: An alternative approachfor continuous control. arXiv preprint arXiv:1905.09855,2019.[24] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, Neural ordinary differential equations, arXivpreprint arXiv:1806.07366, 2018.[25] David Ha, Andrew Dai, Quoc V. Le,HyperNetworks. In International Conference on Learning Representations,2017
A Lemmas and proofs
Lemma 1
Consider a random variable Y and a partial σ -algebla G . If Z := Y − E [ Y |G ] , the following equation holds. E [ Z ] ≤ E [ Y ] (A.1)We quote the important theorem. Lemma 2
Convergence theorem for stochastic systems[19]Consider the following stochastic process. X t +1 := T ( X , ......, X t , ω ) + U t ( ω ) (A.2) This satisfies the following equation with probability 1. | T ( x , x , ......, x t , ω ) | ≤ max ( α, (1 + β t ( ω )) x t − γ t ) (A.3)7 owever, with α > , with probability 1, β t ( ω ) < M ′ , P t β t < ∞ holds, and with probability 1, Let P t γ t ( ω ) = ∞ . X t E [ U t ] < ∞ (A.4) X t E [ U t |F t ] < ∞ (A.5) At this time, there exists a certain N ( ω ) , and it holds for any n > N ( ω )lim sup t →∞ | X t | < α a.e. (A.6)If β, γ are taken again for any α and the same can be said, ”uniform convergence to 0” can be said that is muchstronger than approximate convergence. Lemma 3 x ∈ R is assumed to be a real number. x n +1 = (1 − a n ) x n + γa n | x n | (A.7) γ ∈ (0 , is a constant.At this time, x n → holds with probability 1.proof.Look at each ω . That is, { a n } ∞ n =0 is constant sequence that satisfies P ∞ n =0 a n = ∞ , P ∞ n =0 a n < ∞ . X n isnonnegative for a sufficiently large n , so it is bounded below. In addition, since x n ≥ x n +1 is apparent fromthe equation, { x n } ∞ n =1 is a monotonically decreasing sequence. The sequence converges because it is bounded andmonotonically decreasing below.Putting b n := a n − γa n , this satisfies P ∞ n =0 b n = ∞ , P ∞ n =0 b n < ∞ . You cansay x n = x Π ni =1 (1 − b i ) , and the convergence destination x is x Π ∞ n =1 (1 − b n ) . c n = Π ni =1 (1 − b i ) , the infiniteproduct of n → ∞ is P ∞ n =0 b n = ∞ , but diverges. However, since it is ≤ c n ≤ , c n → is known, and x n → can be said. Lemma 4
Let ǫ > . x n +1 = (1 − a n ) x n + γa n | x n + ǫ | (A.8) Then x n → ǫ γ − γ holds.proof. x n +1 − x n = − a n ((1 − γ ) x n − ǫγ ) (A.9)= − a n (1 − γ )( x n − ǫ γ − γ ) (A.10) The difference from ǫ γ − γ is reduced by a n (1 − γ ) . If y n := x n − ǫ γ − γ , by definition it is clearly y n +1 − y n = x n +1 − x n .Moreover, y n +1 − y n = − a n (1 − γ )( y n ) (A.11) y n +1 = (1 − a n (1 − γ )) y n (A.12) After that, it is x n → ǫ γ − γ because it is y n → by the same argument in Lemma 3. Lemma 5
Suppose that the sequence { c n } ⊂ R + converges uniformly to 0 on a set of probabilities 1. That is, forany ǫ > , there is a certain N ǫ ( ω ) , and when n > N ǫ ( ω ) , | c n | < ǫ holds with probability 1. At this time, x n +1 = (1 − a n ) x n + γa n | x n + c n | (A.13) x n converges to 0.proof. z N ǫ = x N ǫ (A.14) z n +1 = (1 − a n ) z n + γa n | z n + ǫ | (A.15) | Z n | ≥ | x n | for such n > N . Z n → ǫ γ − γ from Lemma 4. That is, for any ǫ > , there is a certain N ǫ > N epsilon ,and n > N ǫ for any n, z n < ǫ γ − γ + ǫ ǫ , ǫ can be arbitrarily taken, so if we define a new ǫ := ǫ γ − γ + ǫ , thisis also ǫ > can be taken arbitrarily. Within the range of Using z n > x n , there is a N ǫ for any ǫ and x n < ǫ for n > N ǫ . Strict Proof of Policy Gradient thorem and Distributionaly
We prove the famous policy gradient theorem using the Q function and its version in distributed reinforcementlearning [23]. Theorem 3
Policy Gradient thoremConsider the gradient of the policy value function J ( θ ) := E [ Q ( x, π θ ( x ))] . At this time, it is assumed that π, Q is implemented by a neural network, the activation function is Lipschitz continuous, and ∇ θ Q ( x, a ) = 0 . Then,Thefollowing equation holds, ∇ θ J ( θ ) = E ρ [ ∇ θ π ( θ ) ∇ a Q ( s, a ) | a = π ( x ) ] (B.1) However, ρ is memory data in general implementation. Next, consider the case of distributed reinforcement learning.If a random variable representing the cumulative reward sum is expressed as Z , then Q ( s, a ) = E n [ Z ( s, a )] holds.Suppose Z is a neural network with stochastic output. Z ( ω )( s, a ) = f ω ( s, a ) (B.2) Then ∇ θ J ( θ ) = E ρ [ ∇ θ π ( x ) E n [ ∇ a Z ( x, a )] | a = π ( x ) ] (B.3) proof.The interchangeable conditions of differentiation and Lebesgue integration are described as follows. Supposethere is a function f ( x, ω ) that can be Lebesgue integrable over Ω and differentiable by x . At this time, there is anintegrable function φ ( ω ) , and x can be differentiated almost everywhere on Ω by x and |∇ x f ( x, ω ) i | ≤ φ ( ω ) holds,then R Ω f ( x, ω ) dµ ( ω ) is differentiable by x , and holds, ∇ x Z Ω f ( x, ω ) dµ ( ω ) = Z Ω ∇ x f ( x, ω ) dµ ( ω ) (B.4) When µ (Ω) < ∞ , An example of a function class that satisfies this is the Lipschitz continuous function. Neuralnetworks is generally combinations of linear transformations and Lipschitz continuous activation map. Moreover,if the Lipschitz constant of the function f is written as || f || L , then considering two Lipschitz continuous functions f, g , || f ◦ g || L ≤ || f || L || g || L . From this, π θ ( x ) , Q ( x, a ) , Q ( x, π θ ( x )) are Lipschitz continuous for x, a , respectively.Although it is not Lipschitz continuous for θ , it is Lipschitz continuous for each element, and the definition anddefinition of ∇ allow the exchange of differentiation and integration. That is, the following holds from the differentialchain rule, ∇ θ J ( θ ) = E ρ [ ∇ θ π θ ( x ) ∇ a Q ( s, a ) | a = π θ ( x ) ] (B.5) Similarly, ∇ a E [ Z ( s, a )] = E [ ∇ a f ω ( s, a )] and f ω is Lipschitz continuous functions for any ω , For distribution type ∇ θ J ( θ ) = E ρ [ ∇ θ π ( x ) E n [ ∇ a f ω ( x, a )] | a = π ( x ) ] (B.6)= E ρ [ ∇ θ π ( x ) E n [ ∇ a Z ( x, a )] | a = π ( x ) ] (B.7)As described above, the policy gradient theorem is established because the policy is Lipschitz-continuous for eachparameter, and is obviously not for a policy function composed of ODEnet[24], hypernet[25], or the like that reusesparameters. C Notation