[PDF] Model-free Control of Chaos with Continuous Deep Q-learning

Abstract

The OGY method is one of control methods for a chaotic system. In the method, we have to calculate a stabilizing periodic orbit embedded in its chaotic attractor. Thus, we cannot use this method in the case where a precise mathematical model of the chaotic system cannot be identified. In this case, the delayed feedback control proposed by Pyragas is useful. However, even in the delayed feedback control, we need the mathematical model to determine a feedback gain that stabilizes the periodic orbit. To overcome this problem, we propose a model-free reinforcement learning algorithm to the design of a controller for the chaotic system. In recent years, model-free reinforcement learning algorithms with deep neural networks have been paid much attention to. Those algorithms make it possible to control complex systems. However, it is known that model-free reinforcement learning algorithms are not efficient because learners must explore their control policies over the entire state space. Moreover, model-free reinforcement learning algorithms with deep neural networks have the disadvantage in taking much time to learn their control optimal policies. Thus, we propose a data-based control policy consisting of two steps, where we determine a region including the stabilizing periodic orbit first, and make the controller learn an optimal control policy for its stabilization. In the proposed method, the controller efficiently explores its control policy only in the region.

Full PDF

MModel-free Control of Chaos with Continuous Deep Q-learning

Junya Ikemoto a) and Toshimitsu Ushio b) Graduate School of Engineering Science, Osaka UniversityToyonaka, Osaka, 560-8531, Japan (Dated: 27 August 2019)

The OGY method is one of control methods for a chaotic system. In the method, we have to calculate a stabilizing peri-odic orbit embedded in its chaotic attractor. Thus, we cannot use this method in the case where a precise mathematicalmodel of the chaotic system cannot be identiﬁed. In this case, the delayed feedback control proposed by Pyragas isuseful. However, even in the delayed feedback control, we need the mathematical model to determine a feedback gainthat stabilizes the periodic orbit. To overcome this problem, we propose a model-free reinforcement learning algorithmto the design of a controller for the chaotic system. In recent years, model-free reinforcement learning algorithms withdeep neural networks have been paid much attention to. Those algorithms make it possible to control complex systems.However, it is known that model-free reinforcement learning algorithms are not efﬁcient because learners must exploretheir control policies over the entire state space. Moreover, model-free reinforcement learning algorithms with deepneural networks have the disadvantage in taking much time to learn their control optimal policies. Thus, we propose adata-based control policy consisting of two steps, where we determine a region including the stabilizing periodic orbitﬁrst, and make the controller learn an optimal control policy for its stabilization. In the proposed method, the controllerefﬁciently explores its control policy only in the region.

In general, periodic orbits embedded in chaotic attractorsdepend on the parameters of the chaotic system and thechaos control method that does not need the precise com-putation of the orbit is practically useful. Several suchmethods such as delayed feedback control have been pro-posed. However, in these methods, the identiﬁcation of theparameters are required. Thus, we propose a model-freecontrol method using continuous deep Q-learning. Con-tinuous deep Q-learning is one of the deep reinforcementleaning algorithms and has been applied to controls ofcomplex tasks recently. We propose a reward that eval-uates stabilization by the control inputs. Moreover, sincethe stabilized periodic orbit is embedded in a chaotic at-tractor, we select a region including the orbit where we in-ject the control inputs so that efﬁcient learning is achieved.As example, we consider stabilization of a ﬁxed point em-bedded in a chaotic attractor of the Gumowski-Mira mapand it is shown by simulation that we learn a nonlinearstate feedback controller by the proposed method.

I. INTRODUCTION

It is known that many unstable periodic orbits are embed-ded in chaotic attractors. Using this property, Ott, Grebogi,and Yorke proposed an efﬁcient chaos control method . How-ever, when we use this method, we have to calculate a sta-bilizing periodic orbit embedded in the chaotic attractor pre-cisely. In the case where we cannot identify precise math-ematical models of the chaotic systems, the delayed feed-back control is known to be very useful. Many related a) Electronic mail: [email protected] b) Electronic mail: [email protected] methods have been proposed . Moreover, the prediction-based chaos control method using predicted future states wasalso proposed . However, it is difﬁcult to determine a feed-back gain of the controller in the absence of its mathemati-cal model. To overcome this problem, a method of adjustingthe gain parameter using the gradient method was proposed .Neural networks have been used as model identiﬁcation .Reinforcement Learning (RL) has been also applied to thedesign of the controller . Recently, RL with deep neu-ral networks, which is called Deep Reinforcement Learning(DRL), has been paid much attention to. DRL makes it possi-ble to learn better policies than human level policies in Atarivideo games and Go . DRL algorithms have been ap-plied not only to playing games but also to controlling real-world systems such as autonomous vehicles and robot ma-nipulators. As an application of the physics ﬁeld, the controlmethod of a Kuramoto-Sivashinsky equation, which is one-dimensional time-space chaos, using the DDPG algorithm was proposed .In this paper, we apply a DRL algorithm to the controlof chaotic systems without identifying their mathematicalmodel. However, in model-free RL algorithms, the learnerhas to explore its optimal control policy over the entire statespace, which leads to inefﬁcient learning. Moreover, when weuse deep neural networks, it takes much time for the learnerto optimize many parameters in the deep neural network. Inthis paper, we propose an efﬁcient model-free control methodconsisting of two steps. First, we determine a region includ-ing a stabilizing periodic orbit based on uncontrolled behaviorof the chaotic systems. Next, we explore an optimal controlpolicy in the region using deep Q networks while we do notcontrol the system outside the region. Without loss of the gen-erality, we focus on the stabilization of a ﬁxed point embeddedin the chaotic attractor.This paper organizes as follows. In Section II, we showa method to determine a region including a stabilizing ﬁxedpoint. In Section III, we propose a model-free reinforcement a r X i v : . [ ee ss . S Y ] A ug learning method to explore an optimal control policy in theregion. In Section IV, numerical simulations of the proposedchaos control of the Gmoowski-Mira map, which is an exam-ple of discrete-time chaotic system, is performed to show theusefulness of the proposed method. Finally, in Section V, wedescribe the conclusion of this paper and future work. II. ESTIMATION OF REGION

We consider the following chaotic discrete-time system. x k +1 = F ( x k , u k ) , (1)where x ∈ R n is the state of the chaotic system and u ∈ R m is the control input. We assume that the function F cannot beidentiﬁed precisely. Thus, we cannot calculate a precise valueof the stabilizing periodic orbits embedded in its chaotic at-tractor. On the other hand, although the state of the chaoticsystem does not converge to the periodic orbit, it is some-times close to any unstable periodic orbit embedded in thechaotic attractor. Using this property, we observe the behav-ior of the chaotic system without the control input and samplestates that are close to the stabilizing periodic orbit. In thefollowing, for simplicity, we focus on the stabilization of aﬁxed point embedded in the chaotic attractor. We observe be-haviors x k ( k = 0 , , . . . ) of the uncontrolled chaotic system( u k = 0 ) and sample states ¯ x ( l ) = x k l ( l = 1 , , ..., L ) satis-fying the following condition from the behaviors, where L isthe number of the sampled states. (cid:107) x k l +1 − x k l (cid:107) p < (cid:15), (2)where (cid:107) · (cid:107) denotes the (cid:96) p -norm over R n and (cid:15) is a sufﬁcientlysmall positive constant. We estimate the stabilizing ﬁxed point ˆ x f based on L sampled states ¯ x ( l ) . ˆ x f = 1 L L (cid:88) l =1 ¯ x ( l ) ( l = 1 , , ..., L ) . (3)Note that there may exist more than one ﬁxed point in thechaotic attractor in general. In such a case, we calculate clus-ters of the sampled data corresponding to the ﬁxed points andselect the cluster close to the stabilizing ﬁxed point.Then, we set a region D appropriately based on the esti-mated ﬁxed point ˆ x f , where the center of D is the estimatedﬁxed point ˆ x f . We have to select the region enough large thatthe stabilizing ﬁxed point is sufﬁciently far from the bound-ary of the region. Since we use a deep neural network, wecan make a learner learn a nonlinear control policy for a largeregion while both the OGY method and the delayed feedbackcontrol method are linear control methods. As an example, weshow an estimation of the ﬁxed point of the Gumowski-Miramap in Fig. 1.Furthermore, we transform the state x into the followingnew state s ∈ S . s := φ ( x ) , (4) FIG. 1. Example of the region D for the Gumowski-Mira map. Theestimated ﬁxed point is ˆ x f = [0 . , . T . We set the region D = { ( x, y ) | || x − ˆ x f || ∞ ≤ } . where φ : R n → S is the following coordinate transforma-tion. φ ( x ) := (cid:40) x − ˆ x f x ∈ D s out x / ∈ D . (5)The transformed state space S is D (cid:48) ∪ { s out } , where D (cid:48) = { φ ( x ) | x ∈ D } . The state s out represents that the currentstate of the chaotic systems lies out of the region D so thatthe control input is set to 0 and we do not sample the state forlearning. Then, the origin of the state space D (cid:48) coincides withthe estimated ﬁxed point ˆ x f . III. DEEP REINFORCEMENT LEARNING FOR CHAOSCONTROL

The goal of RL is to learn an optimal control policy inthe long run through interactions between a controller witha learner and a system. First, the controller observes the sys-tem state x and computes the control input u in accordancewith its control policy µ Next, the controller inputs the controlinput u to the system and the state of the system moves from x to x (cid:48) . Finally, the controller observes the next state x (cid:48) andreceives the immediate reward r . The immediate reward isdetermined by the reward function R . In this paper, we makethe controller learn its control policy only in the region D toimprove its learning efﬁciency. Thus, we deﬁne s ∈ S asthe state of the RL framework. Interactions between them isshown in Fig. 2. FIG. 2. Interactions between a system and a controller. In this paper,we regard the transformed state s ∈ S as the state in the RL frame-work. The controller observes the transformed state s and computesthe control input u in accordance with its policy µ . The controllerinputs the controller input u to the system and the state of the systemmoves from s to s (cid:48) . Finally, the controller observes the next trans-formed state s (cid:48) and receives the immediate reward r . The controllerupdates its control policy µ based on the transition ( s , u , s (cid:48) , r ) . In this paper, the reward function R : D (cid:48) × R m × S → R is deﬁned by R ( s , u , s (cid:48) ) = (cid:40) − ( s (cid:48) − s ) T M ( s (cid:48) − s ) − u T M u if s (cid:48) (cid:54) = s out − q otherwise , (6)where M and M are positive deﬁnite matrices and q is asufﬁciently large positive constant. Since ˆ x f is an approx-imation of the ﬁxed point, the controller requires exploringthe ﬁxed point through its learning. Thus, we deﬁne the re-ward function R that takes the maximum reward when thestate of the system is stabilized at the ﬁxed point x f . In thecase of s (cid:48) = s out , the reward function takes the sufﬁcientlylarge penalty − q . Moreover, since the goal of RL is to learnthe control policy that maximizes the long-term reward, wedeﬁne the following value functions. V µ ( s ) = E (cid:34) ∞ (cid:88) n = i γ n − i r n | s i = s (cid:35) , (7) Q µ ( s , u ) = E (cid:34) ∞ (cid:88) n = i γ n − i r n | s i = s , u i = u (cid:35) , (8)where γ ∈ [0 , is a discount rate to prevent divergences ofthe value functions. Eqs. (7) and (8) are called a state valuefunction and a state-action value function (Q-function), re-spectively. These value functions represent the mean of thediscounted sum of immediate rewards which the controllerreceives in accordance with its control policy µ , where wedo not include immediate rewards in Eqs. (7) and (8) afterthe state of the system moves to s out , that is, the transformedstate s out is a termination state for a learning episode.Furthermore, we apply DRL to design the controller. InDRL, the control policy function and value functions are ap-proximated by deep neural networks. DDPG and A3C areDRL algorithms for continuous control problems. However,it is difﬁcult to handle these algorithms because the control policy function and value functions are approximated by sep-arate deep neural networks in these algorithms. On the otherhand, in a continuous deep Q-learning algorithm , we can ap-proximate the control policy function and value functions byonly one deep neural network. Thus, in this paper, we use thecontinuous deep Q-learning algorithm. The illustration of thedeep neural network used in the algorithm is shown in Fig. 3,where θ is the parameter vector of the deep neural network.The input to the deep neural network is the transformed state s and outputs are the approximated state value function V ( s ; θ ) ,the control input µ ( s ; θ ) , and elements of the lower triangularmatrix P L ( s ; θ ) with the diagonal terms exponentiated. Wedeﬁne the normalized advantage function (NAF) as follows. A ( s , u ; θ ) = −

12 ( u − µ ( s ; θ )) T P L ( s ; θ ) P L ( s ; θ ) T ( u − µ ( s ; θ )) , (9)where u is the control input to the system at the transformedstate s . Note that P L ( s ; θ ) P L ( s ; θ ) T is the positive deﬁnitematrix because P L ( s ; θ ) is the lower triangular matrix. There-fore, the maximum value of the NAF with respect to the con-trol input u is 0. Then, the control input u = µ ( s ; θ ) . Eq.(9) is the quadratic approximation of the advantage function that represents how much the control input u is superior to thecontrol input computed in accordance with the policy µ .By adding the NAF and the approximated state value func-tion, we approximate the Q-function as follows. Q ( s , u ; θ ) = V ( s ; θ ) + A ( s , u ; θ ) . (10)We describe the learning method. We deﬁne the followingTD-error to update the parameter vector of the deep neuralnetwork. J ( θ ) = E (cid:104) ( Q ( s , u ; θ ) − ( r + γ max u (cid:48) Q ( s (cid:48) , u (cid:48) ; θ ))) (cid:105) = E (cid:2) ( Q ( s , u ; θ ) − ( r + γV ( s (cid:48) ; θ ))) (cid:3) , (11)where V ( s out ; θ ) = 0 . The parameter vector θ is updated tothe direction of minimizing the TD-error using an optimizingalgorithm such as Adam .In the learning, we use a target network , which is an-other deep neural network, to update the parameter vector θ ,where the parameter vector of the target network is denoted by θ − . When we compute the approximated state value function V ( s (cid:48) ; θ ) in Eq. (11), we use the output of the target networkas follows. J ( θ ) = E (cid:2) ( Q ( s , u ; θ ) − ( r + γV ( s (cid:48) ; θ − ))) (cid:3) . (12)The target network prevents the learning from being unstable.The parameter vector θ − is updated by the following equation. θ − = βθ + (1 − β ) θ − , (13)where β is the learning rate of the target network and set toa sufﬁciently small positive constant. This update method iscalled a soft update.Moreover, we use the experience replay . In the experi-ence replay, the controller does not immediately use the tran-sition ( s , u , s (cid:48) , r ) obtained by the exploration for its learning. FIG. 3. Illustration of the deep neural network for the continuous deep Q-learning algorithm. The input to the deep neural network isthe transformed state s and outputs are the approximated state value function V ( s ; θ ) , the control input µ ( s ; θ ) , and elements of the lowertriangular matrix P L ( s ; θ ) . We deﬁne the normalized advantage function (NAF) as Eq. (9). Moreover, by adding the NAF and the approximatedstate value function, we approximate the Q-function. Note that Q ( s , u ; θ ) = V ( s ; θ ) when the approximated Q-function Q ( s , u ; θ ) ismaximized for the control input u .FIG. 4. Illustration of controlled chaotic systems by the proposed learning controller. The chaotic system and the main-network keep generatingtransitions ( s , u , s (cid:48) , r ) , where s , u , s (cid:48) , and r are the transformed state of the chaotic system, the control input, the next transformed state of thechaotic system, and the immediate reward. The transition ( s , u , s (cid:48) , r ) is stored in the replay buffer B . At the time of updating the parametervector of the deep neural network θ , N transitions ( s ( n ) , u ( n ) , s (cid:48) ( n ) , r ( n ) ) ( n = 1 , , ..., N ) are randomly selected to make a minibatch.The parameter vector θ is updated based on the minibatch. On the other hand, the parameter vector of the target network θ − is updated by θ − ← βθ + (1 − β ) θ − . The controller stores the transition in the replay buffer B onceand randomly selects N transitions to make a minibatch at thetime of the update of θ . The experience replay is a methodto remove the correlation of transitions. Note that, since welearn an optimal policy only in the region D (cid:48) , we do not storeall behaviors but the transitions in the region.In the exploration for the optimal control policy, the con-troller determines the control input as follows. u = µ ( s ; θ ) + δ, (14)where δ is an exploration noise according with an explorationnoise process N that we properly have to set.The whole learning algorithm is shown in Algorithm 1 andthe controlled chaotic system is illustrated in Fig. 4. M is thenumber of behaviors. K is the maximum discrete-time stepof one behavior. I is the frequency of the update of θ per k p discrete-time steps. Algorithm 1

Continuous Deep Q-learning for Chaos Control Initialize the replay buffer B . Randomly initialize the main Q network with weights θ . Initialize the target network with weights θ − = θ . Estimate the ﬁxed point ˆ x f and select D . for behavior = 1 , ..., M do Initialize the initial state x . Initialize a random process N for action exploration ( δ ∼ N ). for k = 0 , ..., K do if k % k p = 0 then for iteration = 1 , ..., I do Sample a random minibatch of N transitions ( s ( n ) , u ( n ) , s (cid:48) ( n ) , r ( n ) ) , n = 1 , ..., N from B . Set t ( n ) t ( n ) = (cid:40) r ( n ) + γV ( s (cid:48) ( n ) ; θ − ) s (cid:48) ( n ) (cid:54) = s out r ( n ) otherwise Update θ by minimizing the TD error: J ( θ ) = N (cid:80) Nn =1 ( Q ( s ( n ) , u ( n ) ; θ ) − t ( n ) ) . Update the target network: θ − ← βθ + (1 − β ) θ − . end for end if if x k ∈ D then Transform the observed state x k into s = φ ( x k ) . Determine the exploratory action u = µ ( s ; θ ) + δ . Input u to the chaotic system and the state moves to the nextstate x k +1 . Observe the next state x k +1 . Transform the observed state x k into s (cid:48) = φ ( x k +1 ) . Return the immediate reward r = R ( s , u , s (cid:48) ) . Store the transition ( s , u , s (cid:48) , r ) in B . else The state is transited to the next state x k +1 without the controlinput. end if x k +1 ← x k . end for end for FIG. 5. Points of states of the chaotic system without the controlinput. The orange plots are states which satisfy Eq. (2) with (cid:15) =0 . . We regard the mean of orange points as an estimated ﬁxedpoint. IV. EXAMPLE

In order to show the usefulness of the proposed method, weperform the numerical simulation of the chaos control of theGumowski-Mira map , which is an example of the discrete-time chaotic system. The Gumowski-Mira map is describedby x k +1 = y k + b (1 − . y k ) y k + f ( x k ) + 0 . u k , (15) y k +1 = − x k + f ( x k +1 ) , (16)where f is given by f ( x ) = ηx + 2(1 − η ) x x . (17)In this paper, we assume that b = 0 . and η = − . , wherewe cannot use these parameters to design the controller.By simulations, we observe the uncontrolled behaviors ofthe chaotic system to estimate the ﬁxed point. We set (cid:15) = 0 . and p = 1 ( (cid:96) -norm) in Eq. (2). Then, the estimated ﬁxedpoint is ˆ x f = [0 . , . T . Thus, we select the followingregion D shown in FIG. 5. D := { ( x, y ) | − . ≤ x ≤ . , − . ≤ y ≤ . } . (18)Then, if [ x, y ] T ∈ D , the transformed state is s = [ s x , s y ] T =[ x − . , y − . T . Otherwise, the transformed state is s = s out .We use a deep neural network with three hidden layers,where all hidden layers have 32 units and all layers are fullyconnected layers. The activation functions are ReLU exceptfor the output layer. Regarding the activation functions of theoutput layer, we use a linear function at both units for the ap-proximated state value function V ( s ; θ ) and elements of thematrix L p ( s ; θ ) , while we use a 2 times weighted hyperbolictangent function at the units for the control inputs µ ( s ; θ ) . Thesize of the replay buffer is . × and the minibatch sizeis 64. The parameter vector of the deep neural network is up-dated by ADAM , where its stepsize is set to . × − .The soft update late β for the target network is 0.01, and thediscount rate γ for the Q-function is 0.99. For the explorationnoise process N , we use an Ornstein Uhlenbeck process .Moreover, we set parameters of the reward function (6) asfollows. M = (cid:20) .

08 00 0 . (cid:21) , (19) M = 0 . , (20) q = 20 . . (21)In the simulation, we assume that state transitions of the sys-tem occur 10800 times per one behavior ( K = 10800 ). More-over, we assume that the parameter vector of the deep neuralnetwork θ is updated twice ( I = 2 ) every 80 state transitions( k p = 80 ).We show simulation results. The learning curve is shown inFig. 6. The horizontal axis represents the number of episodesand the vertical axis represents the mean value of the imme-diate rewards obtained within 10800 transitions ( ≤ k ≤ ). The solid line represents the average learning perfor-mance obtained in 100 times of learning and the shade repre-sents the 99 % conﬁdence interval. It is shown that high imme-diate rewards are obtained as updates of the parameter vectorof the deep neural network are repeated.Moreover, the time response of the controlled chaotic sys-tem by the controller that learned its control policy sufﬁcientlyis shown in Fig. 7, where the initial state is x = [0 . , . T .It is shown that the controller inputs small control inputs whenits state enters the region D and stabilizes to the ﬁxed point.Shown in Fig. 8 is the control input at each state in the region D by the learned controller. It is shown that the learned con-troller is not linear but nonlinear. Thus, the proposed methodwith continuous deep Q-learning can learn a nonlinear controlpolicy for stabilizing a desired ﬁxed point without identifyinga mathematical model of the chaotic system. V. CONCLUSION AND FUTURE WORK

In this paper, we proposed the control method to stabi-lize a periodic orbit embedded in discrete-time chaotic sys-tem using DRL, where the model of the discrete-time systemis not identiﬁed. Moreover, we show the usefulness of theproposed learning algorithm by the numerical simulation ofthe Gumowski-Mira map. It is future work to propose thechaos control method for continuous-time chaotic system witha P´oincare map.

ACKNOWLEDGMENTS

This work was partially supported by JST-ERATO HASUOProject Grant Number JPMJER1603, Japan and JST-MiraiProgram Grant Number JPMJMI18B4, Japan. E. Ott, C. Grebogi and J. A. Yorke, Phys. Rev. Lett. , 11 (1990). FIG. 6. Learning curve. It is the mean of the immediate rewardsobtained within 10800 transitions after the each learning episode.The solid line represents the average learning performance obtainedin 100 times of learning and the shade represents the 99 % conﬁdenceinterval.FIG. 7. Time response of the chaotic system by the controller thatsufﬁciently learned its control policy, where the initial state x is [0 . , . T . K. Pyragas, Phys. Lett. A , 6 (1992). T. Ushio, IEEE Trans. Circuits and Systems-1, , 9 (1996). H. Nakajima, Phys. Lett. A , 3-4 (1997). K. Pyragas, Phys. Lett. A , 5-6 (1995). S. Yamamoto, T. Hino, and T. Ushio, IEEE Trans. Circuits and Systems-1, , 6 (2001). H. Nakajima, and Y. Ueda, Phys. Rev. E, , 1757 (1998). T. Ushio, and S. Yamamoto, Phys. Lett. A, , 1 (1999). H. Nakajima, H. Ito, and Y. Ueda, IEICE Trans. Fundamentals,

E80-A , 9(1997).

FIG. 8. Learned control input at each state in the region D (cid:48) . Thecolor represents the value of the control input. The learned controlleris not linear but nonlinear. The cross mark represents the conver-gence state s = [0 . , − . T ( x = [1 . , . T ) in Fig.7. A. Boukabou, and N. Mansouri, Nonlinear Analysis: Modeling and Con-trol, , 2 (2005). L. Shen, M. Wang, W. Liu, and G. Sun, Phys. Lett. A , 46 (2008). R. Der and M. Herrmann, in

Proceedings of IEEE International Conferenceon Neural Networks, Orlando, 1994, vol. 4, pp. 2472-2475 (1994). R. Der and M. Herrmann, Nonlinear Theory and Applications, pp. 441-444(1996). M. Funke, M. Herrmann, and R. Der, International Journal of AdaptiveControl and Signal Processing, pp. 489-499 (1997). R. Der and M. Herrmann, Classiﬁcation in the Information Age, pp. 302-309 (1998). J. Randlov, A. G. Barto, and M. T. Rosenstein, Computer Science Depart-ment Faculty Publication Series (2000). S. Gadaleta and G. Dangelmayr, Chaos: An Interdisciplinary Journal ofNonlinear Science, , 3 (1999). S. Gadaleta and G. Dangelmayr,

Proceedings of IEEE International JointConference on Neural Networks, Washington, 2001, vol. 2, pp. 996-1001(2001). V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Belle-mare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen,C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S.Legg, and D. Hassabis, Nature (London), , pp. 529-533 (2015). D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driess-che, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S.Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, Nature (London), , pp. 484-489 (2016). T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver,and D. Wierstra, arXiv preprint arXiv:1509.02971, (2015). M. A. Bucci, O. Semeraro, A. Allauzen, G. Wisniewski, L. Cordier, and L.Mathelin, arXiv preprint arXiv:1906.07672 (2019). V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D.Silver, and K. Kavukcuoglu, in

Proceedings of International Conferenceon Machine Learning, New York, 2016, edited by M. F. Balcan and K. Q.Weinberger (2016). S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, in

Proceedings of Interna-tional Conference on Machine Learning, New York, 2016, edited by M. F.Balcan and K. Q. Weinberger (2016). R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, in

Neural Informa-tion Processing Systems, Denver, 1999, edited by S. A. Solla, T. K. Leen,and K. Mller, (1999). D. P. Kingma and J. Ba, arXiv preprint arXiv:1412.6980 (2014). C. Mira, in

The Chaos Avant-garde: Memories of the Early Days of ChaosTheory, edited by R. Abraham and Y. Ueda, (World Scientiﬁc, 2000). G. E. Uhlenbeck and L. S. Ornstein, Phys. Rev.36