[PDF] Deep Reinforcement Learning for Safe Landing Site Selection with Concurrent Consideration of Divert Maneuvers

Abstract

This research proposes a new integrated framework for identifying safe landing locations and planning in-flight divert maneuvers. The state-of-the-art algorithms for landing zone selection utilize local terrain features such as slopes and roughness to judge the safety and priority of the landing point. However, when there are additional chances of observation and diverting in the future, these algorithms are not able to evaluate the safety of the decision itself to target the selected landing point considering the overall descent trajectory. In response to this challenge, we propose a reinforcement learning framework that optimizes a landing site selection strategy concurrently with a guidance and control strategy to the target landing site. The trained agent could evaluate and select landing sites with explicit consideration of the terrain features, quality of future observations, and control to achieve a safe and efficient landing trajectory at a system-level. The proposed framework was able to achieve 94.8 \% of successful landing in highly challenging landing sites where over 80\% of the area around the initial target lading point is hazardous, by effectively updating the target landing site and feedback control gain during descent.

Full PDF

DDEEP REINFORCEMENT LEARNING FOR SAFE LANDING SITESELECTION WITH CONCURRENT CONSIDERATION OF DIVERTMANEUVERS * Keidai Iiyama † , Kento Tomita ‡ , Bhavi A. Jagatia § , Tatsuwaki Nakagawa ¶ , and Koki Ho || ABSTRACT

This research proposes a new integrated framework for identifying safe landing locations and planning in-ﬂightdivert maneuvers. The state-of-the-art algorithms for landing zone selection utilize local terrain features such asslopes and roughness to judge the safety and priority of the landing point. However, when there are additional chancesof observation and diverting in the future, these algorithms are not able to evaluate the safety of the decision itself totarget the selected landing point considering the overall descent trajectory. In response to this challenge, we proposea reinforcement learning framework that optimizes a landing site selection strategy concurrently with a guidance andcontrol strategy to the target landing site. The trained agent could evaluate and select landing sites with explicitconsideration of the terrain features, quality of future observations, and control to achieve a safe and efﬁcient landingtrajectory at a system-level. The proposed framework was able to achieve 94.8 % of successful landing in highlychallenging landing sites where over 80 % of the area around the initial target lading point is hazardous, by effectivelyupdating the target landing site and feedback control gain during descent. INTRODUCTION

On-board hazard detection and avoidance (HDA) capabilities are essential to enable new mission concepts that in-volve planetary surface operations. With a quick assessment of the perceived terrain data (e.g., DEM, visible spectrummap, or a combination thereof) from optical and/or LIDAR sensors, the HDA technology creates a map of probabilityof safety for prioritizing candidate landing zones. NASA has been actively developing the technology for Precise land-ing and Hazard Avoidance (PL& HA), and Safe and Precise Landing Integrated Capabilities Evolution (SPLICE)project is currently underway to develop next generation PL&HA technologies.

The HDA algorithm being devel-oped for SPLICE directly leverages the HDA algorithms from the previous Autonomous Landing Hazard Avoidance * This paper is an updated version of Paper AAS 20-583 presented at the AAS/AIAA Astrodynamics Specialist Conference, Online, in 2020. † Master Student, Department of Aeronautics and Astronautics, The University of Tokyo, 7-3-1, Hongo,Bunkyo-ku,Tokyo, Japan. ‡ Ph.D. Student, School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, 30332. § Master Student, School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, 30332. ¶ Bachelor Student, School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, 30332. || Assistant Professor, School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, 30332. a r X i v : . [ c s . R O ] F e b rajectory A Trajectory B Landing Site A Landing Site B Straight forward trajectory to the current target B → Fuel optimalTake a longer pathto target B→ Could be robust if diverting to landing site A could possibly happen in the future

Preference of the control policy is affected by the landing site selection policy.

When designing trajectory to landing site B…

Figure 1 : Landing site selection and control

Figure 2 : Observability and trajectoryTechnology (ALHAT) project. ALHAT is capable of quickly assessing the DEM on-board and in real time duringthe descent, and assigns a probability of a safe landing to each pixel on the map.

ALHAT’s HDA capabilityalong with the terrain relative navigation and hazard detection functions was successfully demonstrated during thehardware-in-the-loop testing on the Morpheus Vertical Testbed.

However, a critical caveat of ALHAT is that itonly selects a list of discrete landing points based on local static information (e.g., slopes, roughness, etc.). To con-sider the feasibility of the divert maneuver, a landing selection algorithm that calculates approximate landing footprinthas been presented.

To reﬂect predetermined scientiﬁc values of each landing site in the landing site selectionprocess, landing site selection methods that leverages Bayesian networks has been proposed. Cui, et al. (2017) pro-posed a method that calculates synthetic landing area assessment criteria based on terrain safety, fuel consumption,and touchdown performance during descent. In these previously proposed landing site assessment methods, future changes in the target landing site are not con-sidered. However, since the ﬁeld of view and the quality of the observation data changes depending on the spacecraftstate, the best landing site changes each time new observation data is obtained. This effect could not be ignored,especially when observability is limited or safe landing sites are sparse. When multiple chances of observation anddivert maneuver are considered, the following two aspects should be considered.The ﬁrst aspect is the coupling between the target landing site selection and control maneuver planning. Existinglanding site selection algorithms assume that the lander is guided by a simple predetermined control law, while thecontrol law is designed to guide the lander to the decided landing target. However, when additional divert maneu-vers are available in the future, the landing site selection policy is dependent on the control policy, and vice-versa.Therefore, the landing site and control plan has to be decided simultaneously in order to achieve an overall optimaltrajectory. Fig.1 shows an example of this coupling. Suppose landing site A and landing site B both have uncertaintyin the safety level that could be judged from the observation. If there is little residual fuel, it could be safer to take thefuel optimal path to the closer landing site B (Trajectory B). However, when there is sufﬁcient residual fuel, it could2 igure 3 : General Conceptbe safer to take a longer path to landing site B (Trajectory A), if additional diverting to landing site A could happen incase landing site B turns out to be unsafe as the lander approaches the ground.The second aspect is the observability of the terrain during descent. As illustrated in Figure 2, the change in thecontrol policy and the target landing site affects the quality and number of the obtained observation data during descent.Therefore, the controller has to be aware that sufﬁcient information to judge a good landing site could be obtained byfollowing the planned trajectory. In general landing trajectory design, glide slope constraints are applied as a pathconstraint to ensure observations from higher elevations. However, the actual quality and quantity of observationinformation could not be explained only by slope angles. To tackle this problem, Crane (2013) developed an on-lineinformation-seeking trajectory modiﬁcation method that selects a trajectory that minimizes the weighted sum of theestimated entropy and ﬁeld of view of the image obtained in the future, fuel consumption, and the coverage of the entireﬁeld. However, the research focused on modifying the nominal trajectory for a ﬁxed landing site, and updating thelanding site successively during descent was not considered in the research. The impact of landing site selection onthe observation data quality should also be considered, since changing the landing site affects the observation throughchanges in the trajectory and the observation target,In order to tackle these problems, this paper proposes a learning-based method that selects landing site and designguidance trajectory successively during the HDA phase. The agent in the proposed method seeks to maximize the totalprobability of landing to a safe terrain landing site while minimizing total fuel consumption, landing error to the target,and ﬁnal velocity. Model-free reinforcement learning techniques are leveraged to learn a policy that concurrentlyselects landing site and guidance strategy to the target, by interacting with the simulator environment and learninghow to maximize the total probability of successful landing with minimal fuel consumption. The general concept is3hown in Fig.3.The outline is as follows. First, we explain the assumptions used in modeling the HDA phase and show that thesequence of action choices can be formulated as a partially observable Markov decision process (POMDP). Next, themethod to optimize landing site selection policies and the controller is introduced. Finally, the performance of theobtained controller is analyzed, and qualitative interception is given.

PROBLEM FORMULATIONDynamics

A soft lunar landing scenario is assumed in this paper. In this paper, the 3 degrees of freedom (3-DOF) problem isconsidered. The equation of motion governing the dynamics of the problem are given as follows. v = ˙ ra = T m + g ˙ m = − || T || g ref I sp (1)where r = [ r x r y r z ] is the position in the target centered orthonormal frame with the z-axis pointing upward, and g = [0 0 − . T is the gravity. In this paper, gravity is considered to be constant during the entire mission, and theeffect of planetary rotation is ignored. In addition, limitation in the thrust range is applied as follows. ≤ | T | ≤ T max (2)The following glide slope constraint are applied in most planetary landing problems θ g = arctan (cid:18) (cid:113) r x + r y r z (cid:19) < θ gmax (3)This constraint is applied in order to avoid the lander from hitting the terrain at low altitudes, and to ensure that theinclination to the target is kept over a certain amount for navigation and hazard detection purposes. Terrain Generation

Observation data consists of DEM generated by HD LIDAR, and spacecraft state information (position, velocity,mass) with noise. In order to generate LIDAR DEMs, we need a true DEM as a reference. Lunar ReconnaissanceOrbiter (LROC) database provides Digital Terrain Model (DTM) of the lunar surface. However, the resolution ofthe DTM ( ≈ igure 4 : Terrain generation process Figure 5 : Example of the generated terrainproposed by Bernard . While the model was originally developed for Mars, studies of rock density on the moonshow that the model can be substituted for the lunar topography by adjusting the parameters. The generated terrainhas a size of 1000m x 1000m and a resolution of 1m. The generation process of the terrain is described in Fig.4. Anexample of the generated terrain is shown in Fig.5.

Safety Map Generation

In order to assess if the landing has succeeded in the simulator, the safety of each landing point in the generatedterrain map has to be assessed. We evaluated the safety by applying the safety assessment algorithm developed in theALHAT project to the entire generated terrain. The algorithm for assessing the deterministic safety value V D ∈ { , } is shown in Algorithm 1. Observation Data Generation

The control agent utilizes two observation data for control: estimated lander state (position, velocity, and mass) and2d map of the terrain. For the lander state, we assumed that we have perfect information without errors. Incorporatingnavigation errors and hazard relative navigation algorithms in the framework is important future work. For the 2dmap of the terrain, we considered two options. The ﬁrst option is to directly pass the LIDAR DEM, while the secondoption is to pass the stochastic safety map obtained by applying Algorithm1 on the obtained LIDAR DEM. The ﬁrstoption requires the control agent to assess the safety of each landing site from scratch, while in the second option,the control agent has access to more direct information about the safety of each landing point based on roughness andslope values. In this paper, we chose the second approach as we failed to design a controller using the ﬁrst approach.The simulation of the LIDAR DEM during descent requires complex calculations. To simulate a LIDAR DEM,detector pattern formulation, ray interception with the terrain, addition of the range bias, transformation to the pointclouds, and transformation to a digital elevation map has to be conducted even if we assume that the LIDAR positionis ﬁxed to a known position. In lunar descent cases, errors from vehicle state knowledge, LIDAR misalignment, mapassembly errors when considering vehicle motions, and system latency should also have to be considered. Since the5 lgorithm 1

Deterministic and Stochastic safey map generation

Input:

DEM D ( m × n ) Output: deterministic safety value map V D ( m × n ) ∈ { , } , stochastic safety value map V P ( m × n ) ∈ [0 , Initialize: S ← [0, . . . , 0] (cid:46) list of slope for each orientation R ← [0, . . . , 0] (cid:46) list of roughness for each orientation P ← [0, . . . , 0] (cid:46) list of safety probability for each orientation O ← [0, n o π, . . . , n o − n o π ] (cid:46) possible lander footpad orientation for row = 1 , · · · , m do (cid:46) for each pixel in the DEM for col = 1 , · · · , n do Calculate pad placement and contact for o i = 1 , · · · , n o do (cid:46) calculate worst case slope for all orientations S [ o i ] = G ET SLOPE ( row, col, D ) end for s max ← M AX ( S ) if s max > saf e slope threshold then (cid:46) slope not safe V [ row ][ col ] ← elsefor o i = 1 , · · · , n o do R [ o i ] = G ET ROUGHNESS ( row, col, D ) P [ o i ] = R OUGHNESS T O S AFETY P ROBABILITY ( row, col, D, R ) end for V P [ row ][ col ] ← M IN ( P ) r max ← M AX ( R ) if r max > saf e roughness threshold then (cid:46) roughness not safe V D [ row ][ col ] ← else (cid:46) slope and roughness safe V D [ row ][ col ] ← end ifend ifend forend forreturn V D .V P accurate modeling of the entire procedure is extremely complicated, as an initial concept study, we chose to generatepseudo observation safety map at each observation timing by cutting out the portion of the stochastic safety map ofthe entire ﬁeld, and adding noise to it. The stochastic safety map could be obtained by directly applying Algorithm1to the entire terrain DEM. By this way, we are able to greatly save the computation time required for simulation sincewe can avoid generating DEMs and running Algorithm1 at each observation timing.When generating the ”observation safety map”, three main relationship between the lander’s state (or trajectory)and the LIDAR DEM (and the generated safety map) were modeled. The ﬁrst effect is the ﬁeld of view (FOV) of theLIDAR DEM. For simplicity, we assumed the obtained DEM is square, and its two axes is always aligned with the6 igure 6 : Illustration of the ﬁeld of view of the LIDAR DEM x, y coordinate of the map. The length of one side of the square w x = w y is calculated using the following equation. w x = r s tan φ l (4)where r s is the slant range (distance between the Lander and the target position), and φ l is the ﬁeld of view of theLIDAR which was set to . [deg].(Fig. 6) In reality, the DEM ﬁeld of view is not rectangle, and the ﬁeld of viewdiffers between the horizontal and vertical direction of the lander, but we believe this assumption is fair enough for aninitial concept study.The second effect is the effect of the slant range. As the slant range increases, errors in the range measurement in theLIDAR DEM increases. This effect was simulated by adding each pixel a Gaussian noise with an error proportional tothe distance to each pixel. In addition, as the slant range gets longer, the sampling distance increases, small bouldersin the terrain will be overlooked. This effect was modeled by ﬁxing the LIDAR DEM size to 64x64. The third effectis the effect of slant angle. Slant angle is the angle between the horizontal plane and beam direction. As the slantangle decreases, holes in the DEM appear behind surface hazards. Therefore, regions where safety value could notbe calculated appear in the safety map which leads to fewer safe regions in the map. In this paper, this effect wassimply modeled by assigning unsafe labels to pixels with slant angles between the lander smaller than 70 degrees.In our settings, it is assumed that LIDAR DEMs could be obtained every 5 seconds, and the DEM size was ﬁxedto 64x64. The interval time is based on the maximum processing time of the ALHAT algorithm to process DEM andgenerate a safety map. The entire observation data generation process is described in Algorithm 2.

Modeling the Problem as POMDP

The HDA sequence is modeled as a POMDP P = < S , A , T, R, Ω , O > as follows.• State space s ∈ S : Spacecraft state (position, velocity, mass), true DEM of the entire ﬁeld7 lgorithm 2 Observation Data Generation at Each Timestep

Input: true probabilistic safety value map of the entire terrain V p (1000 × true position of the lander in a surface ﬁxed frame X L = [ x l , y , z l ] Output: noisy probabilistic safety value map O P (64 × ∈ [0 , // calculate the FOV of the DEM and return 2D array of center position of each pixel in the// observed DEM X D , Y D , w x , w y = C ALC

FOV ( X L , φ l ) (cid:46) φ l : LIDAR FOV, Eq.4 for row = 1 , · · · , m do (cid:46) for each pixel in the DEM for col = 1 , · · · , n do x d , y d = X d [ row ][ col ] , Y d [ row ][ col ] θ = S LANT A NGLE ( x d , y d , X L ) if θ > ◦ then (cid:46) slant angle to the pixel is over thershold O P [ row ][ col ] = 0 else // calculate the safety probability of the pixel by ﬁnding nearest neighbor pixel in the// whole V P map v p = N EAREST N EIGHBOR ( x d , y d , V P ) r = S LANT R ANGE ( x d , y d , X L ) (cid:46) [m] O P [ row ][ col ] = C LIP ( v p + N (0 , r ∗ . , , (cid:46) Add noise to the safety value end ifend forend forreturn O P , w x , w y • Action space a ∈ A : Lander thrust output, target landing position (determines next target position for LIDARmeasurement)• Reward space r ∈ R : Consumed fuel, the safety of the ﬁnal landing point, ﬁnal velocity• State transition function T : S × A → Π( S ) : Stochastic dynamics of the spacecraft. T ( s, a, s (cid:48) ) is the probabilityof moving to state s (cid:48) ∈ S when the agent at state s ∈ S takes action a ∈ A .• Observation space Ω : LIDAR DEM, predicted spacecraft state• Observation function O : S × A → Π(Ω) : LIDAR DEM generation models, spacecraft state observation modelObtaining an optimal policy (a policy that could maximize expected reward) in POMDP is difﬁcult due to the partialobservability of the problem. In general, an agent requires the entire history of the observation and action pairs h t = { a , r , o , . . . , a t − , r t − , o t } . POMDP could be converted into a belief Markov decision process (belief MDP)by introducing belief state b . Belief state is a conditional probability function for s ∈ S given the history h t . Whenstate transition function T and observation function O is known, an optimal policy of the POMDP could be obtainedby approximating the belief MDP and applying a value iteration method.8 igure 7 : Original observation data (Up) and reconstructed observation data(Down) by the AutoencoderIn this paper, we seek to obtain the optimal policy of the above POMDP without modeling the T and O function,but by learning from transition data collected by interacting with the simulator. There are two approaches in model-free approach for POMDPs. The ﬁrst approach is the memory-less approach, which learns Markov policy by simplyconsidering the most recent observation o as a state. In POMDP, the Bellman equation is not strictly satisﬁed, sodeterministic policies leveraging only current state information is not guaranteed to be optimal. However, when thepartial observability of the problem is weak, memory-less approach may be sufﬁcient. The second approach is thememory-based approach, which learns history-dependent policy that uses the entire history data. This is usuallyachieved by using Recurrent Neural Networks (RNNs) that could store history information as a single state. In thispaper, both memory-less and memory-based approaches are tested. GUIDANCE AND CONTROLObservation Data Interpretation

Since × map data is used as part of the observation, policies that take in full image data taken as an input havetoo large dimension to optimize. Therefore, the whole history data is transferred into a low-dimension internal statevalue by leveraging an auto-encoder. The role of the auto-encoder is to extract an abstract, compressed representationof the LIDAR DEM data as a latent vector z , to avoid directly passing down large input data to the reinforcementlearning agent. The auto-encoder was trained separately with the reinforcement learning agent, as proposed in the”World Models” paper by Ha (2018). The dataset for training was collected through 5000 random rollouts, and theauto-encoder was trained to minimize the difference between the observed safety map and the reconstructed safetymap produced by the decoder. The following Fig.7 shows examples of the training data and the reconstructed image. General Concept of Control

For control, the most direct approach is to train the agent to output a three-dimensional thrust vector that guidesthe lander along a robust and fuel-efﬁcient trajectory to the selected landing point. However, learning this feed-forward control policy from scratch requires high computational cost for learning. Several research have tackledthis problem by shaping the reward in an effective way, but we took a different approach in order to relate thecontrol policy with target landing points. We adopted Zero-Effort-Miss/Zero Effort-Velocity (ZEM-ZEV) feedback9uidance algorithm as a baseline guidance law, and designed the agent that controls the target landing point r f ,and adaptive hyper-parameters ( K R , K V , t go ) in the ZEM-ZEV guidance algorithm. The idea of learning adaptivehyper-parameters in ZEM-ZEV feedback guidance algorithm with reinforcement learning has already been proposedin previous research, to design an ZEM-ZEV feedback controller that avoids slant angle constraint violation duringdescent. The difference in our paper is that the target position also changes during the descent.

Controller

ZEM/ZEV feedback guidance algorithm calculates the optimal acceleration using the ZEM and ZEV, which repre-sents the distance between the ﬁnal target position and velocity and the projected ﬁnal position and velocity when noadditional control is added from current time t . The optimal acceleration is calculated as a = K R t go ZEM − K V t go ZEV (5)where K R , K V are control gains, and t go is time-to-go. When there are no limitations in the thrust magnitude and noconstraints in the trajectory, it is proved that energy-optimal trajectory could be obtained by setting the control gain as K R = 6 , K V = 2 . Energy optimal time-to-go t go could be obtained by solving the following equation. t go g T g − t go ( v T v + v Tf v + v Tf v f ) + 12 t go ( r f − r ) T ( v + v f ) − r f − r ) T ( r f − r ) = 0 (6)When there is constraint | T | = m | a | ≤ T max in thrust magnitude, saturated acceleration is used for control as follows. a =  ¯ a | ¯ a | ≤ T max /m ¯ a T max /m | ¯ a | | ¯ a | > T max /m (7) ¯ a = K R t go ZEM − K V t go ZEV (8)The goal of the research is to train a reinforcement learning agent that decides K R , K V , t go and target landing position r f at each time step. TRAININGReinforcement Learning

In standard reinforcement learning problems, the agent interacts with the fully-observed environment E . At eachtime step t , the agent outputs an action a t , and the environment returns the next observation o t +1 = s t +1 and a stepreward r t +1 = g ( s t , a t ) . The agent decides its action based on its policy π ( s t ) , which maps the state to a probabilitydistribution over possible actions. The environment follows the transition dynamics p ( s t +1 | s t , a t ) and initial state10istribution p s . The entire process could be modeled as a Markov decision process M = {S , A , p s , p, g } . Thegoal of the agent is to learn a policy that maximizes the expectation of discounted future reward E [ R t |M , π ] where R t = Σ Ti = t γ T g ( s i , a i ) . Reinforcement learning algorithms utilize the following action-value function (or Q-function),which describes the expected return after taking action a t in state s t , then following policy π . Q π ( s t , a t ) = E [ R t | s t , a t ] (9)For the Q function, it is known that the following recursive equation called the Bellman equation holds. Q π ( s t , a t ) = E r t ,s t +1 ∼ E [ g ( s t , a t ) + γ E a t +1 ∼ π [ Q π ( s t +1 , a t +1 )]] (10) Deep Deterministic Policy Gradient (DDPG)

The Deep Deterministic Policy Gradient (DDPG) is an actor-critic, off-policy, model-free reinforcement learningalgorithm for continuous action spaces. In actor-critic methods, the critic estimates the action-value function Q ( s, a ) ,while the actor produces the action given the current state based on its policy π ( a | s ) . As for the actor, we consider aparameterized deterministic policy a = π φ ( s ) with parameter φ . φ could be trained using the following deterministicpolicy gradient theorem and the estimated Q-values by the critic. ∇ φ J ( φ ) = E s ∼ ρ π [ ∇ a Q ( s, a | θ ) | s = s t ,a = µ ( s t ) ∇ φ µ ( s | φ ) | s = s t ] (11)For the critic, in order to handle continuous state and action spaces, Q ( s, a ) is also approximated using a functionapproximater parameterized by θ . The critic uses the collected data to learn parameters that better approximate Q ( s, a ) .This could be achieved by minimizing the following loss L ( θ ) L ( θ ) = E s t ∼ ρ β ,a t ∼ β,r t ∼ E [ Q ( s t , a t | θ ) − y t ) ] (12)where β is an arbitrary stochastic behavior policy, and ρ β is a state visitation distribution using policy β . y t is thetarget value y t = g ( s t , a t ) + γQ ( s t +1 , a t +1 | θ ) (13)In DDPG, deterministic policy is considered. For deterministic policy a = µ ( s ) , the inner expectation in Eq.10 couldbe eliminated as follows. Q µ ( s t , a t ) = E r t ,s t +1 ∼ E [ g ( s t , a t ) + γQ µ ( s t +1 , µ ( s t +1 ))] (14)11herefore, when calculating the loss in Eq.12, transitions obtained from different stochastic policy could be used.The collected transition data ( s t , a t , r t , s t +1 ) is stored in a replay buffer, and then minibatch is sampled from thereplay-buffer for loss calculation in Eq.12. This enables the samples for training to be distributed identically and inde-pendently which is the basic assumption for neural network training. When exploring the environment for collectingtransition data, noise is added to the actor policy. ¯ µ ( s t ) = µ ( s t , θ t ) + N (15)In addition, target networks are introduced in order to stabilize the training process and achieve greater convergence.When calculating Eq.13 with collected transition data ( s i , a i , r i , s i +1 ) , target critic with parameter θ (cid:48) , and target policywith parameter φ (cid:48) are used. y i = r i + γQ (cid:48) ( s i +1 , µ (cid:48) ( s i +1 | φ (cid:48) ) | θ (cid:48) ) (16)Target networks are updated in order to slowly track the learned networks, as follows θ (cid:48) ← τ θ + (1 − τ ) θ (cid:48) (17) φ (cid:48) ← τ φ + (1 − τ ) φ (cid:48) (18)where τ << .The key advantage of the DDPG method is that it enables to train deep reinforcement learning for continuous actionspace off-policy by assuming a deterministic policy. This greatly improves the sample efﬁciency because transitiondata obtained by different policies in the past could be re-used for training the agent. Twin Delayed Deep Deterministic Policy Gradient (TD3)

While DDPG has achieved signiﬁcant performance in continuous control problems, the critical drawback is that itwas sensitive to hyper parameter settings. This is because the overestimation of the Q-values makes the policy fall intolocal optima. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm made three major changes inthe DDPG algorithm to overcome this drawback. The ﬁrst change (clipped double Q-learning) is to use two seperateQ value approximators (critics), and use the smaller Q value of the two networks to calculate the target value y t . Thesecond change is target policy smoothing. For action in the target Q value, clipped noise is added to the actual action.This prevents the Q function from having sharp peaks by returning similar Q values for similar actions. With theclipped double Q-learning and target policy smoothing, the target value from transition data ( s i , a i , r i , s i +1 ) could be12alculated as follows. y i = r i + γ min i =1 , Q (cid:48) ( s i +1 , µ (cid:48) ( s i +1 | φ (cid:48) ) + (cid:15) | θ (cid:48) i ) , (cid:15) ∼ clip ( N (0 , σ ) , − c, c ) (19)The last change is the delayed updates of the policy and the target networks. In order to stabilize the learning process,updates of the target networks and policy networks are carried out less frequently (example: once per every N steps)than the critic and networks. TD3 with memory

When the system is partially observed, optimal policy and action-value function both become function of the entireobservation-action history h t . Recurrent Deterministic Policy Gradient (RDPG) algorithm uses recurrent neural net-works (RNN) in the policy and the action-value networks to preserve (limited) information of the history as part ofthe state. In the RDPG setting, µ ( s ) and Q ( s, a ) could be replaced with µ ( h ) and Q ( h, a ) . Thus, the policy updatecould be obtained as follows. ∇ φ J ( φ ) = E τ ∼ ν µ (cid:20)(cid:88) t γ t − ∇ a Q ( h, a | θ ) | h = h t ,a = µ ( h t ) ∇ φ µ ( h | φ ) | h = h t (cid:21) (20)The critic and actor networks are where τ = ( s , o , a , s , o , a , . . . ) are drawn from the trajectory distribution ν µ generated by the current policy µ . In this paper, the RDPG algorithm was implemented by adding LSTM networksin the critic and actor networks. LSTM is a RNN architecture that is composed of a cell, an input gate, an output gate,and a forget gate. The critic and actor agent inputs the observation and action history h t to their LSTM, and uses theoutput of the output gate ˜ h t to calculate their outputs. The network structure for the critic and the actor is shown inFig.8,9. This architecture is based on the works of Peng (2018) of robotic control. The upper network in the ﬁgure usesthe current observation (and action for critic), while the bottom network utilizes past information stored in the LSTM.The network was trained with the TD3 algorithm, as shown in Algorithm 3. The difference from the original paper isthat in our implementation, we created multiple replay buffers to store transition sequences of different episode length T . In our environment, the episode length varied from 4-10 time steps. We stored transition data with different lengthto separate replay-buffers so that we can create mini-batches with same episode length when training. Overall Framework and Training Process

The observation o t which is used as an input of the agent is summarized in Table.1 and the output of the agent a t issummarized in Table.2. For the training agent, we tested two architectures: a simple TD3 algorithm for memory-lessapproach, and a TD3 with LSTM approach for memory-based approach. For the architecture of TD3 agent withoutmemory, we used the same architecture as the upper network of the policy and critic networks shown in Fig.8, 9. In13 igure 8 : Architecture of the policy network Figure 9 : Architecture of the critic network

Algorithm 3

TD3 with LSTMInitialize Critic Network Q θ , Q θ , and actor network µ φ with random parameters θ (cid:48) , θ (cid:48) , µ (cid:48) Initialize target networks θ (cid:48) ← θ , θ (cid:48) ← θ , φ (cid:48) ← φ Initialize list of replay buffer list B [] for episodes = 1, M do initialize empty history h a ← R ANDOM ( A ) , t ← while not done do t ← t + 1 receive observation o t , reward r t − h t ← h t − , a t − , o t (cid:46) Append observation and previous action to historySelect action a t = µ ( h t | φ ) + (cid:15), (cid:15) ∼ N (0 , σ ) If done d t ← , else d t ← end while Store the sequence ( a , o , r , d , . . . , a t − , o t , r t , d t ) in B [ t ] T ← R ANDOM ( t max ) (cid:46) Select the sequence length to sample fromSample a mini-batch of N episodes with length T : ( a i , o i , r i , d i , . . . , a iT − , o iT , r iT − , d iT ) i =1 ,...,N from B [ T ] Construct T × N set of sequences h it = ( a i , o i , r i , d i . . . , a it − , o it , r it − , d it ) Compute target values for each sample episode for t = 0 , . . . , T − y it = r it + (1 − d it +1 ) γ min k =1 , Q (cid:48) ( h it +1 , µ (cid:48) ( h it +1 | φ (cid:48) ) + (cid:15) | θ (cid:48) k ) , (cid:15) ∼ clip ( N (0 , σ ) , − c, c ) Update critic by minimizing the loss θ k ← argmin θ k NT (cid:80) i (cid:80) t ( y it − Q ( h it , a it )) Update actor parameter φ by the deterministic policy gradient ∇ φ J ( φ ) = NT (cid:80) i (cid:80) t ∇ a Q ( h, a | θ ) | h = h it ,a = µ ( h it ) ∇ φ µ ( h | φ ) | h = h it end for order to stabilize the training, all action parameters were scaled to [0,1] during training. In the actor, the target pointat timestep t ( = x ( t ) f , y ( t ) f ) is not outputted directly. Instead 2 variables α x , α y ∈ [ − . , . is outputted from the14 able 1 : Elements of the observed state o t content symbol size(true) lander position r t v t m t w ( t ) x , w ( t ) y x ( t ) f , y ( t ) f , z ( t ) f z t Table 2 : Elements of the action a t content symbol size rangegain of the ZEM-ZEV control K R K V δt go α x , α y x ( t ) f = x ( t − f + α x w x , y ( t ) f = y ( t − f + α y w y (21)In this way, the next target landing point is always selected to be within the FOV of the latest observed DEM. Thisprevents control failures caused by large deviations of the target between observations. The range of α x and α y was setto α x , α y ∈ [ − . , . . The calculated new target point will be the center of the FOV of the DEM in the next step.Therefore, α x and α y are important output that has impacts on both guidance and observation. In addition, the actordoes not directly output t go of the next step, but instead output the degrade of the time-to-go until the next observationtiming which will be conducted 5 seconds later ( = δt go ). The initial t go is calculated using the initial lander positionand target landing site and Eq.6.Reward setting is the key in reinforcement learning. The reward at time step t was implemented as follows. r t = α m m t − − m t m − m dry + α f ( U ( z max − z t ) + U ( m dry − m t )) + α v d t ( | v t | )+ α r d t ( | r t − r ( t − f | ) + α s d t V D ( r t ) (22)15here U ( x ) =  if x ≥ otherwise d t =  if episode done otherwise V D ( r t ) =  if r t is a safe landing point − if r t is not a safe landing point z max = 50 (maximum height of the terrain) (23)The term with α m is a penalty for fuel consumption. The term with α f is a penalty term for breaking the two criticalpath constraints: hitting the ground and running out of fuel. The term with α v is a penalty for velocity norm at ﬁnalepisode to ensure soft landing, and the term with α r is a penalty for not landing to the ﬁnal target landing point. Thesetwo terms are required because ZEM-ZEV control with adaptive gain and t go parameters are not guaranteed to achievea pinpoint soft landing to the target. Setting the weights α m , α v , α r , α s is a difﬁcult problem which depends not onlyon the mission designer focus but the stability of the training process. It is natural to assume that α f , α v and α s shouldtake relatively large values because hitting the ground at a high rate of speed and landing in a danger zone could bothlead to the immediate loss of the lander. We used α m = 1 , α f = − , α v = 0 . , α r = 0 . , α s = 1 in our research.Note that slope angle constraints are not considered as penalty in the reward settings. This is because slope angleconstraints are not directly related to the primary objective of the planetary landing, but rather conservative constraintsto assure preferable geometry for observation. By interacting with the simulator that simulates the observation processand the dynamics, we expect the agent to learn to take a trajectory with adequate slope angle in order to achieve safelanding. The overall framework is summarized in Fig 10. SIMULATIONInitial Condition and Hyper-parameter Settings

The initial state and other specs of the lander are summarized in Table.3. Parameters that have range are randomlyinitialized within the range every time the episode starts. The true DTM is picked randomly from the list of DTMsgenerated before the training.

Training Process

Fig.11 illustrates the transition of the training loss of the critic and actor network, reward, and the ratio of landing tosafe landing site during training, for both the agent without LSTM and with LSTM. Values in the graph represent the16 igure 10 : Overall framework

Table 3 : Simulation conditioncontent symbol valueInitial altitude (downrange) [m] z [900, 1000]Initial crossrange distance [m] (cid:112) x + y [200 250]Initial downrange velocity [m/s] | v z | [20, 35]Initial crossrange velocity [m/s] | v z | [5, 10]Initial mass of the lander [kg] m z f m dry I sp T max | T − T plan | /T plan N D − training N D − variation igure 11 : Training log. It should be noted that the x-axis of the above two ﬁgures represent the number of trainingupdates, while the x-axis of the bottom two ﬁgures represent the number of training episodes.to achieve higher performance than TD3 without memory. Another reason why agent without LSTM could achievehigh performance is likely because the partial observability of our problem is not strong, thanks to the access to thestochastic safety map with relatively low errors. Adding larger errors into the observation data such as navigationerrors might require the use of memory to cope with strong partial observability. Comparison with other strategies

In order to assess the performance of the trained agent, the performance of the two agents were compared with twoother strategies. The ﬁrst strategy is the ”Fixed Control Policy”. This policy ignores the K R , K V , t go output fromthe agent, and uses the agent only to select the target point. The gains are ﬁxed to K R = 6 , K V = 2 , and t go wascalculated by Eq.6. This policy was introduced to see how the agent’s adjustment of gain and ﬂight time affectedthe trajectory, ﬁnal velocity, and landing errors to the target. The second strategy is the ”Single Divert Policy”. Thisstrategy makes no use of trained agents. This policy targets the initial target point until the altitude reaches 500m, andthen selects the pixel with maximum safety value in the stochastic safety map obtained at 500m altitude as the ﬁnaltarget point. The gains are ﬁxed to K R = 6 , K V = 2 , and t go was calculated by Eq.6. This policy was introduced torepresent a simpliﬁed version of existing landing techniques.The performance of the four policies was tested by landing simulation of 1000 episodes with random initial con-18 able 4 : Comparison of performance between methods (1000 episodes. Mean value)Criteria Memory-Less Agent Memory-Based Fixed Control Single DivertSafe Landing Ratio [%] 94.8 % 88.3 % 93.2 % 89.7 %Distance to ﬁnal target [m] 2.492 4.547 3.786 0.005Final Velocity [m/s] 0.637 1.520 1.189 0.557Propellant Consumption [%] 69.88 63.70 67.84 68.89Minimum Slant Angle [deg] 72.429 72.570 72.795 70.85Maximum Thrust [N] 5818 6863 4553 4556ditions. In order to assess the performance in cases that the agent needs to change its initial landing site, the initialtarget position was chosen so that at least 80% of the area within 100 meter radius is hazardous. Fig.12 shows thehistogram of the reward, landing error to the ﬁnal target point[m], ﬁnal velocity magnitude[m/s], maximum thrust[N],and minimum slant angle [deg]. The mean value is summarized in Table.4.The trained memory-less agent achieved maximum ratio ( . ) of landing to safe landing site. The ﬁxed controlpolicy had a lower safe landing ratio to the agent policy, due to its limitation in changing the thrust magnitude ﬂexiblywhen long divert maneuver is required. Fig.13 shows an example of such a situation. In this case, the initial targetpoint (the center of the red square) is inside the hazardous region (black area of the ﬁgure), and the target landing pointhas to be shifted toward upside of the ﬁgure in order to achieve a safe landing. While the trained agent reduces the z axis velocity immediately as shown in Fig.13-(b) in order to maintain the altitude and wide ﬁeld of view for searchingsafe landing sites, the ﬁxed controller does not aggressively thrust the propellant as shown in Fig.13-(d). Thereforethe ﬁeld of view of the ﬁxed gain controller is kept limited, and the ﬁxed controller fails in ﬁnding a wide safe area toland.The trained agent with memory had the lowest probability of landing to safe landing site, and the largest landingposition error and velocity error among the four policies. On the contrary, the average fuel consumption was thelowest among the four policies. This implies that the agent was trapped in a local optima policy that increases rewardby reducing fuel consumption rather than improving other metrics.The single observation policy fails in selecting safe landing sites when safe landing sites could not be obtained inthe single observation due to limitations in the ﬁeld of view or slant angles. The rate of failure may be reduced byselecting optimal altitude for the divert decision or implementing better landing site selection algorithms from safetymaps. However, it should be noted that in reality, the risk of single observation policy is likely to increase than oursimulator due to additional error sources (navigation errors, attitude constraints, etc) which are not simulated in ourresearch. These errors degrade the quality of the DEM (or safety map) and the accuracy of lander guidance to thetarget point.While the trained agent showed high potential in adaptly changing its target position and control outputs to ﬁnd19 igure 12 : Evaluation of performance. ’Agent’ is the policy of the trained memory-less agent, ’LSTM’ is the policyof the trained memory-based agent, ’Fixed’ is the policy with ﬁxed ZEM-ZEV parameters, and ’Single’ is the policythat conducts single divert maneuver and selects the landing site by directly using the stochastic safety map.safe landing sites, they also have limitations in reducing velocities and landing error to the ﬁnal target point decidedby the agent. Since convergence to the target position and velocity is not guaranteed when control gain in ZEM-ZEVcontrol is changed, this disadvantage is inevitable without further reﬁnements. In addition, fuel consumption of thememory-less trained agent was larger compared with the two agents for comparison, due to the large thrust magnitudeduring descent and frequent changes of the target point. Addressing the trade-off between safety and fuel consumptionfrom the point of the frequency of the target point changes is an interesting future work. CONCLUSION

In this paper, we proposed a new learning-based framework for Hazard Detection and Avoidance (HDA) phasewhich successively updates the target landing site and control parameters simultaneously after each observation, inorder to cope with the coupling between observation, guidance, and control. We modeled the HDA sequence as aPOMDP, and a reinforcement learning agent that interprets the obtained map using auto-encoder and outputs controlparameters for ZEM-ZEV feedback control law was developed to ﬁnd an optimal policy. The agent was trained byinteracting with the simulator, and the trained agent was able to achieve over 90% probability of successful landingat difﬁcult landing sites where over 80% of the terrain was hazardous around the initial target point, by graduallyupdating the target landing point towards safe regions.In order to incorporate a more realistic and higher level of uncertainty to the environment, our future work will20 a) Trajectory by agent controller (White:Safe Black:Unsafe) (b) Control history by the agent controller(Up:Velocity Down:Thrust)(c) Trajectory by ﬁxed controller (White:Safe Black:Unsafe) (d) Control history by the ﬁxed controller((Up:Velocity Down:Thrust)

Figure 13 : Comparison of the obtained agent and a ZEM-ZEV controller with ﬁxed gain. Red line in ﬁgure(a),(c)shows the lander trajectory, while the square shows the ﬁeld of view (FOV) of the LIDAR DEMs.mainly focus on the reﬁnement of the simulator model. This includes the incorporation of accurate LIDAR DEMgeneration models, expansion of dynamics to 6-DOF, and incorporation of lander state measurement errors. As thepartial observability of the environment increases by incorporating various error sources, a memory-based agent mightbe required for sufﬁcient performance. We are also planning to test other baseline control policies and observation21 igure 14 : Changes in the control gain K R , K V while control in Fig.13-(a),(b)data interpretation architectures to improve the agent’s performance and stability. ACKNOWLEDGEMENTS

This material is partially based upon work supported by the National Aeronautics and Space Administration underGrant No.80NSSC20K0064 through the NASA Early Career Faculty Program.22

EFERENCES [1] A. E. Johnson, A. R. Klumpp, J. B. Collier, and A. A. Wolf, “Lidar-based hazard avoidance for safe landing onMars,”

Journal of guidance, control, and dynamics , Vol. 25, No. 6, 2002, pp. 1091–1099.[2] N. Serrano and H. Seraji, “Landing site selection using fuzzy rule-based reasoning,”

Proceedings 2007 IEEEInternational Conference on Robotics and Automation , 2007, pp. 4899–4904.[3] A. Huertas, Y. Cheng, and L. H. Matthies, “Real-time hazard detection for landers,,”

NASA Science TechnologyConference , 2007.[4] L. Matthies, A. Huertas, Y. Cheng, and A. Johnson, “Stereo vision and shadow analysis for landing hazarddetection,” , 2008, pp. 2735–2742.[5] Y. Cheng, A. E. Johnson, L. H. Mattheis, and A. A. Wolf, “Passive imaging based hazard avoidance for spacecraftsafe landing,”

I-SIRAS 2001: 6th International Symposium on Artiﬁcal Intelligence, Robotics and Automation inSpace , 2001, pp. 1–14.[6] N. Serrano, “A bayesian framework for landing site selection during autonomous spacecraft descent,” , 2006, pp. 5112–5117.[7] C. I. Restrepo, R. Lovelace, R. R. Sostaric, J. M. Carson, and N. G. Spaceﬂight, “NASA SPLICE Project :Developing the Next Generation Hazard Detection System,” , 2019, pp. 2–3.[8] A. D. Cianciolo, S. Striepe, J. Carson, R. Sostaric, D. Wofﬁnden, C. Karlgaard, R. Lugo, R. Powell, and J. Tynis,“Deﬁning navigation requirements for future precision lander missions,”

AIAA Scitech 2019 Forum , No. January,2019, pp. 1–18, 10.2514/6.2019-0661.[9] J. M. Carson, M. M. Munk, R. R. Sostaric, J. N. Estes, F. Amzajerdian, J. Bryan Blair, D. K. Rutishauser,C. I. Restrepo, A. D. Cianciolo, G. T. Chen, and T. Tse, “The splice project: Continuing nasa developmentof gn&c technologies for safe and precise landing,”

AIAA Scitech 2019 Forum , No. January, 2019, pp. 1–9,10.2514/6.2019-0660.[10] T. Brady, E. Robertson, S. P. C. App, and D. Zimpfer, “Hazard detection methods for lunar landing,” , 2009, pp. 1–18.[11] C. D. App and T. B. Smith, “Autonomous precision landing and hazard detection and avoidance technology(ALHAT),” , 2007, pp. 1–7.[12] T. Ivanov, A. Huertas, and J. M. Carson, “Probabilistic Hazard Detection for Autonomous Safe Landing,”

AIAAGuidance, Navigation, and Control (GNC) Conference , 2013, p. 5019.[13] T. Brady and J. Schwartz, “ALHAT system architecture and operational concept,” , 2007, pp. 1–13.[14] S. A. Striepe, C. D. Epp, and E. A. Robertson, “Autonomous precision landing and hazard avoidance technology(ALHAT) project status as of May 2010,”

International Planetary Probe Workshop 2010 (IPPW-7) , 2010.[15] D. Rutishauser, C. Epp, and E. Robertson, “Free-Flight Terrestrial Rocket Lander Demonstration for NASA’sAutonomous Landing and Hazard Avoidance Technology (ALHAT) System,”

AIAA SPACE 2012 Conference &Exposition , 2012, p. 5239.[16] C. Epp, E. Robertson, and J. M. Carson, “Real-time hazard detection and avoidance demonstration for a planetarylander,”

AIAA SPACE 2014 Conference and Exposition , 2014, p. 4312.2317] A. E. Johnson, A. Huertas, R. A. Werner, and J. F. Montgomery, “Analysis of on-board hazard detection andavoidance for safe lunar landing,” , 2008, pp. 1–9.[18] A. Huertas, A. E. Johnson, R. A. Werner, and R. A. Maddock, “Performance evaluation of hazard detection andavoidance algorithms for safe Lunar landings,” , 2010, pp. 1–20.[19] B. E. Cohanim and B. K. Collins, “Landing point designation algorithm for lunar landing,”

Journal of Spacecraftand Rockets , Vol. 46, No. 4, 2009, pp. 858–864.[20] S. Paschall and T. Brady, “Demonstration of a safe & precise planetary landing system on-board a terrestrialrocket,” , 2012, pp. 1–8.[21] N. Trawny, A. Huertas, M. E. Luna, C. Y. Villalpando, K. Martin, J. M. Carson, A. E. Johnson, C. Restrepo,and V. E. Roback, “Flight testing a Real-Time Hazard Detection System for Safe Lunar Landing on the Rocket-Powered Morpheus Vehicle,”

AIAA Guidance, Navigation, and Control (GNC) Conference , 2015.[22] N. Serrano, “A Bayesian framework for landing site selection during autonomous spacecraft descent,”

IEEEInternational Conference on Intelligent Robots and Systems , 2006, pp. 5112–5117, 10.1109/IROS.2006.282603.[23] S. R. Ploen, H. Seraji, and C. E. Kinney, “Determination of spacecraft landing footprint for safe plane-tary landing,”

IEEE Transactions on Aerospace and Electronic Systems , Vol. 45, No. 1, 2009, pp. 3–16,10.1109/TAES.2009.4805259.[24] P. Cui, D. Ge, and A. Gao, “Optimal landing site selection based on safety index during planetary descent,”

ActaAstronautica , Vol. 132, No. July 2016, 2017, pp. 326–336, 10.1016/j.actaastro.2016.10.040.[25] E. S. Crane and S. M. Rock, “Guidance augmentation for reducing uncertainty in vision-based hazard mappingduring lunar landing,”

IEEE Aerospace Conference Proceedings , 2013, 10.1109/AERO.2013.6496946.[26] M. Barker, E. Mazarico, G. Neumann, M. Zuber, J. Haruyama, and D. Smith, “A new lunar digital elevationmodel from the Lunar Orbiter Laser Altimeter and SELENE Terrain Camera,”

Icarus , Vol. 273, 2016, pp. 346 –355, https://doi.org/10.1016/j.icarus.2015.07.039.[27] R. J. Pike, “Size-dependence in the shape of fresh impact craters on the moon.,”

Impact and Explosion Cratering:Planetary and Terrestrial Implications (D. J. Roddy, R. O. Pepin, and R. B. Merrill, eds.), Jan. 1977, pp. 489–509.[28] D. E. Bernard and M. P. Golombek, “Crater and rock hazard modeling for Mars landing,”

AIAA Space 2001Conference and Exposition , Albuqurque, NM, 2001, 10.2514/6.2001-4697.[29] C. I. Restrepo, P.-T. Chen, R. R. Sostaric, and J. M. Carson, “Next-Generation NASA Hazard Detection SystemDevelopment,” 2020, pp. 1–10, 10.2514/6.2020-0368.[30] B. Gaudet, R. Linares, and R. Furfaro, “Deep Reinforcement Learning for Six Degree-of-Freedom PlanetaryLanding,”

Advances in Space Research , 01 2020, 10.1016/j.asr.2019.12.030.[31] B. Ebrahimi, M. Bahrami, and J. Roshanian, “Optimal sliding-mode guidance with terminal velocity con-straint for ﬁxed-interval propulsive maneuvers,”

Acta Astronautica , Vol. 62, No. 10, 2008, pp. 556 – 562,https://doi.org/10.1016/j.actaastro.2008.02.002.[32] R. Furfaro, A. Scorsoglio, R. Linares, and M. Massari, “Adaptive Generalized ZEM-ZEV Feedback Guidancefor Planetary Landing via a Deep Reinforcement Learning Approach,” 2020, 10.1016/j.actaastro.2020.02.051.2433] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous controlwith deep reinforcement learning.,”

ICLR (Y. Bengio and Y. LeCun, eds.), 2016.[34] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic Policy Gradient Algo-rithms,”

Proceedings of the 31st International Conference on International Conference on Machine Learning -Volume 32 , ICML’14, JMLR.org, 2014, p. I–387–I–395.[35] S. Fujimoto, H. Hoof, and D. Meger, “Addressing Function Approximation Error in Actor-Critic Methods,”

International Conference on Machine Learning , 2018, pp. 1582–1591.[36] N. M. O. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver, “Memory-based control with recurrent neural networks,”