Real-Time Optimal Guidance and Control for Interplanetary Transfers Using Deep Networks
RR EAL -T IME O PTIMAL G UIDANCE AND C ONTROL FOR I NTERPLANETARY T RANSFERS U SING D EEP N ETWORKS
Dario Izzo, ∗ Ekin ¨Ozt ¨urk † A BSTRACT
We consider the Earth-Venus mass-optimal interplanetary transfer of a low-thrust spacecraft andshow how the optimal guidance can be represented by deep networks in a large portion of the statespace and to a high degree of accuracy. Imitation (supervised) learning of optimal examples is usedas a network training paradigm. The resulting models are suitable for an on-board, real-time, imple-mentation of the optimal guidance and control system of the spacecraft and are called G&CNETs.A new general methodology called “Backward Generation of Optimal Examples” is introduced andshown to be able to efficiently create all the optimal state action pairs necessary to train G&CNETswithout solving optimal control problems. With respect to previous works, we are able to producedatasets containing a few orders of magnitude more optimal trajectories and obtain network perfor-mances compatible with real missions requirements. Several schemes able to train representationsof either the optimal policy (thrust profile) or the value function (optimal mass) are proposed andtested. We find that both policy learning and value function learning successfully and accuratelylearn the optimal thrust and that a spacecraft employing the learned thrust is able to reach the tar-get conditions orbit spending only 2 (cid:104) more propellant than in the corresponding mathematicallyoptimal transfer. Moreover, the optimal propellant mass can be predicted (in case of value functionlearning) within an error well within 1%. All G&CNETs produced are tested during simulations ofinterplanetary transfers with respect to their ability to reach the target conditions optimally startingfrom nominal and off-nominal conditions. ∗ Scientific coordinator, Advanced Concepts Team † Young Graduate Trainee, Advanced Concepts Team a r X i v : . [ c s . N E ] F e b omenclature x = modified equinoctial elements r , v = spacecraft position and velocity a = semimajor axis, AU e = eccentricity i = inclination, rad ω = argument of perigee, rad Ω = right ascension of theascending node, rad ν = true anomaly, rad p = semilatus rectum f = equinoctial parameter g = equinoctial parameter h = equinoctial parameter k = equinoctial parameter L = true longitude m = spacecraft mass, kg t = time, s v = value function λ = equinoctial costate vector u = control variables f t = radial component of thrust force, N f r = tangential component of thrust force, N f n = normal component of thrust force, N T = time scale, s A = acceleration scale, m/s g = acceleration at ground level, m/s T = Set of all optimal trajectoriesfor a given dynamics and cost function D = database of optimal (augmented)state-action pairs T max = maximum thrust, N I sp = specific impulse, s t f = time of flight J = cost function H = Hamiltonian λ = Hamiltonian scale (cid:15) = continuation parameter δ = perturbation scale c = Maximum thrust, N c = Ratio between c and I sp g , kg/s c = Sundmann transform scale n = Sundmann transform order N = an artificial neural network g = activation function of a neural network l = loss function of a neural network rEd = reduced Euclidean distance (distancebetween the first 5 equinoctial elements) Subscripts and Superscripts ( · ) N = neural network prediction ( · ) ∗ = optimal value The use of deep networks in decision making systems has produced increasingly interesting results over the past decadeand in diverse applications ranging from computer gaming to robotics, to micro and unmanned air vehicles and space-craft [1]. The mathematical theories powering most of these new results are the consolidated ones of reinforcementlearning, dynamic programming and optimal control, coupled to emerging results in training deep networks. It is worth2oting that environments such as those encountered in video-games, Earth robotics and air vehicles are characterizedby high noise levels and unpredictability. In all these cases, decision making is greatly affected by environmentalstochasticity, making the use of optimal control theory for deterministic systems less appealing. Spacecraft, on theother hand, operate in a rather different environment, comparatively free of major disturbances. As a consequence,deterministic optimal control methods (e.g. indirect methods based on Pontryagin’s principle [2] or direct methodsbased on collocation [3]) are widely used to design guidance profiles for low-thrust interplanetary transfers, spacecraftlanding, docking problems etc. The works from Sanchez and Izzo [4, 5] introduced the idea to use imitation learning(also known as behavioural cloning, and essentially based on the classical supervised learning scheme) to teach a deepartificial neural network to produce on-board, and in real time, the optimal guidance and tested it on several spacecraftlanding scenarios. The results, triggering a number of other studies [6, 7, 8, 9, 10, 11]) suggest that future space sys-tems might use an artificial neural network in place of their on-board guidance and control systems, and hence thesenetworks are called G&CNETs. An early study on the stability of a G&CNET controlled system [10] shows howit is also possible to provide control guarantees to the resulting neurocontrolled system, a fact of great relevance forsuch a mission critical component. The extension of these results to interplanetary low-thrust trajectories seems ripe.In particular it is of interest to prove that G&CNETs can be developed to represent the highly discontinuous optimalcontrols arising in mass optimal interplanetary transfers, and in a large portion of the interplanetary medium. Whilenot directly using the term G&CNET, a first study on deep networks for the real time optimal control of interplanetarytransfers appeared recently [7], but only considering two dimensional dynamics and a simple solar sailing transferwith continuous controls. In following works from Li et al. [8, 9] neural networks are also trained to approximate theco-states, the optimal thrust and the value function of optimal interplanetary transfers, but only succeeding for timeoptimal cases (resulting in continuous thrust profiles) and in close neighbourhoods of nominal transfers (e.g. smallperturbations of the order of 0.1 m/s on the initial velocity and 100m on the initial position were considered [9]). Thetime consuming creation of optimal examples, likely prevented these works to produce large enough datasets and thustrain networks able to go well beyond the simple cases considered there and able to approximate the optimal guidancein a larger portion of interplanetary space with acceptable accuracies. In a previous, yet preliminary, work [12] wehinted on how to take advantage of Pontryagin principle to create massive datasets of optimal trajectories avoidingthe time consuming solution procedures of optimal control problems. In this work, we refine those results studyingin depth the methodologies there only sketched. Our final aim is to prove the possibility to design G&CNETs able toproduce complex mass optimal guidance profiles in large portions of interplanetary space (i.e. also far away from anominal transfer). We introduce a new generic methodology (the “backward generation of optimal examples”), basedon Pontryagin principle and able to create optimal training samples by numerically integrating a system of equations.We apply it to an Earth-Venus transfer assembling seven large databases of optimal low-thrust transfers which we re-lease publicly (see [13] and similar). Overall, in the context of this work, we are able to compute and release 4,000,000mass optimal transfers containing multiple bang-off phases. For comparison, the datasets used in [8] contain roughly12,000 trajectories, while the datasets used in [7, 9] contain 1,000 trajectories (and all considering smooth thrust pro-files). After the dataset creation, four different training methodologies are studied: one based on policy learning (i.e.3he original G&CNET training method developed in [5, 11]) and three new training procedures based on value functionlearning (i.e. learning the final optimal propellant mass and inferring the optimal thrust profile from it). Our paper isstructured as follows: in Section 2 we provide the necessary mathematical definitions of the low-thrust interplanetarydynamics considered, the related optimal control problem and we present a short discussion on the consequences ofPontryagin’s and Bellman’s optimality principles to our case. In Section 3 we describe the “backward generation ofoptimal examples”, a new methodology (based on Pontryagin’s principle) to generate large databases of optimal state-action pairs without solving optimal control problems. Then, in Section 4, the artificial neural network architecturesand various loss functions are discussed. In the following Section 5 we describe the details of the seven differentdatabases of optimal interplanetary transfers to Venus that we use in this paper and that we created and made publiclyavailable (see [13]). Details on the G&CNET training are then given in the following Section 6, while the evaluationof the their performances is discussed at length in Section 7.
We consider the motion of a spacecraft of mass m with a position r and velocity v subject only to the Sun gravitationalattraction in the heliocentric International Celestial Reference Frame (ICRF). The spacecraft also has an ion thrusterwith a specific impulse I sp and a maximum thrust c independent from solar distance. We describe the spacecraft statevia its mass m and the modified equinoctial elements x = [ p, f, g, h, k, L ] T as originally defined by Walker et al. [14].The motion of the spacecraft is described by the following set of differential equations: ˙ p = (cid:113) pµ pw f t ˙ f = m (cid:113) pµ (cid:110) f r sin L + [(1 + w ) cos L + f ] f t w − ( h sin L − k cos L ) g · f n w (cid:111) ˙ g = m (cid:113) pµ (cid:110) − f r cos L + [(1 + w ) sin L + g ] f t w + ( h sin L − k cos L ) f · f n w (cid:111) ˙ h = (cid:113) pµ s f n mw cos L ˙ k = (cid:113) pµ s f n mw sin L ˙ L = (cid:113) pµ (cid:26) µ (cid:16) wp (cid:17) + w ( h sin L − k cos L ) f n m (cid:27) ˙ m = − √ f r + f t + f n I sp g (1)where, w = 1 + f cos L + g sin L , s = 1 + h + k and f r , f t , f n are the radial, tangential and normal componentsof the force generated by the ion thruster. The gravity parameter is denoted with µ and the gravitational accelerationat sea level with g .It is useful to rewrite these equations using matrix notation thus we introduce the matrices B and D defined as:4 µp B ( x ) = pw L [(1 + w ) cos L + f ] w − gw ( h sin L − k cos L ) − cos L [(1 + w ) sin L + g ] w fw ( h sin L − k cos L )0 0 w s cos L w s sin L w ( h sin L − k cos L ) (2)and D ( x ) = (cid:20) (cid:113) µp w (cid:21) T (3)and x = [ p, f, g, h, k, L ] T . Thus the equations of motion become: ˙ x = c u ( t ) m B ( x ) ˆi τ + D ( x )˙ m = − c u ( t ) (4)where the spacecraft thrust is now indicated by c u ˆi τ = [ f r , f t , f n ] T and is bounded by the relations | u ( t ) | ≤ and | ˆi τ ( t ) | = 1 . The control u is called the throttle and, unlike the thrust, is non dimensional. The dynamics is controlled,at each instant, by the throttle magnitude u ( t ) ∈ [0 , and the thrust direction ˆi τ . We refer to these control variablesalso with a single symbol u ( t ) = [ u ( t ) , ˆi τ ( t )] and we use the notation u ∈ U to indicate that the control u belongsto the space U of admissible controls. Recall that the constant c is the maximum thrust and note that the constant c = c / ( I sp g ) was introduced for convenience. We consider here a free time orbital transfer problem, i.e. finding the controls u ( t ) and ˆi τ ( t ) defined in [0 , t f ] and thetransfer time t f so that the functional: J ( u ( t ) , t f ) = (cid:90) t f { u − (cid:15) log [ u (1 − u )] } d t (5)is minimized, and the spacecraft is steered from its initial mass m and some initial point x to some final mass m f andsome final point x f . The functional J , chosen following the work of Bertrand and Epenoy [15], is parameterised bya continuation parameter (cid:15) ∈ [0 , which activates a logarithmic barrier smoothing the problem and ensuring that theconstraint u ∈ [0 , is always satisfied. Clearly, as the continuation parameter approaches zero (cid:15) → the functionalbecomes J = ( m − m f ) /c (substitute Eq. (4) into Eq. (5)), and the problem considered becomes equivalent tominimising the propellant mass. Following the work of Pontryagin [2], we infer the necessary conditions for optimality by applying Pontryagin’sminimum principle. Since we have stated a minimisation problem, the conditions are slightly different from the ones5riginally derived in Pontryagin’s work. First we introduce the co-states λ , λ m as continuous functions defined in [0 , t f ] and define the Hamiltonian: H ( x , m, λ , λ m , u ) = c um λ T B ( x ) ˆi τ + λ L (cid:114) µp w − c λ m u + { u − (cid:15) log[ u (1 − u )] } (6)and the system of equations: ˙ x = ∂ H ∂ λ = c u ( t ) m Bˆi τ ( t ) + D ˙ m = ∂ H ∂λ m = − c u ( t )˙ λ = − ∂ H ∂ x ˙ λ m = − ∂ H ∂m (7)The explicit form of the various derivatives appearing in the equations above is reported in Appendix 8. Along anoptimal trajectory, the Hamiltonian must be zero (free terminal time problem) and minimal with respect to the choicesof u and ˆi τ . For the optimal thrust direction ˆi ∗ τ it follows that ˆi τ = ˆi ∗ τ ( t ) = − B T λ | B T λ | (8)where the time dependence on the right hand side is present both in the co-states and in B , but has been omitted forbrevity. For the optimal throttle u ∗ , necessarily: u ( t ) = u ∗ ( t ) = 2 (cid:15) (cid:15) + SF ( t ) + (cid:112) (cid:15) + SF ( t ) (9)where we introduced a switching function: SF ( t ) = 1 − c m | B T λ | − c λ m . (10) Substituting Eq. (6), Eq. (8) and Eq. (9) into Eq. (7) one obtains a set of ordinary differential equations in the augmentedstate ( x , m, λ , λ m ) whose solutions represent a generic optimal interplanetary transfer (under the considered dynamicsand merit function). We indicate the set of all solutions to those equations with T . Typically, one is only interestedin searching in T for a solution that satisfies the initial conditions on the spacecraft state and some added conditionsdictated by Pontryagin’s theory (transversality and free-time conditions). Let us consider the case of a transfer fromany x , m to Venus orbit (not a rendezvous): the final values for the mass and the true longitude L are left free(transversality conditions, λ L | t = t f = 0 , λ m | t = t f = 0 ) while the value of the Hamiltonian is fixed (free-time condition H| t = t f = 0 ). Hence we search in T for a solution where the initial values over x and m are known as well as thefinal values over p, f, g, h, k, λ L , λ m , H . This creates a two-boundary value problem that can be solved by a shootingmethod. In other words, for a given initial state x and m , we need to find the initial values λ and λ m such thatsolving the initial value problem (IVP) for Eq. (7) results in a final state at t f that matches the arrival conditions at the6arget orbit, the transversality conditions and the free-time condition on the Hamiltonian. Formally, we introduce theshooting function: φ ( λ , λ m , t f ) = (cid:2) p f − p V , f f − f V , g f − g V , h f − h V , k f − k V , λ L f , λ m f , H f (cid:3) (11)where we indicate with a subscript V the modified equinoctial elements of Venus orbit and with the subscript f thefinal values of the modified equinoctial elements resulting from numerically integrating Eq. (7) for a time t f from theinitial conditions x , m , λ , λ m . Solving an instance of the optimal control problem considered is then equivalentto solving the equation: φ ( λ , λ m , t f ) = 0 . Let us now apply Bellman’s principle of optimality [16] to the optimal control problem we stated in Section 2.2. Weindicate with v ( x , m ) the value function, i.e. the optimal value of the functional defined by Eq. (5) when the initialspacecraft state is x , m . Since the value function is, in the case considered, time-independent, the Hamilton JacobiBellman (HJB) equations can be written as: u ∈U ( u + ∇ x v · f ( x , m )) (12) u = arg min u ∈U ( u + ∇ x v · f ( x , m )) . (13)These equations hold for all points where v ( x , m ) is differentiable. We use f ( x , m ) to denote the right hand sideof Eq. (4) including the mass equation and, abusing the notation, we use, only in this context, the symbol λ to indi-cate all the co-states. Pontryagin’s minimum principle can then, in general, be formulated as u = arg min u ∈U H =arg min u ∈U ( u + λ · f ( x , m )) . Comparing this last expression to Eq.((13)) we may conclude that the co-states intro-duced by Pontryagin in his theory are the gradients of the value function introduced by Bellman in his theory. This fact,albeit rarely exploited in interplanetary trajectory optimization research, provides a convenient basis for the design ofartificial neural network learning procedures, a fact we exploit later in Section 6. As the main purpose of this work is to study artificial neural networks capability to learn the structure of optimallow-thrust interplanetary trajectories, a learning database D := { ( x , m, λ , λ m ) , u ∗ i ) ..i = 1 ...N } containing spacecraft(augmented) states and the optimal associated thrust vectors is needed. In this section we describe a method able toefficiently construct such a database. In order to generate D , the straight forward approach would be to solve a largenumber of optimal control problems, sample each resulting optimal trajectory in multiple time instants and store theresultant (augmented) state-action pairs. Such an approach has indeed been pursued in the past, also by us (CITE),7ut it comes with an intrinsic problem: finding one optimal interplanetary low-thrust trajectory is a computationallyintensive task, finding thousands of them can be a barrier limiting the applicability of the resulting method. An alterna-tive method, called “Backward Generation of Optimal Examples” is here presented that is able to create a sufficientlydense database of optimal trajectories by solving only once the Two-Point Boundary Value Problem (TPBVP) result-ing from the application of Pontryagin principle, and then exploiting principles of optimality to generate more optimaltrajectories to learn from at the cost of simpler numerical integrations. A good statistical distribution of the sampleddata points is then also ensured by making use of the Sundman [17] transform during such integrations. In this section we describe our method to construct efficiently a database D of optimal state action pairs: the “Back-ward Generation of Optimal Examples”. The brute-force approach for populating D requires selecting a number ofmeaningful initial states and then solving the corresponding optimal control problem (e.g. finding a zero for the shoot-ing function, see Section 2). This approach scales extremely poorly because of the known computational difficultiesassociated with solving optimal control problems, both using direct and indirect methods. Previous work (e.g. Sanchezand Izzo [4, 5]) deployed a continuation (homotopy) approach to reduce some of the complexity involved by elimi-nating the need to search for a new initial guess for each optimal control problem instance: by perturbing the initialstate of a nominal trajectory, the unperturbed states (and co-states) provide (most of the time) a good initial guess tosolve also the newly created optimal control problem. However, assuming that no convergence issues occur, it is stillnecessary to solve Eq. (11) for the new initial conditions considered (if an indirect method is pursued) or the resultingtranscribed non linear programming problem (if a direct method is pursued). In both cases a significant computationalcost is encountered.In practice, there exists a more efficient way to obtain a similar result avoiding entirely convergence issues, guarantee-ing optimality and reducing the computational costs significantly. The idea, applicable more generally to any problemformulated in the form introduced in Section 2.3.1, is to perform a backward in time numerical propagation of Eq. (7)starting from suitable values of the state and co-states obtained perturbing the final values known for a nominal trajec-tory. This perturbation needs to be chosen such that the transversality conditions and the condition on the Hamiltonianare still satisfied. If this is the case, the trajectory obtained by the backward integration of Eq. (7) will be the solutionto the optimal control problem of reaching the target orbit from any state along the computed trajectory. Hence we caninsert any point along such a trajectory into our database D .Formally, consider a nominal optimal trajectory and indicate with x ∗ f , m ∗ f , λ ∗ f , λ ∗ m f the final values (i.e. at t ∗ f ) of thestates and the co-states. Consider then a new set of co-state values: λ newf = λ ∗ f + δ λ , λ newm f = λ ∗ m f + δλ m f (14)where the perturbation δ λ is chosen in some ball B ρ ∈ R of size ρ . Since we want that the transversality conditionson the final (free) true longitude and on the final (free) mass to be satisfied also for the new costates, we set δλ L = , δλ m f = 0 . The remaining values for δ λ are randomly sampled within B ρ . If the values for the new states x newf , m newf were now kept equal to the nominal ones, the trajectory resulting from propagating backward in timeEq. (7) from new final conditions x newf , m newf , λ newf , λ newm f would be fulfilling all of Pontryagin’s necessary conditionsfor optimality, except H f = 0 , and would be arriving to the target orbit with the same mass and true longitude L as thenominal trajectory. Thus we perturb also the two states m newf = m ∗ f + δm and L newf = L ∗ f + δL , first choosing δm atrandom (within some bounds ρ ) and then considering the Hamiltonian as a function of the sole anomaly perturbation, H f ( δL ) = 0 , and solving for δL . The resulting trajectory, integrated backward from the new conditions, can besampled and inserted into D . Such a trajectory neither ends where the nominal trajectory ends, nor starts from wherethe nominal trajectory starts, but it is nevertheless optimal (with respect to Eq. (5)) and represents a valid interplanetarytransfer to learn from (it reaches the target orbit). this technique is what we call “Backward Generation of OptimalExamples”. It reduces the cost of computing one more optimal trajectory to that of a backward in time integration ofEq. (7) plus the computational cost of a single root finding call to solve H f ( δL ) = 0 . Each optimal trajectory obtained, regardless of the method used to compute it, has to be sampled in N points andthe corresponding (augmented) state-action pairs inserted into D . In order to create a database covering equally theinterplanetary space we avoid the use of sampling the optimal trajectories uniformly in time (which would create astrong bias for larger distances) and use, instead, the Sundman transformation originally described by Sundman [17]and Levi-Civita [18]. The integration variable in Eq. (7) is thus transformed from d t to d θ s : d t = cr n d θ s (15)where c is a constant that depends on n and r is the radial distance from the main body. In this work, we use n = 1 and c = (cid:112) a/µ where θ s is, therefore, the eccentric anomaly and a is the semi-major axis. We use the notation ˙ p to referto derivatives with respect to time and p (cid:48) to refer to derivatives with respect to θ s . Using the Sundman transformation,Eq. (7) becomes: x (cid:48) = ˙ x (cid:113) a ( θ s ) µ = (cid:104) c u ( θ s ) m Bˆi τ ( θ s ) + D (cid:105) (cid:113) a ( θ s ) µ m (cid:48) = ˙ m (cid:113) a ( θ s ) µ = [ − c u ( θ s )] (cid:113) a ( θ s ) µ λ (cid:48) = (cid:104) ˙ λ (cid:105) (cid:113) a ( θ s ) µ λ (cid:48) m = (cid:104) ˙ λ m (cid:105) (cid:113) a ( θ s ) µ t (cid:48) = (cid:113) a ( θ s ) µ (16)When implementing the discussed “Backward Generation of Optimal Examples” methodology, we use this system ofequations, rather than Eq. (7), to perform the backward propagation. The resulting trajectory is then sampled at N equally spaced points in θ s , which results in a sample density uniform in the eccentric anomaly. Note that varyingthe parameter n of the Sundmann transformation one can bias the database D , and thus the network learning, to have9 fghkLm L (0) L (1) L (2) L ( l ) vL ( l +1) softplus linear pfghkLm L (0) L (1) L (2) L ( l ) ud r d t d n L ( l +1) softplus sigmoidlinearFigure 1: Value function network (left) and policy network (right).different sample densities at different radial distances. The choice made in this work is motivated by the idea to notgive preference to positions far away from the central point of attraction. In this section we describe the neural network architectures we use and the various loss functions proposed and studiedfor their training. The final aim of training an artificial neural network on D is to have it learn the optimal controlstructure for a low-thrust transfer and thus use it on-board the satellite to generate, in real time and autonomously, theguidance and control commands to be sent to the spacecraft thrusters. For this reason, although the term originated ina different context, these type of networks are referred to as Guidance and Control Networks or, briefly, G&CNETs[10].In this study we indicate, generically, a G&CNET with the symbol N ( x , m ) . Formally this is a function of thespacecraft state, defined by the following relations: N ( x , m ) : L (0) [ x , m ] L ( i +1) = σ i ( W ( i ) L ( i ) + b ( i ) ) , ∀ i = 0 ..l N ( x , m ) = L ( l +1) (17)where L (0) denotes the input layer and L ( i +1) the subsequent hidden layers. The network depth (i.e. the number oflayers) l as well as the network width (i.e. the number of neurons per layers) determined by the weight matrices W ( i ) and bias vectors b ( i ) dimensions, constitute the network architecture. σ i is a non-linear function termed activationfunction that is selected for each layer. The neural network parameters (i.e. the weight matrices and the biases)are found during training, typically using some variant of the stochastic gradient descent method. Note that we donot make use of any data pre-processing or post-processing as, instead, typical in a machine learning pipeline. Nondimensional units are used, though, for both the equinoctial elements and the mass).10 .1 Architectures The two fundamental architectures used in this paper are shown in Figure 1. The first one predicts the valuefunction v , which in our case is the final optimal mass m ∗ f , and hence is called value function network , the sec-ond one predicts the throttle u and the quantities d , d , d , which relate to the throttle direction via the relation ˆ i τ = [ d r , d t , d n ] T / | [ d r , d t , d n ] | . This second architecture is called policy network . We found that the initialisation ofthe weights and biases is very important in the case studied and that bad initialisation leads to early convergence andpoor performance in general both on the training and the validation sets. We used Kaiming Normal initialisation [19]as this was found to be better suited to our network architectures. While such a normalization was originally developedfor ReLu units, their functional similarity to the softplus units used here makes it appropriate for our architectures. We consider 6 different approaches to learning the state feedback optimal control. These can be classified into the twocategories of
Policy Imitation and
Value Function Approximation . Policy Imitation is straightforward in that the neuralnetwork learns the mapping between the spacecraft state and the optimal controls. On the other hand, Value FunctionApproximation is a more interesting strategy in that the neural network learns the mapping between the spacecraft stateand the cost required to reach Venus’ orbit: ideally, the neural network gradients with respect to the spacecraft statewill then be make it possible to compute the optimal controls as well. Note that learning the value function, even whenthe following step of deducing the optimal controls fails, has its merit by itself as allows a number of applications, forexample, in preliminary mission design where many transfer options need to be evaluated without going into the detailof the exact guidance law.In order to describe the loss functions used for our networks, it is convenient to introduce the following components:• Policy Learning Loss Component - the error in the estimated controls: l policy = (cid:10) ( u N − u ∗ ) (cid:11) + (cid:68) − ˆi N · ˆi ∗ (cid:69) • Value Function Loss Component - the error in the estimated value function: l vf = (cid:10) ( J opt − J N ) (cid:11) .• Costate Loss Component - the error in the gradients of the network with respect to the true costates: l λ = (cid:10) (( λ x ) opt − ∇ x J N ) (cid:11) .• Hamiltonian Loss Component - the error in Hamiltonian computed from the network approximated costates: l H = (cid:10) ( H ( x , m, ∇ x J N , u ∗ )) (cid:11) • Control Loss Component - the error in the controls computed from the network approximated costates: l u = (cid:10) ( u ( x , m, ∇ x J N ) − u ∗ ) (cid:11) + (cid:68) − ˆi ( x , m, ∇ x J N ) · ˆi ∗ (cid:69) where we used the notation (cid:104) · (cid:105) = N (cid:80) N ( · ) to indicate a mean across the entire database D . These loss componentsencapsulate a large variety of possible network training procedures which we here seek to study. The l policy component11 .0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Time (yr) T h r o tt l e ∈ [ , ] −6−5−4−3−2−1 l o g ε Figure 2: Solution of the two point boundary value problem for the throttle magnitude u(t) and decreasing values of (cid:15) .can be used to train a policy network on the optimal policy whereas the remaining loss components can be used to traina value function network . Combining the latter loss components in different ways we can conceive different trainingpipelines such as training the gradients of the value function network directly on the l λ component. As discussed in Section 3, in order to build a database of trajectories, D using the “Backward Generation of OptimalExamples”we require A) one nominal trajectory and B) a strategy to perturb the final augmented state. In the followingtwo subsections we will be introducing these two items as defined in our experiments. To illustrate the concepts presented above and their potential we focus on one single interplanetary transfer, but it is tobe remarked that the same methodology can be applied in general. We consider a spacecraft with mass m = 1500 kg ,a nuclear electric propulsion system specified by I sp = 3800 s and c = 0 .
33 N . We compute the mass optimal12ransfer from the Earth to Venus orbit starting from the 7th of May 2005. Venus orbit is assumed keplerian andits orbital elements are computed at .
05 yr from the launch date. The planet ephemerides are computed using JPLlow-precision ephemerides [20].We solve the optimal control problem by solving the equation φ ( λ , λ m , t f ) = 0 (see Eq.(11)). Note that this is,essentially, a system of eight nonlinear equations in eight unknowns and can be solved by root finding methods (e.g.Powell, Levenberg-Marquardt) as well as by SQP or interior point methods (e.g. SNOPT [21] or IPOPT [22]).As it is well known (see [23] for example), the convergence radius for this problem can get rather small, to the pointthat if we were to try to directly solve the mass optimal problem (i.e. plugging (cid:15) = 0 in Eq.(5)) we would failconsistently as almost any initial guess on the co-state would not converge.However, solving the problem for (cid:15) = 0 . is reasonably simple as convergence is frequently achieved when startingwith random co-states (we sample them from a uniform distribution with a standard deviation of 10). Note that we usenondimensional units for the state so that the astronomical unit AU is used for length, the spacecraft initial mass formass, and the rest is set as to get µ = 1 . .Gradually decreasing (cid:15) from . down to − (as visualised in Figure 2), allows us to obtain the final mass optimaltrajectory which is visualised in Figure 4a. We refer to the final trajectory (with (cid:15) = 10 − ) as the nominal trajectory in the following.The nominal trajectory reaches the orbit of Venus after t ∗ f = 1 .
376 yr and spends m p = 210 .
47 kg of propellant andis visualized in Figure 4.
It takes in the order of minutes on an Intel Xeon E5-2650L v4 processor at .
70 GHz to solve (completing the wholehomotopy path) one optimal control problem. And this is assuming the initial guess on the co-states are withingthe radius of convergence of the shooting function solver. The largest database we trained on consists of about trajectories, thus brute-forcing the database generation would require on the order of years without multi-threading.Using, instead, the “Backward Generation of Optimal Examples”(see Section 3.1) a database containing optimaltrajectories is generated in ∼ : an improvement of several orders of magnitude.Overall, we generated a total of 7 databases around the same nominal trajectory by varying the perturbation sizes(of the final augmented state) and number of trajectories. Table 1 shows the databases we generated and the threeparameters that characterise each database: the perturbation size, the number of sample points along a trajectory andthe number of trajectories generated.For all databases except for F and G , we used a fixed perturbation size for all the parameters. For databases F and G weexperimented with a non uniform perturbation size tailored to create a visually dense database around the conditions ofinterest. The exact perturbations of each component are listed in Table 2. Additionally the trajectories in F and G were13atabase Name A B C D E F G ρ . . custom
100 100 100 128 100 100 100 ,
000 500 ,
000 1 , ,
000 400 ,
000 1 , ,
000 1 , ,
000 3 , , ,
316 382 ,
193 764 ,
479 265 ,
603 409 ,
076 557 ,
395 999 , Total Size of Database , ,
600 38 , ,
300 76 , ,
900 33 , ,
184 40 , ,
600 55 , ,
500 99 , , Table 1: Parameters of the different databases.m λ p λ f λ g λ h λ k mean . . . . . . standard deviation .
01 5 . . . . . Table 2: Perturbations of each component in the creation of databases F and G .terminated whenever the spacecraft semimajor axis went outside [ a V enus − × r V enus , a
Earth + 100 × r Earth ] and the inclination went outside of [ − ◦ , ◦ ] . This was done to avoid including trajectory segments in the databasethat would confuse the training and be well outside the region of interest. (The region of interest can be loosely definedas the torus encompassing Earth and Venus’ orbits.) .Figure 3 shows a visual representation of each of the generated databases from the above plane and in-plane views. We experimented with several combinations of loss functions and architectures and we report here the results on 4different networks which we found particularly significant. The first trained network, indicated N is a policy network with 3 hidden layers and 200 neurons per layer. The following N and N are value function network s with 9 hiddenlayers and 200 neurons per layer. A final network, N , is also a value function network , but this time with 3 hiddenlayers and 1000 neurons per layer. The loss function used to train the various networks is constructed out of thecomponents defined in Section 4.2 as detailed in Eq. (18). l N = l policy (18a) l N = l vf (18b) l N = l vf + l λ (18c) l N = l vf + s l H + l u (18d)where s is a scaling parameter chosen to normalise the loss components relative to each other. We find that the l H component is generally poorly scaled relative to the other components. Depending on the network initialisation, wefound that, for our particular architecture and initialisation, the l H component started with a value of ≈ − which14 a) Above Plane View(b) In-Plane View Figure 3: Visualization of the trajectories in the generated databases. Thrust arcs are indicated in pink. The Earth andVenus orbits are visualized to provide some sense of scale.was too small relative to the other components (which were of the order − ). For this reason we selected avalue of s = 10 .These four networks were trained using the same optimiser with similar hyper-parameters. Specifically, we used theAmsgrad [24] optimiser with β = 0 . , β = 0 . , (cid:15) = 10 − and weight decay = 0 . . The networks N , N and N were trained with lr = 10 − whereas N was trained with lr = 10 − .Based on the improvements in the validation loss, we fixed the number of training epochs for each network type forall databases. N was trained for epochs, N and N were trained for epochs, and N was trained for epochs. The minibatch size was also fixed at examples per GPU and multi-GPU training with up to 4 GPUs wasutilised where possible.We split the databases into training, validation and test sets following an 80 - 10 - 10 split. The nominal trajectory wasexcluded from the training as we wanted to measure whether the networks generalised to the underlying, unseen ref-erence trajectory. Furthermore, using the validation split, we incorporated the reduction of the learning rate wheneverthe validation loss plateaued. There are numerous ways to evaluate the performance of the trained G&CNETs, each with their pros and cons.Here we look at two main elements: A) the G&CNET performance when controlling the spacecraft starting fromthe nominal trajectory initial conditions, B) the G&CNET performance when controlling the spacecraft starting from“any” initial conditions. 15 a) Nominal trajectory to Venus orbit (b) Policy Imitation (c) Value Function without Gradients(d) Value Function with Gradients (e) Value Function with Hamiltonian andControls
Figure 4: Nominal trajectories generated by G&CNETs trained on G . Thrust arcs are indicated in a lighter colour. The first performance criteria is to look at how closely and optimally the spacecraft reaches the orbit of Venus whenstarting from the nominal conditions used to generated the different databases via our“backward generation of optimalexamples” technique. In Figure 4 the interplanetary transfer resulting from using the networks is shown in comparisonto the nominal transfer (i.e. the optimal transfer) revealing different levels of performances. In order to quantifythe comparison, we first look at the final Euclidean distance between the first 5 equinoctial elements ( p, f, g, h, k ) tothe target equinoctial elements defining Venus orbit. We refer to this measure as the “reduced Euclidean distance”(rEd) and denote it with (cid:107) ∆( x ) (cid:107) . The rEd is measured between Venus orbit and the orbit achieved after numericallyintegrating for the optimal time t ∗ f Eq.(4) where f r , f t , f n are computed from the G&CNET. Table 3 shows the rEdmeasure associated to each network controller trained on each of the databases. For scale, the rEd measure betweenthe Earth and Venus orbit is . .Note that the rEd distance is not, alone, returning the full picture on the network performances since it does notinclude any information on the propellant used. In order to evaluate the mass optimality resulting from the variousnetworks, we solve the optimal control problem described in Section 2.3.1 from the spacecraft state at t ∗ f to Venus’16raining Database A B C D E F G
NNA r c h . policy network N . . . . .
12 0 . . value function network N .
23 0 .
23 0 .
23 0 .
11 0 .
079 0 .
22 0 . value function network N . .
011 0 . . . . . value function network N . .
017 0 .
017 0 .
062 0 .
099 0 .
047 0 . Table 3: Reduced Euclidean distance (rEd) for the trained networks across databases.Training Database
A B C D E F G
NNA r c h . policy network N .
44 1 .
52 0 .
80 20 .
73 6 .
55 1 .
40 1 . value function network N .
53 279 .
04 268 .
96 2 .
56 24 .
32 216 .
99 227 . value function network N .
98 8 .
26 6 .
24 9 .
98 14 .
74 7 .
17 1 . value function network N .
39 11 .
58 13 .
14 4 .
23 6 .
92 8 .
54 6 . Table 4: Propellant discrepancy for various networks and across databases. For scale the nominal optimal propellantis 210.47 kg.orbit at t ∗ f + ∆ t , and compare it to the fixed time optimal transfer from Earth to Venus, the fixed time being t ∗ f + ∆ t .The resulting difference in propellant used is called propellant discrepancy and is the indicator we use to quantify themass optimality of a network controller. Table 4 shows the propellant discrepancy of each controller trained on eachdatabase. A second performance criteria is to look at the behaviour of the network controllers when initial conditions are con-sidered that are far away from the nominal trajectory used to generated the databases via our“backward generation ofoptimal examples” technique. For this criteria we consider two indicators we think cover the most important charac-teristics of a controller: first we look at mean errors of the optimal policy predictions as computed from a G&CNET,and then we look at how close to Venus’ orbit each spacecraft eventually gets when perturbing the initial nominalconditions by increasing factors. Additionally, for the value function networks , we evaluate the ability of the networksto predict the optimal value function (i.e. to compute the optimal propellant to reach Venus orbit from any spacecraftstate).Table 5 shows the mean errors of the controls computed from each neural network on their test sets. We chose to use themean absolute error for the throttle and the mean angular error for the thrust direction. The mean angular error is usefulas it foregoes issues of vector scaling and focuses on the important aspect of the thrust vector, namely, the direction,whereas the throttle error focuses on the throttle magnitude. The throttle error is thus defined as ∆ u = | u N − u ∗ | andthe angular error is defined as ψ i τ = arccos (cid:104) ˆi τ N · ˆi ∗ τ (cid:105) . We denote the mean using the notation (cid:104) · (cid:105) = N (cid:80) N ( · ) andthe standard deviation using the notation σ ( · ) = N − (cid:113)(cid:80) N ( · − (cid:104) · (cid:105) ) , e.g. (cid:104) ∆ u (cid:105) is the mean throttle error and σ (∆ u ) is the standard deviation of the throttle error. We will use this notation in the rest of this paper. In Table 5 we seethat the policy network N and value function network N have consistently low errors with some variability acrossdatabases and a few exceptions. We also note that the value function network N has a high error in general. Thisnetwork is trained using a loss function l N that does not penalise errors in the value function gradients, this results in a17raining Database A B C D E F G
NNA r c h it ec t u r e s policy network N (cid:104) ∆ u (cid:105) .
027 0 .
037 0 .
044 0 .
062 0 .
12 0 .
042 0 . (cid:104) ψ i τ (cid:105) . ° . ° . ° . ° ° . ° . ° value function network N (cid:104) ∆ u (cid:105) .
46 0 .
39 0 .
42 0 .
19 0 .
13 0 .
43 0 . (cid:104) ψ i τ (cid:105) ° . ° . ° ° ° . ° . ° value function network N (cid:104) ∆ u (cid:105) .
041 0 .
057 0 .
049 0 .
029 0 .
029 0 .
046 0 . (cid:104) ψ i τ (cid:105) . ° . ° . ° . ° . ° . ° . ° value function network N (cid:104) ∆ u (cid:105) .
068 0 .
11 0 .
096 0 .
078 0 .
21 0 .
12 0 . (cid:104) ψ i τ (cid:105) . ° . ° . ° ° ° . ° . °Table 5: Mean absolute error of the controls computed by the controllers on the test set of their respective trainingdatabases Training Database A B C D E F G
NNA r c h it ec t u r e s value function network N (cid:104) ∆ J (cid:105) .
61 4 .
00 4 .
47 21 . .
31 1 . σ (∆ J ) 4 .
12 4 .
25 4 .
42 36 . .
49 2 . value function network N (cid:104) ∆ J (cid:105) .
77 8 .
19 7 .
46 42 . . . σ (∆ J ) 8 .
07 12 . . . . . value function network N (cid:104) ∆ J (cid:105) . . . . . σ (∆ J ) 59 . . . . . Table 6: Mean absolute error of the value function (final propellant mass in kilograms) computed by the value functionnetworks on the test set of their respective training datasets.network that is unusable for the purpose of reconstructing the optimal policy. By adding such a contribution to the lossfunction, we get l N which has, instead, a low prediction error for the optimal policy. To a lesser extent, we see observethe same improvement in N which uses a loss l N that while not enforcing the value function gradient directly, itdoes penalize violations to the Belmann equation Eq.(13) and the transversality condition on the Hamiltonian (freetime transfer). The advantage of such a loss is that it does not require the co-states and can in principle be applied tolearn from optimal examples generated by direct methods too.Table 6 shows the performance of the value function networks on predicting the optimal value function for the initialconditions in the test set of their training databases. We show the mean absolute error in terms of the propellant requiredto reach Venus’ orbit optimally from the given initial conditions, and we denote this by (cid:104) ∆ J (cid:105) = (cid:104)| J N − J ∗ |(cid:105) . In thiscase we see, unsurprisingly, that the value function network N , using a loss function that only cares about the valuefunction value, outperforms the others with an accuracy in predicting the propellant required with an accuracy on theorder of for the case of a training on the database G.Table 7 we show the mean rEd and the success rates of the G&CNETs when the spacecraft starts from initial conditionssampled at random in 4 different regions of increasing size centered around the nominal initial state ( A , A , A , A ).A transfer is considered as successful when the minimum rEd reached along a transfer is below a threshold of . .The regions are defined by perturbing the equinoctial elements of the initial nominal state (Earth’s orbit) by x % for A x , e.g. for the initial states are perturbed to be between and of its original value. The regions of theseperturbations can be seen in Figure 5 where we also visualize the corresponding final orbits for the case of the valuefunction network l N trained on database D. The mean rEd reported in Table 7 is computed considering the minimum18Ed achieved along N = 100 transfers starting from different initial conditions randomly sampled in a given region,in formal terms: (cid:104) rEd (cid:105) = N (cid:80) i min t (cid:107) ∆( x ( t )) (cid:107) .Training Database A D G
NNA r c h it ec t u r e s policy network l N A (cid:104) rEd (cid:105) A (cid:104) rEd (cid:105) A (cid:104) rEd (cid:105) A (cid:104) rEd (cid:105) value function network l N A (cid:104) rEd (cid:105) A (cid:104) rEd (cid:105) A (cid:104) rEd (cid:105) A (cid:104) rEd (cid:105) N and N showing their high success rate andprecision also far away from the nominal transfer The mean rEd values have a small standard deviation and are belowa value of . even in the worst case scenarios. We note that increasing the database size and density (i.e. changingdatabase from from A to D ) translates to an improved performance in all the perturbation regions at the cost of anincrease in the mean rEd for the policy networks for perturbations less than . In terms of the policy networks ,if your goal is to fly to Venus’ orbit successfully within a large region the best performing policy network is the onetrained on database D . Conversely, if your goal is to acquire the target Venus orbit with high precision starting frominitial conditions close to Earth, then the policy network trained on database A would be more suitable given the smallermean rEd. On the other hand, the value function network show a significant improvement in the mean rEd and successrate when moving from database A to D . In the case of region A , we see a deterioration of the performance of the value function network when comparing database D to database G . It is difficult to pick one value function network as being suitable for the case of a large perturbation given that the mean rEd of the network trained on database G is half of that of the network trained on database D , but fails to achieve Venus’ orbit (within an rEd of . ) ofthe time as opposed to for the training on database D . For a small perturbation region, it is much easier to selectthe appropriate network given that the value function network trained on database G has the smallest rEd and smallestspread in rEd while achieving a success rate.It is also curious to note that all the networks in Table 7 are successful in achieving Venus’ orbit within a rEdof . for the regions A and A . Furthermore, excluding the value function network trained on database A and the policy network trained on database G , the networks are able to achieve Venus’ orbit with a very high success rate alsoin the region A . This is very surprising considering that, as seen in Figure 5, region A is very large in terms ofcoverage. 19 Conclusion
In this work we have introduced a new methodology called “backward generation of optimal examples” to generatelarge databases of optimal trajectories bypassing entirely all the difficulties associated to optimal control solvers. Inthe case of a mass optimal interplanetary transfer between the Earth and Venus orbit we demonstrate the methodbuilding several databases of mass optimal transfer containing 4,000,000 optimal trajectories, largely surpassing anyother previous related attempt.We find that deep artificial neural networks can learn the optimal policy (optimal thrust magnitude and direction) fromthese large databases both when predicting directly the optimal policy (policy network) and when predicting the valuefunction (value function networks). In this last case we find that it is necessary to include some additional componentto the loss function to inform its gradients in order for the approximated model to be used to recover the optimal policy.The best performing networks (G&CNETs) trained in this work are able to predict the optimal thrust and thrustdirection to within an error of and °. The value function networks are also able to predict the optimal propellantmass to within 1% of the true optimal mass. When starting from nominal initial condition the developed G&CNETis able to complete the interplanetary transfer using only 2 (cid:104) more propellant than in the corresponding mathematicaloptimal solution. Furthermore, we find that the trained G&CNETs are able to steer optimally the spacecraft to Venus’orbit also when deviating consistently from the planned nominal conditions. Our results constitute a step forwardtowards the realization of a purely onboard system able to perform the guidance and control functions of a low-thrustspacecraft simultaneously and in real time. 20 nitial OrbitsFinal OrbitsEarth's OrbitVenus' OrbitInitial Positions (a) A : perturbation of 2% Initial OrbitsFinal OrbitsEarth's OrbitVenus' OrbitInitial Positions (b) A : perturbation of 8% Initial OrbitsFinal OrbitsEarth's OrbitVenus' OrbitInitial Positions (c) A : perturbation of 16% Figure 5: Initial and final orbits of successful optimal transfers driven by N trained on database D .21 ppendix This appendix contains the explicit forms of all the derivatives necessary to write Eq. (7) explicitly.
The ˙ λ p equation We get: ˙ λ p = − c um λ T ∂ B ( x ) ∂p i τ − w λ L ∂∂p (cid:114) µp (19) = − c um λ T ∂ B ( x ) ∂p i τ + 32 w λ L (cid:114) µp √ µp ∂ B ( x ) ∂p = pw L [(1 + w ) cos L + f ] w − gw ( h sin L − k cos L ) − cos L [(1 + w ) sin L + g ] w fw ( h sin L − k cos L )0 0 w s cos L w s sin L w ( h sin L − k cos L ) (20) The ˙ λ f equation We get: ˙ λ f = − c um λ T ∂ B ( x ) ∂f i τ − λ L w (cid:114) µp ∂w∂f (21) = − c um λ T ∂ B ( x ) ∂f i τ − λ L w (cid:114) µp cos Lw (cid:114) µp ∂ B ( x ) ∂f = − p cos L w − (cos L + f ) cos L g cos L ( h sin L − k cos L )0 − (sin L + g ) cos L ( w − f cos L )( h sin L − k cos L )0 0 − s cos L − s sin L cos L − ( h sin L − k cos L ) cos L (22)22 he ˙ λ g equation We get: ˙ λ g = − c um λ T ∂ B ( x ) ∂g i τ − λ L w (cid:114) µp ∂w∂g (23) = − c um λ T ∂ B ( x ) ∂g i τ − λ L w (cid:114) µp sin Lw (cid:114) µp ∂ B ( x ) ∂g = − p sin L − (cos L + f ) sin L − ( w − g sin L )( h sin L − k cos L )0 w − (sin L + g ) sin L − f sin L ( h sin L − k cos L )0 0 − s cos L sin L − s sin L − ( h sin L − k cos L ) sin L (24) The ˙ λ h equation We get: ˙ λ h = − c um λ T ∂ B ( x ) ∂h i τ (25) ∂ B ( x ) ∂h = (cid:114) pµ − gw sin L fw sin L hw cos L hw sin L w sin L The ˙ λ k equation We get: ˙ λ k = − c um λ T ∂ B ( x ) ∂k i τ (26)23 B ( x ) ∂k = (cid:114) pµ gw cos L − fw cos L kw cos L kw sin L − w cos L (27) The ˙ λ L equation We get: ˙ λ L = − c um λ T ∂ B ( x ) ∂L i τ − w (cid:114) µp λ L w L (28)where, w (cid:114) µp ∂ B ( x ) ∂L = − p ( g cos L − f sin L ) 0 w cos L − (1 + w ) w sin L − w L (cos L + f ) (( wh + w L k ) cos L + ( wk − w L h ) sin L ) gw sin L (1 + w ) w cos L − w L (sin L + g ) (( wh + w L k ) cos L + ( wk − w L h ) sin L ) f − s ( w sin L + w L cos L )0 0 s ( w cos L − w L sin L )0 0 ( wh + w L k ) cos L + ( wk − w L h ) sin L (29)where, w L = ∂w∂L = g cos L − f sin L (30) The ˙ λ m equation We get: ˙ λ m = − c um | λ T B ( x ) | (31) References [1] Izzo, D., M¨artens, M., and Pan, B., “A survey on artificial intelligence trends in spacecraft guidance dynamicsand control,”
Astrodynamics ,doi:10.1007/s42064-018-0053-6. 242] Pontryagin, L., Boltyanskii, V., Gamkrelidze, R., and Mishchenko, E.,
The Mathematical theory of optimalprocesses , Wiley & Sons, 1962.[3] Betts, J. T.,
Practical methods for optimal control and estimation using nonlinear programming , Vol. 19, Siam,2010.[4] S´anchez-S´anchez, C., Izzo, D., and Hennes, D., “Learning the optimal state-feedback using deep networks,” in“2016 IEEE Symposium Series on Computational Intelligence (SSCI),” IEEE, 2016, pp. 1–8.[5] S´anchez-S´anchez, C. and Izzo, D., “Real-time optimal control via Deep Neural Networks: study on landingproblems,”
Journal of Guidance, Control, and Dynamics , Vol. 41, No. 5, 2018, pp. 1122–1135.[6] Cheng, L., Wang, Z., Song, Y., and Jiang, F., “Real-time optimal control for irregular asteroid landings usingdeep neural networks,”
Acta Astronautica .[7] Cheng, L., Wang, Z., Jiang, F., and Zhou, C., “Real-time optimal control for spacecraft orbit transfer via multi-scale deep neural networks,”
IEEE Transactions on Aerospace and Electronic Systems , Vol. 55, No. 5, 2018, pp.2436–2450.[8] Li, H., Baoyin, H., and Topputo, F., “Neural Networks in Time-Optimal Low-Thrust Interplanetary Transfers,”
IEEE Access , Vol. 7, 2019, pp. 156413–156419.[9] Li, H., Topputo, F., and Baoyin, H., “Autonomous Time-Optimal Many-Revolution Orbit Raising for ElectricPropulsion GEO Satellites via Neural Networks,” arXiv preprint arXiv:1909.08768 .[10] Izzo, D., Tailor, D., and Vasileiou, T., “On the stability analysis of optimal state feedbacks as represented by deepneural models,” arXiv preprint arXiv:1812.02532 .[11] Tailor, D. and Izzo, D., “Learning the optimal state-feedback via supervised imitation learning,”
Astrodynamics ,doi:10.1007/s42064-019-0054-0.[12] Izzo, D., ¨Ozt¨urk, E., and M¨artens, M., “Interplanetary transfers via deep representations of the optimal policyand/or of the value function,” in “Proceedings of the Genetic and Evolutionary Computation Conference Com-panion on - GECCO '19,” ACM Press, 2019,doi:10.1145/3319619.3326834.[13] ¨Ozt¨urk, E. and Izzo, D., “Earth - Venus Low-Thrust Optimal Transfers / Database A,” , 2020,doi:10.5281/zenodo.3613772.[14] Walker, M. J. H., Ireland, B., and Owens, J., “A set of modified equinoctial orbit elements,”
Celestial Mechanics ,Vol. 36, 1985, pp. 409–419,doi:10.1007/BF01227493.[15] Bertrand, R. and Epenoy, R., “New smoothing techniques for solving bang–bang optimal control problems–numerical results and statistical interpretation,”
Optimal Control Applications and Methods , Vol. 23, No. 4, 2002,pp. 171–197. 2516] Bellman, R., “Dynamic programming,”
Science , Vol. 153, No. 3731, 1966, pp. 34–37.[17] Sundman, K. F., “M´emoire sur le probl`eme des trois corps,”
Acta Mathematica , Vol. 36, No. 0, 1913, pp. 105–179,doi:10.1007/bf02422379.[18] Levi-Civita, T., “Sur la r´esolution qualitative du probl`eme restreint des trois corps,”
Acta Mathematica , Vol. 30,No. 0, 1906, pp. 305–327,doi:10.1007/bf02418577.[19] He, K., Zhang, X., Ren, S., and Sun, J., “Delving Deep into Rectifiers: Surpassing Human-Level Performance onImageNet Classification,” in “2015 IEEE International Conference on Computer Vision (ICCV),” IEEE, 2015,doi:10.1109/iccv.2015.123.[20] Standish, E. M., “Keplerian elements for approximate positions of the major planets,” available from the JPLSolar System Dynamics web site (http://ssd. jpl. nasa. gov/) .[21] Gill, P. E., Murray, W., and Saunders, M. A., “SNOPT: An SQP algorithm for large-scale constrained optimiza-tion,”
SIAM review , Vol. 47, No. 1, 2005, pp. 99–131.[22] W¨achter, A. and Biegler, L. T., “On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming,”
Mathematical programming , Vol. 106, No. 1, 2006, pp. 25–57.[23] Haberkorn, T., Martinon, P., and Gergaud, J., “Low thrust minimum-fuel orbital transfer: a homotopic approach,”