Scheduling the NASA Deep Space Network with Deep Reinforcement Learning
Edwin Goh, Hamsa Shwetha Venkataram, Mark Hoffmann, Mark Johnston, Brian Wilson
SScheduling the NASA Deep Space Network withDeep Reinforcement Learning
Edwin Goh, Hamsa Shwetha Venkataram, Mark Hoffmann, Mark Johnston, Brian WilsonJet Propulsion Laboratory, California Institute of Technology4800 Oak Grove Dr., Pasadena, CA [email protected]
Abstract —With three complexes spread evenly across the Earth,NASA’s Deep Space Network (DSN) is the primary means ofcommunications as well as a significant scientific instrument fordozens of active missions around the world. A rapidly rising num-ber of spacecraft and increasingly complex scientific instrumentswith higher bandwidth requirements have resulted in demandthat exceeds the network’s capacity across its 12 antennae. Theexisting DSN scheduling process operates on a rolling weeklybasis and is time-consuming; for a given week, generation of thefinal baseline schedule of spacecraft tracking passes takes roughly5 months from the initial requirements submission deadline,with several weeks of peer-to-peer negotiations in between. Thispaper proposes a deep reinforcement learning (RL) approachto generate candidate DSN schedules from mission requestsand spacecraft ephemeris data with demonstrated capability toaddress real-world operational constraints. A deep RL agentis developed that takes mission requests for a given week asinput, and interacts with a DSN scheduling environment toallocate tracks such that its reward signal is maximized. Acomparison is made between an agent trained using ProximalPolicy Optimization and its random, untrained counterpart. Theresults represent a proof-of-concept that, given a well-shapedreward signal, a deep RL agent can learn the complex heuristicsused by experts to schedule the DSN. A trained agent canpotentially be used to generate candidate schedules to bootstrapthe scheduling process and thus reduce the turnaround cycle forDSN scheduling. T ABLE OF C ONTENTS
NTRODUCTION . . . . . . . . . . . . . . . . . . ELATED W ORK . . . . . . . . . . . . . . . . . ROBLEM F ORMULATION AND D ESIGN . . . . . ESULTS AND D ISCUSSION . . . . . . . . . . . . ONCLUSIONS AND F UTURE W ORK . . . . . . . R EFERENCES . . . . . . . . . . . . . . . . . . . . .
1. I
NTRODUCTION
As humankind progresses towards groundbreaking space ex-plorations ranging from searching for signs of extraterrestriallife by roving on the red planet [1] to understanding thecomposition interstellar space [2], communicating with thespacecraft to exchange engineering and scientific data becomesincreasingly critical. The resurgence of manned explorationefforts to the Moon and Mars further elevates the importanceof communications to one of guaranteeing the safety of hu- ©2021 IEEE. Personal use of this material is permitted. Permission fromIEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other works. manity’s explorers. The NASA Deep Space Network, managedand operated by the Jet Propulsion Laboratory (JPL), is aninternational network of three facilities strategically locatedaround the world to support constant observation of space-craft launched as part of various interplanetary (and indeed,interstellar) missions. As one of the largest and the mostsensitive telecommunications systems in the world, DSN alsosupports Earth-orbiting missions along with radio astronomy,radar astronomy, and related solar system observations.With 12 operational antennas (as of 2019) spread across threelocations — Goldstone, USA, Madrid, Spain and Canberra,Australia — DSN has served roughly 150 missions for space-craft communications, and at the time of writing is very nearits full capacity. For some weeks, the system is already over-subscribed by the various missions, especially those that clus-ter in the same portion of the sky [3]. In addition, the recently-launched Mars 2020 mission adds additional requirementswith eight cameras and sophisticated instruments to searchfor biological evidence of life on the surface. The combinedfactors of more frequent missions and increased demand forhigher-fidelity data operations are expected to significantlyincrease the load on the DSN. To address the issues of over-subscription, budget constraints and system downtimes, thereis urgent need (and thus much ongoing research) to improvethe DSN scheduling process such that “better” candidateschedules (e.g., with more tracks placed, less conflicts, fairerdistribution of tracking time across missions, etc.) can begenerated in a much shorter turnaround time. This implies aneed to search the solution space for good candidate scheduleswith such expediency and completeness that exceeds humancapabilities, as well as a need to alleviate the bottleneckimposed by the peer-to-peer negotiations process.Real-world optimization tasks typically have a large numberof operational or physical constraints. When solving suchproblems with conventional operations research techniques,great care is taken to formulate the problem so as to avoidthe “curse of dimensionality” in which the problem becomesexponentially complex and computationally intractable. TheDSN scheduling problem indeed imposes numerous resourceallocation constraints due to the wide range of spacecraftorbits, mission requirements, and operational considerations(e.g., hand-off between DSN complexes as the Earth rotatesrelative to the spacecraft). Deep reinforcement learning (deepRL) is a recent alternative to these conventional approachesthat has shown promise in solving complex tasks that aretypically considered to rely heavily upon intuition or creativity1 a r X i v : . [ c s . L G ] F e b Figure 1. Deep RL Canonical Diagram
Recent work on the application of deep RL to scheduling prob-lems in cloud computing resources [7] and wireless networks[8] has demonstrated the capability of these algorithms to learncomplex rules and strategies required to accomplish such tasks.It has been shown to perform comparably if not better thanconventional metaheuristic optimization and search methodson classical operations research problems [9]. There havealso been multiple instances where deep RL was successfullyapplied to NASA use cases [10–12]. With regards to JPL, theappealing aspects of this approach are as follows: • Upfront investment of training an agent, however high interms of initial resource requirements, is amortized overfuture problem sets (weeks) with near real-time inference,which can be performed using consumer hardware. Thisprecludes running classical optimization solvers or train-ing agents from scratch for every scheduling cycle. • Potential infusion into various other ongoing areas ofresearch such as job shop scheduling [10] and similaruse cases at JPL, as well as applications in the publicdomain.
Contributions
In this paper, we propose a policy optimization based schedul-ing approach to effectively generate de-conflicted candidateschedules for a given week using mission requests, antennaavailability and other constraints as inputs. The purpose ofthis solution is threefold: • Reduce scheduling turnaround time from a few monthsto few days. • Increase antenna utilization and thus accommodate moremissions. • Minimize the unsatisfied time fraction experienced byeach user, i.e., improve “fairness” of track allocationsacross missions.
2. R
ELATED W ORK
Deep Space Network schedules are typically generated ayear into the future with allocations to the minute, and areperformed manually, one week at a time [13]. Requested tracksare 1 to 8 hours long and are to be allocated in a view period(VP), defined as the period of time in which the spacecraft isvisible to one or more antennas. In addition to the set of legalview periods for a given mission, some of the major constraintsin DSN scheduling include quantization (whether scheduledactivities are to occur on 1-minute or 5-minute constraints),sufficient separation of contacts (so that onboard data capacityis not exceeded), duration flexibility (reduction or extension oftracking time) and splitting of requests into multiple tracks[14]. Ongoing work seeks to further incorporate the notionof user preferences and mission priorities into the schedulingalgorithm such that lower preference requests can be omittedunder oversubscription, thereby reducing the amount of peer-to-peer negotiation when potentially high-prority tracks areomitted instead [15].The complexity of the DSN scheduling problem is well-known to the DSN user community, and a large body ofliterature exists around its solution. Guillaume et al. [16]explored a formulation of the problem in terms of evolution-ary techniques, and leveraged that formulation to generate apopulation of Pareto-optimal schedules under varying conflictconditions. More recently, Oller [17] and Alimo et al. [18]formulated the task as Mixed Integer Linear Programming(MILP) problems to develop scheduling systems (for the long-range and mid-range scheduling problems, respectively) thatincorporate many of the DSN’s operational and physical con-straints. Hackett et al. [19] investigated a beacon-tone demandaccess scheduling approach, whereby spacecraft, rovers andlanders themselves submit ad-hoc requests for tracking time,which are then scheduled in real-time. The authors foundthat the paradigm decreased the number of required trackscompared to the conventional “pre-allocated” approach. Onthe other hand, [20] propose multi-objective reinforcementlearning cognitive engine using deep neural networks to pro-vide orbit planning and optimization designers the capabilityto leverage this framework and request resources on-demand.The authors in their other work, talk about “demand access”wherein spacecraft, or rovers request track time on the networkthemselves using a beacon-tone system and obtain “on-the-fly”track time on shared-user block tracks.
3. P
ROBLEM F ORMULATION AND D ESIGN
Input Datasets
The main dataset used in this work is a set of User LoadingProfiles (ULPs) for Week 44 of 2016 (an oversubscribedweek), which provides the following information for a givenmission:1) The number of tracks requested for that week2) The set of requested antenna combinations for thesetracks3) The requested duration for these tracks2) The minimum valid duration for each track (used forsplitting tracks into multiple periods)In order to assign requested tracks to a particular antennacombination during a given week, one needs a set of viewperiods during which the spacecraft is visible by the requestedantenna(s). We use ephemeris data downloaded from JPL’sService Preparation Subsystem (SPS) to assemble, for a givenspacecraft and the requested antennas, this set of view periods.This task is a challenge in and of itself because of the potentialfor multiple-antenna requests that require tracks to be placedon antenna arrays. Such requests necessitate, in addition to theneed to identify view periods that overlap across all requestedantennas, the need for practical constraints to be taken intoaccount, e.g., minimum duration for the requests, additionalsetup and teardown times, etc.Finally, scheduled maintenance is also taken into accountto further constrain the problem. Maintenance data for eachantenna is downloaded from SPS and used in the view periodidentification step to filter out view periods that overlap withmaintenance periods for a given antenna.The aforementioned input datasets and the overall steps takento obtain a final problem set to be used in the formulation isshown in Fig. 2 below.
Figure 2. Flow chart illustrating main steps used togenerate the problem set, Week 44, 2016 used in this paper.
Model/Environment
This section provides details about the environment used tosimulate/represent the DSN Scheduling problem. The envi-ronment is implemented according to the
OpenAI Gym [21]API in order to maintain compatibility with widely usedreinforcement learning libraries such as
RLlib and stable-baselines .The simulation is instantiated with the problem set generatedusing the pipeline shown in Fig. 2, as well as a dictionaryof DSN antennas. Therefore, episodes in this simulation arecentered around week problems. Such a formulation is well-aligned with the DSN scheduling process described in Sec. 2,which generates schedules on a per-week basis.Each
Antenna object, initialized with start and end boundsfor a given week, maintains a list of tracks placed as wellas a list of time periods (represented as tuples) that are still available. Algorithm 1 details the general algorithm used inthis environment to satisfy requests in the problem set.
Algorithm 1:
DSN Scheduling Simulation
Data: week problem set (see Fig. 2) while n rem > or n steps < n requests do choose a request to allocate; for antenna in requested antenna combinations do find and keep only valid VPs; end allocate track on antenna with longest valid VP; if duration of VP > requested duration then randomly shorten VP to match requestedduration; end calculate seconds allocated;return reward and observation; end As seen in the simulation steps detailed in Algorithm 1,
Antenna objects provide the capability to process the setof valid view periods identified in Fig. 2 according to theantenna’s availability and output a set of view periods that donot overlap with existing tracks already placed on that antenna.For multi-antenna requests, these available view periods foreach antenna in the array are then passed through an overlapchecker to find the overlapping ranges.For the view periods that are available, the antenna providesutilities to check whether a view period is valid based on DSN-specific heuristics and rules. For the present work, a viewperiod ( t , t ) with an associated setup/calibration duration d s and teardown duration d t is considered valid if all thefollowing conditions return true:1) ( t − d s , t ) is available , or if ( t − t ) ≥ d min + d s + d t ( t , t + d t ) is available, or if ( t − t ) ≥ d min + d s + d t ( t − t ) ≥ d min , where d min is the minimum requestedduration for this trackAs we will discuss in the following sections, the presentenvironment handles most of the “heavy-lifting” involved inactually placing tracks on a valid antenna, leaving the agentwith only one responsibility — to choose the “best” requestat any given time step. The simulation described thus far is apreliminary implementation. Constraints such as the splittingof a single request into tracks on multiple days or MultipleSpacecraft Per Antenna (MSPA) are important aspects ofthe DSN scheduling problem that require experience-guidedhuman intuition and insight to fulfill. Being cognizant ofthis limitation, we intentionally implement this environmentin a modular fashion such that subclasses with additionalconstraints can be easily defined in the future. State Space/Observation
At any given point in the simulation, the environment keepstrack of: i.e., does not overlap any of the tracks already placed on this antenna the distribution of remaining requested durations, • the total outstanding requested hours for that week, • the number of unique missions with outstanding requests, • the remaining number of requested tracks, and • the number of remaining free hours on each antenna.In order to use the same observation space over multipleweeks, we specify a bound on the maximum number ofrequests (i.e., requested tracks) that are valid in any givenweek. For requests in the year 2016, a bound of 500 providedsufficient margin across all weeks. Thus 500 entries are definedfor the distribution of remaining requested durations.This state space of the environment is represented as a 1-Darray that indicates the number of remaining unique missions,the number of remaining requests, the total remaining durationrequested, as well as the remaining unallocated duration ineach request. Action Space
There are multiple ways to enumerate the actions a reinforce-ment learning agent can take at each time step. An initialattempt specified the action space as a 2D binary grid whoserows represented the individual DSN antennas and the columnsrepresented discretized time periods. When flattened/reshapedinto a 1-D array, this resulted in a formidable action space ofsize M × K where M is the number of DSN antennas and K is the number of time steps resulting from the discretization ofthe entire week by a given time step. Since such a large actionspace precludes efficient learning and makes the additionof DSN-defined constraints difficult, the current iteration ofthe action space for the DSN scheduling environment isintentionally simple — a single integer that defines whichitem in a request set the environment should allocate. Actionmasking is used in order to prevent the agent from choosingrequests that have already been satisfied.This implementation was developed with future enhancementsin mind, eventually adding more responsibility to the agentsuch as choosing the resource combination to use for aparticular request, and ultimately the specific time periodsin which to schedule a given request. These decisions arehierarchical in nature and resemble the possible actions foreach Dota agent in OpenAI Five [22], whereby an agent wouldfor instance decide to attack, select a target to attack, anddecide whether to offset the action in anticipation of the targetunit’s future position. Rewards
In the DSN scheduling environment, an agent is rewarded foran action if the chosen request index resulted in a track beingscheduled. Here, the reward is given by, r t ( s, a ) = T allocated T requested (1)where T allocated is the total time scheduled across all antennasfor this request and T requested is the requested time allocationfor the entire week. At each time step, the reward signal is a scalar ranging from0 (if the selected request index did not result in the allocationof any new tracking time) to 1 (if the environment was ableto allocate the entire requested duration). As one can surmise,the theoretical maximum reward that can be achieved in anepisode is the number of requests in that week . Training Algorithm
For this preliminary exploration, we use the Proximal Pol-icy Optimization (PPO) algorithm [23] implemented in theRLlib reinforcement learning library [24]. While Schulman et al. demonstrated state-of-the-art performance with PPO onrobotic locomotion/optimal control and Atari game playing,the algorithm has been shown to be feasible on stochasticoptimization problems in operations research [9]. Furthermore,an RL agent trained on REINFORCE — another policygradient algorithm similar to PPO — was shown to performsimilarly and sometimes better than existing heuristics-basedapproaches for scheduling multi-resource clusters [25].RLlib implements PPO in an actor-critic fashion. The actor is atypical policy network that maps states to actions, whereas thecritic is a value network that predicts the state’s value, i.e., theexpected return for following a given trajectory starting fromthat state. For a batch of observations from the environment,the actor network predicts a distribution over the set of avail-able actions. The training algorithm then samples a specificaction from this distribution based on a given explorationstrategy.After an action is selected, the critic estimates the advantage A t ( s, a ) as a function of the (temporal-difference) error δ t between the value function predicted by the network and theactual rewards returned by the environment. The error term isdefined as δ t = r t + γV ( s t +1 ) − V ( s t ) (2)where V is the critic’s current model of the value function,and r t is the ratio of action probabilities for the current state s t under the current policy to the action probabilities for s t under the old policy.Thus for a given policy defined by the parameters θ , theobjective used in PPO is as follows, L P P O ( θ ) = ˆ E (cid:104) min (cid:16) r t ( θ ) A t , clip ( r t ( θ ) , − (cid:15), (cid:15) ) A t (cid:17)(cid:105) (3)where (cid:15) is a hyperparameter proposed in [23] to clip r t andthus prevent large policy updates that result in irrecoverabledecreases in agent performance. Since the gradient of Eq. 3 isan estimator for the policy gradient, using this loss functionas the objective to a stochastic gradient ascent problem is asurrogate for updating the policy to encourage good actionsand weaken the tendency for actions that perform worse thanexpected.While the results in [23] are obtained using an actor and criticthat share the same layers, the neural architecture used in thiswork is one that has separate layers (and thus parameters)for both the policy and the value function. Throughout all4xperiments, we use a fully-connected neural network archi-tecture with 2 hidden layers of 256 neurons each. Based onthe observation/state space defined above, the input layer is ofsize 518; the first three entries are the remaining number ofhours, missions, and requests, the following set of 500 entriesare the remaining number of hours to be scheduled for eachrequest, and the final 15 entries are the remaining free hourson each antenna. We use a maximum number of requests of500 to ensure that the same observation space can be usedacross multiple weeks. Figure 3. Actor-critic network architecture used in thiswork. The left branch represents the actor network whichmaps observations to actions, whereas the right branchdepicts the actor which learns to estimate the value of agiven state.
4. R
ESULTS AND D ISCUSSION
In this section, we first present details of the training process aswell as the hyperparameters used. We then present preliminarysolutions obtained using the formulation described above andcompare those solutions with that of an agent taking randomsolutions. Solutions are presented for Week 44 of 2016.Excluding maintenance requests on the individual antennas,the DSN received a total of 286 requests for that week,which amounted to 1,770 hours to be allocated across DSN’santennas.
Experimental Setup
Training was performed on a single Amazon EC2 instancewith 4 GPUs and 32 CPUs, and the agent was trained forroughly ∼
10M time steps using the
RLlib framework.
RLlib provides trainer and worker processes — the trainer is respon-sible for policy optimization by performing gradient ascentwhile workers run simulations on copies of the environmentto collect experiences that are then returned to the trainer.
RLlib is built on the
Ray backend, which handles scaling andallocation of available resources to each worker.PPO uses Stochastic Gradient Descent (SGD) algorithm, andin this experiment we set minibatch size to 128 and thenumber of epochs to 30 for optimizing the surrogate objectivegiven in Eq. 3. While learning rate schedules can be definedin
RLlib , the results presented here were trained using a constant learning rate of 5e-5. The target Kullback–Leibler(KL) divergence [26] is set to 0.01 and the GeneralizedAdvance Estimator (GAE) parameter, λ , is set to 1.0. λ isa bias-variance tradeoff parameter; higher values imply highervariance [27]. The discount factor or gamma parameter is setto 0.99, which gives more weighting on long-term rewardsrather than immediate rewards. The clipping parameters forPPO policy and value function loss are set to RLLib defaults,and critic baseline is set to true for making use of GAE.Fig. 4 shows the evolution of several key metrics from thetraining process. In Fig. 4a, mean and maximum rewardsachieved by the policy across several 20 evaluation episodesare shown to increase in a stepwise fashion as the number oftraining episodes increases. One would expect the distributionof rewards to shift rightwards as the policy is progressivelyupdated. Decreases in reward indicate periods where the agentdoesn’t exploit the best-available policy at the time, but insteadexplores other policies to prevent itself from being trappedin local extrema. Furthermore, the average number of stepstaken in each episode (Fig. 4b is shown to decrease withtraining, indicating that the agent is capable of achievingbetter-performing schedules without spending additional stepsto select requests that cannot be allocated. In other words,this may be an indication that the agent is learning to prioritizerequests that can be allocated by the environment based on theavailability of the antennas. Finally, Fig. 4c shows the evolu-tion of entropy as training progresses. Entropy is an importantindicator of whether there is variance in the actions taken bythe policies being trained. The gradually decreasing entropyin Fig. 4c indicates that the PPO algorithm is converging onan optimal policy while maintaining its exploration policy. Random Agent Baseline
Due to complexities in the DSN scheduling process describedin Section 1, the current iteration of the environment has yetto incorporate all necessary constraints and actions to allowfor an “apples-to-apples” comparison between the presentresults and the actual schedule for week 44 of 2016. Forexample, the splitting of a single request into multiple tracksis a common outcome of the discussions that occur betweenmission planners and DSN schedulers. This allows for tracksto be fit into gaps that full requests otherwise would not, at thecost of increased overhead time due to setup and teardown.Instead of comparing to historical data, we define the perfor-mance of a random agent to be the baseline result. Recall thatactions in this case are integers that represent the index of therequest to schedule) and passing them into the environment.As seen in Fig. 5, a random agent without action maskingchooses uniformly across the entire range of possible requestindices (0-500). Recall that policies in this case are deep neural networks parameterizedby θ . A random agent is one that uniformly samples the action space at everytime step of the environment. a) Mean and maximum rewards (b) Average number of steps taken (c) Entropy Figure 4. Evolution of key metrics during PPO training of the DSN scheduling agent. Rewards and episode lengthstatistics were calculated across 20 evaluation episodes.Figure 5. Kernel density estimate of actions taken over100 episodes for the random agent (green) and best agent(blue). Note that episodes consist of multiple steps, andresults here are shown for actions selected by the agent ateach step.
Comparison with Trained Agent
The agent with the best performance (mean rewards in Fig.4a) was chosen as our preliminary benchmark against therandom baseline. This was the agent with the policy thathad undergone roughly 700 SGD updates, or roughly 10,000episodes. We perform 100-episode rollouts/evaluations usingthe best-performing agent and the random agent to sample thestochastic policies. The action distributions across all episodesshown in Fig. 5 illustrate that action masking indeed keepsagent actions to within the 286 requests for week 44 of 2016.Furthermore, Fig. 5 shows a distinct distribution of actions,indicating that, there are requests that the agent “prefers” toallocate as opposed to a uniform sampling of the action space.From the 100 episodes, we extract schedules from the episodeswith total rewards closest to the mean reward ( ∼
161 for
Figure 6. Distribution of total rewards obtained over 100episodes for the random agent (green) and best agent(blue). The reward distribution achieved by the trainedagent exhibits an obvious shift to the right, indicatinglearning by the agent. the random agent and ∼
184 for the trained agent). Keyperformance metrics for DSN schedules include the RMSof the unsatisfied time fraction across all missions, U RMS ,maximum unsatisfied time fraction among all missions U max and antenna utilization , A . These are defined in Eqs. 4–7. U i = T R i − T S i T R i (4) U RMS = (cid:118)(cid:117)(cid:117)(cid:116) N N (cid:88) i U i (5) U max = max i ( U i ) (6)where T R i represents the total tracking time requested by the i -th mission, and T S i represents the total duration scheduledacross all antennas for that mission.6 max is an indication of which mission has the most requestsunsatisfied, while U RMS provides a measure of uniformity inallocations over all missions. A = total time antennas not idletotal available antenna time for time period (7)As seen in Table 1, the trained agent manages to satisfy 1,007hours out of the requested 1,770 hours whereas the randomagent satisfies 944 hours. Likewise, the trained agent allocatesslightly more requests than the random case. The differencein U RMS between the two cases is negligible. Figs. 7 and8 show a comparison across 30 missions for the number ofhours and number of tracks requested/allocated, respectively.The mission names have been omitted from these figures.
TABLE 1. Comparison of scheduled results using themean performance of the random agent and the meanperformance of the trained agent for Week 44, 2016.
Agent (Mean performance from Fig. 6) Random Trained
Hours satisfied 944 1007Mean satisfied time fraction (%) 60.5 59.4Number of satisfied requests 180 188Mean satisfied request fraction (%) 62.9 65.7RMS of unsatisfied time fraction, U RMS (%) 4.3 3.9
The results presented above indicate that, while the agentis definitely learning to choose specific requests to have theenvironment allocate, the final output schedules exhibit onlya modest improvement from randomly chosen actions. Thisis not surprising considering the simplicity of the agent’saction space and the greedy fashion in which the environmentallocates requests after receiving an index from the agent.In addition to demonstrating the feasibility of deep RL forscheduling spacecraft communications, the main accomplish-ment in this work is the implementation of a simple yet mod-ular representation of the DSN scheduling problem within thedeep RL framework that can be augmented with increasinglymore realistic constraints and more complex RL agents. Wediscuss promising avenues of research in the next section.
5. C
ONCLUSIONS AND F UTURE W ORK
In this paper, we presented a formulation of the DSN schedul-ing process as a reinforcement learning problem. An environ-ment that encapsulates the dynamics of the scheduling problemwas implemented, with the observation space being a series ofquantities that represent the state of the remaining problemsand the DSN antennas’ availability. The agent’s action spacewas simplified for this preliminary task — a single integerthat represents the index to a list of requests for the week.Given this index, the environment then attempts to allocatethe request in a greedy fashion, i.e., on the requested antennacombination with the most available time remaining.Using the aforementioned deep RL formulation with theproximal policy optimization algorithm, an agent was trainedon user loading profiles from 2016 for roughly 10M steps. Preliminary results demonstrate observable improvement inagent performance as the underlying policy converges onan optimal policy. Due to the preliminary nature of thisimplementation and the complex human-in-the-loop nature ofthe scheduling process, comparisons could only be performedagainst a random agent baseline rather than the actual schedul-ing outcomes. These comparisons indicate that the trainedagent exhibits demonstrably more reliable performance thana random agent due to the improved policy, although theabsolute gains in schedule-related metrics such as unsatisfiedtime fraction are small.The low performance observed in the trained agent is, perhapsunsurprisingly, due to the simplicity with which the environ-ment and agent were designed. Indeed, it is this intentionalsimplicity that allows us to leverage the explainability of theagent’s progress and learnings rather than performance at thisjuncture. This cognizance led to very careful planning of thesystem’s implementation such that additional improvementscan be made with minimal effort. Thus in ongoing research, weplan to incorporate realistic constraints elicited from require-ments discussions while also scaling the datasets to representcomplexity of real-world requests. We intend to improve theformulation of action spaces such that the agent also learns tosplit, shorten and drop tracks wherever necessary, and learnaction space representations using action embeddings [28].Currently, complexity of input datasets that the agent is beingtrained on has remained fairly high since we consider theoversubscribed weeks. Though the results demonstrate agent’slearning capabilities, neural networks, similar to humans,benefit from gradual increase in the difficulty of the conceptsit can learn [29]. To that end, we plan to integrate curriculumlearning [30] and scale the training examples gradually usingcurriculum-based training strategies. A CKNOWLEDGMENTS
This effort was supported by JPL, managed by the CaliforniaInstitute of Technology on behalf of NASA. The authors wouldlike to thank JPL Interplanetary Network Directorate and DeepSpace Network team, and internal DSN Scheduling StrategicInitiative team members Alex Guillaume, Shahrouz Alimo,Alex Sabol and Sami Sahnoune. U.S. Government sponsorshipacknowledged. R EFERENCES [1] NASA, “Mars 2020 perseverance rover.” [Online]. Available:mars.nasa.gov/mars2020/[2] “Voyager.” [Online]. Available: https://voyager.jpl.nasa.gov/[3] M. D. Johnston, “Deep space network scheduling using multi-objectiveoptimization with uncertainty,” in
SpaceOps 2008 Conference , 2008.[Online]. Available: http://arc.aiaa.org[4] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. A. Riedmiller, “Playing atari with deep reinforcementlearning,”
CoRR , vol. abs/1312.5602, 2013.[5] D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. Driessche,J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Diele-man, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, a) Random Agent (b) Trained Agent Figure 7. Comparison of number of hours allocated across all missions using the random and trained agents. (a) Random Agent (b) Trained Agent
Figure 8. Comparison of number of requests scheduled across all missions using the random and trained agents.
M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Masteringthe game of go with deep neural networks and tree search,”
Nature , vol.529, pp. 484–489, 01 2016.[6] R. S. Sutton and A. G. Barto,
Reinforcement learning: An introduction .MIT press, 2018.[7] Y. Wang, H. Liu, W. Zheng, Y. Xia, Y. Li, P. Chen, K. Guo, andH. Xie, “Multi-objective workflow scheduling with deep-Q-network-based multi-agent reinforcement learning,”
IEEE Access , vol. 7, pp.39 974–39 982, 2019.[8] J. Wang, C. Xu, Y. Huangfu, R. Li, Y. Ge, and J. Wang, “DeepReinforcement Learning for Scheduling in Cellular Networks,” may2019. [Online]. Available: https://arxiv.org/abs/1905.05914[9] B. Balaji, J. Bell-Masterson, E. Bilgin, A. Damianou, P. M. Garcia,A. Jain, R. Luo, A. Maggiar, B. Narayanaswamy, and C. Ye, “ORL: Reinforcement Learning Benchmarks for Online StochasticOptimization Problems,” arXiv
Proceedings of the 14th International JointConference on Artificial Intelligence - Volume 2 , ser. IJCAI’95. SanFrancisco, CA, USA: Morgan Kaufmann Publishers Inc., 1995, pp.1114–1120.[11] A. Rubinsztejn, R. Sood, and F. E. Laipert, “Neural network optimalcontrol in astrodynamics: Application to the missed thrust problem,”
Acta Astronautica , vol. 176, pp. 192–203, nov 2020.[12] P. V. R. Ferreira, R. Paffenroth, A. M. Wyglinski, T. M. Hackett,S. G. Bil´en, R. C. Reinhart, and D. J. Mortensen, “Multi-objectivereinforcement learning-based deep neural networks for cognitive spacecommunications,” in pplications Workshop, CCAA 2017 . Institute of Electrical andElectronics Engineers Inc., aug 2017.[13] B. J. Clement and M. D. Johnston, “The deep space network schedulingproblem,” in Proceedings of the National Conference on ArtificialIntelligence , vol. 3. Pasadena, CA: Jet Propulsion Laboratory, NationalAeronautics and Space . . . , 2005, pp. 1514–1520.[14] M. D. Johnston, D. Tran, B. Arroyo, S. Sorensen, P. Tay, B. Carruth,A. Coffman, and M. Wallace, “Automated scheduling for NASA’s DeepSpace Network,”
AI Magazine , vol. 35, no. 4, pp. 7–25, dec 2014.[15] M. D. Johnston, “User preference optimization for oversubscribedscheduling of nasa’s deep space network,” in , Berkeley, Cali-fornia, USA, July 2019, pp. 86–92.[16] A. Guillaume, S. Lee, Y. F. Wang, H. Zheng, R. Hovden, S. Chau,Y. W. Tung, and R. J. Terrile, “Deep space network scheduling usingevolutionary computational methods,” in
IEEE Aerospace ConferenceProceedings , 2007.[17] G. Rueda Oller, “Space Mission Scheduling Toolkit for Long-Term DeepSpace Network Loading Analyses and Strategic Planning,” 2019.[18] J. A. Sabol, R. Alimo, M. Hoffmann, E. Goh, B. Wilson, andM. Johnston,
Towards Automated Scheduling of NASA’s Deep SpaceNetwork: A Mixed Integer Linear Programming Approach . [Online].Available: https://arc.aiaa.org/doi/abs/10.2514/6.2021-0667[19] T. Hackett, S. Bilen, and M. D. Johnston, “Investigating a demand accessscheduling paradigm for NASA’s deep space network,” in
Proc. 11th Int.Workshop Plan. Scheduling , 2019, pp. 51–60.[20] T. M. Hackett, “Applying artificial intelligence to space communicationsnetworks: Cognitive real-time link layer adaptations through rapid orbitplanning,” Ph.D. dissertation, The Pennsylvania State University, 2019.[21] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman,J. Tang, and W. Zaremba, “Openai gym,” 2016.[22] OpenAI, “Openai five,” https://blog.openai.com/openai-five/, 2018.[23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal Policy Optimization Algorithms,” jul 2017. [Online].Available: http://arxiv.org/abs/1707.06347[24] E. Liang, R. Liaw, P. Moritz, R. Nishihara, R. Fox, K. Goldberg,J. E. Gonzalez, and M. I. Jordan, “RLlib: Abstractions for DistributedReinforcement Learning,” Tech. Rep., jul 2018. [Online]. Available:http://rllib.io[25] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resourcemanagement with deep reinforcement learning,” in
HotNets 2016- Proceedings of the 15th ACM Workshop on Hot Topicsin Networks . New York, New York, USA: Association forComputing Machinery, Inc, nov 2016, pp. 50–56. [Online]. Available:http://dl.acm.org/citation.cfm?doid=3005745.3005750[26] S. Kullback and R. A. Leibler, “On information and sufficiency,”
Ann.Math. Statist. , vol. 22, no. 1, pp. 79–86, 03 1951. [Online]. Available:https://doi.org/10.1214/aoms/1177729694[27] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,”2018.[28] H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, andM. Alizadeh, “Learning Scheduling Algorithms for Data ProcessingClusters,”
SIGCOMM 2019 - Proceedings of the 2019 Conference ofthe ACM Special Interest Group on Data Communication , pp. 270–288,oct 2018. [Online]. Available: http://arxiv.org/abs/1810.01963[29] J. Elman, “Learning and development in neural networks: the importanceof starting small,”
Cognition , vol. 48, pp. 71–99, 1993.[30] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculumlearning,” in
Proceedings of the 26th annual international conferenceon machine learning , 2009, pp. 41–48., 2009, pp. 41–48.