[PDF] Teaching UAVs to Race: End-to-End Regression of Agile Controls in Simulation

Abstract

Automating the navigation of unmanned aerial vehicles (UAVs) in diverse scenarios has gained much attention in recent years. However, teaching UAVs to fly in challenging environments remains an unsolved problem, mainly due to the lack of training data. In this paper, we train a deep neural network to predict UAV controls from raw image data for the task of autonomous UAV racing in a photo-realistic simulation. Training is done through imitation learning with data augmentation to allow for the correction of navigation mistakes. Extensive experiments demonstrate that our trained network (when sufficient data augmentation is used) outperforms state-of-the-art methods and flies more consistently than many human pilots. Additionally, we show that our optimized network architecture can run in real-time on embedded hardware, allowing for efficient on-board processing critical for real-world deployment. From a broader perspective, our results underline the importance of extensive data augmentation techniques to improve robustness in end-to-end learning setups.

Full PDF

TTeaching UAVs to Race:End-to-End Regression of Agile Controls in Simulation

Matthias M¨uller ∗ , Vincent Casser ∗ , Neil Smith, Dominik L. Michels, Bernard Ghanem { matthias.mueller.2, vincent.casser, neil.smith,dominik.michels, bernard.ghanem } @kaust.edu.sa Visual Computing Center at King Abdullah University of Science and Technology

Abstract.

Automating the navigation of unmanned aerial vehicles (UAVs) in di-verse scenarios has gained much attention in recent years. However, teachingUAVs to ﬂy in challenging environments remains an unsolved problem, mainlydue to the lack of training data. In this paper, we train a deep neural network topredict UAV controls from raw image data for the task of autonomous UAV racingin a photo-realistic simulation. Training is done through imitation learning withdata augmentation to allow for the correction of navigation mistakes. Extensiveexperiments demonstrate that our trained network (when sufﬁcient data augmen-tation is used) outperforms state-of-the-art methods and ﬂies more consistentlythan many human pilots. Additionally, we show that our optimized network ar-chitecture can run in real-time on embedded hardware, allowing for efﬁcient on-board processing critical for real-world deployment. From a broader perspective,our results underline the importance of extensive data augmentation techniquesto improve robustness in end-to-end learning setups.

Fig. 1: Illustration of the trained racingUAV in-ﬂight.Unmanned aerial vehicles (UAVs) likedrones and multicopters are attracting in-creased interest across various communi-ties such as robotics, graphics, and com-puter vision. Learning to control UAVsin complex environments is a challengingtask even for humans. One of the mostchallenging navigation tasks with respectto UAVs is competitive drone racing.It takes extensive practice to become agood pilot, frequently involving crashes.A more affordable approach to developprofessional ﬂight skills is to train manyhours in a ﬂight simulator before going to the ﬁeld. Since most of the ﬁne motor skillsof ﬂight control are developed in the simulator, the pilot is able to quickly transition toreal-world ﬂights. ∗ equal contribution a r X i v : . [ c s . C V ] N ov Matthias M¨uller ∗ , Vincent Casser ∗ , Neil Smith, Dominik L. Michels, Bernard Ghanem Humans are able to abstract the visual differences between simulation and the realworld and are able to transfer the learned control knowledge with some ﬁnetuning toaccount for the small differences of the physics simulation. While transfer for trainednetwork policies is more difﬁcult due to the perception component, it will be easier ifthe simulation is as close to reality as possible. Therefore, we use the physics-basedUAV racing game within Sim4CV [28] which features a photo-realistic and customiz-able racing area in the form of a stadium based on a three-dimensional (3D) scannedreal-world location. This ensures minimal discrepancy when transitioning from the sim-ulated to a real-world scenario in the future. The concept of generating synthetic clonesof real-world data for deep learning purposes has been adopted in previous work [8].Also, it has become popular recently to use video game engines [6, 40] to generatephoto-realistic simulations for training autonomous agents.Combining the realistic physics and graphics of a game engine coupled with a real-world 3D scan should make the transfer much simpler and ﬁne-tuning on some realworld data may be sufﬁcient if a sufﬁciently robust policy was trained in simulation.A key requirement for generalization is the DNN’s ability to learn the appearance ofgates and cones in the track within a complexly textured and dynamic environment. Inthe simulated environment, we have the opportunity to fully customize the race track,including using different textures (e.g. grass, snow, and dirt), gates (different shapesand appearance), and lighting. This will make the trained network more robust and willenable transfer to the real world via domain randomization [39].Our autonomous racing UAV approach goes beyond simple pattern detection andinstead learns a full end-to-end system to ﬂy the UAV through a racing course. It issimilar in spirit to learning an end-to-end driving policy for a car [3], but comes withadditional challenges. The proposed network extends the complexity of previous workto the control of a six degrees of freedom (6-DoF) ﬂying system which is able to traversetight spaces and make sharp turns at very high speeds (a task that cannot be performedby a ground vehicle). Our imitation learning based approach simultaneously addressesboth problems of perception and control as the UAV navigates through the course.

Contributions.

Our speciﬁc contributions are as follows. (1)

We show that the challenging task of UAV racing can be learned in an end-to-endfashion in simulation, and both demonstrate and quantify the positive impact of usingviewpoint augmentation for increased robustness. Experiments show that our trainednetwork can outperform several baselines and ﬂy more consistently than the pilots onwhose data it was trained. (2)

To facilitate the training, parameter tuning and evaluation of deep networks onthis type of simulated data, we provide a full integration between the simulator and anend-to-end deep learning pipeline (based on TensorFlow). Similar to other deep net-works trained for game play, our integration will allow the community to fully exploremany scenarios and tasks that go far beyond UAV racing in a rich and diverse photo-realistic gaming environment (e.g. obstacle avoidance and trajectory planning). (3)

We integrate a photo-realistic UAV racing simulation environment based on areal-world counterpart which can be easily customized to build increasingly challengingracing courses and enables realistic UAV physical behavior. Logging video data fromthe UAV’s point-of-view and pilot controls is seamless and can be used to effortlessly eaching UAVs to Race: End-to-End Regression of Agile Controls in Simulation 3 generate large-scale training data for AI systems targeting UAV ﬂying in particular andautonomous vehicles in general (e.g. self-driving cars).

In this section, we put our proposed methodology into context, focusing on the mostrelated previous work.

Learning to Navigate.

Navigation has traditionally been approached by either employ-ing supervised learning (SL) methods [3, 4, 17, 29, 35, 38, 42] or reinforcement learning(RL) methods [21, 24, 25, 32, 33, 43]. Furthermore, combinations of the two have beenproposed in an effort to leverage advantages of both techniques, e.g. for increasing sam-ple efﬁciency for RL methods [1, 5,9, 10,20]. For the case of controlling physics-drivenvehicles, SL can be advantageous when acquiring labeled data is not too costly or inef-ﬁcient, and has been proven to have relative success in the ﬁeld of autonomous driving,among other applications, in recent years [3,4,42]. However, the use of neural networksfor SL in autonomous driving goes back to much earlier work [29, 35].In the work of Bojarski et al. [3], a deep neural network (DNN) is trained to maprecorded camera views to 3-DoF steering commands (steering wheel angle, throttle, andbrake). Seventy-two hours of human driven training data was tediously collected froma forward facing camera and augmented with two additional views to provide data forsimulated drifting and corrective maneuvering. The simulated and on-road results ofthis pioneering work demonstrate the ability of a DNN to learn (end-to-end) the controlprocess of a self-driving car from raw video data.Similar to our work but for cars, Chen et al. [4] use TORCS (The Open Racing CarSimulator) [45] to train a DNN to drive at casual speeds through a course and properlypass or follow other vehicles in its lane. This work builds off earlier work using TORCS,which focused on keeping the car on a track [17]. In contrast to our work, the vehiclecontrols to be predicted in the work of Chen et al. [4] are limited, since only a smalldiscrete set of expected control outputs are available: turn-left, turn-right, throttle, andbrake. Recently, TORCS has also been successfully used in several RL approaches forautonomous car driving [18, 21, 24]; however, in these cases, RL was used to teach theagent to drive speciﬁc tracks or all available tracks rather than learning to drive neverbefore seen tracks.Loquercio et al. [22] trained a network on autonomous car datasets and then de-ployed it to control a drone. For this, they used full supervision by providing imageand measured steering angle pairs from pre-collected datasets, and collecting their owndataset containing image and binary obstacle indication pairs. While they demonstratean ability to transfer successfully to other environments, their approach does not modeland exploit the full six degrees of freedom available. It also focuses on slow and safenavigation, rather than optimizing for speed as is the case for racing. Finally, with theirnetwork being fairly complex, they report an inference speed of 20fps (CPU) for remoteprocessing, which is more than three times lower than the estimated frame rate for ourproposed method when running on-board processing, and more than 27 times lowercompared to our method running remotely on GPU.

Matthias M¨uller ∗ , Vincent Casser ∗ , Neil Smith, Dominik L. Michels, Bernard Ghanem In the work of Smolyanskiy et al. [42], a DNN is trained (in an SL fashion andfrom real data captured from a head-mounted camera) to navigate a UAV through foresttrails and avoid obstacles. Similar to previous work, the expected control outputs of thenetwork are discrete and very limited (simple yaw movements): turn-left, go-straight,or turn-right. Despite showing relatively promising results, the trained network leadsto a slow, non-smooth (zig-zag) trajectory at a ﬁxed altitude above the ground. It isworthwhile to note that indoor UAV control using DNNs has also been recently explored[1, 16, 41].

Importance of Exploration in Supervised Learning.

In imitation learning [14], the‘expert’ training set used for SL is augmented and expanded, so as to combine the mer-its of both exploitation and exploration. In many sequential decision making tasks ofwhich autonomous vehicle control is one, this augmentation becomes necessary to trainan AI system (e.g. DNN) that can recover from mistakes. In this sense, imitation learn-ing with augmentation can be crudely seen as a supervision guided form of RL. Forexample, a recent imitation learning method called DAgger (Dataset Aggregation) [38]demonstrated a simple way of incrementally augmenting ground-truth sequential deci-sions to allow for further exploration, since the learner will be trained on the aggregatedataset and not only the original expert one. This method was shown to outperformstate-of-the-art AI methods on a 3D car racing game (Super Tux Kart), where the con-trol outputs are again 3-DoF. Other imitation learning approaches [20] have reacheda similar conclusion, namely that a trajectory optimizer can function to help guide asub-optimal learning policy towards the optimal one. Inspired by the above work, ourproposed method also exploits similar concepts for exploration. In the simulator, we areable to automatically and effortlessly generate a richly diverse set of image and controlpairs that can be used to train a UAV to robustly and reliably navigate through a racingcourse.

Simulation.

As mentioned earlier, generating diverse ‘natural’ training data for sequen-tial decision making through SL is tedious. Generating additional data for explorationpurposes (i.e. in scenarios where both input and output pairs have to be generated)is much more so. Therefore, a lot of attention from the community is being given tosimulators (or games) for this source of data. In fact, a broad range of work has ex-ploited them recently for these types of learning, namely in animation and motion plan-ning [10–12, 15, 19, 21, 43], scene understanding [2, 31], pedestrian detection [23], andidentiﬁcation of 2D/3D objects [13,26,34]. For instance, the authors of [15] used Unity,a video game engine similar to Unreal Engine, to teach a bird how to ﬂy in simulation.Moreover, there is another line of work that uses hardware-in-the-loop (HILT) sim-ulation. Examples include JMAVSim [36, 44] which was used to develop and evaluatecontrollers and RotorS [7] which was used to study visual servoing. The visual qualityof most HIL simulators is very basic and far from photo-realistic with the exception ofAirSim [40]. While there are multiple established simulators such as Realﬂight, Flight-gear, or XPlane for simulating aerial platforms, they have several limitations. In contrastto Unreal Engine, advanced shading and post-processing settings are not available andthe selection of assets and textures is limited. Recent work [6, 8, 28, 37, 40] highlightshow modern game engines can be used to generate photo-realistic training datasets andpixel-accurate segmentation masks. The goal of this work is to build an automated UAV eaching UAVs to Race: End-to-End Regression of Agile Controls in Simulation 5 ﬂying system (based on imitation learning) that can relatively easily be transitionedfrom a simulated world to the real one. Therefore, we choose Sim4CV [27, 28] as oursimulator, which uses the open source game engine UE4 and provides a full softwarein-the-loop UAV simulation. The simulator also provides a lot of ﬂexibility in terms ofassets, textures, and communication interfaces.

The fundamental modules of our proposed system are summarized in Figure 2, whichrepresents the end-to-end dataset generation, learning, and evaluation process. In whatfollows, we provide details for each of these modules, namely how datasets are automat-ically generated within the simulator, how our proposed DNN is designed and trained,and how the learned DNN is evaluated.Fig. 2: Description of the pipeline of our DNN Imitation Learning System. After record-ing ﬂights of human pilots, we improve important model parameters like network ar-chitecture, number of augmented views and appropriate control compensation for themin an iterative process.

Our simulation environment allows for the automatic generation of customizable datasetsthat can be used for various learning tasks related to UAVs. In the following, we elabo-rate on our setup for building a large-scale dataset speciﬁc to UAV racing.

UAV Flight Simulation.

The core of the system is the application of our UE4 basedsimulator. It is built on top of the open source UE4 project for computer vision calledSim4CV [28]. Several changes were made to adapt the simulator for training our pro-posed racing DNN. First, we replaced the UAV with the 3D model and speciﬁcationsof a racing quadcopter (see Figure 3). We retuned the PID controller of the UAV to bemore responsive and to function in a racing mode, where altitude control and stabliza-tion are still enabled but with much higher rates and steeper pitch and roll angles. In

Matthias M¨uller ∗ , Vincent Casser ∗ , Neil Smith, Dominik L. Michels, Bernard Ghanem fact, this is now a popular racing mode available on consumer UAVs, such as the DJIMavic. The simulator frame rate is locked at 60 fps and at every frame a log is recordedwith UAV position, orientation, velocity, and stick inputs from the pilot. To accommo-date for realistic input, we integrated the same UAV transmitter that would be used inreal-world racing scenarios. Fig. 3: The 3D model of the racing UAVmodeled in the simulator, based on a wellknown 250 class design known within theracing community as the Hornet .Following paradigms set by UAVracing norms, each racing course/trackin our simulator comprises a sequenceof gates connected by uniformly spacedcones. The track has a timing system thatrecords time between each gate, lap, andcompletion time of the race. The gateshave their own logic to detect whetherthe UAV has passed through the gate inthe correct direction. This allows us totrigger both the start and ending of therace, as well as, determine the number ofgates traversed by the UAV. These met-rics (time and percentage of gates passed)constitute the overall per-track perfor-mance of a pilot, be it a human or a DNN.

Automatic Track Generation.

We de-veloped a graphical track editor in which a user can draw a 2D sketch of the overheadview of the track. Subsequently, the 3D track is automatically generated and integratedinto the timing system. With this editor, we created eleven tracks: seven for training,and four for testing and evaluation. Each track is deﬁned by gate positions and tracklanes delineated by uniformly spaced racing cones distributed along the splines con-necting adjacent gates. We design the tracks such that they are similar to what racingprofessionals are accustomed to and such that they offer enough diversity to enable net-work generalization to unseen tracks. To avoid user bias in designing the race tracks,we use images collected from the internet and trace their contours in the editor to cre-ate uniquely stylized tracks. Please refer to Figure 4 for an overhead view of all thesetracks.Fig. 4: The seven training tracks (left) and the four evaluation tracks (right). Gates aremarked in red.

Acquiring Ground-truth Pilot Data.

The simulation environment allows us to log theimages rendered from the UAV camera point-of-view and the UAV ﬂight controls fromthe transmitter. We record human pilot input from a Taranis ﬂight transmitter integrated eaching UAVs to Race: End-to-End Regression of Agile Controls in Simulation 7 into the simulator through a joystick. This input is solicited from three pilots with dif-ferent skill levels: novice (lacking any ﬂight experience), intermediate (a moderatelyexperienced pilot), and expert (a professional, competitive racing pilot). The pilots aregiven the opportunity to ﬂy through the seven training tracks as many times as neededuntil they successfully complete the tracks at their best time while passing through allgates. For the evaluation tracks, the pilots are allowed to ﬂy the course only as manytimes as needed to complete the entire course without crashing. We automatically scorepilot performance based on lap time and percentage of gates traversed.

Data Augmentation.

As mentioned earlier, robust imitation learning requires the aug-mentation of these ground-truth logs with synthetic ones generated at a user-deﬁned setof UAV offset positions and orientations accompanied by the corresponding controlsneeded to correct for these offsets. Assigning corrective controls to the augmented datais quite complex in general, since they depend on many factors, including current UAVvelocity, relative position on the track, its weight and current attitude. While it is pos-sible to get this data in the simulation, it is very difﬁcult to obtain it in the real worldin real-time. Therefore, we employ a fairly simple but effective model to determinethese augmented controls that also scales to real-world settings. We add or subtract acorrective value to the pilot roll and yaw stick inputs for each position or orientationoffset that is applied. For rotational offsets, we do not only apply a yaw correction butalso couple it to roll. This allows to compensate for the UAV’s inertia which producesa motion component in the previous direction of travel. track duration (sec) original totaltrack01 69.8 4.2K 29.3Ktrack02 100.4 6.0K 42.2Ktrack03 83.1 5.0K 35.0Ktrack04 97.7 5.9K 41.0Ktrack05 99.8 6.0K 42.0Ktrack06 115.4 6.9K 48.5Ktrack07 98.3 5.9K 41.2Ktotal 664.5 39.9K 279.1K

Table 1: Overview of the image-control datasetgenerated from two laps of ﬂying (by the in-termediate pilot) through each of the trainingtracks. The ‘duration’ column shows the totaltime taken by the pilot to successfully ﬂy twolaps through the track (i.e. passing through allthe gates). We also record the number of imagesrendered from the pilot’s trajectory in the sim-ulator, along with the total number of imagesused for training when data augmentation is ap-plied. For this augmentation, we use the follow-ing default settings: roll offset ( ± cm), yawoffset ( ± ◦ and ± ◦ ). Training Data Set.

We summarizethe details of the data generated fromall training tracks in Table 1. It isclear that the augmentation increasesthe size of the original dataset byapproximately seven times. Each pi-lot ﬂight leads to a large numberof image-control pairs (both origi-nal and augmented) that will be usedto train the UAV to robustly recoverfrom possible drift along each train-ing track, as well as, generalize tounseen evaluation tracks. Details ofhow our proposed DNN architectureis designed and trained are providedin Section 3.2 of the paper. In gen-eral, more augmented data shouldimprove UAV ﬂight performance as-suming that the control mapping andoriginal ﬂight data are noise-free.However, in many scenarios, this isnot the case, so we ﬁnd that there is alimit after which augmentation doesnot help (or even slightly degrades) explorative learning. Empirical results validating

Matthias M¨uller ∗ , Vincent Casser ∗ , Neil Smith, Dominik L. Michels, Bernard Ghanem this observation are detailed in Section 4 of the paper. We also show the effects of train-ing with different ﬂying styles there. For this dataset, we choose to use the intermediatepilot who tends to follow the track most precisely, striking a good trade-off betweenstyle of ﬂight and speed.Since the logs can be replayed at a later time in the simulator, we can augment thedataset further by changing environmental conditions, including lighting, cone spacingor appearance, and other environmental dynamics (e.g. clouds), but we do not explorethese capabilities in this work. As it is the case for DNN-based solutions to other tasks, a careful construction of thetraining set is a key requirement to robust and effective DNN training. We dedicateseven racing tracks with their corresponding image-control pairs logged from humanpilot runs and appropriate augmentation for training. Please refer to Section 3.1 fordetails about data collection and augmentation. In the following, we provide a detaileddescription of the learning strategy used to train our DNN, its network architecture anddesign. We also explore some of the inner workings of one of the trained DNNs to shedlight on how this network is solving the problem of automated UAV racing.

Network Architecture.

To train a DNN to predict stick inputs controlling the UAVfrom images, we choose a regression network architecture similar to the one used byBojarski et al. [3]; however, we make changes to accommodate the complexity of thetask at hand and to improve robustness in training. Our DNN architecture is shownin Figure 5. The network consists of eight layers, ﬁve convolutional and three fully-connected. Since we implicitly want to localize the track and gates, we use striding inthe convolutional layers instead of (max) pooling.Fig. 5: Our network architecture is taking an image of shape 320x180 and regresses tothe control outputs throttle (T), elevator (E), aileron (A) and roll (R). eaching UAVs to Race: End-to-End Regression of Agile Controls in Simulation 9

We arrived at this compact network architecture by running extensive validation ex-periments. Our ﬁnal architecture strikes a reasonable tradeoff between computationalcomplexity and predictive performance. This careful design makes the proposed DNNarchitecture feasible for real-time applications on embedded hardware (e.g. Nvidia TX1,or the recent Nvidia TX2) unlike previous architectures [3], if they use the same inputsize. In Table 2, we show both evaluation time on and technical details of the NVIDIATitan X, and how it compares to a NVIDIA TX-1. Based on [30], we expect our net-work to still run at real-time speed with over 60 frames per second on this embeddedhardware.

NVIDIA Titan X NVIDIA TX-1CUDA cores 3,840 256Boost Clock MHz 1,582 998VRAM 12 GB 4 GBMemory Bandwidth 547.7 Gbps 25.6 GbpsEvaluation (ours) 556 fps (ref) 64.6 fps

Table 2: Comparison of the NVIDIA Titan X and the NVIDIA TX-1. The performanceof the TX-1 is approximated according to [30].

Implementation Details.

The DNN is given a single RGB-image with a 320 ×

180 pixelresolution as input and is trained to regress to the four stick inputs to control the UAVusing a standard L -loss and dropout ratio of 0.5.We ﬁnd that the relatively high input resolution (i.e. higher network capacity), ascompared to related methods [3, 42], is useful to learn this more complicated maneu-vering task and to enhance the network’s ability to look further ahead. This affords thenetwork with more robustness needed for long-term trajectory stability. On the otherhand, we found no noticeably gain when training on even higher resolutions duringinitial experiments. At our proposed resolution, our network still shows real-time capa-bilities even when being deployed on-board (Table 2), marking a convincing solutionto the resolution-speed trade-off. For training, we exploit a standard stochastic gradientdescent (SGD) optimization strategy (namely Adam) in Tensorﬂow. As such, one in-stance of our DNN can be trained to convergence on our dataset in less than two hourson a single GPU.In contrast to other work where the frame rate is sampled down to 10 fps or lower[3, 4, 42], our racing environment is highly dynamic (with tight turns, high speed, andlow inertia of the UAV), so we use a frame rate of 60 fps. This allows the UAV to bevery responsive and move at high speeds, while maintaining a level of smoothness incontrols. An alternative approach for temporally smooth controls is to include historicdata in the training process (e.g. add the previous controls as input to the DNN). Thiscan make the network more complex, harder to train, and less responsive in the highlydynamic racing environment, where many time critical decisions have to be made withina couple of frames (about 30 ms). Therefore, we ﬁnd the high learning frame rate of60 fps a good trade-off between smooth controls and responsiveness. Network Visualization.

After training our DNN to convergence, we visualize how parts of the network behave. ∗ , Vincent Casser ∗ , Neil Smith, Dominik L. Michels, Bernard Ghanem Fig. 6: Visualization of feature maps at dif-ferent convolutional layers in our trainednetwork. Note the high activations in se-mantically meaningful image regions forthe task of UAV racing, namely the gatesand cones.Figure 6 shows some feature maps indifferent layers for the same input im-age. Note how the ﬁlters have learnedto extract all necessary information inthe scene (i.e. gates and cones). Also,higher-level ﬁlters are not responding toother parts of the environment. Althoughthe feature map resolution becomes verylow in the higher DNN layers, the fea-ture map in the ﬁfth convolutional layeris interesting as it marks the top, left, andright of parts of a gate with just a sin-gle activation each. This clearly demon-strates that our DNN is learning seman-tically intuitive features for the task ofUAV racing.

Reinforcement vs. Imitation Learning.

Our simulation environment can lend it-self useful in training networks using re-inforcement learning. This type of learn-ing does not speciﬁcally require super-vised pilot information, as it searches foran optimal policy that leads to the highesteventual reward (e.g. highest percentageof gates traversed or lowest lap time). Recent methods have made use of reinforcementto learn simpler tasks without supervision [5]; however, they require very long trainingtimes (up to several weeks) and a much faster simulator (1,000fps is possible in sim-ple non photo-realistic games). For UAV racing, the required task is more involved andsince the intent is to transfer the learned network into the real-world, a (slower) photo-realistic simulator is mandatory. Because of these two constraints, we decided to trainour DNN using imitation learning instead of reinforcement learning.

We create four testing tracks based on well-known race tracks found in TORCS andGran Turismo. We refer to Figure 4 for an overhead view of these tracks. Since thetracks must ﬁt within the football stadium environment, they are scaled down leading tomuch sharper turns and shorter straight-aways with the UAV reaching top speeds of over100 km/h. Therefore, the evaluation tracks are signiﬁcantly more difﬁcult than they mayhave been originally intended in their original racing environments. We rank the fourtracks in terms of difﬁculty ranging from easy (track 1), medium (track 2), hard (track3), to very hard (track 4). For all the following evaluations, both the trained networksand human pilots are tasked to ﬂy two laps in the testing tracks and are scored based onthe total gates they ﬂy through and overall lap time. Obviously, the testing/evaluationtracks are never seen in training, neither by the human pilot nor the DNN. eaching UAVs to Race: End-to-End Regression of Agile Controls in Simulation 11

Experimental Setup.

In order to evaluate the performance of a trained DNN in real-time at 60 fps, we establish a TCP socket connection between the UE4 simulator andthe Python wrapper (TensorFlow) executing the DNN. In doing so, the simulator con-tinuously sends rendered UAV camera images across TCP to the DNN, which in turnprocesses each image individually to predict the next UAV stick inputs (ﬂight controls)that are fed back to the UAV in the simulator using the same connection. Another ad-vantage of this TCP connection is that the DNN prediction can be run on a separatesystem than the one running the simulator. We expect that this versatile and multi-purpose interface between the simulator and DNN framework will enable opportunitiesfor the research community to further develop DNN solutions to not only the task ofautomated UAV navigation (using imitation learning) but to the more general task ofvehicle maneuvering and obstacle avoidance (possibly using other forms of learningincluding RL). yaw [ ◦ ] [None] [-20:20:20] [-30:15:30] [-30:10:30] [-30:5:30] roll [cm] cameras Table 3: Effect of data augmentation in training to overall UAV racing performance.By augmenting the original ﬂight logs with data captured at more offsets (roll andyaw) from the original trajectory along with their corresponding corrective controls,our UAV DNN can learn to traverse almost all the gates of the testing tracks, since ithas learned to correct for exploratory maneuvers. We show the settings abbreviated as[min:increment:max] intervals. After a sufﬁcient amount of augmentation, no additionalbeneﬁt is realized in improved racing performance.

Effects of Exploration.

We ﬁnd exploration to be the predominant factor inﬂuencingnetwork performance. As mentioned earlier, we augment the pilot ﬂight data with off-sets and corresponding corrective controls. We conduct grid search to ﬁnd a suitabledegree of augmentation and to analyze the effect it has on overall UAV racing perfor-mance. To do this, we deﬁne two sets of offset parameters: one that acts as a horizontaloffset (roll-offset) and one that acts as a rotational offset (yaw-offset). Table 3 showshow the racing accuracy (percentage of gates traversed) varies with different sets ofthese augmentation offsets across the four testing tracks. It is clear that increasing thenumber of rendered images with yaw-offset has the greatest impact on performance.While it is possible for the DNN to complete tracks without being trained on roll-offsets,this is not the case for yaw-offsets. However, the signiﬁcant gain in adding rotated cam-era views saturates quickly, and at a certain point the network does not beneﬁt frommore extensive augmentation. Therefore, we found four yaw-offsets to be sufﬁcient. ∗ , Vincent Casser ∗ , Neil Smith, Dominik L. Michels, Bernard Ghanem Including camera views with horizontal shifts is also beneﬁcial, since the network isbetter equipped to recover once it is about to leave the track on straights. We foundtwo roll-offsets to be sufﬁcient to ensure this. In the rest of our experiments, we use thefollowing augmentation setup in training: horizontal roll-offset set {− ◦ , ◦ } androtational yaw-offset set {− ◦ , − ◦ , ◦ , ◦ } . Comparison to State-of-the-Art.

We compare our racing DNN to the two most relatedand recent network architectures, the ﬁrst denoted as Nvidia (for self-driving cars [3])and the second as MAV (for forest path navigating UAVs [42]). While the domains ofthese works are similar, it should be noted that ﬂying a high-speed racing UAV is a par-ticularly challenging task, especially since the effect of inertia is much more signiﬁcantand there are more degrees of freedom. For fair comparison, we scale our dataset to thesame input dimensionality and re-train each of the three networks. We then evaluateeach of the trained models on the task of UAV racing in the testing tracks. It is notewor-thy that both the Nvidia and MAV networks (in their original implementation) use dataaugmentation as well, so when training, we assume the augmentation choice to be ap-propriate for the given method and maintain the same strategy. While the exact angularoffsets of the two views used in the Nvidia network are not reported, we assume themto be close to ◦ . We thus employ a rotational offset set of {− ◦ , ◦ } to augmentits data. As for the MAV network, we use the same augmentation parameters proposedin the paper, i.e. a rotational offset of {− ◦ , ◦ } . We modiﬁed the MAV network toallow for a regression output instead of its original classiﬁcation (left, center and rightcontrols). This is necessary since our task requires ﬁne-grained controls, and predictingdiscrete controls leads to very inadequate UAV racing performance. Pilot / Network Track 1 Track 2 Track 3 Track 4

Human-Novice 1.00 1.00 0.95 0.94Human-Intermediate 1.00 1.00 1.00 1.00Human-Expert 1.00 1.00 1.00 1.00Ours-Intermediate

Ours-Expert 1.00 0.95 0.91 0.78Nvidia-Intermediate 0.17 1.00 0.82 0.83Nvidia-Intermediate++ 1.00 1.00 0.82 1.00MAV-Intermediate 0.50 0.75 0.73 0.83MAV-Intermediate++ 0.42 1.00 0.91 0.78

Table 4: Accuracy score of different pilots and networkson the four test tracks, averaged over multiple runs. Theaccuracy score represents the percentage of completedracing gates. The networks ending with ++ are variantsof the original network with our augmentation strategy.It should be noted thatin the original implementa-tion of the Nvidia network[3] (based on real-worlddriving data), it was realizedthat additional augmenta-tion was needed for rea-sonable automatic drivingperformance after the real-world data was acquired. Toavoid recapturing the dataagain, synthetic viewpoints(generated by interpolation)were used to augment thetraining dataset, which in-troduced undesirable distor-tions. By using our simulator, we are able to extract any number of camera views with-out distortions. Therefore, we wanted to also gauge the effect of additional augmen-tation to both the Nvidia and MAV networks, when they are trained using our defaultaugmentation setting: horizontal roll-offset of {− ◦ , ◦ } and rotational yaw-offset of {− ◦ , − ◦ , ◦ , ◦ } . We denote these trained networks as Nvidia++ and MAV++. eaching UAVs to Race: End-to-End Regression of Agile Controls in Simulation 13 Table 4 summarizes the results of these different network variants on the testingtracks. Results indicate that the performance of the original Nvidia and MAV networkssuffers from insufﬁcient data augmentation. They clearly do not make use of enoughexploration. These networks improve in performance when our proposed data augmen-tation scheme is used. Regardless, our proposed DNN outperforms the Nvidia and MAVnetworks, where this improvement is less signiﬁcant when more data augmentation ormore exploratory behavior is learned. Unlike the other networks, our DNN performsconsistently well on all the unseen tracks, owing to its sufﬁcient network capacityneeded to learn this complex task.Fig. 7: Best lap times of human pilots and networks trained on different ﬂight styles. Ifthere is no lap time displayed, the pilot was not able to complete the course because theUAV crashed. See text for a more detailed description.

Pilot Diversity & Human vs. DNN.

In this section, we investigate how the ﬂying style of a pilot affects the network thatis being learned. To this end, we compare the performance of the different networkson the testing set, when each of them is trained with ﬂight data captured from pilots ofvarying ﬂight expertise (intermediate and expert).Table 4 summarizes the lap time and accuracy of these networks. Clearly, the pilotﬂight style can signiﬁcantly affect the performance of the learned network. Figure 7shows that there is a high correlation regarding both performance and ﬂying style of thepilot used in training and the corresponding learned network.The trained networks clearly resemble the ﬂying style and also the proﬁciency oftheir human trainers. Thus, our network that was trained on ﬂights of the intermediatepilot achieves high accuracy but is quite slow, just as the expert network sometimesmisses gates but achieves very good lap and overall times.Interestingly, although the networks perform similar to their pilot, they ﬂy moreconsistently, and therefore tend to outperform the human pilot with regards to overalltime on multiple laps. This is especially true for our intermediate network. Both theintermediate and the expert network clearly outperform the novice human pilot whotakes several hours of practice and several attempts to reach similar performance to thenetwork. Even our expert pilots were not always able to complete the test tracks on theﬁrst attempt. ∗ , Vincent Casser ∗ , Neil Smith, Dominik L. Michels, Bernard Ghanem Fig. 8: Visualization of human and auto-mated UAV ﬂights super-imposed onto a2D overhead view of different tracks. Thecolor illustrates the instantaneous speed ofthe UAV from blue (slow) to red (fast).While the percentage of passed gatesand best lap time give a good indica-tion about the performance, they do notconvey any information about the styleof the pilot. To this end, we visualizethe performance of human pilots and thetrained networks by plotting their trajec-tories onto the track (from a 2D overheadviewpoint). We encode their speeds as aheatmap, where blue corresponds to theminimum speed and red to the maximumspeed. Figure 8 shows a collection ofheatmaps revealing several interesting in-sights.Firstly, despite showing variation, thenetworks clearly imitate the style of thepilot they were trained on. This is es-pecially true for the intermediate proﬁ-ciency level, while the expert networksometimes overshoots, which causes it toloose speed and therefore to not matchthe speed pattern as well as the interme-diate one. We also note that the perfor-mance gap between network and humanincreases as the expertise of the pilot in-creases. Note that the ﬂight path of theexpert network is less smooth and cen-tered than its human correspondent and the intermediate network, respectively. This ispartly due to the fact that the networks were only trained on two laps of ﬂying acrossseven training tracks. An expert pilot has a lot more training than that and is thereforeable to generalize much better to unseen environments.However, the experience advantage of the intermediate pilot over the network ismuch less and therefore the performance gap is smaller. We also show the performanceof our novice pilot on these tracks. While the intermediate pilots accelerate on straights,the novice is not able to control speed that well, creating a very narrow velocity range.Albeit ﬂying quite slowly, the pilot also gets off track several times. This underlines thecomplexity of UAV racing, especially for inexperienced pilots.

In this paper, we proposed a robust imitation learning based framework to teach anunmanned aerial vehicle (UAV) to ﬂy through challenging racing tracks at very highspeeds. To do this, we trained a deep neural network (DNN) to predict the necessaryUAV controls from raw image data, grounded in a photo-realistic simulator that alsoallows for realistic UAV physics. Training is made possible by logging data (rendered eaching UAVs to Race: End-to-End Regression of Agile Controls in Simulation 15 images from the UAV and stick controls) from human pilot ﬂights, while they ma-neuver the UAV through racing tracks. This data is augmented with sufﬁcient cameraview offsets to teach the network how to recover from ﬂight mistakes, which proves tobe crucial during long-term ﬂight. Extensive experiments demonstrate that our trainednetwork (when such sufﬁcient data augmentation is used) outperforms state-of-the-artmethods and ﬂies more consistently than many human pilots.In the future, we aim to transfer the network we trained in our simulator to thereal-world to compete against human pilots in real-world racing scenarios. Althoughwe accurately modeled the simulated racing environment, the differences in appear-ance between the simulated and real-world will need to be reconciled. Therefore, wewill investigate deep transfer learning techniques to enable a smooth transition betweensimulator and the real-world. If such transfer would be successful, our simulator wouldbe able to act as an unlimited, highly customizable and free source of ground truth data.Despite our ﬁndings that temporally aware architectures were not a good choicefor the low-latency UAV racing task, we expect this to be useful when approachinggeneral UAV navigation and complex obstacle avoidance. We plan to more broadlyevaluate our method and the choice of augmentation strategy on tasks with differingchallenges. More generally, since our developed simulator and its seamless interface todeep learning platforms is generic in nature, we expect that this combination will openup unique opportunities for the community to develop better automated UAV ﬂyingmethods, to expand its reach to other ﬁelds of autonomous navigation such as self-driving cars, and to beneﬁt other interesting perception-based tasks such as obstacleavoidance.

Acknowledgments.

This work was supported by the King Abdullah University of Sci-ence and Technology (KAUST) Ofﬁce of Sponsored Research through the Visual Com-puting Center (VCC) funding. ∗ , Vincent Casser ∗ , Neil Smith, Dominik L. Michels, Bernard Ghanem References

1. Andersson, O., Wzorek, M., Doherty, P.: Deep learning quadcopter control via risk-awareactive learning. In: Thirty-First AAAI Conference on Artiﬁcial Intelligence (AAAI), 2017,San Francisco, February 4-9. : (2017), accepted.2. Battaglia, P.W., Hamrick, J.B., Tenenbaum, J.B.: Simulation as an engine of physi-cal scene understanding. Proceedings of the National Academy of Sciences (45),18327–18332 (2013). https://doi.org/10.1073/pnas.1306572110,

3. Bojarski, M., Testa, D.D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L.D.,Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K.: End to end learningfor self-driving cars. CoRR abs/1604.07316 (2016), http://arxiv.org/abs/1604.07316

4. Chen, C., Seff, A., Kornhauser, A., Xiao, J.: Deepdriving: Learning affordance for direct per-ception in autonomous driving. In: Proceedings of the 2015 IEEE International Conferenceon Computer Vision (ICCV). pp. 2722–2730. ICCV ’15, IEEE Computer Society, Washing-ton, DC, USA (2015). https://doi.org/10.1109/ICCV.2015.312, http://dx.doi.org/10.1109/ICCV.2015.312

5. Dosovitskiy, A., Koltun, V.: Learning to act by predicting the future. vol. abs/1611.01779(2017), http://arxiv.org/abs/1611.01779

6. Dosovitskiy, A., Ros, G., Codevilla, F., L´opez, A., Koltun, V.: CARLA: An open urban driv-ing simulator. In: Conference on Robot Learning (CoRL) (2017)7. Furrer, F., Burri, M., Achtelik, M., Siegwart, R.: RotorS—A modular gazebo MAV simulatorframework, Studies in Computational Intelligence, vol. 625. Springer, Cham (2016)8. Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object trackinganalysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-tion. pp. 4340–4349 (2016)9. Guo, X., Singh, S., Lee, H., Lewis, R., Wang, X.: Deep learning for real-time atari gameplay using ofﬂine monte-carlo tree search planning. In: Proceedings of the 27th Interna-tional Conference on Neural Information Processing Systems. pp. 3338–3346. NIPS’14,MIT Press, Cambridge, MA, USA (2014), http://dl.acm.org/citation.cfm?id=2969033.2969199

10. Ha, S., Liu, C.K.: Iterative training of dynamic skills inspired by human coaching techniques.ACM Trans. Graph. (1), 1:1–1:11 (Dec 2014). https://doi.org/10.1145/2682626, http://doi.acm.org/10.1145/2682626

11. Hamalainen, P., Eriksson, S., Tanskanen, E., Kyrki, V., Lehtinen, J.: Online motionsynthesis using sequential monte carlo. ACM Trans. Graph. (4), 51:1–51:12 (Jul2014). https://doi.org/10.1145/2601097.2601218, http://doi.acm.org/10.1145/2601097.2601218

12. Hamalainen, P., Rajamaki, J., Liu, C.K.: Online control of simulated humanoids us-ing particle belief propagation. ACM Trans. Graph. (4), 81:1–81:13 (Jul 2015).https://doi.org/10.1145/2767002, http://doi.acm.org/10.1145/2767002

13. Hejrati, M., Ramanan, D.: Analysis by synthesis: 3d object recognition by object reconstruc-tion. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. pp.2449–2456 (June 2014). https://doi.org/10.1109/CVPR.2014.31414. Hussein, A., Gaber, M.M., Elyan, E., Jayne, C.: Imitation learning: A sur-vey of learning methods. ACM Comput. Surv. (2), 21:1–21:35 (Apr 2017).https://doi.org/10.1145/3054912, http://doi.acm.org/10.1145/3054912

15. Ju, E., Won, J., Lee, J., Choi, B., Noh, J., Choi, M.G.: Data-driven con-trol of ﬂapping ﬂight. ACM Trans. Graph. (5), 151:1–151:12 (Oct 2013).eaching UAVs to Race: End-to-End Regression of Agile Controls in Simulation 17https://doi.org/10.1145/2516971.2516976, http://doi.acm.org/10.1145/2516971.2516976

16. Kim, D.K., Chen, T.: Deep neural network for real-time autonomous indoor navigation.CoRR abs/1511.04668 (2015)17. Koutn´ık, J., Cuccu, G., Schmidhuber, J., Gomez, F.: Evolving large-scale neural networksfor vision-based reinforcement learning. In: Proceedings of the 15th Annual Conference onGenetic and Evolutionary Computation. pp. 1061–1068. GECCO ’13, ACM, New York,NY, USA (2013). https://doi.org/10.1145/2463372.2463509, http://doi.acm.org/10.1145/2463372.2463509

18. Koutn´ık, J., Schmidhuber, J., Gomez, F.: Online Evolution of Deep Convolutional Networkfor Vision-Based Reinforcement Learning, pp. 260–269. Springer International Publishing,Cham (2014). https://doi.org/10.1007/978-3-319-08864-8 2519. Lerer, A., Gross, S., Fergus, R.: Learning Physical Intuition of Block Towers by Example(2016), arXiv:1603.01312v120. Levine, S., Koltun, V.: Guided policy search. In: Dasgupta, S., McAllester, D. (eds.) Pro-ceedings of the 30th International Conference on Machine Learning. Proceedings of Ma-chine Learning Research, vol. 28, pp. 1–9. PMLR, Atlanta, Georgia, USA (17–19 Jun 2013), http://proceedings.mlr.press/v28/levine13.html

21. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra,D.: Continuous control with deep reinforcement learning. ICLR abs/1509.02971 (2016), http://arxiv.org/abs/1509.02971

22. Loquercio, A., Maqueda, A.I., del Blanco, C.R., Scaramuzza, D.: Dronet: Learning to ﬂy bydriving. IEEE Robotics and Automation Letters (2), 1088–1095 (2018)23. Mar´ın, J., V´azquez, D., Ger´onimo, D., L´opez, A.M.: Learning appearance in vir-tual scenarios for pedestrian detection. In: 2010 IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition. pp. 137–144 (June 2010).https://doi.org/10.1109/CVPR.2010.554021824. Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D.,Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In: InternationalConference on Machine Learning. pp. 1928–1937 (2016)25. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves,A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A.,Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nature (7540), 529–533 (Feb 2015), http://dx.doi.org/10.1038/nature14236

26. Movshovitz-Attias, Y., Sheikh, Y., Naresh Boddeti, V., Wei, Z.: 3d pose-by-detection of vehicles via discriminatively reduced ensembles of correlation ﬁlters.In: Proceedings of the British Machine Vision Conference. BMVA Press (2014).https://doi.org/http://dx.doi.org/10.5244/C.28.5327. Mueller, M., Smith, N., Ghanem, B.: A Benchmark and Simulator for UAV Tracking, pp.445–461. Springer International Publishing, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0 27, http://dx.doi.org/10.1007/978-3-319-46448-0_27

28. M¨uller, M., Casser, V., Lahoud, J., Smith, N., Ghanem, B.: Sim4cv: A photo-realistic simulator for computer vision applications. Int. J. Comput. Vision (9), 902–919 (Sep 2018). https://doi.org/10.1007/s11263-018-1073-7, https://doi.org/10.1007/s11263-018-1073-7

29. Muller, U., Ben, J., Cosatto, E., Flepp, B., Cun, Y.L.: Off-road obsta-cle avoidance through end-to-end learning. In: Weiss, Y., Sch¨olkopf, P.B.,Platt, J.C. (eds.) Advances in Neural Information Processing Systems 18,pp. 739–746. MIT Press (2006), http://papers.nips.cc/paper/ ∗ , Vincent Casser ∗ , Neil Smith, Dominik L. Michels, Bernard Ghanem

30. Nvidia: Gpu-based deep learning inference: A performance and power analysis (Novem-ber 2015),

31. Papon, J., Schoeler, M.: Semantic pose using deep networks trained on synthetic RGB-D.CoRR abs/1508.00835 (2015), http://arxiv.org/abs/1508.00835

32. Peng, X.B., Berseth, G., van de Panne, M.: Terrain-adaptive locomotion skills using deep re-inforcement learning. ACM Transactions on Graphics (Proc. SIGGRAPH 2016) (4) (2016)33. Peng, X.B., Berseth, G., Yin, K., van de Panne, M.: Deeploco: Dynamic locomotion skillsusing hierarchical deep reinforcement learning. ACM Transactions on Graphics (Proc. SIG-GRAPH 2017) (4) (2017)34. Pepik, B., Stark, M., Gehler, P., Schiele, B.: Teaching 3d geometry to deformable part mod-els. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. pp.3362–3369 (June 2012). https://doi.org/10.1109/CVPR.2012.624807535. Pomerleau, D.A.: Advances in neural information processing systems 1. chap. ALVINN: AnAutonomous Land Vehicle in a Neural Network, pp. 305–313. Morgan Kaufmann PublishersInc., San Francisco, CA, USA (1989), http://dl.acm.org/citation.cfm?id=89851.89891

36. Prabowo, Y.A., Trilaksono, B.R., Triputra, F.R.: Hardware in-the-loop simula-tion for visual servoing of ﬁxed wing uav. In: Electrical Engineering and In-formatics (ICEEI), 2015 International Conference on. pp. 247–252 (Aug 2015).https://doi.org/10.1109/ICEEI.2015.735250537. Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truth from computergames. In: ECCV (2016)38. Ross, S., Gordon, G.J., Bagnell, J.A.: No-regret reductions for imitation learning and struc-tured prediction. CoRR abs/1011.0686 (2010), http://arxiv.org/abs/1011.0686

39. Sadeghi, F., Levine, S.: CAD2RL: Real single-image ﬂight without a single real image(2017)40. Shah, S., Dey, D., Lovett, C., Kapoor, A.: Airsim: High-ﬁdelity visual and physical simula-tion for autonomous vehicles (2017)41. Shah, U., Khawad, R., Krishna, K.M.: Deepﬂy: Towards complete autonomous navigationof mavs with monocular camera. In: Proceedings of the Tenth Indian Conference on Com-puter Vision, Graphics and Image Processing. pp. 59:1–59:8. ICVGIP ’16, ACM, NewYork, NY, USA (2016). https://doi.org/10.1145/3009977.3010047, http://doi.acm.org/10.1145/3009977.3010047

42. Smolyanskiy, N., Kamenev, A., Smith, J., Birchﬁeld, S.: Toward Low-Flying AutonomousMAV Trail Navigation using Deep Neural Networks for Environmental Awareness. ArXive-prints (May 2017)43. Tan, J., Gu, Y., Liu, C.K., Turk, G.: Learning bicycle stunts. ACM Trans. Graph. (4), 50:1–50:12 (Jul 2014). https://doi.org/10.1145/2601097.2601121, http://doi.acm.org/10.1145/2601097.2601121

44. Trilaksono, B.R., Triadhitama, R., Adiprawita, W., Wibowo, A., Sreenatha,A.: Hardware-in-the-loop simulation for visual target tracking of octorotoruav. Aircraft Engineering and Aerospace Technology (6), 407–419 (2011).https://doi.org/10.1108/00022661111173289, http://dx.doi.org/10.1108/00022661111173289

45. Wymann, B., Dimitrakakis, C., Sumner, A., Espi´e, E., Guionneau, C., Coulom, R.: TORCS,the open racing car simulator.