Learning to Navigate Autonomously in Outdoor Environments : MAVNet
Saumya Kumaar, Arpit Sangotra, Sudakshin Kumar, Mayank Gupta, Navaneethkrishnan B, S N Omkar
LLearning to Navigate Autonomously in Outdoor Environments :MAVNet
Saumya Kumaar , Arpit Sangotra , Sudakshin Kumar , Mayank Gupta , Navaneethkrishnan B and S N Omkar Abstract — In the modern era of automation and robotics,autonomous vehicles are currently the focus of academicand industrial research. With the ever increasing number ofunmanned aerial vehicles getting involved in activities in thecivilian and commercial domain, there is an increased need forautonomy in these systems too. Due to guidelines set by thegovernments regarding the operation ceiling of civil drones,road-tracking based navigation is garnering interest . In anattempt to achieve the above mentioned tasks, we propose animitation learning based, data-driven solution to UAV autonomyfor navigating through city streets by learning to fly by imitatingan expert pilot. Derived from the classic image classificationalgorithms, our classifier has been constructed in the form ofa fast 39-layered Inception model, that evaluates the presenceof roads using the tomographic reconstructions of the inputframes. Based on the Inception-v3 architecture, our systemperforms better in terms of processing complexity and accuracythan many existing models for imitation learning. The data usedfor training the system has been captured from the drone, byflying it in and around urban and semi-urban streets, by expertshaving at least 6-8 years of flying experience. Permissions weretaken from required authorities who made sure that minimalrisk (to pedestrians) is involved in the data collection process.With the extensive amount of drone data that we collected, wehave been able to navigate successfully through roads withoutcrashing or overshooting, with an accuracy of 98.44%. Thecomputational efficiency of MAVNet enables the drone to fly athigh speeds of upto 6m/sec. We present the same results in thisresearch and compare them with other state-of-the-art methodsof vision and learning based navigation.
I. I
NTRODUCTION
With the advent of drone technology to the civilian market,drones have been used in a variety of applications such asagricultural crop monitoring, surveillance, emergency first-response and delivery [1], [2], [3] and [4]. Autonomousnavigation of such systems is of utmost importance tomaximize mission efficiency and safety.The commonly used method of GPS waypoint-to-waypoint navigation [5] is not feasible in urban environmentsdue to current government regulations on the altitude atwhich civilian drones are allowed to operate which is lowerthan the average height of buildings in most cities. Naviga-tion in such cluttered environment can be implemented bysimultaneous localization and mapping (SLAM) techniques[6]. However, though SLAM techniques have shown greatlocalization prowess, issues like inertial measurement unit Chief Research Scientist, Indian Institute of Science, Bangalore Research Assistant, Indian Institute of Science, Bangalore Student, National Institute of Technology, Srinagar (IMU) sensor noise, dynamic obstacles and sharp featuresdue to differential lighting may cause the system failurewhich could prove hazardous to general public. Further,advanced SLAM techniques for outdoor navigation requiresspecialized equipment like LIDAR and stereo-camera whichare expensive and may not be compatible with standard off-the-shelf drones.Research on deep learning based drone navigation meth-ods in the past decade have shown promising results. Oneof the interesting pieces of research by Kim et. al. [7]proposed a deep neural network system for indoor navigationof a quadrotor drone to find a specified target. The researchused monocular camera for environment perception and theimages were fed into a deep convolutional neural networkand the model provides the control outputs. Another methodproposed by Gandhi et. al. [8], demonstrates negative train-ing based navigation trained on a large collection of crashdataset. This method essentially teaches the system how
NOT to fly. It uses a concept termed as Imitation Learning.Imitation learning is a pedagogical approach that aims tomimic human behaviour through mapping of observationsto the consecutive action performed by the human expert[9]. This method has seen numerous applications in robotics,for example, robotic arm actuation for picking up items orflipping a pancake [10]. Imitation learning has also beendemonstrated in quadrotor drones for navigation throughforested environment by Ross et. al. [11]. The learningpolicy focuses on obstacle avoidance by modeling the humanreactive control in the presence of tree-trunks. A recent pieceof research by Loquercio et. al. [12] which in line withours, uses a convolutional neural network to navigate a dronethrough urban streets following basic rules like avoidingobstacles, vehicles and pedestrians.Unlike previously stated methods, we aim to providea high-level real-time navigational method for a drone totravel from a specified point A to point B in an urbanenvironment by tracking the road linking the two pointsusing the imitation learning methodology. In other words, weteach the drone how to navigate between two points by anexpert human piloting the drone between the two points. Thedrone estimates a mapping from an image from its monocularcamera to the control command provided by an experiencedhuman pilot. The main contributions of this paper are: • We propose a nested convolutional network model archi-tecture based on the Inception V3 model: Mini Aerial VehicleNetwork (MAVNet) which provides five outputs namely: a r X i v : . [ c s . R O ] S e p orward, yaw-left, yaw-right, halt and junction, and the inputbeing a feature-extracted, reconstructed tomographic videoframe of size 100 ×
100 pixels. The junction flag is raisedwhen a road junction is detected which enables a landmarkbased navigational model. • The model is benchmarked on the Udacity dataset and itsperformance is compared with other popularly used networkarchitectures. • The proposed architecture is also tested on a customdataset of a 357m long path. The model was used toimplement the navigational system on an off-the-shelf droneand we also show the difference in the paths followed by theexpert pilot and the trained model.The navigation system training is done with the drone’sheight above ground level (AGL) kept constant at 2.5m forminimal interaction with cars or pedestrians. This is donebecause it has been observed that the region above roadsfrom 2m AGL to 10m AGL have fewer obstacles than whenflying beyond the said range. A simple optical flow basedcollision avoidance system can be used in conjunction withour proposed algorithm to deal with the sparsely occurringobstacles in the said range like trucks, vans, traffic lights,tree branches etc. II. M
ETHODOLOGY
As suggested previously, in this paper, we have attemptedto present an end-to-end solution for autonomous navigationthrough urban streets. The algorithm works in a way verysimilar to the learning behaviour in human beings. Whenthe image is captured it is processed along with velocitycommands given. The algorithm learns a controller that mapsthe input key-strokes to the image frame. Once the processingis done, the MAVNet model predicts a total of 5 values in theband. The last value of the array is an indicator of presence orabsence of a junction. So, whenever the array is 1, a counterkeeps a track of how many junctions have passed by countingthe number of 1’s. When the count reaches the requiredvalue, the drone is given a suitable yaw command (eitherleft or right, which is not a MAVNet prediction but pre-programmed by the user) which allows it to make appropriateturns at the junctions. This landmark based method is howhuman brain processes navigational information. We attemptto implement the same in unmanned aerial vehicles, a processflow of which is represented graphically in Fig. 1In contrast to
Loquercio et. al. [12], the collision predictiontasks are not tackled in this research as the height of theUAV was fixed at 2.5m. At this height, most of the on-track vehicles and pedestrians do not interact with the drone,and any observed poles or tree branches were above thefixed height in the collected custom dataset, so the need forcollision avoidance was not critical. So the MAVNet modelpredicts only the forward pitch rate, the yaw angle and theprobability of the presence of a road junction in the currentframe. Apart from the above mentioned outputs, we have onemore additional output, which we call halt which means thatthe UAV has to hover in its place without any movement.So although the training was not done for avoiding obstacles,
Fig. 1:
Overall Architecture of the MAVNet Prediction System.
Fig. 2:
Sample Testing Instances from both the simulation environments. On the top,the figures indicate the GTA San Andreas environment where our system over-fittedon the forward command. The lower portion of the figure indicates the Mario Kartenvironment which performed way better than previous one primarily because of morefeatures available in the image dataset. In both the cases, we collected around 400,000data frames for training, with the logic that more the training, better the performancewould be. but the training set included no commands being given to adrone, whenever there was an obstacle in the surrounding.This kind of imparts an inherent collision avoiding capabil-ities to the drone, however we don’t report or evaluate it.
A. Simulation Environments
Prior to the implementation of our algorithm onto a real-world interface, we created a simulation environment usingcommercial desktop entertainment programs like GTA SanAndreas(GTA SA) and Mario Kart. This was done becausedeploying the algorithm directly on a UAV without anysimulation testing, could pose a serious risk to passers-by,vehicles or other property. However, we did not perform anyfeature extraction experiments with the simulation data. Weused unprocessed raw images captured from recording thescreen and the key-strokes. In GTA SA we collected our dataon an oval shaped circuit following a white demarcation lineon the road. We trained our model on 48 video segmentseach containing approximately 10,000 frames, whereas forMario Kart We collected 120 video segments each containingapproximately 3,400 frames. In both the cases, the modellearned to play quite similar to the human player. With theconfidence acquired from the simulations, we decided torecreate the same with real-world applications. The modelsimply learns how the expert flies the drone by learning thekey-strokes rather than any other complex features. ig. 3:
Training Images as collected from the Udacity Dataset and the CityscapesDataset. In order to benchmark our algorithm against some of the state-of-the-arttechniques, our MAVNet model learns the throttle and steering (as binarized left/right)commands from the Udacity Dataset and junction detection tasks from the Cityscapesdataset. Evaluation has been performed on both datasets with separate metrics for both.
Fig. 4:
Training Images as collected in and around the campus of Indian Institute ofScience. The dataset contains a variety of junctions and marked, unmarked, structuredand unstructured roads. The training is done extensively on these and test results arepresented exclusively on untrained roads. The dataset was collected during differenttimes of the day, under different sunlight conditions.
B. Datasets and Preprocessing
There are two datasets involved in this research. One isthe Udacity’s Self-Driving simulation environment. A partof the dataset (which contains around 72,000 images) hasbeen used for training and testing is done on unseen roadsof the test dataset. The learnt parameters were throttle andsteering angles. The key-strokes used for turning the carin the Udacity’s self-driving simulator were deemed as thecorresponding yaw-commands for the drone. This furtherhelps in classification as there are only two classes arepossible for steering. However, due the absence of roadjunctions in the Udacity dataset, we establish the metrics onthe Cityscapes [19] and our own dataset, where every imagehas associated velocity inputs and junction tags. The customdatabase is composed of around 450,000 images, in threedifferent times of the day, with varying sunlight, shadows andtraffic conditions. There were a total of 71 flights conductedduring the course of the research.
C. Feature Extraction
In this research, we have proposed to extract the radonfeatures of the visual data. Although the sinogram so con-structed could be used effectively for road segmentation,we take a step ahead and reconstruct the image whichhelps in complete isolation of the road from the rest of theenvironment. Fundamentally, roads have the least amountof edge energies when compared to its surroundings. TheRadon transformation technique has proven to be effective in curved segment detection in images and reconstructionof tomographic images by Toft et. al. [14]. The approachadopted for the above mentioned tasks were stated by Deans et. al. [13], and the same has been described further in thissection :
1) Radon Transformation:
Let us say that f ( x ) = f ( x , y ) is a continuous function that is compactly supported on R .The Radon transform, R , is therefore a function defined onthe total or overall space of straight lines L in R by the lineintegral along every such line in the space: R f ( L ) = (cid:90) L f ( x ) | dx | (1)Significantly, the coefficients (parameters) of any straightline L with respect to length of the arc z can always be shownto be: ( x ( z ) , y ( z )) = (cid:0) ( z sin α + s cos α ) , ( − z cos α + s sin α ) (cid:1) (2)where s is the distance of this line L from the origin O and α is the angle between the normal vector to L andthe x axis. It follows that the angular and distance physicalquantities such as ( α , s ) can be considered or deemed asspatial coordinates on the space of every such line in R ,and the Radon transform can be calculated (just like theFast Fourier Transform) in these coordinates by the followingmethods: R f ( α , s ) = (cid:90) ∞ − ∞ f ( x ( z ) , y ( z )) dz = (cid:90) ∞ − ∞ f (( z sin α + s cos α ) , ( − z cos α + s sin α )) dz Speaking generally, in the overall n-dimensional Euclideanspace R n , the Radon transformation of a compactly supportedcontinuous function mentioned as f is a function R f on thespace Σ n of all hyperplanes in R n . It is defined as : R f ( ξ ) = (cid:90) ξ f ( x ) | dx | (3)for ξ ∈ Σ n , where the integral of the function is calcu-lated with respect to the natural hyper-surface estimate, d σ (generalizing the | d x | term from the 2-dimensional case).It is definitely worth noting that any element of Σ n can beclassified as the solution locus of the following equation : x . α = s (4)where α ∈ S n − is a unit vector and s ∈ R . Thus the n-dimensional Radon transformation matrix may be rewrittenas a function on S n − × R via R f ( α , s ) = (cid:90) x . α = s f ( x ) d σ ( x ) . (5)We can also consider a generalized Radon transform stillfurther by integrating instead over the k-dimensional affinesubspaces of R n . The X-ray transform is the most commonlyused special case of this construction, and is obtained byintegrating over straight lines, again a point worth noting. ) Reconstruction Approach - Ill-posedness: As men-tioned previously, we reconstruct the image for better ac-curacy of classification. The process of reconstruction thathas been adopted here is done using
Ill-posedness whichproduces the image (or function f in the previous section)from its projection data. Reconstruction is fundamentallyan inverse problem. The above mentioned approach is usedbecause of its computationally effectiveness for the Radontransform. The Radon transform in n-dimensions can beinverted (or reconstructed finally) by the following : c n f = ( − ∆ ) ( n − ) / R ∗ R f (6)where, c n = ( π ) ( n − ) / Γ ( n / ) Γ ( / ) (7)and the power of the Laplacian − ∆ ( n − ) / is defined asa pseudodifferential operator if necessary by the Fouriertransform : F (cid:2) − ∆ ( n − ) / φ (cid:3) ( ξ ) = | πξ | n − F φ ( ξ ) (8)For computational speed and efficiency, the power of theLaplacian is commuted with the dual transform R ∗ to give : c n f = (cid:40) R ∗ d n − ds n − R f , n odd R ∗ H s d n − ds n − R f , n evenwhere H s is the Hilbert transform with respect to the s variable. In 2 dimensions, the operator H s d / ds is a funda-mental ramp filter in image processing techniques. We canhence, prove directly from the Fourier slice theorem and aslight change of variables for integration, that for a compactlycontinuous supported function of 2 variables the followingholds true : f = R ∗ H s dds R f (9)Therefore, in an image processing case, the original imagecan be regenerated from the sinogram data R by applying abasic ramp filter (in the s variable) and then back-projectingas discussed. As the filtering step can be performed veryefficiently and effectively (for example using digital signalprocessing techniques and tricks) and the back projectionstep is simply an assimilation of values in the individualpixels of the image, this results in a highly computationallyefficient, and hence widely used, algorithm.Explicitly, the inversion formula obtained by the lattermethod is : f ( x ) = ( π ) − n ( − ) ( n − ) / (cid:90) S n − ∂ n − ∂ s n − R f ( α , α · x ) d α (10)if n is odd, and f ( x ) = ( π ) − n ( − ) n / (cid:90) ∞ − ∞ q (cid:90) S n − ∂ n − ∂ s n − R f ( α , α · x + q ) d α dq (11) if n is even.So the the generic Radon transform-and-reconstructionwhen applied to our road dataset is observed as follows. Asis seen clearly, the road-junctions appear as extensive black regions in the images whereas the other side-paths appearas white . The tomographic reconstructed image (as shown inFig. 4) are then fed to the neural network for classification. D. The MAVNet Model
Originally, the Inception V3 model was tried for indoornavigation via imitation learning by Szegedy et. al. [17].However, the research was mostly focused on simulation ofthe setup rather than an actual implementation in real-time.Based on this approach, we decided to try the same model foroutdoor navigation. Interestingly it was observed, inceptionv3 performs remarkably on the custom dataset.But as it is commonly understood, convolutional neuralnetworks work on a basic assumption that most of the lowlevel features of the image are local in nature, and thatwhatever function is applicable to one region would beapplicable to others as well. So before discussing aboutthe changes we made to the existing architecture of theinception-v3 model, let us discuss about the impact of thefilter size on convolutional nets. Size of the filters playsan important role here. A larger sized filter could probablymiss out on the low-level features in the images and endup skipping some important details, whereas a smaller filtercould prevent that, but results in more confusion due toincreased information. Now the inception model has alreadybeen benchmarked as a computationally fast classification ar-chitecture and has been proven to outperform many standardclassification techniques.In our experiments with filter sizes in the inception-v3 model, we observed that there is a multitude of 1 × × × × ARDWARE I MPLEMENTATION
As a proof of our real-time implementation we performedour real-time experiments on
Parrot Bebop 2 , an off-the-shelf quadrotor commonly available in the market. TheMAVNet prediction model sends only high level velocityand yaw commands to the machine for execution. Since theBebop 2 does not support high-level on-board computing,the MAVNet model runs on a portable system with a i-3Core processor at 2.1 GHz, without any GPU support.IV. E
VALUATION AND R ESULTS
All comparative metrics have been first evaluated on theUdacity Dataset. The images were selected from the custom ig. 5:
Our MAVNet Architecture. The multiple 1 × Fig. 6:
The figure shows the tomographic reconstruction of the images after applyingthe Radon Transform. Interestingly, the top-left image represents a road-junction. As itis clearly seen, road-junctions could be classified based on the amount of road presentin the frame. A widespread distribution of road pixels (minimal edge energies) wouldusually indicate the presence of a junction. The rest of the images indicate roads thatthe drone learns to follow. These images support our claim of using X-Ray vision fornavigation.
Fig. 7:
Sample images from the testing database of our custom dataset. This is one ofthe untrained roads where we tested our algorithm. The Bebop tries to isolate the roadsfrom the environments and follow them assisted by junction disambiguation propertyof MAVNet. The GPS plot for the same stretch in depicted in Fig. 9 dataset with T-junctions and multi-junctions to evaluate theaccuracy of the junction prediction task. The evaluationmetrics as suggested by Loquercio et. al. [12] and Ross et.al. [11] have been used in this paper for comparison withthe state-of-the art techniques. We use Explained Variance(EVA), a metric that is helpful in quantifying the quality ofa regressor. It is defined as :EVA = Var [ Y true − Y pred ] Var [ Y true ] (12)Another metric used in this paper is the F-Measure, assuggested by Fritsch et. al. [13]. It is a measure of the qualityof the classifier used in the system. We use a standard valueof 0.9 for β . F-Measure = ( + β ) PR β P + R (13)Moreover, to evaluate the performance of MAVNet on un-trained roads of the custom dataset, we propose to use samplePearson correlation coefficient, which, for two datasets, isdefined as : r = ∑ ni = ( x i − ¯ x )( y i − ¯ y ) (cid:112) ∑ ni = ( x i − ¯ x ) (cid:112) ∑ ni = ( y i − ¯ y ) (14) Architectures Evaluation Metrics and PerformancesF-Measure Accuracy RMSE EVA FPS Achieved Layers ParametersRandom BaseLine 0.33 ± ± ± × ResNet-50 [14] 0.925 97.15% 0.091 0.766 9 50 2.6 × VGG-16 [15] 0.852 93.14% 0.111 0.722 12 16 7.5 × AlexNet [16] 0.845 85.37% 0.353 0.778 14 8 6.0 × DroNet [12] 0.922 96.73% 0.098 0.721 20 8 3.2 × MAVNet × TABLE I:
The table compares the performances of various state-of-the-art methods for autonomus navigation. The metrics of
EVA and
RMSE have been evaluated on theprediction of the movement commands (pitch and yaw). However, the
F-1 and
Accuracy have been evaluated on the junction prediction task. Hoever, Loquercio et. al. [12]generated a probabalistic map of collision detection. In our case, junction disambiguation has been tackled as a binary classification problem, so the F-1 measures have beencalculated based on the number of correct hits, correct missed and incorrect hits. The junction disambiguation problem corresponds to the Cityscpaes dataset. As is it could beseen that the MAVNet model performs equally well, if not better. Furthermore, the results mentioned in this table are the outcomes of tests on untrained roads. The computationtime for each image was observed to be 0.03225 seconds, which turns out to be almost 30 FPS, if real-time evaluation is considered. he metrics for the above evaluations are provided in theimage captions of Fig. (6), (7) and (8). The various out-puts are individually correlated with the respective trainingdatasets. Moreover, the IMU data plots from expert flightsand model flights have been recorded to confirm the accuracyof the MAVNet model. The model takes around 0.03031seconds, which allows the algorithm to run at a maximumrate of 33 FPS (on custom dataset) as compared to thestandard camera input rate of 30 Hz, allowing the drone toachieve a maximum forward velocity of 6 m/sec on straightroads. The straight line policy adopted by Loquercio et. al. [12] provided an insight to the maximum distance drivenby DroNet, without crashing or going off-track. MAVNetwas able to perform a continuous autonomous stretch,identifying junctions and taking decisions on-the-go. TheGPS plot of the same is shown in Fig. 11.
Fig. 8:
Figure indicates the position of the drone along the X-Axis in the body-frame. The position of the drone is recorded by the user on an untrained road (whichconstitutes the ground truth). MAVNet performs its predictions on the same road andcorrelation coefficient is calculated to be 0.9977, using Eq. (14).
Fig. 9:
Figure indicates the Normalized Orientation of the drone, along the Z-Axisin the body-frame. The degree of Yaw to be executed to make sure that the droneremains in the middle of the road, is recorded by the user on an untrained road (whichconstitutes the ground truth). MAVNet performs its predictions on the same road andcorrelation coefficient is calculated to be 0.8534, using Eq. (14).
V. S
UPPLEMENTARY M ATERIAL
The trained model and the final codes for data collec-tion from every simulation environment have been madeavailable at the following GitHub repository : htt ps : // github . com / sudakshin / imitation l earning VI. D
ISCUSSIONS AND C ONCLUSION
Although we have not integrated GPS positioning withour current architecture GPS fusion could be done to tagthe road-junctions. In case they are missed by MAVNet,the program could read the tag information and execute theturning. However, the fundamental problem associated with
Fig. 10:
Figure indicates the velocity of the drone, along the X-Axis in the body-frame. The maximum speed upto which the drone can move without overshooting theroad, is recorded by the user on an untrained road (which constitutes the ground truth).MAVNet performs its predictions on the same road.
Fig. 11:
Figure indicates the GPS Plots of the paths taken by the UAV. Blue representsthe path when the UAV was flown by the expert (which constitutes the ground truth),whereas the red path indicates the path predicted by MAVNet. An interesting thing tonote here is that the path shown in the figure does not belong to the training datasetand the total stretch of this patch is approximately is 357 m.