[PDF] Aerial Images Processing for Car Detection using Convolutional Neural Networks: Comparison between Faster R-CNN and YoloV3

Abstract

In this paper, we address the problem of car detection from aerial images using Convolutional Neural Networks (CNN). This problem presents additional challenges as compared to car (or any object) detection from ground images because features of vehicles from aerial images are more difficult to discern. To investigate this issue, we assess the performance of two state-of-the-art CNN algorithms, namely Faster R-CNN, which is the most popular region-based algorithm, and YOLOv3, which is known to be the fastest detection algorithm. We analyze two datasets with different characteristics to check the impact of various factors, such as UAV's altitude, camera resolution, and object size. A total of 39 training experiments were conducted to account for the effect of different hyperparameter values. The objective of this work is to conduct the most robust and exhaustive comparison between these two cutting-edge algorithms on the specific domain of aerial images. By using a variety of metrics, we show that YOLOv3 yields better performance in most configurations, except that it exhibits a lower recall and less confident detections when object sizes and scales in the testing dataset differ largely from those in the training dataset.

Full PDF

11 Aerial Images Processing for Car Detectionusing Convolutional Neural Networks:Comparison between Faster R-CNN and YoloV3

Adel Ammar , Anis Koubaa , , , Mohanned Ahmed , Abdulrahman Saad Abstract

In this paper, we address the problem of car detection from aerial images using Convolutional NeuralNetworks (CNN). This problem presents additional challenges as compared to car (or any object) detectionfrom ground images because features of vehicles from aerial images are more difﬁcult to discern. Toinvestigate this issue, we assess the performance of two state-of-the-art CNN algorithms, namely FasterR-CNN, which is the most popular region-based algorithm, and YOLOv3, which is known to be the fastestdetection algorithm. We analyze two datasets with different characteristics to check the impact of variousfactors, such as UAV’s altitude, camera resolution, and object size. A total of 39 training experimentswere conducted to account for the effect of different hyperparameter values. The objective of this workis to conduct the most robust and exhaustive comparison between these two cutting-edge algorithms onthe speciﬁc domain of aerial images. By using a variety of metrics, we show that YOLOv3 yields betterperformance in most conﬁgurations, except that it exhibits a lower recall and less conﬁdent detectionswhen object sizes and scales in the testing dataset differ largely from those in the training dataset.

I. INTRODUCTIONUnmanned aerial vehicles (UAVs) are nowadays a key enabling technology for a large number ofapplications such as surveillance [1], tracking [2], disaster management [3], smart parking [4], IntelligentTransport Systems, to name a few. Thanks to their versatility, UAVs offer unique capabilities to collectvisual data using high-resolution cameras from different locations, angles, and altitudes. These capabilitiesallow providing rich datasets of images that can be analyzed to extract useful information that serves thepurpose of the underlying applications. UAVs present several advantages in the context of aerial imagerycollection, including a large ﬁeld of view, high spatial resolution, ﬂexibility, and high mobility. Althoughsatellite imagery also provides a bird-eye view of the earth, UAV-based aerial imagery presents severaladvantages as compared to satellite imagery. In fact, UAV imagery has a much lower cost and providesmore updated views (many satellite maps are several months old and do not present recent changes).Besides, it can be used for real-time image/video stream analysis in a much more affordable means. Aerialimages have different resolutions as compared to satellite images. For example, in our experiments, wereached a resolution of 2 cm/pixel (and can have even lower) of aerial images using typical DJI drones,whereas satellite images have resolutions of about 15 cm/pixel as for the dataset described in [5], and canbe even larger.With the current hype of artiﬁcial intelligence and deep learning, there has been an increasing trendsince 2012 (the birth of AlexNet) to use Convolutional Neural Networks (CNNs) to extract informationfrom images and video streams. While CNNs have been proven to be the best approach for classiﬁcation, Prince Sultan University, Saudi Arabia. Gaitech Robotics, China. CISTER, INESC-TEC, ISEP, Polytechnic Institute of Porto,Portugal.978-1-7281-2747-7/20/$31 c (cid:13) a r X i v : . [ c s . C V ] A p r detection and semantic segmentation of images, the processing, and analysis of aerial images is generallymore challenging than the classical types of images (ground-level images). In fact, given that UAVs canﬂy at high altitudes, the feature extraction from images and detection becomes more difﬁcult to discern.This fact is due to the small size of features and also the angle of view.Recently, there have been several research works that addressed the problem of car detection fromaerial images. In our previous work [1], we also compared between YOLOv3 and Faster R-CNN indetecting cars from aerial images. However, we only used one small dataset from low altitude UAVimages collected at the premises of Prince Sultan University. However, the altitude at which the imageis taken plays an essential role in the accuracy of the detection. Besides, we did not profoundly analyzeadvanced and essential performance metrics such as Intersection over Union (IoU) and the Mean AveragePrecision (mAP). In this paper, we address the gap, and we consider multiple datasets with differentconﬁgurations. Our objective is to present a more comprehensive analysis of the comparison betweenthese two state-of-the-art approaches.In [6], the authors mentioned the challenges faced with aerial images for car detection, namely the prob-lem of having small objects and complex backgrounds. They addressed the problem with the proposed ofMulti-task Cost-sensitive-Convolutional Neural Network based on Faster R-CNN. Some other researchersaddressed the problem applying deep learning techniques on aerial images, in different contexts such asobject detection and classiﬁcation [7,8], semantic segmentation [9]–[11], generative adversarial networks(GANs) [12].In this paper, we propose a comprehensive comparative study between two state-of-the-art deep learningalgorithms, namely Faster R-CNN [13] and YoloV3 [14] for car detection from aerial images. Thecontributions of this paper are manyfold. First, we consider two different datasets of aerial images forthe car detection problem with different characteristics to investigate the impact of datasets propertieson the performance of the algorithms. In addition, we provide a thorough comparison between the twomost sophisticated categories of CNN approaches for object detection, Faster RCCN, which is a region-based approach proposed in 2017, and YOLOv3, which is the latest version of the You-Look-Only-Onceapproach proposed by Joseph Redmon in 2018.The remaining of the paper is organized as follows. Section 2 discusses the related works that dealt withcar detection and aerial image analysis using CNN, and some comparative studies applied to other objectdetections. Section 3 sets forth the theoretical background of the two algorithms. Section 4 describes thedatasets and the obtained results. Finally, section V draws the main conclusions of this study.II. RELATED WORKSVarious techniques have been proposed in the literature to solve the problem of car detection in aerialimages and similar related issues. The main challenge being the small size and the large number of objectsto detect in aerial views, which may lead to information loss when performing convolution operations,as well as the difﬁculty to discern features because of the angle of view. In this scope, Chen et al. [15]applied a technique based on a hybrid deep convolutional neural network (HDNN) and a sliding windowsearch to solve the vehicle detection problem from Google Earth images. The maps of particular layersof the CNN are split into blocks of variable ﬁeld sizes, to be able to extract features of various scales.They obtained an improved detection rate compared to the traditional deep architectures at that time, butwith the expense of high execution time (7s per image, using a GPU).Following a different approach, Ammour et al. [16] used a pre-trained CNN coupled with a linearsupport vector machine (SVM) classiﬁer to detect and count cars in high-resolution UAV images of urbanareas. First, the input image is segmented into candidate regions using the mean-shift algorithm. Then,the VGG16 [17] CNN model is applied to windows that are extracted around each candidate region to generate descriptive features. Finally, these features are classiﬁed using a linear SVM binary model. Thistechnique achieved state-of-the-art performance on a reduced testing dataset (5 images containing 127 carinstances), but it still falls short of real-time processing, mainly due to the high computational cost of themean-shift segmentation stage.Xi et al. [4] also addressed the problem of vehicle detection in aerial images. They proposed a multi-task approach based on the Faster R-CNN algorithm to which they added a cost-sensitive loss. The mainidea is to subdivide the object detection task into simpler subtasks with enlarged objects, thus improvingthe detection of small objects which are frequent in aerial views. Besides, the cost-sensitive loss givesmore importance to the objects that are difﬁcult to detect or occluded because of complex backgroundand aims at improving the overall performance. Their method outperformed state-of-the-art techniques ontheir own speciﬁc private dataset that was collected from surveillance cameras placed on top of buildingssurrounding a parking lot. However, their approach has not been tested on other datasets, nor on UAVimages.In a similar application, Kim et al. [18] compared various implementations of YOLO, SSD, R-CNN,R-FCN and SqueezeDetPerson on the problem of person detection, trained and tested on their own in-house dataset composed of images that were captured by surveillance cameras in retail stores. They foundthat YOLOv3 (with 416 input size) and SSD (VGG-500) [19] provide the best tradeoff between accuracyand response latency.Some recent works have addressed the problem of domain adaptation for CNNs on aerial images. Infact, one of the problems of CNN is that it is very prone to the domain changes. This means that ifa network is trained on objects from a certain domain, its accuracy is likely to decrease a lot if thesame objects provided as input come from a different domain. To address this issues, in [12], the authorshave proposed a technique for domain adaptation based on generative adversarial networks (GANs) toimprove the semantic segmentation accuracy of urban environment in aerial images. The authors achievedan increase of accuracy up to 52% of the semantic segmentation when performing domain adaptationfrom Potsdam to Vaihingen domains.In [20], the authors proposed another technique for domain adaptation based on Active Learning appliedto animal detection in wildlife using aerial images. The core idea consists in using Transfer Sampling(TS) criterion to localize animals effectively, which uses Optimal Transport to determine the regions ofinterest between the source and target domains.In [21], Hardjono et al. investigated the problem of automatic vehicle counting in CCTV images col-lected from four datasets with various resolutions. They tested both classical image processing techniques(Back Subtraction, Viola Jones Algorithm, and Gaussian Filters) and deep learning neural networks,namely YOLOv2 [22] and FCRN (fully convolutional regression network) [23]. Their results show thatdeep learning techniques yield markedly better detection results (in terms of F1 score) when applied onhigher resolution datasets.Also for the aim of car counting, Mundhenk et al. [5] built their own Cars Overhead with Context(COWC) dataset containing 32,716 unique cars and 58,247 negative targets, standardized to a resolutionof 15 cm per pixel, and annotated using single pixel points. The authors used a convolutional neuralnetwork that they called ResCeption, based on Inception synthesized with Residual Learning. The modelwas able to count the number of cars in test patches with a root mean square error of 0.66 at 1.3 FPS.Other works [24,25] focused on ﬁne-grained car detection, where the objective is to classify carsaccording to vehicle models and other visual differences, with a classiﬁcation accuracy that attains 87%.This classiﬁcation can be used for census estimation and sociological analysis of cities and countries. Liuand Mattyus [26] also classiﬁed car orientations and types in aerial images of the city of Munich, withan accuracy of 98%, but limited the classiﬁcation to two classes (car or truck). T A B LE I C O M P A R I S ONO F OU R P A P E R W I T H T H E R EL A TE D W O R K S . R e f . D a t a s e t u s e d A l go r i t h m s M a i n re s u l t s [ ] M undh e nk e t a l ., C a r s O v e r h ea d w it h C on t e x t ( C O W C ) : , i qu eca r s . , e g a ti v e t a r g e t s . , t r a i n i ngp a t c h e s a nd79 , t e s ti ngp a t c h e s . A nno t a t e du s i ng s i ng l e p i x e l po i n t s . R e s o l u ti on : S t a nd a r d i ze d t o15 c m / p i x e l . R e s C e p ti on (I n ce p ti on w it h R e s i du a l L ea r n i ng ) U p t o99 . % c o rr ec tl y c l a ss i ﬁ e dp a t c h e s ( c on t a i n i ng ca r s o r no t ) . F s c o r e o f . % f o r d e t ec ti on . C a r c oun ti ng : R M S E o f . . [ ] X i e t a l ., P a r k i ng l o t d a t a s e t fr o m ae r i a l v i e w . T r a i n i ng : i m a g e s . T e s ti ng : i m a g e s . N u m b e r o f i n s t a n ce s : NA . R e s o l u ti on : . M u lti - T a s k C o s t - s e n s iti v e C onvo l u ti on a l N e u r a l N e t w o r k ( M T C S - C NN ) . m A P o f . % f o r ca r d e t ec ti on . [ ] C h e n e t a l ., s a t e llit e i m a g e s c o ll ec t e d fr o m G oog l e E a r t h . T r a i n i ng : i m a g e s ( e h i c l e s ) . T e s ti ng : i m a g e s ( e h i c l e s ) . R e s o l u ti on : . H yb r i d D ee p C onvo l u ti on a l N e u r a l N e t w o r k ( HDNN ) . P r ec i s i onup t o98 % a t a r eca ll r a t e o f % . [ ] A mm o u re t a l ., i m a g e s ac qu i r e dby UAV . T r a i n i ng : i m a g e s ( s iti v e i n s t a n ce s , a nd1864n e g a ti v e i n s t a n ce s ) . T e s ti ng : i m a g e s ( s iti v e i n s t a n ce s ) . R e s o l u ti on : V a r i a b l e fr o m t o3456x5184 . S p a ti a l r e s o l u ti ono f c m . P r e - t r a i n e d C NN c oup l e d w it h a li n ea r s uppo r t v ec t o r m ac h i n e ( S V M ) . P r ec i s i on fr o m % up t o100 % , a nd r eca ll fr o m % up t o84 % , on t h e ﬁ v e t e s ti ng i m a g e s . I n f e r e n ce ti m e : b e t w ee n11 a nd30 m i n /i m a g e . [ ] H a r d j o n o e t a l ., CC T V d a t a s e t s : - D a t a s e t : s ec ondv i d e o s a t FPS . R e s o l u ti on : - D a t a s e t : m i n : s ec v i d e o a t FPS . R e s o l u ti on : - D a t a s e t : m i n : s ec v i d e o a t FPS . R e s o l u ti on : - D a t a s e t : s ec v i d e o a t FPS . R e s o l u ti on : T r a i n i ng : s iti v e i n s t a n ce s a nd10 , e g a ti v e i n s t a n ce s . - B ac kg r ound S ub t r ac ti on ( B S )- V i o l a J on e s ( V J )- YO L O v2 - B S : F s c o r e fr o m % t o55 % . I n f e r e n ce ti m e fr o m t o40 m s . - V J : F s c o r e fr o m % t o75 % . I n f e r e n ce ti m e fr o m t o640 m s . - YO L O v2 : F s c o r e fr o m % t o100 % ond a t a s e t s t o4 . I n f e r e n ce ti m e no t r e po r t e d . [ ] B e n j d i r a e t a l ., PS U + [ ] UAV d a t a s e t: T r a i n i ng : i m a g e s ( , ca r i n s t a n ce s ) . T e s ti ng : i m a g e s ( ca r i n s t a n ce s ) . R e s o l u ti on : V a r i a b l e fr o m t o4000x2250 . - YO L O v3 ( i npu t s i ze : ) . - F a s t e r R - C NN ( F ea t u r ee x t r ac t o r : I n ce p ti on R e s N e t v2 ) . - YO L O v3 : F s c o r e o f . % . I n f e r e n ce ti m e : m s . - F a s t e r R - C NN : F s c o r e o f % . I n f e r e n ce ti m e : . s . ( U s i ng a n N v i d i a G T X G P U ) . O u r p a p er - S t a n f o r d UAV d a t a s e t: T r a i n i ng : i m a g e s ( , ca r i n s t a n ce s ) . T e s ti ng : i m a g e s ( , ca r i n s t a n ce s ) . R e s o l u ti on : V a r i a b l e fr o m t o1434x1982 . PS U + [ ] UAV d a t a s e t: T r a i n i ng : i m a g e s ( , ca r i n s t a n ce s ) . T e s ti ng : i m a g e s ( ca r i n s t a n ce s ) . R e s o l u ti on : V a r i a b l e fr o m t o4000x2250 . - YO L O v3 ( i npu t s i ze s : , , a nd608x608 ) . - F a s t e r R - C NN ( F ea t u r ee x t r ac t o r s : I n ce p ti onv2 , a nd R e s n e t ) . - YO L O v3 : F s c o r e : up t o32 . % on S t a n f o r dd a t a s e t up t o96 . % on PS U d a t a s e t . I n f e r e n ce ti m e : fr o m t o85 m s . - F a s t e r R - C NN : F s c o r e : up t o31 . % on S t a n f o r dd a t a s e t up t o84 . % on PS U d a t a s e t . I n f e r e n ce ti m e : fr o m t o160 m s . ( U s i ng a n N v i d i a G T X G P U ) . Table I summarizes the datasets, algorithms, and results of the most similar related works on cardetection, compared to the present paper. The closest work to the present study is that of Benjedira etal. [1] who presented a performance evaluation of Faster R-CNN and YOLOv3 algorithms, on a reducedUAV imagery dataset of cars. The present paper is an improvement over this work from several aspects:1) We use two datasets with different characteristics for training and testing, whereas most previousworks described above tested their technique on a single proprietary dataset. We show that annotationerrors in the dataset have an important effect on the detection performance.2) We tested various hyperparameter values (three different input sizes for YOLOv3, two differentfeature extractors for Faster R-CNN, various values of score threshold).3) We conducted a more detailed comparison of the results, by showing the AP at different valuesof IoU thresholds, comparing the tradeoff between AP and inference speed, and calculating several newmetrics that have been suggested for the COCO dataset [27].III. T

HEORETICAL O VERVIEW F ASTER

R-CNN

AND

YOLO V A. Faster R-CNN

R-CNN, as coined by [28] is a convolutional neural network (CNN) combined with a region-proposalalgorithm that hypothesizes object locations. It initially extracts a ﬁxed number of regions (2000), bymeans of a selective search. Then it merges similar regions together, using a greedy algorithm, to obtainthe candidate regions on which the object detection will be applied. Then, the same authors proposed anenhanced algorithm called Fast R-CNN [29] by using a shared convolutional feature map that the CNNgenerates directly from the input image, and from which the regions of interest (RoI) are extacted. Finally,Ren et al. [13] proposed Faster R-CNN algorithm (Figure 1) that introduced a Region Proposal Network(RPN), which is a dedicated fully convolutional neural network that is trained end-to-end (Figure 2) topredict both object bounding boxes and objectness scores in an almost computationally cost-free manner(around 10ms per image). This important algorithmic change thus replaced the selective search algorithmwhich was very computationally expensive and represented a bottleneck for previous object detection deeplearning systems. As a further optimization, the RPN ultimately shares the convolutional features with theFast R-CNN detector, after being ﬁrst independently trained. For training the RPN, Faster R-CNN keptthe multi-task loss function already used in Fast R-CNN, which is deﬁned as follows: L ( p i , t i ) = 1 N cls (cid:88) i L cls ( p i , p ∗ i ) + λ N reg (cid:88) i p ∗ i L reg ( t i , t ∗ i ) Where: • p i is the predicted probability that an anchor i in a mini-batch is an object. • p ∗ i equals 1 if the anchor is positive (having either the highest IoU overlap with a ground-truth box,or an overlap higher than 0.7), and 0 if it is negative (IoU overlap lower than 0.3 with all ground-truthboxes). • t i is the vector of coordinates of the predicted bounding box. • t ∗ i is the vector of coordinates of the ground truth box corresponding to a positive anchor. Fig. 1.

Faster R-CNN basic architecture (from [1]). • L cls is the classiﬁcation log loss. • L reg is the regression loss calculated using the robust loss function, already used for Fast R-CNN[29]. • N cls and N reg are normalization factors. • λ is a balancing weight.Faster R-CNN uses three scales and three aspect ratios for every sliding position, and is translationinvariant. Besides, it conserves the aspect ratio of the original image while resizing it, so that one of itsdimension should be 1024 or 600. B. YOLOv3

Contrary to R-CNN variants, YOLO [30], which is an acronym for You Only Look Once, does notextract region proposals, but processes the complete input image only once using a fully convolutionalneural network that predicts the bounding boxes and their corresponding class probabilities, based on theglobal context of the image. The ﬁrst version was published in 2016 (Figure 3). Later on in 2017, asecond version YOLOv2 [22] was proposed, which introduced batch normalization, a retuning phase forthe classiﬁer network, and dimension clusters as anchor boxes for predicting bounding boxes. Finally, in2018, YOLOv3 [14] improved the detection further by adopting several new features:

Fig. 2.

Region Proposal Network (RPN) architecture. • Replacing mean squared error by cross-entropy for the loss function. The cross-entropy loss functionis calculated as follows: − M (cid:88) c =1 δ x ∈ c log ( p ( x ∈ c )) Where M is the number of classes, c is the class index, x is an observation, δ x ∈ c is an indicatorfunction that equals 1 when c is the correct class for the observation x, and log ( p ( x ∈ c )) is thenatural logarithm of the predicted probability that observation x belongs to class c. • Using logistic regression (instead of the softmax function) for predicting an objectness score forevery bounding box. • Using a signiﬁcantly larger feature extractor network with 53 convolutional layers (Darknet-53replacing Darknet-19). It consists mainly of 3*3 and 1*1 ﬁlters, with some skip connections inspiredfrom ResNet [31], as illustrated in Figure 4.Contrary to Faster R-CNN’s approach, each ground-truth object in YOLOv3 is assigned only one boundingbox prior. These successive variants of YOLO were developed with the objective of obtaining a maximummAP while keeping the fastest execution which makes it suitable for real-time applications. Specialemphasis has been put on execution time so that YOLOv3 is equivalent to state-of-the-art detectionalgorithms like SSD [19] in terms of accuracy but with the advantage of being three times faster [14].Figure 5 depicts the main stages of YOLOv3 algorithm when applied to the car detection problem. Variable

Fig. 3.

YOLO (version 1) network architecture.

Fig. 4.

Darknet-53 architecture adopted by YOLOv3 (from [14]). input sizes are allowed in YOLO. We have tested the three input sizes that are usually used (as in theoriginal YOLOv3 paper [14]): 320x320, 416x416, and 608x608. Figure 6 shows an example of the outputof YOLOv3 on a sample image of our PSU dataset, that will be described in section IV-A.To summarize, Table II compares the features and parameters of Faster R-CNN and YOLOv3. Whilesuccessive optimizations and mutual inspirations made the methodology of the two algorithms relativelyclose, the main difference remains that Faster R-CNN has two separate phases of region proposals andclassiﬁcation (although now with shared features), whereas YOLO has always combined the classiﬁcationand bounding-box regression processes.

Fig. 5.

Successive stages of the YOLOv3 model applied on car detection.

Fig. 6.

Example of the output of YOLOv3 algorithm, on an image of the PSU dataset.

IV. E

XPERIMENTAL COMPARISON BETWEEN F ASTER

R-CNN

AND

YOLO V A. Datasets

In order to obtain a robust comparison, we tested Faster R-CNN and YOLOv3 algorithms on twodatasets of aerial images showing completely different characteristics. • The Stanford dataset [32] consists of a large-scale collection of aerial images and videos of auniversity campus containing various agents (cars, buses, bicycles, golf carts, skateboarders andpedestrians). It was obtained using a 3DR solo quadcopter (equipped with a 4k camera) that ﬂewover various crowded campus scenes, at an altitude of around 80 meters. It is originally composedof eight scenes, but since we are exclusively interested in car detection, we chose only three scenesthat contains the largest percentage of cars: Nexus (in which 29.51% of objects are cars), Gates(1.08%), and DeathCircle (4.71%). All other scenes contain less than 1% of cars. We used the twoﬁrst scences for training, and the third one for testing. Besides, we have removed images that contain TABLE IIT

HEORETICAL COMPARISON OF F ASTER

R-CNN

AND

YOLO V YOLOv3 Faster R-CNNPhases

Concurrentbounding-box regression,and classiﬁcation. RPN +Fast R-CNNobject detector.

Neuralnetworktype

Fully convolutional. Fully convolutional(RPN and 4detection network).

Backbonefeatureextractor

Darknet-53(53 convolutional layers). VGG-16 orZeiler & Fergus (ZF).Other feature extractorscan also be incorporated.

Locationdetection

Anchor-based(dimension clusters). Anchor-based.

Number ofanchorsboxes

Only one bounding-boxprior for eachground-truth object. 3 scales and 3 aspect ratios,yielding k = 9 anchorsat each sliding position.

DefaultAnchor sizes (10,13),(16,30),(33,23),(30,61),(62,45),(59,119),(116,90),(156,198), (373,326). Scales: (128,128),(256,256),(512,512).Aspect ratios: 1:1, 1:2, 2:1.

IoUthresholds

One (at 0.5). Two (at 0.3 and 0.7).

Lossfunction

Binary cross-entropy loss. Multi-task loss:- Log loss for classiﬁcation.- Smooth L1 for regression.

Input size

Three possible input sizes(320x320, 416x416,and 608x608). - Conserves the aspectratio of the original image.- Either the smallestdimension is 600,or the largestdimension is 1024.

Momentum

Default value: 0.9. Default value: 0.9.

Weightdecay

Default value: 0.0005. Default value: 0.0005.

Batchsize

Default value: 64. Default value: 1. no cars. We noticed that the ground-truth bounding boxes in some images contain some mistakes(bounding boxes containing no objects) and imprecisions (many bounding boxes are much largerthan the objects inside them), as can be seen in Figure 8, but we used them as they are in order toassess the impact of annotation errors on detection performance. In fact, the Stanford Drone Datasetwas not primarily designed for object detection, but for trajectory forecasting and tracking. TableIII shows the number of images and instances in the training and testing datasets. The images inthe selected scene have variable sizes, as shown in Table IV, and contain cars of various sizes, asdepicted in Figure 7. The average car size (calculated based on the ground-truth bounding boxes) isshown in Table VIII. The discrepancy observed between the training and testing datasets in terms ofcar sizes is explained by the fact that we used different scenes for the training and testing datasets,as explained above. This discrepancy will constitute an additional challenge for the considered objectdetection algorithms. • The PSU dataset was collected from two sources: an open dataset of aerial images available onGithub [33]; and our own images acquired after ﬂying a 3DR SOLO drone equipped with a GoPro TABLE IIIS

TANFORD D ATASET

Training set Testing set TotalNumber of images 6872 1634 8506Percentage 80.8% 19.2% 100%Number of car instances 74,826 8,131 82,957TABLE IVI

MAGE SIZE IN THE S TANFORD DATASET

Size Number of images1409x1916 16341331x1962 15581330x1947 15571411x1980 14941311x1980 14901334x1982 2951434x1982 1421284x1759 1381425x1973 1281184x1759 70

Hero 4 camera, in an outdoor environment at PSU parking lot. The drone recorded videos from whichframes were extracted and manually labeled. Since we are only interested in a single class, imageswith no cars have been removed from the dataset. The training/testing split was made randomly.Figure 9 shows a sample image of the PSU dataset, and Table V shows the number of images andinstances in the training and testing datasets. The dataset thus obtained contains images of differentsizes, as shown in Table VI, and contain cars of various sizes, as depicted in Figure 7. The averagecar size (calculated based on the ground-truth bounding boxes) in the training and testing datasets isshown in Table VIII.

TABLE VPSU D

ATASET

Training set Testing set TotalNumber of images 218 52 270Percentage 80.7% 19.3% 100%Number of car instances 3,364 738 4,102TABLE VII

MAGE SIZE IN THE

PSU

DATASET

Size Number of images1920x1080 1721764x430 26684x547 211284x377 201280x720 194000x2250 12 TABLE VIID

ETAILS OF THE MAIN EXPERIMENTS . T

HE DEFAULT CONFIGURATION OF F ASTER

R-CNN

ALLOWS A VARIABLE INPUT SIZETHAT CONSERVES THE ASPECT RATION OF THE IMAGE . I

N THIS CASE , THE INPUT SIZE SHOWN IS AN AVERAGE .

VERAGE CAR WIDTH AND LENGTH ( IN PIXELS ) IN THE

PSU

AND S TANFORD DATASETS , CALCULATED BASED ON THEGROUND - TRUTH BOUNDING BOXES .Dataset Averagecar width Averagecar lengthPSU training 48 36PSU testing 55 46Stanford training 72 152Stanford testing 60 90

B. Hyperparameters

The main hyperparameter for the YOLO network is the input size, for which we tested three values(320x320, 416x416, and 608x608), as explained in section III-B On the other hand, the main hyperpa-rameter for Faster R-CNN is the feature extractor. We tested two different feature extractors: Inception-v2[34] (also called BN-inception in the literature [35]) and Resnet50 [36]. As explained in section III-A, thedefault setting of Faster R-CNN conserves the aspect ratio of the original image while resizing it, so thatone of its dimensions should be 1024 or 600. But to be able to fairly compare its precision and speed withYOLOv3 which uses a ﬁxed input size, we also tested Faster R-CNN with a ﬁxed input size of 608x608,for each of the two feature extractors. These settings make a total of 7 classiﬁers that we trained andtested on the two datasets described above, which amounts to 14 experiments that we summarize in TableVII. In these experiments, we kept the default values for the momentum (0.9), weight decay (0.0005),learning rate (initial rate of 10 -3 for YOLOv3, 2*10 -4 for Faster R-CNN with Inception-v2, and 3*10 -4 with Resnet50), batch size (64 for YOLOv3 and 1 for Faster R-CNN), and anchor sizes (see Table II).Furthermore, we made additional experiments with different values of learning rates (10 -5 , 10 -4 , 10 -3 , and10 -2 ) for each of the main algorithms (Faster R-CNN with Inception-v2, Faster R-CNN with Resnet 50,and YOLOv3 with input size 416x416), on each of the two datasets. We trained each network for thenumber of iterations necessary to its convergence. We notice, for example, in Table VII that YOLOv3necessitated a higher number of iterations when using the largest input size (608x608) on Stanford dataset,while it reached convergence after much less iterations when using the medium input size (416x416) on Fig. 7.

Histogram of car sizes in PSU (left) and Stanford (right) training (up) and testing (down) datasets, expressed asthe number pixels inside the ground truth bounding boxes (width*height). the same dataset. Nevertheless, the number of steps needed to reach convergence is non-deterministic,and depends on the initialization of the weights.

C. Results and Discussion

For the experimental setup, we used a workstation powered by an Intel core i7-8700K (3.7 GHz)processor, with 32GB RAM, and an NVIDIA GeForce 1080 (8 GB) GPU, running on Linux.The following metrics have been used to assess the results: TABLE IXA

VERAGE RECALL FOR A GIVEN MAXIMUM NUMBER OF DETECTIONS , AVERAGED OVER ALL VALUES OF I O U (05, 0.65, 0.8,

AND

ON THE S TANFORD D ATASET

Network AR max=1 AR max=10 AR max=100 Faster R-CNN (Inception-v2) 15.1% 17.1% 17.1%Faster R-CNN (Resnet50) 16.4%

YOLOv3 (320x320) 9.04% 9.06% 9.06%YOLOv3 (416x416) 17.13% 17.32% 17.32%YOLOv3 (608x608) % 17.30% 17.30%TABLE XA

VERAGE RECALL FOR A GIVEN MAXIMUM NUMBER OF DETECTIONS , AVERAGED OVER ALL VALUES OF I O U (05, 0.65, 0.8,

AND

ON THE

PSU D

ATASET

Network AR max=1 AR max=10 AR max=100 Faster R-CNN (Inception-v2) 6.2% 41.5% 70.8%Faster R-CNN (Resnet50) • IoU: Intersection over Union measuring the overlap between the predicted and the ground-truthbounding boxes. • mAP: mean average precision, or simply AP, since we are dealing with only one class. It correspondsto the area under the precision vs. recall curve. AP was measured for different values of IoU (0.5,0.6, 0.7, 0.8 and 0.9). • FPS: Number of frames per second, measuring the inference processing speed. • Inference time (in millisecond per image): also measuring the processing speed.

Inf erence time ( ms ) = 1000 F P S • AR max=1 , AR max=10 , AR max=100 : average recall, when considering a maximum number of detectionsper image, averaged over all values of IoU speciﬁed above. We allow only the 1, 10 or 100 top-scoring detections for each image. This metric penalizes missing detections (false negatives) andduplicates (several bounding boxes for a single object). Average Precision : When analyzing the results, it appears that both YOLOv3 and Faster R-CNNgave a much better AP on PSU dataset than on Stanford dataset (Figure 10). This is mainly due to thefact that, contrary to the PSU dataset, the characteristics of the Stanford dataset differ largely between thetraining and testing images, as detailed in IV.A. This is the well known problem of domain adaptationin machine learning (see section II). The Stanford dataset contains 20 times more car instances than thePSU dataset (Tables III and V), whereas the performance of Faster R-CNN and YOLOv3 was respectively4 and 7 times better on the PSU dataset, in terms of AP. This highlights the fact that the clarity of thefeatures, the quality of annotation, and the representativity of the learning dataset are more importantthan the actual size of the dataset. Figure 11 conﬁrms this observation and shows that both precision andrecall are signiﬁcantly lower on the Stanford dataset. However, Figure 12 shows that the number of falsenegatives (non-detected cars) is much higher than the number of false positives on the Stanford dataset (3times higher for Faster R-CNN, and 42 times higher for YOLOv3), and also much higher than the numberof true positives, which indicates that most cars go undetected in the Stanford dataset, most probably due Fig. 8.

A sample image of the Stanford dataset, with ground-truth bounding boxes showing some annotation errors andimprecisions. to the different size and aspect ratio of cars in the testing images, compared to the training images. Eventhough both algorithms performed poorly on the Stanford dataset as compared to PSU dataset, with lessthan 20% of AP, there is still a statistically signiﬁcant difference between Faster R-CNN and YOLOv3on this dataset. In fact, a T-test between the two sets of AP values of the two algorithms (for differentIoU and score thresholds) yielded a p-value of 0.0020, which means that the null hypothesis (equality ofthe means of the two sets of AP values) can be rejected with a conﬁdence of 99.8%.Figure 13 shows examples of YOLOv3 and Faster R-CNN misclassiﬁcations on a sample image of theStanford dataset. The false positives shown may be explained by the presence of errors of annotations inthe learning dataset, as mentioned in section IV-A. Figure 14 and Figure 15 show examples of YOLOv3and Faster R-CNN misclassiﬁcations (all of them false negatives) on a sample image of the PSU dataset,respectively.Figures 16 and 17 show the effect of the score threshold on AP. While this effect is very reducedon Faster R-CNN for both datasets, it shows a high dependency for YOLOv3 only on Stanford dataset,decreasing to almost 0 for a score threshold of 0.9. This reveals the fact that YOLOv3 predictions onStanford dataset are much less conﬁdent. For all other ﬁgures shown here, the score threshold has been Fig. 9.

Example of the output of Faster R-CNN algorithm, on an image of the PSU dataset.

Fig. 10.

Comparison of the average AP between YOLOv3 and Faster R-CNN. ﬁxed to 0.5. Average Recall : Table IX shows the average recall for a given maximum number of detections (asdescribed in the introduction of section C), on the Stanford Dataset. Faster R-CNN outperforms YOLOv3in this metric except for AR max=1 , with a slight better performance for Resnet50 feature extractor overInception-v2, and a marked inferior performance for YOLOv3 with an input size of 320x320. Whereasthe input sizes 416x416 and 608x608 give similar performance, which means that YOLOv3’s mediuminput size is sufﬁcient for Stanford dataset. The fact that the columns AR max=10 and AR max=100 in this tableare identical can be explained by the fact that very few images in the Stanford testing dataset containmore than 10 car instances. Nevertheless, we have kept this duplicated column to compare it to Table Xwhich shows the same metrics on PSU dataset. While all tested networks yield a close performance in Fig. 11.

Precision and recall values for YOLOv3 and Faster R-CNN, on the two datasets.

Fig. 12.

Average number of false positives (FP), false negatives (FN), and true positives (TP) for YOLOv3 and FasterR-CNN, on the two datasets. Fig. 13.

Example of (a) YOLOv3 and (b) Faster R-CNN’s output on a sample image of the Stanford dataset.

Fig. 14.

Example of YOLOv3’s output on an image of the PSU dataset, showing a few false negatives (non detectedcars). terms of AR max=1 and AR max=10 , YOLOv3 is signiﬁcantly better in terms of AR max=100 , with an increasingperformance for larger input sizes, which indicates that YOLOv3 is better at detecting a high number ofobjects in a single image. Effect of the dataset characteristics : YOLOv3 shows the largest performance discrepancy betweenthe two datasets. While it has a very high recognition on the PSU dataset (up to 0.96 of AP), its performancemarkedly decreases on the Stanford dataset (Figure 10). This is mainly due to the spatial constraintsimposed by the algorithm. On the other hand, Faster R-CNN was designed to better deal with objects ofvarious scales and aspect ratios [13]. Fig. 15.

Example of Faster R-CNN misclassiﬁcations on an image of the PSU dataset, showing several false negatives(non detected cars).

TABLE XID

ETAILED RESULTS OF DIFFERENT CONFIGURATIONS OF

YOLO V AND F ASTER

R-CNN, ON PSU

DATASET . T

HE DEFAULTCONFIGURATION OF F ASTER

R-CNN

ALLOWS A VARIABLE INPUT SIZE THAT CONSERVES THE ASPECT RATION OF THE IMAGE .I N THIS CASE , THE INPUT SIZE SHOWN IS AN AVERAGE . Algorithm Featureextractor Input size AP TP FN FP Precision Recall F1score FPS

Faster R-CNN Inception v2 992*550 (variable) 0.739 548 190 11 0.980 0.743 0.845 9.5Faster R-CNN Inception v2 608*608 (ﬁxed) 0.731 541 197 14 0.975 0.733 0.837 9.5Faster R-CNN Resnet50 992*550 (variable) 0.708 524 214

YOLOv3 Darknet-53 416*416 (ﬁxed) 0.957 710 28 40 0.947 0.962 0.954 17.5YOLOv3 Darknet-53 608*608 (ﬁxed)

36 0.952

ETAILED RESULTS OF DIFFERENT CONFIGURATIONS OF

YOLO V AND F ASTER

R-CNN, ON S TANFORD DATASET . T

HEDEFAULT CONFIGURATION OF F ASTER

R-CNN

ALLOWS A VARIABLE INPUT SIZE THAT CONSERVES THE ASPECT RATION OFTHE IMAGE . I

N THIS CASE , THE INPUT SIZE SHOWN IS AN AVERAGE . Algorithm Featureextractor Input size AP TP FN FP Precision Recall F1score FPS

Faster R-CNN Inception v2 600*816 (variable) 0.202 1780 6351 1813 0.495 0.219 0.304 19.2Faster R-CNN Inception v2 608*608 (ﬁxed)

YOLOv3 Darknet-53 416*416 (ﬁxed) 0.195 1583 6548 Fig. 16.

AP for different values of score threshold, for the two algorithms on PSU dataset (IoU threshold ﬁxed at 0.6).

Fig. 17.

AP for different values of score threshold, for the two algorithms on Stanford dataset (IoU threshold ﬁxed at0.6). Fig. 18.

Average IoU value for YOLOv3 and Faster R-CNN, on the two datasets.

Nevertheless, the contrary can be observed in terms of IoU (Figure 18). While the average IoU ofFaster R-CNN decreases by half between PSU dataset and Stanford dataset, it decreases only by 4%times for YOLOv3. The imprecision of the ground-truth bounding boxes in the Stanford dataset, and thediscrepancy between training and testing features could explains the difference between the two datasetsin terms of IoU. YOLOv3 however manages to keep relatively precise predicted bounding boxes on bothdatasets.Besides, Faster R-CNN shows a high disparity between the two datasets in terms of processing speed(2.7 times faster on Stanford dataset), mainly due to the difference in image input size. In fact, wecalculated that the average number of pixels in input test images (after resizing) is 544K for PSU dataset,and 265K for Stanford dataset. Effect of the feature extractor : The effect of the feature extractor for Faster R-CNN is very limitedon the AP, except for a high value of IoU threshold (0.9) on the Stanford dataset, as can be seen inFigure 19 and Figure 20. Nevertheless, in terms of inference speed, the Inception-v2 feature extractor issigniﬁcantly faster than Resnet50 (Figures 21 and 22), which is consistent with the ﬁndings of Bianco etal. [35] who also showed that Inception-v2 (aka BN-inception) is less computationally complex. Effect of the input size : Figures 21 and 22 show a signiﬁcant gain in YOLOv3’s AP when movingfrom a 320x320 input size to 416x416, but the performance stagnates when we move further to 608x608,which means that the 416x416 resolution is sufﬁcient to detect the objects of the two datasets. On anotherhand, the same ﬁgures show that the input size has a signiﬁcant impact on the inference time, as expected,since a larger input size generates a greater number of network parameters, and hence a larger number ofoperations. In fact, the inference processing speed of YOLOv3 largely depends on the input size (from 12FPS for 608*608 up to 23 FPS for 320*320), with little variation between the two datasets (Figure 23). Fig. 19.

Average Precision at different IoU threshold values of the tested algorithms on the PSU dataset.

Fig. 20.

Average Precision at different IoU threshold values of the tested algorithms on the Stanford dataset. Fig. 21.

Comparison of the trade-off between AP and inference time for YOLOv3 (with 3 different imput sizes) andFaster R-CNN (with two different feature extractors), on the PSU dataset.

Concerning Faster R-CNN, Tables XI and XII show that the default variable input size which conservesthe aspect ratio of the images provides a better precision and recall than the ﬁxed size conﬁguration, inall cases except with Inception-v2 on Stanford dataset, which results in signiﬁcantly less false negatives(5215 compared to 6351). This is likely due to an exceptional congruence between the ﬁxed input sizeand the anchor scales for Inception-v2 on this particular dataset. This conﬁguration also gives a slightlybetter performance in terms of inference speed (21.1 FPS compared to 19.2 FPS), due to the smalleraverage input size. Effect of the learning rate : In order to measure the sensitivity of each algorithm to the learningrate hyperparameter, we conducted additional experiments with different values of learning rates (10 -5 ,10 -4 , 10 -3 , and 10 -2 ) for each of the main algorithms (Faster R-CNN with Inception-v2, Faster R-CNNwith Resnet 50, and YOLOv3 with input size 416x416), on each of the two datasets. Figure 24 shows ahigh sensitivity of the AP (measured on the validation dataset) to the learning rate value chosen duringtraining. A learning rate of 10 -3 yields the best performance in most cases, except for Inception-v2 onStanford dataset, for which a lower learning rate (10 -4 or even 10 -5 ) shows better results. A learningrate of 10 -2 gives poor results in all cases except for Resnet50 on the PSU dataset. A learning rate of10 -1 have also been tested, but it led to a divergent loss. These results highlight the importance of tryingdifferent values of learning rates when comparing the performance of object detection algorithms. Theresults showin in Figure 24 conﬁrm the better performance of YOLOv3 and Faster R-CNN respectivelyon PSU and Stanford datasets, when the learning rate is well chosen. Fig. 22.

Comparison of the trade-off between AP and inference time for YOLOv3 (with 3 different input sizes) andFaster R-CNN (with two different feature extractors), on the Stanford dataset. Effect of the anchor scales : The anchor scales used for the two algorithms are the default valuesspeciﬁed in Table II. As we suspected that the anchor values could be the reason for the poor performanceof both algorithm on Stanford dataset, we subsequently conducted three additional experiments witha different set of anchor scales. For YOLOv3 The new anchor scales were calculated using K-meansclustering on the Stanford training dataset, and yielded smaller anchor sizes (10x27, 25x16, 17x26, 18x35,22x31, 35x23, 23x38, 27x34, and 31x42). And for Faster R-CNN, we used anchor scales reduced by half(64x64, 128x128, and 256x256, instead of the default 128x128, 256x256, and 51x512). Table XIII showsthe results obtained after using these anchors, compared to the previous results obtained with the defaultanchors. The performance was markedly lower for YOLOv3, which indicates that this algorithm is verysensitive to the change of anchor scales. As for Faster R-CNN with Resnet50 as feature extractor, theAP was slightly lower (20.7% down from 21.9%), while the average IoU dropped noticeably (25% downfrom 47.7%). In contrast, Faster R-CNN with Inception-v2 as feature extractor was the only algorithmthat showed better results with the reduced anchor scales. The two rightmost columns in Table XIIIshow the average width and height of the predicted bounding boxes. We notice that the dependencybetween the anchor scales and the predicted sizes is not straightforward. The average predicted sizes aremore affected by the size of ground-truth bounding boxes in the training dataset (72x152 in average, asshown in Table VIII), and adapt poorly to the different ground-truth car sizes and aspect ratios in thetesting dataset (60x90 in average), which explains the low performance of all the tested algorithms onthe Stanford dataset speciﬁcally. Moreover, we can observe that despite the fact that the default anchorscales for Faster R-CNN are overall larger than those of YOLOv3, the ﬁrst algorithm yields better AP Fig. 23.

Inference speed measured in Frames per Second (FPS), for each of the tested algorithms.

TABLE XIIIE

FFECT OF REDUCING THE ANCHOR SCALES OF

YOLO V AND F ASTER

R-CNN,

ON THE S TANFORD D ATASET . Algorithm Anchor scales AP IoU Averagepredictedwidth AveragepredictedheightYOLOv3 416x416(default anchors)

YOLOv3 416x416(reduced anchors)

Faster R-CNN,with ResNet50(default anchors)

Scales: 128x128, 256x256, 512x512Aspect ratios: 1:1, 1:2, 2:1 0.219 0.48 91 171

Faster R-CNN,with ResNet50(reduced anchors)

Scales: 64x64, 128x128, 256x256Aspect ratios: 1:1, 1:2, 2:1 0.207 0.25 72 131

Faster R-CNN,with Inception-v2(default anchors)

Scales: 128x128, 256x256, 512x512Aspect ratios: 1:1, 1:2, 2:1 0.202 0.48 74 140

Faster R-CNN,with Inception-v2(reduced anchors)

Scales: 64x64, 128x128, 256x256Aspect ratios: 1:1, 1:2, 2:1 0.255 0.50 92 174 on the Stanford dataset than the latter, which indicates that smaller anchor scales are not the solution forthe poor performance obtained on the Stanford dataset. Main lessons learned : Tables XI and XII present the detailed results of all tested conﬁgurationsof the two algorithms on PSU and Stanford datasets respectively. The best performance for each metricand each dataset is highlighted. We notice that YOLOv3 with high input size (608x608) and Faster R- Fig. 24.

Dependency between the AP and the learning rate, on the PSU (top) and Stanford (down) datasets.

CNN (with Inception-v2 feature extractor and a ﬁxed input size) show the best results in terms of APand recall, on PSU and Stanford datasets respectively. While in terms of precision, Faster R-CNN (withResnet50 feature extractor and a variable input size) and YOLOv3 with medium input size (416x416)perform better on PSU and Stanford datasets respectively. Figures 21 and 22 summarize the main resultsof this comparison study. They compare the trade-off between AP and inference time for YOLOv3 (with 3different imput sizes) and Faster R-CNN (with two different feature extractors), on the PSU and Stanforddatasets respectively, with the default hyperparameters, speciﬁed in section IV-B. It can be observed thatwhile Faster R-CNN (with Inception v2 as feature extractor) gave the best trade-off in terms of AP andinference speed on the Stanford dataset, YOLOv3 (with input size 320*320) presented the best trade-off on the PSU dataset. This lays emphasis on the fact that none of the two algorithms outperforms the otherin all cases, and that the best trade-off between AP and inference time depends on the characteristics ofthe dataset (object size, resolution, quality of annotation, representativity of the training dataset, etc.).Finally, it should be noted that although the present case study was restricted to only car objects, itsconclusions can be easily generalized to any similar types of objects in aerial images, since we did notuse any speciﬁc feature of cars. V. C ONCLUSION

In this study, we conducted a thorough experimental comparison of the two leading object detectionalgorithms (YOLOv3 and Faster R-CNN) on two UAV imaging datasets that present speciﬁc challenges(tiny objects, high number of objects per image, and ambiguous features due to the angle of view)compared to classical datasets of ground images like ImageNet [37] or COCO [27]. The two datasetsused for performance evaluation present very different characteristics, which makes the comparison morerobust. To account for the effect of hyperparameters on each algorithm, a total of 39 trainings have beenconducted (14 experiments with default hyperparameters, 22 experiments with different learning rates,and 3 experiments with different anchor scales). Furthermore, the performance of the two algorithms wasassessed using several metrics (mAP, IoU, FPS, AR max=1 , AR max=10 , AR max=100 ,...) in order to uncovertheir strengths and weaknesses. One of the main conclusions that we can draw from this comparativestudy is that the performance of these two algorithms largely depends on the characteristics of the dataset,and the representativity of the training images. In fact, while Faster R-CNN (with Inception v2 as featureextractor) gave the best trade-off in terms of AP and inference speed on the Stanford dataset, YOLOv3(with input size 320*320) presented the best trade-off on the PSU dataset.ACKNOWLEDGMENTThis work is supported by the Robotics and Internet of Things Lab of Prince Sultan University. Wealso thank Mr. Bilel Ben Jdira and Mr. Taha Khursheed for working on the prior conference version ofthis paper. R

EFERENCES[1] B. Benjdira, T. Khursheed, A. Koubaa, A. Ammar, and K. Ouni, “Car detection using unmanned aerial vehicles: Comparisonbetween faster r-cnn and yolov3,” in , pp. 1–6,IEEE, 2019.[2] A. Koubaa and B. Qureshi, “Dronetrack: Cloud-based real-time object tracking using unmanned aerial vehicles over theinternet,”

IEEE Access , vol. 6, pp. 13810–13824, 2018.[3] E. T. Alotaibi, S. S. Alqefari, and A. Koubaa, “Lsar: Multi-uav collaboration for search and rescue missions,”

IEEE Access ,vol. 7, pp. 55817–55832, 2019.[4] X. Xi, Z. Yu, Z. Zhan, C. Tian, and Y. Yin, “Multi-task cost-sensitive-convolutional neural network for car detection,”

IEEEAccess , pp. 1–1, 2019.[5] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye, “A large contextual dataset for classiﬁcation, detection andcounting of cars with deep learning,” in

European Conference on Computer Vision , pp. 785–800, Springer, 2016.[6] X. Xi, Z. Yu, Z. Zhan, Y. Yin, and C. Tian, “Multi-task cost-sensitive-convolutional neural network for car detection,”

IEEEAccess , vol. 7, pp. 98061–98068, 2019.[7] I. evo and A. Avramovi, “Convolutional neural network based automatic object detection on aerial images,”

IEEE Geoscienceand Remote Sensing Letters , vol. 13, pp. 740–744, May 2016.[8] K. S. Ochoa and Z. Guo, “A framework for the management of agricultural resources with automated aerial imagery detection,”

Computers and Electronics in Agriculture , vol. 162, pp. 53 – 69, 2019.[9] M. Kampffmeyer, A. Salberg, and R. Jenssen, “Semantic segmentation of small objects and modeling of uncertainty in urbanremote sensing images using deep convolutional neural networks,” in , pp. 680–688, June 2016. [10] S. M. Azimi, P. Fischer, M. Krner, and P. Reinartz, “Aerial lanenet: Lane-marking semantic segmentation in aerial imageryusing wavelet-enhanced cost-sensitive symmetric fully convolutional neural networks,” IEEE Transactions on Geoscience andRemote Sensing , vol. 57, pp. 2920–2938, May 2019.[11] L. Mou and X. X. Zhu, “Vehicle instance segmentation from aerial image and video using a multitask learning residual fullyconvolutional network,”

IEEE Transactions on Geoscience and Remote Sensing , vol. 56, pp. 6699–6711, Nov 2018.[12] B. Benjdira, Y. Bazi, A. Koubaa, and K. Ouni, “Unsupervised domain adaptation using generative adversarial networks forsemantic segmentation of aerial images,”

Remote Sensing , vol. 11, no. 11, 2019.[13] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with,”

IEEE TRANSACTIONSON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , 2017.[14] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”

CoRR , vol. abs/1804.02767, 2018.[15] X. Y. Chen, S. M. Xiang, C. L. Liu, and C. H. Pan, “Vehicle Detection in Satellite Images by Hybrid Deep ConvolutionalNeural Networks,”

Ieee Geoscience and Remote Sensing Letters , 2014.[16] N. Ammour, H. Alhichri, Y. Bazi, B. Benjdira, N. Alajlan, and M. Zuair, “Deep Learning Approach for Car Detection in UAVImagery,”

Remote Sensing , vol. 9, p. 312, mar 2017.[17] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,”

InternationalConference on Learning Representations (ICRL) , 2015.[18] C. E. Kim, M. M. D. Oghaz, J. Fajtl, V. Argyriou, and P. Remagnino, “A comparison of embedded deep learning methods forperson detection,” arXiv preprint arXiv:1812.03451 , 2018.[19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in

LectureNotes in Computer Science (including subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics) ,2016.[20] Benjamin Kellenberger, Diego Marcos, Sylvain Lobry, Devis Tuia, “Half a percent of labels is enough: Efﬁcient animaldetection in uav imagery using deep cnns and active learning,”

In press at IEEE Transactions on Geoscience and RemoteSensing (TGRS) , pp. 1–1, 7 2019.[21] B. Hardjono, H. Tjahyadi, M. G. A. Rhizma, A. E. Widjaja, R. Kondorura, and A. M. Halim, “Vehicle counting quantitativecomparison using background subtraction, viola jones and deep learning methods,” in , pp. 556–562, Nov 2018.[22] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in , pp. 6517–6525, 2017.[23] H. Tayara, K. Gil Soo, and K. T. Chong, “Vehicle detection and counting in high-resolution aerial images using convolutionalregression neural network,”

IEEE Access , vol. 6, pp. 2220–2230, 2018.[24] J. Liang, X. Chen, M.-l. He, L. Chen, T. Cai, and N. Zhu, “Car detection and classiﬁcation using cascade model,”

IET IntelligentTransport Systems , vol. 12, no. 10, pp. 1201–1209, 2018.[25] T. Gebru, J. Krause, Y. Wang, D. Chen, J. Deng, and L. Fei-Fei, “Fine-grained car detection for visual census estimation,” in

Thirty-First AAAI Conference on Artiﬁcial Intelligence , 2017.[26] K. Liu and G. Mattyus, “Fast multiclass vehicle detection on aerial images,”

IEEE Geoscience and Remote Sensing Letters ,vol. 12, pp. 1938–1942, Sep. 2015.[27] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Commonobjects in context,” in

European conference on computer vision , pp. 740–755, Springer, 2014.[28] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semanticsegmentation,” in

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition , pp. 580–587, 2014.[29] R. Girshick, “Fast R-CNN,” in

Proceedings of the IEEE International Conference on Computer Vision , 2015.[30] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Uniﬁed, real-time object detection,” in ,pp. 779–788, 2016.[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,”

Arxiv.Org , 2015.[32] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning social etiquette: Human trajectory understanding in crowdedscenes,” in

European conference on computer vision , pp. 549–565, Springer, 2016.[33] “Aerial-car-dataset, available online on: https://github.com/jekhor/aerial-cars-dataset, accessed on (16-10-2018).”[34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,”

CoRR ,vol. abs/1502.03167, 2015.[35] S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmark analysis of representative deep neural network architectures,”

IEEE Access , vol. 6, pp. 64270–64277, 2018.[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pp. 770–778, 2016.[37] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009IEEE conference on computer vision and pattern recognition