[PDF] Convolutional Neural Networks for Aerial Vehicle Detection and Recognition

Abstract

This paper investigates the problem of aerial vehicle recognition using a text-guided deep convolutional neural network classifier. The network receives an aerial image and a desired class, and makes a yes or no output by matching the image and the textual description of the desired class. We train and test our model on a synthetic aerial dataset and our desired classes consist of the combination of the class types and colors of the vehicles. This strategy helps when considering more classes in testing than in training.

Full PDF

CConvolutional Neural Networks for Aerial VehicleDetection and Recognition

Amir Soleimani ∗ , Nasser M. Nasrabadi ∗ , Elias Grifﬁth † , Jason Ralph † , Simon Maskell †∗ Lane Department of Computer Science and Electrical Engineering, West Virginia University † Department of Electrical Engineering and Electronics, University of LiverpoolEmail: [email protected], [email protected]ﬁ[email protected], [email protected], [email protected] paper has been accepted in the National AerospaceElectronics Conference (NAECON) 2018 and would be indexed in IEEE

Abstract —This paper investigates the problem of aerialvehicle recognition using a text-guided deep convolutional neuralnetwork classiﬁer. The network receives an aerial image anda desired class, and makes a yes or no output by matchingthe image and the textual description of the desired class. Wetrain and test our model on a synthetic aerial dataset and ourdesired classes consist of the combination of the class typesand colors of the vehicles. This strategy helps when consideringmore classes in testing than in training.

I. I

NTRODUCTION

Aerial imagery, captured by drones or Unmanned AerialVehicles (UAVs), is a great tool for surveillance because ofits wide ﬁeld of view and the ability of drones to accessplaces that would otherwise be difﬁcult to visit. Aerial imageryhas also other applications like border security, search andrescue tasks, and image and video understanding. Also, it canbe used in human-human, human-vehicle, and vehicle-vehicleinteraction understanding.The wide ﬁeld of view advantages of aerial imagery, how-ever, result in objects of interest occupying small number ofpixels in each image. In comparison to the regular view orstreet view images, aerial images have less information anddetails about vehicles as well as other objects in the image.Therefore, it is common that a vehicle in an aerial view ismissed because of the background or other objects. On theother hand, false positive predictions are also highly probable.The other issue that makes the resolution challenge harderis the limitation in the computational resources. Although itis possible to take a high resolution image, processing a largeimage will result in a huge unavoidable computational costs,specially if we are interested in implementing an online aerialvehicle detection system.The application of aerial vehicle detection and recognitioncan be more speciﬁc if the goal of the system is not just limitedto detect vehicles but to ﬁnd speciﬁc vehicles. For example,a detection system can concentrate on searching for a speciﬁccar with a speciﬁc color, type, and other descriptions (e.g.,yellow taxi, large green truck). In this scenario, the detectionsystem can be used in the applications like ﬁnding a suspicious (a) (b)

Fig. 1. (a) An aerial image. (b) Some exemplar objects of interest (vehicles). vehicle or target vehicle among several other vehicles, objects,and backgrounds.In this speciﬁc goal, in addition to the resolution challenge,there is an open-ended challenge. Providing a comprehen-sive dataset that covers all the probable objects variationsand classes (e.g., vehicle types, colors, shapes, and othervariations) is impossible. Therefore, the detection system hasto respond to unseen targets. In other words, during thetesting phase it sees sample variations that it has not seenin its training phase. This challenge can be called as unseenchallenge or open-ended challenge.Visual Question Answering (VQA) systems take an imageand an open-ended textual question about the given image, andtry to provide an answer to the question in a textual format[1]. The core idea behind the VQA task is to answer to unseenquestions that could be considered here as classes. Antol et al.by using LSTM and MLP structure achieves acceptable resultson their large dataset consisting about 0.25M images, 0.76Mquestions, and 10M answers .In order to alleviate the challenge of objects occupyingsmall number of pixels, we split the problem into two sub- a r X i v : . [ c s . C V ] A ug mage Classiﬁer Label CodeImage ClassiﬁerDesiredClass CommonSpace YESNO (a)(b) Fig. 2. (a) A classical classiﬁer that receives an image and predicts a labelcode for the image class. (b) The proposed architecture that can considerclasses that have not been seen during the training. problems [2]. We ﬁrst assume that a deep detector like SingleShot Multibox Detector (SSD) [3] extracts objects or areas ofinterest, and second, we use a deep convolutional network torecognize which of the extracted objects of interest are alsothe vehicles we wish to detect.In this paper, we propose a framework that can handlethe problem of open-ended classiﬁcation or prediction. In aclassical image classiﬁcation system an image is processed andan output label is produced. However, in this paper, we useanother novel architecture in which it receives an image and adesired textual description of the class, represented by a codevector, and makes a yes or no decision about the correctnessof the input label. In other words, it decides if the input imagehas the desired textual description of the class label or not (seeFigure 2).This paper is organized as follows. In Section II, we reviewthe literature of deep object detectors and visual questionanswering. Then in Section III, we investigate our proposedframework. In Section IV, we explain the dataset, experiments,and results. Finally, conclusion and future works are explainedin the Section V. II. R

ELATED W ORKS

The combination of choosing a good hand-engineered imagefeature descriptors like histogram of gradients (HOG) [4] andscale-invariant feature transform (SIFT) [5], and choosing aclassiﬁer like support vector machine (SVM) [6] or multilayerperceptron (MLP) have been the main focus of research papersin the area of image classiﬁcation for several years. However,in recent years deep convolutional neural networks, whichhave outperformed other methods on different datasets likeVOC [7] and Imagenet [8], are attracting interest in the ﬁeld ofimage classiﬁcation. Deep convolutional neural networks likeLeNet [9] and AlexNet [10] have conﬁrmed their effectivenessas classiﬁers that only receive raw images without any type ofimage feature descriptors. R-CNN [11] can be considered as the ﬁrst considerableand well-known deep structure for the application of objectdetection. This method takes an input image, then a classicalregions of interest generator like selective search [12] createsabout 2000 regions proposals, and a deep convolutional neuralnetwork (CNN) extracts visual features, and ﬁnally an SVMclassiﬁer determines if there is any speciﬁc object in theseproposals or not. However, doing all these steps separatelymakes R-CNN slow and inaccurate. They also achieved a gooddetection result since a high-capacity convolutional neuralnetworks was applied to bottom-up region proposals, andbecause they used a pre-trained model for their initial points.Fast R-CNN [13] is the next top method in the objectdetection literature. In contrast to R-CNN, fast R-CNN has itsown classiﬁer layer. It takes an image and multiple of regionsof interest proposals, then a CNN creates the feature maps forthe proposals and a softmax layer and a regressor layer to ﬁndthe objects in the image. The other important advantage forFast R-CNN is that it does not need a disk storage for featurecaching.The next step, which led to real-time object detection withregion proposal networks, was Faster R-CNN [14]. Faster R-CNN eliminates the need for extra regions of interest proposalgenerator. In other words, generating proposals is also doneusing a convolutional framework. Their method employs theanchor concept that helps to better catch different objects withdifferent sizes and aspect ratios. It must be noted that FasterR-CNN employs Fast R-CNN for some parts of its algorithm.YOLO [15] might be considered as the ﬁrst work in whichthe object detection problem is considered as a regressionproblem, while the previous works focused on classiﬁers toperform detection. A single CNN predicts bounding boxes aswell as the class probabilities from only the input images,without any need for the proposals. This is an end-to-endmethod for detecting objects in the images. While YOLO iswell-known for its extreme speed, its accuracy is not as goodas the other top methods.Single Shot Multibox detector or SSD [3] is another end-to-end single shot object detector that uses a deep learn-ing architecture which in addition to the speed, it has anoutstanding accuracy. SSD is a convolutional network thatpredicts a large number of bounding boxes and scores, whichtheoretically helps to detect 8732 objects in just one image.The output of SSD, like the previous method, is passedthrough a non-maximum suppression operator to reduce theoverlapping results. SSD is made of a base network, whichhas a structure like VGG [16]. This base network creates apreliminary representation for the next steps which is doneusing the extra feature layers. These auxiliary layers decreasein size progressively and are responsible for detecting small tolarge objects. In other words, smaller objects are detected inthe earlier layers while the larger ones are detected in the lastlayers. SSD uses the idea of default boxes, a concept similarto anchors, that helps the network to learn speciﬁc ﬁlters forspeciﬁc scales and aspect ratios. These are responsible forbjects at different scales and aspect ratios.Equation (1) is the SSD’s cost function. SSD has an objec-tive function consisting of two parts, one for the localizationcost that codes the location of the bounding boxes and theobjects, and one for the conﬁdence cost that determines thedegree of certainty for the presence of an object in the speciﬁcbounding box or location. l ( x, c, l, g ) = 1 /N ( L conf ( x, c ) + L loc ( x, l, g )) . (1)The localization cost is a smooth L1 loss and codes theerror between the bounding boxes from the ground-truth andthe predicted one. In contrast, the conﬁdence cost is a softmaxloss that can be considered as a classiﬁcation cost too.Note that x is the parameter that determines the presenceof an object in the SSD’s default boxes. During the training,this information can be calculated using the ground-truthinformation and the default boxes positions. c is for the classparameters, and l and g are the parameters for the predictedlocations and ground-truth locations, respectively.III. P ROPOSED M ETHOD

We propose a framework in which, ﬁrst, an SSD [3], whichhas shown promising performances in the aerial image objectdetection literature [2] and [17], generates a number of objectsof interest proposals for an input aerial image. These proposalsmight contain vehicle, background, or other objects. In otherwords, we would have a set of object patches or object chipsper original image.In contrast to the classical structures of the classiﬁers thatreceive an input image and predict the labels at the output,we propose an architecture that receives an image as wellas the textual description of the desired class as the inputs.This architecture predicts a yes and no decision that shows ifthe image has the desired class label or not (see Figure 2).One of the main reasons for changing the structure is that inthis new structure, the classiﬁer is not limited to pretrained orpredeﬁned class labels. In other worlds, class labels are alsoused like open-ended images.We use a VGG-16 architecture [16] with only one fullyconnected layer to extract visual descriptors. This convolu-tional structure consists of ﬁve convolutional layers with thefollowing details. The ﬁrst two layers consist of two similarlayers with the depth of 64 and 128, respectively. The nextthree layers have three similar layers with the depth of 256,512, and again 512, respectively. Just after each layer, thereis a max pooling structure to reduce the spatial size andincrease the generalization. A fully connected layer is usedjust after the ﬁfth layer (see Figure 3) and its values are fedinto the next step which is a fully-connected layer for fusion ofvisual features extracted by the VGG structure and the textualfeatures that is described below.Meanwhile, the textual description of the desired classesare coded using the bag of words representation, and then afully connected layer transforms this information into the next space. The textual description of the desired classes in ourexperiments consist of the color and the types of the vehicles,but they can be more complicated with more details about thevehicles.As Figure 3 shows, the visual features, which are extractedby the VGG network and the textual features, which areextracted by the fully-connected layers, are fused and forma visual-textual sub-space. This is done by using a fully-connected fusion layer. Finally, a two-class softmax classiﬁeris placed on top of the last layer and trained to predict if theimage has the desired class or not (Figure 3 shows the detailsabout the proposed method).It is worth noting that all the weights of the network,including the visual feature extractor, the textual featureextractor and the softmax classiﬁer are optimized together.This fact will force each component of our algorithm toinﬂuence the optimization of other components.IV. E

XPERIMENTS

In order to evaluate our proposed framework, we use ourdataset that contains real aerial images and synthetic cars andtrucks which are placed on the streets (see Figure 1 and Figure4). Vehicles can have seven different colors: black, white,gray, yellow, green, blue, and red. The two types of vehiclesin conjunction with these seven colors describe a 14-classrecognition problem.The synthetic aerial dataset contains about 5000 vehicleswith information about the vehicle types and colors. SeeFigure 4 for some examples. More details about the datasetare provided in the following subsection.

A. Dataset

Whilst a limited number of datasets are available [18],capturing Wide Area Motion Imagery can be an expensiveand difﬁcult process. In addition to the obvious problems ofobtaining specialist (often classiﬁed military) camera equip-ment, there is also the need to organize aircraft, pilots andpermission to ﬂy. There are also broader legal implications ofperforming surveillance of a real city or community - a personcould be identiﬁed by a house number he/she visits. Critically,there is no deﬁnitive form of ground truth (e.g., vehicle types& positions) for comparative evaluation of methods - nor ameans of placing a camera in a particular position, orientationand time, which is a useful capability for developing newalgorithms.A brief overview of the image and ground truth generationmethod is described here (see Figure 5), a more detaileddiscussion is presented in [19].Firstly, the ground truth data is generated using a MATLABsimulation that controls an instance of SUMO trafﬁc simulator(Simulation of Urban MObility) [20]. The MATLAB simu-lation can seamlessly transfer entities such as vehicles andpeople, between SUMO and itself. SUMO is used to navigatevehicles as per a basic set of rules of the road, and also mage V i s ua l F ea t u r e s T e x t ua l F ea t u r e s V i s ua l - T e x t ua l F u s ed F ea t u r e s Y e s / N obag o f w o r d s R ep . DesiredClassFully-ConnectedMax PoolingCon. Layer + Relu64 128 256 512 512

Fig. 3. Proposed method for the task of aerial vehicle detection. Detected objects of interest are described using the features extracted by the convolutionallayers. Desired classes are represented using bag of words and then fully connected layers to build a common latent space in which the yes or no decision ismade on top of this sub-space. navigates pedestrians along sidewalks and crosswalks. TheMATLAB simulation handles the high level goals of the peoplesuch as when and where to visit (e.g., shop or workplace) themode of transport to be used. It also manages other bespoke”‘micro-simulations”’ such as walking from the sidewalk tothe building doorway, and people using bus stops.Similarly, images are then generated from the ground truthat each time step using a MATLAB controlled instance of theX-Plane ﬂight simulator [21]. The MATLAB image generatorcan position the ﬂight simulator’s viewpoint, and spawn 3Dvehicles of the correct type and color, ﬁnally triggering imagecaptures and converting to the MATLAB matrix format (forsaving or further processing). A conﬁguration based on theDARPA/BAE Systems ARGUS-IS imager [22] is used thatcontains an array of 368 5-Megapixel subcameras. This gen-erates an image of 1.8-Gigapixels (or 5-Gigabytes of uncom-pressed RGB imagery per frame) capturing a circular area ofapproximately 6km diameter (at a 6km altitude). Equationsdescribing the conﬁguration of the subcamera array can befound in [19].The ﬁnal output per frame (each timestep) is the groundtruth data saved in an XML format, 368 PNG subcameraimages each with an accompanying metadata ﬁle for thesubcamera position and orientation. The example applicationswithin [19] detail a tiled video playback tool, and an analysistool for interpreting the ground truth data directly without theimagery (e.g., tracing paths taken by vehicles).

B. Results

In order to test our proposed method, we implemented twodifferent experimental setups. First, we trained the proposedmethod on 75 percent of the dataset (all the 14 classes),which contains visual and textual information about the vehicletypes and colors. Then, the remaining samples were usedfor the testing. Figure 6 shows the accuracy, true positive,and true negative percentages for the testing set. The threeresults indicate the promising power of the proposed methodfor recognizing the vehicles and their types and colors for thesynthetic aerial images.To check the ability of the proposed method on unseenclasses (open-ended setup), we repeated the experiment in away that we only trained on 13 classes and set one class asidefor testing. This experimental setup is repeated 14 times forall the 14 classes. Figure 7 shows the accuracy of the systemfor the unseen classes. Based on this experiment, we can seethat the proposed method is capable of recognizing unseenvehicles but belonging to similar classes.In order to understand the underlying process in the system,we tried to visualize the textual information of the desiredclass labels in the hidden layer. Just for this experiment, weforced the fully connected layers for the textual feature tohave a two dimensional representation. Figure 8 shows thistwo dimensional sub-space. It is clear that the two differentvehicle types, trucks and cars, have a similar feature patternand lie in a similar manifold. The colors of the vehicles inthese two vehicle types are also in the same order. Therefore,it might be true that if the system is not trained on yellowcars, for example, it can respond to the unseen yellow carsamples during the testing phase by considering the similar ig. 4. Some samples from our synthetic aerial dataset. manifolds of the trucks and cars.

Fig. 5. (left) ARGUS ﬁeld of view illustrates the large area simulated (approx6km x 6km). (right) 6 examples of various areas at the same point in time.Fig. 6. The performance of the proposed method on the synthetic aerialdataset.

V. C

ONCLUSION

In this paper, we investigated the problem of aerial vehicledetection. We assumed that a deep detector with a fast andaccurate performance like single shot multi-box detector orSSD generates a number of objects of interest for each aerialimage, which can be called as image proposals or imagepatches. In the next step, we used a VGG-16 structure frame-work to extract visual information for the generated imageproposals. On the other hand, the bag of words representationand fully-connected layers are used to make a textual featurerepresentation for the desired classes.The visual and textual information are fused and make acommon latent sub-space, which is called visual-textual sub-space. Based on this sub-space a softmax classiﬁer is trainedto generate yes or no outputs that correspond to the caseswhen the input image patch has the desired class label or not.It is important to note that all the weights of the second stepincluding the convolutional layers, the fully connected layers,and the softmax classiﬁer, are optimized simultaneously. Inother words, visual and textual features are trained together.We tested our system on a synthetic aerial dataset thatcontains information about the vehicle types and vehicle ig. 7. The accuracy of the proposed method in the unseen experiment.Fig. 8. 2D visualization of the desired classes. The line separates the carsand trucks (vehicle types), and the colors correspond to the vehicle colors. colors. Results on this dataset, showed that in addition tothe promising performance when recognizing seen or trainedclasses, our framework can recognize unseen classes as well,and this is the advantage of the open-ended framework. Forthe future works, collecting or synthesizing more complicateddatasets that have both visual and rich textual informationwould result in the further improvements in the ﬁeld.R

EFERENCES[1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick,and D. Parikh, “Vqa: Visual question answering,” in

Proceedings of theIEEE International Conference on Computer Vision , 2015, pp. 2425–2433.[2] A. Soleimani and N. Nasrabadi, “Convolutional neural networks foraerial multi-label pedestrian detection,” in , 2018. [3] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.Berg, “Ssd: Single shot multibox detector,” in

European conference oncomputer vision . Springer, 2016, pp. 21–37.[4] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in

Computer Vision and Pattern Recognition, 2005. CVPR2005. IEEE Computer Society Conference on , vol. 1. IEEE, 2005, pp.886–893.[5] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”

International journal of computer vision , vol. 60, no. 2, pp. 91–110,2004.[6] C. Cortes and V. Vapnik, “Support-vector networks,”

Machine learning ,vol. 20, no. 3, pp. 273–297, 1995.[7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man, “The pascal visual object classes (voc) challenge,”

Internationaljournal of computer vision , vol. 88, no. 2, pp. 303–338, 2010.[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in

Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on . IEEE,2009, pp. 248–255.[9] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neural infor-mation processing systems , 2012, pp. 1097–1105.[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2014, pp. 580–587.[12] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders,“Selective search for object recognition,”

International journal of com-puter vision , vol. 104, no. 2, pp. 154–171, 2013.[13] R. Girshick, “Fast r-cnn,” in

Proceedings of the IEEE internationalconference on computer vision , 2015, pp. 1440–1448.[14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in

Advances in neuralinformation processing systems , 2015, pp. 91–99.[15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Uniﬁed, real-time object detection,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2016, pp. 779–788.[16] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[17] M. Barekatain, M. Mart´ı, H.-F. Shih, S. Murray, K. Nakayama, Y. Mat-suo, and H. Prendinger, “Okutama-action: An aerial view video datasetfor concurrent human action detection,” in

Simulation Modelling Practice and Theory , vol. 84, pp. 286–308, 2018.[20] D. Krajzewicz, E. Brockfeld, J. Mikat, J. Ringel, C. R¨ossel, W. Tuch-scheerer, P. Wagner, and R. W¨osler, “Simulation of modern trafﬁc lightscontrol systems using the open source trafﬁc simulation sumo,” in