View Independent Vehicle Make, Model and Color Recognition Using Convolutional Neural Network
Afshin Dehghan, Syed Zain Masood, Guang Shu, Enrique. G. Ortiz
VView Independent Vehicle Make, Model andColor Recognition Using Convolutional NeuralNetwork
Afshin Dehghan Syed Zain Masood Guang Shu Enrique G. Ortiz { afshindehghan, zainmasood, guangshu, egortiz } @sighthound.com Computer Vision Lab, Sighthound Inc., Winter Park, FL
Abstract.
Make, model and color recognition (MMCR) of vehicles [1,2,3] is of great interestin several applications such as law-enforcement, driver assistance, surveillanceand traffic monitoring. This fine-grained visual classification task [4,5,6,7,8,9] hasbeen traditionally a difficult task for computers. The main challenge is the subtledifferences between classes (e.g BMW 3 series and BMW 5 series) comparedto some traditional classification tasks such as ImageNet. Recently, there havebeen efforts to design more accurate algorithms for MMCR such as those inthe works of Sochor et al in [1] and Hsieh et al [2]. Moreover, many researchershave focused on collecting large datasets to facilitate research in this area [4].However, the complexity of current methods and/or the small size of currentdatasets lead to sub-optimal performance in real world use cases. Thus, thereare still considerable short-comings for agencies or commercial entities lookingto deploy reliable software for the task of MMCR. In this paper, we present asystem that is capable of detecting and tagging the make, model and color ofvehicles irrespective of viewing angle with high accuracy. Our model is trainedto recognize 59 different vehicle makes as well as 818 different models in whatwe believe is the largest set available for commercial or non commercial use. The contributions of Sighthound’s vehicle MMCR system are listed as follows: Our system covers almost all popular models in North America. a r X i v : . [ c s . C V ] F e b To date, we have collected what we believe to be the largest vehicle dataset,consisting of more than 3 million images labeled with corresponding vehiclemake and model. Additionally, we labeled part of this data with correspond-ing labels for vehicle color. (cid:4)
We propose a semi-automated method for annotating several million vehicleimages. (cid:4)
We present an end-to-end pipeline, along with a novel deep network, thatnot only is computationally inexpensive, but also outperforms competitivemethods on several benchmarks. (cid:4)
We conducted a number of experiments on existing benchmarks and obtainedstate-of-the-art results on all of them.
The overview of our system is shown in Figure 1. Our training consists of a3-stage processing pipeline including data collection, data pre-processing anddeep training. Data collection plays an important role in our final results, thuscollecting data, which requires the least effort in labeling, is of great importance.We collected a large dataset with two different sets of annotations. All the imagesare annotated with their corresponding vehicle make and model and part of thedata is annotated with vehicle colors . In order to prepare the final training datawe further process the images to eliminate the effect of background. Finally theseimages are fed into two separate deep neural networks to train the final model. Below we describe in more detail different components of our 3-stage trainingprocedure. (cid:4)
Data Collection : Data collection plays an important role in training anydeep neural network, especially when it comes to fine-grained classificationtasks. To address this issue we collected the largest vehicle dataset knownto date, where each image is labeled with corresponding make and model ofthe vehicle. We initially collected over 5 million images from various sources.We developed a semi-automated process to partially prune the data and re-move the undesired images. We finally used a team of human annotators toremove any remaining errors from the dataset. The final set of data containsover 3 million images of vehicles with their corresponding make and modeltags. Additionally, we labeled part of this data with the corresponding colorof the vehicle, chosen from a set of 10 colors; blue, black, beige, red, white,yellow, orange, purple, green and gray. Please note the number of color categories is far less than number of vehicle models ata Collection
Make/Model Colors
Automatic Data Pruning Human Annotators 3,000,000 images all labeled with make/model and partially labeled with vehicle color
Vehicle Detection
Final Labeled DataFinal Labeled Data Deep TrainingFinal Labeled DataFinal Deep Model
Data Collection Data Pre-processing Training
Elliptical Masking10% margin
Fig. 1.
This figure shows the pipeline of our system. Images are collected fromdifferent sources. They are later pruned using a semi-automated process witha team of human annotators in the loop. The images are passed throughSighthound’s vehicle detector and steps are taken to align them. The imagesare then fed to our proprietary deep network for training. (cid:4)
Data Pre-processing : An important step in our training is alignment. Inorder to align images such that all the labeled vehicles are centered in theimage, we used Sighthound’s vehicle detection model available through theSighthound Cloud API . Vehicle detection not only helps us align imagesbased on vehicle bounding boxes but also reduces the impact of the back-ground. This is especially important when there is more than one vehicle inthe image. Finally we consider a 10% margin around the vehicle box to com-pensate for inaccurate (or very tight) bounding boxes. For the task of colorrecognition, we took pains to further eliminate any influence the backgroundmay have on the outcome. To achieve this, we further mask the images withan elliptical mask as shown in Figure 1. Note that in certain cases the ellip-tical mask removes some boundary information of the vehicle. However, thishad little effect on the color classification accuracy. (cid:4) Deep training : The final stage of our pipeline in Figure 1 involves trainingtwo deep neural networks. One is trained to classify vehicles based on theirmake and model and the other is trained to classify vehicles based on theircolor. Our networks are designed such that they achieve high accuracy whileremaining computationally inexpensive. We trained our networks for fourdays on four GTX TITAN X PASCAL GPUs. Once the model is trained,we can label images at 150 fps in batch processing mode. Experiments on SIN 2014 Test set
In this section, we report experimental results on two publicly available datasets;the Stanford Cars dataset [10] and the Comprehensive Car (compCar) dataset[4]. The Stanford Cars dataset consists of 196 classes of cars with a total of16 ,
185 images. The data is divided into almost a 50-50 train/test split with8 ,
144 training images and 8 ,
041 testing images. Categories are typically at thelevel of Make, Model, Year. This means that several categories contain the samemodel of a make, and the only difference is the year that the car is made. Ouroriginal model is not trained to classify vehicle models based on the year of theirproduction. However, after fine-tuning our model on the Stanford Cars trainingdata, we observe that we can achieve better results compared to previouslypublished methods. This is mainly due to the sophistication in the design of ourproprietary deep neural network as well as the sizable amount of data used totrain this network. The quantitative results are shown in Table 1.
Table 1.
Top-1 car classification accuracy on Stanford car dataset.
Methods Accuracy (top1)
Sighthound . % Krause et al. [10] 92 . . . . . We also report results on the Comprehensive Car dataset which has recentlybeen published. The task here is to classify data into 431 different classes basedon vehicle make and model. The data is divided into 70% training and 30%testing. There are a total of 36 ,
456 training images and 15 ,
627 test images.The top-1 and top-5 accuracy are reported in Table 2. We compare our resultswith the popular deep network architectures reported in [4]. We can clearly seethat our fine-tuned model outperforms the existing methods by 4 .
68% in top-1accuracy. It is also worth noting that the our model is an order of magnitudefaster than GoogLeNet.Lastly we test the verification accuracy of the proposed method on compCardataset. The compCar dataset includes three sets of data for verification ex-periments, sorted by their difficulties. Each set contains 20,000 pairs of images.The likelihood ratio of each image pair is obtained by computing the euclideandistant between features computed using our the deep network. The likelihoodratio is then compared against a threshold to make the final decision. The resultsare shown in Table 3. As can be seen our model, fine-tuned on the verificationtraining data of compCar dataset, outperforms other methods. It is worth to able 2.
Top-1 and top-5 car classification accuracy of compCar dataset. Wecompare our results with popular deep networks of GoogLeNet, Overfeat andAlexNet reported in [4]
Methods Accuracy (top1) Accuracy (top5)
Sighthound . % . % GoogLeNet [4] 91 .
2% 98 . .
9% 96 . .
9% 94 . mention that , even without fine-tuning our features can achieve a high verifica-tion accuracy of 92 . . .
17% on different sets of easy, medium andhard respectively.
Table 3.
Verification accuracy of three different sets, easy, medium and hard in[4]. Sighthound is our fine-tuned model trained on Sighthound data. It can beseen that the model outperforms previous methods by a large margin.
Methods Accuracy(Easy) Accuracy(Medium) Accuracy(Hard)
Sighthound . % . % . % Yang et. al. [4] 83 .
3% 82 .
4% 76 . .
0% 82 .
7% 76 . We demonstrate some quantitative results in Figures 2 and 3, capturing differentscenarios. Figure 2 shows results for images mostly taken by people. Figure3 shows a surveillance-like scenario where the camera is mounted at a higherdistance from the ground. These images are illustrative of the robustness of ourlarge training dataset, captured from different sources, to real world scenarios.
In this paper we presented an end to end system for vehicle make, model andcolor recognition. The combination of Sighthound’s novel approach to the designand implementation of deep neural networks and a sizable dataset for trainingallow us to label vehicles in real time with high degrees of accuracy.We conductedseveral experiments for both classification and verification tasks on public bench-marks and showed significant improvement over previous methods. ig. 2.
Quantitative results of the proposed method in surveillance cameras.
References
1. Sochor, J., Herout, A., Havel., J.: Boxcars: 3d boxes as cnn input for improvedfine-grained vehicle recognition. In: CVPR. (2016)2. Hsieh, J.W., Chen, L.C., Chen, D.Y.: Symmetrical surf and its applications tovehicle detection and vehicle make and model recognition. In: IEEE Transactionson intelligent transportation systems. (2014)3. Gu, H.Z., Lee., S.Y.: Car model recognition by utilizing symmetric property toovercome severe pose variation. In: Machine vision and applications. (2013)4. Yang, Linjie, e.a.: A large-scale car dataset for fine-grained categorization andverification. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. (2015)5. Lin, T.Y., RoyChowdhury, A., Maji., S.: Bilinear cnn models for fine-grained visualrecognition. In: Proceedings of the IEEE International Conference on ComputerVision. (2015)6. Xie, Saining, e.a.: Hyper-class augmented and regularized deep learning for fine-grained image classification. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. (2015) ig. 3.