Ensembling object detectors for image and video data analysis
Kateryna Chumachenko, Jenni Raitoharju, Alexandros Iosifidis, Moncef Gabbouj
EENSEMBLING OBJECT DETECTORS FOR IMAGE AND VIDEO DATA ANALYSIS
Kateryna Chumachenko † , Jenni Raitoharju †§ , Alexandros Iosifidis (cid:63) , Moncef Gabbouj †† Tampere University, Faculty of Information Technology and Communication Sciences, Finland § Finnish Environment Institute, Programme for Environmental Information, Finland (cid:63)
Aarhus University, Department of Electrical and Computer Engineering, Denmark
ABSTRACT
In this paper, we propose a method for ensembling the outputs ofmultiple object detectors for improving detection performance andprecision of bounding boxes on image data. We further extend itto video data by proposing a two-stage tracking-based scheme fordetection refinement. The proposed method can be used as a stan-dalone approach for improving object detection performance, or asa part of a framework for faster bounding box annotation in unseendatasets, assuming that the objects of interest are those present insome common public datasets.
Index Terms — object detection, bounding box annotation, en-semble models
1. INTRODUCTION
Driven by the broad availability of an extensive amount of datasetsin different domains, object detection has become one of the mostwidely used tools within the field of computer vision in recent years,finding applications in various areas, such as video surveillance [1],medical diagnostics [2], historical image analysis [3], and industrialapplications [4]. Despite the huge progress made in the field of ob-ject detection in the recent years, not much attention has been paidto the generalization ability of the object detection methods: the de-veloped algorithms assume that the training and test data come fromthe same distribution, and the test performance is reported on datafrom the same dataset used for training. For this reason, it is alwayspreferable to train the methods on data collected directly from thedomain of application, even if the classes of interest are present insome public datasets. Nevertheless, the models trained on publicdatasets are still often used in industrial applications. This is primar-ily due to the fact that obtaining training data from the domain of in-terest is generally both expensive and time-consuming, as it requiresa significant amount of manual labour related to the annotation ofthe data specific to the application.Several methods have been proposed for speeding up the bound-ing box annotation process on images [5, 6, 7]. Most of them still re-quire large amounts of manual annotations, corrections, and retrain-ing of the models. However, it is often the case that the classes of in-terest belong to the set of common classes present in public datasets,such as people, vehicles, and animals. Using widely-available objectdetection methods pre-trained on such public datasets has the poten-tial to significantly reduce the amount of manual labour required forthe annotation process. In this work, we aim to take a step towardsutilizing such methods in a way that improves the object detectionperformance and reduces the annotation time.
This work was supported by Business Finland under the project 5G Ver-tical Integrated Industry for Massive Automation (5G-VIIMA).
Fig. 1 . Examples of the detection results of base detectors, proposedmethod, and ground truth bounding boxes on PASCAL VOC dataset.Despite the rapid development of object detection methods [8, 9,10, 11], only a very limited number of works have focused on creat-ing ensembles of those for improved detection performance [12, 13,14]. Moreover, to the best of our knowledge, none of the existingapproaches aims at improving the precision of resulting boundingboxes, although precision can play a crucial role in a variety of appli-cations, primarily including tracking and re-identification problems.In this paper, we aim to take a step towards improving objectdetection performance and precision in image and video data byproposing a method for ensembling multiple object detection meth-ods. In addition, we propose a weighting scheme with regression-based weights learnt from a small number of images, as well as anextension for video data utilizing temporal information.
2. RELATED WORK
Significant progress has been made in the field of object detectionin recent years [8, 9, 10, 11], dominated by deep learning-basedmethods. To improve the object detection performance, several ap-proaches for combining information from multiple object detectorshave been proposed recently. In [13], a cascade of two face detec-tors was proposed to reduce the number of false-positive detections.Several metrics for evaluating the diversity and correlation of detec-tors were used to select the best pair of detectors in [14]. In [12], an a r X i v : . [ c s . C V ] F e b VM classifier was used to select the suitable detection method forspeed considerations. A learning to rank based approach, intendedto rank the more relevant detections higher, was proposed in [15].Notably, in all the above-mentioned approaches, the combination isaimed at the selection of the best detection. In our method, we firstuse contextual information of detectors to select the relevant detec-tions and subsequently improve the precision of the final boundingbox by taking a weighted average of the coordinates with the weightslearned from a small subset of images of the target dataset.The need for data annotation in the application domain motivatesthe emergence of various methods for speeding up the bounding boxannotation process. A common approach to annotating large-scaledatasets is using crowd-sourced annotations [5]. Since this approachexposes the data to public, it cannot be used for applications wherethe data needs to be kept private. Another approach relies on self-training [6, 7]. There, the general methodology is to first annotatea set of images manually and use them for the training of an objectdetector. The trained detector is subsequently used for producing thebounding boxes for the rest of the dataset. The obtained detectionsare manually refined, and the process continues until perfect anno-tation is achieved. Although minimizing to some extent the neededworkload, the self-training approaches still require a high numberof images to be annotated. In addition, a certain amount of time isneeded for training the object detector.
3. PROPOSED METHODS
The main motivation behind the proposed methods lies in the as-sumption that distinct object detectors achieve different levels ofperformance on different data, and extraction of useful informationfrom each method has potential for improving the detection accu-racy and precision of bounding boxes. The proposed approach canbe used as a standalone methodology for improving object detec-tion performance or as a part of a framework for creating boundingbox annotations. We rely on three state-of-the-art object detectors:SSD [9], YOLOv3 [10], and RetinaNet [11]. The choice of thesedetectors is based on their fast execution and good object detectionperformance reported in the literature. Here, one should note that themethod can be equally well applied to any set of object detectors.
First, we consider the straightforward way of fusion by non-maximum suppression that will serve as a baseline for our work.Here, we treat all obtained detections as coming from one objectdetector and apply non-maximum suppression to suppress non-confident duplicates of detections identified to be the true positiveones. First, all bounding boxes detected in the image are sortedaccording to their confidence scores. The most confident boundingbox is selected as the true detection. Then, the Intersection overUnion
IoU = Area of overlapArea of union between the true detection’s bound-ing box and every other detection’s bounding box is calculated. Thebounding boxes with higher IoU (above a certain threshold θ , equalto . in our work) are identified as belonging to the same object asthe true bounding box and removed from the list of detections. Theprocess continues from the next most confident bounding box out ofthe remaining ones and is performed for each class separately.Although providing a reasonable level of performance improve-ment, this approach suffers from several limitations. Firstly, the fi-nal bounding boxes are simply bounding boxes detected by one ofthe base detectors, limiting the potential of fusing the knowledgeof multiple methods. Secondly, the use of confidence score of the detector as a metric for deciding the quality of bounding boxes isquestionable. The common interpretation of the confidence scoreis the probability of the bounding box to be the true positive one.However, such interpretation does not have a strong theoretical ba-sis, as confidence scores of different object detection methods, beinglearned parameters, can reflect significantly different true-positiverates. In other words, the confidence score scales of different ob-ject detection approaches are not calibrated, and some methods tendto produce high-confidence detections that are false-positives, whileothers output true positive detections with low confidence scores. Taking a step towards more meaningful use of the knowledge pro-vided by several object detectors, we propose the following approachthat adjusts the final selection of bounding boxes based on how likelythey are to correspond to true detections. First, the previously de-scribed IoU-based merging is employed to identify the boundingboxes belonging to the same object. However, instead of discard-ing all the bounding boxes corresponding to the same object as themost confident bounding box, we keep track of the source detectorsfor each bounding box. From the set of bounding boxes obtainedfrom each detector, we select only the one having the highest IoUwith the most confident bounding box. As a result, we obtain a setof detections, each described by 1-3 bounding boxes. At this point,we can observe that objects detected by only one of the detectorsare generally false-positives, so out of the obtained set of detections,we discard the ones described by only one bounding box. In orderto obtain the final detections, the selected bounding boxes for eachdetection need to be combined. To exploit the knowledge present ineach of these bounding boxes, we propose to fuse them as a weightedcombination with learned weights.
In order to improve the precision of the bounding box coordinates,we further extend the proposed approach by taking a weighted aver-age value of the coordinates of each of the vertices of the boundingboxes identified to belong to the same object. We weight each co-ordinate by the corresponding confidence score of the detection andnormalize the resulting value by the sum of the confidence scoresso that more confident detectors put more weight to the final output.Direct use of confidence scores would suffer from the limitationscaused by the scores of different detectors not being calibrated. Toaddress this issue, we propose a scheme for re-weighting the outputof each detector that results in better calibration of confidence scores,and, therefore, in a more meaningful combination of the boundingboxes. Assuming that we have obtained a scalar weight w j for the j th detector, the new coordinates and confidence scores are calcu-lated as ˆ c i = (cid:80) Dj =1 s ji w j c ji (cid:80) Di =1 s ji w j and ˆ s i = (cid:80) Dj =1 w j s ji (cid:80) Dj =1 w j , (1)where ˆ c i and ˆ s i , are the refined coordinate and updated confidencescore of i th detection, D is the number of detectors, w j is the weightof j th detector, and s ji and c ji are the score and coordinate of i th detection of j th detector.Appropriate weights w j for each detector cannot be determinedempirically. Therefore, we formulate a regression problem to learnthese weights from a low number of manually annotated images.Let us denote by b ji = [ x ji , y ji , x ji , y ji ] a vector of coordinatesepresenting the i th bounding box of detector j , where x ji , y ji , x ji and y ji are the coordinates of two of the bounding box cor-ners. Let the confidence score corresponding to this bounding boxbe s ji and g i = [ g x , g y , g x , g y ] be the corresponding groundtruthvector of coordinates. Assuming that we are operating with D dis-tinct object detection methods, each i th detection X i can be repre-sented by D × matrix X i = [ b i s i ; b i s i ; ... ; b Di s Di ] of confidencescore-scaled bounding box coordinates obtained from D detectors.Therefore, our goal is to find a D × dimensional set of weights w = [ w , w , ..., w D ] T that would satisfy the following criterionfor each detection i : w T X i = g i ⇒ [ w , w , ..., w D ] × b i s i b i s i ... b Ni s Di = g i , [ w , w , ..., w D ] × x i s i y i s i x i s i y i s i x i s i y i s i x i s i y i s i ... ... ... ...x Di s Di y Di s Di x Di s Di y Di s Di == (cid:2) g x g y g x g y (cid:3) . (2)To obtain the solution to the problem, we optimize it iterativelyby means of Stochastic Gradient Descent optimized over the MeanSquared Error (MSE), defined as L MSE = 14 n n (cid:88) i =1 4 (cid:88) d =1 ( g i,d − ˆ g i,d ) , (3)where n is the number of training samples, g i,d is the d th coordi-nate of the ground truth bounding box of i th sample, and ˆ g i,d is thecorresponding predicted coordinate. The described process requires the creation of a training set ofdetection-ground truth pairs { X i ; g i } . For this purpose, we firstmanually annotate a low number of images with the ground truthbounding boxes (100 in our experiments), as annotation of such asmall number of images is not time-consuming, and apply the baseobject detection methods on these images. To reduce the amount ofmanual labour required for manually assigning each bounding box tothe corresponding ground truth box, for forming the { X i ; g i } pairswe follow the following process: for each ground truth boundingbox g i we find all the overlapping bounding boxes of the corre-sponding class obtained by different detectors and select the oneshaving the IoU higher than a certain threshold θ . From this set, foreach of the detectors we select the bounding box having the highestIoU with the ground truth box, and the rest of the bounding boxesare discarded. The obtained bounding boxes are then used to createthe matrix X i corresponding to the ground truth coordinates g i . Incase one of the detectors did not produce a detection that wouldcorrespond to the groundtruth, we simply set the correspondingbounding box in X i to zeros, i.e., b ji = [0 , , , . We discardredundant duplicate detections and false-positive detections that didnot match with any of the groundtruth bounding boxes, since there-weighting of bounding boxes cannot fix the presence of incorrectdetections as such. An additional step can be taken to further improve the performanceon video data, by taking advantage of temporal information. Wecan safely assume that objects in a video sequence do not appearrandomly at different frames, but follow certain trajectories of con-siderable time length. Therefore, we can enrich the set of detectionsobtained using the proposed ensembling approach with the ones thatwere likely to be missed and reduce the number of false-positiveones by following a two-stage tracking-based approach.In the first stage, to obtain a set of detections that were likelymissed by the object detectors, we apply a set of correlational track-ers [16]. At the first frame, an object tracker is initialized from eachof the detections of that frame and they are tracked throughout thevideo. At each subsequent frame, the tracked bounding boxes arematched with the detections of the frame following the IoU-basedmatching process. A successful match is defined by an IoU exceed-ing 0.5, or 0.4 in the case that the tracklet was initialized in the past3 frames, due to the assumption that the shape of the object changessignificantly when entering the field of view, leading to higher dif-ferences between the bounding boxes detected by the detectors andthe one tracked by the tracker. For the detections not matched withany of the tracklets, a new tracker is initialized, and objects that weretracked successfully but for which no detections were found are con-tinued to be tracked unless one of the following holds:1. the tracklet was only matched with detections for a smallnumber of frames (in our experiments we set this number to5) and then missed for more than 5 frames,2. the tracklet was not matched with any of the detections formore than 50 frames.The first rule is intended for discarding tracklets that are likely tobe initialized from false-positive detections, and the second rule isfor discarding tracklets corresponding to objects that are no morepresent in the scene, but for which tracking continued. When an ob-ject that was tracked, but not matched with detections for a certainnumber of frames is rematched with a detection again, the bound-ing boxes predicted by the tracker at the frames where no detectionhappened are added to the set of detections.At the second stage, a multi-object tracker [17] is applied to re-duce the number of false-positive detections: using the set of detec-tions enriched with the ones obtained from the tracker at the firststage, we identify the sequences of detections belonging to the sameobject. The resulting sequences that consist only of a small numberof consecutive frames (less than 5 in our work) and thus the onesthat are most likely corresponding to false-positive detections arediscarded. Note that the choice of tracking methods here is dictatedby their fast speed and any other tracking methods can be employedas well.
4. EXPERIMENTAL SETUP AND RESULTS
To assess the applicability of the proposed approaches to real-worldproblems and evaluate the ability of the base detectors to generalizeto previously unseen data, we evaluate the algorithms on differentdatasets than they were trained on. We select three state-of-the-artobject detectors as our base methods: SSD [9] trained on imagesresized to × , YOLOv3 [10] on × images, and Reti-naNet [11] on images rescaled such that the smaller side is equal to800 pixels. All models are trained on MS-COCO dataset [18]. Toevaluate the ability of the methods to generalize to previously unseen able 1 . MAP results on PASCAL VOC dataset plane bicycle bird boat bottle bus car cat chair cow table dog horse bike person plant sheep sofa train monit. TOTAL I o U . Ret.Net
SSD
Yolov3
NMS
Our 90.97 87.08 I o U . Ret.Net
SSD
Yolov3
NMS
Our 67.43 59.32 54.43 31.65 42.69 78.32 64.41 77.37 43.70 66.41 43.55 69.74 66.24 64.43 56.60 22.84 55.21 60.05 72.39 60.04 57.84 I o U . Ret.Net
SSD
Yolov3
NMS
Our 37.65 30.51 27.78 10.53 15.13 63.76 40.11 50.58 19.13 37.47 21.04 44.65 43.4 35.38 27.08 6.05 29.91 37.05 46.58 23.1 32.35
Table 2 . [email protected] results on video datasets.
RetinaNet SSD YOLOv3 NMS Our Our-tr.
EPFL Camp
EPFL Lab-6
Campus Aud data, we report the results on the intersection of classes with previ-ously unseen PASCAL VOC dataset (training + test set) [19] andthree video datasets: EPFL Campus-7 dataset, EPFL Lab-6 dataset[20], and Campus Auditorium dataset [21, 22]. In all video datasets,only the people class is considered, since the groundtruth data isavailable only for this class. We use the Mean Average Precision(MAP) as a metric for evaluation of the detection performance asdefined in PASCAL VOC 2012 challenge [23, 24]. Out of all theobtained detections, we discard the ones with a confidence score be-low 0.05 and report the MAP at the default IoU of 0.5 for each classseparately as well as the total MAP. Besides, to evaluate the effect ofthe proposed approach on the precision of the bounding boxes, wereport MAP at higher IoUs of 0.75 and 0.85 for each class in the im-age dataset. In our experiments, we perform 5-fold cross-validation.In the image dataset, 100 samples are used for training the scorere-weighting model in each of the folds, with the remaining 9863images used for testing. In the video dataset, we split the videosinto 5 continuous segments, in each of which the last 100 images areused for training with the rest of the frame sequence used for test-ing. This is done because the proposed tracking-based approach re-quires an uninterrupted video sequence. The same test sets are usedfor reporting the results of separate object detectors and the meanMAP value across 5 folds is reported. For learning the weights, thebounding box coordinates x and y are scaled by the image widthand height, respectively. From the resulting pairs obtained from 100annotated images, 30% are taken for validation, and the regressionmodel is trained on the remaining 70% with a learning rate of − starting from zero-initialized weights. The MSE is calculated onthe validation set at each iteration and training proceeds until MSEstops improving for a number of iterations, after which the weightsresulting in the best performance are selected. We report separatelythe results of each object detection method, the proposed approach,and the proposed approach refined by the tracking-based refinementscheme in the video datasets. We also report the results obtained byapplying solely the non-maximum suppression to the detectors’ out-put to showcase that the improvement of the detection performanceis caused by the re-weighting scheme to a large extent.The results on the object detection methods as well as the pro- posed approaches on image and video datasets are presented in Ta-bles 1 and 2, respectively, where the best MAP is highlighted in bold.On the PASCAL VOC dataset, the best overall performance amongthe base detectors is achieved by YOLOv3 at [email protected], and byRetinaNet at [email protected]. This allows us to conclude that YOLOv3is able to detect the presence of objects of interest in general bet-ter, but RetinaNet produces the bounding boxes that match with theground truth boxes more closely. This observation also reinforcesthe motivation of our proposed approach - by combining outputs ofseveral object detectors we can combine the fair detection abilityof less precise detectors such as YOLOv3, but compensate the preci-sion of bounding boxes by more precise detectors such as RetinaNet.We observe that the proposed approach outperforms the base detec-tors with all the IoU thresholds. At [email protected] the improvementsrange from . for the dining table class up to . for the boatclass. At this IoU threshold our proposed approach is performingon par with non-maximum suppression, with overall improvementof . . However, the performance differences are increased withthe increase in IoU threshold, - we achieve . and . betterperformance on than non-maximum suppression on [email protected] [email protected], respectively. Besides, we outperform all base detec-tion methods on all IoU thresholds, leading to . , . , and . MAP improvements on IoU of 0.5, 0.75, and 0.85, respec-tively. Some example results on PASCAL VOC dataset can be seenin Figure 1. Note that for clarity only the detections with confidencescore of at least 0.4 are shown. In the video dataset, the proposedregression-based approach achieves a significant improvement overall of the detectors on all three datasets, and applying the refinementprocess based on tracking pushes this improvement further, leadingto an overall improvement of . , . , and . on EPFLCampus-7, EPFL Lab-6, and Campus Auditorium datasets, respec-tively.
5. CONCLUSIONS
We proposed a method for ensembling multiple object detectors thatre-weights the confidence scores and bounding box coordinates, aswell as exploits contextual information. The method resulted in bet-ter MAP scores compared to base detectors, where the gap is espe-cially large on higher IoU thresholds. The extention of the proposedmethod for video data utilizing temporal information pushes the im-provement in performance even further. The proposed methods can,therefore, be utilized directly for obtaining improved object detec-tion results or as a part of a framework for creating annotations onnew datasets. . REFERENCES [1] H. Jung, M. Choi, J. Jung, J. Lee, S. Kwon, and W. YoungJung, “Resnet-based vehicle classification and localization intraffic surveillance systems,” in
IEEE Conference on ComputerVision and Pattern Recognition Workshops , 2017, pp. 61–67.[2] G. Hamed, M. Marey, S. Amin, and M. Tolba, “Deep learn-ing in breast cancer detection and classification,” in
JointEuropean-US Workshop on Applications of Invariance in Com-puter Vision . Springer, 2020, pp. 322–333.[3] K. Chumachenko, A. M¨annist¨o, A. Iosifidis, and J. Raitoharju,“Machine learning based analysis of finnish world war ii pho-tographers,”
IEEE Access , vol. 8, pp. 144184–144196, 2020.[4] Y. Tian, G. Yang, Z. Wang, H. Wang, E. Li, and Z. Liang, “Ap-ple detection during different growth stages in orchards usingthe improved yolo-v3 model,”
Computers and Electronics inAgriculture , vol. 157, pp. 417–426, 2019.[5] H. Su, J. Deng, and L. Fei-Fei, “Crowdsourcing annotationsfor visual object detection,” in
AAAI Conference on ArtificialIntelligence Workshops , 2012.[6] B. Adhikari, J. Peltomaki, J. Puura, and H. Huttunen, “Fasterbounding box annotation for object detection in indoor scenes,”in
European Workshop on Visual Information Processing .IEEE, 2018, pp. 1–6.[7] V. Wong, M. Ferguson, K. Law, and Y. Lee, “An assistivelearning workflow on annotating images for object detection,”in
IEEE International Conference on Big Data . IEEE, 2019,pp. 1962–1970.[8] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, andM. Pietik¨ainen, “Deep learning for generic object detection: asurvey,”
International Journal of Computer Vision , vol. 128,no. 2, pp. 261–318, 2020.[9] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu,and A. Berg, “Ssd: single shot multibox detector,” in
EuropeanConference on Computer Vision . Springer, 2016, pp. 21–37.[10] J. Redmon and A. Farhadi, “Yolov3: An incremental improve-ment,” arXiv preprint arXiv:1804.02767 , 2018.[11] T. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal lossfor dense object detection,” in
IEEE International Conferenceon Computer Vision , 2017, pp. 2980–2988.[12] H. Zhou, B. Gao, and J. Wu, “Adaptive feeding: achieving fastand accurate detections by adaptively combining object detec-tors,” in
IEEE International Conference on Computer Vision ,2017, pp. 3505–3513.[13] D. Marˇceti´c, T. Hrka´c, and S. Ribari´c, “Two-stage cascademodel for unconstrained face detection,” in
InternationalWorkshop on Sensing, Processing and Learning for IntelligentMachines . IEEE, 2016, pp. 1–4.[14] S. Yang, A. Wiliem, and B. Lovell, “It takes two to tango:cascading off-the-shelf face detectors,” in
IEEE Conference onComputer Vision and Pattern Recognition Workshops , 2018,pp. 535–543.[15] S. Karaoglu, Y. Liu, and T. Gevers, “Detect2rank: combiningobject detectors using learning to rank,”
IEEE Transactions onImage Processing , vol. 25, no. 1, pp. 233–248, 2015.[16] M. Danelljan, G. H¨ager, F. Khan, and M. Felsberg, “Accuratescale estimation for robust visual tracking,” in
British MachineVision Conference . BMVA Press, 2014. [17] N. Wojke, A. Bewley, and D. Paulus, “Simple online and re-altime tracking with a deep association metric,” in
IEEE In-ternational Conference on Image Processing . IEEE, 2017, pp.3645–3649.[18] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. Zitnick, “Microsoft coco: common objectsin context,” in
European Conference on Computer Vision .Springer, 2014, pp. 740–755.[19] M. Everingham, L. van Gool, C. Williams, J. Winn, and A. Zis-serman, “The pascal visual object classes (voc) challenge,”
International Journal of Computer Vision , vol. 88, no. 2, pp.303–338, 2010.[20] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, “Multicamerapeople tracking with a probabilistic occupancy map,”
IEEETransactions on Pattern Analysis and Machine Intelligence ,vol. 30, no. 2, pp. 267–282, 2007.[21] Y. Xu, X. Liu, L. Qin, and S. Zhu, “Cross-view people trackingby scene-centered spatio-temporal parsing,” in
AAAI Confer-ence on Artificial Intelligence , 2017, pp. 4299–4305.[22] Y. Xu, X. Liu, Y. Liu, and S. Zhu, “Multi-view people trackingvia hierarchical trajectory composition,” in
IEEE Conferenceon Computer Vision and Pattern Recognition , 2016, pp. 4256–4265.[23] J. Cartucho, R. Ventura, and M. Veloso, “Robust object recog-nition through symbiotic deep learning in mobile robots,” in