[PDF] Ensembling object detectors for image and video data analysis

Abstract

In this paper, we propose a method for ensembling the outputs of multiple object detectors for improving detection performance and precision of bounding boxes on image data. We further extend it to video data by proposing a two-stage tracking-based scheme for detection refinement. The proposed method can be used as a standalone approach for improving object detection performance, or as a part of a framework for faster bounding box annotation in unseen datasets, assuming that the objects of interest are those present in some common public datasets.

Full PDF

EENSEMBLING OBJECT DETECTORS FOR IMAGE AND VIDEO DATA ANALYSIS

Kateryna Chumachenko † , Jenni Raitoharju †§ , Alexandros Iosiﬁdis (cid:63) , Moncef Gabbouj †† Tampere University, Faculty of Information Technology and Communication Sciences, Finland § Finnish Environment Institute, Programme for Environmental Information, Finland (cid:63)

Aarhus University, Department of Electrical and Computer Engineering, Denmark

ABSTRACT

In this paper, we propose a method for ensembling the outputs ofmultiple object detectors for improving detection performance andprecision of bounding boxes on image data. We further extend itto video data by proposing a two-stage tracking-based scheme fordetection reﬁnement. The proposed method can be used as a stan-dalone approach for improving object detection performance, or asa part of a framework for faster bounding box annotation in unseendatasets, assuming that the objects of interest are those present insome common public datasets.

Index Terms — object detection, bounding box annotation, en-semble models

1. INTRODUCTION

Driven by the broad availability of an extensive amount of datasetsin different domains, object detection has become one of the mostwidely used tools within the ﬁeld of computer vision in recent years,ﬁnding applications in various areas, such as video surveillance [1],medical diagnostics [2], historical image analysis [3], and industrialapplications [4]. Despite the huge progress made in the ﬁeld of ob-ject detection in the recent years, not much attention has been paidto the generalization ability of the object detection methods: the de-veloped algorithms assume that the training and test data come fromthe same distribution, and the test performance is reported on datafrom the same dataset used for training. For this reason, it is alwayspreferable to train the methods on data collected directly from thedomain of application, even if the classes of interest are present insome public datasets. Nevertheless, the models trained on publicdatasets are still often used in industrial applications. This is primar-ily due to the fact that obtaining training data from the domain of in-terest is generally both expensive and time-consuming, as it requiresa signiﬁcant amount of manual labour related to the annotation ofthe data speciﬁc to the application.Several methods have been proposed for speeding up the bound-ing box annotation process on images [5, 6, 7]. Most of them still re-quire large amounts of manual annotations, corrections, and retrain-ing of the models. However, it is often the case that the classes of in-terest belong to the set of common classes present in public datasets,such as people, vehicles, and animals. Using widely-available objectdetection methods pre-trained on such public datasets has the poten-tial to signiﬁcantly reduce the amount of manual labour required forthe annotation process. In this work, we aim to take a step towardsutilizing such methods in a way that improves the object detectionperformance and reduces the annotation time.

This work was supported by Business Finland under the project 5G Ver-tical Integrated Industry for Massive Automation (5G-VIIMA).

Fig. 1 . Examples of the detection results of base detectors, proposedmethod, and ground truth bounding boxes on PASCAL VOC dataset.Despite the rapid development of object detection methods [8, 9,10, 11], only a very limited number of works have focused on creat-ing ensembles of those for improved detection performance [12, 13,14]. Moreover, to the best of our knowledge, none of the existingapproaches aims at improving the precision of resulting boundingboxes, although precision can play a crucial role in a variety of appli-cations, primarily including tracking and re-identiﬁcation problems.In this paper, we aim to take a step towards improving objectdetection performance and precision in image and video data byproposing a method for ensembling multiple object detection meth-ods. In addition, we propose a weighting scheme with regression-based weights learnt from a small number of images, as well as anextension for video data utilizing temporal information.

2. RELATED WORK

Signiﬁcant progress has been made in the ﬁeld of object detectionin recent years [8, 9, 10, 11], dominated by deep learning-basedmethods. To improve the object detection performance, several ap-proaches for combining information from multiple object detectorshave been proposed recently. In [13], a cascade of two face detec-tors was proposed to reduce the number of false-positive detections.Several metrics for evaluating the diversity and correlation of detec-tors were used to select the best pair of detectors in [14]. In [12], an a r X i v : . [ c s . C V ] F e b VM classiﬁer was used to select the suitable detection method forspeed considerations. A learning to rank based approach, intendedto rank the more relevant detections higher, was proposed in [15].Notably, in all the above-mentioned approaches, the combination isaimed at the selection of the best detection. In our method, we ﬁrstuse contextual information of detectors to select the relevant detec-tions and subsequently improve the precision of the ﬁnal boundingbox by taking a weighted average of the coordinates with the weightslearned from a small subset of images of the target dataset.The need for data annotation in the application domain motivatesthe emergence of various methods for speeding up the bounding boxannotation process. A common approach to annotating large-scaledatasets is using crowd-sourced annotations [5]. Since this approachexposes the data to public, it cannot be used for applications wherethe data needs to be kept private. Another approach relies on self-training [6, 7]. There, the general methodology is to ﬁrst annotatea set of images manually and use them for the training of an objectdetector. The trained detector is subsequently used for producing thebounding boxes for the rest of the dataset. The obtained detectionsare manually reﬁned, and the process continues until perfect anno-tation is achieved. Although minimizing to some extent the neededworkload, the self-training approaches still require a high numberof images to be annotated. In addition, a certain amount of time isneeded for training the object detector.

3. PROPOSED METHODS

The main motivation behind the proposed methods lies in the as-sumption that distinct object detectors achieve different levels ofperformance on different data, and extraction of useful informationfrom each method has potential for improving the detection accu-racy and precision of bounding boxes. The proposed approach canbe used as a standalone methodology for improving object detec-tion performance or as a part of a framework for creating boundingbox annotations. We rely on three state-of-the-art object detectors:SSD [9], YOLOv3 [10], and RetinaNet [11]. The choice of thesedetectors is based on their fast execution and good object detectionperformance reported in the literature. Here, one should note that themethod can be equally well applied to any set of object detectors.

First, we consider the straightforward way of fusion by non-maximum suppression that will serve as a baseline for our work.Here, we treat all obtained detections as coming from one objectdetector and apply non-maximum suppression to suppress non-conﬁdent duplicates of detections identiﬁed to be the true positiveones. First, all bounding boxes detected in the image are sortedaccording to their conﬁdence scores. The most conﬁdent boundingbox is selected as the true detection. Then, the Intersection overUnion

IoU = Area of overlapArea of union between the true detection’s bound-ing box and every other detection’s bounding box is calculated. Thebounding boxes with higher IoU (above a certain threshold θ , equalto . in our work) are identiﬁed as belonging to the same object asthe true bounding box and removed from the list of detections. Theprocess continues from the next most conﬁdent bounding box out ofthe remaining ones and is performed for each class separately.Although providing a reasonable level of performance improve-ment, this approach suffers from several limitations. Firstly, the ﬁ-nal bounding boxes are simply bounding boxes detected by one ofthe base detectors, limiting the potential of fusing the knowledgeof multiple methods. Secondly, the use of conﬁdence score of the detector as a metric for deciding the quality of bounding boxes isquestionable. The common interpretation of the conﬁdence scoreis the probability of the bounding box to be the true positive one.However, such interpretation does not have a strong theoretical ba-sis, as conﬁdence scores of different object detection methods, beinglearned parameters, can reﬂect signiﬁcantly different true-positiverates. In other words, the conﬁdence score scales of different ob-ject detection approaches are not calibrated, and some methods tendto produce high-conﬁdence detections that are false-positives, whileothers output true positive detections with low conﬁdence scores. Taking a step towards more meaningful use of the knowledge pro-vided by several object detectors, we propose the following approachthat adjusts the ﬁnal selection of bounding boxes based on how likelythey are to correspond to true detections. First, the previously de-scribed IoU-based merging is employed to identify the boundingboxes belonging to the same object. However, instead of discard-ing all the bounding boxes corresponding to the same object as themost conﬁdent bounding box, we keep track of the source detectorsfor each bounding box. From the set of bounding boxes obtainedfrom each detector, we select only the one having the highest IoUwith the most conﬁdent bounding box. As a result, we obtain a setof detections, each described by 1-3 bounding boxes. At this point,we can observe that objects detected by only one of the detectorsare generally false-positives, so out of the obtained set of detections,we discard the ones described by only one bounding box. In orderto obtain the ﬁnal detections, the selected bounding boxes for eachdetection need to be combined. To exploit the knowledge present ineach of these bounding boxes, we propose to fuse them as a weightedcombination with learned weights.

In order to improve the precision of the bounding box coordinates,we further extend the proposed approach by taking a weighted aver-age value of the coordinates of each of the vertices of the boundingboxes identiﬁed to belong to the same object. We weight each co-ordinate by the corresponding conﬁdence score of the detection andnormalize the resulting value by the sum of the conﬁdence scoresso that more conﬁdent detectors put more weight to the ﬁnal output.Direct use of conﬁdence scores would suffer from the limitationscaused by the scores of different detectors not being calibrated. Toaddress this issue, we propose a scheme for re-weighting the outputof each detector that results in better calibration of conﬁdence scores,and, therefore, in a more meaningful combination of the boundingboxes. Assuming that we have obtained a scalar weight w j for the j th detector, the new coordinates and conﬁdence scores are calcu-lated as ˆ c i = (cid:80) Dj =1 s ji w j c ji (cid:80) Di =1 s ji w j and ˆ s i = (cid:80) Dj =1 w j s ji (cid:80) Dj =1 w j , (1)where ˆ c i and ˆ s i , are the reﬁned coordinate and updated conﬁdencescore of i th detection, D is the number of detectors, w j is the weightof j th detector, and s ji and c ji are the score and coordinate of i th detection of j th detector.Appropriate weights w j for each detector cannot be determinedempirically. Therefore, we formulate a regression problem to learnthese weights from a low number of manually annotated images.Let us denote by b ji = [ x ji , y ji , x ji , y ji ] a vector of coordinatesepresenting the i th bounding box of detector j , where x ji , y ji , x ji and y ji are the coordinates of two of the bounding box cor-ners. Let the conﬁdence score corresponding to this bounding boxbe s ji and g i = [ g x , g y , g x , g y ] be the corresponding groundtruthvector of coordinates. Assuming that we are operating with D dis-tinct object detection methods, each i th detection X i can be repre-sented by D × matrix X i = [ b i s i ; b i s i ; ... ; b Di s Di ] of conﬁdencescore-scaled bounding box coordinates obtained from D detectors.Therefore, our goal is to ﬁnd a D × dimensional set of weights w = [ w , w , ..., w D ] T that would satisfy the following criterionfor each detection i : w T X i = g i ⇒ [ w , w , ..., w D ] ×  b i s i b i s i ... b Ni s Di  = g i , [ w , w , ..., w D ] ×  x i s i y i s i x i s i y i s i x i s i y i s i x i s i y i s i ... ... ... ...x Di s Di y Di s Di x Di s Di y Di s Di  == (cid:2) g x g y g x g y (cid:3) . (2)To obtain the solution to the problem, we optimize it iterativelyby means of Stochastic Gradient Descent optimized over the MeanSquared Error (MSE), deﬁned as L MSE = 14 n n (cid:88) i =1 4 (cid:88) d =1 ( g i,d − ˆ g i,d ) , (3)where n is the number of training samples, g i,d is the d th coordi-nate of the ground truth bounding box of i th sample, and ˆ g i,d is thecorresponding predicted coordinate. The described process requires the creation of a training set ofdetection-ground truth pairs { X i ; g i } . For this purpose, we ﬁrstmanually annotate a low number of images with the ground truthbounding boxes (100 in our experiments), as annotation of such asmall number of images is not time-consuming, and apply the baseobject detection methods on these images. To reduce the amount ofmanual labour required for manually assigning each bounding box tothe corresponding ground truth box, for forming the { X i ; g i } pairswe follow the following process: for each ground truth boundingbox g i we ﬁnd all the overlapping bounding boxes of the corre-sponding class obtained by different detectors and select the oneshaving the IoU higher than a certain threshold θ . From this set, foreach of the detectors we select the bounding box having the highestIoU with the ground truth box, and the rest of the bounding boxesare discarded. The obtained bounding boxes are then used to createthe matrix X i corresponding to the ground truth coordinates g i . Incase one of the detectors did not produce a detection that wouldcorrespond to the groundtruth, we simply set the correspondingbounding box in X i to zeros, i.e., b ji = [0 , , , . We discardredundant duplicate detections and false-positive detections that didnot match with any of the groundtruth bounding boxes, since there-weighting of bounding boxes cannot ﬁx the presence of incorrectdetections as such. An additional step can be taken to further improve the performanceon video data, by taking advantage of temporal information. Wecan safely assume that objects in a video sequence do not appearrandomly at different frames, but follow certain trajectories of con-siderable time length. Therefore, we can enrich the set of detectionsobtained using the proposed ensembling approach with the ones thatwere likely to be missed and reduce the number of false-positiveones by following a two-stage tracking-based approach.In the ﬁrst stage, to obtain a set of detections that were likelymissed by the object detectors, we apply a set of correlational track-ers [16]. At the ﬁrst frame, an object tracker is initialized from eachof the detections of that frame and they are tracked throughout thevideo. At each subsequent frame, the tracked bounding boxes arematched with the detections of the frame following the IoU-basedmatching process. A successful match is deﬁned by an IoU exceed-ing 0.5, or 0.4 in the case that the tracklet was initialized in the past3 frames, due to the assumption that the shape of the object changessigniﬁcantly when entering the ﬁeld of view, leading to higher dif-ferences between the bounding boxes detected by the detectors andthe one tracked by the tracker. For the detections not matched withany of the tracklets, a new tracker is initialized, and objects that weretracked successfully but for which no detections were found are con-tinued to be tracked unless one of the following holds:1. the tracklet was only matched with detections for a smallnumber of frames (in our experiments we set this number to5) and then missed for more than 5 frames,2. the tracklet was not matched with any of the detections formore than 50 frames.The ﬁrst rule is intended for discarding tracklets that are likely tobe initialized from false-positive detections, and the second rule isfor discarding tracklets corresponding to objects that are no morepresent in the scene, but for which tracking continued. When an ob-ject that was tracked, but not matched with detections for a certainnumber of frames is rematched with a detection again, the bound-ing boxes predicted by the tracker at the frames where no detectionhappened are added to the set of detections.At the second stage, a multi-object tracker [17] is applied to re-duce the number of false-positive detections: using the set of detec-tions enriched with the ones obtained from the tracker at the ﬁrststage, we identify the sequences of detections belonging to the sameobject. The resulting sequences that consist only of a small numberof consecutive frames (less than 5 in our work) and thus the onesthat are most likely corresponding to false-positive detections arediscarded. Note that the choice of tracking methods here is dictatedby their fast speed and any other tracking methods can be employedas well.

4. EXPERIMENTAL SETUP AND RESULTS

To assess the applicability of the proposed approaches to real-worldproblems and evaluate the ability of the base detectors to generalizeto previously unseen data, we evaluate the algorithms on differentdatasets than they were trained on. We select three state-of-the-artobject detectors as our base methods: SSD [9] trained on imagesresized to × , YOLOv3 [10] on × images, and Reti-naNet [11] on images rescaled such that the smaller side is equal to800 pixels. All models are trained on MS-COCO dataset [18]. Toevaluate the ability of the methods to generalize to previously unseen able 1 . MAP results on PASCAL VOC dataset plane bicycle bird boat bottle bus car cat chair cow table dog horse bike person plant sheep sofa train monit. TOTAL I o U . Ret.Net

SSD

Yolov3

NMS

Our 90.97 87.08 I o U . Ret.Net

SSD

Yolov3

NMS

Our 67.43 59.32 54.43 31.65 42.69 78.32 64.41 77.37 43.70 66.41 43.55 69.74 66.24 64.43 56.60 22.84 55.21 60.05 72.39 60.04 57.84 I o U . Ret.Net

SSD

Yolov3

NMS

Our 37.65 30.51 27.78 10.53 15.13 63.76 40.11 50.58 19.13 37.47 21.04 44.65 43.4 35.38 27.08 6.05 29.91 37.05 46.58 23.1 32.35

Table 2 . [email protected] results on video datasets.

RetinaNet SSD YOLOv3 NMS Our Our-tr.

EPFL Camp

EPFL Lab-6

Campus Aud data, we report the results on the intersection of classes with previ-ously unseen PASCAL VOC dataset (training + test set) [19] andthree video datasets: EPFL Campus-7 dataset, EPFL Lab-6 dataset[20], and Campus Auditorium dataset [21, 22]. In all video datasets,only the people class is considered, since the groundtruth data isavailable only for this class. We use the Mean Average Precision(MAP) as a metric for evaluation of the detection performance asdeﬁned in PASCAL VOC 2012 challenge [23, 24]. Out of all theobtained detections, we discard the ones with a conﬁdence score be-low 0.05 and report the MAP at the default IoU of 0.5 for each classseparately as well as the total MAP. Besides, to evaluate the effect ofthe proposed approach on the precision of the bounding boxes, wereport MAP at higher IoUs of 0.75 and 0.85 for each class in the im-age dataset. In our experiments, we perform 5-fold cross-validation.In the image dataset, 100 samples are used for training the scorere-weighting model in each of the folds, with the remaining 9863images used for testing. In the video dataset, we split the videosinto 5 continuous segments, in each of which the last 100 images areused for training with the rest of the frame sequence used for test-ing. This is done because the proposed tracking-based approach re-quires an uninterrupted video sequence. The same test sets are usedfor reporting the results of separate object detectors and the meanMAP value across 5 folds is reported. For learning the weights, thebounding box coordinates x and y are scaled by the image widthand height, respectively. From the resulting pairs obtained from 100annotated images, 30% are taken for validation, and the regressionmodel is trained on the remaining 70% with a learning rate of − starting from zero-initialized weights. The MSE is calculated onthe validation set at each iteration and training proceeds until MSEstops improving for a number of iterations, after which the weightsresulting in the best performance are selected. We report separatelythe results of each object detection method, the proposed approach,and the proposed approach reﬁned by the tracking-based reﬁnementscheme in the video datasets. We also report the results obtained byapplying solely the non-maximum suppression to the detectors’ out-put to showcase that the improvement of the detection performanceis caused by the re-weighting scheme to a large extent.The results on the object detection methods as well as the pro- posed approaches on image and video datasets are presented in Ta-bles 1 and 2, respectively, where the best MAP is highlighted in bold.On the PASCAL VOC dataset, the best overall performance amongthe base detectors is achieved by YOLOv3 at [email protected], and byRetinaNet at [email protected]. This allows us to conclude that YOLOv3is able to detect the presence of objects of interest in general bet-ter, but RetinaNet produces the bounding boxes that match with theground truth boxes more closely. This observation also reinforcesthe motivation of our proposed approach - by combining outputs ofseveral object detectors we can combine the fair detection abilityof less precise detectors such as YOLOv3, but compensate the preci-sion of bounding boxes by more precise detectors such as RetinaNet.We observe that the proposed approach outperforms the base detec-tors with all the IoU thresholds. At [email protected] the improvementsrange from . for the dining table class up to . for the boatclass. At this IoU threshold our proposed approach is performingon par with non-maximum suppression, with overall improvementof . . However, the performance differences are increased withthe increase in IoU threshold, - we achieve . and . betterperformance on than non-maximum suppression on [email protected] [email protected], respectively. Besides, we outperform all base detec-tion methods on all IoU thresholds, leading to . , . , and . MAP improvements on IoU of 0.5, 0.75, and 0.85, respec-tively. Some example results on PASCAL VOC dataset can be seenin Figure 1. Note that for clarity only the detections with conﬁdencescore of at least 0.4 are shown. In the video dataset, the proposedregression-based approach achieves a signiﬁcant improvement overall of the detectors on all three datasets, and applying the reﬁnementprocess based on tracking pushes this improvement further, leadingto an overall improvement of . , . , and . on EPFLCampus-7, EPFL Lab-6, and Campus Auditorium datasets, respec-tively.

5. CONCLUSIONS

We proposed a method for ensembling multiple object detectors thatre-weights the conﬁdence scores and bounding box coordinates, aswell as exploits contextual information. The method resulted in bet-ter MAP scores compared to base detectors, where the gap is espe-cially large on higher IoU thresholds. The extention of the proposedmethod for video data utilizing temporal information pushes the im-provement in performance even further. The proposed methods can,therefore, be utilized directly for obtaining improved object detec-tion results or as a part of a framework for creating annotations onnew datasets. . REFERENCES [1] H. Jung, M. Choi, J. Jung, J. Lee, S. Kwon, and W. YoungJung, “Resnet-based vehicle classiﬁcation and localization intrafﬁc surveillance systems,” in

IEEE Conference on ComputerVision and Pattern Recognition Workshops , 2017, pp. 61–67.[2] G. Hamed, M. Marey, S. Amin, and M. Tolba, “Deep learn-ing in breast cancer detection and classiﬁcation,” in

JointEuropean-US Workshop on Applications of Invariance in Com-puter Vision . Springer, 2020, pp. 322–333.[3] K. Chumachenko, A. M¨annist¨o, A. Iosiﬁdis, and J. Raitoharju,“Machine learning based analysis of ﬁnnish world war ii pho-tographers,”

IEEE Access , vol. 8, pp. 144184–144196, 2020.[4] Y. Tian, G. Yang, Z. Wang, H. Wang, E. Li, and Z. Liang, “Ap-ple detection during different growth stages in orchards usingthe improved yolo-v3 model,”

Computers and Electronics inAgriculture , vol. 157, pp. 417–426, 2019.[5] H. Su, J. Deng, and L. Fei-Fei, “Crowdsourcing annotationsfor visual object detection,” in

AAAI Conference on ArtiﬁcialIntelligence Workshops , 2012.[6] B. Adhikari, J. Peltomaki, J. Puura, and H. Huttunen, “Fasterbounding box annotation for object detection in indoor scenes,”in

European Workshop on Visual Information Processing .IEEE, 2018, pp. 1–6.[7] V. Wong, M. Ferguson, K. Law, and Y. Lee, “An assistivelearning workﬂow on annotating images for object detection,”in

IEEE International Conference on Big Data . IEEE, 2019,pp. 1962–1970.[8] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, andM. Pietik¨ainen, “Deep learning for generic object detection: asurvey,”

International Journal of Computer Vision , vol. 128,no. 2, pp. 261–318, 2020.[9] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu,and A. Berg, “Ssd: single shot multibox detector,” in

EuropeanConference on Computer Vision . Springer, 2016, pp. 21–37.[10] J. Redmon and A. Farhadi, “Yolov3: An incremental improve-ment,” arXiv preprint arXiv:1804.02767 , 2018.[11] T. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal lossfor dense object detection,” in

IEEE International Conferenceon Computer Vision , 2017, pp. 2980–2988.[12] H. Zhou, B. Gao, and J. Wu, “Adaptive feeding: achieving fastand accurate detections by adaptively combining object detec-tors,” in

IEEE International Conference on Computer Vision ,2017, pp. 3505–3513.[13] D. Marˇceti´c, T. Hrka´c, and S. Ribari´c, “Two-stage cascademodel for unconstrained face detection,” in

InternationalWorkshop on Sensing, Processing and Learning for IntelligentMachines . IEEE, 2016, pp. 1–4.[14] S. Yang, A. Wiliem, and B. Lovell, “It takes two to tango:cascading off-the-shelf face detectors,” in

IEEE Conference onComputer Vision and Pattern Recognition Workshops , 2018,pp. 535–543.[15] S. Karaoglu, Y. Liu, and T. Gevers, “Detect2rank: combiningobject detectors using learning to rank,”

IEEE Transactions onImage Processing , vol. 25, no. 1, pp. 233–248, 2015.[16] M. Danelljan, G. H¨ager, F. Khan, and M. Felsberg, “Accuratescale estimation for robust visual tracking,” in

British MachineVision Conference . BMVA Press, 2014. [17] N. Wojke, A. Bewley, and D. Paulus, “Simple online and re-altime tracking with a deep association metric,” in

IEEE In-ternational Conference on Image Processing . IEEE, 2017, pp.3645–3649.[18] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. Zitnick, “Microsoft coco: common objectsin context,” in

European Conference on Computer Vision .Springer, 2014, pp. 740–755.[19] M. Everingham, L. van Gool, C. Williams, J. Winn, and A. Zis-serman, “The pascal visual object classes (voc) challenge,”

International Journal of Computer Vision , vol. 88, no. 2, pp.303–338, 2010.[20] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, “Multicamerapeople tracking with a probabilistic occupancy map,”

IEEETransactions on Pattern Analysis and Machine Intelligence ,vol. 30, no. 2, pp. 267–282, 2007.[21] Y. Xu, X. Liu, L. Qin, and S. Zhu, “Cross-view people trackingby scene-centered spatio-temporal parsing,” in

AAAI Confer-ence on Artiﬁcial Intelligence , 2017, pp. 4299–4305.[22] Y. Xu, X. Liu, Y. Liu, and S. Zhu, “Multi-view people trackingvia hierarchical trajectory composition,” in

IEEE Conferenceon Computer Vision and Pattern Recognition , 2016, pp. 4256–4265.[23] J. Cartucho, R. Ventura, and M. Veloso, “Robust object recog-nition through symbiotic deep learning in mobile robots,” in