Custom Object Detection via Multi-Camera Self-Supervised Learning
aa r X i v : . [ c s . C V ] F e b Custom Object Detection via Multi-Camera Self-Supervised Learning
Yan Lu ∗ and Yuanchao Shu New York University Microsoft [email protected], [email protected]
Abstract
This paper proposes MCSSL, a self-supervisedlearning approach for building custom object de-tection models in multi-camera networks. MCSSLassociates bounding boxes between cameras withoverlapping fields of view by leveraging epipolargeometry and state-of-the-art tracking and reIDalgorithms, and prudently generates two sets ofpseudo-labels to fine-tune backbone and detec-tion networks respectively in an object detectionmodel. To train effectively on pseudo-labels, apowerful reID-like pretext task with consistencyloss is constructed for model customization. Ourevaluation shows that compared with legacy self-training methods, MCSSL improves average mAPby . and . on WildTrack and CityFlowdataset, respectively. Object detection plays a pivotal role in video analytics.Although deep neural network (DNN)-based object detectionmodels pre-trained on large public datasets ( e.g.,
MS-COCO)exhibit decent performance in various scenarios, custommodels are more desired due to its higher accuracy androbustness [Ouyang et al. , 2016; Guoa et al. , 2019].Model customization relies on context-specific (or domain-specific) training data. Unlike general-purpose trainingdatasets, large-scale context-specific labels are way harderto collect in a sustainable manner. For most computer visiontasks, manual labeling ( e.g., draw bounding boxes) remainsto be the major source of training data. Nonetheless, humanannotation is known to be costly and time-consuming,and distributing frames outside of the camera networkalso raises privacy concerns. While we are witnessingadvancements in semi-supervised and weakly-supervisedlearning, performance of models trained from the state-of-theart semi- and weakly-supervised algorithms still falls shortof supervised object detectors. In semi-supervised learning,for instance, it’s hard to converge on regularization terms( e.g., consistency loss) [Athiwaratkun et al. , 2019] in large-scale unlabeled data. Similarly, detection performance ∗ Contact Author suffers from local minimums in weakly-supervisedapproaches [Inoue et al. , 2018].This paper is motivated by a simple question, can we im-prove self-supervised object detection with a fleet of camerasthat have (partially) shared field-of-view (FOV)?
Driven byadvances in computer vision and the plummeting costs ofcamera hardware, organizations are deploying video camerasat scale for a variety of applications. In many scenarios,cameras have overlapping areas for the spatial monitoring ofphysical premises. Detection results on neighboring camerascould potentially serve as pseudo-labels and allow eachcamera to learn its own detector continuously.To achieve this goal, however, we need to tackle thefollowing two major challenges.
How to create pseudo-labels?
Object detection accuracyvaries for cameras and changes over time due to trafficdynamics and environmental factors ( e.g., lighting changes),making it hard to identify high-quality pseudo-labels. It isequally important to assign the right set of pseudo-labels tocamera(s) as blindly training with all pseudo-labels adverselyimpacts model customization.
How to learn from noisy labeled data?
A straightfor-ward way to customize model is to fine-tune the defaultobject detection model on each camera using pseudo-labels.However, fine-tuning a large DNN with insufficient andnoisy pseudo-labels tends to lead to accuracy drop andoverfitting [Hendrycks et al. , 2018].To address these issues, we propose MCSSL, a novel self-supervised training mechanism to customize object detectionmodels on each camera. The key idea of MCSSL is tocreate pseudo-labels at different confidence levels and usethem to train different parts of the network separately.MCSSL divides object detection models into two parts: backbone network which provides discriminative low-levelfeature maps, and detection network which consists of regionproposal network (RPN), RoI feature extractor etc. The twoparts have their unique characteristics (Table 1). Backbonenetwork has way more parameters than RPN and RoI, henceit demands more training data to fine-tune itself. On theother hand, it is less susceptible to training data noisessince it is trained for low-level features [Liu et al. , 2020;Li et al. , 2018b; Ouyang et al. , 2016]. We also found that,for most off-the-shelf object detection models, boundingboxes with high classification score are rarely false positives.owever, it is common to see false negative cases with lowclassification score.
Training Data Demand Noise SensitivityDetection Network Low HighBackbone Network High Low
Table 1: Object Detection Model Characteristics.Based on these two insights, MCSSL categorizes bound-ing boxes detected by the base model on each camera as confident pseudo-labels and uncertain pseudo-labels basedon their classification score. Despite relatively low volume, confident pseudo-labels serve nicely for detection networkfine-tuning due to their high quality. On the contrary, a largernumber of uncertain pseudo-labels fit well with the backbonenetwork which is more tolerable to noises.To build custom object detection model, detection net-work and backbone network on each camera need to betrained independently with its own context-specific data. Tothis end, MCSSL adopts advanced video reID algorithmsto associate bounding boxes across cameras. To improvethe efficiency and precision of bounding box matching, aprune-and-augment approach is introduced by leveragingepipolar constraints and tracking. Akin to feature catego-rization in classification and reID tasks [Zhong et al. , 2018;Li et al. , 2018a], we treat paired bounding boxes as non-camera-specific data due to their camera-invariant features onthe same object. Combined with the above finding, uncertainpseudo-labels which belong to non-camera-specific data isused to update the backbone network on each camera. Towork with pairs of images as input, we devise a novel reID-like pretext task, turning backbone network fine-tuning toconsistency training. When we get a more powerful backbonenetwork, we use traditional self-training framework to updatedetection network ( i.e.,
RPN and RoI feature extractor) withthe entire set of confident pseudo-labels .We found the results are promising. Compared with thebest self-training algorithm, MCSSL, on average, improvesmAP by . and . for each camera on WildTrackand CityFlow dataset, respectively. Two out of seven cameraseven achieve an mAP gain of as high as on theWildTrack dataset. In summary, we made the following threeintellectual contributions.i) We proposed the first self-supervised learning approachthat allows object detection model customization in multi-camera networks.ii) We devised an effective approach to generate trainingdata and fine-tune different parts of an object detection modelwith a reID-like pretext task.iii) We achieved the new state-of-the-art mAP results ofself-supervised learning-based custom object detection onWildTrack and CityFlow datasets. Anchor-based object detection : Anchor-based deep ob-ject detection models [Joseph and Ali, 2018; He et al. , 2017;Ren et al. , 2015; Zhao et al. , 2019] comprise of three mod-ules: 1) Backbone network which extracts general features ( i.e., edges, corners) of an given image; 2) Region proposalnetwork (RPN) [Ren et al. , 2015] that generates candidatebounding boxes based on simpler components from lower lay-ers in backbone network; and 3) RoI feature extractor, whichextracts fine-grained features and assigns class probabilityfor each RoI generated by RPN. Models contain all threemodules are called three-stage object detection models ( e.g.,
Mask RCNN) whereas two-stage models ( e.g.,
YOLOv3and M2Det)[Joseph and Ali, 2018; Zhao et al. , 2019] removeRPN and run RoI feature extractor directly on feature blocksgenerated by backbone network to improve inference speed.MCSSL builds on top the existing anchor-based objectdetection architecture and fine-tunes backbone network anddetection network independently with prudently generatedtraining data for model customization.
Semi- and weakly-supervised object detection :Despites recent advancements [Gao et al. , 2019;Lee et al. , 2019; Yuhua Chen et al. , 2018; Inoue et al. , 2018;Jeong et al. , 2019], today’s semi- and weakly-supervisedlearning algorithms still fall short on accuracy in objectdetection tasks. A common approach to semi-supervisedobject detection is mining-training. [Gao et al. , 2019] isthe first end-to-end semi-supervised framework for objectdetection. To utilize a large amount of unlabeled dataand handle label noises, many works [Lee et al. , 2019;Xu et al. , 2019a; Xu et al. , 2019b] seek to constructauxiliary tasks ( a.k.a., pretext tasks) to indirectlytrain the network. For example, [Xu et al. , 2019a;Xu et al. , 2019b] add a knowledge graph mining task,and [Lee et al. , 2019] constructs three new labeling tasks(closeness labeling, multi-object labeling and foregroundlabeling) to assist object detection model training. Thesekinds of approaches are also known as self-supervisedlearning as in auxiliary tasks, pseudo-labels are foundor mined in unlabeled data automatically. Besidesself-supervised approaches, constructing consistency[Jeong et al. , 2019] between different versions of a givenimage became an effective tool for enhancing detectionmodels’ performance on unlabeled data. Inspired by self-supervised learning and consistency learning approaches,we devise a new reID-like pretext task trained by means ofconsistency learning to assist custom object detection modeltraining on multi-camera datasets.
Multi-camera detection : To deal withoccluded objects from a single view, manyworks [Chavdarova and Fleuret, 2017; Baque et al. , 2017]utilize multi-view streams to build powerful 3D detectionalgorithms for all cameras. However, the majority ofthe methods are supervised learning-based, requiring asignificantly more labeled data than monocular objectdetection. More recently, we have seen works on multi-viewhuman pose estimation [Kocabas et al. , 2019] that do notlearn a 3D model but instead seek to train 2D models foreach camera through adding a new self-supervised learningtask [Kocabas et al. , 2019]. To the best of our knowledge,MCSSL is the first self-supervised learning-based approachto get custom object detection models in multi-cameraenvironments.
Design
MCSSL works in multi-camera scenarios where at leasttwo cameras (partially) share field-of-view. At beginning,cameras are running off-the-shelf object detection models( i.e., base model) trained on large-scale public dataset ( e.g.,
YOLOv3, Mask R-CNN). The goal of MCSSL is to buildaccurate custom object detection model for each cameraover time through cross-camera model fine-tuning. In whatfollows, we use two cameras as a simple example to elaboratethe process of model customization on cam with the help of cam (Figure 1). Phase 1: pseudo-label generation : At first, object detec-tion results are obtained from the base model on frames fromboth cam and cam . MCSSL sets bounding boxes withhigh classification score ( e.g., . ) as confident pseudo-labels and the remaining bounding boxes as uncertain pseudo-labels . Phase 2: cross-camera pseudo-label sharing : MCSSLtreats views of one object on different cameras as styles-transfered images, and associates pseudo-labels across cam-eras using state-of-the-art reID models. Notably, epipolargeometry and tracking are leveraged which significantlyreduces compute overhead and improves accuracy. Associ-ated pseudo-labels are categorized as non-camera-specific and camera-specific training data, where non-camera-specific training data refer to objects seen by multiple cameraswhereas camera-specific training data are objects appear ona single camera.
Phase 3: consistency learning : MCSSL constructs a reID-like pretext task that uses non-camera-specific training data tofine-tune backbone network with consistency loss. Camera-specific training data is used to customize the detectionnetwork as in most existing self-training algorithms.
When cam and cam collect enough new images, wefirst use off-the-shelf object detection models to generatepseudo labels. We denote BB and BB pseudo-labelsfor cam and cam , respectively. Intuitively, boundingboxes with higher classification score are more likely tobe true positive. As reported in a large body of workin computer vision [Joseph and Ali, 2018; He et al. , 2017;Ren et al. , 2015], using a high classification score to filterbounding boxes tends to lead to high precision but low recallacross the majority of DNN object detection models. That is,bounding boxes with high classification score are rarely falsepositives whereas false negative bounding boxes due to lowclassification score are commonly seen. Based on this insight,we set a high threshold T cls , and use high-quality confidentpseudo-labels BB c , bounding boxes whose classificationscores are larger than T cls , to train upper layer detectionnetworks ( i.e., RoI feature extractor). Accordingly, uncertainpseudo-labels ( BB u ) are used to train the initial CNNlayers ( i.e., backbone network) which are less susceptible tonoises [Rodner et al. , 2016; Co et al. , 2019]. Style transfer is a commonly used data augmentationtechnique in DNN model training [Yuhua Chen et al. , 2018; Inoue et al. , 2018]. A key advantage of multi-camera networkis the richness of data from different vantage points. Hence,in MCSSL, we associate bounding boxes of the same objecton different cameras and feed them in model fine-tuning.In a nutshell, bounding box association is achieved byreID. However, naïve reID poses two challenges. First, state-of-the-art reID models are only able to achieve mAP ofaround . , which leads to a decent number of falsepositives and impairs model fine-tuning. Second, pairwisecomparison between bounding boxes on all cameras incurs anon-linear computation overhead, which is prohibitively highfor scenarios with busy traffic. To deal with these two issues,we employ a prune-and-augment approach, which first filtersout a large number of bounding boxes that are less likely tobe confirmed by reID using epipolar constraints, and thenaugments refined pseudo-label pairs through tracking. Epipolar Constraint-based Pruning
When two cameras view the same 3D space from differentviewpoints, geometric relations among 3D points and theirprojections onto the 2D plane lead to constraints on theimage points. This intrinsic projective geometry is capturedby a fundamental matrix F in epipolar geometry, which canbe calculated as F = K − T [ t ] × RK − where K and K represent intrinsic parameters, and R and [ t ] × are the relativecamera rotation and translation which describe the locationof the second camera relative to the first in global coordinates( a.k.a., extrinsic parameters). Given F , for a physical 3Dposition P in the overlapping area of cam and cam , wehave p T F p = 0 , where p and p are the projected scenepoint from P on cam and cam . In essence, this equationcharacterizes an epipolar plane containing P and epipoles O and O of both cameras.Epipolar plane offers a unique characteristic in buildingassociations between bounding boxes on different cameras.As can be seen from Figure 2, the intersection of the epipolarplane with the image plane are two lines, which are calledepipolar lines. This means for any particular point p on cam , it is always mapped to a point alone the epipolar line l in the image from cam .Given the epipolar constraints, we can now map a boundingbox in the image from the “teacher camera” to four epipolarlines in another camera’s image, which significantly reducesthe search space of potential bounding boxes. For instance,it reduces the search space by 12x on the WildTrack dataset.Note that although explanations above assume cameras arecalibrated and time-synchronized, we add a fudge factorin our spatial filtering algorithm to compensate calibrationnoises and slight time shift. Since epipolar geometry onlydefines an area for each bounding box, we set all boundingboxes in BB u fall into the area as candidate bounding boxes( i.e., coarse non-camera-specific training data). In order tofine-tune the object detection model of camera i , MCSSLapplies mapping on all cameras share the view with i . In theexample of customizing an object detection model for cam with the help of cam (Figure 1), we run epipolar geometry-based mapping on all bounding boxes in cam to find allcandidate bounding boxes on cam . ff-the-shelfobject detection model T cls Phase 1 Phase 2 camera-specifictraining datauncertain pseudo-labels highlow
CAM-1 CAM-2coarse non-camera-specifictraining data CAM-1 CAM-2coarse non-camera-specifictraining data after tracking CAM-1 CAM-2video reID CAM-1 CAM-2 non-camera-specifictraining data
Phase 3
Two-stage Three-stage self-training image-1backbonefeaturemapsreg cls specific
RoIextractor self-training image-1backbonefeaturemaps reg cls specific
RoIextractorreg cls specific
RPNimage-1backbonefeaturemapsreg cls image-2featuremapscls self-training with consistency loss specific non-specificconsistency
RoIextractor image-1 backbonefeaturemapsreg cls specific
RoIextractorreg cls specific
RPN self-training with consistency loss image-2featuremapscls non-specificconsistency reg clsRPN specificCAM-1CAM-2 non-camera-specifictraining dataepipolar constraints tracking
Figure 1: MCSSL overview.
Cam Cam p p O O e e l l CAM p p P ’ P ’ P ’ Figure 2: Illustration of epipolar constraints.
Data Augmentation with Tracking
Epipolar constraints effectively reduces the search space ofbounding boxes for reID. Nonetheless, it filters out pairs ofbounding boxes on cameras at different times. For instance, person A on cam at i th frame may not fall within theepipolar constraint of the bounding box of person A on cam at j th frame, despite that this is a valid pair of non-camera-specific data. To revive this large set of training data (due to itscombinatorial nature), we leverage temporal correlations oneach camera to find bounding boxes belong to the same object.In specific, once we find person A on cam and cam at the i th frame, we run SiamMask-E [Chen and Tsotsos, 2019], astate-of-the-art tracking algorithm on subsequent four framesfrom both cameras to get bounding boxes of person A . Thisset of data is called “coarse reID training data” (Figure 1).Data augmentation with tracking also allows us to usevideo reID algorithm to finalize bounding box association.Compared with image reID, video reID [Liu et al. , 2019;GuangcongWang et al. , 2019] has proven to be more ac-curate and reliable. In MCSSL, we adopt B-BOT+Attn-CL [Pathak et al. , 2020] to prune coarse reID training data.Since it extracts an aggregated feature from four consecutiveframes, we use a pre-defined aggregated feature distance threshold [Pathak et al. , 2020] to determine if two boundingboxes belong to the same person. DNN object detection models use backbone networks ( i.e., pre-trained classification networks like ResNet, GoogleNet)to extract discriminative feature maps from an image. Themost commonly used method to retrain a DNN detectoris to freeze backbone network and fine-tune the remainingdetection layers ( i.e.,
RPN and RoI extractors) on a newdataset. In spite of fast convergence, this approach suffersfrom suboptimal performance due to the insufficiently trainedbackbone network.To address this limitation, we use non-camera-specifictraining data to train backbone network. To be able touse pairs of bounding boxes in non-camera-specific trainingdataset, MCSSL creates a reID-like pretext task. In specific,it takes pairs of images with bounding boxes belong to thesame object as input and run them through the entire model toget feature maps. Afterwards, it calculates classification score(cls) of each image purely based on features within the pairedbounding box, and use consistency loss in back propagationto train the backbone network (Figure 1). Consistency loss inMCSSL is defined as L consistency = P Pk =1 CE ( cls k , cls k ) ,where P is the total number of pairs of bounding boxesin two images, CE represents cross entropy function, and cls ik denotes predicted classification score of k th boundingbox from cam i . As consistency loss is minimized by fine-tuning, backbone network generates more representativefeature maps.After backbone network fine-tuning, we adopt theclassic self-training framework [Kocabas et al. , 2019;Lee et al. , 2019; Gao et al. , 2019] to update RPN and RoIeature extractor. Here camera-specific data ( i.e., confidentpseudo-labels on the camera itself) is used directly asground-truth for detection model fine-tuning.In summary, the overall loss function of MCSSL can beformulated as L overall = α ∗ L consistency + β ∗ L det . Wefirst minimize consistency loss through updating backbonenetwork ( α = 1 , β = 0 ) and reduce the loss of self-trainingby updating RPN and RoI feature extractor ( α = 0 , β = 1 ). We evaluate MCSSL on real-world multi-camera datasetsand present evaluation highlights in this section.
Our experiments are conducted on two multi-camera detection datasets, namely WildTrack andCityFlow [Chavdarova et al. , 2018; Tang et al. , 2019] (Table 2). WildTrack is by far the largest multi-cameradataset for pedestrian detection and tracking while CityFlowis built for multi-camera vehicle tracking. We implemented MCSSL with mmdetec-tion [Chen et al. , 2019], an open source object detectiontoolbox based on Pytorch, and conducted all experimentsusing two Nvidia GeForce RTX 2080 Ti GPU.
Models.
YOLOv3 [Joseph and Ali, 2018] and Faster R-CNN [Ren et al. , 2015] pre-trained on COCO with back-bone of Darknet53 and ResNet101 are used as our de-fault two-stage and three-stage object detection models.Evaluation results with different backbones are presentedin Section 4.5. T cls is set to . . We use SiamMask-E [Chen and Tsotsos, 2019] for tracking, and the state-of-the-art reID algorithms B-BOT-Attn-CL [Pathak et al. , 2020] andVehicleNet [Zheng et al. , 2020] for person and vehicle reID,respectively. Training Settings.
We set the ratio of training set, evalu-ation set and testing set to
16 : 4 : 5 , batch size to , andchoose SGD with learning rate of . . All trainings last for epochs, with the first 30 epochs on backbone network andthe subsequent 30 epochs on detection network. Evaluation Metrics.
We use mean Average Precision(mAP) over Intersection Over Union (IoU) of 0.8(mAP@[0.8:1.0]).
Baselines.
We compare MCSSL with three self-supervised learning approaches.i)
Self-Training (ST) : the most widely used self-trainingmechanism with confident pseudo-labels [Gao et al. , 2019;Lee et al. , 2019]. It’s trained in the supervised learning waywith confident pseudo-labels.ii)
Self-Training with Gold Loss Correction (ST-GLC) [Hendrycks et al. , 2018]: an improved version ofST with gold loss correction. It first estimates corruptionmatrix C of conditional corruption probabilities usingconfident pseudo-labels, and then uses C to correct classlabels of all uncertain pseudo-labels. It uses all pseudo-labels We used data collected from the first intersection of CityFlow. to fine-tune the original object detection model.iii)
Self-Training with Consistency Loss (ST-CL) [Jeong et al. , 2019]: the most recent work on self-training. It uses two images (the original image and a flippedimage) as input, and constructs consistency loss betweentwo images during training. When we train on confidentpseudo-labels, we use both supervised loss and consistencyloss. When we train on uncertain pseudo-labels, we only useconsistency loss.In addition to the above three baselines, we report resultsfrom a supervised training with ground-truth ( i.e., to traincustom detection model on each camera with human-labeledbounding boxes from itself). It serves as an upper-bound ofself-training methods. To show the gain of model customiza-tion, we also include mAP of the base model ( i.e.,
YOLOv3and Faster R-CNN).
Figure 3 shows performances of customizing YOLOv3 andFaster R-CNN on WildTrack. Compared with the best knownself-training approach (ST-GLC), MCSSL improves themAP of YOLOv3 and Faster R-CNN by . and . onaverage for each camera on WildTrack. This shows MCSSLis an effective framework for both two-stage and three-stageobject detection models. It’s interesting to see that MCSSLperforms worse than ST-GLC on CAM-7 in both Figure 3aand Figure 3b. This is due to the fact that CAM-7 has theleast amount of shared FOV and hence much less non-camera-specific training data ( e.g., . less than other cameras onaverage in Figure 3a) for backbone network fine-tuning. Camera ID m A P ( % ) BaseSupervisedSTST_CLST_GLCMCSSL (a) YOLOv3
Camera ID m A P ( % ) BaseSupervisedSTST_CLST_GLCMCSSL (b) Faster R-CNN
Figure 3: mAP of self-training approaches on WildTrack.Using the same settings, we report performances ofMCSSL on YOLOv3 and Faster R-CNN on CityFlow inFigure 4. Compared with SL-GLC, MCSSL obtains anaverage mAP improvement of . and . on twomodels. In particular, on CAM-5, YOLOv3 and Faster R-CNN are improved by . mAP and . mAP sinceCAM-5 largely overlaps with other cameras and hence getsmore pseudo-labels from its neighbors. Camera ID m A P ( % ) BaseSupervisedSTST_CLST_GLCMCSSL (a) YOLOv3
Camera ID m A P ( % ) BaseSupervisedSTST_CLST_GLCMCSSL (b) Faster R-CNN
Figure 4: mAP of self-training approaches on CityFlow. bjects Total cameras Resolution Total frames Avg. objects/frameWildTrack Pedestrians 7 1920*1080 29400 23CityFlow Vehicles 5 960*480 9775 13
Table 2: Datasets description.
To show how much uncertain bounding boxes fit into thetraining of backbone layers, we compare three self-trainingprocesses on CAM-1 from WildTrack. MCSSL only usesuncertain bounding boxes to train backbone layers. MCSSL-C trains backbone layers with confident bounding boxes,and MCSSL-CU uses both confident bounding boxes anduncertain bounding boxes. We train backbone layers for 30epochs and then fine-tune detection layers following the sameprotocol for another 30 epochs in all three approaches. InFigure 5, as we expected, training backbone layers withmore high-confident bounding boxes (MCSSL-CU) outper-forms MCSSL in the first 30 epochs. However, MCSSLgets a better final mAP since training twice on confidentbounding boxes (in both backbone and detection layers fine-tuning) makes the detection model more prone to overfittingand hence limits its generalization ability on testing data.Compared with MCSSL-C, we find that MCSSL gets abetter mAP in the first 30 epochs and a better final mAPin the later 30 epochs. This is because the size of confidentbounding boxes ( i.e.,
Number of epochs m A P ( % ) MCSSLMCSSL-CMCSSL-CU (a) YOLOv3
Number of epochs m A P ( % ) MCSSLMCSSL-CMCSSL-CU (b) Faster R-CNN
Figure 5: The performance comparison of different self-training processes over training epochs.
To analysis the impacts of important settings in MCSSL, weconduct experiments on CAM-1 on WildTrack with various T cls (Figure 6) and backbone architectures (Figure 7). Itcomes as no surprise that MCSSL yields lower accuracywith a smaller value of T cls due to the increasing noisesin training data for detection network. However, it is worthnoting that setting T cls too high ( e.g., . in our experiment)could also negatively impact model customization due tothe insufficient training of the detection network. We leave T cls as a hyperparameter to be tuned during training ondifferent datasets. As shown in Figure 7, MCSSL is alsoamenable to different kinds of backbone networks as we seea steady improvement of detection accuracy with the increaseof backbone capacity. Number of epochs m A P ( % ) T cls =0.9T cls =0.8T cls =0.7T cls =0.6 (a) YOLOv3 Number of epochs m A P ( % ) T cls =0.9T cls =0.8T cls =0.7T cls =0.6 (b) Faster R-CNN Figure 6: The performance comparison of MCSSL underdifferent T cls over training epochs. Number of epochs m A P ( % ) Darknet53ResNet50ResNet101ResNeXt101 (a) YOLOv3
Number of epochs m A P ( % ) Darknet53ResNet50ResNet101ResNeXt101 (b) Faster R-CNN
Figure 7: The performance comparison of MCSSL withdifferent backbones over training epochs.
We propose MCSSL, a novel self-training mechanism withconsistency loss, to customize object detection models in amulti-camera network. MCSSL separates object detectionmodels into backbone layers and detection layers and buildsa reID-like pretext task for pre-train of backbone layers. Tobuild training datasets, MCSSL associates bounding boxesbetween cameras using state-of-the-art tracking and reIDalgorithms with epipolar constraints. The large amount ofnon-camera-specific data is used to train backbone networkwhereas high-quality camera-specific data is leveraged fordetection network fine-tuning. Our evaluation results ontwo real-world datasets show MCSSL can achieve the newstate-of-the-art results for customizing detection models.
References [Athiwaratkun et al. , 2019] Ben Athiwaratkun, Marc Finzi, PavelIzmailov, and Andrew Gordon Wilson. There are many consistentexplanations of unlabeled data: Why you should average. In
International Conference on Learning Representations , 2019.[Baque et al. , 2017] Pierre Baque, Francois Fleuret, and PascalFua. Deep occlusion reasoning for multi-camera multi-targetdetection. In
IEEE International Conference on Computer Vision ,2017.[Chavdarova and Fleuret, 2017] Tatjana Chavdarova and FrancoisFleuret. Deep multi-camera people detection. In
IEEEInternational Conference on Machine Learning and Applications ,2017.[Chavdarova et al. , 2018] Tatjana Chavdarova, Pierre Baque,Stephane Bouquet, Andrii Maksai, Cijo Jose, Louis Lettry,Pascal Fua, Luc Van Gool, and Francois Fleuret. The wildtrackmulti-camera person dataset. In
IEEE Conference on ComputerVision, Pattern Recognition , 2018.Chen and Tsotsos, 2019] Bao Xin Chen and John K Tsotsos. Fastvisual object tracking with rotated bounding boxes. In
IEEEInternational Conference on Computer Vision Workshops , 2019.[Chen et al. , 2019] Kai Chen, Jiaqi Wang, Jiangmiao Pang, YuhangCao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, ZiweiLiu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu,Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, YueWu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang,Chen Change Loy, and Dahua Lin. Mmdetection: Open mmlabdetection toolbox, benchmark. arXiv preprint arXiv:1906.07155 ,2019.[Co et al. , 2019] Kenneth T. Co, Luis Muñoz González, Sixtede Maupeou, and Emil C. Lupu. Procedural noise adversarialexamples for black-box attacks on deep convolutional networks.In
Proceedings of the 2019 ACM SIGSAC Conference onComputer and Communications Security , 2019.[Gao et al. , 2019] Jiyang Gao, Jiang Wang, Shengyang Dai, Li-JiaLi, and Ram Nevatia. Note-rcnn: Noise tolerant ensemble rcnnfor semi-supervised object detection. In
IEEE InternationalConference on Computer Vision , 2019.[GuangcongWang et al. , 2019] GuangcongWang, Jianhuang Lai,Peigen Huang, and Xiaohua Xie. Spatial-temporal person re-identification. In
Association for the Advancement of ArtificialIntelligence , 2019.[Guoa et al. , 2019] Yunhui Guoa, Honghui Shi, Abhishek Kumar,Kristen Grauman, Tajana Rosing, and Rogerio Feris. Spottune:Transfer learning through adaptive fine-tuning. In
IEEEConference on Computer Vision and Pattern Recognition , 2019.[He et al. , 2017] Kaiming He, Georgia Gkioxari, Piotr Dollar, andRoss Girshick. Mask r-cnn. In
IEEE International Conferenceon Computer Vision , 2017.[Hendrycks et al. , 2018] Dan Hendrycks, Mantas Mazeika, DuncanWilson, and Kevin Gimpel. Using trusted data to train deepnetworks on labels corrupted by severe noise. In
Conference andWorkshop on Neural Information Processing Systems , 2018.[Inoue et al. , 2018] Naoto Inoue, Ryosuke Furuta, Toshihiko Ya-masaki, and Kiyoharu Aizawa. Cross-domain weakly-supervisedobject detection through progressive domain adaptation. In
IEEEConference on Computer Vision and Pattern Recognition , 2018.[Jeong et al. , 2019] Jisoo Jeong, Seungeui Lee, Jeesoo Kim, andNojun Kwak. Consistency-based semi-supervised learning forobject detection. In
Conference and Workshop on NeuralInformation Processing Systems , 2019.[Joseph and Ali, 2018] Redmon Joseph and Farhadi Ali. Yolov3:An incremental improvement. arXiv preprint arXiv:1804.02767 ,2018.[Kocabas et al. , 2019] Muhammed Kocabas, Salih Karagoz, andEmre Akbas. Self-supervised learning of 3d human pose usingmulti-view geometry. In
IEEE Conference on Computer Visionand Pattern Recognition , 2019.[Lee et al. , 2019] Wonhee Lee, Joonil Na, and Gunhee Kim. Multi-task self-supervised object detection via recycling of boundingbox annotations. In
IEEE Conference on Computer Vision andPattern Recognition , 2019.[Li et al. , 2018a] Yu-Jhe Li, Fu-En Yang, Yen-Cheng Liu, Yu-YingYeh, Xiaofei Du, Yu-Chiang, and Frank Wang. Adaptation, re-identification network: An unsupervised deep transfer learningapproach to person re-identification. In
IEEE Conference onComputer Vision and Pattern Recognition , 2018. [Li et al. , 2018b] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang,Yangdong Deng, and Jian Sun. Detnet: Design backbone forobject detection. In
European Conference on Computer Vision ,2018.[Liu et al. , 2019] Yiheng Liu, Zhenxun Yuan, Wengang Zhou,and Houqiang Li. Spatial, temporal mutual promotion forvideo-based person re-identification. In
Association for theAdvancement of Artificial Intelligence , 2019.[Liu et al. , 2020] Yudong Liu, Yongtao Wang, Siwei Wang, TingT-ing Liang, Qijie Zhao, Zhi Tang, and Haibin Ling. Cbnet: A novelcomposite backbone network architecture for object detection. In
Association for the Advancement of Artificial Intelligence , 2020.[Ouyang et al. , 2016] Wanli Ouyang, Xiaogang Wang, CongZhang, and Xiaokang Yang. Factors in finetuning deep model forobject detection with long-tail distribution. In
IEEE Conferenceon Computer Vision, Pattern Recognition , 2016.[Pathak et al. , 2020] Priyank Pathak, Amir Erfan Eshratifar, andMichael Gormish. Video person re-id: Fantastic techniques,where to find them. In
Association for the Advancement ofArtificial Intelligence , 2020.[Ren et al. , 2015] Shaoqing Ren, Kaiming He, Ross Girshick, andJian Sun. Faster r-cnn: Towards real-time object detection withregion proposal networks. In
Conference and Workshop onNeural Information Processing Systems , 2015.[Rodner et al. , 2016] Erik Rodner, Marcel Simon, Robert B. Fisher,and Joachim Denzler. Fine-grained recognition in the noisy wild:Sensitivity analysis of convolutional neural networks approaches.In
British Machine Vision Conference , 2016.[Tang et al. , 2019] Zheng Tang, Milind Naphade, Ming-Yu Liu,Xiaodong Yang, Stan Birchfield, Shuo Wang, Ratnesh Kumar,David Anastasiu, and Jenq-Neng Hwang. Cityflow: A city-scalebenchmark for multi-target multi-camera vehicle tracking and re-identification. In
IEEE Conference on Computer Vision, PatternRecognition , 2019.[Xu et al. , 2019a] Hang Xu, ChenHan Jiang, Xiaodan Liang, andZhenguo Li. Spatial-aware graph relation network for large-scaleobject detection. In
IEEE Conference on Computer Vision andPattern Recognition , 2019.[Xu et al. , 2019b] Hang Xu, ChenHan Jiang, Xiaodan Liang, LiangLin, and Zhenguo Li. Reasoning-rcnn: Unifying adaptive globalreasoning into large-scale object detection. In
IEEE Conferenceon Computer Vision and Pattern Recognition , 2019.[Yuhua Chen et al. , 2018] Wen Li Yuhua Chen, Christos Sakaridis,Dengxin Dai, and Luc Van Gool1. Domain adaptive faster r-cnnfor object detection in the wild. In
IEEE Conference on ComputerVision and Pattern Recognition , 2018.[Zhao et al. , 2019] Qijie Zhao, Tao Sheng, Yongtao Wang, ZhiTang, Ying Chen, Ling Cai, and Haibin Ling. M2det: Asingle-shot object detector based on multi-level feature pyramidnetwork. In
Association for the Advancement of ArtificialIntelligence , 2019.[Zheng et al. , 2020] Zhedong Zheng, Tao Ruan, Yunchao Wei,Yi Yang, and Tao Mei. Vehiclenet: Learning robust visualrepresentation for vehicle re-identification. arXiv preprintarXiv:2004.06305v1 , 2020.[Zhong et al. , 2018] Zhun Zhong, Liang Zheng, Zhedong Zheng,Shaozi Li, and Yi Yang. Camera style adaptation for personre-identification. In