[PDF] Custom Object Detection via Multi-Camera Self-Supervised Learning

Abstract

This paper proposes MCSSL, a self-supervised learning approach for building custom object detection models in multi-camera networks. MCSSL associates bounding boxes between cameras with overlapping fields of view by leveraging epipolar geometry and state-of-the-art tracking and reID algorithms, and prudently generates two sets of pseudo-labels to fine-tune backbone and detection networks respectively in an object detection model. To train effectively on pseudo-labels,a powerful reID-like pretext task with consistency loss is constructed for model customization. Our evaluation shows that compared with legacy selftraining methods, MCSSL improves average mAP by 5.44% and 6.76% on WildTrack and CityFlow dataset, respectively.

Full PDF

aa r X i v : . [ c s . C V ] F e b Custom Object Detection via Multi-Camera Self-Supervised Learning

Yan Lu ∗ and Yuanchao Shu New York University Microsoft [email protected], [email protected]

Abstract

This paper proposes MCSSL, a self-supervisedlearning approach for building custom object de-tection models in multi-camera networks. MCSSLassociates bounding boxes between cameras withoverlapping ﬁelds of view by leveraging epipolargeometry and state-of-the-art tracking and reIDalgorithms, and prudently generates two sets ofpseudo-labels to ﬁne-tune backbone and detec-tion networks respectively in an object detectionmodel. To train effectively on pseudo-labels, apowerful reID-like pretext task with consistencyloss is constructed for model customization. Ourevaluation shows that compared with legacy self-training methods, MCSSL improves average mAPby . and . on WildTrack and CityFlowdataset, respectively. Object detection plays a pivotal role in video analytics.Although deep neural network (DNN)-based object detectionmodels pre-trained on large public datasets ( e.g.,

MS-COCO)exhibit decent performance in various scenarios, custommodels are more desired due to its higher accuracy androbustness [Ouyang et al. , 2016; Guoa et al. , 2019].Model customization relies on context-speciﬁc (or domain-speciﬁc) training data. Unlike general-purpose trainingdatasets, large-scale context-speciﬁc labels are way harderto collect in a sustainable manner. For most computer visiontasks, manual labeling ( e.g., draw bounding boxes) remainsto be the major source of training data. Nonetheless, humanannotation is known to be costly and time-consuming,and distributing frames outside of the camera networkalso raises privacy concerns. While we are witnessingadvancements in semi-supervised and weakly-supervisedlearning, performance of models trained from the state-of-theart semi- and weakly-supervised algorithms still falls shortof supervised object detectors. In semi-supervised learning,for instance, it’s hard to converge on regularization terms( e.g., consistency loss) [Athiwaratkun et al. , 2019] in large-scale unlabeled data. Similarly, detection performance ∗ Contact Author suffers from local minimums in weakly-supervisedapproaches [Inoue et al. , 2018].This paper is motivated by a simple question, can we im-prove self-supervised object detection with a ﬂeet of camerasthat have (partially) shared ﬁeld-of-view (FOV)?

Driven byadvances in computer vision and the plummeting costs ofcamera hardware, organizations are deploying video camerasat scale for a variety of applications. In many scenarios,cameras have overlapping areas for the spatial monitoring ofphysical premises. Detection results on neighboring camerascould potentially serve as pseudo-labels and allow eachcamera to learn its own detector continuously.To achieve this goal, however, we need to tackle thefollowing two major challenges.

How to create pseudo-labels?

Object detection accuracyvaries for cameras and changes over time due to trafﬁcdynamics and environmental factors ( e.g., lighting changes),making it hard to identify high-quality pseudo-labels. It isequally important to assign the right set of pseudo-labels tocamera(s) as blindly training with all pseudo-labels adverselyimpacts model customization.

How to learn from noisy labeled data?

A straightfor-ward way to customize model is to ﬁne-tune the defaultobject detection model on each camera using pseudo-labels.However, ﬁne-tuning a large DNN with insufﬁcient andnoisy pseudo-labels tends to lead to accuracy drop andoverﬁtting [Hendrycks et al. , 2018].To address these issues, we propose MCSSL, a novel self-supervised training mechanism to customize object detectionmodels on each camera. The key idea of MCSSL is tocreate pseudo-labels at different conﬁdence levels and usethem to train different parts of the network separately.MCSSL divides object detection models into two parts: backbone network which provides discriminative low-levelfeature maps, and detection network which consists of regionproposal network (RPN), RoI feature extractor etc. The twoparts have their unique characteristics (Table 1). Backbonenetwork has way more parameters than RPN and RoI, henceit demands more training data to ﬁne-tune itself. On theother hand, it is less susceptible to training data noisessince it is trained for low-level features [Liu et al. , 2020;Li et al. , 2018b; Ouyang et al. , 2016]. We also found that,for most off-the-shelf object detection models, boundingboxes with high classiﬁcation score are rarely false positives.owever, it is common to see false negative cases with lowclassiﬁcation score.

Training Data Demand Noise SensitivityDetection Network Low HighBackbone Network High Low

Table 1: Object Detection Model Characteristics.Based on these two insights, MCSSL categorizes bound-ing boxes detected by the base model on each camera as conﬁdent pseudo-labels and uncertain pseudo-labels basedon their classiﬁcation score. Despite relatively low volume, conﬁdent pseudo-labels serve nicely for detection networkﬁne-tuning due to their high quality. On the contrary, a largernumber of uncertain pseudo-labels ﬁt well with the backbonenetwork which is more tolerable to noises.To build custom object detection model, detection net-work and backbone network on each camera need to betrained independently with its own context-speciﬁc data. Tothis end, MCSSL adopts advanced video reID algorithmsto associate bounding boxes across cameras. To improvethe efﬁciency and precision of bounding box matching, aprune-and-augment approach is introduced by leveragingepipolar constraints and tracking. Akin to feature catego-rization in classiﬁcation and reID tasks [Zhong et al. , 2018;Li et al. , 2018a], we treat paired bounding boxes as non-camera-speciﬁc data due to their camera-invariant features onthe same object. Combined with the above ﬁnding, uncertainpseudo-labels which belong to non-camera-speciﬁc data isused to update the backbone network on each camera. Towork with pairs of images as input, we devise a novel reID-like pretext task, turning backbone network ﬁne-tuning toconsistency training. When we get a more powerful backbonenetwork, we use traditional self-training framework to updatedetection network ( i.e.,

RPN and RoI feature extractor) withthe entire set of conﬁdent pseudo-labels .We found the results are promising. Compared with thebest self-training algorithm, MCSSL, on average, improvesmAP by . and . for each camera on WildTrackand CityFlow dataset, respectively. Two out of seven cameraseven achieve an mAP gain of as high as on theWildTrack dataset. In summary, we made the following threeintellectual contributions.i) We proposed the ﬁrst self-supervised learning approachthat allows object detection model customization in multi-camera networks.ii) We devised an effective approach to generate trainingdata and ﬁne-tune different parts of an object detection modelwith a reID-like pretext task.iii) We achieved the new state-of-the-art mAP results ofself-supervised learning-based custom object detection onWildTrack and CityFlow datasets. Anchor-based object detection : Anchor-based deep ob-ject detection models [Joseph and Ali, 2018; He et al. , 2017;Ren et al. , 2015; Zhao et al. , 2019] comprise of three mod-ules: 1) Backbone network which extracts general features ( i.e., edges, corners) of an given image; 2) Region proposalnetwork (RPN) [Ren et al. , 2015] that generates candidatebounding boxes based on simpler components from lower lay-ers in backbone network; and 3) RoI feature extractor, whichextracts ﬁne-grained features and assigns class probabilityfor each RoI generated by RPN. Models contain all threemodules are called three-stage object detection models ( e.g.,

Mask RCNN) whereas two-stage models ( e.g.,

YOLOv3and M2Det)[Joseph and Ali, 2018; Zhao et al. , 2019] removeRPN and run RoI feature extractor directly on feature blocksgenerated by backbone network to improve inference speed.MCSSL builds on top the existing anchor-based objectdetection architecture and ﬁne-tunes backbone network anddetection network independently with prudently generatedtraining data for model customization.

Semi- and weakly-supervised object detection :Despites recent advancements [Gao et al. , 2019;Lee et al. , 2019; Yuhua Chen et al. , 2018; Inoue et al. , 2018;Jeong et al. , 2019], today’s semi- and weakly-supervisedlearning algorithms still fall short on accuracy in objectdetection tasks. A common approach to semi-supervisedobject detection is mining-training. [Gao et al. , 2019] isthe ﬁrst end-to-end semi-supervised framework for objectdetection. To utilize a large amount of unlabeled dataand handle label noises, many works [Lee et al. , 2019;Xu et al. , 2019a; Xu et al. , 2019b] seek to constructauxiliary tasks ( a.k.a., pretext tasks) to indirectlytrain the network. For example, [Xu et al. , 2019a;Xu et al. , 2019b] add a knowledge graph mining task,and [Lee et al. , 2019] constructs three new labeling tasks(closeness labeling, multi-object labeling and foregroundlabeling) to assist object detection model training. Thesekinds of approaches are also known as self-supervisedlearning as in auxiliary tasks, pseudo-labels are foundor mined in unlabeled data automatically. Besidesself-supervised approaches, constructing consistency[Jeong et al. , 2019] between different versions of a givenimage became an effective tool for enhancing detectionmodels’ performance on unlabeled data. Inspired by self-supervised learning and consistency learning approaches,we devise a new reID-like pretext task trained by means ofconsistency learning to assist custom object detection modeltraining on multi-camera datasets.

Multi-camera detection : To deal withoccluded objects from a single view, manyworks [Chavdarova and Fleuret, 2017; Baque et al. , 2017]utilize multi-view streams to build powerful 3D detectionalgorithms for all cameras. However, the majority ofthe methods are supervised learning-based, requiring asigniﬁcantly more labeled data than monocular objectdetection. More recently, we have seen works on multi-viewhuman pose estimation [Kocabas et al. , 2019] that do notlearn a 3D model but instead seek to train 2D models foreach camera through adding a new self-supervised learningtask [Kocabas et al. , 2019]. To the best of our knowledge,MCSSL is the ﬁrst self-supervised learning-based approachto get custom object detection models in multi-cameraenvironments.

Design

MCSSL works in multi-camera scenarios where at leasttwo cameras (partially) share ﬁeld-of-view. At beginning,cameras are running off-the-shelf object detection models( i.e., base model) trained on large-scale public dataset ( e.g.,

YOLOv3, Mask R-CNN). The goal of MCSSL is to buildaccurate custom object detection model for each cameraover time through cross-camera model ﬁne-tuning. In whatfollows, we use two cameras as a simple example to elaboratethe process of model customization on cam with the help of cam (Figure 1). Phase 1: pseudo-label generation : At ﬁrst, object detec-tion results are obtained from the base model on frames fromboth cam and cam . MCSSL sets bounding boxes withhigh classiﬁcation score ( e.g., . ) as conﬁdent pseudo-labels and the remaining bounding boxes as uncertain pseudo-labels . Phase 2: cross-camera pseudo-label sharing : MCSSLtreats views of one object on different cameras as styles-transfered images, and associates pseudo-labels across cam-eras using state-of-the-art reID models. Notably, epipolargeometry and tracking are leveraged which signiﬁcantlyreduces compute overhead and improves accuracy. Associ-ated pseudo-labels are categorized as non-camera-speciﬁc and camera-speciﬁc training data, where non-camera-speciﬁc training data refer to objects seen by multiple cameraswhereas camera-speciﬁc training data are objects appear ona single camera.

Phase 3: consistency learning : MCSSL constructs a reID-like pretext task that uses non-camera-speciﬁc training data toﬁne-tune backbone network with consistency loss. Camera-speciﬁc training data is used to customize the detectionnetwork as in most existing self-training algorithms.

When cam and cam collect enough new images, weﬁrst use off-the-shelf object detection models to generatepseudo labels. We denote BB and BB pseudo-labelsfor cam and cam , respectively. Intuitively, boundingboxes with higher classiﬁcation score are more likely tobe true positive. As reported in a large body of workin computer vision [Joseph and Ali, 2018; He et al. , 2017;Ren et al. , 2015], using a high classiﬁcation score to ﬁlterbounding boxes tends to lead to high precision but low recallacross the majority of DNN object detection models. That is,bounding boxes with high classiﬁcation score are rarely falsepositives whereas false negative bounding boxes due to lowclassiﬁcation score are commonly seen. Based on this insight,we set a high threshold T cls , and use high-quality conﬁdentpseudo-labels BB c , bounding boxes whose classiﬁcationscores are larger than T cls , to train upper layer detectionnetworks ( i.e., RoI feature extractor). Accordingly, uncertainpseudo-labels ( BB u ) are used to train the initial CNNlayers ( i.e., backbone network) which are less susceptible tonoises [Rodner et al. , 2016; Co et al. , 2019]. Style transfer is a commonly used data augmentationtechnique in DNN model training [Yuhua Chen et al. , 2018; Inoue et al. , 2018]. A key advantage of multi-camera networkis the richness of data from different vantage points. Hence,in MCSSL, we associate bounding boxes of the same objecton different cameras and feed them in model ﬁne-tuning.In a nutshell, bounding box association is achieved byreID. However, naïve reID poses two challenges. First, state-of-the-art reID models are only able to achieve mAP ofaround . , which leads to a decent number of falsepositives and impairs model ﬁne-tuning. Second, pairwisecomparison between bounding boxes on all cameras incurs anon-linear computation overhead, which is prohibitively highfor scenarios with busy trafﬁc. To deal with these two issues,we employ a prune-and-augment approach, which ﬁrst ﬁltersout a large number of bounding boxes that are less likely tobe conﬁrmed by reID using epipolar constraints, and thenaugments reﬁned pseudo-label pairs through tracking. Epipolar Constraint-based Pruning

When two cameras view the same 3D space from differentviewpoints, geometric relations among 3D points and theirprojections onto the 2D plane lead to constraints on theimage points. This intrinsic projective geometry is capturedby a fundamental matrix F in epipolar geometry, which canbe calculated as F = K − T [ t ] × RK − where K and K represent intrinsic parameters, and R and [ t ] × are the relativecamera rotation and translation which describe the locationof the second camera relative to the ﬁrst in global coordinates( a.k.a., extrinsic parameters). Given F , for a physical 3Dposition P in the overlapping area of cam and cam , wehave p T F p = 0 , where p and p are the projected scenepoint from P on cam and cam . In essence, this equationcharacterizes an epipolar plane containing P and epipoles O and O of both cameras.Epipolar plane offers a unique characteristic in buildingassociations between bounding boxes on different cameras.As can be seen from Figure 2, the intersection of the epipolarplane with the image plane are two lines, which are calledepipolar lines. This means for any particular point p on cam , it is always mapped to a point alone the epipolar line l in the image from cam .Given the epipolar constraints, we can now map a boundingbox in the image from the “teacher camera” to four epipolarlines in another camera’s image, which signiﬁcantly reducesthe search space of potential bounding boxes. For instance,it reduces the search space by 12x on the WildTrack dataset.Note that although explanations above assume cameras arecalibrated and time-synchronized, we add a fudge factorin our spatial ﬁltering algorithm to compensate calibrationnoises and slight time shift. Since epipolar geometry onlydeﬁnes an area for each bounding box, we set all boundingboxes in BB u fall into the area as candidate bounding boxes( i.e., coarse non-camera-speciﬁc training data). In order toﬁne-tune the object detection model of camera i , MCSSLapplies mapping on all cameras share the view with i . In theexample of customizing an object detection model for cam with the help of cam (Figure 1), we run epipolar geometry-based mapping on all bounding boxes in cam to ﬁnd allcandidate bounding boxes on cam . ff-the-shelfobject detection model T cls Phase 1 Phase 2 camera-speciﬁctraining datauncertain pseudo-labels highlow

CAM-1 CAM-2coarse non-camera-speciﬁctraining data CAM-1 CAM-2coarse non-camera-speciﬁctraining data after tracking CAM-1 CAM-2video reID CAM-1 CAM-2 non-camera-speciﬁctraining data

Phase 3

Two-stage Three-stage self-training image-1backbonefeaturemapsreg cls speciﬁc

RoIextractor self-training image-1backbonefeaturemaps reg cls speciﬁc

RoIextractorreg cls speciﬁc

RPNimage-1backbonefeaturemapsreg cls image-2featuremapscls self-training with consistency loss speciﬁc non-speciﬁcconsistency

RoIextractor image-1 backbonefeaturemapsreg cls speciﬁc

RoIextractorreg cls speciﬁc

RPN self-training with consistency loss image-2featuremapscls non-speciﬁcconsistency reg clsRPN speciﬁcCAM-1CAM-2 non-camera-speciﬁctraining dataepipolar constraints tracking

Figure 1: MCSSL overview.

Cam Cam p p O O e e l l CAM p p P ’ P ’ P ’ Figure 2: Illustration of epipolar constraints.

Data Augmentation with Tracking

Epipolar constraints effectively reduces the search space ofbounding boxes for reID. Nonetheless, it ﬁlters out pairs ofbounding boxes on cameras at different times. For instance, person A on cam at i th frame may not fall within theepipolar constraint of the bounding box of person A on cam at j th frame, despite that this is a valid pair of non-camera-speciﬁc data. To revive this large set of training data (due to itscombinatorial nature), we leverage temporal correlations oneach camera to ﬁnd bounding boxes belong to the same object.In speciﬁc, once we ﬁnd person A on cam and cam at the i th frame, we run SiamMask-E [Chen and Tsotsos, 2019], astate-of-the-art tracking algorithm on subsequent four framesfrom both cameras to get bounding boxes of person A . Thisset of data is called “coarse reID training data” (Figure 1).Data augmentation with tracking also allows us to usevideo reID algorithm to ﬁnalize bounding box association.Compared with image reID, video reID [Liu et al. , 2019;GuangcongWang et al. , 2019] has proven to be more ac-curate and reliable. In MCSSL, we adopt B-BOT+Attn-CL [Pathak et al. , 2020] to prune coarse reID training data.Since it extracts an aggregated feature from four consecutiveframes, we use a pre-deﬁned aggregated feature distance threshold [Pathak et al. , 2020] to determine if two boundingboxes belong to the same person. DNN object detection models use backbone networks ( i.e., pre-trained classiﬁcation networks like ResNet, GoogleNet)to extract discriminative feature maps from an image. Themost commonly used method to retrain a DNN detectoris to freeze backbone network and ﬁne-tune the remainingdetection layers ( i.e.,

RPN and RoI extractors) on a newdataset. In spite of fast convergence, this approach suffersfrom suboptimal performance due to the insufﬁciently trainedbackbone network.To address this limitation, we use non-camera-speciﬁctraining data to train backbone network. To be able touse pairs of bounding boxes in non-camera-speciﬁc trainingdataset, MCSSL creates a reID-like pretext task. In speciﬁc,it takes pairs of images with bounding boxes belong to thesame object as input and run them through the entire model toget feature maps. Afterwards, it calculates classiﬁcation score(cls) of each image purely based on features within the pairedbounding box, and use consistency loss in back propagationto train the backbone network (Figure 1). Consistency loss inMCSSL is deﬁned as L consistency = P Pk =1 CE ( cls k , cls k ) ,where P is the total number of pairs of bounding boxesin two images, CE represents cross entropy function, and cls ik denotes predicted classiﬁcation score of k th boundingbox from cam i . As consistency loss is minimized by ﬁne-tuning, backbone network generates more representativefeature maps.After backbone network ﬁne-tuning, we adopt theclassic self-training framework [Kocabas et al. , 2019;Lee et al. , 2019; Gao et al. , 2019] to update RPN and RoIeature extractor. Here camera-speciﬁc data ( i.e., conﬁdentpseudo-labels on the camera itself) is used directly asground-truth for detection model ﬁne-tuning.In summary, the overall loss function of MCSSL can beformulated as L overall = α ∗ L consistency + β ∗ L det . Weﬁrst minimize consistency loss through updating backbonenetwork ( α = 1 , β = 0 ) and reduce the loss of self-trainingby updating RPN and RoI feature extractor ( α = 0 , β = 1 ). We evaluate MCSSL on real-world multi-camera datasetsand present evaluation highlights in this section.

Our experiments are conducted on two multi-camera detection datasets, namely WildTrack andCityFlow [Chavdarova et al. , 2018; Tang et al. , 2019] (Table 2). WildTrack is by far the largest multi-cameradataset for pedestrian detection and tracking while CityFlowis built for multi-camera vehicle tracking. We implemented MCSSL with mmdetec-tion [Chen et al. , 2019], an open source object detectiontoolbox based on Pytorch, and conducted all experimentsusing two Nvidia GeForce RTX 2080 Ti GPU.

Models.

YOLOv3 [Joseph and Ali, 2018] and Faster R-CNN [Ren et al. , 2015] pre-trained on COCO with back-bone of Darknet53 and ResNet101 are used as our de-fault two-stage and three-stage object detection models.Evaluation results with different backbones are presentedin Section 4.5. T cls is set to . . We use SiamMask-E [Chen and Tsotsos, 2019] for tracking, and the state-of-the-art reID algorithms B-BOT-Attn-CL [Pathak et al. , 2020] andVehicleNet [Zheng et al. , 2020] for person and vehicle reID,respectively. Training Settings.

We set the ratio of training set, evalu-ation set and testing set to

16 : 4 : 5 , batch size to , andchoose SGD with learning rate of . . All trainings last for epochs, with the ﬁrst 30 epochs on backbone network andthe subsequent 30 epochs on detection network. Evaluation Metrics.

We use mean Average Precision(mAP) over Intersection Over Union (IoU) of 0.8(mAP@[0.8:1.0]).

Baselines.

We compare MCSSL with three self-supervised learning approaches.i)

Self-Training (ST) : the most widely used self-trainingmechanism with conﬁdent pseudo-labels [Gao et al. , 2019;Lee et al. , 2019]. It’s trained in the supervised learning waywith conﬁdent pseudo-labels.ii)

Self-Training with Gold Loss Correction (ST-GLC) [Hendrycks et al. , 2018]: an improved version ofST with gold loss correction. It ﬁrst estimates corruptionmatrix C of conditional corruption probabilities usingconﬁdent pseudo-labels, and then uses C to correct classlabels of all uncertain pseudo-labels. It uses all pseudo-labels We used data collected from the ﬁrst intersection of CityFlow. to ﬁne-tune the original object detection model.iii)

Self-Training with Consistency Loss (ST-CL) [Jeong et al. , 2019]: the most recent work on self-training. It uses two images (the original image and a ﬂippedimage) as input, and constructs consistency loss betweentwo images during training. When we train on conﬁdentpseudo-labels, we use both supervised loss and consistencyloss. When we train on uncertain pseudo-labels, we only useconsistency loss.In addition to the above three baselines, we report resultsfrom a supervised training with ground-truth ( i.e., to traincustom detection model on each camera with human-labeledbounding boxes from itself). It serves as an upper-bound ofself-training methods. To show the gain of model customiza-tion, we also include mAP of the base model ( i.e.,

YOLOv3and Faster R-CNN).

Figure 3 shows performances of customizing YOLOv3 andFaster R-CNN on WildTrack. Compared with the best knownself-training approach (ST-GLC), MCSSL improves themAP of YOLOv3 and Faster R-CNN by . and . onaverage for each camera on WildTrack. This shows MCSSLis an effective framework for both two-stage and three-stageobject detection models. It’s interesting to see that MCSSLperforms worse than ST-GLC on CAM-7 in both Figure 3aand Figure 3b. This is due to the fact that CAM-7 has theleast amount of shared FOV and hence much less non-camera-speciﬁc training data ( e.g., . less than other cameras onaverage in Figure 3a) for backbone network ﬁne-tuning. Camera ID m A P ( % ) BaseSupervisedSTST_CLST_GLCMCSSL (a) YOLOv3

Camera ID m A P ( % ) BaseSupervisedSTST_CLST_GLCMCSSL (b) Faster R-CNN

Figure 3: mAP of self-training approaches on WildTrack.Using the same settings, we report performances ofMCSSL on YOLOv3 and Faster R-CNN on CityFlow inFigure 4. Compared with SL-GLC, MCSSL obtains anaverage mAP improvement of . and . on twomodels. In particular, on CAM-5, YOLOv3 and Faster R-CNN are improved by . mAP and . mAP sinceCAM-5 largely overlaps with other cameras and hence getsmore pseudo-labels from its neighbors. Camera ID m A P ( % ) BaseSupervisedSTST_CLST_GLCMCSSL (a) YOLOv3

Camera ID m A P ( % ) BaseSupervisedSTST_CLST_GLCMCSSL (b) Faster R-CNN

Figure 4: mAP of self-training approaches on CityFlow. bjects Total cameras Resolution Total frames Avg. objects/frameWildTrack Pedestrians 7 1920*1080 29400 23CityFlow Vehicles 5 960*480 9775 13

Table 2: Datasets description.

To show how much uncertain bounding boxes ﬁt into thetraining of backbone layers, we compare three self-trainingprocesses on CAM-1 from WildTrack. MCSSL only usesuncertain bounding boxes to train backbone layers. MCSSL-C trains backbone layers with conﬁdent bounding boxes,and MCSSL-CU uses both conﬁdent bounding boxes anduncertain bounding boxes. We train backbone layers for 30epochs and then ﬁne-tune detection layers following the sameprotocol for another 30 epochs in all three approaches. InFigure 5, as we expected, training backbone layers withmore high-conﬁdent bounding boxes (MCSSL-CU) outper-forms MCSSL in the ﬁrst 30 epochs. However, MCSSLgets a better ﬁnal mAP since training twice on conﬁdentbounding boxes (in both backbone and detection layers ﬁne-tuning) makes the detection model more prone to overﬁttingand hence limits its generalization ability on testing data.Compared with MCSSL-C, we ﬁnd that MCSSL gets abetter mAP in the ﬁrst 30 epochs and a better ﬁnal mAPin the later 30 epochs. This is because the size of conﬁdentbounding boxes ( i.e.,

Number of epochs m A P ( % ) MCSSLMCSSL-CMCSSL-CU (a) YOLOv3

Number of epochs m A P ( % ) MCSSLMCSSL-CMCSSL-CU (b) Faster R-CNN

Figure 5: The performance comparison of different self-training processes over training epochs.

To analysis the impacts of important settings in MCSSL, weconduct experiments on CAM-1 on WildTrack with various T cls (Figure 6) and backbone architectures (Figure 7). Itcomes as no surprise that MCSSL yields lower accuracywith a smaller value of T cls due to the increasing noisesin training data for detection network. However, it is worthnoting that setting T cls too high ( e.g., . in our experiment)could also negatively impact model customization due tothe insufﬁcient training of the detection network. We leave T cls as a hyperparameter to be tuned during training ondifferent datasets. As shown in Figure 7, MCSSL is alsoamenable to different kinds of backbone networks as we seea steady improvement of detection accuracy with the increaseof backbone capacity. Number of epochs m A P ( % ) T cls =0.9T cls =0.8T cls =0.7T cls =0.6 (a) YOLOv3 Number of epochs m A P ( % ) T cls =0.9T cls =0.8T cls =0.7T cls =0.6 (b) Faster R-CNN Figure 6: The performance comparison of MCSSL underdifferent T cls over training epochs. Number of epochs m A P ( % ) Darknet53ResNet50ResNet101ResNeXt101 (a) YOLOv3

Number of epochs m A P ( % ) Darknet53ResNet50ResNet101ResNeXt101 (b) Faster R-CNN

Figure 7: The performance comparison of MCSSL withdifferent backbones over training epochs.

We propose MCSSL, a novel self-training mechanism withconsistency loss, to customize object detection models in amulti-camera network. MCSSL separates object detectionmodels into backbone layers and detection layers and buildsa reID-like pretext task for pre-train of backbone layers. Tobuild training datasets, MCSSL associates bounding boxesbetween cameras using state-of-the-art tracking and reIDalgorithms with epipolar constraints. The large amount ofnon-camera-speciﬁc data is used to train backbone networkwhereas high-quality camera-speciﬁc data is leveraged fordetection network ﬁne-tuning. Our evaluation results ontwo real-world datasets show MCSSL can achieve the newstate-of-the-art results for customizing detection models.

References [Athiwaratkun et al. , 2019] Ben Athiwaratkun, Marc Finzi, PavelIzmailov, and Andrew Gordon Wilson. There are many consistentexplanations of unlabeled data: Why you should average. In

International Conference on Learning Representations , 2019.[Baque et al. , 2017] Pierre Baque, Francois Fleuret, and PascalFua. Deep occlusion reasoning for multi-camera multi-targetdetection. In

IEEE International Conference on Computer Vision ,2017.[Chavdarova and Fleuret, 2017] Tatjana Chavdarova and FrancoisFleuret. Deep multi-camera people detection. In

IEEEInternational Conference on Machine Learning and Applications ,2017.[Chavdarova et al. , 2018] Tatjana Chavdarova, Pierre Baque,Stephane Bouquet, Andrii Maksai, Cijo Jose, Louis Lettry,Pascal Fua, Luc Van Gool, and Francois Fleuret. The wildtrackmulti-camera person dataset. In

IEEE Conference on ComputerVision, Pattern Recognition , 2018.Chen and Tsotsos, 2019] Bao Xin Chen and John K Tsotsos. Fastvisual object tracking with rotated bounding boxes. In

IEEEInternational Conference on Computer Vision Workshops , 2019.[Chen et al. , 2019] Kai Chen, Jiaqi Wang, Jiangmiao Pang, YuhangCao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, ZiweiLiu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu,Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, YueWu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang,Chen Change Loy, and Dahua Lin. Mmdetection: Open mmlabdetection toolbox, benchmark. arXiv preprint arXiv:1906.07155 ,2019.[Co et al. , 2019] Kenneth T. Co, Luis Muñoz González, Sixtede Maupeou, and Emil C. Lupu. Procedural noise adversarialexamples for black-box attacks on deep convolutional networks.In

Proceedings of the 2019 ACM SIGSAC Conference onComputer and Communications Security , 2019.[Gao et al. , 2019] Jiyang Gao, Jiang Wang, Shengyang Dai, Li-JiaLi, and Ram Nevatia. Note-rcnn: Noise tolerant ensemble rcnnfor semi-supervised object detection. In

IEEE InternationalConference on Computer Vision , 2019.[GuangcongWang et al. , 2019] GuangcongWang, Jianhuang Lai,Peigen Huang, and Xiaohua Xie. Spatial-temporal person re-identiﬁcation. In

Association for the Advancement of ArtiﬁcialIntelligence , 2019.[Guoa et al. , 2019] Yunhui Guoa, Honghui Shi, Abhishek Kumar,Kristen Grauman, Tajana Rosing, and Rogerio Feris. Spottune:Transfer learning through adaptive ﬁne-tuning. In

IEEEConference on Computer Vision and Pattern Recognition , 2019.[He et al. , 2017] Kaiming He, Georgia Gkioxari, Piotr Dollar, andRoss Girshick. Mask r-cnn. In

IEEE International Conferenceon Computer Vision , 2017.[Hendrycks et al. , 2018] Dan Hendrycks, Mantas Mazeika, DuncanWilson, and Kevin Gimpel. Using trusted data to train deepnetworks on labels corrupted by severe noise. In

Conference andWorkshop on Neural Information Processing Systems , 2018.[Inoue et al. , 2018] Naoto Inoue, Ryosuke Furuta, Toshihiko Ya-masaki, and Kiyoharu Aizawa. Cross-domain weakly-supervisedobject detection through progressive domain adaptation. In

IEEEConference on Computer Vision and Pattern Recognition , 2018.[Jeong et al. , 2019] Jisoo Jeong, Seungeui Lee, Jeesoo Kim, andNojun Kwak. Consistency-based semi-supervised learning forobject detection. In

Conference and Workshop on NeuralInformation Processing Systems , 2019.[Joseph and Ali, 2018] Redmon Joseph and Farhadi Ali. Yolov3:An incremental improvement. arXiv preprint arXiv:1804.02767 ,2018.[Kocabas et al. , 2019] Muhammed Kocabas, Salih Karagoz, andEmre Akbas. Self-supervised learning of 3d human pose usingmulti-view geometry. In

IEEE Conference on Computer Visionand Pattern Recognition , 2019.[Lee et al. , 2019] Wonhee Lee, Joonil Na, and Gunhee Kim. Multi-task self-supervised object detection via recycling of boundingbox annotations. In

IEEE Conference on Computer Vision andPattern Recognition , 2019.[Li et al. , 2018a] Yu-Jhe Li, Fu-En Yang, Yen-Cheng Liu, Yu-YingYeh, Xiaofei Du, Yu-Chiang, and Frank Wang. Adaptation, re-identiﬁcation network: An unsupervised deep transfer learningapproach to person re-identiﬁcation. In

IEEE Conference onComputer Vision and Pattern Recognition , 2018. [Li et al. , 2018b] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang,Yangdong Deng, and Jian Sun. Detnet: Design backbone forobject detection. In

European Conference on Computer Vision ,2018.[Liu et al. , 2019] Yiheng Liu, Zhenxun Yuan, Wengang Zhou,and Houqiang Li. Spatial, temporal mutual promotion forvideo-based person re-identiﬁcation. In

Association for theAdvancement of Artiﬁcial Intelligence , 2019.[Liu et al. , 2020] Yudong Liu, Yongtao Wang, Siwei Wang, TingT-ing Liang, Qijie Zhao, Zhi Tang, and Haibin Ling. Cbnet: A novelcomposite backbone network architecture for object detection. In

Association for the Advancement of Artiﬁcial Intelligence , 2020.[Ouyang et al. , 2016] Wanli Ouyang, Xiaogang Wang, CongZhang, and Xiaokang Yang. Factors in ﬁnetuning deep model forobject detection with long-tail distribution. In

IEEE Conferenceon Computer Vision, Pattern Recognition , 2016.[Pathak et al. , 2020] Priyank Pathak, Amir Erfan Eshratifar, andMichael Gormish. Video person re-id: Fantastic techniques,where to ﬁnd them. In

Association for the Advancement ofArtiﬁcial Intelligence , 2020.[Ren et al. , 2015] Shaoqing Ren, Kaiming He, Ross Girshick, andJian Sun. Faster r-cnn: Towards real-time object detection withregion proposal networks. In

Conference and Workshop onNeural Information Processing Systems , 2015.[Rodner et al. , 2016] Erik Rodner, Marcel Simon, Robert B. Fisher,and Joachim Denzler. Fine-grained recognition in the noisy wild:Sensitivity analysis of convolutional neural networks approaches.In

British Machine Vision Conference , 2016.[Tang et al. , 2019] Zheng Tang, Milind Naphade, Ming-Yu Liu,Xiaodong Yang, Stan Birchﬁeld, Shuo Wang, Ratnesh Kumar,David Anastasiu, and Jenq-Neng Hwang. Cityﬂow: A city-scalebenchmark for multi-target multi-camera vehicle tracking and re-identiﬁcation. In

IEEE Conference on Computer Vision, PatternRecognition , 2019.[Xu et al. , 2019a] Hang Xu, ChenHan Jiang, Xiaodan Liang, andZhenguo Li. Spatial-aware graph relation network for large-scaleobject detection. In

IEEE Conference on Computer Vision andPattern Recognition , 2019.[Xu et al. , 2019b] Hang Xu, ChenHan Jiang, Xiaodan Liang, LiangLin, and Zhenguo Li. Reasoning-rcnn: Unifying adaptive globalreasoning into large-scale object detection. In

IEEE Conferenceon Computer Vision and Pattern Recognition , 2019.[Yuhua Chen et al. , 2018] Wen Li Yuhua Chen, Christos Sakaridis,Dengxin Dai, and Luc Van Gool1. Domain adaptive faster r-cnnfor object detection in the wild. In

IEEE Conference on ComputerVision and Pattern Recognition , 2018.[Zhao et al. , 2019] Qijie Zhao, Tao Sheng, Yongtao Wang, ZhiTang, Ying Chen, Ling Cai, and Haibin Ling. M2det: Asingle-shot object detector based on multi-level feature pyramidnetwork. In

Association for the Advancement of ArtiﬁcialIntelligence , 2019.[Zheng et al. , 2020] Zhedong Zheng, Tao Ruan, Yunchao Wei,Yi Yang, and Tao Mei. Vehiclenet: Learning robust visualrepresentation for vehicle re-identiﬁcation. arXiv preprintarXiv:2004.06305v1 , 2020.[Zhong et al. , 2018] Zhun Zhong, Liang Zheng, Zhedong Zheng,Shaozi Li, and Yi Yang. Camera style adaptation for personre-identiﬁcation. In