[PDF] Label and Sample: Efficient Training of Vehicle Object Detector from Sparsely Labeled Data

Abstract

Self-driving vehicle vision systems must deal with an extremely broad and challenging set of scenes. They can potentially exploit an enormous amount of training data collected from vehicles in the field, but the volumes are too large to train offline naively. Not all training instances are equally valuable though, and importance sampling can be used to prioritize which training images to collect. This approach assumes that objects in images are labeled with high accuracy. To generate accurate labels in the field, we exploit the spatio-temporal coherence of vehicle video. We use a near-to-far labeling strategy by first labeling large, close objects in the video, and tracking them back in time to induce labels on small distant presentations of those objects. In this paper we demonstrate the feasibility of this approach in several steps. First, we note that an optimal subset (relative to all the objects encountered and labeled) of labeled objects in images can be obtained by importance sampling using gradients of the recognition network. Next we show that these gradients can be approximated with very low error using the loss function, which is already available when the CNN is running inference. Then, we generalize these results to objects in a larger scene using an object detection system. Finally, we describe a self-labeling scheme using object tracking. Objects are tracked back in time (near-to-far) and labels of near objects are used to check accuracy of those objects in the far field. We then evaluate the accuracy of models trained on importance sampled data vs models trained on complete data.

Full PDF

LLabel and Sample: Efﬁcient Training of Vehicle Object Detector from SparselyLabeled Data

Xinlei PanUC BerkeleyBerkeley, CA, USA 94720 [email protected]

Sung-Li ChiangUC BerkeleyBerkeley, CA, USA 94720 [email protected]

John CannyUC BerkeleyBerkeley, CA, USA 94720 [email protected]

Abstract

Self-driving vehicle vision systems must deal with an ex-tremely broad and challenging set of scenes. They can po-tentially exploit an enormous amount of training data col-lected from vehicles in the ﬁeld, but the volumes are toolarge to train ofﬂine naively. Not all training instances areequally valuable though, and importance sampling can beused to prioritize which training images to collect. This ap-proach assumes that objects in images are labeled with highaccuracy. To generate accurate labels in the ﬁeld, we ex-ploit the spatio-temporal coherence of vehicle video. We usea near-to-far labeling strategy by ﬁrst labeling large, closeobjects in the video, and tracking them back in time to in-duce labels on small distant presentations of those objects.In this paper we demonstrate the feasibility of this approachin several steps. First, we note that an optimal subset (rel-ative to all the objects encountered and labeled) of labeledobjects in images can be obtained by importance samplingusing gradients of the recognition network. Next we showthat these gradients can be approximated with very low er-ror using the loss function, which is already available whenthe CNN is running inference. Then, we generalize theseresults to objects in a larger scene using an object detectionsystem. Finally, we describe a self-labeling scheme usingobject tracking. Objects are tracked back in time (near-to-far) and labels of near objects are used to check accuracy ofthose objects in the far ﬁeld. We then evaluate the accuracyof models trained on importance sampled data vs modelstrained on complete data.

1. Introduction

Autonomous driving is receiving enormous developmenteffort with many companies predicting large-scale com-mercial deployment in 2-3 years [35]. One of the mostimportant features of autonomous driving vehicles is theability to interpret the surroundings and perform complex

Figure 1. Overview of our method. Given a sequence of videoframe inputs, the object detection network ﬁrst detects objects in aframe, the forward pass step. Then the loss of the images are calcu-lated. Both detection results and loss will be sent to object trackerwhich keeps a list of active objects. The importance sampler de-termines whether to save the detection or not based on the loss.Furthermore, the near-to-far labeling step checks the accuracy ofobjects that are in the far ﬁeld using the classiﬁcation result of nearﬁeld objects, where we believe near ﬁeld objects are larger and theclassiﬁcation is more accurate. perception task such as the detection and recognition oflanes, roads, pedestrians, vehicles, and trafﬁc signs [27].Recently, the growth of Convolutional Neural Networks(CNNs), and large labeled data sets [8, 11] have led totremendous progress in object detection and recognition[14, 15, 26, 20]. It is now possible to detect objects withhigh accuracy [26, 25, 20].Videos collected in-vehicle have a great potential to im-prove model quality (through ofﬂine training) but at thescales achievable in a few years (billions to trillions of hoursof video), training on all the data is completely imprac-tical. Nor is it desirable - most images contain “typical”content and objects that are recognized with good accuracy.These models contribute little to the ﬁnal model. But itis the less common “interesting” images that are most im-portant for training (i.e. images containing objects that are1 a r X i v : . [ c s . C V ] A ug is-classiﬁed, or classiﬁed with low conﬁdence). To ben-eﬁt from these images, it’s still important to have accuratelabel information. The images of distance objects in iso-lation are not good for this purpose - by deﬁnition theycontain objects which are difﬁcult to label automatically.But we can use particular characteristics of vehicle video:namely that most of the time (vehicle moving forward), ob-jects gradually grow and become clearer and easier to rec-ognize. We exploit this coherence by tracking objects backin time and using near, high-reliability labels to label moredistant objects. We demonstrate this process on a hand-labeled dataset [13] which only has a small fraction of totalframes labeled and also has short length video clips that canbe used for object tracking. We show that we can extendthe labelled data using near-to-far tracking strategy, and im-portance sampling can be used to reﬁne the automaticallylabeled dataset to improve its quality.To automate far-object tracking, we need both single-image object detection and between-image tracking. Whilethese two modules can be used separately, we designed astrategy to nicely combine them together. Speciﬁcally weused a Faster-RCNN object detector [26] and Kalman ﬁlter-ing to track objects. We use predicted object positions fromtracking to augment the Faster-RCNN’s region proposals,and then use the Faster-RCNN’s bounding box regressionto correct the object position estimates. The result is thatwe can track and persist object layers much further into thedistance where it might be hard for Faster-RCNN to giveaccurate object region proposal.The automatically-generated labels are then used to com-pute the importance of each image. As shown in [2], anoptimal model is obtained when images are importancesampled using the norm of the back-propagated gradientmagnitude for the image. While computing the full back-propagated gradients in vehicle video systems would bevery expensive, we can actually use the loss function as asurrogate for gradient as it is easier to obtain. We furthershow in our experiments that the loss function can be a ap-proximate of the gradient norm. Importance sampling fortraining data ﬁltering is also described in [7]. Contributions . Starting with a sparsely labeled videodataset, we combine object tracking and object detection togenerate new labeled data, and then use importance sam-pling for data reduction. The contributions of this paperare: 1) We show that near-to-far object tracking can produceuseful training data augmentation. 2) Empirically, gradientnorm can be approximated by loss function and last layergradient norm in deep neural networks. 3) Importance sam-pling produces large reductions in training data and trainingtime and modest loss of accuracy.

2. Related Work

Semi-Supervised Data Labeling.

With the largeamount of sparsely labeled image datasets, some work hasbeen done in the ﬁeld of semi-supervised object detectionand data labeling [34, 31, 18, 6]. These work try to learna set of similar attributes for image classes [31, 6] to labelnew datasets, or to cluster similar images [18], or performtransfer learning to recognize similar types of objects [34].However, these work are typically suitable for image datasetwhere they process images individually and do not considerthe temporal continuity of video dataset for semi-supervisedlearning. Semi-supervised learning for video dataset is alsodescribed in [19, 24, 33]. While their performance is good,they assume labeling the salient object in a video and thusdo not apply well to multi-object detection or tracking case.A large body of work has also been done in the ﬁeld oftracking-by-detection [3, 23, 16, 17, 5]. However, they ei-ther assume the possibility of negative data being sampledfrom around the object or they do not use the special char-acteristic of driving video that objects in the near ﬁeld areeasier to be detected than objects in the far ﬁeld. In ad-dition, these naive combinations of tracking and detectionmay introduce additional noise in labeling images. Also,tracking tends to drift away in the long run and the relateddata association is also very challenging [3]. In the workof [21], they proposed to use semi-supervised learning tolabel large video datasets by tracking multiple objects invideos. However, their application scenario is not drivingvideo dataset and the object they detect only include cars. Inaddition, their focus is on short term tracking of objects andthey do not require short tracklets to be associated with eachother. Therefore, the applicability of their method is lim-ited especially when there are multiple categories of objectsin a scene, since ignoring the data association part wouldbe problematic if the goal is to label multiple categories ofobjects. In our work, we do consider the problem of dataassociation and we use object tracker’s prediction as the re-gion proposal for object detector to provide more accuratebounding box annotation. However, similar to the work of[21], we do not perform long-run tracking to prevent thetracker from drifting away. We also use near-to-far labelingto help correct the detector’s classiﬁcation results.

Importance Sampling.

Importance sampling is a well-known technique used for reducing the variance when es-timating properties of a particular distribution while onlyhaving samples generated from another distribution [22,28]. The work of [38] studied the problem of improving tra-ditional stochastic optimization with importance sampling,where they improved the convergence rate of prox-SMD[9, 10] and prox-SDCA [29] by reducing the stochastic vari-ance using importance sampling. The work of [2] improvesover [38] by incorporating stochastic gradient descent anddeep neural networks. Also there are some work in usingmportance sampling for minibatch SGD [7], where theyproposed to use importance sampling to do data samplingin minibatch SGD and this can improve the convergencerate of SGD. The idea of hard negative example mining isalso highly related to our work. As shown in [30] wherethey presented an approach to perform efﬁcient object de-tection training by training on an optimally sampled bound-ing boxes according to their gradient.As for self-driving vehicles’ vision system training, wetypically do not know the ground truth distribution of thedata, which are the images or video data captured by cam-eras. Thus, importance sampling will be very useful in es-timating the properties of the data from a data-driven sam-pling scheme. The work of [36] and [12] proposed to useimportance sampling for visual tracking, but their focus wasnot on reducing the training data amount and creating la-beled data using visual tracking. In our work, we use im-portance sampling to obtain an optimal set of data so thatour training efﬁciency is high as we train on the most in-formative data. The information that each image carries ischaracterized by their detection loss, which is reasonablysuitable in our case as images with high loss are usuallyimages that are difﬁcult for the current detector.

3. Methods

Our approach for creating labeled data and performingdata reduction by importance sampling can be divided intotwo parts. First of all, based on the sparsely labeled im-age frames, we initialize object tracker by incorporatingKalman-ﬁlter algorithm [37] and use the tracker to predictbounding box of objects in the previous (since we predictback in time) frame. We then use the prediction as regionproposal input and send this to object detection module.Based on the region proposal received, the object detectionmodule trained on sparsely labeled data will do a boundingbox regression to get the ﬁnal bounding box and detectionloss. The object tracker further matches new detections toexisting trackers or create new trackers if the new detectioncannot match any of the existing trackers. The near-to-farlabeling module will double check object detection resultswithin each tracker to use the classiﬁcation results of objectsin the near ﬁeld which are more accurate to check the resultsof objects in the far ﬁeld. The bounding box produced bythe object detection module are used as labels for those un-labeled video frames. The detection loss will be used asthe sampling weights for importance sampling. Secondly,based on the detection loss recorded in the ﬁrst part, the im-portance sampler will sample an optimal subset of labeledimages and these selected labeled images will be used as thetraining data to train a new object detector.The system architecture is shown in ﬁgure 1. Here, weﬁrst describe the framework of semi-supervised data label-ing followed by the data reduction using the importance sampling framework.

Object Tracking.

Starting with a few sparsely anno-tated video frames, we ﬁrst trained an object detection net-work using Faster-RCNN [26]. By using Kalman ﬁlter [37],we initialize object trackers with the ground truth labeledframes. The speciﬁc object tracking framework we use issimilar to that of [1], where the state of the tracker includes7 parameters, namely, the center position of the boundingbox x, y , the scale s (the area of the bounding box) and as-pect ratio r of the bounding box ( the ratio of the width overthe height of the bounding box), and the rate of change ofthe center position v x , v y and scale v s of the bounding box.We follow the assumption in [1] that the aspect ratio of thebounding box does not change over time. The measurementis just the ﬁrst four parameters of the state vector.state = [ x, y, s, r, v x , v y , v s ] measurement = [ x, y, s, r ] We always use ground truth labeled bounding box to initial-ize object trackers, and the tracking is done from the nearﬁeld to the far ﬁeld, which means the video is played in theopposite direction as it was collected, so that at the very be-ginning, the camera is close to the labeled objects and at thevery end the camera is far away from the the object. There-fore, it is reasonably to believe that the classiﬁcation anddetection results for objects in the near ﬁeld are more re-liable while there is more noise in the detection results forobject in the far ﬁeld.

Prediction as Region Proposal.

After the trackers areinitialized with ground truth bounding box, based on theprinciple of a Kalman ﬁlter, predictions of bounding boxesof the objects being tracked will be calculated. These pre-dictions will be used as a hint for the object detection net-work to produce new bounding boxes in the next frame. Thenetwork we used for object detection is Faster-RCNN [26],which is composed of a region proposal network (RPN)and object detection network Fast-RCNN [15]. Usuallythe RPN will be used as the region proposer, however, aswe already have the prior information of where the objectmight be, we can directly use this information to help theobject detection network avoid uncertainty in region pro-posal. This part corresponds to the get new detection method in algorithm 1.

Matching Tracker with Detections.

Given the predic-tions sent by the object tracker, the object detection net-work will produces a set of candidate bounding boxes inthe next frame and the object tracker will try to match theexisting trackers and the new detections using linear assign-ment. We also use intersection over Union (IoU) to ﬁlter outdetection-tracker pairs that do not have IoU values higher escription:

Update object bounding box label for asingle track-back-in-time step.

Input: T : the list of active trackers; l (cid:15) : threshold ofdetection loss, save a detection if the detectionloss is larger than this threshold; K : Kalmanﬁlter; D : object detector; n : limit of steps toretain a tracker; Result: R (collections of detections to be saved, adetection is a bounding box containing stateinformation ( x, y, s, r, label ) where label isthe class of the object) Initialize: R = ∅ , R (cid:48) = ∅ , P m = ∅ , P m is thematched pairs of detection and tracker. for ∀ T ∈ T do /* Tracking back in time */ s = predict state( T , K );/* Use s (prediction of RoI) as region proposal */ d = get new detection( s , D ); % Refer to section3.1 R (cid:48) = R (cid:48) (cid:83) d ; endfor ∀ d, T ∈ R (cid:48) , T do Match d , T if d, T can match then P m = P m (cid:83) ( d, T ) ; endend R (cid:48) um = { d ; d ∈ R (cid:48) & d / ∈ P m } (get unmatched d ); T um = { T ; T ∈ T & T / ∈ P m } (get unmatched T ); for ∀ ( d, T ) ∈ P m do Update tracker T using d using Kalman Filter;Using historical records in trk to check theaccuracy of this new detection d ; R = R (cid:83) d ; endfor ∀ unmatched d ∈ R (cid:48) um do T = init new tracker( d ); R = R (cid:83) d ; endfor ∀ unmatched T ∈ T um doif T has not been updated for more than n times then Remove T from T endendfor ∀ d ∈ R doif d has loss > l (cid:15) then mark this d to be saved endend return R Algorithm 1:

Object Tracking and Labeling Algorithm

Description:

Match detections with trackers.

Input: R (cid:48) : object detection bounding boxes, d = [ x , y , x , y ] ; T : trackers; Result:

Matched detection and tracker pairs P m ;Unmatched detections R (cid:48) um ; UnmatchedTrackers T um ; if len( T ) == 0 then P m = ∅ ; R (cid:48) um = R (cid:48) ; T um = ∅ ;return P m , R (cid:48) um , T um . end M = An all-zeros matrix of size [ len ( R (cid:48) ) , len ( T )] . for i from 1 to len( R (cid:48) ) dofor j from 1 to len( T ) do M [ i, j ] = IntersectionOverUnion( R (cid:48) [ i ] , T [ j ] ) endend M = linear assignment( M [ i, j ] ); for i from 1 to len( R (cid:48) ) doif i / ∈ M , then R (cid:48) um = R (cid:48) um (cid:83) R (cid:48) [ i ] endendfor j from 1 to len( T ) doif j / ∈ M , then T um = T um (cid:83) T [ j ] endendfor i, j ∈ M do P m = P m (cid:83) ( R (cid:48) [ i ] , T [ j ]) . end return P m , R (cid:48) um , T um . Algorithm 2:

Match Detections with Trackersthan a pre-deﬁned threshold. After ﬁnishing detection-tracker matching, the state of valid trackers will be updated,and trackers that remain inactive (not being updated) for acertain steps will be removed from the trackers list. Nowwe ﬁnished one step of object tracking and labeling. Thebounding boxes produced by object detection network willbe used as labels for those unlabeled video frames. Themore detailed algorithm description for one step of track-ing and labeling is shown in algorithm 1. Matching trackerswith detections is further described in algorithm 2.The tracker is a class containing state of the current ob-ject being tracked and methods for updating object’s stategiven ground truth state of the object. A detailed imple-mentation of the tracker class can be found in [4].

Near-to-Far Labeling . Another key ingredient of ourapproach is the near-to-far labeling scheme. Consider thecase that we are tracking an object from far to near ﬁeld.When the image is far away from our current location, theobject could be very small or blurred in the image, whichakes it very difﬁcult to be correctly classiﬁed. As theobject approaches the vehicle, the detection network has ahigher conﬁdence to correctly classify this object. As wetrust object detection results in the near ﬁeld, if object detec-tion results of the same object being tracked in the far ﬁelddiffer from that in the near ﬁeld, we can use the detection re-sults in the near ﬁeld to correct that. To do this, we restrictobject tracker’s initialization only to ground truth bound-ing boxes so as to avoid the additional noise introduced byimperfect object detection network. In case the classiﬁca-tion of objects in the far ﬁeld diverges, we use the detectionresult of the same tracker in the near ﬁeld to correct that.Examples of near-to-far labeling are shown in ﬁgure 4.

Inspired by the idea of importance sampling [2], we canselect an optimal subset of the data by sampling the dataaccording to importance sampling probability distributionso that the variance of the sampled data is minimized un-der an expected size of sampled data. Here, the samplingdistribution is proportional to the object detection loss ofeach image. Images with higher loss obtain more impor-tance as they provide more useful information for accurateobject detection.In our case, we are interested in estimating the expec-tation of f ( x ) based on a distribution p ( x ) , where f ( x ) isthe detection loss of each image, p ( x ) denotes the imagedistribution and x denotes a particular image with an objectdetection loss. The problem is expressed by the followingequation, (cid:90) p ( x ) f ( x ) d x = E p ( x ) [ f ( x )] ≈ N N (cid:88) n =1 f ( x n ) , (1)where x n ∼ p ( x ) . However, usually we do not know theground truth distribution of the data p ( x ) , so we rely on asampling proposal q ( x ) to to unbiasedly estimate this ex-pectation, with the requirement that q ( x ) > whenever p ( x ) > . This is commonly known as importance sam-pling: (cid:90) p ( x ) f ( x ) dx = E p ( x ) [ f ( x )] = E q ( x ) [ p ( x ) q ( x ) f ( x )] . (2)It has been proved in [2] that the variance of this estimationcan be minimized when we have, q ∗ ( x ) ∝ p ( x ) | f ( x ) | . (3)Deﬁning ˜ q ∗ ( x i ) as the unnormalized optimal probabilityweight of image x i , it is obvious that images with a largerdetection loss should have a larger weight. Although we donot know p ( x ) , we have access to a dataset D = { x n } Nn =1 sampled from p ( x ) . Therefore, we can obtain q ∗ ( x ) by as-sociating the unnormalized probability weight ˜ q ∗ ( x n ) = | f ( x n ) | to every x n ∈ D , and to sample from q ∗ ( x ) wejust need to normalize these weights: q ∗ ( x n ) = ˜ q ∗ ( x n ) (cid:80) Ni =1 ˜ q ∗ ( x i ) = | f ( x n ) | (cid:80) Ni =1 | f ( x i ) | (4)where f ( x i ) is the loss of input x i . To reduce the totalnumber of data instances used for estimating E p ( x ) [ f ( x )] ,we draw M samples from the whole N data instances( M << N ) based on a multinomial distribution where ( q ∗ ( x ) , ..., q ∗ ( x N )) are the parameters of this multinomialdistribution. Based on the discussion above, we obtainedan estimation of E p ( x ) [ f ( x )] which has least variance com-pared to all cases where we draw M samples from the entire N data set. We further provide some prove in the appendix. Once we get the sampling distribution q ∗ ( x i ) , we thenperform the importance sampling. Images with a higher de-tection loss will get higher likelihood to be sampled. We,further, measure how efﬁcient that we estimate the detec-tion loss distribution. Since the goal of using importancesampling approach here is to reduce the variance while es-timating properties of the data from a subset of the data.To show that the expectation of loss estimated from thesampled images have close variance with loss variance es-timated from all images, we computed a relative variancevalue. This value is the ratio of whole data set detectionloss variance over sampled images’ detection loss variance.Suppose the data set is D = { x n } Nn =1 , and we can getdetection loss g ( x i ) given individual input x i . In order tocalculate the relative variance more easily, we will ﬁrst nor-malize g ( x ) . Then, we deﬁne the sampling probability ofimage x i when we expect to sample M out of N images( M < N ) as, q ( x i ) = min (cid:20) , M | g ( x i ) | (cid:80) Ni =1 | g ( x i ) | (cid:21) (5)taking the minimum compared with 1 is to ensure that theprobability of sampling image x i can not be larger than 1,which happens when M | g ( x i ) | (cid:80) Ni =1 | g ( x i ) | is saturated. Note that,when the sampling probability is 1, we should sample thisimage. With the scaled sampling weight M | g ( x i ) | (cid:80) Ni =1 | g ( x i ) | , wechange M so that we can get different numbers of imagesout of the entire image date. Typically, choosing a M suchthat the sample gradient norm variance is close to wholedata gradient norm variance. Since the data are in the dis-crete space, the relative variance is deﬁned as, R = (cid:80) Ni =1 | g ( x i ) | (cid:80) Ni =1 | g ( x i ) | /q ( x i ) . (6) igure 2. Bounding box generated by using strategies mentioned in experiment 1 (left 1 and 2) and experiment 2(right 1 and 2) Pedestrian Car Cyclist mAPEasy Medium Hard Easy Medium Hard Easy Medium HardGround Truth (GT) 80.6 68.8 61.0 94.1 78.8 69.3 88.1 78.8 73.6 77.0New Labeled (NL) > 69.2 58.4 50.8 83.4 63.2 53.1 68.3 56.6 52.9 61.8Sampled NL & GT

Only NL 69.8 60.8 52.1 80.4 60.9 50.3 70.4 57.8 54.0 61.8

Table 1. Object detection average precision (%) on KITTI dataset using different models. Ground Truth: results of model trained onground truth labeled data (comes from KITTI). New labeled and ground truth: model trained on both new labeled data and ground truthdata, corresponding to experiment 1. Sampled NL and GT: model trained on importance sampled new labeled data and ground truth data,corresponding to experiment 2. Only NL: model trained only on new labeled data, corresponding to experiment 3.

4. Experiments

Our framework has several major contributions. First ofall, we proposed to use object tracker’s prediction as the re-gion proposal input for the object detection network to de-tect objects. Secondly, we proposed to use near-to-far label-ing to help correct labels that may not be correct. Thirdly,we use importance sampling to select an optimal subset ofimages to remove images with less reliable labels and ob-tain a smaller but more informative set of data. We designedseveral comparative experiments to show the impact of ourcontribution.

Datasets . To show that our algorithm is able to scale to arelatively large video dataset, we choose the KITTI bench-mark dataset [13] which contains hundreds of autonomousdriving video clips, and each of the video clips lasts about10 to 30 seconds. The data set is fairly rich as it containshigh-resolution color and grayscale video frames capturedin many kinds of driving environments: city, residential,road, campus, person, etc. The KITTI dataset also containsa set of sparsely labeled image frames for object detectionpurposes. The number of images with ground truth bound-ing box labeling we used in our experiment is 7481, whilethe total number of images is around 40000. Categories ofobjects being labeled include cars, pedestrians, vans, trams,cyclist, truck, person sitting, and so on. For simplicity, wechoose 3 categories from them to detect, which include cars,pedestrians, and cyclist. We manually and randomly dividethe dataset into the training, validation and test data set. The training dataset contains 4206 images, the validation datasetcontains 1404 images, and the test data set contains 1871images.

Experiment 0 . We ﬁrst trained a basic object detectionnetwork based on the ground truth labeled data using theFaster-RCNN [26] object detection network. As for de-tails of training, we used pre-trained Faster-RCNN modelwith VGG16 network [32] trained on PASCAL VOC 2007dataset [11], and then ﬁnetuned with KITTI dataset. Thenumber of training iterations is 300k with the initial learn-ing rate of 0.01 and decay every 30k iterations.

Experiment 1 . The ﬁrst experiment is our labeling bytracking approach using semi-supervised learning. In thisexperiment, we use the ground truth labeled bounding boxesto initialize object trackers. Since images in the KITTIdataset are sparsely labeled with unlabeled images betweenlabeled images in the original video sequence, we use thelabeled data as a guidance to label images without groundtruth labeling. It is useful to notice that, in this case, the ob-ject detection network does not use RPN to generate regionproposals. Instead, it takes the object tracker’s predictionof bounding box in the next frame as region proposal andthen perform bounding box regression to generate optimalbounding box for the object being tracked. In other words,only ground truth labeled images can be used to initializeobject tracker, which is based on our assumption that ob-jects in the near ﬁeld provide more accurate information andwe only predict bounding boxes based on reliable informa-tion instead of relying on some random detection. We usedboth ground truth data from KITTI combined with new la- igure 3. Plot of gradient Frobenius norm of last layer in VGG 16 versus the gradient Frobenius norm of fully-connected layer 7 (FC7),fully-connected layer 6(FC6), Convolutional Layer 5 (Conv5) and Convolutional Layer 1 (Conv1).Figure 4. Examples of near-to-far labeling. These images are from the KITTI Benchmark data set [13]. The labeling results are obtainedfrom pre-trained Faster-RCNN model. The bounding box shows the detected objects being tracked. Near ﬁeld object detection results areused to check the accuracy of the detection results of objects in the far ﬁeld. The ﬁrst image labeling results from left to right: motorbike,car, car (ground truth). As the vehicle approaches the object, it becomes clearer and no longer hidden by the pole. The second imagelabeling results from left to right: bus, car, train (ground truth). The object looks like a bus in the far ﬁeld, but is classiﬁed as train whenit’s in the near ﬁeld considering it’s on railroad. The third image labeling results from left to right: train, car, car (ground truth). The objectlooks like a train in the far ﬁeld, but is classiﬁed as car when it’s in the near ﬁeld. Fourth image labeling results from left to right: bus, car,car (ground truth). At ﬁrst sight, the car is blurred and hidden by other objects, then it became more clearer that this is a car.igure 5. Relative Variance Evaluation Results beled data to train the object detector. The training settingis the same as in experiment 0.

Experiment 2 . In this experiment, we adopt the ap-proach we take in experiment 1 and we further combineit with importance sampling. As images labeled using theapproach in experiment 1 may still contain redundant infor-mation such as images that are already easy for the networkto process, so we use importance sampling to select an op-timal set of images that are more informative. We chooseto sample 60% of the data ( which consists of both groundtruth and new labeled data) in experiment 1 using the impor-tance sampling method mentioned in previous section. Asshown in ﬁgure 5, 60% of data corresponds to around 0.90sampling efﬁciency, which is reasonably high. The trainingsetting is also the same as in experiment 0.

Experiment 3 . We further remove the ground truth datawhich comes from KITTI and only used newly labeled datausing to train an object detector with the same training set-ting as in experiment 0.

Evaluation of Accuracy . We trained Faster-RCNN ob-ject detection networks using data mentioned in experiment1, 2, and 3 respectively, all using the same training conﬁg-urations as we did in experiment 0. We evaluate the perfor-mance of models in experiment 0,1,2,and 3 by testing themodels on a held out test dataset of 1871 images. The av-erage precision is evaluated on the 3 categories of objectsmentioned before.

Loss as a Approximation for Gradient

First, we showour ﬁnding that the gradient of the network we used hassome linear correlation between different layers as shownin ﬁgure 3. Therefore, we can use last layer gradient (as itis easier to obtain) as a approximation of total gradients. Onthe other hand, we also show in ﬁgure 6 that loss can be usedas a approximation for the total gradient norm. Therefore,we can also use loss to approximate gradient and use it assampling weight for different object bounding box labels.

Qualitative Results for Bounding Box Generation . Asmentioned in experiment description, we use two differentstrategies to generate bounding boxes using Faster-RCNN.

Figure 6. Plot of gradient Frobenius norm of entire network weight(not include bias term) VS the loss of individual data point.

The ﬁrst strategy uses region proposal network to gener-ate bounding boxes, and the second strategy uses objecttracker’s prediction as region proposals. We show somequalitative results of bounding boxes generated by the twomethods in ﬁgure 2.

Quantitative Results for Object Detection . The accu-racy of models trained on experiment 0,1,2,and 3 are evalu-ated on a test data set of 1871 images. The average precision(AP) on 3 categories of objects and the mAPs are reportedin table 1. The results show the average precision for differ-ent categories of objects with different degrees of difﬁculty.With the ground truth data, the model shows the best per-formance, which is not a surprise since labels generated bytracking may introduce noise that harms the performanceof the detector. However, after ﬁltering the data by impor-tance sampling, we can obtain better detection accuracy us-ing the same training setting, which means importance sam-pling has helped to reduce data volume and makes it easierto train a model to convergence.

Relative Variance Results

We use the relative variancementioned in section 3.3 to measure how good we estimatethe image detection loss distribution. The result is shownhere 5. From the plot, we can see that by scaling the im-portance sampling weight as mentioned in 5, we are able tokeep high sampling efﬁciency (0.90) with 60 % of the orig-inal labeled data being sampled. This curve will be usefulfor determining how much data to sample given the desiredsampling efﬁciency.

5. Conclusion

We proposed a framework of automatically generat-ing object bounding box labels for large volume drivingvideo dataset with sparse labels. Our work generates ob-ject bounding boxes on the new labeled data by employ-ing a near-to-far labeling strategy, a combination of objecttracker’s prediction and object detection network and theimportance sampling scheme. Our experiments show thatwith our semi-supervised learning framework, we are ableo annotate driving video dataset with bounding box labelsand improve the accuracy of object detection with the newlabeled data using importance sampling.

References [1] Simple online and realtime tracking. In , pages 3464–3468, Sept 2016.[2] G. Alain, A. Lamb, C. Sankar, A. Courville, and Y. Ben-gio. Variance reduction in sgd by distributed importancesampling. arXiv preprint arXiv:1511.06481 , 2015.[3] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua. Multipleobject tracking using k-shortest paths optimization.

IEEEtransactions on pattern analysis and machine intelligence ,33(9):1806–1819, 2011.[4] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simpleonline and realtime tracking.

CoRR , abs/1602.00763, 2016.[5] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier,and L. Van Gool. Robust tracking-by-detection using a de-tector conﬁdence particle ﬁlter. In

Computer Vision, 2009IEEE 12th International Conference on , pages 1515–1522.IEEE, 2009.[6] J. Choi, M. Rastegari, A. Farhadi, and L. S. Davis. Addingunlabeled samples to categories by learned attributes. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 875–882, 2013.[7] D. Csiba and P. Richt´arik. Importance sampling for mini-batches. arXiv preprint arXiv:1602.02283 , 2016.[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.In

Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on , pages 248–255. IEEE, 2009.[9] J. Duchi and Y. Singer. Efﬁcient online and batch learn-ing using forward backward splitting.

Journal of MachineLearning Research , 10(Dec):2899–2934, 2009.[10] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari.Composite objective mirror descent. In

COLT , pages 14–26,2010.[11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge.

International journal of computer vision , 88(2):303–338, 2010.[12] R. Farah, Q. Gan, J. P. Langlois, G.-A. Bilodeau, andY. Savaria. A computationally efﬁcient importance sam-pling tracking algorithm.

Machine Vision and Applications ,25(7):1761–1777, 2014.[13] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-tonomous driving? the kitti vision benchmark suite. In

Com-puter Vision and Pattern Recognition (CVPR), 2012 IEEEConference on , pages 3354–3361. IEEE, 2012.[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In

Computer Vision and Pattern Recognition ,2014.[15] R. B. Girshick. Fast R-CNN.

CoRR , abs/1504.08083, 2015. [16] H. Grabner, C. Leistner, and H. Bischof. Semi-supervisedon-line boosting for robust tracking. In

European conferenceon computer vision , pages 234–247. Springer, 2008.[17] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection.

IEEE transactions on pattern analysis and ma-chine intelligence , 34(7):1409–1422, 2012.[18] S. Lad and D. Parikh. Interactively guiding semi-supervisedclustering via attribute-based explanations. In

EuropeanConference on Computer Vision , pages 333–349. Springer,2014.[19] D. Liu, G. Hua, and T. Chen. A hierarchical visual model forvideo object summarization.

IEEE transactions on patternanalysis and machine intelligence , 32(12):2178–2190, 2010.[20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed,C. Fu, and A. C. Berg. SSD: single shot multibox detector.

CoRR , abs/1512.02325, 2015.[21] I. Misra, A. Shrivastava, and M. Hebert. Watch and learn:Semi-supervised learning of object detectors from videos.

CoRR , abs/1505.05769, 2015.[22] Y. Z. Owen, Art; Associate. Safe and effective importancesampling.

Journal of the American Statistical Association ,449:135 – 143, 2000.[23] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable num-ber of objects. In

Computer Vision and Pattern Recogni-tion (CVPR), 2011 IEEE Conference on , pages 1201–1208.IEEE, 2011.[24] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Fer-rari. Learning object class detectors from weakly annotatedvideo. In

Computer Vision and Pattern Recognition (CVPR),2012 IEEE Conference on , pages 3282–3289. IEEE, 2012.[25] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi.You only look once: Uniﬁed, real-time object detection.

CoRR , abs/1506.02640, 2015.[26] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:towards real-time object detection with region proposal net-works.

CoRR , abs/1506.01497, 2015.[27] E. Romera, L. M. Bergasa, and R. Arroyo. Can weunify monocular detectors for autonomous driving by us-ing the pixel-wise semantic segmentation of cnns?

CoRR ,abs/1607.00971, 2016.[28] R. Y. Rubinstein and D. P. Kroese.

Simulation and the MonteCarlo method , volume 707. John Wiley & Sons, 2011.[29] S. Shalev-Shwartz and T. Zhang. Proximal stochastic dualcoordinate ascent. arXiv preprint arXiv:1211.2717 , 2012.[30] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 761–769, 2016.[31] A. Shrivastava, S. Singh, and A. Gupta. Constrained semi-supervised learning using attributes and comparative at-tributes. In

European Conference on Computer Vision , pages369–383. Springer, 2012.[32] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition.

CoRR ,abs/1409.1556, 2014.33] K. Tang, R. Sukthankar, J. Yagnik, and L. Fei-Fei. Discrim-inative segment annotation in weakly labeled video. In

Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition , pages 2483–2490, 2013.[34] Y. Tang, J. Wang, B. Gao, E. Dellandr´ea, R. Gaizauskas,and L. Chen. Large scale semi-supervised object detectionusing visual and semantic knowledge transfer. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 2119–2128, 2016.[35] A. Teichman and S. Thrun. Practical object recognition inautonomous driving and beyond.

Advanced Robotics and itsSocial Impacts (ARSO), 2011 IEEE Workshop on , pages 35–38, 10 2011.[36] N. Wang and D. yan Yeung. Learning a deep compact imagerepresentation for visual tracking. In C. Burges, L. Bottou,M. Welling, Z. Ghahramani, and K. Weinberger, editors,

Ad-vances in Neural Information Processing Systems 26 , pages809–817. 2013.[37] G. Welch and G. Bishop. An introduction to the kalman ﬁlter.1995.[38] T. Zhang and R. EDU. Stochastic optimization with impor-tance sampling for regularized loss minimization. 2014. ppendices

A. Introduction

We provide proof of the importance sampling frameworkand their optimality in this supplementary material. We alsoprovide detailed explanations for the measurement of rela-tive variance and the meaning of relative variance.

B. Importance Sampling Framework Proof

The importance sampling algorithm is used for data re-duction. It is also used for the selection of an optimal sub-set of data from the original labeled dataset with minimalvariance. In the paper, we stated that by using a referenceproposal distribution q ∗ ( x ) ∝ p ( x ) | f ( x ) | we can get an es-timation of the expectation of f ( x ) with the least variance.We now provide the proof.In importance sampling, the expectation of f ( x ) is es-timated by using E p ( x ) [ f ( x )] = E q ( x ) [ f ( x ) p ( x ) /q ( x )] .We require that q ( x ) > whenever f ( x ) p ( x ) (cid:54) = 0 . Itis thus easily to verify that this estimation is unbiased.Suppose that E p ( x ) [ f ( x )] is deﬁned on x ∈ A while E q ( x ) [ f ( x ) p ( x ) /q ( x )] is deﬁned on x ∈ B . We have A = { x | p ( x ) > } and B = { x | q ( x ) > } . So that wehave for x ∈ A ∩ B c , f ( x ) = 0 and for x ∈ A c ∩ B , p ( x ) = 0 . That is to say, for x ∈ A ∩ B c and x ∈ A c ∩ B ,we have f ( x ) p ( x ) = 0 . So the expectation of f ( x ) can bewritten as, E q ( x ) [ p ( x ) f ( x ) q ( x ) ] = (cid:90) B f ( x ) p ( x ) q ( x ) q ( x ) d x = (cid:90) A f ( x ) p ( x ) d x + (cid:90) B ∩ A c f ( x ) p ( x ) d x − (cid:90) A ∩ B c f ( x ) p ( x ) d x = (cid:90) A f ( x ) p ( x ) d x = E p ( x ) [ f ( x )] (7)Then we prove that when sampling distribution q ( x ) ∝ p ( x ) | f ( x ) | , we can obtain the minimal variance in the es-timation of the expectation. Let E p ( x ) [ f ( x )] = µ , and let, µ q = 1 n n (cid:88) i =1 f ( x i ) p ( x i ) q ( x i ) (8)given samples x i are sampled from q ( x ) . Then the varianceof µ q is, Var( µ q ) = 1 n Var (cid:18) f ( x ) p ( x ) q ( x ) (cid:19) = 1 n (cid:18) (cid:90) ( f ( x ) p ( x )) /q ( x ) d x − µ (cid:19) (9) By choosing q ∗ ( x ) = | f ( x ) | p ( x ) / E p ( | f ( x ) | ) , and let q ( x ) be any density function that is positive given f ( x ) p ( x ) (cid:54) = 0 .We have, Var( µ ∗ q ) = 1 n (cid:18) (cid:90) ( f ( x ) p ( x )) q ∗ ( x ) d x − µ (cid:19) = 1 n (cid:18) (cid:90) ( f ( x ) p ( x )) | f ( x ) | p ( x ) / E p ( | f ( x ) | ) d x − µ (cid:19) = 1 n (cid:18) E p ( | f ( x ) | ) − µ (cid:19) = 1 n (cid:18) E q ( | f ( x ) | p ( x ) /q ( x )) − µ (cid:19) ≤ n (cid:18) E q ( f ( x ) p ( x ) /q ( x ) ) − µ (cid:19) = Var( µ q ) (10)The last inequality is the Cauchy-Schwarz inequality.Therefore, we show that by choosing sampling distribution q ( x ) ∝ p ( x ) | f ( x ) | and sampling data according to the nor-malized q normalized ( x i ) = q ( x i ) / (cid:80) q ( x i ) , we can obtainthe minimal variance estimation. In the case where p ( x ) is not known directly, but we have a dataset sampled from p ( x ) , we can use q ( x i ) = | f ( x i ) | / (cid:80) i | f ( x i ) | as the sam-pling weight. C. Measuring the Efﬁciency of Sampling

We deﬁne the efﬁciency as the ratio between the origi-nal data variance and the sampled data variance. To makethings simpler, suppose we want to estimate the expecta-tion of f ( x ) , we ﬁrst normalize f ( x ) and obtain g ( x ) =( f ( x ) − f ( x )) /σ [ f ( x )] , where f ( x ) and σ [ f ( x )] are themean and standard deviation of f ( x ) . Now we use impor-tance sampling to estimate the expectation of g ( x ) under p ( x ) by using proposal distribution q ( x ) . We sample M images out of a total N images, the probability of a partic-ular image x i being sampled is, s ( x i ) = min (cid:20) , M | g ( x i ) | (cid:80) Ni | g ( x i ) | (cid:21) (11)As mentioned in the paper, we take the minimum comparedwith 1 to ensure that the probability is always no more than1. Obviously, (cid:80) Ni =1 s ( x i ) > since s ( x i ) describes theprobability of a particular image x i being selected. We fur-ther deﬁne q ( x i ) = s ( x i ) /N which has an upper boundof /N . Therefore, it is easy to see that (cid:80) Ni =1 q ( x i ) ≤ .To get M images, we select images according to their sam-pling probabilities s ( x i ) . The expectation of g ( x ) based onthe sampled images is, E q ( x ) [ g ( x ) p ( x ) /q ( x )] = 1 N N (cid:88) i =1 g ( x i ) p ( x i ) /q ( x i ) (12)here x i ∼ q ( x ) . On the other hand, if we sample the entiredataset and get N images, then s ( x i ) = 1 and q ( x i ) =1 /N , the expectation will be, E q ( x ) [ g ( x ) p ( x ) /q ( x )] = 1 N N (cid:88) i =1 g ( x i ) p ( x i ) ∗ N (13)which is just E p ( x ) [ g ( x )] . It is no harm to assume p ( x ) isa uniform distribution since we consider it to be unknown.In the case where we sample the entire dataset, s ( x i ) = 1 , p ( x i ) = 1 /N , and (cid:80) Ni =1 g ( x i ) = 0 , then the variance of g ( x ) by sampling the entire dataset is, Var q (cid:20) g ( x ) p ( x ) q ( x ) (cid:21) = E q (cid:20)(cid:0) g ( x ) p ( x ) q ( x ) (cid:1) (cid:21) − (cid:18) E q (cid:20) g ( x ) p ( x ) q ( x ) (cid:21)(cid:19) = N (cid:88) i =1 (cid:20) g ( x i ) p ( x i ) s ( x i ) /N (cid:21) s ( x i ) N − (cid:18) N (cid:88) i =1 (cid:20) g ( x i ) p ( x i ) s ( x i ) /N (cid:21) s ( x i ) N (cid:19) = N (cid:88) i =1 (cid:20) g ( x i ) /N /N (cid:21) N − (cid:18) N (cid:88) i =1 g ( x i ) /N (cid:19) = N (cid:88) i =1 g ( x i ) /N (14)In the case where we sample M images out of N images, s ( x i ) ≤ , q ( x i ) = 1 /N , and (cid:80) Ni =1 g ( x i ) = 0 , then thevariance of g ( x ) by sampling M images out of N imagesis, Var q ( x ) (cid:20) g ( x ) p ( x ) q ( x ) (cid:21) = N (cid:88) i =1 (cid:20) g ( x i ) p ( x i ) s ( x i ) /N (cid:21) s ( x i ) N − (cid:18) N (cid:88) i =1 (cid:20) g ( x i ) p ( x i ) s ( x i ) /N (cid:21) s ( x i ) N (cid:19) = N (cid:88) i =1 (cid:20) g ( x i ) /Ns ( x i ) /N (cid:21) s ( x i ) N − (cid:18) N (cid:88) i =1 g ( x i ) /N (cid:19) = N (cid:88) i =1 g ( x i ) s ( x i ) ∗ N (15) The efﬁciency is deﬁned as the ratio between 14 and 15,which is, R = (cid:80) Ni =1 g ( x i ) (cid:80) Ni =1 g ( x i ) /s ( x i ) (16)which is the same as the efﬁciency (relative variance) de-ﬁned in the main paper. Obviously, since s ( x i ) ≤ , thisratio will always be no larger than 1. If we sample all thedata, which means s ( x i ) = 1 , then we can obtain a sam-pling efﬁciency of 1. To simplify the calculation of R , Wecan further express (cid:80) Ni =1 g ( x i ) /s ( x i ) as, N (cid:88) i =1 g ( x i ) s ( x i ) = k (cid:88) j =1 (cid:80) Ni =1 | g ( x i ) | M | g ( x j ) | | g ( x j ) | + N (cid:88) j = k +1 | g ( x j ) | = (cid:80) Ni =1 | g ( x i ) | M ( k (cid:88) j =1 | g ( x j ) | ) + N (cid:88) j = k +1 | g ( x j ) | (17)where s ( x ) , s ( x ) , · · · , s ( x k ) are smaller than 1 and s ( x k +1 ) , · · · , s ( x N ))