Lucid Data Dreaming for Video Object Segmentation
Anna Khoreva, Rodrigo Benenson, Eddy Ilg, Thomas Brox, Bernt Schiele
NNoname manuscript No. (will be inserted by the editor)
Lucid Data Dreaming for Video Object Segmentation
Anna Khoreva · Rodrigo Benenson · Eddy Ilg · Thomas Brox · Bernt Schiele
Received: date / Accepted: date
Abstract
Convolutional networks reach top quality in pixel-level video object segmentation but require a large amountof training data ( k ∼ k ) to deliver such results. We pro-pose a new training strategy which achieves state-of-the-artresults across three evaluation datasets while using × ∼ × less annotated data than competing methods. Ourapproach is suitable for both single and multiple object seg-mentation.Instead of using large training sets hoping to generalizeacross domains, we generate in-domain training data usingthe provided annotation on the first frame of each video tosynthesize (“lucid dream” ) plausible future video frames.In-domain per-video training data allows us to train highquality appearance- and motion-based models, as well astune the post-processing stage. This approach allows to reachcompetitive results even when training from only a singleannotated frame, without ImageNet pre-training. Our res-ults indicate that using a larger training set is not automatic-ally better, and that for the video object segmentation task a Anna KhorevaMax Planck Institute for Informatics, GermanyE-mail: [email protected] BenensonGoogleE-mail: [email protected] IlgUniversity of Freiburg, GermanyE-mail: [email protected] BroxUniversity of Freiburg, GermanyE-mail: [email protected] SchieleMax Planck Institute for Informatics, GermanyE-mail: [email protected] In a lucid dream the sleeper is aware that he or she is dreaming andis sometimes able to control the course of the dream.
Figure 1: Starting from scarce annotations we synthesize in-domain data to train a specialized pixel-level video objectsegmenter for each dataset or even each video sequence.smaller training set that is closer to the target domain is moreeffective. This changes the mindset regarding how many train-ing samples and general “objectness” knowledge are requiredfor the video object segmentation task.
In the last years the field of localizing objects in videoshas transitioned from bounding box tracking [33,35,34] topixel-level segmentation [37,55,49,75]. Given a first framelabelled with the foreground object masks, one aims to findthe corresponding object pixels in future frames. Segment-ing objects at the pixel level enables a finer understandingof videos and is helpful for tasks such as video editing, ro-toscoping, and summarisation.Top performing results are currently obtained using con-volutional networks (convnets) [29,6,31,3,22,44]. Like mostdeep learning techniques, convnets for video object segment- a r X i v : . [ c s . C V ] M a r A. Khoreva, R. Benenson, E. Ilg, T. Brox, B. Schiele ation benefit from large amounts of training data. Currentstate-of-the-art methods rely, for instance, on pixel accurateforeground/background annotations of ∼ k video frames[29,6] , ∼ k images [31], or even more than k annot-ated samples for training [74]. Labelling images and videosat the pixel level is a laborious task (compared e.g. to draw-ing bounding boxes for detection), and creating a large train-ing set requires significant annotation effort.In this work we aim to reduce the necessity for such largevolumes of training data. It is traditionally assumed that con-vnets require large training sets to perform best. We showthat for video object segmentation having a larger trainingset is not automatically better and that improved results canbe obtained by using × ∼ × less training data thanprevious approaches [6,31,74]. The main insight of our workis that for video object segmentation using few training frames( ∼ ) in the target domain is more useful than usinglarge training volumes across domains ( k ∼ k ).To ensure a sufficient amount of training data close to thetarget domain, we develop a new technique for synthesizingtraining data particularly tailored for the pixel-level videoobject segmentation scenario. We call this data generationstrategy “ lucid dreaming ”, where the first frame and its an-notation mask are used to generate plausible future framesof the videos. The goal is to produce a large training set ofreasonably realistic images which capture the expected ap-pearance variations in future video frames, and thus is, bydesign, close to the target domain.Our approach is suitable for both single and multiple ob-ject segmentation in videos. Enabled by the proposed datageneration strategy and the efficient use of optical flow, weare able to achieve high quality results while using only ∼ individual annotated training frames. Moreover, inthe extreme case with only a single annotated frame andzero pre-training (i.e. without ImageNet pre-training), westill obtain competitive video object segmentation results.In summary, our contributions are the following:1. We propose “lucid data dreaming”, an automated approachto synthesize training data for the convnet-based pixel-levelvideo object segmentation that leads to top results for bothsingle and multiple object segmentation .2. We conduct an extensive analysis to explore the factorscontributing to our good results.3. We show that training a convnet for video object segment-ation can be done with only few annotated frames. We hopethese results will affect the trend towards even larger train-ing sets, and popularize the design of video segmentationconvnets with lighter training needs.With the results for multiple object segmentation we tookthe second place in the 2017 DAVIS Challenge on VideoObject Segmentation [54]. A summary of the proposed ap- Lucid data dreaming synthesis implementation is available at . proach was provided online [30]. This paper significantlyextends [30] with in-depth discussions on the method, moredetails of the formulation, its implementation, and its vari-ants for single and multiple object segmentation in videos. Italso offers a detailed ablation study and an error analysis aswell as explores the impact of varying number of annotatedtraining samples on the video segmentation quality. Classic work on video object tracking fo-cused on bounding box tracking. Many of the insights fromthese works have been re-used for video object segmenta-tion. Traditional box tracking smoothly updates across timea linear model over hand-crafted features [23,5,35]. Sincethen, convnets have been used as improved features [14,40,76], and eventually to drive the tracking itself [22,3,68,43,44]. Contrary to traditional box trackers (e.g. [23]), convnet-based approaches need additional data for pre-training andlearning the task.
Video object segmentation.
In this paper we focus on gen-erating a foreground versus background pixel-wise objectlabelling for each video frame starting from a first manuallyannotated frame. Multiple strategies have been proposed tosolve this task.
Box-to-segment:
First a box-level track is built, and aspace-time grabcut-like approach is used to generate per framesegments [81].
Video saliency: this group of methods extracts the mainforeground object pixel-level space-time tube. Both hand-crafted models [17,46] or trained convnets [70,28,62] havebeen considered. Because these methods ignore the first frameannotation, they fail in videos where multiple salient objectsmove (e.g. flock of penguins).
Space-time proposals: these methods partition the space-time volume, and then the tube overlapping most with thefirst frame mask annotation is selected as tracking output[20,50,8].
Mask propagation : Appearance similarity and motionsmoothness across time is used to propagate the first frameannotation across the video [41,78,71]. These methods usu-ally leverage optical flow and long term trajectories.
Convnets: following the trend in box tracking, recentlyconvnets have been proposed for video object segmentation.[6] trains a generic object saliency network, and fine-tunesit per-video (using the first frame annotation) to make theoutput sensitive to the specific object of interest. [31] usesa similar strategy, but also feeds the mask from the previ-ous frame as guidance for the saliency network. [74] incor-porates online adaptation of the network using the predic-tions from previous frames. [7] extends the Gaussian-CRF ucid Data Dreaming for Video Object Segmentation 3 approach to videos by exploiting spatio-temporal connec-tions for pairwise terms and relying on unary terms from[74]. Finally [29] mixes convnets with ideas of bilateral fil-tering. Our approach also builds upon convnets.What makes convnets particularly suitable for the task, isthat they can learn what are the common statistics of ap-pearance and motion patterns of objects, as well as whatmakes them distinctive from the background, and exploitthis knowledge when segmenting a particular object. Thisaspect gives convnets an edge over traditional techniquesbased on low-level hand-crafted features.Our network architecture is similar to [6,31]. Other thanimplementation details, there are three differentiating factors.One, we use a different strategy for training: [6,29, 7,74]rely on consecutive video training frames and [31] uses anexternal saliency dataset, while our approach focuses on us-ing the first frame annotations provided with each targetedvideo benchmark without relying on external annotations.Two, our approach exploits optical flow better than theseprevious methods. Three, we describe an extension to seam-lessly handle segmentation of multiple objects.
Interactive video segmentation.
Interactive segmentation[42,27,63,77] considers more diverse user inputs (e.g. strokes),and requires interactive processing speed rather than provid-ing maximal quality. Albeit our technique can be adapted forvaried inputs, we focus on maximizing quality for the non-interactive case with no-additional hints along the video.
Semantic labelling.
Like other convnets in this space [29,6,31], our architecture builds upon the insights from thesemantic labelling networks [84,39,80,2]. Because of this,the flurry of recent developments should directly translateinto better video object segmentation results. For the sakeof comparison with previous work, we build upon the wellestablished VGG DeepLab architecture [10].
Synthetic data.
Like our approach, previous works havealso explored synthesizing training data. Synthetic render-ings [45], video game environment [57], mix-synthetic andreal images [72,11,15] have shown promise, but require task-appropriate 3d models. Compositing real world images pro-vides more realistic results, and has shown promise for ob-ject detection [18,67], text localization [21] and pose estim-ation [51].The closest work to ours is [47], which also generatesvideo-specific training data using the first frame annotations.They use human skeleton annotations to improve pose es-timation, while we employ pixel-level mask annotations toimprove video object segmentation.
Section 3.1 describes the network architecture used, and howRGB and optical flow information are fused to predict the next frame segmentation mask. Section 3.2 discusses differ-ent training modalities employed with the proposed videoobject segmentation system. In Section 4 we discuss thetraining data generation, and sections 5/6 report results forsingle/multiple object segmentation in videos.3.1 Architecture
Approach.
We model video object segmentation as a maskrefinement task (mask: binary foreground/ background la-belling of the image) based on appearance and motion cues.From frame t − to frame t the estimated mask M t − ispropagated to frame t , and the new mask M t is computed asa function of the previous mask, the new image I t , and theoptical flow F t , i.e. M t = f ( I t , F t , M t − ) . Since objectshave a tendency to move smoothly through space in time,there are little changes from frame to frame and mask M t − can be seen as a rough estimate of M t . Thus we require ourtrained convnet to learn to refine rough masks into accuratemasks. Fusing the complementary image I t and motion flow F t enables to exploits the information inherent to video andenables the model to segment well both static and movingobjects.Note that this approach is incremental, does a single for-ward pass over the video, and keeps no explicit model of theobject appearance at frame t . In some experiments we ad-apt the model f per video, using the annotated first frame I , M . However, in contrast to traditional techniques [23],this model is not updated while we process the video frames,thus the only state evolving along the video is the mask M t − itself. First frame.
In the video object segmentation task of ourinterest the mask for the first frame M is given. This is thestandard protocol of the benchmarks considered in sections5 & 6. If only a bounding box is available on the first frame,then the mask could be estimated using grabcut-like tech-niques [58,66]. RGB image I . Typically a semantic labeller generates pixel-wise labels based on the input image (e.g. M = g ( I ) ).We use an augmented semantic labeller with an input layermodified to accept 4 channels (RGB + previous mask) soas to generate outputs based on the previous mask estimate,e.g. M t = f I ( I t , M t − ) . Our approach is general and canleverage any existing semantic labelling architecture. We se-lect the DeepLabv2 architecture with VGG base network[10], which is comparable to [29,6,31]; FusionSeg [28] usesResNet. Optical flow F . We use flow in two complementary ways.First, to obtain a better initial estimate of M t we warp M t − using the flow F t : M t = f I ( I t , w ( M t − , F t )) ; we callthis "mask warping". Second, we use flow as a direct sourceof information about the mask M t . As can be seen in Figure A. Khoreva, R. Benenson, E. Ilg, T. Brox, B. Schiele I t I npu t s M t − (shown over I t ) (cid:107)F t (cid:107) O u t pu t M t Figure 2: Data flow examples. I t , (cid:107)F t (cid:107) , M t − are the inputs, M t is the resulting output. Green boundaries outline theground truth segments. Red overlay indicates M t − , M t .2, when the object is moving relative to background, the flowmagnitude (cid:107)F t (cid:107) provides a very reasonable estimate of themask M t . We thus consider using a convnet specifically formask estimation from flow: M t = f F ( (cid:107)F t (cid:107) , w ( M t − , F t )) ,and merge it with the image-only version by naive averaging M t = 0 . · f I ( I t , . . . ) + 0 . · f F ( (cid:107)F t (cid:107) , . . . ) . (1)We use the state-of-the-art optical flow estimation methodFlowNet2.0 [24], which itself is a convnet that computes F t = h ( I t − , I t ) and is trained on synthetic renderings offlying objects [45]. For the optical flow magnitude computa-tion we subtract the median motion for each frame, averagethe magnitude of the forward and backward flow and scalethe values per-frame to [0; 255] , bringing it to the same rangeas RGB channels. The loss function is the sum of cross-entropy terms overeach pixel in the output map (all pixels are equally weighted).In our experiments f I and f F are trained independently, viasome of the modalities listed in Section 3.2. Our two streamsarchitecture is illustrated in Figure 3a.We also explored expanding our network to accept 5input channels (RGB + previous mask + flow magnitude)in one stream: M t = f I + F ( I t , (cid:107)F t (cid:107) , w ( M t − , F t )) , butdid not observe much difference in the performance com-pared to naive averaging, see experiments in Section 5.4.3.Our one stream architecture is illustrated in Figure 3b. Onestream network is more affordable to train and allows to eas-ily add extra input channels, e.g. providing additionally se-mantic information about objects. Multiple objects.
The proposed framework can easily beextended to segmenting multiple objects simultaneously. In-stead of having one additional input channel for the previ- ucid Data Dreaming for Video Object Segmentation 5(a) Two streams architecture, where image I t and optical flow inform-ation (cid:107)F t (cid:107) are used to update mask M t − into M t . See equation 1.(b) One stream architecture, where 5 input channels: image I t , op-tical flow information (cid:107)F t (cid:107) and mask M t − are used to estimatemask M t . Figure 3: Overview of the proposed one and two streamsarchitectures. See Section 3.1.ous frame mask we provide the mask for each object in-stance in a separate channel, expanding the network to ac-cept N input channels (RGB + N object masks): M t = f I (cid:0) I t , w ( M t − , F t ) , ... , w ( M Nt − , F t ) (cid:1) , where N is thenumber of objects annotated on the first frame.For multiple object segmentation we employ a one-streamarchitecture for the experiments, using optical flow F andsemantic segmentation S as additional input channels: M t = f I + F + S (cid:0) I t , (cid:107)F t (cid:107) , S t , w ( M t − , F t ) , ... , w ( M Nt − , F t ) (cid:1) .This allows to leverage the appearance model with semanticpriors and motion information. See Figure 4 for an illustra-tion.The one-stream network is trained with multi-class crossentropy loss and is able to segment multiple objects simul-taneously, sharing the feature computation for different in-stances. This allows to avoid a linear increase of the costwith the number of objects. In our preliminary results us-ing a single architecture also pro-vides better results thansegmenting multiple objects separately, one at a time; andavoids the need to design a merging strategy amongst over-lapping tracks. Semantic labels S . To compute the pixel-level semanticlabelling S t = h ( I t ) we use the state-of-the-art convnetPSPNet [84], trained on Pascal VOC12 [16]. Pascal VOC12annotates 20 categories, yet we want to track any type of ob-jects. S t can also provide information about unknown cat-egory instances by describing them as a spatial mixture ofknown ones (e.g. a sea lion might looks like a dog torso, and the head of cat). As long as the predictions are consistentthrough time, S t will provide a useful cue for segmentation.Note that we only use S t for the multi-object segmentationchallenge, discussed in Section 6. In the same way as for theoptical flow we scale S t to bring all the channels to the samerange.We additionally experiment with ensembles of differentvariants, that allows to make the system more robust to thechallenges inherent in videos. For our main results on themultiple object segmentation task we consider an ensembleof four models: M t = . · ( f I + F + S + f I + F + f I + S + f I ) ,where we merge the outputs of the models by naive aver-aging. See Section 6 for more details. Temporal coherency.
To improve the temporal coherencyof the proposed video object segmentation framework we in-troduce an additional step into the system. Before providingas input the previous frame mask warped with the opticalflow w ( M t − , F t ) , we look at frame t − to remove in-consistencies between the predicted masks M t − and M t − .In particular, we split the mask M t − into connected com-ponents and remove all components from M t − which donot overlap with M t − . This way we remove possibly spuri-ous blobs generated by our model in M t − . Afterwards wewarp the “pruned” mask (cid:102) M t − with the optical flow and use w ( (cid:102) M t − , F t ) as an input to the network. This step is ap-plied only during inference, it mitigates error propagationissues, as well as help generating more temporally coherentresults. Post-processing.
As a final stage of our pipeline, we re-fine per-frame t the generated mask M t using DenseCRF[32]. This adjusts small image details that the network mightnot be able to handle. It is known by practitioners that Den-seCRF is quite sensitive to its parameters and can easilyworsen results. We will use our lucid dreams to handle per-dataset CRF-tuning too, see Section 3.2.We refer to our full f I + F system as LucidTracker , andas
LucidTracker − when no temporal coherency or post-processing steps are used. The usage of S t or model en-semble will be explicitly stated.3.2 Training modalitiesMultiple modalities are available to train a tracker. Training-free approaches (e.g. BVS [41], SVT [78]) are fully hand-crafted systems with hand-tuned parameters, and thus do notrequire training data. They can be used as-is over differentdatasets. Supervised methods can also be trained to gener-ate a dataset-agnostic model that can be applied over dif-ferent datasets. Instead of using a fixed model for all cases,it is also possible to obtain specialized per-dataset mod-els, either via self-supervision [79,48,82,85] or by using the
A. Khoreva, R. Benenson, E. Ilg, T. Brox, B. Schiele
Figure 4: Extension of LucidTracker to multiple objects. Theprevious frame mask for each object is provided in a separ-ate channel. We additionally explore using optical flow F and semantic segmentation S as additional inputs. See Sec-tion 3.1.first frame annotation of each video in the dataset as train-ing/tuning set. Finally, inspired by traditional box trackingtechniques, we also consider adapting the model weights tothe specific video at hand, thus obtaining per-video mod-els. Section 5 reports new results over these four trainingmodalities (training-free, dataset-agnostic, per-dataset, andper-video).Our LucidTracker obtains best results when first pre-trained on ImageNet, then trained per-dataset using all datafrom first frame annotations together, and finally fine-tunedper-video for each evaluated sequence. The post-processingDenseCRF stage is automatically tuned per-dataset. The ex-perimental section 5 details the effect of these training stages.Surprisingly, we can obtain reasonable performance evenwhen training from only a single annotated frame (with-out ImageNet pre-training, i.e. zero pre-training); this resultsgoes against the intuition that convnets require large trainingdata to provide good results.Unless otherwise stated, we fine-tune per-video modelsrelying solely on the first frame I and its annotation M .This is in contrast to traditional techniques [23,5,35] whichwould update the appearance model at each frame I t . To train the function f one would think of using ground truthdata for M t − and M t (like [3,6,22]), however such data isexpensive to annotate and rare. [6] thus trains on a set of videos ( ∼ k frames) and requires the model to transferacross multiple tests sets. [31] side-steps the need for con-secutive frames by generating synthetic masks M t − froma saliency dataset of ∼ k images with their correspond-ing mask M t . We propose a new data generation strategyto reach better results using only ∼ individual trainingframes. Ideally training data should be as similar as possibleto the test data, even subtle differences may affect quality(e.g. training on static images for testing on videos under-performs [65]). To ensure our training data is in-domain,we propose to generate it by synthesizing samples from theprovided annotated frame (first frame) in each target video.This is akin to “lucid dreaming” as we intentionally “dream”the desired data by creating sample images that are plaus-ible hypothetical future frames of the video. The outcomeof this process is a large set of frame pairs in the target do-main ( . k pairs per annotation) with known optical flowand mask annotations, see Figure 5. Synthesis process.
The target domain for a tracker is theset of future frames of the given video. Traditional data aug-mentation via small image perturbation is insufficient to coverthe expect variations across time, thus a task specific strategyis needed. Across the video the tracked object might changein illumination, deform, translate, be occluded, show differ-ent point of views, and evolve on top of a dynamic back-ground. All of these aspects should be captured when syn-thesizing future frames. We achieve this by cutting-out theforeground object, in-painting the background, perturbingboth foreground and background, and finally recomposingthe scene. This process is applied twice with randomly sam-pled transformation parameters, resulting in a pair of frames( I τ − , I τ ) with known pixel-level ground-truth mask an-notations ( M τ − , M τ ), optical flow F τ , and occlusion re-gions. The object position in I τ is uniformly sampled, butthe changes between I τ − , I τ are kept small to mimic theusual evolution between consecutive frames.In more details, starting from an annotated image:
1. Illumination changes: we globally modify the image byrandomly altering saturation S and value V (from HSV col-our space) via x (cid:48) = a · x b + c , where a ∈ ± . , b ∈ ± . ,and c ∈ ± . .
2. Fg/Bg split: the foreground object is removed from theimage I and a background image is created by inpaintingthe cut-out area [13].
3. Object motion: we simulate motion and shape deform-ations by applying global translation as well as affine andnon-rigid deformations to the foreground object. For I τ − the object is placed at any location within the image with auniform distribution, and in I τ with a translation of ± of the object size relative to τ − . In both frames we applyrandom rotation ± ◦ , scaling ± and thin-plate splinesdeformations [4] of ± of the object size.
4. Camera motion:
We additionally transform the backgroundusing affine deformations to simulate camera view changes.We apply here random translation, rotation, and scaling withinthe same ranges as for the foreground object.
5. Fg/Bg merge: finally ( I τ − , I τ ) are composed by blend-ing the perturbed foreground with the perturbed backgroundusing Poisson matting [64]. Using the known transforma- ucid Data Dreaming for Video Object Segmentation 7 tion parameters we also synthesize ground-truth pixel-levelmask annotations ( M τ − , M τ ) and optical flow F τ .Figure 5 shows example results. Albeit our approach doesnot capture appearance changes due to point of view, occlu-sions, nor shadows, we see that already this rough modellingis effective to train our segmentation models.The number of synthesized images can be arbitrarily large.We generate . k pairs per annotated video frame. This train-ing data is, by design, in-domain with regard of the targetvideo. The experimental section 5 shows that this strategy ismore effective than using thousands of manually annotatedimages from close-by domains.The same strategy for data synthesis can be employedfor multiple object segmentation task. Instead of manipu-lating a single object we handle multiple ones at the sametime, applying independent transformations to each of them.We model occlusion between objects by adding a randomdepth ordering obtaining both partial and full occlusions inthe training set. Including occlusions in the lucid dreams al-lows to better handle plausible interactions of objects in thefuture frames. See Figure 6 for examples of the generateddata. We present here a detailed empirical evaluation on threedifferent datasets for the single object segmentation task:given a first frame labelled with the foreground object mask,the goal is to find the corresponding object pixels in futureframes. (Section 6 will discuss the multiple objects case.)5.1 Experimental setup
Datasets.
We evaluate our method on three video objectsegmentation datasets: DAVIS [49], YouTubeObjects [55,26], and SegTrack v2 [37]. The goal is to track an objectthrough all video frames given an object mask in the firstframe. These three datasets provide diverse challenges witha mix of high and low resolution web videos, single or mul-tiple salient objects per video, videos with flocks of similarlooking instances, longer ( ∼ frames) and shorter ( ∼ frames) sequences, as well as the usual video segmentationchallenges such as occlusion, fast motion, illumination, viewpoint changes, elastic deformation, etc.The DAVIS [49] video segmentation benchmark con-sists of 50 full-HD videos of diverse object categories withall frames annotated with pixel-level accuracy, where onesingle or two connected moving objects are separated fromthe background. The number of frames in each video variesfrom 25 to 104.YouTubeObjects [55,26] includes web videos from 10object categories. We use the subset of 126 video sequences with mask annotations provided by [26] for evaluation, whereone single object or a group of objects of the same categoryare separated from the background. In contrast to DAVIS these videos have a mix of static and moving objects. Thenumber of frames in each video ranges from 2 to 401.SegTrack v2 [37] consists of 14 videos with multiple ob-ject annotations for each frame. For videos with multiple ob-jects each object is treated as a separate problem, resultingin 24 sequences. The length of each video varies from 21 to279 frames. The images in this dataset have low resolutionand some compression artefacts, making it hard to track theobject based on its appearance.The main experimental work is done on DAVIS , sinceit is the largest densely annotated dataset out of the three,and provides high quality/high resolution data. The videosfor this dataset were chosen to represent diverse challenges,making it a good experimental playground.We additionally report on the two other datasets as com-plementary test set results. Evaluation metric.
To measure the accuracy of video ob-ject segmentation we use the mean intersection-over-unionoverlap (mIoU) between the per-frame ground truth objectmask and the predicted segmentation, averaged across allvideo sequences. We have noticed disparate evaluation pro-cedures used in previous work, and we report here a unifiedevaluation across datasets. When possible, we re-evaluatedcertain methods using results provided by their authors. Forall three datasets we follow the DAVIS evaluation pro-tocol, excluding the first frame from evaluation and usingall other frames from the video sequences, independent ofobject presence in the frame. Training details.
For training all the models we use SGDwith mini-batches of images and a fixed learning policywith initial learning rate of − . The momentum and weightdecay are set to . and · − , respectively.Models using pre-training are initialized with weightstrained for image classification on ImageNet [61]. We thentrain per-dataset for 40k iterations with the RGB+Mask branch f I and for 20k iterations for the Flow+Mask f F branch.When using a single stream architecture (Section 5.4.3), weuse 40k iterations.Models without ImageNet pre-training are initialized us-ing the Xavier (also known as Glorot) random weight ini-tialization strategy [19]. (The weights are initialized as ran-dom draws from a truncated normal distribution with zeromean and standard deviation calculated based on the num-ber of input and output units in the weight tensor, see [19]for details). The per-dataset training needs to be longer, us-ing 100k iterations for the f I branch and 40k iterations forthe f F branch.For per-video fine-tuning 2k iterations are used for f I .To keep computing cost lower, the f F branch is kept fixacross videos. A. Khoreva, R. Benenson, E. Ilg, T. Brox, B. Schieleblackswankite-surfcowsbmx-bumpsOriginal image I andmask annotation M Generated image I τ − Generated image I τ Generated flowmagnitude (cid:107)F τ (cid:107) Figure 5: Lucid data dreaming examples. From one annotated frame we generate pairs of images ( I τ − , I τ ) that are plaus-ible future video frames, with known optical flow ( F τ ) and masks (green boundaries). Note the inpainted background andforeground/background deformations. ucid Data Dreaming for Video Object Segmentation 9(a) Original image I and mask annotation M (b) Generated image I τ and mask M τ (c) Generated flow magnitude (cid:107)F τ (cid:107) Figure 6: Lucid data dreaming examples with multiple ob-jects. From one annotated frame we generate a plausible fu-ture video frame ( I τ ), with known optical flow ( F τ ) andmask ( M τ ).All training parameters are chosen based on DAVIS results. We use identical parameters on YouTubeObjects andSegTrack v2 , showing the generalization of our approach.It takes ~3.5h to obtain each per-video model, includingdata generation, per-dataset training, per-video fine-tuningand per-dataset grid search of CRF parameters (averagedover DAVIS , amortising the per-dataset training time overall videos). At test time our LucidTracker runs at ~5s perframe, including the optical flow estimation with FlowNet2.0[24] (~0.5s) and CRF post-processing [32] (~2s).5.2 Key resultsTable 1 presents our main result and compares it to previouswork. Our full system, LucidTracker , provides the bestvideo segmentation quality across three datasets while be-ing trained on each dataset using only one frame per video( frames for DAVIS , for YouTubeObjects, forSegTrack v2 ), which is × ∼ × less than the top com-peting methods. Ours is the first method to reach > mIoUon all three datasets. Oracles and baselines.
Grabcut oracle computes grabcut[58] using the ground truth bounding boxes (box oracle).This oracle indicates that on the considered datasets separ-ating foreground from background is not easy, even if a per-fect box-level tracker was available.We provide three additional baselines. “Saliency” corres-
Method F DAVIS YoutbObjs SegTrckv2Box oracle [31] 0 % % I gno r e s s t fr a m e a nno t a ti on Saliency 0 % ! ! - - 69.1MP-Net [69] ~22.5k ! ! ! ! - 57.3PDB [62] ~18k % - - U s e s s t fr a m e a nno t a ti on Mask warping 0 ! ! % ! - - 69.6ObjFlow [71] 0 ! ! % - -OSVOS [6] ~2.3k % ! ! - -OnAVOS [74] ~120k % - -VideoGCRF [7] ~120k % - -LucidTracker ! Table 1: Comparison of video object segmentation resultsacross three datasets. Numbers in italic are reported on sub-sets of DAVIS . Our LucidTracker consistently improvesover previous results, see Section 5.2.ponds to using the generic (training-free) saliency methodEQCut [1] over the RGB image I t . “Flow saliency” doesthe same, but over the optical flow magnitude (cid:107)F t (cid:107) . Resultsindicate that the objects being tracked are not particularly sa-lient in the image. On DAVIS motion saliency is a strongsignal but not on the other two datasets. Saliency methodsignore the first frame annotation provided for the task. Wealso consider the “Mask warping” baseline which uses op-tical flow to propagate the mask estimate from t to t + 1 via simple warping M t = w ( M t − , F t ) . The bad resultsof this baseline indicate that the high quality flow [24] thatwe use is by itself insufficient to solve the video object seg-mentation task, and that indeed our proposed convnet doesthe heavy lifting.The large fluctuation of the relative baseline results acrossthe three datasets empirically confirms that each of thempresents unique challenges. Comparison.
Compared to flow propagation methods suchas BVS, N15, ObjFlow, and STV, we obtain better resultsbecause we build per-video a stronger appearance model ofthe tracked object (embodied in the fine-tuned model). Com-pared to convnet learning methods such as VPN, OSVOS,
DAV I S Y ou T ub e O b j ec t s S e g T r ac k v2
20% 40% 60% 80% 100%
Figure 7: LucidTracker single object segmentation qualitative results. Frames sampled along the video duration (e.g. :video middle point). Our model is robust to various challenges, such as view changes, fast motion, shape deformations, andout-of-view scenarios. ucid Data Dreaming for Video Object Segmentation 11
MaskTrack, OnAVOS, we require significantly less trainingdata, yet obtain better results.Figure 7 provides qualitative results of
LucidTracker across three different datasets. Our system is robust to vari-ous challenges present in videos. It handles well cameraview changes, fast motion, object shape deformation, out-of-view scenarios, multiple similar looking objects and evenlow quality video. We provide a detailed error analysis insection 5.5.
Conclusion.
We show that top results can be obtained whileusing less training data. This shows that our lucid dreamsleverage the available training data better. We report top res-ults for this task while using only ∼ training frames.5.3 Ablation studiesIn this section we explore in more details how the differentingredients contribute to our results. Table 2 compares the effect of different ingredients in the
LucidTracker − training. Results are obtained using RGBand flow, with warping, no CRF, and no temporal coherency; M t = f ( I t , w ( M t − , F t )) . Training from a single frame.
In the bottom row ("onlyper-video tuning"), the model is trained per-video withoutImageNet pre-training nor per-dataset training, i.e. using a single annotated training frame . Our network is based onVGG16 [10] and contains ∼ M parameters, all effect-ively learnt from a single annotated image that is augmentedto become . k training samples (see Section 4). Even withsuch minimal amount of training data, we still obtain a sur-prisingly good performance (compare . on DAVIS toothers in Table 1). This shows how effective is, by itself, theproposed training strategy based on lucid dreaming of thedata. Pre-training & fine-tuning.
We see that ImageNet pre-training does provide ∼ percent point improvement (de-pending on the dataset of interest; e.g. . → . mIoUon DAVIS ). Per-video fine-tuning (after doing per-datasettraining) provides an additional ∼ percent point gain(e.g. . → . mIoU on DAVIS ). Both ingredients clear-ly contribute to the segmentation results.Note that training a model using only per-video tuningtakes about one full GPU day per video sequence; makingthese results insightful but not decidedly practical.Preliminary experiments evaluating on DAVIS the im-pact of the different ingredients of our lucid dreaming datageneration showed, depending on the exact setup, ∼ percent mIoU points fluctuations between a basic version Variant ImgNetpre-train. per-datasettraining per-videofine-tun. Dataset, mIoUDAVIS YoutbObjs SegTrckv2
LucidTracker − ! ! ! (no ImgNet) % ! ! ! ! % % ! % ! % ! % % ! - Table 2: Ablation study of training modalities. ImageNetpre-training and per-video tuning provide additional im-provement over per-dataset training. Even with one frameannotation for only per-video tuning we obtain good per-formance. See Section 5.3.1.(e.g. without non-rigid deformations nor scene re-composi-tion) and the full synthesis process described in Section 4.Having a sophisticated data generation process directly im-pacts the segmentation quality.
Conclusion.
Surprisingly, we discovered that per-video train-ing from a single annotated frame provides already muchof the information needed for the video object segmentationtask. Additionally using ImageNet pre-training, and per-datasettraining, provide complementary gains.
Table 3 shows the effect of optical flow on
LucidTracker results. Comparing our full system to the "No OF" row, wesee that the effect of optical flow varies across datasets, fromminor improvement in YouTubeObjects, to major differencein SegTrack v2 . In this last dataset, using mask warping isparticularly useful too. We additionally explored tuning theoptical flow stream per-video, which resulted in a minor im-provement ( . → . mIoU on DAVIS ).Our "No OF" results can be compared to OSVOS [6]which does not use optical flow. However OSVOS uses aper-frame mask post-processing based on a boundary de-tector (trained on further external data), which provides ∼ percent point gain. Accounting for this, our "No OF" (andno CRF, no temporal coherency) result matches theirs onDAVIS and YouTubeObjects despite using significantly lesstraining data (see Table 1, e.g. . − ≈ . on DAVIS ).Table 4 shows the effect of using different optical flowestimation methods. For LucidTracker results, FlowNet2.0[24] was employed. We also explored using EpicFlow [56],as in [31]. Table 4 indicates that employing a robust opticalflow estimation across datasets is crucial to the performance(FlowNet2.0 provides ∼ . − points gain on each data-set). We found EpicFlow to be brittle when going acrossdifferent datasets, providing improvement for DAVIS andSegTrack v2 ( ∼ − points gain), but underperforming forYouTubeObjects ( . → . mIoU). I F warp. Dataset, mIoU w DAVIS YoutbObjs SegTrckv2LucidTracker ! ! !
LucidTracker − ! ! ! ! ! % ! % % % ! ! Table 3: Ablation study of flow ingredients. Flow com-plements image only results, with large fluctuations acrossdatasets. See Section 5.3.2.
Variant
Optical Dataset, mIoUflow DAVIS YoutbObjs SegTrckv2LucidTracker − FlowNet2.0
EpicFlow 80.2 71.3 67.0No flow 78.0 74.7 61.8No ImageNetpre-training FlowNet2.0 82.0 74.3 71.2EpicFlow 80.0 72.3 68.8No flow 76.7 71.4 63.0
Table 4: Effect of optical flow estimation.
Conclusion.
The results show that flow provides a com-plementary signal to RGB image only and having a robustoptical flow estimation across datasets is crucial. Despite itssimplicity our fusion strategy ( f I + f F ) provides gains onall datasets, and leads to competitive results. As a final stage of our pipeline, we refine the generated maskusing DenseCRF [32] per frame. This captures small imagedetails that the network might have missed. It is known bypractitioners that DenseCRF is quite sensitive to its paramet-ers and can easily worsen results. We use our lucid dreamsto enable automatic per-dataset CRF-tuning.Following [10] we employ grid search scheme for tun-ing CRF parameters. Once the per-dataset model is trained,we apply it over a subset of its training set (5 random im-ages from the lucid dreams per video sequence), apply Den-seCRF with the given parameters over this output, and thencompare to the lucid dream ground truth.The impact of the tuned parameter of DenseCRF post-processing is shown in Table 5 and Figure 8. Table 5 in-dicates that without per-dataset tuning DenseCRF is under-performing. Our automated tuning procedure allows to ob-tain consistent gains without the need for case-by-case ma-nual tuning.
Conclusion.
Using default DenseCRF parameters woulddegrade performance. Our lucid dreams enable automaticper-dataset CRF-tuning which allows to further improve theresults.
Method CRF Dataset, mIoUparameters DAVIS YoutbObjs SegTrckv2LucidTracker − - 83.7 76.2 76.8LucidTracker default 84.2 75.5 72.2LucidTracker tuned per-dataset Table 5: Effect of CRF tuning (
LucidTracker without tem-poral coherency). Without the automated per-dataset tuningDenseCRF will under-perform.5.4 Additional experimentsOther than adding or removing ingredients, as in Section 5.3,we also want to understand how the training data itself af-fects the obtained results.
Table 6 explores the effect of segmentation quality as a func-tion of the number of training samples. To see more directlythe training data effects we use a base model with RGB im-age I t only (no flow F , no CRF, no temporal coherency),and per-dataset training (no ImageNet pre-training, no per-video fine-tuning). We evaluate on two disjoint subsets of DAVIS videos each, where the first frames for per-datasettraining are taken from only one subset. The reported num-bers are thus comparable within Table 6, but not across to theother tables in the paper. Table 6 reports results with varyingnumber of training videos and with/without including thefirst frames of each test video for per-dataset training. Whenexcluding the test set first frames, the image frames used fortraining are separate from the test videos; and we are thusoperating across (related) domains. When including the testset first frames, we operate in the usual LucidTracker mode,where the first frame from each test video is used to buildthe per-dataset training set.Comparing the top and bottom parts of the table, we seethat when the annotated images from the test set video se-quences are not included, segmentation quality drops drastic-ally (e.g. . → . mIoU). Conversely, on subset of videosfor which the first frame annotation is used for training, thequality is much higher and improves as the training samplesbecome more and more specific (in-domain) to the targetvideo ( . → . mIoU). Adding extra videos for train-ing does not improve the performance. It is better ( . → . mIoU) to have models each trained and evaluated ona single video (row top-1-1) than having one model trainedover test videos (row top-15-1 ) .Training with an additional frame from each video (weadded the last frame of each train video) significantly booststhe resulting within-video quality (e.g. row top-30-2 . → . mIoU), because the training samples cover better thetest domain. ucid Data Dreaming for Video Object Segmentation 13 no CR F w it h CR F no CR F w it h CR F
20% 40% 60% 80% 100%
Figure 8: Effect of CRF tuning. The shown DAVIS videos have the highest margin between with and without CRF post-processing (based on mIoU over the video). Table 6: Varying the number of training videos. A smallertraining set closer to the target domain is better than a largerone. See Section 5.4.1.
Conclusion.
These results show that, when using RGB in-formation ( I t ), increasing the number of training videos doesnot improve the resulting quality of our system. Even withina dataset, properly using the training sample(s) from withineach video matters more than collecting more videos to builda larger training set. Section 5.4.1 has explored the effect of changing the volumeof training data within one dataset, Table 7 compares res-ults when using different datasets for training. Results areobtained using a base model with RGB and flow ( M t = f ( I t , M t − ) , no warping, no CRF, no temporal coherency),ImageNet pre-training, per-dataset training, and no per-videotuning to accentuate the effect of the training dataset. Training set Dataset, mIoU MeanDAVIS YoutbObjs SegTrckv2DAVIS Second best 67.0 52.2 52.0 57.1All-in-one 71.9 70.7 60.8 67.8
Table 7: Generalization across datasets. Results with under-line are the best per dataset, and in italic are the second bestper dataset (ignoring all-in-one setup). We observe a signi-ficant quality gap between training from the target videos,versus training from other datasets; see Section 5.4.2.The best performance is obtained when training on thefirst frames of the target set. There is a noticeable ∼ per-cent points drop when moving to the second best choice (e.g. . → . for DAVIS ). Interestingly, when putting allthe datasets together for training ("all-in-one" row, a dataset-agnostic model) the results degrade, reinforcing the idea that"just adding more data" does not automatically make theperformance better. Conclusion.
Best results are obtained when using trainingdata that focuses on the test video sequences, using similardatasets or combining multiple datasets degrades the per-formance for our system. mIoUtwo streams ! ! % one stream ! ! % Table 8: Experimenting with the convnet architecture. SeeSection 5.4.3.
Section 3.1 and Figure 3 described two possible architec-tures to handle I t and F t . Previous experiments are all basedon the two streams architecture.Table 8 compares two streams versus one stream, wherethe network to accepts 5 input channels (RGB + previousmask + flow magnitude) in one stream: M t = f I + F ( I t , F t , w ( M t − , F t )) . Results are obtained using a base modelwith RGB and optical flow (no warping, no CRF, no tem-poral coherency), ImageNet pre-training, per-dataset train-ing, and no per-video tuning.We observe that both one stream and two stream archi-tecture with naive averaging perform on par. Using a onestream network makes the training more affordable and al-lows more easily to expand the architecture with additionalinput channels. Conclusion.
The lighter one stream network performs aswell as a network with two streams. We will thus use the onestream architecture in Section 6.5.5 Error analysisTable 9 presents an expanded evaluation on DAVIS us-ing evaluation metrics proposed in [49]. Three measures areused: region similarity in terms of intersection over union(J), contour accuracy (F, higher is better), and temporal in-stability of the masks (T, lower is better). We outperform thecompetitive methods of [31,6] on all three measures.Table 10 reports the per-attribute based evaluation asdefined in DAVIS . LucidTracker is best on all 15 videoattribute categories. This shows that our
LucidTracker canhandle the various video challenges present in DAVIS .We present the per-sequence and per-frame results of LucidTracker over DAVIS in Figure 10. On the wholewe observe that the proposed approach is quite robust, mostvideo sequences reach an average performance above mIoU.However, by looking at per-frame results for each video(blue dots in Figure 10) one can see several frames whereour approach has failed (IoU less than ) to correctly trackthe object. Investigating closely those cases we notice condi-tions where LucidTracker is more likely to fail. The same behaviour was observed across all three datasets. A few rep-resentatives of failure cases are visualized in Figure 9.Since we are using only the mask annotation of the firstframe for training the tracker, a clear failure case is causedby dramatic view point changes of the object from its firstframe appearance, as in row 5 of Figure 9. Performing on-line adaptation every certain time step while exploiting theprevious frame segments for data synthesis and marking un-sure regions as ignore for training, similarly to [74], mightresolve the potential problems caused by relying only onthe first frame mask. The proposed approach also under-performs when recovering from occlusions: it might takesseveral frames for the full object mask to re-appear (rows1-3 in Figure 9). This is mainly due to the convnet havinglearnt to follow-up the previous frame mask. Augmentingthe lucid dreams with plausible occlusions might help mit-igate this case. Another failure case occurs when two sim-ilar looking objects cross each other, as in row 6 in Figure 9.Here both cues: the previous frame guidance and learnt viaper-video tuning appearance, are no longer discriminative tocorrectly continue propagating the mask.We also observe that the
LucidTracker struggles totrack the fine structures or details of the object, e.g. wheelsof the bicycle or motorcycle in rows 1-2 in Figure 9. This isthe issue of the underlying choice of the convnet architec-ture, due to the several pooling layers the spatial resolutionis lost and hence the fine details of the object are missing.This issue can be mitigated by switching to more recent se-mantic labelling architectures (e.g. [52,9]).
Conclusion.
LucidTracker shows robust performance ac-ross different videos. However, a few failure cases were ob-served due to the underlying convnet architecture, its train-ing, or limited visibility of the object in the first frame.
We present here an empirical evaluation of LucidTrackerfor multiple object segmentation task: given a first frame la-belled with the masks of several object instances, one aimsto find the corresponding masks of objects in future frames.6.1 Experimental setup
Dataset.
For the multiple object segmentation task we usethe 2017 DAVIS Challenge on Video Object Segmentation [54] (DAVIS ). Compared to DAVIS this is a larger, morechallenging dataset, where the video sequences have mul-tiple objects in the scene. Videos that have more than onevisible object in DAVIS have been re-annotated (the ob-jects were divided by semantics) and the train and val sets http://davischallenge.org/challenge2017ucid Data Dreaming for Video Object Segmentation 15Method F DAVIS Region, J Boundary, F Temporal stability, T Mean ↑ Recall ↑ Decay ↓ Mean ↑ Recall ↑ Decay ↓ Mean ↓ Box oracle [31] 0 % -0.7 Grabcut oracle [31] 0 % % ! ! ! ! ! PDB [62] ~18k % -0.2 Uses 1st frameannotation Mask warping 0 ! ! % ! ! % - - - - OSVOS [6] ~2.3k % MaskTrack [31] ~11k ! ! OnAVOS [74] ~120k % VideoGCRF [7] ~120k % LucidTracker ! Table 9: Comparison of video object segmentation results on DAVIS benchmark. Numbers in italic are computed based onsubsets of DAVIS . Our LucidTracker improves over previous results. Attribute Method
BVS [41] ObjFlow [71] OSVOS [6] MaskTrack [31] LucidTrackerAppearance change
Background clutter
Camera-shake
Deformation
Dynamic background
Edge ambiguity
Fast-motion
Heterogeneous object
Interacting objects
Low resolution
Motion blur
Occlusion
Out-of-view
Scale variation
Shape complexity
Table 10: DAVIS per-attribute evaluation. LucidTrackerimproves across all video object segmentation challenges.were extended with more sequences. In addition, two othertest sets (test-dev and test-challenge) were introduced. Thecomplexity of the videos has increased with more distract-ors, occlusions, fast motion, smaller objects, and fine struc- tures. Overall, DAVIS consists of sequences, totalling
10 474 annotated frames and objects.We evaluate our method on two test sets, the test-devand test-challenge sets, each consists of video sequences,on average ∼ objects per sequence, the length of the se-quences is ∼ frames. For both test sets only the masks onthe first frames are made public, the evaluation is done viaan evaluation server. Our experiments and ablation studiesare done on the test-dev set. Evaluation metric.
The accuracy of multiple object seg-mentation is evaluated using the region (J) and boundary (F)measures proposed by the organisers of the challenge. Theaverage of J and F measures is used as overall performancescore (denoted as global mean in the tables). Please refer to[54] for more details about the evaluation protocol.
Training details.
All experiments in this section are doneusing the single stream architecture discussed in sections 3.1and 5.4.3. For training the models we use SGD with mini-batches of images and a fixed learning policy with initiallearning rate of − . The momentum and weight decay areset to . and · − , respectively. All models are initialized DAV I S Y ou T ub e O b j ec t s S e g T r ac k v2
20% 40% 60% 80% 100%
Figure 9: Failure cases. Frames sampled along the video duration (e.g. : video middle point). For each dataset we show2 out of 5 worst results (based on mIoU over the video).Figure 10: Per-sequence results on DAVIS .with weights trained for image classification on ImageNet[61]. We then train per-video for 40k iterations.6.2 Key resultsTables 11 and 12 presents the results of the 2017 DAVISChallenge on test-dev and test-challenge sets [53].Our main results for the multi-object segmentation chal-lenge are obtained via an ensemble of four different models( f I , f I + F , f I + S , f I + F + S ), see Section 3.1.The proposed system, LucidTracker , provides the bestsegmentation quality on the test-dev set and shows com-petitive performance on the test-challenge set, holding the second place in the competition. The full system is trainedusing the standard ImageNet pre-training initialization, Pas-cal VOC12 semantic annotations for the S t input ( ∼ k an-notated images), and one annotated frame per test video, frames total on each test set. As discussed in Section 6.3,even without S t LucidTracker obtains competitive results(less than percent point difference, see Table 13 for de-tails).The top entry lixx [38] uses a deeper convnet model(ImageNet pre-trained ResNet), a similar segmentation ar-chitecture, trains it over external segmentation data (using ∼ k pixel-level annotated images from MS-COCO andPascal VOC for pre-training, and akin to [6] fine-tuning onthe DAVIS train and val sets, ∼ k annotated frames), ucid Data Dreaming for Video Object Segmentation 171st frame, GT segment
20% 40% 60% 80% 100%
Figure 11: LucidTracker qualitative results on DAVIS , test-dev set. Frames sampled along the video duration (e.g. :video middle point). The videos are chosen with the highest mIoU measure.and extends it with a box-level object detector (trained overMS-COCO and Pascal VOC, ∼ k bounding boxes) and abox-level object re-identification model trained over ∼ k box annotations (on both images and videos). We argue thatour system reaches comparable results with a significantlylower amount of training data.Figure 11 provides qualitative results of LucidTracker on the test-dev set. The video results include successful hand-ling of multiple objects, full and partial occlusions, distract-ors, small objects, and out-of-view scenarios.
Conclusion.
We show that top results for multiple objectsegmentation can be achieved via our approach that focuseson exploiting as much as possible the available annotationon the first video frame, rather than relying heavily on largeexternal training data.6.3 Ablation studyTable 13 explores in more details how the different ingredi-ents contribute to our results.We see that adding extra information (channels) to thesystem, either optical flow magnitude or semantic segment-ation, or both, does provide ∼ percent point improve-ment. The results show that leveraging semantic priors andmotion information provides a complementary signal to RGBimage and both ingredients contribute to the segmentationresults.Combining in ensemble four different models ( f I + F + S + f I + F + f I + S + f I ) allows to enhance the results even fur-ther, bringing . percent point gain ( . vs. . global mean). Excluding the models which use semantic informa-tion ( f I + F + S and f I + S ) from the ensemble results only in aminor drop in the performance ( . vs. . global mean).This shows that the competitive results can be achieved evenwith the system trained only with one pixel-level mask an-notation per video, without employing extra annotations fromPascal VOC12.Our lucid dreams enable automatic CRF-tuning (see Sec-tion 5.3.3) which allows to further improve the results ( . → . global mean). Employing the proposed temporal coher-ency step (see Section 3.1) during inference brings an addi-tional performance gain ( . → . global mean). Conclusion.
The results show that both flow and semanticpriors provide a complementary signal to RGB image only.Despite its simplicity our ensemble strategy provides addi-tional gain and leads to competitive results. Notice that evenwithout the semantic segmentation signal S t our ensembleresult is competitive.6.4 Error analysisWe present the per-sequence results of LucidTracker onDAVIS in Figure 12 (per frame results not available fromevaluation server). We observe that this dataset is signific-antly more challenging than DAVIS (compare to Figure10), with only / of the test videos above mIoU. Thisshows that multiple object segmentation is a much morechallenging task than segmenting a single object.The failure cases discussed in Section 5.5 still apply tothe multiple objects case. Additionally, on DAVIS we ob- , test-dev setRank Globalmean ↑ Region, J Boundary, F Mean ↑ Recall ↑ Decay ↓ Mean ↑ Recall ↑ Decay ↓ sidc 10 45.8 43.9 51.5 34.3 47.8 53.6 36.9YXLKJ 9 49.6 46.1 49.1 22.7 53.0 56.5 22.3haamooon [59] 8 51.3 48.8 56.9 Fromandtozh [83] 7 55.2 52.4 58.4 18.1 57.9 66.1 20.0ilanv [60] 6 55.8 51.9 55.7 17.6 59.8 65.8 18.9voigtlaender [73] 5 56.5 53.4 57.8 19.9 59.6 65.4 19.0lalalafine123 4 57.4 54.5 61.3 24.4 60.2 68.8 24.6wangzhe 3 57.7 55.6 63.2 31.7 59.8 66.7 37.1lixx [38] 2 66.1
LucidTracker Table 11: Comparison of video object segmentation results on DAVIS , test-dev set. Our LucidTracker shows top perform-ance. Method DAVIS , test-challenge setRank Globalmean ↑ Region, J Boundary, F Mean ↑ Recall ↑ Decay ↓ Mean ↑ Recall ↑ Decay ↓ zwrq0 10 53.6 50.5 54.9 28.0 56.7 63.5 30.4Fromandtozh [83] 9 53.9 50.7 54.9 32.5 57.1 63.2 33.7wasidennis 8 54.8 51.6 56.3 26.8 57.9 64.8 28.8YXLKJ 7 55.8 53.8 60.1 37.7 57.8 62.1 42.9cjc [12] 6 56.9 53.6 59.5 25.3 60.2 67.9 27.6lalalafine123 6 56.9 54.8 60.7 34.4 59.1 66.7 36.1voigtlaender [73] 5 57.7 54.8 60.8 31.0 60.5 67.2 34.7haamooon [59] 4 61.5 59.8 71.0 21.9 63.2 74.6 23.7vantam299 [36] 3 63.8 61.5 68.6 Table 12: Comparison of video object segmentation results on DAVIS , test-challenge set. Our LucidTracker shows com-petitive performance, holding the second place in the competition. Variant
I F S ensemble CRF tuning temp. coherency DAVIS test-dev test-challengeglobal mean mIoU mF global mean mIoU mF LucidTracker(ensemble) ! ! ! ! ! ! ! ! ! ! ! % ! ! ! ! % % ! ! % ! ! % ! ! % ! % % ! ! ! % ! % I + F + S ! ! ! % % % I + F ! ! % % % % I + S ! % ! % % % I ! % % % % % Table 13: Ablation study of different ingredients. DAVIS , test-dev and test challenge sets. ucid Data Dreaming for Video Object Segmentation 19 video sequence m I o U g o l f p l a n e s - c r o ss i n g s k a t e - j u m p m a n - b i k e h o r s e j u m p - s t i c k d ee r r o ll e r c o a s t e r s a l s a g i a n t - s l a l o m s u b w a y c a t s - c a r c a r o u s e l m t b - r a c e l o c k m o n k e y s - t r ee s t r a c t o r g y m p e o p l e - s un s e t t a n d e m g i r l - d o g a e r o b a t i c s t e nn i s - v e s t c a r - r a c e h o v e r b o a r d h e li c o p t e r s e a s n a k e o r c h i d g u i t a r - v i o li n c h a m a l e o n s l a c k li n e Figure 12: Per-sequence results on DAVIS , test-dev set.
20% 40% 60% 80% 100%
Figure 13: LucidTracker failure cases on DAVIS , test-dev set. Frames sampled along the video duration (e.g. : videomiddle point). We show 2 results mIoU over the video below 50.serve a clear failure case when segmenting similar lookingobject instances, where the object appearance is not dis-criminative to correctly track the object, resulting in labelswitches or bleeding of the label to other look-alike objects.Figure 13 illustrates this case. This issue could be mitigatedby using object level instance identification modules, like[38], or by changing the training loss of the model to moreseverely penalize identity switches. Conclusion.
In the multiple object case the
LucidTracker results remain robust across different videos. The overallresults being lower than for the single object segmentationcase, there is more room for future improvement in the mul-tiple object pixel-level segmentation task.
We have described a new convnet-based approach for pixel-level object segmentation in videos. In contrast to previouswork, we show that top results for single and multiple ob-ject segmentation can be achieved without requiring externaltraining datasets (neither annotated images nor videos). Evenmore, our experiments indicate that it is not always bene-ficial to use additional training data, synthesizing trainingsamples close to the test domain is more effective than addingmore training samples from related domains.Our extensive analysis decomposed the ingredients thatcontribute to our improved results, indicating that our new training strategy and the way we leverage additional cuessuch as semantic and motion priors are key.Showing that training a convnet for video object seg-mentation can be done with only few ( ∼ ) training sampleschanges the mindset regarding how much general knowledgeabout objects is required to approach this problem [31,28],and more broadly how much training data is required to trainlarge convnets depending on the task at hand.We hope these new results will fuel the ongoing evol-ution of convnet techniques for single and multiple objectsegmentation in videos. Acknowledgements
Eddy Ilg and Thomas Brox acknowledge funding by theDFG Grant BR 3815/7-1.
References
1. Ç. Aytekin, E. C. Ozan, S. Kiranyaz, and M. Gabbouj. Visualsaliency by extended quantum cuts. In
ICIP , 2015. 92. A. Bansal, X. Chen, B. Russell, A. Gupta, and D. Ramanan. Pixel-net: Representation of the pixels, by the pixels, and for the pixels. arXiv:1702.06506 , 2017. 33. L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H.Torr. Fully-convolutional siamese networks for object tracking. arXiv:1606.09549 , 2016. 1, 2, 64. F. Bookstein. Principal warps: Thin-plate splines and the decom-position of deformations.
PAMI , 1989. 60 A. Khoreva, R. Benenson, E. Ilg, T. Brox, B. Schiele5. M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, andL. Van Gool. Robust tracking-by-detection using a detector con-fidence particle filter. In
ICCV , 2009. 2, 66. S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixe, D. Cre-mers, and L. V. Gool. One-shot video object segmentation. In
CVPR , 2017. 1, 2, 3, 6, 9, 11, 14, 15, 167. S. Chandra, C. Couprie, and I. Kokkinos. Deep spatio-temporalrandom fields for efficient video segmentation. In
CVPR , 2018. 2,3, 9, 158. J. Chang, D. Wei, and J. W. Fisher. A video representation usingtemporal superpixels. In
CVPR , 2013. 29. L. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethink-ing atrous convolution for semantic image segmentation. arxiv:1706.05587 , 2017. 1410. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.Yuille. Deeplab: Semantic image segmentation with deep con-volutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915 , 2016. 3, 11, 1211. W. Chen, H. Wang, Y. Li, H. Su, Z. Wang, C. Tu, D. Lischin-ski, D. Cohen-Or, and B. Chen. Synthesizing training images forboosting human 3d pose estimation. In
3D Vision (3DV) , 2016. 312. J. Cheng, S. Liu, Y.-H. Tsai, W.-C. Hung, S. Gupta, J. Gu, J. Kautz,S. Wang, and M.-H. Yang. Learning to segment instances in videoswith spatial propagation network.
CVPR Workshops , 2017. 1813. A. Criminisi, P. Perez, and K. Toyama. Region filling and objectremoval by exemplar-based image inpainting.
Trans. Img. Proc. ,2004. 614. M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Con-volutional features for correlation filter based visual tracking. In
ICCV Workshop , 2015. 215. A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas,V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet:Learning optical flow with convolutional networks. In
ICCV ,2015. 316. M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams,J. Winn, and A. Zisserman. The pascal visual object classes chal-lenge: A retrospective.
IJCV , 2015. 517. A. Faktor and M. Irani. Video segmentation by non-local con-sensus voting. In
BMVC , 2014. 2, 9, 1518. G. Georgakis, A. Mousavian, A. C. Berg, and J. Kosecka. Syn-thesizing training data for object detection in indoor scenes. arXiv:1702.07836 , 2017. 319. X. Glorot and Y. Bengio. Understanding the difficulty of trainingdeep feedforward neural networks. In
AISTATS , 2010. 720. M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hier-archical graph-based video segmentation. In
CVPR , 2010. 221. A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for textlocalisation in natural images. In
CVPR , 2016. 322. D. Held, S. Thrun, and S. Savarese. Learning to track at 100 fpswith deep regression networks. In
ECCV , 2016. 1, 2, 623. J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploitingthe circulant structure of tracking-by-detection with kernels. In
ECCV , 2012. 2, 3, 624. E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, andT. Brox. Flownet 2.0: Evolution of optical flow estimation withdeep networks. In
CVPR , 2017. 4, 9, 1125. B. L. J. Luiten, P. Voigtlaender. PReMVOS: Proposal-generation,refinement and merging for video object segmentation. In
ACCV ,2018. 9, 1526. S. D. Jain and K. Grauman. Supervoxel-consistent foregroundpropagation in video. In
ECCV , 2014. 727. S. D. Jain and K. Grauman. Click carving: Segmenting objects invideo with point clicks. In
HCOMP , 2016. 328. S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning tocombine motion and appearance for fully automatic segmentionof generic objects in videos. arXiv:1701.05384 , 2017. 2, 3, 9, 15,1929. V. Jampani, R. Gadde, and P. V. Gehler. Video propagation net-works. arXiv:1612.05478 , 2016. 1, 2, 3, 9, 15 30. A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Luciddata dreaming for object tracking.
CVPR Workshops , 2017. 231. A. Khoreva, F. Perazzi, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images.In arXiv:1612.02646 , 2016. 1, 2, 3, 6, 9, 11, 14, 15, 1932. P. Krähenbühl and V. Koltun. Efficient inference in fully connec-ted crfs with gaussian edge potentials. In
NIPS . 2011. 5, 9, 1233. M. Kristan, J. Matas, et al. The visual object tracking vot2015challenge results. In
ICCV Workshop , 2015. 134. M. Kristan, J. Matas, et al. The visual object tracking vot2016challenge results. In
ECCV Workshop , 2016. 135. M. Kristan, R. Pflugfelder, et al. The visual object trackingvot2014 challenge results. In
ECCV Workshop , 2014. 1, 2, 636. T. N. Le, K. T. Nguyen, M. H. Nguyen-Phan, T. V. Ton, T. A.Nguyen, X. S. Trinh, Q. H. Dinh, V. T. Nguyen, A. D. Duong,A. Sugimoto, T. V. Nguyen, and M. T. Tran. Instance re-identification flow for video object segmentation.
CVPR Work-shops , 2017. 1837. F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video seg-mentation by tracking many figure-ground segments. In
ICCV ,2013. 1, 738. X. Li, Y. Qi, Z. Wang, K. Chen, Z. Liu, J. Shi, P. Luo, C. C. Loy,and X. Tang. Video object segmentation with re-identification.
CVPR Workshops , 2017. 16, 18, 1939. G. Lin, A. Milan, C. Shen, and I. D. Reid. Refinenet: Multi-pathrefinement networks for high-resolution semantic segmentation. arXiv:1611.06612 , 2016. 340. C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchical con-volutional features for visual tracking. In
ICCV , 2015. 241. N. Maerki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bilat-eral space video segmentation. In
CVPR , 2016. 2, 5, 9, 1542. N. Nagaraja, F. Schmidt, and T. Brox. Video segmentation withjust a few strokes. In
ICCV , 2015. 3, 943. H. Nam, M. Baek, and B. Han. Modeling and propagating cnns ina tree structure for visual tracking. arXiv:1608.07242 , 2016. 244. H. Nam and B. Han. Learning multi-domain convolutional neuralnetworks for visual tracking. In
CVPR , 2016. 1, 245. N.Mayer, E.Ilg, P.Häusser, P.Fischer, D.Cremers, A.Dosovitskiy,and T.Brox. A large dataset to train convolutional networks fordisparity, optical flow, and scene flow estimation. In
CVPR , 2016.3, 446. A. Papazoglou and V. Ferrari. Fast object segmentation in uncon-strained video. In
ICCV , 2013. 247. D. Park and D. Ramanan. Articulated pose estimation with tinysynthetic videos. In
CVPR Workshop , 2015. 348. D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan.Learning features by watching objects move. arXiv:1612.06370 ,2016. 549. F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool, M. Gross,and A. Sorkine-Hornung. A benchmark dataset and evaluationmethodology for video object segmentation. In
CVPR , 2016. 1, 7,1450. F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung. Fullyconnected object proposals for video segmentation. In
ICCV ,2015. 2, 9, 1551. L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, andB. Schiele. Articulated people detection and pose estimation: Re-shaping the future. In
CVPR , 2012. 352. T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full-resolutionresidual networks for semantic segmentation in street scenes. In
CVPR , 2017. 1453. J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. Davis challenge on video ob-ject segmentation 2017. http://davischallenge.org/challenge2017 . 1654. J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on videoobject segmentation. arXiv:1704.00675 , 2017. 2, 14, 15ucid Data Dreaming for Video Object Segmentation 2155. A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learn-ing object class detectors from weakly annotated video. In
CVPR ,2012. 1, 756. J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Epi-cflow: Edge-preserving interpolation of correspondences for op-tical flow. In
CVPR , 2015. 1157. S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data:Ground truth from computer games. In
ECCV , 2016. 358. C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactiveforeground extraction using iterated graph cuts. In
SIGGRAPH ,2004. 3, 959. A. Shaban, A. Firl, A. Humayun, J. Yuan, X. Wang, P. Lei,N. Dhanda, B. Boots, J. M. Rehg, and F. Li. Multiple-instance video segmentation with sequence-specific object pro-posals.
CVPR Workshops , 2017. 1860. G. Sharir, E. Smolyansky, and I. Friedman. Video object segment-ation using tracked object proposals.
CVPR Workshops , 2017. 1861. K. Simonyan and A. Zisserman. Very deep convolutional net-works for large-scale image recognition. In
ICLR , 2015. 7, 1662. H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam. Pyramiddilated deeper convlstm for video salient object detection. In
ECCV , 2018. 2, 9, 1563. T. V. Spina and A. X. Falcão. Fomtrace: Interactive videosegmentation by image graphs and fuzzy object models. arXiv:1606.03369 , 2016. 364. J. Sun, J. Jia, C.-K. Tang, and H.-Y. Shum. Poisson matting. In
SIGGRAPH , 2004. 665. K. Tang, V. Ramanathan, L. Fei-fei, and D. Koller. Shiftingweights: Adapting object detectors from image to video. In
NIPS ,2012. 666. M. Tang, D. Marin, I. Ben Ayed, and Y. Boykov. Normalized cutmeets mrf. In
ECCV , 2016. 367. S. Tang, M. Andriluka, A. Milan, K. Schindler, S. Roth, andB. Schiele. Learning people detectors for tracking in crowdedscenes. In
ICCV , 2013. 368. R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance searchfor tracking. arXiv:1605.05863 , 2016. 269. P. Tokmakov, K. Alahari, and C. Schmid. Learning motion pat-terns in videos. arXiv:1612.07217 , 2016. 9, 1570. P. Tokmakov, K. Alahari, and C. Schmid. Learning video objectsegmentation with visual memory. In
ICCV , 2017. 2, 9, 1571. Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video segmentation viaobject flow. In
CVPR , 2016. 2, 9, 1572. G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black,I. Laptev, and C. Schmid. Learning from synthetic humans. arXiv:1701.01370 . 373. P. Voigtlaender and B. Leibe. Online adaptation of convolutionalneural networks for the 2017 davis challenge on video object seg-mentation.
CVPR Workshops , 2017. 1874. P. Voigtlaender and B. Leibe. Online adaptation of convolutionalneural networks for video object segmentation. In
BMVC , 2017.2, 3, 9, 14, 1575. T. Vojir and J. Matas. Pixel-wise object segmentations for the VOT2016 dataset. Research report, 2017. 176. L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking withfully convolutional networks. In
ICCV , 2015. 277. T. Wang, B. Han, and J. Collomosse. Touchcut: Fast image andvideo segmentation using single-touch interaction.
CVIU , 2014. 378. W. Wang and J. Shen. Super-trajectory for video segmentation. arXiv:1702.08634 , 2017. 2, 5, 9, 1579. X. Wang and A. Gupta. Unsupervised learning of visual repres-entations using videos. In
ICCV , 2015. 580. Z. Wu, C. Shen, and A. van den Hengel. Wider or deeper: Revis-iting the resnet model for visual recognition. arXiv:1611.10080 ,2016. 381. F. Xiao and Y. J. Lee. Track and segment: An iterative unsuper-vised approach for video object proposals. In
CVPR , 2016. 2,9 82. J. J. Yu, A. W. Harley, and K. G. Derpanis. Back to basics: Un-supervised learning of optical flow via brightness constancy andmotion smoothness. arXiv:1608.05842 , 2016. 583. H. Zhao. Some promising ideas about multi-instance video seg-mentation.
CVPR Workshops , 2017. 1884. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsingnetwork. In
CVPR , 2017. 3, 585. Y. Zhu, Z. Lan, S. Newsam, and A. G. Hauptmann. Guided opticalflow learning. arXiv:1702.02295arXiv:1702.02295