[PDF] Lucid Data Dreaming for Video Object Segmentation

Abstract

Convolutional networks reach top quality in pixel-level video object segmentation but require a large amount of training data (1k~100k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x~1000x less annotated data than competing methods. Our approach is suitable for both single and multiple object segmentation. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the video object segmentation task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general "objectness" knowledge are required for the video object segmentation task.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Lucid Data Dreaming for Video Object Segmentation

Anna Khoreva · Rodrigo Benenson · Eddy Ilg · Thomas Brox · Bernt Schiele

Received: date / Accepted: date

Abstract

Convolutional networks reach top quality in pixel-level video object segmentation but require a large amountof training data ( k ∼ k ) to deliver such results. We pro-pose a new training strategy which achieves state-of-the-artresults across three evaluation datasets while using × ∼ × less annotated data than competing methods. Ourapproach is suitable for both single and multiple object seg-mentation.Instead of using large training sets hoping to generalizeacross domains, we generate in-domain training data usingthe provided annotation on the ﬁrst frame of each video tosynthesize (“lucid dream” ) plausible future video frames.In-domain per-video training data allows us to train highquality appearance- and motion-based models, as well astune the post-processing stage. This approach allows to reachcompetitive results even when training from only a singleannotated frame, without ImageNet pre-training. Our res-ults indicate that using a larger training set is not automatic-ally better, and that for the video object segmentation task a Anna KhorevaMax Planck Institute for Informatics, GermanyE-mail: [email protected] BenensonGoogleE-mail: [email protected] IlgUniversity of Freiburg, GermanyE-mail: [email protected] BroxUniversity of Freiburg, GermanyE-mail: [email protected] SchieleMax Planck Institute for Informatics, GermanyE-mail: [email protected] In a lucid dream the sleeper is aware that he or she is dreaming andis sometimes able to control the course of the dream.

Figure 1: Starting from scarce annotations we synthesize in-domain data to train a specialized pixel-level video objectsegmenter for each dataset or even each video sequence.smaller training set that is closer to the target domain is moreeffective. This changes the mindset regarding how many train-ing samples and general “objectness” knowledge are requiredfor the video object segmentation task.

In the last years the ﬁeld of localizing objects in videoshas transitioned from bounding box tracking [33,35,34] topixel-level segmentation [37,55,49,75]. Given a ﬁrst framelabelled with the foreground object masks, one aims to ﬁndthe corresponding object pixels in future frames. Segment-ing objects at the pixel level enables a ﬁner understandingof videos and is helpful for tasks such as video editing, ro-toscoping, and summarisation.Top performing results are currently obtained using con-volutional networks (convnets) [29,6,31,3,22,44]. Like mostdeep learning techniques, convnets for video object segment- a r X i v : . [ c s . C V ] M a r A. Khoreva, R. Benenson, E. Ilg, T. Brox, B. Schiele ation beneﬁt from large amounts of training data. Currentstate-of-the-art methods rely, for instance, on pixel accurateforeground/background annotations of ∼ k video frames[29,6] , ∼ k images [31], or even more than k annot-ated samples for training [74]. Labelling images and videosat the pixel level is a laborious task (compared e.g. to draw-ing bounding boxes for detection), and creating a large train-ing set requires signiﬁcant annotation effort.In this work we aim to reduce the necessity for such largevolumes of training data. It is traditionally assumed that con-vnets require large training sets to perform best. We showthat for video object segmentation having a larger trainingset is not automatically better and that improved results canbe obtained by using × ∼ × less training data thanprevious approaches [6,31,74]. The main insight of our workis that for video object segmentation using few training frames( ∼ ) in the target domain is more useful than usinglarge training volumes across domains ( k ∼ k ).To ensure a sufﬁcient amount of training data close to thetarget domain, we develop a new technique for synthesizingtraining data particularly tailored for the pixel-level videoobject segmentation scenario. We call this data generationstrategy “ lucid dreaming ”, where the ﬁrst frame and its an-notation mask are used to generate plausible future framesof the videos. The goal is to produce a large training set ofreasonably realistic images which capture the expected ap-pearance variations in future video frames, and thus is, bydesign, close to the target domain.Our approach is suitable for both single and multiple ob-ject segmentation in videos. Enabled by the proposed datageneration strategy and the efﬁcient use of optical ﬂow, weare able to achieve high quality results while using only ∼ individual annotated training frames. Moreover, inthe extreme case with only a single annotated frame andzero pre-training (i.e. without ImageNet pre-training), westill obtain competitive video object segmentation results.In summary, our contributions are the following:1. We propose “lucid data dreaming”, an automated approachto synthesize training data for the convnet-based pixel-levelvideo object segmentation that leads to top results for bothsingle and multiple object segmentation .2. We conduct an extensive analysis to explore the factorscontributing to our good results.3. We show that training a convnet for video object segment-ation can be done with only few annotated frames. We hopethese results will affect the trend towards even larger train-ing sets, and popularize the design of video segmentationconvnets with lighter training needs.With the results for multiple object segmentation we tookthe second place in the 2017 DAVIS Challenge on VideoObject Segmentation [54]. A summary of the proposed ap- Lucid data dreaming synthesis implementation is available at . proach was provided online [30]. This paper signiﬁcantlyextends [30] with in-depth discussions on the method, moredetails of the formulation, its implementation, and its vari-ants for single and multiple object segmentation in videos. Italso offers a detailed ablation study and an error analysis aswell as explores the impact of varying number of annotatedtraining samples on the video segmentation quality. Classic work on video object tracking fo-cused on bounding box tracking. Many of the insights fromthese works have been re-used for video object segmenta-tion. Traditional box tracking smoothly updates across timea linear model over hand-crafted features [23,5,35]. Sincethen, convnets have been used as improved features [14,40,76], and eventually to drive the tracking itself [22,3,68,43,44]. Contrary to traditional box trackers (e.g. [23]), convnet-based approaches need additional data for pre-training andlearning the task.

Video object segmentation.

In this paper we focus on gen-erating a foreground versus background pixel-wise objectlabelling for each video frame starting from a ﬁrst manuallyannotated frame. Multiple strategies have been proposed tosolve this task.

Box-to-segment:

First a box-level track is built, and aspace-time grabcut-like approach is used to generate per framesegments [81].

Video saliency: this group of methods extracts the mainforeground object pixel-level space-time tube. Both hand-crafted models [17,46] or trained convnets [70,28,62] havebeen considered. Because these methods ignore the ﬁrst frameannotation, they fail in videos where multiple salient objectsmove (e.g. ﬂock of penguins).

Space-time proposals: these methods partition the space-time volume, and then the tube overlapping most with theﬁrst frame mask annotation is selected as tracking output[20,50,8].

Mask propagation : Appearance similarity and motionsmoothness across time is used to propagate the ﬁrst frameannotation across the video [41,78,71]. These methods usu-ally leverage optical ﬂow and long term trajectories.

Convnets: following the trend in box tracking, recentlyconvnets have been proposed for video object segmentation.[6] trains a generic object saliency network, and ﬁne-tunesit per-video (using the ﬁrst frame annotation) to make theoutput sensitive to the speciﬁc object of interest. [31] usesa similar strategy, but also feeds the mask from the previ-ous frame as guidance for the saliency network. [74] incor-porates online adaptation of the network using the predic-tions from previous frames. [7] extends the Gaussian-CRF ucid Data Dreaming for Video Object Segmentation 3 approach to videos by exploiting spatio-temporal connec-tions for pairwise terms and relying on unary terms from[74]. Finally [29] mixes convnets with ideas of bilateral ﬁl-tering. Our approach also builds upon convnets.What makes convnets particularly suitable for the task, isthat they can learn what are the common statistics of ap-pearance and motion patterns of objects, as well as whatmakes them distinctive from the background, and exploitthis knowledge when segmenting a particular object. Thisaspect gives convnets an edge over traditional techniquesbased on low-level hand-crafted features.Our network architecture is similar to [6,31]. Other thanimplementation details, there are three differentiating factors.One, we use a different strategy for training: [6,29, 7,74]rely on consecutive video training frames and [31] uses anexternal saliency dataset, while our approach focuses on us-ing the ﬁrst frame annotations provided with each targetedvideo benchmark without relying on external annotations.Two, our approach exploits optical ﬂow better than theseprevious methods. Three, we describe an extension to seam-lessly handle segmentation of multiple objects.

Interactive video segmentation.

Interactive segmentation[42,27,63,77] considers more diverse user inputs (e.g. strokes),and requires interactive processing speed rather than provid-ing maximal quality. Albeit our technique can be adapted forvaried inputs, we focus on maximizing quality for the non-interactive case with no-additional hints along the video.

Semantic labelling.

Like other convnets in this space [29,6,31], our architecture builds upon the insights from thesemantic labelling networks [84,39,80,2]. Because of this,the ﬂurry of recent developments should directly translateinto better video object segmentation results. For the sakeof comparison with previous work, we build upon the wellestablished VGG DeepLab architecture [10].

Synthetic data.

Like our approach, previous works havealso explored synthesizing training data. Synthetic render-ings [45], video game environment [57], mix-synthetic andreal images [72,11,15] have shown promise, but require task-appropriate 3d models. Compositing real world images pro-vides more realistic results, and has shown promise for ob-ject detection [18,67], text localization [21] and pose estim-ation [51].The closest work to ours is [47], which also generatesvideo-speciﬁc training data using the ﬁrst frame annotations.They use human skeleton annotations to improve pose es-timation, while we employ pixel-level mask annotations toimprove video object segmentation.

Section 3.1 describes the network architecture used, and howRGB and optical ﬂow information are fused to predict the next frame segmentation mask. Section 3.2 discusses differ-ent training modalities employed with the proposed videoobject segmentation system. In Section 4 we discuss thetraining data generation, and sections 5/6 report results forsingle/multiple object segmentation in videos.3.1 Architecture

Approach.

We model video object segmentation as a maskreﬁnement task (mask: binary foreground/ background la-belling of the image) based on appearance and motion cues.From frame t − to frame t the estimated mask M t − ispropagated to frame t , and the new mask M t is computed asa function of the previous mask, the new image I t , and theoptical ﬂow F t , i.e. M t = f ( I t , F t , M t − ) . Since objectshave a tendency to move smoothly through space in time,there are little changes from frame to frame and mask M t − can be seen as a rough estimate of M t . Thus we require ourtrained convnet to learn to reﬁne rough masks into accuratemasks. Fusing the complementary image I t and motion ﬂow F t enables to exploits the information inherent to video andenables the model to segment well both static and movingobjects.Note that this approach is incremental, does a single for-ward pass over the video, and keeps no explicit model of theobject appearance at frame t . In some experiments we ad-apt the model f per video, using the annotated ﬁrst frame I , M . However, in contrast to traditional techniques [23],this model is not updated while we process the video frames,thus the only state evolving along the video is the mask M t − itself. First frame.

In the video object segmentation task of ourinterest the mask for the ﬁrst frame M is given. This is thestandard protocol of the benchmarks considered in sections5 & 6. If only a bounding box is available on the ﬁrst frame,then the mask could be estimated using grabcut-like tech-niques [58,66]. RGB image I . Typically a semantic labeller generates pixel-wise labels based on the input image (e.g. M = g ( I ) ).We use an augmented semantic labeller with an input layermodiﬁed to accept 4 channels (RGB + previous mask) soas to generate outputs based on the previous mask estimate,e.g. M t = f I ( I t , M t − ) . Our approach is general and canleverage any existing semantic labelling architecture. We se-lect the DeepLabv2 architecture with VGG base network[10], which is comparable to [29,6,31]; FusionSeg [28] usesResNet. Optical ﬂow F . We use ﬂow in two complementary ways.First, to obtain a better initial estimate of M t we warp M t − using the ﬂow F t : M t = f I ( I t , w ( M t − , F t )) ; we callthis "mask warping". Second, we use ﬂow as a direct sourceof information about the mask M t . As can be seen in Figure A. Khoreva, R. Benenson, E. Ilg, T. Brox, B. Schiele I t I npu t s M t − (shown over I t ) (cid:107)F t (cid:107) O u t pu t M t Figure 2: Data ﬂow examples. I t , (cid:107)F t (cid:107) , M t − are the inputs, M t is the resulting output. Green boundaries outline theground truth segments. Red overlay indicates M t − , M t .2, when the object is moving relative to background, the ﬂowmagnitude (cid:107)F t (cid:107) provides a very reasonable estimate of themask M t . We thus consider using a convnet speciﬁcally formask estimation from ﬂow: M t = f F ( (cid:107)F t (cid:107) , w ( M t − , F t )) ,and merge it with the image-only version by naive averaging M t = 0 . · f I ( I t , . . . ) + 0 . · f F ( (cid:107)F t (cid:107) , . . . ) . (1)We use the state-of-the-art optical ﬂow estimation methodFlowNet2.0 [24], which itself is a convnet that computes F t = h ( I t − , I t ) and is trained on synthetic renderings ofﬂying objects [45]. For the optical ﬂow magnitude computa-tion we subtract the median motion for each frame, averagethe magnitude of the forward and backward ﬂow and scalethe values per-frame to [0; 255] , bringing it to the same rangeas RGB channels. The loss function is the sum of cross-entropy terms overeach pixel in the output map (all pixels are equally weighted).In our experiments f I and f F are trained independently, viasome of the modalities listed in Section 3.2. Our two streamsarchitecture is illustrated in Figure 3a.We also explored expanding our network to accept 5input channels (RGB + previous mask + ﬂow magnitude)in one stream: M t = f I + F ( I t , (cid:107)F t (cid:107) , w ( M t − , F t )) , butdid not observe much difference in the performance com-pared to naive averaging, see experiments in Section 5.4.3.Our one stream architecture is illustrated in Figure 3b. Onestream network is more affordable to train and allows to eas-ily add extra input channels, e.g. providing additionally se-mantic information about objects. Multiple objects.

The proposed framework can easily beextended to segmenting multiple objects simultaneously. In-stead of having one additional input channel for the previ- ucid Data Dreaming for Video Object Segmentation 5(a) Two streams architecture, where image I t and optical ﬂow inform-ation (cid:107)F t (cid:107) are used to update mask M t − into M t . See equation 1.(b) One stream architecture, where 5 input channels: image I t , op-tical ﬂow information (cid:107)F t (cid:107) and mask M t − are used to estimatemask M t . Figure 3: Overview of the proposed one and two streamsarchitectures. See Section 3.1.ous frame mask we provide the mask for each object in-stance in a separate channel, expanding the network to ac-cept N input channels (RGB + N object masks): M t = f I (cid:0) I t , w ( M t − , F t ) , ... , w ( M Nt − , F t ) (cid:1) , where N is thenumber of objects annotated on the ﬁrst frame.For multiple object segmentation we employ a one-streamarchitecture for the experiments, using optical ﬂow F andsemantic segmentation S as additional input channels: M t = f I + F + S (cid:0) I t , (cid:107)F t (cid:107) , S t , w ( M t − , F t ) , ... , w ( M Nt − , F t ) (cid:1) .This allows to leverage the appearance model with semanticpriors and motion information. See Figure 4 for an illustra-tion.The one-stream network is trained with multi-class crossentropy loss and is able to segment multiple objects simul-taneously, sharing the feature computation for different in-stances. This allows to avoid a linear increase of the costwith the number of objects. In our preliminary results us-ing a single architecture also pro-vides better results thansegmenting multiple objects separately, one at a time; andavoids the need to design a merging strategy amongst over-lapping tracks. Semantic labels S . To compute the pixel-level semanticlabelling S t = h ( I t ) we use the state-of-the-art convnetPSPNet [84], trained on Pascal VOC12 [16]. Pascal VOC12annotates 20 categories, yet we want to track any type of ob-jects. S t can also provide information about unknown cat-egory instances by describing them as a spatial mixture ofknown ones (e.g. a sea lion might looks like a dog torso, and the head of cat). As long as the predictions are consistentthrough time, S t will provide a useful cue for segmentation.Note that we only use S t for the multi-object segmentationchallenge, discussed in Section 6. In the same way as for theoptical ﬂow we scale S t to bring all the channels to the samerange.We additionally experiment with ensembles of differentvariants, that allows to make the system more robust to thechallenges inherent in videos. For our main results on themultiple object segmentation task we consider an ensembleof four models: M t = . · ( f I + F + S + f I + F + f I + S + f I ) ,where we merge the outputs of the models by naive aver-aging. See Section 6 for more details. Temporal coherency.

To improve the temporal coherencyof the proposed video object segmentation framework we in-troduce an additional step into the system. Before providingas input the previous frame mask warped with the opticalﬂow w ( M t − , F t ) , we look at frame t − to remove in-consistencies between the predicted masks M t − and M t − .In particular, we split the mask M t − into connected com-ponents and remove all components from M t − which donot overlap with M t − . This way we remove possibly spuri-ous blobs generated by our model in M t − . Afterwards wewarp the “pruned” mask (cid:102) M t − with the optical ﬂow and use w ( (cid:102) M t − , F t ) as an input to the network. This step is ap-plied only during inference, it mitigates error propagationissues, as well as help generating more temporally coherentresults. Post-processing.

As a ﬁnal stage of our pipeline, we re-ﬁne per-frame t the generated mask M t using DenseCRF[32]. This adjusts small image details that the network mightnot be able to handle. It is known by practitioners that Den-seCRF is quite sensitive to its parameters and can easilyworsen results. We will use our lucid dreams to handle per-dataset CRF-tuning too, see Section 3.2.We refer to our full f I + F system as LucidTracker , andas

LucidTracker − when no temporal coherency or post-processing steps are used. The usage of S t or model en-semble will be explicitly stated.3.2 Training modalitiesMultiple modalities are available to train a tracker. Training-free approaches (e.g. BVS [41], SVT [78]) are fully hand-crafted systems with hand-tuned parameters, and thus do notrequire training data. They can be used as-is over differentdatasets. Supervised methods can also be trained to gener-ate a dataset-agnostic model that can be applied over dif-ferent datasets. Instead of using a ﬁxed model for all cases,it is also possible to obtain specialized per-dataset mod-els, either via self-supervision [79,48,82,85] or by using the

A. Khoreva, R. Benenson, E. Ilg, T. Brox, B. Schiele

Figure 4: Extension of LucidTracker to multiple objects. Theprevious frame mask for each object is provided in a separ-ate channel. We additionally explore using optical ﬂow F and semantic segmentation S as additional inputs. See Sec-tion 3.1.ﬁrst frame annotation of each video in the dataset as train-ing/tuning set. Finally, inspired by traditional box trackingtechniques, we also consider adapting the model weights tothe speciﬁc video at hand, thus obtaining per-video mod-els. Section 5 reports new results over these four trainingmodalities (training-free, dataset-agnostic, per-dataset, andper-video).Our LucidTracker obtains best results when ﬁrst pre-trained on ImageNet, then trained per-dataset using all datafrom ﬁrst frame annotations together, and ﬁnally ﬁne-tunedper-video for each evaluated sequence. The post-processingDenseCRF stage is automatically tuned per-dataset. The ex-perimental section 5 details the effect of these training stages.Surprisingly, we can obtain reasonable performance evenwhen training from only a single annotated frame (with-out ImageNet pre-training, i.e. zero pre-training); this resultsgoes against the intuition that convnets require large trainingdata to provide good results.Unless otherwise stated, we ﬁne-tune per-video modelsrelying solely on the ﬁrst frame I and its annotation M .This is in contrast to traditional techniques [23,5,35] whichwould update the appearance model at each frame I t . To train the function f one would think of using ground truthdata for M t − and M t (like [3,6,22]), however such data isexpensive to annotate and rare. [6] thus trains on a set of videos ( ∼ k frames) and requires the model to transferacross multiple tests sets. [31] side-steps the need for con-secutive frames by generating synthetic masks M t − froma saliency dataset of ∼ k images with their correspond-ing mask M t . We propose a new data generation strategyto reach better results using only ∼ individual trainingframes. Ideally training data should be as similar as possibleto the test data, even subtle differences may affect quality(e.g. training on static images for testing on videos under-performs [65]). To ensure our training data is in-domain,we propose to generate it by synthesizing samples from theprovided annotated frame (ﬁrst frame) in each target video.This is akin to “lucid dreaming” as we intentionally “dream”the desired data by creating sample images that are plaus-ible hypothetical future frames of the video. The outcomeof this process is a large set of frame pairs in the target do-main ( . k pairs per annotation) with known optical ﬂowand mask annotations, see Figure 5. Synthesis process.

The target domain for a tracker is theset of future frames of the given video. Traditional data aug-mentation via small image perturbation is insufﬁcient to coverthe expect variations across time, thus a task speciﬁc strategyis needed. Across the video the tracked object might changein illumination, deform, translate, be occluded, show differ-ent point of views, and evolve on top of a dynamic back-ground. All of these aspects should be captured when syn-thesizing future frames. We achieve this by cutting-out theforeground object, in-painting the background, perturbingboth foreground and background, and ﬁnally recomposingthe scene. This process is applied twice with randomly sam-pled transformation parameters, resulting in a pair of frames( I τ − , I τ ) with known pixel-level ground-truth mask an-notations ( M τ − , M τ ), optical ﬂow F τ , and occlusion re-gions. The object position in I τ is uniformly sampled, butthe changes between I τ − , I τ are kept small to mimic theusual evolution between consecutive frames.In more details, starting from an annotated image:

1. Illumination changes: we globally modify the image byrandomly altering saturation S and value V (from HSV col-our space) via x (cid:48) = a · x b + c , where a ∈ ± . , b ∈ ± . ,and c ∈ ± . .

2. Fg/Bg split: the foreground object is removed from theimage I and a background image is created by inpaintingthe cut-out area [13].

3. Object motion: we simulate motion and shape deform-ations by applying global translation as well as afﬁne andnon-rigid deformations to the foreground object. For I τ − the object is placed at any location within the image with auniform distribution, and in I τ with a translation of ± of the object size relative to τ − . In both frames we applyrandom rotation ± ◦ , scaling ± and thin-plate splinesdeformations [4] of ± of the object size.

4. Camera motion:

We additionally transform the backgroundusing afﬁne deformations to simulate camera view changes.We apply here random translation, rotation, and scaling withinthe same ranges as for the foreground object.

5. Fg/Bg merge: ﬁnally ( I τ − , I τ ) are composed by blend-ing the perturbed foreground with the perturbed backgroundusing Poisson matting [64]. Using the known transforma- ucid Data Dreaming for Video Object Segmentation 7 tion parameters we also synthesize ground-truth pixel-levelmask annotations ( M τ − , M τ ) and optical ﬂow F τ .Figure 5 shows example results. Albeit our approach doesnot capture appearance changes due to point of view, occlu-sions, nor shadows, we see that already this rough modellingis effective to train our segmentation models.The number of synthesized images can be arbitrarily large.We generate . k pairs per annotated video frame. This train-ing data is, by design, in-domain with regard of the targetvideo. The experimental section 5 shows that this strategy ismore effective than using thousands of manually annotatedimages from close-by domains.The same strategy for data synthesis can be employedfor multiple object segmentation task. Instead of manipu-lating a single object we handle multiple ones at the sametime, applying independent transformations to each of them.We model occlusion between objects by adding a randomdepth ordering obtaining both partial and full occlusions inthe training set. Including occlusions in the lucid dreams al-lows to better handle plausible interactions of objects in thefuture frames. See Figure 6 for examples of the generateddata. We present here a detailed empirical evaluation on threedifferent datasets for the single object segmentation task:given a ﬁrst frame labelled with the foreground object mask,the goal is to ﬁnd the corresponding object pixels in futureframes. (Section 6 will discuss the multiple objects case.)5.1 Experimental setup

Datasets.

We evaluate our method on three video objectsegmentation datasets: DAVIS [49], YouTubeObjects [55,26], and SegTrack v2 [37]. The goal is to track an objectthrough all video frames given an object mask in the ﬁrstframe. These three datasets provide diverse challenges witha mix of high and low resolution web videos, single or mul-tiple salient objects per video, videos with ﬂocks of similarlooking instances, longer ( ∼ frames) and shorter ( ∼ frames) sequences, as well as the usual video segmentationchallenges such as occlusion, fast motion, illumination, viewpoint changes, elastic deformation, etc.The DAVIS [49] video segmentation benchmark con-sists of 50 full-HD videos of diverse object categories withall frames annotated with pixel-level accuracy, where onesingle or two connected moving objects are separated fromthe background. The number of frames in each video variesfrom 25 to 104.YouTubeObjects [55,26] includes web videos from 10object categories. We use the subset of 126 video sequences with mask annotations provided by [26] for evaluation, whereone single object or a group of objects of the same categoryare separated from the background. In contrast to DAVIS these videos have a mix of static and moving objects. Thenumber of frames in each video ranges from 2 to 401.SegTrack v2 [37] consists of 14 videos with multiple ob-ject annotations for each frame. For videos with multiple ob-jects each object is treated as a separate problem, resultingin 24 sequences. The length of each video varies from 21 to279 frames. The images in this dataset have low resolutionand some compression artefacts, making it hard to track theobject based on its appearance.The main experimental work is done on DAVIS , sinceit is the largest densely annotated dataset out of the three,and provides high quality/high resolution data. The videosfor this dataset were chosen to represent diverse challenges,making it a good experimental playground.We additionally report on the two other datasets as com-plementary test set results. Evaluation metric.

To measure the accuracy of video ob-ject segmentation we use the mean intersection-over-unionoverlap (mIoU) between the per-frame ground truth objectmask and the predicted segmentation, averaged across allvideo sequences. We have noticed disparate evaluation pro-cedures used in previous work, and we report here a uniﬁedevaluation across datasets. When possible, we re-evaluatedcertain methods using results provided by their authors. Forall three datasets we follow the DAVIS evaluation pro-tocol, excluding the ﬁrst frame from evaluation and usingall other frames from the video sequences, independent ofobject presence in the frame. Training details.

For training all the models we use SGDwith mini-batches of images and a ﬁxed learning policywith initial learning rate of − . The momentum and weightdecay are set to . and · − , respectively.Models using pre-training are initialized with weightstrained for image classiﬁcation on ImageNet [61]. We thentrain per-dataset for 40k iterations with the RGB+Mask branch f I and for 20k iterations for the Flow+Mask f F branch.When using a single stream architecture (Section 5.4.3), weuse 40k iterations.Models without ImageNet pre-training are initialized us-ing the Xavier (also known as Glorot) random weight ini-tialization strategy [19]. (The weights are initialized as ran-dom draws from a truncated normal distribution with zeromean and standard deviation calculated based on the num-ber of input and output units in the weight tensor, see [19]for details). The per-dataset training needs to be longer, us-ing 100k iterations for the f I branch and 40k iterations forthe f F branch.For per-video ﬁne-tuning 2k iterations are used for f I .To keep computing cost lower, the f F branch is kept ﬁxacross videos. A. Khoreva, R. Benenson, E. Ilg, T. Brox, B. Schieleblackswankite-surfcowsbmx-bumpsOriginal image I andmask annotation M Generated image I τ − Generated image I τ Generated ﬂowmagnitude (cid:107)F τ (cid:107) Figure 5: Lucid data dreaming examples. From one annotated frame we generate pairs of images ( I τ − , I τ ) that are plaus-ible future video frames, with known optical ﬂow ( F τ ) and masks (green boundaries). Note the inpainted background andforeground/background deformations. ucid Data Dreaming for Video Object Segmentation 9(a) Original image I and mask annotation M (b) Generated image I τ and mask M τ (c) Generated ﬂow magnitude (cid:107)F τ (cid:107) Figure 6: Lucid data dreaming examples with multiple ob-jects. From one annotated frame we generate a plausible fu-ture video frame ( I τ ), with known optical ﬂow ( F τ ) andmask ( M τ ).All training parameters are chosen based on DAVIS results. We use identical parameters on YouTubeObjects andSegTrack v2 , showing the generalization of our approach.It takes ~3.5h to obtain each per-video model, includingdata generation, per-dataset training, per-video ﬁne-tuningand per-dataset grid search of CRF parameters (averagedover DAVIS , amortising the per-dataset training time overall videos). At test time our LucidTracker runs at ~5s perframe, including the optical ﬂow estimation with FlowNet2.0[24] (~0.5s) and CRF post-processing [32] (~2s).5.2 Key resultsTable 1 presents our main result and compares it to previouswork. Our full system, LucidTracker , provides the bestvideo segmentation quality across three datasets while be-ing trained on each dataset using only one frame per video( frames for DAVIS , for YouTubeObjects, forSegTrack v2 ), which is × ∼ × less than the top com-peting methods. Ours is the ﬁrst method to reach > mIoUon all three datasets. Oracles and baselines.

Grabcut oracle computes grabcut[58] using the ground truth bounding boxes (box oracle).This oracle indicates that on the considered datasets separ-ating foreground from background is not easy, even if a per-fect box-level tracker was available.We provide three additional baselines. “Saliency” corres-

Method F DAVIS YoutbObjs SegTrckv2Box oracle [31] 0 % % I gno r e s s t fr a m e a nno t a ti on Saliency 0 % ! ! - - 69.1MP-Net [69] ~22.5k ! ! ! ! - 57.3PDB [62] ~18k % - - U s e s s t fr a m e a nno t a ti on Mask warping 0 ! ! % ! - - 69.6ObjFlow [71] 0 ! ! % - -OSVOS [6] ~2.3k % ! ! - -OnAVOS [74] ~120k % - -VideoGCRF [7] ~120k % - -LucidTracker ! Table 1: Comparison of video object segmentation resultsacross three datasets. Numbers in italic are reported on sub-sets of DAVIS . Our LucidTracker consistently improvesover previous results, see Section 5.2.ponds to using the generic (training-free) saliency methodEQCut [1] over the RGB image I t . “Flow saliency” doesthe same, but over the optical ﬂow magnitude (cid:107)F t (cid:107) . Resultsindicate that the objects being tracked are not particularly sa-lient in the image. On DAVIS motion saliency is a strongsignal but not on the other two datasets. Saliency methodsignore the ﬁrst frame annotation provided for the task. Wealso consider the “Mask warping” baseline which uses op-tical ﬂow to propagate the mask estimate from t to t + 1 via simple warping M t = w ( M t − , F t ) . The bad resultsof this baseline indicate that the high quality ﬂow [24] thatwe use is by itself insufﬁcient to solve the video object seg-mentation task, and that indeed our proposed convnet doesthe heavy lifting.The large ﬂuctuation of the relative baseline results acrossthe three datasets empirically conﬁrms that each of thempresents unique challenges. Comparison.

Compared to ﬂow propagation methods suchas BVS, N15, ObjFlow, and STV, we obtain better resultsbecause we build per-video a stronger appearance model ofthe tracked object (embodied in the ﬁne-tuned model). Com-pared to convnet learning methods such as VPN, OSVOS,

DAV I S Y ou T ub e O b j ec t s S e g T r ac k v2

20% 40% 60% 80% 100%

Figure 7: LucidTracker single object segmentation qualitative results. Frames sampled along the video duration (e.g. :video middle point). Our model is robust to various challenges, such as view changes, fast motion, shape deformations, andout-of-view scenarios. ucid Data Dreaming for Video Object Segmentation 11

MaskTrack, OnAVOS, we require signiﬁcantly less trainingdata, yet obtain better results.Figure 7 provides qualitative results of

LucidTracker across three different datasets. Our system is robust to vari-ous challenges present in videos. It handles well cameraview changes, fast motion, object shape deformation, out-of-view scenarios, multiple similar looking objects and evenlow quality video. We provide a detailed error analysis insection 5.5.

Conclusion.

We show that top results can be obtained whileusing less training data. This shows that our lucid dreamsleverage the available training data better. We report top res-ults for this task while using only ∼ training frames.5.3 Ablation studiesIn this section we explore in more details how the differentingredients contribute to our results. Table 2 compares the effect of different ingredients in the

LucidTracker − training. Results are obtained using RGBand ﬂow, with warping, no CRF, and no temporal coherency; M t = f ( I t , w ( M t − , F t )) . Training from a single frame.

In the bottom row ("onlyper-video tuning"), the model is trained per-video withoutImageNet pre-training nor per-dataset training, i.e. using a single annotated training frame . Our network is based onVGG16 [10] and contains ∼ M parameters, all effect-ively learnt from a single annotated image that is augmentedto become . k training samples (see Section 4). Even withsuch minimal amount of training data, we still obtain a sur-prisingly good performance (compare . on DAVIS toothers in Table 1). This shows how effective is, by itself, theproposed training strategy based on lucid dreaming of thedata. Pre-training & ﬁne-tuning.

We see that ImageNet pre-training does provide ∼ percent point improvement (de-pending on the dataset of interest; e.g. . → . mIoUon DAVIS ). Per-video ﬁne-tuning (after doing per-datasettraining) provides an additional ∼ percent point gain(e.g. . → . mIoU on DAVIS ). Both ingredients clear-ly contribute to the segmentation results.Note that training a model using only per-video tuningtakes about one full GPU day per video sequence; makingthese results insightful but not decidedly practical.Preliminary experiments evaluating on DAVIS the im-pact of the different ingredients of our lucid dreaming datageneration showed, depending on the exact setup, ∼ percent mIoU points ﬂuctuations between a basic version Variant ImgNetpre-train. per-datasettraining per-videoﬁne-tun. Dataset, mIoUDAVIS YoutbObjs SegTrckv2

LucidTracker − ! ! ! (no ImgNet) % ! ! ! ! % % ! % ! % ! % % ! - Table 2: Ablation study of training modalities. ImageNetpre-training and per-video tuning provide additional im-provement over per-dataset training. Even with one frameannotation for only per-video tuning we obtain good per-formance. See Section 5.3.1.(e.g. without non-rigid deformations nor scene re-composi-tion) and the full synthesis process described in Section 4.Having a sophisticated data generation process directly im-pacts the segmentation quality.

Conclusion.

Surprisingly, we discovered that per-video train-ing from a single annotated frame provides already muchof the information needed for the video object segmentationtask. Additionally using ImageNet pre-training, and per-datasettraining, provide complementary gains.

Table 3 shows the effect of optical ﬂow on

LucidTracker results. Comparing our full system to the "No OF" row, wesee that the effect of optical ﬂow varies across datasets, fromminor improvement in YouTubeObjects, to major differencein SegTrack v2 . In this last dataset, using mask warping isparticularly useful too. We additionally explored tuning theoptical ﬂow stream per-video, which resulted in a minor im-provement ( . → . mIoU on DAVIS ).Our "No OF" results can be compared to OSVOS [6]which does not use optical ﬂow. However OSVOS uses aper-frame mask post-processing based on a boundary de-tector (trained on further external data), which provides ∼ percent point gain. Accounting for this, our "No OF" (andno CRF, no temporal coherency) result matches theirs onDAVIS and YouTubeObjects despite using signiﬁcantly lesstraining data (see Table 1, e.g. . − ≈ . on DAVIS ).Table 4 shows the effect of using different optical ﬂowestimation methods. For LucidTracker results, FlowNet2.0[24] was employed. We also explored using EpicFlow [56],as in [31]. Table 4 indicates that employing a robust opticalﬂow estimation across datasets is crucial to the performance(FlowNet2.0 provides ∼ . − points gain on each data-set). We found EpicFlow to be brittle when going acrossdifferent datasets, providing improvement for DAVIS andSegTrack v2 ( ∼ − points gain), but underperforming forYouTubeObjects ( . → . mIoU). I F warp. Dataset, mIoU w DAVIS YoutbObjs SegTrckv2LucidTracker ! ! !

LucidTracker − ! ! ! ! ! % ! % % % ! ! Table 3: Ablation study of ﬂow ingredients. Flow com-plements image only results, with large ﬂuctuations acrossdatasets. See Section 5.3.2.

Variant

Optical Dataset, mIoUﬂow DAVIS YoutbObjs SegTrckv2LucidTracker − FlowNet2.0

EpicFlow 80.2 71.3 67.0No ﬂow 78.0 74.7 61.8No ImageNetpre-training FlowNet2.0 82.0 74.3 71.2EpicFlow 80.0 72.3 68.8No ﬂow 76.7 71.4 63.0

Table 4: Effect of optical ﬂow estimation.

Conclusion.

The results show that ﬂow provides a com-plementary signal to RGB image only and having a robustoptical ﬂow estimation across datasets is crucial. Despite itssimplicity our fusion strategy ( f I + f F ) provides gains onall datasets, and leads to competitive results. As a ﬁnal stage of our pipeline, we reﬁne the generated maskusing DenseCRF [32] per frame. This captures small imagedetails that the network might have missed. It is known bypractitioners that DenseCRF is quite sensitive to its paramet-ers and can easily worsen results. We use our lucid dreamsto enable automatic per-dataset CRF-tuning.Following [10] we employ grid search scheme for tun-ing CRF parameters. Once the per-dataset model is trained,we apply it over a subset of its training set (5 random im-ages from the lucid dreams per video sequence), apply Den-seCRF with the given parameters over this output, and thencompare to the lucid dream ground truth.The impact of the tuned parameter of DenseCRF post-processing is shown in Table 5 and Figure 8. Table 5 in-dicates that without per-dataset tuning DenseCRF is under-performing. Our automated tuning procedure allows to ob-tain consistent gains without the need for case-by-case ma-nual tuning.

Conclusion.

Using default DenseCRF parameters woulddegrade performance. Our lucid dreams enable automaticper-dataset CRF-tuning which allows to further improve theresults.

Method CRF Dataset, mIoUparameters DAVIS YoutbObjs SegTrckv2LucidTracker − - 83.7 76.2 76.8LucidTracker default 84.2 75.5 72.2LucidTracker tuned per-dataset Table 5: Effect of CRF tuning (

LucidTracker without tem-poral coherency). Without the automated per-dataset tuningDenseCRF will under-perform.5.4 Additional experimentsOther than adding or removing ingredients, as in Section 5.3,we also want to understand how the training data itself af-fects the obtained results.

Table 6 explores the effect of segmentation quality as a func-tion of the number of training samples. To see more directlythe training data effects we use a base model with RGB im-age I t only (no ﬂow F , no CRF, no temporal coherency),and per-dataset training (no ImageNet pre-training, no per-video ﬁne-tuning). We evaluate on two disjoint subsets of DAVIS videos each, where the ﬁrst frames for per-datasettraining are taken from only one subset. The reported num-bers are thus comparable within Table 6, but not across to theother tables in the paper. Table 6 reports results with varyingnumber of training videos and with/without including theﬁrst frames of each test video for per-dataset training. Whenexcluding the test set ﬁrst frames, the image frames used fortraining are separate from the test videos; and we are thusoperating across (related) domains. When including the testset ﬁrst frames, we operate in the usual LucidTracker mode,where the ﬁrst frame from each test video is used to buildthe per-dataset training set.Comparing the top and bottom parts of the table, we seethat when the annotated images from the test set video se-quences are not included, segmentation quality drops drastic-ally (e.g. . → . mIoU). Conversely, on subset of videosfor which the ﬁrst frame annotation is used for training, thequality is much higher and improves as the training samplesbecome more and more speciﬁc (in-domain) to the targetvideo ( . → . mIoU). Adding extra videos for train-ing does not improve the performance. It is better ( . → . mIoU) to have models each trained and evaluated ona single video (row top-1-1) than having one model trainedover test videos (row top-15-1 ) .Training with an additional frame from each video (weadded the last frame of each train video) signiﬁcantly booststhe resulting within-video quality (e.g. row top-30-2 . → . mIoU), because the training samples cover better thetest domain. ucid Data Dreaming for Video Object Segmentation 13 no CR F w it h CR F no CR F w it h CR F

20% 40% 60% 80% 100%

Figure 8: Effect of CRF tuning. The shown DAVIS videos have the highest margin between with and without CRF post-processing (based on mIoU over the video). Table 6: Varying the number of training videos. A smallertraining set closer to the target domain is better than a largerone. See Section 5.4.1.

Conclusion.

These results show that, when using RGB in-formation ( I t ), increasing the number of training videos doesnot improve the resulting quality of our system. Even withina dataset, properly using the training sample(s) from withineach video matters more than collecting more videos to builda larger training set. Section 5.4.1 has explored the effect of changing the volumeof training data within one dataset, Table 7 compares res-ults when using different datasets for training. Results areobtained using a base model with RGB and ﬂow ( M t = f ( I t , M t − ) , no warping, no CRF, no temporal coherency),ImageNet pre-training, per-dataset training, and no per-videotuning to accentuate the effect of the training dataset. Training set Dataset, mIoU MeanDAVIS YoutbObjs SegTrckv2DAVIS Second best 67.0 52.2 52.0 57.1All-in-one 71.9 70.7 60.8 67.8

Table 7: Generalization across datasets. Results with under-line are the best per dataset, and in italic are the second bestper dataset (ignoring all-in-one setup). We observe a signi-ﬁcant quality gap between training from the target videos,versus training from other datasets; see Section 5.4.2.The best performance is obtained when training on theﬁrst frames of the target set. There is a noticeable ∼ per-cent points drop when moving to the second best choice (e.g. . → . for DAVIS ). Interestingly, when putting allthe datasets together for training ("all-in-one" row, a dataset-agnostic model) the results degrade, reinforcing the idea that"just adding more data" does not automatically make theperformance better. Conclusion.

Best results are obtained when using trainingdata that focuses on the test video sequences, using similardatasets or combining multiple datasets degrades the per-formance for our system. mIoUtwo streams ! ! % one stream ! ! % Table 8: Experimenting with the convnet architecture. SeeSection 5.4.3.

Section 3.1 and Figure 3 described two possible architec-tures to handle I t and F t . Previous experiments are all basedon the two streams architecture.Table 8 compares two streams versus one stream, wherethe network to accepts 5 input channels (RGB + previousmask + ﬂow magnitude) in one stream: M t = f I + F ( I t , F t , w ( M t − , F t )) . Results are obtained using a base modelwith RGB and optical ﬂow (no warping, no CRF, no tem-poral coherency), ImageNet pre-training, per-dataset train-ing, and no per-video tuning.We observe that both one stream and two stream archi-tecture with naive averaging perform on par. Using a onestream network makes the training more affordable and al-lows more easily to expand the architecture with additionalinput channels. Conclusion.

The lighter one stream network performs aswell as a network with two streams. We will thus use the onestream architecture in Section 6.5.5 Error analysisTable 9 presents an expanded evaluation on DAVIS us-ing evaluation metrics proposed in [49]. Three measures areused: region similarity in terms of intersection over union(J), contour accuracy (F, higher is better), and temporal in-stability of the masks (T, lower is better). We outperform thecompetitive methods of [31,6] on all three measures.Table 10 reports the per-attribute based evaluation asdeﬁned in DAVIS . LucidTracker is best on all 15 videoattribute categories. This shows that our

LucidTracker canhandle the various video challenges present in DAVIS .We present the per-sequence and per-frame results of LucidTracker over DAVIS in Figure 10. On the wholewe observe that the proposed approach is quite robust, mostvideo sequences reach an average performance above mIoU.However, by looking at per-frame results for each video(blue dots in Figure 10) one can see several frames whereour approach has failed (IoU less than ) to correctly trackthe object. Investigating closely those cases we notice condi-tions where LucidTracker is more likely to fail. The same behaviour was observed across all three datasets. A few rep-resentatives of failure cases are visualized in Figure 9.Since we are using only the mask annotation of the ﬁrstframe for training the tracker, a clear failure case is causedby dramatic view point changes of the object from its ﬁrstframe appearance, as in row 5 of Figure 9. Performing on-line adaptation every certain time step while exploiting theprevious frame segments for data synthesis and marking un-sure regions as ignore for training, similarly to [74], mightresolve the potential problems caused by relying only onthe ﬁrst frame mask. The proposed approach also under-performs when recovering from occlusions: it might takesseveral frames for the full object mask to re-appear (rows1-3 in Figure 9). This is mainly due to the convnet havinglearnt to follow-up the previous frame mask. Augmentingthe lucid dreams with plausible occlusions might help mit-igate this case. Another failure case occurs when two sim-ilar looking objects cross each other, as in row 6 in Figure 9.Here both cues: the previous frame guidance and learnt viaper-video tuning appearance, are no longer discriminative tocorrectly continue propagating the mask.We also observe that the

LucidTracker struggles totrack the ﬁne structures or details of the object, e.g. wheelsof the bicycle or motorcycle in rows 1-2 in Figure 9. This isthe issue of the underlying choice of the convnet architec-ture, due to the several pooling layers the spatial resolutionis lost and hence the ﬁne details of the object are missing.This issue can be mitigated by switching to more recent se-mantic labelling architectures (e.g. [52,9]).

Conclusion.

LucidTracker shows robust performance ac-ross different videos. However, a few failure cases were ob-served due to the underlying convnet architecture, its train-ing, or limited visibility of the object in the ﬁrst frame.

We present here an empirical evaluation of LucidTrackerfor multiple object segmentation task: given a ﬁrst frame la-belled with the masks of several object instances, one aimsto ﬁnd the corresponding masks of objects in future frames.6.1 Experimental setup

Dataset.

For the multiple object segmentation task we usethe 2017 DAVIS Challenge on Video Object Segmentation [54] (DAVIS ). Compared to DAVIS this is a larger, morechallenging dataset, where the video sequences have mul-tiple objects in the scene. Videos that have more than onevisible object in DAVIS have been re-annotated (the ob-jects were divided by semantics) and the train and val sets http://davischallenge.org/challenge2017ucid Data Dreaming for Video Object Segmentation 15Method F DAVIS Region, J Boundary, F Temporal stability, T Mean ↑ Recall ↑ Decay ↓ Mean ↑ Recall ↑ Decay ↓ Mean ↓ Box oracle [31] 0 % -0.7 Grabcut oracle [31] 0 % % ! ! ! ! ! PDB [62] ~18k % -0.2 Uses 1st frameannotation Mask warping 0 ! ! % ! ! % - - - - OSVOS [6] ~2.3k % MaskTrack [31] ~11k ! ! OnAVOS [74] ~120k % VideoGCRF [7] ~120k % LucidTracker ! Table 9: Comparison of video object segmentation results on DAVIS benchmark. Numbers in italic are computed based onsubsets of DAVIS . Our LucidTracker improves over previous results. Attribute Method

BVS [41] ObjFlow [71] OSVOS [6] MaskTrack [31] LucidTrackerAppearance change

Background clutter

Camera-shake

Deformation

Dynamic background

Edge ambiguity

Fast-motion

Heterogeneous object

Interacting objects

Low resolution

Motion blur

Occlusion

Out-of-view

Scale variation

Shape complexity

Table 10: DAVIS per-attribute evaluation. LucidTrackerimproves across all video object segmentation challenges.were extended with more sequences. In addition, two othertest sets (test-dev and test-challenge) were introduced. Thecomplexity of the videos has increased with more distract-ors, occlusions, fast motion, smaller objects, and ﬁne struc- tures. Overall, DAVIS consists of sequences, totalling

10 474 annotated frames and objects.We evaluate our method on two test sets, the test-devand test-challenge sets, each consists of video sequences,on average ∼ objects per sequence, the length of the se-quences is ∼ frames. For both test sets only the masks onthe ﬁrst frames are made public, the evaluation is done viaan evaluation server. Our experiments and ablation studiesare done on the test-dev set. Evaluation metric.

The accuracy of multiple object seg-mentation is evaluated using the region (J) and boundary (F)measures proposed by the organisers of the challenge. Theaverage of J and F measures is used as overall performancescore (denoted as global mean in the tables). Please refer to[54] for more details about the evaluation protocol.

Training details.

All experiments in this section are doneusing the single stream architecture discussed in sections 3.1and 5.4.3. For training the models we use SGD with mini-batches of images and a ﬁxed learning policy with initiallearning rate of − . The momentum and weight decay areset to . and · − , respectively. All models are initialized DAV I S Y ou T ub e O b j ec t s S e g T r ac k v2

20% 40% 60% 80% 100%

Figure 9: Failure cases. Frames sampled along the video duration (e.g. : video middle point). For each dataset we show2 out of 5 worst results (based on mIoU over the video).Figure 10: Per-sequence results on DAVIS .with weights trained for image classiﬁcation on ImageNet[61]. We then train per-video for 40k iterations.6.2 Key resultsTables 11 and 12 presents the results of the 2017 DAVISChallenge on test-dev and test-challenge sets [53].Our main results for the multi-object segmentation chal-lenge are obtained via an ensemble of four different models( f I , f I + F , f I + S , f I + F + S ), see Section 3.1.The proposed system, LucidTracker , provides the bestsegmentation quality on the test-dev set and shows com-petitive performance on the test-challenge set, holding the second place in the competition. The full system is trainedusing the standard ImageNet pre-training initialization, Pas-cal VOC12 semantic annotations for the S t input ( ∼ k an-notated images), and one annotated frame per test video, frames total on each test set. As discussed in Section 6.3,even without S t LucidTracker obtains competitive results(less than percent point difference, see Table 13 for de-tails).The top entry lixx [38] uses a deeper convnet model(ImageNet pre-trained ResNet), a similar segmentation ar-chitecture, trains it over external segmentation data (using ∼ k pixel-level annotated images from MS-COCO andPascal VOC for pre-training, and akin to [6] ﬁne-tuning onthe DAVIS train and val sets, ∼ k annotated frames), ucid Data Dreaming for Video Object Segmentation 171st frame, GT segment

20% 40% 60% 80% 100%

Figure 11: LucidTracker qualitative results on DAVIS , test-dev set. Frames sampled along the video duration (e.g. :video middle point). The videos are chosen with the highest mIoU measure.and extends it with a box-level object detector (trained overMS-COCO and Pascal VOC, ∼ k bounding boxes) and abox-level object re-identiﬁcation model trained over ∼ k box annotations (on both images and videos). We argue thatour system reaches comparable results with a signiﬁcantlylower amount of training data.Figure 11 provides qualitative results of LucidTracker on the test-dev set. The video results include successful hand-ling of multiple objects, full and partial occlusions, distract-ors, small objects, and out-of-view scenarios.

Conclusion.

We show that top results for multiple objectsegmentation can be achieved via our approach that focuseson exploiting as much as possible the available annotationon the ﬁrst video frame, rather than relying heavily on largeexternal training data.6.3 Ablation studyTable 13 explores in more details how the different ingredi-ents contribute to our results.We see that adding extra information (channels) to thesystem, either optical ﬂow magnitude or semantic segment-ation, or both, does provide ∼ percent point improve-ment. The results show that leveraging semantic priors andmotion information provides a complementary signal to RGBimage and both ingredients contribute to the segmentationresults.Combining in ensemble four different models ( f I + F + S + f I + F + f I + S + f I ) allows to enhance the results even fur-ther, bringing . percent point gain ( . vs. . global mean). Excluding the models which use semantic informa-tion ( f I + F + S and f I + S ) from the ensemble results only in aminor drop in the performance ( . vs. . global mean).This shows that the competitive results can be achieved evenwith the system trained only with one pixel-level mask an-notation per video, without employing extra annotations fromPascal VOC12.Our lucid dreams enable automatic CRF-tuning (see Sec-tion 5.3.3) which allows to further improve the results ( . → . global mean). Employing the proposed temporal coher-ency step (see Section 3.1) during inference brings an addi-tional performance gain ( . → . global mean). Conclusion.

The results show that both ﬂow and semanticpriors provide a complementary signal to RGB image only.Despite its simplicity our ensemble strategy provides addi-tional gain and leads to competitive results. Notice that evenwithout the semantic segmentation signal S t our ensembleresult is competitive.6.4 Error analysisWe present the per-sequence results of LucidTracker onDAVIS in Figure 12 (per frame results not available fromevaluation server). We observe that this dataset is signiﬁc-antly more challenging than DAVIS (compare to Figure10), with only / of the test videos above mIoU. Thisshows that multiple object segmentation is a much morechallenging task than segmenting a single object.The failure cases discussed in Section 5.5 still apply tothe multiple objects case. Additionally, on DAVIS we ob- , test-dev setRank Globalmean ↑ Region, J Boundary, F Mean ↑ Recall ↑ Decay ↓ Mean ↑ Recall ↑ Decay ↓ sidc 10 45.8 43.9 51.5 34.3 47.8 53.6 36.9YXLKJ 9 49.6 46.1 49.1 22.7 53.0 56.5 22.3haamooon [59] 8 51.3 48.8 56.9 Fromandtozh [83] 7 55.2 52.4 58.4 18.1 57.9 66.1 20.0ilanv [60] 6 55.8 51.9 55.7 17.6 59.8 65.8 18.9voigtlaender [73] 5 56.5 53.4 57.8 19.9 59.6 65.4 19.0lalalaﬁne123 4 57.4 54.5 61.3 24.4 60.2 68.8 24.6wangzhe 3 57.7 55.6 63.2 31.7 59.8 66.7 37.1lixx [38] 2 66.1

LucidTracker Table 11: Comparison of video object segmentation results on DAVIS , test-dev set. Our LucidTracker shows top perform-ance. Method DAVIS , test-challenge setRank Globalmean ↑ Region, J Boundary, F Mean ↑ Recall ↑ Decay ↓ Mean ↑ Recall ↑ Decay ↓ zwrq0 10 53.6 50.5 54.9 28.0 56.7 63.5 30.4Fromandtozh [83] 9 53.9 50.7 54.9 32.5 57.1 63.2 33.7wasidennis 8 54.8 51.6 56.3 26.8 57.9 64.8 28.8YXLKJ 7 55.8 53.8 60.1 37.7 57.8 62.1 42.9cjc [12] 6 56.9 53.6 59.5 25.3 60.2 67.9 27.6lalalaﬁne123 6 56.9 54.8 60.7 34.4 59.1 66.7 36.1voigtlaender [73] 5 57.7 54.8 60.8 31.0 60.5 67.2 34.7haamooon [59] 4 61.5 59.8 71.0 21.9 63.2 74.6 23.7vantam299 [36] 3 63.8 61.5 68.6 Table 12: Comparison of video object segmentation results on DAVIS , test-challenge set. Our LucidTracker shows com-petitive performance, holding the second place in the competition. Variant

I F S ensemble CRF tuning temp. coherency DAVIS test-dev test-challengeglobal mean mIoU mF global mean mIoU mF LucidTracker(ensemble) ! ! ! ! ! ! ! ! ! ! ! % ! ! ! ! % % ! ! % ! ! % ! ! % ! % % ! ! ! % ! % I + F + S ! ! ! % % % I + F ! ! % % % % I + S ! % ! % % % I ! % % % % % Table 13: Ablation study of different ingredients. DAVIS , test-dev and test challenge sets. ucid Data Dreaming for Video Object Segmentation 19 video sequence m I o U g o l f p l a n e s - c r o ss i n g s k a t e - j u m p m a n - b i k e h o r s e j u m p - s t i c k d ee r r o ll e r c o a s t e r s a l s a g i a n t - s l a l o m s u b w a y c a t s - c a r c a r o u s e l m t b - r a c e l o c k m o n k e y s - t r ee s t r a c t o r g y m p e o p l e - s un s e t t a n d e m g i r l - d o g a e r o b a t i c s t e nn i s - v e s t c a r - r a c e h o v e r b o a r d h e li c o p t e r s e a s n a k e o r c h i d g u i t a r - v i o li n c h a m a l e o n s l a c k li n e Figure 12: Per-sequence results on DAVIS , test-dev set.

20% 40% 60% 80% 100%

Figure 13: LucidTracker failure cases on DAVIS , test-dev set. Frames sampled along the video duration (e.g. : videomiddle point). We show 2 results mIoU over the video below 50.serve a clear failure case when segmenting similar lookingobject instances, where the object appearance is not dis-criminative to correctly track the object, resulting in labelswitches or bleeding of the label to other look-alike objects.Figure 13 illustrates this case. This issue could be mitigatedby using object level instance identiﬁcation modules, like[38], or by changing the training loss of the model to moreseverely penalize identity switches. Conclusion.

In the multiple object case the

LucidTracker results remain robust across different videos. The overallresults being lower than for the single object segmentationcase, there is more room for future improvement in the mul-tiple object pixel-level segmentation task.

We have described a new convnet-based approach for pixel-level object segmentation in videos. In contrast to previouswork, we show that top results for single and multiple ob-ject segmentation can be achieved without requiring externaltraining datasets (neither annotated images nor videos). Evenmore, our experiments indicate that it is not always bene-ﬁcial to use additional training data, synthesizing trainingsamples close to the test domain is more effective than addingmore training samples from related domains.Our extensive analysis decomposed the ingredients thatcontribute to our improved results, indicating that our new training strategy and the way we leverage additional cuessuch as semantic and motion priors are key.Showing that training a convnet for video object seg-mentation can be done with only few ( ∼ ) training sampleschanges the mindset regarding how much general knowledgeabout objects is required to approach this problem [31,28],and more broadly how much training data is required to trainlarge convnets depending on the task at hand.We hope these new results will fuel the ongoing evol-ution of convnet techniques for single and multiple objectsegmentation in videos. Acknowledgements

Eddy Ilg and Thomas Brox acknowledge funding by theDFG Grant BR 3815/7-1.

References

1. Ç. Aytekin, E. C. Ozan, S. Kiranyaz, and M. Gabbouj. Visualsaliency by extended quantum cuts. In

ICIP , 2015. 92. A. Bansal, X. Chen, B. Russell, A. Gupta, and D. Ramanan. Pixel-net: Representation of the pixels, by the pixels, and for the pixels. arXiv:1702.06506 , 2017. 33. L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H.Torr. Fully-convolutional siamese networks for object tracking. arXiv:1606.09549 , 2016. 1, 2, 64. F. Bookstein. Principal warps: Thin-plate splines and the decom-position of deformations.

PAMI , 1989. 60 A. Khoreva, R. Benenson, E. Ilg, T. Brox, B. Schiele5. M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, andL. Van Gool. Robust tracking-by-detection using a detector con-ﬁdence particle ﬁlter. In

ICCV , 2009. 2, 66. S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixe, D. Cre-mers, and L. V. Gool. One-shot video object segmentation. In

CVPR , 2017. 1, 2, 3, 6, 9, 11, 14, 15, 167. S. Chandra, C. Couprie, and I. Kokkinos. Deep spatio-temporalrandom ﬁelds for efﬁcient video segmentation. In

CVPR , 2018. 2,3, 9, 158. J. Chang, D. Wei, and J. W. Fisher. A video representation usingtemporal superpixels. In

CVPR , 2013. 29. L. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethink-ing atrous convolution for semantic image segmentation. arxiv:1706.05587 , 2017. 1410. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.Yuille. Deeplab: Semantic image segmentation with deep con-volutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915 , 2016. 3, 11, 1211. W. Chen, H. Wang, Y. Li, H. Su, Z. Wang, C. Tu, D. Lischin-ski, D. Cohen-Or, and B. Chen. Synthesizing training images forboosting human 3d pose estimation. In

3D Vision (3DV) , 2016. 312. J. Cheng, S. Liu, Y.-H. Tsai, W.-C. Hung, S. Gupta, J. Gu, J. Kautz,S. Wang, and M.-H. Yang. Learning to segment instances in videoswith spatial propagation network.

CVPR Workshops , 2017. 1813. A. Criminisi, P. Perez, and K. Toyama. Region ﬁlling and objectremoval by exemplar-based image inpainting.

Trans. Img. Proc. ,2004. 614. M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Con-volutional features for correlation ﬁlter based visual tracking. In

ICCV Workshop , 2015. 215. A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas,V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet:Learning optical ﬂow with convolutional networks. In

ICCV ,2015. 316. M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams,J. Winn, and A. Zisserman. The pascal visual object classes chal-lenge: A retrospective.

IJCV , 2015. 517. A. Faktor and M. Irani. Video segmentation by non-local con-sensus voting. In

BMVC , 2014. 2, 9, 1518. G. Georgakis, A. Mousavian, A. C. Berg, and J. Kosecka. Syn-thesizing training data for object detection in indoor scenes. arXiv:1702.07836 , 2017. 319. X. Glorot and Y. Bengio. Understanding the difﬁculty of trainingdeep feedforward neural networks. In

AISTATS , 2010. 720. M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efﬁcient hier-archical graph-based video segmentation. In

CVPR , 2010. 221. A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for textlocalisation in natural images. In

CVPR , 2016. 322. D. Held, S. Thrun, and S. Savarese. Learning to track at 100 fpswith deep regression networks. In

ECCV , 2016. 1, 2, 623. J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploitingthe circulant structure of tracking-by-detection with kernels. In

ECCV , 2012. 2, 3, 624. E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, andT. Brox. Flownet 2.0: Evolution of optical ﬂow estimation withdeep networks. In

CVPR , 2017. 4, 9, 1125. B. L. J. Luiten, P. Voigtlaender. PReMVOS: Proposal-generation,reﬁnement and merging for video object segmentation. In

ACCV ,2018. 9, 1526. S. D. Jain and K. Grauman. Supervoxel-consistent foregroundpropagation in video. In

ECCV , 2014. 727. S. D. Jain and K. Grauman. Click carving: Segmenting objects invideo with point clicks. In

HCOMP , 2016. 328. S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning tocombine motion and appearance for fully automatic segmentionof generic objects in videos. arXiv:1701.05384 , 2017. 2, 3, 9, 15,1929. V. Jampani, R. Gadde, and P. V. Gehler. Video propagation net-works. arXiv:1612.05478 , 2016. 1, 2, 3, 9, 15 30. A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Luciddata dreaming for object tracking.

CVPR Workshops , 2017. 231. A. Khoreva, F. Perazzi, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images.In arXiv:1612.02646 , 2016. 1, 2, 3, 6, 9, 11, 14, 15, 1932. P. Krähenbühl and V. Koltun. Efﬁcient inference in fully connec-ted crfs with gaussian edge potentials. In

NIPS . 2011. 5, 9, 1233. M. Kristan, J. Matas, et al. The visual object tracking vot2015challenge results. In

ICCV Workshop , 2015. 134. M. Kristan, J. Matas, et al. The visual object tracking vot2016challenge results. In

ECCV Workshop , 2016. 135. M. Kristan, R. Pﬂugfelder, et al. The visual object trackingvot2014 challenge results. In

ECCV Workshop , 2014. 1, 2, 636. T. N. Le, K. T. Nguyen, M. H. Nguyen-Phan, T. V. Ton, T. A.Nguyen, X. S. Trinh, Q. H. Dinh, V. T. Nguyen, A. D. Duong,A. Sugimoto, T. V. Nguyen, and M. T. Tran. Instance re-identiﬁcation ﬂow for video object segmentation.

CVPR Work-shops , 2017. 1837. F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video seg-mentation by tracking many ﬁgure-ground segments. In

ICCV ,2013. 1, 738. X. Li, Y. Qi, Z. Wang, K. Chen, Z. Liu, J. Shi, P. Luo, C. C. Loy,and X. Tang. Video object segmentation with re-identiﬁcation.

CVPR Workshops , 2017. 16, 18, 1939. G. Lin, A. Milan, C. Shen, and I. D. Reid. Reﬁnenet: Multi-pathreﬁnement networks for high-resolution semantic segmentation. arXiv:1611.06612 , 2016. 340. C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchical con-volutional features for visual tracking. In

ICCV , 2015. 241. N. Maerki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bilat-eral space video segmentation. In

CVPR , 2016. 2, 5, 9, 1542. N. Nagaraja, F. Schmidt, and T. Brox. Video segmentation withjust a few strokes. In

ICCV , 2015. 3, 943. H. Nam, M. Baek, and B. Han. Modeling and propagating cnns ina tree structure for visual tracking. arXiv:1608.07242 , 2016. 244. H. Nam and B. Han. Learning multi-domain convolutional neuralnetworks for visual tracking. In

CVPR , 2016. 1, 245. N.Mayer, E.Ilg, P.Häusser, P.Fischer, D.Cremers, A.Dosovitskiy,and T.Brox. A large dataset to train convolutional networks fordisparity, optical ﬂow, and scene ﬂow estimation. In

CVPR , 2016.3, 446. A. Papazoglou and V. Ferrari. Fast object segmentation in uncon-strained video. In

ICCV , 2013. 247. D. Park and D. Ramanan. Articulated pose estimation with tinysynthetic videos. In

CVPR Workshop , 2015. 348. D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan.Learning features by watching objects move. arXiv:1612.06370 ,2016. 549. F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool, M. Gross,and A. Sorkine-Hornung. A benchmark dataset and evaluationmethodology for video object segmentation. In

CVPR , 2016. 1, 7,1450. F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung. Fullyconnected object proposals for video segmentation. In

ICCV ,2015. 2, 9, 1551. L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, andB. Schiele. Articulated people detection and pose estimation: Re-shaping the future. In

CVPR , 2012. 352. T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full-resolutionresidual networks for semantic segmentation in street scenes. In

CVPR , 2017. 1453. J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. Davis challenge on video ob-ject segmentation 2017. http://davischallenge.org/challenge2017 . 1654. J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on videoobject segmentation. arXiv:1704.00675 , 2017. 2, 14, 15ucid Data Dreaming for Video Object Segmentation 2155. A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learn-ing object class detectors from weakly annotated video. In

CVPR ,2012. 1, 756. J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Epi-cﬂow: Edge-preserving interpolation of correspondences for op-tical ﬂow. In

CVPR , 2015. 1157. S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data:Ground truth from computer games. In

ECCV , 2016. 358. C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactiveforeground extraction using iterated graph cuts. In

SIGGRAPH ,2004. 3, 959. A. Shaban, A. Firl, A. Humayun, J. Yuan, X. Wang, P. Lei,N. Dhanda, B. Boots, J. M. Rehg, and F. Li. Multiple-instance video segmentation with sequence-speciﬁc object pro-posals.

CVPR Workshops , 2017. 1860. G. Sharir, E. Smolyansky, and I. Friedman. Video object segment-ation using tracked object proposals.

CVPR Workshops , 2017. 1861. K. Simonyan and A. Zisserman. Very deep convolutional net-works for large-scale image recognition. In

ICLR , 2015. 7, 1662. H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam. Pyramiddilated deeper convlstm for video salient object detection. In

ECCV , 2018. 2, 9, 1563. T. V. Spina and A. X. Falcão. Fomtrace: Interactive videosegmentation by image graphs and fuzzy object models. arXiv:1606.03369 , 2016. 364. J. Sun, J. Jia, C.-K. Tang, and H.-Y. Shum. Poisson matting. In

SIGGRAPH , 2004. 665. K. Tang, V. Ramanathan, L. Fei-fei, and D. Koller. Shiftingweights: Adapting object detectors from image to video. In

NIPS ,2012. 666. M. Tang, D. Marin, I. Ben Ayed, and Y. Boykov. Normalized cutmeets mrf. In

ECCV , 2016. 367. S. Tang, M. Andriluka, A. Milan, K. Schindler, S. Roth, andB. Schiele. Learning people detectors for tracking in crowdedscenes. In

ICCV , 2013. 368. R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance searchfor tracking. arXiv:1605.05863 , 2016. 269. P. Tokmakov, K. Alahari, and C. Schmid. Learning motion pat-terns in videos. arXiv:1612.07217 , 2016. 9, 1570. P. Tokmakov, K. Alahari, and C. Schmid. Learning video objectsegmentation with visual memory. In

ICCV , 2017. 2, 9, 1571. Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video segmentation viaobject ﬂow. In

CVPR , 2016. 2, 9, 1572. G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black,I. Laptev, and C. Schmid. Learning from synthetic humans. arXiv:1701.01370 . 373. P. Voigtlaender and B. Leibe. Online adaptation of convolutionalneural networks for the 2017 davis challenge on video object seg-mentation.

CVPR Workshops , 2017. 1874. P. Voigtlaender and B. Leibe. Online adaptation of convolutionalneural networks for video object segmentation. In

BMVC , 2017.2, 3, 9, 14, 1575. T. Vojir and J. Matas. Pixel-wise object segmentations for the VOT2016 dataset. Research report, 2017. 176. L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking withfully convolutional networks. In

ICCV , 2015. 277. T. Wang, B. Han, and J. Collomosse. Touchcut: Fast image andvideo segmentation using single-touch interaction.

CVIU , 2014. 378. W. Wang and J. Shen. Super-trajectory for video segmentation. arXiv:1702.08634 , 2017. 2, 5, 9, 1579. X. Wang and A. Gupta. Unsupervised learning of visual repres-entations using videos. In

ICCV , 2015. 580. Z. Wu, C. Shen, and A. van den Hengel. Wider or deeper: Revis-iting the resnet model for visual recognition. arXiv:1611.10080 ,2016. 381. F. Xiao and Y. J. Lee. Track and segment: An iterative unsuper-vised approach for video object proposals. In

CVPR , 2016. 2,9 82. J. J. Yu, A. W. Harley, and K. G. Derpanis. Back to basics: Un-supervised learning of optical ﬂow via brightness constancy andmotion smoothness. arXiv:1608.05842 , 2016. 583. H. Zhao. Some promising ideas about multi-instance video seg-mentation.

CVPR Workshops , 2017. 1884. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsingnetwork. In

CVPR , 2017. 3, 585. Y. Zhu, Z. Lan, S. Newsam, and A. G. Hauptmann. Guided opticalﬂow learning. arXiv:1702.02295arXiv:1702.02295