[PDF] YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

Abstract

Learning long-term spatial-temporal features are critical for many video analysis tasks. However, existing video segmentation methods predominantly rely on static image segmentation techniques, and methods capturing temporal dependency for segmentation have to depend on pretrained optical flow models, leading to suboptimal solutions for the problem. End-to-end sequential learning to explore spatial-temporal features for video segmentation is largely limited by the scale of available video segmentation datasets, i.e., even the largest video segmentation dataset only contains 90 short video clips. To solve this problem, we build a new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS). Our dataset contains 3,252 YouTube video clips and 78 categories including common objects and human activities. This is by far the largest video object segmentation dataset to our knowledge and we have released it at this https URL. Based on this dataset, we propose a novel sequence-to-sequence network to fully exploit long-term spatial-temporal information in videos for segmentation. We demonstrate that our method is able to achieve the best results on our YouTube-VOS test set and comparable results on DAVIS 2016 compared to the current state-of-the-art methods. Experiments show that the large scale dataset is indeed a key factor to the success of our model.

Full PDF

YYouTube-VOS: Sequence-to-Sequence VideoObject Segmentation

Ning Xu , Linjie Yang , Yuchen Fan , Jianchao Yang , Dingcheng Yue ,Yuchen Liang , Brian Price , Scott Cohen , and Thomas Huang Adobe Research, USA { nxu,bprice,scohen } @adobe.com Snapchat Research, USA { linjie.yang,jianchao.yang } @snap.com University of Illinois at Urbana-Champaign, USA { yuchenf4,dyue2,yliang35,t-huang1 } @illinois.edu Abstract.

Learning long-term spatial-temporal features are critical formany video analysis tasks. However, existing video segmentation meth-ods predominantly rely on static image segmentation techniques, andmethods capturing temporal dependency for segmentation have to de-pend on pretrained optical ﬂow models, leading to suboptimal solu-tions for the problem. End-to-end sequential learning to explore spatial-temporal features for video segmentation is largely limited by the scaleof available video segmentation datasets, i.e., even the largest videosegmentation dataset only contains 90 short video clips. To solve thisproblem, we build a new large-scale video object segmentation datasetcalled YouTube Video Object Segmentation dataset (YouTube-VOS).Our dataset contains 3,252 YouTube video clips and 78 categories in-cluding common objects and human activities . This is by far the largestvideo object segmentation dataset to our knowledge and we have releasedit at https://youtube-vos.org. Based on this dataset, we propose a novelsequence-to-sequence network to fully exploit long-term spatial-temporalinformation in videos for segmentation. We demonstrate that our methodis able to achieve the best results on our YouTube-VOS test set and com-parable results on DAVIS 2016 compared to the current state-of-the-artmethods. Experiments show that the large scale dataset is indeed a keyfactor to the success of our model. Keywords:

Video Object Segmentation, Large-scale Dataset, Spatial-Temporal Information.

Learning eﬀective spatial-temporal features has been demonstrated to be veryimportant for many video analysis tasks. For example, Donahue et al . [10] pro-pose long-term recurrent convolution network for activity recognition and video This is the statistics when we submit this paper, see updated statistics on our web-site. a r X i v : . [ c s . C V ] S e p N. Xu et al . captioning. Srivastava et al . [38] propose unsupervised learning of video rep-resentation with a LSTM autoencoder. Tran et al . [42] develop a 3D convolu-tional network to extract spatial and temporal information jointly from a video.Other works include learning spatial-temporal information for precipitation pre-diction [46], physical interaction [14], and autonomous driving [47].Video segmentation plays an important role in video understanding, whichfosters many applications, such as accurate object segmentation and tracking, in-teractive video editing and augmented reality. Video object segmentation, whichtargets at segmenting a particular object instance throughout the entire videosequence given only the object mask on the ﬁrst frame, has attracted muchattention from the vision community recently [6,32,50,8,11,22,41,19,44]. How-ever, existing state-of-the-art video object segmentation approaches primarilyrely on single image segmentation frameworks [6,32,50,44]. For example, Caelles et al . [6] propose to train an object segmentation network on static images andthen ﬁne-tune the model on the ﬁrst frame of a test video over hundreds of it-erations, so that it remembers the object appearance. The ﬁne-tuned model isthen applied to all following individual frames to segment the object withoutusing any temporal information. Even though simple, such an online learning orone-shot learning scheme achieves top performance on video object segmenta-tion benchmarks [33,21]. Although some recent approaches [11,8,41] have beenproposed to leverage temporal consistency, they depend on models pretrained onother tasks such as optical ﬂow [20,35] or motion segmentation [40], to extracttemporal information. These pretrained models are learned from separate tasks,and therefore are suboptimal for the video segmentation problem.Learning long-term spatial-temporal features directly for video object seg-mentation task is, however, largely limited by the scale of existing video objectsegmentation datasets. For example, the popular benchmark dataset DAVIS [34]has only 90 short video clips, which is barely suﬃcient to learn an end-to-endmodel from scratch like other video analysis tasks. Even if we combine all thevideos from available datasets [21,13,26,4,30,15], its scale is still far smaller thanother video analysis datasets such as YouTube-8M [1] and ActivityNet [17]. Tosolve this problem, we present the ﬁrst large-scale video object segmentationdataset called YouTube-VOS (YouTube Video Object Segmentation dataset) inthis work. Our dataset contains 3,252 YouTube video clips featuring 78 cat-egories covering common animals, vehicles, accessories and human activities.Each video clip is about 3 ∼ ouTube-VOS 3 Table 1: Scale comparison between YouTube-VOS and existing datasets. “An-notations” denotes the total number of object annotations. “Duration” denotesthe total duration (in minutes) of the annotated videos.

Scale JC[13] ST[26] YTO[21] FBMS[30] DAVIS[33] [34]

YouTube-VOS ( Ours )Videos 22 14 96 59 50 90

Categories 14 11 10 16 - - Objects 22 24 96 139 50 205

Annotations 6,331 1,475 1,692 1,465 3,440 13,543

Duration 3.52 0.59 9.01 7.70 2.88 5.17

Based on Youtube-VOS, we propose a new sequence-to-sequence learningalgorithm to explore spatial-temporal modeling for video object segmentation.We utilize a convolutional LSTM [46] to learn long-term spatial-temporal infor-mation for segmentation. At each time step, the convolutional LSTM acceptslast hidden states and an encoded image frame, it then outputs encoded spatial-temporal features which are decoded into a segmentation mask. Our algorithm isdiﬀerent from existing approaches in that it fully exploits the long-term spatial-temporal information in an end-to-end manner and does not depend on exist-ing optical ﬂow or motion segmentation models. We evaluate our algorithm onboth YouTube-VOS and DAVIS 2016 and it achieves better or comparable re-sults compared to the current state of the arts.The rest of our paper is organized as follows. In Section 2 we brieﬂy introducethe related works. In Section 3 and 4 we describe our YouTube-VOS dataset andthe proposed algorithm in detail. Experimental results are presented in Section 5.Finally we conclude the paper in Section 6.

In the past decades, several datasets [21,13,26,4,30,15] have been created forvideo object segmentation. All of them are in small scales which usually con-tain only dozens of videos. In addition, their video content is relatively simple( e.g . no heavy occlusion, camera motion or illumination change) and sometimesthe video resolution is low. Recently, a new dataset called DAVIS [33,34] waspublished and has become the benchmark dataset in this area. Its 2016 versioncontains 50 videos with a single foreground object per video while the 2017 ver-sion has 90 videos with multiple objects per video. In comparison to previousdatasets [21,13,26,4,30,15], DAVIS has both higher-quality of video resolutionsand annotations. In addition, their video content is more complicated with multi-object interactions, camera motion, and occlusions.Early methods [21,28,12,31,5] for video object segmentation often solve somespatial-temporal graph structures with hand-crafted energy terms, which areusually associated with features including appearance, boundary, motion andoptical ﬂows. Recently, deep-learning based methods were proposed due to its

N. Xu et al . great success in image segmentation tasks [36,7,49,48]. Most of these meth-ods [6,32,8,11,50,44] build their model based on an image segmentation net-work and do not involve sequential modeling. Online learning [6] is commonlyused to improve their performance. To make the model temporally consistent,the predicted mask of the previous frame is used as a guidance in [32,50,19].Other methods have been proposed to leverage spatial-temporal information.Jampani et al . [22] use spatial-temporal consistency to propagate object masksover time. Tokmakov et al . [41] use a two-stream network to model objects’ ap-pearance and motion and use a recurrent layer to capture the evolution. However,due to the lack of training videos, they use a pretrained motion segmentationmodel [40] and optical-ﬂow model [20], which leads to suboptimal results sincethe model is not trained end-to-end to best capture spatial-temporal features. To create our dataset, we ﬁrst carefully select a set of object categories includinganimals ( e.g . ant, eagle, goldﬁsh, person ), vehicles ( e.g . airplane, bicycle, boat,sedan ), accessories ( e.g . eyeglass, hat, bag ), common objects ( e.g . potted plant,knife, sign, umbrella ), and humans in various activities ( e.g . tennis, skateboard-ing, motorcycling, surﬁng ). The videos containing human activities have diversi-ﬁed appearance and motion, so instead of treating human videos as one class, wedivide diﬀerent activities into diﬀerent categories. Most of these videos containinteractions between a person and a corresponding object, such as tennis racket,skateboard, motorcycle, etc. The entire category set includes 78 categories thatcovers diverse objects and motions, and should be representative for everydayscenarios.We then collect many high-resolution videos with the selected category labelsfrom the large-scale video classiﬁcation dataset YouTube-8M [1]. This datasetconsists of millions of YouTube videos associated with more than 4,700 visualentities. We utilize its category annotations to retrieve candidate videos thatwe are interested in. Speciﬁcally, up to 100 videos are retrieved for each cate-gory in our segmentation category set. There are several advantages to usingYouTube videos to create our segmentation dataset. First, YouTube videos havevery diverse object appearances and motions. Challenging cases for video objectsegmentation, such as occlusions, fast object motions and change of appearances,commonly exist in YouTube videos. Second, YouTube videos are taken by bothprofessionals and amateurs and thus diﬀerent levels of camera motions are shownin the crawled videos. Algorithms trained on such data could potentially handlecamera motion better and thus are more practical. Last but not the least, manyYouTube videos are taken by today’s smart phone devices and there are de-manding needs to segment objects in those videos for applications such as videoediting and augmented reality.Since the retrieved videos are usually long (several minutes) and have shottransitions, we use an oﬀ-the-shelf video shot detection algorithm to automat- http://johmathe.name/shotdetect.htmlouTube-VOS 5 Fig. 1: The ground truth annotations of sample video clips in our dataset. Dif-ferent objects are highlighted with diﬀerent colors.ically partition each video into multiple video clips. We ﬁrst remove the clipsfrom the ﬁrst and last 10% of the video, since these clips have a high chanceof containing introductory subtitles and credits lists. We then sample up to ﬁveclips with appropriate lengths (3 ∼ e.g .no scene transition, not too dark, shaky, or blurry). After the video clips are col-lected, we ask human annotators to select up to ﬁve objects of proper sizes andcategories per video clip and carefully annotate them (by tracing their bound-aries instead of rough polygons) every ﬁve frames in a 30fps frame rate, whichresults in a 6fps sampling rate. Given a video and its category, annotators areﬁrst required to annotate objects belonging to that category. If the video con-tains other objects that belong to our 78 categories, we ask the annotators tolabel them as well, so that each video has multiple objects annotated. In humanactivity videos, both the human subject and the object he/she interacts with arelabeled, e.g ., both the person and the skateboard are required to be labeled in a“skateboarding” video. Some annotation examples are shown in Figure 1. Unlikedense per-frame annotation in previous datasets [13,33,34], we believe that thetemporal correlation between ﬁve consecutive frames is suﬃciently strong thatannotations can be omitted for intermediate frames to reduce the annotationeﬀorts. Such a skip-frame annotation strategy allows us to scale up the numberof videos and objects under the same annotation budget, which are importantfactors for better performance. We ﬁnd empirically that our dataset is eﬀectivein training diﬀerent segmentation algorithm.As a result, our dataset YouTube-VOS consists of 3,252 YouTube video clipsand 133,886 object annotations, 33 and 10 times more than the best of the ex-isting video object segmentation datasets, respectively (See Table 1). YouTube-VOS is the largest dataset for video object segmentation to date. Based on our new dataset, we propose a new sequence-to-sequence video objectsegmentation algorithm. Diﬀerent from existing approaches, our algorithm learnslong-term spatial-temporal features directly from training data in an end-to-end

N. Xu et al . manner, and the oﬄine trained model is capable of propagating an initial objectsegmentation mask accurately by memorizing and updating the object charac-tersitics, including appearance, location and scale, and temporal movements,automatically over the entire video sequence. Let us denote a video sequence with T frames as { x t | t ∈ [0 , T − } where x t ∈ R H × W × is the RGB frame at time step t , and denote an initial binary objectmask at time step 0 as y ∈ R H × W . The target of video object segmentationis to predict the object mask automatically for the remaining frames from timestep 1 to T − i.e . { ˆy t | t ∈ [1 , T − } .To obtain a predicted mask ˆy t for x t , many existing deep learning meth-ods only leverage information at time step 0 ( e.g . online learning or one-shotlearning [6]) or time step t − e.g . optical ﬂow [32]) while the long-term his-tory information is totally dismissed. Their frameworks can be formulated as ˆy t = arg max ∀ ¯y t P ( ¯y t | x , y , x t ) or ˆy t = arg max ∀ ¯y t P ( ¯y t | x , y , x t , x t − ). Theyare eﬀective when the object appearance is similar between time 0 and time t orwhen the object motion from time t − t can be accurately measured. How-ever, these assumptions will be violated when the object has drastic appearancevariation and rapid motion, which is often case in many real-world videos. Insuch cases, the history information of the object in all previous frames becomescritical and should be leveraged in an eﬀective way. Therefore, we propose tosolve a diﬀerent objective function, i.e . ˆy t = arg max ∀ ¯y t P ( ¯y t | x , x , ..., x t , y ),which can be transformed into a sequence-to-sequence learning problem. Recurrent Neural Networks (RNN) has been adopted by many sequence-to-sequence learning problems because it is capable to learn long-term dependencyfrom sequential data. LSTM [18] as a special RNN structure solves vanishing orexploding gradients issue [3]. A convolutional variant of LSTM (convolutionalLSTM) [46] is later proposed to preserve the spatial information of the data inthe hidden states of the model.Our algorithm is inspired by the convolutional encoder-decoder LSTM struc-ture [9,39] which has achieved much success in machine translation, where aninput sentence in language A is ﬁrst encoded by a encoder LSTM and its outputsare fed into a decoder LSTM which can generate the desired output sentencein language B. In video object segmentation, it is essential to capture the ob-ject characteristics over time. To generate the initial states for our convolutionalLSTM (

ConvLSTM ), we use a feed-forward neural network to encode both theﬁrst image frame and the segmentation mask. Speciﬁcally, we concatenate theinitial frame x and segmentation mask y and feed it into a trainable network,denoted as Initializer , which outputs the initial memory state c and hiddenstate h . These initial states capture object appearance, object location andscale. And they are are feed into ConvLSTM for sequence learning. ouTube-VOS 7 !" t = 0 t = 1 t = 2 t = T - 2 t = T - 1

Fig. 2: The framework of our algorithm. The initial information at time 0 isencoded by

Initializer to initialize

ConvLSTM . The new frame at each time stepis processed by

Encoder and the segmentation result is decoded by

Decoder . ConvLSTM is automatically updated over the entire video sequence.At time step t , frame x t is ﬁrst processed by a convolutional encoder, de-noted as Encoder , to extract feature maps ˜x t . Then ˜x t is sent as the inputs of ConvLSTM . The internal states c t and h t are automatically updated given thenew observation ˜x t , which capture the new characteristics of the object. Theoutput h t is passed into a convolutional decoder, denoted as Decoder , to get thefull-resolution segmentation results ˆy t . Binary cross-entropy loss is computed be-tween ˆy t and y t during training process. The entire model is trained end-to-endusing back-propagation to learn parameters for the Initializer network, the

En-coder and

Decoder networks, and

ConvLSTM network. Figure 2 illustrates oursequence learning algorithm for video object segmentation. The learning processcan be formulated as follows: c , h = Initializer ( x , y ) (1) ˜x t = Encoder ( x t ) (2) c t , h t = ConvLST M ( ˜x t , c t − , h t − ) , (3) ˆy t = Decoder ( h t ) (4) L = − ( y t log( ˆy t )) + ((1 − y t ) log(1 − ˆy t )) (5) Model structures

Both our

Initializer and

Encoder use VGG-16 [37] networkstructures. In particular, all the convolution layers and the ﬁrst fully connectedlayer of VGG-16 are used as backbone for the two networks. The fully connectedlayer is transformed to a 1 × Initializer has two additional convolution layers withReLU [29] activation to produce c and h respectively. Each convolution layerhas 512 1 × Encoder has one additional convolution layer withReLU activation which has 512 1 × Initializer

N. Xu et al . and Encoder are initialized with pre-trained VGG-16 parameters while the otherlayers are randomly initialized by Xavier [16].All the convolution operations of the

ConvLSTM layer use 512 3 × Decoder has ﬁve upsampling layers with 5 × Decoder produces seg-mentation results, which has one 5 × Training

Our algorithm is trained on the YouTube-VOS training set. At eachtraining iteration, our algorithm ﬁrst randomly samples an object and T (5 ∼ ×

448 for memory and speed concern. At theearly stage of training, we only select frames with ground truth annotation as ourtraining samples so that the training loss can be computed and back-propagatedat each time step. When the training losses become stable, we added frameswithout annotations to training data. For those frames without ground truthannotations, loss is set to be 0. Adam [24] is used to train our network and theinitial learning rate is set to 10 − , and our model converges in 80 epochs. Inference

Our oﬄine-trained model is able to learn features for general objectcharacteristics eﬀectively. It is able to produce good segmentation results bydirectly applying it to a new test video with unseen categories. This is in contrastto recent state-of-the-art approaches, which have to ﬁne-tune their models oneach new test video over hundreds of iterations. In our experiments, we show thatour algorithm without online learning can achieve comparable or better resultscompared to previous state of the arts with online learning, which implies muchfaster inference speed for practical applications. Neverthless, we ﬁnd that theperformance of our model can be further improved with online learning.

Online Learning

Given a test video, we generate random pairs of online train-ing examples { ( x , y ) , ( x , y ) } through aﬃne transformations from ( x , y ).We treat ( x , y ) as the initial frame and mask and ( x , y ) as the ﬁrst frameand ground truth mask. We then ﬁne tune our Initializer , Encoder and

Decoder networks on such randomly generated pairs. The parameters of

ConvLSTM areﬁxed as it models long-term spatial-temporal dependency that should be inde-pendent of object categories.

In this section, we ﬁrst evaluate our algorithm and recent state-of-the-art al-gorithms on our YouTube-VOS dataset. Then we compare our results on the ouTube-VOS 9

DAVIS 2016 validation dataset [33], which is an existing benchmark dataset forvideo object segmentation. Finally, we do an ablation study to explore the eﬀectof data scale and model variants to our method.

We split the YouTube-VOS dataset of 3,252 videos into training (2,796), valida-tion (134) and test (322) sets. To evaluate the generalization ability of existingapproaches on unseen categories, the test set is further split into test-seen andtest-unseen subsets. We ﬁrst select 10 categories ( i.e . ant, bull riding, butterﬂy,chameleon, ﬂag, jellyﬁsh, kangaroo, penguin, slopestyle, snail ) as unseen cate-gories during training and treat their videos as test-unseen set. The validationand test-seen subsets are created by sampling two and four videos per category,respectively. The rest of videos are the training set. We use the region similarity J and the contour accuracy F as the evaluation metrics as in [33]. For fair comparison, we re-train previous methods ( i.e . SegFlow [8], OSMN [50],MaskTrack [32], OSVOS[6] and OnAVOS [44]) on our training set with the samesettings as our algorithm. One diﬀerence is that other methods leverage post-processing steps to achieve additional gains while our models do not.The results are presented in Table 2. All the comparison methods use staticimage segmentation models and four of them ( i.e . SegFlow, MaskTrack, OS-VOS and OnAVOS) require online learning. Our algorithm leverages long-termspatial-temporal characteristics and achieves better performance even withoutonline learinng (the second last row in Table 2), which eﬀectively demonstratesthe importance of long-term spatial-temporal information for video object seg-mentation. With online learning, our model is further improved and achievesaround 8% absolute improvement over the best previous method OSVOS on J mean. Our method also outperforms previous methods on contour accuracy anddecay rate with a large margin. Surprisingly, OnAVOS which is the best perform-ing method on DAVIS does not achieve good results on our dataset. We believethe drastic appearance changes and complex motion patterns in our datasetmakes the online adaptation fail in many cases. Figure 3 visualizes the changesof J mean over the duration of video sequences. Without online learning, ourmethod is worse than online learning methods such as OSVOS at the ﬁrst fewframes since the object appearance usually has not changed too much from theinitial frame and online learning is eﬀective under such scenario. However, ourmethod degrades slower than the other methods and starts to outperform OS-VOS at around 25% of the videos, which demonstrates that our method indeedpropagates object segmentations more accurately over time than previous meth-ods. With the help of online learning, our method outperforms previous methodsin most parts of the video sequences, while maintaining a small decay rate.Next we compare the generalization ability of existing methods on unseencategories in Table 2. Most methods have better performance on seen categories et al . Table 2: Comparisons of our approach and other methods on YouTube-VOS testset. The results in each cell show the test results for seen/unseen categories.“OL” denotes online learning. The best results are highlighted in bold.

Method J mean ↑ J recall ↑ J decay ↓ F mean ↑ F recall ↑ F decay ↓ SegFlow [8] 40.4/38.5 45.4/41.7 / / OSVOS [6] 59.1/58.8 66.2/64.5 17.9/19.5 63.7/63.9 69.0/67.9 20.6/23.0MaskTrack [32] 56.9/60.7 64.4/69.6 13.4/16.4 59.3/63.7 66.4/73.4 16.8/19.8OSMN [50] 54.9/52.9 59.7/57.6 10.2/14.6 57.3/55.2 60.8/58.0 10.4/13.8OnAVOS [44] 55.7/56.8 61.6/61.5 10.3/9.4 61.3/62.3 66.0/67.3 13.1/12.8

Ours (w/o OL) 60.9/60.1 70.3/71.2 7.9/12.9 64.2/62.3 73.0/71.4 9.3/14.5

Ours (with OL)

Fig. 3: The changes of J mean values over the length of video sequences.than unseen categories, which is expected. But the diﬀerences are not obvious, e.g . usually within 2% absolute diﬀerences on each metric. On one hand, it sug-gests that existing methods are able to alleviate the mismatch issue betweentraining and test categories by approaches such as online learning. On the otherhand, it also demonstrates the diverse training categories in YouTube-VOS helpsdiﬀerent methods to generalize to new categories. Experiments on dataset scalein Section 5.4 further suggests the power of data scale on our model. Comparedto other single-frame based methods, OSMN has a more obvious degradationon unseen categories since it does not use online learning. Our method withoutonline learning does not have the issue since it leverages spatial-temporal in-formation which is more robust to unseen categories. MaskTrack and OnAVOShave better performance on unseen than seen categories. We believe that theybeneﬁt from the guidance of previous segmentation or online adaption, whichhave advantages to deal with videos with slow motion. There are indeed severalobjects with slow motion in the unseen categories such as snail and chameleon .Some test results produced by our model without online learning are visu-alized in Figure 4. The ﬁrst two rows are from seen categories while the last ouTube-VOS 11 Fig. 4: Some visual results produced by our model without online learning onthe YouTube-VOS test set. The ﬁrst column shows the initial ground truth objectsegmentation (green color) while the second to the last column are predictions.two rows are from unseen categories. In addition, each example represents somechallenging cases in video object segmentation. For example, the person in theﬁrst example has large changes in appearance and illumination. The second andthird examples both have multiple similar objects and heavy occlusions. The lastexample has strong camera motion and the penguin changes its pose frequently.Our model obtains accurate results on all the examples, which demonstrates theeﬀectiveness of spatial-temporal features learned from large-scale training data.

DAVIS 2016 is a popular prior benchmark dataset for video object segmentation.To evaluate our algorithm, we ﬁrst ﬁne-tune our pretrained model in 200 epochson the DAVIS training set which contains 30 videos. The comparison resultsbetween our ﬁne-tuned models and previous methods are shown in Table 3.BVS and OFL are based on hand-crafted features and graphical models, whilethe rest are all deep learning based methods. Among the methods [32,6,44,50]using image segmentation frameworks, OnAVOS achieves the best performance.However, its online adaption process makes the inference pretty slow ( ∼

13s perframe). Our model without online learning (the second last row) achieves com-parable results to other online learning methods without post-processing ( e.g .MaskTrack 69.8% and OSVOS 77.4%), but with a signiﬁcant speed-up (60 timesfaster). Previous methods using spatial-temporal information including SegFlow,VPN and ConvGRU get inferior results compared to ours. Among them, Con-vGRU is most related to ours since it also incorporates RNN memory cells inits model. However, it is an unsupervised methods to only segment moving fore-ground, while our method can segment arbitrary objects given the mask super- et al . Table 3: Comparisons of our approach and previous methods on the DAVIS2016 dataset. Diﬀerent components used in each algorithm are marked. “OL”denotes online learning. “PP” denotes post processing by CRF [25] or BoundarySnapping [6]. “OF” denotes optical ﬂows. “RNN” denotes RNN and its variants.

Method OL PP OF RNN mean IoU(%) Speed(s)BVS [43] - (cid:55) (cid:55) - 60.0 0.37OFL [27] - (cid:51) (cid:51) - 68.0 42.2SegFlow [8] (cid:51) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)

VPN [22] (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) (cid:51)

Ours (cid:55) (cid:55) (cid:55) (cid:51)

Ours (cid:51) (cid:55) (cid:55) (cid:51)

Fig. 5: The comparison results between our model without online learning (upperrow) and with online learning (bottom row). Each column shows predictions ofthe two models at the same frame.vision. Finally, online learning helps our model segment object boundary moreaccurately. Figure 5 shows such an example.To demonstrate the scale limitation of existing datasets, we train our modelson three diﬀerent settings and evaluate on DAVIS 2016. – Setting 1: We train our model from scratch on the 30 training videos. – Setting 2: We train our model from scratch on the 30 training videos, plusall the videos from the SegTrackv2, JumpCut and YoutubeObjects datasets,which results in a total of 192 training videos. – Setting 3: Following the idea of ConvGRU, we use a pretrained object seg-mentation model DeepLab [7] as our

Encoder and train the other parts ofour model on the 30 training videos.Our models trained on setting 1 and 2 only get 51 .

3% and 51 .

9% mean IoU, whichsuggests that existing video object segmentation datasets do not have suﬃcientdata to train our models. Therefore our YouTube-VOS dataset is one of thekey elements for the success of our algorithm. In addition, there is only little ouTube-VOS 13

Table 4: The eﬀect of data scale on our algorithm. We use diﬀerent portions oftraining data to train our models and evaluate on the YouTube-VOS test set.

Scale J mean ↑ J recall ↑ J decay ↓ F mean ↑ F recall ↑ F decay ↓

25% 46.7/40.1 53.5/45.6 8.3/13.6 46.7/40.0 52.2/41.6 8.5/13.250% 51.5/50.3 59.2/58.8 10.3/13.1 51.8/50.2 59.5/55.8 11.1/13.375% 56.8/56.0 65.7/67.1 7.6/10.0 59.6/56.3 68.8/64.1 8.5/11.1100% 60.9/60.1 70.3/71.2 7.9/12.9 64.2/62.3 73.0/71.4 9.3/14.5 improvement by adding videos from the SegTrackv2, JumpCut and YoutubeOb-jects datasets, which suggests that the small scale is not the only problem forprevious datasets. For example, videos in the three datasets usually only haveone main foreground. SegTrackv2 has low-resolution videos. The annotation ofYoutubeObjects videos is not accurate along object boundaries, etc . However,our YouTube-VOS dataset is carefully created to avoid all these problems. Set-ting 3 is a common detour for existing methods to bypass the data-insuﬃciencyissue, i.e . using pre-trained models on other large-scale datasets to reduce theparameters to be learned for their models. However, our model using this strat-egy gets even worse results (45 . In this subsection, we perform an ablation study on the YouTube-VOS datasetto evaluate diﬀerent variants of our algorithm.

Dataset scale.

Since the dataset scale is very important to our models, wetrain several models on diﬀerent portions of the training set of YouTube-VOS toexplore the eﬀect of data. Speciﬁcally, we randomly select 25%, 50% and 75%of the training set and retrain our models from scratch. The results are listedin Table 4. It can be seen that using only 25% of the training videos ( ∼ Initializer variants.

The

Initializer in our original model is a VGG-16 networkwhich encodes a RGB frame and an object mask and outputs initial hidden et al . states of ConvLSTM . We would like to explore using the object mask directlyas the initial hidden states of

ConvLSTM . We train an alternative model byremoving the

Initializer and directly using the object mask as the hidden states, i.e . the object mask is reshaped to match the size of the hidden states. The J mean of the adapted model are 45 .

1% on the seen categories and 38 .

6% onthe unseen categories. This suggests that the object mask alone does not haveenough information for localizing the object.

Encoder variants.

The

Encoder in our original model receives a RGB frame asinput at each time step. Alternatively, we can use the segmentation mask of theprevious step as additional inputs to explicitly provide extra information to themodel, similar to MaskTrack [32]. In this way, our

Initializer and

Encoder can bereplaced with a single VGG-16 network since the inputs at every time step havesame dimensions and similar meaning. However, such a framework potentiallyhas the error-drifting issue since segmentation mistakes made at previous stepswill be propagated to the current step.In the early stage of training, the model is unable to predict good segmenta-tion results. Therefore we use the ground truth annotation of the previous stepas the input. Such strategy is known as teacher forcing [45] which can make thetraining faster. After the training losses become stable, we replace the groundtruth annotation with the model’s prediction of the previous step so that themodel is forced to correct its own mistakes. Such a strategy is known as curricu-lum learning [2]. Empirically we ﬁnd that both the two strategies are importantto make the model to work well. The J mean results of the model are 59 .

4% onthe seen categories and 60 .

7% on the unseen categories, which is similar to ouroriginal model.

In this work, we introduce the largest video object segmentation dataset (YouTube-VOS) to date. The new dataset, much larger than existing datasets in terms ofnumber of videos and annotations, allows us to design a new deep learning al-gorithm to explicitly model long-term spatial-temporal dependency from videosfor segmentation in an end-to-end learning framework. Thanks to the large scaledataset, our new algorithm achieves better or comparable results compared toexisting state-of-the-art approaches. We believe the new dataset will foster re-search on video-based computer vision in general.

This research was partially supported by a gift funding from Snap Inc. and UIUCAndrew T. Yang Research and Entrepreneurship Award to Beckman Institutefor Advanced Science & Technology, UIUC. ouTube-VOS 15

References

1. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B.,Vijayanarasimhan, S.: Youtube-8m: A large-scale video classiﬁcation benchmark.arXiv preprint arXiv:1609.08675 (2016)2. Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequenceprediction with recurrent neural networks. In: Advances in Neural InformationProcessing Systems. pp. 1171–1179 (2015)3. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradientdescent is diﬃcult. IEEE transactions on neural networks (2), 157–166 (1994)4. Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories.In: ECCV. pp. 282–295 (2010)5. Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories.In: European conference on computer vision. pp. 282–295. Springer (2010)6. Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taix´e, L., Cremers, D., Van Gool,L.: One-shot video object segmentation. In: CVPR (2017)7. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se-mantic image segmentation with deep convolutional nets, atrous convolution, andfully connected crfs. In: IEEE T-PAMI. vol. 40, pp. 834–848 (2018)8. Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: Segﬂow: Joint learning for videoobject segmentation and optical ﬂow. In: IEEE International Conference on Com-puter Vision (ICCV) (2017)9. Cho, K., Van Merri¨enboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder forstatistical machine translation. arXiv preprint arXiv:1406.1078 (2014)10. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan,S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visualrecognition and description. In: CVPR (2015)11. Dutt Jain, S., Xiong, B., Grauman, K.: Fusionseg: Learning to combine motionand appearance for fully automatic segmentation of generic objects in videos. In:The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July2017)12. Faktor, A., Irani, M.: Video segmentation by non-local consensus voting. In: BMVC(2014)13. Fan, Q., Zhong, F., Lischinski, D., Cohen-Or, D., Chen, B.: Jumpcut:non-successivemask transfer and interpolation for video cutout. In: ACM Trans. Graph., 34(6)(2015)14. Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interactionthrough video prediction. In: Advances in neural information processing systems(2016)15. Galasso, F., Nagaraja, N.S., C´ardenas, T.J., Brox, T., Schiele, B.: A uniﬁedvideo segmentation benchmark: Annotation, metrics and analysis. In: ICCV. IEEE(2013)16. Glorot, X., Bengio, Y.: Understanding the diﬃculty of training deep feedforwardneural networks. In: Proceedings of the Thirteenth International Conference onArtiﬁcial Intelligence and Statistics. pp. 249–256 (2010)17. Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: A large-scalevideo benchmark for human activity understanding. In: Computer Vision and Pat-tern Recognition (CVPR), 2015 IEEE Conference on. pp. 961–970. IEEE (2015)6 N. Xu et al .18. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (8), 1735–1780 (1997)19. Hu, Y.T., Huang, J.B., Schwing, A.: Maskrnn: Instance level video object segmen-tation. In: NIPS (2017)20. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0:Evolution of optical ﬂow estimation with deep networks. CVPR (2017)21. Jain, S.D., Grauman, K.: Supervoxel-consistent foreground propagation in video.In: ECCV (2014)22. Jampani, V., Gadde, R., Gehler, P.V.: Video propagation networks. In: CVPR(2017)23. Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recur-rent network architectures. In: International Conference on Machine Learning. pp.2342–2350 (2015)24. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)25. Kr¨ahenb¨uhl, P., Koltun, V.: Eﬃcient inference in fully connected crfs with gaussianedge potentials. In: NIPS. pp. 109–117 (2011)26. Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by track-ing many ﬁgure-ground segments. In: ICCV (2013)27. M¨arki, N., Perazzi, F., Wang, O., Sorkine-Hornung, A.: Bilateral space video seg-mentation. In: CVPR (2016)28. Nagaraja, N.S., Schmidt, F.R., Brox, T.: Video segmentation with just a fewstrokes. In: ICCV. pp. 3235–3243 (2015)29. Nair, V., Hinton, G.E.: Rectiﬁed linear units improve restricted boltzmann ma-chines. In: Proceedings of the 27th international conference on machine learning(ICML-10). pp. 807–814 (2010)30. Ochs, P., Malik, J., Brox, T.: Segmentation of moving objects by long term videoanalysis. IEEE transactions on pattern analysis and machine intelligence (6),1187–1200 (2014)31. Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In:Computer Vision (ICCV), 2013 IEEE International Conference on. pp. 1777–1784.IEEE (2013)32. Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., A.Sorkine-Hornung: Learningvideo object segmentation from static images. In: CVPR (2017)33. Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video objectsegmentation. In: CVPR (2016)34. Pont-Tuset, J., Perazzi, F., Caelles, S., Arbel´aez, P., Sorkine-Hornung, A., VanGool, L.: The 2017 davis challenge on video object segmentation. arXiv:1704.00675(2017)35. Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicﬂow: Edge-preservinginterpolation of correspondences for optical ﬂow. In: CVPR (2015)36. Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic seg-mentation. IEEE transactions on pattern analysis and machine intelligence (4),640–651 (2017)37. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 (2014)38. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of videorepresentations using lstms. In: International conference on machine learning(2015)ouTube-VOS 1739. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neuralnetworks. In: Advances in neural information processing systems. pp. 3104–3112(2014)40. Tokmakov, P., Alahari, K., Schmid, C.: Learning motion patterns in videos. In:CVPR (2017)41. Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation withvisual memory. In: ICCV (2017)42. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotem-poral features with 3d convolutional networks. In: ICCV (2015)43. Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object ﬂow. In: CVPR(2016)44. Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks forvideo object segmentation. arXiv preprint arXiv:1706.09364 (2017)45. Williams, R.J., Zipser, D.: A learning algorithm for continually running fully re-current neural networks. Neural computation1