[PDF] Self-supervised Sparse to Dense Motion Segmentation

Abstract

Observable motion in videos can give rise to the definition of objects moving with respect to the scene. The task of segmenting such moving objects is referred to as motion segmentation and is usually tackled either by aggregating motion information in long, sparse point trajectories, or by directly producing per frame dense segmentations relying on large amounts of training data. In this paper, we propose a self supervised method to learn the densification of sparse motion segmentations from single video frames. While previous approaches towards motion segmentation build upon pre-training on large surrogate datasets and use dense motion information as an essential cue for the pixelwise segmentation, our model does not require pre-training and operates at test time on single frames. It can be trained in a sequence specific way to produce high quality dense segmentations from sparse and noisy input. We evaluate our method on the well-known motion segmentation datasets FBMS59 and DAVIS16.

Full PDF

SSelf-supervised Sparse to Dense MotionSegmentation

Amirhossein Kardoost , Kalun Ho , , , Peter Ochs , and Margret Keuper Data and Web Science Group, University of Mannheim, Germany Fraunhofer Center Machine Learning, Germany Fraunhofer ITWM, Competence Center HPC, Kaiserslautern, Germany Mathematical Optimization Group, Saarland University, Germany

Abstract.

Observable motion in videos can give rise to the deﬁnitionof objects moving with respect to the scene. The task of segmentingsuch moving objects is referred to as motion segmentation and is usuallytackled either by aggregating motion information in long, sparse pointtrajectories, or by directly producing per frame dense segmentations re-lying on large amounts of training data. In this paper, we propose aself supervised method to learn the densiﬁcation of sparse motion seg-mentations from single video frames. While previous approaches towardsmotion segmentation build upon pre-training on large surrogate datasetsand use dense motion information as an essential cue for the pixelwisesegmentation, our model does not require pre-training and operates attest time on single frames. It can be trained in a sequence speciﬁc wayto produce high quality dense segmentations from sparse and noisy in-put. We evaluate our method on the well-known motion segmentationdatasets

F BMS

59 and DAVIS . The importance of motion for visual learning has been emphasized in recentyears. Following the Gestalt principle of common fate [1], motion patterns withinan object are often more homogeneous than its appearance, and therefore providereliable cues for segmenting (moving) objects in a video. Motion segmentationis the task of segmenting motion patterns. This is in contrast to semantic seg-mentation, where one seeks to assign pixel-wise class labels in an image. Thus,for motion segmentation, we need motion information and at least two frames tobe visible in order to distinguish between motion segments. Ideally, such motionpatterns separate meaningful objects., e.g. an object moving w.r.t. the scene(refer to Fig.1). To illustrate the importance of motion segmentation, consideran autonomous driving scenario. A ﬁrst step to classify potential danger causedby the presence of a possibly static object, e.g., a parking car, is the knowledgeabout its mobility. Unknowingly waiting for the observation that the door of theparking car suddenly opens may be too late to avoid an accident. The speedof the autonomously driving vehicle must be adjusted based on the mobilityand danger of other objects. While models for the direct prediction of pixel-wise a r X i v : . [ c s . C V ] A ug A. Kardoost et al.

Fig. 1: Motion segmentation example is provided for frames 1 and 160 of the“horses01” sequence (with their ground-truth motion segmentation) in theFBMS59 [2] dataset. Due to the reason that the person and the horse are movingtogether and have the same motion pattern, they are assigned the same motionlabel (blue color).motion segmentations are highly accurate [3, 4], they can only take very limitedaccount of an objects motion history.In this paper, we propose a model to produce high quality dense segmen-tations from sparse and noisy input (i.e. densiﬁcation ). It can be trained in asequence speciﬁc way using sparse motion segmentations as training data, i.e.the densiﬁcation model can be trained in a self-supervised way. Our approach isbased on sparse (semi-dense) motion trajectories that are extracted from videosvia optical ﬂow (Fig. 2, center). Point trajectory based motion segmentation al-gorithms have proven to be robust and fast [2,5–10]. By a long term analysis of awhole video shot at once by the means of such trajectories, even objects that arestatic for most of the time and only move for few frames can be identiﬁed, i.e.the model would not “forget” that a car has been moving, even after it has beenstatic for a while. The same argument allows articulated motion to be assignedto a common moving object.In our approach we use object annotations that are generated using estab-lished, sparse motion segmentation techniques [5]. We also propose an alterna-tive, a GRU-based multicut model, which allows to learn the similarity betweenthe motion of two trajectories and potentially allows for a more ﬂexible appli-cation. In both cases, the result is a sparse segmentation of video sequences,providing labels only for points lying on the original trajectories, e.g. every 8pixels (compare Fig. 2). From such sparse segmentations, pixel-wise segmenta-tions can be generated by variational methods [2]. In order to better leveragethe consistency within the video sequences, we propose to train sequence speciﬁcdensiﬁcation models using only the sparse motion segmentations as labels.Speciﬁcally, we train a U-Net like model [11] to predict dense segmentationsfrom given images (Fig. 2), while we only have sparse and potentially noisy labelsgiven by the trajectories. The training task can thus be interpreted as a labeldensiﬁcation. Yet, the resulting model does not need any sparse labels at testtime but can generalize to unseen frames.In contrast to end-to-end motion segmentation methods such as e.g. [4], weare not restricted to binary labels but can distinguish between diﬀerent motionpatterns belonging to diﬀerent objects and instances per image and only require elf-supervised Sparse to Dense Motion Segmentation 3

Fig. 2: Exemplary multi-label motion segmentation results showing (left) theimage and its sparse (middle) and dense (right) segmentation. The sparse seg-mentation is produced by [5] and the dense segmentation is the result of theproposed model.single images to be given at test time. Also, in contrast to such approaches,our model does not rely on the availability of large surrogate datasets such asImageNet [12] or FlyingChairs [13] for pre-training but can be trained directlyon the sequences from for example the FBMS59 [2] and DAVIS [14] datasets.To summarize, we make the following contributions: – We provide an improved aﬃnity graph for motion segmentation in the min-imum cost multicut framework using a GRU-model. Our model generates asparse segmentation of motion patterns. – We utilize the sparse and potentially noisy motion segmentation labels totrain a U-Net model to produce class agnostic and dense motion segmenta-tion. Sparse motion labels are not required during prediction. – We provide competitive video object segmentation and motion segmentationresults on the FBMS59 [2] and DAVIS [14] datasets. Our goal is to learn to segment moving objects based on their motion pattern.For eﬃciency and robustness, we focus on point trajectory based techniques asinitial cues. Trained in a sequence speciﬁc way, our model can be used for label densiﬁcation . We therefore consider in the following related work in the areas motion segmentation and sparse to dense labeling . Common end-to-end trained CNN based approaches to motion segmentationare based on single frame segmentations from optical ﬂow [3, 4, 15–17]. Tok-makov et al. [3, 4] make use of large amounts of synthetic training data [13]to learn the concept of object motion. Further, [4] combine these cues with anImageNet [12] pre-trained appearance stream and achieve long-term temporalconsistency by using a GRU optimized on top. They learn a binary video seg-mentation and distinguish between static and moving elements. Siam et al. [17]use a single convolutional network to model motion and appearance cues jointly

A. Kardoost et al. for autonomous driving. A frame-wise classiﬁcation problem is formulated in [16]to detect motion saliency in videos. In [15] for each frame multiple ﬁgure-groundsegmentations are produced based on motion boundaries. A moving objectnessdetector trained on image and motion ﬁelds is used to rank the segment candi-dates. Methods like [18, 19] approach the problem in a probabilistic way. In [18]the camera motion is subtracted from each frame to improve the training. Avariational formulation based on optical ﬂow is used in [19]. Addressing motionsegmentation is diﬀerent than object segmentation, as in motion segmentation,diﬀerent motion patterns are segmented, which makes connected objects seem asone object if they move together with the same motion pattern. As an example,we refer to the example of a person riding a horse (Fig. 1). As long as they movetogether they will be considered as same object while in object segmentationwe deal with two separate objects, this makes our method diﬀerent than objectsegmentation approaches [20–22].A diﬀerent line of work relies on point trajectories for motion segmentation.Here, long-term motion information is ﬁrst used to establish sparse but coher-ent motion segments over time. Dense segmentations are usually generated in apost-processing step such as [2]. However, in contrast to end-to-end CNN motionsegmentation approaches, trajectory based methods have the desirable propertyto directly handle multiple motion pattern segmentations. While in [5, 6] thepartitioning of trajectories into motion segments uses the minimum cost multi-cut problem, other approaches employ sparse subspace clustering [23] or spectralclustering and normalized cuts [24–26]. In this setting, a model selection is neededto determine the ﬁnal number of segments. In contrast, the minimum cost mul-ticut formulation allows for a direct optimization of the number of componentsvia repulsive cost.Our GRU approach uses the same graph construction policy as [5] for motionsegmentation while the edge costs are assigned using a Siamese (also known astwin) gated recurrent network. The Siamese networks [27] are metric learningapproach used to provide a comparison between two diﬀerent inputs. Similartrajectory embeddings have been used in [28] to predict a pedestrians futurewalking direction. For motion segmentation, we stick to the formulation as aminimum cost multicut problem [5].

While the trajectory based motion segmentation methods can propagate ob-ject information through frames, they produce sparse results. Therefore, speciﬁcmethods, e.g. [2, 29] are needed to produce dense results. A commonly useddensiﬁcation approach is the variational framework of Ochs et al. [2]. In thisapproach the underlying color and boundary information of the images are usedfor the diﬀusion of the sparsely segmented trajectory points, which sometimesleaks the pixel labels to unwanted regions, e.g. loosely textured areas of the im-age. Furthermore, [30] address the densiﬁcation problem using Gabriel graphsas per frame superpixel maps. Gabriel edges bridge the gaps between contours elf-supervised Sparse to Dense Motion Segmentation 5

Input sequence Trajectory SegmetnationMC on trajectory GraphInput Image from Sequence Output Segmenation MapSparse Cross Entropy LossU-Net

Fig. 3: Sparsely segmented trajectories are produced by minimum cost multicuts(MC) either with our Siamese-GRU model or simple motion cues as in [5] (top).The sparsely labeled points are used to train the U-Net model (bottom). Attest time, U-Net model can produce dense segmentations without requiring anysparse labels as input.using geometric reasoning. However, super-pixel maps are prone to neglect ﬁnestructure of the underlying image and leads to low segmentation quality.Our method beneﬁts from trajectory based methods for producing a sparsemulti-label segmentation. A sparsely trained U-Net [11] produces dense resultsfor each frame purely from appearance cues, potentially speciﬁc for a scene orsequence.

We propose a self-supervised learning framework for sparse-to-dense segmenta-tion of the sparsely segmented point trajectories. In another words, a U-Netmodel is trained from sparse annotations to estimate dense segmentation maps(Section 3.2). The sparse annotations can be provided either with some poten-tially unsupervised state-of-the-art trajectory segmentation methods [5] or ourproposed Siamese-GRU model.

Point trajectories are generated from optical ﬂow [7]. Each point trajectory cor-responds to the set of sub-pixel positions connected through consecutive framesusing the optical ﬂow information. Such point trajectories are clustered by theminimum cost multicut approach [31] (aka. correlation clustering) with respect totheir underlying motion model estimated (i) from a translational motion modelor (ii) from a proposed Siamese GRU network.

Point Trajectories are spatio-temporal curves represented by their frame-wisesub-pixel-accurate (x,y)-coordinates. They can be generated by tracking points

A. Kardoost et al. using optical ﬂow by the method of Brox et al. [7]. The resulting trajectoryset aims for a minimal target density (e.g. one trajectory in every 8 pixels).Trajectories are initialized in some video frame and end when the point cannotbe tracked reliably anymore, e.g. due to occlusion. In order to achieve the desireddensity, possibly new trajectories are initialized throughout the video. Usingtrajectories brings the beneﬁt of accessing the object motion in prior frames.

Translational Motion Aﬃnities can be assigned based on motion distancesof each trajectory pair with some temporal overlap [5]. For trajectories A and B , the frame-wise motion distance at time t is computed by d t ( A, B ) = (cid:107) ∂ t A − ∂ t B (cid:107) σ t . (1)It solely measures in-plane translational motion diﬀerences, normalized bythe variation of the optical ﬂow σ t (refer to [2] for more information). The ∂ t A and ∂ t B represent the partial derivatives of A and B with respect to the timedimension.The overall motion distance of a pair of trajectories A and B is computedby maximum pooling over their joint life-time, d motion ( A, B ) = max t d t ( A, B ) (2)Color and spatial cues are added in [5] for robustness. Instead of using trans-lational motion aﬃnities we propose a Siamese Gated Recurrent Units (GRUs)based model to provide aﬃnities between the trajectory pairs.

Siamese Gated Recurrent Units (GRUs) can be used to learn trajectoryaﬃnities by taking trajectory pairs as input. The proposed network consists oftwo legs with shared weights, where in our model each leg consists of a GRUmodel which takes a trajectory as input. Speciﬁcally, for two trajectories A and B , the ∂A and ∂B (partial derivative of the trajectories with respect tothe time dimension) on the joint life-time of the trajectories are given to eachleg of the Siamese-GRU model. The partial derivatives represent their motioninformation, while no information about their location in image coordinates isprovided. The motion cues are embedded by the GRU network, i.e. the hiddenunits are gathered for each time step. Afterwards, the diﬀerence between twoembedded motion vectors embed ∂A and embed ∂B is computed as d ( embed ∂A ,embed ∂B ) = h (cid:88) i =1 ( embed ∂Ai − embed ∂Bi ) , (3)where h denotes the number of hidden units for each time step. The result ofequation (3) is a vector which is consequently given to two fully connected layersand the ﬁnal similarity value is computed by a Sigmoid function. Therefore, for elf-supervised Sparse to Dense Motion Segmentation 7 each pair of trajectories given to the Siamese [27] GRU network, it provides ameasure of their likelihood to belong to the same motion pattern.The joint life-time of the two trajectories could in practice be diﬀerent frompair to pair and the GRU network requires a ﬁxed number of time steps as input( N ). This problem is dealt as follows:If the joint life-time of the two trajectories is less than N , each trajectory ispadded with its respective ﬁnal partial derivative value in the intersection part.Otherwise, when the joint life-time has more than N time steps, the time step t with maximum motion distance, similar to equation (2), is determined for theentire lifetime, t A,B = arg max t d t ( A, B ) . (4)The trajectory values are extracted before and after t so that the required numberof time steps N is reached. The reason for doing this is that the important partof the trajectories are not lost. Consider a case where an object does not movein the ﬁrst x frames and starts moving from frame x + 1, the most importantinformation will be available around frames x and x + 1 and it is better not tolose such information by cutting this part out.In our approach, the frame-wise Euclidean distance of trajectory embeddings(extracted from the hidden units) of the GRU model is fed to a fully connectedlayer for dimensionality reduction and passed to a Sigmoid function for classi-ﬁcation into the classes 0 (same label - pair of trajectories belong to the samemotion pattern) or 1 (diﬀerent label - pair of trajectories belong to diﬀerentmotion pattern) using a mean squared error (MSE) loss.To train the Siamese-GRU model two labels are considered for each pair oftrajectories, label 0 where the trajectory pair correspond to the same motion pat-tern and label 1 otherwise (the trajectories belong to diﬀerent motion patterns).To produce the ground-truth labeling to train the Siamese-GRU model, a subsetof the edges in the produced graph G = ( V, E ) by the method of Keuper et al. [5]are sampled (information about the graph is provided in the next paragraph).For each edge, which corresponds to a pair of trajectories, we look into each tra-jectory and its label in the provided ground-truth segmentation. We only takethose trajectories which belong to exactly one motion pattern in the providedground-truth. Some trajectories change their labels while passing through frameswhich are considered as unreliable. Furthermore, the same amount of edges withlabel 0 (same motion pattern) and label 1 (diﬀerent motion pattern) are sampledto have a balanced training signal. At test time, costs for each edge E in graph G = ( V, E ) are generated by the trained Siamese-GRU network.

Trajectory Clustering yields a grouping of trajectories according to theirmodeled or learned motion aﬃnities. We formalize the motion segmentationproblem as minimum cost multicut problem [5]. It aims to optimally decomposea graph G = ( V, E ), where trajectories are represented by nodes in V and theiraﬃnities deﬁne costs on the edges E . In our approach the costs are assignedusing the Siamese-GRU model. A. Kardoost et al.

While the multicut problem is APX-hard [32], heuristic solvers [33–37] areexpected to provide practical solutions. We use the Kernighan Lin [33] imple-mentation of [38]. This solver is proved to be practically successful in motionsegmentation [5, 6], image decomposition and mesh segmentation [34] scenarios.

We use the sparse motion segmentation annotated video data as described inSection 3.1 for our deep learning based sparse-to-dense motion segmentationmodel. Speciﬁcally, the training data consist of input images (video frames) oredge maps and their sparse motion segmentations, which we use as annotations.Although, the loss function only applies at the sparse labels, the network learnsto predict dense segmentations.

Deep Learning Model

We use a U-Net [11] type architecture for dense seg-mentation, which is known for its high quality predictions in tasks such as se-mantic segmentation [39–41]. A U-Net is an encoder-decoder network with skipconnections. During encoding, characteristic appearance properties of the inputare extracted and are learnt to be associated with objectness. In the decodingphase, the extracted properties are traced back to locations causing the observedeﬀect, while details from the downsampling phase are taken into account to easethe localisation (see Fig. 3 for details). The output is a dense (pixel-wise) seg-mentation of objects, i.e., a function u : Ω → { , . . . , K } , where Ω is the imagedomain and K is the number of classes which corresponds to number of tra-jectory labels. This means that, after clustering the trajectories each clustertakes a label and overall we have class-agnostic motion segmentation of sparsetrajectories. Such labels are only used during training. Loss Function

The U-Net is trained via the Binary Cross Entropy (BCE)and Cross Entropy (CE) loss function for the single and multiple object case,respectively. As labels are only present at a sparse subset of pixels, the lossfunction is restricted to those pixels. Intuitively, since the label locations wherethe loss is evaluated are unknown to the network, it is forced to predict a labelat every location. (A more detailed discussion is provided below.)

Dense Predictions with Sparse Loss Functions

At ﬁrst glance, a sparseloss function may not force the network to produce a meaningful dense seg-mentation. Since trajectories are generated according to a simple deterministiccriterion, namely extreme points of the local structure tensor [7], the networkcould internally reproduce this generation criterion and focus on labeling suchpoints only. Actually, we observed exactly the problematic behaviour mentionedabove, and, therefore, suggest variants for the learning process employing eitherRGB images or (deterministic) Sobel edge maps [42] as input. One remedy isto alleviate the local structure by smoothing the input RGB-image, making it elf-supervised Sparse to Dense Motion Segmentation 9 harder for the network to pick up on local dominant texture and to stimulate theusage of globally extracted features that can be associated with movable objectproperties.

Conditional Random Filed (CRF) Segmentation Reﬁnement

To buildthe ﬁne-grained segmentation maps from the blurred images, we employ Condi-tional Random Fields (CRF): We compare – the fully connected pre-trained CRF layer (dense-CRF) [43], with parameterslearnded from pixel-level segmentation [44] and – a CRFasRNN [45] model which we train using the output of our U-Net modelas unaries on our same sparse annotations. To generate a training signal evenin case the U-Net output perfectly ﬁts the sparse labels, we add Gaussiannoise to the unaries. Discussion: Sparse Trajectories vs. Dense Segmentation

The handling ofsparse labels could be avoided if dense unsupervised motion segmentations weregiven. Although, in principle, dense trajectories can be generated by the motionsegmentation algorithm, the clustering algorithm does not scale linearly with thenumber of trajectories and the computational cost explodes. Instead of denselytracking pixels throughout the video, a frame-wise computationally aﬀordabledensiﬁcation step, for example based on variational or inpainting strategies [2],could be used. However, some sparse labels might be erroneous, an issue thatcan be ampliﬁed by the densiﬁcation. Although some erroneous labels can alsobe corrected by [2], especially close to object boundaries, we observed that thenegative eﬀect prevails when it comes to learning from such unsupervised anno-tations. Moreover, variational methods often rely on local information to steerthe propagation of label information in the neighborhood. In contrast, the U-Netarchitecture can incorporate global information and possibly objectness proper-ties to construct its implicit regularization.

We evaluate the performance of the proposed models on the two datasetsDAVIS [14] and FBMS59 [2], which contain challenging video sequences withhigh quality segmentation annotations of moving objects. The DAVIS dataset [14] is produced for high-precision binary ob-ject segmentation tracking of rigidly and non-rigidly moving objects. It contains30 train and 20 validation sequences. The pixel-wise binary ground truth seg-mentation is provided per frame for each sequence. The evaluation metrics areJaccard index (also known as Intersection over Union) and F-measure. Even Table 1: The trajectories are segmented by 1. the method of Keuper et al. [5]and 2. our Siamese-GRU model. The densiﬁed results are generated based on 1.the method of Ochs et al. [2] and 2. the proposed U-Net model. The results areprovided for the validation set of DAVIS . Traj. Seg. Method Densiﬁcation Method Jaccard IndexKeuper et al. [5] Ochs et al. [2] 55.3Keuper et al. [5] U-Net model (ours) 58.5Siamese-GRU (ours) Ochs et al. [2] 57.7Siamese-GRU (ours) U-Net model (ours) though this dataset is produced for object segmentation, it is commonly usedto evaluate motion segmentation because only one object is moving in each se-quence which makes the motion pattern of the foreground object to be diﬀerentfrom the background motion.The FBMS59 [2] dataset is speciﬁcally designed for motion segmentation andconsists of 29 train and 30 test sequences. The sequences cover camera shaking,rigid/non-rigid motion as well as occlusion/dis-occlusion of single and multipleobjects with ground-truth segmentations given for a subset of the frames. Weevaluate precision, recall and F-measure.

Implementation Details

Our Siamese-GRU model with 2 hidden units ( h =2, equation 3) and experimentally selected 25 time steps ( N = 25, for moreinformation refer to section 3.1) is trained for 3 epochs, a batch size of 256 and alearning rate of 0 .

001 where the trajectories are produced by large displacementoptical ﬂow (LDOF) [46] at 8 pixel sampling on the training set of DAVIS [14].We employ two diﬀerent strategies of using the sparse motion segmenta-tions of the resulting trajectories as labels, depending whether we face binary(DAVIS ) or multi-label (FBMS59) problems. In case of single label, the mostfrequent trajectory label overall frames and second most frequent label per frame are considered as background and foreground, respectively. For multi-label cases,the most frequent class-agnostic labels are selected, i.e. we take only those labelswhich are frequent enough compared to the other labels.Our U-Net model is trained in a sequence speciﬁc way. Such model can beused for example for label densiﬁcation and is trained using color and edge-mapdata with a learning rate of 0 .

01 and batch size of 1 for 15 epochs. The overalltrain and prediction process takes around (maximally) 30 minutes per sequenceon a NVIDIA Tesla V100 GPU. The CRFasRNN [45] is trained with learningrate of 0 . We ﬁrst evaluate our GRU model for sparse motion segmentation on the vali-dation set of DAVIS [14]. Therefore, we produce, in a ﬁrst iteration, densiﬁed elf-supervised Sparse to Dense Motion Segmentation 11 Table 2: Sparse Motion Segmentation trained on DAVIS (all seq.) and eval-uated on FBMS59 (train set). We compare to [5] and their variant only usingmotion cues. Precision Recall F-measure

Siamese-GRU (ours - transfer) 81.01 70.07 75.14 24 segmentations using the variational approach from [2]. The results are given inTab. 1 (line 3) and show an improvement over the motion model from Keuperet al. [5] by 2% in Jaccard index. In the following experiments on DAVIS , wethus use sparse labels from our Siamese GRU approach. Knowledge Transfer

Next, we investigate the generalization ability of thismotion model. We evaluate the DAVIS -trained GRU-model on the train setof FBMS59 [2]. Results are given in Tab. 2. While this model performs belowthe hand-tuned model of [5] on this dataset, results are comparable, especiallyconsidering that our GRU model does not use color information. In furtherexperiments on FBMS59, we use sparse labels from [5]. Next, we evaluate our self-supervised dense segmentation framework with se-quence speciﬁc training on the color images as well as edge maps on the vali-dation set of DAVIS [14]. Tab. 1 shows that this model, trained on the GRUbased labels, outperforms the model trained on the motion models from [5] aswell as the densiﬁcation of Ochs et al. [2] by a large margin. Tab. 3 (top) pro-vides a comparison between diﬀerent versions of our model, the densiﬁcationmodel of Ochs et al. [2], and the per-frame evaluation of Tokmakov et al. [3] onDAVIS . [3] use large amounts of data for pre-training. Their better performingmodel variants require optical ﬂow and full sequence information to be given attest time. Our results based on RGB and edge maps are better than those solelyusing edge maps. We also compare the diﬀerent CRF versions: – pre-trained dense-CRF [43], – our trained CRFasRNN [45] model trained per sequence (CRF-per-seq) , – CRFasRNN [45] trained on all frames in the train set of DAVIS (CRF-general) with sparse labelsAll CRF versions improve the F-measure. The CRF-general performs on parwith dense-CRF by only using our sparse labels for training. See Fig. 4 and Fig.5 for qualitative results of our model on DAVIS [14] and FBMS59 [2] datasets,respectively. Table 3: Evaluation of self supervised training on sequences from DAVIS val-idation and comparison with other methods is provided. Eﬀect of adding colorinformation (RGB) to the edge maps (sobel) is studied ( ours ) and compari-son between (pretrained) dense-CRF ( dense ), CRF-per-seq ( per-seq ) and CRF-general ( general ) is provided (for diﬀerent versions of CRF refer to 4.3). Westudied the eﬀect of our best model while training it only on 50%, 70% and 90%of the frames in the last three rows. % of frames Jaccard Index F-measurevariational [2] 100 57.7 57.1appearance + GRU [3] 100 59.6 -sobel + dense

100 62.6 54.0sobel + RGB ( ours ) 100 61.3 49.0 ours + dense ours + per-seq

100 66.2 60.3 ours + general

100 66.2 ours + general

50 59.6 50.4 ours + general

70 62.3 53.5 ours + general

90 63.4 55.4

Table 4: We evaluate our densiﬁcation method on FBMS59 (train) using sparsemotion segmentations from Keuper et al. [5]. The sparse trajectories are pro-duced with diﬀerent ﬂow estimation methods (LDOF [46] and FlowNet2 [47])and densiﬁed with our proposed U-Net model (using edge maps (sobel) and colorinformation (RGB) ( ours )). Further, we study on diﬀerent CRF methods, (pre-trained) dense-CRF ( dense ) and CRF-general ( general ). For more details aboutdiﬀerent versions of CRF refer to section 4.3.

Precision Recall F-measureOchs et al. [2] 85.31 68.70 76.11LDOF + ours + dense ours + dense FlowNet2 + ours + general Partly trained Model

We further evaluate how well our model works for videoframes for which no sparse labels are given during training. Please note that,throughout the sequences, the appearance of an object can change drastically.In Tab. 3 (bottom), we report results for the sequence speciﬁc U-Net model +

CRF-general trained on the ﬁrst 50%, 70% and 90% of the frames and evaluatedon the remaining frames. While there is some loss in Jaccard’s index compared tothe model evaluated on seen frames (above), the performance only drops slightlyas smaller portions of the data are used for training. elf-supervised Sparse to Dense Motion Segmentation 13

Fig. 4: Exemplary single-label motion segmentation results showing the ﬁveframes and their sparse and dense segmentation for two diﬀerent sequences,generated using the proposed U-Net model. The images are from the sequenceson the validation set of DAVIS [14] dataset. Densiﬁcation on FBMS59

Next, we evaluate our sequence speciﬁc model forlabel densiﬁcation on FBMS59 [2]. We study on two diﬀerent variants of opti-cal ﬂow (FlowNet2 [47] and Large Displacement Optical Flow (LDOF) [46]) fortrajectory generation and sparse motion segmentation [5]. The results in Tab. 4show that the proposed approach outperforms the approach of Ochs et al. [2].Improved optical ﬂow leads to improved results overall. The diﬀerent CRF ver-sions do not provide signiﬁcantly diﬀerent results.

In this paper, we have addressed the segmentation of moving objects from singleframes. To that end, we proposed a GRU-based trajectory embedding to pro-duce high quality sparse segmentations automatically. Furthermore, we closedthe gap between sparse and dense results by providing a self-supervised U-Netmodel trained on sparse labels and relying only on edge maps and color in-formation. The trained model on sparse points provides single and multi-labeldense segmentations. The proposed approach generalizes to unseen sequencesfrom FBMS59 and DAVIS and provides competitive and appealing results. Fig. 5: Exemplary single- and multi-label motion segmentation results showingthe image and its sparse results as well as dense segmentation for ﬁve framesin three diﬀerent sequences, generated using the proposed U-Net model. Theimages are from the FBMS59 [2] dataset. Segmentations with ﬁne details areproduced even when training labels were scarce, notice how scarce the labels arefor “rabbit” images in the 8th row. White areas are parts without any label.

Acknowledgment

We acknowledge funding by the DFG project KE 2264/1-1. We also acknowledgethe NVIDIA Corporation for the donation of a Titan Xp GPU. elf-supervised Sparse to Dense Motion Segmentation 15

References

1. Koﬀka, K.: Principles of Gestalt Psychology. Hartcourt Brace Jovanovich,NewYork (1935)2. Ochs, P., Malik, J., Brox, T.: Segmentation of moving objects by long term videoanalysis. IEEE TPAMI (2014) 1187 – 12003. Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation withvisual memory. In: ICCV. (2017)4. Tokmakov, P., Alahari, K., Schmid, C.: Learning motion patterns in videos. In:CVPR. (2017)5. Keuper, M., Andres, B., Brox, T.: Motion trajectory segmentation via minimumcost multicuts. In: IEEE International Conference on Computer Vision (ICCV).(2015)6. Keuper, M.: Higher-order minimum cost lifted multicuts for motion segmentation.In: The IEEE International Conference on Computer Vision (ICCV). (2017)7. T.Brox, J.Malik: Object segmentation by long term analysis of point trajectories.In: European Conference on Computer Vision (ECCV). Lecture Notes in ComputerScience, Springer (2010)8. Fragkiadaki, K., Zhang, W., Zhang, G., Shi, J.: Two-granularity tracking: Medi-ating trajectory and detection graphs for tracking under occlusions. In: ECCV.(2012)9. Shi, F., Zhou, Z., Xiao, J., Wu, W.: Robust trajectory clustering for motion seg-mentation. In: ICCV. (2013)10. Rao, S.R., Tron, R., Vidal, R., Yi Ma: Motion segmentation via robust subspaceseparation in the presence of outlying, incomplete, or corrupted trajectories. In:2008 IEEE Conference on Computer Vision and Pattern Recognition. (2008) 1–811. Ronneberger, O., P.Fischer, Brox, T.: U-net: Convolutional networks for biomed-ical image segmentation. In: Medical Image Computing and Computer-AssistedIntervention (MICCAI). Volume 9351 of LNCS., Springer (2015) 234–241 (availableon arXiv:1505.04597 [cs.CV]).12. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09. (2009)13. Dosovitskiy, A., Fischer, P., Ilg, E., H¨ausser, P., Hazırba¸s, C., Golkov, V., v.d.Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical ﬂow with convolutionalnetworks. In: IEEE International Conference on Computer Vision (ICCV). (2015)14. Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video objectsegmentation. (2016) 724–73215. Maczyta, L., Bouthemy, P., Meur, O.: Cnn-based temporal detection of motionsaliency in videos. Pattern Recognition Letters (2019)16. Fragkiadaki, K., Arbelaez, P., Felsen, P., Malik, J.: Learning to segment movingobjects in videos. (2015) 4083–409017. Siam, M., Mahgoub, H., Zahran, M., Yogamani, S., Jagersand, M., El-Sallab, A.:Modnet: Motion and appearance based moving object detection network for au-tonomous driving. In: 2018 21st International Conference on Intelligent Trans-portation Systems (ITSC). (2018) 2859–286418. Bideau, P., Learned-Miller, E.: Its moving! a probabilistic model for causal motionsegmentation in moving camera videos. Volume 9912. (2016) 433–4496 A. Kardoost et al.19. Cremers, D.: A variational framework for image segmentation combining motionestimation and shape regularization. In: Proceedings of the 2003 IEEE Com-puter Society Conference on Computer Vision and Pattern Recognition. CVPR’03,Washington, DC, USA, IEEE Computer Society (2003) 53–5820. Hu, Y.T., Huang, J.B., Schwing, A. In: Unsupervised Video Object Segmenta-tion Using Motion Saliency-Guided Spatio-Temporal Propagation: 15th EuropeanConference, Munich, Germany, September 8-14, 2018, Proceedings, Part I. (2018)813–83021. Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object ﬂow. In: IEEEConf. on Computer Vision and Pattern Recognition (CVPR). (2016)22. Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. 2013IEEE International Conference on Computer Vision (2013) 1777–178423. Elhamifar, E., Vidal, R.: Sparse subspace clustering. In: ICCV. (2013)24. P.Ochs, T.Brox: Higher order motion models and spectral clustering. In: IEEEInternational Conference on Computer Vision and Pattern Recognition (CVPR).(2012)25. Jianbo Shi, Malik, J.: Motion segmentation and tracking using normalized cuts. In:Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).(1998) 1154–116026. Fragkiadaki, K., Shi, J.: Detection free tracking: Exploiting motion and topologyfor segmenting and tracking under entanglement. In: CVPR. (2011)27. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively,with application to face veriﬁcation. In: 2005 IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR’05). Volume 1. (2005) 539–546 vol. 128. Bhattacharyya, A., Fritz, M., Schiele, B.: Long-term on-board prediction of peoplein traﬃc scenes under uncertainty. (2018)29. M¨uller, S., Ochs, P., Weickert, J., Graf, N.: Robust interactive multi-label seg-mentation with an advanced edge detector. In: German Conference on PatternRecognition (GCPR). Volume 9796 of LNCS., Springer (2016) 117–12830. Shi, J.: Video segmentation by tracing discontinuities in a trajectory embedding.In: Proceedings of the 2012 IEEE Conference on Computer Vision and PatternRecognition (CVPR). CVPR ’12, Washington, DC, USA, IEEE Computer Society(2012) 1846–185331. Andres, B., Kr¨oger, T., Briggman, K.L., Denk, W., Korogod, N., Knott, G., K¨othe,U., Hamprecht, F.A.: Globally optimal closed-surface segmentation for connec-tomics. In: ECCV. (2012)32. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Machine Learning (2004) 89–11333. Kernighan, B.W., Lin, S.: An eﬃcient heuristic procedure for partitioning graphs.The Bell System Technical Journal (1970) 291–30734. Keuper, M., Levinkov, E., Bonneel, N., Lavoue, G., Brox, T., Andres, B.: Eﬃcientdecomposition of image and mesh graphs by lifted multicuts. In: IEEE Interna-tional Conference on Computer Vision (ICCV). (2015)35. Beier, T., Andres, B., K¨othe, U., Hamprecht, F.A.: An eﬃcient fusion move algo-rithm for the minimum cost lifted multicut problem. In: ECCV. (2016)36. Kardoost, A., Keuper, M.: Solving minimum cost lifted multicut problems bynode agglomeration. In: ACCV 2018, 14th Asian Conference on Computer Vision,Perth, Australia (2018)elf-supervised Sparse to Dense Motion Segmentation 1737. Bailoni, A., Pape, C., Wolf, S., Beier, T., Kreshuk, A., Hamprecht, F.: A gener-alized framework for agglomerative clustering of signed graphs applied to instancesegmentation. (2019)38. Keuper, M., Andres, B., Brox, T.: Motion trajectory segmentation via minimumcost multicuts. In: ICCV. (2015)39. Siam, M., Gamal, M., Abdel-Razek, M., Yogamani, S., Jagersand, M.: Rtseg: Real-time semantic segmentation comparative study. In: 2018 25th IEEE InternationalConference on Image Processing (ICIP). (2018) 1603–160740. Siam, M., Elkerdawy, S., Jagersand, M., Yogamani, S.: Deep semantic segmentationfor automated driving: Taxonomy, roadmap and challenges. In: 2017 IEEE 20thInternational Conference on Intelligent Transportation Systems (ITSC). (2017) 1–841. Fu, J., Liu, J., Wang, Y., Zhou, J., Wang, C., Lu, H.: Stacked deconvolutionalnetwork for semantic segmentation. IEEE Transactions on Image Processing (2019)1–142. Kanopoulos, N., Vasanthavada, N., Baker, R.L.: Design of an image edge detectionﬁlter using the sobel operator. IEEE Journal of solid-state circuits (1988) 358–36743. Kr¨ahenb¨uhl, P., Koltun, V.: Eﬃcient inference in fully connected crfs with gaussianedge potentials. CoRR abs/1210.5644abs/1210.5644