TTVOS: Lightweight Video Object Segmentation with Adaptive Template Attention Module and Temporal Consistency Loss
TTTVOS: Lightweight Video Object Segmentation with Adaptive TemplateAttention Module and Temporal Consistency Loss
Hyojin Park,
Ganesh Venkatesh, Nojun Kwak Seoul National University [email protected], [email protected], [email protected]
Abstract
Semi-supervised video object segmentation (semi-VOS) iswidely used in many applications. This task is tracking class-agnostic objects by a given segmentation mask. For doingthis, various approaches have been developed based on opti-cal flow, online-learning, and memory networks. These meth-ods show high accuracy but are hard to be utilized in real-world applications due to slow inference time and tremen-dous complexity. To resolve this problem, template matchingmethods are devised for fast processing speed, sacrificing lotsof performance. We introduce a novel semi-VOS model basedon a temple matching method and a novel temporal consis-tency loss to reduce the performance gap from heavy mod-els while expediting inference time a lot. Our temple match-ing method consists of short-term and long-term matching.The short-term matching enhances target object localization,while long-term matching improves fine details and handlesobject shape-changing through the newly proposed adaptivetemplate attention module. However, the long-term match-ing causes error-propagation due to the inflow of the pastestimated results when updating the template. To mitigatethis problem, we also propose a temporal consistency lossfor better temporal coherence between neighboring framesby adopting the concept of a transition matrix. Our modelobtains . J & F score at the speed of 73.8 FPS on theDAVIS16 benchmark. Video object segmentation (VOS) is essential in many ap-plications such as autonomous driving, video editing, andsurveillance system. In this paper, we focus on a semi-supervised video object segmentation (semi-VOS) task,which is to track a target in a pixel-wise resolution from agiven annotated mask for the first frame.For accurate tracking, many approaches have been ap-plied, such as optical flow, online-learning, memory net-work, and so on. Optical flow is one of the popular meth-ods in low-level vision which has been applied in diversevideo applications. In a video segmentation task, it propa-gates a given mask or features by computing pixel-wise tra-jectories or movements of objects (Lin, Chou, and Martinez2020; Wang et al. 2018a; Hu et al. 2018; Cheng et al. 2017). * Figure 1: The speed (FPS) vs accuracy ( J & F score) on theDAVIS2016 validation set. Our proposed TTVOS achieveshigh accuracy with small complexity. HR/RN respectivelydenotes HRNet/ResNet50 for the backbone network.However, it is too demanding to compute exact flow vec-tors which contain excessive information for the segmenta-tion task. For example, if we know the binary informationof whether a pixel is changed into the foreground or back-ground, we do not need an exact flow vector of each pixel.Another popular method is online-learning, which fine-tunesmodel parameters using the first frame image and the corre-sponding ground truth mask (Robinson et al. 2020; Maniniset al. 2018; Perazzi et al. 2017; Caelles et al. 2017). Thisstrategy makes the model more specialize in each video in-put. However, it requires additional time and memory forfine-tuning. Finally, the approach of memory network adoptsa concept of key, value, and query components from the QAtask in the NLP domain. They maintain a target memoryand match the current frame with the entries in the mem-ory. STM (Oh et al. 2019) stacked multiple memories forhandling shape-changing and occlusions However, the in-ference time and the required memories increase in pro-portion to the number of frames. To solve these problems,GC (Li, Shen, and Shan 2020) accumulated the memories ateach time frame using the global context module. However,it needs an additional feature extraction step for updatingthe memory from the current estimated mask and the image.Also, this module can be considered as a kind of channelattention method, and we believe that it is not enough to di- a r X i v : . [ c s . C V ] N ov ectly comprehend spatial information since the global con-text does not make a ( hw × hw ) memory like (Zhu et al.2019; Wang et al. 2018b) but a ( c key × c val ) memory * .The aforementioned methods have increased accuracy alot, but they are difficult to apply in a real environment dueto heavy inference time and memory. The template match-ing approach resolves this problem by designing a targettemplate from a given image and annotation. It calculatesa similarity between the template and the feature of the cur-rent frame for tracking (Voigtlaender et al. 2019; Johnanderet al. 2019; Wang et al. 2019c). This approach does not needextra computation for generating memories or fine-tuning.SiamMask (Wang et al. 2019b) crops the target object inan image by a bounding box from a given mask to createa template. The template induces a depth-wise convolutionoperation as weights of the kernel to compute similarity withthe current image feature. This work shows proper inferencetime suitable to a real environment. However, the accuracyis lower compared to other models because the matchingmethod is too simple, and the template is never updated.Thus, SiamMask is hard to handle object shape variation.In this paper, we propose an adaptive template match-ing method and a novel temporal consistency loss for semi-VOS. Our contributions can be summarized as follows: 1)We propose a new lightweight VOS model based on tem-plate matching method by combining short-term and long-term matching to achieve fast inference time and to reducethe accuracy gap from heavy and complex models. Morespecifically, in short-term matching, we compare the currentframe’s feature with the information in the previous framefor localization. In long-term matching, we devise an adap-tive template for generating an accurate mask. 2) We intro-duce a novel adaptive template motivated from GC for man-aging shape variation of target objects. Our adaptive tem-plate is updated from the current estimated mask withoutre-extracting features and occupying additional memory. 3)To train the model, we propose a new temporal consistencyloss for mitigating the error propagation problem, one of themain reasons for performance degradation, caused by adopt-ing the past estimated results. To the best of our knowledge,this work is the first to apply the concept of consistency lossfor the semi-VOS task. Our model generates a transition ma-trix to encourage the correction of the incorrectly estimatedpixels from the previous frame and preventing their prop-agation to future frames. Our model achieves . J & F score at the speed of 73.8 FPS on the DAVIS16 benchmark(See Fig. 1). We also verified the efficacy of the temporalconsistency loss by applying it to other models and showingincreased performance. Optical flow:
Optical flow which estimates flow vectors ofmoving objects is widely used in many video applications(Khoreva et al. 2017; Dutt Jain, Xiong, and Grauman 2017;Tsai, Yang, and Black 2016a; Sevilla-Lara et al. 2016). In the * h and w are the height and the width of an input feature mapfor constructing memory, and c key and c val are the number chan-nels for the key and value feature maps. semi-VOS task, it aligns the given mask or features with theestimated flow vector. Segflow (Cheng et al. 2017) designedtwo branches, each for image segmentation and optical flow.The outputs of both branches are combined together to es-timate the target masks. Similarly, FAVOS (Lin, Chou, andMartinez 2020) and CRN (Hu et al. 2018) refined a roughsegmentation mask by optical flow. Online-learning:
The online-learning method is trainingthe model with new data in each inference iteration (Sahooet al. 2018; Zhou, Sohn, and Lee 2012; Kivinen, Smola, andWilliamson 2004). In the semi-VOS task, model parame-ters are fine-tuned in the inference stage with a given in-put image and a corresponding mask. Therefore, the modelis specialized for the given condition of the clip (Maniniset al. 2018; Perazzi et al. 2017; Caelles et al. 2017). How-ever, fine-tuning causes additional latency in inference time.(Robinson et al. 2020) resolved this issue by dividing themodel into two sub-networks. One is a lightweight networkthat is fine-tuned in the inference stage for making a coarsescore map. The other is a heavy segmentation network with-out the need for fine-tuning. This network enables fast opti-mization and relieves the burden of online-learning.
Memory network:
The memory network constructs ex-ternal memory representing various properties of the tar-get. It was devised for handling long-term sequential tasksin the NLP domain, such as the QA task (Kim et al. 2019;Sukhbaatar et al. 2015; Weston, Chopra, and Bordes 2014).STM (Oh et al. 2019) adopted this idea for the semi-VOStask by a new definition of key and value. The key encodesvisual semantic clue for matching and the value stores de-tailed information for making the mask. However, it requireslots of resources because the amount of memory is increasedover time. Furthermore, the size of memory is the square ofthe resolution of an input feature map. To lower this hugecomplexity, GC (Li, Shen, and Shan 2020) does not stackmemory at each time frame, but accumulate them into one,which is also of a smaller size than a unit memory of STM.
Template matching:
Template matching is one of the tra-ditional method in the tracking task. It generates a templateand calculates similarity with input as a matching operation.Most works match a feature map from a given image anda template following the siamese network (Bertinetto et al.2016), but A-GAME (Johnander et al. 2019) designed a tar-get distribution by a mixture of Gaussian in an embeddingspace. It predicted posterior class probabilities for match-ing. RANet (Wang et al. 2019c) applied a racking system tothe matching process between multiple templates and inputfor extracting reliable results. FEELVOS (Voigtlaender et al.2019) calculated distance map by local and global matchingfor better robustness. SiamMask (Wang et al. 2019b) used adepth-wise operation for fast matching and makes a templatefrom a bounding box initialization.
Consistency Loss:
Consistency loss is widely used for im-proving performance in semi-supervised learning, enhancerobustness from perturbation to input, enable stable trainingunder specific constraints, and so on (Jeong et al. 2019; Miy-ato et al. 2018; Zhu et al. 2017). In VOS, consistency usu-ally means temporal coherence between neighboring framesby additional clue from optical flow. (Tsai, Yang, and Blackigure 2: The overall architecture of TTVOS. A backbone feature is shared in all the processes of TTVOS for efficiency.There are two types of template matching (long-term and short-term), decoding and template update stages in our model. Thetransition matrix ˆ π t is computed only in the training phase for enhancing temporal coherence.2016b; Volz et al. 2011; Weickert and Schn¨orr 2001). In this section, we present our semi-VOS model. Section 3.1introduces the whole model architecture and how to man-age multi-object VOS. Section 3.2 explains the details oftemplate attention module for long-term matching. We alsodescribe how to update the template and how to produce asimilarity map. Finally, Section 3.3 demonstrates our tempo-ral consistency loss and how to define new ground truth formitigating error propagation between neighboring frames.
We propose a new architecture for VOS as shown in Fig. 2.Our TTVOS consists of feature extraction, template match-ing, decoding, and template update stages. The templatematching is composed of a short-term matching and a long-term matching. The short-term matching enhances localiza-tion property by using previous information. This uses asmall feature map for producing a coarse segmentation map.However, this incurs two problems: 1) Utilizing only the in-formation of the previous frame causes the output masksoverly dependent on previous results. 2) This can not han-dle shape-changing nor manifest detailed target shape dueto a small feature map. To resolve these problems, we pro-pose long-term matching as an adaptive template match-ing method. This template is initialized from the given firstframe condition and updated at each frame. Therefore, it canconsider the whole frames and track gradually changing ob-jects. This module uses a larger feature map for getting moredetailed information for generating accurate masks. Afterthen, our model executes decoding and updates each tem-plates step by step.A backbone extracts feature maps f N t from the currentframe, where f N t denotes a feature map at frame t with an /N -sized width and height compared to the input. Short-term matching uses a small feature map f t and the pre-vious frame information for target localization: f t − isconcatenated with a previous mask heatmap ˆ H t − , which consists of two channels containing the probability of back-ground and foreground respectively. After then, this con-catenated feature map is forwarded by several convolutionlayers for embedding localization information from the pre-vious frame. This information is blended with f t to getan enhanced localization property. In the long-term templatematching stage, f t is concatenated with the previous maskheatmap, which is compared with the adaptive template toproduce a similarity map in the template attention module.The details are in Section 3.2. At only training time, a sim-ilarity map estimates a transition matrix to encourage tem-poral consistency between neighboring frames as detailedin Section 3.3. The resultant similarity map is concatenatedwith the short-term matching result.Finally, f t is added for a more accurate mask. We useConvTranspose for upsampling and use PixelShuffle (Shiet al. 2016) in the final upsampling stage to prevent the grid-effect. After target mask estimation, f t and ˆ H t are usedfor updating next short-term template matching, and f t and ˆ H t are utilized for next long-term template matching. All thebackbone features are also shared in the multi-object case,but the stages of two template matching and decoding areconducted separately for each object. Therefore, each ob-ject’s heatpmap always has two channels for the probabil-ity of background and foreground. At inference time, all theheatmaps are combined by the soft aggregation method (Choet al. 2020; Johnander et al. 2019). We conjecture that pixels inside a target object have a dis-tinct embedding vector distinguished from non-target ob-ject pixels. Our model is designed to find this vector byself-attention while suppressing the irrelevant informationof the target object. Each current embedding vector updatesa previous long-term template by weighted-average at eachframe. After then, the proposed module generates a similar-ity map by template matching to enhance the detailed regionas shown in Fig. 3.For constructing the current embedding vector, the back-a) (b)Figure 3: (a) Process in a template attention module. Here, a red (blue) color means a high (low) similarity between twoinformation. The size of f ( X (cid:48) t − ) and g ( X t − ) is c tp × HW , but we draw feature maps as c tp × H × W for the sake ofconvenient understanding. (b) The detailed structure of a template attention module and a template update. An operation (a,b,c)denotes the input channel, output channel, and kernel size of convolution operation, respectively.bone feature f t − and the previous estimated maskheatmap ˆ H t − are concatenated to suppress information farfrom the target object. In Fig. 3, the concatenated featuremap is denoted as X (cid:48) t − . X (cid:48) t − is forwarded to two sep-arate branches f ( · ) and g ( · ) , making f ( X (cid:48) t − ) , g ( X (cid:48) t − ) ∈ R c tp × H × W . After then, the feature maps are reshaped to c tp × HW and producted to generate an embedding matrix I as follows: I = σ ( f ( X (cid:48) t − ) × g ( X (cid:48) t − ) T ) ∈ R c tp × c tp . (1)Here, σ is a softmax function applied row-wise. I i,j is the ( i, j ) element of I , corresponds to an i th channel’s viewabout j th channel information by dot-producting along HW direction. X (cid:48) t − hampers the inflow of information far fromthe target object by ˆ H t − . Thus I i,j considers only pixelsinside or near the target object, and this operation is sim-ilar to global pooling and region-based operation (Caesar,Uijlings, and Ferrari 2016) in terms of making one repre-sentative value from the whole HW -sized channel and con-centrating on a certain region. For example, if the hexagon inFig. 3(a) indicates the estimated location of the target fromthe previous mask, the information outside of the hexagon issuppressed. Then f ( X (cid:48) t − ) and g ( X (cid:48) t − ) are compared witheach other along the whole HW plane. If the two channelsare similar, the resultant value of I will be high (red pixel inFig. 3(a)); otherwise, it will be low (blue pixel). Finally, wehave c tp embedding vectors of size × c tp containing infor-mation about the target object. The final long-term template T P t is updated by weighted-average of the embedding ma-trix I and the previous template T P t − as below: T P t = t − t T P t − + 1 t I. (2)The template attention module generates a similarity map S t ∈ R c tp × H × W by attending on each channel of the queryfeature map q ( X t ) ∈ R c tp × H × W through the template T P t as follows: S t = T P t × q ( X t ) . (3)In doing so, the previous estimated mask heatmap ˆ H t − en-hances the backbone feature map f t around the previous Read Seg Update J & F GC 1.05 G 36.8 G 37.1 G 38 M 86.6Ours 0.08 G 5.29 G 0.06 G 1.6 M 79.5Table 1: The complexity and accuracy comparison betweenGC and ours when the input image size is × . Read,Seg, and updates mean the requirement of FLOPS for read-ing a memory or a template, making a segmentation maskwithout a decoding stage, and updating a memory or a tem-plate. Our method reduces lots of computations for updatingthe template.target object location by forwarding the concatenated featureto a convolution layer resulting in a feature map X t . Then, X t is forwarded to several convolution layers to generate aquery feature map q ( X t ) as shown in Fig. 3. In Eq. (3), thesimilarity is measured between each row of T P t (templatevector) and each spatial feature from q ( X t ) , both of whichare of a length c tp . When the template vector is similar to thespatial feature, the resultant S t value will be high (red pixelin Fig. 3(a)). Otherwise, it will be low (blue in Fig. 3(a)).After then, the global similarity feature S t and modified fea-ture map f (cid:48) t are concatenated to make the final feature mapby blending both results as shown in the bottom of Fig. 3(b).To reduce computational cost while retaining a large re-ceptive field, we use group convolution (group size of 4)with a large kernel size of × for generating f ( · ) , g ( · ) and q ( · ) . While, depth-wise convolutions cost less than the groupconvolution, we do not use them because their larger groupcount adversely impacts the model execution time (Ma et al.2018). We select LeakyReLU as the non-linearity to avoidthe dying ReLU problem. We empirically determine that us-ing a point-wise convolution first then applying the groupconvolution achieves better accuracy (shown in Fig. 3(b)).Our template attention module has some similarity to GCbut is conceptually very different and computationally muchcheaper, as shown in Table 1. Unlike GC, which is a memorynetwork approach, our method is a kind of template match-ing approach. Specifically, GC extracts backbone featuresigure 4: ((a)-(d)) frame t − and t from top to bottom. (a) Input image. (b) Ground truth. (c) Our result. (d) Estimated maskwith color marking. Blue color means wrong segmentation result, and the blue region in frame t is corrected from frame t − .(e) Visualizing π t, . Top: H t − H t − , Bottom: H t − ˆ H t − . H t − H t − can not remove false positive region in the top of (c).again from the new input combining image and mask forgenerating new memory. Then, it produces a global contextmatrix by different-sized key and value. However, our tem-plate method just combines the current estimated mask andthe already calculated backbone feature. Then, we use thesame-sized feature maps for self-attention to construct mul-tiple embedding vectors representing various characteristicsof the target. Our adaptive template deals with the target shape-changingproblem by analyzing a backbone feature and an estimatedmask along the whole executed frames. However, using pre-vious estimation incurs the innate error propagation issue.For example, when the template is updated with a wrong re-sult, this template will gradually lead to incorrect tracking. Ifthe model gets right transition information about how to cor-rect the wrong estimation in the previous frame, the modelcan mitigate this error propagation problem. For this reason,we calculate a transition matrix ˆ π t from the output featuremap of the template attention module as shown in Fig. 2.We design a novel template consistency loss L tc by ˆ π t , andthis loss encourages the model to get correction power andto attain consistency between neighboring frames: π t = H t − ˆ H t − , L tc = || ˆ π t − π t || . (4)As a new learning target, we make a target transition ma-trix from ground truth heatmap H t and previous estimatedmask heatmap ˆ H t − as in Eq. (4). Note that the first andthe second channel of H t are the probability of backgroundand foreground from a ground truth mask of frame t , re-spectively. By Eq. (4), the range of π t becomes ( − , and π t consists of two channel feature map indicating transitiontendency from t − to t . In detail, the first channel con-tains transition tendency of the background while the sec-ond is for the foreground. For example, if the value of π i,jt, ,the ( i, j ) element of π t in the second channel, is closer to ,it helps the estimated class at position ( i, j ) to change intoforeground from frame t − to t . On the other hand, if it isclose to − , it prevents the estimated class from turning tothe foreground. Finally, when the value is close to , it keepsthe estimated class of frame t − for a frame t result. The reason why we use ˆ H t − instead of H t − is illus-trated in Fig. 4. Fig. 4(b) shows ground truth masks, and (c)is the estimated masks at frame t − (top) and t (bottom).First row of Fig. 4(e) is a visualization of ( H t − H t − ) thatguides the estimation to maintain the false positive regionfrom the frame t − to t . Second row of Fig. 4(e) is a vi-sualization of ( H t − ˆ H t − ) that guides the estimation to re-move false positive region of the frame t − . Fig. 4(d) ismarked by blue color for denoting false estimation resultscomparing between (b) and (c). As shown in Fig. 4(d), thetransition matrix π t helps reducing the false positive regionfrom frame t − to t . With L tc , the overall loss becomes: Loss = CE (ˆ y t , y t ) + λL tc , (5)where λ is a hyper-parameter that controls the balance be-tween the loss terms, and we set λ = 5 . CE denotes thecross entropy between the pixel-wise ground truth y t atframe t and its predicted value ˆ y . Here, we show various evaluations by using DAVIS bench-marks (Pont-Tuset et al. 2017; Perazzi et al. 2016). DAVIS16is a single object task consisting of training videos and validation videos, while DAVIS17 is a multiple object taskwith training videos and validation videos. We eval-uated our model by using official benchmark code † . TheDAVIS benchmark reports model accuracy by average ofmean Jaccard index J and mean boundary score F . J indexmeasures overall accuracy by comparing estimated maskand ground truth mask. F score focuses more contour ac-curacy by delimiting the spatial extent of the mask. Furtherexperimental results using Youtube-VOS dataset (Xu et al.2018) are reported in supplementary material.
Implementation Detail:
We used HRNetV2-W18-Small-v1 (Wang et al. 2019a) for a lightweight backbone networkand initialized it from the pre-trained parameters from theofficial code ‡ . We froze every backbone layer except the lastblock. The size of the smallest feature map is / of the in-put image. We upsampled the feature map and concatenated † https://github.com/davisvideochallenge/davis2017-evaluation ‡ https://github.com/HRNet/HRNet-Semantic-Segmentation odel Method Train DatasetMethod Backbone OL Memory YTB Seg Synth DV17 DV16 FPSOnAVOS (Voigtlaender and Leibe 2017) VGG16 o - - o - 67.9 85.5 0.08OSVOS-S (Maninis et al. 2018) VGG16 o - - o - 68.0 86.5 0.22FRTM-VOS (Robinson et al. 2020) ResNet101 o - o - - 76.7 83.5 21.9STM (Oh et al. 2019) ResNet50 - o o - o 81.8 89.3 6.25GC (Li, Shen, and Shan 2020) ResNet50 - o o - o 71.4 86.6 25.0OSMN (Yang et al. 2018) VGG16 - - - o - 54.8 73.5 7.69RANet (Wang et al. 2019c) ResNet101 - - - - o 65.7 85.5 30.3A-GAME (Johnander et al. 2019) ResNet101 - - o - o 70 82.1 14.3FEELVOS (Voigtlaender et al. 2019) Xception 65 - - o o - 71.5 81.7 2.22SiamMask (Wang et al. 2019b) ResNet50 - - o o - 56.4 69.8 55.0 TTVOS (Ours) HRNet - - o - o 58.7 79.5 73.8TTVOS-RN (Ours) ResNet50 - - o - o 67.8 83.8 39.6
Table 2: Quantitative comparison on DAVIS benchmark validation set. OL and Memory denotes online-learning approach andmemory network approach. YTB is using Youtube-VOS for training. Seg is segmentation dataset for pre-training by Pascal(Everingham et al. 2015) or COCO (Lin et al. 2014). Synth is using saliency dataset for making synthetic video clip by affinetransformation. We report the detailed number of parameters and FLOPs comparison in supplementary material.Figure 5: Example of parkour for frame 1, 34 and 84 from top to Bottom. Column (a) shows input images overlapped with theground truth masks. RM-LongM denotes estimated results removing long-term matching information by replacing to zeros.it with the second smallest feature map whose size is / ofthe input image. We used ADAM optimizer for training ourmodel. First, we pre-trained with synthetic video clip fromimage dataset, after then we trained with video dataset withsingle GPU following (Oh et al. 2019; Voigtlaender et al.2019; Wang et al. 2019b; Johnander et al. 2019). Pre-train with images:
We followed (Li, Shen, and Shan2020; Oh et al. 2019; Wang et al. 2019c) pre-trainingmethod, which applies random affine transformation (rota-tion [ − ◦ , ◦ ] , scaling[-0.75,1.25] and thin plate warp-ing (Perazzi et al. 2017)) to a static image for generatingsynthetic video clip. We used the saliency detection datasetMSRA10K (Cheng et al. 2014), ECSSD (Yan et al. 2013),and HKU-IS (Li and Yu 2015) for various static images.Synthetic video clips consisting of three frames with a sizeof × were generated. We trained 100 epochs withan initial learning rate to e − and a batch size to . Main-train with videos:
We initialized the whole networkwith the best parameters from the previous step and trained the model to video dataset. We used a two-stage trainingmethod; for the first epochs, we only used Youtube-VOSwith × image. We then trained on the DAVIS16dataset with × image for an additional epochs.Both training, we used consecutive frames with a batchsize to and set an initial learning rate to e − . Comparison to state-of-the-art :
We compared ourmethod with other recent models as shown in Table 2. We re-port backbone models and training datasets for clarificationbecause each model has a different setting. Furthermore, wealso show additional results with ResNet50 because somerecent models utilized ResNet50 for extracting features.Our result shows the best accuracy among models withsimilar speed. Specifically, SiamMask is one of the popu-lar fast template matching methods, and our model has bet-ter accuracy and speed than SiamMask on both DAVIS16and DAVIS17 benchmark. When we used ResNet50, ourxp SM LM Lup TC M DV17 DV161 o - - - o 57.0 75.92 - o o - o 54.5 78.83 o o o - o 57.5 77.14 o o o o - 58.6 77.65 o o - o o 57.2 77.4
Table 3: Ablation study on DAVIS16 and DAVIS17. SM,LM, TC means short-term matching, long-term matchingand temporal consistency loss. Lup represents updatinglong-term template at every frame, and M is using originalground truth mask for initial condition.Figure 6:
Horsejump-high example of ablation study forframe and from top to bottom. (a) Ground truth. (b)Using only short-term matching. (c) Using only long-termmatching. (d) Our proposed method (Exp6).model has better or competitive results with FRTM-VOS, A-GAME, RANet, and FEELVOS. Also, this ResNet50 basedmodel decreases DAVIS16 accuracy by . but the speedbecomes 1.6 times faster than GC. Therefore, our methodachieves favorable performance among fast VOS modelsand reduces the performance gap from the online-learningand memory network based models. Ablation Study :
For proving our proposed methods, weperformed an ablative analysis on DAVIS16 and DAVIS17benchmark as shown in Table 3. SM and LM mean short-term matching and long-term matching, respectively. Whenwe do not use short-term matching or long-term matching,we replaced the original matching method into concatenat-ing the previous mask heatmap and the current feature map.After then the concatenated feature map is forwarded by sev-eral convolution layers. Lup represents updating the long-term template at every frame. If not used, the model neverupdates the template. TC denotes using temporal consis-tency loss. Without this, the model only uses a cross entropyloss. M denotes using the original ground truth mask for theinitial condition; if M is not checked, a box-shaped maskis used for the initial condition like SiamMask. Exp1 is us-ing only short-term matching, and Exp2 is using only long-term matching. Exp3-6 uses both matching methods. Table 3is the corresponding accuracy for each ablation experiment,and Fig. 6 visualizes efficacy of each template matching.We found that short-term matching helps maintain ob-jects ID from localization clue, and long-term matching im-proves mask quality by enhancing the detailed regions. Forexample, Exp1 keeps object ID but fails to make an ac-curate mask for horse legs, as shown in Fig. 6(b). On the Backbone DV17 DV16FRTM-VOS ResNet101 76.7 83.5(Robinson et al. 2020) ResNet18 70.2 78.5with TC Loss ResNet101 76.6 85.2ResNet18 71.8 82.0Table 4: DAVIS17 and DAVIS16 results when additional ap-plying temporal consistency loss (TC Loss).contrary, Exp2 makes accurate shape but loses green-object(rider) ID as shown in Fig. 6(c). Exp2 shows performancedegradation on multi-object tracking task (DAVIS 17) dueto failure in maintaining object ID, even it generates moreaccurate masks than Exp1. Therefore, Exp1 achieves betterperformance in DAVIS17, and Exp2 shows high accuracyin DAVIS16. Exp3 gets every advantage from both templatematching methods, and Fig. 6(d) is our proposed method re-sults (Exp6), which do not lose object ID and generate deli-cate masks with high performance on both benchmarks.Exp4-6 explain why our model shows better performancethan SiamMask, even using a more lightweight backbone.The initial condition of the box shape mask does not degradeperformance a lot comparing with Exp6. However, when themodel does not update the long-term template, the accuracydegrades a lot from our proposed method.
Temporal Consistency Loss :
We conducted further ex-periments for proving the efficacy of our temporal consis-tency loss with FRTM-VOS, which is one of the fast online-learning methods, using ResNet101 and ResNet18 for thebackbone network. We implemented our proposed loss func-tion based on FRTM-VOS official code § , and followed theirtraining strategy. Our proposed loss is more useful in thelightweight backbone network (ResNet18) as shown in Ta-ble 4. When we applied our loss to the ResNet101 model, theaccuracy on DAVIS17 decreased slightly by 0.1%, but it in-creased 1.7% on DAVIS16. In the ResNet18 model, we im-proved the accuracy a lot on both DAVIS17 and DAVIS16.We conjecture that using our loss not only improves maskquality but also resolves a problem of overfeating due tofine-tuning by a given condition. Many semi-VOS methods have improved accuracy, but theyare hard to utilize in real-world applications due to tremen-dous complexity. To resolve this problem, we proposed anovel lightweight semi-VOS model consisting of short-termand long-term matching modules. The short-term matchingenhances localization, while long-term matching improvesmask quality by an adaptive template. However, using pastestimated results incurs an error-propagation problem. Tomitigate this problem, we also devised a new temporal con-sistency loss to correct false estimated regions by the con-cept of the transition matrix. Our model achieves fast infer-ence time while reducing the performance gap from heavymodels. We also showed that the proposed temporal consis-tency loss can improves accuracy of other models. § https://github.com/andr345/frtm-vos eferences Bertinetto, L.; Valmadre, J.; Henriques, J. F.; Vedaldi, A.;and Torr, P. H. S. 2016. Fully-Convolutional Siamese Net-works for Object Tracking. In
ECCV 2016 Workshops , 850–865.Caelles, S.; Maninis, K.; Pont-Tuset, J.; Leal-Taix´e, L.; Cre-mers, D.; and Van Gool, L. 2017. One-Shot Video ObjectSegmentation. In
Computer Vision and Pattern Recognition(CVPR) .Caesar, H.; Uijlings, J.; and Ferrari, V. 2016. Region-basedsemantic segmentation with end-to-end training. In
Euro-pean Conference on Computer Vision , 381–397. Springer.Cheng, J.; Tsai, Y.-H.; Wang, S.; and Yang, M.-H. 2017.Segflow: Joint learning for video object segmentation andoptical flow. In
Proceedings of the IEEE international con-ference on computer vision , 686–695.Cheng, M.-M.; Mitra, N. J.; Huang, X.; Torr, P. H.; and Hu,S.-M. 2014. Global contrast based salient region detection.
IEEE transactions on pattern analysis and machine intelli-gence arXiv preprint arXiv:2002.03651 .Dutt Jain, S.; Xiong, B.; and Grauman, K. 2017. FusionSeg:Learning to combine motion and appearance for fully auto-matic segmentation of generic objects in videos. In
Proceed-ings of the IEEE conference on computer vision and patternrecognition , 3664–3673.Everingham, M.; Eslami, S. A.; Van Gool, L.; Williams,C. K.; Winn, J.; and Zisserman, A. 2015. The pascal vi-sual object classes challenge: A retrospective.
Internationaljournal of computer vision
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 1400–1409.Jeong, J.; Lee, S.; Kim, J.; and Kwak, N. 2019. Consistency-based semi-supervised learning for object detection. In
Ad-vances in neural information processing systems , 10759–10768.Johnander, J.; Danelljan, M.; Brissman, E.; Khan, F. S.; andFelsberg, M. 2019. A generative appearance model for end-to-end video object segmentation. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition , 8953–8962.Khoreva, A.; Benenson, R.; Ilg, E.; Brox, T.; and Schiele,B. 2017. Lucid data dreaming for object tracking. In
TheDAVIS Challenge on Video Object Segmentation .Kim, J.; Ma, M.; Kim, K.; Kim, S.; and Yoo, C. D. 2019.Progressive attention memory network for movie story ques-tion answering. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 8337–8346.Kivinen, J.; Smola, A. J.; and Williamson, R. C. 2004. On-line learning with kernels.
IEEE transactions on signal pro-cessing
Proceedings of the IEEE conference oncomputer vision and pattern recognition , 5455–5463.Li, Y.; Shen, Z.; and Shan, Y. 2020. Fast Video Object Seg-mentation using the Global Context Module. In
The Euro-pean Conference on Computer Vision (ECCV) .Lin, F.; Chou, Y.; and Martinez, T. 2020. Flow AdaptiveVideo Object Segmentation.
Image and Vision Computing
94: 103864.Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ra-manan, D.; Doll´ar, P.; and Zitnick, C. L. 2014. Microsoftcoco: Common objects in context. In
European conferenceon computer vision , 740–755. Springer.Ma, N.; Zhang, X.; Zheng, H.-T.; and Sun, J. 2018. Shuf-flenet v2: Practical guidelines for efficient cnn architecturedesign. In
Proceedings of the European conference on com-puter vision (ECCV) , 116–131.Maninis, K.-K.; Caelles, S.; Chen, Y.; Pont-Tuset, J.; Leal-Taix´e, L.; Cremers, D.; and Van Gool, L. 2018. Video objectsegmentation without temporal information.
IEEE transac-tions on pattern analysis and machine intelligence
IEEE transactionson pattern analysis and machine intelligence
Proceedings of the IEEE International Conference on Com-puter Vision , 9226–9235.Perazzi, F.; Khoreva, A.; Benenson, R.; Schiele, B.; andSorkine-Hornung, A. 2017. Learning video object segmen-tation from static images. In
Proceedings of the IEEE con-ference on computer vision and pattern recognition , 2663–2672.Perazzi, F.; Pont-Tuset, J.; McWilliams, B.; Van Gool, L.;Gross, M.; and Sorkine-Hornung, A. 2016. A benchmarkdataset and evaluation methodology for video object seg-mentation. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , 724–732.Pont-Tuset, J.; Perazzi, F.; Caelles, S.; Arbel´aez, P.;Sorkine-Hornung, A.; and Van Gool, L. 2017. The2017 DAVIS Challenge on Video Object Segmentation. arXiv:1704.00675 .Robinson, A.; Lawin, F. J.; Danelljan, M.; Khan, F. S.;and Felsberg, M. 2020. Learning Fast and Robust TargetModels for Video Object Segmentation. In
Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , 7406–7415.Sahoo, D.; Pham, Q.; Lu, J.; and Hoi, S. C. H. 2018. OnlineDeep Learning: Learning Deep Neural Networks on the Fly.In
Proceedings of the Twenty-Seventh International JointConference on Artificial Intelligence, IJCAI-18 , 2660–2666.nternational Joint Conferences on Artificial Intelligence Or-ganization. doi:10.24963/ijcai.2018/369.Sevilla-Lara, L.; Sun, D.; Jampani, V.; and Black, M. J.2016. Optical flow with semantic segmentation and local-ized layers. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , 3889–3898.Shi, W.; Caballero, J.; Husz´ar, F.; Totz, J.; Aitken, A. P.;Bishop, R.; Rueckert, D.; and Wang, Z. 2016. Real-timesingle image and video super-resolution using an efficientsub-pixel convolutional neural network. In
Proceedings ofthe IEEE conference on computer vision and pattern recog-nition , 1874–1883.Sukhbaatar, S.; Weston, J.; Fergus, R.; et al. 2015. End-to-end memory networks. In
Advances in neural informationprocessing systems , 2440–2448.Tsai, Y.-H.; Yang, M.-H.; and Black, M. J. 2016a. Video seg-mentation via object flow. In
Proceedings of the IEEE con-ference on computer vision and pattern recognition , 3899–3908.Tsai, Y.-H.; Yang, M.-H.; and Black, M. J. 2016b. VideoSegmentation via Object Flow. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR) .Voigtlaender, P.; Chai, Y.; Schroff, F.; Adam, H.; Leibe, B.;and Chen, L.-C. 2019. Feelvos: Fast end-to-end embed-ding learning for video object segmentation. In
Proceed-ings of the IEEE Conference on Computer Vision and Pat-tern Recognition , 9481–9490.Voigtlaender, P.; and Leibe, B. 2017. Online Adaptation ofConvolutional Neural Networks for Video Object Segmen-tation. In
British Machine Vision Conference 2017, BMVC2017, London, UK, September 4-7, 2017 . BMVA Press.Volz, S.; Bruhn, A.; Valgaerts, L.; and Zimmer, H. 2011.Modeling temporal coherence for optical flow. In , 1116–1123.IEEE.Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.;Liu, D.; Mu, Y.; Tan, M.; Wang, X.; Liu, W.; and Xiao, B.2019a. Deep High-Resolution Representation Learning forVisual Recognition.
TPAMI .Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; and Torr, P. H.2019b. Fast online object tracking and segmentation: A uni-fying approach. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , 1328–1338.Wang, W.; Shen, J.; Porikli, F.; and Yang, R. 2018a.Semi-supervised video object segmentation with super-trajectories.
IEEE transactions on pattern analysis and ma-chine intelligence
Proceedings of the IEEE con-ference on computer vision and pattern recognition , 7794–7803.Wang, Z.; Xu, J.; Liu, L.; Zhu, F.; and Shao, L. 2019c. Ranet:Ranking attention network for fast video object segmenta- tion. In
Proceedings of the IEEE international conferenceon computer vision , 3978–3987.Weickert, J.; and Schn¨orr, C. 2001. Variational optic flowcomputation with a spatio-temporal smoothness constraint.
Journal of mathematical imaging and vision arXiv preprint arXiv:1410.3916 .Xu, N.; Yang, L.; Fan, Y.; Yue, D.; Liang, Y.; Yang, J.; andHuang, T. 2018. Youtube-vos: A large-scale video objectsegmentation benchmark. arXiv preprint arXiv:1809.03327 .Yan, Q.; Xu, L.; Shi, J.; and Jia, J. 2013. Hierarchicalsaliency detection. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , 1155–1162.Yang, L.; Wang, Y.; Xiong, X.; Yang, J.; and Katsaggelos,A. K. 2018. Efficient video object segmentation via net-work modulation. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 6499–6507.Zhou, G.; Sohn, K.; and Lee, H. 2012. Online incrementalfeature learning with denoising autoencoders. In
Artificialintelligence and statistics , 1453–1461.Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Un-paired image-to-image translation using cycle-consistent ad-versarial networks. In
Proceedings of the IEEE internationalconference on computer vision , 2223–2232.Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; and Bai, X. 2019.Asymmetric non-local neural networks for semantic seg-mentation. In