[PDF] A Coarse-To-Fine Framework For Video Object Segmentation

Abstract

In this study, we develop an unsupervised coarse-to-fine video analysis framework and prototype system to extract a salient object in a video sequence. This framework starts from tracking grid-sampled points along temporal frames, typically using KLT tracking method. The tracking points could be divided into several groups due to their inconsistent movements. At the same time, the SLIC algorithm is extended into 3D space to generate supervoxels. Coarse segmentation is achieved by combining the categorized tracking points and supervoxels of the corresponding frame in the video sequence. Finally, a graph-based fine segmentation algorithm is used to extract the moving object in the scene. Experimental results reveal that this method outperforms the previous approaches in terms of accuracy and robustness.

Full PDF

AA Coarse-To-Fine Framework For Video Object Segmentation

Chi Zhang , Rochester Institute of Technology, Rochester, NY 14623, USAAlexander Loui, Kodak Alaris Imaging R&D, Rochester, NY 14615, USA Abstract

In this study, we develop an unsupervised coarse-to-ﬁnevideo analysis framework and prototype system to extract asalient object in a video sequence. This framework starts fromtracking grid-sampled points along temporal frames, typically us-ing KLT tracking method. The tracking points could be dividedinto several groups due to their inconsistent movements. At thesame time, the SLIC algorithm is extended into 3D space to gen-erate supervoxels. Coarse segmentation is achieved by combin-ing the categorized tracking points and supervoxels of the corre-sponding frame in the video sequence. Finally, a graph-based ﬁnesegmentation algorithm is used to extract the moving object in thescene. Experimental results reveal that this method outperformsthe previous approaches in terms of accuracy and robustness.

Introduction

Object level video segments are semantically meaningfulspatiotemporal units such as moving persons, moving vehicles,ﬂowing river, etc. Segmentation of video sequence into a num-ber of component regions would beneﬁt many higher level visionbased applications such as scene analysis, object localization andcontent understanding. However, single target object extractionwould be a more demanding task considering consumers needs.In many cases, a consumer video sequence simply targets at cap-turing a single object’s movement in a speciﬁc environment suchas dancing, skiing, running, etc. In general, motion object detec-tion and extraction for a static video camera is relatively straight-forward since the background barely changes and a simple framedifferencing would be able to extract a moving foreground ob-ject. However, it is still challenging for the object moving on acluttered and/or dynamic background.The goal of background modeling and foreground object ex-traction is to build a model of the background/foreground in anofﬂine manner and extract the object of interest by comparingthe estimated model with the frames. The model must be ro-bust enough to cope with background changes in different ways.In recent years, a trend towards modeling spatio-temporal uni-form (in terms of either appearance or motion) regions instead ofsingle pixels has been observed [1]. These works rely on super-pixels/supervoxels for object segmentation in videos. However,these methods is computationally expensive and group superpix-els together according to pure spatio-temporal similarity withoutexploiting real-world object features. As an improvement, Gior-dano et al . [2] proposed an approach without making any speciﬁcassumptions about the videos and it relies on how objects are per-ceived by humans according to Gestalt laws. Khoreva et al . [3]proposed an empirical approach to learn both the edge topology Work performed at Kodak Alaris during an internship from RochesterInstitute of Technology. and weights of the graph. The most conﬁdent edges are selectedby the graph structure while the classiﬁers are learned to combinefeatures and seamlessly integrated by its accuracy. In [4] and [5],FPFH and HoG have been used as features to represent superpix-els. The high dimension feature space slows down the compu-tation, although some improvements (e.g., [6]) were proposed toprovide a better balance of trade off between segmentation qualityand runtime.Moreover, much research has been devoted to graph mod-els for segmentation, such as [3] and [7]. Fan and Loui [8] pro-posed a graph-based approach that models the data in a featurespace, which emphasizes the correlation between similar pixelswhile reducing the inter-class connectivity between different ob-jects. In [9], a reduced superpixel graph was reweighted such thatthe resulting segmentation was equivalent to the full graph undercertain assumptions.In this work, we develop a novel coarse-to-ﬁne frameworkand prototype system for automatically segmenting a video se-quence and extracting a salient moving object from it. The pro-posed framework comprises of point tracking and motion cluster-ing of pixels into groups. In parallel, a pixel grouping methodis used to generate supervoxels for the corresponding frames ofthe video sequence. Coarse segmentation is achieved by com-bining the results of previous steps. Subsequently, a graph-basedtechnique is used to perform ﬁne segmentation and extraction ofthe salient object. The following section presents the proposedcoarse-to-ﬁne video segmentation framework and the details ofthe key component algorithms. Then the performance evaluationsand experimental results will be discussed in an individual sec-tion. Finally, some concluding remarks are presented in the lastSection.

System Framework and Algorithms

The proposed framework is shown in Fig. 1, and consists ofseveral stages: 1) the point tracking algorithm is applied to theconsecutive frames of the input video, and then 2) these track-ing points are clustered into groups; in parallel, 3) a pixel group-ing method is used to generate supervoxels for the correspondingframe of the video sequence; 4) coarse segmentation is achievedby combining the results of previous steps; ﬁnally, 5) a graph-based segmentation technique is used to perform ﬁne segmenta-tion and generate a mask of most salient object.This video segmentation scheme exhibits state-of-the-artboundary adherence, improves the performance of segmenta-tion algorithms with reduced memory consumption. This newapproach is a major enhancement to the previous graph-basedframework [8], with the following distinctions and advantages: • We deal with the video sequence with any resolution and anylength, i.e., there is no restriction on the size of the video. a r X i v : . [ c s . MM ] S e p nput Video Sequence A. Point Tracking B. Motion

Clustering C. Supervoxel Clustering D. Coarse

Segmentation E. Graph-based

Fine SegmentationCreate Effects

Figure 1.

The overall framework of the proposed algorithm.

For a long video sequence, it is divided into small clips thatare processed by the system one by one. • The parallel approach combines the spatial and temporal in-formation and takes advantage of both graph-based algo-rithms and pixel grouping methods. Consequently, it pro-vides a marked improvement on accuracy and speed. • It is an unsupervised scheme, i.e., there is no user interactionrequired to generate the accurate object mask.

A. Points Tracking

There are a lot of widely-used points tracking algorithms,such as particle ﬁltering [10] and mean shift tracking [11], andeach of them has its own characteristics. A popular and well-performed video object tracking algorithm is the Kanade-Lucas-Tomasi (KLT) point tracker [12, 13]. The algorithm basicallyprovides the trajectories of a bundle of points. In our work, thepoints to be tracked are selected in a grid-based manner in orderto make the initial points distributed uniformly in the entire frame,as shown by the red dots in Fig. 2(a). As the point tracking al-gorithm progresses over time, points can be lost due to lightingvariation, out of plane rotation, or articulated motion as shown inFig. 2(b) and Fig. 2(c). To track an object over a long period oftime, we may need to reacquire points periodically.(a) (b) (c)

Figure 2.

KLT point tracking. (a) Selected tracking points in the 1st frame;(b) and (c) Tracking points in the 3rd and 5th frame. (Please print in color.)

There are some algorithms proposed to improve the accuracyof KLT points tracking. One such proposal is the TLD algorithmproposed by Kalal [14].The KLT points tracker requires some premises: 1) the lu-minance between two adjacent frames should be constant; 2) theobject moves continuously in time domain, otherwise the move-ment should be “small” enough; 3) a point and its neighborhoodhave similar motion vector, i.e., spatial consistent. Intuitively, ifa window w in frame I is the same as that in the adjacent frame J , we have I ( x , y , t ) = J ( x (cid:48) , y (cid:48) , t + τ ) . The constant-luminance hy-pothesis holds the equality and gets rid of the effect of luminancechanges. The second premise ensures the existence of the track-ing points. The points in the same window that have the same offset is guaranteed by the third premise. B. Motion Clustering

In a video sequence, the collection of points which locatesin the high-dimensional space often lie close to low-dimensionalstructures corresponding to several classes the data belongs to.The Sparse Subspace Clustering (SSC) algorithm proposed by El-hamifar and Vidal [15] clusters tracking points that lie in a unionof low-dimensional subspaces. The point trajectories acquired byKLT point tracker are grouped into two clusters using SSC al-gorithm. Among inﬁnitely many possible representations of thedata in terms of other points, a sparse representation correspondsto selecting a few points from the same subspace. This motivatessolving a sparse optimization problem whose solution is used ina spectral clustering framework to infer the clustering of data intosubspaces.Fig. 3 shows the clustering results on two frames. Due tothe fact that the object moves in a different way from the back-ground does, the tracking points on the object are separated fromthe points on the background. Actually, this algorithm can besolved efﬁciently and can handle data points near the intersectionsof subspaces. Another key advantage of this algorithm with re-spect to the state-of-the-art is that it can deal with data nuisances,such as noise, sparse outlying entries, and missing entries, directlyby incorporating the model of the data into the sparse optimiza-tion program.

Figure 3.

Demonstrations of point trajectory clustering using SSC on twoframes. The yellow and red markers represent two clusters, foreground andbackground respectively. (Please print in color.)

C. Supervoxel Clustering

In our work, the Simple Linear Iterative Clustering (SLIC)[16] is extended to 3D space for dealing with 3D data clusteringproblem.Considering the aspect of computational efﬁciency, the en-tire video sequence is divided into clips and each chip containsa ﬁxed number of frames, which is determined by the computingability of the processor. Each clip can then be processed individu-ally. The resolution of consumer videos is sometimes comparableor higher than 720p HD videos, which contain too much detailsin each frame and cause undesired effects and redundant compu-tations on the 3D SLIC performance. Bilateral ﬁltering [17] canbe used on each frame in order to solve this problem so that theedges around the objects are preserved and the other regions aresmoothed. Also, bilateral ﬁltering reduces the noise in each chan-nel. Suppose that the desired number of supervoxels on eachframe is n and the thickness of each supervoxel is D along theemporal axis. Assuming that the supervoxels are initially squarein each frame and approximately equal-sized. All cluster centersare initialized by sampling the clip on a regular grid spaced S pixelapart inside each frame and t pixel between frames (along tempo-ral axis). Without considering the accuracy for small color differ-ences, the video sequence is converted into CIELAB space, sincethe nonlinear relation for L ∗ , a ∗ , and b ∗ good model to mimicthe nonlinear response of the eye. Furthermore, uniform changesof components in the CIELAB color space aim to correspond touniform changes in perceived color, so the relative perceptual dif-ferences between any two colors can be approximated by treatingeach color as a point in a three-dimensional space and taking theEuclidean distance between them. Also, the motion informationcan be represented by motion vectors obtained from optical ﬂow.Consequently, each cluster is then represented by the vector C = [ x y z L ∗ a ∗ b ∗ u v ] (1)where x and y represent the spatial location and z carries the tem-poral information, L ∗ , a ∗ and b ∗ represent the spectral informationand u , v are motion information extracted by optical ﬂow.In the assignment step, the cluster of each pixel is determinedby calculating the distance between the pixel itself and the clustercenter in the 2 S × S × D search region, as shown in Fig. 4. S 2SS2SD 2DInitialized supervoxel

Figure 4.

Initialization and the search region of supervoxel. Red box showsthe initialized supervoxel along D consecutive frames. Blue box is the search-ing area for this cluster. Each pixel is calculated eight times since it enclosedby eight cluster search region. (Please print in color.) The problem arises when the distance is measured. In thiscase, the distances in each domain are calculated separately andthen combined after multiplying the appropriate weights, i.e., thedistance d is deﬁned by the pixel location, the CIELAB colorspace and motion vector in the image is as follows: d = (cid:115) d l S + D + d c m + w m · d m RS (2)where m is the regularity that controls the compactness of the su-pervoxel, w m is a weight on motion information, R is frame rate, and d l = (cid:113) ∆ x + ∆ y + w z · ∆ z (3) d c = (cid:112) w L ∗ · ∆ L ∗ + ∆ a ∗ + ∆ b ∗ (4) d m = (cid:112) ∆ u + ∆ v = (cid:113) ∆ ˙ x + ∆ ˙ y (5)where w z and w L ∗ are the weights for the temporal distance and L ∗ channel. In the distance measure, the location is normalizedby the maximum distance in the 3D lattice 2 S + D accordingto Fig. 4. The weight for the depth component w z is introducedsince the inter-frame (lateral) position distance should be treateddifferently as in-frame (transverse) distance. Considering two ad-jacent supervoxels with depth D in the temporal axis, these twosupervoxels would shrink transversely and expand up to 2D inlateral direction during the iterations if the region surrounded isrelatively uniform and the weight w z is small. This causes the in-creased number of clusters on a single frame, which is unexpectedfor some applications.Note that 3D SLIC does not explicitly enforce connectivity.The adjacency matrix is generated and the clusters with a numberof pixels under a threshold are reassigned to the nearest neighborcluster using connect component analysis. Fig. 5 shows the re-sults of 3D SLIC algorithm after the connect component analysis. Figure 5.

Results of 3D SLIC voxel grouping on three consecutive frames.The boundaries of each supervoxel are shown in yellow. The block enclosedby the yellow boundaries in the corresponding position between frames hasthe same label. (Please print in color.)

Note that for some HD videos that contain too much redun-dant details on the background, the SLIC voxel grouping gener-ates some tiny clusters which are too ﬁne and increase the com-putation and processing time. To solve this problem, it is recom-mended to cluster the videos of this kind after the bilateral ﬁlter-ing. The ﬁne edges can be removed and the main boundaries ofthe object and background would be retained.

D. Coarse Segmentation

For each supervoxel, coarse segmentation is performed bycombining the SSC output and supervoxels. As shown in Fig.6(a), the SSC algorithm provides an approximate region contain-ing the object of interest. Based on that, we propose a strategywith the following rules: for each supervoxel in the video clip (asshown in Fig. 6(b)), if all the tracking points in it are markedred, this supervoxel is considered as background (black region inFig. 6(c)); similarly, if all the tracking points in a supervoxel aremarked yellow, this supervoxel is labelled as foreground (whiteregion in Fig. 6(c)); otherwise, for the supervoxels containingboth colored markers, they are considered as undetermined re-gions, as shown by the gray region in Fig. 6(c)).a) (b) (c)

Figure 6.

Coarse segmentation by combining the results of SSC and 3DSLIC algorithms. (a) Tracking points generated by KLT and SSC. The yel-low and red markers represent the foreground and background region re-spectively; (b) The 3D SLIC supervoxels on the same frame; and (c) Themask generated by combining (a) and (b). The black, gray and white re-gions denote determined background, undetermined region and determinedforeground respectively. (Please print in color.)

E. Graph-based Fine Segmentation

For ﬁne segmentation, we propose to use the GrabCut [18]algorithm since it requires a set of pixels for background, i.e., itallows incomplete labeling. Also, GrabCut looks for the mini-mum iteratively rather than in an one-time manner. Each iterationimproves the parameters of the GMMs to generate a better seg-mentation.For the video frames in RGB color space, the object andbackground are modeled by a full-covariance Gaussian mixturewith K components (typically K = k = [ k , k , ··· , k n , ··· , k N ] is introduced, with k n ∈ , ··· , K ,assigning, to each pixel, a unique GMM component, with onecomponent either from the background or the foreground model.Using the mask generated by the coarse segmentation, the black,white and gray regions are ﬂagged with background, foregroundand undetermined, or simply marked as 0, 1 or 2 for the image.Applying k -means clustering, the pixels belonging to either ob-ject or background are clustered into K groups (GMMs). Themean and covariance of the GMM can be estimated by the RGBvalues of pixels in each cluster, and the weight can be determinedby the ratio of the number of pixels in the cluster to the numberof overall pixels. Finally, use texture (color) and boundary (con-trast) information to get a reliable segmentation result within afew iterations, as illustrated in Fig. 7.(a) (b) Figure 7.

Result of ﬁne segmentation using GrabCut method. (a) Thealgorithm segment the undetermined region to light and dark gray regions;and (b) The light and dark gray regions are merged to the background andforeground respectively to form the ﬁnal mask.

Experimental Results

We conduct experiments on a variety of video content. Werun the proposed algorithm on multiple types of data, and generatea mask of the extracted object for each frame. We also compareour segmentation results to those produced by other state-of-the-art methods [19, 20, 21, 22, 23]. Both qualitative and quantitativeresults will be presented to support the effectiveness and robust-ness of our proposed method.

A. Parameters Settings

The parameters used in the experiments are listed as follows.In the point tracking and clustering process, we set the initial pointsampling interval as 10 pixels and the tracking points are reset ev-ery 5 frames. The number of clustering groups depends on the ap-plication. Typically, we set it to 5. To group pixels, the 3D SLICalgorithm is performed every 30 frames (clip size). For demon-stration, the desired number of supervoxels in one frame is set to100; the desired depth of supervoxels is D = m =

22; depth of supervoxel D =

5; the weights for temporaldistance and L ∗ channel are w Z =

50 and w L = B. Evaluation on SegTrack Dataset

We ﬁrst consider video sequences from the SegTrack [23]dataset since a pixel-level segmentation ground truth for eachvideo is available. To quantitatively evaluate the segmentationperformance, we use the ground truth provided with the origi-nal data. We compare our method with ﬁve state-of-the-art meth-ods as shown in Table 1. The “penguin” video sequence is notavailable for our segmentation application since the ground truthfor the “penguin” sequence is designed for object tracking in aweakly supervised setting, in which only one penguin is manu-ally annotated by the original user at the each frame. Note thatour method is an unsupervised methods, whereas [20] and [23]are supervised method which needs an initial annotation for theﬁrst frame. One can see that our algorithm outperforms the otherunsupervised methods except for the “parachute” and “birdfall2”video where it is still comparable to the best one. As mentionedbefore, for the “parachute” video sequence, our result is based onthe fact that the person under the parachute should be a part of theobject and extracted. However, the person in the ground truth of“parachute” sequence was removed in the original dataset, whichleads to a slightly inaccurate error calculation. Due to the smallsize of moving object and the complex background in the scene,the pre-deﬁned density of tracking points may not be high enoughto extract the foreground in “birdfall2” video sequence, whichleads to the pixel error a little higher than the best one. How-ever, this can be improved by making the density of the trackingpoints self-adjustable. The results in Table 1 take the average ofthe difference between pixel error and the ground truth, i.e.,error = xor ( our result, ground truth ) number of frames (6)where xor is an exclusive OR operation.Fig. 8 shows an example of the qualitative results of“parachute” video sequence in SegTrack dataset. In this video able I: Quantitative pixel-level errors and comparison with thestate-of-the-art methods on SegTrack dataset. [19] [20] [21] [22] [23] Oursparachute 220 502 201 221 235 219girl 1488 1755 1785 1698 1304 1471monkeydog 365 683 521 472 563 345birdfall2 155 454 288 189 252 232cheetah 633 1217 905 806 1142 621penguin * NA NA NA NA NA NA * The video sequence “penguin” is not applicable to this evaluation.sequence, the foreground and background regions move in differ-ent ways. Compared to the last column of Fig. 8(b) and (c), theperson under the parachute is segmented into the foreground inour results instead of merged into background as shown in groundtruth. This makes the segmentation result more reasonable, al-though it leads to the slight error increase in Table 1. Fig. 9 com-pares our results with the ground truth on “girl” video sequence.The “girl” video sequence suffers from low resolution and severemotion blur which increases the difﬁculty for segmentation. Thepoint tracking and supervoxel generation are affected by the mo-tion blur. This becomes the main source of the pixel-level error.(a) Original frames(b) Ground truth(c) Our results

Figure 8.

Qualitative results of SegTrack “parachute” video sequence.

All the experiments are performed on an Intel R (cid:13) Core TM i5-4590 CPU at 3.30GHz with 16GB memory. Before extensivecode and data structure optimization, the processing time perframe is around 0.52s, 15.86s, and 7.62s for points clustering,supervoxel generation and ﬁnal segmentation respectively. C. Evaluation on Kodak Alaris Consumer VideoDataset

With the rapid development and lower cost of smartphonesand new digital capture devices, consumer videos are becoming (a) Original frames(b) Ground truth(c) Our results

Figure 9.

Qualitative results of SegTrack “girl” video sequence. ever popular as is evident by the large volume of YouTube videoupload, as well as video viewing in Facebook social network.These large amount of videos also pose a challenge for organiz-ing and retrieving videos for the consumers. Besides the SegTrackdataset, we have conducted evaluations of our proposed approachon some of the videos from Kodak Alaris consumer video dataset.The videos in the dataset are mostly captured in standard HD for-mat with high frame resolution. Fig. 10 shows the qualitativeresults of “gymnast1” video sequence in this dataset. Because ofthe high resolution of the video, we apply bilateral ﬁltering onthe original frame to remove some ﬁne details of the backgroundand keep the main edges. The bilateral ﬁltering does not affectthe performance of either SSC or 3D SLIC algorithm, but rathersaves the computation. Another example is shown in Fig. 11. Inthis video, some parts of the moving object (dog) is similar to thebackground trees in color, and the other parts are as white as thebackground sky. It turns out that our algorithm produces reason-ably good results for this difﬁcult task.(a) Original frames in the video sequence.(b) Mask representing the extracted object in the sequence.

Figure 10.

Object segmentation results on “gymnast1” video sequence inKodak Alaris consumer video dataset. a) Original frames in the video sequence.(b) Mask representing the extracted object in the sequence.

Figure 11.

Object segmentation results on “dog” video sequence in KodakAlaris consumer video dataset.

Conclusion

We have proposed a novel and accurate coarse-to-ﬁne ap-proach to segment the salient object in video sequences. Thisapproach involves a parallel scheme, which consists of KLT, SSCand 3D SLIC algorithm to identify the approximate location of themost salient object. Subsequently, an unsupervised graph-basedmethod is used for ﬁne segmentation. Since the coarse segmen-tation determines the location of the moving object rather thanthe exact boundaries, the robustness of this approach can be guar-anteed. It is also worth mentioning that this algorithm can beeasily extended to multiple objects segmentation by controllingthe number of classes in the SSC stage. Compared to other state-of-the-art approaches, it has stronger ability to segment video se-quences accurately in any resolution and length within a shortertime. The experimental results also validate the effectiveness andperformance of the proposed method.

References [1] J. Lim and B. Han, “Generalized background subtraction using su-perpixels with label integrated motion estimation,” in

ECCV , Zurich,Switzerland, 2014.[2] D. Giordano, F. Murabito, S. Palazzo, and C. Spampinato,“Superpixel-based video object segmentation using perceptual or-ganization and location prior,” in

IEEE Conf. on CVPR , Boston,MA, 2015.[3] A. Khoreva, F. Galasso, M. Hein, and B. Schiele, “Classiﬁer basedgraph construction for video segmentation,” in

IEEE Conf. onCVPR , Boston, MA, 2015.[4] J. Papon, A. Abramov, and M. Schoeler, “Voxel cloud connectiv-ity segmentation - supervoxels for point clouds,” in

IEEE Conf. onCVPR , Portland, OR, 2013.[5] E. Trulls, S. Tsogkas, I. Kokkinos, A. Sanfeliu, and F. Moreno-Noguer, “Segmentation-aware deformable part models,” in

IEEEConf. on CVPR , Columbus, OH, 2014.[6] P. Neubert and P. Protzel, “Compact watershed and preemptive slic:On improving trade-offs of superpixel segmentation algorithms,” in

ICPR , Stockholm, Sweden, 2014.[7] F. Galasso, M. Keuper, T. Brox, and B. Schiele, “Spectral graphreduction for efﬁcient image and streaming video segmentation,” in

IEEE Conf. on CVPR , Columbus, OH, 2014.[8] L. Fan and A. Loui, “A graph-based framework for video objectsegmentation and extraction in feature space,” in

IEEE Int. Symp. on Multimedia , Miami, FL, 2015.[9] C. Li, L. Lin, W. Zuo, and S. Yan, “Sold: Sub-optimal low-rankdecomposition for efﬁcient video segmentation,” in

IEEE Conf. onCVPR , Boston, MA, 2015.[10] A. Khoreva, F. Galasso, M. Hein, and B. Schiele, “Learning must-link constraints for video segmentation based on spectral cluster-ing,” in

GCPR , Muenster, Germany, 2014.[11] D. Comaniciu and P. Meer, “Mean shift: A robust approach towardfeature space analysis,”

IEEE Trans. PAMI , vol. 24, no. 5, 2002.[12] H. Fu, D. Xu, B. Zhang, and S. Lin, “Object-based multipleforeground video co-segmentation via multi-state selection graph,”

IEEE Trans. on Image Processing , vol. 24, no. 11, 2015.[13] B. D. Lucas and T. Kanade, “An iterative image registration tech-nique with an application to stereo vision,” in

Int. Joint Conf. on AI ,Vancouver, Canada, 1981.[14] Z. Kalal, K. Mikolajczyk, and J. Matas, “Forward-backward error:Automatic detection of tracking failures,” in

ICPR , Istanbul, Turkey,2010.[15] Y. Zhang, X. Chen, J. Li, C. Wang, and C. Xia, “Semantic objectsegmentation via detection in weakly labeled video,” in

IEEE Conf.on CVPR , Boston, MA, 2015.[16] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,“Slic superpixels compared to state-of-the-art superpixel methods,”

IEEE Trans. on PAMI , vol. 34, no. 11, 2012.[17] K. He, J. Sun, and X. Tang, “Guided image ﬁltering,”

IEEE Trans.on PAMI , vol. 35, no. 6, 2012.[18] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut -interactiveforeground extraction using iterated graph cuts,”

Proc. ACM SIG-GRAPH , vol. 23, no. 3, 2004.[19] D. Zhang, O. Javed, and M. Shah, “Video object segmentationthrough spatially accurate and temporally dense extraction of pri-mary object regions,” in

IEEE Conf. on CVPR , Portland, OR, 2013.[20] P. Chockalingam, N. Pradeep, and S. Bircheld, “Adaptivefragments-based tracking of non-rigid objects using level sets,” in

IEEE ICCV , Kyoto, Japan, 2009.[21] Y. J. Lee, J. Kim, and K. Grauman, “Key-segments for video objectsegmentation,” in

IEEE ICCV , Barcelona, Spain, 2011.[22] T. Ma and L. J. Latecki, “Maximum weight cliques with mutexconstraints for video object segmentation,” in

IEEE Conf. on CVPR ,Providence, RI, 2012.[23] D. Tsai, M. Flagg, and J. M. Rehg, “Motion coherent tracking withmulti-label mrf optimization,”

IJCV , vol. 100, no. 2, 2012.

Author Biography

Chi Zhang received his MS in electrical engineering from RochesterInstitute of Technology (2013) and is currently a Ph.D. student in imag-ing science at Rochester Institute of Technology. He had worked as ansoftware intern at Kodak Alaris Inc. in Rochester, NY. In recent yearshis professional interests focus on the area of computer vision, includingimage analysis, video processing and convolutional neural networks forvisual recognition. He is a student member of IEEE.