Region-Based Multiscale Spatiotemporal Saliency for Video
NNovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms
International Journal of Pattern Recognition and Artificial Intelligencec (cid:13)
World Scientific Publishing Company
Region-Based Multiscale Spatiotemporal Saliency for Video
Trung-Nghia Le
Department of Informatics, SOKENDAI (Graduate University for Advanced Studies)2-1-2, Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, [email protected]
Akihiro Sugimoto
National Institute of Informatics2-1-2, Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, [email protected]
Detecting salient objects from a video requires exploiting both spatial and temporalknowledge included in the video. We propose a novel region-based multiscale spatiotem-poral saliency detection method for videos, where static features and dynamic featurescomputed from the low and middle levels are combined together. Our method utilizessuch combined features spatially over each frame and, at the same time, temporallyacross frames using consistency between consecutive frames. Saliency cues in our methodare analyzed through a multiscale segmentation model, and fused across scale levels,yielding to exploring regions efficiently. An adaptive temporal window using motion in-formation is also developed to combine saliency values of consecutive frames in orderto keep temporal consistency across frames. Performance evaluation on several popularbenchmark datasets validates that our method outperforms existing state-of-the-arts.
Keywords : Spatiotemporal saliency; multiscale segmentation; low-level feature; middle-level feature; adaptive temporal window.
1. Introduction
Visual saliency that reflects sensitivity of human vision aims at locating informa-tive and interesting regions in a scene. It is originally developed to predict humaneye fixations on images, and has been recently extended to detect salient objects.Computational methods developed for salient object detection are useful for high-level tasks in computer vision and computer graphics. For instance, these methodshave been successfully applied in many areas such as object detection , sceneclassification , image and video compression , image editing and manipulating ,and video re-targeting .Pixel-based computational methods for saliency have been mainly developedand achieved good results for detecting static objects; thus they are popular indetecting salient objects for images. However, these approaches take disadvantagesin the context of dynamic scene in videos. Videos usually have lower quality thanimages due to lossy compression, which makes every pixel value always changes over a r X i v : . [ c s . C V ] A ug ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto
Fig. 1: Multiscale image analysis. From the left to the right are pixel-wise analysisand region-wise analysis using SLIC superpixel , followed by small scale superpixelsfully covering baby bird and large scale superpixels fully covering mother bird.time regardless it belongs to static objects or dynamic motions. Accordingly, pixel-based saliency detection methods could be misled because of this pixel fluctuation.In contrast, region-based saliency detection methods are more effective in videosbecause these methods have less fluctuation and can capture dynamics in videosbetter than pixel-based methods. Therefore, in this work, we focus on analyzingvideos at regional levels through superpixel segmentation.Superpixels, which are used to detect regions in an image, can be basic mate-rials to capture salient objects at regional levels. Since objects in scenes generallycontain various salient scale patterns, superpixel-based regions with a pre-definedsize cannot fully explore objects (c.f. Fig.1). As a result, generated region-basedsaliency could generally be misled by the complexity of patterns in natural images.This problem can be solved by extending the size of superpixels gradually throughthe multiscale segmentation approach (c.f. Fig.1). Multiscale segmentation enablesus to analyze saliency cues from multiple scale levels of structure, yielding to dealingwith complex salient structures.Majority of existing methods for saliency, on the other hand, are based on low-level features in scenes such as color, orientation, and intensity . These bottom-upcues localize objects that present distinct characteristics from their surroundings.However, they do not concern any explicit information about scene structure, con-text or task-related factors. Therefore, they cannot effectively highlight objects invideos because they do not always reflect regions corresponding to moving parts.In contrast, middle-level features such as objectness and background prior aresuitable for exploiting moving objects because they focus only on properties in dis-tinct regions. Therefore, combining middle-level features with low-level features canboost up current methods and improve the performance. In terms of saliency cues,our method exploits both low-level and middle-level features based on not pixelsbut regions.A video is usually composed of dynamic entities caused by egocentric movementsor dynamics of the real world. Particularly, in a dynamic scene, background alwayschanges; different parts corresponding to different elements or objects can move indifferent directions with different speed independently. Saliency models should haveability to fuse current static information and accumulated knowledge on dynamicsfrom the past to deal with the dynamic nature of scenes including two proper-ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video Fig. 2: Examples of our spatiotemporal saliency detection method. Top row imagesare original images. Bottom row images are the corresponding saliency maps usingour method.ties: dynamic background and entities’ independent motion. Several spatiotemporalsaliency detection methods based on motion analysis are proposed for videos .Some of them can capture scene regions that are important in a spatiotemporalmanner . However, most of existing methods do not fully exploit the nature ofdynamics in a scene. Temporal features presenting motion dynamics of objects ina scene between consecutive frames are not utilized in saliency detection process,either.In order to effectively use knowledge on dynamics of background and objectsin a video, we propose a salient object detection method where low-level featuresand middle-level features are fused. In this framework, static features and dynamicfeatures computed from the low level and the middle level are combined togetherto utilize both spatial features of each frame and consistency between consecutiveframes (c.f. Table 1). The features are exploited in a region-based multiscale saliencymodel, where saliency cues from multiple scale levels in the structure are analyzedand integrated to take advantage of each level. Using region-based features andmultiscale analysis, our method is able to deal with complex scale structures interms of dynamic scene, so that salient objects are labeled more accurately. Wealso present a novel metric for motion information by estimating the number ofreferenced frames for each single object to keep temporal consistency across frames.Our method overcomes the limitation of the existing method which uses a fixednumber of referenced frames and does not concern motion of objects within a scene.Examples of generated saliency maps using our method are shown in Fig. 2.Our key contributions lie in twofold: • We propose a region-based multiscale framework, which explicitly integrates low-level features together with middle-level features. Our regional features are ex-ploited to analyze saliency cues from multiple scale levels of structure. Withusing region-based features and multiscale analysis, our method is able to dealwith complex scale structures in terms of dynamic scenes, so that salient objectsovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto
Table 1: Our used feature classification. low-level feature middle-level featurestatic feature - color- intensity- orientation - center bias- objectness - background prior dynamic feature - flow magnitude- flow orientation - movementare labeled more accurately. Although the proposed saliency model is developedbased on the framework presented by Zhou et al. , it significantly improves theperformance of the original work. • We introduce a novel metric called adaptive temporal window using motion in-formation in order to keep temporal consistency between consecutive frames ofeach entity in a video. Our method also exploits the dynamic nature of the scenein term of independent motion of entities.The rest of this paper is organized as follows. In Section 2, we briefly present andanalyze the related work in saliency models for videos as well as region segmentationin videos. The proposed method is presented in Section 3. Experiments are showedin Section 4, Section 6, and Section 5. Finally, Section 7 presents conclusion andideas for future work. We remark that a part of this work has been reported in .
2. Related Work
In this section, we briefly review related work in saliency models for videos andregion segmentation in videos.
Dynamics in saliency detection for videos
The spatial saliency models for still images can be frame-wisely applied to videosto detect salient objects. Some techniques such as fast minimum barrier distancetransform and minimum spanning tree are used to develop real-time salientobject detection systems, which can apply to videos. Other methods formulate theproblem based on boolean map theory , absorbing Markov chain , and weightedsparse coding framework . Multiple saliency maps can be aggregated by condi-tional random field or cellular automata dynamic evolution model . However,this approach does not achieve high effectiveness because they cannot exploit tem-poral knowledge in dynamic scenes in videos.Dynamic cues have usually been employed for saliency detection to deal withdynamics in videos. Some video saliency models reply on center-surround differ-ences between a local spatiotemporal cube and its neighboring cubes in space-timecoordinates to find salient regions in dynamic image streams. L.Itti et al. devel-oped a method which computes instantaneous low-level surprise at every locationovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video in complex video clips. Flicker is added into saliency detection as a new cue tobuild a neurobiological model of visual attention for automatic realistic generationof eye and head movements given in video scenes . In the method proposed byM.Mancas et al. , only dynamic features such as speed and direction are used toquantify rare and abnormal motion.Motion cues, which are considered as a low-level feature channel, have recentlybeen employed in spatiotemporal frameworks together with spatial features such asillumination, color, and so on. Several spatiotemporal saliency detection methodsbased on motion analysis are proposed for videos. Motion between a pair of frames,which is considered as optical flow, is used to compute local discrimination of theflow in a spatial neighborhood . Motion features such as optical flow or SIFTflow are incorporated into conditional random field to extend the models for stillimages to detect salient objects from videos. In the spatiotemporal attention detec-tion framework proposed by Y.Zhai et al. , motion contrast between consecutiveframes is estimated by applying RANSAC algorithm on point correspondences inthe scene. W.Wang et al. estimated salient regions in videos based on the gra-dient flow field, which consists of intra-frame boundary and inter-frame motion,and the energy optimization. In the framework proposed by W.Wang et al. , thespatiotemporal saliency map, which is computed from temporal motion boundariesand spatial edges, is combined with the appearance model and the dynamic locationmodel to segment salient video objects. C.Feichtenhofer et al. defined a space-timesaliency model, which relies on two general observations regarding actions (e.g. mo-tion contrast and motion variance), for capturing foreground action motion. Y.Louet al. measured relationships within and between trajectories of superpixels overtime to capture sudden or onset movements in a scene. H.Kim et al. incorpo-rated the spatial transition matrix and the temporal restarting distribution, whichis computed from motion distinctiveness, temporal consistency, and abrupt change,into the random walk with restart framework to detect spatiotemporal saliency.However, most of existing methods do not fully exploit the nature of dynamicsin a scene. Temporal features presenting motion dynamics of objects in a scenebetween consecutive frames are not utilized in saliency detection process, either.Differently from existing methods, our spatiotemporal saliency detection methoduses motion information to keep temporal consistency across frames. We proposean adaptive temporal sliding window to relate salient values of frame sequences byexploiting motion information of an entity in each frame. Region segmentation in videos
Most video analysis methods are extended from image analysis methods . Al-though frame by frame processing is efficient and can achieve high performance inspatial respects, its temporal stability is limited. Therefore, many video segmenta-tion methods are developed to exploit the temporal knowledge in videos.Superpixel based algorithms are widely used as a pre-processing step in bothovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto still images and videos to generate supporting regions and to speed up furthercomputations . Existing superpixel methods can be divided into two categories:one is to grow superpixels starting from an initial set, and the others are based ongraphs .For the first approach, one of the most efficient superpixel segmentation algo-rithms for images was introduced by R.Achanta et al. and is called Simple LinearIterative Clustering (SLIC). In this algorithm, the k -means clustering is firstly per-formed to group pixels that exhibit similar appearances into superpixels and thensingle-pixel superpixels are merged into large superpixels via a single 4-connectedregion. Many segmentation methods have been proposed recently based on the SLICsuperpixel. J.Chang et al. extended SLIC to propose a probabilistic model for tem-porally consistent superpixels in video sequences. SEED superpixel method alsoshares the idea of growing superpixels from an initial set with SLIC. However, theSEED superpixel directly exchanges pixels between superpixels by moving bound-aries instead of growing superpixels by clustering pixels around the centers. Fromthe SEED superpixel, M.Van den Bergh et al. proposed an online, real-time videosegmentation algorithm to exploit temporal information in videos.In addition, there are some segmentation methods based on graphs. C.Xu et al. developed a framework for streaming video sequences based on hierarchical graph-based segmentation. Using a spatiotemporal extension of GraphCut method ,A.Papazoglou et al. introduced a fast segmentation method for unconstrainedvideos which include rapidly moving background, arbitrary object motion and ap-pearance, as well as non-rigid deformations and articulations.In this work, we employed the temporal superpixel method proposed byJ.Chang in our framework. This is because it overcomes limitations of othervideo segmentation methods and is popularly employed in the computer visionliterature .
3. Multiscale Spatiotemporal Saliency Detection Method3.1.
Overview
The goal of this work is to detect salient regions in videos by combining staticfeatures with dynamic features where the features are detected from regions butnot from pixels. Figure 3 illustrates the process of our multiscale spatiotemporalsaliency detection method.First of all, the temporal superpixels model is executed to segment a videointo spatiotemporal regions at various scale levels. Motion information, as well asfeatures for each frame, are extracted at each scale level. From these features, webuild feature maps, including both low-level feature maps presenting contrasts be-tween regions and middle-level feature maps presenting properties inside regions.These two kinds of feature maps are combined to generate spatial saliency enti-ties for regions at each scale level. Temporal consistency is incorporated into spa-tial saliency entities to form spatiotemporal saliency entities by using an Adaptiveovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video Color Intensity OrientationFlow Orientation Flow MagnitudeLocation ObjectnessBackground Probability Movement
Low-level FeaturesMiddle-level Features Multiscale SegmentationVideo Frame Low-level Feature MapMiddle-level Feature Map Spatial Saliency Entity Spatiotemporal Saliency Entity Multiscale Spatiotemporal Saliency Map
Motion Distribution … Fig. 3: Pipeline of the proposed spatiotemporal saliency detection method.Temporal Window (ATW) for each region individually to smooth saliency valuesacross frames. Finally, a spatiotemporal saliency map is generated for each frameby fusing its multiscale spatiotemporal saliency entities.
Multiscale video segmentation
To support the intuition that objects in a video generally contain various salientscale patterns as well as an object at a coarser scale may be composed of multipleparts at a finer scale, the video is segmented at multiple scales. Multiscale segmen-tation enables us to analyze saliency cues from multiple scale levels of structure,yielding to dealing with complex salient structures (c.f. Fig. 1). In this work, weovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto segment a video at three scale levels. We also remark that each segmentation levelhas a different number of superpixels, which are defined as non-overlapping regions.To segment a video, we employed the temporal superpixel method , which isbased on multiple frame superpixel segmentation. This method is extended fromSLIC . Differently from the SLIC method, the temporal superpixel method utilizes aspatial intensity Gaussian Mixture Model (GMM) combined with a motion model,served as a prior for the next frame. Motion information is used to propagatesuperpixels over frames to reduce generative superpixels in a single frame.After the segmentation process, we obtain multiple scale temporal superpixels.At each scale, superpixels across frames are connected, thus we can predict motionof a superpixel over frames through its positions at frames. Spatial saliency entity construction
Low-level feature map
Human vision reacts to image regions with discriminative features such as uniquecolor, high contrast, different orientation, or complex texture . To estimate at-tractiveness of regions in a video, contrast metric is usually used to evaluate thesensitivity of elements in each frame. The contrast is usually based on low-level fea-tures including static information such as color, intensity, or texture, and dynamicinformation such as magnitude or orientation of motion. A region with high con-trast against surrounding regions can attract human attention and is perceptuallymore important.For the i -th region at the l -th scale of the segmentation model at a frame,denoted by r i,l , we compute its normalized color histogram in the CIE Lab colorspace, denoted by χ lf col i,l , and distribution of lightness χ lf lig i,l . We quantize the fourcolor channels (L, A, B and Hue) into 16 bins for each channel to compute χ lf col i,l .We also uniformly quantize χ lf lig i,l into 16 bins.We also calculate orientation statistic χ lf ori i,l of the region r i,l . We use the fol-lowing 2-D Gabor functions to model the image texture for every pixel ( x, y ): g λ,ϕ,γ,σ,θ ( x, y ) = exp (cid:16) − x (cid:48) + γ y (cid:48) σ (cid:17) cos (cid:16) π y (cid:48) λ + ϕ (cid:17) ,x (cid:48) = x cos θ + y sin θ,y (cid:48) = − x sin θ + y cos θ, (1)where γ, λ, σ, ϕ, and θ are parameters as follows: γ = 0 . λ = 8 denotes the wavelength of the cosine factor of the Gabor filterkernel and herewith the preferred wavelength of the Gabor function. σ is the stan-dard deviation of the Gaussian factor of the Gabor function where the ratio σ/λ determines the spatial frequency bandwidth. In this paper, we fix σ = 0 . λ corre-sponding to a bandwidth of one octave at half-response, corresponding to bw = 1in σ = bw +12 bw − · λπ (cid:113) log 2. ϕ is a phase offset that determines the symmetry filter ofovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video the Gabor function. We use quadrature pairs of two filter banks, including an oddfilter with ϕ = π and an even filter with ϕ = 0. The angle parameter θ specifiesthe orientation of the normal to the parallel stripes of the Gabor function. In thiswork, we use 8 orientations θ = k π where k ∈ { , , ..., } . We then quantize χ lf ori i,l into 16 bins.Since the human visual system is more sensitive to moving objects than stillobjects, dynamic features are also compared between regions at the same segmenta-tion level . Pixel-wise optical flow is used to analyze motion between consecutiveframes. Regional motion features of a region are obtained by computing distribu-tion of this flow information in the region. The motion distribution of region r i,l isencoded in two descriptors: χ lf fmag i,l is a normalized distribution of flow magnitudeand χ lf fori i,l is a normalized histogram of flow orientation. We uniformly quantize χ lf fmag i,l and χ lf fori i,l into 16 and 9 bins respectively.The low-level feature map of each region is considered as the sum of its featuredistances to other neighbor regions at the same scale level in the segmentationmodel with different weight factors: S lf i,l = (cid:88) lf w lf (cid:88) j (cid:54) = i | r j,l | ω ( r i,l , r j,l ) (cid:13)(cid:13)(cid:13) χ lfi,l − χ lfj,l (cid:13)(cid:13)(cid:13) , (2)where (cid:13)(cid:13)(cid:13) χ lfi,l − χ lfj,l (cid:13)(cid:13)(cid:13) is the Chi-Square distance between two histograms, lf ∈{ lf col , lf lig , lf ori , lf fmag , lf fori } denotes one of the five features with correspond-ing weight w lf ∈ { . , . , . , . , . } . | r j,l | denotes the contrast weight of region r j,l , which is the ratio of its size to the frame size. Regions with more pixels con-tribute higher contrast weight factors, than those containing smaller number ofpixels. ω ( r i,l , r j,l ) controls spatial distance influence between two regions r i,l and r j,l : ω ( r i,l , r j,l ) = e − D ( ri,l,rj,l ) σ sp − dst , (3)where D ( r i,l , r j,l ) is the Euclidean distance between region centers and parameter σ sp − dst = 0 . S lf i,l is normalized torange [0 , Middle-level feature map
In addition to contrasts between regions, we also compute properties of each regionbased on middle-level features. We observe that human vision is biased towardspecific spatial information of a video such as the center of the frame, foregroundobjects, or background as well as movements of objects over time. Therefore, ourmiddle-level features are based on center bias, objectness, background prior andmovement metrics.Human eye-tracking studies show that human attention favors the center ofnatural scenes when watching videos . So, pixels close to the screen center couldovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto be salient in many cases. Our center bias is defined as: χ i,lmf cen = 1 | r i,l | (cid:88) j ∈ r i,l e − D ( pj, ¯ p ) σ cen , (4)where | r i,l | denotes the number of pixels in region r i,l and D ( p j , ¯ p ) is the Euclideandistance from each pixel p j in the region to the image center ¯ p . σ cen = 0 . .It quantifies how likely it is for an image window to contain an object of any class.Our objectness feature is defined as: χ f obj i,l = 1 | r i,l | (cid:88) j ∈ r i,l o j , (5)where o j is pixel-wise objectness map in region r i,l . Objectness map provides mean-ingful distributions over the object locations, demonstrating probability of contain-ing objects at pixels. Four object cues are utilized as follows: multi-scale saliency M S measuring uniqueness characteristic of objects, color contrast CC measur-ing different appearance from their surroundings, edge density ED and superpixelstraddling SS measuring closed boundary characteristic of objects. These objectcues are combined into a Bayesian framework: o j = p ( j ) ( obj | A ) = p ( j ) ( A | obj ) p ( j ) ( obj ) p ( j ) ( A ) = p ( j ) ( obj ) (cid:81) cue ∈ A p ( j ) ( cue | obj ) (cid:80) c ∈{ obj,bgr } p ( j ) ( c ) (cid:81) cue ∈ A p ( j ) ( cue | c ) , (6)where A = { M S, CC, ED, SS } denotes object cues, and p ( j ) ( · ) is the probability atpixel j . Priors p ( obj ), p ( bg ), individual cue likelihoods p ( cue | c ) and c ∈ { obj, bg } ,are estimated from the training dataset. This posterior constitutes the objectnessscore of overlapping regions.The regional background prior is based on differences of spatial layout of imageregions . Object regions are much less connected to image boundaries than back-ground ones. In contrast, a region corresponding to background tends to be heavilyconnected to the image boundary. In order to compute background probability ofeach region, each segmented image is built as an undirected weighted graph byconnecting all adjacent regions and assigning their weights w edge as the Euclideandistance between their average colors in the CIE-Lab space. The background featureof region r i,l is written as: χ i,lmf bgr = exp (cid:32) − BndCon ( r i,l )2 σ bgr (cid:33) , (7)where BndCon ( r i,l ) is the boundary connectivity of region r i,l . Similarly to , weset the parameter σ bgr = 1. The boundary connectivity of region r i,l is calculatedovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video as the ratio of length along the boundary of it to the square root of its spanningarea: BndCon ( r i,l ) = len bnd ( r i,l ) (cid:112) SpanArea ( r i,l ) (8)where len bnd ( r i,l ) is length along the boundary of region r i,l , which is calculated asthe sum of the Geodesic distance from it to regions on the boundary at the samescale level. SpanArea ( r i,l ) is the spanning area of region r i,l calculated as the sumof the Geodesic distances from it to all regions in the frame at the same scale level.The Geodesic distance between any two regions is defined as the accumulatededge weight along their shortest path on the graph: d geo ( r i,l , r j,l ) = min r ,l = r i,l ,r ,l ,...,r n,l = r j,l n − (cid:88) k =1 w edge ( r k,l , r k +1 ,l ) (9)Moreover, to encode movement of objects, we capture any sudden speed changein motion of regions. Movement of a region is calculated as its average motionmagnitude values computed from the optical flow : χ f mov i,l = 1 | r i,l | (cid:88) j ∈ r i,l m j , (10)where m j is motion magnitude at all pixels in region r i,l .The middle-level feature map of the region r i,l is computed as the sum of itsattribute values with different weight factors: S mf i,l = (cid:88) mf w mf χ mfi,l , (11)where mf ∈ (cid:8) mf cen , mf obj , mf bgr , mf mov (cid:9) denotes one of the four features, withcorresponding weight factor w mf ∈ { . , . , . , . } . Finally, S mf i,l is normal-ized to range [0 , Feature map combination
Combining the low-level feature maps and the middle-level feature maps allows usto obtain initial spatial saliency entities for all segmentation levels separately by aweighted multiplicative integration (c.f. Fig.4): S i,l = S αlf i,l S − αmf i,l , (12)where the parameter α controls the trade-off between the low-level feature maps andthe middle-level feature maps. To weight the low-level feature map and the middle-level feature map equally, we set α = 0 .
5. Finally, the spatial saliency entities arelinearly normalized to the fixed range [0 ,
1] in order to guarantee that regions withvalue 1 are the maxima of saliency. We note that we demonstrate effects of usingthe low-level feature maps and the middle-level feature maps on saliency results inSection 6.2.ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto
VideoFrame Low-levelFeature Map Middle-level Feature Map Spatial Saliency Entity Spatiotemporal Saliency Entity
Fig. 4: Saliency entity construction.
Incorporating temporal consistency
In a video, it is sometimes hard to distinguish objects from background becauseevery pixel value always changes over time regardless that it belongs to an objector background. Moreover, motion analysis shows that different parts of objectsmove with various speed and, furthermore, background motion also changes withdifferent speed and direction (c.f. flow information in Fig. 5). This causes fluctuationof object appearances between frames. To reduce this negative effect, each spatialsaliency entity at the current frame is combined with those at neighboring frames,resulting in smoothing saliency values over time. After this operation, salient valueson contiguous frames can become similar, and this generates robust spatiotemporalsaliency entities.We propose to adaptively use a sliding window in the temporal domain, Adap-tive Temporal Window (ATW), for each region at each frame to capture speedvariation by exploiting motion information in the region. A spatial saliency entityat each scale level at the current frame is combined with spatial saliency entities atneighboring frames using the Gaussian combination weights, where nearer frameshave larger weights: ˜ S ti,l = 1Ψ t (cid:88) t (cid:48) = t − Φ ti,l e − D ( t,t (cid:48) ) ti,l σ tp − dst S t (cid:48) i,l , (13)where Ψ is the normalization factor and S ti,l measures spatial saliency entity ofovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video Flow Information
Source Image Motion Distribution
R5R6R4R2R3R1 … Segmented Image
R2R5R4 R6R3 R1
Fig. 5: Regional motion calculation.region r i,l at frame t with corresponding weight e − D ( t,t (cid:48) ) ti,l σ tp − dst . D ( t, t (cid:48) ) denotes thetime difference between two frames, and parameter σ tp − dst = 10 controls how largethe region at previous frames is. Φ ti,l controls the number of participating framesin the operation, expressed as: Φ ti,l = M e − µ ti,l λβti,l , (14)where M = 10 and λ = 2 are parameters. β ti,l = σ ti,l µ ti,l is the coefficient variationmeasuring dispersion of motion distribution of each region. µ ti,l and σ ti,l are themean value and the standard deviation of the motion distribution of region r i,l at frame t . To calculate regional motion distribution, we first use the pixel-wiseoptical flow to compute motion magnitude of each pixel in a frame, and thenexploit distribution of motion magnitude in each region (c.f. Figure 5). Spatiotemporal saliency generation
A multiscale saliency map is usually generated by calculating the average of saliencymaps over all segmentation levels. However, this kind of process does not achievegood results because saliency values at a scale level have their own advantages anddisadvantages. Compared with naive averaging of all levels, the Multi-layer CellularAutomata (MCA) integration algorithm aggregates confident saliency values fromthese levels, surely yielding better performance . Therefore, we implement theMCA algorithm to take advantage of multiple scale levels. The generated multiscalespatiotemporal saliency value SM tp at pixel p at frame t is given by: SM tp = M CA (cid:16) ˜ S t Ω l ( p ) (cid:17) , with l ∈ { , , ..., L } , (15)where Ω l ( · ) is a function that converts a pixel to the region at scale level l whereit belongs. We note that all operations are processed pixel-wisely. ˜ S t Ω l ( · ) measuresovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto
Table 2: Multi-layer Cellular Automata integration algorithm . Input: S tl with l ∈ { , ..., L } Output: SM t (cid:46) Saliency map binarization by the Otsu algorithm : γ tl ← log (cid:18) Θ ostu ( S tl ) − Θ ostu ( S tl ) (cid:19) (cid:46) Saliency refinement:for k = 1 → K − do Λ (cid:16) S t,k +1 l (cid:17) ← Λ (cid:16) S t,kl (cid:17) + ln (cid:16) λ − λ (cid:17) L (cid:80) i =1 ,i (cid:54) = l sign (cid:16) Λ (cid:16) S t,ki (cid:17) − γ ti (cid:17) ,where Λ ( S tl ) = S tl − S tl end for (cid:46) Saliency map integration: SM t ← L L (cid:80) l =1 S tl Parameter settings : K = 5 and ln (cid:16) λ − λ (cid:17) = 0 . VideoFrame Saliency Entity Level 1 Saliency Entity Level 2 Saliency Entity Level 3 Saliency Entity Level 4 Multiscale Spatiotemporal Saliency Map
Fig. 6: Saliency integration using Multi-layer Cellular Automata .multiscale saliency values at each pixel generated in the l -th scale of the segmenta-tion model, which has L scale levels, at frame t . M CA ( · ) is the Multi-layer CellularAutomata (MCA) integration , which exploits intrinsic relevance of similar regionsthrough interactions with neighbors. The MCA algorithm is described in Table 2.Figure 6 depicts examples of the MCA algorithm.ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video
4. Experimental settings4.1.
Datasets
We used 4 datasets for all experiments: Weizmann human action dataset , MCL2014dataset , SegTrack2 dataset , and DAVIS dataset . The Weizmann dataset contains 93 video sequences of nine people perform-ing ten natural actions such as running, walking, jacking, and waving, etc. Thisdataset is created for human action recognition and has simple static background,thus it is easy to distinguish objects from scenes. Each sequence in the datasethas the spatial resolution of 180 ×
144 and consists of about 60 frames with theground-truth foreground mask for every frame.
The MCL2014 dataset contains 8 video sequences with various backgroundsuch as streets, roads, and halls, etc. In this dataset, multiple objects such as crowdsmove with different directions and speed. Each sequence in the dataset has thespatial resolution of 480 ×
270 and consists of around 800 frames. The binary ground-truth maps are manually obtained for every 8 frames. We remark that anotherversion of the MCL dataset was published and it is called MCL2015 dataset which contains 9 video sequences. Due to the advantage of having double framesper video, the MCL2014 dataset (800 frames per video) was employed instead ofthe MCL2015 dataset (400 frames per video). The SegTrack2 dataset , which is extended from the SegTrack dataset ,contains 14 challenging video sequences and is originally designed for video objectsegmentation. A half of videos in this dataset have multiple salient objects. Thedataset is designed to be challenging in that it has background-foreground colorsimilarity, fast motion, and complex shape deformation. Dynamics in this datasetis caused by moving cameras, which track objects in scenes moving from far tonear, from the borders to the center or vice versa. Each sequence in the dataset hasthe spatial resolution of 352 ×
288 and consists of about 75 frames with the binaryground-truth mask.
The DAVIS dataset consists of 50 high quality 854 ×
480 spatial resolutionand Full HD 1080p video sequences with about 70 frames per video, each of whichhas one single salient object or two spatially connected objects either with lowcontrast or overlapping with image boundary. It is a challenging dataset becauseof frequent occurrences of occlusions, motion blur and appearance changes. In thiswork, we used only 854 ×
480 resolution video sequences.We note that these datasets have different characteristics such as simple back-ground for the Weizmann dataset, multiple moving objects for the MCL2014dataset, diverse complex dynamic scenes for the SegTrack2 and the DAVIS datasets.All the datasets contain manually annotated pixel-wise ground-truth.ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto
Evaluation metrics
The
Precision-Recall , and
F-measure metrics are used to evaluate performanceof the object location detection at a binary threshold. The precision value corre-sponds to the ratio of salient pixels that are correctly assigned to all the pixels ofextracted regions. The recall value is defined as the percentage of detected salientpixels in relation to the number of salient pixels in the ground-truth. Given aground-truth GT and the binarized map BM for a saliency map, we have: P recision = | BM ∩ GT || BM | and Recall = | BM ∩ GT || GT | . (16)The F-measure is the overall performance measure computed by the weightedharmonic of precision and recall: F β = (cid:0) β (cid:1) P recision × Recallβ × P recision + Recall . (17)Similarly to , we chose β = 0 . F β reflectsthe overall prediction accuracy.There are different ways of binarizing a saliency map to compute F-measure.We used the F-Adap , an adaptive threshold for generating a binary saliency map.As suggestion by , we employed an adaptive threshold for each image, which isdetermined as the sum of the average value and the standard deviation of the entiregiven saliency image: θ = µ + σ where µ and σ are the mean value and the standarddeviation of the given saliency map, respectively. We then computed the averageF-measure scores over the frames. We also used the F-Max , which describes themaximum F-measure score for different thresholds from 0 to 255, as suggested by . We note that for a threshold, we binarize the saliency map to compute Precisionand Recall at each frame in a video and then take the average over the video. Afterthat, the mean of the averages over videos in a dataset is computed. F-measure iscomputed from the final Precision and Recall.Overlap-based evaluation measures mentioned above do not consider the truenegative saliency assignments, i.e., the pixels correctly marked as non-salient .For a more comprehensive comparison, we thus used the Mean Absolute Error(MAE) to compute the average absolute per-pixel difference between a saliencymap SM and its corresponding ground-truth GT : M AE = 1 W × H W (cid:88) x =1 H (cid:88) y =1 (cid:107) SM ( x, y ) − GT ( x, y ) (cid:107) , (18)where W and H are the width and the height of the maps, respectively. We notethat MAE is also computed from mean average value of the dataset in the sameway with F-measure.ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video Table 3: Compared state-of-the-art methods and classification. target image videomethod BL , BSCA , MBS ,MC , MST , WSC LC , LD , RWRV ,SAG , SEG , STS
5. Comparison with state-of-the-art methods
We compared the performance of our method with several state-of-the-art models,namely, LC , LD , RWRV , SAG , SEG , STS , BMS , BSCA , MBS ,MC , MST , and WSC , which are classified in Table 3. We compared ourmethod with not only video saliency models but also recent saliency detectionmethods for still images, which show good performance on image datasets. Weframe-wisely applied the methods developed for the still image to videos. We remarkthat we run original codes provided by the authors with recommended parametersettings for obtaining results.Some examples for visual comparison of the methods are shown in Fig. 9 andFig. 10, suggesting that our method produces the best results on the datasets.Our method can handle complex foreground and background with different details,giving accurate and uniform saliency assignment. Our method also achieves thestate-of-the-art substantially on the datasets on all evaluation criteria. Precision-Recall Curve
We binarized each saliency map into a binary mask using a binary threshold θ ( θ is changed from 0 to 255). With each θ , the binary mask is checked against theground-truth to evaluate the accuracy of the salient object detection to computethe Precision-Recall Curve (PRC) (c.f. Fig. 7). The PRC is used to evaluate theperformance of the object location detection because it captures behaviors of bothprecision and recall under varying thresholds. Therefore, the PRC provides a reliablecomparison of how well various saliency maps can highlight salient regions in images.Figure 7 shows that the proposed method consistently produces saliency mapscloser to the ground-truth than the others. Our method achieves the highest preci-sion in most of entire recall ranges on almost datasets. We observe that our methodis the best method on the Weizmann dataset, the SegTrack2 dataset, and the DAVISdataset while on the MCL2014 dataset, our method is the second to the best (lowerthan STS ). Therefore, the proposed method significantly outperforms the state-of-the-art methods across the public datasets. Especially in the most challengingdatasets (e.g. SegTrack2 dataset and DAVIS dataset), the performance gains of ourmethod over all the other methods are more noticeable. Our method is suitable fordealing with dynamic scenes and multiple moving objects in videos.From Figure 7, at the first end of PRCs at maximum recall, all salient pixels areretained as positives, i.e., considered to be foreground, so all the methods have theovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto
Recall P r e c i s i on OurLCLD RWRVSAGSEG STSBLBSCA MBSMCMST WSC (a) Weizmann Dataset
Recall P r e c i s i on OurLCLD RWRVSAGSEG STSBLBSCA MBSMCMST WSC (b) MCL2014 Dataset
Recall P r e c i s i on OurLCLD RWRVSAGSEG STSBLBSCA MBSMCMST WSC (c) SegTrack2 Dataset
Recall P r e c i s i on OurLCLD RWRVSAGSEG STSBLBSCA MBSMCMST WSC (d) DAVIS Dataset
Fig. 7: Quantitative comparison with state-of-the-art methods, using Precision-Recall Curves at different fixed thresholds. Our method is marked in bold curve .Note that the state-of-the-art methods are divided into two groups only for theclear presentation.same precision and recall values. At the other end of PRCs, our method is one of themethods whose minimum recall values are highest on datasets, meaning that ourmethod generates saliency maps containing more salient pixels with the maximumsaliency values at 1. Therefore, our method has more practical advantages thanother methods. For example, when we cannot decide the best binary threshold toextract salient objects from a saliency map, an adaptive threshold of the saliencymap can be used, but the accuracy of object extraction is still ensured (c.f. Section5.2).
F-measure
We evaluated F-Adap and F-Max to compare the proposed method with the state-of-the-art methods (c.f. Table 4). Table 4 shows that our method achieves the bestperformance on all the datasets. Our method achieves the highest scores on theWeizmann dataset, the SegTrack2 dataset, and the DAVIS dataset on both theovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms
Region-Based Multiscale Spatiotemporal Saliency for Video Table 4: Quantitative comparison with state-of-the-art methods on three datasets,using F-measure (F-Adap and F-Max) (higher is better). The best and the secondbest results are shown in blue and green, respectively. Our method is marked in bold . Dataset Weizmann MCL2014 SegTrack2 DAVIS Metric F-Adap F-Max F-Adap F-Max F-Adap F-Max F-Adap F-MaxOur metrics. On the most challenging dataset, the DAVIS dataset according to , theproposed method outperforms the second best method by a large margin on allthe metrics. Our method achieves 0.627 in the F-Adap and 0.645 in the F-Maxwhile the second best method (SAG ) achieves 0.494 and 0.548, respectively. Onthe MCL2014 dataset, the proposed method achieves 0.644 in the F-Adap whileother methods do lower than 0.5. Our method is the second best in the F-Max andslightly lower than the best method (STS ) (0.702 vs. 0.728). Mean Absolute Error
To further demonstrate the effectiveness of the proposed method, we provide thecomparison of our method with the state-of-the-art methods using the MAE metric.Table 5 shows results of our method and the other methods on the four benchmarkdatasets. Our method outperforms the other methods on almost all datasets. TheMAE of our method is lower than the others, which suggests our method not onlyhighlights the overall salient objects but also preserves the detail better. It canbe seen from Table 5 that our method shows the lowest MAE on the Weizmanndataset, the MCL2014 dataset, and the DAVIS dataset. Our method achieves 0.009,0.051, and 0.077 in the MEA on the Weizmann dataset, the MCL2014 dataset, andthe DAVIS dataset while the second best methods achieve 0.012 for MBS , 0.107for STS , and 0.103 for SAG , respectively. Our method outperforms the secondbest methods by a large margin on these datasets. For the SegTrack2 dataset, ourmethod is the second best method at 0.125 while the second best method, SAG ,is at 0.106.ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto
Table 5: Quantitative comparison with state-of-the-art methods on three datasets,using Mean Absolute Errors (MAE) (smaller is better). The best and the secondbest results are shown in blue and green, respectively. Our method is marked in bold . Dataset Weizmann MCL2014 SegTrack2 DAVIS Our
6. Evaluation of the proposed method
In this part, we analyzed and discussed each component of the proposed method toevaluate actual contribution of each component. For the evaluation of the proposedmethod, we used F-Adap, F-Max, and MAE.
Multiple level processing evaluation
To verify effectiveness of our introduced multiple scale analysis, we performed ex-periments to compare methods in cases: analyzing saliency cues in a multiscale seg-mentation model through combining saliency maps at 3 levels (denoted by
Multi-level ) and in single scales separately (denoted by 1 st Level , 2 nd Level , and 3 rd Level ).The results in Table 6 show that using multiple scale analysis outperformssaliency computation at a single scale level on all datasets. Processing in multi-ple scales yields much improvement over the method with a single layer saliencycomputation. Therefore, utilizing information from multiple image layers makes ourmethod gain benefit.
Evaluation of combination of low-level feature map andmiddle-level feature map
In order to verify the effectiveness of combining low-level and middle-level featuresto generate a saliency map, we conducted experiments to compare our methodovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms
Region-Based Multiscale Spatiotemporal Saliency for Video Table 6: Multiple level processing evaluation, using F-measure scores (higher isbetter) and Mean Absolute Error (smaller is better). The best results are shown inblue.
Dataset Method F-Adap ⇑ F-Max ⇑ MAE ⇓ Weizmann
Multi-level1 st Level2 nd Level3 rd Level 0.9090.7730.7480.727 0.9120.9040.9020.897 0.0090.0400.0390.040
MCL2014
Multi-level1 st Level2 nd Level3 rd Level 0.6440.5570.5650.565 0.7020.6320.6880.702 0.0510.1140.1080.106
SegTrack2
Multi-level1 st Level2 nd Level3 rd Level 0.5100.4630.4600.465 0.6770.6670.6690.667 0.1250.1750.1750.175
DAVIS
Multi-level1 st Level2 nd Level3 rd Level 0.6270.5540.5560.550 0.6450.6400.6420.638 0.0770.1420.1430.146
Average
Multi-level1 st Level2 nd Level3 rd Level 0.6730.5870.5820.577 0.7340.7110.7250.726 0.0660.1180.1160.117(denoted by
Feature combination ) with the methods using a single kind of featuremap (denoted by
Low-level feature (the integration weight α = 1) and Middle-level feature ( α = 0)). Table 7 illustrates the results, showing that combiningboth two kinds of feature maps significantly outperforms using a single featuremap separately on all datasets. Therefore, utilizing multiple features yields muchimprovement over the method using only a single feature. Temporal consistency evaluation
We evaluated the effectiveness of introducing ATW by comparing our method (de-noted by with Temporal consistency ) with the method not incorporating tem-poral consistency (denoted by w/o Temporal consistency ). Table 8 illustratesthe results. We observe that the ATW introduction slightly improves results onthe Weizmann dataset and the DAVIS dataset, while it slightly degrades results onthe MCL2014 dataset and the SegTrack2 dataset. However, the averaging resultsstill show that the ATW introduction improves the performance of the proposedovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto
Table 7: Feature map combination evaluation, using F-measure scores (higher isbetter) and Mean Absolute Error (smaller is better). The best results are shown inblue.
Dataset Method F-Adap ⇑ F-Max ⇑ MAE ⇓ Weizmann
Feature combinationLow-level featureMiddle-level feature 0.9090.4120.740 0.9120.8760.894 0.0090.1530.030
MCL2014
Feature combinationLow-level featureMiddle-level feature 0.6440.4960.440 0.7020.5880.611 0.0510.1210.146
SegTrack2
Feature combinationLow-level featureMiddle-level feature 0.5100.3600.359 0.6770.4910.549 0.1250.2380.197
DAVIS
Feature combinationLow-level featureMiddle-level feature 0.6270.5260.303 0.6450.5600.392 0.0770.1190.249
Average
Feature combinationLow-level featureMiddle-level feature 0.6730.4490.461 0.7340.6290.612 0.0660.1580.156
Table 8: Temporal consistency evaluation, using F-measure scores (higher is better)and Mean Absolute Error (smaller is better). The best results are shown in blue.
Dataset Method F-Adap ⇑ F-Max ⇑ MAE ⇓ Weizmann with Temporal consistencyw/o Temporal consistency 0.9090.899 0.9120.907 0.0090.010
MCL2014 with Temporal consistencyw/o Temporal consistency 0.6440.645 0.7020.703 0.0510.051
SegTrack2 with Temporal consistencyw/o Temporal consistency 0.5100.515 0.6770.680 0.1250.122
DAVIS with Temporal consistencyw/o Temporal consistency 0.6270.624 0.6450.643 0.0770.079
Average with Temporal consistencyw/o Temporal consistency 0.6730.671 0.7340.733 0.0660.066 method.The limitation of ATW incorporation seems dependent on the accuracy of super-pixel segmentation. Ineffectively propagating superpixels across time causes com-bining saliency values between incorrect regions, which leads to propagating errorsin temporal consistency incorporation process. In Fig. 8, we present some failurecases. In a video sequence having extremely difficult scenes with small, low-contrastovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms
Region-Based Multiscale Spatiotemporal Saliency for Video (a) (b) (c) (d) Fig. 8: Failure cases of our method. (a) Input frames, (b) Ground-truth, (c) Resultsof our method, (d) Results of not incorporating temporal consistency.salient objects (see examples in Fig. 8), temporal superpixel methods cannot effec-tively segment the scene and cannot correctly track region over frames. This willresult in poor quality in salient object detection. Firstly, spatial saliency entitiesfail in distinguishing foreground regions from the background when they are similarin appearance at almost frames. Secondly, errors can be propagated when combin-ing temporal consistency from incorrect regions, resulting in worse spatiotemporalsaliency maps than processing with a single frame.
7. Conclusion
Differently from images, videos include dynamics, i.e., temporal position changes ofentities in each frame and temporal changes of background. Such dynamics shouldbe significantly considered when detecting salient objects in videos. To capture thedynamics, regions corresponding to each entity in the video frame are more suitablethan pixels. Motivated by this, we present a region-based multiscale spatiotemporalsaliency detection method for videos. In our method, we first segment each frameinto regions and, at each region we utilize static features and dynamic featurescomputed from the low and middle levels to obtain saliency cues. By changingthe scale in segmentation, we explore saliency cues in multiple scales. To keeptemporal consistency across consecutive frames, we introduce adaptive temporalwindows that are computed from motion of segmented regions. Fusing saliencycues obtained in multiscale using adaptive temporal windows allows us to obtainour spatiotemporal saliency map. Our intensive experiments using publicly availabledatasets demonstrate that our method outperforms the state-of-the-arts.The proposed method can detect salient objects from dynamic scenes, but it stillcannot work well for complex dynamic background. This problem could be tackledby incorporating high-level knowledge to further improve the saliency maps. Ourproposed method has flexibility in utilizing more features. Although the currentovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto
Fig. 9: Visual comparison of our method to the state-of-the-art methods. Fromtop-left to bottom-right, original image and ground-truth are followed by outputsobtained using our method, LC , LD , RWRV , SAG , SEG , STS , BMS ,BSCA , MBS , MC , MST , and WSC . Our method surrounded with redrectangles achieves the best results.method utilizes low-level and middle-level features, we plan to utilize high-levelfeatures such as human faces or semantic knowledge exploited from videos. Webelieve saliency maps generated by our proposed method can be further used forefficient object detection and action recognition.ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video Fig. 10: Visual comparison of our method to the state-of-the-art methods. Fromtop-left to bottom-right, original image and ground-truth are followed by outputsobtained using our method, LC , LD , RWRV , SAG , SEG , STS , BMS ,BSCA , MBS , MC , MST , and WSC . Our method surrounded with redrectangles achieves the best results. References
1. R. Achanta, S. Hemami, F. Estrada and S. Susstrunk, Frequency-tuned salient regiondetection, in
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEEConference on (June 2009) pp. 1597–1604.2. R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua and S. Susstrunk, Slic superpix-els compared to state-of-the-art superpixel methods,
Pattern Analysis and Machine ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto
Intelligence, IEEE Transactions on (Nov 2012) 2274–2282.3. B. Alexe, T. Deselaers and V. Ferrari, What is an object?, in Computer Vision andPattern Recognition (CVPR), 2010 IEEE Conference on (June 2010) pp. 73–80.4. M. Blank, L. Gorelick, E. Shechtman, M. Irani and R. Basri, Actions as space-timeshapes, in
Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conferenceon , Vol. 2 (Oct 2005) pp. 1395–1402 Vol. 2.5. A. Borji, M. M. Cheng, H. Jiang and J. Li, Salient object detection: A benchmark,
IEEE Transactions on Image Processing (Dec 2015) 5706–5722.6. A. Borji and L. Itti, Scene classification with a sparse set of salient regions, in Roboticsand Automation (ICRA), 2011 IEEE International Conference on (May 2011) pp.1902–1908.7. A. Borji, D. Sihite and L. Itti, Quantitative analysis of human-model agreement invisual saliency modeling: A comparative study,
Image Processing, IEEE Transactionson (Jan 2013) 55–69.8. J. Chang, D. Wei and J. Fisher, A video representation using temporal superpixels, in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (June2013) pp. 2051–2058.9. M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet and N. Crook, Efficientsalient region detection with soft image abstraction, in
Computer Vision (ICCV),2013 IEEE International Conference on (Dec 2013) pp. 1529–1536.10. J. G. Daugman, Uncertainty relation for resolution in space, spatial frequency, andorientation optimized by two-dimensional visual cortical filters,
JOSA A (7) (1985)1160–1169.11. M. V. den Bergh, X. Boix, G. Roig, B. de Capitani and L. J. V. Gool, SEEDS:superpixels extracted via energy-driven sampling, in Computer Vision - ECCV 2012- 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012,Proceedings, Part VII (Springer, 2012) pp. 13–26.12. C. Feichtenhofer, A. Pinz and R. Wildes, Dynamically encoded actions based onspacetime saliency, in
Computer Vision and Pattern Recognition (CVPR), 2015 IEEEConference on (June 2015) pp. 2755–2764.13. C. Guo and L. Zhang, A novel multiresolution spatiotemporal saliency detection modeland its applications in image and video compression,
Image Processing, IEEE Trans-actions on (Jan 2010) 185–198.14. M. Guo, Y. Zhao, C. Zhang and Z. Chen, Fast object detection based on selectivevisual attention, Neurocomput. (November 2014) 184–197.15. A. Hagiwara, A. Sugimoto and K. Kawamoto, Saliency-based image editing for guidingvisual attention, in
Proceedings of the 1st International Workshop on Pervasive EyeTracking and Mobile Eye-based Interaction (ACM, New York, NY, USA, 2011) pp.43–48.16. A. P. Hillstrom and S. Yantis, Visual motion and attentional capture,
Perception &Psychophysics (4) (1994) 399–411.17. L. Itti and P. Baldi, A principled approach to detecting surprising events in video,in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE ComputerSociety Conference on , Vol. 1 (June 2005) pp. 631–637 vol. 1.18. L. Itti, N. Dhavale and F. Pighin, Realistic avatar eye and head animation usinga neurobiological model of visual attention, in B. Bosacchi, D. B. Fogel and J. C.Bezdek (eds.),
Proc. SPIE 48th Annual International Symposium on Optical Scienceand Technology , Vol. 5200 (SPIE Press, Bellingham, WA, Aug 2003) pp. 64–78.19. Y. Jia and M. Han, Category-independent object-level saliency detection, in
ComputerVision (ICCV), 2013 IEEE International Conference on (Dec 2013) pp. 1761–1768. ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms
Region-Based Multiscale Spatiotemporal Saliency for Video
20. B. Jiang, L. Zhang, H. Lu, C. Yang and M.-H. Yang, Saliency detection via absorbingmarkov chain, in
Proceedings of the 2013 IEEE International Conference on ComputerVision (IEEE Computer Society, Washington, DC, USA, 2013) pp. 1665–1672.21. A. Kae, B. Marlin and E. Learned-Miller, The shape-time random field for semanticvideo labeling, in
Computer Vision and Pattern Recognition (CVPR), 2014 IEEEConference on (June 2014) pp. 272–279.22. H. Kim, Y. Kim, J.-Y. Sim and C.-S. Kim, Spatiotemporal saliency detection for videosequences based on random walk with restart,
Image Processing, IEEE Transactionson (Aug 2015) 2552–2564.23. T.-N. Le and A. Sugimoto, Contrast based hierarchical spatial-temporal saliency forvideo, in Image and Video Technology - 7th Pacific-Rim Symposium, PSIVT 2015,Auckland, New Zealand, November 25-27, 2015, Revised Selected Papers , LectureNotes in Computer Science
Vol. 9431 (Springer International Publishing Switzerland,2015) pp. 734–748.24. S.-H. Lee, J.-H. Kim, K. P. Choi, J.-Y. Sim and C.-S. Kim, Video saliency detectionbased on spatiotemporal feature learning, in
Image Processing (ICIP), 2014 IEEEInternational Conference on (Oct 2014) pp. 1120–1124.25. Y. J. Lee, J. Kim and K. Grauman, Key-segments for video object segmentation, in
Computer Vision (ICCV), 2011 IEEE International Conference on (Nov 2011) pp.1995–2002.26. F. Li, T. Kim, A. Humayun, D. Tsai and J. M. Rehg, Video segmentation by trackingmany figure-ground segments, in (Dec 2013) pp. 2192–2199.27. N. Li, B. Sun and J. Yu, A weighted sparse coding framework for saliency detection, in (June2015) pp. 5216–5223.28. C. Liu,
Beyond pixels: exploring new representations and applications for motion anal-ysis (MIT, 2009).29. T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang and H.-Y. Shum, Learning todetect a salient object,
Pattern Analysis and Machine Intelligence, IEEE Transactionson (Feb 2011) 353–367.30. T. Lu, Z. Yuan, Y. Huang, D. Wu and H. Yu, Video retargeting with nonlinear spatial-temporal saliency fusion, in Image Processing (ICIP), 2010 17th IEEE InternationalConference on (Sept 2010) pp. 1801–1804.31. Y. Luo, L. Cheong and J. Cabibihan, Modeling the temporality of saliency, in
Com-puter Vision - ACCV 2014 - 12th Asian Conference on Computer Vision, Singapore,Singapore, November 1-5, 2014, Revised Selected Papers, Part III (Springer, 2014)pp. 205–220.32. M. Mancas, N. Riche, J. Leroy and B. Gosselin, Abnormal motion selection in crowdsusing bottom-up saliency, in
Image Processing (ICIP), 2011 18th IEEE InternationalConference on (Sept 2011) pp. 229–232.33. R. Margolin, A. Tal and L. Zelnik-Manor, What makes a patch distinct?, in
ComputerVision and Pattern Recognition (CVPR), 2013 IEEE Conference on (June 2013) pp.1139–1146.34. S. Nataraju, V. Balasubramanian and S. Panchanathan,
An Integrated Approach toVisual Attention Modeling for Saliency Detection in Videos in L. Wang, G. Zhao,L. Cheng and M. Pietik¨ainen (eds.),
Machine Learning for Vision-Based Motion Anal-ysis: Theory and Techniques . (Springer London, London, 2011), London, pp. 181–214.35. N. Otsu, A threshold selection method from gray-level histograms,
IEEE Transactionson Systems, Man, and Cybernetics (Jan 1979) 62–66. ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto
36. A. Papazoglou and V. Ferrari, Fast object segmentation in unconstrained video, in
Computer Vision (ICCV), 2013 IEEE International Conference on (Dec 2013) pp.1777–1784.37. O. Pele and M. Werman, The quadratic-chi histogram distance family, in
Proceed-ings of the 11th European Conference on Computer Vision: Part II (Springer-Verlag,Berlin, Heidelberg, 2010) pp. 749–762.38. F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool, M. Gross and A. Sorkine-Hornung, A benchmark dataset and evaluation methodology for video object seg-mentation, in (June 2016) pp. 724–732.39. Y. Qin, H. Lu, Y. Xu and H. Wang, Saliency detection via cellular automata, in
Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on (June2015) pp. 110–119.40. E. Rahtu, J. Kannala, M. Salo and J. Heikkil¨a, Segmenting salient objects from im-ages and videos, in
Computer Vision - ECCV 2010 - 11th European Conference onComputer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, PartV (Springer, 2010) pp. 366–379.41. G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez and A. M. Lopez, Vision-based offline-online perception paradigm for autonomous driving, in
Applications ofComputer Vision (WACV), 2015 IEEE Winter Conference on (Jan 2015) pp. 231–238.42. H. J. Seo and P. Milanfar, Static and space-time visual saliency detection by self-resemblance,
Journal of Vision (12) (2009) p. 15.43. N. Tong, H. Lu, X. Ruan and M.-H. Yang, Salient object detection via bootstraplearning, in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Confer-ence on (June 2015) pp. 1884–1892.44. D. Tsai, M. Flagg, A. Nakazawa and J. M. Rehg, Motion coherent tracking usingmulti-label mrf optimization,
International Journal of Computer Vision (2) (2012)190–202.45. P.-H. Tseng, R. Carmi, I. G. M. Cameron, D. P. Munoz and L. Itti, Quantifying centerbias of observers in free viewing of dynamic natural scenes,
Journal of Vision (7)(2009) p. 4.46. W.-C. Tu, S. He, Q. Yang and S.-Y. Chien, Real-time salient object detection witha minimum spanning tree, in IEEE Conference on Computer Vision and PatternRecognition (CVPR) (June 2016)47. M. Van den Bergh, G. Roig, X. Boix, S. Manen and L. Van Gool, Online video seeds fortemporal window objectness, in
Computer Vision (ICCV), 2013 IEEE InternationalConference on (Dec 2013) pp. 377–384.48. W. Wang, J. Shen and F. Porikli, Saliency-aware geodesic video object segmentation,in
Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on (June 2015) pp. 3395–3402.49. W. Wang, J. Shen and L. Shao, Consistent video saliency using local gradient flowoptimization and global refinement,
Image Processing, IEEE Transactions on (Nov2015) 4185–4196.50. C. Xu, C. Xiong and J. Corso, Streaming hierarchical video segmentation, inA. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato and C. Schmid (eds.), ComputerVision ECCV 2012 , Lecture Notes in Computer Science
Vol. 7577 (Springer BerlinHeidelberg, 2012) pp. 626–639.51. Q. Yan, L. Xu, J. Shi and J. Jia, Hierarchical saliency detection, in
Computer Visionand Pattern Recognition (CVPR), 2013 IEEE Conference on (June 2013) pp. 1155– ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms
Region-Based Multiscale Spatiotemporal Saliency for Video Spatiotemporal Salience via Centre-Surround Compar-ison of Visual Spacetime Orientations in K. M. Lee, Y. Matsushita, J. M. Rehgand Z. Hu (eds.),
Computer Vision - ACCV 2012: 11th Asian Conference on Com-puter Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, PartIII . (Springer Berlin Heidelberg, Berlin, Heidelberg, 2013), Berlin, Heidelberg, pp.533–546.53. Y. Zhai and M. Shah, Visual attention detection in video sequences using spatiotem-poral cues, in
Proceedings of the 14th ACM International Conference on Multimedia (ACM, New York, NY, USA, 2006) pp. 815–824.54. J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price and R. Mech, Minimum barrier salientobject detection at 80 fps, in (Dec 2015) pp. 1404–1412.55. J. Zhang and S. Sclaroff, Saliency detection: A boolean map approach, in
ComputerVision (ICCV), 2013 IEEE International Conference on (IEEE, Dec 2013) pp. 153–160.56. L. Zhang, Z. Gu and H. Li, Sdsp: A novel saliency detection method by combiningsimple priors, in (Sept 2013)pp. 171–175.57. F. Zhou, S. B. Kang and M. Cohen, Time-mapping using space-time saliency, in
Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (June2014) pp. 3358–3365.58. W. Zhu, S. Liang, Y. Wei and J. Sun, Saliency optimization from robust backgrounddetection, in