[PDF] Region-Based Multiscale Spatiotemporal Saliency for Video

Abstract

Detecting salient objects from a video requires exploiting both spatial and temporal knowledge included in the video. We propose a novel region-based multiscale spatiotemporal saliency detection method for videos, where static features and dynamic features computed from the low and middle levels are combined together. Our method utilizes such combined features spatially over each frame and, at the same time, temporally across frames using consistency between consecutive frames. Saliency cues in our method are analyzed through a multiscale segmentation model, and fused across scale levels, yielding to exploring regions efficiently. An adaptive temporal window using motion information is also developed to combine saliency values of consecutive frames in order to keep temporal consistency across frames. Performance evaluation on several popular benchmark datasets validates that our method outperforms existing state-of-the-arts.

Full PDF

NNovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms

International Journal of Pattern Recognition and Artiﬁcial Intelligencec (cid:13)

World Scientiﬁc Publishing Company

Region-Based Multiscale Spatiotemporal Saliency for Video

Trung-Nghia Le

Department of Informatics, SOKENDAI (Graduate University for Advanced Studies)2-1-2, Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, [email protected]

Akihiro Sugimoto

National Institute of Informatics2-1-2, Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, [email protected]

Detecting salient objects from a video requires exploiting both spatial and temporalknowledge included in the video. We propose a novel region-based multiscale spatiotem-poral saliency detection method for videos, where static features and dynamic featurescomputed from the low and middle levels are combined together. Our method utilizessuch combined features spatially over each frame and, at the same time, temporallyacross frames using consistency between consecutive frames. Saliency cues in our methodare analyzed through a multiscale segmentation model, and fused across scale levels,yielding to exploring regions eﬃciently. An adaptive temporal window using motion in-formation is also developed to combine saliency values of consecutive frames in orderto keep temporal consistency across frames. Performance evaluation on several popularbenchmark datasets validates that our method outperforms existing state-of-the-arts.

Keywords : Spatiotemporal saliency; multiscale segmentation; low-level feature; middle-level feature; adaptive temporal window.

1. Introduction

Visual saliency that reﬂects sensitivity of human vision aims at locating informa-tive and interesting regions in a scene. It is originally developed to predict humaneye ﬁxations on images, and has been recently extended to detect salient objects.Computational methods developed for salient object detection are useful for high-level tasks in computer vision and computer graphics. For instance, these methodshave been successfully applied in many areas such as object detection , sceneclassiﬁcation , image and video compression , image editing and manipulating ,and video re-targeting .Pixel-based computational methods for saliency have been mainly developedand achieved good results for detecting static objects; thus they are popular indetecting salient objects for images. However, these approaches take disadvantagesin the context of dynamic scene in videos. Videos usually have lower quality thanimages due to lossy compression, which makes every pixel value always changes over a r X i v : . [ c s . C V ] A ug ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto

Fig. 1: Multiscale image analysis. From the left to the right are pixel-wise analysisand region-wise analysis using SLIC superpixel , followed by small scale superpixelsfully covering baby bird and large scale superpixels fully covering mother bird.time regardless it belongs to static objects or dynamic motions. Accordingly, pixel-based saliency detection methods could be misled because of this pixel ﬂuctuation.In contrast, region-based saliency detection methods are more eﬀective in videosbecause these methods have less ﬂuctuation and can capture dynamics in videosbetter than pixel-based methods. Therefore, in this work, we focus on analyzingvideos at regional levels through superpixel segmentation.Superpixels, which are used to detect regions in an image, can be basic mate-rials to capture salient objects at regional levels. Since objects in scenes generallycontain various salient scale patterns, superpixel-based regions with a pre-deﬁnedsize cannot fully explore objects (c.f. Fig.1). As a result, generated region-basedsaliency could generally be misled by the complexity of patterns in natural images.This problem can be solved by extending the size of superpixels gradually throughthe multiscale segmentation approach (c.f. Fig.1). Multiscale segmentation enablesus to analyze saliency cues from multiple scale levels of structure, yielding to dealingwith complex salient structures.Majority of existing methods for saliency, on the other hand, are based on low-level features in scenes such as color, orientation, and intensity . These bottom-upcues localize objects that present distinct characteristics from their surroundings.However, they do not concern any explicit information about scene structure, con-text or task-related factors. Therefore, they cannot eﬀectively highlight objects invideos because they do not always reﬂect regions corresponding to moving parts.In contrast, middle-level features such as objectness and background prior aresuitable for exploiting moving objects because they focus only on properties in dis-tinct regions. Therefore, combining middle-level features with low-level features canboost up current methods and improve the performance. In terms of saliency cues,our method exploits both low-level and middle-level features based on not pixelsbut regions.A video is usually composed of dynamic entities caused by egocentric movementsor dynamics of the real world. Particularly, in a dynamic scene, background alwayschanges; diﬀerent parts corresponding to diﬀerent elements or objects can move indiﬀerent directions with diﬀerent speed independently. Saliency models should haveability to fuse current static information and accumulated knowledge on dynamicsfrom the past to deal with the dynamic nature of scenes including two proper-ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video Fig. 2: Examples of our spatiotemporal saliency detection method. Top row imagesare original images. Bottom row images are the corresponding saliency maps usingour method.ties: dynamic background and entities’ independent motion. Several spatiotemporalsaliency detection methods based on motion analysis are proposed for videos .Some of them can capture scene regions that are important in a spatiotemporalmanner . However, most of existing methods do not fully exploit the nature ofdynamics in a scene. Temporal features presenting motion dynamics of objects ina scene between consecutive frames are not utilized in saliency detection process,either.In order to eﬀectively use knowledge on dynamics of background and objectsin a video, we propose a salient object detection method where low-level featuresand middle-level features are fused. In this framework, static features and dynamicfeatures computed from the low level and the middle level are combined togetherto utilize both spatial features of each frame and consistency between consecutiveframes (c.f. Table 1). The features are exploited in a region-based multiscale saliencymodel, where saliency cues from multiple scale levels in the structure are analyzedand integrated to take advantage of each level. Using region-based features andmultiscale analysis, our method is able to deal with complex scale structures interms of dynamic scene, so that salient objects are labeled more accurately. Wealso present a novel metric for motion information by estimating the number ofreferenced frames for each single object to keep temporal consistency across frames.Our method overcomes the limitation of the existing method which uses a ﬁxednumber of referenced frames and does not concern motion of objects within a scene.Examples of generated saliency maps using our method are shown in Fig. 2.Our key contributions lie in twofold: • We propose a region-based multiscale framework, which explicitly integrates low-level features together with middle-level features. Our regional features are ex-ploited to analyze saliency cues from multiple scale levels of structure. Withusing region-based features and multiscale analysis, our method is able to dealwith complex scale structures in terms of dynamic scenes, so that salient objectsovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto

Table 1: Our used feature classiﬁcation. low-level feature middle-level featurestatic feature - color- intensity- orientation - center bias- objectness - background prior dynamic feature - ﬂow magnitude- ﬂow orientation - movementare labeled more accurately. Although the proposed saliency model is developedbased on the framework presented by Zhou et al. , it signiﬁcantly improves theperformance of the original work. • We introduce a novel metric called adaptive temporal window using motion in-formation in order to keep temporal consistency between consecutive frames ofeach entity in a video. Our method also exploits the dynamic nature of the scenein term of independent motion of entities.The rest of this paper is organized as follows. In Section 2, we brieﬂy present andanalyze the related work in saliency models for videos as well as region segmentationin videos. The proposed method is presented in Section 3. Experiments are showedin Section 4, Section 6, and Section 5. Finally, Section 7 presents conclusion andideas for future work. We remark that a part of this work has been reported in .

2. Related Work

In this section, we brieﬂy review related work in saliency models for videos andregion segmentation in videos.

Dynamics in saliency detection for videos

The spatial saliency models for still images can be frame-wisely applied to videosto detect salient objects. Some techniques such as fast minimum barrier distancetransform and minimum spanning tree are used to develop real-time salientobject detection systems, which can apply to videos. Other methods formulate theproblem based on boolean map theory , absorbing Markov chain , and weightedsparse coding framework . Multiple saliency maps can be aggregated by condi-tional random ﬁeld or cellular automata dynamic evolution model . However,this approach does not achieve high eﬀectiveness because they cannot exploit tem-poral knowledge in dynamic scenes in videos.Dynamic cues have usually been employed for saliency detection to deal withdynamics in videos. Some video saliency models reply on center-surround diﬀer-ences between a local spatiotemporal cube and its neighboring cubes in space-timecoordinates to ﬁnd salient regions in dynamic image streams. L.Itti et al. devel-oped a method which computes instantaneous low-level surprise at every locationovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video in complex video clips. Flicker is added into saliency detection as a new cue tobuild a neurobiological model of visual attention for automatic realistic generationof eye and head movements given in video scenes . In the method proposed byM.Mancas et al. , only dynamic features such as speed and direction are used toquantify rare and abnormal motion.Motion cues, which are considered as a low-level feature channel, have recentlybeen employed in spatiotemporal frameworks together with spatial features such asillumination, color, and so on. Several spatiotemporal saliency detection methodsbased on motion analysis are proposed for videos. Motion between a pair of frames,which is considered as optical ﬂow, is used to compute local discrimination of theﬂow in a spatial neighborhood . Motion features such as optical ﬂow or SIFTﬂow are incorporated into conditional random ﬁeld to extend the models for stillimages to detect salient objects from videos. In the spatiotemporal attention detec-tion framework proposed by Y.Zhai et al. , motion contrast between consecutiveframes is estimated by applying RANSAC algorithm on point correspondences inthe scene. W.Wang et al. estimated salient regions in videos based on the gra-dient ﬂow ﬁeld, which consists of intra-frame boundary and inter-frame motion,and the energy optimization. In the framework proposed by W.Wang et al. , thespatiotemporal saliency map, which is computed from temporal motion boundariesand spatial edges, is combined with the appearance model and the dynamic locationmodel to segment salient video objects. C.Feichtenhofer et al. deﬁned a space-timesaliency model, which relies on two general observations regarding actions (e.g. mo-tion contrast and motion variance), for capturing foreground action motion. Y.Louet al. measured relationships within and between trajectories of superpixels overtime to capture sudden or onset movements in a scene. H.Kim et al. incorpo-rated the spatial transition matrix and the temporal restarting distribution, whichis computed from motion distinctiveness, temporal consistency, and abrupt change,into the random walk with restart framework to detect spatiotemporal saliency.However, most of existing methods do not fully exploit the nature of dynamicsin a scene. Temporal features presenting motion dynamics of objects in a scenebetween consecutive frames are not utilized in saliency detection process, either.Diﬀerently from existing methods, our spatiotemporal saliency detection methoduses motion information to keep temporal consistency across frames. We proposean adaptive temporal sliding window to relate salient values of frame sequences byexploiting motion information of an entity in each frame. Region segmentation in videos

Most video analysis methods are extended from image analysis methods . Al-though frame by frame processing is eﬃcient and can achieve high performance inspatial respects, its temporal stability is limited. Therefore, many video segmenta-tion methods are developed to exploit the temporal knowledge in videos.Superpixel based algorithms are widely used as a pre-processing step in bothovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto still images and videos to generate supporting regions and to speed up furthercomputations . Existing superpixel methods can be divided into two categories:one is to grow superpixels starting from an initial set, and the others are based ongraphs .For the ﬁrst approach, one of the most eﬃcient superpixel segmentation algo-rithms for images was introduced by R.Achanta et al. and is called Simple LinearIterative Clustering (SLIC). In this algorithm, the k -means clustering is ﬁrstly per-formed to group pixels that exhibit similar appearances into superpixels and thensingle-pixel superpixels are merged into large superpixels via a single 4-connectedregion. Many segmentation methods have been proposed recently based on the SLICsuperpixel. J.Chang et al. extended SLIC to propose a probabilistic model for tem-porally consistent superpixels in video sequences. SEED superpixel method alsoshares the idea of growing superpixels from an initial set with SLIC. However, theSEED superpixel directly exchanges pixels between superpixels by moving bound-aries instead of growing superpixels by clustering pixels around the centers. Fromthe SEED superpixel, M.Van den Bergh et al. proposed an online, real-time videosegmentation algorithm to exploit temporal information in videos.In addition, there are some segmentation methods based on graphs. C.Xu et al. developed a framework for streaming video sequences based on hierarchical graph-based segmentation. Using a spatiotemporal extension of GraphCut method ,A.Papazoglou et al. introduced a fast segmentation method for unconstrainedvideos which include rapidly moving background, arbitrary object motion and ap-pearance, as well as non-rigid deformations and articulations.In this work, we employed the temporal superpixel method proposed byJ.Chang in our framework. This is because it overcomes limitations of othervideo segmentation methods and is popularly employed in the computer visionliterature .

3. Multiscale Spatiotemporal Saliency Detection Method3.1.

Overview

The goal of this work is to detect salient regions in videos by combining staticfeatures with dynamic features where the features are detected from regions butnot from pixels. Figure 3 illustrates the process of our multiscale spatiotemporalsaliency detection method.First of all, the temporal superpixels model is executed to segment a videointo spatiotemporal regions at various scale levels. Motion information, as well asfeatures for each frame, are extracted at each scale level. From these features, webuild feature maps, including both low-level feature maps presenting contrasts be-tween regions and middle-level feature maps presenting properties inside regions.These two kinds of feature maps are combined to generate spatial saliency enti-ties for regions at each scale level. Temporal consistency is incorporated into spa-tial saliency entities to form spatiotemporal saliency entities by using an Adaptiveovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video Color Intensity OrientationFlow Orientation Flow MagnitudeLocation ObjectnessBackground Probability Movement

Low-level FeaturesMiddle-level Features Multiscale SegmentationVideo Frame Low-level Feature MapMiddle-level Feature Map Spatial Saliency Entity Spatiotemporal Saliency Entity Multiscale Spatiotemporal Saliency Map

Motion Distribution … Fig. 3: Pipeline of the proposed spatiotemporal saliency detection method.Temporal Window (ATW) for each region individually to smooth saliency valuesacross frames. Finally, a spatiotemporal saliency map is generated for each frameby fusing its multiscale spatiotemporal saliency entities.

Multiscale video segmentation

To support the intuition that objects in a video generally contain various salientscale patterns as well as an object at a coarser scale may be composed of multipleparts at a ﬁner scale, the video is segmented at multiple scales. Multiscale segmen-tation enables us to analyze saliency cues from multiple scale levels of structure,yielding to dealing with complex salient structures (c.f. Fig. 1). In this work, weovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto segment a video at three scale levels. We also remark that each segmentation levelhas a diﬀerent number of superpixels, which are deﬁned as non-overlapping regions.To segment a video, we employed the temporal superpixel method , which isbased on multiple frame superpixel segmentation. This method is extended fromSLIC . Diﬀerently from the SLIC method, the temporal superpixel method utilizes aspatial intensity Gaussian Mixture Model (GMM) combined with a motion model,served as a prior for the next frame. Motion information is used to propagatesuperpixels over frames to reduce generative superpixels in a single frame.After the segmentation process, we obtain multiple scale temporal superpixels.At each scale, superpixels across frames are connected, thus we can predict motionof a superpixel over frames through its positions at frames. Spatial saliency entity construction

Low-level feature map

Human vision reacts to image regions with discriminative features such as uniquecolor, high contrast, diﬀerent orientation, or complex texture . To estimate at-tractiveness of regions in a video, contrast metric is usually used to evaluate thesensitivity of elements in each frame. The contrast is usually based on low-level fea-tures including static information such as color, intensity, or texture, and dynamicinformation such as magnitude or orientation of motion. A region with high con-trast against surrounding regions can attract human attention and is perceptuallymore important.For the i -th region at the l -th scale of the segmentation model at a frame,denoted by r i,l , we compute its normalized color histogram in the CIE Lab colorspace, denoted by χ lf col i,l , and distribution of lightness χ lf lig i,l . We quantize the fourcolor channels (L, A, B and Hue) into 16 bins for each channel to compute χ lf col i,l .We also uniformly quantize χ lf lig i,l into 16 bins.We also calculate orientation statistic χ lf ori i,l of the region r i,l . We use the fol-lowing 2-D Gabor functions to model the image texture for every pixel ( x, y ): g λ,ϕ,γ,σ,θ ( x, y ) = exp (cid:16) − x (cid:48) + γ y (cid:48) σ (cid:17) cos (cid:16) π y (cid:48) λ + ϕ (cid:17) ,x (cid:48) = x cos θ + y sin θ,y (cid:48) = − x sin θ + y cos θ, (1)where γ, λ, σ, ϕ, and θ are parameters as follows: γ = 0 . λ = 8 denotes the wavelength of the cosine factor of the Gabor ﬁlterkernel and herewith the preferred wavelength of the Gabor function. σ is the stan-dard deviation of the Gaussian factor of the Gabor function where the ratio σ/λ determines the spatial frequency bandwidth. In this paper, we ﬁx σ = 0 . λ corre-sponding to a bandwidth of one octave at half-response, corresponding to bw = 1in σ = bw +12 bw − · λπ (cid:113) log 2. ϕ is a phase oﬀset that determines the symmetry ﬁlter ofovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video the Gabor function. We use quadrature pairs of two ﬁlter banks, including an oddﬁlter with ϕ = π and an even ﬁlter with ϕ = 0. The angle parameter θ speciﬁesthe orientation of the normal to the parallel stripes of the Gabor function. In thiswork, we use 8 orientations θ = k π where k ∈ { , , ..., } . We then quantize χ lf ori i,l into 16 bins.Since the human visual system is more sensitive to moving objects than stillobjects, dynamic features are also compared between regions at the same segmenta-tion level . Pixel-wise optical ﬂow is used to analyze motion between consecutiveframes. Regional motion features of a region are obtained by computing distribu-tion of this ﬂow information in the region. The motion distribution of region r i,l isencoded in two descriptors: χ lf fmag i,l is a normalized distribution of ﬂow magnitudeand χ lf fori i,l is a normalized histogram of ﬂow orientation. We uniformly quantize χ lf fmag i,l and χ lf fori i,l into 16 and 9 bins respectively.The low-level feature map of each region is considered as the sum of its featuredistances to other neighbor regions at the same scale level in the segmentationmodel with diﬀerent weight factors: S lf i,l = (cid:88) lf w lf (cid:88) j (cid:54) = i | r j,l | ω ( r i,l , r j,l ) (cid:13)(cid:13)(cid:13) χ lfi,l − χ lfj,l (cid:13)(cid:13)(cid:13) , (2)where (cid:13)(cid:13)(cid:13) χ lfi,l − χ lfj,l (cid:13)(cid:13)(cid:13) is the Chi-Square distance between two histograms, lf ∈{ lf col , lf lig , lf ori , lf fmag , lf fori } denotes one of the ﬁve features with correspond-ing weight w lf ∈ { . , . , . , . , . } . | r j,l | denotes the contrast weight of region r j,l , which is the ratio of its size to the frame size. Regions with more pixels con-tribute higher contrast weight factors, than those containing smaller number ofpixels. ω ( r i,l , r j,l ) controls spatial distance inﬂuence between two regions r i,l and r j,l : ω ( r i,l , r j,l ) = e − D ( ri,l,rj,l ) σ sp − dst , (3)where D ( r i,l , r j,l ) is the Euclidean distance between region centers and parameter σ sp − dst = 0 . S lf i,l is normalized torange [0 , Middle-level feature map

In addition to contrasts between regions, we also compute properties of each regionbased on middle-level features. We observe that human vision is biased towardspeciﬁc spatial information of a video such as the center of the frame, foregroundobjects, or background as well as movements of objects over time. Therefore, ourmiddle-level features are based on center bias, objectness, background prior andmovement metrics.Human eye-tracking studies show that human attention favors the center ofnatural scenes when watching videos . So, pixels close to the screen center couldovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto be salient in many cases. Our center bias is deﬁned as: χ i,lmf cen = 1 | r i,l | (cid:88) j ∈ r i,l e − D ( pj, ¯ p ) σ cen , (4)where | r i,l | denotes the number of pixels in region r i,l and D ( p j , ¯ p ) is the Euclideandistance from each pixel p j in the region to the image center ¯ p . σ cen = 0 . .It quantiﬁes how likely it is for an image window to contain an object of any class.Our objectness feature is deﬁned as: χ f obj i,l = 1 | r i,l | (cid:88) j ∈ r i,l o j , (5)where o j is pixel-wise objectness map in region r i,l . Objectness map provides mean-ingful distributions over the object locations, demonstrating probability of contain-ing objects at pixels. Four object cues are utilized as follows: multi-scale saliency M S measuring uniqueness characteristic of objects, color contrast CC measur-ing diﬀerent appearance from their surroundings, edge density ED and superpixelstraddling SS measuring closed boundary characteristic of objects. These objectcues are combined into a Bayesian framework: o j = p ( j ) ( obj | A ) = p ( j ) ( A | obj ) p ( j ) ( obj ) p ( j ) ( A ) = p ( j ) ( obj ) (cid:81) cue ∈ A p ( j ) ( cue | obj ) (cid:80) c ∈{ obj,bgr } p ( j ) ( c ) (cid:81) cue ∈ A p ( j ) ( cue | c ) , (6)where A = { M S, CC, ED, SS } denotes object cues, and p ( j ) ( · ) is the probability atpixel j . Priors p ( obj ), p ( bg ), individual cue likelihoods p ( cue | c ) and c ∈ { obj, bg } ,are estimated from the training dataset. This posterior constitutes the objectnessscore of overlapping regions.The regional background prior is based on diﬀerences of spatial layout of imageregions . Object regions are much less connected to image boundaries than back-ground ones. In contrast, a region corresponding to background tends to be heavilyconnected to the image boundary. In order to compute background probability ofeach region, each segmented image is built as an undirected weighted graph byconnecting all adjacent regions and assigning their weights w edge as the Euclideandistance between their average colors in the CIE-Lab space. The background featureof region r i,l is written as: χ i,lmf bgr = exp (cid:32) − BndCon ( r i,l )2 σ bgr (cid:33) , (7)where BndCon ( r i,l ) is the boundary connectivity of region r i,l . Similarly to , weset the parameter σ bgr = 1. The boundary connectivity of region r i,l is calculatedovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video as the ratio of length along the boundary of it to the square root of its spanningarea: BndCon ( r i,l ) = len bnd ( r i,l ) (cid:112) SpanArea ( r i,l ) (8)where len bnd ( r i,l ) is length along the boundary of region r i,l , which is calculated asthe sum of the Geodesic distance from it to regions on the boundary at the samescale level. SpanArea ( r i,l ) is the spanning area of region r i,l calculated as the sumof the Geodesic distances from it to all regions in the frame at the same scale level.The Geodesic distance between any two regions is deﬁned as the accumulatededge weight along their shortest path on the graph: d geo ( r i,l , r j,l ) = min r ,l = r i,l ,r ,l ,...,r n,l = r j,l n − (cid:88) k =1 w edge ( r k,l , r k +1 ,l ) (9)Moreover, to encode movement of objects, we capture any sudden speed changein motion of regions. Movement of a region is calculated as its average motionmagnitude values computed from the optical ﬂow : χ f mov i,l = 1 | r i,l | (cid:88) j ∈ r i,l m j , (10)where m j is motion magnitude at all pixels in region r i,l .The middle-level feature map of the region r i,l is computed as the sum of itsattribute values with diﬀerent weight factors: S mf i,l = (cid:88) mf w mf χ mfi,l , (11)where mf ∈ (cid:8) mf cen , mf obj , mf bgr , mf mov (cid:9) denotes one of the four features, withcorresponding weight factor w mf ∈ { . , . , . , . } . Finally, S mf i,l is normal-ized to range [0 , Feature map combination

Combining the low-level feature maps and the middle-level feature maps allows usto obtain initial spatial saliency entities for all segmentation levels separately by aweighted multiplicative integration (c.f. Fig.4): S i,l = S αlf i,l S − αmf i,l , (12)where the parameter α controls the trade-oﬀ between the low-level feature maps andthe middle-level feature maps. To weight the low-level feature map and the middle-level feature map equally, we set α = 0 .

5. Finally, the spatial saliency entities arelinearly normalized to the ﬁxed range [0 ,

1] in order to guarantee that regions withvalue 1 are the maxima of saliency. We note that we demonstrate eﬀects of usingthe low-level feature maps and the middle-level feature maps on saliency results inSection 6.2.ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto

VideoFrame Low-levelFeature Map Middle-level Feature Map Spatial Saliency Entity Spatiotemporal Saliency Entity

Fig. 4: Saliency entity construction.

Incorporating temporal consistency

In a video, it is sometimes hard to distinguish objects from background becauseevery pixel value always changes over time regardless that it belongs to an objector background. Moreover, motion analysis shows that diﬀerent parts of objectsmove with various speed and, furthermore, background motion also changes withdiﬀerent speed and direction (c.f. ﬂow information in Fig. 5). This causes ﬂuctuationof object appearances between frames. To reduce this negative eﬀect, each spatialsaliency entity at the current frame is combined with those at neighboring frames,resulting in smoothing saliency values over time. After this operation, salient valueson contiguous frames can become similar, and this generates robust spatiotemporalsaliency entities.We propose to adaptively use a sliding window in the temporal domain, Adap-tive Temporal Window (ATW), for each region at each frame to capture speedvariation by exploiting motion information in the region. A spatial saliency entityat each scale level at the current frame is combined with spatial saliency entities atneighboring frames using the Gaussian combination weights, where nearer frameshave larger weights: ˜ S ti,l = 1Ψ t (cid:88) t (cid:48) = t − Φ ti,l e − D ( t,t (cid:48) ) ti,l σ tp − dst S t (cid:48) i,l , (13)where Ψ is the normalization factor and S ti,l measures spatial saliency entity ofovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video Flow Information

Source Image Motion Distribution

R5R6R4R2R3R1 … Segmented Image

R2R5R4 R6R3 R1

Fig. 5: Regional motion calculation.region r i,l at frame t with corresponding weight e − D ( t,t (cid:48) ) ti,l σ tp − dst . D ( t, t (cid:48) ) denotes thetime diﬀerence between two frames, and parameter σ tp − dst = 10 controls how largethe region at previous frames is. Φ ti,l controls the number of participating framesin the operation, expressed as: Φ ti,l = M e − µ ti,l λβti,l , (14)where M = 10 and λ = 2 are parameters. β ti,l = σ ti,l µ ti,l is the coeﬃcient variationmeasuring dispersion of motion distribution of each region. µ ti,l and σ ti,l are themean value and the standard deviation of the motion distribution of region r i,l at frame t . To calculate regional motion distribution, we ﬁrst use the pixel-wiseoptical ﬂow to compute motion magnitude of each pixel in a frame, and thenexploit distribution of motion magnitude in each region (c.f. Figure 5). Spatiotemporal saliency generation

A multiscale saliency map is usually generated by calculating the average of saliencymaps over all segmentation levels. However, this kind of process does not achievegood results because saliency values at a scale level have their own advantages anddisadvantages. Compared with naive averaging of all levels, the Multi-layer CellularAutomata (MCA) integration algorithm aggregates conﬁdent saliency values fromthese levels, surely yielding better performance . Therefore, we implement theMCA algorithm to take advantage of multiple scale levels. The generated multiscalespatiotemporal saliency value SM tp at pixel p at frame t is given by: SM tp = M CA (cid:16) ˜ S t Ω l ( p ) (cid:17) , with l ∈ { , , ..., L } , (15)where Ω l ( · ) is a function that converts a pixel to the region at scale level l whereit belongs. We note that all operations are processed pixel-wisely. ˜ S t Ω l ( · ) measuresovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto

Table 2: Multi-layer Cellular Automata integration algorithm . Input: S tl with l ∈ { , ..., L } Output: SM t (cid:46) Saliency map binarization by the Otsu algorithm : γ tl ← log (cid:18) Θ ostu ( S tl ) − Θ ostu ( S tl ) (cid:19) (cid:46) Saliency reﬁnement:for k = 1 → K − do Λ (cid:16) S t,k +1 l (cid:17) ← Λ (cid:16) S t,kl (cid:17) + ln (cid:16) λ − λ (cid:17) L (cid:80) i =1 ,i (cid:54) = l sign (cid:16) Λ (cid:16) S t,ki (cid:17) − γ ti (cid:17) ,where Λ ( S tl ) = S tl − S tl end for (cid:46) Saliency map integration: SM t ← L L (cid:80) l =1 S tl Parameter settings : K = 5 and ln (cid:16) λ − λ (cid:17) = 0 . VideoFrame Saliency Entity Level 1 Saliency Entity Level 2 Saliency Entity Level 3 Saliency Entity Level 4 Multiscale Spatiotemporal Saliency Map

Fig. 6: Saliency integration using Multi-layer Cellular Automata .multiscale saliency values at each pixel generated in the l -th scale of the segmenta-tion model, which has L scale levels, at frame t . M CA ( · ) is the Multi-layer CellularAutomata (MCA) integration , which exploits intrinsic relevance of similar regionsthrough interactions with neighbors. The MCA algorithm is described in Table 2.Figure 6 depicts examples of the MCA algorithm.ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video

4. Experimental settings4.1.

Datasets

We used 4 datasets for all experiments: Weizmann human action dataset , MCL2014dataset , SegTrack2 dataset , and DAVIS dataset . The Weizmann dataset contains 93 video sequences of nine people perform-ing ten natural actions such as running, walking, jacking, and waving, etc. Thisdataset is created for human action recognition and has simple static background,thus it is easy to distinguish objects from scenes. Each sequence in the datasethas the spatial resolution of 180 ×

144 and consists of about 60 frames with theground-truth foreground mask for every frame.

The MCL2014 dataset contains 8 video sequences with various backgroundsuch as streets, roads, and halls, etc. In this dataset, multiple objects such as crowdsmove with diﬀerent directions and speed. Each sequence in the dataset has thespatial resolution of 480 ×

270 and consists of around 800 frames. The binary ground-truth maps are manually obtained for every 8 frames. We remark that anotherversion of the MCL dataset was published and it is called MCL2015 dataset which contains 9 video sequences. Due to the advantage of having double framesper video, the MCL2014 dataset (800 frames per video) was employed instead ofthe MCL2015 dataset (400 frames per video). The SegTrack2 dataset , which is extended from the SegTrack dataset ,contains 14 challenging video sequences and is originally designed for video objectsegmentation. A half of videos in this dataset have multiple salient objects. Thedataset is designed to be challenging in that it has background-foreground colorsimilarity, fast motion, and complex shape deformation. Dynamics in this datasetis caused by moving cameras, which track objects in scenes moving from far tonear, from the borders to the center or vice versa. Each sequence in the dataset hasthe spatial resolution of 352 ×

288 and consists of about 75 frames with the binaryground-truth mask.

The DAVIS dataset consists of 50 high quality 854 ×

480 spatial resolutionand Full HD 1080p video sequences with about 70 frames per video, each of whichhas one single salient object or two spatially connected objects either with lowcontrast or overlapping with image boundary. It is a challenging dataset becauseof frequent occurrences of occlusions, motion blur and appearance changes. In thiswork, we used only 854 ×

480 resolution video sequences.We note that these datasets have diﬀerent characteristics such as simple back-ground for the Weizmann dataset, multiple moving objects for the MCL2014dataset, diverse complex dynamic scenes for the SegTrack2 and the DAVIS datasets.All the datasets contain manually annotated pixel-wise ground-truth.ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto

Evaluation metrics

The

Precision-Recall , and

F-measure metrics are used to evaluate performanceof the object location detection at a binary threshold. The precision value corre-sponds to the ratio of salient pixels that are correctly assigned to all the pixels ofextracted regions. The recall value is deﬁned as the percentage of detected salientpixels in relation to the number of salient pixels in the ground-truth. Given aground-truth GT and the binarized map BM for a saliency map, we have: P recision = | BM ∩ GT || BM | and Recall = | BM ∩ GT || GT | . (16)The F-measure is the overall performance measure computed by the weightedharmonic of precision and recall: F β = (cid:0) β (cid:1) P recision × Recallβ × P recision + Recall . (17)Similarly to , we chose β = 0 . F β reﬂectsthe overall prediction accuracy.There are diﬀerent ways of binarizing a saliency map to compute F-measure.We used the F-Adap , an adaptive threshold for generating a binary saliency map.As suggestion by , we employed an adaptive threshold for each image, which isdetermined as the sum of the average value and the standard deviation of the entiregiven saliency image: θ = µ + σ where µ and σ are the mean value and the standarddeviation of the given saliency map, respectively. We then computed the averageF-measure scores over the frames. We also used the F-Max , which describes themaximum F-measure score for diﬀerent thresholds from 0 to 255, as suggested by . We note that for a threshold, we binarize the saliency map to compute Precisionand Recall at each frame in a video and then take the average over the video. Afterthat, the mean of the averages over videos in a dataset is computed. F-measure iscomputed from the ﬁnal Precision and Recall.Overlap-based evaluation measures mentioned above do not consider the truenegative saliency assignments, i.e., the pixels correctly marked as non-salient .For a more comprehensive comparison, we thus used the Mean Absolute Error(MAE) to compute the average absolute per-pixel diﬀerence between a saliencymap SM and its corresponding ground-truth GT : M AE = 1 W × H W (cid:88) x =1 H (cid:88) y =1 (cid:107) SM ( x, y ) − GT ( x, y ) (cid:107) , (18)where W and H are the width and the height of the maps, respectively. We notethat MAE is also computed from mean average value of the dataset in the sameway with F-measure.ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video Table 3: Compared state-of-the-art methods and classiﬁcation. target image videomethod BL , BSCA , MBS ,MC , MST , WSC LC , LD , RWRV ,SAG , SEG , STS

5. Comparison with state-of-the-art methods

We compared the performance of our method with several state-of-the-art models,namely, LC , LD , RWRV , SAG , SEG , STS , BMS , BSCA , MBS ,MC , MST , and WSC , which are classiﬁed in Table 3. We compared ourmethod with not only video saliency models but also recent saliency detectionmethods for still images, which show good performance on image datasets. Weframe-wisely applied the methods developed for the still image to videos. We remarkthat we run original codes provided by the authors with recommended parametersettings for obtaining results.Some examples for visual comparison of the methods are shown in Fig. 9 andFig. 10, suggesting that our method produces the best results on the datasets.Our method can handle complex foreground and background with diﬀerent details,giving accurate and uniform saliency assignment. Our method also achieves thestate-of-the-art substantially on the datasets on all evaluation criteria. Precision-Recall Curve

We binarized each saliency map into a binary mask using a binary threshold θ ( θ is changed from 0 to 255). With each θ , the binary mask is checked against theground-truth to evaluate the accuracy of the salient object detection to computethe Precision-Recall Curve (PRC) (c.f. Fig. 7). The PRC is used to evaluate theperformance of the object location detection because it captures behaviors of bothprecision and recall under varying thresholds. Therefore, the PRC provides a reliablecomparison of how well various saliency maps can highlight salient regions in images.Figure 7 shows that the proposed method consistently produces saliency mapscloser to the ground-truth than the others. Our method achieves the highest preci-sion in most of entire recall ranges on almost datasets. We observe that our methodis the best method on the Weizmann dataset, the SegTrack2 dataset, and the DAVISdataset while on the MCL2014 dataset, our method is the second to the best (lowerthan STS ). Therefore, the proposed method signiﬁcantly outperforms the state-of-the-art methods across the public datasets. Especially in the most challengingdatasets (e.g. SegTrack2 dataset and DAVIS dataset), the performance gains of ourmethod over all the other methods are more noticeable. Our method is suitable fordealing with dynamic scenes and multiple moving objects in videos.From Figure 7, at the ﬁrst end of PRCs at maximum recall, all salient pixels areretained as positives, i.e., considered to be foreground, so all the methods have theovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto

Recall P r e c i s i on OurLCLD RWRVSAGSEG STSBLBSCA MBSMCMST WSC (a) Weizmann Dataset

Recall P r e c i s i on OurLCLD RWRVSAGSEG STSBLBSCA MBSMCMST WSC (b) MCL2014 Dataset

Recall P r e c i s i on OurLCLD RWRVSAGSEG STSBLBSCA MBSMCMST WSC (c) SegTrack2 Dataset

Recall P r e c i s i on OurLCLD RWRVSAGSEG STSBLBSCA MBSMCMST WSC (d) DAVIS Dataset

Fig. 7: Quantitative comparison with state-of-the-art methods, using Precision-Recall Curves at diﬀerent ﬁxed thresholds. Our method is marked in bold curve .Note that the state-of-the-art methods are divided into two groups only for theclear presentation.same precision and recall values. At the other end of PRCs, our method is one of themethods whose minimum recall values are highest on datasets, meaning that ourmethod generates saliency maps containing more salient pixels with the maximumsaliency values at 1. Therefore, our method has more practical advantages thanother methods. For example, when we cannot decide the best binary threshold toextract salient objects from a saliency map, an adaptive threshold of the saliencymap can be used, but the accuracy of object extraction is still ensured (c.f. Section5.2).

F-measure

We evaluated F-Adap and F-Max to compare the proposed method with the state-of-the-art methods (c.f. Table 4). Table 4 shows that our method achieves the bestperformance on all the datasets. Our method achieves the highest scores on theWeizmann dataset, the SegTrack2 dataset, and the DAVIS dataset on both theovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms

Region-Based Multiscale Spatiotemporal Saliency for Video Table 4: Quantitative comparison with state-of-the-art methods on three datasets,using F-measure (F-Adap and F-Max) (higher is better). The best and the secondbest results are shown in blue and green, respectively. Our method is marked in bold . Dataset Weizmann MCL2014 SegTrack2 DAVIS Metric F-Adap F-Max F-Adap F-Max F-Adap F-Max F-Adap F-MaxOur metrics. On the most challenging dataset, the DAVIS dataset according to , theproposed method outperforms the second best method by a large margin on allthe metrics. Our method achieves 0.627 in the F-Adap and 0.645 in the F-Maxwhile the second best method (SAG ) achieves 0.494 and 0.548, respectively. Onthe MCL2014 dataset, the proposed method achieves 0.644 in the F-Adap whileother methods do lower than 0.5. Our method is the second best in the F-Max andslightly lower than the best method (STS ) (0.702 vs. 0.728). Mean Absolute Error

To further demonstrate the eﬀectiveness of the proposed method, we provide thecomparison of our method with the state-of-the-art methods using the MAE metric.Table 5 shows results of our method and the other methods on the four benchmarkdatasets. Our method outperforms the other methods on almost all datasets. TheMAE of our method is lower than the others, which suggests our method not onlyhighlights the overall salient objects but also preserves the detail better. It canbe seen from Table 5 that our method shows the lowest MAE on the Weizmanndataset, the MCL2014 dataset, and the DAVIS dataset. Our method achieves 0.009,0.051, and 0.077 in the MEA on the Weizmann dataset, the MCL2014 dataset, andthe DAVIS dataset while the second best methods achieve 0.012 for MBS , 0.107for STS , and 0.103 for SAG , respectively. Our method outperforms the secondbest methods by a large margin on these datasets. For the SegTrack2 dataset, ourmethod is the second best method at 0.125 while the second best method, SAG ,is at 0.106.ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto

Table 5: Quantitative comparison with state-of-the-art methods on three datasets,using Mean Absolute Errors (MAE) (smaller is better). The best and the secondbest results are shown in blue and green, respectively. Our method is marked in bold . Dataset Weizmann MCL2014 SegTrack2 DAVIS Our

6. Evaluation of the proposed method

In this part, we analyzed and discussed each component of the proposed method toevaluate actual contribution of each component. For the evaluation of the proposedmethod, we used F-Adap, F-Max, and MAE.

Multiple level processing evaluation

To verify eﬀectiveness of our introduced multiple scale analysis, we performed ex-periments to compare methods in cases: analyzing saliency cues in a multiscale seg-mentation model through combining saliency maps at 3 levels (denoted by

Multi-level ) and in single scales separately (denoted by 1 st Level , 2 nd Level , and 3 rd Level ).The results in Table 6 show that using multiple scale analysis outperformssaliency computation at a single scale level on all datasets. Processing in multi-ple scales yields much improvement over the method with a single layer saliencycomputation. Therefore, utilizing information from multiple image layers makes ourmethod gain beneﬁt.

Evaluation of combination of low-level feature map andmiddle-level feature map

In order to verify the eﬀectiveness of combining low-level and middle-level featuresto generate a saliency map, we conducted experiments to compare our methodovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms

Region-Based Multiscale Spatiotemporal Saliency for Video Table 6: Multiple level processing evaluation, using F-measure scores (higher isbetter) and Mean Absolute Error (smaller is better). The best results are shown inblue.

Dataset Method F-Adap ⇑ F-Max ⇑ MAE ⇓ Weizmann

Multi-level1 st Level2 nd Level3 rd Level 0.9090.7730.7480.727 0.9120.9040.9020.897 0.0090.0400.0390.040

MCL2014

Multi-level1 st Level2 nd Level3 rd Level 0.6440.5570.5650.565 0.7020.6320.6880.702 0.0510.1140.1080.106

SegTrack2

Multi-level1 st Level2 nd Level3 rd Level 0.5100.4630.4600.465 0.6770.6670.6690.667 0.1250.1750.1750.175

DAVIS

Multi-level1 st Level2 nd Level3 rd Level 0.6270.5540.5560.550 0.6450.6400.6420.638 0.0770.1420.1430.146

Average

Multi-level1 st Level2 nd Level3 rd Level 0.6730.5870.5820.577 0.7340.7110.7250.726 0.0660.1180.1160.117(denoted by

Feature combination ) with the methods using a single kind of featuremap (denoted by

Low-level feature (the integration weight α = 1) and Middle-level feature ( α = 0)). Table 7 illustrates the results, showing that combiningboth two kinds of feature maps signiﬁcantly outperforms using a single featuremap separately on all datasets. Therefore, utilizing multiple features yields muchimprovement over the method using only a single feature. Temporal consistency evaluation

We evaluated the eﬀectiveness of introducing ATW by comparing our method (de-noted by with Temporal consistency ) with the method not incorporating tem-poral consistency (denoted by w/o Temporal consistency ). Table 8 illustratesthe results. We observe that the ATW introduction slightly improves results onthe Weizmann dataset and the DAVIS dataset, while it slightly degrades results onthe MCL2014 dataset and the SegTrack2 dataset. However, the averaging resultsstill show that the ATW introduction improves the performance of the proposedovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto

Table 7: Feature map combination evaluation, using F-measure scores (higher isbetter) and Mean Absolute Error (smaller is better). The best results are shown inblue.

Dataset Method F-Adap ⇑ F-Max ⇑ MAE ⇓ Weizmann

Feature combinationLow-level featureMiddle-level feature 0.9090.4120.740 0.9120.8760.894 0.0090.1530.030

MCL2014

Feature combinationLow-level featureMiddle-level feature 0.6440.4960.440 0.7020.5880.611 0.0510.1210.146

SegTrack2

Feature combinationLow-level featureMiddle-level feature 0.5100.3600.359 0.6770.4910.549 0.1250.2380.197

DAVIS

Feature combinationLow-level featureMiddle-level feature 0.6270.5260.303 0.6450.5600.392 0.0770.1190.249

Average

Feature combinationLow-level featureMiddle-level feature 0.6730.4490.461 0.7340.6290.612 0.0660.1580.156

Table 8: Temporal consistency evaluation, using F-measure scores (higher is better)and Mean Absolute Error (smaller is better). The best results are shown in blue.

Dataset Method F-Adap ⇑ F-Max ⇑ MAE ⇓ Weizmann with Temporal consistencyw/o Temporal consistency 0.9090.899 0.9120.907 0.0090.010

MCL2014 with Temporal consistencyw/o Temporal consistency 0.6440.645 0.7020.703 0.0510.051

SegTrack2 with Temporal consistencyw/o Temporal consistency 0.5100.515 0.6770.680 0.1250.122

DAVIS with Temporal consistencyw/o Temporal consistency 0.6270.624 0.6450.643 0.0770.079

Average with Temporal consistencyw/o Temporal consistency 0.6730.671 0.7340.733 0.0660.066 method.The limitation of ATW incorporation seems dependent on the accuracy of super-pixel segmentation. Ineﬀectively propagating superpixels across time causes com-bining saliency values between incorrect regions, which leads to propagating errorsin temporal consistency incorporation process. In Fig. 8, we present some failurecases. In a video sequence having extremely diﬃcult scenes with small, low-contrastovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms

Region-Based Multiscale Spatiotemporal Saliency for Video (a) (b) (c) (d) Fig. 8: Failure cases of our method. (a) Input frames, (b) Ground-truth, (c) Resultsof our method, (d) Results of not incorporating temporal consistency.salient objects (see examples in Fig. 8), temporal superpixel methods cannot eﬀec-tively segment the scene and cannot correctly track region over frames. This willresult in poor quality in salient object detection. Firstly, spatial saliency entitiesfail in distinguishing foreground regions from the background when they are similarin appearance at almost frames. Secondly, errors can be propagated when combin-ing temporal consistency from incorrect regions, resulting in worse spatiotemporalsaliency maps than processing with a single frame.

7. Conclusion

Diﬀerently from images, videos include dynamics, i.e., temporal position changes ofentities in each frame and temporal changes of background. Such dynamics shouldbe signiﬁcantly considered when detecting salient objects in videos. To capture thedynamics, regions corresponding to each entity in the video frame are more suitablethan pixels. Motivated by this, we present a region-based multiscale spatiotemporalsaliency detection method for videos. In our method, we ﬁrst segment each frameinto regions and, at each region we utilize static features and dynamic featurescomputed from the low and middle levels to obtain saliency cues. By changingthe scale in segmentation, we explore saliency cues in multiple scales. To keeptemporal consistency across consecutive frames, we introduce adaptive temporalwindows that are computed from motion of segmented regions. Fusing saliencycues obtained in multiscale using adaptive temporal windows allows us to obtainour spatiotemporal saliency map. Our intensive experiments using publicly availabledatasets demonstrate that our method outperforms the state-of-the-arts.The proposed method can detect salient objects from dynamic scenes, but it stillcannot work well for complex dynamic background. This problem could be tackledby incorporating high-level knowledge to further improve the saliency maps. Ourproposed method has ﬂexibility in utilizing more features. Although the currentovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto

Fig. 9: Visual comparison of our method to the state-of-the-art methods. Fromtop-left to bottom-right, original image and ground-truth are followed by outputsobtained using our method, LC , LD , RWRV , SAG , SEG , STS , BMS ,BSCA , MBS , MC , MST , and WSC . Our method surrounded with redrectangles achieves the best results.method utilizes low-level and middle-level features, we plan to utilize high-levelfeatures such as human faces or semantic knowledge exploited from videos. Webelieve saliency maps generated by our proposed method can be further used foreﬃcient object detection and action recognition.ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Region-Based Multiscale Spatiotemporal Saliency for Video Fig. 10: Visual comparison of our method to the state-of-the-art methods. Fromtop-left to bottom-right, original image and ground-truth are followed by outputsobtained using our method, LC , LD , RWRV , SAG , SEG , STS , BMS ,BSCA , MBS , MC , MST , and WSC . Our method surrounded with redrectangles achieves the best results. References

1. R. Achanta, S. Hemami, F. Estrada and S. Susstrunk, Frequency-tuned salient regiondetection, in

Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEEConference on (June 2009) pp. 1597–1604.2. R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua and S. Susstrunk, Slic superpix-els compared to state-of-the-art superpixel methods,

Pattern Analysis and Machine ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto

Intelligence, IEEE Transactions on (Nov 2012) 2274–2282.3. B. Alexe, T. Deselaers and V. Ferrari, What is an object?, in Computer Vision andPattern Recognition (CVPR), 2010 IEEE Conference on (June 2010) pp. 73–80.4. M. Blank, L. Gorelick, E. Shechtman, M. Irani and R. Basri, Actions as space-timeshapes, in

Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conferenceon , Vol. 2 (Oct 2005) pp. 1395–1402 Vol. 2.5. A. Borji, M. M. Cheng, H. Jiang and J. Li, Salient object detection: A benchmark,

IEEE Transactions on Image Processing (Dec 2015) 5706–5722.6. A. Borji and L. Itti, Scene classiﬁcation with a sparse set of salient regions, in Roboticsand Automation (ICRA), 2011 IEEE International Conference on (May 2011) pp.1902–1908.7. A. Borji, D. Sihite and L. Itti, Quantitative analysis of human-model agreement invisual saliency modeling: A comparative study,

Image Processing, IEEE Transactionson (Jan 2013) 55–69.8. J. Chang, D. Wei and J. Fisher, A video representation using temporal superpixels, in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (June2013) pp. 2051–2058.9. M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet and N. Crook, Eﬃcientsalient region detection with soft image abstraction, in

Computer Vision (ICCV),2013 IEEE International Conference on (Dec 2013) pp. 1529–1536.10. J. G. Daugman, Uncertainty relation for resolution in space, spatial frequency, andorientation optimized by two-dimensional visual cortical ﬁlters,

JOSA A (7) (1985)1160–1169.11. M. V. den Bergh, X. Boix, G. Roig, B. de Capitani and L. J. V. Gool, SEEDS:superpixels extracted via energy-driven sampling, in Computer Vision - ECCV 2012- 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012,Proceedings, Part VII (Springer, 2012) pp. 13–26.12. C. Feichtenhofer, A. Pinz and R. Wildes, Dynamically encoded actions based onspacetime saliency, in

Computer Vision and Pattern Recognition (CVPR), 2015 IEEEConference on (June 2015) pp. 2755–2764.13. C. Guo and L. Zhang, A novel multiresolution spatiotemporal saliency detection modeland its applications in image and video compression,

Image Processing, IEEE Trans-actions on (Jan 2010) 185–198.14. M. Guo, Y. Zhao, C. Zhang and Z. Chen, Fast object detection based on selectivevisual attention, Neurocomput. (November 2014) 184–197.15. A. Hagiwara, A. Sugimoto and K. Kawamoto, Saliency-based image editing for guidingvisual attention, in

Proceedings of the 1st International Workshop on Pervasive EyeTracking and Mobile Eye-based Interaction (ACM, New York, NY, USA, 2011) pp.43–48.16. A. P. Hillstrom and S. Yantis, Visual motion and attentional capture,

Perception &Psychophysics (4) (1994) 399–411.17. L. Itti and P. Baldi, A principled approach to detecting surprising events in video,in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE ComputerSociety Conference on , Vol. 1 (June 2005) pp. 631–637 vol. 1.18. L. Itti, N. Dhavale and F. Pighin, Realistic avatar eye and head animation usinga neurobiological model of visual attention, in B. Bosacchi, D. B. Fogel and J. C.Bezdek (eds.),

Proc. SPIE 48th Annual International Symposium on Optical Scienceand Technology , Vol. 5200 (SPIE Press, Bellingham, WA, Aug 2003) pp. 64–78.19. Y. Jia and M. Han, Category-independent object-level saliency detection, in

ComputerVision (ICCV), 2013 IEEE International Conference on (Dec 2013) pp. 1761–1768. ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms

Region-Based Multiscale Spatiotemporal Saliency for Video

20. B. Jiang, L. Zhang, H. Lu, C. Yang and M.-H. Yang, Saliency detection via absorbingmarkov chain, in

Proceedings of the 2013 IEEE International Conference on ComputerVision (IEEE Computer Society, Washington, DC, USA, 2013) pp. 1665–1672.21. A. Kae, B. Marlin and E. Learned-Miller, The shape-time random ﬁeld for semanticvideo labeling, in

Computer Vision and Pattern Recognition (CVPR), 2014 IEEEConference on (June 2014) pp. 272–279.22. H. Kim, Y. Kim, J.-Y. Sim and C.-S. Kim, Spatiotemporal saliency detection for videosequences based on random walk with restart,

Image Processing, IEEE Transactionson (Aug 2015) 2552–2564.23. T.-N. Le and A. Sugimoto, Contrast based hierarchical spatial-temporal saliency forvideo, in Image and Video Technology - 7th Paciﬁc-Rim Symposium, PSIVT 2015,Auckland, New Zealand, November 25-27, 2015, Revised Selected Papers , LectureNotes in Computer Science

Vol. 9431 (Springer International Publishing Switzerland,2015) pp. 734–748.24. S.-H. Lee, J.-H. Kim, K. P. Choi, J.-Y. Sim and C.-S. Kim, Video saliency detectionbased on spatiotemporal feature learning, in

Image Processing (ICIP), 2014 IEEEInternational Conference on (Oct 2014) pp. 1120–1124.25. Y. J. Lee, J. Kim and K. Grauman, Key-segments for video object segmentation, in

Computer Vision (ICCV), 2011 IEEE International Conference on (Nov 2011) pp.1995–2002.26. F. Li, T. Kim, A. Humayun, D. Tsai and J. M. Rehg, Video segmentation by trackingmany ﬁgure-ground segments, in (Dec 2013) pp. 2192–2199.27. N. Li, B. Sun and J. Yu, A weighted sparse coding framework for saliency detection, in (June2015) pp. 5216–5223.28. C. Liu,

Beyond pixels: exploring new representations and applications for motion anal-ysis (MIT, 2009).29. T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang and H.-Y. Shum, Learning todetect a salient object,

Pattern Analysis and Machine Intelligence, IEEE Transactionson (Feb 2011) 353–367.30. T. Lu, Z. Yuan, Y. Huang, D. Wu and H. Yu, Video retargeting with nonlinear spatial-temporal saliency fusion, in Image Processing (ICIP), 2010 17th IEEE InternationalConference on (Sept 2010) pp. 1801–1804.31. Y. Luo, L. Cheong and J. Cabibihan, Modeling the temporality of saliency, in

Com-puter Vision - ACCV 2014 - 12th Asian Conference on Computer Vision, Singapore,Singapore, November 1-5, 2014, Revised Selected Papers, Part III (Springer, 2014)pp. 205–220.32. M. Mancas, N. Riche, J. Leroy and B. Gosselin, Abnormal motion selection in crowdsusing bottom-up saliency, in

Image Processing (ICIP), 2011 18th IEEE InternationalConference on (Sept 2011) pp. 229–232.33. R. Margolin, A. Tal and L. Zelnik-Manor, What makes a patch distinct?, in

ComputerVision and Pattern Recognition (CVPR), 2013 IEEE Conference on (June 2013) pp.1139–1146.34. S. Nataraju, V. Balasubramanian and S. Panchanathan,

An Integrated Approach toVisual Attention Modeling for Saliency Detection in Videos in L. Wang, G. Zhao,L. Cheng and M. Pietik¨ainen (eds.),

Machine Learning for Vision-Based Motion Anal-ysis: Theory and Techniques . (Springer London, London, 2011), London, pp. 181–214.35. N. Otsu, A threshold selection method from gray-level histograms,

IEEE Transactionson Systems, Man, and Cybernetics (Jan 1979) 62–66. ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms Trung-Nghia Le and Akihiro Sugimoto

36. A. Papazoglou and V. Ferrari, Fast object segmentation in unconstrained video, in

Computer Vision (ICCV), 2013 IEEE International Conference on (Dec 2013) pp.1777–1784.37. O. Pele and M. Werman, The quadratic-chi histogram distance family, in

Proceed-ings of the 11th European Conference on Computer Vision: Part II (Springer-Verlag,Berlin, Heidelberg, 2010) pp. 749–762.38. F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool, M. Gross and A. Sorkine-Hornung, A benchmark dataset and evaluation methodology for video object seg-mentation, in (June 2016) pp. 724–732.39. Y. Qin, H. Lu, Y. Xu and H. Wang, Saliency detection via cellular automata, in

Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on (June2015) pp. 110–119.40. E. Rahtu, J. Kannala, M. Salo and J. Heikkil¨a, Segmenting salient objects from im-ages and videos, in

Computer Vision - ECCV 2010 - 11th European Conference onComputer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, PartV (Springer, 2010) pp. 366–379.41. G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez and A. M. Lopez, Vision-based oﬄine-online perception paradigm for autonomous driving, in

Applications ofComputer Vision (WACV), 2015 IEEE Winter Conference on (Jan 2015) pp. 231–238.42. H. J. Seo and P. Milanfar, Static and space-time visual saliency detection by self-resemblance,

Journal of Vision (12) (2009) p. 15.43. N. Tong, H. Lu, X. Ruan and M.-H. Yang, Salient object detection via bootstraplearning, in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Confer-ence on (June 2015) pp. 1884–1892.44. D. Tsai, M. Flagg, A. Nakazawa and J. M. Rehg, Motion coherent tracking usingmulti-label mrf optimization,

International Journal of Computer Vision (2) (2012)190–202.45. P.-H. Tseng, R. Carmi, I. G. M. Cameron, D. P. Munoz and L. Itti, Quantifying centerbias of observers in free viewing of dynamic natural scenes,

Journal of Vision (7)(2009) p. 4.46. W.-C. Tu, S. He, Q. Yang and S.-Y. Chien, Real-time salient object detection witha minimum spanning tree, in IEEE Conference on Computer Vision and PatternRecognition (CVPR) (June 2016)47. M. Van den Bergh, G. Roig, X. Boix, S. Manen and L. Van Gool, Online video seeds fortemporal window objectness, in

Computer Vision (ICCV), 2013 IEEE InternationalConference on (Dec 2013) pp. 377–384.48. W. Wang, J. Shen and F. Porikli, Saliency-aware geodesic video object segmentation,in

Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on (June 2015) pp. 3395–3402.49. W. Wang, J. Shen and L. Shao, Consistent video saliency using local gradient ﬂowoptimization and global reﬁnement,

Image Processing, IEEE Transactions on (Nov2015) 4185–4196.50. C. Xu, C. Xiong and J. Corso, Streaming hierarchical video segmentation, inA. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato and C. Schmid (eds.), ComputerVision ECCV 2012 , Lecture Notes in Computer Science

Vol. 7577 (Springer BerlinHeidelberg, 2012) pp. 626–639.51. Q. Yan, L. Xu, J. Shi and J. Jia, Hierarchical saliency detection, in

Computer Visionand Pattern Recognition (CVPR), 2013 IEEE Conference on (June 2013) pp. 1155– ovember 5, 2018 13:51 WSPC/INSTRUCTION FILE ms

Region-Based Multiscale Spatiotemporal Saliency for Video Spatiotemporal Salience via Centre-Surround Compar-ison of Visual Spacetime Orientations in K. M. Lee, Y. Matsushita, J. M. Rehgand Z. Hu (eds.),

Computer Vision - ACCV 2012: 11th Asian Conference on Com-puter Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, PartIII . (Springer Berlin Heidelberg, Berlin, Heidelberg, 2013), Berlin, Heidelberg, pp.533–546.53. Y. Zhai and M. Shah, Visual attention detection in video sequences using spatiotem-poral cues, in

Proceedings of the 14th ACM International Conference on Multimedia (ACM, New York, NY, USA, 2006) pp. 815–824.54. J. Zhang, S. Sclaroﬀ, Z. Lin, X. Shen, B. Price and R. Mech, Minimum barrier salientobject detection at 80 fps, in (Dec 2015) pp. 1404–1412.55. J. Zhang and S. Sclaroﬀ, Saliency detection: A boolean map approach, in

ComputerVision (ICCV), 2013 IEEE International Conference on (IEEE, Dec 2013) pp. 153–160.56. L. Zhang, Z. Gu and H. Li, Sdsp: A novel saliency detection method by combiningsimple priors, in (Sept 2013)pp. 171–175.57. F. Zhou, S. B. Kang and M. Cohen, Time-mapping using space-time saliency, in

Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (June2014) pp. 3358–3365.58. W. Zhu, S. Liang, Y. Wei and J. Sun, Saliency optimization from robust backgrounddetection, in