[PDF] Edge Preserving and Multi-Scale Contextual Neural Network for Salient Object Detection

Abstract

In this paper, we propose a novel edge preserving and multi-scale contextual neural network for salient object detection. The proposed framework is aiming to address two limits of the existing CNN based methods. First, region-based CNN methods lack sufficient context to accurately locate salient object since they deal with each region independently. Second, pixel-based CNN methods suffer from blurry boundaries due to the presence of convolutional and pooling layers. Motivated by these, we first propose an end-to-end edge-preserved neural network based on Fast R-CNN framework (named RegionNet) to efficiently generate saliency map with sharp object boundaries. Later, to further improve it, multi-scale spatial context is attached to RegionNet to consider the relationship between regions and the global scenes. Furthermore, our method can be generally applied to RGB-D saliency detection by depth refinement. The proposed framework achieves both clear detection boundary and multi-scale contextual robustness simultaneously for the first time, and thus achieves an optimized performance. Experiments on six RGB and two RGB-D benchmark datasets demonstrate that the proposed method achieves state-of-the-art performance.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Edge Preserving and Multi-Scale Contextual NeuralNetwork for Salient Object Detection

Xiang Wang, Huimin Ma, Xiaozhi Chen, and Shaodi You

Abstract —In this paper, we propose a novel edge preservingand multi-scale contextual neural network for salient objectdetection. The proposed framework is aiming to address twolimits of the existing CNN based methods. First, region-basedCNN methods lack sufﬁcient context to accurately locate salientobject since they deal with each region independently. Second,pixel-based CNN methods suffer from blurry boundaries dueto the presence of convolutional and pooling layers. Motivatedby these, we ﬁrst propose an end-to-end edge-preserved neuralnetwork based on Fast R-CNN framework (named

RegionNet ) toefﬁciently generate saliency map with sharp object boundaries.Later, to further improve it, multi-scale spatial context is attachedto

RegionNet to consider the relationship between regions and theglobal scenes. Furthermore, our method can be generally appliedto RGB-D saliency detection by depth reﬁnement. The proposedframework achieves both clear detection boundary and multi-scale contextual robustness simultaneously for the ﬁrst time, andthus achieves an optimized performance. Experiments on sixRGB and two RGB-D benchmark datasets demonstrate that theproposed method achieves state-of-the-art performance.

Index Terms —Salient object detection, Edge preserving, Multi-scale context, RGB-D saliency detection, Object mask

I. I

NTRODUCTION S ALIENT object detection, which aims to detect objectthat most attracts people’s attention through out an image,has been widely exploited in recent years. It has also beenwidely utilized for many computer vision tasks, such assemantic segmentation [1], object tracking [2], [3] and imageclassiﬁcation [4], [5].Traditional saliency methods aim to generate a heat mapwhich gives each pixel a relative value of its level of saliency[6], [7], [8]. In recent years, the fashion moves to salient objectdetection which generates pixel-wise binary label for salientand non-salient objects [9], [10], [11]. In comparing with theheat map, the binary label would further beneﬁt segmentationbased applications such as semantic segmentation [1], and thusattracts more attention.To achieve a high accuracy for binary labeling, there aremainly two requirements: ﬁrst, multi-scale contextual relia-bility; and second, sharp boundary between salient and non-salient objects. The contextual reliability aims to model therelationship between regions and global scenes to determine

X. Wang, H. Ma and X. Chen are with Tsinghua National Laboratoryfor Information Science and Technology (TNList) and Department ofElectronic Engineering, Tsinghua University, Beijing 100084, China.E-mail: [email protected], [email protected],[email protected]. You is with Data61, CSIRO and Australian National University, Aus-tralia. E-mail: [email protected] corresponding author is H. Ma. (a) Image (b) Ground Truth (c) Ours(f) LEGS(g) MC (h) DISC (i) DS(e) HDCT(d) RC

Fig. 1. Saliency map of an image with low-contrast. Previous methodsfail to distinguish the object from the confusing background. Our methoddetect salient object with ﬁne boundaries by taking advantages of regionsand multi-scale context. (a) image, (b) groundtruth, (c) our proposed

RexNet .(d, e) traditional methods: RC [10] and HDCT [17], (f, g) region-basedCNN methods: LEGS [18] and MC [19], (h, i) pixel-based CNN methods:DISC [20] and DS [21]. which object is salient. And the clear boundary aims toseparate the salient object and background clearly and tohighlight the whole object uniformly.Unfortunately, none of the existing methods achieve bothrequirements simultaneously. Traditional bottom-up methodsmainly rely on priors or assumptions and hand-crafted features.For example, center-surround difference [6], [12], uniquenessprior [13], [14] and backgroundness prior [15], [16]. Thesemethods can not consider high-level semantic contextual rela-tions and do not achieve a satisfying accuracy.Recently, the deep Convolutional Neural Network (CNN)has attracted wide attention for its superior performance. CNNbased methods can be divided into region-based networksand pixel-based networks. Region-based methods aim to ex-tract features of each region (or patch), and then predict itssaliency score. However, existing region-based methods lackof representing context information to model the relationshipbetween regions and global scenes. Because of this, it mayhave false detection results when the scene is complex or theobject is composed by several different parts, which limitstheir performance (Fig. 1). On the other hand, existing pixel-based CNN methods lack the ability to produce clear boundarybetween salient and non-salient objects, due to the presenceof convolutional and pooling layers, and they only achievepartial contextual reliability. This limits the performance ofpixel-based methods (Fig. 1). a r X i v : . [ c s . C V ] S e p OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

Mask-basedRoI Pooling FCs RoI

Feature

For each RoI

FC Saliency Score S C Fusion Fusion

Layers with supervision

ContextNet RegionNet S t r i d e : S t r i d e : S t r i d e : x S t r i d e : ( B e f o r e p oo li n g ) D a t a l a y e r C o n v s - C o n v s - C o n v s - C o n v s - C o n v s - S S S E Fusion

Image

Final Saliency Map

Superpixel RegionEdge Region

Fig. 2. Architecture of the proposed

RexNet . The network is composed by two components:

RegionNet and

ContextNet . Image is ﬁrst segmented into regionsusing superpixel and edges.

RegionNet predicts saliency score of regions and forms saliency maps S S and S E . At the same time, ContextNet extractsmulti-scale spatial context and fuse them to get saliency map S C . These three saliency maps are fused to get the ﬁnal saliency map. In this paper, we propose a novel edge preserving andmulti-scale contextual network for salient object detection.The proposed framework achieves both clear boundary andmulti-scale contextual robustness simultaneously for the ﬁrsttime. As illustrated in Fig. 2, the proposed structure, named

RexNet , is mainly composed by two parts, the

REgionNet andthe conteXtNet . First, the

RegionNet is inspired by the FastR-CNN framework [22]. Fast R-CNN is recently proposed forobject detection and achieves superior performance becausethe convolutional features of entire image are shared and fea-tures of each patch (or RoI) are extracted via the RoI poolinglayer. We extend Fast R-CNN to salient object detection byintroducing mask-based RoI pooling and formulating salientobject detection as a binary region classiﬁcation task. Theimage is ﬁrst segmented into regions and are used as input of

RegionNet , the

RegionNet then predicts saliency score of eachregion end-to-end to form saliency map of the entire image.Since the regions are segmented by edge-preserved methods,saliency map generated by our network is naturally with sharpboundaries.Second, the

ContextNet aims to provide strongly reliablemulti-scale contextual information. Different from most pre-vious works which consider context by expanding regionwindow at a certain layer, in this paper, we consider tomodel context via multiple spatial scales. This is based on theobservation that different layers of CNN represent differentlevels of semantic [23], [24], considering context of differentlevels may be more sufﬁcient. We achieve this by takingadvantages of dense image prediction. For all max-poolinglayers of

RegionNet , we attach multiple convolutional layersto predict saliency map of different levels. Then all levelsof saliency map are fused with

RegionNet to generate theﬁnal saliency map. Our method generates saliency map withaccurate location while keeping ﬁne object boundaries.Other than the effectiveness, our proposed frameworks isefﬁcient, since we take advantages of regions by extendingthe efﬁcient Fast R-CNN framework, which predicts saliencyscore of regions by only one forwarding. We also extend ourmethod to RGB-D saliency by applying depth reﬁnement. Experiments on 2 RGB-D benchmark datasets demonstratethat the proposed

RexNet outperforms other methods by a largemargin.The main contributions of this paper are three-fold. First,we proposed

RegionNet which generates saliency score ofregions efﬁciently and preserves object boundaries. Second,multi-scale spatial context is considered and attached to

Re-gionNet to boost salient object detection performance. Third,we extend our method to RGB-D saliency datasets and usedepth information to further reﬁne saliency maps.The rest of this paper is organized as follows. Section IIdiscusses related work. Section III and Section IV introducethe details of the proposed

RegionNet and

ContextNet cor-respondingly. Section V describes the training details of theproposed network. Section VI introduces our extension toRGB-D salient object detection. Section VII shows the exper-imental results and comparison with state-of-the-art methods.And conclusion is made in Section VIII.II. R

ELATED W ORK

In this section, we introduce traditional salient detectionmethods and the recent CNN based methods. In addition, wealso introduce some related works that integrate multi-scalecontext information and some topics related to salient objectdetection.

A. Traditional Methods

Salient object detection was ﬁrst exploited by Itti et.al. [6],and later attracted wide attention in the computer visionsociety. Traditional methods mostly rely on prior assumptionsand most are un-supervised. Center-surround difference whichassumes that salient regions differs from their surroundingregions is an important prior in early research. Itti et.al. [6]ﬁrst proposed center-surround difference at different scalesto compute saliency. Liu et.al. [12] propose center-surroundhistogram which deﬁnes saliency as the difference betweencenter region and its surrounding region. Li et.al. [25] proposecost-sensitive SVM to learn and discover salient regions that

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 are different from their surrounding regions. These methodscannot provide sharp boundary for salient region because theyare based on rectangle regions, which is only able to generatecoarse and blurry boundary.While center-surround difference considers local contrast, itdoes not take into consideration of global contrast. Global con-trast based methods are later proposed, e . g ., Cheng et.al. [10]and Yan et.al. [26]. In [10], image is ﬁrst segmented intosuperpixels. Then saliency value of each region is deﬁned asthe contrast with all other regions. The contrast is weighted byspatial distance so that nearby regions have greater impact onit. To deal with objects with complex structures, Yan et.al. [26]propose a hierarchical model which analyzes saliency cuesfrom multiple scales based on local contrast and then infersthe ﬁnal saliency values of regions by optimizing them in atree model. Following them, many methods utilizing bottom-up priors are proposed, readers are encouraged to ﬁnd moredetails in a recent survey paper by Borji et.al. [11]. B. CNN based Methods

Deep Convolutional Neural Network (CNN) has attracted alot of attention for its outstanding performance in representinghigh-level semantic. Here, we mention are few representativework. These work can be divided into two categories accordingto their treatment of input images: region-based methods andpixel-based methods. Region-based methods formulate salientobject detection as a region classiﬁcation task, namely, extract-ing features of regions and predict their saliency score. Whilepixel-based methods directly predict saliency map pixels-to-pixels with CNN.

Region-based methods.

Wang et.al. [18] propose to detectsalient object by integrating both local estimation and globalsearch with two trained networks DNN-L and DNN-G. Zhao et.al. [19] consider global and local context by putting a globaland a closer-focused superpixel-centered window to extractfeatures of each superpixel, respectively, and then combinethem to predict saliency score. Li et.al. [27] propose multi-scale deep features by extracting features of each region atthree scales and then fuse them to generate its saliency score.These works are region-based which focused on extractingfeatures of regions and fuse larger scale of regions as context topredict saliency score of each region. These fusions are mostlyapplied at only one layer and does not achieve a optimalperformance. In addition, the networks extract features of oneregion for each forwarding which is very time-consuming.

Pixel-based methods.

Recently, CNN has also been appliedto pixels-to-pixels dense image prediction, such as semanticsegmentation and saliency prediction. Long et.al. [28] proposefully convolutional networks which is trained end-to-end andpixels-to-pixels by introducing fully convolutional layers anda skip architecture. Chen et.al. [20] propose a coarse-to-ﬁnemanner in which the ﬁrst CNN generates coarse map usingthe entire image as input and then the second CNN takes thecoarse map and local patch as input to generate ﬁne-grainedsaliency map. Li et.al. [21] propose a multi-task model basedon fully convolutional network. In [21], saliency detectiontask is in conjunction with object segmentation task, which is helpful for perceiving objects. A Laplacian regularizedregression is then applied to reﬁne saliency map. However,while end-to-end dense saliency prediction is efﬁcient, theresulting saliency maps are coarse and with blurry objectboundaries due to the presence of convolutional layers withlarge receptive ﬁelds and pooling layers.

C. RGB-D Salient Object Detection

RGB-D saliency is an emerging topic and most RGB-D saliency methods are based on fusing depth priors withRGB saliency priors. Ju et.al. [29] propose RGB-D saliencymethod based on anisotropic center-surround difference, inwhich saliency is measured as how much it outstands fromsurroundings. Peng et.al. [30] propose depth saliency withmulti-contextual contrast and then fuse it with appearance cuesvia a multi-stage model. Ren et.al. [31] propose normalizeddepth prior and global-context surface orientation prior basedon depth information and then fuse them with RGB regioncontrast priors. Depth contrast may cause false positives inbackground region, to address it, in [32], Feng et.al. proposelocal background enclosure feature based on the observationthat salient objects tend to be locally in front of surroundingregions. To the best of our knowledge, existing RGB-D salientobject detection are all using hand-crafted features and theperformance is not optimized.

D. Multi-scale Context

Multi-scale context has been proved to be useful for imagesegmentation task [33], [19], [27], [34]. Hariharan et.al. [33]proposed hypercolumns for object segmentation and ﬁne-grained localization, in which they deﬁned hypercolumn ata given input location as the outputs of all layers at thatlocation. Features of different layers are combined and thenbe used for classiﬁcation. Zhao et.al. [19] proposed multi-context network which extracts features of a given superpixelat global and local scale, and then predict saliency value ofthat superpixel. Li et.al. [27] proposed to extract features atthree scales: bounding box, neighbourhood rectangular and theentire image. Liu et.al. [34] proposed to use recurrent con-volutional layers (RCLs) [35] iteratively to integrate contextinformation and to reﬁne saliency maps. At each step, the RCLtakes coarse saliency map from last step and feature map atlower layer as input to predict a ﬁner saliency map. In thisway, context information is integrated iteratively and the ﬁnalsaliency map is more accurate than that predicted from globalcontext.The proposed

ContextNet differs from those at two aspects.First, the

ContextNet is a holistically-nested architecture [36]which predicts saliency map at each branch and fuse themﬁnally. Second, we propose

EdgeLoss as a supervision whichmakes the boundary of segmentation result more clear.

E. Fixation prediction and semantic segmentation

Fixation prediction [6], [7], [8], [37] aims to predict theregions people may pay attention to, and semantic segmen-tation [28], [38] aims to segment objects of certain classes

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Mask-basedRoI Pooling FCs

For each RoI

FCSoftmax Saliency ScoreRegionMaskDeep

ConvNetRoI

Projection (a) (b) (c) (d) (e)

RegionMaskMask-basedRoI Pooling 7 x 7 Saliency Map

Fig. 3. Pipeline of

RegionNet . We extend the Fast R-CNN framework for saliency detection. (a) Image is ﬁrst segmented into regions and the region maskwhich records the index of regions is also generated. For each region, we use its external rectangle as RoI. Note that, for clarity, we only show RoIs ofsalient objects, the background regions are omitted. (b) All RoIs are put into the convolutional networks, and (c) at the RoI pooling layer, the mask-basedRoI pooling is applied to extract features inside region mask. In this way, the features of irregular region can be extracted. (d) With this mask-based pooling,the framework predicts saliency score of regions end-to-end, and (e) to form the saliency map of the entire image. in images. They are topics related to salient object detection,but they also have signiﬁcant differences. Fixation predictionaims to predict regions which most attract people’s attention,while salient object detection focuses on segmenting themost attractive objects . For semantic segmentation, saliencydetection is a class-agnostic task, whether an object is salientor not is largely depend on its surroundings, while seman-tic segmentation mainly focuses on segmentation objects ofcertain classes ( e.g.

20 classes in PASCAL VOC dataset). Socompared with semantic segmentation, context information ismore important for saliency detection, and this is the mainmotivation of our

ContextNet .III. R

EGION N ET : E DGE P RESERVING N EURAL N ETWORKFOR S ALIENT O BJECT D ETECTION

A. Motivation

In this paper, we aim to propose a uniﬁed framework whichcan preserve object boundaries and take multi-scale spatialcontext into consideration. To preserve object boundaries,we propose an effective network, named

RegionNet , whichgenerates saliency score of each region end-to-end (Fig. 3).Different from previous region-based methods [18], [19], [27],we extend the efﬁcient Fast R-CNN framework [22] for salientobject detection for the ﬁrst time. On the other hand, previousworks consider context mainly by expanding window of regionor using entire images at a certain data or feature layer. Inthis paper, we consider context at multiple layers and usingdense saliency prediction framework to generate saliency mapsto complement

RegionNet . The architecture of the proposedframework is shown in Fig. 2.In this section, we ﬁrst introduce the idea of edge-preservingsaliency detection based on a CNN network. This idea is pre-viously appeared in our conference paper [39]. In section IV,we extend this idea with consideration of multi-scale spatialcontext.

B. RegionNet

In this section, we introduce

RegionNet which takes ad-vantage of CNN for high effectiveness and high efﬁciency.More importantly, it takes advantage of region segmentation which enables clear detection boundary and further improvesthe accuracy.

Network architecture.

We extend original Fast R-CNN [22]structure for end-to-end saliency detection. Fast R-CNN is anefﬁcient and general framework in which the convolutionallayers are shared on the entire image and the feature of eachregion is extracted by the RoI pooling layer. However, to thebest of our knowledge, Fast R-CNN is only used for objectdetection and classiﬁcation but not for saliency. Namely, theresult of Fast R-CNN is bounding box but not pixel-wise. Inthis paper, we make the modiﬁcation to enable edge preservingsaliency by introducing mask-based RoI pooling. Differentfrom previous region-based methods which deal with eachregion of an image independently, our proposed Fast R-CNNstructure processes all regions end-to-end and with the entireimage considered.

Detection pipeline.

As illustrated in Fig. 3, ﬁrst, given animage, we segment it into regions using superpixel and edges.And for each region, we use its external rectangle as proposal(or RoI) and use it as input of Fast R-CNN framework similarwith object detection tasks. We also generate a region maskwith the same size of image to record the region index foreach pixel and then downsample it by 16 times and put it intothe RoI pooling layer.Then, at the RoI pooling stage, features inside each RoI( h × w ) are pooled into a ﬁxed scale H × W ( × in our work).So each sub-window with scale h/H × w/W is converted toone value with max-pooling. To extract feature of irregularpixel-wise RoI region, we only pool features inside its regionmask while leaving others as . The process of the proposedmask-based RoI pooling is formulated as following. For regionwith index i , and a certain sub-window as SW j , we denoteregion mask as M , features before pooling as F , the pooledfeature at sub-window SW j as P j , then P j = (cid:40) max { k | k ∈ SW j ,M k = i } F k i ∈ M ( SW j ) , i / ∈ M ( SW j ) . (1)With this mask-based pooling, features of each region areextracted and the edge information is also preserved.Last, by considering salient object detection as a binaryclassiﬁcation problem, the network generates saliency score OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Fig. 4. (a) images, (b) and (c) superpixel regions and edge regions. Pixels in each region are replaced with their mean color, (d) masks generated by MNC [40].(i) We can see that edges divide images into fewer regions than superpixels and thus preserving more compactness of objects, which is helpful for saliencyprediction. (ii) The superpixels and edges regions achieve higher boundary accuracy than masks generated by MNC [40]. Best viewed in color. of regions to form the saliency map of entire image end-to-end.Note that, in our work, to segment image into regions,besides superpixel, we also consider larger scale regions whichare segmented by edges (denoted as edge regions). This isbased on the observation that when an object is segmentedinto dozens of superpixels, it will be difﬁcult to uniformlyhighlight the whole object. The edge regions can preservemore compactness of objects and thus may be more effec-tive. Recent advances in edge detection have achieved highlysatisfactory performance which makes it practical to use edgeinformation to help better detect salient objects. In our work,we use HED method of Xie et.al. [36] to get object edges andthen thinning them using method of Dollar et.al. [41]. Thesuperpixel is segmented using SLIC algorithm [42].Some examples of superpixel regions and edge regions areshown in Fig. 4. We can see that edges segment image intofewer regions and better preserves compactness of object.For region-based methods, this will help improve the ﬁnalperformance and since the number of regions is smaller, it alsoreduces computation cost. Considering the fault-tolerant capa-bility, namely, misclassiﬁcation of edge regions may decreaseperformance largely, the superpixel regions are also used in ourmethod. These two scales regions are complementary sincesuperpixel regions can generate results with high resolutionand edge regions can preserve more compactness of objects.Note that the similar idea of mask-based RoI pooling hasalso been applied in MNC [40] for semantic segmentation.However, we have much difference. In [40], the masks weregenerated by the multi-task network and they are continuousvalues in [0 , . The masked feature is the element-wise prod-uct of features and masks. While in our work, the masks aregot by segmenting images into regions with superpixels [42]and edges [36], they are binary and the mask-based RoIpooling is to extract features inside the masks. The SLICalgorithm [42] for generating superpixels has strong ability to adhere to image boundaries, so its boundary accuracy is quitegood. The HED [36] network is designed for edge detection,the boundary accuracy is much better than multi-task networksin [40]. So the masks of our method has higher boundaryaccuracy compared with MNC [40]. Some examples are shownin Fig. 4.We denote the saliency map generated by RegionNet withsuperpixel regions and edge regions as S S and S E , respec-tively. We have shown in our previous conference paper [39]that S E outperforms most previous works, and the combi-nation of S E and S S achieves better performance, whichshows the effectiveness of edge regions and the combinationwith superpixel regions. More detailed experimental results areshown in Section VII.IV. C ONTEXT N ET : M ULTI - SCALE C ONTEXTUAL N EURAL N ETWORK FOR S ALIENT O BJECT D ETECTION

In this section, we introduce the extension of the proposedmethod by utilizing multi-scale context. In Section IV-A, weﬁrst introduce the motivation for multi-scale context, afterthat, in Section IV-B, we introduce the architecture of theproposed multi-scale contextual network. In Section IV-C, weintroduce the loss function for supervising the

ContextNet , andin Section IV-D, we introduce deep supervision to accelerateconvergence and improve prediction performance.

A. Motivation

Salient object detection is a class-agnostic task, whether aregion is salient or not is largely depend on its surroundings, i.e. , context. While the

RegionNet we proposed can generatesaliency map with well preserved boundary, it lacks of contextinformation. In addition, region-based CNN methods [18],[19], [27] suffer from some common drawbacks. First, region-based methods are based on binary region classiﬁcation, mis-classiﬁcation of regions will cause large false detection. Sec-ond, solving binary classiﬁcation problem with huge amount

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

Image GT LEGS [18] MC [19] MDF [27] Ours ( S S ) Ours ( S E )Fig. 5. Results of previous region-based methods and our S S and S E . We can see that misclassiﬁcation of regions has a great impact on the ﬁnal performanceand most regions are assigned to near either 0 or 1, with few intermediate values. These will limit the precision at high recall when thresholding.Image pool1 pool2 pool3 pool4Fig. 6. Visualization of features in different layers of RegionNet . For a test image, we forward it in our trained

RegionNet , and then we extract features ofthe ﬁrst four pooling layers and show each channel of them. Different layer represents different level of semantic. Best viewed in color. of images using CNN causes the classiﬁcation results to beextremely separated to either 0 or 1, thus saliency map isnot smooth. These two issues will limit the precision at highrecall. Fig. 5 shows some results of previous region-basedCNN methods and our S S and S E .As explored in previous works [23], [24], features in dif-ferent layers of CNN has different properties and representdifferent levels of semantic. So fusing context from multiplelayers may be more sufﬁcient. Fig. 6 shows the visualiza-tion example of features in the ﬁrst four pooling layers of RegionNet . We can see that shallow layers mainly focus onbottom features, such as contour, and deep layers focus onmore abstract high-level features. Based on these observations,in this paper, we consider context information by introducingmulti-scale contextual layers, named

ContextNet , to addressthe issues mentioned above and to complement

RegionNet . B. Network Architecture

The architecture of our proposed network is shown in Fig. 2.Based on the

RegionNet , we propose to use multi-scale denseimage prediction method to model the relationship betweenregions and the global scenes at multiple levels. For all maxpooling layers (except the RoI pooling layer) of

RegionNet , weattach ﬁve convolutional layers (called as branch) to predictsaliency maps of different levels. The ﬁrst three layers ofeach branch are with × convolutional ﬁlters and 64, 64,128 channels, and the dilated convolution [38] is also applied to increase the receptive ﬁeld. The last two layers are fullyconvolutional layers with 128 and 1 channels.Experimental results in [28] have demonstrated that denserprediction map has better performance. Following that, wepropose to generate saliency map with one eighth scale of theoriginal input images. So we set the stride of each branch as 4,2, 1, 1, respectively. Note that the last branch is connected tothe convolution layer before the fourth max-pooling layer, i.e., conv4 3 in VGG16 [43], so output of all branches have thesame dimensions. The outputs of all branches are then fed intofully convolutional layers which learn the combination weightsto generate saliency map S C . The ﬁnal saliency map S is thengot by fusing S S , S E , and S C via a fully convolutional layer. S = F usion ( S S , S E , S C ) . (2) C. Loss

We assume that the training data, D = { ( X i , T i ) } Ni =1 ,consists of N training images and groundtruth. Our goal is totrain a convolutional network f ( X ; θ ) to predict saliency mapof a given image. We deﬁne two kinds of loss for ContextNet to generate saliency map with high accuracy and clear objectboundary.The ﬁrst

Loss is common used Cross Entropy Loss L C ,which aims to make the output saliency map f ( X ; θ ) consis- OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 w / o deep s upe r v i s i on w / deep s upe r v i s i on (a) Image & GT (b) Branch 1 (c) Branch 2 (d) Branch 3 (e) Branch 4 (f) Fusion Fig. 7. Effect of deep supervision. From left to right are image and ground truth, results of 4 branches, and fusion of all branches. The ﬁrst row showsresults without deep supervision and the second row shows results with deep supervision. Without deep supervision, the ﬁrst and second branch learn almostnothing in our network due to the heavy bias. tent with the groundtruth T . L C = − N N (cid:88) i =1 [ T i log ( f ( X i ; θ )) + (1 − T i ) log (1 − f ( X i ; θ ))] (3)The second Loss is Edge Loss L E which aims to preserveedge and make the saliency map more uniform. Since we havesegmented image into regions with edge-preserved methods,our assumption is that saliency map in the same region shouldshare similar value, so that the ﬁnal saliency map can alsopreserve edge and be more uniform. We average saliencymap f ( X ; θ ) in each region and marked the averaged map as ¯ f ( X ; θ ) . The Edge Loss is deﬁned as the L norm betweensaliency map f ( X ; θ ) and the averaged map ¯ f ( X ; θ ) . L E = 12 N N (cid:88) i =1 (cid:107) f ( X i ; θ ) − ¯ f ( X i ; θ ) (cid:107) (4) D. Deep Supervision

The proposed

ContextNet comprises of a fusion layer whichfuses the outputs of four branches. Supervision only in thelast fusion layer may cause heavy bias, namely, some layersmay not be optimized adequately. To address this issue, inthis paper, we utilize deep supervision [44], [36] method,namely, outputs of all branches and their fusion result arealso supervised. Fig. 7 shows the comparison of results withand without deep supervision. Without deep supervision, thenetwork will be heavily biased towards some maps, and inextreme cases, some branches will learn nothing, e.g. , Fig. 7(b) and (c). While with deep supervision, each branch learnsand predicts saliency map with features at different scale,which accelerates convergence of the network and makes theﬁnal saliency map more precise. V. N

ETWORK T RAINING

We implement our method using Caffe framework [45].The training process consists of two stages. At the ﬁrststage, we ﬁne-tune the

RegionNet using weights pre-trainedon ImageNet [46]. At the second stage, we ﬁx the weights of

RegionNet and then optimize the weights of the

ContextNet using SGD procedure.For the training of

RegionNet , a region is considered assalient/background if more than of its pixels are locatedinside/outside ground truth. The

RegionNet formulates salientobject detection as a binary classiﬁcation problem and the lossfunction we used is softmax loss. Following previous works,we ﬁne-tune our

RegionNet based on VGG16 [43] which ispre-trained on ImageNet [46].For the training of

ContextNet , deep supervision is appliedto accelerate convergence and to improve the ﬁnal perfor-mance.VI. E

XTENSION TO

RGB-D S

ALIENT O BJECT D ETECTION

Depth information is an important cue for salient objectdetection, especially for images with complex scenes. In thispaper, we apply depth information to further improve theperformance by extending our framework to RGB-D saliencydatasets.For RGB-D datasets, a simple idea is to train our networkusing RGB-D data directly. However, it suffers from twoproblems. First, our network is pre-trained on ImageNet [46],it is unreasonable to ﬁne-tune it using RGB-D data. Second,the image number of existing RGB-D saliency dataset is toosmall to well train a network. So in this paper, we propose toﬁrst generate saliency map using RGB data, and then reﬁne itwith depth information.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 (a) Image (b) Depth (c) S (d) S (e) S (f) GT Fig. 8. The process of depth reﬁnement. (a) Image, (b) depth, (c) saliency map of our method using RGB data ( S ), (d) with the position prior, the backgroundnoise is strongly suppressed ( S ), and (e) with the local compactness prior, the background is further suppressed and the result map is more uniform ( S ),(f) groundtruth. We propose two efﬁciency priors based on our observations:position prior and local compactness prior. For position prior,in most scenes, the salient object is located at the most frontposition. For local compactness prior, regions with similardepth, appearance and position should share similar saliencyvalue.We denote saliency map generated by our network as S .For position prior, we directly multiply S by depth D usinga sigmoid function and denote it as S , S = S ×

11 + exp ( − σ × D ) , (5)in which the parameter σ is set to 5 empirically in our work.Note that we have transformed the depth similar with [29], inwhich the depth is rescaled to [0 , and pixels with shorterdistance are attached with larger intensity.For local compactness prior, saliency value of each region S ( i ) is reﬁned with their neighbor regions N ( i ) weighted bydepth and appearance similarity. S ( i ) = (cid:88) j ∈N ( i ) W ( i, j ) S ( j ) , (6)with W ( i, j ) = exp ( − D ( i, j ) σ dep ) exp ( − Col ( i, j ) σ col ) , (7)in which Col ( i, j ) denotes the Euclidean distance of RGBcolor. We set σ dep = 0 . and σ col = 5 empirically in ourwork. Fig. 8 shows some examples of the depth reﬁnement.VII. E XPERIMENTS

To evaluate the effectiveness of each component and studythe performance of the proposed method, we conduct experi-ments on six RGB and two RGB-D benchmark datasets andcompare our method with state-of-the-art methods quantita-tively and qualitatively.

A. Setup

We randomly sample 4000 images from DUT-OMRON [47]dataset and 5000 images from MSRA10K [12], [48], [10]dataset as training set and then evaluate our method onthe following six benchmark datasets: ECSSD [26], DUT-OMRON [47], JuddDB [49], SED2 [50], THUR15K [51] andPascal-S [52]. Note that the DUT-OMRON has 5168 imagesand we only evaluate on the remaining 1168 images that arenot included in the training set. We also evaluate our methodon two benchmark RGB-D saliency datasets: RGBD1000 [30]and NJU2000 [29]. All results are got from the benchmark ofBorji et.al. [53] or generated using authors’ code.We evaluate the performance using precision-recall (PR)curves, F-measure and mean absolute error (MAE). Thesaliency maps are ﬁrst normalized to [0 , , and then theprecision and recall are computed by binarizing them with256 thresholds and comparing them with ground truth. ThePR curves are computed by averaging them on each dataset.The F-measure considers both precision and recall which iscomputed as: F β = (1 + β ) P recision × Recallβ P recision + Recall , (8)we set β = 0 . as most previous works [48], [10] to em-phasize the precision. The ﬁnal F-measure is the maximal F β computed by 256 precision-recall pairs in the PR curves [53].The MAE directly measures the mean absolute differencebetween saliency map and ground truth, M AE = 1 W × H W (cid:88) x =1 H (cid:88) y =1 | S ( x, y ) − GT ( x, y ) | (9) B. Comparison with State-of-the-art Methods

We compare our method with state-of-the-art methods, in-cluding traditional methods: LC [9], RC [10], SF [54], FT [48],GS [15], DRFI [55] MR [47], HDCT [17], ST [56], RBD [16],LPS [57], MB+ [58], and CNN based methods: MDF [27],

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

Recall P r e c i s i on LC (22.5)RC (38.9)SF (33.3)FT (24.1)GS (37.8)DRFI (45.4)MR (41.8)HDCT (35.8)ST (42.1) RBD (42.1)LPS (40.0)MB+ (43.4)MC (45.9)LEGS (40.9)FL (47.5)DHS (50.2)Ours (54.5)

Recall P r e c i s i on LC (28.4)RC (53.7)SF (47.0)FT (34.0)GS (53.8)DRFI (67.0)MR (59.7)HDCT (56.3)ST (63.1)RBD (62.2) LPS (60.0)MB+ (62.3)MDF (70.3)DISC (65.7)MC (65.5)LEGS (67.7)DS (74.3)FL (74.5)Ours (80.3)

Recall P r e c i s i on LC (33.6)RC (59.3)SF (49.6)FT (34.5)GS (51.1)DRFI (67.0)MR (57.2)HDCT (55.7)ST (61.2) RBD (57.1)LPS (55.3)MB+ (61.0)MC (65.6)LEGS (65.1)DS (72.3)FL (68.6)DHS (73.7)Ours (77.9)

LC RC SF FT GS DRFI MR HDCT ST RBD LPS MB+ MC LEGS FL DHS Ours F - m ea s u r e / M AE LC RC SF FT GS DRFI MRHDCT ST RBD LPS MB+MDFDISC MCLEGS DS FL Ours F - m ea s u r e / M AE LC RC SF FT GS DRFI MR HDCT ST RBD LPS MB+ MC LEGS DS FL DHS Ours F - m ea s u r e / M AE JuddDB DUT-OMRON THUR15K

Recall P r e c i s i on LC (65.8)RC (74.7)SF (77.5)FT (71.7)GS (76.9)DRFI (86.6)MR (76.8)HDCT (79.9)ST (84.2)RBD (83.7) LPS (79.3)MB+ (82.9)MDF (89.0)DISC (81.7)MC (78.2)LEGS (83.1)DS (89.2)FL (84.9)DHS (89.1)Ours (91.8)

Recall P r e c i s i on LC (35.8)RC (75.3)SF (54.7)FT (40.6)GS (65.4)DRFI (82.5)MR (74.8)HDCT (69.2)ST (83.2)RBD (72.4) LPS (71.0)MB+ (76.9)MDF (87.3)DISC (84.1)MC (85.2)LEGS (85.5)DS (92.2)FL (85.3)DHS (93.4)Ours (93.7)

Recall P r e c i s i on LC (35.3)RC (57.7)SF (49.0)FT (39.3)GS (61.2)MR (66.6)HDCT (65.4)ST (66.4)RBD (66.3) LPS (62.2)MB+ (69.2)MDF (79.3)MC (73.8)LEGS (77.1)DS (78.4)FL (76.2)DHS (84.2)Ours (85.6)

LC RC SF FT GS DRFI MRHDCTST RBD LPSMB+MDFDISC MCLEGSDS FL DHSOurs F - m ea s u r e / M AE LC RC SF FT GS DRFI MRHDCTST RBD LPSMB+MDFDISC MCLEGSDS FL DHSOurs F - m ea s u r e / M AE LC RC SF FT GS MR HDCT ST RBD LPS MB+ MDF MC LEGS DS FL DHS Ours F - m ea s u r e / M AE SED2 ECSSD Pascal-S

Fig. 9. Comparison with state-of-the-art methods on six benchmark datasets. For each dataset, the ﬁrst row shows the PR curves and the second row showsthe F-measure and MAE. The numbers in the PR curves denote the AUC. Best viewed in color.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 I m age G T O u r s F L DH S D S L E G S M CD I S C M D F M B + L PS R B D S T HDC T M RDR F I G S FT S F RC L C Fig. 10. Qualitative comparison with state-of-the-art methods. We can see that our method locates salient objects more accurately and preserves objectboundaries better. Background noise is strongly suppressed and the objects are highlighted uniformly.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

Recall P r e c i s i on ACSDGPLMHLBEOurs

RGBD1000 NJU2000Fig. 11. Comparison with state-of-the-art methods on two benchmark RGB-Dsaliency datasets. Best viewed in color.TABLE IT

RAINING DATA OF STATE - OF - THE - ART METHODS .Method Training DataMDF [27] 2,500 images from MSRA-5000DISC [20] 9,000 images from MSRA10KMC [19] 8,000 images from MSRA10K3,000 images from the MSRA-5000 dataset andLEGS [18] 340 images from the Pascal-S dataset.Both horizontal reﬂection and rescaling (5%) are appliedDS [21] leave-one-out strategy, using other 7 datasets for trainingDHSNet [34] 6,000 from MSRA10K and 3,500 from DUT-OMRONOURS 4,000 from DUT-OMRON and 5,000 from MSRA10K

DISC [20], MC [19], LEGS [18], DS [21], DHSNet [34]and our preliminary conference method FL [39]. For CNN-based methods, we also list the training data they used inTable I. MDF [27] uses less training data, DS [21] uses muchmore training data, and for other methods, we use comparabletraining data. Fig. 9 shows PR-curves, F-measure and MAE onsix benchmark datasets. We can see that our method outper-forms other methods and our preliminary conference methodby a large margin. For the state-of-the art multi-scale methodDHSNet [34], we achieve comparable performance. For PRcurves, our method outperforms DHSNet on all datasets by . on average. For F-measure, our method outperformsDHSNet on JuddDB, THUR15K and SED2 datasets, but failson ECSSD and Pascal-S dataset. For MAE, we are inferior toDHSNet by . on average.Note that DS [21] is a multi-task framework which de-tects salient object and object boundaries simultaneously, ourmethod outperforms DS [21] at all 6 datasets, especially ondatasets with complex scenes, such as DUT-OMRON, JuddDBand Pascal-S, which shows that our method better takesadvantages of edges. Note that our network is trained on partsof DUT-OMRON and MSRA10K dataset, we apply the trainednetwork to other 5 datasets without ﬁne-tuning, the results stilloutperform others by a large margin, which shows that ourmethod has strong generalization ability. Fig. 10 shows thequalitative comparison with state-of-the-art methods, we cansee that our method preserves edges well and suppresses mostbackground noise. C. Evaluation on RGB-D Saliency Datasets

We compare our method with state-of-the-art RGB-Dsaliency methods: ACSD [29], GP [31], LMH [30] andLBE [32]. Fig. 11 shows the comparison of PR-curves. Our (a) Image (b) Depth (c) ACSD (d) GP (e) LMH (f) LBE (g) RexNet [Ours] (h) GT

Fig. 12. Qualitative comparison with state-of-the-art methods on RGB-Ddatasets. Our method can not only locate salient object accurately, but alsopreserve edges, thus highlighting the whole object uniformly and suppressingbackground noise. method signiﬁcantly outperforms other methods, especially inthe region of high recall. The main reason of our performanceis that our method can not only locate salient object accurately,but also preserve edges, thus saliency map of our method arewith high precision and high recall. Fig. 12 also shows thequalitative comparison with state-of-the-art RGB-D methods.

D. Ablation Studies

In this subsection, we conduct experiments to verify theeffectiveness of each component of our method.

Network Components.

We ﬁrst evaluate the components ofthe proposed network by outputting the intermediate resultsof our network and analyzing their performance. Table IIshows the comparison of all components: S S , S E , S C andthe ﬁnal saliency map S on six benchmark datasets. To betterdemonstrate the comparison with numerical results, we useArea Under Curve (AUC) which measures the area under thePR-curve to represent PR-curve criterion. We can see thatthe ﬁnal result S outperforms all components, which showsthat all the components are complementary and our method iseffective. Branches of

ContextNet . We evaluate the effectiveness ofbranches of

ContextNet . Table III shows the results of eachbranch and the fusion results on six benchmark datasets. Wecan see that, commonly, the branches of deeper layers achievebetter performance, and the ﬁnal fusion result is the best,which demonstrates that our method makes full use of featuresat each branch.

Edge Loss.

We evaluate the effectiveness of Edge Loss bycomparing with networks without Edge Loss. Table IV showsthe results of

ContextNet on six benchmark datasets. With theEdge Loss, the performance is better since the Edge Loss canpreserve edges better and so the saliency map of

ContextNet are more uniform.

Comparison with fusing features.

The proposed

Con-textNet fuses saliency maps of each branch to get the ﬁnalresult. To evaluate the effectiveness, we compare with methodwhich fuses features to predict saliency map. We concatenatefeatures of each branch to predict saliency map. Table V showsthe result of

ContextNet with fusing features and fusing maps.We can see that our method outperforms method which fuses

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

TABLE IIE

VALUATION OF ALL COMPONENTS ON SIX BENCHMARK DATASETS WITH F- MEASURE AND

AUC. T

HE FINAL RESULT S ALWAYS PERFORMS BETTERTHAN ALL COMPONENTS , WHICH SHOWS THAT ALL THE COMPONENTS ARE COMPLEMENTARY AND OUR METHOD IS EFFECTIVE .JuddDB DUT-OMRON THUR15K SED2 ECSSD Pascal-S F β AUC F β AUC F β AUC F β AUC F β AUC F β AUC S S S E S C S TABLE IIIR

ESULT OF EACH BRANCH AND THEIR FUSION IN

ContextNet .JuddDB DUT-OMRON THUR15K SED2 ECSSD Pascal-S F β AUC F β AUC F β AUC F β AUC F β AUC F β AUCBranch 1 0.402 0.366 0.529 0.510 0.533 0.510 0.749 0.780 0.639 0.643 0.599 0.596Branch 2 0.416 0.381 0.525 0.507 0.557 0.540 0.691 0.728 0.692 0.719 0.622 0.622Branch 3 0.447 0.423 0.564 0.563 0.600 0.601 0.705 0.737 0.751 0.801 0.678 0.713Branch 4 0.490 0.457 0.692 0.710 0.686 0.695 0.802 0.854 0.836 0.891 0.756 0.798Fusion

TABLE IVR

ESULTS OF

ContextNet

WITH AND WITHOUT E DGE L OSS . W

ITH THE E DGE L OSS , THE PERFORMANCE IS BETTER .JuddDB DUT-OMRON THUR15K SED2 ECSSD Pascal-S F β AUC F β AUC F β AUC F β AUC F β AUC F β AUCw/o Edge Loss 0.524 0.494 0.750 0.744 0.715 0.703 0.873 0.865 0.865 0.903 0.789 0.822w/ Edge Loss

Recall P r e c i s i on S S S Recall P r e c i s i on S S S RGBD1000 NJU2000Fig. 13. Evaluate the effectiveness of depth reﬁnement. Our depth reﬁnementimproves the performance mainly at the region with high recall, which isessential important for the ﬁnal performance. Best viewed in color. features. This is beneﬁted from the deep supervision in eachbranch which makes full use of features at different levels.

Depth Reﬁnement.

For the RGB-D saliency datasets, weevaluate the effectiveness of depth reﬁnement. We show thecomparison of PR-curves with and without depth reﬁnementin Fig. 13. Experimental results show that the depth reﬁnementimprove the performance signiﬁcantly especially in the regionwith high precision and high recall.

Speed.

We compare the speed with other region-based CNNmethods. Our method is much faster since we deal withregions under end-to-end Fast R-CNN framework, while otherregion-based CNN methods forward network for each region.Table VI shows the comparison of performance and runningtime, the experiment is conduct on ECSSD dataset [26], itcontains 1000 test images, we test on this dataset with a singleNVIDIA GeForce GTX TITAN GPU and report the averagetime per image. We compare with MC [19] and LEGS [18]using the authors’ public code. Our method takes . s for each image, including . s for segmenting image intoregions using superpixel and edges and only . s for network Image Superpixel Region Edge Region RexNet [Ours] GT(a)(b)(c)(d)

Fig. 14. Some failure cases of our method. These images are with extremelow-contrast scenes, which makes it difﬁcult to segment into correct regions,thus inﬂuencing the ﬁnal results. (a, b) both superpixel and edge segmentationfail, the result is bad. (c, d) the boundary between object and background isa bit clearer, thus the result is much better than (a) and (b). forwarding. Our method takes less time while achieving betterperformance.

E. Failure Cases

Our proposed framework achieves state-of-the-art perfor-mance. However, as the

RegionNet is based on the seg-mentation of images, when the image is with extreme lowcontrast and the boundary between object and backgroundis blurry, the segmentation may fail and thus inﬂuencing theﬁnal performance. Fig. 14 shows some failure examples. Theseimages are all in scene with low contrast, when both superpixeland edge segmentation fail, the performance decreases much.Note that in Fig. 14 (c) and (d), though the scene is low-contrast, the boundary between object and background is a bitclearer, thus the result is much better than Fig. 14 (a) and (b).

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

TABLE VC

OMPARISON WITH FUSING FEATURES . O

UR PROPOSED FUSING MAPS METHOD OUTPERFORMS METHOD WHICH FUSES FEATURES .JuddDB DUT-OMRON THUR15K SED2 ECSSD Pascal-S F β AUC F β AUC F β AUC F β AUC F β AUC F β AUCFusing Features 0.520 0.486 0.734 0.724 0.704 0.686 0.873 0.871 0.855 0.887 0.776 0.805Fusing Maps

TABLE VIP

ERFORMANCE AND SPEED COMPARISON WITH OTHER REGION - BASED

CNN

METHODS . O

UR METHOD TAKES . s FOR SEGMENTING IMAGEINTO REGIONS , AND ONLY . s FOR NETWORK FORWARDING . O

URMETHOD TAKES LESS TIME WHILE ACHIEVING BETTER PERFORMANCE . F β AUC Time (s)

RexNet [Ours] 0.893 0.937 0.40 + 0.35MC [19] 0.822 0.852 1.63LEGS [18] 0.827 0.855 2.27

VIII. C

ONCLUSION

In this paper, we propose

RexNet which generates saliencymap end-to-end and with sharp object boundaries. In the pro-posed framework, image is ﬁrst segmented into two scales ofcomplementary regions: superpixel regions and edge regions.The network then generates saliency score of regions end-to-end and context in multiple layers are considered to fuse withregion saliency scores. The proposed

RexNet achieves bothclear detection boundary and multi-scale contextual robustnesssimultaneously for the ﬁrst time, thus achieves an optimizedperformance. We also extend the proposed framework to RGB-D saliency detection by depth reﬁnement. Experiments onbenchmark RGB and RGB-D datasets demonstrate that theproposed method achieves state-of-the-art performance.A

CKNOWLEDGEMENT

This work was supported by National Natural ScienceFoundation of China (No. 61171113) and National Key BasicResearch Program of China (No. 2016YFB0100900).R

EFERENCES[1] Y. Wei, X. Liang, Y. Chen, X. Shen, M.-M. Cheng, Y. Zhao, and S. Yan,“STC: A simple to complex framework for weakly-supervised semanticsegmentation,” arXiv preprint arXiv:1509.03150 , 2015.[2] V. Mahadevan and N. Vasconcelos, “Saliency-based discriminant track-ing,” in

Computer Vision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on . IEEE, 2009, pp. 1007–1013.[3] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learningdiscriminative saliency map with convolutional neural network,” in

International Conference on Machine Learning , 2015, pp. 597–606.[4] B. Lei, E.-L. Tan, S. Chen, D. Ni, and T. Wang, “Saliency-driven imageclassiﬁcation method based on histogram mining and image score,”

Pattern Recognition , vol. 48, no. 8, pp. 2567–2580, 2015.[5] B. Li, W. Xiong, O. Wu, W. Hu, S. Maybank, and S. Yan, “Horrorimage recognition based on context-aware multi-instance learning,”

IEEE Transactions on Image Processing , vol. 24, no. 12, pp. 5193–5205, 2015.[6] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visualattention for rapid scene analysis,”

IEEE TPAMI , 1998.[7] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell, “Sun:A bayesian framework for saliency using natural statistics,”

Journal ofvision , vol. 8, no. 7, pp. 32–32, 2008.[8] N. Murray, M. Vanrell, X. Otazu, and C. A. Parraga, “Saliency estima-tion using a non-parametric low-level vision model,” in

Computer Visionand Pattern Recognition (CVPR), 2011 IEEE Conference on . IEEE,2011, pp. 433–440. [9] Y. Zhai and M. Shah, “Visual attention detection in video sequencesusing spatiotemporal cues,” in

ACM MM , 2006.[10] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu,“Global contrast based salient region detection,” in

CVPR , 2011.[11] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detection:A survey,” arXiv preprint arXiv:1411.5878 , 2014.[12] T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum, “Learning todetect a salient object,” in

CVPR , 2007.[13] K. Shi, K. Wang, J. Lu, and L. Lin, “Pisa: pixelwise image saliency byaggregating complementary appearance contrast measures with spatialpriors,” in

CVPR , 2013.[14] P. Jiang, H. Ling, J. Yu, and J. Peng, “Salient region detection by ufo:Uniqueness, focusness and objectness,” in

ICCV , 2013.[15] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using back-ground priors,” in

ECCV , 2012.[16] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robustbackground detection,” in

CVPR , 2014.[17] J. Kim, D. Han, Y.-W. Tai, and J. Kim, “Salient region detection viahigh-dimensional color transform,” in

CVPR , 2014.[18] L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for saliencydetection via local estimation and global search,” in

CVPR , 2015.[19] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep learning,” in

CVPR , 2015.[20] T. Chen, L. Lin, L. Liu, X. Luo, and X. Li, “DISC: Deep image saliencycomputing via progressive representation learning,”

IEEE TNNLS , 2016.[21] X. Li, L. Zhao, L. Wei, M. Yang, F. Wu, Y. Zhuang, H. Ling,and J. Wang, “Deepsaliency: Multi-task deep neural network modelfor salient object detection,”

IEEE Transactions on Image Processing ,vol. 25, no. 8, pp. 3919–3930, 2016.[22] R. Girshick, “Fast R-CNN,” in

ICCV , 2015.[23] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-tional networks,” in

ECCV , 2014.[24] Y. Li, J. Yosinski, J. Clune, H. Lipson, and J. Hopcroft, “Convergentlearning: Do different neural networks learn the same representations?”in

ICLR , 2016.[25] X. Li, Y. Li, C. Shen, A. Dick, and A. Van Den Hengel, “Contextualhypergraph modeling for salient object detection,” in

Proceedings of theIEEE International Conference on Computer Vision , 2013, pp. 3328–3335.[26] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in

CVPR , 2013.[27] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2015, pp. 5455–5463.[28] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks forsemantic segmentation,” in , 2015, pp. 3431–3440.[29] R. Ju, L. Ge, W. Geng, T. Ren, and G. Wu, “Depth saliency basedon anisotropic center-surround difference,” in . IEEE, 2014, pp. 1115–1119.[30] H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji, “Rgbd salient objectdetection: A benchmark and algorithms,” in

European Conference onComputer Vision . Springer, 2014, pp. 92–109.[31] J. Ren, X. Gong, L. Yu, W. Zhou, and M. Y. Yang, “Exploiting globalpriors for RGB-D saliency detection,” in . IEEE,2015, pp. 25–32.[32] D. Feng, N. Barnes, S. You, and C. McCarthy, “Local backgroundenclosure for RGB-D salient object detection,” in . IEEE, 2016.[33] B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik, “Hypercolumnsfor object segmentation and ﬁne-grained localization,” in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2015, pp. 447–456.[34] N. Liu and J. Han, “DHSNet: Deep hierarchical saliency network forsalient object detection,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2016, pp. 678–686.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14 [35] M. Liang and X. Hu, “Recurrent convolutional neural network for objectrecognition,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2015, pp. 3367–3375.[36] S. Xie and Z. Tu, “Holistically-nested edge detection,” in

ICCV , 2015.[37] J. Zhang and S. Sclaroff, “Saliency detection: A boolean map approach,”in

Proceedings of the IEEE international conference on computer vision ,2013, pp. 153–160.[38] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“Semantic image segmentation with deep convolutional nets and fullyconnected crfs,” in

International Conference on Learning Representa-tions , 2015.[39] X. Wang, H. Ma, and X. Chen, “Salient object detection via Fast R-CNNand low-level cues,” in

IEEE ICIP , 2016.[40] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation viamulti-task network cascades,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2016, pp. 3150–3158.[41] P. Doll´ar and C. L. Zitnick, “Structured forests for fast edge detection,”in

ICCV , 2013.[42] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,“Slic superpixels compared to state-of-the-art superpixel methods,”

IEEETPAMI , 2012.[43] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[44] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervisednets.” in

AISTATS , vol. 2, no. 3, 2015, p. 6.[45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in

ACM MM , 2014.[46] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al. , “Imagenet largescale visual recognition challenge,”

International Journal of ComputerVision , vol. 115, no. 3, pp. 211–252, 2015.[47] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detectionvia graph-based manifold ranking,” in

CVPR , 2013.[48] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tunedsalient region detection,” in

CVPR , 2009.[49] A. Borji, “What is a salient object? a dataset and a baseline model forsalient object detection,”

IEEE TIP , 2015.[50] S. Alpert, M. Galun, A. Brandt, and R. Basri, “Image segmentationby probabilistic bottom-up aggregation and cue integration,”

IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 34,no. 2, pp. 315–327, 2012.[51] M.-M. Cheng, N. J. Mitra, X. Huang, and S.-M. Hu, “Salientshape:Group saliency in image collections,”

The Visual Computer , vol. 30,no. 4, pp. 443–453, 2014.[52] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The secrets ofsalient object segmentation,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2014, pp. 280–287.[53] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detection: Abenchmark,”

IEEE Transactions on Image Processing , vol. 24, no. 12,pp. 5706–5722, 2015.[54] F. Perazzi, P. Kr¨ahenb¨uhl, Y. Pritch, and A. Hornung, “Saliency ﬁlters:Contrast based ﬁltering for salient region detection,” in

CVPR , 2012.[55] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient objectdetection: A discriminative regional feature integration approach,” in

CVPR , 2013.[56] Z. Liu, W. Zou, and O. Le Meur, “Saliency tree: A novel saliencydetection framework,”

IEEE TIP , 2014.[57] H. Li, H. Lu, Z. Lin, X. Shen, and B. Price, “Inner and inter labelpropagation: Salient object detection in the wild,”

IEEE TIP , 2015.[58] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech, “Minimumbarrier salient object detection at 80 fps,” in