Co-salient Object Detection Based on Deep Saliency Networks and Seed Propagation over an Integrated Graph
JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Co-salient Object Detection Based on DeepSaliency Networks and Seed Propagation over anIntegrated Graph
Dong-ju Jeong, Insung Hwang, and Nam Ik Cho,
Senior Member, IEEE
Abstract —This paper presents a co-salient object detectionmethod to find common salient regions in a set of images.We utilize deep saliency networks to transfer co-saliency priorknowledge and better capture high-level semantic information,and the resulting initial co-saliency maps are enhanced by seedpropagation steps over an integrated graph. The deep saliencynetworks are trained in a supervised manner to avoid onlineweakly supervised learning and exploit them not only to extracthigh-level features but also to produce both intra- and inter-image saliency maps. Through a refinement step, the initial co-saliency maps can uniformly highlight co-salient regions andlocate accurate object boundaries. To handle input image groupsinconsistent in size, we propose to pool multi-regional descriptorsincluding both within-segment and within-group information.In addition, the integrated multilayer graph is constructed tofind the regions that the previous steps may not detect byseed propagation with low-level descriptors. In this work, weutilize the useful complementary components of high-, low-levelinformation, and several learning-based steps. Our experimentshave demonstrated that the proposed approach outperformscomparable co-saliency detection methods on widely used publicdatabases and can also be directly applied to co-segmentationtasks.
Index Terms —Co-saliency, saliency, deep saliency networks,seed propagation model, foreground probability.
I. I
NTRODUCTION T HE objective of saliency detection is to find the mostinformative and attention-drawing regions in an image[1], and it has been one of the most popular computer visiontasks for the past few decades [2]. There may be two categoriesof saliency detection: salient object detection and eye fixationprediction. The former aims at identifying precise salientobject regions with relative saliency values [1], [3]–[5], whilethe latter is for estimating eye gaze fixation points resultingin saliency maps in the form of heat maps [6]–[9]. Recently,co-saliency detection has emerged as an important subtopic ofthe salient object detection, which is to find visually distinctregions and/or objects that commonly appear in a set ofimages. In other words, the goal of co-saliency detectionis to find common salient objects while suppressing salientobjects/regions that appear only in part of the image group.Thus, it is needed to consider visual coherency among theimages besides the cues used in the saliency detection such ascontrast [10]–[12] and/or boundary priors [1], [12], [13]. The
D. Jeong, I. Hwang, and N. I. Cho are with the Dept. of Electri-cal and Computer Engineering, Seoul National University, 1, Gwanak-ro,Gwanak-Gu, Seoul 151- 742, Korea and also affiliated with INMC (e-mail:[email protected], [email protected], and [email protected]). co-saliency detection can be applied to other computer visiontasks, such as co-segmentation [2], video foreground detection[14], image retrieval [15], and weakly supervised localization[16]. It can be utilized to enhance the single-image saliencydetection as well [17].Many researchers have recently proposed to utilize con-volutional neural networks (CNNs), which will be calleddeep saliency networks in this paper, to produce pixel- orsegment-level saliency maps better capturing high-level se-mantic information and robust to complex background [4],[5], [18], [19]. These methods also detect salient regions moreuniformly, and outperform conventional algorithms in termsof accuracy. Meanwhile, until recently, the majority of co-saliency detection methods use low-level handcrafted featuressuch as color cues because the color information usually playan important role in distinguishing between co-salient and non-co-salient regions [15], [20]–[22]. However, recent advancesin deep learning have also contributed to the state-of-the-artmethods for co-saliency detection [2], [23], [24], which exploithigh-level CNN features to represent image patches/segmentsor encode low-level features with deep autoencoders. One ofthe most challenging issues in co-saliency detection is itsdependency on input image groups: whether low- or high-level features become a prior factor differs from case tocase, and the same goes for contrast and consistency [16].To handle this, those learning-based methods perform weaklysupervised learning given an image group, where similarimages from external groups are also exploited to identifyconsistent background. On the other hand, the graph-basedprocessing proves to be effective for spatial refinement of eachimage [19], [23], but it has rarely been used considering awhole image group and its consistency factor.In order to tackle the issues mentioned above, we proposea supervised learning-based method that is complementedby graph-based manifold ranking with an integrated graphincluding all the intra-image nodes of input images. The intra-image saliency (IrIS) maps of the images are produced bya fully convolutional networks, the part of which generateshigh-level semantic features. They are associated with low-level features to cope with the various cases and improve theperformance of our system [25], and fed into fully-connectedlayers to obtain the inter-image saliency (IeIS) value of eachsegment. We choose to train these deep saliency networks in asupervised manner to avoid using any learning models trainedgiven an input image group (with similar external images)and thereby reduce computation time. As a result, initial co- a r X i v : . [ c s . C V ] J un OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
Fig. 1. A flowchart of the proposed co-saliency detection method with the blocks showing its steps and produced items. saliency maps are generated by combining the IrIS and IeISmaps. In addition, we propose to construct the integratedgraph where auxiliary co-saliency values are obtained bypropagating seeds extracted from the initial co-saliency maps.The segments of the input images are treated as the intra-image nodes, and inter-image nodes connect them to formthe integrated graph with a sparse affinity matrix computedwith color similarities. While the deep saliency networks areexpected to detect precise (co-)salient regions, the graph-basedmethod helps to find parts of co-salient objects showing colorconsistency and/or located on image boundaries. These twotypes of co-saliency value are combined to produce final co-saliency maps with simple spatial refinement. The unifiedframework is illustrated in Fig. 1.The rest of this paper is organized as follows. The secondsection introduces related works on co-saliency detection,the third section describes co-saliency detection using thedeep saliency networks, and the fourth section describes theseed propagation over the integrated graph. The experimentalresults and the conclusions are presented in the last twosections. II. R
ELATED W ORK
The co-salient object detection began with analyzing multi-image information and finding common objects within imagepairs [26]–[29]. For example, Li et al. [26] performed thepyramid decomposition of images and then extracted color andtexture features from each region to compute the maximumSimRank scores of region pairs, which are defined as multi-image saliency values. To obtain the final co-saliency maps,they linearly combined the single- and multi-image saliencymaps. Tan et al. [27] proposed to calculate the affinitiesof superpixel pairs with color and position similarities, andthen perform bipartite graph matching to discover the mostrelevant pairs for affinity propagation. The resulting superpixelaffinities between two images are converted into foregroundcohesiveness and locality compactness measures to obtain thefinal co-saliency maps.Due to the lack of scalability, other co-saliency detectionmethods have aimed at treating larger groups with more thantwo images. Fu et al. [15] proposed a two-layer cluster-basedapproach, where pixel-level intra- and inter-image clustering steps are performed to calculate the contrast, spatial, andcorresponding cues of each cluster. They employed multi-plication fusion of the cluster-wise cues and converted theminto the final pixel-wise co-saliency values. The algorithm byLi et al. [30] generates intra-saliency maps with multi-scalesegmentation and pixel-wise voting, and inter-saliency mapsby matching image regions with a minimum spanning tree. Italso linearly combines the intra- and inter-saliency maps intothe final co-saliency maps. Liu et al. [20] proposed to performhierarchical segmentation and compute intra-saliency, objectprior, and global similarity values of the fine/coarse segmentsto obtain co-saliency values. In [21], Li et al. adopted theirprevious work to obtain single-image saliency maps, and usedthe two-stage manifold ranking method to estimate co-salientregions. They let each image of a group take turns to producequeries for the manifold ranking of all the images, and fusedmultiple co-saliency values by averaging or multiplication. Inaddition, Cao et al. [22] proposed a fusion-based algorithm,which adopts several existing (co-)saliency detection schemesand combine their results with self-adaptive weights producedby low-rank analysis.The above methods utilize handcrafted features to representpixels, segments, or clusters, and some of them focus only oncolor cues to cope with the situation where co-salient objectsare quite consistent in color; so they cannot capture abstractsemantic information and effectively detect the co-salientobjects that consist of multiple components. Thus the learning-based methods using high-level features [2], [23], [24] haverecently been proposed to tackle this problem. Zhang et al.[2] proposed to find several similar neighbors from externalgroups for negative image patches and analyze intra-imagecontrast, intra-group consistency, and inter-group separabilitymeasures. They combined them through a Bayesian frameworkto obtain patch-wise co-saliency values and then convertedthem into pixel-wise ones. In [23], a self-paced multi-instancelearning method is used to update positive and negativetraining samples and their weights, and thereby train an SVMmodel for co-saliency estimation, where similar neighbors givethe negative samples as in [2]. The approach proposed in[24] exploits stacked denoising autoencoders (SDAEs) for twoobjectives: intra-saliency prior transfer and deep inter-saliencymining. First, several SDAEs for intra-saliency detection are
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 trained in supervised and unsupervised manners with single-image saliency detection data and second, another SDAE istrained in an unsupervised manner with input images to exploitits reconstruction errors for co-saliency cues. These threeapproaches also utilize CNN models or SDAEs to representpatches/superpixels with high-level features or convert low-level descriptors into higher-level ones. In addition, theyperform their learning steps provided with input image groupsbecause the criteria for differentiating between co-salient andnon-co-salient objects depend on the given target image group;but, this may result in high computational complexity fortesting.III. C O - SALIENCY DETECTION USING DEEP SALIENCYNETWORKS
According to [16], the co-saliency detection methods inthe literature explicitly or implicitly use the contrast cue andcorresponding cue, which are also called intra- and inter-imagesaliency respectively. This is because co-salient regions aresalient in each image and have correspondence in a wholeimage group, and thus a co-saliency detection algorithm shouldnot ignore either one. To be accurate, the definition of the inter-image saliency in the conventional methods is similar to that ofco-saliency but places more emphasis on the correspondencefactor. As the bottom-up methods for co-saliency detectiongenerally design the explicit intra- and inter-image saliencymaps, we compute them with deep saliency networks trainedin a supervised manner. Then, they are refined and combinedto produce initial co-saliency maps for the next step.
A. Intra-Image Saliency Detection
Given an image group { I m } Mm =1 , each image I m is inde-pendently represented by its n m superpixels { s mi } n m i =1 , whichare over-segmented regions obtained by the SLIC algorithm[31]. The goal of this step is to produce pixel-wise intra-imagesaliency (IrIS) values and convert them into segment-wise ones { rs mi } n m i =1 for each image I m . To this end, we use the multi-scale fully convolutional network [19], which produces pixel-level saliency maps combining several stacked feature mapsextracted with different sizes of receptive fields. We call thissingle-image saliency detection network as an IrISnet in thispaper. It is based on the original structure that utilizes the pre-trained VGG16 network [32] and is implemented following theDeepLab system [33].Specifically, it replaces the fully-connected layers of theoriginal VGG16 network with × convolutional layers todesign its fully convolutional structure, and four branchesconsisting of × and × convolutional layers are attachedto its pooling layers to obtain the multi-scale feature maps.The main stream and four branches compute the five multi-scale single-channel feature maps, which are input to the last × convolutional layer and a sigmoid activation functionto obtain an output saliency map ranging between - . Thisnetwork also exploits the hole (`a trous) algorithm [34] fortwo purposes: first it helps to compute denser feature mapsmaintaining the original sizes of receptive fields and second,it can also adjust the size of each multi-scale feature map to be identical. As a result, we obtain the output maps for { I m } and use bicubic interpolation to resize them to the originalinput image sizes so that we can estimate pixel-level saliencymaps. Lastly, we set the median of the saliency values withineach superpixel as rs mi to obtain the segment-wise IrIS values,where we use the medians instead of means to reduce halosaround salient objects as shown in Fig. 2. Fig. 2. Examples of the intra-image saliency maps. From left to right: aninput image, its pixel-level IrIS map, and two segment-level IrIS maps. Thethird and fourth images show the mean and median of the pixel-wise saliencyvalues within each segment, respectively.
B. Inter-Image Saliency Detection
For both single-image salient object detection and co-salientobject detection, many of the existing methods (over-)segmentinput images and produce segment-wise saliency values. As forgraph-based models [1], [3], superpixels instead of raw pixelsare treated as nodes in a graph, the number of which is limited,so this makes it possible to utilize the graph models by mani-fold ranking. Meanwhile, the deep convolutional networks canmake it possible to produce pixel-level saliency maps, andsome CNN-based methods efficiently perform in that manner[4], [5]. However, other ones operate totally at segment-levelor use segment-wise saliency values to complement pixel-wiseones. In [18], [25], each segment (with its relevant regions),irrespective of the number of segments, can be fed into thedeep neural networks with the fixed number of parameters, andlow-level features can also be exploited as additional inputs.The method of [19] utilizes both the pixel- and segment-wisesaliency values, where the latter ones better represent saliencydiscontinuities along object boundaries.To treat multiple images, we take advantages of thesegment-wise processing mentioned above. When a CNN-based method is applied, as for single-image saliency detec-tion, a whole image can be fed into a CNN model. However,for co-saliency detection, the size of an image group isnot consistent and the information of a whole image grouphas to be exploited, so it is appropriate to predict segment-wise co-saliency values. In addition, there are cases whencolor cues are the most important rather than the other onessuch as high-level semantic information, and other low-levelfeatures (e.g., position) might also be helpful and need to beadded to the higher-level features extracted from convolutionallayers. Considering these aspects, we compute each segment’sdescriptor that includes the information of a whole imagegroup, and produce segment-level inter-image saliency (IeIS)maps.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
The CNN features extracted from the conv5 3 layer withinthe IrISnet are used as higher-level features for each segment.To adapt the superpixels made on the image domain to thedomain of the feature maps of the IrISnet, we use convolu-tional feature masking (CFM) [35], and then perform × spatial pooling [36] to obtain fixed-length descriptors as in[19]. In addition to the higher-level features, the low-levelones such as Lab color vectors, color histograms, and positionsare computed to complement the higher-level features becauseeither one of the two types is more important than theother one depending on the input image group [16]. Withthese components, we propose to compute each segment s mi ’smulti-regional descriptor x mi = [ x mi, seg , x mi, nbh , x m sfg , x gfg ] , eachelement of which is pooled within four different regions: i)the target segment s mi , ii) its immediate neighborhood, iii)foreground regions in the image that it belongs to, and iv) fore-ground regions in the whole image group. Each image I m caninitially have one or multiple foreground regions, which are setby thresholding its IrIS map with max ( n m (cid:80) n m i =1 rs mi , . andfinding connected components. Then we form the power set ofthem, where the empty set is excluded, and all its elements aretreated as the foreground regions of the image. The Lab colorvectors and positions are normalized to [0 , and averagedto represent each region. Also, each foreground region hasthe variance of the ( x, y ) -positions within the region. We L -normalize and then square root the 256-bin Lab colorhistograms as RootSIFT [37] and VLAD [38] to moderatelysuppress the few color components bursty in the image group.Given the descriptors of all the foreground regions, we performsum-pooling to obtain fixed-length x m sfg and x gfg . In particular,the sum-pooling of regional max-pooled CNN features hasbeen shown to be effective in [39], [40], but the differencefrom R-MAC [40] is that we perform the × spatial poolingover the fixed grid in each region. We compute the covariancematrices of the high- and low-level descriptors within theforeground regions, and the traces of them are included in x gfg .At last, each of x mi, seg , x mi, nbh , x m sfg , and x gfg is L -normalizedand then they are concatenated to form x mi .Given the segment descriptors { x mi } n m i =1 for each image I m ,they are fed into three fully-connected layers, which outputswith a two-way softmax. We call this network model anIeISnet. To train the IeISnet, the ground-truth co-saliency mapsof training datasets are set to labels cs GTi by thresholdingthe averaged label with 0.5 in each superpixel, and the crossentropy loss is used with pointwise weights as below: L = − N N (cid:88) i λ i log e z yii e z i + e z i ,λ i = { (1 − ρ )[ y i = 0] + ρ [ y i = 1] } · γ | rs i − cs GTi | , ∀ i (1)where N is the number of training data, z i and z i are the twolast activation values, and are the labels for non-co-salientand co-salient regions respectively, and y i is the ground-truthlabel of the i -th sample. The weight ρ balances the number of and labels in the training sets. In terms of γ , the pointwiseweights λ i are designed to place more emphasis on the regionsthat have high IrIS and low IeIS values, and vice versa. Asshown in Fig. 3, the former case shows what are called “single saliency residuals” in [41], and the latter one represents thesituations where some regions may not seem salient in theirimage but are certainly co-salient in their image group. Fig. 3. Examples of two different cases where IrIS and IeIS maps differ fromeach other. From left to right: image groups, input images from these groups,and their IrIS, IeIS, and initial co-saliency maps. The first row shows theregions that are salient in their image and not co-salient in the group, whilethe second row represents an opposite case.
Because the IeIS value of each segment s mi is independentlyestimated through the above process, we refine each IeISmap so that neighboring regions have smooth IeIS values.For this, we perform seed propagation over a simple graphmodel. The segments s mi are treated as intra-image nodes v mi in each image I m , and the edge e mij between two neighboringnodes v mi and v mj that share a common boundary of segmentsconnects them with a weight w mij , which represents the affinitybetween them and is calculated using color similarity [42].Even though the recently proposed saliency detection methodsusing the graph-based manifold ranking [1], [3] utilize moresophisticated graphs for difficult cases such as where parts ofsalient objects are located on image boundaries, we tackle thisproblem in section IV and use the simple graph model for thisstep. We let x i,co denote (omitting m for now) the averaged Lab color vector of v i in an image, the weight w ij is computedas: w ij = exp (cid:0) − ( x i,co − x j,co ) T Σ − ( x i,co − x j,co ) (cid:1) Σ = 1 N ( E ) (cid:88) e ij ∈ E ( x i,co − x j,co )( x i,co − x j,co ) T (2)where E is the set of all the edges in the image and N ( E ) isthe size of E . Then the affinity matrix for the intra-imagegraph of I m is constructed whose ( i, j ) -th element is theweight between v mi and v mj : ( W m ) i,j = (cid:26) w mij , if j ∈ Q mi , , otherwise , m = 1 , . . . , M (3)where Q mi is an index set of neighbors of the i -th node.To propagate seeds over these graphs, we need to extractforeground and background seeds. We set the segments whoseinitial IeIS values are larger than 0.5 and, at the same time, inthe top 10 percent as the foreground seeds, where they have s and all the others have s in y mf . The segments on imageboundaries are simply selected as the background seeds and weget y mb likewise. To obtain the refined IeIS maps, the graph-based learning method is adopted for effective propagation[1], [43]. Given the weight matrix W m and its degree matrix D m = diag ( d m , ..., d mn m ) , where d mi = (cid:80) j w mij , the newly OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 ranked values f m = [ f m , ..., f mn m ] T for either type of the seedscan be optimized with the following problem:min f m n m (cid:88) i,j =1 w mij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f mi (cid:112) d mi − f mj (cid:112) d mj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + ν n m (cid:88) i =1 | f mi − y mi | (4)where ν is the controlling parameter that balances the smooth-ness constraint and the fitting constraint. The solution of (4)is given by: f m = ( D m − α W m ) − y m = W mL y m (5)where α = 1 / ( ν + 1) and, for implementation, the diagonalelements of W mL are set to for each query to obtainthe propagated values ranked by the other ones except thequery itself. Using both the foreground and background seeds, f mf and f mb are obtained where each node receives newlypropagated values from the seeds with the learned affinitymatrix. The final refined IeIS values are computed as: es m = (cid:0) f mf − η f mb (cid:1) ./ (cid:0) f mf + η f mb (cid:1) = [ es mi , ..., es mn m ] T (6)where ./ is the element-wise division of two vectors and η is acontrolling parameter. The numerator represents the IeIS whilethe denominator maintains the balance among the nodes, andlastly es m is normalized to [0 , . C. Initial Co-saliency Maps
As mentioned above, there are the occasions where rs mi should be sufficiently larger than es mi , and vice versa. If rs mi > es mi , s mi is considered to show the single saliencyresidual and thus the co-saliency value of s mi should be assmall as es mi , which prohibits us from linear combinationof the two values [41]. If es mi > rs mi , on the other hand,this shows the specific case where some regions may notseem salient in their image but are certainly co-salient intheir image group. Both the types of cases encourage us toput more emphasis on es mi for computing co-saliency maps.Considering this aspect, we obtain the initial co-saliency (IC)value for each segment s mi as below: IC mi = (cid:26) rs mi · es mi , δ mi ≥ τ (1 − | δ mi | ) rs mi + | δ mi | es mi , otherwise δ mi = rs mi − es mi . (7)where the threshold τ draws a boundary between the “singlesaliency residual” case and the other one. Fig. 3 shows severalsaliency maps resulted from the deep saliency networks.IV. S EED PROPAGATION OVER AN INTEGRATED GRAPH
The (co-)saliency detection methods usually perform theirlatter tasks to obtain final (co-)saliency maps leveraging colorand pixel position information. For example, the ranking withforeground queries in [1] is the second stage to locate accurateobject boundaries and eliminate background noise, and a fully-connected conditional random field (CRF) [44] is used forpost-processing in [19]. Many co-saliency detection algorithmsalso refine their resulting maps [2], [23] or combine severalcues [20] using color features within an image group becauseco-salient objects probably share similar color features in (part
Fig. 4. Examples of initial and auxiliary co-saliency maps. From left toright: image groups, input images from these groups, and their initial andauxiliary co-saliency maps. These initial co-saliency maps do not fully detectthe regions that are homogeneous and/or close to image boundaries, while thesecond row shows that the auxiliary co-saliency maps may miss part of theobjects with multiple components. Thus, the two types of co-saliency mapscan complement each other. of) the image group. However, the graph-based proceduresamong these latter tasks are performed respectively withineach image, so they tackle only the refinement of each (co-)saliency map so that it shows accurate boundaries and hassmooth saliency values, not considering the correspondencewithin the image group.In contrast to those methods, we propose to consider thewhole image group and refine the input images all together.In addition, this step has another important role, which is todetect (parts of) co-salient objects located on image boundariesas shown in Fig. 4. The above procedures in our work mightmiss the homogeneous parts of co-salient objects on the imageboundaries and strongly suppress the regions close to theboundaries. To this end, we construct an integrated graph sothat it can connect all the intra-image nodes of the images inthe group for sharing co-saliency information.
A. The Integrated Graph with a Cluster Layer
In [27], the bipartite graph matching method finds pairs ofthe most relevant superpixels between two images, each ofwhich is connected with its matching score. Though ensuringgood matched pairs for similar scenes such as sequentialframes of a video that are not severely different from eachother, in general, this approach easily fails to find good pairs ofsuperpixels between the images that have various backgroundsand/or different sizes of objects. Hence, an indirect approachis introduced in this paper to overcome this problem. We ba-sically ignore the connectivity between images, which meansthat there are no edges that directly connect the intra-imagenodes of any two different images. Thus the intra-image graphsare represented in the form of a sparse block-wise diagonalmatrix: W I = W . . .
00 0 W M ∈ (cid:82) n × n (8)where n = (cid:80) m n m and each W m is computed by (2,3).Instead, the proposed method introduces an additional clusterlayer to consider the interactions between images and indi-rectly connect the intra-image nodes via the inter-image oneson it, as shown in Fig. 5.To define the inter-image nodes, we perform K -means clus-tering with the descriptor of every intra-image node, reusing its OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
Fig. 5. Visualization of the integrated graph and the interactions of intra-image nodes therein, focusing on image layer . Red arrows represent thepaths where the intra-image nodes interact between images via inter-imagenodes, and small navy arrows indicate the interactions between the intra-imagenodes in the single image. averaged Lab color vector x i,co . Through this procedure, K clusters { C i } Ki =1 and their centroids { c i } Ki =1 are generated,where c i is the representative descriptor for C i and alsodefined as an inter-image node. The goal of this step is toconstruct the affinity matrix of the unified graph including allthe intra- and inter-image nodes, so we first connect each c i to its elements and compute the weights of the edges usingdescriptor similarities as: w ICij = exp (cid:18) − (cid:107) x i,co − c j (cid:107) σ (cid:19) ( W IC ) i,j = (cid:26) w ICij , if x i,co ∈ C j , otherwise (9)where σ is a control parameter for the descriptor similarity.In addition, the inter-image nodes are also connected to eachother, specifically to their k -nearest neighbors ( k -NN), whichmeans that the graph of the cluster layer is as sparse as theintra-image graphs, and its affinity matrix is written as: w Cij = exp (cid:18) − (cid:107) c i − c j (cid:107) σ (cid:19) W C = (cid:26) w Cij , if i ∈ k -NN ( j ) or j ∈ k -NN ( i )0 , otherwise. (10)Finally, the affinity matrix of the unified graph is constructedfrom W I , W IC , and W C , expressed in a block-wise matrixform: W = (cid:20) W I W IC W TIC W C (cid:21) ∈ (cid:82) ( n + K ) × ( n + K ) . (11) B. Seed Propagation
To assign newly propagated co-saliency values to all thesegments, we need to extract foreground seeds (called co-saliency seeds in this section) and background ones, whichare selected similarly to the process for the IeIS refinement.The top 10 percent of co-salient regions with respect to the ICvalues in each image are extracted as the co-saliency seeds,and the boundary nodes of each image are selected as thebackground seeds, based on the boundary prior. In addition,the ones selected as both the co-saliency and background seedssimultaneously are precluded from both seed sets because those seeds are not reliable. In summary, the co-saliency andbackground seeds are defined as: • Co-saliency seeds ( y I,s ) : high IC nodes that are not onany image boundaries. • Background seeds ( y I,b ) : low IC nodes on image bound-aries.From the co-saliency and background seeds, co-saliencyvalues are computed by propagating them to all the (intra-image) nodes in the image group. For this, we use the graph-based learning scheme again with the integrated graph, whichmakes a full pairwise graph as: W L = ( D − α W ) − = (cid:2) w L , ..., w n + KL (cid:3) , (12)where D = diag ( d , ..., d n ) is the degree matrix of W . Asmentioned above, there are no direct inter-image connectionsbetween any two intra-image nodes in the graph with theaffinity matrix W , so the inter-image nodes indirectly connectsthe pairs of them instead. However, the learned graph with W L has full pairwise relations of all the nodes. In other words,this graph has direct inter-image connections so that it ensuresstraightforward propagation between images.To obtain auxiliary co-saliency maps to combine with the ICmaps, the overall affinities to the co-saliency and backgroundseeds are computed respectively, which is written as: f s = W L y s = (cid:88) i ∈ S s w iL , f b = W L y b = (cid:88) i ∈ S b w iL (13)where y s = [ y I,s ; ] and y b = [ y I,b ; ] are the co-saliencyand background seed vectors respectively each of which isconcatenated with a zero vector for the inter-image nodes, and S s and S b represent the co-saliency and background seed setsrespectively. f s and f b are decomposed into the vectors foreach image and the cluster layer, i.e., f s = (cid:2) f s ; ... ; f Ms ; f Cs (cid:3) and f b = (cid:2) f b ; ... ; f Mb ; f Cb (cid:3) , and thus the auxiliary co-saliencymap for I m is computed as: AC m = ( f ms − η f mb ) ./ ( f ms + η f mb ) . (14)Lastly, AC m = [ AC m , ..., AC mn m ] is also normalized to [0 , and combined with IC m = [ IC m , ..., IC mn m ] . C. Final Co-saliency Maps
Given the initial and auxiliary co-saliency maps, the formerones might not fully detect the regions that are homogeneousand/or close to image boundaries, while the latter ones mightmiss part of the objects with multiple components due tosolely using the color and position cues. Therefore, these arecomplementary to each other and thus simply combined toproduce the final co-saliency maps CS m = [ CS m , ..., CS mn m ] as CS mi = max ( IC mi , AC mi ) . (15)Because the auxiliary co-saliency maps are likely to be vulner-able to background noise, we perform a simple post-processingscheme for CS m where the outputs never exceed the inputs[22]. This step needs spatial positional distance maps, andthey can be computed with shrunk input images to reduceprocessing time. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 (a)
Alaskan bear (b)
Red Sox player (c)
Statue of liberty
Fig. 6. Visual comparison on the
Alaskan bear , Red Sox player , and
Statue of liberty sets in iCoseg (from left to right: input images, CB, HS, MG, LDW,MIL, DIM, the proposed method, and ground truth images).
V. E
XPERIMENTAL R ESULTS
A. Experimental Settings
In our experiments, two widely used datasets, iCoseg [45]and MSRC [46], are used to evaluate the performance ofour algorithm and compare it with others. The iCoseg datasetconsists of 38 groups, each of which includes 4-42 images, andtotally 643 images along with pixel-wise ground truth annota-tions. It is the largest among widely used co-saliency detectiondatasets, and its image groups contain multiple objects andcomplex backgrounds. The MSRC dataset is composed of 8groups, each of which equally has 30 images, but the grass group is not used for the evaluation since it has no co-salientobjects. This dataset can be used to evaluate the ability to treat the co-salient objects that are not consistent in color, and alsocontains complex co-salient objects, and diverse and clutteredbackgrounds.For each of the evaluation datasets, the performance ismeasured with five widely used criteria: the precision-recall(PR) curve, the average precision (AP), the receiver operatingcharacteristic (ROC) curve, areas under the ROC curve (AUC),and the F-measure. When there are more true negatives thantrue positives, the PR curve more clearly shows the differencesbetween algorithms than the ROC curve does, and the samegoes for their areas under the curves, AP and AUC. As for thePR and ROC curves, the co-saliency maps are normalized to [0 , and binarized with thresholds varying from 0 to 255. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 (a)
Building (b)
Face
Fig. 7. Visual comparison on the
Building and
Face sets in MSRC (from left to right: input images, CB, HS, MG, LDW, MIL, the proposed method, andground truth images).
The precision, recall, false positive rates are calculated undereach threshold and averaged over all samples as the standardused in the literature [47]. Meanwhile, we used a self-adaptivethreshold T = µ + (cid:15) [48] to obtain the F-measure, where µ and (cid:15) are the mean and standard deviation within each co-saliencymap respectively, and the precision and recall rates averagedover all samples are combined as defined below: F β = (1 + β ) Precision × Recall β Precision + Recall (16)where β = 0 . as typically used in the literature.The deep saliency networks are implemented with the Caffe package [49] to follow the publicly available model of theDeepLab system. For the IrISnet, we used a large singlesaliency detection dataset, MSRA10K [50], and the size ofinput images (i.e., × ) and hyper parameters for trainingare set as suggested in [19]. The IeISnet consists of sequentialfully-connected, batch normalization [51], and rectified linearunit (ReLU) layers, which are trained with several co-saliency detection datasets, i.e., Cosal2015 [2] and CPD [26], includingeither of iCoseg or MSRC that is not used for testing toexploit as much training data as possible. Even though partsof the Cosal2015 dataset (e.g., baseball ) tend to put farmore emphasis on the correspondence than on the intra-imagesaliency, it is acceptable to our IeISnet training because thedefinition of IeIS also focuses more on the correspondence.We set the learning rate and momentum parameter to 0.001and 0.9 respectively, and the weight decay is 0.0005. As in[1], [2], [19], we set n m for each I m to 200, where 150and 50 superpixels at different scales are additionally used forthe IeIS detection, and the precise value of n m is determinedby the SLIC algorithm. We consistently set K = 100 and k = 5 irrespective of the number of images M . For the IeISnetlearning, we set ρ = 0 . considering the number of truepositives and negatives in the training co-saliency detectiondatasets, and γ is empirically set to 3. The parameter α isusually set to 0.99 in the literature, but we use α = 0 . sincethe seed propagation steps are performed with more reliable OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 foreground seeds, and we set η = 2 for the same reason.Lastly, we use σ = 0 . and τ = 0 . , and also conduct gridsearch experiments for α , η , σ , and τ to ensure that we selectthe appropriate values of these parameters. TABLE IA
VERAGE EXECUTION TIME PER IMAGE .CB HS MG LDW MIL DIM* OursTime (s) 1.02 103.36 1.12 6.52 12.25 19.6 1.79* The running times of DIM and HS, cited from [20], [24], were measuredonly with iCoseg.
B. Run Time Comparison
We conduct the experiments with our unoptimized code runon a PC with Intel i7-6700 CPU, 32GB RAM, and GTX TitanX GPU. The code is implemented in MATLAB except forthe SLIC algorithm in C++, and the GPU acceleration wasapplied only for the
Caffe framework. Table I lists the averageexecution time per image using several different methods,where the execution times of LDW, MIL, DIM, and HS arecite from their papers. The first two values were measuredusing a PC with two 2.8GHz 6-core CPUs, 64GM RAM, andGTX Titan black GPU in [2], [23], the third one with Inteli3-2130 CPU and 8GB RAM in [24], and the last one withIntel i7-3770 and 4GB RAM in [20]. As can be seen, theproposed method has moderate computational complexity withstate-of-the-art performance as evaluated below. In particular,our method runs faster leveraging the supervised learningschemes compared to the other ones based on the onlineweakly supervised learning [2], [23], [24], and shows theexecution time similar to that of the efficient CB [15] andMG [21] methods.
C. Comparison with the State-of-the-Art
With the evaluation criteria stated above, we compare theproposed co-saliency detection method with other major algo-rithms ranging from the bottom-up ones based on handcraftedfeatures to the learning-based ones using high-level features:CB [15], HS [20], MG [21], LDW [2], MIL [23], DIM[24] (only for the iCoseg dataset). At first, Figs. 6 and 7show several visual examples of resulting co-saliency mapsfor qualitative comparison, where it can be seen that theproposed method more uniformly detects co-salient objectsand better suppresses background regions than the others do. Inparticular, the
Alaskan bear set shows the background regionssimilar to co-salient objects in color. Even though our auxiliaryco-saliency maps focus on the color similarity, emphasizingthe background seeds moderately suppresses the backgroundnoise that it may bring about. The MG method effectively findsthe co-salient objects consistent in color, e.g.,
Red Sox player ,but it has weaknesses in suppressing noisy backgrounds anddetecting the co-salient regions inconsistent in color, as shownin Fig. 7. The
Statue of liberty set includes a lot of the co-salient regions that are not salient in terms of single-imagesaliency. The most representative case is the first image, whereonly the torch probably looks salient, but every part of the statue is co-salient in the group. Because each image in theMSRC dataset probably contains a single co-salient object, itis effective to first find salient regions in terms of single-imagesaliency and then analyze the correspondence in each group.Thus, the results of the proposed method show well-suppressedcommon backgrounds.For the quantitative comparison, Fig. 8 shows the PR andROC curves, and Table II contains the AP, AUC, and F-measure values of ours and compared methods. As for theiCoseg dataset, the proposed method outperforms the otherones on all the evaluation criteria. Both the PR and ROCcurves show that our co-saliency maps result in the highestprecision/recall rates in the widest ranges of recall/false pos-itive rates, especially at decent recall/false positive rates ( ∼ D. Parameter Analysis
We conduct the grid search experiments to find the ap-propriate values of α , η , σ , and τ . Fig. 9 shows the AP,AUC, F-measure scores along with certain ranges of theseparameters. As can been seen, the most effective value of α is lower with MSRC than with iCoseg, which is relatedto the tendency that correspondence cues are of lower im-portance so the fitting constraint is more emphasized for theseed propagation over the integrated graph in MSRC than iniCoseg. Likewise, the variation of the parameter η also slightly OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
Fig. 8. Quantitative comparison on the iCoseg and MSRC datasets with the PR and ROC curves.Fig. 9. Grid search analysis on the parameters α , η , σ , and τ . The performance with the variations of α and η slightly depends on target datasets (i.e.,iCoseg and MSRC), while the proposed system is not sensitive to σ and τ . influences the performance depending on the target datasets.Because the foreground and background seeds are basicallyof equal importance, one could set η to 1, but the value of η larger than 1 is more effective with reliable foreground seeds,especially with respect to the F-measures. On the other hand,the proposed method is not sensitive to the parameters σ and τ , so we can select the decent values for these parametersand obtain the stable results with them. Even though, in (7),two different operations are applied according to τ , both sideswould emphasize the IeIS with large difference between IrIS and IeIS values; when they are similar to each other, they bothwould give almost equal contributions to the resulting initialco-saliency value. Thus, slight variations of the parameter τ do not bring about big differences in the performance of ourmethod. The control parameter σ for the construction of theintegrated graph behaves similarly to τ , where larger σ morefacilitates the seed propagation between the cluster centersand intra-image nodes with similar colors, and vice versa.It is because the various colors of co-salient objects couldbe reflected in the cluster layer, where different inter-image OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
TABLE IIQ
UANTITATIVE COMPARISON ON THE I C OSEG AND
MSRC
DATASETSWITH
AP, AUC,
AND F- MEASURES .Dataset Method AP AUC F-measure σ F *iCoseg CB 0.806 0.937 0.741 0.145HS 0.839 0.955 0.755 0.189MG 0.854 0.957 0.794 0.114LDW 0.875 0.957 0.799 0.168MIL 0.866 0.965 0.814 0.141DIM 0.877 0.969 0.792 0.212Ours MSRC CB 0.689 0.798 0.577 0.170HS 0.785 0.882 0.709 0.197MG 0.688 0.827 0.635 0.133LDW 0.842 0.908 0.767 0.178MIL * σ F denotes a standard deviation of F-measures with the variation of thebinarization threshold. nodes represent diverse colors, with sufficiently large K , andthe regions within an image group similar in color could sharetheir co-saliency information through the seed propagation. E. Co-segmentation Experiments
Co-segmentation is a direct higher-level application of theco-salient object detection, where it can replace user inter-action and provide useful prior knowledge of target objects.For example, Quan et al. [52] proposed to construct a graphincluding input images and generate two types of probabilitymaps using low- and high-level features through graph-basedoptimization. For each image, the resulting two probabilitymaps are combined by multiplication, and then a graph cutapproach produces the final co-segmentation results. Theseprobability maps could be replaced with co-saliency mapsand in fact, Fu et al. [15] applied their co-saliency detectionmethod to co-segmentation through Markov random fieldoptimization. The approach of Chen et al. [53] groups inputimages into aligned homogeneous clusters and then mergesthem into visual subcategories, where a discriminative detectorfor each subcategory is trained to find target objects withina test dataset. For each cluster, a co-segmentation method isapplied to segment out the aligned objects, and this step couldalso be performed with co-saliency detection.Thus, we conduct co-segmentation experiments to compareour results with those of other approaches. Two datasets,Internet-100 [54] and iCoseg, are used for the evaluation withthe Jaccard index (J, intersection-over-union for the foregroundregions) and Precision (P, the proportion of correctly labeledpixels). We convert our co-saliency maps into co-segmentationresults by simply thresholding with 0.5. Because, as for theInternet-100 dataset, there are several noisy images that donot contain target objects in each class, we normalize eachauxiliary co-saliency map by the operation x → ( x + 1) / instead of normalizing it to [0 , , which forces the maximumin it to be . Table III and Fig. 10 show the quantitativecomparison and several visual examples of our results onthe Internet-100 dataset, respectively. The proposed method with simple thresholding outperforms other state-of-the-art co-segmentation algorithms or produces competitive results com-pared to them. Fig. 10 shows several quality co-segmentationresults, but the noisy objects have not been perfectly sup-pressed in the last images of the first and second rows. TABLE IIIQ
UANTITATIVE COMPARISON OF CO - SEGMENTATION METHODS .Internet-100 Airplane Car HorseP (%) J (%) P (%) J (%) P (%) J (%)[55] 47.5 11.7 59.2 35.2 64.2 29.5[54] 88.0 55.8 85.3 64.4 82.8 51.3[53] 90.3 40.3 87.7 64.9 86.2 33.4[52] 91.0 56.3 88.5 66.8 89.3 58.1Ours iCoseg [56] [57] [52] OursP (%) 91.4 92.8 93.3
J – 0.73
VI. C
ONCLUSIONS
We have proposed a co-saliency detection method, whichfinds the regions of high initial co-saliency values with thedeep saliency networks and complementary regions throughthe seed propagation over the integrated graph. Given salientregions within each image in terms of single-image saliency,the features extracted from these foregrounds in a group areconcatenated with the descriptor of each segment to be fedinto the inter-image saliency network. The resulting IrIS andIeIS values are combined to produce initial co-saliency maps,
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12 which then provide foreground and background seeds for theseed propagation steps. The unified graph is constructed withthe affinity matrices using color similarities, and the newlypropagated co-saliency values become complementary compo-nents for the final co-saliency maps. The experimental resultsindicate that the proposed method shows the state-of-the-artperformance with decent requirements about computationalcomplexity and input images/groups.R
EFERENCES[1] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detectionvia graph-based manifold ranking,” in
Proc. IEEE Conf. ComputerVision and Pattern Recognition , 2013, pp. 3166–3173.[2] D. Zhang, J. Han, C. Li, J. Wang, and X. Li, “Detection of co-salientobjects by looking deep and wide,”
Int. J. Comput. Vision , vol. 120,no. 2, pp. 215–232, Nov. 2016.[3] Q. Wang, W. Zheng, and R. Piramuthu, “Grab: Visual saliency via novelgraph model and background priors,” in
Proc. IEEE Conf. ComputerVision and Pattern Recognition , 2016, pp. 535–543.[4] N. Liu and J. Han, “DHSNet: Deep hierarchical saliency network forsalient object detection,” in
Proc. IEEE Conf. Computer Vision andPattern Recognition , 2016, pp. 678–686.[5] J. Kuen, Z. Wang, and G. Wang, “Recurrent attentional networks forsaliency detection,” in
Proc. IEEE Conf. Computer Vision and PatternRecognition , 2016, pp. 3668–3677.[6] J. Pan, E. Sayrol, X. Giro-I-Nieto, K. McGuinness, and N. E. OConnor,“Shallow and deep convolutional networks for saliency prediction,” in
Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2016, pp.598–606.[7] S. Jetley, N. Murray, and E. Vig, “End-to-end saliency mapping viaprobability distribution prediction,” in
Proc. IEEE Conf. ComputerVision and Pattern Recognition , 2016, pp. 5753–5761.[8] H. Cholakkal, J. Johnson, and D. Rajan, “Backtracking scspm imageclassifier for weakly supervised top-down saliency,” in
Proc. IEEE Conf.Computer Vision and Pattern Recognition , 2016, pp. 5278–5287.[9] S. S. S. Kruthiventi, V. Gudisa, J. H. Dholakiya, and R. V. Babu,“Saliency unified: A deep architecture for simultaneous eye fixation pre-diction and salient object segmentation,” in
Proc. IEEE Conf. ComputerVision and Pattern Recognition , 2016, pp. 5781–5790.[10] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu,“Global contrast based salient region detection,” in
Proc. IEEE Conf.Computer Vision and Pattern Recognition , 2011, pp. 409–416.[11] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu, “Globalcontrast based salient region detection,”
IEEE Trans. Pattern Anal. Mach.Intell. , vol. 37, no. 3, pp. 569–582, Mar. 2015.[12] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization fromrobust background detection,” in
Proc. IEEE Conf. Computer Visionand Pattern Recognition , 2014, pp. 2814–2821.[13] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using back-ground priors,” in
Proc. European Conf. Computer Vision , 2012, pp.29–42.[14] H. Fu, D. Xu, B. Zhang, S. Lin, and R. K. Ward, “Object-based multipleforeground video co-segmentation via multi-state selection graph,”
IEEETrans. Image Process. , vol. 24, no. 11, pp. 3415–3424, Jun. 2015.[15] H. Fu, X. Cao, and Z. Tu, “Cluster-based co-saliency detection,”
IEEETrans. Image Process. , vol. 22, no. 10, pp. 3766–3778, Oct. 2013.[16] D. Zhang, H. Fu, J. Han, and F. Wu, “A review of co-saliency detectiontechnique: Fundamentals, applications, and challenges,” arXiv preprintarXiv:1604.07090 , Apr. 2016.[17] M.-M. Cheng, N. J. Mitra, X. Huang, and S.-M. Hu, “Salientshape:Group saliency in image collections,”
Vis. Comput. , vol. 30, no. 4, pp.443–453, Apr. 2014.[18] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in
Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2015, pp.5455–5463.[19] ——, “Deep contrast learning for salient object detection,” in
Proc. IEEEConf. Computer Vision and Pattern Recognition , 2016, pp. 478–487.[20] Z. Liu, W. Zou, L. Li, L. Shen, and O. Le Meur, “Co-saliency detectionbased on hierarchical segmentation,”
IEEE Signal Process. Lett. , vol. 21,no. 1, pp. 88–92, Jan. 2014.[21] Y. Li, K. Fu, Z. Liu, and J. Yang, “Efficient saliency-model-guided visualco-saliency detection,”
IEEE Signal Process. Lett. , vol. 22, no. 5, pp.588–592, May 2015. [22] X. Cao, Z. Tao, B. Zhang, H. Fu, and W. Feng, “Self-adaptively weightedco-saliency detection via rank constraint,”
IEEE Trans. Image Process. ,vol. 23, no. 9, pp. 4175–4186, Sep. 2014.[23] D. Zhang, D. Meng, and J. Han, “Co-saliency detection via a self-pacedmultiple-instance learning framework,”
IEEE Trans. Pattern Anal. Mach.Intell. , vol. 39, no. 5, pp. 865–878, May 2017.[24] D. Zhang, J. Han, J. Han, and L. Shao, “Cosaliency detection based onintrasaliency prior transfer and deep intersaliency mining,”
IEEE Trans.Neural Netw. Learn. Syst. , vol. 27, no. 6, pp. 1163–1176, Jun. 2016.[25] G. Lee, Y. W. Tai, and J. Kim, “Deep saliency with encoded low leveldistance map and high level features,” in
Proc. IEEE Conf. ComputerVision and Pattern Recognition , 2016, pp. 660–668.[26] H. Li and K. N. Ngan, “A co-saliency model of image pairs,”
IEEETrans. Image Process. , vol. 20, no. 12, pp. 3365–3375, Dec. 2011.[27] Z. Tan, L. Wan, W. Feng, and C.-M. Pun, “Image co-saliency detectionby propagating superpixel affinities,” in
Proc. IEEE Int. Conf. Acoustics,Speech and Signal Processing , 2013, pp. 2114–2118.[28] D. E. Jacobs, D. B. Goldman, and E. Shechtman, “Cosaliency: Wherepeople look when comparing images.” in
Proc. ACM Symp. UserInterface Software and Technology , 2010, pp. 219–228.[29] H. T. Chen, “Preattentive co-saliency detection,” in
Proc. IEEE Int. Conf.Image Processing , 2010, pp. 1117–1120.[30] H. Li, F. Meng, and K. N. Ngan, “Co-salient object detection frommultiple images,”
IEEE Trans. Multimedia , vol. 15, no. 8, pp. 1896–1909, Dec. 2013.[31] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. S¨usstrunk,“SLIC superpixels compared to state-of-the-art superpixel methods,”
IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34, no. 11, pp. 2274–2282,Nov. 2012.[32] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in
Proc. Int. Conf. Learning Represen-tations , 2015.[33] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“Deeplab: Semantic image segmentation with deep convolutional nets,atrous convolution, and fully connected crfs,”
IEEE Trans. Pattern Anal.Mach. Intell. , vol. PP, no. 99, pp. 1–1, 2017.[34] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian,“A real-time algorithm for signal analysis with the help of the wavelettransform,” in
Wavelets: Time-Frequency Methods and Phase Space ,1990.[35] J. Dai, K. He, and J. Sun, “Convolutional feature masking for jointobject and stuff segmentation,” in
Proc. IEEE Conf. Computer Visionand Pattern Recognition , 2015, pp. 3992–4000.[36] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling indeep convolutional networks for visual recognition,”
IEEE Trans. PatternAnal. Mach. Intell. , vol. 37, no. 9, pp. 1904–1916, Sept 2015.[37] R. Arandjelovic and A. Zisserman, “Three things everyone should knowto improve object retrieval,” in
Proc. IEEE Conf. Computer Vision andPattern Recognition , 2012, pp. 2911–2918.[38] H. Jgou, F. Perronnin, M. Douze, J. Snchez, P. Prez, and C. Schmid,“Aggregating local image descriptors into compact codes,”
IEEE Trans.Pattern Anal. Mach. Intell. , vol. 34, no. 9, pp. 1704–1716, Sep. 2012.[39] A. Gordo, J. Almaz´an, J. Revaud, and D. Larlus, “Deep image retrieval:Learning global representations for image search,” in
Proc. EuropeanConf. Computer Vision , 2016, pp. 241–257.[40] G. Tolias, R. Sicre, and H. J´egou, “Particular object retrieval withintegral max-pooling of CNN activations,” in
Proc. Int. Conf. LearningRepresentations , 2016.[41] R. Huang, W. Feng, and J. Sun, “Color feature reinforcement forcosaliency detection without single saliency residuals,”
IEEE SignalProcess. Lett. , vol. 24, no. 5, pp. 569–573, May 2017.[42] I. Hwang, S. H. Lee, J. S. Park, and N. I. Cho, “Saliency detectionbased on seed propagation in a multilayer graph,”
Multimedia Tools andApplications , vol. 76, no. 2, pp. 2111–2129, 2017.[43] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch¨olkopf,“Learning with local and global consistency,” in
Proc. Advances inNeural Information Processing Systems , 2004, pp. 321–328.[44] P. Kr¨ahenb¨uhl and V. Koltun, “Efficient inference in fully connected crfswith gaussian edge potentials,” in
Proc. Advances in Neural InformationProcessing Systems , 2011, pp. 109–117.[45] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “icoseg: Interactiveco-segmentation with intelligent scribble guidance,” in
Proc. IEEE Conf.Computer Vision and Pattern Recognition , 2010, pp. 3169–3176.[46] J. Winn, A. Criminisi, and T. Minka, “Object categorization by learneduniversal visual dictionary,” in
Proc. IEEE Int. Conf. Computer Vision ,2005, pp. 1800–1807.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13 [47] X. Li, Y. Li, C. Shen, A. Dick, and A. V. D. Hengel, “Contextualhypergraph modeling for salient object detection,” in
Proc. IEEE Int.Conf. Computer Vision , 2013, pp. 3328–3335.[48] Y. Jia and M. Han, “Category-independent object-level saliency detec-tion,” in
Proc. IEEE Int. Conf. Computer Vision , 2013, pp. 1761–1768.[49] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in
Proc. ACM Int. Conf. on Multimedia , 2014,pp. 675–678.[50] T. Liu, J. Sun, N. N. Zheng, X. Tang, and H. Y. Shum, “Learningto detect a salient object,” in
Proc. IEEE Conf. Computer Vision andPattern Recognition , 2007, pp. 1–8.[51] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in
Proc. Int. Conf.Machine Learning , 2015, pp. 448–456.[52] R. Quan, J. Han, D. Zhang, and F. Nie, “Object co-segmentationvia graph optimized-flexible manifold ranking,” in
Proc. IEEE Conf.Computer Vision and Pattern Recognition , 2016, pp. 687–695.[53] X. Chen, A. Shrivastava, and A. Gupta, “Enriching visual knowledgebases via object discovery and segmentation,” in
Proc. IEEE Conf.Computer Vision and Pattern Recognition , 2014, pp. 2035–2042.[54] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu, “Unsupervised joint objectdiscovery and segmentation in internet images,” in
Proc. IEEE Conf.Computer Vision and Pattern Recognition , 2013, pp. 1939–1946.[55] A. Joulin, F. Bach, and J. Ponce, “Multi-class cosegmentation,” in
Proc.IEEE Conf. Computer Vision and Pattern Recognition , 2012, pp. 542–549.[56] D. Kuettel, M. Guillaumin, and V. Ferrari, “Segmentation propagationin imagenet,” in
Proc. European Conf. Computer Vision , 2012, pp. 459–473.[57] A. Faktor and M. Irani, “Co-segmentation by composition,” in