[PDF] Object cosegmentation using deep Siamese network

Abstract

Object cosegmentation addresses the problem of discovering similar objects from multiple images and segmenting them as foreground simultaneously. In this paper, we propose a novel end-to-end pipeline to segment the similar objects simultaneously from relevant set of images using supervised learning via deep-learning framework. We experiment with multiple set of object proposal generation techniques and perform extensive numerical evaluations by training the Siamese network with generated object proposals. Similar objects proposals for the test images are retrieved using the ANNOY (Approximate Nearest Neighbor) library and deep semantic segmentation is performed on them. Finally, we form a collage from the segmented similar objects based on the relative importance of the objects.

Full PDF

OObject cosegmentation using deep Siamese network

Prerana Mukherjee ∗ , Brejesh Lall ∗ and Snehith Lattupally ∗∗ Dept of EE, IIT Delhi, India.Email: { eez138300, brejesh, eet152695 } @ee.iitd.ac.in Abstract —Object cosegmentation addresses the problem ofdiscovering similar objects from multiple images and segmentingthem as foreground simultaneously. In this paper, we propose anovel end-to-end pipeline to segment the similar objects simul-taneously from relevant set of images using supervised learningvia deep-learning framework. We experiment with multiple setof object proposal generation techniques and perform extensivenumerical evaluations by training the Siamese network withgenerated object proposals. Similar objects proposals for the testimages are retrieved using the ANNOY (Approximate NearestNeighbor) library and deep semantic segmentation is performedon them. Finally, we form a collage from the segmented similarobjects based on the relative importance of the objects.

Keywords — Cosegmentation, Siamese Network, Multiscale Com-binatorial Grouping, Nearest Neighbor

I. I

NTRODUCTION

Automated foreground segregation and localization of ob-jects constitute the fundamental problem in computer visiontasks. Further the lack of sufﬁcient information about theforeground objects makes it highly complex to deal withit. The exploitation of the commonness prior and the jointprocessing of similar images (containing objects of samecategory) can aid in the process of such object related tasks.Cosegmentation refers to such class of problems which dealswith the segmentation of the common objects from a given setof images without any priori knowledge about the foreground.It was ﬁrst hypothesized in [1] that in most cases the commonobjects for cosegmentation represent the ‘objects of interest’which appear in the images instead of common backgrounddetails. These objects have huge variations in terms of scale,viewpoint, rotation, illumination, location and afﬁne changes.In other cases, it may be highly occluded by other objects.Even same class of objects may drastically differ in appearanceresulting in high intra-class variation.The works in [2]–[4] solve the generic object cosegmen-tation by applying the localization constraint that in all theimages the common object will always belong to the salient re-gion. Some of the methods are conﬁned to the cosegmentationbetween image pairs [5]–[7] while others require some userintervention [8], [9]. Further [5], [10] pose it as segmentingonly those objects that are exactly similar. These approachesare unable to handle the intra-class variations or other syntheticchanges or noise which might be present in case of images thatare downloaded from Internet. In recent years, with increase inthe computational power and access to widespread availabil-ity of semantic annotations for object classes, deep learninghas achieved dramatic break-through in various applications.Siamese Network has also been extensively used for manyvision applications. They have been used to learn the similaritymetrics by aligning the similar objects together and dissimilar objects far away. This motivates us to solve the cosegmentationproblem using the high-level features extracted using deepnetworks. We propose to couple the similarity based clusteringand cosegmentation task so that they can coexist and beneﬁtfrom each other synergistically.In this work, we pose cosegmentation as a clusteringproblem using the Siamese network. For a given set ofimages, we train the Siamese twin architecture to assess thesimilarity of two equally sized patches. These patches are theobject proposals of an image. Co-segmenting the objects usingtrained model is done using high-level features utilizing fullyconvolutional networks [11] rather than low-level features likeSIFT, HOG etc. Finally, we create a visual summary fromthe segmented images based on their similarity score in therespective class. In view of the above discussions, the majorcontributions of this paper are:1) Cosegmentation is posed as a clustering problem toalign the similar objects using Siamese network andsegmenting them. We also train the Siamese networkon non-target classes with no to little ﬁne-tuning andtest the generalization capability to target classes.2) Generation of visual summary of similar imagesbased on relative relevance.Rest of the paper is organized as follows. In Sec. II, wedescribe the proposed approach in detail. In Sec. III, we presentthe results and discussions. Finally, we conclude the paper inSec. IV. II. M

ETHODOLOGY

In the following subsections, we describe the componentsof the proposed method. Fig. 1 shows the overall pipeline ofthe proposed method.

A. Siamese network

For given set of images, we ﬁrst generate the object propos-als using different object proposal techniques as described inSec. III. The generated object proposals are given to SiameseNetwork for training. Siamese Networks are useful in ﬁndingsimilarities and relationship between different structures. TheSiamese conﬁguration consists of two convolutional neuralnetworks (CNNs) with shared weights with a contrastive losslayer. The input to the Siamese network are two input patches(object proposals) along with a similarity label. Similarity label‘1’ indicates that patches are similar while ‘0’ indicates dis-similar patches. Two CNNs generate a N-Dimensional featurevector in forward pass. The N-Dimensional vectors are fed tothe contrastive loss layer which helps in adjusting the weightssuch that positive samples are closer and negative samples arefar from each other. Contrastive loss function penalizes the a r X i v : . [ c s . C V ] M a r RAINING PHASE:TESTING PHASE:

Fig. 1. Overall Architecture positive samples that are far away and negative samples thatare closer. Let us consider two patches ( x , x ) that are fedto Siamese network. Let us assume the N-Dimension vectorsgenerated by convnets as f ( x ) and f ( x ) . Y be the binarylabel Y (cid:15) { , } , Y =1 for similar pairs and 0 otherwise. Margin m is deﬁned for the contrastive layer such that positive samplesare at a distance less than margin and negative samples areat a distance greater than margin. Thus, the contrastive lossfunction is given as, L ( W, Y, x , x ) = Y D W + (1 − Y ) 12 { max (0 , m − D W ) } (1)where D W = (cid:107) f ( x ) − f ( x ) (cid:107) is the Euclidean distancebetween the two feature vectors of the input patches. Theoutputs from the fully-connected layers are fed to contrastivelayers, which measures the distance between two features.The weights W are adjusted such that the loss function isminimized.After training the Siamese Network, we deployed thetrained model on test images. First we extracted the objectproposals for the test images. A N-Dimensional feature vectoris generated for each of the proposals. In our experiments, weused 256-Dimensional feature vector. The features generatedfor test image proposals using trained Siamese network arefed to Annoy (Approximate Nearest Neighbor) Library . Itmeasures the Euclidean distance or Cosine distance betweenvectors. It works by building up a tree using random projec-tions. A hyper-plane is generated at every intermediate node inthe tree which divides the space into two parts. The hyperplaneis chosen such that it is equidistant from the chosen two samplepoints. Annoy library allows to tune two parameters, numberof trees and number of nodes to be checked while searching.The features extracted from the test image proposals are givento ANNOY library. Annoy assigns indices to each of thefeatures. To retrieve nearest neighbor for any of the feature, https://github.com/spotify/annoy it measures the Euclidean distance to all other features andindices of neighbors are assigned in the increasing order oftheir Euclidean distance. It has many advantages as comparedto other nearest neighbor search algorithms. These include(i) small memory footprint (ii) Annoy creates data structures(read-only) which can be shared among multiple processes. B. Segmentation

Segmentation is performed on the retrieved similar ob-ject proposals. We used Fully convolutional Networks forsemantic segmentation proposed by Jonathan Long et al.[11]. Convolutional networks are used as powerful tools forextracting the hierarchy of features. It was the ﬁrst approachto generate pixel-wise predictions using supervised learning.The contemporary classiﬁcation networks are adapted to thesegmentation tasks by transferring the learned representations.It utilizes a skip architecture which combines the semanticinformation from deep (coarse information) and shallow (ﬁneappearance information) layers. The fully connected layers canalso be considered as convolutions with kernels covering entireimage. Transforming the FC layers into convolutional layersconverts the classiﬁcation network to generate a heat map.However, the generated output maps are of reduced size ascompared to the input size. So, dense predictions are madefrom coarse maps by upsampling. Upsampling is performedby backward convolution (also called as deconvolution) withstride as f . Skip layers are added to fuse semantic andappearance information. C. Visual Summary based on relative importance

A visual summary is created from the segmented proposals.While retrieving the similar object proposals using ANNOYlibrary, we preserved the Euclidean distances correspondingto each of the proposals. A basic collage is formed with 10slots constituting the most similar proposal (least Euclideandistance) getting a larger block. The remaining segmentedbjects are placed in the other slots and a background is addedto the image. III. E

XPERIMENTAL R ESULTS

In this section, we discuss the empirical results on twopublicly available benchmark co-segmentation datasets. Wedescribe the datasets used followed by implementation detailsand baseline. Caffe [12] is used for the constructing theSiamese network.

Datasets.

MSRC dataset [13] consists of 14 categories.Each category consists of 30 images of dimension 213x320.iCoseg dataset [8] consists of 38 categories. Each categoryconsists of about 20 to 30 images, which are of 300x500 size.

Baselines and Parameter setting.

We report results withtwo baselines. The ﬁrst baseline involves training the Siamesenetwork with pretrained ILSVRC [14] models. The weightsare ﬁne-tuned for target classes as in the datasets and thensegmentation is performed on the clustered test set data. In thesecond baseline, we train the network on non-target classes andtest the generalization ability on target classes. We evaluated ontwo objective measures: Precision ( ¯ P ) and Jaccard Similarity( ¯ J ). ¯ P indicates the fraction of the pixels in the segmentedimage common with the ground truth. ¯ J is the intersectionover union measure with the ground truth images. iCoseg iCosegMSRC MSRC Fig. 2. Performance analysis of various object proposal generation methodswith proposed architecture.

We generated the object proposals using different methodsand evaluated the performance on these metrics. The tech-niques used are Multiscale Combinatorial Grouping (MCG)[15], Selective Search (SS) [16], Objectness (Obj) [17], Sal-Prop [18] and Edgeboxes [19]. We further perform a non-maximal suppression and near duplicate rejection in the pro-posal set. We preserved the top-10 object proposals, so thatall the object instances in the images are covered. We usedGoogLeNet architectures [20] for training the Siamese in ourexperiments. We used transfer learning, in which we initializedthe weights with pre-trained model weights. We then ﬁne-tuned the weights using back propagation technique. Siamesenetwork is trained and the N-Dimensional (N=256) features are B e f o r e f i n e - t un i n g A f t e r f i n e - t un i n g iCoseg iCoseg MSRCMSRC

Fig. 3. Visualization of iCoseg and MSRC Training set using t-SNE extracted for the test images. The N-Dimensional features arefed to ANNOY and similar object proposals are retrieved. Theparameters used for Annoy library include number of trees, n trees =350 and number of nodes to inspect during searching search k =50. Similar object proposals are segmented usingFCN based semantic segmentation as discussed in Sec. II-B.We trained the Siamese architecture by employing the standardbackpropagation on feed-forward nets by stochastic gradientdescent with momentum to adjust the weights. The mini-batchsize was set to 128, with an equal learning rate for all layers setto 0.01. The number of iterations is set as 100,000, contrastiveloss margin as 1.We also trained the Siamese network on datasets whichcontains similar (but not same) classes to iCoseg and MSRCdatasets. We used Pascal [21], Animals [22] and Coseg-Rep[23] datasets to train the Siamese model and tested on iCosegand MSRC datasets. Initially, we randomly selected positiveand negative pairs for training the Siamese network. However,once most of the pairs are correctly learned, then using thosepairs, Siamese cannot learn anymore. So, to address this issue,we used strategy of aggressive mining [24] for preparing hardnegative and positive pairs. Results.

We divided iCoseg dataset into 80% trainingsamples and 20% as testing set for each class. For MSRCdataset the split was 70%-30% (training-test). The results ofthe ¯ P and ¯ J are shown in Fig.2. It can be observed that Siamesenetwork fed with MCG proposals outperforms all other objectproposal generation techniques with the closest being SalPropfollowed by SS, Obj and Edgeboxes. For both the datasets, theaverage precision and Jaccard index over all the classes withMCG proposals is higher than SalProp technique with a gapon an average being 2.48% and 1.84% in ¯ P and ¯ J respectively.The 256-D feature vector of the training set are visualizedusing t-SNE (t-Distributed Stochastic Neighbor Embedding) asshown in Fig. 3. Firstly for high dimensional data, a probabilitydistribution is built such that similar objects gets selectedwith high probability and dissimilar points have very lowprobability of being selected. In the second step, similar to ahigh-dimensional map, probability distribution over the pointsin the low-dimensional map is constructed. The color-codingin the t-SNE plots corresponds to the number of object classesin the respective datasets. Siamese net with post-processinghelps in better separation of the classes compared to beforeﬁne-tuning. As can be seen, the results of clusters of classesare well separated with only few cluster of confusion.e computed the average precision and jaccard similarityand compared with the other state-of-the-art methods in Tab.I-II. On testing with complete iCoseg dataset, we achieve again in ¯ P of 27.27% (Joulin et al. [25]), 23.52% (Kim etal. [26]). Quan et al. [27] outperform with a margin of ¯ P :9.67% and ¯ J : 13.15% compared to the proposed technique(Siamese (MCG) + FCN segmentation). Similarly with MSRCdataset, we achieve a gain in ¯ P of 20% (Joulin et al. [25]),9.09% (Jian et al. [28]), 44.82% (Kim et al. [26]) and in ¯ J of15.51% (Yong Li [3]). Rubinstein et al. [29] outperform witha margin of ¯ P :8.6% and ¯ J : 1.47% compared to the proposedtechnique (Siamese (MCG) + FCN segmentation). In FCNsegmentation, we used VGGNet architecture, with FC layersreplaced with convolutional layers. Deconvolutional layers areﬁxed using bilinear interpolation. We have abstained fromusing any auxiliary training and use the pretrained weightsto avoid over-ﬁtting in the FCN segmentation network. Theobject proposals that are similar and clustered together are fedas input to FCN segmentation to obtain the co-segmentationresults. We consider only those object proposals for the coseg-mentation task which have an intersection over union (IoU)score IoU ≥ . . Since, the segmentation is performed on thetight object proposals it segments the regions speciﬁc to theobject class only and thus refrains from performing semanticsegmentation over entire image. Owing to the performanceboost by aggressive mining we achieve an average gain of 4%and 2.52% on both the datasets in ¯ P and ¯ J respectively overtraining with MCG proposals (Siamese (MCG) + FCN seg-mentation). Fig. 4 shows the qualitative results on few exampleclasses in the datasets. However, it is important to note herethat since we perform cosegmentation on the subset of imagesowing to the retrieval results we observe that there is a dropin performance with respect to few reported techniques. Theadvantage of the proposed technique over other compared tech-niques( [3], [25]–[27], [29]–[31]) involve: (i) co-segmentingwithout explicit knowledge of localization prior in the form ofsaliency map (ii) co-segmentation pipeline being formulatedas clustering followed by segmentation of the similar objectclasses thus eliminating the need for providing as input therelevant set of class-speciﬁc images as required in graph-basedco-segmentation techniques. The proposed method takes lessthan 100ms for the generation of 256-dimensional featuresfrom trained Siamese network. Fig. 4. Visual segmentation results on iCoseg and MSRC datasets. First threerows are classes of iCoseg (Cheetah, Panda, Taj-Mahal) and next two rowsare MSRC (Car, Cow). Fig. 5. Example of Collage results for Chair class (MSRC).TABLE I. C

OMPARISON OF A VERAGE PRECISION AND J ACCARD S IMILARITY WITH STATE - OF - THE - ART METHODS . (’-’

INDICATES THATTHE METRIC HAS NOT BEEN PROVIDED IN THE RESPECTIVE PAPER ) ONI C OSEG DATASET

Method ¯ P ¯ J Rubinstein [29] 0.88 0.674

Joulin [25] 0.66 -

Kim [26] 0.68 -

Keuttel [32] 0.91 -

Quan [27] 0.93 0.76

Fanman Meng [30] - 0.71

Faktor [31] 0.92 0.70

Trained on 80% and tested with 20%Method ¯ P ¯ J Siamese (Edgeboxes) + FCN segmentation

Siamese (Obj) + FCN segmentation

Siamese (SS) + FCN segmentation

Siamese (SalProp) + FCN segmentation

Siamese (MCG) + FCN segmentation

Trained on 80% and tested with 100%Method ¯ P ¯ J Siamese (Edgeboxes) + FCN segmentation

Siamese (Obj) + FCN segmentation

Siamese (SS) + FCN segmentation

Siamese (SalProp) + FCN segmentation

Siamese (MCG) + FCN segmentation

Trained on Pascal+animals+coseg-rep and testedon iCoseg Method ¯ P ¯ J Siamese (MCG) + FCN segmentation

Siamese (MCG) + FCN segmentation-Aggressive mining

We create a visual summary of the co-segmented similarobjects. We preserved the Euclidean distances while retrievingthe similar objects. Image is divided into different blocks andobjects are placed such that the object with least Euclideandistance is at the center. A proper background is added toimprove the visual appearance. Fig. 5 shows the sample collageresults formed the Chair class in MSRC. A 512x512 image isdivided into 10 blocks consisting a blue sky back-ground toform collage. Future work would be aimed to further improvethe segmentation results and utilization of more cues for therelevance ranking in the collage-generation.IV. C

ONCLUSION

We addressed object cosegmentation and posed it as aclustering problem using deep Siamese network to align thesimilar images which are segmented using semantic segmen-tation. Additionally, we compared the performance of variousobject proposal generation schemes on Siamese architecture.

ABLE II. C

OMPARISION OF A VERAGE PRECISION AND J ACCARD S IMILARITY WITH STATE - OF - THE - ART METHODS . (’-’

INDICATES THATTHE METRIC HAS NOT BEEN PROVIDED IN THE RESPECTIVE PAPER ) ON MSRC

DATASET

Method ¯ P ¯ J Rubinstein [29] 0.92 0.68

Joulin [25] 0.70 -

Jian Sun [28] 0.77 0.54

Faktor [31] 0.89 0.73

Kim [26] 0.58 -

Yong Li [3] - 0.58

Trained on 70% and tested with 30%Method ¯ P ¯ J Siamese (Edgeboxes) + FCN segmentation

Siamese (Obj) + FCN segmentation

Siamese (SS) + FCN segmentation

Siamese (SalProp) + FCN segmentation

Siamese (MCG) + FCN segmentation

Trained on 70% and tested with 100%Method ¯ P ¯ J Siamese (Edgeboxes) + FCN segmentation

Siamese (Obj) + FCN segmentation

Siamese (SS) + FCN segmentation

Siamese (SalProp) + FCN segmentation

Siamese (MCG) + FCN segmentation

Trained on Pascal+animals+coseg-rep and testedon iCoseg Method ¯ P ¯ J Siamese (MCG) + FCN segmentation

Siamese (MCG) + FCN segmentation-Aggressive mining

We performed extensive evaluation on iCoseg and MSRCdataset and demonstrated that the deep features can encodethe commonness prior and thus provide a more discriminativerepresentation for the features.R

EFERENCES[1] S. Vicente, C. Rother, and V. Kolmogorov, “Object cosegmentation,”in

Computer Vision and Pattern Recognition (CVPR), 2011 IEEEConference on . IEEE, 2011, pp. 2217–2224.[2] W. Wang, J. Shen, X. Li, and F. Porikli, “Robust video object coseg-mentation,”

Image Processing, IEEE Transactions on , vol. 24, no. 10,pp. 3137–3148, 2015.[3] Y. Li, J. Liu, Z. Li, H. Lu, and S. Ma, “Object co-segmentation viasalient and common regions discovery,”

Neurocomputing , vol. 172, pp.225–234, 2016.[4] L. Huang, R. Gan, and G. Zeng, “Object cosegmentation by similaritypropagation with saliency information and objectness frequency map,” ,pp. 906–911, 2016.[5] C. Rother, T. Minka, A. Blake, and V. Kolmogorov, “Cosegmentationof image pairs by histogram matching-incorporating a global constraintinto mrfs,” in

Computer Vision and Pattern Recognition, 2006 IEEEComputer Society Conference on , vol. 1. IEEE, 2006, pp. 993–1000.[6] F. Wang, Q. Huang, and L. Guibas, “Image co-segmentation viaconsistent functional maps,” in

Proceedings of the IEEE InternationalConference on Computer Vision , 2013, pp. 849–856.[7] A. Joulin, F. Bach, and J. Ponce, “Multi-class cosegmentation,” in

Com-puter Vision and Pattern Recognition (CVPR), 2012 IEEE Conferenceon . IEEE, 2012, pp. 542–549.[8] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “icoseg: Interactiveco-segmentation with intelligent scribble guidance,” in

Computer Visionand Pattern Recognition (CVPR), 2010 IEEE Conference on . IEEE,2010, pp. 3169–3176.[9] A. Kowdle, D. Batra, W.-C. Chen, and T. Chen, “imodel: Interactive co-segmentation for object of interest 3d modeling,” in

Trends and Topicsin Computer Vision . Springer, 2010, pp. 211–224.[10] L. Mukherjee, V. Singh, and C. R. Dyer, “Half-integrality basedalgorithms for cosegmentation of images,” in

Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on . IEEE,2009, pp. 2028–2035. [11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2015, pp. 3431–3440.[12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” arXiv preprint arXiv:1408.5093 , 2014.[13] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost: Jointappearance, shape and context modeling for multi-class object recog-nition and segmentation,” in

European conference on computer vision .Springer, 2006, pp. 1–15.[14] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al. , “Imagenet largescale visual recognition challenge,”

International Journal of ComputerVision , vol. 115, no. 3, pp. 211–252, 2015.[15] J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and J. Malik,“Multiscale combinatorial grouping for image segmentation and ob-ject proposal generation,”

IEEE transactions on pattern analysis andmachine intelligence , vol. 39, no. 1, pp. 128–140, 2017.[16] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders,“Selective search for object recognition,”

International journal of com-puter vision , vol. 104, no. 2, pp. 154–171, 2013.[17] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness ofimage windows,”

IEEE transactions on pattern analysis and machineintelligence , vol. 34, no. 11, pp. 2189–2202, 2012.[18] P. Mukherjee, B. Lall, and S. Tandon, “Salprop: Salient object proposalsvia aggregated edge cues,” arXiv preprint arXiv:1706.04472 , 2017.[19] C. L. Zitnick and P. Doll´ar, “Edge boxes: Locating object proposalsfrom edges.” in

ECCV (5) , 2014, pp. 391–405.[20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2015, pp. 1–9.[21] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man, “The pascal visual object classes (voc) challenge,”

Internationaljournal of computer vision , vol. 88, no. 2, pp. 303–338, 2010.[22] H. M. Afkham, A. T. Targhi, J.-O. Eklundh, and A. Pronobis, “Jointvisual vocabulary for animal classiﬁcation,” in

Pattern Recognition,2008. ICPR 2008. 19th International Conference on . IEEE, 2008,pp. 1–4.[23] J. Dai, Y. Nian Wu, J. Zhou, and S.-C. Zhu, “Cosegmentation andcosketch by unsupervised learning,” in

Proceedings of the IEEE inter-national conference on computer vision , 2013, pp. 1305–1312.[24] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer, “Discriminative learning of deep convolutional feature pointdescriptors,” in

Proceedings of the IEEE International Conference onComputer Vision , 2015, pp. 118–126.[25] A. Joulin, F. Bach, and J. Ponce, “Discriminative clustering for imageco-segmentation,” in

Computer Vision and Pattern Recognition (CVPR),2010 IEEE Conference on . IEEE, 2010, pp. 1943–1950.[26] G. Kim and E. P. Xing, “On multiple foreground cosegmentation,”in

Computer Vision and Pattern Recognition (CVPR), 2012 IEEEConference on . IEEE, 2012, pp. 837–844.[27] R. Quan, J. Han, D. Zhang, and F. Nie, “Object co-segmentation viagraph optimized-ﬂexible manifold ranking,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2016, pp.687–695.[28] J. Sun and J. Ponce, “Learning dictionary of discriminative partdetectors for image categorization and cosegmentation,”

InternationalJournal of Computer Vision , vol. 120, no. 2, pp. 111–133, 2016.[29] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu, “Unsupervised jointobject discovery and segmentation in internet images,”

IEEE Conf. onComputer Vision and Pattern Recognition (CVPR) , June 2013.[30] F. Meng, J. Cai, and H. Li, “Cosegmentation of multiple image groups,”

Computer Vision and Image Understanding , vol. 146, pp. 67–76, 2016.[31] A. Faktor and M. Irani, “Co-segmentation by composition,” in

Proceed-ings of the IEEE International Conference on Computer Vision , 2013,pp. 1297–1304.[32] D. Kuettel, M. Guillaumin, and V. Ferrari, “Segmentation propagationin imagenet,”