[PDF] Toward unsupervised, multi-object discovery in large-scale image collections

Abstract

This paper addresses the problem of discovering the objects present in a collection of images without any supervision. We build on the optimization approach of Vo et al. (CVPR'19) with several key novelties: (1) We propose a novel saliency-based region proposal algorithm that achieves significantly higher overlap with ground-truth objects than other competitive methods. This procedure leverages off-the-shelf CNN features trained on classification tasks without any bounding box information, but is otherwise unsupervised. (2) We exploit the inherent hierarchical structure of proposals as an effective regularizer for the approach to object discovery of Vo et al., boosting its performance to significantly improve over the state of the art on several standard benchmarks. (3) We adopt a two-stage strategy to select promising proposals using small random sets of images before using the whole image collection to discover the objects it depicts, allowing us to tackle, for the first time (to the best of our knowledge), the discovery of multiple objects in each one of the pictures making up datasets with up to 20,000 images, an over five-fold increase compared to existing methods, and a first step toward true large-scale unsupervised image interpretation.

Full PDF

TToward Unsupervised, Multi-Object Discoveryin Large-Scale Image Collections

Huy V. Vo , , , Patrick P´erez , and Jean Ponce , INRIA, Paris, France D´epartement d’informatique de l’ENS, ENS, CNRS, PSL University, Paris, France Valeo.ai

Abstract.

This paper addresses the problem of discovering the objectspresent in a collection of images without any supervision. We build on theoptimization approach of Vo et al. [34] with several key novelties: (1) Wepropose a novel saliency-based region proposal algorithm that achievessigniﬁcantly higher overlap with ground-truth objects than other com-petitive methods. This procedure leverages oﬀ-the-shelf CNN featurestrained on classiﬁcation tasks without any bounding box information,but is otherwise unsupervised. (2) We exploit the inherent hierarchicalstructure of proposals as an eﬀective regularizer for the approach to ob-ject discovery of [34], boosting its performance to signiﬁcantly improveover the state of the art on several standard benchmarks. (3) We adopt atwo-stage strategy to select promising proposals using small random setsof images before using the whole image collection to discover the objectsit depicts, allowing us to tackle, for the ﬁrst time (to the best of ourknowledge), the discovery of multiple objects in each one of the picturesmaking up datasets with up to 20,000 images, an over ﬁve-fold increasecompared to existing methods, and a ﬁrst step toward true large-scaleunsupervised image interpretation.

Keywords:

Object discovery, large-scale, optimization, region propos-als, unsupervised learning.

Object discovery, that is ﬁnding the location of salient objects in images withoutusing any source of supervision, is a fundamental scientiﬁc problem in computervision. It is also potentially an important practical one, since any eﬀective solu-tion would serve as a reliable free source of supervision for other tasks such asobject categorization, object detection and the like. While many of these taskscan be tackled using massive amounts of annotated data, the manual annota-tion process is complex and expensive at large scales. Combining the discoveryresults with a limited amount of annotated data in a semi-supervised setting isa promising alternative to current data-hungry supervised approaches [35].Vo et al. [34] posit that image collections possess an implicit graph structure.The pictures themselves are the nodes, and an edge links two images when they a r X i v : . [ c s . C V ] A ug H. V. Vo et al. share similar visual content. They propose the object and structure discoveryframework (OSD) to localize objects and ﬁnd the graph structure simultane-ously by solving an optimization problem. Though demonstrating promising re-sults, [34] has several shortcomings, e.g., the use of supervised region proposals,the limitation in addressing large image collections (See Section 2). Our workis built on OSD, aims to alleviate its limitations and improves it to eﬀectivelydiscover multiple objects in large image collections. Our contributions are: • We propose a simple but eﬀective method for generating region proposalsdirectly from CNN features (themselves trained beforehand on some auxiliarytask [29] without bounding boxes) in an unsupervised way (Section 3.1). Our al-gorithm gives on average half the number of region proposals per image comparedto selective search [33], edgeboxes [40] or randomized Prim [23], yet signiﬁcantlyoutperforms these oﬀ-the-shelf region proposals in object discovery (Table 3). • Leveraging the intrinsic structure of region proposals generated by ourmethod allows us to add an additional constraint into the OSD formulation thatacts as a regularizer on its behavior (Section 3.2). This new formulation (rOSD)signiﬁcantly outperforms the original algorithm and allows us to eﬀectively per-form multi-object discovery, a setting never studied before (to the best of ourknowledge) in the literature. • We propose a two-stage algorithm to make rOSD applicable to large imagecollections (Section 3.3). In the ﬁrst stage, rOSD is used to choose a small setof good region proposals for each image. In the second stage, these proposalsand the full image collection are fed to rOSD to ﬁnd the objects and the imagegraph structure. • We demonstrate that our approach yields signiﬁcant improvements over thestate of the art in object discovery (Tables 4 and 5). We also run our two-stagealgorithm on a new and much larger dataset with 20,000 images and show thatit signiﬁcantly outperforms plain OSD in this setting (Table 7).The only supervisory signal used in our setting are the image labels used totrain CNN features in an auxiliary classiﬁcation task (see [21,35] for similar ap-proaches in the related colocalization domain). We use CNN features trained onImageNet classiﬁcation [29], without any bounding box information. Our regionproposal and object discovery algorithms are otherwise fully unsupervised.

Region proposals have been used in object detection/discovery to serve as objectpriors and reduce the search space. In most cases, they are found either by abottom-up approach in which low-level cues are aggregated to rank a large setof boxes obtained with sliding window approaches [1,33,40] and return the topwindows as proposals, or by training a model to classify them (as in random-ized Prim [23], see also [26]), with bounding box supervision . Edgeboxes [40]and selective search [33] are popular oﬀ-the-shelf algorithms that are used togenerate region proposals in object detection [13,14], weakly supervised objectdetection [7,31] or image colocalization [21]. Note, however, that the features oward unsupervised, multi-object discovery in large-scale image collections 3 used to generate proposals in these algorithms and those representing them inthe downstream tasks are generally diﬀerent in nature: Typically, region propos-als are generated from low-level features such as color and texture [33] or edgedensity [40], but CNN features are used to represent them in downstream tasks.However, the Region Proposal Network in Faster-RCNN [26] shows that propos-als generated directly from the features used in the object detection task itselfgive a great boost in performance. In the object discovery setting, we thereforepropose a novel approach for generating region proposals in an unsupervised wayfrom CNN features trained on an auxiliary classiﬁcation task without boundingbox information. Features from CNNs trained on large-scale image classiﬁcationhave also been used to localize object in the weakly supervised setting. Zhou etal. [39] and Selvaraju et al. [28] ﬁne-tune a pre-trained CNN to classify imagesand construct class activation maps, as weighted sums of convolutional featuremaps or their gradient with respect to the classiﬁcation loss, for localizing ob-jects in these images. Tang et al. [32] generate region proposals to perform weaklysupervised object detection on a set of labelled images by training a proposalnetwork using the images’ labels as supervision. Contrary to these works, we gen-erate region proposals using only pre-trained CNN features without ﬁne-tuningthe feature extractor. Moreover, our region proposals come with a nice intrinsicstructure which can be exploited to improve object discovery performance.Early work on object discovery [12,15,20,27,30] focused on a restricted settingwhere images are from only a few distinctive object classes. Cho et al. [6] pro-pose an approach for object and structure discovery by combining a part-basedmatching technique and an iterative match-then-localize algorithm, using oﬀ-the-shelf region proposals as primitives for matching. Vo et al. [34] reformulate [6]in an optimization framework and obtain signiﬁcantly better performance. Im-age colocalization can be seen as a narrow setting of object discovery where allimages in the collection contain objects from the same class. Observing thatsupervised object detectors often assign high scores to only a small number ofregion proposals, Li et al. [21] propose to mimic this behavior by training a clas-siﬁer to minimize the entropy of the scores it gives to region proposals. Wei etal. [35] localize objects by clustering pixels with high activations in feature mapsfrom CNNs pre-trained in ImageNet. All of the above works, however, focus ondiscovering only the main object in the images and target small-to-medium-scaledatasets. Our approach is based on a modiﬁed version of the OSD formulationof Vo et al. [34] and pre-trained CNN features for object discovery, oﬀers an ef-fective and eﬃcient solution to discover multiple objects in images in large-scaledatasets. The recent work of Hsu et al. [18] for instance co-segmentation canalso be adapted for localizing multiple objects in images. However, it requiresinput images to contain an object of a single dominant class while images mayinstead contain several objects from diﬀerent categories in our setting.

Object and structure discovery (OSD) [34].

Since our work is built on [34], wegive a short recap of this work in this section. Given a collection of n images,possibly containing objects from diﬀerent categories, each equipped with p regionproposals (which can be obtained using selective search [33], edgeboxes [40], H. V. Vo et al. randomized Prim [23], etc.) and a set of potential neighbors, the unsupervisedobject and structure discovery problem (OSD) is formalized in [34] as follows:Let us deﬁne the variable e as an element of { , } n × n with a zero diagonal, suchthat e ij = 1 when images i and j are linked by a (directional) edge, and e ij = 0otherwise, and the variable x as an element of { , } n × p , with x ki = 1 whenregion proposal number k corresponds to visual content shared with neighborsof image i in the graph. This leads to the following optimization problem:max x,e S ( x, e ) = n (cid:88) i =1 (cid:88) j ∈ N ( i ) e ij x Ti S ij x j , s.t. p (cid:88) k =1 x ki ≤ ν and (cid:88) j (cid:54) = i e ij ≤ τ ∀ i, (1)where N ( i ) is the set of potential neighbors of image i , S ij is a p × p matrixwhose entry S klij measures the similarity between regions k and l of images i and j , and ν and τ are predeﬁned constants corresponding respectively to themaximum number of objects present in an image and to the maximum numberof neighbors an image may have. This is however a hard combinatorial opti-mization problem. As shown in [34], an approximate solution can be found by(a) a dual gradient ascent algorithm for a continuous relaxation of Eq. (1) withexact updates obtained by maximizing a supermodular cubic pseudo-Booleanfunction [4,24], (b) a simple greedy scheme, or (c) a combination thereof. Sincesolving the continuous relaxation of Eq. (1) is computationally expensive andmay be less eﬀective for large datasets [34], we only consider the version (b) ofOSD in our analysis.OSD has some limitations: (1) Although the algorithm itself is fully unsu-pervised, it gives by far its best results with region proposals from randomizedPrim [23], a region proposal algorithm trained with bounding box supervision.(2) Vo et al. use whitened HOG (WHO) [16] to represent region proposals intheir implementation although CNN features work better on the similar imagecolocalization problem [21,35]. In our experiments, naively switching to CNNfeatures does not give consistent improvement on common benchmarks. OSDwith CNN features gives a CorLoc of 82 .

9, 71 . . .

1, 71 . . We address the limitation of using oﬀ-the-shelf region proposals of [34] withinsights gained from the remarkably eﬀective method for image colocalizationproposed by Wei et al. [35]: CNN features pre-trained for an auxiliary task, suchas ImageNet classiﬁcation, give a strong, category-independent signal for unsu-pervised tasks. In retrospect, this insight is not particularly surprising, and it oward unsupervised, multi-object discovery in large-scale image collections 5 is implicit in several successful approaches to image retrieval [38] or co-saliencydetection [2,3,19,36]. Wei et al. [35] use it to great eﬀect in the image colocaliza-tion task. Feeding an image to a pre-trained convolutional neural network yieldsa set of feature maps represented as a 3D tensor (e.g., a convolutional layer ofVGG16 [29] or ResNet [17]). Wei et al. [35] observe that the “image” obtainedby simply adding the feature maps gives hints to the locations of the objects itcontains, and identify objects by clustering pixels with high activation. Similarbut diﬀerent from them, we observe that local maxima in the above “images”correspond to salient parts of objects in the original image and propose to exploitthis observation for generating region proposals directly from CNN features. Aswe do not make use of any annotated bounding boxes, our region proposal itselfis indeed unsupervised. Our method consists of the following steps. First, we feedthe image to a pre-trained convolutional neural network to obtain a 3D tensorof size ( H × W × D ), noted F . Adding elements of the tensor along its depthdimension yields a ( H × W ) 2D saliency map, noted as s g ( global saliency map),showing salient locations in the image with each location in s g being representedby the corresponding D -dimensional feature vector from F . Next, we ﬁnd robustlocal maxima in the previous saliency map using persistence , a measure used intopological data analysis [5,8,9,25,41] to ﬁnd critical points of a function (seeSection 4.2 for details). We ﬁnd regions around each local maximum y usinga local saliency map s y of the same size as the global one. The value at anylocation in s y is the dot product between normalized feature vectors at thatlocation and the local maximum. By construction, the local saliency map high-lights locations that are likely to belong to the same object as the correspondinglocal maximum. Finally, for each local saliency map, we discard all locationswith scores below some threshold and the bounding box around the connectedcomponent containing the corresponding local maximum is returned as a regionproposal. By varying the threshold, we can obtain tens of region proposals perlocal saliency map. An example illustrating the whole process is shown in Fig. 1. Due to the greedy nature of OSD [34], its block-coordinate ascent iterations areprone to bad local maxima. Vo et al. [34] attempt to resolve this problem byusing a larger value of ν in the optimization than the actual number of objectsthey intend to retrieve (which is one in their case) to diversify the set of retainedregions in each iteration. The ﬁnal region in each image is then chosen amongstits retained regions in a post processing step by ranking these using a new scoresolely based on their similarity to the retained regions in the image’s neighbors.Increasing ν in fact gives limited help in diversifying the set of retained regions.Since there is redundancy in object proposals with many highly overlappingregions, the ν retained regions are often nearly identical (see supplementarydocument for a visual illustration). This phenomenon also prevents OSD fromretrieving multiple objects in images. One can use the ranking in OSD’s postprocessing step with non-maximum suppression to return more than one region H. V. Vo et al.

Fig. 1:

Illustration of the unsupervised region proposal generation process. The toprow shows the original image, the global saliency map s g , local maxima of s g and threelocal saliency maps s y from three local maxima (marked by red stars). The next threerows illustrate the proposal generation process on the local saliency maps: From left toright, we show in green the connected component formed by pixels with saliency abovedecreasing thresholds and, in red, the corresponding region proposals. from ν retained regions but since ν regions are often highly overlapping, thisfails to localize multiple objects.By construction, proposals produced by our approach also contain manyhighly overlapping regions, especially those generated from the same local max-imum in the saliency map. However, they come with a nice intrinsic structure:Proposals in an image can be partitioned into groups labelled by the local maxi-mum from which they are generated. Naturally, it makes sense to impose that atmost one region in a group is retained in OSD since they are supposed to corre-spond to the same object. This additional constraint also conveniently helps todiversify the set of proposals returned by the block-coordinate ascent procedureby avoiding to retain highly overlapping regions. Concretely, let G ig be the setof region proposals in image i generated from the g -th local maximum in itsglobal saliency map s g , with 1 ≤ g ≤ L i where L i is the number of local maximain s g , we propose to add the constraints (cid:80) k ∈ G ig x ki ≤ ∀ i, g to Eq. (1). Wecoin the new formulation regularized OSD (rOSD). Similar to OSD, a solutionto rOSD can be obtained by a greedy block-coordinate ascent algorithm whoseiterations are illustrated in the supplementary document. We will demonstratethe eﬀectiveness of rOSD compared to OSD and the state of the art in Section 4. The optimization algorithm of Vo et al. [34] requires loading all score matrices S ij into the memory (they can also be computed on-the-ﬂy but at an unacceptablecomputational cost). The corresponding memory cost is M = ( (cid:80) ni =1 | N ( i ) | ) × K ,decided by two main factors: The number of image pairs considered (cid:80) ni =1 | N ( i ) | and the number of positive entries K in matrices S ij . To reduce the cost on largerdatasets, Vo et al. [34] pre-ﬁlter the neighborhood of each image ( | N ( i ) | ≤ oward unsupervised, multi-object discovery in large-scale image collections 7 for classes with more than 1000 images) and limit K to 1000. This value of K is approximately the average number of proposals in each image, and it isintentionally chosen to make sure that S ij is not too sparse in the sense thatapproximately every proposal in image i should have a positive match withsome proposal in image j . Further reducing the number of positive entries inscore matrices is likely to hurt the performance (Table 7) while a number of100 potential neighbors is already small and can not be signiﬁcantly lowered.Eﬀectively scaling up OSD therefore requires lowering considerably the numberof proposals it uses. To this end, we propose two diﬀerent interpretations of theimage graph and exploit both to scale up OSD. Two diﬀerent interpretations of the image graph.

The image graph G = ( x, e )obtained by solving Eq. (1) can be interpreted as capturing the “true” structureof the input image collection. In this case, ν is typically small (say, 1 to 5) andthe discovered “objects” correspond to maximal cliques of G , with instancesgiven by active regions ( x ki = 1) associated with nodes in the clique. But it canalso be interpreted as a proxy for that structure. In this case, we typically take ν larger (say, 50). The active regions found for each node x i of G are interpreted asthe most promising regions in the corresponding image and the active edges e ij link it to other images supporting that choice. We dub this variant proxy OSD.For small image collections, it makes sense to run OSD only. For large ones,we propose instead to split the data into random groups with ﬁxed size, runproxy OSD on each group to select the most promising region proposals in thecorresponding images, then run OSD using these proposals. Using this two-stagealgorithm, we reduce signiﬁcantly the number of image pairs in each run of theﬁrst stage, thus permitting the use of denser score matrices in these runs. In thesecond stage, since only a very small number of region proposals are considered ineach image, we need to keep only a few positive entries in each score matrix andare able to run OSD on the entire image collection. Our approach for large-scaleobject discovery is summarized the supplementary material.

Similar to previous works on object discovery [6,34] and image colocalization[21,35], we evaluate object discovery performance with our proposals on fourdatasets: Object Discovery (OD), VOC 6x2, VOC all and VOC12. OD is a smalldataset with three classes airplane , car and horse , and 100 images per class,among which 18, 11 and 7 images are outliers (images not including an object ofthe corresponding class) respectively. VOC all is a subset of the PASCAL VOC2007 dataset [11] obtained by eliminating all images containing only diﬃcult or truncated objects as well as diﬃcult or truncated objects in retained images.It has 3550 images and 6661 objects. VOC 6x2 is a subset of VOC all which Since the analysis in this section applies to both OSD and rOSD, we refer to bothas OSD for ease of notation. H. V. Vo et al. contains images of 6 classes aeroplane , bicycle , boat , bus , horse and motorbike divided into 2 views left and right . In total, VOC 6x2 contains 463 images of 12classes. VOC12 is a subset of the PASCAL VOC 2012 dataset [10] and obtainedin the same way as VOC all. It contains 7838 images and ﬁgures 13957 objects.For large-scale experiments, we randomly choose 20000 images from the trainingset of COCO [22] and eliminate those containing only crowd bounding boxesas well as bounding boxes marked as crowd in retained images. The resultingdataset, which we call COCO 20k, has 19817 images and 143951 objects.As single-object discovery and colocalization performance measure, we use correct localization (CorLoc) deﬁned as the percentage of images correctly lo-calized. In our context, this means the intersection over union ( IoU ) betweenone of the ground-truth regions and one of the predicted regions in the imageis greater than 0.5. Since CorLoc does not take into account multiple detectionsper image, for multi-object discovery, we use instead detection rate at the IoUthreshold of 0 . ζ , detectionrate at IoU = ζ is the percentage of ground-truth bounding boxes that havean IoU with one of the retained proposals greater than ζ . We run the experi-ments in both the colocalization setting, where the algorithm is run separatelyon each class of the dataset, and the average CorLoc/detection rate over allclasses is computed as the overall performance measure on the dataset, and thetrue discovery setting where the whole dataset is considered as a single class. We test our methods with the pre-trained CNN features from VGG16and VGG19 [29]. For generating region proposals, we apply the algorithm de-scribed in Section 3.1 separately to the layers right before the last two max pool-ing layers of the networks ( relu4 3 and relu5 3 in VGG16, relu4 4 and relu5 4 in VGG19), then fuse proposals generated from the two layers as our ﬁnal setof proposals. Note that using CNN features at multiple layers is important asdiﬀerent layers capture diﬀerent visual patterns in images [37]. One could alsouse more layers from VGG16 (e.g., layers relu3 3 , relu4 2 or relu5 2 ) but weonly use two for the sake of eﬃciency. In experiments with OSD, we extract fea-tures for the region proposals by applying the RoI pooling operator introducedin Fast-RCNN [13] to layer relu5 3 of VGG16. Region Proposal Generation Process.

For ﬁnding robust local maxima ofthe global saliency maps s g , we rank its locations using persistence [5,8,9,25,41].Concretely, we consider s g as a 2D image and each location in it as a pixel. Weassociate with each pixel a cluster (the 4-neighborhood connected component ofpixels that contains it), together with a “birth” (its own saliency) and “death”time (the highest value for which one of the pixels in its cluster also belongs tothe cluster of a pixel with higher saliency, or, if no such location exists, the lowestsaliency value in the map). The persistence of a pixel is deﬁned as the diﬀerencebetween its birth and death times. A sorted list of pixels in decreasing persistenceorder is computed, and the local maxima are chosen as the top pixels in the list.For additional robustness, we also apply non maximum suppression on the list oward unsupervised, multi-object discovery in large-scale image collections 9(a) IoU = 0 .

5. (b)

IoU = 0 .

7. (c)

IoU = 0 .

9. (d) positive regions.

Fig. 2:

Quality of proposals by diﬀerent methods. (a-c): Detection rate by numberof proposals at diﬀerent

IoU thresholds of randomized Prim (RP) [23], edgeboxes(EB) [40], selective search (SS) [33] and ours; (d): Percentage of positive proposals forthe four methods. over a 3 × s g below α max s g beforecomputing the persistence to obtain only good local maxima. We also eliminatelocations with score smaller than the average score in s y and whose score in s g is smaller than β times the average score in s g . We choose the value of thepair ( α, β ) in { . , . } × { . , } by conducting small-scale object discovery onVOC 6x2. We ﬁnd that ( α, β ) = (0 . , .

5) yields the best performance and giveslocal saliency maps that are not fragmented while eliminating well irrelevantlocations across settings and datasets. We take up to 20 local maxima (afternon-maximum suppression) and use 50 linearly spaced thresholds between thelowest and the highest scores in each local saliency map to generate proposals.

Object Discovery Experiments.

For single-object colocalization and dis-covery, following [34], we use ν = 5 , τ = 10 and apply the OSD’s post pro-cessing to obtain the ﬁnal localization result. For multi-object setting, we use ν = 50, τ = 10 and apply the post processing with non-maximum suppres-sion at IoU = 0 . fc6 of the pre-trained network, following [2]. The numberof potential neighbors of each image is ﬁxed to 50 in all experiments where thepre-ﬁltering is necessary. Following other works on region proposals [23,33,40], we evaluate the quality ofour proposals on PASCAL VOC 2007 using the detection rate at various

IoU thresholds. But since we intend to later use our proposals for object discovery,unlike other works, we evaluate directly our proposals on VOC all instead ofthe test set of VOC 2007 to reveal the link between the quality of proposalsand the object discovery performance. Figure 2(a-c) shows the performance ofdiﬀerent proposals on VOC all. It can be seen that our method performs betterthan others at a very high overlap threshold (0 .

9) regardless of the number et al. of proposals allowed. At medium threshold (0 . . positive pro-posal, that is, a proposal having an IoU greater than some threshold with objectbounding boxes. It is therefore easier to localize the object if the percentage ofpositive region proposals is larger. As shown by Fig. 2(d), our method performsvery well according to this criterion: Over 8% of our proposals are positive atan

IoU threshold of 0 .

5, and over 3% are still positive for an

IoU of 0 .

7. Also,randomized Prim and our method are by far better than selective search andedgeboxes, which explains the superior object discovery performance of the for-mer over the latter ( cf . [34] and Table 3). Note that region proposals with ahigh percentage of positive ones could also be used in other tasks, i.e., weaklysupervised object detection, but this is left for future work. An important component ofOSD is the similarity model used to compute score matrices S ij , which, in [34], isthe Probabilistic Hough Matching (PHM) algorithm [6]. Vo et al. [34] introducetwo scores, conﬁdence score and standout score, but use only the latter for itgives better performance. Since our new proposals come with diﬀerent statis-tics, we test both scores in our experiments. Table 1 compares colocalizationperformance on OD, VOC 6x2 and VOC all of OSD using the conﬁdence andstandout scores as well as our proposals. It can be seen that on VOC 6x2 andVOC all, the conﬁdence score does better than the standout score, while on OD,the latter does better. This is in fact not particularly surprising since images inOD generally contain bigger objects (relative to image size) than those in theother datasets. In fact, although the standout score is used on all datasets in [6]and [34], the authors adjust the parameter γ (see [6]) used in computing theirstandout score to favor larger regions when running their models on OD. In allof our experiments from now on, we use the standout score on OD and the con-ﬁdence score on other datasets (VOC 6x2, VOC all, VOC12 and COCO 20k).Our proposal generation process introduces a few hyper-parameters. Apartfrom α and β , two other important hyper-parameters are the number of localmaxima u and the number of thresholds v which together control the number ofproposals p per image returned by the process. We study their inﬂuence on thecolocalization performance by conducting experiments on VOC 6x2 and reportthe results in Table 2. It shows that the colocalization performance does notdepend much on the values of these parameters. Using ( u = 50 , v = 100) actuallygives the best performance but with twice as many proposals as ( u = 20 , v = 50).For eﬃciency, we use u = 20 and v = 50 in all of our experiments. oward unsupervised, multi-object discovery in large-scale image collections 11 Table 1:

Colocalization perfor-mance with our proposals in dif-ferent conﬁgurations of OSD

Conﬁg. Conﬁdence StandoutOD 83.7 ± ± VOC 6x2 ± ± ± ± Table 2:

Colocalization performance for diﬀerentvalues of hyper-parameters ( u , v ) (20,50) (20,100) (50,50) (50,100)CorLoc 73.6 ± ± ± ± p

760 882 1294 1507

Table 3:

Single-object colocalization and discovery performance of OSD with diﬀerenttypes of proposals. We use VGG16 features to represent regions in these experiments

Region proposals Colocalization DiscoveryOD VOC 6x2 VOC all OD VOC 6x2 VOC allEdgeboxes [40] 81.6 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± We report in Table 3 the performance of OSD and rOSD on OD, VOC 6x2and VOC all with diﬀerent types of proposals. It can be seen that our proposalsgive the best results on all datasets among all types of proposals with signiﬁcantmargins: 6.1%, 2.1% and 3.0% in colocalization and 5.3%, 0.5% and 4.7% indiscovery, respectively. It is also noticeable that our proposals not only faremuch better than the unsupervised ones (selective search and edgeboxes) butoutperform those generated by randomized Prim, an algorithm trained withbounding box annotation.We compare OSD and rOSD using our region proposals to the state of theart in Table 4 (colocalization) and Table 5 (discovery). In their experiments, Wei et al. [35] only use features from VGG19. We have conducted experiments withfeatures from both VGG16 and VGG19 but only present experiment results withVGG19 features in comparisons with [35] due to the space limit. A more compre-hensive comparison with features from VGG16 is included in the supplementarymaterial. It can be seen that our use of CNN features (for both creating pro-posals and representing them in OSD) consistently improves the performancecompared to the original OSD [34]. It is also noticeable that rOSD performssigniﬁcantly better than OSD on the two large datasets (VOC all and VOC12)while on the two smaller ones (OD and VOC 6x2), their performances are com-parable. It is due to the fact that images in OD and VOC 6x2 mostly contain onlyone well-positioned object thus bad local maxima are not a big problem in theoptimization while images in VOC all and VOC12 contain much more complexscenes and the optimization works better with more regularization. In overall,we obtain the best results on the two smaller datasets, fare better than [21] butare behind [35] on VOC all and VOC12 in the colocalization setting. It shouldbe noticed that while methods for image colocalization [21,35] suppose that im-ages in the collection come from the same category and explicitly exploit this et al.

Table 4:

Single-object colocalization performance of our approach compared to thestate of the art. Note that Wei et al. [35] outperform our method on VOC all andVOC12 in this case, but the situation is clearly reversed in the much more diﬃcultdiscovery setting, as demonstrated in Table 5

Method Features OD VOc 6x2 VOC all VOC12Cho et al. [6] WHO 84.2 67.6 37.6 -Vo et al. [34] WHO 87.1 ± ± ± et al. [21] VGG19 - - 41.9 45.6Wei et al. [35] VGG19 87.9 67.7 Ours (OSD) VGG19 ± ± ± ± ± ± ± ± Table 5:

Single-object discovery performance on the datasets with our proposals com-pared to the state of the art

Method Features OD VOC 6x2 VOC all VOC12Cho et al. [6] WHO 82.2 55.9 37.6 -Vo et al. [34] WHO 82.3 ± ± ± et al. [35] VGG19 75.0 54.0 43.4 46.3Ours (OSD) VGG19 89.1 ± ± ± ± ± ± ± ± assumption, rOSD is intended to deal with the much more diﬃcult and generalobject discovery task. Indeed, in discovery setting, rOSD outperforms [35] by alarge margin, 5.9% and 4.9% respectively on VOC all and VOC12. Multi-Object Colocalization and Discovery.

We demonstrate the eﬀec-tiveness of rOSD in multi-object colocalization and discovery on VOC all andVOC12 datasets, which contain images with multiple objects. We compare theperformance of OSD and rOSD to Wei et al. [35] in Table 6. Although [35] tacklesonly the single-object colocalization problem, we modify their method to havea reasonable baseline for the multi-object colocalization and discovery problem.Concretely, we take the bounding boxes around the 5 largest connected compo-nents of positive locations in the image’s indicator matrix [35] as the localizationresults. It can be seen that our method obtains the best performance with sig-niﬁcant margins to the closest competitor across all datasets and settings. It isalso noticeable that rOSD, again, signiﬁcantly outperforms OSD in this task. Anillustration of multi-object discovery is shown in Fig. 3. For a fair comparison,we use high values of ν (50) and IoU (0 .

7) in the multi-object experiments tomake sure that both OSD and rOSD return approximately 5 regions per image.Images may of course contain fewer than 5 objects. In such cases, OSD andrOSD usually return overlapping boxes around the actual objects. We can ofteneliminate these overlapping boxes and obtain better qualitative results by usingsmaller ν and IoU threshold values. It can be seen in Fig. 3 that with ν = 25and IoU = 0 .

3, rOSD is able to return bounding boxes around objects with-out many overlapping regions. Note however that the quantitative results mayworsen due to the reduced number of regions returned and the fact that manyimages contain objects that highly overlap, e.g., the last two columns of Fig. 3. oward unsupervised, multi-object discovery in large-scale image collections 13

Table 6:

Multi-object colocalization and discovery performance of rOSD compared tocompetitors on VOC all and VOC12 datasets

Method Features Colocalization DiscoveryVOC all VOC12 VOC all VOC12Vo et al. [34] WHO 40.7 ± ± et al. [35] VGG19 43.3 45.5 28.1 30.3Ours (OSD) VGG19 46.8 ± ± ± ± ± ± ± ± In such cases, a small

IoU threshold prevents discovering all of these objects.See supplementary document for more visualizations and details.Fig. 3:

Qualitative multi-object discovery results obtained with rOSD. White boxes areground truth objects and red ones are our predictions. Original images are in the ﬁrstrow. Results with ν = 50 and IoU = 0 . ν = 25and IoU = 0 . Large-Scale Object Discovery.

We apply our large-scale algorithm in thediscovery setting on VOC all, VOC12 and COCO 20k which are randomly par-titioned respectively into 5, 10 and 20 parts of roughly equal sizes. In the ﬁrststage of all experiments, we preﬁlter the initial neighborhood of images andkeep only 50 potential neighbors. We choose ν = 50 and keep K (which are250, 500 and 1000 respectively on VOC all, VOC12 and COCO 20k) positiveentries in each score matrix. In the second stage, we run rOSD (OSD) on theentire datasets with ν = 5, limit the number of potential neighbors to 50 anduse score matrices with only 50 positive entries. We choose K such that eachrun in the ﬁrst stage and the OSD run in the second stage have the same mem-ory cost, hence the values of K chosen above. As baselines, we have appliedrOSD (OSD) directly to the datasets, keeping 50 positive entries (baseline 1)and 1000 positive entries (baseline 2) in score matrices. Table 7 shows the objectdiscovery performance on VOC all, VOC12 and COCO 20k for our large-scalealgorithm compared to the baselines. It can be seen that our large-scale two-stage rOSD algorithm yields signiﬁcant performance gains over the baseline 1,obtains an improvement of 6.6%, 9.3% and 4.0% in single-object discovery and et al. Table 7:

Performance of our large-scale algorithm compared to the baselines. Ourmethod and baseline 1 have the same memory cost, which is much smaller than thecost of baseline 2 . Also, due to memory limits, we cannot run baseline 2 on COCO 20k

Method Single-object Multi-objectVOC all VOC12 COCO 20k VOC all VOC12 COCO 20kBaseline 1 (OSD) 41.1 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Execution time.

Similar to [34], our method requires computing the similarityscores for a large number of image pairs which makes it computationally costly.It takes in total 478 paralellizable CPU hours, 300 unparallelizable CPU secondsand 1 GPU hour to run single-object discovery on VOC all with 3550 images.This is more costly compared to only 812 GPU seconds needed by DDT+ [35] butis less costly than [34] using CNN features. The latter requires 546 paralellizableCPU hours, 250 unparalellizable CPU seconds and 4 GPU hours. Note that theunparallelizable computational cost, which comes from the main OSD algorithm,grows very fast (at least linearly in theory, it takes 2.3 hours on COCO 20k inpractice) with the data set’s size and is the time bottleneck in large scale.

We have presented an unsupervised algorithm for generating region proposalsfrom CNN features trained on an auxiliary and unrelated task. Our proposalscome with an intrinsic structure which can be leveraged as an additional regular-ization in the OSD framework of Vo et al. [34]. The combination of our proposalsand regularized OSD gives comparable results to the current state of the art inimage colocalization, sets a new state-of-the-art single-object discovery and hasproven eﬀective in the multi-object discovery. We have also successfully extendedOSD to the large-scale case and show that our method yields signiﬁcantly bet-ter performance than plain OSD. Future work will be dedicated to investigatingother applications of our region proposals.

Acknowledgments.

This work was supported in part by the Inria/NYU col-laboration, the Louis Vuitton/ENS chair on artiﬁcial intelligence and the Frenchgovernment under management of Agence Nationale de la Recherche as part ofthe “Investissements d’avenir” program, reference ANR19-P3IA-0001 (PRAIRIE3IA Institute). Huy V. Vo was supported in part by a Valeo/Prairie CIFRE PhDFellowship. oward unsupervised, multi-object discovery in large-scale image collections 15

References

1. Alexe, B., Deselaers, T., Ferrari, V.: Measuring the objectness of image windows.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) ,2189–2202 (2012) 22. Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for imageretrieval. In: Proceedings of the European Conference on Computer Vision (ECCV)(2014) 5, 93. Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for imageretrieval. In: Proceedings of the International Conference on Computer Vision(ICCV) (2015) 54. Bach, F.: Learning with submodular functions: A convex optimization perspective.Foundations and Trends in Machine Learning (2-3), 145–373 (2013) 45. Chazal, F., Guibas, L.J., Oudot, S.Y., Skraba, P.: Persistence-based clustering inriemannian manifolds. Journal of the ACM (6), 41:1–41:38 (2013) 5, 8, 246. Cho, M., Kwak, S., Schmid, C., Ponce, J.: Unsupervised object discovery andlocalization in the wild: Part-based matching with bottom-up region proposals.In: Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR) (2015) 3, 7, 10, 12, 21, 237. Cinbis, R., Verbeek, J., Schmid, C.: Weakly supervised object localization withmulti-fold multiple instance learning. IEEE Transactions on Pattern Analysis andMachine Intelligence (TPAMI) et al.

18. Hsu, K.J., Lin, Y.Y., Chuang, Y.Y.: Deepco : Deep instance co-segmentation byco-peak search and co-saliency detection. In: IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR) (2019) 319. Hsu, K.J., Tsai, C.C., Lin, Y.Y., Qian, X., Chuang, Y.Y.: Unsupervised cnn-basedco-saliency detection with graphical optimization. In: Proceedings of the EuropeanConference on Computer Vision (ECCV) (2018) 520. Kim, G., Torralba, A.: Unsupervised detection of regions of interest using iterativelink analysis. In: Proceedings of the Conference on Neural Information ProcessingSystems (NeurIPS) (2009) 321. Li, Y., Liu, L., Shen, C., Hengel, A.: Image co-localization by mimicking a good de-tector’s conﬁdence score distribution. In: Proceedings of the European Conferenceon Computer Vision (ECCV) (2016) 2, 3, 4, 7, 11, 12, 21, 2422. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P.,Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Proceedings ofthe European Conference on Computer Vision (ECCV) (2014) 823. Manen, S., Guillaumin, M., Van Gool, L.: Prime Object Proposals with Random-ized Prim’s Algorithm. In: Proceedings of the International Conference on Com-puter Vision (ICCV) (2013) 2, 4, 9, 10, 1124. Nedi´c, A., Ozdaglar, A.: Approximate primal solutions and rate analysis for dualsubgradient methods. SIAM Journal on Optimization (4), 1757–1780 (2009) 425. Oudot, S.: Persistence Theory: From Quiver Representations to Data Analysis.AMS Surveys and Monographs (2015) 5, 8, 2426. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time objectdetection with region proposal networks. In: Proceedings of the Conference onNeural Information Processing Systems (NeurIPS) (2015) 2, 327. Russell, B., Freeman, W., Efros, A., Sivic, J., Zisserman, A.: Using multiple seg-mentations to discover objects and their extent in image collections. In: Proceedingsof the Conference on Computer Vision and Pattern Recognition (CVPR) (2006) 328. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In:Proceedings of the International Conference on Computer Vision (ICCV) (2017) 329. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scaleimage recognition. In: Proceedings of the International Conference on LearningRepresentations (ICLR) (2015) 2, 5, 830. Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering ob-ject categories in image collections. In: Proceedings of the International Conferenceon Computer Vision (ICCV) (2005) 331. Tang, P., Wang, X., Bai, S., Shen, W., Bai, X., Liu, W., Yuille, A.: Pcl: Pro-posal cluster learning for weakly supervised object detection. IEEE Transactionson Pattern Analysis and Machine Intelligence (TPAMI) (1), 176–191 (2020) 232. Tang, P., Wang, X., Wang, A., Yan, Y., Liu, W., Huang, J., Yuille, A.: Weaklysupervised region proposal network and object detection. In: Proceedings of theEuropean Conference on Computer Vision (ECCV) (2018) 333. Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selectivesearch for object recognition. International Journal of Computer Vision (2),154–171 (2013) 2, 3, 9, 10, 1134. Vo, H.V., Han, K., Cho, M., P´erez, P., Bach, F., LeCun, Y., Ponce, J.: Unsuper-vised image matching and object discovery as optimization. In: Proceedings of theConference on Computer Vision and Pattern Recognition (CVPR) (2019) 1, 2, 3,4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 18, 19, 20, 21, 23oward unsupervised, multi-object discovery in large-scale image collections 1735. Wei, X.S., Zhang, C.L., Wu, J., Shen, C., Zhou, Z.H.: Unsupervised object discov-ery and co-localization by deep descriptor transforming. Pattern Recognition (PR) (2019) 1, 2, 3, 4, 5, 7, 11, 12, 13, 14, 20, 21, 22, 23, 2436. Wei, X.S., Luo, J.H., Wu, J., Zhou, Z.H.: Selective convolutional descriptor ag-gregation for ﬁne-grained image retrieval. IEEE Transactions on Image Processing (6), 2868–2881 (2017) 537. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks.In: Proceedings of the European Conference on Computer Vision (ECCV) (2014)838. Zhang, D., Meng, D., Li, C., Jiang, L., Zhao, Q., Han, J.: A self-paced multiple-instance learning framework for co-saliency detection. In: Proceedings of the In-ternational Conference on Computer Vision (ICCV) (2015) 539. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep featuresfor discriminative localization. In: Proceedings of the Conference on ComputerVision and Pattern Recognition (CVPR) (2016) 340. Zitnick, C.L., Doll´ar, P.: Edge boxes: Locating object proposals from edges. In:Proceedings of the European Conference on Computer Vision (ECCV) (2014) 2,3, 9, 10, 1141. Zomorodian, A., Carlsson, G.: Computing persistent homology. Discrete and Com-putational Geometry (2005) 5, 8, 248 H. V. Vo et al. Supplementary materials: Toward unsupervised,multi-object discovery in large-scale image collections1 Regularized OSD (rOSD)

We have presented in the paper a new version of the OSD formulation [34] withadded constraints based on the structure of our region proposals. Concretely, wepropose to solve the optimization problem:max x,e S ( x, e ) = n (cid:88) i =1 (cid:88) j ∈ N ( i ) e ij x Ti S ij x j , s.t. ∀ i  p (cid:80) k =1 x ki ≤ ν, (cid:80) k ∈ G ig x ki ≤ , for all groups g (cid:80) j (cid:54) = i e ij ≤ τ. (2)We solve this problem with an iterative block-coordinate ascent algorithmsimilar to OSD. Its iterations are illustrated in Algorithm 1. Algorithm 1:

Block coordinate ascent algorithm for rOSD.

Result:

A solution to rOSD.Input: G i , ν , τ , S ij , number n of images.Initialization: x i = p ∀ i , e ij = 1 ∀ i (cid:54) = j . for i = 1 to n do Compute the vector R containing the scores of regions in image i . R ←− (cid:80) nj (cid:54) = i ( e ij S ij + e ji S Tji ) x j .I ←− ∅ . for g = 1; g ≤ L i do Find the region g ∗ with highest score R ( g ∗ ) in the group G ig . I ←− I ∪ { g ∗ } . end Choose ν regions in I with highest scores in R , assign their correspondingvariables to 1. Assign the variables of other regions to 0. endfor i = 1 to n do Compute the indices j to j τ of the τ largest scalars x Ti S ij x j (1 ≤ j ≤ n ). e i ←− for t = 1; t ≤ τ do e ij t ←− endend Note that the output of Algorithm 1 depends on the order in which thevariables x i are processed in its ﬁrst for loop. In our implementation, we usea diﬀerent random permutation of (1 , ..., n ) in each iteration of the optimiza-tion. For each experiment, we run rOSD several times and report the averageperformance of all runs as the ﬁnal performance. oward unsupervised, multi-object discovery in large-scale image collections 19 We summarize in Algorithm 2 our proposed large-scale algorithm for objectdiscovery.

Algorithm 2:

Large-scale object discovery algorithm.

Input:

Dataset D of n images, memory limit M , number of partition k , imageneighborhood size N , ν ∗ , τ .Partition D into random k parts D , ..., D k , each has roughly (cid:98) n/k (cid:99) images.Compute the maximum number of positive entries in the score matrices in eachparts: K ←− M/ ( N ∗ (cid:98) n/k (cid:99) ).Compute the maximum number of positive entries in the score matrices in thewhole dataset: K ←− M/ ( n ∗ N ). for i = 1 to k do Compute score matrices for image pairs in D i with K positive entries.Run proxy OSD on D i with ν = K .Each image in D i has a new set of region proposals which are those retainedby OSD. end Compute score matrices between pairs of images in D with K positive entries.Run OSD on the whole dataset D with ν = ν ∗ . Vo et al. [34] use an ensemble method (EM) to combine several solutions beforepost processing to stabilize and improve the ﬁnal performance of OSD. We in-vestigate the inﬂuence of this procedure on the performance of OSD and rOSDwith our proposals, and present the result in Tables 1 and 2. We use VGG16features in these experiments. It can be seen that the eﬀect of EM is mixed forthe tested datasets. It generally harms the performance on VOC all and VOC12and improves the performance on VOC 6x2 while its eﬀect on OD is unclear.We have therefore chosen to omit EM in the experiments of the main body ofthe paper. We present in Tables 3, 4 and 5 our full results in colocalization and objectdiscovery with features from both VGG16 and VGG19. It can be seen that, withVGG16 features, rOSD still signiﬁcantly outperforms OSD on the two largedatasets and fares comparably to OSD on the smaller two. It is also noticeablethat rOSD signiﬁcantly outperforms Wei et al. in both colocalization and single-object discovery on all datasets when VGG16 features are used. et al.

Table 1: Inﬂuence of the ensemble method of Vo et al. on the colocalizationperformance of OSD and rOSD with our proposals

Method OD VOC 6x2 VOC all VOC12Ours (OSD) w/o EM 89.0 ± ± ± ± ± ± ± ± ± ± ± ± Ours (rOSD) w/ EM ± ± ± ± Table 2: Inﬂuence of the ensemble method of Vo et al. on the single-objectdiscovery performance of OSD and rOSD with our proposals

Method OD VOC 6x2 VOC all VOC12Ours (OSD) w/o EM 87.8 ± ± ± ± ± ± ± ± ± ± ± ± Ours (rOSD) w/ EM ± ± ± ± For a fair comparison to OSD and Wei et al. [35] in multi-object discovery, wehave ﬁxed the number of objects retained in each image by all methods to 5 inthe paper. We have also modiﬁed the method of Wei et al. such that 5 boundingboxes around the 5 largest clusters of positive pixels in their indicator matrix arereturned as objects. For OSD and rOSD, we run the corresponding optimizationthen apply the following post processing on each image: all ν retained regions areranked in descending order using the score proposed in [34] (Eq. 12 in Sec. 2.6therein), which is solely based on their similarity to the retained regions inthe image’s neighbors; We then iteratively discard all proposals having an IoUscore greater than some threshold with higher-ranked regions; Among remainingregions, we return the 5 highest ranked as retrieved objects. Since this procedurecan eliminate all but a few regions if the regions highly overlap, we choose alarge value of ν (50) and a large value of IoU threshold (0 .

7) in our experimentsto guarantee that we have exactly , ν and IoU threshold. We have conducted preliminaryexperiments with ν = 25 in the optimization of OSD and rOSD and IoU = 0 . oward unsupervised, multi-object discovery in large-scale image collections 21 Table 3:

Single-object colocalization performance of our approach compared to thestate of the art. Note that Wei et al. [35] outperform our method on VOC all andVOC12 with VGG19 features in this case, but the situation is clearly reversed in themuch more diﬃcult single-object discovery setting, as demonstrated in Table 4

Method Features OD VOc 6x2 VOC all VOC12Cho et al. [6] WHO 84.2 67.6 37.6 -Vo et al. [34] WHO 87.1 ± ± ± et al. [21] VGG16 - - 40.0 41.9Wei et al. [35] VGG16 86.9 66.2 44.7 47.6Ours (OSD) VGG16 89.0 ± ± ± ± ± ± ± ± et al. [21] VGG19 - - 41.9 45.6Wei et al. [35] VGG19 87.9 67.7 Ours (OSD) VGG19 ± ± ± ± ± ± ± ± Table 4:

Single-object discovery performance in the mixed setting on the datasets withour proposals compared to the state of the art

Method Features OD VOC 6x2 VOC all VOC12Cho et al. [6] WHO 82.2 55.9 37.6 -Vo et al. [34] WHO 82.3 ± ± ± et al. [35] VGG16 73.5 66.2 41.9 45.0Ours (OSD) VGG16 87.8 ± ± ± ± ± ± ± ± Wei et al. [35] VGG19 75.0 54.0 43.4 46.3Ours (OSD) VGG19 89.1 ± ± ± ± ± ± ± ± jects without many overlapping regions. It is also observed that rOSD fares muchbetter than OSD in localizing multiple objects. We also compare the quantita-tive performance of rOSD, OSD and [35] in Table 6. For [35], we take as beforethe bounding boxes around the largest clusters of pixels in the indicator matrix of each image. The number of clusters in this case is chosen to be the number ofobjects returned by rOSD in the same image. The results show that rOSD againyields by far the best performance. It is also noticeable that while using smallervalues of ν and the IoU threshold slightly deteriorates the performance of rOSD,it makes the performance of OSD drop signiﬁcantly (compare Tables 5 and 6).This is due to the fact that OSD returns many highly overlapping regions andmost of them are eliminated by our procedure. On the other hand, rOSD returnsmore diverse regions and consequently more regions are retained. In practice, weobserve that OSD returns on average 1.47 (respectively 1.52) regions while rOSDreturns 3.62 (respectively 3.63) on VOC all (respectively VOC12). Note, how-ever, that rOSD still outperforms OSD and [35] even when the latter are allowedto retain exactly 5 regions. et al. (a) VOC all (b) VOC12

Fig. 1:

Multi-object discovery performance of rOSD compared to OSD and [35] whenvarying the maximum number of returned objects.

Fig. 2:

Multi-object discovery results. In each column, from top to bottom: originalimage, image with predictions of OSD, image with predictions of rOSD. White boxes areground truth objects and red ones are our predictions. There are at most

Table 5:

Multi-object colocalization and discovery performance of rOSD compared tocompetitors on VOC all and VOC12 datasets

Method Features Colocalization DiscoveryVOC all VOC12 VOC all VOC12Vo et al. [34] WHO 40.7 ± ± et al. [35] VGG16 38.3 40.4 25.8 28.2Ours (OSD) VGG16 45.9 ± ± ± ± ± ± ± ± Wei et al. [35] VGG19 43.3 45.5 28.1 30.3Ours (OSD) VGG19 46.8 ± ± ± ± ± ± ± ± Table 6:

Multi-object colocalization and discovery performance of rOSD compared tocompetitors on VOC all and VOC12 datasets when using smaller values of ν (25) and IoU (0 .

3) threshold

Method Features Colocalization DiscoveryVOC all VOC12 VOC all VOC12Wei et al. [35] VGG19 43.1 45.3 27.8 30.0Ours (OSD) VGG19 39.6 ± ± ± ± ± ± ± ± Following [6], we evaluate the local graph structure obtained by rOSD using theCorRet measure, deﬁned as the average percentage of returned image neighborsthat belong to the same (ground-truth) class as the image itself. As a baseline, weconsider the local graph induced by the sets of nearest neighbors N ( i ) computedfrom the fully connected layer fc6 of the CNN that are used in the same exper-iment. Table 7 shows the CorRet of local graphs obtained when running rOSD(OSD) on VOC all and VOC12 and large-scale rOSD (OSD) on COCO 20k inthe mixed setting. It can be seen that the local image graphs returned by ourmethods have higher CorRet than the baseline.Table 7: Quality of the returned local image graph as measured by CorRetDataset VOC all VOC12 COCO 20kBaseline 50.7 56.4 36.8Ours (OSD) ± ± ± Ours (rOSD) 59.8 ± ± ± et al. Table 8:

Colocalization and single-object discovery performance of rOSD compared toOSD, Li et al. [21] and Wei et al. [35] on 6 held-out

ImageNet classes

Method Features Colocalization DiscoveryLi et al. [21] VGG16 48.3 -Wei et al. [35] VGG16 74.3 61.2Ours (OSD) VGG16 61.5 ± ± ± ± et al. [21] VGG19 51.6 -Wei et al. [35] VGG19 Ours (OSD) VGG19 61.3 ± ± ± ± Though trained for classifying 1000 object classes of ImageNet, features fromconvolutional layers of VGGs have shown to be generic: They have been usedfor various tasks, including unsupervised object discovery. Li et al. [21] and Wei et al. [35] have shown that CNN features generalize well beyond the classes inILSVRC2012 by testing on 6 held-out classes on ImageNet ( chipmunk , racoon , rhinoceros , rake , stoat and wheelchair ). We have also tested our method onthese classes. Since ImageNet has been under maintenance, we could not down-load all the oﬃcial images in the six classes. For preliminary experiments, wehave instead downloaded the images using their public URLs (provided on theImageNet website), eliminated corrupted images, randomly chosen up to 200images per class and run our experiments on these images. We have comparedrOSD, OSD, [21] and [35] in this setting (Table 8). Although rOSD performssigniﬁcantly better than [21] in colocalization tasks, it is as before signiﬁcantlyoutperformed by [35] there. In object discovery, rOSD performs slightly betterthan [35] for VGG16 features, but signiﬁcantly worse for VGG19 features. Un-derstanding this discrepancy observed in preliminary experiments is part of ourplans for future work. The most important advantage of rOSD over OSD is that the former returnsmore diverse regions than the former does. We visualize the regions returned byOSD and rOSD in colocalization experiments with ν = 5 in Fig. 3. We use persistence [5,8,9,25,41] to ﬁnd robust local maxima of the global saliencymap s g in our work. Considering s g as a 2D image and each location in it as Numbers for [21] are taken from [35].oward unsupervised, multi-object discovery in large-scale image collections 25

Fig. 3:

Regions returned by OSD and rOSD. In each column, from top to bottom:original image, image with regions returned by OSD, image with regions returned byrOSD. a pixel, we associate with each pixel a cluster (the 4-neighborhood connectedcomponent of pixels that contains it), together with both a “birth” (its ownsaliency) and “death time” (the highest value for which one of the pixels in itscluster also belongs to the cluster of a pixel with higher saliency, or, if no suchlocation exists, the lowest saliency value in the map). The persistence of a pixelis deﬁned as the diﬀerence between its birth and death times. Figure 4 illustratespersistence for the 1D case.Fig. 4:

An illustration of persistence in the 1D case. Left: A 1D function. Right: Itspersistence diagram. Points above the diagonal correspond to its local maxima and thevertical distance from these points to the diagonal is their persistence. Local maximawith higher persistence are more robust: B is more robust than A although f ( A ) >f ( B ). Given a chosen persistence threshold (shown by dash lines in blue), points withpersistence higher than some threshold are selected as robust local maxima. The blackhorizontal dotted lines show birth and death time of the local maxima of ff