[PDF] Weakly Supervised Object Detection with Segmentation Collaboration

Abstract

Weakly supervised object detection aims at learning precise object detectors, given image category labels. In recent prevailing works, this problem is generally formulated as a multiple instance learning module guided by an image classification loss. The object bounding box is assumed to be the one contributing most to the classification among all proposals. However, the region contributing most is also likely to be a crucial part or the supporting context of an object. To obtain a more accurate detector, in this work we propose a novel end-to-end weakly supervised detection approach, where a newly introduced generative adversarial segmentation module interacts with the conventional detection module in a collaborative loop. The collaboration mechanism takes full advantages of the complementary interpretations of the weakly supervised localization task, namely detection and segmentation tasks, forming a more comprehensive solution. Consequently, our method obtains more precise object bounding boxes, rather than parts or irrelevant surroundings. Expectedly, the proposed method achieves an accuracy of 51.0% on the PASCAL VOC 2007 dataset, outperforming the state-of-the-arts and demonstrating its superiority for weakly supervised object detection.

Full PDF

WWeakly Supervised Object Detection with Segmentation Collaboration

Xiaoyan Li , Meina Kan Shiguang Shan , , Xilin Chen , Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS),Institute of Computing Technology, CAS, Beijing 100190, China University of Chinese Academy of Sciences, Beijing 100049, China CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai 200031, China [email protected] { kanmeina, sgshan, xlchen } @ict.ac.cn Abstract

Weakly supervised object detection aims at learning pre-cise object detectors, given image category labels. In recentprevailing works, this problem is generally formulated as amultiple instance learning module guided by an image clas-siﬁcation loss. The object bounding box is assumed to be theone contributing most to the classiﬁcation among all pro-posals. However, the region contributing most is also likelyto be a crucial part or the supporting context of an object.To obtain a more accurate detector, in this work we proposea novel end-to-end weakly supervised detection approach,where a newly introduced generative adversarial segmenta-tion module interacts with the conventional detection mod-ule in a collaborative loop. The collaboration mechanismtakes full advantages of the complementary interpretationsof the weakly supervised localization task, namely detec-tion and segmentation tasks, forming a more comprehensivesolution. Consequently, our method obtains more preciseobject bounding boxes, rather than parts or irrelevant sur-roundings. Expectedly, the proposed method achieves anaccuracy of 51.0% on the PASCAL VOC 2007 dataset, out-performing the state-of-the-arts and demonstrating its su-periority for weakly supervised object detection.

1. Introduction

As the data-driven approaches prevail on object detectiontask in both academia and industry, the amount of data inan object detection benchmark is expected to be larger andlarger. However, annotating object bounding boxes is bothcostly and time-consuming. In order to reduce the labelingworkload, researchers hope to make object detectors workin a weakly-supervised fashion, e.g . learning a detector withonly category labels rather than bounding boxes.Recently, the most high-proﬁle works on weakly super-vised object detection all exploit the multiple instance learn-

Stage 2Stage1

SegmentationDetection

Backbone Backbone Backbone

Segmentation Detection

Proposal FilteringHeatmapProposal Reweighting

Segmentation (a) Previous works

Stage 2Stage1

SegmentationDetection

Backbone Backbone Backbone

Segmentation Detection

Proposal FilteringHeatmapProposal Reweighting

Segmentation (b) Ours

Figure 1: The schematic diagram of the previous works withsegmentation utilization [8, 31] and the proposed collabora-tion approach. In [8, 31], a two-stage paradigm is used, inwhich proposals are ﬁrst ﬁltered and then detection is per-formed on these remaining boxes ([8] shares the backbonebetween two modules). In our approach, detection and seg-mentation modules instruct each other in a dynamic collab-oration loop in the training process.ing (MIL) paradigm [3, 6, 23, 22, 19, 15, 1, 24, 25, 2, 8].Based on the assumption that the object bounding boxshould be the one contributing most to image classiﬁca-tion among all proposals, the MIL based approaches workin an attention-like mechanism: automatically assign largerweights to the proposals consistent with the classiﬁcationlabels. Several promising works combining MIL with deeplearning [2, 26, 31] have greatly pushed the boundaries ofweakly supervised object detection. However, as noted in[26, 31], these methods are easy to over-ﬁt on object parts,because the most discriminative classiﬁcation evidence mayderive from the entire object region, but may also from thecrucial parts. The attention mechanism is effective in se-lecting the discriminative boxes, but does not guarantee thecompleteness of a detected object. For a more reasonableinference, a further elaborative mechanism is necessary.Meanwhile, the completeness of a detected region is eas-ier to ensure in weakly supervised segmentation.One com-mon way to outline whole class-related segmentation re-1 a r X i v : . [ c s . C V ] A p r ask Recall PrecisionWeakly supervised detection 62.9% %Weakly supervised segmentation % 35.4% Table 1: Pixel-wise recall and precision of detection andsegmentation results on the VOC 2007 test set, followingthe same setting in Sec. 4.2. For a comparable pixel-levelmetric, the detection results are converted to the equivalentsegmentation maps in a similar way described in Sec. 3.3.gions is recurrently discovering and masking these regionsin several forward passes [30]. These segmentation mapscan potentially constrain the weakly supervised object de-tection, given that a proposal having low intersection overunion (IoU) with the corresponding segmentation map isnot likely to be an object bounding box. In [8, 31], weaklysupervised segmentation maps are used to ﬁlter object pro-posals and reduce the difﬁculty of detection, as shown inFig. 1a. However, these approaches adopt cascaded or in-dependent models with relatively coarse segmentations todo “hard” delete on the proposals, inevitably resulting in adrop of the proposal recall. In a word, these methods un-derutilize the segmentation and limit the improvements ofweakly supervised object detection.The MIL based object detection approaches and seman-tic segmentation approaches focus on restraining differentaspects of the weakly supervised localization and have op-posite strengths and shortcomings. The MIL based objectdetection approaches are precise in distinguishing object-related regions and irrelevant surroundings, but incline toconfuse entire objects with parts due to its excessive atten-tion to the signiﬁcant regions. Meanwhile, the weakly su-pervised segmentation is able to cover the entire instances,but tends to mix irreverent surroundings with real objects.This complementary property is veriﬁed by Table 1, thatthe segmentation can achieve a higher pixel-wise recall butlower precision, while the detection can achieve a higherpixel-level precision but lower recall. Rather than work-ing independently, the two are naturally cooperative and canwork together to overcome their intrinsic weaknesses.In this work, we propose a segmentation-detection col-laborative network (SDCN) for more precise object detec-tion under weak supervision, as shown in Fig. 1b. In theproposed SDCN, the detection and segmentation brancheswork in a collaborative manner to boost each other. Speciﬁ-cally, the segmentation branch is designed as a generativeadversarial localization structure to sketch the object re-gion. The detection module is optimized in an MIL man-ner with the obtained segmentation map serving as spatialprior probabilities of the object proposals. Besides, the ob-ject detection branch also provides supervision back to thesegmentation branch by a synthetic heatmap generated fromall proposal boxes and their classiﬁcation scores. Therefore,these two branches tightly interact with each other and form a dynamic cooperating loop. Overall, the entire network isoptimized under weak supervision of the classiﬁcation lossin an end-to-end manner, which is superior to the cascadedor independent architectures in previous works [8, 31].In summary, we make three contributions in this paper:1) the segmentation-detection collaborative mechanism en-forces deep cooperation between two complementary tasksand boosts valuable supervision to each other under theweakly supervised setting; 2) for the segmentation branch,the novel generative adversarial localization strategy en-ables our approach to produce more complete segmentationmaps, which is crucial for improving both the segmentationand the detection branches; 3) as demonstrated in Section 4,we achieve the best performance on PASCAL VOC 2007and 2012 datasets, surpassing the previous state-of-the-arts.

2. Related works

Multiple Instance Learning (MIL).

MIL [9] is a con-cept in machine learning, illustrating the essence of inexactsupervision problem, in which only coarse-grained labelsare available [34]. Formally, given a training image I , all in-stances in some speciﬁc form constitute a “bag”. E.g . objectproposals (in detection task) or image pixels (in segmenta-tion task) can be different forms of instances. If the image I is labeled with class c , then the “bag” of I is positive withregard to c , meaning that there is at least one positive in-stance of class c in this bag. If I is not labeled with class c , the corresponding “bag” is negative to c and there is noinstance of class c in this image. The MIL models aim atpredicting the label of an input bag, and more importantly,ﬁnding positive instances in positive bags. Weakly Supervised Object Detection.

Recently, the in-corporation of deep neural networks and MIL signiﬁcantlyimproves the previous state-of-the-arts. Bilen[2] proposeda Weakly Supervised Deep Detection Network (WSDDN)composing of two branches acting as a proposal selectorand a proposal classiﬁer, respectively. The idea, detectingobjects by the attention-based selection, is proved to be soeffective that most of the latter works follow it.

E.g ., WS-DDN is further improved by adding recursive reﬁnementbranches in [26]. Besides these single-stage approaches,researchers have also considered the multiple-stage meth-ods in which fully-supervised detectors are trained withthe boxes detected by the single-stage methods as pseudo-labels. Zhang [33] proposed a metric to estimate image dif-ﬁculty with the proposal classiﬁcation scores of WSDDN,and progressively trained a Fast R-CNN with curriculumlearning strategy. To speed up the weakly supervised objectdetectors, Shen [20] used WSDDN as an instructor whichguides a fast generator to produce similar detection results.

Weakly Supervised Object Segmentation.

Anotherroute for localizing objects is semantic segmentation.To obtain weakly supervised segmentation map, in [18],2 m D seg I Feature extractor

Feature mapsInput image

RoI Pool Classification network

Segmentation mapProposal scores

Pseudo-label

HeatmapImage classification scoresSegmentation branchDetection branch P s e udo - l a b e l x S det ˆ D m D = D r Segmentation network

Collaboration loopPooledfeature f E f S f C I S OICR f D = ⇢ f D m f D r L D Smil L D Sref L Sadv L C L Scls L S Dseg L S Dseg L Scls

Figure 2: The overall architecture. The SDCN is composed of three modules: the feature extractor, the segmentation branch,and the detection branch. The segmentation branch is instructed by a classiﬁcation network in a generative adversariallearning manner, while the detection branch employs a conventional weakly supervised detector OICR [26], guided by anMIL objective. These two branches further supervise each other in a collaboration loop. The solid ellipses denote the costfunctions. The operations are denoted as blue arrows, while the collaboration loop is shown with orange ones.Kolesnikov took segmentation map as an output of the net-work and then aggregated it to a global classiﬁcation predic-tion to learn with category labels. In [10], the aggregationfunction is improved to incorporate both negative and posi-tive evidence, representing both the absence and presence ofthe target class. In [30], a recurrent adversarial erasing strat-egy is proposed to mask the response region of the previousforward passes and force to generate responses on other un-detected parts during the current forward pass.

Utilization of Segmentation in Weakly SupervisedDetection.

Researchers have found that there are inher-ent relations between the weakly supervised segmentationand detection tasks. In [8], a segmentation branch gener-ating coarse response maps is used to eliminate proposalsunlikely to cover any objects. In [31], the proposal ﬁlteringstep is based on a new objectness rating TS2C deﬁned withthe weakly supervised segmentation map. Ge [13] proposeda complex framework for both weakly supervised segmen-tation and detection, where results from segmentation mod-els are used as both object proposal generator and ﬁlter forthe latter detection models. These methods incorporate thesegmentation to overcome the limitations of weakly super-vised object detection, which are reasonable and promisingconsidering their superiorities over their baseline models.However, they ignore the mentioned complementarity ofthese tasks and only exploit one-way cooperation, as shownin Fig. 1a. The suboptimal manners in using the segmenta-tion information limit the performance of their methods.

3. Method

The overall architecture of the proposed segmentation-detection collaborative network (SDCN) is shown in Fig. 2. The network is mainly composed of three components: abackbone feature extractor f E , a segmentation branch f S ,and a detection branch f D . For an input image I , its feature x = f E ( I ) is extracted by the extractor f E , and then feedsinto f S and f D for segmentation and detection, respec-tively. The entire network is guided by the classiﬁcationlabels y = [ y , y , · · · , y N ] ∈ { , } N , (where N is thenumber of object classes), which is formatted as an adver-sarial classiﬁcation loss and an MIL objective. Additionalcollaboration loss is designed for improving the accuracy ofboth branches in a manner of the collaborative loop.In 3.1, we ﬁrst brieﬂy introduce our detection branch,which follows the Online Instance Classiﬁer Reﬁnement(OICR) [26]. The proposed segmentation branch and col-laboration mechanism are described in detail in 3.2 and 3.3. The detection branch f D aims at detecting object in-stances in an input image, given only image category labels.The design of f D follows the OICR [26], which works ina similar fashion to the Fast RCNN [14]. Speciﬁcally, f D takes the feature x from the backbone f E and object pro-posals B = { b , b , . . . , b B } (where B is the number ofproposals) from Selective Search [28] as input, and detectsby classifying each proposal, formulated as below: D = f D ( x , B ) , D ∈ [0 , B × ( N +1) , (1)where N denotes the number of classes with the ( N + 1) th class as the background. Each element D ( i, j ) indicates theprobability of the i th proposal b i belonging to the j th class.The detection branch f D consists of two sub-modules,a multiple instance detection network (MIDN) f D m and3n online instance classiﬁer reﬁnement module f D r . TheMIDN f D m serves as an instructor of the reﬁnement mod-ule f D r , while f D r produces the ﬁnal detection output.The MIDN is the same as the mentioned WSDDN [2],which computes the probability of each proposal belong-ing to each class under the supervision of category label,with an MIL objective (in Eq. (1) of [26]) formulated asfollows: D m = f D m ( x , B ) , D m ∈ [0 , B × N , (2) L Dmil = (cid:88) Nj =1 L BCE (cid:18)(cid:88) Bi =1 D m ( i, j ) , y ( j ) (cid:19) , (3)where (cid:80) Bi =1 D m ( i, j ) (denoted as φ c in [26]) shows theprobability of an input image belonging to the j th categoryby summing up that of all proposals, and L BCE denotes thestandard multi-class binary cross entropy loss.Then, the resulting probability D m from minimizing Eq.(3) is used to generate pseudo instance classiﬁcation labelsfor the reﬁnement module. This process is denoted as: Y r = κ ( D m ) , Y r ∈ { , } B × ( N +1) . (4)Each binary element Y r ( i, j ) indicates if the i th proposal islabeled as the j th class. κ denotes the conversion from thesoft probability matrix D m to discrete instance labels Y r ,where the top-scoring proposal and its highly overlappedones are labeled as the image label and the rest are labeledas the background. Details are referred to Sec. 3.2 in [26].The online instance classiﬁer reﬁnement module f D r performs detection proposal by proposal and further con-strains the spatial consistency of the detection results withthe generated labels Y r , which is formulated as below: D r ( i, :) = f D r ( x , b i ) , D r ∈ [0 , B × ( N +1) , (5) L Dref = (cid:88) N +1 j =1 (cid:88) Bi =1 L CE ( D r ( i, j ) , Y r ( i, j )) , (6)where D r ( i, :) ∈ [0 , N +1 is a row of D r , indicatingthe classiﬁcation scores for proposal b i . L CE denotes theweighted cross entropy (CE) loss function in Eq. (4) of [26].Here, L CE is employed instead of L BCE considering thateach proposal has one and only one positive category label.Eventually, the detection results are given by the reﬁne-ment module, i.e . D = D r , and the overall objective for thedetection module is a combination of Eq. (3) and Eq. (6): L D = λ Dmil L Dmil + λ Dref L Dref , (7)where λ Dmil and λ Dref are balancing factors for the loss.After optimization according to Eq. (7), the reﬁnementmodule f D r can do object detection independently by dis-carding the MIDN in testing. Generally, the MIL weakly supervised object detectionmodule is subject to over-ﬁtting on discriminative parts, since smaller regions with less variation are more likelyto have high consistency across the whole training set. Toovercome this issue, the completeness of a detected objectneeds to be measured and adjusted, e.g . by comparing witha segmentation map. Therefore, a weakly supervised seg-mentation branch is proposed to cover the complete objectregions with generative adversarial localization strategy.In detail, the segmentation branch f S takes the feature x as input and predicts a segmentation map, as below, S = f S ( x ) , S ∈ [0 , ( N +1) × h × w , (8) s k (cid:44) S ( k, : , :) , k ∈ { , . . . , N + 1 } , s k ∈ [0 , h × w (9)where S has N + 1 channels. Each channel s k correspondsto a segmentation map for the k th class with a size of h × w .To ensure that the segmentation map S covers the com-plete object regions precisely, a novel generative adversariallocalization strategy is designed as adversarial training be-tween the segmentation predictor f S and an independentimage classiﬁer f C , severing as generator and discrimina-tor respectively, as shown in Fig. 2. The training target ofthe generator f S is to fool f C into misclassifying by mask-ing out the object regions, and the discriminator f C aimsto eliminate the effect of the erased regions and correctlypredict the category labels. The f S and f C are optimizedalternatively, given the other one ﬁxed.Here, we ﬁrst introduce the optimization of the segmen-tation branch f S , given the classiﬁer f C ﬁxed. Overall, theobjective of the segmentation branch f S can be formulatedas a sum of losses for each class, L S ( S ) = L S ( s ) + L S ( s ) + · · · + L S ( s N +1 ) . (10)Here, L S ( s k ) is the loss for the i th channel of the segmenta-tion map, consisting of an adversarial loss L Sadv and a clas-siﬁcation loss L Scls , described in detail as following.If the k th class is a positive foreground class , the seg-mentation map s k should fully cover the region of the k th class, but does not overlap with the regions of the otherclasses. In other words, for an accurate s k , only the ob-ject region masked out by s k should be classiﬁed as the k th class, while its complementary region should not. Formally,this expectation can be satisﬁed by minimizing the function L Sadv ( s k ) = L BCE ( f C ( I ∗ s k ) , ˜ y )+ L BCE ( f C ( I ∗ (1 − s k )) , ˆ y ) , (11)where ∗ denotes pixel-wise product. The ﬁrst term repre-sents that the object region covered by the generated seg-mentation map, i.e . I ∗ s k , should be recognized as the k th class by the classiﬁer f C , but does not respond to any otherclasses with the label ˜ y ∈ { , } N , where ˜ y ( k ) = 1 and A positive foreground class means that the foreground class presentsin the current image, while a negative one means that it does not appear. y ( i (cid:54) = k ) = 0 . The second term means that when theregion related to the k th class is masked out from the in-put, i.e . I ∗ (1 − s k ) , the classiﬁer f C should not recog-nize the k th class anymore without inﬂuence on the otherclasses, with the label ˆ y ∈ { , } N , where ˆ y ( k ) = 0 and ˆ y ( i (cid:54) = k ) = y ( i (cid:54) = k ) . Here, we note that generally themask can be applied to the image I or the input of any layerof the classiﬁer f C , and since f C is ﬁxed, the loss functionin Eq. (11) only penalizes the segmentation branch f S .If the k th class is a negative foreground class , the s k should be all-zero, as no instance of this foreground classpresents. This is restrained with a response constraint term.In this term, the top 20% response pixels of each map s k are pooled and averaged for a classiﬁcation predication op-timized with a binary cross entropy loss as below, L Scls ( s k ) = L BCE ( avgpool s k , y ( k )) . (12)If the k th class is labeled as negative, avgpool s k is en-forced to be close to 0, i.e . all elements of the map s k shouldapproximately be 0. However, the above loss is also appli-cable when the k th class is positive, avgpool s k shouldbe close to 1, agreeing with the constraint in Eq. (11).The background is taken as a special case. In Eq. (11),though the labels ˜ y and ˆ y do not involve the backgroundclass, the background segmentation map s N +1 is also appli-cable same as the other classes. When s N +1 is multiplied asthe ﬁrst term in Eq. (11), the target label should be all-zero ˜ y = ; when − s N +1 is used as the mask in the secondterm of Eq. (11), the target label should be exactly the sameas the original label ˆ y = y . For Eq. (12), we assume that abackground region always appears in any input image, i.e . y ( N + 1) = 1 for all images.Overall, the total loss of the segmentation branch in Eq.(10) can be summarized and rewritten as follows, L S = λ Sadv (cid:88) k if y ( k )=1 L Sadv ( s k ) + λ Scls N +1 (cid:88) k =1 L Scls ( s k ) , (13)where λ Sadv and λ Scls denote balance weights.After optimizing Eq. (13), following the adversarialmanner, the segmentation branch f S is ﬁxed, and the clas-siﬁer f C is further optimized with the following objective, L Cadv ( s k ) = L BCE ( f C ( I ∗ (1 − s k )) , y ) , (14) L C = L BCE ( f C ( I ) , y ) + (cid:88) k if y ( k )=1 L Cadv ( s k ) . (15)The objective L C consists of a classiﬁcation loss and an ad-versarial loss L Cadv . The target of the classiﬁer f C shouldalways be y , since it aims at digging out the remaining ob-ject regions, even if s k is masked out.Our idea for designing the segmentation branch sharesthe same adversarial spirit with [30], but our design is more efﬁcient compared with [30] that recurrently performs sev-eral forward passes for one segmentation map. Besides, wedo not have the trouble of deciding number recurrent stepsas [30], which may vary with different objects. A dynamic collaboration loop is designed to complementboth detection and segmentation for more accurate predic-tions, namely neither so large that cover the background norso small that degenerate to object parts.

Segmentation instructs Detection.

As mentioned, thedetection branch is easy to over-ﬁt to discriminative parts,while the segmentation can cover the whole object region.So naturally, the segmentation map can be used to reﬁnethe detection results by making the proposal having a largerIoU with the corresponding segmentation map have a higherscore. This is achieved by re-weighting the instance classi-ﬁcation probability matrix D m in Eq. (2) in the detectionbranch by using a prior probability matrix D seg stemmingfrom the segmentation map as follows, ˆ D m = D m (cid:12) D seg , (16)where D seg ( i, k ) denotes the overlap degree between the i th object proposal and the connected regions from the k th segmentation map. D seg is generated as below: D seg ( i, k ) = max j IoU (ˆ s kj , b i ) + τ . (17)Here, ˆ s kj denotes the j th connected component in the seg-mentation map s k , and IoU (ˆ s kj , b i ) denotes the intersec-tion over union between ˆ s kj and the object proposal b i .The constant τ adds a fault-tolerance for the segmentationbranch. Each column of D seg is normalized by its maxi-mum value, to make it range within [0, 1].With the re-weighting in Eq. (16), the object propos-als only focusing on local parts are assigned with lowerweights, while those proposals precisely covering the ob-ject stand out. The connected components are employed toalleviate the issue of multiple instance occurrences, whichis a hard case for weakly supervised object detection. Therecent TS2C [31] objectness rating designed for solving thisissue is also tested in place of IoU with connected compo-nents, but no superiority shows in our case.The re-weighted probability matrix ˆ D m replaces D m inEq. (3) and further instructs the MIDN as in Eq. (18) andthe reﬁnement module as in Eq. (19): L D ← Smil = (cid:88) j L BCE (cid:16)(cid:88) i ˆ D m ( i, j ) , y ( j ) (cid:17) , (18) L D ← Sref = (cid:88) j (cid:88) i L CE (cid:16) D r ( i, j ) , ˆ Y r ( i, j ) (cid:17) , (19)where ˆ Y r denotes the pseudo labels deriving from ˆ D m asthat in Eq. (4). Finally, the overall objective of the detectionbranch in Eq. (7) is reformulated as below, L D ← S = λ Dmil L D ← Smil + λ Dref L D ← Sref . (20)5 etection instructs Segmentation. Though the detec-tion boxes may not cover the whole object, they are effec-tive for distinguishing an object from the background. Toguide the segmentation branch, a detection heatmap S det ∈ [0 , ( N +1) × h × w is generated, which can be seen as an ana-log of the segmentation map. Each channel s detk (cid:44) S det ( k, : , :) corresponds to a heatmap for the k th class. Speciﬁcally,for the positive class k , each proposal box contributes itsclassiﬁcation score to all pixels within this proposal andthus generates the s detk by s detk ( p, q ) = (cid:88) i if ( p,q ) ∈ b i D ( i, k ) , (21)while the other s detk corresponding to negative classes areset to zero. Then, s detk is normalized by its maximum re-sponse and the background heatmap s detN +1 can be simplycalculated as the complementary set of the foreground, i.e . s detN +1 = 1 − max k ∈{ ,...,N } s detk . (22)To generate pseudo category label for each pixel, the softsegmentation map S det is ﬁrst discretized by taking the ar-guments of the maxima at each pixel and then the top 10%pixels for each class are kept, while other ambiguous onesare ignored. The generated label is denoted by ψ ( S det ) , andthe instructive loss is formulated as below: L S ← Dseg = L CE ( S , ψ ( S det )) . (23)Therefore, the loss function of the whole segmentationbranch in Eq. (13) is now updated to L S ← D = L S + λ Sseg L S ← Dseg . (24) Overall Objective.

With the updates in Eq. (20) and Eq.(24), the ﬁnal objective for the entire network isargmin f E ,f S ,f D L = L S ← D + L D ← S . (25)Brieﬂy, the above objective is optimized in an end-to-endmanner. The image classiﬁer f C is optimized with the loss L C alternatively, as most adversarial methods. The opti-mization can be easily conducted using gradient descent.For clarity, the training and the testing of our SDCN aresummarized in Algorithm 1.In the testing stage, as shown in Algorithm 1, only thefeature extractor f E and the reﬁnement module f D r areneeded, which make our method as efﬁcient as [26].

4. Experiments

We evaluate the proposed segmentation-detection col-laborative network (SDCN) for weakly supervised objectdetection to prove its advantages over the state-of-the-arts.

Datasets.

The evaluation is conducted on two commonlyused datasets for weakly supervised detection, including the

Algorithm 1

Training and Testing SDCN

Input: training set with category labels T = { ( I , y ) } . procedure T RAINING2: forward SDCN f E ( I ) → x , f D ( x ) → D , f S ( x ) → S , forward the classiﬁer f C ( s k ∗ I ) and f C ((1 − s k ) ∗ I ) , generate variables D seg and S det with S and D , compute L D ← S in Eq.(20) and L S ← D in Eq.(24), backward the loss L = L D ← S + L S ← D for SDCN, compute and backward the loss L C for f C , continue until convergence. Output: the optimized SDCN ( f E and f D ) for detection. Input: test set T = { I } . procedure T ESTING2: forward SDCN f E ( I ) → x , f D r ( x ) → D , post-process for detected bounding boxes with D . Output: the detected object bounding boxes for T .PASCAL VOC 2007 [12] and 2012 [11]. The VOC 2007dataset includes 9,963 images with total 24,640 objects in20 classes. It is divided into a trainval set with 5,011 imagesand a test set with 4,952 images. The more challengingVOC 2012 dataset consists of 11,540 images with 27,450objects in trainval set and 10,991 images for test. In ourexperiments, the trainval split is used for training and the test set is for testing. The performance is reported in termsof two metrics: 1) correct localization (CorLoc) [7] on the trainval spilt and 2) average precision (AP) on the test set. Implementation.

For the backbone network f E , we usethe VGG-16 [21]. For f D , the same architecture as thatin OICR [31] is employed. For f S , similar segmentationheader to the CPN [4] is adopted. For the adversarial clas-siﬁer f C , ResNet-101 [16] is used and the segmentationmasking operation is applied after the res4b22 layer. Thedetailed architecture is shown in the Appendix A.We follow a three-step training strategy: 1) the classiﬁer f C is trained with a ﬁxed learning rate × − until itsconvergence; 2) the segmentation branch f S and detectionbranch f D are pre-trained without collaboration; 3) the en-tire architecture is trained following the end-to-end manner.The SDCN runs for 40k iterations with learning rate − ,following 30k iterations with learning rate − . The samemulti-scale training and testing strategies in OICR [26] areadopted. To achieve balanced impacts between detectionand segmentation branches, the weights of the losses aresimply set to make the gradients have similar scales, i.e. λ Sadv = 1 , λ Scls = 0 . , λ Sseg = 0 . , λ Dmil = 1 and λ Dref = 1 ,respectively. The constant τ in Eq. (17) is empirically setto 0.5. Our ablation study is conducted on VOC 2007 dataset.Four weakly supervised strategies are compared and the6 nput image det segdet + seg

Without collaboration With collaboration (a) Segmentation (b) Detection

Figure 3: Visualization of the segmentation and the detection results without and with collaboration. In (a), the columnsfrom left to right are the original images, the segmentation map obtained without and with the collaboration loop. In (b), thedetection results of OICR[26] without consideration of collaboration, and the proposed method with collaboration loop areshown with red and green boxes, respectively. (Absence of boxes means no detected object given the detection threshold.)results are shown in Table 2. The baseline detectionmethod without the segmentation branch is the same asthe OICR[31]. Another naive consideration is directly in-cluding the detection and segmentation modules in a multi-task manner without any collaboration between them. Themodel where only segmentation branch instructs detectionbranch is also tested. Its mAP is the lowest, since the meanintersection over union (mIoU) between the segmentationresults and the ground-truth drops from 37% to 25.1% with-out the guidance of detection branch, which proves thatthese two branches should not collaborate in one-way. Ourmethod with segmentation-detection collaboration achievesthe highest mAP. It can be observed that the proposedmethod improves all baseline models by large margins,demonstrating the effectiveness and necessity of the collab-oration loop between detection and segmentation.The segmentation masks and detection results withoutand with the collaboration are visualized in Fig. 7. As ob-served in Fig. 3a, with the instruction from the detectionbranch, the segmentation map becomes much more pre-cise with fewer confusions between the background and theclass-related region. Similarly, as shown in Fig. 3b, thebaseline approach inclines to mix discriminative parts withtarget object bounding boxes, while with the guidance fromsegmentation the more complete objects are detected. Thevisualization clearly illustrates the beneﬁts to each other.

Det. branch Seg. branch Seg. → Det. Det. → Seg. mAP √ √ √ √ √ √ √ √ √ √ Table 2: mAP (in %) of different weakly supervised strate-gies with the same backbone on the VOC 2007 dataset. For the validation of hyper-parameters and detailed erroranalysis, please refer to the Appendix B.

All comparison methods are ﬁrst evaluated on VOC 2007as shown in Table 3 and Table 4 in terms of mAP and Cor-Loc. Among single-stage methods, our method outperformsothers on the most categories, leading to a notable improve-ment on average. Especially, our method performs muchbetter than the state-of-the-arts on “boat”, “cat”, “dog”, asour approach leans to detect more complete objects, thoughin most cases instances of these categories can be identiﬁedby parts. Moreover, our method produces signiﬁcant im-provements compared with the OICR[26] with exactly thesame architecture. The most competitive method [27] is de-signed for weakly supervised object proposal, which is notreally competing but complementary to our method, and re-placing the ﬁxed object proposal in our method with [27]potentially improves the performance. Besides, the perfor-mance of our single-stage method is even comparable withthe multiple-stage methods [26, 31, 33, 29], illustrating theeffectiveness of the proposed dynamic collaboration loop.Furthermore, all methods can be enhanced by train-ing with multiple stages, as shown at the bottom of Ta-ble 3. Following [26, 31], the top scoring detection bound-ing boxes from SDCN is used as the labels for training aFast RCNN [14] with the backbone of VGG16, denoted asSDCN+FRCNN. By this simple multi-stage training strat-egy, the performance can be further boosted to 51%, whichsurpasses all the state-of-the-art multiple-stage methods,though [26, 27] use more complex ensemble models. Itis noted that the approaches, e.g . HCP+DSD+OSSH3[17]and ZLDN-L[33], attempt to design more elaborate trainingmechanism by using self-paced or curriculum learning. We7 ethods aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAPSingle-stageWSDDN-VGG16 [2] 39.4 50.1 31.5 16.3 12.6 64.5 42.8 42.6 10.1 35.7 24.9 38.2 34.4 55.6 9.4 14.7 30.2 40.7 54.7 46.9 34.8OICR-VGG 16 [26] 58.0 62.4 31.1 19.4 13.0 65.1 62.2 28.4 ours ) Multiple-stageWSDDN-Ens. [2] 46.4 58.3 35.5 25.9 14.0 66.7 53.0 39.2 8.9 41.8 26.6 38.6 44.7 59.0 10.8 17.3 40.7 49.6 56.9 50.8 39.3HCP+DSD+OSSH3[17] 52.2 47.1 35.0 26.7 15.4 61.3 66.0 54.3 3.0 53.6 24.7 43.6 48.4 65.8 6.6 18.8 51.9 43.6 53.6 62.4 41.7OICR-Ens.+FRCNN[26] ours ) 61.1

Table 3: Average precision (in %) for our method and the state-of-the-arts on VOC 2007 test split.

Methods aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv CorLocSingle-stageWSDDN-VGG16 [2] 65.1 58.8 58.5 33.1 39.8 68.3 60.2 59.6 34.8 64.5 30.5 43.0 56.8 82.4 ours ) Multiple-stageHCP+DSD+OSSH3[17] 72.7 55.3 53.0 27.8 35.2 68.6 81.9 60.7 11.6 71.6 29.7 54.3 64.3 88.2 22.2 53.7 72.2 52.6 68.9 75.5 56.1WSDDN-Ens. [2] 68.9 68.7 ours ) Table 4: CorLoc (in %) for our method and the state-of-the-arts on VOC 2007 trainval split.

Methods mAP CorLocSingle-stage OICR-VGG16 [26] 37.9 62.1TS2C [31] 40.0 64.4[27] 40.8 64.9SDCN ( ours ) Multiple-stage MELM-L2+ARL[29] 42.4 –OICR-Ens.+FRCNN [26] 42.5 65.6ZLDN-L[33] 42.9 61.5TS2C+FRCNN [31] 44.4 –Ens.+FRCNN[27] 45.7 69.3SDCN+FRCNN ( ours ) Table 5: mAP and CorLoc (in %) for our method and thestate-of-the-arts on VOC 2012 trainval split.believe that the performance of our model SDCN+FRCNNcan be further improved by adopting such algorithms.The comparison methods are further evaluated on themore challenging VOC 2012 dataset, as shown in Table 5.As expected, the proposed method achieves signiﬁcant im-provements with the same architecture as [26, 31], demon-strating its superiority again.Overall, our SDCN signiﬁcantly improves the perfor-mance of weakly supervised object detection on average,beneﬁtting from the deep collaboration of segmentation anddetection. However, there are still several classes on whichthe performance is hardly improved as shown in Table 3, e.g . “chair” and “person”. The main reason is the large por-tion of occluded and overlapped samples for these classes,which leads to incomplete or connected responses on thesegmentation map and bad interaction with the detectionbranch, leaving room for further improvements.

Time cost.

Our training speed is roughly 2 × slower thanthat of the baseline OICR [26], but the testing time costs ofour method and OICR are the same, since they share exactlythe same architecture of the detection branch.

5. Conclusions and Future Work

In this paper, we present a novel segmentation-detectioncollaborative network (SDCN) for weakly supervised objectdetection. Different from the previous works, our methodexploits a collaboration loop between segmentation task anddetection task to combine the merits of both. Extensiveexperimental results safely reach the conclusion that ourmethod successfully exceeds the previous state-of-the-arts,while it keeps efﬁciency in the inference stage. The designof SDCN may be more elaborate for densely overlapped orpartially occluded objects, which is more challenging andleft as future work.8 . Appendix: Network Architecture

The network architectures of the proposed method areshown in Fig. 4. The feature extractor f E and the detectionbranch f D are exactly the same as the OICR [26], whilethe segmentation branch f S follows the design of the Re-ﬁneNet in CPN [4]. The classiﬁcation network f C for gen-erative adversarial localization is omitted, considering thatit has exactly the same architecture with the well-knownResNet [16].The feature extractor f E in Fig. 4a is basically theVGG16 [21] network. The max-pooling layer after “con4”and its subsequent convolutional layers are replaced by thedilated convolutional layers in order to increase the resolu-tion of the last output feature map.The detection branch f D is composed of a multiple in-stance detection network (MIDN) f D m and an online in-stance classiﬁer reﬁnement module f D r , which are shown in green and blue in Fig. 4b, respectively. In MIDN, twobranches are in charge of computing the instance classiﬁ-cation weights for each proposal and classifying each pro-posal respectively, by performing softmax along differentdimensions. For the reﬁnement module, although the in-stance classiﬁer is reﬁned only one time for a clear illus-tration in the manuscript, in fact, it can be reﬁned multipletimes. We follow the OICR [26], which performs the re-ﬁnement 3 times, as shown in Fig. 4b. The k th ( k = 1 , , reﬁnement is instructed by the ( k − th detection results(with the D m as the th detection result). During testing,the outputs from all reﬁnement branches are averaged forthe ﬁnal detection result D = D r = (cid:80) i =1 D ri .The segmentation branch f S is shown in Fig. 4c and it issimilar to the ReﬁneNet in CPN [4], which is effective in in-tegrating the multi-scale information for the accurate local-ization problem. As it is illustrated in [4], the architecture,mainly consisting of several stacked bottleneck blocks, can D m D r D r D r D r Feature-con51x1 conv, 1281x1 conv, 323x3 conv, 321x1 conv, 128 Feature-con41x1 conv, 1281x1 conv, 643x3 conv, 641x1 conv, 256concat Feature-con31x1 conv, 641x1 conv, 803x3 conv, 801x1 conv, 320concatUpSample 3x3 conv, 21Softmax S (a) Feature extractor f E D m D r D r D r D r Feature-con51x1 conv, 1281x1 conv, 323x3 conv, 321x1 conv, 128 Feature-con41x1 conv, 1281x1 conv, 643x3 conv, 641x1 conv, 256concat Feature-con31x1 conv, 641x1 conv, 803x3 conv, 801x1 conv, 320concatUpSample 3x3 conv, 21Softmax S (b) Detection branch f D D m D r D r D r D r Feature-con51x1 conv, 1281x1 conv, 323x3 conv, 321x1 conv, 128 Feature-con41x1 conv, 1281x1 conv, 643x3 conv, 641x1 conv, 256concat Feature-con31x1 conv, 641x1 conv, 803x3 conv, 801x1 conv, 320concatUpSample 3x3 conv, 21Softmax S (c) Segmentation branch f S Figure 4: Network architectures for (a) the feature extractor, (b) the detection branch, and (c) the segmentation branch.9ransmit information across different scales and integrate allof them. The normalization layers in the bottleneck blocksare changed from the batch normalization to group normal-ization [32] in our experiments, given that the batch size istoo small to train a good batch normalization layer.

B. Appendix: Further Ablation Study

B.1. Investigation of Hyper-parameters

The inﬂuences of balance weights λ Sadv , λ Scls , λ Sseg , λ Dmil and λ Dref are shown in Fig. 5. As can be seen, thedetection performance is not sensitive to these parameterswhen they are larger than 0.1, demonstrating the robustnessof the proposed method. m A P value 1020304050 Figure 5: The curves of the mAP varying with the balanceweights for each loss on the PASCAL VOC test set.

B.2. Error Analysis

We investigate the detailed sources of errors follow-ing [5], where detected boxes are categorized into ﬁvecases: 1) correct localization (overlap with the ground-truth (cid:62) e.g . occluded or distorted objects andmultiple instances in one image, the proposed method stilldetect these objects. p e r s on c h a i r bo ttl e t a b l e b i r d p l a n t bo a t s o f a ho r s e dog s h ee p t r a i n ca t c o w t v bu s b i k e ca r ae r o m b i k e m ea n F r e qu e n c y p e r s on c h a i r bo ttl e t a b l e b i r d p l a n t bo a t s o f a ho r s e dog s h ee p t r a i n ca t c o w t v bu s b i k e ca r ae r o m b i k e m ea n F r e qu e n c y CorLoc Hypo in GT GT in Hypo Low Overlap No Overlap p e r s on c h a i r bo ttl e t a b l e b i r d p l a n t bo a t s o f a ho r s e dog s h ee p t r a i n ca t c o w t v bu s b i k e ca r ae r o m b i k e m ea n F r e qu e n c y p e r s on c h a i r bo ttl e t a b l e b i r d p l a n t bo a t s o f a ho r s e dog s h ee p t r a i n ca t c o w t v bu s b i k e ca r ae r o m b i k e m ea n F r e qu e n c y CorLoc Hypo in GT GT in Hypo Low Overlap No Overlap (a) OICR p e r s on c h a i r bo ttl e t a b l e b i r d p l a n t bo a t s o f a ho r s e dog s h ee p t r a i n ca t c o w t v bu s b i k e ca r ae r o m b i k e m ea n F r e qu e n c y p e r s on c h a i r bo ttl e t a b l e b i r d p l a n t bo a t s o f a ho r s e dog s h ee p t r a i n ca t c o w t v bu s b i k e ca r ae r o m b i k e m ea n F r e qu e n c y CorLoc Hypo in GT GT in Hypo Low Overlap No Overlap (b) Ours

Figure 6: Per-class frequencies of error modes, and averaged across all classes for the baseline OICR [26] and our proposedmethod on the PASCAL VOC 2007 trainval set. 10igure 7: Visualization of the proposed SDCN on the PASCAL VOC 2007 test set.

References [1] H. Bilen, M. Pedersoli, and T. Tuytelaars. Weakly supervisedobject detection with posterior regularization. In

British Ma-chine Vision Conference (BMVC) , pages 1–12, 2014.[2] H. Bilen and A. Vedaldi. Weakly supervised deep detectionnetworks. In

IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR) , pages 2846–2854, 2016.[3] M. Blaschko, A. Vedaldi, and A. Zisserman. Simultaneousobject detection and ranking with weak supervision. In

Ad-vances in Neural Information Processing Systems (NIPS) ,pages 235–243, 2010.[4] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun.Cascaded Pyramid Network for Multi-Person Pose Estima-tion. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2018.[5] R. G. Cinbis, J. Verbeek, and C. Schmid. Weakly supervisedobject localization with multi-fold multiple instance learn-ing.

IEEE Transactions on Pattern Analysis & Machine In-telligence (TPAMI) , 39(1):189–203, 2017.[6] T. Deselaers, B. Alexe, and V. Ferrari. Localizing objectswhile learning their appearance. In

European Conference onComputer Vision (ECCV) , pages 452–466, 2010.[7] T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervisedlocalization and learning with generic knowledge.

Interna-tional Journal of Computer Vision (IJCV) , 100(3):275–293,2012.[8] A. Diba, V. Sharma, A. M. Pazandeh, H. Pirsiavash, andL. Van Gool. Weakly supervised cascaded convolutional net- works. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , page 9, 2017.[9] T. G. Dietterich, R. H. Lathrop, and T. Lozano-P´erez. Solv-ing the multiple instance problem with axis-parallel rectan-gles.

Artiﬁcial Intelligence. , 89(1-2):31–71, 1997.[10] T. Durand, T. Mordan, N. Thome, and M. Cord. WILDCAT:Weakly Supervised Learning of Deep ConvNets for ImageClassiﬁcation, Pointwise Localization and Segmentation. In

IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , 2017.[11] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams,J. Winn, and A. Zisserman. The pascal visual object classeschallenge: A retrospective.

International Journal of Com-puter Vision (IJCV) , 111(1):98–136, 2015.[12] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge.

International Journal of Computer Vision (IJCV) ,88(2):303–338, 2010.[13] W. Ge, S. Yang, and Y. Yu. Multi-evidence ﬁltering and fu-sion for multi-label classiﬁcation, object detection and se-mantic segmentation based on weakly supervised learning.In

IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2018.[14] R. Girshick. Fast r-cnn. In

IEEE International Conferenceon Computer Vision (ICCV) , pages 1440–1448, 2015.[15] R. Gokberk Cinbis, J. Verbeek, and C. Schmid. Multi-foldmil training for weakly supervised object localization. In

IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , pages 2409–2416, 2014.

16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In

IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2016.[17] Z. Jie, Y. Wei, X. Jin, J. Feng, and W. Liu. Deep self-taught learning for weakly supervised object localization. In

IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , 2017.[18] A. Kolesnikov and C. H. Lampert. Seed, expand and con-strain: Three principles for weakly-supervised image seg-mentation. In

European Conference on Computer Vision(ECCV) , 2016.[19] O. Russakovsky, Y. Lin, K. Yu, and L. Fei-Fei. Object-centric spatial pooling for image classiﬁcation. In

EuropeanConference on Computer Vision (ECCV) , pages 1–15, 2012.[20] Y. Shen, R. Ji, S. Zhang, W. Zuo, and Y. Wang. Generativeadversarial learning towards fast weakly supervised detec-tion. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2018.[21] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In

InternationalConference on Learning Representations (ICLR) , 2014.[22] P. Siva, C. Russell, and T. Xiang. In defence of negative min-ing for annotating weakly labelled data. In

European Con-ference on Computer Vision (ECCV) , pages 594–608, 2012.[23] P. Siva and T. Xiang. Weakly supervised object detectorlearning with model drift detection. In

IEEE InternationalConference on Computer Vision (ICCV) , pages 343–350,2011.[24] H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui,and T. Darrell. On learning to localize objects with minimalsupervision. In

International Conference on Machine Learn-ing (ICML) , pages 1611–1619, 2014.[25] H. O. Song, Y. J. Lee, S. Jegelka, and T. Darrell. Weakly-supervised discovery of visual pattern conﬁgurations. In

Ad-vances in Neural Information Processing Systems (NIPS) ,pages 1637–1645, 2014.[26] P. Tang, X. Wang, X. Bai, and W. Liu. Multiple instance de-tection network with online instance classiﬁer reﬁnement. In

IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , 2017.[27] P. Tang, X. Wang, A. Wang, Y. Yan, W. Liu, J. Huang, andA. Yuille. Weakly supervised region proposal network andobject detection. In

European Conference on Computer Vi-sion (ECCV) , 2018.[28] J. R. R. Uijlings, K. E. A. V. De Sande, T. Gevers, andA. W. M. Smeulders. Selective search for object recog-nition.

International Journal of Computer Vision (IJCV) ,104(2):154–171, 2013.[29] F. Wan, P. Wei, J. Jiao, Z. Han, and Q. Ye. Min-entropylatent model for weakly supervised object detection. In

IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , pages 1297–1306, 2018.[30] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, andS. Yan. Object region mining with adversarial erasing: Asimple classiﬁcation to semantic segmentation approach. In

IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , 2017. [31] Y. Wei, Z. Shen, B. Cheng, H. Shi, J. Xiong, J. Feng, andT. Huang. Ts2c: tight box mining with surrounding seg-mentation context for weakly supervised object detection. In

European Conference on Computer Vision (ECCV) , 2018.[32] Y. Wu and K. He. Group normalization. In

European Con-ference on Computer Vision (ECCV) , 2018.[33] X. Zhang, J. Feng, H. Xiong, and Q. Tian. Zigzag learningfor weakly supervised object detection. In

IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2018.[34] Z.-H. Zhou. A brief introduction to weakly supervised learn-ing.

National Science Review , 5(1):44–53, 2018., 5(1):44–53, 2018.