[PDF] Detector Discovery in the Wild: Joint Multiple Instance and Representation Learning

Abstract

We develop methods for detector learning which exploit joint training over both weak and strong labels and which transfer learned perceptual representations from strongly-labeled auxiliary tasks. Previous methods for weak-label learning often learn detector models independently using latent variable optimization, but fail to share deep representation knowledge across classes and usually require strong initialization. Other previous methods transfer deep representations from domains with strong labels to those with only weak labels, but do not optimize over individual latent boxes, and thus may miss specific salient structures for a particular category. We propose a model that subsumes these previous approaches, and simultaneously trains a representation and detectors for categories with either weak or strong labels present. We provide a novel formulation of a joint multiple instance learning method that includes examples from classification-style data when available, and also performs domain transfer learning to improve the underlying detector representation. Our model outperforms known methods on ImageNet-200 detection with weak labels.

Full PDF

DDetector Discovery in the Wild:Joint Multiple Instance and Representation Learning

Judy Hoffman, Deepak Pathak, Trevor DarrellUC Berkeley { jhoffman, pathak, trevor } @eecs.berkeley.edu Kate SaenkoUMass Lowell [email protected]

Abstract

We develop methods for detector learning which exploitjoint training over both weak and strong labels and whichtransfer learned perceptual representations from strongly-labeled auxiliary tasks. Previous methods for weak-labellearning often learn detector models independently usinglatent variable optimization, but fail to share deep represen-tation knowledge across classes and usually require stronginitialization. Other previous methods transfer deep rep-resentations from domains with strong labels to those withonly weak labels, but do not optimize over individual la-tent boxes, and thus may miss speciﬁc salient structures fora particular category. We propose a model that subsumesthese previous approaches, and simultaneously trains a rep-resentation and detectors for categories with either weak orstrong labels present. We provide a novel formulation of ajoint multiple instance learning method that includes exam-ples from classiﬁcation-style data when available, and alsoperforms domain transfer learning to improve the underly-ing detector representation. Our model outperforms knownmethods on ImageNet-200 detection with weak labels.

1. Introduction

It is well known that contemporary visual models thriveon large amounts of training data, especially those that di-rectly include labels for desired tasks. Many real world set-tings contain labels with varying speciﬁcity, e.g., “strong”bounding box detection labels, and “weak” labels indicatingpresence somewhere in the image. We tackle the problemof joint detector and representation learning , and developmodels which cooperatively exploit heterogeneous sourcesof training data, where some classes have no “strong” an-notations. Our model optimizes a latent variable multipleinstance learning model over image regions while simulta-neously transferring a shared representation from detection-domain models to classiﬁcation-domain models. The latterprovides a key source of automatic and accurate initializa- Figure 1: We learn detectors for categories with only weaklabels ( bottom row ), by jointly transferring a representationfrom auxiliary categories with available strong annotations( top row ) and solving an MIL problem on the weakly anno-tated data (green box).tion for latent variable optimization, which has heretoforebeen unavailable in such methods.Previous methods employ varying combinations of weakand strong labels of the same object category to learn adetector. Such methods seldom exploit available strong-labeled data of different, auxiliary categories, despite thefact that such data is very often available in many practi-cal scenarios. Deselaers et al . [10] uses auxiliary data tolearn generic objectness information just as an initial step,but doesn’t optimize jointly for weakly labeled data.We introduce a new model for large-scale learning of de-tectors that can jointly exploit weak and strong labels, per-form inference over latent regions in weakly labeled train-ing examples, and can transfer representations learned fromrelated tasks (see Figure 1). In practical settings, such aslearning visual detector models for all available ImageNetcategories, or for learning detector versions of other deﬁnedcategories such as Sentibank’s adjective-noun-phrase mod-els [7], our model makes greater use of available data and la-1 a r X i v : . [ c s . C V ] D ec els than previous approaches. Our method takes advantageof such data by using the auxiliary strong labels to improvethe feature representation for detection tasks, and uses theimproved representation to learn a stronger detector fromweak labels in a deep architecture.To learn detectors, we exploit weakly labeled data fora concept, including both “easy” images (e.g., from Ima-geNet classiﬁcation training data), and “hard” weakly la-beled imagery (e.g., from PASCAL or ImageNet detectiontraining data with bounding box metadata removed). Wedeﬁne a novel multiple instance learning (MIL) frameworkthat includes bags deﬁned on both types of data, and alsojointly optimizes an underlying perceptual representationusing strong detection labels from related categories. Thelatter takes advantage of the empirical results in [19], whichdemonstrated knowledge of what makes a good perceptualrepresentation for detection tasks could be learned from aset of paired weak and strong labeled examples, and the re-sulting adaptation could be transferred to new categories,even those for which no strong labels were available.We evaluate our model empirically on the largest set ofavailable ground-truth visual detection data, the ImageNet-200 category challenge. Our method outperforms the previ-ous best MIL-based approaches for held-out detector learn-ing on ImageNet-200 [27] by 200%, and outperforms theprevious best domain-adaptation based approach [19] by12%. Our model is directly applicable to learning improved“detectors in the wild”, including categories in ImageNetbut not in ImageNet-200, or categories deﬁned ad-hoc fora particular user or task with just a few training examplesto ﬁne-tune a new classiﬁcation model. Such models canbe promoted to detectors with no (or few) labeled bound-ing boxes. Upon acceptance we will release an open-sourceimplementation of our model and all network and detectorweights for an improved set of detectors for the ImageNet-7.5K dataset of [19].

2. Related Work

CNNs for Visual Recognition

Within the last few years,convolutional neural networks (CNNs) have emerged as theclear winners for many visual recognition tasks. A break-through was made when the positive performance demon-strated for digit recognition [25] began to translate to theImageNet [27] classiﬁcation challenge winner [22]. Shortlythereafter, the feature space learned through these archi-tectures was shown to be generic and effective for a largevariety of visual recognition tasks [12, 39]. These resultswere followed by state-of-the-art results for object detec-tion [16, 29]. Most recently, it was shown that CNN ar-chitectures can be used to transfer generic information be-tween the classiﬁcation and detection tasks [19], improvingdetection performance for tasks which lack bounding boxtraining data.

Training with Auxiliary Data Sources

There has been alarge amount of prior work on training models using aux-iliary data sources. The problem of visual domain adap-tation is precisely seeking to use data from a large auxil-iary source domain to improve recognition performance ona target domain which has little or no labeled data avail-able. Techniques to solve this problem consist of learning anew feature representation that minimizes the distance be-tween source and target distributions [28, 23, 17, 15], regu-larizing the learning of a target classiﬁer against the sourcemodel [36, 4, 9], or doing both simultaneously [20, 13].

Multiple Instance Learning

Since its inception, theMIL [11] problem has been attempted in several frame-works including Noisy-OR [18], boosting [2, 40] etc. Butmost commonly, it was framed as a max-margin classiﬁ-cation problem [3] with latent parameters optimized usingalternating optimization [14, 37]. Overall, MIL is tackledin two stages: ﬁrst ﬁnding better initialization, and then us-ing better heuristics for optimization. A number of methodshave been proposed for initialization which include usinglarge image region excluding boundary [26], using candi-date set which covers the training data space [33, 34], usingunsupervised patch discovery [32, 30], learning generic ob-jectness knowledge from auxiliary catgories [1, 10], learn-ing latent categories from background to suppress it [35]or using class-speciﬁc similarity [31]. Approaches to bet-ter optimize the non-convex problem involve using multi-fold learning as a measure of regularizing overﬁtting [8],optimize Latent SVM for the area under the ROC curve(AUC) [6] and training with easy examples in beginningto avoid bad local optimization [5, 24]. Most of these ap-proaches perform reasonably only when object covers mostof the region of image, or when most of the candidate re-gions contain an object. The major challenge faced by MILin general is that of ﬁxed feature representation, and poorinitialization particularly in non-object centric images. Ouralgorithm provides solutions to both of these issues.

3. Background: MI-SVM

We begin by brieﬂy reviewing a standard solution tothe multiple instance learning problem, Multiple InstanceSVMs (MI-SVMs) [3] or Latent SVMs [14, 37]. In this set-ting, each weakly labeled image is considered a collectionof regions which form a positive ‘bag’. For a binary clas-siﬁcation problem, the task is to maximize the bag marginwhich is deﬁned by the instance with highest conﬁdence.For each weakly labeled image I ∈ W , we collect a set ofregions of interest and deﬁne the index set of those regionsas R I . We next deﬁne a bag as B I = { x i | i ∈ R I } , withlabel Y I , and let the i th instance in the bag be ( x i , y i ) ∈R p × {− , +1 } .igure 2: Our method jointly optimizes a representation and detectors for categories with only weakly annotated data. Weﬁrst learn a feature representation conducive to MIL by initializing all parameters with classiﬁcation style data. We thencollectively reﬁne the feature space with strongly annotated data from auxiliary tasks, and perform MIL in our detectionfeature space. The discovered positive patches are further used to reﬁne the representation and detection weights.For an image with a negative image-level label, Y I = − , we label all regions in the image as negative. For animage with a positive image-level label, Y I = 1 , we createa constraint that at least one positive instance occurs in theimage bag.In a typical detection scenario, R I corresponds to the setof possible bounding boxes inside the image, and maximiz-ing over R I is equivalent to discovering the bounding boxthat contains the positive object. We deﬁne a representation φ ( x i ) ∈ R d for each instance, which is the feature descrip-tor for the corresponding bounding box, and formulate theMI-SVM objective as follows: min w ∈R d (cid:107) w (cid:107) + α (cid:88) I (cid:96) (cid:16) Y I , max i ∈ R I w T φ ( x i ) (cid:17) (1)where α is a hyper-parameter and (cid:96) ( y, ˆ y ) is the hinge loss.Interestingly, for negative bags i.e. Y I = − , the knowl-edge that all instances are negative allows us to unfold themax operation into a sum over each instance. Thus, Equa-tion (1) reduces to a standard QP with respect to w . For thecase of positive bags, this formulation reduces to a standardSVM if maximum scoring instance is known.Based on this idea, Equation (1) is optimized using aclassic concave-convex procedure [38], which decreases theobjective value monotonically with a guarantee to convergeto a local minima or saddle point. Due to this reason, thesemethods are extremely susceptible to the feature represen-tation and detector initialization [8, 33]. We address boththese issues using annotated auxiliary data available to learna better feature representation and reasonable initializationfor MIL based methods.

4. Large Scale Detection Learning

We propose a detection learning algorithm that uses aheterogeneous data source, containing only weak labels forsome tasks, to produce strong detectors for all. Let the setof images with only weak labels be denoted as W and theset of images with strong labels (bounding box annotations)from auxiliary tasks be denoted as S . We assume that theset of object categories that appear in the weakly labeledset, C W , do not overlap with the set of object categories thatappear in the strongly labeled set, C S . For each image inthe weakly labeled set, I ∈ W , we have an image-levellabel per category, k : Y kI ∈ { , − } . For each image in thestrongly labeled set, I ∈ S , we have a label per category, k , per region in the image, i ∈ R I : y ki ∈ { , − } . Weseek to learn a representation, φ ( · ) that can be used to traindetectors for all object categories, C = {C W ∪ C S } . For acategory k ∈ C , we denote the category speciﬁc detectionparameter as w k and compute our ﬁnal detection scores perregion, x , as score k ( x ) = w Tk φ ( x ) .We propose a joint optimization algorithm which learnsa feature representation, φ ( · ) , and detectors, w k , using thecombination of strongly labeled detection data, S , withweakly labeled data, W . For a ﬁxed representation, onecan directly train detectors for all categories represented inthe strongly labeled set, k ∈ C S . Additionally, for the sameﬁxed representation, we reviewed in the previous sectiontechniques to train detectors for the categories in the weaklylabeled data set, k ∈ C W . Our insight is that the knowledgefrom the strong label set can be used to help guide the op-timization for the weak labeled set, and we can explicitlyadapt our representation for the categories of interest andfor the generic detection task.elow, we state our overall objective: min w k ,φk ∈C (cid:88) k Γ( w k ) (2) + α (cid:88) I ∈W (cid:88) p ∈C W (cid:96) ( Y pI , max i ∈ R I w Tp φ ( x i ))+ α (cid:88) I ∈S (cid:88) i ∈ R I (cid:88) q ∈C S (cid:96) ( y qi , w Tq φ ( x i )) where α is a scalar hyper-parameter, (cid:96) ( . ) is the loss functionand Γ( . ) is a regularization over the detector weights. Thisformulation is non-convex in nature due to the presence ofinstance level ambiguity. It is difﬁcult to optimize directly,so we choose a speciﬁc alternating minimization approach(see Figure 2).We begin by initializing a feature representation and ini-tial CNN classiﬁcation weights using auxiliary weakly la-beled data (blue boxes Figure 2). These weights can beused to compute scores per region proposal to produce ini-tial detection scores. We next use available strongly anno-tated data from auxiliary tasks to transfer category invariantinformation about the detection problem. We accomplishthis through further optimizing our feature representationand learning a generic background detection weights (redboxes Figure 2). We then use the well tuned detection fea-ture space to perform MIL on our weakly labeled data toﬁnd positive instances (yellow box Figure 2. Finally, we useour discovered positive instances together with the stronglyannotated data from auxiliary tasks to jointly optimize allparameters corresponding to feature representation and de-tection weights. We now discuss our procedure for initializing the fea-ture representation and detection weights. We want to usea representation which makes it possible to separate ob-jects of interest from background and makes it easy todistinguish different object categories. Convolutional neu-ral networks (CNNs) have proved effective at providingthe desired semantically discriminative feature representa-tion [12, 16, 29]. We use the architecture which won theILSVRC2012 classiﬁcation challenge [22], since it is one ofthe best performing and most studied models. The networkcontains roughly 60 million parameters, and so must be pre-trained on a large labeled corpus. Following the standardprotocol, we use auxiliary weakly labeled data that was col-lected for training a classiﬁcation task for this initial trainingof the network parameters (Figure 2: blue boxes ). This datais usually object centric and is therefore effective for train-ing a network that is able to discriminate between differentcategories. We remove the classiﬁcation layer of the net-work and use the output of the fully connected layer, f c , as our initial feature representation, φ ( · ) .We next learn initial values for all of the detection pa-rameters, w k , ∀ k ∈ C . To solve this, we begin by solvingthe simpliﬁed learning problem of image-level classiﬁca-tion. The image, I ∈ S , is labeled as positive for a category k if any of the regions in the image are labeled as positivefor k and is labeled as negative otherwise, we denote theimage level label as in the weakly labeled case: Y kI . Now,we can optimize over all images to reﬁne the representationand learn category speciﬁc parameters that can be used perregion proposal to produce detection scores: min w k ,φk ∈C (cid:88) k  Γ( w k ) + α (cid:88) I ∈{W∪S} (cid:96) ( Y kI , w Tk φ ( I ))  (3)We optimize Equation 3 through ﬁne-tuning our CNN ar-chitecture with a new K -way last fully connected layer,where K = |C| . Motivated by the recent representation transfer result ofHoffman et al. [19] - LSDA, we learn to generically trans-form our classiﬁcation feature representation into a detec-tion representation by using the strongly labeled detectiondata to modify the representation, φ ( · ) , as well as the de-tectors, w k , k ∈ C S (Figure 2 : red boxes ). In addition,we use the strongly annotated detection data to initialize anew “background” detector, w b . This detector explicitlyattempts to recognize all data labeled as negative in ourbags. However, since we initialize this detector with thestrongly annotated data, we know precisely which regionscorrespond to background. The intermediate objective is: min w q ,φq ∈{C S ,b } (cid:88) q (cid:34) Γ( w q ) + α (cid:88) I ∈S (cid:88) i ∈ R I (cid:96) ( y qi , w Tq φ ( x i )) (cid:35) (4)Again, this is accomplished by ﬁne-tuning our CNN archi-tecture with the strongly labeled data, while keeping the de-tection weights for the categories with only weakly labeleddata ﬁxed. Note, we do not include the last layer adaptationpart of LSDA, since it would not be easy to include in thejoint optimization. Moreover, it is shown that the adaptationstep does not contribute signiﬁcantly to the accuracy [19]. With a representation that has now been directly tunedfor detection, we ﬁx the representation, φ ( · ) and considersolving for the regions of interest in each weak labeled im-age. This corresponds to solving the second term in Equa-ion (2), i.e.: min w p p ∈{C W ,b } (cid:88) p (cid:34) Γ( w p ) (5) + α (cid:88) I ∈W (cid:96) ( Y pI , max i ∈ R I w Tp φ ( x i )) (cid:35) Note, we can decouple this optimization problem and in-dependently solve for each category in our weakly labeleddata set, p ∈ C W . Let’s consider a single category p . Ourgoal is to minimize the loss for category p over images I ∈ W . We will do this by considering two cases. First,if p is not in the weak label set of an image ( Y pI = − ),then all regions in that image should be considered negativefor category p . Second, if Y pI = 1 , then we positively labela region x i if it has the highest conﬁdence of containing ob-ject and negatively label all other regions. We perform thediscovery of this top region in two steps. At ﬁrst, we narrowdown the set of candidate bounding boxes using the score, w Tp φ ( x i ) , from our ﬁxed representation and detectors fromthe previous optimization step. This set is then reﬁned toestimate the most region likely to contain the positive in-stance in a Latent SVM formulation. The implementationdetails are discussed section 5.2.Our ﬁnal optimization step is to use the discovered an-notations from our weak data-set to reﬁne our detectors andfeature representation from the previous optimization step.This amounts to the subsequent step for alternating mini-mization of the joint objective described in Equation 2. Wecollectively utilize the strong annotations of images in S and estimated annotations for weakly labelled set, W , tooptimize for detector weights and feature representation, asfollows: min w k ,φk ∈{C ,b } (cid:88) k (cid:34) Γ( w k ) (6) + α (cid:88) I ∈{W∪S} (cid:88) i ∈ R I (cid:96) ( y ki , w Tk φ ( x i )) (cid:35) This is achieved by re-ﬁnetuning the CNN architecture.The reﬁned detector weights and representation can beused to mine the bounding box annotations for weakly la-beled data again, and this process can be iterated over (seeFigure 2). We discuss re-training strategies and evaluate thecontribution of this ﬁnal optimization step in Section 5.3.

5. Experiments

We now study the effectiveness of our algorithm by ap-plying it to a standard detection task. Train Num images 395905Num objects 345854Val Num images 20121Num objects 55502Table 1: Statistics of the ILSVRC13 detection dataset.Training set has fewer objects per image than validation set.

We use the ILSVRC13 detection dataset [27] for ourexperiments. This dataset provides bounding box annota-tions for 200 categories. The dataset is separated into threepieces: train, val, test (see Table 1). The training imageshave fewer objects per image on an average than validationset images, so they constitute classiﬁcation style data [19].Following prior work [16], we use the further separation ofthe validation set into val1 and val2. Overall, we use thetrain and val1 set for our training data source and evaluateour performance of the data in val2.Speciﬁcally, we use ∼ ∼ ∼ , images in val2across all 200 categories.We use open source deep learning framework,Caffe [21], for the implementation, training and ﬁne-tuning of our CNN architecture. One of the key components of our system is using strongannotations from auxiliary tasks to learn a representationwhere it’s possible to discover patches that correspond tothe objects of interest in our weakly labeled data source.We begin our analysis by studying the patch discovery thatour feature space enables. We optimize the patch discov-ery (Equation (5)) using a one vs all Latent SVM formu-lation and optimize the formulation for AUC criterion [6].The feature descriptor used is the output of the fully con-nected layer, f c , of the CNN which is produced after ﬁne-tuning the feature representation with strongly annotateddata from auxiliary tasks. Following our alternating mini-mization approach, these discovered top boxes are then usedigure 3: Example mined bounding boxes learned using our method. Left side shows the mined boxes after ﬁne-tuningwith images in classiﬁcation settings only, and right side shows the mined boxes after ﬁne-tuning with auxiliary stronglyannotated dataset. We show top 5 mined boxes across the dataset for corresponding category. Examples with a green outlineare categories for which our algorithm was able to correctly mine patches of the object, while the feature space with onlyweak label training was not able to produce correct patches. In yellow we highlight the speciﬁc example of “tennis racket”.None of the discovered patches from the original feature space correctly located the tennis racket and instead included theperson as well. After incorporating the strong annotations from auxiliary tasks, our method starts discovering tennis rackets,though still has some confusion with the person playing tennis.to re-estimate the weights and feature representations of ourCNN architecture.To evaluate the quality of mined boxes, we do pre-cision analysis with respect to their overlap with groundtruth which is measured using the standard intersection overunion (IOU) metric. Table 2 reports the precision for vary-ing overlapping thresholds. Our optimization approach pro-duces one positive patch per image with a weak label, and adiscovered patch is considered a true positive if it overlapssufﬁciently with the ground truth box that corresponds tothat label. Since each patch, once discovered, is consideredan equivalent positive (regardless of score) for the purposeof retraining, this simple precision metric is a good indica-tion of the usefulness of our mined patches. It is interestingthat a signiﬁcant fraction of mined boxes have high over-lap with the ground truth regions. For reference, we alsocomputed the standard mean average precision over the dis-covered patches and report these results.It is important to understand not only that our new fea-ture space improves the quality of the resulting patches, but also what type of errors our method reduces. In Figure 3, weshow the top 5 scoring discovered patches before and aftermodifying the feature space with strong annotations fromauxiliary tasks. We ﬁnd that in many cases the improve-ment comes from better localization. For example withoutauxiliary strong annotations we mostly discover the face ofa lion rather than the body that we discover after our algo-rithm. Interestingly, there is also an issue with co-occurringclasses. In the bottom row of Figure 3, we show the top5 discovered patches for “tennis racket”. Once we incor-porate strong annotations from auxiliary tasks we begin tobe able to distinguish the person playing tennis from theracket itself. Finally, there are some example mined patcheswhere we reduce quality after incorporating the strong an-notations from auxiliary tasks. For example, one of ourstrongly annotated categories is “computer keyboard”. Dueto the strong training with keyboard images, some of ourmined patches for “laptop” start to have higher scores onthe keyboard rather than the whole laptop (see Figure 4).recision mAPov=0.3 ov=0.5 ov=0.7 ov=0.9 ov=0.5Without auxiliary strong dataset 29.63 26.10 24.28 23.43 13.13Ours 32.69 28.81 26.27 24.78 22.81Table 2: Precision analysis and mAP performance of discovered patches in our weakly labeled training set (val1) ofILSVRC13 detection dataset. Comparison with varying amount of overlap with ground truth box. About 25% of our minedboxes have an overlap of at least 0.9. Our method is able to signiﬁcantly improve the quality of mined boxes after incorpo-rating strong annotations from auxiliary tasks.Figure 4: Example mined boxes of the category “laptop”where using auxiliary strongly annotated data causes patchdiscovery to diverge. Top row : The mined boxes obtainedafter ﬁne-tuning with images in classiﬁcation settings only.

Bottom row : The mined boxes obtained after ﬁne-tuningwith the auxiliary strongly annotated dataset that containsthe category “computer keyboard”. These patches werelow scoring examples, but we show them here to demon-strate a potential failure case – speciﬁcally, when one of thestrongly annotated classes is a part of one of the weakly la-beled classes.

Now that we’ve analyzed the intermediate result of ouralgorithm, we next study the full performance of our sys-tem. Figure 5 shows the mean average precision (mAP) per-centage computed over the categories in val2 of ILSVRC13for which we only have weakly annotated training data(categories 101-200). We compare to two state-of-the-artmethods for this scenario and show that our algorithm sig-niﬁcantly outperforms both of the previous state-of-the-arttechniques. The ﬁrst, LCL [35], detects in the standardweakly supervised setting – having no bounding box an-notations for any of the 200 categories. This method alsoonly reports results across all 200 categories. Our exper-iments indicate that the ﬁrst 100 categories are easier onaverage then the second 100 categories, therefore the 6.0%mAP may actually be an upper bound of the performance ofthis approach. The second algorithm we compare against isLSDA [19], which does utilize the bounding box informa-tion from the ﬁrst 100 categories.We next consider different re-training strategies forlearning new features and detection weights after discov-

LCL * LSDA Ours m AP ( % ) Figure 5: Comparison of mAP (%) for categories with-out any bounding box annotations (101-200 of val2)of ILSVRC13. Our method signiﬁcantly outperformsboth previous state-of-the-art algorithms: LCL [35] andLSDA [19]. *The value for LCL was computed across all200 categories. Our experiments show this this is an easiertask resulting in higher numbers overall.ering the positive patches in the weakly labeled data. Ta-ble 3 reports the mean average precision (mAP) percentagefor no re-training (directly using the feature space learnedafter incorporating the strong labels), re-training only thecategory detection parameters, and retraining feature rep-resentations jointly with detection weights. In our experi-ments the improved performance is due to the ﬁrst iterationof the overall algorithm. We ﬁnd that the best approach isto jointly learn to reﬁne the feature representation and thedetection weights. More speciﬁcally, we learn a new fea-ture representation by ﬁne-tuning all fully connected layersin the CNN architecture.We ﬁnally analyze examples where our full algorithmoutperforms the previous state-of-the-art, LSDA [19]. Fig-ure 6 shows a sample of the types of errors our algorithmimproves on. These include localization errors, confusionwith other categories, and interestingly, confusion with co-occurring categories. In particular, our algorithm providesimprovement when searching for a small object (ball orhelmet) in a sports scene. Training only with weak labelscauses the previous state-of-the-art to confuse the playerand the object, resulting in a detection that includes both.Our algorithm is able to localize only the small object andrecognize that the player is a separate object of interest.igure 6: Examples where our algorithm outperforms the previous state-of-the-art. We show the top scoring detection fromthe baseline detector, LSDA [19], with a Red box and label, and the top scoring detection from our method, LSDL, as a Greenbox and label. Our algorithm improves localization (ex: rabbit, lion etc), confusion with other categories (ex: miniskirt vsmaillot), and confusion with co-occurring classes (ex: volleyball vs volleyball player)Category SetRe-train Weakly Strongly AllStrategy Labeled Labeled- 15.85 27.81 21.83detectors 17.01 27.85 22.43rep+detectors

6. Conclusion

We have presented a method which jointly trains a fea-ture representation and detectors for categories with onlyweakly labeled data. We use the insight that strongly an-notated detection data from auxiliary tasks can be used totrain a feature representation that is conducive to discover-ing object patches in weakly labeled data. We demonstrateusing a standard detection dataset (ImageNet-200 detection)that our method of incorporating the strongly annotated datafrom auxiliary tasks is very effective at improving the qual-ity of the discovered patches. We then use all strong anno-tations along with our discovered object patches to furtherreﬁne our feature representation and produce our ﬁnal de-tectors. We show that our full detection algorithm signiﬁ- cantly outperforms both the previous state-of-the-art meth-ods which uses only weakly annotated data, as well as thealgorithm which uses strongly annotated data from auxil-iary tasks, but does not incorporate any MIL for the weaktasks.Upon acceptance of this paper, we will release all ﬁnalweights and hyper parameters learned using our algorithmto improve the performance of the recently released ¿7.5Kcategory detectors [19].

References [1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In

Proc. CVPR , 2010.[2] K. Ali and K. Saenko. Conﬁdence-rated multiple instanceboosting for object detection. In

IEEE Conference on Com-puter Vision and Pattern Recognition , 2014.[3] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vec-tor machines for multiple-instance learning. In

Proc. NIPS ,pages 561–568, 2002.[4] Y. Aytar and A. Zisserman. Tabula rasa: Model transfer forobject category detection. In

IEEE International Conferenceon Computer Vision , 2011.[5] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Cur-riculum learning. In

In Proc. ICML , 2009.[6] H. Bilen, V. P. Namboodiri, and L. J. Van Gool. Object andaction classiﬁcation with latent window parameters.

IJCV ,106(3):237–251, 2014.[7] D. Borth, R. Ji, T. Chen, T. Breuel, and S. F. Chang. Large-scale visual sentiment ontology and detectors using adjectivenown paiars. In

ACM Multimedia Conference , 2013.8] R. G. Cinbis, J. Verbeek, C. Schmid, et al. Multi-fold miltraining for weakly supervised object localization. In

CVPR ,2014.[9] H. Daum´e III. Frustratingly easy domain adaptation. In

ACL ,2007.[10] T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervised lo-calization and learning with generic knowledge.

IJCV , 2012.[11] T. G. Dietterich, R. H. Lathrop, and T. Lozano-P´erez. Solv-ing the multiple instance problem with axis-parallel rectan-gles.

Artiﬁcial intelligence , 1997.[12] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. DeCAF: A Deep ConvolutionalActivation Feature for Generic Visual Recognition. In

Proc.ICML , 2014.[13] L. Duan, D. Xu, and I. W. Tsang. Learning with aug-mented features for heterogeneous domain adaptation. In

Proc. ICML , 2012.[14] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-manan. Object detection with discriminatively trained part-based models.

IEEE Tran. PAMI , 32(9):1627–1645, 2010.[15] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Un-supervised visual domain adaptation using subspace align-ment. In

Proc. ICCV , 2013.[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In

In Proc. CVPR , 2014.[17] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic ﬂowkernel for unsupervised domain adaptation. In

Proc. CVPR ,2012.[18] D. Heckerman. A tractable inference algorithm for diagnos-ing multiple diseases. arXiv preprint arXiv:1304.1511 , 2013.[19] J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue,R. Girshick, T. Darrell, and K. Saenko. LSDA: Large scaledetection through adaptation. In

Neural Information Pro-cessing Systems (NIPS) , 2014.[20] J. Hoffman, E. Rodner, J. Donahue, K. Saenko, and T. Dar-rell. Efﬁcient learning of domain-invariant image represen-tations. In

Proc. ICLR , 2013.[21] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-tional architecture for fast feature embedding. arXiv preprintarXiv:1408.5093 , 2014.[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNetclassiﬁcation with deep convolutional neural networks. In

Proc. NIPS , 2012.[23] B. Kulis, K. Saenko, and T. Darrell. What you saw is notwhat you get: Domain adaptation using asymmetric kerneltransforms. In

Proc. CVPR , 2011.[24] M. P. Kumar, B. Packer, and D. Koller. Self-paced learningfor latent variable models. In

In Proc. NIPS , 2010.[25] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard,W. Hubbard, and L. Jackel. Backpropagation applied tohandwritten zip code recognition.

Neural Computation ,1989.[26] M. Pandey and S. Lazebnik. Scene recognition and weaklysupervised object localization with deformable part-basedmodels. In

Proc. ICCV , 2011. [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. K. amd Michael Bern-stein, A. C. Berg, and L. Fei-Fe. Imagenet large scale visualrecognition challenge. arXiv:1409.0575, 2014.[28] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting vi-sual category models to new domains. In

Proc. ECCV , 2010.[29] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,and Y. LeCun. Overfeat: Integrated recognition, localiza-tion and detection using convolutional networks.

CoRR ,abs/1312.6229, 2013.[30] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discoveryof mid-level discriminative patches. In

ECCV . 2012.[31] P. Siva, C. Russell, and T. Xiang. In defence of negativemining for annotating weakly labelled data. In

ECCV . 2012.[32] P. Siva, C. Russell, T. Xiang, and L. Agapito. Looking be-yond the image: Unsupervised learning for object saliencyand detection. In

Proc. CVPR , 2013.[33] H. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui,and T. Darrell. On learning to localize objects with minimalsupervision. In

Proceedings of the International Conferenceon Machine Learning (ICML) , 2014.[34] H. O. Song, Y. J. Lee, S. Jegelka, and T. Darrell. Weakly-supervised discovery of visual pattern conﬁgurations. 2014.[35] C. Wang, W. Ren, K. Huang, and T. Tan. Weakly supervisedobject localization with latet category learning. In

EuropeanConference on Computer Vision (ECCV) , 2014.[36] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain videoconcept detection using adaptive svms.

ACM Multimedia ,2007.[37] C.-N. J. Yu and T. Joachims. Learning structural svms withlatent variables. In

Proc. ICML , pages 1169–1176, 2009.[38] A. L. Yuille and A. Rangarajan. The concave-convex proce-dure.

Neural Computation , 15(4):915–936, 2003.[39] M. Zeiler and R. Fergus. Visualizing and UnderstandingConvolutional Networks.

ArXiv e-prints , 2013.[40] C. Zhang, J. C. Platt, and P. A. Viola. Multiple instanceboosting for object detection. In