Deep Patch Learning for Weakly Supervised Object Classification and Discovery
Peng Tang, Xinggang Wang, Zilong Huang, Xiang Bai, Wenyu Liu
DDeep Patch Learning for Weakly Supervised ObjectClassification and Discovery
Peng Tang a , Xinggang Wang a, ∗ , Zilong Huang a , Xiang Bai a , Wenyu Liu a a School of Electronic Information and Communications, Huazhong University of Scienceand Technology, Wuhan 430074, China
Abstract
Patch-level image representation is very important for object classificationand detection, since it is robust to spatial transformation, scale variation, andcluttered background. Many existing methods usually require fine-grainedsupervisions (e.g., bounding-box annotations) to learn patch features, whichrequires a great effort to label images may limit their potential applications.In this paper, we propose to learn patch features via weak supervisions,i.e., only image-level supervisions. To achieve this goal, we treat images asbags and patches as instances to integrate the weakly supervised multipleinstance learning constraints into deep neural networks. Also, our methodintegrates the traditional multiple stages of weakly supervised object classi-fication and discovery into a unified deep convolutional neural network andoptimizes the network in an end-to-end way. The network processes the twotasks object classification and discovery jointly, and shares hierarchical deepfeatures. Through this jointly learning strategy, weakly supervised objectclassification and discovery are beneficial to each other. We test the pro-posed method on the challenging PASCAL VOC datasets. The results showthat our method can obtain state-of-the-art performance on object classifi-cation, and very competitive results on object discovery, with faster testingspeed than competitors.
Keywords:
Patch feature learning, Multiple instance learning, Weakly supervised ∗ Corresponding author.
Email addresses: [email protected] (Peng Tang), [email protected] (Xinggang Wang), [email protected] (Zilong Huang), [email protected] (Xiang Bai), [email protected] (Wenyu Liu)
Preprint submitted to Pattern Recognition May 9, 2017 a r X i v : . [ c s . C V ] M a y earning, Convolutional neural network, End-to-end, Object classification,Object discovery
1. Introduction
In this paper, we study the problems of weakly supervised object classi-fication and discovery, which are with great importance in computer visioncommunity. As shown in the top and middle of Fig. 1, given an input imageand its category labels (e.g., image-level annotations), object classificationis to learn object classifiers for classifying which object classes (e.g., person)appear in testing images. Similar to object detection, object discovery is tolearn object detectors for detecting the location of objects in input images, asshown in the bottom of Fig. 1. Different from the fully supervised object de-tection task that requires exhaustive patch-level/bounding-box annotationsfor training, object discovery is weakly supervised, i.e., only image-level an-notations are necessary to train object discovery models, as shown in the topof Fig. 1. Nowadays, large scale datasets with patch-level annotations areavailable [1, 2, 3], and many object classification and detection methods arebenefited from such fine-grained annotations [4, 5, 6, 7]. However, comparedwith the great amounts of images with only image-level annotations (e.g.,using image search queries to search on the Internet), the amount of exhaus-tively annotated images is still relatively small. This inspires us to exploremethods that can deal with only image-level annotations.A popular solution for weakly supervised learning is Multiple InstanceLearning (MIL) [8]. In MIL, a set of bags and bag labels are given, and eachbag consists of a collection of instances, where instances labels are unknownfor training. MIL has two constraints: 1) if a bag is positive, at least oneinstance in the bag should be positive; 2) if a bag is negative, all instancesin the bag should be negative. It is natural to treat images as bags andpatches as instances. In addition, patch-level feature has wide applicationsin computer vision community, like image classification [9, 10, 11], objectdetection [12, 4], and object discovery [13, 14]. Then we can combine thepatch-level feature with MIL for object classification and discovery. We refer this task as weakly supervised object classification since it does not requirepatch-level annotations for training. Object discovery is also called weakly supervised object detection, common objectdetection, etc., in other papers. erson Horse BusBike Person Cow Motorbike Object Classifier(for classification)Object Detector(for discovery) trainingtesting(classification)
Person Bike testing (discovery)
PersonBike
Figure 1: Illustration of weakly supervised object classification and discovery. Only image-level annotations are given for training (top). For classification, object classifier tends toclassify which object classes appear in images (middle). For discovery, object discoverertends to detect the location of objects in images (bottom).
Specifically, the connections between MIL methods and weakly supervisedobject classification and discovery are introduced as follows. As defined in[15], there are three paradigms for MIL: instance space based MIL methodslearn an instance classifier, bag space based MIL methods learn the similarityamong bags, and embedded space based MIL methods map bags to represen-tations. For object classification, many methods aggregate extracted patchfeatures into a vector for each image as image representation, and use therepresentation to train a classifier [9, 10, 16, 17], which is similar to embed-3 - Decision boundarymappingmappingDecisionboundaryPositive instances(person) Negative instances(non-person)Positive bag(person)Negative bag(non-person)
Bags (images)
Instances (patches)
Embedded space based MIL (object classification)
Instance space based MIL(object discovery)
Figure 2: The relationship among MIL and object classification/discovery. Images/patchescan be viewed as bags/instances. Embedded space based MIL methods map instances(patches) in each bag (image) to a bag (image) representation for training a bag (image)classifier. Instance space based MIL methods learn an instance (patch) classifier to discoversome most representative instances (patches). “+”/“-” indicates positive/negative bagrepresentation. Note that in the top-right part, only two points are drawn to denote thepositive and negative image respectively. ded space based MIL methods as in the top-right of Fig. 2. Meanwhile forobject discovery, instance space based MIL methods are directly applied onpatch features to find object of interest [13, 14], as shown in the bottom ofFig. 2.Recently, deep Convolutional Neural Networks (CNNs) [18] have obtainedgreat success on image classification [19]. However, conventional CNNs with-out patch-level image features are unsuitable to recognize complex imagesand unable to obtain state-of-the-art performance on challenging datasets,e.g., PASCAL VOC datasets [2]. There are many reasons: Unlike ImageNet,which have millions object centered images, in PASCAL VOC, (1) There is alimited number of training images; (2) The images have complex structure,4nd objects have large spatial transformation and scale variation; (3) Theimages have multiple labels.Now the state-of-the-art object classification methods for complex datasetsare based on local image patches and CNNs [20, 21, 16, 17]. And as shown inFig. 2, it is natural to treat object classification in complex images as a MILproblem. Thus, it is important to combine deep CNNs with MIL. There area few early attempts. For example, similar to embedded space based MILmethods, Cimpoi et al. [16] combines CNN-based patch features with FisherVector [22] to learn image representations. The Hypotheses CNN Pooling(HCP) [20] and Deep Multiple Instance Learning (DMIL) [23] find the mostrepresentative patches in images. These examples show that the patch-basedCNN has its advantage over plain CNNs. Also, for object discovery, manymethods use CNN to extract patch features, and discover objects by instancespace based MIL methods [13, 24]. All these methods are patch-based, andthey are more preferable on complex datasets than plain CNNs.However, these methods have some limitations. First, they separatelyfeed each patch into CNN models for feature extraction, ignoring the fact thatcomputation on convolutional layers for overlapping patches can be shared,thus reduced, in both training and testing procedures. Second, they treatpatch feature extraction, image representation learning, and object classifi-cation, or discovery as separate stages. During training, every stage requiresits own training data, taking up a lot of disk space for storage. At the sametime, treating these stages separately may harm performance, as the stagesmay not be independent. Therefore it is better to integrate them into a uni-fied framework. Third, features of patches are extracted using pre-trainedmodels, i.e., they can not learn dataset or task specific patch features. Last,they treat object classification and discovery as independent tasks, whichhave been demonstrated to be complementary by our experiments. Inspiredby these facts, we propose a novel framework, called Deep Patch Learn-ing (DPL), which integrates patch feature learning, image representationlearning, object classification and discovery into a unified framework.Inspired by the fully supervised object detection methods SPPnet [5] andFast R-CNN [6], our DPL reduces the training and testing time by sharingthe computation on convolutional layers for different patches. Meanwhile, itcombines different stages of object classification and discovery to form an end-to-end framework for classification and discovery. That is, DPL optimizes thepatch feature learning, image representation learning, and image classifyingjointly by backpropagation, which mainly focuses on object classification.5n the meantime, it uses a MIL loss for each patch feature, and trains adeep MIL network end-to-end, which can discover the most representativepatches in images. These two blocks (object classification block and MILbased discovery block) are combined via a multi-task learning framework,which boosts the performance for each task. Moreover, as images may havemultiple labels, the MIL loss is adapted to make it suitable for the multi-classcase. Notice that for both object classification and discovery, only image-levelannotations are utilized for training, which makes our method quite differentfrom the fully supervised methods [5, 6] that require detailed patch-levelsupervisions.To demonstrate the effectiveness of our method, we perform elaborate ex-periments on the PASCAL VOC 2007 and 2012 datasets. The DPL achievesstate-of-the-art performance on object classification, and very competitiveresults on object discovery. Moreover, it takes only 1 .
85s and 2 .
8s for eachimage during testing, using AlexNet [19] and VGG16 [17] CNN backend, re-spectively, which is much faster than the previous best performed methodHCP [20].To summarize, the main contributions of our work are as follows. • We propose a weakly supervised learning framework to integrate dif-ferent stages of object classification into a single deep CNN framework,in order to learn patch features for object classification in an end-to-end manner. The proposed object classification network is much moreeffective and efficient than previous patch-based deep CNNs. • We novelly integrate the two MIL constraints into the loss of our deepCNN framework to train instance classifiers, which can be applied forobject discovery. • We embed two tasks object classification and discovery into a singlenetwork, and perform classification and discovery simultaneously. Wealso demonstrate that the two tasks are complementary to some extent.To the best of our knowledge, it is the first time to demonstrate thatclassification and discovery can be complementary to each other in anend-to-end neural network. We think this reveals new phenomenonthat makes sense to this community. • Our method achieves state-of-the-art performance on object classifi-cation, and very competitive results on discovery, with faster testingspeed on PASCAL VOC datasets.6he rest of this paper is organized as follows. In Section 2, related workis listed. In Section 3, the detailed architecture of our DPL is described. InSection 4, we present some experiments on several object classification anddiscovery benchmarks. In Section 5, some discussions of experimental setupsare presented. Section 6 concludes the paper.
2. Related Work
MIL was first proposed by Dietterich et al. [8] for drug activity prediction.Then many methods have emerged in the MIL community [13, 25, 26, 27].Our method can be regarded as a MIL based method as we treat images asbags and patches as instances. Meanwhile, learning image representationscan be viewed as embedded space based MIL method and learning instanceclassifier can be viewed as instance space based MIL method. However,traditional MIL methods mainly focus on the problem that bags only haveone single label, while in real-world tasks each bag may be associated withmore than one class label, e.g., an image may contains multiple objects fromdifferent classes. A solution for the multi-class problem is to adapt the MILby training a binary classifier for each class through the one-vs.-all strategy[28]. And the Multi-Instance Multi-Label (MIML) problem [29, 30, 31] alsohave been proposed instead of the single label MIL problem. As many imagesin the PASCAL VOC datasets have multiple objects from different classes,our method is also based on the MIML. Similar to the one-vs.-all strategy,we train some binary classifiers using the multi-class sigmoid cross entropyloss. But instead of training these binary classifiers separately, we trainall classifiers jointly and share features among these classifiers, just like themulti-task learning [32]. Moreover, different from previous MIL methods, weintegrate the MIL constraints into the popular deep CNN, and apply ourmethod to object classification and discovery.There are also many other computer vision methods benefit from the MIL.Wei et al. [20] and Wu et al. [23] have combined the CNN and MIL for end-to-end object classification. Their methods are also be end-to-end trainableand can learn patch features. However, their methods have to resize patchesto a special size and feed all patches into the CNN models separately, asshown in Figure 3 (b). This will result in huge time consumption for trainingand testing due to ignoring the fact that computation on convolutional layers7or overlapping patches can be shared. Meanwhile, [20, 23] use instance spacebased MIL methods for solution, which means they train an instance classifierunder the MIL constraints, and aggregate instance scores by max-poolingas bag scores. Then they classify bags (images) by these pooled bag scores.Different from their methods, as shown in Figure 3 (d), we share computationof convolutional layers among different patches, and combine both embeddedspace and instance space based MIL methods into a single network, whichcan achieve much promising results.MIL is also a prevalent method for object discovery. Cinbis et al. [14]and Wang et al. [13] have used MIL for object discovery, and have achievedsome state-of-the-art performance. But their methods separate patch fea-ture extraction and MIL into two separate stages, which may limit theirperformance.
Patch-based methods are popular for image classification as its robustnessfor spatial transformation, scale variation, and cluttered background. BoF[9] is a very popular pipeline for image classification. It extracts a set oflocal features like SIFT [33] or HOG [34] from patches, and then uses someunsupervised ways [10, 22, 35], or weakly supervised methods [36, 37, 38, 39,40, 41, 42, 43, 44, 45] to aggregate patch features for image representation.These image representations are used for image classification. To considerthe spatial layout of images, the Spatial Pyramid Matching (SPM) [46] isemployed to enhance the performance. But their pipeline treats patch featureextraction, image representation and classification as independent stages,whereas our method integrates these into a single network and trains thenetwork end-to-end.Recently, Lobel et al. [47] and Parizi et al. [48] have proposed a methodto combine the last two stages, i.e., image representation and classification.They learn patterns of patches and image classifier jointly, and the resultsshow they have improved the performance significantly. Sydorov et al. [49]have proposed a method to learn the parameters of Fisher Vector and im-age classifier end-to-end. But as a matter of fact, they do not perform realend-to-end classification. That is, although they can learn the image repre-sentation and classifier jointly, they still treat patch feature extraction as anindependent part. This will lead to a large consumption of time and spacefor computing and saving the patch features. Different from their methods,our method achieves real end-to-end learning.8 ersonpersonbike
CNN Forward Score vectorNMS Supervision Encoded patch ImagerepresentationPatches L e g e nd Detectedboxes ( c ) … …… ( b )( a ) … … ( d ) IA IAIAIAPA person personpersonbike personbike
Figure 3: Illustration of different network architectures: (a) plain deep CNN; (b) DMIL[23]/HCP [20]; (c) Fast R-CNN [6]; (d) our DPL. Where IA, PA, and NMS mean image-level annotations, patch-level annotations, and non-maxima suppression, respectively. For(a), a whole image (resized to fixed-size) is fed into the network. For (b), a set of patches(resized to fixed-size) from one image are fed into the network, and each patch passes aCNN separately. (a) and (b) produce a score vector per-image for classification and onlyrequire image-level annotations for training. For (c) and (d), a whole image (with originalaspect ratio) as well as some patch regions are fed into the network, where all patchesfrom one image share the computation of convolutional layers. (c) and (d) produce ascore vector per-patch, and then NMS is used to filter some highly overlapped patches andproduce some detected boxes. But (c) requires patch-level/bounding-box annotations fortraining, whereas (d) only takes image-level annotations as supervision. (d) also producesa score vector per-image for classification. For simplicity, backpropagation arrows are notplotted in the networks. Best viewed in color.
Yang et al. [7] also proposes to learn local patch level information forobject classification. They propose a multi-view MIL framework, and chooses9he Fisher Vector [22] to aggregate patch features. But their method isalso not end-to-end, and requires fine-grained bounding-box annotations fortraining, whereas our method is end-to-end and weakly supervised.
Inspired by the SPPnet [5] and the great success of CNN [18] for imageclassification [19], Girshick [6] have proposed a Fast R-CNN method for fastproposal classification method in fully supervised setting. Their method canalso learn patch features. Our method follows the path of this work to sharecomputation on convolutional layers among all patches. But as shown inFigure 3 (c) and (d), the differences between our method and [6] are multi-fold: 1) Fast R-CNN focuses on supervised object detection, whereas theproposed DPL focuses on weakly supervised image classification and objectdiscovery. 2) Fast R-CNN requires bounding-box annotations, whereas DPLonly requires image-level annotations. Annotating object bounding-boxes islabor- and time-consuming, whereas image-level annotations are easier toobtain. 3) In summary, Fast R-CNN is a fully supervised object detectionframework; DPL is a weakly supervised deep learning framework for jointimage classification and object discovery.
3. The Architecture of Deep Patch Learning
The architecture of Deep Patch Learning (DPL) is shown in Figure 4.Given an image and some patches, DPL first passes the image through someconvolutional (conv) layers to generate conv feature maps for the whole im-age, and the size of feature maps is decided by the size of input image. Afterthat, the Spatial Pyramid Pooling (SPP) layer can be employed for eachpatch to produce some fixed-size feature maps. Then each feature map canbe fed into several fully connected (fc) layers, which will output a set ofpatch features. At last, these patch features are branched into two differentstreams with two different tasks: one jointly learns the image representationand classifier focusing on object classification (the classification block), andthe other finds most positive patches focusing on object discovery (the dis-covery block). Only image-level annotations are used as supervisions to trainthe two streams. In this section, we will introduce these steps referred above.
Our method is patch-based, so it is necessary to generate patches first.The simplest and fastest way is sliding window, i.e., sliding a set of fixed-size10 atch encoding
Convfeature maps ……… ………… … … … ……… … Patch scores
Max-pooling
Image scores
SPM with max-pooling … Image representation … Image score prediction
Image scoresPatch convfeature maps Two fc layers
Conv layers SPP layer
Encoded patchesPatch features
Patch score prediction … Discovery BlockClassification Block …… …
Input image Image patches
Image-level annotationsImage-level annotations
Figure 4: The architecture of DPL. An image and multiple patches are first input to afully convolutional network. Each patch is projected to a fixed-size feature map and thenfed into several fc layers which will generate a feature vector of each patch. At last, Thesepatch features are branched into two streams: one learns the image representation andclassifier jointly for object classification, and the other finds most representative patchesfor object discovery. These two streams only require image-level supervisions for training. windows over the image. But objects only cover a small portion of imagesand may have various shape, thus patches by fixed-size sliding window arealways with low recall. Some methods propose to generate patches based onsome visual cues, like segmentation [50] and edge [51]. Here we choose the“fast” mode of Selective Search (SS) [50] to generate patches due to its fastspeed and high recall.
Using CNN models which were trained on large scale datasets like Ima-geNet [1] to fine-tune on target dataset has achieved marvelous performance.We fine-tune our model on two widely used models AlexNet [19] and VGG16[17].
As we stated in Section 3.2, we choose two CNN models AlexNet andVGG16. All the two models have conv layers with some max-pooling layersand three fc layers. Conv and max-pooling layers are implemented in asliding window manner. Actually, all conv and max-pooling layers can dealwith inputs of arbitrary sizes, and their outputs maintain roughly the sameaspect ratio as input images. Meanwhile, conv and pooling operations will11ot change the relative spatial distribution of input images. Outputs fromconv layers are known as conv feature maps [18]. Though conv and max-pooling layers have the ability to handle arbitrary sized input images, thetwo CNN models require fix-sized input images because fc layers demandfixed-length input vectors.
As fc layers take fixed-length input vector, the pre-trained CNN modelsrequire fixed-size input image. Therefore, the changing of image size andaspect ratio may somehow leads to loss in the performance. To handle thisproblem, the Fast R-CNN [6] uses a SPP layer [5] to realize fast proposalclassification. Our work follows this path. In special, we replace the lastmax-pooling layer by the SPP layer. That is, given i -th patch R i and itscoordinate is ( l xi , l yi , r xi , r yi ) that indicate the horizontal/vertical ordinates ofthe top left and bottom right points, suppose the feature maps size is 1 /n oforiginal image size (e.g., 1 /
16 for VGG16), we can project the coordinate of R i to ( l xi /n, l yi /n, r xi /n, r yi /n ) that corresponds to the coordinate of patch i onfeature maps. Then we can obtain feature maps of patch i by cropping theportion of whole image feature maps inside R i and resizing it to fixed-size.The size of resized feature maps is depended on the pre-trained CNN model(e.g., 6 × × j -th cropped feature map of R i is x ij , we can divide the x ij into7 × o kij from the k -th grid R ki is as Eq. (1).This procedure will produce fixed-size feature maps for each patch, whichcan be transmitted to the following fc layers. More details can be found in[6, 5]. o kij = max all R ki x ij . (1) As shown in Figure 4 and referred above, our DPL will produce two dif-ferent scores for two different tasks respectively, one for object classification,and the other for object discovery. Therefore, we replace the last fc layerand the softmax layer of pre-trained models by our multi-task loss. Here wedenote the classification loss as L cls and discovery loss as L dis , and the totalloss is as follows. 12 ( X i , Y i ) = L cls ( X i , Y i ) + L dis ( X i , Y i ) , (2)where X i , Y i are the input image and its image-level label respectively. Herewe will introduce these two losses in detail. For the object classification, we choose the embedded space based MILmethods. That is, we learn an image (bag) representation for each image(bag) for classification. As shown in the classification block of Figure 4,after computing the feature vector of each patch, we first encode these patchfeatures. As we all know, an object is composed of a set of parts (e.g., theface and body for person). Patch encoding is to project patch features tomore semantic vectors whose elements correspond to patterns (i.e., parts).This can be done by a weight matrix W = [ w , w , ..., w N ] ∈ R K × N , where i -th column w i ∈ R K × is the i -th weight filter (i.e., i -th part filter), K isthe dimension of patch features and part filters, N is the number of filters.Suppose the j -th patch feature of image X i is f ij ∈ R K × , the encoded patchis represented as E ij = [ E ij , E ij , ..., E ijN ] T = W T f ij ∈ R N × .Then it is necessary to aggregate these encoded patches into an imagerepresentation. This can be simply performed by SPM [46] with max-pooling.Suppose the SPM split an image into M different grids (i.e., M differentsub-regions), then we can aggregate patches in grid m ∈ , , ..., M usingmax-pooling as F mi = [ F mi , F mi , ..., F miN ] T ∈ R N × , where the element F min =max j ∈ m E ijn . And the final image representation is the concatenation of thesevectors F i = [ F i T , F i T , ..., F Mi T ] T ∈ R NM × .To classify an image, it is necessary to compute the predicted score vector s cls i ∈ R C × of the image X i for different classes, where C is the numberof classes. Let the classifier be U cls = [ u cls , u cls , ..., u cls C ] ∈ R NM × C , thescore can be computed by s cls i = [ s cls i , s cls i , ..., s cls iC ] T = U Tcls F i ∈ R C × .Then the loss is as the form of L cls ( X i , Y i ) = G ( s cls i , Y i ), where G ( · , · ) is theloss function. Backpropagation.
To learn the parameters of part filters W and image clas-sifier U cls , the derivative ∂L cls /∂ U cls and ∂L cls /∂ W is required to be com-puted. This can be easily achieved by standard backpropagation, as shownin Eq. (3) and Eq. (4). 13 L cls ∂ U cls = 1 I I (cid:88) i =1 C (cid:88) c =1 ∂G ( s cls i , Y i ) ∂s cls ic ∂s cls ic ∂ U cls , (3) ∂L cls ∂ W = 1 I I (cid:88) i =1 C (cid:88) c =1 NM (cid:88) r =1 J i (cid:88) j =1 N (cid:88) n =1 ∂G ( s cls i , Y i ) ∂s cls ic ∂s cls ic ∂F ir ∂F ir ∂E ijn ∂E ijn ∂ W , (4)where I is the batch size per-iteration and J i is the patch number of image X i .Actually the connections between patch features and encoded patches, imagerepresentation and predicted scores are the matrix multiplication, which canbe performed by fc layer and is a standard layer in CNN, so we do not givethe detailed derivatives of ∂s cls ic / U cls , ∂s cls ic /∂F ir , and ∂E ijn /∂ W . Thederivative of the SPM with max-pooling layer is computed by ∂F ir ∂E ijn = r mod N ) = n && j = arg max j (cid:48) ∈ m E ij (cid:48) n , . (5)Where the mod is the operation that computes the remainder, and m isthe m -th grid satisfying m = ceil ( r/N ). Through the backpropagation,an end-to-end system for patch feature learning, image representation, andclassification can be obtained. Different from the object classification, which aims at finding some im-portant parts to compose the object, the object discovery is to find the patchthat can locate the object exactly. That is, object classification tends to learnthe local information of an object, and object discovery tends to learn theglobal information of an object. The two tasks are complementary in somedegree, so here we also perform object discovery, as shown in the discoveryblock of Figure 4.Object discovery and instance space based MIL method have similar tar-gets. That is, the former wants to find the most representative patches ofan object in the image, and the latter wants to find positive instances in thepositive bag. If we treat image as bag and patches as instances, these twoconcepts may be equivalent. There is other work that utilizes instance spacebased MIL methods to realize object discovery [13, 14]. Our object discovery14ethod also adopts this method to find the most positive patch of the object,just as the MI-SVM [26].Specially, we define the patch classifier U dis = [ u dis , u dis , ..., u dis C ] ∈ R K × C . Then the scores of patch feature f ij can be computed by s pat ij =[ s pat ij , s pat ij , ..., s pat ijC ] T = U Tdis f ij ∈ R C × . As in the MIL constraints,there must be at least one positive instance in a positive bag, and all instancesshould be negative in negative bags. So if an image contains an object, themaximum score of patches corresponding to that object should be a largevalue, but if an image does not contain an object, the maximum score ofpatches corresponding to that object should be a small value. This canbe realized by the max-pooling over all patches, that is, s dis ic = max j s pat ijc ,where s dis ic is the score that indicates whether image X i contains c -th object.So we can define s dis i = [ s dis i , s dis i , ..., s dis iC ] T ∈ R C × as the predictedscore vector for the image, which aims at object discovery. The s dis i and s cls i are similar, and both of them represent the predict score vector of animage, so we can use the same loss function L dis ( X i , Y i ) = G ( s dis i , Y i ). Backpropagation.
To learn the parameters of patch classifier U dis , the deriva-tive ∂L dis /∂ U dis is required to be computed, which can be easily achievedby the backpropagation, as shown in Eq. (6). ∂L dis ∂ U dis = 1 I I (cid:88) i =1 C (cid:88) c =1 J i (cid:88) j =1 ∂G ( s dis i , Y i ) ∂s dis ic ∂s dis ic ∂s pat ijc ∂s pat ijc ∂ U dis , (6)The connection between patch features and patch scores can also beachieved by the fc layer, so we only give the derivative of the max-poolinglayer as Eq. (7). ∂s dis ic ∂s pat ijc = j = arg max j (cid:48) s pat ij (cid:48) c , . (7)Through the backpropagation, end-to-end object discovery can thus be per-formed. In the above two subsections, we have derived the backpropagation ofobject classification and discovery. In this part, we will explain the G ( s i , Y i )and its derivative. As one image may contain multiple objects from different15lasses, so it has multiple labels, and then its label will become a binary labelvector Y i = [ y i , y i , ..., y iC ] T ∈ R C × , where y ic = 1 if X i has object c , y ic = 0otherwise. The popular softmax loss function is not suitable for this case, sowe choose multi-class sigmoid cross entropy loss, as shown in Eq. (8). G ( s i , Y i ) = − C (cid:88) c =1 { y ic log σ ( s ic ) + (1 − y ic ) log (1 − σ ( s ic )) } , (8)where σ ( x ) is the sigmoid function σ ( x ) = 1 / (1 + exp( − x )). Using theEq. (8), we train C binary classifiers each of which distinguishes images arewith/without one object class, just similar to the one-vs.-all strategy [28]for multi-class classification. After that, the derivative of Eq. (8) can beobtained as follows. ∂G ( s i , Y i ) ∂s ic = σ ( s ic ) − y ic . (9)Then all the derivatives of parameters can be derived. We can observe thatonly image-level labels Y i are necessary to optimize the loss in Eq. (8), whichconfirms our method is totally weakly supervised.
4. Experiments
In this section we will show the experiments of our DPL method for objectclassification and discovery.
As stated in Section 3.2, we choose two popular CNN architectures AlexNet[19] and VGG16 [17], which are pre-trained on the ImageNet [1]. These pre-trained models can be downloaded from the Caffe model zoo . We replacethe last max-pooling layer, the final fc layer, and the softmax loss layer bythe layers defined in Section 3. The dimension of encoded patch is set to 256(i.e., N = 256). Then we choose three different SPM scales { × , × , × } for the SPM with max-pooling layer after the patch encoding layer. The fclayers for patch encoding, image and patch score prediction are initialized us-ing Gaussian distributions with 0-mean and standard deviations 0 .
01. Biases https://github.com/BVLC/caffe/wiki/Model-Zoo able 1: Object classification results (AP in %) for different methods on PASCAL VOC2007 test set. method aero bike bird boat bottle bus car cat chair cow DMIL [23] . . . . . . . . . . CNNaug-SVM [52] . . . . . . . . . . Oquab et al. [53] . . . . . . . . . . Chatfield et al. [54] . . . . . . . . . . Barat et al. [55] - - - - - - - - - -
HCP-AlexNet [20] . . . . . . . . . . Cimpoi et al. [16] - - - - - - - - - -
VGG16-SVM [17] - - - - - - - - - -
VGG19-SVM [17] - - - - - - - - - -
VGG16-19-SVM [17] - - - - - - - - - -
FeV+LV-20 [7] . . . . . . . . . . HCP-VGG16 [20] . . . . . . . . . . DPL-AlexNet . . . . . . . . . . DPL-VGG16 . . . . . . . . . . method table dog horse mbike persn plant sheep sofa train tv mAP DMIL [23] . . . . . . . . . . . CNNaug-SVM [52] . . . . . . . . . . . Oquab et al. [53] . . . . . . . . . . . Chatfield et al. [54] . . . . . . . . . . . Barat et al. [55] - - - - - - - - - - 82 . HCP-AlexNet [20] . . . . . . . . . . . Cimpoi et al. [16] - - - - - - - - - - 88 . VGG16-SVM [17] - - - - - - - - - - 89 . VGG19-SVM [17] - - - - - - - - - - 89 . VGG16-19-SVM [17] - - - - - - - - - - 89 . FeV-LV-20 [7] . . . . . . . . . . . HCP-VGG16 [20] . . . . . . . . . . . DPL-AlexNet . . . . . . . . . . . DPL-VGG16 . . . . . . . . . . . are initialized to be 0. The mini-batch size is set to 2. For AlexNet, learningrates of all layers are set to 0 .
001 in the first 30K mini-batch iterations and0 . . . . We test our DPL method on two famous object classification and dis-covery benchmarks PASCAL VOC 2007 and PASCAL VOC 2012 [2], whichhave 9 ,
962 and 22 ,
531 images respectively with 20 different object cate-17ories. The datasets are split into standard train, val and test sets. We usethe trainval set (5 ,
011 images for VOC 2007 and 11 ,
540 images for VOC2012) with only image-level labels to train our models. During the testingprocedure, for object classification, we compute Average Percision (AP) andthe mean of AP (mAP) as the evaluation metric to test our model on thetest set (4 ,
952 images for VOC 2007 and 10 ,
991 images for VOC 2012).For object discovery, we report the CorLoc on the trainval set as in [56],which computes the percentage of the correct location of objects under thePASCAL criteria (Intersection over Union (IoU) > . There are many different methods to generate patches, like proposalbased methods Selective Search (SS) [50] and EdgeBoxes [51], or slidingwidow, which need 1 . . .
01s respectively (we use the“fast” mode of SS). To get patches, we choose SS [50] to produce 1-3Kpatches for each image. For data augmentation, we use five image scales { , , , , } (resize the longest side to one of the scales andmaintain the aspect ratio of images) with their horizontal flips to train themodel. For testing, we use the same five scales without flips and computethe mean score of these scales. Our code is written by C++ and Python, based on the Caffe [59] and thepublicly available implementation of Fast R-CNN [6]. All of our experimentsare running on a NVIDIA GTX TitanX GPU with 12GB memory. Codes forreproducing the results are available at https://github.com/ppengtang/dpl . We first report our results for object classification. Even though thediscovery block in Figure 4 mainly focuses on object discovery, it can alsoproduce image-level scores. So for object classification, we compute the mean For VOC 2012, the evaluation is performed online via the PASCAL VOC evaluationserver ( http://host.robots.ox.ac.uk:8080/ ). able 2: Object classification results (AP in %) for different methods on PASCAL VOC2012 test set. method aero bike bird boat bottle bus car cat chair cow DDSFL [57] . . . . . . . . . . HCP-AlexNet [20] . . . . . . . . . . Oquab et al. [53] . . . . . . . . . . Chatfield et al. [54] . . . . . . . . . . Oquab et al. [58] . . . . . . . . . . VGG16-SVM [17] . . . . . . . . . . VGG19-SVM [17] . . . . . . . . . . VGG16-19-SVM [17] . . . . . . . . . . FeV+LV-20 [7] . . . . . . . . . . HCP-VGG16 [20] . . . . . . . . . . DPL-AlexNet . . . . . . . . . . DPL-VGG16 . . . . . . . . . . method table dog horse mbike persn plant sheep sofa train tv mAP DDSFL [57] . . . . . . . . . . . HCP-AlexNet [20] . . . . . . . . . . . Oquab et al. [53] . . . . . . . . . . . Chatfield et al. [54] . . . . . . . . . . . Oquab et al. [58] . . . . . . . . . . . VGG16-SVM [17] . . . . . . . . . . . VGG19-SVM [17] . . . . . . . . . . . VGG16-19-SVM [17] . . . . . . . . . . . FeV+LV-20 [7] . . . . . . . . . . . HCP-VGG16 [20] . . . . . . . . . . . DPL-AlexNet . . . . . . . . . . . DPL-VGG16 . . . . . . . . . . . score of two different tasks. The results on VOC 2007 and VOC 2012 areshown in Table 1 and Table 2. From the results, we can observe that our method outperforms otherCNN-based methods using single model. Specially, our method is betterthan other patch-based methods for quite a lot. For example, the methodin [16] extract ten different scale patch features from pre-trained VGG19model with Fisher Vector. In [17], patch features are extracted from fivedifferent scales with mean-pooling. HCP [20] combines the MIL constraintsand CNN models to find the most representative patches in images. Our The results of our method on VOC 2012 are also available on http://host.robots.ox.ac.uk:8080/anonymous/PRKWXL.html and http://host.robots.ox.ac.uk:8080/anonymous/PWADSM.html . a tt e r n P a tt e r n P a tt e r n P a tt e r n P a tt e r n Figure 5: Visualization of the patterns by our patch encoding method. Each row corre-sponds to one patterns, where blue rectangles show patches with strongest responses ofthe patterns. method even outperforms the FeV+LV-20 [7] that utilizes bounding-box an-notations during training, which shows the potential for combining CNNswith weakly supervised methods (e.g., MIL). As shown in Table 1 and Ta-ble 2, our method achieves 1 .
8% and 2 .
0% incresement on VOC 2007 andVOC 2012 respectively. These results show that our DPL method can achievethe state-of-the-art performance on object classification. On VOC 2012, thebest result was reported in literature is the combination of HCP-VGG16 [20]and NUS-PSL [62], which achieves 93 .
2% mAP, but it just simply averagesthe predicted scores by two methods.Some patterns from our patch encoding method are also visualized in20 able 3: Object discovery results (CorLoc in %) for different methods on PASCAL VOC2007 trainval set. method aero bike bird boat bottle bus car cat chair cow
Shi et al. [60] . . . . . . . . . . Multi-fold MIL [14] . . . . . . . . . . RMI-SVM [13] . . . . . . . . . . Bilen et al. [61] . . . . . . . . . . Wang et al. [24] . . . . . . . . . . DPL-AlexNet . . . . . . . . . . DPL-VGG16 . . . . . . . . . . method table dog horse mbike persn plant sheep sofa train tv Ave. Shi et al. [60] . . . . . . . . . . . Multi-fold MIL [14] . . . . . . . . . . . RMI-SVM [13] . . . . . . . . . . . Bilen et al. [61] . . . . . . . . . . . Wang et al. [24] . . . . . . . . . . . DPL-AlexNet . . . . . . . . . . . DPL-VGG16 . . . . . . . . . . . Figure 5. We can observe that, though only image-level annotations areavaliable during training, our method can learn patterns with great semanticinformation. For example, “Pattern 5” corresponds to head of person; “Pat-tern 7” corresponds to wheel of bicycle; “Pattern 141” corresponds to screenof tvmonitor; and so on.
To train our DPL model, it takes 6 hours in AlexNet and 28 hours inVGG16. During testing, our DPL only takes 1 .
85s and 2 .
8s per-image inAlexNet and VGG16 respectively. It is much faster comparing with the HCP[20] (3s and 10s per-image in AlexNet and VGG16 respectively) that hasachieved the state-of-the-art performance on object classification previously.
We also perform some object discovery experiments. For object discovery,we only use the predicted patch scores from the discovery block in Figure 4and choose the patch with maximum score. The results on VOC 2007 andVOC 2012 are shown in Table 3 and Table 4.From the results, we can observe that our method can achieve quite com-petitive performance on object discovery. It outperforms other MIL-basedmethods like [13, 14], but a little weaker than the method in [24]. The21 able 4: Object discovery results (CorLoc in %) on PASCAL VOC 2012 trainval set. ∗ denotes results from unsupervised object co-localization methods. method aero bike bird boat bottle bus car cat chair cow Cho et al. [63] ∗ . . . . . . . . . . Li et al. [64] ∗ . . . . . . . . . . DPL-AlexNet . . . . . . . . . . DPL-VGG16 . . . . . . . . . . method table dog horse mbike persn plant sheep sofa train tv Ave. Cho et al. [63] ∗ . . . . . . . . . . . Li et al. [64] ∗ . . . . . . . . . . . DPL-AlexNet . . . . . . . . . . . DPL-VGG16 . . . . . . . . . . . method in [24] finds a compact cluster for object and some clusters for thebackground. Except for being sensitive to the number of clusters, it is a mustto tune parameters for each class tediously. As other weakly supervised meth-ods do not compute their CorLoc on VOC 2012, we only compare our methodwith unsupervised object co-localization methods [63, 64] in Table 4. We canobserve that our method outperforms the co-localization methods [63, 64] onVOC 2012. It is not surprise as our method benefits from image-level anno-tations, whereas [63, 64] are unsupervised (without image-level labels duringtraining).Figure 6 shows some success and failure discovery cases on VOC 2007. Aswe can observe, though failure cases do not perform that well, they can stillfind the representative part of the whole object (e.g., the face for person), orthe box not only including the object but also containing its adjacent similarobjects and can locate the object exactly.
5. Discussion
In this section, we will discuss the influence factors of the multi-tasklearning, the image scales, and the method to generate patches. Without lossgenerality, we only choose AlexNet to perform experiments on the PASCALVOC 2007 dataset. If not specified, all the reported testing time in thissection does not includes that of the patch generation procedure.22 igure 6: Some discovery results on several classes on VOC 2007. Each row denotesone class. These classes are, from top row to bottom, aeroplane, bird, car, person, andtvmonitor. Green rectangle denotes success cases, and red rectangle denotes failure cases.
Multi-task learning may improve the performance for each task as differ-ent tasks can influence each other by the shared representation [32]. Here wetest the influence of different tasks. The results are shown in Table 5. As wecan see, the multi-task learning can improve the classification mAP by 0 . . To evaluate the influence of image scales, we conduct a single-scale ex-periment that only uses one scale 600 to compare with the five scales experi-ment. The results are shown in Table 6. We can observe that, multi-scale can23 able 5: Results on PASCAL VOC 2007 for different tasks (mAP for Object classificationand CorLoc for object discovery). multi-task classification discoverymAP (%) . . . . - 40 . .
35 0 .
35 0 . . . . . . . Table 7: Results on PASCAL VOC 2007 for different patch generation methods and HCP[20] (mAP for Object classification and CorLoc for object discovery). + denotes the timecosting with the addition of patch generation procedure.DPL HCP [20]SS EdgeBoxes SWmAP (%) . . . . . . . . . .
22 2 . + (s/image) 1 .
85 0 . . improve the classification and discovery results evidently (+2 . . In the previous experiments, we choose SS [50] to extract patches. Herewe will compare three different methods to generate patches, including SS2450], EdgeBoxes [51], and Sliding Window (SW). For EdgeBoxes, we generate256 patches for each image, so it can accelerate the testing speed (we alsotest the performance when increase the patch number, but the results showthat the performance is only improved a little but the speed slow down alot). For SW method, we extract patches from 7 different scales widow32 × { , , ..., } with step size 32. This operation will generate 500 to 1000patches per-image. The results are shown in Table 7. From the results,we can observe that the method to extract patches affects the performancegreatly, especially for object discovery. What is more, SS is the best methodfor both object classification and discovery. It is interesting that the SWmethod can achieve similar classification mAP comparing with SS with lesstesting time. Notice that the time to generate SW patches is negligible,so during testing, the SW method is about 2 × , 8 × , and 13 × faster thanEdgeBoxes, SS, and HCP, respectively. For systems only focusing on objectclassification, the SW method is preferable as it reduces the testing timesignificantly with no cost of performance.
6. Conclusions
In this paper, a novel DPL method is proposed, which integrates the patchfeature learning, image representation learning, object classification and dis-covery into a unified framework. The DPL explicitly optimizes patch-levelimage representation, which is totally different from conventional CNNs. Italso combines the CNN based patch-level feature learning with MIL methods,thus can be trained in a weakly supervised manner. The excellent perfor-mance of DPL on object classification and discovery confirms its effectiveness.These inspiring results show that learning good patch-level image presenta-tion and combining CNNs with MIL are very promising directions to explorein various vision problems. In the future, we will study how to apply DPLfor other visual recognition problems, including introducing DPL into solvevery large scale problems.
Acknowledgements
This work was primarily supported by National Natural Science Founda-tion of China (NSFC) (No. 61503145, No. 61572207, and No. 61573160)and the CAST Young Talent Supporting Program.25 eferencesReferences [1] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical imagedatabase, in: CVPR, 2009, pp. 248–255.[2] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, A. Zisserman, The pascal visual objectclasses (VOC) challenge, IJCV 88 (2) (2010) 303–338.[3] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, C. L. Zitnick, Microsoftcoco: Common objects in context, in: ECCV, 2014, pp. 740–755.[4] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detectionand semantic segmentation, in: CVPR, 2014, pp. 580–587.[5] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visualrecognition, TPAMI 37 (9) (2015) 1904–1916.[6] R. Girshick, Fast r-cnn, in: ICCV, 2015, pp. 1440–1448.[7] H. Yang, J. T. Zhou, Y. Zhang, B. Gao, J. Wu, J. J. Cai, Exploit bounding box annotations formulti-label object recognition, in: CVPR, 2016, pp. 280–288.[8] T. G. Dietterich, R. H. Lathrop, T. Lozano-P´erez, Solving the multiple instance problem with axis-parallel rectangles, Artificial intelligence 89 (1) (1997) 31–71.[9] G. Csurka, C. Dance, L. Fan, J. Willamowski, C. Bray, Visual categorization with bags of keypoints,in: ECCV Workshop on Statistical Learning in Computer Vision, 2004, pp. 1–22.[10] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for imageclassification, in: CVPR, 2009, pp. 1794–1801.[11] Y. Gong, L. Wang, R. Guo, S. Lazebnik, Multi-scale orderless pooling of deep convolutional activationfeatures, in: ECCV, 2014, pp. 392–407.[12] P. F. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object detection with discriminativelytrained part based models, TPAMI 32 (9) (2010) 1627–1645.[13] X. Wang, Z. Zhu, C. Yao, X. Bai, Relaxed multiple-instance SVM with application to object discovery,in: ICCV, 2015, pp. 1224–1232.[14] R. G. Cinbis, J. J. Verbeek, C. Schmid, Multi-fold MIL training for weakly supervised object local-ization, in: CVPR, 2014, pp. 2409–2416.[15] J. Amores, Multiple instance classification: Review, taxonomy and comparative study, ArtificialIntelligence 201 (2013) 81–105.[16] M. Cimpoi, S. Maji, I. Kokkinos, A. Vedaldi, Deep filter banks for texture recognition, description,and segmentation, IJCV (2015) 1–30.[17] K. Simonyan, A. Zisserman., Very deep convolutional networks for large-scale image recognition, in:ICLR, 2015.[18] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Back-propagation applied to handwritten zip code recognition, Neural computation 1 (4) (1989) 541–551.
19] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neuralnetworks, in: NIPS, 2012, pp. 1097–1105.[20] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, S. Yan, HCP: A flexible CNN frameworkfor multi-label image classification, TPAMI 38 (9) (2016) 1901–1907.[21] L. Liu, C. Shen, L. Wang, A. van den Hengel, C. Wang, Encoding high dimensional local features bysparse coding based fisher vectors, in: NIPS, 2014, pp. 1143–1151.[22] J. S´anchez, F. Perronnin, T. Mensink, J. J. Verbeek, Image classification with the fisher vector:Theory and practice, IJCV 105 (3) (2013) 222–245.[23] J. Wu, Y. Yu, C. Huang, K. Yu, Deep multiple instance learning for image classification and auto-annotation, in: CVPR, 2015, pp. 3460–3469.[24] C. Wang, W. Ren, K. Huang, T. Tan, Weakly supervised object localization with latent categorylearning, in: ECCV, 2014, pp. 431–445.[25] Q. Zhang, S. A. Goldman, Em-dd: An improved multiple-instance learning technique, in: NIPS,Vol. 1, 2001, pp. 1073–1080.[26] S. Andrews, I. Tsochantaridis, T. Hofmann, Support vector machines for multiple-instance learning,in: NIPS, 2002, pp. 561–568.[27] X. Wei, J. Wu, Z. Zhou, Scalable algorithms for multi-instance learning, TNNLS PP (99) (2016)1–13.[28] C. M. Bishop, Pattern recognition and machine learning, Springer, 2006.[29] Z. Zhou, M. Zhang, Multi-instance multi-label learning with application to scene classification, NIPS19 (2007) 1609.[30] M. Zhang, Z. Zhou, M3miml: A maximum margin method for multi-instance multi-label learning,in: ICDM, 2008, pp. 688–697.[31] N. Nguyen, A new svm approach to multi-instance multi-label learning, in: ICDM, 2010, pp. 384–392.[32] R. Caruana, Multitask learning, Machine learning 28 (1) (1997) 41–75.[33] D. G. Lowe, Distinctive image features from scale-invariant keypoints, IJCV 60 (2) (2004) 91–110.[34] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: CVPR, 2005, pp.886–893.[35] X. Bai, S. Bai, Z. Zhu, L. J. Latecki, 3D shape matching via two layer coding, TPAMI 37 (12) (2015)2361–2373.[36] M. Pandey, S. Lazebnik, Scene recognition and weakly supervised object localization with deformablepart-based models, in: ICCV, 2011, pp. 1307–1314.[37] S. Singh, A. Gupta, A. Efros, Unsupervised discovery of mid-level discriminative patches, in: ECCV,2012, pp. 73–86.[38] X. Wang, B. Wang, X. Bai, W. Liu, Z. Tu, Max-margin multiple-instance dictionary learning, in:ICML, 2013, pp. 846–854.
39] M. Juneja, A. Vedaldi, C. V. Jawahar, A. Zisserman, Blocks that shout: Distinctive parts for sceneclassification, in: CVPR, 2013, pp. 923–930.[40] C. Doersch, A. Gupta, A. A. Efros, Mid-level visual element discovery as discriminative mode seeking,in: NIPS, 2013, pp. 494–502.[41] J. Sun, J. Ponce, Learning discriminative part detectors for image classification and cosegmentation,in: ICCV, 2013, pp. 3400–3407.[42] B. Shi, X. Bai, C. Yao, Script identification in the wild via discriminative convolutional neuralnetwork, Pattern Recognition 52 (2016) 448–458.[43] Y. Zhou, X. Bai, W. Liu, L. J. Latecki, Similarity fusion for visual tracking, IJCV (2016) 1–27.[44] P. Tang, X. Wang, B. Feng, W. Liu, Learning multi-instance deep discriminative patterns for imageclassification, TIP PP (99) (2016) 1–1.[45] P. Tang, J. Zhang, X. Wang, B. Feng, F. Roli, W. Liu, Learning extremely shared middle-level imagerepresentation for scene classification, Knowledge and Information Systems (2016) 1–22.[46] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizingnatural scene categories, in: CVPR, 2006, pp. 2169–2178.[47] H. Lobel, R. Vidal, A. Soto, Hierarchical joint max-margin learning of mid and top level representa-tions for visual recognition, in: ICCV, 2013, pp. 1697–1704.[48] S. N. Parizi, A. Vedaldi, A. Zisserman, P. Felzenszwalb, Automatic discovery and optimization ofparts for image classification, in: ICLR, 2015.[49] V. Sydorov, M. Sakurada, C. H. Lampert, Deep fisher kernels-end to end learning of the fisher kernelGMM parameters, in: CVPR, 2014, pp. 1402–1409.[50] J. R. Uijlings, K. E. van de Sande, T. Gevers, A. W. Smeulders, Selective search for object recognition,IJCV 104 (2) (2013) 154–171.[51] C. L. Zitnick, P. Doll´ar, Edge boxes: Locating object proposals from edges, in: ECCV, 2014, pp.391–405.[52] A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson, Cnn features off-the-shelf: an astoundingbaseline for recognition, in: CVPR Workshops, 2014, pp. 806–813.[53] M. Oquab, L. Bottou, I. Laptev, J. Sivic, Learning and transferring mid-level image representationsusing convolutional neural networks, in: CVPR, 2014, pp. 1717–1724.[54] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: Delvingdeep into convolutional nets, in: BMVC, 2014.[55] C. Barat, C. Ducottet, String representations and distances in deep convolutional neural networksfor image classification, Pattern Recognition 54 (2016) 104–115.[56] T. Deselaers, B. Alexe, V. Ferrari, Weakly supervised localization and learning with generic knowl-edge, IJCV 100 (3) (2012) 275–293.[57] Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, Exemplar based deep discriminative and shareablefeature learning for scene image classification, Pattern Recognition 48 (10) (2015) 3004–3015.
58] M. Oquab, L. Bottou, I. Laptev, J. Sivic, Is object localization for free?-weakly-supervised learningwith convolutional neural networks, in: CVPR, 2015, pp. 685–694.[59] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell,Caffe: Convolutional architecture for fast feature embedding, in: ACM MM, 2014, pp. 675–678.[60] Z. Shi, T. Hospedales, T. Xiang, Bayesian joint topic modelling for weakly supervised object locali-sation, in: ICCV, 2013, pp. 2984–2991.[61] H. Bilen, M. Pedersoli, T. Tuytelaars, Weakly supervised object detection with convex clustering, in:CVPR, 2015, pp. 1081–1089.[62] S. Yan, J. Dong, Q. Chen, Z. Song, Y. Pan, W. Xia, Z. Huang, Y. Hua, S. Shen, Generalizedhierarchical matching for sub-category aware object classification, in: ECCV Workshop on VisualRecognition Challenge, 2012.[63] M. Cho, S. Kwak, C. Schmid, J. Ponce, Unsupervised object discovery and localization in the wild:Part-based matching with bottom-up region proposals, in: CVPR, 2015, pp. 1201–1210.[64] Y. Li, L. Liu, C. Shen, A. van den Hengel, Image co-localization by mimicking a good detectorsconfidence score distribution, in: ECCV, 2016, pp. 19–34.58] M. Oquab, L. Bottou, I. Laptev, J. Sivic, Is object localization for free?-weakly-supervised learningwith convolutional neural networks, in: CVPR, 2015, pp. 685–694.[59] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell,Caffe: Convolutional architecture for fast feature embedding, in: ACM MM, 2014, pp. 675–678.[60] Z. Shi, T. Hospedales, T. Xiang, Bayesian joint topic modelling for weakly supervised object locali-sation, in: ICCV, 2013, pp. 2984–2991.[61] H. Bilen, M. Pedersoli, T. Tuytelaars, Weakly supervised object detection with convex clustering, in:CVPR, 2015, pp. 1081–1089.[62] S. Yan, J. Dong, Q. Chen, Z. Song, Y. Pan, W. Xia, Z. Huang, Y. Hua, S. Shen, Generalizedhierarchical matching for sub-category aware object classification, in: ECCV Workshop on VisualRecognition Challenge, 2012.[63] M. Cho, S. Kwak, C. Schmid, J. Ponce, Unsupervised object discovery and localization in the wild:Part-based matching with bottom-up region proposals, in: CVPR, 2015, pp. 1201–1210.[64] Y. Li, L. Liu, C. Shen, A. van den Hengel, Image co-localization by mimicking a good detectorsconfidence score distribution, in: ECCV, 2016, pp. 19–34.