Diverse Sampling for Self-Supervised Learning of Semantic Segmentation
Mohammadreza Mostajabi, Nicholas Kolkin, Gregory Shakhnarovich
DDiverse Sampling for Self-Supervised Learning of Semantic Segmentation
Mohammadreza Mostajabi ∗ Nicholas Kolkin ∗ Gregory ShakhnarovichToyota Technological Institute at Chicago { mostajabi, nick.kolkin, greg } @ttic.edu Abstract
We propose an approach for learning category-level se-mantic segmentation purely from image-level classificationtags indicating presence of categories. It exploits local-ization cues that emerge from training classification-taskedconvolutional networks, to drive a “self-supervision” pro-cess that automatically labels a sparse, diverse training setof points likely to belong to classes of interest. Our ap-proach has almost no hyperparameters, is modular, and al-lows for very fast training of segmentation in less than 3minutes. It obtains competitive results on the VOC 2012segmentation benchmark. More, significantly the modular-ity and fast training of our framework allows new classes toefficiently added for inference.
1. Introduction
The problem of semantic segmentation (category-levellabeling of pixels) has attracted significant attention. Mostrecent progress can be attributed to advances in deep learn-ing and to availability of large, manually labeled data sets.However, the cost and complexity of annotating segmenta-tion are significantly higher than that for classification; con-sequently, we have orders of magnitude more images andcategories in classification data sets such as ImageNet [24]or Places2 [32], than in segmentation data sets, such asVOC [8] or MS-COCO [15].Given this gap, and the objective difficulty in rapidlyclosing it, many researchers have considered weakly super-vised segmentation, where the goal is still pixel-level label-ing at test time, but only spatially coarses annotations areavailable at training time. Common examples of such anno-tations include partial annotations, in which only a subsetof pixels is labeled; bounding boxes, where a square withan associated label is drawn around objects of interest; andimage tags, where labels provide no spatial information andsimply indicate whether or not a particular class is present somewhere in the image. We focus on this last, arguably ∗ Authors contributed equally weakest, level of per-image supervision.There is mounting evidence that this task, while difficult,is not hopeless. Units sensitive to object localization havebeen shown to emerge as part of the representations learnedby convolutional neural networks (CNNs) trained for imageclassification [31]. Furthermore, some localization methodsdemonstrate the utility of features learned by classificationCNNs by using them to achieve competitive results [18, 31].Our method is inspired by recent work [23] demonstrat-ing that reasonable segmenation accuracy could be achievedwith very few point-wise labels provided by human annota-tors,. In this paper we propose an automatic version of thisidea, replacing human annotators with an automatic label-ing procedure. Our approach starts by learning noisy lo-calization networks separately for each foreground class,trained solely with image-level classification tags, and us-ing a novel multiple instance learning loss (global soft-max, Section 3) adapted for the segmentation task. Bycombining the localization evidence provided by these net-works with a novel diversity sampling procedure, we obtainsparse, informative, and accurate set of labeled pixels. Wecan use these samples to rapidly train a fully-convolutionalmulti-class pixel-wise label predictor operating on hyper-column/zoomout representation of image features [9, 17] inless than 3 minutes.In contrast to much previous work, our approach is sim-ple and modular. It almost entirely lacks hyper-parameterslike thresholds and weighting coeficients. It also allows foreasy addition of new foreground classes or incorporation ofmore image examples for some classes, without the needto retrain the entire system. We also avoid complex inte-gration with externally trained components, other than thebasic ImageNet-trained neural network we use to extractpixel features. Consequently while competitive models takehours to train, our framework takes less than 3 minutes. De-spite this simplicity we obtain results on VOC 2012 data setthat improve upon most of previous work on image-levelweak supervision of segmentation.1 a r X i v : . [ c s . C V ] D ec Initial Dog Scoremap
Initial Sample Dot Product with 1 st Sample
1- =
Dog ScoremapFor 2 nd SampleInitial Dog Scoremap nd Sample Max Dot Product with1 st & 2 nd Sample = Initial Dog Scoremap rd Sample = Max Dot Product with1 st , 2 nd , & 3 rd Sample
Dog ScoremapFor 3 rd SampleDog ScoremapFor 4 th Sample
Figure 1: Our diverse samplingprocedure. Left: input im-age, labeled as containing class dog . Middle: first four stepsof diverse sampling procedurefor dog , starting with the map S ( · , dog ) . Right: sample of 20diverse points.
2. Background
Semantic segmentation has seen major progress in re-cent years, at least as measured by benchmark performance,nearly doubling the accuracy from 2012 [3, 4] to today’sleading methods [16, 5, 30]. Much of this progress canbe attributed to (re)introduction of convolutional neural net-works. The availability of training data with manually an-notated segmentation masks, in particular VOC[8] and re-cently MS-COCO [15], has been instrumental in these de-velopments, however recent work has shown that trainingon weaker annotations, such as partial labeling [23], ob-ject bounding boxes [19, 7, 10] or other cues such as ob-ject sizes [20], can produce results very close to those us-ing strongly-supervised training. However, closing this gapwith strongly-supervised methods trained exclusively onimage-level tags, which is the regime we consider in thispaper, remains more challenging.Our work was in part inspired by the experiments in [23]showing that very sparse point-wise supervision allowstraining reasonable CNN-based segmentation models. Ourapproach aims to replace manual annotation with “self-supervision” obtained automatically from image-level tags.Similarly to other recent work, we obtain self-supervisionby leveraging the recently established observation: CNNstrained for image classification tasks appear to contain in-ternal representations tuned to object localization [2, 31, 18,6]. These representations have been Combined with a pool-ing strategy to obtain self-supervision, which can be usedto train a segmentation model, often with a variant of theExpectation-Maximization algorithm [19, 28] or with mul-tiple instance learning (MIL) [21, 22, 11, 13].Some recent methods combine image-level supervi-sion with additional components, such as object propos-als [22], saliency [26] or objectness measures [23]. Mostof these components require localization annotations, suchas bounding boxes and/or segmentation masks, which intro-duces a dependency on additional, often expensive annota-tions beyond image-level tags. In contrast, our approach issimple and modular, and does not require any external sys- tems, other than the initial CNN [25] pretrained on the Im-ageNet classification task [24], making the entire pipelineindependent of any requirements beyond image-level anno-tations.The most established benchmark for this task remainsthe VOC 2012 data set, with the standard performance mea-sure being intersection over union averaged over 21 classes(mIoU), which can be reported on val or on test set.The latter arguably more rigorous since the labels are with-held and evaluation frequency limited. In Section 4 weshow that despite its simplicity and efficiency, our approachoutperforms most competing methods on this benchmark.Finally, there is a body of work related to exploringand exploiting diversity in learning and vision. Most rel-evant to our work is the DivMBest algorithm [1, 29] whichsomewhat resembles our procedure for greedy diverse sam-pling of points described in Section 3.2. The form of thediversity-infused objective and the context are quite differ-ent, however: DivMBest is used to sample assignments ina structured prediction settings for a single input example,whereas we aim to sample examples (image points); in ap-plications of DivMBest the diverse samples are typicallyfed to a post-processing stage like reranking whereas in ourcase, the diverse sample is directly used as a training set fora segmentation algorithm.
3. Automatic pointwise self-supervision forsegmentation
The basis of our self-supervision method is the local-ization maps obtained for each of the foreground classes.These maps are sampled for each class, and the resultingsparse set of point-wise labels on the training images is usedto train the final segmentation network. We describe thesesteps in detail below. MS-COCO is larger in categories and images, but at the moment doesnot allow for a proper category-level semantic segmentation evaluation,due to its focus on instance-level detection. .1. Learning localization with image-level tags
We start by extracting an image feature map using a pre-trained fully convolutional network. Then for each fore-ground class c , we construct a per-location localization net-work on top of these features, which outputs two scoresper location i : S ( c, i ) for foreground, and ¯ S ( c, i ) for back-ground (which in this case means anything other than theforeground class at hand).An obvious next step now is to convert these scoresinto image-level foreground probability, using some sort ofpooling scheme; this can then be used to compute image-level classification log-loss and backpropagate it to the lo-calization network. We consider two such schemes. The per-pixel softmax model
We can convert the scoresinto per-location posterior probabilities using the standard(over-parameterized) softmax model, and apply max pool-ing over the resulting probability map: p ( c ) = max i exp S ( c, i )exp S ( c, i ) + exp ¯ S ( c, i ) . (1)This can be interpreted as requiring that for images contain-ing foreground, the network assign at least one location highprobability while for images without the foreground, all lo-cations must have low probability. The background scores ¯ S do not have a direct meaning other than to normalize theprobabilities. The global softmax model
An alternative is to applymax-pooling separately for the two score maps, and con-vert the maxima to the image-level probability: p ( c ) = max i exp S ( c, i )max j exp S ( c, j ) + max l exp ¯ S ( c, l ) . (2)This model no longer is equivalent to the per-location soft-max, and in fact does not provide per-location probabilitymap. It specifically encourages the background scores to behigh for images without the foreground. It also routes thegradient of the loss via two locations in each image, insteadof one with (1), and therefore may facilitate faster training.It is worth noting that this approach does not includean explicit “background localization” model. Backgroundis defined separately for each foreground class as its com-plement, and jointly as the complement of all foregroundclasses. Adding another foreground class would requireonly training one new localization model for that class; thedefinition of segmentation background would then automat-ically be updated, and reflected in the sampling process de-scribed below. We now consider the goal of translating class-specificscore maps to supervisory signal for semantic segmentation. Our general framework for this will be to select a sparse setof locations from the training images, for which we will as-sign class labels. The segmentation predictor is then trainedby learning to map the image features for these locations tothe assigned class labels.Let S ( i, c ) be the score at spatial index i for a class c produced by our image-level classification model. One ap-proach would be to densely label y i = argmax c S ( i, c ) .Background requires a separate treatment: it is not a “real”class, rather it’s defined by not being one of the foregroundclasses. Hence, we do not have a separate model for it,and instead can assign background labels to pixels in whichno foreground class attains a sufficient score: y i = bg if max c S ( i, c ) < τ . This simplistic strategy has two prob-lems: (1) some classes may have systemically lower scoresthan others, and (2) it is unclear how to optimally set thevalue of τ .However, our hypothesis is that while the scoremaps pro-vide only coarse localization, and an inconsistent level ofconfidence across images and classes, the maximum activa-tions of a class scoremap when that class is present appear toreliably correspond with pixels containing the correct class.(We verified this qualitatively, on a few classes and a num-ber of training images). So, an alternative approach is tolabel as c the k locations corresponding with highest scoresfor class c . However the size of objects varies widely acrossimages, and it isn’t clear what k should be. If k is too high,the labels will be very noisy. If k is too low, most of thepixels will be tightly clustered portions of each class, e.g.,wheels of cars, or faces of people; training on such exam-ples is much less effective because many of the samples willhighly correlated.The method we propose here alleviates these problemsby relying on diversity sampling . Let z i be the image fea-ture vector at spatial index i , normalized to unit norm, andlet F be the set of foreground classes present in the image.For each class c ∈ F we define the k th sampled location x ck from that image by induction: x c = argmax i S ( i, c ) , (3) x ck = argmax i (cid:110) S ( i, c ) (cid:104) − max k (cid:48) 1. This can be done byusing the standard convnet training machinery, with zero-masking applied to the loss in locations where no labels areavailable; whether to fine-tune the underlying network thatextracts the visual features per location is a choice. k val as a function of k pointsin diverse sampling, with global softmax model 4. Experiments In order to compare our method to other work on seg-mentation, we conduct all of our experiments on the VOC2012 data set. For training images (10,582 images in the train-aug set) we discard all annotations except forimage-level labels indicating which of the 20 foregroundclasses is present in each image. We evaluate various ver-sions of our method, as well as its components individually,on the val set, and finally use the models chosen basedon tese experiments to obtain results on the test set fromthe VOC evaluation server. All experiments were done inTorch7, using Adam [12] update rule when training net-works. Pixel features As the base CNN we use the VGG-16 net-work trained on ImageNet and publicly available from the authors of [25], from which we remove all layers above p ool5. This network is run in the fully convolutional modeon input images resized to 256 × 336 pixels. Then each ofthe 13 feature maps (outputs of all convolutional layers,with pooling applied when available) is resized to 1/4 ofthe input resolution, and concatenated along feature dimen-sions. This produces a tensor in which each location on thecoarse × grid has a 4,224-dimensional feature vector.This closely follows the hypercolumn extraction protocolin [9] (but using all layers) and [17], but without superpixelpooling.When computing dot products in diversity sam-pling (3),(5) we normalize zoomout feature vectors to unitnorm in two stages: each feature dimension is normalizedto be zero-mean, unit variance over the entire training set,then each feature vector is scaled to be unit Euclidean norm. Localization models For each class, the fully convolu-tional localization network consists of a 1 × × S for the foreground and ¯ S for background. At trainingtime, for the global softmax model (2) this is futher fol-lowed by global max pooling layer and the softmax layer,while for the per-pixel softmax (1) the order of softmax andmax pooling is reversed.For each class, we train the network on all positive ex-amples for that class (images that contain it) and a randomlysampled equal number of negative examples, with batchsize of 1 image, learning rate e − which after 2 epochs is decreased to e − for one additional epoch and momentum0.9.We experimented with adding higher layer features ( fc7 from VGG-16) to the input to localization networks, butfound that it makes localization worse: it is too easy for thenetwork to determine presence of objects from these com-plex, translation invariant features. We do however bringthese features back when training the final segmentationmodel, described next. Segmentation model To provide image-level priors,which have been reported to improve segmentation resultsin both fully supervised [17] and weakly supervised [22]settings, we augment the zoomout feature map with theglobal features (layer fc7 of VGG-16, pooled over the en-tire image and replicated for all locations). The combinedfeature map (8320 features per location) is fed to a × conv. layer with 512 units, followed by ReLU, and the 21-channel prediction layer, followed by the softmax loss layer.We train this network on the selected set of points pooledover all training images, using batch size 100 (note: thismeans 100 sample points, not 100 images!), fixed learningrate e − , and momentum 0.9, for two epochs. With theseigure 2: Example segmentations on extra classes from MS-COCO dataset added to the seg-mentation model trained on VOC dataset. Model mIoUPixel softmax 38.0Global softmax 40.6Table 1: Comparison of local-ization models on VOC 2012 val Sampling mIoUDense 15.0Spatial 33.4Top k = 20 k = 20 val VOC 2012 test commentsDeepLab-CRF[5] 68.7 71.6 fully supervisedFCRN[27] 74.8 77.3 fully supervisedBoxSup[7] 62.0 64.6 bounding box-supervisedBbox-Seg[19] 60.6 62.2 bounding box-supervised1 point [23] 35.1 manual annotation, 1 pt/class1 point+Obj [23] 42.7 + objectness priorSTC[26] 49.8 51.2 externally trained saliency model and 40K extra imagesMIL-sppx [22] 36.6 35.8 superpixel smoothingCCCN [20] 35.3 35.6EM-Adapt [19] 38.2 39.6Ours 40.6 41.2 no post-processingOurs 45.2 46 CRF post-processing, less than 3 minutes training timeSEC[13] 50.7 51.5 7-8 hours training timeTable 4: Comparison of competitive segmentation methods, supervised with image-level tags. For reference, top of the tableincludes representative numbers methods trained with stronger supervision regimes on VOC 2012 data, or on additional data.settings, typical time to train the final segmentation modelis less than 3 minutes on a single Titan X GPU. We start by evaluating components of for our approachon the val set. Localization model As shown in Table 1, the global soft-max model (2) obtains significantly better results than theper-pixel softmax (1). Therefore we choose it for all thesubsequent experiments. Self-supervision by localization maps We could attemptusing the score maps obtained by localization network di-rectly as the predicted segmentation maps. Specifically, weconsidered assigning each pixel to the highest scoring class (after normalizing the scores so that for each class on av-erage images with the foreground present have the highestscore of 1), or to the background if the highest foregroundscore is below threshold. This results in poor segmentationaccuracy: the highest val mIoU after sweeping the thresh-old values was 25.66, for threshold of 0.2.We also made an attempt to use these score-based seg-mentation maps as the source of self-supervision directly,without sampling. That is, we can train the segmentationnetwork in the usual fully-supervised way, giving it thescore map-based dense segmentation labels as if it wereground truth. The results were very poor; in large frac-tion of the pixels the localization models uncertain, andwhile our sampling focuses on high-score points, denseself-supervision is forced to make a decision in those un-certain points as well, leading to a very noisy labeling. ffect of sampling strategy For our diverse samplingmethod, we need to set the value of k . Table 3 shows the ef-fect this value has on val accuracy. The optimal k amongthose tested is 20, but the behavior is stable across a largerange of values.We also compared alternative sampling strategies,namely selecting the top k scoring points for each class,or using diversity but in spatial domain instead of in fea-ture domain. Table 2 shows that the diversity sampling us-ing feature similarity is indeed superior to those. Figure 3shows a few qualitative examples of diverse sampling out-puts. Notably, sampling for background is usually quite ac-curate, even though it is oblivious to the actual class scoresand is entirely driven by diversity w.r.t. to the foregroundand within background. Based on the preliminary experiments on val above, wecould identify the optimal setting for our approach: globalsoftmax localization model, with diverse sampling, k = 20 .We report the results of this method in Table 4. We also re-port the results of applying fully-connected CRF [14] withauthor’s default parameters on top of our predictions.The top portion of the table contains representative re-sults for stronger supervision scenarios, for reference; theseare not directly comparable to our results. Among the meth-ods trained on the same data and in the same regime as ours,our results are the highest. It is interesting that we obtainsimilar results to those with manual annotation of a singlepoint per class per image [23] (better than their results with-out objectness prior), although our point selection is fullyautomatic.We are aware of additional results from concurrent workappearing on arXiv. The STC [26], reporting test mIoUof 51.2; it is trained on 40,000 additional images, collectedin a carefully designed procedure to make them easy tolearn from. STC also uses an externally trained saliencymechanism, which requires mask annotations to train.The only other method trained solely on VOC data whichhas higher accuracy is SEC [13], achieving 51.5 mIoU on test . However, their approach is considerably more com-plex than ours, employing hyperparameters that determinevarious thresholds, and the time to train the final segmen-tation system with SEC is almost two orders of magnitudehigher than ours (7-8 hours vs. 3 minutes for us). SEC alsouses significantly larger field of view for the underlying seg-mentation network than in our experiments (378 vs. 224 forus), and results reported in [13] suggest that this may bevery important. We plan to investigate whether increasingfield of view of the segmentation network improves our re-sults as well.Not only does our method obtain competitive results,but due to its very fast training of segmentation model, it is practical to add new classes on the fly, unlike other ap-proaches.We show some qualitative results obtained by our modelon val images in Figure 4. Adding new object classes on the fly One of the keycharacteristics of our model is modularity. Consider that wewant to add new classes like Giraffe and Elephant whichare not part of VOC dataset. We train localization modelfor new classes with only image level tags from MS-COCOdataset for Giraffe and Elephant. Since we have trainedthe localization model for each class separately, there isno need to re-train the other classes localization models.The segmentation training data for new classes is the sparseset of diverse points extracted from localization model out-put. Sparsity significantly speedup the segmentation train-ing and diversity leads to high quality segmentation out-put. It takes less than 3 minutes to re-train the segmentationmodel with additional classes without hurting its accuracyon Pascal segmentation benchmark. Hence it is practical toadd new classes on the fly. Qualitative examples of segmen-tation results for the new classes are shown in Figure 2. 5. Conclusions We have proposed an approach to learning category-levelsemantic segmentation when the only annotation availableis tags indicating the presence (or absence) of each fore-ground class in each image. Our approach is based on aform of self-supervision, in which a sparse, visually diverseset of points in training images is labeled based on class-specific localization maps, predicted from the image tags.Among the appealing properties of our method are itssimplicity, near absence of hyperparameters (and insensi-tivity to the only hyperparameter, the self-supervision sam-ple size), modularity (easy to update the model with newclasses and/or examples), lack of reliance on complex ex-ternal components requiring strong supervision, and last butnot least, competitive empirical performance and speed. Acknowledgments We gratefully acknowledge a gift of GPUs fromNVIDIA corporation. GS was partially supported by NSFaward 1409837.nput bottle tvmonitor backgroundinput person bicycle backgroundinput person horse backgroundFigure 3: Examples of diverse sampling outputs. For each foreground class, we show the localization score map from globalsoftmax model, and the selected 20 points. For background, the map shows the max over dot products with any selectedforeground points.nput GT loc-map diverse 20 diverse 20 with CRFFigure 4: Examples of segmentations learned through our self-supervision approach. From left: input image, ground truth,thresholded localization score maps, segmentation learned with diverse k = 20 sampling, CRF post-processing. eferences [1] D. Batra, P. Yadollahpour, A. Guzman-Rivera, andG. Shakhnarovich. Diverse M-Best Solutions inMarkov Random Fields. In ECCV , 2012. 2[2] A. Bergamo, L. Bazzani, D. Anguelov, and L. Tor-resani. Self-taught object localization with deep net-works. arXiv:1409.3964 , 2014. 2[3] X. Boix, J. M. Gonfaus, J. van de Weijer, A. D. Bag-danov, J. S. Gual, and J. Gonz`alez. Harmony poten-tials - fusing global and local scale for semantic imagesegmentation. IJCV , 2012. 2[4] J. Carreira, R. Caseiro, J. Batista, and C. Sminchis-escu. Semantic segmentation with second-order pool-ing. In ECCV , 2012. 2[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy,and A. L. Yuille. Semantic image segmentation withdeep convolutional nets and fully connected CRFs. In ICLR , 2015. 2, 5[6] R. G. Cinbis, J. Verbeek, and C. Schmid. Weakly su-pervised object localization with multi-fold multipleinstance learning. arXiv:1503.00949 , 2015. 2[7] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bound-ing boxes to supervise convolutional networks for se-mantic segmentation. In ICCV , 2015. 2, 5[8] M. Everingham, S. A. Eslami, L. Van Gool, C. K.Williams, J. Winn, and A. Zisserman. The pascal vi-sual object classes challenge: A retrospective. IJCV ,111(1), 2015. 1, 2[9] B. Hariharan, P. A. an R. Girshick, and J. Malik. Hy-percolumns for object segmentation and fine-grainedlocalization. In CVPR , 2015. 1, 4[10] A. Khoreva, R. Benenson, J. Hosang, M. Hein, andB. Schiele. Weakly supervised semantic labelling andinstance segmentation. arXiv:1603.07485 , 2016. 2[11] H.-E. Kim and S. Hwang. Deconvolutional featurestacking for weakly-supervised semantic segmenta-tion. arXiv:1602.04984 , 2016. 2[12] D. Kingma and J. Ba. Adam: A method for stochasticoptimization. In ICLR , 2015. 4[13] A. Kolesnikov and C. H. Lampert. Seed, expand andconstrain: Three principles for weakly-supervised im-age segmentation. arXiv:1603.06098 , 2016. 2, 5, 6[14] V. Koltun. Efficient inference in fully connected crfswith gaussian edge potentials. Adv. Neural Inf. Pro-cess. Syst , 2011. 6 [15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoftcoco: Common objects in context. In ECCV , 2014. 1,2[16] J. Long, E. Shelhamer, and T. Darrell. Fully convolu-tional networks for semantic segmentation. In CVPR ,2015. 2[17] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich.Feedforward semantic segmentation with zoom-outfeatures. In CVPR , 2015. 1, 4[18] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is objectlocalization for free?-weakly-supervised learning withconvolutional neural networks. In CVPR , 2015. 1, 2[19] G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L.Yuille. Weakly-and semi-supervised learning of adeep convolutional network for semantic image seg-mentation. In ICCV , 2015. 2, 5[20] D. Pathak, P. Krahenbuhl, and T. Darrell. Constrainedconvolutional neural networks for weakly supervisedsegmentation. In CVPR , 2015. 2, 5[21] D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fullyconvolutional multi-class multiple instance learning.In ICLR workshop , 2015. 2[22] P. O. Pinheiro and R. Collobert. From image-level topixel-level labeling with convolutional networks. In CVPR , 2015. 2, 4, 5[23] O. Russakovsky, A. L. Bearman, V. Ferrari, and F.-F. Li. What’s the point: Semantic segmentation withpoint supervision. arXiv:1506.02106 , 2015. 1, 2, 5, 6[24] O. Russakovsky, J. Deng, H. Su, J. Krause,S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,M. Bernstein, et al. Imagenet large scale visual recog-nition challenge. arXiv:1409.0575 , 2014. 1, 2[25] K. Simonyan and A. Zisserman. Very deep convo-lutional networks for large-scale image recognition. arXiv:1409.1556 , 2014. 2, 4[26] Y. Wei, X. Liang, Y. Chen, X. Shen, M.-M. Cheng,Y. Zhao, and S. Yan. Stc: A simple to complex frame-work for weakly-supervised semantic segmentation. arXiv:1509.03150 , 2015. 2, 5, 6[27] Z. Wu, C. Shen, and A. v. d. Hengel. High-performance semantic segmentation using very deepfully convolutional networks. arXiv:1604.04339 ,2016. 528] J. Xu, A. G. Schwing, and R. Urtasun. Learning tosegment under various forms of weak supervision. In CVPR , 2015. 2[29] P. Yadollahpour, D. Batra, and G. Shakhnarovich. Dis-criminative re-ranking of diverse segmentations. In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition , 2013. 2[30] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vi-neet, Z. Su, D. Du, C. Huang, and P. H. S. Torr. Con-ditional random fields as recurrent neural networks. arXiv:1502.03240 , 2015. 2[31] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, andA. Torralba. Object detectors emerge in deep scenecnns. In ICLR , 2015. 1, 2[32] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, andA. Oliva. Learning deep features for scene recogni-tion using places database. In