[PDF] NoPeopleAllowed: The Three-Step Approach to Weakly Supervised Semantic Segmentation

Abstract

We propose a novel approach to weakly supervised semantic segmentation, which consists of three consecutive steps. The first two steps extract high-quality pseudo masks from image-level annotated data, which are then used to train a segmentation model on the third step. The presented approach also addresses two problems in the data: class imbalance and missing labels. Using only image-level annotations as supervision, our method is capable of segmenting various classes and complex objects. It achieves 37.34 mean IoU on the test set, placing 3rd at the LID Challenge in the task of weakly supervised semantic segmentation.

Full PDF

NNoPeopleAllowed: The Three-Step Approach to Weakly Supervised SemanticSegmentation

Mariia Dobko , dobko [email protected] Ostap Viniavskyi [email protected] Oles Dobosevych [email protected] The Machine Learning Lab, Ukrainian Catholic University, Lviv, Ukraine SoftServe, Lviv, Ukraine

Figure 1: The results of the proposed approach on the unseen data.

Abstract

We propose a novel approach to weakly supervised se-mantic segmentation, which consists of three consecutivesteps. The ﬁrst two steps extract high-quality pseudo masksfrom image-level annotated data, which are then used totrain a segmentation model on the third step. The presentedapproach also addresses two problems in the data: classimbalance and missing labels. Using only image-level an-notations as supervision, our method is capable of segment-ing various classes and complex objects. It achieves 37.34mean IoU on the test set, placing 3rd at the LID Challengein the task of weakly supervised semantic segmentation.

1. Introduction

Deep learning methods have proven their efﬁciency ina variety of computer vision tasks, including semantic seg-mentation. However, their application to semantic segman-tation typically requires large amounts of data with pixel-level annotations. We can, however, overcome this issueby developing weakly supervised methods that rely only onimage-level labels. Omitting usage of pixel-wise annota-tions also provides considerable advantages. For example,it is less expensive and less time consuming; Lin et al. [10]calculated that collecting bounding boxes for each class isabout 15 times faster than producing a ground-truth pixel-wise segmentation mask; collecting image-level labels iseven more time-efﬁcient. Working with image-level anno-tations also decreases the probability of disagreement be- tween experts, since pixel-wise annotations tend to be lessaccurate and have a higher variance among labelers.

Data:

The dataset used was proposed on the LID Chal-lenge [11]. It consists of 456,567 training images of objectsfrom 201 classes including background. The validation andtest sets have pixel-wise annotations, which are publiclyavailable only for a validation set. The images in the trainset are provided exclusively with image-level annotations.Moreover, the data have a lot of missing labels, and are alsohighly imbalanced towards three classes: ‘dog’, ‘bird’, and‘person’.The class ‘person’ has a large impact on other classesin the data; it usually appears in combination with othersand often overlaps with such classes as ‘microphone’, ‘sun-glasses’, ‘unicycle’ etc. It is thus crucial to have correctlabels for the class ’person’; however, the opposite is ob-served in the data: the image-level labels for this class areoften missing, creating an additional challenge for the task.So, not only are the data biased towards a certain class, butthey also suffer from imperfect labelling. These problemsare usually present in many datasets, so a solution overcom-ing them will make an essential contribution to a larger ﬁeld.We propose a novel weakly-supervised approach to se-mantic segmentation that uses only image-level annotationsand deals with data that have severe class imbalance. Itscores 37.34 mean Intersection over Union (IoU) on the testset placing third in the LID Challenge.1 a r X i v : . [ c s . C V ] J un . Related Work In this paper, we follow the self-supervised paradigmof weakly supervised semantic segmentation, which sug-gests training a fully supervised segmentation model onthe pseudo-labels generated from a classiﬁer network. Theimage-level annotations are used to train a classiﬁer; ClassActivation Maps (CAM) [12] are extracted afterward. As-sessment of quantitative performance on PASCAL VOC2012 [4] validation set shows that the top ﬁve methodsof weakly-supervised segmentation use the self-supervisedlearning approach [2]. The nature of PASCAL VOC 2012[4] dataset is similar to the LID Challenge data, thus, weuse the lessons learned on PASCAL VOC 2012 [4] whendeveloping a solution for the challenge.Many methods of self-supervised learning for weaklysupervised semantic segmentation have been recently sug-gested. Kolesnikov et al. [9] propose Seed Expand Con-strain (SEC) method, which trains a Convolutional Neu-ral Net (CNN), applies CAM to produce pseudo-ground-truth segments, and then trains a Fully Convolutional Net-work (FCN) optimizing three losses: one for the generatedseeds, another for the image-level label, and, ﬁnally, a con-straint loss against the maps processed by Conditional Ran-dom Fields (CRF). Huang et al. [6] introduce Deep SeededRegion Growing (DSRG), which propagates class activa-tions from high-conﬁdence regions to adjacent regions witha similar visual appearance by applying a region-growingalgorithm on the generated CAM. Another approach, pro-posed by Ahn et al. [1], suggests using Inter-pixel RelationNetwork (IRNet) [1], which takes the random walk fromlow-displacement ﬁeld centroids in the CAM up until theclass boundaries as the pseudo-ground-truths for training anFCN. Ahn et al. [1] focus on the segmentation of the indi-vidual instances estimating two types of features in additionto CAM: a class-agnostic instance map and pairwise seman-tic afﬁnities. We incorporate IRNet [1] into one of the stepsof our approach.

3. Method

The proposed approach consists of three consecutivesteps: Classiﬁcation followed by CAM generation, IRNet[1] for activation map improvement, and Segmentation.Each of these steps is followed by post-processing and im-proves the results of the previous one. All the experimentswere performed on three Nvidia GeForce RTX 2080 TIGPUs.

On the ﬁrst step, we train fully-supervised classiﬁcationmodels with image-level labels.

Input:

We remove the ’person’ class labels and balancedthe other 199 classes (without background) using the down- sampling technique. The obtained data is split into train andvalidation parts with 72,946 and 12,873 samples in each.

Neural network architecture and loss : For this step,we choose VGG16 arhitecture with 4 additional convolu-tional layers at the end, as proposed by Jiang et al. [7]. Weuse binary cross-entropy loss for each output.

Training procedure:

The model is trained with Adamoptimizer [8] and the learning rate − for the pretrainedpart and − for 4 extra convolutions. The input imagesare augmented using strong augmentation (horizontal ﬂip,shift, scale, rotate, Gauss noise, random brightness and con-trast, median blur, RGB shift). For the second step, we choose IRNet [1] with ClassBoundary Map and Displacement Field branches. The IR-Net allows to improve boundaries between different objectclasses. It is trained on the generated maps from the ﬁrststep and does not require extra supervision. This step al-lows us to obtain better pseudo-labels before proceeding tosegmentation.

Input:

As an input for IRNet [1], we chose only imagesfrom the train dataset that had conﬁdence score of classiﬁ-cation more than 0.8 and increase their amount by includingthe scaling with factors 0.5, 1, 1.5, and 2. All CAM are alsopostprocessed with CRF.

Neural network architecture and loss:

As in the orig-inal paper we use ResNet50 [5] concatenated activationsfrom different layers as an architecture and the sum ofAfﬁnity loss and Displacement loss for a loss function.

Training procedure:

The model is trained with freezedbackbone and Stochastic gradient descent (SGD) optimizerwith learning rate 0.05 for Displacement ﬁeld part andlearning rate 0.005 for Class Boundary Map. The samestrong data augmentation as in Classiﬁcation step is used.

The segmentation step is done in a classic manner withmasks obtained on a previous step.

Neural network architecture and loss:

We useDeepLabv3+ [3] with ResNet50 [5] encoder that was pre-trained on ImageNet and has stride replaced with dilation toincrease receptive ﬁeld. For output we use binary categori-cal cross-entropy loss.

Training procedure:

The model is trained with SGDoptimizer with learning rate 0.001, momentum 0.9 andweight decay − . The ﬁnal prediction is made by averaging the predictionsafter horizontal ﬂip and scale (factors: 0.5, 1, and 2). Werefer to this technique as Test Time Augmentation (TTA). a) Image (b) Step1. Classiﬁcation (c) Step2. IRNet (d) Step3. Segmentation (e) Mask

Figure 2: Object localization maps for (a) an image at each consecutive step of our method: (b) map after CAM extraction,(c) improved map by IRNet trained on the outcomes of step 1, (d) prediction of DeepLabV3+ trained on step 2 results, allcompared to (e) ground truth mask.

4. Experiments and Results

For the classiﬁcation step, we validate our models by cal-culating the F1 score on image-level labels. This allows usto select the model which performs best in the classiﬁcationof our extremely imbalanced data.The best segmentation model is selected based on themean IoU achieved on the validation set during the training.

We evaluate performance of our method on the competi-tion server on both validation and test sets using the meanIoU metric. Our method achieves 37.34 mean IoU on thetest set, which positions us at the third place. There are twoother metrics calculated on the competition server: meanaccuracy and mean pixel accuracy. Comparison of top-3solutions in the LID Challenge is presented in Table 1. Solution mean IoU mean accuracy pixel accuracy1st

Table 1: Top 3 solutions in the LID Challenge comparedusing three different metrics.In Table 2, we show how the results improve on valida-tion with each step of our approachWe also experiment by testing two encoder architecturesof DeepLabv3+ [3] model, different thresholds after IRNet[1] model, including or excluding ’person’ class, and ap-plying various postprocessing techniques. All these experi-ments are reported in the Table 3.We provide qualitative results of segmentation on severalethod Step mean IoUStep 1. Classiﬁcation + CRF 31.06Step 2. IRNet 31.87Step 3. Segmentation + TTA 39.64Table 2: Results on validation at each step of our approachEncoder IRNet thr. TTA Person mean IoUResNet50 0.3 No No 36.65ResNet50 0.3 Yes No

ResNet50 0.3 Yes Yes

ResNet50 0.5 No No 37.11ResNet50 0.5 Yes No 39.58ResNet101 0.5 No No 36.14ResNet101 0.5 Yes No 37.15Table 3: Experiments results on validation set by testing dif-ferent encoders for DeepLabv3+, two thresholds after IR-Net step, using TTA as postprocessing, including CAM forclass person from a binary classiﬁer.validation images in Figure 2. We show the resulting mapsat each step of our method; the ﬁgure demonstrates how theperformance improves after each step.

5. Conclusions

We present a novel method of weakly-supervised seman-tic segmentation that consists of three consecutive steps:classiﬁcation, CAM improvement via IRNet, and segmenta-tion. The presented approach generates pseudo-labels froma classiﬁer network, rectiﬁes the class boundaries with IR-Net, and uses a supervised segmentation model as a ﬁnalend-to-end method. This allows us to solve a semantic seg-mentation task using only image-level annotations.

In the proposed approach, the downsampling techniquewas used to balance the dataset, which was dictated by re-source limitations. However, it would be interesting to testupsampling as a class balancing method, or the combinationof both. We believe this could give an increase in perfor-mance.Also, we didn’t include CAM for class ‘person’ ex-tracted from a binary classiﬁer into the third step - train-ing segmentation model. We think this could be a worthyexperiment.There is also a space to experiment with different regu-larization and optimization techniques at all steps.

Acknowledgements

This research was supported by Faculty of Applied Sci-ences at Ukrainian Catholic University and SoftServe. Theauthors thank Rostyslav Hryniv for helpful insights, andTetiana Martyniuk for computational resources.

References [1] J. Ahn, S. Cho, and S. Kwak. Weakly supervised learning ofinstance segmentation with inter-pixel relations. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 2209–2218, 2019.[2] L. Chan, M. S. Hosseini, and K. N. Plataniotis. Acomprehensive analysis of weakly-supervised semantic seg-mentation in different image domains. arXiv preprintarXiv:1912.11186 , 2019.[3] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam.Encoder-decoder with atrous separable convolution for se-mantic image segmentation. In

Proceedings of the Euro-pean conference on computer vision (ECCV)

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages770–778, 2016.[6] Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang. Weakly-supervised semantic segmentation network with deep seededregion growing. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 7014–7023, 2018.[7] P.-T. Jiang, Q. Hou, Y. Cao, M.-M. Cheng, Y. Wei, and H.-K.Xiong. Integral object mining via online attention accumu-lation. In

Proceedings of the IEEE International Conferenceon Computer Vision , pages 2070–2079, 2019.[8] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.[9] A. Kolesnikov and C. H. Lampert. Seed, expand and con-strain: Three principles for weakly-supervised image seg-mentation. In

European Conference on Computer Vision ,pages 695–711. Springer, 2016.[10] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In

European conference on computervision , pages 740–755. Springer, 2014.[11] Y. Wei, S. Zheng, M.-M. Cheng, and e. Zhao, Hang. Lid2020: The learning from imperfect data challenge results.2020.[12] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-ralba. Learning deep features for discriminative localization.In