NoPeopleAllowed: The Three-Step Approach to Weakly Supervised Semantic Segmentation
NNoPeopleAllowed: The Three-Step Approach to Weakly Supervised SemanticSegmentation
Mariia Dobko , dobko [email protected] Ostap Viniavskyi [email protected] Oles Dobosevych [email protected] The Machine Learning Lab, Ukrainian Catholic University, Lviv, Ukraine SoftServe, Lviv, Ukraine
Figure 1: The results of the proposed approach on the unseen data.
Abstract
We propose a novel approach to weakly supervised se-mantic segmentation, which consists of three consecutivesteps. The first two steps extract high-quality pseudo masksfrom image-level annotated data, which are then used totrain a segmentation model on the third step. The presentedapproach also addresses two problems in the data: classimbalance and missing labels. Using only image-level an-notations as supervision, our method is capable of segment-ing various classes and complex objects. It achieves 37.34mean IoU on the test set, placing 3rd at the LID Challengein the task of weakly supervised semantic segmentation.
1. Introduction
Deep learning methods have proven their efficiency ina variety of computer vision tasks, including semantic seg-mentation. However, their application to semantic segman-tation typically requires large amounts of data with pixel-level annotations. We can, however, overcome this issueby developing weakly supervised methods that rely only onimage-level labels. Omitting usage of pixel-wise annota-tions also provides considerable advantages. For example,it is less expensive and less time consuming; Lin et al. [10]calculated that collecting bounding boxes for each class isabout 15 times faster than producing a ground-truth pixel-wise segmentation mask; collecting image-level labels iseven more time-efficient. Working with image-level anno-tations also decreases the probability of disagreement be- tween experts, since pixel-wise annotations tend to be lessaccurate and have a higher variance among labelers.
Data:
The dataset used was proposed on the LID Chal-lenge [11]. It consists of 456,567 training images of objectsfrom 201 classes including background. The validation andtest sets have pixel-wise annotations, which are publiclyavailable only for a validation set. The images in the trainset are provided exclusively with image-level annotations.Moreover, the data have a lot of missing labels, and are alsohighly imbalanced towards three classes: ‘dog’, ‘bird’, and‘person’.The class ‘person’ has a large impact on other classesin the data; it usually appears in combination with othersand often overlaps with such classes as ‘microphone’, ‘sun-glasses’, ‘unicycle’ etc. It is thus crucial to have correctlabels for the class ’person’; however, the opposite is ob-served in the data: the image-level labels for this class areoften missing, creating an additional challenge for the task.So, not only are the data biased towards a certain class, butthey also suffer from imperfect labelling. These problemsare usually present in many datasets, so a solution overcom-ing them will make an essential contribution to a larger field.We propose a novel weakly-supervised approach to se-mantic segmentation that uses only image-level annotationsand deals with data that have severe class imbalance. Itscores 37.34 mean Intersection over Union (IoU) on the testset placing third in the LID Challenge.1 a r X i v : . [ c s . C V ] J un . Related Work In this paper, we follow the self-supervised paradigmof weakly supervised semantic segmentation, which sug-gests training a fully supervised segmentation model onthe pseudo-labels generated from a classifier network. Theimage-level annotations are used to train a classifier; ClassActivation Maps (CAM) [12] are extracted afterward. As-sessment of quantitative performance on PASCAL VOC2012 [4] validation set shows that the top five methodsof weakly-supervised segmentation use the self-supervisedlearning approach [2]. The nature of PASCAL VOC 2012[4] dataset is similar to the LID Challenge data, thus, weuse the lessons learned on PASCAL VOC 2012 [4] whendeveloping a solution for the challenge.Many methods of self-supervised learning for weaklysupervised semantic segmentation have been recently sug-gested. Kolesnikov et al. [9] propose Seed Expand Con-strain (SEC) method, which trains a Convolutional Neu-ral Net (CNN), applies CAM to produce pseudo-ground-truth segments, and then trains a Fully Convolutional Net-work (FCN) optimizing three losses: one for the generatedseeds, another for the image-level label, and, finally, a con-straint loss against the maps processed by Conditional Ran-dom Fields (CRF). Huang et al. [6] introduce Deep SeededRegion Growing (DSRG), which propagates class activa-tions from high-confidence regions to adjacent regions witha similar visual appearance by applying a region-growingalgorithm on the generated CAM. Another approach, pro-posed by Ahn et al. [1], suggests using Inter-pixel RelationNetwork (IRNet) [1], which takes the random walk fromlow-displacement field centroids in the CAM up until theclass boundaries as the pseudo-ground-truths for training anFCN. Ahn et al. [1] focus on the segmentation of the indi-vidual instances estimating two types of features in additionto CAM: a class-agnostic instance map and pairwise seman-tic affinities. We incorporate IRNet [1] into one of the stepsof our approach.
3. Method
The proposed approach consists of three consecutivesteps: Classification followed by CAM generation, IRNet[1] for activation map improvement, and Segmentation.Each of these steps is followed by post-processing and im-proves the results of the previous one. All the experimentswere performed on three Nvidia GeForce RTX 2080 TIGPUs.
On the first step, we train fully-supervised classificationmodels with image-level labels.
Input:
We remove the ’person’ class labels and balancedthe other 199 classes (without background) using the down- sampling technique. The obtained data is split into train andvalidation parts with 72,946 and 12,873 samples in each.
Neural network architecture and loss : For this step,we choose VGG16 arhitecture with 4 additional convolu-tional layers at the end, as proposed by Jiang et al. [7]. Weuse binary cross-entropy loss for each output.
Training procedure:
The model is trained with Adamoptimizer [8] and the learning rate − for the pretrainedpart and − for 4 extra convolutions. The input imagesare augmented using strong augmentation (horizontal flip,shift, scale, rotate, Gauss noise, random brightness and con-trast, median blur, RGB shift). For the second step, we choose IRNet [1] with ClassBoundary Map and Displacement Field branches. The IR-Net allows to improve boundaries between different objectclasses. It is trained on the generated maps from the firststep and does not require extra supervision. This step al-lows us to obtain better pseudo-labels before proceeding tosegmentation.
Input:
As an input for IRNet [1], we chose only imagesfrom the train dataset that had confidence score of classifi-cation more than 0.8 and increase their amount by includingthe scaling with factors 0.5, 1, 1.5, and 2. All CAM are alsopostprocessed with CRF.
Neural network architecture and loss:
As in the orig-inal paper we use ResNet50 [5] concatenated activationsfrom different layers as an architecture and the sum ofAffinity loss and Displacement loss for a loss function.
Training procedure:
The model is trained with freezedbackbone and Stochastic gradient descent (SGD) optimizerwith learning rate 0.05 for Displacement field part andlearning rate 0.005 for Class Boundary Map. The samestrong data augmentation as in Classification step is used.
The segmentation step is done in a classic manner withmasks obtained on a previous step.
Neural network architecture and loss:
We useDeepLabv3+ [3] with ResNet50 [5] encoder that was pre-trained on ImageNet and has stride replaced with dilation toincrease receptive field. For output we use binary categori-cal cross-entropy loss.
Training procedure:
The model is trained with SGDoptimizer with learning rate 0.001, momentum 0.9 andweight decay − . The final prediction is made by averaging the predictionsafter horizontal flip and scale (factors: 0.5, 1, and 2). Werefer to this technique as Test Time Augmentation (TTA). a) Image (b) Step1. Classification (c) Step2. IRNet (d) Step3. Segmentation (e) Mask
Figure 2: Object localization maps for (a) an image at each consecutive step of our method: (b) map after CAM extraction,(c) improved map by IRNet trained on the outcomes of step 1, (d) prediction of DeepLabV3+ trained on step 2 results, allcompared to (e) ground truth mask.
4. Experiments and Results
For the classification step, we validate our models by cal-culating the F1 score on image-level labels. This allows usto select the model which performs best in the classificationof our extremely imbalanced data.The best segmentation model is selected based on themean IoU achieved on the validation set during the training.
We evaluate performance of our method on the competi-tion server on both validation and test sets using the meanIoU metric. Our method achieves 37.34 mean IoU on thetest set, which positions us at the third place. There are twoother metrics calculated on the competition server: meanaccuracy and mean pixel accuracy. Comparison of top-3solutions in the LID Challenge is presented in Table 1. Solution mean IoU mean accuracy pixel accuracy1st
Table 1: Top 3 solutions in the LID Challenge comparedusing three different metrics.In Table 2, we show how the results improve on valida-tion with each step of our approachWe also experiment by testing two encoder architecturesof DeepLabv3+ [3] model, different thresholds after IRNet[1] model, including or excluding ’person’ class, and ap-plying various postprocessing techniques. All these experi-ments are reported in the Table 3.We provide qualitative results of segmentation on severalethod Step mean IoUStep 1. Classification + CRF 31.06Step 2. IRNet 31.87Step 3. Segmentation + TTA 39.64Table 2: Results on validation at each step of our approachEncoder IRNet thr. TTA Person mean IoUResNet50 0.3 No No 36.65ResNet50 0.3 Yes No
ResNet50 0.3 Yes Yes
ResNet50 0.5 No No 37.11ResNet50 0.5 Yes No 39.58ResNet101 0.5 No No 36.14ResNet101 0.5 Yes No 37.15Table 3: Experiments results on validation set by testing dif-ferent encoders for DeepLabv3+, two thresholds after IR-Net step, using TTA as postprocessing, including CAM forclass person from a binary classifier.validation images in Figure 2. We show the resulting mapsat each step of our method; the figure demonstrates how theperformance improves after each step.
5. Conclusions
We present a novel method of weakly-supervised seman-tic segmentation that consists of three consecutive steps:classification, CAM improvement via IRNet, and segmenta-tion. The presented approach generates pseudo-labels froma classifier network, rectifies the class boundaries with IR-Net, and uses a supervised segmentation model as a finalend-to-end method. This allows us to solve a semantic seg-mentation task using only image-level annotations.
In the proposed approach, the downsampling techniquewas used to balance the dataset, which was dictated by re-source limitations. However, it would be interesting to testupsampling as a class balancing method, or the combinationof both. We believe this could give an increase in perfor-mance.Also, we didn’t include CAM for class ‘person’ ex-tracted from a binary classifier into the third step - train-ing segmentation model. We think this could be a worthyexperiment.There is also a space to experiment with different regu-larization and optimization techniques at all steps.
Acknowledgements
This research was supported by Faculty of Applied Sci-ences at Ukrainian Catholic University and SoftServe. Theauthors thank Rostyslav Hryniv for helpful insights, andTetiana Martyniuk for computational resources.
References [1] J. Ahn, S. Cho, and S. Kwak. Weakly supervised learning ofinstance segmentation with inter-pixel relations. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 2209–2218, 2019.[2] L. Chan, M. S. Hosseini, and K. N. Plataniotis. Acomprehensive analysis of weakly-supervised semantic seg-mentation in different image domains. arXiv preprintarXiv:1912.11186 , 2019.[3] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam.Encoder-decoder with atrous separable convolution for se-mantic image segmentation. In
Proceedings of the Euro-pean conference on computer vision (ECCV)
Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages770–778, 2016.[6] Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang. Weakly-supervised semantic segmentation network with deep seededregion growing. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 7014–7023, 2018.[7] P.-T. Jiang, Q. Hou, Y. Cao, M.-M. Cheng, Y. Wei, and H.-K.Xiong. Integral object mining via online attention accumu-lation. In
Proceedings of the IEEE International Conferenceon Computer Vision , pages 2070–2079, 2019.[8] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.[9] A. Kolesnikov and C. H. Lampert. Seed, expand and con-strain: Three principles for weakly-supervised image seg-mentation. In
European Conference on Computer Vision ,pages 695–711. Springer, 2016.[10] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In
European conference on computervision , pages 740–755. Springer, 2014.[11] Y. Wei, S. Zheng, M.-M. Cheng, and e. Zhao, Hang. Lid2020: The learning from imperfect data challenge results.2020.[12] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-ralba. Learning deep features for discriminative localization.In