Classification Calibration for Long-tail Instance Segmentation
Tao Wang, Yu Li, Bingyi Kang, Junnan Li, Jun Hao Liew, Sheng Tang, Steven Hoi, Jiashi Feng
aa r X i v : . [ c s . C V ] J u l Joint COCO and Mapillary Workshop at ICCV 2019:LVIS Challenge Track
Technical Report: Classification Calibration for Long-tail Instance Segmentation *Tao Wang *Yu Li , Bingyi Kang Junnan Li Junhao Liew Sheng Tang Steven Hoi Jiashi Feng Department of Electrical and Computer Engineering, National University of Singapore, Singapore Institute of Computing Technology, Chinese Academy of Sciences, China Salesforce Research Asia, Singapore [email protected] [email protected] [email protected] [email protected] [email protected]@ict.ac.cn [email protected] [email protected]
Abstract
This report presents our winning solution to LVIS 2019challenge. Remarkable progress has been made in objectinstance detection and segmentation in recent years. How-ever, existing state-of-the-art methods are mostly evaluatedwith fairly balanced and class-limited benchmarks, suchas Microsoft COCO dataset [8]. In this report, we inves-tigate the performance drop phenomenon of state-of-the-art two-stage instance segmentation models when process-ing extreme long-tail training data based on the LVIS [5]dataset, and find a major cause is the inaccurate classifi-cation of object proposals. Based on this observation, wepropose to calibrate the prediction of classification head toimprove recognition performance for the tail classes. With-out much additional cost and modification of the detectionmodel architecture, our calibration method improves theperformance of the baseline by a large margin on the tailclasses. Importantly, after the submission, we find signif-icant improvement can be further achieved by modifyingthe calibration head [9], codes and models are availableat https://github.com/twangnh/SimCal ..
1. Experimental Details
Dataset statistics
Different from [5], we divide all the1,230 categories of the LVIS v0.5 dataset into 4 sets, whichrespectively contain <
10, 10-100, 100-1,000 and > Both * authors contributed equally to this work.
Sets (0 ,
10) [10 , , , − ] totalTrain 294 453 302 181 1230Train-on-val 67 298 284 181 830 Table 1: Category division based on training instance num-ber.
Train-on-val means the subset of categories that appearin the validation set.order to see the effect of training instance number and an-alyze the long-tail object instance detection models. Weclaim that the improvement on the tail bin, i.e. subset (0,10), of the validation set does not contribute much to theoverall AP as it contains only 67 classes, though the cate-gory distribution of the test set is unknown.
Training and Evaluation
Our implementation is basedon the mmdetection toolkit [4]. Unless otherwise stated,the models are trained on LVIS-v0.5 training set and evalu-ated on LVIS-v0.5 validation set for mask prediction tasks.The external data used in the experiments are introducedin Sec. 4. All the models are trained with SGD, 0.9 mo-mentum and 8 images per minibatch. The training sched-ule is 8th/11th/12th epoch updates with learning rates of0.01/0.001/0.0001 respectively, unless otherwise stated.
2. Classification Calibration
We first investigate the performance degradation of thebaseline Mask-RCNN [6] on tail classes. Then, basedon our observations for the possible causes of this phe-nomenon, we propose a classification calibration methodfor improving the model performance over tail classes.1 odel AP (0 , AP [10 , AP [100 , AP [1000 , − ] APmrcnn-r50-thr 0.0 5.4 16.6 25.1 13.1mrcnn-r50 0.0 13.3 21.4 27.0 18.0
Table 2: Performance of baseline Mask-RCNN with classagnostic box and mask heads on validation set. mrcnn-r50-thr means testing with 0.05 detection threshold and mrcnn-r50 denotes testing with 0.0 threshold.
Dataset AP AR k LVIS 18.0 51.0COCO 32.8 55.9
Table 3: Comparison of baseline models trained on COCOand LVIS. Models are all evaluated with the 5k validationset. AR k denotes average recall at 1000 proposals. COCOresults are measured on minival set. For simplicity of analysis, we train a baseline Mask R-CNN with ResNet50-FPN backbone and class agnostic boxand mask heads . As shown in Table 2, the model performspoorly, especially on the tail sets (0,10] and (10, 100]. Evenwhen we lower the detection threshold to 0, which improves5% mAP, the mAP for the subset (0, 10] is still 0. This re-sult reveals that the Mask-RCNN model trained with normalsetting is heavily biased toward the many-shot classes (i.e.those with more training instances).We then calculate the proposal recall of the model andcompare it with that of the same model trained on COCOdataset. As shown in Table 3, the same baseline modeltrained on LVIS suffers a drop of 8.8% in the proposal re-call compared with that on COCO, and notably, a 45.1%drop in the overall AP. This indicates the degradation ofproposal classification accuracy is the major cause of finalperformance drop on long-tail training data.To verify our observation, for RPN generated propos-als, we assign their ground truth class labels and evaluatethe AP, instead of using the predicted labels. As shown inTable 4, AP on tail classes is increased by a large margin,especially on the (0,10) and [10, 100) bins. This confirmsthe observation that the low performance over tail classesis mainly caused by the inability of the model to recog-nize their correct categories from current generated pro-posal candidates.
Recently, Kang et al. [7] reveal that for long-tail classi-fication, representation and classifier learning should be de-coupled. Inspired by this work, to improve the performance
Model AP (0 , AP [10 , AP [100 , AP [1000 , − ] APmrcnn-r50 0.0 13.3 21.4 27.0 18.0props-gt 39.7 45.1 31.4 29.3 36.6
Table 4: Test with ground truth labels of proposals.of the second stage classification over tail classes, we de-velop a strategy to retrain the classification head with dataobtained by class balanced sampling and combine predic-tions of the new classification head with the original one.Our approach, though simple, can effectively improve therecognition accuracy on tail classes while maintaining goodperformance on many-shot classes. We name it classifica-tion calibration .Concretely, we sample a fixed number of classes for eachstep, and sample one image corresponding to each of thesampled classes. In our current implementation, 16 classesand 1 image per class are sampled. The sampled imagesare fed to the trained model, and the obtained proposalsare matched with ground truth boxes using the same IOUthreshold as the original detection model training. Onlythe proposals corresponding to the sampled classes are se-lected, together with the ground truth boxes of these classes,for training the new head; the other proposals are ignored.During training, we keep the parameters in the backbonenetwork and RPN frozen.As shown in Table 6, with the newly trained head as theproposal classifier, AP on tail-class bins (0, 10) and [10,100) is boosted by a large margin. However, due to insuf-ficient training on many-shot classes, AP on [100, 1000)and [1000,-] drops significantly. To enjoy the advantagesof both new and original heads, we have tried many differ-ent ways to combine their predictions. Refer to Table 6 fordetails. We find that simply concatenating the predictionsof the new head on tail classes ((0, 10) and [10, 100)) withthose of the original head on many-shot classes ([100, 1000)and [1000,-]) yields the best results overall.
To further improve the overall performance, we apply theproposed calibration method to multi-stage cascaded mod-els with more complex architectures. State-of-the-art cas-caded model Hybrid Task Cascade [3] (HTC) is utilizedhere. We find that HTC brings a large improvement overvanilla Cascaded Mask-RCNN [2] on LVIS dataset. SeeTable 5 for details.All the three classification heads in the three stages ofthe HTC framework are retrained with our proposed sam-pling strategy, and we average the predictions of these threenew heads during inference following the original setting.Then, the predictions on tail classes are concatenated with2odels COCO LVISbox mask box maskcascaded-mrcnn 45.4 39.1 28.6 25.9htc 46.9 40.8 31.3 29.3Table 5: Comparison of cascaded Mask-RCNN and HybridTask Cascade (HTC) on COCO and LVIS dataset validationset. The two models use the same backbone ResNeXt-101-64x4d. They are trained with 20 epochs and learning ratedecay at 16th and 19th epoch.the original classification results. Table 7 shows the resultsof our calibration method applied to HTC with ResNeXt-101-64d backbone. The scores of categories for (0, 10) binwhich are predicted by the new head are concatenated withthe scores of other categories predicted by the original head.We think it is more reasonable to take into considerationthe number of classes in each bin in long-tail detection eval-uation, rather than just averaging AP of all classes. This isbecause, the number of classes in each bin may vary largelyand the bins with fewer classes tend to be down-weightedin overall mAP. In this sense, the importance of solving thetail problem is not obviously and directly demonstrated byusing the current evaluation metric mAP. For example, thevalidation set of LVIS v0.5 contains only 67 classes withless than 10 training instances, while the numbers are muchlarger for the [10, 100) bin and [100, 1000) bin, which are298 and 284 respectively. The improvement on (0, 10) binwould be down-weighted in mAP.
As shown in Table 8, we compare our classificationcalibration approach with image-level repeat sampling forthe whole network, which is reported as the best baselinein [5]. Although our calibration method has lower over-all mAP on validation set than image-level repeat samplingon tail classes, it has higher performance on the most tailbin (0, 10). When generalized to the more complex multi-stage model HTC, our method performs better. The perfor-mance of our calibrated model suffers less drop on many-shot classes and enjoys much improvement on tail-classesthan image-level repeat sampling method.
3. Final Models and Test Set Submission
As shown in Table 9, our final submitted results on thetest set are from the ensemble of 4 models with differentbackbones. However, due to time limit, we only have ourbest single model (31.9 AP on val) calibrated among all fi-nal models. We believe the final ensemble results will bestronger on tail classes if all models are calibrated.
Model AP (0 , AP [10 , AP [100 , AP [1000 , − ] APmrcnn-r50 0.0 13.3 21.4 27.0 18.0rhead-only 8.5 20.8 17.6 19.3 18.4rhead-avg 8.5 20.9 19.6 24.6 20.3rhead-det 8.6 22.0 16.7 25.2 19.8rhead-cat rhead-cat-thr 8.5 20.8 20.1 26.7 20.9rhead-cat-scale 8.5 21.3 19.9 26.7 21.0
Table 6: Different ways for calibrating predictions of orig-inal classification head with newly trained head. The wayswe have tried include rhead-only (using only newly trainedhead predictions), rhead-avg (averaging predictions of orig-inal head and new head), rhead-det (using the two headsseparately for detection outputs and combining them af-terward, i.e., two expert models), rhead-cat (simply con-catenating tail classes predictions of new head and many-shot classes predictions of original head, with (0,10) and[10, 100) for new head and [100, 1000) [1000,-] for origi-nal head), rhead-cat-thr (filtering new head predictions with0.05 threshold and then concatenating), and rhead-cat-scale(scaling new head predictions by ratio of average back-ground score between new and original head predictions).
Model AP (0 , AP [10 , AP [100 , AP [1000 , − ] APhtc-x101 7.1 30.5 30.7 33.9 29.4calibration
Table 7: Results of applying our calibration to state-of-the-art multi-stage cascaded instance segmentation model Hy-brid Task Cascade (HTC).
Model AP (0 , AP [10 , AP [100 , AP [1000 , − ] APmrcnn-r50 0.0 13.3 21.4 27.0 18.0img-sample 7.7 23.2 21.4 26.2 22.0calibration 8.6 22.0 19.6 26.6 21.1htc-x101 5.6 33.0 33.7 37.0 31.9img-sample 10.3 32.4 33.4 36.6 31.9calibration 12.7 32.1 33.6 37.0 32.1
Table 8: Comparison with image level sampling trainedmodel on baseline Mask R-CNN.
4. External Data
Microsoft COCO dataset [8] (2017 version) and COCO-stuff dataset [1] are used as external data for our submit-ted results. All COCO, COCO-stuff, and LVIS datasets3 odel val-sethtc x101 64d ms dcn 31.9htc x101 32d ms dcn 31.4htc x101 64d ms dcn cos 30.7htc r101 ms dcn 30.0ensemble-with-calibration 34.2add-multiscale-testing 35.2
Table 9: Final models performance and ensemble results onvalidation set. ms denotes multi-scale training, dcn meansdeformable convolution and cos means cosine learning rateschedule.
Model AP AP r AP c AP f best-baseline 20.5 9.8 21.1 30.0w/o 22.89 5.90 25.65 35.26with-calibration 23.94 10.31 25.26 35.16ensemble-with-calibration 26.11 11.94 27.98 37.05add-multiscale-testing 26.67 10.59 28.70 39.21 Table 10: Comparison of baseline model without and withour proposed calibration method on LVIS test set . Best-baseline denotes best baseline performance reported [5];w/o denotes our best single model (31.9 AP on validationset); with-calibration means adds calibration to the model;ensemble-with-calibration means using the ensemble of allmodels and adding calibration; add-multiscale-testing de-notes adding multi-scale testing.
Model P S AP (0 , AP [10 , AP [100 , AP [1000 , − ] APHTC-x50-fpn 1.4 23.9 25.3 30.4 24.0HTC-x50-fpn X X X
Table 11: Effect of external data. P stands for using COCObox and polygons for pre-training, and S for using COCO-stuff pixel-level semantic segmentation label for semantichead of HTC.share the same training images but own different annota-tions (LVIS only uses part of the training images in COCOtrain2017). We only use the training set of COCO andCOCO-stuff, which contains 118K images. COCO covers80 thing classes, the same as COCO-stuff, but the latter alsocontains 91 stuff classes. For COCO, instance-level boxesand polygons are used to pre-train HTC models. We ini-tialize our model with a model pre-trained on COCO. ForCOCO-stuff, pixel-level semantic segmentation labels areused for training the semantic head of HTC. Table 11 showsthe results of using COCO pre-training and semantic head.
5. Conclusion
We propose a classification calibration method for im-proving the performance of current state-of-the-art proposalbased object instance detection and segmentation modelsover long-tail distribution data. It is able to effectively im-prove the classification performance over the long-tail dis-tribution data by enhancing the proposal classification ac-curacy. Currently, our retraining strategy for proposal clas-sification head is not optimized, which we will investigatein future works. For example, we may combine our methodwith image-level sampling, choose new head designs or usenew head training sampling methods, trying to further boostthe model performance on long-tail data distribution.
References [1] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. COCO-Stuff: Thing and stuff classes in context. In
CVPR , 2018. 3[2] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delvinginto high quality object detection. In
Proceedings of the IEEEconference on computer vision and pattern recognition , pages6154–6162, 2018. 2[3] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox-iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi,Wanli Ouyang, et al. Hybrid task cascade for instance segmen-tation. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 4974–4983, 2019. 2[4] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, YuXiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu,Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tian-heng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, YueWu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang,Chen Change Loy, and Dahua Lin. MMDetection: Openmmlab detection toolbox and benchmark. arXiv preprintarXiv:1906.07155 , 2019. 1[5] Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: Adataset for large vocabulary instance segmentation. In
CVPR ,2019. 1, 3, 4[6] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In
Proceedings of the IEEE internationalconference on computer vision , pages 2961–2969, 2017. 1[7] Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan,Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decouplingrepresentation and classifier for long-tailed recognition. arXivpreprint arXiv:1910.09217 , 2019. 2[8] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft COCO: Common objects in context. In
ECCV , 2014. 1, 3[9] Tao Wang, Yu Li, Bingyi Kang, Junnan Li, Junhao Liew,Sheng Tang, Steven Hoi, and Jiashi Feng. The devil is inclassification: A simple framework for long-tail instance seg-mentation. arXiv preprint arXiv:2007.11978 , 2020. 1, 2020. 1