IoU-balanced Loss Functions for Single-stage Object Detection
Abstract
Single-stage detectors are efficient. However, we find that the loss functions adopted by single-stage detectors are sub-optimal for accurate localization. The standard cross entropy loss for classification is independent of localization task and drives all the positive examples to learn as high classification score as possible regardless of localization accuracy during training. As a result, there will be detections that have high classification score but low IoU or low classification score but high IoU. And the detections with low classification score but high IOU will be suppressed by the ones with high classification score but low IOU during NMS, hurting the localization accuracy. For the standard smooth L1 loss, the gradient is dominated by the outliers that have poorly localization accuracy and this is harmful for accurate localization. In this work, we propose IoU-balanced loss functions that consist of IoU-balanced classification loss and IoU-balanced localization loss to solve the above problems. The IoU-balanced classification loss focuses more attention on positive examples with high IOU and can enhance the correlation between classification and localization task. The IoU-balanced localization loss decreases the gradient of the examples with low IoU and increases the gradient of examples with high IoU, which can improve the localization accuracy of models. Sufficient studies on MS COCO demonstrate that both IoU-balanced classification loss and IoU-balanced localization loss can bring substantial improvement for the single-stage detectors. Without whistles and bells, the proposed methods can improve AP by 1.1% for single-stage detectors and the improvement for AP at higher IoU threshold is especially large, such as 2.3% for AP . The source code will be made available. Introduction
Along with the advances in deep convolutional networks, lots of object detection models have been developed. All these models can be classified into single-stage detectors such as YOLO [1], SSD [2], RetinaNet [3] and multi-stage detectors such as R-CNN [4], Fast R-CNN [5], Faster R-CNN [6], Cascade R-CNN [7]. For the multi-stage detectors, the proposals are firstly generated and then RoIPool or RoIAlign [8] are utilized to extract features for these proposals. The extracted features are then used for further proposal regression and classification. Because of the multi-stage box regressions and classifications, multi-stage detectors have achieved state-of-the-art performance. The single-stage detectors directly rely on the regular, dense sampled anchors at different scales and aspect ratios for classification and box regression. This makes the single-stage detectors highly efficient. However, the accuracy of single-stage detector is usually behind that of multi-stage detectors. One of the main reasons is the extreme class imbalance problem [3]. RetinaNet [3] proposes focal loss to solve this problem. In addition, the localization accuracy of single-stage detectors is also inferior because of the low localization accuracy of predefined anchors and the only single stage regression. As a result, RefineDet [9] propose two-step regressions to improve the localization accuracy for single-stage detectors. In this work, we demonstrate that the classification and localization loss functions adopted by single-stage detectors are suboptimal for accurate localization and the localization ability can be substantially improved by designing better loss functions that makes no changes to the model architecture. There are two problems about the loss functions adopted by most of the single-stage detectors. Firstly, the correlation between classification task and localization task is weak. Most of the single-stage detectors adopt the standard cross entropy loss for classification which is independent of the localization task and this kind of classification loss drives the model to learn as high classification score as possible for all the positive examples regardless of the localization accuracy during training. As a result, the predicted classification scores will be independent of the localization accuracy and there will be detections that have high classification score but low IoU or low classification score but high IoU. When traditional non-maximum suppression (NMS) is applied, the detections with high classification score but low IoU will suppress the detections with low classification score but
IoU-balanced Loss Functions for Single-stage Object Detection
Shengkai Wu Xiaoping Li School of Mechanical Science and Engineering, Huazhong University of Science and Technology {ShengkaiWu, lixiaoping}@hust.edu.cn high IoU, which is unreasonable and will hurt the object localization accuracy. In addition, the detections with high classification score but low IoU will rank before the detections with low classification scores but high IoU during computing COCO metric, which reduces the precision of models. As a result, we claim that enhancing the correlation between classification and localization task is important for accurate localization. Secondly, the gradients of localization loss for single-stage detectors are dominated by outliers, which are the examples with poorly localization accuracy. These examples will prevent the models to obtain high localization accuracy during training. Fast R-CNN [5] proposes smooth L1 loss to suppress the gradients of outliers to a bounded value and can prevent exploding gradients effectively during training. However, we claim that it's still important to make more suppression on the gradient of outliers while increasing the gradient of inliers. Inspired by these ideas, IoU-balanced classification loss and IoU-balanced localization loss are designed in our work. IoU-balanced classification loss focuses more attention on positive examples with high IoU. The higher the IoU of the positive example is, the more contribution to the classification loss it makes. Thus, the positive examples with higher IoU will generate higher gradients during training and are more likely to learn higher classification score, which can enhance the correlation between classification and localization task. IoU-balanced localization loss up-weights the gradients of examples with high IoU while suppressing the gradients of examples with low IoU. Sufficient experiments on the challenging COCO benchmark demonstrate that IoU-balanced classification loss and IoU-balanced localization loss can substantially improve the localization accuracy for single-stage detectors. Our main contributions are as follows: (1) We demonstrate that the standard cross entropy loss for classification and the smooth L1 loss for localization is suboptimal for accurate object localization and the localization ability can be improved by designing better loss functions for single-stage detectors. (2) We propose IoU-balanced classification loss to enhance the correlation between the classification and localization task, which can substantially improve the performance of single-stage detectors. (3) We introduce IoU-balanced localization loss to up-weight the gradients of inliers while suppressing gradients of outliers , which makes models more powerful for accurate object localization. Related Work Accurate object localization.
Accurate object localization is a challenging and important topic and many methods to improve localization accuracy have been proposed in recent years. Multi-region detector [10] argues that a single regression step is insufficient for accurate localization and thus proposes iterative bounding box regression to refine the coordinates of detections, followed by NMS and box voting. Cascade R-CNN [7] trains multi-stage R-CNNs with increasing IoU thresholds stage-by-stage and thus the multi-stage R-CNNs are sequentially more powerful for accurate localization. As a result, the last stage R-CNN can produce detections with the most accurate localization accuracy. RefineDet [9] improves one-stage detector by two-step cascade regression. The ARM first refines the human-designed anchors and then the ODM accepts these refined anchors as inputs for the second stage regression, which is beneficial for improving localization. All these methods add new modules to the detection models and thus hurt the efficiency. On the contrary, IoU-balanced loss functions improve localization accuracy without changing model architectures and don't affect the efficiency of models.
Hard example mining.
To improve the models' ability of handling hard examples, many hard example mining strategies having been developed for object detection. RPN [6] defines the anchors whose IoU with ground truth boxes are not larger than 0.3 as hard negative examples. Fast R-CNN [5] defines the proposals that have a maximum IoU with ground truth boxes in the interval [0.1,0.5) as hard negative examples. OHEM [11] computes losses for all the examples, then ranks examples based on losses, followed by NMS. Finally, the top-B/N examples are selected as hard examples to train the model. SSD [2] defines anchors whose IoU is lower than 0.5 as negative examples and ranks negative examples based on losses. The top-ranked negative examples are selected as hard negative examples. RetinaNet [3] design focal loss to solve the extreme imbalance between easy examples and hard examples, which reduces the losses of easy examples whose predicted classification score is high and focuses more attention on hard examples whose predicted classification score is low. Libra R-CNN [12] constructs a histogram based IoU for negative examples and selects examples from each bin in the histogram uniformly as hard negative examples. Different from these strategies, IoU-balanced loss functions don't change the sampling process and only assign different weights to the positive examples based on their IoU.
Correlation between classification and localization task.
Most of detection models adopt the parallel classification and localization subnetworks for classification and localization task. And they rely on independent classification loss and localization loss to train the models. This architecture results in the independence between classification and localization task, which is suboptimal. Fitness NMS [13] classifies localization accuracy to 5 levels based on the IoU of regressed boxes and designs subnetworks to predict the probabilities of each localization level independent or dependent of classes for every detection. Then fitness is computed based on these probabilities and combined with the classification score to compute the final detection score, which enhances the correlation between classification and localization task. The enhanced detection score is used as the input for NMS, denoted as Fitness NMS. Similarly, IoU-Net [14] adds an IoU prediction branch parallel with the classification and localization branches to predict the IoU for every detection and the predicted IoU is highly correlated with the localization accuracy. Different from Fitness-NMS, the predicted IoU is directly used as the input for the NMS, denoted as IoU-guided NMS. MS R-CNN [15] designs a MaskIoU head to predict the IoU of the predicted masks aiming to solve the problem of the weak correlation between classification score and mask quality. During inference, the predicted mask IoU is multiplied with the classification score as the final mask confidence, which is highly correlated with the mask quality. Unlike IoU-Net, the enhanced mask confidence is only used to rank the predicted masks when computing COCO AP. PISA [16] proposes IoU-HLR to rank the importance of positive examples based on IoU and computes the weight of every positive example in classification loss based on the importance such that the more important examples contribute more to the classification loss. In this way, the correlation between classification and localization is enhanced. Compared with PISA, IoU-balanced classification loss function doesn't need the IoU hierarchical local ranking process and uses the IoU of positive examples directly to compute weights assigned to positive examples, which is more simple, efficient and elegant.
Outliers during training localization subnetwork.
Compared with R-CNN [4] and SPPnet [17], Fast R-CNN [5] adopts smooth L1 loss to constrain the gradients of outliers as a constant, which prevents gradient explosion. GHM [18] analyzes the example imbalance in one-stage detectors in term of gradient norm distribution. The analysis demonstrates that for localization subnetwork of a converged model, there are still a large number of outliers and the gradients can be dominated by these outliers, which hurts the training process for accurate object localization. GHM-R is proposed to up-weight easy examples and down-weight outliers based on the gradient density of every example. However, gradient density computation is time-consuming and can substantially slow down the training speed. Libra R-CNN [12] claims that the overall gradient of smooth L1 loss is dominated by the outliers when balancing classification and localization task directly. As a result, balanced L1 loss is proposed to increase the gradient of easy examples and keep the gradient of outliers unchanged. Different from these method, IoU-balanced localization loss computes weights of every example based on their IoU and up-weights examples with high IoU while down-weighting examples with low IoU. Method
In this paper, we propose IoU-balanced classification loss to enhance the correlation between classification and localization task and IoU-balanced localization loss to up-weight the gradients of inliers while suppressing the gradients of outliers. Both these losses can make single-stage detectors more powerful for accurate localization. These two losses will be introduced in details in the following subsections.
IoU-balanced Classification Loss
The classification losses adopted by most of object detection models are independent of the localization task and this kind of classification loss functions will drive the model to learn as high classification score as possible for the positive examples despite of the localization accuracy. As a result, the predicted classification scores of detections are independent of the localization accuracy. This problem will hurt the performance of models in the subsequent procedure during inference. Firstly, when NMS or its variants such as soft NMS[19] is applied, there will be cases that the detections with high classification scores but low IoU suppress the ones with low classification scores but high IoU. Secondly, during computing COCO AP, all the detections are ranked based on the classification scores and there will be cases that the detections with high classification scores but low IoU are ranked ahead of the detections with low classification scores but high IoU. All these two problem will hurt the localization accuracy of models. Thus, enhancing the correlation between classification score and localization accuracy is beneficial for accurate object detection. We design IoU-balanced classification loss to enhance the correlation between classification and localization task as Equ. (1) shows. The weights assigned to positive examples are positively correlated with the IoU between the regressed bounding boxes and their corresponding ground truth boxes. As a result, the examples with high IoU are up-weighted and the ones with low IoU are down-weighted adaptively based on their IoU after bounding box regression. During training, the examples with higher IoU will contribute larger gradients and thus the model is more easy to learn high classification scores for these examples. The gradients contributed by examples with low IoU will be suppressed and thus the classification scores will be suppressed. In this way, the correlation between classification scores and localization accuracy is enhanced as demonstrated by
Figure 2 . In Equ. (1), the parameter can control to which extent IoU-balanced classification loss focuses on examples with high IoU and suppresses examples with low IoU. Normalization strategy is adopted to keep the sum of classification loss for positive examples unchanged. ˆ ˆ( ) *CE ) CE( ) N Mcls i i i i i ii Pos i Neg
L w iou (p , p p , p (1) ˆCE( )( ) ˆCE( ) n i iii i i n i i ii p , pw iou iou iou p , p (2) IoU-balanced Localization Loss
GHM [18] demonstrates that even for converged models, the gradients of localization loss are dominated by the outliers whose gradient norm is large and this will be harmful for the localization accuracy of models during training process. GHM designs a new localization loss called GHM-R loss and reweights examples based on the computed gradient density such that the gradients of outliers are suppressed and the gradients of easy examples are up-weighted. However, the computation of gradient density is time-consuming and the training time per iteration will consume nearly 1.5 times. We propose IoU-balanced localization loss to up-weight examples with high IoU and down-weight examples with low IoU, which adds little computation and is efficient and elegant as Equ. (3), (4) shows.
L1{ } ˆ( ) *smooth ( )
N m mloc i i i ii Pos m cx,cy,w,h
L w iou l - g (3) ( ) * i i loc i w iou w iou (4) L1{ , , , } L1{ , , , } ˆsmooth ( )( ) iou * ˆ*smooth ( )
N m mi ii Pos m cx cy w hi i i N m mi i ii Pos m cx cy w h l gw iou iou l g , (5) The parameter is designed to control to which extent IoU-balanced localization loss focuses on inliers and suppresses outliers. The localization loss weight loc w is manually adjusted to keep the sum of localization loss unchanged for the first step of the training procedure. Normalization strategy [16] can also be used to keep the localization loss sum unchanged during the whole training procedure as Equ. (5) shows. However, experiments show that this normalization strategy is inferior compared with manually adjusting loc w . This may be caused by that the normalization factor is decreased as the IoUs of positive examples become larger during training. Thus, the strategy of manually adjusting loc w is adopted in the subsequent experiments. We constrains that the gradients are not propagated from ( ) w iou to mi l . Denoting ˆ m m d l g , the gradient of IoU-balanced smooth L1 loss w.r.t m l can be expressed as: L1 L1 ( ) *smooth ( ) *smooth( ) * = ( ) *sign( ) m w iou w iou dl dw iou if dw iou d otherwise (6) The IoU function representing the relationship between IoU and d is complex and Bounded IoU [13] simplifies this function by computing an upper bound of the IoU function. The same idea is adopted in our paper and readers can refer to Bounded IoU for more details. Given an anchor ( , , , ) s s s s s b x y w h , an associated ground truth box ( , , w , ) t t t t t b x y h and a predicted bounding box ( , , , ) p p p p p b x y w h , the upper bounds of the IoU function is as follows: B iou ( , ) tt t w xx b w x (7) B iou ( , ) min( , ) p tt t p w ww b w w (8) where p t x x x . Because there exists that / cx s d x w , log( / ) w p t d w w , we can get: B iou ( , ) cxt st cxt s w w dx b w w d (9) B iou ( , ) w dt w b e (10) which satisfies / cx t s d w w to ensure B iou ( , ) 0 t x b . B iou ( , ) t y b and B iou ( , ) t h b are similar to B iou ( , ) t x b and B iou ( , ) t w b respectively. Assuming that =0.111 and t s w w , we have L1
1* 1( ) *smooth 1* sign( )1 locB loc d dw if ddw iou d dw d otherwised (11) for cx d or cy d and . Figure 1 The gradient norm of standard smooth L1 loss ( 0) and the upper bound of gradient norm for IoU-balanced smooth L1 loss ( 0.5,1.0,1.5,1.8) with respect to cx d , cy d , w d , h d . The localization weight loc w is manually adjusted to keep the sum of localization loss unchanged when is changed. L1 *( )*smooth w *e sign( ) dlocB dloc dw e if dw iou d d otherwise (12) for w d and h d The gradient norm of standard smooth L1 loss ( 0) and the upper bound of gradient norm for IoU-balanced smooth L1 loss ( 0.5,1.0,1.5,1.8) are visualized in Figure 1. Compared with standard smooth L1 loss, IoU-balanced smooth L1 loss can increase the gradient norm of inliers and reduce the gradient norm of outliers, making the model more powerful for accurate localization. Experiments 4.1.
Experimental Settings Dataset and Evaluation Metrics.
All the experiments are implemented on the challenging MS COCO [20] dataset. It consists of 118k images for training ( train-2017 ), 5k images for validation ( val-2017 ) and 20k images with no disclosed labels for test ( test-dev ). All models are trained on train-2017 and evaluated on val-2017 and test-dev . The standard COCO-style Average Precision (AP) metrics are adopted which includes AP (averaged on IoUs from 0.5 to 0.95 with an interval of 0.05), AP (AP for IoU threshold 0.5), AP (AP for IoU threshold 0.75), AP S (AP for small scales), AP M (AP for medium scales) and AP L (AP for large scales). Implementation Details.
All the experiments are implemented based on PyTorch and MMDetection [21]. As only 2 GPUs are available, linear scaling rule [22] is adopted to adjust the learning rate during training. Specifically, the initial learning rate is divided by 4 compared with default settings of MMDetection and decreased by 0.1 after 8 and 11 epochs respectively. All the detectors are trained for 12 epochs in total. For all ablation studies, RetinaNet with ResNet50 as backbone are trained and evaluated on val-2017 using image scale of [600, 1000]. For the main results, the converged models provided by MMDetection [21] are evaluated as the baseline. The IoU-balanced RetinaNets with different backbones are trained with the default settings with which the converged models provided by MMDetection are trained. The performance is evaluated on test-dev . All the other settings are kept the same as default settings in MMDetection if not specifically noted.
Main Results
For the main results, the performance for RetinaNet with different backbones are reported. As
Table 1 shows, The IoU-balanced loss functions can improve AP by 1.1% for both RetinaNet-ResNet50 and RetinaNet-ResNet101.
Ablation Experiments Component Analysis . The effectiveness of different components is analyzed.
Table 2 shows that IoU-balanced classification loss and IoU-balanced localization loss can improve AP by 0.7% and 0.8% respectively and
Table 1: Main results. Comparison of single-stage detectors on COCO test-dev . Method Backbone AP AP AP AP S AP M AP L RetinaNet ResNet50 35.9 55.8 38.4 19.9 38.8 45.0 RetinaNet ResNet101 38.1 58.5 0.40.8 21.2 0.41.5 48.2 IoU-balanced RetinaNet ResNet50 37.0 56.2 39.7 20.6 39.8 46.3 IoU-balanced RetinaNet ResNet101 39.2 58.7 42.3 21.5 42.4 49.4
Table 2: Effectiveness of IoU-balanced Classification Loss and IoU-balanced Localization Loss for RetinaNet-ResNet50 on COCO val-2017.
IoU-balanced Cls IoU-balanced Loc AP AP AP AP S AP M AP L Table 3: The impact of IoU-balanced Classification Loss and IoU-balanced Localization Loss on AP at different IoU threshold.
IoU-balanced Cls IoU-balanced Loc AP AP AP AP AP combining them can improve AP by 1.3%. Table 3 demonstrates that IoU-balanced classification loss has consistent improvement for AP at different IoU threshold. And IoU-balanced localization loss is especially beneficial for accurate object localization, improving AP and AP by 2.1% and 2.3% respectively. This demonstrates that focusing more attention on inliers and suppressing outliers for the localization loss are important for accurate localization. Ablation Studies on IoU-balanced Classification Loss . The parameter in IoU-balanced classification loss controls to which extent the model focuses on the examples with high IoU. As Table 4 shows, the model can achieve the best performance of 35.1% when equals to 1.5. As shown in Figure 2 , IoU-balanced classification loss can increase the average scores for examples with high IoU and decrease the average scores for examples with low IoU, which demonstrates that the correlation between classification and localization task is enhanced by the IoU-balanced classification loss.
Ablation Studies on IoU-balanced Localization Loss. As Figure 1 shows, the parameter in IoU-balanced localization loss controls to which extent the model Table 4 The effectiveness of varying in IoU-balanced classification loss and in IoU-balanced localization loss respectively. AP loc w AP 0 34.4 0 1.0 34.4 1.0 loc w is manually adjusted to keep the sum of localization loss unchanged when changing the parameter . As Table 4 shows, the best performance of AP 35.2% is obtained when equals to 1.5. As shown in Figure 3 , IoU-balanced localization loss increases the percentage of detections with high IoU by 0.8%~4.8% relative to the baseline model. This demonstrates that the IoU-balanced localization loss is beneficial for accurate object localization.
Figure 2 Average scores of examples with different IoU. IoU-balanced classification loss increases the average scores for examples with high IoU while decreasing the average scores for examples with low IoU.
Figure 3 The percentage of detections at different IoU thresholds. IoU-balanced localization loss can increase the percentage of detections with high IoU by 0.8% ~ 4.8% relative to the baseline. Conclusions
In this paper, we demonstrate that the classification loss and localization loss adopted by most of single-stage detectors are suboptimal for accurate localization and thus we propose IoU-balanced loss functions that consist of IoU-balanced classification loss and IoU-balanced localization loss to improve localization accuracy for single-stage detectors. IoU-balanced classification loss is designed to enhance the correlation between classification and localization task. IoU-balanced localization loss is designed to decrease the gradient norm of outliers while increasing the gradient norm of inliers. Extensive experiments on MS COCO have shown that IoU-balanced loss functions have substantial improvement for the localization accuracy of single-stage detectors.
References: [1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection,", 2016. [2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Y. Fu, and A. C. Berg, "SSD: Single Shot MultiBox Detector," CoRR, vol. abs/1512.02325, 2015. [3] T. Lin, P. Goyal, R. Girshick, K. He, and P. Doll A R, "Focal loss for dense object detection," arXiv preprint arXiv:1708.02002, 2017. [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,", 2014. [5] R. Girshick, "Fast R-CNN,", 2015. [6] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds.: Curran Associates, Inc., 2015, pp. 91--99. [7] Z. Cai and N. Vasconcelos, "Cascade R-CNN: Delving Into High Quality Object Detection,", 2018. [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask R-CNN,", 2017, pp. 2980-2988. [9] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, "Single-Shot Refinement Neural Network for Object Detection,", 2018. [10] S. Gidaris and N. Komodakis, "Object detection via a multi-region & semantic segmentation-aware CNN model," CoRR, vol. abs/1505.01749, 2015. [11] A. Shrivastava, A. Gupta and R. Girshick, "Training Region-Based Object Detectors With Online Hard Example Mining,", 2016. [12] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, "Libra R-CNN: Towards Balanced Learning for Object Detection," arXiv e-prints, pp. arXiv:1904.02701, 2019. [13] L. Tychsen-Smith and L. Petersson, "Improving Object Localization with Fitness NMS and Bounded IoU Loss," arXiv e-prints, pp. arXiv:1711.00164, 2017. [14] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang, "Acquisition of Localization Confidence for Accurate Object Detection," arXiv e-prints, pp. arXiv:1807.11590, 2018. [15] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, "Mask Scoring R-CNN," arXiv e-prints, pp. arXiv:1903.00241, 2019. [16] Y. Cao, K. Chen, C. Change Loy, and D. Lin, "Prime Sample Attention in Object Detection," arXiv e-prints, pp. arXiv:1904.04821, 2019. [17] K. A. Z. X. He, "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition," Cham, 2014, pp. 346--361. [18] B. Li, Y. Liu and X. Wang, "Gradient Harmonized Single-stage Detector," arXiv e-prints, pp. arXiv:1811.05181, 2018. [19] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, "Soft-NMS -- Improving Object Detection With One Line of