RMOPP: Robust Multi-Objective Post-Processing for Effective Object Detection
RRMOPP: Robust Multi-Objective Post-Processing for Effective Object Detection
Mayuresh Savargaonkar Abdallah ChehadeDepartment of Industrial and Manufacturing Systems EngineeringUniversity of Michigan-DearbornSamir RawashdehDepartment of Electrical and Computer EngineeringUniversity of Michigan-Dearborn [email protected] [email protected] [email protected] . Abstract
Over the last few decades, many architectures havebeen developed that harness the power of neural networksto detect objects in near real-time. Training such sys-tems requires substantial time across multiple GPUs andmassive labeled training datasets. Although the goal ofthese systems is generalizability, they are often impracti-cal in real-life applications due to flexibility, robustness,or speed issues. This paper proposes RMOPP: A robustmulti-objective post-processing algorithm to boost the per-formance of fast pre-trained object detectors with a negli-gible impact on their speed. Specifically, RMOPP is a sta-tistically driven, post-processing algorithm that allows forsimultaneous optimization of precision and recall. A uniquefeature of RMOPP is the Pareto frontier that identifies dom-inant possible post-processed detectors to optimize for bothprecision and recall. RMOPP explores the full potential of apre-trained object detector and is deployable for near real-time predictions. We also provide a compelling test case onYOLOv2 using the MS-COCO dataset.
1. Introduction
Object detectors and vision systems have greatly ad-vanced in the last few years. The evolution of deep neuralnetworks has unlocked new potential in complex fields suchas capacity estimations of Li-ion battery cells [36, 33, 6] andRemaining-Useful-Life (RUL) predictions [35, 4, 5]. Thefield of object detection and classification is another signif-icant beneficiary of this especially due to the advancementsin Convolutional Neural Networks (CNNs). In this paper,we refer to the broader task of detection and classificationsimply as object detection. Deep convolution neural net- works commonly outperform traditional methods such asones using Support Vector Machines (SVMs) or regressiontechniques [24]. CNNs are typically either run as a single-stage detector or a two-stage detector. More details on state-of-art detectors are provided in the literature review.Research today mainly focuses on building object detec-tion systems that offer a reasonable trade-off between speedand accuracy. For improved accuracy, two-stage detectorsincluding ensemble neural networks like PA Net [25] areshown to perform well in competitions like PASCAL VOC[27], MS-COCO [23], and ImageNet [7]. Although suchsystems provide competition winning performances, theysacrifice speed and are not currently suitable for deploymentin real-time systems. For improved latency, single-stage de-tectors like the YOLO series algorithms: YOLOv1 [30],YOLOv2 [31], and YOLOv3 [10] are shown to be fasterthan two-stage detectors. Although single-stage detectorsare extremely fast, they often fail to reach the same level ofrobustness as the two-stage detectors [24]. Robustness hereis defined as the ability of an object detection pipeline tosuccessfully handle perturbations. Single-stage detectors’limitations are often associated with localization errors andthe identification of false positives [29]. This makes themless suitable for critical applications that require high ro-bustness.Practically, for many real-life applications like object de-tection in autonomous vehicles in Figure 1, both, detectionof an object and speed are deemed extremely important. Afailure in detection, as well as a slow detection, can resultin a fatal crash. Further, these object detection pipelinesare often hard-wired to achieve high precision at lower re-call or higher recall at lower precision to achieve competi-tion winning performances. Therefore, there is an essentialneed for a customizable and low-latency object detectionpipeline with a robust post-processing method that can pro-1 a r X i v : . [ c s . C V ] F e b igure 1: Example of object detection in modern vehicles.cess at real-time speeds without undergoing re-training.Generally, it is hard to tune a robust object detectionpipeline with a uni-objective post-processing algorithm.While existing methods focus on optimizing either preci-sion/recall they suffer from maintaining an acceptable re-call/precision. In this paper, we propose RMOPP: A ro-bust multi-objective post-processing algorithm that allowsfor simultaneous optimization of both, precision and recall.RMOPP exhibits the following advantages:• Statistical Guarantees - Development of a heuristicbased on the log-likelihood ratio provides explainablestatistical reasoning.•
Increased Robustness - Intuitive post-processinghyper-parameters developed using statistical guaran-tees allow for robust and improved filtering.•
Pareto Frontier - Development of a multi-dimensionalpost-processing algorithm allows for a thorough explo-ration of an object detection pipeline using the con-cepts of Pareto frontier. This helps in the choice of op-timized post-processing hyper-parameters. More de-tails on the Pareto frontier are discussed in Section 4.•
Need for Re-training - The need for re-training the ob-ject detector is eliminated when preference-based op-timization for Precision and/or Recall is required.•
Decreased Complexity - Avoids design, development,and training of complex post-processing procedureswhile achieving state-of-art performances.•
Low Latency - Since the post-processing hyper-parameters are optimized offline, the inference time ofthe object detection pipeline remains unaffected.Although we present a case study based on the single-stage detectors in this paper, there is no reason to assumethat a similar change in the post-processing step won’t bebeneficial for the two-stage detectors. The remainder of thepaper is organized as follows: Section 2 explores literaturein the field of object detection and post-processing, Section3 introduces the proposed post-processing algorithm, Sec-tion 4 verifies the efficacy of the proposed algorithm using a case-study and Section 5 concludes the paper with direc-tions for future work.
2. Literature Review
Modern object detectors can be categorized into twotypes: (i) Two-stage detectors like R-CNN [13], Fast R-CNN [12], Faster R-CNN [32], and many others, (ii) Single-stage detectors like OverFeat [34], YOLO [30], RetinaNet[22] and many others. The family of R-CNN type two-stage object detectors works on selective search strategiesor Region Proposal Networks (RPN) [13]. The first stageproposes and filters locations inside the image that havea high probability of containing an object, and the secondstage scores these proposed regions. Fast R-CNN reducesthe computational load in the first stage by passing the im-age only once through a convolutional layer that generatesRegions of Interest (ROI) using the selective search strat-egy. Faster R-CNN improves upon Fast R-CNN by using anRPN instead of a selective search strategy which further of-fers a 10x improvement in speed [32]. Cascade R-CNN [2]uses multiple R-CNN modules capable of detecting varyingsizes of objects with different classification thresholds foreach module. Frameworks such as Libra-RCNN [28] workon performance improvements by working on the imbal-ance in training images. To offer speed improvements overRPN stages of Fast R-CNN, a novel method, CPNDet [9]was recently published by Duan et al . CPNDet replaces theRPN stage with an anchor-free heat-map based detectionframework called as CornerNet [20]. Although two-stagedetectors show better detection performance, they offer lim-ited improvements in speed due to several complex modulesbeing stitched together to produce final detections. More-over, these detectors have far too many hyper-parameterswhich make them unsuitable for real-life applications.Single-stage detectors are fundamentally different andtry to achieve the entire process of detection, classifica-tion, and bounding box regression using a single step. Mostsingle-stage detectors work extremely fast but fail to reachthe same level of robustness as two-stage detectors. Algo-rithms such as DSSD [11], SSD [26], and YOLO [31], [10]use anchor-based multi-scale detections for identifying fine-grained features in the image and thus improve the accuracyof the detector. Object detectors like RetinaNet [22] tryto improve on the localization errors created by the single-stage detectors by introducing ‘focal loss’ that rewards themodel under training when it correctly classifies the hard-negative examples. To eliminate problems associated withdetections of small-scale objects in common single-stagedetectors, researchers recently introduced detectors such asCenterNet [8] and FCOS [40]. These detectors work byeliminating the need for anchor boxes. In their detailed ex-2igure 2: Object detection pipeline using a single-stage detector. *Detector output is cleaned for visualization purposes.periments with ATSS, Zhang et al . [43] show that defini-tions of positive and negative annotations help bridge tradi-tional gaps between anchor-free and anchor-based methods.End-to-end object detection is another branch of single-stage detectors that try to improve performance by elim-inating the post-processing stages such as NMS (Non-Maximal Suppression). DETR [3] is the first such detectorthat is capable of performing end-to-end object detection.OneNet [38] is another recent end-to-end object detectorthat achieves state-of-art while running at real-time speeds.While end-to-end object detectors achieve unparalleled per-formances in speed and accuracy, they lack customizabil-ity offered by post-processing stages. Customizations re-quire re-training using several GPUs and hours of complextraining paradigms often unavailable to the common user.The importance of this subject is also touched upon by J.Huang et al . [18], although they focus more on exploitingthe aspect of speed and accuracy trade-offs for developinga better detector. While investigating the effect of speed onaccuracy for a wide range of detectors, they show that thesingle-stage detectors based on the ResNet 101 backboneperform equally well when compared to R-FCN [18] andFaster RCNN on the same backbone for large object detec-tion on the MS-COCO dataset.
Most existing object detectors start with outputting fourbounding box co-ordinates, a score, and the class associ-ated with each one of the detections. The commonly used‘score’ function for detection is the product of the proba-bility of the most probable class bounded in an identifieddetection and the confidence of having an object inside the detected box. Next, detections with low scores are filteredassuming that lower scores represent poorly identified orclassified objects. Once the detections are filtered, NMS isapplied with a pre-defined NMS threshold ( η ) [1]. NMS isthe process of identifying detections with the Intersectionover Union (IoU) greater than the NMS threshold ( η ) andthen eliminating those with lower ‘scores’. The NMS pro-cess is effective in eliminating redundant detections.A significant improvement in post-processing comes asthe ‘Soft-NMS’ replacing the traditional greedy NMS [1].This work has shown to be of little use in the case of com-plex datasets like the PASCAL VOC and MS-COCO [1].To propose a suitable improvement over the ‘Soft-NMS’,He et al . [15] introduced the ‘Softer-NMS’ which decaysthe bounding box scores using a continuous function thusachieving better results. Relation networks [17] and Learn-ing NMS [16] are amongst some other types of works thattry to improve upon the traditional NMS by designing andtraining a sub-network to analyze complex object-objectcorrelations. While works such as Fitness NMS [41] tryto integrate localization information into ranking scores,LTR [39] is another sub-network that improves on suppres-sion ranking via learning procedures. NMS based post-processing works show promising performance gains andthus help build confidence that effective post-processing iskey in any suitable object detection pipeline. A summary ofa common single-stage object detection pipeline is shownin Figure 2.Although there exist post-processing algorithms, they of-ten fail to explicitly acknowledge the fact that there mayexist multiple recalls at the same precision and vice-versa.Thus, they fail to explore the full potential of an object de-3robability ofmost probable classProbability ofexistence ofbounding box Low HighLow Block I
Block IIHigh Block III
Block IV
Table 1: The blocks in ‘bold’ show where the existing post-processing algorithms are effective.tection pipeline which results in sub-optimal detection per-formance. RMOPP thus uses the concepts of Pareto frontierwhich identifies a set of dominant post-processing hyper-parameters that optimize both precision and recall simulta-neously. More details are provided in Section 4.
3. Proposed Algorithm: RMOPP
Existing post-processing algorithms rely on a singledata-driven ‘score’ metric that is not statistically intuitive.Several assumptions are made when filtering detections byscores and NMS using pre-defined post-processing hyper-parameters and thresholds. These values are usually cho-sen on an empirical basis, user’s experience, or by using atrial-and-error method. Hereafter, we will focus on the mostcommon ‘score’ function for existing post-processing algo-rithms that is defined by the product of the probability ofthe most probable class bounded in an identified detectionand the confidence of having an object inside the detectedbounding box.With this definition of the score function, Table 1 sum-marizes the output of commonly used object detectors.Block I comprise detections with poor scores that are typ-ically filtered out. Block IV comprises of detections withhigh scores that are typically retained. Block II and IIIcomprise of risky detections with intermediate scores. Theexisting post-processing algorithms are susceptible to poorperformance for detections in Block II and III that have in-termediate level scores. Many of those detections are (i)mistakenly filtered out (a large number of false negatives)thus achieving a lower than ideal recall or, (ii) poor detec-tions are retained (a large number of false positives) thusachieving a lower than ideal precision. To address thislimitation, we introduce RMOPP: A robust multi-objective,post-processing algorithm that allows simultaneous and op-timized tuning of recall and precision.
Let ( X c ) be the binary random variable associatedwith the existence of some bounding box ( c ) with prob-ability ( P c ) . Let ( X ic ) be the binary random variable indicating that the object in the bounding box ( c ) be-longs to a class ( i ) with probability ( P ic ) . Under theassumptions that X c ∼ Ber ( P c ) and the collection { X c , ..., X ic , ..., X N c } ∼
M ultiN om ( P ic , ..., P N c , , E [ X c ] = P ( X c = 1) = P c (1) E [ X ci ] = P ( X ci = 1) = P ci (2) N (cid:88) i =1 P ci = 1 (3)To simplify the mathematical notations, let ( Z ic ) be theordered statistic of ( X ic ) such that P ( Z ci = 1) x P ( Z cj =1) ∀ j > i . Under this notation, P ( Z ci = 1) is the proba-bility of the most probable class for the detection boundedin the box ( c ) and the existing post-processing technique isexpressed as, s c = P ( Z c = 1) x P ( X c = 1) ≤ γ (4)where the detection in the box ( c ) is filtered out if its score ( s c ) is less than a pre-defined threshold ( γ ) as stated inEquation 4. The threshold ( γ ) is often chosen based onempirical analysis, user experience, or by limited trials. Ex-tensions of the of Equation 4 exist where the score is definedas max i P ( X ic = 1) or as max i P ( X c = 1) . It should benoted that if the object detector does not output the con-fidence P ( X c = 1) , then P ( X c = 1) is set to 1 for alldetections.Note that during training, object detectors aim to mini-mize the population cross-entropy over the training dataset.Therefore, it is natural to design a post-processing al-gorithm that shares similar merit to the training proce-dure. Specifically, we consider thresholding over the log-likelihood ratios of the most probable class to all otherclasses for every bounding box. This brings us to our firstproposition. Proposition I : The log-likelihood ratios between the topclass and the remaining classes of a bounding box ( c ) aresufficient to quantify the classification accuracy of the ob-ject in the bounding box ( c ) .Proposition I is inspired by the likelihood ratio test andthe theory of statistical hypothesis testing, which showsthat every bounding box ( c ) must satisfy (cid:16) P ( Z c =1) P ( Z ck =1) (cid:17) ≥ γ , ∀ k > . Lemma I : Given the definition that ( Z ic ) is the orderedstatistic of ( X ic ) such that P ( Z ci = 1) x P ( Z cj = 1) ∀ j >i . Then, (cid:16) P ( Z c =1) P ( Z c =1) (cid:17) ≥ γ for any bounding box ( c ) guaran-tees (cid:16) P ( Z c =1) P ( Z ck =1) (cid:17) ≥ γ , ∀ k > . From Lemma I and Propo-sition I, we identify our first score metric to be (cid:16) P ( Z c =1) P ( Z c =1) (cid:17) ( γ ) . Therefore, any de-tected box ( c ) must satisfy Equation 5 or it will be removedfor poorly classifying its bounded object. (cid:18) P ( Z c = 1) P ( Z c = 1) (cid:19) ≥ γ , ∀ c (5)While the classification task is critical in object detec-tion, the bounding box detection is also an important task.Proposition I guarantees a good classification but it does notguarantee a good bounding box detection. This brings us toProposition II. Proposition II : An effective object detector must showa relatively similar classification and detection perfor-mance which can be quantified by the log-likelihood ra-tio of detection to classification for any bounding box ( c ) , (cid:16) P ( X c =1) P ( Z c =1) (cid:17) , ∀ k > .Proposition II suggests that a detected bounding boxtruly bounds an object if (cid:16) P ( X c =1) P ( Z c =1) (cid:17) , ∀ k > , is largeenough. It should be noted here that there exists a cross-correlation between Propositions I and II, where Proposi-tion II mainly focuses on the relative detection performancegiven the classification probabilities P { ( X c = 1) | P ( Z c =1) , ..., P ( Z cN = 1) } for any detected bounding box ( c ) , thusensuring that the boxes are well-identified and classified. Lemma II : Given the definition that ( Z ic ) is the orderedstatistic of ( X ic ) such that P ( Z ci = 1) x P ( Z cj = 1) ∀ j >i . Then, (cid:16) P ( X c =1) P ( Z c =1) (cid:17) ≥ γ for any bounding box ( c ) ,guarantees (cid:16) P ( X c =1) P ( Z ck =1) (cid:17) ≥ γ , ∀ k > . From Lemma IIand Proposition II, we identify our second score metric tobe (cid:16) P ( X c =1) P ( Z c =1) (cid:17) and its user-defined threshold to be ( γ ) .Therefore, any detected box ( c ) must satisfy Equation 6 orit will be considered a false detection and removed. (cid:18) P ( X c = 1) P ( Z c = 1) (cid:19) ≥ γ , ∀ c (6)Both Equations 5 and 6 summarize the proposed post-processing algorithm, where a bounding box ( c ) is a gooddetection if and only if, (cid:18) P ( Z c = 1) P ( Z c = 1) (cid:19) ≥ γ ∩ (cid:18) P ( X c = 1) P ( Z c = 1) (cid:19) ≥ γ , ∀ c (7) Equation 5 reduces the chances of having false positivesdue to poor classification. This addresses the problems withBlock III in Table 1 for traditional post-processing algo-rithms because it filters out detections with low (cid:16) P ( X c =1) P ( Z c =1) (cid:17) . (a) Before using RMOPP(b)
After using RMOPP
Figure 3: ( γ ) and ( γ ) enable RMOPP to perform effectivepost-processing while eliminating false positives.Note that a lower value for (cid:16) P ( X c =1) P ( Z c =1) (cid:17) indicates poor clas-sification based on Lemma I and Proposition I. Thus, it ishypothesized that increasing the value ( γ ) will reduce falsepositives due to poor classification. Note that extreme val-ues for ( γ ) may have a detrimental effect on true positives,thus increasing the false negatives and lowering precisionand hence, should be chosen carefully. More details andresults are presented in Section 4.Further, Lemma II helps address problems due to casesfollowing Block II in Table 1. Intuitively, P ( Z c = 1) ,should only be considered when P ( X c = 1) is highenough. At random detections where P ( X c = 1) is lowand P ( Z c = 1) is very high, we can observe that the tradi-tional algorithms fail as they eliminate too many true posi-tives. Thus, it is hypothesized that a lower ( γ ) will help in-creasing true positives by keeping detections that were tra-ditionally eliminated due to poor identification. The cross-correlation between ( γ ) and ( γ ) will be particularly help-ful here in keeping the false positives at bay. Figure 3 showshow results improve when RMOPP is used for filtering thedetections compared to the traditional algorithms. ( γ ) and ( γ ) Choosing ( γ ) and ( γ ) is a non-trivial task that dependson the object detector’s raw output. This is further elab-orated in the case-study. Generally, decreasing both ( γ ) and ( γ ) will result in filtering out fewer detections. In ex-5 lgorithm 1 Pseudocode for RMOPP
Require: (cid:26) argmax i X ic , Z c , Z c , X c , b c (cid:27) Mc =1 M predictions for a given image with bbox coord b c ; δ , δ are the increments for γ , γ ; for ∆ L ≤ γ ≤ ∆ U dofor ∆ L ≤ γ ≤ ∆ U do I = 0 M ; an indicator if the detections satisfy Eq. 7 for c = 1 , ..., M doif (cid:16) P ( Z c =1) P ( Z c =1) (cid:17) ≥ γ and (cid:16) P ( X c =1) P ( Z c =1) (cid:17) ≥ γ then I [ c ] = 1 end ifend forif (cid:80) ( I ) (cid:54) = 0 thenapply NMS with IoU = η compute P recision, Recall and F end if γ = γ + δ end for γ = γ + δ end forreturn Precision, Recall and F scores ∀ γ and γ treme cases, this may result in poor precision due to ex-treme under-filtering. Similarly, increasing both ( γ ) and ( γ ) will result in filtering out increasing detections. In ex-treme cases, this may result in poor recall due to extremeover-filtering. We recommend that the user first decide onthe evaluation metric of interest. The common three metricswe discuss in this paper are Precision, Recall, and F score.Once decided, we apply Algorithm 1 with varying valuesof ( γ ) and ( γ ) to fully understand the complex correla-tions between the evaluation metric and ( γ ) and ( γ ) . ThePareto frontier is then identified from these values whichhelps optimize for recall and precision simultaneously. Itis worth noting that although there may exist multiple com-binations of ( γ ) and ( γ ) that result in the same precision(or recall) but different recalls (or precisions) only the dom-inant combinations will be identified by the Pareto frontier.More details are provided in Section 4 (Figure 4 and Figure5). Unlike existing post-processing algorithms, RMOPP ismulti-objective and hence provides the utmost flexibility insimultaneously optimizing for precision and recall.In summary, the object detection pipeline with RMOPPperforms the following steps: (1.) Take an input image andpass it through the pre-trained detector that outputs somedetections, (2.) Filter these detections using Equation 7,given some values for ( γ ) and ( γ ) (3.) Perform NMSon detections from step 2 to further refine the results usingNMS threshold ( η ) , (4.) Calculate precision, recall, and F scores, (5.) Modify values of ( γ ) and ( γ ) and repeat steps (2-5), (6.) Choose the best values of ( γ ) and ( γ ) based onpreference for precision, recall, and/or F scores. It shouldalso be noted that since there is no formal specification ofthe NMS threshold ( η ) , we default it to 0.5 for all experi-ments unless specified otherwise.
4. Case Study: YOLOv2 using Darknet-19
While RMOPP is a suitable add-on for most of the avail-able object detectors, we consider applying it to improveYOLOv2 in this case-study. YOLOv2 is one of the fastestexisting object detectors; however, it detects many erro-neous boxes and that makes it an excellent case-study forevaluating the efficacy of post-processing algorithms.YOLOv2’s backbone is called as ‘Darknet-19’ that con-sists of 19 hidden convolutional layers, and it was first in-troduced in 2017 [31]. It is one of the fastest architecturearound and performs about 5.58 billion operations per im-age [31]. The YOLOv2 loss function is a multi-part lossfunction which is a weighted sum of classification, local-ization, and confidence losses [30]. The ‘classification loss’represents the loss based on the squared error of conditionalclass probabilities of the detected object given by the soft-max layer of the detector. The ‘localization loss’ representsthe loss based on the difference between coordinates of thebounding boxes for detected objects and the coordinatesof bounding boxes in annotations. Note that a weightingparameter ( λ coord ) is used to increase the contribution oflocalization loss in the overall loss function. ‘Confidenceloss’ represents the loss concerning the objectness scorefor each bounding box. Since we expect more boxes forbackground class, to eliminate the class imbalance, anotherweighting parameter ( λ noobj ) is used to down weight thecontribution of confidence loss in the overall loss functionwhen a bounding box is identified as a background object. ( λ coord ) and ( λ noobj ) are set to 5 and 0.5 by the Redmon etal . in [31].The MS-COCO dataset used for training YOLOv2 con-sists of around 118,000 images with an average of 7 boxesper image and a total of 80 unique class labels [23]. Forperformance evaluation, we use the COCO minival datasetwhich has 5000 images that were not used in training theYOLOv2, and set the input image resolution to 544x544 forall our experiments. F score In this case study, we use the definitions of precision,recall, and F score as given in Equations 8,9 and 10. A de-tection is considered as a true positive if the ( IoU ) is greaterthan 0.5 with at least one ground truth annotation of thesame class label. Note that if multiple objects have ( IoU ) more than 0.5, then only the one with the highest ( IoU ) isselected as a true positive while others are considered to befalse positives. All unmatched objects in the list of ground6ruths are considered to be false negatives. For this case-study ( γ ) is varied between 1 and 10 with increments of0.5 and ( γ ) is varied between 0.1 and 1 with incrementsof 0.05. Beyond those limits, YOLOv2 showed extremelypoor precisions or recalls.Precision = True PositivesTrue Positives + False Positives (8)Recall = True PositivesTrue Positives + False Negatives (9)F score = 2 x Precision x RecallPrecision + Recall (10)Changes in ( γ ) and ( γ ) have a significant effect on pre-cision, recall, and F scores. Essentially, ( γ ) and ( γ ) canbe perceived as tuning knobs that can be used to tune theobject detector for desired precision, recall, or F score. InFigure 4a it is seen that as ( γ ) is increased, a smooth de-crease in recall is observed with recall almost dropping to0 for the highest values of ( γ ) . Similarly, in Figure 4b itis observed that as ( γ ) is increased, precision increases al-most linearly to 0.9 or more when ( γ ) is augmented from0.1 to 1 for all values of ( γ ) above 5.5. For values of ( γ ) lower than 5.5 this increase is not so significant due to thecorrelation between ( γ ) and ( γ ) .When ( γ ) and ( γ ) are set to highest values, total detec-tions drop, and a maximum precision condition (Recall ≈ ( γ ) and ( γ ) are set to lowestvalues, total detections rise and a maximum recall condi-tion (Precision ≈
0) is realized. This helps build a sense ofconfidence which suggests that by employing finer adjust-ments in ( γ ) and ( γ ) , we can explore the entire possiblespace of precision-recall offline. In this case study, mul-tiple post-processed detectors with different ( γ ) and ( γ ) resulted in the same precision but different recall (and viceversa). The region highlighted in Figure 4a shows possiblecombinations of ( γ ) and ( γ ) where recall is fixed to 0.5.The regions highlighted in Figure 4b show possible combi-nations of ( γ ) and ( γ ) where precision is fixed to 0.5.Designing a bi-objective function to simultaneously op-timize precision and recall is challenging, thus, researchershave developed metrics like F score to address this chal-lenge. In Figure 4c we plot a heat map of F scores inFigure 4c by varying ( γ ) and ( γ ) . Although Figure 4chelps us understand the complex correlations between ( γ ) , ( γ ) , precision and recall, it offers limited help in selectingvalues of ( γ ) and ( γ ) that maximize precision at givenrecall and vice-versa. Thus, we plot a Pareto frontier us-ing results obtained by varying ( γ ) and ( γ ) . The Paretofrontier is plotted as a function of precision and recall inFigure 5. Here, we use the concepts of ‘Pareto Optimality’[19] and extend them to the field of object detection for the (a) The labels inside the figure denote recall while thecolor represents precision. The highlighted region en-closes combinations of ( γ ) and ( γ ) where recall=0.5.(b) The labels inside the figure denote precision whilethe color represents recall. The highlighted regionencloses combinations of ( γ ) and ( γ ) where preci-sion=0.5.(c) The highlighted region encloses combinations of ( γ ) and ( γ ) where F score ≥ Figure 4: Effects of ( γ ) and ( γ ) on Precision, Recall andF score.7igure 5: Pareto Frontier based on precision and recall.Here, each combination of ( γ ) and ( γ ) is a uniquely post-processed object detector and is represented by a ‘black’dot.first time, to the best of our knowledge. In object detection,‘Pareto Optimality’ is defined as a condition where no im-provements can be made to either precision or recall withoutsome sacrifice in another. Such combinations of ( γ ) and ( γ ) which achieve an optimal trade-off between precisionand recall are called ‘Pareto Optimal’ points. A set of suchPareto optimal points is called as ‘Pareto Frontier’. Paretooptimal points are said to dominate non-optimal points be-cause there exists no other combination of ( γ ) and ( γ ) which achieves a better precision (or recall) at a higher re-call (or precision). Figure 5 shows a plot of each uniquecombination of ( γ ) and ( γ ) and their associated preci-sion and recall as a solid ‘black’ dot. A Pareto frontier ishighlighted in ‘red’ in this figure which helps choose thebest possible combinations of ( γ ) and ( γ ) that maximizeprecision at given recall or vice-versa. Figure 5, thus, canbe used to thoroughly explore and select a robust choice ofhyper-parameters ( γ ) and ( γ ) .Although Pareto Frontier helps in choosing the best com-binations of ( γ ) and ( γ ) , setting values for ( γ ) and ( γ ) that maximize precision (or recall) without ensuring someminimum recall (or precision) is not recommended as it willresult in too few or too many detections, which is not practi-cal. Thus, to avoid this, we set a minimum threshold on theF score to be 0.5 in this case study. The optimal hyper-parameters achieving maximum precision, recall, and F score under this condition are thus pinpointed using domi-nance principles in the Pareto frontier using Figure 5. Table2 summarizes these values. MaximizingObjective F − score Recall Precision γ γ Precision 0.50 0.35 0.88 10 0.55Recall 0.50 0.60 0.43 3.5 0.15F score 0.57 0.49 0.68 4.5 0.35Table 2: Pareto Optimal ( γ ) and ( γ ) under different max-imization objectives. For benchmark comparisons, we utilize the COCO APmetric that has been extensively used to evaluate variousstate-of-art detectors on the COCO challenge [23]. Thismetric is based on the PASCAL VOC 2012 metric whichused a similar strategy to evaluate object detectors [27].Both VOC 2012 and COCO metrics calculate the per classAverage Precision or (AP) by calculating the area under thePrecision-Recall curve. This procedure is repeated for dif-ferent ( IoU ) between 0.5 and 0.95. The individual AP forevery ( IoU ) is reported as AP [IoU] and the mean AP isreported as AP [0.5:0.95].Table 3 summarizes the comparisons between the re-sults of RMOPP and other benchmarked post-processingmethods. Using Table 3, it can be observed that RMOPPunder different optimal settings significantly improves theCOCO AP metrics for YOLOv2 without any loss in infer-ence times (FPS). Specifically, for AP [0.5:0.95], RMOPPshows a increase compared to the original benchmarkof YOLOv2 [31]. Henceforth, we mainly focus on AP[0.5] because for practical applications the detected boxesmust have at least a intersection-over-union with theground truth or it will be considered a bad detection. Us-ing Table 3, it can be observed that, the proposed methodeasily surpasses the performance by other benchmark post-processing methods such as Soft-NMS [1] and RelationNets[17] while being significantly faster on an NVIDIA Titan XGPU. Here it should be noted that although Rank-NMS [39]achieves better performance, it is significantly slower thanthe proposed method. Figures 6 and 7 compares perfor-mance of YOLOv2 with traditional filtering and RMOPP.The compelling evidence in these figures helps prove theeffectiveness of RMOPP over traditional algorithms. Table 4 aims to compare the performance of theYOLOv2 with RMOPP to other state-of-art object detec-tors [14, 22, 38]. The AP [0.5] results in Table 4 showthat RMOPP improves the YOLOv2 performance to be-come comparable to that of its slower but more com-plex successor- YOLOv3. It also shows that better post-processing can significantly improve the performance ofsingle-stage detectors like YOLOv2 to match that of a two-8ethod Backbone Post-ProcessingMethod FPS γ γ AP[ . . AP[ . AP[ . YOLOv2 [31] Darknet-19 NMS(Baseline) - - 0.216 0.440 0.192MetanAnchorGS [42] Darknet-19 NMS - - - 0.212 0.395 -Faster R-CNN [17] ResNet-50 Soft-NMS 9 - - 0.300 0.523 0.305Faster R-CNN [17] ResNet-50 RelationNets 17 - - 0.303 0.519 0.315Faster R-CNN [39] ResNet-50-FPN Rank-NMS 8 - - YOLOv2 Darknet-19 RMOPP-Best AP
10 0.55 0.216 0.366 0.229YOLOv2 Darknet-19 RMOPP-Best Recall score γ γ AP[0.5]
Two-Stage Detectors
Faster R-CNN+++ [14] - - 0.557Faster R-CNN w FPN [21] - - 0.591Faster R-CNN w D-RMI [18] - - 0.555Faster R-CNN w TDM [37] - - 0.577
Single-Stage Detectors
YOLOv3 608x608 [10] - - 0.579SSD513 [11] - - 0.504DSSD513 [11] - - 0.533RetinaNet+ResNet-101-FPN [22] - - 0.591RetinaNet+ResNeXt-101-FPN [22] - -
CenterNet+ResNet-18 [38] - - 0.466OneNet+ResNet-18 [38] - - 0.457YOLOv2+RMOPP-Best AP 1 0.1
YOLOv2+RMOPP-Best Recall 3.5 0.15
YOLOv2+RMOPP-Best F score 4.5 0.35 YOLOv2+RMOPP-Best Precision 10 0.55
Table 4: Comparison of RMOPP with state-of-art.stage detector like Faster R-CNN.
5. Conclusion and Futurework
This paper introduces RMOPP: A robust multi-objectivepost-processing algorithm that improves the performanceof pre-trained object detectors with a negligible impacton speed. Unlike existing uni-objective post-processingmethods, the proposed algorithm allows for simultaneousoptimization of precision and recall in object detectionpipelines. When applied to YOLOv2, RMOPP showed a improvement in average precision to match similarperformances of slower but more complex object detectorslike YOLOv3 and Faster-RCNN. This work presents thefirst known usage of Pareto frontiers for post-processinghyper-parameter optimization in the field of object detec- (a) YOLOv2 + Baseline(b) YOLOv2 + RMOPP
Figure 6: YOLOv2+RMOPP shows significant improve-ment in results.tion, to the best of our knowledge. A Bayesian post-processing technique may be feasible in future work toupdate the post-processing hyper-parameters for every in-putted image and reduce training bias.9 a) YOLOv2 + Baseline(b) YOLOv2 + RMOPP
Figure 7: YOLOv2+RMOPP shows significant improve-ment in results.
References [1] Navaneeth Bodla, Bharat Singh, Rama Chellappa, andLarry S Davis. Soft-NMS–improving object detection withone line of code. In
Proceedings of the IEEE internationalconference on computer vision , pages 5561–5569, 2017.[2] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delv-ing into High Quality Object Detection. In
Proceedings ofthe IEEE Computer Society Conference on Computer Visionand Pattern Recognition , pages 6154–6162, 2018.[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, NicolasUsunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In
European Confer-ence on Computer Vision , pages 213–229. Springer, 2020.[4] Abdallah Chehade, Scott Bonk, and Kaibo Liu. Sensory-based failure threshold estimation for remaining useful lifeprediction.
IEEE Transactions on Reliability , 66(3):939–949, 2017.[5] Abdallah Chehade and Zunya Shi. Sensor Fusion via Statisti-cal Hypothesis Testing for Prognosis and Degradation Anal-ysis.
IEEE Transactions on Automation Science and Engi-neering , 16(4):1774–1787, 2019.[6] Abdallah A Chehade and Ala A Hussein. Latent FunctionDecomposition for Forecasting Li-ion Battery Cells Capac- ity: A Multi-Output Convolved Gaussian Process Approach,2019.[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. ImageNet: A large-scale hierarchical im-age database. , pages 248–255, 2010.[8] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing-ming Huang, and Qi Tian. Centernet: Keypoint triplets forobject detection. In
Proceedings of the IEEE/CVF Interna-tional Conference on Computer Vision , pages 6569–6578,2019.[9] Kaiwen Duan, Lingxi Xie, Honggang Qi, Song Bai, Qing-ming Huang, and Qi Tian. Corner proposal network foranchor-free, two-stage object detection. arXiv preprintarXiv:2007.13816 , 2020.[10] Ali Farhadi and Joseph Redmon. Yolov3: An incrementalimprovement.
Computer Vision and Pattern Recognition,cite as , 2018.[11] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi,and Alexander C Berg. Dssd: Deconvolutional single shotdetector. arXiv preprint arXiv:1701.06659 , 2017.[12] Ross Girshick. Fast R-CNN. In
Proceedings of the IEEEInternational Conference on Computer Vision , volume 2015Inter, pages 1440–1448, 2015.[13] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik,U C Berkeley, and Jitendra Malik. Rich feature hierarchiesfor accurate object detection and semantic segmentation. In
Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition , volume 1, page5000, 2014.[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[15] Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides,and Xiangyu Zhang. Bounding box regression with un-certainty for accurate object detection. In
Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 2888–2897, 2019.[16] Jan Hosang, Rodrigo Benenson, and Bernt Schiele. Learningnon-maximum suppression. In
Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages4507–4515, 2017.[17] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and YichenWei. Relation networks for object detection. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 3588–3597, 2018.[18] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu,Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wo-jna, Yang Song, Sergio Guadarrama, and Kevin Murphy.Speed/accuracy trade-offs for modern convolutional objectdetectors. In
Proceedings - 30th IEEE Conference on Com-puter Vision and Pattern Recognition, CVPR 2017 , volume2017-Janua, pages 3296–3305, 2017.[19] Gandhinathan Karuppusami and R Gandhinathan. Paretoanalysis of critical success factors of total quality manage-ment.
The TQM magazine , 2006.
20] Hei Law and Jia Deng. Cornernet: Detecting objects aspaired keypoints. In
Proceedings of the European confer-ence on computer vision (ECCV) , pages 734–750, 2018.[21] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature pyra-mid networks for object detection. In
Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 2117–2125, 2017.[22] Tsung Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, andPiotr Dollar. Focal Loss for Dense Object Detection.
IEEETransactions on Pattern Analysis and Machine Intelligence ,42(2):318–327, 2020.[23] Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. LawrenceZitnick. Microsoft COCO: Common objects in context. In
Lecture Notes in Computer Science (including subseries Lec-ture Notes in Artificial Intelligence and Lecture Notes inBioinformatics) , volume 8693 LNCS, pages 740–755, 2014.[24] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, JieChen, Xinwang Liu, and Matti Pietik¨ainen. Deep Learningfor Generic Object Detection: A Survey.
International Jour-nal of Computer Vision , 2019.[25] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia.Path Aggregation Network for Instance Segmentation.
Pro-ceedings of the IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition , pages 8759–8768,2018.[26] Wei Liu, Dragomir Anguelov, Dumitru Erhan, ChristianSzegedy, Scott Reed, Cheng Yang Fu, and Alexander C.Berg. SSD: Single shot multibox detector.
Lecture Notesin Computer Science (including subseries Lecture Notes inArtificial Intelligence and Lecture Notes in Bioinformatics) ,9905 LNCS:21–37, 2016.[27] Everingham M., Van-Gool L., Williams C K I., Winn J., andZisserman A. The Pascal Visual Object Classes (VOC) Chal-lenge.
International Journal of Computer Vision , 88(2):303–338, 2010.[28] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng,Wanli Ouyang, and Dahua Lin. Libra r-cnn: Towardsbalanced learning for object detection. In
Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 821–830, 2019.[29] Steven Puttemans, Timothy Callemein, and Toon Goedeme.Building robust industrial applicable object detection modelsusing transfer learning and single pass deep learning archi-tectures. In
VISIGRAPP 2018 - Proceedings of the 13th In-ternational Joint Conference on Computer Vision, Imagingand Computer Graphics Theory and Applications , volume 5,pages 209–217, 2018.[30] Joseph Redmon, Santosh Divvala, Ross Girshick, and AliFarhadi. You only look once: Unified, real-time object detec-tion. In
Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition , volume2016-Decem, pages 779–788, 2016.[31] Joseph Redmon and Ali Farhadi. YOLO9000: Better, faster,stronger. In
Proceedings - 30th IEEE Conference on Com-puter Vision and Pattern Recognition, CVPR 2017 , volume2017-Janua, pages 6517–6525, 2017. [32] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster R-CNN: towards real-time object detection with re-gion proposal networks.
IEEE transactions on pattern anal-ysis and machine intelligence , 39(6):1137–1149, 2016.[33] Mayuresh Savargaonkar and Abdallah Chehade. An Adap-tive Deep Neural Network with Transfer Learning for State-of-Charge Estimations of Battery Cells. In , pages598–602. IEEE, 2020.[34] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Math-ieu, Rob Fergus, and Yann LeCun. Overfeat: Integratedrecognition, localization and detection using convolutionalnetworks. In ,2014.[35] Zunya Shi and Abdallah Chehade. A dual-LSTM frame-work combining change point detection and remaining use-ful life prediction.
Reliability Engineering & System Safety ,205:107257, 2021.[36] Zunya Shi, Mayuresh Savargaonkar, Abdallah A Chehade,and Ala A Hussein. A Long Short-Term Memory Net-work for Online State-of-Charge Estimation of Li-ion Bat-tery Cells. In , pages 594–597. IEEE, 2020.[37] Abhinav Shrivastava, Rahul Sukthankar, Jitendra Malik, andAbhinav Gupta. Beyond skip connections: Top-down modu-lation for object detection. arXiv preprint arXiv:1612.06851 ,2016.[38] Peize Sun, Yi Jiang, Enze Xie, Zehuan Yuan, ChanghuWang, and Ping Luo. OneNet: Towards End-to-End One-Stage Object Detection. arXiv preprint arXiv:2012.05780 ,2020.[39] Zhiyu Tan, Xuecheng Nie, Qi Qian, Nan Li, and Hao Li.Learning to rank proposals for object detection. In
Proceed-ings of the IEEE/CVF International Conference on Com-puter Vision , pages 8273–8281, 2019.[40] Yunong Tian, Guodong Yang, Zhe Wang, Hao Wang,En Li, and Zize Liang. Apple detection during differ-ent growth stages in orchards using the improved YOLO-V3 model.
Computers and Electronics in Agriculture ,157(January):417–426, 2019.[41] Lachlan Tychsen-Smith and Lars Petersson. Improving ob-ject localization with fitness nms and bounded iou loss. In
Proceedings of the IEEE conference on computer vision andpattern recognition , pages 6877–6885, 2018.[42] Tong Yang, Xiangyu Zhang, Zeming Li, Wenqiang Zhang,and Jian Sun. Metaanchor: Learning to detect objects withcustomized anchors.
Advances in Neural Information Pro-cessing Systems , 2018-Decem:320–330, 2018.[43] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, andStan Z Li. Bridging the gap between anchor-based andanchor-free detection via adaptive training sample selection.In
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , pages 9759–9768, 2020., pages 9759–9768, 2020.