[PDF] One-Click Annotation with Guided Hierarchical Object Detection

Abstract

The increase in data collection has made data annotation an interesting and valuable task in the contemporary world. This paper presents a new methodology for quickly annotating data using click-supervision and hierarchical object detection. The proposed work is semi-automatic in nature where the task of annotations is split between the human and a neural network. We show that our improved method of annotation reduces the time, cost and mental stress on a human annotator. The research also highlights how our method performs better than the current approach in different circumstances such as variation in number of objects, object size and different datasets. Our approach also proposes a new method of using object detectors making it suitable for data annotation task. The experiment conducted on PASCAL VOC dataset revealed that annotation created from our approach achieves a mAP of 0.995 and a recall of 0.903. The Our Approach has shown an overall improvement by 8.5%, 18.6% in mean average precision and recall score for KITTI and 69.6%, 36% for CITYSCAPES dataset. The proposed framework is 3-4 times faster as compared to the standard annotation method.

Full PDF

OOne - Click Annotation with Guided Hierarchical Object Detection

Adithya Subramanian, Anbumani SubramanianIntelBangalore, India [email protected]@intel.com

Abstract

The increase in data collection has made data annotation an interesting and valuable task in the contemporary world.This paper presents a new methodology for quickly annotating data using click-supervision and hierarchical object detection.The proposed work is semi-automatic in nature where the task of annotations is split between the human and a neuralnetwork. We show that our improved method of annotation reduces the time, cost and mental stress on a human annotator.The research also highlights how our method performs better than the current approach in different circumstances suchas variation in number of objects, object size and different datasets. Our approach also proposes a new method of usingobject detectors making it suitable for data annotation task. The experiment conducted on PASCAL VOC dataset revealedthat annotation created from our approach achieves a mAP of 0.995 and a recall of 0.903. The Our Approach has shown anoverall improvement by 8.5%, 18.6% in mean average precision and recall score for KITTI and 69.6%, 36% for CITYSCAPESdataset. The proposed framework is 3-4 times faster as compared to the standard annotation method.

1. Introduction

Annotated data is an extremely valuable asset in both academia and industry. The availability of data has provided dataannotation task a greater importance in the society. A lot of deep learning research has been focused on improving thegeneralizability of the deep neural networks with only a small set of training samples which is known as few shot learning[23, 8, 10, 24]. In contrary to this type of research a little attention has been shown in improving the annotation process tomake more data available for allowing models to generalize well.The current annotation strategy for object detection involves clicking the object on the left-top and right bottom of the imagebut this task puts the user into heavy mental stress and also a huge amount of time is consumed in the process of ﬁnding anextremely tight bounding box. The same has been proved in multiple research papers [12, 17, 16]. To avoid the mental stressas well as the time consumption, we propose a semi - automatic approach which combines best of the both worlds i.e. theaccuracy of human eye and the speed of neural networks.The current object detectors cannot be used for the purpose of annotating data as they have a low recall score and meanaverage precision score. This leads to unreliable results where the object detector might have miss-classiﬁed an object or itmight have made a completely wrong prediction both in terms of classiﬁcation and localization. The low recall score alsosuffers from the same issue but now the problem becomes severe as the detector classiﬁes an object as a background. When anew network is trained with these annotation then it won’t able to generalize well for these incorrectly classiﬁed, incorrectlylocalized and missed out object category in the annotations.Our approach is rather robust to these issues. The proposed framework being semi-automatic it only acquires partial an-notation from the annotator by making them click on the object centers. The framework at the same time simultaneouslypredicts the intermediate detections from the object detector. These detections are further reﬁned from these human annotatedobject centers removing the incorrect classiﬁcations as well as the localizations. These object centers are further used by the1 a r X i v : . [ c s . L G ] O c t etector to create object proposals when it misses to predict objects at this object center. The created object proposals arefurther used by the detector as an input to the network for detecting the objects. This process is iterated hierarchically untilthe object is detected.The further sections are ordered such that Section 2 highlights the related work, Section 3 describes the proposed work.Experimental results are discussed in Section 4 and conclusion for the work is derived at Section 5. The references for thiswork is listed after the conclusion.

2. Related work

The research in the ﬁeld of deep learning has focused on using unlabelled or partially labelled data by developing modelin semi-supervised, unsupervised or weakly-supervised learning paradigm to reduce the dependence of the model over an-notated data. In the semi - supervised learning paradigm Tan et al. [25] has proposed a novel algorithm for large scalesemi-supervised object detection by imbibing knowledge from visual and semantic cues. Rhee et al [20] has developed anobject detector in semi - supervised learning paradigm which is initially trained on a set of perfectly labelled examples andthen it uses active learning to batch imperfect and unlabelled samples. The weakly supervised learning has also been in spot-light lately. Li et al. [13] has worked on weakly supervised learning by making use of progressive domain adaptation to solvethe problem of model initialization and local minima convergence which is a common issue in weakly - supervised learningparadigm. Zhang et al. [27] has proposed a new state-of-the-art weakly supervised model which combines saliency detectionand weakly supervised object detection based on self-paced curriculum learning. There has also been research work wheremodels have been created which can work in both weakly - supervised and semi-supervised paradigm such as [26].The fatal problem with all these previous approaches is again the requirement of data to generalize well. The semi-supervisedas well as the weakly-supervised approaches barely the match performance of a fully supervised object detector such as Faster- RCNN [19], Single Shot Multi-Box Detector [15], YOLO9000 [18] and RetinaNet [14]. The availability of data thus provesto be the easiest method to attain state-of-the-art performance in deep learning based object detection task.The researchers realized this fact and started to develop tools as well as automated algorithms to annotate data efﬁcientlyto reduce the human effort but the progress in this direction is scant. Bianco et al [1] has developed a tool which uses algo-rithms like linear interpolation, template matching and as well as supervised object detector depending on mode of operationwhich can be manual, semi - automatic or fully automatic aiding the annotator to speed-up the annotation, allowing thedeep networks to learn from a considerably large annotated data. Bouquet et al. [7] has attempted to annotate video’s bypropagating the annotation throughout the frames using an ofﬂine tracker followed by dynamic programming and distancetransformation to penalize to the displacement between frames. Konyushkova et al [12] has shown a different perspectiveof human - computer interaction for data-annotation by choosing the best sequence of actions to annotate images in leastamount of time. This is learnt based on the previous experience which is achieved by using Q- learning to learn an approxi-mate optimal policy. Similar interactive annotation methods has also been explored in semantic segmentation annotation taskin the following works [2, 21, 3, 5, 11, 22].On other hand recent research by Papadopoulos et al. [17] explores using object centers as a supervision to Multiple In-stance Learning frameworks for visual detection task which can make use of the data available in the Internet. Papadopouloset al. [16] has also worked on creating bounding boxes using four clicks on the extreme left, top, right and bottom to moreintuitively annotate the data which results in a 7 times faster annotation method.

3. Proposed Work

The proposed work can be divided into 2 sub-sections where the ﬁrst section discuss the method followed to attain theobject centers which occupies minimal amount of user interaction but captures maximum information. This section alsoprovides details on the complete work-ﬂow of each and every step taken by the annotator to annotate the data. The secondsection explains the methodology followed to achieve state-of-the-art performance in case of click guided object detectiontask. 2 .1. Mechanism to capture annotations

This section explains the steps followed to acquire one-click annotation i.e the object centers of the objects in the image.The ﬁrst step is to display the input image to annotator where annotator selects the class to be annotated as shown in Fig 1.Once the class is selected, user can then click on the center point of the objects belonging to the selected class as shown inFig 2. This process stores the object centers along with the associated class information and an instant feedback is providedby a red dot making user aware of the clicks made.The user then changes the class which is to be annotated. These steps arerepeated continuously until all the object centers are captured. Once process is ﬁnished the annotation information is passedonto the network and the annotation results are provided which can be used by annotator to improve the clicking accuracy asshown in Fig 3.

Figure 1. The class selection pageFigure 2. Instant click feedback

Hierarchical object detection consists of a base detector which is trained on a dataset having the same labels as that of thedata which is to be annotated and the base detector used in the experiment is YOLO9000 which was trained to a fair amountof loss of 1.3332. The working pipeline of the detector can be broken down into a sequence of steps as seen in Fig 4,5 andthe pseudo code for the same can be seen in Algorithm 1.The framework uses the click data collected from the annotator consisting of object centers and the class it belongs. Theseobject centers and it’s associated class information are used to validate the results obtained from the standard object detectorand depending on the resulting accuracy of the standard object detector hierarchical object detection is performed. If the3 igure 3. The bounding box results with probability of the objectFigure 4. Improving results from standard detector using one-click annotationFigure 5. Improving results from standard detector using guided hierarchical object detection standard object detector is successful in detecting all the objects in the image then the predicted object centers are replacedwith that of the annotated object centers and if the object is classiﬁed wrongly then its class information is replaced with theacquired class information. Similarly if any background information is classiﬁed as an object it is also ﬁltered out with thehelp of these human annotation. If the object detector fails to detect objects in the image then hierarchical object detectioncomes into play, this way the time consumption of the annotation process is reduced. Hierarchical object detection plays animportant role in improving the recall score of the model as the human annotation alone can only improve the mean averageprecision but not the recall score.The framework ﬁrst detects the location where the standard object detector has failed to detect an object by comparingthe annotation from the human and the model. Object proposals are created at these locations using the width,height infor-mation associated with the anchors boxes located in the same grid. These object proposals are further fed into the objectdetector for object detection. The process of generating object proposals is applied continuously until all the missed outobjects are detected. This makes the task of object detection hierarchical in nature. The detection results among all theobject proposals at any level is chosen based on the probability of these detection and these results are propagated back tothe higher level of the hierarchy once the best among them is chosen. The probability score of the detection transitioning4 esult:

Guided Hierarchical object detection1. initialize X to an empty list2. initialize Y to an empty list3. initialize class to an empty list while all objects are not annotated do

1. Click on the object o i to be annotated in palette window.2. class ← o i

3. Click on the selected object’s center x c , y c in the image window.4. X ← x c Y ← y c end

4. N ← number of network predictions5. initialize W to an empty list6. initialize H to an empty list7. K ← number of anchor boxes8. T ← length of the object proposal tree9. h ← i ← while i < number of clicks do Find the K nearest neighbors i.e. the network detected object centers of the i th object center. if if the detected neighbors are belonging to same class and closer than the threshold distance then

1. choose the closest point.2. among them choose the one with highest probability.3. W ← width of the closest neighbor4. H ← height of the closest neighbor5. return the centers, width, height and probability of the bounding box else S1 : if h < T then h ← h + 1 for for all the missed out objects centers do

1. Extract the width and height of the anchors located at these object centers.2. The width, height along with the object centers are used to create object proposals.3. apply hierarchical object detection to object proposals. if if there are no detection then goto : S1 end

4. Choose the bounding box which has the highest probability among all the object proposals andreturn the object centers,width, height and the probability of the object in it to the higher level.; else

1. return empty box endendend

Algorithm 1:

Guided hierarchical object detection5 igure 6. Object proposal tree pruning from a lower to higher level is multiplied by the conﬁdence value at higher level. The intuition behind this step is make surethat the annotator knows the difﬁculty faced by the detector to annotate the object which can used for post-processing of thecoarse annotations. Any particular branch in the object proposal tree is expanded only when the resulting probability of thedetection from the particular branch will be higher than the neighbouring branches which can be seen in Fig 6, this helps inreducing the computation time and memory consumption removing the dependency of the framework over the high computepower devices.

4. Experimental Results

This section analyzes multiple aspects of both the segments discussed in the proposed work section i.e. the one clickmethod and the hierarchical object detection.

This section brieﬂy discusses how our annotation approach differs from the standard annotation approach in the aspectsof computational power, object scale, number of objects and different datasets. The feedback from users with little domainknowledge on deep learning and data annotation claimed the following for our approach of annotation: • Our approach of annotation saves incredible amount of time. • Our approach is easier to use with when there are multiple objects to be annotated in the images. • Our approach offers much less mental stress when objects are placed at a far depth from the point of capture as in theobjects which are extremely small.

The table 1 shows the average time taken to annotate an image in GPU, CPU using our approach and the time taken toannotate a data using standard approach. The results clearly shows that our approach is advantageous as compared to thestandard in case of both CPU and GPU as it 3 - 4 times faster as compared to the standard annotation process. Table 2 showsthe time consumed by our approach with change in the type of GPU used. The results show that a decent GPU is enough toannotate the data with only a minute difference in the annotation time.

The table 3 shows the comparison between our approach and standard approach when the object size varies. The results intable 3 clearly indicate that our approach is better option when it comes to annotating object at smaller scale as compared to6 able 1. Time comparison table (in seconds)

Method GPU CPU

Our Approach . . Standard Approach 65.5 65.5

Table 2. Time comparison across multiple GPUs (in seconds)

Method

NVIDIA TITAN X GTX 1080 Ti GTX 1050 Ti

Our Approach . . . Standard Approach

Table 3. Time comparison: size of the object (in seconds)

Method 300+ <

200 200 <

100 100 <

50 50 <

30 30 < Our Approach .

49 9 .

39 7 . .

48 7 .

51 6 . Standard Approach 14.1 12.2 10.03 9.07 9.06 6.55

This section analyses the effectiveness of our approach when the number of objects in the images increase and the results forthe same can be viewed in Table 4.The table 4 shows that the time difference increases radically when the number of objectsin the image starts to increase.

Table 4. Time comparison: number of objects (in seconds)

Method 1 2 4 7 12+

Our approach .

15 9 .

66 21 . .

01 45

Traditional Approach 7.87 15.66 28.9 40.7 60.0

The our approach is applicable to all datasets such as PASCAL VOC [6], KITTI [9] and CITYSCAPES [4]. The table 5shows the consistency of framework over multiple datasets proving that our approach is robust to changes in the distributionof the data.

Table 5. Time comparison: different data sets (in seconds)

Method PASCOL VOC CITYSCAPES KITTI

Our Approach . . . Standard Approach 34.5 52.8 66.6

The hierarchical object detection also depends on multiple parameters inﬂuencing its accuracy and computational time,this section gives a brief overview about such parameters. 7 .2.1 Anchor-boxes

The anchor boxes play a vital role in detecting the objects which were left out at the ﬁrst iteration of detection but at the sametime it takes a heavy toll on the computational time. The table 6 shows the trade off between accuracy and time based on thenumber of anchor boxes. The results show that with increase in number of anchor boxes the mean average precision scoredoes increase but at the same time the computational time of annotation increases.

Table 6. Time vs accuracy comparison on varying number of anchors

Number of Anchors Accuracy Time taken (seconds)Mean average precision Recall

The hierarchy count is the number of iterations the model runs over the anchor box based object proposal, the table 7 discussabout the accuracy vs computational time trade-off. The results show with the increase size of hierarchy the results aren’tmuch improving but only the time consumption increases.

Table 7. Time vs accuracy comparison on hierarchy count

Hierarchy count Accuracy Time taken (seconds)Mean average precision Recall .

997 0 .

805 19 .

53 0 .

997 0 .

799 20 .

84 0 .

997 0 .

797 22 .

25 0 .

997 0 .

798 24 . Table 8. Comparison between results from standard object detector and hierarchical object detector

Data set Hierarchical object detector Standard object detectorMean average precision Recall Mean average precision RecallPASCAL VOC .

999 0 . .

85 0 . KITTI .

997 0 . .

912 0 . CITYSCAPES .

924 0 . .

227 0 . The table 8 describes the performance of the hierarchical object detector on multiple datasets. It can be observed thatthe hierarchical object detector boosts the performance of annotation both in terms of mean average precision score as wellas in recall score. The detection results for different types of scenarios are listed below where the detections from standarddetector is compared to that of our approach.

This is the case where the standard object detector is able to perform optimally by detecting all the object of interest in theimage. The human annotated object center comes in handy here, the human annotated data containing the precise centerco-ordinates are used instead of the network predicted centers to re-localize the detected objects. The changes in the looks8f the bounding boxes can be viewed in Fig 7 and Fig 8. In Fig 8 we can observe that the objects are much more centered ascompared to that of the Fig 7 the reason behind such loss turns out due to the loss of information occurring as a consequenceof squeezing the image into a smaller grid losing some amount of information regarding the exact location of the object inorder to obtain more dense feature for accurate classiﬁcation of the object.

Figure 7. Object detection results from the standard detectorFigure 8. Improving correctly labelled and localized data using one - click annotation

There are certain cases where the network will wrongly label the objects in the image which can be seen in Fig 9. Thedetections of such spurious nature are found by comparing the predicted object labels and human annotated object labels.The spurious detection are corrected using the label information and object center. The rest of the predicted informationremains the same. The result of object detection after adding this detection can be seen in the Fig 10.

Figure 9. Object detection results from the standard detector

These set of detection are popularly termed as false positives. The false positives play an important role in determiningthe mean average precision of the object. The object detector possess a low mean average precision due to the incorrect9 igure 10. Improving incorrectly labelled but correctly localized data using one-click annotation classiﬁcation and localization of the object in the image. Thus, for any annotated data it is desired to have very high meanaverage precision. The Fig 11 and 12 show that the results after removing such spurious detections.

Figure 11. Object detection results from a standard detectorFigure 12. Improving incorrectly labelled and localized data using one-click annotation

In this subsection the case of missing out an object is explained. The hierarchical object detection comes into action in thisregion. The results for detection from standard object detection can be seen in Fig 13 and the results from hierarchical objectdetection can be seen in Fig 14.

5. Conclusion and Future work

The proposed framework provides a novel solution to problem of annotating the data with least amount of human effortboth mentally and physically by harnessing maximum amount of information with minimum amount of interaction with thecomputer. The current framework acts as the current state of the art object detector in case of guided object detection task byattaining a mean average precision score of 99.95 ,99.7,99.23 on PASCAL VOC, KITTI, CITYSCAPES and a recall score of90.38,80.1,45.02 on PASCAL VOC, KITTI, CITYSCAPES respectively. The framework reduces the annotation time, cost10 igure 13. Object detection results from a standard object detectorFigure 14. Hierarchical object detector on improving not labelled and localized data and mental stress. The framework also removes the spurious human object center annotations reducing the time required forreﬁning the annotation. The framework has proven to be 3 - 4 times faster than that of the standard annotation procedure.There lies a lot unexploited potential in the framework which can be further taken up for future research. One of them is thatthe network ﬁnds it difﬁcult to annotate the similar objects which are placed together. In the ﬁrst image in Fig 15 we canobserve that although both the parrot were clicked by the annotator but one of the parrot masks its presence as it is prominentin comparison to the neighboring the object which was clicked, so a bounding box is created around the prominent oneleaving out the clicked object center out of the box. This results only in a single detection. The same trend can be observedin the other set of Figures 15 for the dogs in the second image and cows in third image.

Figure 15. Performance of hierarchical object detector when two similar objects are close

Another future work lies in improving the object detection by allowing user to click any-where in the object as there aremany situation where unwanted object might occlude the visual and spatial characteristics of the object of interest i.e anunwanted object might occlude the center point of the object of interest thus resulting in poor quality of anchor boxes tobe used as an object proposal. An example for the same can be seen in Fig 16 where an object interest is occluded by anuninteresting object. 11 igure 16. Performance of hierarchical object detector when object centers are occluded

References [1] S. Bianco, G. Ciocca, P. Napoletano, and R. Schettini. An interactive tool for manual, semi-automatic and automatic video annotation.

Computer Vision and Image Understanding , 131:88–99, 2015.[2] Y. Y. Boykov and M.-P. Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In

ComputerVision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on , volume 1, pages 105–112. IEEE, 2001.[3] L. Castrej´on, K. Kundu, R. Urtasun, and S. Fidler. Annotating object instances with a polygon-rnn. In

CVPR , volume 1, page 2,2017.[4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes datasetfor semantic urban scene understanding. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages3213–3223, 2016.[5] S. Dutt Jain and K. Grauman. Predicting sufﬁcient annotation strength for interactive foreground segmentation. In

Proceedings ofthe IEEE International Conference on Computer Vision , pages 1313–1320, 2013.[6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge.

Interna-tional journal of computer vision , 88(2):303–338, 2010.[7] L. Fagot-Bouquet, J. Rabarisoa, and Q. C. Pham. Fast and accurate video annotation using dense motion hypotheses. In

ImageProcessing (ICIP), 2014 IEEE International Conference on , pages 3122–3126. IEEE, 2014.[8] V. Garcia and J. Bruna. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043 , 2017.[9] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset.

The International Journal of RoboticsResearch , 32(11):1231–1237, 2013.[10] N. Hilliard, L. Phillips, S. Howland, A. Yankov, C. D. Corley, and N. O. Hodas. Few-shot learning with metric-agnostic conditionalembeddings. arXiv preprint arXiv:1802.04376 , 2018.[11] S. D. Jain and K. Grauman. Click carving: Segmenting objects in video with point clicks. arXiv preprint arXiv:1607.01115 , 2016.[12] K. Konyushkova, J. Uijlings, C. Lampert, and V. Ferrari. Learning intelligent dialogs for bounding box annotation. arXiv preprintarXiv:1712.08087 , 2017.[13] D. Li, J.-B. Huang, Y. Li, S. Wang, and M.-H. Yang. Weakly supervised object localization with progressive domain adaptation. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3512–3520, 2016.[14] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002 , 2017.[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In

Europeanconference on computer vision , pages 21–37. Springer, 2016.[16] D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari. Extreme clicking for efﬁcient object annotation. In , pages 4940–4949. IEEE, 2017.[17] D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari. Training object class detectors with click supervision. arXiv preprintarXiv:1704.06189 , 2017.[18] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. arXiv preprint , 2017.[19] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In

Advancesin neural information processing systems , pages 91–99, 2015.[20] P. K. Rhee, E. Erdenee, S. D. Kyun, M. U. Ahmed, and S. Jin. Active and semi-supervised learning for object detection with imperfectdata.

Cognitive Systems Research , 45:109–123, 2017.[21] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In

ACM transactionson graphics (TOG) , volume 23, pages 309–314. ACM, 2004.[22] N. Shankar Nagaraja, F. R. Schmidt, and T. Brox. Video segmentation with just a few strokes. In

Proceedings of the IEEE Interna-tional Conference on Computer Vision , pages 3235–3243, 2015.

23] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In

Advances in Neural Information ProcessingSystems , pages 4080–4090, 2017.[24] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. arXiv preprint arXiv:1711.06025 , 2017.[25] Y. Tang, J. Wang, X. Wang, B. Gao, E. Dellandrea, R. Gaizauskas, and L. Chen. Visual and semantic knowledge transfer for largescale semi-supervised object detection.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017.[26] Z. Yan, J. Liang, W. Pan, J. Li, and C. Zhang. Weakly-and semi-supervised object detection with expectation-maximization algorithm. arXiv preprint arXiv:1702.08740 , 2017.[27] D. Zhang, D. Meng, L. Zhao, and J. Han. Bridging saliency detection to weakly supervised object detection based on self-pacedcurriculum learning. arXiv preprint arXiv:1703.01290 , 2017., 2017.