Counting and Locating High-Density Objects Using Convolutional Neural Network
Mauro dos Santos de Arruda, Lucas Prado Osco, Plabiany Rodrigo Acosta, Diogo Nunes Gonçalves, José Marcato Junior, Ana Paula Marques Ramos, Edson Takashi Matsubara, Zhipeng Luo, Jonathan Li, Jonathan de Andrade Silva, Wesley Nunes Gonçalves
CC OUNTING AND L OCATING H IGH -D ENSITY O BJECTS U SING C ONVOLUTIONAL N EURAL N ETWORK
A P
REPRINT
Mauro dos Santos de Arruda
Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]
Lucas Prado Osco
University of Western São PauloPresidente Prudente, SP, Brazil [email protected]
Plabiany Rodrigo Acosta
Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]
Diogo Nunes Gonçalves
Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]
José Marcato Junior
Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]
Ana Paula Marques Ramos
University of Western São PauloPresidente Prudente, SP, Brazil [email protected]
Edson Takashi Matsubara
Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]
Zhipeng Luo
Xiamen UniversityXiamen, FJ, China [email protected]
Jonathan Li
University of WaterlooWaterloo, ON, Canada [email protected]
Jonathan de Andrade Silva
Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]
Wesley Nunes Gonçalves ∗ Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]
February, 2021 A BSTRACT
This paper presents a Convolutional Neural Network (CNN) approach for counting and locatingobjects in high-density imagery. To the best of our knowledge, this is the first object counting andlocating method based on a feature map enhancement and a Multi-Stage Refinement of the confidencemap. The proposed method was evaluated in two counting datasets: tree and car. For the tree dataset,our method returned a mean absolute error (MAE) of 2.05, a root-mean-squared error (RMSE) of2.87 and a coefficient of determination (R ) of 0.986. For the car dataset (CARPK and PUCPR+),our method was superior to state-of-the-art methods. In the these datasets, our approach achievedan MAE of 4.45 and 3.16, an RMSE of 6.18 and 4.39, and an R of 0.975 and 0.999, respectively.The proposed method is suitable for dealing with high object-density, returning a state-of-the-artperformance for counting and locating objects. Keywords
Deep learning · Object counting · Tree counting · Car counting ∗ corresponding author: [email protected] a r X i v : . [ c s . C V ] F e b rXiv - Arruda et al. (2021) A P
REPRINT
In computer vision, counting and locating objects in images have been the attention of several approaches [Sindagi& Patel, 2018]. These methods help to control and count people [Idrees et al., 2018], support car detection [Hsiehet al., 2017], monitor wildlife [d. S. de Arruda et al., 2018], and even count bacterial colonies [Ferrari et al., 2017]. Asexpected, the majority of these methods are based on the well-known object detection task, including the recent methodsbased on convolutional neural networks (Faster R-CNN [Ren et al., 2017], Mask-RCNN [He et al., 2020], RetinaNet[Lin et al., 2020]), multi-scale variants (Multi-Scale Structures [Ohn-Bar & Trivedi, 2017], Multi-scale deep featurelearning network [Ma et al., 2020], Gated CNN [Yuan et al., 2019]) and ensembles of models [Xu et al., 2020]. Manyobject detection methods consider a bounding box (bbox) around the targeted objects and can provide both location(center of the bbox) and counting (number of bboxs). Recent contribution to this matter is from Hsieh et al. [2017],where the authors proposed, simultaneously, Layout Proposal Networks (LPNs) and spatial kernels to detect objects invideo. These additions helped to improve the object counting and location using an object detection framework. Still,even state-of-the-art methods return bounding boxes that partially overlap multiple objects, which is still a problemsince the adjacent object region is detected as a separate object [Goldman et al., 2019].One of the biggest challenges regarding the counting and location of objects in images is the high-object density.Object detection methods are, in general, not adequate for high-density scenes [Goldman et al., 2019]. In this scenario,overlapping objects are difficult to analyze due to the size of the instances and the standpoint of the scene. Thus,approaches that model the problem of counting objects with a density estimation has been defined as state-of-the-artsolutions, and are providing interesting solutions for dense scenes [Goldman et al., 2019, Aich & Stavness, 2018]. InGoldman et al. [2019], the authors proposed a CNN-based detection method, using the bounding box, to cope withdensely packed scenes. They considered a layer to estimate a quality score index and used a novel EM (Expectation-Maximization) merging unit to solve the overlap ambiguities with this score. However, handling high-density objects inimages is still a concerning issue, both in counting and locating objects.Another problem regarding object count from detection frameworks is the need of detailed ground-truth labeleddata, which is hard to obtain at large-scales [Russakovsky et al., 2015]. Acquiring a large-scale annotated data isa time-consuming process. Because of that, the approach based on a lighter weight image label is something thatresearchers have previously proposed [Zhang et al., 2018, Fiaschi et al., 2012]. Still, recent studies are implementingpoint annotations to reduce the supervision task [Aich & Stavness, 2018, Liu et al., 2019]. Point annotations are easierto obtain than bounding-boxes, and many counting and locating approaches do not need to rely on them to identify anobject [Liu et al., 2019]. These types of approaches can rely on context information, and, for most problems, objectinstances will share a similar color, texture, and shape; meaning that the method will learn how to recognize them evenif only using point features [Aich & Stavness, 2018].Recently, state-of-the-art methods to count objects include the VGG-GAP and VGG-GAP-HR [Aich & Stavness, 2018]approaches, Layout Proposal Networks (LPN) [Hsieh et al., 2017] and Deep IoU CNN [Goldman et al., 2019]. Thesemethods were applied in counting and locating cars, crowds, biological cells and products from supermarket shelves,returning impressive performances in high-density scenes. Despite the promising results, scale variations, clutterbackground, occlusions, and especially high-density of objects are still challenges that hinder methods of providinghigh-quality predictions. That way, in previous work, we developed an initial model for the location and counting ofCitrus-trees in UAV multispectral images [Osco et al., 2020]. This initial model significantly surpassed methods fordetecting objects such as RetinaNet and Faster-RCNN.In this paper, we propose a method for counting and locating objects based on convolutional neural network [Simonyan& Zisserman, 2015]. Unlike previous research that has estimated a rectangle for each object, the present method isbased on a density estimation map with the confidence that an object occurs in each pixel, following [Aich & Stavness,2018]. Different from previous work [Osco et al., 2020], the proposed method uses a feature map enhancement witha Pyramid Pooling Module (PPM) [Zhao et al., 2017] that allows our method to incorporate global information atdifferent scales. Consequently, the proposed method incorporates sufficient global context information for a goodcharacterization of objects similarly to Zhang et al. [2019] with its hierarchical context module. Thus, we hypothesizethat this approach is best suitable for situations of high object density.Aside from that, another potential pitfall of previous methods is the missed detections due to object occlusion andhigh-density. To compensate for these detections, and produce the correct count label, a Multi-Stage Refinement overthe ground-truth map is proposed. By implementing multiple stages, our method provides hierarchical learning of theobject position, starting from a rough to a more refined prediction of the center of the object. Therefore, our method isdivided into four main phases: 1) a feature map generation using a CNN; 2) feature map enhancement with a PyramidPooling Module (PPM); 3) a Multi-Stage Refinement of the confidence maps; and 4) the obtention of the object positionthrough peaks in the confidence map. 2 rXiv - Arruda et al. (2021)
A P
REPRINT
Experiments were performed in three datasets. First, we evaluated the method parameters in a tree counting datasetcontaining 3,370 images and aproximately 232,000 objects. Second, we verify its generalization by comparing it withstate-of-the-art methods in two car-counting benchmarks: CARPK and PUCPR+. The proposed method outperformsstate-of-the-art methods for counting purposes in these three benchmarks. To the best of our knowledge, this is thefirst object counting and locating method based on a feature map enhancement and a multi-stage refinement of theconfidence map.
This section describes our method to count and locate objects. This method uses a three-channel image, with w × h pixels as input and processes it with a CNN. The object counting and location were modeled after a 2D confidence mapestimation, following the procedures presented in [Aich & Stavness, 2018].The confidence map is a 2D representation of the likelihood of an object occurring in each pixel. We improved theconfidence map estimation by including global and local information through a Pyramid Pooling Module (PPM) [Zhaoet al., 2017]. We also proposed a multi-stage prediction phase to refine the confidence map to a more accurate predictionof the center of the objects.Fig. 1 illustrates the phases of the proposed method, which are detailed in the following section. Our approach isdivided into four main phases: 1) feature map generation with a CNN (Section 2.1); 2) feature map enhancement withthe PPM (Section 2.2); 3) multi-stage refinement of the confidence map (Sections 2.3 and 2.4); and, 4) object positionobtention by peaks in the confidence map (Section 2.5).
64 128 256
Stage 1
Stage t, (1 < t < T)(b) Feature map extraction
128 128
POOL CONVCONVCONVCONV U P S A M P L E CONCAT (c) Pyramid Pooling Module (PPM)Stage T min σ σ max
128 128 (e) Object detection (d) Multi-Sigma Stages (MSS)(a) Input image
Figure 1: Our method for the confidence map prediction using the Pyramid Pooling Module (PPM) and the multi-stagerefinement approach. The initial part (b), based on VGG19 [Simonyan & Zisserman, 2015], extracts a feature map fromthe input image (a). This feature map is used as input for the PPM (c) [Zhao et al., 2017]. The resulting volume is thenused as input to the first stage of a Multi-Sigma Stages (MSS) phase (d) [Aich & Stavness, 2018]. The concatenation ofthe PPM and the prediction map of the previous stage is used as input for the remaining stages. The T stages apply astandard deviation ( σ ) for the confidence map peak, starting at maximum-to-minimum so that values are spaced equally.3 rXiv - Arruda et al. (2021) A P
REPRINT
We started by extracting a feature map with a CNN from a given input image (Fig. 1(a)). This feature map extractionmodule is based on the VGG19 [Simonyan & Zisserman, 2015], where the first two convolutional layers have 64 filtersof a × size and are followed by a maximum pooling layer with a × window. The last two convolutional layershave 256 filters with a × size. All convolutional layers use the rectified linear unit (ReLU) function, with a stride of1 and zero-padding, returning an output with the same resolution as the input.We evaluated two variations of our method for different input images dimensions. The first variation receives an inputimage with × resolution and produces a feature map in the final layer with × resolution. Proportionally,the second variation receives an input image with × pixels, and the output feature map has a resolution of × . Despite the low resolution, this map can describe the relevant features extracted from the image. Many CNN cannot incorporate sufficient global context information to ensure a good performance in characterizinghigh-density objects. To solve this issue, our method adopts a global and subregional context module called PPM[Zhao et al., 2017]. This module allows CNN to be invariant to scale. Fig. 1 (c) illustrates the PPM that combines thefeatures of four pyramid scales, with resolutions of × , × , × and × , respectively. The highest generallevel, displayed in the color orange, applies a global max pooling which creates a × feature map to describe theglobal image context, such as the number of detected objects in the image. The other levels divide the input map intosubregions, forming a grouped representation of the image with their subcontext informations.The levels of the PPM contain feature maps with various sizes. Because of this, we used a × convolution layer with512 filters after each level. We upsampled the feature maps to the same size as the input map with bilinear interpolation.Lastly, these feature maps are concatenated with the input map to form an improved description of the image. Thisstep ensures that small object information is not lost in the PPM phase. Although this module is proposed for semanticsegmentation, it has proven to be a robust method for counting objects according to our experiments, as it includesinformation at different scales and global context. In the multi-stage refinement phase, the improved feature map obtained by PPM is used as input for the T stages thatestimates the confidence map. The first stage (Fig. 1 (d)) receives the feature map and generates the confidence map C by using five convolutional layers: three layers with 128 filters with a × size; one layer with 512 filters with a × size; and one layer with a single filter, corresponding to the confidence map.At a subsequent stage t (Fig. 1 (d)), the prediction returned by the previous stage C t − and the feature map from thePPM process are concatenated. They are used to produce a refined confidence map C t . The T − final stages consistof seven convolutional layers: five layers with 128 filters with a × size; and one layers with 128 filters with a × size. The last layers have a sigmoid activation function so that each pixel represents the probability of the occurrence ofan object (values between [0 , ). The remaining layers have a ReLU activation function. Through multiple stages, weproposed hierarchical learning of the center of the object. The first stage roughly predicts the position, while the otherstages refine this prediction (Fig. 5).To avoid the vanishing gradient problem during the training phase, we adopted a loss function (Eq. 1) to be applied atthe end of each stage. f t = (cid:88) p (cid:107) ˆ C t ( p ) − C t ( p ) (cid:107) , (1)where ˆ C t is the ground truth confidence map of the stage t (Section 2.4). The overall loss function is given by: f = T (cid:88) t =1 f t (2) As mentioned in the previous section, to train our method, a confidence map ˆ C t is generated as a ground truth for eachstage t by using the center of the objects as annotations in the image. The ˆ C t is generated by placing a 2D Gaussian4 rXiv - Arruda et al. (2021) A P
REPRINT kernel at each center of the labeled objects [Aich & Stavness, 2018]. The Gaussian kernel has a standard deviation ( σ t )that controls the spread of the confidence map peak, as shown in Fig. 2..Our approach uses different values of σ t for each stage t to refine the object center prediction during each stage. The σ of the first stage is set to a maximum value ( σ max ) while the σ T of the last stage is set to a minimum value ( σ min ). Theappropriate values of σ max and σ min are evaluated in the experiments. The σ t for each intermediate stage is equallyspaced between [ σ max , σ min ] . The early stage should return a rough prediction of the center of the object, and thisprediction is refined in the subsequent stages.Fig. 2 illustrates an example of a ground truth confidence map with three values of σ t . Fig. 2 (a) shows the RGB imageand the locations of each objected marked by a red dot. Fig.2 (b, c, and d) present the ground truth confidence mapsfor σ t = 0 . , . and . , respectively. In our experiment, the usage of different σ helped refine the confidence map,improving its robustness. (a) RGB images (b) σ t = 1 . (c) σ t = 1 . (d) σ t = 0 . Figure 2: Example of an RGB image and its corresponding ground-truth confidence maps with different σ t values. Object locations are obtained from the confidence map of the last stage ( C T ). We estimate the peaks (local maximum)of the confidence map by analyzing the 4-pixel neighborhood of each given location of p . Thus, p = ( x p , y p ) is a localmaximum if C T ( p ) > C T ( v ) for all the neighbors v , where v is given by ( x p ± , y p ) or ( x p , y p ± . An example ofthe object location from the confidence map peaks is shown in Fig. 3 .To avoid noise or low probability of occurrence of the position p , a peak in the confidence map is considered as anobject only if C T ( p ) > τ . Besides that, we set a minimum distance δ to allow the method to detect very close objects.After a preliminary experiment, we used τ = 0 . and δ = 1 pixel that allows the detection of objects from two pixelsof distances. To test the robustness of our method, we evaluated it in a new and challenging dataset of eucalyptus tree images. Weused this image dataset because there are different tree plantation densities, ranging from extreme cases to more sparsetrees (Fig. 4). This variation in density is a challenge for counting and locating objects. The trees were also at differentgrowth stages. This permitted to evaluate the proposed method in different scales (tree size) and changes in appearance.The images were captured by an Unmanned Aerial Vehicle (UAV) in a rural property in Mato Grosso do Sul, Brazil,over four different areas of approximately 40 ha each. The eucalyptus trees were planted at different spacing, the densestbeing at 1.25 meters from each other, with an average of 1,750 trees per hectare. These trees were at different growthstages, variating between high and canopy areas. The images were acquired with an RGB sensor, which produced apixel size of 4.15 cm. A total of four orthomosaic were generated from the area of interest. Approximately 232,000eucalyptus trees were labeled as a point feature by a specialist.To evaluate the robustness and generability of the proposed approach, we also compared the performance of our methodin two well-known image datasets for counting cars: CARPK and PUCPR+ benchmarks [Hsieh et al., 2017]. Wecompare the prediction metrics with state-of-the-art methods such One-Look Regression [Mundhenk et al., 2016],IEP Counting [Stahl et al., 2019], YOLO [Redmon & Farhadi, 2017], YOLO9000 [Redmon & Farhadi, 2017], Faster5 rXiv - Arruda et al. (2021)
A P
REPRINT
Peaks
Figure 3: Example of the localization of eucalyptus trees from a refined confidence map.Figure 4: Examples of the tree dataset. The eucalyptus trees are at different growth stages and plantation densities.R-CNN [Ren et al., 2017], RetinaNet [Lin et al., 2020, Hsieh et al., 2017] , LPN [Hsieh et al., 2017], VGG-GAP [Aich& Stavness, 2018], VGG-GAP-HR [Aich & Stavness, 2018] and Deep IoU CNN [Goldman et al., 2019].
The four orthomosaics were split into 3,370 patches with × pixels without overlapping. These patches wererandomly divided into training ( n = 2 , ), validation ( n = 250 ) and testing ( n = 250 ) sets. For training the CNN,we applied a Stochastic Gradient Descent optimizer with a momentum of 0.9. To reduce the risk of overfitting, weused the validation set for the hyperparameter tuning on the learning rate and the number of epochs. After minimalhyperparameter tuning, the learning rate was 0.01 and the number of epochs was equal to 100. Instead of training theproposed approach from scratch, we initialized the weights of the first part with pre-trained weights in ImageNet. Sixregression metrics, the mean absolute error (MAE) [Wackerly et al., 2014, Chai & Draxler, 2014], root mean squared6 rXiv - Arruda et al. (2021) A P
REPRINT error (RMSE) [Wackerly et al., 2014, Chai & Draxler, 2014], the coefficient of determination (R ) [Draper & Smith,1998], the Precision, Recall, and the F-Measure, were used to measure the performance. Training and testing wereperfomed in a deskop computer with Intel(R) Xeon(R) CPU [email protected], 64 GB memory, and NVIDIA Titan VGraphics Card (5120 Compute Unied Device Architecture - CUDA cores and 12 GB graphics memory). The methodswere implemented using Keras-Tensorflow on the Ubuntu 18.04 operating system. This section presents and discusses the results obtained by the proposed method while comparing it with state-of-the-artmethods. First, we demonstrate the influence of different parameters, which includes the σ to generate the ground truthconfidence maps, the number of stages necessary to refine the prediction, and the usage of PPM [Zhao et al., 2017] toinclude context information based on multiple scales. Second, we compare the results with the baseline of the proposedmethod. For this, we used the tree counting dataset and the car counting datasets (CARPK and PUCPR+). We present the results of the proposed method in the validation set for a different number of stages on the tree countingdataset. These stages are responsible for refining the confidence map. We observed that by using two stages ( T = 2 ),the proposed method already returned satisfactory results (Table 1). When increasing to T = 4 stages, we obtainedthe best result, with MAE, RMSE, R , Precision, Recall, and F-Measure of 2.69, 3.57, 0.977, 0.817, 0.831, and 0.823,respectively. These results indicate the multi-stage refinement can affect the object counting tasks significantly. Thisis because the confidence map is refined in later stages, increasing the chance of objects be detected in high-densityregions.Table 1: Evaluation of the number of stages ( T ) on the validation set of the tree counting dataset using σ min = 1 and σ max = 3 . Stages ( T ) MAE RMSE R Precision Recall F-Measure2 2.86 3.82 0.974 0.809 0.825 0.816
We evaluated the σ min and σ max responsible for generating the ground truth confidence maps implemented in the T stages. In this experiment, we adopt T = 4 stages that achieved the best results from the previous experiment. Theconfidence map of the first stage is generated using σ max , while the last stage uses σ min , and the intermediate stages areconstructed from values equally spaced between [ σ max , σ min ] . A low σ , relative to the object area (e.g., tree canopy)provides a confidence map without correctly covering the object’s area. However, a high σ generates a confidence mapthat, while fully covers the object, may include nearby objects in high-density conditions. These conditions make itdifficult to spatially locate objects in the image.The evaluation for σ max is presented in Table 2. The highest result was obtained with σ max = 3 , which best covers thetree-canopies without overlapping them. Still, we observed that other values for σ max also returned good results. Since σ max is used in the first stage, it does a small influence over the final result, since the confidence map is refined in thesubsequent stages.Table 2: Evaluation of the σ max in the validation set of the tree counting dataset. We adopted the σ min = 1 and stages T = 4 . σ max MAE RMSE R Precision Recall F-Measure2 3.31 4.31 0.966 0.811 0.837 0.822
The results for the σ min are summarized in Table 3. The σ min has great influence over the final result since it isresponsible for the last confidence map. The overall best result was obtained with a σ min = 1 . , which achieved aMAE, RMSE, R , Precision, Recall and F-Measure of . , . , . , . , . and . , respectively. This7 rXiv - Arruda et al. (2021) A P
REPRINT shows that the σ min = 1 . is the best fit for the size of the tree canopy. The conducted experiments showed that, withappropriate values of σ max = 3 and σ min = 1 , high performance for counting trees can be obtained (Table 3).Table 3: Evaluation of the σ min in the validation set of the tree counting dataset. We used σ max = 3 and stages T = 4 . σ min MAE RMSE R Precision Recall F-Measure0.5 11.01 13.77 0.658 0.868 0.721 0.7830.75 2.93 3.89 0.972 0.820 0.831 0.824
To verify the potential of our method in real-time processing, we perform a comparison of the processing timeperformance for different amounts of stages ( T ). Table 4 shows the processing time of the proposed method for valuesof T = , , , and . For this, we used 100 images from the tree test set and extracted the average processingtime and standard deviation. We used the values of σ min = 1 and σ max = 3 that obtained the best performancein the previous tests. The results showed that the proposed approach can achieve real-time processing. For the bestconfiguration with stages T = 4 the approach can deliver an image detection in . seconds with a standard deviationof . . Table 4: Processing time evaluation of the proposed approach for different amounts of T . Stages ( T ) Average Time (s) Standard deviation2 0.802 0.0224 1.426 0.0286 2.063 0.0588 2.675 0.05910 3.373 0.100 To analyse the design of the proposed architecture, we compared it with a baseline model that does not include the PPMand the multi-stage refinement on tree-counting dataset. The overall best result with just the baseline of the CNN wasobtained with a σ = 1 , returning an MAE, RMSE, R , Precision, Recall, and F-Measure equal to 2.85, 3.72, 0.977,0.814, 0.833 and 0.822, respectively.A gain in performance is observable when analyzing the results from the inclusion of the PPM and multi-stage refinementin the baseline (Table 5). The inclusion of the PPM has no significant improvement for the results, while the baselinewith multi-stage refinement achieves better results. One explanation for this is that multiple stages provide hierarchicallearning of the object position, starting from a rough to a more refined prediction of the center of the object. Examplesof the confidence map refinement across the stages are shown in Fig. 5. Besides, when we implemented both these twomodules, it outperformed all the baseline results. This shows that the combination of these two modules is essential toobject counting. Table 5: Results of the proposed method and its baseline for the tree couting dataset. Method MAE RMSE R Precision Recall F-MeasureBaseline ( σ = 0.5) 11.97 15.10 0.62 0.861 0.709 0.772Baseline ( σ = 1.0) 2.85 3.72 0.977 0.814 0.833 0.822Baseline ( σ = 2.0) 3.07 4.37 0.968 0.822 0.805 0.812Baseline + PPM 2.44 3.38 0.981 0.825 0.836 0.829Baseline + multi-stage 2.78 3.64 0.978 0.808 0.833 0.819 Proposed Method 2.05 2.87 0.986 0.822 0.834 0.827
We considered a region around the labeled object position to analyze qualitatively the proximity of the prediction tothe center of the object. The results using the best configuration ( σ min = 1.0, σ max = 3.0, and T = 4) are displayedin Fig. 6. The predicted positions are represented by red dots, and the tree-canopy regions are represented by yellowcircles whose center is the labeled position. The proposed method can correctly predict most of the tree positions.8 rXiv - Arruda et al. (2021) A P
REPRINT (a) RGB Image (b) Stage 1 (c) Stage 3 (d) Stage 4
Figure 5: Example of two images showing the confidence map refinement by our method.Another important contribution is that planting-lines are also identified without the need for annotation or additionalprocedures (Fig. 6 (a)). Furthermore, the proposed method can correctly identify trees even outside the planting lines,in a non-regular distribution (Fig. 6 (b)). (a) Planting Lines (b) Non-regular Planting
Figure 6: Comparison of predicted positions (red dots) in two images with different tree density.A comparison of the proposed method with both PPM and multi-stage refinement on the baseline is displayed in Fig.7. The baseline fails to detect some trees while returning some false-positives. The proposed method is capable ofdetecting more difficult true-positives, not detected by the baseline methods, with fewer false-negatives.9 rXiv - Arruda et al. (2021)
A P
REPRINT (a) Proposed Method(b) Baseline
Figure 7: Comparison of the predicted positions of (a) the proposed method and (b) the baseline. Predicted positionsare shown by red dots while tree-canopies are represented by yellow circles. Blue circles show the challenges faced bythe methods.Although the proposed method returned a good performance for the tree counting dataset, it also had some challenges(Fig. 8). The "far-from-center" predictions occurred in short planting-lines (Fig. 8 (a)) or in dispersed vegetation. Thisalso happened in highly dense areas (Fig. 8 (b)), although in fewer occurrences. Still, the proposed method was capableof predicting the correct position of the majority of trees. (a) Short Planting Lines (b) Canopy Occlusion
Figure 8: Examples of the challenges faced by the proposed method.10 rXiv - Arruda et al. (2021)
A P
REPRINT
To verify the performance of the proposed approach for object detection in different types of densities, we dividedthe tree dataset of images into three density groups: low, medium and high. For this, the images were orderedaccording to the number of trees annotated, then the three groups were defined based on the quantities of trees in abalanced way. The low corresponds to images that have up to plants, the medium between and plants, and thehigh above plants. Thus, the sets of low, medium and high test images were left with , and , respectively.Table 6 presents the results obtained by the proposed approach at the three density levels. We can see that the approachdoes equally well at each density level, obtaining better results at the low level achieving an MAE, RMSE, R , Precision,Recall, and F-Measure equal to 1.70, 2.34, 0.966, 0.818, 0.846 and 0.829, respectively.Table 6: Results of the proposed method for different object densities.Density Level MAE RMSE R Precision Recall F-MeasureLow 1.70 2.34 0.966 0.818 0.846 0.829Medium 2.10 2.85 0.865 0.824 0.829 0.826High 2.38 3.36 0.843 0.823 0.826 0.824Figure 9 shows the visual results for plant detection at the three density levels. We can see that the proposed approach isable to correctly detect the centers of the plants, even in irregular plantings (see Figure 9 (a) and (b)). In addition, asshown in Table 6 we can see that at the low level the approach detects the plants positions more easily, since there is notmuch overlap of the tree canopies. (a) Low (b) Medium (c) High
Figure 9: Examples of the performance of the proposed approach at different levels of object densities. Column (a)shows the results for low densities, (b) for medium densities and (c) for high densities.11 rXiv - Arruda et al. (2021)
A P
REPRINT
To generalize the proposed approach while comparing its robustness against other state-of-the-art methods, we evaluatedits performance on two well-known benchmarks: CARPK and PUCPR+ [Hsieh et al., 2017]. These benchmarks providea large-scale aerial dataset for counting cars in parking lots. The CARPK dataset [Hsieh et al., 2017] is composed of989 training images (42,274 cars) and 459 test images (47,500 cars). The number of cars per image ranges from 1 to 87in training images, and from 2 to 188 in test images. PUCPR+ [Hsieh et al., 2017] is a subset of the PUCPR dataset[de Almeida et al., 2015], and it is composed of 100 training images and 25 test images. The training and test imagescontain respectively 12,995 and 3,920 car instances.The results from the proposed method and the state-of-the-art methods in both benchmarks demonstrate how feasibleour approach is for counting objects (Tables 7 and 8). We adopted the same protocols for the training and testing sets.The images have been resized to × pixels since we obtained similar performance when using full-resolutionimages in our approach. The proposed method achieved state-of-the-art performance in counting objects. Additionally,our method also provides the object position; something not accomplished by most of the high-performance methods.The location of objects can be achieved with traditional object detection methods like Faster R-CNN, YOLO andRetinaNet, which returned inferior results, not being suitable for high object density.Table 7: CARPK Comparative Results. Method MAE RMSE R Precision Recall F-MeasureOne-Look Regression [Mundhenk et al., 2016] 59.46 66.84 - - - -IEP Counting [Stahl et al., 2019] 51.83 - - - - -YOLO [Redmon & Farhadi, 2017] 48.89 57.55 - - - -YOLO9000 [Redmon & Farhadi, 2017] 45.36 52.02 - - - -Faster R-CNN [Ren et al., 2017] 24.32 37.62 - - - -RetinaNet [Lin et al., 2020, Hsieh et al., 2017] 16.62 22.30 - - - -LPN [Hsieh et al., 2017] 13.72 21.77 - - - -VGG-GAP [Aich & Stavness, 2018] 10.33 12.89 - - - -VGG-GAP-HR [Aich & Stavness, 2018] 7.88 9.30 - - - -Deep IoU CNN [Goldman et al., 2019] 6.77 8.52 - - - -Proposed Method 4.45 6.18 0.975 0.767 0.765 0.763
Table 8: PUCPR+ comparative results.
Method MAE RMSE R Precision Recall F-MeasureYOLO [Redmon & Farhadi, 2017] 156.00 200.42 - - - -YOLO9000 [Redmon & Farhadi, 2017] 130.40 172.46 - - - -Faster R-CNN [Ren et al., 2017] 39.88 47.67 - - - -RetinaNet [Lin et al., 2020, Hsieh et al., 2017] 24.58 33.12 - - - -One-Look Regression [Mundhenk et al., 2016] 21.88 36.73 - - - -IEP Counting [Stahl et al., 2019] 15.17 - - - - -VGG-GAP [Aich & Stavness, 2018] 8.24 11.38 - - - -LPN [Hsieh et al., 2017] 8.04 12.06 - - - -Deep IoU CNN [Goldman et al., 2019] 7.16 12.00 - - - -VGG-GAP-HR [Aich & Stavness, 2018] 5.24 6.67 - - - -Proposed Method 3.16 4.39 0.999 0.832 0.829 0.830
As shown in Fig. 10, the proposed method improves the results by detecting more difficult true-positives. Some carsare partially covered by trees or shadows while being parked next to each other. Our method was able to detect suchcases. The PPM helped improve the object representation, while the multi-stage refinement provided a better position inthe center of the objects, especially in highly dense areas. These features, incorporated in our approach, proved to beimportant additions for both datasets evaluated. 12 rXiv - Arruda et al. (2021)
A P
REPRINT (a) Occlusion by trees and shadows (b) Multiple distances
Figure 10: Car detection by the proposed method under different conditions.
In this study, we proposed a new method based on a CNN which returned state-of-the-art performance for counting andlocating objects with high-density in images. The proposed approach is based on a density estimation map with theconfidence that an object occurs in each pixel. For this, our approach produces a feature map generated by a CNN, andthen applies an enhancement to the PPM. To improve the prediction of each object, it uses a multi-stage refinementprocess, and the object position is calculated from the peaks of the refined confidence maps.Experiments were performed in three datasets with images containing eucalyptus trees and cars. Despite the challenges,the proposed method obtained better results than previous methods. Experimental results on CARPK and PUCPR+indicate that the proposed method improves MAE, e.g., from 6.77 to 4.45 on CARPK and 5.24 to 3.16 on databasePUCPR+. The proposed method is suitable for dealing with high object-density in images, returning a state-of-the-artperformance for counting and locating objects. Since this is the first object counting and locating CNN method based onfeature map enhancement and a multi-stage refinement of a confidence map, other types of object detection approachesmay benefit from the findings presented here.Further research could be focused on investigating the impact on object counting for different choices of distribution(other than Gaussian) used to generate the groundtruth confidence map. Predictions other than the confidence map canalso help in separating objects of high density, such as predicting the boundaries obtained from the Voronoi diagram.
Acknowledgments
This study was supported by the FUNDECT - State of Mato Grosso do Sul Foundation to Support Education, Scienceand Technology, CAPES - Brazilian Federal Agency for Support and Evaluation of Graduate Education, and CNPq -National Council for Scientific and Technological Development. The Titan V and XP used for this research was donatedby the NVIDIA Corporation.
References
Aich, S., & Stavness, I. (2018). Improving object counting with heatmap regulation. arXiv:1803.05494 .de Almeida, P. R., Oliveira, L. S., Britto, A. S., Silva, E. J., & Koerich, A. L. (2015). Pklot –a robust dataset for parking lot classification.
Expert Systems with Applications , , 4937 – 4949.doi:doi:https://doi.org/10.1016/j.eswa.2015.02.009. 13 rXiv - Arruda et al. (2021) A P
REPRINT
Chai, T., & Draxler, R. R. (2014). Root mean square error (rmse) or mean absolute error (mae)?–arguments againstavoiding rmse in the literature.
Geoscientific model development , , 1247–1250.Draper, N. R., & Smith, H. (1998). Applied regression analysis volume 326. John Wiley & Sons.Ferrari, A., Lombardi, S., & Signoroni, A. (2017). Bacterial colony counting with convolutional neural networksin digital microbiology imaging.
Pattern Recognition , , 629 – 640. URL: . doi:doi:https://doi.org/10.1016/j.patcog.2016.07.016.Fiaschi, L., Koethe, U., Nair, R., & Hamprecht, F. A. (2012). Learning to count with regression forest and structuredlabels. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012) (pp. 2685–2688).Goldman, E., Herzig, R., Eisenschtat, A., Goldberger, J., & Hassner, T. (2019). Precise detection in densely packedscenes. In
IEEE Conf. on Computer Vision and Pattern Recognition (pp. 5227–5236). arXiv:1904.00853 .He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2020). Mask r-cnn.
IEEE Transactions on Pattern Analysis andMachine Intelligence , , 386–397.Hsieh, M., Lin, Y., & Hsu, W. H. (2017). Drone-based object counting by spatially regularized regionalproposal network. In (pp. 4165–4173).doi:doi:10.1109/ICCV.2017.446.Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Máadeed, S., Rajpoot, N. M., & Shah, M. (2018). Composition lossfor counting, density map estimation and localization in dense crowds. In European Conference on Computer Vision(ECCV (pp. 544–559). doi:doi:10.1007/978-3-030-01216-8_33.Lin, T., Goyal, P., Girshick, R., He, K., & Dollár, P. (2020). Focal loss for dense object detection.
IEEE Transactionson Pattern Analysis and Machine Intelligence , , 318–327. doi:doi:10.1109/TPAMI.2018.2858826.Liu, Y., Shi, M., Zhao, Q., & Wang, X. (2019). Point in, box out: Beyond counting persons in crowds. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) . arXiv:1904.01333 .Ma, W., Wu, Y., Cen, F., & Wang, G. (2020). Mdfn: Multi-scale deep feature learning network for object detection. Pattern Recognition , , 107149. doi:doi:https://doi.org/10.1016/j.patcog.2019.107149.Mundhenk, T. N., Konjevod, G., Sakla, W. A., & Boakye, K. (2016). A large contextual dataset for classification,detection and counting of cars with deep learning. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), ComputerVision – ECCV 2016 (pp. 785–800). Cham: Springer International Publishing.Ohn-Bar, E., & Trivedi, M. M. (2017). Multi-scale volumes for deep object detection and localization.
PatternRecognition , , 557 – 572. doi:doi:https://doi.org/10.1016/j.patcog.2016.06.002.Osco, L. P., dos Santos de Arruda, M., Junior, J. M., da Silva, N. B., Ramos, A. P. M., Moryia, E. A. S., Imai, N. N.,Pereira, D. R., Creste, J. E., Matsubara, E. T., Li, J., & Gonçalves, W. N. (2020). A convolutional neural networkapproach for counting and geolocating citrus-trees in uav multispectral imagery. ISPRS Journal of Photogrammetryand Remote Sensing , , 97 – 106. doi:doi:https://doi.org/10.1016/j.isprsjprs.2019.12.010.Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. In (pp. 6517–6525). doi:doi:10.1109/CVPR.2017.690.Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster r-cnn: Towards real-time object detection with region proposalnetworks. IEEE Trans. Pattern Anal. Mach. Intell. , , 1137–1149. doi:doi:10.1109/TPAMI.2016.2577031.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., Berg, A. C., & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International Journal ofComputer Vision (IJCV) , , 211–252. doi:doi:10.1007/s11263-015-0816-y.d. S. de Arruda, M., Spadon, G., Rodrigues, J. F., Gonçalves, W. N., & Machado, B. B. (2018). Recognition ofendangered pantanal animal species using deep learning methods. In International Joint Conference on NeuralNetworks (pp. 1–8). doi:doi:10.1109/IJCNN.2018.8489369.Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In
International Conference on Learning Representations .Sindagi, V. A., & Patel, V. M. (2018). A survey of recent advances in cnn-based single image crowd counting anddensity estimation.
Pattern Recognition Letters , , 3 – 16. doi:doi:https://doi.org/10.1016/j.patrec.2017.07.007.Stahl, T., Pintea, S. L., & van Gemert, J. C. (2019). Divide and count: Generic object counting by image divisions. IEEE Transactions on Image Processing , , 1035–1044. doi:doi:10.1109/TIP.2018.2875353.Wackerly, D., Mendenhall, W., & Scheaffer, R. L. (2014). Mathematical statistics with applications . Cengage Learning.Xu, J., Wang, W., Wang, H., & Guo, J. (2020). Multi-model ensemble with rich spatial information for object detection.
Pattern Recognition , , 107098. doi:doi:https://doi.org/10.1016/j.patcog.2019.107098.14 rXiv - Arruda et al. (2021) A P
REPRINT
Yuan, J., Xiong, H.-C., Xiao, Y., Guan, W., Wang, M., Hong, R., & Li, Z.-Y. (2019). Gatedcnn: Integrating multi-scale feature layers for object detection.
Pattern Recognition , (p. 107131).doi:doi:https://doi.org/10.1016/j.patcog.2019.107131.Zhang, S., Li, H., Kong, W., Wang, L., & Niu, X. (2019). An object counting network based on hierarchi-cal context and feature fusion.
Journal of Visual Communication and Image Representation , , 166 – 173.doi:doi:https://doi.org/10.1016/j.jvcir.2019.05.003.Zhang, Y., Bai, Y., Ding, M., Li, Y., & Ghanem, B. (2018). Weakly-supervised object detection via mining pseudoground truth bounding-boxes. Pattern Recognition , , 68 – 81. doi:doi:https://doi.org/10.1016/j.patcog.2018.07.005.Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) . arXiv:1612.01105arXiv:1612.01105