[PDF] Counting from Sky: A Large-scale Dataset for Remote Sensing Object Counting and A Benchmark Method

Abstract

Object counting, whose aim is to estimate the number of objects from a given image, is an important and challenging computation task. Significant efforts have been devoted to addressing this problem and achieved great progress, yet counting the number of ground objects from remote sensing images is barely studied. In this paper, we are interested in counting dense objects from remote sensing images. Compared with object counting in a natural scene, this task is challenging in the following factors: large scale variation, complex cluttered background, and orientation arbitrariness. More importantly, the scarcity of data severely limits the development of research in this field. To address these issues, we first construct a large-scale object counting dataset with remote sensing images, which contains four important geographic objects: buildings, crowded ships in harbors, large-vehicles and small-vehicles in parking lots. We then benchmark the dataset by designing a novel neural network that can generate a density map of an input image. The proposed network consists of three parts namely attention module, scale pyramid module and deformable convolution module to attack the aforementioned challenging factors. Extensive experiments are performed on the proposed dataset and one crowd counting datset, which demonstrate the challenges of the proposed dataset and the superiority and effectiveness of our method compared with state-of-the-art methods.

Full PDF

IIEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1

Counting from Sky: A Large-scale Dataset forRemote Sensing Object Counting and A BenchmarkMethod

Guangshuai Gao, Qingjie Liu ∗ , Member, IEEE and Yunhong Wang,

Fellow, IEEE

Abstract —Object counting, whose aim is to estimate thenumber of objects from a given image, is an important andchallenging computation task. Signiﬁcant efforts have been de-voted to addressing this problem and achieved great progress,yet counting the number of ground objects from remote sensingimages is barely studied. In this paper, we are interested incounting dense objects from remote sensing images. Comparedwith object counting in a natural scene, this task is challengingin the following factors: large scale variation, complex clutteredbackground, and orientation arbitrariness. More importantly, thescarcity of data severely limits the development of research inthis ﬁeld. To address these issues, we ﬁrst construct a large-scaleobject counting dataset with remote sensing images, which con-tains four important geographic objects: buildings, crowded shipsin harbors, large-vehicles and small-vehicles in parking lots. Wethen benchmark the dataset by designing a novel neural networkthat can generate a density map of an input image. The proposednetwork consists of three parts namely attention module, scalepyramid module and deformable convolution module to attackthe aforementioned challenging factors. Extensive experimentsare performed on the proposed dataset and one crowd countingdatset, which demonstrate the challenges of the proposed datasetand the superiority and effectiveness of our method comparedwith state-of-the-art methods.

Index Terms —Object counting, remote sensing, attentionmechanism, scale pyramid module, deformable convolution layer.

I. I

NTRODUCTION O BJECT counting, whose aim is to estimate the num-ber of objects in a still image or video frame, is animportant yet challenging computer vision task. It has beenattracting increasing research interest because of its practicalapplications in various ﬁelds, such as crowd counting [1]–[13], cell microscopy [14]–[16], counting animals for ecologicstudies [17], vehicle counting [18]–[20] and environmentsurvey [21], [22]. Albeit great progress has been made inmany domains of object counting, only a few works havebeen done to address the problem of ground object countingin remote sensing community over the past few years, forexample counting palms or olive trees from remotely sensedimages [23]–[25].

This work was supported by the National Natural Science Foundationof China (Grant Nos: 41871283, U1804157 and 61601011). (Correspondingauthor: Qingjie Liu)Guangshuai Gao, Qingjie Liu and Yunhong Wang are with the StateKey Laboratory of Virtual Reality Technology and Systems, BeihangUniversity, Xueyuan Road, Haidian District, Beijing, 100191, China, andHangzhou Innovation Institute, Beihang University, Hangzhou, 310051,China(email: [email protected]; [email protected];[email protected]).

In recent years, with the growing population and rapiddevelopment of urbanization, geographic objects such asbuildings, cars have become more and more gathering andeasily densely crowded. This impels an increasing numberof researchers to study the scene understanding from theperspective of object counting. There are emerging studiesthat analyze crowd scenes from an airborne platform on ahelicopter [26], and detecting and evaluating the number ofcars from drones [27] or even satellites [28]. However, thedominant ground objects in remote sensing images such asbuildings, ships are ignored by the community, estimating thenumber of which could be beneﬁcial for many practical appli-cations, to name a few urban planning [29], environment con-trol and mapping [30], digital urban model construction [31]and disaster response and assessment [32].One major reason for this is that there is a lack of large-scale remote sensing object counting dataset available for thecommunity. Although aerial image understanding has gainedmuch attention and there are many datasets built for tasks suchas object detection [33], [34] or scene classiﬁcation [35], [36],these datasets are not intended for counting and the numberof objects in which is too small to support the counting task.Another reason that limits the progress and applications ofcounting in remote sensing ﬁeld lies in that the well-developedcounting models may not work well in remote sensing images,since the objects captured overhead possess distinct featuresfrom that of captured by surveillance or consumer cameras.To be speciﬁc, compared with the counting task in otherﬁelds, e.g., crowd counting, object counting in remote sensingimages faces several challenges in the following aspects: • Scale variation.

Objects (e.g., buildings) in remote sens-ing images have large scale variations ranging from onlya few pixels to thousands of pixels. This extreme scalechange makes it difﬁcult to infer an accurate number ofobjects. • Complex clutter background.

The size of crowdedsmall objects in remote sensing images is so small that itis difﬁcult to detect and recognize them. The objects areoften submerged in complex clutter backgrounds, whichwill distract the models from the region of interest andmake them unable to predict the true count. • Orientation arbitrariness.

Objects in remote sensingimages are taken overhead, thus the orientation is notcertain, which is different from the objects in naturalimages such as crowds have an upright orientation. • Dataset scarcity.

As mentioned above, object counting a r X i v : . [ c s . C V ] A ug EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 2 datasets in remote sensing ﬁled are extremely lacking.Although some datasets for object detection in remotesensing images such as SpaceNet attempt to alleviatethis issue, they are not initially designed for the count-ing task, thus making them not a good choice for thecounting research. The recently released datasets suchas DLR-ACD [26] and CARPK [27] are acquired usingUAV platforms and constructed for crowd or car counting,which is lack of typical geographic objects such asbuildings and ships.Intuitively, object counting can be achieved in a straight-forward way with the aid of well-designed detectors, suchas Faster RCNN [37], YOLO [38] and SSD [39], when theobjects are identiﬁed and located, the counting itself is trivial.However, this solution could be successful in the condition thatobjects are with large sizes and sparsely located but may failin the dense cases, especially for adjoining dense buildings,densely crowded ships in the harbors and small vehicles in theparking lots.To overcome the aforementioned problems and facilitatefuture research, we prepare to begin with two routes, one ismethodology, and the other is data source. For the method-ology, we design a deep supervised network to estimate thecount of different geographic objects under complex scenes.Following Lempitsky et al.’s pioneer work [16] that caststhe object counting as density estimation, our method alsoaccomplishes the counting task by ﬁrst predicting the densitymaps of input images and then integrating them to obtain theﬁnal counts. The proposed model is comprised of three stages.The ﬁrst stage is a truncated VGG-16 [40] serving as a featureextractor that extracts features from the input image, followingwhich is an A ttention module used to highlight informativefeatures and suppress backgrounds. The subsequent stage isa S cale P yramid module with different dilation rates. Thismodule captures the multi-scale information of the objects.The ﬁnal part is a D eformable convolution layer which wehope the proposed method could be robust to orientationchanging. For convenience, we name the proposed method ASPD-Net .Regarding the data source, we have constructed a large-scale Remote Sensing Object Counting dataset (abbreviatedas RSOC). The dataset contains a total of 3057 images and286,539 instances covered by four widely studied and con-cerned object types involving buildings, ships, large vehicles,and small vehicles. All instances are accurately annotated withpixel-level center points. To our best knowledge, this is thelargest dataset designed for the counting task in remote sensingimages. We hope this dataset will motivate researchers of theremote sensing community to pay attention to the countingtopic and promote research on aerial scene understanding.In summary, we make the following contributions.1) We propose a specially designed deep neural networkASPD-Net to attack the challenges of object countingin remote sensing images. The proposed ASPD-Netaddresses the problems of scale and rotation variation byintroducing a feature pyramid module and deformable https://aws.amazon.com/public-datasets/spacenet/ convolution module. In addition, the attention mech-anism is incorporated to jointly aggregate multi-scalefeatures and suppress clutter backgrounds.2) A large-scale remote sensing object counting dataset,termed RSOC, is constructed and released to the com-munity to boost the development of object countingin the ﬁeld of remote sensing. The RSOC datasetconsists of 3,057 images with 286,539 annotations andcovers four categories, including buildings, ships, large-vehicles, and small-vehicles. To the best of our knowl-edge, this is the ﬁrst attempt to build and release adataset that facilitates research of object counting inremote sensing images.3) We set a benchmark for the RSOC dataset by conductingextensive experiments, which demonstrates the effec-tiveness of our proposed method. We also make com-parisons with several state-of-the-art counting methods,and the proposed method achieves superior performanceon the RSOC dataset. Additionally, experiments on onewidely used crowd counting dataset demonstrate therobustness and generalization ability of the proposedapproach.This paper extends and makes further improvement com-pared with its previous version [41] in the following aspects:1) More details and descriptions of the dataset, includingthe collection of the data, annotation manner and statisticalinformation are added into this paper; 2) We give a brief reviewof related work to enable readers to have a comprehensiveunderstanding of our work; 3) More experiments includingablation studies and comparisons with state-of-the-art countingmethods are included to demonstrate the effectiveness andsuperiority of our approach.The remainder of this paper is organized as follows. In Sec-tion II we brieﬂy review some works related to this paper. Thedataset is presented in Section IV. We give detailed structuresof the proposed method and experiments in Section III andSection V, respectively. Finally, this paper is concluded inSection VI. II. R ELATED WORK

A. Object counting

Contemporary works for object counting can be roughlyclassiﬁed into three categories: 1) detection-based; 2)regression-based and 3) density map estimation based meth-ods. Detection-based methods are inchoate counting tech-niques, which estimate the number of objects from the de-tection of object instances. The performance of these methodsrelies on sophisticated detectors and is not satisfactory owingto the fact that object detection is still in its primitive stagein the early years. In recent years, with the advent of deeplearning techniques, object detection has achieved signiﬁ-cant progress. Object detectors such as Faster RCNN [37],SSD [39], YOLO [38] have shown remarkable performance invarious object detections such as face [42], pedestrian [43] andso on. Counting is completed spontaneously after the detectionby simply summarizing the number of detection instances.However, these approaches can only work well in the easy

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 3 case, i.e., objects are sparsely located and in relatively largescales, they may show poor performance in densely congestedscenes, particularly in remote sensing images, since groundobjects such as dense buildings, small-vehicles, large-vehicles,and ships are far beyond current detection performance.Regression-based methods, also known as global-regressionbased methods, count objects via mapping the high-dimensionimages into the real number space through regression modelssuch as linear regression [44] or Gaussian mixture regres-sion [45]. Elaborately designed hand-crafted features such asSIFT [46], GLCM [47] are usually employed to representimages. These methods are successful when dealing withobjects with homogeneity scales and uniform distributions,however, may fail when objects are with varying scales andcluster distributions.Lempitsky et al. [16] set a milestone for subsequent count-ing researches by casting the visual object counting problemas image density estimation, in which the integral of thedensity map gives the count of objects within it. Following thiswork, Pham et al. [48] learn a non-linear mapping function byusing random forest regression. Recently, with the entering ofthe deep learning era, many CNN based counting methodshave been proposed and achieved great success, especiallyin crowd counting. Zhang et al. [1] propose a multi-columnneural network (MCNN) with three branches, each of whichconstructed with different kernel sizes for capturing varisizedobjects. Sindagi et al. [49] improve the MCNN by jointlylearning crowd count classiﬁcation and density map estima-tion, and integrate them into an end-to-end cascaded network.Sam et al. [5] propose a switching CNN to select the mostsuitable regressor by training a switch classiﬁer. Instead ofdesigning a wider multi-column network, Li et al. [12] proposeto devise a deeper single column network that utilizes dilatedconvolution to enlarge the receptive ﬁelds so as to boostcounting performance. In this paper, we also intend to designa CNN based model for estimating density maps of remotesensing images by taking advantage of recently developedtechniques such as attention mechanism.

B. Attention mechanism

Attention mechanism has been incorporated into deep neuralnetworks for improving performance in various computervision tasks, including image captioning [50], [51], imagequestion answering [52], video analysis [53], image classi-ﬁcation [54], object counting [7], [55] and countless others.Bahdanau et al. [56] are among the ﬁrst to introduce theattention mechanism and successfully apply it to machinetranslation. Chen et al. [57] propose to encode the spatial-and channel-wise attention sequentially for improving imagecaptioning performance. Wang et al. [58] put forward a non-local neural network to compute the response at a positionas a weighted sum of feature maps. Woo et al. [59] devise aconvolution block attention module (CBAM) to enrich featurerepresentations. It can be plugged into many feed-forwardconvolutional networks. The attention mechanism has alsobeen introduced into the counting ﬁeld. For instance, Liu etal. [7] incorporate an attention module to adaptively decide the appropriate counting mode for different locations on the imagebased on its real density conditions. Zhang et al. [55] proposean attentional neural ﬁeld network, in which they introduce theno-local attention module [58] to expand the receptive ﬁeld tothe entire image such that the method can deal with large scalevariations. Our work arranges the channel and spatial attentionmodules in different manners. The rational arrangement willbe demonstrated in Section V-C. With the attention moduleincorporated in our proposed remote sensing object countingframework, we can achieve the goal of highlighting the re-gion of interest and alleviating the interference of clutteredbackgrounds.

C. Dilated convolutions

Dilated convolution has been widely used in a bunch of vi-sion tasks, due to its prominent characteristics of enlarging thereceptive ﬁeld of networks without increasing the computationcomplexity. Yu et al. [60] design a dilated convolution-basednetwork to capture multi-scale contextual features for bettersegmenting objects. Song et al. [61] leverage pyramid dilatedconvolution for video salient object detection. Li et al. [62]utilize dilated convolution for image de-raining. In countingcommunity, CSRNet [12] also employs dilated convolution todesign a deep convolutional neural network for crowd countingin congested scenes.Different from previous approaches, our proposed workincorporates the attention mechanism to capture more con-textual features and then concatenates several parallel dilatedconvolutions with different dilation rates to extract multi-scale features. Furthermore, a set of deformable convolutionlayers is leveraged to generate high-quality density maps toaccurately locate the position of objects, to address the issuethat orientation arbitrariness due to overhead perspective in theremote sensing images.III. M

ETHODOLOGY

The architecture of the proposed ASPD-Net is illustrated inFig. 1. It comprises of three stages. The front-end is a truncatedVGG-16 [40] incorporated with an attention module to extractfeatures of an input image. The mid-end is a scale pyramidmodule (SPM) followed by four dilated convolution layers todeal with scale variation. Finally, we equip the back-end witha set of deformable convolution layers for better capturingorientation information and add a × convolution layer togenerate density map. A. Feature extraction with attention (front-end)

For a given remote sensing image with an arbitrary size, weﬁrst feed it into a feature extractor, which is composed of theﬁrst 10 convolution layers of the pre-trained VGG-16 [40]. TheVGG-16 is widely used in counting tasks as backbone to buildmodels for its strong generalization ability [5], [6], [12], thusin this work we also employ VGG-16 as the basis to constructour network. VGGs were initially designed and trained forimage classiﬁcation task [63]. Although it has been proved thatVGGs could generalize well to remote sensing images after

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 4

Input image Density map

VGG16

Spatial-attention moduleChannel-attention module

Conv-3-512-2

Conv-3- © C onv - - - C onv - - - C onv - - - D ec onv - - - D ec onv - - - D ec onv - - - C onv - - - Front-end Mid-end Back-endInput image Density map

VGG16

Spatial-attention moduleChannel-attention module

Conv-3-512-2

Conv-3- © C onv - - - C onv - - - C onv - - - D ec onv - - - D ec onv - - - D ec onv - - - C onv - - - Front-end Mid-end Back-end

Fig. 1: Architecture of ASPD-Net. The parameters of convolutional layers in the the mid-end stages are denoted as ”Conv-(kernel size)-(numberof ﬁlters)-(dilation rate)”, while the parameters in the back-end stage are represented as ”Conv-(kernel size)-(number of ﬁlters)-(stride)”. fun-tuning [64], additional designs should be considered toaddress counting tasks, since there is a signiﬁcant gap betweencounting and classiﬁcation. The features should be enhancedto better represent crowded objects.Inspired by recent development in attention mechanism,especially [59] and [65], we incorporate attention modules toachieve the goal of capturing more contextual and high-levelsemantics. We consider both channel- and spatial-attention,which are described below: reshapereshapereshape+transpose softmax S : HW×C S : C×HW HW×HW reshapeF:C×H×W Fs : C×H×W

Convolution layer Matrix Multiplication S : C×HW C×H×W Sa C×H×W reshapereshape softmaxreshape+transpose C×C reshapeF:C×H×W Fc : C×H×W

Convolution layer Matrix Multiplication Element-wise Sum C : C×HW

C×H×W Ca C×H×W C : HW×C C : C×HW (a)(b)

Element-wise Sum reshapereshapereshape+transpose softmax S : HW×C S : C×HW HW×HW reshapeF:C×H×W Fs : C×H×W

Convolution layer Matrix Multiplication S : C×HW C×H×W Sa C×H×W reshapereshape softmaxreshape+transpose C×C reshapeF:C×H×W Fc : C×H×W

Convolution layer Matrix Multiplication Element-wise Sum C : C×HW

C×H×W Ca C×H×W C : HW×C C : C×HW (a)(b)

Element-wise Sum

Fig. 2: The detailed architecture of (a) channel-attention and (b)spatial-attention. Channel-attention : In the extremely dense scenes, tex-tures of the foregrounds are hard to distinguish from that of thebackgrounds. Channel-attention could alleviate this problem.The architecture of channel-attention is depicted in Fig. 2(a). Concretely, for a given intermediate feature map F ∈ R C × H × W , where C represents the number of the channels, H and W denote the height and width of the feature map. First ofall, one × convolution layer is executed, and then throughreshaping or transpose operations, two feature maps C and C are obtained. To generate the channel attention map, a matrixmultiplication and softmax operations are applied on C and C . In this way, we obtain a channel attention map C a witha size of C × C . Speciﬁcally, the process can be formulatedas follows: C jia = exp (cid:16) C i · C j (cid:17)(cid:80) Ci =1 exp (cid:16) C i · C j (cid:17) (1)where C jia means the inﬂuence of i th channel on the j th chan-nel. The ﬁnal weighted feature maps with channel attentionmodule with size of C × H × W is calculated as: C j ﬁnal = λ C (cid:88) i =1 (cid:0) C jia · C i (cid:1) + F j (2)where λ is a learnable parameter, which can be leaned froma × convolution operation. Spatial-attention : Observing that there have differentdensities in different locations of the feature map, we fur-ther encode long-range dependencies from spatial dimension,which is effective for the performance of the spatial locations.Similar to the operations in the channel-attention aforemen-tioned, the architecture of the spatial-attention is illustratedin Fig. 2(b). However, spatial-attention has some differencesfrom channel-attention in two folds: 1) Instead of only one × convolution layer adopted in channel-attention, it hasthree in the spatial attention operations; 2) In contrast to thesize of the channel attention map ( C a ) is C × C , the size ofspatial attention map ( S a ) is HW × HW . Speciﬁcally, S a canbe computed as follows: S jia = exp (cid:16) S i · S j (cid:17)(cid:80) HWi =1 exp (cid:16) S i · S j (cid:17) (3)where S jia denotes the inﬂuence at the position of the i thpixel on the j th pixel of the feature map. More similar ofthe positions means stronger correlations between them. Then EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 5

Conv-3-N-2 Conv-3-N-4 Conv-3-N-8 Conv-3-N-12

Fig. 3: The structure of SPM. The parameters of convolutional layersare denoted as “Conv-(kernel size)-(number of ﬁlters)-(dilation rate)”. the ﬁnal weighted feature map with spatial attention whosesize is C × H × W can be obtained as below: S j ﬁnal = µ HW (cid:88) i =1 (cid:0) S jia · S i (cid:1) + F j (4)where µ is a learnable parameter, which is leaned with thesame operations as λ in Equ. 2. B. Scale pyramid module (mid-end)

There are multiple max-pooling operations in the front-end stage, leading to dramatically decrease in the size ofthe feature map. The size of the output map is only 1/64of the input image. This will bring two drawbacks. Firstly,a density map indicates the localization of objects in theimage. Pooling operation makes the features invariance to localtranslations thus is good for classiﬁcation however is harmfulfor object localization, in here is a barrier to generating high-quality density map. Secondly, the information of small objectsis weakened as the spatial resolution of the feature mapdecreases, making the network blind to the small objects.To address these problems, inspired by [66], we introducea scale pyramid module (SPM) that is built with four paralleldilated convolution layers. The dilated convolution is convolu-tion with holes as illustrated in Fig. 3. It was ﬁrst introduced byYu et al. [60] in the segmentation task to expand the receptiveﬁelds without losing spatial resolution of feature maps. In ad-dition, there are no extra parameters or computations, makingit an excellent solution for dense prediction tasks, e.g., sceneparsing [67] and object counting [12], [66].In this work, all the four layers have the same channelshowever different dilation rates to capture different scaleinformation. We use dilation rates 2, 4, 8 and 12 as suggestedin [66]. In this way, we build a pyramid with different receptiveﬁelds that can not only keep the spatial resolution of featuremaps unchanged but also can be invariant to scale variations.The structure of the SPM is depicted in Fig. 3.

C. Deformable convolution module (back-end)

Deformable convolution [68] is an operation that adds anoffset, whose magnitude can be learnable, on each point in thereceptive ﬁeld of the feature map. After the offset, the shapeof the receptive ﬁeld matches the actual shape of the object

Input feature map Output feature map conv offset fielddeformable convolution

Fig. 4: Diagram of deformable convolution module [68]. rather than a square. The advantage of the offset is that nomatter how deformable the object is, the region of convolutionalways covers the object. The diagram and the visualizationof deformable convolution are illustrated in Fig. 4 and Fig. 5,respectively.For a normal convolution, a given sampling location p m with × convolution kernel of dilation 1, which can be repre-sented as p m ∈ M = { ( − , − , (0 , − . . . . , (1 , , (1 , } .And then the output feature map y ( p ) on the location p canbe formulated as y ( p ) = M (cid:88) m =1 w ( p m ) · x ( p + p m ) (5)where w represents weighted parameters and x means theinput feature map.In contrast with normal convolution, deformable convo-lution adds an adaptive learnable offset ∆ p m , which canbe optimized via training [68]. Therefore, the deformableconvolution of feature map y ( p ) can be represented as y ( p ) = M (cid:88) m =1 w ( p m ) · x ( p + p m + ∆ p m ) (6)Speciﬁcally, we adopt three deformable convolution layerswith the kernel size × followed by ReLU activationafter each layer. And then a × convolution is leveragedto generate the density map. The ﬁnal counting number ofobjects can be computed by summing all the pixel valuesof the density map. With the dynamic sampling scheme indeformable convolution, the orientation arbitrariness due tothe overhead perspective in the remote sensing imagery canbe well addressed. D. Ground truth generation

We generate the ground truth density maps following theprocedure of density map generation in previous works [1],[12], [66]. Assuming that there is an object instance at pixel x i ,it can be represented by a delta function δ ( x − x i ) . Therefore,for an image with N annotations, it can be represented asfollows: H ( x ) = N (cid:88) i =1 δ ( x − x i ) (7) EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 6 (a) standard convolution (b) deformable convolution

Fig. 5: Visualization of deformable sampling locations. (a) standardconvolution (b) deformable convolution.Fig. 6: Visualization of ground truth density maps via Gaussianconvolution operation.

To generate the density map F , we convolute H ( x ) with aGaussian kernel, which can be deﬁned as follows: F ( x ) = N (cid:88) i =1 δ ( x − x i ) ∗ G σ i ( x ) , (8)where σ i represents the standard deviation. Empirically, weadopt the ﬁxed kernel with σ = 15 for all the experiments.The ground truth density maps after Gaussian convolutionoperation are visualized in Fig. 6. E. Loss function

We adopt Euclidean distance as loss function to evaluatethe difference between the predicted density maps and theground truths, which is widely adopted by many other countingworks [1], [3], [5]. The L loss function is deﬁned as follows: L (Θ) = 12 N N (cid:88) i =1 (cid:13)(cid:13) F ( X i ; Θ) − F GTi (cid:13)(cid:13) (9) where N means the batch size, X i represents the input imageand Θ indicates the trainable parameters, F ( X i ; Θ) and F GTi are an estimated density map and its corresponding groundtruth, respectively.IV. R

EMOTE SENSING OBJECT COUNTING (RSOC)

DATASET

The absence of publicly available dataset especially large-scale datasets seriously limits the progress of object countingin the remote sensing ﬁeld. Nowadays, there are only a fewdatasets available for the community. These datasets are eitherin a small scale or counting can be achieved easily evenusing an off-the-shelf detector. For example, the oil treesdataset [24] has only 10 images and 1251 instances in total.The few number of instances easily leads to overﬁtting fordeep learners. In spite of 32,716 cars the COWC dataset [28]has, too much contextual information it contains, making itmore suitable for detection tasks. CARPK [27] is a newlycollected drone-based dataset, which has nearly 90,000 carswith bounding box annotations. The distributes of these objectsin this dataset are scattered, thus making it more suitable fordetection task rather than counting. More recently, a large-scale aerial image based dataset, named DLR-ACD [26], isproposed for counting task. This dataset is quite large, whichcontains 226,291 instances. However, there are only 33 imagesin it. And more importantly, it is annotated for crowd countingand thus cannot be used for geographic object counting tasks.A simple statistics information of these four datasets is givenin Table I. To facilitate counting research in remote sensingcommunity, we construct a large-scale remote sensing objectcounting dataset, termed RSOC. It consists of 3057 imagesand 286,539 instances in total. To our best knowledge, thisis the largest dataset available for the remote sensing objectcounting task. Some representative samples are presented inFig. 7.

Data collection.

Four types of objects involving buildings,small vehicles, large vehicles, and ships are included in ourRSOC dataset, these four types are among the main concernsof researches in the remote sensing community. The imagesof buildings are collected from Google Earth, while the otherthree categories are collected from the DOTA dataset [33],which is a very large dataset built for object detection inaerial images. During collection, the easy cases, i.e., imagescontaining only disperse distributed objects are removed fromthe RSOC, since we only focus on cluster instances and thecount of those disperse ones can be easily inferred from an off-the-shelf detector. As a consequence, there leaves only hardsamples such as crowded inshore ships and intensively packedvehicles in the parking lots. Finally, there are 280 images forsmall vehicles, 172 images for large vehicles, 137 images forships are included in our dataset. Incorporating 2468 imagescollected from Google Earth for buildings, the ROSC datasethas 3057 images in total. For each subset, we divide it intotraining and testing sets as illustrated in Table I. The instancedistribution of each subset is plotted in Fig. 9.

Annotation.

To reduce the workload and speed up theannotation, the buildings are annotated with center points.

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 7 S m a ll - v e h i c l e L a r g e - v e h i c l e S h i p B u il d i ng S m a ll - v e h i c l e L a r g e - v e h i c l e S h i p B u il d i ng Fig. 7: Example images from the RSOC dataset. Four widely studied and concerned objects including building, small-vehicle, large-vehicleand ship are included in this dataset.

Images from DOTA [33] are labeled with rotated quadrilateralbounding boxes enclosing the objects. The labels are denotedas { ( x i , y i ) , i = 1 , , , } , where ( x i , y i ) indicates the posi-tions of the vertices of the boxes in the image. We take thecentroid of the bounding box as the central location, whichcan be calculated as follows, ( x, y ) = (cid:32) (cid:88) i =1 x i , (cid:88) i =1 y i (cid:33) (10)The speciﬁc annotation process of our constructed datasetis depicted in Fig. 8. Statistics.

More information on this dataset is given inTable I, from which we can observe that the RSOC dataset hasdistinct features from other datasets: 1) large data capacity:as mentioned before, the RSOC dataset consists of fourcategories, 3057 images, and 286,539 instances. It is the largestcounting dataset for remote sensing image understanding up totoday; 2) large scale variation: the size of objects in the datasetranges from a dozens of pixels to thousands of pixels, makingit extremely challenging for counting; 3) diverse scenes andcategories: the dataset covers a variety of scenes includingparking lots, towns, villages, harbors and so on. Each scenecontains speciﬁc annotations such as buildings, vehicles, ships. V. E

XPERIMENTS

In this section, we benchmark the RSOC dataset by conduct-ing extensive experiments on it. In addition, the ASPD-Net isevaluated on the RSOC and compared with previous state-of-the-art counting methods to demonstrate its superiority.Beyond comparison, ablation studies are also provided toverify the effectiveness of ASPD-Net. Besides, we conductexperiments on one popular crowd counting dataset, i.e.,ShanghaiTech dataset [1], to further demonstrate the robust-ness and generalization ability of our proposed approach.

A. Evaluation metrics

Two widely used metrics, Mean Absolute Error (MAE) andRoot Mean Squared Error (RMSE), are employed to evaluatethe performance of the proposed and comparison methods.MAE measures the accuracy of the model, while RMSEmeasures the robustness. These two metrics are deﬁned asfollows:

M AE = 1 n n (cid:88) i =1 (cid:12)(cid:12)(cid:12) ˆ C i − C i (cid:12)(cid:12)(cid:12) (11) RM SE = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 (cid:12)(cid:12)(cid:12) ˆ C i − C i (cid:12)(cid:12)(cid:12) (12) EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 8

Annotater 1Annotater 2Annotater 3

Average

Gaussian filter (a)

Gaussian filter i ii i x y x y      ( , ) x y ( , ) x y ( , ) x y ( , ) x y ( , ) x y (b) Fig. 8: The annotation process for the dataset construction. (a) The annotation with center points for build-ings and (b) The annotation withorientation bounding box, then transform them into center points, the image is derived from DOTA dataset. (a) RSOC_building (b) RSOC_small vehicle (c) RSOC_large vehicle (d) RSOC_ship

Fig. 9: Distributions of number of instances per image for the RSOC dataset. where n is the number of test images, ˆ C i denotes the predictedcount for the i th image and C i indicates the ground truth count. B. Implementation details

We implement the proposed ASPD-Net in PyTorch [69],train and test it on an NVIDIA 2080Ti GPU. The ﬁrst 10convolution layers are ﬁne-tuned from VGG-16 [40], and theother layers are initialized through a Gaussian noise with 0.01 standard deviation. ASPD-Net is trained in an end-to-end manner. During training, we employ stochastic gradientdescent (SGD) to optimize the network and set the learningrate as le-5. For the building subset, we adopt a batch size of32 and set a batch size of 1 for the other three subsets. ForShanghaiTech crowd counting dataset, we inherit the trainingmanner in [12]. All training will reach convergence in 400epochs.We apply data augmentation to generate more training

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 9

TABLE I: Brief statistical information of the proposed RSOC and other four counting datasets in the remote sensing ﬁled.

Dataset Platform Images Training/test Average Resolution Annotation Format Count StatisticsTotal Min Average MaxOlive trees [24] UAV 10 – 4000 × ×

720 bounding box 89,777 1 62 188DLR-ACD [26] aerial 33 19/14 3619 × ×

512 center point 76,215 15 30.88 142Small-vehicle satellite 280 222/58 2473 × × × Channel Attention Spatial Attention Channel Attention Spatial AttentionChannel Attention Spatial Attention Channel AttentionSpatial Attention （ a ）（ b ）（ c ）（ d ）（ e ） Fig. 10: Conﬁgurations of channel- and spatial-attention modules. samples. For each image from the training set, 9 sub-imageswith 1/4 size of the original image are cropped, of which fourare adjacent non-overlapping sub-images and the other ﬁveare cropped randomly. After then, a mirror ﬂip is applied todouble them. Since the ship, large-vehicle and small-vehicleimages are with large sizes, which will lead to out-of-memoryof GPU in the training phase, therefore, we resize all the largeimages into 1024 ×

768 pixels before data augmentation.

C. Arrangement of the channel- and spatial-attention

There are several conﬁgurations when integrating channel-and spatial-attention, as shown in Fig. 10. Here we test all thecombinations on top of the MCNN [1] by embedding attentionmodules into the MCNN. We choose MCNN because it iseasy to implement and train. An example test model is shownin Fig. 11. We conduct experiments on the RSOC buildingsubset. The results are illustrated in Table II. We can ﬁndthat incorporating attention module, either spatial or channelattention, could signiﬁcantly improve the performance by alarge margin. Channel attention performs slightly better thanspatial attention. Combining two attentions together couldfurther boost the performance. And sequential assemblies aresuperior to parallel one. This is consistent with [59]. The‘channel+spatial’ conﬁguration obtains the best result, thusin the following experiments, we employ it as the attentionmodule to embed into the proposed ASPD-Net.

TABLE II: Impacts of different attention conﬁgurations.

Method RSOC buildingMAE RMSEMCNN [1] 13.65 18.56MCNN [1]+spatial 11.41 16.09MCNN [1]+channel 11.20 15.92MCNN [1]+channel & spatial in parallel 11.21 15.83MCNN [1]+spatial+channel 11.11 15.77MCNN [1]+channel+spatial Input image Densitymap

ChannelAttention SpatialAttention

Fig. 11: An example model for evaluating conﬁgurations of attentionmodules.

D. Ablation study

To better quantify the contribution of different mod-ules of our method, we conduct an ablation study on theRSOC building subset with simpliﬁed models. • Baseline : We set a baseline method similar to CSRNet [12],which is composed of 10 convolution layers carved fromVGG-16 and 6 dilated convolution layers with dilation rate2. • Baseline+Att : A sequential channel-spatial attention module(Att) is added on top of the baseline method. • Baseline+Att+SPM : The Att module and the scale pyramidmodule (SPM) are added on the top of the baseline method. • Baseline+Att+SPM+DCM : the proposed ASPD-Net.The results of the ablation experiments are tabulated inTable III. From the table, we can see that each component inour network contributes to performance improvement. Specif-ically, the naive baseline method does not perform the mostoptimal performance. Using the attention module captures theglobal and local dependencies of the feature maps. The useof scale pyramid module further improves the performanceby capturing the multi-scale information. A toy example forthe visualization of density maps on RSOC building subsetis depicted in Fig. 12. It can be intuitively observed that byincorporating attention module, scale pyramid module, anddeformable convolution module into the uniﬁed framework,

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 10

TABLE III: Ablation experiments on RSOC building dataset.Method RSOC building RSOC small-vehicle RSOC large-vehicle RSOC shipMAE RMSE MAE RMSE MAE RMSE MAE RMSEBaseline 8.00 11.78 443.72 1252.22 34.10 46.42 240.01 394.81Baseline+Att 7.92 11.67 439.51 1248.95 28.75 41.63 228.45 365.37Baseline+Att+SPM 7.85 11.58 436.84 1243.73 24.38 36.54 211.58 334.53Baseline+Att+SPM+DCM ( ours ) (a) (b) (c) (d) (e) (f)(a) (b) (c) (d) (e) (f) Fig. 12: Density maps generated from each stage of our proposed method. (a) Inputs (b)-(f) represent GT, density maps generated from themethods of Baseline, Baseline+attention, Baseline+attention+SPM and Ours, respectively. our proposed ASPDNet can obtain the superior countingperformance and accurate predicts.

E. Comparison with state-of-the-arts on RSOC dataset

We make comparison with several state-of-the-art countingmethods, which are ﬁrst raised for crowed counting, however,they can be applicable for object counting in remote sensingimages. We report the results in Table IV. It can be observedthat our approach achieves the best performance on the RSOCdataset. Speciﬁcally, compared with the second best state-of-the-art method, ASPDNet improves the performance with1.94% MAE and 7.14% RMSE on RSOC building subset,0.47% MAE and 0.77% RMSE on RSOC small vehicle sub-set, 35.40% MAE and 33.09% RMSE on RSOC large vehiclesubset and 3.86% MAE and 4.18% RMSE on RSOC shipsubset, respectively. From the evaluation metrics, we can ﬁndthat even though we have achieved the best performance onthe proposed RSOC dataset, there is still a large margin to beimproved, especially for small objects such as small-vehiclesand ships. This is consistent with other counting tasks thathigher congested the scene is, more challenging the countingtask will be. This also indicates the challenging nature of theproposed RSOC is and we hope this will encourage moreresearch efforts to put on our dataset.Figs. 13 depicts the visualization results for some sam-ple images from RSOC building, RSOC small vehicle,RSOC large vehicle and RSOC ship subsets. It can be ob-served that our proposed method obtains high-quality densitymaps with small count errors. It also indicates even under the extreme conditions of scale variation, complex clutter back-grounds, and orientation arbitrariness, our proposed approachstill performs strong robustness capacity. Meanwhile, fromthe generated density maps, we can ﬁnd that our model hasaccurate localization ability to some extent.

F. Comparison with state-of-the-arts on crowd countingdataset

To further demonstrate the effectiveness and generalizationability of our proposed method, we apply it on one crowdcounting dataset, i.e., ShanghaiTech dataset [1]. This datasetis composed of 1,198 annotated images with a total of 330,165presidents, which is split into two parts, Part A and Part B.Part A contains 482 highly congested images which are ran-domly crawled from the Internet, where 400 images serve astraining and the remaining are for testing. Part B contains 716image with relatively sparse people which are taken from thebusy streets of metropolitan areas in Shanghai, China, where400 images are for training and the rest 316 are for testing.We report the comparison results in Table V. It can be seenthat compared with the state-of-the-art methods, our proposedmethod can obtain the best results, which demonstrates greatrobustness and generalization ability. Speciﬁcally, comparedwith the baseline method, CSRNet [12], it gains 10.85% /16.35% (MAE / RMSE) on Part A, and 32.08% / 34.38%(MAE / RMSE) on Part B, respectively. Some visualizationsof density maps are depicted in Fig. 13, which shows the pro-posed method still can generate accurate predicts for diverseobjects from sparse to dense scenarios.

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 11

TABLE IV: Performance comparison on our constructed RSOC dataset. The top three performances are highlighted with bold , underlineand (cid:58)(cid:58)(cid:58)(cid:58) under (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) wave. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)

Methods Datasets Building Small vehicle Large vehicle ShipMAE RMSE MAE RMSE MAE RMSE MAE RMSEMCNN [1] 13.65 16.56 488.65 1317.44 36.56 55.55 263.91 412.30CMTL [49] 12.78 15.99 490.53 1321.11 61.02 78.25 251.17 403.07CSRNet [12] (cid:58)(cid:58)(cid:58) (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) (cid:58)(cid:58)(cid:58)(cid:58) (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) (cid:58)(cid:58)(cid:58) (cid:58)(cid:58)(cid:58) proposed ) GT Count: 105

Pred Count: 101.9

GT Count: 145GT Count: 77 Pred Count: 79.2 GT Count: 597 Pred Count: 148.3Pred Count: 551.6

GT Count: 131

Pred Count: 129.8GT Count: 819 Pred Count: 823.6GT Count: 105

Pred Count: 101.9

GT Count: 145GT Count: 77 Pred Count: 79.2 GT Count: 597 Pred Count: 148.3Pred Count: 551.6

GT Count: 131

Pred Count: 129.8GT Count: 819 Pred Count: 823.6

Fig. 13: Visualization of density map for ASPDNet on various datasets. From left to right, for each visualization, the original image, theground-truth density map, and the predicted density map, respectively. From left to right, top to down, the visualization of RSCO building,RSOC small-vehicle, RSOC large-vehicle, RSOC ship, ShanghaiTech Part A [1], and ShanghaiTech Part B [1], respectively.TABLE V: Performance comparison on ShnaghaiTech crowd count-ing dataset. The best performance is highlighted with bold . (cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88) Methods Datasets ShanghaiTech Part A ShanghaiTech Part BMAE RMSE MAE RMSEMCNN [1] 110.2 173.2 26.4 41.3CMTL [49] 101.3 152.4 20.0 31.1Switch-CNN [5] 90.4 135.0 21.6 33.4IGCNN [73] 72.5 118.2 13.6 21.1CSRNet [12] 68.2 115.0 10.6 16.0SANet [11] 67.0 104.5 8.4 13.6SFCN [70] 64.8 107.5 7.6 13.0SPN [66] 61.7 99.5 9.4 14.4SCAR [65] 66.3 114.1 9.5 15.2ADCrowdNet [74] 70.9 115.2 7.7 12.9BL [75] 62.8 101.8 7.7 12.7ASPDNet ( proposed ) G. Standard deviations experiments

To validate the stability of our proposed method, follow-ing [76], we also report the standard deviations of our methodson the constructed dataset. See Table VI for detail, note thatthe standard deviations are computed using 5 trials.VI. C

ONCLUSION AND FUTURE WORK

Counting the object instance in remote sensing images isa remarkable signiﬁcant yet scientiﬁcally challenging topic.To achieve this, we devise an ASPD-Net by incorporatingattention mechanism, scale pyramid module, and deformableconvolution module into a uniﬁed framework. Moreover, con-sidering that the development of this ﬁeld has been limitedmainly due to the scarcity of large-scale datasets with accu-rately annotated. To remedy this, we present a new large-scale

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 12

TABLE VI: Standard deviations of our methods on the four categories. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)

Methods Datasets Building Small vehicle Large vehicle ShipMAE RMSE MAE RMSE MAE RMSE MAE RMSEOurs 7.59 ± ± ± ± ± ± ± ± remote sensing object counting dataset which encompasses 4object categories with 3057 images, and 286,539 instancesin total. Extensive experimental results including quantitativeand qualitative demonstrate the effectiveness and superiority ofour proposed approach compared with the off-the-shelf state-of-the-art methods for crowd counting. In addition, to furthervalidate the effectiveness of each component and the gener-alization ability of our designed ASPD-Net, some ablationstudies on our constructed RSOC dataset and experiments onone widely used crowd counting dataset are also implemented.We expect that our contribution can bridge the gap and guidethe new developments on the object counting in remote sensingimagery.Nevertheless, there are some drawbacks such as class unbal-anced in our proposed RSOC dataset, suboptimal performanceon small object counting of the proposed method. Therefore,in the future, we plan to collect more images from variousplatforms, and devise better model to alleviate the small objectcounting problems. Meanwhile, we intend to design dedicatedclass-agnostic algorithms for remote sensing object counting.R EFERENCES[1] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowdcounting via multi-column convolutional neural network,” in

CVPR ,2016, pp. 589–597. 1, 3, 5, 6, 7, 9, 10, 11[2] D. Onoro-Rubio and R. J. L´opez-Sastre, “Towards perspective-freeobject counting with deep learning,” in

ECCV . Springer, 2016, pp.615–629. 1[3] L. Boominathan, S. S. Kruthiventi, and R. V. Babu, “Crowdnet: A deepconvolutional network for dense crowd counting,” in

ACM MM . ACM,2016, pp. 640–644. 1, 6[4] D. Kang and A. Chan, “Crowd counting by adaptively fusing predictionsfrom an image pyramid,” arXiv preprint arXiv:1805.06115 , 2018. 1[5] D. B. Sam, S. Surya, and R. V. Babu, “Switching convolutional neuralnetwork for crowd counting,” in

CVPR . IEEE, 2017, pp. 4031–4039.1, 3, 6, 11[6] V. A. Sindagi and V. M. Patel, “Generating high-quality crowd densitymaps using contextual pyramid cnns,” in

ICCV , 2017, pp. 1861–1870.1, 3[7] J. Liu, C. Gao, D. Meng, and A. G. Hauptmann, “Decidenet: Countingvarying density crowds through attention guided detection and densityestimation,” in

CVPR , 2018, pp. 5197–5206. 1, 3[8] M. Hossain, M. Hosseinzadeh, O. Chanda, and Y. Wang, “Crowdcounting using scale-aware attention networks,” in

WACV . IEEE, 2019,pp. 1280–1288. 1[9] L. Zhang, M. Shi, and Q. Chen, “Crowd counting via scale-adaptiveconvolutional neural network,” in

WACV . IEEE, 2018, pp. 1113–1121.1[10] J. Sang, W. Wu, H. Luo, H. Xiang, Q. Zhang, H. Hu, and X. Xia, “Im-proved crowd counting method based on scale-adaptive convolutionalneural network,”

IEEE Access , 2019. 1[11] X. Cao, Z. Wang, Y. Zhao, and F. Su, “Scale aggregation network foraccurate and efﬁcient crowd counting,” in

ECCV , 2018, pp. 734–750.1, 11[12] Y. Li, X. Zhang, and D. Chen, “CSRNet: Dilated convolutional neuralnetworks for understanding the highly congested scenes,” in

CVPR ,2018, pp. 1091–1100. 1, 3, 5, 8, 9, 10, 11[13] R. R. Varior, B. Shuai, J. Tighe, and D. Modolo, “Scale-aware attentionnetwork for crowd counting,”

CVPR , 2019. 1[14] Y. Wang and Y. Zou, “Fast visual object counting via example-baseddensity estimation,” in

ICIP . IEEE, 2016, pp. 3653–3657. 1 [15] E. Walach and L. Wolf, “Learning to count with cnn boosting,” in

ECCV .Springer, 2016, pp. 660–676. 1[16] V. Lempitsky and A. Zisserman, “Learning to count objects in images,”in

NIPS , 2010, pp. 1324–1332. 1, 2, 3[17] C. Arteta, V. Lempitsky, and A. Zisserman, “Counting in the wild,” in

ECCV . Springer, 2016, pp. 483–498. 1[18] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual translationembedding network for visual relation detection,” in

CVPR , 2017, pp.5532–5540. 1[19] S. Zhang, G. Wu, J. P. Costeira, and J. M. Moura, “Fcn-rlstm: Deepspatio-temporal neural networks for vehicle counting in city cameras,”in

ICCV , 2017, pp. 3667–3676. 1[20] R. Guerrero-G´omez-Olmedo, B. Torre-Jim´enez, R. L´opez-Sastre,S. Maldonado-Basc´on, and D. Onoro-Rubio, “Extremely overlappingvehicle counting,” in

PRIA . Springer, 2015, pp. 423–431. 1[21] G. French, M. Fisher, M. Mackiewicz, and C. Needle, “Convolutionalneural networks for counting ﬁsh in ﬁsheries surveillance video,”

MVAB ,pp. 1–7, 2015. 1[22] B. Zhan, D. N. Monekosso, P. Remagnino, S. A. Velastin, and L.-Q.Xu, “Crowd analysis: a survey,”

MVA , vol. 19, no. 5-6, pp. 345–357,2008. 1[23] Y. Bazi, H. Al-Sharari, and F. Melgani, “An automatic method forcounting olive trees in very high spatial remote sensing images,” in

IGARSS , vol. 2. IEEE, 2009, pp. 125–128. 1[24] E. Salam´ı, A. Gallardo, G. Skorobogatov, and C. Barrado, “On-the-ﬂy olive trees counting using a uas and cloud services,”

Remote Sens. ,vol. 11, no. 3, p. 316, 2019. 1, 6, 9[25] N. A. Mubin, E. Nadarajoo, H. Z. M. Shafri, and A. Hamedianfar,“Young and mature oil palm tree detection and counting using con-volutional neural network deep learning method,”

IJRS , vol. 40, no. 19,pp. 7500–7515, 2019. 1[26] R. Bahmanyar, E. Vig, and P. Reinartz, “Mrcnet: Crowd counting anddensity map estimation in aerial and ground imagery,” in

BMVC-ODRSS ,2019, pp. 1–12. 1, 2, 6, 9[27] M.-R. Hsieh, Y.-L. Lin, and W. H. Hsu, “Drone-based object countingby spatially regularized regional proposal network,” in

ICCV , 2017, pp.4145–4153. 1, 2, 6, 9[28] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye, “A largecontextual dataset for classiﬁcation, detection and counting of cars withdeep learning,” in

ECCV . Springer, 2016, pp. 785–800. 1, 6, 9[29] M. M. Rathore, A. Ahmad, A. Paul, and S. Rho, “Urban planning andbuilding smart cities based on the internet of things using big dataanalytics,”

Computer Networks , vol. 101, pp. 63–80, 2016. 1[30] J.-F. Pekel, A. Cottam, N. Gorelick, and A. S. Belward, “High-resolutionmapping of global surface water and its long-term changes,”

Nature , vol.540, no. 7633, p. 418, 2016. 1[31] L. Guan, Y. Ding, X. Feng, and H. Zhang, “Digital beijing constructionand application based on the urban three-dimensional modelling andremote sensing monitoring technology,” in

IGARSS . IEEE, 2016, pp.7299–7302. 1[32] Y. Fan, Q. Wen, W. Wang, P. Wang, L. Li, and P. Zhang, “Quantifyingdisaster physical damage using remote sensing data—a technical workﬂow and case study of the 2014 ludian earthquake in china,”

Interna-tional Journal of Disaster Risk Science , vol. 8, no. 4, pp. 471–488, 2017.1[33] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu,M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for objectdetection in aerial images,” in

CVPR , 2018, pp. 3974–3983. 1, 6, 7[34] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection inoptical remote sensing images: A survey and a new benchmark,”

ISPRSP&RS , vol. 159, pp. 296–307, 2020. 1[35] G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu,“Aid: A benchmark data set for performance evaluation of aerial sceneclassiﬁcation,”

TGRS , vol. 55, no. 7, pp. 3965–3981, 2017. 1[36] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classiﬁ-cation: Benchmark and state of the art,”

Proc. of the IEEE , vol. 105,no. 10, pp. 1865–1883, 2017. 1

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 13 [37] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in

NIPS , 2015,pp. 91–99. 2[38] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Uniﬁed, real-time object detection,” in

CVPR , 2016, pp. 779–788.2[39] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, andA. C. Berg, “Ssd: Single shot multibox detector,” in

ECCV . Springer,2016, pp. 21–37. 2[40] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.2, 3, 8[41] G. Gao, Q. Liu, and Y. Wang, “Counting dense objects in remote sensingimages,” in

ICASSP , 2020. 2[42] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A face detectionbenchmark,” in

CVPR , 2016, pp. 5525–5533. 2[43] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: Anevaluation of the state of the art,”

TPAMI , vol. 34, no. 4, pp. 743–761,2012. 2[44] N. Paragios and V. Ramesh, “A mrf-based approach for real-time subwaymonitoring,” in

CVPR , vol. 1. IEEE, 2001, pp. I–I. 3[45] Y. Tian, L. Sigal, H. Badino, F. De la Torre, and Y. Liu, “Latent gaussianmixture regression for human pose estimation,” in

ACCV . Springer,2010, pp. 679–690. 3[46] D. G. Lowe et al. , “Object recognition from local scale-invariantfeatures.” in

ICCV , vol. 99, no. 2, 1999, pp. 1150–1157. 3[47] R. M. Haralick, K. Shanmugam et al. , “Textural features for imageclassiﬁcation,”

IEEE Trans. Syst., Man, Cybern. , no. 6, pp. 610–621,1973. 3[48] V.-Q. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada, “Count forest:Co-voting uncertain number of targets using random forest for crowddensity estimation,” in

ICCV , 2015, pp. 3253–3261. 3[49] V. A. Sindagi and V. M. Patel, “Cnn-based cascaded multi-task learningof high-level prior and density estimation for crowd counting,” in

AVSS .IEEE, 2017, pp. 1–6. 3, 11[50] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,and Y. Bengio, “Show, attend and tell: Neural image caption generationwith visual attention,” in

ICML , 2015, pp. 2048–2057. 3[51] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning withsemantic attention,” in

CVPR , 2016, pp. 4651–4659. 3[52] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attentionnetworks for image question answering,” in

CVPR , 2016, pp. 21–29. 3[53] Y. Huang, X. Cao, X. Zhen, and J. Han, “Attentive temporal pyramidnetwork for dynamic scene classiﬁcation,” in

AAAI , vol. 33, 2019, pp.8497–8504. 3[54] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, andX. Tang, “Residual attention network for image classiﬁcation,” in

CVPR ,2017, pp. 3156–3164. 3[55] A. Zhang, L. Yue, J. Shen, F. Zhu, X. Zhen, X. Cao, and L. Shao,“Attentional neural ﬁelds for crowd counting,” in

ICCV , 2019, pp. 5714–5723. 3[56] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,” arXiv preprint arXiv:1409.0473 ,2014. 3[57] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua,“Sca-cnn: Spatial and channel-wise attention in convolutional networksfor image captioning,” in

CVPR , 2017, pp. 5659–5667. 3[58] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-works,” in

CVPR , 2018, pp. 7794–7803. 3[59] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “CBAM: Convolutionalblock attention module,” in

ECCV , 2018, pp. 3–19. 3, 4, 9[60] F. Yu and V. Koltun, “Multi-scale context aggregation by dilatedconvolutions,”

ICLR , 2015. 3, 5[61] H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam, “Pyramid dilateddeeper convlstm for video salient object detection,” in

ECCV , 2018, pp.715–731. 3[62] X. Li, J. Wu, Z. Lin, H. Liu, and H. Zha, “Recurrent squeeze-and-excitation context aggregation net for single image deraining,” in

ECCV ,2018, pp. 254–269. 3[63] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al. , “Imagenet largescale visual recognition challenge,”

IJCV , vol. 115, no. 3, pp. 211–252,2015. 3[64] X. Liu, M. Chi, Y. Zhang, and Y. Qin, “Classifying high resolutionremote sensing images by ﬁne-tuned vgg deep networks,” in

IGARSS .IEEE, 2018, pp. 7137–7140. 4 [65] J. Gao, Q. Wang, and Y. Yuan, “Scar: Spatial-/channel-wise attentionregression networks for crowd counting,”

Neurocomputing , vol. 363, pp.1–8, 2019. 4, 11[66] X. Chen, Y. Bin, N. Sang, and C. Gao, “Scale pyramid network forcrowd counting,” in

WACV . IEEE, 2019, pp. 1941–1950. 5, 11[67] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsingnetwork,” in

CVPR , 2017, pp. 2881–2890. 5[68] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformableconvolutional networks,” in

ICCV , 2017, pp. 764–773. 5[69] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al. , “Pytorch: Animperative style, high-performance deep learning library,” in

NeurIPS ,2019, pp. 8024–8035. 8[70] Q. Wang, J. Gao, W. Lin, and Y. Yuan, “Learning from synthetic datafor crowd counting in the wild,” in

CVPR , 2019, pp. 8198–8207. 11[71] W. Liu, M. Salzmann, and P. Fua, “Context-aware crowd counting,”

CVPR , 2019. 11[72] L. Zhu, Z. Zhao, C. Lu, Y. Lin, Y. Peng, and T. Yao, “Dual path multi-scale fusion networks with attention for crowd counting,” arXiv preprintarXiv:1902.01115 , 2019. 11[73] D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, and M. Srinivasan,“Divide and grow: capturing huge diversity in crowd images withincrementally growing cnn,” in

CVPR , 2018, pp. 3618–3626. 11[74] N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu, “Adcrowdnet:An attention-injective deformable convolutional network for crowdunderstanding,”

CVPR , 2019. 11[75] Z. Ma, X. Wei, X. Hong, and Y. Gong, “Bayesian loss for crowd countestimation with point supervision,” in

ICCV , 2019, pp. 6142–6151. 11[76] V. A. Sindagi, R. Yasarla, D. S. Babu, R. V. Babu, and V. M. Patel,“Learning to count in the crowd from limited labeled data,” in

ECCV ,2020. 11

Guangshuai Gao received the B.Sc. degree in ap-plied physics from college of science and the M.Sc.degree in signal and information processing fromthe School of Electronic and Information Engineer-ing, from the Zhongyuan University of Technology,Zhengzhou, China, in 2014 and 2017, respectively.He is currently pursuing the Ph.D. degree withthe Laboratory of Intelligent Recognition and ImageProcessing, Beijing Key Laboratory of Digital Me-dia, School of Computer Science and Engineering,Beihang University. His research interests includeimage processing, pattern recognition, remote sensing image analysis anddigital machine learning.

Qingjie Liu received the B.S. degree in computerscience from Hunan University, Changsha, Chinaand the Ph.D. degree in computer science fromBeihang University, Beijing, China.He is currently an Assistant Professor with theSchool of Computer Science and Engineering, Bei-hang University. He is also a Distinguished ResearchFellow with the Hangzhou Institute of Innovation,Beihang University, Hangzhou. His current researchinterests include remote sensing image analysis,pattern recognition, and computer vision. He is amember of the IEEE.

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 14