A Histogram Thresholding Improvement to Mask R-CNN for Scalable Segmentation of New and Old Rural Buildings
AA Histogram Thresholding Improvement to Mask R-CNNfor Scalable Segmentation of New and Old Rural Buildings
Ying Li , Weipan Xu , Haohui “Caron” Chen , Junhao Jiang , Xun Li Affiliations Department of Urban and Regional Planning, School of Geography and Planning, Sun Yat-senUniversity, Guangzhou, China Data61, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia*Correspondence to: Xun Li ([email protected])
Abstract
Mapping new and old buildings are of great significance for understandingsocio-economic development in rural areas. In recent years, deep neural networks haveachieved remarkable building segmentation results in high-resolution remote sensing images.However, the scarce training data and the varying geographical environments have posedchallenges for scalable building segmentation. This study proposes a novel framework basedon Mask R-CNN, named HTMask R-CNN, to extract new and old rural buildings even whenthe label is scarce. The framework adopts the result of single-object instance segmentationfrom the orthodox Mask R-CNN. Further, it classifies the rural buildings into new and oldones based on a dynamic grayscale threshold inferred from the result of a two-object instancesegmentation task where training data is scarce. We found that the framework can extractmore buildings and achieve a much higher mean Average Precision (mAP) than the orthodoxMask R-CNN model. We tested the novel framework's performance with increasing trainingdata and found that it converged even when the training samples were limited. Thisframework's main contribution is to allow scalable segmentation by using significantly fewertraining samples than traditional machine learning practices. That makes mapping China'snew and old rural buildings viable.
Introduction
Monitoring the composition of new and old buildings in the rural area is of great significanceto rural development [1] . In particular, China’s recent rapid urbanization has tremendouslytransformed its rural settlements over the last decades [2] . However, unplanned andpoorly-documented dwellings have posed significant challenges for understanding the ruralsettlements [3,4] . Traditionally, field surveys had been the major solutions, but they requireintensive labour inputs and could be time-consuming, especially in remote areas. The recentbreakthroughs of the remote sensing technologies provide growing availability of high-resolution emote sensing images such as low-altitude aerial photos and Unmanned Aerial Vehicle (UAV)images. That allows manual mappings of the rural settlements at a lower cost and with broadercoverage, but they are still time-consuming. Therefore, to map the settlements for nearly 564million rural population in China [5] , a scalable, intelligent and image-based solution is urgentlyneeded.Remote sensing-based mapping of buildings has been a popular research topic for long [6,7,8,9,10] . The pixel-based methods, grouping pixels of similar spectral properties into a particularclass, had been widely used in the eras when the remote sensors could only generate images inwhich pixels are bigger than ground features [11] . Since the launch of IKONO, QuickBird,WorldView and most recently UAVs, the remote sensing images' spatial resolution growssignificantly. The pixel-based methods, which do not consider neighboring pixels that are part ofthe same land cover, failed to utilize different land covers' spatial variation in thosehigh-resolution images. Consequently, the Object-based image analysis (OBIA) has emerged as aneffective alternative solution. OBIA use image objects as the basic analysis unit instead ofindividual pixels [12] . It groups neighboring pixels into shapes with a meaningful representation ofthe target objects, considering the spatial and hierarchical relationships during the classificationprocess [13] . OBIA involves two main phases: image segmentation and feature classification. Thesegmentation process divides an image into homogeneous regions, e.g., buildings, water bodiesand grasslands, so its quality is critical not just determining the downstream feature classificationprocess but also for the overall performance of the OBIA [14,15] . The recent breakthroughs ofmachine learning (ML) technologies in computer vision have pushed image segmentationforefronts. ML-based methods, such as Markov Random Fields [16] , Bayesian Network [17] , NeuralNetwork [18] , SVM [19] and Deep Convolution neural network (DCNN) [20] achieved impressiveresults. These supervised methods learn the spatial and hierarchical relationships between pixelsand objects, utilizing remote sensing technologies' resolution gains in recent years. There are twokinds of image segmentation. The semantics segmentation treats multiple objects of the same classas a single entity, while instance segmentation treats multiple objects of the same class as distinctindividual entities (or instances) [21,22] . Amongst, Mask R-CNN has been one of the most popularimage segmentation methods. It has been used for vehicle-damage-detection , ships labeling,buildings extraction [23,24,25] . The understanding of rural settlements would involve instancesegmentation instead of semantic segmentation, as settlement's numbers and areas are needed.Compared to the urban buildings, rural ones attracted much less attention from the remotesensing community [26,27] . Only a few rural datasets were open to the public. The Wuhan dataset(WHU) [28] extracts 220,000 independent buildings from high-resolution remote sensing imagescovering 450 km in Christchurch, New Zealand. The Massachusetts dataset consists of 151 aerialimages of the Boston area, covering roughly 340 km . Training on these datasets, the ML-basedmodels achieved impressive segmentation results. However, these data only cover a relativelysmall area. As a result, the models might not generalize well in other regions where thegeographical environments differ significantly. Therefore, the lack of training data specifically forrural environments and the varying geographical environments across regions have posedchallenges for building a robust and genialized algorithm. To achieve human-like segmentation forChina's vast and varying geography, we might need to build models dedicated to different regions.In this regard, the bottleneck is the manual effort for annotating a large amount of training data. Inthis study, we proposed a novel framework that could significantly reduce data annotation effortshile retaining the classification capability.Humans annotate the new and old buildings in high-resolution remote sensing images by thedifference of pixel color. This principle can also be applied in the ML algorithm, as the histogramof grayscales would vary significantly across new and old buildings. When new and old buildings’grayscale histogram exhibits bimodal distributions, the valley point can be used as the thresholdfor discriminating them. This methodology is called histogram thresholding, widely used beforethe ML prevails [30,31] . If all building footprints are given, a few predicted labels of new and oldbuildings in the same remote sensing image could to validate if such a bimodal distribution existsand consequently find the valley point. In this regard, we can reduce the number of trainingsamples while retaining the algorithm's capability. It should be noticed that the segmentation ofbuildings is more manageable than the segmentation of new and old ones in machine learningpractices. That is partially due to the reduced efforts for labeling one class instead of two. Anotherreason is that binary classifiers are more accurate than multi-class classifiers in general. Forexample, classifying dogs is easier than classifying dog breeds. Therefore, the proposedframework uses histogram thresholding as an add-on to the state-of-the-art deep learningalgorithm to achieve impressive segmentation results. In the method section, we will address theproposed framework in detail. This study uses rural areas in Xinxing County, Guangdong Province,as the study case to test the performance of our proposed framework.
2. Study Area and Data2.1
Study Area
To test the proposed framework's performance, we collected data samples fromhigh-resolution satellite images covering rural Xinxing County, Guangdong province, China (seeFigure 1). Xinxing is a traditional mountainous agricultural county, with a large agriculturalpopulation and a relatively complete landscape, forest, and land city. Moreover, Xinxing is a ruralrevitalization pilot area and has made much rural development and governance achievements. Theextraction of new and old buildings is of great significance for understanding rural development inXinxing.Figure 1 The study area. (a) The location of Xingxing in Guangdong province, China, (b) a villageof Xinxing (c) building labels of the same region in panel c (new and old buildings are masked inlue and red respectively).
Data collection and annotation
Table 1 shows the new and old buildings in high-resolution satellite images. Most of the newbuildings are brick-concrete structures, with roofs made of cement or colored tiles. Old houses aremainly Cantonese-style courtyards in Xinxing, where the roof materials are dark tiles. Moreover,the outline of their footprints is less clear than new houses. New buildings are mostly distributedalong the streets, while the old buildings still retain a compact comb pattern.
Table 1. the image characteristics of new and old buildings in rural areas
Type image characteristicsold buildingsnew buildings
We collected 60 images with a resolution of 0.26m. Each image has a size ranging from 900× 900 to 1024 × 1024 pixels. We use the open-source image annotation tool VIA [32] to delineatethe building footprints (see Figure 2). All building samples from those 60 images were compiledas dataset called one-class samples. And We annotated only 26 out of 60 images with new and oldlabels (called two-class samples hereafter) (see Table 2). Both datasets were spitted into trainingand validation sets respectively.Figure 2. The construction of image dataset. a) a village image; b) the building labels; c) thelabel json file; d) the mask with building categories.Table 2. The train and val dataset one-class samples two-class samples buildings new buildings old buildings totalrain 54 pic/5359 poly 1340 1081 20 pic/2421 polyval 6 pic/892 poly 498 394 6 pic/892 polytotal 60 pic/6251 poly 1838 1475 26 pic/3313 poly
3. Methods3.1 HTMask R-CNN
Mask R-CNN [33] has been proven to be a powerful and adaptable model in many differentdomains [34,35] . It operates in two phases, generation of region proposals and classification of eachgenerated proposal. In this study, we use Mask R-CNN as our baseline model for benchmarking.As discussed before, we propose a novel segmentation framework that can utilize the histogramthresholding and deep learning’s image segmentation capability to extract the new and old ruralbuildings. We call the proposed framework HTMask R-CNN, abbreviating HistogramThresholding Mask R-CNN. The workflow of the framework is addressed as follows (see Figure 3for illustration):a. We built two segmentation models (one-class and two-class models) based on theone-class and two-class samples' training sets. The one-class model can extract ruralbuildings, while the two-class model can classify new and old rural buildings.b. An satellite image (Figure 3a) is classified by the one-class and two-class modelseparately, leading to a map of building footprints (R1 in Figure 3d), and a map ofnew and old buildings (R2 in Figure 3b) .c. Grayscale histograms were built using the pixels from the new and old buildingfootprints(R2). The average grayscale levels for new and old buildings werecomputed as N and O respectively. A valley point is determined by 홸N+O)/2 .d. The valley point is used as the threshold to determine the type of building in R1.Finally, we get a map of the old and new buildings in R3(Figure 3e).The hypothesis is that R3 performs better than R2. Specifically, R3 can take advantage of thecapability of R1, while utilizing the grayscale difference of the new and old buildings in R2. Thetwo-class model’s performance depends on the numbers of training samples. Assumably, t he morethe training data are added, the more robust the network training, the better the segmentationresults. This study also tests how the numbers of training samples could affect R2 and R3'sperformance to evaluate how HTMask R-CNN can save the annotation efforts while retaining thesegmentation capability.igure 3. The workflow of HTMask R-CNN. a) a village image; b) R2: the result of two-classmodel; c) the calculation of exclusive threshold from gray distribution; d) R1:the result ofone-class model; e)R3: the result of HTMask R-CNN model. We use R2, the prediction results of the two-class model as the benchmarking. R3 is theresult of the proposed framework. We compare R2 and R3 to test how much accuracyimprovements in the proposed framework.We performed data augmentations, including rotating, mirroring, brightness enhancement andadding noise points to the images in both training sets of the one-class and two-class samples. Inthe training stage for the one-class and two-class models, 50 epochs with two batches per epochwere applied, and the learning rate was set at 0.0001. The Stochastic Gradient Descent (SGD)optimization algorithm was adopted as the optimizer [36] . We set the weight decay to 0.0001, as apenalty added to the loss function to prevent the network model’s over-fitting. We set themomentum up to 0.9, which is used to control to what extent the model remains in the originalupdating direction. We use cross-entropy as a loss function to evaluate the training performance.To test how HTMask R-CNN can achieve a converged performance with a limited amount oftraining data, the training process has involved an incremental number of samples (from 5 to 20satellite images). Afterward, we compare the baseline Mask R-CNN and the HTMask R-CNN bycomparing R2 and R3.
We use the average precision (AP) to quantitatively evaluate our framework on the validationdataset. The AP is equal to taking the area under the precision-recall (PR) curve, equation (1). ThemAP represents the AP value when the threshold of intersection over union (IoU) is 0.5. / ( )/ ( )( ) Precision TP TP FPRecall TP TP FNAP P R dR (1)IoU means the ratio of intersection and union of the prediction and the reference. When asegmentation image is obtained, the value of IoU is calculated according to equation (2). / ( ) IoU TP TP FP FN (2)
4. Results
Figure 4 shows the result of an image. In terms of the building footprints mapping, theone-class model has identified most of the buildings. More importantly, it can accurately outlineindividual buildings and the boundaries of adjacent buildings being correctly separated, whichallows the texture of the building to be captured (Figure 4a).The baseline model (two-class model) performed better and better with the growing numbersof training samples (Figure 4b). However, the numbers of buildings in R2 are still significantlyfewer than R1, which aligns with our assumption. In R3, the proposed framework uses R1 as thebase map, so the numbers of buildings are equal between R1 and R3. That means the proposedframework outperforms the baseline model.In terms of the new and old building segmentation, R3 is significantly better than R2 at alllevels of training samples. Even when the number of training samples is very limited , e.g., five,the baseline model nearly misidentify between the new and old buildings, while the proposedframework still produce a reasonable result.Figure 4. An example shows the result comparison between the baseline Mask R-CNN model andhe HTMask R-CNN framework. a) the satellite image from the test set, the second column showsthe annotations, and the third column shows the result of one-class model R1. b) the result of thetwo-class model, R2, with incremental training samples. c) the result of the HTMask R-CNNframework, R3, with incremental training samples.Figure 5 shows the training process of the one-class model. We noticed that the performanceof the one-class model converges at the 25th epoch. Regardless of classification, it identifies mostof the buildings, and then the mAP50 could reach 0.70, respectively.Figure 5 the training process of the one-class modelThe baseline two-class model and the HTMask R-CNN also converge at the 25th epoch ( Table3, Figure 6). When the training size is small, the mAP50 of the baseline two-class model isvery low, while the HTMask R-CNN can significantly improve the recognition. With theincreasing training size, the baseline two-class model and HTMask R-CNN's performance gapbecome narrower(see Figure 5d). Finally, the mAP of the baseline two-class reached 0.35 whenthe training size is 20, but still lower than HTMask R-CNN. More importantly, HTMask R-CNNperforms consistently (mAP_50 ≈ between R2 and R3 at all levels of training models\epoch 1 5 10 15 20 25 30 35 40 45 505_2class 0.00 0.00 0.06 0.06 0.05 0.06 0.04 0.07 0.06 0.09 0.075_HTM 0.03 0.35 0.32 0.38 0.41 0.42 0.43 0.44 0.41 0.45 0.4210_2class 0.00 0.01 0.09 0.12 0.15 0.17 0.16 0.17 0.20 0.22 0.2210pic_ HTM 0.04 0.34 0.31 0.40 0.41 0.44 0.45 0.45 0.42 0.45 0.4415_2class 0.00 0.06 0.10 0.15 0.19 0.21 0.23 0.26 0.28 0.29 0.2915pic_HTM 0.04 0.34 0.33 0.39 0.42 0.44 0.45 0.48 0.43 0.45 0.4520_2class 0.00 0.11 0.19 0.19 0.31 0.27 0.27 0.34 0.31 0.31 0.3520_HTM 0.05 0.34 0.32 0.39 0.41 0.44 0.44 0.47 0.43 0.45 0.45 igure.6 the AP for each models. a). sample size=5; b) sample size=10; c) sample size=15; d)sample size=20.
6. Discussions
With the advance of deep learning, the extraction of building footprint from satellite imageryhas made notable progress, contributing significantly to settlements' digital records. However, thescarcity of training data has always been the main challenge for scaling building segmentation.Therefore, this study proposes a novel framework based on the Mask R-CNN model andhistogram thresholding method to extract old and new rural buildings even when the label isscarce. We tested the framework in Xinxing County, Guangdong province, and achievedpromising results. This framework provides a viable solution for mapping China's rural buildingsat a significantly reduced cost.Mask R-CNN models have been proven useful in many applications. However, this studyfound that the orthodox Mask R-CNN model performed poorly in extracting old-newtwo-category buildings. When the training samples are limited, the mAP is only 0.35, respectively.We believe the varying geographical environments lead to the poor generalization of thesegmentation model when the training samples cannot cover the most distinctive spatial andspectrum features. For instance, the model might not classify a building with an open patio aseither a new or old building if none of the training samples contains this unique shape. Meanwhile,the single-category classification task using Mask R-CNN could reach mAP at 0.70, respectively.That means utilizing Mask R-CNN's capability in mapping building footprints could improve therecall rate for the old-new two-category classification task. Hence, we propose such a novelframework.While tested the framework with increasing training samples, we found that it converges at avery early stage when the numbers of training images are only five. That means the frameworkcan be applied on a large scale, to map all rural buildings in China. Before then, more carefulstudies should be undertaken to understand the limitations of the framework. For example, moreraining samples might be needed for an accurate model in other areas.
6. Conclusions
Nearly half of the Chinese population live in the rural areas of China. The lack of a digitalrecord of the new and old buildings has posed challenges for the governments to realize thesocio-economic state. Under China's central government's current rural revitalization policy, manymigrant workers will return to the villages. Therefore, a scalable, intelligent, and accurate buildingmapping solution is urgently needed. The proposed framework in this study achieved a promisingresult even when the training samples are scarce. As a result, we can scale the mapping process ata significantly reduced cost. Therefore, we believe this framework could map every settlement inthe rural areas, help policymakers establish a longitudinal digital building record, and monitorsocio-economics across all rural regions.