Regularization of Building Boundaries in Satellite Images using Adversarial and Regularized Losses
RREGULARIZATION OF BUILDING BOUNDARIES IN SATELLITE IMAGES USINGADVERSARIAL AND REGULARIZED LOSSES
Stefano Zorzi and Friedrich Fraundorfer
Institute of Computer Graphics and Vision, Graz University of Technology
ABSTRACT
In this paper we present a method for building boundary re-finement and regularization in satellite images using a fullyconvolutional neural network trained with a combination ofadversarial and regularized losses. Compared to a pure MaskR-CNN model, the overall algorithm can achieve equivalentperformance in terms of accuracy and completeness. How-ever, unlike Mask R-CNN that produces irregular footprints,our framework generates regularized and visually pleasingbuilding boundaries which are beneficial in many applica-tions.
Index Terms — Generative adversarial networks, build-ing segmentation, boundary refinement, satellite images.
1. INTRODUCTION
Building detection and segmentation from satellite imagesis still a challenging problem. Automatically detecting con-structions and extracting precisely their footprints is in theinterest of many engineering and cartographic applications.In recent years, multiple machine learning challenges havebeen proposed to encourage people to present new buildingextraction methods (e.g. Deep Globe Challenge , SpaceNetChallenge , CrowdAI Mapping Challenge ).The most common and effective way to deal with thisproblem is the use of powerful semantic segmentation orinstance segmentation networks. However, in most cases,predicted building footprints have irregular shapes which arevery different from the ones used in cartographic applications.This problem has been recently dealt with Kang et al. [1]where they proposed a building segmentation and refinementpipeline as a solution for the DeepGlobeChallenge 2018.Their framework is composed of a Mask R-CNN [2] modelfor instance segmentation followed by a boundary refinementalgorithm that exploits polygon simplification methods. Theoverall algorithm produces more realistic building footprints,but it does not consider the intensity image for the regulariza-tion to further improve the results. http://deepglobe.org/challenge.html https://spacenetchallenge.github.io/ In this paper we present a new building segmentation andregularization framework completely based on Deep Learn-ing techniques. The pipeline itself is the same as Kang’s, sowe still perform the building segmentation as a first step andthen we apply the building regularization as a second step.The difference is in the use of a fully convolutional neuralnetwork as a regularization method, instead of using polygonsimplification algorithms.Inspired by deep style transfer techniques like pix2pix [3]and cyclegan [4], we train our regularization network usingadversarial losses to produce more realistic footprints. In par-ticular, we use OpenStreetMap building footprints as the tar-get footprint domain to train a GAN [5] architecture. We alsoexploit regularized losses [6, 7] to make the network aware ofthe real building boundaries in the intensity image and, con-sequently, to further refine the result. Finally, a reconstructionloss ensures to obtain regularized footprints that look similarin size, pose and shape to the original Mask R-CNN predictedfootprints.The combination of these three types of loss functionsenables us to learn a regularization network that not only pro-duces better looking and more realistic building footprints,but is also capable of achieving better scores on the testdataset compared to the pure Mask R-CNN solution.
2. METHOD
Our aim is to learn a mapping function between the domain X (Mask R-CNN footprints) and the domain Y (ideal foot-prints) given the training samples { x i } Ni =1 where x i ∈ X and { y i } Mi =1 where y i ∈ Y . We also exploit RGB images, { z i } Ni =1 where z i ∈ Z , to further improve the results trainingthe model with an additional regularized loss.The model performs the regularization G : { X, Z } → Y exploiting an encoder-decoder network, as shown in Figure 1.The generation of the regularized footprints is performed bythe encoder E G and the decoder F , so G can be seen as thecombination of the two: G ( x, z ) = F ( E G ( x, z )) . A discrim-inator D is introduced in order to distinguish between regu-larized footprints G ( x, z ) and ideal ones. It is worth notingthat the ideal building footprints are not directly evaluated bythe discriminator model, but the ideal mask is encoded by E R and decoded back by the common network F . The aim of this a r X i v : . [ ee ss . I V ] J u l onv 1 ×
1, sigmoidmax pool 2 × × × yz x E G E R F D truefalse
Fig. 1 . Workflow of the proposed regularization framework. It is composed of two paths: the generator path ( E G → F )produces the regularized building footprint mask; the reconstruction path ( E R → F ) encodes and decodes the ideal input maskensuring to have the same real valued masks as input to the discriminator.path is to obtain a reconstructed version of y . One concern forthis design choice is that the adversarial network can poten-tially trivially distinguish the two distributions by detecting ifthe mask consists of zeros and ones (one-hot encoding of theideal mask), or of real values between zero and one (output ofthe autoencoder). This problem is solved by generating bothreconstructed and regularized samples with the same network F . Also, this architecture ensures stability during training andavoids a winning discriminator situation since the two autoen-coders are connected (with the common decoder) and trainedtogether.The encoders and the decoder are learned exploitingthree types of loss functions: adversarial loss , reconstructionlosses and regularized loss . We use adversarial losses [5] to learn the mapping functionbetween the domain X and Y .The objective function used to learn the discriminator D is expressed as: L D ( G, R, D ) = E y [(1 − D ( R ( y ))) ]+ E x,z [ D ( G ( x, z )) ] (1)where the path R ( y ) = F ( E R ( y )) encodes and recon-structs the ideal mask and the path G ( x, z ) = F ( E G ( x, z )) generates building footprints that look similar to ideal foot-prints in domain Y . The aim of D is to distinguish betweenregularized footprints and reconstructed footprints. Note thatwe used the least-squared loss in equation 1 because it ensuresbetter stability during training and generates higher quality re-sults [4]. For the mapping path G the loss function is expressed as: L GAN ( G, D ) = E x,z [(1 − D ( G ( x, z )) ] (2)This path is trained to fool the discriminator D , in fact, theadversarial loss encourages G to produce footprints similar tothe samples on the Y domain. In order to force the network to generate building footprintssimilar to the input masks, we simply use the binary cross en-tropy loss both on the generator path G and on the reconstruc-tion path R . The loss is computed between x and G ( x ) andbetween y and R ( y ) to produce regularization masks close tothe Mask R-CNN predictions and to the ideal masks, respec-tively. The two losses can be expressed as: L BCE G ( G ) = − N (cid:88) i x i · log G ( x, z ) i L BCE R ( R ) = − N (cid:88) i y i · log R ( y ) i (3) Without regularized losses our model would not be able to ex-ploit image information to further improve the building regu-larization.Alongside the adversarial loss and the reconstruction loss,the
Potts loss [6] and the normalized cut loss [6, 7] are usedto learn our model. These two loss functions force the gener-ator G to produce building footprints aligned to the buildingoundaries observed in the intensity image. Also, trained withthese losses, the generator is capable of solving some artifactsproduced by Mask R-CNN (Figure 2).Potts and normalized cut loss functions can be expressedas: L potts ( G ) = (cid:88) k S k (cid:62) W (1 − S k ) (4) L ncut ( G ) = (cid:88) k S k (cid:62) ˆ W (1 − S k )1 (cid:62) ˆ W S k (5)where W and ˆ W are a matrices of pairwise discontinu-ity costs or affinity matrices , while S = G ( x, z ) is the k-waysoftmax segmentation mask generated by the network. S k describes the vectorization of the k -th channel in the segmen-tation image. In our case k = 2 since we have two classes. The full objective used to train the generator G and the recon-struction R model is a linear combination between the adver-sarial loss , the reconstruction loss and the regularized loss . L ( G, R, D ) = α L GAN ( G, R, D )+ β L BCE G ( G ) + γ L BCE R ( R )+ δ L P otts ( G ) + (cid:15) L ncut ( G ) (6)Note that the losses through the paths G and R are ob-tained switching the encoders E G and E R . Once the total losshas been computed, the backpropagation step is performedand the weights of E G , E R and F are updated jointly.
3. IMPLEMENTATION DETAILS3.1. Dataset
We trained our regularization framework on a satellite im-age which represents the city of Jacksonville, Florida. Theimage is obtained by performing the pansharpening betweenthe panchromatic layer and three multispectral channels (in-frared, green, blue). There is no technical reason why we usethe infrared channel. The decision has been taken just for a vi-sualization preference, since grass and trees highlighted in redmake the roofs of the buildings more visible to the naked eye.Input masks are generated by a Mask R-CNN model trainedusing OpenStreetMap footprints. OpenStreetMap footprintsare also used as ideal masks during the regularization frame-work training. In order to achieve better results, our modelsare learned using single building instances instead of patches.As a test-set, we manually labeled an image of a residentialarea in Jacksonville mainly composed of mid-sized and small-sized buildings. The size of the test area is around 360 × F . IoUMask R-CNN 0.885 . Scores of building extraction computed on the testarea.
The network follows the same design choices of a classicalconvolutional autoencoder, as shown in Figure 1. The en-coders E G , E R and the discriminator D share the same archi-tectural design. They are composed of a chain of 3 × × F has the dual architecture. It iscomposed of a chain of 3 × For the training every building mask and the correspondingRGB picture are resized to 256 ×
256 pixels images. The idealmasks are generated drawing the OpenStreetMap buildingfootprint polygons in 256 ×
256 pixels masks as well.For all the experiments we use Adam optimizer with abatch size of 8. The models are trained for 80000 batches intotal. All networks are learned from scratch with an initiallearning rate of 0.0002. We keep the same learning rates for60000 batches and linearly decay the rates to zero over thelast 20000 batches.We set α = 3 , β = 3 , γ = 1 , δ = 200 and (cid:15) = 2 inEquation 6. (cid:15) and δ are linearly increased from zero to 2 and200, respectively, during the first 30000 batches to keep thelearning more stable.The weight matrix W and ˆ W for potts loss and normal-ized cut loss are constructed as: w ij = e −(cid:107) F ( i ) − F ( j ) (cid:107) σ I · e −(cid:107) X ( i ) − X ( j ) (cid:107) σ X if (cid:107) X ( i ) − X ( j ) (cid:107) 4. EXPERIMENTAL RESULTS The performances of our algorithm are evaluated based on theIntersection over Union (IoU) metric. Computing the scores,we want to analyze the effects of building regularization onthe building extraction, comparing the result of the pure MaskR-CNN model with the result of the regularization pipeline(Mask R-CNN and regularization).Table 1 shows the scores for Mask R-CNN and ourmethod. Although Mask R-CNN shows slightly higher pre-cision values, our regularization pipeline achieves higherresults on recall, F . and Jaccard index (IoU) scores.We also trained a model without Potts and normalized cutlosses. The scores show higher results for the complete regu-larization method, a sign that the regularized losses are effec-tive and can be used to refine the segmentation results.To summarize, our method produces better representa-tions of building footprints with more regular boundaries.Some regularization examples are shown in Figure 2, while aregularized portion of the test area is shown in Figure 3. 5. CONCLUSIONS We presented a building extraction method that combines aMask R-CNN model for instance segmentation with a net- Fig. 3 . Portion of the test area evaluated by Mask R-CNN(left) and regularized by our framework (right).work for footprints regularization. The regularization net-work has proved capable of exploiting effectively the infor-mation of the intensity image to further refine building bound-aries, achieving equivalent or even higher results in terms ofIntersection over Union compared to the pure Mask R-CNNmodel. Moreover, unlike Mask R-CNN that produces irreg-ular building masks, our method generates regularized foot-prints that can be used in many cartographic and engineeringapplications. 6. REFERENCES [1] Kang Zhao, Jungwon Kang, Jaewook Jung, and Gunho Sohn,“Building extraction from satellite images using mask R-CNNwith building boundary regularization,” in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops , 2018, pp. 247–251.[2] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick,“Mask R-CNN,” in Computer Vision (ICCV), 2017 IEEE Inter-national Conference on . IEEE, 2017, pp. 2980–2988.[3] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros,“Image-to-image translation with conditional adversarial net-works,” CoRR , vol. abs/1611.07004, 2016.[4] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros,“Unpaired image-to-image translation using cycle-consistentadversarial networks,” arXiv preprint , 2017.[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,David Warde-Farley, Sherjil Ozair, Aaron Courville, and YoshuaBengio, “Generative adversarial nets,” in Advances in neuralinformation processing systems , 2014, pp. 2672–2680.[6] Meng Tang, Federico Perazzi, Abdelaziz Djelouah, Ismail BenAyed, Christopher Schroers, and Yuri Boykov, “On regular-ized losses for weakly-supervised CNN segmentation,” arXivpreprint arXiv:1803.09569 , 2018.[7] Meng Tang, Abdelaziz Djelouah, Federico Perazzi, YuriBoykov, and Christopher Schroers, “Normalized cut loss forweakly-supervised CNN segmentation,” in