[PDF] End-to-End Trainable Deep Active Contour Models for Automated Image Segmentation: Delineating Buildings in Aerial Imagery

Abstract

The automated segmentation of buildings in remote sensing imagery is a challenging task that requires the accurate delineation of multiple building instances over typically large image areas. Manual methods are often laborious and current deep-learning-based approaches fail to delineate all building instances and do so with adequate accuracy. As a solution, we present Trainable Deep Active Contours (TDACs), an automatic image segmentation framework that intimately unites Convolutional Neural Networks (CNNs) and Active Contour Models (ACMs). The Eulerian energy functional of the ACM component includes per-pixel parameter maps that are predicted by the backbone CNN, which also initializes the ACM. Importantly, both the ACM and CNN components are fully implemented in TensorFlow and the entire TDAC architecture is end-to-end automatically differentiable and backpropagation trainable without user intervention. TDAC yields fast, accurate, and fully automatic simultaneous delineation of arbitrarily many buildings in the image. We validate the model on two publicly available aerial image datasets for building segmentation, and our results demonstrate that TDAC establishes a new state-of-the-art performance.

Full PDF

EEnd-to-End Trainable Deep Active ContourModels for Automated Image Segmentation:Delineating Buildings in Aerial Imagery

Ali Hatamizadeh, Debleena Sengupta, and Demetri Terzopoulos

Computer Science DepartmentUniversity of California, Los Angeles, CA 90095, USA { ahatamiz,debleenas,dt } @cs.ucla.edu Abstract.

The automated segmentation of buildings in remote sens-ing imagery is a challenging task that requires the accurate delineationof multiple building instances over typically large image areas. Manualmethods are often laborious and current deep-learning-based approachesfail to delineate all building instances and do so with adequate accuracy.As a solution, we present Trainable Deep Active Contours (TDACs), anautomatic image segmentation framework that intimately unites Convo-lutional Neural Networks (CNNs) and Active Contour Models (ACMs).The Eulerian energy functional of the ACM component includes per-pixelparameter maps that are predicted by the backbone CNN, which alsoinitializes the ACM. Importantly, both the ACM and CNN componentsare fully implemented in TensorFlow and the entire TDAC architectureis end-to-end automatically diﬀerentiable and backpropagation train-able without user intervention. TDAC yields fast, accurate, and fullyautomatic simultaneous delineation of arbitrarily many buildings in theimage. We validate the model on two publicly available aerial imagedatasets for building segmentation, and our results demonstrate thatTDAC establishes a new state-of-the-art performance.

Keywords:

Computer vision · Image segmentation · Active contourmodels · Convolutional neural networks · Building delineation

The delineation of buildings in remote sensing imagery [24] is a crucial step inapplications such as urban planning [29], land cover analysis [35], and disasterrelief response [28], among others. Manual or semi-automated approaches canbe very slow, laborious, and sometimes imprecise, which can be detrimental tothe prompt, accurate extraction of situational information from high-resolutionaerial and satellite images.Convolutional Neural Networks (CNNs) and deep learning have been broadlyapplied to various computer vision tasks, including semantic and instance segmen-tation of natural images in general [7,6] and particularly to the segmentation of a r X i v : . [ c s . C V ] J u l A. Hatamizadeh, D. Sengupta, and D. Terzopoulos

Input Label E rr o r B ac kp r op E rr o r B ac kp r op    L  P G Error Backprop

Output

Fig. 1: TDAC is a fully-automated, end-to-end automatically diﬀerentiable andbackpropagation trainable ACM and backbone CNN framework.buildings in remote sensing imagery [28,3]. However, building segmentation chal-lenges CNNs. First, since CNN architectures often include millions of trainableparameters, successful training relies on large, accurately-annotated datasets; butcreating such datasets from high-resolution imagery with possibly many buildinginstances is very tedious. Second, CNNs rely on a ﬁlter learning approach inwhich edge and texture features are learned together, which adversely impacts theability to properly delineate buildings and capture the details of their boundaries[11,16].One of the most inﬂuential computer vision techniques, the Active ContourModel (ACM) [19], has been successfully employed in various image analysistasks, including segmentation. In most ACM variants, the deformable curves ofinterest dynamically evolve according to an iterative procedure that minimizes acorresponding energy functional. Since the ACM is a model-based formulationfounded on geometric and physical principles, the segmentation process reliesmainly on the content of the image itself, not on learning from large annotatedimage datasets with hours or days of training and extensive computationalresources. However, the classic ACM relies to some degree on user input tospecify the initial contour and tune the parameters of the energy functional,which undermines its usefulness in tasks requiring the automatic segmentation ofnumerous images.We address the aforementioned challenges by intimately uniting CNNs andACMs in an end-to-end trainable framework (originally proposed in [15]). Ourframework (Fig. 1) leverages a novel ACM with trainable parameters that isautomatically diﬀerentiable in a TensorFlow implementation, thereby enablingthe backpropagation of gradients for stochastic optimization. Consequently, theACM and an untrained, as opposed to pre-trained, backbone CNN can be rainable Deep Active Contours 3 trained together from scratch. Furthermore, our ACM utilizes an Eulerian energyfunctional that aﬀords local control via 2D parameter maps that are directlypredicted by the backbone CNN, and it is also automatically initialized bythe CNN. Thus, our framework alleviates the biggest obstacle to exploitingthe power of ACMs in the context of CNNs and deep learning approaches toimage segmentation—eliminating the need for any form of user supervision orintervention.Our speciﬁc technical contributions in this paper are as follows: – We propose an end-to-end trainable building segmentation framework thatestablishes a tight merger between the ACM and any backbone CNN in orderto delineate buildings and accurately capture the ﬁne-grained details of theirboundaries. – To this end, we devise an implicit ACM formulation with pixel-wise pa-rameter maps and diﬀerentiable contour propagation steps for each term ofthe associated energy functional, thus making it amenable to TensorFlowimplementation. – We present new state-of-the-art benchmarks on two popular publicly availablebuilding segmentation datasets,

Vaihingen and

Bing Huts , with performancesurpassing the best among competing methods [25,9].

Audebert et al. [1] leveraged CNN-based models for building segmentation byapplying SegNet [2] with multi-kernel convolutional layers at three diﬀerentresolutions. Subsequently, Wang et al. [31] applied ResNet [17], ﬁrst to identifythe instances, followed by an MRF to reﬁne the predicted masks. Some methodscombine CNN-based models with classical optimization methods. Costa et al. [10]proposed a two-stage model in which they detect roads and intersections with aDual-Hop Generative Adversarial Network (DH-GAN) at the pixel level and thenapply a smoothing-based graph optimization to the pixel-wise segmentation todetermine a best-covering road graph. Wu et al. [33] employed a U-Net encoder-decoder architecture with loss layers at diﬀerent scales to progressively reﬁnethe segmentation masks. Xu et al. [34] proposed a cascaded approach in whichpre-processed hand-crafted features are fed into a Residual U-Net to extractbuilding locations and a guided ﬁlter reﬁnes the results.In an eﬀort to address the problem of poor boundary predictions by CNNmodels, Bischke et al. [3] proposed a cascaded multi-task loss function to simul-taneously predict the semantic masks and distance classes. Recently, Rudner etal. [28] proposed a method to segment ﬂooded buildings using multiple streamsof encoder-decoder architectures that extract spatiotemporal information frommedium-resolution images and spatial information from high-resolution imagesalong with a context aggregation module to eﬀectively combine the learnedfeature map.

A. Hatamizadeh, D. Sengupta, and D. Terzopoulos Hu et al. [18] proposed a model in which the network learns a level-set functionfor salient objects; however, the authors predeﬁned a ﬁxed scalar weightingparameter λ , which will not be optimal for all cases in the analyzed set ofimages. Hatamizadeh et al. [14] connected the output of a CNN to an implicitACM through spatially-varying functions for the λ parameters. Le et al. [22]proposed a framework for the task of semantic segmentation of natural images inwhich level-set ACMs are implemented as RNNs. There are three key diﬀerencesbetween that eﬀort and our TDAC: (1) TDAC does not reformulate ACMs asRNNs, which makes it more computationally eﬃcient. (2) TDAC beneﬁts from anovel, locally-parameterized energy functional, as opposed to constant weightedparameters (3) TDAC has an entirely diﬀerent pipeline—we employ a singleCNN that is trained from scratch along with the ACM, as opposed to requiringtwo pre-trained CNN backbones. The dependence of [22] on pre-trained CNNslimits its applicability.Marcos et al. [25] proposed Deep Structured Active Contours (DSAC), anintegration of ACMs with CNNs in a structured prediction framework for buildinginstance segmentation in aerial images. There are three key diﬀerences betweenthat work and our TDAC: (1) TDAC is fully automated and runs without anyexternal supervision, as opposed to depending heavily on the manual initializationof contours. (2) TDAC leverages the Eulerian ACM, which naturally segmentsmultiple building instances simultaneously, as opposed to a Lagrangian formu-lation that can handle only a single building at a time. (3) Our approach fullyautomates the direct back-propagation of gradients through the entire TDACframework due to its automatically diﬀerentiable ACM implementation.Cheng et al. [9] proposed the Deep Active Ray Network (DarNet), which uses apolar coordinate ACM formulation to prevent the problem of self-intersection andemploys a computationally expensive multiple initialization scheme to improvethe performance of the proposed model. Like DSAC, DarNet can handle onlysingle instances of buildings due to its explicit ACM formulation. Our approach isfundamentally diﬀerent from DarNet, as (1) it uses an implicit ACM formulationthat handles multiple building instances and (2) leverages a CNN to automaticallyand precisely initialize the implicit ACM.Wang et al. [32] proposed an interactive object annotation framework forinstance segmentation in which a backbone CNN and user input guide theevolution of an implicit ACM. Recently, Gur et al. [12] introduced Active Contoursvia Diﬀerentiable Rendering Network (ACDRNet) in which an explicit ACMis represented by a “neural renderer” and a backbone encoder-decoder U-Netpredicts a shift map to evolve the contour via edge displacement.Some eﬀorts have also focused on deriving new loss functions that are inspiredby ACM principles. Inspired by the global energy formulation of [5], Chen etal. [8] proposed a supervised loss layer that incorporated area and size informationof the predicted masks during training of a CNN and tackled a medical imagesegmentation task. Similarly, Gur et al. [13] presented an unsupervised lossfunction based on morphological active contours without edges [26]. rainable Deep Active Contours 5 Fig. 2: Boundary C represented as the zero level-set of implicit function φ ( x, y ). Our ACM formulation allows us to create a diﬀerentiable and trainable activecontour model. Instead of working with a parametric contour that encloses thedesired area to be segmented [19], we represent the contour(s) as the zero level-setof an implicit function. Such so-called “level-set active contours” evolve thesegmentation boundary by evolving the implicit function so as to minimize anassociated Eulerian energy functional.The most well-known approaches that utilize this implicit formulation aregeodesic active contours [4] and active contours without edges [5]. The latter,also known as the Chan-Vese model, relies on image intensity diﬀerences betweenthe interior and exterior regions of the level set. Lankton and Tannenbaum [21]proposed a reformulation in which the energy functional incorporates imageproperties in the local region near the level set, which more accurately segmentsobjects with heterogeneous features. Let I represent an input image and C = (cid:8) ( x, y ) | φ ( x, y ) = 0 (cid:9) be a closedcontour in Ω ∈ R represented by the zero level set of the signed distance map φ ( x, y ) (Fig. 2). The interior and exterior of C are represented by φ ( x, y ) > φ ( x, y ) <

0, respectively. Following [5], we use a smoothed Heaviside function H ( φ ( x, y )) = 12 + 1 π arctan (cid:16) φ ( x, y ) (cid:15) (cid:17) (1) These approaches numerically solve the PDE that governs the evolution of the implicitfunction. Interestingly, M´arquez-Neila et al. [26] proposed a morphological approachthat approximates the numerical solution of the PDE by successive application ofmorphological operators deﬁned on the equivalent binary level set. A. Hatamizadeh, D. Sengupta, and D. Terzopoulos(a) (b)

Fig. 3: The ﬁlter is divided by the contour into interior and exterior regions. Thepoint x is represented by the red dot and the interior (a) and exterior (b) regionsare shaded yellow.to represent the interior as H ( φ ) and exterior as (1 − H ( φ )). The derivative of H ( φ ( x, y )) is ∂H ( φ ( x, y )) ∂φ ( x, y ) = 1 π (cid:15)(cid:15) + φ ( x, y ) = δ ( φ ( x, y )) . (2)In TDAC, we evolve C to minimize an energy function according to E ( φ ) = E length ( φ ) + E image ( φ ) , (3)where E length ( φ ) = (cid:90) Ω µδ ( φ ( x, y )) |∇ φ ( x, y ) | dx dy (4)penalizes the length of C whereas E image ( φ ) = (cid:90) Ω δ ( φ ( x, y )) (cid:20) H ( φ ( x, y ))( I ( x, y ) − m ) +(1 − H ( φ ( x, y )))( I ( x, y ) − m ) (cid:21) dx dy (5)takes into account the mean image intensities m and m of the regions interiorand exterior to C [5]. We compute these local statistics using a characteristicfunction W s with local window of size f s (Fig. 3), as follows: W s = (cid:40) x − f s ≤ u ≤ x + f s , y − f s ≤ v ≤ y + f s ;0 otherwise , (6) rainable Deep Active Contours 7 where x, y and u, v are the coordinates of two independent points.To make our level-set ACM trainable, we associate parameter maps withthe foreground and background energies. These maps, λ ( x, y ) and λ ( x, y ), arefunctions over the image domain Ω . Therefore, our energy function may bewritten as E ( φ ) = (cid:90) Ω δ ( φ ( x, y )) (cid:20) µ |∇ φ ( x, y ) | + (cid:90) Ω W s F ( φ ( u, v )) du dv (cid:21) dx dy, (7)where F ( φ ) = λ ( x, y )( I ( u, v ) − m ( x, y )) ( H ( φ ( x, y ))+ λ ( x, y )( I ( u, v ) − m ( x, y )) (1 − H ( φ ( x, y )) . (8)According to the derivation in the appendix, the variational derivative of E withrespect to φ yields the Euler-Lagrange PDE ∂φ∂t = δ ( φ ) (cid:20) µ div (cid:18) ∇ φ |∇ φ | (cid:19) + (cid:90) Ω W s ∇ φ F ( φ ) dx dy (cid:21) (9)with ∇ φ F = δ ( φ ) (cid:0) λ ( x, y )( I ( u, v ) − m ( x, y )) − λ ( x, y )( I ( u, v ) − m ( x, y )) (cid:1) . (10)To avoid numerical instabilities during the evolution and maintain a well-behaved φ ( x, y ), a distance regularization term [23] can be added to (9).It is important to note that our formulation enables us to capture the ﬁne-grained details of boundaries, and our use of pixel-wise parameter maps λ ( x, y )and λ ( x, y ) allows them to be directly predicted by the backbone CNN along withan initialization map φ ( x, y ). Thus, not only does the implicit ACM propagationnow become fully automated, but it can also be directly controlled by a CNNthrough these learnable parameter maps. For the backbone CNN, we use a standard encoder-decoder with convolutionallayers, residual blocks, and skip connections between the encoder and decoder.Each 3 × × × × λ ( x, y ) and λ ( x, y ) parameter maps as well as the initializationmap φ ( x, y ). A. Hatamizadeh, D. Sengupta, and D. Terzopoulos = Conv3x3 ReLU Batch Norm Conv3x3 ReLU Batch Norm = Max Pool 2x2 = Bilinear Upsample = Conv3x3 Stride 1 = Conv1x1 Stride 1

C C C    + Fig. 4: TDAC’s CNN backbone employs a standard encoder-decoder architecture.Table 1: Detailed information about the TDAC encoder.

Operations Output sizeInput 512 × × × × × × × × × × × × × × × × × × × × Table 2: Detailed information about the TDAC decoder.

Operations Output sizeInput 64 × × × × × × × × × × × × Tables 1 and 2 present the details of the encoder and decoder in the TDACbackbone CNN architecture. BN, Add, Pool, Upsample, Conv, and Conv1 denotebatch normalization, addition, 2 × × × The ACM is evolved according to (9) in a diﬀerentiable manner in TensorFlow.The ﬁrst term is computed according to the surface curvature expression:div (cid:18) ∇ φ |∇ φ | (cid:19) = φ xx φ y − φ xy φ x φ y + φ yy φ x ( φ x + φ y ) / , (11)where the subscripts denote the spatial partial derivatives of φ , which are ap-proximated using central ﬁnite diﬀerences. For the second term, convolutionaloperations are leveraged to eﬃciently compute m ( x, y ) and m ( x, y ) in (8) withinimage regions interior and exterior to C . Finally, ∂φ/∂t in (9) is evaluated and φ ( x, y ) updated according to φ t = φ t − + ∆t ∂φ t − ∂t , (12)where ∆t is the size of the time step. Referring to Fig. 1, we simultaneously train the CNN and level-set componentsof TDAC in an end-to-end manner with no human intervention. The CNN guidesthe ACM by predicting the λ ( x, y ) and λ ( x, y ) parameter maps, as well as aninitialization map φ ( x, y ) from which φ ( x, y ) evolves through the L layers of theACM in a diﬀerentiable manner, thus enabling training error backpropagation.The φ ( x, y ) output of the CNN is also passed into a Sigmoid activation functionto produce the prediction P . Training optimizes a loss function that combinesbinary cross entropy and Dice losses:ˆ L ( X ) = − N N (cid:88) j =1 [ X j log G j + (1 − X j ) log(1 − G j )] + 1 − (cid:80) Nj =1 X j G j (cid:80) Nj =1 X j + (cid:80) Nj =1 G j , (13)where X j denotes the output prediction and G j the corresponding ground truthat pixel j , and N is the total number of pixels in the image. The total loss of theTDAC model is L = L ACM + L CNN , (14)where L ACM = ˆ L ( φ L ) is the loss computed for the output φ L from the ﬁnalACM layer and L CNN = ˆ L ( P ) is the loss computed over the prediction P of thebackbone CNN. Algorithm 1 presents the details of the TDAC training procedure. Algorithm 1:

TDAC Training Algorithm

Data: I : Image; G : Corresponding ground truth label; g : ACM energy functionwith parameter maps λ , λ ; φ : ACM implicit function; L : Number ofACM iterations; W : CNN with weights w ; P : CNN prediction; L : Totalloss function; η : Learning rate Result:

Trained TDAC model while not converged do λ , λ , φ = W ( I ) P = Sigmoid( φ ) for t = 1 to L do ∂φ t − ∂t = g ( φ t − ; λ , λ , I ) φ t = φ t − + ∆t ∂φ t − ∂t end L = L ACM ( φ L ) + L CNN ( P )Compute ∂ L ∂w and backpropagate the errorUpdate the weights of W : w ← w − η ∂ L ∂w end We have implemented the TDAC architecture and training algorithm entirely inTensorFlow. Our ACM implementation beneﬁts from the automatic diﬀerentiationutility of TensorFlow and has been designed to enable the backpropagation ofthe error gradient through the L layers of the ACM. We set L = 60 iterationsin the ACM component of TDAC since, as will be discussed in Section 4.3, theperformance does not seem to improve signiﬁcantly with additional iterations.We set a ﬁlter size of f = 5, as discussed in Section 4.3. The training wasperformed on an Nvidia Titan RTX GPU, and an Intel R (cid:13) Core TM i7-7700K CPU@ 4.20GHz. The size of the training minibatches for both datasets is 2. All thetraining sessions employ the Adam optimization algorithm [20] with an initiallearning rate of α = 0 .

001 decreasing according to [27] α = α (1 − e/N e ) . (15)with epoch counter e and total number of epochs N e . Vaihingen:

The Vaihingen buildings dataset consists of 168 aerial images of size512 ×

512 pixels. Labels for each image were generated by using a semi-automatedapproach. We used 100 images for training and 68 for testing, following the samedata partition as in [25]. In this dataset, almost all the images include multipleinstances of buildings, some of which are located at image borders. rainable Deep Active Contours 11

Bing Huts:

The Bing Huts dataset consists of 605 aerial images of size 64 × To evaluate TDAC’s performance, we utilized four diﬀerent metrics—Dice, meanIntersection over Union (mIoU), Boundary F (BoundF) [9], and Weighted Cover-age (WCov) [30].Given the prediction X and ground truth mask G , the Dice (F1) score isDice( X, G ) = 2 (cid:80) Ni =1 X i G i (cid:80) Ni =1 X i + (cid:80) Ni =1 G i , (16)where N is the number of image pixels and G i and X i denote pixels in G and X .Similarly, the IoU score measures the overlap of two objects by calculatingthe ratio of intersection over union, according toIoU( X, G ) = | X ∩ G || X ∪ G | . (17)BoundF computes the average of Dice scores over 1 to 5 pixels around theboundaries of the ground truth segmentation.In WConv, the maximum overlap output is selected and the IoU between theground truth segmentation and best output is calculated. IoUs for all instances aresummed up and weighted by the area of the ground truth instance. Assuming that S G = { r S G , . . . , r S G | S G | } is a set of ground truth regions and S X = { r S X , . . . , r S X | S X | } is a set of prediction regions for single image, and | r S G j | is the number of pixelsin r S G j , the weighted coverage can be expressed asWCov( S X , S G ) = 1 N | S G | (cid:88) j =1 | r S G j | max k =1 ... | S X | IoU( r S X k , r S G j ) . (18) Single-Instance Segmentation:

Although most of the images in the Vaihingendataset depict multiple instances of buildings, the DarNet and DSAC models candeal only with a single building instance at a time. For a fair comparison againstthese models, we report single-instance segmentation results in the exact samemanner as [25] and [9]. As reported in Table 3, our TDAC model outperformsboth DarNet and DSAC in all metrics on both the Vaihingen and Bing Hutsdatasets. Fig. 5 shows that with the Vaihingen dataset, both the DarNet andDSAC models have diﬃculty coping with the topological changes of the buildingsand fail to appropriately capture sharp edges, while TDAC overcomes these

TDAC (e) φ (f) λ (g) λ Fig. 5: Comparative visualization of the labeled image and the outputs of DSAC,DarNet, and our TDAC for the Vaihingen (top) and Bing Huts (bottom) datasets.(a) Image labeled with (green) ground truth segmentation. (b) DSAC output.(c) DarNet output. (d) TDAC output. (e) TDAC’s learned initialization map φ ( x, y ) and parameter maps (f) λ ( x, y ) and (g) λ ( x, y ). rainable Deep Active Contours 13(a) Image (b) DSAC (c) DarNet (d) TDAC (e) φ ( x, y ) (f) λ ( x, y ) (g) λ ( x, y ) Fig. 6: Additional comparative visualization of the labeled image and the outputsof DSAC, DarNet, and our TDAC, for the Vaihingen dataset.challenges in most cases. For the Bing Huts dataset, both the DarNet and DSACmodels are able to localize the buildings, but they inaccurately delineate thebuildings in many cases. This may be due to their inability to distinguish thebuilding from the surrounding terrain because of the low contrast and small sizeof the image. Comparing the segmentation output of DSAC (Fig. 5b), DarNet(Fig. 5c), and TDAC (Fig. 5d), our model performs well on the low contrastdataset, delineating buildings more accurately than the earlier models. Additionalcomparative visualizations are presented in Fig. 6.

Multiple-Instance Segmentation:

We next compare the performance of TDACagainst popular models such as Mask R-CNN for multiple-instance segmentationof all buildings in the Vaihingen and Bing Huts datasets. As reported in Table 4,our extensive benchmarks conﬁrm that the TDAC model outperforms MaskR-CNN and the other methods by a wide margin. Although Mask R-CNN seemsto be able to localize the building instances well, the ﬁne-grained details ofboundaries are lost, as is attested by the BoundF metric. The performance ofother CNN-based approaches follow the same trend in our benchmarks.

Parameter Maps:

To validate the contribution of the parameter maps λ ( x, y )and λ ( x, y ) in the level-set ACM, we also trained our TDAC model on boththe Vaihingen and Bing Huts datasets by allowing just two trainable scalarparameters, λ and λ , constant over the entire image. As reported in Table 3, forboth the Vaihingen and Bing Huts datasets, this “constant- λ ” formulation (i.e., Table 3: Model Evaluations: Single-Instance Segmentation.

Model Vaihingen Bing HutsMethod Backbone Dice mIoU WCov BoundF Dice mIoU WCov BoundFFCN ResNet 84.20 75.60 77.50 38.30 79.90 68.40 76.14 39.19FCN Mask R-CNN 86.00 76.36 81.55 36.80 77.51 65.03 76.02 65.25FCN UNet 87.40 78.60 81.80 40.20 77.20 64.90 75.70 41.27FCN Ours 90.02 81.10 82.01 44.53 82.24 74.09 73.67 42.04FCN DSAC – 81.00 81.40 64.60 – 69.80 73.60 30.30FCN DarNet – 87.20 86.80 76.80 – 74.50 77.50 37.70DSAC DSAC – 71.10 70.70 36.40 – 38.70 44.60 37.10DSAC DarNet – 60.30 61.10 24.30 – 57.20 63.00 15.90DarNet DarNet 93.66 88.20 88.10 75.90 85.21 75.20 77.00 38.00TDAC-const λ s Ours 91.18 83.79 82.70 73.21 84.53 73.02 74.21 48.25TDAC Ours Table 4: Model Evaluations: Multiple-Instance Segmentation.

Model Vaihingen Bing HutsMethod Backbone Dice mIoU WCov BoundF Dice mIoU WCov BoundFFCN UNet 81.00 69.10 72.40 34.20 71.58 58.70 65.70 40.60FCN ResNet 80.10 67.80 70.50 32.50 74.20 61.80 66.59 39.48FCN Mask R-CNN 88.35 79.42 80.26 41.92 76.12 63.40 70.51 41.97FCN Ours 89.30 81.00 82.70 49.80 75.23 60.31 72.41 41.12TDAC-const λ s Ours 90.80 83.30 83.90 47.20 81.19 68.34 75.29 44.61TDAC Ours the Chan-Vese model [5,21]) still outperforms the baseline CNN in most evaluationmetrics for both single-instance and multiple-instance buildings, thus establishingthe eﬀectiveness of the end-to-end training of the TDAC. Nevertheless, our TDACwith its full λ ( x, y ) and λ ( x, y ) maps outperforms this constant- λ version bya wide margin in all experiments and metrics. A key metric of interest in thiscomparison is the BoundF score, which elucidates that our formulation capturesthe details of the boundaries more eﬀectively by locally adjusting the inward andoutward forces on the contour. Fig. 7 shows that our TDAC has well delineatedthe boundaries of the building instances, compared to the TDAC hobbled by theconstant- λ formulation. Convolutional Filter Size:

The ﬁlter size of the convolutional operation is animportant hyper-parameter for the accurate extraction of localized image statistics.As illustrated in Fig. 8a, we have investigated the eﬀect of the convolutionalﬁlter size on the overall mIoU for both the Vaihingen and Bing datasets. Ourexperiments indicate that ﬁlter sizes that are too small are sub-optimal whileexcessively large sizes defeat the beneﬁts of the localized formulation. Hence, werecommend a ﬁlter size of f = 5 for the TDAC. Number of Iterations:

The direct learning of an initialization map φ ( x, y ) aswell as its eﬃcient TensorFlow implementation have enabled the TDAC to rainable Deep Active Contours 15(a) Image andgreen GT label (b) TDAC withconstant λ , λ (c) TDAC (d) λ ( x, y ) (e) λ ( x, y ) Fig. 7: (a) Image labeled with (green) ground truth segmentation. (b) Output ofTDAC with constant λ and λ . (c) TDAC output and learned parameter maps(d) λ ( x, y ) and (e) λ ( x, y ). (a) (b) Fig. 8: The eﬀects on mIoU of (a) varying the convolutional ﬁlter size and (b)varying the number L of ACM iterations.require substantially fewer iterations to converge with a better chance of avoidingundesirable local minima. As shown in Fig. 8b, we have investigated the eﬀectof the number of iterations on the overall mIoU for both Vaihingen and Bingdatasets and our results reveal that TDAC exhibits a robust performance after acertain threshold. Therefore, we have chosen a ﬁxed number of iterations (i.e.,ACM layers) for optimal performance, L = 60, yielding a runtime of less than1 sec in TensorFlow. We have introduced a novel image segmentation framework, called Trainable DeepActive Contour Models (TDACs), which is a full, end-to-end merger of ACMsand CNNs. To this end, we proposed a new, locally-parameterized, EulerianACM energy model that includes pixel-wise learnable parameter maps that canadjust the contour to precisely delineate the boundaries of objects of interest inthe image. Our model is fully automatic, as its backbone CNN learns the ACMinitialization map as well as the parameter maps that guide the contour to avoid suboptimal solutions. This eliminates any reliance on manual initialization ofACMs. Moreover, by contrast to previous approaches that have attempted tocombine CNNs with ACMs that are limited to segmenting a single building at atime, our TDAC can segment any number of buildings simultaneously.We have tackled the problem of building instance segmentation on twochallenging datasets, Vaihingen and Bing Huts, and our model signiﬁcantlyoutperforms the current state-of-the-art methods on these test cases.Given the level of success that TDAC has achieved in the building delineationapplication and the fact that it features an Eulerian ACM formulation, it isreadily applicable to other segmentation tasks in various domains, whereverpurely CNN ﬁlter-based approaches can beneﬁt from the versatility and precisionof ACMs to accurately delineate object boundaries in images.

A Derivation of the ACM Evolution PDE

Following [21], we derive the Euler-Lagrange PDE governing the evolution ofthe ACM. Let X = ( u, v ) and X = ( x, y ) represent two independent spatialvariables that each represent a point in Ω . Using the characteristic function (6),which selects regions within a square window of size s , the energy functional of C may be written in terms of a generic internal energy density F as E ( φ ) = (cid:90) Ω X δ ( φ ( X )) (cid:90) Ω X W s F ( φ, X , X ) dX dX . (19)To compute the ﬁrst variation of the energy functional, we add to φ a perturbationfunction (cid:15)ψ , where (cid:15) is a small number; hence, E ( φ + (cid:15)ψ ) = (cid:90) Ω X δ ( φ ( X ) + (cid:15)ψ ) (cid:90) Ω X W s F ( φ + (cid:15)ψ, X , X ) dX dX . (20)Taking the partial derivative of (20) with respect to (cid:15) and evaluating at (cid:15) = 0yields, according to the product rule, ∂E∂(cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 = (cid:90) Ω X δ ( φ ( X )) (cid:90) Ω X ψW s ∇ φ F ( φ, X , X ) dX dX + ψ (cid:90) Ω X γφ ( X ) (cid:90) Ω X W s F ( φ, X , X ) dX dX , (21)where γφ is the derivative of δ ( φ ). Since γφ is zero on the zero level set, it doesnot aﬀect the movement of the curve. Thus the second term in (21) and can beignored. Exchanging the order of integration, we obtain ∂E∂(cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 = (cid:90) Ω X (cid:90) Ω X ψδ ( φ ( X )) W s ∇ φ F ( φ, X , X ) dX dX . (22)Invoking the Cauchy–Schwartz inequality yields ∂φ∂t = (cid:90) Ω X δ ( φ ( X )) W s ∇ φ F ( φ, X , X ) dX . (23) rainable Deep Active Contours 17 Adding the contribution of the curvature term and expressing the spatial variablesby their coordinates, we obtain the desired curve evolution PDE (9) with (10)under the assumption of a uniform internal energy model with m ( x, y ) and m ( x, y ) as the mean image intensities inside and outside C and within W s . References

1. Audebert, N., Le Saux, B., Lef`evre, S.: Semantic segmentation of earth observationdata using multimodal and multi-scale deep networks. In: Asian Conference onComputer Vision. pp. 180–196. Springer (2016)2. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysisand Machine Intelligence (12), 2481–2495 (2017)3. Bischke, B., Helber, P., Folz, J., Borth, D., Dengel, A.: Multi-task learning forsegmentation of building footprints with deep neural networks. In: 2019 IEEEInternational Conference on Image Processing (ICIP). pp. 1480–1484. IEEE (2019)4. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. International Journalof Computer Vision (1), 61–79 (1997)5. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Transactions on ImageProcessing (2), 266–277 (2001)6. Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J.,Ouyang, W., et al.: Hybrid task cascade for instance segmentation. In: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition. pp. 4974–4983(2019)7. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab:Semantic image segmentation with deep convolutional nets, atrous convolution,and fully connected CRFs. IEEE Transactions on Pattern Analysis and MachineIntelligence (4), 834–848 (2018)8. Chen, X., Williams, B.M., Vallabhaneni, S.R., Czanner, G., Williams, R., Zheng, Y.:Learning active contour models for medical image segmentation. In: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition. pp. 11632–11640(2019)9. Cheng, D., Liao, R., Fidler, S., Urtasun, R.: Darnet: Deep active ray network forbuilding segmentation. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 7431–7439 (2019)10. Costea, D., Marcu, A., Slusanschi, E., Leordeanu, M.: Creating roadmaps in aerialimages with generative adversarial networks and smoothing-based optimization. In:Proceedings of the IEEE International Conference on Computer Vision (ICCV)Workshops (Oct 2017)11. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Brendel, W.:ImageNet-trained CNNs are biased towards texture; increasing shape bias improvesaccuracy and robustness. In: International Conference on Learning Representations(ICLR) (2019)12. Gur, S., Shaharabany, T., Wolf, L.: End to end trainable active contours viadiﬀerentiable rendering. In: Proceedings of the International Conference on LearningRepresentations (ICLR) (2019)13. Gur, S., Wolf, L., Golgher, L., Blinder, P.: Unsupervised microvascular imagesegmentation using an active contours mimicking neural network. In: Proceedingsof the IEEE International Conference on Computer Vision. pp. 10722–10731 (2019)8 A. Hatamizadeh, D. Sengupta, and D. Terzopoulos14. Hatamizadeh, A., Hoogi, A., Sengupta, D., Lu, W., Wilcox, B., Rubin, D., Terzopou-los, D.: Deep active lesion segmentation. In: International Workshop on MachineLearning in Medical Imaging. pp. 98–105. Springer (2019)15. Hatamizadeh, A., Sengupta, D., Terzopoulos, D.: End-to-end deep convolutionalactive contours for image segmentation. ArXiv preprint ArXiv:1909.13359 (2019)16. Hatamizadeh, A., Terzopoulos, D., Myronenko, A.: End-to-end boundary awarenetworks for medical image segmentation. In: International Workshop on MachineLearning in Medical Imaging. pp. 187–194. Springer (2019)17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 770–778 (2016)18. Hu, P., Shuai, B., Liu, J., Wang, G.: Deep level sets for salient object detection. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (2017)19. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. InternationalJournal of Computer Vision (4), 321–331 (1988)20. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)21. Lankton, S., Tannenbaum, A.: Localizing region-based active contours. IEEETransactions on Image Processing (11), 2029–2039 (2008)22. Le, T.H.N., Quach, K.G., Luu, K., Duong, C.N., Savvides, M.: Reformulating levelsets as deep recurrent neural network approach to semantic segmentation. IEEETransactions on Image Processing (5), 2393–2407 (2018)23. Li, C., Xu, C., Gui, C., Fox, M.D.: Distance regularized level set evolution and itsapplication to image segmentation. IEEE Transactions on Image Processing (12),3243 (2010)24. Lillesand, T., Kiefer, R.W., Chipman, J.: Remote Sensing and Image Interpretation.John Wiley & Sons (2015)25. Marcos, D., Tuia, D., Kellenberger, B., Zhang, L., Bai, M., Liao, R., Urtasun, R.:Learning deep structured active contours end-to-end. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR). pp. 8877–8885(2018)26. Marquez-Neila, P., Baumela, L., Alvarez, L.: A morphological approach to curvature-based evolution of curves and surfaces. IEEE Transactions on Pattern Analysis andMachine Intelligence (1), 2–17 (2013)27. Myronenko, A., Hatamizadeh, A.: Robust semantic segmentation of brain tumorregions from 3D MRIs. In: International MICCAI Brainlesion Workshop. pp. 82–89.Springer (2019)28. Rudner, T.G., Rußwurm, M., Fil, J., Pelich, R., Bischke, B., Kopaˇckov´a, V.,Bili´nski, P.: Multi3Net: Segmenting ﬂooded buildings via fusion of multiresolution,multisensor, and multitemporal satellite imagery. In: Proceedings of the AAAIConference on Artiﬁcial Intelligence. vol. 33, pp. 702–709 (2019)29. Shrivastava, N., Rai, P.K.: Remote-sensing the urban area: Automatic buildingextraction based on multiresolution segmentation and classiﬁcation. Geograﬁa:Malaysian Journal of Society and Space (2) (2017)30. Silberman, N., Sontag, D., Fergus, R.: Instance segmentation of indoor scenes usinga coverage loss. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ComputerVision – ECCV 2014. pp. 616–631. Springer International Publishing, Cham (2014)31. Wang, S., Bai, M., Mattyus, G., Chu, H., Luo, W., Yang, B., Liang, J., Cheverie,J., Fidler, S., Urtasun, R.: Torontocity: Seeing the world with a million eyes. arXivpreprint arXiv:1612.00423 (2016)rainable Deep Active Contours 1932. Wang, Z., Acuna, D., Ling, H., Kar, A., Fidler, S.: Object instance annotationwith deep extreme level set evolution. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 7500–7508 (2019)33. Wu, G., Shao, X., Guo, Z., Chen, Q., Yuan, W., Shi, X., Xu, Y., Shibasaki, R.:Automatic building segmentation of aerial imagery using multi-constraint fullyconvolutional networks. Remote Sensing (3), 407 (2018)34. Xu, Y., Wu, L., Xie, Z., Chen, Z.: Building extraction in very high resolution remotesensing imagery using deep learning and guided ﬁlters. Remote Sensing (1), 144(2018)35. Zhang, P., Ke, Y., Zhang, Z., Wang, M., Li, P., Zhang, S.: Urban land use andland cover classiﬁcation using novel deep learning models based on high spatialresolution satellite imagery. Sensors18