[PDF] ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remote Sensing Images

Abstract

Semantic segmentation of remotely sensed images plays a crucial role in precision agriculture, environmental protection, and economic assessment. In recent years, substantial fine-resolution remote sensing images are available for semantic segmentation. However, due to the complicated information caused by the increased spatial resolution, state-of-the-art deep learning algorithms normally utilize complex network architectures for segmentation, which usually incurs high computational complexity. Specifically, the high-caliber performance of the convolutional neural network (CNN) heavily relies on fine-grained spatial details (fine resolution) and sufficient contextual information (large receptive fields), both of which trigger high computational costs. This crucially impedes their practicability and availability in real-world scenarios that require real-time processing. In this paper, we propose an Attentive Bilateral Contextual Network (ABCNet), a convolutional neural network (CNN) with double branches, with prominently lower computational consumptions compared to the cutting-edge algorithms, while maintaining a competitive accuracy. Code is available at this https URL

Full PDF

11 ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remote Sensing Images

Rui Li and Chenxi Duan

2, * School of Remote Sensing and Information Engineering, Wuhan University, 129 Luoyu Road, Wuhan, Hubei 430079, China. 2)

The State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, 129 Luoyu Road, Wuhan, Hubei 430079, China. E-mail addresses: [email protected] (R. Li), [email protected] (C. Duan) *Corresponding author.

Abstract —Semantic segmentation of remotely sensed images plays a crucial role in precision agriculture, environmental protection, and economic assessment. In recent years, substantial fine-resolution remote sensing images are available for semantic segmentation. However, due to the complicated information caused by the increased spatial resolution, state-of-the-art deep learning algorithms normally utilize complex network architectures for segmentation, which usually incurs high computational complexity. Specifically, the high-caliber performance of the convolutional neural network (CNN) heavily relies on fine-grained spatial details (fine resolution) and sufficient contextual information (large receptive fields), both of which trigger high computational costs. This crucially impedes their practicability and availability in real-world scenarios that require real-time processing. In this paper, we propose an Attentive Bilateral Contextual Network (ABCNet), a convolutional neural network (CNN) with double branches, with prominently lower computational consumptions compared to the cutting-edge algorithms, while maintaining a competitive accuracy. Code is available at https://github.com/lironui/ABCNet.

Index Terms —Semantic Segmentation, Attention Mechanism, Convolutional Neural Network INTRODUCTION

Profit from the rapidly expanding Earth Observation technique, a large amount of remotely sensed images with fine spatial and spectral resolutions are now available for a wide range of application scenarios such as image classification (Lyons et al., 2018; Maggiori et al., 2016), object detection (Li et al., 2017; Xia et al., 2018), and semantic segmentation (Kemker et al., 2018; Zhang et al., 2019a). The revisiting property of orbital acquisitions brings the consecutive monitoring of land surface, ocean, and atmosphere into the possibility (Duan and Li, 2020). Fine-resolution remote sensing images normally contain substantial detailed spatial information for land cover and land use (Duan et al., 2020). Semantic segmentation, which assigns each pixel in images with a definite category, has become one of the most crucial levers for ground object interpretation. Specifically, semantic segmentation from remotely sensed imagery plays a pivotal role in various scenarios including precision agriculture (Griffiths et al., 2019; Picoli et al., 2018), environmental protection (Samie et al., 2020; Yin et al., 2018), and economic assessment (Zhang et al., 2020; Zhang et al., 2019a). Looking from a panoramic view, semantic segmentation is one of the high-level tasks that paves the way for complete scene understanding. Hence, semantic segmentation is at the forefront of a comprehensive effort towards automatic Earth monitoring by international agencies.

To identify the image content from various land cover and land use categories, tons of approaches explored the utilization of spectral and spectral-spatial features to interpret remote sensing images (Gong et al., 1992; Ma et al., 2017; Tucker, 1979; Zhong et al., 2014; Zhu et al., 2017). However, the finite ability to capture the contextual information contained in the images restricts the flexibility and adaptability of these methods (Li et al., 2020c; Tong et al., 2020), especially when the detailed and structural information surged by the increased spatial resolution. By contrast, bolstered by its powerful capabilities to capture nonlinear and hierarchical features automatically, deep Convolutional Neural Network (CNN) has posed a significant impact on the understanding of fine-resolution remote sensing images (Li et al., 2020a; Zheng et al., 2020). For semantic segmentation, Fully Convolutional Network (FCN) (Long et al., 2015) is the first proven and effective end-to-end CNN structure. Restricted by the oversimple design of the decoder, the results of FCN, although very encouraging, appear coarse. Subsequently, the more elaborate encoder-decoder structure (Badrinarayanan et al., 2017; Ronneberger et al., 2015) is proposed which comprises two symmetric paths: a contracting path for extracting features and an expanding path for exact positioning to accomplish more accurate results. To guarantee the accuracy of segmentation, global contextual information and multiscale semantic features are supposed to be thoroughly utilized for semantic categories with varying sizes in images. By the spatial pyramid pooling module, the pyramid scene parsing network (PSPNet) (Zhao et al., 2017) aggregates contextual information among different regions. The dual attention network (DANet) (Fu et al., 2019) applies the dot-product attention mechanism to extract abundant contextual relationships. Subject to the enormous memory and computational consumptions, DANet simply attaches the dot-product attention mechanism at the lowest layer and merely captures the long-range dependencies from the smallest feature maps. DeeplabV3 (Chen et al., 2017) adopts atrous convolution to mining multiscale features, while a simple yet valid decoder module is added in DeepLabV3+ (Chen et al., 2018a) to further refine the segmentation results. The extraction of global contextual information and the exploitation of large-scale feature maps are computationally expensive (Duan and Li, 2020; Li et al., 2020b). Therefore, a series of lightweight networks (Hu et al., 2020; Oršić and Šegvić, 2021; Romera et al., 2017; Yu et al., (Oršić and Šegvić, 2021) explores the effectiveness of pyramidal fusion in compact architectures. Due to limited capacity in extracting the global context information, there is a huge gap in accuracy between the lightweight networks and the state-of-the-art models, which is especially true when it comes to the fine-resolution remotely sensed images. As a powerful approach that can capture long-range dependencies, the dot-product attention mechanism (Vaswani et al., 2017) is a plausibly ideal solution to remedy this limitation. Whereas, the memory and computational consumptions of the dot-product attention mechanism increase quadratically with the spatio-temporal size of the input, which runs counter to the original intention of lightweight networks. Encouragingly, our previous work about linear attention (Li et al., 2020a) which reduces the complexity of the dot-product attention mechanism from 𝑂𝑂 ( 𝑁𝑁 ) to 𝑂𝑂 ( 𝑁𝑁 ) alleviates this plight. Fig.1 Illustration of (a) the encoder-decoder structure and (b) the bilateral architecture. In this paper, we aim to further improve the segmentation accuracy while simultaneously ensuring the efficiency of semantic segmentation. We approach this challenging problem by modeling the global contextual information using the linear attention mechanism. To be specific, we proposed an Attentive Bilateral Contextual Network (ABCNet) to address the efficient semantic segmentation of fine-resolution remote sensing images. Following the design philosophy of BiSeNet (Yu et al., 2018), there are two branches in the proposed ABCNet: a spatial path to retain affluent spatial details and a contextual path to capture global contextual information. Compared with the encoder-decoder structure (Fig. 1(a)), the bilateral architecture (Fig. 1(b)) can maintain more spatial information without retarding the speed of the model (Yu et al., 2018). Concretely, the spatial path merely stacks three convolution layers to generate the 1/8 feature maps, while the contextual path includes two attention enhancement modules (AEM) to refine the features and capture contextual information. As features generated by two paths are disparate in the level of feature representation, we further design a feature aggregation module (FAM) to fuse these features. Our main contributions are summarized as follows: 1)

We propose a novel approach for efficient semantic segmentation of fine-resolution remote sensing images. Specifically, we propose an Attentive Bilateral Contextual Network (ABCNet) with a spatial path and a contextual path. 2)

We design two specific modules, attention enhancement modules (AEM) for exploring long-range contextual information and feature aggregation module (FAM) for fusing features obtained by two paths. 3)

We achieve competitive results on the ISPRS Vaihingen dataset and ISPRS Potsdam dataset. More specifically, we obtain the results of 91.095% overall accuracy on the Potsdam test dataset with a speed of 72.13 FPS even on a mid-range graphics card (1660Ti). Related Work Context information extraction

As the performance of semantic segmentation heavily hinges on the abundant context information, a great many endeavors are poured into tackling this issue. The dilated or atrous convolution (Chen et al., 2014; Yu and Koltun, 2015) has been demonstrated to be an effective technology for enlarging receptive fields without shrinking spatial resolution. Also, the encoder-decoder (Ronneberger et al., 2015) architecture which merges high-level and low-level features using skip connections is another valid way for extracting spatial context. Based on the encoder-decoder framework or dilation backbone, several subsequent studies focus on exploring the usage of spatial pyramid pooling (SPP) (He et al., 2015). For example, the pyramid pooling module (PPM) in PSPNet is composed of convolutions with kernels of four different sizes (Zhao et al., 2017), while DeepLab v2 (Chen et al., 2018a) equips with the atrous spatial pyramid pooling (ASPP) module which groups parallel atrous convolution layers with varying dilation rates. However, there are still certain current limitations in SPP. The SPP with standard convolution will face a dilemma when expanding the receptive field by a large kernel size. The above operations are normally accompanied by a huge number of parameters. The SPP with small kernels (e.g. ASPP), on the other hand, lacks enough connection between adjacent features; and the gridding problem (Wang et al., 2018a), which occurs when the field is enlarged by a dilated convolutional layer. By contrast, the powerful ability to model long-range dependencies enable the dot-product attention mechanism to extract context information in the global scale. Dot-Product Attention Mechanism

Let H , W, and 𝐶𝐶 denote the height, weight, and channels of the input, respectively. The input feature is defined as 𝑿𝑿 = [ 𝒙𝒙 , ⋯ , 𝒙𝒙 𝑁𝑁 ] ∈ ℝ 𝑁𝑁 × 𝐶𝐶 , where 𝑁𝑁 = 𝐻𝐻 × 𝑊𝑊 . Firstly, the dot-product attention mechanism utilizes three projected matrices 𝑾𝑾 𝑞𝑞 ∈ ℝ 𝐷𝐷 𝑥𝑥 × 𝐷𝐷 𝑘𝑘 , 𝑾𝑾 𝑘𝑘 ∈ ℝ 𝐷𝐷 𝑥𝑥 × 𝐷𝐷 𝑘𝑘 , and 𝑾𝑾 𝑣𝑣 ∈ℝ 𝐷𝐷 𝑥𝑥 × 𝐷𝐷 𝑣𝑣 to generate the corresponding query matrix Q , the key matrix K , and the value matrix V : �𝑸𝑸 = 𝑿𝑿𝑾𝑾 𝑞𝑞 ∈ ℝ 𝑁𝑁 × 𝐷𝐷 𝑘𝑘 ; 𝑲𝑲 = 𝑿𝑿𝑾𝑾 𝑘𝑘 ∈ ℝ 𝑁𝑁 × 𝐷𝐷 𝑘𝑘 ; 𝑽𝑽 = 𝑿𝑿𝑾𝑾 𝑣𝑣 ∈ ℝ 𝑁𝑁 × 𝐷𝐷 𝑣𝑣 . (1) Please note that the dimensions of the Q and K are supposed to be identical and all the vectors in this section are column vectors by default. Accordingly, a normalization function ρ is employed to measure the similarity between the i -th query feature 𝒒𝒒 𝑖𝑖𝑇𝑇 ∈ ℝ 𝐷𝐷 𝑘𝑘 and the j -th key feature 𝒌𝒌 𝑗𝑗 ∈ℝ 𝐷𝐷 𝑘𝑘 as 𝜌𝜌 ( 𝒒𝒒 𝑖𝑖𝑇𝑇 𝒌𝒌 𝑗𝑗 ) ∈ ℝ . As the query feature and key feature are generated via different layers, the similarities between 𝜌𝜌 ( 𝒒𝒒 𝑖𝑖𝑇𝑇 𝒌𝒌 𝑗𝑗 ) and 𝜌𝜌 ( 𝒒𝒒 𝑗𝑗𝑇𝑇 𝒌𝒌 𝑖𝑖 ) are not symmetric. By calculating similarities between all pairs of pixels in the input feature maps and taking the similarities as weights, the dot-product attention mechanism generates the value at position i by aggregating the value features from all positions using weighted summation: 𝐷𝐷 ( 𝑸𝑸 , 𝑲𝑲 , 𝑽𝑽 ) = 𝜌𝜌 ( 𝑸𝑸𝑲𝑲 𝑇𝑇 ) 𝑽𝑽 . (2) Normally, the softmax is the frequently-used normalization function: 𝜌𝜌 ( 𝑸𝑸 𝑇𝑇 𝑲𝑲 ) = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑟𝑟𝑟𝑟𝑟𝑟 ( 𝑸𝑸𝑲𝑲 𝑇𝑇 ), (3) where 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑟𝑟𝑟𝑟𝑟𝑟 indicates that the softmax is exploited along each row of the matrix 𝑸𝑸𝑲𝑲 𝑇𝑇 . By modeling the similarities between each pair of positions of the input, the global dependencies in the features can be thoroughly extracted by the 𝜌𝜌 ( 𝑸𝑸𝑲𝑲 𝑇𝑇 ) . The dot-product attention mechanism is firstly designed for machine translation (Vaswani et al., 2017), while the non-local module (Wang et al., 2018b) introduces and modifies it for computer vision (Fig. 2). Based on the dot-product attention mechanism as well as its variants, a constellation of attention-based networks has been proposed to tackle the semantic segmentation task. Inspired by the non-local module (Wang et al., 2018b), the Double Attention Networks ( 𝐴𝐴 -Net) (Chen et al., 2018b), Dual Attention Network (DANet) (Fu et al., 2019), Point-wise Spatial Attention Network (PSANet) (Zhao et al., 2018), Object Context Network (OCNet) (Yuan and Wang, 2018), and Co-occurrent Feature Network (CFNet) (Zhang et al., 2019b) are proposed successively for scene segmentation by exploring the long-range dependency. Fig.2 The diagram of the dot-product attention modified for computer vision. Even though the introduction of attention significantly boosts the performance on segmentation, the huge resource-demanding of dot-product critically hinders its application on large inputs. To be specific, for 𝑸𝑸 ∈ ℝ 𝑁𝑁 × 𝐷𝐷 𝑘𝑘 and 𝑲𝑲 𝑇𝑇 ∈ ℝ 𝐷𝐷 𝑘𝑘 × 𝑁𝑁 , the product between 𝑸𝑸 and 𝑲𝑲 𝑇𝑇 belongs to ℝ 𝑁𝑁 × 𝑁𝑁 , leading to the 𝑂𝑂 ( 𝑁𝑁 ) memory and computation complexity. Consequently, it is requisite to lower the high demand for computational resources of the dot-product attention mechanism. Generalization and simplification of the dot-product attention mechanism

If the normalization function is set as softmax, the i -th row of the result matrix generated by the dot-product attention mechanism can be written as: 𝐷𝐷 ( 𝑸𝑸 , 𝑲𝑲 , 𝑽𝑽 ) 𝑖𝑖 = ∑ 𝑒𝑒 𝒒𝒒 𝑖𝑖𝑇𝑇 𝒌𝒌 𝑗𝑗 𝒗𝒗 𝑗𝑗𝑁𝑁𝑗𝑗=1 ∑ 𝑒𝑒 𝒒𝒒 𝑖𝑖𝑇𝑇 𝒌𝒌 𝑗𝑗 𝑁𝑁𝑗𝑗=1 . (4) Equation (4) can be rewritten and generalized to any normalization function as: 𝐷𝐷 ( 𝑸𝑸 , 𝑲𝑲 , 𝑽𝑽 ) 𝑖𝑖 = ∑ sim �𝒒𝒒 𝑖𝑖 , 𝒌𝒌 𝑗𝑗 �𝒗𝒗 𝑗𝑗𝑁𝑁𝑗𝑗=1 ∑ sim �𝒒𝒒 𝑖𝑖 , 𝒌𝒌 𝑗𝑗 � 𝑁𝑁𝑗𝑗=1 , sim �𝒒𝒒 𝑖𝑖 , 𝒌𝒌 𝑗𝑗 � ≥ (5) sim �𝒒𝒒 𝑖𝑖 , 𝒌𝒌 𝑗𝑗 � can be expanded as 𝜙𝜙 ( 𝒒𝒒 𝑖𝑖 ) 𝑇𝑇 𝜑𝜑 ( 𝒌𝒌 𝑗𝑗 ) that measures the similarity between the 𝒒𝒒 𝑖𝑖 and 𝒌𝒌 𝑗𝑗 , whereupon equation (4) can be rewritten as equation (6) and be simplified as equation (7): 𝐷𝐷 ( 𝑸𝑸 , 𝑲𝑲 , 𝑽𝑽 ) 𝑖𝑖 = ∑ 𝜙𝜙 ( 𝒒𝒒 𝑖𝑖 ) 𝑇𝑇 𝜑𝜑 ( 𝒌𝒌 𝑗𝑗 ) 𝒗𝒗 𝑗𝑗𝑁𝑁𝑗𝑗=1 ∑ 𝜙𝜙 ( 𝒒𝒒 𝑖𝑖 ) 𝑇𝑇 𝜑𝜑 ( 𝒌𝒌 𝑗𝑗 ) 𝑁𝑁𝑗𝑗=1 , (6) 𝐷𝐷 ( 𝑸𝑸 , 𝑲𝑲 , 𝑽𝑽 ) 𝑖𝑖 = 𝜙𝜙 ( 𝒒𝒒 𝑖𝑖 ) 𝑇𝑇 ∑ 𝜑𝜑 ( 𝒌𝒌 𝑗𝑗 ) 𝒗𝒗 𝑗𝑗𝑇𝑇𝑁𝑁𝑗𝑗=1 𝜙𝜙 ( 𝒒𝒒 𝑖𝑖 ) 𝑇𝑇 ∑ 𝜑𝜑 ( 𝒌𝒌 𝑗𝑗 ) 𝑁𝑁𝑗𝑗=1 . (7) Particularly, if 𝜙𝜙 ( ∙ ) = 𝜑𝜑 ( ∙ ) = 𝑒𝑒 ( ∙ ) , equation (5) is equivalent to equation (4). The vectorized form of equation (7) is: 𝐷𝐷 ( 𝑸𝑸 , 𝑲𝑲 , 𝑽𝑽 ) = 𝜙𝜙 ( 𝑸𝑸 ) 𝜑𝜑 ( 𝑲𝑲 ) 𝑇𝑇 𝑽𝑽𝜙𝜙 ( 𝑸𝑸 ) ∑ 𝜑𝜑 ( 𝑲𝑲 ) 𝑖𝑖 , 𝑗𝑗𝑇𝑇𝑗𝑗 . (8) As the softmax function is substituted for sim �𝒒𝒒 𝑖𝑖 , 𝒌𝒌 𝑗𝑗 � = 𝜙𝜙 ( 𝒒𝒒 𝑖𝑖 ) 𝑇𝑇 𝜑𝜑 ( 𝒌𝒌 𝑗𝑗 ) , the order of the commutative operation can be altered, thereby avoiding multiplication between the reshaped key matrix K and query matrix Q . In concrete terms, the product between 𝜑𝜑 ( 𝑲𝑲 ) 𝑇𝑇 and V can be computed first and then multiply the result and Q , leading only 𝑂𝑂 ( 𝑑𝑑𝑁𝑁 ) time complexity and 𝑂𝑂 ( 𝑑𝑑𝑁𝑁 ) space complexity. The suitable 𝜙𝜙 ( ∙ ) and 𝜑𝜑 ( ∙ ) enable the above scheme to achieve the competitive performance with finite complexity (Katharopoulos et al., 2020; Li et al., 2020b). Linear Attention Mechanism

In our previous work (Li et al., 2020a) we proposed a linear attention mechanism from another perspective that replaces the softmax function with the first-order approximation of Taylor expansion, which is shown as equation (9): 𝑒𝑒 𝒒𝒒 𝑖𝑖𝑇𝑇 𝒌𝒌 𝑗𝑗 ≈ 𝒒𝒒 𝑖𝑖𝑇𝑇 𝒌𝒌 𝑗𝑗 . (9) To guarantee the above approximation to be nonnegative, 𝒒𝒒 𝑖𝑖 and 𝒌𝒌 𝑗𝑗 are normalized by 𝑙𝑙 norm, thereby ensuring 𝒒𝒒 𝑖𝑖𝑇𝑇 𝒌𝒌 𝑗𝑗 ≥ − : 𝑠𝑠𝑠𝑠𝑠𝑠�𝒒𝒒 𝑖𝑖 , 𝒌𝒌 𝑗𝑗 � = 1 + � 𝒒𝒒 𝑖𝑖 ‖𝒒𝒒 𝑖𝑖 ‖ � 𝑇𝑇 � 𝒌𝒌 𝑗𝑗 �𝒌𝒌 𝑗𝑗 � � . (10) Thus, equation (5) can be rewritten as equation (11) and simplified as equation (12): 𝑫𝑫 ( 𝑸𝑸 , 𝑲𝑲 , 𝑽𝑽 ) 𝒊𝒊 = ∑ �𝟏𝟏 + � 𝒒𝒒 𝒊𝒊 ‖𝒒𝒒 𝒊𝒊 ‖ � 𝑻𝑻 � 𝒌𝒌 𝒋𝒋 �𝒌𝒌 𝒋𝒋 � �� 𝒗𝒗 𝒋𝒋𝑵𝑵𝒋𝒋=𝟏𝟏 ∑ �𝟏𝟏 + � 𝒒𝒒 𝒊𝒊 ‖𝒒𝒒 𝒊𝒊 ‖ � 𝑻𝑻 � 𝒌𝒌 𝒋𝒋 �𝒌𝒌 𝒋𝒋 � �� 𝑵𝑵𝒋𝒋=𝟏𝟏 , (11) 𝐷𝐷 ( 𝑸𝑸 , 𝑲𝑲 , 𝑽𝑽 ) 𝑖𝑖 = ∑ 𝒗𝒗 𝑗𝑗𝑁𝑁𝑗𝑗=1 + � 𝒒𝒒 𝑖𝑖 ‖𝒒𝒒 𝑖𝑖 ‖ � 𝑇𝑇 ∑ � 𝒌𝒌 𝑗𝑗 �𝒌𝒌 𝑗𝑗 � � 𝒗𝒗 𝑗𝑗𝑇𝑇𝑁𝑁𝑗𝑗=1 𝑁𝑁 + � 𝒒𝒒 𝑖𝑖 ‖𝒒𝒒 𝑖𝑖 ‖ � 𝑇𝑇 ∑ � 𝒌𝒌 𝑗𝑗 �𝒌𝒌 𝑗𝑗 � � 𝑁𝑁𝑗𝑗=1 . (12) The equation (12) can be turned into a vectorized form: 𝐷𝐷 ( 𝑸𝑸 , 𝑲𝑲 , 𝑽𝑽 ) = ∑ 𝑽𝑽 𝑖𝑖 , 𝑗𝑗𝑗𝑗 + � 𝑸𝑸‖𝑸𝑸‖ � �� 𝑲𝑲‖𝑲𝑲‖ � 𝑇𝑇 𝑽𝑽�𝑁𝑁 + � 𝑸𝑸‖𝑸𝑸‖ � ∑ � 𝑲𝑲‖𝑲𝑲‖ � 𝑖𝑖 , 𝑗𝑗𝑇𝑇𝑗𝑗 . (13) Since ∑ � 𝒌𝒌 𝑗𝑗 �𝒌𝒌 𝑗𝑗 � � 𝒗𝒗 𝑗𝑗𝑇𝑇𝑁𝑁𝑗𝑗=1 and ∑ � 𝒌𝒌 𝑗𝑗 �𝒌𝒌 𝑗𝑗 � � 𝑁𝑁𝑗𝑗=1 can be calculated and reused for each query, time and memory complexity of the attention based on equation (13) is 𝑂𝑂 ( 𝑑𝑑𝑁𝑁 ) . Fig.3 The (a) computation requirement and (b) memory requirement between the linear attention mechanism and dot-product attention mechanism under different input sizes. The calculation assumes 𝐶𝐶 = 𝐷𝐷 𝑣𝑣 = 2 𝐷𝐷 𝑘𝑘 = 64 . Please notice that the figure is on the log scale. The validity and efficiency of the proposed attention have been testified through extensive ablation experiments and analysis (Li et al., 2020a). Efficient semantic segmentation

For many applications, efficiency is critical, which is especially true for real-time ( ≥ scenarios such as autonomous driving. Therefore, recent researches have made great efforts to accelerate models for efficient semantic segmentation, which employs lightweight models or downsampling the input size. The utilization of lightweight convolutions (e.g., the asymmetric convolution and the depth-wise separable convolution) is a common strategy for designing lightweight networks (Romera et al., 2017; Yu et al., 2018). The downsampling of the input size is a trivial solution to speed up semantic segmentation which reduces the resolution of the input images, thereby leading to the loss of image details. To extract spatial details at original resolution, many methods further add a shallow branch, forming the two-path architecture (Yu et al., 2020; Yu et al., 2018). Attentive Bilateral Contextual Network

Fig.4 An overview of the Attentive Bilateral Contextual Network. (a) Network Architecture. (b) The Attention Enhancement Module (AEM). (c) The Feature Aggregation Module (FAM). (d) The Linear Attention Mechanism. The proposed Attentive Bilateral Contextual Network (ABCNet), as well as the components, are demonstrated in Fig. 4. Spatial path

Although both of them are crucial for the high accuracy of segmentation, it is actually impossible to reconcile the affluent spatial details with the large receptive field simultaneously. Especially, in the term of efficient semantic segmentation, the mainstream solutions focus on down-sampling the input image or speeding up the network by channel pruning. The former loses the majority of spatial details, which the latter damages spatial details. By contrast, in the proposed ABCNet, we adopt the bilateral architecture (Yu et al., 2018) which is equipped with a spatial path to capture spatial details and generate low-level feature maps. Therefore, the rich channel capacity is essential for this path to encode sufficient spatial detailed information. Meanwhile, as the spatial path merely focuses on the low-level details, the shallow structure with a small stride for this branch is enough. Specifically, the spatial path comprises three layers as shown in Fig. 4(a). Each layer contains a convolution with stride = 2, followed by batch normalization (Ioffe and Szegedy, 2015) and ReLU (Glorot et al., 2011). Therefore, the output feature maps of this path are 1/8 of the original image, which encodes abundant spatial details resulting from the large spatial size. Contextual path

In parallel to the spatial path, the contextual path is designed to extract high-level global context information and provide sufficient receptive field. To enlarge the receptive field, several networks take advantage of the spatial pyramid pooling with a large kernel, leading to the huge computation demanding and memory consuming. With the consideration of the long-range context information and efficient computation simultaneously, we develop the contextual path with the linear attention mechanism (Li et al., 2020a). Concretely, in the contextual path as shown in Fig. 4(a), we harness the lightweight backbone (i.e., ResNet 18) (He et al., 2016) to down-sample the feature map and encode the high-level semantic information. Thereafter, we deploy two attention enhancement modules (AEM) on the tails of the backbone to fully extract the global context information. The features obtained by the last two stages are fused and fed into the feature aggregation module (FAM). Feature aggregation module

The feature representation of the spatial path and the contextual path is complementary but in different domains (i.e., the spatial path generates the low-level and detailed feature, while the contextual path obtains the high-level and semantic features). Thus, the simple fusion schemes such as summation and concatenation are not appropriate manners to fuse information. In contrast, we design a feature aggregation module (FAM) to merge both types of feature representation with consideration of accuracy and efficiency. As shown in Fig. 4(c), with two domains of features, we first concatenate the output of spatial path and context path. Thereafter, a convolution layer with batch normalization (Ioffe and Szegedy, 2015) and ReLU (Glorot et al., 2011) attached to balance the scales of the features. Then, we capture the long-range dependencies of the generated features using the linear attention mechanism. The details of the design of FAM can be seen in Fig. 4(c). Loss function

As can be seen from Fig. 1(b), besides the principal loss function to supervise the output of the whole network, we utilize two auxiliary loss functions at the context path to accelerate the convergence velocity. We select the cross-entropy loss as the principal loss: 𝑙𝑙𝑠𝑠𝑠𝑠𝑠𝑠 𝑝𝑝𝑟𝑟𝑖𝑖 ( 𝑝𝑝 , 𝑦𝑦 ) = −𝑦𝑦 log ( 𝑝𝑝 ) − (1 − 𝑦𝑦 ) log (1 − 𝑝𝑝 ), (14) where p is the prediction generated by the network, while y is the ground truth. The auxiliary loss functions are chosen as the focal loss: 𝑙𝑙𝑠𝑠𝑠𝑠𝑠𝑠 𝑠𝑠𝑎𝑎𝑠𝑠 ( 𝑝𝑝 , 𝑦𝑦 ) = 𝑙𝑙𝑠𝑠𝑠𝑠𝑠𝑠 𝑠𝑠𝑎𝑎𝑠𝑠 ( 𝑝𝑝 , 𝑦𝑦 ) = −𝑦𝑦 ( − 𝑝𝑝 ) 𝛾𝛾 log 𝑝𝑝 − ( − 𝑦𝑦 ) 𝑝𝑝 𝛾𝛾 log ( − 𝑝𝑝 ) , (15) where γ is the focusing parameter, which controls the down-weighting of the easily classified examples and is set as 2 in our experiments. Hence, the overall loss of the network is: 𝑙𝑙𝑠𝑠𝑠𝑠𝑠𝑠 ( 𝑝𝑝 , 𝑦𝑦 ) = 𝑙𝑙𝑠𝑠𝑠𝑠𝑠𝑠 𝑝𝑝𝑟𝑟𝑖𝑖 + 𝑙𝑙𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎1 ( 𝑝𝑝 , 𝑦𝑦 ) + 𝑙𝑙𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎2 ( 𝑝𝑝 , 𝑦𝑦 ) . (16) EXPERIMENTAL RESULTS AND DISCUSSION Datasets

The effectiveness of the proposed ABCNet is verified using the ISPRS Potsdam dataset, the ISPRS Vaihingen dataset.

Potsdam : There are 38 fine-resolution images of size 6000 × 6000 pixels with a ground sampling distance (GSD) of 5 cm in the Potsdam dataset. The dataset provides near-infrared, red, green, and blue channels as well as DSM and normalized DSM (NDSM). We utilize ID: 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15, 7_13 for testing, ID: 2_10 for validation, and the remaining 22 images, except image named 7_10 with error annotations, for training. Please note that we only employ the red, green, and blue channels in our experiments. Vaihingen : The Vaihingen dataset contains 33 images with an average size of 2494 × 2064 pixels and a GSD of 5 cm. The near-infrared, red, and green channels together with DSM are provided in the dataset. We utilize ID: 2, 4, 6, 8, 10, 12, 14, 16, 20, 22, 24, 27, 29, 31, 33, 35, 38 for testing, ID: 30 for validation, and the remaining 15 images for training. The DSM is not used in our experiments. Evaluation Metrics

The performance of ABCNet is evaluated using the overall accuracy (OA), the mean Intersection over Union (mIoU), and the F1 score (F1). Based on the accumulated confusion matrix, the OA, mIoU, and F1 are computed as:

𝑂𝑂𝐴𝐴 = ∑ 𝑇𝑇𝑇𝑇 𝑘𝑘𝑁𝑁𝑘𝑘=1 ∑ 𝑇𝑇𝑇𝑇 𝑘𝑘 + 𝐹𝐹𝑇𝑇 𝑘𝑘 + 𝑇𝑇𝑁𝑁 𝑘𝑘 + 𝐹𝐹𝑁𝑁 𝑘𝑘𝑁𝑁𝑘𝑘=1 , (17) 𝑠𝑠𝑚𝑚𝑠𝑠𝑚𝑚 = 1 𝑁𝑁 � 𝑇𝑇𝑇𝑇 𝑘𝑘 𝑇𝑇𝑇𝑇 𝑘𝑘 + 𝐹𝐹𝑇𝑇 𝑘𝑘 + 𝐹𝐹𝑁𝑁 𝑘𝑘𝑁𝑁𝑘𝑘=1 , (18) 𝐹𝐹 𝑝𝑝𝑝𝑝𝑒𝑒𝑝𝑝𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑝𝑝 × 𝑝𝑝𝑒𝑒𝑝𝑝𝑠𝑠𝑙𝑙𝑙𝑙𝑝𝑝𝑝𝑝𝑒𝑒𝑝𝑝𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑝𝑝 + 𝑝𝑝𝑒𝑒𝑝𝑝𝑠𝑠𝑙𝑙𝑙𝑙 , (19) where 𝑇𝑇𝑇𝑇 𝑘𝑘 , 𝐹𝐹𝑇𝑇 𝑘𝑘 , 𝑇𝑇𝑁𝑁 𝑘𝑘 , and 𝐹𝐹𝑁𝑁 𝑘𝑘 represent the true positive, false positive, true negative, and false negatives, respectively, for object indexed as class k . OA is computed for all categories including the background. Experimental Setting

All of the training procedures are implemented with PyTorch on a single Tesla V100 with 32 batch size, and the optimizer is set as AdamW with a 0.0003 learning rate. For training, the raw images are cropped into 512 × 512 patches and augmented by rotating, resizing, horizontal axis flipping, vertical axis flipping, and adding random noise. The comparative methods include the contextual information aggregation methods designed initially for natural images, such as pyramid scene parsing network (PSPNet) (Zhao et al., 2017) and dual attention network (DANet) (Fu et al., 2019), the multi-scale feature aggregation models proposed for remote sensing images, like multi-stage attention ResU-Net (MAResU-Net) (Li et al., 2020a) and edge-aware neural network (EaNet) (Zheng et al., 2020), and also lightweight network developed for efficient semantic segmentation including depth-wise asymmetric bottleneck network (DABNet) (Li et al., 2019), efficient residual factorized convNet (ERFNet) (Romera et al., 2017), bilateral segmentation network V1 (BiSeNetV1) (Yu et al., 2018) and V2 (BiSeNetV2) (Yu et al., 2020), fast attention network (FANet) (Hu et al., 2020), ShelfNet (Zhuang et al., 2019), and SwiftNet (Oršić and Šegvić, 2021) . The test time augmentation (TTA) in terms of rotating and flipping is applied for all comparative methods. Ablation study

To verify the effectiveness of the components in the proposed ABCNet, we conduct extensive ablation experiments. atmosphere conditions, while the setting details and quantitative results are listed in Table 1.

Baseline : We utilize the ResNet-18 as the backbone of the contextual path and select the contextual path without the AEM (denoted as CP in Table I) as the baseline. The feature maps generated by CP are directly up-sampled to the shape as the original input image. Ablation for attention enhancement module : For capturing the global context information, we specially design an attention enhancement module (AEM) in the contextual path. As presented in Table I, for two datasets, the utilization of AEM (indicated as Cp + AEM ) brings more than 1.5% improvement in mIoU.

Ablation for the spatial path : As the affluent spatial information is crucial for semantic segmentation, the spatial path is designed for preserving the spatial size and extracting spatial information. Table I demonstrated that even the simple fusion schemes such as summation (represented as

Cp + Sp + AEM(Sum) ) and concatenation (represented as

Cp + Sp + AEM(Cat) ) boost the performance.

TABLE

I A

BLATION STUDY OF EACH COMPONENT IN OUR PROPOSED

ABCN ET Dataset Method Mean F1 OA (%) mIoU (%) Vaihingen

Cp 83.862 88.141 74.433 Cp + AEM 85.746 88.780 76.268 Cp + Sp + AEM(Sum) 86.575 89.831 77.529 Cp + Sp + AEM(Cat) 87.059 89.715 78.779 Cp + Sp + AEM + FAM 89.497

Potsdam

Cp 89.716

Cp + AEM 90.600

Cp + Sp + AEM(Sum) 91.029

Cp + Sp + AEM(Cat) 91.233

Cp + Sp + AEM + FAM 92.498 Ablation for feature aggregation module : Given the features obtained by the spatial path and the contextual path are in different domains, neither summation nor the concatenation is the optimal fusion scheme. As can be seen from Table I, the significant gap of performance explains the validity of the feature aggregation module (signified as

Cp + Sp + AEM + FAM ). TABLE Ⅱ T HE COMPLEXITY AND SPEED OF THE PROPOSED

ABCN

ET AND COMPARATIVE METHODS . Method Backbone Complexity(G) Parameters(M) 256 ×

256 512 ×

512 1024 × × × DABNet (Li et al., 2019) - 5.22 0.75 90.67 87.74 27.41 7.44 * 82.144 ERFNet (Romera et al., 2017) - 14.75 2.06 90.51 59.04 17.59 4.87 1.25 79.152 BiSeNetV1 (Yu et al., 2018)

ResNet18 15.25

ResNet18 12.55 24.03 151.12 105.03 34.83 10.16 2.66 77.971 BiSeNetV2 (Yu et al., 2020) - 13.91 12.30 124.49 82.84 25.64 7.07 * 85.167 DANet (Fu et al., 2019)

ResNet18

ResNet18 * ResNet18 (Oršić and Šegvić, 2021)

ResNet18 * ResNet18 * ResNet18

ResNet18 The complexity and speed of the network

The complexity and speed are momentous factors for measuring the merit of an algorithm, which is especially true for practical application. For a thorough comparison, we implement our experiments under different settings. First, the comparison of parameters and computational complexity between different networks are reported in Table Ⅱ , where 'G' indicates Gillion (i.e., the unit of ﬂoating point operations) and 'M' signifies Million (i.e., the unit of parameter number). Meanwhile, for a fair comparison, we choose 256×256, 512×512, 1024×1024, 2048×2048, and 4096×4096 as resolutions of the input image and report the inference speed which is measured by frames per second (FPS) on a midrange notebook graphics card 1660Ti. The proposed ABCNet simultaneously juggles both speed and accuracy. As can be seen from the last column of Table Ⅱ , the mIoU on the Potsdam dataset achieved by the ABCNet is at least 1.79% higher than the comparative methods. Meanwhile, the ABCNet could maintain a 72.13 FPS speed for a 512×512 input. Besides, the elaborate design enables the ABCNet to handle the massive input (4096×4096), while more than half of the comparative methods run out of memory for a such large input. Results on the ISPRS Vaihingen dataset

The ISPRS Vaihingen is a relatively small dataset. Besides, there is a small covariate shift between training and test sets (Ghassemi et al., 2019). Therefore, the high performance can be easily achieved by specifically designed networks, especially for those fuse orthophoto (TOP) images with auxiliary DSM or NDSM. In this part, we will show that our ABCNet model using only TOP images with efficient architecture can not only also transcend lightweight networks but also achieve competitive performance with those specially designed models. As shown in TABLE Ⅲ , the numeric scores for the ISPRS Vaihingen test dataset demonstrated that our ABCNet delivers robust performance, and exceeded other lightweight networks in the mean F1, OA, and mIoU by a considerable margin. Significantly, the ‘‘car’’ class in Vaihingen dataset is difficult to handle as it is a relatively small object. Nonetheless, our ABCNet acquires TABLE Ⅲ QUANTITATIVE

COMPARISON

RESULTS ON THE

VAIHINGEN

TEST

SET.

Method Backbone Imp. surf. Building Low veg. Tree Car Mean F1 OA (%) mIoU (%)

DABNet (Li et al., 2019) - 87.775 88.808 74.319 84.905 60.247 79.211 84.278 67.373 ERFNet (Romera et al., 2017) - 88.451 90.239 76.394 85.751 53.649 78.897 85.751 67.698 BiSeNetV1 (Yu et al., 2018)

ResNet18 89.115

PSPNet (Zhao et al., 2017)

ResNet18 89.005 93.161 81.483 87.657 43.926 79.046 87.651 68.861 BiSeNetV2 (Yu et al., 2020) - 89.884 91.911 82.020 88.271 71.417 84.701 87.972 75.005 DANet (Fu et al., 2019)

ResNet18

EaNet (Zheng et al., 2020)

ResNet18

ShelfNet (Zhuang et al., 2019)

ResNet18

MAResU-Net (Li et al., 2020a)

ResNet18

SwiftNet (Oršić and Šegvić, 2021)

ResNet18

ABCNet

ResNet18 an 85.299% F1 score, which is at least 4% higher than other methods. To further evaluate the statistical significance, we report Kappa z-test for pairwise methods based on Kappa coefficients of agreement and their variances using the following equation: 𝑧𝑧 = ( 𝑘𝑘 − 𝑘𝑘 ) �𝑣𝑣 + 𝑣𝑣 ⁄ , (20) where k signifies the Kappa coefficient and v denotes the Kappa variance. Concretely, if the value of 𝑧𝑧 is greater than 1.96, the two algorithms are signally different at the 95 % confidence level. TABLE Ⅳ KAPPA

Z-TEST

COMPARING

THE

PERFORMANCE OF DIFFERENT

METHODS ON THE

VAIHINGEN

DATASET.

Method Kappa KV

ERFNet PSPNet BiSeNetV1 DANet BiSeNetV2 FANet EaNet ShelfNet MAResU-Net SwiftNet ABCNet DABNet 0.798 2.808 6.04 19.84 21.73 22.53 26.86 27.89 33.03 34.38 35.68 35.94 39.06 ERFNet 0.812 2.643 - 13.80 15.70 16.50 20.84 21.86 27.02 28.37 29.67 29.93 33.06 BiSeNetV1 0.843 2.272 - - PSPNet 0.847 2.218 - - - 0.80 5.15 6.19 11.35 12.72 14.03 14.29 17.44 BiSeNetV2 0.849 2.198 - - - - 4.35 5.38 10.55 11.91 13.22 13.48 16.63 DANet 0.858 - - - - - 1.04 6.21 7.57 8.88 9.14 12.30 FANet 0.860 2.057 - - - - - - 5.17 6.53 7.84 8.10 11.26 EaNet 0.870 1.918 - - - - - - - 1.36 2.68 2.94 6.10 ShelfNet 0.873 1.883 - - - - - - - - 1.31 1.57 4.73 MAResU-Net 0.875 1.850 - - - - - - - - - 0.26 3.42 SwiftNet 0.876 1.843 - - - - - - - - - - 3.16 ABCNet 0.882 1.762 - - - - - - - - - - - As can be seen from Table Ⅳ , the accuracy of the proposedABCNet is statistically higher than other comparative methods. In addition, we visualize area 38 in Fig. 5 to qualitatively demonstrate the effectiveness of our ABCNet, while the enlarged results are shown in Fig. 7 (a). For a comprehensive evaluation, ABCNet is also compared with other state-of-the-art methods. As can be seen in Table Ⅴ , as a lightweight network, the proposed ABCNet achieves a competitive performance even compared with those designed models with complex structures. It is worth noting that the speed of our ABCNet is two to seven times faster than those methods. TABLE Ⅴ QUANTITATIVE

COMPARISON

RESULTS ON THE

VAIHINGEN

TEST

SET

WITH

STATE-OF-THE-ART

METHODS.

Method Backbone Imp. surf. Building Low veg. Tree Car Mean F1 OA (%) mIoU (%) Speed

DeepLabV3+ (Chen et al., 2018a)

ResNet101 92.38

ResNet101 92.79 95.46 84.51 89.94

ResNet101 91.63 95.02 83.25 88.87 87.16 89.19 90.44 81.32 21.97 EaNet (Zheng et al., 2020)

ResNet101

ResNet50 - ResegNets - - CASIA2 (Liu et al., 2018) ResNet101 - - V-FuseNet - - DLR_9 - - ABCNet ResNet18 - means the results are not repoted in the original paper. Fig.5 Mapping results for test images of Vaihingen tile-38. Results on the ISPRS Potsdam dataset

We carry out experiments on the ISPRS Potsdam dataset to further evaluate the performance of ABCNet. Numerical comparisons with other lightweight methods are shown in Table

Ⅵ, while the Kappa-z test i s illustrated in Table Ⅶ . Remarkably, ABCNet achieves 91.095% in overall accuracy and 88.561% in mIoU, and the Kappa-z test strongly illuminates the superiority contrasted with other lightweight networks. The visualization of area 3_13 is displayed in Fig. 6, and the enlarged results are exhibited in Fig. 7 (b). As there are sufficient images in the Potsdam dataset to train the network, the performance of the ABCNet can be parity with the state-of-the- art methods with a much faster speed. The comparisons are illustrated in Table Ⅷ. CONCLUSIONS

In this paper, we propose a novel lightweight framework for efficient semantic segmentation in the field of remote sensing, namely Attentive Bilateral Contextual Network (ABCNet), which adaptively captures abundant spatial details by spatial path and global contextual information via

TABLE Ⅵ QUANTITATIVE

COMPARISON

RESULTS ON THE

POTSDAM

TEST

SET.

Method Backbone Imp. surf. Building Low veg. Tree Car Mean F1 OA (%) mIoU (%)

ERFNet (Romera et al., 2017) - 88.675 92.991 81.100 75.843 90.534 85.829 84.492 79.152 DABNet (Li et al., 2019) - 89.939 93.188 83.596 82.257 92.578 88.312 86.664 82.144 PSPNet (Zhao et al., 2017)

ResNet18 89.116

BiSeNetV1 (Yu et al., 2018)

ResNet18 90.241 94.554 85.527 86.195 92.684 89.840 88.163 84.537 BiSeNetV2 (Yu et al., 2020) - 91.280 94.316 85.048 85.192 94.112 89.990 88.174 85.167 EaNet (Zheng et al., 2020)

ResNet18

DANet (Fu et al., 2019)

ResNet18

SwiftNet (Oršić and Šegvić, 2021)

ResNet18

FANet (Hu et al., 2020)

ResNet18

ShelfNet (Zhuang et al., 2019)

ResNet18

ABCNet

ResNet18 the contextual path. In particular, we design an attention enhancement module to model long-range dependencies from extracted feature maps. Additionally, to address the feature fusion issue and improve the effectiveness, a feature aggregation module is presented to adequately merge the detailed features captured by the spatial path and semantic features generated by the contextual path. Extensive experiments on ISPRS Vaihingen and Potsdam datasets demonstrate the effectiveness and efficiency of the proposed ABCNet. TABLE Ⅶ KAPPA

Z-TEST

COMPARING

THE

PERFORMANCE OF DIFFERENT

METHODS ON THE

POTSDAM

DATASET.

Method Kappa KV

DABNet PSPNet BiSeNetV1 BiSeNetV2 EaNet DANet MAResU-Net SwiftNet FANet ShelfNet ABCNet ERFNet 0.837 4.344 9.06 11.25 17.17 17.51 19.64 20.18 21.01 22.27 23.84 24.32 29.66 DABNet 0.863 3.712 - 2.19 8.14 8.50 10.64 11.19 12.02 13.29 14.88 15.37 20.77 PSPNet 0.869 3.563 - - BiSeNetV1 0.884 3.187 - - - 0.37 2.50 3.06 3.90 5.16 6.76 7.26 12.69 BiSeNetV2 0.885 3.182 - - - - 2.13 2.68 3.52 4.78 6.38 6.88 12.30 EaNet [7] 0.890 - - - - - 0.56 1.40 2.66 4.26 4.77 10.20 DANet 0.892 3.006 - - - - - - 0.84 2.10 3.70 4.21 9.64 MAResU-Net 0.894 2.959 - - - - - - - 1.26 2.86 3.37 8.80 SwiftNet 0.897 2.870 - - - - - - - - 1.60 2.11 7.54 FANet 0.901 2.780 - - - - - - - - - 0.51 5.94 ShelfNet 0.902 2.757 - - - - - - - - - - 5.43 ABCNet 0.914 2.425 - - - - - - - - - - - TABLE Ⅷ QUANTITATIVE

COMPARISON

RESULTS ON THE

POTSDAM

TEST

SET

WITH

STATE-OF-THE-ART

METHODS.

Method Backbone Imp. surf. Building Low veg. Tree Car Mean F1 OA (%) mIoU (%) Speed

DeepLabV3+ (Chen et al., 2018a)

ResNet101 92.95

ResNet101 93.36 96.97 87.75 88.50 95.42

ResNet50 92.90 96.90 87.70 89.40 94.90 92.30 90.80 - 37.28 CCNet (Huang et al., 2020) ResNet101 - - - SWJ_2 ResNet101 - - HUSTW4 (Sun et al., 2019) ResegNets - - V-FuseNet DST_5

ABCNet

ResNet18 - means the results are not repoted in the original paper. Fig.6 Mapping results for test images of Potsdam tile-3_13. Fig.7 Enlarged visualization of results on (LEFT) the Vaihingen dataset and (RIGHT) Potsdam dataset. REFERENCES

Audebert, N., Le Saux, B., Lefèvre, S., 2018. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry and Remote Sensing 140, 20-32. Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39, 2481-2495. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2014. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Chen, L.-C., Papandreou, G., Schroff, F., Adam, H., 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H., 2018a. Encoder-decoder with atrous separable convolution for semantic image segmentation, Proceedings of the European conference on computer vision (ECCV), pp. 801-818. Chen, Y., Kalantidis, Y., Li, J., Yan, S., Feng, J., 2018b. A^ 2-nets: Double attention networks, Advances in neural information processing systems, pp. 352-361. Chollet, F., 2017. Xception: Deep learning with depthwise separable convolutions, Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251-1258. Duan, C., Li, R., 2020. Multi-Head Linear Attention Generative Adversarial Network for Thin Cloud Removal. arXiv preprint arXiv:2012.10898. Duan, C., Pan, J., Li, R., 2020. Thick Cloud Removal of Remote Sensing Images Using Temporal Smoothness and Sparsity Regularized Tensor Optimization. Remote Sensing 12, 3446. Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H., 2019. Dual attention network for scene segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146-3154. Ghassemi, S., Fiandrotti, A., Francini, G., Magli, E., 2019. Learning and adapting robust features for satellite image segmentation on heterogeneous data sets. IEEE Transactions on Geoscience and Remote Sensing 57, 6517-6529. Glorot, X., Bordes, A., Bengio, Y., 2011. Deep sparse rectifier neural networks, Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, pp. 315-323. Gong, P., Marceau, D.J., Howarth, P.J., 1992. A comparison of spatial feature extraction algorithms for land-use classification with SPOT HRV data. Remote sensing of environment 40, 137-151. Griffiths, P., Nendel, C., Hostert, P., 2019. Intra-annual reflectance composites from Sentinel-2 and Landsat for national-scale crop and land cover mapping. Remote sensing of environment 220, 135-151. He, K., Zhang, X., Ren, S., Sun, J., 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37, 1904-1916. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. Hu, P., Perazzi, F., Heilbron, F.C., Wang, O., Lin, Z., Saenko, K., Sclaroff, S., 2020. Real-time semantic segmentation with fast attention. IEEE Robotics and Automation Letters 6, 263-270. Huang, Z., Wang, X., Wei, Y., Huang, L., Shi, H., Liu, W., Huang, T.S., 2020. CCNet: Criss-Cross Attention for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift, International conference on machine learning. PMLR, pp. 448-456. Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F., 2020. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. arXiv preprint arXiv:2006.16236. Kemker, R., Salvaggio, C., Kanan, C., 2018. Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning. ISPRS journal of photogrammetry and remote sensing 145, 60-77. Li, G., Yun, I., Kim, J., Kim, J., 2019. Dabnet: Depth-wise asymmetric bottleneck for real-time semantic segmentation. arXiv preprint arXiv:1907.11357. Li, K., Cheng, G., Bu, S., You, X., 2017. Rotation-insensitive and context-augmented object detection in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 56, 2337-2348. Li, R., Su, J., Duan, C., Zheng, S., 2020a. Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images. arXiv preprint arXiv:2011.14302. Li, R., Zheng, S., Duan, C., Su, J., 2020b. Multi-Attention-Network for Semantic Segmentation of High-Resolution Remote Sensing Images. arXiv preprint arXiv:2009.02130. Li, R., Zheng, S., Duan, C., Yang, Y., Wang, X., 2020c. Classification of Hyperspectral Image Based on Double-Branch Dual-Attention Mechanism Network. Remote Sensing 12, 582. Liu, Q., Kampffmeyer, M., Jenssen, R., Salberg, A.-B., 2020. Dense dilated convolutions’ merging network for land cover classification. IEEE Transactions on Geoscience and Remote Sensing 58, 6309-6320. Liu, Y., Fan, B., Wang, L., Bai, J., Xiang, S., Pan, C., 2018. Semantic labeling in very high resolution images via a self-cascaded convolutional neural network. ISPRS journal of photogrammetry and remote sensing 145, 78-95. Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation, Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431-3440. Lyons, M.B., Keith, D.A., Phinn, S.R., Mason, T.J., Elith, J., 2018. A comparison of resampling methods for remote sensing classification and accuracy assessment. Remote Sensing of Environment 208, 145-153. Ma, L., Li, M., Ma, X., Cheng, L., Du, P., Liu, Y., 2017. A review of supervised object-based land-cover image classification. ISPRS Journal of Photogrammetry and Remote Sensing 130, 277-293. Maggiori, E., Tarabalka, Y., Charpiat, G., Alliez, P., 2016. Convolutional neural networks for large-scale remote-sensing image classification. IEEE Transactions on Geoscience and Remote Sensing 55, 645-657. Marmanis, D., Schindler, K., Wegner, J.D., Galliani, S., Datcu, M., Stilla, U., 2018. Classification with an edge: Improving semantic image segmentation with boundary detection. ISPRS Journal of Photogrammetry and Remote Sensing 135, 158-172.

Oršić, M., Šegvić, S., 2021. Efficient semantic segmentation with pyramidal fusion. Pattern Recognition impacts of future land use/land cover changes on climate in Punjab province, Pakistan: implications for environmental sustainability and economic growth. Environmental Science and Pollution Research 27, 25415-25433. Sherrah, J., 2016. Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery. arXiv preprint arXiv:1606.02585. Sun, Y., Tian, Y., Xu, Y., 2019. Problems of encoder-decoder frameworks for high-resolution remote sensing image segmentation: Structural stereotype and insufficient learning. Neurocomputing 330, 297-304. Tong, X.-Y., Xia, G.-S., Lu, Q., Shen, H., Li, S., You, S., Zhang, L., 2020. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sensing of Environment 237, 111322. Tucker, C.J., 1979. Red and photographic infrared linear combinations for monitoring vegetation. Remote sensing of Environment 8, 127-150. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017.

Attention is all you need, Advances in neural information processing systems, pp. 5998-6008. Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G., 2018a. Understanding convolution for semantic segmentation, 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, pp. 1451-1460. Wang, X., Girshick, R., Gupta, A., He, K., 2018b. Non-local neural networks, Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794-7803. Xia, G.-S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., Zhang, L., 2018. DOTA: A large-scale dataset for object detection in aerial images, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3974-3983. Yin, H., Pflugmacher, D., Li, A., Li, Z., Hostert, P., 2018. Land use and land cover change in Inner Mongolia-understanding the effects of China's re-vegetation programs. Remote Sensing of Environment 204, 918-930. Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., Sang, N., 2020. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. arXiv preprint arXiv:2004.02147. Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N., 2018. Bisenet: Bilateral segmentation network for real-time semantic segmentation, Proceedings of the European conference on computer vision (ECCV), pp. 325-341. Yu, F., Koltun, V., 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Yuan, Y., Wang, J., 2018. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916. Zhang, C., Harrison, P.A., Pan, X., Li, H., Sargent, I., Atkinson, P.M., 2020. Scale Sequence Joint Deep Learning (SS-JDL) for land use and land cover classification. Remote Sensing of Environment 237, 111593. Zhang, C., Sargent, I., Pan, X., Li, H., Gardiner, A., Hare, J., Atkinson, P.M., 2019a. Joint Deep Learning for land cover and land use classification. Remote sensing of environment 221, 173-187. Zhang, H., Zhang, H., Wang, C., Xie, J., 2019b. Co-occurrent features in semantic segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 548-557. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J., 2017. Pyramid scene parsing network, Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881-2890. Zhao, H., Zhang, Y., Liu, S., Shi, J., Change Loy, C., Lin, D., Jia, J., 2018. Psanet: Point-wise spatial attention network for scene parsing, Proceedings of the European Conference on Computer Vision (ECCV), pp. 267-283. Zheng, X., Huan, L., Xia, G.-S., Gong, J., 2020. Parsing very high resolution urban scene images by learning deep ConvNets with edge-aware loss. ISPRS Journal of Photogrammetry and Remote Sensing 170, 15-28. Zhong, Y., Zhao, J., Zhang, L., 2014. A hybrid object-oriented conditional random field classification framework for high spatial resolution remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 52, 7023-7037. Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F., 2017. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5, 8-36. Zhuang, J., Yang, J., Gu, L., Dvornek, N., 2019. Shelfnet for fast semantic segmentation, Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0-0.attention network for scene parsing, Proceedings of the European Conference on Computer Vision (ECCV), pp. 267-283. Zheng, X., Huan, L., Xia, G.-S., Gong, J., 2020. Parsing very high resolution urban scene images by learning deep ConvNets with edge-aware loss. ISPRS Journal of Photogrammetry and Remote Sensing 170, 15-28. Zhong, Y., Zhao, J., Zhang, L., 2014. A hybrid object-oriented conditional random field classification framework for high spatial resolution remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 52, 7023-7037. Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F., 2017. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5, 8-36. Zhuang, J., Yang, J., Gu, L., Dvornek, N., 2019. Shelfnet for fast semantic segmentation, Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0-0.