[PDF] Bidirectional Multi-scale Attention Networks for Semantic Segmentation of Oblique UAV Imagery

Abstract

Semantic segmentation for aerial platforms has been one of the fundamental scene understanding task for the earth observation. Most of the semantic segmentation research focused on scenes captured in nadir view, in which objects have relatively smaller scale variation compared with scenes captured in oblique view. The huge scale variation of objects in oblique images limits the performance of deep neural networks (DNN) that process images in a single scale fashion. In order to tackle the scale variation issue, in this paper, we propose the novel bidirectional multi-scale attention networks, which fuse features from multiple scales bidirectionally for more adaptive and effective feature extraction. The experiments are conducted on the UAVid2020 dataset and have shown the effectiveness of our method. Our model achieved the state-of-the-art (SOTA) result with a mean intersection over union (mIoU) score of 70.80%.

Full PDF

BBIDIRECTIONAL MULTI-SCALE ATTENTION NETWORKS FOR SEMANTICSEGMENTATION OF OBLIQUE UAV IMAGERY

Ye Lyu a , George Vosselman a , Gui-Song Xia b , Michael Ying Yang aa University of Twente, The Netherlands b Wuhan University, China

KEY WORDS:

Semantic Segmentation, Multi-Scale, Attention, Oblique View, UAV, Deep Learning

ABSTRACT:

Semantic segmentation for aerial platforms has been one of the fundamental scene understanding task for the earth observation.Most of the semantic segmentation research focused on scenes captured in nadir view, in which objects have relatively smallerscale variation compared with scenes captured in oblique view. The huge scale variation of objects in oblique images limits theperformance of deep neural networks (DNN) that process images in a single scale fashion. In order to tackle the scale variationissue, in this paper, we propose the novel bidirectional multi-scale attention networks, which fuse features from multiple scalesbidirectionally for more adaptive and effective feature extraction. The experiments are conducted on the UAVid2020 dataset andhave shown the effectiveness of our method. Our model achieved the state-of-the-art (SOTA) result with a mean intersection overunion (mIoU) score of . .

1. INTRODUCTION

Semantic segmentation has been one of the most fundamentalresearch tasks for scene understanding. It is to assign eachpixel within an image with the class label it belongs to. Therehave been many works for semantic segmentation on the re-mote sensing images and the aerial images (Demir et al., 2018,Rottensteiner et al., 2014), which are captured in nadir viewstyle. The spatial resolutions in such images are approximatelythe same for all pixels. Oblique views have a much larger landcoverage if the platforms are at the same ﬂight height. For ex-ample, the unmanned aerial vehicle (UAV) platform has beenused to for urban scene observation (Lyu et al., 2020, Nigam etal., 2018). The images of different viewing directions are shownin Figure 1. The left image of nadir view is from the Vaihingendataset (Rottensteiner et al., 2014), while the right image of ob-lique view is from the UAVid2020 dataset (Lyu et al., 2020).Compared with the images in nadir view style, the images inoblique view have very large spatial resolution variation acrossthe entire image.The state-of-the-art methods for semantic segmentation all relyon powerful deep neural networks, which can effectively extracthigh-level semantic information to determine the class types forall pixels. Deep neural networks serve as non-linear functions,which map an image input to a label output. Due to its non-linear property, the label output will not scale linearly as theimage input scales. When designing the deep neural networks,there is usually a performance trade-off for objects in differentscales. For example, the semantic segmentation of a small carin a remote sensing image is better handled in higher resolutionwhere ﬁner details can be observed, such as wheels. For largerobjects like roads and buildings, it is better to have more globalcontext to recognize the objects since their whole shapes can beobserved for semantic segmentation.When objects in an image dataset have very large scale vari-ation, the semantic segmentation performance of deep neuralnetworks will drop if this multi-scale problem is not consideredin the network design. A simple strategy is to apply multi-

Figure 1. Example of images in different viewing style. The leftimage from Vaihingen dataset(Rottensteiner et al., 2014) iscaptured in nadir view. The right image from UAVid2020dataset(Lyu et al., 2020) is captured in oblique view. scale inference (Zhao et al., 2017), i.e., a well-trained deepneural networks predict the score maps of the same image inmultiple different scales, and the score maps are averaged todetermine the ﬁnal label prediction. Such strategy generallyprovides better performance. However, a good prediction froma proper scale could be undermined by those worse predictionsfrom other scales, which limits the model performance. Max-pooling selects one score map prediction of multiple scales foreach pixel, but the optimal output could be the interpolation ofthe prediction of multiple scales. A smarter way of fusing theoutput score maps is to leverage on an attention model (Chen etal., 2016), which determines the weights when fusing the scoremaps of different scales. The strategy has been extended to ahierarchical structure for better performance (Tao et al., 2020).With respect to the design of deep neural networks, there areseveral strategies to relieve the multi-scale problem. The ﬁrststrategy is to gradually reﬁne features from coarse scales toﬁne scales (Long et al., 2015, Ronneberger et al., 2015, Lyuet al., 2020). The second strategy is to design a multi-scalefeature extractor module in the middle of the deep neural net-works (Zhao et al., 2017, Chen et al., 2017, Chen et al., 2018,Yuan and Wang, 2018). Self-attention (Fu et al., 2019, Huanget al., 2019, Yuan et al., 2020) and graph networks (Liang et al.,2018, Li and Gupta, 2018) have also been applied to aggregate a r X i v : . [ c s . C V ] F e b nformation globally to reinforce the features for each pixel.In this paper, we propose the bidirectional multi-scale atten-tion networks (BiMSANet) to address the multi-scale problemin the semantic segmentation task. Our method is inspired bythe multi-scale attention strategy (Chen et al., 2016, Tao et al.,2020) and the feature level fusion strategy (Chen et al., 2017,Zhao et al., 2017), and jointly fuses the features guided bythe attention of different scales in bidirectional pathways, i.e,coarse-to-ﬁne and ﬁne-to-coarse. Our method is tested on thenew UAVid2020 dataset (Lyu et al., 2020). One of its challengesis the huge inter-class and intra-class scale variance for differentobjects due to its oblique viewing style. Our method achieves anew state-of-the-art result with a mIoU score of . . Com-pared with the currently top ranked method (Tao et al., 2020),which features on handling the multi-scale problem, our meth-ods outperforms by almost . .The contributions of this paper are summarized as follows, • We have proposed a novel bidirectional multi-scale atten-tion networks (BiMSANet) to handle the multi-scale prob-lem for the semantic segmentation task. • We have visualized in multiple perspectives and analyzedthe bidirectional multi-scale attentions in details. • We have achieved state-of-the-art result on the UAVid2020benchmark, and the code will be made public.

2. RELATED WORK

In this section, we will discuss other works that are related toour paper. In order to handle the multi-scale problem for thesemantic segmentation, a number of deep neural networks havebeen designed.

Multi-scale feature fusion.

The ﬁrst basic type of method isto aggregate features of multiple scales from deep neural net-works. FCN (Long et al., 2015) and U-Net (Ronneberger etal., 2015) have adopted skip connections between encoder anddecoder to gradually fuse the information from multiple scales.MSDNet (Lyu et al., 2020) has extended the connection acrossscales to further increase the performance. ZipZagNet (Di Lin,2019) uses a more complex zip-zag architecture between thebackbone and the decoder for intermediate multi-scale featureaggregation. HRNet (Wang et al., 2019) proposes a multi-scalebackbone to exchange information between branches of coarsescale and ﬁne scale. BiSeNet (Yu et al., 2018) proposes a dualbranch structure for better performance, one branch for higherspatial resolution, while the other for richer semantic features.

Multi-scale context extraction.

Another method is to aggreg-ate multi-scale context from the same feature maps with a mod-ule. PSPNet (Zhao et al., 2017) has adopted pyramid poolingmodule, which has pooling modules of multiple scales to poolcontext features for the object recognition. DeepLabv3 (Chen etal., 2017, Chen et al., 2018) has utilized atrous spatial pyramidpooling module, which assembles multi-scale features with con-volutions of multiple atrous rates. OCNet (Yuan and Wang,2018) proposes pyramid object context (Pyramid-OC) moduleand atrous spatial pyramid object context (ASP-OC) module toextract object context in multiple scales.

Context by relations.

With the creation of self-attention mech-anism (Vaswani et al., 2017) for natural language processing, better semantic segmentation results have also been achievedwhen self-attention is applied to reason the relation betweenpixels. Self-attention reﬁnes the features in a non-local style,which aggregates information for each pixel globally. DANet (Fuet al., 2019) has utilized dual attention module, position at-tention and channel attention, to extract information globally.CCNet (Huang et al., 2019) has applied the criss-cross atten-tion module to reduce the computational complexity of the self-attention. OCRNet (Yuan et al., 2020) has used explicit classattention to reinforce the features. However, these types ofmethods are normally intensive in memory and computationas there are too many pixels, resulting in very dense connec-tions between them. Graph reasoning is another way to includerelations among objects. Instead of adopting dense pixel rela-tions, sparse graph structure makes the context relation reason-ing less intensive in memory and computation. (Liang et al.,2018) proposes the symbolic graph reasoning (SGR) layer forcontext information aggregation through knowledge graph. (Liand Gupta, 2018) transforms a 2D image into a graph structure,whose vertices are clusters of pixels. Context information ispropagated across all vertices on the graph.

Inference in multi-scale.

Multi-scale inference is widely usedto provide more robust prediction, which is orthogonal to pre-viously discussed methods as those networks can be regardedas a trunk for multi-scale inference. Average pooling and maxpooling on score maps are mostly used, but they limit the per-formance. (Chen et al., 2016) propose to apply attentions forfusing score maps across multiple scales. The method is moreadaptive to objects in different scales as the weights for fusingscore maps across multiple scales can vary. (Tao et al., 2020)further improve the multi-scale attention method by introducinga hierarchical structure, which allows different network struc-tures during training and testing to improve the model design.Our paper also focuses on the multi-scale inference. We havefurther improved the multi-scale attention mechanism by intro-ducing feature level bidirectional fusion.

3. PRELIMINARY

In this section, we ﬁrst go through some network architecturedesign to better help understand the newly proposed bidirec-tional multi-scale attention networks.

The multi-scale-dilation net (Lyu et al., 2020) is proposed asthe ﬁrst attempt to tackle the multi-scale problem for the UAViddataset. The basic idea shares the philosophy of multi-scale im-age inputs, where the input images are scaled by the scale tobatch operation and batch to scale operation. The intermedi-ate features are concatenated from coarse to ﬁne scales, whichare used to output the ﬁnal semantic segmentation output. Thestructure is shown in Figure 2. The feature extraction part isnamed as trunk in the following ﬁgures.

The hierarchical multi-scale attention net (Tao et al., 2020) isproposed to learn to fuse semantic segmentation outputs of ad-jacent scales by a hierarchical attention mechanism. The deepneural networks learn to segment the images while predictingthe weighting masks for fusing the score maps. This methodranks as the top method in the Cityscapes pixel-level semantic runk FeatTrunk

Trunk

OutputFeat Seg

FeatConcat

Concat

Figure 2. Architecture of the multi-scale dilation net. Featuresare aggregated from coarse to ﬁne scales with concatenation. labeling task (Cordts et al., 2016), which focuses on the multi-scale problem. The hierarchical mechanism allows differentnetwork structures during training and inference, e.g., the net-works have only two branches of two adjacent scales duringtraining, while the networks could have three branches of threeadjacent scales during testing as shown in Figure 3.

Trunk Seg downTrunk SegTrunk Seg UpAttn

Attn ** + * * + Outputα α Figure 3. Architecture of the hierarchical multi-scale attentionnetworks. In addition to the predicted score maps, extraweighting masks are predicted from the attention sub-networksfor fusing the score maps of adjacent scales. ⊕ , (cid:126) stand forelement-wise addition and multiplication, respectively. One limitation of the hierarchical multi-scale attention networksis that the fused score maps are the linear interpolation of thescore maps in adjacent scales, whereas the best score mapscould be acquired with the interpolated features instead. Asimple solution that we propose is to move the segmentationhead to the end of the fused features as shown in Figure 4.

4. BIDIRECTIONAL MULTI-SCALE ATTENTIONNETWORKS

In this section, the structure of the proposed bidirectional multi-scale attention networks will be introduced.

Our design also takes the hierarchical attention mechanism andthe feature level fusion into account. The overall architectureis shown in Figure 5. For the input image I of size H × W ,the image pyramid is built by adding two extra images I × and I . × , which are acquired by bi-linear up-sampling I to size of H × W and bi-linear down-sampling I to H × W . Thebidirectional multi-scale attention networks have two pathways Trunk Feat downTrunkTrunk UpAttn

Attn ** + * * + Output α α Feat

Feat

Seg

Figure 4. Architecture of the hierarchical multi-scale attentionnetworks with feature level fusion. Segmentation head is movedto the end of the fused features. ⊕ , (cid:126) stand for element-wiseaddition and multiplication, respectively. for feature fusing in a hierarchical manner. For each pathway,the structure is the same as the feature level hierarchical multi-scale attention nets. The design of the two pathways allows thefeature fusion from both directions, and the fusion weights canbe better determined in a better scale. The reason to use featurelevel fusion is that we need distinct features for two pathways.If the score maps are used for fusion, the feat1 and the feat2in the two pathways would be the same, which limits the rep-resentation power of the two pathways. The two pathways takeadvantage of their own attention branches and features. Attn1branch and Feat1 are for the coarse to ﬁne pathway, while Attn2branch and Feat2 are for the ﬁne to coarse pathway. The Feat1and the Feat2 from two pathways are fused hierarchically acrossscales, and the ﬁnal feature is the concatenation of the featuresfrom the two pathways.The Feat1 and Feat2 are reduced to the half number of chan-nels as the Feat in feature level hierarchical multi-scale atten-tion net. This setting is to provide fair comparisons betweenthese two types of networks, since it leads to features with thesame number of channels before the segmentation head.The parameter sharing is also applied in the design. Threebranches corresponding to the three scales share the same net-work parameters for Trunk, Attn1 and Attn2. Feat1 and Feat2in the three branches are different as they are the output of dif-ferent image inputs through the same trunk. In this section, we will illustrate the details of each componentwe applies.

Trunk.

In order to effectively extract information from eachsingle scale, we have adopted the deeplabv3+ (Chen et al., 2018)as the trunk. We apply the wide residual networks (Zagoruykoand Komodakis, 2016) as the backbone, namely the WRN-38,which has been pre-trained on the imagenet dataset (Deng etal., 2009). The ASPP module in the deeplabv3+ has convolu-tions with atrous rate of , , , and . The features f b fromthe deeplabv3+ are further reﬁned with a sequence of mod-ules as follows, Conv × − > BN − > ReLU − >Conv × − > BN − > ReLU − > Conv × nc ) (numbers in the brackets are the numbers of output channels),which corresponds to the feature transformation in the Seg ofthe hierarchical multi-scale attention net before the ﬁnal classi-ﬁcation. ownTrunk Up Attn2 ** + ** +

Output α α Feat1

Seg

Attn1

Feat2

Trunk

Attn2

Feat1

Feat2

Trunk

Feat1Attn1

Feat2 Up *

1- β β * + down ** + ·

1- β β Figure 5. Architecture of the bidirectional multi-scale attention networks. The structure is the combination of two feature levelhierarchical multi-scale attention nets corresponding to two pathways, where they share the same trunks. The coarse to ﬁne pathwayand the ﬁne to coarse pathway are marked with the yellow and the blue arrows, respectively. ⊕ , (cid:126) stand for element-wise addition andmultiplication, respectively. (cid:12) stands for concatenation in channel dimension. The trunk T transforms an image input I into feature maps f with nc channels, i.e., f = T ( I ) . nc = n class × d , where n class is the total number of classes for the semantic segment-ation task. d is the expansion rate for the channels. d is set to in our case. The ﬁrst nc channels are for the Feat1, while thesecond nc channels are for the Feat2. Attention head.

The Attn1 and the Attn2 share the same struc-ture, but with different parameters. The attention heads map thefeatures f b from the deeplabv3+ to the attention weights α, β (ranging from . to . with nc channels) for the two path-ways. For each attention head, the structure is comprised of asequence of modules as follows, Conv × − > BN − >ReLU − > Conv × − > BN − > ReLU − > Conv × nc ) − > Sigmoid (numbers in the brackets are the outputchannels). Segmentation head.

The segmentation head

Seg converts thefused input feature maps f fused into score maps l ( channels for the UAVid2020 dataset), which correspond to the class prob-abilities for all the pixels, i.e., l = Seg ( f fused ) . The segment-ation head is simply a × convolution, Conv × n class ) .Argmax operation along the channel dimension outputs the ﬁ-nal class labels for all the pixels. Auxiliary semantic head.

As in (Tao et al., 2020), we applyauxiliary semantic segmentation heads for each branch duringtraining, which consists of only a × convolution, Conv × n class ) . As our model follows the hierarchical inference mechanism, itallows our model to be trained with only scales, while to in-fer with scales ( . × , × , × ). Such design makes it pos-sible for our network to adopt a large trunk such as deeplabv3+with WRN-38 backbone for better performance. We use RMIloss (Zhao et al., 2019) for the main semantic segmentationhead and cross entropy loss for the auxiliary semantic head.

5. EXPERIMENTS

In this section, we will illustrates the implementation detailsfor the experiments and compare the performance of differentmodels on the UAVid2020 dataset.

Our experiments are conducted on the public UAVid2020 data-set (Lyu et al., 2020). The UAVid2020 dataset focuses on thecomplex urban scene semantic segmentation task for classes.The images are captured in oblique views with large spatial res-olution variation. There are high quality images of K res-olutions ( × or × ) in total, split into train-ing, validation and testing sets with , and images,respectively. The performance of different models are evalu-ated on the test set of the UAVid2020 benchmark. The perform-ance for the semantic segmentation task is assessed based on thestandard mean intersection-over-union(mIoU) metric (Evering-ham et al., 2015). All the models in the experiments are implementedwith pytorch (Paszke et al., 2019), and trained on a single TeslaV100 GPU of G memory with a batch of images. Mixedprecision and synchronous batch normalization are applied forthe model. Stochastic gradient decent with a momentum . and weight decay of e − is applied as the optimizer for train-ing. ”Polynomial” learning rate policy is adopted (Liu et al.,2015) with a poly exponent of . . The initial learning rate isset to e − . The model is trained for epochs by randomimage selection. We apply random scaling for the images from . × to . × . Random cropping is applied to acquire imagepatches of size of × . https://uavid.nl/ ethods mIoU(%) Clutter Building Road Tree Vegetation Moving Car Static Car HumanMSDNet 56.97 57.04 79.82 73.98 74.44 55.86 62.89 32.07 19.69DeepLabv3+ 67.36 66.68 87.61 80.04 79.49 62.00 71.69 68.58 22.76HMSANet 70.03 69.32 88.14 82.12 79.42 61.21 77.33 72.52 30.17FHMSANet 70.33 69.36 87.95 82.69 80.06 62.66 76.88 72.90 30.12BiMSANet 70.80 69.94 88.63 81.60 80.38 61.64 77.22 75.62 31.34 Table 1. Performance comparisons in intersection-over-union (IoU) metric for different models. The top ranked scores are marked incolors. Red for the 1st place, green for the 2nd place, and blue for the 3rd place.

Original ImageOverlapped Output

DeepLabv3+ HMSANetFHMSANet BiMSANet

Figure 6. Qualitative comparisons of different models on the UAVid2020 test set. The example image is from the test set (seq30,000400). Bottom left image shows the overlapped result of the BiMSANet output and the original image. Three example regions forcomparisons are marked in red, orange, and white boxes.

Testing.

As the K image is too large to ﬁt into the GPU, weapply cropping during testing as well. The image is partitionedinto overlapped patches for evaluation as in (Lyu et al., 2020)and the average of the score maps are used for the ﬁnal output inthe overlapped regions. The crop size is set to × with anoverlap of pixels in both horizontal and vertical directions. In this section, we will presents the semantic segmentation res-ults on the test set of UAVid2020 dataset for multi-scale-dilationnet (MSDNet) (Lyu et al., 2020), deeplabv3+ (Chen et al., 2018),hierarchical multi-scale attention net (HMSANet) (Tao et al.,2020), feature level hierarchical multi-scale attention net (FHM-SANet), and our proposed bidirectional multi-scale attentionnetworks (BiMSANet). MSDNet is included as reference, whichuses an old trunk FCN-8s (Long et al., 2015) in each scale. Themajor comparisons are among DeepLabv3+, HMSANet, FHM-SANet, and BiMSANet.The mIoU scores and the IoU scores for each individual classare shown in Table 1. Among all the compared models, theBiMSANet performs the best regarding the mIoU metric. OurBiMSANet has a more balanced prediction ability for both largeand small objects.For the evaluation of each individual class, the BiMSANet ranksthe ﬁrst for classes of clutter, building, tree, static car, and hu-man. The most distinct improvement is for the static car, whichis . higher than the second best score. With only the con-text information, our method could achieve decent scores forclasses of both moving car and static car.For human class, the scores of HMSANet, FHMSANet andBiMSANet are all signiﬁcantly higher than the DeepLabv3+, which shows the superiority of multi-scale attention mechan-ism in handling the small objects. Thanks to the bidirectionalmulti-scale attention design, BiMSANet achieves the best per-formance for the human class.Qualitative comparisons are shown in Figure 6. The exampleimage is selected from the test set (seq30, 000400). As theground truth label is reserved for benchmark evaluation, theoverlapped output is shown instead in Figure 6. Three exampleregions are marked in red, orange, and white boxes.In the red box region, it could be seen that the deeplabv3+struggles to give coherent predictions for cars in the middleof the road, while the other three models have better resultsdue to the multi-scale attention. The HMSANet and the FHM-SANet wrongly classify part of the sidewalks, which is outsidethe road, as road class. BiMSANet handles better in this area.However, part of the road near the lane-mark are wrongly clas-siﬁed as clutter by the BiMSANet. In the orange box region,the parking lot, which belongs to the clutter class, is predictedas the road by all four models, and the BiMSANet makes theleast error. In the white box region, the ground in front of theentrance door is wrongly classiﬁed as building by all modelsexcept the BiMSANet. This is beneﬁted from the bidirectionalmulti-scale attention design.We have also shown the performance for human class segment-ation in Figure 7. The example image is from the test set (seq22,000900). The zoomed in images in the middle and the rightcolumns correspond to the patches in the white boxes of theoverlapped output. The four patches are from different con-text, which is very complex in some local regions. Even thoughthe humans in the image are quite small and in many differentposes, such as standing, sitting, and riding, our model can stilleffectively detect and segment most of the humans in the image. riginal ImageOverlapped Output Dining AreaCrossing Square1Square2

Figure 7. Qualitative example of human class segmentation by the BiMSANet. The example image is from the test set (seq22,000900). The left column shows the original full image and the overlapped output. The middle and the right columns show the imagepatches cropped from the overlapped output (marked by white boxes), which all focus on the human class. The red circles mark somemissing segmentation.

Methods mIoU(%) mIoU Gains(%) Trunk Multi-Scale Attention Feature Level Fusion BidirectionDeepLabv3+ 67.36 - (cid:51) - - -HMSANet 70.03 +2.67 (cid:51) (cid:51) - -FHMSANet 70.33 +0.30 (cid:51) (cid:51) (cid:51) -BiMSANet 70.80 +0.47 (cid:51) (cid:51) (cid:51) (cid:51)

Table 2. Ablation study for models. The performance gains could be observed by gradually adding components.

In this section, we will compare the performance gains by gradu-ally adding the components. The corresponding results are shownin Table 2. It is easy to see that the multi-scale processingis useful for the oblique view UAV images. The mIoU scorehas increased by . by including the multi-scale attentioninto the networks. The feature level fusion is also proved to beuseful as it helps the networks to improve the mIoU score by . . By further adding the bidirectional attention mechanism,the networks improve the mIoU score by another . . In this section, we will analyze the learned multi-scale atten-tions from the BiMSANet to better understand how the atten-tions work. We explore from mainly three perspectives: atten-tions of different channels, different scales, and different direc-tions. The example image is from the test set (seq25,000400).Attentions from both Attn1 branch and Attn2 branch are used,noted as α and β in Figure 5. α is for the ﬁne to coarse pathway,while β is for the coarse to ﬁne pathway. The multi-scale at-tentions in our BiMSANet have nc channels ( in our case),which is different from the HMSANet (Tao et al., 2020), whoseattention has only one single channel for all classes. The at-tentions guide the fusion of features across scales. Exampleattentions of different channels in × scale branch are shown inFigure 8. Different channels have different attentions focusingon different parts of the image. It is obvious that different chan-nels have different focus for different classes, e.g., th channelmore focus on trees, th channel less focus on roads, and th channel have the most focus on moving cars. In order to analyze thedifference of attentions in different scales, we have selected attentions from each of the Attn1 branch and the Attn2 branchas shown in Figure 9. The superscripts are the channel indexof the attentions. By comparing the α with α , which are pre-dicted in × and . × scales, we could see that attentions indifferent scales have different focus. The difference of the samechannel between α and α are more worth of comparisons.The same applies for β and β .From α and α , it could be noted that the recognition of carsin closer distance are more based on context, since the values of α are larger than α . The recognition of road that are closer tothe camera also relies more on the coarser level features, whichis reasonable as the road area is large and requires more con-text for recognition. It is also interesting to note that the middlelane-marks is even brighter than other parts of the road in α ,which means the recognition requires more context. It is reas-onable as the color and the texture of the lane-marks are quitedifferent compared to other parts of the road. The distant build-ings near the horizon relies more on the coarser level featuresas well.We have also noticed that the α ( . × scale) and β ( × scale)have larger values on average compared with α and β ( × scale), which means that features with context information andﬁne details are both valuable for object recognition. In our bidirectionaldesign, both the coarse to ﬁne pathway and the ﬁne to coarsepathway fuse the features from three scales ( . × , × , × ). Inthis section, we analyze if the feature fusion in two pathwayshas the same attention pattern. Attention examples are shownin Figure 10. Attentions α and − β from two pathways areboth for the feature fusion across scale . × and × . Although th Figure 8. Attention analysis of different channels. Example attentions are of × scale from Attn1 branch. The image on the top leftshows the image adopted. The other images are the attention maps from different channels. Channel indices are presented below theimages. Brighter color means higher value. Best visualized with zoom in. 𝛂 𝛂 𝛂 𝛂 𝛃 𝛃 𝛃 𝛃 Figure 9. Attention analysis of different scales. We select attentions from each of the Attn1 branch and the Attn2 branch. α, β are ofthe same meaning as in Figure 5. The superscripts are the channel index of the attentions. α , β correspond to the attentions predictedin the . × scale and the × scale. α , β are predicted in × scale. Brighter color means higher value. Best visualized with zoom in. the attention values of same pixels can not be directly comparedas the feature sources are different (Feat1 and Feat2), it is stillevident that the attention densities on average are quite differ-ent. There are more activation in α than − β , showing thatthe two pathways play different roles for feature fusion acrosssame scales.

6. CONCLUSION

In this paper, we have proposed the bidirectional multi-scaleattention networks (BiMSANet) for the semantic segmentationtask. The hierarchical design adopted from (Tao et al., 2020)allows the usage of larger trunk for better performance. The fea-ture level fusion and the bidirectional design allows the modelto more effectively fuse the features from both of the adjacentcoarser scale and the ﬁner scale. We have conducted the experi-ments on the UAVid2020 dataset (Lyu et al., 2020), which have large variation in spatial resolution. The comparisons amongdifferent models have shown that our BiMSANet achieves bet-ter results by balancing the performance of small objects andlarge objects. Our BiMSANet achieves the state-of-art resultwith a mIoU score of . for the UAVid2020 benchmark. REFERENCES

Chen, L.-C., Papandreou, G., Schroff, F., Adam, H., 2017. Re-thinking Atrous Convolution for Semantic Image Segmenta-tion.

CoRR , abs/1706.05587.Chen, L.-C., Yang, Y., Wang, J., Xu, W., Yuille, A. L., 2016.Attention to scale: Scale-aware semantic image segmentation.

CVPR .Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H., − 𝛃 𝜶 𝜶 𝜶 𝜶 Figure 10. Attention analysis of different directions. The ﬁgure shows the attentions for fusing features of scale . × and scale × . α is for the ﬁne to coarse pathway, while − β is for the coarse to ﬁne pathway. Brighter color means higher value. Best visualizedwith zoom in. ECCV .Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M.,Benenson, R., Franke, U., Roth, S., Schiele, B., 2016. The city-scapes dataset for semantic urban scene understanding.

CVPR .Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J.,Basu, S., Hughes, F., Tuia, D., Raska, R., 2018. Deepglobe2018: A challenge to parse the earth through satellite images.

CVPRW .Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.,2009. ImageNet: A Large-Scale Hierarchical Image Database.

CVPR .Di Lin, Dingguo Shen, S. S. Y. J. D. L. D. C.-O. H. H., 2019.Zigzagnet: Fusing top-down and bottom-up context for objectsegmentation.

CVPR .Everingham, ., Eslami, S. M. A., Van Gool, L., Williams, C.K. I., Winn, J., Zisserman, A., 2015. The Pascal Visual ObjectClasses Challenge: A Retrospective.

IJCV , 111(1), 98–136.Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H., 2019.Dual attention network for scene segmentation.

CVPR .Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.,2019. Ccnet: Criss-cross attention for semantic segmentation.

ICCV .Li, Y., Gupta, A., 2018. Beyond grids: Learning graph repres-entations for visual recognition.

NeurIPS .Liang, X., Hu, Z., Zhang, H., Lin, L., Xing, E. P., 2018. Sym-bolic graph reasoning meets convolutions.

NeurIP .Liu, W., Rabinovich, A., Berg, A. C., 2015. ParseNet: LookingWider to See Better.

CoRR , abs/1506.04579.Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutionalnetworks for semantic segmentation.

CVPR .Lyu, Y., Vosselman, G., Xia, G.-S., Yilmaz, A., Yang, M. Y.,2020. UAVid: A semantic segmentation dataset for UAV im-agery.

ISPRS Journal of Photogrammetry and Remote Sensing ,165, 108 - 119. Nigam, I., Huang, C., Ramanan, D., 2018. Ensemble know-ledge transfer for semantic segmentation.

WACV .Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L.,Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Te-jani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chint-ala, S., 2019. Pytorch: An imperative style, high-performancedeep learning library.

NeurIPS .Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolu-tional networks for biomedical image segmentation.

MICCAI .Rottensteiner, F., Sohn, G., Gerke, M., Wegner, J., Breitkopf,U., Jung, J., 2014. Results of the ISPRS benchmark on urbanobject detection and 3D building reconstruction.

ISPRS journalof photogrammetry and remote sensing , 93, 256–271.Tao, A., Sapra, K., Catanzaro, B., 2020. HierarchicalMulti-Scale Attention for Semantic Segmentation.

CoRR ,abs/1910.12037.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,Gomez, A. N., Kaiser, L. u., Polosukhin, I., 2017. Attention isall you need.

NeurIPS .Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu,D., Mu, Y., Tan, M., Wang, X., Liu, W., Xiao, B., 2019. DeepHigh-Resolution Representation Learning for Visual Recogni-tion.

TPAMI .Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N., 2018.Bisenet: Bilateral segmentation network for real-time semanticsegmentation.

ECCV .Yuan, Y., Chen, X., Wang, J., 2020. Object-contextual repres-entations for semantic segmentation.

ECCV .Yuan, Y., Wang, J., 2018. Ocnet: Object context network forscene parsing.

CoRR , abs/1809.00916.Zagoruyko, S., Komodakis, N., 2016. Wide residual networks.

BMVC .Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J., 2017. Pyramid sceneparsing network.

CVPR .Zhao, S., Wang, Y., Yang, Z., Cai, D., 2019. Region mutualinformation loss for semantic segmentation.