Global-Local Propagation Network for RGB-D Semantic Segmentation
GGLOBAL-LOCAL PROPAGATION NETWORK FOR RGB-D SEMANTIC SEGMENTATION
Sihan Chen , Xinxin Zhu , Wei Liu , Xingjian He , Jing Liu National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
ABSTRACT
Depth information matters in RGB-D semantic segmenta-tion task for providing additional geometric information tocolor images. Most existing methods exploit a multi-stagefusion strategy to propagate depth feature to the RGB branch.However, at the very deep stage, the propagation in a simpleelement-wise addition manner can not fully utilize the depthinformation. We propose Global-Local propagation network(GLPNet) to solve this problem. Specifically, a local contextfusion module(L-CFM) is introduced to dynamically alignboth modalities before element-wise fusion, and a global con-text fusion module(G-CFM) is introduced to propagate thedepth information to the RGB branch by jointly modeling themulti-modal global context features. Extensive experimentsdemonstrate the effectiveness and complementarity of theproposed fusion modules. Embedding two fusion modulesinto a two-stream encoder-decoder structure, our GLPNetachieves new state-of-the-art performance on two challeng-ing indoor scene segmentation datasets, i.e., NYU-Depth v2and SUN-RGBD dataset.
Index Terms — rgb-d, sementic segmentation
1. INTRODUCTION
With the rapid development of RGB-D sensors like MicrosoftKinect, people get access to depth data more easily. Depthdata can naturally describe 3D geometric information and re-flect the structure of objects in the scene, thus can serve asa complementary modality for RGB data which captures richcolor and texture information to improve semantic segmenta-tion results. However, how to fully utilize depth informationand effectively fuse the two complementary modalities stillremains an open problem.Early approaches [1] attempt to use a two-stream networkto extract features from RGB and depth modality respectivelyand fuse them at the last layer to predict the final segmen-tation results. This kind of “late fusion” strategy fuses twomodalities too late, resulting in that the RGB branch cannot get needed geometric information guidance at the earlystages. Later, researchers tend to propagate features of thedepth branch to the RGB branch in a multi-stage manner,i.e., add depth feature to the RGB branch at the end of every stage of the encoder network [2, 3]. This strategy leveragesgeometric cues more early and sufficiently, which is provedto be effective.However, we rethink that the feature propagation way inconventional methods can not fully utilize the depth informa-tion due to the following two reasons. Firstly, the consecu-tive use of convolution and down-sampling operations in thetwo-stream encoder makes the two modality features are notaligned each other as well as the early layers, so the geomet-ric information provided by depth feature at the deep stageis not precise enough to assist RGB feature. Secondly, com-pared with the depth information, the RGB information re-flects more semantic information, and the fact is more obvi-ous at the higher level of the network. Thus, the element-wiseaddition is not an appropriate solution, and the RGB shouldreceive more focus for the semantic prediction.To address the above problems, we propose Global-LocalPropagation Network (GLPNet) to jointly utilize the comple-mentary modalities of the depth and RGB features, in whicha local context fusion module (L-CFM) and a global contextfusion module (G-CFM) are designed to solve the problemsof spatial disalignment and semantic propagation in the fea-ture fusion, respectively. Instead of directly adding depth fea-ture to the RGB branch, L-CFM dynamically aligns the fea-tures of both modalities first before modality fusion. Specif-ically, the alignment process is to simultaneously warp thefeature maps of both modalities according to the offsets pre-dicted through a convolution layer which is inspired by theoptical flow in video processing field and Semantic Flow [4].Besides, the G-CFM is proposed to propagate the depth fea-ture to the RGB branch through the joint multi-modal contextmodeling. Concretely, we extract global context features fromboth modalities and aggregate them to every RGB pixel usingattention mechanism. Compared to L-CFM which preciselyaligns the local features of two modalities, G-CFM aims atutilizing depth information from the view of global context.Given that the proposed two fusion modules help depth fea-ture propagation from orthogonal perspectives (i.e., globallyand locally), combining them in a parallel way further im-proves the propagation effectiveness at deep stage.The proposed GLPNet achieves new state-of-the-art per-formance on two challenging RGB-D semantic segmentationdatasets, i.e., NYU-Depth v2 and SUN-RGBD dataset. a r X i v : . [ c s . C V ] J a n GB in Depth in Stage1 Stage2 Stage3 Stage4Stage1 Stage2 Stage3 Stage4 L-CFMG-CFMPrediction
UpC1/4 1/8 1/16 1/16UpC C CC C RGB offset D offset H WW W RGB out
RGB warped D warped Attention CC H W K
RGB out
Matrix multiply
RGB-D cxt
RGB mask D mask RGB cxt D cxt RGB attended
RGB in D in RGB in D in W Warp operationC Conv-bn-relu blockUp Bilinear UpsamplingElement-wise addConcatenateConv layer C G-CFML-CFM
Fig. 1 . An overview of our GLPNet. The fraction number describes the resolution ratio to the raw input image. We use dilationstrategy in the last stage and the overall stride is 16. Best viewed in color.
2. FRAMEWORK2.1. Overview
The overall framework of our approach is depicted in Fig-ure 1 which uses an encoder-decoder architecture. In theencoder part, we use a two-stream backbone network (e.g.ResNet 101) to extract features from both modalities sepa-rately like the previous methods [2, 3, 5, 6]. We take themulti-stage fusion strategy to propagate depth feature to theRGB branch. Specifically, we simply propagate depth fea-ture in an element-wise addition manner for the three earlystages and through the designed local and global fusion mod-ules for the last stage. After applying those two fusion mod-ules parallelly, we concatenate the outputs of them and addan additional convolution block to further process the fusedfeature then feed it into the segment decoder to get the finalprediction. The segment decoder takes a FPN [7] -like struc-ture which gradually upsamples the feature map and mergesthe shallow stage features ( i.e., stage-1, stage-2) through skipconnections and the channel dimension of different stage fea-tures are reduced to 256 before fusion.
We introduce Local Context Fusion Module to dymanicallyadjust both modality features before their summation tohelp depth feature propagation. As illustrated at the bot-tom left corner of Figure 1, we take both modality featuresof the last stage as the inputs to the L-CFM, denoted asRGB in ∈ R C × H × W and D in ∈ R C × H × W , respectively.From an intuitive perspective, the dynamic alignment should be inferred according to the spatial relations of both modal-ities, so we concatenate the two modality features alongthe channel dimension and then apply a convolution layerto predict the offset fields for each modality, denoted asRGB offset ∈ R × H × W and D offset ∈ R × H × W , respec-tively. Then we use warp operation to adjust the features ofboth modalities according to the predicted offset fields sepa-rately and add the aligned depth feature to the aligned RGBfeature to get the final output. More details about the warpoperation can be found in the supplementary materials. Inspired by the context researches on semantic segmentationwhich focuses on single RGB modality, we exploit the multi-modal global context information to further help depth fea-ture propagation. The details of the G-CFM is illustrated atthe bottom right corner of Figure 1. Just like the L-CFM,it also takes the RGB and depth feature of the last stage asinput, i.e., RGB in ∈ R C × H × W and D in ∈ R C × H × W .We apply two independent convolution layers and softmaxfunction along the spatial dimension to compute the poolingmasks of both modalities, denoted as RGB mask ∈ R K × H × W and D mask ∈ R K × H × W , respectively. K is a hyperparame-ter which controls the number of global context feature vec-tors. We reshape the predicted RGB mask to R K × HW andRGB in to R HW × C and then perform a matrix multiplica-tion to extract K global context features RGB cxt ∈ R K × C .The same process is performed on the depth feature to com-pute D cxt ∈ R K × C and then we concatenate two groups ofcontext features to generate the multi-modal context featuresGB-D cxt ∈ R K × C .After modeling the multi-modal global context features,we aggregate them back to the RGB feature using attentionmechanism. Specifically, we feed RGB in into a × con-volution layer to generate query features Q ∈ R C (cid:48) × H × W ( C (cid:48) = 1 / C ), and multi-modal context features into twolinear layers to generate the key features K ∈ R C (cid:48) × K andvalue features V ∈ R C × K , respectively. Then we reshapeQ to R C (cid:48) × HW and perform a matrix multiplication betweenthe transpose of Q and K to calculate the attention map A ∈ R HW × K and apply a softmax function to normalize the con-tribution of the 2k multi-modal context features. Finally, wemultiply A to the transpose of V to compute the attended fea-ture, reshape and element-wisely add it to the original RGB in in a residual connection manner getting RGB out .
3. EXPERIMENTS3.1. Datasets and Implementation details
To evaluate the proposed network, we conduct experiments ontwo RGB-D semantic segmentation datasets: NYU-Depth v2[8] and SUN-RGBD [9] dataset. NYU-Depth v2 dataset con-tains 1449 RGB-D images which are divided into 795 trainingimages and 654 testing images. SUN-RGBD dataset consistsof 10355 RGB-D images which are divided into 5285 trainingimages and 5050 testing images.We choose two-stream dilated ResNet101 pretrained onImageNet as the backbone network and the overall stride is16. We use the SGD optimizer and employ a poly learningrate schedule. The initaial learning rate is set to 0.005 forNYU-Depth v2 dataset and 0.001 for SUN-RGBD dataset.Momentum and weight decay are set to 0.9 and 0.0005, re-spectively. Batch size is 8 for both datasets. We train thenetwork by 500 epochs for the NYU-Depth v2 dataset and200 epochs for the SUN-RGBD dataset. For data augmenta-tion, we apply random scaling between [0 . , . , randomcropping with crop size × and random horizontalflipping. We use cross-entropy as loss function. When as-sembling the segment decoder, we adopt the multi-loss strat-egy. Specifically, we compute two auxiliary losses using thestage-2 and stage-4 outputs of the decoder and the weight isboth set to 0.2. We report three metrics including pixel ac-curacy(Acc), mean accuracy(mACC) and mean intersectionover union (mIoU). We conduct extensive ablation experiments on the NYU-Depth v2 dataset to verify the effectiveness of the proposedmodules. For the baseline model, we use conventional multi-stage propagation strategy that directly adds depth feature tothe rgb branch at all four stages without using the decoder. Asshown in Table 1, compared to the baseline model, the model Method mIoU%baseline 46.73+L-CFM 48.22+G-CFM 50.31+L-CFM +G-CFM 51.39+L-CFM +G-CFM +decoder 52.11+L-CFM +G-CFM +decoder +MG 53.57+L-CFM +G-CFM +decoder +MG +MS . Ablation study of proposed GLPNet on NYU-Depthv2 test set. MG: multi-grid. MS:multi-scale test.embedded with L-CFM achieves a result of 48.22% mIoU,which brings 1.49% improvements and the model embeddedwith G-CFM achieves a result of 50.31% mIoU, which brings3.58% improvements. When the L-CFM and G-CFM areused together in a parallel way, the result further improves to51.39% mIoU, demonstrating the proposed two fusion mod-ules which propagate depth feature in the local and globalmanner respectively are complement to each other. After as-sembling the segment decoder, we achieve 52.11% mIoU. Forfair comparison with the state-of-the-art, we adopt multi-gridand multi-scale test strategy and achieve 54.61% mIoU.For the Local Context Fusion Module, we further conductexperiments with different settings of the locations where theL-CFM is embedded. As shown in Table 2, using L-CFMat early stages can hardly improve the network performancegiven that the the alignment error has not been accumulatedtoo big. We further prove this by adding L-CFM to all stagesand achieved a slight performance drop, demonstrating ourdesign that only adds it at the last stage can effectively per-form precise depth feature propagation and overcome the ac-cumulated alignment error.For the Global Context Fusion Module, we further con-duct two comparison experiments with different context mod-eling settings denoted as G-CFM var1 and G-CFM var2 re-spectively. In G-CFM var1, we only extract K global contextfeatures from the input RGB feature and totally do not usethe depth feature. In G-CFM var2, we directly add depth fea-ture to the RGB branch like the baseline network and thenextract K global context features from the fused feature. Theresult is represented in Table 3. We can see that G-CFM var1which only models the RGB context yields 49.81% mIoU.G-CFM var2 can not properly utilize the depth informationfor that the unaligned depth feature brings noise and the per-formance drops to 49.56% mIoU. Compared to those twovariants, our G-CFM which exploits the multi-model contextachieves 50.31 mIoU% demonstrating the effectiveness andnecessity to jointly model the multi-model contexts. With re-gards to the hyperparameter K , i.e., the number of globalcontext features extracted by each modality, we found thenetwork performance is sensitive to the its choices and wechoose k=15 for its best performance. We choose two RGB-D image pairs from NYU-Depth v2 test set to visualize theooling masks predicted by the G-CFM. The result is shownin Figure 2 from which we can find that different global pool-ing masks concentrate on different regions of the images andare different between two modalities.Method S1 S2 S3 S4 mIoU% (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Table 2 . Ablation study of different stages where L-CFM isembedded on NYU-Depth v2 test set. S represents stage.
Method K mIoU%G-CFM var1 15 49.81G-CFM var2 15 49.56G-CFM 15
G-CFM 5 48.23G-CFM 10 49.27G-CFM 20 50.27G-CFM 25 49.42
Table 3 . Ablation study of different variants of G-CFM andthe value of hyperparameter K on NYU-Depth v2 test set.
Fig. 2 . Visualization of the pooling masks predicted by theG-CFM for two example RGB-D pairs from NYU-Depth v2test set. We present five highly representative masks of fifteenfor each modality to save space. Best viewed in color.
The performance comparison results on NYU-Depth v2dataset is shown in Table 4 and our method outperforms exist-ing approaches with dominant advantage which demonstratesthe effectiveness of our GLPNet with the proposed fusionmodules. It is worth noting that we use raw depth image asthe input to the depth branch and achieves better performancethan those methods [5, 11, 6, 13] that take the encoded HHA[18] images as input which consume much more inferencetime. Compared with the previous state-of-the-art [6] whichuses a bi-direction information propagation strategy and uses
Method Backbone DE Acc% mAcc% mIoU %FCN [1] 2 × VGG16 HHA 65.4 46.1 34.0LSD-GF [10] 2 × VGG16 HHA 71.9 60.7 45.9RDF101 [5] 2 × Res101 HHA 75.6 62.2 49.1RDF152 [5] 2 × Res152 HHA 76.0 62.8 50.1CFNet [11] 2 × Res152 HHA - - 47.73DGNN [12] 1 × VGG16 HHA - 55.2 42.0D-CNN [13] 2 × VGG16 HHA - 56.3 43.9ACNet [14] 3 × Res50 raw - - 48.3PADNet [15] 1 × Res50 raw 75.2 62.3 50.2PAP [16] 1 × Res50 raw 76.2 62.5 50.4SGNet-16s [17] 1 × Res101 raw 76.4 62.1 50.3SGNet-8s [17] 1 × Res101 raw 76.8 63.1 51.0SA-Gate [6] 2 × Res101 HHA 77.9 - 52.4Ours 2 × Res101 raw
Table 4 . Comparison results on NYU-Depth v2 test set. DErepresents depth encoding.
Method Backbone DE Acc% mAcc% mIoU %LSD-GF [10] 2 × VGG16 HHA - 58.0 -3DGNN [12] 1 × VGG16 HHA - 57.0 45.9RDF152 [5] 2 × Res152 HHA 81.5 60.1 47.7CFNet [11] 2 × Res152 HHA - - 48.1D-CNN [13] 2 × VGG16 HHA - 53.5 42.0ACNet [14] 3 × Res50 raw - - 48.1PAP [16] 1 × Res50 raw 83.8 58.4 50.5SGNet-8s [17] 1 × Res101 raw 81.8 60.9 48.5SA-Gate [6] 2 × Res101 HHA 82.5 - 49.4baseline 2 × Res101 raw 79.3 57.9 44.0Ours 2 × Res101 raw
Table 5 . Comparison results on SUN-RGBD test set.deeplabv3+ [19] to model the fusion context, our methodsurpass them over 2.2% mIoU. we attribute it to the betterdepth feature propagation ability of our network with G-CFMmodeling the multi-modal context information globally andL-CFM performing precise modality alignment locally.We also conduct experiments on the SUN-RGBD datasetto further evaluate the proposed method. Quantitative re-sults of this dataset are shown in Table 5, our method booststhe RGB-D baseline from 44.0 mIoU% to 51.2 mIoU% andachieves the state-of-the-art.
4. CONCLUSION
We have proposed GLPNet for RGB-D semantic segmen-tation. The GLPNet help the information propagation fromthe depth branch to RGB branch at deep stage. Specifi-cally, the local context fusion module dynamically alignsboth modalities before fusion and the global context fusionmodule performs depth information propagation through thejoint multi-modal context modeling. Extensive ablation ex-periments have been conducted to verify the effectiveness ofproposed method and the GLPNet achieved new state-of-the-art performance on two indoor scene segmentation datasets,i.e., NYU-Depth v2 and SUN-RGBD dataset. . REFERENCES [1] Jonathan Long, Evan Shelhamer, and Trevor Darrell,“Fully convolutional networks for semantic segmenta-tion,” in
Proceedings of the IEEE conference on com-puter vision and pattern recognition , 2015, pp. 3431–3440. 1, 4[2] Caner Hazirbas, Lingni Ma, Csaba Domokos, andDaniel Cremers, “Fusenet: Incorporating depth into se-mantic segmentation via fusion-based cnn architecture,”in
Asian conference on computer vision . Springer, 2016,pp. 213–228. 1, 2[3] Jindong Jiang, Lunan Zheng, Fei Luo, and ZhijunZhang, “Rednet: Residual encoder-decoder networkfor indoor rgb-d semantic segmentation,” arXiv preprintarXiv:1806.01054 , 2018. 1, 2[4] Xiangtai Li, Ansheng You, Zhen Zhu, Houlong Zhao,Maoke Yang, Kuiyuan Yang, and Yunhai Tong, “Se-mantic flow for fast and accurate scene parsing,” arXivpreprint arXiv:2002.10120 , 2020. 1[5] Seong-Jin Park, Ki-Sang Hong, and Seungyong Lee,“Rdfnet: Rgb-d multi-level residual feature fusion forindoor semantic segmentation,” in
Proceedings ofthe IEEE international conference on computer vision ,2017, pp. 4980–4989. 2, 4[6] Xiaokang Chen, Kwan-Yee Lin, Jingbo Wang, WayneWu, Chen Qian, Hongsheng Li, and Gang Zeng,“Bi-directional cross-modality feature propagation withseparation-and-aggregation gate for rgb-d semantic seg-mentation,” arXiv preprint arXiv:2007.09183 , 2020. 2,4[7] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie, “Feature pyra-mid networks for object detection,” in
Proceedings ofthe IEEE conference on computer vision and patternrecognition , 2017, pp. 2117–2125. 2[8] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, andRob Fergus, “Indoor segmentation and support infer-ence from rgbd images,” in
European conference oncomputer vision . Springer, 2012, pp. 746–760. 3[9] Shuran Song, Samuel P Lichtenberg, and JianxiongXiao, “Sun rgb-d: A rgb-d scene understanding bench-mark suite,” in
Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2015, pp. 567–576. 3[10] Yanhua Cheng, Rui Cai, Zhiwei Li, Xin Zhao, andKaiqi Huang, “Locality-sensitive deconvolution net-works with gated fusion for rgb-d indoor semantic seg-mentation,” in
Proceedings of the IEEE conference on computer vision and pattern recognition , 2017, pp.3029–3037. 4[11] Di Lin, Guangyong Chen, Daniel Cohen-Or, Pheng-AnnHeng, and Hui Huang, “Cascaded feature network forsemantic segmentation of rgb-d images,” in
Proceed-ings of the IEEE International Conference on ComputerVision , 2017, pp. 1311–1319. 4[12] Xiaojuan Qi, Renjie Liao, Jiaya Jia, Sanja Fidler, andRaquel Urtasun, “3d graph neural networks for rgbdsemantic segmentation,” in
Proceedings of the IEEE In-ternational Conference on Computer Vision , 2017, pp.5199–5208. 4[13] Weiyue Wang and Ulrich Neumann, “Depth-aware cnnfor rgb-d segmentation,” in
Proceedings of the Euro-pean Conference on Computer Vision (ECCV) , 2018,pp. 135–150. 4[14] Xinxin Hu, Kailun Yang, Lei Fei, and Kaiwei Wang,“Acnet: Attention based network to exploit comple-mentary features for rgbd semantic segmentation,” in . IEEE, 2019, pp. 1440–1444. 4[15] Dan Xu, Wanli Ouyang, Xiaogang Wang, and NicuSebe, “Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimationand scene parsing,” in
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition ,2018, pp. 675–684. 4[16] Zhenyu Zhang, Zhen Cui, Chunyan Xu, Yan Yan, NicuSebe, and Jian Yang, “Pattern-affinitive propagationacross depth, surface normal and semantic segmenta-tion,” in
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , 2019, pp. 4106–4115. 4[17] Lin-Zhuo Chen, Zheng Lin, Ziqin Wang, Yong-LiangYang, and Ming-Ming Cheng, “Spatial informationguided convolution for real-time rgbd semantic segmen-tation,” arXiv preprint arXiv:2004.04534 , 2020. 4[18] Saurabh Gupta, Ross Girshick, Pablo Arbel´aez, and Ji-tendra Malik, “Learning rich features from rgb-d im-ages for object detection and segmentation,” in
Euro-pean conference on computer vision . Springer, 2014, pp.345–360. 4[19] Liang-Chieh Chen, Yukun Zhu, George Papandreou,Florian Schroff, and Hartwig Adam, “Encoder-decoderwith atrous separable convolution for semantic imagesegmentation,” in