Learning Fully Dense Neural Networks for Image Semantic Segmentation
LLearning Fully Dense Neural Networks for Image Semantic Segmentation
Mingmin Zhen , Jinglu Wang , Lei Zhou , Tian Fang ∗ , Long Quan Hong Kong University of Science and Technology, Microsoft Research Asia, Altizure.com { mzhen, lzhouai, quan } @cse.ust.hk, [email protected], [email protected] Abstract
Semantic segmentation is pixel-wise classification which re-tains critical spatial information. The “feature map reuse” hasbeen commonly adopted in CNN based approaches to takeadvantage of feature maps in the early layers for the later spa-tial reconstruction. Along this direction, we go a step furtherby proposing a fully dense neural network with an encoder-decoder structure that we abbreviate as FDNet. For each stagein the decoder module, feature maps of all the previous blocksare adaptively aggregated to feedforward as input. On the onehand, it reconstructs the spatial boundaries accurately. Onthe other hand, it learns more efficiently with the more ef-ficient gradient backpropagation. In addition, we propose theboundary-aware loss function to focus more attention on thepixels near the boundary, which boosts the “hard examples”labeling. We have demonstrated the best performance of theFDNet on the two benchmark datasets: PASCAL VOC 2012,NYUDv2 over previous works when not considering trainingon other datasets.
Introduction
Recent works on semantic segmentation are mostly based onthe fully convolutional network (FCN) (Long, Shelhamer,and Darrell 2015). Generally, a pretrained classificationnetwork (such as VGGNet (Simonyan and Zisserman 2015),ResNet (He et al. 2016) and DenseNet (Huang et al. 2017))is used as an encoder to generate a series of feature mapswith rich semantic information at the higher layers. In orderto obtain the probability map with the same resolution asthe input image size, the decoder is adopted to recoverthe spatial resolution from the output of the encoder (Fig.1 Top). The encoder-decoder structure is widely used forsemantic segmentation (Vijay, Alex, and Roberto 2017;Long, Shelhamer, and Darrell 2015; Noh, Hong, and Han2015; Zhao et al. 2017) .The key difficulties for the encoder-decoder structureare twofold. First, as multiple stages of spatial pooling andconvolutional strides are used to reduce the final featuremap size in the encoder module, much spatial informationis lost. This is hard to recover in the decoder module and ∗ Tian Fang is with Shenzhen Zhuke Innovation Technologysince 2017.Copyright c (cid:13)
Image 1/4 1/8 1/16 1/32 1/16 1/8 1/4 Output
Image
OutputImage Output
Figure 1: Different types of encoder-decoder structures forsemantic segmentation.
Top : basic encoder-decoder struc-ture (e.g. DeconvNet (Noh, Hong, and Han 2015) and Seg-Net (Vijay, Alex, and Roberto 2017)) using a multiple-stagedecoder to predict masks, often results in very coarse pixelmasks since spatial information is largely lost in the encodermodule.
Middle : Feature map reuses structures using previ-ous feature maps of the encoder module achieves very goodresults in semantic segmentation tasks (Lin et al. 2017a;Islam et al. 2017; Ghiasi and Fowlkes 2016) and other re-lated tasks (Pinheiro et al. 2016; Huang et al. 2018), but thepotential of feature map reuse is not deeply released.
Bot-tom : The proposed fully dense networks, using feature mapsfrom all the previous blocks, are capable of capturing multi-scale information, of restoring the spatial information, andof benefitting the gradient backpropagation.leads to poor semantic segmentation results, especiallyfor boundary localization. Second, the encoder-decoderstructure is much deeper than the original encoder networkfor image classification tasks (such as VGGNet (Si-monyan and Zisserman 2015), ResNet (He et al. 2016) andDenseNet (Huang et al. 2017)). This results in the trainingoptimization problem as introduced in (He et al. 2016;Huang et al. 2017) though it has been partially solved byusing batch normalization (BN) (Ioffe and Szegedy 2015).In order to address the spatial information loss problem,DeconvNet (Noh, Hong, and Han 2015) uses the unpoolinglayers to restore the spatial information by recording thelocations of maximum activations during the pooling oper-ation. However, this cannot completely solve the problemsince only the location of maximum activations is restored. a r X i v : . [ c s . C V ] M a y a ) ( b ) ( c ) m ean I O U ( % ) FDNet + boundary lossFDNetw/ feature reusew/o feature reuse
Figure 2: Left: (a) original images; (b) trimap example with1 pixels; (c) trimap example with 10 pixels. Right: semanticsegmentation results within a band around the object bound-aries for different methods (mean IOU).Another way to deal with this problem is to reuse the featuremaps with rich spatial information of earlier layers. U-Net(Ronneberger, Fischer, and Brox 2015) exploits previousfeature maps in the decoder module by a “skip connections”structure (See Fig. 1 Middle). Furthermore, RefineNet (Linet al. 2017a) refines semantic feature maps from later layerswith fine-grained feature maps from earlier layers. Simi-larly, G-FRNet (Islam et al. 2017) adopts multi-stage gateunits to make use of previous feature maps progressively.The feature map reuse significantly improves the restorationof spatial information. Meanwhile, it helps to capturemulti-scale information from the multi-scale feature mapsof earlier layers in the encoder module. In addition, it alsoboosts information flow and gradient backpropagation asthe path from the earlier layers to the loss layer is shortened.However, the potential of feature map reuse is notcompletely revealed. In order to further improve the per-formance, we propose to reconstruct the encoder-decoderneural network to form a fully dense neural network (SeeFig. 1 Bottom). We refer to our neural network as
FDNet .FDNet is a nearly symmetric encoder-decoder network andis easy to optimize. We choose DenseNet-264 (Huang et al.2017) as the encoder, which achieves state-of-the-art resultsin the image classification tasks. The feature maps in theencoder module are beneficial to the decoder module. Thedecoder module is operated as an upsampling process torecover the spatial resolution, aiming for accurate boundarylocalization. The feature maps of different scale sizes(including feature maps in the decoder module) will be fullyreused through an adaptive aggregation structure, whichwill generate a fully dense connected structure.In general, the cross entropy loss func-tion is used to propagate the loss in previ-ous works (Liu, Rabinovich, and Berg 2016;Lin et al. 2017a). The weakness of this method is thatit sees all pixels as the same. As shown in Fig. 2, labelingfor the pixels near the boundary (band width < ) is notvery accurate. In other words, the pixels near the boundaryare “hard examples”, which need to be treated differently.Based on this observation, we propose a boundary-awareloss function, which pays more attention to the pixelsnear the boundary. Though attention based loss has beenadopted in object detection task (Lin et al. 2017c), ourboundary-aware loss comes from the prior that pixels nearthe boundary are “hard examples”. This is very different from focal loss, which pays more attention to the pixels withhigher loss. In order to further boost training optimization,we use multiple losses for the output feature maps of thedecoder module. As a result, basically each layer of FDNethas direct access to the gradients from the loss layers. Thiswill be very helpful to gradient propagation (Huang et al.2017). Related work
The fully convolutional network (FCN) (Long, Shelhamer,and Darrell 2015) has improved the performance of se-mantic segmentation significantly. In the FCN architec-ture, a fully convolutional structure and bilinear interpo-lation are used to realize pixel-wise prediction, which re-sults in coarse boundaries as large amounts of spatial infor-mation have been lost. Following the FCN method, manyworks (Vijay, Alex, and Roberto 2017; Lin et al. 2017a;Zhao et al. 2017) have tried to further improve the perfor-mance of semantic segmentation.
Encoder-decoder . The encoder-decoder structure with amulti-stage decoder gradually recovers sharp object bound-aries. DeconvNet (Noh, Hong, and Han 2015) and SegNet(Vijay, Alex, and Roberto 2017) employ symmetric encoder-decoder structures to restore spatial resolution by using un-pooling layers. RefineNet (Lin et al. 2017a) and G-FRNet(Islam et al. 2017) also adopt a multi-stage decoder with fea-ture map reuse in each stage of the decoder module. In LRR(Ghiasi and Fowlkes 2016), a multiplicative gating methodis used to refine the feature map of each stage and a Lapla-cian reconstruction pyramid is used to fuse predictions.Moreover, (Fu et al. 2017) stacks many encoder-decoderarchitectures to capture multi-scale information. Followingthese works, we also use an encoder-decoder structure togenerate pixel-wise prediction label maps.
Feature map reuse . The feature maps in the higher layerstend to be invariant to translation and illumination. This in-variance is crucial for specific tasks such as image classi-fication, but is not ideal for semantic segmentation whichrequires precise spatial information, since important spa-tial relationships have been lost. Thus, the reuse of featuremaps with rich spatial information of the previous layers canboost the spatial structure reconstruction process. Further-more, feature map reuse has also been used in object de-tection tasks (Lin et al. 2017b) and instance segmentationtasks (Pinheiro et al. 2016; He et al. 2017) to capture multi-scale information when considering the objects with differ-ent scales. In our architecture, we fully aggregate previousfeature maps in the decoder module, which shows outstand-ing performances in the experiments.
Fully dense neural networks
In this section, we introduce the proposed fully dense neuralnetwork (FDNet), which is visualized in Fig. 3 comprehen-sively. We first introduce the whole architecture. Next, theadaptive aggregation structure for dense feature maps is pre-sented in detail. At last, we show the boundary-aware lossfunction. × W H/4 × W/4 H/8 × W/8 H/16 × W/16 H/32 × W/32 H/16 × W/16 H/8 × W/8 H × W BackwardAdaptive aggregation
H/4 × W/4Up
Fully Dense Net Up Bilinear Upsample C onvo l u ti on P oo li ng C onvo l u ti on C onvo l u ti on P oo li ng C onvo l u ti on P oo li ng U p C onvo l u ti on Block 3 Block Block 5 Block 6 C onvo l u ti on Up Block 1 C onvo l u ti on C onvo l u ti on C onvo l u ti on P oo li ng Block 2
Figure 3: Overview of the proposed fully dense neural network (FDNet). The feature maps (output of dense block , , , ) ofthe encoder module and even the feature maps (output of dense block 5) of the decoder module are fully reused. The adaptiveaggregation module combines feature maps from all the previous blocks to form new feature maps as the input of subsequentblocks. After an adaptive aggregation module or a dense block, a convolution layer is used to compress the feature maps. Theaggregated feature maps are upsampled to H × W × C ( C is the number of the classes for the labels) and the pixel-wise crossentropy loss is computed. Encoder-decoder architecture
Our model (Fig. 3) is based on the deep encoder-decoder ar-chitecture (e.g. (Noh, Hong, and Han 2015; Vijay, Alex, andRoberto 2017)). The encoder module extracts features froman image and the decoder module produces semantic seg-mentation prediction.
Encoder . Our encoder network is based on the DenseNet-264 (Huang et al. 2017) while removing the softmax andfully connected layers of the original network (from thestarting convolutional layer to the dense block 4 in Fig. 3).The input of each convolutional layer within a dense blockis the concatenation of all outputs of its previous layers at agiven resolution. Given that x l is the output of the (cid:96) th layerin a dense block, x (cid:96) can be computed as follows: x (cid:96) = H l ([ x , x , ..., x (cid:96) − ]) (1)where [ x , x , ..., x (cid:96) − ] denotes the concatenation opera-tion of the feature maps x , x , ..., x (cid:96) − , and x is the in-put feature map of the dense block. Meanwhile, H (cid:96) ( · ) isdefined as a composite function of operations: BN, ReLU,a × convolution operation followed by BN, ReLU, a × convolution operation. As a result, the output of adense block includes feature maps from all the layers in thisblock. Each dense block is followed by a transition layer,which is to compress the number and size of the featuremaps through × convolution and pooling layers. Foran input image I , the encoder network produces 4 featuremaps ( B , B , B , B ) with decreasing spatial resolution ( , , , ) . In order to reduce spatial information loss, we can remove the pooling layer before the dense block 4 sothat the output feature map of the last dense block (i.e. B )in the encoder module is of the size. Atrous convolutionis also used to control the spatial density of computed fea-ture responses in the last block as suggested in (Chen et al.2017). We refer to this architecture as FDNet-16s. The orig-inal architecture can be taken as FDNet-32s. Decoder . As the encoder-decoder structure has much morelayers than the original encoder network, how to boost gradi-ent backpropagation and information flow becomes anotherproblem we have to deal with. The decoder module pro-gressively enlarges the feature maps while densely reusingprevious feature maps by aggregating them into a new fea-ture map. As the input feature map of each dense block hasa direct connection to the output of the block, the inputsof previous blocks in the encoder module are also directlyconnected to the new feature map . The new feature map isthen upsampled to compute loss with the groundtruth, whichleads to multiple losses computation. Thus, the inputs of alldense blocks in the FDNet have a direct connection to theloss layers. This will significantly boost the gradient back-propagation.Following the DenseNet structure, we also use denseblocks at each stage of the same size after a compressionlayer with convolution operation, which will change thenumber of feature maps from adaptive aggregation struc-ture. The compression layer is composed of BN, ReLU anda × convolution operation. In the two compression lay-ers after adaptive aggregation, their filter numbers are set to /8 × W/8 Up × × × × Figure 4: An example of an adaptive aggregation structurefor dense feature maps. For all the input feature maps (notincluding direct connected input feature map, i.e. blue line),a compression layer with BN, ReLU and × convolutionis applied to adjust the number of feature maps. Then an up-sampling or downsampling layer is first operated so that allthe feature maps are consistent in size with the output fea-ture map. They are then concatenated to form a new featuremap with of the size of the input image.1024 and 768. In the two compression layers after block 5and block 6, the filter numbers are set to 768 and 512. Forblock 5 and block 6, there are 2 convolutional layers in eachof them. Adaptive aggregation of dense feature maps
In previous works, e.g. U-Net (Ronneberger, Fischer, andBrox 2015) for semantic segmentation, FPN (Lin et al.2017b) for object detection and SharpMask (Pinheiro et al.2016) for instance segmentation, feature maps are reuseddirectly in the corresponding decoder module by concate-nating or adding the feature maps. Furthermore, RefineNet(Lin et al. 2017a), LRR (Ghiasi and Fowlkes 2016) and G-FRNet (Islam et al. 2017) refine the feature maps progres-sively stage by stage. Instead of just using previous featuremaps as before, we introduce an adaptive aggregation struc-ture to make better use of the feature maps from previousblocks. As shown in Fig. 4, the feature maps from previousblocks are densely concatenated together by using the adap-tive aggregation structure.The adaptive aggregation structure takes all the featuremaps from previous blocks ( B , B , ... ) as input. The fea-ture maps from the lower layers (e.g. B , B ) are of highresolution with coarse semantic information, whereas fea-ture maps from the higher layers (e.g. B , B ) are of lowresolution with rich semantic information. The adaptive ag-gregation structure combines all previous feature maps togenerate rich contextual information and also spatial infor-mation. For incoming feature maps, the scale sizes may bedifferent. As shown in Fig. 4, the output feature map is of the size of the input image. To reduce memory consump-tion, we firstly use the convolutional layer to compress theincoming feature maps except for the direct connected fea-ture map (which has been compressed). The compressionlayer is also composed of BN, ReLU and a × convo-lution operation. In order to make all feature maps consis-tent in size, we use the convolutional layer to downsampleand the deconvolutional layer to upsample the feature maps. Intuitively, we directly concatenate the feature map if it isequivalent to the size of the output feature map. The convo-lutional layers are all composed of BN, ReLU and a × convolution operation with different strides. The deconvo-lutional layers are all composed of BN, ReLU and a × deconvolutional operation with different strides. At last, allthe resultant feature maps D i , D i , ..., D iM ( M input featuremaps) are concatenated into a new feature map F i for the i th stage, which is then fed to latter loss computation operationor dense block. The formulation for obtaining the i th densefeature map from the previous feature maps can be writtenas follows: D i = T i ( B ) , D i = T i ( B ) , ..., D iM = T iM ( B M ) F i = [ D i , D i , ..., D iM ] (2)where T ( · ) denotes the transformation operation (downsam-ple or upsample). If B j is of the same size as the output fea-ture map, no operation is performaned on B j . In addition, [ · · · ] stands for the concatenation operation.In the adaptive aggregation structures for the three stagesof the decoder module, the filter numbers in the compressionlayer for the reused feature map are set to 384, 256 and 128respectively. The upsampling and downsampling layers willnot change the dimension of feature maps. Boundary-aware loss
In previous works, cross entropy loss function is often usedin pipeline, which treat all pixels equally. As shown in Fig.2, we can see that the pixels surrounding the boundary are“hard examples”, which lead to bad prediction. Based on thisobservation, we construct a boundary-aware loss function,which guides the network to pay more attention on the pixelsnear the boundary. The loss function is loss ( L, L gt ) = − N K (cid:88) j =1 (cid:88) I i ∈ S j C (cid:88) c =1 α j L gti,c w ( L i,c ) logL i,c (3)where L is the result of sof tmax operation on the outputfeature map and L gt is the groundtruth. The I i is the i -thpixel in the image I and C is number of categories. We splitall the N pixels of image I into several sets S j based onthe distance between the pixels and the boundary so that I = { S , S , ..., S K } . We apply image dilation operationon the boundary with varying kernel size, which refers toas band width shown in Fig. 2, to obtain different set of pix-els surrounding the boundary. α j is the balancing weight and w ( L i,c ) is an attention weight function. Motivated by (Lin etal. 2017c), we test two attention weight functions ( poly and exp ): w ( L i,c ) = (1 − L i,c ) λ and w ( L i,c ) = e − λ (1 − L i,c ) .The λ is used to control attention weight. The ablation ex-periment results are shown in Table 2.In order to further boost the gradient backpropagation andinformation flow, we compute multiple losses for differentaggregated feature map F i motivated by (Zhao et al. 2017;Islam et al. 2017; Fu et al. 2017). Specifically, F i is fed toupsample module to obtain a feature map L i with channel C , where C is number of classes in prediction labels. Then mage Groundtruth FDNet w/ Feature reusew/o Feature reuse
Figure 5: The effect of employing the proposed fully densefeature map reuse structure compared with other frame-works. Our proposed FDNet shows better results (Column ), especially on the boundary localization , compared withthe results (Column ) of encoder-decoder structure withfeature reuse method (Fig. 1 Middle ) and the results (Col-umn ) of encoder-decoder structure without feature reusemethod (Fig. 1 Top ).the feature map L i is upsampled by using bilinear interpo-lation method directly to produce feature map H × W × C ,which is used to compute pixel-wise loss with groundtruth.In terms of formula, the final loss L final is computed as fol-lows: L i = sof tmax ( U i ( F i )) L final = (cid:88) i loss ( L i , L gt ) (4)where U i ( · ) denotes a upsample module with bilinear inter-polation operation.In the encoder module, the output feature map of eachmodule is the concatenation of all the feature maps withinthis block, including the input. And the aggregated featuremap is feature maps from all the previous blocks. Thus, eachfeature map in the encoder has much shorter path to losscompared with previous encoder-decoder structure (Lin etal. 2017a; Islam et al. 2017). The gradient backpropagationand information flowing is much more efficient. This willfurther boost our network optimization. Implementation details
Training : The proposed FDNet is implemented with Py-Torch on a single NVIDIA GTX 1080Ti. The weights ofDenseNet-264 are directly employed in the encoder moduleof FDNet. In the training step, we adopt data augmentationsimilar to (Chen et al. 2016). Random crops of × andhorizontal flip is applied. We train the dataset with 30K itera-tions. We optimize the network by using the “poly” learningrate policy where the initial learning rate is multiplied by Table 1: The mean IoU scores (%) for encoder-decoder withdifferent feature map reuse methods on PASCAL VOC 2012validation dataset. Encoder stride w/o featurereuse w/ featurereuse dense featurereuse32 77.2 78.5 78.916 78.2 79.1 79.4Table 2: The mean IoU scores (%) for boundary-aware losson PASCAL VOC 2012 validation dataset. The poly and exp represent different weighting methods. loss mIoU
CE 79.4b-aware( poly ) kernel = (10 , , , , λ = 0 α = (5 , , , , α = (8 , , , , b-aware( poly ) α = (8 , , , , , λ = 0 kernel = (5 , , , poly ) α = (8 , , , , , kernel = (10 , , , λ = 1 λ = 2 λ = 5 exp ) α = (8 , , , , , kernel = (10 , , , λ = 0 . λ = 0 . λ = 0 . λ = 1 λ = 2 (1 − itermax iter ) power with power = 0 . . The initial learningrate is set to . . We set momentum to . and weightdecay to . . Inference : In the inference step, we pad images with meanvalue before feeding full images into the network. We applymulti-scale inference, which is commonly used in seman-tic segmentation methods (Lin et al. 2017a; Fu et al. 2017).For multi-scale inference, we average the predictions on thesame image across different scales for the final prediction.We set the scales ranging from 0.6 to 1.4. Horizontal flippingis also adopted in the inference. In the ablation experiments,we just use the single scale (i.e. scale = 1.0) and horizontalflipping method to do inference. In addition, we use the last-stage feature map of the decoder module to generate finalprediction label map.
Experiments
In this section, we describe configurations of experimen-tal datasets and show ablation experiments on PASCALVOC 2012. At last, we report the results on two benchmarkdatasets: PASCAL VOC 2012 and NYUDv2.able 3: GPU memory, number of parameters and some re-sults on VOC 2012 test dataset are reported.
Methods
RefineNet-152 FDNet SDN M2 GPU Memory (MB) 4253 -Parameters (M) 109.2
FDNet-16s-MS denotes the evaluation on multiple scales.
FDNet-16s-finetuning-MS denotes fine-tuning on standard trainingdata (1464 images) of PASCAL VOC 2012 dataset aftertraining on the trainaug dataset.
Method mIoU
Deeplab-MSc-CRF-LargeFOV 68.7DeconvNet 67.1DeepLabv2 77.7G-FRNet 77.8DeepLabv3 79.8SDN 80.7DeepLabv3+ 81.4FDNet-16s 80.9FDNet-16s-MS 82.1FDNet-16s-finetuning-MS
Datasets description
To show the effectiveness of our approach, we conduct com-prehensive experiments on PASCAL VOC 2012 dataset (Ev-eringham et al. 2010) and NYUDv2 dataset (Silberman et al.2012).PASCAL VOC 2012: The dataset has 1,464 images fortraining, 1,449 images for validation and 1,456 images fortesting, which involves 20 foreground object classes and onebackground class. Meanwhile, we augment the training setwith extra labeled PASCAL VOC images provided by Se-mantic Boundaries Dataset (Hariharan et al. 2011), resultingin 10,582 images as trainaug dataset for training.NYUDv2: The NYUDv2 dataset (Silberman et al. 2012)consists of 1449 RGB-D images showing indoor scenes.We use the segmentation labels provided in (Gupta, Arbe-laez, and Malik 2013), in which all labels are mapped to 40classes. We use the standard training/test split with 795 and654 images, respectively. Only RGB images are used in ourexperiments.Moreover, we perform a series of ablation evaluations onPASCAL VOC 2012 dataset with mean IoU score reported.We use the trainaug and validation dataset of PASCAL VOC2012 for training and inference, respectively.
Feature map reuse
To verify the power of dense feature maps reuse, we com-pare our method with other two baseline frameworks. Inthis experiment, cross entropy loss is used. One is encoder-decoder structure without feature map reuse (Fig. 1 Top) Figure 6: Some visual results on PASACAL VOC 2012dataset. Three columns of each group are image, groundtruthand prediction label map.and the other is encoder-decoder structure with naive fea-ture map reuse (Fig. 1 Middle). We also compare the threeframeworks on different encoder strides (the ratio of inputimage resolution to smallest output feature map of encoder,i.e. 16 and 32).The results are shown in Table 1. It is observed thatthe performance increases when feature maps are reused.Specifically, the performance for encoder-decoder (encoderstride = 32) without feature map reuse is only 77.2%. Af-ter the naive feature map reuse, the performance can in-crease to 78.5%. Furthermore, our fully dense feature mapreuse can further improve the performance to 78.9%. In ad-dition, when we adopt the stride 16 for the encoder mod-ule, the performance is much better than the original en-coder with stride 32 on the three frameworks. This is be-cause the spatial information loss is reduced by the en-coder with smaller stride. We speculate that encoder withstride 8 can have better result similar to (Chen et al. 2017;2018). Because of memory limitation, we only test on theencoder with stride 16 and 32.We also show some predicted semantic label maps for dif-ferent feature map reuse methods in Fig. 5. For the encoder-decoder structure without feature map reuse, the result ispoor, especially for boundary localization. Though the naivefeature map reuse method improves the segmentation resultpartially, it is still hard to obtain accurate pixel-wise predic-tion. The fully dense feature map reuse method shows veryexcellent results on the boundary localization.
Boundary-aware loss
In order to demonstrate the effect of proposed boundary-aware loss method, we take FDNet-16s as baseline to testthe performance of different parameters. We mainly use the kernel = (10 , , , and kernel = (5 , , , bysplitting the pixels into K = 5 sets (the remaining pixels arereferred to as S ). For poly weight method, the boundary-aware loss method (b-aware) degrades into cross entropymethod (CE) when α = (1 , , , , and λ = 0 . As shownin Table 2, the simply weighting on the pixels surroundingthe boundary shows better performance compared with gen-eral cross entropy method, which enhances the performanceby . . By fixing the α and kernel , we try different param-eter λ in Table 2. Comparing the poly and exp methods, wecan observe that exp brings obvious improvement by . .On the contrary, the poly methods lead to worse effect com-able 5: Quantitative results (%) in terms of mean IoU on PASCAL VOC 2012 test set. Only VOC data is used as training dataand denseCRF (Kr¨ahenb¨uhl and Koltun 2011) is not included. Method ae r o b i c y c l e b i r d bo a t bo ttl e bu s ca r ca t c h a i r c o w t a b l e dog ho r s e m b i k e p e r s on p l a n t s h ee p s o f a t r a i n t v mIoU DeconvNet (Noh, Hong, and Han 2015) 89.9 39.3 79.7 63.9 68.2 87.4 81.2 86.1 28.5 77.0 62.0 79.0 80.3 83.6 80.2 58.8 83.4 54.3 80.7 65.0 72.5Deeplabv2 (Chen et al. 2016) 84.4 54.5 81.5 63.6 65.9 85.1 79.1 83.4 30.7 74.1 59.8 79.0 76.1 83.2 80.8 59.7 82.2 50.4 73.1 63.7 71.6GCRF (Vemulapalli et al. 2016) 85.2 43.9 83.3 65.2 68.3 89.0 82.7 85.3 31.1 79.5 63.3 80.5 79.3 85.5 81.0 60.5 85.5 52.0 77.3 65.1 73.2Adelaide (Lin et al. 2016) 90.6 37.6 80.0 67.8 74.4 92.0 85.2 86.2 39.1 81.2 58.9 83.8 83.9 84.3 84.8 62.1 83.2 58.2 80.8 72.3 75.3LRR (Ghiasi and Fowlkes 2016) 91.8 41.0 83.0 62.3 74.3 93.0 86.8 88.7 36.6 81.8 63.4 84.7 85.9 85.1 83.1 62.0 84.6 55.6 84.9 70.0 75.9G-FRNet (Islam et al. 2017) 91.4 44.6 91.4 69.2 78.2 95.4 88.9 93.3 37.0 89.7 61.4 90.0 91.4 87.9 87.2 63.8 89.4 59.9 87.0 74.1 79.3PSPNet (Zhao et al. 2017) 91.8 71.9 pared with baseline method ( . vs . ). In addition, thenetwork cannot converge for λ < . We also compare the la-beling accuracy for the pixels near the boundary. As shownin Fig. 2, the FDNet with boundary-aware loss shows obvi-ous better performance for the pixels surrounding the bound-ary. Memory analysis
For semantic segmentation task, memory consumption andparameter number are both important issues. The proposedFDNet uses fully dense connected structure with nearly thesame number of parameters compared with RefineNet (Linet al. 2017a). As shown in Table 3, the FDNet consumesmuch less GPU memory (training process) compared withRefineNet. In addition, the memory consumption of FDNetcan be reduced by using sharing memory efficiently basedon (Geoff Pleiss* 2017). Compared with SDN (Fu et al.2017), there are much less parameters for FDNet but the per-formance is much better.
PASCAL VOC 2012
We evaluate the performance on the PASCAL VOC 2012dataset following previous works (Lin et al. 2017a; Zhao etal. 2017). As FDNet-16s shows a better performance (Ta-ble 1), we only report the performance of FDNet-16s inthe following experiments. We adopt the boundary-awaremethod in the training step. As shown in Table 4, FDNet-16sachieves very comparable result with an 82.1% mean IoUaccuracy compared with previous works ((Chen et al. 2017;Islam et al. 2017; Fu et al. 2017)) when evaluated on multi-ple scales. Moreover, after fine-tuning the model on the stan-dard training data (1464 images) of the PASCAL VOC 2012dataset, we achieve a much better result with 84.1% meanIoU accuracy, which is the best result currently if not consid-ering pretraining on other dataset (such as MS-COCO (Linet al. 2014)). Some visual results with image, groundtruthand prediction label maps are shown in Fig. 6.Table 5 shows the quantitative results of our method onthe test dataset, where we only report the results using thePASCAL VOC dataset. We achieve the best result of 84.2%on test data without pretraining on other datasets, which isthe highest score when considering training on PASCALVOC 2012 dataset. Though latest work DeepLabv3+ (Chenet al. 2018) achieves a mean IoU score of 89.0% on test dataof PASCAL VOC 2012, the result relies upon pretraining ona much larger dataset MS-COCO (Lin et al. 2014) or JFT Table 6: Quantitative results (%) on NYUDv2 dataset (40classes). The model is only trained on the provided trainingimage dataset.
Method pixel acc. mean acc. mIoU
SegNet 66.1 36.0 23.6Bayesian SegNet 68.0 45.8 32.4FCN-HHA 65.4 46.1 34.0Piecewise 70.0 53.6 40.6RefineNet 73.6 58.9 46.5FDNet-16s (Chollet 2017). In fact, FDNet-16s shows very comparableresult compared with DeepLabv3+ on the validation dataset(Table 4).
NYUDv2 Dataset
We conduct experiments on the NYUDv2 dataset to com-pare FDNet-16s with previous works. We follow the train-ing setup in PASCAL VOC 2012 and multi-scale inferenceis also adopted. The results are reported in Table 6. Simi-lar to (Lin et al. 2017a), pixel accuracy, mean accuracy andmean IoU are used to evaluate all the methods. Some worksmake use of both depth image and RGB image as input andobtain very better results. For example, RDF (Park, Hong,and Lee 2017) achieves 50.1% (mean IoU) by using depthinformation. For a fair comparison, we only report the re-sults training on only RGB images. As is shown, FDNet-16soutperforms previous work in terms of all metrics. In partic-ular, our result is better than RefineNet (Lin et al. 2017a) by0.9% in terms of mean IoU accuracy.
Conclusion
In this paper, we have presented the fully dense neuralnetwork (FDNet) with an encoder-decoder structure forsemantic segmentation. For each layer of the FDNet in thedecoder module, feature maps of almost all the previouslayers are aggregated as the input. Furthermore, we proposea boundary-aware loss function by paying more attention tothe pixels surrounding the boundary. The proposed FDNetis very advantageous to semantic segmentation. On the onehand, the class boundaries as spatial information are wellreconstructed by using the Encoder-Decoder structure witha boundary-aware loss function. On the other hand, the FD-et learns more efficiently with the more efficient gradientbackpropagation, much similar to the arguments alreadydemonstrated in ResNet and DenseNet. The experimentsshow that our model outperforms previous works on twopublic benchmarks when training on other datasets is notconsidered.
Acknowledgments
This work is supported by GRF 16203518, Hong Kong RGC16208614, T22-603/15N, Hong Kong ITC PSKL12EG02,and China 973 program, 2012CB316300.
References
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; andYuille, A. L. 2016. Deeplab: Semantic image segmentationwith deep convolutional nets, atrous convolution, and fullyconnected crfs.
TPAMI .Chen, L.-C.; Papandreou, G.; Schroff, F.; and Adam, H.2017. Rethinking atrous convolution for semantic imagesegmentation. arXiv preprint arXiv:1706.05587 .Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; andAdam, H. 2018. Encoder-decoder with atrous separable con-volution for semantic image segmentation. arXiv preprintarXiv:1802.02611 .Chollet, F. 2017. Xception: Deep learning with depthwiseseparable convolutions.
CVPR .Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.;and Zisserman, A. 2010. The pascal visual object classes(voc) challenge.
IJCV .Fu, J.; Liu, J.; Wang, Y.; and Lu, H. 2017. Stacked deconvo-lutional network for semantic segmentation. arXiv preprintarXiv:1708.04943 .Geoff Pleiss*, Danlu Chen*, G. H. T. L. L. v. d. M. K. Q. W.2017. Memory-efficient implementation of densenets. In
Technical report .Ghiasi, G., and Fowlkes, C. C. 2016. Laplacian pyramidreconstruction and refinement for semantic segmentation. In
ECCV .Gupta, S.; Arbelaez, P.; and Malik, J. 2013. Perceptual or-ganization and recognition of indoor scenes from rgb-d im-ages. In
CVPR .Hariharan, B.; Arbel´aez, P.; Bourdev, L.; Maji, S.; and Ma-lik, J. 2011. Semantic contours from inverse detectors. In
ICCV .He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residuallearning for image recognition. In
CVPR .He, K.; Gkioxari, G.; Doll´ar, P.; and Girshick, R. 2017.Mask r-cnn. In
ICCV .Huang, G.; Liu, Z.; v. Maaten, L.; and Weinberger, K. Q.2017. Densely connected convolutional networks. In
CVPR .Huang, G.; Chen, D.; Li, T.; Wu, F.; van der Maaten, L.; andWeinberger, K. Q. 2018. Multi-scale dense convolutionalnetworks for efficient prediction.
ICLR . Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accel-erating deep network training by reducing internal covariateshift. In
ICML .Islam, M. A.; Rochan, M.; Bruce, N. D.; and Wang, Y. 2017.Gated feedback refinement network for dense image label-ing. In
CVPR .Kr¨ahenb¨uhl, P., and Koltun, V. 2011. Efficient inference infully connected crfs with gaussian edge potentials. In
NIPS .Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ra-manan, D.; Doll´ar, P.; and Zitnick, C. L. 2014. Microsoftcoco: Common objects in context. In
ECCV .Lin, G.; Shen, C.; Van Den Hengel, A.; and Reid, I. 2016.Efficient piecewise training of deep structured models forsemantic segmentation. In
CVPR .Lin, G.; Milan, A.; Shen, C.; and Reid, I. 2017a. Refinenet:Multi-path refinement networks with identity mappings forhigh-resolution semantic segmentation.
CVPR .Lin, T.-Y.; Doll´ar, P.; Girshick, R.; He, K.; Hariharan, B.;and Belongie, S. 2017b. Feature pyramid networks for ob-ject detection. In
CVPR .Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Doll´ar, P.2017c. Focal loss for dense object detection.
ICCV .Liu, W.; Rabinovich, A.; and Berg, A. C. 2016. Parsenet:Looking wider to see better.
ICLR .Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convo-lutional networks for semantic segmentation. In
CVPR .Noh, H.; Hong, S.; and Han, B. 2015. Learning deconvolu-tion network for semantic segmentation. In
ICCV .Park, S.-J.; Hong, K.-S.; and Lee, S. 2017. Rdfnet: Rgb-dmulti-level residual feature fusion for indoor semantic seg-mentation. In
ICCV .Pinheiro, P. O.; Lin, T.-Y.; Collobert, R.; and Doll´ar, P. 2016.Learning to refine object segments. In
ECCV .Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net:Convolutional networks for biomedical image segmentation.In
MICCI .Silberman, N.; Hoiem, D.; Kohli, P.; and Fergus, R. 2012.Indoor segmentation and support inference from rgbd im-ages. In
ECCV .Simonyan, K., and Zisserman, A. 2015. Very deep convolu-tional networks for large-scale image recognition.
ICLR .Vemulapalli, R.; Tuzel, O.; Liu, M.-Y.; and Chellapa, R.2016. Gaussian conditional random field network for se-mantic segmentation. In
CVPR .Vijay, B.; Alex, K.; and Roberto, C. 2017. Segnet: A deepconvolutional encoder-decoder architecture for image seg-mentation.
TPAMI .Zhao, H.; Shi, J.; Qi, X.; Wang, X.; and Jia, J. 2017. Pyramidscene parsing network. In