PT-ResNet: Perspective Transformation-Based Residual Network for Semantic Road Image Segmentation
Rui Fan, Yuan Wang, Lei Qiao, Ruiwen Yao, Peng Han, Weidong Zhang, Ioannis Pitas, Ming Liu
PPT-ResNet: Perspective Transformation-Based ResidualNetwork for Semantic Road Image Segmentation
Rui Fan ∗ , Yuan Wang ∗ , Lei Qiao , Ruiwen Yao , Peng Han , Weidong Zhang , Ioannis Pitas , Ming Liu Robotics Institute, the Hong Kong University of Science and Technology, Hong Kong. Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China. Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, [email protected], [email protected], [email protected], [email protected],han [email protected], [email protected], [email protected], [email protected]
Abstract —Semantic road region segmentation is a high-leveltask, which paves the way towards road scene understanding.This paper presents a residual network trained for semanticroad segmentation. Firstly, we represent the projections of roaddisparities in the v-disparity map as a linear model, which canbe estimated by optimizing the v-disparity map using dynamicprogramming. This linear model is then utilized to reduce theredundant information in the left and right road images. Theright image is also transformed into the left perspective view,which greatly enhances the road surface similarity between thetwo images. Finally, the processed stereo images and their dis-parity maps are concatenated to create a set of 3D images, whichare then utilized to train our neural network. The experimentalresults illustrate that our network achieves a maximum F1-measure of approximately . , when analyzing the imagesfrom the KITTI road dataset. I. I
NTRODUCTION
Autonomous driving technology has been developingrapidly, since Google launched its self-driving car project in2009 [1]. In recent years, industry titans, such as Waymo andTesla, race to commercialize autonomous vehicles (AVs) [2],[3]. However, a number of high-profile experimental accidentsthat occurred in the last year and have called into questionwhether the autonomous driving technology is mature enoughfor employment [4]. Therefore, most researchers believe thatin the next few years the research on autonomous drivingshould focus on developing advanced driver assistance systems(ADASs) [5], such as lane marking detection [6], road surface3D reconstruction [7], 2D/3D object detection [8], etc.Visual environment perception (VEP) is a key componentof ADAS [9], [10]. After learning from a large amount oflabeled training data, VEP can extract useful road environmentinformation, e.g., free space areas and pedestrians, from roadimages [11], semantic image region segmentation can provideuseful information by partitioning an image into semanticallymeaningful regions and classifying them into one of the pre-defined categories [12]. State-of-the-art semantic segmentationalgorithms are generally based on fully convolutional networks(FCNs) [12], which are an extension of convolutional neuralnetwork (CNN). FCNs utilize classical CNNs to learn imagefeature representations, but the input images can be of anysizes. FCNs perform image upsampling to produce a proba-bility mask with the same size as the input image [13]. *This two authors are joint first authors.
FCN-LC [13] is a classical FCN used for semantic roadimage segmentation. FCN-LC utilizes a network-in-networkarchitecture [13] to learn road region segmentation fromlabeled training image data. This allows fast inference, evenfor large contextual image window sizes [13]. In addition, inrecent years, a number of conditional random field (CRF)-based neural networks, e.g., PGM-ARS [14], Hybrid [15] andStixelNet [16] have been proposed for semantic road imagesegmentation. PGM-ARS [14] and StixelNet [16] were trainedusing monocular images, while Hybrid [15] also employed the3D road scene information acquired using LiDAR for training.Furthermore, stereo vision [17], [18] was used to improveroad segmentation performance. For example, a so-called BMneural network [17] selects a region of interest (ROI) in animage, by analyzing the v-disparity information. Such ROIinformation greatly minimizes the number of incorrectly seg-mented pixels. Furthermore, a so-called HistonBoost network[18] post-processes such ROIs using watershed transformationand morphological filtering [19]. In this paper, we draw on thesuccess of [17], [18] and present a perspective transformation(PT)-based deep convolutional network for road semantic seg-mentation. It is designed using the residual network (ResNet)from DeepLab [20]. The structure of our proposed network isshown in Fig. 1.The rest of the paper is organized as follows: SectionII introduces the proposed PT-ResNet. In Section III, theexperimental results are illustrated and the performance of theproposed approach is evaluated. Section IV contains conclu-sion and some recommendations for future work.II. M
ETHODOLOGY
A. Training Data Pre-Processing and Generation
In this paper, the proposed semantic segmentation methodfocuses entirely on the road surface, which can be treated asa ground plane. According to the perspective transformationalgorithm presented in [21], a right image can be transformedinto its left view using the disparity projection model. Thiscan greatly enhance the similarity of the road surface betweenthe stereo images [21]. Therefore, in this paper, we first utilizePSMNet [22] to estimate dense disparity maps (see Fig. 2). Av-disparity map is then created by computing the histograms ˆ p ( d, v ) of each horizontal row v of the disparity map [23]. Tofind the path corresponding to the road disparity projection in a r X i v : . [ c s . C V ] O c t ig. 1. PT-ResNet structure.Fig. 2. Training data pre-processing and generation. the v-disparity map, we utilize dynamic programming (DP) tosearch for every possible solution [24]: E ( d, v ) = − ˆ p ( d, v ) + τ max min τ =0 [ E ( d + 1 , v − τ ) − λτ ] , (1)where ˆ p ( d, v ) represents the histogram value at ( d, v ) inthe v-disparity map, λ is a smoothness term, τ max is themaximum search range [25]. E represents the energy of eachpossible solution. The path corresponding to the road disparityprojection is generally represented using a linear polynomial[21]: f ( v ) = α + α v. (2)The vertical coordinate of the vanishing point, i.e., v py , canbe estimated using (2). As the vertical coordinates of the roadpixels are always larger than v py , the image region abovethe vanishing point can be removed from the left and rightimages (see Fig. 2). Then, we utilize our previous algorithm[21] to transform the perspective view of the right image. Thisalgorithm improves the road surface similarity in the stereoimages, but also distorts obstacles, such as vehicles and trees.Finally, the processed stereo images and the left disparity mapsare concatenated together to generate a set of 3D images withseven channels, which are then utilized to train the neuralnetwork. Fig. 3. The structure of each block unit in Fig. 1.
B. PT-ResNet Structure
In recent years, the encoder-decoder structure has beenprevalently used in deep neural networks for semantic segmen-tation [26]. The encoder allows fast high-dimensional imagefeature map generation, while the decoder enables the networkto recover sharp object boundaries [26]. In this paper, ournetwork is designed following ResNet-101 used in DeepLab-v3+ [26]. The structure of our proposed network is shown inFig. 1.
1) Encoder:
In the encoder, the spatial dimension of thefeature maps reduces gradually using four blocks, as shownin Fig. 1. The structure of each block is shown in Fig. 3,where BN denotes batch normalization, and ReLU denotesrectified linear unit. ReLU activation function is used to avoidoverfitting during training. The parameter of ReLU is set to0.5 in this paper. BN is a method used to normalize theinput of each layer and overcome the internal covariate shiftproblem. As the stride in each block is set to 2, the outputof the fourth block is 256 times smaller than the input ofthe first block. Furthermore, the baseline utilizes an atrousconvolutional layer instead of a conventional convolutionallayer. This allows us to enlarge the field-of-view of filterswhen interpolating the multi-scale context in the frameworkof spatial pyramid pooling and cascaded modules [27]. Theoutput of each block can be computed by adding the outputof the atrous convolutional layer to the input of the block, asshown in Fig. 3. The output of the fourth block feeds into fivebranches, as shown in Fig. 3. The baseline uses global averagepooling to obtain the global image feature representations.In addition, three atrous convolutional layers with differentrates are utilized to acquire multi-scale information. The ratesdepend entirely on the feature map produced by block 4, andthey are set to 4, 8 and 16, respectively. Finally, the five branch ig. 4. Experimental results of road semantic segmentation (threshold is set to 0.9). The green areas in the fourth row are the segmented road surfaces. (a)Processed left images. (b) Transformed right images. (c) Disparity maps. (d) Segmentation results. output is concatenated and further compressed using a × convolutional layer.
2) Decoder:
In the decoder, the baseline applies skip con-nection to the feature map, which is produced by the secondblock. This greatly improves the details of local features in thehigh-level feature map. The low-level and high-level featuremaps are then concatenated together. A probability map can beobtained after a feature map upsampling process. By findingthe pixels whose probabilities are higher than our pre-setthreshold, the semantic segmentation result can be obtained.Some examples of experimental results are shown in Fig. 4.III. E
XPERIMENTAL R ESULTS
In this section, we present our experimental results andevaluate the performance of the proposed method using theKITTI road dataset [28]. The dataset contains synchronizedstereo road image pairs, 3D road scenery point clouds acquiredusing a Velodyne HDL-64E LiDAR, calibration parameters,and semantic region segmentation ground truth. The imagesin this dataset are grouped into three categorizes: urban un-marked (UU), urban marked (UM) and urban multiple marked(UMM). To quantify the accuracy of the proposed approach, aset of indicators, including maximum F1-measure (MaxF), av-erage precision (AP), precision (PRE), recall (REC), false pos-itive rate (FPR) and false negative rate (FNR), are computedand are publicly available on the KITTI road benchmark .PT-ResNet training was conducted on an NVIDIA GTX 1080Ti GPU (CUDA 9 and cnDNN v7). In the experiments, thelearning rate, training step and batch size are set to 0.001,30000 and 8, respectively. The approach was programmed inPython language. The runtime of segmenting an image fromthe KITTI dataset is around 3 seconds. In this section, wecompare our method with FCN-LC [13], PGM-ARS [14],Hybrid [15], StixelNet [16], BM [17] and HistonBoost [18].The comparisons of MaxF, AP, PRE, REC, FPR and FNRamong these methods are shown in Fig. 5, where urban reflects the overall performance of UM, UMM and UU. Itcan be observed in Fig. 5(a) that our PT-ResNet methodoutperforms the others in terms of MaxF, it achieves a MaxFof approximately . , which is slightly higher than thatachieved using FCN-LC ( . ). Fig. 5(b) indicates that PT-ResNet performs better than other networks in terms of AP, asit achieves an AP of approximately . . However, FCN-LC performs slightly better than our network in terms of PREand FPR (see Fig. 5(c) and 5(e)). The overall PRE and FPR weachieved is . and . , respectively. Additionally, PT-ResNet achieves an intermediate performance in terms of RECand FNR (see Fig. 5(d) and 5(f)), as the REC and FNR valuesobtained using our method is . and . , respectively.In general, the proposed PT-ResNet achieves the best overallperformance and its ranking is higher than that of other CNNs.IV. C ONCLUSION AND F UTURE W ORK
In this paper, we presented a deep neural network forsemantic road image segmentation. Since the proposed net-work focuses entirely on the road surface, the left and rightstereo images were processed using our previously publishedperspective transformation algorithm. This greatly enhancedthe similarity of the road surface between the left and rightimages. The processed stereo images and their correspondingsubpixel disparity maps were utilized to create 3D train-ing data. Additionally, we developed our network based onResNet, a state-of-the-art network with an encoder-decoderstructure. According to the evaluation results provided by theKITTI road benchmark, our proposed method outperformsFCN-LC, PGM-ARS, Hybrid, StixelNet, BM and HistonBoostin terms of MaxF and AP, achieving an overall MaxF and APof . and . , respectively. However, the ResNetfrom DeepLab-v3+ may not be the best network for learningroad semantic segmentation from our created 3D training data.Therefore, we plan to train different state-of-the-art networks,such as VGG-16 and VGG-19, and compare the experimentalresults with what we achieved in this paper. a) (b)(c) (d)(e) (f)Fig. 5. Performance evaluation. (a) Comparison of MaxF. (b) Comparison of AP. (c) Comparison of PRE. (d) Comparison of REC. (e) Comparison of FPR.(f) Comparison of FNR. A CKNOWLEDGMENT
This work was supported by the National Natural ScienceFoundation of China, under grant No. U1713211, the ResearchGrant Council of Hong Kong SAR Government, China, underProject No. 11210017, No. 21202816, and the Shenzhen Sci-ence, Technology and Innovation Commission (SZSTI) undergrant JCYJ20160428154842603, awarded to Prof. Ming Liu.R
EFERENCES[1] J. A. Brink, R. L. Arenson, T. M. Grist, J. S. Lewin, and D. Enzmann,“Bits and bytes: the future of radiology lies in informatics and infor-mation technology,”
European radiology , vol. 27, no. 9, pp. 3647–3651,2017. [2] R. Fan, J. Jiao, H. Ye, Y. Yu, I. Pitas, and M. Liu, “Key ingredients ofself-driving cars,” arXiv:1906.02939 , 2019.[3] R. Fan and N. Dahnoun, “Real-time stereo vision-based lane detectionsystem,”
Measurement Science and Technology , vol. 29, no. 7, p.074005, 2018.[4] P. Lin, “Why ethics matters for autonomous cars,” in
Autonomousdriving
IEEETransactions on Intelligent Transportation Systems , vol. 18, no. 3, pp.621–632, 2016.7] R. Fan, J. Jiao, J. Pan, H. Huang, S. Shen, and M. Liu, “Real-time densestereo embedded in a uav for road inspection,” in
The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) Workshops , June2019.[8] X. Du, M. H. Ang, S. Karaman, and D. Rus, “A general pipeline for3d detection of vehicles,” in . IEEE, 2018, pp. 3194–3200.[9] C. Yan, H. Xie, D. Yang, J. Yin, Y. Zhang, and Q. Dai, “Supervisedhash coding with deep neural network for environment perception ofintelligent vehicles,”
IEEE transactions on intelligent transportationsystems , vol. 19, no. 1, pp. 284–295, 2018.[10] R. Fan, “Real-time computer stereo vision for automotive applications,”Ph.D. dissertation, University of Bristol, 2018.[11] J. Y. Baek, I. V. Chelu, L. Iordache, V. Paunescu, H. Ryu, A. Ghiuta,A. Petreanu, Y. Soh, A. Leica, and B. Jeon, “Scene understandingnetworks for autonomous driving based on around view monitoringsystem,” in
Proc. IEEE/CVF Conf. Computer Vision and Pattern Recog-nition Workshops (CVPRW) , Jun. 2018, pp. 1074–10 747.[12] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in
Proc. IEEE Conf. Computer Vision andPattern Recognition (CVPR) , Jun. 2015, pp. 3431–3440.[13] C. C. T. Mendes, V. Frmont, and D. F. Wolf, “Exploiting fully convolu-tional neural networks for fast road detection,” in
Proc. IEEE Int. Conf.Robotics and Automation (ICRA) , May 2016, pp. 3174–3179.[14] M. Passani, J. J. Yebes, and L. M. Bergasa, “Fast pixelwise roadinference based on uniformly reweighted belief propagation,” in
Proc.IEEE Intelligent Vehicles Symp. (IV) , Jun. 2015, pp. 519–524.[15] L. Xiao, R. Wang, B. Dai, Y. Fang, D. Liu, and T. Wu, “Hybridconditional random field based camera-lidar fusion for road detection,”
Information Sciences , vol. 432, pp. 543–558, 2018.[16] D. Levi, N. Garnett, E. Fetaya, and I. Herzlyia, “Stixelnet: A deepconvolutional network for obstacle detection and road segmentation.”in
BMVC , 2015, pp. 109–1.[17] B. Wang, V. Fr´emont, and S. A. R. Florez, “Color-based road detectionand its evaluation on the kitti road benchmark,” in
IEEE IntelligentVehicles Symposium (IV 2014) , 2014, pp. 31–36. [18] G. B. Vitor, A. C. Victorino, and J. V. Ferreira, “Comprehensiveperformance analysis of road detection algorithms using the commonurban kitti-road benchmark,” in
Proc. IEEE Intelligent Vehicles Symp ,Jun. 2014, pp. 19–24.[19] I. Pitas,
Digital image processing algorithms and applications . JohnWiley & Sons, 2000.[20] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“Deeplab: Semantic image segmentation with deep convolutional nets,atrous convolution, and fully connected crfs,”
IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 40, no. 4, pp. 834–848,Apr. 2018.[21] R. Fan, X. Ai, and N. Dahnoun, “Road surface 3d reconstruction basedon dense subpixel disparity map estimation,”
IEEE Transactions onImage Processing , vol. 27, no. 6, pp. 3025–3035, 2018.[22] J. Chang and Y. Chen, “Pyramid stereo matching network,” in
Proc.IEEE/CVF Conf. Computer Vision and Pattern Recognition , Jun. 2018,pp. 5410–5418.[23] R. Fan and M. Liu, “Road damage detection based on unsuperviseddisparity map segmentation,”
IEEE Transactions on Intelligent Trans-portation Systems , 2019.[24] R. Fan, U. Ozgunalp, B. Hosking, M. Liu, and I. Pitas, “Potholedetection based on disparity transformation and road surface modeling,”
IEEE Transactions on Image Processing , vol. 29, pp. 897–908, 2020.[25] R. Fan, M. J. Bocus, and N. Dahnoun, “A novel disparity transformationalgorithm for road segmentation,”
Information Processing Letters , vol.140, pp. 18–24, 2018.[26] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmen-tation,” arXiv preprint arXiv:1802.02611 , 2018.[27] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinkingatrous convolution for semantic image segmentation,” arXiv preprintarXiv:1706.05587 , 2017.[28] J. Fritsch, T. Khnl, and A. Geiger, “A new performance measure andevaluation benchmark for road detection algorithms,” in