Semantic Segmentation of 3D LiDAR Data in Dynamic Scene Using Semi-supervised Learning
Jilin Mei, Biao Gao, Donghao Xu, Wen Yao, Xijun Zhao, Huijing Zhao
11 Semantic Segmentation of 3D LiDAR Data inDynamic Scene Using Semi-supervised Learning
Jilin Mei,
Member, IEEE,
Biao Gao,
Member, IEEE,
Donghao Xu,
Member, IEEE,
Wen Yao,
Member, IEEE,
Xijun Zhao,
Member, IEEE, and Huijing Zhao,
Member, IEEE,
Abstract —This work studies the semantic segmentation of3D LiDAR data in dynamic scenes for autonomous drivingapplications. A system of semantic segmentation using 3D LiDARdata, including range image segmentation, sample generation,inter-frame data association, track-level annotation and semi-supervised learning, is developed. To reduce the considerablerequirement of fine annotations, a CNN-based classifier is trainedby considering both supervised samples with manually labeledobject classes and pairwise constraints, where a data sampleis composed of a segment as the foreground and neighborhoodpoints as the background. A special loss function is designed to ac-count for both annotations and constraints, where the constraintdata are encouraged to be assigned to the same semantic class. Adataset containing 1838 frames of LiDAR data, 39934 pairwiseconstraints and 57927 human annotations is developed. Theperformance of the method is examined extensively. Qualitativeand quantitative experiments show that the combination of a fewannotations and large amount of constraint data significantlyenhances the effectiveness and scene adaptability, resulting ingreater than 10% improvement.
Index Terms —3D LiDAR data, semantic segmentation, semi-supervised learning.
I. I
NTRODUCTION
Scene understanding is crucial for the safe and efficientnavigation of autonomous vehicles in complex and dynamicenvironments, and semantic segmentation is a key technique.3D LiDAR has been used as one of the main sensors inmany prototyping systems for fully autonomous driving [1].Semantic segmentation using 3D LiDAR data is illustrated inFig. 1, where given a frame of input data (a), the problem is tofind a meaningful label (i.e., object class in this research) foreach pixel, super-pixel or region of the data (b). As 3D LiDARis a 2.5D sensing of the surroundings, it can be representedequivalently in the form of a range image (c)-(d) in the polarcoordinate system, and the problem of semantic segmentationcan be solved by using either 3D points or range images asthe input.Semantic segmentation using 3D LiDAR data from out-door scenes has been studied since the past decade [1]–[3].The traditional process in these works [4], [5] includes thefollowing steps: (1) preprocessing to divide a whole dataset
This work is supported by the National Key Research and DevelopmentProgram of China (2017YFB1002601) and the NSFC Grants (61573027).J. Mei, B. Gao, D. Xu and H. Zhao are with the Peking University, withthe Key Laboratory of Machine Perception (MOE), and also with the Schoolof Electronics Engineering and Computer ScienceW. Yao and X. Zhao is with China North Vehicle Research Institute, Beijing,China.Correspondence: H. Zhao, [email protected]. Fig. 1. The semantic segmentation for dynamic scene. (a) and (c) show theinput data in two kinds of formats, i.e., the raw 3D point clouds and rangeframe. (b) and (d) show the semantic segmentation results. into locally consistent small units, such as voxels, segmentsor clusters; (2) extracting a sequence of predefined features;(3) learning a classifier via SVM, random forest etc.; and(4) refining the results using a method such as conditionalrandom field by considering the spatial consistency amongneighboring units. The traditional methods depend on carefullydesigned discriminative features, and their adaptability todifferent scenes remains an open challenge.The recent success of deep learning in image semanticsegmentation has provided new approaches [6]. These methodsremove the dependence on handcrafted features in an end-to-end manner. However, these methods also have substantialdemands for finely labeled data [6]. On one hand, pixel-wise annotation is extremely time consuming; on the otherhand, few 3D LiDAR datasets with annotation at the pointlevel to support autonomous driving applications are publiclyavailable. Therefore, it is necessary to develop a semanticsegmentation method using 3D LiDAR data with only a smallset of supervised data, where semi-supervised learning isadopted.Semi-supervised learning methods, which integrate labeledand unlabeled data, have been studied extensively in the fieldof machine learning [7]. In [8], [9], pairwise constraints arecollected from unlabeled data to describe the probability oftwo samples sharing the same or different labels. LiDARsensors measure the 3D coordinates of an object directly a r X i v : . [ c s . R O ] S e p and can be used to associate the same object in the dataof subsequent frames according to their locations after egomotion compensation. A tracking method can be used fordata association for moving objects such as people, cyclistsand cars. Such associated data constitute pairwise constraints,which can be inexpensive due to their abundant nature andautonomous generation.In this paper, we propose a semantic segmentation using3D LiDAR data from dynamic urban scenes by integratingsemi-supervised learning and deep learning methods. A CNN-based classifier is trained by considering both supervisedsamples with manually labeled object classes and pairwiseconstraints, where a data sample is composed of a segmentas the foreground and the neighborhood points as the back-ground. A system of semantic segmentation using 3D LiDARdata, including range image segmentation, sample genera-tion, inter-frame data association, track-level annotation andsemi-supervised learning, is developed. A dataset containing1838 frames of LiDAR data, 39934 pairwise constraints and57927 human annotations is generated using 3D LiDAR datafrom a dynamic campus scene. Qualitative and quantitativeexperiments show that the combination of a small amountof annotation data and a large amount of constraint datasignificantly improves the effectiveness and scene adaptabilityof the classifier.The remainder of this paper is organized as follows. Relatedwork is discussed in Sect. II. In Sect. III, the proposed methodis presented. In Sect. IV, the algorithm details are discussed.Then, we present the experimental results in Sect. V and drawconclusions in Sect. VI.II. R ELATED W ORKS
Numerous studies on semantic segmentation have beenconducted; the recent progress for image and RGB-D data isreviewed in [6]. In this section, we focus on methods specifiedfor 3D point clouds (i.e., from LiDAR sensors) in dynamicoutdoor scenes. These works can be broadly divided into threeclasses: feature-based methods, deep learning methods, semi-supervised learning methods.
A. Feature-based Methods
Feature-based methods belong to traditional machine learn-ing, and the general process of these methods consists offeature selection, classifier design and graphical model de-scription.A straightforward technique is to convert semantic segmen-tation into a point-wise classification that includes extractingthe features on each unit, concatenating the features as a vectorand determining the label via a well-trained classifier. [10]presents a common pipeline from feature selection to classifiertraining. Due to the irregular arrangement of point clouds, theauthors test 5 definitions of neighborhood to achieve the bestrepresentation. Similar research is conducted in [13], whichdemonstrates the ability to address varying densities of data.A single point cloud usually contains millions of points, soevaluating the label for each point is typically computationallyexpensive (on the order of minutes according to [13]). [14] proposes an efficient approach where speed andaccuracy are satisfied simultaneously; furthermore, the av-erage classification time can be reduced to less than 1 s.[5] represents the raw point cloud as a 2D range imageand proposes a framework for simultaneous segmentation andclassification of the range image that considers both the 2Drange image and 3D raw data. Straightforward approachesassume that each data unit is independent and ignore thespatial and contextual relations between units. Consequently,they can produce good results based on distinctive features.However, when the features are not discriminative, the point-wise classification will be noisy and locally inconsistent [4].The neighbor elements are taken into account to makethe segmentation results spatially smooth. For this purpose,graphical models such as Markov random Field (MRF) andconditional random field (CRF) are usually exploited to encodethe spatial relationships. In [11], the node potentials and edgepotentials are both formulated with a parametric linear model,and the functional max-margin learning is used to find theoptimal weights. [16] proposes a simplified Markov networkto infer the contextual relations between points. Instead oflearning all the weights for the node and edge potentials ingraphical models, the node potentials are calculated from apoint-wise classifier, and the edge potentials are determinedby the physical distance between points.The performance of the above methods largely depends onhandcrafted features. These methods are effective in fixed orregular scenarios, but for dynamic scenes, the features areempirically designed and the performance decreases.
B. Deep Learning Methods
Deep learning, especially the convolution neural network(CNN) without handcrafted features, has shown effectual per-formance on 2D image segmentation [6]. However, the seman-tic segmentation of 3D point clouds(i.e., from LiDAR sensors)is still an open research problem [17] due to the irregular,not grid-aligned properties. Therefore, recent studies projectthe point clouds into 2D views, and some of them attempt todirectly ways, for example, volumetric/voxel representations.Inspired by the success of CNN in image segmentation, thestate-of-the-art image-based algorithms can be used directlyafter rendering 2D views from the 3D raw data. [20] projectspoint clouds into virtual 2D RGB images via Katz projection.Then, a pretrained CNN is used to semantically classify theimages. However, this projection removes all the points thatare not visible; for example, if a car is projected, all thepoints behind it are removed. [22] unwraps ◦
3D LiDARdata onto a spherical 2D plane without point loss. Sphericalprojection is also applied in SqueezeSeg [25], where theCNN directly outputs the point-wise label of the transformedLiDAR data and a CRF is applied to refine the outputs. [26]uses cylindrical projection to create the depth and reflectivityimages. In [27], the point clouds are encoded by top-viewimages and a simple fully convolutional neural network (FCN)is used. This method can be used in real time because onlyelevation and density features are extracted. In [28], the inputpoint cloud is projected into multiple views, such as color,depth and surface normal images.
TABLE IA
PPROACHES FOR POINT CLOUDS SEMANTIC SEGMENTATION
Research Learning Method Input Classifier Dataset SceneMunoz, 2009, [4] sup. L MRF - ruralZhao, 2010, [5] sup. L SVM - campusWeinmann, 2014 [10] sup. L RF VMR-Oakland [11],Pairs-rue-Madame [12] urbanMunoz, 2009, [11] sup. L/V MRF - -Hackel, 2016 [13] sup. L RF Pairs-rue-Madame urbanHu, 2013, [14] sup. L KLR VMR-Oakland,Freiburg [15] urbanLu, 2012, [16] sup. L CRF,SVM VMR-Oakland urbanEngelmann, 2017, [17] sup. L/V DL S3DIS [18],vKITTI [19],KITTI indoor,urban,urbanTosteberg, 2017, [20] sup. L DL Semantic3D,VPS [21] urban,indoorDewan, 2017, [22] sup. L DL KITTI [23] urbanHackel, 2017, [24] sup. L DL Semantic3D [24] urban/ruralWu, 2017, [25] sup. L DL KITTI urbanPiewak, 2018, [26] sup. L/V DL - urban/rural/highwayCaltagirone, 2017, [27] sup. L DL KITTI urbanLawin, 2017, [28] sup. L DL Semantic3D urban/ruralTchapmi, 2017, [29] sup. L/V DL NYU V2 [30],Semantic3D,S3DIS,KITTI, indoor,urban/rural,indoor,urbanRiegler, 2017, [31] sup. L DL ModelNet10 [32] CAD modelQi, 2017, [33] sup. L/V DL ShapeNet [34],Stanford3D [35] CAD model,indoorLandrieu, 2017, [36] sup. L/V DL Semantic3D,S3DIS urban,rural/indoorBearman, 2016, [37] semi-sup. V DL PASCAL VOC [38] -Yan, 2006, [39] semi-sup. V KLR,SVM - nursing homeBauml, 2013 [40] semi-sup. V KLR,SVM - TV seriesHong, 2015, [41] semi-sup. V DL PASCAL VOC -Papandreou, 2015, [42] semi-sup. V DL PASCAL VOC -Lin, 2016, [43] semi-sup. V DL PASCAL VOC -Cour, 2009, [44] weakly-sup. V linear classifier - TV seriesPathak, 2015, [45] weakly-sup. V DL PASCAL VOC -Xu, 2015, [46] weakly-sup. V linear classifier Siftflow [47] -Dai, 2015, [48] weakly-sup. V DL PASCAL VOC -sup. : supervised; L : LiDAR Data; V : Camera Data; RF : Random Forest; KLR : Kernel Linear Regression; DL : Deep Learning.
Another type of method models the raw data in direct ways.[29] proposes SEGCloud, where the raw 3D point cloud ispreprocessed into a voxelized point cloud with a fixed gridsize. Although [29] is simple and effective, how to set thevoxel size is a problem in large-scale scenes. Thus, the sceneis voxelized at five different resolutions in [24], and eachof the five scales is handled separately by the CNN. Ratherthan using a fixed grid size, [31] proposes OctNet, wherethe hybrid grid-octree data structure is applied to representthe raw 3D data, and each leaf of the octree stores a pooledfeature representation. PointNet [33] is a unified architecturethat directly takes raw point clouds as input and outputs thelabel of each point. The scene is divided into blocks. Then, thepoints in each block are passed through a series of multilayerperceptrons (MLPs) to extract the local and global features.Based on [33], [17] extends the method to incorporate a larger-scale spatial context, and improved results are reported in bothindoor and outdoor scenarios. [36] proposes a more elegantarchitecture to capture contextual relations. The first step is to partition the raw point cloud into geometrically simple shapes,called super-points. The super-points are then embedded byPointNet [33].Semantic segmentation with deep learning is usually im-plemented in a supervised manner, which requires detailedannotations. However, obtaining point-wise annotations for 3Dpoint clouds is labor intensive and time consuming. Further-more, few public datasets support this level of annotation.
C. Semi-supervised Learning Methods
Considering the large demand for detailed annotations,many researchers study semi-supervised learning methods,which integrate fewer labeled data and more unlabeled data,and weakly supervised learning methods, which use multipleambiguous labels. Our research belongs to the semi-supervisedcategory, please refer to TABLE I for weakly supervisedmethods.The early work [8] on semi-supervised learning specifies theprior knowledge of unlabeled data via pairwise constraints. A
Fig. 2. The framework of semi-supervised learning for 3D point clouds semantic segmentation. pairwise must-link constraint means two objects must have thesame label, although the label is unknown, whereas two objectsassociated via must-not-link must have different labels. Bothlabeled and constraint data are used for model fitting, and theauthors model the constraint information by the maximum en-tropy principle. A similar idea is presented in [9]. The pairwiseconstraints are incorporated in the clustering of a Gaussianmixture model (GMM). [8], [9] present the foundation of semi-supervised learning with pairwise constraints.[39] extends the method for video object classification,where temporal relations between frames, multi-modalities,such as faces and voices, and human feedback (manual anno-tation) are considered and formulated in a unified framework.[40] applies constraints for person identification in multimediadata and achieves state-of-the-art performance on two diverseTV series. Recently, semi-supervised learning has also beenintegrated with deep neural networks (DNN). [41] decouplessemantic segmentation into classification and segmentationusing a large number of image-level object labels and asmall number of pixel-wise annotations. The classificationnetwork specifies the class-specific activation maps, which aretransfered into the segmentation network with bridge layers.Then, the segmentation network requires only a few pixel-wise annotations to train the model, e.g., 5 or 10 strongannotations per class. [42] designs a expectation-maximization (EM) training method for semantic image segmentation bycombining bounding boxes, image-level labels and a fewstrongly labeled images.Semi-supervised learning has been successfully applied inimage segmentation [41] and video analysis [39]. However, tothe best of the author’s knowledge, few studies discuss semi-supervised leaning in the context of the 3D point clouds insemantic segmentation. In this paper, we attempt to combinesemi-supervised learning and neural network to solve the prob-lem of how to perform 3D point cloud semantic segmentationwith insufficient point-wise annotations.III. METHODOLOGY
A. Problem Definition
Let s denotes a small segment of 3D LiDAR data ex-tracted by examining the consistency of 3D points with theirneighborhood in the range image frame using, e.g., a regiongrowing method. Without loss of generality, we assume thatthe 3D points of s are measurements of a single object.However, s commonly represents only a part of the object,e.g., the upper body of a pedestrian, due to the nature ofLiDAR measurements. Hence, for each s , a data sample x is generated centered at s , containing s as the foreground andits neighborhood data as the background, as shown in Fig. 2. Fig. 3. The flowchart of implementation.
The problem in this work is formulated as learning a multi-class classifier f θ that maps x to a label y ∈ { , ..., K } andsubsequently associates y with the 3D points of s . f θ : x → y ∈ { , ..., K } (1)Given a set of supervised data samples X l = { x i , y i } Mi =1 ,where { y i } are one-hot vectors annotated manually by humanoperators for each { x i } , a common way of learning a classifier f θ is to find the best θ ∗ that minimizes a loss function L , asbelow. θ ∗ = arg max θ L ( X l ; θ ) (2)However, the problem of generating a large amount of su-pervised data is not trivial. This research learns f θ with a smallset of costly supervised data X l and an additional large set ofautonomously generated constraints X c = { < x i , x j > n } Nn =1 , θ ∗ = arg max θ L ( X l , X c ; θ ) (3)where constraint < x i , x j > denotes that x i and x j havethe same label, i.e., y i = y j , which is generated in thisresearch autonomously by associating the data segments alongsequential frames according to their locations after ego-motioncompensation.This semi-supervised loss function L is based on a combi-nation of loss on supervised data X l and loss on constraints X c . L ( X l , X c ; θ ) = L l ( X l ; θ ) + L c ( X c ; θ ) (4) B. Supervised Loss
For supervised data X l , we follow the widely used definitionof cross entropy, e.g., [33], and define loss L l as below: L l ( X l ; θ ) = − M M (cid:88) i =1 K (cid:88) k =1 [ y ki = 1] ln ( P kθ ( x i )) (5)where [ ∗ ] is an indicator function, and P kθ ( x i ) is the proba-bility that x i is assigned a label k by a classifier with the setof parameters θ . C. Constraint Loss
For each constraint < x i , x j > , let y i and y j denote thelabels x i and x j , respectively, by using a learned classifier f θ . A penalty is applied if y i (cid:54) = y j . Subsequently, the loss isestimated as below for each constraint based on the probabilitythat the constrained data samples are assigned different labels. P ( y i (cid:54) = y j ) = K (cid:88) k =1 K (cid:88) l =1 l (cid:54) = k P kθ ( x i ) P lθ ( x j )= 1 − K (cid:88) k =1 P kθ ( x i ) P kθ ( x j ) (6)Hence, we define the loss on constraints L c as below: L c ( X c ; θ ) = γN N (cid:88) n =1 P ( y i (cid:54) = y j )= γN N (cid:88) n =1 (1 − K (cid:88) k =1 P kθ ( x i ) P kθ ( x j )) , γ ∈ [0 , (7)where γ is a weighting factor. Clearly, L c describes theunsupervised learning procedure. D. Semi-supervised Loss
Combining the losses from supervised data X l and con-straints X c , the semi-supervised loss is defined as below. L ( X l ,X c ; θ ) = L l ( X l ; θ ) + L c ( X c ; θ )= − M M (cid:88) i =1 K (cid:88) k =1 [ y ki = 1] ln ( P kθ ( x i ))+ γN N (cid:88) n =1 (1 − K (cid:88) k =1 P kθ ( x i ) P kθ ( x j )) , γ ∈ [0 , . (8)IV. A LGORITHM D ETAILS
A. Process Flow
Fig. 3 describes the major modules of the workflow. Samplegeneration and semi-supervised learning are detailed in thenext subsections. Here, we describe the remaining modules.Although traditional methods are used in these modules, their
Fig. 4. The details for segmentation, constraint generation and annotation. The first row is range image; the second row is the segmentation result and theblack rectangle shows a car is separated into three parts; the third row shows two adjacent sample consist of one constraint; the fourth row is human annotation. integration and the design of the data pipelines are importantin constructing a complete system.As illustrated in Fig. 4, segmentation is conducted at theframe level of a range image, which is a representation of 3DLiDAR data in the polar coordinate system. The columns androws of a range image correspond to tessellated horizontal andvertical angles, and each pixel value is the range distance ofthe laser point in that direction. A Velodyne HDL-32E is usedin this research. Hence, a LiDAR frame contains 32 scan linesat different vertical angles, and each scan line has 2160 laserpoints at different horizontal angles. These laser points areprojected onto a range frame and reshaped to size (144,1080).Region growing is conducted to extract segments fromunlabeled pixels as the seeds by examining 4-connectivity,where the thresholds on the vertical and horizontal rangedifferences of two connected pixels are assigned empiricallywith a set of test data. As detailed in the next subsection, asegment is used as the foreground data in sample generation,and segmentation is treated only as a preprocessing step. Manymethods can be exploited as long as the following condition ismet: the 3D points of segment s are measurements of a singleobject.Constraints are generated by inter-frame data association.For any segment s t in frame t , the 3D points can be back-projected to the LiDAR sensor’s coordinate system in theprevious frame t − based on the vehicle’s motion data andcalibration parameters. A segment s t − is associated with s t if it matches with the 3D points. Let x i and x j denote thesamples of s t − and s t , respectively; a pairwise-constraint (cid:104) x i , x j (cid:105) is subsequently generated.As illustrated in Fig. 4, data association is conducted for theentire stream, and sequences of segments are extracted as theresult. An operator examines each sequence. If the sequence is tracked correctly, a label is assigned to all the segments (andthe samples) of the sequence, and constraints are generated forall subsequent samples. Otherwise, a sequence is manuallytruncated to remove the erroneous tracking part or evendiscarded.The dataset with supervised labels is divided into twogroups, X l and X g , for training and testing, respectively. Theset of pairwise constraints X c is used for training only. B. Sample Generation
A straightforward method is to use a segment s directly asthe input of a classifier, i.e., s = x . However, the performanceof such a classifier can be degraded if s represents only a partof the target object. This situation often occurs in the segmen-tation results of 3D LiDAR data. Due to diffusive reflectionor the weak reflectance of LiDAR measures on reflective ordark objects, many failures that yield discontinuities in thedata of a single object exist in 3D LiDAR measurements. Asindicated by the black rectangle in the second row of Fig. 4,three segment pieces are extracted from the data of a singlecar, and it is difficult to recognize the car given the data ofonly one piece. However, by placing each piece of the segmentinto the background of its surrounding data, the car is easilyrecognized.Inspired by the above idea, given a segment s , this workgenerates a sample x containing s as the foreground and itsneighborhood data as the background. As illustrated in Fig.5(b), a cuboid centered at s with a size of 2.4 m x 5 m x5 m (height x width x length) is drawn. The LiDAR pointsinside the cuboid are extracted, and their pixels on the rangeimage are projected onto a canvas with a size of 256x256to obtain sample x , where each pixel is composed of threechannels: 1) Height: distance to the ground surface mapped to Fig. 5. The procedure of sample generation. (a) the yellow segment s in the black rectangle is chosen as candidate region. (b) the raw points inside acuboid centered at s are cropped, where the cuboid size is 2.4mx5mx5m(height,width,length) and the points are colored by range value; then these pointsare projected on range image to make one sample that consists of three channels. (c) the range channel of the sample, and we mark s with red for bettervisualization. (d) the height channel of the sample. (e) the intensity channel of the sample.Fig. 6. The classifier for offline semi-supervised learning. [0,255]. Distances in [0 m,6 m] are mapped linearly to [0,255].Distance less than 0 m or greater than 6 m are mapped to 0 and255 respectively; 2) Range: distance to the sensor, normalizedto [0,255]. 3) Intensity: reflectance of the LiDAR point in[0,255].As a small or distant segment may provide insufficientinformation for a reliable classification, the following criteriaare applied. s n s d > ρ ∩ s n > σ (9)where s n is the number of LiDAR points of segment s , s d isthe distance from the LiDAR sensor to the center point of s ,and ρ and σ are empirically assigned thresholds. Segment s isvalid for sample generation if it has more points than σ andthe n-d ratio is larger than ρ . In this research, σ = 8 . and ρ = 30 . C. Semi-supervised Learning
A CNN is used as the classifier f θ , as shown in Fig. 6, withspecifically designed input and loss function. We rewrite theloss function of Equation (8) as below. L ( X l , X c ; θ ) = L c ( X c ; θ ) , M = 0 , N (cid:54) = 0 L l ( X l ; θ ) , M (cid:54) = 0 , N = 0 L l ( X l ; θ ) + L c ( X c ; θ ) , M (cid:54) = 0 , N (cid:54) = 0 . (10) If no supervised samples exists, i.e., M = 0 , N (cid:54) = 0 , the lossfunction degenerates to L c ( X c ; θ ) , representing unsupervisedlearning. If there are no constraints, i.e., M (cid:54) = 0 , N = 0 ,the loss function becomes L l ( X l ; θ ) , representing supervisedlearning. Finally, if both supervised samples and constraintsexist, i.e., M (cid:54) = 0 , N (cid:54) = 0 , the loss function is in its fullform, i.e., L l ( X l ; θ )+ L c ( X c ; θ ) , representing semi-supervisedlearning. The classifier is designed to adapt to all the abovecases. Algorithm 1 the training of the CNN classifier
Input: X c , X l Output: the classifier parameter θ Initialize Φ l , Φ c with empty. Make input pairs: Φ c ← (cid:104) x i , x j ; ξ = 1 (cid:105) , ∀(cid:104) x i , x j (cid:105) ∈ X c Φ l ← (cid:104) x i , y i , x j , y j ; ξ = 0 (cid:105) , ∀(cid:104) x i , y i , x j , y j (cid:105) ∈ X l for each step in training do Φ nc ← take n items from Φ c , n > Φ ml ← take m items from Φ l , m > Φ = Φ ml ∪ Φ nc for each item in Φ do if ξ = 0 then L ( X l , X c ; θ ) = L l ( X l ; θ ) = L l ( x i , y i , x j , y j ; θ ) else L ( X l , X c ; θ ) = L c ( X c ; θ ) = L c ( x i , x j ; θ ) Do backward learning. return θ To allow both supervised samples and pairwise constraintsin model training, the CNN is designed to take two samples x i and x j as input and output their labels y i and y j simul-taneously. An indicator ξ is used to specify whether the twosamples are a constrained pair ( ξ = 1 ) or individuals ( ξ = 0 ).Hence, the loss functions are converted to the following. TABLE IIT
HE DATA SET . Frame Dist.(m) People Car Cyclist Trunk Bush Building Unknown TotalA 0 ∼
414 0 ∼
184 cons. 682 1897 196 975 3268 2450 0 9468anno. 735 2079 228 1048 4330 2825 2423 13668B 414 ∼
829 184 ∼
350 cons. 681 1896 195 977 3268 2450 0 9467anno. 736 2080 227 1048 4330 2825 2423 13669C 829 ∼ ∼
630 cons. 909 2529 271 1301 4357 3267 0 14624anno. 980 2772 304 1398 5773 3767 3230 18224D 1373 ∼ ∼
890 cons. 925 3542 142 62 2440 1254 0 8365anno. 962 4157 169 89 3773 1549 1667 12366Total 1838 890 cons. anno. The A-D corresponding the route in Fig.7. Dist.: the travel distance. cons.: constraint. anno.: annotation.Fig. 7. The routes of data collection and the platform configuration. The30%/60% in the left means 30%/60% travel of the route for training data. L ( X l ,X c , ξ = 1; θ ) = L l ( X l ; θ ) = L l ( x i , y i , x j , y j ; θ )= − j (cid:88) t = i K (cid:88) k =1 [ y kt = 1] ln ( P kθ ( x t )); L ( X l ,X c , ξ = 0; θ ) = L c ( X c ; θ ) = L c ( x i , x j ; θ )= 12 (1 − K (cid:88) k =1 P kθ ( x i ) P kθ ( x j )) . (11)In training, the supervised samples X l and constraints X c are randomly fed into the CNN, and the model parameters areadjusted in the traditional back-propagation manner. Pseudocode of the training process is given in Algorithm 1. Asthe focus of this research is to learn a classifier using smallnumber of expensive supervised samples and a large numberof inexpensive constraints, the unbalanced sample problemshould be considered. In this study, we set the number ofconstraints to 5 times the number of supervised samples, whichis described as n/m = 5 in lines 6-7 of Algorithm 1.In online classification, we have ξ ≡ , and in the case ofonly one sample, we have x i = x j . V. EXPERIMENTAL RESULTS A. Data Set
The performance of the proposed method is evaluated on adynamic campus dataset collected by an instrumented vehiclewith a GPS/IMU suite and a Velodyne-HDL32, as shown inFig. 7. The total route is approximately 890 meters. All sensordata are collected, and each data frame is associated with atime log for synchronization. The GPS/IMU data are loggedat 100 Hz. The LiDAR data are recorded at 10 Hz and include1373 frames of training data (red line in Fig. 7) and 465 framesof testing data (black line in Fig. 7). One frame can producemultiple samples, for example, we obtain 6931 car samplesfrom the 1373 frames of training data in TABLE II. In total,we obtain 1838 frames of LiDAR data, 39934 constraints and57927 manual annotations.The details of the dataset are listed in TABLE II. Six labelsare used, i.e., person, car, cyclist, trunk, bush and building.These categories are important for driving applications on acampus; other labels, such as pole and cone, are marked asunknown. We do not generate constraints for the unknownlabel.
B. Result - Classifier Training1) Experimental settings:
There are five experimental set-tings for the training data, as shown in Table III. The trainingdata contain 3 parts, i.e., A, B, C in TABLE II. A total of 70%of the data are randomly selected for training, and the remain-ing 30% are selected for validation. Except baseline max, thesettings use only a small amount of annotations. Baseline minuses 350 annotations without the constraint, while base-line max uses all the annotations. Thus, the baseline methodworks in a supervised manner. In general, baseline min setsthe low performance bound and baseline max sets the highperformance bound. Constraint30 uses the same annotationsas the baseline min but with additional constraints; tail 30means all constraints are used during 30% of the travel, i.e.,route A from the start to the 184 m point (30%) in Fig. 7.
TABLE IIIT
HE EXPERIMENTAL SETTINGS ON TRAINING DATA . Settings People Car Cyclist trunk Bush Building Unknownbaseline min constraint 0 0 0 0 0 0 0annotation 350 350 350 350 350 350 5650baseline max constraint 0 0 0 0 0 0 0annotation 1715 4851 531 2445 10103 6591 5650constriant30 constraint 682 1897 196 976 3268 2450 0annotation 350 350 350 350 350 350 5650constriant60 constraint 1363 3793 391 1952 6536 4900 0annotation 350 350 350 350 350 350 5650constriant100 constraint 2272 6322 652 3253 10893 8167 0annotation 350 350 350 350 350 350 5650
Fig. 8. The qualitative results on training data. The rectangles a and b show that the constraint100 has better performance on car, and the ellipse c showsthat the constraint100 makes an error classification when the people near the bush. TABLE IVF
MEASURE ON TRAINING DATA . Learning Method Classifier People Car Cyclist Trunk Bush Building Unknownsup. baseline min 77.2 68.5 66.1 81.0 56.2 78.7 33.9semi-sup. constraint30 77.2 83.0 69.6 86.6 61.9 81.8 36.8semi-sup. constraint60 81.5 85.8 71.3 90.4 80.7 86.6 48.3semi-sup. constraint100 sup. baseline max
Fig. 9. The quantitative comparison on training data. constraint60 and constraint100 have similar configurations toconstraint30, except for the amount of constraint data.According to the settings, the five classifiers are learnedseparately offline. All the classifiers have the same networkarchitecture, as detailed in Set. IV.C. The difference lies inthe loss function, where baseline min and baseline max useonly L l ( X l ; θ ) in Equation (8), and the other settings useboth L l ( X l ; θ ) and L c ( X c ; θ ) . For the offline learning, theclassifier’s parameters are saved at fixed training steps; then,the parameters that make the loss less than 1e-4 and achievethe best performance on the validation set is selected. Foronline inference, all classifiers work in the same way.
2) Qualitative results:
As illustrated in Fig. 8, as thenumber of constraints increases, the car in the rectangle issuccessfully classified by constraint100, even if occlusionsoccur. Our method still produces errors, for example, whenthe person walks near the bush, the constraint100 classifierwrongly annotates the person as a bush. The main reason forthis error is that a single sample contains both the foregroundand background; if the background occupies more informationthan the foreground, the classifier is likely to assign thebackground label to the sample.
3) Quantitative results:
The F-measure is adopted for quan-titative evaluations and is defined as: F − M easure = 2 ∗ recall ∗ precisionrecall + precision · . (12)We use the five classifiers to annotate the training set and thevalidation set, and the quantitative results are shown in TABLEIV.The baseline min and baseline max results show that alarge number of annotations is important for supervised learning. The baseline min and constraint100 results indicatethat semi-supervised learning is effective: the F-Score ofconstraint100 increases by 15% on average. Furthermore,as the number of constraints increases, the performance ofthe classifier is enhanced. A more intuitive comparison isillustrated in Fig. 9. Although constraint100 is not as goodas baseline max, in which all annotations are used, it showspromising results, indicating that adding constraints improvesthe performance of the classifier and reduces the need forannotations. In conclusion, semi-supervised learning, whereonly a few annotations are used, is effective for 3D point cloudsemantic segmentation. C. Result - Classifier Testing
In a fixed scene or dataset, a classifier based on supervisedlearning can easily achieve high performance due to over-fitting. When the classifier is applied to a new scene, additionalannotations are necessary to prevent performance degradation.However, large quantities of fine annotations in each newscene are difficult to obtain for driving applications. Thus,how to enhance the adaptability with a few or no newannotations is crucial for practical applications. We detail threeexperiments in the following to demonstrate the adaptabilityof the proposed semi-supervised method. The F-measure inEquation (12) is used for the quantitative comparison.
1) The pretrained result:
The pretrained classifiers basedon training data are directly applied to the testing data, andthe results are shown in TABLE V. The baseline min andconstraint100 results show that adding constraints improvesthe adaptability to the new scene: the F-Score of constraint100increases by 9% on average. The constraint100 and base-line max results show that constraint100 has higher scoresfor the cyclist and trunk categories, despite having lowerperformance in all categories on the training data (TABLEIV).
2) The unsupervised result:
For the new scene, an attemptto use only constraints and no new annotations to improve thepretrained classifier is interesting. Thus, the loss function inEquation. (8) is rewritten as: L ( X l , X c ; θ ) = L c ( X c ; θ ) =1 N N (cid:88) i =1 (1 − K (cid:88) k =1 P kθ ( x i ) P kθ ( x j )) . (13)Here, the pretrained constraint100 is treated as the initialclassifier, and only constraints are used to retrain the model. TABLE VT
HE PRE - TRAINED CLASSIFIERS ON TESTING DATA . Learning Method Classifier People Car Cyclist Trunk Bush Building Unknownsup. baseline min 57.6 64.0 26.3 37.5 33.2 49.7 33.0semi-sup. constraint100 sup. baseline max
TABLE VIT
HE SETTING OF FINE TUNING
People Car Cyclist Trunk Bush Building Unknownanchor sample 20 20 20 20 20 20 100constraint 925 3542 142 62 2440 1254 0
TABLE VIIF
MEASURE ON TESTING DATA
Learning Method Classifier People Car Cyclist Trunk Bush Building Unknownsup. baseline min 57.6 64.0 26.3 37.5 33.2 49.7 33.0semi-sup. constraint100 69.4 81.6 39.8 42.1 37.3 55.7 39.0semi-sup. constraint100+fine tuning sup. baseline max
Fig. 10. The confusion matrix of unsupervised learning on testing data.
The results of this unsupervised learning procedure are shownin Fig. 10. Regardless of the samples, the classifier erroneouslyassigns them as a trunk. We can explain this situation withthe loss function in Equation (13): the constraint loss onlypenalizes the classifier when it gives x i and x j different labels,so unsupervised learning fails.
3) The fine tuning result:
After finding that unsupervisedlearning does not work for our method, we attempt to adda few new annotations. Specifically, the new annotations aregenerated in an interactive way. The pretrained constraint100is used as the initial classifier to produce classification resultson the testing data. Although the pretrained results show lowF-Scores in TABLE V, a few new annotations can be selectedby human confirmation. In this way, we obtain 20 annotationsfor each category and 100 for the unknown label, as shown inTABLE VI, where the new annotation is renamed the anchorsample.
Fig. 11. The quantitative comparison on testing data.
The pretrained constraint100 model is fine-tuned by com-bining the anchor sample and constraint. The final quantitativeresults are shown in TABLE VII and Fig. 11. Comparingconstraint100 and the fine-tuned version, the F-Score of thelatter increases by 13% on average, which shows that fine-tuning is an effective way to improve adaptability, even in asemi-supervised manner. Comparing the baseline max and thefine-tuned version, the latter has higher scores except on bushand building. The qualitative results are shown in Fig. 12. Inconclusion, fine-tuning with the anchor sample increases theadaptability of the classifier to a new scene.VI.
CONCLUSION AND FUTURE WORK
A semantic segmentation method for 3D point clouds (i.e.,from LiDAR senors) is developed in this research, and semi-supervised learning is utilized to reduce the considerablerequirement for fine annotations. The pairwise constraints Fig. 12. The qualitative results on testing data. The a and b show the comparison on car and c shows that the cyclist is successfully classified after fine-tuning. between adjacent frames are generated via inter-frame dataassociation, and a loss function is designed to help the con-straint data to obtain the same label. This method is examinedextensively on a new dataset. The superior results indicateimprovements in effectiveness and adaptability. Future workwill address how to define the sample because including bothforeground and background information in the sample canconfuse the classifier. In addition, the introduction of newconstraints will also be studied.R
EFERENCES[1] C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark,J. Dolan, D. Duggins, T. Galatali, C. Geyer et al. , “Autonomous drivingin urban environments: Boss and the urban challenge,”
Journal of FieldRobotics , vol. 25, no. 8, pp. 425–466, 2008.[2] F. Moosmann, O. Pink, and C. Stiller, “Segmentation of 3d lidar data innon-flat urban environments using a local convexity criterion,” in
IEEEIntelligent Vehicles Symposium . IEEE, 2009, pp. 215–220.[3] B. Douillard, J. Underwood, N. Kuntz, V. Vlaskine, A. Quadros,P. Morton, and A. Frenkel, “On the segmentation of 3d lidar pointclouds,” in
IEEE International Conference on Robotics and Automation .IEEE, 2011, pp. 2798–2805.[4] D. Munoz, N. Vandapel, and M. Hebert, “Onboard contextual classifica-tion of 3-d point clouds with learned high-order markov random fields,”in
IEEE international conference on Robotics and Automation . IEEE,2009, pp. 4273–4280. [5] H. Zhao, Y. Liu, X. Zhu, Y. Zhao, and H. Zha, “Scene understanding ina large dynamic environment through a laser-based sensing,” in
IEEEInternational Conference on Robotics and Automation . IEEE, 2010,pp. 127–133.[6] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, andJ. Garcia-Rodriguez, “A review on deep learning techniques applied tosemantic segmentation,” arXiv preprint arXiv:1704.06857 , 2017.[7] X. Zhu, “Semi-supervised learning literature survey,”
Computer Science,University of Wisconsin-Madison , vol. 2, no. 3, p. 4, 2006.[8] T. Lange, M. H. Law, A. K. Jain, and J. M. Buhmann, “Learning withconstrained and unlabelled data,” in
IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition , vol. 1. IEEE, 2005, pp.731–738.[9] Z. Lu and T. K. Leen, “Semi-supervised learning with penalizedprobabilistic clustering,” in
Advances in Neural Information ProcessingSystems , 2005, pp. 849–856.[10] M. Weinmann, B. Jutzi, and C. Mallet, “Semantic 3d scene interpreta-tion: A framework combining optimal neighborhood size selection withrelevant features,”
ISPRS Annals of Photogrammetry, Remote Sensingand Spatial Information Sciences , vol. II-3, pp. 181–188, 2014.[11] D. Munoz, J. A. Bagnell, N. Vandapel, and M. Hebert, “Contextualclassification with functional max-margin markov networks,” in
IEEEConference on Computer Vision and Pattern Recognition . IEEE, 2009,pp. 975–982.[12] A. Serna, B. Marcotegui, F. Goulette, and J.-E. Deschaud, “Paris-rue-madame database: a 3d mobile laser scanner dataset for benchmarkingurban detection, segmentation and classification methods,” in
Interna-tional Conference on Pattern Recognition, Applications and Methods ,2014.[13] T. Hackel, J. D. Wegner, and K. Schindler, “Fast semantic segmentationof 3d point clouds with strongly varying density.”
ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences , vol.III-3, pp. 177–184, 2016.[14] H. Hu, D. Munoz, J. A. Bagnell, and M. Hebert, “Efficient 3-d sceneanalysis from streaming data,” in
IEEE International Conference onRobotics and Automation . IEEE, 2013, pp. 2297–2304.[15] J. Behley, V. Steinhage, and A. B. Cremers, “Performance of histogramdescriptors for the classification of 3d laser range data in urban environ-ments,” in
Robotics and Automation (ICRA), 2012 IEEE InternationalConference on . IEEE, 2012, pp. 4391–4398.[16] Y. Lu and C. Rasmussen, “Simplified markov random fields for efficientsemantic labeling of 3d point clouds,” in
IEEE/RSJ International Con-ference on Intelligent Robots and Systems . IEEE, 2012, pp. 2690–2697.[17] F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe, “Exploringspatial context for 3d semantic segmentation of point clouds,” in
IEEEInternational Conference on Computer Vision Workshops . IEEE, 2017,pp. 716–724.[18] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer,and S. Savarese, “3d semantic parsing of large-scale indoor spaces,” in
International Conference on Computer Vision and Pattern Recognition .IEEE, 2016.[19] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxyfor multi-object tracking analysis,” in
International Conference onComputer Vision and Pattern Recognition
IEEE/RSJ International Conference on Intelligent Robots andSystems . IEEE, 2017, pp. 3544–3549.[23] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:The kitti dataset,”
International Journal of Robotics Research (IJRR) ,2013.[24] T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler, andM. Pollefeys, “SEMANTIC3D.NET: A new large-scale point cloudclassification benchmark,”
ISPRS Annals of the Photogrammetry, RemoteSensing and Spatial Information Sciences , vol. IV-1-W1, pp. 91–98,2017.[25] B. Wu, A. Wan, X. Yue, and K. Keutzer, “Squeezeseg: Convolutionalneural nets with recurrent crf for real-time road-object segmentationfrom 3d lidar point cloud,” arXiv preprint arXiv:1710.07368 , 2017.[26] F. Piewak, P. Pinggera, M. Sch¨afer, D. Peter, B. Schwarz, N. Schneider,D. Pfeiffer, M. Enzweiler, and M. Z¨ollner, “Boosting lidar-based se-mantic labeling by cross-modal training data generation,” arXiv preprintarXiv:1804.09915 , 2018.[27] L. Caltagirone, S. Scheidegger, L. Svensson, and M. Wahde, “Fast lidar-based road detection using fully convolutional neural networks,” in
IEEEIntelligent Vehicles Symposium . IEEE, 2017, pp. 1019–1024.[28] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan, andM. Felsberg, “Deep projective 3d semantic segmentation,” in
Inter-national Conference on Computer Analysis of Images and Patterns .Springer, 2017, pp. 95–107.[29] L. P. Tchapmi, C. B. Choy, I. Armeni, J. Gwak, and S. Savarese,“Segcloud: Semantic segmentation of 3d point clouds,” arXiv preprintarXiv:1710.07563 , 2017.[30] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentationand support inference from rgbd images,” in
European Conference onComputer Vision . Springer, 2012, pp. 746–760.[31] G. Riegler, A. O. Ulusoy, and A. Geiger, “Octnet: Learning deep 3drepresentations at high resolutions,” in
IEEE Conference on ComputerVision and Pattern Recognition , vol. 3. IEEE, 2017, pp. 6620–6629.[32] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao,“3d shapenets: A deep representation for volumetric shapes,” in
IEEEconference on computer vision and pattern recognition . IEEE, 2015,pp. 1912–1920.[33] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning onpoint sets for 3d classification and segmentation,” in
IEEE Conferenceon Computer Vision and Pattern Recognition . IEEE, 2017, pp. 77–85.[34] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang,A. Sheffer, L. Guibas et al. , “A scalable active framework for regionannotation in 3d shape collections,”
ACM Transactions on Graphics(TOG) , vol. 35, no. 6, p. 210, 2016.[35] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, andS. Savarese, “3d semantic parsing of large-scale indoor spaces,” in
IEEEConference on Computer Vision and Pattern Recognition . IEEE, 2016,pp. 1534–1543. [36] L. Landrieu and M. Simonovsky, “Large-scale point cloud semanticsegmentation with superpoint graphs,” arXiv preprint arXiv:1711.09869 ,2017.[37] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei, “Whats thepoint: Semantic segmentation with point supervision,” in
EuropeanConference on Computer Vision . Springer, 2016, pp. 549–565.[38] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man, “The pascal visual object classes (voc) challenge,”
Internationaljournal of computer vision , vol. 88, no. 2, pp. 303–338, 2010.[39] R. Yan, J. Zhang, J. Yang, and A. G. Hauptmann, “A discriminativelearning framework with pairwise constraints for video object classifica-tion,”
IEEE Transactions on Pattern Analysis and Machine Intelligence ,vol. 28, no. 4, pp. 578–593, 2006.[40] M. B¨auml, M. Tapaswi, and R. Stiefelhagen, “Semi-supervised learningwith constraints for person identification in multimedia data,” in
IEEEConference on Computer Vision and Pattern Recognition . IEEE, 2013,pp. 3602–3609.[41] S. Hong, H. Noh, and B. Han, “Decoupled deep neural network for semi-supervised semantic segmentation,” in
Advances in Neural InformationProcessing Systems , 2015, pp. 1495–1503.[42] G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille, “Weakly-andsemi-supervised learning of a deep convolutional network for semanticimage segmentation,” in
IEEE International Conference on ComputerVision . IEEE, 2015, pp. 1742–1750.[43] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup: Scribble-supervised convolutional networks for semantic segmentation,” in
IEEEConference on Computer Vision and Pattern Recognition . IEEE, 2016,pp. 3159–3167.[44] T. Cour, B. Sapp, C. Jordan, and B. Taskar, “Learning from ambiguouslylabeled images,” in
IEEE Conference on Computer Vision and PatternRecognition . IEEE, 2009, pp. 919–926.[45] D. Pathak, P. Krahenbuhl, and T. Darrell, “Constrained convolutionalneural networks for weakly supervised segmentation,” in
IEEE Interna-tional Conference on Computer Vision . IEEE, 2015, pp. 1796–1804.[46] J. Xu, A. G. Schwing, and R. Urtasun, “Learning to segment undervarious forms of weak supervision,” in
IEEE Conference on ComputerVision and Pattern Recognition . IEEE, 2015, pp. 3781–3790.[47] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing vialabel transfer,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 33, no. 12, pp. 2368–2382, 2011.[48] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes tosupervise convolutional networks for semantic segmentation,” in