Find my mug: Efficient object search with a mobile robot using semantic segmentation
OOAGM Workshop 2014 (arXiv:1404.3538) Find my mug: Efficient object search with amobile robot using semantic segmentation
Daniel Wolf, Markus Bajones, Johann Prankl and Markus VinczeVision4Robotics Group, Automation & Control InstituteVienna University of Technology
Abstract.
In this paper, we propose an efficient semantic segmentationframework for indoor scenes, tailored to the application on a mobile robot.Semantic segmentation can help robots to gain a reasonable understand-ing of their environment, but to reach this goal, the algorithms not onlyneed to be accurate, but also fast and robust. Therefore, we developed anoptimized 3D point cloud processing framework based on a RandomizedDecision Forest, achieving competitive results at sufficiently high framerates. We evaluate the capabilities of our method on the popular NYUdepth dataset and our own data and demonstrate its feasibility by deploy-ing it on a mobile service robot, for which we could optimize an objectsearch procedure using our results.
It is the ultimate goal in the field of service robotics that mobile robots au-tonomously navigate and get along in human-made environments. A crucialstep on the way to achieve this ambitious goal is that robots are able to recog-nize and interpret their surroundings. Imagine a simple scenario where you askyour service robot to look for your mug. So far, in most applications the onlyknowledge the machine has about its environment is a simple 2D map encodingoccupied and free space. That is, the only way to solve this task would be toexecute a time-consuming brute-force object detection everywhere in the map.Would it not be much more intelligent if it first looked at the most probablelocations for the mug to be, e.g. on tables or in the cupboard? An importantcornerstone to develop more intelligent behavior like this is semantic segmen-tation, which enables the robot to infer more meaningful information from itsperceived environment.Especially since the emergence of cheap 3D sensor technology such as theMicrosoft Kinect, semantic segmentation for indoor scenes has become a veryFigure 1: Intermediate steps of our segmentation pipeline. Left to right: Inputimage, oversegmentation, conditional label probabilities (here for label table ,red=0, blue=1), final result after MRF. Color code given in Sec. 6. a r X i v : . [ c s . C V ] A p r Efficient object search with a mobile robot using semantic segmentation active topic in the field of computer vision, and many different proposed meth-ods show promising results. However, using them in an interdisciplinary scopeof computer vision and mobile robotics is very challenging due to the strict limi-tations imposed by mobile robotic systems. Many published scene segmentationalgorithms are too complex to be executed on a real-time system, consequentlythere are not many applications actually making use of the results these meth-ods provide yet. Focused on this issue, in this paper we present an efficient andfast semantic segmentation framework developed and optimized to be deployedon a mobile service robot autonomously navigating in user apartments. As anexample application, we show that our framework can be used to speed up theobject search task previously described by inferring likely object locations fromthe segmentation results.The remainder of this paper is structured as follows: In the next section wediscuss recent developments in the area of semantic segmentation in computervision and robotics. The proposed framework is presented in Sec. 3, the useddatasets to train and evaluate it are covered in Sec. 4. Sec. 5 briefly introducesour mobile robot and describes the object search scenario in more detail, theresults are discussed in Sec. 6. We finally give an overview of possible ideas forfuture work and conclude in Sec. 7.
The majority of proposed semantic segmentation algorithms [9, 14, 1, 17] isbased on a similar architecture. In a first step, a clustering algorithm calculatesan oversegmentation of the input scene and a feature vector is extracted foreach cluster. The clusters are then classified according to the feature vector andin the last step a Conditional Random Field (CRF) or Markov Random Field(MRF) incorporates more global information to obtain the final labeling.Munoz et al. [9] proposed an outdoor scene labeling approach to label 3Dpoints collected by a laser scanner. They learn the parameters of the CRF usinga functional gradient algorithm. To label sequences of frames, Floros et al. [6]came up with a large CRF formulation connecting several subsequent framesto enforce time consistency in the labeling through higher-order potentials. Aswell as [9], their method is intended for outdoor scenarios, using a stereo camerasetup.With the arrival of new structured light depth sensors like the MicrosoftKinect, the attention shifted more towards the labeling of indoor scenes. Silber-man et al. [14, 15] presented the publicly available NYU Depth datasets, pro-viding thousands of recorded RGB-D frames of different indoor scenes recordedwith a Microsoft Kinect, many of them with densely annotated labels. Theirbaseline algorithm is based on 2D data taking into account depth and uses aneural network, followed by a CRF. The first big improvement on the results of[14] was presented by Ren et al. [11], using kernel descriptors to describe patchesaround every pixel. They achieved the best results combining a segmentationtree and a superpixel MRF. An alternative way of oversegmenting the inputscene is used by Valentin et al. [17]. They calculate a mesh representation ofthe scene and compute feature vectors for all faces of the mesh, using geometricand color information. Like [11], the approach presented by Couprie et al. [3]omits the procedure of handcrafting suitable features to classify scene segments.They exhaustively train a multiscale convolutional network to learn discrimi-native features directly from training data. After downscaling regular Kinectframes by a factor of 2, they are able to process more than 1 frame per second.
AGM Workshop 2014 (arXiv:1404.3538)
Our proposed point cloud processing pipeline consists of four steps, depictedin Fig. 1. First, we create an oversegmentation of the scene, clustering it intomany small homogeneous patches. In the second step, we compute a manifoldbut efficient-to-compute feature set for each patch. The resulting feature vec-tor is then processed by a classifier, which yields a probability for each patchbeing assigned a specific label. To that end, we use a Randomized DecisionForest (RDF), a classifier which is intensively discussed in [4]. In the last stageof our processing pipeline the classification results set up a pairwise MarkovRandom Field (MRF), whose optimization yields the final labeling. This laststep smoothes the labeling out to correct ambiguous classification results dueto noisy local patch information. The final labeling then corresponds to theMaximum-a-posteriori of the output of the MRF.
Like the majority of scene segmentation approaches, we first create an overseg-mentation of the input data, such that the features can capture more informationand classification is more robust against noise. Furthermore, this step drasti-cally reduces the number of nodes for the final MRF stage, which results in muchshorter inference times. To perform the segmentation, we use the supervoxelclustering algorithm proposed by Papon et al. [10].
After the segmentation patches have been obtained, we calculate a set of featuresfor each patch. Patches containing very few points are disregarded. First, wecalculate the three eigenvalues λ ≤ λ ≤ λ of the scatter matrix of the patch.Then, similar to [9], we define three spectral features, namely pointness ( λ ), surfaceness ( λ − λ ) and linearness ( λ − λ ). One of the most discriminativefeatures is the height of the centroid of the patch above the ground plane.Additionally, we use the height values of the lowest and the highest point ofthe patch as a feature. Important information can also be extracted from the Efficient object search with a mobile robot using semantic segmentation surface normals of a patch. Since they are already computed for the supervoxelclustering, we can add the angle of the mean surface normal of the patch with theground plane and its circular standard deviation as features without increasingthe computational complexity of the feature calculation stage. Finally, we alsomake use of the color information. We first transform all color values to theCIELAB color space and then store the mean color values of the patch and itsrespective standard deviations as the last two features. In total, we end up witha 14-dimensional feature vector x , which is then fed into the classification stagedescribed in the following section. In recent years, Randomized Decision Forests and several variations of them [4]have been successfully used for many different tasks in image processing andcomputer vision [12, 13, 7]. They are capable of handling a large variety ofdifferent features, have a probabilistic output and are very efficient at trainingand at test time.To train our RDF we follow the standard approach presented in [4], suchthat we end up with a pre-defined number of trees recursively splitting up thedata with respect to the evaluation of randomly chosen split functions. Leafnodes are created at the defined final depth level of the trees or if data cannotbe split up any further. These nodes store the distribution of the labels ofthe training data which has reached the respective node. At test time, a datapoint x (i.e. feature vector) traverses all trees according to the learned splitfunctions, starting at the root nodes, until it reaches a leaf node in every tree.The conditional probability p ( y | x ) of label y being assigned to a patch withfeature vector x is then defined as the mean of all label distributions stored inthe reached leaf nodes. In the last stage of our processing pipeline we model the labeling problem witha Markov Random Field, similar to the formulation presented in [16]. An MRFis a graph-based model, where an undirected graph is defined as a set ( V , E ), V denoting a set of vertices or nodes and E denoting a set of edges connectingnodes. In our case, each node i ∈ V corresponds to a patch and is assigneda label y i ∈ L , where L is the discrete set of label categories. The set of alllabel assignments is defined as Y = { y i } . We use a pairwise MRF, which meansthat we only consider edges connecting exactly two nodes. This allows us todirectly infer E from the pairwise adjacency graph obtained by the supervoxelclustering, defining the set of nodes A i connected to a node i ∈ V . Following theHammersley-Clifford theorem, the posterior probability of a label assignment Y is a Gibbs distribution, which can be reformulated as an energy function: E ( Y ) = X i ∈V φ i ( y i ) + X i ∈V X j ∈A i φ i,j ( y i , y j ) (1)where φ i ( y i ) is the unary term corresponding to the likelihood label y i beingassigned to node i and φ i,j ( y i , y j ) is the pairwise term corresponding to thepairwise likelihood of labels y i and y j being assigned to the nodes i and j . Theoptimal labeling Y ∗ can be obtained by minimizing the energy function: Y ∗ = arg min Y E ( Y ) (2)The unary term can directly be inferred from the conditional probabilitiescalculated by the classification stage by transferring them to a cost: AGM Workshop 2014 (arXiv:1404.3538) φ i ( y i ) = λ (1 − p ( y i | x i )) (3)where λ is a weighting term defining the importance of the unary term comparedto the pairwise term. For the pairwise term we use the common definition ofthe Potts model: φ i,j ( y i , y j ) = ( y i = y j e − k pi − pj k σ otherwise (4)where p i and p j are the 3D coordinates of the centroids of the patches corre-sponding to the nodes i and j . σ regularizes the penalty assigned to an inconsis-tent labeling of adjacent patches. As proposed in [16], we approximately solvethe resulting optimization problem (1) using Loopy Belief Propagation [8]. We train and evaluate our classifier on the NYU Depth V2 dataset publishedby Silberman et al. [15]. It contains densely labeled indoor scenes recordedwith a Microsoft Kinect. In particular, a collection of 1 ,
449 frames has beenlabeled with more than 1 ,
000 classes. As the dataset has been recorded manuallyholding the camera, we need to fit a plane to all “floor”-labeled points to retrievethe camera height and angles. If there are not enough floor points available inthe image, it is discarded in the training and evaluation procedures. For ourpurposes, we decided the most important object classes are larger structurescommonly seen in apartments, as well as a separate object class. Therefore, wenarrow down the labels available in the dataset to the set floor , wall , ceiling , table , chair , cabinet , object , and unknown .Besides evaluating our framework on the popular NYU dataset, we alsomeasure its perfomance on our own small dataset, consisting of 10 typical officescenes. The difference to the NYU dataset is that the point clouds have all beenrecorded from the same height and angle, a similar setting as on our mobilerobot. The label set is the same as for the NYU dataset. Some example imagesof the used datasets and the corresponding results are shown in Sec. 6. In this section, we describe how we use our framework to speed up an objectsearch task on the mobile service robot Hobbit [5]. The robot is equipped with adifferential drive and two RGB-D cameras. For our experiments only the camerain the head, which is mounted on a pan-tilt unit, is used. For manipulation therobot has an IGUS Robolink arm with a 2-finger gripping system. A picture ofthe platform can be seen in Fig. 2 (left).In the object search scenario the user asks the robot to search and fetchan object, e.g. a mug. The robot then sequentially navigates to a number of“search positions” defined in the map, where an object recognition algorithm isthen run. So far, the list of search positions had to be pre-defined by an expertand stayed fixed. Using our framework this is no longer necessary, becauselikely object locations can directly be inferred from our segmentation results.In particular, because our robot can only grasp objects located on tables, weonly consider positions next to large clusters of points labeled “table” as suitableobject search positions. Consequently, after calculating the semantic labels ofthe current scene, we first use simple Euclidean clustering to obtain all tables inthe scene. The resulting search positions are then defined by a simple heuristic,which is explained in Fig. 2 (right). For further details about the whole objectsearch scenario we refer to [2].
Efficient object search with a mobile robot using semantic segmentation dd Figure 2: Left: The Hobbit robot. Right: Heuristic to define search positions fora table cluster projected on the groundplane (red points). Blue lines: Principalaxes of the cluster. The search positions (green dots) are placed on the secondprincipal axis, adhering to a security distance d from the table edge. To measure the labeling performance of our framework, we evaluated it withrespect to the common multi-label metrics class average accuracy and globalaccuracy . The former is the mean of the diagonal of the confusion matrix, thelatter is the mean of the pointwise accuracy over the whole test set. For theNYU dataset, we performed 5-fold cross-validation , for our own dataset weevaluated on all point clouds after training on the whole NYU dataset. With aclass average accuracy of 71 .
7% and a global accuracy of 77 .
2% for the NYU,respectively 55 .
6% (class average) and 72 .
0% (global) for our own dataset, ourframework achieves superior performance with respect to the current state of theart [3]. Of course, one has to keep in mind that we use a limited label set specificto our application, compared to other approaches evaluated the NYU dataset.An overview of all quantitative results is given in Table 1, some qualitativeexamples are shown in Fig. 4.We also evaluated how different parameters, namely the number of trees usedin the RDF and the maximum tree depth, influence the accuracy. Fig. 3 (left)shows that the accuracy significantly increases as soon as multiple trees are usedin the RDF, but saturates if more than 8 trees are used. A similar effect can beobserved for the maximum tree depth parameter, plotted in Fig. 3 (right). Withincreased depth level, the RDF can capture the data structure better. However,due to the limited amount of training data, trees often do not grow deeper than10 levels in the training stage, which explains why the results do not improvefor larger depth values.Setting the maximum tree depth and the number of trees to 8, our frameworkprocesses point clouds containing 640x480 points at a frame rate of about 1 fpson a 2.4 GHz 8-core Intel i7 laptop. Regarding the operation on a robot, we con-sider this processing time to be sufficiently fast for many potential applications,such as the object search scenario described in Sec. 5. A cc u r a cy i n % Class Avg. Acc.Global Acc. A cc u r a cy i n % Class Avg. Acc.Global Acc.
Figure 3: Influence of RDF parameters number of trees and max tree depth onthe labeling accuracies for the NYU Depth V2 dataset.
AGM Workshop 2014 (arXiv:1404.3538) F l oo r W a ll C e ili n g T a b l e C h a i r C a b i n e t O b j ec t A v g . G l o b a l NYU V2 [3] 87.3
Our data Ours - floor wall ceiling table chair cabinet object unknownFigure 4: Example results for NYU Depth V2 (first 3 columns) and our (last 2columns) dataset. Top: Input image. Middle: Groundtruth. Bottom: Results. We presented an efficient 3D semantic segmentation framework for indoor scenes.We showed that our method, based on an RDF, achieves accurate results at aframe rate feasible for the application on a mobile robot. We demonstrated thatby successfully deploying our framework on a mobile service robot, where weused our method to detect possible object locations in a room and in turn areable to dynamically infer a more efficient object search procedure.Still, there are aspects of our framework which could potentially be im-proved. So far, we do not make use of the contextual relationship between scenesegments. By incorporating simple pairwise features, e.g. color and height dif-ference, we expect the accuracy to further increase, especially for similar labelssuch as wall and cabinet. We also plan to investigate further exploitations of se-mantic segmentation in the scope of mobile robotics. An interesting applicationcould be the construction of a complete semantic map, fusing labeling resultsfrom different viewpoints. This map would not only encode more information,but should also be more robust against noisy classification results.
Acknowledgments
This work has been partially funded by the European Commission under grantagreements FP7-IST-288146 HOBBIT, FP7-IST-600623 STRANDS and theAustrian Science Foundation (FWF), Project I513-N23 vision@home.
Efficient object search with a mobile robot using semantic segmentation
References [1] A. Anand, H. S. Koppula, T. Joachims, and A. Saxena. Contextuallyguided semantic labeling and search for three-dimensional point clouds.
IJRR , 32(1):19–34, 2012.[2] M. Bajones, D. Wolf, J. Prankl, and M. Vincze. Where to look first?Behaviour control for fetch-and-carry missions of service robots. In
ARW ,2014.[3] C. Couprie, C. Farabet, L. Najman, and Y. LeCun. Indoor Semantic Seg-mentation using depth information. In
ICLR , 2013.[4] A. Criminisi and J. Shotton, editors.
Decision Forests for Computer Visionand Medical Image Analysis . Springer London, 2013.[5] D. Fischinger, P. Einramhof, W. Wohlkinger, K. Papoutsakis, P. Mayer,P. Panek, T. Koertner, S. Hofmann, A. Argyros, M. Vincze, A. Weiss, andC. Gisinger. Hobbit - The Mutual Care Robot. In
IROS , 2013.[6] G. Floros and B. Leibe. Joint 2D-3D temporally consistent semantic seg-mentation of street scenes.
CVPR , 2012.[7] O. Kähler and I. D. Reid. Efficient 3D Scene Labeling Using Fields of Trees.In
ICCV , 2013.[8] J. H. Kim and J. Pearl. A Computational Model For Diagnostic Reasoningin Inference Systems. In
IJCAI , 1983.[9] D. Munoz, N. Vandapel, and M. Hebert. Onboard Contextual Classificationof 3-D Point Clouds with Learned High-order Markov Random Fields. In
ICRA , 2009.[10] J. Papon, A. Abramov, M. Schoeler, and F. Wörgötter. Voxel Cloud Con-nectivity Segmentation - Supervoxels for Point Clouds. In
CVPR , 2013.[11] X. Ren, L. Bo, and D. Fox. RGB-(D) scene labeling: Features and algo-rithms. In
CVPR , 2012.[12] F. Schroff, A. Criminisi, and A. Zisserman. Object Class Segmentationusing Random Forests. In
BMVC , 2008.[13] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio,R. Moore, P. Kohli, A. Criminisi, A. Kipman, and A. Blake. Efficienthuman pose estimation from single depth images.
PAMI , 35(12):2821–40,December 2013.[14] N. Silberman and R. Fergus. Indoor scene segmentation using a structuredlight sensor. In
ICCVW , 2011.[15] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation andsupport inference from RGBD images. In
ECCV , 2012.[16] F. Tombari and L. di Stefano. 3D Data Segmentation by Local Classifica-tion and Markov Random Fields. In , 2011.[17] J. P.C. Valentin, S. Sengupta, J. Warrell, A. Shahrokni, and P. H.S. Torr.Mesh Based Semantic Modelling for Indoor and Outdoor Scenes. In