Learning Navigation Costs from Demonstration with Semantic Observations
PProceedings of Machine Learning Research vol 120:1-11, 2020 2nd Annual Conference on Learning for Dynamics and Control
Learning Navigation Costs from Demonstration with SemanticObservations
Tianyu Wang
TIW
ENG . UCSD . EDU
Vikas Dhiman
VDHIMAN @ ENG . UCSD . EDU
Nikolay Atanasov
NATANASOV @ ENG . UCSD . EDU
Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA 92093
Editors:
A. Bayen, A. Jadbabaie, G. J. Pappas, P. Parrilo, B. Recht, C. Tomlin, M. Zeilinger
Abstract
This paper focuses on inverse reinforcement learning (IRL) for autonomous robot navigation usingsemantic observations. The objective is to infer a cost function that explains demonstrated behaviorwhile relying only on the expert’s observations and state-control trajectory. We develop a map en-coder, which infers semantic class probabilities from the observation sequence, and a cost encoder,defined as deep neural network over the semantic features. Since the expert cost is not directly ob-servable, the representation parameters can only be optimized by differentiating the error betweendemonstrated controls and a control policy computed from the cost estimate. The error is optimizedusing a closed-form subgradient computed only over a subset of promising states via a motion plan-ning algorithm. We show that our approach learns to follow traffic rules in the autonomous drivingCARLA simulator by relying on semantic observations of cars, sidewalks and road lanes.
Keywords:
Inverse reinforcement learning, semantic mapping, learning from demonstration
1. Introduction
Autonomous systems operating in unstructured, partially observed, and changing real-world envi-ronments need an understanding of context to evaluate the safety, utility, and efficiency of theirperformance. For example, while a bipedal robot may navigate along sidewalks, an autonomouscar needs to follow the road lane structure and the traffic signs. Designing a cost function thatencodes such rules by hand is cumbersome, if not infeasible, especially for complex tasks. How-ever, it is often possible to obtain demonstrations of desirable behavior that indirectly capture therole of semantic context in the task execution. Semantic labels provide rich information about therelationship between object entities and their surroundings. In this work, we consider an inverse re-inforcement learning (IRL) problem in which observations containing semantic information aboutthe environment are available.There has been significant progress in semantic segmentation techniques, including deep neuralnetworks for RGB image segmentation (Papandreou et al., 2015; Badrinarayanan et al., 2017; Chenet al., 2018) or point cloud labeling via a 2D spherical depth projection (Wu et al., 2018; Dohanet al., 2015). Maps that store semantic information can be generated from segmented images (Sen-gupta et al., 2012; Lu et al., 2019). Gan et al. (2019); Sun et al. (2018) generalize binary occupancygrid mapping (Hornung et al., 2013) to multi-class semantic mapping in 3D. In this work, we pa-rameterize the navigation cost of an autonomous vehicle as a nonlinear function of such semanticfeatures to explain the demonstrations of an expert. c (cid:13) a r X i v : . [ c s . L G ] J un EARNING N AVIGATION C OSTS FROM D EMONSTRATION WITH S EMANTIC O BSERVATIONS
Learning a cost function from demonstration requires a control policy that is differentiable withrespect to the cost parameters. Computing policy derivatives has been addressed by several sucess-ful IRL approaches (Neu and Szepesv´ari, 2007; Ratliff et al., 2006; Ziebart et al., 2008). Earlyworks assume that the cost is linear in the feature vector and aim at matching the feature expec-tations of the learned and expert policies. Ratliff et al. (2006) computes subgradients of planningalgorithms so that expected reward of an expert policy is better than any other policy by a margin.Value iteration networks (VIN) (Tamar et al., 2016) show that the value iteration algorithm can beapproximated by a series of convolution and maxpooling layers, allowing automatic differentiationto learn the cost function end-to-end. Ziebart et al. (2008) develops a dynamic programming algo-rithm to maximize the likelihood of observed expert data and learns a policy of maximum entropy(MaxEnt) distribution. Many works (Levine et al., 2011; Wulfmeier et al., 2016; Song, 2019) ex-tend MaxEnt to learn a nonlinear cost using Gaussian Processes or deep neural networks. Finn et al.(2016) uses sample-based approximation of the MaxEnt objective on high-dimensional continuoussystems. However, the cost in most existing work is learned offline using full observation sequencesfrom the expert demonstrations. A major contribution of our work is to develop cost representationsand planning algorithms that rely only on causal partial observations.Achieving safe and robust navigation is directly coupled with the quality of the environmentrepresentation and the cost function specifying desirable behaviors. The traditional approach com-bines geometric mapping of occupancy probability or distance to the nearest obstacle (Hornunget al., 2013; Oleynikova et al., 2017) with hand-specified planning cost functions. Recent advancesin deep reinforcement learning demonstrated that control inputs may be predicted directly fromsensory observations (Levine et al., 2016). However, special model designs (Khan et al., 2018) thatserve as a latent map are needed in navigation tasks where simple reactive policies are not feasi-ble. Gupta et al. (2017) decompose visual navigation into two separate stages explicitly: mappingthe environment from first-person RGB images and planning through the constructed map withVIN. Our model also seperates the two stages but integrates semantic information to obtain a richermap representation. In addition, Wang et al. (2020) propose a differentiable mapping and planningframework to learn the expert cost function. They parameterize cost function as a neural networkover the binary occupancy map probability, which is integrated from previous partial observations.They further propose an efficient A* (Hart et al., 1968) planning algorithm that computes the policyat the current state and backpropagates gradients in closed-form to optimize the cost parameters. Weextend their work by incorporating semantic observation to the map representation and evaluatingthe model in the CARLA autonomous driving simulator (Dosovitskiy et al., 2017).We propose a model that learns to navigate from first-person semantic observations and makethe following contributions. First, we propose a cost function representation composed of a map en-coder , capturing semantic class probabilities from the streaming observations, and a cost encoder ,defined as a deep neural network over the semantic features. Second, we optimize the cost pa-rameters using a closed-form subgradient of the cost-to-go only over a subset of promising states,obtained by an efficient planning algorithm. Finally, we verify our model in autonomous navigationexperiments in urban environments provided by the CARLA simulator (Dosovitskiy et al., 2017).
2. Problem Formulation
Consider a robot navigating in an unknown environment X with the task of reaching a goal state x g ∈ X . Let x t ∈ X be the robot state, capturing its pose, twist, etc., at discrete time t . For a EARNING N AVIGATION C OSTS FROM D EMONSTRATION WITH S EMANTIC O BSERVATIONS
Explored area with semanticsGoalUnexplored area State spaceCurrent positionArea of Bellman update
Motion planning
Motionplanning Policy evaluationStoch. Grad. Descent on Closed form gradient
Training phase
Map encoding Costencoding
Cost representation
Figure 1: Architecture for cost function learning from demonstrations with semantic observations. Our maincontribution is a cost representation, combining a probabilistic semantic map encoder , with recurrent de-pendence on semantic observations P t , and a cost encoder , defined over the semantic features. Efficientforward policy computation and closed-form subgradient backpropagation are used to optimize the cost rep-resentation in order to explain expert behavior. given control input u t ∈ U where U is assumed finite, the robot state evolves according to knowndeterministic dynamics: x t +1 = f ( x t , u t ) . Let K = { , , , . . . , K } be a set of class labels,where k = 0 denotes “free” space and k ≥ denotes a particular semantic class such as car ortree. Let m ∗ : X → K be a function specifying the true semantic occupancy of the environmentby labeling states with semantic classes and M be the space of possible environment realizations m ∗ . Let c ∗ : X × U × M → R ≥ be a cost function specifying desirable robot behavior in agiven environment, e.g., according to an expert user or an optimal design. We assume that the robotdoes not have access to either the true semantic map m ∗ or the true cost function c ∗ . However,the robot is able to obtain point cloud observations P t = { ( p l , y l ) } l ∈ P at each step t , where p l ∈ X is the measurement location and y l is an observed semantic likelihood such that y l = (cid:2) y l , . . . , y Kl (cid:3) T , y kl ≥ , (cid:80) Kk =1 y kl = 1 , whose support is K \ { } . In practice, y l can be obtainedfrom a semantic segmentation algorithm (Papandreou et al., 2015; Badrinarayanan et al., 2017;Chen et al., 2018) that predicts the semantic class of the corresponding measurement location p l .The observed point cloud P t depends on the robot state x t and the environment realization m ∗ .Given a training set D := (cid:8) ( x t,n , u ∗ t,n , P t,n , x g,n ) (cid:9) T n ,Nt =1 ,n =1 of N expert trajectories with length T n to demonstrate desirable behavior, our goal is to • learn a cost function estimate c t : X × U × P t × Θ → R ≥ that depends on an observationsequence P t from the true latent environment and is parameterized by θ ∈ Θ , • design a stochastic policy π t from c t such that the robot behavior under π t matches the priorexperience D .To balance exploration in partially observable environments with exploitation of promising controls,we specify π t as a Boltzmann policy (Ramachandran and Amir, 2007; Neu and Szepesv´ari, 2007)associated with the cost c t , π t ( u t | x t ; P t , θ ) = exp( − Q ∗ t ( x t , u t ; P t , θ )) (cid:80) u ∈U exp( − Q ∗ t ( x t , u ; P t , θ )) , where the optimalcost-to-go function Q ∗ t is: Q ∗ t ( x t , u t ; P t , θ ) := min u t +1: T − T − (cid:88) k = t c t ( x k , u k ; P t , θ ) s.t. x k +1 = f ( x k , u k ) , x T = x g . (1) Problem
Given demonstrations D , optimize the cost function parameters θ so that log-likelihoodof the demonstrated controls u ∗ t,n is maximized under the robot policies π t,n : min θ L ( θ ) := − N (cid:88) n =1 T n (cid:88) t =1 log π t,n ( u ∗ t,n | x t,n ; P t,n , θ ) . (2) EARNING N AVIGATION C OSTS FROM D EMONSTRATION WITH S EMANTIC O BSERVATIONS
The problem setup is illustrated in Fig. 1. While Eqn. (1) is a standard deterministic shortestpath (DSP) problem, the challenge is to make it differentiable with respect to θ , which is necessaryfor the loss in (2) to propagate back through the DSP problem to update the cost parameters θ .Once the parameters are optimized, the robot can generalize to navigation tasks in new partiallyobservable environments by evaluating the cost c t based on the observations P t iteratively and(re)computing the associated policy π t .
3. Cost Function Representation
We propose a cost function representation with two components: a semantic map encoder withparameters ψ and a cost encoder with parameters φ . The model is differentiable by design, allowingits parameters to be optimized by the subsequent planning algorithm described in Sec. 4. The semantic probability of different environment areas is encoded in a hidden state h t given thetrajectory and observations x t , P t . Specifically, we discretize the state space X into J cells andlet m = (cid:2) m , . . . , m J (cid:3) T ∈ K J be the random vector of true semantic labels over the cells. Since m is unknown to the robot, we maintain the semantic occupancy posterior P ( m = k | x t , P t ) where k = (cid:2) k , . . . , k J (cid:3) T , k j ∈ K , given the history of states x t and observations P t . Therepresentation complexity may be simplified significantly if one assumes independence among themap cells m j : P ( m = k | x t , P t ) = (cid:81) Jj =1 P ( m j = k j | x t , P t ) .Inspired by the binary occupancy grid mapping (Thrun et al., 2005; Hornung et al., 2013), weextend the recurrent updates for the multi-class semantic probability of each cell m j . Definition 1
The log-odds ratio of semantic classes associated with cell m j at time t is h t,j = (cid:2) h t,j , . . . , h Kt,j (cid:3) T , h kt,j := log P ( m j = k | x t , P t ) P ( m j = 0 | x t , P t ) for k ∈ K . (3)Its recurrent Bayesian update is h kt +1 ,j = h kt,j +log p ( P t +1 | m j = k, x t +1 ) p ( P t +1 | m j =0 , x t +1 ) . Note that by definition h t,j =0 . The update increment is a log-odds observation model and we assume the observation P t +1 giventhe cell m j is independent of the previous observations P t . The semantic class posterior can berecovered from the semantic log-odds ratio h t,j via a softmax function P ( m j = k | x t , P t ) = σ k ( h t,j ) , where σ : R K +1 → R K +1 satisfies the following properties σ ( z ) = (cid:2) σ ( z ) , . . . , σ K ( z ) (cid:3) T , σ k ( z ) = exp ( z k ) (cid:80) k (cid:48) ∈K exp ( z k (cid:48) ) , log σ k ( z ) σ k (cid:48) ( z ) = z k − z k (cid:48) . (4)We provide a simple observation model to instantiate Eq. (3). Consider all cells m j that lie onthe ray between robot state x and a labeled point ( p l , y l ) in the point cloud P . Let d ( x , m j ) be thedistance between the robot position and the center of mass of the cell m j . Definition 2
The inverse observation model relating the label of cell m j to the ray between robotstate x and labeled point ( p l , y l ) is defined as a softmax function with parameters ψ l , scaled by thedistance, δp l = d ( m j , x ) − (cid:107) p l (cid:107) , which is truncated at a threshold (cid:15) : P ( m j = k | x , ( p l , y l )) = (cid:40) σ k ( diag ( ψ l ) y l δp l ) if δp l ≤ (cid:15)σ k ( h ,j ) if δp l > (cid:15) . (5) EARNING N AVIGATION C OSTS FROM D EMONSTRATION WITH S EMANTIC O BSERVATIONS
The function diag ( · ) returns a diagonal matrix from a vector and the prior occupancy log-odds ratio h ,j depends on the environment (e.g., h ,j = specifies a uniform prior over the semantic classes). Proposition 3
Given the definitions of the log-odds ratio in Eq. (3) and the inverse observationmodel in Eq. (5) , the log-odds update rule for the semantic probability at cell m j is h t +1 ,j = h t,j + (cid:80) ( p l , y l ) ∈ P t +1 (cid:2) g j ( x t , ( p l , y l )) − h ,j (cid:3) , where the log-odds inverse observation model forcells m j along the ray from x t to p l can be simplied using (4) as: g j ( x t , ( p l , y l )) = (cid:40) diag ( ψ l ) y l δp l if δp l ≤ (cid:15) h ,j if δp l > (cid:15) . (6)A more expressive multi-layer neural network may be used to parameterize the inverse obser-vation model instead of the linear transformation diag ( ψ l ) y l δp l of the semantic probability anddistance differential along the l -th ray in Eq (5): g j ( x t , ( p l , y l ); ψ l ) = (cid:40) NN ( y l , p l , d ( x t , m j ); ψ l ) if δp l ≤ (cid:15) h ,j if δp l > (cid:15) . (7)In summary, the map encoder starts with prior log-odds h , updates them recurrently via h t +1 = h t + g ( x t , P t ; ψ ) − h , where the inverse sensor log-odds g j ( x t , ( p l , y l ); ψ l ) is specified for the j -th cell along the l -th ray in (6) or (7). The posterior P ( m = k | x t , P t ) is the softmax of h t . The cost encoder uses the semantic occupancy grid prosterior σ ( h t ) to define the cost function esti-mate c t ( x , u ) at a given state-control pair ( x , u ) . A convolutional neural network (CNN) (Goodfel-low et al., 2016) with parameters φ can extract cost features from the environment map: c t ( x , u ) = CNN ( h t , x , u ; φ ) . We implement an encoder-decoder neural network architecture (Badrinarayananet al., 2017) to parameterize the cost function from semantic class probabilities. The idea is toperform downsamples and upsamples at multiple scales to provide both local and global contextbetween semantic probability and cost.
4. Cost Learning via Differentiable Planning
We follow the planning algorithm in Wang et al. (2020) that enables efficient cost optimizationand briefly review the steps below. The parameters θ of the cost representation c t ( x , u ; P t , θ ) developed in Sec. 3 are optimized by differentiating L ( θ ) in (2) through the DSP problem in (1).Motion planning algorithms, such as A* (Likhachev et al., 2004), solve problem (1) efficiently anddetermine the optimal cost-to-go Q ∗ t ( x , u ) only over a subset of promising states. This is sufficientto obtain the subgradient of Q ∗ t ( x t , u t ) with respect to c t along the optimal path by applying thesubgradient method (Shor, 2012; Ratliff et al., 2006).A backwards A* search applied to problem (1) with start state x g , goal state x ∈ X , andpredecessors expansions according to transition f provides an upper bound to the optimal cost-to-go: Q ∗ t ( x , u ) ≤ c t ( x , u ) + g ( f ( x , u )) , where g are the values computed by A* for expanded nodesin the CLOSED list and visited nodes in the OPEN list. Strict equality is obtained only if f ( x , u ) belongs to the CLOSED list. A Boltzmann policy π t ( u | x ) may be defined using the g -values forall x ∈ CLOSED ∪ OPEN ⊆ X and a uniform distribution over U for all other states. EARNING N AVIGATION C OSTS FROM D EMONSTRATION WITH S EMANTIC O BSERVATIONS
Algorithm 1:
Train cost parameters θ input : D = (cid:8) ( x t,n , u ∗ t,n , P t,n , x g,n ) (cid:9) T n ,Nt =1 ,n =1 while θ not converged do L ( θ ) ← ; for n = 1 , . . . , N and t = 1 , . . . , T n do Update c t,n based on x t,n and P t,n as in Sec. 3;Obtain Q ∗ t,n ( x , u ) from (1) with stage cost c t,n ;Obtain π t,n ( u | x t,n ) from Q ∗ t,n ( x t,n , u ) ; L ( θ ) ← L ( θ ) − log π t,n ( u ∗ t,n | x t,n ) ; end Update θ ← θ − α ∇L ( θ ) via Prop. 4; end Algorithm 2:
Test control policy π t input : Start state x s , goal state x g , cost parameters θ Current state x t ← x s ; while x t (cid:54) = x g do Make an observation P t ;Update c t based on x t and P t as in Sec. 3;Obtain π t ( u | x t ) from Q ∗ t ( x t , u ) ;Update x t ← f ( x t , u t ) via u t := arg max u π t ( u | x t ) ; end We rewrite Q ∗ t ( x t , u t ) in a form that makes its subgradient with respect to c t ( x , u ) obvious. Let T ( x t , u t ) be the set of feasible trajectories τ of horizon T that start at x t , u t , satisfy transition f and terminate at x g . Let τ ∗ ∈ T ( x t , u t ) be an optimal trajectory corresponding to the optimal cost-to-go function Q ∗ t ( x t , u t ) . Define µ τ ( x , u ) := (cid:80) T − k = t ( x k , u k )=( x , u ) as a state-control visitationfunction indicating if ( x , u ) is visited by τ . The optimal cost-to-go function Q ∗ t ( x t , u t ) can beviewed as a minimum over T ( x t , u t ) of the inner product of the cost function c t and the visitationfunction µ τ : Q ∗ t ( x t , u t ) = min τ ∈T ( x t , u t ) (cid:88) x ∈X , u ∈U c t ( x , u ) µ τ ( x , u ) (8)where X can be assumed finite because both T and U are finite. Applying the subgradient method(Shor, 2012; Ratliff et al., 2006) to (8) shows that ∂Q ∗ t ( x t , u t ) ∂c t ( x , u ) = µ τ ∗ ( x , u ) is a subgradient of theoptimal cost-to-go. This result and the chain rule allow us to obtain a subgradient of L ( θ ) . Proposition 4
A subgradient of the loss function L ( θ ) in (2) with respect to θ can be obtained as: ∂ L ( θ ) ∂ θ = − N (cid:88) n =1 T n (cid:88) t =1 ∂ log π t,n ( u ∗ t,n | x t,n ) ∂ θ = − N (cid:88) n =1 T n (cid:88) t =1 (cid:88) u t,n ∈U ∂ log π t,n ( u ∗ t,n | x t,n ) ∂Q ∗ t,n ( x t,n , u t,n ) ∂Q ∗ t,n ( x t,n , u t,n ) ∂ θ = − N (cid:88) n =1 T n (cid:88) t =1 (cid:88) u t,n ∈U (cid:16) { u t,n = u ∗ t,n } − π t,n ( u t,n | x t,n ) (cid:17) (cid:88) ( x , u ) ∈ τ ∗ ∂Q ∗ t,n ( x t,n , u t,n ) ∂c t ( x , u ) ∂c t ( x , u ) ∂ θ The computation graph implied by Prop. 4 is illustrated in Fig. 1. The graph consists of a costrepresentation layer and a differentiable planning layer, allowing end-to-end minimization of L ( θ ) via stochastic subgradient descent. Training and testing algorithms are shown in Alg. 1 and Alg. 2.
5. Experiments
We evaluate our approach using the CARLA simulator (0.9.6) (Dosovitskiy et al., 2017), whichprovides high-fidelity autonomous vehicle simulation in urban environments. Demonstration datafor training the cost function representation is collected from maps { T own , T own , T own ,T own } , while map T own is used for testing. T own is the largest map and includes different EARNING N AVIGATION C OSTS FROM D EMONSTRATION WITH S EMANTIC O BSERVATIONS street layouts, junctions, and a freeway. In each map, we collect expert trajectories by runningthe autonomous navigation agent provided by the CARLA Python API. The expert finds the shortestpath between two query points, while respecting traffic rules, such as staying on the road, andkeeping in the current lane. Features not related to the experiment are disabled, including spawningother vehicles and pedestrians, and changing traffic signal. Each vehicle trajectory is discretizedinto a × grid of meter resolution. The robot state x is the grid cell location while thecontrol u takes the robot to one of its neighbor grid cells. Trajectories that do not fit in the gridare discarded.The ego vehicle is equipped with a lidar sensor that has meters maximum range and ◦ horizontal field of view. The vertical field of view ranges from ◦ (facing forward) to − ◦ (facingdown) with ◦ resolution. A total of lidar rays is generated per scan P t while each pointmeasurement is returned only if it hits an obstacle. The ego vehicle is also equipped with semanticsegmentation cameras that display objects of 13 different classes in RGB images, including road,road line, sidewalk, vegetation, car, building, etc. The cameras face front, left, right and rear, eachcapturing a ◦ horizontal field of view. The semantic label of each lidar point is retrieved from thesemantic segmentation image by projecting the lidar point in the camera’s frame. We compare our model with two baseline algorithms: Wulfmeier et al. (2016) and Wulfmeier et al.(2016) + semantics. Wulfmeier et al. (2016) use a neural network to learn a cost from lidar pointclouds without semantics. The input to the neural network is a grid that stores the mean and varianceof points in each cell, as well as a binary indicator of cell visibility. We augment the grid featureswith the mode of semantic labels in each cell to get the model Wulfmeier et al. (2016) + semanticsas a fair comparison with ours. Neural networks are implemented in the PyTorch library (Paszkeet al., 2019) and trained with the Adam optimizer (Kingma and Ba, 2014) until convergence.The evaluation metrics include: negative log-likelihood (
NLL ), control accuracy (
Acc ), trajec-tory success rate (
Traj. Succ. Rate ) and Modified Hausdorff Distance (
MHD ). More precisely, givena test set D test = (cid:8) ( x t,n , u ∗ t,n , P t,n , x g,n ) (cid:9) T n ,Nt =1 ,n =1 and a learned policy π with paramters θ ∗ , wedefine NLL ( D test , π ) = − (cid:80) Nn =1 T n (cid:80) N,T n n =1 ,t =1 log π t,n ( u ∗ t,n | x t,n ; P t,n , θ ∗ ) and Acc ( D test , π ) = (cid:80) Nn =1 T n (cid:80) N,T n n =1 ,t =1 { u ∗ t,n =arg max π t,n ( ·| x t,n ; P t,n , θ ∗ ) } . Traj. Succ. Rate records the success rate ofthe learned policy by iteratively rolling out its predicted controls. A trajectory is regarded as suc-cessful if it reaches the goal within twice the number of steps of the expert trajectory without hittingan obstacle.
MHD compares the rolled out trajectory τ L by the learned policy and the expert tra-jectory τ E and is defined as: MHD ( τ L , τ E ) = max (cid:110) T L (cid:80) T L t =1 d ( τ tL , τ E ) , T E (cid:80) T E t =1 d ( τ tE , τ L ) (cid:111) where d ( τ tA , τ B ) measures the minimum Euclidean distance from state τ tA to any state in τ B . Fig. 2 shows the performance of our model versus Wulfmeier et al. (2016) and Wulfmeier et al.(2016) + semantics using the metrics described above. Ours learns to generate policies closest tothe expert in new environments by scoring best in
NLL and
Acc . The predicted trajectory is alsoclosest to the expert by achieving the minimum
MHD . The results demonstrate that the semanticmap encoder captures more geometric as well as semantic information so that the cost function can EARNING N AVIGATION C OSTS FROM D EMONSTRATION WITH S EMANTIC O BSERVATIONS
Model
NLL Acc (%)
Traj. Succ. Rate (%)
MHD
Wulfmeier et al. (2016) 0.595 86.1 + semantics 0.613 82.7 88 4.479Ours Figure 2: Test result from CARLA
Town05 map. Best model for each evaluation metric is in bold.
Ground truth semantic map Semantic map encoder at step 40 Cost function at step 40 Cost function at step 80Semantic map encoder at step 80
Figure 3: Example of a predicted trajectory in red at an intersection and the goal in blue. The groud truthsemantic map, predicted semantic map and cost map at two time steps are shown. Our model learns thesidewalk is costly to traverse. be optimized and generate trajectories which match the expert behaviors. We notice that simply tak-ing the mode of the semantic labels in each grid cell degrades the performance of Wulfmeier et al.(2016). We conjecture that taking the mode is a deterministic assignment, which could provide con-flicting semantic information, while our model endorses a probabilistic semantic map encoder withBayesian updates to avoid information loss. Fig. 3 shows an example of the predicted trajectory atan intersection. The semantic map visualizes the class of highest probability, which mostly reflectsthe ground truth. Sub-cell objects like roadlines are captured in the semantic map distribution butnot visualized in the most probable class. It is interesting to find that our model assigns low cost toroad in front of the robot, medium cost for sidewalks, and high cost to road behind itself. This costassignment is actually effective for the robot to navigate to the goal.
6. Conclusion
We propose an inverse reinforcement learning approach for infering navigation costs from demon-strations with semantic observations. Our model introduces a new cost representation composed ofa probabilistic semantic occupancy encoder and a cost encoder defined over the semantic features.The cost function can be optimized via backpropagation with closed-form (sub)gradient. Exper-iments in the CARLA simulator show that our model outperforms methods that do not encodesemantic information probabilistically over time. Our work offers a promising solution for learningsemantic features in navigation and may enable efficient online learning in challenging conditions.
Acknowledgments
We gratefully acknowledge support from NSF CRII IIS-1755568, ARL DCIST CRA W911NF-17-2-0181, and ONR SAI N00014-18-1-2828. EARNING N AVIGATION C OSTS FROM D EMONSTRATION WITH S EMANTIC O BSERVATIONS
References
Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoderarchitecture for image segmentation.
IEEE transactions on pattern analysis and machine intelligence , 39(12):2481–2495, 2017.Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoderwith atrous separable convolution for semantic image segmentation. In
Proceedings of the Europeanconference on computer vision (ECCV) , pages 801–818, 2018.David Dohan, Brian Matejek, and Thomas Funkhouser. Learning hierarchical semantic segmentations oflidar data. In , pages 273–281. IEEE, 2015.Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An openurban driving simulator. In
Proceedings of the 1st Annual Conference on Robot Learning , pages 1–16,2017.Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control viapolicy optimization. In
International conference on machine learning , pages 49–58, 2016.Lu Gan, Ray Zhang, Jessy W Grizzle, Ryan M Eustice, and Maani Ghaffari. Bayesian spatial kernel smooth-ing for scalable dense semantic mapping. arXiv preprint arXiv:1909.04631 , 2019.Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning . MIT Press, 2016. .S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive Mapping and Planning for VisualNavigation. In
Computer Vision and Pattern Recognition (CVPR) , 2017.Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimumcost paths.
IEEE transactions on Systems Science and Cybernetics , 4(2):100–107, 1968.Armin Hornung, Kai M Wurm, Maren Bennewitz, Cyrill Stachniss, and Wolfram Burgard. OctoMap: Anefficient probabilistic 3D mapping framework based on octrees.
Autonomous Robots , 34(3):189–206,2013.A. Khan, C. Zhang, N. Atanasov, K. Karydis, V. Kumar, and D. D. Lee. Memory augmented control networks.In
International Conference on Learning Representations (ICLR) , 2018.Diederik P Kingma and Jimmy Ba. ADAM: A method for stochastic optimization. In
ICLR , 2014.Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with gaussianprocesses. In
Advances in Neural Information Processing Systems , pages 19–27, 2011.Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotorpolicies.
J. Mach. Learn. Res. , 17(1):1334–1373, 2016.M. Likhachev, G. Gordon, and S. Thrun. ARA*: Anytime A* with Provable Bounds on Sub-Optimality. In in Advances in Neural Information Processing Systems , page 767774, 2004.C. Lu, M. J. G. van de Molengraft, and G. Dubbelman. Monocular semantic occupancy grid mapping withconvolutional variational encoderdecoder networks.
IEEE Robotics and Automation Letters , 4(2):445–452,April 2019. ISSN 2377-3774. doi: 10.1109/LRA.2019.2891028.9
EARNING N AVIGATION C OSTS FROM D EMONSTRATION WITH S EMANTIC O BSERVATIONS
Gergely Neu and Csaba Szepesv´ari. Apprenticeship learning using inverse reinforcement learning and gra-dient methods. In
Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence ,pages 295–302. AUAI Press, 2007.H. Oleynikova, Z. Taylor, M. Fehr, R. Siegwart, and J. Nieto. Voxblox: Incremental 3d euclidean signeddistance fields for on-board mav planning. In
IEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS) , pages 1366–1373, 2017. doi: 10.1109/IROS.2017.8202315.George Papandreou, Liang-Chieh Chen, Kevin P Murphy, and Alan L Yuille. Weakly-and semi-supervisedlearning of a deep convolutional network for semantic image segmentation. In
Proceedings of the IEEEinternational conference on computer vision , pages 1742–1750, 2015.Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Ed-ward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, BenoitSteiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In
Advances in Neural Information Processing Systems 32 ,pages 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf .Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In
IJCAI , volume 7, pages2586–2591, 2007.Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In
Proceedingsof the 23rd international conference on Machine learning , pages 729–736. ACM, 2006.Sunando Sengupta, Paul Sturgess, L’ubor Ladick`y, and Philip HS Torr. Automatic dense visual semanticmapping from street-level imagery. In , pages 857–862. IEEE, 2012.Naum Zuselevich Shor.
Minimization methods for non-differentiable functions , volume 3. Springer Science& Business Media, 2012.Yeeho Song. Inverse reinforcement learning for autonomous ground navigation using aerial and satelliteobservation data. Master’s thesis, Pittsburgh, PA, August 2019.L. Sun, Z. Yan, A. Zaganidis, C. Zhao, and T. Duckett. Recurrent-octomap: Learning state-based maprefinement for long-term semantic mapping with 3-d-lidar data.
IEEE Robotics and Automation Letters , 3(4):3749–3756, Oct 2018. ISSN 2377-3774.Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. InD. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,
Advances in Neural In-formation Processing Systems 29 , pages 2154–2162. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6046-value-iteration-networks.pdf .Sebastian Thrun, Wolfram Burgard, and Dieter Fox.
Probabilistic Robotics . The MIT Press, 2005. ISBN0262201623.Tianyu Wang, Vikas Dhiman, and Nikolay Atanasov. Learning navigation costs from demonstration inpartially observable environments. In . IEEE, 2020. 10
EARNING N AVIGATION C OSTS FROM D EMONSTRATION WITH S EMANTIC O BSERVATIONS
Bichen Wu, Alvin Wan, Xiangyu Yue, and Kurt Keutzer. Squeezeseg: Convolutional neural nets with re-current crf for real-time road-object segmentation from 3d lidar point cloud. In , pages 1887–1893. IEEE, 2018.Markus Wulfmeier, Dominic Zeng Wang, and Ingmar Posner. Watch this: Scalable cost-function learning forpath planning in urban environments. In , pages 2089–2095. IEEE, 2016.Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse rein-forcement learning. In