[PDF] The Pitfall of More Powerful Autoencoders in Lidar-Based Navigation

Abstract

The benefit of pretrained autoencoders for reinforcement learning in comparison to training on raw observations is already known [1]. In this paper, we address the generation of a compact and information-rich state representation. In particular, we train a variational autoencoder for 2D-lidar scans to use its latent state for reinforcement learning of navigation tasks. To achieve high reconstruction power of our autoencoding pipeline, we propose an - in the context of autoencoding 2D-lidar scans - novel preprocessing into a local binary occupancy image. This has no additional requirements, neither self-localization nor robust mapping, and therefore can be applied in any setting and easily transferred from simulation in real-world. In a second stage, we show the usage of the compact state representation generated by our autoencoding pipeline in a simplistic navigation task and expose the pitfall that increased reconstruction power will always lead to an improved performance. We implemented our approach in python using tensorflow. Our datasets are simulated with pybullet as well as recorded using a slamtec rplidar A3. The experiments show the significantly improved reconstruction capabilities of our approach for 2D-lidar scans w.r.t. the state of the art. However, as we demonstrate in the experiments the impact on reinforcement learning in lidar-based navigation tasks is non-predictable when improving the latent state representation generated by an autoencoding pipeline. This is surprising and needs to be taken into account during the process of optimizing a pretrained autoencoder for reinforcement learning tasks.

Full PDF

LLearning a Compact State Representation for Navigation Tasksby Autoencoding 2D-Lidar Scans

Christopher Gebauer Maren Bennewitz

Abstract — In this paper, we address the problem of generat-ing a compact representation of 2D-lidar scans for reinforce-ment learning in navigation tasks. By now only little workfocuses on the compactness of the provided state, which isa necessary condition to successfully and efﬁciently train anavigation agent. Our approach works in three stages. First,we propose a novel preprocessing of the distance measurementsand compute a local, egocentric, binary grid map based on thecurrent range measurements. We then autoencode the local mapusing a variational autoencoder, where the latent space servesas state representation. An important key for a compact and,at the same time, meaningful representation is the degree ofdisentanglement, which describes the correlation between eachlatent dimension. Therefore, we ﬁnally apply state-of-the-artdisentangling methods to improve the representation power.Furthermore, we investige the possibilities of incorporatingtime-dependent information into the latent space. In particular,we incorporate the relation of consecutive scans, especially ego-motion, by applying a memory model. We implemented our ap-proach in python using tensorﬂow. Our datasets are simulatedwith pybullet as well as recorded using a slamtec rplidar A3.The experiments show the capability of our approach to highlycompress lidar data, maintain a meaningful distribution of thelatent space, and even incorporate time-depended information.

I. I

NTRODUCTION

In recent years a variety of mobile robots became availableat the consumer market for a reasonable price and keepimproving their astonishing capabilities. One of the mostfundamental tasks for an operating robot is to safely navigatewithin the environment. Most systems today are capable ofconsidering the nearby surroundings based on sensor data,and even react reasonable to moving obstacles if their motionis mathematically describable. In more complex situations,e.g., when close interactions with humans are involved,classical approaches are still able to successfully navigate [1].However, to further improve the rather short-sighted andreactive behavior deep reinforcement learning (RL) has beenapplied, to achieve a navigation policy, that results in a moresocially compliant robot behavior [2].To be able to learn a navigation policy, a precise obser-vation of the surroundings is required. For indoor tasks acommon choice is a 2D-lidar sensor, as it is robust andcheap. The drawbacks are the high dimensionality of the dataand the hidden information due to its representation, as theangle θ of each range measurement is only recoverable withadditional knowledge. In a RL setup the usage of raw sensor All authors are with the Humanoid Robots Lab, University of Bonn,Germany. This work has partly been supported by the German ResearchFoundation under Germany’s Excellence Strategy, EXC-2070 - 390732324(PhenoRob).

Fig. 1: The representation of the 2D-lidar scan at three differentstages of our autoencoding pipeline. The top view shows a localimage, which is passed as input. The robot pose is highlightedwith a red circle. The bar plot in the middle shows the compressedstate representation, which we lay our interest on in this work.Based on this compact state, the entire image at the bottom isreconstructed. It represents the distribution over the original image,where brighter regions show a higher probability of a pixel beingoccupied. Even tough not directly visible for humans, a correlationbetween speciﬁc dimensions of the latent space and outer factors,e.g., wall orientation exists. measurements usually causes the agent to learn sluggishlyand perform poorly in certain situations. This effect has beenshown on the example of RGB images by Yarats et al. [3].To overcome this, a powerful representation is either trainedin advance [4], or concurrently but with stronger auxiliarylosses [5]. Commonly, both approaches are based on anautoencoder (AE) architecture. The general idea of an AE isto compress an input into a latent space, where the target isto losslessly reconstruct the input based on this latent space.Even though Laskin et al. [6] show that contrastive learningoutperforms an AE in representation power, the required dataaugmentation, e.g., cropping or rotating the input image,cannot be transferred to local sensor data.The four contributions of our work, which mainly addressthe characteristics of the latent space as well as the incor- a r X i v : . [ c s . R O ] F e b orated information within, are the following. At ﬁrst, wepropose a, to our best knowledge, novel preprocessing toimprove the general reconstruction performance, especiallyin cluttered regions and around sharp edges. We then traina variational autoencoder (VAE) to compress and recon-struct the preprocessed input into a probability distribution,see Fig. 1 for an example. In comparison to existing ap-proaches [7], [8], [9], we focus on a compact latent spaceand, additionally, aim for high, but uncorrelated, activity ofeach latent dimension, also known as disentangled repre-sentation [10]. Finally, we are investigating the capabilityof incorporating time-dependent knowledge into the latentspace by using a memory model based on Ha et al. [11].The neural networks are implemented and optimized usingtensorﬂow [12]. The simulated datasets are generated withpybullet [13] and the real-world dataset is recorded witha slamtec rplidar A3. In our experimental evaluation, weshow the reconstruction power of our approach, as well asthe compactness of the resulting state representation and thecapability to incorporate time-dependent information.II. R ELATED W ORK

A lot of the related work in the area of autoencoding lidarscans focuses, different from ours, on the reconstruction of3D data. Gaccia et al. [14] reconstruct 3D-lidar data recordedin outdoor environments. The focus lays on the preprocessinginto a 2D grid and the reconstruction of corrupted data, e.g.,by noise, to recover the original scan. Nicolai et al. [15]use the denoising effect to improve estimation of motion inSE(3) for odometry reﬁnement. However, the major purposeof the deep AE is to denoise the 3D-lidar scan, while thescan matching is performed using the reconstructed scan.While we focus on a compact state representation, bothworks are rather interested in the generative capabilities ofthe autencoder. Similarly, Yin et al. [16] concentrate onautoencoding 3D-lidar data for odometry reﬁnement. Theauthors investigate the interest point selection by two parallelautoencoding pipelines, using a 2D image and 3D voxelgrid representation of the 3D-lidar data. The received interestpoints are further used for odometry reﬁnement.Korthals et al. [17] apply a VAE for multi modal sensorfusion. The authors encode a 2D-lidar scan and RGB imageof the identical scene to fuse them using a third encoder.This work does focus on 2D-lidar data, however, the latentspace itself and its compactness is not further discussed.Lundell et al. [7] use an AE to reconstruct range measure-ments that are minimal along the complete height of therobot. This is required, when the obstacle is below or abovethe height level of the range measurements but the robotitself would hit the obstacle, e.g., in the case of a table. Thepurpose is to support the navigation using the reconstructeddata, while the latent space itself is not further considered.Schlichting et al. [8] train an AE to reduce the dimen-sionality of a 2D-lidar scan from an outdoor scene. Forlocalization, matching scans are computed with a k-meansclustering algorithm directly in the latent space. The workanalyzes the size of the latent space, but rather in context of scan matching performance than actual compactness ofthe latent space. Wakita et al. [9] apply a VAE for map-construction and self-localization. A previously stored com-bination of features and a global pose is used to ﬁnd theclosest match in feature space with the current encoded scan.The ﬁnal pose is determined by decoding the stored featuresand match the current scan using ICP. The authors mentiona blurring on sharp edges, also called step edge. This effectappears in the reconstructed input, when a high differencebetween neighboring range measurements due to the shapeof the observed obstacle is present.We noticed that step edges only occur when directlyusing the 1D array of the raw range measurements as input.Therefore, we present a novel preprocessing for autoen-coding 2D-lidar scans to overcome the problem of stepedges. We further investigate the capability of maximalcompression with lossless reconstruction, as well as thecorrelation between each of the latent dimensions. As therequired information for navigation tasks is not fully includedwithin a single lidar scan, e.g., ego-motion, we also add thisknowledge by applying a memory model ﬁrst proposed byHa et al. [11]. III. B

ACKGROUND

In this section we brieﬂy outline the required backgroundincluding the variational autoencoder (VAE) [18] and thestate-of-the-art disentangling methods.

A. Variational Autoencoder

Autoencoders (AE) in general consist of an encoder orrecognition model q , which compresses the data x into alatent space z ∈ R k . The dimension or compactness of thelatent space is given by the hyperparameter k . The secondpart of an AE is the decoder or generative model p , whichreturns the reconstructed input x (cid:48) given the latent space. Thetarget is to minimize the error between x and x (cid:48) .While the original concept of an AE is deterministic,Kingma et al. [18] applied amortized variational inferenceto propose a stochastic AE, known as variational autoen-coder. The recognition model q ( z | x ) is changed to return adistribution over the latent space, which captures uncertainty.The generative model p ( x (cid:48) | z ) returns a distribution over thereconstructed input given the latent space. Commonly, bothmodels are parametrized by a neural network. The target isformulated by maximizing a variational lower bound on themarginalized log-likelihood of the generative model log p ( x ) ≥ E q [ log p ( x | z )] − β · D KL ( q ( z | x ) || p ( z )) . (1)The scaling factor β will be introduced in the next sectionand for now is assumed to be 1. Intuitively, the ﬁrst term onthe right in the above equation can be interpreted as a recon-struction loss. The second term is a penalty to prevent thatthe recognition model is too deterministic and concurrentlymaintain its simplicity. More precisely, the reconstructionterm maximizes the log-likelihood of the generative modelunder the expectation of the recognition model. This can bedirectly computed, while the expectation can be estimatedAEPreprocessing Enc. z Dec. Fig. 2: Proposed autoencoding pipeline. The range measurementsare preprocessed into the local image. The encoder compressesthe image into a stochastic latent space z . This is the quantity ofinterest, as marked by the double arrow. The decoder reconstructsa distribution over the original image. with one sample if the batch size is large enough [18]. Thepenalty for simplicity is a distance measure, represented bythe Kullback-Leibler (KL) divergence, between the recogni-tion model and a prior p ( z ) . The prior is commonly deﬁnedby a diagonal standard Gaussian to emphasize the recognitionmodel to be simple. Especially the factorized character of theprior has an important role, as discussed in the next section. B. Disentanglement

Disentanglement describes the correlation between onekey factor of a dataset, e.g., age or skin color in the contextof face recognition, and one speciﬁc latent dimensions [10].Each dimension should correlate with exactly one factor,while not being effected by any of the others. VAE naturallyinduce disentanglement due to the penalty given by the KLdivergence. The factorized character of the prior emphasizesthe latent space dimensions to be disentangled. Higgins etal. [19] amplify this effect by introducing the above men-tioned hyperparameter β ≥ . , see Eq. (1).Kim et al. [20] further investigated the composition ofthe KL divergence. The term splits up into the mutualinformation between x and z as well as a KL divergencebetween the factorized prior and the marginalized recognitionmodel q ( z ) . The authors argue, that the beneﬁt of β -VAEis originated in the second KL divergence and the lossin reconstruction performance comes with penalizing themutual information. Only penalizing D KL ( q ( z ) || p ( z )) orenforcing the factorizability of q ( z ) is not feasible, as themarginalization step is too expensive. However, as it ispossible to sample from q ( z ) without knowing it, using therecognition model, one can train a discriminator network topredict whether a sample is drawn from q ( z ) or from a trulyfactorized opponent q ( z ) . This is trained in an adversarialfashion and added as additional penalty, where the VAE isemphasized to fool the discriminator, while the latter tries toprecisely tell samples from q ( z ) and q ( z ) apart.IV. P ROPOSED A UTOENCODING P IPELINE

In this section we describe each step of our autoencodingpipeline for 2D-lidar scans. The interest is to train a compactand meaningful state representation for further usage in a RLsetup for navigation tasks. An overview is given in Fig. 2.At ﬁrst we introduce a novel preprocessing for autoencoding2D-lidar scans and then present the network architecturealongside with the important hyperparameters. Finally, we

Encoder DecoderInput : × local image Input : latent sample ∈ R k Conv. × × , stride 2ReLU, BN Dense, 256, ReLUMax pool. × Dense, × × , ReLUreshapedConv. × × , stride 2ReLU, BN Trans. Conv., × × stride 2, ReLUConv. × × , stride 1ReLU, BN Trans. Conv., × × stride 2, ReLUAvg. pool. × Trans. Conv., × × stride 2, ReLUConv. × × , stride 2ReLU, BN Trans. Conv., × × stride 2, ReLUConv. × × , stride 2ReLU, BN, ﬂatten Trans. Conv., × × stride 2, ReLUDense, 256, ReLU Trans. Conv., × × stride 2, ReLUDense, k Conv. × × , stride 1 Output : Diag. Gaussian

Output : Ind. Bernoulli

TABLE I: Network architecture from top to bottom. focus on incorporating time-dependent information, as ego-motion, into the latent space by applying a memory model.

A. Preprocessing

Wakita et al. [9] mention that a VAE smooths sharpedges during reconstruction. We had similar observations andadditionally noticed even worse behavior in cluttered areas,where the ranges of neighboring measurements underlay a lotof change. During our work we observed that this behavioris originated in the incapability of the VAE to understand thegeometric correlation between the scans in Cartesian spacebased on raw range measurements.To overcome this, we preprocess the range measurements r into a local, egocentric map, representing the occupancyin a 20 m ×

20 m area around the sensor. First, we nor-malize the data with a maximum range of 10 m into theinterval r ∈ [0 , and set every measurement that is invalidor outside this range to zero. The normalized polar coor-dinates are then converted into Cartesian space and castedto indices after multiplying them with the desired mapresolution. The indices are used to compute the desiredbinary image, referred to as local image. Of particular noteis that our local image only requires local sensor data andno global information, in contrast to the one proposed byRegier et al. [21]. B. Network Architecture

Our network architecture for the VAE is shown in Tab. I.Each of the convolutional layers that are marked with a BN have a preceding batch normalization layer [22]. The ﬁnallayers of the encoder and decoder are distribution layers [23].For the encoder we used a diagonal Gaussian distributionand an independent Bernoulli distribution for the decoder.Regarding the encoder, the second dense layer is used toparametrize the diagonal Gaussian distribution and thereforehas k nodes, per latent dimension the mean value and thestandard deviation. . Memory Model A navigation task cannot be solved purely based on asingle sensor scan, because time-dependent quantities suchas direction of travel and velocity are hidden. While it ispossible to pass these information in addition, our proposedstate representation already inherits them. As a result, theagent does not need to learn a correlation between the stateand ego-motion as it is already represented. Inspired byLee et al. [24], we apply a memory model in combinationwith the VAE to incorporate such information into the latentrepresentation.The VAE architecture described in the previous section isnot changed. Concurrently with the VAE, a memory model istrained to predict based on a given initial state and a series ofactions a future distribution over the latent space ˆ q ( z | h, a ) ,which we refer to as predictive model. The predictivemodel ˆ q is optimized to be similar to the recognition model q .Additionally, the target is to minimize the error between theactual future scan and the reconstruction based on a samplefrom the predictive model. We slightly modiﬁed the lossfunction in Eq. (1) to meet the above requirements L = T (cid:88) t =0 E q ( z | x ) [ log p ( x t | z t )] − D KL ( q ( z t | x t ) || p ( z ))+ T (cid:88) t =1 E ˆ q ( z t | h t ,a t − ) [ log p ( x t | z t )] − T (cid:88) t =1 D KL (ˆ q ( z t | h t , a t − ) || q ( z t | x t )) . (2)The ﬁrst two terms are identical to the VAE setup butsum over the entire sequence. The second line is the re-construction term for the predictive model ˆ q . We are notsumming over the entire sequence as the prediction startswith the second scan. The last term encourages the predictivemodel to have a similar distribution as the recognition modelfor the same time step by penalizing the KL divergence. Inthe ﬁrst time step the predictive model is initialized with asample from the recognition model. However, afterwards itonly receives its previous states and the executed action, i.e.,rotational and translational velocity commands.While Lee et al. [24] use two stochastic layer, for ourmemory model a GRU [25] network with one layer and256 units works best. As we require a distribution over thepredicted latent space and a GRU network is deterministic,we use a subsequent stochastic layer. The latter is representedby two dense layers and parametrizes based on the GRU’shidden state a diagonal Gaussian distribution, which has thesame shape as the one from the recognition model.Concluding, we have introduced a novel preprocessing aswell as our network structure to encode lidar scans into acompact and meaningful state representation. To incorporatetime-dependent knowledge, we proposed a memory model.V. E XPERIMENTAL E VALUATION

In this section we present experiments to evaluate ourautoencoding pipeline for 2D-lidar data. In particular, we ﬁrst demonstrate the increased reconstruction power due tothe preprocessing into our local image. Afterwards we showthe impact of current state-of-the-art disentangling methodson the latent space regarding the general activity of eachdimensions. Finally, we show the ability of the memorymodel to include time-dependent information.

A. Experimental Environments

First, we shortly introduce the characteristics of our exper-imental environments. We simulated two different categoriesof environments using pybullet [13]. The simulations are ata rate of 5 Hz and to mimic noisy sensors we randomlyinvalidated measurements. The ﬁrst category consists of asmall rectangular room. Both side lengths are uniformlydrawn from l ∈ [4 m , m ] and a small round pole with adiameter of 0.2 m is placed at random in the room. Duringsimulation the robot is set to different poses and randomlydrives through the environment until a collision occurs ora maximum number of steps is reached. We collected 10trajectories per room with 250 randomly sampled roomsin total. This dataset is referred to as simple environment ,see Fig. 3 as an example.In a next step, we slightly increased the room sizeto l ∈ [5 m , m ] and added additional walls as well asincreased the number of poles, which now can be round andsquare with alternating diameter d ∈ [0 . m , . m ] . The like-lihood of poles being placed close to each other or in cornersis increased to simulate cluttered regions. We simulated 250trajectories per room, at a total number of 100 rooms. Thisdataset is referred to as complex environment , see Fig. 5 asan example. For the memory model we additionally requiresmoother action-trajectories, as these are part of the datasetto understand the robots kinematics. We generated these byapplying different action selection schemes: Random, wideleft/right curve, serpentine motion forward, static left/rightrotation, and straight motion forward. To ensure a steady butrandom motion the actions are drawn uniformly but with aminimal value. The sign of the actions are ﬁx and given bythe selective schemes described above.For the real-world experiments we collected scans witha slamtec rplidar A3 attached to a turtlebot. We roughlycollected one hour of scans at 20 Hz within our entirebuilding, including hallways, ofﬁces, kitchens, and labs. Thisdataset is referred to as real environment , see Fig. 1 as anexample. We use for the real and simple environment a latentdimension of k = 16 and for the complex environment adimension of k = 32 . Both values turned out to be theminimal value without noticing a greater reconstruction loss. B. Reconstruction Error

In this section, we are showing the improvement due to ourpreprocessing. We used a resolution of 320 pixel for the localimage, which corresponds to 6.25 cm per pixel in a 20 m ×

20 m area. This resolution provides enough information forcluttered regions, without being to expensive in computation.Regarding the experiments, when trained directly on theraw range measurements, the network architecture of the ig. 3: Reconstruction sample from the simple environment. Wecompare the autoencoding pipeline based on our local image andon raw measurements. False positives are marked in color, blue onesare based on the local image and red ones on raw measurements.The robot pose is highlighted with a red circle, which is not in thecenter as we cropped the free space. The pole is highlighted witha yellow circle. As can be seen the undesired blurred edge occursonly with the raw measurements (as highlighted with a blue circle). recognition model is strongly related to the one used byPfeiffer et al. [26], which we refer to for further details.For the convolutional layer, we used circular padding [27].The generative model is similar to the one proposed in Tab. I,but we adapted the ﬁlter numbers, kernel sizes and stridesto meet the encoder and data structure. Additionally, wehad to take the average of neighboring measurements if theactual measurement was invalid, as setting it to zero heavilydecreased the performance.In Fig. 3 we compare the two methods on a samplefrom the simple environment. As can be seen learning onraw measurements blurs sharp edges, which increases falsepositives (colored pixel). The blurred edge is highlighted bythe blue circle, caused by the pole (yellow circle) and thesuddenly changing range measurements. This behavior doesnot occur when we use our local image as input.To further compare both methods we averaged the falsepositives and false negatives over each of the datasets. Asmany errors appear close to walls, see Fig. 3, and are lesscritical, we additionally use a corrected version of eachmeasure. The scores are described and listed in Tab. II. Thetable shows, that the local image increases the performancein each environment. Only the false negatives in the simpleenvironment are slightly worse, but in a neglectable magni-tude. Furthermore, even for unseen rooms or in a differentenvironment category the autoencoder captures the relevantparts, such as walls and greater cluttered regions, and isalmost capable of keeping the level of performance. Whentransferring the autoencoder for the simple environment fromsimulation into a similar real-world environment withoutretraining, especially the false positives increase, which alsoimpacts on the MSE. This is actually expected, as the averagenumber of occupied pixels dropped by a factor of roughly 2.6from simulation to real world, and therefore the autoencoderexpected more pixels to be occupied for the same shapes.However, all walls and the pole are correctly reconstructed,as can be seen in the accompanying video. As we furthernoticed a slight drop in performance for the real dataset in

Train Eval. FP cFP FN cFN MSESimple Simple our 0.325 0.018 0.328 0.012 0.326raw 0.362 0.043 0.323 0.015 0.643Simple Real our 1.231 0.219 0.173 0.013 0.859Comp. Simple our 0.340 0.033 0.343 0.022 0.347Comp. Comp. our 0.386 0.063 0.406 0.042 0.398raw 0.487 0.139 0.407 0.055 0.827Comp. unseenComp. our 0.432 0.091 0.428 0.063 0.446Real Real our 0.429 0.144 0.432 0.106 0.444raw 1.684 1.264 0.594 0.312 1.440Real,20 Hz Real,20 Hz our 0.492 0.201 0.523 0.158 0.512raw 1.529 1.130 0.548 0.277 1.361

TABLE II: Normalized error for the preprocessing methods withineach of the complete datasets. The local image is indicated by our ,the raw measurements by raw . The ﬁrst column indicates in whichenvironment the autoencoder is trained, the second in which it isevaluated. We average false positives (FP) and false negatives (FN)per image over each dataset. The corrected false positives (cFP)are computed by removing the FP, that are next to a truly occupiedpixel. False negative pixels next to a positively estimated pixel areremoved to get the corrected false negatives (cFN). These quantitiesare computed based on samples from the generative model. Toalso take the mean probability into account we computed the meansquared error (MSE) for the expectation of the generative model.Each score is normalized by the average number of occupied pixelsper image in a dataset. Our pipeline based on the local imageclearly outperforms the one based on raw range measurements.After Welch’s t-test [28] the local image leads to a signiﬁcantlysmaller MSE on all datasets (for a level of 0.01). The secondand ﬁfth row show the performance of the autoencoder trained onthe complex dataset, but tested in the simple environment and onunseen rooms. The third row represents the transfer from the simpleenvironment into a similarly rebuild real world environment, as alsocan be seen in the accompanying video. comparison to our simulation, we downsampled the datasetfrom 20 Hz to 5 Hz to remove almost identical scans. Thisincreases the difﬁculty of the learning task and thereforeforces the network to further generalize. The table shows thatthis actually improves the reconstruction results. Note thatfor the VAE based on raw measurements the downsamplingmade the scores worse as it overﬁts.

C. Disentanglement

In this section we investigate the activity of each latentdimension when applying current state-of-the-art disentan-gling methods on prerecorded sequences within the simpleenvironment. These sequences contain 100 consecutive scanswith simple motion patterns, e.g., driving forward facing awall and stopping right in front of it. However, it is impos-sible to directly quantify the degree of disentanglement, asthe key factors (see Sec. III) of our datasets are unknownand difﬁcult to deﬁne at all. We still assume that each of thesequences only interferes with a limited set of key factorsand therefore should have a minimal number of active latentdimensions or, in general, low activity.This is evaluated in Fig. 4 with one network per β -value.For β -VAE [19] increasing β to emphasize disentanglementclearly reduces the general activity, while especially theleast active dimensions keep getting less active. In contrastthe most active dimensions stay, except the outlier, at the N o r m . a c t i v i t y m e a s u r e -VAEFactorVAEMSEADDz-mostDz-leastDz-thres Fig. 4: Quantiﬁcation of latent space activity under different degreesof disentanglement, represented by β . The values for β -VAE start at β = 1 , which is a basic VAE. For FactorVAE the penalty comes inaddition to the VAE cost function and therefore starts at β = 0 . Wecomputed the mean squared error (MSE) from Tab. II and countedthe number of active dimensions (AD). We used the differencein z (Dz) for consecutive scans and compute the average over themost active (Dz-most) and least active dimensions (Dz-least). At lastwe counted the dimensions whose difference at least once exceededa threshold (Dz-thres). All values are normalized by the values ofa basic VAE. As can be seen, all values from β -VAE drop withincreasing beta, besides the ones that are important for carryingthe information (Dz-most), as concurrently the MSE stays almostconstant. FactorVAE does not lead to this behavior. same level of activity, which is desirable as these carry therelevant information. Concurrently, the reconstruction erroronly slightly increases. Generally, the bigger β the better,but above β = 2 . ﬁne structures as the pole start todisappear due to generalization. FactorVAE [20] required alot of additional ﬁne tuning and unfortunately did not resultin the desired activity reduction. D. Motion Prediction

In this section we evaluate the performance of the memorymodel. The purpose is to incorporate all time-dependentknowledge within the latent space to be able to fully relyon this state representation without additional sources ofinformation. This makes the later need to understand therelation between motion and the state unnecessary. Thepurpose of the memory model is to capture time-dependentinformation along one sequence, which in our experimentsconsists of ten consecutive scans at a rate of 5 Hz. Fig. 5shows for a given sequence the ﬁrst and last scan, as wellas a decoded sample from the recognition and predictivemodel for the last scan. It is obvious that the rotationaland translational change is captured almost perfectly by thememory model (bottom right), while only the sharpness ofthe short wall in the free space is lost.To further show the capability of including time-dependentinformation within the state representation, we trained a one-layered network to predict the current velocity. The recog-nition and memory model are part of the preprocessing forthis experiment and the resulting latent space acts as input.When the memory model is not incorporated, the velocitypredictions are basically random. Unfortunately, with thememory model the error barely drops. In comparison, when

Fig. 5: Future prediction of the memory model. Top left showsthe initial state, on the top right is the future state that needs tobe predicted. The robot pose is highlighted with a red circle andpoles with a yellow circle. The image on the bottom left showsthe reconstruction from the recognition model. The reconstructionfrom the predictive model purely based on the initial state and thesequence of actions is on the bottom right. All four images arecropped identically to reduce visible free space. It can be seen thatthe memory model almost perfectly predicts the future scan. directly trained on the hidden state of the memory model,the network error is heavily reduced and leads to an almostperfect velocity prediction. We argue that the recognitionmodel forgets the hidden state as it is not required for thescan reconstruction. We counter that, by adding a penalty forreconstructing the hidden state of the memory model basedon the latent space. This ensures, that the information stayswithin the latent space, without using any target quantity,e.g., the velocity itself. Using a small weighting factor forthe penalty does not change the performance when recon-structing the lidar scan, but reduces the velocity prediction byone magnitude. This is almost as good as directly predictingbased on the hidden state, and already perfectly captures themajor direction of travel with minor offset. This shows, thatour autoencoding pipeline is capable of incorporating furtherknowledge into the latent space, as long as this informationis task-relevant and captured by the objective.VI. C

ONCLUSION

In this paper, we presented an autoencoding pipelinefor 2D-lidar data and proposed a novel preprocessing thatimproved the overall performance in comparison to autoen-coding the raw sensor data. We evaluated the compactnessand correlation of our state representation using simulatedand real-world data. To show that further knowledge canbe incorporated within the state representation, we trainedconcurrently a memory model to capture the ego-motion ofthe sensor. Based on our state representation, already a smallnetwork can predict time-dependent information. Conclud-ing, our autoencoding pipeline is capable of incorporatinghighly relevant information for a navigation task within acompact state representation.

EFERENCES[1] G. Ferrer, A. G. Zulueta, F. Cotarelo, and A. Sanfeliu, “Robot social-aware navigation framework to accompany people walking side-by-side,”

Autonomous Robots , 2017.[2] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-Robot Interaction:Crowd-Aware Robot Navigation With Attention-Based Deep Rein-forcement Learning,”

Proc. of the IEEE Intl. Conf. on Robotics &Automation (ICRA) , 2018.[3] D. Yarats, A. Zhang, I. Kostrikov, B. Amos, J. Pineau, and R. Fergus,“Improving sample efﬁciency in model-free reinforcement learningfrom images,” arXiv preprint , 2019.[4] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel,“Deep spatial autoencoders for visuomotor learning,”

Proc. of theIEEE Intl. Conf. on Robotics & Automation (ICRA) , 2016.[5] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Sil-ver, and K. Kavukcuoglu, “Reinforcement Learning with UnsupervisedAuxiliary Tasks,”

CoRR , 2016.[6] M. Laskin, A. Srinivas, and P. Abbeel, “Curl: Contrastive unsupervisedrepresentations for reinforcement learning,”

Proc. of the Intl. Conf. onMachine Learning (ICML) , 2020.[7] J. Lundell, F. Verdoja, and V. Kyrki, “Hallucinating robots: Inferringobstacle distances from partial laser measurements,”

Proc. of theIEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS) , 2018.[8] A. Schlichting and U. Feuerhake, “Global vehicle localization bysequence analysis using lidar features derived by an autoencoder,” in

IEEE Intelligent Vehicles Symposium , 2018.[9] S. Wakita, T. Nakamura, and H. Hachiya, “Laser variational autoen-coder for map construction and self-localization,” in

Proc. of the IEEEIntl. Conf. on Systems, Man, and Cybernetics (SMC) , 2018.[10] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,”

IEEE Trans. on Pattern Analalysis andMachine Intelligence (TPAMI) , 2013.[11] D. Ha and J. Schmidhuber, “World models,”

CoRR

Proc. of the IEEE/RSJ Intl. Conf. on IntelligentRobots and Systems (IROS) , 2019.[15] A. Nicolai and G. A. Hollinger, “Denoising autoencoders for laser-based scan registration,”

IEEE Robotics and Automation Letters , 2018.[16] D. Yin, Q. Zhang, J. Liu, X. Liang, Y. Wang, J. Maanp¨a¨a, H. Ma,J. Hyypp¨a, and R. Chen, “Cae-lo: Lidar odometry leveraging fullyunsupervised convolutional auto-encoder for interest point detectionand feature description,” arXiv preprint , 2020.[17] T. Korthals, M. Hesse, J. Leitner, A. Melnik, and U. R¨uckert,“Jointly trained variational autoencoder for multi-modal sensor fu-sion,”

Proc. of the Int. Conf. on Information Fusion (FUSION) , 2019.[18] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”

Intl. Conf. on Learning Representations (ICLR) , 2013.[19] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick,S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visualconcepts with a constrained variational framework,” in

Intl. Conf. onLearning Representations (ICLR) , 2017.[20] H. Kim and A. Mnih, “Disentangling by factorising,” 2018.[21] P. Regier, L. Gesing, and M. Bennewitz, “Deep reinforcement learningfor navigation in cluttered environments,”

Proc. of the Intl. Conf. onMachine Learning and Applications (CMLA) , 2020.[22] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating DeepNetwork Training by Reducing Internal Covariate Shift,” in

Proc. ofthe Intl. Conf. on Machine Learning (ICML) , 2015.[23] J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore,B. Patton, A. Alemi, M. D. Hoffman, and R. A. Saurous, “Tensorﬂowdistributions,”

CoRR , 2017. [24] A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine, “Stochastic latentactor-critic: Deep reinforcement learning with a latent variable model,” arXiv preprint , 2019.[25] K. Cho, B. van Merrienboer, C¸ . G¨ulc¸ehre, F. Bougares, H. Schwenk,and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,”

CoRR , 2014.[26] M. Pfeiffer, M. Schaeuble, J. I. Nieto, R. Siegwart, and C. Cadena,“From Perception to Decision: A Data-driven Approach to End-to-endMotion Planning for Autonomous Ground Robots,”

CoRR , 2016.[27] S. Schubert, P. Neubert, J. P¨oschmann, and P. Pretzel, “CircularConvolutional Neural Networks for Panoramic Images and LaserData,” in

Proc. of the IEEE Intelligent Vehicles Symposium (IV) , 2019.[28] B. L. Welch, “The generalization of ‘student’s’ problem when severaldifferent population varlances are involved,”