[PDF] Deep Generative Modeling of LiDAR Data

Abstract

Building models capable of generating structured output is a key challenge for AI and robotics. While generative models have been explored on many types of data, little work has been done on synthesizing lidar scans, which play a key role in robot mapping and localization. In this work, we show that one can adapt deep generative models for this task by unravelling lidar scans into a 2D point map. Our approach can generate high quality samples, while simultaneously learning a meaningful latent representation of the data. We demonstrate significant improvements against state-of-the-art point cloud generation methods. Furthermore, we propose a novel data representation that augments the 2D signal with absolute positional information. We show that this helps robustness to noisy and imputed input; the learned model can recover the underlying lidar scan from seemingly uninformative data

Full PDF

DDeep Generative Modeling of LiDAR Data

Lucas Caccia , , Herke van Hoof , , Aaron Courville , , Joelle Pineau , , Abstract — Building models capable of generating structuredoutput is a key challenge for AI and robotics. While generativemodels have been explored on many types of data, little workhas been done on synthesizing lidar scans, which play a keyrole in robot mapping and localization. In this work, we showthat one can adapt deep generative models for this task byunravelling lidar scans into a 2D point map. Our approach cangenerate high quality samples, while simultaneously learning ameaningful latent representation of the data. We demonstratesigniﬁcant improvements against state-of-the-art point cloudgeneration methods. Furthermore, we propose a novel datarepresentation that augments the 2D signal with absolutepositional information. We show that this helps robustness tonoisy and imputed input; the learned model can recover theunderlying lidar scan from seemingly uninformative data.

I. INTRODUCTIONOne of the main challenges in mobile robotics is thedevelopment of systems capable of fully understanding theirenvironment. This non-trivial task becomes even more com-plex when sensor data is noisy or missing. An intelligentsystem that can replicate the data generation process is muchbetter equipped to tackle inconsistency in its sensor data.There is signiﬁcant potential gain in having autonomousrobots equipped with data generation capabilities which canbe leveraged for reconstruction, compression, or predictionof the data stream.In autonomous driving, information from the environmentis captured from sensors mounted on the vehicle, such ascameras, radars, and lidars. While a signiﬁcant amount ofresearch has been done on generating RGB images, relativelylittle work has focused on generating lidar data. These scans,represented as an array of three dimensional coordinates, givean explicit topography of the vehicle’s surroundings, poten-tially leading to better obstacle avoidance, path planning, andinter-vehicle spatial awareness.To this end, we leverage recent advances in deep gener-ative modeling, namely variational autoencoders (VAE) [1]and generative adversarial networks (GAN) [2], to producea generative model of lidar data. While the VAE and GANapproaches have different objectives, they can be used inconjunction with Convolutional Neural Networks (CNN) [3]to extract local information from nearby sensor points.Unlike some approaches for lidar processing, we do notconvert the data to voxel grids [4], [5]. Instead, we buildoff existing work [6] which projects the lidar scan into a 2Dspherical point map. We show that this representation is fully MILA, McGill University MILA, Universit´e de Montr´eal CIFAR Fellow University of Amsterdam

Fig. 1: Best viewed in color. Top: real LiDAR sample fromthe test set. Middle: reconstruction from our proposed model.Bottom: reconstruction from the baseline model.compatible with deep architectures previously designed forimage generation. Moreover, we investigate the robustness ofthis approach to missing or noisy data, a crucial property forreal world applications. We propose a simple, yet effectiveway to improve the model’s performance when the input isdegraded. Our approach consists of augmenting the 2D mapwith absolute positional information, through extra ( x, y, z ) coordinate channels. We validate these claims through avariety of experiments on the KITTI [7] dataset.Our contributions are the following: • We provide a fully unsupervised method for both con-ditional and unconditional lidar generation. • We establish an evaluation framework for lidar recon-struction, allowing the comparison of methods over aspectrum of different corruption mechanisms. • We propose a simple technique to help the modelprocess noisy or missing data. a r X i v : . [ c s . C V ] D ec I. R

ELATED WORK

A. Lidar processing using Deep Learning

The majority of papers applying deep learning methods tolidar data present discriminative models to extract relevantinformation from the vehicle’s environment. Dewan et al.[8] propose a CNN for pointwise semantic segmentation todistinguish between static and moving obstacles. Caltagironeet al. [9] use a similar approach to perform pixel-wiseclassiﬁcation for road detection. To leverage the full 3Dstructure of the input, Bo Li [10] uses 3D convolutions on avoxel grid for vehicle detection. However processing voxelsis computationally heavy, and does not leverage the sparsityof LiDAR scans. Engelcke et al. [5] propose an efﬁcient 3Dconvolutional layer to mitigate these issues.Another popular approach [6], [11]–[13] to avoid usingvoxels relies on the inherent two-dimensional nature oflidars. It consists of a bijective mapping from 3D point cloudto a 2D point map, where ( x, y, z ) coordinates are encodedas azimuth and elevation angles measured from the origin.This can also be seen as projecting the point cloud onto a2D spherical plane. Using such a bijection lies at the coreof our proposed approach for generative modeling of lidardata. B. Grid-based lidar generation

An alternative approach for generative modeling of lidardata is from Ondruska et al [14]. They train a RecurrentNeural Network for semantic segmentation and convert theirinput to an occupancy grid. More relevant to our task, theytrain their network to also predict future occupancy grids,thereby creating a generative model for lidar data. Theirapproach differs from ours, as the occupancy grid usedassigns a constant area (400 cm ) to every slot, whereaswe operate directly on projected coordinates. This not onlyreduces preprocessing time, but also allows us to efﬁcientlyrepresent data with non-uniform spatial density. We cantherefore run our model at a much higher resolution, whileremaining computationally efﬁcient.Concurrent with our work, Tomasello et al. [15] exploreconditional lidar synthesis from RGB images. The authorsuse the same 2D spherical mapping proposed in [6]. Ourapproach differs on several points. First, we do not requireany RGB input for generation, which may not always beavailable (e.g. in poorly lit environments). Second, we ex-plore ways to augment the lidar representation to increaserobustness to corrupted data. Finally, we look at generativemodeling of lidar data (compared to a deterministic mappingin their case). C. Point Cloud Generation

A recent line of work [16]–[19] considers the problemof generating point clouds as unordered sets of ( x, y, z ) coordinates. This approach does not deﬁne an ordering on thepoints, and must therefore be invariant to permutations. Toachieve this, they use a variant of PointNet [20] to encodea variable-length point cloud into a ﬁxed-length represen-tation. This latent vector is then decoded back to a point cloud, and the whole network is trained using permutationinvariant losses such as the Earth-Mover’s Distance or the

Chamfer Distance [19]. While these approaches work wellfor arbitrary point clouds, we show that they give suboptimalperformance on lidar, as they do not leverage the knownstructure of the data.

D. Improving representations through extra coordinatechannels

In this work, we propose to augment the 2D sphericalsignal with Cartesian coordinates. This can be seen as ageneralization of the CoodConv solution [21]. The authorspropose to add two channels to the image input, correspond-ing to the ( i, j ) location of every pixel. They show thatthis enables networks to learn either complete translationinvariance or varying degrees of translation dependence,leading to better performance on a variety of downstreamtasks. III. T ECHNICAL B ACKGROUND : G

ENERATIVE M ODELING

The underlying task of generative models is density es-timation. Formally, we are given a set of d -dimensional i.i.d samples X = { x i ∈ R d } mi =1 from some unknownprobability density function p real . Our objective is to learna density p θ where θ ∈ F represents the parameters of ourestimator and F a parametric family of models. Training isdone by minimizing some distance D between p real and p θ .The choice of both D and the training algorithm are thedeﬁning components of the density estimation procedure.Common choices for D are either f -divergences such asthe Kullback-Liebler (KL) divergence, or Integral ProbabilityMetrics (IPMs), such as the Wasserstein metric [22]. Thesesimilarity metrics between distributions often come withspeciﬁc training algorithms, as we describe next. A. Maximum Likelihood Training

Maximum likelihood estimation (MLE) aims to ﬁnd modelparameters that maximize the likelihood of X . Since samplesare i.i.d , the optimization criterion can be viewed as : max θ ∈F E x ∼ p real log( p θ ( x )) . (1)It can be shown that training with the MLE criteria convergesto a minimization of the KL-divergence as the sample sizeincreases [23]. From Eqn (1) we see that any model admittinga differentiable density p θ ( x ) can be trained via backpropa-gation. Powerful generative models trained via MLE includeVariational Autoencoders [1] and autoregressive models [24].In this work, we focus on the former, as the latter have slowsampling speed, limiting their potential use for real worldapplications.ig. 2: Best viewed in color. Our proposed ordering of points from 3D space (left) into a 2D grid (right). Points sampledfrom the same elevation angle share the same color. The ordering of every row is obtained by unrolling points in increasingazimuth angle. The showed lidar was downsampled for visual purposes.

1) Variational Autoencoders (VAE):

The VAE [1] is aregularized version of the traditional autoencoder (AE). Itconsists of two parts: an inference network φ enc ≡ q ( z | x ) that maps an input x to a posterior distribution of latentcodes z , and a generative network ψ dec ≡ p ( x | z ) that aimsto reconstruct the original input conditioned on the latentencoding. By imposing a prior distribution p ( z ) on latentcodes, it enforces the distribution over z to be smooth andwell-behaved. This property enables proper sampling fromthe model via ancestral sampling from latent to input space.The full objective of the VAE is then: L ( θ ; x ) = E q ( z | x ) log p ( x | z ) − KL ( q ( z | x ) || p ( z )) ≤ log p ( x ) , (2)which is a valid lower bound on the true likelihood, therebymaking Variational Autoencoders valid generative models.For a more in depth analysis of VAEs, see [25]. B. Generative Adversarial Network (GAN)

The GAN [2] formulates the density estimation problemas a minimax game between two opposing networks. The generator G ( z ) maps noise drawn from a prior distribution p noise to the input space, aiming to fool its adversary, the discriminator D ( x ) . The latter then tries to distinguishbetween real samples x ∼ p real and fake samples x (cid:48) ∼ G ( z ) .In practice, both models are represented as neural networks.Formally, the objective is written as min G max D E x ∼ p real log( D ( x )) + E z ∼ p noise log(1 − D ( G ( z ))) . (3)GANs have shown the ability to produce more realisticsamples [26] than their MLE counterparts. However, theoptimization process is notoriously difﬁcult; stabilizing GANtraining is still an open problem. In practice, GANs can alsosuffer from mode collapse [27], which happens when thegenerator overlooks certain modes of the target distribution.IV. P ROPOSED APPROACH FOR LIDAR GENERATION

We next describe the proposed deep learning frameworkused for generative modeling of lidar scans.

A. Data Representation

Our approach relies heavily on 2D convolutions, thereforewe start by converting a lidar scan containing N ( x, y, z ) coordinates into a 2D grid. We begin by clustering togetherpoints emitted from the same elevation angle into H clusters. Second, for every cluster, we sort the points in increasingorder of azimuth angle. In order to have a proper grid with aﬁxed amount of points per row, we divide the ◦ plane into W bins. This yields a H × W grid, where for each cell westore the average ( x, y, z ) coordinate, such that we can storeall the information in a H × W × tensor. We note that thedefault ordering in most lidar scanners is the same as the oneobtained after applying this preprocessing. Therefore, sortingis not required in practice, and the whole procedure can beexecuted in O ( N ) . Figure 2 provides a visual representationof this mapping. This procedure yields the same ordering ofpoints as the projection discussed in II-A. The latter wouldthen return a grid of H × W × , where the ( x, y ) channelsare compressed as d = (cid:112) x + y . We will refer to the tworepresentations above as Cartesian and

Polar respectively.While this small change in representation seems innocuous,we show that when the input is noisy or incomplete, thiscompression can lead to suboptimal performance.

B. Training Phase1) VAEs:

In practice, both encoder φ and decoder ψ arerepresented as neural networks with parameters θ enc and θ dec respectively.Similar to a traditional AE, the training procedure ﬁrst en-codes the data x into a latent representation z = φ ( x ; θ enc ) .The variational aspect is introduced by interpreting z not asa vector, but as parameters of a posterior distribution. In ourwork we choose a Gaussian prior and posterior, and therefore z decomposes as µ x , σ x .We then sample from this distribution ˜ z ∼ N ( µ x , σ x ) and pass it through the decoder to obtain ˜ x = ψ (˜ z ; θ dec ) .Using the reparametrization trick [1], the network is fullydeterministic and differentiable w.r.t its parameters θ enc and θ dec , which are updated via stochastic gradient descent(SGD).

2) GANs:

Training alternates between updates for thegenerator and discriminator, with parameters θ gen and θ dis .Similarly to the VAE, samples are obtained by ancestralsampling from the prior through the generator. In the originalGAN, the networks are updated according to Eqn. 3. Inpractice, we use the Relativistic Average GAN (RaGAN)objective [28], which is easier to optimize. Again, θ gen and θ dis are updated using SGD. For a complete hyperparameterlist, we refer the reader to our publicly available sourceode. C. Model Architecture

Deep Convolutional GANs (DCGANs) [29] have showngreat success in generating images. They use a symmetricarchitecture for the two networks: The generator consists ofﬁve transpose convolutions with stride two to upsample ateach layer, and ReLU activations. The discriminator usesstride two convolutions to downsample the input, and LeakyReLU activations. In both networks, Batch Normalization[30] is interleaved between convolution layers for easieroptimization. We use this architecture for all our models:The VAE encoder setup is simply the ﬁrst four layers ofthe discriminator, and the decoder’s architecture replicatesthe DCGAN generator. We note that for both models, moresophisticated architectures [31], [32] are fully compatiblewith our framework. We leave this line of exploration asfuture work. V. E

XPERIMENTS

This section provides a thorough analysis of the per-formance of our framework fulﬁlling a variety of tasksrelated to generative modeling. First, we explore conditional generation, where the model must compress and reconstructa (potentially corrupted) lidar scan. We then look at uncon-ditional generation. In this setting, we are only interested inproducing realistic samples, which are not explicitly tied toa real lidar cloud.

A. Dataset

We consider the point clouds available in the KITTIdataset [7]. We use the train/validation/test set split proposedby [33], which yields 40 000, 80 and 700 samples for train,validation and test sets. We use the preprocessing describedin section IV-A to get a × grid. For training wesubsample from 10 Hz to 3 Hz since temporally adjacentframes are nearly identical. B. Baseline Models

Since, to the best of our knowledge, no work has attemptedgenerative modeling of raw lidar clouds, we compare to ourmethod models that operate on arbitrary point clouds. Weﬁrst choose AtlasNet [17], which has shown strong modelingperformance on the Shapenet [34] dataset. This networkﬁrst encodes point clouds using a shared

MLP network thatoperates on each point individually . A max-pooling operationis performed on the point axis to obtain a ﬁxed-length globalrepresentation of the point cloud. In other words, the encodertreats each point independently of other points, withoutassuming an ordering on the set of coordinates. This makesthe feature extraction process invariant to permutations ofpoints. The decoder is given the encoder output along with ( x, y ) coordinates of a 2D-grid, and attempts to fold this 2D-grid into a three-dimensional surface. The decoder also usesa MLP network shared across all points. Similar to AtlasNet, we compare our model with the onefrom Achlioptas et al [16]. Only its decoder differs fromAtlasNet; the model does not deform a 2D grid, but ratheruses fully-connected layers to convert the latent vector intoa point cloud, making it less parameter efﬁcient.Both networks are trained end-to-end using the

ChamferLoss [19], deﬁned as d CH = (cid:88) x ∈ S min y ∈ S || x − y || + (cid:88) y ∈ S min x ∈ S || x − y || , (4)where S and S are two sets of ( x, y, z ) coordinates. Wenote again that this loss is invariant to the ordering of theoutput points. For both autoencoders, we regularize theirlatent space using a Gaussian prior to get a valid generativemodel. C. Conditional Generation

We proceed to test our approach in a conditional genera-tion task. In this setting, we do not evaluate the GAN, as thisfamily of model -in their original formulation- does not havean inference mechanism. In other words, we consider fourmodels: our approach, using either the Cartesian or the Polarrepresentation, and the two baselines above. Since we arenot sampling, but rather reconstructing an input, we considerboth VAE and AE variants of every model, and report thebest performing one.Formally, given a lidar cloud, we evaluate a model’s abilityto reconstruct it from a compressed encoding. More relevantto real word applications, we look at how robust the model’slatent representation is to input perturbation. Speciﬁcally, welook at the two following corruption mechanisms: • Additive Noise : we add Gaussian noise drawn from N (0 , σ ) to the ( x, y, z ) coordinates of the lidar cloud.For this process, we normalize each of the three di-mension independently prior to noise addition. Weexperiment with varying levels of σ . • Data Removal : We remove random points from theinput lidar scan. Speciﬁcally, the probability of re-moving a point is modeled as a Bernoulli distributionparametrized by p . We consider different values for p . D. Unconditional Generation

For this section, we consider the GAN model introducedin section IV-C. Our goal is to train a model that can producerealistic samples. Having access to such a generator canlead to better simulator development, which are heavily usedto train self-driving agents [35]. In this use case, an agentoperating in an environment that lacks crispness will likelyresult in poor skill transfer to real world navigation. Sincethe use of GANs has been shown to produce more realisticsamples than MLE based models on images [36], we hopeto see similar results with our model in the case of LiDARdata.

Evaluation criteria : Rigorous quantitative evaluation ofsamples produced by GANs and generative models is anopen research question. GANs trained on images have beenig. 3: EMD and Chamfer Distance under varying levels of added noise (left) and missing data (right). We remove modelswith poor performance for clarity. For both metrics lower is better.evaluated by the Inception Score [27] and the Frechet Incep-tion Distance (FID) [37]. Since there exists no standardizedmetric for unconditional generation of lidar clouds, we relyon visual inspection of samples for quality assessment.

1) Evaluation criteria:

To measure how close the recon-structed output is to the original point cloud, we use the

Earth-Mover’s Distance [19]. It is deﬁned as d EMD ( S , S ) = min γ : S −→ S (cid:88) x ∈ S || x − γ ( x ) || (5)where γ is a bijection between the two sets.The EMD gives the solution to the optimal transportationproblem, which attempts to transform one point cloud intothe other. Recent work [16] has shown that this metriccorrelates well with human evaluation, and does so betterthan the Chamfer Distance. Moreover, the Earth Mover’sDistance is sensitive to both global and local structure, anddoes not require points to be ordered. Additionally, trainingand evaluating models on the same metric can result inmodels overﬁtting to this criterion, at the expense of samplequality [38]. Nevertheless, we also provide results measuredby the Chamfer Distance for completeness.

2) Training Protocol:

For every model considered, weperform the same hyperparameter search. We randomly se-lect the learning rate, the latent dimension and the batch sizefrom a predetermined set of values. This set of values is thesame for all models to ensure fairness. This process is re-peated for 10 different conﬁgurations, from which we choosethe one obtaining the best performance on the validation set.We then proceed to evaluate this conﬁguration on the testset according to the metrics described above. All models aretrained end-to-end on the same dataset.VI. R

ESULTS

In this section, we will ﬁrst discuss results for conditionalgeneration and subsequently evaluate results for uncondi-tional generation of lidar images.

A. Conditional

In all conditional tasks, our proposed approach beatsavailable baselines by a signiﬁcant margin, both in termsof EMD, Chamfer Distance and visual inspection. Fig. 4: Top : corrupted lidar from the test set, where weadded noise drawn from N (0 , . on the preprocessed scan.Middle : reconstructed point cloud given corrupted input.Bottom : original lidar scan

1) Reconstructing clean data: while the baseline modelsare able to reconstruct the global structure of the lidar scan,they are unable to recover the more ﬁne grained detail ofthe input (see Fig.1). This suggests that leveraging the knownstructure of the lidar plays a key role in obtaining high qualityreconstructions. Quantitative results are shown in Table I.

Model EMD ChamferRandom 4331.9 253.6AtlasNet 1571.2 2.85Ach. et al 1103.1 2.16Ours(xyz) 137.2 1.23

Ours(pol) 127.0 1.04

TABLE I: EMD and Chamfer distance measured on test setreconstructions (in both cases lower is better)

2) Reconstructing corrupted data:

Next, we evaluate theproposed models on their ability to extract important in-ig. 5: We compare generated GAN samples (left) with their nearest neighbor in feature space (middle) from the test set.We show the corresponding RGB image (right). Regions of interest are highlighted in red.formation from corrupted lidar scans. As shown in Fig.4, the proposed VAE correctly reconstructs the deﬁningcomponents of the original cloud, even if the given inputis seemingly uninformative. We emphasize that our modelwas not trained with such corrupted data , thereforethese results are quite surprising. Animations and additionalreconstructions can be found here .Moreover, we observe that as soon as the input is mod-erately noisy, the proposed Cartesian representation yieldsbetter performance. As seen in Fig. 3, this representationperforms better than its Polar alternative over the majorityof the graph. In addition, we observe a similar trend whenpoints are randomly removed from the input, as shown inFig. 3; when more than 15% of the points are missing, using ( x, y, z ) coordinates performs favorably according to EMD.This result suggests that in this corruption regime, havingaccess to absolute positional information provides a bettersignal to the model. Interesting future work would be toleverage the best of the two representations.We note that the suboptimal performance of the baselinesis mainly due to two factors. First, since points are encodedindependently, only information about the global structure iskept, and local ﬁne-grained details are neglected. Second, theChamfer Distance used for training assumes that the pointcloud has a uniform density, which is not the case for lidarscans. B. Unconditional

We perform a visual inspection of generated samples,located in the leftmost column of Figure 5 (more samplesare available here). We see that our model generates realisticsamples. First, the scans have a well-deﬁned global structure:an aerial view of the samples show points correctly aligned to model the structure of the road. Second, the samplesshare local characteristics of real data: the model correctlygenerates road obstacles, such as cars, or cyclists. Thisamounts to having locations with a dense aggregation ofpoints, followed by a trailing area with almost no points,similar to the shadow of an object. Third, model respectsthe point density of the data, where the density is roughlyinversely proportional to the distance from the origin. Lastly,our models show good sample diversity.

1) What is the GAN generating?:

In order to betterinterpret samples from the unconditional generator, we try tomatch them to real data examples. We perform the followingprocedure: we encode every sample to a latent representation,given by the output of the third layer of our discriminator.We similarly encode random datapoints from the test set, andmatch the generated sample to the real datapoint yieldingthe smallest latent L2 loss. We show three examples of thismatching in Figure 5. In the ﬁrst row, we see the modelgenerating a two layer roadside to the right, consisting of along shrub, followed by a line of trees. On the second row,we ﬁnd a large tilted object to the right, which matches abus turning right. Finally, on the last row we see a sharpenclosing, corresponding to a driveway leading to a garagedoor. VII. D

ISCUSSION AND F UTURE W ORK

In this work we introduced two generative models for raw lidars, a GAN and a VAE. We have shown that the proposedadversarial network can generate highly realistic data, andcaptures both local and global features of real lidar scans.The LiDAR-VAE successfully encodes and reconstructs lidarsamples, and is highly robust to missing or inputed data.e demonstrate that when adding enough noise to renderthe scan uninformative to the human eye, the proposedVAE still extracts relevant information and generates themissing data. Our work in deep generative modeling oflidar enables concrete advancements in real life applications;the former model can help reduce the discrepancy betweensynthetic and real lidars in driving simulators, while the lattercan be leveraged in deployed vehicles for reconstruction,compression, or prediction of the data stream.Moreover, we proposed a simple way to encode absolutepositional information in the lidar representation, and showedthat this leads to better reconstructions when the input isnoisy or incomplete. Interesting future work would be tosee if this can also lead to improvements in standard lidarprocessing tasks. R

EFERENCES[1] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”

Proceedings of the 2nd International Conference on Learning Rep-resentations. , 2013.[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in

Advances in neural information processing systems , 2014, pp. 2672–2680.[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neuralinformation processing systems , 2012, pp. 1097–1105.[4] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloudbased 3D object detection,” arXiv preprint arXiv:1711.06396 , 2017.[5] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner,“Vote3deep: Fast object detection in 3d point clouds using efﬁcientconvolutional neural networks,” in . IEEE, 2017, pp. 1355–1361.[6] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3D lidar usingfully convolutional network,” arXiv preprint arXiv:1608.07916 , 2016.[7] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:The kitti dataset,”

The International Journal of Robotics Research ,vol. 32, no. 11, pp. 1231–1237, 2013.[8] A. Dewan, G. L. Oliveira, and W. Burgard, “Deep semantic classiﬁca-tion for 3D lidar data,” in

Intelligent Robots and Systems (IROS), 2017IEEE/RSJ International Conference on . IEEE, 2017, pp. 3544–3549.[9] L. Caltagirone, S. Scheidegger, L. Svensson, and M. Wahde, “Fastlidar-based road detection using fully convolutional neural networks,” arXiv preprint arXiv:1703.03613 , 2017.[10] B. Li, “3D fully convolutional network for vehicle detection in pointcloud,” in

Intelligent Robots and Systems (IROS), 2017 IEEE/RSJInternational Conference on . IEEE, 2017, pp. 1513–1518.[11] M. Velas, M. Spanel, M. Hradis, and A. Herout, “Cnn for veryfast ground segmentation in velodyne lidar data,” in

AutonomousRobot Systems and Competitions (ICARSC), 2018 IEEE InternationalConference on . IEEE, 2018, pp. 97–103.[12] V. Vaquero, I. del Pino, F. Moreno-Noguer, J. Sol`a, A. Sanfeliu,and J. Andrade-Cetto, “Deconvolutional networks for point-cloudvehicle detection and tracking in driving scenarios,” in

Mobile Robots(ECMR), 2017 European Conference on . IEEE, 2017, pp. 1–7.[13] V. Vaquero, A. Sanfeliu, and F. Moreno-Noguer, “Deep lidar cnnto understand the dynamics of moving vehicles,” arXiv preprintarXiv:1808.09526 , 2018.[14] P. Ondruska, J. Dequaire, D. Z. Wang, and I. Posner, “End-to-endtracking and semantic segmentation using recurrent neural networks,” arXiv preprint arXiv:1604.05091 , 2016.[15] P. Tomasello, S. Sidhu, A. Shen, M. W. Moskewicz, N. Redmon,G. Joshi, R. Phadte, P. Jain, and F. Iandola, “Dscnet: Replicat-ing lidar point clouds with deep sensor cloning,” arXiv preprintarXiv:1811.07070 , 2018.[16] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas, “Represen-tation learning and adversarial generation of 3D point clouds,” arXivpreprint arXiv:1707.02392 , 2017. [17] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry, “At-lasnet: A papier-mˆach´e approach to learning 3D surface generation,” arXiv preprint arXiv:1802.05384 , 2018.[18] Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud auto-encoder via deep grid deformation,” in

Proc. IEEE Conf. on ComputerVision and Pattern Recognition (CVPR) , vol. 3, 2018.[19] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for3D object reconstruction from a single image.” in

CVPR , vol. 2, no. 4,2017, p. 6.[20] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learningon point sets for 3D classiﬁcation and segmentation,”

Proc. ComputerVision and Pattern Recognition (CVPR), IEEE , vol. 1, no. 2, p. 4,2017.[21] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, andJ. Yosinski, “An intriguing failing of convolutional neural networks andthe coordconv solution,” in

Advances in Neural Information ProcessingSystems , 2018, pp. 9628–9639.[22] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXivpreprint arXiv:1701.07875 , 2017.[23] S. Kolouri, G. K. Rohde, and H. Hoffmann, “Sliced wassersteindistance for learning gaussian mixture models,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2018,pp. 3427–3436.[24] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrentneural networks,” arXiv preprint arXiv:1601.06759 , 2016.[25] C. Doersch, “Tutorial on variational autoencoders,” arXiv preprintarXiv:1606.05908 , 2016.[26] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growingof gans for improved quality, stability, and variation,” arXiv preprintarXiv:1710.10196 , 2017.[27] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, andX. Chen, “Improved techniques for training gans,” in

Advances inNeural Information Processing Systems , 2016, pp. 2234–2242.[28] A. Jolicoeur-Martineau, “The relativistic discriminator: a key elementmissing from standard gan,” arXiv preprint arXiv:1807.00734 , 2018.[29] A. Radford, L. Metz, and S. Chintala, “Unsupervised representationlearning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434 , 2015.[30] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[31] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attentiongenerative adversarial networks,” arXiv preprint arXiv:1805.08318 ,2018.[32] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever,and M. Welling, “Improved variational inference with inverse autore-gressive ﬂow,” in

Advances in neural information processing systems ,2016, pp. 4743–4751.[33] W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding net-works for video prediction and unsupervised learning,” arXiv preprintarXiv:1605.08104 , 2016.[34] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang,Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al. ,“Shapenet: An information-rich 3D model repository,” arXiv preprintarXiv:1512.03012 , 2015.[35] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla:An open urban driving simulator,” arXiv preprint arXiv:1711.03938 ,2017.[36] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,“Autoencoding beyond pixels using a learned similarity metric,” arXivpreprint arXiv:1512.09300 , 2015.[37] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,“Gans trained by a two time-scale update rule converge to a local nashequilibrium,” in

Advances in Neural Information Processing Systems ,2017, pp. 6626–6637.[38] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,M. Krikun, Y. Cao, Q. Gao, K. Macherey et al. , “Google’s neuralmachine translation system: Bridging the gap between human andmachine translation,” arXiv preprint arXiv:1609.08144arXiv preprint arXiv:1609.08144