[PDF] Domain Adaptation for Vehicle Detection from Bird's Eye View LiDAR Point Cloud Data

Abstract

Point cloud data from 3D LiDAR sensors are one of the most crucial sensor modalities for versatile safety-critical applications such as self-driving vehicles. Since the annotations of point cloud data is an expensive and time-consuming process, therefore recently the utilisation of simulated environments and 3D LiDAR sensors for this task started to get some popularity. With simulated sensors and environments, the process for obtaining an annotated synthetic point cloud data became much easier. However, the generated synthetic point cloud data are still missing the artefacts usually exist in point cloud data from real 3D LiDAR sensors. As a result, the performance of the trained models on this data for perception tasks when tested on real point cloud data is degraded due to the domain shift between simulated and real environments. Thus, in this work, we are proposing a domain adaptation framework for bridging this gap between synthetic and real point cloud data. Our proposed framework is based on the deep cycle-consistent generative adversarial networks (CycleGAN) architecture. We have evaluated the performance of our proposed framework on the task of vehicle detection from a bird's eye view (BEV) point cloud images coming from real 3D LiDAR sensors. The framework has shown competitive results with an improvement of more than 7% in average precision score over other baseline approaches when tested on real BEV point cloud images.

Full PDF

DDomain Adaptation for Vehicle Detection from Bird’s Eye ViewLiDAR Point Cloud Data

Khaled Saleh, Ahmed Abobakr, Mohammed Attia, Julie Iskander, Darius Nahavandi and Mohammed HossnyInstitute for Intelligent Systems Research and Innovation (IISRI)Deakin [email protected]

Abstract —Point cloud data from 3D LiDAR sensors are oneof the most crucial sensor modalities for versatile safety-criticalapplications such as self-driving vehicles. Since the annotationsof point cloud data is an expensive and time-consuming process,therefore recently the utilisation of simulated environments and3D LiDAR sensors for this task started to get some popu-larity. With simulated sensors and environments, the processfor obtaining an annotated synthetic point cloud data becamemuch easier. However, the generated synthetic point cloud dataare still missing the artefacts usually exist in point cloud datafrom real 3D LiDAR sensors. As a result, the performanceof the trained models on this data for perception tasks whentested on real point cloud data is degraded due to the domainshift between simulated and real environments. Thus, in thiswork, we are proposing a domain adaptation framework forbridging this gap between synthetic and real point cloud data.Our proposed framework is based on the deep cycle-consistentgenerative adversarial networks (CycleGAN) architecture. Wehave evaluated the performance of our proposed frameworkon the task of vehicle detection from a bird’s eye view (BEV)point cloud images coming from real 3D LiDAR sensors. Theframework has shown competitive results with an improvementof more than 7% in average precision score over other baselineapproaches when tested on real BEV point cloud images.

I. I

NTRODUCTION

Recently, deep learning-based techniques such as convolu-tion neural networks (ConvNets) have been achieving state-of-the-art results in many computer vision tasks such: objectidentiﬁcation [1], scene understanding [2], [3], and humanaction recognition [4]–[6]. However, these techniques requirea handful amount of labelled data for training them whichis both time-consuming and cumbersome to get for manytasks. Thus, the utilisation of synthetic data for training suchtechniques got some momentum over the past few years [7],[8]. With synthetic data, the process for obtaining ground-truth labels becomes much easier and automated most of thetime. However, still, the utilisation of synthetic data is notentirely reliable because of its limitations when it comes tothe generalisation to real data.In safety-critical applications such as a self-driving vehicle,one of the main sensors that are currently crucial for itsdevelopment is the 3D LiDAR (Light Detection And Ranging)sensor. 3D LiDAR sensors can reliably provide 360 ◦ pointcloud in trafﬁc environment with coverage distance up to200 meters ahead across different weather and lighting condi-tions. Thus, a number of deep-learning based techniques have Fig. 1. Sample of BEV images of real point cloud data (left) from a realVelodyne 3d LiDAR from KITTI dataset [11] and a synthetic point cloud data(right) from a simulated 3D LiDAR sensor from MDLS dataset [9]. recently been utilising its point cloud for many perceptiontasks for self-driving vehicles [9], [10]. The reason that thenumber of deep-learning techniques that rely on point-clouddata is not as much as the ones rely on visual data is thescarcity of labelled point cloud data. The labelling procedurefor point cloud data is more complicated than visual dataespecially for tasks such as 3D object detection and per-pointsemantic segmentation. Thus, the usage of synthetic data hasbeen explored, similar to the visual data modality data [7], [9].However, the generalisation to real-point cloud data was ratherlimited due to the perfectness of the synthetic point cloud data(shown in Fig. 1, right) which is missing the artefacts usuallyexist in point cloud data from real 3D LiDAR sensors (shownin Fig. 1, left). These artefacts are such as the variability of theLiDAR beams intensities or the motion distortion as a resultof the motion of the 3D LiDAR.Domain adaptation (DA) is one of the machine learning(ML) techniques that have been recently explored to bridge a r X i v : . [ c s . C V ] M a y he aforementioned gaps between synthetic and real datadomains [12]. In DA, the goal is to learn from one data distri-bution (referred to as the source domain) a perfect model on adifferent data distribution (referred to as the target domain). Intrafﬁc environments, DA has recently shown promising resultsfor image translation between different domain pairs such asnight/day, synthetic/real images and RGB/thermal images [13].Since most of the previous DA techniques are based on 2Ddeep ConvNet architectures, thus their application on 3D pointcloud data from 3D LiDAR sensors is not a straight forwardtask.On the other hand, the recent deep-learning based tech-niques that have been applied on perception tasks using 3Dpoint cloud data, they managed to ﬁnd a way to adopt the same2D ConvNet architectures to work on the 3D point cloud data.One of the most common techniques was to project a top-downbird’s eye view (BEV) of the point cloud data on a 2D plane(ie. ground). The representation of the 3D LiDAR point clouddata as a BEV was shown to be effective in many perceptiontasks for self-driving vehicles such as 3D object detection [14],road detection [15] and per-point semantic segmentation [16].To this end, in this work, we will be proposing a DAapproach for vehicle detection in real point cloud data from3D LiDAR sensors represented as BEV images. The proposedDA approach will be a deep learning-based approach basedon deep generative adversarial networks (GANs) [13]. Forthe vehicle detection task, it will be based on state-of-the-artdeep object detection architecture YOLOv3 [1]. The rest of thepaper is organised as follows. In Section II, a brief introductionto the different DA approaches with emphasis on deep learningbased approaches will be reviewed in addition to a quickreview on GANs. Section III, the methodology we followedfor our proposed DA approach will be discussed thoroughly.Experiments and results are discussed in Section IV. Finally,Section V concludes.II. R ELATED W ORK

Commonly, there are two ways to achieve DA either bydirectly translating one domain to the other or by obtaining acommon-ground intermediate pseudo-domain between the twodomains. In the following, ﬁrstly a quick review of the workrelated to the DA approaches will be provided speciﬁcally theapproaches based on the direct translation between domains.Then, a brief summary of the DA work between simulated andreal domains done in the context of trafﬁc environments willbe discussed.

A. Adversarial Domain Adaptation

Historically, most of the work done on DA has been relyingon the transformation between source and target domainsbased on linear representations [17], [18]. Until the emer-gence of the recent set of techniques based on non-lineartransformation representations via neural networks [19], [20],which have achieved state-of-the-art results in a number of DAbenchmarks [21], [22]. One of the most commonly non-linear-based representations DA approaches is the adversarial domain adaptation (ADA) approach [19]. ADA was inspired by thework done by Goodfellow et al. [23] on generative adversarialnetworks (GANs). In GANs, there are two deep neural net-works trained simultaneously, namely a “generator” networkand a “discriminator” network. The generator network, asthe name implies, it generates new data instances using auniform distribution, on the other hand, the discriminatornetwork tries to decide whether or not this newly generateddata instance has the same distribution as the training datasetdistribution. Similarly, in ADA, it has the same two networks,where the generator network, generates instances from thesource domain distribution to transform it into the targetdomain distribution. Whereas, the discriminator network triesto differentiate between the instances outputted from the actualtarget domain distribution and the ones generated from thegenerator network. Thus, this architecture is often referredto in the literature as the “conditional GAN”. One of themost recently successful ADA architectures is the Cycle-Consistent GAN (CycleGAN) [13] architecture. In CycleGAN,it is essentially comprised of two conditional GAN networks.The ﬁrst network works on the transformation from the sourcedomain ( S ) to the target domain ( T ), S → T , while the otherone works on the transformation in the opposite direction, T → S . The additional contribution for CycleGAN architecturewas the introduction of a new loss function they call it thecycle-consistency loss function. This new loss function assuresthat if the two conditional GANs networks are connected, theywill produce the following identity mapping: S → T → S . B. DA Between Synthetic and Real for Perception Tasks

In the context of trafﬁc environments, a number of per-ceptions tasks has been utilising the DA approach to bridgethe gap between real domains from physical sensors andsynthetic domains from simulated sensors [13], [24], [25].It is worth noting that all of these works were only ex-ploring one type of sensors which was cameras either RGB(monocular/stereo) or thermal. For example, in [13], a numberof DA between different domains were introduced basedon the CycleGAN architecture. For instance, they addressedthe semantic segmentation task between the day and nightdomains on unpaired visual images from multiple road-baseddatasets. Similarly, in [24], Atapour et al. trained a ConvNetmodel on synthetic depth and RGB images from the famousgame GTA in order to estimate a synthetic monocular depthimage. In the testing/inference phase, they took an input realRGB image from the KITTI dataset [26] and with the helpof a CycleGAN architecture, they transformed the real RGBimage into a synthetic GTA game like RGB image. Then, theypassed the synthetic RGB image to their initial trained modelto estimate a synthetic depth image. Eventually, they used thesame CycleGAN network again to adopt the estimated depthimage from the synthetic image domain to a real RGB imagedomain.On the other hand, in [25] Zhang et al. proposed deep-learning based approach for thermal infra-red object tracking.To overcome the scarcity of thermal images dataset, theytilised DA based on the CycleGAN architecture to transformimages from visual domain to the thermal infra-red domain.III. P

ROPOSED M ETHODOLOGY

The main focus of this work is to provide a frameworkfor bridging the gap between real and synthetic point clouddata represented as BEV images for the vehicle detection task.That being said, the same framework can still be used forother perceptions tasks on point cloud data such as semanticsegmentation or object tracking. In this section, we willﬁrst provide our formulation for the problem at hand. Thensubsequently, we will break-down the building blocks of theproposed framework.

A. Problem Formulation

ConvNet-based architectures for object detection from BEVpoint cloud data has been achieving state-of-the-art results inmany benchmarks [14]. However, with the available insufﬁ-cient numbers of annotated BEV point cloud data for trainingsuch architectures, the trained models are still performingpoorly especially in challenging scenarios. The utilisation ofannotated synthetic BEV point cloud data from simulated traf-ﬁc environments could be the key to increase the performanceof such models. However, due to the domain shift betweenreal and synthetic BEV point cloud data, the trained model onsynthetic data is not necessarily guaranteed to generalise onthe real data [10].Thus, in our formulation for the vehicle detection task fromreal BEV point cloud data, we are proposing a framework forDA between synthetic BEV point cloud data and real BEVpoint cloud data. In the ﬁrst stage of our framework, we traina CycleGAN model between unpaired synthetic BEV pointcloud data and real BEV point cloud data. The trained model,in returns, learns a transformation from synthetic BEV pointcloud data to real BEV point cloud data and vice versa. As aresult, given any annotated synthetic BEV point cloud datasetwith vehicles, the trained CycleGAN model will transformthat dataset to an annotated real-like BEV point cloud data.Finally, using the transformed dataset, we could train anotherConvNet-based model for the vehicle detection task in realBEV point cloud data.

B. Deep Unsupervised DA via Cycle-Consistent GANs

As we earlier mentioned in Section II-B, the CycleGANarchitecture has recently shown promising results in a numberof DA tasks between real and synthetic visual domains. Thus,in this work, we will be exploring the CycleGAN architecturefor the task of DA between real BEV point cloud domain andsynthetic BEV point cloud domain. One of the advantages ofthe CycleGAN architecture in the context of DA is it can learntransformation between source and target domains without anysupervised one-to-one mapping between the two domains. Thisis beneﬁcial for our task because it is almost impossible forus to have the same trafﬁc scenario and environment capturedin both real BEV point cloud data and synthetic BEV pointcloud data. However, we can have a handful amount of BEV point cloud data from each domain separately that representthe distribution of that domain.More formally, given our two domains S , R of the syn-thetic and the real BEV point cloud data domains. Then,the objective of our adopted CycleGAN-based DA approach(shown in Fig. 4) is to map between the distributions s ∼ P d ( s ) and r ∼ P d ( r ) from the synthetic and the real BEV pointcloud domains respectively. The proposed CycleGAN-basedDA approach achieve this mapping via the two generators, G S → R and G R → S and the two discriminators D S and D R . Thegenerator G S → R will try to map the input source synthetic BEVpoint cloud image to some target real BEV point cloud image.While the generator G R → S is trying to map the generatedBEV point cloud image from the real target domain backto its original source domain. The discriminator D S , on theother hand, is trying to differentiate between a BEV pointcloud image s ∈ S and a generated BEV point cloud imagefrom G R → S . Conversely, the discriminator D R will be tryingto distinguish between a BEV point cloud image r ∈ R anda generated BEV point cloud image from G S → R . The twogenerators networks are deep ConvNet models.The main building blocks of them are three blocks, namelythe encoder, the transformer and the decoder respectively.The encoder’s job is to extract features on multiple levelsprogressively by down-sampling them from the input BEVpoint cloud image from both domains. The transformer, onthe other hand, takes the extracted features vector encoderin the source domain and transform it into another featurevector in the opposite target domain. The decoder ﬁnally up-sample the transformed features vector back to the originalshape and dimensionality as it was before going through theencoder. The architecture we used for that combination ofencoder, transformer and decoder of our generator networks isbased on the architecture proposed in [27]. The encoder in thisarchitecture consists of two convolution layers, while the trans-former consists of nine ResNet blocks and the decoder consistsof two de-convolution/transposed convolution layers. The twodiscriminators architecture is a deep ConvNet model as well.They are based on the PatchGAN architecture from [28],which consists of three consecutive convolution layers forfeature extraction in patches and a ﬁnal 1D-convolution layerfor the decision whether its input BEV point cloud image isfake or not.In order to train the proposed CycleGAN-based DA ap-proach for our task, we will be utilising the adversarial lossfor the two generators that we have discussed above alongwith their corresponding discriminators. The ﬁrst loss for thetransformation from domain S to domain R is as follows: L adv S → R = min G S → R max D R E r ∼ P d ( r ) [ log D R ( r )]+ E s ∼ P d ( s ) [ log ( − D R ( G S → R ( s )))] (1)where S is the synthetic BEV point cloud data domain and P d ( s ) is its data distribution. enerator G S → R Generator G R → S Discriminator D R Discriminator D S Generator G S → R Generator G R → S Discriminator D R Discriminator D S Generated s ( s ∈ S ) Generated r ( r ∈ R ) fake or real ? fake or real ? fake or real ? fake or real ? Inpu t sample s Inpu t sample rCyclic r Cyclic s Cycle S Cycle R Fig. 2. Proposed CycleGAN-based DA framework for the vehicle detection task in BEV point cloud images. The framework has two internal cycles, namely

Cycle S and Cycle R . In Cycle S , the input sample s of synthetic BEV point cloud image goes ﬁrstly through the generator G S → R which its output is interrogatedby the discriminator D R . The generated sample r is then goes through the other generator G R → S for reconstructed the original input s sample. The sameprocess goes for the second cycle Cycle S . Similarly, the second loss for the transformation from do-main R to domain S is as follows: L adv R → S = min G R → S max D S E s ∼ P d ( s ) [ log D S ( s )]+ E r ∼ P d ( r ) [ log ( − D S ( G R → S ( r )))] (2)Additionally, in order to penalise the generators of thetrained model to generate more realistic BEV point cloud datafrom each domain S and R , the following third loss is added. L cyc = (cid:107) G R → S ( G S → R ( s )) − s (cid:107) + (cid:107) G S → R ( G R → S ( r )) − r (cid:107) (3)where L cyc is the cycle-consistency loss which ensures theidentity mapping of the each transformed sample BEV pointcloud image back to its original source.Given the three losses from Eq. 1, 2, 3, the objective lossfunction for the proposed CycleGAN-based DA approach isas follows: L = L adv S → R + L adv R → S + λ L cyc (4)where λ is equal to 10 which was chosen empirically. Finally, since the objective of training any deep ConvNetmodel is to minimise a certain loss function, which in ourcase is the joint loss function in Eq. 4. Thus, we will be usingthe Adam optimiser for minimising our objective joint lossfunction using a learning rate of 0.001. C. Vehicle Detection in BEV Point Cloud Data via YOLOv3

For the vehicle detection task, we will be the adoptingstate-of-the-art single stage deep ConvNet architecture forobject detection, You Only Look Once (YOLOv3) architecture.Internally, YOLOv3 relies on k-means clustering to have priorbounding boxes “anchors” of a potential region of interests(ROIs) in the input image which goes through a total of 53convolution layers to extract features from them on 3 differentscales. YOLOv3 in returns predicts the four coordinates forthe bounding box, an objectness score for each bounding box,and class score for the object that the bounding box maycontain. The four coordinates are predicted using a sigmoidfunction. The objectness score is predicted using a logisticregression which is set to 1 if the bounding box of one ofthe anchors overlaps with a ground truth bounding box. Theclass score of a bounding box is predicted via multinomialogistic classiﬁers which is better than the traditional soft-maxclassiﬁer when it comes to multi-label classiﬁcation task suchas object detection.More speciﬁcally, in our vehicle detection task from BEVpoint cloud images, we relied on the YOLOv3-416 derivativearchitecture, which as the name implies works on input imageswith a resolution of 416 H × W .IV. E XPERIMENTS

In this section, we will ﬁrstly discuss the datasets we haveused for training and validating our trained models. Secondly,the performance of our models will be quantitatively andqualitatively evaluated.

A. Datasets

For the task of the DA between synthetic and real BEVpoint cloud images, we relied on two datasets. The ﬁrst datasetis the recently released Motion-Distorted LiDAR Simulation(MDLS) dataset introduced in [9]. This dataset represents thesynthetic domain S of our CycleGAN-based DA approachdiscussed in Section III-B. The MLDS dataset was generatedfrom high ﬁdelity simulated urban trafﬁc environments fromthe CARLA simulator [29] using a simulated Velodyne HDL-64E sensor. The dataset is originally meant for studying theeffect of the motion distortion resulted from a moving vehicle-based 3D LIDAR sensor on the generated point cloud data.The dataset consists of two sequences of point cloud data fromurban trafﬁc environment involving between 60 to 90 movingvehicle, each one with an average duration of ﬁve minuteswhich results in total 6K point cloud scans. The dataset wasannotated with the position of the vehicles in the scene. Forour DA task, we ﬁrst preprocessed the point cloud scans inorder to get a BEV image of each scan according to themethod introduced in [14]. As a result, we get a total of 6KBEV point cloud images similar to the right image shown inFig. 1. The second dataset we utilised for the real domain R of our CycleGAN-based DA approach is the BEV benchmarkdata from the KITTI dataset [11]. The BEV benchmark dataconsists of 7481 training images and point cloud scans and7518 test images and point cloud scans. The point cloud datawas captured using a real 3D LiDAR sensor the VelodyneHDL-64E sensor. The dataset contains annotations for multipleobjects in the trafﬁc scene such as vehicles, pedestrians andcyclists. Similar to the pre-processing step we have done forthe MLDS dataset we did it as well for the KITTI dataset inorder to get BEV point cloud images like the one shown on theleft in Fig. 1. In our experiments for training our CycleGAN-based DA approach, we used a total 6K BEV point cloudimages from the MLDS dataset and the 7481 BEV point cloudimages of the training split from the KITTI dataset.Similarly, for the task of the vehicle detection from BEVpoint cloud images we used the same aforementioned twodatasets (MLDS and KITTI) in addition to the domain adaptedBEV images from synthetic to real for training our YOLOv3model. Since our ultimate goal in the vehicle detection task isto identify vehicles in real BEV point cloud images. Thus, we Fig. 3. Qualitative results for the proposed CycleGAN-based method for DAbetween synthetic and real BEV point cloud data. The ﬁrst row is the input thesynthetic BEV point cloud image from [9]. Second row is the transformedreal BEV point cloud image using the proposed method. Third row is thecorrelated real BEV point cloud image from the KITTI dataset [11]. further split the total 7481 real BEV images from the KITTIdataset into 4K for training our YOLOv3 model and 3481 fortesting the model.

B. Results and Discussion

Firstly, in order to evaluate the effectiveness of our proposedCycleGAN based DA approach for the vehicle detection taskfrom real BEV point cloud images. In ﬁg. 3, we show qual-itative results of the trained CycleGAN-based DA approachbetween synthetic and real BEV point cloud images. In theﬁrst row of the ﬁgure is the input synthetic BEV point cloudimage to our model. The second row represents the outputfrom the generator G S → R of our trained CycleGAN model. Thethird row shows one sample of a real BEV point cloud imagefrom the KITTI dataset. As it can be noticed, the generatedBEV point cloud from our CycleGAN model is mimickingand trying to be consistent with the same structure exist in thereal BEV point cloud image from KITTI. More speciﬁcally,the generated image captures pretty well the structure of the a) (b) (c) (d) Fig. 4. Qualitative results on the KITTI BEV point cloud dataset for the vehicle detection task. From left to right, a) the input BEV image , b) boundingbox detections from

YOLO K model, c) bounding box detections from YOLO KS model, d) bounding box detections from YOLO KR model.TABLE IC OMPARISON BETWEEN OUR TRAINED

YOLO V MODELS ON THE SAMETESTING SPLIT

BEV

POINT CLOUD IMAGES FROM THE

KITTI

DATASET [11]. H

IGHER IS BETTER .Model Training Data Average Precision (AP)%

YOLO S SYN (only) 29.93

YOLO R DA (only)

YOLO K KITTI (only) 57.26

YOLO KS KITTI+SYN 59.16

YOLO KR KITTI+DA vehicles and the distortion/noise artefacts from resulting fromthe real Velodyne 3D LiDAR sensor.For having more quantitative evaluation of our proposedCycleGAN based DA approach for the vehicle detection task,we trained two YOLOv3 models, the ﬁrst one

Y OLO S istrained using the 6K synthetic BEV point cloud images, whilethe other one Y OLO R is trained using the same 6K BEVpoint cloud images but the DA versions of them after feedingthem to our trained CycleGAN model and getting its predictedDA real BEV point cloud images. Furthermore, we trainedthree additional YOLOv3 models with the only difference inthe type of training data. The ﬁrst model Y OLO K which asthe name implies is trained on the 4K training split BEVpoint cloud images from the KITTI dataset. The second model Y OLO KS is trained using on the 4K images from the KITTIdataset with an additional 6K synthetic BEV point cloud imagefrom the MLDS dataset. The third and ﬁnal model Y OLO KR istrained using the same amount of data to the Y OLO KS model,however instead of the MLDS synthetic BEV images we usedthe DA version predicted from our CycleGAN model.In Table I, we report the performance of the total 5 YOLOv3models we mentioned earlier when all are tested on the same3481 testing real BEV point cloud images from the KITTI dataset. The evaluation metric we used is the average precisionscore (AP) which summarises the precision-recall curve thatcommonly used for evaluating object detectors. As it can benoticed from the table, the Y OLO R model outperformed the Y OLO S with more than 4% in AP score which proves ourclaim that our CycleGAN-based DA approach for the BEVpoint cloud images are more efﬁcient than pure syntheticones for the vehicle detection task. Additionally, the bestperforming model with 64.29% in AP score is the Y OLO KR ,which again proves the beneﬁts of using domain adaptedBEV point cloud images over the purse synthetic ones. Thisprevalent from Table I by the low AP scores from the Y OLO K and the Y OLO KS models which achieved only AP score of57.26% and 59.16% respectively.For a qualitative measuring of the performance of the trainedYOLOv3 models, in Fig. 4, we show a) input sample BEVpoint cloud image, b), c) and d) the detected bounding boxes(in green colour) from models Y OLO K , Y OLO KS and Y OLO KR respectively. The ground truth annotations are highlighted inthe light blue colour, while the false or miss-detected objectsare highlighted in red colour. As it can be shown, our model Y OLO KR gives an accurate detection with the lowest false-positive rate. V. C ONCLUSION

In this work, we have introduced a framework for domainadaptation between synthetic and real BEV point cloud im-ages for the vehicle detection task. The proposed frameworkutilises deep generative adversarial networks, CycleGAN forthe domain adaptation task. Then, given the domain adaptedBEV point cloud images we trained a series of object detectionmodels based on state-of-the-art deep ConvNet-based model,YOLOv3. The trained models have shown the effectiveness ofthe proposed DA approach for the vehicle detection task fromreal BEV point cloud images. Furthermore, we have evaluatedhe performance of the trained models on the testing split fromreal BEV point cloud images from the KITTI dataset. The bestperforming model was the one utilising our domain-adaptedBEV point cloud images which achieved the highest averageprecision score of 64.29% with an improvement of more than7% over the compared baseline approaches.A

CKNOWLEDGEMENT

This research was fully supported by the Institute forIntelligent Systems Research and Innovation (IISRI) at DeakinUniversity. R

EFERENCES[1] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv , 2018.[2] K. Saleh, R. A. Zeineldin, M. Hossny, S. Nahavandi, and N. El-Fishawy,“End-to-end indoor navigation assistance for the visually impaired usingmonocular camera,” in . IEEE, 2018, pp. 3504–3510.[3] K. Saleh, M. Attia, M. Hossny, S. Hanoun, S. Salaken, and S. Nahavandi,“Local motion planning for ground mobile robots via deep imitationlearning,” in . IEEE, 2018, pp. 4077–4082.[4] K. Saleh, M. Hossny, and S. Nahavandi, “Cyclist trajectory predictionusing bidirectional recurrent neural networks,” in

Australasian JointConference on Artiﬁcial Intelligence . Springer, 2018, pp. 284–295.[5] A. Abobakr, M. Hossny, H. Abdelkader, and S. Nahavandi, “Rgb-d falldetection via deep residual convolutional lstm networks,” in , Dec 2018, pp.1–7.[6] K. Saleh, M. Hossny, and S. Nahavandi, “Long-term recurrent predictivemodel for intent prediction of pedestrians via inverse reinforcementlearning,” in . IEEE, 2018, pp. 1–8.[7] K. Saleh, M. Hossny, A. Hossny, and S. Nahavandi, “Cyclist detection inlidar scans using faster r-cnn and synthetic depth images,” in . IEEE, 2017, pp. 1–6.[8] K. Saleh, M. Hossny, and S. Nahavandi, “Effective vehicle-basedkangaroo detection for collision warning systems using region-basedconvolutional networks,”

Sensors , vol. 18, no. 6, p. 1913, 2018.[9] D. J. Yoon, T. Y. Tang, and T. D. Barfoot, “Mapless online detection ofdynamic objects in 3d lidar,” arXiv preprint arXiv:1809.06972 , 2018.[10] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer, “Squeezesegv2:Improved model structure and unsupervised domain adaptation forroad-object segmentation from a lidar point cloud,” arXiv preprintarXiv:1809.08495 , 2018.[11] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomousdriving? the kitti vision benchmark suite,” in

Conference on ComputerVision and Pattern Recognition (CVPR) , 2012.[12] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,”

Neurocomputing , vol. 312, pp. 135–153, 2018.[13] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in

Proceedingsof the IEEE International Conference on Computer Vision , 2017, pp.2223–2232.[14] B. Li, “3d fully convolutional network for vehicle detection in pointcloud,” in . IEEE, 2017, pp. 1513–1518.[15] L. Caltagirone, S. Scheidegger, L. Svensson, and M. Wahde, “Fast lidar-based road detection using fully convolutional neural networks,” in . IEEE, 2017, pp. 1019–1024.[16] A. Dewan, G. L. Oliveira, and W. Burgard, “Deep semantic classiﬁcationfor 3d lidar data,” in . IEEE, 2017, pp. 3544–3549.[17] J. Blitzer, R. McDonald, and F. Pereira, “Domain adaptation with struc-tural correspondence learning,” in

Proceedings of the 2006 conferenceon empirical methods in natural language processing . Association forComputational Linguistics, 2006, pp. 120–128. [18] P. Germain, A. Habrard, F. Laviolette, and E. Morvant, “A pac-bayesianapproach for domain adaptation with specialization to linear classiﬁers,”in

International conference on machine learning , 2013, pp. 738–746.[19] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavio-lette, M. Marchand, and V. Lempitsky, “Domain-adversarial training ofneural networks,”

The Journal of Machine Learning Research , vol. 17,no. 1, pp. 2096–2030, 2016.[20] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discrim-inative domain adaptation,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2017, pp. 7167–7176.[21] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,“Reading digits in natural images with unsupervised feature learning,” in

NIPS Workshop on Deep Learning and Unsupervised Feature Learning2011 , 2011.[22] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al. , “Gradient-basedlearning applied to document recognition,”

Proceedings of the IEEE ,vol. 86, no. 11, pp. 2278–2324, 1998.[23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in

Advances in neural information processing systems , 2014, pp. 2672–2680.[24] A. Atapour-Abarghouei and T. P. Breckon, “Real-time monocular depthestimation using synthetic data with domain adaptation via image styletransfer,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2018, pp. 2800–2810.[25] L. Zhang, A. Gonzalez-Garcia, J. van de Weijer, M. Danelljan, andF. S. Khan, “Synthetic data generation for end-to-end thermal infraredtracking,”

IEEE Transactions on Image Processing , vol. 28, no. 4, pp.1837–1850, 2019.[26] M. Menze and A. Geiger, “Object scene ﬂow for autonomous vehicles,”in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2015, pp. 3061–3070.[27] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-timestyle transfer and super-resolution,” in

European conference on computervision . Springer, 2016, pp. 694–711.[28] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2017, pp. 1125–1134.[29] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,“CARLA: An open urban driving simulator,” in