AADS: Augmented Autonomous Driving Simulation using Data-driven Algorithms
Wei Li, Chengwei Pan, Rong Zhang, Jiaping Ren, Yuexin Ma, Jin Fang, Feilong Yan, Qichuan Geng, Xinyu Huang, Huajun Gong, Weiwei Xu, Guoping Wang, Dinesh Manocha, Ruigang Yang
AA R T I F I C I A L I N T E L L I G E N C E
Copyright © 2019The Authors, somerights reserved;exclusive licenseeAmerican Associationfor the Advancementof Science. No claimto original U.S.Government Works
AADS: Augmented autonomous driving simulationusing data-driven algorithms
W. Li * † , C. W. Pan *, R. Zhang *, J. P. Ren , Y. X. Ma , J. Fang , F. L. Yan , Q. C. Geng ,X. Y. Huang , H. J. Gong , W. W. Xu , G. P. Wang , D. Manocha † , R. G. Yang † Simulation systems have become essential to the development and validation of autonomous driving (AD) tech-nologies. The prevailing state-of-the-art approach for simulation uses game engines or high-fidelity computergraphics (CG) models to create driving scenarios. However, creating CG models and vehicle movements (theassets for simulation) remain manual tasks that can be costly and time consuming. In addition, CG images stilllack the richness and authenticity of real-world images, and using CG images for training leads to degradedperformance. Here, we present our augmented autonomous driving simulation (AADS). Our formulationaugmented real-world pictures with a simulated traffic flow to create photorealistic simulation images and ren-derings. More specifically, we used LiDAR and cameras to scan street scenes. From the acquired trajectory data,we generated plausible traffic flows for cars and pedestrians and composed them into the background. Thecomposite images could be resynthesized with different viewpoints and sensor models (camera or LiDAR).The resulting images are photorealistic, fully annotated, and ready for training and testing of AD systems fromperception to planning. We explain our system design and validate our algorithms with a number of AD tasksfrom detection to segmentation and predictions. Compared with traditional approaches, our method offersscalability and realism. Scalability is particularly important for AD simulations, and we believe that real-worldcomplexity and diversity cannot be realistically captured in a virtual environment. Our augmented approachcombines the flexibility of a virtual environment (e.g., vehicle movements) with the richness of the real world toallow effective simulation.
INTRODUCTION
Autonomous vehicles (AVs) have attracted considerable attention inrecent years from researchers, venture capitalists, and the general public.The societal benefits in terms of safety, mobility, and environmentalconcerns are expected to be tremendous and have captivated the atten-tion of people across the globe. However, in light of recent accidentsinvolving AVs, it has become clear that there is still a long way to goto meet the high standards and expectations associated with AVs.Safety is the key requirement for AVs. It has been argued that an AVhas to be test-driven hundreds of millions of miles in challengingconditions to demonstrate statistical reliability in terms of reductionsin fatalities and injuries ( ), which could take tens of years of road testseven under the most aggressive evaluation schemes. New methods andmetrics are being developed to validate the safety of AVs. One possiblesolution is to use simulation systems, which are common in other do-mains such as law enforcement, defense, and medical training. Simula-tions of autonomous driving (AD) can serve two purposes. The first is totest and validate the capability of AVs in terms of environmental per-ception, navigation, and control. The second is to generate a largeamount of labeled training data to train machine learning methods,e.g., a deep neural network. The second purpose has recently beenadopted in computer vision ( , ). The most common way to generate such a simulator is to use acombination of computer graphics (CG), physics-based modeling, androbot motion planning techniques to create a synthetic environmentin which moving vehicles can be animated and rendered. A number ofsimulators have recently been developed, such as Intel ’ s CARLA ( ),Microsoft ’ s AirSim ( ), NVIDIA ’ s Drive Constellation ( ), Google/Waymo ’ s CarCraft ( ), etc.Although all of these simulators achieve state-of-the-art syntheticrendering results, these approaches are difficult to deploy in the realworld. A major hurdle is the need for high-fidelity environmentalmodels. The cost of creating life-like CG models is prohibitively high.Consequently, synthetic images from these simulators have a distinct,CG-rendered look and feel, i.e., gaming or virtual reality (VR) systemquality. In addition, the animation of moving obstacles, such as cars andpedestrians, is usually scripted and lacks the flexibility and realism ofreal scenes. Moreover, these systems are unable to generate differentscenarios composed of vehicles, pedestrians, or bicycles, as observedin urban environments.Here, we present a data-driven approach for end-to-end simulationfor AD: augmented autonomous driving simulation (AADS). Ourmethod augments real-world pictures with a simulated traffic flow tocreate photorealistic simulation scenarios that resemble real-world ren-derings. Figure 1 shows the pipeline of our AADS system and its majorinputs and outputs. Specifically, we proposed using light detection andranging (LiDAR) and cameras to scan street scenes. We decomposedthe input data into background, scene illumination, and foregroundobjects. We presented a view synthesis technique to enable changingviewpoints on the static background. The foreground vehicles werefitted with three-dimensional (3D) CG models. With accurately esti-mated outdoor illumination, the 3D vehicle models, computer-generatedpedestrians, and other movable subjects could be repositioned and ren-dered back into the background images to create photorealistic street Baidu Research, Beijing, China. National Engineering Laboratory of DeepLearning Technology and Application, Beijing, China. Nanjing University of Aero-nautics and Astronautics, Nanjing, China. Beijing Engineering Technology Re-search Center of Virtual Simulation and Visualization, Peking University, Beijing,China. Deepwise AI Lab, Beijing, China. Zhejiang University, Hangzhou, China. University of Hong Kong, Hong Kong, China. Beihang University, Beijing, China. University of Maryland, College Park, MD, USA.*These authors contributed equally to this work. † Corresponding author. Email: [email protected] (W.L.); [email protected](R.G.Y.); [email protected] (D.M.)
S C I E N C E R O B O T I C S | R E S E A R C H A R T I C L E Li et al ., Sci. Robot. , eaaw0863 (2019) 27 March 2019 The input dataset.
Middle:
The pipeline of AADS is shown between the dashed linesand contains data preprocessing, novel background synthesis, trajectory synthesis, moving objects ’ augmentation, and LiDAR simulation. Bottom:
The outputs from theAADS system, which include synthesized RGB images, a LiDAR point cloud, and trajectories with ground truth annotations.
S C I E N C E R O B O T I C S | R E S E A R C H A R T I C L E Li et al ., Sci. Robot. , eaaw0863 (2019) 27 March 2019 iew images that looked like they were captured from a dashboardcamera on a vehicle. Furthermore, the simulated traffic flows, e.g., theplacement and movement of synthetic objects, were based on capturedreal-world vehicle trajectories that looked natural and captured thecomplexity and diversity of real-world scenarios.Compared with traditional VR-based or game engine – based AVsimulation systems, AADS provides more accurate end-to-endsimulation capability without requiring costly CG models or tediousprogramming to define the traffic flow. Therefore, it can be deployedfor large-scale use, including training and evaluation of new navigationstrategies for the ego vehicle.The key to AADS ’ s success is the wide availability of 3D scene scansand vehicle trajectory data, both of which are needed for automatic gen-eration of new traffic scenarios. We will also release part of the real-world data that we have collected for the development and evaluationof AADS. The data were fully annotated by a professional labeling ser-vice. In addition to AADS, they may also be used for many perception-and planning-related tasks to drive further research in this area.This paper includes the following technological advances:1) A data-driven approach for AD simulation: By using scannedstreet view images and real trajectories, both photorealistic imagesand plausible movement patterns can be synthesized automatically.This direct scan-to-simulation pipeline, with little manual intervention,enables large-scale testing of autonomous cars virtually anywhere andanytime within a closed-loop simulation.2) A view synthesis method to enable view interpolation and ex-trapolation with only a few images: Compared to previous approaches,it generates better quality images with fewer artifacts.3) A new set of datasets, including the largest set of traffic trajec-tories and the largest 3D street view dataset with pixel/point level anno-tation: All of these are captured in metropolitan areas, with dense andcomplex traffic patterns. This kind of dense urban traffic poses notablechallenges for AD. Previous methods
Simulation for AD is a very large topic. Traditionally, simulation cap-abilities have been primarily used in the planning and control phase ofAD, e.g., ( – ). More recently, simulation has been used in the entireAD pipeline, from perception and planning to control [see the survey byPendleton et al . ( )].Although Waymo has claimed that its AV has been tested for bil-lions of miles in their proprietary simulation system, CarCraft ( ), littletechnical detail has been released to the public in terms of its fidelity fortraining machine learning methods. Researchers have tried to useimages from video games to train deep learning – based perceptionsystems ( , ).Recently, a number of high-fidelity simulators dedicated to ADsimulation have been developed, such as Intel ’ s CARLA ( ), Microsoft ’ sAirSim ( ), and NVIDIA ’ s Drive Constellation ( ). They allow end-to-end, closed-loop training and testing of the entire AD pipeline beyondthe generation of annotated training data. All of these simulators havetheir basis in current gaming techniques or engines, which generatehigh-quality synthetic images in real time. A limitation of these systemsis the fidelity of the resulting environmental model. Even with state-of-the-art rendering capabilities, the images produced by these simulatorsare obviously synthetic. Current state-of-the-art CG rendering may notprovide enough accuracy and details for machine learning methods.With the availability of LiDAR devices and advances in structurefrom motion, it is now possible to capture large urban scenes in 3D. However, turning the large-scale point cloud (PC) into a CG quality – rendered image is still an ongoing research problem. Models recon-structed from these point clouds often lack details or complete textures( ). In addition, AD simulators have to address the problem of realis-tic traffic patterns and movements. Traditional traffic flow simulationalgorithms mainly focus on generating trajectories for vehicles and donot take into account the realistic movements of individual cars or pe-destrians. One of the challenges is to simulate realistic traffic patterns,particularly in complex situations, when traffic is dense and involvesheterogenous agents (e.g., an intersection scenario with pedestrians ina crosswalk).Our work is related to the approach described by Alhaija et al . ( ) inwhich 3D vehicle models were rendered onto existing captured real-world background images. However, the observation viewpoint wasfixed at capture time, and the 3D models were chosen from an existing3D repository that may or may not match those in the real-worldimages. Their approach can be used to augment still images for trainingperception applications. In contrast, with the ability to freely change theobservation viewpoint, our system could not only play a role in dataaugmentation but also enhance a closed-loop simulator such as CARLA( ) or AirSim ( ). Further enhanced by realistic traffic simulation ability,our system can also be used for path planning and driving decision ap-plications. In those dynamic applications, our system can generate datain a loop for reinforcement learning and learning-by-demonstrationalgorithms. Overall, the proposed approach enables closed-loop, end-to-end simulation without the need for environmental modeling andhuman intervention. RESULTS
Because AADS is data driven, we first explain the datasets that havebeen collected. Some of the datasets have already been released, andothers will be released with this paper. We then show results for thesynthesis of virtual views and generation of traffic flows, two key com-ponents of AADS. Last, we evaluated AADS ’ s effectiveness for ADsimulation. Specifically, we show that the simulated red-green-blue(RGB) and LiDAR images were useful for improving the performanceof the perception system, whereas the simulated trajectories were usefulfor improving predictions of obstacle movements — a criticalcomponent for the planning and control phases for autonomous cars. The datasets
When collecting a dataset, we used a hardware system consisting of twoRIEGL laser scanners, one real-time line-scanning LiDAR (Velodyne64-line), a VMX-CS6 stereo camera system, and a high-precision iner-tial measurement unit (IMU)/global navigation satellite system(GNNS). With the RIEGL scanners, our system could obtain higher-density point clouds with better accuracy than widely used LiDAR scan-ners, whereas the VMX-CS6 system provided a wide baseline stereocamera with high resolution (3384 by 2710). With the Velodyne LiDAR,we could obtain the shapes and positions of moving objects. To scan ascene, the hardware was calibrated, synchronized, and then mounted onthe top of a mid-size sport utility vehicle (SUV) that cruised around thetarget scene at an average speed of 30 km/hour. Note that the RGBimages were taken about once every meter.Instead of fully annotating all 2D/RGB and 3D/point cloud datamanually, we developed a labeling pipeline to make our labeling processaccurate and efficient. Because 2D labeling is expensive in terms of timeand labor, we combined the two stages, i.e., 3D labeling and 2D labeling.
S C I E N C E R O B O T I C S | R E S E A R C H A R T I C L E Li et al ., Sci. Robot. , eaaw0863 (2019) 27 March 2019 y using easy-to-label 3D annotations, we could automatically generatehigh-quality 2D annotations of static backgrounds/objects in all theimage frames by 3D-2D projections. Details of the labeling processcan be found in ( ).For each image frame, we annotated 25 different classes covered byfive groups in both 3D point clouds. In addition to standard annotationclasses, such as cars, motorcycles, traffic cones, and so on, we added anew “ tricycle ” class, a popular mode of transportation in East Asiancountries. We also annotated 35 different lane markings in both 2Dand 3D that were not previously available in open datasets. These lanemarkings were defined on the basis of color (e.g., white and yellow),type (e.g., solid and broken), and usage (e.g., dividing, guiding, stopping,and parking).The table in Fig. 2 compares our dataset and other street view data-sets. Our dataset outperformed other datasets in many aspects, such asscene complexity, number of pixel-level annotations, number of classes,and so on. We have released 143,906 video frames and correspondingpixel-level annotations. Images were assigned to three degrees of diffi-culty (e.g., easy, moderate, and hard) based on scene complexity, a mea-sure of the number of movable objects in an image. Our dataset alsocontains challenging lighting conditions, such as high-contrast regionsdue to sunlight and shadows from overpasses. We named the dataset ofRGB images ApolloScape-RGB. We also provided 3D point-level anno-tations in ApolloScape point cloud (ApolloScape-PC) dataset, which arenot available in other street view datasets.In addition to this article, we also announce ApolloScape-TRAJ, alarge-scale dataset for urban streets that includes RGB image sequencesand trajectory files. It is focused on trajectories of heterogeneous trafficagents for planning, prediction, and simulation tasks. The dataset in-cludes RGB videos with around 100,000 images with a resolution of1920 by 1080 and 1000 km of trajectories for all kinds of moving trafficagents. We used the Apollo acquisition car to collect traffic data andgenerate trajectories. In Beijing, we collected a dataset of trajectories un-der a variety of lighting conditions and traffic densities. The dataset in-cludes many challenging scenarios involving many vehicles, bicycles,and pedestrians moving around one another. Evaluations of augmented background synthesis
An important part of our AADS system is synthesizing backgroundimages in specific views using images captured in fixed views whenrunning closed-loop simulations. This ability stems from the utilizationof the image-based rendering technique and avoids prerequisitemodeling of the full environment.There is a large literature on image-based rendering techniques, al-though relatively little has been written on capturing scenes with sparseimages. We focused on wide baseline stereo image – based rendering forstreet view scenes: The overlap between left images and right imagesmay be less than half the size of full images. Technically, obtaining reliabledepth is an important challenge for image-based rendering techniques.Thus, methods such as ( ) use the multiview stereo method to estimatedepth maps. However, most street view datasets provide laser-scannedpoint clouds, which can be used to generate initial depth maps by render-ing point clouds. As point clouds tend to be sparse and noisy, initial es-timates of depth maps are full of outliers and holes and need to be refinedbefore they are passed on to downstream processing. Thus, we proposedan effective depth refinement method that included depth filtering andcompletion procedures. To evaluate our depth refinement method, weused initial and refined depth maps (Fig. 3, B and E) to synthesize thesame novel view. Results are shown in Fig. 3 (F and G, respectively). When using depth maps without refinement to run image-basedrendering, the results suffered from artifacts near errors and holes indepth maps. Specifically, in Fig. 3 (F and H), fluctuations appeared inthe green rectangle as the view changed, whereas window frames werekept straight when using refined depth in the yellow rectangle.To evaluate our image-based rendering algorithm (specifically thenovel view synthesis algorithm) with refined depth maps, we comparedour method with two representative approaches: the content preservingwarping method by Liu et al . ( ) and the method by Chaurasia et al .( ). Note that, in the implementation of the method by Chaurasia et al .( ), we used the similarity of super pixels ( ) to complete the depthmap and perform a local shape-preserving warp on each super pixel.The synthesized images in Fig. 3 were generated using four referenceimages. Because images were captured by a stereo camera, the fourreference images could be considered as two pairs of stereo images withclose to parallel views in which the angle between two optical axes of thestereo images is small, but the baseline is relatively wide (about 1 m). Wecompared our view interpolation and extrapolation results with classicalmethods. As shown in the third row of Fig. 3, the method by Liu et al . ( )performed well for small changes in the novel view compared to the inputviews. When the view translation became larger, view distortion artifactsbecame apparent (such as the fence in the green rectangle, the shapeof which is deformed inappropriately). For the method by Chaurasia et al . ( ), ghost artifacts appeared when neighboring super pixels wereassigned to inappropriate or incorrect depths. Our method obtainedcorrect depths and preserved invariant shapes of objects when the viewchanges, handling both interpolation and extrapolation. The fourth rowof Fig. 3 evaluates another scene with both a wide baseline and a largerotation angle. Because of large changes in the novel view, neither themethod by Liu et al . ( ) nor the method by Chaurasia et al . ( ) alignedwell with neighboring reference views. As shown in the figure, curb-stones in the green rectangle and the white lane marker in the yellowrectangle reveal misalignment artifacts. In addition, because of tone in-consistencies in the input images, seams are prominent in the results ofthe methods by Liu et al . ( ) and Chaurasia et al . ( ). In contrast, ourmethod could effectively eliminate misalignment and seam artifacts.To further illustrate the effectiveness of our view synthesis approachfor closed-loop simulation, we have included a video (movie S3) thatshows the synthesized front camera view from a driving car thatchanges lanes several times. Our view synthesis approach is sufficientfor handling such lane changes because it interpolates or extrapolatesthe viewpoint. Evaluations of trajectories synthesis
Another pillar for AADS is its ability to generate plausible traffic flow,particularly when there are interactions between vehicles and pedes-trians, e.g., heterogeneous agents who move at different speeds and withdifferent dynamics. This topic is a full research area in its own right, andwe developed techniques for heterogeneous agent simulations. For thesake of completeness, we briefly show the main result in Fig. 4. Readersare referred to ( ) for more technical details. Specifically, Fig. 4 showsthe comparison with the ground truth from the input dataset, results ofour simulation method, and results of the method by Chao et al . ( ), astate-of-the-art multiagent simulation approach. In the evaluation, thetraffic was simulated on a straight four-lane road. For our method, thenumber, positions, and velocities of agents were randomly initialized ac-cording to the dataset. We evaluated the comparison using the metric ofvelocity and minimum distance probability distributions. The metrics aredivided into 30 intervals. The probability of each interval is the divisor of S C I E N C E R O B O T I C S | R E S E A R C H A R T I C L E Li et al ., Sci. Robot. , eaaw0863 (2019) 27 March 2019 he sample number in this interval and the total sample number. Asshown in Fig. 4, our simulation results are closer to the input data in boththe velocity distribution and the minimum distance distribution. AADS evaluations by AD applications
As shown in Fig. 1, simulation and our AADS can simultaneously pro-duce the following augmented data: (i) photorealistic RGB images with
Fig. 2. The ApolloScape dataset and its extension. Top:
Table comparing ApolloScape with other popular datasets.
Bottom:
RGB images, annotations, and a pointcloud from top to bottom (left) and some labeled traffic trajectories from the dataset (right).
S C I E N C E R O B O T I C S | R E S E A R C H A R T I C L E Li et al ., Sci. Robot. , eaaw0863 (2019) 27 March 2019 ( A and B ) Raw RGB and depth images in our dataset, respectively. ( C to E ) Results of depth refinementafter filtering and completion. ( F and G ) Results of view synthesis using initial and refined depths with close views in ( H ). ( I to K ) Final results of view synthesis using the method byLiu et al . ( ), the method by Chaurasia et al . ( ), and our method, respectively. S C I E N C E R O B O T I C S | R E S E A R C H A R T I C L E Li et al ., Sci. Robot. , eaaw0863 (2019) 27 March 2019 nnotation information such as semantic labels, 3D bounding boxes,etc.; (ii) an augmented LiDAR point cloud; and (iii) typical traffic flows.In the following evaluations, those data augmentations were synthesizedon the basis of our ApolloScape dataset. We summarize the AADS syn-thetic data and evaluations in terms of RGB images, point clouds, andtrajectories as follows:1) AADS-RGB: For the baseline training set of ApolloScape-RGB,we augmented RGB images with AADS and generated correspondingannotations for augmented moving agents. This dataset is namedAADS-RGB and was used to evaluate our image synthesis method.2) AADS-PC: With our AADS system, we synthesized up to 100,000new point cloud frames by simulating the Velodyne HDL-64E S3 LiDARsensor based on the ApolloScape-PC dataset. The simulation dataset hasthe same object categories as and numbers of objects in each categorysimilar to ApolloScape-PC.3) AADS-TRAJ: Our AADS system can also produce new tra-jectories based on the ApolloScape-TRAJ dataset. We further eval-uated such augmented data using a trajectory prediction method. Object detection with AADS-RGB
For the evaluations of AADS ’ capability to simulate camera images, weused two real and three virtual datasets: ApolloScape-RGB – annotatedimages (ApolloScape-RGB), CityScapes, virtual KITTI (VKITTI),synthesized data from the popular simulator CARLA, and our synthe-sized data (AADS-RGB).We used the VKITTI ( ) dataset to compare our system with a fullysynthetic method. The full dataset contains 21,260 images with differentweather and lighting conditions. A total of 1600 images were randomlyselected as a training set. CityScapes ( ) is a dataset of urban street scenes. There are 5000annotated images with fine instance-level semantic labels. We usedthe validation set of 492 images as the testing dataset.CARLA ( ) is the most recent and popular VR simulator for AD. Upto now, it provides two manually built scenes with car models. Becausethe size of the scene is limited, we generated 1600 images distributed asevenly as possible in the simulated scene.In this section, we show the effectiveness of the AADS-RGB data.We used the state-of-the-art objection detection algorithm Mask R-CNN( ) to perform the experiments. The results were compared with thestandard average precision metric of an intersection over union (IoU)threshold of 50% (AP50) and 70% (AP70) and a mean bounding boxAP (mAP) with an across threshold at IoU ranging from 5 to 95% insteps of 5%. Because we mainly augmented textured vehicles ontoimages in our object detection evaluation, the evaluation results camefrom vehicles.Synthetic data generation is an easy way to obtain large-scale datasetsand has been proven to be effective in AD. However, the data statisticsand distribution limit the capabilities of virtual data. When applying amodel trained with synthetic data to real images, there is a domain gap.Because our simulation method was built on realistic background,placement, and moving object synthesis, it effectively reduced the do-main problem. Our method produced an image (Fig. 5C) that is morevisually similar to a real image from CityScapes (Fig. 5D) than it is to theVR simulator CARLA (Fig. 5A) or the fully synthetic dataset VKITTI(Fig. 5B), i.e., images from our system have small domain gaps.To quantitatively verify the effectiveness of our simulated data, wechose to train object detectors with our data and data from CARLA andVKITTI. The trained detectors were tested on the CityScape dataset,which has no overlap with any of the training sets.We trained models on CARLA-1600, VKITTI-1600, ApolloScape-RGB-1600, and AADS-RGB-2400 separately, where the suffix showsthe number of images used for training. Then, the object detectionperformance of the trained model was evaluated on the CityScapes val-idation set. Results are shown in Fig. 5 (right). It can be seen that, be-cause of the domain gap, the metrics of ApolloScape-1600 are higherthan those of VKITTI-1600 or CARLA-1600. Note that images inVKITTI are smaller than images in other datasets. We therefore appliedthe VKITTI-1600 model on downsampled CityScapes to make thecomparisons fair. Otherwise, the VKITTI-1600 model tended to misslarge cars, leading to a degradation in detection performance. Adding Fig. 4. Comparison of traffic synthesis.
Velocityandminimumdistancedistributionoftraffic simulation using our method, the method by Chao et al . ( ), and the ground truth. Fig. 5. RGB image augmentation evaluations.
The four images on the left were selected from CARLA ( A ), VKITTI dataset ( B ), our AADS-RGB dataset ( C ), and thetesting dataset CityScapes ( D ). The bar graph on the right shows the evaluation results using mAP, AP50, and AP70 metrics, respectively. S C I E N C E R O B O T I C S | R E S E A R C H A R T I C L E Li et al ., Sci. Robot. , eaaw0863 (2019) 27 March 2019
00 simulated images to ApolloScape-1600 (AADS-RGB-2400) im-proved the results by roughly 1%. This demonstrates that our simula-tion data may be closer to real-world data than data from VR.
Instance segmentation with AADS-PC
To evaluate AADS-PC simulations, we used the KITTI-PC dataset,the ApolloScape-PC, and our simulated point cloud (AADS-PC).The KITTI-PC dataset ( ) consists of 7481 training and 7518 test-ing frames. These real point cloud frames were labeled corresponding tocaptured RGB images in the front view. This dataset provides evaluationbenchmarks for (i) 2D object detection and orientation estimation, (ii)3D object detection, and (iii) bird ’ s eye view evaluations.On the basis of those datasets, we evaluated our AADS system using3D instance segmentation. It was a typical point cloud – based ADapplication, which simultaneously ran 3D object detection and pointcloud segmentation. We used the state-of-the-art algorithm PointNet++( ) to perform quantitative evaluation. The results were evaluated usinga mAP named mAP(Bbox) and a mean mask AP named mAP(mask).We evaluated the accuracy and effectiveness of the model trained byour simulation data and compared it with the models trained with man-ually labeled real data. These simulation and real data were randomlyselected from the AADS-PC and ApolloScape-PC datasets, respectively.The mAP evaluation results of the instance segmentation models arepresented in Fig. 6A. When trained with only our simulation data,the instance segmentation models produced results competitive withthe precisely labeled real data. When using 100,000 data points generatedby simulation, the segmentation performance was better than a modeltrained on 4000 real data points and came close to models trained on16,000 and 100,000 real data points. In short, by using simulation to in-crease the size of the training set, performance can approach that ofmodels trained on real-world data.Next, we used simulation data to boost the real data (i.e., pretrain themodel), as shown in Fig. 6C. Boosting with simulation data significantlyimproves (by 2 to 4%) the validation accuracy of the original modeltrained with only the real data. On the ApolloScape-PC dataset, wefound that using 100,000 simulated data points to pretrain the modeland 1600 real data for fine-tuning outperformed a model trained with16,000 real data points in terms of the average mAP of all object types.When fine-tuned with 32,000 real data points, the model surpassed a model trained on 100,000 real data. These results indicate that oursimulation approach may reduce up to 80 to 90% of manually labeleddata, greatly reducing the need to label images and thus saving time andmoney. More details can be found in ( ).Last, on the basis of instance segmentation, we compared our objectplacement (traffic simulation) method with alternative placement stra-tegies, e.g., placing object randomly or under specific rules ( ). Asshown in Fig. 6B, the accuracy of models trained with simulated dataoutperformed (by 4 to 7%) those trained with the other object place-ment strategies. The accuracy of models trained with our simulated datais close to that of a model trained on real data (gap of only 1 to 4%,depending on the application). TrafficPredict with AADS-TRAJ
To evaluate the effectiveness of synthesized traffic, i.e., trajectories ofcars, cyclist, and pedestrians, we adopted the TrafficPredict methodMa et al . ( ) for quantitative evaluation. This method takes motionpatterns of traffic agents in the first T obs frames as input and predictstheir positions in the following T pred frames. In our evaluation, T obs and T pred were set to 5 and 7, respectively. We extended 20,000 realframes from ApolloScape-TRAJ dataset with an additional 20,000 sim-ulated frames from our AADS-TRAJ dataset to train the deep neuralnetwork proposed in the method by Ma et al . ( ). Performance of thetrained model was measured using mean Euclidean distance between pre-dicted positions and ground truth. In our case, average displacement error(mean Euclidean distance of all predicted frames) and final displacementerror (mean Euclidean distance of the T pred -th predicted frame) were eval-uated. Prediction error was reduced sharply when training with an addition-al 20,000 simulated data points (Fig. 7). The error rate for cars was reducedthe most because cars were well represented in the simulated trajectories. CONCLUSIONS
In the previous section, we showed the effectiveness of AADS for var-ious tasks in AD. All of these tasks were achieved by using capturedscene data (location specific) and traffic trajectory data (general). Theentire AADS system requires very little human intervention. The systemmay be used to generate large amounts of realistic training data with fineannotation, or it may be used in-line to simulate the entire AD systemfrom perception to planning. The realismand scalability of AADS make it possibleto be used in real-world scenarios, as longas the background can be captured.Compared to VR-based simulations,AADS ’ s viewpoint change for RGB datais limited. Deviating too much from theoriginal captured viewpoints leads to de-graded image quality. However, we be-lieve that the limited viewing range isactually acceptable for AD simulation.For the most part, a vehicle drives on flatroads, and the possible viewpoint changesare limited to rotation and 2D translationon the road plane. There is no need tosupport a bird ’ s eye view or a third-personperspective for RGB-based perception.Another major limitation of AADS is thelack of lighting/environmental changes(snow/rain) in the scene. For now, thesemust be the same in the captured images, Fig. 6. LiDAR simulation evaluations. ( A ) Evaluation of dataset ’ s size and type (real or simulation) for real-timeinstance segmentation. ( B ) Evaluation results of different object placement methods. ( C ) Real data boosting evalua-tions (mean mask AP) using instance segmentation. S C I E N C E R O B O T I C S | R E S E A R C H A R T I C L E Li et al ., Sci. Robot. , eaaw0863 (2019) 27 March 2019 ut there have been significant advancesin image synthesis using generative ad-versary networks ( , ). Preliminaryresults synthesizing seasonal changes havebeen demonstrated. Enabling full lighting/environmental effect synthesis withinAADS is a promising direction that weare actively pursuing. MATERIALS AND METHODS
Data preprocessing
AADS used the scanned real images forsimulation. Our goal was to simulate newvehicles and pedestrians in the scannedscene with new trajectories. To achievethis goal, before simulating data, ourAADS should remove moving objects,e.g., vehicles and pedestrians, fromscanned RGB images and point clouds.Automatic detection and removal ofmoving objects constitute a full researchtopic in its own right; fortunately, mostrecent datasets provide semantic labelsof RGB images, including point clouds.By using semantic information in theApolloScape dataset, we removed objectsof a specific type, e.g., cars, bicycles,trucks, and pedestrians. After removingmoving objects, numerous holes in bothRGB images and point clouds appear,which must be carefully filled to generatea complete and clean background forAADS. We used the most recent RGBimage inpainting method ( ) to closethe holes in the images. This methodused the semantic label to guide alearning-based inpainting technique,which achieved acceptable levels of qual-ity. The point cloud completion will beintroduced in the depth processing fornovel background synthesis (see the nextsection).Given synthesized background im-ages, we could place any 3D CG modelon the ground and then render it intothe image space to create a new, compositesimulation image. However, to make thecomposite image photorealistic (look closeto the real image), we must first estimatethe illumination in background images.This enables our AADS to render 3DCG models with consistent shadows onthe ground and on vehicle bodies. Wesolved such outdoor lighting estimationproblems according to the method in( ). In addition, to further improve thereality of composite images, our AADS al-so provided an optional feature to enhance Fig. 7. TrafficPredict evaluations.
Comparison of trajectory prediction with 20,000 real trajectory frames and an addi-tional 20,000 simulation trajectory frames.
Fig. 8. Novel view synthesis pipeline. ( A ) The four nearest reference images were used to synthesize the target view in(D). ( B ) The four reference images were warped into the target view via depth proxy. ( C ) A stitching method was used to yield acomplete image. ( D ) Final results in the novel view were synthesized after post-processing, e.g., hole filling and color blending. S C I E N C E R O B O T I C S | R E S E A R C H A R T I C L E Li et al ., Sci. Robot. , eaaw0863 (2019) 27 March 2019 he appearance of 3D CG models by grabbing textures from real images.Specifically, given an RGB image with unremoved vehicles, we retrievedthe corresponding 3D vehicle models and aligned those models to theinput image using the method in ( ). Similar to ( ), we then used sym-metric priors to transfer and complete the appearance of 3D CG modelsfrom aligned real images. Augmented background synthesis
Given a dense point cloud and image sequence produced fromautomatic scanning, the most straightforward way to build virtual assetsfor an AD simulator is to reconstruct the full environment. This line ofwork focuses on using geometry and texture reconstruction methods toproduce complete large-scale 3D models from the captured real scene.However, these methods cannot avoid hand editing while modeling,which is expensive in terms of time, computation, and storage.Here, we directly synthesized an augmented background in a spe-cific view as needed during simulation. Our method avoided modelingthe full scene ahead of running the simulation. Technically, ourmethod created such a scan-and-simulate system by using the viewsynthesis technique.To synthesize a target view, we needed to first obtain dense depthmaps for input reference images. Ideally, these depth maps should beextracted from a scanned point cloud. Unfortunately, such depth mapswill be incomplete and unreliable. In our case, these problems came withscanning: (i) The baseline of our stereo camera was too small comparedto the size of street view scenes, and consequently, there were too fewdata points for objects that were too far from the camera. (ii) The sceneswere full of numerous moving vehicles and pedestrians that needed tobe removed. Unfortunately, their removal produced holes in scannedpoint clouds. (iii) The scenes were always complicated (e.g., many build-ings are fully covered with glasses), which led to scanning sensor failureand thus incomplete scanned point clouds. We introduced a two-stepprocedure to address the lack of reliability and incompleteness in depthmaps: depth filtering and depth completion.With respect to depth filtering, we carried out a judiciously selectedcombination of pruning filters. The first pruning filter is a median filter:A pixel was pruned if its depth value was sufficiently different from themedian filtered value. To prevent removing thin structures, the kernelsize of the median filter was set to small (e.g., 5 by 5 in our implemen-tation). Then, a guided filter ( ) was applied to keep thin structures andto enhance edge alignment between the depth map and the color image.After getting a much more reliable depth, we completed the depth mapby propagating the existing depth value to the pixels in the holes bysolving a first-order Poisson equation similar to the one used in color-ization algorithms ( ).After depth filtering and completion, reliable dense depth maps thatcould provide enough geometry information to render an image intovirtual views were produced. Similar to ( ), given a target virtual view,we selected the four nearest reference views to synthesize the virtualview. For each reference view, we first used the forward mappingmethod to produce a depth map with camera parameters of the virtualview and then performed depth inpainting to close small holes. Then, abackward mapping method and occlusion test were used to warp thereference color image into the target view.A naïve way to synthesize the target image is to blend all the warpedimages together. However, when we blended the warped images usingthe view angle penalty following a previous work ( ), there always ex-isted obvious artifacts. Thus, we solved this problem as an image stitch-ing problem rather than direct blending. Technically, for each pixel x i of the synthesized image in the target virtual view, it is optimized to choosea color from one of those warped images. This can be formulated as adiscrete pixel labeling energy functionarg min f x i g ∑ i l E ð x i Þ þ l E ð x i Þ þ ∑ ð ij Þ ∈ N l E ð x i ; x j Þþ l E ð x i ; x j Þ þ l E ð x i ; x j Þ ð Þ Here, x i is the i th pixel of the target image, and N is the pixel set of x i ’ sone ring neighbor. E ( x i ) is the pixel-wise data term, which is defined byextending the view angle penalty in ( ). In contrast to the scenarios in( ), depth maps of the street view scene always contain pixels with largedepth values, which lead the angle view penalty to be too small. To ad-dress this problem, when the penalty is close to zero, we added anotherterm to help choose the appropriate image by taking advantage of cam-era position information. Specifically, E ( x i ) is defined as E ( x i ) = E angle ( x i ) W label ( x i ). Here, E angle ( x i ) is the view angle penalty in ( ). When E angle ( x i ) is too small, it will be hooked and set to 0.01 in our implemen-tation. This is done to balance two energy terms and make W label ( x i ) ef-fective. W label ( x i ) is defined as W label ð x i Þ ¼ D pos ð C x i ; C syn Þ D dir ð C x i ; C syn Þ ,which evaluates the difference between the reference view and the targetview. Here, C x i and C syn denote the view choice for the camera for pixel x i and for the target view ’ s camera, respectively. Furthermore, D pos repre-sents the distance from the camera center, and D dir is the angle between theoptical axes of the two cameras. E ( x i ) is the occlusion term used to exclude the occlusion areas whileminimizing the pixel labeling energy. Most occlusions appear near depthedges. Thus, when using the backward mapping method to render thewarped images, we detected occlusions by performing depth testing.All pixels in the reference view with larger depth values than those ofthe source depth can yield an occlusion mask, which is then used todefine E ( x i ). Specifically, when an occlusion mask is invalid, i.e., thepixel is nonocclusion, we set E ( x i ) = 0 to add no penalty into theenergy function. When a pixel is occluded, we set E ( x i ) = ∞ to ex-clude this pixel completely.The rest of the terms in Eq. 1 are smoothness terms: color term E ( x i , x j ),depth term E ( x i , x j ), and color gradient term E ( x i , x j ). Similar to ( ),the color term E ( x i , x j ) is defined by a truncated seam-hiding pair-wise cost first introduced in ( ) E ð x i ; x j Þ ¼ min ðjj c x i i (cid:2) c x j i j ; t c j Þ þ min ðjj c x i j (cid:2) c x j j j ; t c j Þ , where c x i i is the RGB value of pixel x i . Similarly,the depth term E ( x i , x j ) is defined as E ð x i ; x j Þ ¼ min ðj d x i i (cid:2) d x j i j ; t d Þ þ min ðj d x i j (cid:2) d x j j j ; t d Þ , where d x i i is the depth of pixel x i ,and the RGB and depth truncation thresholds are set to t c = 0.5 and t d = 5 m in our implementation, respectively. Because the illuminationdifference may occur between different reference images, the colordifference is not sufficient to ensure a good stitch. An additional gradi-ent difference E ( x i , x j ) is used. By assuming that the gradient vectorshould be similar on both sides of the seam, we defined E ð x i ; x j Þ ¼ ∣ g xi (cid:2) g xj ∣ þ ∣ g yi (cid:2) g yj ∣ , where g xi is a color space gradient of the i thpixel in the image and includes x i .The term weights in Eq. 1 are set to l = 200, l = 1, l = 200, l =100, and l = 50. The labeling problems are solved using the sequentialtree-reweighted message passing (TRW-S) method ( ). Figure 8 showsthe pipeline and results of augmented background synthesis. Note that,in Fig. 8C, a color difference may exist near the stitching seams afterimage stitching. To obtain consistent results, a modified Poisson imageblending method ( ) was performed. Specifically, we selected the near-est reference image as the source domain and then fixed its edges to S C I E N C E R O B O T I C S | R E S E A R C H A R T I C L E Li et al ., Sci. Robot. , eaaw0863 (2019) 27 March 2019
10 of 12 ropagate color brightness to the other side of the stitch seams. Aftersolving the Poisson equation, we obtained the fusion result shown inFig. 8D. Note that, when the novel view is far from the input views, e.g.,large view extrapolation, there will be artifacts because of disoccludedregions that cannot be filled in by stitching together pieces of the inputimages. We marked those regions as holes and set their gradient value tozero. Thus, these holes were filled with plausible blurred color whensolving the Poisson equation.
Moving objects ’ synthesis and data augmentation With a synthesized background image in the target view, a completesimulator should have the ability to synthesize realistic traffics with di-verse moving objects (e.g., vehicles, bicycles, and pedestrians) and pro-duce corresponding semantic labels and bounding boxes in simulatedimages and LiDAR point clouds.We used the data-driven method described in ( ) to address chal-lenges involving traffic generation and placement of moving objects.Specifically, given the localization information, we first extracted laneinformation from an associated high-definition (HD) map. Then, werandomly initialized the moving objects ’ positions within lanes and en-sured that the directions of moving objects were consistent with thelanes. We used agents to simulate objects ’ movements under constraintssuch as avoiding collisions and yielding to pedestrians. The multiagentsystem was iteratively deduced and optimized using previously capturedtraffic following a data-driven method. Specifically, we estimated mo-tion states from our real-world trajectory dataset ApolloScape-TRAJ;these motion states included position, velocity, and control directioninformation of cars, cyclists, and pedestrians. Note that such real datasetprocessing was performed in advance of simulation and needed to beprocessed just once. During simulation runtime, we used an interactiveoptimization algorithm to make decisions for each agent at each frameof the simulation. In particular, we solved this optimization problem bychoosing a velocity from the datasets that tends to minimize our energyfunction. The energy function was defined on the basis of the locomo-tion or dynamics rules of heterogeneous agents, including continuity ofvelocity, collision avoidance, attraction, direction control, and other user-defined constraints.With generated traffic, i.e., the object placement in each simulationframe, we rendered 3D models into the RGB image space and generatedannotated data using the physical rendering engine PBRT ( ). Mean-while, we also generated a corresponding LiDAR point cloud with an-notations using the method introduced in the next section. LiDAR synthesis
Given 3D models and corresponding placement, it is relatively straight-forward to synthesize LiDAR point clouds with popular simulators suchas CARLA ( ). Nevertheless, there are opportunities to take advantageof specific LiDAR sensors (e.g., Velodyne HDL-64E S3). We pro-posed a realistic point cloud synthesis method by effectively modelingthe specific LiDAR sensor following a data-driven fashion. Technically,a real LiDAR sensor captured the surrounding scene by measuring thetime of flight for pulses of each laser beam ( ). One laser beam wasemitted from the LiDAR and then reflected from target surfaces. A3D point was then generated if the returned pulse energy of a laser beamwas big enough. We modeled the behavior of laser beams to simulatethis physical process. Specifically, the emitted laser beam could bemodeled using parameters including the vertical and azimuth anglesand their angular noises, as well as the distance measurement noise.For example, the Velodyne HDL-64E S3 LiDAR sensor emits 64 laser beams in different vertical angles ranging from − p /2 forms the realvertical angle, whereas the noise variance was determined from thedeviation of lines constructed by the cone apex and points from thecone surface. The real vertical angles usually differed from ideal anglesby 1° to 3°. In our implementation, we modeled the noise with standardGaussian distribution, setting the distance noise variance to 0.5 cm andthe azimuth angular noise variance to 0.05°.To generate a point cloud, we computed intersections between thelaser beams and the scene. Specifically, we proposed a cubed, map-basedmethod to mix the background of the scenes in the form of points andmeshes of 3D CG models. Instead of computing intersections betweenthe beams and the mixed data, we computed the intersection with theprojected maps (e.g., depth map) of scenes, which offer the equivalentinformation but in a much simpler form. Note that our LiDARsimulation method can be easily extended for arbitrary LiDAR sensorsand to any sensor solution for different numbers and poses of sensors.Figure S1 shows the visual results of our LiDAR simulation. SUPPLEMENTARY MATERIALS robotics.sciencemag.org/cgi/content/full/4/28/eaaw0863/DC1Fig. S1. Visual evaluations of point cloud simulation.Movie S1. Full movie.Movie S2. Scan-and-simulation pipeline.Movie S3. Synthesizing lane changes.Movie S4. Data augmentation.Movie S5. Novel view synthesis evaluations.
REFERENCES AND NOTES
1. N. Kalra, S. M. Paddock, Driving to safety: How many miles of driving would it take todemonstrate autonomous vehicle reliability?
Transp. Res. A Policy Pract. , 182 –
193 (2016).2. A. Gaidon, Q. Wang, Y. Cabon, E. Vig, Virtual worlds as proxy for multi-object trackinganalysis, in
Proceedings of the 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR, 2016).3. M. Müller, V. Casser, J. Lahoud, N. Smith, B. Ghanem, Sim4CV: A photo-realistic simulatorfor computer vision applications.
Int. J. Comput. Vis. , 902 –
919 (2018).4. A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, V. Koltun, CARLA: An open urban drivingsimulator, in
Proceedings of the 1st Annual Conference on Robot Learning (PMLR, 2017),pp. 1 – Field Serv. Rob. , 621 –
635 (2018).6. NVIDIA, NVIDIA Drive Constellation: Virtual Reality Autonomous Vehicle Simulator(NVIDIA, 2017).7. A. C. Madrigal, “ Inside Waymo ’ s secret world for training self-driving cars, ” The Atlantic
Int. J. Robot. Res. , 933 –
945 (2009).9. S. J. Anderson, S. C. Peters, T. E. Pilutti, K. Iagnemma, Design and development ofan optimal-control-based framework for trajectory planning, threat assessment, and
S C I E N C E R O B O T I C S | R E S E A R C H A R T I C L E Li et al ., Sci. Robot. , eaaw0863 (2019) 27 March 2019
11 of 12 emi-autonomous control of passenger vehicles in hazard avoidance scenarios.
Robot. Res. , 39 –
54 (2011).10. C. Katrakazas, M. Quddus, W.-H. Chen, L. Deka, Real-time motion planning methodsfor autonomous on-road driving: State-of-the-art and future research directions.
Transp. Res. Part C Emerg. Technol. , 416 –
442 (2015).11. J. Ziegler, P. Bender, M. Schreiber, H. Lategahn, T. Strauss, C. Stiller, T. Dang, U. Franke,N. Appenrodt, C. G. Keller, E. Kaus, R. G. Herrtwich, C. Rabe, D. Pfeiffer, F. Lindner, F. Stein,F. Erbs, M. Enzweiler, C. Knoppel, J. Hipp, M. Haueis, M. Trepte, C. Brenk, A. Tamke, M. Ghanaat,M. Braun, A. Joos, H. Fritz, H. Mock, M. Hein, E. Zeeb, Making Bertha drive — An autonomousjourney on a historic route.
IEEE Intell. Transp. Syst. Mag. , 8 –
20 (2014).12. A. Geiger, M. Lauer, F. Moosmann, B. Ranft, H. Rapp, C. Stiller, J. Ziegler, Team AnnieWAY ’ sentry to the 2011 grand cooperative driving challenge. IEEE Trans. Intell. Transp. Syst. , 1008 – The DARPA Urban Challenge: Autonomous Vehicles inCity Traffic (Springer-Verlag, 2009), vol. 56.14. A. Best, S. Narang, D. Barber, D. Manocha, Autonovi: Autonomous vehicle planning withdynamic maneuvers and traffic constraints, in
Proceedings of the 2017 IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS, 2017), pp. 2629 – Machines , 6 (2017).16. M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, R. Vasudevan, Driving inthe matrix: Can virtual worlds replace human-generated annotations for real world tasks?arXiv:1610.01983 (2016).17. S. R. Richter, V. Vineet, S. Roth, V. Koltun, Playing for data: Ground truth from computergames, in Proceedings of the 2016 European Conference on Computer Vision (ECCV, 2016),pp. 102 – ACM Trans. Graph. , 66 (2013).19. H. A. Alhaija, S. K. Mustikovela, L. Mescheder, A. Geiger, C. Rother, Augmented realitymeets computer vision: Efficient data generation for urban driving scenes. Int. J. Comput.Vis. , 961 –
972 (2018).20. X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, R. Yang, The ApolloScape open datasetfor autonomous driving and its application. arXiv:1803.06184 (2018).21. E. Penner, L. Zhang, Soft 3D reconstruction for view synthesis.
ACM Trans. Graph. , 235(2017).22. F. Liu, M. Gleicher, H. Jin, A. Agarwala, Content-preserving warps for 3D videostabilization. ACM Trans. Graph. , 44 (2009).23. G. Chaurasia, S. Duchene, O. Sorkine-Hornung, G. Drettakis, Depth synthesis and localwarps for plausible image-based navigation. ACM Trans. Graph. , 30 (2013).24. R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk, SLIC superpixels compared tostate-of-the-art superpixel methods. IEEE Trans. PatternAnal. Mach. Intell. ,2274 – IEEE Trans. Vis. Comput. Graph , 1167 – Proceedingsof the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2016).28. K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in
Proceedings of the 2017 IEEEInternational Conference on Computer Vision (ICCV, 2017), pp. 2980 – Proceedings of the 2012 IEEE Conference on Computer Vision andPattern Recognition (CVPR, 2012), pp. 3354 – Proceedings of the 2017 Advances in Neural InformationProcessing Systems (2017), pp. 5099 – Proceedings of the ACMInternational Conference on Multimedia Retrieval (ICMR, 2018), pp. 458 – Proceedings of the 2018IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2018).36. Y. Song, C. Yang, Y. Shen, P. Wang, Q. Huang, C.-C. Jay Kuo, SPG-Net: Segmentationprediction and guidance network for image inpainting. arXiv:1805.03356 (2018).37. Y. Liu, X. Qin, S. Xu, E. Nakamae, Q. Peng, Light source estimation of outdoor scenes formixed reality.
Vis. Comput. , 637 –
646 (2009).38. M. Corsini, M. Dellepiane, F. Ponchio, R. Scopigno, Image-to-geometry registration:A mutual information method exploiting illumination-related geometricproperties, in
Computer Graphics Forum (Wiley Online Library, 2009), vol. 28,pp. 1755 – ACM Trans. Graph. , 127 (2014).40. K. He, J. Sun, X. Tang, Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. ,1397 – ACM Trans. Graph. ,689 –
694 (2004).42. C. Buehler, M. Bosse, L. McMillan, S. Gortler, M. Cohen, Unstructured lumigraph rendering,in
Proceedings of the 28th Annual Conference on Computer Graphics and InteractiveTechniques (2001), pp. 425 – ACM Trans. Graph. ,234 (2017).44. V. Kwatra, A. Schödl, I. Essa, G. Turk, A. Bobick, Graphcut textures: Image and videosynthesis using graph cuts. ACM Trans. Graph. , 277 –
286 (2003).45. V. Kolmogorov, Convergent tree-reweighted message passing for energy minimization.
IEEE Trans. Pattern Anal. Mach. Intell. , 1568 – ACM Trans. Graph. , 313 – Physically Based Rendering: From Theory toImplementation (Morgan Kaufmann, 2016).48. S. Kim, I. Lee, Y. J. Kwon, Simulation of a Geiger-mode imaging LADAR system forperformance assessment.
Sensors , 8461 – Funding:
This work was supported by NSFC grants 61732016 (to W.W.X.), 61872398 and61632003 (to G.P.W.), and National Key R&D Program of China 2017YFB1002700 (to G.P.W.).
Author contributions:
R.G.Y. conceived the project. W.L. and C.W.P. developed theconcept and systems. J.P.R. developed the trajectory synthesis framework. R.Z. and Q.C.G.performed the synthesized RGB image evaluations. X.Y.H. helped collect the RGB and pointcloud datasets. J.F. and F.L.Y. performed the synthesized LiDAR point cloud evaluations.Y.X.M. helped collect the trajectories dataset and performed the simulated trajectory evaluations.G.P.W., W.W.X., and H.J.G. discussed the results and contributed to the final manuscript. W.L.,D.M., and R.G.Y. wrote the paper.
Competing interests:
W.L., C.W.P., and R.Z. completed the workwhile interning with Baidu Research. F.L.Y., J.F., and R.G.Y. are inventors on patent applicationno. CN20181105574.2, “ A method of LIDAR point cloud simulation for autonomous driving. ” The other authors declare that they have no competing interests.
Data and materialsavailability:
The RGB and point cloud datasets (ApolloScape-RGB and ApolloScape-PC) are hostedwith the web link http://apolloscape.auto/scene.html. The trajectory dataset (ApolloScape-TRAJ)announced along with this paper can be freely downloaded through the link http://apolloscape.auto/trajectory.html. Some of the data used and the code are proprietary.Submitted 4 December 2018Accepted 5 March 2019Published 27 March 201910.1126/scirobotics.aaw0863
Citation:
W. Li, C. W. Pan, R. Zhang, J. P. Ren, Y. X. Ma, J. Fang, F. L. Yan, Q. C. Geng, X. Y. Huang,H. J. Gong, W. W. Xu, G. P. Wang, D. Manocha, R. G. Yang, AADS: Augmented autonomousdriving simulation using data-driven algorithms.
Sci. Robot. , eaaw0863 (2019). S C I E N C E R O B O T I C S | R E S E A R C H A R T I C L E Li et al ., Sci. Robot. , eaaw0863 (2019) 27 March 2019
12 of 12
ADS: Augmented autonomous driving simulation using data-driven algorithms
P. Wang, D. Manocha and R. G. YangW. Li, C. W. Pan, R. Zhang, J. P. Ren, Y. X. Ma, J. Fang, F. L. Yan, Q. C. Geng, X. Y. Huang, H. J. Gong, W. W. Xu, G.DOI: 10.1126/scirobotics.aaw0863, eaaw0863. Sci. Robotics
ARTICLE TOOLS http://robotics.sciencemag.org/content/4/28/eaaw0863
MATERIALSSUPPLEMENTARY http://robotics.sciencemag.org/content/suppl/2019/03/25/4.28.eaaw0863.DC1
REFERENCES http://robotics.sciencemag.org/content/4/28/eaaw0863
PERMISSIONS
Science Robotics