[PDF] The ParallelEye Dataset: Constructing Large-Scale Artificial Scenes for Traffic Vision Research

Abstract

Video image datasets are playing an essential role in design and evaluation of traffic vision algorithms. Nevertheless, a longstanding inconvenience concerning image datasets is that manually collecting and annotating large-scale diversified datasets from real scenes is time-consuming and prone to error. For that virtual datasets have begun to function as a proxy of real datasets. In this paper, we propose to construct large-scale artificial scenes for traffic vision research and generate a new virtual dataset called "ParallelEye". First of all, the street map data is used to build 3D scene model of Zhongguancun Area, Beijing. Then, the computer graphics, virtual reality, and rule modeling technologies are utilized to synthesize large-scale, realistic virtual urban traffic scenes, in which the fidelity and geography match the real world well. Furthermore, the Unity3D platform is used to render the artificial scenes and generate accurate ground-truth labels, e.g., semantic/instance segmentation, object bounding box, object tracking, optical flow, and depth. The environmental conditions in artificial scenes can be controlled completely. As a result, we present a viable implementation pipeline for constructing large-scale artificial scenes for traffic vision research. The experimental results demonstrate that this pipeline is able to generate photorealistic virtual datasets with low modeling time and high accuracy labeling.

Full PDF

TThe ParallelEye Dataset: Constructing Large-ScaleArtiﬁcial Scenes for Trafﬁc Vision Research

Xuan Li, Kunfeng Wang,

Member, IEEE , Yonglin Tian, Lan Yan, and Fei-Yue Wang,

Fellow, IEEE

Abstract —Video image datasets are playing an essential rolein design and evaluation of trafﬁc vision algorithms. Neverthe-less, a longstanding inconvenience concerning image datasets isthat manually collecting and annotating large-scale diversiﬁeddatasets from real scenes is time-consuming and prone to error.For that virtual datasets have begun to function as a proxyof real datasets. In this paper, we propose to construct large-scale artiﬁcial scenes for trafﬁc vision research and generate anew virtual dataset called “ParallelEye”. First of all, the streetmap data is used to build 3D scene model of ZhongguancunArea, Beijing. Then, the computer graphics, virtual reality, andrule modeling technologies are utilized to synthesize large-scale,realistic virtual urban trafﬁc scenes, in which the ﬁdelity andgeography match the real world well. Furthermore, the Unity3Dplatform is used to render the artiﬁcial scenes and generate ac-curate ground-truth labels, e.g., semantic/instance segmentation,object bounding box, object tracking, optical ﬂow, and depth. Theenvironmental conditions in artiﬁcial scenes can be controlledcompletely. As a result, we present a viable implementationpipeline for constructing large-scale artiﬁcial scenes for trafﬁcvision research. The experimental results demonstrate that thispipeline is able to generate photorealistic virtual datasets withlow modeling time and high accuracy labeling.

I. I

NTRODUCTION

The publicly available video image datasets have receivedmuch attention in recent years, due to its indispensability indesign and evaluation of computer vision algorithms [1]. Ingeneral, a computer vision algorithm needs a large amountof labeled images for training and evaluation. The datasetscan be divided into two types: unlabeled datasets used forunsupervised learning and labeled datasets used for supervised

This work was partly supported by National Natural Science Foundationof China under Grant 61533019, Grant 71232006, and Grant 91520301.Xuan Li is with the School of Automation, Beijing Institute of Technology,Beijing 100081, China, and also with The State Key Laboratory for Man-agement and Control of Complex Systems, Institute of Automation, ChineseAcademy of Sciences, Beijing 100190, China (e-mail: [email protected]).Kunfeng Wang (

Corresponding author ) is with The State Key Laboratoryfor Management and Control of Complex Systems, Institute of Automa-tion, Chinese Academy of Sciences, Beijing 100190, China, and also withQingdao Academy of Intelligent Industries, Qingdao 266000, China (e-mail:[email protected]).Yonglin Tian is with the Department of Automation, University of Scienceand Technology of China, Hefei 230027, China, and also with The State KeyLaboratory for Management and Control of Complex Systems, Institute ofAutomation, Chinese Academy of Sciences, Beijing 100190, China.Lan Yan is with The State Key Laboratory for Management and Controlof Complex Systems, Institute of Automation, Chinese Academy of Sciences,Beijing 100190, China.Fei-Yue Wang is with The State Key Laboratory for Management and Con-trol of Complex Systems, Institute of Automation, Chinese Academy of Sci-ences, Beijing 100190, China, and also with the Research Center for Compu-tational Experiments and Parallel Systems Technology, National University ofDefense Technology, Changsha 410073, China (e-mail: [email protected]). learning. However, manually annotating the images is time-consuming and labor-intensive, and participants often lackprofessional knowledge, making some annotation tasks dif-ﬁcult to execute. Experts are always sparse and should beproperly identiﬁed. As we known, the human annotators aresubjective, and their annotations should be re-examined if twoor more annotators have disagreements about the label of oneentity. By contrast, the computer is objective in processingdata and particularly good at batch processing, so why not letthe computer annotate the images automatically?At present, most publicly available datasets are obtainedfrom real scenes. As the computer vision ﬁeld enters thebig data era, researchers begin to look for better ways toannotate large-scale datasets [2]. At the same time, the de-velopment of virtual datasets has a long history, starting atleast from Bainbridge’s work [3]. Bainbridge used Second Lifeand World of Warcraft as two distinct examples of virtualworlds to predict the scientiﬁc research potential of virtualworlds, and introduced the virtual worlds into a lot of researchﬁelds that scientists are now exploring, including sociology,computer science, and anthropology. In fact, synthetic datahas been used for decades to benchmark the performance ofcomputer vision algorithms. The use of synthetic data hasbeen particularly signiﬁcant in object detection [4], [5] andoptical ﬂow estimation [6]-[8], but most virtual data are notphotorealistic or akin to the real-world data, and lack sufﬁcientdiversity [9]. The ﬁdelity of some virtual data is close to thereal-world [10]. However, the synthesized virtual worlds areseldom equivalent to the real world in geographic position,and seldom annotate the virtual images automatically. Richter et al. [11] used a commercial game engine to extract virtualimages, with no access to the source code or the content. TheSYNTHIA dataset [12] provided a realistic virtual city as wellas synthetic images with automatically generated pixel-levelannotations, but in that dataset there lacks other annotationinformation such as object bounding box and object tracking.Gaidon et al. [13] proposed a virtual dataset called “VirtualKITTI” as a proxy for tracking algorithm evaluation. Whilethis dataset was cloned from “KITTI”, it cannot extend easilyto arbitrary trafﬁc networks. Due to the above limitations, newvirtual datasets that match the real world and provide detailedground truth annotations are still desirable.Manually annotating pixel-level semantics for images istime-consuming and not accurate enough. For example, an-notating high-quality semantics with 10-20 categories in oneimage usually takes 30-60 minutes [14]. This is known asthe “curse of dataset annotation” [15]. The more detailed the a r X i v : . [ c s . C V ] D ec ky VegetationBuildingFenceSidewalkRoadTrafficsignTrafficlightPoleTreeCyclistPedestrianChairBillboard

Fig. 1. Examples of our generated ParallelEye dataset. From left to right: a general view of the constructed artiﬁcial scenes, its semantic labels, a sampleframe with tracking bounding boxes generated automatically, and its semantic labels. Best viewed with zooming. semantics, the more labor-intensive the annotation process. Asa result, many datasets do not provide semantic segmentationannotations. For example, ImageNet [16], [17] has 14 millionimages, in which more than one million images have deﬁniteclass and the images are annotated with object bounding boxfor object recognition. However, ImageNet does not havesemantic segmentation annotations. Some datasets provideonly limited semantic segmentation annotations. For example,NYU-Depth V2 [18] has 1449 densely labelled images, KITTI[1] has 547 images, CamVid [19], [20] has 600 images,Urban LabelMe [21] has 942 images, and Microsoft COCO[22] has three hundred thousand images. These datasets playan important role in the study of semantic segmentation.However, these datasets cannot be used directly in intelligenttransportation, especially in automobile navigation, becausethe number of labeled images is insufﬁcient and the segmentedsemantics have different categories. Currently, computer vi-sion algorithms that exploit context for pattern recognitionwould beneﬁt from datasets with many annotated categoriesembedded in images from complex scenes. Such datasetsshould contain a wide variety of environmental conditions withannotated object instances co-occurring in the same scenes.However, the real scenes are unrepeatable and the capturedimages are expensive to annotate, making it difﬁcult to obtainlarge-scale, diversiﬁed datasets with precise annotations.In order to solve these problems, this paper proposesa pipeline for constructing artiﬁcial scenes and generatingvirtual images. First of all, we use map data to build the3D scene model of Zhongguancun Area, Beijing. Then, weuse the computer graphics, virtual reality, and rule modelingtechnologies to create a realistic, large-scale virtual urbantrafﬁc scene, in which the ﬁdelity and geographic informationcan match the real world well. Furthermore, we use theUnity3D development platform for rendering the scene and au-tomatically annotating the ground truth labels including pixel-level semantic/instance segmentation, object bounding box,object tracking, optical ﬂow, and depth. The environmentalconditions in artiﬁcial scenes can be controlled completely. Inconsequence, we generate a new virtual image dataset, called“ParallelEye” (see Fig. 1). We will build a website and makethis dataset publicly available before the publication of thispaper. The experimental results demonstrate that our proposedimplementation pipeline is able to generate photorealistic

Real scene Artificial scenesPerception and understanding Experiment and evaluation Learning and training M o d e l - d r i v e n D a t a - d r i v e n Perception and understanding Observation and evaluation

Computing and observation

Fig. 2. Basic framework and architecture for parallel vision [23]. virtual images with low modeling time and high ﬁdelity.The rest of this paper is organized as follows. SectionII introduces the signiﬁcance of parallel vision and virtualdataset. Section III presents our approach to constructingartiﬁcial scenes and generating virtual images with ground-truth labels. Section IV reports the experimental results andanalyzes the performance. Finally, the concluding remarks aremade in section V.II. P

ARALLEL V ISION AND V IRTUAL D ATASET

Parallel vision [23]-[25] is an extension of the ACP (Artiﬁ-cial systems, Computational experiments, and Parallel execu-tion) theory [26]-[30] into the computer vision ﬁeld. For par-allel vision, photo-realistic artiﬁcial scenes are used to modeland represent complex real scenes, computational experimentsare utilized to learn and evaluate a variety of vision models,and parallel execution is conducted to online optimize thevision system and realize perception and understanding ofcomplex scenes. The basic framework and architecture forparallel vision [23] is shown in Fig. 2. Based on the parallelvision theory, this paper constructs a large-scale virtual urbannetwork and synthesizes a large number of realistic images.The ﬁrst stage of parallel vision is to construct photorealisticartiﬁcial scenes by simulating a variety of environmentalconditions occurring in real scenes, and accordingly to syn-thesize large-scale diversiﬁed datasets with precise annotationsgenerated automatically. Generally speaking, the constructionof artiﬁcial scenes can be regarded as “video game design”,i.e., using the computer animation-like techniques to modelthe artiﬁcial scenes. The main technologies used in thistage include computer graphics, virtual reality, and micro-simulation. Computer graphics and computer vision, on thewhole, can be thought of as a pair of forward and inverseproblems. The goal of computer graphics is to synthesizeimage measurements given the description of world parametersaccording to physics-based image formation principles (for-ward inference), while the focus of computer vision is to mapthe pixel measurements to 3D scene parameters and semantics(inverse inference). Apparently their goals are opposite, butcan converge to a common point: parallel vision.From the parallel vision perspective, we design the Par-allelEye dataset. ParallelEye is synthesized by referring tothe urban network of Zhongguancun Area, Beijing. UsingOpenStreetMap (OSM), an urban network with length 3kmand width 2km is extracted. Artiﬁcial scenes are constructedon this urban network. Unity3D is used to control the environ-mental conditions in the scene. There are 15 object classes inParallelEye, reﬂecting the common elements of trafﬁc scenes,including sky, buildings, cars, roads, sidewalks, vegetation,fence, trafﬁc signs, trafﬁc lights, lamp poles, billboards, trees,cyclists, pedestrians, and chairs. These object classes can beautomatically annotated to generate pixel-level semantics. Fortrafﬁc vision research, we pay special attention to instancesegmentation, with each object of interest segmented automat-ically. In addition, ParallelEye provides accurate ground truthfor object detection and tracking, depth, and optical ﬂow.III. A

PPROACH

Our pipeline for generating the ParallelEye dataset is shownin Fig. 3. Firstly, the OSM data released by OpenStreetMapis used to achieve the correspondence in geographic locationbetween the virtual and real world. Secondly, CityEngineis used to write CGA (Computer Generated Architecture)rules and design a realistic artiﬁcial scene, including roads,buildings, cars, trees, sidewalks, etc. Thirdly, the artiﬁcialscene is imported into Unity3D and gets rendered by usingthe script and the shader. In the dataset, accurate ground truthannotations are generated automatically, and environmentalconditions can be controlled completely and ﬂexibly.

A. Correspondence of Artiﬁcial and Real Scenes

In order to increase the ﬁdelity, we choose to importgeographic data from OpenStreetMap. Although Google Mapsoccupies an important position in geographic information, itis not an open-source software. By contrast, OpenStreetMapis an open-source, online map editing program with the goalof creating a world where content is freely accessible toeveryone. In OpenStreetMap, the ways denote a directionalnode sequence. Each node of the network can connect 2-2000paths, and then arrive at another node. The road informationincludes direction, lane number, lane width, street name, andspeed limit. Each path can form three combinations: non-closed paths, closed paths, and regions. The non-closed pathscorrespond to the roads, rivers, and railways in the real world.The closed paths correspond to subway, bus routes, residentialroads, and so on. The regions correspond to buildings, parks, 、、、、、、 CityEngine

Unity3DOpenStreetMap

Fig. 3. Pipeline for generating the ParallelEye dataset with OpenStreetMap,CityEngine, and Unity3D. lakes, and so on. Based on the properties of OSM data, it iseasy to relate the real world to the geographic information ofthe artiﬁcial scene. Fig. 4 shows the real Automation Buildingof CASIA (Institute of Automation, Chinese Academy ofSciences) and its virtual proxy generated by CGA rules. Theyare similar in appearance.

B. Generation of Ground-Truth Annotations

As stated above, ground-truth annotations are essentialfor vision algorithm design and evaluation. Traditionally, theimages were annotated by hand. The manual annotation istime-consuming and prone to error. Taking semantic/instancesegmentation as an example, it usually takes 30-60 minutesto annotate an image with 10-20 object categories. Besides,manual annotation is more or less subjective, so that differentannotators can make different semantic labels for the same im-age, especially near the object boundaries. Instead of manualannotation, this paper uses Unity3D to automatically generateaccurate ground-truth labels. Fig. 5 shows some examples ofground-truth annotations, including depth, optical ﬂow, objecttracking, object detection, instance segmentation, and semanticsegmentation.Generating ground truth with Unity3D is accurate andefﬁcient. Semantic segmentation ground truth can be directly ig. 4. The real Automation Building of CASIA (top) and its virtual proxy(bottom). generated by using unlit shaders on the materials of theobjects, with each category outputting a unique color. Instancesegmentation ground truth is generated using the same method,but assigns a unique color tag to each object of interest.The modiﬁed shaders output a color which is not affectedby the lighting and shading conditions. Depth ground truth isgenerated using built-in depth buffer information to get depthdata for screen coordinates. The depth ranges from 0 to 1with a nonlinear distribution, with 1 representing “inﬁnitelydistant”. Optical ﬂow ground truth is generated by calculatingthe instantaneous velocity of moving objects on the imagingplane and using the pixel changes in the image sequence to ﬁndthe correspondence between the previous frame and the currentframe. Given a pixel point ( x, y ) in the image, at any time thebrightness of that point is E ( x + (cid:52) x, y + (cid:52) y, t + (cid:52) t ) . Let ( u, v ) = ( ∂x∂t , ∂y∂t ) represent the instantaneous velocity of thepoint in the horizontal and vertical directions, the brightnesschange occurs when the point moves. We use the Taylorformula to represent the pixel brightness: E ( x + (cid:52) x, y + (cid:52) y, t + (cid:52) t )= E ( x, y, t ) + ∂E∂x (cid:52) x + ∂E∂y (cid:52) y + ∂E∂t (cid:52) t + ε. (1)For any (cid:52) t → , let ω = ( u, v ) , the optical ﬂow constraintequation is given by − ∂E∂t = ∂E∂x ∂x∂t + ∂E∂y ∂y∂t = ∇ E · ω, (2)where ω is the optical ﬂow of E ( x, y, t ) .We generate multi-object tracking ground truth based onfour rules: 1) when the object appears within the ﬁeld of Fig. 5. Examples of ground-truth annotations generated automatically byUnity3D. Top: depth (left) and optical ﬂow (right). Middle: object tracking(left) and object detection (right). Bottom: pixel-level instance segmentation(left) and semantic segmentation (right). Best viewed with zooming. view of the camera, the three-dimensional bounding box ofthe object is converted to a two-dimensional bounding box;2) when the object appears or disappears from the imageboundary, we perform special handling for the bounding box;3) we do not draw bounding boxes for objects that have lessthan 15 pixels in width or less than 10 pixels in height; 4)when occlusion occurs and the occlusion rate is higher thana threshold, we do not draw bounding boxes for the occludedobject.

C. Diversity of Artiﬁcial Scenes

In order to increase the diversity and ﬁdelity of artiﬁcialscenes, we control the parameters in the script, the material,and the simulated environmental conditions. Speciﬁcally, thecontrollable parameters include: 1) number, type, trajectory,speed, and direction of the vehicles; 2) position and conﬁgu-ration of the camera; 3) weather (sunny, cloudy, rainy, foggy,etc) and illumination (daytime, dawn, dusk, etc).Traditionally, video image datasets are collected by cap-turing in the real world or retrieving from the Internet. Itis impossible to control the environmental conditions andrepeat the scene layout under different environments, and thusdifﬁcult to isolate the effects of environmental conditions onthe performance of computer vision algorithms. By contrast,it is easy to control the environmental conditions in artiﬁcialscenes. In this work, we are able to ﬂexibly control thecamera’s location, height, and orientation to capture different ig. 6. Illustration of the diversity of artiﬁcial scenes. Top: Virtual imageswith illumination at 6:00 am (left) and 12:00 pm (right) in a sunny day.Bottom: Virtual images with weather of fog (left) and rain (right). contents of the artiﬁcial scene. We are also able to dynam-ically change the illumination (from sunrise to sunset) andweather conditions (sunny, cloudy, and foggy). Although wecan change the environmental conditions in artiﬁcial scene, theground-truth annotations are always easy to generate, no mat-ter how adverse the illumination and weather conditions areand how blurred the image details are. This makes it possibleto quantitatively analyze the impacts of each environmentalcondition on algorithm performance, usually called “ceterisparibus analysis”. Fig. 6 illustrates the diversity of artiﬁcialscenes in terms of illumination and weather conditions.IV. E

XPERIMENTS

Based on the proposed approach, we construct the artiﬁcialscene and conﬁgure virtual cameras to capture images fromthe scene. The virtual cameras can be moving or stationary.For automobile applications, the virtual cameras are installedon moving vehicles. For visual surveillance applications, thevirtual cameras are ﬁxed on the roadside or at intersections.The experiments are conducted to verify that the artiﬁcialscenes are repeatable and that the camera’s position, height,and orientation can be conﬁgured ﬂexibly.

A. Onboard Camera

In this experiment, an onboard camera is conﬁgured ata height of 2 meters, mimicking the camera installed onthe vehicle roof. There are totally 67 vehicles on the road,including 52 vehicles parking on the roadside (3 buses, 4trucks, and 45 cars ) and another 15 vehicles in motion. Weturn the camera orientation from left to right and get ﬁveorientations (i.e., -30, -15, 0, 15, and 30 degrees with respectto the lane direction). The distance between two camerason adjacent lanes is 5 meters. These conﬁgurations lead tosubstantial changes in object appearance. Fig. 7 shows threecontinuous images captured by the onboard camera.

Fig. 7. Continuous images captured by an onboard camera: a sample image(left), another image annotated with object bounding boxes (middle), and thethird image annotated with tracking bounding boxes of different colors (right).Best viewed with zooming.Fig. 8. Continuous images captured by a surveillance camera: imagesannotated with object bounding boxes (top row), original images (middlerow), and images annotated with tracking bounding boxes of different colors(bottom row). Best viewed with zooming.

B. Surveillance Camera

In this experiment, a surveillance camera is installed at anintersection. We rotate the camera and control the rotationspeed at 10 degrees per second, and the rotation range is 180degrees. We also change the camera height, with the liftingspeed of 0.1 meters per second and the lifting range of 2-5meters. Such settings can fully simulate the role of surveillancecameras. Based on this experiment, the artiﬁcial scene providesvirtual video images for intersection monitoring. Fig. 8 showsimages captured by the surveillance camera.In order to increase diversity of virtual images and recordthe ground truth, we adopt the same operations for boththe onboard camera and the surveillance camera. To recordthe ground truth, we use a green bounding box to recordthe detection ground truth for each object. We also assigna bound box of unique color to record the tracking groundtruth for each object instance. To increase diversity, we dy-namically change the illumination (daytime, dawn, and dusk)and weather (sunny, cloudy, rainy, and foggy) conditions inthe artiﬁcial scenes. These subtle changes simulate differentenvironmental conditions in the virtual world, and wouldotherwise need the expensive process of re-acquiring and re-labeling images of the real world. The advantage of this settings that it can increase diversity of the ParallelEye dataset. Inthe experiments, with image resolution of 500*375 pixels forParallelEye, the pipeline for artiﬁcial scene construction andground truth generation runs at 8-12 fps (frames per second)on a workstation computer. We have collected a total of 31,000image frames, each of which has been annotated with accurateground truth. We will build a website and make the datasetpublicly available before the publication of this paper.V. C

ONCLUDING R EMARKS

In this paper, we propose a new virtual image dataset called“ParallelEye”. For that we present a dataset generation pipelinethat uses street map, computer graphics, virtual reality, andrule modeling technologies to construct a realistic, large-scalevirtual urban trafﬁc scene. The artiﬁcial scene matches the realworld well in terms of ﬁdelity and geographic information. Inthe artiﬁcial scene, we ﬂexibly conﬁgure the camera (includingits position, height, and orientation) and the environmentalconditions, to collect diversiﬁed images. Each image hasbeen annotated automatically with ground truth includingsemantic/instance segmentation, object bounding box, objecttracking, optical ﬂow, and depth.In the future, we will improve the diversity of ParallelEyeby introducing moving pedestrians and cyclists, which areharder to animate. We will increase the scale of ParallelEye.In addition, we will combine ParallelEye and the existing realdatasets (e.g., PASCAL VOC, MS COCO, and KITTI) to learnand evaluate trafﬁc vision models, in order to improve theaccuracy and robustness of trafﬁc vision models when appliedto complex trafﬁc scenes.R

EFERENCES[1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets Robotics:The KITTI dataset,”

International Journal of Robotics Research , vol. 32,no. 11, pp. 1231-1237, 2013.[2] A. Handa, T. Whelan, J. McDonald, and A. J. Davison, “A benchmark forRGB-D visual odometry, 3D reconstruction and SLAM,” in Proceedingsof the International Conference on Robotics and Automation , IEEE, pp.1524-1531, 2014.[3] W. S. Bainbridge, “The scientiﬁc research potential of virtual worlds,”

Science , vol. 317, no. 5837, pp. 472-476, 2007.[4] J. Marin, D. Vazquez, D. Geronimo, and A. M. Lopez, “Learningappearance in virtual scenarios for pedestrian detection,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,IEEE, pp. 137-144, 2010.[5] J. Papon and M. Schoeler, “Semantic pose using deep networks trainedon synthetic RGB-D,” in Proceedings of the IEEE Conference on Inter-national Conference on Computer Vision , IEEE, pp. 774-782, 2015.[6] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, “Performance of opticalﬂow techniques,”

International Journal of Computer Vision , vol. 12, no.1, pp. 43-77, 1994.[7] B. McCane, K. Novins, D. Crannitch, and B. Galvin, “On benchmarkingoptical ﬂow,”

Computer Vision and Image Understanding , vol. 84, no. 1,pp. 126-143, 2001.[8] S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black, and R. Szeliski,“A database and evaluation methodology for optical ﬂow,”

InternationalJournal of Computer Vision , vol. 92, no. 1, pp. 1-31, 2011.[9] G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez, and A. M.Lopez, “Vision-based ofﬂine-online perception paradigm for autonomousdriving,” in Proceedings of the IEEE Conference on Applications ofComputer Vision , IEEE, pp. 231-238, 2015. [10] H. Prendinger, K. Gajananan, A. Bayoumy Zaki, A. Fares, R. Molenaar,D. Urbano, H. Van Lint, and W. Gomaa, “Tokyo Virtual Living Lab:Designing smart cities based on the 3d internet,” in Proceedings ofthe IEEE Conference on Applications of Computer Vision , SpringerInternational Publishing,vol. 17, no. 6, pp. 30-38, 2013.[11] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data:Ground truth from computer games,” in Proceedings of the EuropeanConference on Computer Vision , IEEE Internet Computing, pp. 102C118,2016.[12] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “TheSYNTHIA dataset: A large collection of synthetic images for semanticsegmentation of urban scenes,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , IEEE, pp. 3234-3243, 2016.[13] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy formulti-object tracking analysis,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , IEEE, pp. 4340-4349, 2016.[14] A. Kundu, Y. Li, F. Dellaert, F. Li, and J. M. Rehg, “Joint semantic seg-mentation and 3D reconstruction from monocular video,” in Proceedingsof the European Conference on Computer Vision , pp. 703-718, 2014.[15] J. Xie, M. Kiefel, M.-T. Sun, and A. Geiger, “Semantic instanceannotation of street scenes by 3D to 2D label transfer,” , pp. 3688-3697,2016.[16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. F.Li, “Large-scale video classiﬁcation with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pp. 1725-1732, 2014.[17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z.Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F.-F.Li, “ImageNet Large Scale Visual Recognition Challenge,”

InternationalJournal of Computer Vision , vol. 115, no. 3, pp. 211-252, 2015.[18] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentationand support inference from RGBD images,” in Proceedings of theConference European Conference on Computer Vision , pp. 746-760, 2012.[19] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classesin video: A high-deﬁnition ground truth database,”

Pattern RecognitionLetters , vol. 30, no. 2, pp. 88-97, 2009.[20] G. J. Browstow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentationand recognition using structure from motion point clouds,” in Proceedingsof the European Conference on Computer Vision , Heidelberg: SpringerBerlin Heidelberg, vol. 5302, pp. 44-57, 2008.[21] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “LabelMe:A database and web-based tool for image annotation,”

InternationalJournal of Computer Vision , vol. 77, no. 1, pp. 157-173, 2008.[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft COCO: Common objects incontext,” in Proceedings of the European Conference on Computer Vision ,pp. 740-755, 2014.[23] K. Wang, C. Gou, and F.-Y. Wang, “Parallel vision: An ACP-basedapproach to intelligent vision computing,”

Acta Automatica Sinica , vol.42, no. 10, pp. 1490-1500, 2016.[24] K. Wang, C. Gou, N. Zheng, J. M. Rehg, and F.-Y. Wang, “Parallelvision for perception and understanding of complex scenes: Methods,framework, and perspectives,”

Artiﬁcial Intelligence Review , vol. 48, no.3, pp. 298-328, 2017.[25] K. Wang, Y. Lu, Y. Wang, Z. Xiong, and F.-Y. Wang, “Parallel imaging:A new theoretical framework for image generation,”

Pattern Recognitionand Artiﬁcial Intelligence , vol. 30, no. 7, pp. 577-587, 2017.[26] F.-Y. Wang, “Parallel control and management for intelligent transporta-tion systems: Concepts, architectures, and applications,”

IEEE Transac-tions on Intelligent Transportation Systems , vol. 11, no. 3, pp. 630-638,2010.[27] F.-Y. Wang, “Parallel control: A method for data-driven and compu-tational control,”

Acta Automatica Sinica , vol. 39, no. 4, pp. 293-302,2014.[28] F.-Y. Wang, X. Wang, L. Li, and L. Li, “Steps toward parallel intelli-gence,”

IEEE/CAA Journal of Automatica Sinica , vol. 3, no. 4, pp.345-348, 2016.[29] L. Li, Y. Lin, D. Cao, N. Zheng, and F.-Y. Wang, “Parallel learning —A new framework for machine learning,”

Acta Automatica Sinica , vol.43, no. 1, pp. 1-8, 2017.[30] X. Liu, X. Wang, W. Zhang, J. Wang, and F.-Y. Wang, “Parallel data:From big data to data intelligence,”