The ParallelEye Dataset: Constructing Large-Scale Artificial Scenes for Traffic Vision Research
TThe ParallelEye Dataset: Constructing Large-ScaleArtificial Scenes for Traffic Vision Research
Xuan Li, Kunfeng Wang,
Member, IEEE , Yonglin Tian, Lan Yan, and Fei-Yue Wang,
Fellow, IEEE
Abstract —Video image datasets are playing an essential rolein design and evaluation of traffic vision algorithms. Neverthe-less, a longstanding inconvenience concerning image datasets isthat manually collecting and annotating large-scale diversifieddatasets from real scenes is time-consuming and prone to error.For that virtual datasets have begun to function as a proxyof real datasets. In this paper, we propose to construct large-scale artificial scenes for traffic vision research and generate anew virtual dataset called “ParallelEye”. First of all, the streetmap data is used to build 3D scene model of ZhongguancunArea, Beijing. Then, the computer graphics, virtual reality, andrule modeling technologies are utilized to synthesize large-scale,realistic virtual urban traffic scenes, in which the fidelity andgeography match the real world well. Furthermore, the Unity3Dplatform is used to render the artificial scenes and generate ac-curate ground-truth labels, e.g., semantic/instance segmentation,object bounding box, object tracking, optical flow, and depth. Theenvironmental conditions in artificial scenes can be controlledcompletely. As a result, we present a viable implementationpipeline for constructing large-scale artificial scenes for trafficvision research. The experimental results demonstrate that thispipeline is able to generate photorealistic virtual datasets withlow modeling time and high accuracy labeling.
I. I
NTRODUCTION
The publicly available video image datasets have receivedmuch attention in recent years, due to its indispensability indesign and evaluation of computer vision algorithms [1]. Ingeneral, a computer vision algorithm needs a large amountof labeled images for training and evaluation. The datasetscan be divided into two types: unlabeled datasets used forunsupervised learning and labeled datasets used for supervised
This work was partly supported by National Natural Science Foundationof China under Grant 61533019, Grant 71232006, and Grant 91520301.Xuan Li is with the School of Automation, Beijing Institute of Technology,Beijing 100081, China, and also with The State Key Laboratory for Man-agement and Control of Complex Systems, Institute of Automation, ChineseAcademy of Sciences, Beijing 100190, China (e-mail: [email protected]).Kunfeng Wang (
Corresponding author ) is with The State Key Laboratoryfor Management and Control of Complex Systems, Institute of Automa-tion, Chinese Academy of Sciences, Beijing 100190, China, and also withQingdao Academy of Intelligent Industries, Qingdao 266000, China (e-mail:[email protected]).Yonglin Tian is with the Department of Automation, University of Scienceand Technology of China, Hefei 230027, China, and also with The State KeyLaboratory for Management and Control of Complex Systems, Institute ofAutomation, Chinese Academy of Sciences, Beijing 100190, China.Lan Yan is with The State Key Laboratory for Management and Controlof Complex Systems, Institute of Automation, Chinese Academy of Sciences,Beijing 100190, China.Fei-Yue Wang is with The State Key Laboratory for Management and Con-trol of Complex Systems, Institute of Automation, Chinese Academy of Sci-ences, Beijing 100190, China, and also with the Research Center for Compu-tational Experiments and Parallel Systems Technology, National University ofDefense Technology, Changsha 410073, China (e-mail: [email protected]). learning. However, manually annotating the images is time-consuming and labor-intensive, and participants often lackprofessional knowledge, making some annotation tasks dif-ficult to execute. Experts are always sparse and should beproperly identified. As we known, the human annotators aresubjective, and their annotations should be re-examined if twoor more annotators have disagreements about the label of oneentity. By contrast, the computer is objective in processingdata and particularly good at batch processing, so why not letthe computer annotate the images automatically?At present, most publicly available datasets are obtainedfrom real scenes. As the computer vision field enters thebig data era, researchers begin to look for better ways toannotate large-scale datasets [2]. At the same time, the de-velopment of virtual datasets has a long history, starting atleast from Bainbridge’s work [3]. Bainbridge used Second Lifeand World of Warcraft as two distinct examples of virtualworlds to predict the scientific research potential of virtualworlds, and introduced the virtual worlds into a lot of researchfields that scientists are now exploring, including sociology,computer science, and anthropology. In fact, synthetic datahas been used for decades to benchmark the performance ofcomputer vision algorithms. The use of synthetic data hasbeen particularly significant in object detection [4], [5] andoptical flow estimation [6]-[8], but most virtual data are notphotorealistic or akin to the real-world data, and lack sufficientdiversity [9]. The fidelity of some virtual data is close to thereal-world [10]. However, the synthesized virtual worlds areseldom equivalent to the real world in geographic position,and seldom annotate the virtual images automatically. Richter et al. [11] used a commercial game engine to extract virtualimages, with no access to the source code or the content. TheSYNTHIA dataset [12] provided a realistic virtual city as wellas synthetic images with automatically generated pixel-levelannotations, but in that dataset there lacks other annotationinformation such as object bounding box and object tracking.Gaidon et al. [13] proposed a virtual dataset called “VirtualKITTI” as a proxy for tracking algorithm evaluation. Whilethis dataset was cloned from “KITTI”, it cannot extend easilyto arbitrary traffic networks. Due to the above limitations, newvirtual datasets that match the real world and provide detailedground truth annotations are still desirable.Manually annotating pixel-level semantics for images istime-consuming and not accurate enough. For example, an-notating high-quality semantics with 10-20 categories in oneimage usually takes 30-60 minutes [14]. This is known asthe “curse of dataset annotation” [15]. The more detailed the a r X i v : . [ c s . C V ] D ec ky VegetationBuildingFenceSidewalkRoadTrafficsignTrafficlightPoleTreeCyclistPedestrianChairBillboard
Fig. 1. Examples of our generated ParallelEye dataset. From left to right: a general view of the constructed artificial scenes, its semantic labels, a sampleframe with tracking bounding boxes generated automatically, and its semantic labels. Best viewed with zooming. semantics, the more labor-intensive the annotation process. Asa result, many datasets do not provide semantic segmentationannotations. For example, ImageNet [16], [17] has 14 millionimages, in which more than one million images have definiteclass and the images are annotated with object bounding boxfor object recognition. However, ImageNet does not havesemantic segmentation annotations. Some datasets provideonly limited semantic segmentation annotations. For example,NYU-Depth V2 [18] has 1449 densely labelled images, KITTI[1] has 547 images, CamVid [19], [20] has 600 images,Urban LabelMe [21] has 942 images, and Microsoft COCO[22] has three hundred thousand images. These datasets playan important role in the study of semantic segmentation.However, these datasets cannot be used directly in intelligenttransportation, especially in automobile navigation, becausethe number of labeled images is insufficient and the segmentedsemantics have different categories. Currently, computer vi-sion algorithms that exploit context for pattern recognitionwould benefit from datasets with many annotated categoriesembedded in images from complex scenes. Such datasetsshould contain a wide variety of environmental conditions withannotated object instances co-occurring in the same scenes.However, the real scenes are unrepeatable and the capturedimages are expensive to annotate, making it difficult to obtainlarge-scale, diversified datasets with precise annotations.In order to solve these problems, this paper proposesa pipeline for constructing artificial scenes and generatingvirtual images. First of all, we use map data to build the3D scene model of Zhongguancun Area, Beijing. Then, weuse the computer graphics, virtual reality, and rule modelingtechnologies to create a realistic, large-scale virtual urbantraffic scene, in which the fidelity and geographic informationcan match the real world well. Furthermore, we use theUnity3D development platform for rendering the scene and au-tomatically annotating the ground truth labels including pixel-level semantic/instance segmentation, object bounding box,object tracking, optical flow, and depth. The environmentalconditions in artificial scenes can be controlled completely. Inconsequence, we generate a new virtual image dataset, called“ParallelEye” (see Fig. 1). We will build a website and makethis dataset publicly available before the publication of thispaper. The experimental results demonstrate that our proposedimplementation pipeline is able to generate photorealistic
Real scene Artificial scenesPerception and understanding Experiment and evaluation Learning and training M o d e l - d r i v e n D a t a - d r i v e n Perception and understanding Observation and evaluation
Computing and observation
Fig. 2. Basic framework and architecture for parallel vision [23]. virtual images with low modeling time and high fidelity.The rest of this paper is organized as follows. SectionII introduces the significance of parallel vision and virtualdataset. Section III presents our approach to constructingartificial scenes and generating virtual images with ground-truth labels. Section IV reports the experimental results andanalyzes the performance. Finally, the concluding remarks aremade in section V.II. P
ARALLEL V ISION AND V IRTUAL D ATASET
Parallel vision [23]-[25] is an extension of the ACP (Artifi-cial systems, Computational experiments, and Parallel execu-tion) theory [26]-[30] into the computer vision field. For par-allel vision, photo-realistic artificial scenes are used to modeland represent complex real scenes, computational experimentsare utilized to learn and evaluate a variety of vision models,and parallel execution is conducted to online optimize thevision system and realize perception and understanding ofcomplex scenes. The basic framework and architecture forparallel vision [23] is shown in Fig. 2. Based on the parallelvision theory, this paper constructs a large-scale virtual urbannetwork and synthesizes a large number of realistic images.The first stage of parallel vision is to construct photorealisticartificial scenes by simulating a variety of environmentalconditions occurring in real scenes, and accordingly to syn-thesize large-scale diversified datasets with precise annotationsgenerated automatically. Generally speaking, the constructionof artificial scenes can be regarded as “video game design”,i.e., using the computer animation-like techniques to modelthe artificial scenes. The main technologies used in thistage include computer graphics, virtual reality, and micro-simulation. Computer graphics and computer vision, on thewhole, can be thought of as a pair of forward and inverseproblems. The goal of computer graphics is to synthesizeimage measurements given the description of world parametersaccording to physics-based image formation principles (for-ward inference), while the focus of computer vision is to mapthe pixel measurements to 3D scene parameters and semantics(inverse inference). Apparently their goals are opposite, butcan converge to a common point: parallel vision.From the parallel vision perspective, we design the Par-allelEye dataset. ParallelEye is synthesized by referring tothe urban network of Zhongguancun Area, Beijing. UsingOpenStreetMap (OSM), an urban network with length 3kmand width 2km is extracted. Artificial scenes are constructedon this urban network. Unity3D is used to control the environ-mental conditions in the scene. There are 15 object classes inParallelEye, reflecting the common elements of traffic scenes,including sky, buildings, cars, roads, sidewalks, vegetation,fence, traffic signs, traffic lights, lamp poles, billboards, trees,cyclists, pedestrians, and chairs. These object classes can beautomatically annotated to generate pixel-level semantics. Fortraffic vision research, we pay special attention to instancesegmentation, with each object of interest segmented automat-ically. In addition, ParallelEye provides accurate ground truthfor object detection and tracking, depth, and optical flow.III. A
PPROACH
Our pipeline for generating the ParallelEye dataset is shownin Fig. 3. Firstly, the OSM data released by OpenStreetMapis used to achieve the correspondence in geographic locationbetween the virtual and real world. Secondly, CityEngineis used to write CGA (Computer Generated Architecture)rules and design a realistic artificial scene, including roads,buildings, cars, trees, sidewalks, etc. Thirdly, the artificialscene is imported into Unity3D and gets rendered by usingthe script and the shader. In the dataset, accurate ground truthannotations are generated automatically, and environmentalconditions can be controlled completely and flexibly.
A. Correspondence of Artificial and Real Scenes
In order to increase the fidelity, we choose to importgeographic data from OpenStreetMap. Although Google Mapsoccupies an important position in geographic information, itis not an open-source software. By contrast, OpenStreetMapis an open-source, online map editing program with the goalof creating a world where content is freely accessible toeveryone. In OpenStreetMap, the ways denote a directionalnode sequence. Each node of the network can connect 2-2000paths, and then arrive at another node. The road informationincludes direction, lane number, lane width, street name, andspeed limit. Each path can form three combinations: non-closed paths, closed paths, and regions. The non-closed pathscorrespond to the roads, rivers, and railways in the real world.The closed paths correspond to subway, bus routes, residentialroads, and so on. The regions correspond to buildings, parks, 、、 、 、、 、 CityEngine
Unity3DOpenStreetMap
Fig. 3. Pipeline for generating the ParallelEye dataset with OpenStreetMap,CityEngine, and Unity3D. lakes, and so on. Based on the properties of OSM data, it iseasy to relate the real world to the geographic information ofthe artificial scene. Fig. 4 shows the real Automation Buildingof CASIA (Institute of Automation, Chinese Academy ofSciences) and its virtual proxy generated by CGA rules. Theyare similar in appearance.
B. Generation of Ground-Truth Annotations
As stated above, ground-truth annotations are essentialfor vision algorithm design and evaluation. Traditionally, theimages were annotated by hand. The manual annotation istime-consuming and prone to error. Taking semantic/instancesegmentation as an example, it usually takes 30-60 minutesto annotate an image with 10-20 object categories. Besides,manual annotation is more or less subjective, so that differentannotators can make different semantic labels for the same im-age, especially near the object boundaries. Instead of manualannotation, this paper uses Unity3D to automatically generateaccurate ground-truth labels. Fig. 5 shows some examples ofground-truth annotations, including depth, optical flow, objecttracking, object detection, instance segmentation, and semanticsegmentation.Generating ground truth with Unity3D is accurate andefficient. Semantic segmentation ground truth can be directly ig. 4. The real Automation Building of CASIA (top) and its virtual proxy(bottom). generated by using unlit shaders on the materials of theobjects, with each category outputting a unique color. Instancesegmentation ground truth is generated using the same method,but assigns a unique color tag to each object of interest.The modified shaders output a color which is not affectedby the lighting and shading conditions. Depth ground truth isgenerated using built-in depth buffer information to get depthdata for screen coordinates. The depth ranges from 0 to 1with a nonlinear distribution, with 1 representing “infinitelydistant”. Optical flow ground truth is generated by calculatingthe instantaneous velocity of moving objects on the imagingplane and using the pixel changes in the image sequence to findthe correspondence between the previous frame and the currentframe. Given a pixel point ( x, y ) in the image, at any time thebrightness of that point is E ( x + (cid:52) x, y + (cid:52) y, t + (cid:52) t ) . Let ( u, v ) = ( ∂x∂t , ∂y∂t ) represent the instantaneous velocity of thepoint in the horizontal and vertical directions, the brightnesschange occurs when the point moves. We use the Taylorformula to represent the pixel brightness: E ( x + (cid:52) x, y + (cid:52) y, t + (cid:52) t )= E ( x, y, t ) + ∂E∂x (cid:52) x + ∂E∂y (cid:52) y + ∂E∂t (cid:52) t + ε. (1)For any (cid:52) t → , let ω = ( u, v ) , the optical flow constraintequation is given by − ∂E∂t = ∂E∂x ∂x∂t + ∂E∂y ∂y∂t = ∇ E · ω, (2)where ω is the optical flow of E ( x, y, t ) .We generate multi-object tracking ground truth based onfour rules: 1) when the object appears within the field of Fig. 5. Examples of ground-truth annotations generated automatically byUnity3D. Top: depth (left) and optical flow (right). Middle: object tracking(left) and object detection (right). Bottom: pixel-level instance segmentation(left) and semantic segmentation (right). Best viewed with zooming. view of the camera, the three-dimensional bounding box ofthe object is converted to a two-dimensional bounding box;2) when the object appears or disappears from the imageboundary, we perform special handling for the bounding box;3) we do not draw bounding boxes for objects that have lessthan 15 pixels in width or less than 10 pixels in height; 4)when occlusion occurs and the occlusion rate is higher thana threshold, we do not draw bounding boxes for the occludedobject.
C. Diversity of Artificial Scenes
In order to increase the diversity and fidelity of artificialscenes, we control the parameters in the script, the material,and the simulated environmental conditions. Specifically, thecontrollable parameters include: 1) number, type, trajectory,speed, and direction of the vehicles; 2) position and configu-ration of the camera; 3) weather (sunny, cloudy, rainy, foggy,etc) and illumination (daytime, dawn, dusk, etc).Traditionally, video image datasets are collected by cap-turing in the real world or retrieving from the Internet. Itis impossible to control the environmental conditions andrepeat the scene layout under different environments, and thusdifficult to isolate the effects of environmental conditions onthe performance of computer vision algorithms. By contrast,it is easy to control the environmental conditions in artificialscenes. In this work, we are able to flexibly control thecamera’s location, height, and orientation to capture different ig. 6. Illustration of the diversity of artificial scenes. Top: Virtual imageswith illumination at 6:00 am (left) and 12:00 pm (right) in a sunny day.Bottom: Virtual images with weather of fog (left) and rain (right). contents of the artificial scene. We are also able to dynam-ically change the illumination (from sunrise to sunset) andweather conditions (sunny, cloudy, and foggy). Although wecan change the environmental conditions in artificial scene, theground-truth annotations are always easy to generate, no mat-ter how adverse the illumination and weather conditions areand how blurred the image details are. This makes it possibleto quantitatively analyze the impacts of each environmentalcondition on algorithm performance, usually called “ceterisparibus analysis”. Fig. 6 illustrates the diversity of artificialscenes in terms of illumination and weather conditions.IV. E
XPERIMENTS
Based on the proposed approach, we construct the artificialscene and configure virtual cameras to capture images fromthe scene. The virtual cameras can be moving or stationary.For automobile applications, the virtual cameras are installedon moving vehicles. For visual surveillance applications, thevirtual cameras are fixed on the roadside or at intersections.The experiments are conducted to verify that the artificialscenes are repeatable and that the camera’s position, height,and orientation can be configured flexibly.
A. Onboard Camera
In this experiment, an onboard camera is configured ata height of 2 meters, mimicking the camera installed onthe vehicle roof. There are totally 67 vehicles on the road,including 52 vehicles parking on the roadside (3 buses, 4trucks, and 45 cars ) and another 15 vehicles in motion. Weturn the camera orientation from left to right and get fiveorientations (i.e., -30, -15, 0, 15, and 30 degrees with respectto the lane direction). The distance between two camerason adjacent lanes is 5 meters. These configurations lead tosubstantial changes in object appearance. Fig. 7 shows threecontinuous images captured by the onboard camera.
Fig. 7. Continuous images captured by an onboard camera: a sample image(left), another image annotated with object bounding boxes (middle), and thethird image annotated with tracking bounding boxes of different colors (right).Best viewed with zooming.Fig. 8. Continuous images captured by a surveillance camera: imagesannotated with object bounding boxes (top row), original images (middlerow), and images annotated with tracking bounding boxes of different colors(bottom row). Best viewed with zooming.
B. Surveillance Camera
In this experiment, a surveillance camera is installed at anintersection. We rotate the camera and control the rotationspeed at 10 degrees per second, and the rotation range is 180degrees. We also change the camera height, with the liftingspeed of 0.1 meters per second and the lifting range of 2-5meters. Such settings can fully simulate the role of surveillancecameras. Based on this experiment, the artificial scene providesvirtual video images for intersection monitoring. Fig. 8 showsimages captured by the surveillance camera.In order to increase diversity of virtual images and recordthe ground truth, we adopt the same operations for boththe onboard camera and the surveillance camera. To recordthe ground truth, we use a green bounding box to recordthe detection ground truth for each object. We also assigna bound box of unique color to record the tracking groundtruth for each object instance. To increase diversity, we dy-namically change the illumination (daytime, dawn, and dusk)and weather (sunny, cloudy, rainy, and foggy) conditions inthe artificial scenes. These subtle changes simulate differentenvironmental conditions in the virtual world, and wouldotherwise need the expensive process of re-acquiring and re-labeling images of the real world. The advantage of this settings that it can increase diversity of the ParallelEye dataset. Inthe experiments, with image resolution of 500*375 pixels forParallelEye, the pipeline for artificial scene construction andground truth generation runs at 8-12 fps (frames per second)on a workstation computer. We have collected a total of 31,000image frames, each of which has been annotated with accurateground truth. We will build a website and make the datasetpublicly available before the publication of this paper.V. C
ONCLUDING R EMARKS
In this paper, we propose a new virtual image dataset called“ParallelEye”. For that we present a dataset generation pipelinethat uses street map, computer graphics, virtual reality, andrule modeling technologies to construct a realistic, large-scalevirtual urban traffic scene. The artificial scene matches the realworld well in terms of fidelity and geographic information. Inthe artificial scene, we flexibly configure the camera (includingits position, height, and orientation) and the environmentalconditions, to collect diversified images. Each image hasbeen annotated automatically with ground truth includingsemantic/instance segmentation, object bounding box, objecttracking, optical flow, and depth.In the future, we will improve the diversity of ParallelEyeby introducing moving pedestrians and cyclists, which areharder to animate. We will increase the scale of ParallelEye.In addition, we will combine ParallelEye and the existing realdatasets (e.g., PASCAL VOC, MS COCO, and KITTI) to learnand evaluate traffic vision models, in order to improve theaccuracy and robustness of traffic vision models when appliedto complex traffic scenes.R
EFERENCES[1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets Robotics:The KITTI dataset,”
International Journal of Robotics Research , vol. 32,no. 11, pp. 1231-1237, 2013.[2] A. Handa, T. Whelan, J. McDonald, and A. J. Davison, “A benchmark forRGB-D visual odometry, 3D reconstruction and SLAM,” in Proceedingsof the International Conference on Robotics and Automation , IEEE, pp.1524-1531, 2014.[3] W. S. Bainbridge, “The scientific research potential of virtual worlds,”
Science , vol. 317, no. 5837, pp. 472-476, 2007.[4] J. Marin, D. Vazquez, D. Geronimo, and A. M. Lopez, “Learningappearance in virtual scenarios for pedestrian detection,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,IEEE, pp. 137-144, 2010.[5] J. Papon and M. Schoeler, “Semantic pose using deep networks trainedon synthetic RGB-D,” in Proceedings of the IEEE Conference on Inter-national Conference on Computer Vision , IEEE, pp. 774-782, 2015.[6] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, “Performance of opticalflow techniques,”
International Journal of Computer Vision , vol. 12, no.1, pp. 43-77, 1994.[7] B. McCane, K. Novins, D. Crannitch, and B. Galvin, “On benchmarkingoptical flow,”
Computer Vision and Image Understanding , vol. 84, no. 1,pp. 126-143, 2001.[8] S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black, and R. Szeliski,“A database and evaluation methodology for optical flow,”
InternationalJournal of Computer Vision , vol. 92, no. 1, pp. 1-31, 2011.[9] G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez, and A. M.Lopez, “Vision-based offline-online perception paradigm for autonomousdriving,” in Proceedings of the IEEE Conference on Applications ofComputer Vision , IEEE, pp. 231-238, 2015. [10] H. Prendinger, K. Gajananan, A. Bayoumy Zaki, A. Fares, R. Molenaar,D. Urbano, H. Van Lint, and W. Gomaa, “Tokyo Virtual Living Lab:Designing smart cities based on the 3d internet,” in Proceedings ofthe IEEE Conference on Applications of Computer Vision , SpringerInternational Publishing,vol. 17, no. 6, pp. 30-38, 2013.[11] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data:Ground truth from computer games,” in Proceedings of the EuropeanConference on Computer Vision , IEEE Internet Computing, pp. 102C118,2016.[12] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “TheSYNTHIA dataset: A large collection of synthetic images for semanticsegmentation of urban scenes,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , IEEE, pp. 3234-3243, 2016.[13] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy formulti-object tracking analysis,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , IEEE, pp. 4340-4349, 2016.[14] A. Kundu, Y. Li, F. Dellaert, F. Li, and J. M. Rehg, “Joint semantic seg-mentation and 3D reconstruction from monocular video,” in Proceedingsof the European Conference on Computer Vision , pp. 703-718, 2014.[15] J. Xie, M. Kiefel, M.-T. Sun, and A. Geiger, “Semantic instanceannotation of street scenes by 3D to 2D label transfer,” , pp. 3688-3697,2016.[16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. F.Li, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pp. 1725-1732, 2014.[17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z.Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F.-F.Li, “ImageNet Large Scale Visual Recognition Challenge,”
InternationalJournal of Computer Vision , vol. 115, no. 3, pp. 211-252, 2015.[18] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentationand support inference from RGBD images,” in Proceedings of theConference European Conference on Computer Vision , pp. 746-760, 2012.[19] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classesin video: A high-definition ground truth database,”
Pattern RecognitionLetters , vol. 30, no. 2, pp. 88-97, 2009.[20] G. J. Browstow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentationand recognition using structure from motion point clouds,” in Proceedingsof the European Conference on Computer Vision , Heidelberg: SpringerBerlin Heidelberg, vol. 5302, pp. 44-57, 2008.[21] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “LabelMe:A database and web-based tool for image annotation,”
InternationalJournal of Computer Vision , vol. 77, no. 1, pp. 157-173, 2008.[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft COCO: Common objects incontext,” in Proceedings of the European Conference on Computer Vision ,pp. 740-755, 2014.[23] K. Wang, C. Gou, and F.-Y. Wang, “Parallel vision: An ACP-basedapproach to intelligent vision computing,”
Acta Automatica Sinica , vol.42, no. 10, pp. 1490-1500, 2016.[24] K. Wang, C. Gou, N. Zheng, J. M. Rehg, and F.-Y. Wang, “Parallelvision for perception and understanding of complex scenes: Methods,framework, and perspectives,”
Artificial Intelligence Review , vol. 48, no.3, pp. 298-328, 2017.[25] K. Wang, Y. Lu, Y. Wang, Z. Xiong, and F.-Y. Wang, “Parallel imaging:A new theoretical framework for image generation,”
Pattern Recognitionand Artificial Intelligence , vol. 30, no. 7, pp. 577-587, 2017.[26] F.-Y. Wang, “Parallel control and management for intelligent transporta-tion systems: Concepts, architectures, and applications,”
IEEE Transac-tions on Intelligent Transportation Systems , vol. 11, no. 3, pp. 630-638,2010.[27] F.-Y. Wang, “Parallel control: A method for data-driven and compu-tational control,”
Acta Automatica Sinica , vol. 39, no. 4, pp. 293-302,2014.[28] F.-Y. Wang, X. Wang, L. Li, and L. Li, “Steps toward parallel intelli-gence,”
IEEE/CAA Journal of Automatica Sinica , vol. 3, no. 4, pp.345-348, 2016.[29] L. Li, Y. Lin, D. Cao, N. Zheng, and F.-Y. Wang, “Parallel learning —A new framework for machine learning,”
Acta Automatica Sinica , vol.43, no. 1, pp. 1-8, 2017.[30] X. Liu, X. Wang, W. Zhang, J. Wang, and F.-Y. Wang, “Parallel data:From big data to data intelligence,”