Learning to Map Vehicles into Bird's Eye View
Andrea Palazzi, Guido Borghi, Davide Abati, Simone Calderara, Rita Cucchiara
LLearning to Map Vehicles into Bird’s Eye View
Andrea Palazzi, Guido Borghi, Davide Abati,Simone Calderara and Rita Cucchiara
University of Modena and Reggio Emilia, Italy { name.surname } @unimore.it Abstract.
Awareness of the road scene is an essential component forboth autonomous vehicles and Advances Driver Assistance Systems andis gaining importance both for the academia and car companies. This pa-per presents a way to learn a semantic-aware transformation which mapsdetections from a dashboard camera view onto a broader bird’s eye occu-pancy map of the scene. To this end, a huge synthetic dataset featuring1M couples of frames, taken from both car dashboard and bird’s eyeview, has been collected and automatically annotated. A deep-networkis then trained to warp detections from the first to the second view. Wedemonstrate the effectiveness of our model against several baselines andobserve that is able to generalize on real-world data despite having beentrained solely on synthetic ones.
Vision-based algorithms and models have massively been adopted in currentgeneration ADAS solutions. Moreover, recent research achievements on scenesemantic segmentation [9,14], road obstacle detection [3,12] and driver’s gaze,pose and attention prediction [7,22] are likely to play a major role in the rise ofautonomous driving.As suggested in [5], three major paradigms can be individuated for vision-basedautonomous driving systems: mediated perception approaches, based on the totalunderstanding of the scene around the car, behavior reflex methods, in whichdriving action is regressed directly from the sensory input, and direct perception techniques, that fuse elements of previous approaches and learn a mapping be-tween the input image and a set of interpretable indicators which summarize thedriving situation.Following this last line of work, in this paper we develop a model for mappingvehicles across different views. In particular, our aim is to warp vehicles de-tected from a dashboard camera view into a bird’s eye occupancy map of thesurroundings, which is an easily interpretable proxy of the road state. Being al-most impossible to collect a dataset with this kind of information in real-world,we exclusively rely on synthetic data for learning this projection.We aim to create a system close to surround vision monitoring ones, also calledaround view cameras that can be useful tools for assisting drivers during ma-neuvers by, for example, performing trajectory analysis of vehicles out from ownvisual field. a r X i v : . [ c s . C V ] J un ig. 1: Simple outline of our task. Vehicle detections in the frontal view (left) aremapped onto a bird-eye view (right), accounting for the positions and size.In this framework, our contribution is twofold: – We make available a huge synthetic dataset ( > – We propose a deep learning architecture for generating bird’s eye occupancymaps of the surround in the context of autonomous and assisted driving. Ourapproach does not require a stereo camera, nor more sophisticated sensorslike radar and lidar. Conversely, we learn how to project detections fromthe dashboard camera view onto a broader bird’s eye view of the scene (seeFig.1). To this aim we combine learned geometric transformation and visualcues that preserve objects size and orientation in the warping procedure.Dataset, code and pre-trained model are publicly available and can be found at http://imagelab.ing.unimore.it/scene-awareness . Surround view
Few works in literature tackle the problem of the vehicle’ssurround view. Most of these approaches are vision and geometry based and arespecifically tailored for helping drivers during parking manoeuvres. In particular,in [13] a perspective projection image is transformed into its corresponding bird’seye view, through a fitting parameters searching algorithm. In [16] exploited thecalibration of six fish eye cameras to integrate six images into a single one, bya dynamic programming approach. In [17] were described algorithms for creat-ing, storing and viewing surround images, thanks to synchronized and aligneddifferent cameras. Sung et al. [20] proposed a camera model based algorithmto reconstruct and view multi-camera images. In [21], an homography matrixis used to perform a coordinate transformation: visible markers are required ininput images during the camera calibration process.Recently, Zhang et al. [24] proposed a surround view camera solution designedor embedded systems, based on a geometric alignment, to correct lens distor-tions, a photometric alignment, to correct brightness and color mismatch and acomposite view synthesis.
Videgames for collecting data
The use of synthetic data has recently gainedconsiderable importance in the computer vision community for several reasons.First, modern open-world games exhibit constantly increasing realism - whichdoes not only mean that they feature photorealistic lights/textures etc, but alsoshow plausible game dynamics and lifelike autonomous entity AI [18,19] . Fur-thermore, most research fields in computer vision are now tackled by meansof deep networks, which are notoriously data hungry in order to be properlytrained. Particularly in the context of assisted and autonomous driving, the op-portunity to exploit virtual yet realistic worlds for developing new techniqueshas been embraced widely: indeed, this makes possible to postpone the (veryexpensive) validation in real world to the moment in which a new algorithmalready performs reasonably well in the simulated environment [23,8]. Buildingupon this tendency, [5] relies on TORCS simulator to learn an interpretablerepresentation of the scene useful for the task of autonomous driving. However,while TORCS [23] is a powerful simulation tool, it’s still severely limited by thefact that both its graphics and its game variety and dynamics are far from beingrealistic.Many elements mark as original our approach. In principle, we want our surroundview to include not only nearby elements, like commercial geometry-based sys-tems, but also most of the elements detected into the acquired dashboard cameraframe. Additionally, no specific initialization or alignment procedures are nec-essary: in particular, no camera calibration and no visible alignment points arerequired. Eventually, we aim to preserve the correct dimensions of detected ob-jects, which shape is mapped onto the surround view consistently with theirsemantic class.
In order to collect data, we exploit
Script Hook V library [4], which allows touse Grand Theft Auto V (GTAV) video game native functions [1]. We developa framework in which the game camera automatically toggle between frontaland bird-eye view at each game time step: in this way we are able to gatherinformation about the spatial occupancy of the vehicles in the scene from bothviews ( i.e. bounding boxes, distances, yaw rotations). We associate vehicles in-formation across the two views by querying the game engine for entity IDs. Moreformally, for each frame t , we compute the set of entities which appear in bothviews as E ( t ) = E frontal ( t ) ∩ E birdeye ( t ) (1)where E frontal ( t ) and E birdeye ( t ) are the sets of entities that appear at time t in frontal and bird’s eye view, respectively. Entities e ( t ) ∈ E ( t ) constitute the a)(b) Fig. 2: (a) Randomly sampled couples from our GTAV dataset, which highlightthe huge variety in terms of landscape, traffic condition, vehicle models etc.Each detection is treated as a separate training example (see Sec. 3 for details).(b) Random examples rejected during the post-processing phase.candidate set for frame t C ( t ); other entities are discarded. Unfortunately, wefound that raw data coming from the game engine are not always accurate (Fig.2). To deal with this problem, we implement a post-processing pipeline in orderto discard noisy data from the candidate set C ( t ). We define a discriminatorfunction f ( e ( t )) : C (cid:55)→ { , } (2)which is positive when information on dumped data e ( t ) are reliable and zerootherwise. Thus we can define the final filtered dataset as T (cid:91) t =0 D ( t ) where D ( t ) = { c i ( i ) | f ( c i ( t )) > } (3)being T the total number of frames recorded. From an implementation stand-point, we employ a rule-based ontology which leverage on entity information( e.g. vehicle model, distance etc.) to decide if the bounding box of that entitycan be considered reasonable. This implementation has two main values: first it’s otalNumber of runs 300Number of bounding boxes 1125187Unique entity IDs 56454Unique entity models 198 Table 1: Overview of the statistics on the collected dataset. See text for details. (a) (b)
Fig. 3: Unnormalized distribution of vehicle orientation (a) and distances (b)present in the collected dataset. Distribution of angles conversely presents twoprominent modes around 0 ◦ /360 ◦ and 180 ◦ respectively, due to the fact that themajor part of vehicles encountered travel in parallel to the player’s car, on thesame (0/360 ◦ ) or the opposite (180 ◦ ) direction. Conversely, distance is almostuniformly distributed between 5 and 30 meters.lightweight and very fast in filtering massive amounts of data. Furthermore, ruleparameters can be tuned to eventually generate different dataset distribution( e.g. removing all trucks, keeping only cars closer than 10 meters, etc.).Each entry of the dataset is a tuple containing: – f rame f , f rame b : 1920 × – ID e , model e : identifiers of the entity (e) in the scene and of the vehicle’stype; – f rontal coords e , birdeye coords e : the coordinates of the bounding box thatencloses the entity; – distance e , yaw e : distance and rotation of the entity w.r.t. the player.Fig. 3 shows the distributions of entity rotation and distance across the collecteddata. At a first glance, the problem we address could be mistaken with a bare geomet-ric warping between different views. Indeed, this is not the case since targets arenot completely visible from the dashboard camera view and their dimensions inig. 4: A graphical representation of the proposed
SDPN (see Sec. 4). All lay-ers contain
ReLU units, except for the top decoder layer which employs tanh activation. The number of fully connected units is (256 , , , , , , ,
4) for the coordinate encoder and decoder respectively.the bird’s eye map depend on both the object visual appearance and semanticcategory ( e.g. a truck is longer than a car). Additionally, it cannot be cast asa correspondence problem, since no bird’s eye view information are available attest time. Conversely, we tackle the problem from a deep learning perspective:dashboard camera information are employed to learn a spatial occupancy mapof the scene seen from above.Our proposed architecture composes of two main branches, as depicted in Fig. 4.The first branch takes as input image crops of vehicles detected in the dashboardcamera view. We extract deep representations by means of
ResNet50 deep net-work [10], taking advantage of pre-training for image recognition on ImageNet [6].To this end we discard the top fully-connected dense layer which is tailored forthe original classification task. This part of the model is able to extract seman-tic features from input images, even though it is unaware of the location of thebounding box in the scene.Conversely, the second branch consists of a deep
Multi Layer Perceptron (MLP),composed by 4 fully-connected layers, which is fed with bounding boxes coordi-nates (4 for each detection), learning to encode the input into a 256 dimensionalfeature space. Due to its input domain, this segment of the model is not awareof objects’ semantic, and can only learn a spatial transformation between thetwo planes.Both appearance features and encodings of bounding box coordinates are thenerged through concatenation and undergo a further fully-connected decoderwhich predicts vehicles’ locations in the bird’s eye view. Since our model com-bines information about object’s location with semantic hints on the content ofthe bounding box, we refer to it as
Semantic-aware Dense Projection Network ( SDPN in short).
Training Details:
ImageNet [6] mean pixel value is subtracted from inputcrops, which are then resized to 224 ×
224 before being fed to the network.During training, we freeze
ResNet50 parameters. Ground truth coordinates inthe bird’s eye view are normalized in range [ − , Mean Squared Error as objective function and exploiting
Adam [11]optimizer with the following parameters: lr = 0 . , β = 0 . , β = 0 . We now assess our proposal comparing its performance against some baselines.Due to the peculiar nature of the task, the choice of competitor models is nottrivial.To validate the choice of a learning perspective against a geometrical one, weintroduce a first baseline model that employs a projective transformation to esti-mate a mapping between corresponding points in the two views. Such correspon-dences are collected from bottom corners of both source and target boxes in thetraining set, then used to estimate an homography matrix in a least-squares fash-ion ( e.g. minimizing reprojection error). Since correspondences mostly belong tothe street, which is a planar region, the choice of the projective transformationseems reasonable. The height of the target box, however, cannot be recoveredfrom the projection, thus it is cast as the average height among training exam-ples. We refer to this model as homography model .Additionally, we design second baseline by quantizing spatial locations in bothviews in a regular grid, and learn point mappings in a probabilistic fashion. Foreach cell G fi in the frontal view grid, a probability distribution is estimated overbird’s eye grid cells G bj , encoding the probability of a pixel belonging to G fi to fallin the cell G bj . During training, top-left and bottom-right bounding box cornersin both views are used to update such densities. At prediction stage, given atest point p k which lies in cell G fi we predict destination point by sampling fromthe corresponding cell distribution. We fix grid resolution to 108x192, meaninga 10x quantization along both axes, and refer to this baseline as grid model .It could be questioned if the appearance of the bounding box content in thefrontal view is needed at all in estimating the target coordinates, given sufficienttraining data and an enough powerful model. In order to determine the impor-tance of the visual input in the process of estimating the bird’s eye occupancymap, we also train an additional model with approximately the same numberof trainable parameters of our proposed model SDPN , but fully connected frominput to output coordinates. We refer to this last baseline as
MLP .oU ↑ CD ↓ hE ↓ wE ↓ arE ↓ homo 0.13 191.8 0.28 0.34 0.38grid 0.18 154.3 0.74 0.70 1.30MLP 0.32 96.5 0.25 0.25 0.29 SDPN (a) (b)Fig. 5: (a) Table summarizing results of proposed
SDPN model against the base-lines; (b) Degradation of IoU performance as the distance to the detected vehicleincreases.For comparison, we rely on three metrics: – Intersection over Union (IoU): measure of the quality of the predicted bound-ing box BB p with respect to the target BB t : IoU ( BB p , BB t ) = A ( BB p ∩ BB t ) A ( BB p ∪ BB t )where A ( R ) refers to the area of the rectangle R ; – Centroid Distance (CD): distance in pixels between box centers, as an indi-cator of localization quality ; – Height, Width Error (hE,wE): average error on bounding box height andwidth respectively, expressed in percentage w.r.t. the ground truth BB t size; – Aspect ratio mean Error (arE): absolute difference in aspect ratio between BB p and BB t : arE = (cid:12)(cid:12)(cid:12)(cid:12) BB p .wBB p .h − BB t .wBB t .h (cid:12)(cid:12)(cid:12)(cid:12) (4)The evaluation of baselines and proposed model is reported in Fig. 5 (a). Resultssuggest that both homography and grid are too naive to capture the complexityof the task and fail in properly warping vehicles into the bird’s eye view. Inparticular, grid baseline performs poorly as it only models a point-wise trans-formation between bounding box corners, disregarding information about theoverall input bounding box size. On the contrary, MLP processes the boundingbox in its whole and provides a reasonable estimation. However, it still missesthe chance to properly recover the length of the bounding box in the bird’s eyeview, being unaware of entity’s visual appearance. Instead, SDPN is able to cap-ture the object’s semantic, which is a primary cue for correctly inferring vehicle’slocation and shape in the target view. Please recall that images are 1920x1080 pixel size. omography grid MLP
SDPN ground truth
Fig. 6: Qualitative comparison between different models. Baselines often predictreasonable locations for the bounding boxes.
SDPN is also able to learn the ori-entation and type of the vehicle ( e.g. a truck is bigger than a car etc.).A second experiment investigates how vehicle’s distance affects the warping ac-curacy. Fig. 5 (b) highlights that all the models’ performance degrades as thedistance of target vehicles increases. Indeed, closer examples exhibit lower vari-ance ( e.g. are mostly related to the car ahead and the ones approaching fromthe opposite direction) and thus are easier to model. However, it can be noticedthat moving forward along distance axis the gap between the
SDPN and MLP getswider. This suggests that the additional visual input adds robustness in thesechallenging situations. We refer the reader to Fig. 6 for a qualitative comparison.
A real-world case study
In order to judge the capability of our model togeneralize on real-world data, we test it using authentic driving videos takenFig. 7: Qualitative results on real-world examples. Predictions look reasonableeven if the whole training was conducted on synthetic data.rom a roof-mounted camera [2]. We rely on state-of-the-art detector [15] toget the bounding boxes of vehicles in the frontal view. As the ground truth isnot available for these sequences, performance is difficult to quantify precisely.Nonetheless, we show qualitative results in Fig. 7: it can be appreciated howthe network is able to correctly localize other vehicles’ positions, despite havingbeen trained exclusively on synthetic data.
SDPN can perform inference at approximately 100 Hz on a NVIDIA TitanX GPU,which demonstrates the suitability of our model for being integrated in an actualassisted or autonomous driving pipeline. In this paper we presented two main contributions. A new high-quality syntheticdataset, featuring a huge amount of dashboard camera and bird’s eye frames,in which the spatial occupancy of a variety of vehicles (i.e. bounding boxes,distance, yaw) is annotated. Furthermore, we presented a deep learning basedmodel to tackle the problem of mapping detections onto a different view of thescene. We argue that these maps could be useful in an assisted driving con-text, in order to facilitate driver’s decisions by making available in one place aconcise representation of the road state. Furthermore, in an autonomous driv-ing scenario, inferred vehicle positions could be integrated with other sensorydata such as radar or lidar by means of e.g. a Kalman filter to reduce overalluncertainty.
References http://arxiv.org/abs/1412.6980http://arxiv.org/abs/1412.6980