Vanishing point detection with convolutional neural networks
PPresented at SUNw: Scene Understanding Workshop, CVPR 2016
Vanishing point detection with convolutional neural networks
Ali BorjiCenter for Research in Computer Vision, University of Central Florida [email protected]
1. Introduction
In a graphical perspective, a vanishing point (VP) is a2D point (in the image plane) which is the intersection ofparallel lines in the 3D world (but not parallel to the im-age plane). In other words, the vanishing point is the spotto which the receding parallel lines diminish. In principle,there can be more than one vanishing point in the image.VP can commonly be seen in fields, railroads, streets, tun-nels, forest, buildings, objects such as ladder (from lookingbottom-up), etc. It is an important visual cue useful in sev-eral applications (e.g., camera calibration, 3D reconstruc-tion, autonomous driving).Inspired by the finding that vanishing point (road tan-gent) guides driver’s gaze [1, 2], in our previous work weshowed that vanishing point attracts gaze during free view-ing of natural scenes as well as in visual search [3]. We havealso introduced improved saliency models using vanishingpoint detectors [4]. Here, we aim to predict vanishing pointsin naturalistic environments by training convolutional neu-ral networks in an end-to-end manner.Traditionally, geometrical and structural features such aslines and corners (e.g., using Hough transform [5]) havebeen applied for detecting vanishing points in images. Here,we follow a data-driven learning approach by training twopopular convolutional neural networks, Alexnet and VGG,for: 1) predicting whether a vanishing point exists in a scene(on a n × n grid map), and 2) If so, we then attempt to lo-calize its exact location.
2. Experiments & Results
To train deep neural networks, often a large amountof data is needed. We resorted to YouTube to down-load videos including road trips across America (e.g., fromsedan, bus, or truck dash cams), personal adventures (e.g.,using shifters or motorbikes) or game playing sessions (e.g.,formula one, Nascar). These videos have been captured ina variety of weather and ground conditions (e.g., freeway,race track, in city, inter city, snowy, rainy, sunny, mountain-ous, forest, vegetation). Eventually, we had 37,497 frames(resized to 300 ×
300 pixels). We annotated vanishing points (1 per frame) in all videos (one annotator; the au-thor). The grid cell containing the vanishing point has thelabel 1 on a n × n grid map ( n = [10, 20, 30]). Some exampleframes of 29 YouTube videos are shown in Figure 1.We also collected some images without VPs to train a bi-nary classifier for VP existence prediction. A total of 32,419images were sampled from these datasets: MIT saliencybenchmark [6], CAT2000 dataset [7], Caltech 256 [8], 15category dataset [9] except the street and highway cate-gories, MS COCO [10], and Imagenet [11]. After training networks for 20 epochs over 63,916 im-ages (34,497 with VP and 29,419 without VP), we obtained98.9% VP existence prediction accuracy using the Alexnetnetwork and 99.73% using the VGG network over the testset (6,000 images; 3,000 with VP).
Alexnet and VGG networks were trained to map a sceneinto the VP location which is one of the p classes ( p = 100,400, or 900; linearized n × n grids). Thus, there are p neu-rons in the output layers. We used 33,000 frames for train-ing and the remaining 4,497 frames for testing. The networktraining was stopped after 40 epochs.Results are shown in Figure 2. We achieved the lowesttop-5 error rate of 5.1% over 10 ×
10, 15.9% over 20 × ×
30 grid sizes using the VGG network.It means that the probability of hitting within 15 pixels ofthe true VP location in 5 guesses is about 85% (over a 20 ×
20 grid on a 300 ×
300 image). Our results are nearlythe same using both networks. Figure 2 (right) shows somesuccess and failure cases of the Alexnet for VP localization.Since vanishing point usually happens at the image cen-ter (See Figure 1, bottom-right), we devised two baselinepredictors to further evaluate our method. The first one isthe most frequent grid location ([x y] in training data) de-noted as the ‘Top-1 center’ and the second one is the fivemost frequent locations ([x y], [x-1 y], [x y-1], [x+1 y], [xy+1]; all set to one, the rest are zero) denoted as ‘Top-5center’. These models perform well above chance (16.5%accuracy using Top-1 center vs. 0.25% chance over a 20 × a r X i v : . [ c s . C V ] S e p
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 distribution of vanishing points
Images without vanishing point
Figure 1. Left: Two sample frames of each of 29 videos downloaded from YouTube. Top-right: Sample images without vanishing pointused to train the vanishing point existence prediction network. Bottom-right: Average vanishing point location. Left panel shows all visitedlocations and the right panel shows the VP histogram. E rr o r r a t e Alexnet: top-1Alexnet: top-5 VGG: top-1VGG: top-5 Center: top-1Center: top-5
10 x 10 20 x 20 30 x 30
Grid size
Figure 2. Left: Error rates of deep models for VP detection. Top-right: Sample images where our model is able to accurately locate the VPin five tries. Red circle is the top-1 prediction and blue ones are the next top-4. Bottom-right: Failure examples of our model.
20 grid) but are well below the deep learning performance(deep learning Top-1 accuracy is about 57%).We also compare our model with two vanishing pointdetection algorithms from the literature. The first one isa method by Koˇseck´a and Zhang [12] and the second oneis the classic Hough transform [5]. These two algorithmsscore 15.6% and 35%, respectively in detecting the vanish-ing point on a 20 ×
20 map (Top-1 accuracy) which aremuch lower than our results using CNNs.To assess the generalization power of our approach indetecting vanishing points in arbitrary natural scenes, weexperimented with pictures of buildings, tunnels, sketchesand fields shown in Figure 3. Although our model (VGG)has not been explicitly trained on these images, it success-fully finds VPs in some of them. It fails on some otherunseen examples (e.g., sketches). Augmenting our datasetwith more images of these kinds could help overcome this shortcoming. Another way to improve performance wouldbe through data augmentation (i.e., adding jittered, cropped,noisy, and blurry versions of input images).
3. Discussion
We proposed a method for vanishing point detectionbased on convolutional neural networks that does wellon road scenes but is not very effective on arbitrary im-ages. We will consider collecting a larger image datasetwith variety of scenes including vanishing points and morerecent deep learning architectures to improve accuracy.Extension of this approach to videos is another interest-ing future direction. Our dataset is freely available at:http://crcv.ucf.edu/people/faculty/Borji/code.php
Acknowledgments:
We wish to thank NVIDIA for theirgenerous donation of the GPU used in this study. igure 3. Performance of our vanishing point detector on arbitrary images containing vanishing points. The largest red circle is the firstdetection. Other four detections are shown in blue.