Neural Geometric Parser for Single Image Camera Calibration
NNeural Geometric Parser for Single ImageCamera Calibration (cid:63)
Jinwoo Lee , Minhyuk Sung , Hyunjoon Lee , and Junho Kim Kookmin University Adobe Research Intel
Abstract.
We propose a neural geometric parser learning single imagecamera calibration for man-made scenes. Unlike previous neural ap-proaches that rely only on semantic cues obtained from neural networks,our approach considers both semantic and geometric cues, resulting insignificant accuracy improvement. The proposed framework consists oftwo networks. Using line segments of an image as geometric cues, thefirst network estimates the zenith vanishing point and generates severalcandidates consisting of the camera rotation and focal length. The sec-ond network evaluates each candidate based on the given image andthe geometric cues, where prior knowledge of man-made scenes is usedfor the evaluation. With the supervision of datasets consisting of thehorizontal line and focal length of the images, our networks can be trainedto estimate the same camera parameters. Based on the Manhattan worldassumption, we can further estimate the camera rotation and focal lengthin a weakly supervised manner. The experimental results reveal that theperformance of our neural approach is significantly higher than that ofexisting state-of-the-art camera calibration techniques for single imagesof indoor and outdoor scenes.
Keywords:
Single image camera calibration, Neural geometric parser,Horizon line, Focal length, Vanishing Points, Man-made scenes
This paper deals with the problem of inferring camera calibration parameters froma single image. It is used in various applications of computer vision and graphics,including image rotation correction [12], perspective control [21], camera rotationestimation [38], metrology [9], and 3D vision [16,23]. Due to its importance, singleimage camera calibration has been revisited in various ways.Conventional approaches focus on reasoning vanishing points (VPs) in imagesby assembling geometric cues in the images. Most methods find straight linesegments in the images using classic image processing techniques [15,2] and thenestimate the VPs by carefully selecting parallel or orthogonal segments in the 3Dscene as geometric cues [21]. In practice, however, line segments found in imagescontain a large amount of noisy data, and it is therefore important to carefullyselect an inlier set of line segments for the robust detection of VPs [13,28]. Because (cid:63)
Corresponding author: Junho Kim ([email protected]) a r X i v : . [ c s . C V ] J u l J. Lee et al.(a) (b) (c) (d)
Fig. 1.
Applications of the proposed framework: (a) image rotation correction, (b)perspective control, and virtual object insertions with respect to the (c) horizon and(d) VPs; before (top) and after (bottom). the accuracy of the inlier set is an important performance indicator, the elapsedtime may exponentially increase if stricter criteria are applied to draw the inlierset.Recently, several studies have proposed estimating camera intrinsic parametersusing semantic cues obtained from deep neural networks. It has been investi-gated [31,32,18] that well-known backbone networks, such as ResNet [17] andU-Net [25], can be used to estimate the focal length or horizon line of an imagewithout significant modifications of the networks. In these approaches, however,it is difficult to explain which geometric interpretation inside of the networksinfers certain camera parameters. In several studies [38,33], neural networks weredesigned to infer geometric structures; however, they required a new convolutionoperator [38] or 3D supervision datasets [33].In this paper, we propose a novel framework for single image camera calibrationthat combines the advantages of both conventional and neural approaches. Thebasic idea is for our network to leverage line segments to reason camera parameters.We specifically focus on calibrating camera parameters from a single image of aman-made scene. By training with image datasets annotated with horizon linesand focal lengths, our network infers pitch, roll, and focal lengths (3DoF) and canfurther estimate camera rotations and focal lengths through three VPs (4DoF).The proposed framework consists of two networks. The first network, theZenith Scoring Network (ZSNet), takes line segments detected from the inputimage and deduces reliable candidates of parallel world lines along the zenithVP. Then, from the lines directed at the zenith VP, we generate candidate pairsconsisting of a camera rotation and a focal length as inputs of the following step.The second network, the Frame Scoring Network (FSNet), evaluates the scoreof its input in conjunction with the given image and line segment information.Here, geometric cues from the line segments are used as prior knowledge aboutthe man-made scenes in our network training. This allows us to obtain significant eural Geometric Parser for Single Image Camera Calibration 3
Input Image 𝐿𝐿 = 𝐥𝐥 𝑗𝑗 𝑉𝑉 = 𝐯𝐯 𝑖𝑖 ⊆ 𝑳𝑳 × 𝑳𝑳 ZSNet FSNet
Rotation & Focal Length Sampling 𝐑𝐑 𝑖𝑖 , 𝑓𝑓 𝑖𝑖 Zenith
VPs
PointNetPointNet global feature
Parser
Neural Point ParserNeural Line Parserinstance featuresinstance features
𝑍𝑍𝐿𝐿 𝑧𝑧 𝑝𝑝 𝑧𝑧 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝐑𝐑 𝑖𝑖 𝑓𝑓 𝑖𝑖 activation map(224 × ×
3) (224 × × × ×
3) (224 × ×
9) (224 × × 𝑠𝑠 𝑖𝑖 Zenith Scoring Network (ZSNet) Frame Scoring Network (FSNet)
Fig. 2.
Overview of the proposed neural geometric parser. improvement over previous neural methods that only use semantic cues [32,18].Furthermore, it is possible to estimate camera rotation and focal length in a weakly supervised manner based on the Manhattan world assumption, as wereason camera parameters with pairs consisting of a camera rotation and a focallength. It should be noted that the ground truth for our supervisions is readilyavailable with Google Street View [1] or with consumer-level devices possessinga camera and an inertial measurement unit (IMU) sensor, in contrast to themethod in [33], which requires 3D supervision datasets.
Projective geometry [16,23] has historically stemmed from the study on imagingof perspective distortions occurring in the human eyes when one observes man-made architectural scenes [3]. In this regard, conventional methods of singleimage camera calibration [8,20,26,11,29,30,35,21] involve extracting line segmentsfrom an image, inferring the combinations of world parallel or orthogonal lines,identifying two or more VPs, and finally estimating the rotation and focal lengthof the camera. LSD [15] or EDLine [2] were commonly used as effective linesegment detectors, and RANSAC [13] or J-Linkage [28] were adopted to identifyVPs describing as many of the extracted line segments as possible. Lee et al . [21]proposed robust estimation of camera parameters and automatic adjustment ofcamera poses to achieve perspective control. Zhai et al . [37] analyzed the globalimage context with a neural network to estimate the probability field in whichthe horizon line was formed. In their work, VPs were inferred with geometricoptimization, in which horizontal VPs were placed on the estimated horizon line.Simon et al . [27] achieved better performance than Zhai et al . [37] by inferringthe zenith VP with a geometric algorithm and carefully selecting a line segmentorthogonal to the zenith VP to identify the horizon line. Li et al . [22] proposed aquasi-optimal algorithm to infer VPs from annotated line segments.Recently, neural approaches have been actively studied to infer camera pa-rameters from a single image using semantic cues learned by convolutional neural
J. Lee et al. networks. Workman et al . proposed DeepFocal [31] for estimating focal lengthsand DeepHorizon [32] for estimating horizon lines using semantic analyses ofimages with neural networks. Hold-Geoffroy et al . [18] trained a neural classifierthat jointly estimates focal lengths and horizons. They demonstrated that theirjoint estimation leads to more accurate results than those produced by indepen-dent estimations [31,32]. Although they visualized how the convolution filtersreact near the edges (through the method proposed by Zeiler and Fergus [36]),it is difficult to intuitively understand how the horizon line is geometricallydetermined through the network. Therefore, [31,32,18] have a common limitationthat it is non-trivial to estimate VPs from the network inference results. Zhou etal . [38] proposed NeurVPS that infers VPs with conic convolutions for a givenimage. However, NeurVPS [38] assumes normalized focal lengths and does notestimate focal lengths.Inspired by UprightNet [33], which takes geometric cues into account, wepropose a neural network that learns camera parameters by leveraging linesegments. Our method can be compared to Lee et al . [21] and Zhai et al . [37],where line segments are used to infer the camera rotation and focal length.However, in our proposed method, the entire process is designed with neuralnetworks. Similar to Workman et al . [32] and Hold-Geoffroy et al . [18], we utilizesemantic cues from neural networks but our network training differs in thatline segments are utilized as prior knowledge about man-made scenes. UnlikeUprightNet [18], which requires the supervisions of depth and normal maps forlearning roll/pitch (2DoF), the proposed method learns the horizon and focallength (3DoF) with supervised learning and the camera rotation and focal length(4DoF) with weakly supervised learning.The relationship between our proposed method and the latest neural RANSACs[6,7,19] is described below. Our ZSNet is related to neural-guided RANSAC [7] inthat it updates the line features with backpropagation when learning to samplezenith VP candidates. In addition, our FSNet is related to DSAC [6] in that itevaluates each input pair consisting of a camera rotation and focal length basedon the hypothesis on man-made scenes. Our work differs from CONSAC [19],which requires the supervision of all VPs, as we focus on learning single imagecamera calibrations from the supervision of horizons and focal lengths.
From a given input image, our network estimates up to four camera intrinsic andextrinsic parameters; the focal length f and three camera rotation angles ψ , θ , φ . Then, a 3D point ( P x , P y , P z ) T in the world coordinate is projected onto theimage plane as follows: p x p y p w = ( KR ) P x P y P z , where K = f c u f c v and R = R ψ R θ R φ , (1)where ( p x , p y , p w ) T represents the mapped point in the image space, and R ψ , R θ , R φ represent the rotation matrices along x -, y -, and z -axes, with rotation eural Geometric Parser for Single Image Camera Calibration 5 angles ψ , θ , φ , respectively. The principal point is assumed to be on the imagecenter such that c u = W/ c v = H/
2, where W and H represent the widthand height of the image, respectively.Under the Manhattan world assumption, calibration can be done once weobtain the Manhattan directions , which are three VPs corresponding to x -, y -, and z -directions in 3D [8]. In Sec. 3.1, we describe how to extract a set ofcandidate VPs along the zenith direction. Then, in Sec. 3.2, we present our weaklysupervised method for estimating all three directions and calibrating the cameraparameters.We use LSD [15] as a line segment detector in our framework. A line segmentis represented by a pair of points in the image space. Before estimating thefocal length in Sec. 3.2, we assume that each image is transformed into a pseudocamera space as p = K − p ( p x , p y , p w ) T , where K p represents a pseudo cameraintrinsic matrix of K , built by assuming f as min( W, H ) / We first explain our ZSNet, which is used to estimate the zenith VP (see Fig. 2top-left). Instead of searching for a single zenith VP, we extract a set of candidatesthat are sufficiently close to the ground truth.Similar to PointNet [24], ZSNet takes sets of unordered vectors in 2D homo-geneous coordinates - line equations and VPs - as inputs. Given a line segment,a line equation l can be computed as a cross product of its two endpoints: l = [ p ] × p , (2)where [ · ] × represents a skew-symmetric matrix of a vector. A candidate VP v can then be computed as an intersection point of the two lines: v = [ l ] × l . (3)Motivated by [27], we sample a set of line equations roughly directed to thezenith L z = (cid:8) l , . . . , l | L z | (cid:9) from the line segments, using the following equation: (cid:12)(cid:12)(cid:12) tan − (cid:16) − ab (cid:17)(cid:12)(cid:12)(cid:12) > tan − ( δ z ) , (4)where l = ( a, b, c ) T represents a line equation as in Eq. (2) and the angle threshold δ z is set to 67 . ◦ as recommended in [27]. Then, we randomly select pairs of linesegments from L z and compute their intersection points as in Eq. (3) to extracta set of zenith VP candidates Z = (cid:8) z , . . . , z | Z | (cid:9) . Finally, we feed L z and Z toZSNet. We set both the number of samples, | L z | and | Z | , 256 in the experiments.The goal of our ZSNet is to score each zenith candidate in Z ; 1 if a candidateis sufficient close to the ground truth zenith, and 0 otherwise. Fig. 2 top-leftshows the architecture of our ZSNet.In the original PointNet [24], each point is processed independently, exceptfor transformer blocks, to generate point-wise features. A global max pooling J. Lee et al. layer is then applied to aggregate all the features and generate a global feature.The global feature is concatenated to each point-wise feature, followed by severalneural network blocks to classify/score each point.In our network, we also feed the set of zenith candidates Z to the network,except that we do not compute the global feature from Z . Instead, we use anothernetwork, feeding the set of line equations L z , to extract the global feature of L z that is then concatenated with each point-wise feature of Z (Fig. 2, top-left).Let h z ( z i ) be a point-wise feature of the point z i in Z , where h z ( · ) representsa PointNet feature extractor. Similarly, let h l ( L z ) = (cid:8) h l ( l ) , . . . , h l ( l | L z | ) (cid:9) bea set of features of L z . A global feature g l of h l ( L z ) is computed via a globalmax-pooling operation (gpool), and is concatenated to h z ( z i ) as follows: g l = gpool ( h l ( L z )) (5) h (cid:48) z ( z i ) = g l ⊗ h z ( z i ) , (6)where ⊗ represents the concatenation operation.Finally, the concatenated features are fed into a scoring network computing[0 ,
1] scores such that: p z i = sigmoid ( s z ( h (cid:48) z ( z i ))) , (7)where p z i represents the computed scores of each zenith candidate. The network s z ( · ) in Eq. (7) consists of multiple MLP layers, similar to the latter part of thePointNet [24] segmentation architecture.To train the network, we assign a ground truth label y i to each zenithcandidate z i using the following equation: y i = (cid:26) z i , z gt ) > cos( δ p )0 if cossim( z i , z gt ) < cos( δ n ) , (8)where cossim( x, y ) = | x · y |(cid:107) x (cid:107)(cid:107) y (cid:107) and z gt represents the ground truth zenith. The twoangle thresholds δ p and δ n are empirically selected as 2 ◦ and 5 ◦ , respectively,from our experiments. The zenith candidates each of which y i is undefined arenot used in the training. The cross entropy loss is used to train the network asfollows: L cls = 1 N N (cid:88) i − y i log( p z i ) . (9)To better train our ZSNet we use another loss in addition to the cross entropy.Specifically, we constrain the weighted average of zenith candidates close to theground truth, where estimated scores p z i are used as weights. To average thezenith candidates, which represent vertical directions of the scene, we use structuretensors of l normalized 2D homogeneous points. Given a 2D homogeneous point v = ( v x , v y , v w ) T , a structure tensor of the normalized point is computed asfollows: ST( v ) = 1 (cid:0) v x + v y + v w (cid:1) v x v x v y v x v w v x v y v y v y v w v x v w v y v w v w , (10) eural Geometric Parser for Single Image Camera Calibration 7(a) (b) (c) (d) Fig. 3.
Sampling horizontal line segments into two groups: (a) Based on a zenith VPrepresentative (green), we want to classify the line segments (blue) of a given image.(b) Line segments are classified as follows: vanishing lines of the zenith VP (green) andthe remaining lines (red). (c) Junction points (cyan) are computed as the intersectionsof spatially adjacent line segments that are classified differently; line segments whoseendpoints are close to junction points are selected. (d) Using a pseudo-horizon (dottedline), we divide horizontal line segments into two groups (magenta and cyan).
The following loss is used in our network: L loc = (cid:13)(cid:13) ST( z gt ) − ST( z ) (cid:13)(cid:13) F , ST( z ) = (cid:80) i p z i ST( z i ) (cid:80) i p z i (11)where (cid:107) · (cid:107) F represents the Frobenius norm. Finally, we select zenith candidateswhose scores p z i are larger than δ c to the set of zenith candidates, as: Z c = { z i | p z i > δ c } , (12)where δ c = 0 . Z c is then used in our FSNet. After we extract a set of zenith candidates, we estimate the remaining twohorizontal VPs taking into account the given set of zenith VP candidates. Wefirst generate a set of hypotheses on all three VPs. Each hypothesis is then scoredby our FSNet.To sample horizontal VPs, we first filter the input line segments. However,we cannot simply filter line segments using their directions in this case, as theremay be multiple horizontal VPs, and lines in any directions may vanish in thehorizon. As a workaround, we use a heuristic based on the characteristics of mosturban scenes.Many man-made structures contain a large number of rectangles (e.g., facadesor windows of a building) that are useful for calibration parameter estimation,and line segments enclosing these rectangles create junction points. Therefore,we sample horizontal direction line segments by only using their endpoints whenthey are close to the endpoints of the estimated vertical vanishing lines.Fig. 3 illustrates the process of sampling horizontal line segments into twogroups. Let z est = ( z x , z y , z w ) be a representative of the estimated zenith VPs, J. Lee et al. which is computed as the eigenvector with the largest eigenvalue of ST( z ) inEq. (11). We first draw a pseudo-horizon by using z est and then compute theintersection points between each sampled line segment and the pseudo-horizon.Finally, using a line connecting z est and the image center as a pivot, we dividehorizontal line segments into two groups; one that intersects the pseudo-horizonon the left side of the pivot and the other that intersects the pseudo-horizonon the right side of the pivot. The set of horizontal VP candidates is composedof intersection points by randomly sampling pairs of horizontal direction linesegments in each group. We sample an equal number of candidates for bothgroups.Once the set of horizontal VP candidates is sampled, we sample candidatesof Manhattan directions. To sample each candidate, we draw two VPs; one fromzenith candidates and the other from either set of horizontal VP candidates. Thecalibration parameters for the candidate can then be estimated by solving Eq. (1)with the two VPs, assuming that the principal point is on the image center [20].We design our FSNet for inferring camera calibration parameters to utilizeall the available data, including VPs, lines, and the original raw image (Fig. 2,top-right). ResNet [17] is adapted to our FSNet to handle raw images, appendingall the other data as additional color channels. To append the information of thedetected line segments, we rasterize line segments as a binary line segment mapwhose width and height are the same as those of the input image, as follows: L ( u, v ) = (cid:26) l passes through ( u, v )0 otherwise , (13)where ( u, v ) represents a pixel location of the line segment map. We also appendthe information of vanishing line segments (i.e., expanding lines are close to aVP) as a weighted line segment map for all three VPs of a candidate, whereweights are computed using the closeness between the line segments and VPs.For a given VP v and the line equation l of a line segment, we compute thecloseness between v and l using the conventional line-point distance as follows:closeness ( l , v ) = 1 − | l · v |(cid:107) l (cid:107) (cid:107) v (cid:107) . (14)Three activation maps are drawn for each candidate ( x -, y - and z -directions), as: A { x | y | z } ( u, v ) = (cid:26) closeness( l , v { x | y | z } ) if a line l passes through ( u, v )0 otherwise . (15)All the maps are appended to the original image as illustrated in Fig. 2.Finally, we append the Manhattan directions and the estimated focal length toeach pixel of the concatenated map so that the input to the scoring network havesize of ( height , width , channel ) = (224 , ,
17) (Fig. 2, top-right).To train FSNet, we assign GT score to each candidate by measuring similaritiesbetween horizon and zenith of each candidate and those of the GT. For the zenith,we measure the cosine similarities of GT zenith and that of candidate as follows: s z i = cossim( z gt , z i ) , (16) eural Geometric Parser for Single Image Camera Calibration 9 where z gt and z i represent the GT and candidate zenith. For the horizon, weadapt the distance metric proposed in [5]. For this, we compute the intersectionpoints between the GT and candidate horizons and left/right image boundaries.Let h l and h r be intersection points of the predicted horizon and left/right borderof the image. Similarly, we compute g l and g r using the ground truth horizon.Inspired by [5], the similarity between the GT and a candidate is computed as: s h i = exp (cid:16) − max ( (cid:107) h l i − g l (cid:107) , (cid:107) h r i − g r (cid:107) ) (cid:17) . (17)Our scoring network h score is then trained with the cross entropy loss, defined as: L score = (cid:88) i − h score ( R i ) log( c i ) (18) c i = (cid:26) s vh i < δ s s vh i = exp − (cid:16) s hi + s vi − . (cid:17) σ , (20)where σ = 0 . δ s = 0 . Robust score estimation using the Manhattan world assumption.
Al-though our FSNet is able to accurately estimate camera calibration parametersin general, it can sometimes be noisy and unstable. In our experiments, wefound that incorporating with the Manhattan world assumption increased therobustness of our network. Given a line segment map and three closeness maps(Eqs. (14) and (15)), we compute the extent to which a candidate follows theManhattan world assumption using Eq. (21): m i = (cid:80) u (cid:80) v max ( A x ( u, v ) , A y ( u, v ) , A z ( u, v )) (cid:80) u (cid:80) v L ( u, v ) , (21)and the final score of a candidate is computed as: s i = s vh i · m i . (22)Once all the candidates are scored, we estimate the final focal length and zenithby averaging those of top- k high score candidates such that: f est = (cid:80) i s i f i (cid:80) i s i and ST est = (cid:80) i s i ST( z i ) (cid:80) i s i , (23)where ST( z i ) represents the structure tensor of a zenith candidate (Eq. (10)).We set k = 8 in the experiments. z est can be estimated from ST est , and thetwo camera rotation angles ψ and φ in Eq. (1) can be computed from f est and z est . For the rotation angle θ , we simply take the value from the highest scorecandidate, as there may be multiple pairs of horizontal VPs that are not closeto each other, particularly when the scene does not follow the Manhattan worldassumption but the Atlanta world assumption. Note that they still share similarzeniths and focal lengths [5,21]. In training, we train ZSNet first and then train FSNet with the outputs of ZSNet.Both networks are trained with Adam optimizer with initial learning rates of0.001 and 0.0004 for ZSNet and FSNet, respectively. Both learning rates aredecreased by half for every 5 epochs. The mini-batch sizes of ZSNet and FSNetare set to 16 and 2, respectively. The input images are always downsampled to224 × ∼ ∼ We provide the experimental results with Google Street View [1] and HLW [32]datasets. Refer to the supplementary material for more experimental results withthe other datasets and more qualitative results.
Google Street View [1] dataset . It provides panoramic images of outdoorcity scenes for which the Manhanttan assumption is satisfied. For generatingtraining and test data, we first divide the scenes for each set and rectify andcrop randomly selected panoramic images by sampling FoV, pitch, and roll in theranges of 40 ∼ ◦ , − ∼ ◦ , and − ∼ ◦ , respectively. 13,214 and 1,333images are generated for training and test sets, respectively. HLW [32] dataset . It includes images only with the information of horizonline but no other camera intrinsic parameters. Hence, we use this dataset onlyfor verifying the generalization capability of methods at test time.
Evaluation Metrics . We measure the accuracy of the output camera up vector,focal length, and horizon line with several evaluation metrics. For camera upvector, we measure the difference of angle, pitch, and roll with the GT. Forthe focal length, we first convert the output focal length to FoV and measurethe angle difference with the GT. Lastly, for the horizon line, analogous to oursimilarity definition in Eq. (17), we measure the distances between the predictedand GT lines at the left/right boundary of the input image (normalized by theimage height) and take the maximum of the two distances. We also report thearea under curve (AUC) of the cumulative distribution with the x -axis of thedistance and the y -axis of the percentage, as introduced in [5]. The range of x -axis is [0 , . We compare our method with six baseline methods, where Table 1 presents therequired supervision characteristics and outputs. Upright [21] and A-ContrarioDetection [27] are non-neural-net methods based on line detection and RANSAC.We use the authors’ implementations in our experiments. For Upright [21], theevaluation metrics are applied after optimizing the Manhattan direction per image, eural Geometric Parser for Single Image Camera Calibration 11
Table 1.
Supervision and output characteristics of baseline methods and ours. The firsttwo are unsupervised methods not leveraging neural networks, and the others are deeplearning methods. Ours is the only network-based method predicting all four outputs.
Method Supervision OutputHorizonLine FocalLength CameraRotation Per-PixelNormal HorizonLine FocalLength CameraRotation UpVectorUpright [21] N/A (cid:88) (cid:88) (cid:88) (cid:88)
A-Contrario [27] (cid:88) (cid:88) (cid:88) (cid:88)
DeepHorizon [32] (cid:88) (cid:88)
Perceptual [18] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
UprightNet [33] (cid:88) (cid:88) (cid:88)
Ours (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) assuming the principal point as the image center. A-Contrario Detection [27]often fails to estimate focal length when horizon VP candidates are insufficientin the input. Thus, we exclude [27] in the evaluation of FoV. The AUC of[27] is measured, regardless of the failures in focal length estimations, with thehorizon lines estimated by the method proposed in [27]. The other metrics of[27] are evaluated only for the cases that focal lengths are successfully estimated.DeepHorizon [32] and Perceptual Measure [18] are neural-network-based methodsdirectly taking the global feature of an image and performing classifications indiscretized camera parameter spaces. For fair comparisons, we use ResNet [17]as a backbone architecture in the implementations of these methods. Note thatDeepHorizon [32] does not predict focal length, and thus we use ground truthfocal length in the estimation of the camera up vector. Additionally, we trainPerceptual Measure [18] by feeding it both the input image and the line mapused in our FSNet (Sec. 3.2), and we assess whether the extra input improvesthe performance. UprightNet [33] is another deep learning method that requiresadditional supervision in training, such as camera extrinsic parameters and per-pixel normals in the 3D space. Due to the lack of such supervision in our datasets,in our experiments, we use the author’s pretrained model on ScanNet [10], whichis a synthetic dataset.The quantitative results with Google Street View dataset [1] are presentedin Table 2. The results demonstrate that our method outperforms all the baselinemethods in most evaluation metric. Upright [21] provides a slightly lower medianroll than ours, although its mean roll is much greater than the median, meaningthat it completely fails in some test cases. In addition, Perceptual Measure [18]gives a slightly smaller mean FoV error; however, the median FoV error is higherthan ours. When Perceptual Measure [18] is trained with the additional linemap input, the result indicates that it does not lead to a meaning difference inperformance. As mentioned earlier, a pretrained model is used for UprightNet [33](trained on ScanNet [10]) due to the lack of required supervision in Google StreetView; thus the results are much poorer than others.Fig. 5 visualizes several examples of horizon line predictions as well as ourweakly-supervised Manhattan directions. Recall that we do not use full super-vision of the Manhattan directions in training; we only use the supervision of
Table 2.
Quantitative evaluations with Google Street View dataset [1]. See
EvaluationMetrics in Sec. 4 for details. Bold is the best result, and underscore is the second-bestresult. Note that, for DeepHorizon [32]*, we use GT FoV to calculate the cameraup vector (angle, pitch, and roll errors) from the predicted horizon line. Also, forUprightNet [33]**, we use a pretrained model on ScanNet [10] due to the lack ofrequired supervision in the Google Street View dataset.
Method Angle ( ◦ ) ↓ Pitch ( ◦ ) ↓ Roll ( ◦ ) ↓ FoV ( ◦ ) ↓ AUC(%) ↑ Mean Med. Mean Med. Mean Med. Mean Med.Upright [21] 3.05 1.92 2.90 1.80 6.19 L Ours 2.12 1.61 1.92 1.38 0.75
Upright: 77.43%A-Contario: 74.25%DeepHorizon: 80.29%Perceptual: 80.40%Perceptual + L: 80.40%Ours: 83.12%
DeepHorizon: 45.63%Perceptual: 38.29%Perceptual + L: 42.38%Ours: 48.90% (a) Google Street View (b) HLW
Fig. 4.
Comparison of the cumulative distributions of the horizon line error and theirAUCs tested on (a) Google Street View and (b) HLW. Note that, in (b), neuralapproaches are trained with the Google Street View training dataset and to demonstratethe generalization capability. The AUCs in (a) are also reported in Table 2. horizon lines and focal lengths. In each example in Fig. 5, we illustrate theManhattan direction of the highest score candidate.To evaluate the generalization capability of neural-network-based methods,we also take the network models trained on Google Street View training datasetand test them on the HLW dataset [32]. Because the HLW dataset only hasthe GT horizon line, Fig. 4(b) only reports the cumulative distributions of thehorizon prediction errors. As shown in the figure, our method provides the largestAUC with a significant margin compared with the other baselines. Interestingly,Perceptual Measure [18] shows improvement when trained with the additionalline map, meaning that the geometric interpretation helps more when parsing unseen images in network training.We conduct an experiment comparing the outputs of ZSNet with the outputsof Upright [21] and A-Contrario [27]. Using the weighted average of the zenithcandidates (in Eq. (11)), we measured the angle to the GT zenith cossim( z i , z gt ),as provided in Eq. (8). Table 3 show that our ZSNet computes the zenith VPmore accurately than the other non-neural-net methods. eural Geometric Parser for Single Image Camera Calibration 13 Ground Truth Upright [21] A-Contario [27] DeepHorizon [32]Perceptual [18] Perceptual [18] + L Ours
Fig. 5.
Examples of horizon line prediction on the Google Street View test set (toptwo rows) and on the HLW test set (bottom two rows). Each example also shows theManhattan direction of the highest score candidate.
Table 3.
Evaluation of ZSNet.
Angle ( ◦ ) ↓ Mean Med.ZSNet (Ours)
Upright [21] 3.15 2.11A-Contrario [27] 3.06 2.38
Fig. 6.
Visualizations of FSNet focus: (left) input;(right) feature highlight.
Fig. 6 visualizes the weights of the second last convolution layer in FSNet(the layer in the ResNet backbone); red means high, and blue means low. It canbe seen that our FSNet focused on the areas with many line segments, such asbuildings, window frames, and pillars. The supplementary material contains moreexamples.
We conduct an ablation study using Google Street View dataset to demonstratethe effect of each component in our framework. All results are reported in Table 4,where the last row shows the result of our final version framework.We first evaluate the effect of the entire ZSNet by ablating it in the training.When sampling zenith candidates in FSNet (Sec. 3.2), the score p z i in Eq. (7) Table 4.
Ablation study results. Bold is the best result. See Sec. 4.2 for details.
Angle ( ◦ ) ↓ Pitch ( ◦ ) ↓ Roll ( ◦ ) ↓ FoV ( ◦ ) ↓ AUC(%) ↑ Mean Med. Mean Med. Mean Med. Mean Med.w/o ZSNet 3.00 2.04 2.81 1.98 1.62 0.95 8.42 4.47 74.01 h (cid:48) z ( z i ) = h z ( z i ) (Eq. (6)) 4.34 1.96 3.91 1.76 1.64 0.59 7.88 4.16 77.65FSNet − Image 2.45 1.78 2.19 1.52 − L − A s i = s vhi (Eq. (22)) 2.32 1.80 2.09 1.57 0.72 0.54 6.06 4.12 80.85 Ours 2.12 1.61 1.92 1.38 is not predicted but set uniformly. The first row of Table 4 indicates that theperformance significantly decreases in all evaluation metrics; e.g., the AUCdecreased by 9%. This indicates that ZSNet plays an important role in findinginlier Zenith VPs that satisfy the Manhattan/Atlanta assumption. In ZSNet, wealso evaluate the effect of global line feature g l in Eq. (5) by not concatenating itwith the point feature h z ( z i ) in Eq. (6). Without the line feature, ZSNet is stillable to prune the outlier zeniths in some extent, as indicated in the second row,but the performance is far inferior to that of our final framework (the last row).This result indicates that the equation of a horizon line is much more informativethan the noisy coordinates of the zenith VP.In FSNet, we first ablate some parts of the input fed to the network per frame.When we do not provide the given image but the rest of the input (the thirdrow of Table 4), the performance decreases somewhat; however, the change isless significant than when omitting the line map L (Eq. (13)) and the activationmap A (Eq. (15)) in the input (the fourth row). This demonstrates that FSNetlearns more information from the line map and activation map, which containexplicit geometric interpretations of the input image. The combination of thetwo maps (our final version) produces the best performance. In addition, theresults get worse when the activation map score m i is not used in the final scoreof candidates — i.e., s i = s vh i in Eq. (22) (the fifth row). In this paper, we introduced a neural method that predicts camera calibrationparameters from a single image of a man-made scene. Our method fully exploitsline segments as prior knowledge of man-made scenes, and in our experiments,it exhibited better performance than that of previous approaches. Furthermore,compared to previous neural approaches, our method demonstrated a highergeneralization capability to unseen data. In future work, we plan to investigateneural camera calibration that considers a powerful but small number of geometriccues through analyzing image context, as humans do.
Acknowledgements . This research was supported by the National Research Founda-tion of Korea (NRF) funded by the Ministry of Education (2017R1D1A1B03034907).eural Geometric Parser for Single Image Camera Calibration 15
References
1. Google Street View Images API. https://developers.google.com/maps/
2. Akinlar, C., Topal, C.: EDLines: A real-time line segment detector with a falsedetection control. Pattern Recognition Letters (13), 1633–1642 (2011)3. Alberti, L.B.: Della Pittura (1435)4. Almaz`an, E.J., Tal, R., Qian, Y., Elder, J.H.: MCMLSD: A Dynamic ProgrammingApproach to Line Segment Detection. In: Proc. CVPR. pp. 2031–2039 (2017)5. Barinova, O., Lempitsky, V., Tretiak, E., Kohli, P.: Geometric Image Parsing inMan-Made Environments. In: Proc. ECCV. pp. 57–70 (2010)6. Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S., Rother,C.: DSAC — Differentiable RANSAC for Camera Localization. In: Proc. CVPR.pp. 6684–6692 (2017)7. Brachmann, E., Rother, C.: Neural-Guided RANSAC: Learning Where to SampleModel Hypotheses. In: Proc. ICCV. pp. 4322–4331 (2019)8. Coughlan, J.M., Yuille, A.L.: Manhattan World: Compass Direction from a SingleImage by Bayesian Inference. In: Proc. ICCV. pp. 941–947 (1999)9. Criminisi, A., Reid, I., Zisserman, A.: Single View Metrology. International Journalof Computer Vision (2), 123–148 (2000)10. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet:Richly-annotated 3D Reconstructions of Indoor Scenes. In: Proc. CVPR. pp. 5828–5839 (2017)11. Denis, P., Elder, J.H., Estrada, F.J.: Efficient Edge-Based Methods for EstimatingManhattan Frames in Urban Imagery. In: Proc. ECCV. pp. 197–210 (2008)12. Fischer, P., Dosovitskiy, A., Brox, T.: Image Orientation Estimation with Convolu-tional Networks. In: Proc. GCPR. pp. 368–378 (2015)13. Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for ModelFitting with Applications to Image Analysis and Automated Cartography. Commu-nications of the ACM (6), 381–395 (1981)14. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for Autonomous Driving? TheKITTI Vision Benchmark Suite. In: Proc. CVPR. pp. 3354–3361 (2012)15. von Gioi, R.G., Jakubowicz, J., Morel, J.M., Randall, G.: LSD: A Fast Line SegmentDetector with a False Detection Control. IEEE Trans. Pattern Analysis MachineIntelligence (4), 722–732 (2010)16. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. CambridgeUniversity Press, 2 edn. (2003)17. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition.In: Proc. CVPR. pp. 770–778 (2016)18. Hold-Geoffroy, Y., Sunkavalli, K., Eisenmann, J., Fisher, M., Gambaretto, E.,Hadap, S., Lalonde, J.F.: A Perceptual Measure for Deep Single Image CameraCalibration. In: Proc. CVPR. pp. 2354–2363 (2018)19. Kluger, F., Brachmann, E., Ackermann, H., Rother, C., Yang, M.Y., Rosenhahn,B.: Consac: Robust multi-model fitting by conditional sample consensus (2020), https://arxiv.org/pdf/2001.02643.pdf
20. Koˇseck´a, J., Zhang, W.: Video Compass. In: Proc. ECCV. pp. 476–491 (2002)21. Lee, H., Shechtman, E., Wang, J., Lee, S.: Automatic Upright Adjustment ofPhotographs with Robust Camera Calibration. IEEE Trans. Pattern AnalysisMachine Intelligence (5), 833–844 (2014)22. Li, H., Zhao, J., Bazin, J.C., Chen, W., Liu, Z., Liu, Y.H.: Quasi-globally Optimaland Efficient Vanishing Point Estimation in Manhattan World. In: Proc. ICCV. pp.1646–1654 (2019)6 J. Lee et al.23. Ma, Y., Soatto, S., Kosecka, J., Sastry, S.S.: An Invitation to 3-D Vision: FromImages to Geometric Models. Springer (2004)24. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: Deep Learning on Point Sets for3D Classification and Segmentation. In: Proc. CVPR. pp. 652–660 (2017)25. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomed-ical Image Segmentation. In: MICCAI. pp. 234–241 (2015)26. Schindler, G., Dellaert, F.: Atlanta World: An Expectation Maximization Frameworkfor Simultaneous Low-level Edge Grouping and Camera Calibration in ComplexMan-made Environments. In: Proc. CVPR (2004)27. Simon, G., Fond, A., Berger, M.O.: A-Contrario Horizon-First Vanishing PointDetection Using Second-Order Grouping Laws. In: Proc. ECCV. pp. 318–333 (2018)28. Tardif, J.P.: Non-Iterative Approach for Fast and Accurate Vanishing Point Detec-tion. In: Proc. ICCV. pp. 1250–1257 (2009)29. Tretyak, E., Barinova, O., Kohli, P., Lempitsky, V.: Geometric Image Parsing inMan-Made Environments. International Journal of Computer Vision (3), 305–321(2012)30. Wildenauer, H., Hanbury, A.: Robust Camera Self-Calibration from MonocularImages of Manhattan Worlds. In: Proc. CVPR. pp. 2831–2838 (2012)31. Workman, S., Greenwell, C., Zhai, M., Baltenberger, R., Jacobs, N.: DeepFocal: AMethod for Direct Focal Length Estimation. In: Proc. ICIP. pp. 1369–1373 (2015)32. Workman, S., Zhai, M., Jacobs, N.: Horizon Lines in the Wild. In: Proc. BMVC.pp. 20.1–20.12 (2016)33. Xian, W., Li, Z., Fisher, M., Eisenmann, J., Shechtman, E., Snavely, N.: UprightNet:Geometry-Aware Camera Orientation Estimation From Single Images. In: Proc.ICCV. pp. 9974–9983 (2019)34. Xiao, J., Ehinger, K.A., Oliva, A., Torralba, A.: Recognizing Scene Viewpoint usingPanoramic Place Representation. In: Proc. CVPR. pp. 2695–2702 (2012)35. Xu, Y., Oh, S., Hoogs, A.: A Minimum Error Vanishing Point Detection Approachfor Uncalibrated Monocular Images of Man-made Environments. In: Proc. CVPR.pp. 1376–1383 (2013)36. Zeiler, M.D., Fergus, R.: Visualizing and Understanding Convolutional Networks.In: Proc. ECCV. pp. 818–833 (2014)37. Zhai, M., Workman, S., Jacobs, N.: Detecting Vanishing Points using Global ImageContext in a Non-Manhattan World. In: Proc. CVPR. pp. 5657–5665 (2016)38. Zhou, Y., Qi, H., Huang, J., Ma, Y.: NeurVPS: Neural Vanishing Point Scanningvia Conic Convolution. In: Proc. NeurIPS (2019)eural Geometric Parser for Single Image Camera Calibration 17 Appendix
A.1 Comparisons on SUN360 [34] Dataset
Similar to the experiment in Sec. 4.1, we also compare our method with thebaseline methods using SUN360 [34] dataset. We selected indoor and outdoorscenes in the SUN360 dataset that satisfied the Manhattan/Atlanta assumptions,and generated the training and test images in the same way as with GoogleStreet View dataset, described in Sec. 4. 30 ,
837 and 878 images are generatedfor the training and test sets, respectively. Details of the evaluation metrics andbaseline methods are provided in Sec. 4.1.The quantitative results with SUN360 dataset [34] are reported in Table A1.The trends of the results are similar to those of the Google Street View experimentin Table 2. Our method provides the best performance for most of the evaluationmetrics, and the second-best for the remaining evaluation metrics, such as themedian roll error and mean FoV error. Our method has a very marginal differencewith the best AUC. The qualitative results are presented in Fig. A1.
Table A1.
Quantitative evaluations with SUN360 dataset. Bold represents the bestresult, while an underscore represents the second-best result. Note that for DeepHori-zon [32]*, we use the GT FoV to calculate the camera up vector (angle, pitch, and rollerrors) from the predicted horizon line. In addition, for UprightNet [33]**, we use apretrained model on ScanNet [10] due to the lack of required supervision in the SUN360dataset.
Method Angle ( ◦ ) ↓ Pitch ( ◦ ) ↓ Roll ( ◦ ) ↓ FoV ( ◦ ) ↓ AUC(%) ↑ Mean Med. Mean Med. Mean Med. Mean Med.Upright [21] 3.43 1.43 3.03 1.13 6.85
Perceptual [18] + L Ours 2.33 1.27 1.97 0.96 0.97
A.2 Additional Results on Google Street View [1] Dataset
Fig. A2 presents additional results on the Google Street View [1] dataset, as inFig. 5, visualizing horizon line predictions and weakly supervised Manhattandirections. In each example, we illustrate the Manhattan directions of the highestscore candidate (Eq. (22)). In most cases, our method provides better horizonprediction results than those of previous state-of-the-art methods. Note thatwe only use the supervision of horizon lines and focal lengths (3DoF), yet wecan further estimate the camera rotation and focal lengths (4DoF) based on theManhattan world assumption.
Ground Truth Upright [21] A-Contario [27] DeepHorizon [32]Perceptual [18] Perceptual [18] + L Ours
Fig. A1.
Examples of horizon line prediction on the SUN360 test set. Each examplealso displays the Manhattan direction of the highest score candidate.eural Geometric Parser for Single Image Camera Calibration 19
A.3 Visualization of Our Network I/Os
Fig. A3 illustrates how geometric cues are processed and utilized in the proposedmethod. Each row of Fig. A3(a)-(d) shows the input image, rasterized line segmentmap L , grouped horizon line segments used for sampling candidates of Manhattandirections (as in Fig. 3(d) in the paper), and the set of sampled candidates ofManhattan directions, respectively.Fig. A3(e) displays the prediction results of the horizons and their corre-sponding ground truths, as well as the Manhattan direction of the highest scorecandidate. Activation maps A with respect to the Manhattan directions arepresented in Fig. A3(f). Notice that activation maps A in Fig. A3(f) explainmuch of their respective line segment maps L in Fig. A3(b), exemplifying howour method incorporates with the Manhattan world assumption.Fig. A3(g) superimposes the eight Manhattan directions of the top-8 high-scoring candidates over the input images. As illustrated in Fig. A3(g), the zenithdirections are almost the same between candidates, as the man-made scenesusually satisfies either the Manhattan or Atlanta world assumption [8,26]. Forscenes satisfying the Manhattan world assumption (rows 1–4), the axes of eightframes almost overlap. For the last two scenes (rows 5 and 6) that follow theAtlanta world assumption, all the frames have zenith directions that are veryclose to each other. By utilizing these frames we can robustly and accuratelyestimate horizon lines and focal lengths of given scenes. A.4 Comparison of Manhattan Direction Prediction on YUD [11]and ECD [29]
We report the accuracy of Manhattan direction prediction using our method onYUD [11] and ECD [29] datasets and compare the result with those of the othermethods. In the experiment, we took the network model trained on the GoogleStreet View dataset and tested it on YUD [11] and ECD [29] datasets. For theevaluation, the Manhattan direction of the high score candidate is used.YUD [11] dataset contains 102 images under the Manhattan assumption,where each image is annotated with three VPs and a focal length. ECD [29]dataset contains 103 images under the Atlanta assumption, where each image isannotated with a zenith VP and more than two horizontal VPs on a horizon line.For ECD [29] dataset, the direction which is the closest to the prediction is usedfor comparisons.Table A2 shows the quantitative comparisons, in terms of the relative rotationangle, differences of FoV, and AUC. For FoV and AUC, we used the same settingas depicted in Sec. 4.1. As shown Table A2, our results are comparable to theones of non-neural-net methods [21,27], which are highly optimized for YUD [11]and ECD [29] datasets. We remark that our networks are trained on a differentdataset and also with weak and indirect supervision (horizon lines and focallengths).
Ground Truth Upright [21] A-Contario [27] DeepHorizon [32]Perceptual [18] Perceptual [18] + L Ours
Fig. A2.
Examples of horizon line prediction on the Google Street View test set (topfour rows) and on the HLW test set (bottom four rows). Each example also shows theManhattan direction of the highest score candidate.eural Geometric Parser for Single Image Camera Calibration 21(a) (b) (c) (d) (e) (f) (g)
Fig. A3.
Sampled images from the Google Street View test set (top three rows) andthe HLW test set (bottom three rows). For each row, we show: (a) input image; (b)rasterized line segment map L ; (c) two groups of horizontal line segments (cyan &magenta) used for sampling candidates of Manhattan directions; (d) sampled candidatesof Manhattan directions; (e) ground truth and predicted horizon lines (yellow & reddashed) as well as the estimated Manhattan directions of the highest score candidate;(f) activation map A of the Manhattan directions shown in (e); and (g) Manhattandirections of the top-8 high score candidates. A.5 Parameter Sensitivity Test
We tested parameter sensitivity by varying our parameters including: the anglethreshold for vertical lines ( δ z in Eq. (4)), the angle thresholds for deciding thepositive and negative samples of zenith candidates ( δ p and δ n in Eq. (8)), thescore threshold for ZSNet ( δ c in Eq. (12)), the score threshold for FSNet ( δ s inEq. (19)), and the numbers of line segments and intersection points used in thenetwork ( | L z | and | Z | ). Also, we tested sensitivity to line detection algorithm byvarying the LSD algorithm parameter, − log (NFA), where NFA is the number of Table A2.
Quantitative evaluations of Manhattan direction prediction with YUD [11]and ECD [29] datasets.
Dataset Method Rotation Angle ( ◦ ) ↓ FoV ( ◦ ) ↓ AUC(%) ↑ Mean Med. Mean Med.YUD [11] Upright [21] false alarms and also by replacing the line detection algorithm with MCMLSD [4].We used the network model trained with the Google Street View [1] dataset with default parameters and tested the model by varying the parameters, except for δ p and δ n in Eq. (8) and δ s in Eq. (19); these parameters change either the groundtruth labels or the loss function. For those parameters, we finetuned our networkfrom the pretrained model. All results are reported in Table A3. The highlightedrows show the results with default parameters. The results demonstrate that ourmethod is robust to the change of the parameters.In our implementation, 1,024 for both lines and points was the maximumnumber to train the network with 11 GB GPU memory. However, more numbersof lines and points also significantly increase training time and GPU memoryusage. For the sake of simplicity, all results reported in this paper were obtainedwith | L z | = | Z | = 256 both at training time and test time. A.6 Visualization of FSNet Focus
In Fig. A4, we show more visualizations of the weights of the second last convo-lution layer in FSNet, as shown in Fig. 6. The network mostly focuses on thelines that pass the vanishing points.
A.7 Failure Cases
Fig. A5 shows failure cases of our framework. The failure cases occur when thecomputation of the focal length is unstable, such that: i) the scene is far fromthe Manhattan assumption, ii) only short or noisy line segments are detectedin the scene, iii) the scene is almost perpendicular to the center of projection.Please notice that the estimated zenith directions are still reasonable in Fig.A5, thanks to the semantic information learned by ResNet, the backbone of ourFSNet. Therefore, even in the cases of Fig. A5, our framework is still applicableto image rotation corrections as shown in Fig. 1(a).
A.8 Experiment on KITTI [14] Dataset
We conducted an additional experiment with KITTI [14] dataset. The KITTIdataset contains wide-images captured by driving around urban cities and rural eural Geometric Parser for Single Image Camera Calibration 23
Table A3.
Parameter sensitivity test results. The highlighted rows show the resultwith the default parameters. Bold is the best result, and underscore is the second-bestresult in each experiment.
Angle ( ◦ ) ↓ Pitch ( ◦ ) ↓ Roll ( ◦ ) ↓ FoV ( ◦ ) ↓ AUC(%) ↑ Mean Med. Mean Med. Mean Med. Mean Med. δ z (Eq. (4)) 58 . ◦ . ◦ . ◦ δ p , δ n (Eq. (8)) 1 ◦ , 2 ◦ ◦ , 5 ◦ ◦ , 10 ◦ δ c (Eq. (12)) 0.4 2.32 1.84 2.28 1.44 0.84 0.52 6.82 4.37 80.230.5 δ s (Eq. (19)) 0.4 3.02 1.80 2.71 1.61 1.04 k k = 1 2.23 1.72 1.97 1.49 0.75 0.55 6.71 3.84 82.12 k = 4 k = 8 2.12 k = 16 2.24 1.71 2.04 1.52 | L z | , | Z | − log (NFA)in LSD [15],MCMLSD [4] 0 2.12 . × . . × . . × . . × . Table A4.
Quantitative evaluations with KITTI dataset.
Method Angle ( ◦ ) ↓ Pitch ( ◦ ) ↓ Roll ( ◦ ) ↓ FoV ( ◦ ) ↓ AUC(%) ↑ Mean Med. Mean Med. Mean Med. Mean Med.
Ours ( k = 8 ) areas. We sample 8,675 images of urban scenes from the KITTI dataset and feedthem to finetune our network from the pretrained model with the Google StreetView dataset. We test our finetuned model to 481 images of urban and ruralscenes from the KITTI dataset.Fig. A6 shows some examples of horizon predictions with the KITTI test set.Unfortunately, the GT horizons of the dataset are geometrically inaccurate dueto the large influence of the vehicle’s tilting angle during cornering. Nevertheless,we obtained interesting results where the estimated horizons of our frameworkdo not deviate significantly from the GT horizons in urban areas. We believethe results come from the KITTI dataset, since there are little changes in thehorizontal line and focal length. Another reason seems to be that the ResNet,the backbone of our FSNet, learned the scene context from the KITTI dataset.Table A4 reports the quantitative evaluations with KITTI dataset. Fig. A4.
More visualizations of FSNet focus: Left is input; right is feature highlight.eural Geometric Parser for Single Image Camera Calibration 25
Ground Truth Ours
Fig. A5.
Failure cases.
Ground Truth Ours