[PDF] Single-Shot Cuboids: Geodesics-based End-to-end Manhattan Aligned Layout Estimation from Spherical Panoramas

Abstract

It has been shown that global scene understanding tasks like layout estimation can benefit from wider field of views, and specifically spherical panoramas. While much progress has been made recently, all previous approaches rely on intermediate representations and postprocessing to produce Manhattan-aligned estimates. In this work we show how to estimate full room layouts in a single-shot, eliminating the need for postprocessing. Our work is the first to directly infer Manhattan-aligned outputs. To achieve this, our data-driven model exploits direct coordinate regression and is supervised end-to-end. As a result, we can explicitly add quasi-Manhattan constraints, which set the necessary conditions for a homography-based Manhattan alignment module. Finally, we introduce the geodesic heatmaps and loss and a boundary-aware center of mass calculation that facilitate higher quality keypoint estimation in the spherical domain. Our models and code are publicly available at this https URL

Full PDF

SSingle-Shot Cuboids: Geodesics-based End-to-end Manhattan Aligned LayoutEstimation from Spherical Panoramas

Nikolaos ZioulisCentre for Research and Technology HellasUniversidad Polit´ecnica de Madrid [email protected]

Federico AlvarezUniversidad Polit´ecnica de Madrid [email protected]

Dimitrios Zarpalas Petros DarasCentre for Research and Technology Hellas { zarpalas,daras } @iti.gr Abstract

It has been shown that global scene understanding taskslike layout estimation can beneﬁt from wider ﬁeld of views,and speciﬁcally spherical panoramas. While much progresshas been made recently, all previous approaches rely on in-termediate representations and postprocessing to produceManhattan-aligned estimates. In this work we show howto estimate full room layouts in a single-shot, eliminatingthe need for postprocessing. Our work is the ﬁrst to di-rectly infer Manhattan-aligned outputs. To achieve this,our data-driven model exploits direct coordinate regressionand is supervised end-to-end. As a result, we can explicitlyadd quasi-Manhattan constraints, which set the necessaryconditions for a homography-based Manhattan alignmentmodule. Finally, we introduce the geodesic heatmaps andloss and a boundary-aware center of mass calculation thatfacilitate higher quality keypoint estimation in the spheri-cal domain. Our models and code are publicly available athttps://vcl3d.github.io/SingleShotCuboids/.

1. Introduction

Modern hardware advances have commoditized spheri-cal cameras which have evolved beyond elaborate opticsand camera clusters. Affordable handheld o camerasare ﬁnding widespread use in various applications, withthe more prominent ones being real-estate, interior designand virtual tours, with recently introduced datasets follow-ing the same trends. Realtor360 [60] contains panoramasacquired by a real-estate company, while Kujiale [28] andStructured3D [65] were rendered using a large corpus of We will be using the adjective terms spherical, omnidirectional and o for cameras and images interchangeably. Figure 1: From a single indoor scene panorama input, weestimate a Manhattan aligned cuboid of the room’s layout,in a single-shot. To achieve this, we rely on spherical coor-dinate localization using geodesic heatmaps. This explicitreasoning about the corner positions in the image, allows forthe integration of vertical alignment constraints that drive adifferentiable homography-based cuboid ﬁtting module.computer-generated data from an interior design company.Further, datasets containing spherical panoramas like Mat-terport3D [3] and Stanford2D3D [1], were created using theMatterport camera, originally developed for virtual tours.This signiﬁes the importance of spherical panoramas for in-door 3D capturing, as they are (re-)used in multiple 3D vi-sion tasks [55, 50, 67].Spherical panoramas capture the entire scene context wi-thin their ﬁeld-of-view (FoV), an important trait for sceneunderstanding. While humans can infer out of FoV infor-mation, the same cannot be said for machines, with viewextrapolation methods [44] using spherical data to addressthis. Certain tasks like illumination or layout estimation im-plicitly extrapolate outside narrow FoVs. Neural Illumina-tion [43] estimates a scene’s lighting from a single perspec-tive image employing a perspective-to-spherical completionintermediate task within their end-to-end model. Estimating1 a r X i v : . [ c s . C V ] F e b scene’s layout involves extrapolating structural informa-tion, and, thus, many works now resort to spherical panora-mas to exploit their holistic structural and contextual infor-mation.The seminal work of PanoContext [64], reconstructs anentire room into a 3D cuboid, fully exploiting the largeFoV of omnidirectional panoramas . Its complex for-mulation and weak priors resulted in high computationalcomplexity, requiring several minutes for each panorama.While modern deep priors produce higher quality results[68, 60], increasing the accuracy of their predictions andensuring Manhattan-aligned layouts, requires postprocess-ing and hurts runtime efﬁciency.Spherical panoramas necessitate higher resolution pro-cessing, and therefore, increased computational complexity,as evidenced by recent data-driven layout estimation models[68, 60, 47]. More efﬁcient alternatives [15] produce irregu-lar ( i.e. non-Manhattan) outputs, require parameter sensitivepostprocessing, and increase efﬁciency by lowering spatialresolution, which comes at the cost of accuracy. Moreover,data-driven spherical vision needs to address the distortionof the projective omnidirectional data formats. But distor-tion mitigating convolutions add a signiﬁcant computationaloverhead as reported in [15] and [9].In this work, we present a single-shot spherical layoutestimation model. As presented in Figure 1, we employspherical-aware corner coordinate estimation and thus, addexplicit constraints that facilitate vertically aligned corners.Capitalizing on this, we further integrate full Manhattanalignment directly into the model, allowing for end-to-endtraining, lifting the postprocessing requirement.

2. Related Work

While an excellent review regarding the 3D reconstruc-tion of structured indoor environments exists [40], our dis-cussion will provide the necessary details for positioningour work. We focus on monocular layout estimation andthus, refrain from discussing works using multiple panora-mas [39, 41, 37, 38], interaction [30], other types of cameras[27, 29].PanoContext [64] showcased the expressiveness of o panoramas in terms of structural and contextual informa-tion. Prior to the maturation of deep data-driven meth-ods, PanoContext relied on edge and line detection, Houghtransform, and deformable part models to generate differentroom layout hypotheses. Similarly, low-level line segmentswere used in an energy minimization formulation to esti-mate a scene’s structural planes [17]. In Panoramix [59],the line features were supplemented by superpixel facets,and embedded as vertices in a graph for a constrained leastsquares problem. Hybrid data-driven methods [16] used structural edgedetection to improve the performance and runtime of [64]when using fewer hypotheses. Pano2CAD [58] used a prob-abilistic formulation that relied on CNN object recognitionand detection. It generated a synthetic scene reconstruc-tion but required several minutes of processing. Its com-putational overhead largely comes from the fusion of nar-row FoV predictions from perspective o crops. This iscommon to all aforementioned methods relying on line seg-ments and to [61], which runs various CNNs on all narrowFoV sub-views before merging them in o .PanoRoom [14] and LayoutNet [68] were the ﬁrst mod-els to be trained on spherical panoramas. They both mod-elled layout corner and structural edge estimation as aspatial probabilistic inference task. While it is possibleto extract the layout’s corners by relying on heuristicallyor empirically parameterized peak detection, these estima-tions will most likely not deliver Manhattan-aligned out-puts. Consequently, joint optimization is performed usingboth sources of information to recover the ﬁnal layout cor-ner estimates. LayoutNet requires several seconds to inferand optimize the layout on a CPU, but PanoRoom is muchfaster as it uses a greedy RANSAC approach.DuLa-Net [60] employs a novel approach for o lay-out estimation. The main insight is that spherical imagescan be projected in multiple ways, and different projec-tions highlight different cues. Speciﬁcally, DuLa-Net uses a‘ceiling-view’ that offers a more informative viewpoint withrespect to the ﬂoor-plan, which is a projection of a Manhat-tan 3D layout. It performs feature fusion across both theequirectangular and ceiling-view branches, using a heightprediction to estimate the ﬁnal 3D layout. HorizonNet [47]is yet another novel take at omnidirectional layout estima-tion. Instead of image localised predictions, it encodes theboundaries and intersections in one-dimensional vectors,which are then used to reconstruct the scene’s corners. Thisallows HorizonNet to exploit the expressiveness of recur-rent models (LSTM [22]) to offer globally coherent predic-tions. After a postprocessing step involving peak detectionand height optimization, the ﬁnal Manhattan-aligned layoutis computed. A recent thorough comparison between Lay-outNet, DuLa-Net and HorizonNet was presented in [69].Uniﬁed encoding models and training scripts were used tofairly evaluate these approaches. Their ﬁndings indicatethat the PanoStretch data augmentation proposed in [47],as well as its heavier encoder backbone lead to improvedperformance for the other models as well. The Corners-for-Layout (CFL) [15] model is currently the most efﬁcientapproach for o layout estimation in terms of runtime, butat the expense of accuracy and Manhattan alignment. Whilean end-to-end model is discussed, an empirically or heuris-tically parameterized postprocessing image peak detectionstep is still required.2ompared to these approaches, our model is end-to-endtrainable, producing Manhattan aligned corners in a single-shot. We approach the layout estimation task as a keypointlocalization one and use an efﬁciently designed sphericalmodel. There are multiple representations for spherical imageswith the more straightforward being the cube-map. Tradi-tional CNN models can be applied to the cube faces [33],and then warped back to the sphere. This was used in [64]and [59] to detect lines on each cube’s faces [53], while[58] and [61] used CNN inference on each face. Still, cube-maps suffer from distortion as well, and additionally requireface-speciﬁc padding [4] to deal with the faces’ discontinu-ities. Yet, to capture the global context these approachesneed to expand their receptive ﬁeld to connect all faces con-tinuously, which leads to inefﬁcient models.A novel line of research pursues model adaptationfrom the perspective domain to the equirectangular one[45]. The follow-up work, Kernel Transformer Networks[46], adapt traditional kernels to the spherical domain ina learned manner, also discussing two important aspects.First, the accuracy-resolution trade-off for spherical images,which necessitates the user of higher resolutions. Indeed,most aforementioned data-driven layout estimation meth-ods from o images operate on × images, whichare unusually large for CNNs. Only [15] is the exception tothis rule, which further supports this point, taking into ac-count its reduced performance. The second point of discus-sion is related to the effect that non-linearities have, whencombined with kernel projection methods like [6] and [51].It is shown that the assumption that needs to hold for no er-ror to accumulate when using kernel projection, only holdsfor the ﬁrst layers of the network, and as it deepens, the ac-cumulated error becomes even larger. Still, [15] shows thattheir EquiConv offer more robust predictions. A generaliza-tion of this concept, Mapped Convolutions [9], decouple thesampling operation from the ﬁltering one, and demonstrateincreased performance in dense estimation tasks. Still, run-time performance is greatly reduced as reported in both [15]and [9].This is also the main drawback of frequency-basedspherical convolutions as presented in the concurrent worksof [5] and [11]. They are also highly inefﬁcient in termsof memory, allowing for training and inference in very lowresolution images only. DeepSphere [8] and [25] presentanother approach to handle distortion and discontinuity byleveraging graph convolutions and lifting the sphere repre-sentation to a graph. Nonetheless, this requires a graph gen-eration step and loses efﬁcacy compared to traditional con-volutions, whose implementations are highly optimized toexploit the memory regularity of image representations. The most efﬁcient way to handle the discontinuity is cir-cular padding [54, 47, 7], which is partly our approach aswell, taking into account the inefﬁciency of distorted ker-nels. It should also be noted that model adaptation meth-ods would not transfer well for the layout estimation task.While an object detection task parses a scene in a local man-ner, layout estimation requires to reason about the globalcontext, with perspective methods typically needing to ex-trapolate the scene’s structure. However, as ﬁrst provenby PanoContext [64], the availability of the entire scene ismuch more informative, and this would hinder the applica-bility of transferring models like RoomNet [27] to the o domain using such techniques [45, 46]. Regressing coordinates in an image has been shown tobe an intriguingly challenging problem [31]. The pro-posed solution was to offer the coordinate information ex-plicitly. Yet, most keypoint estimation works in the litera-ture initially used fully connected layers to regress coordi-nates. The counter-intuition is that convolutions are inher-ently spatial, and should be more well-behaved in spatialprediction tasks. This is how data-driven layout estimationmodels have addressed this problem up to now ([68], [15]),transforming coordinates into spatial conﬁgurations, usingsmoothing kernels to approximate coordinates, and leveragedense supervision. Keypoint localisation tasks with seman-tic inter-correlated structures, typically use one heatmap perkeypoint. However, an issue that has recently received at-tention [62], is the way the ﬁnal coordinate is estimatedfrom each dense prediction. Indeed the spatial maximamight not always best approximate the coordinate, and thus,heuristic approaches have persisted. Speciﬁcally for lay-out estimation, where the corners are predicted on the samemap, manually-set peak detection thresholds are used.The overlapping works of [32], [48] and [35] derive smo-oth operations to reduce a heatmap to single a coordinate.Using the coordinate grid and a spatial softmax function,they smoothly, and differentiably, transform a spatial prob-abilistic representation into a single location. As shown in[52], all the above operations are treating pixels as particleswith masses, and estimate their center of mass.

3. Single-Shot Cuboids

Unlike previous works, we approach layout estimationas a keypoint localisation task, alleviating the need for post-processing and simultaneously ensure Manhattan alignedoutputs. Section 3.1 formulates our coordinate regressionobjective and its adaption to the spherical domain, Sec-tion 3.2 introduces the geodesic heatmaps and loss functionand then, Section 3.3 provide insights into our model’s de-sign, and the techniques to achieve end-to-end Manhattanalignment.3 .1. Spherical Center of Mass

The center of mass (CoM) c P for a collection of particles P : { p , . . . , p N } ∈ R is deﬁned as: c P = (cid:80) Ni m i p i M , M = N (cid:88) i m i , (1)with m i being the mass of particle p i and M the system’stotal mass. The CoM c P represents a concentration of theparticle system’s mass and does not necessarily lie on anexisting particle. This way, when considering a sparse key-point estimation task in a structured grid, we can reformu-late it as a dense prediction task by instead inferring themass of each grid point. Using Eq. (1) we can directly su-pervise it with the keypoint coordinates, instead of relyingon a surrogate objective as commonly done in pose estima-tion [62] or facial landmark detection [13].For spherical layout estimation, the set of particles P for which we seek to individually estimate their per particlemass, lies on a sphere. Each layout corner is considered asthe CoM of a distinct particle system deﬁned on the sphere.Each particle p = ( φ, θ ) on the sphere is represented byits longitude φ and latitude θ . While there are ways forlearning directly on the 2-sphere S manifold, as explainedin Section 2.2, they are very inefﬁcient. Consequently, weconsider the equirectangular projection of the sphere whichpreserves the angular parameterization of each particle. Theequirectangular projection is an equidistant planar projec-tion of the sphere, where the pixels in the image domain Ω : ( u, v ) ∈ [0 , W ] × [0 , H ] are linearly mapped to the an-gular domain A : ( φ, θ ) ∈ [0 , π ] × [0 , π ] . Nevertheless,this format necessitates a different approach to overcomeits weaknesses, namely, image boundary discontinuity, andplanar projection distortion.The discontinuity arises at the horizontal panoramaboundary, where the particles, even though at the oppositesides of the image, are actually neighboring on the sphere.For traditional images, the (normalized) grid coordinatesare typically deﬁned in [0 , or [ − , , and thus, the parti-cles at the boundary would be maximally distant. However,for spherical panoramas, the longitudinal coordinate φ is pe-riodic and wraps around, with the particles at the boundariesbeing proximal ( i.e. minimally distant). To address this, wesplit the CoM calculation for the longitude and latitude co-ordinates, and adapt the former to consider each point aslying on a circle. Therefore, for each panorama row, whichrepresents a circle of (equal) latitude, we deﬁne new parti-cles r ∈ R with r ( φ ) = ( λ, τ ) = (cos φ, sin φ ) , (2)while lie on a unit circle. We can then calculate the CoM We transition between these terms ﬂexibly given their linear mapping.

Figure 2: Spherical Center of Mass calculation.

Left : Twosets of particles distributed on two circles of latitude (blueand pink).

Middle : Their equirectangular projection gridcoordinates.

Right : Lifting the problem to the unit circleallows for continuous CoM estimation. Darker points illus-trate the CoMs calculated using our lifting approach, andwhite ones the erroneous estimates when directly estimat-ing CoM on the grid. c R : c R = (¯ λ, ¯ τ ) = (cid:80) Ni m i r i M . (3)This estimates, exactly and continuously, the CoM of thecircle. To map this back to the original domain, we extractthe angle ¯ φ : ¯ φ = atan2( − ¯ τ , − ¯ λ ) + π, (4)which represents the longitudinal CoM across the disconti-nuity. Figure 2 shows a toy example of CoM calculationsalong two circles of latitude on the sphere, with the erro-neous estimates acquired on the equirectangular projectionand the correct ones when considering the boundary.Although the equirectangular projection maps circles oflatitude (longitude) to horizontal (vertical) lines of constantspacing, the same does not apply for its sampling density.Indeed, while it samples the sphere with a constant den-sity vertically, it stretches each circle of latitude to ﬁt thesame constant horizontal line. Thus, its sphere samplingdensity is not uniform in all planar pixel locations. Thesampling density is / sin θ [49] and it approaches inﬁnitynear the pole singularities. When calculating the CoM in theequirectangular domain, we need to compensate for it by re-weighting the contribution of each pixel p by σ ( p ) = sin θ [66].Essentially, given a dense mass prediction M ( p ) , p ∈A , we calculate the spherical CoM by ﬁrst estimating athree-dimensional coordinate c a : c a = (¯ λ, ¯ τ , ¯ θ ) = (cid:80) A p M ( p ) σ ( p ) a ( p ) (cid:80) A p M ( p ) σ ( p ) , (5)with a ( p ) = ( r ( φ ) , θ ) = (cos φ, sin φ, θ ) , and then dropit to the two-dimensions again to calculate the ﬁnal CoM4igure 3: Geodesic heatmaps respect the horizontal bound-ary continuity and the equirectangular projection’s distor-tion. Five normal distributions on the sphere centeredaround different coordinates but using the same angularstandard deviation are presented on the top row. Theircorresponding geodesic heatmaps are aggregated on theequirectangular image on the bottom row. In addition, thegeodesic distance between the red square and the colorizeddiamond coordinates are also presented on the same image.The geodesic distance similarly respects the boundary anddistortion of the equirectangular projection as seen by thegreat circles drawn on the image that correspond to eachpair’s angular distance. c m = ( ¯ φ, ¯ θ ) = (atan2( − ¯ τ , − ¯ λ ) + π, θ ) of M in theequirectangular domain. Accordingly, predicting the sparse coordinates of a cor-ner comes down to predicting the dense mass map M , orotherwise heatmap, which is the terminology we will beusing hereafter. Previous approaches complemented thesparse objective with a dense regularisation term [35]. Thereason was that CoM regression is not constrained in anyway as to the shape of its dense prediction. This was ad-dressed by adding a distribution loss over the predictedheatmap and a Gaussian centered at the groundtruth coor-dinate.Yet while extracting the CoM, as presented in Section3.1, takes the spherical domain into account, traditional(ﬂat) Gaussian heatmaps do not. A spatial normal distri-bution N ( c , s ) centered around a coordinate c = ( u, v ) ,using a standard deviation s = ( s x , s y ) would consider theequirectangular image as a ﬂat one, with a discontinuousboundary and no distortion.To overcome this, we construct geodesic heatmaps,which are reconstructed directly on the equirectangular do- main using a shifted angular coordinate grid A s deﬁned onthe panorama: G ( c m , α ) = 1 α √ π e − g ( c m, p s )2 α , p s ∈ A s , (6)where α is the angular standard deviation around the distri-bution’s center c m , and g ( · ) is the geodesic distance: g ( p , p ) = 2 arcsin (cid:112) sin θ +cos θ cos θ sin φ , (7)where ∆ φ = φ − φ and ∆ θ = θ − θ . As illustratedin Figure 3, using the geodesic distance between two an-gular coordinates on the equirectangular panorama, we re-construct geodesic heatmaps that simultaneously take intoaccount both the continuous boundary, as well as the pro-jection’s distortion. Our model infers a set of heatmaps M j , one for eachlayout corner j ∈ [1 , J ] (or junction, given that 3 planesintersect), with J = 8 for cuboid layouts. It operatesin a single-shot manner, as these predictions are directlymapped into layout corners c jm . Apart from removing thepost-processing step, another advantage of our single-shotapproach is the sub-pixel level accuracy that it allows for,as the CoM of the particles is not necessarily one of the par-ticles themselves. This translates to a reduction of the inputand working resolution of the model.We choose a light-weight stacked hourglass (SH) archi-tecture [34]. It is designed for multi-scale feature extractionand merging, that enables the effective capturing of spatialcontext. It suits spherical layout estimation very well as itis a global scene understanding task that beneﬁts from spa-tial context aggregation, which is achieved by lowering thespatial dimension of the features. Still, it also requires pre-cise localisation of speciﬁc keypoints, which needs higherspatial ﬁdelity, ( i.e. resolution) predictions. We made several modiﬁcations to the original SH modelstemming mainly from recent advances made in the ﬁeld.While we preserve the original residual block [20] in thefeature preprocessing block, we replace the hourglass resid-ual blocks with preactivated ones [21]. Essentially, this addsdirect identity mappings between the stack of hourglasses,allowing for immediate information propagation from theoutput to the earlier hourglass modules. We also use anti-aliased max-pooling [63], which preserves shift equivari-ance and leads to smoother activations across downsampledlayers. Finally, unlike some state-of-the-art spherical layoutestimation methods [68, 60, 69], we address feature map φ and θ are shifted by − π and − π/ respectively. N hourglasses which embed recently developed CNN modules for direct inter-hourglass in-formation ﬂow, spherically padded convolutions, and smoother multi-scale feature ﬂow. The predicted geodesic heatmapsget transformed directly to panoramic layout coordinates through a spherical CoM module. Since we regress coordinates,we explicitly enforce quasi-Manhattan alignment. This sets the ground for a homography-based cuboid alignment head thatensures the Manhattan alignment of our estimates. The (cid:70) symbol denotes a global multiply-accumulate operation, reducingthe predicted dense representation to a set of sparse coordinates. Color-graded spheres indicate coordinate-based distancefrom the origin.discontinuity by using spherical padding. For the horizon-tal image direction, we apply circular padding, as also donein [54] and [47], and for the vertical one at the pole singu-larities, we resort to replication padding. Since we are directly regressing coordinates, we can ex-plicitly ensure quasi-Manhattan alignment during trainingand inference alike. Previous approaches either use post-processing to ensure the Manhattan alignment of their pre-dictions [68, 60, 47], or simply forego it and produce non-Manhattan outputs [15]. While this relaxation is some-times presented as an advantage, most man-made environ-ments are Manhattan-aligned, with walls being orthogonalto ceiling and ﬂoors, and therefore, same edge wall cor-ners are vertically aligned. For each wall-to-ceiling junc-tion, there exists a wall-to-ﬂoor junction, effectively split-ting our heatmaps in two groups, the top M jt and bottom M jb heatmaps ( i.e. ceiling and ﬂoor junctions respectively).We enforce quasi-Manhattan alignment by averaging thelongitudinal coordinates of each wall’s vertical edge, guar-anteeing a consistent longitudinal coordinate for both thetop and bottom junction. This quasi-Manhattan alignment ensures that wall edges arevertical to the ﬂoor, but does not enforce their orthogonal-ity. To achieve this, we introduce a differentiable operationthat transforms the predicted corners so as to ensure the or-thogonality between adjacent walls. While the estimatedcorners are up-to-scale, with a single center-to-ﬂoor/ceilingmeasurement/assumption we can extract metric 3D coordi-nates for each corner as in [64] , by ﬁxing the ceiling/ﬂoorvertical distance to the corresponding average height.We extract the f = ( x , y ) horizontal coordinates coor-dinates, corresponding to an orthographic ﬂoor view pro-jection, which comprise a general trapezoid. This is trans-formed to a unit square by estimating the projective trans-formation H (planar homography) mapping the former tothe latter [18]. Using the trapezoid’s edge norms (cid:107) v (cid:107) , with v = f j +1 − f j , we calculate the average opposite edge dis-tances and use them to scale the unit square to a rectan-gle, after translating it for their centroids to align. Then,we rotate and translate the rectangle to align with the origi-nal trapezoid using orthogonal Procrustes analysis [42]. Fi-nally, the rectangle gets lifted to a cuboid using the verti-cal ( z ) ceiling and ﬂoor coordinates. The resulting cuboid See the supplementary material [56] C as the ﬁnalblock of our model to ensure full Manhattan alignment inan end-to-end manner.We supervise the junction angular coordinates using thegeodesic distance of Eq.(7): L G = 1 J (cid:88) j g ( c jm , ˆ c jm ) , (8)with c jm and ˆ c jm being the groundtruth and predicted coor-dinates. The geodesic distance smoothly handles the con-tinuous boundary and provides a more appropriate distancemetric on the sphere, instead of the equirectangular projec-tion. We additionally supervise the spatially normalizedheatmaps H j = spatial sof tmax ( M j ) predicted by ourmodel with Kullback Leibler divergence: L D = (cid:88) A ∫ ,j D KL ( H j , ˜ G ( c jm )) , (9)where ˜ G ( · ) is the spatially normalized geodesic heatmap G ( · ) . Apart from regularizing the predicted heatmaps, thisloss allows for stable end-to-end training with the cuboidalignment transform, as pure coordinate supervision desta-bilized the model during early training, which preventedconvergence as a consequence of the double solve requiredin the homography and Procrustes analysis. Our ﬁnal lossis deﬁned as: L = N (cid:88) n =1 λ G N L nG + λ D N L nD , (10)with λ G and λ D being weighting factors between thegeodesic distance and KL loss, applied on each of the N hourglass predictions.The higher level SH architecture allows for global pro-cessing without relying on heavy bottlenecks [68], compu-tational expensive feature fusion [60] or recurrent models[47]. It also requires no post-processing as it can producea Manhattan aligned layout in a single-shot with high accu-racy albeit operating at lower than typical resolutions.

4. Results

The input to our model is a single upright , i.e. horizontalﬂoor, × spherical panorama. We use featuresfor each hourglass’s residual block, with a × heatmapresolution, and initialize our SH model using [19]. We usethe Adam [26] optimizer with a learning rate of . and Traditional [68, 23], or data-driven methods [24] can be used.

Figure 5: Starting from quasi-Manhattan corner estimates,these get ﬁrst deprojected ( K − ) to D coordinates. Then,keeping only the horizontal coordinates ( F ), we get a ﬂoorview trapezoid, which depending on the measurement andcoordinates (ﬂoor/ceiling) our projection operated on, isslightly different (cyan for the ceiling, and blue for theﬂoor). Using these ﬂoor view horizontal coordinates, weestimate a homography H to transform them to an axisaligned, unit square. This gets translated and scaled ( S )using the average opposite edge lengths and centroid of theoriginal untransformed ﬂoor view coordinates. An orthog-onal Procrustes analysis ( O ) is used to align the rectangleto the trapezoid, which then gets lifted to a cuboid ( Q )using the original heights, taking into account the quasi-Manhattan alignment of our estimates. The cuboid’s D coordinates then get projected ( K ) back to equirectangulardomain corners. Apart from the ceiling and ﬂoor startingcorners, we also consider a joint approach where the hor-izontal ﬂoor view coordinates get averaged from both D estimates, before proceeding to estimate the homography.For this approach to work, we rescale the ceiling coordi-nates so that their camera to ﬂoor distances align, thereforeremoving any scale difference from the camera’s positiondeviation from the true center.default values for the other parameters, no weight decay,and a batch size of . Further, after an empirical greedysearch, we use a ﬁxed α = 2 o and s = (3 . , . for ourGeodesic and Isotropic Gaussian distribution reconstruc-tions respectively, which are created using the encoding of[62], and set the loss weights to λ G = 1 . and λ D = 0 . .For cuboid alignment we use the joint approach and use aﬂoor distance of − . m . We implement our models us-ing PyTorch [36, 12], setting the same seed for all randomnumber generators. Further, each parameter update uses thegradients of samples.We apply heavy data augmentation during training, as es-tablished in prior work [69, 47, 15]. Apart from photomet-ric augmentations (random brightness, contrast, and gamma[2]), following [15], we further apply random erasing, witha uniform random selection between and blocks erasedper sample. We also probabilistically apply a set of o panorama speciﬁc augmentations in a cascaded manner: i) uniformly random horizontal rotations spanning the full an-7igure 6: Qualitative results on the PanoContext (top) and Stanford2D3D (bottom) datasets. On each panorama, we overlaythe reconstructed layout from the groundtruth red and predicted blue junctions. The next row showcases the overlaid aggre-gated heatmap predictions, with the following one illustrating the resulting 3D mesh. Finally, two orthographic ﬂoor viewsare presented, showing the full Manhattan (left), and quasi-Manhattan aligned (right) estimations.8le range, ii) left-right ﬂipping, and iii) PanoStretch aug-mentations [47] using the default stretching ratio ranges.All augmentation probabilities are set to . Prior work up to now has experimented with small scaledatasets. PanoContext [64] manually annotated a total of panoramas from the Sun360 dataset [57] as cuboids.Additionally, LayoutNet manually annotated panora-mas from the Stanford2D3D dataset [1], which are notcomplete spherical images as their vertical FoV is nar-rower. Similar to previous works, we use the common train,test and validation splits as used in [15] and [68] for thePanoContext and Stanford2D3D datasets respectively. Tak-ing into account their small scale, we jointly consider themas a single real dataset and train all our models for epochs.More recently, layout annotations have been providedin newer computer-generated datasets, the Kujiale datasetused in [28] and the Structured3D dataset [65], totaling and annotated images respectively. Albeit syn-thetic, they offer a much more expanded data corpus thanwhat is currently available for real datasets. Given theirsynthetic nature, these datasets offer different room stylesfor the same scene. In particular, they provide empty roomsas well as rooms ﬁlled with furniture by interior designers.For the Kujiale dataset we use both types of scenes, whilefor Structured3D we only use full scenes and follow theirrespective ofﬁcial dataset splits. Our models are trained for and epochs respectively on Structured3D and Ku-jiale. For the quantitative assessment of our approach againstprior works we use a set of standard metrics found in the lit-erature [69], complemented by another set of accuracy met-rics. The standard metrics include 2D and 3D intersectionover union (IoU2D and IoU3D), normalized corner error(CE), pixel error (PE), and the depth-based RMSE and δ accuracy [10]. For all 3D calculations a ﬁxed ﬂoor distanceat − . m is used. We also use junction ( J d ) and wireframe( W d ) accuracy metrics, deﬁned as correct when the closestgroundtruth junction or line segment respectively is withina pixel threshold d . More speciﬁcally, we use the thresholds d = [5 , , . Finally, since we regress sub-pixel coordi-nates, all metric calculations are evaluated on a × panorama resolution, and the arrows next to each metric de-note the direction of better performance. First, we focus on the latest results reported in [69],where three data-driven cuboid panoramic layout estima-tion methods ([68, 60, 47]) were adapted for fairer compar- ison. Similar to [69], we train a stack (HG-3) single-shotcuboid (SSC) model using the real dataset. We present re-sults tested on real (combined and single) datasets in Table 1where our model compares favorably with the state-of-the-art , offering robust performance and end-to-end Manhattanaligned estimates, a trait no other state-of-the-art method of-fers currently. For these results, we report the same metricsas those reported in [69]. Furthermore, Figure 6 presents aset of qualitative results for our HG-3 model on these twodatasets.With the recent availability of large scale syntheticdatasets, we additionally train a model using Structured3D[65]. Since only HorizonNet offers a pretrained model us-ing the same data, we present results on the Structured3Dtest dataset for two HorizonNet variants and our model inTable 2. Apart from the standard model that includes post-processing, we also assess a single-shot variant of Hori-zonNet. For this, we only perform peak detection on thepredicted wall-to-wall boundary vector and directly samplethe heights at the detected peaks to reconstruct the layout.While this saves an amount of processing, the postprocess-ing scheme used by HorizonNet improves the results whenapplied to Structured3D’s test set. On the other hand, ourmodel produces accurate layout corner estimates withoutany postprocessing. While SSC outperforms HorizonNetin the established metrics, HorizonNet offers higher accu-racy in the junction and wireframe metrics. This is also thecase for the cross-validation experiment that we present inTable 3. We test the models trained using Structured3D onthe test set of Kujiale, using only the full rooms. The differ-ence is this setting is that the single-shot variant of Horizon-Net provides more accurate layout estimates than the post-processed one. This exposes the weakness of postprocess-ing approaches, which require empiric or heuristic tuning.Nonetheless, this HorizonNet model is trained for generallayout estimation, and the performance deviation might berelated to this extra trait. Qualitative results for our end-to-end model for both synthetic datasets are presented inFigure 7. We perform an ablation study across all datasets. Ta-bles 4, 2 and 5 present the results on the real and syn-thetic datasets . Our baseline is the model as presentedin Section 3.3 without the end-to-end Manhattan alignmenthomography module (Section 3.3.3), but with the quasi-Manhattan alignment (Section 3.3.2) offered by aligning thelongitude of top and bottom corners. Apart from addingthe end-to-end Manhattan alignment module, we also ab-late the effect of the geodesic heatmap and loss (Sec-tion 3.2), the SH model adaptation (spherical padding, pre- Best three performances are denoted with bold red, orange and yellow. Our supplement offers results for each of the real datasets.

Model PanoContext Stanford2D3D Real (Combined)Name Variant Parameters ↓ CE ↓ IoU3D ↑ PE ↓ CE ↓ IoU3D ↑ PE ↓ CE ↓ IoU3D ↑ PE ↓ LayoutNet v2 ResNet-18 15.57M 0.65% 84.13% 1.92% 0.77% 83.53% 2.30% 0.71% 83.83% 2.11%LayoutNet v2 ResNet-34 25.68M

LayoutNet v2 ResNet-50 91.50M 0.75% 82.44% 2.22% 0.83% 82.66% 2.59% 0.79% 82.55% 2.41%DuLa-Net v2 ResNet-18 25.64M 0.83% 82.43% 2.55% 0.74% 84.93% 2.56% 0.79% 83.68% 2.56%DuLa-Net v2 ResNet-34 45.86M 0.82% 83.41% 2.54% 0.66% 86.45% 2.43% 0.74% 84.93% 2.49%DuLa-Net v2 ResNet-50 57.38M 0.81% 83.77% 2.43% 0.67% 86.6% 2.48% 0.74% 85.19% 2.46%HorizonNet ResNet-18 23.49M 0.83% 80.27% 2.44% 0.82% 80.59% 2.72% 0.83% 80.43% 2.58%HorizonNet ResNet-34 33.59M 0.76% 81.30% 2.22% 0.78% 80.44% 2.65% 0.77% 80.87% 2.44%HorizonNet ResNet-50 81.57M 0.74% 82.63% 2.17% 0.69% 82.72% 2.27% 0.72% 82.68% 2.22%SSC HG-3

Table 2: Quantitative results and ablation on the synthetic Structured3D synthetic dataset.

Model Variant CE ↓ IoU2D ↑ IoU3D ↑ PE ↓ J ↑ J ↑ J ↑ W ↑ W ↑ W ↑ RMSE ↓ δ ↑ HNet Single-Shot 0.57% 93.10% 91.17% 1.53% 78.20% 90.69% 95.09% w/ Homography (joint) 0.40%

Table 3: Cross-validation results on the Kujiale dataset using the Structured3D trained model.

Model Variant CE ↓ IoU2D ↑ IoU3D ↑ PE ↓ J ↑ J ↑ J ↑ W ↑ W ↑ W ↑ RMSE ↓ δ ↑ HNet Single-Shot 0.61% 91.68% 89.53% 1.83% activated residual blocks and anti-aliased maxpooling - Sec-tion 3.3.1), and the quasi-Manhattan alignment itself bytraining a model with unrestricted, traditional ( i.e. not spher-ical as presented in Section 3.1) CoM calculation for eachcorner.These offer a number of insights. While the end-to-end model provides the more robust performance acrossall datasets, its performance is uncontested in the IoU anddepth related metrics. However, on the remaining pro-jective metrics, the unrestricted coordinate regression ap-proaches usually perform better. This is reasonable as thehomography ﬁts a cuboid on the predictions, while the un-/semi-constrained approaches can freely localise the cor-ners, even though at the expense of unnatural/Manhattanoutputs, which manifests at an IoU3D drop. Overall, weobserve that the additional of explicit Manhattan constraints(quasi and homography-based) offer increased performancecompared to directly regressing the corners. The same ap-plies to spherical (periodic CoM and geodesics) and modeladaptation that consistently increase performance.We also ablate the three approaches (ﬂoor/ceiling/joint)that use different starting coordinates for the homographyestimation in Tables 4 and 5. We ﬁnd that the joint approach produces higher quality results, as it enforces both the topand bottom predictions to be consistent between them. Thisway, the cuboid misalignment errors are backpropagated toall corner estimates through the homography.

5. Conclusion

Our work has focused on keypoint estimation on thesphere and in particular on layout corner estimation.Through coordinate regression we integrate explicit con-straints in our model. Moreover, while we have also shownthat end-to-end single-shot layout estimation is possible,our approach is rigid as it is based on a frequent and log-ical assumption, that the underlying room is, or can be ap-proximated by, a cuboid. Nonetheless, this rigidity comesfrom the structured predictions that CNN enforce, with thenumber of heatmaps that will be predicted being strictly de-ﬁned at the design phase. Future work should try to addressthis limitation to fully exploit the potential that single-shotapproaches offer, mainly stemming from end-to-end super-vision. Finally, as with all prior layout estimation works,predictions are up to a scale, which hinders applicability.Even so, structured scene layout estimation is an importanttask that can even be used as an intermediate task to improve11able 4: Ablation study on the real dataset.

Variant CE ↓ IoU2D ↑ IoU3D ↑ PE ↓ J ↑ J ↑ J ↑ W ↑ W ↑ W ↑ RMSE ↓ δ ↑ Quasi-Manhattan w/ Homography (ﬂoor) 0.68% 88.25%

Table 5: Ablation study on the synthetic Kujiale dataset.

Variant CE ↓ IoU2D ↑ IoU3D ↑ PE ↓ J ↑ J ↑ J ↑ W ↑ W ↑ W ↑ RMSE ↓ δ ↑ Quasi-Manhattan w/ Homography (ceil) 0.56% 90.92% 88.55% 1.78% 61.68% 80.82% 89.59% 35.55% 60.42% 72.48% 0.0925 97.13%w/o Geodesics 0.59% 90.81% 88.31% 1.81% 59.55% 79.36% 89.64% 27.64% 56.39% 71.24% 0.0998 96.92%w/o Model Adaptation 0.59% 90.42% 87.52% 1.82% 61.36% 79.14% 88.68% 29.61% 57.27% 70.82% 0.1026 96.65%w/o Quasi-Manhattan 0.54% 90.92% 88.42% 1.73% 62.59% 80.91% 90.36% 33.03% 59.33% 73.36% 0.0962 97.07% other tasks, as shown in [28]. With metric scale inference,it has the potential for signiﬁcant interplay with other 3Dvision tasks like depth or surface estimation.

Supplement

Supplementary material including additional ablationexperiments and qualitative results are appended after thereferences.

Acknowledgements

This work was supported by the EC funded H2020project ATLANTIS [GA 951900].

References [1] Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese.Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 , 2017.[2] Alexander Buslaev, Vladimir I Iglovikov, Eugene Khved-chenya, Alex Parinov, Mikhail Druzhinin, and Alexandr AKalinin. Albumentations: fast and ﬂexible image augmenta-tions.

Information , 11(2):125, 2020.[3] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal-ber, Matthias Niebner, Manolis Savva, Shuran Song, AndyZeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In , pages 667–676. Institute ofElectrical and Electronics Engineers Inc., 2018.[4] Hsien-Tzu Cheng, Chun-Hung Chao, Jin-Dong Dong, Hao-Kai Wen, Tyng-Luh Liu, and Min Sun. Cube padding forweakly-supervised saliency prediction in 360 videos. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 1420–1429, 2018.[5] Taco S Cohen, Mario Geiger, Jonas K¨ohler, and MaxWelling. Spherical cnns. In

International Conference onLearning Representations , 2018. [6] Benjamin Coors, Alexandru Paul Condurache, and AndreasGeiger. Spherenet: Learning spherical representations fordetection and classiﬁcation in omnidirectional images. In

Proceedings of the European Conference on Computer Vi-sion (ECCV) , pages 518–533, 2018.[7] Thiago LT da Silveira and Claudio R Jung. Dense 3d scenereconstruction from multiple spherical images for 3-dof+ vrapplications. In , pages 9–18. IEEE, 2019.[8] Micha¨el Defferrard, Nathana¨el Perraudin, Tomasz Kacprzak,and Raphael Sgier. Deepsphere: towards an equivariantgraph-based spherical cnn. In

ICLR Workshop on Represen-tation Learning on Graphs and Manifolds , 2019.[9] Marc Eder, True Price, Thanh Vu, Akash Bapat, and Jan-Michael Frahm. Mapped convolutions. arXiv preprintarXiv:1906.11096 , 2019.[10] David Eigen, Christian Puhrsch, and Rob Fergus. Depth mapprediction from a single image using a multi-scale deep net-work. In

Advances in neural information processing systems ,pages 2366–2374, 2014.[11] Carlos Esteves, Christine Allen-Blanchette, Ameesh Maka-dia, and Kostas Daniilidis. Learning so(3) equivariant rep-resentations with spherical cnns. In

Proceedings of the Eu-ropean Conference on Computer Vision (ECCV) , September2018.[12] WA Falcon. Pytorch lightning.

GitHub. Note:https://github.com/PyTorchLightning/pytorch-lightning ,2019.[13] Zhen-Hua Feng, Josef Kittler, Muhammad Awais, Patrik Hu-ber, and Xiao-Jun Wu. Wing loss for robust facial landmarklocalisation with convolutional neural networks. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 2235–2245, 2018.[14] Clara Fernandez-Labrador, Jos´e F´acil, Alejandro Perez-Yus,C´edric Demonceaux, and Jose Guerrero. Panoroom: Fromthe sphere to the 3d layout. In

ECCV 2018 Workshops , 2018.[15] Clara Fernandez-Labrador, Jose M Facil, Alejandro Perez-Yus, C´edric Demonceaux, Javier Civera, and Jose J Guer- ero. Corners for layout: End-to-end layout recoveryfrom 360 images. IEEE Robotics and Automation Letters ,5(2):1255–1262, 2020.[16] Clara Fernandez-Labrador, Alejandro Perez-Yus, GonzaloLopez-Nicolas, and Jose J Guerrero. Layouts frompanoramic images with geometry and deep learning.

IEEERobotics and Automation Letters , 3(4):3153–3160, 2018.[17] Kosuke Fukano, Yoshihiko Mochizuki, Satoshi Iizuka,Edgar Simo-Serra, Akihiro Sugimoto, and Hiroshi Ishikawa.Room reconstruction from a single spherical image byhigher-order energy minimization. In , pages1768–1773. IEEE, 2016.[18] Richard Hartley and Andrew Zisserman.

Multiple view ge-ometry in computer vision . Cambridge university press,2003.[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Delving deep into rectiﬁers: Surpassing human-level perfor-mance on imagenet classiﬁcation. In

Proceedings of theIEEE international conference on computer vision , pages1026–1034, 2015.[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Identity mappings in deep residual networks. In

Europeanconference on computer vision , pages 630–645. Springer,2016.[22] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-termmemory.

Neural computation , 9(8):1735–1780, 1997.[23] Jinwoong Jung, Beomseok Kim, Joon-Young Lee, Byung-moon Kim, and Seungyong Lee. Robust upright adjustmentof 360 spherical panoramas.

The Visual Computer , 33(6-8):737–747, 2017.[24] Raehyuk Jung, Aiden Seuna Joon Lee, Amirsaman Ashtari,and Jean-Charles Bazin. Deep360up: A deep learning-basedapproach for automatic vr image upright adjustment. In , pages 1–8. IEEE, 2019.[25] Renata Khasanova and Pascal Frossard. Graph-based clas-siﬁcation of omnidirectional images. In

Proceedings of theIEEE International Conference on Computer Vision Work-shops , pages 869–878, 2017.[26] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.[27] Chen-Yu Lee, Vijay Badrinarayanan, Tomasz Malisiewicz,and Andrew Rabinovich. Roomnet: End-to-end room layoutestimation. In

Proceedings of the IEEE International Con-ference on Computer Vision , pages 4865–4874, 2017.[28] Jin Lei, Xu Yanyu, Zheng Jia, Zhang Junfei, Tang Rui, XuShugong, Yu Jingyi, and Gao Shenghua. Geometric structurebased and regularized depth estimation from o indoor im-agery. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2020.[29] Mingyang Li, Yi Zhou, Ming Meng, Yuehua Wang, andZhong Zhou. 3d room reconstruction from a single ﬁsheye image. In , pages 1–8. IEEE, 2019.[30] Niantao Liu, Bingxian Lin, Linwang Yuan, Guonian Lv,Zhaoyuan Yu, and Liangchen Zhou. An interactive indoor 3dreconstruction method based on conformal geometry alge-bra.

Advances in Applied Clifford Algebras , 28(4):73, 2018.[31] Rosanne Liu, Joel Lehman, Piero Molino, Felipe PetroskiSuch, Eric Frank, Alex Sergeev, and Jason Yosinski. Anintriguing failing of convolutional neural networks and thecoordconv solution. In

Advances in Neural Information Pro-cessing Systems , pages 9605–9616, 2018.[32] Diogo C Luvizon, Hedi Tabia, and David Picard. Humanpose regression by combining indirect part detection andcontextual information.

Computers & Graphics , 85:15–22,2019.[33] Rafael Monroy, Sebastian Lutz, Tejo Chalasani, and AljosaSmolic. Salnet360: Saliency maps for omni-directional im-ages with cnn.

Signal Processing: Image Communication ,69:26–34, 2018.[34] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour-glass networks for human pose estimation. In

European con-ference on computer vision , pages 483–499. Springer, 2016.[35] Aiden Nibali, Zhen He, Stuart Morgan, and Luke Prender-gast. Numerical coordinate regression with convolutionalneural networks. arXiv preprint arXiv:1801.07372 , 2018.[36] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-ban Desmaison, Luca Antiga, and Adam Lerer. Automaticdifferentiation in pytorch. 2017.[37] Giovanni Pintore, Fabio Ganovelli, Ruggero Pintus, RobertoScopigno, and Enrico Gobbetti. 3d ﬂoor plan recovery fromoverlapping spherical images.

Computational Visual Media ,4(4):367–383, 2018.[38] Giovanni Pintore, Fabio Ganovelli, Alberto Jaspe Vil-lanueva, and Enrico Gobbetti. Automatic modeling of clut-tered multi-room ﬂoor plans from panoramic images. In

Computer Graphics Forum , volume 38, pages 347–358. Wi-ley Online Library, 2019.[39] Giovanni Pintore, Valeria Garro, Fabio Ganovelli, EnricoGobbetti, and Marco Agus. Omnidirectional image captureon mobile devices for fast automatic generation of 2.5 d in-door maps. In , pages 1–9. IEEE, 2016.[40] Giovanni Pintore, Claudio Mura, Fabio Ganovelli, LizethFuentes-Perez, Renato Pajarola, and Enrico Gobbetti. State-of-the-art in automatic 3d reconstruction of structured indoorenvironments.

STAR , 39(2), 2020.[41] Giovanni Pintore, Ruggero Pintus, Fabio Ganovelli, RobertoScopigno, and Enrico Gobbetti. Recovering 3d existing-conditions of indoor structures from spherical images.

Com-puters & Graphics , 77:16–29, 2018.[42] Peter H Sch¨onemann. A generalized solution of the orthog-onal procrustes problem.

Psychometrika , 31(1):1–10, 1966.[43] Shuran Song and Thomas Funkhouser. Neural illumination:Lighting prediction for indoor environments. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 6918–6926, 2019.

44] Shuran Song, Andy Zeng, Angel X Chang, Manolis Savva,Silvio Savarese, and Thomas Funkhouser. Im2pano3d: Ex-trapolating 360 structure and semantics beyond the ﬁeld ofview. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , pages 3847–3856,2018.[45] Yu-Chuan Su and Kristen Grauman. Learning spherical con-volution for fast features from 360 imagery. In

Advancesin Neural Information Processing Systems , pages 529–539,2017.[46] Yu-Chuan Su and Kristen Grauman. Kernel transformer net-works for compact spherical convolution. In

Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , pages 9442–9451, 2019.[47] Cheng Sun, Chi-Wei Hsiao, Min Sun, and Hwann-TzongChen. Horizonnet: Learning room layout with 1d repre-sentation and pano stretch data augmentation. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 1047–1056, 2019.[48] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and YichenWei. Integral human pose regression. In

Proceedings of theEuropean Conference on Computer Vision (ECCV) , pages529–545, 2018.[49] Yule Sun, Ang Lu, and Lu Yu. Weighted-to-spherically-uniform quality evaluation for omnidirectional video.

IEEEsignal processing letters , 24(9):1408–1412, 2017.[50] Hajime Taira, Masatoshi Okutomi, Torsten Sattler, MirceaCimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Ak-ihiko Torii. Inloc: Indoor visual localization with densematching and view synthesis. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 7199–7209, 2018.[51] Keisuke Tateno, Nassir Navab, and Federico Tombari.Distortion-aware convolutional ﬁlters for dense prediction inpanoramic images. In

Proceedings of the European Confer-ence on Computer Vision (ECCV) , pages 707–722, 2018.[52] C. Tensmeyer and T. Martinez. Robust keypoint detection. In , volume 5, pages 1–7,2019.[53] Rafael Grompone Von Gioi, J´er´emie Jakubowicz, Jean-Michel Morel, and Gregory Randall. Lsd: a line segmentdetector.

Image Processing On Line , 2:35–55, 2012.[54] Tsun-Hsuan Wang, Hung-Jui Huang, Juan-Ting Lin, Chan-Wei Hu, Kuo-Hao Zeng, and Min Sun. Omnidirectional cnnfor visual place recognition and navigation. In ,pages 2341–2348. IEEE, 2018.[55] Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, JitendraMalik, and Silvio Savarese. Gibson env: Real-world percep-tion for embodied agents. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages9068–9079, 2018.[56] Jianxiong Xiao. 3d geometry for panorama, 2012.[57] Jianxiong Xiao, Krista A Ehinger, Aude Oliva, and Anto-nio Torralba. Recognizing scene viewpoint using panoramicplace representation. In , pages 2695–2702. IEEE,2012.[58] Jiu Xu, Bj¨orn Stenger, Tommi Kerola, and Tony Tung.Pano2cad: Room layout from a single panorama image. In , pages 354–362. IEEE, 2017.[59] Hao Yang and Hui Zhang. Efﬁcient 3d room shape recoveryfrom a single panorama. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages5422–5430, 2016.[60] Shang-Ta Yang, Fu-En Wang, Chi-Han Peng, Peter Wonka,Min Sun, and Hung-Kuo Chu. Dula-net: A dual-projectionnetwork for estimating room layouts from a single rgb pano-rama. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , pages 3363–3372,2019.[61] Yang Yang, Shi Jin, Ruiyang Liu, Sing Bing Kang, andJingyi Yu. Automatic 3d indoor scene modeling from singlepanorama. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 3926–3934,2018.[62] Feng Zhang, Xiatian Zhu, Hanbin Dai, Mao Ye, and Ce Zhu.Distribution-aware coordinate representation for human poseestimation. In

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pages 7093–7102, 2020.[63] Richard Zhang. Making convolutional networks shift-invariant again. In

International Conference on MachineLearning , pages 7324–7334, 2019.[64] Yinda Zhang, Shuran Song, Ping Tan, and Jianxiong Xiao.Panocontext: A whole-room 3d context model for panoramicscene understanding. In

European conference on computervision , pages 668–686. Springer, 2014.[65] Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao,and Zihan Zhou. Structured3d: A large photo-realisticdataset for structured 3d modeling. In

Proceedings of TheEuropean Conference on Computer Vision (ECCV) , 2020.[66] Nikolaos Zioulis, Antonis Karakottas, Dimitrios Zarpalas,Federico Alvarez, and Petros Daras. Spherical view synthe-sis for self-supervised 360° depth estimation. In , pages 690–699.IEEE, 2019.[67] Nikolaos Zioulis, Antonis Karakottas, Dimitrios Zarpalas,and Petros Daras. Omnidepth: Dense depth estimation forindoors spherical panoramas. In

Proceedings of the Euro-pean Conference on Computer Vision (ECCV) , pages 448–465, 2018.[68] Chuhang Zou, Alex Colburn, Qi Shan, and Derek Hoiem.Layoutnet: Reconstructing the 3d room layout from a sin-gle rgb image. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 2051–2059, 2018.[69] Chuhang Zou, Jheng-Wei Su, Chi-Han Peng, Alex Colburn,Qi Shan, Peter Wonka, Hung-Kuo Chu, and Derek Hoiem.3d manhattan room layout reconstruction from a single 360image. arXiv preprint arXiv:1910.04099 , 2019. . Supplementary Material In this supplementary material we present additional in-formation regarding runtime and ﬂoating point operations,with the data offered in Table 6, and illustrated in Fig-ure 8. Apart from the models presented in the main doc-ument, we also add efﬁcient CFL models for complete-ness. In addition, we provide evaluation results for the Stan-ford2D3D and PanoContext datasets separately, in Tables 7and 8 respectively. Further, in Tables 9, 10, and 11, weoffer a decomposed model ablation for the Stanford2D3D,the PanoContext, and both datasets (averaged) respectively,where each individual component is ablated (namely, pre-activated bottlenecks, spherical padding, and anti-aliasedmax pooling). The pre-activated residual blocks offer thelarger gains, followed by the padding and ﬁnally, the anti-aliased max pooling. Nonetheless, each different compo-nent is contributing to increased performance, with theircombined effect being the most signiﬁcant as observed bythe model without all of these components together. Fig-ures 9, 10, 11 and 12 present additional qualitative resultsof our single-shot, end-to-end Manhattan aligned layout es-timation model using the joint homography head modulein Stanford2D3D, PanoContext, Structured3D and Kujialedatasets respectively. Finally, Figures 13 and 14 present thequalitative samples from the real and synthetic datasets re-spectively, which are included in the main manuscript inanimated 3D views (can only be viewed in recent AdobeAcrobat Reader versions). 0$&6* , R 8 ' 0HWKRG66&/D\RXW1HWY'X/D1HWY+RUL]RQ1HW&)/3DUDPHWHUV0 Figure 8:

Model Size vs Accuracy vs Complexity.

Vi-sual comparison of spherical layout estimation models interms of parameters (denoted by each bullet’s size), com-putational complexity ( x axis, in log scale, billions ofmultiply-accumulate operations) and accuracy ( y axis, av-erage IoU3D accuracy). Our model (SSC) is the most light-weight and offers a good comprise between complexity andaccuracy, surpassing most other approaches. It also pro-vides an end-to-end layout prediction in a single-shot, com-pared to all other approaches that require postprocessing.Different variants of each model are depicted. The exactdata of this plot can be found in Table 6. 15able 6: This table presents model complexity measures (multiply-accumulate giga-operations per inference, millions ofparameter counts, runtime performance) as well as accuracy (IoU3D) on real domain datasets. This table’s reported valuesare used to generate Figure 8.Method Variant MACS Parameters CPU GPU IoU3DSSC HG-3 LayoutNet v2 ResNet18 76.12G 15.57M 11.65s 0.034s 83.83%LayoutNet v2 ResNet34 95.48G 25.68M 12.97s 0.044s 84.60%LayoutNet v2 ResNet50 607.43G 91.50M 34.63s 0.130s 82.55%DuLa-Net v2 ResNet18 46.76G 25.64M 4.99s 0.037s 83.68%DuLa-Net v2 ResNet34 75.79G 45.86M 6.46s 0.049s 84.93%DuLa-Net v2 ResNet50 93.53G 57.38M 7.22s 0.072s 85.19%HorizonNet ResNet18 23.03G 23.49M N/As N/As 80.43%HorizonNet ResNet34 42.38G 33.59M N/As N/As 80.87%HorizonNet ResNet50 71.70G 81.57M 3.21s 0.063s 82.68%CFL EfﬁcientNet 42.19G 11.69M

N/A%CFL ResNet50 N/A N/A 0.420s 0.052s 78.79%Table 7: Ablation results on the Stanford2D3D dataset.

Variant CE ↓ IoU2D ↑ IoU3D ↑ PE ↓ J ↑ J ↑ J ↑ W ↑ W ↑ W ↑ RMSE ↓ δ ↑ Quasi-Manhattan 0.56% 88.18% 85.16% 1.83% w/ Homography (ﬂoor) 0.59% 89.51% 87.56% 1.80% 54.87% 81.86% 90.27% 23.01% 52.73% 70.35% 0.1474 97.95%w/ Homography (ceil) 0.58% 89.04% 87.04% 1.82% 60.29% 82.74% 91.81% 27.88% 57.52% 74.04% 0.1539 97.36%w/o Geodesics 0.80% 84.72% 81.73% 2.34% 32.19% 67.70% 87.28% 4.13% 25.15% 49.71% 0.2213 95.31%w/o Model Adaptation 0.62% 87.77% 84.47% 1.85% 59.40% 79.20% 89.60% 25.96% 54.35% 69.76% 0.1815 97.10%w/o Quasi-Manhattan 0.60% 87.50% 84.72% 1.86% 58.30% 79.76% 90.49% 19.32% 54.79% 71.17% 0.1825 95.97%

Table 8: Ablation results on the PanoContext dataset.

Variant CE ↓ IoU2D ↑ IoU3D ↑ PE ↓ J ↑ J ↑ J ↑ W ↑ W ↑ W ↑ RMSE ↓ δ ↑ Quasi-Manhattan w/o Geodesics 0.78% 84.08% 80.43% 2.27% 35.38% 72.41% 86.32% 5.19% 27.52% 57.08% 0.2252 95.64%w/o Model Adaptation 0.68% 85.42% 81.41% 2.11% 51.42% 77.12% 87.50% 19.50% 46.07% 63.21% 0.2250 94.13%w/o Quasi-Manhattan 0.61% 86.99% 83.46% 1.77% 50.00% 80.66% 91.51% 13.21% 45.28% 68.08% 0.1922 96.66%

Table 9: Model ablation results on the Stanford2D3D dataset.

Variant CE ↓ IoU2D ↑ IoU3D ↑ PE ↓ J ↑ J ↑ J ↑ W ↑ W ↑ W ↑ RMSE ↓ δ ↑ Quasi-Manhattan

Table 10: Model ablation results on the PanoContext dataset.

Variant CE ↓ IoU2D ↑ IoU3D ↑ PE ↓ J ↑ J ↑ J ↑ W ↑ W ↑ W ↑ RMSE ↓ δ ↑ Quasi-Manhattan

Table 11: Average model ablation results on both the real datasets.

Variant CE ↓ IoU2D ↑ IoU3D ↑ PE ↓ J ↑ J ↑ J ↑ W ↑ W ↑ W ↑ RMSE ↓ δ ↑ Quasi-Manhattan0.1733 96.92%