Flat2Layout: Flat Representation for Estimating Layout of General Room Types
FFlat2Layout: Flat Representation for Estimating Layout of General Room Types
Chi-Wei Hsiao Cheng Sun Min Sun Hwann-Tzong ChenNational Tsing Hua University { chiweihsiao, chengsun } @gapp.nthu.edu.tw [email protected] [email protected] Figure 1: Sample 3D room layouts reconstructed by our Flat2Layout.
Abstract
This paper proposes a new approach, Flat2Layout, forestimating general indoor room layout from a single-viewRGB image whereas existing methods can only produce lay-out topologies captured from the box-shaped room. Theproposed flat representation encodes the layout informationinto row vectors which are treated as the training targetof the deep model. A dynamic programming based post-processing is employed to decode the estimated flat outputfrom the deep model into the final room layout. Flat2Layoutachieves state-of-the-art performance on existing room lay-out benchmark. This paper also constructs a benchmark forvalidating the performance on general layout topologies,where Flat2Layout achieves good performance on generalroom types. Flat2Layout is applicable on more scenario forlayout estimation and would have an impact on applicationsof Scene Modeling, Robotics, and Augmented Reality.
1. Introduction
Estimating room layout is a fundamental indoor sceneunderstanding problem with applications to a wide rangeof tasks such as scene reconstruction [16], indoor local- ization [1] and augmented reality. Consider a single-viewRGB image: the layout estimation task is to delineate thewall-ceiling, wall-floor, and wall-wall boundaries. Existingworks only target special cases of room layouts that com-prise at most five planes ( i.e ., ceiling, floor, left wall, frontwall, and right wall).Previous deep learning based methods [3, 18, 20] typi-cally predict 2D per-pixel edge maps or segmentation maps,( i.e . ceiling, floor, left, front, and right), followed by theclassic vanishing point/line sampling methods to produceroom layouts. However, none of these methods could di-rectly apply to non-box-shaped room layout topology. Forinstance, more segmentation labels have to be defined inthe framework proposed by Dasgupta et al . [3] to generatea layout for a room which contains more than three walls.In addition, these methods highly depend on the accuracyof the extraction of the three mutually orthogonal vanishingpoints, which sometimes fails due to misleading texture.We propose Flat2Layout, a layout estimation approachthat can directly work for general topologies of room lay-outs without the need to extract horizontal vanishing points.The overall pipeline includes vertical-axis image rectifi-cation, prediction of the proposed flat output representa-tion with the proposed deep model, and an efficient post-1 a r X i v : . [ c s . C V ] M a y rocessing procedure based on dynamic programming.We summarize our contributions as follows: • We design a flat target output representation that couldbe efficiently decoded to layout ( i.e . corners or planesegmentation) with an intuitive and effective post-processing procedure based on dynamic programming. • Our approach achieves state-of-the-art performance onthe existing dataset with processing time as less as500ms per frame (there is still room for speedup). • Our method could apply to not only the typical 11types defined in LSUN [28] dataset but also more com-plex room layout topologies. We quantitatively andqualitatively demonstrate that our approach is capableof recovering room layout of general layout types froma single-view RGB image.
2. Related Work
Single-view room layout estimation has been an activetask over the past decade. Hedau et al . [9] first definedthe problem as using a cuboid-shaped box to approximatethe 3D layout of indoor scenes. In their method, many boxlayout proposals were generated by sampling rays from thethree orthogonal vanishing points. Then they ranked thecandidates with a structured SVM trained on images anno-tated with surface labels { left wall, front wall, right wall,ceiling, object } . Using a similar framework, many worksexplored different methods from the aspect of proposal gen-eration [23, 22, 19] and inference procedure [15, 10, 23, 19].Recently, per-pixel 2D target output representations weregenerally adopted in many deep-learning based approaches.Mallya et al . [18] and Ren et al . [20] predicted edge mapswith FCN-based model and inferred the layout with van-ishing points or lines following the widely used proposal-ranking scheme. Dasgupta et al . [3] proposed a similarapproach to estimate the surface label heatmaps instead ofedge maps. Zhao et al . [30] alternatively trained their modelon large scale semantic segmentation dataset then trans-ferred semantic features to edge maps. RoomNet [14] en-coded the ordered room layout keypoint locations into 48keypoint heatmaps for 11 room types. Although this rep-resentation could be decoded easily, it required data withpredefined room types and annotation of ordered cornersfor their keypoint heatmaps. Unfortunately, none of the ex-isting methods for single-view layout estimation have beendesigned for the layout of general room types. We solvethis problem with our proposed flat layout representationand dynamic programming based post-processing which donot rely on the predefined layout topologies.Several works targeted at estimating indoor layout forpanoramic images. Zou et al . [32] predicted the per-pixelcorner probability map and boundary map like the cases in perspective images. Yang et al . [27] combined surface se-mantic with ceiling view and floor view. Sun et al . [25]encoded boundaries and wall-wall existence in their ”1Drepresentation” to recover layouts from 360 ◦ panoramas.Sun et al . [25] is the most related method to ours.However, their representation was only suitable for 360 ◦ panoramic images, in which the ceiling-wall and floor-wallboundaries exist for every column under equirectangularprojection. Their post-processing assume the layout formeda closed loop which is often true in panorama images butnot the case for perspective image.
3. Approach
In the context of 360 ◦ panorama, aligning input imageby three orthogonal vanishing points in the pre-processingphase is a common practice for layout estimation [32, 6,27, 25]. On the other hand, existing works for perspectiveimages layout estimation only use the vanishing point in-formation in post-processing phase [20, 3, 30, 29]. To thebest of our knowledge, we are the first exploiting vanishingpoint information in the pre-processing phase (before thedeep model) for a single perspective image layout estima-tion task.To facilitate our flat room-layout representation(Sec. 3.2), we want to rectify images such that all wall-wall boundaries are parallel to the image Y axis. Thisrequirement can be easily achieved by detecting the verticalvanishing point ( vp Z ) and constructing a homography totransform vp Z to infinite of the Y-axis of the image. Todetect a vp Z , we extract line segments using LSD [26]and keep only line segments pointing vertically (pointingdirection larger or smaller than ± ◦ ). The most votedpoint is detected by RANSAC as the vp Z . Fig. 2 depictsthe effect of the pre-processing.We apply the same pre-processing to all the training andtesting images. We introduce a flat target output representation of roomlayout that could be efficiently decoded to a layout. As il-lustrated in Fig. 3, our flat output representation comprisesthree row vectors: y ceil , y floor , and p wall , of the samelength as image width, and two classifiers: p ceil and p floor .Each column of y ceil represents the position of theceiling-wall boundary at the corresponding column of theimage as a scalar. In the same way, y floor represents theposition of the floor-wall boundary. p wall indicates the exis-tence of wall-wall boundary at each column as a probability.The two classifiers p ceil and p floor stand for whether theinput image contains ceiling-wall boundary and floor-wallboundary or not. We explain the motivation of designing a) Source image (b) Rectified image Figure 2: In pre-processing, we transform the source image(2a) such that all wall-wall boundaries are parallel to imageY-axis (2b). The magenta line segments are the inlier of thevertical vanishing point ( vp Z ). The red dotted lines connectthe center of line segments and vp Z .Figure 3: Visualization of our flat target output represen-tation. p wall denotes the existence of wall-wall boundary. y ceil and y floor (plotted in green and blue) denote the posi-tion of ceiling-wall boundary and floor-wall boundary. Notethat y ceil , y floor and p wall are just three row vectors in ef-fect.these two classifiers in Sec. 4.5.The values of y ceil and y floor are normalized to [0 , .And for a column i where ceiling-wall and floor-wallboundary do not exist, we set y ceil ( i ) to − . and y floor ( i ) to . . Considering that assigning p wall with 0/1 labelswould result in a strongly class-imbalanced ground truth,we define p wall ( i ) = 0 . dx where i indicates the i th col-umn and dx is the distance from the i th column to the near-est wall-wall boundary. An overview of the Flat2Layout network architecture isillustrated in Fig. 4. The Flat2Layout network takes a singleRGB image with the dimension × × (channel, height, width) as input. Our network is built upon ResNet-50 [8], followed by five output branches respectively for thefive output targets defined in Sec. 3.2.The two branches for classifiers p ceil , p floor take theoutput feature maps from the last ResNet-50 block as in-put. Both branches consist of two convolution layers with × , × kernel sizes, a global average pooling layerwhich reduces the spatial dimension to 1x1, and one finalfully-connected layer with 2 class softmax activation.The structures of the three flat decoder branches forpredicting y ceil , y floor , p wall are exactly the same withoutsharing weights. We design the decoder for capturing bothhigh-level global features and low-level local features. Fol-lowing the spirit of U-net [21], which gradually fuses fea-tures from deeper layers with features from shallower lay-ers, we adopt a contracting-expanding structure, in whichResNet-50 serves as the contracting path (upper part inFig. 4) and the flat decoder branch is the expanding path(lower part of the figure). More specifically, the ResNet-50 comprises four blocks, and each outputs feature mapswith half spatial resolution compared to that of the previ-ous block. To reduce the number of parameters, we add asequence of convolution layers after each ResNet-50 blockwhich reduce the number of channels and height by a fac-tor of 4 and 8 respectively, then reshape the feature maps toheight 1 to obtain flat feature maps. Every step in the ex-panding path comprises an upsampling which doubles onlythe width of the flat feature maps followed by three convo-lutions with kernel size × , × , × . At the last step, weupsample the width of flat feature maps to four times largerand apply two convolution layers to reduce the channel to 1,yielding the final output with the dimension × (imagewidth). All the convolution layers except the last one arefollowed by ReLU and BatchNorm [11]. Based on our flat representation, we propose an intuitiveand efficient post-processing algorithm which is capable ofgenerating layouts for general room layout topologies notlimited to “box-shape” (namely, the images taken in cuboid-shaped rooms). An overview of our post-processing algo-rithm is: i ) peak finding to detect corners’ x positions, then ii ) special cases checking, finally iii ) using dynamic pro-gramming to determine the actual positions of all corners.As a reminder, the p (cid:48) wall predicted by our model is theprobability of each image column being a wall-wall bound-ary. y (cid:48) ceil and y (cid:48) floor are the position of ceiling-wall andfloor-wall boundary at each image column and any out ofimage plane position indicating no boundary at that imagecolumn. p (cid:48) ceil and p (cid:48) floor are the results from two soft-max classifier branches telling us whether to ignore y (cid:48) ceil or y (cid:48) floor .igure 4: An illustration of the Flat2Layout network architecture. Wall-wall positions:
We first extract information aboutthe position of the wall-wall boundary from model esti-mated p (cid:48) wall . We use a smoothing filter with window sizecovering of the image width, then finding signal peakswith the same window size. Peaks with probability lowerthan . are removed. The remain peaks telling the columnposition of all wall-wall boundaries which we denote as aset of image x positions W x .For the remaining post-processing description, we willonly explain using y (cid:48) ceil and W x . As we construct ceil-ing corners and floor corners independently with the samedetermined W x , the same algorithm could be applied to y (cid:48) floor . Two Special Cases: i ) S x is an empty set. If no wall-wall boundary is found in the image, we will simply predicta ceiling-wall boundary with linear regression on y (cid:48) ceil . ii )There are more than of columns of y (cid:48) ceil are predictedto be out of image plane. Since the result suggesting that noceiling-wall boundary in the image plane, we estimate theceiling corners as { ( x, | x ∈ W x } . (In the case of floorcorners, the result is { ( x, H − | x ∈ W x } where H isimage height.) Dynamic Programming:
With the detected N = | W x | peaks, we generate N + 2 candidate points sets accordingto the positions of x ∈ W x . Please see Fig. 5 for bet-ter understanding. The generated candidate points sets aredenoted as S L , S , S · · · , S N , S R where we will select apoint in each set (red dots in the figure) as actual ceilingcorners. S L and S R (green dots in the figure) are the setsof points all located at the image border with x positionless than min ( W x ) or greater than max ( W x ) respectively. The middle sets S , · · · , S N (yellow dots in the figure) are S i = { ( W ( i ) x , y ) | y ∈ , , · · · , H − } where W ( i ) x is the i ’th smallest element of W x .The raw estimated ceiling-wall boundary (blue dots inupper part of Fig. 5) is split also according to W x , resultingin N + 1 sets P , P , · · · , P N To select a point in each set as the final actual ceiling cor-ners (red dots in Fig. 5), we define loss as the average Eu-clidean distance of all the raw estimated ceiling-wall bound-ary positions (blue dots in the figure) to the estimated lay-out. More specifically, a point in P i producing a loss by thedistance to the line connecting the selected two points in S i and S i +1 . The loss function V ( j ) i is defined to be mini-mum loss from S L to S i with j ’th element of S i is selected.To find the layout with minimum loss, we exploit dynamicprogramming. The recursive relationship is obvious and isprovided in Eq. 1 V ( j )1 = min k d ( P , S (0) L S ( k )1 ) ,V ( j ) i +1 = min k V ( k ) i + d ( P i , S ( k ) i S ( j ) i +1 ) . (1)The layout with minimum loss can be extracted by back-tracking from argmin p ∈ V R V R ( p ) where V R = V N +1 . Thetime complexity of the overall algorithm is O ( N · H ) andtakes roughly 400ms for a frame. There is still much roomfor speedup as our implementation is based on Python andmany redundant candidates points in each set can be re-moved in a heuristic manner. To reconstruct the recognized layout in 3D, we make afew assumptions: ( i ) the pre-processing correctly transformthe vp Z to infinite Y axis of image, ( ii ) the floor and ceil-ing are planes orthogonal to gravity direction, and ( iii ) theigure 5: Illustration of the candidate points sets in ourpost-processing procedure. After finding N peaks in thenetwork predicted wall-wall existence, we generate N + 2 candidate points sets, which include N columns on the Xcoordinates of the N peaks (yellow), the left-most edge, andthe right-most edge (green). The corners decoded with thepost-processing are plotted in red dots. Please see Sec. 3.4for detailed description of our post-processing algorithm.distance between camera to floor and camera to ceiling areboth 1 meter.In below explanation, x, y are used to indicate the pixelposition on the image plane where the origin is defined asthe image center, and the right and bottom are defined aspositive directions of x and y respectively. X, Y, Z are usedto indicate the position in 3D where the camera center islocated at (0 , , and floor plane is Z = − .Before further reconstruction, we have to infer the imag-inary pixel distance, f , between camera center and im-age plane. We use assumption ( i ) and three floor corners ( x , y ) , ( x , y ) , ( x , y ) in the image which are consid-ered to form an right angle in 3D with ( x , y ) on the angle.The f is then can be solved by Eq. 2 (See supplementaryfor detail derivation). f = x x y + x y y − x x y y − x x y y y y + y y − y − y y . (2)With the f and a given Z we get equations X = xZy and Y = fZy for mapping from image coordinate to world coor-dinate under our notation. Combined with assumptions ( ii )and ( iii ), we can obtain the world coordinate of each cor-ners on floor and ceiling for texture mapping in 3D viewer.Some reconstructed results are in Fig. 1 and Fig. 7.
4. Experiments
All input RGB images and ground truth are resizedto 256 × y ceil ) and floor-wall boundary ( y floor ), whilethe y ceil loss is not computed if the model output is al-ready smaller than 0 when the ground truth is -0.01 in thecolumns where ceiling-wall boundary does not exist, andlikewise, the y floor loss is not counted if the model out-put is greater than 1 when the ground truth is 1.01. Weadopt the binary cross-entropy loss for wall-wall boundary( p wall ) and use the cross-entropy loss for both ceiling-wallboundary classifier ( p ceil ) and floor-wall boundary classi-fier ( p floor ). The Adam optimizer [13] is employed to trainthe network for 150 epochs with batch size 16 and learningrate 0.0002. Similar to [17, 2], we employ a poly learn-ing rate policy, where the initial learning rate is multiply by (1 − iter now /iter max ) . . We use two standard quantitative evaluation metric forroom layout estimation. i ) Pixel Error (PE) calculatesthe accuracy of per-pixel surface label between groundtruth and estimation. ii ) Corner Error (CE) is definedby Euclidean distance between ground truth corners andestimated corners normalized by image diagonal length.LSUN [28] official evaluation codes take the minimum CEamong all possible matching and penalize . for each extraor missing corners from ground truth.In the pre-processing phase of our approach pipeline, im-ages are rectified by homography (see Fig. 2) so the es-timated corners or surface semantic can not be evaluateddirectly with the original ground truth. For a fairness com-parison with literature, we project corners estimated by ourmodel back to the original image and also rescale them tothe original image resolution. Hedau [9] dataset consists of 209 training instances and105 testing instances. We skip the Hedau training set andevaluate our approach trained on LSUN [28] dataset di-rectly on Hedau testing set. As the ground truth cornersare not provided and also being consistent with the litera-ture, we only use pixel error (PE) as the evaluation metric.The quantitative results on Hedau testing set compared withother methods are summarized in Table. 1. Our approachachieves state-of-the-art performance. Some qualitative re-sults are provided in supplementary.
LSUN [28] dataset consists of 4000 training instances,394 validation instances, and 1000 testing instances. Be-ethod PE (%)Hedau et al . (2009) [9] 21.20Del Pero et al . (2012) [4] 16.30Gupta et al . (2010) [7] 16.20Zhao et al . (2013) [31] 14.50Ramalingam et al . (2013) [19] 13.34Mallya et al . (2015) [18] 12.83Schwing et al . (2012) [23] 12.80Del Pero et al . (2013) [5] 12.70Izadinia et al . (2017) [12] 10.15Dasgupta et al . (2016) [3] 9.73Zou et al . (2018) [32] 9.69Ren et al . (2016) [20] 8.67Lee et al . (2017) [14] 8.36 ours 5.01
Table 1: Quantitative results on Hedau [9] testing set.cause the ground truth of testing set is not available, weevaluate and compare with other approaches only on thevalidation set (also the case of [20, 14, 32] where only resulton validation set are reported). The quantitative results areshown in Table. 2 where we achieve state-of-the-art perfor-mance. Some qualitative results are provided in supplemen-tary. Method CE (%) PE (%)Hedau et al . (2009) [9] 15.48 24.23Mallya et al . (2015) [18] 11.02 16.71Dasgupta et al . (2016) [3] 8.20 10.63Ren et al . (2016) [20] 7.95 9.31Zou et al . (2018) [32] 7.63 11.96Lee et al . (2017) [14] 6.30 9.86 ours 4.92 6.68
Table 2: Quantitative results on LSUN testing set [28].
Data imbalance problem in LSUN:
We design the twoclassifier branch described in Sec. 3.2 to suppress the falsepositive boundary regression (in other words, our modelfind a boundary while the ground truth is empty). Theintuition is by observing the severe room type imbalancein LSUN datasets [28]. We depicted the number of train-ing instances according to the room layout types definedby LSUN in Fig. 6. The detailed definition of each roomtype is provided in supplementary. The number of instancesbelonging to type 2, 3, 7, 8, 10 which do not have floor-wall boundary only accounts for . of the total number Figure 6: Distribution of the training samples according tothe room layout types.of training instances. This data imbalance problem makesour learning-based model tend always to predict a floor-wallboundary.To further prove that the design of the two classifierscould ease the bias caused by type imbalance, we show thecorner error with and without the help of the classifiers inTable. 3. The results show that the classifier branches couldhelp in the case that the floor-wall boundary or ceiling-wallboundary is outside the image plane.TypesID No classifierCE(%) With classifierCE(%) ImprovementCE(%)Both floor-wall and ceiling-wall appear0 4.54 4.75 -0.225 2.83 3.09 -0.276 8.27 8.27 0.00Only floor-wall appear1 12.71 12.70 +0.014 6.41 5.18 +1.23 +1.21 Only ceiling-wall appear2 - - -3 19.28 16.30 +2.98 +3.10
Table 3: Compare the corner error (CE) with and withoutthe classifier branch in ablation manner on the validationset of LSUN dataset. Type 2, 8 and 7 are left empty becausethese types do not exist in the validation set. ompare with Zhao et al . [30]:
An ”alternative” methodfor room layout estimation is proposed by Zhao et al . [30]where they transferred information from larger scale seman-tic segmentation dataset (SUNRGBD [24]) by a complextraining protocol while achieving outstanding performance,even the most recent state-of-the-art layout estimation ap-proaches ( e.g . [14, 32]) did not outperform their results.Comparing to our method, they achieve better result onLSUN dataset ( . CE and . PE vs . ours . CE, . PE) while getting worse result on Hedau dataset thanours ( . PE vs . ours . PE). However, like other ex-isting methods, they have to define a prior set of room typetopologies (11 types in LSUN which also covers all pos-sible types in Hedau). Besides, time complexity and thenumber of parameters of their post optimization algorithmincrease linear to the number of room type. Our method,on the other hand, can extend to general room type withoutmodifying our model and post-processing algorithm as wewill show in Sec. 4.6.
Motivation:
Existing datasets for layout estimation fromperspective images, i.e. LSUN [28] and Hedau [9] dataset,have the prior that at most two wall-wall boundaries are pre-sented in the image. More specifically, LSUN dataset de-fined 11 layout types, and both LSUN and Hedau datasetsconsidered only 5 categories of surface categories which areceiling, floor, left wall, front wall, and right wall.Existing methods for room layout estimation only takethe layout topologies defined in LSUN dataset into con-sideration. They require further modification to generatelayout which is not defined in LSUN. For instance, Room-Net [14] need to define all layout topologies for their model.Hypotheses ranking based algorithms [20, 3, 12] have to de-sign additional rules for generating the proposal for cover-ing possible layout types.Our approach, on the other hand, can handle generalroom layout topologies directly without modifying ourmodel. Our post-processing for decoding the predicted flatroom layout representation can be directly applied to pro-duce general room layout.
Experiment:
To verify the idea, we build a benchmarkwith the 40 panoramic images annotated as general roomlayout by HorizonNet [25]. We split the 40 rooms into 10folds for 10-fold cross-validation. If a room is in validationsubsample, we capture 6 perspective images by uniformlyrotate the camera within the panorama and remove instanceswithout any boundaries or corners inside the image plane(where the camera is too close to the wall). For a room intraining subsample, we capture the perspective data like thecase of validation but also with Pano Stretch Augmentation proposed by HorizonNet [25]. Fig. 3 show one of the cap-tured perspective image.For each validation subsample, we use the model trainedon LSUN dataset and finetune on the other training subsam-ples with 25 epochs.
Results:
Each of the 40 rooms is validated once duringthe course of 10-fold cross-validation. We cluster the cor-ner error (CE) by the number of wall-wall visible in the per-spective image and show the result in Table. 4. We alsoshow the result of the model trained only on LSUN dataset(which contains only 0, 1 and 2 number of wall-wall) in thethird column. We observe that the model only trained onLSUN get severely degraded result on cases of 3 or morewall-wall boundaries. The results show that our approach isapplicable to general layout topology with available train-ing data. We show some qualitative 3d reconstructed layoutin Fig. 1 and Fig. 7. More qualitative results are provided insupplementary.Number ofwall-wall Number ofinstances No finetuneCE (%) FinetuneCE (%)0 5 8.30 4.691 86 5.63 4.812 59 3.55 3.453 48 13.59 6.784 29 14.65 6.265 7 16.60 10.896 2 20.98 15.25236 8.35 5.31Table 4: Results on general room layout benchmark con-structed by us. The third column are testing directly withour LSUN pre-trained model while the fourth column is theresult of finetuning in cross-validation manner. The cornererror (CE) shown in the last row are averaged across all theinstances. See Sec. 4.6 for more detail.
5. Conclusion
We have presented a new approach, Flat2Layout, whichis able to recover the layout of general indoor room typesfrom a single RGB image using a flat target representation.The proposed deep model is trained to estimate layout underthe flat representation. To extract layout from flat represen-tation, we exploit dynamic programming as post-processingwhich is fast and effective. Our approach achieves state-of-the-art performance on existing room layout datasets. Be-sides, we quantitatively and qualitatively show that our ap-proach can also work directly on general room topology byonly changing the training data, which overcomes the box-shape limitation of existing datasets and methods.a) (b) (c) (d)(e) (f) (g) (h)Figure 7: Some qualitative results of 3d reconstructed layout by our approach. The green dots and lines are ground truth andthe red ones are estimated by our approach. Please see Sec. 3 for more detail about our method. In sample (g), our modelfailed to fit the true indoor layout but surprisingly recognize the shape of balcony thus producing visually acceptable result.Sample (h) is a challenging example which our approach fail and recognize the beam column as wall.
References [1] F. Boniardi, A. Valada, R. Mohan, T. Caselitz, and W. Bur-gard. Robot localization in floor plans using a room layoutedge extraction network. arXiv preprint arXiv:1903.01804 ,2019. 1[2] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re-thinking atrous convolution for semantic image segmenta-tion. arXiv preprint arXiv:1706.05587 , 2017. 5[3] S. Dasgupta, K. Fang, K. Chen, and S. Savarese. Delay:Robust spatial layout estimation for cluttered indoor scenes. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 616–624, 2016. 1, 2, 6, 7[4] L. Del Pero, J. Bowdish, D. Fried, B. Kermgard, E. Hart-ley, and K. Barnard. Bayesian geometric modeling of indoorscenes. In , pages 2719–2726. IEEE, 2012. 6[5] L. Del Pero, J. Bowdish, B. Kermgard, E. Hartley, andK. Barnard. Understanding bayesian rooms using composite3d object models. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 153–160,2013. 66] C. Fernandez-Labrador, J. M. Facil, A. Perez-Yus, C. De-monceaux, and J. J. Guerrero. Panoroom: From the sphereto the 3d layout. arXiv preprint arXiv:1808.09879 , 2018. 2[7] A. Gupta, M. Hebert, T. Kanade, and D. M. Blei. Estimat-ing spatial layout of rooms using volumetric reasoning aboutobjects and surfaces. In
Advances in neural information pro-cessing systems , pages 1288–1296, 2010. 6[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In , pages 770–778, 2016. 3[9] V. Hedau, D. Hoiem, and D. Forsyth. Recovering the spatiallayout of cluttered rooms. In
Computer vision, 2009 IEEE12th international conference on , pages 1849–1856. IEEE,2009. 2, 5, 6, 7[10] V. Hedau, D. Hoiem, and D. Forsyth. Thinking inside thebox: Using appearance models and context based on roomgeometry. In
European Conference on Computer Vision ,pages 224–237. Springer, 2010. 2[11] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.In
Proceedings of the 32Nd International Conference on In-ternational Conference on Machine Learning - Volume 37 ,ICML’15, pages 448–456. JMLR.org, 2015. 3[12] H. Izadinia, Q. Shan, and S. M. Seitz. Im2cad. In
CVPR ,2017. 6, 7[13] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization.
CoRR , abs/1412.6980, 2014. 5[14] C.-Y. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabi-novich. Roomnet: End-to-end room layout estimation. In
Computer Vision (ICCV), 2017 IEEE International Confer-ence on , pages 4875–4884. IEEE, 2017. 2, 6, 7[15] D. C. Lee, M. Hebert, and T. Kanade. Geometric reason-ing for single image structure recovery. In , pages2136–2143. IEEE, 2009. 2[16] J.-K. Lee, J. Yea, M.-G. Park, and K.-J. Yoon. Joint layoutestimation and global multi-view registration for indoor re-construction. In
Proceedings of the IEEE International Con-ference on Computer Vision , pages 162–171, 2017. 1[17] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Lookingwider to see better. arXiv preprint arXiv:1506.04579 , 2015.5[18] A. Mallya and S. Lazebnik. Learning informative edge mapsfor indoor scene layout prediction. In
Proceedings of theIEEE international conference on computer vision , pages936–944, 2015. 1, 2, 6[19] S. Ramalingam, J. K. Pillai, A. Jain, and Y. Taguchi. Manhat-tan junction catalogue for spatial reasoning of indoor scenes.In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 3065–3072, 2013. 2, 6[20] Y. Ren, S. Li, C. Chen, and C.-C. J. Kuo. A coarse-to-fineindoor layout estimation (cfile) method. In
Asian Conferenceon Computer Vision , pages 36–51. Springer, 2016. 1, 2, 6, 7[21] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo-lutional networks for biomedical image segmentation. In
International Conference on Medical image computing and computer-assisted intervention , pages 234–241. Springer,2015. 3[22] A. G. Schwing, S. Fidler, M. Pollefeys, and R. Urtasun. Boxin the box: Joint 3d layout and object reasoning from singleimages. In
Proceedings of the IEEE International Confer-ence on Computer Vision , pages 353–360, 2013. 2[23] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Ef-ficient structured prediction for 3d indoor scene understand-ing. In , pages 2815–2822. IEEE, 2012. 2, 6[24] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In
Proceedings ofthe IEEE conference on computer vision and pattern recog-nition , pages 567–576, 2015. 7[25] C. Sun, C.-W. Hsiao, M. Sun, and H.-T. Chen. Horizon-net: Learning room layout with 1d representation and panostretch data augmentation. arXiv preprint arXiv:1901.03861 ,2019. 2, 7[26] R. G. Von Gioi, J. Jakubowicz, J.-M. Morel, and G. Ran-dall. Lsd: A fast line segment detector with a false detectioncontrol.
IEEE transactions on pattern analysis and machineintelligence , 32(4):722–732, 2010. 2[27] S.-T. Yang, F.-E. Wang, C.-H. Peng, P. Wonka, M. Sun, andH.-K. Chu. Dula-net: A dual-projection network for estimat-ing room layouts from a single rgb panorama. arXiv preprintarXiv:1811.11977 , 2018. 2[28] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao. Lsun: Con-struction of a large-scale image dataset using deep learningwith humans in the loop. arXiv preprint arXiv:1506.03365 ,2015. 2, 5, 6, 7[29] W. Zhang, W. Zhang, and J. Gu. Edge-semantic learningstrategy for layout estimation in indoor environment. arXivpreprint arXiv:1901.00621 , 2019. 2[30] H. Zhao, M. Lu, A. Yao, Y. Guo, Y. Chen, and L. Zhang.Physics inspired optimization on semantic transfer features:An alternative method for room layout estimation. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 10–18, 2017. 2, 7[31] Y. Zhao and S.-C. Zhu. Scene parsing by integrating func-tion, geometry and appearance models. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 3119–3126, 2013. 6[32] C. Zou, A. Colburn, Q. Shan, and D. Hoiem. Layoutnet:Reconstructing the 3d room layout from a single rgb image.In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 2051–2059, 2018. 2, 6, 7 upplemental Material
A. Definition of LSUN Room Types
Figure 8: The 11 room types defined in the LSUN dataset.
B. Derivation for 3D Reconstruction
We use Y and Z for front-back and up-down direction re-spectively in world coordinate system for our convenient, so X = xZy and Y = fZy are obtained by rearranging x = fXY and y = fZY . For later 3D reconstruction, we need to solvethe unknown term f first, which can be done by using threepoints ( x , y ) , ( x , y ) , ( x , y ) estimated by our model onthe image plane. The three points are considered to be onthe floor and forming ◦ angle with ( x , y ) on the vertex.As Z = Z = Z = − in our assumption, we have ( X − X )( X − X ) + ( Y − Y )( Y − Y ) = 0 . (3)By expanding (3), we have a linear equation for f : ( x y − x y ) ( x y − x y )+( y − y ) ( y − y ) f = 0 . (4)Based on the solution of f , one of the wall-wall intersectionis guaranteed to be orthogonal. C. Qualitative Results on Hedau Dataset (a)(b)(c)(d)Figure 9: Qualitative results on the Hedau test set. The re-sults are separately sampled from four groups that comprisethe predictions with the best 0–25%, 25–50%, 50–75% and75–100% pixel errors (displayed from the first to the fourthrow). The red lines depict the estimated layout. . Qualitative Results on LSUN Dataset (a) (b) (c) (d)Figure 10: Qualitative results on the LSUN validation set. The results are separately sampled from four groups that comprisethe predictions with the best 0–25%, 25–50%, 50–75% and 75–100% corner errors (displayed from the first to the fourthcolumn). The green lines are the ground-truth layout while the red lines are the estimated. . Qualitative Results for General Layout Topology. Qualitative Results for General Layout Topology