Learning Panoptic Segmentation from Instance Contours
Sumanth Chennupati, Venkatraman Narayanan, Ganesh Sistu, Senthil Yogamani, Samir A Rawashdeh
LLearning Panoptic Segmentation from Instance Contours
Sumanth Chennupati , Venkatraman Narayanan , Ganesh Sistu , Senthil Yogamani and Samir A Rawashdeh Abstract — Panoptic Segmentation aims to provide an un-derstanding of background (stuff) and instances of objects(things) at a pixel level. It combines the separate tasks ofsemantic segmentation (pixel-level classification) and instancesegmentation to build a single unified scene understandingtask. Typically, panoptic segmentation is derived by combiningsemantic and instance segmentation tasks that are learnedseparately or jointly (multi-task networks). In general, instancesegmentation networks are built by adding a foreground maskestimation layer on top of object detectors or using instanceclustering methods that assign a pixel to an instance center.In this work, we present a fully convolution neural networkthat learns instance segmentation from semantic segmentationand instance contours (boundaries of things). Instance contoursalong with semantic segmentation yield a boundary-awaresemantic segmentation of things. Connected component labelingon these results produces instance segmentation. We mergesemantic and instance segmentation results to output panop-tic segmentation. We evaluate our proposed method on theCityScapes dataset to demonstrate qualitative and quantitativeperformances along with several ablation studies.
I. I
NTRODUCTION
Panoptic segmentation [1], [2] offers ultimate understand-ing of a scene by providing joint semantic and instancelevel predictions of background and objects at a pixel level.Panoptic segmentation is usually achieved by combining out-puts from semantic segmentation and instance segmentation.Examples where panoptic segmentation offers unprecedentedadvantage over standalone semantic or instance segmentationsolutions include collective knowledge of distinct objects anddrivable area around a self-driving car [3], [4], semantic andinstance level details of cancerous cells in digital pathology[5], understanding of background and different individualsin a frame to enhance smartphone photography. Multi-tasklearning networks [6], [7], [8], [9] that jointly performsemantic and instance segmentation [1], [3] acceleratedprogress of panoptic segmentation in terms of accuracy andcomputational efficiency compared to traditional methodsthat use naive fusion of predictions from independent seman-tic and instance segmentation networks to derive panopticsegmentation output [2].Instance segmentation is typically achieved in two majorways, 1) Foreground mask estimation of objects detectedby an object detection model [1], [10], [11], [12] or 2) Department of Electrical and Computer Engineering, Universityof Michigan-Dearborn, 4901 Evergreen Rd, Dearborn, MI USA. { schenn,srawa } @umich.edu University of Maryland-College Park, College Park, MD USA. [email protected] Valeo Vision Systems, Tuam, Ireland { ganesh.sistu,senthil.yogamani } @valeo.com Overview Video - https://youtu.be/wBtcxRhG3e0 (a) (b)(c) (d)
Fig. 1: (a) Semantic segmentation, (b) Instance contoursegmentation, (c) Instance center regression and (d) InstancesegmentationClustering based instance assignment methods [13], [14].Recently, single stage instance segmentation methods havebeen developed [15], [16]. These major approaches use fullyconvolution networks so that they can be trained in an end-to-end fashion.Semantic segmentation is a mature task which is wellexplored in literature relatively to panoptic segmentation.We make an observation that panoptic segmentation canbe obtained from semantic segmentation by additionallyestimating instance separating contours. Naively, the instanceseparating contours can be an additional class in the segmen-tation task. In practice, it is difficult to get good performancefor this class. It is illustrated in Figure 1 where segmentation(a) and instance contour segmentation (b) contains all theinformation to obtain panoptic segmentation. The minimalcontours needed are contours which separate two instancesof the same object. However, these contours don’t havesufficient information to be learnt on its own and thus weuse the entire instance contours.In this work, we present a multi-task learning networkas shown in Figure 2 that learns semantic segmentation,instance contours and center regression results. Our instancecontours along with semantic segmentation guide us to deriveinstance segmentation and eventually panoptic segmentation.Our instance contour segmentation network is a binary seg-mentation network that predicts instance boundaries betweenobjects that belong to a same category. Compared to semanticedge detection networks [17], [18] our instance contourestimation doesn’t ignore boundaries between instances ofa same category. We refine low quality instances from ourinstance segmentation output using center regression results.We split large instances or merge small instances, using 2-d offsets to an instance center predicted by instance centerregression at a pixel level. a r X i v : . [ c s . C V ] O c t ig. 2: We present a network that learns panoptic segmentation from semantic segmentation and instance contours (boundariesof things). We use a shared convolution neural network to predict semantic segmentation, instance contours and centerregression. Instance contours along with semantic segmentation yield a boundary-aware semantic segmentation of things.Connected component labeling on these results produce instance segmentation and eventually panoptic segmentation.We hope that our idea encourages a new direction in theresearch of panoptic segmentation which ultimately leads tolearning of instance separating contours within the segmen-tation task. The main contributions of this paper include:1) A novel method to learn panoptic segmentation andinstance segmentation from semantic segmentation andinstance contours.2) An instance contour segmentation network that learnsboundaries between objects of same semantic category.II. R ELATED W ORK
Scene understanding [19] has witnessed tremendousprogress over the past decade with introduction of Convo-lution Neural Networks [20], [21], [22] that aided in devel-opment of semantic segmentation (pixel wise classification)and instance segmentation (pixel level recognition of distinctobjects). Panoptic segmentation [2], a joint semantic andinstance segmentation has provided complete scene under-standing by categorizing a pixel into distinct categories andinstances. On the other hand, semantic edge detection [17]has been widely used to learn boundaries between semanticclasses.
A. Semantic Segmentation
Few years ago semantic segmentation [23] was considereda challenging problem. With the help of Fully convolutionalneural networks (FCNs) [24] development of accurate andefficient solutions were made possible.Several enhancements were made to push the performanceof semantic segmentation higher by making improvements toencoder and decoder in FCNs. Dilated residual convolutions[25], Feature pyramid networks [1], [26], Spatial pyramidpooling [27] etc. are examples of improvements made toencoder while U-Net [28], Densely connected CRFs [25],[29] are examples of improvements made to decoder. Weuse a combination of feature pyramid networks and a light-weight asymmetric decoder presented by Kirillov et al. [1]to learn semantic segmentation.
B. Instance Segmentation
In instance segmentation, an object instance(id) is assignedto every pixel for every known object within an image.Two stage methods like MaskR-CNN [12] involve proposalgeneration from object detection followed by mask gener-ation using a foreground/background binary segmentationnetwork. These methods dominate the state of the art ininstance segmentation but incur a relatively higher compu-tational cost. Using YOLO [30], SSD [31] and other lightwight object detector compared to Faster R-CNN [32] mayseem promising but they still posses inevitable additionalcompute in generating object proposals followed by maskgeneration.Other approaches in instance segmentation range fromclustering of instance embedding [33] to prediction ofinstance centers using offset regression [13], [14]. Thesemethods appear logically straight forward but are laggingbehind in terms of accuracy and computational efficiency.The major drawback with these methods is usage of computeintensive clustering methods like OPTICS [34], DBSCAN[35] etc. In contrast to these methods, we derive instancesegmentation from semantic segmentation using instancecontours (boundaries of things).
C. Semantic Edge Detection
Semantic edge detection (SED) [17], [36] differs fromedge detection [37] by predicting edges that belong tosemantic class boundaries. In SED, edges/boundaries thatseparate segments of one category from another are predictedwhereas, in edge detection every edge is detected based onimage gradients. Holistically-nested edge detection (HED)[38] is one of the first CNN based edge detection method.Later, several methods were proposed to address differentchallenges in edge detection that include prediction of crispboundaries [18], [39], selection of intermediate feature mapsand choices of supervision on these feature maps [40], [41]. Itis important to note that these methods ignore the boundariesig. 3: Proposed model architecture with CNN backbone. Multi-scale features from the backbone are fed to a feature pyramidnetwork and then to upsampling neck followed by a prediction head. Our network has three heads for semantic segmentation,instance contour segmentation and center regression tasks. Separate necks can be used for different heads/tasks as needed.between instances of objects that belong to same semanticcategory.Deep Snake [42] recently proposed to predict instancecontours by learning contours from object detection. Theyreplace foreground mask estimation for objects with con-tours to derive instance segmentation. Our instance contoursegmentation however is a single stage method that directlyestimates contours using a binary segmentation network.
D. Panoptic Segmentation
Panoptic segmentation [2] combines semantic segmenta-tion and instance segmentation together to provide classcategory and instance id for every pixel with in an image.Recent works [1], [14], [3] use a shared backbone and predictpanoptic segmentation by fusing output from semantic andinstance segmentation branches. Almost every work so faruses an FCN based semantic segmentation branch withvariations including usage of dilated convolutions [14] orfeature pyramid networks [1]. However, choices of instancesegmentation branch can vary as discussed in Section II-B.Major challenge in generating panoptic segmentation out-put is merging conflicting outputs from semantic segmen-tation and instance branches. For example, semantic seg-mentation can predict that a pixel might belong to car classwhile instance segmentation branch may predict the samepixel as person class. Several methods [3] were proposedto handle the conflicts in a better and learned fashion.Our methods propose to derive instance segmentation fromsemantic segmentation using instance contours. Therefore,our method doesn’t require a conflict resolution policy likeother existing methods.III. P
ROPOSED M ETHOD
Our proposed method is a multi-task neural network withseveral shared convolution layers and multiple output headsthat predict semantic segmentation, instance contours andcenter regression. As shown in Figure 3, a common ResNet[22] backbone outputs multi-scale feature maps that areprocessed by a top-down feature pyramid network [26].These feature maps from different levels are upsampled to a common scale through a series of 1x1 convolutions andcombined before making output predictions. We refer theupsampling stages as necks and prediction layers as heads.Outputs from instance contour and semantic segmentationbranches are combined to generate instance segmentation.We refine instance segmentation output using center regres-sion results. Later, we simply merge semantic and instancesegmentation outputs to generate panoptic segmentation.
A. Model Architecture
We begin with introducing our shared backbone thatoutputs multi-scale feature maps as shown in Figure 3.Our backbone uses ResNet [22] as the encoder that outputsmultiple scales of feature maps { } w.r.tto input image. Our pyramid is built using Feature pyramidnetwork (FPN) [26] which consumes feature maps (scales1/4 to 1/32) from backbone in a top-down fashion andoutputs feature maps with 256 channels maintaining theirinput scale. Feature maps from pyramid are then passedthrough a series of 1x1 convolutions and are up sampled to1/4 scale using 2-d bi-linear interpolation in the neck layersas proposed in [1]. These layers have 128 dims at each level.We add these feature maps from different levels and pass toprediction heads. Our semantic segmentation head contains1 × B. Loss functions
We discuss the explicit loss functions defined for semanticsegmentation and instance contour branches. We chose cross-entropy loss for semantic segmentation. In Equation 1, L s issegmentation loss over k classes for all pixels in the image,ig. 4: Illustrative flow diagram of proposed algorithm thatlearns panoptic segmentation from semantic segmentationand instance contourswhere p i and ˆ y i are the prediction probability and the groundtruth per pixel for class i . L semantic = − k ∑ i ˆ y i ∗ log ( p i ) (1)For instance contours, we chose weighted multi-labelbinary cross entropy loss [17] as shown in Equation 2, where β is the ratio of non-edge pixels to total pixels in the image. L wBCE = − β ∗ ˆ y ∗ log ( p ) − ( − β ) ∗ ( − ˆ y ) ∗ log ( − p ) (2)We add Huber loss ( δ = 0.3): L Huber = (cid:40) . ∗ ( p − ˆ y ) , | p − ˆ y | ≤ δδ ∗ ( p − ˆ y ) − . ∗ δ , otherwise (3)and NMS Loss [18] terms to contour loss to predict thin andcrisp boundaries. L NMS = − ∑ c log ( h ) (4)We compute softmax response h along the normal direc-tion of boundary pixels c as described in [18]. For centerregression, we use Huber loss to compute error between y ,predicted offsets and ˆ y , ground truth offsets with δ = 1. Ourtotal loss function is a weight combination of semantic loss,contour losses and center regression loss. L total = λ ∗ L semantic + λ ∗ L contour + λ ∗ L center (5)where L contour is defined as: L contour = L wBCE + L Huber + L NMS (6)We chose λ , λ , and λ as 1, 50 and 0.1 for our experiments. C. Instance segmentation
Our instance segmentation is derived from semantic seg-mentation unlike any other instance segmentation methods asshown in Figure 4. As a first step, we generate a binary maskby searching for instance classes in semantic segmentationwhich we refer to as instance class mask. We subtractinstance contours (generated from instance contour segmen-tation head) from instance class mask to derive boundaryaware instance class mask. Using connected component la-beling [43], we derive unique instances from boundary awareinstance class mask. We map the semantic segmentation output to the instance generated. We assign the most frequentlabel found inside an instance as its category and average thesoftmax predictions over the area of an instance to generateconfidence for an instance.
D. Refining Instance Segmentation
We refine instance segmentation output using center re-gression results. Our refinement consists of mainly 2 stages:Split and Merge. We estimate centroids predicted by centerregression head. We cluster the centroid predictions usingDBSCAN in an instance and split them if distinct centroidsare found. If distance between two centroids is at least 20pixels (eps), we declare them as distinct. Our clustering stagedoesn’t require large computational complexity like othermethods [33], [13], [14] since we perform clustering withininstances that are much smaller compared to performingclustering on entire image.After the instances are split, we estimate mean centroidsfor every instance using offsets predicted by center regressionhead. If the mean centroids are closer than 20 pixels ineuclidean distance, we merge those instances. Later, weremove all instances that have an area lower than a minimumarea threshold. We assign these pixels to instances whosecentroids are closest to the centroids derived from offsetspredicted by the center regression head.
E. Panoptic Segmentation
Panoptic segmentation is now obtained by simply mergingoutput from semantic segmentation and instance segmenta-tion. As discussed in Section II-D we don’t use a conflictresolution since our instance segmentation is a byproductof our semantic segmentation. Thus, we will never haveconflicting predictions.IV. E
XPERIMENTS , R
ESULTS AND D ISCUSSION
In this section, we demonstrate the performance of ourproposed methods for panoptic segmentation on Cityscapes[44] dataset. We also present the performance of our semanticsegmentation and instance segmentation results that helpedus generate the panoptic segmentation output.
A. Experimental Setup
Cityscapes [44] is an automotive scene understandingdataset with 2975/500 train/val images at 1024 × . − . We initialize our ResNet encoders withig. 5: Qualitative Results on Cityscapes [44] dataset obtained with ResNet-50 [22] encoder using a separate neck architecture,wBCE + Huber loss combination, split and merge refinement with min Instance area = 300 pixels. Instance contours groundtruth are generated with dilation rate = 2. From Left to Right: Semantic Segmentation, Instance contour Segmentation, CenterRegression, Instance Segmentation. Contour Loss PerformancewBCE Huber NMS AP PQ PQ Th SQ Th RQ Th (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) TABLE I: Instance and Panoptic Segmentation results fordifferent loss functions used to represent instance contourloss. wBCE = weighted multi-class binary cross entropy, AP= average precision, PQ = panoptic quality. PQ Th , SQ Th ,and RQ Th represent panoptic, segmentation and recognitionqualities of instance objects (things).pre-trained ImageNet [46] weights and train our networks for48000 iterations.We measure the performance of semantic segmentationusing mean intersection over union (mIoU), instance seg-mentation using mean average precision (mAP) and panopticsegmentation using panoptic quality (PQ) [2], segmentationquality (SQ) and recognition quality (RQ) metrics. B. Ablation experiments1) Instance contour segmentation loss function:
As men-tioned before, we aim to predict thin and crisp instancecontours. We study different loss functions discussed inSection III-B by evaluating the performance of instanceand panoptic segmentation as shown in Table I. We usedResNet-50 encoder as our backbone and separate heads witha common neck as discussed in Section III-A.
Dilation Rate AP PQ PQ Th SQ Th RQ Th TABLE II: Performance of instance and panoptic segmen-tation when different dilation rates were used to generateground truth instance contours. Increasing the dilation rate,increases the thickness of the ground truth instance contours.We observed that the use of Huber and NMS loss functionhave improved the performance of instance and panopticsegmentation results. The weighted multi-class binary crossentropy combined with the Huber loss is the best combi-nation we found. We use this combination for the rest ofthe experiments in the paper. Qualitative results in Figure 5demonstrate that the contours generated are thin and crispwhen the above combination is used.
2) Instance contour ground truth dilation rate:
We gener-ate our ground truth instance contours by applying a contourdetection algorithm on instance masks provided for differentobjects in cityscapes dataset. Number of edge pixels arecomparatively lower than non edge pixels in our contoursegmentation problem. We can alleviate this class imbalanceusing appropriate loss functions as discussed in Section III-Bor by dilating the the contours and increasing their thickness.In Table II, we evaluate the performance of instance andpanoptic segmentation for different dilation rates.ig. 6: Panoptic Segmentation Results on Cityscapes [44] dataset. Results obtained with ResNet-50 [22] encoder using aseparate neck architecture, wBCE + Huber loss combination, split and merge refinement with min Instance area = 300 pixels.Instance contours ground truth are generated with dilation rate = 2.
Refine PerformanceSplit Merge AP PQ PQ Th SQ Th RQ Th (cid:88) (cid:88) (cid:88) TABLE III: Evaluation of instance and panoptic segmenta-tion performance before and after refinement using offsetspredicted by center regression results. min Instance Area AP PQ PQ Th SQ Th RQ Th
500 23.6 47.0 32.7 75.5 42.4
TABLE IV: Impact of minimum instance area thresholdduring refinement of instance segmentation.We observed that when appropriate loss combination isused, the dilation rate doesn’t have a significant impact onthe performance. However, increasing the dilation rate from 2to 3 decreases the performance. We use a dilation rate of 2 togenerate our ground truth contours for all other experiments.
3) Refining Instance Segmentation:
As discussed in Sec-tion III-D, we refine our instance segmentation output usingcenter regression results. We evaluate the effects of split andmerge components in our refinement process in Table III andevaluate the effect of min instance area in Table IV.We observed that refining the instance segmentation using
Neck Backbone mIoU PQ St AP PQ Th PQShared ResNet-50 67.5 57.4 24.3 33.2 47.8Separate ResNet-50
TABLE V: Performance of semantic, instance and panopticsegmentation using different network architecture choices.offsets predicted by center regression marginally improvesperformance of instance segmentation. However, the refine-ment is critical in cases where a broken contour can miss theboundary between two instances that can wrongly predictedas a single instances. Similarly, a occlusion by a pole orlow width object can cause mislead connected componentlabeling to interpret resulting contours as separate instances.Qualitative results in Figure 5 suggests that offsets predictedby center regression head are accurate for objects that arecloser while they are less accurate for objects farther away.We observed that choosing an appropriate minimum in-stance area threshold is critical in determining the perfor-mance of our proposed method. Lower instance area allowsto remove unwanted instance generated due to artifacts incontour estimation. Such artifacts could be a result of falsecontours around mirrors of cars, convex hulls, occlusion etc.
4) Network Ablation:
We experimented with different net-work architecture choices as discussed in Section III-A. Westudied the impact of using a shared neck vs separate necklayer to upsample and add features from a common featurepyramid network. We also studied how the depth of ResNet ethod mIoU PQ St AP PQ Th PQTwo Stage Object detectionMask R-CNN [12] - - 31.5 - -Weakly Supervised [11] 71.6 52.9 24.3 39.6 47.3Panoptic-FPN [1] 74.5 62.4 32.2 51.3 57.7Instance ClusteringKendall et al [13] 78.5 - 21.6 - -Panoptic-DeepLab [14] 78.2 - 32.7 - 60.3Single-stage Object detectionPoly YOLO [16]* - - 8.7 - -OthersUhrig et al. [47] 64.3 - 8.9 - -Deep Watershed [48] - - 19.4 - -SGN [49] - - 29.2 - -Ours [ResNet-50] 69.6 58.6 25.0 34.0 48.3Ours [ResNet-101] 68.7 59.3 24.9 33.2 48.4
TABLE VI: Comparison with other state-of-the art methodson Cityscapes val [44] dataset. *Poly YOLO [16] is evaluatedon resized input image of size 416 × C. State of the Art Comparison
In Table VI, we compare our proposed methods againstother semantic, instance and panoptic segmentation methods.
1) Comparison with Two-stage methods:
As discussedin Section II-B, Two stage object detection methods [1],[12], [50], [11] dominate the state of the art in instance andpanoptic segmentation. However they have incur additionalcompute costs in generating object detection followed byforeground mask generation. Mask R-CNN [12] for instancesegmentation on a high end GPU like Nvidia Titan X runs at ∼ × ×
2) Comparison with Instance clustering:
Kendall et al.[13] was one of the early works that used multi-task learningto simultaneously learn semantic and instance segmentation.Panoptic-DeepLab [14] recently proposed an strong baselinefor center regression based methods by exploiting the ef-fectiveness of dual-Atrous Spatial Pyramid Pooling (ASPP)modules. We believe that using ASPP module in our networkwill improve our semantic segmentation performance andeventually lead us to better instance and panoptic segmen-tation results. However, ASPP modules are computationallyvery expensive compared to Feature pyramid networks [1].
3) Comparison with Single-stage object detection andOthers:
Poly YOLO [16] reported ∼
22 fps on a 416 ×
832 image with an AP score of 8.7. Other methods like DeepWatershed [48] and SGN [49] incur a huge computationcomplexity in their instance assignment techniques. Ourmethods are light weight compared to object detection,instance clustering based methods and better in terms ofperformance compared with other single-stage methods.V. C
ONCLUSION
In this paper, we presented a new approach to panopticsegmentation using instance contours. Our method is oneof the first approaches where instance segmentation is agenerated as a byproduct in a semantic segmentation net-work. We evaluated performance of our semantic, instanceand panoptic segmentation results on Cityscapes dataset. Wepresented several ablation studies that help understand theimpact of architecture and training choices that we made. Webelieve that our proposed methods opens a new direction inresearch of instance and panoptic segmentation and serves abaseline for contour based methods.R
EFERENCES[1] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Doll´ar.Panoptic feature pyramid networks. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 6399–6408, 2019.[2] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, andPiotr Doll´ar. Panoptic segmentation. In
Proceedings of the IEEEconference on computer vision and pattern recognition , pages 9404–9413, 2019.[3] Andra Petrovai and Sergiu Nedevschi. Multi-task network for panopticsegmentation in automated driving. In , pages 2394–2401. IEEE, 2019.[4] Daan de Geus, Panagiotis Meletis, and Gijs Dubbelman. Singlenetwork panoptic segmentation for street scene understanding. In , pages 709–715. IEEE,2019.[5] Donghao Zhang, Yang Song, Dongnan Liu, Haozhe Jia, Siqi Liu, YongXia, Heng Huang, and Weidong Cai. Panoptic segmentation with anend-to-end cell r-cnn for pathology image analysis. In
InternationalConference on Medical Image Computing and Computer-AssistedIntervention , pages 237–244. Springer, 2018.[6] Rich Caruana. Multitask learning.
Machine learning , 28(1):41–75,1997.[7] Ganesh Sistu, Isabelle Leang, Sumanth Chennupati, Senthil Yogamani,Ciar´an Hughes, Stefan Milz, and Samir Rawashdeh. Neurall: Towardsa unified visual perception model for automated driving. In , pages 796–803.IEEE, 2019.[8] Sumanth Chennupati, Ganesh Sistu, Senthil Yogamani, and SamirA Rawashdeh. Multinet++: Multi-stream feature aggregation and ge-ometric loss strategy for multi-task learning. In
The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) Workshops , June2019.[9] Sumanth Chennupati, Ganesh Sistu., Senthil Yogamani., and Samir ARawashdeh. Auxnet: Auxiliary tasks enhanced semantic segmentationfor automated driving. In
Proceedings of the 14th International JointConference on Computer Vision, Imaging and Computer GraphicsTheory and Applications - Volume 5: VISAPP, , pages 645–652. IN-STICC, SciTePress, 2019.[10] L. Porzi, S. R. Bul`o, A. Colovic, and P. Kontschieder. Seamless scenesegmentation. In , pages 8269–8278, 2019.[11] Qizhu Li, Anurag Arnab, and Philip HS Torr. Weakly-and semi-supervised panoptic segmentation. In
Proceedings of the EuropeanConference on Computer Vision (ECCV) , pages 102–118, 2018.[12] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick.Mask r-cnn. In
Proceedings of the IEEE international conferenceon computer vision , pages 2961–2969, 2017.13] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learningusing uncertainty to weigh losses for scene geometry and semantics.In
Proceedings of the IEEE conference on computer vision and patternrecognition , pages 7482–7491, 2018.[14] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas SHuang, Hartwig Adam, and Liang-Chieh Chen. Panoptic-deeplab: Asimple, strong, and fast baseline for bottom-up panoptic segmentation. arXiv preprint arXiv:1911.10194 , 2019.[15] Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, Xuebo Liu,Ding Liang, Chunhua Shen, and Ping Luo. Polarmask: Single shotinstance segmentation with polar representation. In
Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) , June 2020.[16] Petr Hurtik, Vojtech Molek, Jan Hula, Marek Vajgl, Pavel Vlasanek,and Tomas Nejezchleba. Poly-yolo: higher speed, more precisedetection and instance segmentation for yolov3. arXiv preprintarXiv:2005.13243 , 2020.[17] Zhiding Yu, Chen Feng, Ming-Yu Liu, and Srikumar Ramalingam.Casenet: Deep category-aware semantic edge detection. In
Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 5964–5973, 2017.[18] David Acuna, Amlan Kar, and Sanja Fidler. Devil is in the edges:Learning semantic boundaries from noisy annotations. In
The IEEEConference on Computer Vision and Pattern Recognition (CVPR) , June2019.[19] Derek Hoiem, James Hays, Jianxiong Xiao, and Aditya Khosla. Guesteditorial: Scene understanding.
International Journal of ComputerVision , 112(2):131–132, 2015.[20] Yann LeCun, Patrick Haffner, L´eon Bottou, and Yoshua Bengio.Object recognition with gradient-based learning. In
Shape, contourand grouping in computer vision , pages 319–345. Springer, 1999.[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenetclassification with deep convolutional neural networks. In
Advancesin neural information processing systems , pages 1097–1105, 2012.[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition. In
Proceedings of the IEEEconference on computer vision and pattern recognition , pages 770–778, 2016.[23] Mennatullah Siam, Sara Elkerdawy, Martin Jagersand, and SenthilYogamani. Deep semantic segmentation for automated driving: Tax-onomy, roadmap and challenges. In , pages 1–8.IEEE, 2017.[24] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learningdeconvolution network for semantic segmentation. In
Proceedings ofthe IEEE international conference on computer vision , pages 1520–1528, 2015.[25] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, KevinMurphy, and Alan L Yuille. Deeplab: Semantic image segmentationwith deep convolutional nets, atrous convolution, and fully connectedcrfs.
IEEE transactions on pattern analysis and machine intelligence ,40(4):834–848, 2017.[26] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He, BharathHariharan, and Serge Belongie. Feature pyramid networks for objectdetection. In
Proceedings of the IEEE conference on computer visionand pattern recognition , pages 2117–2125, 2017.[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatialpyramid pooling in deep convolutional networks for visual recognition.
IEEE transactions on pattern analysis and machine intelligence ,37(9):1904–1916, 2015.[28] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Con-volutional networks for biomedical image segmentation. In NassirNavab, Joachim Hornegger, William M. Wells, and Alejandro F.Frangi, editors,
Medical Image Computing and Computer-AssistedIntervention – MICCAI 2015 , pages 234–241, Cham, 2015. SpringerInternational Publishing.[29] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vib-hav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HSTorr. Conditional random fields as recurrent neural networks. In
Proceedings of the IEEE international conference on computer vision ,pages 1529–1537, 2015.[30] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. Youonly look once: Unified, real-time object detection. In
Proceedingsof the IEEE conference on computer vision and pattern recognition ,pages 779–788, 2016. [31] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, ScottReed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shotmultibox detector. In
European conference on computer vision , pages21–37. Springer, 2016.[32] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.In
Advances in neural information processing systems , pages 91–99,2015.[33] Xiaodan Liang, Liang Lin, Yunchao Wei, Xiaohui Shen, JianchaoYang, and Shuicheng Yan. Proposal-free network for instance-levelobject segmentation.
IEEE transactions on pattern analysis andmachine intelligence , 40(12):2978–2991, 2017.[34] Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and J¨orgSander. Optics: ordering points to identify the clustering structure.
ACM Sigmod record , 28(2):49–60, 1999.[35] Martin Ester, Hans-Peter Kriegel, J¨org Sander, Xiaowei Xu, et al.A density-based algorithm for discovering clusters in large spatialdatabases with noise.[36] Jimei Yang, Brian Price, Scott Cohen, Honglak Lee, and Ming-HsuanYang. Object contour detection with a fully convolutional encoder-decoder network. In
Proceedings of the IEEE conference on computervision and pattern recognition , pages 193–202, 2016.[37] Lucas J Van Vliet, Ian T Young, and Guus L Beckers. A nonlinearlaplace operator as edge detector in noisy images.
Computer vision,graphics, and image processing , 45(2):167–195, 1989.[38] Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In
Proceedings of the IEEE international conference on computer vision ,pages 1395–1403, 2015.[39] Ruoxi Deng, Chunhua Shen, Shengjun Liu, Huibing Wang, and XinruLiu. Learning to predict crisp boundaries. In
The European Conferenceon Computer Vision (ECCV) , September 2018.[40] Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and Xiang Bai.Richer convolutional features for edge detection. In
Proceedings of theIEEE conference on computer vision and pattern recognition , pages3000–3009, 2017.[41] Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. Deepedge: Amulti-scale bifurcated deep network for top-down contour detection.In
Proceedings of the IEEE conference on computer vision and patternrecognition , pages 4380–4389, 2015.[42] Sida Peng, Wen Jiang, Huaijin Pi, Xiuli Li, Hujun Bao, and XiaoweiZhou. Deep snake for real-time instance segmentation. In
Proceed-ings of the IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR) , June 2020.[43] Hanan Samet and Markku Tamminen. Efficient component labelingof images of arbitrary dimension represented by linear bintrees.
IEEEtransactions on pattern analysis and machine intelligence , 10(4):579–586, 1988.[44] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld,Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth,and Bernt Schiele. The cityscapes dataset for semantic urban sceneunderstanding. In
Proceedings of the IEEE conference on computervision and pattern recognition , pages 3213–3223, 2016.[45] Yuxin Wu and Kaiming He. Group normalization. In
Proceedingsof the European conference on computer vision (ECCV) , pages 3–19,2018.[46] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pages248–255. Ieee, 2009.[47] Jonas Uhrig, Marius Cordts, Uwe Franke, and Thomas Brox. Pixel-level encoding and depth layering for instance-level semantic labeling.In
German Conference on Pattern Recognition , pages 14–25. Springer,2016.[48] Min Bai and Raquel Urtasun. Deep watershed transform for instancesegmentation. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 5221–5229, 2017.[49] Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. Sgn: Sequentialgrouping networks for instance segmentation. In
Proceedings of theIEEE International Conference on Computer Vision , pages 3496–3504,2017.[50] Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, and X. Wang.Attention-guided unified network for panoptic segmentation. In2019IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR)