Instance and Panoptic Segmentation Using Conditional Convolutions
11 Instance and Panoptic Segmentation UsingConditional Convolutions
Zhi Tian ∗ , Bowen Zhang ∗ , Hao Chen, Chunhua Shen Abstract —We propose a simple yet effective framework for instance and panoptic segmentation, termed CondInst (conditionalconvolutions for instance and panoptic segmentation). In the literature, top-performing instance segmentation methods typically followthe paradigm of Mask R-CNN and rely on ROI operations (typically ROIAlign) to attend to each instance. In contrast, we propose toattend to the instances with dynamic conditional convolutions. Instead of using instance-wise ROIs as inputs to the instance mask headof fixed weights, we design dynamic instance-aware mask heads, conditioned on the instances to be predicted. CondInst enjoys threeadvantages: 1) Instance and panoptic segmentation are unified into a fully convolutional network, eliminating the need for ROI croppingand feature alignment. 2) The elimination of the ROI cropping also significantly improves the output instance mask resolution. 3) Due tothe much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact ( e.g. , 3 conv.layers, each having only 8 channels), leading to significantly faster inference time per instance and making the overall inference timealmost constant, irrelevant to the number of instances. We demonstrate a simpler method that can achieve improved accuracy andinference speed on both instance and panoptic segmentation tasks. On the COCO dataset, we outperform a few state-of-the-artmethods. We hope that CondInst can be a strong baseline for instance and panoptic segmentation. Code is available at:https://git.io/AdelaiDet
Index Terms —Fully convolutional networks, conditional convolutions, instance segmentation, panoptic segmentation (cid:70) outputinstance masks… c onv c onv c onv mask head K c onv c onv c onv mask head 1 …instance-awaremask headsfeatures w/ rel. coord.… Fig. 1 – CondInst uses instance-aware mask heads to predictthe mask for each instance. K is the number of instancesto be predicted. Note that each output map only containsthe mask of one instance. The filters in the mask head varywith different instances, which are dynamically-generatedand conditioned on the target instance. ReLU is used as theactivation function (excluding the last conv. layer). NTRODUCTION
Instance segmentation is a fundamental yet challenging taskin computer vision, which requires an algorithm to predicta per-pixel mask with a category label for each instance ofinterest in an image. Panoptic segmentation further requiresthe algorithm to segment the stuff ( e.g. , sky and grass),assigning every pixel in the image a semantic label. Panopticsegmentation is often built on an instance segmentation
Work was done when all authors were with The University of Adelaide,Australia. The first two authors contribute equally to this work. C. Shen iswith Monash University, Australia. C. Shen is the corresponding author (e-mail: [email protected]). framework with an extra semantic segmentation branch.Therefore, both instance and panoptic segmentation sharethe same key challenge—-how to efficiently and effectivelydistinguish individual instances.Despite a few works being proposed recently, the dom-inant method tackling this challenge is still the two-stagemethod as in Mask R-CNN [17], which casts instancesegmentation into a two-stage detection-and-segmentationtask. To be specific, Mask R-CNN first employs an ob-ject detector Faster R-CNN to predict a bounding-box foreach instance. Then for each instance, regions-of-interest(ROIs) are cropped from the networks’ feature maps us-ing the ROIAlign operation. To predict the final masksfor each instance, a compact fully convolutional network(FCN) ( i.e. , mask head) is applied to these ROIs to performforeground/background segmentation. However, this ROI-based method may have the following drawbacks. 1) SinceROIs are often axis-aligned bounding-boxes, for objects withirregular shapes, they may contain an excessive amount ofirrelevant image content including background and otherinstances. This issue may be mitigated by using rotatedROIs, but with the price of a more complex pipeline. 2) Inorder to distinguish between the foreground instance andthe background stuff or instance(s), the mask head requiresa relatively larger receptive field to encode sufficiently largecontext information. As a result, a stack of × convolutionsis needed in the mask head ( e.g. , four × convolutions with channels in Mask R-CNN). It considerably increasescomputational complexity of the mask head, resulting thatthe inference time significantly varies in the number ofinstances. 3) ROIs are typically of different sizes. In order touse effective batched computation in modern deep learningframeworks [1], [33], a resizing operation is often required a r X i v : . [ c s . C V ] F e b to resize the cropped regions into patches of the same size.For instance, Mask R-CNN resizes all the cropped regions to × (upsampled to × using a deconvolution), whichrestricts the output resolution of instance segmentation, aslarge instances would require higher resolutions to retaindetails at the boundary.In computer vision, the closest task to instance segmenta-tion is semantic segmentation, for which fully convolutionalnetworks (FCNs) have shown dramatic success [8], [19],[31], [32], [41]. FCNs also have shown excellent performanceon many other per-pixel prediction tasks ranging from low-level image processing such as denoising, super-resolution;to mid-level tasks such as optical flow estimation and con-tour detection; and high-level tasks including recent single-shot object detection [43], monocular depth estimation [2],[29], [53], [54] and counting [5]. However, almost all theinstance segmentation methods based on FCNs lag behindstate-of-the-art ROI-based methods. Why do the versatileFCNs perform unsatisfactorily on instance segmentation?This is due to the fact that the FCNs tend to yield similarpredictions for similar image appearance. As a result, thevanilla FCNs are incapable of distinguishing individualinstances. For example, if two persons A and B with thesimilar appearance are in an input image, when predictingthe instance mask of A , the FCN needs to predict B as back-ground w.r.t. A , which can be difficult as they look similar inappearance. Therefore, an ROI operation is used to crop theperson of interest, i.e. , A ; and filter out B . Essentially, this isthe core operation making the model attend to an instance.Instead of using ROIs, CondInst attends to each instance byusing instance-sensitive convolution filters as well as relativecoordinates that are appended to the feature maps.Specifically, unlike Mask R-CNN, which uses a standardconvolution network with a fixed set of convolutional filtersas the mask head for predicting all instances, the networkparameters are adapted according to the instance to bepredicted. Inspired by dynamic filtering networks [21] andCondConv [51], for each instance, a controller sub-network(see Fig. 3) dynamically generates the mask FCN networkparameters (conditioned on the center area of the instance),which is then used to predict the mask of this instance. Itis expected that the network parameters can encode thecharacteristics ( e.g. , relative position, shape and appearance)of this instance, and only fires on the pixels of this instance,which thus bypasses the difficulty mentioned above. Theseconditional mask heads are applied to the whole featuremaps, eliminating the need for ROI operations . At the firstglance, the idea may not work well as instance-wise maskheads may incur a large number of network parametersprovided that some images contain as many as dozensof instances. However, as the mask head filters are onlyasked to predict the mask of only one instance, it largelyeases the learning requirement and thus reduces the loadof the filters. As a result, the mask head can be extremelylight-weight. We will show that a very compact mask headwith dynamically-generated filters can already outperformprevious ROI-based Mask R-CNN. This compact mask headalso results in much reduced computational complexity per
1. By FCNs, we mean the vanilla FCNs in [32] that only involveconvolutions and pooling. instance than that of the mask head in Mask R-CNN.We summarize our main contributions as follow. • We attempt to solve instance segmentation from anew perspective that uses dynamic mask heads. Thisnovel solution achieves improved instance segmen-tation performance than existing methods such asMask R-CNN. To our knowledge, this is the firsttime that a new instance segmentation frameworkoutperforms recent state-of-the-art both in accuracyand speed. • CondInst is fully convolutional and avoids the afore-mentioned resizing operation used in many existingmethods, as CondInst does not rely on ROI opera-tions. Without having to resize feature maps leads tohigh-resolution instance masks with more accurateedges, as shown in Fig. 2. • Since the mask head in CondInst is very compact,compared with the box detector FCOS, CondInstneeds only ∼
10% more computational time to ob-tain the mask results, even when processing themaximum number of instances per image ( i.e. , instances). The overall inference time is also stable asit does not depend on the number of instances in theimage. • With an extra semantic segmentation branch,CondInst can be easily extended to panoptic segmen-tation [23], resulting a unified fully convolutionalnetwork for both instance and panoptic segmenta-tion tasks. • CondInst achieves state-of-the-art performance onboth instance and panoptic segmentation tasks whilebeing fast and simple. We hope that CondInst can bea new strong alternative for instance and panopticsegmentation tasks, as well as other instance-levelrecognition tasks such as keypoint detection.
Here we review some work that is most relevant to ours.
Conditional Convolutions.
Unlike traditional convolu-tional layers, which have fixed filters once trained, the filtersof conditional convolutions are conditioned on the inputand are dynamically generated by another network ( i.e. ,a controller). This idea has been explored previously indynamic filter networks [21] and CondConv [51] mainlyfor the purpose of increasing the capacity of a classificationnetwork. In this work, we extend this idea to solve the sig-nificantly more challenging task of instance segmentation.
Instance Segmentation.
To date, the dominant frame-work for instance segmentation is still Mask R-CNN.Mask R-CNN first employs an object detector to detectthe bounding-boxes of instances ( e.g. , ROIs). With thesebounding-boxes, an ROI operation is used to crop the fea-tures of the instance from the feature maps. Finally, a com-pact FCN head is used to obtain the desired instance masks.Many works [7], [20], [30] with top performance are builton Mask R-CNN. Moreover, some works have exploredto apply FCNs to instance segmentation. InstanceFCN [13]may be the first instance segmentation method that is fullyconvolutional. InstanceFCN proposes to predict position-sensitive score maps with vanilla FCNs. Afterwards, these ours ours YOLACT M-RCNN ours ours YOLACT M-RCNNours ours YOLACT M-RCNN ours ours YOLACT M-RCNN
Figure 1.
Qualitative comparisons with other methods. We compare the proposed CondInst against YOLACT [1] and MaskR-CNN [2]. Our masks are generally of higher quality ( e.g. , preserving more details). Best viewed on screen.the major difficulty of applying FCNs to instance segmen-tation is that the similar image appearance may require dif-ferent predictions but FCNs struggle at achieving this. Forexample, if two persons A and B with the similar appear-ance are in an input image, when predicting the instancemask of A, the FCN needs to predict B as background w.r.t.A, which can be difficult as they look similar in appearance.Therefore, the ROI operation is used to crop the person ofinterest, e.g. , A; and filter out B. Essentially, instance seg-mentation needs two types of information: 1) appearance information to categorize objects; and 2) location informa-tion to distinguish multiple objects belonging to the samecategory. Almost all methods rely on ROI cropping, whichexplicitly encodes the location information of instances. Incontrast, CondInst exploits the location information by us-ing location/instance-sensitive convolution filters as well asrelative coordinates that are appended to the feature map.Thus, we advocate a new solution that uses instance-aware FCNs for instance mask prediction. In other words,instead of using a standard ConvNet with a fixed set ofconvolutional filters as the mask head for predicting all in-stances, the network parameters are adapted according tothe instance to be predicted. Inspired by dynamic filteringnetworks [10] and CondConv [11], for each instance, a con-troller sub-network (see Fig. 2) dynamically generates themask FCN network parameters (conditioned on the centerarea of the instance), which is then used to predict the maskof this instance. It is expected that the network parameterscan encode the characteristics of this instance, and only fireson the pixels of this instance, which thus bypasses the dif-ficulty mentioned above. These conditional mask heads areapplied to the whole feature maps, eliminating the need for ROI operations . At the first glance, the idea may not workwell as instance-wise mask heads may incur a large numberof network parameters provided that some images containas many as dozens of instances. However, we show that avery compact FCN mask head with dynamically-generatedfilters can already outperform previous ROI-based Mask R-CNN, resulting in much reduced computational complexityper instance than that of the mask head in Mask R-CNN.We summarize our main contributions as follow.• We attempt to solve instance segmentation from a newperspective. To this end, we propose the CondInst in-stance segmentation framework, which achieves im-proved instance segmentation performance than exist-ing methods such as Mask R-CNN while being faster.To our knowledge, this is the first time that a newinstance segmentation framework outperforms recentstate-of-the-art both in accuracy and speed.• CondInst is fully convolutional and avoids the afore-mentioned resizing operation in many existing meth-ods, as CondInst does not rely on ROI operations.Without having to resize feature maps leads to high-resolution instance masks with more accurate edges.• Unlike previous methods, in which the filters in itsmask head are fixed for all the instances once trained,the filters in our mask head are dynamically gener-ated and conditioned on instances. As the filters areonly asked to predict the mask of only one instance,it largely eases the learning requirement and thus re-duces the load of the filters. As a result, the maskhead can be extremely light-weight, significantly re-ducing the inference time per instance. Compared with2
Fig. 2 – Qualitative comparisons with other methods. We compare the proposed CondInst against YOLACT [4] and MaskR-CNN [17]. Our masks are generally of higher quality ( e.g. , preserving finer details). Best viewed on screen. score maps are assembled to obtain the desired instancemasks. Note that InstanceFCN does not work well withoverlapping instances. Others [15], [35], [36] attempt tofirst perform segmentation and the desired instance masksare formed by assembling the pixels of the same instance.Novotny et al. [37] propose semi-convolutional operatorsto make FCNs applicable to instance segmentation. To ourknowledge, thus far none of these methods can outperformMask R-CNN both in accuracy and speed on the COCObenchmark dataset.The recent YOLACT [4] and BlendMask [6] may beviewed as a reformulation of Mask RCNN, which decouplesROI detection and feature maps used for mask prediction.Wang et al. developed a simple FCN based instance seg-mentation method, showing competitive performance [46].PolarMask developed a new simple mask representation forinstance segmentation [49], which extends the bounding boxdetector FCOS [43].
Panoptic segmentation.
There are two main approaches forsolving this task. The first one is the bottom-up approach. Ittackles the task as a semantic segmentation at first and thenuses clustering/grouping methods to assemble the pixelsinto individual instances or stuff [52].The second approach is the top-down approach, whichis often built on top-down instance segmentation methods.Panoptic-FPN [22] extends an additional semantic segmen-tation branch from Mask R-CNN and combines the resultswith the instance segmentation results generated by MaskR-CNN [23]. Moreover, attention based methods recentlygain much popularity in many computer vision tasks, whichprovide a new approach to panoptic segmentation. Axial-DeepLab [45] used a carefully designed module to enableattention to be applied to large-size images for panopticsegmentation. CondInst can easily be applied to panop-tic segmentation following the top-down approaches. weempirically observe that the quality of the instance seg-mentation results may be the dominant factor to the final performance. Thus in CondInst, without bells and whistles,by simply applying the same method used by Panoptic-FPN, the panoptic segmentation performance of CondInst isalready competitive compared to the state-of-the-art panop-tic segmentation methods.Additionally, AdaptIS [40] recently proposes to solvepanoptic segmentation with FiLM [38]. The idea sharessome similarity with CondInst in that information aboutan instance is encoded in the coefficients generated byFiLM. Since only the batch normalization coefficients aredynamically generated, AdaptIS needs a large mask headto achieve good performance. In contrast, CondInst directlyencodes them into conv. filters of the mask head, which ismuch more straightforward and efficient. Also, as shownin experiments, CondInst can achieve much better panopticsegmentation accuracy than AdaptIS, which suggests thatCondInst is much more effective. UR M ETHODS : I
NSTANCE AND P ANOPTIC S EGMENTATION WITH C OND I NST
We first present CondInst for instance segmentation. Theinstance segmentation framework can be extended to solvepanoptic segmentation in Sec. 2.5.
Given an input image I ∈ R H × W × , the goal of instancesegmentation is to predict the pixel-level mask and the cat-egory of each instance of interest in the image. The ground-truths are defined as { ( M i , c i ) } , where M i ∈ { , } H × W is the mask for the i -th instance and c i ∈ { , , ..., C } is the category. C is on MS-COCO [28]. In semanticsegmentation, the prediction target of each pixel are well-defined, which is the semantic category of the pixel. Inaddition, the number of categories is known and fixed.Thus, the outputs of semantic segmentation can be easilyrepresented with the output feature maps of the FCNs, and Fig. 3 – The overall architecture of
CondInst . C , C and C are the feature maps of the backbone network ( e.g. , ResNet-50). P to P are the FPN feature maps as in [26], [43]. F bottom is the bottom branch’s output, whose resolution is the same as thatof P . Following [6], the bottom branch aggregates the feature maps P , P and P . ˜ F bottom is obtained by concatenating therelative coordinates to F bottom . The classification head predicts the class probability ppp x,y of the target instance at location ( x, y ) ,same as in FCOS. The controller generates the filter parameters θθθ x,y of the mask head for the instance. Similar to FCOS, thereare also center-ness and box heads in parallel with the controller (not shown in the figure for simplicity). Note that the heads inthe dashed box are repeatedly applied to P · · · P . The mask head is instance-aware, and is applied to ˜ F bottom as many timesas the number of instances in the image (refer to Fig. 1). each channel of the output feature maps corresponds toa class. However, in instance segmentation, the predictiontarget of each pixel is hard to define because instance seg-mentation also requires to distinguish individual instancesand the number of instances varies in an image. This posesa major challenge when applying traditional FCNs [32] toinstance segmentation.In this work, our core idea is that for an image with K instances, K different mask heads will be dynamicallygenerated, and each mask head will contain the characteris-tics of its target instance in their filters. As a result, whenthe mask is applied to an input, it will only fire on thepixels of the instance, thus producing the mask predictionof the instance and distinguishing individual instances. Weillustrate the process in Fig. 1. The instance-aware filtersare generated by modifying an object detector. Specifically,we add a new controller branch to generate the filters forthe target instance of each box predicted by the detector,as shown in Fig. 3. Therefore, the number of the dynamicmask heads is the same as the number of the predictedboxes, which should be the number of the instances in theimage if the detector works well. In this work, we buildCondInst on the popular object detector FCOS [43] due toits simplicity and flexibility. Also, the elimination of anchor-boxes in FCOS can also save the number of parameters andthe amount of computation.As shown in Fig. 3, following FCOS [43], we make useof the feature maps { P , P , P , P , P } of feature pyramidnetworks (FPNs) [26], whose down-sampling ratios are , , , and , respectively. As shown in Fig. 3, on eachfeature level of the FPN, some functional layers (in the dashbox) are applied to make instance-aware predictions. For ex-ample, the class of the target instance and the dynamically-generated filters for the instance. In this sense, CondInst canbe viewed as the same as Mask R-CNN, both of which firstattend to instances in an image and then predict the pixel-level masks of the instances ( i.e. , instance-first).Moreover, recall that Mask R-CNN employs an objectdetector to predict the bounding-boxes of the instances inthe input image. The bounding-boxes are actually the waythat Mask R-CNN represents instances. Similarly, CondInstemploys the instance-aware filters to represent the instances.In other words, instead of encoding the instance informationwith the bounding-boxes, CondInst implicitly encodes itwith the parameters of the dynamic mask heads, which ismuch more flexible. For example, it can easily represent theirregular shapes that are hard to be tightly enclosed by abounding-box. This is one of CondInst’s advantages overthe previous ROI-based methods.Besides the detector, as shown in Fig. 3, there is also abottom branch, which provides the feature maps (denotedby F bottom ) that our generated mask heads take as inputsto predict the desired instance mask. The bottom branchaggregates the FPN feature maps P , P and P . To bespecific, P and P are upsampled to the resolution of P with bilinear and added to P . After that, four × convolutions with channels are applied. The resolutionof the resulting feature maps is the same as P ( i.e. , of the input image resolution). Finally, another convolutional layeris used to reduce the number of the output channels C bottom from to , which reduces the number of the generatedparameters. Surprisingly, C bottom = 8 can already achievegood performance, and as shown in our experiments, alarger C bottom here ( e.g. , 16) cannot improve the perfor-mance. Even more aggressively, using C bottom = 1 onlydegrades the performance by ∼ in mask AP. F bottom is shared between all the instances.Moreover, as mentioned before, the generated filters alsoencode the shape and position of the target instance. Sincethe feature maps do not generally convey the position infor-mation, a map of the coordinates needs to be appended to F bottom such that the generated filters are aware of positions.As the filters are generated with the location-agnostic con-volutions, they can only (implicitly) encode the shape andposition with the coordinates relative to the location wherethe filters are generated ( i.e. , using the coordinate systemwith the location as the origin). Thus, as shown in Fig. 3, F bottom is combined with a map of the relative coordinates,which are obtained by transforming all the locations on F bottom to the coordinate system with the location gener-ating the filters as the origin. Then, the combination is sentto the mask head to predict the instance mask in the fullyconvolutional fashion. The relative coordinates provide astrong cue for predicting the instance mask, as shown inour experiments. It is also interesting to note that even ifthe generated mask heads only take as input the map of therelative coordinates, a modest performance can be obtainedas shown in the experiments. This empirically proves thatthe generated filters indeed encode the shape and positionof the target instance. Finally, sigmoid is used as the lastlayer of the mask head and obtains the mask scores. Themask head only classifies the pixels as the foreground orbackground. The class of the instance is predicted by theclassification head in the detector, as shown in Fig. 3.The resolution of the original mask prediction is sameas the resolution of F bottom , which is / of the input imageresolution. In order to improve the resolution of instancemasks, we use bilinear to upsample the mask predictionby , resulting in × instance masks (if the inputimage size is × ). The mask’s resolution is muchhigher than that of Mask R-CNN (only × as mentionedbefore). Similar to FCOS, each location on the FPN’s feature maps P i either is associated with an instance, thus being a positivesample, or is considered as a negative sample. The associ-ated instance and label for each location are determined asfollows.Let us consider the feature maps P i ∈ R H × W × C and let s be its down-sampling ratio. As shown in previous works[18], [39], [43], a location ( x, y ) on the feature maps can bemapped back onto the input image as ( (cid:98) s / (cid:99) + xs, (cid:98) s / (cid:99) + ys ) .If the mapped location falls in the center region of aninstance, the location is considered to be responsible forthe instance. Any locations outside the center regions arelabeled as negative samples. The center region is defined asthe box ( c x − rs, c y − rs, c x + rs, c y + rs ) , where ( c x , c y ) denotes the mass center of the instance mask, s is the down-sampling ratio of P i and r is a constant scalar being . asin FCOS [43]. As shown in Fig. 3, at a location ( x, y ) on P i ,CondInst has the following output heads. Classification Head.
The classification head predicts theclass of the instance associated with the location. Theground-truth target is the instance’s class c i or ( i.e. , back-ground). As in FCOS, the network predicts a C -D vector ppp x,y for the classification and each element in ppp x,y correspondsto a binary classifier, where C is the number of categories. Controller Head.
The controller head, which has the samearchitecture as the classification head, is used to predict theparameters of the conv. filters of the mask head for theinstance at the location. The mask head predicts the maskfor this particular instance. This is the core contributionof our work. To predict the parameters, we concatenateall the parameters of the filters ( i.e. , weights and biases)together as an N -D vector θθθ x,y , where N is the total numberof the parameters. Accordingly, the controller head has N output channels. The mask head is a very compact FCNarchitecture, which has three × convolutions, each having channels and using ReLU as the activation function exceptfor the last one. No normalization layer such as batchnormalization is used here. The last layer has outputchannel and uses sigmoid to predict the probability of beingforeground. The mask head has parameters in total( weights = (8+2) × conv × conv × conv and biases = 8( conv
1) + 8( conv
2) + 1( conv ). Themasks predicted by the mask heads are trained with theground-truth instance masks, which pushes the controllerto generate the correct filters. Box Head.
The box head is the same as that in FCOS, whichpredicts a 4-D vector encoding the four distances from thelocation to the four boundaries of the bounding-box ofthe target instance. Conceptually, CondInst can eliminatethe box head since CondInst needs no ROIs. However,we find that if we make use of box-based NMS, the in-ference time will be much reduced. Thus, we still predictboxes in CondInst. We would like to highlight that thepredicted boxes are only used in NMS and do not involveany ROI operations. Moreover, as shown in Table 5, the boxprediction can be removed if no box information is used( e.g. , mask NMS [47]). This is fundamentally different fromprevious ROI-based methods, in which the box prediction ismandatory.
Center-ness Head.
Like FCOS, at each location, we alsopredict a center-ness score. The center-ness score depictshow the location deviates from the center of the targetinstance. In inference, it is used to down-weight the boxespredicted by the locations far from the center, which mightbe unreliable. We refer readers to FCOS [43] for the details.
Formally, the overall loss function of CondInst can be for-mulated as, L overall = L fcos + λL mask , (1)where L fcos and L mask denote the original loss of FCOS andthe loss for instance masks, respectively. λ being in thiswork is used to balance the two losses. L fcos is the same as in FCOS. Specifically, L fcos includes the classification head,the box regression head and the center-ness head, which aretrained with the focal loss [27], the GIoU loss, and the binarycross-entropy (BCE) loss, respectively. L mask is defined as, L mask ( { θθθ x,y } ) =1 N pos (cid:88) x,y { c ∗ x,y > } L dice (cid:16) M askHead ( ˜F x,y ; θθθ x,y ) , M ∗ x,y (cid:17) , (2)where c ∗ x,y is the classification label of location ( x, y ) , whichis the class of the instance associated with the location or ( i.e. , background) if the location is not associated with anyinstance. N pos is the number of locations where c ∗ x,y > . { c ∗ x,y > } is the indicator function, being if c ∗ x,y > and otherwise. θθθ x,y is the generated filters’ parameters atlocation ( x, y ) . ˜F x,y ∈ R H bottom × W bottom × ( C bottom +2) is thecombination of F bottom and a map of coordinates O x,y ∈ R H bottom × W bottom × . As described before, O x,y is the relativecoordinates from all the locations on F bottom to ( x, y ) ( i.e. ,the location where the filters are generated). M askHead denotes the mask head, which consists of a stack of convolu-tions with dynamic parameters θθθ x,y . M ∗ x,y ∈ { , } H × W × C is the mask of the instance associated with location ( x, y ) .Here L dice is the Dice loss as in [34], which is usedto overcome the foreground-background sample imbalance.We do not employ focal loss here as it requires specialinitialization, which is not trivial if the parameters aredynamically generated. Note that, in order to compute theloss between the predicted mask and the ground-truth mask M ∗ x,y , they are required to have the same size. As mentionedbefore, the resolution of the predicted mask is / of theground-truth mask M ∗ x,y . Thus, we down-sample M ∗ x,y by to make the sizes equal. These operations are omitted inEq. (2) for clarification.By design, all the positive locations on the feature mapsshould be used to compute the mask loss. For the imageshaving hundreds of positive locations, the model wouldconsume a large amount of memory. Therefore, in ourpreliminary version [42], the positive locations used incomputing the mask loss are limited up to 500 per GPU( i.e. , 250 per image and we have two images on one GPU). Ifthere are more than 500 positive locations, 500 locations willbe randomly chosen. In this version, instead of randomlychoosing the 500 locations, we first rank the locations by thescores predicted by the FCOS detector, and then choose thelocations with top scores for each instance. As a result, thenumber of locations per image can be reduced to . Thisstrategy works equally well and further reduces the memoryfootprint. For instance, using this strategy, the ResNet-50based CondInst can be trained with 4 1080Ti GPUs.Moreover, as shown in YOLACT [4] and BlendMask [6],during training, the instance segmentation task can benefitfrom a joint semantic segmentation task. Thus, we also con-duct experiments with the joint semantic segmentation task,showing improved performance. However, unless explicitlyspecified, all the experiments in the paper are without thesemantic segmentation task. If used, the semantic segmen-tation loss is added to L overall . FPN conv → → conv → → conv → → → conv → →
2x conv !"!"!"!" conv → …CondInst instance seg. outputs Up-sample 4x
Panoptic seg.
Fig. 4 – Illustration of CondInst for panoptic segmentationby attaching a semantic segmentation branch. The semanticsegmentation branch follows [22]. Results from the instancesegmentation and segmentation segmentation branches arecombined together using the same post-processing as in [23].
Given an input image, we forward it through the network toobtain the outputs including classification confidence ppp x,y ,center-ness scores, box prediction ttt x,y and the generatedparameters θθθ x,y . We first follow the steps in FCOS to obtainthe box detections. Afterwards, box-based NMS with thethreshold being . is used to remove duplicated detectionsand then the top boxes are used to compute masks. Notethat these boxes are also associated with the filters generatedby the controller. Let us assume that K boxes remain afterthe NMS, and thus we have K groups of the generatedfilters. The K groups of filters are used to produce K instance-specific mask heads. These instance-specific maskheads are applied, in the fashion of FCNs, to the ˜F x,y ( i.e. ,the combination of F bottom and O x,y ) to predict the masksof the instances. Since the mask head is a very compactnetwork (three × convolutions with channels and parameters in total), the overhead of computing masksis extremely small. For example, even with detections( i.e. , the maximum number of detections per image on MS-COCO), only less milliseconds in total are spent on themask heads, which only adds ∼ computational timeto the base detector FCOS. In contrast, the mask head ofMask R-CNN has four × convolutions with channels,thus having more than 2.3M parameters and taking longercomputational time. Since panoptic segmentation can be treated as a combinationof instance and semantic segmentation, we keep CondInstas is to obtain the instance segmentation results. For thesemantic segmentation, we use the structure from Panoptic-FPN [22]. To be specific, as shown in Fig. 4, the seman-tic segmentation branch takes as inputs the feature maps { P , P , P , P } of FPNs. { P , P , P } are up-sampled tothe same resolution as P and the four feature maps are depth time AP AP AP AP S AP M AP L (a) Varying the depth (width = 8 ). width time AP AP AP AP S AP M AP L (b) Varying the width (depth = 3 ). TABLE 1 – Instance segmentation results with different architectures of the mask head on MS-COCO val2017 split. “depth”:the number of layers in the mask head. “width”: the number of channels of these layers. “time”: the milliseconds that the maskhead takes for processing instances. C bottom AP AP AP AP S AP M AP L TABLE 2 – Instance segmentation results by varying the num-ber of channels of the bottom branch’s output ( i.e. , C bottom ) onthe MS-COCO val2017 split. The performance keeps almostthe same if C bottom is in a reasonable range, which suggeststhat CondInst is robust to the design choice. concatenated together. The resolution of P is / of theinput image, which is also the same as the instance maskspredicted by CondInst. Then, it is followed by a × convolution and softmax to obtain the classification scores.The classification scores are trained with the cross-entropyloss and the loss is added to L overall with loss weight . . Inference.
During inference, we follow [22] to combine in-stance and semantic results for obtaining the final panopticresults. The instance results from CondInst are ranked bytheir confidence scores generated by FCOS. The results withtheir scores less than . are discarded. When overlapsoccur between the instance masks, the overlap areas are at-tributed to the instance with higher score. Then, the instancethat loses more than 40% of its total area due to its overlapwith other higher-score-instances is discarded. Finally, thesemantic results are filled to the areas that are not occupiedby any instance. XPERIMENTS
We evaluate CondInst on the large-scale benchmark MS-COCO [28]. Following the common practice [17], [27], [43],our models are trained with split train2017 (115K images)and all the ablation experiments are evaluated on split val2017 (5K images). Our main results are reported on the test - dev split (20K images). Unless specified, we make use of the following implementa-tion details. Following FCOS [43], ResNet-50 is used as ourbackbone network and the weights pre-trained on ImageNet[14] are used to initialize it. For the newly added layers,we initialize them as in [43]. Our models are trained withstochastic gradient descent (SGD) over V100 GPUs for90K iterations with the initial learning rate being . and amini-batch of images. The learning rate is reduced by afactor of at iteration K and K , respectively. Weightdecay and momentum are set as . and . , respectively. Following Detectron2 [48], the input images are resized tohave their shorter sides in [640 , and their longer sidesless or equal to during training. Left-right flipping dataaugmentation is also used during training. When testing, wedo not use any data augmentation and only the scale of theshorter side being is used. The inference time in thiswork is measured on a single V100 GPU with image perbatch. In this section, we discuss the design choices of the maskhead in CondInst. To our surprise, the performance is notsensitive to the architectures of the mask head. Our baselineis the mask head of three × convolutions with channels( i.e. , width = 8 ). As shown in Table 1 (3rd row), it achieves . in mask AP. Next, we first conduct experiments byvarying the depth of the mask head. As shown in Table 1a,apart from the mask head with depth being , all othermask heads ( i.e. , depth = 2 , and ) attain similar perfor-mance. The mask head with depth being achieves inferiorperformance as in this case the mask head is actually alinear mapping, which has overly weak capacity and cannotencode the complex shapes of the instances. Moreover, asshown in Table 1b, varying the width ( i.e. , the number ofthe channels) does not result in a remarkable performancechange either as long as the width is in a reasonable range.We also note that our mask head is extremely light-weightas the filters in our mask head are dynamically generated.As shown in Table 1, our baseline mask head only takes . ms per instances (the maximum number of instanceson MS-COCO), which suggests that our mask head onlyadds small computational overhead to the base detector.Moreover, our baseline mask head only has parametersin total. In sharp contrast, the mask head of Mask R-CNN[17] has more than 2.3M parameters and takes ∼ . × computational time ( . ms per instances). We further investigate the impact of the mask branch. Wefirst change C mask , which is the number of channels ofthe mask branch’s output feature maps ( i.e. , F bottom ). Asshown in Table 2, as long as C mask is in a reasonable range( i.e. , from to ), the performance keeps almost the same. C mask = 8 is optimal and thus we use C mask = 8 in allother experiments by default.As mentioned before, before taken as the input of themask heads, the mask branch’s output F mask is concate-nated with a map of relative coordinates, which providesa strong cue for the mask prediction. As shown in Table 3 w/ abs. coord. w/ rel. coord. w/ F bottom AP AP AP AP S AP M AP L AR AR AR (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) TABLE 3 – Ablation study of the input to the mask head on MS-COCO val2017 split. As shown in the table, without therelative coordinates, the performance drops significantly from . to . in mask AP. Using the absolute coordinatescannot improve the performance remarkably. If the mask head only takes as input the relative coordinates ( i.e. , no appearancein this case), CondInst also achieves modest performance. factor resolution AP AP AP AP S AP M AP L / / / TABLE 4 – The instance segmentation results on MS-COCO val2017 split by changing the factor used to upsample themask predictions. “resolution” denotes the resolution ratio ofthe mask prediction to the input image. Without the upsam-pling ( i.e. , factor = 1 ), the performance drops significantly.The similar results are obtained with ratio or . (2nd row), the performance drops significantly if the relativecoordinates are removed ( . vs. . ). The significantperformance drop implies that the generated filters notonly encode the appearance cues but also encode the shapeand relative position of the target instance. It can also beevidenced by the experiment only using the relative coordi-nates. As shown in Table 3 (2rd row), only using the relativecoordinates can also obtain decent performance ( . inmask AP). We would like to highlight that unlike Mask R-CNN, which is based on Faster R-CNN and represents thetarget instance by an axis-aligned RoI, CondInst implicitlyencodes the shape into the generated filters, which can eas-ily represent any shapes including irregular ones and thusis much more flexible. We also experiment with the absolutecoordinates, but it cannot largely boost the performance asshown in Table 3 ( . ). This suggests that the generatedfilters mainly carry translation-invariant cues such as shapesand relative position, which is preferable. As mentioned before, the original mask prediction is up-sampled and the upsampling is of great importance to thefinal performance. We confirm this in the experiment. Asshown in Table 4, without using the upsampling (1st rowin the table), in this case CondInst can produce the maskprediction with / of the input image resolution, whichmerely achieves . in mask AP because most of thedetails ( e.g. , the boundary) are lost. If the mask predictionis upsampled by factor = 4 , the performance can be sig-nificantly improved by in mask AP (from . to . ). In particular, the improvement on small objects islarge (from . to . ), which suggests that the upsam-pling can greatly retain the details of objects. Increasing theupsampling factor to slightly worsens the performance insome metrics ( e.g. , AP ), probably due to the relatively low-quality annotations of MS-COCO. Therefore, we use factor = 4 in all other models. NMS AP AP AP AP S AP M AP L box mask TABLE 5 – Instance segmentation results with differentNMS algorithms. Mask-based NMS can obtain the sameoverall performance as box-based NMS, which suggests thatCondInst can totally eliminate the box detection.
Although we still keep the bounding-box detection branchin CondInst, it is conceptually feasible to totally eliminateit if we make use of the NMS using no bounding-boxes.In this case, all the foreground samples (determined by theclassification head) will be used to compute instance masks,and the duplicated masks will be removed by mask-basedNMS. This is confirmed in Table 5. As shown in the table,with the mask-based NMS, the same overall performancecan be obtained as box-based NMS ( . vs. . inmask AP). We compare CondInst against previous state-of-the-artmethods on MS-COCO test - dev split. As shown in Ta-ble 6, with × learning rate schedule ( i.e. , K iterations),CondInst outperforms the original Mask R-CNN by . ( . vs. . ). CondInst also achieves a much fasterspeed than the original Mask R-CNN ( ms vs. ms perimage on a single V100 GPU). To our knowledge, it is thefirst time that a new and simpler instance segmentationmethod, without any bells and whistles outperforms MaskR-CNN both in accuracy and speed. CondInst also obtainsbetter performance ( . vs. . ) and on-par speed( ms vs ms) than the well-engineered Mask R-CNN in Detectron2 ( i.e. , Mask R-CNN ∗ in Table 6). Furthermore,with a longer training schedule ( e.g. , × ) or a strongerbackbone ( e.g. , ResNet-101), a consistent improvement isachieved as well ( . vs. . with ResNet-50 × and . vs. . with ResNet-101 × ). Moreover, asshown in Table 6, with the auxiliary semantic segmentationtask, the performance can be boosted from . to . (ResNet-50) or from . to . (ResNet-101), withoutincreasing the inference time. For fair comparisons, all theinference time here is measured by ourselves on the samehardware with the official codes.We also compare CondInst with the recently-proposedinstance segmentation methods. Only with half trainingiterations, CondInst surpasses TensorMask [9] by a largemargin ( . vs. . for ResNet-50 and . vs. . for ResNet-101). CondInst is also ∼ × faster than Tensor-Mask ( ms vs ms per image on the same GPU) with method backbone aug. sched. AP AP AP AP S AP M AP L Mask R-CNN [17] R-50-FPN × R-50-FPN × ∗ R-50-FPN (cid:88) × ∗ R-50-FPN (cid:88) × (cid:88) × (cid:88) × CondInst
R-50-FPN (cid:88) × CondInst
R-50-FPN (cid:88) × CondInst w/ sem. R-50-FPN (cid:88) × Mask R-CNN R-101-FPN (cid:88) × ∗ R-101-FPN (cid:88) × (cid:88) . × (cid:88) × (cid:88) × (cid:88) × ∗ w/ sem. R-101-FPN (cid:88) × (cid:88) × R-101-FPN (cid:88) × CondInst w/ sem. R-101-FPN (cid:88) × CondInst w/ sem. R-101-BiFPN (cid:88) × CondInst w/ sem. R-101-DCN-BiFPN (cid:88) × TABLE 6 – Instance segmentation comparisons with state-of-the-art methods on MS-COCO test - dev . “Mask R-CNN” is theoriginal Mask R-CNN [17]. “Mask R-CNN ∗ ” and “BlendMask ∗ ” mean that the models are improved by Detectron2 [48].“aug.”: using multi-scale data augmentation during training. “sched.”: the used learning rate schedule. 1 × is K iterations, 2 × is K iterations and so on. The learning rate is changed as in [16]. “w/ sem”: using the auxiliary semantic segmentation task. method backbone sched. FPS AP AP AP YOLACT-550++ [3] R-50 . ×
44 34.1 53.3 36.2YOLACT-550++ R-101 . ×
36 34.6 53.8 36.9CondInst-RT shtw. R-50 ×
43 36.0 57.0 38.0CondInst-RT shtw. DLA-34 × × TABLE 7 – The mask AP and inference speed of the real-time CondInst models on the COCO test - dev data. “shtw.”:sharing the conv. towers between the classification and boxregression branches in FCOS. Both YOLACT++ and CondInstuse the auxiliary semantic segmentation loss here. All infer-ence time is measured with a single V100 GPU. As you cansee, with the same backbone R-50, CondInst-RT outperformsYOLACT++ by . AP with almost the same speed. similar performance ( . vs. . ). Moreover, CondInstoutperforms YOLACT-700 [4] by a large margin with thesame backbone ResNet-101 ( . vs. . and both withthe auxiliary semantic segmentation task). Moreover, asshown in Fig. 2, compared with YOLACT-700 and MaskR-CNN, CondInst can preserve more details and producehigher-quality instance segmentation results. We also present a real-time version of CondInst. FollowingFCOS [44], the × conv. layers in the classification andbox regression towers in FCOS are shared in the real-time models (denoted by “shtw.” in Table 7). Moreover, wereduce the size of the input image from a scale of 800 to 512during testing, and the FPN levels P and P are removedsince there are not many larger objects with the small inputimages. In order to compensate for the performance lossdue to the smaller input size, we use a more aggressivetraining strategy here. Specifically, the real-time models aretrained for K iterations ( i.e. , × ) and the shorter side ofthe input image is randomly chosen from the range 256 to608 with step 32. The synchronized BatchNorm (SyncBN) isalso used during training. In the real-time models, following YOLACT, we enable the extra semantic segmentation loss bydefault.The performance and inference speed of these real-time models are shown in Table 7. As shown in the table,the R-50 based CondInst-RT outperforms the R-50 basedYOLACT++ [3] by about AP ( . vs. . ) and hasalmost the same speed (43 FPS vs. 44 FPS). By further usinga strong backbone DLA-34 [55], CondInst-RT can achieve 47FPS with similar performance. Furthermore, if we do notshare the classification and box regression towers in FCOS,the performance can be improved to . AP with slightlylonger inference time.
We also conduct the instance segmentation experiments onCityscapes [12]. The Cityscapes dataset is deigned for theunderstanding of urban street scenes. For instance segmen-tation, it has 8 categories, which are person, rider, car, truck,bus, train, motorcycle, and bicycle. It includes 2975, 500 and1525 images with fine annotations for training, validationand testing, respectively. It also has 20K training imageswith coarse annotations. Following Mask R-CNN [17], weonly use the images with fine annotations to train ourmodels. All images have the same resolution 2048 × . to . .We follow the training details in Detectron2 [48] totrain CondInst on Cityscapes. Specifically, the models aretrained for 24K iterations with batch size 8 (1 image perGPU). The initial learning rate is . , which is reducedby a factor of 10 at step 18K. Since Cityscapes has rel-atively fewer images, following Mask R-CNN, we mayinitialize the models with the weights pre-trained on theCOCO dataset if specified. Moreover, we use multi-scaledata augmentation during training and the shorter side ofthe images is sampled in the range from 800 to 1024 with method backbone training data AP [ val ] AP AP person rider car truck bus train mcycle bicycleMask R-CNN ResNet-50-FPN train 31.5 26.2 49.9 30.5 23.7 46.9 22.8 32.2 18.6 19.1 16.0CondInst ResNet-50-FPN train 33.3 Mask R-CNN ResNet-50-FPN train+COCO 36.4 32.0 58.1 34.8 27.0 49.1 30.1 40.9 30.9 24.1 18.7CondInst ResNet-50-FPN train+COCO 37.5 33.2 57.2 35.1 27.7 54.5 29.5 42.3 33.8 23.9
CondInst w/sem. ResNet-50-FPN train+COCO 37.7 33.7 57.7
CondInst w/sem. DCN-101-BiFPN train+COCO
CondInst w/sem. DCN-101-BiFPN train+val+COCO -
TABLE 8 – Instance segmentation results on Cityscapes val (“AP [ val ]” column) and test (remaining columns) splits.“DCN”: using deformable convolutions in the backbones. “+COCO”: fine-tuning from the models pre-trained on COCO.“train+val+COCO”: using both train and val splits to train the models evaluated on the test split. “w/ sem.”: using theauxiliary semantic segmentation loss during training as in COCO.
Fig. 5 – Qualitative results on Cityscapes. Noticeably,CondInst can well perverse the details (best viewed onscreen). step 32. In inference, we only use the original image scale2048 × / to / resolution of the input image.The results are reported in Table 8. As shown in thetable, with the same settings, CondInst generally outper-forms the previous strong baseline Mask R-CNN by morethan mask AP in all the experiments. On Cityscapes,the auxiliary semantic segmentation loss can also improvethe instance segmentation performance. The results withthe loss are denoted by “w/ sem.” in Table 8. By furtherusing the complementary techniques such as deformableconvolutions and BiFPN, the performance can be furtherboosted as expected. Some qualitative results are shown inFig. 5. As mentioned before, CondInst can be easily extended topanoptic segmentation [23] by attaching a new semanticsegmentation branch depicted in Fig. 4. Here, we conductthe panoptic segmentation experiments on the COCO 2018dataset. Unless specified, the training and testing details( e.g. , image sizes, the number of iterations and etc.) are thesame as in the instance segmentation task on COCO.Although panoptic segmentation can be viewed as acombination of instance segmentation and semantic seg- mentation. An annotation issue is that there is a discrepancybetween the targets of the original instance segmentationand the instance segmentation task in panoptic segmen-tation. Panoptic segmentation requires that a pixel in theresulting mask has only one label. Therefore if two instancesoverlap, the pixels in the overlapped region will only beassigned to the front instance. However, in the originalinstance segmentation, the pixels in the overlapped regionbelong to both instances, and the ground-truth masks arelabeled in such a way. Therefore, when we use the instancesegmentation framework for panoptic segmentation, thetraining targets of the instance segmentation is changed tothe instance annotations in panoptic segmentation accord-ingly.We compare our method with a few state-of-the-artpanoptic segmentation methods in Table 9. On the challeng-ing COCO test - dev benchmark, we outperform the pre-vious strong baseline Panoptic-FPN [22] by a large marginwith the same backbone and training schedule ( i.e. , from . to . in PQ with ResNet101-FPN). Moreover,compared to AdaptIS [40], which shares some similaritywith us, the ResNet-101 based CondInst achieves dramat-ically better performance than ResNeXt-101 based AdaptIS( . vs. . PQ). This suggests that using the dynamicfilters here might be more effective than using FiLM [38].In addition, compared to the recent methods such as [24]and Panoptic-FCN [25], CondInst also outperforms themconsiderably. We show also some qualitative results in Fig. 6.
ONCLUSION
We have proposed a new and simple instance segmentationframework, termed CondInst. Unlike previous method suchas Mask R-CNN, which employs the mask head with fixedweights, CondInst conditions the mask head on instancesand dynamically generates the filters of the mask head.This not only reduces the parameters and computationalcomplexity of the mask head, but also eliminates the ROIoperations, resulting in a faster and simpler instance seg-mentation framework. To our knowledge, CondInst is thefirst framework that can outperform Mask R-CNN bothin accuracy and speed, without longer training schedulesneeded. With simple modifications, CondInst can be ex-tended to solve panoptic segmentation and achieve state-of-the-art performance on the challenging COCO dataset. Webelieve that CondInst can be a strong alternative for bothinstance and panoptic segmentation. Fig. 6 – Panoptic segmentation results on the COCO dataset (better viewed on screen). Color encodes categories and instances.As we can see, CondInst performs well. method backbone sched. PQ PQ th PQ st CondInst R-50-FPN × CondInst R-50-FPN × × . × × R-101-FPN × × × Unifying DCN-101-FPN - 47.2 53.5 37.7
CondInst
DCN-101-FPN × TABLE 9 – Panoptic segmentation on the COCO test - dev data. All results are with single-model and single-scale testing.Here we report comparisons with state-of-the-art methods using various backbones and training schedules ( × means 90Kiterations). CondInst achieves the best result among the compared methods. R EFERENCES [1] A. Paszke et al. PyTorch: An imperative style, high-performancedeep learning library. In
Proc. Advances in Neural Inf. Process. Syst. ,pages 8024–8035. 2019.[2] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, andI. Reid. Unsupervised scale-consistent depth and ego-motionlearning from monocular video. In
Proc. Advances in Neural Inf.Process. Syst. , pages 35–45, 2019.[3] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee. YOLACT++: Better real-time instance segmentation.
IEEE Trans. Pattern Anal. Mach. Intell. ,2019.[4] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee. YOLACT: real-timeinstance segmentation. In
Proc. IEEE Int. Conf. Comp. Vis. , pages9157–9166, 2019.[5] L. Boominathan, S. Kruthiventi, and R. V. Babu. Crowdnet: A deepconvolutional network for dense crowd counting. In
Proc. ACMInt. Conf. Multimedia , pages 640–644. ACM, 2016.[6] H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan. Blend-mask: Top-down meets bottom-up for instance segmentation. In
Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , 2020.[7] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng,Z. Liu, J. Shi, W. Ouyang, et al. Hybrid task cascade for instancesegmentation. In
Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , pages4974–4983, 2019. [8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille.Deeplab: Semantic image segmentation with deep convolutionalnets, atrous convolution, and fully connected CRFs.
IEEE Trans.Pattern Anal. Mach. Intell. , 40(4):834–848, 2017.[9] X. Chen, R. Girshick, K. He, and P. Doll´ar. Tensormask: Afoundation for dense object segmentation. In
Proc. IEEE Int. Conf.Comp. Vis. , pages 2061–2069, 2019.[10] B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, andL.-C. Chen. Panoptic-deeplab. arXiv: Comp. Res. Repository , 2019.[11] F. Chollet. Xception: Deep learning with depthwise separableconvolutions. In
Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , pages1251–1258, 2017.[12] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele. The Cityscapes datasetfor semantic urban scene understanding. In
Proc. IEEE Conf. Comp.Vis. Patt. Recogn. , pages 3213–3223, 2016.[13] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fullyconvolutional networks. In
Proc. Eur. Conf. Comp. Vis. , pages 534–549. Springer, 2016.[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet:A large-scale hierarchical image database. In
Proc. IEEE Conf.Comp. Vis. Patt. Recogn. , pages 248–255. Ieee, 2009.[15] A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama,and K. P. Murphy. Semantic instance segmentation via deep metriclearning. arXiv: Comp. Res. Repository , 2017.[16] K. He, R. Girshick, and P. Doll´ar. Rethinking imagenet pre- training. In Proc. IEEE Int. Conf. Comp. Vis. , pages 4918–4927, 2019.[17] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask R-CNN. In
Proc. IEEE Int. Conf. Comp. Vis. , pages 2961–2969, 2017.[18] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling indeep convolutional networks for visual recognition.
IEEE Trans.Pattern Anal. Mach. Intell. , 37(9):1904–1916, 2015.[19] T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y. Yan. Knowledgeadaptation for efficient semantic segmentation. In
Proc. IEEE Conf.Comp. Vis. Patt. Recogn. , pages 578–587, 2019.[20] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang. Maskscoring R-CNN. In
Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , pages6409–6418, 2019.[21] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamicfilter networks. In
Proc. Advances in Neural Inf. Process. Syst. , pages667–675, 2016.[22] A. Kirillov, R. Girshick, K. He, and P. Doll´ar. Panoptic featurepyramid networks. In
Proc. IEEE Conf. Comp. Vis. Patt. Recogn. ,pages 6399–6408, 2019.[23] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Doll´ar. Panopticsegmentation. In
Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , pages9404–9413, 2019.[24] Q. Li, X. Qi, and P. H. S. Torr. Unifying training and inference forpanoptic segmentation. In
Proc. IEEE Conf. Comp. Vis. Patt. Recogn. ,pages 13320–13328, 2020.[25] Y. Li, H. Zhao, X. Qi, L. Wang, Z. Li, J. Sun, and J. Jia. Fullyconvolutional networks for panoptic segmentation. arXiv: Comp.Res. Repository , 2020.[26] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Be-longie. Feature pyramid networks for object detection. In
Proc.IEEE Conf. Comp. Vis. Patt. Recogn. , pages 2117–2125, 2017.[27] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar. Focal loss fordense object detection. In
Proc. IEEE Conf. Comp. Vis. Patt. Recogn. ,pages 2980–2988, 2017.[28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects incontext. In
Proc. Eur. Conf. Comp. Vis. , pages 740–755. Springer,2014.[29] F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from singlemonocular images using deep convolutional neural fields.
IEEETrans. Pattern Anal. Mach. Intell. , 2016.[30] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network forinstance segmentation. In
Proc. IEEE Conf. Comp. Vis. Patt. Recogn. ,pages 8759–8768, 2018.[31] Y. Liu, C. Shu, J. Wang, and C. Shen. Structured knowledgedistillation for dense prediction.
IEEE Trans. Pattern Anal. Mach.Intell. , 2020.[32] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networksfor semantic segmentation. In
Proc. IEEE Conf. Comp. Vis. Patt.Recogn. , pages 3431–3440, 2015.[33] M. Abadi et al.
TensorFlow: A system for large-scale machinelearning. In
USENIX Symp. Operating Systems Design & Implemen-tation , pages 265–283, 2016.[34] F. Milletari, N. Navab, and S.-A. Ahmadi. V-net: Fully convolu-tional neural networks for volumetric medical image segmenta-tion. In
Proc. Int. Conf. 3D Vision , pages 565–571. IEEE, 2016.[35] D. Neven, B. D. Brabandere, M. Proesmans, and L. V. Gool.Instance segmentation by jointly optimizing spatial embeddingsand clustering bandwidth. In
Proc. IEEE Conf. Comp. Vis. Patt.Recogn. , pages 8837–8845, 2019.[36] A. Newell, Z. Huang, and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. In
Proc. Advancesin Neural Inf. Process. Syst. , pages 2277–2287, 2017.[37] D. Novotny, S. Albanie, D. Larlus, and A. Vedaldi. Semi-convolutional operators for instance segmentation. In
Proc. Eur.Conf. Comp. Vis. , pages 86–102, 2018.[38] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville.FiLM: Visual reasoning with a general conditioning layer. In
Proc.AAAI Conf. Artificial Intell. , 2018.[39] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towardsreal-time object detection with region proposal networks. In
Proc.Advances in Neural Inf. Process. Syst. , pages 91–99, 2015.[40] K. Sofiiuk, O. Barinova, and A. Konushin. Adaptis: Adaptiveinstance selection network. In
Proc. IEEE Int. Conf. Comp. Vis. ,pages 7355–7363, 2019.[41] Z. Tian, T. He, C. Shen, and Y. Yan. Decoders matter for semanticsegmentation: Data-dependent decoding enables flexible featureaggregation. In
Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , pages3126–3135, 2019.[42] Z. Tian, C. Shen, and H. Chen. Conditional convolutions forinstance segmentation. In
Proc. Eur. Conf. Comp. Vis. , 2020. [43] Z. Tian, C. Shen, H. Chen, and T. He. FCOS: Fully convolutionalone-stage object detection. In
Proc. IEEE Int. Conf. Comp. Vis. , pages9627–9636, 2019.[44] Z. Tian, C. Shen, H. Chen, and T. He. FCOS: A simple and stronganchor-free object detector.
IEEE Trans. Pattern Anal. Mach. Intell. ,2021.[45] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen.Axial-deeplab: Stand-alone axial-attention for panoptic segmenta-tion. arXiv preprint arXiv:2003.07853 , 2020.[46] X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li. SOLO: Segmentingobjects by locations. In
Proc. Eur. Conf. Comp. Vis. , 2020.[47] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen. SOLOv2: Dynamicand fast instance segmentation. In
Proc. Advances in Neural Inf.Process. Syst. , 2020.[48] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick. Detectron2.https://github.com/facebookresearch/detectron2, 2019.[49] E. Xie, P. Sun, X. Song, W. Wang, D. Liang, C. Shen, and P. Luo.PolarMask: Single shot instance segmentation with polar repre-sentation. In
Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , 2020.[50] Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urta-sun. Upsnet: A unified panoptic segmentation network. In
Proc.IEEE Conf. Comp. Vis. Patt. Recogn. , pages 8818–8826, 2019.[51] B. Yang, G. Bender, Q. V. Le, and J. Ngiam. Condconv: Condition-ally parameterized convolutions for efficient inference. In
Proc.Advances in Neural Inf. Process. Syst. , pages 1305–1316, 2019.[52] T.-J. Yang, M. D. Collins, Y. Zhu, J.-J. Hwang, T. Liu, X. Zhang,V. Sze, G. Papandreou, and L.-C. Chen. Deeperlab: Single-shotimage parser. arXiv preprint arXiv:1902.05093 , 2019.[53] W. Yin, Y. Liu, C. Shen, and Y. Yan. Enforcing geometric constraintsof virtual normal for depth prediction. In
Proc. IEEE Int. Conf.Comp. Vis. , 2019.[54] W. Yin, X. Wang, C. Shen, Y. Liu, Z. Tian, S. Xu, C. Sun, andD. Renyin. Diversedepth: Affine-invariant depth prediction usingdiverse data. arXiv preprint arXiv:2002.00569 , 2020.[55] F. Yu, D. Wang, E. Shelhamer, and T. Darrell. Deep layer aggrega-tion. In