[PDF] Regularized Densely-connected Pyramid Network for Salient Instance Segmentation

Abstract

Much of the recent efforts on salient object detection (SOD) have been devoted to producing accurate saliency maps without being aware of their instance labels. To this end, we propose a new pipeline for end-to-end salient instance segmentation (SIS) that predicts a class-agnostic mask for each detected salient instance. To better use the rich feature hierarchies in deep networks and enhance the side predictions, we propose the regularized dense connections, which attentively promote informative features and suppress non-informative ones from all feature pyramids. A novel multi-level RoIAlign based decoder is introduced to adaptively aggregate multi-level features for better mask predictions. Such strategies can be well-encapsulated into the Mask R-CNN pipeline. Extensive experiments on popular benchmarks demonstrate that our design significantly outperforms existing \sArt competitors by 6.3\% (58.6\% vs. 52.3\%) in terms of the AP metric.The code is available at this https URL.

Full PDF

11 Regularized Densely-connected Pyramid Networkfor Salient Instance Segmentation

Yu-Huan Wu, Yun Liu, Le Zhang, Wang Gao, and Ming-Ming Cheng

Abstract —Much of the recent efforts on salient object detec-tion (SOD) has been devoted to producing accurate saliencymaps without being aware of their instance labels. To thisend, we propose a new pipeline for end-to-end salient instancesegmentation (SIS) that predicts a class-agnostic mask for eachdetected salient instance. To make better use of the rich featurehierarchies in deep networks, we propose the regularized denseconnections, which attentively promote informative features andsuppress non-informative ones from all feature pyramids, toenhance the side predictions. A novel multi-level RoIAlign baseddecoder is introduced as well to adaptively aggregate multi-levelfeatures for better mask predictions. Such good strategies canbe well-encapsulated into the Mask-RCNN pipeline. Extensiveexperiments on popular benchmarks demonstrate that our designsigniﬁcantly outperforms existing state-of-the-art competitors by6.3% (58.6% vs 52.3%) in terms of the AP metric. The code isavailable at https://github.com/yuhuan-wu/RDPNet.

Index Terms —salient instance segmentation, feature pyramid

I. I

NTRODUCTION A S a fundamental image understanding technique, salientobject detection (SOD) aims at segmenting the mosteye-attracting objects in a natural image. Although recentSOD approaches [50], [41], [42], [23], [29] have achievedmany successful stories, their generated saliency maps cannotdiscriminate different salient instances, which has preventedmany applications from applying SOD for instance-levelimage understanding [9]. Motivated by [18], in this paper,we tackle the more challenging case of SOD, called salientinstance segmentation (SIS). SIS not only segments salientobjects from an image but also discriminates salient instancesby associating each instance with a different label. SIS canfacilitate more advanced tasks than SOD, such as imagecaptioning [10], weakly-supervised instance learning [9], andvisual tracking [15].The MSRNet [18] made the ﬁrst attempt to detect salientinstances by adopting several isolated processing steps. How-ever, its performance was usually limited in challenging sce-narios because it was not end-to-end trainable. The S4Net [8]replaced

RoIAlign in Mask R-CNN [13] with the proposed

RoIMasking to keep the scale of the feature maps and leveragethe nearby background of objects. Although much betterperformances were reported, it was yet far from satisfactory

Y.-H. Wu, Y. Liu and M.M. Cheng are with TKLNDST, College ofComputer Science, Nankai University, Tianjin 300350, China.L. Zhang is with Institute for Infocomm Research, the Agency for Science,Technology and Research (A*STAR), Singapore.W. Gao is with the Science and Technology on Complex System Controland Intelligent Agent Cooperation Laboratory, Beijing, China.M.-M. Cheng is the corresponding author ([email protected]). (a) (b) ( c ) (d) Fig. 1. Visualizations for the feature maps after passing FPN and ourproposed regularized densely-connected pyramid (RDP). (a) Source images;(b) Corresponding ground truth; (c) Visualized maps for the feature maps afterFPN; (d) Visualized maps for the feature maps after the proposed RDP. Asthe visualized feature maps directly obtained by the FPN look coarser andare hard to recognize objects in them, our proposed RDP are much easier torecognize the locations and shapes of each salient instance. because only a limited feature level was utilized to decodesalient instances. One may argue that a natural solution isto employ the Feature Pyramid Network (FPN) [20] andsolve this task using the feature pyramid as well. FPN buildsthe feature pyramid via the top-down pathway and lateralconnections from the backbone. With this network, small andlarge objects are thus more likely to be detected in the lowand high levels of the pyramid, respectively. Therefore, withthe top-down pathway, apart from detecting the salient objects,much of the information ﬂow was devoted for detecting thesmall and unnoticeable objects as well. Na¨ıvely applying theFPN architecture for SIS is suboptimal, because salient objectsare often much larger and distinctive compared with noisybackground and uninteresting objects.Motivated by this, we focus on enhancing the side pre-dictions by providing each side branch with richer featurehierarchies from deep networks to locate the object and recoverits details. We achieve this by employing dense connectionsfor each branch. In this way, each level is able to leverageboth high-level semantic and low-level ﬁne-grained features.However, as features from different feature levels of thefeature pyramid usually have different receptive ﬁelds, directlyapplying such a dense connection may yield noisy predictions.To this end, we propose to regularize such a dense connectionby employing the attention mechanism to promote informativefeatures and suppress non-informative ones from all featurelevels of the feature pyramid.Our effort starts with Mask R-CNN [13] that ﬁrst detects a r X i v : . [ c s . C V ] A ug bounding boxes and then adopts RoIAlign to predict the binarymask for each region of interest (RoI). Speciﬁcally, we proposethe regularized densely-connected pyramid (RDP) networkmentioned above to better enhance the feature pyramid withdifferent scales while keeping semantic features for detectingsalient instances. More speciﬁcally, each level of features willbe fused with not only its successive bottom features, as donein other works [20], [21], [30], [45] but also features from allthe lower levels. The RDP network only costs 0.7ms which canbe ignored in affecting the speed of the whole network. Fig. 1shows the superiority of RDP in feature learning comparedwith FPN. Besides, instead of only using features from aspeciﬁc feature level, we propose to leverage the featuremaps from all feature levels with a novel multi-level

RoIAlign operation for extracting hierarchical RoIs and then use a maskdecoder to predict instance masks from them. Extensive exper-iments demonstrate that the proposed method achieves state-of-the-art performance and far surpasses previous competitorsin terms of all metrics. With an NVIDIA TITIAN Xp GPU,the proposed method runs at 45.0fps for × images andare thus suitable for real-time applications.Overall, our main contributions are summarized as below: • We propose to use regularized dense connections in theMask R-CNN framework to provide richer bottom-upinformation ﬂows by attentively promoting informativefeatures and suppressing non-informative ones, at eachstage of the feature pyramid. • We further propose a novel multi-level RoIAlign baseddecoder to adaptively pool multi-level features for bettermask predictions. • We empirically evaluate the proposed methods on twopopular SIS datasets and demonstrate its superior accu-racy and better efﬁciency.II. R

ELATED W ORK

A. Salient Object Detection

SOD aims to detect salient objects or regions in naturalimages. Conventional methods [1], [4], [3], [41] mainly focuson designing hand-crafted features and better prior strategiesfor SOD. Later, some learning based features [41] were studiedas well. In recent years, those methods have been suppressedby the deep learning based methods due to their limited rep-resentational ability. More speciﬁcally, motivated by the vastsuccess of convolutional neural networks (CNNs) and fullyconvolutional networks (FCNs) [31] for segmentation-relatedtasks, many FCN-based SOD networks were proposed as well[24], [50], [42], [16], [29], [35], [23], [32], [49]. For example,Wang et al. [42] developed a recurrent FCN architecture forsaliency prediction. Liu et al. [24] presented a deep hierar-chical saliency network to learn a coarse global predictionand reﬁne it hierarchically and progressively by integratinglocal information. Inspired by [47], [28], Hou et al. [16]introduced short connections for side-outputs to enrich multi-scale features. Zhang et al. [50] introduced a bi-directionalstructure to adaptively aggregate multi-level features. Wang et al. [43] proposed to globally detect salient objects andrecurrently reﬁne the saliency maps. Liu et al. [25] proposed a pixel-wise contextual attention network to selectively attendto informative context locations for each pixel. Liu et al. [23]proposed various pooling-based modules to strengthen thefeature representations with real-time speed. Although thesemethods can detect saliency maps accurately, they cannotdiscriminate different salient object instances.

B. Instance Segmentation

Similar to object detection, early instance segmentationworks [11], [12], [5] focus on classifying segmented proposalsgenerated by object proposal methods [40], [34], [2]. Li et al. [19] ﬁrst proposed an end-to-end fully convolutionalinstance segmentation (FCIS) framework. He et al. [13] ex-tended Faster R-CNN [36] to Mask R-CNN by replacing

RoIPool with

RoIAlign for more accurate RoI generation.They added a parallel mask head with the box head in FasterR-CNN, for mask prediction using the RoI features fromthe feature pyramid. PANet [26] proposes a bottom-up pathaugmentation, which has been demonstrated to be effectivein shortening the information path and enhancing the featurepyramid for instance detection. Mask Scoring R-CNN [17]combines the mask conﬁdence score and the localization scoreand is thus more precise for scoring the detected instances.

C. Salient Instance Segmentation

SIS is a relatively new problem that shares similar spir-its with both SOD and instance segmentation. It is morechallenging than SOD because it not only segments salientobjects but also differentiates different salient instances. Onepossible solution is to derive the salient instances directlyfrom the saliency map using some post-processing techniques.For example, Li et al. [18] proposed a two-stage solution,called MSRNet, which ﬁrst produces saliency maps and salientobject contours that are then integrated with MCG [34] forsalient instance segmentation. Although MSRNet can learnfrom the saliency maps, as the two stages are optimizedisolatedly, the results of MSRNet are far from satisfactory. Toovercome the difﬁculties of the isolated optimization, recently,Fan et al. [8] introduced an end-to-end single-stage frameworkbased on the Mask R-CNN [13]. They learned to mimic thestrategy of GrabCut [37] and used the so-called

RoIMasking toexplicitly incorporate foreground/background separation. Theyalso designed a customized segmentation head with dilatedconvolutions to retrieve instance masks from the coarsestfeature level. Instead of using a single speciﬁc feature levelwith limited semantic features as done in existing methods,we propose to use the regularized densely-connected pyramidto extract richer feature hierarchies with higher contrasts (asin Fig. 1) from all feature levels, and signiﬁcantly release theburden of accurately detecting salient instances and retrievingbinary masks for each salient instance.III. O UR A PPROACH

A. Feature Pyramid Enhancement

The feature pyramid, which is usually understood as a groupof feature maps with different resolutions, has demonstrated its superiority in various computer vision tasks. Among them,one notable application is object detection, which aims to ac-curately detect the locations of semantic objects. As there existlarge scale variations for natural objects, directly detecting theaccurate locations of targets by simply using features from onescale is extremely challenging. Therefore, many researchersattempt to detect semantic objects with the feature pyramid.Our method naturally belongs to this family. We propose adensely-connected pyramid (DP) network and the advancedregularized densely-connected pyramid (RDP) network for thefeature pyramid enhancement. We elaborate the main ideabelow.

1) Problem Formulation:

Given an image as the input anda base network ( e.g. , ResNet [14]) for feature extraction, wecan ﬁrst derive a set of side-outputs from multiple stagesin this network. Assume that we have access to multiplescales of features { C m , C m +1 , · · · , C k } from the m -th to k -thstage, corresponding to the ﬁnest and coarsest feature maps.Typically, m will be 2 as deﬁned in two-stage detectors likeFaster R-CNN [36], [20] or 3 as deﬁned in one-stage detectorslike RetinaNet [21]. k is typically 5 as deﬁned in both kindsof detectors [36], [20], [21].

2) The Top-down Style:

In order to leverage both high-level semantics and low-level ﬁne details as mentioned above,the well-known FPN [20] proposes a top-down architecturewith lateral connections to strengthen the capacity and rep-resentability of each side-output. Such a strategy has beendemonstrated very powerful especially for detecting small andtiny objects and has been extensively used in many otherapproaches. Suppose that the feature pyramid enhanced byFPN is called P = { P m , P m +1 , ..., P k } . This enhancementoperation can be formulated as: P k = F [ φ ( C k )] , (1) P i = F [ φ ( C i ) + U psample ( P i +1 )] , m ≤ i < k, (2)where φ represents a × convolution layer to reduce thechannels of C i . F represents the feature fusion module whichconsists of a single × convolution layer. The upsamplingfactor for P i +1 is 2 and we use the bilinear interpolation forupsampling. For the coarsest feature map C k , this enhance-ment operation is simpliﬁed done by passing a single × convolution.Such a strategy, however, is suboptimal for SIS. Recall thatthe objective of this task is to detect salient instances andignore other non-salient ones that usually have a relativelysmaller size. In Equ. (2), each side branch only has limitedbottom-up information, because it only leverages the featuresof two successive layers. In this way, higher levels in thepyramid have limited access to the low-level ﬁne-graineddetails and thus may fail to recover the instance boundaries.In the same way, the lower levels in the pyramid lack thehigh-level semantic information and thus may not be goodat accurately locating the salient objects and identifying theirinstance labels. To address this problem, we provide oursolution below. CC (b) Dense Connections with Regularization Regularization

Copy CC Pixel-wise Multiply Concat C Pixel-wise Multiply ConcatConv+Sigmoid C Pixel-wise Multiply ConcatConv+Sigmoid (a) Densely-connected Pyramids (a) Densely-connected Pyramids

Add

Regu.

Fig. 2. Illustration of the proposed regularized densely-connected pyramid(RDP) network for feature pyramid enhancement. (a) The densely-connectedpyramid (DP) network; (b) Dense connections with regularization. For sim-plicity, we illustrate the regularization with only the th feature level. RDPis DP with the regularization at each feature level.

3) The Bottom-up Densely-connected Pyramid Network:

A straightforward solution to overcome the above-mentioneddisadvantages of FPN, as proposed in [26], is to build aprogressive bottom-up lateral connection and recreate a newfeature pyramid: P (cid:48) m = F ( P m ) , (3) P (cid:48) j +1 = F [ P j +1 + Downsample ( P (cid:48) j )] , m < j ≤ k − , (4)where P (cid:48) k is the re-generated feature map of the new featurepyramid. This solution naturally follows a progressive mannerof the FPN and is applied in instance segmentation [26]. Wetake inspirations from this architecture and make necessaryamendments. For each feature level in the network, instead ofonly merging two successive levels, we merge features frommany other levels as well. This is advantageous because eachstage is given a much richer information ﬂow from all itsbottom layers. More speciﬁcally, we achieve this by addingdense connections, which can be formulated as P (cid:48) j = F{ φ [ Concat ( P (cid:48) m , P (cid:48) m +1 , ..., P (cid:48) j − , P j )] } , (5)where we have m < j ≤ k and m represents the index of theﬁrst stage of the feature pyramid. In the concatenation oper-ation, feature maps P (cid:48) m , P (cid:48) m +1 , ..., P (cid:48) j − are all downsampledto the size of P j . We use the × convolution operation φ to reduce the channels to that of P j .

4) Regularized Densely-Connected Pyramid Network:

Thebottom-up dense connections essentially expand the inputspace for each side branch. However, as features from differentlayers usually have different receptive ﬁelds, they are usuallynot very compatible in discovering the ﬁne details of the objectdue to the scale conﬂict. To this end, we further regularizethe dense connections with the well-established self-attentionmechanism. To compute the new feature maps P (cid:48) j , we ﬁrstcreate spatial regularization based on the feature map P j ofthe current scale: R j = σ F ( P j ) , (6) RoIAlignRoIAlign

Detector

DetectorDetector

Detector

DetectorBackbone

RDP

Image Boxes (d) Mask Prediction in Our Design(b) Box Regression (a) Feature Extraction (c) Mask Prediction in Traditional Design

Single LayerUpsample, , 3x3 ConvConv1x1 Conv Prediction Pixel-wise SumFeature Pyramid

Boxes

RoIAlignRoIAlign

Detector

DetectorDetector

Detector

DetectorBackbone

RDP

Image Boxes (d) Mask Prediction in Our Design(b) Box Regression (a) Feature Extraction (c) Mask Prediction in Traditional Design

Single LayerUpsample, , 3x3 ConvConv1x1 Conv Prediction Pixel-wise SumFeature Pyramid

Boxes

Fig. 3. Overall pipeline of the proposed method. (a) In the feature extraction part, RDP is the regularized densely-connected pyramid network, as illustratedin Fig. 2. (b) We use the base detector [39] for box regression at each feature level. (c) The traditional design for mask prediction only uses a single layer todecode the binary masks. (d) Our design for mask prediction uses all feature levels to decode binary masks by a simple decoder. where R j is the attention map for the regularization and σ denotes the sigmoid function for each pixel. By reducing theeffect of the scale conﬂict during the feature concatenation,we apply this regularization to the feature maps with identicalattention maps R j in the feature fusion except P j : P rt = R j ⊗ Downsample ( P (cid:48) t ) , m ≤ t < j, (7)where P rt is the regularized feature map from other scales. Weperform the downsampling operation to features maps fromother scales with the same size as P j . The symbol ⊗ denotesthe element-wise multiplication. Overall, the regularized denseconnections for enhancing the feature pyramid can be formu-lated as P (cid:48) j = F [ Concat ( P rm , P rm +1 , ..., P rj − , P j )] , m < j ≤ k. (8)We provide an illustration of the proposed RDP in Fig. 2 forbetter understanding. B. Multi-level RoIAlign for Mask Prediction

Mask prediction is essential for SIS as it directly determinesthe accuracy of the mask for each salient instance. As shownin Fig. 3 (c), Mask R-CNN [13] uses a speciﬁc feature level,which depends on the size of the object of interest, for maskprediction using

RoIAlign . Although this option of determiningwhich feature level is used in

RoIAlign can adaptively extractmasks for objects of different sizes, this is suboptimal forSIS and a better strategy is to leverage all the feature levels.More speciﬁcally, we propose an efﬁcient yet well-performingmulti-level

RoIAlign with a decoder to leverage all featurelevels and Fig. 3 (d) illustrates our idea. After the multi-level

RoIAlign layer, we derive a tiny feature pyramid speciﬁcallyfor the mask prediction. The next decoder is to progressivelydecode the binary masks from the tiny feature pyramid. Thedecoder consists of the lateral connections and some featurefusion operations. Since the strides of the top two feature mapsare very large, they are

RoIAligned to the same size of RoIs, and we perform element-wise sum for these two RoIs. Otherfeature maps are

RoIAligned to different sizes of RoIs.With this decoder, we ﬁrst use

RoIAlign to adaptively alignfeatures from all levels, and then retrieve binary masks basedon the aligned features. For the feature fusion between twoadjacent feature maps of different sizes, we ﬁrst performbilinear interpolation to upsample them to the size of the ﬁnerfeature map by a factor of 2. Then, we use element-wise sumto fuse these two feature maps and add a × convolutionlayer to generate the new feature maps for the next featurefusion. Finally, we get the ﬁnest feature maps, on which weperform a × convolution to predict the binary masks. C. Overall Pipeline

The regularized densely-connected pyramid and the multi-level

RoIAlign layer are encapsulated into a Mask R-CNNbased pipeline, as displayed in Fig. 3. The functionality ofeach component is presented in the following.

1) Feature Extraction:

We adopt the widely used ResNet[14] as our backbone network, which has been pretrained onthe ImageNet dataset [38]. The base feature pyramid followsthe architecture of FPN [20]. Since we use the one-stagedetector [39] for box regression, we follow [39] to generatetwo extra feature maps, P and P , by connecting two × convolutions with a stride of 2 after P . P and P areadded to the feature pyramid, so the feature pyramid afterpassing FPN is { P , P , P , P , P } . All feature maps in thisfeature pyramid are with 256 channels. Then, we build theregularized densely-connected pyramid (RDP) from P (cid:48) to P (cid:48) , as introduced in Section III-A4 and Fig. 2. The numberof output channels is still 256 for all feature maps in thereconstructed feature pyramid. Fig. 1 displays the visualizationof feature maps after passing FPN and our proposed RDP.We ﬁnd that although feature maps derived by FPN havecaptured the locations of salient instances, the activation orhigh responses are very coarse or cannot even recognize thenumber of salient instances in each image. In contrast, the feature maps from our proposed RDP network have moreprecise activation and can help the base detector to betterdetect the bounding box of each salient instance. This furtherenhances the mask head towards obtaining better masks forthe detected salient instances.

2) Box Regression:

To quickly detect the salient instances,we do not apply a heavy two-stage detector that contains anRPN [36] head to generate object proposals and classiﬁesthese object proposals with the box head, because it is tooslow for SIS. Instead, we use the one-stage detector [39] asour base detector. This detector consists of four convolution-ReLU layers with 256 channels, and the box regression isperformed at each feature level with this shared-parametershead. The details for calculating the box proposals from theﬁnal feature map can refer to [39]. In this part, we willderive many box proposals with their conﬁdence scores ineach feature level. We concatenate them and leave the top1000 boxes with conﬁdence score larger than 0.05. After that,a non-maximum suppression (NMS) operation is conducted onthe boxes and then keep at most top 100 boxes for predictingtheir corresponding binary masks.

3) Mask Prediction:

In the box regression, we detect thesalient instances in the box level. Since our ﬁnal goal is topredict the instance-level segmentation, mask prediction isnecessary to retrieve the corresponding binary mask for eachsalient instance. We make a further improvement to MaskR-CNN by leveraging the feature maps of all feature levels( { P (cid:48) , P (cid:48) , P (cid:48) , P (cid:48) , P (cid:48) } ) for retrieving binary masks for salientinstances. After the multi-level RoIAlign layer, the sizes of thefeature maps { D , D , D , D , D } are displayed in Table I.Please refer to Section III-B for the implementation of thedecoder. After passing this decoder, we use a simple × convolution layer to predict the ﬁnal masks for the detectedsalient instances. TABLE IF

EATURE MAP SIZE FOR EACH CHANNEL AFTER THE MULTI - LEVEL

RoIAlign

LAYERS . S

INCE P (cid:48) AND P (cid:48) ARE VERY SMALL , D AND D ARESAMPLED WITH THE SAME SIZE . T

HE SIZE OF THE FINAL MASK FOR EACHSALIENT INSTANCE IS × .Name D D D D D Size × × × ×

16 32 ×

4) The Loss Function:

Our pipeline has two key parts thatneed supervisions: box regression and mask prediction. A fore-ground box classiﬁcation loss L cls and a coordinate regressionloss L reg are applied in the box regression branch. Note that L cls is the focal loss [21] and L reg is the IoU loss proposedin [48]. To further get rid of the bad effect of too many low-quality boxes, we apply the centerness loss L center proposedin [39] to ignore the boxes whose centers are far away fromthe centers of salient instances. For mask prediction, we usethe standard cross-entropy loss as the mask loss L mask . Hencewe obtain the ﬁnal loss L = L cls + L reg + L center + L mask to supervise the whole network.IV. E XPERIMENTS

In this section, we will ﬁrst introduce the datasets andevaluation metrics used in our experiments, as in Section IV-A. Implementation details will be described in Section IV-B. Wewill carefully examine our proposed designs and demonstratetheir effectiveness in Section IV-C. The results of our methodand the comparison with previous state-of-the-art methods willbe provided in Section IV-D.

A. Dataset and Evaluation Metric1) Datasets:

We adopt two popular datasets in our exper-iments, i.e. , ISOD and SOC datasets. The ISOD dataset isproposed by Li et al. [18]. It contains 1000 images with salientinstance annotations. Here we follow the previous work [8] touse 500 images for training, 200 images for validation, andanother 300 images for testing. The SOC dataset is proposedby Fan et al. [7]. This dataset consists of 3000 images incluttered scenes with salient instance annotations. Amongthem, 2400 images are used for training and 600 images areused for testing.

2) Evaluation Metrics:

Previous works use the mAP metricwith a speciﬁc threshold such as 0.5 (standard) or 0.7 (strict)to determine whether a detected instance is a true positive(TP), similar to the evaluation in the PASCAL VOC challenge[6]. However, as this metric is not enough to fully reﬂect thequality of detectors, the MS-COCO evaluation metric [22] hasbeen widely used in mainstream object detection and instancesegmentation. We follow the MS-COCO evaluation metric [22]to use mAP@ { } as the primary metric, sinceit can better reﬂect the detection quality. We also [email protected] and [email protected] for reference, as done in relatedworks [27], [20], [21], [13], [17]. For simplicity, we use“AP”, “AP ”, and “AP ” to stand for mAP@ { } ,[email protected], and [email protected], respectively. B. Implementation Details

In this paper, we use the popular PyTorch framework [33] toimplement our method. If not specially mentioned, we applythe widely used ResNet-50 [14] as the backbone network. Inthe network training, maybe there is no box satisfying thethreshold of the conﬁdence score for NMS, especially in theearly training stage, so we add the ground-truth boxes to theresults of detected salient instances in the training to preventsuch a situation to take place. We only use horizontal ﬂippingas the data augmentation, and each input image is resizedas the shorter side is 320 pixels and the longer side followsthe initial image aspect ratio but is limited to a maximumvalue of 480 pixels. We use a single NVIDIA TITAN XpGPU for all experiments. We use the SGD optimizer with theweight decay of − and the momentum of 0.9. Each mini-batch contains four images. The initial learning rate is 0.0025.For the ISOD dataset [18], the learning rate is divided by 10after 6K iterations, and we train our network for 9K iterationsin total. For the SOC dataset [7] that is approximately 4 × larger than ISOD, the learning rate is divided by 10 after24K iterations, and we train our network for 36K iterationsin total. Due to the small batch size, the BatchNorm layersof the backbone network are all frozen during training. The × convolution layers of the box regression head and maskprediction head are with the group normalization [44]. The TABLE IIE

VALUATION ON THE

ISOD

VALIDATION SET FOR VARIOUS DESIGNCHOICES . T

HE FIRST LINE REFERS TO THE BASELINE OF

FPN. NP

IS THENATURAL PROGRESSIVE BOTTOM - UP STYLE FOR BUILDING THE NEWFEATURE PYRAMID . DP

DENOTES THE PROPOSED METHOD THATREBUILDS THE FEATURE PYRAMID WITH DENSE CONNECTIONS . RDP

MEANS TO ADD THE PROPOSED REGULARIZATION TO

DP. MRA

REPRESENTS THE PROPOSED MULTI - LEVEL

RoIAlign .NP DP RDP MRA AP AP AP - - - - 54.2% 83.3% 69.7% (cid:52) (cid:52) (cid:52) (cid:52) (cid:52) number of output channels of each × convolution layer is128 in the mask prediction head. C. Ablation Study

In this part, we evaluate the effect of various designs onthe ISOD dataset. We use its training set for training andreport results on its validation set. If not mentioned, we usethe ResNet-50 as the backbone for our network.

1) Effect of DP and RDP:

As mentioned in Section III-A4,we propose to create RDP to ﬁll the vacancy of the FPN.Here, we view FPN as our baseline and evaluate four designchoices: i) NP, i.e. , the naive progressive bottom-up style forbuilding the new feature pyramid; ii) DP, i.e. , the proposedmethod that rebuilds the feature pyramid with dense connec-tions; iii) RDP, i.e. , adding the proposed regularization to thedense connections in DP; iv) MRA, i.e. , the proposed multi-level

RoIAlign . Table II shows the evaluation results on theISOD validation set. We can see that NP only has a minorimprovement compared with the vanilla solution of FPN, i.e. ,an improvement of 0.4% in terms of AP. If we replace thisnaive solution with DP without regularization, the metric ofAP will be improved by 0.9% compared with FPN. Whenwe add the regularization to DP, a relative 1.3% improvementover DP is observed, indicating that the regularization is ofvital importance for the proposed densely-connected pyramid.Note that the proposed RDP is very efﬁcient and only costs0.7ms for a × input image, making it have little effecton the speed of the whole network.

2) Effect of Multi-level RoIAlign:

The existing researchusually predicts object masks using the mask head proposedby Mask R-CNN [13], which predicts masks from a speciﬁcfeature level. Instead, we propose a top-down progressive maskdecoder to utilize all feature levels for object mask prediction,namely multi-level

RoIAlign (MRA). The comparison betweenMRA and the traditional

RoIAlign can be found in Table II.We can see that the introduction of MRA further leads to animprovement of 1.0%, 0.7%, and 1.8% in terms of AP, AP ,and AP , respectively. This demonstrates the signiﬁcance ofthe proposed MRA in accurate mask prediction by leveragingall feature levels. Overall, the proposed method achieves 3.2%higher AP, 2.8% higher AP , and 4.1% higher AP than thebaseline of FPN.

3) Partially Applying DP and RDP:

Our initial design con-siders all feature levels ( P - P ) for the reconstruction of the TABLE IIIE

VALUATION ON THE

ISOD

VALIDATION SET FOR PARTIALLY APPLYING

DP/RDP

TO A PART OF SIDE - OUTPUTS . P ∼ P MEANS FROM P TO P . P ∼ P MEANS ALL SIDE - OUTPUTS IN THE FEATURE PYRAMID .Side-outputs DP RDP AP AP AP - - - 54.2% 83.3% 69.7% P ∼ P (cid:52) P ∼ P (cid:52) P ∼ P (cid:52) P ∼ P (cid:52) VALUATION ON THE

ISOD

VALIDATION SET FOR THE TOP - DOWN ANDBOTTOM - UP DESIGNS OF

RDP. T

HE TOP - DOWN DESIGN DIRECTLYREPLACE

FPN

OF THE BASELINE METHOD WITH THE TOP - DOWN STYLEOF

RDP. T

HE BOTTOM - UP DESIGN IS THE DEFAULT VERSION OF

RDP

ASSHOWN IN F IG . 2.Method AP AP AP Baseline 54.2% 83.3% 69.7%Top-down 45.6% 76.9% 57.0%Bottom-up 56.4% 85.4% 72.0% feature pyramid. Among them, the top 2 feature levels ( P and P ) are generated from P using only two × convolutions.In this section, we further evaluate the effectiveness of DP andRDP by applying them to a part of side-outputs. Speciﬁcally,we only apply DP/RDP to three side-outputs, i.e. , P , P ,and P , excluding P and P . The experimental results areshown in the Table III. We could see that applying DP/RDP toonly three side-outputs performs better than the baseline, butperforms worse than applying DP/RDP to ﬁve side-outputs,indicating that DP/RDP is effective in feature enhancementfor all feature levels. The fact that RDP with only threefeature levels signiﬁcantly outperforms the baseline, furthersuggesting that RDP is very useful to FPN.

4) Error Analyses of the Baseline and the Proposed De-signs:

Salient instances are usually large because large objectsare more eye-attracting and are thus visually distinctive. Wefollow the MS-COCO benchmark to consider the instanceswhose areas are larger than as large instances. In thisway, we ﬁnd that the ISOD dataset [18] has over 70% largesalient instances. Here, we perform error analyses using allsalient instances or only large instances. We view FPN [20]as the baseline and gradually add each design of us to thisbaseline to analyze the changes of detection errors. Fig. 4illustrates the results. First, let us discuss the changes of the PRcurve by adding DP to the baseline. We observe that althoughAP is improved for almost all IoU thresholds when usingall salient instances, the performance becomes worse whenonly salient instances are considered, especially for large IoUthresholds ( e.g. , IoU = 0 . ). Then, we further replace DPwith the regularized version of RDP. There is a signiﬁcantimprovement in terms of all IoU thresholds for both all andonly large salient instances, demonstrating the importance ofthe proposed regularization for DP. At last, we analyze theeffect of the multi-level RoIAlign (MRA) by further addingit to our system. A substantial improvement is observed,especially for large salient instances. For example, MRAbrings AP improvements of 7.2%, 3.6%, and 2.0% for IoU

Baseline-all recall p r e c i s i on [.121] C90[.697] C70[.833] C50[.908] C30[.936] C10[.960] BG[1.00] FN Baseline+DP-all recall p r e c i s i on [.127] C90[.710] C70[.844] C50[.910] C30[.929] C10[.960] BG[1.00] FN Baseline+RDP-all recall p r e c i s i on [.133] C90[.720] C70[.854] C50[.905] C30[.934] C10[.970] BG[1.00] FN Baseline+RDP+MRA-all recall p r e c i s i on [.164] C90[.738] C70[.861] C50[.917] C30[.938] C10[.970] BG[1.00] FN Baseline-large recall p r e c i s i on [.147] C90[.730] C70[.886] C50[.954] C30[.970] C10[.980] BG[1.00] FN Baseline+DP-large recall p r e c i s i on [.134] C90[.727] C70[.886] C50[.948] C30[.968] C10[.990] BG[1.00] FN Baseline+RDP-large recall p r e c i s i on [.151] C90[.738] C70[.899] C50[.951] C30[.972] C10[.990] BG[1.00] FN Baseline+RDP+MRA-large recall p r e c i s i on [.223] C90[.774] C70[.919] C50[.964] C30[.973] C10[.990] BG[1.00] FN Fig. 4. Error analyses for the baseline and the proposed designs on the ISOD validation set. The ﬁrst row is PR curves for all salient instances, while the secondrow is only for large salient instances whose areas are larger than . The PR curves are drawn in different settings following [22]. C10 ∼ C90: PR curveat IoU= { } . BG: PR curve after all background false positives (FP) are removed. FN: PR curve after all remaining errors are removed (AP = 1 ).Each number in the legend corresponds to the average precision for each setting. The area under each curve is drawn in different colors, corresponding to thecolor in the legend. Best viewed in color.TABLE VE VALUATION ON THE

ISOD

VALIDATION SET USING DIFFERENT

NMS

THRESHOLDS .NMS Threshold AP AP AP thresholds 0.9, 0.7, and 0.5, respectively. Compared our ﬁnalsystem (the rightmost column in Fig. 4) with the baseline (theleftmost column), the improvement is very visually signiﬁcantin the PR curves in terms of all IoU threshold.

5) Bottom-up versus Top-down:

In our method, we rebuildthe feature pyramid based on the outputs of FPN. Anotherpotential solution is to directly replace FPN with the top-down style of RDP, which would have a lower computationalcost compared with our proposed design. However, the ex-perimental results proclaim its failure. As shown in Table IV,this solution leads to substantial performance degradation, i.e. ,over 10% lower than the default bottom-up design in termsof various metrics. Hence we can come to the conclusion thatthe proposed RDP is not suitable for the top-down informationﬂow but can only well in the bottom-up way.

6) Different NMS Thresholds:

The NMS post-processingstep is important to eliminate the detected overlapping in-stances. NMS with a higher IoU threshold will have a greatertolerance on the high overlapping instances and vice versa.Here, we explore how different NMS thresholds affect theperformance of the proposed method, and the results are sum-marized in Table V. We observe that our method is robust todifferent NMS thresholds, and thresholds larger than 0.4 havesimilar evaluation results. Since the threshold of 0.6 resultsin slightly better performance in terms of various metrics, we

TABLE VIE

VALUATION ON THE

ISOD

VALIDATION SET USING DIFFERENT NUMBERSOF BOX PROPOSALS FOR MASK PREDICTION IN THE INFERENCE . F

ROM

BOX PROPOSALS TO

THE SPEED IMPROVEMENT OF

FPS FORCESTHE AP PERFORMANCE TO BE REDUCED BY AP Speed ∆ AP100 ↓ . ↓ . adopt 0.6 as the default threshold in our experiments.

7) The number of proposals for mask prediction:

As intypical instance segmentation [13], [20], our method ﬁrstlearns to localize salient instances by predicting bounding boxproposals and then predicts the mask for each box proposal.Hence the number of box proposals for mask predictionmay affect the detection accuracy and inference speed of thewhole network. As mentioned in the box regression part ofSection III-C2, we select top 100 box proposals for maskprediction, which may look a bit large and expensive for SISbecause there are rarely more than 10 salient instances in eachimage. To prove the rationality of this setting, we explorethe trend of the number of proposals during the trainingstage. At the beginning, this number is the limited setting(100), but it gradually declines as the number of trainingiterations increases. Finally, it converges to a number less than10 on average. On the other hand, we run the experimentsusing different numbers of box proposals in the inference,and the results are shown in Table VI. As we decrease thenumber of box proposals in the inference, the maximumspeed boosting is only 0.2fps, but the accuracy suffers fromsigniﬁcant degradation. Therefore, we apply the default settingof 100 box proposals because it almost has no harm to thespeed of our method.

TABLE VIIE

VALUATION RESULTS ON THE

ISOD

TEST SET [18].Method Backbone AP AP AP SpeedMSRNet [18] VGG-16 - 65.3% 52.3% < VALUATION OF OUR METHOD WITH DIFFERENT BACKBONE NETWORKSON THE

ISOD

TEST SET [18]. O

UR METHOD WITH THE MOST POWERFULBACKBONE ( i.e. , R ES N E X T -101 [46]) CAN ACHIEVE A

IMPROVEMENT IN TERMS OF AP AND × INFERENCE TIME COMPAREDWITH THAT WITH THE SIMPLEST BACKBONE ( i.e. , R ES N ET -50 [14]). T HESPEED IS TESTED USING A SINGLE

NVIDIA TITAN X P GPU.Backbone AP AP AP SpeedResNet-50 [14] 58.6% 88.6% 73.6% 45.0fpsResNet-101 [14] 60.9% 89.7% 76.6% 34.8fpsResNeXt-101 [46]

VALUATION RESULTS ON THE

SOC

TEST SET [7].Method Backbone AP AP AP S4Net [8] ResNet-50 24.0% 51.8% 27.5%Ours ResNet-50

D. Comparisons with state-of-the-art Methods1) ISOD Dataset:

Since SIS is a relatively new problem,the previous works on this topic are very limited. Here, wecompare our method with two well-known methods: MSR-Net [18] that is on behalf of the post-processing-based methodsand S4Net [8] that is a representative work of end-to-endnetworks. Following [18], [8], all methods are tested on theISOD test set [18]. We apply AP as the main metric, andAP , AP for the reference. Higher scores represent betterperformances for all metrics. The quantitative results can beseen in Table VII. The proposed method achieves the bestresults compared with the other two popular competitors.Speciﬁcally, the proposed method has 6.3% higher AP thanS4Net [8]. In terms of AP , the proposed method is 10.0%better than S4Net [8]. This demonstrates the superiority of theproposed method in accurate salient instance segmentation. InTable VIII, we try different backbone networks for our method.We can see that powerful backbones can further boost theperformance signiﬁcantly, indicating the good potential andextendibility of our method.

2) SOC Dataset:

The scenarios of the SOC dataset [7] aremuch more complex than that of the ISOD dataset [18], soSIS on the SOC dataset is more challenging. The quantitativecomparison between our method and S4Net [8] on the SOCdataset is summarized in Table IX. Since other methods do notreport evaluation results on this dataset, we train S4Net [8]using its ofﬁcial code with default settings, and we reportits best performance in three independent trials for a faircomparison. The results suggest that our method is 13.7%,7.6%, and 20.9% better than S4Net in terms of AP, AP and AP , respectively. This demonstrates that our methodcan handle the cluttered background much better and ourimprovement for SIS is nontrivial. ISOD Dataset-all recall p r e c i s i on [.186] C90[.736] C70[.886] C50[.934] C30[.947] C10[.980] BG[1.00] FN SOC Dataset-all recall p r e c i s i on [.081] C90[.484] C70[.594] C50[.658] C30[.702] C10[.881] BG[1.00] FN (a) Error analyses Instance Count A P ISOD Dataset

Instance Count A P SOC Dataset (b) Probability distribution of APFig. 5. Statistical analyses for our method on the ISOD [18] and SOC [7]test sets.

E. Qualitative Comparisons

To visually compare our method with the previous state-of-the-art method of S4Net [8], we show qualitative comparisonsusing the ISOD [18] and SOC [7] datasets in Fig. 6. S4Nethas many superﬂuous detection results (false positives) or onlydetects a part of salient instances. In contrast, our methodproduces consistent high-quality salient instance masks. More-over, the boundaries of salient instances detected by S4Netare usually rough, while our method can produce salientinstances with smooth boundaries. Therefore, these qualitativecomparisons further validate the effectiveness of the proposedmethod.

F. Statistical Analyses

The statistical characteristics of the ISOD [18] and SOC[7] datasets are highly different, so it would be interesting toexplore the differences of the performance of our method onthese two datasets. Here, we conduct statistical analyses forthe performance of our method on the test sets of these twodatasets. We ﬁrst explore the differences of PR curves betweenthe two datasets by drawing the PR curves of our method onthese two datasets, as shown in Fig. 5 (a). As the backgroundof images in the SOC dataset is more cluttered than that in theISOD dataset, more salient instances are not detected in theSOC dataset, while in the ISOD dataset, most salient instancescan be correctly localized. Then, we explore the probabilitydistribution of AP for different numbers of salient instances ineach image. More speciﬁcally, we calculate the AP score andthe number of ground-truth salient instances for each image,and illustrate the overall probability distribution in Fig. 5 (b)where the area of each closed pattern is 1 ( i.e. , the sum of allprobabilities). AP = 1 for an image means that our methodalmost perfectly detects and segments the ground truths in I m a g e I n s t a n ce G T S N e t O u r s ISOD Dataset I m a g e I n s t a n ce G T S N e t O u r s SOC Dataset

Fig. 6. Qualitative comparisons between our method and S4Net [8]. The samples are from the ISOD and SOC datasets. S4Net [8] is easy to detect superﬂuousobjects (false positives) or a part of instances. In contrast, our proposed method can detect the complete instances and have much fewer false positives. this image and also has no false positives. AP = 0 indicatesthat all ground truths in this image are not detected. In theISOD dataset, the AP score of each image is likely betterthan the medium AP score if the instance count is not morethan 3 in each image, while in the SOC dataset, the same casehappens only when the instance count is 1 in each image.Besides, in the ISOD dataset, our method only fails for a fewimages (AP = 0 ) with 1 or 2 salient instances in each image,but in the SOC dataset, our method fails for relatively manymore images. The above analyses suggest that the SOC datasetis much more difﬁcult than the ISOD dataset owing to its cluttered background and complex scenarios, so there mightstill be much space to strengthen the representation for futureSIS research.V. C

ONCLUSION AND F UTURE W ORK

In this paper, we propose a new network for salient instancesegmentation (SIS). The core of our method is the regularizeddense-connected pyramid (RDP), which provides each side-output with richer yet more compatible bottom-up informationﬂows to enhance the side-output prediction. We further designa novel multi-level RoIAlign based decoder for better mask prediction. Through extensive experiments, we analyze the ef-fect of our proposed designs and demonstrate the effectivenessof our method. With our simple designs, the proposed methodachieves state-of-the-art results on popular benchmarks interms of all evaluation metrics while keeping a real-timespeed. The effectiveness and efﬁciency of the proposed methodmake it possible for many real-world applications. Moreover,this research is expected to push forward the development offeature learning and mask prediction for SIS. In the future, weplan to apply the RDP module for other vision tasks that needpowerful feature pyramids. The code and pretrained modelsof this paper will be released to promote the future research.R EFERENCES[1] R. Achanta, S. Hemami, F. Estrada, and S. S¨usstrunk. Frequency-tunedsalient region detection. In

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , pages 1597–1604, 2009.[2] M.-M. Cheng, Y. Liu, W.-Y. Lin, Z. Zhang, P. L. Rosin, and P. H. Torr.BING: Binarized normed gradients for objectness estimation at 300fps.

Computational Visual Media , 5(1):3–20, 2019.[3] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu. Globalcontrast based salient region detection.

IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI) , 37(3):569–582, 2014.[4] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet, and N. Crook.Efﬁcient salient region detection with soft image abstraction. In

IEEEInternational Conference on Computer Vision (ICCV) , pages 1529–1536,2013.[5] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint objectand stuff segmentation. In

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , pages 3992–4000, 2015.[6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man. The PASCAL visual object classes (voc) challenge.

InternationalJournal of Computer Vision (IJCV) , 88(2):303–338, 2010.[7] D.-P. Fan, M.-M. Cheng, J.-J. Liu, S.-H. Gao, Q. Hou, and A. Borji.Salient objects in clutter: Bringing salient object detection to theforeground. In

European Conference on Computer Vision (ECCV) , pages186–202, 2018.[8] R. Fan, M.-M. Cheng, Q. Hou, T.-J. Mu, J. Wang, and S.-M. Hu. S4net:Single stage salient-instance segmentation.

Computational Visual Media ,6(2):191–204, June 2020.[9] R. Fan, Q. Hou, M.-M. Cheng, G. Yu, R. R. Martin, and S.-M. Hu.Associating inter-image salient instances for weakly supervised semanticsegmentation. In

European Conference on Computer Vision (ECCV) ,pages 367–383, 2018.[10] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar,J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visualconcepts and back. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 1473–1482, 2015.[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich featurehierarchies for accurate object detection and semantic segmentation. In

IEEE International Conference on Computer Vision (ICCV) , pages 580–587, 2014.[12] B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Hypercolumns forobject segmentation and ﬁne-grained localization. In

IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , pages 447–456,2015.[13] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask R-CNN. In

IEEEInternational Conference on Computer Vision (ICCV) , pages 2980–2988,2017.[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning forimage recognition. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 770–778, 2016.[15] S. Hong, T. You, S. Kwak, and B. Han. Online tracking by learningdiscriminative saliency map with convolutional neural network. In

International Conference on Learning Representations (ICML) , pages597–606, 2015.[16] Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr. Deeply super-vised salient object detection with short connections.

IEEE Transactionson Pattern Analysis and Machine Intelligence (TPAMI) , 41(4):815, 2019.[17] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang. Mask Scoring R-CNN. In

IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , pages 6409–6418, 2019. [18] G. Li, Y. Xie, L. Lin, and Y. Yu. Instance-level salient objectsegmentation. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 2386–2395, 2017.[19] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , pages 2359–2367, 2017.[20] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie.Feature pyramid networks for object detection. In

IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , pages 2117–2125,2017.[21] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar. Focal loss fordense object detection. In

IEEE International Conference on ComputerVision (ICCV) , pages 2980–2988, 2017.[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects incontext. In

European Conference on Computer Vision (ECCV) , pages740–755, 2014.[23] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang. A simple pooling-based design for real-time salient object detection. In

IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , June 2019.[24] N. Liu and J. Han. Dhsnet: Deep hierarchical saliency network forsalient object detection. In

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , pages 678–686, 2016.[25] N. Liu, J. Han, and M.-H. Yang. Picanet: Learning pixel-wise contextualattention for saliency detection. In

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , pages 3089–3098, 2018.[26] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network forinstance segmentation. In

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , pages 8759–8768, 2018.[27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.Berg. Ssd: Single shot multibox detector. In

European Conference onComputer Vision (ECCV) , pages 21–37. Springer, 2016.[28] Y. Liu, M.-M. Cheng, X. Hu, J.-W. Bian, L. Zhang, X. Bai, and J. Tang.Richer convolutional features for edge detection.

IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI) , 41(8):1939–1946,2019.[29] Y. Liu, M.-M. Cheng, X. Zhang, G.-Y. Nie, and M. Wang. DNA: Deeply-supervised nonlinear aggregation for salient object detection. arXivpreprint arXiv:1903.12476 , 2019.[30] Y. Liu, Y.-H. Wu, Y. Ban, H. Wang, and M.-M. Cheng. Rethinkingcomputer-aided tuberculosis diagnosis. In

IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2020.[31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks forsemantic segmentation. In

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , pages 3431–3440, 2015.[32] Y. Pang, X. Zhao, L. Zhang, and H. Lu. Multi-scale interactive networkfor salient object detection. In

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2020.[33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: Animperative style, high-performance deep learning library. In

Advancesin neural information processing systems , pages 8026–8037, 2019.[34] J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and J. Malik.Multiscale combinatorial grouping for image segmentation and objectproposal generation.

IEEE Transactions on Pattern Analysis andMachine Intelligence (TPAMI) , 39(1):128–140, 2017.[35] Y. Qiu, Y. Liu, H. Yang, and J. Xu. A simple saliency detection approachvia automatic top-down feature fusion.

Neurocomputing , 388:124–134,2020.[36] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In

Advances inNeural Information Processing Systems (NIPS) , pages 91–99, 2015.[37] C. Rother, V. Kolmogorov, and A. Blake. GrabCut: Interactive fore-ground extraction using iterated graph cuts.

International Journal ofComputer Vision (TOG) , 23(3):309–314, 2004.[38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet largescale visual recognition challenge.

International Journal of ComputerVision (IJCV) , 115(3):211–252, 2015.[39] Z. Tian, C. Shen, H. Chen, and T. He. FCOS: Fully convolutional one-stage object detection.

IEEE International Conference on ComputerVision (ICCV) , 2019.[40] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Se-lective search for object recognition.

International Journal of ComputerVision (IJCV) , 104(2):154–171, 2013.[41] J. Wang, H. Jiang, Z. Yuan, M.-M. Cheng, X. Hu, and N. Zheng. Salientobject detection: A discriminative regional feature integration approach.

International Journal of Computer Vision (IJCV) , 123(2):251–268, 2017.[42] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan. Salient object detec- tion with recurrent fully convolutional networks. IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI) , 41(7):1734–1746,2018.[43] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji.Detect globally, reﬁne locally: A novel approach to saliency detection. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,pages 3127–3135, 2018.[44] Y. Wu and K. He. Group normalization. In

European Conference onComputer Vision (ECCV) , pages 3–19, 2018.[45] Y.-H. Wu, S.-H. Gao, J. Mei, J. Xu, D.-P. Fan, C.-W. Zhao, and M.-M. Cheng. Jcs: An explainable covid-19 diagnosis system by jointclassiﬁcation and segmentation. arXiv preprint arXiv:2004.07054 , 2020.[46] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He. Aggregated residualtransformations for deep neural networks. In

IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , pages 1492–1500,2017.[47] S. Xie and Z. Tu. Holistically-nested edge detection.

InternationalJournal of Computer Vision (IJCV) , 125(1-3):3–18, 2017.[48] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang. Unitbox: Anadvanced object detection network. In

ACM International Conferenceon Multimedia (ACM MM) , pages 516–520. ACM, 2016.[49] P. Zhang, W. Liu, H. Lu, and C. Shen. Salient object detectionwith lossless feature reﬂection and weighted structural loss.

IEEETransactions on Image Processing (TIP) , 28(6):3048–3060, 2019.[50] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan. Amulet: Aggregatingmulti-level convolutional features for salient object detection. In