Accurate RGB-D Salient Object Detection via Collaborative Learning
AAccurate RGB-D Salient Object Detection viaCollaborative Learning
Wei Ji (cid:63) , Jingjing Li (cid:63) , Miao Zhang (cid:66) , Yongri Piao , and Huchuan Lu , Dalian University of Technology, Dalian, China Pengcheng Lab, Shenzhen,China [email protected], [email protected] , { miaozhang, yrpiao, lhchuan } @dlut.edu.cnhttps://github.com/OIPLab-DUT/CoNet Abstract.
Benefiting from the spatial cues embedded in depth images,recent progress on RGB-D saliency detection shows impressive abilityon some challenge scenarios. However, there are still two limitations.One hand is that the pooling and upsampling operations in FCNs mightcause blur object boundaries. On the other hand, using an additionaldepth-network to extract depth features might lead to high computationand storage cost. The reliance on depth inputs during testing also limitsthe practical applications of current RGB-D models. In this paper, wepropose a novel collaborative learning framework where edge, depth andsaliency are leveraged in a more efficient way, which solves those prob-lems tactfully. The explicitly extracted edge information goes togetherwith saliency to give more emphasis to the salient regions and objectboundaries. Depth and saliency learning is innovatively integrated intothe high-level feature learning process in a mutual-benefit manner. Thisstrategy enables the network to be free of using extra depth networksand depth inputs to make inference. To this end, it makes our modelmore lightweight, faster and more versatile. Experiment results on sevenbenchmark datasets show its superior performance.
The goal of salient object detection (SOD) is to locate and segment the most at-tractive and noticeable regions in an image. As a fundamental and pre-processingtask, salient object detection plays an important role in various computer visiontasks, e.g., visual tracking [25,52], video SOD [58,20], object detection [50,14],semantic segmentation [37], and human-robot interaction [13].Recent researches on RGB-D salient object detection have gradually bro-ken the performance bottleneck of traditional methods and RGB-based meth-ods, especially when dealing with complex scenarios like similar foregroundand background. However, there are some limitations with the introduction ofFCNs [40,51] and depth images.
Firstly , the emergence of FCNs enables auto-matic extraction of multi-level and multi-scale features. The high-level features (cid:63) means equal contribution. a r X i v : . [ c s . C V ] J u l W. Ji, et al.
RGB InputDepth Input Cross Model Fusion (a)
Conv Conv Conv . . .
Conv Conv Conv . . .
Depth NetRGB Net
OutputModel (b)
Designed Model / SubnetworkRGB InputDepth Input Conv Conv Conv . . .
RGB Net
OutputModel
CPD [60]PoolNet [38]EGNet [66] CPFP * [65] DMRA * [46]EdgeDepthImage GT Ours * High-level FeaturesLow-level Features
Fig. 1. (Left)
First two rows: feature maps in different layers of CNNs. Last two rows:RGB image, depth image, edge map, saliency ground truth (GT) and saliency resultsof several state-of-the-art methods. * means RGB-D methods. (Right)
Two kinds ofprevious RGB-D SOD network structures. (a) Processing RGB input and depth inputseparately and then combining the complementary RGB and depth features throughcross-modal fusion (e.g. [7,23,5,6,46]). (b) Using tailor-made depth subnetworks tocompensate for RGB representations (e.g. [68,65]). with rich semantic information can better locate salient objects but the pool-ing and upsampling operations in FCNs might result in coarse and blur objectboundaries (see Fig. 1 (left)). The low-level features contain rich local detailsbut suffer from excessive background noises and might cause information chaos.
Secondly , the spatial layout information from depth images can better express3D scenes and help locate salient objects. However, previous RGB-D methodseither adopted two-stream architectures that process RGB and depth imagesseparately with various cross-modal fusion strategies (see Fig. 1a) [7,23,5,6,46],or utilized subnetworks tailored for depth image to compensate for RGB repre-sentations (see Fig. 1b) [68,65]. In those methods, the additional depth-networksmight lead to high computation and storage cost, and cannot work without depthinput, seriously limiting their practical applications.In this paper, we propose a novel collaborative learning framework (CoNet)to confront the aforementioned limitations. In collaborative learning, multiplegroup members work together to achieve learning goals through exploratorylearning and timely interaction. In our framework, three mutually beneficial col-laborators are well-designed from different perspectives of the SOD task, namelyedge detection, coarse salient object detection, and depth estimation.
On the onehand , a edge collaborator is proposed to explicitly extracts edge information fromthe overabundant low-level features and then goes together with saliency knowl-edge to jointly assign greater emphasis to salient regions and object boundaries.
On the other hand , considering the strong consistencies among global semanticsand geometrical properties of image regions [54], we innovatively integrate depthand saliency learning into the high-level feature learning process in a mutual-benefit manner. Instead of directly taking depth image as input, this learningstrategy enables the network to be free of using an extra depth network to makeinference from an extra input. Compared with previous RGB-D models which ccurate RGB-D Salient Object Detection via Collaborative Learning 3 utilize additional subnetworks to extract depth features and rely on depth imagesas input, our network is more lightweight, faster and more versatile. To our bestknowledge, this is the first attempt to use depth images in such a way in RGB-DSOD research.
Finally , a unified tutor named knowledge collector is designed toaccomplish knowledge transfer from individual collaborators to the group, so asto more comprehensively utilize the learned edge, saliency and depth knowledgesto make accurate saliency prediction. Benefiting from this learning strategy, ourframework produces accurate saliency results with sharp boundary preservedand simultaneously avoids the reliance on depth images during testing.In summary, our main contributions are as follows: – We propose a novel collaborative learning framework (CoNet) where edge,depth, and saliency are leveraged in a different but more efficient way forRGB-D salient object detection. The edge exploitation makes the boundariesof saliency maps more accurate. – This learning strategy enables our RGB-D network to be free of using anadditional depth network and depth input during testing, and thus beingmore lightweight and versatile. – Experiment results on seven datasets show the superiority of our method overother state-of-the-art approaches. Moreover, it supports the faster frame rateas it runs at 34 FPS, meeting the needs of real-time prediction (enhancesFPS by 55% compared with current best performing method DMRA [46]).
Early works [27,10,45,36,61] for saliency detection mainly rely on hand-craftedfeatures. [2,3,55] are some comprehensive surveys. Recently, traditional meth-ods have been gradually surpassed by deep learning ones. Among those re-searches, 2D methods [32,33,53,67,30,35,56,57,16,41,60] based on RGB imageshave achieved remarkable performance and lone been the mainstream of saliencydetection. However, 2D saliency detection appears to make a downgrade whenhandling complex scenarios due to the lack of spatial information in single RGBimage. The introduction of depth images in RGB-D saliency researches [49,31,48,23,7,5,6,68,46,65]has made great promotions for those complex cases thanks to the embedded richspatial information of depth images.The first CNNs-based method [48] for RGB-D SOD uses hand-crafted fea-tures extracted from RGB and depth images for training. Then, Chen et al. propose to use two-stream models [23,7] to process RGB and depth image sep-arately and then combine cross-modal features to jointly predict saliency. Theysubsequently design a progressive fusion network [5] to better fuse cross-modalmulti-level features and propose a three-stream network [6] which adopts theattention mechanism to adaptively select complement from RGB and depth fea-tures. Afterwards, Piao et al. [46] utilize residual structure and depth-scale fea-ture fusion module to fuse paired RGB and depth features. The network struc-tures in [23,7,5,6,46] can be represented as two-stream architectures shown in
W. Ji, et al.
Fig. 1a. Another kind of structure is the use of subnetworks tailored for depthimages to extract depth features and make compensation for RGB representa-tions [68,65] (Fig. 1b). Zhu et al. [68] utilize an auxiliary network to extractdepth-induced features and then use them to enhance a pre-trained RGB priormodel. In [65], Zhao et al. first enhance the depth map by contrast prior andthen think of it as an attention map and integrate it with RGB features.Those methods have some limitations. Using additional depth networks toextract depth features leads to high computation and storage cost. The relianceon depth images as input during testing also severely limits the practical appli-cations of current RGB-D models. Moreover, we found that the boundaries ofthe produced saliency maps in those methods are a bit coarse and blur. Thisis mainly because the pooling and upsampling operations in FCNs might leadto the loss of local details and current RGB-D methods have not taken steps toemphasize the boundaries of salient objects.Some RGB-based SOD methods attempt to enhance the boundary accuracythrough adding edge constraints or designing boundary-aware losses. An edgeguidance network [66] couples saliency and edge features to better preserve accu-rate object boundary. Liu et al. [38] train their pooling-based network with edgedetection task and successfully enhance the details of salient regions. A predict-refine architecture [47] equipped with a hybrid loss segments salient regions andrefines the structure with clear boundaries. An attentive feedback module [21]employs a boundary-enhanced loss for learning exquisite boundaries.In this paper, we propose a novel collaborative learning framework whereedge, depth and saliency are leveraged in a different but more efficient way. Dif-ferent from previous RGB methods using edge supervision [66,38] or boundary-aware losses [47,21], we further combine the learned edge knowledge with saliencyknowledge to give extra emphasis to both salient regions and boundaries. Forthe use of depth, we innovatively integrate it into the high-level feature learningprocess in a mutual-benefit manner, instead of directly taking depth images asinput. Free of using the depth subnetworks and depth input during testing makesour network more lightweight and versatile. In Section 3, we will elaborate onour collaborative learning framework.
In this paper, we propose a novel CoNet for RGB-D SOD. The overall architec-ture is shown in Fig. 2. In this framework, three mutually beneficial collabora-tors, namely edge detection, coarse salient object detection and depth estimation,work together to aid accurate SOD through exploratory learning and timely in-teraction. From different perspectives of the SOD target, knowledges from edge,depth and saliency are fully exploited in a mutual-benefit manner to enhancethe detector’s performance. A simplified workflow is given below.First, a backbone network is used to extract features from original images.Five transition layers and a global guidance module (GGM) are followed to per- ccurate RGB-D Salient Object Detection via Collaborative Learning 5
FeatureExtractionNetwork ⦿ Att
Att ⊕ ⦿ ○ Att
Att
Att
Att < CONV @×@
Att
BC93
Att
BC93
Softmax Att B Att B Att B ⊕ CONV @×@ ℱ H I J K L M N L O P Q I O R P K N I R N S CONV @×@
Att M M Loss Q W O X Y W Q R L K Y P M J CONV
Z×Z [ \×] HIJKLMNLOP
Loss B Att
Loss Att trans1trans5 `PaRN
CONV
Z×Z
Att
Att
Att ⦿ Loss DepthHead ·· CA SA trans4trans3trans2 < ○ < ○ < ○ CONV @×@ ⦿ ⊕ trans :transition layer [ \×] :upsample ×4 CA SA :channel / spatial attention// :pixel wise multiplication / addition< ○ :concatenation operation Low-levelIntegrationHigh-levelIntegration - ℱ Att
Att
Att n opq:; Knowledge Collector ℱ Knowledge Collector × × ⦿ s s l ts u s wJaNS Fig. 2.
The overall architecture of our collaborative learning framework. Details of theGlobal Guidance Module can be found in Fig. 3. Here,
Att ∗ = 1 − Att ∗ . Table 1.
Detailed information of the five transition layers in Fig. 2. We show the inputsize and output size of the feature maps before and after those transition layers, andrepresent their specific transition operators for better understanding.
Transition Input Size Transition Operators Output Sizetrans1 128 × × Upsample × × × × × Upsample × × × × × Upsample × , Conv × + BN + P Relu × × × × Upsample × , Conv × + BN + P Relu × × × × Upsample × , Conv × + BN + P Relu × × form feature preprocessing and generate the integrated low-level feature f l andhigh-level feature f h (details are shown in Sec. 3.2). Then an edge collaborator isassigned to f l to extract edge information from the overabundant low-level fea-ture. For the high-level feature f h , saliency collaborator and depth collaboratorwork together to jointly enhance the high-level feature learning process of globalsemantics in a mutual-benefit manner. Finally, all learned knowledges from threecollaborators ( Att edge , Att sal and
Att depth ), as well as the integrated low-leveland high-level feature ( F g ), are uniformly handed to a knowledge collector (KC).Here, acting as a tutor, KC summarizes the learned edge, depth and saliencyknowledges and utilizes them to predict accurate saliency results. We elaborateon the three collaborators and the KC in Sec. 3.3. We use the widely used ResNet [24] suggested by otherdeep-learning-based methods [15,39,60] as backbone network, where the last fullyconnected layers are truncated to better fit for the SOD task. As shown in Fig. 2,five side-out features generated from the backbone network are transferred tofive transition layers to change their sizes and the number of channels. Detailedparameters are listed in Table. 1, and the five output features are defined as { f , f , f , f , f } . W. Ji, et al.
Featuremap )*+,-.,/-01-/,2*+324,-05 · · ·
FeatureAggregation;×=×> ;×=×> ;×=×> ○ ? @4*AB4/-01-/,2*+C*DE4- Contextfeature ○ ? ⊕ ○ ? ⊕ ○ ? @4*AB4 @E2DB+1- JKJ L M L N L O PL N PL M PL M PL O PL N PL M Fig. 3.
The architecture of global guidancemodule (GGM).
Global Guidance Module.
In orderto obtain richer global semantics andalleviate information dilution in thedecoder, a Global Guidance Module(GGM) is applied on high-level fea-tures (i.e. f , f , and f )(see Fig. 3).Its key component, global perceptionmodule (GPM), takes the progres-sively integrated feature as input, fol-lowed by four parallel dilated convo-lution operations [62] (kernel size =3, dilation rates = 1/6/12/18 ) andone 1 × (cid:101) F = Φ ( F ), where F denotes the input feature map and (cid:101) F means the outputfeature. In GGM, we take the summation of the feature in current layer and theoutput features of all high-level GPMs as input to alleviate information dilu-tion. Finally, three output features of GPMs are concatenated and an integratedhigh-level feature f h is produced, which is computed by: (cid:101) f i = Φ ( f i + (cid:80) m = i +1 (cid:101) f m ) , i = 3 , , , (1) f h = U p ( W h ∗ Concat ( (cid:101) f , (cid:101) f , (cid:101) f ) + b h ) , (2)where * means convolution operation. W h and b h are convolution parameters. U p ( · ) means the upsampling operation. Existing 3D methods [48,5,6,46,65] have achieved remark-able performance in locating salient regions, but they still suffer from coarse ob-ject boundaries. In our framework, we design an edge collaborator to explicitlyextract edge information from the overabundant low-level feature and use thisinformation to give more emphasis to object boundaries.Specifically, we first formulate this problem by adding edge supervision on thetop of integrated low-level feature f l . The used edge ground truths (GT) (shownin Fig. 1) are derived from saliency GT using canny operator [4]. As shown inFig. 2, f l is processed by a 1 × M edge . Then, binary cross entropy loss (denoted as loss e )is adopted to calculate the difference between M edge and edge GT. As the edgemaps M edge in Fig. 2 and Fig. 5 show, edge detection constraint is beneficial forpredicting accurate boundaries of salient objects. Additionally, we also transferthe learned edge knowledge before the softmax function (denoted as Att edge ) to ccurate RGB-D Salient Object Detection via Collaborative Learning 7 the knowledge collector (KC), where the edge information is further utilized toemphasize object boundaries. The reason why we use
Att edge rather than M edge is to alleviate the negative influence brought by accuracy decrement of M edge . Saliency and Depth Collaborators.
When addressing scene understandingtasks like semantic segmentation and salient object detection, there exist strongconsistencies among the global semantics and geometric properties of image re-gions [54]. In our framework, a saliency collaborator and a depth collaboratorwork together to jointly enhance the feature learning process of high-level se-mantics in a mutual-benefit manner.
Stage one:
The high-level feature f h is first processed by a 1 × S coarse . Here,binary cross entropy loss (denoted as loss s ) is used for training. Then, the learnedsaliency knowledge acts as a spatial attention map to refine the high-level featurein a similar way like [60]. But different from [60] which considers S coarse asattention map directly, we use the more informative feature map before softmaxfunction (denoted as Att sal ) to emphasize or suppress each pixel of f h . Identifymapping is adopted to alleviate the errors in Att sal to be propagated to depthlearning and accelerate network convergence. Formally, this procedure can bedefined as:
Att sal = W s ∗ f h + b s , (3) (cid:101) f h = Att sal (cid:12) f h + f h , (4)where (cid:12) means element-wise multiplication. (cid:101) f h denotes the output saliency-enhanced feature. Stage two:
As pointed out in previous RGB-D researches [46,65], the spatialinformation within depth image is helpful for better locating salient objects in ascene. In our network, we innovatively integrate depth learning into the high-levelfeature learning process, instead of directly taking depth image as input. Thislearning strategy enables our network to be free of using an extra depth networkto make inference from an extra depth input, and thus being more lightweightand versatile. As in Fig. 2, a depth head with three convolution layers (defined as Ψ ( · )) is first used to make feature (cid:101) f h adapt to depth estimation. Then, its output Ψ ( (cid:101) f h ) is followed by a 1 × Att depth . Here, depth images act as GTs for supervision and we usesmooth L loss [22] to calculate the difference between Att depth and depth GT,where smooth L loss is a robust L loss proposed in [22] that is less sensitiveto outliers than L loss. Formally, the depth loss can be defined as: Loss d = 1 W × H W (cid:88) x =1 H (cid:88) y =1 (cid:26) . × |(cid:52) ( x, y ) | , if |(cid:52) ( x, y ) | ≤ , |(cid:52) ( x, y ) | − . , if (cid:52) ( x, y ) < − (cid:52) ( x, y ) > , (5)where W and H denote the width and height of the depth map. (cid:52) ( x, y ) meansthe error between prediction Att depth and the depth GT in each pixel ( x, y ).Since each channel of a feature map can be considered as a feature detector [59],
W. Ji, et al. the depth knowledge
Att depth is further employed to learn a channel-wise atten-tion map M c for choosing useful semantics. Identify mapping operation is alsoadopted to enhance the fault-tolerant ability. This procedure can be defined as: Att depth = W d ∗ Ψ ( (cid:101) f h ) + b d , (6) M c = σ ( GP ( W c ∗ Att depth + b c )) , (7) f hc = M c ⊗ (cid:101) f h + (cid:101) f h , (8)where w ∗ and b ∗ are parameters to be learned. GP ( · ) means global poolingoperation. σ ( · ) is the softmax function. ⊗ denotes channel-wise multiplication.After these two stages, two collaborators can cooperatively generate opti-mal feature which contains affluent spatial cues and possesses strong ability todistinguish salient and non-salient regions. Knowledge Collector.
In our framework, the KC works as a unified tutor tocomplete knowledge transfer from individual collaborators to the group.As illustrated in Fig. 2, all knowledges learned from three collaborators (i.e.
Att edge , Att sal , and
Att depth ) and the concatenated multi-level feature F g = Concat ( f l , f hc ) are uniformly transferred to the KC. Those information are com-prehensively processed in a triple-attention manner to give more emphasis tosalient regions and object boundaries. In Fig. 2, we show a detailed diagramwith visualized attention maps for better understanding. To be specific, Att edge and
Att sal are first concatenated together to jointly learn a fused attention map
Att f , where the locations and boundaries of the salient objects are considereduniformly. Then, F g is in turn multiplied with the depth attention map Att depth and the fused attention map
Att f , which significantly enhances the contrast be-tween salient and non-salient areas. Ablation analysis shows the ability of theKC to enhance the performance significantly.There is a vital problem worth thinking about. The quality of Att depth and
Att f might lead to irrecoverable inhibition of salient areas. Therefore, we addseveral residual connection operations [24] to the KC to retain the original fea-tures. Formally, this process can be defined as: Att f = σ ( W f ∗ Concat ( Att sal , Att edge ) + b f ) , (9) (cid:101) F g = Att depth (cid:12) F g + F g , (10) F = Att f (cid:12) (cid:101) F g + (cid:101) F g . (11)In the end, F is followed by a 1 × S final . Here, binary cross entropyloss (denoted as loss f ) is used to calculate the difference between S final andsaliency GT. Thus, the total loss L can be represented as: L = λ e Loss e + λ s Loss s + λ d Loss d + λ f Loss f , (12)where Loss e , Loss s , and Loss f are cross entropy loss and Loss d is a smooth L loss. In this paper, we set λ e = λ s = λ f = 1 and λ d = 3. ccurate RGB-D Salient Object Detection via Collaborative Learning 9 To evaluate the performance of our network, we conduct experiments on sevenwidely used benchmark datasets.
DUT-D [46]: contains 1200 images with 800 indoor and 400 outdoor scenespaired with corresponding depth images. This dataset contains many complexscenarios.
NJUD [28]: contains 1985 stereo images (the latest version). Theyare gathered from the Internet, 3D movies and photographs taken by a Fuji W3stereo camera.
NLPR [44]: includes 1000 images captured by Kinect underdifferent illumination conditions.
SIP [19]: contains 929 salient person sampleswith different poses and illumination conditions.
LFSD [34]: is a relatively smalldataset with 100 images captured by Lytro camera.
STEREO [43]: contains797 stereoscopic images downloaded from the Internet.
RGBD135 [11]: consistsof seven indoor scenes and contains 135 images captured by Kinect.For training, we split 800 samples from DUT-D, 1485 samples from NJUD,and 700 samples from NLPR as in [5,6,46]. The remaining images and otherpublic datasets are all for testing to comprehensively evaluate the generationabilities of models. To reduce overfitting, we augment the training set by ran-domly flipping, cropping and rotating those images.
We adopt 6 widely used evaluation metrics to verify theperformance of various models, including precision-recall (PR) curve, mean F-measure ( F β ) [1], mean absolute error (MAE) [3], weighted F-measure ( F wβ ) [42]and recently proposed S-measure ( S ) [17] and E-measure ( E ) [18]. Saliency mapsare binarized using a series of thresholds and then pairs of precision and recallare computed to plot the PR curve. The F-measure is a harmonic mean of aver-age precision and average recall. Here, we calculate the mean F-measure whichuses adaptive threshold to generate binary saliency map. The MAE representsthe average absolute difference between the saliency map and ground truth.Weighted F-measure intuitively generalizes F-measure by alternating the way tocalculate the Precision and Recall. S-measure contains two terms: object-awareand region-aware structural similarities. E-measure jointly captures image levelstatistics and local pixel matching information. Details of those evaluation met-rics can refer to [55]. For MAE, lower value is better. For others, higher is better. Implementation details.
We implement our proposed framework using thePytorch toolbox and train it with a GTX 1080 Ti GPU. All training and testimages are uniformly resized to 256 × ×
256 image only takes 0.0290s (34FPS).
We show the quantitative and qualitative resultsof different modules of our proposed network in Tab. 2 and Fig. 4. The backbonenetwork (denoted as B ) is constructed by directly concatenating low-level feature f l and high-level feature f h without using GGM for prediction. Comparison ofthe results (a) and (b) shows that adding our GGM can more effectively extractrich semantic features and prevent information dilution in the decoding stage. Table 2.
Quantitative results of the abla-tion analysis on two benchmark datasets. Bmeans the backbone network. E and S rep-resent edge supervision and saliency super-vision respectively. S SA + D CA means ourmutual-benefit learning strategy betweendepth and saliency. +KC means adding ourknowledge collector on (e). NJUD NLPRindexes Modules F β ↑ MAE ↓ F β ↑ MAE ↓ (a) B 0.831 0.065 0.797 0.050(b) B+GGM 0.839 0.060 0.813 0.044(c) (b)+E 0.851 0.056 0.825 0.041(d) (b)+E+S 0.857 0.054 0.833 0.038(e) (b)+E+ S SA + D CA (cid:4)(cid:5)(cid:3)(cid:6)(cid:7)(cid:24)(cid:25)(cid:18)(cid:23)(cid:18)(cid:21)(cid:10)(cid:42)(cid:13)(cid:11)(cid:15)(cid:14)(cid:10)(cid:42)(cid:13)(cid:16)(cid:13)(cid:12)(cid:26)(cid:36)(cid:31)(cid:32)(cid:42)(cid:9)(cid:39)(cid:37)(cid:33) (cid:27)(cid:9)(cid:28)(cid:27)(cid:41)(cid:31)(cid:36)(cid:35)(cid:37)(cid:31) (cid:28)(cid:9)(cid:28)(cid:27)(cid:41)(cid:31)(cid:36)(cid:35)(cid:37)(cid:31)(cid:8)(cid:19)(cid:22)(cid:20) (cid:29)(cid:9)(cid:28)(cid:27)(cid:41)(cid:31)(cid:36)(cid:35)(cid:37)(cid:31)(cid:8)(cid:19)(cid:22)(cid:20)(cid:8)(cid:31)(cid:30)(cid:33)(cid:31) (cid:30)(cid:9)(cid:28)(cid:27)(cid:41)(cid:31)(cid:36)(cid:35)(cid:37)(cid:31)(cid:8)(cid:19)(cid:22)(cid:20)(cid:8)(cid:31)(cid:30)(cid:33)(cid:31)(cid:8)(cid:2)(cid:1)(cid:41)(cid:27)(cid:36)(cid:35)(cid:31)(cid:37)(cid:29)(cid:44) (cid:36)(cid:38)(cid:41)(cid:41) (cid:31)(cid:9)(cid:28)(cid:27)(cid:41)(cid:31)(cid:36)(cid:35)(cid:37)(cid:31)(cid:8)(cid:19)(cid:22)(cid:20)(cid:8)(cid:31)(cid:30)(cid:33)(cid:31)(cid:8)(cid:2)(cid:1)(cid:41)(cid:27)(cid:36)(cid:35)(cid:31)(cid:37)(cid:29)(cid:44)(cid:8)(cid:30)(cid:31)(cid:39)(cid:42)(cid:34) (cid:17)(cid:42)(cid:42)(cid:31)(cid:37)(cid:42)(cid:35)(cid:38)(cid:37) (cid:32)(cid:9)(cid:21)(cid:43)(cid:40)(cid:41) (cid:19)(cid:25) RGB GT(a) (b) (c) (d) (e) (f)
Fig. 4.
Visual saliency maps of ablationanalysis. The meaning of the indexes (a)-(f) can refer to Table. 2.
After introducing edge supervision(denoted as E), the boundaries of thesaliency maps are sharper (b vs c inFig. 4). The edge maps ( M edge ) inFig. 2 and Fig. 5 also show the abilityof our network in explicitly extractingobject boundaries. By adding addi-tional saliency supervision on f h (de-noted as S), the performance can befurther improved. However, by com-paring (d) and (e), we can see that ourmutual-benefit learning style betweensaliency collaborator and depth col-laborator (denoted as S SA and D CA )can further improve the detector’sability to locate salient objects. Thisalso verifies the strong correlation be-tween saliency and depth. Finally, byusing our proposed (KC), all learnededge, depth and saliency knowledgesfrom three collaborators can be ef-fectively summarized and utilized togive more emphasis to salient regionsand object boundaries, improving theaverage MAE performance on twodatasets by nearly 9.6% points. Bycomparing (e) and (f) in Fig. 4, we can also see that salient regions in (f) aremore consistent with the saliency GT and the object boundaries are explicitlyhighlighted benefiting from the comprehensive knowledge utilization. Those ad-vances demonstrate that using our collaborative learning strategy is beneficialfor accurate saliency prediction. We list some numerical results here for betterunderstanding. The Root Mean Squared Error (RMSE) of depth prediction onNJUD and NLPR datasets are 0.3684 and 0.4696, respectively. The MAE scoresof edge prediction are 0.053 and 0.044, respectively. The Interactions between Collaborators.
Saliency and Edge.
To explore the correlation between saliency and edge,we gradually add edge detection supervision (denoted as E ) and saliency su-pervision (denoted as S l ) on the low-level feature f l . From the quantitive re- ccurate RGB-D Salient Object Detection via Collaborative Learning 11 sults in Tab. 3, we can see that adding edge supervision can explicitly ex-tract clear boundary information and significantly enhance the detection per-formance, especially for the F-measure scores. However, when adding saliencysupervision on f l , the performances on both datasets decrease dramatically. Table 3.
Ablation analysis of the interac-tions between three collaborators. The mean-ing of indexes (b)-(f) can refer to Table. 2.+ S l means adding saliency supervision onlow-level feature. D means depth supervision. NJUD NLPRModules F β ↑ MAE ↓ F β ↑ MAE ↓ Saliency & Edge (b) 0.839 0.060 0.813 0.044(b)+E (c) 0.851 0.056 0.825 0.041(b)+E+ S l Saliency & Depth (c) 0.851 0.056 0.825 0.041(c)+ S (d) 0.857 0.054 0.833 0.038(c)+ S +D 0.859 0.054 0.835 0.037(c)+ S + D CA S SA + D CA (e) 0.864 0.051 0.841 0.035 Saliency & Edge & Depth (e) 0.864 0.051 0.841 0.035(e)+
Att edge
Att sal
Att edge + Att sal
Att edge + Att sal + Att depth (f) 0.872 0.047 0.848 0.031 (cid:8)(cid:7)(cid:9)(cid:6)(cid:1)(cid:2)(cid:2)(cid:2)(cid:3)(cid:4)(cid:5)(cid:10)(cid:14)(cid:11)(cid:12)(cid:17)(cid:1)(cid:16)(cid:15)(cid:13)
Image GT Ours M " M " Att '()
Att '()
Att "
Att "
Att *+'"
Att *+'"
Att * Att * Att
Att
Fig. 5.
Internal results in the knowledge col-lector. The results of another sample can beseen in Fig. 2. Here, F = 1 − F . This is partly because the low-levelfeatures contain too much informa-tion and are relatively too coarse topredict saliency, and partly becausethe two tasks are to some extent in-compatible, in which one is for high-lighting the boundaries and anotheris for highlighting the whole salientobjects. Hence, it is optimal to onlyadd edge detection supervision onthe low-level feature.
Saliency and Depth.
In order toverify the effectiveness of the pro-posed mutual-benefit learning strat-egy on high-level feature f h , wegradually add two collaborators andtheir mutual-benefit operations tothe baseline model (c). As shown inTab. 3, adding saliency supervision(denoted as S) and adding depth su-pervision (denoted as D) are all ben-eficial for extracting more represen-tative high-level semantic features.In addition, by gradually intro-ducing our proposed mutual-benefitlearning strategy between two col-laborators (denoted as S SA and D CA ), spatial layouts and globalsemantics of high-level feature canbe greatly enhanced, which conse-quently brings additional accuracygains on both datasets. These re-sults further verify the effectiveness of our collaborative learning framework. Saliency, Edge and Depth.
In our knowledge collector, all knowledge learnedfrom three collaborators are summarized and utilized in a triple-attention man-ner. As the visualized attention maps in Fig. 2 and Fig. 5 show, the edge knowl-edge (
Att edge ) can help highlight object boundaries, and the depth and saliencyknowledge (
Att depth and
Att sal ) can also be used to emphasize salient regionsand suppress non-salient regions. We can see from Tab. 3 that both
Att edge and
Att sal are beneficial for enhancing the feature representation and improving theF-measure and MAE performance. In our framework, we adopt a better strategy
Table 4.
Quantitative comparisons on seven benchmark datasets. The best three re-sults are shown in blue , red, and green fonts respectively.
Dataset Metric DES LHM DCMC MB CDCP DF CTMF PDNet MPCI TANet PCA CPFP DMRA Ours [11] [44] [12] [69] [70] [48] [23] [68] [7] [6] [5] [65] [46]
DUT-D [46] E ↑ S ↑ F wβ ↑ F β ↑ MAE ↓ [28] E ↑ S ↑ F wβ ↑ F β ↑ MAE ↓ [44] E ↑ S ↑ F wβ ↑ F β ↑ MAE ↓ [43] E ↑ S ↑ F wβ ↑ F β ↑ MAE ↓ [19] E ↑ S ↑ F wβ ↑ F β ↑ MAE ↓ [34] E ↑ S ↑ F wβ ↑ F β ↑ MAE ↓ [11] E ↑ S ↑ F wβ ↑ F β ↑ MAE ↓ that Att edge and
Att sal are concatenated together to jointly emphasize salientobjects and their boundaries. Finally, by comparing the results in the last twolines of Tab. 3, we can see that by further utilizing the learned depth knowledge,the detector’s performance can be further improved. We visualize all internalresults of the KC in Fig. 5 for better understanding.
We compare results from our method with various state-of-the-art approaches onseven public datasets. For fair comparisons, the results from competing methodsare generated by authorized codes or directly provided by authors.
Quantitative Evaluation.
Tab. 4 shows the quantitative results of our methodover other 13 RGB-D ones on seven benchmark datasets. We can see that ourproposed collaborative learning framework achieves superior performance. Notedthat our method avoids the reliance on depth images and only takes RGB im-age as input in the testing stage. To comprehensively verify the effectiveness ofour model, we additionally conduct comparisons with 9 state-of-the-art RGBmethods on three public datasets. Results in Tab. 5 consistently show that our ccurate RGB-D Salient Object Detection via Collaborative Learning 13
Table 5.
Quantitative comparisons with state-of-the-art 2D methods.
Dataset Metric DSS Amulet R Net PiCANet PAGRN EGNet PoolNet BASNet CPD Ours [26] [63] [15] [39] [64] [66] [38] [47] [60]
NJUD [28] S ↑ F wβ ↑ MAE ↓ [44] S ↑ F wβ ↑ MAE ↓ [43] S ↑ F wβ ↑ MAE ↓ Image GT Ours * DMRA * CPFP * TANet * PCA * EGNet PoolNet CPD PiCANet R NetPAGRNPDNet * Depth
Fig. 6.
Visual comparisons of our method with other state-of-the-art CNNs-basedmethods in some representative scenes. * means RGB-D methods. method also achieves comparable results compared to 2D methods. The PRcurves in Fig. 7 also verify the superiority of our method.
Qualitative Evaluation.
Fig. 6 shows some representative samples of resultscomparing our method with some top-ranking CNNs-based RGB and RGB-Dapproaches. For the complex scenes with lower-contrast (the 4 th and 5 th rows) ormultiple objects (the 8 th row), our method can better locate the salient objectsthanks to the useful spatial information in depth image and sufficient extractionand utilization of edge information. Thus, our method can produce accuratesaliency results with sharp boundaries preserved. Complexity Evaluation.
We also compare the model size and run time (FramePer Second, FPS) of our method with 11 representative models in Tab. 6. Thanksto the well-designed depth learning strategy, our network is free of using extradepth networks and depth inputs to make inference. It can also be seen thatour method achieves outstanding scores with a smaller model size and higher
Recall P r e c i s i on OursAmuletBASNetCPDCPFPCTMFDFDMRADSSEGNetMPCIPAGRNPCAPDNetPiCANetPoolNetR ³ NetTANetUCF
Recall P r e c i s i on OursAmuletBASNetCPDCTMFDFDMRADSSEGNetMPCIPAGRNPCAPDNetPiCANetPoolNetR ³ NetTANetUCF
Recall P r e c i s i on OursAmuletBASNetCPDCPFPCTMFDFDMRADSSEGNetMPCIPAGRNPCAPDNetPiCANetPoolNetR ³ NetTANetUCF
Recall P r e c i s i on OursAmuletBASNetCPDCPFPCTMFDFDMRADSSEGNetMPCIPAGRNPCAPDNetPiCANetPoolNetR ³ NetTANetUCF (a) DUT-D dataset (b) NJUD dataset (c) NLPR dataset (d) STEREO dataset
Fig. 7.
The PR curves of our method compared to other state-of-the-art approacheson four datasets.
Table 6.
Complexity comparisons of various methods. The best three results are shownin blue , red, and green fonts respectively. FPS means frame per second.
NJUD [28] NLPR [44]Types Methods Years Size FPS F wβ ↑ MAE ↓ F wβ ↑ MAE ↓
2D DSS 2017’CVPR 447.3MB 22 0.678 0.108 0.614 0.076Amulet 2017’ICCV
16 0.758 0.085 0.716 0.062PiCANet 2018’CVPR 197.2 MB 7 0.768 0.071 0.707 0.053PoolNet 2019’CVPR 278.5 MB 32 0.816 0.057 0.771 0.046CPD 2019’CVPR 183 MB * Ours 167.6 MB 34 FPS (enhances FPS by 55% compared to current best performing RGB-D modelDMRA). Those results confirm that our model is suitable for the pre-processingtask in terms of model size and running speed.
In this work, we propose a novel collaborative learning framework for accurateRGB-D salient object detection. In our framework, three mutually beneficialcollaborators, i.e., edge detection, coarse salient object detection and depth esti-mation, jointly accomplish the SOD task from different perspectives. Benefitingfrom the well-designed mutual-benefit learning strategy between three collabo-rators, our method can produce accurate saliency results with sharp boundariespreserved. Free of using extra depth subnetworks and depth inputs during testingalso makes our network more lightweight and versatile. Experiment results onseven benchmark datasets show that our method achieves superior performanceover 22 state-of-the-art RGB and RGB-D methods.
Acknowledgements
This work was supported by the Science and Technology Innovation Foundationof Dalian (2019J12GX034), the National Natural Science Foundation of China(61976035), and the Fundamental Research Funds for the Central Universities(DUT19JC58, DUT20JC42). ccurate RGB-D Salient Object Detection via Collaborative Learning 15
References
1. Achanta, R., Hemami, S.S., Estrada, F.J., Ssstrunk, S.: Frequency-tuned salientregion detection. In: CVPR. pp. 1597–1604 (2009)2. Borji, A., Cheng, M.M., Jiang, H., Li, J.: Salient object detection: A benchmark.TIP (12), 5706–5722 (2015)3. Borji, A., Sihite, D.N., Itti, L.: Salient object detection: a benchmark. In: ECCV.pp. 414–429 (2012)4. Canny, J.: A computational approach to edge detection. TPAMI (6), 679–698(1986)5. Chen, H., Li, Y.: Progressively complementarity-aware fusion network for rgb-dsalient object detection. In: CVPR. pp. 3051–3060 (2018)6. Chen, H., Li, Y.: Three-stream attention-aware network for rgb-d salient objectdetection. TIP (6), 2825–2835 (2019)7. Chen, H., Li, Y., Su, D.: Multi-modal fusion network with multi-scale multi-pathand cross-modal interactions for rgb-d salient object detection. PR , 376–385(2019)8. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab:Semantic image segmentation with deep convolutional nets, atrous convolution,and fully connected crfs. TPAMI (4), 834–848 (2018)9. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoderwith atrous separable convolution for semantic image segmentation. In: ECCV.pp. 833–851 (2018)10. Cheng, M.M., Zhang, G.X., Mitra, N.J., Huang, X., Hu, S.M.: Global contrastbased salient region detection. TPAMI (3), 409–416 (2011)11. Cheng, Y., Fu, H., Wei, X., Xiao, J., Cao, X.: Depth enhanced saliency detectionmethod. In: ICIMCS. pp. 23–27 (2014)12. Cong, R., Lei, J., Zhang, C., Huang, Q., Cao, X., Hou, C.: Saliency detection forstereoscopic images based on depth confidence analysis and multiple cues fusion.SPL (6), 819–823 (2016)13. Craye, C., Filliat, D., Goudou, J.F.: Environment exploration for object-basedvisual saliency learning. In: ICRA. pp. 2303–2309 (2016)14. Dai, J., Li, Y., He, K., Sun, J.: R-fcn: object detection via region-based fullyconvolutional networks. In: NIPS. pp. 379–387 (2016)15. Deng, Z., Hu, X., Zhu, L., Xu, X., Qin, J., Han, G., Heng, P.A.: R net: Recurrentresidual refinement network for saliency detection. In: IJCAI. pp. 684–690 (2018)16. Fan, D.P., Cheng, M.M., Liu, J.J., Gao, S.H., Hou, Q., Borji, A.: Salient objects inclutter: Bringing salient object detection to the foreground. In: ECCV. pp. 196–212(2018)17. Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A.: Structure-measure: A new wayto evaluate foreground maps. In: ICCV. pp. 4558–4567 (2017)18. Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A.: Enhanced-alignmentmeasure for binary foreground map evaluation. In: IJCAI. pp. 698–704 (2018)19. Fan, D.P., Lin, Z., Zhao, J., Liu, Y., Zhang, Z., Hou, Q., Zhu, M., Cheng, M.M.:Rethinking rgb-d salient object detection: Models, datasets, and large-scale bench-marks. arXiv preprint arXiv:1907.06781 (2019)20. Fan, D.P., Wang, W., Cheng, M.M., Shen, J.: Shifting more attention to videosalient object detection. In: CVPR. pp. 8554–8564 (2019)21. Feng, M., Lu, H., Ding, E.: Attentive feedback network for boundary-aware salientobject detection. In: CVPR. pp. 1623–1632 (2019)6 W. Ji, et al.22. Girshick, R.: Fast r-cnn. In: ICCV. pp. 1440–1448 (2015)23. Han, J., Chen, H., Liu, N., Yan, C., Li, X.: Cnns-based rgb-d saliency detectionvia cross-view transfer and multiview fusion. IEEE Transactions on Systems, Man,and Cybernetics (11), 3171–3183 (2018)24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: CVPR. pp. 770–778 (2016)25. Hong, S., You, T., Kwak, S., Han, B.: Online tracking by learning discriminativesaliency map with convolutional neural network. In: ICML. pp. 597–606 (2015)26. Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.H.S.: Deeply supervisedsalient object detection with short connections. In: CVPR. pp. 815–828 (2017)27. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapidscene analysis. TPAMI (11), 1254–1259 (1998)28. Ju, R., Ge, L., Geng, W., Ren, T., Wu, G.: Depth saliency based on anisotropiccenter-surround difference. In: ICIP. pp. 1115–1119 (2014)29. Krhenbhl, P., Koltun, V.: Efficient inference in fully connected crfs with gaussianedge potentials. In: NIPS. pp. 109–117 (2011)30. Lee, G., Tai, Y.W., Kim, J.: Deep saliency with encoded low level distance mapand high level features. In: CVPR. pp. 660–668 (2016)31. Li, G., Zhu, C.: A three-pathway psychobiological framework of salient objectdetection using stereoscopic technology. In: ICCVW. pp. 3008–3014 (2017)32. Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: CVPR. pp.5455–5463 (2015)33. Li, G., Yu, Y.: Visual saliency detection based on multiscale deep cnn features.TIP (11), 5012–5024 (2016)34. Li, N., Ye, J., Ji, Y., Ling, H., Yu, J.: Saliency detection on light field. TPAMI (8), 1605–1616 (2017)35. Li, X., Zhao, L., Wei, L., Yang, M.H., Wu, F., Zhuang, Y., Ling, H., Wang, J.:Deepsaliency: Multi-task deep neural network model for salient object detection.TIP (8), 3919–3930 (2016)36. Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L.: The secrets of salient objectsegmentation. In: CVPR. pp. 280–287 (2014)37. Lin, G., Milan, A., Shen, C., Reid, I.D.: Refinenet: Multi-path refinement networksfor high-resolution semantic segmentation. In: CVPR. pp. 5168–5177 (2017)38. Liu, J.J., Hou, Q., Cheng, M.M., Feng, J., Jiang, J.: A simple pooling-based designfor real-time salient object detection. In: CVPR. pp. 3917–3926 (2019)39. Liu, N., Han, J., Yang, M.H.: Picanet: Learning pixel-wise contextual attention forsaliency detection. In: CVPR. pp. 3089–3098 (2018)40. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semanticsegmentation. In: CVPR. pp. 3431–3440 (2015)41. Luo, Z., Mishra, A.K., Achkar, A., Eichel, J.A., Li, S., Jodoin, P.M.: Non-localdeep features for salient object detection. In: CVPR. pp. 6593–6601 (2017)42. Margolin, R., Zelnik-Manor, L., Tal, A.: How to evaluate foreground maps. In:CVPR. pp. 248–255 (2014)43. Niu, Y., Geng, Y., Li, X., Liu, F.: Leveraging stereopsis for saliency analysis. In:CVPR. pp. 454–461 (2012)44. Peng, H., Li, B., Xiong, W., Hu, W., Ji, R.: Rgbd salient object detection: Abenchmark and algorithms. In: ECCV. pp. 92–109 (2014)45. Perazzi, F., Krhenbhl, P., Pritch, Y., Hornung, A.: Saliency filters: Contrast basedfiltering for salient region detection. In: CVPR. pp. 733–740 (2012)46. Piao, Y., Ji, W., Li, J., Zhang, M., Lu, H.: Depth-induced multi-scale recurrentattention network for saliency detection. In: ICCV (2019)ccurate RGB-D Salient Object Detection via Collaborative Learning 1747. Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M.: Basnet:Boundary-aware salient object detection. In: CVPR. pp. 7479–7489 (2019)48. Qu, L., He, S., Zhang, J., Tian, J., Tang, Y., Yang, Q.: Rgbd salient object detectionvia deep fusion. TIP (5), 2274–2285 (2017)49. Ren, J., Gong, X., Yu, L., Zhou, W., Yang, M.Y.: Exploiting global priors for rgb-dsaliency detection. In: CVPRW. pp. 25–32 (2015)50. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster r-cnn: towards real-time objectdetection with region proposal networks. In: NIPS. vol. 2015, pp. 91–99 (2015)51. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scaleimage recognition. In: ICLR (2015)52. Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah,M.: Visual tracking: An experimental survey. TPAMI (7), 1442–1468 (2014)53. Wang, L., Lu, H., Ruan, X., Yang, M.H.: Deep networks for saliency detection vialocal estimation and global search. In: CVPR. pp. 3183–3192 (2015)54. Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.: Towards unified depthand semantic prediction from a single image. In: CVPR. pp. 2800–2809 (2015)55. Wang, W., Lai, Q., Fu, H., Shen, J., Ling, H.: Salient object detection in the deeplearning era: An in-depth survey. arXiv preprint arXiv:1904.09146 (2019)56. Wang, W., Shen, J.: Deep visual attention prediction. TIP27