[PDF] Towards Accurate RGB-D Saliency Detection with Complementary Attention and Adaptive Integration

Abstract

Saliency detection based on the complementary information from RGB images and depth maps has recently gained great popularity. In this paper, we propose Complementary Attention and Adaptive Integration Network (CAAI-Net), a novel RGB-D saliency detection model that integrates complementary attention based feature concentration and adaptive cross-modal feature fusion into a unified framework for accurate saliency detection. Specifically, we propose a context-aware complementary attention (CCA) module, which consists of a feature interaction component, a complementary attention component, and a global-context component. The CCA module first utilizes the feature interaction component to extract rich local context features. The resulting features are then fed into the complementary attention component, which employs the complementary attention generated from adjacent levels to guide the attention at the current layer so that the mutual background disturbances are suppressed and the network focuses more on the areas with salient objects. Finally, we utilize a specially-designed adaptive feature integration (AFI) module, which sufficiently considers the low-quality issue of depth maps, to aggregate the RGB and depth features in an adaptive manner. Extensive experiments on six challenging benchmark datasets demonstrate that CAAI-Net is an effective saliency detection model and outperforms nine state-of-the-art models in terms of four widely-used metrics. In addition, extensive ablation studies confirm the effectiveness of the proposed CCA and AFI modules.

Full PDF

TTowards Accurate RGB-D Saliency Detection with ComplementaryAttention and Adaptive Integration

Hong-Bo Bi a , Zi-Qi Liu a , Kang Wang a , Bo Dong b , Geng Chen c and Ji-Quan Ma d a School of Electrical Information Engineering, Northeast Petroleum University, Daqing 163000, China. b Department of Optical Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China c Inception Institute of Artiﬁcial Intelligence, Abu Dhabi, UAE d Department of Computer Science and Technology, Heilongjiang University, Harbin 150080, China

A R T I C L E I N F O

Keywords :RGB-D saliency detectionContext-awarenessComplementary attentionAdaptive integration

A B S T R A C T

CAAI-Net ), a novel RGB-D saliency detection model that integrates comple-mentary attention based feature concentration and adaptive cross-modal feature fusion into a uniﬁedframework for accurate saliency detection. Speciﬁcally, we propose a context-aware complemen-tary attention (CCA) module, which consists of a feature interaction component, a complementaryattention component, and a global-context component. The CCA module ﬁrst utilizes the feature in-teraction component to extract rich local context features. The resulting features are then fed into thecomplementary attention component, which employs the complementary attention generated fromadjacent levels to guide the attention at the current layer so that the mutual background disturbancesare suppressed and the network focuses more on the areas with salient objects. Finally, we utilize aspecially-designed adaptive feature integration (AFI) module, which suﬃciently considers the low-quality issue of depth maps, to aggregate the RGB and depth features in an adaptive manner. Exten-sive experiments on six challenging benchmark datasets demonstrate that CAAI-Net is an eﬀectivesaliency detection model and outperforms nine state-of-the-art models in terms of four widely-usedmetrics. In addition, extensive ablation studies conﬁrm the eﬀectiveness of the proposed CCA andAFI modules.

1. Introduction

Salient object detection (SOD), which segments the mostattractive objects in an image, has drawn increasing researcheﬀorts in recent years [1–10]. SOD has a large number ofapplications, such as object recognition [11], image videocompression [12], image retrieval [13, 14], image redirec-tion [15], image segmentation [16, 17], image enhance-ment [18], quality assessment [19], etc. With the rapidprogress in this ﬁeld, a number of derived techniques aredeveloped. Typical instances include video saliency detec-tion [20–26], co-saliency detection [27, 28], stereo saliencydetection [29], etc.The perception of depth information is the premise ofhuman stereoscopic vision. Therefore, considering depth in-formation in SOD can better imitate the human visual mech-anism and improve the detection accuracy. In recent years,increasing research eﬀort has been made to study the RGB-Dsaliency detection [30–39]. Existing methods employ diﬀer-ent schemes to handle the multi-level multi-modal features.For the multi-level features, Liu et al. [40] utilized pixel-wisecontextual attention network to focus on context informationfor each pixel and hierarchically integrate the global and lo-cal context features. Wang et al. [41] devised a pyramidattention structure to concentrate more on salient regionsbased on typical bottom-up/top-down network architecture.Zhang et al. [42] developed an aggregating multi-level con-volutional feature framework to extract the multi-level fea-tures and integrate them into multiple resolutions. For thefusion of the multi-modal features, Liu et al. [43] took depth maps as the fourth channel of the input and employed a par-allel structure to extract features through spatial/channel at-tention mechanisms. Piao et al. [44] exploited a multi-levelcross-modal way to fuse the RGB and depth features, andproposed a depth distiller to transfer the depth information tothe RGB stream. Li et al. [34] designed an information con-version module to fuse high-level RGB and depth featuresadaptively, and RGB features at each level were enhancedby weighting depth information. Piao et al. [45] adopted adepth reﬁnement block based fusion method for each levelRGB and depth features. More details can be found in therecently released RGB-D survey and benchmark papers [46–48].Despite their advantages, most existing deep-basedRGB-D saliency detection methods suﬀer from two majorlimitations. First, although attention mechanisms have beenadopted, most existing methods only rely on a kind of atten-tion mechanisms, e.g., channel attention, spatial attention,etc. This results in the drawback that the network is unableto suﬃciently explore and make full use of the attention forimproving the performance. Second, existing methods usu-ally overlook the noise nature of depth maps, and directlyfuse the RGB and depth features by simple concatenation oraddition. More reasonable fusion of multi-level and cross-modal features can eﬀectively reduce the error rate causedby misidentiﬁcation. This is particularly important for thesalient object detection in the interference environment, e.g.,complex, low-contrast, similar background, etc. As shownin Fig. 1, the low-quality depth information and locally sim-

Hongbo Bi et al.:

Preprint submitted to Elsevier

Page 1 of 12 a r X i v : . [ c s . C V ] F e b ongbo Bi et al./Elsevier xxx (xxxx) xxx GTRGB OursDepth ASIF [53]DMRA [45]SSF [83] A2dele [44]

Figure 1:

Saliency maps of state-of-the-art deep-based RGB-Dmodels in a complex scene that is locally similar. ilar scene aﬀects the performance of existing cutting-edgemodels, making them unable to accurately detect the salientobjects.To address these limitations, in this paper, we pro-pose a novel RGB-D saliency detection model, called C omplementary A ttention and A daptive I ntegration Net work (

CAAI-Net ), which employs a complementaryattention mechanism along with adaptive feature fusionto detect mesh saliency from multi-modal RGB-D im-ages. Our CAAI-Net eﬀectively resolves the drawbacksin existing methods with a more comprehensive attentionmechanism and a novel fusion strategy, which considersthe low-quality issue of depth maps and fuses multi-modalfeatures in an adaptive manner. Speciﬁcally, we employtwo backbones to extract multi-level features from RGBimages and depth maps. The multi-level features are ﬁrstdivided into low-level and high-level features according totheir locations in the backbones. For the low-level features,the semantic information of the diﬀerent channels is almostindistinguishable, therefore we adopt spatial attention (SA)components to reﬁne the features rather than using channelattention (CA) components. The attention component isemployed to suppress the useless background informationand locate the informative features. For the high-level fea-tures, we propose a context-aware complementary attention(CCA) module for better informative feature concentrationand noisy feature reduction. The CCA module consistsof a feature interaction component, a complementaryattention component, and a global-context component. Thefeature interaction component is designed to extract thelocal context features using a pyramid structure, whichsupplements missing information from adjacent levels.The resulting features are then fed to the complementaryattention component, which is a mixture of CA and SAcomponents with eﬀective inter-level guidance. In addition,the global-context component further supplements thedetails. Finally, we design an adaptive feature integration(AFI) module to adaptively fuse the cross-modal featuresat each level. The AFI module employs the fusion weightsgenerated from the adjacent levels as guidance to obtainenhanced RGB features, and then fuse the enhanced RGB and depth features in an adaptive manner.In summary, our contributions lie in three-fold:• We propose the CCA module, which is able to ex-tract the informative features highly related to the ac-curate saliency detection. In the CCA module, the fea-ture interaction component employs a pyramid struc-ture along with nested connections to extract rich con-text features. The complementary attention compo-nent reﬁnes the features to capture highly informativefeatures, while eﬀectively reducing the noisy featuredisturbances. The global-context component supple-ments the details to enrich the features.• We propose a novel adaptive feature fusion module,AFI, which adaptively integrates the multi-modal fea-tures at each level. The AFI module is able to self-correct the ratio of diﬀerent feature branches. More-over, the feature coeﬃcients automatically generatedfrom pooling and softmax layers are assigned to theenhanced RGB features and depth features to balancetheir contributions to the feature fusion.• Extensive experiments on six benchmark datasetsdemonstrate that our CAAI-Net outperforms ninestate-of-the-art (SOTA) RGB-D saliency detectionmethods, both qualitatively and quantitatively. In ad-dition, the eﬀectiveness of the proposed modules isvalidated by extensive ablation studies.Our paper is organized as follows. In Section 2, wewill introduce related work. In Section 3, we will describeour CAAI-Net in detail. In Section 4, we will present thedatasets, experimental settings, and results. Finally, we willconclude our work in Section 5.

2. Related Works

In this section, we discuss a number of works that areclosely related to ours. These works are divided into threecategories, including RGB-D saliency detection, global con-text and local context mechanism, and attention mechanism.

The early RGB-D saliency detection methods are mostlybased on hand-crafted features, such as color [49], bright-ness [50], and texture [51]. However, these methods areunable to capture the high-level semantic information ofsalient objects and have low conﬁdence level and low recallrate. Afterwards, deep convolutional neural network (CNN)is introduced and has shown remarkable success in RGB-D saliency detection. Zhou et al. [52] utilized multi-leveldeep RGB features to combine the attention-guided bottom-up and top-down modules, which is able to make full useof multi-modal features. Li et al. [53] proposed an attentionsteered interweave fusion network to fuse cross-modal infor-mation between RGB images and corresponding depth mapsat each level. These methods utilize attention modules toimprove the ability of acquiring local information for salient

Hongbo Bi et al.:

Preprint submitted to Elsevier

Page 2 of 12ongbo Bi et al./Elsevier xxx (xxxx) xxx ! f ! f ! f ! f ! f ! f ! f ! f !" ! " )*+,% ! f !" ! f !" ! f !" ! f !" ! f !" f ! f ! f ! f ! f ! f ! f ! f ! f ! f ! f ! f ! f Figure 2:

An overview of our network. We propose a Complementary Attention and Adaptive Integration Network(CAAI-Net) with two modules, i.e., the context-aware complementary attention (CCA) module and adaptive featureintegration (AFI) module. objects detection. Some of them consider spatial attentionmechanism, while others use channel attention mechanismto guide RGB-D saliency detection. In our work, we takefull advantage of both attention mechanisms for improvedperformance.A number of RGB-D saliency detection methods fo-cus on the fusion of cross-modal information. Xiao etal. [54] employed a CNN-based cross-modal transfer learn-ing framework to guide the depth domain feature extraction.Wang et al. [55] designed two-streamed convolutional neu-ral networks to extract features and employed a switch mapto adaptively fuse the predicted saliency maps. Chen [56]proposed a three-stream attention-aware multi-modal fusionnetwork to improve the performance of saliency detection.Zhang et al. [57] proposed a probabilistic RGB-D saliencydetection model, which learns from the labeled data via con-ditional variational autoencoders. However, these methodsusually employ simple concatenation or addition operationsto aggregate RGB and depth features, which leads to unsat-isfactory performance. In addition, the useless informationare propagated, which degrades the saliency detection accu-racy.To resolve these issues, we propose a novel fusion mod-ule to integrate cross-modal features. The proposed mod-ule utilizes weight coeﬃcients learnt from lower level to en-hance the details of RGB features at current level, whichgenerates the complement RGB information to improve themodel performance. The learned coeﬃcients are then as-signed to the RGB, complementary RGB and depth fea-ture branches, which fuses the features adaptively to self-correction and yields improved saliency maps. Moreover,our module can improve the quality of salient maps and suppress the interferences in the complex or low-contrastscenes.

A number of studies have demonstrated that global andlocal information plays an important role in the eﬀectivesalient object detection. Wang et al. [58] proposed a globalrecurrent localization network, which exploits the weightedcontextual information to improve accuracy of saliency de-tection. Liu et al. [59] exploited the fusion of global and lo-cal information under multi-level cellular automata to detectsaliency, and the global saliency map is obtained using theCNN-based encoder-decoder model. Ge et al. [60] obtainedlocal information through superpixel segmentation, saliencyestimation, and multi-scale linear combination. The result-ing local information is fused with the CNN-based globalinformation. Fu et al. [36, 61] proposed a joint learningand densely cooperative fusion architecture to acquire robustsalient features. Chen et al. [62] proposed a global context-aware aggregation network, where a global module is de-signed to generate the global context information. The re-sulting context information is fused across diﬀerent levels tocompensate the missing information and to mitigate the dilu-tion eﬀect in high-level features. In this paper, local contextfeatures are acquired by a feature interaction component inthe CCA module and then fed into a complementary atten-tion component with the guidance from global context infor-mation to learn more meaningful features.

The attention mechanism stems from the fact that hu-man vision assigns more attention to the region of interests

Hongbo Bi et al.:

Preprint submitted to Elsevier

Page 3 of 12ongbo Bi et al./Elsevier xxx (xxxx) xxx and suppresses the useless background information. Re-cently, it has been widely applied in various computer vi-sion tasks [63, 64]. Li et al. [65] exploited the asymmetricco-attention to adaptively focus important information fromdiﬀerent blocks at the interweaved nodes and to improve thediscriminative ability of networks. Fu et al. [66] proposeda dual attention network including position attention andchannel attention module to capture long-range contextualinformation and to fuse local features with global features.Zhang et al. [37] introduced a bilateral attention module tocapture more useful foreground and background cues and tooptimize the uncertain details between foreground and back-ground regions. Zhang et al. [67] presented a split-attentionblock to enhance the performance of learned features and toapply across vision tasks. Noori et al. [68] adopted a multi-scale attention guided module and an attention-based multi-level integrator module to obtain more discriminative fea-ture maps and assign diﬀerent weights to multi-level featuremaps. In our work, we suppress useless features and im-prove accuracy of salient object detection by our CCA mod-ule, which is based on the spatial attention and channel at-tention.

3. Method

In this section, we provide detail descriptions for the pro-posed RGB-D saliency detection model in terms of the over-all network architecture and two major components, includ-ing CCA and AFI modules. Our network exploits the rela-tionships between global and local features, high-level andlow-level features, as well as diﬀerent modality features. Inaddition, the features are fused eﬀectively according to theirrespective characteristics.

Inspired by DMRANet [45], the proposed network,CAAI-Net, considers both the global and local context in-formation. Fig. 2 shows an overview of CAAI-Net, whichis based on a two-stream structure for RGB images anddepth maps. As can be observed, CAAI-Net employs simi-lar network branches to process the depth and RGB inputs.Low-level features have rich details, but the messy back-ground information tends to aﬀect the detection of salientobjects. In contrast, high-level features have rich seman-tic information, which is useful for locating the salient ob-jects, but the details are usually missing in the high-level fea-tures [69]. According to these characteristics, we divide theﬁve convolutional blocks of VGG-19 [70] into two parts, ofwhich the ﬁrst two convolution layers (

𝐶𝑜𝑛𝑣 _ , 𝐶𝑜𝑛𝑣 _ )are regarded as low-level features and the rest ( 𝐶𝑜𝑛𝑣 _ , 𝐶𝑜𝑛𝑣 _ , 𝐶𝑜𝑛𝑣 _ ) are the high-level features. The high-level features are fed to our CCA module, which consists ofthree components (i.e., feature interaction component, com-plementary attention component, and global-context com-ponent), to obtain abundant context information and focusmore on the regions with salient objects. The feature inter-action component is proposed to extract suﬃcient featuresby fusing dense interweaved local context information. The output of feature interaction component is then fed into com-plementary attention component for extracting more mean-ingful features with the guidance of global context informa-tion. For the low-level features, we employ spatial atten-tion components to reﬁne them before the feature fusion.The underlying motivation lies in two folds. First, the at-tention mechanism has been demonstrated to be eﬀective inimproving the feature representation for capturing informa-tive features, which is able to improve the performance ef-fectively [63, 64]. Second, as demonstrated by visualizingthe features maps of CNNs [71, 72], the low-level featurescontain abundant structural details (e.g., edges), indicatingrich spatial information. Therefore, spatial attention com-ponents are employed to select eﬀective features from thelow-level features. We then utilize the AFI module to fusethe extracted RGB and depth features at all levels in an adap-tive manner. Finally, the fused features at diﬀerent levels areadded together and then fed into the depth-induced multi-scale weighting and recurrent attention module [45] for pre-dicting the saliency map. An overview of our CCA module is shown in Fig. 3. Wewill then detail in its three major components as follows.

Extracting the local context information plays an impor-tant role in the task of RGB-D saliency detection. Previ-ous works adopt various methods to obtain the local contextinformation for capturing the informative features relatedto saliency detection. Liu et al. [73] proposed a deep spa-tial contextual long-term recurrent convolutional network toboost the saliency detection performance by incorporatingboth global and local context information. Liu et al. [59]employed a locality-constrained linear coding model to gen-erate local saliency map by minimizing its reconstructionerrors. Liu et al. [40] proposed a pixel-wise contextual at-tention network to selectively focus on useful local-contextinformation at each pixel, which can strengthen the perfor-mance of RGB-D saliency detection.A number of works have shown that combining the fea-tures of adjacent layers can more eﬀectively supplementmutual features. Therefore, we design the feature interac-tion component for high-level features to capture the localcontext information across levels (see Fig. 3 (a)). To sup-press complex background information, we adopt the reticu-lar pyramid to fuse multi-scale information, which yields theenhanced features 𝑓 ′ 𝑖 with 𝑖 = 3 , , . Note that we omit thesuperscripts, h and d, for clarity. Mathematically, we deﬁnethe feature interaction component as 𝑓 (0 , = 𝐶𝑈 (0 , ( 𝑓 ) , (1) 𝑓 (1 , = 𝐶𝑈 (1 , ( 𝑓 + 𝐷𝑆 ( 𝑓 (0 , )) , (2) 𝑓 (0 , = 𝐶𝑈 (0 , ( 𝑓 (0 , + 𝑈 𝑆 ( 𝑓 (1 , )) , (3) 𝑓 (2 , = 𝐶𝑈 (2 , ( 𝑓 + 𝐷𝑆 ( 𝑓 (1 , )) , (4) Hongbo Bi et al.:

Preprint submitted to Elsevier

Page 4 of 12ongbo Bi et al./Elsevier xxx (xxxx) xxx !" ()*+,- ()*-,- ()*.,- !"/$0& !" )*+,-.*$’"/0!1"+,-.*$’"/ ! " $% & ’ () * ! " $% & ’ () * ! " $% & ! " $% & ;-<(=5-&%65(>"&56-8&’!" ;?<( !.*$5.5"&-6@(9&&5"&’!";8<(A$!?-$+8!"&5B& ! " S ! S S " f " f ! f " f f ! f " ! f ! f ! ! f Figure 3:

Illustration of the Context-aware Complementary Attention (CCA) module. The CCA module consists of afeature interaction component, a complementary attention component, and a global-context component. 𝑓 (1 , = 𝐶𝑈 (1 , ( 𝑓 (1 , + 𝑈 𝑆 ( 𝑓 (2 , ) + 𝐷𝑆 ( 𝑓 (0 , )) , (5) 𝑓 (0 , = 𝐶𝑈 (0 , ( 𝑈 𝑆 ( 𝑓 (1 , ) + 𝑓 (0 , ) . (6)Taking Eq. (5) as an example, 𝑓 (1 , denotes the output ofconvolution unit 𝐶𝑈 (1 , ( ⋅ ) . 𝑈 𝑆 ( ⋅ ) is the up-sampling op-eration via bilinear interpolation, and 𝐷𝑆 ( ⋅ ) is the down-sampling operation. 𝑓 𝑖 with 𝑖 = 3 , , denotes the inputof the 𝑖 th layer. We then have the outputs of feature interac-tion component as 𝑓 ′ = 𝑓 (0 , , 𝑓 ′ = 𝑓 (1 , , and 𝑓 ′ = 𝑓 (2 , .Furthermore, the CPM can be extended to more layers, andthe principle is similar to the three-layer pyramid structurein this paper. As shown in Fig. 3 (b), in order to further reduce thebackground redundant information and locate interested re-gions, the outputs 𝑓 ′ 𝑖 from the feature interaction componentare fed into channel attention (see Fig. 4 (a)) and spatial at-tention (see Fig. 4 (b)) components [69]. Speciﬁcally, thefeatures obtained from dual attention mechanism are ﬁrst di-vided into two parts, one is the original output 𝑆 𝑖 , the otheris a normalized and reversed one 𝜔 𝑖 , which is regarded asthe weight factor learnt from supplementary attention for ex-ploiting the interactive features between the adjacent levels. 𝜔 𝑖 is then multiplied with the output 𝑆 𝑖 +1 of the next level toenhance the features and to supplement the details. Note that 𝑘 in the SA (see Fig. 4 (b)) component is taken as 5 to obtainthe required size of output features. The ﬁrst two outputs, ̂𝑓 𝑖 with 𝑖 = 3 , , of the CCA module are deﬁned as ̂𝑓 𝑖 = ⊝ ( 𝛿 ( 𝑆 𝑖 )) ⊙ 𝑆 𝑖 +1 , (7)where 𝛿 ( ⋅ ) represents a Sigmoid activation function, ⊙ de-notes the Hadamard product, and ⊝ ( ⋅ ) represents a reverseoperation [74, 75], which subtracts the input from a matrixof all ones. For the ﬁfth-level features, global context information(see Fig. 3 (c)) is introduced as the supplementary informa-tion to combine with the attention module, which is able tocorrect the location and enrich the features of salient objects.Simply adding the global with local features is not an eﬀec-tive solution, therefore we adopt the residual component asa rough locator to generate the global context information,i.e., ̂𝑓 = 𝜔 ⊙ ( 𝑓 ′ + 𝜖 ( 𝐶𝑜𝑛𝑣 ( 𝜖 ( 𝐶𝑜𝑛𝑣 ( 𝑓 ′ ))))) (8)where 𝐶𝑜𝑛𝑣 ( ⋅ ) denotes a 3 × 𝜖 ( ⋅ ) denotes a ReLU activation function. Although RGB and depth are complementary and depthcan provide unique semantic information, the feature in thedepth is not abundant in terms of structural details. If thedepth information is treated equally with RGB, it may re-sult in the degradation of model performance. Therefore, wedevelop AFI module, an eﬀective fusion module, which isable to suﬃciently integrate the features of the cross-modal

Hongbo Bi et al.:

Preprint submitted to Elsevier

Page 5 of 12ongbo Bi et al./Elsevier xxx (xxxx) xxx

Input

Output

HxWxC

Global Pooling3x3 Conv

ReLuSigmoid

HxWxC

InputOutput

HxWxC HxWxC/2 HxWx1

HxWxC/2

HxWx1

HxWxC (a) CA (b) SA

Figure 4:

Illustration of the channel attention (CA) and spatialattention (SA) components [69]. to adaptively correct the impact of the depth features whichhave low-quality but abundant spatial information.As illustrated in Fig. 5, the inputs ̂𝑓 h 𝑖 and ̂𝑓 d 𝑖 with 𝑖 =1 , , , , represent RGB and depth features at each layer,respectively. First, the RGB features of the lower layer arefed into 1 × 𝑛 𝑖 isobtained using a Sigmoid layer. Moreover, taking diﬀerentreceptive ﬁelds into consideration, we apply a 3 × 𝑚 𝑖 . Further,these two symmetric weights are multiplied separately bythe feature that is input into the 3 × 𝑑 𝑖 , ̂𝑓 d 𝑖 is fed intotwo units, each of which includes a convolutional layer fol-lowed by PReLU activation function. The depth map usuallysuﬀers from low-quality and noise issues, therefore treatingdepth and RGB features equally in the fusion leads to un-satisfactory results. To resolve this issue, we add the modi-ﬁed RGB features ℎ 𝑖 , the depth features 𝑑 𝑖 , and the originalRGB features ̂𝑓 h 𝑖 proportionally with a learned coeﬃcient 𝑘 𝑖 ,which is obtained using the RGB feature ̂𝑓 h 𝑖 and a poolinglayer that reduces the feature dimension. We utilize the RGBinformation to guide the complementary and depth informa-tion so that the fused features provide a good representationof multi-modal features. Finally, the output is concatenatedwith the depth features ̂𝑓 d 𝑖 . Mathematically, the above pro-cedure is deﬁned as 𝑛 𝑖 = 𝛿 ( 𝐶𝑜𝑛𝑣 ( 𝐷𝑆 ( ̂𝑓 h 𝑖 −1 ))) , (9) 𝑚 𝑖 = 𝛿 ( 𝐶𝑜𝑛𝑣 ( 𝐷𝑆 ( ̂𝑓 h 𝑖 −1 ))) , (10) ℎ 𝑖 = 𝐶𝑎𝑡 ( 𝑛 𝑖 ⊙ 𝑈 𝑆 ( 𝐶𝑜𝑛𝑣 ( ̂𝑓 h 𝑖 )) , i f ! " ()*+,()*+, !" ’()*" + %,-!%-. i f ! i f i d i h i n i m i k i k !" ! i f i f ! ! i f ! ! i f i k Figure 5:

Illustration of the adaptive feature integration (AFI)module. 𝑚 𝑖 ⊙ 𝑈 𝑆 ( 𝐶𝑜𝑛𝑣 ( ̂𝑓 h 𝑖 ))) , (11) 𝑑 𝑖 = 𝜃 ( 𝐶𝑜𝑛𝑣 ( 𝜃 ( 𝐶𝑜𝑛𝑣 ( ̂𝑓 d 𝑖 )))) , (12) ̂𝑓 ′ 𝑖 = (1 − 𝑘 𝑖 ) ̂𝑓 h 𝑖 + 𝑘 𝑖 ( ℎ 𝑖 + 𝑑 𝑖 )∕2 , (13) ̂𝑓 ′′ 𝑖 = 𝐶𝑎𝑡 ( ̂𝑓 ′ 𝑖 , ̂𝑓 d 𝑖 ) , (14)where 𝐶𝑜𝑛𝑣 ( ⋅ ) denotes a 1 × 𝐶𝑎𝑡 ( ⋅ ) represents the concatenation operation, and 𝜃 ( ⋅ ) denotes aPReLU activation function.Furthermore, the output ̂𝑓 ′′ 𝑖 is fed into the traditionalresidual unit to obtain the cross-modal fused feature 𝑓 fuse ( 𝑖 ) at each layer. Finally, the features at diﬀerent layers areadded to obtain the ﬁnal features 𝑓 fuse , i.e., 𝑓 fuse = ∑ 𝑖 =1 𝑓 fuse ( 𝑖 ) , (15)Our AFI module allows RGB and depth information tobe eﬀectively fused according to their own characteristics inorder to improve the saliency detection performance.

4. Experiments

In this section, we ﬁrst introduce the implementation de-tails, datasets, and evaluation metrics. We then present the

Hongbo Bi et al.:

Preprint submitted to Elsevier

Page 6 of 12ongbo Bi et al./Elsevier xxx (xxxx) xxx

Table 1:

Quantitative results on six RGB-D benchmark datasets. Nine SOTA models are involved in the evaluation.The best three results are marked with red, green and blue colors, respectively. Methods with/without “ ∗ ” are trainedwith either the NJUD, NLPR, and DUT-RGBD training sets or the NJUD and NLPR training sets. “ ↑ ” indicates thehigher the better, while “ ↓ ” indicates the lower the better. Methods Years LFSD [76] NJUD [77] NLPR [78] STEREO [79] RGBD135 [80] DUT-RGBD [45] 𝑆 𝛼 ↑ MAE ↓ maxE ↑ maxF ↑ 𝑆 𝛼 ↑ MAE ↓ maxE ↑ maxF ↑ 𝑆 𝛼 ↑ MAE ↓ maxE ↑ maxF ↑ 𝑆 𝛼 ↑ MAE ↓ maxE ↑ maxF ↑ 𝑆 𝛼 ↑ MAE ↓ maxE ↑ maxF ↑ 𝑆 𝛼 ↑ MAE ↓ maxE ↑ maxF ↑ MMCI [81] PR19 0.787 0.132 0.839 0.771 0.859 0.079 0.915 0.853 0.856 0.059 0.913 0.815 0.856 0.080 0.913 0.843 0.848 0.065 0.928 0.822 0.791 0.113 0.859 0.767TAN [56] TIP19 0.801 0.111 0.847 0.796 0.878 0.060 0.925 0.874 0.886 0.041 0.941 0.863 0.877 0.059 0.927 0.870 0.858 0.046 0.910 0.827 0.808 0.093 0.861 0.790CPFP [82] CVPR19 0.828 0.088 0.871 0.825 0.878 0.053 0.923 0.877 0.888 0.036 0.932 0.868 0.879 0.051 0.925 0.874 0.874 0.037 0.923 0.845 0.818 0.076 0.859 0.795CFGA [33] Neucom200.802 0.097 0.858 0.804 0.885 0.052 0.925 0.886 0.907 0.030 0.948 0.890 0.880 0.050 0.927 0.879 0.779 0.061 0.869 0.709 0.891 0.049 0.923 0.890ASIF [53] CVPR20 0.823 0.090 0.860 0.824 0.889 0.047 0.927 0.888 0.906 0.030 0.944 0.888 0.879 0.049 0.927 0.878 0.764 0.076 0.846 0.684 0.838 0.073 0.876 0.821D3Net [46] TNNLS20 0.832 0.099 0.864 0.819 0.895 0.051 0.932 0.889 0.906 0.034 0.946 0.886 0.901 0.046 0.944 0.898 0.906 0.030 0.939 0.882 0.814 0.086 0.857 0.786 ∗ DMRA [45] ICCV19 0.823 0.087 0.886 0.841 0.880 0.053 0.927 0.889 0.890 0.035 0.940 0.883 0.835 0.066 0.911 0.847 0.878 0.035 0.933 0.869 0.869 0.057 0.927 0.889 ∗ A2dele [44] CVPR20 0.837 0.074 0.880 0.836 0.869 0.051 0.916 0.873 0.896 0.028 0.945 0.880 0.885 0.043 0.935 0.885 0.885 0.028 0.923 0.867 0.885 0.042 0.930 0.892 ∗ SSF [83] CVPR20 0.859 0.066 0.900 0.866 0.899 0.043 0.935 0.896 0.888 0.035 0.934 0.864 0.893 0.044 0.936 0.890 0.905 0.025 0.941 0.883 0.915 0.033 0.951 0.924Ours – 0.866 0.066 0.906 0.867 0.903 0.043 0.940 0.905 0.912 0.027 0.949 0.897 0.902 0.041 0.945 0.902 0.909 0.026 0.946 0.900 0.916 0.035 0.953 0.927 experimental results to demonstrate the eﬀectiveness of theproposed model by comparing with the SOTA models. Fi-nally, we perform ablation analysis to investigate the pro-posed components.

The proposed model is implemented using PyTorch,and the input images for training and testing are resized to256 ×

256 before feeding into the network. The batch size isset to 2 and the training is optimized by mini-batch stochas-tic gradient descent. Other parameter settings are as follows:Learning rate is set to 1e-10, the momentum is set to 0.99,and the weight decay is set to 0.0005. Our model takes 61epochs to complete the training.

We evaluate the proposed method on six public RGB-D saliency detection benchmark datasets, which are detailedas follows: LFSD [76] includes 100 RGB-D images and thedepth maps are collected by Lytro camera. NJUD [77] iscomposed of 1985 RGB images and corresponding depthimages estimated from the stereo images with various ob-jects and complex scenes. NLPR [78] consists of 1000 RGBimages and corresponding depth images captured by Kinect.STEREO [79] contains 797 stereoscopic images capturedfrom the Internet. RGBD135 [80] contains 135 RGB-D im-ages captured by Kinect. DUT-RGBD [45] consists of 1200paired images containing more complex real scenarios byLytro camera.

Four evaluation metrics widely used in the ﬁeld ofRGB-D saliency detection are adopted in our experiments.These metrics include Structure Measure (S-Measure) [84],Mean Absolute Error (MAE) [85], E-measure [86] and F-Measure [87], each of which is detailed as follows. 1) Structure Measure ( 𝑆 𝛼 ) [84]: This is a evaluation met-ric to measure the structural similarity between the predictedsaliency map and the ground-truth map. According to [84], 𝑆 𝛼 is deﬁned as 𝑆 𝛼 = (1 − 𝛼 ) 𝑆 𝑜 + 𝛼𝑆 𝑟 , (16)where 𝑆 𝑜 denotes the object-aware structural similarity and 𝑆 𝑟 denotes the region-aware structural similarity. Follow-ing [84], we set 𝛼 = 0 . . Note that the higher the S-measurescore, the better the model performs.2) Mean absolute error ( 𝑀𝐴𝐸 ) [85]: This is a metricto directly calculate the average absolute error between thepredict saliency map and the ground-truth.

𝑀𝐴𝐸 is deﬁnedas

𝑀𝐴𝐸 = 1 𝐻 × 𝑊 𝑊 ∑ 𝑥 =1 𝐻 ∑ 𝑦 =1 | 𝑆 ( 𝑥, 𝑦 ) − 𝐺 ( 𝑥, 𝑦 ) | (17)where 𝐻 and 𝑊 denotes the height and width of the saliencymap, respectively. 𝑆 represents the predicted saliency map,and 𝐺 denotes the corresponding ground truth. 𝑥 and 𝑦 de-note the coordinate of each pixel. Note that the lower the 𝑀𝐴𝐸 , the better the model performance.3) F-measure ( 𝐹 𝛽 ) [87]: This metric represents theweighted harmonic mean of recall and precision under anon-negative weights 𝛽 . In the experiments, we use themaximum F-Measure ( 𝑚𝑎𝑥𝐹 ) to evaluate the model perfor-mance. Mathematically, 𝐹 𝛽 is deﬁned as 𝐹 𝛽 = ( 𝛽 + 1 ) 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙𝛽 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (18)Following [42], we set 𝛽 = 0 . . Note that the higher theF-measure score, the better the model performs.4) E-measure [86]: E-measure is a perceptual-inspired Hongbo Bi et al.:

Preprint submitted to Elsevier

Page 7 of 12ongbo Bi et al./Elsevier xxx (xxxx) xxx

Low Quality DepthComplex BackgroundLow

Contrast

Small Objects DMRA OursA2deleASIFCPFP D3NetMMCI SSFTANetSimilar Background RGB Depth GT CFGA

Figure 6:

Qualitative visual comparison of the proposed model and the state-of-the-art models. Our model yieldsresults that are closer to the ground truth maps than other models. metric and is deﬁned as 𝐸 = 1 𝑊 × 𝐻 𝑊 ∑ 𝑥 =1 𝐻 ∑ 𝑦 =1 𝜙 𝐹 𝑀 ( 𝑥, 𝑦 ) (19)where 𝜙 𝐹 𝑀 is an enhanced alignment matrix [86]. We adoptmaximum E-Measure ( 𝑚𝑎𝑥𝐸 ) to assess the model perfor-mance. Note that the higher the E-measure score, the betterthe model performs.

We perform extensive experiments to compare ourCAAI-Net with nine state-of-the-art RGB-D saliency detec-tion models, including DMRA [45], CPFP [82], MMCI [81],TAN [56], CFGA [33], A2dele [44], SSF [83], ASIF-Net [53] and D3Net [46]. For fair comparison, we adoptthe results provided by the authors directly or generate theresults using the open source codes with default parame-ters. In addition, for models without the source code pub-licly available, we adopt the corresponding published re-sults. Our model is trained using the same training setwith [44, 45, 83], which contains 800 samples from theDUT-RGBD, 1485 samples from NJUD and 700 samples from NLPR datasets. The remaining images in these datasetsand other three datasets are used for testing.

Quantitative evaluation.

The results, shown in Ta-ble 1, indicate that CAAI-Net achieves promising perfor-mance on all six datasets and outperforms the SOTA mod-els. Speciﬁcally, CAAI-Net sets new SOTA in terms of 𝑆 𝛼 , 𝑚𝑎𝑥𝐹 and 𝑚𝑎𝑥𝐸 on all datasets. In addition, it providesthe best 𝑀𝐴𝐸 results on four benchmark datasets and thesecond best

𝑀𝐴𝐸 results on RGB135 and DUT-RGBD. Onthe NLPR dataset, our model outperforms the second bestwith 3.8 % improvement on 𝑚𝑎𝑥𝐹 . It is worth noting thatCAAI-Net outperforms SOTA models on the DUT-RGBDand STEREO, which are challenging datasets that are withcomplex background information. All the quantitative re-sults demonstrate that CAAI-Net is capable of improving theperformance eﬀectively. Qualitative evaluation.

We further show the visualcomparison of predicted saliency maps in Fig. 6. As canbe observed, CAAI-Net yields saliency maps that are closeto the ground truth. In contrast, the competing methodsprovide unsatisfactory results that poses signiﬁcant diﬀer-ences with the ground truth. In particular, for the chal-lenging cases, such as low-quality depth, background inter-

Hongbo Bi et al.:

Preprint submitted to Elsevier

Page 8 of 12ongbo Bi et al./Elsevier xxx (xxxx) xxx

Table 2:

Ablation study of the proposed model. Thebest results are bold . “ ↑ ” indicates the higher the bet-ter, while “ ↓ ” indicates the lower the better. “B” denotesthe backbone module. “B+CCA” denotes the model withbackbone and CCA module. “B+CCA+AFI” representsthe model with backbone, CCA and AFI module. Methods NJUD STEREO DUT-RGBD 𝑆 𝛼 ↑ MAE ↓ maxE ↑ maxF ↑ 𝑆 𝛼 ↑ MAE ↓ maxE ↑ maxF ↑ 𝑆 𝛼 ↑ MAE ↓ maxE ↑ maxF ↑ B 0.88 0.053 0.927 0.889 0.835 0.066 0.911 0.847 0.869 0.057 0.927 0.889B+CCA 0.900 0.044 0.937 0.896 0.898 0.044 0.943 0.897 0.912 0.036 ference, low contrast, and small objects, CAAI-Net consis-tently provides promising results and outperforms the com-peting methods signiﬁcantly. Speciﬁcally, the ﬁrst two rowsof Fig. 6 shows the results for the case of low-quality depth.Although challenging, CAAI-Net overcomes the low-qualityissue and accurately detects the salient objects, especially forthe regions marked by red rectangles. Besides, the objectand the background have similar colors in the next two rows.The next two rows show the case of similar backgroundwhere the salient object shares similar appearance with thebackground. Our model consistently provides the best per-formance in comparison with competing methods. The re-sults, shown in the ﬁfth and sixth rows, indicate that CAAI-Net consistently provides the best performance in the pres-ence of complex background problems. Finally, the last fourrows show the resulting regarding low contrast and small ob-jects. The eﬀectiveness of our method is further conﬁrmedby these two challenging cases.

In this section, the ablation experiments on three testingdatasets are performed to validate the eﬀectiveness of theproposed CCA and AFI modules.

Eﬀectiveness of CCA module.

The results, shown inTable 2, indicate that the ablated version, B+CCA, outper-forms the backbone network, B, in all datasets and evalua-tion metrics, demonstrating that the CCA module is an eﬀec-tive module to improve the performance. In particular, CCAmodule signiﬁcantly reduces the MAE value, indicating thatthe predicted saliency maps are much closer to the groundtruth. The advantage of CCA module can be attributed toits ability of locating the interested regions more accurately.In addition, the visual results, shown in Fig. 7, provides theconsistent conclusion, as in Table 2. Our CCA module isan eﬀective module for improving the accuracy of saliencydetection.In addition, we further investigate the eﬀectivenessof each component of CCA module by performing abla-tion studies. The results, shown in Table 3, indicate that“B+(a)” outperforms the baseline module “B” across dif-ferent datasets, suﬃciently demonstrating the eﬀectivenessof our feature interaction component. The results, shown inthe third row of Table 3, indicate that the complementaryattention component eﬀectively improves the performance

B+CCA B+CCA+AFIRGB Depth GT B

Figure 7:

Visual comparison of the ablated versions of ourmodel. “B” denotes the backbone module. “B+CCA” denotesthe model with backbone and CCA module. “B+CCA+AFI”means the model with backbone, CCA and AFI module.

Table 3:

Ablation study of CCA module. “B” denotes thebaseline module without three components, i.e., 𝑔 𝑖 = 𝑓 𝑖 with 𝑖 = 3 , , . “B+(a)” denotes the module with featureinteraction component. “B+(a)+(b)” represents the mod-ule with feature interaction and complementary attentioncomponents. “B+(a)+(b)+(c)” denotes the module withfeature interaction, complementary attention, and global-context components. Methods NJUD STEREO DUT-RGBD 𝑆 𝛼 ↑ MAE ↓ maxE ↑ maxF ↑ 𝑆 𝛼 ↑ MAE ↓ maxE ↑ maxF ↑ 𝑆 𝛼 ↑ MAE ↓ maxE ↑ maxF ↑ B 0.88 0.053 0.927 0.889 0.835 0.066 0.911 0.847 0.869 0.057 0.927 0.889B+(a) 0.898 0.048 0.935 on the complex scene (i.e., STEREO and DUT-RGBD). Thecomplementary attention component enables the model toput more emphasis on informative features and suppressingbackground interferences. Finally, we show the results forthe full version of CCA in the fourth row of Table 3. Ascan be observed, the global-context component improves theperformance eﬀectively, demonstrating its advantages.

Eﬀectiveness of AFI module.

We then investigate theeﬀectiveness of AFI module. The results, shown in Table 2,indicate that the full version of our model with AFI mod-ule outperforms the ablated version, B+CCA, in terms of allevaluation metrics. This suﬃciently demonstrates the eﬀec-tiveness of AFI, which is capable of adaptively fusing themulti-modal features to capture the meaningful features for

Hongbo Bi et al.:

Preprint submitted to Elsevier

Page 9 of 12ongbo Bi et al./Elsevier xxx (xxxx) xxx

RGB Depth GT OursRGB Depth GT Ours

Figure 8:

Failure cases of CAAI-Net in extreme scenarios. accurate saliency detection. In addition, the visual results,shown in Fig. 7, conﬁrm our observation in Table 2, furtherdemonstrating the eﬀectiveness of AFI module suﬃciently.As can be observed, the full version of our model yieldssaliency maps that are close to the ground truth. In con-trast, B+CCA fails to provide satisfactory results, especiallyin the regions marked by rectangles.

Despite its various advantages, our model may yield mis-detections for some extreme scenarios. For instance, asshown in the top row of Fig. 8, the object in image back-ground is recognized as the salient one by mistake. In ad-dition, as shown in the bottom row of Fig. 8, the detectionaccuracy decreases when the background objects share sim-ilar appearances with the target salient object. In the future,we will consider more comprehensive scenarios and exploremore eﬀective solutions to handle these challenging saliencydetection tasks.

5. Conclusion

In this paper, we have proposed a novel RGB-D saliencydetection network, CAAI-Net, which extracts and fuses themulti-modal features eﬀectively for accurate saliency detec-tion. Our CAAI-Net ﬁrst utilizes the CCA module to ex-tract informative features highly related to the saliency de-tection. The resulting features are then fed to our AFI mod-ule, which adaptively fuses the cross-modal features accord-ing to their contributions to the saliency detection. Extensiveexperiments on six widely-used benchmark datasets demon-strate that CAAI-Net is an eﬀective RGB-D saliency detec-tion model and outperforms cutting-edge models, both qual-itatively and quantitatively.

Declaration of Competing Interest

The authors declare that they have no known competingﬁnancial interests or personal relationships that could haveappeared to inﬂuence the work reported in this paper.

References [1] A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, J. Li, Salient object de-tection: A survey, Computational Visual Media (CVM) 5 (2) (2019)117–150.[2] A. Borji, M.-M. Cheng, H. Jiang, J. Li, Salient Object Detection: ABenchmark, IEEE Transactions on Image Processing (TIP) 24 (12)(2015) 5706–5722.[3] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, S.-M. Hu, Globalcontrast based salient region detection, IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI) 37 (3) (2015) 569–582.[4] J.-X. Zhao, J.-J. Liu, D.-P. Fan, Y. Cao, J. Yang, M.-M. Cheng, EG-Net: Edge guidance network for salient object detection, in: IEEE In-ternational Conference on Computer Vision (ICCV), 2019, pp. 8779–8788.[5] K. Fu, Q. Zhao, I. Y. Gu, J. Yang, Deepside: A general deep frame-work for salient object detection, Neurocomputing 356 (2019) 69–82.[6] J. Su, J. Li, Y. Zhang, C. Xia, Y. Tian, Selectivity or Invariance:Boundary-Aware Salient Object Detection, in: IEEE InternationalConference on Computer Vision (ICCV), IEEE, 2019, pp. 3798–3807.[7] Z. Liu, Q. Li, W. Li, Deep layer guided network for salient objectdetection, Neurocomputing 372 (2020) 55–63.[8] W. Wang, J. Shen, M.-M. Cheng, L. Shao, An iterative and coopera-tive top-down and bottom-up inference network for salient object de-tection, in: IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), 2019, pp. 5968–5977.[9] W. Wang, J. Shen, X. Dong, A. Borji, R. Yang, Inferring salient ob-jects from human ﬁxations, IEEE Transactions on Pattern Analysisand Machine Intelligence (TPAMI) 42 (8) (2020) 1913–1927.[10] B. Dong, Y. Zhou, C. Hu, K. Fu, G. Chen, BCNet: Bidirectional col-laboration network for edge-guided salient object detection, Neuro-computing.[11] A. Serban, E. Poll, J. Visser, Adversarial Examples on Object Recog-nition: A Comprehensive Survey, ACM Computing Surveys 53 (3)(2020) 1–38.[12] N. Ibrahim, M. R. Tomari, W. N. W. Zakaria, Analysis of minimumface video duration and the eﬀect of video compression to image-based non-contact heart rate monitoring system, Bulletin of ElectricalEngineering and Informatics 9 (1) (2020) 403–410.[13] Y. Gao, M. Wang, D. Tao, R. Ji, Q. Dai, 3-D Object Retrieval andRecognition With Hypergraph Analysis, IEEE Transactions on ImageProcessing (TIP) 21 (9) (2012) 4290–4303.[14] K. Mari, P. Anandababu, Quadhistogram with local texton XOR pat-tern based feature extraxtion for content based image retrieval system,The International journal of analytical and experimental modal anal-ysis XII (II) (2020) 1966–1986.[15] T. G. Bayrock, R. N. Hull, B. Wuest, Image redirection and opticalpath folding, uS Patent 6,353,657 (Mar. 5 2002).[16] L. Grady, Random Walks for Image Segmentation, IEEE Transactionson Pattern Analysis and Machine Intelligence (TPAMI) 28 (2006)1768–1783.[17] Z. Hu, G. Feng, J. Sun, L. Zhang, H. Lu, Bi-directional relationshipinferring network for referring image segmentation, in: IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), 2020, pp.4424–4433.[18] S. Osher, L. I. Rudin, Feature-Oriented Image Enhancement UsingShock Filters, Siam Journal on Numerical Analysis 27 (4) (1990)919–940.[19] J. Bansiya, C. G. Davis, A hierarchical model for object-oriented de-sign quality assessment, IEEE Transactions on Software Engineering(TSE) 28 (1) (2002) p.4–17.[20] W. Wang, J. Shen, L. Shao, Video Salient Object Detection via FullyConvolutional Networks, IEEE Transactions on Image Processing(TIP) 27 (1) (2018) 38–49.[21] C. Chen, S. Li, Y. Wang, H. Qin, A. Hao, Video Saliency Detec-tion via Spatial-Temporal Fusion and Low-Rank Coherency Diﬀu-sion, IEEE Transactions on Image Processing (TIP) 26 (7) (2017)3156–3170.

Hongbo Bi et al.:

Preprint submitted to Elsevier

Page 10 of 12ongbo Bi et al./Elsevier xxx (xxxx) xxx [22] C. Chen, Y. Li, S. Li, H. Qin, A. Hao, A novel bottom-up saliencydetection method for video with dynamic background, IEEE SignalProcessing Letters (SPL) 25 (2) (2018) 154–158.[23] W. Wang, J. Shen, L. Shao, Consistent Video Saliency Using LocalGradient Flow Optimization and Global Reﬁnement, IEEE Transac-tions on Image Processing (TIP) 24 (11) (2015) 4185–4196.[24] C. Chen, G. Wang, C. Peng, X. Zhang, H. Qin, Improved RobustVideo Saliency Detection Based on Long-Term Spatial-Temporal In-formation, IEEE Transactions on Image Processing (TIP) 29 (2020)1090–1100.[25] W. Wang, J. Shen, F. Guo, M.-M. Cheng, A. Borji, Revisiting VideoSaliency: A Large-Scale Benchmark and a New Model, in: IEEEConference on Computer Vision and Pattern Recognition (CVPR),IEEE Computer Society, 2018, pp. 4894–4903.[26] W. Wang, J. Shen, R. Yang, F. Porikli, Saliency-aware video objectsegmentation, IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI) 40 (1) (2018) 20–33.[27] H. Bi, K. Wang, D. Lu, C. Wu, W. Wang, L. Yang, 𝐶 Net: a comple-mentary co-saliency detection network, The Visual Computer (VC).[28] D.-P. Fan, T. Li, Z. Lin, G.-P. Ji, D. Zhang, M.-M. Cheng, H. Fu,J. Shen, Re-thinking Co-Salient Object Detection, arXiv preprintarXiv:2007.03380.[29] W. Wang, J. Shen, Y. Yu, K.-L. Ma, Stereoscopic Thumbnail Cre-ation via Eﬃcient Stereo Saliency Detection, IEEE Transactions onVisualization & Computer Graphics 23 (8) (2017) 2014–2027.[30] G. Li, Y. Yu, Visual saliency based on multiscale deep features,in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), IEEE Computer Society, 2015, pp. 5455–5463.[31] Y. Ding, Z. Liu, M. Huang, R. Shi, X. Wang, Depth-aware saliency de-tection using convolutional neural networks, Journal of Visual Com-munication and Image Representation (VCIR) 61 (2019) 1–9.[32] C. Chen, J. Wei, C. Peng, W. Zhang, H. Qin, Improved saliency de-tection in RGB-D images using two-phase depth estimation and selec-tive deep fusion, IEEE Transactions on Image Processing 29 (2020)4296–4307.[33] Z. Liu, W. Zhang, P. Zhao, A cross-modal adaptive gated fusion gen-erative adversarial network for RGB-D salient object detection, Neu-rocomputing 387 (2020) 210–220.[34] G. Li, Z. Liu, H. Ling, ICNet: Information Conversion Network forRGB-D Based Salient Object Detection, IEEE Transactions on ImageProcessing (TIP) 29 (2020) 4873–4884.[35] Y. Zhai, D.-P. Fan, J. Yang, A. Borji, L. Shao, J. Han, L. Wang, Bi-furcated backbone strategy for rgb-d salient object detection, arXive-prints (2020) arXiv–2007.[36] K. F. Fu, D.-P. Fan, G.-P. Ji, Q. Zhao, JL-DCF: Joint Learning andDensely-Cooperative Fusion Framework for RGB-D Salient ObjectDetection, in: IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2020, pp. 3052–3062.[37] Z. Zhang, Z. Lin, J. Xu, W. Jin, S.-P. Lu, D.-P. Fan, Bilateral attentionnetwork for RGB-D salient object detection, CoRR abs/2004.14582. arXiv:2004.14582 .[38] Q. Chen, K. Fu, Z. Liu, G. Chen, H. Du, B. Qiu, L. Shao, EF-Net: Anovel enhancement and fusion network for RGB-D saliency detection,Pattern Recognition.[39] Z. Huang, H.-X. Chen, T. Zhou, Y.-Z. Yang, C.-Y. Wang, Multi-levelcross-modal interaction network for RGB-D salient object detection,arXiv preprint arXiv:2007.14352.[40] N. Liu, J. Han, M.-H. Yang, PiCANet: Pixel-wise Contextual Atten-tion Learning for Accurate Saliency Detection, IEEE Transactions onImage Processing (TIP) PP (99) (2020) 1–1.[41] W. Wang, S. Zhao, J. Shen, S. C. H. Hoi, A. Borji, Salient objectdetection with pyramid attention and salient edges, in: IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), 2019, pp.1448–1457.[42] P. Zhang, W. Dong, H. Lu, H. Wang, R. Xiang, Amulet: AggregatingMulti-level Convolutional Features for Salient Object Detection, in:IEEE International Conference on Computer Vision (ICCV), IEEEComputer Society, 2017, pp. 202–211. [43] Z. Liu, Q. Duan, S. Shi, P. Zhao, Multi-level progressive parallel at-tention guided salient object detection for RGB-D imfages, The VisualComputer (VC) (2020) 1–12.[44] Y. Piao, Z. Rong, M. Zhang, W. Ren, H. Lu, A2dele: Adaptive and At-tentive Depth Distiller for Eﬃcient RGB-D Salient Object Detection,in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2020, pp. 9060–9069.[45] Y. Piao, W. Ji, J. Li, M. Zhang, H. Lu, Depth-Induced Multi-ScaleRecurrent Attention Network for Saliency Detection, in: IEEE Inter-national Conference on Computer Vision (ICCV), 2019, pp. 7253–7262.[46] D.-P. Fan, Z. Lin, Z. Zhang, M. Zhu, M.-M. Cheng, Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-ScaleBenchmarks, IEEE Transactions on Neural Networks and LearningSystems (TNNLS).[47] T. Zhou, D.-P. Fan, M.-M. Cheng, J. Shen, L. Shao, RGB-D SalientObject Detection: A Survey, in: Computational Visual Media(CVM), Springer, 2020, pp. 1–33.[48] W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, Salient Object Detec-tion in the Deep Learning Era: An In-Depth Survey, in: CoRR, Vol.abs/1904.09146, 2019, pp. 1–19.[49] Borji, Ali, What is a salient object? a dataset and a baseline modelfor salient object detection, IEEE Transactions on Image Processing(TIP) 24 (2) (2015) 742–756.[50] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, S. Li, Salient ob-ject detection: A discriminative regional feature integration approach,in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), IEEE Computer Society, 2013, pp. 2083–2090.[51] C. Yang, L. Zhang, H. Lu, X. Ruan, M. H. Yang, Saliency detec-tion via graph-based manifold ranking, in: IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), IEEE Computer Soci-ety, 2013, pp. 3166–3173.[52] X. Zhou, G. Li, C. Gong, Z. Liu, J. Zhang, Attention-guided RGBDsaliency detection using appearance information, IEEE InternationalConference on Image, Vision and Computing (ICIVC) 95 (2020)103888.[53] C. Li, R. Cong, S. Kwong, J. Hou, Q. Huang, ASIF-Net: AttentionSteered Interweave Fusion Network for RGB-D Salient Object Detec-tion, IEEE Transactions on Cybernetics (TC) PP (99) (2020) 1–13.[54] F. Xiao, B. Li, Y. Peng, C. Cao, K. Hu, X. Gao, Multi-Modal WeightsSharing and Hierarchical Feature Fusion for RGBD Salient ObjectDetection, IEEE Access 8 (2020) 26602–26611.[55] N. Wang, X. Gong, Adaptive fusion for RGB-D salient object detec-tion, IEEE Access 7 (2019) 55277–55284.[56] H. Chen, Y. Li, Three-Stream Attention-Aware Network for RGB-D Salient Object Detection, IEEE Transactions on Image Processing(TIP) 28 (6) (2019) 2825–2835.[57] J. Zhang, D.-P. Fan, Y. Dai, S. Anwar, F. S. Saleh, T. Zhang,N. Barnes, UC-Net: Uncertainty Inspired RGB-D Saliency Detectionvia Conditional Variational Autoencoders, in: IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2020, pp. 8582–8591.[58] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, A. Borji,Detect Globally, Reﬁne Locally: A Novel Approach to Saliency De-tection, in: IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), IEEE Computer Society, 2018, pp. 3127–3135.[59] Y. Liu, P. Yuan, Saliency detection using global and local informa-tion under multilayer cellular automata, IEEE Access 7 (99) (2019)72736–72748.[60] M. Ge, R. Ji, Y. Wu, Saliency detection based on local and globalinformation fusion, in: IEEE International Conference on Image, Vi-sion and Computing (ICIVC), IEEE, 2019, pp. 612–616.[61] K. Fu, D.-P. Fan, G.-P. Ji, Q. Zhao, J. Shen, C. Zhu, Siamese net-work for rgb-d salient object detection and beyond, arXiv preprintarXiv:2008.12134.[62] Z. Chen, Q. Xu, R. Cong, Q. Huang, Global context-aware progres-sive aggregation network for salient object detection, Proceedings ofthe AAAI Conference on Artiﬁcial Intelligence (AAAI) 34 (7) (2020)

Hongbo Bi et al.:

Preprint submitted to Elsevier

Page 11 of 12ongbo Bi et al./Elsevier xxx (xxxx) xxx arXiv:2004.08955 .[68] M. Noori, S. Mohammadi, S. G. Majelan, A. Bahri, M. Havaei,DFNet: Discriminative feature extraction and integration network forsalient object detection, Engineering Applications of Artiﬁcial Intel-ligence (EAAI) 89 (2020) 103419.[69] T. Zhao, X. Wu, Pyramid feature attention network for saliency detec-tion, in: IEEE Conference on Computer Vision and Pattern Recogni-tion, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 2019,pp. 3085–3094.[70] K. Simonyan, A. Zisserman, Very deep convolutional networks forlarge-scale image recognition, International Conference on LearningRepresentations (ICLR).[71] Z. J. Wang, R. Turko, O. Shaikh, H. Park, N. Das, F. Hohman,M. Kahng, D. H. Chau, CNN explainer: Learning convolutional neu-ral networks with interactive visualization, CoRR abs/2004.15004.[72] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutionalnetworks, in: D. J. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.),Proceedings of the European conference on computer vision (ECCV),Vol. 8689 of Lecture Notes in Computer Science, Springer, 2014, pp.818–833.[73] N. Liu, J. Han, A deep spatial contextual long-term recurrent convo-lutional network for saliency detection, IEEE Transactions on ImageProcessing (TIP) (2018) 3264.[74] S. Chen, X. Tan, B. Wang, X. Hu, Reverse attention for salient objectdetection, in: Proceedings of the European Conference on ComputerVision (ECCV), 2018, pp. 234–250.[75] D.-P. Fan, T. Zhou, G.-P. Ji, Y. Zhou, G. Chen, H. Fu, J. Shen, L. Shao,Inf-Net: Automatic COVID-19 lung infection segmentation from CTimages, IEEE Transactions on Medical Imaging 39 (8) (2020) 2626–2637.[76] J. Zhang, M. Wang, L. Lin, X. Yang, J. Gao, Y. Rui, Saliency De-tection on Light Field: A Multi-Cue Approach, Acm Transactionson Multimedia Computing Communications & Applications 13 (3)(2017) 32.1–32.22.[77] J. Ran, G. Ling, W. Geng, T. Ren, G. Wu, Depth saliency based onanisotropic center-surround diﬀerence, in: IEEE International Con-ference on Image Processing (ICIP), 2015, pp. 1115–1119.[78] H. Peng, B. Li, W. Xiong, W. Hu, R. Ji, RGBD salient object de-tection: A benchmark and algorithms, in: D. J. Fleet, T. Pajdla,B. Schiele, T. Tuytelaars (Eds.), Proceedings of the European con-ference on computer vision (ECCV), Vol. 8691 of Lecture Notes inComputer Science, Springer, 2014, pp. 92–109.[79] Y. Niu, Y. Geng, X. Li, L. Feng, Leveraging Stereopsis for SaliencyAnalysis, in: IEEE Conference on Computer Vision and PatternRecognition (CVPR), IEEE Computer Society, 2012, pp. 454–461.[80] Y. Cheng, H. Fu, X. Wei, J. Xiao, X. Cao, Depth Enhanced SaliencyDetection Method, in: H. Wang, L. Davis, W. Zhu, S. Kopf, Y. Qu,J. Yu, J. Sang, T. Mei (Eds.), International Conference on InternetMultimedia Computing and Service (ICIMCS), ACM, 2014, p. 23.[81] C. Hao, L. Youfu, S. Dan, Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient ob-ject detection, Pattern Recognition (PR) 86 (2019) 376–385. [82] J.-X. Zhao, Y. Cao, D.-P. Fan, M.-M. Cheng, X.-Y. Li, L. Zhang, Con-trast prior and ﬂuid pyramid integration for RGBD salient object de-tection, in: IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), 2019, pp. 3927–3936.[83] M. Zhang, W. Ren, Y. Piao, Z. Rong, H. Lu, Select, Supplement andFocus for RGB-D Saliency Detection, in: IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), 2020, pp. 3472–3481.[84] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, A. Borji, Structure-measure:A New Way to Evaluate Foreground Maps, in: IEEE InternationalConference on Computer Vision (ICCV), IEEE Computer Society,2017, pp. 4558–4567.[85] A. Borji, D. N. Sihite, L. Itti, Salient object detection: A benchmark,in: A. W. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, C. Schmid(Eds.), Proceedings of the European conference on computer vision(ECCV), Vol. 7573 of Lecture Notes in Computer Science, Springer,2012, pp. 414–429.[86] D.-P. Fan, C. Gong, Y. Cao, B. Ren, M.-M. Cheng, A. Borji,Enhanced-alignment measure for binary foreground map evaluation,in: Proceedings of the Twenty-Ninth International Joint Conferenceon Artiﬁcial Intelligence (IJCAI), 2018, pp. 698–704.[87] P. Arbelaez, M. Maire, C. C. Fowlkes, J. Malik, Contour Detectionand Hierarchical Image Segmentation, IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI) 33 (5) (2011) 898–916.

Hongbo Bi et al.: