RGBT Salient Object Detection: A Large-scale Dataset and Benchmark
Zhengzheng Tu, Yan Ma, Zhun Li, Chenglong Li, Jieming Xu, Yongtao Liu
IIEEE TRANSACTIONS ON XXX 1
RGBT Salient Object Detection: A Large-scaleDataset and Benchmark
Zhengzheng Tu, Yan Ma, Zhun Li, Chenglong Li, Jieming Xu, Yongtao Liu
Abstract —Salient object detection in complex scenes and en-vironments is a challenging research topic. Most works focus onRGB-based salient object detection, which limits its performanceof real-life applications when confronted with adverse conditionssuch as dark environments and complex backgrounds. Takingadvantage of RGB and thermal infrared images becomes anew research direction for detecting salient object in complexscenes recently, as thermal infrared spectrum imaging providesthe complementary information and has been applied to manycomputer vision tasks. However, current research for RGBTsalient object detection is limited by the lack of a large-scaledataset and comprehensive benchmark. This work contributessuch a RGBT image dataset named VT5000, including 5000spatially aligned RGBT image pairs with ground truth annota-tions. VT5000 has 11 challenges collected in different scenes andenvironments for exploring the robustness of algorithms. Withthis dataset, we propose a powerful baseline approach, whichextracts multi-level features within each modality and aggregatesthese features of all modalities with the attention mechanism, foraccurate RGBT salient object detection. Extensive experimentsshow that the proposed baseline approach outperforms the state-of-the-art methods on VT5000 dataset and other two publicdatasets. In addition, we carry out a comprehensive analysis ofdifferent algorithms of RGBT salient object detection on VT5000dataset, and then make several valuable conclusions and providesome potential research directions for RGBT salient objectdetection. Our new VT5000 dataset are made publicly availableat https://pan.baidu.com/s/1O5TC-5sEya8N2EGm-xJ5mw Pass-word:o57e.
Index Terms —Salient object detection, Attention, VT5000dataset.
I. I
NTRODUCTION S ALIENT object detection aims to find the object that hu-man eyes pay much attention to in an image. Salient objectdetection has been extensively studied over the past decade,but still faces many challenges in complex environment, e.g.,when appearance of the object is similar to the surrounding,the algorithms of salient object detection in RGB images oftenperform not well. Researches on adopting different modalitiesto assist salient object detection have attracted more and more
Z. Tu, Y. Ma, Z. Li, C. Li, J. Xu, and Y. Liu are with Key Labof Intelligent Computing and Signal Processing of Ministry of Education,School of Computer Science and Technology, Anhui University, Hefei230601, China, Email: [email protected], [email protected],[email protected], [email protected], [email protected],[email protected]. C. Li is also with Institute of Physical Scienceand Information Technology, Anhui University, Hefei 230601, China. (
Cor-responding author is Chenglong Li )This research is jointly supported by the National Natural Science Founda-tion of China (No. 61602006, 61702002, 61872005, 61860206004), NaturalScience Foundation of Anhui Province (1808085QF187), Open fund for Disci-pline Construction, Institute of Physical Science and Information Technology,Anhui University. attentions. Many works [1], [2] have achieved good resultson salient object detection by combining RGB images withdepth information. However, depth image has its limitations,for example, when the object is perpendicular to the shot ofdepth camera, the depth values in same object are different,as the depth value is calculated according to distance, whichwill bring difficulty to salient object detection. IntegratingRGB and thermal infrared (RGBT) data have also shown itseffectiveness in some computer vision tasks, such as movingobject detection, person Re-ID, and visual tracking [3], [4],[5]. The imaging principle of thermal infrared camera is basedon thermal radiation from the object surface, and the thermalradiation of different places of object surface are almost same.Therefore, imaging of salient object with thermal infraredcamera is always uniform. Furthermore, thermal infrared infor-mation can help assist salient object detection, as the objectsare salient in thermal infrared images in most cases even if thebackground is cluttered in RGB images. And thermal infraredcamera imaging will not be influenced by bad weathers or lowillumination.Recently, RGBT salient object detection has become at-tractive. The first work of RGBT salient object detection [5]proposes a multi-task manifold ranking algorithm for RGBTimage saliency detection, and creates an unified RGBT datasetcalled VT821. However, the first RGBT dataset has severallimitations: (1) RGB and thermal imaging parameters arecompletely different, so there might be some alignment errors;(2) Aligning images of two modalities will bring the blackbackground (that is actually the noise) to images; (3) Mostof scenarios are very simple, so this dataset is not so chal-lenging. The second important work of RGBT image saliencydetection [6] contributes a more challenging dataset namedVT1000 and proposes a novel collaborative graph learningalgorithm. Compared with VT821, VT1000 dataset has itsadvantages but also has several limitations: (1) As RGB andthermal infrared imaging have different sighting distances,thermal infrared image and visible light image look differentand need to be aligned, as shown as the left image pair inFig. 3; (2) The RGB image and thermal infrared image are stillnot automatically aligned, inevitably introducing errors in theprocess of manually aligning them; (3) Although VT1000 islarger than VT821, the complexity and diversity of the sceneshave not been greatly improved.In this paper, we construct a more comprehensive bench-mark for RGBT salient object detection based on the demandof large-scale, good resolution, high diversity, low deviation.Firstly, existing RGBT datasets are not big enough for traininga good deep network, so we collected 5000 pairs of RGB a r X i v : . [ c s . C V ] J u l EEE TRANSACTIONS ON XXX 2
BSO CB CIB IC LI MSO OF SA SSO TC BW020040060080010001200140016001800
VT821VT1000VT5000
Fig. 1. The challenge distribution of VT821, VT1000 and VT5000. and thermal images in different environments, each pair ofRGBT images are automatically aligned and have their groundtruths. Secondly, as most of the backgrounds or scenes aresimple in existing datasets, our dataset considers different size,category, surrounding, imaging quantity and spatial location ofsalient objects, and we also give a statistical result to show thediversity of objects. For instance, we have more images withthermal crossover considered as a big challenge. To analyzethe sensitivity of different methods for various challenges, weannotate 11 different challenges in consideration of above fac-tors. Thirdly, we annotate not only attributes of challenges butalso imaging quality of objects in the dataset, since annotationsfor imaging quality of objects provide the labels for weaklysupervised RGBT salient object detection, as the next work.The comparisons of our VT5000 with VT821 and VT1000 onvarious challenge distribution are shown in Fig. 1. In addition,some challenging RGB and thermal infrared images in ourdataset and the corresponding ground truth are shown in Fig. 2.To provide a powerful baseline for RGBT salient ob-ject detection, we design an end-to-end trained CNN-basedframework. In a specific, a two-stream CNN architecturewhich employs VGG16 [7] as the backbone network toextract multi-scale RGB and thermal infrared features sepa-rately. For obtaining task-related features, we use channel-wiseand spatial-wise attention based Convolution Block AttentionModule(CBAM) [8] to selectively collect features from RGBand thermal infrared branches. Then we perform an element-wise addition on RGB and thermal infrared features to fusethem, and pass the merged feature from first convolutionalblock of VGG16 to next convolutional block. To obtainglobal guidance information, we input the fused RGB andthermal infrared features from last convolutional block into thePyramid Pooling Module (PPM) [9]. Then, we use adaptivelyaverage pooling to capture global context information, thusobtain good location of salient objects. To make better useof characteristics of different layers, we upsample the featuresprocessed by each block of VGG with different sampling rates,and then combine them with the features processed by PPM.We also utilize Feature Aggregation Module (FAM) [9] afterfeature fusion to capture the local context information.As far as we know, we are the first to propose an end-to-enddeep learning method for RGBT salient object detection. In summary, the main contributions of this work are summarizedas follows: • We create a large-scale RGBT dataset containing 5000pairs of RGB and thermal images for salient object de-tection, with manually labeled ground truth annotations.We hope that this dataset would promote the researchprogress of deep learning techniques on RGBT salientobject detection. This dataset with all annotated informa-tion will be released to public for free academic usage. • We propose a novel deep CNN architecture to providea powerful baseline approach for RGBT salient objectdetection. In particular, we utilize a Convolutional BlockAttention Module(CBAM) to selectively collect featuresfrom RGB and thermal infrared branches, which couldincrease the receptive field of the convolution layers andfocus on important regions with multi-scale information. • Extensive experiments show that the designed approachoutperforms the state-of-the-art methods on VT5000dataset and other two public datasets, i.e., VT821 andVT1000. In addition, a comprehensive analysis of dif-ferent algorithms of RGBT salient object detection isperformed on VT5000 dataset. Through the analysis,we make several valuable conclusions and provide somepotential research directions for RGBT salient objectdetection. II. R
ELATED W ORK
A. Multi-modal Salient Object Detection Datasets
With the emergence of multi-modal data, RGBD salientobject detection (SOD) has been proposed, and the relatedRGBD datasets have been constructed. More specifically,the GIT [1] and LFSD [10] datasets are designed for thespecific purposes, for example, generic object segmentationbased on saliency map, or saliency detection in the light field.Subsequently, Li et al. [5] construct the first RGBT datasetVT821 with 821 pairs of RGBT images. Tu et al. [6] contributea more challenging dataset VT1000 for RGBT image saliencydetection.
B. Attention Mechanism
First proposed by Bahdanau et al. [11] for neural machinetranslation, attention mechanisms in deep neural networkshave been studied widely recently. Attention mechanisms areproven useful in many tasks, such as scene recognition [12],[13], question answering [14], caption generation [15] andpose estimation [16]. Chu et al. [16] propose a networkbased on multi-context attention mechanism and apply itto the end-to-end framework of pose estimation. Zhang etal. [17] propose a progressive attention guidance network,which generates attention features successively through thechannel and spatial attention mechanisms for salient object de-tection. In PiCANet [39], Liu et al. propose a novel pixel-wisecontextual attention network, which also uses the attentionmechanism similar to ours. In a specific, the network generatesthe attention map with the contextual information of eachpixel. With the learned attention map, the network selectivelyincorporate the features of useful contextual locations, thus
EEE TRANSACTIONS ON XXX 3 contextual features can be constructed. Then the pixel-wisecontextual attention mechanism is embedded into the poolingand convolution to bring in the global or local contextualinformation.As performing quite well on feature selection, the attentionmechanism is suitable for salient object detection. Somemethods adopt effective strategies, such as progressive atten-tion [17] and gate function [18]. Inspired by the above, weutilize a lightweight and general attention module [8], whichdecomposes the learning process of channel-wise attentionand spatial-wise attention. The separate attention generationprocess for the 3D feature map has much less parameters andcomputational cost.Moreover, in order to enhance the ability of feature rep-resentation, different channel extracts different semantic in-formation, then channel-wise attention mechanism assignsthe weight to the channel that is highly responsive to thesalient object. Some details in the background are inevitablyintroduced when the saliency map is generated with low-levelfeatures. Taking advantages of high-level features, spatial-wiseattention mechanism removes some backgrounds,thus high-lights foreground area, which benefits salient object detection.
C. Multi-modal SOD methods
In recent years, with the popularity of thermal sensors,integrating RGB and thermal infrared data has applied to manytasks of computer vision [19], [20], [21], [5], [22]. In additionto RGBT SOD, there are many methods adopting differentmodality to obtain multiple cues for better detection, suchas RGBD SOD input with depth and RGB images. In orderto combine multiple modalities well, many RGBD methodsutilize the better modality to assist the other modality. Forexample, Qu et al. [23] design a novel network to automati-cally learn the interaction mechanism for RGBD salient objectdetection. Han et al. [24] design a ”two-stream” architecture,combining the depth representation to make the collaborativedecision through a joint full connection layer. These works arebased on combining RGB with depth images to boost the re-sults of salient object detection. But not all the methods basedon RGBD are suitable for RGBT salient object detection.Compared with the thermal infrared camera, depth imaginghas the limitation that the objects with the same distanceto the camera have same gray level, which is a obviousweakness of RGBD SOD. In addition, RGB imaging is usuallyinfluenced by the various bad illuminations or weathers. Toavoid the above problems, more and more researches focuson the adopting RGB and thermal infrared image together.For example, Wang et al. [5] propose a multi-task manifoldranking algorithm for RGBT image saliency detection, andat the same time build up an unified RGBT image dataset.Tu et al. [6] propose an effective RGBT saliency detectionmethod by taking superpixels as graph nodes, moreover, theyuse hierarchical deep features to learn the graph affinity andnode saliency in a unified optimization framework. With thisbenchmark [5], Tang et al. [25] propose a novel approachbased on a cooperative ranking algorithm for RGBT SOD, theyintroduce a weight for each modality to describe the reliability and design an efficient solver to solve multiple subproblems.All of the above methods are based on traditional methods ordeep features, and time-consuming. We are the first end-to-enddeep network for RGBT salient object detection.III. VT5000 B
ENCHMARK
In this work, for promoting the research of RGBT salientobject detection (SOD) and considering the insufficiency ofexisting data, we captured 5000 pairs of RGBT images. Inthis section, we will introduce our new dataset.
A. Capture Platform
The equipment used to collect RGB and thermal infraredimages are FLIR (Forward Looking Infrared) T640 and T610,as shown in Fig. 3, equipped with the thermal infrared cameraand CCD camera. The two cameras have same imagingparameters, thus we don’t need to manually align RGB andthermal infrared images one by one, which reduces errors frommanual alignment.
B. Data Annotation
In order to evaluate RGBT SOD algorithms comprehen-sively, after collecting more than 5500 pairs of RGB imagesand corresponding thermal infrared images, we first select5500 pairs of RGBT images as different as possible, eachpair of images contains one or more salient objects. Similarto many popular SOD datasets [26], [27], [28], [29], [30], weask six people to choose the most salient objects they saw atthe first sight for the same image. Because different peoplemight look at different objects in the same image, 5000 pairsof RGBT images with same selection for salient objects arefinally retained. Finally, we use Adobe Photoshop to manuallysegment the salient objects from each image to obtain pixel-level ground truth masks.
C. Dataset Statistics
The image pairs in our dataset are recorded in differentplace and environmental conditions, moreover, our datasetrecords different illuminations, categories, sizes, positions andquantities of objects, as well as the backgrounds, etc. Ina specific, the following main aspects are considered whencreating VT5000 datasets.
Size of Object:
We define the size of the salient object asthe ratio of number of pixels in the salient object to sum of allpixels in the image. If this ratio is more than 0.26, the objectbelongs to the big salient object, and vice versa.
Illumination Condition:
We create the image pairs un-der different light conditions(e.g., low-illumination, sunny orcloudy). Low illuminance and illumination variation underdifferent illumination conditions usually bring great challengesto visible light images.
Center Bias:
Previous studies on visual saliency showthat the center bias has been identified as one of the mostsignificant bias in the saliency datasets [31], which involvesa phenomenon that people pay more attentions to the centerof the screen [32]. As described in [33], the degree of center
EEE TRANSACTIONS ON XXX 4 (cid:948)(cid:68)(cid:949)(cid:11)(cid:69)(cid:12)(cid:11)(cid:70)(cid:12) (cid:37)(cid:54)(cid:50) (cid:38)(cid:37) (cid:38)(cid:44)(cid:37) (cid:44)(cid:38) (cid:47)(cid:44) (cid:48)(cid:54)(cid:50) (cid:50)(cid:41) (cid:54)(cid:54)(cid:50) (cid:54)(cid:36) (cid:55)(cid:38) (cid:37)(cid:58)
Fig. 2. Sample image pairs with annotated ground truths and challenges from our RGBT dataset. (a) and (b) indicate RGB and thermal image. (c) iscorresponding ground truth of RGBT image pairs. TABLE ID
ISTRIBUTION OF ATTRIBUTES AND IMAGING QUALITY IN
VT5000
DATASET , SHOWING THE NUMBER OF COINCIDENT ATTRIBUTES ACROSS ALL
RGBT
IMAGE PAIRS . T
HE LAST TWO ROWS AND TWO COLUMNS IN THE T ABLE .I INDICATE THE POOR PERFORMANCE OF
RGB
AND T RESPECTIVELY , DUE TOLOW LIGHT , OUT OF FOCUS AND THERMAL CROSSOVER , ETC . CHALLENGE BSO CB CIB IC LI MSO OF SSO SA TC BW RGB TBSO
371 590 446 211 159 96 3 99 244 66 138 206CB 371
388 286 113 224 62 88 65 177 45 78 151CIB 590 388
292 112 114 49 13 70 148 76 70 123IC 446 286 292
66 80 54 38 46 193 57 62 160LI 211 113 112 66
51 89 22 71 83 23 188 70MSO 159 224 114 80 51
35 43 48 83 27 42 69OF 96 62 49 54 89 35
21 18 73 20 177 65SSO 3 88 13 38 22 43 21
37 77 6 20 66SA 99 65 70 46 71 48 18 37
76 16 74 64TC 244 177 148 193 83 83 73 77 76
32 90 629BW 66 45 76 57 23 27 20 6 16 32
28 40RGB 138 78 70 62 188 42 177 20 74 90 28
92T 206 151 123 160 70 69 65 66 64 629 40 92 bias cannot be described by simply overlapping all the mapsin the dataset.
Amount of Salient Object:
It is called multiple salientobjects that the number of the salient objects in an image isgreater than one. We find that the images have less salientobjects in the existing RGBT SOD datasets. In VT5000dataset, we capture 3 to 6 salient objects in an image for thechallenge of multiple salient objects.
Background factor:
We take two factors related to back-ground into consideration. Firstly, it is a big challenge thatthe temperature or appearance of the background is similar tothe salient object. Secondly, it is difficult to separate salientobjects accurately from cluttered background.Considering above-mentioned factors, together with thechallenges in existing RGBT SOD datasets [5], [6], we anno-tate 11 challenges for testing different algorithms, includingbig salient object (BSO), small salient object (SSO), multiplesalient object (MSO), low illumination (LI), center bias (CB),cross image boundary (CIB), similar appearance (SA), thermalcrossover (TC), image clutter (IC), out of focus (OF) andbad weather (BW). Descriptions for these challenges are asfollows:
BSO: size of the object is the ratio of number ofpixels in the salient object to sum of all pixels in the image, size of the big salient object is over 0.26.
SSO: size of thesmall salient object is smaller than 0.05.
LI: the images arecollected in cloudy days or at night.
MSO: there are morethan one significant object in an image.
CB: the salient objectis far away from the center of the image.
CIB: the salientobject crosses the boundaries of image, therefore the imagealways contains part of the object.
SA: the salient objecthas a similar color to the background surroundings.
TC: thesalient object has a similar temperature to other objects or itssurrounding.
IC: the scene around the object is complex orthe background is cluttered.
OF: the object in the image isout-of-focus, and the whole image is blurred.
BW: the imagescollected in rainy or foggy days. In addition, we also labelthose images with good or bad imaging quality of objects inRGB modality (RGB) or Thermal modality (T) in the datasetfor researches in the future. Herein,
RGB: the objects are notclear in RGB modality, T: the objects are not clear in Thermalinfrared modality. We also show the attribute distributions onthe VT5000 dataset as shown in Table I.Here, we also give another statistic result that is sizedistribution in Fig. 4. It shows the distribution of size ofsalient object in the training set and the test set respectively,respectively. We can see that the big salient objects in the EEE TRANSACTIONS ON XXX 5
Fig. 3. The image on the left shows a sample of RGBT image pairs fromVT1000 dataset captured by FLIR (Forward Looking Infrared) SC620, theone on the right is a sample from VT5000 dataset captured by FLIR T640and T610. training set are more than in the test set, meanwhile the smallsalient objects in the test set are more than the training set,which benifits for demonstrating the robustness of our method.
D. Advantages of Our Dataset
Compared with existing RGB-T datasets VT821 [5] andVT1000 [6], our VT5000 dataset has the following advantages:(1) Being different from previous thermal infrared cameraas shown in Fig. 3, RGBT image pairs in our dataset donot require manual alignment, thus errors brought by manualalignment can be reduced; (2) The thermal infrared camera weuse can automatically focus, which enhances the accuracy oflong-distance shooting and captures image texture informationeffectively; (3) Since the images were captured in summerand autumn, we have more thermal infrared images withsevere thermal crossover; (4) We provide a large scale datasetwith more RGBT image pairs and more complex scenes andchallenges.
Fig. 4. Size Distribution: the horizontal axis represents the proportion ofpixels of objects to total pixels of the image, the vertical axis represents thenumber of corresponding images
E. Baseline Methods
We include eight deep learning-based and two traditionalstate-of-the-arts methods in our benchmark for advancedevaluations, including PoolNet [9], RAS [34], BASNet [35],CPD [36], R3Net [37], PFA [38], PiCANet [39], EGNet [40],MTMR [5], SCGL [6]. It is worth mentioning that resultsare obtained by testing the corresponding method on RGBTdata without any post-processing, and evaluated with the same evaluation code. The results of all methods are obtained withthe published codes. For a fair comparison, the deep learningmethods take the same training set and test set as ours.
F. Evaluation Metrics
Similar to RGB dataset MSRA-B [41], we use the 2500pairs of RGBT images in VT5000 dataset as the trainingset, and take the rest in VT5000 together with VT821 [5]and VT1000 [6] as the test set. We evaluate performancesof different methods on three different metrics, includingPrecision-Recall (PR) curves, F-measure and Mean AbsoluteError (MAE). The PR curve is a standard metric to evaluatesaliency performance, which is obtained by binarizing thesaliency maps using thresholds from 0 to 255 and thencomparing the binary maps with the ground truths. The F-measure can evaluate the quality of the saliency map, bycomputing the weighted harmonic mean of the precision andrecall, F β = (1 + β ) · P recision · Recallβ · P recision + Recall (1)where β is set to 0.3 as suggested in [26]. MAE is acomplement to the PR curve and quantitatively measures theaverage difference between predicted S and ground truth G values at the pixel level, M AE = 1 W × H W (cid:88) x =1 H (cid:88) y =1 | S ( x, y ) − G ( x, y ) | (2)where W and H is the width and height of a given image.IV. A TTENTION - BASED D EEP F USION N ETWORK
In this section, we will introduce the architecture of theproposed Attention-based Deep Fusion Network (ADFNet),and describe the details of RGBT salient object detection.
A. Overview of ADFNet
We build our network architecture based on [7] as shownin Fig. 5, and employ a two-stream CNN architecture, whichfirst extracts RGB and thermal infrared features separatelyand then proceeds RGBT salient object detection. To makethe network focus on more informative regions, we utilize aseries of attention modules to extract weighted features fromRGB and thermal infrared branches before fusion of thesefeatures. From the second block of VGG16, the fused featuresof each layer are transmitted from the lower-level to thehigh-level in turn. Although high-level semantic informationcould facilitate the location of salient objects [42], [43], [44],redlow-level and mid-level features are also essential to refinedeep level features. Therefore, we add two complementarymodules (Pyramid Pooling Module and Feature AggregationModule) [9] to accurately capture the exact position of aprominent object while sharpening its details.
EEE TRANSACTIONS ON XXX 6
Fig. 5. The overall architecture of our method. VGG16 is our backbone network, in which different color blocks represent different convolution blocks ofVGG16.
B. Convolutional Block Attention Module
As illustrated in Fig. 5, RGB and thermal infrared imagesrespectively generate five different levels of features throughfive blocks of the backbone network VGG16, expressed by X Ri and X Ti ∈ R C × H × W respectively, where i represents VGG16 i -th block. As most of complex scenes contain clutteredbackground, which will bring lots of noises to feature extrac-tion, we expect to selectively extract the features with lessnoises from RGB and thermal infrared branches. Therefore,we adopt Convolutional Block Attention Module (CBAM) [8]with channel-wise attention and spatial-wise attention shownin Fig. 6. As shown in Fig. 7, with CBAM, the proposednetwork can capture the spatial details around the object,especially at the shallow layer, which is conducive to saliencyrefinement. If without CBAM, the network will have someredundant information that is helpless for saliency refinement. Fig. 6. Convolutional Block Attention Module(CBAM), (a) channel attentionmodule and (b) spatial attention module
The channel-wise attention focuses on what makes sense for an input image. Currently, most of methods typically useaverage pooling operations to aggregate spatial information. Inaddition to previous works [38], [45], we think that the max-pooling collects discriminative characteristics of the object toinfer finer channel-wise attention. Therefore, we use both ofaverage pooling and max pooling features. The RGB branchis described here as an example,just as the thermal infraredbranch does. Firstly, we aggregate the spatial information froma feature map with the average pooling and max pooling op-erations to generate two different spatial context information, X R avg i and X R max i , which represent the features after averagepooling and max pooling respectively. Secondly, these featuresare forwarded to two convolution layers of ∗ to generatechannel attention map M C R i , and we merge the outputtedfeature vectors with element-wise summation. Finally, thechannel attention weight vector is obtained by a sigmoidfunction. The specific process can be expressed as: M C R i =( σ ( Conv ( AvgP ool ( X Ri ))+ Conv ( M axP ool ( X Ri )))) ∗ X Ri (3)where σ denotes the sigmoid function, Conv denotes theconvolutional operation and ∗ denotes multiply operation.The spatial-wise attention is complementary to the channelattention. Different from channel attention, spatial attentionfocuses on structural information and its map is generated withthe spatial relationship between features. In a specific, we firstapply average pooling and max pooling operations to featuresalong the channel axis and connect these features to produceefficient descriptors. Next, we obtain a two-dimensional fea-ture map with a standard convolution layer, represented asfollows: M S R i = ( σ ( f k ∗ k ([ AvgP ool ( M Ci ) , M axP ool ( M Ci )]))) ∗ M C R i (4)where σ denotes the sigmoid function and f k ∗ k represents aconvolution operation with the filter size of k ∗ k . EEE TRANSACTIONS ON XXX 7 (cid:11)(cid:20)(cid:12)(cid:11)(cid:21)(cid:12) (cid:53)(cid:72)(cid:86)(cid:88)(cid:79)(cid:87) (cid:41)(cid:20) (cid:41)(cid:21) (cid:41)(cid:22) (cid:41)(cid:23)
Fig. 7. Visualization of features from different fusion layers in the proposednetwork without CBAM (shown in the first row) and with CBAM (shownin the second row). From left to right, there are the saliency map, the fusedfeature from layer 1 to 4, respectively.
C. Multi-modal Multi-layer Feature Fusion
The previous works [46], [47] show that fusion of multi-modal features at the shallower layer or the deeper layer mightnot take good advantage of the useful features from multiplemodalities. To obtain rich and useful features of RGB andthermal infrared images during downsampling, we adopt astrategy of multi-layer feature fusion here. Specifically, we usetwo VGG16 networks to extract RGB and thermal infraredfeatures respectively, which can preserve RGB and thermalinfrared features before upsampling. Each branch provides aset of feature maps from each block of VGG16. After passingthrough each Conv block, the two features are processed byCBAM respectively and then added for fusion on the pixellevel. Here, we add the features of two modalities directly inthe first layer, and add the features of the current layer after theconvolution operation with the features of the previous layer.In this way, both of low-level features and high-level featuresare extracted, and the corresponding formula is expressed as: F i = (cid:40) M S R i + M S T i , if i is 1 Conv ( F i − ) + M S R i + M S T i , if i is 2,3,4,5 (5) D. Pyramid Pooling Module
A classic encoder-decoder classification architecture gener-ally follows the top-down pathway. However, the top-downpathway is built upon the bottom-up backbone. Higher-levelfeatures will be gradually diluted when they are transferredto shallower layers, therefore,loss of useful information in-evitably happens. The receptive field of CNN will becomesmaller and smaller when the number of network layersincreases [48], so the receptive field of the whole networkis not large enough to capture the global information of theimage. Considering fine-level feature map lacks of high-levelsemantic information, we use a Pyramid Pooling Module(PPM) [9] to process features for capturing global informationwith different sampling rates. Thus we can clearly identifyposition of the object at each stage.More specifically, the PPM includes four sub-branchesto capture context information of the image. The first andfourth branches are the global average pooling layer and theidentity mapping layer, respectively. For the two intermediatebranches, we use adaptive averaging pooling to ensure thatsizes of output feature maps are 3*3 and 5*5 respectively.The guidance information generated by PPM will be properly integrated with feature maps of different levels in the top-down pathway, and high-level semantic information can beeasily passed to the feature map of each level by a series ofup-sampling operations. Providing global information to thefeature of each level makes sure locating salient objects willbe more accurate.
E. Feature Aggregation Module
As shown in Fig. 5, with the help of global guidance flow,the global guidance information can be passed to the featureat different pyramid level. Next, we want to perfectly integratethe coarse feature map with the feature at different scale bythe global guidance flow. At first, the input image passesthrough five convolution blocks of VGG16 in sequence, thusfeature maps corresponding to F = { F , F , F , F } in thepyramid have been downsampled with downsample rate of { , , , } respectively. In the original top-down pathway,RGB and thermal infrared features with coarser resolution areupsampled by a factor of 2. After the merging operation, weuse a convolutional layer with kernel size 3 × × F. Loss Function1) Cross Entropy Loss:
The cross entropy loss is usuallyused to measure the error between the final saliency map andthe ground truth in salient object detection. The cross entropyloss function is defined as: L C = size ( Y ) (cid:88) i =0 ( Y i log ( p i ) + (1 − Y i ) log (1 − P i )) (6)Where Y represents the ground truth, P represents the saliencymap output by the network and N represents the number ofpixels in an image.
2) Edge Loss:
The cross-entropy loss function providesgeneral guidance for the generation of the saliency map.Nevertheless, edge blur is an unsolved problem in salientdetection. Inspired by [38] and different from it, we use asimpler strategy to sharpen the boundary around the object. Ina specific, we use Laplace operator [49] to generate boundariesof ground truth and the predicted saliency map, and then weuse the cross entry loss to supervise the generation of salientobject boundaries. ∆ f = ∂ f∂x + ∂ f∂y (7) ∆ ˜ f = abs ( tanh ( conv ( f, K laplace ))) (8) EEE TRANSACTIONS ON XXX 8
TABLE IIL
IST OF THE BASELINE METHODS WITH THE MAIN TECHNIQUES AND THE PUBLISHED INFORMATION
Baseline Technique Book Title YearRAS [34] residual learning and reverse attention ECCV 2018PiCANet [39] pixel-wise contextual attention network CVPR 2018R3Net [37] recurrent residual refinement network IJCAI 2018MTMR [5] multi task manifold ranking with cross-modality consistency IGTA 2018SGDL [6] collaborative graph learning algorithm TMM 2019PFA [38] context-aware pyramid feature extraction module CVPR 2019CPD [36] multi-level feature aggregate CVPR 2019PoolNet [9] global guidance module and feature aggregation module CVPR 2019BASNet [35] predict-refine architecture and a hybrid loss CVPR 2019EGNet [40] integrate the local edge information and global location information ICCV 2019
TABLE IIIT
HE VALUE OF F- MEASURE IN EACH CHALLENGE OF OUR METHOD AND TEN COMPARISON METHODS
Challenge PoolNet BASNet CPD PFA R3Net RAS PiCANet EGNet MTMR SCGL OURBSO 0.800 0.858 0.872 0.802 0.831 0.768 0.804 0.873 0.667 0.754
CB 0.725 0.808 0.845 0.748 0.794 0.669 0.796 0.838 0.575 0.703
CIB 0.740 0.822 0.860 0.742 0.822 0.688 0.790 0.854 0.582 0.694
IC 0.721 0.775 0.812 0.735 0.745 0.672 0.752 0.818 0.564 0.681
LI 0.757 0.832 0.840 0.749 0.790 0.707 0.783 0.848 0.695 0.742
MSO 0.706 0.794 0.826 0.729 0.774 0.655 0.777 0.815 0.620 0.710
OF 0.762 0.816 0.821 0.754 0.759 0.738 0.758 0.817 0.707 0.738
SA 0.727 0.762 0.825 0.726 0.728 0.673 0.748 0.791 0.653 0.665
SSO 0.658 0.718 0.767 0.695 0.663 0.535 0.676 0.701 0.698 0.753
TC 0.720 0.791 0.811 0.762 0.729 0.711 0.745 0.791 0.570 0.675
BW 0.750 0.768 0.795 0.671 0.753 0.701 0.773 0.774 0.606 0.643
RGB 0.733 0.785 0.804 0.731 0.736 0.690 0.743 0.785 0.670 0.671
T 0.719 0.787 0.802 0.755 0.719 0.699 0.736 0.776 0.564 0.664 L E = − size ( Y ) (cid:88) i =0 (∆ Y i log (∆ p i ) + (1 − ∆ Y i ) log (1 − ∆ P i )) (9)The Laplace operator is defined as the divergence of thegradient ∆ f . Since the second derivative can be used to detectedges, we use the Laplace operator to obtain salient objectboundaries. In Eq. 7, x and y are the standard Cartesiancoordinates of the XY -plane. As Laplacian uses the gradientof image, which is actually calculated with convolution. Thenwe use the absolute value operation followed by tanh activationin Eq. 8 to map the value to a range of 0 to 1. We use theedge loss(Eq. 9) to measure the error between real boundariesof salient object and its generated boundaries. The total losscan be represented as: L S = L C + L E (10)V. E XPERIMENTS
In this section, we first introduce our experiment setups,which include the experimental details, the training dataset andtesting datasets, and the evaluation criteria. Then we conduct aseries of ablation studies to prove the effect of each componentin the proposed benchmark method. Finally, we show theperformance of our method and compare it with the state-of-the-art methods. To provide a comparison platform, Table II presents thebaseline methods about the main technique, book title andpublished time. We take RGB and thermal images as theinput to these ten state-of-the-art methods to achieve RGBTsalient object detection, including PoolNet [9], RAS [34],BASNet [35], CPD [36], R3Net [37], PFA [38], PiCANet [39],EGNet [40], MTMR [5] and SGDL [6]. These methodsutilize deep features except for MTMR [5]. Furthermore, onlyMTMR [5] and SGDL [6] are traditional models. In ourmethod, we combine the deep features extracted from RGBand thermal branches and compare with the above-mentionedmethods.
A. Experiment Setup
Implementation Details.
In this work, the proposed net-work is implemented based on the PyTorch and hyper-parameters are set as follows. We train our network on singleTitan Xp GPUs. The whole experiments are performed usingAdam [50] optimized with a weight decay of 5e-4, and ournetwork needs to be trained 25 epochs. The initial learning rateis 1e-4, after the 20th epoch, the learning rate is reduced to1e-5. We do data augmentation with simple random horizontalflipping. The original size of input image is 640*480. Toimprove the efficiency during training stage, we resize theinput image to 400*400.
EEE TRANSACTIONS ON XXX 9
Recall P r e c i s i o n ADFNet[0.804 , , , , , , , , , , , Recall P r e c i s i o n ADFNet[0.923 , , , , , , , , , , , Recall P r e c i s i o n ADFNet[0.863 , , , , , , , , , , , VT821 VT1000 VT5000
Fig. 8. Precision-recall curves of our model compared with PoolNet [9], RAS [34], BASNet [35], CPD [36], R3Net [37], PFA [38], PiCANet [39], EGNet [40],MTMR [5], SCGL [6]. Our model can deliver state-of-the-art performance on three datasets. The numbers in first column are the values of F-measure, andthe second column represents the value of MAE.
B. Comparison with State-of-the-Art Methods
Challenge-sensitive performance.
To dispaly and analyzethe performance of our method on the challenge-sensitivityand imaging quality of objects compared with other methods,we give a quantitative comparison in Table III. We evaluateour method on eleven challenges and bad imaging quality forobjects in two modality (i.e., BSO, SSO, MSO, LI, CB, CIB,SA, TC, IC, OF, BW, RGB, T) in VT5000 dataset. Notice thatour method is significantly better than other methods, showingthat our method is more robust for these challenges. Comparedwith PoolNet [9], our method outperforms 10.8% and 14.8% inF-measure on SA and SSO challengs, respectively. This resultsshow that the thermal infrared data can provide effectiveinformation to help the network distinguish the object and thebackground when the object is similar to the background inRGB modality. And the small object is a challenge for everymodality. Our network can locate salient object well with thehelp of the global guidance flow derived from PPM, even forthe small object.
Quantitative Comparisons.
We compare the proposedmethods with others in terms of F-measure scores, MAEscores, and PR-curves. And we have verified the effectivenessof our method on three datasets, and the quantitative resultsare shown in Fig. 8 (left). We take PoolNet [9] as our baseline.Fig. 8 shows the results on VT821, VT1000 and VT5000, andour method performs best in F-measure. Compared with thebaseline PoolNet, with the help of thermal infrared branch,our model outperforms PoolNet by a large margin of 5.5%9.9% on three RGBT datasets(VT821, VT1000 and VT5000).Compared with the method PiCANet [39] that also usesattention mechanism, our F-measure value achieves . gainand MAE value is . more than it on VT821 dataset.On VT1000 dataset, our F-measure value outperforms Pi-CANet [39] with . and MAE value is . less than it. On VT5000 dataset, our F-measure value outperformsPiCANet [39] with . and MAE value is . less thanit. As a method with high performance, CPD [36] proposesa new Cascaded Partial Decoder framework for salient objectdetection, through integrating features of deeper layers anddiscarding larger resolution features of shallower layers toachieve fast and accurate salient object detection. Our F-measure value outperforms CPD [36] with . and MAEvalue is . less than it on VT821 dataset, our F-measurevalue outperforms it with . and MAE value is . lessthan it on VT1000 dataset. On VT5000 dataset, our F-measurevalue outperforms CPD [36] with . and MAE value is . more than it. The EGNet [40] is the latest approachamong the compared methods, composed of three parts: edgefeature extraction, feature extraction of salient object and one-to-one guidance module. The edge feature can help to locatethe object and make the object boundary more accurate. FromFig. 8, we can see the results on VT1000 dataset. And F-measure value of our method is . higher than EGNet andits MAE value is . lower than our method. Same as above,the results of our method and other state-of-the-arts are shownin Fig. 8(right). Compared with the baseline PoolNet, our F-measure value achieves . gain and MAE value is . less, and compared with best method EGNet [40] , F-measurevalue of our method is . higher than EGNet [40] andits MAE value is . lower than our method on VT5000.Compared with EGNet [40], our method has these merits: (1):Our method can also sharpen the edge of the salient objectwithout using additional edge detection model; (2) The globalguidance flow derived from PPM can make good use of theglobal context information and better locate the salient object;(3) With the help of the thermal infrared branch, we can makeuse of the complementary information of the two modalities tobetter deal with various challenges in salient object detection. EEE TRANSACTIONS ON XXX 10
This shows that our method is still optimal in general forRGBT SOD.
PR Curves.
In addition to the results shownin the above three tables, we also show the PR curves onthree datasets. As shown in Fig.8, it can be seen that the PRcurve (red) obtained by our method are particularly prominentcompared with all previous methods. When the recall score isclose to 1, our accuracy score is much higher than comparedmethods. This also shows that the truth-positive rate of oursaliency maps are higher than compared methods.
Visual Comparison
To qualitatively evaluate the proposedmethod on the new RGBT dataset, we visualize and comparesome results of our method with other state-of-the-atrs inFigurg 9. These examples are from various scenarios, in-cluding big salient object(BSO) (row 1, 3, 4, 7), multiplesalient objects(MSO) (row 2, 5, 13), small salient object(SSO)(row 13), cross image boundary(CIB) (row 3, 6, 9), clutteredbackground(IC)(row 3, 7, 9), low illumination(LI) (row 12),center bias(CB) (row 8), out-of-focus(OF) (row 2, 13), badweathers(BW) (row 3, 9), similar appearance(SA) (row 8) andThermal Crossover(TC) (row 11, 13). Each row includes atleast one challenge in Figurg 9. It is easy to see that ourmethod obtains best results in various challenging scenes.Specifically, the proposed method not only clearly highlightsthe objects, but also suppresses the background, and theobjects have well-defined contours.VI. A
BLATION A NALYSIS
In this part of ablation analysis, we investigate the effect ofCBAM and edge loss respectively for our method. As shownin Table IV, firstly, we run the basic network without CBAMand edge loss and the result is not well. Then if we only addCBAM into the basic network PoolNet [9], the value of F-measure increases by 2.1% and the value of MAE decreasesby 0.5%, and if we only add the edge loss, our performanceis degraded. In the course of experiment, we find that ifwe only add the edge loss, the value of loss is downwardoverall, but fluctuates greatly during training stage. In addition,although adding CBAM to the basic network can effectivelysuppress the noise, but if too much redundant noises appear,extracted edges are unsatisfactory, influencing greatly on thestability during training stage. These observations mean thatwithout attention mechanism, salient object can not be locatedaccurately. As shown in Table IV, with the help of CBAM forlocating salient object, and edge loss for refining edges, ournetwork obtains best performance of salient object detection.
CBAM Edge Loss max F β MAE0.836 0.057 (cid:88) (cid:88) (cid:88) (cid:88)
TABLE IVT
HE IMPACT OF EACH COMPONENT IN THE NETWORK ON THEPERFORMANCE
VII. C
ONCLUDING R EMARKS AND P OTENTIAL D IRECTIONS
In this work, we create a new large-scale RGBT dataset fordeep salient object detection, with the attribute annotations for 11 challenges and the quality annotations of object imagingin RGB and Thermal modalities. We also propose a novelattention-based deep fusion network for RGBT salient objectdetection. Our network consists of a basic feature extractionnetwork, convolutional block attention modules, pyramid pool-ing modules and feature aggregation modules. The comparisonexperiments demonstrate our baseline method performs bestover all the state-of-the-art methods in most evaluation metrics.From the evaluation results, taking advantage of thermalimage can boost the results of salient object detection whensalient object is big, far away from center of the image andcrosses the image boundaries, and background is cluttered,illumination is low, salient object has a similar temperaturewith background. Cluttered background and low illuminationare common scenes but bring big challenges to salient objectdetection, while thermal infrared images can provide comple-mentary information to RGB images to improve SOD results.However, when thermal crossover occurs, thermal data becomeunreliable, but visible spectrum imaging will not be influencedby temperature.According to the evaluation results, we observe and drawsome inspirations which are essential for boosting RGBT SODin the future. Firstly, deep learning-based RGBT SOD methodsneed to be explored further. For example, how to design asuitable deep network which takes the special properties ofRGB and thermal modalities into considerations for RGBTSOD is worth studying. How to make the best use of attentionmechanisms and semantic information still are important forimproving feature representation of salient objects and canprevent salient objects from being gradually diluted. Secondly,the attribute-based feature representations could be studiedfor handling the problem of lacking sufficient training data.Comparing with the task of object detection and classification,the scale of annotated data for training networks of RGBTSOD are very small. We annotate various attributes in ourVT5000 dataset and these attribute annotations could be usedto study the attribute-based feature representations that modelsdifferent visual contents under certain attributes to reduce net-work parameters. Thirdly, unsupervised and weakly supervisedRGBT SOD are valuable research directions. The task ofRGBT SOD needs pixel-level annotations, and thus annotatinglarge-scale datasets needs unacceptable manual cost. There-fore, no and less relying on large-scale labeled datasets arefuture research directions for RGBT SOD. Note that we haveannotated some weakly supervised labels in our VT5000, i.e.,imaging quality of different modalities, and believe it would bebeneficial to the related researches of unsupervised and weaklysupervised RGBT SOD. Finally, the alignment-free methodswould make RGBT SOD more popular and practical in real-world applications. We find that existing datasets contain somemisaligned RGBT image pair even though we adopt severaladvanced techniques to perform the alignment in VT5000dataset. Moreover, the images recorded from existing RGBTimaging platform are non-aligned. Therefore, the research onalignment-free RGBT SOD is also worth investigating in thefuture.
EEE TRANSACTIONS ON XXX 11
Image T GT OUR CPD EGNet R3Net BASNet RAS PoolNet PFA PiCANet SGDL MTMR
Fig. 9. Saliency maps produced by the PoolNet [9], RAS [34], BASNet [35], CPD [36], R3Net [37], PFA [38], PiCANet [39], EGNet [40], MTMR [5],SGDL [6]. Our model can deliver state-of-the-art performance on three datasets R EFERENCES[1] A. Ciptadi, T. Hermans, and J. M. Rehg, “An in depth view of saliency,”in
British Machine Vision Conference , 2013.[2] R. Ju, L. Ge, W. Geng, T. Ren, and G. Wu, “Depth saliency basedon anisotropic center-surround difference,” in
Proceedings of IEEEInternational Conference on Image Processing , 2014.[3] C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin, “Learningcollaborative sparse representation for grayscale-thermal tracking,”
IEEETransactions Image Processing , vol. 25, no. 12, pp. 5743–5756, 2016.[4] C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang, “RGB-T object tracking:Benchmark and baseline,”
Pattern Recognit , vol. 96, 2019.[5] G. Wang, C. Li, Y. Ma, A. Zheng, J. Tang, and B. Luo, “RGB-T saliencydetection benchmark: Dataset, baselines, analysis and a novel approach,”in
Image and Graphics Technologies and Applications , 2018.[6] Z. Tu, T. Xia, C. Li, X. Wang, Y. Ma, and J. Tang, “Rgb-t imagesaliency detection via collaborative graph learning,”
IEEE Transactionson Multimedia , vol. 22, no. 1, pp. 160–173, 2019.[7] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” in , 2015.[8] S. Woo, J. Park, J. Lee, and I. S. Kweon, “CBAM: convolutional blockattention module,” in
Proceedings of IEEE European Conference onComputer Vision , 2018.[9] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, “A simple pooling-based design for real-time salient object detection,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2019.[10] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu, “Saliency detection on light field,”in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2014.[11] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,” in , 2015. [12] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang,C. Huang, W. Xu, D. Ramanan, and T. S. Huang, “Look and thinktwice: Capturing top-down visual attention with feedback convolutionalneural networks,” in
Proceedings of IEEE International Conference onComputer Vision , 2015.[13] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learningdiscriminative saliency map with convolutional neural network,” in
Proceedings of the 32nd International Conference on Machine Learning ,2015.[14] Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola, “Stacked atten-tion networks for image question answering,” in
Proceedings of IEEEConference on Computer Vision and Pattern Recognition , 2016.[15] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov,R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural imagecaption generation with visual attention,” in
Proceedings of the 32ndInternational Conference on Machine Learning , 2015.[16] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for human pose estimation,” in
Proceedings of IEEEConference on Computer Vision and Pattern Recognition , 2017.[17] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang, “Progressive attentionguided recurrent network for salient object detection,” in
Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition , 2018.[18] L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang, “A bi-directional messagepassing model for salient object detection,” in
Proceedings of IEEEConference on Computer Vision and Pattern Recognition , 2018.[19] C. Li, N. Zhao, Y. Lu, C. Zhu, and J. Tang, “Weighted sparse rep-resentation regularized graph learning for RGB-T object tracking,” in
Proceedings of the ACM on Multimedia Conference , 2017.[20] C. Li, C. Zhu, Y. Huang, J. Tang, and L. Wang, “Cross-modal rankingwith soft consistency and noisy labels for robust rgb-t tracking,” in
Proceedings of IEEE European Conference on Computer Vision , 2018.[21] H. Liu and F. Sun, “Fusion tracking in color and infrared images using
EEE TRANSACTIONS ON XXX 12 joint sparse representation,”
SCIENCE CHINA Information Sciences ,vol. 55, no. 3, pp. 590–599, 2012.[22] S. Yang, B. Luo, C. Li, G. Wang, and J. Tang, “Fast grayscale-thermalforeground detection with collaborative low-rank decomposition,”
IEEETransactions on Circuits and Systems for Video Technology , vol. 28,no. 10, pp. 2574–2585, 2018.[23] L. Qu, S. He, J. Zhang, J. Tian, Y. Tang, and Q. Yang, “Rgbdsalient object detection via deep fusion,”
IEEE Transactions on ImageProcessing , vol. 26, no. 5, pp. 2274–2285, 2017.[24] J. Han, H. Chen, N. Liu, C. Yan, and X. Li, “Cnns-based rgb-dsaliency detection via cross-view transfer and multiview fusion,”
IEEEtransactions on cybernetics , vol. 48, no. 11, pp. 3171–3183, 2018.[25] J. Tang, D. Fan, X. Wang, Z. Tu, and C. Li, “Rgbt salient objectdetection: Benchmark and a novel cooperative ranking approach,”
IEEETransactions on Circuits and Systems for Video Technology , 2019.[26] R. Achanta, S. S. Hemami, F. J. Estrada, and S. S¨usstrunk, “Frequency-tuned salient region detection,” in
Proceedings of IEEE Conference onComputer Vision and Pattern Recognition , 2009.[27] M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S. Hu, “Globalcontrast based salient region detection,”
IEEE Trans. on Pattern AnalysisMachine Intelligence , vol. 37, no. 3, pp. 569–582, 2015.[28] H. Jiang, M. Cheng, S. Li, A. Borji, and J. Wang, “Joint salient objectdetection and existence prediction,”
Frontiers of Computer Science ,vol. 13, no. 4, pp. 778–788, 2019.[29] G. Li, Y. Xie, L. Lin, and Y. Yu, “Instance-level salient object segmen-tation,” in
Proceedings of IEEE Conference on Computer Vision andPattern Recognition , 2017.[30] C. Xia, J. Li, X. Chen, A. Zheng, and Y. Zhang, “What is and what isnot a salient object? learning salient object detector by ensembling linearexemplar regressors,” in
Proceedings of IEEE Conference on ComputerVision and Pattern Recognition , 2017.[31] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The secretsof salient object segmentation,” in
Proceedings of IEEE Conference onComputer Vision and Pattern Recognition , 2014.[32] B. W. Tatler, R. J. Baddeley, and I. D. Gilchrist, “Visual correlates offixation selection: Effects of scale and time,”
Vision research , vol. 45,no. 5, pp. 643–659, 2005.[33] X. Huang and Y. Zhang, “300-fps salient object detection via minimumdirectional contrast,”
IEEE Transactions on Image Processing , vol. 26,no. 9, pp. 4243–4254, 2017.[34] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salientobject detection,” in
Proceedings of IEEE International Conference onComputer Vision , 2018.[35] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. J¨agersand,“Basnet: Boundary-aware salient object detection,” in
Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition , 2019.[36] Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast andaccurate salient object detection,” in
Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition , 2019.[37] Z. Deng, X. Hu, L. Zhu, X. Xu, J. Qin, G. Han, and P.-A. Heng,“R3net: Recurrent residual refinement network for saliency detection,”in
Proceedings of the International Joint Conference on ArtificialIntelligence , 2018.[38] T. Zhao and X. Wu, “Pyramid feature attention network for saliencydetection,” in
Proceedings of IEEE Conference on Computer Vision andPattern Recognition , 2019.[39] N. Liu, J. Han, and M. Yang, “Picanet: Learning pixel-wise contextualattention for saliency detection,” in
Proceedings of IEEE Conference onComputer Vision and Pattern Recognition , 2018.[40] J.-X. Zhao, J.-J. Liu, D.-P. Fan, Y. Cao, J. Yang, and M.-M. Cheng, “Eg-net: Edge guidance network for salient object detection,” in
Proceedingsof the IEEE International Conference on Computer Vision , 2019.[41] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H. Shum,“Learning to detect a salient object,”
IEEE Transactions on Patternanalysis and machine intelligence , vol. 33, no. 2, pp. 353–367, 2010.[42] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji,“Detect globally, refine locally: A novel approach to saliency detection,”in
Proceedings of IEEE Conference on Computer Vision and PatternRecognition , 2018.[43] N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network forsalient object detection,” in
Proceedings of IEEE Conference on Com-puter Vision and Pattern Recognition , 2016.[44] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr,“Deeply supervised salient object detection with short connections,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2017. [45] S. Zhu and L. Zhu, “Ognet: Salient object detection with output-guidedattention module,” arXiv preprint arXiv:1907.07449 , 2019.[46] J. Jiang, L. Zheng, F. Luo, and Z. Zhang, “Rednet: Residual encoder-decoder network for indoor RGB-D semantic segmentation,”
ComputingResearch Repository , vol. abs/1806.01054, 2018.[47] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3d graph neuralnetworks for RGBD semantic segmentation,” in
Proceedings of IEEEInternational Conference on Computer Vision , 2017.[48] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsingnetwork,” in
Proceedings of IEEE Conference on Computer Vision andPattern Recognition , 2017.[49] D. Gilbarg and N. S. Trudinger,
Elliptic partial differential equations ofsecond order . springer, 2015.[50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in3rd International Conference on Learning Representations