OGNet: Salient Object Detection with Output-guided Attention Module
OOGNet: Salient Object Detection withOutput-guided Attention Module
Shiping Zhu,
Member, IEEE , Lanyun Zhu
Abstract —Attention mechanisms are widely used in salientobject detection models based on deep learning, which can effec-tively promote the extraction and utilization of useful informationby neural networks. However, most of the existing attentionmodules used in salient object detection are input with theprocessed feature map itself, which easily leads to the problemof ‘blind overconfidence’. In this paper, instead of applying thewidely used self-attention module, we present an output-guidedattention module built with multiscale outputs to overcome theproblem of ‘blind overconfidence’. We also construct a new lossfunction, the intractable area F-measure loss function, which isbased on the F-measure of the hard-to-handle area to improvethe detection effect of the model in the edge areas and confusingareas of an image. Extensive experiments and abundant ablationstudies are conducted to evaluate the effect of our methods andto explore the most suitable structure for the model. Tests onseveral datasets show that our model performs very well, eventhough it is very lightweight.
Index Terms —Salient object detection, multi output neuralnetwork, attention mechansim
I. I
NTRODUCTION S ALIENT object detection aims to estimate the region ofthe most attractive object in an image, and it is an impor-tant research area in computer vision. It has great applicationvalue in the fields of scene classification [1], object detction[2], image retrieval [3], [4] and visual tracking [5]–[7]. Salientobject detection is a very challenging problem because it re-quires both a correct identification of the salient object and anaccurate display of the salient region. In recent decades, manyalgorithms for salient object detection have been proposed.Inspired by the human visual attention mechanism, traditionalunsupervised methods [8]–[13] typically apply handcraftedfeatures in images to determine the salient region. Thesemethods do not perform well when the background of theimage or the shape of the salient object is very complicated.Recently, deep learning has made rapid development, andmany methods based on deep learning [14]–[22] have greatlyimproved the accuracy of salient object detection. Modelsbased on convolutional neural networks and recurrent neuralnetworks have achieved remarkable performance in manytasks, such as image classification [23], [24], object detection
This work is supported by the National Natural Science Foundation ofChina (NSFC) under grant No. 61375025, No. 61075011 and No. 60675018,and also the Scientific Research Foundation for the Returned Overseas ChineseScholars from the State Education Ministry of China.Shiping Zhu and Lanyun Zhu are with the Department of MeasurementControl and Information Technology, School of Instrumentation Science andOptoelectronics Engineering, Beihang University, 100191, HaiDian District,XueYuan Road No. 37., Beijing, China. (phone: +86-13391687912; e-mail:[email protected]) [25]–[27] and semantic segmentation [28], [29]. A deep neuralnetwork can effectively extract and fuse different levels offeatures in images, which effectively solves the insufficiencyof image feature extraction and fusion in traditional methods.Many salient object detection models based on deep learningadopt the encoder-decoder as the basic structure of the neuralnetwork [21], [30], [31]. This structure, represented by FCN[32], reduces the image resolution by passing the encoder andextracting the image features from different levels; and then,it gradually recovers the image resolution by the decoder andfinally gains the saliency map. The encoder-decoder structureis widely adopted since it can recover the contour shape ofthe salient objects well.Since the encoder-decoder has a weaker extraction abilityfor semantic information, the simple utilization of the encoder-decoder cannot gain quite an outstanding performance. Hence,many research studies are committed to improving the originalencoder-decoder structure. Two methods can upgrade the de-tection effects by increasing only a small memory footprintand the cost of computing. One is to use networks withmultiscale outputs [33]. Different from most models that haveonly one output, the multiscale output structure, representedby deeply supervised net, obtains many outputs in variouspositions of the neural network. Such a structure can make thedeeper parts of the neural network easier to train and lead thenetwork to place more emphasis on the required tasks, avoid-ing information turbulence and mistakes. Many models basedon multiscale output structures for image segmentation [34],object detection [35] and salient object detection [14], [15],[33], [36] have achieved good performance. The other methoduses the attention mechanism. In recent years, the attentionmechanism has become one of the most important researchdirections of deep learning [37] because it can significantlyimprove the effect of models by adding only a small amountof computation. Attention mechanisms can reinforce useful orkey information and impair useless or incorrect information.Salient object detection is a task of classifying each pixel intotwo categories, and the introduction of an attention mechanismcan enhance the confidence for the model’s judgement.In this paper, we make full use of the features of these twostructures. Traditional attention modules used in segmentationand salient object detection tasks usually apply the processedfeature maps themselves as the input of the attention modules.In such a structure, it is easy to realize two kinds of favorableoperations, reinforcing ‘true positive’ information and weaken-ing ‘false negative’ information, as well as generate two kindsof faulty operations, reinforcing ‘false positive’ informationand weakening ‘true negative’ information. This is a problem a r X i v : . [ c s . C V ] J u l a) (b) (c) (d) (e) (f) (g) Fig. 1. Some examples of output in different layers. (a) Input image; (b) ground truth; (c) output of layer 1; (d) output of layer 2; (e) output of layer 3; (f)output of layer 4 (g) output of layer 5. The area in the red box is where the output is misjudged. For the first line, output of the shallower layers makes fewermistakes. For the second layers, output of the shallower layers makes more mistakes. we call the ‘blind overconfidence’ of the attention modulejudgement. To solve this problem, we take the deeper layer’slow-resolution output as the input of the shallower layer’sattention module to establish a new output-guided attentionmodule. Compared with the ordinary self-attention module,taking the deeper layers’ output as the input of the attentionmodule can integrate the advantage of each layer of the neuralnetwork, thus preventing the attention module of decoder inone layer from enlarging the false information caused by thislayer. Considering that different input feature maps have dif-ferent importance in attention module processing for differentinput images, in order to reasonably select multiple input of theattention module, we make the network learn a set of weights.And all the input features maps are weighted before fed intothe attention module. Based on the output-guided module,we built a new salient object detection model applying theclassic encoder-decoder structure. Our model gains remarkabledetection effects with a small memory footprint and fastdetection speed. Moreover, with the enlightenment of thefeatures of outputs in different layers, we propose a method toidentify regions that are difficult to estimate in images of thetraining set. On this basis, we propose the intractable areaF-measure loss function, which can pay more attention tothe areas that are difficult to judge in the image. The maincontributions of this paper can be summarized as follows: • We propose a new output-guided attention module builtwith outputs in various positions of an neural network,which can overcome the shortcomings of many other selfattention modules. • We propose a new end-to-end neural network for salientobject detection applying the output-guided attentionmodule. • We propose an intractable area loss function based on thefeatures of the multi-output structure. The introduction ofthis loss function makes the model more effective facingcomplicated images.The rest of this paper is organized as follows: in SectionII, we make a conclusion about the existing classic salientobject detection models and attention modules; in Section III, we introduce the output-guided attention module, OGNet andintractable area F-measure loss; in Section IV, we demonstrateour experimental results during the research process. Finally,we conclude this paper in Section V.II. R
ELATED W ORK
A. Salient Object Detection
In the early stages of development, salient object detec-tion models were usually based on low-level hand-craftedfeatures, such as color features [8] and textural features[11], [37], [38]. Although these models generated certaineffects, their performances were not ideal for images withcomplicated backgrounds or complex salient objects. Whenmaking saliency judgments, the human eyes always confrontcomplicated elements, while in traditional methods, fullyconsidering and integrating various factors are difficult. Deeplearning explores a new route for the research of salientobject detection. Early salient object detection models basedon deep learning usually select the convolutional layer - fullyconnected layer structure, which is the same as most imageclassification models. Wang et al . [16] proposed two neuralnetworks to detect salient objects, one for learning local patchfeatures to determine the saliency value of each pixel and theother for predicting the saliency score of each object regionbased on global features. Li and Yu [39] first segmented theimage into several areas and then formulated a neural networkwith some branches to train these areas. Then, their methodutilizes several convolutional layers to connect them togetherto achieve information integration among different layers.Zhao et al . [40] built a multicontext deep learning frameworkwith two branches that extract global context and local contextand then integrate them together. After the appearance of fullyconvolutional networks, many salient object detection modelsbased on deep learning adopted the encoder-decoder structurerepresented by FCN and then made some adjustments to thatstructure. Liu and Han [14] made use of the hierarchicalrecurrent convolution to build up the decoder part of the neuralnetwork. Zhang et al . [17] applied the reformulated dropoutto some convolutional layers on the strength of the basicncoder-decoder structure to extract the salient informationmore conveniently. However, due to the inadequate use ofdifferent levels of information, it is difficult to achieve verygood performance in a simple encoder-decoder. Hou et al .[33] utilized a large number of short connections to join thedecoders in different layers together and drew on the idea ofDenseNet [41], which worked to realize the full integrationof information in different layers. Similarly, Zhang et al . [18]proposed the Amulet. Wang et al . [42] proposed a multistagestructure and used pyramid pooling in the joint part to obtainand merge the information from different layers together. Suchmethods usually perform well. However, due to a demand toconnect feature maps in different layers, they often need toconsume a large amount of memory and require a large amountof computation.
B. Attention Mechansim
During deep learning, the attention mechanism was appliedto the field of machine translation [43], [44] at the earlieststages and accomplished outstanding effects. Then, it is ap-plied to the neural network models of computer vision. For thepast few years, many models applying attention mechanismshave greatly improved the effects in image classification [45],semantic segmentation [46], [47], action recognition [48] andother fields. The core ideology of the attention mechanismis to selectively enhance or weaken the large amount ofinformation constructed by neural networks. The attentionmodule of SENet proposed by Hu et al . [45] includes twoprocesses: squeeze and excitation. The squeeze process appliesglobal average pooling to compress the feature maps, andthe excitation process utilizes two fully connected layers toobtain a series of weights, which are used to weigh featuremaps from channels. This method improves the accuracy ofimage classification models immensely. Furthermore, CBAMproposed by Woo et al . [46] expands the attention mechanismstreatment dimension from the channel dimension of SENet totwo dimensions channel and spatial and selects both theaverage value and maximum value to compress the featuremaps, which further increases the effects of the attentionmodule. The structures of SENet and CBAM can expand tomany other computer vision task models. In recent years, manysalient object detection models have also utilized various kindsof attention modules. The module proposed by Zhang et al .builds two attention modules from the channel and spatiallayers, which is similar to the establishment method of CBAM.Liu et al . [31] applied a convolution and bidirectional LSTMto formulate local pixelwise attention and global pixelwiseattention , which enlarges the receptive field to reduce mis-takes. All the attention modules mentioned above only usethe processed feature maps themselves as the input, which isthe main difference between the proposed method and othermethds for salient object detection only using self attention[30], [31]. III. M
ETHODOLOGY
In this section, we introduce our proposed methods. Theoutput-guided attention module is introduced in Subsection
MaxPoolAve
Pool FC FC MUL
Fig. 2. The structure of channel attention. FC is the fully connected layer.MUL is a multiplication operation. C MaxAverage Pooling& FC Conv SigmoidOutput of deeper layers
Fig. 3. The structure of spatial attention. CAT is the concatenation of somegroups of feature maps from channels.
III-A and the complete structure of the output-guided network(OGNet) is shown in Subsection III-B. The intractable area F-measure loss and the training method are shown in SubsectionIII-C and Subsection III-D, respectively.
A. Output-guided Attention Module1) Blind Overconfidence:
At present, attention mechanismscomposed of both spatial attention and channel attention, rep-resented by CBAM [49], is one of the most popular attentionmodules used in various kinds of computer vision models.CBAM builds up two attention modules - channel attention andspatial attention, taking the processed feature maps themselvesas input. The effects of models can be greatly improvedthrough these two attention modules. Such a module is verysuitable for image classification because image classificationdoes not concern the shape and location of objects in animage; thus, the enhancement of incorrect information causedby attention modules will not have a great impact on thefinal judgment. However, we think that this kind of spatialattention, which only takes processed feature maps as input,faces some problems when applied to salient object detection.Salient object detection aims to classify each pixel in an imageinto two categories, which means that the final saliency map is
GG-1
VGG-2
VGG-3
VGG-4
VGG-5
VGG-6
D-1
D-2
D-3
D-4
D-5
OGA-1
OGA-2
OGA-3OGA-4OGA-5 C CC ENCODER
DECODER
Score Function (Conv 3x3)
Sigmoid C Concatenate
CCC
Fig. 4. The detailed structure of the output-guided model. binary. Assuming that a pixel in an image is salient, the valueof the corresponding position in the feature maps should belarge. If a layer of the neural networks misjudges the pixel,the following attention module will greatly magnify the wronginformation, which is the problem of ‘blind overconfidence’ inattention modules. An ideal attention module should magnifythe correct information and avoid the enhancement of wronginformation. A good resolution is to enlarge the receptive fieldof the attention module to capture more information. However,this requires a lot of computation. To solve the problemof blind overconfidence, we built a new attention module.Similar to CBAM, it consists of both channel attention andspatial attention. The structures of channel attention moduleand spatial attention module are shown in Fig. 2 and Fig. 3,respectively.
2) Channel Attention:
In a layer of neural networks, not allfeature maps have the same significance. The channel attentionmodule is a feature detector that can enhance information inuseful feature maps and reduce information in useless featuremaps. If we adopt the whole feature map as the input ofthe attention module, the computation will be quite large,which violates the design principle that the attention moduleshould be lightweight. Thus, we should find a method whosereceptive field is large enough to express the global feature ofa feature map. Similar to CBAM, we use both max poolingand average pooling to demonstrate the global feature of theinput feature map F ∈ R C × W × H . The channel attention mapcan be calculated as follows: W c = Sigmoid ( L ( L ( GM P ( F ))) + L ( L ( GAP ( F )))) (1) where G MP is global max pooling and G AP is global averagepooling. The input size of L and the output size of L are C . The input size of L and the output size of L are C/4 .This setup is designed to deepen the network to extract moreinformation with reducing the additional memory footprintscaused by these two fully connected layers. L is followedby a rectified linear unit (ReLU) [50]. Note that L and L are shared for feature maps after max pooling and averagepooling. Sigmoid is a function used to get the attention map.For each resolution X i c in a feature map processed by twofully networks: W i c = 11 + e − X i c (2)
3) Spatial Attention:
Spatial attention is used to enhancethe confidence of the model on its judgement. In salientobject detection, the use of spatial attention can also makea model focus on the foreground region, which is beneficialfor saliency prediction. Unlike CBAM, apart from taking theaverage and maximum value from channels, outputs from otherlayers are also taken as the input. We think that taking theoutputs of other layers into the attention module is a kind ofbalance and compensation, which can avoid the enhancementof wrong information caused by the attention module in onelayer. Beyond that, this structure can also be regarded asa special form of short connection, which can make fulluse of information from different layers and make the deepneural network easier to train. For feature maps from thedecoder in layer m , we obtain two feature maps, F maxm and F minm , which express the comprehensive information of alllayers by calculating the maximum and average values on thehannel. For layer m , the input of the attention module is { F maxm , F minm , O m +1 , ..., O M } , where O i is the output of thei th layer.A straightforward idea is to input these maps directly intosome convolution layers to obtain spatial attention weight.However, this approach regards all maps as having the sameimportance and ignores the differences between them. Asshown in Fig. 1, for the first column, output maps in shallowerlayers make fewer errors in saliency judgments for the area inthe red box. Thus, when fed into the attention module, shal-lower output should be more important. However, for the sec-ond column where outputs in deeper layers judge better, deeperoutput should be emphasized. Thus, weighting these mapsbefore feeding them into the spatial attention module is nec-essary. The weight is also obtained from the neural network.We first concatenate F m with output of output-guided attentionmodules of all deeper layers { OG m − , OG m − , ..., OG M } to get a feature map C . C passes through two fully connectedlayers which are similar to that in channel attention moduleand obtain a vetor V with M − m dimensions. Finally, thespatial attention map can be generated as follows: W s = Sigmoid ( f × ( V .CAT ( F maxm , F mmin , O m − , ..., O M ))) (3)where f × is a × convolution layer. The size of theconvolution layer is bigger than the usual one which is × because the receptive field should be large enough to fullyextract pixel relationship for the sptial attention. CAT is theconcatenation of feature maps from channels.When utilizing the attention module, we let the processedfeature maps pass the channel attention module first and thenpass the spatial attention module to obtain the final outputof the output-guided attention module. The output can beobtained as follows: F out = W s . W c . F in (4)This arrangement that passes the channel attention modulefirst is also inspired by [49], which argues that channel-first isslightly better than spatial-first. B. OGNet
Based on the output-guided attention module introducedabove, we propose a new model for the salient object detec-tion: output-guided model (OGNet). Our model strengthens thebasic encoder-decoder structure. To compare fairly with mostsalient object detection models, we choose the most commonlyused VGG16 [51] as the backbone of the encoder. Note that,similar to most models, the backbone can be flexibly selectedand can be replaced by other networks, such as ResNet [52]and Xception [53].Our model’s decoder contains five layers so that fivesaliency maps with different resolutions are gained. Each layerof the decoder has the same structure. The structure of thedecoder is shown in Fig. 5 and details of each convolutionare shown in Table I. The ith layer of the decoder takesthe output of the encoder in the same layer and the outputof the previous layer’s decoder as input. First, the decoderfeature maps are bilinearly upsampled by a factor of 2, and
Conv
Conv
Conv1x1UPS
Fig. 5. The detailed structure of a layer of the decoder. Feature maps fromencoder and decoder are input from the left and bottom, respectively. UPS isthe bilinearly upsamping. TABLE IS
TRUCTURE OF EACH LAYER OF THE DECODER . CONV E , CONV DREPRESENT THE CONVOLUTION LAYERS PROCESSING INPUT FROMENCODER AND DECODER . CONV AND CONV REFER TO TWOCONVOLUTION LAYERS AFTER THE CONCATENATION .No. Layer conv e conv d conv 1 conv 21 × ,
128 3 × ,
128 3 × ,
256 3 × , × ,
128 3 × ,
128 3 × ,
256 3 × , × ,
64 3 × ,
64 3 × ,
128 3 × , × ,
32 3 × ,
32 3 × ,
64 3 × , × ,
32 3 × ,
32 3 × ,
64 3 × , then two × convolutions are applied on feature mapsfrom the encoder and decoder separately. Note that we donot use deconvolution directly because bilinearly upsamplingperforms slightly better than deconvolution. Inspired by [33],we tried to use a larger-sized convolution such as × and × to process feature maps from encoder but foundthat it could not improve performance but instead causedoverfitting. We performed some experiments to find the mostsuitable convolution size, and the results are shown in Section4.3. The encoder feature maps and decoder feature maps areconcatenated, and another two × convolutions are appliedto further fuse and extract information from the feature maps.Inspired by the structure of ResNet [52], a residual block isapplied to construct the decoder. For each layer of the decoder,we apply a × convolution to convert the feature map whichhas been bilinearly upsampled to the same number of channelsas the output of the decoder in this layer. Then this feature mapis added to the output of the decoder to obtain the final output.Section 4.3 shows the comparison between the performance ofmodels using residual blocks and not using residual blocks.The output of every layer of the decoder passes an output-guided attention module, which is the input of the decoderin the next layer, as well as passes a × convolution andSigmoid function to obtain this layer’s output saliency map.Note that the inputs of the output-guided attention module arethe saliency maps that have not passed the Sigmoid function.All convolutions in the decoder are followed by a batchnormalization and ReLU. The structure of the output-guidedmodel is shown in Fig. 4. nput Ground Truth Output-1 Output-5 Difference Map Fig. 6. Some examples of multi output and the difference map. The output of different layers in the network is different in some areas. From the differencemaps, we can find that the different areas are usually the boundary of the objects or where the disturbing objects are located.
C. Intractable Area F-measure Loss
We observe that in multioutput encoder-decoder neural net-works, outputs from different positions with different resolu-tions have different characteristics. Generally speaking, takingthe deeply supervised multioutput network as an example,deeper outputs with low resolutions can capture semantic in-formation better while shallower outputs with high resolutionsconcerns more on the spatial features. Some examples ofoutputs from different positions are shown in Fig. 1. As can beseen from Fig. 1, first, high-resolution output saliency mapsare more precise than low-resolution maps at the boundary ofobjects: second, there are some interference objects that areeasily misjudged and different outputs make different saliencyjudgments on them. Both the object boundary and interferenceobjects are difficult points to improve the detection accuracy.The judgement ability in these areas is always a significantfactor affecting the performance of a salient object detectionmodel.Thus, we propose a new loss function to promote themodel’s performance in these areas. We need to find theintractable areas of images in the training set. First, we applyanother dataset with fewer images to train the model forfewer iterations, and the training result is rough. Then, wetest images in the training set utilizing the roughly trainedmodel, and some saliency maps with different resolutions areobtained. For input image I , there are five output saliency maps S i , i ∈ { , , , , } . We apply S with the largest resolutionand S with the smallest resolution to calculate the differencemap based on the observation that difference between high-resolution maps can only show the boundary area but fail toget the intractable area such as the disturbing objects. First, we bilinearly upsampled these two saliency maps to the resolutionof the original image and obtain S (cid:48) and S (cid:48) . Then the differentareas can be obtained by the pixel-level comparison between S (cid:48) and S (cid:48) and the coordinate set C of the different areas canbe calculated as follows, for all coordinates ( i, j ) in S (cid:48) and S (cid:48) : (cid:40) ( i, j ) ∈ C if S (cid:48) ( i, j ) − S (cid:48) ( i, j ) = 0( i, j ) / ∈ C if S (cid:48) ( i, j ) − S (cid:48) ( i, j ) (cid:54) = 0 (5)After getting the different maps, we train the model for thesecond time. For the second training, the saliency score isbinarized, and the intractable area F-measure loss is calculatedas follows: L f = 1 − (1 + β ) × P c × R c β × P c + R c (6)where P c and R c represent the precision and recall of area C . The formula of intractable area F-measure loss equals to1 minus the F-measure of area C . The effectiveness can beunderstood from two aspects. On the one hand, the loss func-tion is designed directly according to the evaluation metric,which is proved to be useful to promote the test results in alot of computer vision tasks such as object detection [54] andsemantic segmentation [55]; On the other hand, the IAF lossis only calculated on the intractable areas , thus promoting themodel to process these areas more effectively and enhancingthe generalization ability of the model in dealing with compleximages.Note that, the second training is not the fine-tuning of themodel gained by the first training. The only purpose of thefirst training is to obtain the difference maps for the training ABLE IIQ
UANTITATIVE COMPARISON OF
MAE, F-
MEASURE AND S- MEASURE WITH METHODS ON DATASETS . A
HIGHER F- MEASURE SCORE , HIGHER S- MEASURE SCORE AND LOWER
MAE
SCORE REPRESENT BETTER PERFORMANCE . T
HE TOP THREE RESULTS ARE HIGHLIGHTED IN RED , GREEN ANDBLUE , RESPECTIVELY .Methods Datasets HKU-IS ECSSD SOD DUT-OMRON DUTS-TEMAE
F S
MAE
F S
MAE
F S
MAE
F S
MAE
F S
LEGS
CVPR2015
CVPR2015
CVPR2016
CVPR2016
ECCV2016
ECCV2016
ECCV2016
CVPR2017
ICCV2017
ICCV2017
ICCV2017
CVPR2018
CVPR2018
ECCV2018
ECCV2018
Ours set in the second training. When testing, only the model fromthe second training is applied to obtain saliency maps. Thus,our proposed method is end-to-end when testing though thetraining involves two processes.
D. Training
Suppose that the multioutput neural networks can be dividedinto M layers and that every layer of the decoder generates anoutput. Every output can produce a loss term. The final lossfunction can be defined as: L ( I, G, W , w) = βl f ( I, G, W , w (1) )+ M (cid:88) m =1 α m l mside ( I, G, W , w ( M ) ) (7)where α m is the weight of the cross-entropy loss in the m th layer and β is the weight of intractable area F-measure loss. I and G represent the input image and its ground truth. Eachoutput is obtained by a separate score function w ( m ) , and w refers to the set of all score fuctions: w = (w , w , ..., w M ) (8)Here, l f ( I, G, W , w (1) ) represents the intractable area F-measure loss function, and l mside ( I, G, W , w ( M ) ) refers to the cross-entropy loss function of the m th output and can becalculated as follows: l mside ( I, G, W , w ( m ) ) = − | I | (cid:88) z =1 G ( z ) logP ( G ( z ) = 1 | I ( z ) , W , w m ) − | I | (cid:88) z =1 (1 − G ( z )) logP ( G ( z ) = 0 | I ( z ) , W , w m ) (9)In the output-guided network, M equals 5 so that 5 outputsare gained. Instead of fusing these outputs as in [33] by addingadditional computing, we directly apply the output of the firstlayer, which has the highest resolution, as our final saliencyscore. Considering that the output of the first layer has thehighest importance, α is set higher than others, the weightsof all the loss functions are: { α , α , α , α , α , β } = { , , , , , } (10)IV. E XPERIMENTAL R ESULTS
A. Implementation Details
We use the PyTorch framework to train and test our model.All images are resized to × pixels for training andtesting. We select SGD with a weight decay of 0.0005 anda momentum of 0.9 as the optimizer. Inspired by [56], weuse the ‘poly’ policy to set the learning rate. For an iteration,its learning rate equals the initial learning rate multiplied by (1 − itermaxiter ) power , where the initial learning rate is set to a) ECSSD P r e c i s i o n Recall
LEGS
ELDMDFRFCNDHSUCFAmuletPAGRNSRMKSRNLDFDCLOurs 0.10.20.30.40.50.60.70.80.91 0 0.2 0.4 0.6 0.8 1 P r e c i s i o n Recall
LEGSELDMDFRFCNDHSUCFAmuletPAGRNSRMKSRNLDFDCLOurs (b) HKU-IS P r e c i s i o n Recall
LEGSELDMDFRFCNDHSUCFAmuletPAGRNSRMKSRNLDFDCLOurs (c) DUTS-T P r e c i s i o n Recall
LEGSELDMDFRFCNDHSUCFAmuletPAGRNSRMKSRNLDFDCLOurs (d) DUT-O P r e c i s i o n Recall
LEGSELDMDFRFCNDHSUCFAmuletPAGRNSRMKSRNLDFDCLOurs (e) SOD (f) ECSSD(g) DUTS-T (h) SOD (i) HKU-IS F - m e a s u r e Thresholds
LEGSELDMDFRFCNDHSUCFAmulet
PAGRN
SRMKSRNLDFDCL
Ours0.50.550.6 F - m e a s u r e Thresholds
LEGSELDMDFRFCNDHSUCFAmuletPAGRNSRMKSRNLDFDCLOurs 0.50.550.6 F - m e a s u r e Thresholds
LEGSELDMDFRFCNDHSUCFAmuletSRMKSRNLDFDCLOurs 0.50.550.60.650.70.750.80.850.90.95 0 50 100 150 200 250 F - m e a s u r e Thresholds
LEGSELDRFCNDHSUCFAmuletPAGRNSRMNLDF
DCL
Ours
Fig. 7. (a)-(e) are P-R curves on various datasets, including ECSSD, HKU-IS, DUTS-T, DUT-O and SOD. (f)-(i) are F-measure curves on various datasets,including ECSSD, DUTS-T, SOD and HKU-IS.
B. Datasets and Evaluation Metrics1) Datasets:
Six datasets are used to train and test ourmodels: MSRB [58], DUTS [59], ECSSD [11], DUT-OMRON[60], HKU-IS [39], and SOD [61].
MSRB : This dataset contains 5000 high quality images with high precision marks. These images are abundant in species,but their backgrounds are usually simple.
DUTS : This dataset includes 10553 images for training and5019 images for testing. This datasets images are characterizedby a large quantity of abundant species and high markedquality.
ECSSD : This dataset contains 1000 images with a complexbackground, and the ground truth of the image in the datasetusually contains very rich semantic information.
DUT-OMRON : This dataset contains 5168 high quality im-ages. The images of this dataset include one or more salient a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l)
Fig. 8. Visual comparison with 9 state-of-the-art methods. (a) Input image; (b) ground truth; (c) ours; (d) PAGRN [30]; (e) SRM [42]; (f) Amulet [18]; (g)UCF [17]; (h) NLDF [19]; (i) KSR [57]; (j) MDF [39]; (k) ELD [20]; (l) LEGS [16]. Our method performs best for images with various characteristics. objects and their backgrounds are very complicated. It isrelatively more difficult to achieve salient object detection onthese images. Hence, it is a significant dataset to determinewhether a salient object detection model can perform well forcomplex images.
HKU-IS : This dataset contains 4447 high-precision labeledimages. Images in the dataset are often equipped with manysalient objects, and some of these salient objects are locatedat the edge of the images, which brings a great challenge tosalient object detection.
SOD : This dataset contains 300 images. These images’ back-ground and the shape of the salient objects are quite complex.It is a very challenging dataset.We use MSRB to train our model for the first time and thenuse this model to test the training set of DUTS and obtainthe difference maps. Then, the training set of DUTS and thedifference maps are used for the second training to obtain thefinal model. The test set of DUTS and other datasets are usedto test the model.
2) Evaluation Metrics:
We utilize four methods that areextensively applied in the salient object detection field totest our models performance on test sets: precision-recall(PR) curves, F-measure and mean absolute error (MAE) andS-measure. The saliency maps are binarized by varying thethreshold from 0 to 255, and pairs of precision and recallunder different thresholds are computed to plot the PR curve.Then, the saliency map is binarized with a fixed threshold,which is determined as twice the mean saliency value of thesaliency map. The F-measure is calculated as follows: F β = (1 + β ) × P recision × Recallβ × P recision + Recall (11) Similar to most other methods [17], [18], [42], we set β to0.3, making the precision’s influence factors larger than thatof the recall.Due to the binarization of the saliency map, the F-measurecannot directly measure the difference between the groundtruth and the saliency map obtained by the model. Hence, wealso apply MAE, which values the average pixelwise absolutedifference between the saliency map and binary ground truth: MAE = 1 W × H W (cid:88) x =1 H (cid:88) y =1 | S ( x, y ) − G ( x, y ) | (12)where W and H are the width and height of the saliency map S , respectively.Structure measure (S-measure) [62] is a new evaluationmetric to evaluate region-aware and object-aware structuralsimilarity between saliency maps and ground truth maps. Itcan be calculated as follows: S = α ∗ S o + (1 − α ) ∗ S r (13)where S o and S r represent object-aware and region-awarestructural similarity, respectively. α is set to 0.5. A modelwith a higher F-measure score ,lower MAE score and higherS-measure score has better performance. C. Performance Comparison
We compare 15 state-of-the-art classic salient object detec-tion methods, including LEGS [16], ELD [20], MDF [39],KSR [57], DCL [22], RFCN [21], NLDF [19], DHS [14],UCF [17], Amulet [18], PAGRN [30], SRM [42], C2S [63],RA [15] and PICA [31]. Most of these methods are based ondeep learning. ) Qualitative Evaluation:
Fig. 8 displays the visual com-parison between our method and the others. Our method canjudge the salient object better and more accurately display thearea of the salient object. Our method performs much betterthan the other methods in the following challenging situations:(1) When confronting multiple salient objects in an image,our method makes more accurate decisions on multiple salientobjects. As shown in the third line, our method preciselyjudges all four salient objects.(2) When the shape of the salient object is complicated, ouralgorithm still demonstrates the shape of the salient objectobviously and favorably. As shown in the fifth line, althoughthe salient object’s upper edge contour is quite complex andthe rough sketch feature is quite blurry, our method preciselyrecovers the rough sketch of the salient object and do notgenerate an erroneous judgement.(3) Thanks to the introduction of the attention mechanism,when salient objects are surrounded by some interferentialfactors disturbing the salient judgement, our method is ableto perform better and had strong antijamming ability. Forexample, in the first line, the lower left quarter of the salientobject is highly similar to the surrounding areas, so it is quiteeasy to cause an erroneous judgement. Our method makes avery accurate judgement, while most of the other state-of-the-art methods incorrectly judge that area as nonsalient.
2) Quantitative Evaluation:
The PR curves and F-measurecurves are shown in Fig. 7. For a PR curve, a higher preci-sion and slower attenuation represents a better performance.Compared with the other methods, our method has the bestperformance on all the datasets.In Table II, we also compare our method with the state-of-the-art methods in terms of MAE, F-measure and S-measure.For MAE score, we obtained the best performance on most ofthe datasets. Although we did not realize the best performanceon SOD and DUT-OMRON on MAE, our method demon-strates high competition. For the F-measure score, our methodperforms the best on all the datasets. Compared with thesecond-ranked method, our method improves the F-measuresscore by 4.8%, 2.7%, 14.3%, 4.5% and 2.4% on HKU-IS,ECSSD, SOD, DUT-OMRON and DUTS-TE, respectively. ForS-measure score, our method performs best on three datasetsand and ranks second on another two datasets.Based on the indexes being synthesized, in comparison tothe other state-of-the-art methods, our method shows the bestperformance overall. The excellent execution on all datasetsdemonstrates that our method possesses stronger universality.
3) Memory Comparison:
The algorithms based on deeplearning usually require a large computation and memoryfootprint. In general, a deeper neural network can gain betterperformance, but it is also followed by a larger memoryfootprint and computation so it is difficult to apply the modelto real-time detection and to use it on mobile terminals,which reduces the practicability. Hence, the size of the neuralnetwork model is also one of the significant factors whenmeasuring a salient object detection algorithm based on deeplearning. Fig. 9 shows some methods’ model size and F-measure on ECSSD. The model size of many methods is verylarge, while those with smaller model sizes usually have a
TABLE IIIA
BLATION EXPERIMENTS ON
DUTS
AND
SOD.No. settings SOD DUTS(a) Comparison of attention module1 basline 0.13082 0.053232 +SE 0.12433 0.052323 +CBAM 0.12234 0.051854 +OGAM (b) Comparison of IAF loss5 baseline(No.4 setting) 0.11619 0.048956 BCE loss+IAF loss (c) Comparison of residual blocks7 baseline(No.4 setting)
ELD
UCF
Amulet DSS+
DCL+MSRNetDHSNLDF DS RFCNOurs0.80.820.840.860.880.90.920.94 0 200 400 600 800 1000 1200 F - m ea s u r e Model size(MB ) Fig. 9. Memory comparison with some methods, including Amulet[], DSS+[33], MSRNet [64], UCF [17], DHS [14], NLDF [19], RFCN [21], DS [65],DCL+ [22] and ELD [20]. general effect. Our method is the only one with a model sizeof less than 100 MB and an F-measure score higher than 0.9.Our model is lightweight but very effective.
D. Ablation Studies1) Evaluation of output-guided attention:
As shown in tableIII, to verify the effect of the output-guided attention module,we compare the effects of models with and without the output-guided attention module. The experimental results show thatthe output-guided attention module can greatly improve themodel’s effect. In addition, we also test the effects of someother types of attention modules on model improvement.Two attention modules are tested: SE [45] and CBAM [49].Different from the output-guided attention module, SE onlyuses channel attention, and both SE and CBAM only take theprocessed feature maps themselves as input. The experimentalresults show that the effect of using SE alone is not signifi-cant enough and CBAM utilizing both channel attention andspatial attention can produce better effects. Our output-guided a) (b) (c) (d)
Fig. 10. Comparison between output obtained by models applying IAF lossand not applying IAF loss. (a) Input image; (b) ground truth; (c) output ofmodel not applying IAF loss; (d) output of model applying IAF loss. attention module performs best among these three kinds ofattention modules.
2) Evaluation of intractable area F-measure loss:
Theintractable area F-measure loss is used to upgrade the model’sjudgement ability when encountering difficult areas. To testits effectiveness, we test the performance of models trainedapplying the intractable area F-measure loss and not applyingthe intractable F-measure loss on SOD and DUTS-TE, respec-tively. As shown in Fig. 10, comparing the test results of thetwo models, the model performs better in the marginal areasof the salient objects and makes a more precise judgment ofthe difficult areas after utilizing the intractable area F-measureloss. Quantitative analysis is shown in Table III. After utilizingthe intractable F-measure loss, the MAE score on both datasetsdecline.
3) Ablation of residual block:
Residual blocks can make avery deep neural network easier to train and improve the effectof the neural network. To test the residual blocks’ influenceon our model, we eliminate the original residual block of themodel and then test its performance. The experimental resultsare shown in Table III. Observing the training process, we findthat the model with the residual blocks converged faster andthat the final loss value was smaller. The application of theresidual blocks slightly raised the model’s effects. Thus, wedeemed that the utilization of residual blocks in our modelcauses overfitting.
4) Selection of convolution size:
The choice of convolutionsize has a great influence on the performance of convolutionalneural networks. DSS [33] uses a large convolution to processthe feature maps extracted from the encoder in every layer.Theoretically, a large convolution can increase the receptivefield and extract more semantic information, so it is used toprocess feature maps from encoders that do not sufficientlyextract semantic information compared with those from de-coders. Inspired by DSS, we first choose a convolution of size × but find that the performance of the model unexpectedlybecame worse. To determine the most suitable convolutionsize, we test the convolution of four sizes: × , × , × , × . The performance of these models on five datasetsare shown in Fig. 11. The lowest MAE score on all fivedatasets is achieved by the model with a × convoluton. Thereceptive field of a × convolution is too small to integrate ECSSD DUTS
Fig. 11. Comparison of MAE score of four sizes of convolution on ECSSDand DUTS. the information extracted from the encoder and overfitting iscaused by a × convolution and a × convolution, whichare too large. Feature maps extracted from encoders are mainlyused to better restore the shape of salient objects, so spatialinformation is more important than semantic information. Alarge convolution may destroy the spatial information, whichis harmful for the accurate display of salient objects.
5) Application of output-guided attention in the other mod-els:
The output-guided attention module proposed by us in thispaper is a lightweight and universal module that can be usedin all multioutput models. We test the effect of the output-guided attention module on some other multioutput models.DSS [33] is a classic salient object detection model withmultiple outputs. The original DSS uses two convolutionallayers to process each side output, and we add an output-guided attention module after the first convolutional layer. Theexperimental results are shown in Fig. 12, where the MAEscore of the five datasets between the original DSS and theDSS with the output-guided attention module are compared.Compared with the original model, the MAE score of thefive datasets after using the output-guided attention moduledecreases by 8.9%, 4.6%, 8.0%, 4.2%, and 5.1%, respectively.V. C
ONCLUSION
In this paper, we proposed a new output-guided attentionmodule. Experimental results show that compared with otherattention modules, the output-guided attention module con-structed by the processed feature maps themselves and otherresolution outputs can reduce errors and achieve better perfor-mance. Our proposed model, based on output-guided attention,showed outstanding performance on multiple datasets. Owingto the output-guided attention module, our model has strongerrobustness. The proposed intractable area F-measure loss caneffectively improve the performance of the model when facingimages with complex backgrounds and salient objects withcomplicated shapes. The improvements of the output-guidedattention module and intractable area F-measure loss on othermultioutput methods demonstrate that these two methods are
ECSSD DUTS HKU-IS DUT-O SOD
DSSDSS*
Fig. 12. Comparison of MAE scores on five datasets of the original DSS andDSS*(DSS applying output-guided attention module and IAF loss.) universal. We suggest that researchers try to use the output-guided attention module and intractable area F-measure losswhen constructing other neural networks for salient objectdetection. We believe that blind overconfidence is a commonproblem faced by many attention modules in salient objectdetection and that the output-guided attention module providesa new way to solve this problem. In the future, we willfurther explore additional ways to solve the problem of ‘blindoverconfidence’. R
EFERENCES[1] C. Siagian and L. Itti, “Rapid biologically-inspired scene classificationusing features shared with visual attention,”
IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 29, no. 2, pp. 300–312,2007.[2] Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang, “Region-based saliencydetection and its application in object recognition,”
IEEE Transactionson Circuits and Systems for Video Technology , vol. 24, no. 5, pp. 769–779, 2014.[3] Y. Gao, M. Wang, D. Tao, R. Ji, and Q. Dai, “3-d object retrievaland recognition with hypergraph analysis,”
IEEE Transactions on ImageProcessing , vol. 21, no. 9, pp. 4290–4303, 2012.[4] J. He, J. Feng, X. Liu, T. Cheng, T.-H. Lin, H. Chung, and S.-F. Chang,“Mobile product search with bag of hash bits and boundary reranking,”in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pp. 3005–3012, 2012.[5] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learningdiscriminative saliency map with convolutional neural network,” in
International Conference on Machine Learning , pp. 597–606, 2015.[6] V. Mahadevan and N. Vasconcelos, “Saliency-based discriminant track-ing,” in
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition(CVPR) , pp. 1007–1013, 2009.[7] C. Ma, Z. Miao, X.-P. Zhang, and M. Li, “A saliency prior contextmodel for real-time object tracking,”
IEEE Transactions on Multimedia ,vol. 19, no. 11, pp. 2415–2424, 2017.[8] F. Perazzi, P. Kr¨ahenb¨uhl, Y. Pritch, and A. Hornung, “Saliency filters:Contrast based filtering for salient region detection,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , pp. 733–740, 2012.[9] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in
Advances in Neural Information Processing Systems , pp. 545–552, 2007.[10] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual at-tention for rapid scene analysis,”
IEEE Transactions on Pattern Analysisand Machine Intelligence , vol. 20, no. 11, pp. 1254–1259, 1998.[11] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition(CVPR) , pp. 1155–1162, 2013. [12] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient objectdetection: A discriminative regional feature integration approach,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition(CVPR) , pp. 2083–2090, 2013.[13] Y. Liu, J. Han, Q. Zhang, and L. Wang, “Salient object detection viatwo-stage graphs,”
IEEE Transactions on Circuits and Systems for VideoTechnology , vol. 29, no. 4, pp. 1023–1037, 2019.[14] N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network forsalient object detection,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition(CVPR) , pp. 678–686, 2016.[15] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient ob-ject detection,” in
Proceedings of the European Conference on ComputerVision (ECCV) , pp. 234–250, 2018.[16] L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for saliencydetection via local estimation and global search,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) ,pp. 3183–3192, 2015.[17] P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin, “Learning uncertainconvolutional features for accurate saliency detection,” in
IEEE Inter-national Conference on Computer Vision (ICCV) , pp. 212–221, 2017.[18] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan, “Amulet: Aggre-gating multi-level convolutional features for salient object detection,”in
Proceedings of the IEEE International Conference on ComputerVision(CVPR) , pp. 202–211, 2017.[19] Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li, and P.-M. Jodoin, “Non-local deep features for salient object detection,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) ,2017.[20] G. Lee, Y.-W. Tai, and J. Kim, “Deep saliency with encoded lowlevel distance map and high level features,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) ,pp. 660–668, 2016.[21] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detectionwith recurrent fully convolutional networks,” in
European Conferenceon Computer Vision(ECCV) , pp. 825–841, Springer, 2016.[22] G. Li and Y. Yu, “Deep contrast learning for salient object detection,”in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition(CVPR) , pp. 478–487, 2016.[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Advances in Neural Infor-mation Processing Systems , pp. 1097–1105, 2012.[24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition(CVPR) , pp. 1–9, 2015.[25] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in
Advances in NeuralInformation Processing Systems , pp. 91–99, 2015.[26] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You onlylook once: Unified, real-time object detection,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) ,pp. 779–788, 2016.[27] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask r-cnn,” in
IEEEInternational Conference on Computer Vision (ICCV) , pp. 2980–2988,2017.[28] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinementnetworks for high-resolution semantic segmentation,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern recognition(CVPR) , vol. 1, p. 3, 2017.[29] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matter-simprove semantic segmentation by global convolutional network,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pp. 1743–1751, 2017.[30] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang, “Progressive attentionguided recurrent network for salient object detection,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recogni-tion(CVPR) , pp. 714–722, 2018.[31] N. Liu, J. Han, and M.-H. Yang, “Picanet: Learning pixel-wise contex-tual attention for saliency detection,” arXiv preprint arXiv:1708.06433 ,2017.[32] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition(CVPR) , pp. 3431–3440, 2015.[33] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr,“Deeply supervised salient object detection with short connections,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition(CVPR) , pp. 3203–3212, 2017.34] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-timesemantic segmentation on high-resolution images,” in
Proceedings ofthe European Conference on Computer Vision (ECCV) , pp. 405–420,2018.[35] S.-W. Kim, H.-K. Kook, J.-Y. Sun, M.-C. Kang, and S.-J. Ko, “Parallelfeature pyramid network for object detection,” in
Proceedings of theEuropean Conference on Computer Vision (ECCV) , pp. 234–250, 2018.[36] H. Xiao, J. Feng, Y. Wei, M. Zhang, and S. Yan, “Deep salientobject detection with dense connections and distraction diagnosis,”
IEEETransactions on Multimedia , vol. 20, no. 12, pp. 3239–3251, 2018.[37] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliencydetection via graph-based manifold ranking,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) ,pp. 3166–3173, 2013.[38] A. Manno-Kovacs, “Direction selective contour detection for salient ob-jects,”
IEEE Transactions on Circuits and Systems for Video Technology ,vol. 29, no. 2, pp. 375–389, 2019.[39] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition(CVPR) , pp. 5455–5463, 2015.[40] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection bymulti-context deep learning,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition(CVPR) , pp. 1265–1274,2015.[41] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in
Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR) , pp. 4700–4708, 2017.[42] T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewiserefinement model for detecting salient objects in images,” in
IEEEInternational Conference on Computer Vision(ICCV) , pp. 4019–4028,2017.[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Advancesin Neural Information Processing Systems , pp. 5998–6008, 2017.[44] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap-proaches to attention-based neural machine translation,” arXiv preprintarXiv:1508.04025 , 2015.[45] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition(CVPR) , pp. 7132–7141, 2018.[46] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Learninga discriminative feature network for semantic segmentation,” arXivpreprint arXiv:1804.09337 , 2018.[47] J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu, “Dual attention network forscene segmentation,” arXiv preprint arXiv:1809.02983 , 2018.[48] Z. Fan, X. Zhao, T. Lin, and H. Su, “Attention-based multiview re-observation fusion network for skeletal action recognition,”
IEEE Trans-actions on Multimedia , vol. 21, no. 2, pp. 363–374, 2019.[49] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “Cbam: Convolutionalblock attention module,” in
Proceedings of the European Conference onComputer Vision (ECCV) , pp. 3–19, 2018.[50] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-mann machines,” in
Proceedings of the 27th International Conferenceon Machine Learning (ICML-10) , pp. 807–814, 2010.[51] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition(CVPR) , pp. 770–778, 2016.[53] F. Chollet, “Xception: Deep learning with depthwise separable convo-lutions,” arXiv preprint , pp. 1610–02357, 2017.[54] L. Tychsen-Smith and L. Petersson, “Improving object localizationwith fitness nms and bounded iou loss,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pp. 6877–6885, 2018.[55] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. J. Cardoso,“Generalised dice overlap as a deep learning loss function for highlyunbalanced segmentations,” in
Deep learning in medical image analysisand multimodal learning for clinical decision support , pp. 240–248,Springer, 2017.[56] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“Deeplab: Semantic image segmentation with deep convolutional nets,atrous convolution, and fully connected crfs,”
IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 40, no. 4, pp. 834–848,2018. [57] T. Wang, L. Zhang, H. Lu, C. Sun, and J. Qi, “Kernelized subspaceranking for saliency detection,” in
European Conference on ComputerVision(ECCV) , pp. 450–466, Springer, 2016.[58] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum,“Learning to detect a salient object,”
IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 33, no. 2, pp. 353–367, 2011.[59] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan,“Learning to detect salient objects with image-level supervision,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition(CVPR) , pp. 136–145, 2017.[60] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliencydetection via graph-based manifold ranking,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) ,pp. 3166–3173, 2013.[61] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of humansegmented natural images and its application to evaluating segmentationalgorithms and measuring ecological statistics,” in
IEEE InternationalConference on Computer Vision(ICCV) , vol. 2, pp. 416–423, 2001.[62] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-measure:A new way to evaluate foreground maps,” in
Proceedings of the IEEEinternational conference on computer vision , pp. 4548–4557, 2017.[63] X. Li, F. Yang, H. Cheng, W. Liu, and D. Shen, “Contour knowledgetransfer for salient object detection,” in
Proceedings of the EuropeanConference on Computer Vision (ECCV) , pp. 355–370, 2018.[64] G. Li, Y. Xie, L. Lin, and Y. Yu, “Instance-level salient object segmen-tation,” in
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition(CVPR) , pp. 247–256, 2017.[65] X. Li, L. Zhao, L. Wei, M.-H. Yang, F. Wu, Y. Zhuang, H. Ling,and J. Wang, “Deepsaliency: Multi-task deep neural network modelfor salient object detection,”
IEEE Transactions on Image Processing ,vol. 25, no. 8, pp. 3919–3930, 2016.
Shiping Zhu (M05) received the B.Sc. and M.Sc.degrees in measuring and testing technologies andinstruments from Xian University of Technology,Xian, China, in 1991 and 1994, respectively, and thePh.D. degree in precision instrument and machineryfrom Harbin Institute of Technology, Harbin, China,in 1997.From 1997 to 1999, he was a Postdoctoral Fellowwith Beihang University, Beijing, China. From 2000to 2002, he was a Postdoctoral Fellow with theBrain and Cognition Research Center, Universit PaulSabatier, Toulouse, France. From 2002 to 2004, he was a Postdoctoral Fellowwith the Department of Computer Science and Department of Electrical andComputer Engineering, Universit de Sherbrooke, Sherbrooke, QC, Canada.Since 2005, he has been an Associate Professor with the Department ofMeasurement Control and Information Technology, School of InstrumentationScience and Optoelectronics Engineering, Beihang University. He has au-thored or coauthored more than 80 journal and conference papers. He receivedthe second prize of National Technological Invention Award in 2013. He isthe holder of 50 China invention patents. His current research interests includeimage processing and video coding, stereo matching, saliency detection andimage/video object segmentation.