Deep Texture-Aware Features for Camouflaged Object Detection
Jingjing Ren, Xiaowei Hu, Lei Zhu, Xuemiao Xu, Yangyang Xu, Weiming Wang, Zijun Deng, Pheng-Ann Heng
DDeep Texture-Aware Features for Camouflaged Object Detection
Jingjing Ren , ∗ , Xiaowei Hu , * , Lei Zhu , Xuemiao Xu , † , Yangyang Xu , ,Weiming Wang , Zijun Deng , , and Pheng-Ann Heng , South China University of Technology, The Chinese University of Hong Kong, The Open University of Hong Kong
Abstract
Camouflaged object detection is a challenging task thataims to identify objects having similar texture to the sur-roundings. This paper presents to amplify the subtle tex-ture difference between camouflaged objects and the back-ground for camouflaged object detection by formulatingmultiple texture-aware refinement modules to learn thetexture-aware features in a deep convolutional neural net-work. The texture-aware refinement module computes thecovariance matrices of feature responses to extract the tex-ture information, designs an affinity loss to learn a setof parameter maps that help to separate the texture be-tween camouflaged objects and the background, and adoptsa boundary-consistency loss to explore the object detailstructures. We evaluate our network on the benchmarkdataset for camouflaged object detection both qualitativelyand quantitatively. Experimental results show that our ap-proach outperforms various state-of-the-art methods by alarge margin.
1. Introduction
In nature, animals try to conceal themselves by adaptingthe texture of their bodies to the texture of the surround-ings, which helps them avoid being recognized by preda-tors. This strategy is easily used to deceive the visual per-ceptual system [34] and current vision algorithms may failto distinguish camouflaged objects from the background.Hence, camouflaged object detection [7] has been a greatchallenge and solving this problem could benefit a lot ofapplications in computer vision, such as polyp segmenta-tion [8], lung infection segmentation [9, 36], photo-realisticblending [11], and recreational art [4].To solve this problem, Fan et al . [7] collects the firstlarge-scale dataset for camouflaged object detection. Thedataset includes , images with over object cate- * Jingjing Ren and Xiaowei Hu are the joint first authors of this work. † Corresponding author ([email protected]) gories in various natural scenes, which make it possible toapply deep learning algorithms to learn to recognize cam-ouflaged objects from large data. Camouflaged objectsusually have similar texture to the surroundings, and deeplearning algorithms designed for generic object detection[28, 27, 12, 2, 40, 29, 3] and salient object detection [42, 41]generally do not perform well to detect camouflaged objectsin such difficult situations. Recently, algorithms for camou-flaged object detection based on deep neural networks havebeen proposed. SINet [7] first adopts a search module tofind the candidate regions of camouflaged objects and thenuses an identification module to precisely detect camou-flaged objects. ANet [19] first leverages a classification net-work to identify whether the image contains camouflagedobjects or not, and then adopts a fully convolutional net-work for camouflaged object segmentation. However, thesealgorithms may still misunderstand camouflaged objects asthe background due to their similar texture.Texture refers to a particular way how visual primitivesare organized spatially in natural images [ ? ]. Essentially,there are subtle differences in the texture between cam-ouflaged objects and the background. As shown in Fig-ure 1 (a), the texture of the fish involves a combination ofdense and small white particles and brown regions whilethe texture of background has the combination of white andbrown regions. Based on this observation, we present toamplify the texture difference between camouflaged objectsand the background by learning texture-aware features fromthe deep neural network, thus improving the performance ofcamouflaged object detection; see Figure 1 (c)-(e).To achieve this, we design the texture-aware refinementmodule (TARM) in a deep neural network, where we firstcompute the covariance matrices of feature responses to ex-tract the texture information from the convolutional fea-tures, and then learn a set of affinity functions to am-plify the texture difference between camouflaged objectsand the background. Moreover, we design a boundary-consistency loss in TARM to improve the segmentationquality by revisiting the image patches across boundaries onhigh-resolution feature maps. After that, we adopt multiple1 a r X i v : . [ c s . C V ] F e b a) input image (b) ground truth (c) convolutional feature (d) textrue-aware feature (e) our result Figure 1: Visualization results of the reconstructed images from the original convolutional features and texture-aware featuresproduced from our TARM.TARMs in a deep network (TANet) to learn deep texture-aware features in different layers and predict a detectionmap per layer for camouflaged object detection. Finally,we qualitatively and quantitatively compare our approachwith state-of-the-art methods designed for camouflagedobject detection, salient object detection, and semantic seg-mentation on the benchmark dataset, showing the superior-ity of our network.We summarize the contributions of this work as follow:• First, we design a novel texture-aware refinement mod-ule (TARM) to amplify the texture difference betweencamouflaged objects and the background, which sig-nificantly enhances camouflaged object recognition.• Second, we design a boundary-consistency loss to en-hance detail information across boundaries without ex-tra computation overhead in testing.• Third, we evaluate our network on the benchmarkdataset and compare it with state-of-the-art meth-ods on camouflaged object detection, salient object de-tection, and semantic segmentation. Qualitative andquantitative results show that our approach outper-forms previous methods by a large margin.
2. Related Work
Camouflaged object detection.
Early work for camou-flaged object detection (COD) adopted various hand-craftedfeatures, e.g ., color [15, 31], convex intensity [35], edge[31], and texture [1, 17]. Recently, deep convolution neu-ral network achieves great success with the help of large-scale camouflaged object datasets [32, 19, 7]. SINet [7]found the candidate regions of camouflaged objects by asearch module and precisely detected camouflaged objectsthrough an identification module. ANet [19] first identifiedthe image that contains camouflaged objects through a clas-sification network and then adopted a fully convolutionalnetwork for camouflaged object segmentation. These meth-ods, however, do not consider the subtle texture difference between camouflaged objects and the background, and mayfail to detect camouflaged objects in complex situations.
Salient object detection and semantic segmentation.
Salient object detection (SOD) predicts a binary mask toindicate the saliency regions while semantic segmentation(SS) aims to generate masks with category labels to identifythe image regions with different classes. The deep-learning-based methods designed for SOD [21, 26, 37, 41, 42] andSS [12, 43, 40, 2, 14] can be employed for camouflagedobject detection by retraining them on camouflaged objectdatasets, and we compare our network with these methodsfor camouflaged object detection on the benchmark dataset;see Section 4.Figure 2: The schematic illustration of the overall networkarchitecture (TANet) to learn the deep texture-aware fea-tures for camouflaged object detection. Given the input im-age, we use a feature extractor to produce the feature mapswith multiple resolutions. At each layer, the feature mapis first refined by the residual refine block (RRB) and thenfurther enhanced by our texture-aware refinement module(TARM) in a texture-aware manner. We adopt the predictedmask with highest resolution as the final output of our net-work.2igure 3: The schematic illustration of the texture-aware refinement module (TARM). We compute the deep texture-awarefeatures by formulating covariance matrices to extract the texture features in different aspects and learning the parametermaps through affinity and boundary-consistency losses to amplify the texture difference between the camouflaged objectsand their surroundings.
3. Methodology
Figure 2 shows the overall network architecture (TANet)with texture-aware refinement modules (TARM) for camou-flaged object detection. Given the input image, we adopt thefeature extractor to extract the feature maps with multipleresolutions, and then use the residual refine blocks (RRB)to refine the feature maps at different layers to enhance thefine details and remove the background noises.We ignore to refine the feature maps at the first layer dueto the large memory footprint. Next, we present the texture-aware refinement module (TARM) to learn the texture-aware features, which help to improve the visibility of cam-ouflaged objects. Lastly, we predict the binary masks to in-dicate the camouflaged objects at each layer by adding thesupervision signals at multiple layers.In the following subsections, we will elaborate on thetexture-aware refinement module (Section 3.1) in detail,present the training and testing strategies in Section 3.2, andvisualize the learned texture-aware features in Section 3.3.
As shown in Figure 1, the camouflaged objects have sim-ilar texture with the surroundings. However, there still ex-ists subtle texture difference between the camouflaged ob-jects and the background. Hence, we present a texture-aware refinement module to extract texture information andamplify the texture difference of camouflaged objects andbackground, thus improving the performance for camou-flaged object detection.
Figure 3 shows the architecture of the proposed texture-aware refinement module. First, it takes a feature map withthe resolution C × H × W as the input, and then use multi-ple convolution operations with × kernel to obtain mul-tiple kinds of feature maps [33] and each with the size of C × H × W . Note that C is smaller than C for thecomputational efficiency and these feature maps are usedto learn multiple aspects of textures in the following oper-ations. Next, we compute the co-variance matrix amongthe feature channels at each position to capture the correla-tions between different responses on the convolutional fea-tures. The co-variance matrix among features measures theco-occurrence of features, describes the combination of fea-tures, and is used to represent texture information [16, 10].As shown in Figure 3(b), for each pixel f km on the featuremap f k ∈ C × H × W , we compute its covariance ma-trix as the the inner product between f km and f kmT . Sincethe covariance matrix ( C × C ) has the property of di-agonal symmetry, we just adopt the upper triangle of thismatrix to represent the texture feature, and reshape the re-sult into a feature vector. We perform the same operationsfor each pixel on the feature map and concatenate the re-sults to obtain g k ∈ C × ( C +1)2 × H × W , which containthe texture information. Then, we fuse all the covariancematrices computed from different feature maps by a × convolution. After that, we adopt two sets of × convolu-tion from the texture features to learn two parameter maps3igure 4: The schematic illustration of how boundary-consistency loss works. Affinity loss is computed withinsmall patches across boundary, which is used to enhancedetail information. γ ∈ C (cid:48) × H × W and β ∈ C (cid:48) × H × W ( C (cid:48) denotes thechannel number) which are used to amplify the texture dif-ference of camouflaged objects and their surroundings byadjusting the texture of input features f in [13, 24]. Finally,we obtain the output feature f out by: f out = conv ( γ f (cid:48) in − µ ( f (cid:48) in ) σ ( f (cid:48) in ) + β ) + f in , (1)where f (cid:48) in is obtained by applying a × convolution on f in . µ ( f (cid:48) in ) and σ ( f (cid:48) in ) are the mean and variance of f (cid:48) in ,which are used to normalize the feature map. Finally, weadd the original feature map with the feature maps refinedby the above operations as the output of our texture-awarerefinement module. To make the parameter maps γ and β cap-ture the texture difference between the camouflaged objectsand the background, we adopt the affinity loss [39] on γ and β to explicitly amplify the difference between their texturefeatures. As shown in Figure 3 (c), we first use the pool-ing operations to downsample the parameter map and thencompute the affinity matrix A hm,n at the position m, n : A hm,n = h mT h n (cid:107) h m (cid:107) (cid:107) h n (cid:107) , (2)where h m and h n are the parameter vectors of the down-sampled map. A h is the result matrix that captures the pair-wise texture similarity. Next, we calculate the ground truthaffinity matrix A gt by: A gtm,n = 2 × { C m = C n } − , (3)where is a indicator, which is equal to one when the labels( C m and C n ) of positions m and n are the same, otherwiseit is equal to zero. i npu t g r ound t r u t h w / o e dg e w it h e dg e Figure 5: Visual comparison results of our method with andwithout boundary-consistency loss.In natural images, camouflaged objects usually occupysmall regions, and we formulate the affinity loss as followto solve the class imbalance problem: L aff = (cid:88) m (cid:88) n w m w n d ( A hm,n , A gtm,n ) , (4)where w m = 1 − N Cm H (cid:48) W (cid:48) and w n = 1 − N Cn H (cid:48) W (cid:48) ; N C m and N C n are the numbers of pixels that have the same class label withthe pixel m and n ; H (cid:48) and W (cid:48) are the height and width ofthe parameter map. From this loss function, we can seethat the parameter map will learn to maximize the texturedifference, Boundary-Consistency Loss.
The convolutional featurescontain highly semantic features but tend to produce blurryboundaries between the camouflaged objects and the back-ground due to the small resolutions of the parameter maps.To solve this issue, we present a boundary-consistency loss L edge to improve the boundary quality by revisiting the pre-diction results across boundary regions: L edge = (cid:88) ∀ i,b i ∈B (cid:88) m,n ∈ b i d ( A hm,n , A gtm,n ) , (5)4 a) input images (b) ground truths (c) convolutional feature (d) texture-aware feature (e) our results Figure 6: Visualization results of the reconstructed images from the original convolutional features and texture-aware featuresproduced from our TARM.where b i is the i -th image patch in B , which contains allthe image patches that across boundary, as shown in Fig-ure 4. These image patches are selected when they containthe pixels that belong to different categories. Note that un-like affinity loss, the L edge is performed on parameter mapswithout downsampling operations and the parameter mapswith higher resolutions help to provide more detailed in-formation for boundary prediction. The additional memoryconsumption during training is reasonable since we onlyconsider affinity relationship within small patches and noextra computational time is introduced in testing process.The comparison results in Figure 5 show that our methodwith the boundary-consistency loss better preserves the de-tailed structures of camouflaged objects. The overall loss function was defined as L = λ L seg + λ L aff + λ L edge , (6)where L seg was the binary cross entropy loss used forcamouflaged segmentation. We empirically set λ = λ = 1 and λ = 10 to balance the numerical magni-tude. We implemented our network in Pytorch and usedResNeXt50 [38] as the backbone network, which was pre-trained on ImageNet [30]. We used stochastic gradient de- scent (SGD) with the poly learning strategy [22] to optimizethe network by setting the initial learning rate as . anddecay power as . . The training process was stopped after epochs. It took about two hours to train the network ona 1080Ti GPU with the batch size of . The input imageswere resized as × . In testing, we adopted the predic-tion mask with the highest resolution as the final result. Weused 0.03s to process an image with the size of × . To visualize the learned texture-aware features, we adopta decoder [18] to reconstruct the images using the featuremaps before and after refined by texture-aware refinementmodule (TARM); see Section 3.1. Note that during the vi-sualization process, the weights in our TANet are fixed andwe only trained the newly added decoder to reconstruct theoriginal images. Figures 1&6 show the visualization results,where after adopting our TARM module to refine features,the texture differences between camouflaged objects andbackground are clearly amplified, which helps to improvethe overall performance of camouflaged object detection.5able 1: Comparison with state-of-the-art methods for camouflaged object detection on three benchmark datasets.
Method Year CHAMELEON CAMO-Test COD10K-Test S α ↑ E φ ↑ F wβ ↑ M ↓ S α ↑ E φ ↑ F wβ ↑ M ↓ S α ↑ E φ ↑ F wβ ↑ M ↓ FPN [20] 2017 0.794 0.783 0.590 0.075 0.684 0.677 0.483 0.131 0.697 0.691 0.411 0.075MaskRCNN [12] 2017 0.643 0.778 0.518 0.099 0.574 0.715 0.430 0.151 0.613 0.748 0.402 0.080PSPNet [40] 2017 0.773 0.758 0.555 0.085 0.663 0.659 0.455 0.139 0.678 0.680 0.377 0.080UNet++ [43] 2018 0.695 0.762 0.501 0.094 0.599 0.653 0.392 0.149 0.623 0.672 0.350 0.086PiCANet [21] 2018 0.769 0.749 0.536 0.085 0.609 0.584 0.356 0.156 0.649 0.643 0.322 0.090MSRCNN [14] 2019 0.637 0.686 0.443 0.091 0.617 0.669 0.454 0.133 0.641 0.706 0.419 0.073BASNet [26] 2019 0.687 0.721 0.474 0.118 0.618 0.661 0.413 0.159 0.634 0.678 0.365 0.105PFANet [42] 2019 0.679 0.648 0.378 0.144 0.659 0.622 0.391 0.172 0.636 0.618 0.286 0.128CPD [37] 2019 0.853 0.866 0.706 0.052 0.726 0.729 0.550 0.115 0.747 0.770 0.508 0.059HTC [2] 2019 0.517 0.489 0.204 0.129 0.476 0.442 0.174 0.172 0.548 0.520 0.221 0.088EGNet [41] 2019 0.848 0.870 0.702 0.050 0.732 0.768 0.583 0.104 0.737 0.779 0.509 0.056ANet-SRM [19] 2019 ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡
SINet [7] 2020 0.869 0.891 0.740 0.044 0.751 0.771 0.606 0.100 0.771 0.806 0.551 0.051TANet v1 - 0.881 0.907 0.773 0.039 0.778 0.813 0.659 0.089 0.794 0.838 0.613 0.043TANet (ours) -
Table 2: Ablation study results. Here, we compare the quantitative results of our full pipeline and baseline networks on threebenchmark datasets.
Method CHAMELEON CAMO-Test COD10K-Test S α ↑ E φ ↑ F wβ ↑ M ↓ S α ↑ E φ ↑ F wβ ↑ M ↓ S α ↑ E φ ↑ F wβ ↑ M ↓ M basic [20] 0.856 0.866 0.710 0.050 0.763 0.783 0.621 0.097 0.772 0.797 0.557 0.049 M basic+RRB 0.862 0.885 0.733 0.046 0.774 0.807 0.649 0.091 0.782 0.813 0.583 0.046 M basic+RRB+TARM w/o BCL 0.879 0.909 0.767 0.039 0.782 0.820 0.675 0.087 0.796 0.840 0.618 0.042Ours TANet
4. Experimental Results
We employ three widely-used COD benchmark datasetsto test each COD method. COD10K [7] is the largest anno-tated dataset with , training images and , testingimages. CAMO [19] includes , training images and testing images while CHAMELEON [32] consists of images. The same training dataset used by the most re-cent COD method [7] is employed to train our network forfair comparisons. The training set consists of , trainingimages of COD10K (3,040 images), training images ofCPD1K, and , training images of CAMO. We test dif-ferent COD methods on the testing set (COD10K-Test) ofCOD10K, the testing set (CAMO-Test) of CAMO, and thewhole CHAMELEON for their results. We shall release ourcode, the trained network model, and the predicted CODmaps of our method upon the publication of this work.
To conduct quantitative comparisons, we employ fourcommon metrics to test different COD methods. They areStructure-measure [5] ( S α ), Enhanced-measure [6] ( E φ ),weight-F-measure [23] ( F wβ ), and MAE [25]; see SINet [7]for the definitions of the four evaluation metrics. Overall, a better COD method has larger S α , E φ , and E φ scores, but asmaller MAE score. We compare our method against 13 cutting-edge methods,including (1) FPN [20], (2) MaskRCNN [12], (3) PSP-Net [40], (4) UNet++ [43], (5) PiCANet [21], (6) MSR-CNN [14], (7) BASNet [26], (8) PFANet [42], (9) CPD [37],(10) HTC [2], (11) EGNet [41], (12)ANet-SRM [19], and(13) SINet [7]. Note that SINet has reported the quantita-tive results and released camouflaged object maps predictedby all compared COD methods. Hence, we use these pubicresults for conducting fair comparisons.
Quantitative comparisons.
Table 1 summarizes thequantitative results of different COD methods on threebenchmark datasets. Apparently, SINet, as a dedicatedCOD method, has a superior performance on four metricsover other semantic segmentation methods and saliency de-tectors on three COD benchmark datasets. Compared toSINet, our method has larger S α , E φ , and F wβ scores, but asmaller M score, which demonstrates that our method canmore accurately identify camouflaged objects. Specifically,our method has a 3.98% improvement on the average S α , a6 a) inputs (b) ground truth (c) ours (d) SINet (e) EGNet (f) HTC (g) CPD (h) PFANet (i) BASNet Figure 7: Visual comparison of camouflaged object detection maps produced by different methods. (a) input images; (b)ground truths; camouflaged object detection maps produced by (c) our method, (d) SINet [7], (e) EGNet [41], (f) HTC [2],(g) CPD [37], (h) PFANet [42], and (i) BASNet [26]. Apparently, our method can better identify camouflaged objects thanall the compared detectors, and our prediction results (c) are more consistent with the ground truth images.5.21% improvement on the average E φ , a 11.41% improve-ment on the average F wβ , and a 18.26% improvement on theaverage M on the three benchmark datasets. We also re-train our method by taking ResNet-50 as the backbone net-work, and report the results “TANet v1” in Table 1, whereour method still achieves the best performance. Visual comparisons.
Figure 7 visually compares theCOD maps produced by our network and compared meth-ods. Apparently, all compared methods tend to includemany non-camouflaged regions or neglect parts of camou-flaged objects in their COD maps. On the contrary, ourmethod can more accurately detect camouflaged objects,and our results (see Figure 7 (c)) are most consistent with the ground truths shown in Figure 7 (b).
More visual com-parisons can be found in the supplementary material.
We conduct ablation study experiments to verify the ef-fectiveness of TARMs, and the boundary-consistency loss;see Figure 3. The first baseline (“basic”) equals to removeall RRBs and all TARMs from our network. The secondbaseline (“basic+RRB”) is to add RRBs into “basic” tomerge features at the adjacent CNN layers. It equals toremove all TARMs from our network. The third baseline(“basic+RRB+TARM w/o BCL”) is to add TARMs withoutboundary-consistency loss into the second baseline. Com-pared to the four baselines, our method is equal to add7 a) inputs (b) ground truth (c) full pipeline (d) M (e) M (f) M (g) M Figure 8: Visual comparison of camouflaged object detection maps produced by our method and baseline networks. (a) inputimages; (b) ground truths; camouflaged object detection maps produced by (c) our method with a full pipeline, (d)-(g) M , M , M , and M (see Table 2 for M to M ). Apparently, our full pipeline can better identify camouflaged objects than allthe compared detectors.boundary-consistency loss of TARMs into the third base-line.Figure 8 shows the visual comparison results of camou-flaged object detection maps produced by our full pipelineand other baseline methods, demonstrating that our fullpipeline better identifies camouflaged objects than oth-ers. Table 2 compares the results of our network andfour baseline networks on three benchmark datasets, i.e .,CHAMELEON, CAMO-Test, and COD10K-Test. Effectiveness of RRBs.
From the results of Table 2, “ba-sic+RRB” has better metric results than “basic”, indicatingRRBs blocks help our network detect camouflaged objects.
Effectiveness of the affinity loss of TARMs w/o BCL.
As shown in Table 2, “basic+RRB+TARM w/o BCL” haslarger S α , E φ , F wβ scores and a smaller M score than “ba-sic+RRB” on three benchmark datasets, demonstrating thatit enables our network to better capture the texture differ-ence between camouflaged objects and the background andthus benefits camouflaged object detection. Effectiveness of the boundary-consistency loss ofTARMs.
The superior metric results of Our method over“basic+RRB+TARM w/o BCL” on the three benchmarkdatasets indicate that computing the boundary-consistencyloss in TARMs can further improve the camouflaged objectdetection accuracy of our network by enhancing the consis-tency across boundaries.
5. Conclusion
This paper designs a novel deep network architecturefor camouflaged object detection by learning deep texture-aware features. Our key idea is to amplify the texture dif-ference between camouflaged objects and their surround-ings, thus benefiting to identify camouflaged objects fromthe background. In our network, we design a texture-awarerefinement module (TARM) that computes the covariancematrices of feature responses among feature channels torepresent the texture structures and adopt the affinity lossto learn a set of parameter maps that perform a linear trans-formation of the convolutional features to separate the tex-ture between camouflaged objects and the background. A8oundary-consistency loss is further proposed to learn theobject details. In this way, we can obtain deep texture-awarefeatures and formulate the TANet by embedding multipleTARMs in a deep neural network for the task. In the end,we evaluate our method on the benchmark dataset, com-pare it with various state-of-the-art methods, and show thesuperiority of our method both qualitatively and quantita-tively. In the future, we will explore the potential of ournetwork for generic image segmentation in more complexenvironments, especially for objects having similar color tothe background.
References [1] Nagappa U. Bhajantri and P. Nagabhushan. Camouflage de-fect identification: a novel approach. In
International Con-ference on Information Technology , pages 145–148, 2006. 2[2] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox-iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi,Wanli Ouyang, et al. Hybrid task cascade for instance seg-mentation. In
IEEE Conference on Computer Vision and Pat-tern Recognition , pages 4974–4983, 2019. 1, 2, 6, 7[3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolu-tion, and fully connected crfs.
IEEE Transactions on PatternAnalysis and Machine Intelligence , pages 834–848, 2017. 1[4] Hung-Kuo Chu, Wei-Hsin Hsu, Niloy J. Mitra, DanielCohen-Or, Tien-Tsin Wong, and Tong-Yee Lee. Camouflageimages.
ACM Trans. Graph. , 29(4):51–1, 2010. 1[5] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and AliBorji. Structure-measure: A new way to evaluate foregroundmaps. In
IEEE International Conference on Computer Vi-sion , pages 4548–4557, 2017. 6[6] Deng-Ping Fan, Cheng Gong, Yang Cao, Bo Ren, Ming-Ming Cheng, and Ali Borji. Enhanced-alignment mea-sure for binary foreground map evaluation. arXiv preprintarXiv:1805.10421 , 2018. 6[7] Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng,Jianbing Shen, and Ling Shao. Camouflaged object detec-tion. In
IEEE Conference on Computer Vision and PatternRecognition , pages 2777–2787, 2020. 1, 2, 6, 7[8] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, HuazhuFu, Jianbing Shen, and Ling Shao. Pranet: Parallel reverseattention network for polyp segmentation. In
InternationalConference on Medical Image Computing and Computer As-sisted Intervention , pages 263–273. Springer, 2020. 1[9] Deng-Ping Fan, Tao Zhou, Ge-Peng Ji, Yi Zhou, Geng Chen,Huazhu Fu, Jianbing Shen, and Ling Shao. Inf-net: Auto-matic covid-19 lung infection segmentation from ct images.
IEEE Transactions on Medical Imaging , 2020. 1[10] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge.Image style transfer using convolutional neural networks. In
IEEE Conference on Computer Vision and Pattern Recogni-tion , pages 2414–2423, 2016. 3[11] Shiming Ge, Xin Jin, Qiting Ye, Zhao Luo, and Qiang Li.Image editing by object-aware optimal boundary searching and mixed-domain composition.
Computational Visual Me-dia , 4(1):71–82, 2018. 1[12] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In
IEEE International Conference onComputer Vision , pages 2961–2969, 2017. 1, 2, 6[13] Xun Huang and Serge Belongie. Arbitrary style transferin real-time with adaptive instance normalization. In
IEEEInternational Conference on Computer Vision , pages 1501–1510, 2017. 4[14] Zhaojin Huang, Lichao Huang, Yongchao Gong, ChangHuang, and Xinggang Wang. Mask scoring r-cnn. In
IEEEConference on Computer Vision and Pattern Recognition ,pages 6409–6418, 2019. 2, 6[15] Iv´an Huerta, Daniel Rowe, Mikhail Mozerov, and JordiGonz`alez. Improving background subtraction based on a ca-suistry of colour-motion segmentation problems. In
IberianConference on Pattern Recognition and Image Analysis ,pages 475–482, 2007. 2[16] Levent Karacan, Erkut Erdem, and Aykut Erdem. Structure-preserving image smoothing via region covariances.
ACMTransactions on Graphics , 32(6):1–11, 2013. 3[17] Ch Kavitha, B Prabhakara Rao, and A. Govardhan. An ef-ficient content based image retrieval using color and textureof image sub-blocks.
International Journal of EngineeringScience and Technology , 3(2):1060–1068, 2011. 2[18] Junho Kim, Minjae Kim, Hyeonwoo Kang, and KwangheeLee. U-gat-it: unsupervised generative attentional networkswith adaptive layer-instance normalization for image-to-image translation. arXiv preprint arXiv:1907.10830 , 2019.5[19] Trung-Nghia Le, Tam V Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro Sugimoto. Anabranch network forcamouflaged object segmentation.
Computer Vision and Im-age Understanding , pages 45–56, 2019. 1, 2, 6[20] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature pyramidnetworks for object detection. In
IEEE Conference on Com-puter Vision and Pattern Recognition , pages 2117–2125,2017. 6[21] Nian Liu, Junwei Han, and Ming-Hsuan Yang. Picanet:Learning pixel-wise contextual attention for saliency detec-tion. In
IEEE Conference on Computer Vision and PatternRecognition , pages 3089–3098, 2018. 2, 6[22] Wei Liu, Andrew Rabinovich, and Alexander C. Berg.Parsenet: Looking wider to see better. arXiv preprintarXiv:1506.04579 , 2015. 5[23] Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal. How toevaluate foreground maps? In
IEEE Conference on Com-puter Vision and Pattern Recognition , pages 248–255, 2014.6[24] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-YanZhu. Semantic image synthesis with spatially-adaptive nor-malization. In
IEEE Conference on Computer Vision andPattern Recognition , pages 2337–2346, 2019. 4[25] Federico Perazzi, Philipp Kr¨ahenb¨uhl, Yael Pritch, andAlexander Hornung. Saliency filters: Contrast based filtering or salient region detection. In IEEE Conference on Com-puter Vision and Pattern Recognition , pages 733–740, 2012.6[26] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao,Masood Dehghan, and Martin Jagersand. Basnet: Boundary-aware salient object detection. In
IEEE Conference on Com-puter Vision and Pattern Recognition , pages 7479–7489,2019. 2, 6, 7[27] Joseph Redmon, Santosh Divvala, Ross Girshick, and AliFarhadi. You only look once: Unified, real-time object de-tection. In
CVPR , pages 779–788, 2016. 1[28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In
Conference on Neural InformationProcessing Systems , pages 91–99, 2015. 1[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:Convolutional networks for biomedical image segmentation.In
International Conference on Medical Image Computingand Computer Assisted Intervention , pages 234–241, 2015.1[30] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, et al. Imagenet largescale visual recognition challenge.
International Journal ofComputer Vision , 115(3):211–252, 2015. 5[31] P Siricharoen, S Aramvith, TH Chalidabhongse, and S Sid-dhichai. Robust outdoor human segmentation based oncolor-based statistical approach and edge combination. In
International Conference on Green Circuits and Systems ,pages 463–468, 2010. 2[32] P. Skurowski, H. Abdulameer, J. Błaszczyk, T. Depta, A.Kornacki, and P. Kozieł. Animal camouflage analysis:Chameleon database.
Unpublished Manuscript , 2018. 2, 6[33] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan,Yewen Xu, Ming Zhang, and Jian Tang. Autoint: Auto-matic feature interaction learning via self-attentive neuralnetworks. In
ACM International Conference on Informationand Knowledge Management , pages 1161–1170, 2019. 3[34] Martin Stevens and Sami Merilaita. Animal camou-flage: current issues and new perspectives.
PhilosophicalTransactions of the Royal Society B: Biological Sciences ,364(1516):423–427, 2009. 1[35] Ariel Tankus and Yehezkel Yeshurun. Convexity-based vi-sual camouflage breaking.
Computer Vision and Image Un-derstanding , pages 208–237, 2001. 2[36] Yu-Huan Wu, Shang-Hua Gao, Jie Mei, Jun Xu, Deng-PingFan, Chao-Wei Zhao, and Ming-Ming Cheng. Jcs: An ex-plainable covid-19 diagnosis system by joint classificationand segmentation. arXiv preprint arXiv:2004.07054 , 2020.1[37] Zhe Wu, Li Su, and Qingming Huang. Cascaded partial de-coder for fast and accurate salient object detection. In
IEEEConference on Computer Vision and Pattern Recognition ,pages 3907–3916, 2019. 2, 6, 7[38] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, andKaiming He. Aggregated residual transformations for deepneural networks. In
IEEE Conference on Computer Visionand Pattern Recognition , pages 1492–1500, 2017. 5 [39] Changqian Yu, Jingbo Wang, Changxin Gao, Gang Yu,Chunhua Shen, and Nong Sang. Context prior for scenesegmentation. In
IEEE Conference on Computer Vision andPattern Recognition , pages 12416–12425, 2020. 4[40] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, XiaogangWang, and Jiaya Jia. Pyramid scene parsing network. In
IEEE Conference on Computer Vision and Pattern Recogni-tion , pages 2881–2890, 2017. 1, 2, 6[41] Jia-Xing Zhao, Jiang-Jiang Liu, Deng-Ping Fan, Yang Cao,Jufeng Yang, and Ming-Ming Cheng. Egnet: Edge guidancenetwork for salient object detection. In
IEEE InternationalConference on Computer Vision , pages 8779–8788, 2019. 1,2, 6, 7[42] Ting Zhao and Xiangqian Wu. Pyramid feature attention net-work for saliency detection. In
IEEE Conference on Com-puter Vision and Pattern Recognition , pages 3085–3094,2019. 1, 2, 6, 7[43] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, NimaTajbakhsh, and Jianming Liang. Unet++: A nested u-net ar-chitecture for medical image segmentation. In
Deep Learn-ing in Medical Image Analysis and Multimodal Learning forClinical Decision Support , pages 3–11. Springer, 2018. 2, 6, pages 3–11. Springer, 2018. 2, 6