[PDF] Deep Texture-Aware Features for Camouflaged Object Detection

Abstract

Camouflaged object detection is a challenging task that aims to identify objects having similar texture to the surroundings. This paper presents to amplify the subtle texture difference between camouflaged objects and the background for camouflaged object detection by formulating multiple texture-aware refinement modules to learn the texture-aware features in a deep convolutional neural network. The texture-aware refinement module computes the covariance matrices of feature responses to extract the texture information, designs an affinity loss to learn a set of parameter maps that help to separate the texture between camouflaged objects and the background, and adopts a boundary-consistency loss to explore the object detail structures.We evaluate our network on the benchmark dataset for camouflaged object detection both qualitatively and quantitatively. Experimental results show that our approach outperforms various state-of-the-art methods by a large margin.

Full PDF

DDeep Texture-Aware Features for Camouﬂaged Object Detection

Jingjing Ren , ∗ , Xiaowei Hu , * , Lei Zhu , Xuemiao Xu , † , Yangyang Xu , ,Weiming Wang , Zijun Deng , , and Pheng-Ann Heng , South China University of Technology, The Chinese University of Hong Kong, The Open University of Hong Kong

Abstract

Camouﬂaged object detection is a challenging task thataims to identify objects having similar texture to the sur-roundings. This paper presents to amplify the subtle tex-ture difference between camouﬂaged objects and the back-ground for camouﬂaged object detection by formulatingmultiple texture-aware reﬁnement modules to learn thetexture-aware features in a deep convolutional neural net-work. The texture-aware reﬁnement module computes thecovariance matrices of feature responses to extract the tex-ture information, designs an afﬁnity loss to learn a setof parameter maps that help to separate the texture be-tween camouﬂaged objects and the background, and adoptsa boundary-consistency loss to explore the object detailstructures. We evaluate our network on the benchmarkdataset for camouﬂaged object detection both qualitativelyand quantitatively. Experimental results show that our ap-proach outperforms various state-of-the-art methods by alarge margin.

1. Introduction

In nature, animals try to conceal themselves by adaptingthe texture of their bodies to the texture of the surround-ings, which helps them avoid being recognized by preda-tors. This strategy is easily used to deceive the visual per-ceptual system [34] and current vision algorithms may failto distinguish camouﬂaged objects from the background.Hence, camouﬂaged object detection [7] has been a greatchallenge and solving this problem could beneﬁt a lot ofapplications in computer vision, such as polyp segmenta-tion [8], lung infection segmentation [9, 36], photo-realisticblending [11], and recreational art [4].To solve this problem, Fan et al . [7] collects the ﬁrstlarge-scale dataset for camouﬂaged object detection. Thedataset includes , images with over object cate- * Jingjing Ren and Xiaowei Hu are the joint ﬁrst authors of this work. † Corresponding author ([email protected]) gories in various natural scenes, which make it possible toapply deep learning algorithms to learn to recognize cam-ouﬂaged objects from large data. Camouﬂaged objectsusually have similar texture to the surroundings, and deeplearning algorithms designed for generic object detection[28, 27, 12, 2, 40, 29, 3] and salient object detection [42, 41]generally do not perform well to detect camouﬂaged objectsin such difﬁcult situations. Recently, algorithms for camou-ﬂaged object detection based on deep neural networks havebeen proposed. SINet [7] ﬁrst adopts a search module toﬁnd the candidate regions of camouﬂaged objects and thenuses an identiﬁcation module to precisely detect camou-ﬂaged objects. ANet [19] ﬁrst leverages a classiﬁcation net-work to identify whether the image contains camouﬂagedobjects or not, and then adopts a fully convolutional net-work for camouﬂaged object segmentation. However, thesealgorithms may still misunderstand camouﬂaged objects asthe background due to their similar texture.Texture refers to a particular way how visual primitivesare organized spatially in natural images [ ? ]. Essentially,there are subtle differences in the texture between cam-ouﬂaged objects and the background. As shown in Fig-ure 1 (a), the texture of the ﬁsh involves a combination ofdense and small white particles and brown regions whilethe texture of background has the combination of white andbrown regions. Based on this observation, we present toamplify the texture difference between camouﬂaged objectsand the background by learning texture-aware features fromthe deep neural network, thus improving the performance ofcamouﬂaged object detection; see Figure 1 (c)-(e).To achieve this, we design the texture-aware reﬁnementmodule (TARM) in a deep neural network, where we ﬁrstcompute the covariance matrices of feature responses to ex-tract the texture information from the convolutional fea-tures, and then learn a set of afﬁnity functions to am-plify the texture difference between camouﬂaged objectsand the background. Moreover, we design a boundary-consistency loss in TARM to improve the segmentationquality by revisiting the image patches across boundaries onhigh-resolution feature maps. After that, we adopt multiple1 a r X i v : . [ c s . C V ] F e b a) input image (b) ground truth (c) convolutional feature (d) textrue-aware feature (e) our result Figure 1: Visualization results of the reconstructed images from the original convolutional features and texture-aware featuresproduced from our TARM.TARMs in a deep network (TANet) to learn deep texture-aware features in different layers and predict a detectionmap per layer for camouﬂaged object detection. Finally,we qualitatively and quantitatively compare our approachwith state-of-the-art methods designed for camouﬂagedobject detection, salient object detection, and semantic seg-mentation on the benchmark dataset, showing the superior-ity of our network.We summarize the contributions of this work as follow:• First, we design a novel texture-aware reﬁnement mod-ule (TARM) to amplify the texture difference betweencamouﬂaged objects and the background, which sig-niﬁcantly enhances camouﬂaged object recognition.• Second, we design a boundary-consistency loss to en-hance detail information across boundaries without ex-tra computation overhead in testing.• Third, we evaluate our network on the benchmarkdataset and compare it with state-of-the-art meth-ods on camouﬂaged object detection, salient object de-tection, and semantic segmentation. Qualitative andquantitative results show that our approach outper-forms previous methods by a large margin.

2. Related Work

Camouﬂaged object detection.

Early work for camou-ﬂaged object detection (COD) adopted various hand-craftedfeatures, e.g ., color [15, 31], convex intensity [35], edge[31], and texture [1, 17]. Recently, deep convolution neu-ral network achieves great success with the help of large-scale camouﬂaged object datasets [32, 19, 7]. SINet [7]found the candidate regions of camouﬂaged objects by asearch module and precisely detected camouﬂaged objectsthrough an identiﬁcation module. ANet [19] ﬁrst identiﬁedthe image that contains camouﬂaged objects through a clas-siﬁcation network and then adopted a fully convolutionalnetwork for camouﬂaged object segmentation. These meth-ods, however, do not consider the subtle texture difference between camouﬂaged objects and the background, and mayfail to detect camouﬂaged objects in complex situations.

Salient object detection and semantic segmentation.

Salient object detection (SOD) predicts a binary mask toindicate the saliency regions while semantic segmentation(SS) aims to generate masks with category labels to identifythe image regions with different classes. The deep-learning-based methods designed for SOD [21, 26, 37, 41, 42] andSS [12, 43, 40, 2, 14] can be employed for camouﬂagedobject detection by retraining them on camouﬂaged objectdatasets, and we compare our network with these methodsfor camouﬂaged object detection on the benchmark dataset;see Section 4.Figure 2: The schematic illustration of the overall networkarchitecture (TANet) to learn the deep texture-aware fea-tures for camouﬂaged object detection. Given the input im-age, we use a feature extractor to produce the feature mapswith multiple resolutions. At each layer, the feature mapis ﬁrst reﬁned by the residual reﬁne block (RRB) and thenfurther enhanced by our texture-aware reﬁnement module(TARM) in a texture-aware manner. We adopt the predictedmask with highest resolution as the ﬁnal output of our net-work.2igure 3: The schematic illustration of the texture-aware reﬁnement module (TARM). We compute the deep texture-awarefeatures by formulating covariance matrices to extract the texture features in different aspects and learning the parametermaps through afﬁnity and boundary-consistency losses to amplify the texture difference between the camouﬂaged objectsand their surroundings.

3. Methodology

Figure 2 shows the overall network architecture (TANet)with texture-aware reﬁnement modules (TARM) for camou-ﬂaged object detection. Given the input image, we adopt thefeature extractor to extract the feature maps with multipleresolutions, and then use the residual reﬁne blocks (RRB)to reﬁne the feature maps at different layers to enhance theﬁne details and remove the background noises.We ignore to reﬁne the feature maps at the ﬁrst layer dueto the large memory footprint. Next, we present the texture-aware reﬁnement module (TARM) to learn the texture-aware features, which help to improve the visibility of cam-ouﬂaged objects. Lastly, we predict the binary masks to in-dicate the camouﬂaged objects at each layer by adding thesupervision signals at multiple layers.In the following subsections, we will elaborate on thetexture-aware reﬁnement module (Section 3.1) in detail,present the training and testing strategies in Section 3.2, andvisualize the learned texture-aware features in Section 3.3.

As shown in Figure 1, the camouﬂaged objects have sim-ilar texture with the surroundings. However, there still ex-ists subtle texture difference between the camouﬂaged ob-jects and the background. Hence, we present a texture-aware reﬁnement module to extract texture information andamplify the texture difference of camouﬂaged objects andbackground, thus improving the performance for camou-ﬂaged object detection.

Figure 3 shows the architecture of the proposed texture-aware reﬁnement module. First, it takes a feature map withthe resolution C × H × W as the input, and then use multi-ple convolution operations with × kernel to obtain mul-tiple kinds of feature maps [33] and each with the size of C × H × W . Note that C is smaller than C for thecomputational efﬁciency and these feature maps are usedto learn multiple aspects of textures in the following oper-ations. Next, we compute the co-variance matrix amongthe feature channels at each position to capture the correla-tions between different responses on the convolutional fea-tures. The co-variance matrix among features measures theco-occurrence of features, describes the combination of fea-tures, and is used to represent texture information [16, 10].As shown in Figure 3(b), for each pixel f km on the featuremap f k ∈ C × H × W , we compute its covariance ma-trix as the the inner product between f km and f kmT . Sincethe covariance matrix ( C × C ) has the property of di-agonal symmetry, we just adopt the upper triangle of thismatrix to represent the texture feature, and reshape the re-sult into a feature vector. We perform the same operationsfor each pixel on the feature map and concatenate the re-sults to obtain g k ∈ C × ( C +1)2 × H × W , which containthe texture information. Then, we fuse all the covariancematrices computed from different feature maps by a × convolution. After that, we adopt two sets of × convolu-tion from the texture features to learn two parameter maps3igure 4: The schematic illustration of how boundary-consistency loss works. Afﬁnity loss is computed withinsmall patches across boundary, which is used to enhancedetail information. γ ∈ C (cid:48) × H × W and β ∈ C (cid:48) × H × W ( C (cid:48) denotes thechannel number) which are used to amplify the texture dif-ference of camouﬂaged objects and their surroundings byadjusting the texture of input features f in [13, 24]. Finally,we obtain the output feature f out by: f out = conv ( γ f (cid:48) in − µ ( f (cid:48) in ) σ ( f (cid:48) in ) + β ) + f in , (1)where f (cid:48) in is obtained by applying a × convolution on f in . µ ( f (cid:48) in ) and σ ( f (cid:48) in ) are the mean and variance of f (cid:48) in ,which are used to normalize the feature map. Finally, weadd the original feature map with the feature maps reﬁnedby the above operations as the output of our texture-awarereﬁnement module. To make the parameter maps γ and β cap-ture the texture difference between the camouﬂaged objectsand the background, we adopt the afﬁnity loss [39] on γ and β to explicitly amplify the difference between their texturefeatures. As shown in Figure 3 (c), we ﬁrst use the pool-ing operations to downsample the parameter map and thencompute the afﬁnity matrix A hm,n at the position m, n : A hm,n = h mT h n (cid:107) h m (cid:107) (cid:107) h n (cid:107) , (2)where h m and h n are the parameter vectors of the down-sampled map. A h is the result matrix that captures the pair-wise texture similarity. Next, we calculate the ground truthafﬁnity matrix A gt by: A gtm,n = 2 × { C m = C n } − , (3)where is a indicator, which is equal to one when the labels( C m and C n ) of positions m and n are the same, otherwiseit is equal to zero. i npu t g r ound t r u t h w / o e dg e w it h e dg e Figure 5: Visual comparison results of our method with andwithout boundary-consistency loss.In natural images, camouﬂaged objects usually occupysmall regions, and we formulate the afﬁnity loss as followto solve the class imbalance problem: L aff = (cid:88) m (cid:88) n w m w n d ( A hm,n , A gtm,n ) , (4)where w m = 1 − N Cm H (cid:48) W (cid:48) and w n = 1 − N Cn H (cid:48) W (cid:48) ; N C m and N C n are the numbers of pixels that have the same class label withthe pixel m and n ; H (cid:48) and W (cid:48) are the height and width ofthe parameter map. From this loss function, we can seethat the parameter map will learn to maximize the texturedifference, Boundary-Consistency Loss.

The convolutional featurescontain highly semantic features but tend to produce blurryboundaries between the camouﬂaged objects and the back-ground due to the small resolutions of the parameter maps.To solve this issue, we present a boundary-consistency loss L edge to improve the boundary quality by revisiting the pre-diction results across boundary regions: L edge = (cid:88) ∀ i,b i ∈B (cid:88) m,n ∈ b i d ( A hm,n , A gtm,n ) , (5)4 a) input images (b) ground truths (c) convolutional feature (d) texture-aware feature (e) our results Figure 6: Visualization results of the reconstructed images from the original convolutional features and texture-aware featuresproduced from our TARM.where b i is the i -th image patch in B , which contains allthe image patches that across boundary, as shown in Fig-ure 4. These image patches are selected when they containthe pixels that belong to different categories. Note that un-like afﬁnity loss, the L edge is performed on parameter mapswithout downsampling operations and the parameter mapswith higher resolutions help to provide more detailed in-formation for boundary prediction. The additional memoryconsumption during training is reasonable since we onlyconsider afﬁnity relationship within small patches and noextra computational time is introduced in testing process.The comparison results in Figure 5 show that our methodwith the boundary-consistency loss better preserves the de-tailed structures of camouﬂaged objects. The overall loss function was deﬁned as L = λ L seg + λ L aff + λ L edge , (6)where L seg was the binary cross entropy loss used forcamouﬂaged segmentation. We empirically set λ = λ = 1 and λ = 10 to balance the numerical magni-tude. We implemented our network in Pytorch and usedResNeXt50 [38] as the backbone network, which was pre-trained on ImageNet [30]. We used stochastic gradient de- scent (SGD) with the poly learning strategy [22] to optimizethe network by setting the initial learning rate as . anddecay power as . . The training process was stopped after epochs. It took about two hours to train the network ona 1080Ti GPU with the batch size of . The input imageswere resized as × . In testing, we adopted the predic-tion mask with the highest resolution as the ﬁnal result. Weused 0.03s to process an image with the size of × . To visualize the learned texture-aware features, we adopta decoder [18] to reconstruct the images using the featuremaps before and after reﬁned by texture-aware reﬁnementmodule (TARM); see Section 3.1. Note that during the vi-sualization process, the weights in our TANet are ﬁxed andwe only trained the newly added decoder to reconstruct theoriginal images. Figures 1&6 show the visualization results,where after adopting our TARM module to reﬁne features,the texture differences between camouﬂaged objects andbackground are clearly ampliﬁed, which helps to improvethe overall performance of camouﬂaged object detection.5able 1: Comparison with state-of-the-art methods for camouﬂaged object detection on three benchmark datasets.

Method Year CHAMELEON CAMO-Test COD10K-Test S α ↑ E φ ↑ F wβ ↑ M ↓ S α ↑ E φ ↑ F wβ ↑ M ↓ S α ↑ E φ ↑ F wβ ↑ M ↓ FPN [20] 2017 0.794 0.783 0.590 0.075 0.684 0.677 0.483 0.131 0.697 0.691 0.411 0.075MaskRCNN [12] 2017 0.643 0.778 0.518 0.099 0.574 0.715 0.430 0.151 0.613 0.748 0.402 0.080PSPNet [40] 2017 0.773 0.758 0.555 0.085 0.663 0.659 0.455 0.139 0.678 0.680 0.377 0.080UNet++ [43] 2018 0.695 0.762 0.501 0.094 0.599 0.653 0.392 0.149 0.623 0.672 0.350 0.086PiCANet [21] 2018 0.769 0.749 0.536 0.085 0.609 0.584 0.356 0.156 0.649 0.643 0.322 0.090MSRCNN [14] 2019 0.637 0.686 0.443 0.091 0.617 0.669 0.454 0.133 0.641 0.706 0.419 0.073BASNet [26] 2019 0.687 0.721 0.474 0.118 0.618 0.661 0.413 0.159 0.634 0.678 0.365 0.105PFANet [42] 2019 0.679 0.648 0.378 0.144 0.659 0.622 0.391 0.172 0.636 0.618 0.286 0.128CPD [37] 2019 0.853 0.866 0.706 0.052 0.726 0.729 0.550 0.115 0.747 0.770 0.508 0.059HTC [2] 2019 0.517 0.489 0.204 0.129 0.476 0.442 0.174 0.172 0.548 0.520 0.221 0.088EGNet [41] 2019 0.848 0.870 0.702 0.050 0.732 0.768 0.583 0.104 0.737 0.779 0.509 0.056ANet-SRM [19] 2019 ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡

SINet [7] 2020 0.869 0.891 0.740 0.044 0.751 0.771 0.606 0.100 0.771 0.806 0.551 0.051TANet v1 - 0.881 0.907 0.773 0.039 0.778 0.813 0.659 0.089 0.794 0.838 0.613 0.043TANet (ours) -

Table 2: Ablation study results. Here, we compare the quantitative results of our full pipeline and baseline networks on threebenchmark datasets.

Method CHAMELEON CAMO-Test COD10K-Test S α ↑ E φ ↑ F wβ ↑ M ↓ S α ↑ E φ ↑ F wβ ↑ M ↓ S α ↑ E φ ↑ F wβ ↑ M ↓ M basic [20] 0.856 0.866 0.710 0.050 0.763 0.783 0.621 0.097 0.772 0.797 0.557 0.049 M basic+RRB 0.862 0.885 0.733 0.046 0.774 0.807 0.649 0.091 0.782 0.813 0.583 0.046 M basic+RRB+TARM w/o BCL 0.879 0.909 0.767 0.039 0.782 0.820 0.675 0.087 0.796 0.840 0.618 0.042Ours TANet

4. Experimental Results

We employ three widely-used COD benchmark datasetsto test each COD method. COD10K [7] is the largest anno-tated dataset with , training images and , testingimages. CAMO [19] includes , training images and testing images while CHAMELEON [32] consists of images. The same training dataset used by the most re-cent COD method [7] is employed to train our network forfair comparisons. The training set consists of , trainingimages of COD10K (3,040 images), training images ofCPD1K, and , training images of CAMO. We test dif-ferent COD methods on the testing set (COD10K-Test) ofCOD10K, the testing set (CAMO-Test) of CAMO, and thewhole CHAMELEON for their results. We shall release ourcode, the trained network model, and the predicted CODmaps of our method upon the publication of this work.

To conduct quantitative comparisons, we employ fourcommon metrics to test different COD methods. They areStructure-measure [5] ( S α ), Enhanced-measure [6] ( E φ ),weight-F-measure [23] ( F wβ ), and MAE [25]; see SINet [7]for the deﬁnitions of the four evaluation metrics. Overall, a better COD method has larger S α , E φ , and E φ scores, but asmaller MAE score. We compare our method against 13 cutting-edge methods,including (1) FPN [20], (2) MaskRCNN [12], (3) PSP-Net [40], (4) UNet++ [43], (5) PiCANet [21], (6) MSR-CNN [14], (7) BASNet [26], (8) PFANet [42], (9) CPD [37],(10) HTC [2], (11) EGNet [41], (12)ANet-SRM [19], and(13) SINet [7]. Note that SINet has reported the quantita-tive results and released camouﬂaged object maps predictedby all compared COD methods. Hence, we use these pubicresults for conducting fair comparisons.

Quantitative comparisons.

Table 1 summarizes thequantitative results of different COD methods on threebenchmark datasets. Apparently, SINet, as a dedicatedCOD method, has a superior performance on four metricsover other semantic segmentation methods and saliency de-tectors on three COD benchmark datasets. Compared toSINet, our method has larger S α , E φ , and F wβ scores, but asmaller M score, which demonstrates that our method canmore accurately identify camouﬂaged objects. Speciﬁcally,our method has a 3.98% improvement on the average S α , a6 a) inputs (b) ground truth (c) ours (d) SINet (e) EGNet (f) HTC (g) CPD (h) PFANet (i) BASNet Figure 7: Visual comparison of camouﬂaged object detection maps produced by different methods. (a) input images; (b)ground truths; camouﬂaged object detection maps produced by (c) our method, (d) SINet [7], (e) EGNet [41], (f) HTC [2],(g) CPD [37], (h) PFANet [42], and (i) BASNet [26]. Apparently, our method can better identify camouﬂaged objects thanall the compared detectors, and our prediction results (c) are more consistent with the ground truth images.5.21% improvement on the average E φ , a 11.41% improve-ment on the average F wβ , and a 18.26% improvement on theaverage M on the three benchmark datasets. We also re-train our method by taking ResNet-50 as the backbone net-work, and report the results “TANet v1” in Table 1, whereour method still achieves the best performance. Visual comparisons.

Figure 7 visually compares theCOD maps produced by our network and compared meth-ods. Apparently, all compared methods tend to includemany non-camouﬂaged regions or neglect parts of camou-ﬂaged objects in their COD maps. On the contrary, ourmethod can more accurately detect camouﬂaged objects,and our results (see Figure 7 (c)) are most consistent with the ground truths shown in Figure 7 (b).

More visual com-parisons can be found in the supplementary material.

We conduct ablation study experiments to verify the ef-fectiveness of TARMs, and the boundary-consistency loss;see Figure 3. The ﬁrst baseline (“basic”) equals to removeall RRBs and all TARMs from our network. The secondbaseline (“basic+RRB”) is to add RRBs into “basic” tomerge features at the adjacent CNN layers. It equals toremove all TARMs from our network. The third baseline(“basic+RRB+TARM w/o BCL”) is to add TARMs withoutboundary-consistency loss into the second baseline. Com-pared to the four baselines, our method is equal to add7 a) inputs (b) ground truth (c) full pipeline (d) M (e) M (f) M (g) M Figure 8: Visual comparison of camouﬂaged object detection maps produced by our method and baseline networks. (a) inputimages; (b) ground truths; camouﬂaged object detection maps produced by (c) our method with a full pipeline, (d)-(g) M , M , M , and M (see Table 2 for M to M ). Apparently, our full pipeline can better identify camouﬂaged objects than allthe compared detectors.boundary-consistency loss of TARMs into the third base-line.Figure 8 shows the visual comparison results of camou-ﬂaged object detection maps produced by our full pipelineand other baseline methods, demonstrating that our fullpipeline better identiﬁes camouﬂaged objects than oth-ers. Table 2 compares the results of our network andfour baseline networks on three benchmark datasets, i.e .,CHAMELEON, CAMO-Test, and COD10K-Test. Effectiveness of RRBs.

From the results of Table 2, “ba-sic+RRB” has better metric results than “basic”, indicatingRRBs blocks help our network detect camouﬂaged objects.

Effectiveness of the afﬁnity loss of TARMs w/o BCL.

As shown in Table 2, “basic+RRB+TARM w/o BCL” haslarger S α , E φ , F wβ scores and a smaller M score than “ba-sic+RRB” on three benchmark datasets, demonstrating thatit enables our network to better capture the texture differ-ence between camouﬂaged objects and the background andthus beneﬁts camouﬂaged object detection. Effectiveness of the boundary-consistency loss ofTARMs.

The superior metric results of Our method over“basic+RRB+TARM w/o BCL” on the three benchmarkdatasets indicate that computing the boundary-consistencyloss in TARMs can further improve the camouﬂaged objectdetection accuracy of our network by enhancing the consis-tency across boundaries.

5. Conclusion

This paper designs a novel deep network architecturefor camouﬂaged object detection by learning deep texture-aware features. Our key idea is to amplify the texture dif-ference between camouﬂaged objects and their surround-ings, thus beneﬁting to identify camouﬂaged objects fromthe background. In our network, we design a texture-awarereﬁnement module (TARM) that computes the covariancematrices of feature responses among feature channels torepresent the texture structures and adopt the afﬁnity lossto learn a set of parameter maps that perform a linear trans-formation of the convolutional features to separate the tex-ture between camouﬂaged objects and the background. A8oundary-consistency loss is further proposed to learn theobject details. In this way, we can obtain deep texture-awarefeatures and formulate the TANet by embedding multipleTARMs in a deep neural network for the task. In the end,we evaluate our method on the benchmark dataset, com-pare it with various state-of-the-art methods, and show thesuperiority of our method both qualitatively and quantita-tively. In the future, we will explore the potential of ournetwork for generic image segmentation in more complexenvironments, especially for objects having similar color tothe background.

References [1] Nagappa U. Bhajantri and P. Nagabhushan. Camouﬂage de-fect identiﬁcation: a novel approach. In

International Con-ference on Information Technology , pages 145–148, 2006. 2[2] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox-iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi,Wanli Ouyang, et al. Hybrid task cascade for instance seg-mentation. In

IEEE Conference on Computer Vision and Pat-tern Recognition , pages 4974–4983, 2019. 1, 2, 6, 7[3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolu-tion, and fully connected crfs.

IEEE Transactions on PatternAnalysis and Machine Intelligence , pages 834–848, 2017. 1[4] Hung-Kuo Chu, Wei-Hsin Hsu, Niloy J. Mitra, DanielCohen-Or, Tien-Tsin Wong, and Tong-Yee Lee. Camouﬂageimages.

ACM Trans. Graph. , 29(4):51–1, 2010. 1[5] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and AliBorji. Structure-measure: A new way to evaluate foregroundmaps. In

IEEE International Conference on Computer Vi-sion , pages 4548–4557, 2017. 6[6] Deng-Ping Fan, Cheng Gong, Yang Cao, Bo Ren, Ming-Ming Cheng, and Ali Borji. Enhanced-alignment mea-sure for binary foreground map evaluation. arXiv preprintarXiv:1805.10421 , 2018. 6[7] Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng,Jianbing Shen, and Ling Shao. Camouﬂaged object detec-tion. In

IEEE Conference on Computer Vision and PatternRecognition , pages 2777–2787, 2020. 1, 2, 6, 7[8] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, HuazhuFu, Jianbing Shen, and Ling Shao. Pranet: Parallel reverseattention network for polyp segmentation. In

InternationalConference on Medical Image Computing and Computer As-sisted Intervention , pages 263–273. Springer, 2020. 1[9] Deng-Ping Fan, Tao Zhou, Ge-Peng Ji, Yi Zhou, Geng Chen,Huazhu Fu, Jianbing Shen, and Ling Shao. Inf-net: Auto-matic covid-19 lung infection segmentation from ct images.

IEEE Transactions on Medical Imaging , 2020. 1[10] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge.Image style transfer using convolutional neural networks. In

IEEE Conference on Computer Vision and Pattern Recogni-tion , pages 2414–2423, 2016. 3[11] Shiming Ge, Xin Jin, Qiting Ye, Zhao Luo, and Qiang Li.Image editing by object-aware optimal boundary searching and mixed-domain composition.

Computational Visual Me-dia , 4(1):71–82, 2018. 1[12] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In

IEEE International Conference onComputer Vision , pages 2961–2969, 2017. 1, 2, 6[13] Xun Huang and Serge Belongie. Arbitrary style transferin real-time with adaptive instance normalization. In

IEEEInternational Conference on Computer Vision , pages 1501–1510, 2017. 4[14] Zhaojin Huang, Lichao Huang, Yongchao Gong, ChangHuang, and Xinggang Wang. Mask scoring r-cnn. In

IEEEConference on Computer Vision and Pattern Recognition ,pages 6409–6418, 2019. 2, 6[15] Iv´an Huerta, Daniel Rowe, Mikhail Mozerov, and JordiGonz`alez. Improving background subtraction based on a ca-suistry of colour-motion segmentation problems. In

IberianConference on Pattern Recognition and Image Analysis ,pages 475–482, 2007. 2[16] Levent Karacan, Erkut Erdem, and Aykut Erdem. Structure-preserving image smoothing via region covariances.

ACMTransactions on Graphics , 32(6):1–11, 2013. 3[17] Ch Kavitha, B Prabhakara Rao, and A. Govardhan. An ef-ﬁcient content based image retrieval using color and textureof image sub-blocks.

International Journal of EngineeringScience and Technology , 3(2):1060–1068, 2011. 2[18] Junho Kim, Minjae Kim, Hyeonwoo Kang, and KwangheeLee. U-gat-it: unsupervised generative attentional networkswith adaptive layer-instance normalization for image-to-image translation. arXiv preprint arXiv:1907.10830 , 2019.5[19] Trung-Nghia Le, Tam V Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro Sugimoto. Anabranch network forcamouﬂaged object segmentation.

Computer Vision and Im-age Understanding , pages 45–56, 2019. 1, 2, 6[20] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature pyramidnetworks for object detection. In

IEEE Conference on Com-puter Vision and Pattern Recognition , pages 2117–2125,2017. 6[21] Nian Liu, Junwei Han, and Ming-Hsuan Yang. Picanet:Learning pixel-wise contextual attention for saliency detec-tion. In

IEEE Conference on Computer Vision and PatternRecognition , pages 3089–3098, 2018. 2, 6[22] Wei Liu, Andrew Rabinovich, and Alexander C. Berg.Parsenet: Looking wider to see better. arXiv preprintarXiv:1506.04579 , 2015. 5[23] Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal. How toevaluate foreground maps? In

IEEE Conference on Com-puter Vision and Pattern Recognition , pages 248–255, 2014.6[24] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-YanZhu. Semantic image synthesis with spatially-adaptive nor-malization. In

IEEE Conference on Computer Vision andPattern Recognition , pages 2337–2346, 2019. 4[25] Federico Perazzi, Philipp Kr¨ahenb¨uhl, Yael Pritch, andAlexander Hornung. Saliency ﬁlters: Contrast based ﬁltering or salient region detection. In IEEE Conference on Com-puter Vision and Pattern Recognition , pages 733–740, 2012.6[26] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao,Masood Dehghan, and Martin Jagersand. Basnet: Boundary-aware salient object detection. In

IEEE Conference on Com-puter Vision and Pattern Recognition , pages 7479–7489,2019. 2, 6, 7[27] Joseph Redmon, Santosh Divvala, Ross Girshick, and AliFarhadi. You only look once: Uniﬁed, real-time object de-tection. In

CVPR , pages 779–788, 2016. 1[28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In

Conference on Neural InformationProcessing Systems , pages 91–99, 2015. 1[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:Convolutional networks for biomedical image segmentation.In

International Conference on Medical Image Computingand Computer Assisted Intervention , pages 234–241, 2015.1[30] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, et al. Imagenet largescale visual recognition challenge.

International Journal ofComputer Vision , 115(3):211–252, 2015. 5[31] P Siricharoen, S Aramvith, TH Chalidabhongse, and S Sid-dhichai. Robust outdoor human segmentation based oncolor-based statistical approach and edge combination. In

International Conference on Green Circuits and Systems ,pages 463–468, 2010. 2[32] P. Skurowski, H. Abdulameer, J. Błaszczyk, T. Depta, A.Kornacki, and P. Kozieł. Animal camouﬂage analysis:Chameleon database.

Unpublished Manuscript , 2018. 2, 6[33] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan,Yewen Xu, Ming Zhang, and Jian Tang. Autoint: Auto-matic feature interaction learning via self-attentive neuralnetworks. In

ACM International Conference on Informationand Knowledge Management , pages 1161–1170, 2019. 3[34] Martin Stevens and Sami Merilaita. Animal camou-ﬂage: current issues and new perspectives.

PhilosophicalTransactions of the Royal Society B: Biological Sciences ,364(1516):423–427, 2009. 1[35] Ariel Tankus and Yehezkel Yeshurun. Convexity-based vi-sual camouﬂage breaking.

Computer Vision and Image Un-derstanding , pages 208–237, 2001. 2[36] Yu-Huan Wu, Shang-Hua Gao, Jie Mei, Jun Xu, Deng-PingFan, Chao-Wei Zhao, and Ming-Ming Cheng. Jcs: An ex-plainable covid-19 diagnosis system by joint classiﬁcationand segmentation. arXiv preprint arXiv:2004.07054 , 2020.1[37] Zhe Wu, Li Su, and Qingming Huang. Cascaded partial de-coder for fast and accurate salient object detection. In

IEEEConference on Computer Vision and Pattern Recognition ,pages 3907–3916, 2019. 2, 6, 7[38] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, andKaiming He. Aggregated residual transformations for deepneural networks. In

IEEE Conference on Computer Visionand Pattern Recognition , pages 1492–1500, 2017. 5 [39] Changqian Yu, Jingbo Wang, Changxin Gao, Gang Yu,Chunhua Shen, and Nong Sang. Context prior for scenesegmentation. In

IEEE Conference on Computer Vision andPattern Recognition , pages 12416–12425, 2020. 4[40] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, XiaogangWang, and Jiaya Jia. Pyramid scene parsing network. In

IEEE Conference on Computer Vision and Pattern Recogni-tion , pages 2881–2890, 2017. 1, 2, 6[41] Jia-Xing Zhao, Jiang-Jiang Liu, Deng-Ping Fan, Yang Cao,Jufeng Yang, and Ming-Ming Cheng. Egnet: Edge guidancenetwork for salient object detection. In

IEEE InternationalConference on Computer Vision , pages 8779–8788, 2019. 1,2, 6, 7[42] Ting Zhao and Xiangqian Wu. Pyramid feature attention net-work for saliency detection. In

IEEE Conference on Com-puter Vision and Pattern Recognition , pages 3085–3094,2019. 1, 2, 6, 7[43] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, NimaTajbakhsh, and Jianming Liang. Unet++: A nested u-net ar-chitecture for medical image segmentation. In

Deep Learn-ing in Medical Image Analysis and Multimodal Learning forClinical Decision Support , pages 3–11. Springer, 2018. 2, 6, pages 3–11. Springer, 2018. 2, 6