[PDF] Object Segmentation Without Labels with Large-Scale Generative Models

Abstract

The recent rise of unsupervised and self-supervised learning has dramatically reduced the dependency on labeled data, providing effective image representations for transfer to downstream vision tasks. Furthermore, recent works employed these representations in a fully unsupervised setup for image classification, reducing the need for human labels on the fine-tuning stage as well. This work demonstrates that large-scale unsupervised models can also perform a more challenging object segmentation task, requiring neither pixel-level nor image-level labeling. Namely, we show that recent unsupervised GANs allow to differentiate between foreground/background pixels, providing high-quality saliency masks. By extensive comparison on standard benchmarks, we outperform existing unsupervised alternatives for object segmentation, achieving new state-of-the-art.

Full PDF

BBig GANs Are Watching You:Towards Unsupervised Object Segmentationwith Off-the-Shelf Generative Models

Andrey Voynov

Yandex [email protected]

Stanislav Morozov

YandexLomonosov Moscow State University [email protected]

Artem Babenko

YandexNational Research UniversityHigher School of Economics [email protected]

Abstract

Since collecting pixel-level groundtruth data is expensive, unsupervised visualunderstanding problems are currently an active research topic. In particular, severalrecent methods based on generative models have achieved promising results forobject segmentation and saliency detection. However, since generative modelsare known to be unstable and sensitive to hyperparameters, the training of thesemethods can be challenging and time-consuming.In this work, we introduce an alternative, much simpler way to exploit generativemodels for unsupervised object segmentation. First, we explore the latent spaceof the BigBiGAN — the state-of-the-art unsupervised GAN, which parametersare publicly available. We demonstrate that object saliency masks for GAN-produced images can be obtained automatically with BigBiGAN. These masksthen are used to train a discriminative segmentation model. Being very simple andeasy-to-reproduce, our approach provides competitive performance on commonbenchmarks in the unsupervised scenario. Code is available online . Deep convolutional models are a core instrument for visual understanding problems, including objectlocalization[1, 2], saliency detection [3], segmentation[4] and others. Deep models, however, requirea large amount of high-quality training data to ﬁt a huge number of learnable parameters. In practice,obtaining groundtruth pixel-level labeling is expensive, since it requires labor-intensive human efforts.Therefore, much research attention has currently focused on weakly-supervised and unsupervisedapproaches for challenging pixel-level tasks, such as segmentation[5, 6, 7, 8].An emerging line of research on unsupervised segmentation exploits generative models as a tool forimage decomposition. Namely, recent works [7, 8] have designed training protocols that includegenerative adversarial networks (GANs), to solve the foreground object segmentation without humanlabels. Given the promising results and the fact that the GANs’ performance is steadily improving,this research direction will likely develop in the future. https://github.com/anvoynov/BigGANsAreWatching Preprint. Under review. a r X i v : . [ c s . L G ] J un n practice, however, training high-quality generative models is challenging. This is especially thecase for GANs, which training can be both time-consuming and unstable. Moreover, the models in[7, 8] typically include a large number of hyperparameters that can be tricky to tune, especially in thecompletely unsupervised scenario when labeled validation set is not available.To this end, we propose an alternative way to exploit GANs for unsupervised segmentation, whichdoes not train a separate generative model for each task. Instead, we use a publicly available pretrainedGAN to generate synthetic images equipped with segmentation masks, which can be obtainedautomatically. In more detail, we explore the latent space of the publicly available BigBiGAN model[9], which is an unsupervised GAN trained on the Imagenet[10]. With the recent unsupervisedtechnique [11], we demonstrate that manipulations in the BigBiGAN latent space allow to distinguishobject/background pixels in the generated images, providing decent segmentation masks. Thesemasks are then used to supervise a discriminative U-Net model [12], which is much easier to train.As another advantage, our approach also provides a straightforward way to tune hyperparameters.Since an amount of synthetic data is unlimited, its hold-out subset can be used as validation.Our work conﬁrms the promise of using GANs to produce synthetic training data, which is along-standing goal of research on generative modeling. In extensive experiments, we show thatthe approach often outperforms the existing unsupervised alternatives for object segmentation andsaliency detection. Furthermore, our approach performs on par with weakly-supervised methods forobject localization, despite being completely unsupervised.The main contributions of our paper are the following:1. We introduce an alternative line of research on using GANs for unsupervised object seg-mentation. In a nutshell, we advocate the usage of high-quality synthetic data produced byBigBiGAN, which can provide high-quality saliency masks for generated images.2. We compare our method to existing approaches and achieve a new state-of-the-art in mostoperating points. Given its simplicity, the method can serve as a useful baseline in the future.3. We demonstrate a novel unsupervised scenario, where GAN-produced imagery becomes auseful source of training data for supervised computer vision models. In this paper, we address the binary object segmentation problem, i.e, for each image pixel we aim topredict if it belongs to the object or to the background. In the literature this setup is typically referredto as saliency detection[3] and foreground object segmentation[7, 8]. While most prior works operatein fully-supervised or weakly-supervised regimes, we focus on the most challenging unsupervisedscenario, for which only a few approaches have been developed.

Existing unsupervised approaches.

Before a rise of deep learning models, a large number of “shal-low” unsupervised techniques were developed [13, 14, 15, 16, 17, 18]. These earlier techniques weremostly based on hand-crafted features and heuristics, e.g., color contrast[17] or certain backgroundpriors [18]. Often these approaches also utilize traditional computer vision routines, such as super-pixels[19, 20], object proposals[21], CRF[22]. These heuristics, however, are not completely learnedfrom data, and the corresponding methods are inferior to the more recent “deep” approaches.Regarding unsupervised deep models, several works have recently been proposed by the saliencydetection community[23, 24, 25, 26]. Their main idea is to combine or fuse the predictions of severalheuristic saliency methods, typically using them as a source of noisy groundtruth for deep CNNmodels. However, these methods are not completely unsupervised, since they typically rely on thepretrained classiﬁcation or segmentation networks. In contrast, in this work, we focus on the methodsthat do not require any source of external supervision.

Generative models for object segmentation.

The recent line of completely unsupervised methods[7, 8] employs generative modeling to decompose the image into the object and the background. In anutshell, these methods exploit the idea that the object location or appearance can be perturbed withoutaffecting the image realism. This inductive bias is formalized in the training protocols [7, 8], whichinclude learning of GANs. Therefore, for each new segmentation task, one has to perform adversariallearning, which is known to be unstable, time-consuming, and sensitive to hyperparameters.2n contrast, our approach avoids these disadvantages, being much simpler and easier to reproduce. Inessence, we propose to use the “inner knowledge” of the off-the-shelf large-scale GAN to producethe saliency masks for synthetic images and use them as a supervision for discriminative models.

Latent spaces of large-scale GANs.

Our study is partially inspired by the recent ﬁndings from [11].This work introduces an unsupervised technique that discovers the directions in the GAN latent spacecorresponding to interpretable image transformations. Among its ﬁndings, [11] demonstrates that thelarge-scale conditional GAN (BigGAN [27]) possesses a “background removal” direction that can beused to obtain saliency masks. However, this direction was discovered only for BigGAN that wastrained under the supervision from the image class labels. For unconditional GANs, such a directionwas not discovered in [11], hence, it is not clear if the supervision from the class labels is necessaryfor the GAN latent space “to understand” what pixels belong to object/background. In this paper, wedemonstrate that this supervision is not necessary, therefore, even completely unsupervised GANscan serve as an excellent source of synthetic data for object segmentation.

The main component of our method is the recent BigBiGAN model [9]. BigBiGAN is the state-of-the-art generative adversarial network trained on the Imagenet [10] without labels and its parametersare available online . The BigBiGAN generator G maps the samples z ∼ N (0 , I ) from the latentspace R into the image space G : z → I . BigBiGAN is also equipped with an encoder E : I → z that was trained jointly with the generator and maps images to the latent space. In this section, weexplore the BigBiGAN latent space to investigate if its properties can be useful for downstream tasks.A very recent paper [11] has introduced an unsupervised technique that identiﬁes interpretabledirections in the latent space of a pretrained GAN. By moving a latent code z in these directions, onecan achieve different image transformations, such as image zooming or translation. Formally, givenan image corresponding to a latent code z , one can modify it via shifting the code in an interpretabledirection h . Then a modiﬁed image G ( z + h ) can be generated. Importantly, h operates consistentlyover the whole latent space, i.e. for all z , shifting results in the same type of transformation. Asthe ﬁrst step of our study, we apply the technique from [11] to the BigBiGAN generator to explorethe potential of its latent space. In a nutshell, [11] seeks to learn K directions in the latent space h , . . . , h K such that the effects of the corresponding image transformations are “disentangled”.More formally, the sets of pairs { G ( z ) , G ( z + h i ) | z ∼ N (0 , I ) } for different i =1 , . . . , K are easy todistinguish from each other by a CNN classiﬁer, which is trained jointly with h , . . . , h K .We use the authors’ implementation with default hyperparameters and the number of directions K =120 . After learning converged, we inspect the directions manually and ﬁlter out only thedirections that are interpretable. Several directions revealed by the procedure are provided in Figure1. Compared to the results from [11] for the “supervised” conditional BigGAN, the BigBiGAN latentspace does not possess any directions that have clear “background removal” effect. However, one ofthe directions has an effect that can be used to distinguish between object and background pixels.The corresponding transformation “Saliency lighting” is presented on Figure 1 and we refer to thisdirection as h bg . As one can see, moving in this direction makes the object pixels lighter, while thebackground pixels become darker. Therefore, despite BigBiGAN is completely unsupervised, itslatent space can be used to obtain saliency masks for generated images. Technically, we producea binary saliency mask M for an image G ( z ) by comparing its intensity with the “shifted” image M = [ G ( z + h bg ) > G ( z )] after grayscale conversion. As a shift magnitude, we always use || h bg || =5 . Here we describe a few tricks increasing the quality of the masks for the particular segmentation task.

Adaptation to the particular segmentation task.

In the scheme above the latent codes are sampledfrom the standard Gaussian distribution z ∼ N (0 , I ) . To make the distribution of generated imagescloser to the particular dataset at hand I = { I , . . . , I N } , we aim to sample z from the latent spaceregions that are close to the latent codes of I . To this end, we use the BigBiGAN encoder to compute https://tfhub.dev/deepmind/bigbigan-resnet50/1 https://github.com/anvoynov/GANLatentDiscovery - Light direction + - Saliency lighting +- Zoom + - Background green + Figure 2:

Top: images G ( z ) ; Middle: images after a latent shift G ( z + h bg ) ; Bottom : saliency masksthe latent representations { E ( I ) , . . . , E ( I N ) } ⊂ R and sample the codes from the neighborhoodof these representations. Formally, the samples have the form { E ( I i )+ αξ | i ∼ U{ , N } , ξ ∼N (0 , I ) } . Here α denotes the neighborhood size and it should be larger for small I to preventoverﬁtting. In particular, we use α =0 for Imagenet and α =0 . for all other cases. Mask size ﬁltering.

Since some of the BigBiGAN-produced images are low-quality and do notcontain clear objects, the corresponding masks can result in a very noisy supervision. To avoid this,we apply a simple ﬁltering that excludes the images where the ratio of foreground pixels exceeds . . Histogram ﬁltering.

Since G ( z + h bg ) should have mostly dark and light pixels, we ﬁlter out theimages that are not contrastive enough. Formally, we compute the intensity histogram with binsfor the grayscaled G ( z + h bg ) . Then we smooth it by taking the moving average with a window andﬁlter out the samples that have local maxima outside the ﬁrst/last buckets of the histogram. Connected components ﬁltering.

For each generated mask M we group the foreground pixels intoconnected (by edges) groups forming clusters M , . . . , M k . Assuming that M is the cluster withthe maximal area, we exclude all the clusters M i with | M i | < . · | M | . This technique allows toremove visual artifacts from the synthetic data. 4igure 3: Examples of mask improvement. Left: sample rejected by the mask size ﬁlter.

Middle: sample rejected by the histogram ﬁltering.

Right block : mask pixels removed by the connectedcomponents ﬁlter are shown in blue and the remaining mask pixels are shown in red.

Given a large amount of synthetic data, one can train one of the existing image-to-image CNNarchitectures in the fully-supervised regime. The whole pipeline is schematically presented inFigure 4. In all our experiments we employ a simple U-net architecture [12]. We train U-net on thesynthetic dataset with Adam optimizer and the binary cross-entropy objective applied on the pixellevel. We perform · steps with batch 95. The initial learning rate equals . and is decreasedby . on step · . During inference, we rescale an input image to have a size 128 along itsshorter side and scale the color channels to [ − , . Compared to existing unsupervised alternatives,the training of our model is extremely simple, does not include a large number of hyperparameters.The only hyperparameters in our protocol are batch size, learning rate schedule, and a number ofoptimizer steps and we tune them on the hold-out validation set of synthetic data. Training withon-line synthetic data generation takes approximately seven hours on two Nvidia 1080Ti cards.Figure 4: Schematic representation of our approach. z = E ( I real ) I real GGz + h bg LossFilteringU-net > The goal of this section is to conﬁrm that the usage of GAN-produced synthetic data is a promisingdirection for unsupervised saliency detection and object segmentation. To this end, we extensivelycompare our approach to the existing unsupervised counterparts on the standard benchmarks.

Evaluation metrics.

All the methods are compared in terms of the three measures described below. • F-measure is an established measure in the saliency detection literature. It is deﬁned as F β = (1+ β ) Precision × Recall β Precision + Recall . Here Precision and Recall are calculated based on the binarizedpredicted masks and groundtruth masks as Precision = T PT P + F P and Recall = T PT P + F N , whereTP, TN, FP, FN denote true-positive, true-negative, false-positive, and false-negative, respec-tively. We compute F-measure for 255 uniformly distributed binarization thresholds andreport its maximum value max F β . We use β =0 . for consistency with existing works. • IoU (intersection over union) is calculated on the binarized predicted masks and groundtruthas IoU ( s, m )= µ ( s ∩ m ) µ ( s ∪ m ) , where µ denotes the area. The binarization threshold is set to . . • Accuracy measures the proportion of pixels that have been correctly assigned to the ob-ject/background. The binarization threshold for masks is set to . .Since the existing literature uses different benchmark datasets for saliency detection and objectsegmentation, we perform a separate comparison for each task below.5 .1 Object segmentation.Datasets. We use two following datasets from the literature of segmentation with generative models. • Caltech-UCSD Birds 200-2011 [28] contains 11788 photographs of birds with segmenta-tion masks. We follow [7], and use 10000 images for our training subset and 1000 for thetest subset from splits provided by [7]. Unlike [7], we do not use any images for validationand simply omit the remaining 788 images. • Flowers [29] contains 8189 images of ﬂowers equipped with saliency masks generatedautomatically via the method developed for ﬂowers. In experiments with the Flowers dataset,we do not apply the mask area ﬁlter in our method, since it rejects most of the samples.On these two datasets we compare the following methods: • PerturbGAN [8] segments an image based on the idea that object location can be perturbedwithout affecting the scene realism. For comparison, we use the numbers reported in [8]. • ReDO [7] produces segmentation masks based on the idea that object appearance can bechanged without affecting image quality. For comparison, we report the numbers from [7]. • BigBiGAN is our method where the latent codes are sampled from z ∼ N (0 , I ) . • E-BigBiGAN (w/o z -noising) is our method where the latent codes of synthetic data aresampled from the outputs of the encoder E applied to the train images of the dataset at hand. • E-BigBiGAN (with z -noising) same as above with latent codes sampled from the vicinityof the embeddings with the neighborhood size α set to . .The comparison results are provided in Table 1, which demonstrates the signiﬁcant advantage ofour scheme. Note, since, both datasets in this comparison are small-scale, z -noising considerablyimproves the performance, increasing diversity of training images.Method CUB-200-2011 Flowersmax F β IoU Accuracy max F β IoU AccuracyPerturbGAN — 0.380 — — — —ReDO — 0.426 0.845 — 0.764 0.879BigBiGAN 0.794 0.683 0.930 0.760 0.540 0.765E-BigBiGAN(w/o z -noising) 0.750 0.619 0.918 0.814 0.689 0.874E-BigBiGAN(with z -noising) std 0.005 0.007 0.002 0.001 < < We use the following established benchmarks for saliency detection. For all the datasetsgroundtruth pixel-level saliency masks are available. • ECSSD [30] contains 1,000 images with structurally complex natural contents. • DUTS [31] contains 10,553 train and 5,019 test images. The train images are selected fromthe ImageNet detection train/val set. The test images are selected from the ImageNet testand the SUN dataset[32]. We always report the performance on the DUTS-test subset. • DUT-OMRON [19] contains 5,168 images of high content variety.

Baselines.

While there are a large number of papers on unsupervised deep saliency detection, allof them employ pretrained supervised models in their training protocols. Therefore, we use the6ost recent “shallow” methods HS[33], wCtr[34], and WSC[35] as the baselines. These threemethods were chosen based on their state-of-the-art performance reported in the literature andpublicly available implementations. The results of the comparison are reported in Table 2. In thistable, BigBiGAN denotes the version of our method where the latent codes of synthetic imagesare sampled from z ∼ N (0 , I ) . In turn, in E-BigBiGAN, z are sampled from the latent codes ofImagenet-train images, for all three datasets. Since the Imagenet dataset is large enough, we do notemploy z -noising in this comparison.Method ECSSD DUTS DUT-OMRONmax F β IoU Accuracy max F β IoU Accuracy max F β IoU AccuracyHS 0.673 0.508 0.847 0.504 0.369 0.826 0.561 0.433 0.843wCtr 0.684 0.517 0.862 0.522 0.392 0.835 0.541 0.416 0.838WSC 0.683 0.498 0.852 0.528 0.384 0.862 0.523 0.387

BigBiGAN 0.782 0.672 0.899 0.608 0.498 0.878 0.549 0.453 0.856E-BigBiGAN

Top:

Images from the DUTS-test dataset.

Middle:

Groundtruth masks.

Bottom:

Masksproduced by the E-BigBiGAN method.

A closely related to segmentation problem is object localization, where for a given image one hasto provide a bounding box instead of a segmentation mask. In this section, we demonstrate that ourunsupervised method performs on par with the weakly-supervised state-of-the-art. To compare withthe previous literature, we use the numbers from the very recent evaluation paper [2] that reviewsa large number of existing WSOL methods and reports actual state-of-the-art. We employ exactlythe same evaluation protocols as in [2] and compare the prior works with our E-BigBiGAN method,which samples z from the latent codes of Imagenet-train images, as described in Section 3.2. Thecomparison results are provided in Table 3. Evaluation metrics.

For the WSOL problem we use the following metrics [2]: • MaxBoxAcc [36, 1]. For an image I n , let us have a predicted mask s n and a set of groundtruth bounding boxes B ( i ) n for i = 1 , . . . , m (some datasets can provide several boundingboxes per image). Let us select a threshold τ ∈ [0 , and denote c τn the largest (in terms ofthe area) connected component of the mask s n binarized with threshold τ . Let us denote7ith box ( c τn ) the minimal bounding box containing the set c τn . Then we deﬁneBoxAcc ( τ ) = 1 N N (cid:88) n =1 IoU ( box ( c τn ) ,B ( j ) n ) ≥ . (1)where B ( j ) n corresponds to the ground truth bounding box with the maximal IoU withbox ( c τn ) and N denotes the number of images. Then the ﬁnal metrics MaxBoxAcc is themaximum of BoxAcc ( τ ) over all thresholds τ . • PxAP [37]. Let us have a predicted mask s n and ground truth mask t n . For a threshold τ ∈ [0 , we deﬁne a pixel precision and recallP τ = 1 N N (cid:88) n =1 |{ s n ≥ τ } ∩ { t n = 1 }||{ s n ≥ τ }| ; R τ = 1 N N (cid:88) n =1 |{ s n ≥ τ } ∩ { t n = 1 }||{ t n = 1 }| (2)We average both values over all images and then PxAP is deﬁned as the area under curve ofthe pixel precision-recall curve. Datasets.

We use the following benchmarks for weakly-supervised object localization. • Imagenet [36]. For evaluation we use , validation images. The dataset containsseveral annotated bounding boxes for each image. • Caltech-UCSD Birds 200-2011 [28]. For evaluation we use , test images. • OpenImages [2] contains a subset of OpenImages instance segmentation dataset [38]. Forevaluation we use , randomly selected images from classes as in [2].Method Imagenet (MaxBoxAcc) CUB (MaxBoxAcc) OpenImages (PxAP)Previous SOTA [2] 0.654 0.781 0.630E-BigBiGAN 0.614 0.742 0.638Table 3: The comparison of E-BigBiGAN to the WSOL state-of-the-art. For E-BigBiGAN we reportthe mean values over 10 independent runs. Despite being completely unsupervised, E-BigBiGANperforms on par with the WSOL methods, which were trained under more supervision. In Table 4 we demonstrate the impacts of individual components in our method. First, we start with asaliency detection model trained on the synthetic data pairs { G ( z ) , M = [ G ( z + h bg ) > G ( z )] } with z ∼ N (0 , I ) . Then we add one by one the components listed in Section 3.2. The most signiﬁcantperformance impact comes from using the latent codes of the real images from the Imagenet.Method ECSSD DUTS DUT-OMRONmax F β IoU Accuracy max F β IoU Accuracy max F β IoU AccuracyBase 0.737 0.626 0.859 0.575 0.454 0.817 0.498 0.389 0.758+Imagenet embeddings 0.773 0.657 0.874 0.616 0.483 0.832 0.533 0.413 0.772+Size ﬁlter 0.781 0.670 0.900 0.62 0.499 0.871 0.552 0.443 0.842+Histogram 0.779 0.670 0.900 0.621 0.503 0.875 0.555 0.450 0.850+Connected components

Table 4: Impact of different components in the E-BigBiGAN pipeline.

In our paper, we continue the line of works on unsupervised object segmentation with the aid ofgenerative models. While the existing unsupervised techniques require adversarial training, we8ntroduce an alternative research direction, based on the high-quality synthetic data from the off-the-shelf GAN. Namely, we utilize the images produced by the BigBiGAN model, which is trained on theImagenet dataset. Exploring BigBiGAN, we have discovered that its latent space semantics allows toproduce the saliency masks for synthetic images automatically via latent space manipulations. Asshown in experiments, this synthetic data is an excellent source of supervision for discriminativecomputer vision models. The main feature of our approach is its simplicity and reproducibility sinceour model does not rely on a large number of components/hyperparameters. On several commonbenchmarks, we demonstrate that our method achieves superior performance compared to existingunsupervised competitors.We also highlight the fact that the state-of-the-art generative models, such as BigBiGAN, can besuccessfully used to generate training data for yet another computer vision task. We expect that otherproblems such as semantic segmentation can also beneﬁt from the usage of GAN-produced data inthe weakly-supervised or few-shot regimes. Since the quality of GANs will likely improve in thefuture, we expect that the usage of synthetic data will become increasingly widespread.

References [1] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deepfeatures for discriminative localization. In

Proceedings of the IEEE conference on computervision and pattern recognition , 2016.[2] Junsuk Choe, Seong Joon Oh, Seungho Lee, Sanghyuk Chun, Zeynep Akata, and HyunjungShim. Evaluating weakly supervised object localization methods right. In

Proceedings of theIEEE conference on computer vision and pattern recognition , 2020.[3] Wenguan Wang, Qiuxia Lai, Huazhu Fu, Jianbing Shen, and Haibin Ling. Salient objectdetection in the deep learning era: An in-depth survey. arXiv preprint arXiv:1904.09146 , 2019.[4] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for se-mantic segmentation. In

Proceedings of the IEEE conference on computer vision and patternrecognition , 2015.[5] Xide Xia and Brian Kulis. W-net: A deep model for fully unsupervised image segmentation. arXiv preprint arXiv:1711.08506 , 2017.[6] Xu Ji, João F Henriques, and Andrea Vedaldi. Invariant information clustering for unsupervisedimage classiﬁcation and segmentation. In

Proceedings of the IEEE International Conference onComputer Vision , pages 9865–9874, 2019.[7] Mickaël Chen, Thierry Artières, and Ludovic Denoyer. Unsupervised object segmentation byredrawing. In

Advances in Neural Information Processing Systems , pages 12705–12716, 2019.[8] Adam Bielski and Paolo Favaro. Emergence of object segmentation in perturbed generativemodels. In

Advances in Neural Information Processing Systems , pages 7254–7264, 2019.[9] Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In

Advancesin Neural Information Processing Systems , pages 10541–10551, 2019.[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-ScaleHierarchical Image Database. In

CVPR09 , 2009.[11] Andrey Voynov and Artem Babenko. Unsupervised discovery of interpretable directions in thegan latent space. arXiv preprint arXiv:2002.03754 , 2020.[12] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks forbiomedical image segmentation. In

International Conference on Medical image computing andcomputer-assisted intervention , pages 234–241. Springer, 2015.[13] Wangjiang Zhu, Shuang Liang, Yichen Wei, and Jian Sun. Saliency optimization from robustbackground detection. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 2814–2821, 2014.[14] Huaizu Jiang, Jingdong Wang, Zejian Yuan, Yang Wu, Nanning Zheng, and Shipeng Li. Salientobject detection: A discriminative regional feature integration approach. In

Proceedings of theIEEE conference on computer vision and pattern recognition , pages 2083–2090, 2013.915] Houwen Peng, Bing Li, Haibin Ling, Weiming Hu, Weihua Xiong, and Stephen J Maybank.Salient object detection via structured matrix decomposition.

IEEE transactions on patternanalysis and machine intelligence , 39(4):818–832, 2016.[16] Runmin Cong, Jianjun Lei, Huazhu Fu, Qingming Huang, Xiaochun Cao, and Chunping Hou.Co-saliency detection for rgbd images based on multi-constraint feature matching and crosslabel propagation.

IEEE Transactions on Image Processing , 27(2):568–579, 2017.[17] Ming-Ming Cheng, Niloy J Mitra, Xiaolei Huang, Philip HS Torr, and Shi-Min Hu. Globalcontrast based salient region detection.

IEEE Transactions on Pattern Analysis and MachineIntelligence , 37(3):569–582, 2014.[18] Y Wei, F Wen, W. Zhu, and J. Sun. Geodesic saliency using background priors. In

IEEE, ICCV ,2012.[19] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. Saliency detectionvia graph-based manifold ranking. In

Proceedings of the IEEE conference on computer visionand pattern recognition , pages 3166–3173, 2013.[20] Wenguan Wang, Jianbing Shen, Ling Shao, and Fatih Porikli. Correspondence driven saliencytransfer.

IEEE Transactions on Image Processing , 25(11):5025–5034, 2016.[21] Fang Guo, Wenguan Wang, Jianbing Shen, Ling Shao, Jian Yang, Dacheng Tao, and Yuan YanTang. Video saliency detection using object proposals.

IEEE transactions on cybernetics ,48(11):3159–3170, 2017.[22] Philipp Krähenbühl and Vladlen Koltun. Efﬁcient inference in fully connected crfs with gaussianedge potentials. In

Advances in neural information processing systems , pages 109–117, 2011.[23] Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and XiangRuan. Learning to detect salient objects with image-level supervision. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , pages 136–145, 2017.[24] Jing Zhang, Tong Zhang, Yuchao Dai, Mehrtash Harandi, and Richard Hartley. Deep unsuper-vised saliency detection: A multiple noisy labeling perspective. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 9029–9038, 2018.[25] Dingwen Zhang, Junwei Han, and Yu Zhang. Supervision by fusion: Towards unsupervisedlearning of deep salient object detector. In

Proceedings of the IEEE International Conferenceon Computer Vision , pages 4048–4056, 2017.[26] Tam Nguyen, Maximilian Dax, Chaithanya Kumar Mummadi, Nhung Ngo, Thi Hoai PhuongNguyen, Zhongyu Lou, and Thomas Brox. Deepusps: Deep robust unsupervised saliencyprediction via self-supervision. In

Advances in Neural Information Processing Systems , pages204–214, 2019.[27] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high ﬁdelitynatural image synthesis. arXiv preprint arXiv:1809.11096 , 2018.[28] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. Thecaltech-ucsd birds-200-2011 dataset, 2011.[29] Maria-Elena Nilsback and Andrew Zisserman. Delving into the whorl of ﬂower segmentation.In

BMVC , volume 2007, pages 1–10, 2007.[30] Jianping Shi, Qiong Yan, Li Xu, and Jiaya Jia. Hierarchical image saliency detection onextended cssd.

IEEE transactions on pattern analysis and machine intelligence , 38(4):717–729,2015.[31] Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and XiangRuan. Learning to detect salient objects with image-level supervision. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , pages 136–145, 2017.[32] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database:Large-scale scene recognition from abbey to zoo. In , pages 3485–3492. IEEE, 2010.[33] Qiong Yan, Li Xu, Jianping Shi, and Jiaya Jia. Hierarchical saliency detection. In

Proceedingsof the IEEE conference on computer vision and pattern recognition , pages 1155–1162, 2013.1034] Wangjiang Zhu, Shuang Liang, Yichen Wei, and Jian Sun. Saliency optimization from robustbackground detection. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 2814–2821, 2014.[35] Nianyi Li, Bilin Sun, and Jingyi Yu. A weighted sparse coding framework for saliency detection.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages5216–5223, 2015.[36] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visualrecognition challenge.

International journal of computer vision , 115(3):211–252, 2015.[37] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. Frequency-tuned salient region detection. In , pages 1597–1604. IEEE, 2009.[38] Rodrigo Benenson, Stefan Popov, and Vittorio Ferrari. Large-scale interactive object segmenta-tion with human annotators. In