SalGAN: Visual Saliency Prediction with Generative Adversarial Networks
Junting Pan, Cristian Canton Ferrer, Kevin McGuinness, Noel E. O'Connor, Jordi Torres, Elisa Sayrol, Xavier Giro-i-Nieto
11 Computer Vision and Image Understanding
SalGAN: visual saliency prediction with adversarial networks
Junting Pan a , Cristian Canton-Ferrer b , Kevin McGuinness c , Noel E. O’Connor c , Jordi Torres d , Elisa Sayrol a , XavierGiro-i-Nieto a, ∗∗ a Universitat Politecnica de Catalunya, Barcelona 08034, Catalonia / Spain b Facebook AML, Seattle, WA, United States of America c Insight Center for Data Analytics, Dublin City University, Dublin 9, Ireland d Barcelona Supercomputing Center, Barcelona 08034, Catalonia / Spain
ABSTRACTRecent approaches for saliency prediction are generally trained with a loss function based on a singlesaliency metric. This could lead to low performance when evaluating with other saliency metrics. Inthis paper, we propose a novel data-driven metric based saliency prediction method, named SalGAN(Saliency GAN), trained with adversarial loss function. SalGAN consists of two networks: one pre-dicts saliency maps from raw pixels of an input image; the other one takes the output of the first oneto discriminate whether a saliency map is a predicted one or ground truth. By trying to make the pre-dicted saliency map indistinguishable with the ground truth, SalGAN is expected to generate saliencymaps that resembles the ground truth. Our experiments show that the adversarial training allows ourmodel to obtain state-of-the-art performances across various saliency metrics.c (cid:13)
1. Introduction
Visual saliency describes the spatial locations in an imagethat attract the human attention. It is understood as a result of abottom-up process where a human observer explores the imagefor a few seconds with no particular task in mind. Therefore,saliency prediction is indispensable for various machine visiontasks such as object recognition (Walther et al., 2002).Visual saliency data are traditionally collected by eye-trackers (Judd et al., 2009a), and more recently with mouseclicks (Jiang et al., 2015) or webcams (Krafka et al., 2016).The salient points of the image are aggregated and convolvedwith a Gaussian kernel to obtain a saliency map. As a result,a gray-scale image or a heat map is generated to represent theprobability of each corresponding pixel in the image to capturethe human attention.A lot of research e ff ort has been made in designing an opti-mal loss function for saliency prediction. State-of-the-art meth-ods (Huang et al., 2015) adopt saliency based metrics while oth-ers (Pan et al., 2016; Cornia et al., 2016; Jetley et al., 2016; Sunet al., 2017) use distance in saliency map space. How to choose ∗∗ Corresponding author: Tel.: + e-mail: [email protected] (Xavier Giro-i-Nieto) Image Ground TruthBCE SalGAN
Fig. 1. Example of saliency map generation where the proposed system(SalGAN) outperforms a standard binary cross entropy (BCE) predictionmodel. a r X i v : . [ c s . C V ] J u l or design a best training loss is still an open problem. In addi-tion, di ff erent saliency metrics diverge at defining the meaningof saliency maps, and there exist inconsistency with the modelcomparison. For instance, it has been pointed that the optimalmetric for model optimization may depend on the final applica-tion (Bylinskii et al., 2016a).To this end, instead of designing a tailored loss function, weintroduce adversarial training for visual saliency prediction in-spired by generative adversarial networks (GANs)(Goodfellowet al., 2014). We dub the proposed method as SalGAN. We fo-cus on exploring the benefits of using such an adversarial lossto make the output saliency map not able to be distinguishedfrom the real saliency maps. In GANs, training is driven bytwo competing agents: first, the generator synthesizing sam-ples that match with the training data; second, the discrimina-tor distinguishing between a real sample drawn directly fromthe training data and a fake one synthesized by the generator.In our case, this data distribution corresponds to pairs of realimages and their corresponding visual saliency maps.Specifically, SalGAN estimates the saliency map of an inputimage using a deep convolutional neural network (DCNN). Asshown in the Figure 2 this network is initially trained with abinary cross entropy (BCE) loss over down-sampled versionsof the saliency maps. The model is then refined with a dis-criminator network trained to solve a binary classification taskbetween the saliency maps generated by SalGAN and the realones used as ground truth. Our experiments show how adversar-ial training allows reaching state-of-the-art performance acrossdi ff erent metrics when combined with a BCE content loss in asingle-tower and single-task model.To summarize, we investigate the introduction of the adver-sarial loss to the visual saliency learning. By introducing adver-sarial loss to the BCE saliency prediction model, we achieve thestate-of-the-art performance in MIT300 and SALICON datasetfor almost all the evaluation metricsThe remaining of the text is organized as follows. Section 2reviews the state-of-the-art models for visual saliency predic-tion, discussing the loss functions they are based upon, their re-lations with the di ff erent metrics as well as their complexity interms of architecture and training. Section 3 presents SalGAN,our deep convolutional neural network based on a convolutionalencoder-decoder architecture, as well as the discriminator net-work used during its adversarial training. Section 4 describesthe training process of SalGAN and the loss functions used.Section 5 includes the experiments and results of the presentedtechniques. Finally, Section 6 closes the paper by drawing themain conclusions.Our results can be reproduced with the source code andtrained models available at https://imatge-upc.github.io/saliency-salgan-2017/ .
2. Related work
Saliency prediction has received interest by the researchcommunity for many years. Thus seminal works (Itti et al.,1998) proposed to predict saliency maps considering low-levelfeatures at multiple scales and combining them to form a saliency map. (Harel et al., 2006), also starting from low-levelfeature maps, introduced a graph-based saliency model that de-fines Markov chains over various image maps, and treat theequilibrium distribution over map locations as activation andsaliency values. (Judd et al., 2009b) presented a bottom-up,top-down model of saliency based not only on low but mid andhigh-level image features. (Borji, 2012) combined low-levelfeatures saliency maps of previous best bottom-up models withtop-down cognitive visual features and learned a direct mappingfrom those features to eye fixations.As in many other fields in computer vision, a number ofdeep learning solutions have very recently been proposed thatsignificantly improve the performance. For example, the En-semble of Deep Networks (eDN) (Vig et al., 2014) representedan early architecture that automatically learns the representa-tions for saliency prediction, blending feature maps from dif-ferent layers. In (Pan et al., 2016) two convolutional neu-ral networks trained ebd-to-end for saliency prediction arecompared, a lighter one designed and trained from scratch,and a second and deeper one pre-trained for image classifica-tion. DCNN have shown better results even when pre-trainedwith datasets build for other purposes. DeepGaze K¨ummereret al. (2015a) provided a deeper network using the well-knowAlexNet (Krizhevsky et al., 2012), with pre-trained weights onImagenet (Deng et al., 2009) and with a readout network ontop whose inputs consisted of some layer outputs of AlexNet.The output of the network is blurred, center biased and con-verted to a probability distribution using a softmax. Huang etal. (Huang et al., 2015), in the so call SALICON net, obtainedbetter results by using VGG rather than AlexNet or GoogleNet(Szegedy et al., 2015). In their proposal they considered twonetworks with fine and coarse inputs, whose feature maps out-puts are concatenated.Li et al. (Li and Yu, 2015) proposed a multi resolution con-volutional neural network that is trained from image regionscentered on fixation and non-fixation locations over multipleresolutions. Diverse top-down visual features can be learnedin higher layers and bottom-up visual saliency can also beinferred by combining information over multiple resolutions.These ideas are further developed in they recent work calledDSCLRCN Liu and Han (2018), where the proposed modellearns saliency related local features on each image location inparallel and then learns to simultaneously incorporate globalcontext and scene context to infer saliency. They incorporatea model to e ff ectively learn long-term spatial interactions andscene contextual modulation to infer image saliency. DeepGaze II (K¨ummerer et al., 2017) sets the state of the art in theMIT300 dataset by combining features trained for image recog-nition with four layer of 1x1 convolutions. Both DSCLRCNand Deep Gaze II obtain excellent results in the benchmarkswhen combined with a center bias, which is not consideredin SalGAN as the results are purely the results at inferencetime. MLNET (Cornia et al., 2016) proposes an architecturethat combines features extracted at di ff erent levels of a DCNN.They introduce a loss function inspired by three objectives: tomeasure similarity with the ground truth, to keep invariance ofpredictive maps to their maximum and to give importance to Conv-VGG Conv-Scratch UpsamplingMax PoolingFully Connected Sigmoid
BCE Cost Adversarial Cost
Image Stimuli +
Predicted Saliency Map
Image Stimuli +
Ground Truth Saliency MapGenerator Discriminator
Fig. 2. Overall architecture of the proposed saliency system. The input for the saliency prediction network predicts an output saliency map given a naturalimage as input. Then, the pair of saliency and image is feed into the discriminator network. The output of the discriminator is a score that tells aboutwhether the input saliency map is real or fake. pixels with high ground truth fixation probability. In fact choos-ing an appropriate loss function has become an issue that canlead to improved results. Thus, another interesting contribu-tion of (Huang et al., 2015) lies on minimizing loss functionsbased on metrics that are di ff erentiable, such as NSS, CC, SIMand KL divergence to train the network (see Riche et al. (2013)and K¨ummerer et al. (2015b) for the definition of these metrics.A thorough comparison of metrics can be found in (Bylinskiiet al., 2016a)). In (Huang et al., 2015) KL divergence gave thebest results. (Jetley et al., 2016) also tested loss functions basedon probability distances, such as X2 divergence, total variationdistance, KL divergence and Bhattacharyya distance by consid-ering saliency map models as generalized Bernoulli distribu-tions. The Bhattacharyya distance was found to give the bestresults.In our work we present a network architecture that takes adi ff erent approach. By incorporating the high-level adversarialloss into the conventional saliency prediction training approach,the proposed method achieves the state-of-the-art performancein both MIT300 and SALICON datasets by a clear margin.
3. Architecture
The training of SalGAN is the result of two competingconvolutional neural networks: a generator of saliency maps,which is SalGAN itself, and a discriminator network, whichaims at distinguishing between the real saliency maps and thosegenerated by SalGAN. This section provides details on thestructure of both modules, the considered loss functions, andthe initialization before beginning adversarial training. Fig-ure 2 shows the architecture of the system.
The generator network, SalGAN, adopts a convolutionalencoder-decoder architecture, where the encoder part includes max pooling layers that decrease the size of the feature maps,while the decoder part uses upsampling layers followed by con-volutional filters to construct an output that is the same resolu-tion as the input.The encoder part of the network is identical in architecture toVGG-16 (Simonyan and Zisserman, 2015), omitting the finalpooling and fully connected layers. The network is initializedwith the weights of a VGG-16 model trained on the ImageNetdata set for object classification (Deng et al., 2009). Only thelast two groups of convolutional layers in VGG-16 are modi-fied during the training for saliency prediction, while the earlierlayers remain fixed from the original VGG-16 model. We fixweights to save computational resources during training, evenat the possible expense of some loss in performance.The decoder architecture is structured in the same way as theencoder, but with the ordering of layers reversed, and with pool-ing layers being replaced by upsampling layers. Again, ReLUnon-linearities are used in all convolution layers, and a final1 × Table 2 gives the architecture and layer configuration for thediscriminator. In short, the network is composed of six 3x3kernel convolutions interspersed with three pooling layers ( ↓ layer depth kernel stride pad activationconv1 1 64 1 × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × Table 1. Architecture of the generator network. layer depth kernel stride pad activationconv1 1 3 1 × × × × × × × × × Table 2. Architecture of the discriminator network.
4. Training
The filter weights in SalGAN have been trained over a per-ceptual loss (Johnson et al., 2016) resulting from combining acontent and adversarial loss. The content loss follows a clas-sic approach in which the predicted saliency map is pixel-wisecompared with the corresponding one from ground truth. Theadversarial loss depends of the real / synthetic prediction of thediscriminator over the generated saliency map. The content loss is computed in a per-pixel basis, where eachvalue of the predicted saliency map is compared with its corre-sponding peer from the ground truth map. Given an image I ofdimensions N = W × H , we represent the saliency map S as vec-tor of probabilities, where S j is the probability of pixel I j beingfixated. A content loss function L ( S , ˆ S ) is defined between thepredicted saliency map ˆ S and its corresponding ground truth S .The first considered content loss is mean squared error(MSE) or Euclidean loss, defined as: L MS E = N N (cid:88) j = ( S j − ˆ S j ) . (1)In our work, MSE is used as a baseline reference, as it has beenadopted directly or with some variations in other state of the artsolutions for visual saliency prediction (Pan et al., 2016; Corniaet al., 2016).Solutions based on MSE aim at maximizing the peak signal-to-noise ratio (PSNR). These works tend to filter high spatialfrequencies in the output, favoring this way blurred contours.MSE corresponds to computing the Euclidean distance betweenthe predicted saliency and the ground truth.Ground truth saliency maps are normalized so that each valueis in the range [0 , L BCE = − N N (cid:88) j = ( S j log( ˆ S j ) + (1 − S j ) log(1 − ˆ S j )) . (2) Generative adversarial networks (GANs) (Goodfellow et al.,2014) are commonly used to generate images with realistic sta-tistical properties. The idea is to simultaneously fit two para-metric functions. The first of these functions, known as thegenerator, is trained to transform samples from a simple distri-bution (e.g. Gaussian) into samples from a more complicateddistribution (e.g. natural images). The second function, thediscriminator, is trained to distinguish between samples fromthe true distribution and generated samples. Training proceedsalternating between training the discriminator using generatedand real samples, and training the generator, by keeping thediscriminator weights constant and backpropagating the errorthrough the discriminator to update the generator weights.The saliency prediction problem has some important di ff er-ences from the above scenario. First, the objective is to fit a de-terministic function that predict realistic saliency values fromimages, rather than realistic images from random noise. Assuch, in our case the input to the generator (saliency predictionnetwork) is not random noise but an image. Second, the in-put image that a saliency map corresponds to is essential, duethe fact the goal is not only to have the two saliency maps be-coming indistinguishable but with the condition that they bothcorrespond the same input image. We therefore include both theimage and saliency map as inputs to the discriminator network.Finally, when using generative adversarial networks to generaterealistic images, there is generally no ground truth to compareagainst. In our case, however, the corresponding ground truthsaliency map is available. When updating the parameters of thegenerator function, we found that using a loss function that isa combination of the error from the discriminator and the crossentropy with respect to the ground truth improved the stabilityand convergence rate of the adversarial training. The final lossfunction for the saliency prediction network during adversarialtraining can be formulated as: L = α · L BCE + L ( D ( I , ˆ S ) , . (3)where L is the binary cross entropy loss, and 1 is the target cat-egory of real samples and 0 for the category of fake (predicted)sample. Here, instead of minimizing − L ( D ( I , S ) , L ( D ( I , S ) ,
1) which provides stronger gradient, similar to(Goodfellow et al., 2014) . D ( I , ˆ S ) is the probability of fool-ing the discriminator network, so that the loss associated to thesaliency prediction network will grow more when chances offooling the discriminator are lower. During the training of thediscriminator, no content loss is available and the loss functionis: L D = L ( D ( I , S ) , + L ( D ( I , ˆ S ) , . (4)At train time, we first bootstrap the saliency prediction networkfunction by training for 15 epochs using only BCE, which iscomputed with respect to the down-sampled output and groundtruth saliency. After this, we add the discriminator and beginadversarial training. The input to the discriminator network isan RGBS image of size 256 × × λ = × − ).We used AdaGrad for optimization, with an initial learning rateof 3 × − . sAUC ↑ AUC-B ↑ NSS ↑ CC ↑ IG ↑ BCE 0.752 0.825 2.473 0.761 0.712BCE / / / Table 3. Impact of downsampled saliency maps at (15 epochs) evaluatedover SALICON validation. BCE / x refers to a downsample factor of / x over a saliency map of × .
5. Experiments
The presented SalGAN model for visual saliency predictionwas assessed and compared from di ff erent perspectives. First,the impact of using BCE and the downsampled saliency mapsare assessed. Second, the gain of the adversarial loss is mea-sured and discussed, both from a quantitative and a qualitativepoint of view. Finally, the performance of SalGAN is comparedto published works to compare its performance with the currentstate-of-the-art. The experiments aimed at finding the best con-figuration for SalGAN were run using the train and validation partitions of the SALICON dataset (Jiang et al., 2015). Thisis a large dataset built by collecting mouse clicks on a total of20,000 images from the Microsoft Common Objects in Context(MS-CoCo) dataset (Lin et al., 2014). We have adopted thisdataset for our experiments because it is the largest one avail-able for visual saliency prediction. In addition to SALICON,wealso present results on MIT300, the benchmark with the largestamount of submissions. The two content losses presented in content loss section,MSE and BCE, were compared to define a baseline upon whichwe later assess the impact of the adversarial training. The twofirst rows of Table 4 shows how a simple change from MSE toBCE brings a consistent improvement in all metrics. This im-provement suggests that treating saliency prediction as multiplebinary classification problem is more appropriate than treatingit as a standard regression problem, in spite of the fact that thetarget values are not binary. Minimizing cross entropy is equiv-alent to minimizing the KL divergence between the predictedand target distributions, which is a reasonable objective if bothpredictions an targets are interpreted as probabilities.Based on the superior BCE-based loss compared with MSE,we also explored the impact of computing the content loss overdownsampled versions of the saliency map. This technique re-duces the required computational resources at both training andtest times and, as shown in Table 3, not only does it not decreaseperformance, but it can actually improve it. Given this results,we chose to train SalGAN on saliency maps downsampled bya factor 1 /
4, which in our architecture corresponds to saliencymaps of 64 × The adversarial loss was introduced after estimating the valueof the hyperparameter α in Equation 3 by maximizing the mostgeneral metric, Information Gain (IG). As shown in Figure 3,the search was performed on logarithmic scale, and we achievedthe best performance for α = . -0,2 I G α=0,5α = 0,05α=0,005 Fig. 3. SALICON validation set Information Gain for di ff erent hyper pa-rameter α on varying numbers of epochs. The Information Gains (IG) of SalGAN for di ff erent val-ues of the hyper parameter α are compared in Figure 3. Thesearch for finding an optimal hyper parameter α is performedon logarithmic scale, and we achieved the best performance for α = . ×
48. The first row of results in Table 4 refersto a baseline defined by training SalGAN with the BCE contentloss for 15 epochs only. Later, two options are considered: 1)training based on BCE only (2nd row), or 2) introducing theadversarial loss (3rd and 4th row).Figure 4 compares validation set accuracy metrics for train-ing with combined GAN and BCE loss versus a BCE alone asthe number of epochs increases. In the case of the AUC met-rics (Judd and Borji), increasing the number of epochs doesnot lead to significant improvements when using BCE alone.The combined BCE / GAN loss however, continues to improveperformance with further training. After 100 and 120 epochs,the combined GAN / BCE loss shows substantial improvementsover BCE for five of six metrics.The single metric for which Adversarial training fails to im-prove performance is normalized scanpath saliency (NSS). Thereason for this may be that GAN training tends to produce asmoother and more spread out estimate of saliency, which bettermatches the statistical properties of real saliency maps, but mayincrease the false positive rate. As noted in (Bylinskii et al.,2016a), NSS is very sensitive to such false positives. The im-pact of increased false positives depends on the final applica-tion. In applications where the saliency map is used as a mul-tiplicative attention model (e.g. in retrieval applications, wherespatial features are importance weighted), false positives are of-ten less important than false negatives, since while the formerincludes more distractors, the latter removes potentially usefulfeatures. Note also that NSS is di ff erentiable, so could poten-tially be optimized directly when important for a particular ap-
20 40 60 80 100 1200.6550.6600.6650.670 S i m
20 40 60 80 100 1200.8740.8760.8780.8800.882 A UC J udd GAN+BCEBCE Only20 40 60 80 100 1200.8100.8200.8300.8400.850 A UC B o r ji
20 40 60 80 100 1200.7550.7600.7650.7700.7750.7800.7850.790 CC
20 40 60 80 100 120 epoch N SS
20 40 60 80 100 120 epoch I G ( G au ss i an ) Fig. 4. SALICON validation set accuracy metrics for Adversarial + BCE vsBCE on varying numbers of epochs. AUC shu ffl ed is omitted as the trendis identical to that of AUC Judd. sAUC ↑ AUC-B ↑ NSS ↑ CC ↑ IGMSE 0.728 0.820 1.680 0.708 0.628BCE 0.753 0.825 2.562 0.772 0.824BCE / / Table 4. Best results through epochs obtained with non-adversarial (MSEand BCE) and adversarial training. BCE / / plication. SalGAN is compared in Table 5 to several other algo-rithms from the state-of-the-art. The comparison is based onthe evaluations run by the organizers of the SALICON andMIT300 benchmarks on a test dataset whose ground truth isnot public. The two benchmarks o ff er complementary features:while SALICON is a much larger dataset with 5,000 test im-ages, MIT300 has attracted the participation of many more re-searchers. In both cases, SalGAN was trained using 15,000 im-ages contained in the training (10,000) and validation (5,000)partitions of the SALICON dataset. Notice that while bothdatasets aim at capturing visual saliency, the acquisition of datadi ff ered, as SALICON ground truth was generated based oncrowdsourced mouse clicks, while the MIT300 was built witheye trackers on a limited and controlled group of users. Table5 compares SalGAN with other contemporary works that haveused SALICON and MIT300 datasets. SalGAN presents verycompetitive results in both datasets, as it improves or equals theperformance of all other models in at least one metric. SALICON (test) AUC-J ↑ Sim ↑ EMD ↓ AUC-B ↑ sAUC ↑ CC ↑ NSS ↑ KL ↓ DSCLRCN (Liu and Han, 2018) - - - 0.884 0.776 0.831 3.157 -
SalGAN - - - -ML-NET (Cornia et al., 2016) - - - (0.866) (0.768) (0.743) 2.789 -SalNet (Pan et al., 2016) - - - (0.858) (0.724) (0.609) (1.859) -MIT300 AUC-J ↑ Sim ↑ EMD ↓ AUC-B ↑ sAUC ↑ CC ↑ NSS ↑ KL ↓ Humans 0.92 1.00 0.00 0.88 0.81 1.0 3.29 0.00Deep Gaze II K¨ummerer et al. (2017) (0.84) (0.43) (4.52) (0.83) 0.77 (0.45) (1.16) (1.04)DSCLRCN (Liu and Han, 2018) 0.87 0.68 2.17 (0.79) 0.72 0.80 2.35 0.95SALICON (Huang et al., 2015) 0.87 (0.60) (2.62) 0.85 0.74 0.74 2.12 0.54
SalGAN 0.86 0.63 2.29 0.81 0.72 0.73 2.04 1.07
PDP (Jetley et al., 2016) (0.85) (0.60) (2.58) (0.80) 0.73 (0.70) 2.05 0.92ML-NET (Cornia et al., 2016) (0.85) (0.59) (2.63) (0.75) (0.70) (0.67) 2.05 (1.10)Deep Gaze I (K¨ummerer et al., 2015a) (0.84) (0.39) (4.97) 0.83 (0.66) (0.48) (1.22) (1.23)SalNet (Pan et al., 2016) (0.83) (0.52) (3.31) 0.82 (0.69) (0.58) (1.51) 0.81BMS (Zhang and Sclaro ff , 2013) (0.83) (0.51) (3.35) 0.82 (0.65) (0.55) (1.41) 0.81 Table 5. Comparison of SalGAN with other state-of-the-art solutions on the SALICON (test) and MIT300 benchmarks. Values in brackets correspond toperformances worse than SalGAN.
Image Ground Truth BCE SalGAN
Fig. 5. Example images from MIT300 containing salient region (markedin yellow) that is often missed by computational models, and saliency mapestimated by SalGAN.
The impact of adversarial training has also been exploredfrom a qualitative perspective by observing the resultingsaliency maps. Figure 5 shows one example from the MIT300dataset, highlighted in (Bylinskii et al., 2016b) as being par-ticular challenges for existing saliency algorithms. The areashighlighted in yellow in the images on the left are regions thatare typically missed by algorithms. In the this example, we seethat SalGAN successfully detects the often missed hand of themagician and face of the boy as being salient.Figure 6 illustrates the e ff ect of adversarial training on thestatistical properties of the generated saliency maps. Shownare two close up sections of a saliency map from cross entropytraining (left) and adversarial training (right). Training on BCEalone produces saliency maps that while they may be locally Fig. 6. Close-up comparison of output from training on BCE loss vs com-bined BCE / Adversarial loss. Left: saliency map from network trainedwith BCE loss. Right: saliency map from proposed adversarial training. consistent with the ground truth, are often less smooth and havecomplex level sets. Adversarial training on the other hand pro-duces much smoother and simpler level sets. Finally, Figure 7shows some qualitative results comparing the results from train-ing with BCE and BCE / Adversarial against the ground truth forimages from the SALICON validation set.
6. Conclusions
To the best of our knowledge, this is the first work that pro-poses an adversarial-based approach to saliency prediction andhas shown how adversarial training over a deep convolutionalneural network can achieve state-of-the-art performance witha simple encoder-decoder architecture. A BCE-based contentloss was shown to be e ff ective for both initializing the saliencyprediction network, and as a regularization term for stabilizingadversarial training. Our experiments showed that adversarialtraining improved all bar one saliency metric when comparedto further training on cross entropy alone.It is worth pointing out that although we use a VGG-16based encoder-decoder model as the saliency prediction net-work in this paper, the proposed adversarial training approachis generic and could be applied to improve the performance ofother saliency models. Images Ground Truth BCE
SalGAN
Fig. 7. Qualitative results of SalGAN on the SALICON validation set. SalGAN predicts well those high salient regions which are missed by BCE model.Saliency maps of BCE model are very localized in a few salient regions, they tend to fail when the number of salient regions increases.
Acknowledgments
The Image Processing Group at UPC is supported by theproject TEC2016-75976-R, funded by the Spanish Ministe-rio de Economia y Competitividad and the European Re-gional Development Fund (ERDF). This material is based uponworks supported by Science Foundation Ireland under GrantNo 15 / SIRG / References
Borji, A., 2012. Boosting bottom-up and top-down visual features for saliencyestimation, in: IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR).Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F., 2016a. Whatdo di ff erent evaluation metrics tell us about saliency models? ArXivpreprint:1610.01563 .Bylinskii, Z., Recasens, A., Borji, A., Oliva, A., Torralba, A., Durand, F.,2016b. Where should saliency models look next?, in: European Conferenceon Computer Vision (ECCV).Cornia, M., Baraldi, L., Serra, G., Cucchiara, R., 2016. A deep multi-level net-work for saliency prediction, in: International Conference on Pattern Recog-nition (ICPR).Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. ImageNet: Alarge-scale hierarchical image database, in: IEEE Conference on ComputerVision and Pattern Recognition (CVPR).Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Ad-vances in Neural Information Processing Systems, pp. 2672–2680.Harel, J., Koch, C., Perona, P., 2006. Graph-based visual saliency, in: NeuralInformation Processing Systems (NIPS).Huang, X., Shen, C., Boix, X., Zhao, Q., 2015. Salicon: Reducing the seman-tic gap in saliency prediction by adapting deep neural networks, in: IEEEInternational Conference on Computer Vision (ICCV).Itti, L., Koch, C., Niebur, E., 1998. A model of saliency-based visual atten-tion for rapid scene analysis. IEEE Transactions on Pattern Analysis andMachine Intelligence (PAMI) , 1254–1259.Jetley, S., Murray, N., Vig, E., 2016. End-to-end saliency mapping via proba-bility distribution prediction, in: IEEE Conference on Computer Vision andPattern Recognition (CVPR).Jiang, M., Huang, S., Duan, J., Zhao, Q., 2015. Salicon: Saliency in context,in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Johnson, J., Alahi, A., Fei-Fei, L., 2016. Perceptual losses for real-time styletransfer and super-resolution, in: European Conference on Computer Vision(ECCV).Judd, T., Ehinger, K., Durand, F., Torralba, A., 2009a. Learning to predictwhere humans look, in: Computer Vision, 2009 IEEE 12th internationalconference on, IEEE. pp. 2106–2113.Judd, T., Ehinger, K., Durand, F., Torralba, A., 2009b. Learning to predictwhere humans look, in: IEEE International Conference on Computer Vision(ICCV).Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik,W., Torralba, A., 2016. Eye tracking for everyone, in: IEEE Conference onComputer Vision and Pattern Recognition (CVPR).Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet classification withdeep convolutional neural networks, in: Advances in neural information pro-cessing systems, pp. 1097–1105.K¨ummerer, M., Theis, L., Bethge, M., 2015a. DeepGaze I: Boosting saliencyprediction with feature maps trained on ImageNet, in: International Confer-ence on Learning Representations (ICLR).K¨ummerer, M., Theis, L., Bethge, M., 2015b. Information-theoretic modelcomparison unifies saliency metrics. Proceedings of the National Academyof Sciences (PNAS) 112, 16054–16059.K¨ummerer, M., Wallis, T.S., Gatys, L.A., Bethge, M., 2017. Understandinglow-and high-level contributions to fixation prediction, in: 2017 IEEE Inter-national Conference on Computer Vision, pp. 4799–4808.Li, G., Yu, Y., 2015. Visual saliency based on multiscale deep features, in: TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR). Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar,P., Zitnick, L., 2014. Microsoft COCO: Common objects in context, in:European Conference on Computer Vision (ECCV).Liu, N., Han, J., 2018. A deep spatial contextual long-term recurrent convolu-tional network for saliency detection. IEEE Transactions on Image Process-ing 27, 3264–3274.Pan, J., Sayrol, E., Gir´o-i Nieto, X., McGuinness, K., O’Connor, N.E., 2016.Shallow and deep convolutional networks for saliency prediction, in: IEEEConference on Computer Vision and Pattern Recognition (CVPR).Riche, N., M., D., Mancas, M., Gosselin, B., Dutoit, T., 2013. Saliency andhuman fixations. state-of-the-art and study comparison metrics, in: IEEEInternational Conference on Computer Vision (ICCV).Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks forlarge-scale image recognition, in: International Conference on LearningRepresentations (ICLR).Sun, X., Huang, Z., Yin, H., Shen, H.T., 2017. An integrated model for e ff ectivesaliency prediction., in: AAAI, pp. 274–281.Szegedy, C., Liu, W., Jia, Y., Sermanet, P.R.S., Anguelov, D., Erhan, D., Van-houcke, V., Rabinovich, A., 2015. Going deeper with convolutions, in: IEEEConference on Computer Vision and Pattern Recognition (CVPR).Vig, E., Dorr, M., Cox, D., 2014. Large-scale optimization of hierarchicalfeatures for saliency prediction in natural images, in: IEEE Conference onComputer Vision and Pattern Recognition (CVPR).Walther, D., Itti, L., Riesenhuber, M., Poggio, T., Koch, C., 2002. Attentionalselection for object recognition-a gentle way. Lecture Notes in ComputerScience 2525, 472–479.Zhang, J., Sclaro ffff