[PDF] A Gated Fusion Network for Dynamic Saliency Prediction

Abstract

Predicting saliency in videos is a challenging problem due to complex modeling of interactions between spatial and temporal information, especially when ever-changing, dynamic nature of videos is considered. Recently, researchers have proposed large-scale datasets and models that take advantage of deep learning as a way to understand what's important for video saliency. These approaches, however, learn to combine spatial and temporal features in a static manner and do not adapt themselves much to the changes in the video content. In this paper, we introduce Gated Fusion Network for dynamic saliency (GFSalNet), the first deep saliency model capable of making predictions in a dynamic way via gated fusion mechanism. Moreover, our model also exploits spatial and channel-wise attention within a multi-scale architecture that further allows for highly accurate predictions. We evaluate the proposed approach on a number of datasets, and our experimental analysis demonstrates that it outperforms or is highly competitive with the state of the art. Importantly, we show that it has a good generalization ability, and moreover, exploits temporal information more effectively via its adaptive fusion scheme.

Full PDF

11 A Gated Fusion Network for Dynamic SaliencyPrediction

Aysun Kocak , Erkut Erdem and Aykut Erdem Department of Computer Engineering, Hacettepe University, Ankara, Turkey Department of Computer Engineering, Koc¸ University, Istanbul, Turkey

Abstract —Predicting saliency in videos is a challenging prob-lem due to complex modeling of interactions between spatialand temporal information, especially when ever-changing, dy-namic nature of videos is considered. Recently, researchers haveproposed large-scale datasets and models that take advantage ofdeep learning as a way to understand what’s important for videosaliency. These approaches, however, learn to combine spatial andtemporal features in a static manner and do not adapt themselvesmuch to the changes in the video content. In this paper, weintroduce Gated Fusion Network for dynamic saliency (GFSal-Net), the ﬁrst deep saliency model capable of making predictionsin a dynamic way via gated fusion mechanism. Moreover, ourmodel also exploits spatial and channel-wise attention within amulti-scale architecture that further allows for highly accuratepredictions. We evaluate the proposed approach on a numberof datasets, and our experimental analysis demonstrates that itoutperforms or is highly competitive with the state of the art.Importantly, we show that it has a good generalization ability,and moreover, exploits temporal information more effectively viaits adaptive fusion scheme.

Index Terms —dynamic saliency estimation, gated fusion, deepsaliency networks

I. I

NTRODUCTION

Human visual system employs visual attention mechanismsto effectively deal with huge amount of information by fo-cusing only on salient or attention grabbing parts of a scene,and thus ﬁltering out irrelevant stimuli. Saliency estimationmethods offer different computational models of attention tomimic this key component of our visual system. These meth-ods generate a so-called saliency map within which a pixelvalue indicates the likelihood of that pixel being ﬁxated by ahuman. Since the pioneering work of [1], this research areahas gained a lot of interest in the last few decades (please referto [2], [3] for an overview), and it has found to have practicaluse in a variety of computer vision tasks such as visual qualityassessment [4], [5], image and video resizing [6], [7], videosummarization [8], to name a few. Early saliency predictionapproaches use low-level (color, orientation, intensity) and/orhigh-level (pedestrians, faces, text, etc.) image features toestimate salient regions. While low-level cues are used todetect regions that are different from their surroundings, top-down cues are used to infer high-level semantics to guide themodel. For example, humans tend to focus some object classesmore than others. Recently, deep learning based models havestarted to dominate over the traditional approaches as theycan directly learn both low and high-level features relevantfor saliency prediction [9], [10].

A single input frame and its corre-sponding ﬁxation map Four consecutive overlaid frames andtheir overlaid ﬁxation maps

Fig. 1: Predicting video saliency requires ﬁnding a harmoniousinteraction between appearance and temporal information. Forexample, while the ﬁrst row shows a case in which attentionis guided more by visual appearance, in the second row,motion is the most determining factor for attention. Hence,we speculate that an adaptive scheme would be better suitedfor this task.Most of the literature on saliency estimation focuses onstatic images. Lately, predicting saliency in videos has alsogained some attraction, but it still remains a largely unexploredﬁeld of research. Video saliency models (also called dynamicsaliency models) aim to predict attention grabbing regions indynamically changing scenes. While static saliency estimationconsiders only low-level and high-level spatial cues, dynamicsaliency needs to take into account temporal information tooas there is evidence that moving objects or object parts canalso guide our attention. Motion and appearance play comple-mentary roles in human attention and their signiﬁcance canchange over time. As we illustrate in Fig. 1, in dynamic scenes,humans tend to focus more on moving parts of the scene andthe eye ﬁxations change over time, showing the importanceof motion cues (bottom row). On the other hand, when thereis practically no motion in the scene, low-level appearancecues dominantly guide our attention and we focus more onthe regions showing different visual characteristics than theirsurroundings (top row). Motivated by these observations, inthis work, we develop a deep dynamic saliency model whichhandles spatial and temporal changes in the visual stimuli inan adaptive manner.The ﬁrst generation of dynamic saliency methods weresimply extensions of the static saliency approaches, e.g. [11],[12], [13], [14], [15]. In other words, these methods adaptedthe strategies proposed for static scenes and mostly modiﬁed a r X i v : . [ c s . C V ] F e b them to work on either 3D feature maps that are formed bystacking 2D spatial features over time or 2D feature mapsencoding motion information like optical ﬂow images. Sev-eral follow-up works, however, have approached the problemfrom a fresh perspective and developed specialized methodsfor dynamic saliency detection, e.g. [16], [17], [18], [19],[20], [21], [22], [23], [24]. These models either utilize novelspatio-temporal features or employ data-driven techniques tolearn relevant features from data. As with the case of state-of-the-art static saliency models, approaches based on deeplearning have also shown promise for dynamic saliency. Thesestudies basically explore different neural architectures used forprocessing temporal and spatial information in a joint manner,and they either use 3D convolutions [25], LSTMs [25], [26]or multi-stream architectures that encode temporal informationseparately [27], [28], [29].In this work, we introduce Gated Fusion Network for videosaliency (GFSalNet). Our proposed network model is radicallydifferent from the previously proposed deep models in thatit includes a novel content-driven fusion scheme to combinespatial and temporal streams in a more dynamic manner. Inparticular, our model is based on two-stream CNNs [30],[31], which have been successfully applied to various videoanalysis tasks. To our interest, these architectures are inspiredby the ventral and dorsal pathways in the human visualcortex [32]. Although the use of two-stream CNNs in videosaliency prediction has been investigated before [28], the mainnovelty of our work lies in the ability to fuse appearance andmotion information in a spatio-temporally coordinated mannerby estimating the importance of each cue with respect basedon the current video content.The rest of the paper is organized as follows: In Section 2,we give a brief overview of the existing dynamic saliencyapproaches. In Section 3, we present the details of our pro-posed deep architecture for video saliency. In Section 4, wegive the details of our experimental setup, including evaluationmetrics, datasets and the competing dynamic saliency models,and discuss the results of our experiments. Finally, in the lastsection, we offer some concluding remarks.Our codes and predeﬁned models, along with the saliencymaps extracted with our approach, will be publicly availableat the project website .II. R ELATED W ORK

Early visual saliency models can be dated back to 1980swith the Feature Integration Theory by [33]. The ﬁrst modelsof saliency, such as [34], [1], provide computational solutionsto [33], and since then a notable number of saliency modelsare developed, most of which deal with static scenes. For a de-tailed list of pre-deep learning saliency estimation approaches,please refer to [2]. After the availability of large-scale datasets,researchers proposed various deep learning based models forstatic saliency that outperformed previous approaches by alarge margin [35], [36], [37], [38], [39], [40], [41], [42].

Early models for dynamic saliency generally depend onpreviously proposed static saliency models. Adaptation of https://hucvl.github.io/GFSalNet/ these models to dynamic scenes is achieved by consideringfeatures related to motion such as the optical ﬂow infor-mation. For example, [11] proposed a saliency predictionmethod called PQFT that predicts the salient regions via thephase spectrum of Fourier Transform of the given image. Inparticular, PQFT generates a quaternion image representationby using color, intensity, orientation and motion features andestimates the salient regions in the frequency domain by usingthis combined representation. [12] extracted salient parts ofvideo frames by similarly performing a spectral analysis of theframes considering both spatial and temporal domains. [13]employed local regression kernels as features to calculateself similarities between pixels or voxels for ﬁgure-groundsegregation. [14] extended the previously proposed staticsaliency model by [43]’s model by including motion cuesto the graph-theoretic formulation. [44] employ a two streamapproach that generates spatial saliency map (using color andtexture features) and temporal saliency map (using optical ﬂowfeature) separately and combines these maps with an entropybased adaptive method. [15] proposed a dynamic saliencymodel for activity recognition that works in an unsupervisedmanner. Their method is based on an encoding scheme thatconsiders color along with motion cues.Following these early approaches, the researchers startedto develop novel video saliency models speciﬁcally designedfor dynamic stimuli. For instance, [16] proposed a sparsitybased framework that generates spatial saliency maps andtemporal saliency maps separatelty based on entropy gainand temporal consistency, respectively, and then combinesthem. [17] integrated several visual cues such as static anddynamic image features based on color, texture, edge distri-bution, motion boundary histograms, through learning-basedfusion strategies and later employed this dynamic saliencymodel for action recognition. [18] suggested a learning-basedmodel that generates a candidate set regions with the useof existing methods and then predicts gaze transitions oversubsequent video frames conditionally on these regions. [19]proposed a simple dynamic saliency model that combinesspatial saliency maps with temporal saliency using pixel-wise maximum operation. In their work, while the spatialsaliency maps are extracted using multi-scale analysis oflow-level features, temporal saliency maps are obtained byexamining dynamic consistency of motion through an opticalﬂow model. [20] suggested an approach that independentlyestimates superpixel-level and pixel-level temporal and spa-tial saliency maps and subsequently combines them usingan adaptive fusion strategy. [21] proposed an approach thatoversegments video frames by using both spatial and tem-poral information and estimates the saliency score for eachregion by computing the regional contrast values via low-level features extracted from these regions. [22] suggestedto learn a ﬁlter bank from low-level features for ﬁxations.This ﬁlterbank encodes the association between local featurepatterns and probabilities of human ﬁxations, and is used to re-weight ﬁxation candidates. [23] formulated another dynamicsaliency model by exploiting the compressibility principle.More recently, [24] proposed a saliency model (called AWS-D) for dynamic scenes by considering the observation that high-order statistical structures carry most of the perceptuallyrelevant information. AWS-D [24] removes the second-orderinformation from input sequence via a whitening process.Then, it computes bottom-up spatial saliency maps using aﬁlter bank at multiple scales, and temporal saliency maps withthe use of a 3D ﬁlter bank. Finally, it combines all these mapsby considering their relative signiﬁcance. Deep learning based dynamic saliency models havereceived attention only recently. [25] proposed a recurrentmixture density network (RMDN) for spatio-temporal visualattention. The method uses a C3D architecture [45] as abackbone to integrate spatial and temporal information. Thisrepresentation module is fed to a Long Short-Term Memory(LSTM) network, which is connected to Mixture Density Net-work (MDN) whose outputs are the parameters of a Gaussianmixture model expressing the saliency map of each frame. [28]suggested a two stream CNN model [30], [31] which considersthe motion and appearance clues in videos. While, optical ﬂowimages are used to feed the temporal stream, raw RGB framesare used as input for the spatial stream. [27] presented anattention network to predict where driver is focused. In thiswork, the authors also proposed a dataset that consists of ego-centric and car-centric driving videos and eye tracking databelongs to the videos. Their network consists of three indepen-dent paths, namely spatial, temporal and semantic paths. Whilethe spatial path uses raw RGB data as input, the temporal oneuses optical ﬂow data to integrate motion information and thelast one processes the segmentation prediction on the scenegiven by the model by [46]. In the ﬁnal layer of the network,the three independent maps are summed and then normalizedto obtain the ﬁnal saliency map. [29] proposed a deep modelcalled OM-CNN which consists of two subnetworks, namelyobjectness subnet to highlight the regions that contain anobject, motion subnet to encode temporal information, whoseoutputs are then combined to generate some spatio-temporalfeatures. [26] proposed a model called ACLNet which employsa CNN-LSTM architecture to predict human gaze in dynamicscenes. The proposed approach focuses static information withan attention module and allows an LSTM to focus on learningdynamic information. Recently, [47] proposed an encoder-decoder based deep neural network called SalEMA, whichemploys a convolutional recurrent neural network method toinclude temporal information. In particular, it processes asequence of RGB video frames as input to employ spatialand temporal information with the temporal information beinginferred by the weighted average of the convolution state ofthe current frame and all the previous frames. [48] suggested adifferent model called TASED-Net, which utilizes a 3D fully-convolutional encoder-decoder network architecture where theencoded features are spatially upsampled while aggregating thetemporal information. [49] recently developed another two-stream spatiotemporal salieny model called STRA-Net thatconsiders dense residual cross connections and a compositeattention module.The aforementioned dynamic saliency models suffer fromdifferent drawbacks. The early methods employ (hand-crafted)low-level features that do not provide a high-level understand-ing of the video frames. Deep models eliminate this pitfall by utilizing an end-to-end learning strategy and, hence, providebetter saliency predictions. They differ from each other by howthey include motion information within their respective archi-tectures. As we reviewed, the two main alternative approachesinclude using recurrent connections or processing data inmultiple streams. Although RNN-based models help to encodetemporal information with less amount of parameters, theencoding procedure compresses all the relevant informationinto a single vector representation, which affects the robustnessespecially for longer sequences. In that respect, the accuracy ofthe two-stream models do not, in general, degrade as the lengthof a sequence increases. Moreover, they are more interpretableas they need to perform fusion of spatial and temporal featuresin an explicit manner. On the other hand, their performancedepends on accurate estimation of the optical ﬂow maps usedas input to the temporal stream. Hence, most of these two-stream models employ recent deep-learning based optical ﬂowestimation models and even some of them uses some additionalpost-processing steps such as conﬁning the absolute values ofthe magnitudes within a certain interval to avoid noise, as inSTRA-Net [49]. Our proposed model also uses a two-streamapproach, but as we will show, it exploits a novel and moredynamic fusion strategy, which boosts the performance andfurther improves the interpretability.III. O UR M ODEL

A general overview of our proposed spatio-temporal net-work architecture is given in Fig. 2. We use a two-streamarchitecture that processes temporal and spatial informationin separate streams, similar to the one in [28]. That is, werespectively feed the spatial stream and temporal stream withRGB video frames and the corresponding optical ﬂow imagesas inputs. Different than [28], however, our network com-bines information coming from several levels (Section III-A)and fuses both streams via a novel dynamic fusion strat-egy (Section III-C). We additionally utilize attention blocks(Section III-B) to select more relevant features to further boostthe performance of our model. Here, we use a pre-trainedResNet-50 model [50] as the backbone of our saliency networkas commonly explored by the previous saliency studies. Inparticular, we remove the average pooling and fully connectedlayers after the last residual block (

ResBlock4 ) and thenadapt it for saliency prediction by adding extra blocks. UsingResNet-50 model allows us to encode both low-, mid- andhigh-level cues in the visual stimuli in an efﬁcient manner.Moreover, the number of network parameters is much smalleras compared to other alternative backbone networks.

A. Multi-level Information Block

As its name implies, the purpose of multi-level informationblock is to let the information extracted at different levelsguide the saliency prediction process. It has proven to beuseful that employing a multi-level/multiscale structure almostalways improves the performance for many different visiontasks such as object detection [51], segmentation [52], [53],[54], and static saliency detection [55], [56]. In our work, wealso employ a multi-level information block to enhance feature

Fig. 2: Our two-stream dynamic saliency model uses RGB frames for spatial stream and optical ﬂow images for temporalstream. These streams are integrated with a dynamic fusion strategy that we referred to as gated fusion. Our architecture alsoemploys multi-level information block to fuse multi-scale features and attention blocks for feature selection.Fig. 3: Multi-level information block. It is used to integratemultiscale features extracted at different levels of the deepnetwork for predicting salient parts of the given input videoframe.learning capability of our model. Speciﬁcally, it allows low-,mid-, and high-level information to be fused together and to betaken into account simultaneously while making predictions.Fig. 3 shows the proposed multi-level information blockthat we employ in our model. This block considers low-level and high-level representations of frames by processingfeatures maps which are extracted at each residual block.The aim is to combine primitive image features ( e.g. edges,shared common patterns) obtained at lower levels with richsemantic information ( e.g. object parts, faces, text) extractedat higher levels of the network. Here, we prefer to utilize × convolution and bilinear interpolation layers to combine cuesfrom higher and lower levels. That is, after each residual block,we expand the feature map with bilinear interpolation to makeequal size of the feature map with the size of the output of theprevious residual block. Then, we concatenate the expandedfeature map with the previous residual block’s output and fusethem via × convolution layers. B. Attention Blocks

Neural attention mechanisms allow for learning to payattention to features more useful for a given task, and hence,it has been demonstrated many times that they can boost theperformance of a neural network architecture proposed for anycomputer vision problem, such as object detection [57]), visualquestion answering [58], pose estimation [59], image caption-ing [60] and salient object detection [55]. Motivated with theseobservations, in our work, we integrate several attention blocksto our proposed deep architecture to let the model choosethe most relevant features for the dynamic saliency estimationproblem. Resembling the structures in [60], [55], we exploittwo separate attention mechanisms: spatial and channel-wise attention, as explained below.Fig. 4(a) shows our spatial attention block, which weintroduce at the lower levels of our network model (see Fig. 2)that helps to ﬁlter out the irrelevant information. The blocktakes the output of

ResBlock4 , shaped [ B × C × H × W ] with C = 2048 , as input and it determines the importantlocations by calculating a weight tensor, which is shaped [ B × × H × W ] . To estimate this tensor, input channelsare fused via × convolution layer following by a sigmoidlayer. The output (shaped [ B × C × H × W ] ) of this block is aresult of Hadamard product between input and spatial weighttensor.The second type of our attention block, the channel-wiseattention block, is shown in Fig. 4(b), whose main purpose isto utilize the context information in a more efﬁcient way. Theblock consists of average pooling, full connected and ReLUlayers. In particular, it takes the concatenation of the featuremaps from the main stream and multi-level information blockas input which is shaped [ B × × H × W ] , then downsamples (a) Spatial attention (b) Channel-wise attention Fig. 4: Attention blocks: (a) spatial attention block, (b)channel-wise attention block. While the spatial attentionblock deﬁnes spatial importance weights for individual featuremaps, the channel-wise attention block introduces feature-levelweighting which allows for a better use of context information.Fig. 5: Gated fusion block. It integrates the spatial andtemporal streams to learn a weighted gating scheme to de-termine their contributions in predicting dynamic saliency ofthe current input video frame.it with average pooling (output shape is [ B × ). The weightof each channel is determined after two fully connected layersfollowed by ReLUs. The shape of the matrices are [ B × and [ B × respectively. The output of last ReLU which isshaped [ B × × × , contains a scalar value to weighteach channel. At the end of the block, the input feature mapis weighted via Hadamard product. C. Gated Fusion Block

One of the main contributions of our framework is toemploy a dynamic fusion strategy to combine temporal andspatial information. Gated fusion has been exploited before fordifferent problems such as image dehazing [61], image deblur-ring [62], semantic segmentation [63]. The main purpose touse a gated fusion block is to combine different kind of infor-mation with a dynamic structure which considers the currentinputs’ characteristics. For example, in [63] feature maps thatare generated via RGB information and depth information iscombined for solving semantic segmentation. In our case, ouraim is to come up with a fusion module that considers thecontent of the video at inference time. To our knowledge, weare the ﬁrst to provide a truly dynamic approach for dynamicsaliency. As opposed to the classical learning based approachesthat learn the contributions of temporal and spatial streams ina static manner from the training data, our gated fusion blockperforms the fusion process in an adaptive way. That is, itdecides the contribution of each stream on a location- andtime-aware manner according to the content of the video. The structure of the proposed gated fusion block is shownin Fig. 5. It takes the feature maps of the spatial and temporalstreams as inputs and produces a probability map which isused to designate contribution of each stream with regard totheir current characteristics. Let S A , S T denote the featuremaps from spatial and temporal streams, respectively. Gatedfusion module ﬁrst concatenates these features and then learnstheir correlations by applying a × convolution layer. Afterthat, it uses a sigmoid layer to regularize the feature mapwhich is used to estimate weights of the gate. Let G A and G T denote how conﬁdently we can rely on appearance andmotion, respectively, as follows: G A = P , G T = 1 − P , (1)where P is the output of the sigmoid layer. Then, gated fusionmodule estimates the weights denoting the contributions of thespatial and temporal streams, as given below: S (cid:48) A = S A (cid:12) G A , S (cid:48) T = S T (cid:12) G T , (2)where (cid:12) represents the Hadamard product operation. Finally,it generates the ﬁnal saliency map, S final , via weightingthe appearance and temporal streams’ feature maps with theestimated probability map: S final = S (cid:48) A + S (cid:48) T . (3)Fig. 6 visualizes how gated fusion block works. While theappearance stream computes a saliency map from the RGBframe, the temporal stream extracts a saliency map from theoptical image obtained from successive frames. As can beseen, these intermediate maps encode different characteristicof the input dynamic stimuli. The appearance based saliencymap mostly focuses on the regions that have distinct visualproperties than theirs surroundings, whereas the motion basedsaliency map mainly pay attention to motion. Gated fusionscheme estimates spatially varying probability maps and em-ploys them to integrate the appearance and temporal streams,which results in more conﬁdent predictions. The spatial streamgenerally gives more accurate predictions than the temporalstream, as will be presented in the Experiments section. Onthe other hand, as can be seen from the estimated weight maps,the gated fusion scheme in the proposed model has a tendencyto pay more attention to the temporal stream. We suspect thatthis is because the model considers that it may carry auxiliaryinformation. IV. E XPERIMENTS

In the following, we ﬁrst provide a brief review of thebenchmark datasets we used in our experimental analysis.Then, we give the details of our training procedure includingthe loss functions and settings we use to train our proposedmodel. Next, we summarize the evaluation metrics and thedynamic saliency models used in our experiments. We thendiscuss our ﬁndings and present some qualitative and quanti-tative results. Finally, we present an ablation study to evaluatethe effectiveness of the blocks of the proposed dynamicsaliency model. A pp ea r a n ce M o ti on S A G A S T G T P r e d i c ti on G T Fig. 6: Gated fusion block estimates the ﬁnal saliency map by combining the appearance and the temporal maps S A and S T with the spatially varying weights G A and G T . A. Datasets

In our experiments, we employ six different datasets toevaluate the effectiveness of the proposed saliency model.The ﬁrst four, namely UCF-Sports [64], Holywood-2 [65],DHF1K [26], and DIEM [66], are the most commonly usedbenchmarks. Among them, we speciﬁcally utilize DIEM totest the generalization ability of our model. The last twodatasets considered in our analysis, DIEM-Meta [67] andLEDOV-Meta [67], are two recently proposed datasets, whichare particularly designed to explore the performance of adynamic saliency model under situations where understandingtemporal effects is critical to give results more compatiblewith humans.

UCF-Sports dataset [64] is the smallest dataset in terms of itssize, consisting of 150 videos obtained from 13 different actionclasses. It is originally collected for action recognition, butthen enriched by [65] to include eye ﬁxation data. The videosare annotated by 4 subjects under free-viewing condition.In the experiments, we used the same train/test splits givenin [68].

Holywood-2 dataset [65] contains 1,707 videos fromHollywood-2 action recognition dataset [69], among which823 are used for training and the remaining 884 are left fortesting. Since the videos are collected from 69 Hollywoodmovies with 12 action categories, its content is limited tohuman actions. In [65], the authors collected human ﬁxationdata for each sequence from 3 subjects under free-viewingcondition. In our experiments, we use all train and test frames.

DHF1K [26] is the most recent and the largest video saliencydataset, which contains a total of 1000 videos with eye trackingdata collected from 17 different human subjects. The authorssplit the dataset into 600 training, 100 validation videos and300 test videos. The ground truth ﬁxation data for the test splitis intentionally kept hidden and the evalution of a model onthe test data is carried out by the authors themselves.

DIEM [66] includes 84 natural videos. Each video sequencehas eye ﬁxation data collected from approximately 50 differenthuman subjects. Following the common experimental setupﬁrst considered in [18], we used all frames from 64 videosfor training and the ﬁrst 300 frames from the remaining 20videos as test set.

DIEM-Meta [67] and

LEDOV-Meta [67] are two recentlyproposed datasets collected from the existing video saliencydatasets DIEM [66] and LEDOV [29], respectively. The maindifference between these and the aforementioned datasets liesin the characteristics of the video frames they consider. [67]constructed these so-called meta-datasets by eliminating thevideo frames from their original counterparts where spatialpatterns are generally enough to predict where people look.To detect them, they employ a deep static saliency model thatthey developed. DIEM-Meta and DIEM-Meta are thus bettertestbeds for evaluating whether or not a dynamic saliencymodel learns to use the temporal domain effectively. DIEM-Meta contains only 35% of the video frames from DIEM,LEDOV-Meta includes just 20% of the original LEDOVframes.

B. Training Procedure

As we mentioned previously, our network takes RGB videoframes and optical ﬂow images as inputs. We extract theframes from the videos by considering their original framerate. We employ these RGB frames to feed our appearancestream. For the temporal stream, we generate the opticalﬂow images between two consecutive frames by using PWC-Net [70]. We resize all the input images to × pixelsand map the ground truth ﬁxation points accordingly.Instead of training our dynamic saliency network fromscratch, we ﬁrst train the subnet for the appearance streamon SALICON dataset [71]. Then, we initialize the weightsof both of our subnets for spatial and temporal streamswith this pre-trained static saliency model and ﬁnetune ourwhole two-stream network model using the dynamic saliencydatasets described above. Pre-training on static data allowsour dynamic saliency model to converge in fewer epochswhen trained on dynamic stimuli. We use Kullback-Leibler(KL) divergence and Normalized Scanpath Saliency (NSS)loss functions (which we will explain in detail later) withAdam optimizer during the training process. We set the initiallearning rate to 10e-5 and reduce it to one tenth in every3000th iteration. The batch size is set to 8 for UCF-Sportsand 16 for the other video datasets. We train our model onNVIDIA V100 GPUs (3 × GPUs) and while one epoch takesapproximately 2 days for the larger datasets of DHF1K, DIEMand Hollywood-2, it takes approximately 2 hours for UCF-Sports. We train our models for 2-3 epochs. Our (unoptimized)Pytorch implementation achieves a near real-time performanceof 8.2 fps for frames of size × on a NVidia Tesla K40cGPU.For our experiments on standard benchmark datasets, weconsider two different training settings for dynamic stimuli.In our ﬁrst setting, we use the training split of the datasetunder consideration to train our proposed model. On the otherhand, in our second setting, we utilize a combined trainingset containing training sequences from both UCF-Sports,Hollywood-2 and DHF1K datasets. The second setting furtherallows us to test the generalization ability of our model onDIEM, DIEM-Meta and LEDOV-Meta datasets. Loss functions . In our work, we employ the combination ofKL-divergence and NSS loss functions to train our proposeddynamic saliency model. As explored in previous studies, [72],[26], considering more than one loss function during training,in general, improves the model performance. Moreover, em-pirical experiments on the analysis of the existing automaticevaluation metrics in [73] have shown that KL-divergence andNSS are good choices for evaluating saliency models.Let P denote the predicted saliency map, F representground truth (binary) ﬁxation map collected from humansubjects and S be the ground truth (continuous) ﬁxationdensity map which is generated by blurring ﬁxation maps witha small Gaussian kernel.KL-divergence is a widely used metric to compare twoprobability distributions. It has been proven to be effectivefor evaluating and trainig the performance of saliency modelswhere the ground truth ﬁxation map S and the predictedsaliency map P are interpreted as probability distributions.Formally, KL-divergence loss function is deﬁned as: L KL ( P, S ) = (cid:88) i S ( i ) log (cid:18) S ( i ) P ( i ) (cid:19) . (4)NSS is a location based metric which is computed as theaverage of the normalized predicted saliency values at ﬁxatedlocations that is provided with the ground truth. By using thismetric as a loss function, we force the saliency model to betterdetect the ﬁxation locations and assign high likelihood scoresto those pixel locations. This loss function is deﬁned as below: L NSS ( P, F ) = − N (cid:88) i ¯ P ( i ) × F ( i ) , (5)where N is the total number of ﬁxated pixels (cid:80) i F ( i ) and ¯ P is the normalized saliency map P − µ ( P ) σ ( P ) .Our ﬁnal loss function is then deﬁned as: L ( P, F, S ) = α L KL ( P, S ) + β L NSS ( P, F ) , (6)where L KL is the KL loss function, L NSS is the NSS lossfunction, and α and β are the weights for these loss functions.We ﬁrst perform a set of experiments on SALICON datasetto empirically determine the optimal values of α and β , andthen set α = 1 and β = 0 . for all the experiments. C. Evaluation Metrics and Compared Saliency Models

In our evaluation, we employ the following ﬁve commonlyreported saliency metrics: Area Under Curve (AUC-Judd),Pearson’s Correlation Coefﬁcient (CC), Normalized ScanpathSaliency (NSS), Similarity Metric (SIM) and KL-divergence(KLDiv). For a detailed analysis of these metrics and theirdeﬁnitions, please refer to [73]. Each metric measures adifferent aspect of visual saliency and none of them is superiorto the others. AUC metric considers the saliency map asclassiﬁcation map. A ROC curve is constituted by measuringthe true and false positive rates under different binary classiﬁerthresholds. While a score of 1 indicates a perfect match, ascore close to 0.5 indicates the performance of chance. NSSis another commonly used metric, which we formally deﬁned before while describing our loss functions. CC metric is adistribution based metric which is used to measure the linearrelationship between saliency and ﬁxation maps using thefollowing formula: CC ( P, S ) = σ ( P, S ) σ ( P ) × σ ( S ) (7)where σ corresponds to covariance. A CC value close to +1/-1demonstrates a perfect linear relationship. SIM is another pop-ular metric that measures the similarity between the predictedand human saliency maps, as deﬁned below:SIM ( P, S ) = (cid:88) i min( P i , S i ) where (cid:88) i P i = 1 and (cid:88) i S i = 1 (8)KLDiv metric evaluates the dissimilatrity between two distri-butions. Since KLDiv represents the difference between thesaliency map and the density map, a small value indicates agood result. However, we note that, according to the aforemen-tioned study, NSS and CC seem to provide more fair results.In our experiments, we report the scores obtained with theimplementations provided by MIT benchmark website .We compare our method with ten different models: Sal-GAN [74], PQFT [11], [44], AWS-D [24], [28], OM-CNN [29], ACLNet [26], SalEMA [47], STRA-Net [49], andTASED-Net [48]. Among these, SalGAN [74] is the only staticsaliency model that gives the state-of-the-art results in theimage datasets. We evaluate this method on video datasetsconsidering each frame as a static image. PQFT [11], [44],and AWS-D [24] are non-deep learning models whereas allthe other models employs deep learning techniques to predictwhere people look in videos. We note that in [28], the authorstested different fusion strategies with static weighting schemesand here we only report the results obtained with convolutionalfusion strategy, which was shown to perform better than theothers. In our experiments, we use the implementations and thetrained models provided by the authors and test our approachagainst them with the settings explained in Sec. IV-A forfair comparison. In particular, after a careful analysis, wenotice that some methods do not report results on wholetest set of Hollywood-2 and/or they mistakenly consider task-speciﬁc gaze data collected for UCF-Sports while generatingthe groundtruth ﬁxation density maps. Hence, some of theresults are different than those reported in the papers butthey give a better picture of their performances. Moreover, inour experiments, we also provide the results of single-streamversions of our model that respectively consider either spatialor temporal information. D. Qualitative and Quantitative Results

Performance on UCF-Sports.

Table I reports the comparativeresults on UCF-Sports test set, which contains 43 sequences.As can be seen, the single-stream versions of our proposedmodel gives worse scores than our full model. Moreover,spatial stream generally predicts saliency much better than the https://github.com/cvzoya/saliency/tree/master/code forMetrics TABLE I: Performance comparison on UCF-Sports dataset.The best and the second best performing models are shown inbold typeface and underlined, respectively. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)

Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ KLDiv ↓ Static SalGAN 0.869 0.389 2.074 0.258 2.169PQFT* 0.776 0.211 1.189 0.157 2.458Fang et al.* 0.879 0.387 2.319 0.247 2.012AWS-D* 0.845 0.313 1.870 0.195 2.202Bak et al. 0.864 0.387 2.231 0.130 2.575Dynamic OM-CNN 0.880 0.398 2.443 0.294 1.902ACLNet 0.876 0.367 2.045 0.292 2.135SalEMA 0.895 0.470 2.979 (Gated) Setting 2 0.911 0.499 2.980 0.353 1.568* Non-deep learning model

TABLE II: Performance comparison on Hollywood-2 dataset.The best and the second best performing models are shown inbold typeface and underlined, respectively. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)

Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ KLDiv ↓ Static SalGAN 0.892 0.428 2.383 0.298 1.760PQFT* 0.689 0.150 0.610 0.139 2.387Fang et al.* 0.862 0.312 1.614 0.221 1.781AWS-D* 0.747 0.227 0.994 0.193 2.256Bak et al. 0.840 0.310 1.439 0.158 2.339Dynamic OM-CNN 0.893 0.430 2.625 0.330 1.896ACLNet 0.899 0.459 2.463 0.342 1.701SalEMA 0.873 0.383 2.226 0.330 3.157STRA-Net 0.913 0.558 3.226 0.459 2.251TASED-Net 0.916 * Non-deep learning model temporal stream, which is a trend that we observe on the otherstandard benchmark datasets too. Our model trained only onUCF-Sports outperforms all the competing models in most ofthe metrics. It results in a performance very close to thoseof SalEMA and STRA-Net in terms of SIM. We believe thatweighting the predictions by the spatial and temporal streamsusing a gating mechanism allows the model to better handlethe variations throughout video sequence, thus resulting inmore accurate saliency maps on this action-speciﬁc relativelysmall dataset.

Performance on Hollywood-2.

In our experiments onHollywood-2 dataset, we use all the frames from the testset that contains 884 video sequences. In that regard, it isthe largest test set that we considered in our experimentalevaluation. In Table II, we provide comparison against thecompeting saliency models. Our results show that our modelgives better saliency predictions than all the other methodsin terms of the AUC-J and KLDiv metrics. The performanceof the model trained considering our second training settingthat includes a larger and more diverse training set providesmuch better results than the one trained with the ﬁrst setting.In terms of the remaining evaluation metrics, our results arehighly competitive as compared to the recent state-of-the-artmodels, namely STRA-Net and TASED-Net, as well.

TABLE III: Performance comparison on DHF1K dataset. Thebest and the second best performing models are shown in boldtypeface and underlined, respectively. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)

Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ Static SalGAN 0.866 0.370 2.043 0.262PQFT* 0.699 0.137 0.749 0.139Fang et al.* 0.819 0.273 1.539 0.198AWS-D* 0.703 0.174 0.940 0.157Bak et al. 0.834 0.325 1.632 0.197Dynamic OM-CNN 0.856 0.344 1.911 0.256ACLNet 0.890 0.434 2.354 0.315SalEMA 0.890 0.449 2.574

STRA-Net

TABLE IV: Performance comparison on DIEM dataset. Thebest and the second best performing models are shown in boldtypeface and underlined, respectively. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)

Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ KLDiv ↓ Static SalGAN 0.860 0.492 2.068 0.392 1.431PQFT* 0.680 0.190 0.656 0.220 2.140Fang et al.* 0.825 0.360 1.407 0.313 1.688AWS-D* 0.768 0.313 1.228 0.272 1.825Bak et al. 0.810 0.313 1.212 0.206 2.050Dynamic OM-CNN 0.847 0.464 2.037 0.381 1.599ACLNet * Non-deep learning model

Performance on DHF1K.

We test the performance of ourmodel on the recently proposed DHF1K video saliency dataset,which includes 300 test videos. As mentioned before, theannotations for the test split are not publicly available andall the evaluations are carried out externally by the authors ofthe dataset. As Table III shows, our proposed model achievesperformance on par with the state-of-the-art models. In termsof AUC-J, along with the recent STRA-Net and TASED-Netmodels, it outperforms all the other saliency models. In termsof CC, our model gives roughly the second best result.

Performance on DIEM.

We also evaluate our model on DIEMtest set consisting of 20 videos. Table IV summarizes thesequantitative results. As can be seen, our model achieves thehighest scores in NSS and KLDiv metrics and very competitivein others. The second setting demonstrates the generalizationcapability of our proposed approach as compared to the recentmodels like SalEMA, STRA-Net and TASED-Net.In Fig. 7, we show some sample saliency maps predicted byour proposed model and three other deep saliency networks:ACLNet, SalEMA, STRA-Net, and TASED-Net models. Asone can observe, our model makes generally better predictionsthan the competing approaches. For instance, for the sequencefrom UCF-Sports (Fig. 7(a)) most the models fail to identifythe salient region on the swimmer, or for the sequence fromthe Hollywood-2 dataset (Fig. 7(b)) our model is the onlymodel that correctly predicts the soldier at the center of the TABLE V: Performance comparison on DIEM-Meta dataset.The best and the second best performing models are shown inbold typeface and underlined, respectively. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)

Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ KLDiv ↓ ACLNet 0.845 0.437 1.627 0.391 1.473SalEMA 0.832 0.392 1.576 0.374 1.664STRA-Net 0.840 0.419 1.637 0.385 1.634TASED-Net 0.857 0.455 1.810

TABLE VI: Performance comparison on LEDOV-Metadataset. The best and the second best performing models areshown in bold typeface and underlined, respectively. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)

Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ KLDiv ↓ ACLNet 0.879 0.384 1.750 0.342 1.837SalEMA 0.863 0.380 1.815 0.353 1.850STRA-Net background as salient. Similar kind of observations are alsovalid for the sample sequences from DHF1K (Fig. 7(c)) andDIEM (Fig. 7(d)) datasets.

Performance on DIEM-Meta and LEDOV-Meta.

As men-tioned before, [67] have recently showed that most of thecurrent benchmarks for video saliency include many sequencesin which spatial attention is more dominant than temporaleffects in describing saliency. DIEM-Meta and LEDOV-Metadatasets are curated in a special way to contain video framesin which temporal signals are found to be more inﬂuentialthan appearance cues. Hence, they both offer a better wayto test how well a dynamic saliency model utilizes temporalinformation. In our experimental evaluation, we compare ourproposed model with the state-of-the-art deep trackers, whichare all trained on the combined training set that includesframes from DIEM or LEDOV datasets. As can be seenfrom Table V and Table VI, our model outperforms all theother models in DIEM-Meta, and is the second best modelin LEDOV-Meta, achieving highly competitive performances.These results demonstrate the effectiveness of the proposedgated mechanism and its ability to use temporal information tothe full extent, as compared to the state-of-the-art approaches.Overall, the results reported on all the six datasets con-sidered in our experimental analysis suggest that our modelhas better capacity to mimic human attention mechanismby combining the temporal and static clues in an effectiveway. It has a better generalization ability that it can predictwhere people look at the videos from unseen domains muchbetter. Moreover, it utilizes the temporal information moresuccessfully with its gated fusion mechanism, which adap-tively integrates spatial and temporal cues depending on videocontent.

E. Ablation study.

In this section, we aim to analyze the inﬂuence of eachcomponent of our proposed deep dynamic saliency model. Weperform the ablation study on UCF-Sports dataset by disabling G T O u r s S a l E M AA C L N e t T A S E D - N e t S T R A - N e t (a) UCF-Sports (b) Hollywood-2 G T O u r s S a l E M AA C L N e t T A S E D - N e t S T R A - N e t (c) DHF1K (d) DIEM Fig. 7: Qualitative results of our proposed framework and the deep learning based SalEMA, ACLNet and SalGAN models.Our approach, in general, produces more accurate saliency predictions than these state-of-the-art models.or removing some blocks of our model and by examininghow these changes affect the model performance. As we didwith training our proposed model, for each version of our model under evaluation, we ﬁrst train a single stream modelon SALICON dataset and then use this model to ﬁnetune theactual two-stream version. Accordingly, Table VII reports the Fig. 8: Our model dynamically decides the contribution of motion and appearance streams via gated fusion. Here, we plot theaverage motion probabilities (the contribution of motion stream) for two regions having different characteristic, one containinga moving object (the gummy bear) and the other with relatively no motion, shown with red and blue, respectively. As can beseen, our model assigns higher weights to the motion stream when motion becomes the dominant visual cue, and the weightsadaptively change throughout the sequence.performance of different versions of our saliency model.

Effect of gated fusion.

As we emphasized before, the roleof gated fusion block is to adaptively integrate spatial andtemporal streams is a key component of our model. In ouranalysis, we replace the gated fusion block with a standard × convolution layer. As can be seen from Table VII, theperformance of the model decreases considerably without thegated fusion mechanism. That is, using a dynamic weightingstrategy, instead of a ﬁxed weighting scheme (learned via × convolution), generates much better predictions. Fig. 8 showsa visualization of how our proposed gated fusion operates,demonstrating the behavior of the weighting scheme for bothdynamic and static parts of a given scene, In particular, weplot the motion probabilities averaged within the correspond-ing image regions over time, which clearly shows that themotion probability (the contribution of motion stream) for theregion that contains a moving object is, in general, muchhigher than that of the static region. Moreover, dependingon the characteristics of the regions, it shows the changes inthe motion probabilities throughout the whole sequence. Forexample, when no motion is taking place in the region initiallycontaining the moving object, the weight of the temporalstream starts to fall. These results supports our main claim thatconsidering the content of the video while combining temporaland spatial cues is a more appropriate way to model saliencyestimation on dynamic scenes. Effect of multi-level information.

Previous studies demon-strate that low and high-level cues are equally importantfor saliency prediction [9], [10]. Motivated with these, weincluded a multi-level information block to fuse featuresextracted from different levels of our deep model. For thisanalysis, we disable this multi-level information block andtrain a single-scale model instead. Compared to our fullmodel, disabling this block reduces the performance as can be TABLE VII: Ablation study on UCF-Sports dataset. (cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)

Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ KLDiv ↓ w/o spatial attention 0.872 0.474 2.884 0.374 2.223w/o channel-wise attention 0.892 0.489 2.923 0.319 1.707w/o spatial & ch.-wise attention 0.875 0.447 2.885 0.364 2.646w/o multi-level information 0.890 0.484 2.755 0.303 1.711w/o gated fusion 0.900 0.480 2.913 0.353 1.676full model seen in Table VII. Employing a representation that containsinformation from low and high levels helps to improve theperformance of our model. We speculate that our multi-levelinformation block allows the network to better identify theregions semantically important for saliency. Effect of attention blocks.

As discussed before, the reasonswe introduce the attention blocks are to eliminate the irrelevantfeatures via the spatial attention and to choose the mostinformative feature channels via the channel-wise attentionwhen processing a video frame. In this experiment, we removethe spatial and the channel-wise attention blocks from our fullmodel and train two different models, respectively. The resultsgiven in Table VII support our assertion that both of theseattention blocks improve the model performance. Disablingthem results in a much lower performance as compared tothat of the full model.V. S

UMMARY AND C ONCLUSION

In this study, we proposed a new spatio-temporal saliencynetwork for video saliency. It follows a two-stream networkarchitecture that processes spatial and temporal information inseparate streams, but it extends the standard structure in manyways. First, it includes a gated fusion block that performsintegration of spatial and temporal streams in a more dynamicmanner by deciding the contribution of each channel oneframe at a time. Second, it utilizes a multi-level information block that allows for performing multi-scale processing ofappearance and motion features. Finally, it employs spatialand channel-wise attention blocks to further increase theselectivity. Our extensive set of experiments on six differentbenchmark datasets shows the effectiveness of the proposedmodel in extracting the most salient parts of the video framesboth qualitatively and quantitatively. Moreover, our ablationstudy demonstrates the gains achieved by each componentof our model. Our analysis reveals that the proposed modeldeals with the videos from unseen domains much better thatthe existing dynamic saliency models. Additionally, it usestemporal cues more effectively via the proposed gated fusionmechanism which allows for adaptive integration of spatialand temporal streams.We believe that our work highlights several important direc-tions to pursue for better modeling of saliency in videos. Asfuture work, we plan to explore more efﬁcient ways to includethe temporal information. For instance, instead of using opticalﬂow images, one can use features extracted from early and midlayers of an optical ﬂow network model to encode motioninformation. This can reduce the memory footprint of themodel and decreases the running times.A CKNOWLEDGMENTS

This work was supported in part by TUBA GEBIP fellow-ship awarded to E. Erdem.R

EFERENCES[1] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual at-tention for rapid scene analysis,”

IEEE Transactions on Pattern Analysisand Machine Intelligence , vol. 20, no. 11, pp. 1254–1259, 1998.[2] A. Borji and L. Itti, “State-of-the-art in visual attention modeling,”

IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 35,no. 1, pp. 185–207, 2013.[3] S. Filipe and L. A. Alexandre, “Retracted article: From the humanvisual system to the computational models of visual attention: a survey,”

Artiﬁcial Intelligence Review , vol. 43, no. 4, pp. 601–601, 2015.[4] H. Kim and S. Lee, “Transition of visual attention assessment instereoscopic images with evaluation of subjective visual quality anddiscomfort,”

IEEE Transactions on Multimedia , vol. 17, no. 12, pp.2198–2209, 2015.[5] K. Gu, S. Wang, H. Yang, W. Lin, G. Zhai, X. Yang, and W. Zhang,“Saliency-guided quality assessment of screen content images,”

IEEETransactions on Multimedia , vol. 18, no. 6, pp. 1098–1110, 2016.[6] Y. Fang, Z. Chen, W. Lin, and C. Lin, “Saliency detection in thecompressed domain for adaptive image retargeting,”

IEEE Transactionson Image Processing , vol. 21, no. 9, pp. 3888–3901, 2012.[7] D. Chen and Y. Luo, “Preserving motion-tolerant contextual visualsaliency for video resizing,”

IEEE Transactions on Multimedia , vol. 15,no. 7, pp. 1616–1627, 2013.[8] G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Ra-pantzikos, G. Skoumas, and Y. Avrithis, “Multimodal saliency and fusionfor movie summarization based on aural, visual, and textual attention,”

IEEE Transactions on Multimedia , vol. 15, no. 7, pp. 1553–1568, 2013.[9] N. D. B. Bruce, C. Catton, and S. Janjic, “A deeper look at saliency:Feature contrast, semantics, and beyond,” in

CVPR , 2016, pp. 516–524.[10] Z. Bylinskii, A. Recasens, A. Borji, A. Oliva, A. Torralba, and F. Durand,“Where should saliency models look next?” in

Proc. ECCV , 2016, pp.809–824.[11] C. Guo, Q. Ma, and L. Zhang, “Spatio-temporal saliency detection usingphase spectrum of quaternion fourier transform,” in

Proc. CVPR , 2008,pp. 1–8.[12] X. Cui, Q. Liu, and D. Metaxas, “Temporal spectral residual: fast motionsaliency detection,” in

ACM MM , 2009, pp. 617–620.[13] H. J. Seo and P. Milanfar, “Static and space-time visual saliencydetection by self-resemblance,”

Journal of Vision , vol. 9, no. 12, pp.15–15, 2009. [14] W. Sultani and I. Saleemi, “Human action recognition across datasets byforeground-weighted histogram decomposition,” in

Proc. CVPR , 2014,pp. 764–771.[15] T. Mauthner, H. Possegger, G. Waltner, and H. Bischof, “Encoding basedsaliency detection for videos and images,” in

CVPR , 2015, pp. 2494–2502.[16] Y. Luo and Q. Tian, “Spatio-temporal enhanced sparse feature selectionfor video saliency estimation,” in , June2012, pp. 33–38.[17] S. Mathe and C. Sminchisescu, “Dynamic eye movement datasets andlearnt saliency models for visual action recognition,” in

Proc. ECCV ,2012, pp. 842–856.[18] D. Rudoy, D. B. Goldman, E. Shechtman, and L. Zelnik-Manor,“Learning video saliency from human gaze using candidate selection,”in

Proc. CVPR , 2013, pp. 1147–1154.[19] S. Zhong, Y. Liu, F. Ren, J. Zhang, and T. Ren, “Video saliency detectionvia dynamic consistent spatio-temporal attention modelling,” in

Proc.,AAAI , 2013.[20] Z. Liu, X. Zhang, S. Luo, and O. Le Meur, “Superpixel-based spatiotem-poral saliency detection,”

IEEE Transactions on Circuits and Systems forVideo Technology , vol. 24, no. 9, pp. 1522–1540, 2014.[21] F. Zhou, S. B. Kang, and M. F. Cohen, “Time-mapping using space-timesaliency,” in

Proc. CVPR , 2014, pp. 3358–3365.[22] J. Zhao, C. Siagian, and L. Itti, “Fixation bank: Learning to reweightﬁxation candidates,” in

CVPR , 2015, pp. 3174–3182.[23] S. H. Khatoonabadi, N. Vasconcelos, I. V. Baji´c, and Yufeng Shan, “Howmany bits does it take for a stimulus to be salient?” in

Proc. CVPR , 2015,pp. 5501–5510.[24] V. Lebor´an, A. Garc´ıa-D´ıaz, X. R. Fdez-Vidal, and X. M. Pardo,“Dynamic whitening saliency,”

IEEE Transactions on Pattern Analysisand Machine Intelligence , vol. 39, no. 5, pp. 893–907, 2017.[25] L. Bazzani, H. Larochelle, and L. Torresani, “Recurrent mixture densitynetwork for spatiotemporal visual attention,” in

Proc. ICLR , 2017.[26] W. Wang, J. Shen, J. Xie, M. Cheng, H. Ling, and A. Borji, “Revisitingvideo saliency prediction in the deep learning era,”

IEEE Transactionson Pattern Analysis and Machine Intelligence , 2019.[27] A. Palazzi, D. Abati, S. Calderara, F. Solera, and R. Cucchiara, “Pre-dicting the driver’s focus of attention: the dr(eye)ve project,”

IEEETransactions on Pattern Analysis and Machine Intelligence , 2018.[28] C. Bak, A. Kocak, E. Erdem, and A. Erdem, “Spatio-temporal saliencynetworks for dynamic saliency prediction,”

IEEE Transactions on Mul-timedia , vol. 20, no. 7, pp. 1688–1698, 2018.[29] L. Jiang, M. Xu, T. Liu, M. Qiao, and Z. Wang, “Deepvs: A deeplearning based video saliency prediction approach,” in

Proc. ECCV ,2018, pp. 625–642.[30] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei, “Large-scale video classiﬁcation with convolutional neuralnetworks,” in

CVPR , 2014, pp. 1725–1732.[31] K. Simonyan and A. Zisserman, “Two-stream convolutional networks foraction recognition in videos,” in

Proceedings of the 27th InternationalConference on Neural Information Processing Systems (NIPS) , 2014, p.568–576.[32] M. A. Goodale and A. D. Milner, “Separate visual pathways forperception and action,”

Trends in Neurosciences , vol. 15, no. 1, pp.20–?25, 1992.[33] A. M. Treisman and G. Gelade, “A feature-integration theory of atten-tion,”

Cognitive Psychology , vol. 12, no. 1, pp. 97–136, 1980.[34] C. Koch and S. Ullman, “Shifts in selective visual attention: Towards theunderlying neural circuitry,”

Human neurobiology , vol. 4, pp. 219–27,1985.[35] X. Huang, C. Shen, X. Boix, and Q. Zhao, “SALICON: Reducing thesemantic gap in saliency prediction by adapting deep neural networks,”in

Proc. ICCV , 2015, pp. 262–270.[36] S. Jetley, N. Murray, and E. Vig, “End-to-end saliency mapping viaprobability distribution prediction,” in

Proc. CVPR , 2016, pp. 5753–5761.[37] S. S. S. Kruthiventi, K. Ayush, and R. V. Babu, “Deepﬁx: A fullyconvolutional neural network for predicting human eye ﬁxations,”

IEEETransactions on Image Processing , vol. 26, no. 9, pp. 4446–4456, 2017.[38] N. Liu, J. Han, T. Liu, and X. Li, “Learning to predict eye ﬁxationsvia multiresolution convolutional neural networks,”

IEEE Transactionson Neural Networks and Learning Systems , vol. 29, no. 2, pp. 392–404,2018.[39] J. Pan, E. Sayrol, X. Gir´o-i Nieto, K. McGuinness, and N. E. OConnor,“Shallow and deep convolutional networks for saliency prediction,” in

Proc. CVPR , 2016, pp. 598–606. [40] W. Wang and J. Shen, “Deep visual attention prediction,” IEEE Trans-actions on Image Processing , vol. 27, no. 5, pp. 2368–2378, 2018.[41] E. Vig, M. Dorr, and D. Cox, “Large-scale optimization of hierarchicalfeatures for saliency prediction in natural images,” in

Proc. CVPR , 2014,pp. 2798–2805.[42] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Predicting humaneye ﬁxations via an lstm-based saliency attentive model,”

IEEE Trans-actions on Image Processing , vol. 27, no. 10, pp. 5142–5154, 2018.[43] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in

Proceedings of the 19th International Conference on Neural InformationProcessing Systems (NIPS) , 2006, pp. 545–552.[44] Y. Fang, Z. Wang, W. Lin, and Z. Fang, “Video saliency incorporatingspatiotemporal cues and uncertainty weighting,”

IEEE Transactions onImage Processing , vol. 23, no. 9, pp. 3910–3921, 2014.[45] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learningspatiotemporal features with 3d convolutional networks,” in

Proc. ICCV ,2015, pp. 4489–4497.[46] F. Yu and V. Koltun, “Multi-scale context aggregation by dilatedconvolutions,” in

Proc. ICLR , 2016.[47] P. Linardos, E. Mohedano, J. J. Nieto, K. McGuinness, X. Giro-i Nieto,and N. E. O’Connor, “Simple vs complex temporal recurrences for videosaliency prediction,” in

Proc. BMVC , 2019.[48] K. Min and J. J. Corso, “Tased-net: Temporally-aggregating spatialencoder-decoder network for video saliency detection,” in

Proceedingsof the IEEE International Conference on Computer Vision , 2019, pp.2394–2403.[49] Q. Lai, W. Wang, H. Sun, and J. Shen, “Video saliency predictionusing spatiotemporal residual attentive networks,”

IEEE Trans. on ImageProcessing , 2019.[50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proc. CVPR , 2016, pp. 770–778.[51] T. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,“Feature pyramid networks for object detection,” in

Proc. CVPR , 2017,pp. 936–944.[52] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in

Proc. MICCAI , 2015, pp. 234–241.[53] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in

Proc. CVPR , 2015, pp. 3431–3440.[54] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Doll´ar, “Learning to reﬁneobject segments,” in

Proc. ECCV , 2016, pp. 75–91.[55] T. Zhao and X. Wu, “Pyramid feature attention network for saliencydetection,” in

Proc. CVPR , 2019, pp. 3080–3089.[56] S. Dong, Z. Gao, S. Sun, X. Wang, M. M. Li, H. Zhang, G. Yang, H. Liu,and S. Li, “Holistic and deep feature pyramids for saliency detection,”in

Proc. BMVC , 2018.[57] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent modelsof visual attention,” in

Proc. NIPS , 2014.[58] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image captiongeneration with visual attention,” in

Proc. ICML , 2015.[59] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for human pose estimation,” in

Proc. CVPR , 2017, pp.5669–5678.[60] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S.Chua, “SCA-CNN: Spatial and channel-wise attention in convolutionalnetworks for image captioning,” in

Proc. CVPR , 2017.[61] W. Ren, L. Ma, J. Zhang, J. Pan, X. Cao, W. Liu, and M.-H. Yang,“Gated fusion network for single image dehazing,” in

Proc. CVPR , 2018.[62] X. Zhang, H. Dong, Z. Hu, W.-S. Lai, F. Wang, and M.-H. Yang, “Gatedfusion network for joint image deblurring and super-resolution,” in

Proc.BMVC , 2018.[63] Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality-sensitivedeconvolution networks with gated fusion for RGB-D indoor semanticsegmentation,” in

Proc. CVPR , 2017, pp. 1475–1483.[64] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action MACH a spatio-temporal maximum average correlation height ﬁlter for action recogni-tion,” in

Proc. CVPR , 2008.[65] S. Mathe and C. Sminchisescu, “Actions in the eye: Dynamic gazedatasets and learnt saliency models for visual recognition,”

IEEE Trans-actions on Pattern Analysis and Machine Intelligence , vol. 37, no. 7,pp. 1408–1424, 2015.[66] P. K. Mital, T. J. Smith, R. L. Hill, and J. M. Henderson, “Clustering ofgaze during dynamic scene viewing is predicted by motion,”

CognitiveComputation , vol. 3, no. 1, pp. 5–24, 2011.[67] M. Tangemann, M. K¨ummerer, T. S. Wallis, and M. Bethge, “Measuringthe importance of temporal features in video saliency,” 2020. [68] T. Lan, Y. Wang, and G. Mori, “Discriminative ﬁgure-centric modelsfor joint action localization and recognition,” in

Proc. ICCV , 2011, pp.2003–2010.[69] M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,” in

Proc.CVPR , 2009, pp. 2929–2936.[70] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: CNNs for opticalﬂow using pyramid, warping, and cost volume,” in

Proc. CVPR , 2018.[71] M. Jiang, S. Huang, J. Duan, and Q. Zhao, “SALICON: Saliency incontext,” in

Proc. CVPR , 2015, pp. 1072–1080.[72] X. Huang, C. Shen, X. Boix, and Q. Zhao, “SALICON: Reducing thesemantic gap in saliency prediction by adapting deep neural networks,”in

Proc. ICCV , 2015, pp. 262–270.[73] Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand, “Whatdo different evaluation metrics tell us about saliency models?”

IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 41,no. 3, pp. 740–757, 2019.[74] J. Pan, C. Canton, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol,and X. a. Giro-i Nieto, “SalGAN: Visual saliency prediction withgenerative adversarial networks,” in arXivarXiv