A Gated Fusion Network for Dynamic Saliency Prediction
11 A Gated Fusion Network for Dynamic SaliencyPrediction
Aysun Kocak , Erkut Erdem and Aykut Erdem Department of Computer Engineering, Hacettepe University, Ankara, Turkey Department of Computer Engineering, Koc¸ University, Istanbul, Turkey
Abstract —Predicting saliency in videos is a challenging prob-lem due to complex modeling of interactions between spatialand temporal information, especially when ever-changing, dy-namic nature of videos is considered. Recently, researchers haveproposed large-scale datasets and models that take advantage ofdeep learning as a way to understand what’s important for videosaliency. These approaches, however, learn to combine spatial andtemporal features in a static manner and do not adapt themselvesmuch to the changes in the video content. In this paper, weintroduce Gated Fusion Network for dynamic saliency (GFSal-Net), the first deep saliency model capable of making predictionsin a dynamic way via gated fusion mechanism. Moreover, ourmodel also exploits spatial and channel-wise attention within amulti-scale architecture that further allows for highly accuratepredictions. We evaluate the proposed approach on a numberof datasets, and our experimental analysis demonstrates that itoutperforms or is highly competitive with the state of the art.Importantly, we show that it has a good generalization ability,and moreover, exploits temporal information more effectively viaits adaptive fusion scheme.
Index Terms —dynamic saliency estimation, gated fusion, deepsaliency networks
I. I
NTRODUCTION
Human visual system employs visual attention mechanismsto effectively deal with huge amount of information by fo-cusing only on salient or attention grabbing parts of a scene,and thus filtering out irrelevant stimuli. Saliency estimationmethods offer different computational models of attention tomimic this key component of our visual system. These meth-ods generate a so-called saliency map within which a pixelvalue indicates the likelihood of that pixel being fixated by ahuman. Since the pioneering work of [1], this research areahas gained a lot of interest in the last few decades (please referto [2], [3] for an overview), and it has found to have practicaluse in a variety of computer vision tasks such as visual qualityassessment [4], [5], image and video resizing [6], [7], videosummarization [8], to name a few. Early saliency predictionapproaches use low-level (color, orientation, intensity) and/orhigh-level (pedestrians, faces, text, etc.) image features toestimate salient regions. While low-level cues are used todetect regions that are different from their surroundings, top-down cues are used to infer high-level semantics to guide themodel. For example, humans tend to focus some object classesmore than others. Recently, deep learning based models havestarted to dominate over the traditional approaches as theycan directly learn both low and high-level features relevantfor saliency prediction [9], [10].
A single input frame and its corre-sponding fixation map Four consecutive overlaid frames andtheir overlaid fixation maps
Fig. 1: Predicting video saliency requires finding a harmoniousinteraction between appearance and temporal information. Forexample, while the first row shows a case in which attentionis guided more by visual appearance, in the second row,motion is the most determining factor for attention. Hence,we speculate that an adaptive scheme would be better suitedfor this task.Most of the literature on saliency estimation focuses onstatic images. Lately, predicting saliency in videos has alsogained some attraction, but it still remains a largely unexploredfield of research. Video saliency models (also called dynamicsaliency models) aim to predict attention grabbing regions indynamically changing scenes. While static saliency estimationconsiders only low-level and high-level spatial cues, dynamicsaliency needs to take into account temporal information tooas there is evidence that moving objects or object parts canalso guide our attention. Motion and appearance play comple-mentary roles in human attention and their significance canchange over time. As we illustrate in Fig. 1, in dynamic scenes,humans tend to focus more on moving parts of the scene andthe eye fixations change over time, showing the importanceof motion cues (bottom row). On the other hand, when thereis practically no motion in the scene, low-level appearancecues dominantly guide our attention and we focus more onthe regions showing different visual characteristics than theirsurroundings (top row). Motivated by these observations, inthis work, we develop a deep dynamic saliency model whichhandles spatial and temporal changes in the visual stimuli inan adaptive manner.The first generation of dynamic saliency methods weresimply extensions of the static saliency approaches, e.g. [11],[12], [13], [14], [15]. In other words, these methods adaptedthe strategies proposed for static scenes and mostly modified a r X i v : . [ c s . C V ] F e b them to work on either 3D feature maps that are formed bystacking 2D spatial features over time or 2D feature mapsencoding motion information like optical flow images. Sev-eral follow-up works, however, have approached the problemfrom a fresh perspective and developed specialized methodsfor dynamic saliency detection, e.g. [16], [17], [18], [19],[20], [21], [22], [23], [24]. These models either utilize novelspatio-temporal features or employ data-driven techniques tolearn relevant features from data. As with the case of state-of-the-art static saliency models, approaches based on deeplearning have also shown promise for dynamic saliency. Thesestudies basically explore different neural architectures used forprocessing temporal and spatial information in a joint manner,and they either use 3D convolutions [25], LSTMs [25], [26]or multi-stream architectures that encode temporal informationseparately [27], [28], [29].In this work, we introduce Gated Fusion Network for videosaliency (GFSalNet). Our proposed network model is radicallydifferent from the previously proposed deep models in thatit includes a novel content-driven fusion scheme to combinespatial and temporal streams in a more dynamic manner. Inparticular, our model is based on two-stream CNNs [30],[31], which have been successfully applied to various videoanalysis tasks. To our interest, these architectures are inspiredby the ventral and dorsal pathways in the human visualcortex [32]. Although the use of two-stream CNNs in videosaliency prediction has been investigated before [28], the mainnovelty of our work lies in the ability to fuse appearance andmotion information in a spatio-temporally coordinated mannerby estimating the importance of each cue with respect basedon the current video content.The rest of the paper is organized as follows: In Section 2,we give a brief overview of the existing dynamic saliencyapproaches. In Section 3, we present the details of our pro-posed deep architecture for video saliency. In Section 4, wegive the details of our experimental setup, including evaluationmetrics, datasets and the competing dynamic saliency models,and discuss the results of our experiments. Finally, in the lastsection, we offer some concluding remarks.Our codes and predefined models, along with the saliencymaps extracted with our approach, will be publicly availableat the project website .II. R ELATED W ORK
Early visual saliency models can be dated back to 1980swith the Feature Integration Theory by [33]. The first modelsof saliency, such as [34], [1], provide computational solutionsto [33], and since then a notable number of saliency modelsare developed, most of which deal with static scenes. For a de-tailed list of pre-deep learning saliency estimation approaches,please refer to [2]. After the availability of large-scale datasets,researchers proposed various deep learning based models forstatic saliency that outperformed previous approaches by alarge margin [35], [36], [37], [38], [39], [40], [41], [42].
Early models for dynamic saliency generally depend onpreviously proposed static saliency models. Adaptation of https://hucvl.github.io/GFSalNet/ these models to dynamic scenes is achieved by consideringfeatures related to motion such as the optical flow infor-mation. For example, [11] proposed a saliency predictionmethod called PQFT that predicts the salient regions via thephase spectrum of Fourier Transform of the given image. Inparticular, PQFT generates a quaternion image representationby using color, intensity, orientation and motion features andestimates the salient regions in the frequency domain by usingthis combined representation. [12] extracted salient parts ofvideo frames by similarly performing a spectral analysis of theframes considering both spatial and temporal domains. [13]employed local regression kernels as features to calculateself similarities between pixels or voxels for figure-groundsegregation. [14] extended the previously proposed staticsaliency model by [43]’s model by including motion cuesto the graph-theoretic formulation. [44] employ a two streamapproach that generates spatial saliency map (using color andtexture features) and temporal saliency map (using optical flowfeature) separately and combines these maps with an entropybased adaptive method. [15] proposed a dynamic saliencymodel for activity recognition that works in an unsupervisedmanner. Their method is based on an encoding scheme thatconsiders color along with motion cues.Following these early approaches, the researchers startedto develop novel video saliency models specifically designedfor dynamic stimuli. For instance, [16] proposed a sparsitybased framework that generates spatial saliency maps andtemporal saliency maps separatelty based on entropy gainand temporal consistency, respectively, and then combinesthem. [17] integrated several visual cues such as static anddynamic image features based on color, texture, edge distri-bution, motion boundary histograms, through learning-basedfusion strategies and later employed this dynamic saliencymodel for action recognition. [18] suggested a learning-basedmodel that generates a candidate set regions with the useof existing methods and then predicts gaze transitions oversubsequent video frames conditionally on these regions. [19]proposed a simple dynamic saliency model that combinesspatial saliency maps with temporal saliency using pixel-wise maximum operation. In their work, while the spatialsaliency maps are extracted using multi-scale analysis oflow-level features, temporal saliency maps are obtained byexamining dynamic consistency of motion through an opticalflow model. [20] suggested an approach that independentlyestimates superpixel-level and pixel-level temporal and spa-tial saliency maps and subsequently combines them usingan adaptive fusion strategy. [21] proposed an approach thatoversegments video frames by using both spatial and tem-poral information and estimates the saliency score for eachregion by computing the regional contrast values via low-level features extracted from these regions. [22] suggestedto learn a filter bank from low-level features for fixations.This filterbank encodes the association between local featurepatterns and probabilities of human fixations, and is used to re-weight fixation candidates. [23] formulated another dynamicsaliency model by exploiting the compressibility principle.More recently, [24] proposed a saliency model (called AWS-D) for dynamic scenes by considering the observation that high-order statistical structures carry most of the perceptuallyrelevant information. AWS-D [24] removes the second-orderinformation from input sequence via a whitening process.Then, it computes bottom-up spatial saliency maps using afilter bank at multiple scales, and temporal saliency maps withthe use of a 3D filter bank. Finally, it combines all these mapsby considering their relative significance. Deep learning based dynamic saliency models havereceived attention only recently. [25] proposed a recurrentmixture density network (RMDN) for spatio-temporal visualattention. The method uses a C3D architecture [45] as abackbone to integrate spatial and temporal information. Thisrepresentation module is fed to a Long Short-Term Memory(LSTM) network, which is connected to Mixture Density Net-work (MDN) whose outputs are the parameters of a Gaussianmixture model expressing the saliency map of each frame. [28]suggested a two stream CNN model [30], [31] which considersthe motion and appearance clues in videos. While, optical flowimages are used to feed the temporal stream, raw RGB framesare used as input for the spatial stream. [27] presented anattention network to predict where driver is focused. In thiswork, the authors also proposed a dataset that consists of ego-centric and car-centric driving videos and eye tracking databelongs to the videos. Their network consists of three indepen-dent paths, namely spatial, temporal and semantic paths. Whilethe spatial path uses raw RGB data as input, the temporal oneuses optical flow data to integrate motion information and thelast one processes the segmentation prediction on the scenegiven by the model by [46]. In the final layer of the network,the three independent maps are summed and then normalizedto obtain the final saliency map. [29] proposed a deep modelcalled OM-CNN which consists of two subnetworks, namelyobjectness subnet to highlight the regions that contain anobject, motion subnet to encode temporal information, whoseoutputs are then combined to generate some spatio-temporalfeatures. [26] proposed a model called ACLNet which employsa CNN-LSTM architecture to predict human gaze in dynamicscenes. The proposed approach focuses static information withan attention module and allows an LSTM to focus on learningdynamic information. Recently, [47] proposed an encoder-decoder based deep neural network called SalEMA, whichemploys a convolutional recurrent neural network method toinclude temporal information. In particular, it processes asequence of RGB video frames as input to employ spatialand temporal information with the temporal information beinginferred by the weighted average of the convolution state ofthe current frame and all the previous frames. [48] suggested adifferent model called TASED-Net, which utilizes a 3D fully-convolutional encoder-decoder network architecture where theencoded features are spatially upsampled while aggregating thetemporal information. [49] recently developed another two-stream spatiotemporal salieny model called STRA-Net thatconsiders dense residual cross connections and a compositeattention module.The aforementioned dynamic saliency models suffer fromdifferent drawbacks. The early methods employ (hand-crafted)low-level features that do not provide a high-level understand-ing of the video frames. Deep models eliminate this pitfall by utilizing an end-to-end learning strategy and, hence, providebetter saliency predictions. They differ from each other by howthey include motion information within their respective archi-tectures. As we reviewed, the two main alternative approachesinclude using recurrent connections or processing data inmultiple streams. Although RNN-based models help to encodetemporal information with less amount of parameters, theencoding procedure compresses all the relevant informationinto a single vector representation, which affects the robustnessespecially for longer sequences. In that respect, the accuracy ofthe two-stream models do not, in general, degrade as the lengthof a sequence increases. Moreover, they are more interpretableas they need to perform fusion of spatial and temporal featuresin an explicit manner. On the other hand, their performancedepends on accurate estimation of the optical flow maps usedas input to the temporal stream. Hence, most of these two-stream models employ recent deep-learning based optical flowestimation models and even some of them uses some additionalpost-processing steps such as confining the absolute values ofthe magnitudes within a certain interval to avoid noise, as inSTRA-Net [49]. Our proposed model also uses a two-streamapproach, but as we will show, it exploits a novel and moredynamic fusion strategy, which boosts the performance andfurther improves the interpretability.III. O UR M ODEL
A general overview of our proposed spatio-temporal net-work architecture is given in Fig. 2. We use a two-streamarchitecture that processes temporal and spatial informationin separate streams, similar to the one in [28]. That is, werespectively feed the spatial stream and temporal stream withRGB video frames and the corresponding optical flow imagesas inputs. Different than [28], however, our network com-bines information coming from several levels (Section III-A)and fuses both streams via a novel dynamic fusion strat-egy (Section III-C). We additionally utilize attention blocks(Section III-B) to select more relevant features to further boostthe performance of our model. Here, we use a pre-trainedResNet-50 model [50] as the backbone of our saliency networkas commonly explored by the previous saliency studies. Inparticular, we remove the average pooling and fully connectedlayers after the last residual block (
ResBlock4 ) and thenadapt it for saliency prediction by adding extra blocks. UsingResNet-50 model allows us to encode both low-, mid- andhigh-level cues in the visual stimuli in an efficient manner.Moreover, the number of network parameters is much smalleras compared to other alternative backbone networks.
A. Multi-level Information Block
As its name implies, the purpose of multi-level informationblock is to let the information extracted at different levelsguide the saliency prediction process. It has proven to beuseful that employing a multi-level/multiscale structure almostalways improves the performance for many different visiontasks such as object detection [51], segmentation [52], [53],[54], and static saliency detection [55], [56]. In our work, wealso employ a multi-level information block to enhance feature
Fig. 2: Our two-stream dynamic saliency model uses RGB frames for spatial stream and optical flow images for temporalstream. These streams are integrated with a dynamic fusion strategy that we referred to as gated fusion. Our architecture alsoemploys multi-level information block to fuse multi-scale features and attention blocks for feature selection.Fig. 3: Multi-level information block. It is used to integratemultiscale features extracted at different levels of the deepnetwork for predicting salient parts of the given input videoframe.learning capability of our model. Specifically, it allows low-,mid-, and high-level information to be fused together and to betaken into account simultaneously while making predictions.Fig. 3 shows the proposed multi-level information blockthat we employ in our model. This block considers low-level and high-level representations of frames by processingfeatures maps which are extracted at each residual block.The aim is to combine primitive image features ( e.g. edges,shared common patterns) obtained at lower levels with richsemantic information ( e.g. object parts, faces, text) extractedat higher levels of the network. Here, we prefer to utilize × convolution and bilinear interpolation layers to combine cuesfrom higher and lower levels. That is, after each residual block,we expand the feature map with bilinear interpolation to makeequal size of the feature map with the size of the output of theprevious residual block. Then, we concatenate the expandedfeature map with the previous residual block’s output and fusethem via × convolution layers. B. Attention Blocks
Neural attention mechanisms allow for learning to payattention to features more useful for a given task, and hence,it has been demonstrated many times that they can boost theperformance of a neural network architecture proposed for anycomputer vision problem, such as object detection [57]), visualquestion answering [58], pose estimation [59], image caption-ing [60] and salient object detection [55]. Motivated with theseobservations, in our work, we integrate several attention blocksto our proposed deep architecture to let the model choosethe most relevant features for the dynamic saliency estimationproblem. Resembling the structures in [60], [55], we exploittwo separate attention mechanisms: spatial and channel-wise attention, as explained below.Fig. 4(a) shows our spatial attention block, which weintroduce at the lower levels of our network model (see Fig. 2)that helps to filter out the irrelevant information. The blocktakes the output of
ResBlock4 , shaped [ B × C × H × W ] with C = 2048 , as input and it determines the importantlocations by calculating a weight tensor, which is shaped [ B × × H × W ] . To estimate this tensor, input channelsare fused via × convolution layer following by a sigmoidlayer. The output (shaped [ B × C × H × W ] ) of this block is aresult of Hadamard product between input and spatial weighttensor.The second type of our attention block, the channel-wiseattention block, is shown in Fig. 4(b), whose main purpose isto utilize the context information in a more efficient way. Theblock consists of average pooling, full connected and ReLUlayers. In particular, it takes the concatenation of the featuremaps from the main stream and multi-level information blockas input which is shaped [ B × × H × W ] , then downsamples (a) Spatial attention (b) Channel-wise attention Fig. 4: Attention blocks: (a) spatial attention block, (b)channel-wise attention block. While the spatial attentionblock defines spatial importance weights for individual featuremaps, the channel-wise attention block introduces feature-levelweighting which allows for a better use of context information.Fig. 5: Gated fusion block. It integrates the spatial andtemporal streams to learn a weighted gating scheme to de-termine their contributions in predicting dynamic saliency ofthe current input video frame.it with average pooling (output shape is [ B × ). The weightof each channel is determined after two fully connected layersfollowed by ReLUs. The shape of the matrices are [ B × and [ B × respectively. The output of last ReLU which isshaped [ B × × × , contains a scalar value to weighteach channel. At the end of the block, the input feature mapis weighted via Hadamard product. C. Gated Fusion Block
One of the main contributions of our framework is toemploy a dynamic fusion strategy to combine temporal andspatial information. Gated fusion has been exploited before fordifferent problems such as image dehazing [61], image deblur-ring [62], semantic segmentation [63]. The main purpose touse a gated fusion block is to combine different kind of infor-mation with a dynamic structure which considers the currentinputs’ characteristics. For example, in [63] feature maps thatare generated via RGB information and depth information iscombined for solving semantic segmentation. In our case, ouraim is to come up with a fusion module that considers thecontent of the video at inference time. To our knowledge, weare the first to provide a truly dynamic approach for dynamicsaliency. As opposed to the classical learning based approachesthat learn the contributions of temporal and spatial streams ina static manner from the training data, our gated fusion blockperforms the fusion process in an adaptive way. That is, itdecides the contribution of each stream on a location- andtime-aware manner according to the content of the video. The structure of the proposed gated fusion block is shownin Fig. 5. It takes the feature maps of the spatial and temporalstreams as inputs and produces a probability map which isused to designate contribution of each stream with regard totheir current characteristics. Let S A , S T denote the featuremaps from spatial and temporal streams, respectively. Gatedfusion module first concatenates these features and then learnstheir correlations by applying a × convolution layer. Afterthat, it uses a sigmoid layer to regularize the feature mapwhich is used to estimate weights of the gate. Let G A and G T denote how confidently we can rely on appearance andmotion, respectively, as follows: G A = P , G T = 1 − P , (1)where P is the output of the sigmoid layer. Then, gated fusionmodule estimates the weights denoting the contributions of thespatial and temporal streams, as given below: S (cid:48) A = S A (cid:12) G A , S (cid:48) T = S T (cid:12) G T , (2)where (cid:12) represents the Hadamard product operation. Finally,it generates the final saliency map, S final , via weightingthe appearance and temporal streams’ feature maps with theestimated probability map: S final = S (cid:48) A + S (cid:48) T . (3)Fig. 6 visualizes how gated fusion block works. While theappearance stream computes a saliency map from the RGBframe, the temporal stream extracts a saliency map from theoptical image obtained from successive frames. As can beseen, these intermediate maps encode different characteristicof the input dynamic stimuli. The appearance based saliencymap mostly focuses on the regions that have distinct visualproperties than theirs surroundings, whereas the motion basedsaliency map mainly pay attention to motion. Gated fusionscheme estimates spatially varying probability maps and em-ploys them to integrate the appearance and temporal streams,which results in more confident predictions. The spatial streamgenerally gives more accurate predictions than the temporalstream, as will be presented in the Experiments section. Onthe other hand, as can be seen from the estimated weight maps,the gated fusion scheme in the proposed model has a tendencyto pay more attention to the temporal stream. We suspect thatthis is because the model considers that it may carry auxiliaryinformation. IV. E XPERIMENTS
In the following, we first provide a brief review of thebenchmark datasets we used in our experimental analysis.Then, we give the details of our training procedure includingthe loss functions and settings we use to train our proposedmodel. Next, we summarize the evaluation metrics and thedynamic saliency models used in our experiments. We thendiscuss our findings and present some qualitative and quanti-tative results. Finally, we present an ablation study to evaluatethe effectiveness of the blocks of the proposed dynamicsaliency model. A pp ea r a n ce M o ti on S A G A S T G T P r e d i c ti on G T Fig. 6: Gated fusion block estimates the final saliency map by combining the appearance and the temporal maps S A and S T with the spatially varying weights G A and G T . A. Datasets
In our experiments, we employ six different datasets toevaluate the effectiveness of the proposed saliency model.The first four, namely UCF-Sports [64], Holywood-2 [65],DHF1K [26], and DIEM [66], are the most commonly usedbenchmarks. Among them, we specifically utilize DIEM totest the generalization ability of our model. The last twodatasets considered in our analysis, DIEM-Meta [67] andLEDOV-Meta [67], are two recently proposed datasets, whichare particularly designed to explore the performance of adynamic saliency model under situations where understandingtemporal effects is critical to give results more compatiblewith humans.
UCF-Sports dataset [64] is the smallest dataset in terms of itssize, consisting of 150 videos obtained from 13 different actionclasses. It is originally collected for action recognition, butthen enriched by [65] to include eye fixation data. The videosare annotated by 4 subjects under free-viewing condition.In the experiments, we used the same train/test splits givenin [68].
Holywood-2 dataset [65] contains 1,707 videos fromHollywood-2 action recognition dataset [69], among which823 are used for training and the remaining 884 are left fortesting. Since the videos are collected from 69 Hollywoodmovies with 12 action categories, its content is limited tohuman actions. In [65], the authors collected human fixationdata for each sequence from 3 subjects under free-viewingcondition. In our experiments, we use all train and test frames.
DHF1K [26] is the most recent and the largest video saliencydataset, which contains a total of 1000 videos with eye trackingdata collected from 17 different human subjects. The authorssplit the dataset into 600 training, 100 validation videos and300 test videos. The ground truth fixation data for the test splitis intentionally kept hidden and the evalution of a model onthe test data is carried out by the authors themselves.
DIEM [66] includes 84 natural videos. Each video sequencehas eye fixation data collected from approximately 50 differenthuman subjects. Following the common experimental setupfirst considered in [18], we used all frames from 64 videosfor training and the first 300 frames from the remaining 20videos as test set.
DIEM-Meta [67] and
LEDOV-Meta [67] are two recentlyproposed datasets collected from the existing video saliencydatasets DIEM [66] and LEDOV [29], respectively. The maindifference between these and the aforementioned datasets liesin the characteristics of the video frames they consider. [67]constructed these so-called meta-datasets by eliminating thevideo frames from their original counterparts where spatialpatterns are generally enough to predict where people look.To detect them, they employ a deep static saliency model thatthey developed. DIEM-Meta and DIEM-Meta are thus bettertestbeds for evaluating whether or not a dynamic saliencymodel learns to use the temporal domain effectively. DIEM-Meta contains only 35% of the video frames from DIEM,LEDOV-Meta includes just 20% of the original LEDOVframes.
B. Training Procedure
As we mentioned previously, our network takes RGB videoframes and optical flow images as inputs. We extract theframes from the videos by considering their original framerate. We employ these RGB frames to feed our appearancestream. For the temporal stream, we generate the opticalflow images between two consecutive frames by using PWC-Net [70]. We resize all the input images to × pixelsand map the ground truth fixation points accordingly.Instead of training our dynamic saliency network fromscratch, we first train the subnet for the appearance streamon SALICON dataset [71]. Then, we initialize the weightsof both of our subnets for spatial and temporal streamswith this pre-trained static saliency model and finetune ourwhole two-stream network model using the dynamic saliencydatasets described above. Pre-training on static data allowsour dynamic saliency model to converge in fewer epochswhen trained on dynamic stimuli. We use Kullback-Leibler(KL) divergence and Normalized Scanpath Saliency (NSS)loss functions (which we will explain in detail later) withAdam optimizer during the training process. We set the initiallearning rate to 10e-5 and reduce it to one tenth in every3000th iteration. The batch size is set to 8 for UCF-Sportsand 16 for the other video datasets. We train our model onNVIDIA V100 GPUs (3 × GPUs) and while one epoch takesapproximately 2 days for the larger datasets of DHF1K, DIEMand Hollywood-2, it takes approximately 2 hours for UCF-Sports. We train our models for 2-3 epochs. Our (unoptimized)Pytorch implementation achieves a near real-time performanceof 8.2 fps for frames of size × on a NVidia Tesla K40cGPU.For our experiments on standard benchmark datasets, weconsider two different training settings for dynamic stimuli.In our first setting, we use the training split of the datasetunder consideration to train our proposed model. On the otherhand, in our second setting, we utilize a combined trainingset containing training sequences from both UCF-Sports,Hollywood-2 and DHF1K datasets. The second setting furtherallows us to test the generalization ability of our model onDIEM, DIEM-Meta and LEDOV-Meta datasets. Loss functions . In our work, we employ the combination ofKL-divergence and NSS loss functions to train our proposeddynamic saliency model. As explored in previous studies, [72],[26], considering more than one loss function during training,in general, improves the model performance. Moreover, em-pirical experiments on the analysis of the existing automaticevaluation metrics in [73] have shown that KL-divergence andNSS are good choices for evaluating saliency models.Let P denote the predicted saliency map, F representground truth (binary) fixation map collected from humansubjects and S be the ground truth (continuous) fixationdensity map which is generated by blurring fixation maps witha small Gaussian kernel.KL-divergence is a widely used metric to compare twoprobability distributions. It has been proven to be effectivefor evaluating and trainig the performance of saliency modelswhere the ground truth fixation map S and the predictedsaliency map P are interpreted as probability distributions.Formally, KL-divergence loss function is defined as: L KL ( P, S ) = (cid:88) i S ( i ) log (cid:18) S ( i ) P ( i ) (cid:19) . (4)NSS is a location based metric which is computed as theaverage of the normalized predicted saliency values at fixatedlocations that is provided with the ground truth. By using thismetric as a loss function, we force the saliency model to betterdetect the fixation locations and assign high likelihood scoresto those pixel locations. This loss function is defined as below: L NSS ( P, F ) = − N (cid:88) i ¯ P ( i ) × F ( i ) , (5)where N is the total number of fixated pixels (cid:80) i F ( i ) and ¯ P is the normalized saliency map P − µ ( P ) σ ( P ) .Our final loss function is then defined as: L ( P, F, S ) = α L KL ( P, S ) + β L NSS ( P, F ) , (6)where L KL is the KL loss function, L NSS is the NSS lossfunction, and α and β are the weights for these loss functions.We first perform a set of experiments on SALICON datasetto empirically determine the optimal values of α and β , andthen set α = 1 and β = 0 . for all the experiments. C. Evaluation Metrics and Compared Saliency Models
In our evaluation, we employ the following five commonlyreported saliency metrics: Area Under Curve (AUC-Judd),Pearson’s Correlation Coefficient (CC), Normalized ScanpathSaliency (NSS), Similarity Metric (SIM) and KL-divergence(KLDiv). For a detailed analysis of these metrics and theirdefinitions, please refer to [73]. Each metric measures adifferent aspect of visual saliency and none of them is superiorto the others. AUC metric considers the saliency map asclassification map. A ROC curve is constituted by measuringthe true and false positive rates under different binary classifierthresholds. While a score of 1 indicates a perfect match, ascore close to 0.5 indicates the performance of chance. NSSis another commonly used metric, which we formally defined before while describing our loss functions. CC metric is adistribution based metric which is used to measure the linearrelationship between saliency and fixation maps using thefollowing formula: CC ( P, S ) = σ ( P, S ) σ ( P ) × σ ( S ) (7)where σ corresponds to covariance. A CC value close to +1/-1demonstrates a perfect linear relationship. SIM is another pop-ular metric that measures the similarity between the predictedand human saliency maps, as defined below:SIM ( P, S ) = (cid:88) i min( P i , S i ) where (cid:88) i P i = 1 and (cid:88) i S i = 1 (8)KLDiv metric evaluates the dissimilatrity between two distri-butions. Since KLDiv represents the difference between thesaliency map and the density map, a small value indicates agood result. However, we note that, according to the aforemen-tioned study, NSS and CC seem to provide more fair results.In our experiments, we report the scores obtained with theimplementations provided by MIT benchmark website .We compare our method with ten different models: Sal-GAN [74], PQFT [11], [44], AWS-D [24], [28], OM-CNN [29], ACLNet [26], SalEMA [47], STRA-Net [49], andTASED-Net [48]. Among these, SalGAN [74] is the only staticsaliency model that gives the state-of-the-art results in theimage datasets. We evaluate this method on video datasetsconsidering each frame as a static image. PQFT [11], [44],and AWS-D [24] are non-deep learning models whereas allthe other models employs deep learning techniques to predictwhere people look in videos. We note that in [28], the authorstested different fusion strategies with static weighting schemesand here we only report the results obtained with convolutionalfusion strategy, which was shown to perform better than theothers. In our experiments, we use the implementations and thetrained models provided by the authors and test our approachagainst them with the settings explained in Sec. IV-A forfair comparison. In particular, after a careful analysis, wenotice that some methods do not report results on wholetest set of Hollywood-2 and/or they mistakenly consider task-specific gaze data collected for UCF-Sports while generatingthe groundtruth fixation density maps. Hence, some of theresults are different than those reported in the papers butthey give a better picture of their performances. Moreover, inour experiments, we also provide the results of single-streamversions of our model that respectively consider either spatialor temporal information. D. Qualitative and Quantitative Results
Performance on UCF-Sports.
Table I reports the comparativeresults on UCF-Sports test set, which contains 43 sequences.As can be seen, the single-stream versions of our proposedmodel gives worse scores than our full model. Moreover,spatial stream generally predicts saliency much better than the https://github.com/cvzoya/saliency/tree/master/code forMetrics TABLE I: Performance comparison on UCF-Sports dataset.The best and the second best performing models are shown inbold typeface and underlined, respectively. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)
Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ KLDiv ↓ Static SalGAN 0.869 0.389 2.074 0.258 2.169PQFT* 0.776 0.211 1.189 0.157 2.458Fang et al.* 0.879 0.387 2.319 0.247 2.012AWS-D* 0.845 0.313 1.870 0.195 2.202Bak et al. 0.864 0.387 2.231 0.130 2.575Dynamic OM-CNN 0.880 0.398 2.443 0.294 1.902ACLNet 0.876 0.367 2.045 0.292 2.135SalEMA 0.895 0.470 2.979 (Gated) Setting 2 0.911 0.499 2.980 0.353 1.568* Non-deep learning model
TABLE II: Performance comparison on Hollywood-2 dataset.The best and the second best performing models are shown inbold typeface and underlined, respectively. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)
Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ KLDiv ↓ Static SalGAN 0.892 0.428 2.383 0.298 1.760PQFT* 0.689 0.150 0.610 0.139 2.387Fang et al.* 0.862 0.312 1.614 0.221 1.781AWS-D* 0.747 0.227 0.994 0.193 2.256Bak et al. 0.840 0.310 1.439 0.158 2.339Dynamic OM-CNN 0.893 0.430 2.625 0.330 1.896ACLNet 0.899 0.459 2.463 0.342 1.701SalEMA 0.873 0.383 2.226 0.330 3.157STRA-Net 0.913 0.558 3.226 0.459 2.251TASED-Net 0.916 * Non-deep learning model temporal stream, which is a trend that we observe on the otherstandard benchmark datasets too. Our model trained only onUCF-Sports outperforms all the competing models in most ofthe metrics. It results in a performance very close to thoseof SalEMA and STRA-Net in terms of SIM. We believe thatweighting the predictions by the spatial and temporal streamsusing a gating mechanism allows the model to better handlethe variations throughout video sequence, thus resulting inmore accurate saliency maps on this action-specific relativelysmall dataset.
Performance on Hollywood-2.
In our experiments onHollywood-2 dataset, we use all the frames from the testset that contains 884 video sequences. In that regard, it isthe largest test set that we considered in our experimentalevaluation. In Table II, we provide comparison against thecompeting saliency models. Our results show that our modelgives better saliency predictions than all the other methodsin terms of the AUC-J and KLDiv metrics. The performanceof the model trained considering our second training settingthat includes a larger and more diverse training set providesmuch better results than the one trained with the first setting.In terms of the remaining evaluation metrics, our results arehighly competitive as compared to the recent state-of-the-artmodels, namely STRA-Net and TASED-Net, as well.
TABLE III: Performance comparison on DHF1K dataset. Thebest and the second best performing models are shown in boldtypeface and underlined, respectively. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)
Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ Static SalGAN 0.866 0.370 2.043 0.262PQFT* 0.699 0.137 0.749 0.139Fang et al.* 0.819 0.273 1.539 0.198AWS-D* 0.703 0.174 0.940 0.157Bak et al. 0.834 0.325 1.632 0.197Dynamic OM-CNN 0.856 0.344 1.911 0.256ACLNet 0.890 0.434 2.354 0.315SalEMA 0.890 0.449 2.574
STRA-Net
TABLE IV: Performance comparison on DIEM dataset. Thebest and the second best performing models are shown in boldtypeface and underlined, respectively. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)
Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ KLDiv ↓ Static SalGAN 0.860 0.492 2.068 0.392 1.431PQFT* 0.680 0.190 0.656 0.220 2.140Fang et al.* 0.825 0.360 1.407 0.313 1.688AWS-D* 0.768 0.313 1.228 0.272 1.825Bak et al. 0.810 0.313 1.212 0.206 2.050Dynamic OM-CNN 0.847 0.464 2.037 0.381 1.599ACLNet * Non-deep learning model
Performance on DHF1K.
We test the performance of ourmodel on the recently proposed DHF1K video saliency dataset,which includes 300 test videos. As mentioned before, theannotations for the test split are not publicly available andall the evaluations are carried out externally by the authors ofthe dataset. As Table III shows, our proposed model achievesperformance on par with the state-of-the-art models. In termsof AUC-J, along with the recent STRA-Net and TASED-Netmodels, it outperforms all the other saliency models. In termsof CC, our model gives roughly the second best result.
Performance on DIEM.
We also evaluate our model on DIEMtest set consisting of 20 videos. Table IV summarizes thesequantitative results. As can be seen, our model achieves thehighest scores in NSS and KLDiv metrics and very competitivein others. The second setting demonstrates the generalizationcapability of our proposed approach as compared to the recentmodels like SalEMA, STRA-Net and TASED-Net.In Fig. 7, we show some sample saliency maps predicted byour proposed model and three other deep saliency networks:ACLNet, SalEMA, STRA-Net, and TASED-Net models. Asone can observe, our model makes generally better predictionsthan the competing approaches. For instance, for the sequencefrom UCF-Sports (Fig. 7(a)) most the models fail to identifythe salient region on the swimmer, or for the sequence fromthe Hollywood-2 dataset (Fig. 7(b)) our model is the onlymodel that correctly predicts the soldier at the center of the TABLE V: Performance comparison on DIEM-Meta dataset.The best and the second best performing models are shown inbold typeface and underlined, respectively. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)
Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ KLDiv ↓ ACLNet 0.845 0.437 1.627 0.391 1.473SalEMA 0.832 0.392 1.576 0.374 1.664STRA-Net 0.840 0.419 1.637 0.385 1.634TASED-Net 0.857 0.455 1.810
TABLE VI: Performance comparison on LEDOV-Metadataset. The best and the second best performing models areshown in bold typeface and underlined, respectively. (cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)(cid:96)
Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ KLDiv ↓ ACLNet 0.879 0.384 1.750 0.342 1.837SalEMA 0.863 0.380 1.815 0.353 1.850STRA-Net background as salient. Similar kind of observations are alsovalid for the sample sequences from DHF1K (Fig. 7(c)) andDIEM (Fig. 7(d)) datasets.
Performance on DIEM-Meta and LEDOV-Meta.
As men-tioned before, [67] have recently showed that most of thecurrent benchmarks for video saliency include many sequencesin which spatial attention is more dominant than temporaleffects in describing saliency. DIEM-Meta and LEDOV-Metadatasets are curated in a special way to contain video framesin which temporal signals are found to be more influentialthan appearance cues. Hence, they both offer a better wayto test how well a dynamic saliency model utilizes temporalinformation. In our experimental evaluation, we compare ourproposed model with the state-of-the-art deep trackers, whichare all trained on the combined training set that includesframes from DIEM or LEDOV datasets. As can be seenfrom Table V and Table VI, our model outperforms all theother models in DIEM-Meta, and is the second best modelin LEDOV-Meta, achieving highly competitive performances.These results demonstrate the effectiveness of the proposedgated mechanism and its ability to use temporal information tothe full extent, as compared to the state-of-the-art approaches.Overall, the results reported on all the six datasets con-sidered in our experimental analysis suggest that our modelhas better capacity to mimic human attention mechanismby combining the temporal and static clues in an effectiveway. It has a better generalization ability that it can predictwhere people look at the videos from unseen domains muchbetter. Moreover, it utilizes the temporal information moresuccessfully with its gated fusion mechanism, which adap-tively integrates spatial and temporal cues depending on videocontent.
E. Ablation study.
In this section, we aim to analyze the influence of eachcomponent of our proposed deep dynamic saliency model. Weperform the ablation study on UCF-Sports dataset by disabling G T O u r s S a l E M AA C L N e t T A S E D - N e t S T R A - N e t (a) UCF-Sports (b) Hollywood-2 G T O u r s S a l E M AA C L N e t T A S E D - N e t S T R A - N e t (c) DHF1K (d) DIEM Fig. 7: Qualitative results of our proposed framework and the deep learning based SalEMA, ACLNet and SalGAN models.Our approach, in general, produces more accurate saliency predictions than these state-of-the-art models.or removing some blocks of our model and by examininghow these changes affect the model performance. As we didwith training our proposed model, for each version of our model under evaluation, we first train a single stream modelon SALICON dataset and then use this model to finetune theactual two-stream version. Accordingly, Table VII reports the Fig. 8: Our model dynamically decides the contribution of motion and appearance streams via gated fusion. Here, we plot theaverage motion probabilities (the contribution of motion stream) for two regions having different characteristic, one containinga moving object (the gummy bear) and the other with relatively no motion, shown with red and blue, respectively. As can beseen, our model assigns higher weights to the motion stream when motion becomes the dominant visual cue, and the weightsadaptively change throughout the sequence.performance of different versions of our saliency model.
Effect of gated fusion.
As we emphasized before, the roleof gated fusion block is to adaptively integrate spatial andtemporal streams is a key component of our model. In ouranalysis, we replace the gated fusion block with a standard × convolution layer. As can be seen from Table VII, theperformance of the model decreases considerably without thegated fusion mechanism. That is, using a dynamic weightingstrategy, instead of a fixed weighting scheme (learned via × convolution), generates much better predictions. Fig. 8 showsa visualization of how our proposed gated fusion operates,demonstrating the behavior of the weighting scheme for bothdynamic and static parts of a given scene, In particular, weplot the motion probabilities averaged within the correspond-ing image regions over time, which clearly shows that themotion probability (the contribution of motion stream) for theregion that contains a moving object is, in general, muchhigher than that of the static region. Moreover, dependingon the characteristics of the regions, it shows the changes inthe motion probabilities throughout the whole sequence. Forexample, when no motion is taking place in the region initiallycontaining the moving object, the weight of the temporalstream starts to fall. These results supports our main claim thatconsidering the content of the video while combining temporaland spatial cues is a more appropriate way to model saliencyestimation on dynamic scenes. Effect of multi-level information.
Previous studies demon-strate that low and high-level cues are equally importantfor saliency prediction [9], [10]. Motivated with these, weincluded a multi-level information block to fuse featuresextracted from different levels of our deep model. For thisanalysis, we disable this multi-level information block andtrain a single-scale model instead. Compared to our fullmodel, disabling this block reduces the performance as can be TABLE VII: Ablation study on UCF-Sports dataset. (cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)
Method Metric AUC-J ↑ CC ↑ NSS ↑ SIM ↑ KLDiv ↓ w/o spatial attention 0.872 0.474 2.884 0.374 2.223w/o channel-wise attention 0.892 0.489 2.923 0.319 1.707w/o spatial & ch.-wise attention 0.875 0.447 2.885 0.364 2.646w/o multi-level information 0.890 0.484 2.755 0.303 1.711w/o gated fusion 0.900 0.480 2.913 0.353 1.676full model seen in Table VII. Employing a representation that containsinformation from low and high levels helps to improve theperformance of our model. We speculate that our multi-levelinformation block allows the network to better identify theregions semantically important for saliency. Effect of attention blocks.
As discussed before, the reasonswe introduce the attention blocks are to eliminate the irrelevantfeatures via the spatial attention and to choose the mostinformative feature channels via the channel-wise attentionwhen processing a video frame. In this experiment, we removethe spatial and the channel-wise attention blocks from our fullmodel and train two different models, respectively. The resultsgiven in Table VII support our assertion that both of theseattention blocks improve the model performance. Disablingthem results in a much lower performance as compared tothat of the full model.V. S
UMMARY AND C ONCLUSION
In this study, we proposed a new spatio-temporal saliencynetwork for video saliency. It follows a two-stream networkarchitecture that processes spatial and temporal information inseparate streams, but it extends the standard structure in manyways. First, it includes a gated fusion block that performsintegration of spatial and temporal streams in a more dynamicmanner by deciding the contribution of each channel oneframe at a time. Second, it utilizes a multi-level information block that allows for performing multi-scale processing ofappearance and motion features. Finally, it employs spatialand channel-wise attention blocks to further increase theselectivity. Our extensive set of experiments on six differentbenchmark datasets shows the effectiveness of the proposedmodel in extracting the most salient parts of the video framesboth qualitatively and quantitatively. Moreover, our ablationstudy demonstrates the gains achieved by each componentof our model. Our analysis reveals that the proposed modeldeals with the videos from unseen domains much better thatthe existing dynamic saliency models. Additionally, it usestemporal cues more effectively via the proposed gated fusionmechanism which allows for adaptive integration of spatialand temporal streams.We believe that our work highlights several important direc-tions to pursue for better modeling of saliency in videos. Asfuture work, we plan to explore more efficient ways to includethe temporal information. For instance, instead of using opticalflow images, one can use features extracted from early and midlayers of an optical flow network model to encode motioninformation. This can reduce the memory footprint of themodel and decreases the running times.A CKNOWLEDGMENTS
This work was supported in part by TUBA GEBIP fellow-ship awarded to E. Erdem.R
EFERENCES[1] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual at-tention for rapid scene analysis,”
IEEE Transactions on Pattern Analysisand Machine Intelligence , vol. 20, no. 11, pp. 1254–1259, 1998.[2] A. Borji and L. Itti, “State-of-the-art in visual attention modeling,”
IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 35,no. 1, pp. 185–207, 2013.[3] S. Filipe and L. A. Alexandre, “Retracted article: From the humanvisual system to the computational models of visual attention: a survey,”
Artificial Intelligence Review , vol. 43, no. 4, pp. 601–601, 2015.[4] H. Kim and S. Lee, “Transition of visual attention assessment instereoscopic images with evaluation of subjective visual quality anddiscomfort,”
IEEE Transactions on Multimedia , vol. 17, no. 12, pp.2198–2209, 2015.[5] K. Gu, S. Wang, H. Yang, W. Lin, G. Zhai, X. Yang, and W. Zhang,“Saliency-guided quality assessment of screen content images,”
IEEETransactions on Multimedia , vol. 18, no. 6, pp. 1098–1110, 2016.[6] Y. Fang, Z. Chen, W. Lin, and C. Lin, “Saliency detection in thecompressed domain for adaptive image retargeting,”
IEEE Transactionson Image Processing , vol. 21, no. 9, pp. 3888–3901, 2012.[7] D. Chen and Y. Luo, “Preserving motion-tolerant contextual visualsaliency for video resizing,”
IEEE Transactions on Multimedia , vol. 15,no. 7, pp. 1616–1627, 2013.[8] G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Ra-pantzikos, G. Skoumas, and Y. Avrithis, “Multimodal saliency and fusionfor movie summarization based on aural, visual, and textual attention,”
IEEE Transactions on Multimedia , vol. 15, no. 7, pp. 1553–1568, 2013.[9] N. D. B. Bruce, C. Catton, and S. Janjic, “A deeper look at saliency:Feature contrast, semantics, and beyond,” in
CVPR , 2016, pp. 516–524.[10] Z. Bylinskii, A. Recasens, A. Borji, A. Oliva, A. Torralba, and F. Durand,“Where should saliency models look next?” in
Proc. ECCV , 2016, pp.809–824.[11] C. Guo, Q. Ma, and L. Zhang, “Spatio-temporal saliency detection usingphase spectrum of quaternion fourier transform,” in
Proc. CVPR , 2008,pp. 1–8.[12] X. Cui, Q. Liu, and D. Metaxas, “Temporal spectral residual: fast motionsaliency detection,” in
ACM MM , 2009, pp. 617–620.[13] H. J. Seo and P. Milanfar, “Static and space-time visual saliencydetection by self-resemblance,”
Journal of Vision , vol. 9, no. 12, pp.15–15, 2009. [14] W. Sultani and I. Saleemi, “Human action recognition across datasets byforeground-weighted histogram decomposition,” in
Proc. CVPR , 2014,pp. 764–771.[15] T. Mauthner, H. Possegger, G. Waltner, and H. Bischof, “Encoding basedsaliency detection for videos and images,” in
CVPR , 2015, pp. 2494–2502.[16] Y. Luo and Q. Tian, “Spatio-temporal enhanced sparse feature selectionfor video saliency estimation,” in , June2012, pp. 33–38.[17] S. Mathe and C. Sminchisescu, “Dynamic eye movement datasets andlearnt saliency models for visual action recognition,” in
Proc. ECCV ,2012, pp. 842–856.[18] D. Rudoy, D. B. Goldman, E. Shechtman, and L. Zelnik-Manor,“Learning video saliency from human gaze using candidate selection,”in
Proc. CVPR , 2013, pp. 1147–1154.[19] S. Zhong, Y. Liu, F. Ren, J. Zhang, and T. Ren, “Video saliency detectionvia dynamic consistent spatio-temporal attention modelling,” in
Proc.,AAAI , 2013.[20] Z. Liu, X. Zhang, S. Luo, and O. Le Meur, “Superpixel-based spatiotem-poral saliency detection,”
IEEE Transactions on Circuits and Systems forVideo Technology , vol. 24, no. 9, pp. 1522–1540, 2014.[21] F. Zhou, S. B. Kang, and M. F. Cohen, “Time-mapping using space-timesaliency,” in
Proc. CVPR , 2014, pp. 3358–3365.[22] J. Zhao, C. Siagian, and L. Itti, “Fixation bank: Learning to reweightfixation candidates,” in
CVPR , 2015, pp. 3174–3182.[23] S. H. Khatoonabadi, N. Vasconcelos, I. V. Baji´c, and Yufeng Shan, “Howmany bits does it take for a stimulus to be salient?” in
Proc. CVPR , 2015,pp. 5501–5510.[24] V. Lebor´an, A. Garc´ıa-D´ıaz, X. R. Fdez-Vidal, and X. M. Pardo,“Dynamic whitening saliency,”
IEEE Transactions on Pattern Analysisand Machine Intelligence , vol. 39, no. 5, pp. 893–907, 2017.[25] L. Bazzani, H. Larochelle, and L. Torresani, “Recurrent mixture densitynetwork for spatiotemporal visual attention,” in
Proc. ICLR , 2017.[26] W. Wang, J. Shen, J. Xie, M. Cheng, H. Ling, and A. Borji, “Revisitingvideo saliency prediction in the deep learning era,”
IEEE Transactionson Pattern Analysis and Machine Intelligence , 2019.[27] A. Palazzi, D. Abati, S. Calderara, F. Solera, and R. Cucchiara, “Pre-dicting the driver’s focus of attention: the dr(eye)ve project,”
IEEETransactions on Pattern Analysis and Machine Intelligence , 2018.[28] C. Bak, A. Kocak, E. Erdem, and A. Erdem, “Spatio-temporal saliencynetworks for dynamic saliency prediction,”
IEEE Transactions on Mul-timedia , vol. 20, no. 7, pp. 1688–1698, 2018.[29] L. Jiang, M. Xu, T. Liu, M. Qiao, and Z. Wang, “Deepvs: A deeplearning based video saliency prediction approach,” in
Proc. ECCV ,2018, pp. 625–642.[30] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei, “Large-scale video classification with convolutional neuralnetworks,” in
CVPR , 2014, pp. 1725–1732.[31] K. Simonyan and A. Zisserman, “Two-stream convolutional networks foraction recognition in videos,” in
Proceedings of the 27th InternationalConference on Neural Information Processing Systems (NIPS) , 2014, p.568–576.[32] M. A. Goodale and A. D. Milner, “Separate visual pathways forperception and action,”
Trends in Neurosciences , vol. 15, no. 1, pp.20–?25, 1992.[33] A. M. Treisman and G. Gelade, “A feature-integration theory of atten-tion,”
Cognitive Psychology , vol. 12, no. 1, pp. 97–136, 1980.[34] C. Koch and S. Ullman, “Shifts in selective visual attention: Towards theunderlying neural circuitry,”
Human neurobiology , vol. 4, pp. 219–27,1985.[35] X. Huang, C. Shen, X. Boix, and Q. Zhao, “SALICON: Reducing thesemantic gap in saliency prediction by adapting deep neural networks,”in
Proc. ICCV , 2015, pp. 262–270.[36] S. Jetley, N. Murray, and E. Vig, “End-to-end saliency mapping viaprobability distribution prediction,” in
Proc. CVPR , 2016, pp. 5753–5761.[37] S. S. S. Kruthiventi, K. Ayush, and R. V. Babu, “Deepfix: A fullyconvolutional neural network for predicting human eye fixations,”
IEEETransactions on Image Processing , vol. 26, no. 9, pp. 4446–4456, 2017.[38] N. Liu, J. Han, T. Liu, and X. Li, “Learning to predict eye fixationsvia multiresolution convolutional neural networks,”
IEEE Transactionson Neural Networks and Learning Systems , vol. 29, no. 2, pp. 392–404,2018.[39] J. Pan, E. Sayrol, X. Gir´o-i Nieto, K. McGuinness, and N. E. OConnor,“Shallow and deep convolutional networks for saliency prediction,” in
Proc. CVPR , 2016, pp. 598–606. [40] W. Wang and J. Shen, “Deep visual attention prediction,” IEEE Trans-actions on Image Processing , vol. 27, no. 5, pp. 2368–2378, 2018.[41] E. Vig, M. Dorr, and D. Cox, “Large-scale optimization of hierarchicalfeatures for saliency prediction in natural images,” in
Proc. CVPR , 2014,pp. 2798–2805.[42] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Predicting humaneye fixations via an lstm-based saliency attentive model,”
IEEE Trans-actions on Image Processing , vol. 27, no. 10, pp. 5142–5154, 2018.[43] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in
Proceedings of the 19th International Conference on Neural InformationProcessing Systems (NIPS) , 2006, pp. 545–552.[44] Y. Fang, Z. Wang, W. Lin, and Z. Fang, “Video saliency incorporatingspatiotemporal cues and uncertainty weighting,”
IEEE Transactions onImage Processing , vol. 23, no. 9, pp. 3910–3921, 2014.[45] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learningspatiotemporal features with 3d convolutional networks,” in
Proc. ICCV ,2015, pp. 4489–4497.[46] F. Yu and V. Koltun, “Multi-scale context aggregation by dilatedconvolutions,” in
Proc. ICLR , 2016.[47] P. Linardos, E. Mohedano, J. J. Nieto, K. McGuinness, X. Giro-i Nieto,and N. E. O’Connor, “Simple vs complex temporal recurrences for videosaliency prediction,” in
Proc. BMVC , 2019.[48] K. Min and J. J. Corso, “Tased-net: Temporally-aggregating spatialencoder-decoder network for video saliency detection,” in
Proceedingsof the IEEE International Conference on Computer Vision , 2019, pp.2394–2403.[49] Q. Lai, W. Wang, H. Sun, and J. Shen, “Video saliency predictionusing spatiotemporal residual attentive networks,”
IEEE Trans. on ImageProcessing , 2019.[50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proc. CVPR , 2016, pp. 770–778.[51] T. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,“Feature pyramid networks for object detection,” in
Proc. CVPR , 2017,pp. 936–944.[52] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in
Proc. MICCAI , 2015, pp. 234–241.[53] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in
Proc. CVPR , 2015, pp. 3431–3440.[54] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Doll´ar, “Learning to refineobject segments,” in
Proc. ECCV , 2016, pp. 75–91.[55] T. Zhao and X. Wu, “Pyramid feature attention network for saliencydetection,” in
Proc. CVPR , 2019, pp. 3080–3089.[56] S. Dong, Z. Gao, S. Sun, X. Wang, M. M. Li, H. Zhang, G. Yang, H. Liu,and S. Li, “Holistic and deep feature pyramids for saliency detection,”in
Proc. BMVC , 2018.[57] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent modelsof visual attention,” in
Proc. NIPS , 2014.[58] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image captiongeneration with visual attention,” in
Proc. ICML , 2015.[59] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for human pose estimation,” in
Proc. CVPR , 2017, pp.5669–5678.[60] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S.Chua, “SCA-CNN: Spatial and channel-wise attention in convolutionalnetworks for image captioning,” in
Proc. CVPR , 2017.[61] W. Ren, L. Ma, J. Zhang, J. Pan, X. Cao, W. Liu, and M.-H. Yang,“Gated fusion network for single image dehazing,” in
Proc. CVPR , 2018.[62] X. Zhang, H. Dong, Z. Hu, W.-S. Lai, F. Wang, and M.-H. Yang, “Gatedfusion network for joint image deblurring and super-resolution,” in
Proc.BMVC , 2018.[63] Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality-sensitivedeconvolution networks with gated fusion for RGB-D indoor semanticsegmentation,” in
Proc. CVPR , 2017, pp. 1475–1483.[64] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action MACH a spatio-temporal maximum average correlation height filter for action recogni-tion,” in
Proc. CVPR , 2008.[65] S. Mathe and C. Sminchisescu, “Actions in the eye: Dynamic gazedatasets and learnt saliency models for visual recognition,”
IEEE Trans-actions on Pattern Analysis and Machine Intelligence , vol. 37, no. 7,pp. 1408–1424, 2015.[66] P. K. Mital, T. J. Smith, R. L. Hill, and J. M. Henderson, “Clustering ofgaze during dynamic scene viewing is predicted by motion,”
CognitiveComputation , vol. 3, no. 1, pp. 5–24, 2011.[67] M. Tangemann, M. K¨ummerer, T. S. Wallis, and M. Bethge, “Measuringthe importance of temporal features in video saliency,” 2020. [68] T. Lan, Y. Wang, and G. Mori, “Discriminative figure-centric modelsfor joint action localization and recognition,” in
Proc. ICCV , 2011, pp.2003–2010.[69] M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,” in
Proc.CVPR , 2009, pp. 2929–2936.[70] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: CNNs for opticalflow using pyramid, warping, and cost volume,” in
Proc. CVPR , 2018.[71] M. Jiang, S. Huang, J. Duan, and Q. Zhao, “SALICON: Saliency incontext,” in
Proc. CVPR , 2015, pp. 1072–1080.[72] X. Huang, C. Shen, X. Boix, and Q. Zhao, “SALICON: Reducing thesemantic gap in saliency prediction by adapting deep neural networks,”in
Proc. ICCV , 2015, pp. 262–270.[73] Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand, “Whatdo different evaluation metrics tell us about saliency models?”
IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 41,no. 3, pp. 740–757, 2019.[74] J. Pan, C. Canton, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol,and X. a. Giro-i Nieto, “SalGAN: Visual saliency prediction withgenerative adversarial networks,” in arXivarXiv