[PDF] Audiovisual Saliency Prediction in Uncategorized Video Sequences based on Audio-Video Correlation

Abstract

Substantial research has been done in saliency modeling to develop intelligent machines that can perceive and interpret their surroundings. But existing models treat videos as merely image sequences excluding any audio information, unable to cope with inherently varying content. Based on the hypothesis that an audiovisual saliency model will be an improvement over traditional saliency models for natural uncategorized videos, this work aims to provide a generic audio/video saliency model augmenting a visual saliency map with an audio saliency map computed by synchronizing low-level audio and visual features. The proposed model was evaluated using different criteria against eye fixations data for a publicly available DIEM video dataset. The results show that the model outperformed two state-of-the-art visual saliency models.

Full PDF

11 Audiovisual Saliency Prediction inUncategorized Video Sequences based onAudio-Video Correlation

Maryam Qamar Butt, Anis Ur Rahman (cid:70)

Abstract —Substantial research has been done in saliency modelingto develop intelligent machines that can perceive and interpret theirsurroundings. But existing models treat videos as merely image se-quences excluding any audio information, unable to cope with inherentlyvarying content. Based on the hypothesis that an audiovisual saliencymodel will be an improvement over traditional saliency models for naturaluncategorized videos, this work aims to provide a generic audio/videosaliency model augmenting a visual saliency map with an audio saliencymap computed by synchronizing low-level audio and visual features.The proposed model was evaluated using different criteria against eyeﬁxations data for a publicly available DIEM video dataset. The resultsshow that the model outperformed two state-of-the-art visual saliencymodels.

Index Terms —audiovisual, saliency, video sequences

NTRODUCTION

Though a lot of research has been done in the general ﬁeldof unimodal saliency models for both images and videos,no substantial contributions exist for bimodal models. Ofmore consequence is the lack of a model for computation ofaudiovisual saliency in complex video sequences. Existingliterature for audio-video saliency modeling is scarce andoften targets a speciﬁc class of videos [10], [27], [28]. There-fore, an extended saliency model to predict salient regionsin complex videos with different sound classes is required.Many existing saliency algorithms are designed for im-ages [6], [16], [24] using visual cues such as color, intensity,orientation etc., while other models [7], [14], [22] take so-cial cues like faces into account resulting in more accurateeye movement predictions. Spatiotemporal saliency mod-els [11], [15], [21] usually incorporate temporal cues likemotion but ignore the effect of audio stimuli–an integralcomponent of video content–on human gaze. Subsequently,such models are classiﬁed as unimodal models [4] whereonly visual stimuli are used.Interestingly, the effect of audio stimuli is relevant tohuman eye movements. In [25] the authors ﬁnd eye move-ments to be spatially biased towards the source of audiousing an eye tracking experiment on images with spa-tially localized sound sources in three conditions: auditory

Maryam Q. Butt and Anis Ur Rahman are with School of Electrical Engi-neering and Computer Science (SEECS), National University of Sciences andTechnology (NUST), Islamabad, Pakistan. Corresponding Author: Anis U.Rahman, e-mail: [email protected] (A), visual (V) and audio-visual (AV). Moreover, anotherstudy [29] analyzed the effects of different type of sounds onhuman gaze involving an experiment with thirteen soundclasses under audio-visual and visual conditions. The soundclasses are further clustered into on-screen with one soundsource, on-screen with more than one sound source, and off-screen sound source. The results show that human speech,singer(s) and human noise (on-screen sound source clusters)highly affect gaze and, more importantly, linked audio-visual stimuli has a greater effect than unsynchronizedaudio-visual events.The focus of this work is to propose a generic audio-visual saliency model for complex video sequences. Thework differs from previous research [10], [27], [28] in thatit does not restrict input videos from a certain category. Toaccomplish that an audio source localization method wasused to relate an audio signal with an object in the videoframes in a rank correlation space. The proposed model wasevaluated against eye ﬁxations ground truth from DIEMdataset.The original contribution of this study is as follows:1) Propose an audio-visual saliency model for complexscenes that, unlike existing literature, does not restrainvideos to any speciﬁc category.2) Present and analyze the results of experimental evalua-tion on a publicly available dataset to examine how ourproposed saliency model compares to two state-of-the-art audio-visual saliency models.The remainder of the paper is organized as follows: Sec-tion 1 narrates background knowledge of saliency modelingand identiﬁes the novel contribution of this work. Section 2provides a detailed review of state-of-the-art literature whileSection 3 describes the proposed solution. Section 4 sum-marizes the implementation details as well as outlines theproperties of video sequences used for experimentation.This section also explains the different saliency evaluationmetrics. Section 5 presents our results followed by a discus-sion in Section 6. Section 7 summarizes our ﬁndings andconcludes with future perspectives.

ELATED WORK

Unimodal saliency models use only one type of sen-sory stimulus as input, some visual cues including a r X i v : . [ ee ss . I V ] J a n color, intensity and orientation features [1], [3], [24]. Otherbiologically-inspired models [20], [21] exploit spatial con-trast and motion, and simulate interactions between neu-rons using excitation and inhibition mechanisms. Whileothers [18], [19] propagate spatial/ temporal saliency usingmultiscale color and motion histograms as features. In [19]pixel-level spatiotemporal saliency is computed from spatialand temporal saliencies via interaction and selection drivenfrom superpixel-level saliency. In [18] temporal saliency ispropagated forward and backward via inter-frame simi-larity matrices and graph-based motion saliency, whereasspatial saliency is propagated over a frame using temporalsaliency and intra-frame similarity matrices. In most of thesemodels conspicuity maps are constructed using a varietyof approaches with different visual features that are laterintegrated together to get a ﬁnal saliency map.Based on the fact that eyes are the most important sen-sory organs that provide much of the information aroundhumans, many state-of-the-art visual models [18], [19] aimat saliency computation for complex dynamic scenes. Butsuch unimodal models tend to overlook other inﬂuentialsocial cues like faces in social interaction scenes, and henceexhibit lower predictability [2], [30]. Moreover, social scenesinvolve a lot more sensory signals inﬂuencing eye move-ments spatially such as auditory information includingvoice tone, music, etc, and different kinds of sounds affecteye ﬁxations differently [25], [29]. Thus, there is a need for abimodal saliency model incorporating both visual and audioinformation channels.Rapantzikos et al. [26] proposed an audio-visualsaliency model for movie summarization. The visualsaliency map is constructed using traditional features suchas intensity, color and motion, and simulating feature com-petition as energy minimization via gradient descent. Thismap is thresholded and averaged per frame to compute a D visual saliency curve. While maximum average Teagerenergy, mean instant amplitude and mean instant frequency,are extracted as audio features by applying Teager Kaiserenergy operator and energy separation algorithm on theaudio signal. The resulting feature vector is normalized toa range [0 , followed by weighted fusion to get an audiosaliency curve. The ﬁnal audio-visual saliency curve is aweighted linear combination of audio and visual saliencycurves. The local maxima feature of audio-visual saliencycurve is used for key-frame selection. The experiments areconducted on movie database of A.U.T.H but no comparisonand evaluation is given.Coutrot and Guyader [9] proposed an audiovisualsaliency model for natural conversation scenes; a linearcombination of low-level saliency, face map, and centerbias. Low-level saliency map is constructed via Marat’sspatiotemporal saliency model [21]. While for face mapconstruction a speaker diarization algorithm is proposedthat uses motion activity of faces and 26 Mel-frequencycepstral coefﬁcients (MFCCs) as visual and audio featuresrespectively. Center bias is a time-independent 2D Gaus-sian function centered on the screen. The three maps arelinearly combined into ﬁnal audiovisual saliency map us-ing expectation maximization to determine the weight foreach. The resulting model performs better compared to thesame model without speaking and mute face differentiation. However, the target video dataset belongs to a limitedcategory: conversation scenes only.Sidaty et al. [28] proposed an audiovisual saliency modelfor teleconferencing and conversational videos. Three bestperforming models on target database i.e. Itti et al [13],Harel et al. [12] and Tavakoli et al. [31] are selected as spatialmodels. Acoustic energy is computed per frame and blockmatching algorithm is used to construct an audio map usingthe face stream of video. Then peak matching is used foraudio-visual synchronization. Five fusion schemes are usedto get a ﬁnal map. Experiments performed on XLIMediadatabase created by the authors showed that the proposedmodel performed better compared to spatial models. Againthe limitation of this work is that it only targets conferencingand conversational videos.All in all, one of the major limitation of the aforemen-tioned visual models is that they treat videos as a mute se-quence of images and ignore any inﬂuence of audio stimuli.This results in inaccurate predictions where sound guideseye movement. Furthermore, another limitation of litera-ture is the absence of an audiovisual model for complexdynamic scenes; that is, many of the state-of-the-art mod-els restrict the dataset used to only one speciﬁc category,for instance, conversational videos. This limits the models’performance when dealing with videos containing differentsound classes. ROPOSED S OLUTION

This section explains the proposed solution for audio-visualsaliency computation for videos. The framework consists ofﬁve major stages as illustrated in Figure 1. The ﬁrst stageis the extraction of audio energy descriptors and objectmotion descriptors per frame using audio and visual stimulias separate channels. The next stage computes an audiosaliency map using these descriptors. In parallel, anotherstage computes visual saliency map and motion map. Theformer using low-level features while the latter from a color-coded optical ﬂow similar to one done for the audio maps.The last stage normalizes and combines all these maps intoa uniﬁed audiovisual saliency map.

In this stage, we extracted visual and acoustic features froma given input video. The stage comprised two phases offeature extraction, one for audio features and the other forvisual features.

The step outputs an audio energy descriptor a ( t ) extractedfrom an audio signal featuring changing patterns of an au-dio signal strength. Note that the signal was obtained withthe same temporal resolution as the video frames. Hence,the signal was ﬁrst segmented into frames according to theframe rate of video so-that each audio frame corresponds toa video frame. Using short-term Fourier transform (STFT),this framed signal was transformed into a time-frequencydomain to get a spectrogram of the signal at each frame.The descriptor a ( t ) was computed by the integration of the Fig. 1. Architecture of proposed solution resultant spectrogram at any given frame over all frequen-cies using, a ( t ) = (cid:90) ∞ (cid:90) T f ( t (cid:48) ) W ( t (cid:48) − t ) e − j πft (cid:48) dt (cid:48) df where the windowing function W ( t ) is deﬁned so thatneighboring windows overlap by . The ﬁnal descriptorwas post-processed using a D Gaussian kernel.

Based on the assumption that a moving object is a primecandidate to be an audio signal source, acceleration perframe of all moving objects in a given input video wascomputed as motion descriptor. First, the moving objectswere segmented per frame using optical ﬂow estimationand tracked along with all frames via color histograms ofthe regions in HSV color space. The process is described asfollows:1)

Optical ﬂow computation.

The method proposed by [8]was used to compute dense optical ﬂow and corre-sponding color-coded optical ﬂow images per videoframe. The method used apparent motion of each pixelto compute forward and backward optical ﬂows wherethe former depicts the motion of pixels of frame t withreference to frame t + 1 and the latter was the motionof pixels of frame t with respect to frame t − . Theresulting ﬂows were averaged out to get a mean opticalﬂow per frame, later used to compute an audio saliencymap.2) Frame segmentation.

The color-coded mean opticalﬂow per frame was used as input for the segmentationstep. Mean shift, a nonparametric clustering algorithmwas applied to segment input image in LUV colorspace. The oversegmented result of the step was fol-lowed by a simple region merging technique based onDeltaE, a color difference score, to merge the closelysimilar regions. Regions smaller than 200 pixels wereﬁltered as noise and insigniﬁcant regions. 3)

Region tracking.

Once individual frames were seg-mented, a number of tracks were initialized in the ﬁrstframe using the segmented regions’ location and ap-pearance features. All regions in following segmentedinput frames were either assigned to an existing trackor initialized to a new track based on its location andappearance similarities. The location similarity d E wascomputed by Euclidean distance between the centroidof a new region C n and that of an existing track C e using, d E = (cid:113) ( C n ( x, y ) − C e ( x, y )) This resulted in a list of candidate tracks similar tothe region under consideration for assignment withina speciﬁed search radius r. For appearance similarity, AS LUV histograms of existing candidate tracks H e were compared to the new region’s histogram H n usingcosine similarity cosθ as, cosθ = H n · H e (cid:107) H n (cid:107) (cid:107) H e (cid:107) The region C n was assigned to a track whose cosθ was maximum and greater than a speciﬁed threshold.The centroid of the track was updated to the centroidof the newly assigned region and its histogram wasreplaced with the mean of the existing histogram andnew region’s histogram. Otherwise, if cosθ was lessthan the speciﬁed threshold, the region was used toinitialize a new track.4) Calculate acceleration.

In this step objects’ accelerationwas computed using the motion descriptors. Averageof forward and backward optical ﬂow resulted in accel-eration at each pixel ( x, y, t ) per frame using, g ( x, y, t ) = F + ( x, y, t ) + F − ( x, y, t ) where x and y are spatial coordinates, t is frame num-ber and F + and F − indicate forward and backwardoptical ﬂow. The acceleration of regions ST ti where i is region indexper frame t was computed as the average accelerationof all pixels belonging to that region as: m i ( t ) = 1 | ST ti | (cid:88) ( x,y ) εST ti (cid:107) g ( x, y, t ) (cid:107) The resulting acceleration vector was ﬁltered using aGaussian kernel to remove noise. The result was amotion descriptor of objects in a given input video.

For the audio saliency map computation, we used audio-video correlation method proposed in [17]. The correlationbetween the aforementioned audio and motion descriptorswas used to localize the source of the sound signal ininput video frames to indicate audio saliency. Winner-Take-All (WTA) hash [33], a subfamily of hashing functionscontrolled by the number of permutations N and windowsize S , was used to transform both descriptors in rank cor-relation space. Once in the common rank correlation space,Hamming distance was used to relate the audio signal to anobject. A classical visual saliency map was used as proposed in [12].The model approaches the problem of saliency computationby deﬁning Markov chains over feature maps, extracted forfeatures of intensity, color, orientation, ﬂicker, and motion,and treats equilibrium locations as saliency values. In detail,each value of the feature map(s) is considered a node andthe connectivity between them is weighted by their dissim-ilarity. Once a Markov chain is deﬁned on this graph, theequilibrium distribution of this chain computed by repeatedmultiplication of Markov matrix with an initially uniformvector accumulates mass at highly dissimilar nodes provid-ing activation maps. A similar mass concentration processis applied to these activation maps and output is summedinto a ﬁnal saliency map.

Motion map indicates the regions of high motion com-puted using mean optical ﬂow per frame as described inSection 3.1.2. Adaptive thresholding proposed in [5] wasapplied on the ﬂows to discard any inconsequential lowmotion as, I p = (cid:40) if I p < T · I avg otherwisewhere pixel I p is set to zero if its brightness is T percentlower than average brightness of its surrounding pixels. In this ﬁnal stage, the three computed maps: a) visualsaliency map, b) audio saliency map, and c) motion mapwere normalized before combining them together into aﬁnal audiovisual saliency map. Here the visual saliencymap was a sum of normalized activation maps computedusing mass concentration algorithm. The other two mapswere normalized to a speciﬁed range [0 − using simplelinear transformations. The resulting normalized maps werelinearly combined to get the ﬁnal audiovisual saliency map. TABLE 1Parameters used for different steps.

Region tracking Search radius ( r ) 100Audio-video corr. No. of permutations ( N ) 2000Window size ( S ) 5Motion map comp. Threshold % ( T ) 10 MPLEMENTATION AND E VALUATION

The proposed solution was implemented in MATLAB 2014band Windows 10 on a 64-bit architecture machine withIntel i5 processor. The same setup was used for evaluationpurposes. The parameters used for the proposed solutionare given in Table 1.

Dynamic images and eye movements (DIEM) dataset [23]was used for evaluation of the proposed approach. Thedataset comprises (eighty-ﬁve) videos with or withoutaudio of varying genres. Eye ﬁxation data is collected viabinocular eye tracking experiment with 250 participantsin total with ages ranging between and years withnormal/corrected-to-normal vision. In this work, for eval-uation (twenty-ﬁve) videos with audio were randomlyselected. The video sequences are listed in Table 2 alongwith its properties. The proposed solution was evaluated using four criteria.1)

Area under the curve (

AU C ). is a location-based met-ric, where saliency pixels equal to the total recorded ﬁx-ations are randomly extracted. The true positives ( T P )and false positives (

F P ) are calculated for differentthresholds treating saliency pixels as a classiﬁer. Theresulting values are used to plot an ROC curve andcompute

AU C –the ideal score being . and a value of . indicating random classiﬁcation.2) Kullback-Leibler divergence ( D KL ). is a distribution-based dissimilarity measure given as, D KL = (cid:88) i M f ( i ) ln (cid:18) M f ( i ) M s ( i ) (cid:19) it estimates the loss of information when saliency map M s is used to approximate a ﬁxation map M f –bothconsidered as probability distributions.The ideal D KL score is zero, meaning the saliency andﬁxation maps are exactly same, otherwise poorer thanthe scale of the saliency model.3) Normalized scanpath saliency (

N SS ). is computedusing, N SS = 1 N (cid:88) i M s ( i ) − µ M s σ M s TABLE 2Summary of properties of video sequences selected from DIEM dataset. In Audio source column On-screen( + )/Off-screen( − ): H = human, N = non-human, M = music and A = applause. Properties in order are: Single object( S )/Multiple objects( M ) ( f ), Camera motion ( f ), Abrupt scenechange ( f ), Interaction ( f ), Occlusion ( f ), Deformation ( f ), Crowd ( f ), Clutter ( f ), and Motion blur ( f ). In columns f to f ( + ) indicatespresence and ( − ) indicates absence of the particular property. No Video Sequence Scene Type AudioSource f f f f f f f f f H + / − M − M + + - - - + - +2 advert bbc4 bees Advertisement M − N + M - - - - - - - -3 advert bbc4 library Advertisement M − M - - - - - - + -4 advert bravia paint Advertisement M − N + M - + - - + - - -5 arctic bears Documentary H − M − N + M - - - - - - - -6 basketball of sorts Sports M − N + M - - + + - - - -7 BBC wildlife special tiger Documentary H − M − N + S - - - + - - - -8 DIY SOS Other H + S - - - - - - - -9 documentary adrenaline rush Documentary H − M − M + - - + - - - -10 documentary coral reef adventure Documentary H − M − N + M + + + + - - - -11 game trailer lego indiana jones ComputerGame H − M − N + M - + + + + + - -12 hairy bikers cabbage Other H + M - - + - - - - -13 harry potter 6 trailer Movie H + M − N + M - + + + + - - -14 home movie Charlie bit my ﬁnger againMovie H + M + - + - - - - -15 hummingbirds closeups Documentary H − N + M - - + - + - - -16 music trailer nine inch nails Crowd M + / − M - - + + - - - -17 news bee parasites News H + / − M - - + + - - - -18 news sherry drinking mice News H − M - + + + - - - -19 news us election debate News A − H + M - - + + - - - -20 one show Other H + S + - - - - - - -21 pingpong angle shot Sports N + M - - + - - - - -22 planet earth jungles monkeys Documentary H − N + M - - - + - - - -23 scottish parliament Other H + / − M - - - - - + - -24 sport football best goals Sports A − M − M + - + + - - - -25 stewart lee Other H + M - - + + - + - - where saliency map M s is normalized to zero meanand unit standard deviation, then averaged for N ﬁxa-tions. Zero score means a chance prediction whereas ahigh score indicates high predictability of the saliencymodel.4) Linear correlation coefﬁcient ( CC ). is anotherdistribution-based metric computed using, CC = cov ( M s , M f ) σ M s σ M f its output ranges between − and +1 , the closer is thescore to any of these, the better is predictability of thesaliency model. Based on our literature review, we found no other audio-visual saliency model for complex dynamic scenes. For thesake of comparison, we compared our proposed audiovisualsaliency model against two state-of-the-art visual saliencymodels. The ﬁrst model proposed in [19] derives pixel-levelspatial/temporal saliency map from superpixel-level spa-tial/temporal saliency map constructed using motion andcolor histogram features. The other spatiotemporal saliencydetection model proposed by Liu et al. [18] is based uponsuperpixel-level graph and temporal propagation.

ESULTS

For evaluation, we computed saliency maps for the selectedvideos from DIEM dataset using the two state-of-the-art

TABLE 3Average scores for three different techniques on DIEM datasetincluding our proposed model.

Ours Liu et al. 2014 [19] Liu et al. 2016 [18]

AUC D KL NSS CC models and our proposed model. Using the evaluationcriteria, average scores (Table 3) for the resulting saliencymaps for the ﬁrst frames per video were compared toassess eye movement predictability.We observe that the proposed model not only outper-forms both comparison methods but also results in a satis-factorily higher average score in terms of AU C . Moreover, alower D KL score indicates a better saliency model with lessdissimilarity to the ground truth. For the remaining evalu-ation metrics, CC and N SS , the proposed method resultsin slightly lower scores; however, the results still suggestthat the proposed model makes better eye movement pre-dictions, and thus supports the idea of incorporating audiofeatures when computing spatiotemporal saliency for un-constrained videos. Some of the video sequences performedbetter for instance stewart lee, news us election debateand one show, with on-screen sound source with no objectocclusion, and interaction.

Figure 2 illustrates the saliency maps obtained by dif-ferent methods. The visual comparison demonstrates thatour proposed model performs comparatively better. Forinstance, video sequence with an on-screen audio source-type in the third row, visual models failed to correspond tothe ground truth (GT) as they considered both faces salient;by contrast, the proposed audiovisual model marked thetalking face salient.

ISCUSSION

Spatiotemporal saliency detection is a challenging problem.It is worth mentioning that existing models ignore the audiosignal in the input media. However, a number of experi-mental studies [25], [29], [32] discuss the inﬂuence of auralstimuli on early attention when viewing complex scenes;that is, audio stimuli can provide useful information inguiding eye movements. This inﬂuence can be incorporatedinto existing bottom-up models by the inclusion of low-levelaudio properties like energy, frequency, amplitude, etc. Theresulting audiovisual saliency model makes more sense inapplication areas like video summarization/compression,event detection, gaze prediction, and robotic vision andinteraction. There exist some models in the literature basedupon multiple stimuli [9], [26], [28] but they lack a genericsolution by limiting the models to speciﬁc categories ofvideos.A major reason for this lack in literature is due to oneof the foremost challenges of audiovisual saliency models:localization of audio source in a given frame. Some methodseither use microphone arrays to triangulate a single sourceor only target stationary sources in a scene. The modelsfail to perform for dynamic videos, as they assume a sin-gle audio source. Furthermore, an approach overcomingthese restrictions use correlation analysis between audioand video segments, the audio source is a set of relevantpixels rather than an object. The approach has been usedin more recent works where object segmentation precedesaudiovisual correlation, making audio source separationmaintain the source object shape. Since both audio andvideo signals are from different domains, reliable correlationrequires feature transformation into a suitable space. More-over, it requires a method to relate an audio descriptor toan object descriptors in a video frame, that is, segmentationand tracking of diversiﬁed objects in a video frame is initself a challenging task. To be precise, the literature lacks intechniques for multiple objects, the case in our dataset withno a priori information about objects like shape, color, size,etc.In terms of eye movement predictability, the proposedaudiovisual saliency model performed better for two evalu-ation metrics. However, they resulted in comparable scoresfor the other two metrics. This result can be attributed to thedifﬁculty in segmenting and tracking of multiple interactingobjects in varying conditions like motion blur, crowd, etc.Moreover, multiple and/or off-screen audio sources makeit a more challenging task to locate an audio source, inconsequence, affecting the model’s performance.The proposed saliency model exhibits higher time com-plexity (Table 4) attributed to dense optical ﬂow compu-tation, inherently compute-intensive being an optimization

TABLE 4Time complexity for three models including our model for × sized video-frame. Method Steps time per frame (s)Ours Optical ﬂow 13.406Segmentation 15.752Tracking 0.718Audio-Video Corr. 2.509Video SaliencyComp. 0.218Motion Map Comp. 0.07832.681sLiu et al. 2014 [19] 13.658sLiu et al. 2016 [18] 7.797s problem. The main advantage of using the method is that itestimates both forward and backward ﬂows, and hence theoptical ﬂow of occluded regions is also computed correctly.Other alternative motion estimation approaches are block-matching and phase correlation that can be used insteadto propose a more efﬁcient solution. Likewise, segmenta-tion of multiple objects is a complex task involving mean-shift segmentation, a non-parametric clustering using kerneldensity estimation. The approach is not scalable due toits large feature space dimensions. Alternatively, a simplerhistogram or superpixel-based segmentation methods canbe used to reduce computational complexity, as well asincrease model predictability.A shortcoming of the current study is the use of a subsetof the available dataset for evaluation. It may be interestingto perform evaluation using the entire video dataset and/orother available datasets to enforce our ﬁnding that auralstimuli alongside visual stimuli can provide useful informa-tion in guiding eye movements.All in all, the proposed solution scored reasonably well,however it can be further improved. An improvement insegmentation and tracking techniques may contribute to abetter audio saliency map, and in turn towards a better ﬁnalsaliency map. Furthermore, the use of a more sophisticatedvisual saliency model, as well as the use of more suitablecombination techniques can improve the ﬁnal result.

ONCLUSION

Existing bottom-up saliency models only use visual stimuliwhile available audio stimuli in the input media remainunused. In this paper, we proposed an audiovisual saliencymodel incorporating both low-level visual and audio infor-mation to produce three different saliency maps: an audiosaliency map, a motion saliency map, and a visual saliencymap. These maps were linearly combined to get a ﬁnalsaliency map. These maps were evaluated for DIEM datasetusing four different criteria. The results show an overallimprovement against two state-of-the-art visual saliencymodels and enforce the idea that of aural stimuli can pro-vide useful information to guide eye movements. advert bbc4 bees 1024x576basketball of sorts 960x720hairy bikers cabbage 1280x712news us election debate 1080x600 Input GT Liu et al. 2014 [19] Liu et al. 2016 [18] Ours

Fig. 2. Comparison of our results with other methods against the ground-truth (GT) on DIEM dataset. R EFERENCES [1] T. Avraham and M. Lindenbaum. Esaliency (extended saliency):Meaningful attention using stochastic image modeling.

IEEETrans. Pattern Anal. Mach. Intell. , 32(4):693–708, 2010.[2] E. Birmingham, W.F. Bischof, and A. Kingstone. Saliency doesnot account for ﬁxations to eyes within social scenes.

Vision Res. ,49(24):2992–3000, 2009.[3] A. Borji, M.N. Ahmadabadi, and B.N. Araabi. Cost-sensitivelearning of top-down modulation for attentional control.

Mach.Vision Appl. , 22(1):61–76, 2011.[4] A. Borji and L. Itti. State-of-the-art in visual attention modeling.

IEEE Trans. Pattern Anal. Mach. Intell. , 35(1):185–207, 2013.[5] D. Bradley and G. Roth. Adaptive thresholding using the integralimage.

J. Graphics GPU Game Tools , 12(2):13–21, 2007.[6] N. Bruce and J. Tsotsos. Saliency based on information maximiza-tion. In

Adv. Neural Inf. Process. Syst. , pages 155–162, 2005.[7] M. Cerf, J. Harel, W. Einh¨auser, and C. Koch. Predicting humangaze using low-level saliency combined with face detection. In

Adv. Neural Inf. Process. Syst. , pages 241–248, 2008.[8] H.-S. Chang and Yu-Chiang F. Wang. Superpixel-based largedisplacement optical ﬂow. In ,pages 3835–3839. IEEE, 2013.[9] A. Coutrot and N. Guyader. An audiovisual attention model fornatural conversation scenes. In , pages 1100–1104. IEEE, 2014.[10] A. Coutrot and N. Guyader. Multimodal saliency models forvideos. In

From Hum. Attention Comput. Attention , pages 291–304.Springer, 2016.[11] C. Guo and L. Zhang. A novel multiresolution spatiotemporalsaliency detection model and its applications in image and videocompression.

IEEE Trans. Image Process. , 19(1):185–198, 2010.[12] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In

Adv. Neural Inf. Process. Syst. , pages 545–552. MIT Press, 2006.[13] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visualattention for rapid scene analysis.

IEEE Trans. Pattern Anal. Mach.Intell. , 20(11):1254–1259, 1998.[14] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning topredict where humans look. In , pages 2106–2113. IEEE, 2009. [15] O. Le Meur, P. Le Callet, and D. Barba. Predicting visual ﬁxa-tions on video based on low-level visual features.

Vision Res. ,47(19):2483–2498, 2007.[16] O. Le Meur, P. Le Callet, D. Barba, and D. Thoreau. A coherentcomputational approach to model the bottom-up visual attention.

IEEE Trans. Pattern Anal. Mach. Intell. , 28:802–817, 2006.[17] K. Li, J. Ye, and K.A. Hua. What’s making that sound? In

Proc.22nd ACM Int. Conf. Multimedia , pages 147–156. ACM, 2014.[18] Z. Liu, J. Li, L. Ye, G. Sun, and L. Shen. Saliency detectionfor unconstrained videos using superpixel-level graph and spa-tiotemporal propagation.

IEEE Trans. Circuits Syst. Video Technol. ,PP(99):1–1, 2016.[19] Z. Liu, X. Zhang, S. Luo, and O. Le Meur. Superpixel-basedspatiotemporal saliency detection.

IEEE Trans. Circuits Syst. VideoTechnol. , 24(9):1522–1540, 2014.[20] S. Marat, M. Guironnet, and D. Pellerin. Video summarizationusing a visual attention model. In

Signal Process. Conf., 2007 15thEur. , pages 1784–1788. IEEE, 2007.[21] S. Marat, T.H. Phuoc, L. Granjon, N. Guyader, D. Pellerin, andA. Gu´erin-Dugu´e. Modelling spatio-temporal saliency to predictgaze direction for short videos.

Int. J. Comput. Vision , 82(3):231–243, 2009.[22] S. Marat, A. Rahman, D. Pellerin, N. Guyader, and DominiqueHouzet. Improving visual saliency by adding ’face feature map’and ’center bias’.

Cognit. Comput. , 5(1):63–75, 2013.[23] P.K. Mital, T.J. Smith, R.L. Hill, and J.M. Henderson. Clusteringof gaze during dynamic scene viewing is predicted by motion.

Cognit. Comput. , 3(1):5–24, 2011.[24] N. Murray, M. Vanrell, X. Otazu, and C.A. Parraga. Saliencyestimation using a non-parametric low-level vision model. In

Comput. Vision Pattern Recognit. (CVPR), 2011 IEEE Conf. , pages433–440. IEEE, 2011.[25] C. Quigley, S. Onat, S. Harding, M. Cooke, and P. K¨onig. Audio-visual integration during overt visual attention.

J. Eye Mov. Res. ,1(2):1–17, 2008.[26] K. Rapantzikos, G. Evangelopoulos, P. Maragos, and Y. Avrithis.An audio-visual saliency model for movie summarization. In

Multimedia Signal Process. MMSP 2007. IEEE 9th Workshop , pages320–323. IEEE, 2007. [27] K. Rapantzikos, N. Tsapatsoulis, Y. Avrithis, and S. Kollias. Spa-tiotemporal saliency for video classiﬁcation.

Signal Process.: ImageCommun. , 24(7):557–571, 2009.[28] N.O. Sidaty, M.-C. Larabi, and A. Saadane. An audiovisualsaliency model for conferencing and conversation videos.

Electron.Imaging , 2016(13):1–6, 2016.[29] G. Song, D. Pellerin, and L. Granjon. Different types of soundsinﬂuence gaze differently in videos.

J. Eye Mov. Res. , 6(4):1–13,2013.[30] B.W. Tatler, M.M. Hayhoe, M.F. Land, and D.H. Ballard. Eyeguidance in natural vision: Reinterpreting salience.

J. Vision ,11(5):5–5, 2011.[31] H.R. Tavakoli, E. Rahtu, and J. Heikkil¨a. Fast and efﬁcient saliencydetection using sparse sampling and kernel density estimation. In

Scand. Conf. Image Anal. , pages 666–675. Springer, 2011.[32] E. Van der Burg, C.N.L. Olivers, A.W. Bronkhorst, andJ. Theeuwes. Pip and pop: nonspatial auditory signals improvespatial visual search.

J. Exp. Psychol.: Hum. Percept. Perform. ,34(5):1053, 2008.[33] J. Yagnik, D. Strelow, D.A. Ross, and R.-S. Lin. The power ofcomparative reasoning. In2011 Int. Conf. Comput. Vision