Multi-scale Visual Attention & Saliency Modelling with Decision Theory
MMULTI-SCALE VISUAL ATTENTION & SALIENCY MODELLINGWITH DECISION THEORY
Anh Cat Le Ngo , Li-Minn Ang Guoping Qiu Kah Phooi Seng Faculty of Engineering, University of Nottingham, Malaysia Campus, Malaysia School of Computer Science, University of Nottingham, Jubilee Campus, UK Centre for Communications Engineering Research, Edith Cowan University, Australia Department of Computer Science & Networked System, Sunway University, Malaysia
ABSTRACT Bottom-up saliency, an early human visual processing, behaves like binary classification of interest and null hypothesis. Its discriminant power, mutual information of image features and class distribution, is closely related to saliency value by the well-known centre-surround theory. As classification accuracy very much depends on window sizes, the discriminant saliency (power) varies according to sampling scales. Discrimi- nating power estimation in multi-scales frame- work needs integrating with wavelet transforma- tion and then estimating statistical discrepancy of two consecutive scales (centre-surround win- dows) by Hidden Markov Tree (HMT) model. Finally, multi-scale discriminant saliency (MDIS) maps are combined by the maximum informa- tion rule to synthesize a final saliency map. All MDIS maps are evaluated with standard quantita- tive tools (NSS,LCC,AUC) on N.Bruce’s database with ground truth data as eye-tracking locations ; as well assessed qualitatively by visual exami- nation of individual cases. For evaluating MDIS against well-known AIM saliency method, sim- ulations are needed and described in details with several interesting conclusions, drawn for further research directions.
1. DISCRIMINANT VISUAL SALIENCY Saliency mechanism plays a key role in percep- tual organization [1]; therefore, recently several researchers attempt to generalize principles for visual saliency [2][3][4][5][6],[7]. In the decision theoretic point of view, saliency is regarded as power for distinguishing salient and non-salient classes; moreover, discriminant saliency, (DIS), combines classical centre-surround hypothesis with derived optimal saliency architecture. Saliency value at a spatial location is identified as the dis- criminant power of a feature set with respect to the binary classification problem between centre and surround classes. Based on the decision the- ory, this approach can be generalized for variety of stimulus modalities, including intensity, color, orientation and motion [2]. Moreover, various psychophysical properties for both static and mo- tion stimuli are shown to be accurately satisfied quantitatively by DIS saliency maps [8]. Due to ubiquity of centre-surround operator in the early stages of biological vision, bottom-up saliency is commonly defined as how certain the stimuli at each location of central visual field can be deter- mined against other stimuli in its surround. In other words, “centre-surround” hypothesis is also a natural binary classification problem which can be solved by the well-established decision theory. In this problem, classes can be defined as follows. • Centre class: observations within a central neighborhood W l of visual fields location l . • Surround class: observations within a sur- rounding window W l of the above central region. Feature responses are drawn from the predefined feature sets X by a random process. As there are many possible combinations and orders of how such responses are assembled, feature ob- servations can be considered as a random pro- cess, X ( l ) = ( X ( l ) , . . . , X d ( l )) of dimension d . This random process is drawn conditionally on hidden variable Y ( l ) of class states or labels (center / surround). Feature vector x ( j ) , given j ∈ W cl , c ∈ { , } , is drawn from classes c according to the conditional probability density P X ( l ) | Y ( l ) ( x | c ) where Y ( l ) = 0 , are surround and centre labels. The saliency of location l, S ( l ) is equal to the discriminant power of X for the classification of the observed feature vectors. That a r X i v : . [ c s . C V ] F e b iscriminant concept is quantified by mutual in- formation between feature, X and class label, Y. S ( l ) = I l ( X ; Y )= (cid:88) c (cid:90) p X,Y ( x, c ) log p X,Y ( x, c ) p X ( x ) p y ( c ) dx However, mutual information estimation of d - dimensional space suffers from the curse of di- mensionality. Successfully tackling the problem would make information-based saliency algo- rithms more biologically plausible and compu- tationally feasible. Dashan Gao and Nuno Vas- concelos have proposed a possible solution called DIS [9], which is formulated as follows. I l ( X ; Y ) = H ( Y ) − H ( Y | X ) =1 | W l | (cid:88) j ∈ W l (cid:34) H ( Y ) + (cid:88) c =0 P Y | X ( c | x j ) log P Y | X ( c | x j ) (cid:35) (1) where H ( Y ) = − (cid:80) c =0 P Y ( c ) logP Y ( c ) is en- tropy of classes Y and − E Y | X (cid:2) logP Y | X ( c | x ) (cid:3) is conditional entropy of Y given X . Given a loca- tion l, there are corresponding center W l and sur- round W l windows along with a set of associated feature responses x ( j ) , j ∈ W l = W l ∪ W l . While DIS successfully defines discriminant saliency in information-theoretic senses, its imple- mentation, equation 1, restrains sampled features in a single fixed-size window. Consequently, it creates a bias toward objects with distinctive fea- tures fitted in that window size. As multi-scale processing is an implicit factor of visual atten- tion, DIS needs adapting in wavelet transform, a popular multi-resolution framework.
2. MULTISCALE FRAMEWORK
A multi-scale image binary segmentation is a great starting point for multi-scale DIS (MDIS) as it also needs to classify a data point into two classes cen- tre, surround classes. Noted that DIS only uses the binary classification as an intermediate step to measure discriminant value. As segmentation ac- curacy depends on sizes of classifying windows, an appropriate choice optimizes positive classifi- cation ratio; otherwise, it leads to sub-optimal sys- tems. For example, a large window usually pro- vides rich statistical information and enhance re- liability of the algorithm; however, it simultane- ously risks including heterogeneous elements in the window, which in turn reduces segmentation accuracy. If processing with too small windows, we probably run into local maxima points while missing global meaningful points. In brief, choos- ing appropriate window size has vital influence on performances of binary segmentation and conse- quently of DIS or MDIS.
Dynamic windows with varying sizes can be em- ployed to obtain coarse-to-fine segmented regions [10]. Adapting this approach, MDIS can produce saliency maps with varying resolutions. In MDIS, multiscale dyadic windows are implemented due to its compact arrangement [11]; for example, an initial square image s with J x J of n := 2 J pixels, the dyadic square structures can be gen- erated by recursively dividing x into four square sub-images equally, the left-hand side of figure 1. Moreover, it is similar to the popular quad-tree structure, commonly employed in wavelet trans- forms, the right-hand side of figure 1. Each node of a quad-tree is a child of a node at the directly above level; meanwhile it is a parent of other nodes at the directly below level. Each node corresponds to a dyadic block, combining wavelet coefficients across different sub-bands, nodes τ in the figure
1. Let’s denote each block by d ji given i, j are in- dexes of locations, levels. τ i ττ ρ ( i ) S = 3 S = 0 S = 1 S = 2 LL S = J J x J J − x J − J − x J − J − x J − j = 0 j = 1 j = 2 j = 3 HL LL LH LL HH HL LH HH LL LH HL HH Fig. 1:
Quad-tree structure
Assumed image contents are generated by ran- dom variable X , each node of the quad-tree also relates to a randomly generated block. Classifica- tion of a node into either centre or surround class requires studying its statistical property. As a node can be represented by wavelet coefficients, Gaus- sian Mixture Model (GMM) is utilised for esti- mating their likelihood from mixtures of large and small variance Gaussian distributions. Moreover, inter-scale correlation is usually found between wavelet-coefficients of different levels; hence, this statistical dependence is modelled by Hidden Markov Tree (HMT). Basically, HMT estimates likelihood of each wavelet coefficient give a hid- den state, considering feature probability by GSM includes novelty and persistence elements, for which hidden states are probably changed or per- sisted from open scale to another. Utilization of the up-down algorithm [12] estimates likeli- hood , p ( d ji | c m ) , of all nodes given their hidden states c m = 0 , . Though binary segmentation / classification can be achieved with the maximum likelihood principle, however the results are not consistent across scales due to lack of prior infor- mation integration. Choi .et .al [13] proposes a Bayesian Maximum a Posterior (MAP) approach for p ( c m | d ji , v j − i ) , the equation 2, whereof both parents’ classes and children’s features are in- volved in class decisions. To optimize MAP and enhance across-scale coherency, sweeping opera- tions fuse likelihoods f ( d i | c i ) along the quad-tree given the label tree prior p ( c ji | v i ) . ˆ c iMAP = argmax c ji ∈ , f ( c ji | d j , v j ) (2) The DIS method also uses MAP to estimate the scale parameter or variance of GGD (see section ˆ α MAP = K n (cid:88) j =1 | x ( j ) | β + ν β (3) The estimation is later included in centre / sur- round class decision, the equation 1. Therefore, discriminant power is strictly proportional to how difference there are between MAP values of dis- tributions with variances α , α from both classes. In MDIS, posterior can be computed directly by the equation 2, and its combination with mutual in- formation principle of DIS, the equation 1 yields a multiscale estimation for discriminant power, I ji ( C j ; D j ) . H ( C j )+ (cid:88) c =0 P C j | D j ( c ji | d j ) logP C j | D j ( c ji | d j ) (4) Since the equation 4 yields discriminant power across scales, we can choose the maximum MAP values, argmax j (cid:16) I ji (cid:17) , for each location.
3. EXPERIMENTS & DISCUSSION
In our paper, we try MAP estimations with several
HMT derivatives such as Universal HMT [14],
Trained HMT [12], or Vector HMT [15]. Normal
HMT (THMT) requires an on-line training stage for estimating model parameters. THMT pro- cesses three wavelet orientations independently by single-variate operations; meanwhile, a vec- tor of coefficients can be treated as multi-variate variables in similar operations by VHMT. Multi- variate nature of VHMT prefers modelling textu- ral, especially rotation-invariant features. Though
THMT or VHMT needs training stages for pa- rameter, they could be fixed by off-line training in UHMT if general image contents are known in advance. Romberg et. al. [14] have proposed a set of UHMT parameters for natural images, such approach needs evaluating against an estab- lished saliency method AIM (An Inforax Method [16]) in both quantitative (LCC,NSS,AUC,TIME [17]) or qualitative measures, visual inspection of generated saliency maps on the well-known Neil
Bruce’s database [18] with eye-tracking locations.
In the simulation, we deploy five dyadic scales corresponding to (U/T/V)HMT(1-5) of MDIS and integrated saliency maps are denoted by (U/T/V)HMT0. Three numerical measures lin- ear cross correlation (LCC), normalized scan- path saliency (NSS), area under curve AUC and
TIME are represented in tables 2k, 2m, 2o for (U,T,V)HMT consequently. In these tables, TIME represents computational requirement of saliency methods of (U,T,V) HMT which are listed in predictable incrementing orders. While UHMT requires the least TIME due to no requirement for training, THMT and VHMT need more compu- tational effort for learning model parameters in single and multiple variate manners . (T,V)HMT surpass UHMT in evaluated LCC, NSS, and AUC scores, shown in the tables 2k,2m,2o and figures passes AIM in all quantitative measures, clearly shown by each column of these tables with max- imum and minimum values. In figures 2c,2d,2e are shown the comparisons between different modes of MDIS and AIM with Receiver Oper- ating Curve (ROC). Generally, HMT-based MDIS modes perform better than AIM in smaller scales (U,T,V)HMT(0,4,5) but MDIS in larger scales
HMT(1,2,3) are equivalent or slight worse than
AIM. AUC measures are increased with shrink- ing sizes of processing windows HMT(1-5) re- gardless of U/T/V modes. Meanwhile, LCC and NSS are varied more wildly, for instance,
UHMT has the best LCC, NSS at the HMT4 mode; while, (T,V)HMT almost has the best eval- uation at HMT0, the integrated mode. Overall, trained HMT, especially VHMT in the table 2o and figure 2j, provides more consistent numerical results through different scales. Figures 2l,2n,2p a) NSS (b)
LCC (c)
UHMT - ROC (d)
THMT - ROC (e)
VHMT - ROC (f)
AUC (g)
TIME (h)
UHMT - MDIS (i)
THMT - MDIS (j)
VHMT - MDIS
Observations LCC NSS AUC TIMUHMT0 0.01434 0.21811 -0.00269 (k)
UHMT - MDIS - DATA (l)
UHMT - MDIS - MAP
Observations LCC NSS AUC TIMTHMT0 (m)
THMT - MDIS - DATA (n)
THMT - MDIS - MAP
Observations LCC NSS AUC TIMVHMT0 0.01697 -0.00125 (o)
VHMT - MDIS - DATA (p)
VHMT - MDIS - MAP
Figure 2 & Table 1:
Quantitative and Qualitative evaluation of MDIS and AIM show sample saliency maps of (U,T,V)HMT(0-
5) MDISs and AIM for qualitative evaluation. (T,V)HMT have similar saliency maps while the
UHMT map highlights unlikely attentive regions.
Its poor performance might be due to lack of train- ing steps.
4. CONCLUSION
In conclusion, Multiscale Discriminant Saliency (MDIS) is developed as an extension of DIS [19] under the dyadic scale framework of wavelet trans- form. MDIS utilizes mutual information between classes and feature distribution to quantify clas- sifying discriminant power as saliency value in multiple dyadic-scale structures. Moreover, it fuses prior information, class decisions from pre- vious scales, in Bayesian MAP along quad-tree in coarse-to-fine manner to create consistent saliency maps for multiple scales and final integrated map with maximum information rule. MDISs are eval- uated against AIM to prove MDIS’s competitive- ness. For further research direction is implementa- tion of MDIS algorithms on embedded and mobile platforms. . REFERENCES [1] A. M. Treisman and G. Gelade, “A feature- integration theory of attention,” Cognitive psychology , vol. 12, no. 1, pp. 97–136, 1980. [2] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,”
Pattern Analysis and Ma- chine Intelligence, IEEE Transactions on , vol. 20, no. 11, pp. 1254–1259, 1998. [3] N. D. B. Bruce and J. K. Tsotsos, “Saliency, attention, and visual search: An information theoretic approach,” Journal of Vision , vol.
9, no. 3, 2009. [4] J. Harel, C. Koch, and P. Perona, “Graph- based visual saliency,”
NIPS , 2007. [5] X. Hou and L. Zhang, “Saliency detec- tion: A spectral residual approach,” in
Com- puter Vision and Pattern Recognition, 2007.
CVPR’07. IEEE Conference on , 2007, p. [6] Guoping Qiu, Xiaodong Gu, Zhibo Chen,
Quqing Chen, and C. Wang, “An informa- tion theoretic model of spatiotemporal visual saliency,” in ference on Multimedia and Expo , July 2007, pp. 1806 –1809. [7] A. C. Le Ngo, G. Qiu, G. Underwood, L. M.
Ang, and K. P. Seng, “Visual saliency based on fast nonparametric multidimen- sional entropy estimation,” in
Acoustics,
Speech and Signal Processing (ICASSP), , [8] D. Gao and N. Vasconcelos, “Decision- theoretic saliency: computational principles, biological plausibility, and implications for neurophysiology and psychophysics,” Neu- ral Computation , vol. 21, no. 1, pp. 239–271, [9] D. Gao and N. Vasconcelos, “Discrimi- nant interest points are stable,” in
Com- puter Vision and Pattern Recognition, 2007.
CVPR’07. IEEE Conference on , 2007, p. [10] Hyeokho Choi and R. Baraniuk, “Mul- tiscale texture segmentation using wavelet- domain hidden markov models,” in
Confer- ence Record of the Thirty-Second Asilomar
Conference on Signals, Systems amp; Com- puters, 1998 , Nov. 1998, vol. 2, pp. 1692 – [11] P. Burt and E. Adelson, “The laplacian pyra- mid as a compact image code,”
IEEE Trans- actions on Communications , vol. 31, no. 4, pp. 532 – 540, Apr. 1983. [12] M.S. Crouse and R.G. Baraniuk, “Simpli- fied wavelet-domain hidden markov models using contexts,” in
Proceedings of the 1998
IEEE International Conference on Acoustics,
Speech and Signal Processing, 1998 , May [13] H. Choi and R.G. Baraniuk, “Multiscale im- age segmentation using wavelet-domain hid- den markov models,”
IEEE Transactions on
Image Processing , vol. 10, no. 9, pp. 1309 –1321, Sept. 2001. [14] J.K. Romberg, Hyeokho Choi, and R.G.
Baraniuk, “Bayesian tree-structured im- age modeling using wavelet-domain hidden markov models,”
IEEE Transactions on Im- age Processing , vol. 10, no. 7, pp. 1056 – [15] M. N. Do and M. Vetterli, “Rotation invari- ant texture characterization and retrieval us- ing steerable wavelet-domain hidden markov models,”
Multimedia, IEEE Transactions on , vol. 4, no. 4, pp. 517–527, 2002. [16] N. Bruce and J. Tsotsos, “Saliency based on information maximization,” Advances in neural information processing systems , vol.
18, pp. 155, 2006. [17] A. Borji, D. N. Sihite, and L. Itti, “Quan- titative analysis of Human-Model agreement in visual saliency modeling: A comparative study,”
Image Processing, IEEE Transac- tions on , 2012. [18] N. D. B. Bruce, D. P. Loach, and J. K. Tsot- sos, “Visual correlates of fixation selection: a look at the spatial frequency domain,” in
Im- age Processing, 2007. ICIP 2007. IEEE In- ternational Conference on , 2007, vol. 3, p.
III–289. [19] D. Gao, V. Mahadevan, and N. Vasconcelos, “The discriminant center-surround hypothe- sis for bottom-up saliency,”
Advances in neu- ral information processing systems , vol. 20, pp. 1–8, 2007.379