A Fractal Approach to Characterize Emotions in Audio and Visual Domain: A Study on Cross-Modal Interaction
Sayan Nag, Uddalok Sarkar, Shankha Sanyal, Archi Banerjee, Souparno Roy, Samir Karmakar, Ranjan Sengupta, Dipak Ghosh
AA FRACTAL APPROACH TO CHARACTERIZE EMOTIONS IN AUDIO AND VISUAL DOMAIN: A STUDY ON CROSS-MODAL INTERACTION
Sayan Nag a,e , Uddalok Sarkar a,e , Shankha Sanyal a,f *, Archi Banerjee a,b,c , Souparno Roy a,d , Samir Karmakar f , Ranjan Sengupta a & Dipak Ghosh a a Sir C.V. Raman Centre for Physics and Music, Jadavpur University b Department of Humanities and Social Sciences, IIT Kharagpur c Shrutinandan School of Music, Kolkata d Department of Physics, Jadavpur University e Department of Electrical Engineering, Jadavpur University f School of Languages and Linguistics, Jadavpur University * Corresponding Author email: [email protected]
Abstract:
It is already known that both auditory and visual stimulus is able to convey emotions in human mind to different extent. The strength or intensity of the emotional arousal vary depending on the type of stimulus chosen. In this study, we try to investigate the emotional arousal in a cross-modal scenario involving both auditory and visual stimulus while studying their source characteristics. A robust fractal analytic technique called Detrended Fluctuation Analysis (DFA) and its 2D analogue has been used to characterize three (3) standardized audio and video signals quantifying their scaling exponent corresponding to positive and negative valence. It was found that there is significant difference in scaling exponents corresponding to the two different modalities. Detrended Cross Correlation Analysis (DCCA) has also been applied to decipher degree of cross-correlation among the individual audio and visual stimulus. This is the first of its kind study which proposes a novel algorithm with which emotional arousal can be classified in cross-modal scenario using only the source audio and visual signals while also attempting a correlation between them.
Keywords: Cross-modal valence, Emotions, Audio/visual stimuli, 2D-DFA, Hurst Exponent
INTRODUCTION
A number of studies have been done in the psychological level as to how the emotions conveyed by two different modalities - audio and visual vary from one another. A few of them even try to look into how the perceptual strength of emotions expressed in the two modalities differs from one another. It is already known that certain emotions (i.e., brief affective states triggered by the appraisal of an event in relation to current goals [4]) such as awe and wonder [5] are frequently reported in relation to the contemplation of artworks. These emotions typically occur when an object or event is appraised as highly complex and novel and creates a sense of being in the presence of something greater than oneself [6]. However, it has also been recently emphasized that affective responses to art are more diverse, and often include emotions such as sadness [7] and nostalgia [8], which are also experienced in other everyday situations that do not involve contemplation of artworks. If we try to look into more basic features of a painting i.e. the usage of basic colors (Red, Green and Blue) along with their offshoots, earlier works suggest that warm colors – such as red, yellow and orange – can spark a variety of emotions ranging from comfort and warmth to hostility and anger. Cool colors – such as green, blue and purple – often spark feelings of calmness as well as sadness. But, all these are psychological studies based on human response data, till date there have been no study which computationally classifies the emotional appraisal corresponding to a group of paintings. In this work we have tried to evaluate long range temporal correlations corresponding to these three color components in paintings. In recent years, the use of musical stimuli as an important means of emotional appraisal is being developed with special focus on cross-modal transfer of emotions. While most of these studies look into the psychological and cognitive aspects of the musical and visual stimulus, the source characteristics of these stimuli are largely neglected mainly due to the lack of obust features to quantify them. The development of the International Affective Picture System (IAPS) has been followed by a similar collection of sounds, the International Affective Digitized Sounds (IADS) – a series of naturally occurring human, non-human, animal, and environmental sounds (e.g., bees buzzing; applause, explosions). In two experiments by Bradley and Lang (2000), it was shown that valence and arousal ratings of these sounds were comparable to affective pictures from the IAPS. On a physiological level, emotionally arousing sounds elicit large electrodermal activity, which is generally known to be sensitive to the arousal of emotional stimuli. In this paper, for the first time, we look to classify the emotional sound and visual stimuli solely from their source characteristics, i.e. the time series generated from the audio signal and the 2-dimensional matrix of pixels generated from the affective picture stimulus. The sample data consists of 6 audio signals of around 10 second each and 6 affective pictures, of which 3 each belonged to positive and negative valence respectively. The emotional ratings corresponding to the visual and audio stimulus were standardized a-priori with the help of different psychological tests and corroborated with standardized measures present in literature. The main aim of this work is to ratify the results of psychological tests in the perceptual domain with the mathematical quantitative output obtained from the source signals itself. As a powerful mathematical tool, fractal theory initiated by Mandelbrot in the 1960s has been widely applied to many areas of natural sciences. Since the simple iterative algorithm in the fractal theory can generate a variety of complex images, fractal dimension is considered as an effective measure of the complexity of the target object. We use a robust non-linear data analysis tool called Detrended Fluctuation Analysis (DFA) to calculate the long-range temporal correlations or the Hurst exponent corresponding to the auditory signals. Similar non-linear analyses have been previously used in the scientific community to comprehend the underlying complexities in these inherently convoluted audio signals [10 - 21]. On the other hand, the 2D analogue of the same DFA technique has been applied on the array of pixels corresponding to affective pictures of contrast emotions, which essentially gives the long-range spatial correlations of individual color components. We used the scaling exponent (or the Hurst exponent) obtained from the audio clips and the visual images as a robust parameter to quantify their emotional valence. Thus we have a single unique scaling exponent corresponding to the each 1D audio signal and three scaling exponents corresponding to RED/GREEN/BLUE (RGB) component in each of the visual images. In this way we have been able to provide a quantitative classification of emotional cues in auditory and visual domain using the source signals itself. Further, the correlation features among the paintings as well as the audio clips have also been computed using the 2D/1D- Detrended Cross-correlation (DCCA) technique, which essentially gives the degree of correlation between the individual domains. To conclude, for the first time we propose a novel algorithm with which emotional arousal can be classified in cross-modal scenario using only the source audio and visual signals while also attempting a correlation between them. The study is expected to go a long way in research of multimodal interaction of emotional cues in multiple domains. The results and implications have been discussed in detail.
EXPERIMENTAL DETAILS: Choice of three pairs of audio and visual stimuli
Table 1: Psychological ratings of the audio clips chosen for analysis
Clip No. Anger Fear Happy Sad
Target
1 1.00 1.00 7.33 1.00 HAPPY 2 1.00 1.00 7.17 1.17 HAPPY 3 1.00 1.00 7.17 1.00 HAPPY 4 1.17 1.00 1.00 7.67 SAD 1.00 1.33 1.17 7.50 SAD 6 1.00 1.67 1.00 7.50 SAD
Table 2: Psychological ratings of the paintings chosen for analysis
Image No. (Painting Name) Anger Fear Happy Sad
Target Painter
In this way, we have a standardized measure of each of the stimulus corresponding to both auditory and visual domain. A correlation study is performed across the cross-modal domain to establish the degree of emotional appraisal corresponding to the stimulus used.
METHODOLOGY:
This section describes the steps for computing Hurst Exponent using the two-dimensional DFA algorithm for a grayscale image I . The steps are as follows: 1) The profile x i,j is computed using: 𝑥 𝑖,𝑗 = ∑ ∑ (𝐼 𝑖,𝑗 − 𝐼̅) 𝑗𝑚=1𝑖𝑛=1 where m = 1, 2, ··· , M , n = 1, 2, ··· , N , I n , m = 0, 1, ··· , 255 is the brightness of the pixel at the coordinates ( m , n ) of the gray scale image and 𝐼̅ represents the mean value of I n , m . 2) x i,j is divided into small regions of size s × s , where s is set as: s min ≈ 5 ≤ s ≤ s max ≈ min{ M , N }/4. 3) An interpolating curve is computed of x i,j using: 𝐺 𝑖,𝑗 (𝑙, 𝑠) = 𝑎 𝑙 𝑖 + 𝑏 𝑙 𝑗 + 𝑐 𝑙 in the l th small square region of size s × s , which can be given by using a multiple regression procedure. 4) The variance in the l th small square region is computed for s = s min , s min + 1, ··· , s max , which is given by: 𝐹 𝑖,𝑗2 (𝑙, 𝑠) = 1𝑠 ∑ ∑ (𝑥 𝑖,𝑗 − 𝐺 𝑖,𝑗 (𝑙, 𝑠)) The root mean square F ( s ) is computed as: 𝐹(𝑠) = [ 1𝐿 𝑠 ∑ 𝐹 𝑖,𝑗2 (𝑙, 𝑠) 𝐿 𝑠 𝑙=1 ] where L s denotes the number of the small square regions of size s × s . 6) If x i,j has a long-range power-law correlation characteristic, then the fluctuation function F ( s ) is observed as follows: 𝐹(𝑠) ∝ 𝑠 𝛼 where α is the two-dimensional scaling exponent, a self-affinity parameter representing the long-range power-low correlation characteristics of the surface. For investigating power law cross-correlations between different simultaneously recorded time series in the presence of nonstationarity, 1D-Detrended Cross correlation Analysis (DCCA) [3] has been used in many cases. Here we generalize it in 2-Dimensional analogue to extract the degree of correlation present between different paintings. This section describes the steps for computing Cross-Correlation Coefficient using the two-dimensional DCCA algorithm for two grayscale images A and B . The steps are as follows: 1) The profiles x i,j and y i,j are computed using: 𝑥 𝑖,𝑗 = ∑ ∑ (𝐴 𝑖,𝑗 − 𝐴̅) 𝑗𝑚=1𝑖𝑛=1 𝑦 𝑖,𝑗 = ∑ ∑ (𝐵 𝑖,𝑗 − 𝐵̅ ) 𝑗𝑚=1𝑖𝑛=1 where m = 1, 2, ··· , M , n = 1, 2, ··· , N , A n , m = 0, 1, ··· , 255, B n , m = 0, 1, ··· , 255 are the brightness of the pixel at the coordinates ( m , n ) of the gray scale images and 𝐴̅ and 𝐵̅ represents the mean value of A n , m and B n , m respectively. 2) Both x i,j and y i,j are individually divided into small regions of size s × s , where s is set as: s min ≈ 5 ≤ s ≤ s max ≈ min{ M , N }/4. 3) Interpolating curves are computed of x i,j and y i,j using: 𝐺𝑥 𝑖,𝑗 (𝑙, 𝑠) = 𝑎𝑥 𝑙 𝑖 + 𝑏𝑥 𝑙 𝑗 + 𝑐𝑥 𝑙 𝐺𝑦 𝑖,𝑗 (𝑙, 𝑠) = 𝑎𝑦 𝑙 𝑖 + 𝑏𝑦 𝑙 𝑗 + 𝑐𝑦 𝑙 in the l th small square region of size s × s , which can be given by using a multiple regression procedure. 4) The variance in the l th small square region is computed for s = s min , s min + 1, ··· , s max , which is given by: 𝐹 𝑖,𝑗2 (𝑙, 𝑠) = 1𝑠 ∑ ∑ (𝑥 𝑖,𝑗 − 𝐺𝑥 𝑖,𝑗 (𝑙, 𝑠)) ∗ (𝑦 𝑖,𝑗 − 𝐺𝑦 𝑖,𝑗 (𝑙, 𝑠)) 𝑗+𝑠𝑚=1𝑖+𝑠𝑛=1 The root mean square F ( s ) is computed as: 𝐹(𝑠) = [ 1𝐿 𝑠 ∑ 𝐹 𝑖,𝑗2 (𝑙, 𝑠) 𝐿 𝑠 𝑙=1 ] where L s denotes the number of the small square regions of size s × s . 6) If the profiles are long-range power-law correlated, then the fluctuation function F ( s ) is observed as follows: 𝐹(𝑠) ∝ 𝑠 𝜆 where λ is the two-dimensional scaling exponent. The relation between cross-correlation exponent, ɣ x and scaling exponent λ can be shown as: ɣ 𝑥 = 2 − 2 ∗ 𝜆 For uncorrelated data, cross-correlation exponent has a value 1 and the lower the value of cross-correlation exponent more correlated are the data.
RESULTS AND DISCUSSION
In the first part of our work, DFA exponent was computed for the 6 audio clips and the 6 paintings that were put to analysis. In case of the paintings, α red , α green and α blue were computed corresponding to the Red, green and blue color component of the painting analyzed. In the following figures, the DFA exponent corresponding to each clip and visual stimuli have been plotted. Fig. 1 shows the scaling exponents for the audio clips which have been classified apriori as happy and sad; while
Fig. 2 denotes the scaling exponent for the paintings.
Red Green Blue D F A S c a li n g e x p o n e n t Sunflower (Image 1)
Japanese Vase(Image 2)Almond Trees(Image 3)Tragedy (Image4)Starry Night(Image 5)Sailboat Sunset(Image 6) -2-1.8-1.6-1.4-1.2-1-0.8-0.6-0.4-0.20 C r o ss C o rre l a t i o n C o e ff i c i e n t Img 1 vs Img 2Img 1 vs Img 3
Img 2 vs Img 3
Img 4 vs Img 5Img 4 vs Img 6Img 5 vs Img 6Img 1 vs Img 4Img 1 vs Img 5Img 1 vs Img 6Img 2 vs Img 4Img 2 vs Img 5Img 2 vs Img 6Img 3 vs Img 4
Img 3 vs Img 5
Img 3 vs Img 6-2.5-2-1.5-1-0.500.5 Clip 1 vs Clip 2Clip 1 vs Clip 3Clip 2 vs Clip 3Clip 4 vs Clip 5Clip 4 vs Clip 6Clip 5 vs Clip 6Clip 1 vs Clip 4Clip 1 vs Clip 5Clip 1 vs Clip 6Clip 2 vs Clip 4Clip 2 vs Clip 5Clip 2 vs Clip 6Clip 3 vs Clip 4Clip 3 vs Clip 5Clip 3 vs Clip 6
Fig. 1: DFA exponent of audio clips Fig. 2: DFA exponents (color-wise) of visual stimuli
From
Fig. 1, it is evident that the scaling exponents of Clips 1 to 3 are lower as compared to the scaling exponents of Clips 4 to 6 i.e. the LRTC present in Clips 1 to 3 are lower than the temporal correlations present in Clips 1 to 6. This can be attributed to various acoustic features of these clips like tempo, rhythm etc, but the mathematical manifestation is the decrease/increase in long range temporal correlations. In
Fig. 2 , again it is evident that the scaling exponents corresponding to Images 4 to 6 are in general higher than Images 1 to 3. Also, an interesting observation is that α green and α blue , i.e. the scaling exponents corresponding to green and blue color show the maximum increase for Images 4 to 6 (which were classified as evoking sad emotions). Thus, the manifestation of sad emotion can be attributed to higher order of correlations present in the blue and green color of a painting. In the next part of our work, degree of correlation between the auditory and visual stimuli is evaluated individually using the DCCA (1D and 2D) technique. A lower value of cross correlation exponent (γ x ) denotes higher level of power law correlation between the two signals involved, and vice versa. Fig. 3 and 4 represent the values of γ x for different combinations of auditory and visual stimuli respectively. It is to be noted that in Fig.4 , before calculating the cross-correlation coefficient for the visual stimulus,we took an average of the three cross-correlation coefficients obtained from the previous analysis for the simplification of the obtained results.
Fig. 3: Cross-correlation coefficient for different clips
Fig. 4: Cross-correlation coefficient for different images In Fig. 3, it is seen that the degree of correlation for audio clips belonging to the same valence is on the higher side as compared to the clips belonging to opposite valence. The clips which have been rated as “sad” are the ones which show highest degree of correlation, while the clips rated as “happy” also show strong correlation, but lower than the “sad” ones. The inter-valence correlations are however much lower than these, while some even so “no correlation” also.
From
Fig. 4, it is seen that
Clip 1Clip 2Clip 3Clip 4Clip 5Clip 6 he degree of correlation among Images 4to 6 are the highest of all the combinations present here, while the correlation among the Clips 1 to 3 are the lowest. Thus, we have an indirect classification of emotional appraisal even while performing DCCA also. While the images which have been rated as "sad" provide higher degree of correlation amongst them, the "happy" rated images provides lower degree of correlation. The inter-valence correlation coefficient (i.e. the degree of correlation among the happy and sad images) lie somewhere in between the two.
CONCLUSION
In this work, we have presented a novel algorithm to automatically classify and compare emotional appraisal from cross-modal stimuli based on the amount of long range temporal correlations present in the auditory and visual stimulus put to use. The important findings of the study can be listed as under: 1. For both the auditory and visual stimulus, an averaged DFA scaling exponent of anything greater than 1.5 denotes stimulus belonging to "sad" category. 2. The DFA scaling exponent corresponding to blue and green color is the highest in case of "sad" images while the DFA exponent for "happy" images is high for red color. 3. The DCCA exponent shows that the degree of correlation is strongest among the sad clips, while the amount of correlation is lowest for the inter-valence clips. 4. The averaged degree of correlation for happy images is very low (i.e. below -0.7) while that for sad images is considerably high (i.e. greater than -1.4). The correlation between happy and sad images interestingly lies between the two (ranging between -0.7 and -1.4). 5. Pearson correlation coefficient is computed from the variation of DFA values belonging to the stimulus from two modalities. the values of which are found to be as follows:
Happy (audio) vs. Happy (Image) Happy (audio) vs. Sad (Image) Sad (audio) vs. Happy (image) Sad (audio) vs. sad (image)
ACKNOWLEDGEMENT:
SS acknowledges the JU RUSA 2.0 Post Doctoral Fellowship (
R-11/557/19 ) and Acoustical Society of America (ASA) to pursue this research AB acknowledges the Department of Science and Technology (DST), Govt. of India for providing (
SR/CSRI/PDF-34/2018 ) the DST CSRI Post Doctoral Fellowship to pursue this research work.
REFERENCES: Bradley, M. M. & Lang, P. J. (2007). The International Affective Digitized Sounds (2nd Edition; IADS-2): Affective ratings of sounds and instruction manual. Technical report B-3. University of Florida, Gainesville, Fl. 2.
Peng, C. K., Buldyrev, S. V., Havlin, S., Simons, M., Stanley, H. E., & Goldberger, A. L. (1994). Mosaic organization of DNA nucleotides.
Physical review e , (2), 1685. 3. Podobnik, B., & Stanley, H. E. (2008). Detrended cross-correlation analysis: a new method for analyzing two nonstationary time series.
Physical review letters , (8), 084102. 4. Scherer, K. R., & Zentner, M. R. (2001). Emotional effects of music: Production rules.
Music and emotion: Theory and research , , 361-392. 5. Zentner, M., Grandjean, D., & Scherer, K. R. (2008). Emotions evoked by the sound of music: characterization, classification, and measurement.
Emotion , (4), 494. 6. Keltner, D., & Haidt, J. (2003). Approaching awe, a moral, spiritual, and aesthetic emotion.
Cognition and emotion , (2), 297-314. 7. Vuoskoski, J. K., & Eerola, T. (2012). Can sad music really make you sad? Indirect measures of affective states induced by music and autobiographical memories.
Psychology of Aesthetics, Creativity, and the Arts , (3), 204. 8. Barrett, F. S., Grimm, K. J., Robins, R. W., Wildschut, T., Sedikides, C., & Janata, P. (2010). Music-evoked nostalgia: Affect, memory, and personality.
Emotion , (3), 390. 9. Eerola, T., Lartillot, O., & Toiviainen, P. (2009, October). Prediction of Multidimensional Emotional Ratings in Music from Audio Using Multivariate Regression Models. In
Ismir (pp. 621-626). 0.
Sengupta, Sourya, et al. "Emotion specification from musical stimuli: An EEG study with AFA and DFA." 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN). IEEE, 2017. 11.
Nag, Sayan, et al. "Can musical emotion be quantified with neural jitter or shimmer? A novel EEG based study with Hindustani classical music." 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN). IEEE, 2017. 12.
Sanyal, Shankha, et al. "Music of brain and music on brain: a novel EEG sonification approach." Cognitive neurodynamics 13.1 (2019): 13-31. 13.
Sarkar, Uddalok, et al. "Speaker Recognition in Bengali Language from Nonlinear Features." arXiv preprint arXiv:2004.07820 (2020). 14.
Bhattacharyya, Chirayata, et al. "From Speech to Recital--A case of phase transition? A non-linear study." arXiv preprint arXiv:2004.08248 (2020). 15.
Sarkar, Uddalok, et al. "A Simultaneous EEG and EMG Study to Quantify Emotions from Hindustani Classical Music." Recent Developments in Acoustics. Springer, Singapore, 2020. 285-299. 16.
Sanyal, Shankha, et al. "Tagore and neuroscience: A non-linear multifractal study to encapsulate the evolution of Tagore songs over a century." Entertainment Computing (2020): 100367. 17.
Banerjee, Archi, et al. "A novel study on perception–cognition scenario in music using deterministic and non-deterministic approach." Physica A: Statistical Mechanics and its Applications 567 (2021): 125682. 18.
Banerjee, Archi, et al. "Neural (EEG) Response during Creation and Appreciation: A Novel Study with Hindustani Raga Music." arXiv preprint arXiv:1704.05687 (2017). 19.
He, Juan, et al. "Non-Linear Analysis: Music and Human Emotions." 2015 3rd International Conference on Education, Management, Arts, Economics and Social Science. Atlantis Press, 2015. 20.