A New Multilabel System for Automatic Music Emotion Recognition
Fabio Paolizzo, Natalia Pichierri, Daniele Casali, Daniele Giardino, Marco Matta, Giovanni Costantini
MMultilabel Automated Recognition of Emotions Induced Through Music
Fabio Paolizzo , Natalia Pichierri , Daniele Casali , Daniele Giardino , Marco Matta , Giovanni Costantini
1 1
Department of Electronic Engineering, University of Rome Tor Vergata. Italy [email protected]
Abstract : Achieving advancements in a utomatic recognition of emotions that music can induce require considering multiplicity and simultaneity of emotions. Comparison of different machine learning algorithms performing multilabel and multiclass classification is the core of our work. The study analyzes the implementation of the Geneva Emotional Music Scale 9 in the Emotify music dataset and the data distribution. The research goal is the identification of best methods towards the definition of the audio component of a new a new multimodal dataset for music emotion recognition. Keywords: multilabel, feature extraction, classification, SVM, optimization, data analysis, music emotion recognition
I. INTRODUCTION
Music has the power of inducing emotions, and human beings exploit such a phenomenon in order to empower a variety of mental states and activities, both positively and negatively. The study of emotions and music has a long and still vibrant tradition. New findings and changes of perspective in the field are not uncommon. More recent is the field investigating music emotion recognition through computational means. Music emotion recognition (MER) is an emerging and cross-disciplinary field spanning information retrieval (audio, symbolic and metadata) and machine learning, on a strong backing of music cognition (semiology of music and psychology) and music theory. Musical stimuli can be categorized according to the emotions that they can induce. As computational means have progressively increased fficiency and provided more accurate results, the contribution of MER to the general field of emotion research has become central in the study of the emotional expressiveness of music. The automatic recognition of the emotions that music can induce through listening is an important part of automated music information retrieval (MIR). Computational approaches for MER through automated retrieval of musical information have shown accuracy that is comparable or surpasses human performance [1]. Emotion induction through music is the process of emotionally affecting a subject through musical stimuli. However, the complexity of human emotions is still neither fully understood from a theoretical perspective, nor fully captured and represented using computational means. Approaches narrowing down the complexity of the phenomenon from specific perspectives are known for the biological [2], the cognitive [3] and the cultural [4]. However, such implications also suggest that perception is multimodal and music stimuli can channel one single emotion or multiple emotions simultaneously. Studies approaching MER and MIR multimodally as well as emotions in their multiplicity are rare. Research in the Musical-Moods project encompasses a cross-modal approach to machine learning for validating a MER computational model. This project investigates music from a multimodal perspective that involve motor, kinesthetic, visual and language besides auditory components, and evaluating results through creative practice with the aim to understand how we feel and attribute meaning when interacting with music technology. The Musical-Moods [5] dataset targets emotions and mental states indexed by language modelling of the participants and comprises audio excerpts, vector-based 3D animations and dance video recordings from automatic music generation through an interactive and music system [6] and professional dancers. In the present paper, we want to identify the best methods to categorize and recognize simultaneous music-induced emotions, in order to know what to expect in the Musical-Moods project before the dataset is produced and use these best methods in the definition of the MIR component of its computational model. here are multiple ways to categorize emotions and music in terms of emotions. A first and simple approach is to have a listener indicate what is the emotion expressed by a set of music stimuli. Difficulties in emotion representation can be encountered in terms of annotator’s subjectivity and language constraints. Moreover, different subjects can have a different opinion about the same emotional content. The need for a universal model able to represents real-world scenarios is paramount. An important first approach is to consider multiple annotation to describe the same music for each music piece of a dataset as different annotators can annotate a subset of the dataset through predefined tags. This known as multiclass annotation [7] [8]. A more nuanced approach capable of better modelling emotion induction is to use continuous values representing different dimensions of the perceived emotions which are attributed to a specific piece [9] [10], typically adopting the valence-arousal
Cartesian expressive space [1]. These dimensions are independent, with valence indicating how a person feels depending on positive or negative evaluations of people, things or events, and arousal indicating the degree of a person ’s activation and his/her inclination to perform actions. It is possible to extend the valence-arousal space by a third dimension as discussed in MIR-based psychology [4]. A more sophisticated approach is to build a dictionary of discernible patterns appearing in sample data on a short timescale and then expressing the contents of a music file using contents of the dictionary. We will focus our study on the latest implementation of this latter approach. II. GEMS EMOTION CATEGORIZATION AND THE EMOTIFY DATASET
The Genevan Emotional Music Scales (GEMS) consists of 45 terms for annotation of musically induced/expressed emotions. Shorter versions of 25 and 9 terms exist [11]. This can be considered as an emotional cluster model [12], investigating array of terms for music related emotions that describes the emotional and psychological response of music annotators. Each annotator could select maximally three tems from the scale in order to describe the emotions, which they felt most prominent when listening to a music track from the dataset. In the dataset, 515 terms are noted for the verbal description of the emotions induced in the listening. The different arrays of terms are then grouped on the results of a set of experiments with different annotators. Study 1 and 2 (Total Participants: 354): a list of terms related to emotions that are relevant to the music is made. Attention is also paid to perceived and induced emotions through 5 groups of listeners with distinct musical preferences. Study 3 (Total participants: 801): the structure of emotions for music induced by Cronbach's alpha factor is examined. You get a model with 9 emotions factors for induced music. Study 4 (Total Participants: 238): the model in Study 3 is replicated and it is found that it better represents the emotions needed to describe the psychology of listening to music [12]. In the present study, we investigate the adoption of GEMS from a machine-learning perspective. The aim is the automatic classification of induced emotions through music using a multilabel approach. In the present study we introduce a new method for feature extraction that integrates different frameworks and best methods for automatic feature selection and grid search for the optimization of the classification parameters. The Emotify dataset includes 400 tracks (44100 Hz, 128 kbps, one minute each) and incorporates four music genres (classical, rock, pop, electronic music), 100 tracks per genre and 8407 annotations total. Annotations in the dataset are collected using a version of the scale representing nine emotions: GEMS-9. The annotations, herein referred to as labels , verify the inducement of an emotion among the following: “ amazement ” , “ solemnity ” , “ tenderness ” , “ nostalgia ” , “ calmness ” , “ power ” , “ joyful activation ” , “ tension ” , “ sadness ” . The label identifies the class a track belongs to, through statistical evaluation. Analysis of emotion annotations.
At a first glance, it is evident that the amount of total annotations differs for each of the 400 tracks in the corpus. The simplest statistical criterion that we can use to describe the distribution of the annotations is to compute the mean positive response of the total annotations. This method provides information regarding the inducement of a specific emotion in the annotators by the music tracks. Because the dataset adopts a multiclass approach, the probability of data sparseness is high when considering the expected mean annotation of a single emotion per each track (2.335), meaning that track can have labels produced by few annotations. We consider a label to be valid if the mean positive response of annotations exceeds a specific threshold, herein defined as consensus threshold . The method produces a score value for each label that represents that emotion in the track. This score is also computed in [7], as capable of preserving information regarding the emotional expressiveness and emotion induction in listening for each track. 𝑠𝑐𝑜𝑟𝑒 𝑖𝑗 =
1𝑛 ∑ 𝑎 𝑘𝑛𝑘=1
Where: ● i: i-th emotion ● j: j-th track ● ak: presence or absence (0 or 1) of the i-th emotion expressed according to a specific annotator ● n: total number of annotations for the j-th track. The method computes the score without weighing the number of listening sessions for each annotator. . Validity of emotion labels
Advancements in MER can be achieved by generalizations capable of surpassing the intrinsic difficulties existing in producing a universal model of emotion induction. We analyze further the Emotify music corpus and the implementation of the GEMS-based multiclass annotation approach by adopting the consensus threshold technique as the criteria for considering an emotion to be relevant for a certain track. The emotion score is computed as defined in (A), in order to consider a label as valid if exceeding a defined threshold. In this way, a score represents a consistent emotional expressiveness for emotion induction in listening among the annotators regarding a single emotion. We estimate the percentage consensus threshold through a posteriori observation of the average trend of the emotions per track in the dataset. The percentage value represents the proportion of annotators who identify specific emotions as part of the average trend. This average function is monotonically decreasing, and its standard deviation has a decreasing trend that is stable to the unit value . Notably, we identify plateau zones at 25% and 30%. The estimated thresholds have an average of 3.5 ± 1 and 2.5 ± 1 annotated mood per track.
III. PATTER RECOGNITION
Pattern recognition is a branch of machine learning characterized by the analysis and identification of patterns within raw data. Such patterns are known as features. In MER, strong performance was initially reported in tasks of music emotion classification by using audio features only [13], and various algorithms are currently available (i.e., MIRtoolbox [14], Marsyas [15], Psysound [16]). Information retrieval of audio features is a process that allows to extract the characteristics of an audio signal. Features are used to represent the attributes that are present in an audio signal. The strategy which is generally pplied in most approaches to patter recognition focuses on the development of classifiers and the implementation of both selection processes of features and classification algorithms, and list of parameters neglecting the importance of the initial data. In the last few years however, other methods incorporating metadata have demonstrated the need for a multimodal approach with a strong focus on human language processing [17] [18]. Recent research has confirmed this by ranking linguistic features higher than audio features, in terms of valence, and by suggesting that symbolic features are useful only when language data cannot be used[19]. In the Emotify dataset, lyrics are not comprised in the dataset and this is certainly a limiting factor in consideration of scenarios where lyrics are available for the music. However, in approaching an emotion recognition task in music that leverages on audio features only, we challenge scenarios where language data is not available as well as aim to raise the state of the art of audio-based MER. Moreover, the use of GEMS as a means for producing annotations in Emotify provide us with the opportunity to consider audio scenarios from a perspective on language data that can also be useful in classifying the Musical-Moods dataset from the likelihood of occurrence of words used by participants to describe emotions. In the last few years, collection of ground truth for emotion labels has been increasingly carried out by using game(s)-with-a-purpose (GWAP) [20] [21] , as based on the “wisdom of the crowds” phenomenon [22] [23] [24] and thus less expensive than listener surveys [25] [26] or social tags [27]. Recently, the approach has been integrated in methods for multimodal distributed semantics, thus showing that computational models of meaning improves in efficiency when classification of meaning is grounded in perception [28]. As annotators self-report the music induction from listening sessions in accordance to their subjective experience, we can assimilate the task of music emotion labelling to a process of meaning attribution that is grounded in perception. In Emotify, annotations are also produced using a GWAP [29]. In the Musical-Moods project, computational models of meaning are grounded in perception by using a new multimodal game with a purpose. This is a multimodal GWAP for language modelling deployed as art of the Musical-Moods project. M-GWAP is based on mindprint modelling [30]. Mindprint-based classification is a state-of-the-art technique for language data capable of capturing sophisticated linguistic features that involving semantic, syntactic, and valence information. Mindprint are not limited to representations through emotion labels leveraged by domain-specific knowledge of interest [31] ( i.e. , Russell’s [32] [33] or MIREX [4]). From this perspective, the process of generating emotion labels through M-GWAP is superior to GEMS. Moreover, in Emotify, annotators are presented with terms that derive from a process of label definition that is separate from the music annotation task at hand. The emotion labels which they are presented with could simply not reflect the music files that they are required to annotate, as well as their culture and subjectivity. On the contrary, In M-GWAP, players can freely describe the emotions that they identify in the music using their own words, thus providing a model for categorization that is closer to real-world scenarios. M-GWAP incorporates a component for the annotation of music alone. In this, scores computed for emotion labels are based on popularity ranking. An emotion term needs to pass a threshold of popularity before it is added as a real annotation for a multimodal excerpt. Popularity is here assimilated to form of consensus. In the present article, we investigate distribution for expectancy of agreement (consensus) on emotion terms with the aim of gathering know-how for then best calibrating the score system of language modelling in the audio component of the M-GWAP.
IV. PREPROCESSING
We extract audio features from the audio files of the Emotify dataset by using different computational frameworks such as MIRToolbox and Marsyas for deriving statistical functions, and Psysound for extracting psychoacoustical features. . Discretization
When classifying continuous data, a variable amount of discretization error is always present. Reducing this amount to a level that can be considered as negligible in relation to the music emotion modelling, before the actual classification occurs, is important.
Kononenko’s discretization [34] is a process that provides optimal results for audio-based tasks. This is a recursive algorithm, whose stopping criterion is based on the
Minimum Description Length principle (MDL). MDL considers regularities in a given set of data to compress the data and describe it through fewer symbols than in their original representation. Discretization operates by dividing values of the continuous attribute into intervals and representing each interval by a string. Discretization reduces the helps reducing the learning complexity and modelling correlation between attributes and target class. A best number of thresholds for the discretization is selected through a supervised method. The algorithm estimates probabilities of two near instances from the same class and two near instances from different class to have an attribute value that is the same for each instance in a pair. The first pair represents the predicative power of the attribute, while the second testifies against that probability. These two probabilities are combined into a single parameter. The significance of the attribute increases proportionally with parameter value. The algorithm adopts a measure of dissimilarity between attribute and class, which is as:
𝐷(𝐴, 𝐶) = 𝐻(𝐴|𝐶) + 𝐻(𝐶|𝐴)𝐻(𝐴&𝐶)
Where A is the attribute, C is the class, and H represents the entropy. Entropy is here defined as: 𝐻(𝑆 𝑖 ) = − ∑ 𝑃(𝐶 𝑖 ; 𝑆 𝑖 )log(𝑃(𝐶 𝑗 , 𝑆 𝑖 )) 𝑘𝑗=1 here given k classes ( C ; … ; C k ), S i is the set of instances and P(C j ; S i ) is the proportion of instances in S i having class C j . B. Feature selection
While, the extraction of audio features is a major step in audio classification tasks, operating a reasoned selection of features can strongly improve the results of the classification. An excessively large number of features could lead to various disadvantages, such as increased processing time, reduced classification accuracy and information redundancy. In the training phase of the classification task, the use of too many features may expose to the risk of data overfitting. In order to select the most relevant features from a large feature set, automatic selection is highly advisable and various algorithms exist for the automatic selection of most-relevant features. In the present article, we compare results from the
CFS (Correlation-based Feature Selection) algorithm with results obtained using the t-Student distribution test.
B1. CFS
The Correlation-based Feature Selection (
CFS ) algorithm [35] automatically selects features with a strong correlation to the primary class and a weak correlation to the other features. This way, selected features have high relevance and low redundancy. The formula used is the following: 𝑀 𝑠 = kr 𝑐𝑓 √k + (k − 1)r 𝑓𝑓 This formula measures the relevance of a subset feature (S) containing k features, where r stands for the main correlation between feature and class (f ∈ S), as well as for the average inter-correlation between features.
2. T-test
T-Student test can verify if a difference between two sets of data is meaningful or not. Such difference can be considered meaningful when it can be attributed random variations of the same statistic distribution. For each extracted feature a comparison is carried out between the values of the feature when a specific emotion is present and when it is absent. The test provides a p-value indicating the probability for the difference to be due to chance. If the p-value is less than 0.05 value, variation is due to a correlation between feature and emotion, which cannot be attributed to chance. By repeating t-test for each emotion and each feature, we can finally select only features with a p-value lower than 0.05 values, in order to identify a meaningful feature, set for each emotion. Furthermore, we consider different values of p in order to analyze the level of consistency in correlation of features and emotion in the dataset. We compute the CFS algorithm and t-test for each emotion class. We define the feature set as the union of all selected features for each class. From this set we will select a best sub-set of features concurring to the best automatic classification process of music induced emotions. The aim is to improve the quality of input data by reducing information redundancy.
V. CLASSIFICATION
Extracted features are part of the raw data that is used to classify tracks into nine distinct emotions (or classes). As mentioned, emotion multiplicity as approached through GEMS and Emotify (GEMS-9) leads to a multi-labeling problem. From a classification perspective, we represent efficiently the nine emotions as a computational problem that is both multilabel and multiclass. The nine emotion labels are not exclusive (multilabel) and each can either be present or absent for a music track (multiclass). In such a lassification, each music track in the dataset is assigned to multiple emotion labels, creating multiple classification patterns that are valid at the same time [36]. In the present study, we approach the problem by adopting three different types of classifiers: Support Vector Machine (SVM), Artificial Neural Networks (ANN) and Bayesian Classifiers. SVM are pattern recognition systems mostly developed for binary class classifications. On the contrary, algorithms such as backpropagation developed for Perceptron Artificial Multilayer Neural Networks are easily adoptable for multiclass classification tasks. Finally, Naïve Bayesian Classifiers are useful when independence between features can be assumed. As we employ different consensus thresholds to the analysis of the distribution of the dataset for identifying which track is assigned to which emotion label(s), we expect each classifier to respond differently according to its specific characteristics. In the present section, we highlight these characteristics and their relevance to the present study. Support Vector Machine (SVM) is a method that represents samples as points on a hyperplane where each class is identified by a position in the space, distant as much as possible from the other class positions. New samples placed in a specific position of the space are classified in relation to their proximity to each class, identified by the nearest class. In SVM, the key of an efficient classification lies in the presence of a precise optimal separation between the different classes. This separation, known as maximum margin is computed by SVM algorithms and corresponds to the hyperplane with highest margin in relation to the support vectors, being the support vectors parallel to the hyperplane and constituted of the nearest points from both sides
Error! Reference source not found. . As the separating curves are not always linear, data are represented through a kernel function projecting the information on a larger dimensional space
Error! Reference source not found. . In this study, we adopt an SVM algorithm that uses the linear Kernel Sequential Minimal Optimization (SMO) reporting to be the most accurate option. This approach was successfully implemented in studies of speech emotion recognition [38] [39] [40]. rtificial Neural Networks (ANN) are mathematical learning models that emulate the biological neural network. In this way, the artificial neural network is constituted by several neuron layers that receive information through input connections and send information to the next layer through output connections. The basic element in data processing is called perceptron, the simplest mathematic representation of the biological neuron. In ANNs
Error! Reference source not found. , the perceptron receives input from other perceptra or from the environment, xj ∈ ℜ, j = 1, . . . , d , and links that input at the same time to both synaptic weights wj ∈ ℜ and to the output y . As w0 allows to generalize the model, y is a function of the weighted sum of the inputs. This function, so- called “activation function”, can be either linear or non-linear (e.g. a sigmoid and hyperbolic tangent). 𝑦 = ∑ 𝑤 𝑗 𝑥 𝑗 + 𝑤 In this study, we adopt an ANN using a backpropagation method. Backpropagation is a method used in combination with gradient descent as the optimization approach requiring a desired output (target) for each input value to calculate the gradient of the loss function (cost function). Bayesian classifiers [42] are based on the Bayes theorem. This theorem links the probability that one event (A) occurs in relation to another (B) to the probability that the latter occurs in relation to the former. The joint probability of these two variables persists also after an inversion of the variables, that is:
P (A,B) = P (B,A).
Also, the probability that A and B occur simultaneously is equal to the probability that A occurs multiplied by the likelihood that B occurs once A is happened, that is: P (A,B) = P (A) * P (B|A) ; and vice versa:
P (B,A) = P (B) * P (A|B) . These equations can be equalized according to the joint probability, resulting in the Bayes Theorem of the conditional probability:
𝑃(𝐴|𝐵) = P(A) ∗ P(B|A)P(B) hrough the Bayes theorem it is possible to compute the likelihood for an unknown event to occur according to the presence of specific conditions having a causal relationship with the former
Error! Reference source not found. . Therefore, Bayesian classifiers are particularly useful in applications where the relation between the values of the attributes and those of the class is not deterministic because either of data noise, of attributes not modelling some characteristics of the phenomenon or of difficulties in quantifying certain aspects of the phenomenon. In the present study, we adopt a Naïve Bayesian classifier assuming independence of the features each with the other. As the approach is a simplification of the real-world problem, we expect the classification using Naïve Bayesian classifier to perform better in term of generalizations and worse where a strong correlation between features exist.
VI. EXPERIMENTS ON EMOTIFY DATASET
The main goal of our experiments is to compare the performance of different machine learning methods in an emotion classification task. The present section reports the tools adopted, describes the classification algorithms and presents results from the different classifiers. An initial consideration can be made regarding the audio format that is used for the audio files of the Emotify dataset. This is the popular MPEG-2 Audio Layer III ( mp3 ) format, which is a lossy format based on the concept of auditory masking and sound data compression occurring without major distinguishable difference in the listening when comparing to the uncompressed sound data. In a classification task of induced emotions through music, the sound data that is lost in the compression could be relevant for the features in two ways. Features mostly related to perception will be affected by the loss of sound data, although only minorly, because of the auditory masking phenomenon on which he compression is based. In this case, compression operates similarly to a feature selection, as information that is not relevant in the listening is discarded before the actual feature selection process occurs. Differently, every feature that is not related to perception will potentially be affected by the compression, as of the discarded sound data. Also, we may question whether emotions induced through music are the result of acoustic phenomena occurring only at a conscious level. Audio files which are almost identical in the listening could still affect a listener in different ways. For example, frequencies which are outside of the maximum range of human hearing could still affect a listener, both at a physiological level through resonance of the sound waves in the listener’s body, and at a perceptual level through subliminal perception — a sound property is too weak to be perceived yet it influences the listener — resulting in a different emotional induction overall. For non-perceptual features, mp3 compression could therefore generate a disturbance in the features ’ values, reduce or even eliminate the relevance of subliminal perceptual features. As the Emotify datasets does not provide information regarding the uncompressed sound data, no comparison is possible to that. However, we should take these considerations into account when drawing our conclusions. Extracting features using MIRToolbox requires the MATLAB environment and accepts audio files using the wave format ( wav ). While lost sound data cannot be recovered by converting the files from mp3 to wave , we still need to convert them in order to carry out the feature extraction task. For this reason, a Java program is created for bulk conversion of each audio file from the original mp3 format of the dataset to wave . In order to extract features values that can be meaningful in terms of emotion induction we cannot simply analyze each 1-minute audio files in its entire length, because emotions have shorter duration than a minute and one single track could convey different emotions in different moments. Instead, we consider a chronologically moving time-window on the signal. The length of this window must be short enough hat only one emotion at a time can be captured rather than two temporally subsequent emotions expressed in the music. For this, the length of the window, so- called “frame”, must be lower than the half of the duration of the quickest possible emotion induction through music. The duration of a peak emotional response varies also according to the specific emotion that is induced. Tests report durations as low as 0.2 s under specific circumstances of music emotion induction [44]. In other studies, time-windows of 0.5 s and shifts of 0.25 s are used for short-term features, while long-term features can be meaningfully retrieved from time-windows of 3 s, shifted in 1 s steps [28]. Similarly, we characterize the duration of each window position (frame) by a length of CFS and t-test methods. Additionally, we repeat the classification by carrying out a discretization before the feature selection. We used the Waikato Environment for Knowledge Analysis (Weka 3.8) [16] in command line and standard mode for feature selection and classification. Configuration parameters include a simple estimator (search algorithm k2) for the Bayesian classifier, a logistic polykernel for SMO in SVM, and 80 neurons in a single hidden layer with 0.3 learning rate , 0.2 moment and a training gradient descent using momentum algorithm for the ANN. xtracted raw features are statistically synthesized determining a total of 476 total features. Discretization and feature selection improve the analysis. Tables 1 and 2 show results of classification. These results are reported in terms of accuracy , defined as correctly classified instances, per emotion and threshold of and threshold of 25, respectively. As mentioned, these thresholds represent the mean of annotators who identify the presence of an emotion and allows us to understand which emotions are better categorized by GEMS-9 in the Emotify dataset. Table 1: Classification results using a consensus threshold at 30%.
RAW CFS DISCR+CFS Consensus @ 30% SMO BAY MLP SMO BAY MLP SMO BAY MLP amazement 91.60 90.05 89.50 92.45 92.05 96.25 96.65 95.45 97.00 calmness 76.75 76.55 76.00 80.05 81.95 74.90 72.25 83.70 74.80 Joyful activation 84.35 79.95 83.00 87.65 87.70 87.20 76.85 87.75 83.45 nostalgia 72.10 74.30 72.00 74.85 80.35 71.45 75.10 81.80 77.95 power 86.20 87.55 88.50 91.20 91.45 89.95 91.85 95.35 92.40 sadness 79.20 74.45 80.50 81.70 80.25 83.35 85.60 89.15 86.70 solemnity 65.25 65.15 67.00 69.05 70.60 65.40 78.10 80.75 77.55 tenderness 74.50 71.75 74.50 75.40 75.60 72.65 87.90 89.65 90.50 tension 67.05 65.95 67.50 77.05 74.90 76.70 84.55 88.40 84.80
CLASS. MEAN
CLASS. STD
TTEST (p=0.05) TTEST (p=0.01) DISCR+TTEST (p=0.05) DISCR+TTEST (p=0.01) Consensus @ 30% SMO BAY MLP SMO BAY MLP SMO BAY MLP SMO BAY MLP amazement 97.85 90.6 97.3 98.25 91.90 95.75 95.85 90.95 96.50 95.70 92.45 96.00 calmness 79.75 77.7 79.7 80.55 76.55 76.25 76.75 79.25 76.00 77.60 81.20 76.75 Joyful actvation 87.05 79.95 88.4 87.20 80.80 88.25 84.55 83.35 86.00 80.80 82.10 82.75 nostalgia 76.25 74.45 75.75 74.20 72.90 66.75 72.15 72.65 71.25 68.35 73.70 70.50 powee 91.4 87.6 90.4 90.60 85.00 90.50 89.65 90.65 91.00 90.20 90.40 92.25 sadness 83.6 74.9 79.65 86.35 79.65 79.75 83.05 82.60 82.75 81.25 83.20 81.00 solemnity 76.3 66.85 73.45 71.70 61.10 70.25 71.85 63.80 66.00 68.45 64.65 65.00 tenderness 80.6 72.3 75.9 80.50 75.10 70.25 81.40 77.30 81.75 78.15 76.50 77.75 tension 80.95 69.65 78.35 79.90 75.10 76.25 80.10 82.15 82.00 71.60 76.35 72.25
CLASS. MEAN
CLASS. STD s expected, MLP using a single hidden layer provides the most accurate classification when considering all raw features, although with an accuracy (77.61%) that is only slightly superior to the results obtained on the same feature set using SVM (77.44%). We achieve an improvement using
CFS for all three classifiers and one further when running a discretization task before the feature selection. Using discretization, the highest accuracy is obtained with the Bayesian classifier and is attested at 88%. In comparison to the classification of raw features, an accuracy increase is also noticed when using t-test . For t-test , the highest value is 83% for SVM at p-value = 0.05 . Table 2: Classification results using a consensus threshold at 25%.
RAW CFS DISCR+CFS Consensus @ 25% SMO BAY MLP SMO BAY MLP SMO BAY MLP amazement 84.55 77.60 85.50 85.40 84.75 81.90 89.55 91.15 90.80 calmness 76.10 76.40 76.00 85.35 81.85 79.75 79.45 86.20 83.55 Joyful activation 82.30 81.10 84.50 88.45 88.35 85.80 77.45 87.55 78.95 nostalgia 75.90 79.45 76.00 82.75 87.80 78.30 80.80 84.60 79.40 power 79.90 82.25 80.00 87.45 87.95 81.80 86.95 90.30 87.15 sadness 73.80 72.80 72.00 80.35 79.60 77.60 81.40 87.35 83.85 solemnity 58.85 58.25 57.50 71.05 64.20 64.35 72.90 82.25 75.95 tenderness 68.55 68.05 72.00 78.85 76.95 70.40 79.75 85.50 80.95 tension 63.60 63.20 62.50 76.10 67.00 70.70 78.95 86.80 83.35
CLASS. MEAN
CLASS. STD
TTEST (p=0.05) TTEST (p=0.01) Consensus @ 25% SMO BAY MLP SMO BAY MLP amazement 85.05 76.7 83.6 93.20 84.45 92.75 calmness 80.35 77.3 80.55 84.20 79.15 81.25 Joyful activation 87.45 80.75 87.7 86.20 82.05 89.00 nostalgia 81.3 80.7 80.75 81.15 80.00 79.50 power 90.7 83.25 92.2 89.75 84.70 89.00 sadness 82.35 73.55 80.05 79.30 74.45 76.50 solemnity 70.75 58.85 66.35 70.25 62.35 57.00 tenderness 77.9 69.05 75.05 75.30 66.25 70.25 tension 80.4 65.95 76.2 80.40 66.75 78.25
CLASS. MEAN
CLASS. STD n the case of threshold = 25 , an accuracy improvement is noted when moving from raw features to both discretization and
CFS . The best performing classifier in this case is Bayesian with 86.86% accuracy compared to 74% accuracy of MLP on raw features. Improvements are also present when using t-test , SVM and p-value = 0.01, albeit these are only marginal improvements in comparison the same case but using
CFS instead . For both cases the highest accuracy value is achieved when using discretization together with
CFS and the Bayesian classifier. For t-test , the highest value is achieved for SVM although at different p-value s for the different thresholds. Although the threshold at 30% provides better results in terms of classification, this is also a consequence of having a lower number of moods per track, which translates in a lower uncertainty for the classification. On the contrary, the threshold at 25% is a better representation of real-world scenarios, as considering a greater potential multiplicity of induced emotions. Figure 2 and figure 3 show the classification results by considering mean and standard deviation using the different thresholds for each emotion. Figure 2. Classifiers mean efficiency using consensus threshold at 30%.
Figure 3. Classifiers mean efficiency using consensus threshold at 25%. Notably, the difference in accuracy regarding the recognition of the specific emotions is consistent to previous research on the Emotify dataset [7]. In that, “ all the feature sets demonstrate the same pattern of success and failure ” ; we take that consistent classification results are reported across the different feature sets. This leads the authors to the conclusion that amazement and joyful activation must be emotional categories, which “ are very different in their subjectiveness ” . Our results however do not avail these conclusions, as we report high accuracy values at around 90% for both emotions when considering both thresholds plateau (25 and 30), in our testing. On the contrary, we report the lowest accuracy for the classification of the solemnity label. Notably, a direct comparison to the authors’ methods of classification is possible by considering the root mean square error (RMSE) and its standard deviation across the multiple cross validation rounds. Table 3 compares the RMSE and its standard deviation for each of our methods using the combined of discretization and CFS. RMSE is the standard deviation of prediction errors measuring the distance of the data points from the regression line; it measures the bsolute fit of the model to the data. The comparison shows that all our features set outperform considerably those adopted by the authors’, in all classification cases and for all mood labels. For the different consensus thresholds of 30% and 25%, we respectively attest Mean Classification Accuracy at 88% and 86.2% and Root Mean Square Error (RMSE) at 0.29 ± 0.07 and 0.31 ± 0.07. In a direct comparison, our best method provides improvement for RMSE of 0.42-0.44 ± 0.13 from the best result on the same dataset from [7]. Table 3a. Comparison of RMSE from literature classification methods for the Emotify dataset and our methods using discretization and CFS at different consensus thresholds.
Literature Feature Set
MIRToolbox + PsySound OpenSmile MP + Harm Musicological
RMSE RMSE RMSE RMSE
Amazement .99 ± .16 .95 ± .13 1.05 ± .11 .85 ± .24
Calmness .83 ± .09 .89 ± .07 .78 ± .09 .70 ± .16
Joyful Activation .77 ± .11 .80 ± .08 .75 ± .11 .58 ± .15
Nostalgia .82 ± .12 .89 ± .07 .88 ± .10 .69 ± .16
Power .82 ± .13 .84 ± .09 .80 ± .16 .78 ± .26
Sadness .87 ± .11 .96 ± .18 .88 ± .12 .93 ± .20
Solemnity .80 ± .09 .95 ± .13 .89 ± .15 .84 ± .22
Tenderness .84 ± .10 .95 ± .07 .85 ± .18 .50 ± .19
Tension .87 ± .20 .94 ± .19 .85 ± .13 .71 ± .36
Best Mean RMSE .73 ± .21 able 3b. Comparison of RMSE from literature classification methods for the Emotify dataset and our methods using discretization and CFS at different consensus thresholds.
Our Feature set
Bayes @30% Bayes @25% SMO @30% SMO @25% MLP @30% MLP @25%
RMSE RMSE RMSE RMSE RMSE RMSE
Amazement .17 ± .10 .26 ± .07 .15 ± .11 .31 ± .09 .12 ± .12 .24 ± .14
Calmness .36 ± .07 .34 ± .07 .52 ± .07 .45 ± .08 .45 ± .14 .34 ± .14
Joyful Activation .31 ± .09 .32 ± .10 .47 ± .08 .47 ± .08 .35 ± .16 .41 ± .14
Nostalgia .39 ± .07 .34 ± .08 .49 ± .08 .43 ± .10 .42 ± .14 .41 ± .14
Power .19 ± .09 .27 ± .08 .27 ± .09 .35 ± .07 .19 ± .14 .30 ± .14
Sadness .28 ± .07 .30 ± .08 .37 ± .07 .42 ± .10 .29 ± .14 .34 ± .14
Solemnity .36 ± .07 .36 ± .07 .46 ± .08 .52 ± .07 .42 ± .14 .44 ± .14
Tenderness .26 ± .08 .33 ± .07 .34 ± .07 .44 ± .07 .22 ± .13 .37 ± .16
Tension .30 ± .07 .32 ± .07 .39 ± .08 .45 ± .08 .32 ± .14 .34 ± .13
Best Mean RMSE .29 ±.07 .31 ±.07
VII. CONCLUSION
Music emotion induction is a phenomenon whose complexity is simplified in most approaches to the automatic recognition of emotion in music. A listener can perceive multiple emotions simultaneously. Subjective and cultural differences can constitute a bias in annotations tasks. In the present paper, we investigate classification of multiple and simultaneous emotions that can be induced/expressed through individual music tracks, with the aim of identifying best methods for information retrieval and computational learning models of emotions in music. We test a variety of approaches on the Emotify dataset, which adopts the GEMS-9 model for the categorization of nine different labels of emotion that can be expressed/induced simultaneously in music by a same annotator. The analysis of Emotify and GEMS-9 will provide us the opportunity to design better a new multimodal dataset for MER, MIR and computational creativity. e approach the scenario of emotions expression/induction through music as a multilabel and multiclass problem, where multiple emotion labels can be adopted for the same music track by each annotator (multilabel), and each emotion can be identified or not in the music (multiclass). We consider this approach as better approximation to real-world scenarios in comparison to the use of exclusive labels for the description of emotions in music. We consider different distributions of annotations and emotion labels in a corpus by considering emotion labels as valid for a track when the mean positive response of annotations per label surpasses a specific consensus threshold . Thresholds at 25% and 30% are considered. Furthermore, we compare the efficiency of different approaches for discretization, feature selection and classification for both threshold values. Best performance of classification tasks is achieved by using a Naïve Bayesian classifier at 88% mean accuracy for threshold = 30 . Results from the different classification tasks confirm that pre-processing techniques for application (
CFS and t-test ) and joint discretization of features values to pre-processing improve performance in terms of accuracy, in comparison to classifications of unprocessed corpora. The most evident improvement is obtained when discretization of the dataset is applied before a
CFS approach to feature selection. Using t-test for the selection of features, different amount of improvement are achieved in relation to the specific classification procedure used. This could be explained by a repetition of t-test for every single feature, which does not allow us to consider the correlation between features. This constitutes a limit mostly for the Bayesian classifier because it assumes independence between features. However, this classifier remains a choice of preference both for simplicity of approach and lower risk of data overfitting. Methods using a 25% consensus threshold show a significant improvement for the Bayesian classifiers. Most consistent improvement is achieved through discretization of features values and
CFS selection (from 73.23% to 86.86%). The least noticeable improvement is achieved with data processed by t-test ( p = CFS . Best technique among tested is Bayesian classification applied on a discretized dataset preprocessed through
CFS . This is positive, as we expected the Naïve Bayesian to perform better without correlation of features, which is exactly what
CFS enables for the task in our testing, confirming that uncorrelated features are more effective than most meaningful features from t-test when adopting conditional probability as classification approach. This method is most performant when considering a 30% consensus threshold (from 76.19% to 88%). Therefore, a higher agreement between the annotations suggests that this may be better captured by a conditional approach, as on the contrary, the same annotations are less effected for 25% consensus threshold (from 73.48% to 81.73%). In a direct comparison to the existing reference study on the same Emotify dataset from [7], our best method improve RMSE of 0.42-0.44 ± 0.13 from the best result on the same dataset. A first consideration can be drawn regarding the use of consensus thresholds as a means of validation for multiple emotion labels to identify the number of available labels used for describing effectively induced/expressed emotions through music. The high value of mean performance achieved by the classifiers confirms the validity of consensus threshold as a method for the analysis of the distribution of annotations per emotion label. In the definition of a new multilabel dataset, we should therefore aim at gathering enough annotations that can represent a consensus for emotion labels proportional to the datapoints distribution that is equal or higher to threshold values adopted in the present study; at least 25% ‒
30% of the total annotators will need agreeing regarding the identification of a specific emotion in the same music excerpt, for that emotion label to be considered as valid and the classification task to achieve strong performance. In terms of future implementations in the Musical-Moods project, this will require fine tuning the score system of M-GWAP so that players will be motivated by the score system towards a total number of emotion labels that is proportional to the total number of annotations that we expect to gather by eploying this multimodal game with a purpose. As players will be allowed to enter freely any term for describing the item that the M-GWAP will present them, rather than selecting from a limited set of labels as in the use of GEMS-9, risk of data sparseness will be higher than in Emotify. To this end M-GWAP should aim at gathering a greater number of annotations than the mean of annotations per musical genre classified in Emotify. As M-GWAP is based on mindprint modelling, and this approach is superior to GEMS because of the freedom that is granted to the player for describing the emotions they identify in the music, we consider this risk as part of the problem of dealing with emotion modelling technique that attempt to induce the least bias as possible. A further consideration can be made regarding the classification methods. M-GWAP is a game with a purpose that allows gathering emotions labels about a variety of media datatypes. Because of this, we can assume that a Naïve Bayesian classifier should perform efficiently for MER in a new but similar corpus, given a character of multimodality in the variety of the data. For example, should the audio files in the Musical-Moods dataset belong to a music genre that is not represented in Emotify, correlations between features could simply not exist. On the contrary, the use of audio clips of different music genre could be particularly useful towards the definition of a universal model for the automatic recognition of emotions that can be expressed and induced through music. Similarly, we cannot assume that a similar classification method would be efficient for different datatypes. Nevertheless, annotations gathered should converge towards emotion labels identical or semantically close to those adopted in GEMS-9, and a correlation between features exists, as long as the features extracted from the multimodal corpus can be classified with strong accuracy and low RMSE. This was the case of our testing on the Emotify dataset. From this perspective, non-domain features could be useful as a starting point for adopting a cross-modal approach, also confirming that investigating emotion induction/expression from a multimodality of approaches supports classifying complex and structured, simultaneous emotions and achieving solid results when drawing from experiences observed multimodally. ur future work will include a fine tuning of the score system in M-GWAP and a classification first of the audio component of the Musical-Moods database, and then of the video, motion capture data, and language modelling of the participants by using the best methods here identified.
FUNDING ACKNOWLEDGEMENT
The research is supported by the EU through the MUSICAL-MOODS project funded by the Marie Sklodowska-Curie Actions Individual Fellowships Global Fellowships (MSCA-IF-GF) of the Horizon 2020 Programme H2020/2014-2020, REA grant agreement n° 659434.
REFERENCES [1]
U. Schimmack and R. Reisenzein, "Experiencing Activation: Energetic Arousal and Tense Trousal are Not Mixtures of Valence and Activation,"
Emotion , vol. 2, no. 4, p. 412, 2002. [2]
Sloboda, J. (2005).
Exploring the musical mind: Cognition, emotion, ability, function . Oxford University Press. [3]
Barthet, M., Fazekas, G., & Sandler, M. (2012, June). Music emotion recognition: From content-to context-based models. In
International Symposium on Computer Music Modeling and Retrieval (pp. 228-252). Springer, Berlin, Heidelberg. [4]
Zentner, M., Grandjean, D., & Scherer, K. R. (2008). Emotions evoked by the sound of music: characterization, classification, and measurement.
Emotion , (4), 494. [5] Lin, Yi, Xiaoou Chen, and Deshun Yang. “Exploration of Music Emotion Recognition Based on MIDI.” ISMIR.
Paolizzo, F. (2019) The Musical-Moods project, H2020-MCSA.GR659434. http://musicalmoods2020.org [7]
Aljanaki A., Wiering F., Veltkamp, R.C. “ Computational Modeling of Induced Emotion Using GEMS, ” in Proc. of the 15th International Society for Music Information Retrieval Conference , ISMIR 2014, Taipei, Taiwan, pp.373-378, 2014. [8]
C. Laurier, O. Lartillot, T. Eerola, and P. Toiviainen: "Exploring Relationships between Audio Features and Emotion in Music,"
Conference of European Society for the Cognitive Sciences of Music , 2009. [9]
Y. H. Yang, Y. C. Lin, Y. F. Su, and H. H. Chen, "A Regression Approach to Music Emotion Recognition,"
IEEE Transactions on Audio, Speech, and Language Processing , Vol. 16, No. 2, pp. 448-457, 2008. [10]
Y. E. Kim, E. M. Schmidt, R. Migneco, B. G. Morton, P. Richardson, J. Scott, J. A. Speck, and D. Turnbul, "Music Emotion Recognition: A State of The Art Review," in , 2010. [11]
Downie, X. H. J. S., Cyril Laurier, and M. B. A. F. Ehmann. “The 2007 MIREX audio mood classification task: Lessons learned.” Proc. 9th Int. Conf. Music Inf. Retrieval. 2008.
Cronbach, L. J. “ Coefficient Alpha and The Internal Structure of Tests ” , Psychometrik a, vol. 16, pagg 297-334, 1951. [13]
Lartillot, Olivier, Petri Toiviainen, and Tuomas Eerola. “A matlab toolbox for music information retrieval.” Data analysis, machine learning and applications. Springer Berlin Heidelberg, 2008. 261-268. [14]
Hu, Xiao, J. Stephen Downie, and Andreas F. Ehmann. “Lyric text mining in music mood classification.” American music 183.5,049 (2009): 2-209. [15]
Cabrera, D. (1999, November). PSYSOUND: A computer program for psychoacoustical analysis. In
Proceedings of the Australian Acoustical Society Conference (Vol. 24, pp. 47-54). [16]
Garner, S. R. (1995, April). Weka: The waikato environment for knowledge analysis. In
Proceedings of the New Zealand computer science research students conference (pp. 57-64). [17]
Turnbull, Douglas R., et al. “Combining audio content and social context for semantic music discovery.” Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2009. [18]
Aljanaki, Anna, Frans Wier ing, and Remco Veltkamp. “Collecting annotations for induced musical emotion via online game with a purpose Emotify.” Technical Report Series 2014.UU -CS-2014-015 (2014). [19]
Zwicker, E. and H. Fastl, “Psychoacoustics: Facts and Models. Berlin”,
Springer-Verlag , 1999. [20]
Šimko, Jakub, and Mária Bieliková. “State -of-the-
Art: Semantics Acquisition Games.” Semantic Acquisition Games.
Springer International Publishing, 2014. 35-50. [21] St eyvers, Mark, et al. “The wisdom of crowds in the recollection of order information.” Advances in neural information processing systems. 2009. [22] Turner, Brandon, and Mark Steyvers. “A wisdom of the crowd approach to forecasting.” 2nd NIPS workshop on
Computational Social Science and the Wisdom of Crowds. 2011 [23]
Lee, Michael D., et al. “Inferring expertise in knowledge and prediction ranking tasks.” Topics in cognitive science
Turnbull, Douglas, et al. “Semantic annotation and retrieval of music and sound effects.” IEEE Transactions on
Audio, Speech, and Language Processing 16.2 (2008): 467-476. [25]
Tsoumakas, Grigorios, Ioannis Katakis, and Ioannis Vlahavas. “Mining multi - label data.” Data mining and knowledge discovery handbook. Springer US, 2009. 667-685. [26] Miller, F., M. Stiksel, and R. Jones. “Last. fm in numbers.” Last.fm press material (2008). [27]
Bruni, Elia, Nam-
Khanh Tran, and Marco Baroni. “Multimodal Distributional Semantics.” J. Artif. Intell. Res.(JAIR)
Pearl, Lisa S., and Igii Enverga. “Can you read my mindprint?: Automatically identifying mental states from language text using deeper linguistic features.” Interaction Studies 15.3 (2014): 359 -387. [29]
Aljanaki, A., Wiering, F., & Veltkamp, R. C. (2016). Studying emotion induced by music through a crowdsourcing game. Information Processing & Management, 52(1), 115-128. [30]
Reybrouck, M., & Eerola, T. (2017). Music and its inductive power: a psychobiological and evolutionary approach to musical emotions. Frontiers in psychology, 8, 494. [31]
Russell, J. A. “A circumspect model of affect, 1980.” J Psychol Soc Psychol 39.6 (1980): 1161. [32]
Yik, Michelle, James A. Russell, and James H. Steiger. “A 12 - point circumplex structure of core affect.” Emotion [33] Aljanaki, A., Wiering, F., & Veltkamp, R. C. (2016). Studying emotion induced by music through a crowdsourcing game. Information Processing & Management, 52(1), 115-128. [34]
Mori, K., & Iwanaga, M. (2017). Two types of peak emotional responses to music: The psychophysiology of chills and tears.
Scientific reports , , 46063. [35] Ou, Guobin, and Yi Lu Murphey. "Multi-class pattern classification using neural networks."
Pattern Recognition
Li, Y. -
Zhao, Y, “Recognizing
Emotions in Speech Using Short-Term and Long-Term F eatures”, in
International Conference on Speech and Language Processing (ICSLP ), Sydney, Australia, pp. 2255 – Schuller, B. - Rigoll, G. -
Lang, M., “Speech E motion Recognition Combining Acoustic Features and Linguistic Information in a Hybrid Support Vector Machine-belief Network A rchitecture”, in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 1, pp. 577 – Chuang, Z-J. - Wu, C-
H., “Emotion Recognition Using Acoustic Features and Textual C ont ent”, in
IEEE Proceedings of the International Conference on Multimedia and Expo (ICME ), 1, Taipei, Taiwan, pp. 53 –
56, 2004. [40]
Alpaydin, E., “Introduction to
Machine L earning”, The MIT Press, Cambridge, Massachusetts, 2010. [41]
Eibe Frank, Mark A. Hall, and Ian H. The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 2016. [42]
Russell,
S. J., Norvig, P., “Artificial Intelligence: A Modern Approach”, Pearson Education, 1995 [43]
McEnnis, D., McKay, C., Fujinaga, I., & Depalle, P., “jAudio: A Feature Extraction Library”, 2005 [44]
Trost, W., Frühholz, S., Cochrane, T., Cojan, Y., & Vuilleumier, P. (2015). Temporal dynamics of musical emotions examined through intersubject synchrony of brain activity.
Social cognitive and affective neuroscience , (12), 1705-1721.(12), 1705-1721.