Non-linear frequency warping using constant-Q transformation for speech emotion recognition
aa r X i v : . [ ee ss . A S ] F e b Non-linear frequency warping using constant-Qtransformation for speech emotion recognition
Premjeet Singh , Goutam Saha Dept of Electronics and ECEIndian Institute of Technology Kharagpur, Kharagpur, India [email protected] , [email protected] Md Sahidullah Universit´e de LorraineCNRS, Inria, LORIA, F-54000, Nancy, France [email protected] Abstract —In this work, we explore the constant-Q transform(CQT) for speech emotion recognition (SER). The CQT-basedtime-frequency analysis provides variable spectro-temporal res-olution with higher frequency resolution at lower frequencies.Since lower-frequency regions of speech signal contain moreemotion-related information than higher-frequency regions, theincreased low-frequency resolution of CQT makes it morepromising for SER than standard short-time Fourier transform(STFT). We present a comparative analysis of short-term acousticfeatures based on STFT and CQT for SER with deep neuralnetwork (DNN) as a back-end classifier. We optimize different pa-rameters for both features. The CQT-based features outperformthe STFT-based spectral features for SER experiments. Furtherexperiments with cross-corpora evaluation demonstrate that theCQT-based systems provide better generalization with out-of-domain training data.
Index Terms —Speech emotion recognition (SER), Constant-Q transform (CQT), Mel frequency analysis, Cross-corporaevaluation.
I. I
NTRODUCTION
The speech emotion recognition (SER) is the task forrecognizing emotion from human speech. The potential ap-plications of SER include human-computer interaction , sen-timent analysis and health-care [1]–[4]. Humans naturallysense the emotions in speech while machines find it difficultto characterize them [5], [6]. Techniques proposed till datehave significantly increased the machine’s ability to recognizespeech emotions. However, the task is still challenging mainlydue to the presence of large interpersonal and intrapersonalvariability and the differences in speech quality used to trainand evaluate the system. The goal of this work is to developan improved SER system by considering emotion-specificacoustic parameters from speech that are assumed to be morerobust to unwanted variabilities.Previous studies in SER research have shown that spectraland prosodic characteristics of speech contain emotion-relatedinformation. Spectral features include lower formants frequen-cies (F1 and F2), speech amplitude and energy , zero crossingrate (ZCR) and spectral parameters, e.g., like spectral flux and spectral roll-off [7]–[9]. Prosodic features include pitch , pitchharmonics , intonation , and speaking rate [8], [9]. The acousticfront-ends are used with Gaussian mixture model (GMM)or support vector machines (SVMs) as back-end classifiersfor SER tasks [10]. Studies with prosody reveal that higharousal emotions, such as
Angry , Happy and
Fear , have higher average pitch values with abrupt pitch variations whereaslow arousal emotions like
Sadness and
Neutral have lowerpitch values with consistent pitch contours [1], [11]–[14]. Theauthors in [15] have reported that recognition accuracy ofAnger is higher near F2 (1250-1750 Hz) and that of Neutralis higher near F1 (around 200-1000 Hz). Authors in [16]report that center frequencies of F2 and F3 are reduced indepressed individuals. In [17], the authors report that higharousal emotions have higher mean F1 and lower F2 andhigh (positive) valence emotions have high mean F2. In [18],authors report discrimination between idle and negative emo-tions using temporal patterns in formants. In [19], the authorshave demonstrated that non-linear frequency scales, such as logarithmic , mel and equivalent rectangular bandwidth (ERB),have considerable impact in SER performance over linearfrequency scale.Recent works with deep learning methods such as convolu-tional neural networks (CNNs) or CNN with recurrent neuralnetworks (CNN-RNNs), use spectrogram or raw waveformas input and have shown impressive results [20]–[22]. Thesedata-driven methods automatically learn the emotion-relatedrepresentation, however, the role of individual speech attributesin the decision making process is not clear due to the lack of explainability . On the other hand, the generalization of thesemethods remains an open problem, especially when the audio-data for train and test are substantially different in terms oflanguage and speech quality [23].We address this generalization issue by capturing emotion-related information from speech before processing with aneural network back-end. Given the fact that the low and midfrequency regions of speech spectrum contain pitch harmonicsand lower formants that are relevant for emotion recognition,we propose to use a more appropriate approach for time-frequency analysis that produces emotion-oriented speech rep-resentation in the first place. Even though the processing with mel frequency warping introduces non-linearity in some sense,the power spectrum from the speech is essentially computedwith a uniform frequency resolution. We propose to use atime-frequency analysis method called constant-Q transform (CQT). This transformation offers higher frequency resolutionat low-frequency regions and higher time resolution at high-frequency regions. As the pitch harmonics and lower formantsreside in the low-frequency regions of speech spectrum, we
CQT bins (Hz)0.00.10.20.30.40.50.60.7 F R a t i o CQT bins (Hz)0.00.10.20.30.40.50.60.7
23 116 210 303 397 491 584 678 771 865 9581055116212791409 1590175119282123233825752836312334393788417145945059557161366757 7809
Mel-filter center Frequencies (Hz)0.00.10.20.30.40.50.60.7 F R a t i o Mel-filter center Frequencies (Hz)0.00.10.20.30.40.50.60.7
STFT bins (Hz)0.00.10.20.30.40.50.60.7 F R a t i o STFT bins (Hz)0.00.10.20.30.40.50.60.7
Fig. 1. F-ratios of spectrograms based on CQT (top), mel-filter (middle), and standard STFT (bottom) corresponding to the frequency bins. We use the speechsentences with fixed text ‘a02’ from EmoDB database (discussed in Section III-A). We select the same text assuming spectra characteristics of emotions tobe text-dependent. First column shows the values over the entire frequency range while the second column focuses only on the lower-frequency regions. hypothesize that keeping high resolution in this region mayefficiently capture emotion-related information.The CQT was initially proposed for music signal processing[24]. Then it was also applied in different speech processingtasks, e.g., anti-spoofing [25], [26], speaker verification [27]and acoustic scene classification [28]. Recently, the CQThas also been studied for SER [29], but without success.This is possibly due to the lack of optimization of CQTparameters and/or the applied end-to-end model fails to exploitthe advantages of CQT. Recent studies show that CNN-based models are suitable for SER including cross-corporaevaluation [23], [30]. In this work, we also adopt a CNN-basedapproach for modeling SER systems. Our main contributionsin this work are summarized as follows: (i) We proposea new framework for CQT-based SER by optimizing CQTextraction parameters for reduced redundancy and improvedperformance, (ii) We investigate CNN architecture knownas time-delay neural networks (TDNNs) suitable for speechpattern classification tasks [31], [32] for SER, and (iii) Weperform cross-corpora evaluation with three different speechcorpora to assess the generalization ability of the proposedmethod. Our results demonstrate that the optimized CQTfeatures not only outperform short-time Fourier transform(STFT) features but also provide better generalization.II. M
ETHODOLOGY
In this section, we discuss the CQT-based feature extractionframework and the TDNN architecture for emotion recogni-tion.
A. Constant-Q transform
The CQT of a time-domain signal x [ n ] is defined as, X [ k ] = 1 N [ k ] N [ k ] − X n =0 W [ k, n ] x [ n ] e − jw k n , (1)where X [ k ] is the CQT coefficient for k -th frequency bin, W [ k, n ] is the time-domain window for k -th bin with duration N [ k ] , x [ n ] denotes the time samples and w k = πQN [ k ] , where Q is the (constant) Q factor of the filter banks [24]. In CQTcomputation, the window length N [ k ] varies for differentvalues of k . Hence, x [ n ] is correlated with sinusoids ofdifferent lengths with equal cycles of oscillation. This leadsto constant-Q filter bank representation with geometricallyspaced center frequencies over frequency octaves. Hence, weobtain a time-frequency representation which has frequencyresolution varying from high to low towards increasing fre-quencies.The CQT representation of an audio signal depends on the number of octaves of frequencies and the number of frequencybins per each octave . The number of octaves depends uponthe chosen minimum frequency ( F min ) and the maximumfrequency ( F max ) of operation, and this equals to log F max F min [25]. The CQT representation with reduced number of totalfrequency bins over a fixed number of octaves will providedetailed information for lower frequency region with reducedredundancy. Conversely, due to linearly spaced frequencybins, short-time Fourier transform (STFT) does not offer thisflexibility. Fixing F min to . Hz and F max to Nyquistfrequency gives us approximately octaves. Hop length inCQT computation defines the number of time samples bywhich CQT computation window moves. The CQT also hasresemblance with continuous wavelet transform (CWT) whichprovides variable time-frequency resolution and has beenfound helpful for SER [33].uring the CQT-based feature extraction process, the CQTcoefficients are uniformly resampled and then processed with discrete cosine transform (DCT) to compute speech featuresknown as constant-Q cepstral coefficients (CQCCs).We perform class separability analysis of the time-frequency representations by computing the F-ratios [34]. TheFig. 1 shows the F-ratio obtained at different frequency bins.The higher F-ratios at lower bins for CQT and STFT showthe presence of more discriminative information. The figurealso indicates that CQT-spectrogram has more number ofdiscriminative coefficients on an average over others due tohigher resolution in low-frequency regions.
B. CNN architecture
The time-frequency representation of speech-like signal issuitable to be used with 1-D CNN, popularly known as TDNNin speech processing literature. Our method is inspired bythe TDNN-based x-vector system [32] developed for speakerverification task. This processes speech information at frame and segment level. In frame level, the TDNN captures con-textual information by applying kernel over adjacent framesand by processing each speech frame in an identical man-ner. This also applies dilation in the temporal domain toreduce redundancy and to make it computationally efficient.The frame-level information is processed with several TDNNlayers having different kernel sizes and dilation parameters.Finally, temporal pooling aggregates frame-level informationinto segment-level and this is followed by processing with fullyconnected (FC) and softmax layer for classification objective.The standard x-vector system computes the segment-levelintermediate representation referred as embeddings which arefurther processed with another system for classification. Incontrast, our proposed method trains the network in an end-to-end fashion for which the emotion for a test speech is obtainedfrom the output of the trained network.We empirically optimize the parameters for TDNN architec-ture. Finally, we use four TDNN layers, followed by statisticspooling with mean and standard deviation, and one FC layerbefore softmax . Table I describes the parameters for differentlayers.
TABLE IT
HE PARAMETERS OF
CNN
ARCHITECTURE FOR
SER.
Layer Size Kernel Size Dilation
TDNN 32 5 1TDNN 32 3 2TDNN 32 3 3TDNN 64 1 1Statistics Pooling (Mean and SD) 128 - -Fully Connected 64 - -Softmax
III. E
XPERIMENTAL SETUP
A. Speech corpora
In our experiments, we use three different speech corporawhich are described in Table II. We downsample speech files at sampling rate of 16 kHz when required. The EmoDB is aGerman language corpora while RAVDESS and IEMOCAPare in English. For IEMOCAP database, we select only fouremotions (Angry, Happy, Sad and Neutral) as some of theemotion class have inadequate data for training neural networkmodels [30]. We perform cross-corpora SER experiments byselecting the same four emotions.
TABLE IIS
UMMARY OF THE SPEECH CORPORA USED IN THE EXPERIMENTS .(F=F
EMALE , M=M
ALE ) Databases Speakers Emotions
Berlin Emotion Database(EmoDB) [35] 10(5 F, 5 M) 7(Anger, Sad, Boredom, Fear,Happy, Disgust and Neutral)Ryerson Audio-Visual Databaseof Emotional Speech and Song(RAVDESS) [36] 24(12 F, 12 M) 8(Calm, Happy, Sad, Angry, Neutral,Fearful, Surprise, and Disgust)Interactive Emotional DyadicMotion Capture Database(IEMOCAP) [37] 10(5 F, 5 M) 4(Happy, Angry, Sad and Neutral)
B. Experimental details & evaluation methodology
First, we optimize the parameters of the features on EmoDB.We perform experiments on this corpus using leave-one-speaker-out (LOSO) cross validation by keeping one speakerin test. Out of the remaining speakers, we use two of them invalidation and seven in training. We also apply five-fold dataaugmentation by corrupting training set with additive noisesand room reverberation effect following the
Kaldi recipe forx-vector training [32].We extract features from each speech utterance and discardthe non-speech frames with a simple energy-based speech ac-tivity detector (SAD). We apply utterance-level cepstral meanvariance normalization (CMVN) before creating the trainingand validation samples with chunks of consecutive frames.We consider multiple non-overlapping chunks from the speechutterances depending on the length. We use LibROSA pythonlibrary for feature extraction.We do not apply chunking for testing and consider the fullutterance for computing the test accuracy. We report the finalperformances with accuracy as well as unweighted averagerecall (UAR). The accuracy is computed as the ratio betweenthe number of correctly classified sentences to the total numberof sentences in test. The UAR is given as [38], UAR = 1 K K X i =1 A ii P Kj =1 A ij (2)where A refers to the contingency matrix, A ij correspondsto number of samples in class i classified into class j and K is the total number of classes. As accuracy is considered https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb/v2 https://librosa.github.io/ A cc u r a c y Accuracy
Hop len:64Hop len:128Hop len:192 1 2 3 5 8 12 24 96Bins per Octave0.40.50.6 U A R UAR
Hop len:64Hop len:128Hop len:192Optimized-CQT Optimized-CQCC Baseline-MFSC Optimized-MFSC Optimized-MFCC0.00.20.40.60.81.0 S c o r e s Comparison with Mel based Features
AccuracyUAR
Comparsion of various CQT parameters over EMODB database
Fig. 2. Performance comparison of CQT for different parameter values. Optimized CQT shows the response of CQT with 3 bins per octave and hop length of64 samples. Baseline MFSC corresponds to MFSC extraction with standard value of 128 mel-filters and 160 samples hop length, whereas, optimized MFSChas 24 mel-filters with 64 hop. Optimized CQCC and MFCC shown are obtained after applying DCT over optimized MFSC and CQT. The results shown infigure are obtained over EmoDB database only. $ Q J U \ % R U H G R P ' L V J X V W ) H D U + D S S \ 6 D G 1 H X W U D O $ Q J U \ % R U H G R P ' L V J X V W ) H D U + D S S \ 6 D G 1 H X W U D O 2 S W L P L ] H G &