[PDF] Non-linear frequency warping using constant-Q transformation for speech emotion recognition

Abstract

In this work, we explore the constant-Q transform (CQT) for speech emotion recognition (SER). The CQT-based time-frequency analysis provides variable spectro-temporal resolution with higher frequency resolution at lower frequencies. Since lower-frequency regions of speech signal contain more emotion-related information than higher-frequency regions, the increased low-frequency resolution of CQT makes it more promising for SER than standard short-time Fourier transform (STFT). We present a comparative analysis of short-term acoustic features based on STFT and CQT for SER with deep neural network (DNN) as a back-end classifier. We optimize different parameters for both features. The CQT-based features outperform the STFT-based spectral features for SER experiments. Further experiments with cross-corpora evaluation demonstrate that the CQT-based systems provide better generalization with out-of-domain training data.

Full PDF

aa r X i v : . [ ee ss . A S ] F e b Non-linear frequency warping using constant-Qtransformation for speech emotion recognition

Premjeet Singh , Goutam Saha Dept of Electronics and ECEIndian Institute of Technology Kharagpur, Kharagpur, India [email protected] , [email protected] Md Sahidullah Universit´e de LorraineCNRS, Inria, LORIA, F-54000, Nancy, France [email protected] Abstract —In this work, we explore the constant-Q transform(CQT) for speech emotion recognition (SER). The CQT-basedtime-frequency analysis provides variable spectro-temporal res-olution with higher frequency resolution at lower frequencies.Since lower-frequency regions of speech signal contain moreemotion-related information than higher-frequency regions, theincreased low-frequency resolution of CQT makes it morepromising for SER than standard short-time Fourier transform(STFT). We present a comparative analysis of short-term acousticfeatures based on STFT and CQT for SER with deep neuralnetwork (DNN) as a back-end classiﬁer. We optimize different pa-rameters for both features. The CQT-based features outperformthe STFT-based spectral features for SER experiments. Furtherexperiments with cross-corpora evaluation demonstrate that theCQT-based systems provide better generalization with out-of-domain training data.

Index Terms —Speech emotion recognition (SER), Constant-Q transform (CQT), Mel frequency analysis, Cross-corporaevaluation.

I. I

NTRODUCTION

The speech emotion recognition (SER) is the task forrecognizing emotion from human speech. The potential ap-plications of SER include human-computer interaction , sen-timent analysis and health-care [1]–[4]. Humans naturallysense the emotions in speech while machines ﬁnd it difﬁcultto characterize them [5], [6]. Techniques proposed till datehave signiﬁcantly increased the machine’s ability to recognizespeech emotions. However, the task is still challenging mainlydue to the presence of large interpersonal and intrapersonalvariability and the differences in speech quality used to trainand evaluate the system. The goal of this work is to developan improved SER system by considering emotion-speciﬁcacoustic parameters from speech that are assumed to be morerobust to unwanted variabilities.Previous studies in SER research have shown that spectraland prosodic characteristics of speech contain emotion-relatedinformation. Spectral features include lower formants frequen-cies (F1 and F2), speech amplitude and energy , zero crossingrate (ZCR) and spectral parameters, e.g., like spectral ﬂux and spectral roll-off [7]–[9]. Prosodic features include pitch , pitchharmonics , intonation , and speaking rate [8], [9]. The acousticfront-ends are used with Gaussian mixture model (GMM)or support vector machines (SVMs) as back-end classiﬁersfor SER tasks [10]. Studies with prosody reveal that higharousal emotions, such as

Angry , Happy and

Fear , have higher average pitch values with abrupt pitch variations whereaslow arousal emotions like

Sadness and

Neutral have lowerpitch values with consistent pitch contours [1], [11]–[14]. Theauthors in [15] have reported that recognition accuracy ofAnger is higher near F2 (1250-1750 Hz) and that of Neutralis higher near F1 (around 200-1000 Hz). Authors in [16]report that center frequencies of F2 and F3 are reduced indepressed individuals. In [17], the authors report that higharousal emotions have higher mean F1 and lower F2 andhigh (positive) valence emotions have high mean F2. In [18],authors report discrimination between idle and negative emo-tions using temporal patterns in formants. In [19], the authorshave demonstrated that non-linear frequency scales, such as logarithmic , mel and equivalent rectangular bandwidth (ERB),have considerable impact in SER performance over linearfrequency scale.Recent works with deep learning methods such as convolu-tional neural networks (CNNs) or CNN with recurrent neuralnetworks (CNN-RNNs), use spectrogram or raw waveformas input and have shown impressive results [20]–[22]. Thesedata-driven methods automatically learn the emotion-relatedrepresentation, however, the role of individual speech attributesin the decision making process is not clear due to the lack of explainability . On the other hand, the generalization of thesemethods remains an open problem, especially when the audio-data for train and test are substantially different in terms oflanguage and speech quality [23].We address this generalization issue by capturing emotion-related information from speech before processing with aneural network back-end. Given the fact that the low and midfrequency regions of speech spectrum contain pitch harmonicsand lower formants that are relevant for emotion recognition,we propose to use a more appropriate approach for time-frequency analysis that produces emotion-oriented speech rep-resentation in the ﬁrst place. Even though the processing with mel frequency warping introduces non-linearity in some sense,the power spectrum from the speech is essentially computedwith a uniform frequency resolution. We propose to use atime-frequency analysis method called constant-Q transform (CQT). This transformation offers higher frequency resolutionat low-frequency regions and higher time resolution at high-frequency regions. As the pitch harmonics and lower formantsreside in the low-frequency regions of speech spectrum, we

CQT bins (Hz)0.00.10.20.30.40.50.60.7 F R a t i o CQT bins (Hz)0.00.10.20.30.40.50.60.7

23 116 210 303 397 491 584 678 771 865 9581055116212791409 1590175119282123233825752836312334393788417145945059557161366757 7809

Mel-filter center Frequencies (Hz)0.00.10.20.30.40.50.60.7 F R a t i o Mel-filter center Frequencies (Hz)0.00.10.20.30.40.50.60.7

STFT bins (Hz)0.00.10.20.30.40.50.60.7 F R a t i o STFT bins (Hz)0.00.10.20.30.40.50.60.7

Fig. 1. F-ratios of spectrograms based on CQT (top), mel-ﬁlter (middle), and standard STFT (bottom) corresponding to the frequency bins. We use the speechsentences with ﬁxed text ‘a02’ from EmoDB database (discussed in Section III-A). We select the same text assuming spectra characteristics of emotions tobe text-dependent. First column shows the values over the entire frequency range while the second column focuses only on the lower-frequency regions. hypothesize that keeping high resolution in this region mayefﬁciently capture emotion-related information.The CQT was initially proposed for music signal processing[24]. Then it was also applied in different speech processingtasks, e.g., anti-spooﬁng [25], [26], speaker veriﬁcation [27]and acoustic scene classiﬁcation [28]. Recently, the CQThas also been studied for SER [29], but without success.This is possibly due to the lack of optimization of CQTparameters and/or the applied end-to-end model fails to exploitthe advantages of CQT. Recent studies show that CNN-based models are suitable for SER including cross-corporaevaluation [23], [30]. In this work, we also adopt a CNN-basedapproach for modeling SER systems. Our main contributionsin this work are summarized as follows: (i) We proposea new framework for CQT-based SER by optimizing CQTextraction parameters for reduced redundancy and improvedperformance, (ii) We investigate CNN architecture knownas time-delay neural networks (TDNNs) suitable for speechpattern classiﬁcation tasks [31], [32] for SER, and (iii) Weperform cross-corpora evaluation with three different speechcorpora to assess the generalization ability of the proposedmethod. Our results demonstrate that the optimized CQTfeatures not only outperform short-time Fourier transform(STFT) features but also provide better generalization.II. M

ETHODOLOGY

In this section, we discuss the CQT-based feature extractionframework and the TDNN architecture for emotion recogni-tion.

A. Constant-Q transform

The CQT of a time-domain signal x [ n ] is deﬁned as, X [ k ] = 1 N [ k ] N [ k ] − X n =0 W [ k, n ] x [ n ] e − jw k n , (1)where X [ k ] is the CQT coefﬁcient for k -th frequency bin, W [ k, n ] is the time-domain window for k -th bin with duration N [ k ] , x [ n ] denotes the time samples and w k = πQN [ k ] , where Q is the (constant) Q factor of the ﬁlter banks [24]. In CQTcomputation, the window length N [ k ] varies for differentvalues of k . Hence, x [ n ] is correlated with sinusoids ofdifferent lengths with equal cycles of oscillation. This leadsto constant-Q ﬁlter bank representation with geometricallyspaced center frequencies over frequency octaves. Hence, weobtain a time-frequency representation which has frequencyresolution varying from high to low towards increasing fre-quencies.The CQT representation of an audio signal depends on the number of octaves of frequencies and the number of frequencybins per each octave . The number of octaves depends uponthe chosen minimum frequency ( F min ) and the maximumfrequency ( F max ) of operation, and this equals to log F max F min [25]. The CQT representation with reduced number of totalfrequency bins over a ﬁxed number of octaves will providedetailed information for lower frequency region with reducedredundancy. Conversely, due to linearly spaced frequencybins, short-time Fourier transform (STFT) does not offer thisﬂexibility. Fixing F min to . Hz and F max to Nyquistfrequency gives us approximately octaves. Hop length inCQT computation deﬁnes the number of time samples bywhich CQT computation window moves. The CQT also hasresemblance with continuous wavelet transform (CWT) whichprovides variable time-frequency resolution and has beenfound helpful for SER [33].uring the CQT-based feature extraction process, the CQTcoefﬁcients are uniformly resampled and then processed with discrete cosine transform (DCT) to compute speech featuresknown as constant-Q cepstral coefﬁcients (CQCCs).We perform class separability analysis of the time-frequency representations by computing the F-ratios [34]. TheFig. 1 shows the F-ratio obtained at different frequency bins.The higher F-ratios at lower bins for CQT and STFT showthe presence of more discriminative information. The ﬁgurealso indicates that CQT-spectrogram has more number ofdiscriminative coefﬁcients on an average over others due tohigher resolution in low-frequency regions.

B. CNN architecture

The time-frequency representation of speech-like signal issuitable to be used with 1-D CNN, popularly known as TDNNin speech processing literature. Our method is inspired bythe TDNN-based x-vector system [32] developed for speakerveriﬁcation task. This processes speech information at frame and segment level. In frame level, the TDNN captures con-textual information by applying kernel over adjacent framesand by processing each speech frame in an identical man-ner. This also applies dilation in the temporal domain toreduce redundancy and to make it computationally efﬁcient.The frame-level information is processed with several TDNNlayers having different kernel sizes and dilation parameters.Finally, temporal pooling aggregates frame-level informationinto segment-level and this is followed by processing with fullyconnected (FC) and softmax layer for classiﬁcation objective.The standard x-vector system computes the segment-levelintermediate representation referred as embeddings which arefurther processed with another system for classiﬁcation. Incontrast, our proposed method trains the network in an end-to-end fashion for which the emotion for a test speech is obtainedfrom the output of the trained network.We empirically optimize the parameters for TDNN architec-ture. Finally, we use four TDNN layers, followed by statisticspooling with mean and standard deviation, and one FC layerbefore softmax . Table I describes the parameters for differentlayers.

TABLE IT

HE PARAMETERS OF

CNN

ARCHITECTURE FOR

SER.

Layer Size Kernel Size Dilation

TDNN 32 5 1TDNN 32 3 2TDNN 32 3 3TDNN 64 1 1Statistics Pooling (Mean and SD) 128 - -Fully Connected 64 - -Softmax

III. E

XPERIMENTAL SETUP

A. Speech corpora

In our experiments, we use three different speech corporawhich are described in Table II. We downsample speech ﬁles at sampling rate of 16 kHz when required. The EmoDB is aGerman language corpora while RAVDESS and IEMOCAPare in English. For IEMOCAP database, we select only fouremotions (Angry, Happy, Sad and Neutral) as some of theemotion class have inadequate data for training neural networkmodels [30]. We perform cross-corpora SER experiments byselecting the same four emotions.

TABLE IIS

UMMARY OF THE SPEECH CORPORA USED IN THE EXPERIMENTS .(F=F

EMALE , M=M

ALE ) Databases Speakers Emotions

Berlin Emotion Database(EmoDB) [35] 10(5 F, 5 M) 7(Anger, Sad, Boredom, Fear,Happy, Disgust and Neutral)Ryerson Audio-Visual Databaseof Emotional Speech and Song(RAVDESS) [36] 24(12 F, 12 M) 8(Calm, Happy, Sad, Angry, Neutral,Fearful, Surprise, and Disgust)Interactive Emotional DyadicMotion Capture Database(IEMOCAP) [37] 10(5 F, 5 M) 4(Happy, Angry, Sad and Neutral)

B. Experimental details & evaluation methodology

First, we optimize the parameters of the features on EmoDB.We perform experiments on this corpus using leave-one-speaker-out (LOSO) cross validation by keeping one speakerin test. Out of the remaining speakers, we use two of them invalidation and seven in training. We also apply ﬁve-fold dataaugmentation by corrupting training set with additive noisesand room reverberation effect following the

Kaldi recipe forx-vector training [32].We extract features from each speech utterance and discardthe non-speech frames with a simple energy-based speech ac-tivity detector (SAD). We apply utterance-level cepstral meanvariance normalization (CMVN) before creating the trainingand validation samples with chunks of consecutive frames.We consider multiple non-overlapping chunks from the speechutterances depending on the length. We use LibROSA pythonlibrary for feature extraction.We do not apply chunking for testing and consider the fullutterance for computing the test accuracy. We report the ﬁnalperformances with accuracy as well as unweighted averagerecall (UAR). The accuracy is computed as the ratio betweenthe number of correctly classiﬁed sentences to the total numberof sentences in test. The UAR is given as [38], UAR = 1 K K X i =1 A ii P Kj =1 A ij (2)where A refers to the contingency matrix, A ij correspondsto number of samples in class i classiﬁed into class j and K is the total number of classes. As accuracy is considered https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb/v2 https://librosa.github.io/ A cc u r a c y Accuracy

Hop len:64Hop len:128Hop len:192 1 2 3 5 8 12 24 96Bins per Octave0.40.50.6 U A R UAR

Hop len:64Hop len:128Hop len:192Optimized-CQT Optimized-CQCC Baseline-MFSC Optimized-MFSC Optimized-MFCC0.00.20.40.60.81.0 S c o r e s Comparison with Mel based Features

AccuracyUAR

Comparsion of various CQT parameters over EMODB database

Fig. 2. Performance comparison of CQT for different parameter values. Optimized CQT shows the response of CQT with 3 bins per octave and hop length of64 samples. Baseline MFSC corresponds to MFSC extraction with standard value of 128 mel-ﬁlters and 160 samples hop length, whereas, optimized MFSChas 24 mel-ﬁlters with 64 hop. Optimized CQCC and MFCC shown are obtained after applying DCT over optimized MFSC and CQT. The results shown inﬁgure are obtained over EmoDB database only. $QJU\ %RUHGRP 'LVJXVW )HDU +DSS\ 6DG 1HXWUDO$QJU\%RUHGRP'LVJXVW)HDU+DSS\6DG1HXWUDO 2SWLPL]HG&47FRQIXVLRQPDWUL[ $QJU\ %RUHGRP 'LVJXVW )HDU +DSS\ 6DG 1HXWUDO$QJU\%RUHGRP'LVJXVW)HDU+DSS\6DG1HXWUDO 2SWLPL]HG0)6&FRQIXVLRQPDWUL[

Fig. 3. Confusion matrices for emotion classiﬁcation experiment with optimized CQT and MFSC features in EmoDB corpus. Given values are the ratio ofutterances identiﬁed in column class to the total number of utterances in every corresponding row class. unintuitive for databases with uneven samples across differentclasses, we optimize the feature extraction parameters basedon the UAR metric.In DNN, we use ReLU activation function and batch nor-malization for all the hidden layers. For regularization, weapply dropout with probability . on the FC layer only. Weuse Adam optimizer with learning rate . . The mini-batchsize is . We train the models for epochs and ﬁnally testingis done with the model which achieves the highest UAR onthe validation set. We repeat each experiment multiple timesand report the average performance.IV. R ESULTS

A. Experiments on EmoDB

First, we conduct experiments on EmoDB and optimize theCQT parameters. We vary the number of bins per octave from to . We also perform the experiments with three differenthop lengths: , , and . The top row of Fig. 2 shows thestandard accuracy and UAR for CQT. We observe improvedperformance for lower bins per octave and lower hop length. The performance remains very similar for bins per octavesbetween and . We select bins per octave as the optimumobserving the consistency in different runs of the experiment.We ﬁx the hop size as the optimum since the performanceis consistently better with this hop size, especially, for lowerbins per octave. Since the optimized CQT features use ﬁlters and hop length as , we apply similar conﬁgurationfor STFT-based mel frequency cepstral coefﬁcients (MFCCs)as well as mel frequency spectral coefﬁcients (MFSCs) (i.e.,MFCCs without DCT). The SER performances with CQTand STFT-based features are illustrated as a bar plot inFig. 2. We observe that CQT coefﬁcients as well as CQCCsconsistently outperform MFCCs and MFSCs. We also noticethat the optimized MFSC outperforms baseline MFSC. TheDCT slightly degrades performance in both CQT and STFT-based approaches. We chose the best conﬁguration for bothfeatures for the remaining experiments.Figure 3 shows the confusion matrices obtained for CQTand MFSC in experiments with EmoDB. We observe thatCQT is better capable of discriminating emotions such as Fear, ABLE IIIC

ROSS - CORPORA RESULTS SHOWN IN ACCURACY / UAR. T

HIS USESOPTIMIZED CONFIGURATION OF

MFSC

AND

CQT. T

O HAVE EQUALNUMBER OF CLASSES , ONLY FOUR EMOTIONS (H APPY , A

NGRY , S

AD AND N EUTRAL ) ARE CONSIDERED FROM EVERY DATABASE HERE . A

LL OTHERPARAMETER SETTINGS REMAIN SIMILAR TO OTHER EXPERIMENTS .Train on Test on MFSC CQTEmoDB RAVDESS 0.41 / 0.44 0.44 / 0.46IEMOCAP 0.36 / 0.37 0.38 / 0.39RAVDESS EmoDB 0.45 / 0.42 0.48 / 0.48IEMOCAP 0.30 / 0.32 0.32 / 0.34IEMOCAP EmoDB 0.64 / 0.50 0.63 / 0.50RAVDESS 0.38 / 0.39 0.38 / 0.39

Disgust, Sad, Anger and Neutral as compared to MFSC. TheCQT-based system yields improved accuracy for Sad, Neutraland Disgust because those emotions are more prominent inlow-frequency regions. Performance of Boredom is slightlydegraded. Among all the seven emotions, Fear shows the high-est gain in performance over MFSC. Happy shows the lowestclassiﬁcation accuracy and a high confusion with Angry.

B. Cross-corpora evaluation

Table III shows the performance obtained after cross corpustesting. The optimized CQT shows better performance thanoptimized MFSC for most cases except when the train-testpair are IEMOCAP-EmoDB and IEMOCAP-RAVDESS. Theobtained results consolidate our hypothesis that CQT helps inbetter capturing of emotion-dependent information leading tobetter generality across databases.V. D

ISCUSSION AND C ONCLUSION

We notice that increasing the frequency resolution at lowerfrequency regions led to substantial improvement in SERperformance. This also conﬁrms that low-frequency regioncontaining pitch harmonics and lower formants convey impor-tant emotion-speciﬁc information. At the same time, the CQTwith lower high-frequency resolution does not degrade theoverall SER performance which indicates that high-frequencyregions are less important from emotion perspective. Also,better performance with fewer frequency bins in both CQT andMFSC indicates less redundant time-frequency representationis more effective for emotion discrimination. Though STFTwith optimized parameters generates spectrograms with higherfrequency resolution, the performance degrades most likelydue to increased redundancy caused by capturing detailsof high-frequency region. Cross-corpora evaluation suggeststhat CQT-based time-frequency representation provides bettergeneralization for SER task with different speech corpora intraining and test.We conclude that CQT is a better choice of time-frequencyrepresentation in terms of both recognition performance andgeneralization ability. However, the SER performance is stillpoor for real-world deployment. We also gain no improvementover MFSC for all the seven emotions included in EmoDBcorpus. This indicates that the time-frequency representation needs further investigation for SER. This work can also beextended by exploring CQT representation with recurrent ar-chitecture and attention mechanisms which are lacking withinour TDNN framework but found useful for SER.R

EFERENCES[1] Mehmet Berkehan Akc¸ay and Kaya O˘guz, “Speech emotion recognition:Emotional models, databases, features, preprocessing methods, support-ing modalities, and classiﬁers,”

Speech Communication , vol. 116, 2020.[2] Sreenivasa Rao Krothapalli and Shashidhar G Koolagudi, “Speechemotion recognition: a review,” in

Emotion recognition using speechfeatures , pp. 15–34. Springer, 2013.[3] Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray, “Surveyon speech emotion recognition: Features, classiﬁcation schemes, anddatabases,”

Pattern Recognition , vol. 44, no. 3, pp. 572–587, 2011.[4] Jes´us B Alonso, Josu´e Cabrera, Manuel Medina, and Carlos M Travieso,“New approach in quantiﬁcation of emotional intensity from the speechsignal: emotional temperature,”

Expert Systems with Applications , vol.42, no. 24, pp. 9554–9564, 2015.[5] Rosalind W Picard,

Affective computing , MIT press, 2000.[6] Rosalind W Picard, “Affective computing: challenges,”

InternationalJournal of Human-Computer Studies , vol. 59, no. 1-2, pp. 55–64, 2003.[7] Lijiang Chen, Xia Mao, Yuli Xue, and Lee Lung Cheng, “Speechemotion recognition: Features and classiﬁcation models,”

Digital SignalProcessing , vol. 22, no. 6, pp. 1154–1160, 2012.[8] Florian Eyben, Anton Batliner, and Bjoern Schuller, “Towards a standardset of acoustic features for the processing of emotion in speech.,” in

Proceedings of Meetings on Acoustics 159ASA . Acoustical Society ofAmerica, 2010, vol. 9, p. 060006.[9] Florian Eyben et al., “The Geneva minimalistic acoustic parameterset (GeMAPS) for voice research and affective computing,”

IEEETransactions on Affective Computing , vol. 7, no. 2, pp. 190–202, 2015.[10] M. Kockmann, L. Burget, and J.H. ˘Cernock´y, “Application of speaker-and language identiﬁcation state-of-the-art techniques for emotion recog-nition,”

Speech Communication , vol. 53, no. 9-10, pp. 1172–1185, 2011.[11] Carl E Williams and Kenneth N Stevens, “Emotions and speech: Someacoustical correlates,”

The Journal of the Acoustical Society of America ,vol. 52, no. 4B, pp. 1238–1250, 1972.[12] Rainer Banse and Klaus R Scherer, “Acoustic proﬁles in vocal emotionexpression.,”

Journal of Personality and Social Psychology , vol. 70, no.3, pp. 614, 1996.[13] Roddy Cowie and Ellen Douglas-Cowie, “Automatic statistical analysisof the signal and prosodic signs of emotion in speech,” in

Proc. ICSLP .IEEE, 1996, vol. 3, pp. 1989–1992.[14] Suman Deb and Samarendra Dandapat, “Multiscale amplitude featureand signiﬁcance of enhanced vocal tract information for emotion classi-ﬁcation,”

IEEE Transactions on Cybernetics , vol. 49, no. 3, pp. 802–815,2018.[15] Sahar E Bou-Ghazale and John HL Hansen, “A comparative study oftraditional and newly proposed features for recognition of speech understress,”

IEEE Transactions on Speech and Audio Processing , vol. 8, no.4, pp. 429–442, 2000.[16] Daniel Joseph France, Richard G Shiavi, Stephen Silverman, MarilynSilverman, and M Wilkes, “Acoustical properties of speech as indicatorsof depression and suicidal risk,”

IEEE Transactions on BiomedicalEngineering , vol. 47, no. 7, pp. 829–837, 2000.[17] Martijn Goudbeek, Jean Philippe Goldman, and Klaus R Scherer,“Emotion dimensions and formant position,” in

Proc. INTERPSEECH ,2009, pp. 1575–1578.[18] Elif Bozkurt, Engin Erzin, Cigdem Eroglu Erdem, and A Tanju Erdem,“Formant position based weighted spectral features for emotion recogni-tion,”

Speech Communication , vol. 53, no. 9-10, pp. 1186–1197, 2011.[19] Margaret Lech, Melissa Stolar, Robert Bolia, and Michael Skinner,“Amplitude-frequency analysis of emotional speech using transfer learn-ing and classiﬁcation of spectrogram images,”

Advances in Science,Technology and Engineering Systems Journal , vol. 3, pp. 363–371, 2018.[20] Shiqing Zhang, Shiliang Zhang, Tiejun Huang, and Wen Gao, “Speechemotion recognition using deep convolutional neural network and dis-criminant temporal pyramid matching,”

IEEE Transactions on Multime-dia , vol. 20, no. 6, pp. 1576–1590, 2017.21] Qirong Mao, Ming Dong, Zhengwei Huang, and Yongzhao Zhan,“Learning salient features for speech emotion recognition using con-volutional neural networks,”

IEEE Transactions on Multimedia , vol. 16,no. 8, pp. 2203–2213, 2014.[22] Che-Wei Huang and Shrikanth Narayanan, “Characterizing types ofconvolution in deep convolutional recurrent neural networks for robustspeech emotion recognition,” arXiv preprint arXiv:1706.02901 , 2017.[23] Jack Parry, Dimitri Palaz, Georgia Clarke, Pauline Lecomte, RebeccaMead, Michael Berger, and Gregor Hofer, “Analysis of deep learningarchitectures for cross-corpus speech emotion recognition,”

Proc.INTERSPEECH , pp. 1656–1660, 2019.[24] Judith C Brown, “Calculation of a constant Q spectral transform,”

TheJournal of the Acoustical Society of America , vol. 89, no. 1, pp. 425–434, 1991.[25] Massimiliano Todisco, H´ector Delgado, and Nicholas Evans, “ConstantQ cepstral coefﬁcients: A spooﬁng countermeasure for automatic speakerveriﬁcation,”

Computer Speech & Language , vol. 45, pp. 516–535, 2017.[26] Monisankha Pal, Dipjyoti Paul, and Goutam Saha, “Synthetic speechdetection using fundamental frequency variation and spectral features,”

Computer Speech & Language , vol. 48, pp. 31–50, 2018.[27] H´ector Delgado et al., “Further optimisations of constant Q cepstral pro-cessing for integrated utterance and text-dependent speaker veriﬁcation,”in

Proc. IEEE SLT . IEEE, 2016, pp. 179–185.[28] T. Lidy and A. Schindler, “CQT-based convolutional neural networksfor audio scene classiﬁcation,” in

Proc. the Detection and Classiﬁcationof Acoustic Scenes and Events 2016 Workshop (DCASE2016) , 2016,vol. 90, pp. 1032–1048.[29] Dengke Tang, Junlin Zeng, and Ming Li, “An end-to-end deep learningframework for speech emotion recognition of atypical individuals,” in

Proc. INTERSPEECH , 2018, pp. 162–166.[30] Dias Issa, M Fatih Demirci, and Adnan Yazici, “Speech emotionrecognition with deep convolutional neural networks,”

Biomedical SignalProcessing and Control , vol. 59, pp. 101894, 2020.[31] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang,“Phoneme recognition using time-delay neural networks,”

IEEE Trans-actions on Acoustics, Speech, and Signal Processing , vol. 37, no. 3, pp.328–339, 1989.[32] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, andSanjeev Khudanpur, “X-vectors: Robust DNN embeddings for speakerrecognition,” in

Proc. ICASSP . IEEE, 2018, pp. 5329–5333.[33] P. Shegokar and P. Sircar, “Continuous wavelet transform based speechemotion recognition,” in

Proc. ICSPCS , 2016, pp. 1–8.[34] Simon Nicholson, Ben Milner, and Stephen Cox, “Evaluating feature setperformance using the F-ratio and J-measures,” in

Proc. EUROSPEECH ,1997, pp. 413–416.[35] Felix Burkhardt, Astrid Paeschke, Miriam Rolfes, Walter F Sendlmeier,and Benjamin Weiss, “A database of German emotional speech,” in

Proc. INTERSPEECH , 2005, pp. 1517–1520.[36] Steven R Livingstone and Frank A Russo, “The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic,multimodal set of facial and vocal expressions in North AmericanEnglish,”

PLOS One , vol. 13, no. 5, 2018.[37] Carlos Busso et al., “IEMOCAP: Interactive emotional dyadic motioncapture database,”

Language Resources and Evaluation , vol. 42, no. 4,pp. 335, 2008.[38] Andrew Rosenberg, “Classifying skewed data: Importance weightingto optimize average recall,” in