[PDF] Applying Speech Tempo-Derived Features, BoAW and Fisher Vectors to Detect Elderly Emotion and Speech in Surgical Masks

Abstract

The 2020 INTERSPEECH Computational Paralinguistics Challenge (ComParE) consists of three Sub-Challenges, where the tasks are to identify the level of arousal and valence of elderly speakers, determine whether the actual speaker wearing a surgical mask, and estimate the actual breathing of the speaker. In our contribution to the Challenge, we focus on the Elderly Emotion and the Mask sub-challenges. Besides utilizing standard or close-to-standard features such as ComParE functionals, Bag-of-Audio-Words and Fisher vectors, we exploit that emotion is related to the velocity of speech (i.e. speech rate). To utilize this, we perform phone-level recognition using an ASR system, and extract features from the output such as articulation tempo, speech tempo, and various attributes measuring the amount of pauses. We also hypothesize that wearing a surgical mask makes the speaker feel uneasy, leading to a slower speech rate and more hesitations; hence, we experiment with the same features in the Mask sub-challenge as well. Although this theory was not justified by the experimental results on the Mask Sub-Challenge, in the Elderly Emotion Sub-Challenge we got significantly improved arousal and valence values with this feature type both on the development set and in cross-validation.

Full PDF

aa r X i v : . [ ee ss . A S ] A ug Applying Speech Tempo-Derived Features, BoAW and Fisher Vectorsto Detect Elderly Emotion and Speech in Surgical Masks

G´abor Gosztolya , , L´aszl´o T´oth MTA-SZTE Research Group on Artiﬁcial Intelligence, Szeged, Hungary University of Szeged, Institute of Informatics, Szeged, Hungary { ggabor, tothl } @ inf.u-szeged.hu Abstract

The 2020 INTERSPEECH Computational Paralinguistics Chal-lenge (ComParE) consists of three Sub-Challenges, where thetasks are to identify the level of arousal and valence of elderlyspeakers, determine whether the actual speaker wearing a sur-gical mask, and estimate the actual breathing of the speaker.In our contribution to the Challenge, we focus on the ElderlyEmotion and the Mask sub-challenges. Besides utilizing stan-dard or close-to-standard features such as ComParE functionals,Bag-of-Audio-Words and Fisher vectors, we exploit that emo-tion is related to the velocity of speech (i.e. speech rate). Toutilize this, we perform phone-level recognition using an ASRsystem, and extract features from the output such as articula-tion tempo, speech tempo, and various attributes measuring theamount of pauses. We also hypothesize that wearing a surgicalmask makes the speaker feel uneasy, leading to a slower speechrate and more hesitations; hence, we experiment with the samefeatures in the Mask sub-challenge as well. Although this the-ory was not justiﬁed by the experimental results on the MaskSub-Challenge, in the Elderly Emotion Sub-Challenge we gotsigniﬁcantly improved arousal and valence values with this fea-ture type both on the development set and in cross-validation.

Index Terms : speech recognition, human-computer interac-tion, computational paralinguistics

1. Introduction

Computational paralinguistics, a subﬁeld of speech technology,deals with extracting, locating and identifying various phenom-ena being present in human speech. In contrast with AutomaticSpeech Recognition (ASR), where most of such information isconsidered secondary behind the phonetic content of the speechsignal (i.e. the phonetic or word-level transcription), computa-tional paralinguistics focuses on the huge variety of informationrelated to the physical and mental state of the speaker, usuallyignoring the actual words uttered. The Interspeech Computa-tional Paralinguistics Challenge (ComParE), held regularly atthe Interspeech conference over a decade now, focuses on theautomatic identiﬁcation of this ‘paralinguistic’ (that is, ‘beyondlinguistic’) aspect of human speech. The open tasks presentedover the years covered dozens of different human speech as-pects, ranging from emotion detection [1] through determiningspeaker age and gender [2], estimating blood alcohol level [3]and identifying speciﬁc disorders which affect the speech of thesubject (e.g. autism [4] and Parkinson’s Disease [5]).During the history of the ComParE Challenge, we can seetwo main types of solutions for the various tasks. The ﬁrst em-ploys general techniques, that might be applied for a wide rangeof problems. Perhaps the most straightforward such techniqueis the 6373-sized ‘ComParE functionals’ attribute set, which uses means, standard deviations, percentile statistics (e.g. 1st,99th), peak detection etc. to form utterance-level attributes fromcertain frame-level feature vectors. This feature set was also de-veloped over the years, taking its ﬁnal form in 2013 (for the de-tails, see the work of Schuller et al. [4]). Another such approachis the Bag-of-Audio-Words (or BoAW, [6, 7, 8]) method, whichﬁrst clusters the input frame-level feature vectors, and then as-signs each frame of each utterance into one of these clusters anduses a statistics of these clusters to construct an utterance-levelfeature vector. This technique is incorporated into the Chal-lenge baselines since 2017 [9]. Some other feature extractionmethods, albeit being of a general nature, were only employedby certain participants so far, such as Fisher vectors [10, 11, 12].The second type of approaches seek to employ task-speciﬁc techniques. Clearly, one might expect that a solution designedand ﬁne-tuned for the actual problem at hand allows higherperformance, leading to better accuracy scores; on the otherhand, they take more effort to develop. For example, Gr`ezeset al. calculated the ratio of speaker overlap to aid conﬂict in-tensity estimation [13]; Montaci´e and Caraty detected tempo-ral events (e.g. speech onset latency, event starting time-codes,pause and phone segments) to detect cognitive load [14], sev-eral authors extracted phone posterior-based attributes to de-termine the degree of nativeness or the native language of thespeaker [15, 16, 17], while Huckvale and Beke developed spe-ciﬁc spectral-based attributes to detect whether the speaker hasa cold [18]. Of course, some kind of fusion of the general andthe task-speciﬁc attributes might also prove to be beneﬁcial.In our actual contribution to the ComParE 2020 Chal-lenge [19], we apply speciﬁc task-dependent attributes. It iswell-known that the mental state of the subject affects severalprosodic and temporal properties of his speech; speciﬁcally,emotion is strongly related to speech tempo [20, 21], and itaffects the amount of hesitation as well. This means that, byour hypothesis, calculating temporal parameters such as artic-ulation tempo (i.e. phones uttered per second), speech tempoand some pause-related attributes, we might be able to esti-mate the emotional state of the speaker. In other paralinguistictasks, these attributes might be indicators of different speakerstates, e.g. feeling uneasy when forced to speak in a surgicalmask. Of course, competitive performance is probably achievedvia a combination of predictions with those obtained by usingstandard approaches, such as ComParE functionals or Bag-of-Audio-Words (BoAW).Following the Challenge guidelines (see [19]), we will omitthe description of the tasks, datasets and the method of evalua-tion, and focus on the techniques we applied. Since the Breath-ing sub-challenge is essentially a frame-level (or a few-frame-level ) task, which calls for entirely different techniques, we fo-cus on the remaining two sub-challenges in this study: in theask Sub-Challenge (MSC), the task is to recognise whetherthe speaker was recorded while wearing a surgical mask or not,while in the Elderly Emotion Sub-Challenge (ESC) the task is todetermine the affective state of subjects aged 60 or over. Whilethe former is a binary classiﬁcation task, in the Elderly Emo-tion Sub-Challenge both arousal and valence have to be classi-ﬁed as low, medium or high, therefore it essentially consists oftwo three-class classiﬁcation tasks. Classiﬁcation performanceis measured via the Unweighted Average Recall (UAR) metric;for the Elderly Emotion task, the UAR values corresponding toarousal and valence are averaged out.

2. Temporal Speech Features

Next we describe the temporal speech features we extractedfrom the utterances of the Elderly Emotion and Mask Sub-Challenges. We would like to note that this attribute setwas based on our previous works focusing on detecting MildCognitive Impairment (MCI), Alzheimer’s Disease (AD) andschizophrenia (SCH) (see e.g. [22, 23, 24]), with some straight-forward changes. That is, we removed the calculation of theutterance length, as it was meaningless for the short speechchunks provided for the particular sub-challenges. For the listof our temporal parameters, see Table 1.These speech parameters rely on the concept of hesitations .The simpler form of pause or hesitation is that of a silent pause :the absence of speech. However, ﬁlled pauses (sounds like “er”,“um” etc.) also indicate hesitations, and can take up a signiﬁ-cant amount of speech time. For example, T´oth et al. found thatabout 10% of the hesitations in a Hungarian speech databaseappear as ﬁlled pauses [25]. While the simplest of our attribute,speech tempo corresponds to the average number of phonesfound in one second of the utterance, in articulation rate wetake into account only those phones which are, in fact, are nothesitations. The remaining attributes (i.e. (3)-(6)) all describethe amount of hesitation within speech, but in different ways.Furthermore, when we describe the amount of pauses, we cantake into account only silent pauses, only ﬁlled pauses, or anyof them; so the temporal parameters (3)-(6) can be calculated inthree variations, leading to a 14-sized attribute set.To calculate these temporal parameters, ﬁrst we performedspeech recognition; as we were interested in these speciﬁc pa-rameters, we decided to work only on the level of phones. Whilecompletely discarding a word-level language model (even assimple as a vocabulary) probably increases the number of er-rors in the ASR output, notice that now we do not need to beable to accurately identify the phones: all we need to do is to count them. We need to identify only two phenomena: silences(including breath intakes and sighs) and ﬁlled pauses. In this ap-proach, we treated ﬁlled pauses as a special ‘phoneme’ ; fromour previous experience, we also expected that silent and ﬁlledpauses can be identiﬁed with a high accuracy.Although the speech material of both the Elderly Emotionand the Mask Sub-Challenges contained German speech, dueto the absence of a German speech corpus we trained our DNNacoustic models on Hungarian speech samples. We were ableto exploit this since both silent and ﬁlled pauses appear to bequite language-independent and because these two languagesare quite similar on the phonetic level. We used a roughly 44hours subset of the BEA corpus [27], where the annotation in-cluded several non-verbal acoustic cues such as breath intakes,sighs, coughs, and most importantly, ﬁlled pauses. We used our Yes, the quotes are there for a reason, Mr. Moore [26] (1)

Speech tempo : the number of phones per second(including hesitations).(2)

Articulation rate : the number of phones per sec-ond during speech (excluding hesitations).(3)

Pause occurrence rate : divide the total number ofpauses by the number of phonemes in the utterance.(4)

Pause duration rate : divide the total duration ofpauses by the length of the utterance.(5)

Pause frequency : divide the number of pause oc-currences by the length of the utterance.(6)

Average pause duration : divide the total durationof pauses by the number of pauses.Table 1:

The examined temporal speech parameters, based onthe work of Hoffmann et al. [30] and T´oth et al. [22]. custom DNN implementation [28], and a modiﬁed version ofHTK [29] for decoding, using a (Hungarian) phone bi-gram asa language model.

3. General Feature Extraction Methods

Next, we brieﬂy describe the three standard feature extractionapproaches we utilized in the ComParE 2020 Challenge.

Firstly, we used the 6373 ComParE functionals (see e.g. [4]),extracted by using the openSMILE tool [31]. The feature setincludes energy, spectral, cepstral (MFCC) and voicing relatedframe-level attributes, from which speciﬁc functionals (like themean, standard deviation, percentiles and peak statistics) arecomputed to provide utterance-level feature values.

The BoAW approach also seeks to extract a ﬁxed-length fea-ture vector from a varying-length utterance [6]. Its input is aset of frame-level feature vectors such as MFCCs. In the ﬁrststep, clustering is performed on these vectors, the number ofclusters ( N ) being a parameter of the method. The list of theresulting cluster centroids will form the codebook . Next, eachoriginal feature vector is replaced by a single index representingthe nearest entry in the codebook ( vector quantization ). Thenthe feature vector for the given utterance is calculated by gen-erating a histogram of these indices, usually after some kind ofnormalization (e.g. in L1 normalization we divide each clustercount by the number of frames in the given utterance).To calculate the BoAW representations, we utilized theOpenXBOW package [32]. We tested codebook sizes of N =32 , , , , , , , , and . Weemployed random sampling instead of kmeans++ clustering forcodebook generation [8], and employed 5 parallel cluster as-signments; otherwise, our setup followed the ComParE 2020baseline paper (i.e. [19]): we used the 65 ComParE frame-levelattributes as the input after standardization, and a separate code-book was built for the ﬁrst-order derivatives. .3. Fisher Vector Representation The aim of the Fisher vector representation is to combine thegenerative and discriminative machine learning approaches byderiving a kernel from a generative model of the data [33]. Firstwe describe the original version, developed for image represen-tation; then we turn to the application of Fisher vectors to audio.The main concept of the Fisher Vector (FV) representation,adapted to audio processing, is to take the frame-level featurevectors of some corpus and model their distribution by a prob-ability density function p ( X | Θ) , Θ being the parameter vectorof the model. For example, when using Gaussian Mixture Mod-els with a diagonal covariance matrix, Θ will correspond to thepriors, and the mean and standard deviation vectors of the com-ponents. The Fisher score describes X by the gradient G X Θ ofthe log-likelihood function, i.e. G X Θ = 1 T ∇ Θ log p ( X | Θ) . (1)This gradient function describes the direction in which themodel parameters (i.e. Θ ) should be modiﬁed to best ﬁt thedata. The Fisher kernel between the frame-level feature vectorsequences (i.e. utterances) X and Y is then deﬁned as K ( X, Y ) = G X Θ F − G Y Θ , (2)where F Θ is the Fisher information matrix of p ( X | Θ) , deﬁnedas F Θ = E X [ ∇ Θ log p ( X | Θ) ∇ Θ log p ( X | Θ) T ] . (3)Expressing F − as F − = L T Θ L Θ , we get the Fisher vectors as G X Θ = L Θ G X Θ = L Θ ∇ Θ log p ( X | Θ) . (4)We used the open-source VLFeat library [34] to ﬁt GMMsand to extract the FV representation; we ﬁtted Gaussian MixtureModels with N = 2 , , , , , and components. Asthe input frame-level feature vectors, we here again employedthe 65 ComParE frame-level attributes; following our previousexperiments (e.g. [11, 35]), we also employed the ﬁrst-orderderivatives (i.e. the ∆ values).

4. The Mask Sub-Challenge

Firstly, we present our experimental results on the Mask Sub-Challenge. For classiﬁcation, we employed SVM with a linearkernel, using the libSVM implementation [36]; the value of C was set in the range − , − , . . . , . To combine the dif-ferent approaches, following our previous works, we decided totake the weighted mean of the posterior estimates; the weightswere set on the development set, with 0.05 increments.Our results achieved can be seen in Table 2. By using the‘ComParE functionals’ feature set we got a slightly better UARscore (at least, on the development set) than what was reportedin the baseline paper (i.e. 62.6% [19]), but it is probably due tothe different SVM implementation used (i.e. libSVM instead ofscikit-learn); on the other hand, Bag-of-Audio-Words led to aquite similar classiﬁcation performance (i.e. 64.2% vs. 64.5%).Unfortunately, using the temporal parameters turned out to bemuch less beneﬁcial: the 50.6% UAR score measured on thedevelopment set is only slightly higher than what is achievableby random guessing.When combining the approaches, the temporal parame-ters were not really useful either: fusing them with the ‘Com-ParE functionals’ predictions actually brought an insigniﬁcantimprovement (0.01%); furthermore, the ComParE + BoAW Table 2: Results for the Mask Sub-Challenge

Approach Dev Test

Temporal parameters 50.6% —ComParE functionals 64.2% —Bag-of-Audio-Words 64.5% —Fisher Vectors 67.7% —ComParE + Temporal 64.2% —ComParE + BoAW 65.8% —ComParE + FV 68.1% 72.0%ComParE + BoAW + FV 68.5% 71.8%All four attribute sets 68.6% 71.6%Best single method in [19] (test) 63.4% 70.8%ComParE 2020 baseline [19] — 71.8%combination was also only slightly better than any individu-lat method (i.e. cca +1%). Fusing the ComParE functionalspredictions with those of the Fisher vectors, however, led toan efﬁcient machine learning model, achieving an UAR valueof 68.1% on the development set. Adding the Bag-of-Audio-Words and the temporal feature sets to this combination did nothelp the prediction signiﬁcantly on the development set; or, ac-cording to our submissions, they even decreased the UAR val-ues. While the ComParE + FV variation achieved a 72.0% onthe test set, even slightly outperforming the ofﬁcial baselinescore (which was obtained by a combination of models basedon their test set performance), we got 71.6% and 71.8% in theother two cases. This, in our opinion, indicates that the Fishervector representation approach was quite robust on the MaskSub-Challenge; on the other hand, the temporal attributes werenot really useful. Bag-of-Audio-Words, on the other hand, werefound to be quite sensitive to meta-parameters, and are, in gen-eral, less robust than either ComParE functionals or Fisher vec-tors [37].

5. The Elderly Emotion Sub-Challenge

This Sub-Challenge was quite different than either the MaskSub-Challenge, or most sub-challenges in the past years. Thereason for this is that the organizers provided features based onthe transcription of the utterances; but since these make senseonly for larger utterances than the standard few-seconds-longchunks, predictions have to be submitted for recordings beingseveral minutes long. Unfortunately, this also meant that thetraining, development and test sets all consisted of 87-87 (albeitlong) utterances. Another, although minor difference was thepresence of the two subtasks (i.e. arousal and valence).On the technical level, this affected some parts of our theclassiﬁcation framework as well. We decided to discard thechunks provided by the organizers, and we focused on thelonger recordings (which we reconstructed by simply mergingthe 5-second-long chunks). To compensate for the signiﬁcantlyless examples, we used 10-fold (speaker-independent) cross-validation for meta-parameter setting instead of relying on theprovided development set; test set predictions were made withthe same SVM models. On the other hand, to meet the Chal-lenge guidelines, we repeated all experiments with the providedtrain-dev setup.Similarly to the Mask Sub-Challenge, we used the same lib-SVM implementation (with the same C values tested). On theother hand, as the distribution of the Low, Medium and Highable 3: Results for the Elderly Emotion Sub-Challenge

Arousal ValenceApproach Dev. CV Test Dev. CV Test

Temporal parameters 41.8% 39.5% — 33.3% 34.9% —ComParE functionals 35.4% 38.4% — 39.1% 37.7% —BERT embeddings (A+V) 35.0% 41.0% — 49.1% 64.0% —Fisher Vectors 37.8% 44.1% — — — —ComParE + Temporal parameters 46.8% 40.5% 42.7% 43.7% 37.7% —ComParE + BERT 38.7% 42.8% — 51.6% 64.5% —ComParE + Fisher Vectors 37.8% 44.1% — — — —BERT + Temporal parameters 45.3% 44.7% — 49.9% 64.5% 47.8%All attributes 49.1% 52.8% 53.2% — — 32.4%ComParE 2020 baseline [19] — — 49.8% — — 49.0%class labels was somewhat imbalanced, we decided to opt fordownsampling. Since downsampling shrinks our already smalltraining sets even further, we decided to repeat SVM training100 times for each training fold; therefore, for each feature setand for each C value, we trained 1000 models. Model fusionwas done by simply taking the (unweighted) mean of the pre-dicted posterior values.Our results can be seen in Table 3. First, notice that theCV and the development set-level UAR scores not always dis-play the some tendencies. In our opinion, this is due to theextremely small-sized corpus: having only × utterancescarries the risk to be insufﬁcient even to allow measuring theclassiﬁcation performance reliably. (Of course, this comes fromthe attempt to provide BERT embeddings as features, whichmake sense only for larger utterances.) Unfortunately, this alsomeans that setting the meta-parameters of the different methodsmight prove to be challenging, which actually coincides withour experience. During our experiments, we found that optimalmeta-parameters (and the corresponding accuracy / UAR val-ues) which we set on the classic “training set + developmentset” set-up differed greatly from those set in ten-fold cross-validation.Regarding the arousal values, all tested approaches wereproven to be useful even alone to achieve a competitive perfor-mance; however, among them, the temporal parameters and theFisher vectors seem to be the most effective techniques. For va-lence, however, the linguistic attributes (i.e. the BERT embed-dings) seem to be unmatched: no other method was able evento come close to the 49.1% (development set) and the 64.0%(cross-validation) UAR scores. It is logical, though, as the otherapproaches are all acoustic ones, and therefore not really suit-able to detect valence [19].

6. Conclusions

For our contribution to the Interspeech 2020 Computational Par-alinguistics Challenge, ﬁrst we experimented with features de-rived from speech tempo. Our motivation was that emotion isreported to be related articulation tempo (i.e. the number ofphones uttered per second), and it affects the amount of hesita-tion as well. To this end, we employed ASR techniques and ex-tracted articulation rate, speech tempo and further 12 attributesdescribing the amount of hesitation in the utterance. Accordingto our experimental results on the development set, this attributeset is not really useful for detecting whether the speaker is wear-ing a surgical mask, as the UAR score of 50.6% attained is only slightly above the change achievable via random guessing; onthe other hand, in the arousal subtask of the Elderly EmotionSub-Challenge it led to similar UAR values as the other meth-ods described in the baseline paper.Besides these custom features, we also applied standardmethods like Bag-of-Audio-Words and Fisher vectors, andcombined our predictions with those got by using the standard‘ComParE functionals’ attribute set and in the case of the El-derly Emotion Sub-Challenge, the various BERT embeddings.These methods and their combinations proved to be quite use-ful on the development sets, but for the Mask Sub-Challenge weeven managed to outperform the ofﬁcial baseline score, whichitself is also a combination of four approaches.For the Elderly Emotion Sub-Challenge, we found thetested temporal attributes helpful for the arousal task; for va-lence, however, they were less useful. In general, the low num-ber of training, development and test instances made ElderlyEmotion a particularly challenging task; in the end, we man-aged to obtain a mean UAR score of 50.5%, but this is proba-bly not much higher than what could be achievable by randomguessing (and saving the ﬁfth submission to pair up the bestarousal and valence “predictions”).

7. Acknowledgements

This research was partially supported by grant TUDFO/47138-1/2019-ITM of the Ministry for Innovation and Technology,Hungary and by the National Research, Development and In-novation Ofﬁce of Hungary via contract NKFIH FK-124413.L. T´oth and G. Gosztolya were also funded by the J´anos BolyaiScholarship of the Hungarian Academy of Sciences and by theHungarian Ministry of Innovation and Technology New Na-tional Excellence Program ´UNKP-19-4.

8. References [1] B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009emotion challenge,” in

Proceedings of Interspeech , Brighton,United Kingdom, Sep 2009, pp. 312–315.[2] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers,C. M¨uller, and S. S. Narayanan, “The INTERSPEECH 2010 par-alinguistic challenge,” in

Proceedings of Interspeech , Makuhari,Chiba, Japan, Sep 2010, pp. 2794–2797.[3] B. Schuller, S. Steidl, A. Batliner, F. Schiel, and J. Krajewski,“The INTERSPEECH 2011 speaker state challenge,” in

Proceed-ings of Interspeech , Florence, Italy, Aug 2011, pp. 3201–3204. No pun intended4] B. W. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer,F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi,H. Salamin, A. Polychroniou, F. Valente, and S. Kim, “The Inter-speech 2013 Computational Paralinguistics Challenge: Social sig-nals, Conﬂict, Emotion, Autism,” in

Proceedings of Interspeech ,Lyon, France, Sep 2013, pp. 148–152.[5] B. Schuller, S. Steidl, A. Batliner, S. Hantke, F. H¨onig, J. R.Orozco-Arroyave, E. N¨oth, Y. Zhang, and F. Weninger, “The IN-TERSPEECH 2015 computational paralinguistics challenge: Na-tiveness, Parkinson’s & eating condition,” in

Proceedings of In-terspeech , 2015, pp. 478–482.[6] S. Pancoast and M. Akbacak, “Bag-of-Audio-Words approach formultimedia event classiﬁcation,” in

Proceedings of Interspeech ,Portland, OR, USA, Sep 2012, pp. 2105–2108.[7] F. B. Pokorny, F. Graf, F. Pernkopf, and B. W. Schuller, “Detec-tion of negative emotions in speech signals using bags-of-audio-words,” in

Proceedings of ACII , Sep 2015, pp. 1–5.[8] M. Schmitt, F. Ringeval, and B. Schuller, “At the border of acous-tics and linguistics: Bag-of-Audio-Words for the recognition ofemotions in speech,” in

Proceedings of Interspeech , San Fran-cisco, CA, USA, 2016, pp. 495–499.[9] B. Schuller, S. Steidl, A. Batliner, S. Hantke, E. Bergelson, J. Kra-jewski, C. Janott, A. Amatuni, M. Casillas, A. Seidl, M. Soder-strom, A. S. Warlaumont, G. Hidalgo, S. Schnieder, C. Heiser,W. Hohenhorst, M. Herzog, M. Schmitt, K. Qian, Y. Zhang,G. Trigeorgis, P. Tzirakis, and S. Zafeiriou, “The INTERSPEECH2017 computational paralinguistics challenge: Addressee, Cold &Snoring,” in

Proceedings of Interspeech , 2017, pp. 3442–3446.[10] H. Kaya, A. A. Karpov, and A. A. Salah, “Fisher Vectors withcascaded normalization for paralinguistic analysis,” in

Proceed-ings of Interspeech , 2015, pp. 909–913.[11] G. Gosztolya, “Using Fisher Vector and Bag-of-Audio-Wordsrepresentations to identify Styrian dialects, sleepiness, baby &orca sounds,” in

Proceedings of Interspeech , Graz, Austria, Sep2019, pp. 2413–2417.[12] H. Wu, W. Wang, and M. Li, “The DKU-LENOVO systemsfor the INTERSPEECH 2019 Computational Paralinguistic Chal-lenge,” in

Proceedings of Interspeech , Graz, Austria, Sep 2019,pp. 2433–2437.[13] F. Gr`ezes, J. Richards, and A. Rosenberg, “Let me ﬁnish: Auto-matic conﬂict detection using speaker overlap,” in

Proceedings ofInterspeech , Lyon, France, Sep 2013, pp. 200–204.[14] C. Montaci´e and M.-J. Caraty, “High-level speech event analysisfor cognitive load classiﬁcation,” in

Proceedings of Interspeech ,Singapore, Sep 2014, pp. 731–735.[15] ——, “Phrase accentuation veriﬁcation and phonetic variationmeasurement for the degree of nativeness sub-challenge,” in

Pro-ceedings of Interspeech , Dresden, Germany, Sep 2015, pp. 483–487.[16] P. G. Shivakumar, S. N. Chakravarthula, and P. Georgiou, “Multi-modal fusion of multirate acoustic, prosodic, and lexical speakercharacteristics for native language identiﬁcation,” in

Proceedingsof Interspeech , San Francisco, CA, USA, Sep 2016, pp. 2408–2412.[17] G. Gosztolya, T. Gr´osz, R. Busa-Fekete, and L. T´oth, “Deter-mining native language and deception using phonetic features andclassiﬁer combination,” in

Proceedings of Interspeech , San Fran-cisco, CA, USA, Sep 2016, pp. 2418–2422.[18] M. Huckvale and A. Beke, “It sounds like you have a cold! Testingvoice features for the Interspeech 2017 Computational Paralin-guistics Cold Challenge,” in

Proceedings of Interspeech , Stock-holm, Sweden, Aug 2017, pp. 3447–3451.[19] B. W. Schuller, A. Batliner, C. Bergler, E.-M. Messner, A. Hamil-ton, S. Amiriparian, A. Baird, G. Rizos, M. Schmitt, L. Stappen,H. Baumeister, A. D. MacIntyre, and S. Hantke, “The INTER-SPEECH 2020 Computational Paralinguistics Challenge: Elderlyemotion, Breathing & Masks,” in

Proceedings of Interspeech ,Shanghai, China, Sep 2020, p. 5 pages, to appear. [20] S. Arnﬁeld, P. Roach, J. Setter, P. Greasley, and D. Horton,“Emotional stress and speech tempo variation,” in

Proceedingsof Speech Under Stress , Lisbon, Portugal, Sep 1995, pp. 13–15.[21] A. Braun and R. Oba, “Speaking tempo in emotional speech –a cross-cultural study using dubbed speech,” in

Proceedings ofParaLing , Saarbr¨ucken, Germany, 2007, pp. 77–82.[22] L. T´oth, I. Hoffmann, G. Gosztolya, V. Vincze, G. Szatl´oczki,Z. B´anr´eti, M. P´ak´aski, and J. K´alm´an, “A speech recognition-based solution for the automatic detection of mild cognitive im-pairment from spontaneous speech,”

Current Alzheimer Research ,vol. 15, no. 2, pp. 130–138, 2018.[23] G. Gosztolya, V. Vincze, L. T´oth, M. P´ak´aski, J. K´alm´an, andI. Hoffmann, “Identifying Mild Cognitive Impairment and mildAlzheimer’s disease based on spontaneous speech using ASR andlinguistic features,”

Computer, Speech & Language , vol. 53, no.Jan, pp. 181–197, 2019.[24] G. Gosztolya, A. Bagi, S. Szal´oki, I. Szendi, and I. Hoffmann,“Identifying schizophrenia based on temporal parameters in spon-taneous speech,” in

Proceedings of Interspeech , Hyderabad, India,Sep 2018, pp. 3408–3412.[25] L. T´oth, G. Gosztolya, V. Vincze, I. Hoffmann, G. Szatl´oczki,E. Bir´o, F. Zsura, M. P´ak´aski, and J. K´alm´an, “Automatic detec-tion of mild cognitive impairment from spontaneous speech us-ing ASR,” in

Proceedings of Interspeech , Dresden, Germany, Sep2015, pp. 2694–2698.[26] R. K. Moore and L. Skidmore, “On the use/misuse of the term‘phoneme’,” in

Proceedings of Interspeech , Graz, Austria, Sep2019, pp. 2340–2344.[27] T. Neuberger, D. Gyarmathy, T. E. Gr´aczi, V. Horv´ath, M. G´osy,and A. Beke, “Development of a large spontaneous speechdatabase of agglutinative Hungarian language,” in

Proceedings ofTSD , Brno, Czech Republic, Sep 2014, pp. 424–431.[28] L. T´oth, “Phone recognition with hierarchical Convolutional DeepMaxout Networks,”

EURASIP Journal on Audio, Speech, and Mu-sic Processing , vol. 2015, no. 25, pp. 1–13, 2015.[29] S. Young, G. Evermann, M. J. F. Gales, T. Hain, D. Kershaw,G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. C.Woodland,

The HTK Book . Cambridge, UK: Cambridge Univer-sity Engineering Department, 2006.[30] I. Hoffmann, D. N´emeth, C. D. Dye, M. P´ak´aski, T. Irinyi,and J. K´alm´an, “Temporal parameters of spontaneous speech inAlzheimer’s disease,”

International Journal of Speech-LanguagePathology , vol. 12, no. 1, pp. 29–34, 2010.[31] F. Eyben, M. W¨ollmer, and B. Schuller, “Opensmile: The Mu-nich versatile and fast open-source audio feature extractor,” in

Proceedings of ACM Multimedia , 2010, pp. 1459–1462.[32] M. Schmitt and B. Schuller, “openXBOW – introducing the Pas-sau open-source crossmodal Bag-of-Words toolkit,”

The Journalof Machine Learning Research , vol. 18, pp. 1–5, 2017.[33] T. S. Jaakkola and D. Haussler, “Exploiting generative models indiscriminative classiﬁers,” in

Proceedings of NIPS , Denver, CO,USA, Dec 1999, pp. 487–493.[34] A. Vedaldi and B. Fulkerson, “Vlfeat: an open and portable libraryof computer vision algorithms,” in

Proceedings of ACM Multime-dia , 2010, pp. 1469–1472.[35] G. Gosztolya, “Using the Fisher Vector representation foraudio-based emotion recognition,”

Acta Polytechnica Hungarica ,vol. 17, no. 6, pp. 7–23, 2020.[36] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for supportvector machines,”

ACM Transactions on Intelligent Systems andTechnology , vol. 2, pp. 1–27, 2011.[37] G. Gosztolya, T. Gr´osz, and L. T´oth, “General utterance-levelfeature extraction for classifying crying sounds, atypical & self-assessed affect and heart beats,” in