Aalto's End-to-End DNN systems for the INTERSPEECH 2020 Computational Paralinguistics Challenge
Tamás Grósz, Mittul Singh, Sudarsana Reddy Kadiri, Hemant Kathania, Mikko Kurimo
aa r X i v : . [ ee ss . A S ] A ug Aalto’s End-to-End DNN systems for the INTERSPEECH 2020Computational Paralinguistics Challenge
Tam´as Gr´osz, Mittul Singh, Sudarsana Reddy Kadiri, Hemant Kathania, Mikko Kurimo
Department of Signal Processing and Acoustics, Aalto University, Finland [email protected]
Abstract
End-to-end neural network models (E2E) have shown signifi-cant performance benefits on different INTERSPEECH Com-ParE tasks. Prior work has applied either a single instance of anE2E model for a task or the same E2E architecture for differenttasks. However, applying a single model is unstable or usingthe same architecture under-utilizes task-specific information.On ComParE 2020 tasks, we investigate applying an ensem-ble of E2E models for robust performance and developing task-specific modifications for each task. ComParE 2020 introducesthree sub-challenges: the breathing sub-challenge to predict theoutput of a respiratory belt worn by a patient while speaking, theelderly sub-challenge to estimate the elderly speaker’s arousaland valence levels and the mask sub-challenge to classify if thespeaker is wearing a mask or not. On each of these tasks, anensemble outperforms the single E2E model. On the breath-ing sub-challenge, we study the impact of multi-loss strategieson task performance. On the elderly sub-challenge, predictingthe valence and arousal levels prompts us to investigate multi-task training and implement data sampling strategies to han-dle class imbalance. On the mask sub-challenge, using an E2Esystem without feature engineering is competitive to feature-engineered baselines and provides substantial gains when com-bined with feature-engineered baselines.
Index Terms : computational paralinguistics, DNN, end-to-end,ensemble learning
1. Introduction
INTERSPEECH’s Computational Paralinguistics Challenges(ComParE) have regularly introduced the speech community tonew exciting challenges since 2009 . These challenges setup asprediction tasks focus on extracting important speaker-relatedinformation from the audio signal. In more than a decade ofComParE, researchers have come up with innovative solutionsto these challenges.These efforts can broadly be divided into two categories:feature-engineering based approaches [1, 2, 3, 4, 5, 6, 7, 8, 9]and deep learning-based end-to-end approaches [10, 11, 12, 13,14]. Feature-engineering approaches have concentrated on ex-tracting task-specific features to be utilized by classifiers forprediction. On the other hand, end-to-end approaches have fo-cused applying complex neural network architectures to bypassfeature engineering. There might not be a clear winner betweenthese two approaches [10], but combining these two approacheshas emerged as a trend. Prior work has used Deep NeuralNetworks (DNN) only to extract useful features automatically.These features are used to train a simple classifier afterwards.Two such systems, AuDeep [3], and DeepSpectrum [4] are al-ready part of this year’s baseline method. For end-to-end (E2E) approaches, prior work has concen-trated on applying single-model systems for the prediction tasks[15, 10, 12, 13]. A single neural network, which is a non-linear model, can have high-variance and, thus, produce unsta-ble results. On the other hand, the prior work has concentratedon applying the same architecture across different tasks. Us-ing the one-size-fits-all policy for different ComParE tasks ig-nores task-specific requirements that can be exploited in modeldesign [10]. Hence, we study the application of E2E modelsto obtain a more robust but task-specific solution. We buildensemble-based E2E systems to obtain robust results across dif-ferent ComParE 2020 tasks. Utilizing multiple models insteadof one shows better performance across all the tasks. Besides,we also study task-specific requirements and explore incorpo-rating them into the E2E solution.ComParE 2020 poses three new challenges [16] to the com-munity: 1) the breathing sub-challenge to predict the output sig-nal of a respiratory belt worn by the speaker, 2) the elderly sub-challenge to classify the arousal (A) and valence (V) level ofelderly speakers and 3) the mask sub-challenge to predict if thespeaker wears a mask or not while they speak.For the breathing sub-challenge, the E2E baseline systemoptimizes Pearson’s coefficient, which is also the task metric, tosolve a regression problem. However, E2E predictions do notmatch the scale of the ground truth. To alleviate this scale is-sue, we study multi-loss strategies for our E2E model, wherewe optimize Pearson’s coefficient along with mean-squared-error. The elderly sub-challenge entails learning two closely re-lated tasks (arousal and valence level prediction), so as a naturalchoice, we explore multi-task learning. The sub-challenge alsofaces the issue of imbalanced class data and we apply samplingschemes to augment the data to reduce the class imbalance. Inthe mask sub-challenge, our single E2E model performs betterthan the best baseline result. On further investigation, we an-alyze the trained models to understand which frequency bandshold the largest importance. Our findings lead us to create low-frequency band features. A fusion of the baseline with our E2Emodels, including these features, results in a substantial perfor-mance gain.We also include our plans to explore and expand our pre-sented experiments. We hope to include these results in an ex-tended version of this article.
2. Methods
In this section, we describe the end-to-end system usage in anensemble learning scheme. We also present task-specific modi-fications to capture task requirements in the end-to-end system.Keep in mind that this paper describes our solutions for a com-petition, so we broke with the tradition of using only a few tech-niques. Instead, we used several to get the best results. Still, wedid our best to measure the impact of each modification on theevelopment data and tested only the best ones.
End-to-end learning is a new emerging paradigm within deeplearning. Researchers across various fields have adopted thisparadigm supported by the availability of large data and power-ful computational resources. Theoretically, end-to-end systemsare built to replace the traditional pipeline-based solutions witha single deep neural network. The end-to-end systems allow us-ing a single optimization step to training the complete model.They also have the promise of bypassing the laborious feature-engineering step by having a single system for solving everyaspect of the prediction problem. In practice, however, thesesystems are built on top of existing features. The advantages ofthis paradigm make it an attractive choice for ComParE tasks.In our experiments, we employ the same DNN model archi-tecture for elderly and mask sub-challenges. For the breathingsub-challenge, we use a different DNN architecture based on thebaseline system for further research. We describe the details ofthese end-to-end model architectures in Section 3.1. Our mod-els process either spectral input features or raw audio signalsin case of the breathing task. Then the DNNs can directly beoptimized to perform the given task. Using this single modelapproach allowed us to quickly modify the general frameworkto the specialties of the sub-challenges.
DNNs are known to be sensitive to the random initialization,and our experiments also confirm this. This issue is especiallysevere if the amount of training data is limited, which is usuallythe case for paralinguistic tasks. A solution to this problem isapplying ensemble learning. We train several differently initial-ized DNNs and then combine their predictions to get stable andeven better results.Here, we employ a specific bootstrap aggregation method,called bagging. Originally, bagging trains each model usingonly a random subset of the training data to produce diverse sys-tems. As the training data is already limited, we decide to useall available data during training and rely on random initializa-tion and data shuffling to produce a diverse set of DNNs. In thecombination, we average the outputs of differently-initializedDNN together to make the final prediction.The ensemble learning can also be performed with other ap-proaches. For our mask sub-challenge experiments, we performan equal-weighted soft-voting-based combination of the base-line prediction system like Support Vector Machines (SVM)with our ensemble DNNs.
Training an end-to-end system does not have to be restrictedto using a single loss function. Often multiple losses are takeninto consideration to focus on multiple aspects of the predictionproblem. This technique also helps regularize training.For breathing sub-challenge, the end-to-end baseline sys-tem is trained with a correlation-based loss. However, it doesnot help to bound the outputs to the same scale as the label. Tomatch the output’s scale to the label, we use a combination ofthe correlation loss and the mean squared error (MSE), whichcan help regularize the end-to-end baseline system.
Multi-task learning trains a single model to perform multipletasks simultaneously. Recent work [17] has also shown the ben-efits of using this scheme for paralinguistic tasks. Intuitively,multi-task learning’s unified model allows data augmentationby sharing information relevant for one task with the other.This intuition is especially relevant in the case of elderly sub-challenge. The arousal and valence levels are two related di-mensions to describe the emotional experiences of the speaker.Thus, we experiment with a single end-to-end model trained topredict the arousal and valence levels in a joint framework.
In the elderly sub-challenge, we observe a class imbalance prob-lem. Having over-represented classes in the data is a com-mon problem for paralinguistic problems [15]. To address thedata imbalance, we choose two sampling techniques: upsam-pling and probabilistic sampling [15]. Upsampling is a sim-ple method that repeats the underrepresented examples until thedata becomes balanced. Probabilistic sampling applies a morerigorous approach. It defines the desired class distribution andduring training, it selects examples such a way that the overalldistribution of the training data would fit the desired one. Thisnew distribution is a linear combination of the original and auniform one, λ and − λ being the respective coefficients.These resampling methods are easy to use; however, wehad to adapt them to work in a multi-task setup. To upsample,we created clusters, which had the same label pair, and resam-pled so that each group would have the same amount of trainingdata. Although this adaptation does not ensure the individualtasks having balanced data, in practice, it works quite well, asshown in section 3.3. A similar modification can be appliedwhen using probabilistic sampling in a multi-task setting. First,we generate the desired distribution for each task. Then duringtraining, we select a label pair that would fit the distribution anduse a training instance that has those labels. For the mask sub-challenge, we hypothesize that wearing amask changes the resonance conditions in the vocal tract, as themask might reflect some of the frequencies to the tract [18, 19].To test this hypothesis, we look at the output gradients w.r.t. theinputs and plot them per input frequency bands in Figure 1. Wenotice that end-to-end models have large gradients for the tenlowest frequencies. Considering this observation, we computelow-frequency information-based features. Specifically, we ex-tract Mel-spectrogram features for 200 filter-banks and then usethe ten lowest filter-banks as input features, which is referred toas lowest-10-features .As a pre-processing step for extracting these features, wealso examine enhancing the lower frequencies by manipu-lating the input audio. We apply low-frequency enhancingschemes like preemphasing the audio (with filter coefficients h = 1 and passing through a fifth-order low-pass butterworthfilter whose cutoff frequency is Hz [20], denoted as pre-emphasis+butterworth . These schemes can allow the Mel-spectrogram to better represent the relevant information for thistask, which is dependent on the low-frequency bands. ime F r e q u e n c y ( b a n d ) F r e q u e n c y ( b a n d ) F r e q u e n c y ( b a n d ) F r e q u e n c y ( b a n d ) Figure 1:
The figure shows the gradient magnitudes of DNNoutputs with respect to the input spectral features on the Masksub-challenge. The left two images show the gradients of thetwo random models for a single audio file and the right two im-ages show the gradients of the same two models averaged overall training files. A bright yellow shade represents the largestgradient magnitudes seen for the lowest frequencies.
3. Experiments and results
For the elderly and mask sub-challenges, we extract Mel-spectrograms from the audio files as inputs in a similar fash-ion to the auDeep [3] pipeline. Instead, for the breathing sub-challenge, the raw audio data directly input as the raw audioleads to better results than Mel-spectrograms.In our experiments, we use two different end-to-end sys-tems. For the elderly and mask sub-challenges, the spectralinput is first processed by a 1D convolutional layer with 100neurons and then a recurrent layer, containing 100 LSTM cells,accumulates the outputs of the filters. We pass the outputs ofthe recurrent layer to a feedforward layer (100 rectified-linearunits) and then apply a classification layer. In the multi-task ex-periments, we split the structure after the LSTM layer passingthe recurrent layer output to a unique set of hidden and outputlayers for both tasks. For the breathing sub-challenge, we optedfor the same structure as the best baseline system, for detailssee [16]. For training, we employed the Keras framework withTensorFlow as the backend.For all tasks, we use ensemble learning. For the mask sub-challenge, we obtained the best results using 50 models, whilefor the other tasks, ten models were enough to reach the peakperformance. After training the individual models, we averagedtheir output to create the final predictions.For evaluation on the test set, we train our models on thecombined training and development set. We note that the Com-ParE challenge restricts the number of submissions per teamand task to five evaluations on the test set. As the competitionis ongoing, we only used a few of the available submissions tocheck the best systems so far. The limitation implies that wecan not test all of our methods. In the result tables, we use thequestion mark (?) to indicate the solutions not yet evaluated onthe test data. Table 1:
The table presents the breathing task’s Pearson corre-lation scores for single end-to-end models on the developmentset (Dev). As we trained ten different models, we present theaverage and the best result.
System Dev
Avg. per DNN Best DNNE2E-corr .506 .514E2E-MSE .467 .481E2E-corr+MSE .497 .521Table 2:
The table presents the ensemble E2E model’s perfor-mance on the breathing task for different loss functions.
System (loss function) Dev Test
Corr MSE CorrBaseline (E2E) [16] .507 1.682 .731E2E-corr .523 .896 .759
E2E-MSE .480 .028 ?E2E-corr+MSE .514 .180 .751
On this task, we used the end-to-end baseline system for fur-ther development as this system performs quite well on this task[16]. However, it faces the issue of mismatch on the scale of theend-to-end predictions and output labels. To alleviate this mis-match, we apply a multi-loss scheme using MSE based loss toregularize the baseline correlation loss.In table 1, we compare between the single-loss versusmulti-loss strategies. The single-loss models use either Pear-son’s correlation (corr) or the MSE. The multi-loss strategycombines the two losses (corr+MSE) with a regularizationweight of 0.1. Correlation-based E2E (E2E-corr) performsbest on when averaging the correlation values of ten randomly-initialized DNNs. The best result corresponds to the corr+MSEbased E2E model; however, the averaged results are lower andsuggest that this value is unreliable. We suspect further tuningof the regularization weight is required and we hope to completethis analysis as part of our future work.In table 2, we present the results of 10-model ensembles tocompare with the baseline performance. Even though the base-line system produced high correlation values, it had the high-est MSE value. Combining the predictions of 10 models (corr)reduced the MSE significantly and outperformed the baselineresults. Using the MSE as loss function performed the worstbut, naturally produced the lowest MSE. Lastly, we can see thatusing the multi-loss ensemble of E2E model (E2E-corr+MSE)drops in comparison to the ensemble of E2E-corr because ofthe MSE regularization. However, in terms of MSE, it is muchbetter. On the evaluation set, E2E-corr+MSE was also slightlyworse than the E2E-corr ensemble in terms of overall correla-tion. Nevertheless, our ensemble of E2E corr outperforms thebaseline result and shows an absolute improvement of 2.6 cor-relation points over the baseline result.
The elderly task presents a prediction problem with class im-balance. For valence prediction, 44 out of the 87 stories havea medium-valence label. Upon inspecting some of our initialmodels, we observe that the output prediction favours the over-represented classes. To cope with these issues, we apply theresampling methods described in section 2.5.Table 3 presents the ensemble E2E models evaluated on theable 3:
UAR values for predicting Arousal (A) and Valence (V)levels on the elderly sub-challenge. E2E systems combined tenDNNs to produce the predictions.
System Dev (A/V) Test (A/V)
Baseline (linguistic) [16] 40.6/
Baseline(acoustic) [16] 35.0/31.6 /40.3E2E (single task) 35.0/39.7 ?E2E (multitask) 39.5/39.7 ?E2E (single task + upsampl.) 39.8/41.5 ?E2E (multitask+ upsampl.) /42.4 38.0/39.5E2E (single task + prob. sampl.) 35.6/39.6 ?E2E (multitask+ prob. sampl.) 40.0/ λ = 0 . wasvery beneficial for the valence sub-task. The multi-task mod-els consistently outperformed single task ones. For the twobest systems, we also checked the performance of the individ-ual DNNs and saw that ensemble learning is essential for goodperformance. Upsampling for a single multi-task DNNs on av-erage yielded 38.4%/36.4% (A/V), with probabilistic samplingwe got 36.6%/38.5%.Unfortunately, the test results are below the official base-line. The considerable difference between scores on the de-velopment and test data suggest that our model overfits whentraining a train+development set system for evaluation. We alsosuspect that there is a significant mismatch between the dev andtest data in this sub-challenge. Strong evidence for this can befound in the baseline paper [16]. In the baseline paper, we cansee that the test performance does not correlate with the scoresachieved on the development set. The official acoustic base-line model (DeepSpectrum+SVM) produces almost the worstresults on the development set, and the difference between itsdevelopment and test scores is large. This observation suggeststhat parametric tuning with the development data might not bethe best model for the evaluation set. Training a 50 model ensemble, we saw that their averaged pre-diction significantly outperformed our single E2E model andthe individual baseline system (auDeep-fused). Our individualE2E models achieved 66.0%UAR on average, but their combi-nation reached 68.0% (E2E). The best individual baseline usesauDeep-based features in an SVM system. Our E2E ensembleoutperformed this model both on the development and test set,as shown in table 4. Though, our ensemble E2E model is out-performed by the fusion of the best baseline models, which is anSVM based on auDeep-fused, Bag-of-audio-word, OpenSmileand DeepSpectrum features [16].Earlier in section 2.6, we had observed that lower-frequencybands of the audio hold important information for the masksub-challenge. Based on this observation, we applied pre-emphasis+butteworth to input audio and then extracted thelowest-10-features to build an E2E ensemble (E2E lowest-10-features). This ensemble model outperformed the E2E-ensemble built with only preprocessed input audio (E2E preem-phasis+butterworth) but was worse than our vanilla ensemble.Combining the regular ensemble (E2E) and E2E lowest-10-features fared better, resulting in a slight improvement over the Table 4:
UAR values on the Mask Sub-Challenge. The E2Esolutions fused 50 models to get the final output.
System Dev Test auDeep-fused (baseline) [16] 64.4 66.6Fusion of the bests (baseline) [16] – 71.8E2E 68.0 69.9E2E preemphasis+butterworth 59.3 ?E2E lowest-10-features 62.9 ?E2E + E2E lowest-10-features 68.6 ?E2E + E2E lowest-10-features + baseline vanilla ensemble. We combined this model with predictionsfrom SVMs trained on bag-of-audio-word features (BoAW-fused) and DeepSpectrum-resnet50 features (E2E+lowest-10-feats+baseline) via soft voting with equal weights. The com-bined achieved our best result on the development set and im-proved the fusion-based baseline by 3.8% UAR.
4. Future work
For the breathing sub-challenge, we observe that regularizingwith MSE can help alleviate the mismatch of scales betweenthe output and the labels. However, it still lacks in performancein comparison to the regular E2E model. To study this effect, weexplore other regularization schemes to obtain a better balancebetween mismatch issues and performance.On the elderly sub-challenge, our current system overfitson the training and development data and obtains a poor per-formance on the test set. We investigate this effect further andapply regularization schemes to reduce the overfitting. Anotherthing that limits our current system is that it is trained to clas-sify short segments of the stories and then the decisions madefor these fragments are merged with a soft voting method. In-stead, we could concatenate the audio files of the same storiesand directly classify them, as our E2E architecture allows us touse arbitrary long inputs.For the mask sub-challenge, we currently combine pre-dictions from separate models for the vanilla and lowest-10-features E2E scenarios. In contrast with this late fusion, we ex-plore early and intermediate fusion of features to better exploitthe information present in these spectrograms. Our lowest-10-features na¨ıvely extracts the ten lowest frequency bands to useas features for the E2E model. Instead, we also develop special-ized low-frequency features to aid better learning by the E2Emodel. In the mask sub-challenge, we observe that combiningour E2E ensembles with baseline results achieves the best re-sult. We plan to explore similar combinations for both breathingand elderly sub-challenges.
5. Conclusions
We presented Aalto’s E2E ensemble solution for the three dif-ferent INTERSPEECH 2020 ComParE tasks. In our study, theensemble E2E models achieved better performance than indi-viduals E2E models on average. On the ComParE 2020 tasks,we also proposed task-specific modifications for the underly-ing E2E models. We studied modifications based on multi-tasklearning, re-sampling training data for multi-task scenarios, andfeature engineering based on the initial E2E ensemble models.Our best models showed absolute improvements upon the com-petitive baselines for the breathing and mask sub-challenges by2.8% and 3.8%, respectively. Overall, our paper showcased theenefits of using an ensemble of E2E models and task-specificmodifications for computational paralinguistic tasks.
6. Acknowledgements
We thank Antonia Hamilton and Alexis Macintyre for grant-ing us access to a subset of the UCL Speech BreathMonitor-ing (UCL-SBM) database used in the Breathing Sub-Challenge.This work was supported by the Academy of Finland (grants312490, 329267) and the Kone Foundation. Aalto ScienceITprovided the computational resources.
7. References [1] M. Schmitt, F. Ringeval, and B. Schuller, “At the border of acous-tics and linguistics: Bag-of-audio-words for the recognition ofemotions in speech,” in
Proc. of Interspeech , 2016, pp. 495–499.[2] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr´e,C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan,and K. P. Truong, “The geneva minimalistic acoustic parameterset (gemaps) for voice research and affective computing,”
IEEETransactions on Affective Computing , vol. 7, no. 2, pp. 190–202,2016.[3] M. Freitag, S. Amiriparian, S. Pugachevskiy, N. Cummins, andB. Schuller, “Audeep: Unsupervised learning of representationsfrom audio with deep recurrent neural networks,”
J. Mach. Learn.Res. , vol. 18, no. 1, p. 6340–6344, Jan. 2017.[4] S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag,S. Pugachevskiy, A. Baird, and B. Schuller, “Snore soundclassification using image-based deep spectrum features,” in
Proc. of Interspeech , 2017, pp. 3512–3516. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2017-434[5] C. Montaci´e and M.-J. Caraty, “Vocalic, lexical and prosodiccues for the interspeech 2018 self-assessed affect challenge,” in
Proc. of Interspeech , 2018, pp. 541–545. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2018-1331[6] G. Gosztolya, T. Gr´osz, and L. T´oth, “General utterance-level feature extraction for classifying crying sounds, atyp-ical & self-assessed affect and heart beats,” in
Proc.of Interspeech , 2018, pp. 531–535. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2018-1076[7] G. Gosztolya, “Using Fisher Vector and Bag-of-Audio-Words Representations to Identify Styrian Dialects, Sleepi-ness, Baby & Orca Sounds,” in
Proc. of In-terspeech , 2019, pp. 2413–2417. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2019-1726[8] H. Wu, W. Wang, and M. Li, “The DKU-LENOVO Systemsfor the INTERSPEECH 2019 Computational Paralinguistic Chal-lenge,” in
Proc. of Interspeech , 2019, pp. 2433–2437. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2019-1386[9] M. Carbonneau, E. Granger, Y. Attabi, and G. Gagnon, “Featurelearning from spectrograms for assessment of personality traits,”
IEEE Transactions on Affective Computing , vol. 11, no. 1, pp. 25–31, 2020.[10] J. Wagner, D. Schiller, A. Seiderer, and E. Andr´e, “Deep learningin paralinguistic recognition tasks: Are hand-crafted features stillrelevant?” in
Proc. of Interspeech , B. Yegnanarayana, Ed. ISCA,2018, pp. 147–151.[11] Z. Zhao, Y. Zhao, Z. Bao, H. Wang, Z. Zhang, and C. Li,“Deep spectrum feature representations for speech emotionrecognition,” in
Proc. of ASMMC-MMAC . Association forComputing Machinery, 2018, p. 27–33. [Online]. Available:https://doi.org/10.1145/3267935.3267948[12] D. Elsner, S. Langer, F. Ritz, R. M¨uller, and S. Illium, “Deepneural baselines for computational paralinguistics,” in
Proc. of In-terspeech , G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 2388–2392. [13] P. Tzirakis, S. Zafeiriou, and B. W. Schuller, “End2you - the im-perial toolkit for multimodal profiling by end-to-end learning,”
CoRR , vol. abs/1802.01115, 2018.[14] J. Fritsch, S. P. Dubagunta, and M. Magimai.-Doss, “Estimatingthe degree of sleepiness by integrating articulatory feature knowl-edge in raw waveform based cnns,” in
Proc. of ICASSP , 2020, pp.6534–6538.[15] G. Gosztolya, R. Busa-Fekete, T. Gr´osz, and L. T´oth, “Dnn-basedfeature extraction and classifier combination for child-directedspeech, cold and snoring identification,” in
Proc. of Interspeech ,F. Lacerda, Ed. ISCA, 2017, pp. 3522–3526.[16] B. W. Schuller, A. Batliner, C. Bergler, E.-M. Messner, A. Hamil-ton, S. Amiriparian, A. Baird, G. Rizos, M. Schmitt, L. Stappen,H. Baumeister, A. D. MacIntyre, and S. Hantke, “The INTER-SPEECH 2020 Computational Paralinguistics Challenge: Elderlyemotion, Breathing & Masks,” in
Proc. of Interspeech , Shanghai,China, September 2020, p. 5 pages, to appear.[17] S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Epps, and B. W.Schuller, “Multi-task semi-supervised adversarial autoencodingfor speech emotion recognition,”
IEEE Transactions on AffectiveComputing , pp. 1–1, 2020.[18] S. R. Kadiri and B. Yegnanarayana, “Breathy to tense voicediscrimination using zero-time windowing cepstral coefficients(ztwccs),” in
Proc. of Interspeech , 2018, pp. 232–236.[19] S. R. Kadiri and P. Alku, “Mel-frequency cepstral coefficients ofvoice source waveforms for classification of phonation types inspeech,”
Proc. of Interspeech , pp. 2508–2512, 2019.[20] A. V. Oppenheim, R. W. Schafer, and J. R. Buck,