[PDF] Aalto's End-to-End DNN systems for the INTERSPEECH 2020 Computational Paralinguistics Challenge

Abstract

End-to-end neural network models (E2E) have shown significant performance benefits on different INTERSPEECH ComParE tasks. Prior work has applied either a single instance of an E2E model for a task or the same E2E architecture for different tasks. However, applying a single model is unstable or using the same architecture under-utilizes task-specific information. On ComParE 2020 tasks, we investigate applying an ensemble of E2E models for robust performance and developing task-specific modifications for each task. ComParE 2020 introduces three sub-challenges: the breathing sub-challenge to predict the output of a respiratory belt worn by a patient while speaking, the elderly sub-challenge to estimate the elderly speaker's arousal and valence levels and the mask sub-challenge to classify if the speaker is wearing a mask or not. On each of these tasks, an ensemble outperforms the single E2E model. On the breathing sub-challenge, we study the impact of multi-loss strategies on task performance. On the elderly sub-challenge, predicting the valence and arousal levels prompts us to investigate multi-task training and implement data sampling strategies to handle class imbalance. On the mask sub-challenge, using an E2E system without feature engineering is competitive to feature-engineered baselines and provides substantial gains when combined with feature-engineered baselines.

Full PDF

aa r X i v : . [ ee ss . A S ] A ug Aalto’s End-to-End DNN systems for the INTERSPEECH 2020Computational Paralinguistics Challenge

Tam´as Gr´osz, Mittul Singh, Sudarsana Reddy Kadiri, Hemant Kathania, Mikko Kurimo

Department of Signal Processing and Acoustics, Aalto University, Finland [email protected]

Abstract

End-to-end neural network models (E2E) have shown signiﬁ-cant performance beneﬁts on different INTERSPEECH Com-ParE tasks. Prior work has applied either a single instance of anE2E model for a task or the same E2E architecture for differenttasks. However, applying a single model is unstable or usingthe same architecture under-utilizes task-speciﬁc information.On ComParE 2020 tasks, we investigate applying an ensem-ble of E2E models for robust performance and developing task-speciﬁc modiﬁcations for each task. ComParE 2020 introducesthree sub-challenges: the breathing sub-challenge to predict theoutput of a respiratory belt worn by a patient while speaking, theelderly sub-challenge to estimate the elderly speaker’s arousaland valence levels and the mask sub-challenge to classify if thespeaker is wearing a mask or not. On each of these tasks, anensemble outperforms the single E2E model. On the breath-ing sub-challenge, we study the impact of multi-loss strategieson task performance. On the elderly sub-challenge, predictingthe valence and arousal levels prompts us to investigate multi-task training and implement data sampling strategies to han-dle class imbalance. On the mask sub-challenge, using an E2Esystem without feature engineering is competitive to feature-engineered baselines and provides substantial gains when com-bined with feature-engineered baselines.

Index Terms : computational paralinguistics, DNN, end-to-end,ensemble learning

1. Introduction

INTERSPEECH’s Computational Paralinguistics Challenges(ComParE) have regularly introduced the speech community tonew exciting challenges since 2009 . These challenges setup asprediction tasks focus on extracting important speaker-relatedinformation from the audio signal. In more than a decade ofComParE, researchers have come up with innovative solutionsto these challenges.These efforts can broadly be divided into two categories:feature-engineering based approaches [1, 2, 3, 4, 5, 6, 7, 8, 9]and deep learning-based end-to-end approaches [10, 11, 12, 13,14]. Feature-engineering approaches have concentrated on ex-tracting task-speciﬁc features to be utilized by classiﬁers forprediction. On the other hand, end-to-end approaches have fo-cused applying complex neural network architectures to bypassfeature engineering. There might not be a clear winner betweenthese two approaches [10], but combining these two approacheshas emerged as a trend. Prior work has used Deep NeuralNetworks (DNN) only to extract useful features automatically.These features are used to train a simple classiﬁer afterwards.Two such systems, AuDeep [3], and DeepSpectrum [4] are al-ready part of this year’s baseline method. For end-to-end (E2E) approaches, prior work has concen-trated on applying single-model systems for the prediction tasks[15, 10, 12, 13]. A single neural network, which is a non-linear model, can have high-variance and, thus, produce unsta-ble results. On the other hand, the prior work has concentratedon applying the same architecture across different tasks. Us-ing the one-size-ﬁts-all policy for different ComParE tasks ig-nores task-speciﬁc requirements that can be exploited in modeldesign [10]. Hence, we study the application of E2E modelsto obtain a more robust but task-speciﬁc solution. We buildensemble-based E2E systems to obtain robust results across dif-ferent ComParE 2020 tasks. Utilizing multiple models insteadof one shows better performance across all the tasks. Besides,we also study task-speciﬁc requirements and explore incorpo-rating them into the E2E solution.ComParE 2020 poses three new challenges [16] to the com-munity: 1) the breathing sub-challenge to predict the output sig-nal of a respiratory belt worn by the speaker, 2) the elderly sub-challenge to classify the arousal (A) and valence (V) level ofelderly speakers and 3) the mask sub-challenge to predict if thespeaker wears a mask or not while they speak.For the breathing sub-challenge, the E2E baseline systemoptimizes Pearson’s coefﬁcient, which is also the task metric, tosolve a regression problem. However, E2E predictions do notmatch the scale of the ground truth. To alleviate this scale is-sue, we study multi-loss strategies for our E2E model, wherewe optimize Pearson’s coefﬁcient along with mean-squared-error. The elderly sub-challenge entails learning two closely re-lated tasks (arousal and valence level prediction), so as a naturalchoice, we explore multi-task learning. The sub-challenge alsofaces the issue of imbalanced class data and we apply samplingschemes to augment the data to reduce the class imbalance. Inthe mask sub-challenge, our single E2E model performs betterthan the best baseline result. On further investigation, we an-alyze the trained models to understand which frequency bandshold the largest importance. Our ﬁndings lead us to create low-frequency band features. A fusion of the baseline with our E2Emodels, including these features, results in a substantial perfor-mance gain.We also include our plans to explore and expand our pre-sented experiments. We hope to include these results in an ex-tended version of this article.

2. Methods

In this section, we describe the end-to-end system usage in anensemble learning scheme. We also present task-speciﬁc modi-ﬁcations to capture task requirements in the end-to-end system.Keep in mind that this paper describes our solutions for a com-petition, so we broke with the tradition of using only a few tech-niques. Instead, we used several to get the best results. Still, wedid our best to measure the impact of each modiﬁcation on theevelopment data and tested only the best ones.

End-to-end learning is a new emerging paradigm within deeplearning. Researchers across various ﬁelds have adopted thisparadigm supported by the availability of large data and power-ful computational resources. Theoretically, end-to-end systemsare built to replace the traditional pipeline-based solutions witha single deep neural network. The end-to-end systems allow us-ing a single optimization step to training the complete model.They also have the promise of bypassing the laborious feature-engineering step by having a single system for solving everyaspect of the prediction problem. In practice, however, thesesystems are built on top of existing features. The advantages ofthis paradigm make it an attractive choice for ComParE tasks.In our experiments, we employ the same DNN model archi-tecture for elderly and mask sub-challenges. For the breathingsub-challenge, we use a different DNN architecture based on thebaseline system for further research. We describe the details ofthese end-to-end model architectures in Section 3.1. Our mod-els process either spectral input features or raw audio signalsin case of the breathing task. Then the DNNs can directly beoptimized to perform the given task. Using this single modelapproach allowed us to quickly modify the general frameworkto the specialties of the sub-challenges.

DNNs are known to be sensitive to the random initialization,and our experiments also conﬁrm this. This issue is especiallysevere if the amount of training data is limited, which is usuallythe case for paralinguistic tasks. A solution to this problem isapplying ensemble learning. We train several differently initial-ized DNNs and then combine their predictions to get stable andeven better results.Here, we employ a speciﬁc bootstrap aggregation method,called bagging. Originally, bagging trains each model usingonly a random subset of the training data to produce diverse sys-tems. As the training data is already limited, we decide to useall available data during training and rely on random initializa-tion and data shufﬂing to produce a diverse set of DNNs. In thecombination, we average the outputs of differently-initializedDNN together to make the ﬁnal prediction.The ensemble learning can also be performed with other ap-proaches. For our mask sub-challenge experiments, we performan equal-weighted soft-voting-based combination of the base-line prediction system like Support Vector Machines (SVM)with our ensemble DNNs.

Training an end-to-end system does not have to be restrictedto using a single loss function. Often multiple losses are takeninto consideration to focus on multiple aspects of the predictionproblem. This technique also helps regularize training.For breathing sub-challenge, the end-to-end baseline sys-tem is trained with a correlation-based loss. However, it doesnot help to bound the outputs to the same scale as the label. Tomatch the output’s scale to the label, we use a combination ofthe correlation loss and the mean squared error (MSE), whichcan help regularize the end-to-end baseline system.

Multi-task learning trains a single model to perform multipletasks simultaneously. Recent work [17] has also shown the ben-eﬁts of using this scheme for paralinguistic tasks. Intuitively,multi-task learning’s uniﬁed model allows data augmentationby sharing information relevant for one task with the other.This intuition is especially relevant in the case of elderly sub-challenge. The arousal and valence levels are two related di-mensions to describe the emotional experiences of the speaker.Thus, we experiment with a single end-to-end model trained topredict the arousal and valence levels in a joint framework.

In the elderly sub-challenge, we observe a class imbalance prob-lem. Having over-represented classes in the data is a com-mon problem for paralinguistic problems [15]. To address thedata imbalance, we choose two sampling techniques: upsam-pling and probabilistic sampling [15]. Upsampling is a sim-ple method that repeats the underrepresented examples until thedata becomes balanced. Probabilistic sampling applies a morerigorous approach. It deﬁnes the desired class distribution andduring training, it selects examples such a way that the overalldistribution of the training data would ﬁt the desired one. Thisnew distribution is a linear combination of the original and auniform one, λ and − λ being the respective coefﬁcients.These resampling methods are easy to use; however, wehad to adapt them to work in a multi-task setup. To upsample,we created clusters, which had the same label pair, and resam-pled so that each group would have the same amount of trainingdata. Although this adaptation does not ensure the individualtasks having balanced data, in practice, it works quite well, asshown in section 3.3. A similar modiﬁcation can be appliedwhen using probabilistic sampling in a multi-task setting. First,we generate the desired distribution for each task. Then duringtraining, we select a label pair that would ﬁt the distribution anduse a training instance that has those labels. For the mask sub-challenge, we hypothesize that wearing amask changes the resonance conditions in the vocal tract, as themask might reﬂect some of the frequencies to the tract [18, 19].To test this hypothesis, we look at the output gradients w.r.t. theinputs and plot them per input frequency bands in Figure 1. Wenotice that end-to-end models have large gradients for the tenlowest frequencies. Considering this observation, we computelow-frequency information-based features. Speciﬁcally, we ex-tract Mel-spectrogram features for 200 ﬁlter-banks and then usethe ten lowest ﬁlter-banks as input features, which is referred toas lowest-10-features .As a pre-processing step for extracting these features, wealso examine enhancing the lower frequencies by manipu-lating the input audio. We apply low-frequency enhancingschemes like preemphasing the audio (with ﬁlter coefﬁcients h = 1 and passing through a ﬁfth-order low-pass butterworthﬁlter whose cutoff frequency is Hz [20], denoted as pre-emphasis+butterworth . These schemes can allow the Mel-spectrogram to better represent the relevant information for thistask, which is dependent on the low-frequency bands. ime F r e q u e n c y ( b a n d ) F r e q u e n c y ( b a n d ) F r e q u e n c y ( b a n d ) F r e q u e n c y ( b a n d ) Figure 1:

The ﬁgure shows the gradient magnitudes of DNNoutputs with respect to the input spectral features on the Masksub-challenge. The left two images show the gradients of thetwo random models for a single audio ﬁle and the right two im-ages show the gradients of the same two models averaged overall training ﬁles. A bright yellow shade represents the largestgradient magnitudes seen for the lowest frequencies.

3. Experiments and results

For the elderly and mask sub-challenges, we extract Mel-spectrograms from the audio ﬁles as inputs in a similar fash-ion to the auDeep [3] pipeline. Instead, for the breathing sub-challenge, the raw audio data directly input as the raw audioleads to better results than Mel-spectrograms.In our experiments, we use two different end-to-end sys-tems. For the elderly and mask sub-challenges, the spectralinput is ﬁrst processed by a 1D convolutional layer with 100neurons and then a recurrent layer, containing 100 LSTM cells,accumulates the outputs of the ﬁlters. We pass the outputs ofthe recurrent layer to a feedforward layer (100 rectiﬁed-linearunits) and then apply a classiﬁcation layer. In the multi-task ex-periments, we split the structure after the LSTM layer passingthe recurrent layer output to a unique set of hidden and outputlayers for both tasks. For the breathing sub-challenge, we optedfor the same structure as the best baseline system, for detailssee [16]. For training, we employed the Keras framework withTensorFlow as the backend.For all tasks, we use ensemble learning. For the mask sub-challenge, we obtained the best results using 50 models, whilefor the other tasks, ten models were enough to reach the peakperformance. After training the individual models, we averagedtheir output to create the ﬁnal predictions.For evaluation on the test set, we train our models on thecombined training and development set. We note that the Com-ParE challenge restricts the number of submissions per teamand task to ﬁve evaluations on the test set. As the competitionis ongoing, we only used a few of the available submissions tocheck the best systems so far. The limitation implies that wecan not test all of our methods. In the result tables, we use thequestion mark (?) to indicate the solutions not yet evaluated onthe test data. Table 1:

The table presents the breathing task’s Pearson corre-lation scores for single end-to-end models on the developmentset (Dev). As we trained ten different models, we present theaverage and the best result.

System Dev

Avg. per DNN Best DNNE2E-corr .506 .514E2E-MSE .467 .481E2E-corr+MSE .497 .521Table 2:

The table presents the ensemble E2E model’s perfor-mance on the breathing task for different loss functions.

System (loss function) Dev Test

Corr MSE CorrBaseline (E2E) [16] .507 1.682 .731E2E-corr .523 .896 .759

E2E-MSE .480 .028 ?E2E-corr+MSE .514 .180 .751

On this task, we used the end-to-end baseline system for fur-ther development as this system performs quite well on this task[16]. However, it faces the issue of mismatch on the scale of theend-to-end predictions and output labels. To alleviate this mis-match, we apply a multi-loss scheme using MSE based loss toregularize the baseline correlation loss.In table 1, we compare between the single-loss versusmulti-loss strategies. The single-loss models use either Pear-son’s correlation (corr) or the MSE. The multi-loss strategycombines the two losses (corr+MSE) with a regularizationweight of 0.1. Correlation-based E2E (E2E-corr) performsbest on when averaging the correlation values of ten randomly-initialized DNNs. The best result corresponds to the corr+MSEbased E2E model; however, the averaged results are lower andsuggest that this value is unreliable. We suspect further tuningof the regularization weight is required and we hope to completethis analysis as part of our future work.In table 2, we present the results of 10-model ensembles tocompare with the baseline performance. Even though the base-line system produced high correlation values, it had the high-est MSE value. Combining the predictions of 10 models (corr)reduced the MSE signiﬁcantly and outperformed the baselineresults. Using the MSE as loss function performed the worstbut, naturally produced the lowest MSE. Lastly, we can see thatusing the multi-loss ensemble of E2E model (E2E-corr+MSE)drops in comparison to the ensemble of E2E-corr because ofthe MSE regularization. However, in terms of MSE, it is muchbetter. On the evaluation set, E2E-corr+MSE was also slightlyworse than the E2E-corr ensemble in terms of overall correla-tion. Nevertheless, our ensemble of E2E corr outperforms thebaseline result and shows an absolute improvement of 2.6 cor-relation points over the baseline result.

The elderly task presents a prediction problem with class im-balance. For valence prediction, 44 out of the 87 stories havea medium-valence label. Upon inspecting some of our initialmodels, we observe that the output prediction favours the over-represented classes. To cope with these issues, we apply theresampling methods described in section 2.5.Table 3 presents the ensemble E2E models evaluated on theable 3:

UAR values for predicting Arousal (A) and Valence (V)levels on the elderly sub-challenge. E2E systems combined tenDNNs to produce the predictions.

System Dev (A/V) Test (A/V)

Baseline (linguistic) [16] 40.6/

Baseline(acoustic) [16] 35.0/31.6 /40.3E2E (single task) 35.0/39.7 ?E2E (multitask) 39.5/39.7 ?E2E (single task + upsampl.) 39.8/41.5 ?E2E (multitask+ upsampl.) /42.4 38.0/39.5E2E (single task + prob. sampl.) 35.6/39.6 ?E2E (multitask+ prob. sampl.) 40.0/ λ = 0 . wasvery beneﬁcial for the valence sub-task. The multi-task mod-els consistently outperformed single task ones. For the twobest systems, we also checked the performance of the individ-ual DNNs and saw that ensemble learning is essential for goodperformance. Upsampling for a single multi-task DNNs on av-erage yielded 38.4%/36.4% (A/V), with probabilistic samplingwe got 36.6%/38.5%.Unfortunately, the test results are below the ofﬁcial base-line. The considerable difference between scores on the de-velopment and test data suggest that our model overﬁts whentraining a train+development set system for evaluation. We alsosuspect that there is a signiﬁcant mismatch between the dev andtest data in this sub-challenge. Strong evidence for this can befound in the baseline paper [16]. In the baseline paper, we cansee that the test performance does not correlate with the scoresachieved on the development set. The ofﬁcial acoustic base-line model (DeepSpectrum+SVM) produces almost the worstresults on the development set, and the difference between itsdevelopment and test scores is large. This observation suggeststhat parametric tuning with the development data might not bethe best model for the evaluation set. Training a 50 model ensemble, we saw that their averaged pre-diction signiﬁcantly outperformed our single E2E model andthe individual baseline system (auDeep-fused). Our individualE2E models achieved 66.0%UAR on average, but their combi-nation reached 68.0% (E2E). The best individual baseline usesauDeep-based features in an SVM system. Our E2E ensembleoutperformed this model both on the development and test set,as shown in table 4. Though, our ensemble E2E model is out-performed by the fusion of the best baseline models, which is anSVM based on auDeep-fused, Bag-of-audio-word, OpenSmileand DeepSpectrum features [16].Earlier in section 2.6, we had observed that lower-frequencybands of the audio hold important information for the masksub-challenge. Based on this observation, we applied pre-emphasis+butteworth to input audio and then extracted thelowest-10-features to build an E2E ensemble (E2E lowest-10-features). This ensemble model outperformed the E2E-ensemble built with only preprocessed input audio (E2E preem-phasis+butterworth) but was worse than our vanilla ensemble.Combining the regular ensemble (E2E) and E2E lowest-10-features fared better, resulting in a slight improvement over the Table 4:

UAR values on the Mask Sub-Challenge. The E2Esolutions fused 50 models to get the ﬁnal output.

System Dev Test auDeep-fused (baseline) [16] 64.4 66.6Fusion of the bests (baseline) [16] – 71.8E2E 68.0 69.9E2E preemphasis+butterworth 59.3 ?E2E lowest-10-features 62.9 ?E2E + E2E lowest-10-features 68.6 ?E2E + E2E lowest-10-features + baseline vanilla ensemble. We combined this model with predictionsfrom SVMs trained on bag-of-audio-word features (BoAW-fused) and DeepSpectrum-resnet50 features (E2E+lowest-10-feats+baseline) via soft voting with equal weights. The com-bined achieved our best result on the development set and im-proved the fusion-based baseline by 3.8% UAR.

4. Future work

For the breathing sub-challenge, we observe that regularizingwith MSE can help alleviate the mismatch of scales betweenthe output and the labels. However, it still lacks in performancein comparison to the regular E2E model. To study this effect, weexplore other regularization schemes to obtain a better balancebetween mismatch issues and performance.On the elderly sub-challenge, our current system overﬁtson the training and development data and obtains a poor per-formance on the test set. We investigate this effect further andapply regularization schemes to reduce the overﬁtting. Anotherthing that limits our current system is that it is trained to clas-sify short segments of the stories and then the decisions madefor these fragments are merged with a soft voting method. In-stead, we could concatenate the audio ﬁles of the same storiesand directly classify them, as our E2E architecture allows us touse arbitrary long inputs.For the mask sub-challenge, we currently combine pre-dictions from separate models for the vanilla and lowest-10-features E2E scenarios. In contrast with this late fusion, we ex-plore early and intermediate fusion of features to better exploitthe information present in these spectrograms. Our lowest-10-features na¨ıvely extracts the ten lowest frequency bands to useas features for the E2E model. Instead, we also develop special-ized low-frequency features to aid better learning by the E2Emodel. In the mask sub-challenge, we observe that combiningour E2E ensembles with baseline results achieves the best re-sult. We plan to explore similar combinations for both breathingand elderly sub-challenges.

5. Conclusions

We presented Aalto’s E2E ensemble solution for the three dif-ferent INTERSPEECH 2020 ComParE tasks. In our study, theensemble E2E models achieved better performance than indi-viduals E2E models on average. On the ComParE 2020 tasks,we also proposed task-speciﬁc modiﬁcations for the underly-ing E2E models. We studied modiﬁcations based on multi-tasklearning, re-sampling training data for multi-task scenarios, andfeature engineering based on the initial E2E ensemble models.Our best models showed absolute improvements upon the com-petitive baselines for the breathing and mask sub-challenges by2.8% and 3.8%, respectively. Overall, our paper showcased theeneﬁts of using an ensemble of E2E models and task-speciﬁcmodiﬁcations for computational paralinguistic tasks.

6. Acknowledgements

We thank Antonia Hamilton and Alexis Macintyre for grant-ing us access to a subset of the UCL Speech BreathMonitor-ing (UCL-SBM) database used in the Breathing Sub-Challenge.This work was supported by the Academy of Finland (grants312490, 329267) and the Kone Foundation. Aalto ScienceITprovided the computational resources.

7. References [1] M. Schmitt, F. Ringeval, and B. Schuller, “At the border of acous-tics and linguistics: Bag-of-audio-words for the recognition ofemotions in speech,” in

Proc. of Interspeech , 2016, pp. 495–499.[2] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr´e,C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan,and K. P. Truong, “The geneva minimalistic acoustic parameterset (gemaps) for voice research and affective computing,”

IEEETransactions on Affective Computing , vol. 7, no. 2, pp. 190–202,2016.[3] M. Freitag, S. Amiriparian, S. Pugachevskiy, N. Cummins, andB. Schuller, “Audeep: Unsupervised learning of representationsfrom audio with deep recurrent neural networks,”

J. Mach. Learn.Res. , vol. 18, no. 1, p. 6340–6344, Jan. 2017.[4] S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag,S. Pugachevskiy, A. Baird, and B. Schuller, “Snore soundclassiﬁcation using image-based deep spectrum features,” in

Proc. of Interspeech , 2017, pp. 3512–3516. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2017-434[5] C. Montaci´e and M.-J. Caraty, “Vocalic, lexical and prosodiccues for the interspeech 2018 self-assessed affect challenge,” in

Proc. of Interspeech , 2018, pp. 541–545. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2018-1331[6] G. Gosztolya, T. Gr´osz, and L. T´oth, “General utterance-level feature extraction for classifying crying sounds, atyp-ical & self-assessed affect and heart beats,” in

Proc.of Interspeech , 2018, pp. 531–535. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2018-1076[7] G. Gosztolya, “Using Fisher Vector and Bag-of-Audio-Words Representations to Identify Styrian Dialects, Sleepi-ness, Baby & Orca Sounds,” in

Proc. of In-terspeech , 2019, pp. 2413–2417. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2019-1726[8] H. Wu, W. Wang, and M. Li, “The DKU-LENOVO Systemsfor the INTERSPEECH 2019 Computational Paralinguistic Chal-lenge,” in

Proc. of Interspeech , 2019, pp. 2433–2437. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2019-1386[9] M. Carbonneau, E. Granger, Y. Attabi, and G. Gagnon, “Featurelearning from spectrograms for assessment of personality traits,”

IEEE Transactions on Affective Computing , vol. 11, no. 1, pp. 25–31, 2020.[10] J. Wagner, D. Schiller, A. Seiderer, and E. Andr´e, “Deep learningin paralinguistic recognition tasks: Are hand-crafted features stillrelevant?” in

Proc. of Interspeech , B. Yegnanarayana, Ed. ISCA,2018, pp. 147–151.[11] Z. Zhao, Y. Zhao, Z. Bao, H. Wang, Z. Zhang, and C. Li,“Deep spectrum feature representations for speech emotionrecognition,” in

Proc. of ASMMC-MMAC . Association forComputing Machinery, 2018, p. 27–33. [Online]. Available:https://doi.org/10.1145/3267935.3267948[12] D. Elsner, S. Langer, F. Ritz, R. M¨uller, and S. Illium, “Deepneural baselines for computational paralinguistics,” in

Proc. of In-terspeech , G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 2388–2392. [13] P. Tzirakis, S. Zafeiriou, and B. W. Schuller, “End2you - the im-perial toolkit for multimodal proﬁling by end-to-end learning,”

CoRR , vol. abs/1802.01115, 2018.[14] J. Fritsch, S. P. Dubagunta, and M. Magimai.-Doss, “Estimatingthe degree of sleepiness by integrating articulatory feature knowl-edge in raw waveform based cnns,” in

Proc. of ICASSP , 2020, pp.6534–6538.[15] G. Gosztolya, R. Busa-Fekete, T. Gr´osz, and L. T´oth, “Dnn-basedfeature extraction and classiﬁer combination for child-directedspeech, cold and snoring identiﬁcation,” in

Proc. of Interspeech ,F. Lacerda, Ed. ISCA, 2017, pp. 3522–3526.[16] B. W. Schuller, A. Batliner, C. Bergler, E.-M. Messner, A. Hamil-ton, S. Amiriparian, A. Baird, G. Rizos, M. Schmitt, L. Stappen,H. Baumeister, A. D. MacIntyre, and S. Hantke, “The INTER-SPEECH 2020 Computational Paralinguistics Challenge: Elderlyemotion, Breathing & Masks,” in

Proc. of Interspeech , Shanghai,China, September 2020, p. 5 pages, to appear.[17] S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Epps, and B. W.Schuller, “Multi-task semi-supervised adversarial autoencodingfor speech emotion recognition,”

IEEE Transactions on AffectiveComputing , pp. 1–1, 2020.[18] S. R. Kadiri and B. Yegnanarayana, “Breathy to tense voicediscrimination using zero-time windowing cepstral coefﬁcients(ztwccs),” in

Proc. of Interspeech , 2018, pp. 232–236.[19] S. R. Kadiri and P. Alku, “Mel-frequency cepstral coefﬁcients ofvoice source waveforms for classiﬁcation of phonation types inspeech,”

Proc. of Interspeech , pp. 2508–2512, 2019.[20] A. V. Oppenheim, R. W. Schafer, and J. R. Buck,