The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates
Björn W. Schuller, Anton Batliner, Christian Bergler, Cecilia Mascolo, Jing Han, Iulia Lefter, Heysem Kaya, Shahin Amiriparian, Alice Baird, Lukas Stappen, Sandra Ottl, Maurice Gerczuk, Panagiotis Tzirakis, Chloë Brown, Jagmohan Chauhan, Andreas Grammenos, Apinan Hasthanasombat, Dimitris Spathis, Tong Xia, Pietro Cicuta, Leon J. M. Rothkrantz, Joeri Zwerts, Jelle Treep, Casper Kaandorp
PP r e li m i n a r y The INTERSPEECH 2021 Computational Paralinguistics Challenge:COVID-19 Cough, COVID-19 Speech, Escalation & Primates
Björn W. Schuller , , Anton Batliner , , Christian Bergler , Cecilia Mascolo , Jing Han , Iulia Lefter ,Heysem Kaya , Shahin Amiriparian , Alice Baird , Lukas Stappen , Sandra Ottl , Maurice Gerczuk ,Panagiotis Tzirakis , Chloë Brown , Jagmohan Chauhan , Andreas Grammenos ,Apinan Hasthanasombat , Dimitris Spathis , Tong Xia , Pietro Cicuta , Leon J. M. Rothkrantz ,Joeri Zwerts , Jelle Treep , Casper Kaandorp GLAM – Group on Language, Audio & Music, Imperial College London, UK EIHW – Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany Pattern Recognition Lab, FAU Erlangen-Nuremberg, Germany University of Cambridge, UK Delft University of Technology, The Netherlands Faculty of Science, Utrecht University, The Netherlands [email protected]
Abstract
The INTERSPEECH 2021 Computational Paralinguistics Chal-lenge addresses four different problems for the first time ina research competition under well-defined conditions: In the
COVID-19 Cough and
COVID-19 Speech
Sub-Challenges, abinary classification on COVID-19 infection has to be madebased on coughing sounds and speech; in the
Escalation
Sub-Challenge, a three-way assessment of the level of escalation ina dialogue is featured; and in the
Primates
Sub-Challenge, fourspecies vs background need to be classified. We describe theSub-Challenges, baseline feature extraction, and classifiers basedon the ‘usual’ C OM P AR E and BoAW features as well as deepunsupervised representation learning using the AU D EEP toolkit,and deep feature extraction from pre-trained CNNs using theD
EEP S PECTRUM toolkit; in addition, we add deep end-to-endsequential modelling, and partially linguistic analysis.
Index Terms : Computational Paralinguistics, Challenge,COVID-19, Escalation, Primates
1. Introduction
In this INTERSPEECH 2021 COM
PUTATIONAL
PAR
ALIN - GUISTICS CHALLENG
E (C OM P AR E) – the thirteenth since 2009[1], we address four new problems within the field of Computa-tional Paralinguistics [2] in a challenge setting:In the
COVID-19 Cough
Sub-Challenge (
CCS ) and
COVID-19 Speech
Sub-Challenge (
CSS ), coughing sounds orspeech are used to binary classify COVID-19 (or not) infection.In the present pandemic situation, great potential lies in low-cost, anywhere and anytime accessible real-time pre-diagnosisof COVID-19 infection. To date, the possibility has been shown[3], yet a controlled challenge test-bed is lacking. In the
Escala-tion
Sub-Challenge (
ESS ), participants are faced with three-wayclassification of the level of escalation in human dialogues. Arange of applications exists including human-to-computer inter-action, computer mediated human-to-human conversation, orpublic security. Finally, in the
Primate
Sub-Challenge (
PRS ),we classify four species of primates versus background noise.Real-life applications include wild-life monitoring in habitats,e. g., to save species from extinction.For all tasks, a target class has to be predicted for eachcase. Contributors can employ their own features and machine learning algorithms; standard feature sets and procedures areprovided. Participants have to use the pre-defined partitions foreach Sub-Challenge. They may report results obtained fromthe Train(ing)/Dev(elopment) set – preferably with the suppliedevaluation setups, but have only five trials to upload their resultson the Test set per Sub-Challenge, whose labels are unknownto them. Each participation must be accompanied by a paperpresenting the results, which undergoes peer-review and has tobe accepted for the conference in order to participate. The organ-isers preserve the right to re-evaluate the findings, but will notparticipate in the Challenge. As evaluation measure, we employin all Sub-Challenges
Unweighted Average Recall (UAR) asused since the first Challenge from 2009 [1], especially becauseit is more adequate for (unbalanced) multi-class classificationsthan Weighted Average Recall (i. e., accuracy) [2, 4]. Ethicalapproval for the studies has been obtained from the pertinentcommittees. In section 2, we describe the challenge corpora.Section 3 details baseline experiments, metrics, and baselineresults; concluding remarks are given in section 4.
2. The Four Sub-Challenges
For the
CCS and
CSS , we employ two subsets from the Cam-bridge COVID-19 Sound database [5, 6]. The database wascollected via the COVID-19 Sounds App since its launch inApril 2020, aiming at collecting data to inform the diagnosis ofCOVID-19 based primarily on voice, breathing, and coughing.Participants were able to provide audio samples together withtheir COVID-19 test results via multiple platforms (a webpage,an Android app, and an iOS app). The participants also providedbasic demographic, medical information, and reported symp-toms. For the
CCS and the
CSS , only cough sounds and voicerecordings with COVID-19 positive/ negative test results wereincluded separately, and only audio data and the correspondingCOVID-19 test labels are provided. The quality of these datawas manually checked. As they were crowd-sourced, the orig-inal audio data had varying sampling rates and formats; all ofthem were resampled and converted to 16 kHz and mono/16 bit,and further normalised recording-wise to eliminate varying loud- a r X i v : . [ ee ss . A S ] F e b r e li m i n a r y ness. For the CCS , 929 recordings from 397 participants wereprovided, in total 1.63 hrs. In each cough recording, the partici-pant provided one to three forced coughs. For the
CSS , we use893 recordings from 366 participants. in total 3.24 hrs. In eachspeech recording, the participant recorded speech content (“Ihope my data can help to manage the virus pandemic.”) in onelanguage (English, Italian, or German, etc), one to three times.For each recording, a COVID-19 test result was available whichwas self-reported by the participant. To create the two-class clas-sification task, the original COVID-19 test results were mappedonto either positive (denoted as ‘P’) or negative (‘N’).
For the
ESS , the INTERSPEECH C OM P AR E Escalation Corpusis provided, consisting of the Dataset of Aggression in Trains(TR) [7] and the Stress at Service Desk Dataset (SD) [8]. Bothpresent unscripted interactions between actors, where frictionappears as they spontaneously react to each other based on shortscenario descriptions. While the datasets share the same proce-dure for eliciting interactions, the topics, the number of partici-pants in the scene, and amount of overlapping speech, as wellas the recording quality differ. The TR dataset consists of 21scenarios of unwanted behaviours in trains and train stations(e. g., harassment, theft, travelling without a ticket) played by13 subjects. It was annotated based on aggression levels on a 5point scale by 7 raters (Krippendorff’s alpha = = L ow, SD class 3 and TR class 2onto M edium, and the rest of the data onto H igh escalation. Thelanguage spoken in the Escalation Corpus is Dutch (two scenar-ios from SD where English was spoken were excluded). Manualtranscriptions are provided. The corpus has been re-segmentedbased on linguistic information, resulting in 410 and 501 (test)segments, of an average length of 5 seconds. The challenge taskis to use the SD dataset for training, and to recognise escalationlevels in the TR dataset. For the
PRS , the Primate Vocalisations Corpus described inZwerts et al. [9] is used. The global biodiversity crisis callsfor effective monitoring methods to measure, manage and con-serve wildlife. Using acoustic recordings is a non-invasive andpotentially cost-effective way to identify and count species forenvironments like tropical forests, where opportunities for vi-sual monitoring are limited. Several studies have applied auto-matic acoustic monitoring for a variety of taxa, ranging frombirds [10] to forest elephants [11], and sporadically also forprimates [12, 13, 14]. Zwerts et al. [9] recently collected acous-tic data from a primate sanctuary in Cameroon. The recordedspecies were C himpanzees ( Pan troglodytes ), M andrills ( Man-drillus sphinx ), R ed-capped mangabeys ( Cercocebus torquatus )and a mixed group of G uenons ( Cercopithecus spp. ). The sanc-tuary houses primates under semi-natural conditions makingbackground noise relatively comparable to natural forests, albeitless rich in biodiversity and also containing human related noise.Recordings were made between December 2019 and January2020 with a timespan of 32 days, using Audiomoth (v1.1.0) Table 1:
Databases: Number of instances per class in theTrain/Dev/Test splits: Test split distributions are blinded duringthe ongoing challenge and will be given in the final version. Σ CCS: COVID-19 COUGH (C19C) corpus no COVID-19 215 183 blinded blindedCOVID-19 71 48 blinded blinded Σ
286 231 208 725
CSS: COVID-19 SPEECH (C19S) corpus no COVID-19 243 153 blinded blindedCOVID-19 72 142 blinded blinded Σ
315 295 283 893
ESS: Escalation at Service-desks and in Trains (CEST)
L 156 69 blinded blindedM 74 33 blinded blindedH 63 15 blinded blinded Σ
293 117 501 911
PRS: Primate Vocalisations Corpus (PVC)
C 2 217 2 217 blinded blindedM 874 874 blinded blindedR 208 209 blinded blindedG 158 159 blinded blindedBackground 3 458 3 459 blinded blinded Σ
3. Experiments and Results
For all corpora, the segmented audio was converted to single-channel 16 kHz, 16 bits PCM format. Table 1 shows the numberof cases for Train, Dev, and Test for the databases; partitions forCCS, CSS, and ESS were gender-balanced. OM P AR E Acoustic Feature Set:
The official baseline fea-ture set is the same as has been used in the eight previous editionsof the C OM P AR E challenges, starting from 2013 [16]. It contains6 373 static features resulting from the computation of function-als (statistics) over low-level descriptor (LLD) contours [17, 16].A full description of the feature set can be found in [18].
Bag-of-Audio-Words (BoAWs):
These have been applied suc-cessfully for, e. g., acoustic event detection [19] and speech-based emotion recognition [20]. Audio chunks are representedas histograms of acoustic LLDs, after quantisation based on acodebook. One codebook is learnt for the 65 LLDs from theC OM P AR E feature set, and another one for the 65 deltas ofthese LLDs. In Table 2, results are given for different code- r e li m i n a r y Table 2:
Results for the four Sub-Challenges. The official baselines for Test are highlighted (bold and greyscale); there are no official baselines for Dev. C : Complexity parameter of the SVM, for all from − to , only best result. N : Codebook size forBag-of-Audio-Words (BoAW) splitting the input into two codebooks ( C OM P AR E -LLDs/ C OM P AR E -LLD-deltas) of the same given size,with 50 assignments per frame. DenseNet : pre-trained CNN used for extraction of D EEP S PECTRUM features. X : Thresholdpower levels for S SAE under which was clipped. D I FE : Linguistic feature extraction pipeline and SVM. E ND OU : End-to-endlearning with convolutional recurrent neural network hidden units N h . UAR : Unweighted Average Recall.
CCS : COVID-19 Coughing.
CSS : COVID-19 Speech.
ESS : Escalation Sub-Challenge.
PRS : Primates Sub-Challenge. CI on Test: confidence intervals for Test, seeexplanation in text.
CCS CSS ESS PRS
UAR [%] UAR [%] UAR [%] UAR [%]Dev Test CI on Test Dev Test CI on Test Dev Test CI on Test Dev Test CI on Test C OPEN
SMILE: C OM P AR E functionals+SVM N OPEN
XBOW: C OM P AR E BoAW+SVM D EEP S PECTRUM +SVM
DenseNet121 63.3 64.1 55.7-72.8 / 65.9-67.1 56.0 60.4 55.9-64.9 / 57.8-58.7 64.2 56.4 51.5-61.3 / 53.6-55.2 81.3 78.8 76.9-80.6 / 76.1-76.8 X [dB] AU D EEP : S2SAE+SVM -30 60.7 55.2 47.6-61.9 / 51.9-53.5 65.8 59.9 53.6-65.4 / 58.2-59.3 39.1 35.3 30.0-40.4 / 34.8-37.3 70.6 69.7 67.7-71.8 / 69.1-69.5-45 64.1 60.5 51.8-69.5 / 61.0-62.0 66.3 55.2 49.1-61.0 / 54.1-55.2 41.3 43.1 37.8-48.6 / 38.5-42.0 80.3 82.3 80.6-83.8 / 80.5-81.3-60 67.6 67.6 60.3-75.4 / 64.9-65.8 59.4 53.3 47.4-59.4 / 52.2-53.5 42.0 44.3 39.2-49.6 / 41.7-44.1 81.6 84.1 82.5-85.6 / 82.4-83.2-75 64.0 64.6 56.1-72.6 / 61.0-62.3 58.4 52.2 45.9-57.7 / 52.0-52.9 49.0 52.2 47.2-56.9 / 50.1-52.0 80.7 83.0 81.5-88.0 / 81.1-82.0Fused 65.4 64.2 57.0-72.2 / 62.1-63.1 62.2 64.2 63.1-74.3 / 62.3-64.2 46.8 45.0 39.8-50.4 / 45.1-47.5 84.6 86.6 85.1-88.0 / 84.6-85.2Features D I FE: Transformer+SVM plain – – – – – – 51.2 36.8 32.2-41.7 / 38.8-41.2 – –plain-BlAtt – – – – – – 50.3 45.2 39.4-50.8 / 44.0-45.3 –sent – – – – – – 56.5 44.1 38.4-49.7/ 40.9-44.2 – –sent-BlAtt – – – – – – 47.3 47.2 41.8-52.9 / 46.9-47.8 – –tuned-BlAtt – – – – – – 43.5 44.9 40.0-50.3 / 43.7-45.3 – – N h RNN
End2You: CNN+LSTM RNN
64 61.8 64.7 56.2-73.5 / – 70.5 68.7 63.1-74.3 / – 64.1 54.0 48.8-59.5 / – 72.70 70.8 68.8-72.9 / –
Fusion of Best – book sizes. Codebook generation is done by random sampling from the LLDs/deltas in the training data. Each LLD/delta isassigned to the 10 audio words from the codebooks with thelowest Euclidean distance. Both BoAW representations, onefrom the LLDs and one from their deltas, are concatenated.Finally, a logarithmic term frequency weighting is applied tocompress the numeric range of the histograms. LLDs are ex-tracted with the OPEN
SMILE toolkit, BoAWs are computedusing
OPEN
XBOW [21]. D EEP S PECTRUM : The feature extraction D
EEP S PECTRUM toolkit is applied to obtain first deep representations from the in-put audio data utilising pre-trained convolutional neural networks(CNNs) [22]. D EEP S PECTRUM features have been shown to beeffective, e. g., for speech processing [23]. First, audio signalsare transformed into mel-spectrogram plots using a Hanningwindow of width 32 ms and an overlap of 16 ms. From these,128 Mel frequency bands are computed. The spectrograms arethen forwarded through DenseNet121 [24], a pre-trained CNN,and the activations of the ‘avg_pool’ layer of the network areextracted, resulting in a 2 048 dimensional feature vector. AU D EEP : Another feature set is obtained through unsupervisedrepresentation learning with recurrent sequence to sequence au-toencoders, using AU D EEP [25, 26]. These explicitly model theinherently sequential nature of audio with Recurrent Neural Net-works (RNNs) within the encoder and decoder networks [25, 26].Here, Mel-scale spectrograms are first extracted from the rawwaveforms in a data set. In order to eliminate some backgroundnoise, power levels are clipped below four different given thresh-olds in these spectrograms, which results in four separate sets https://github.com/DeepSpectrum/DeepSpectrum https://github.com/auDeep/auDeep of spectrograms per data set. Subsequently, a distinct recurrentsequence to sequence autoencoder is trained on each of thesesets of spectrograms in an unsupervised way, i. e., without anylabel information. The learnt representations of a spectrogramare then extracted as feature vectors for the corresponding in-stance. Finally, these feature vectors are concatenated to obtainthe final feature vector. For the results shown in Table 2, theautoencoders’ hyperparameters were not optimised. DiFE:
Escalation is marked by an increase in arousal comingfrom acoustic rather than linguistic features; yet, semantic con-notations might additionally play a role [27, 28]. To this aim,we developed a lightweight Dutch LinguistIc Feature Extractor(D I FE) pipeline similar to [29] and last year’s challenge [30] toutilise linguistic features for ESS . Transformer language em-beddings recently showed tremendous success over a wide rangeof Natural Language Processing tasks. For the vectorisation,D I FE either utilises a) a standard pre-trained Dutch BERT model( plain ), b) a fine-tuned version on an external sentiment ( sent )task [31], or c) a fine-tuned version on the escalation train and validation partitions ( tuned ). Next, a 768-dimensional con-text embedding vector for each word of a segment of the last 4layers is extracted and summed up over the last four layers [32].The sequence of encoded words is then either summed up againacross the time dimension, or fed into a feature compressionblock to obtain a single feature vector for the entire segment. Forcompression, the pipeline uses a bidirectional Long Short-TermMemory (LSTM) RNN with an attention module (BLAtt), fol-lowed by two feedforward layers. The output of this last layer isused as feature input for the SVM evaluation.
End2You:
We utilise the multimodal profiling toolkit https://github.com/lstappen/DiFE r e li m i n a r y n e g a t i v e p o s i t i v e Predicted label n e g a t i v e p o s i t i v e T r u e l a b e l CCS, UAR = 64.7% (openXBOW) n e g a t i v e p o s i t i v e Predicted label n e g a t i v e p o s i t i v e T r u e l a b e l CSS, UAR = 57.9% (openSMILE) L M H Predicted label L M H T r u e l a b e l ESS, UAR = 70.6% (openXBOW) B C G M R Predicted label B C G M R T r u e l a b e l PRS, UAR = 84.6% (auDeep)
Figure 1:
Confusion matrices for
CCS , CSS , ESS , and
PRS . The individual approach/hyperparameters performing on Dev for the bestTest result (without fusion) were chosen – given on top of each figure. In the cells, absolute number and percent of ‘classified as’ of theclass displayed in the respective row; percentage also indicated by colour-scale: the darker, the higher.
End2You [33] to perform end-to-end learning. For our pur-poses, we utilise the Emo-18 [34] deep neural network that usesa convolutional network to extract features from the raw timerepresentation and then a subsequent recurrent network withGated Recurrent Units (GRUs) which performs the final classifi-cation. For training the network, we split the raw waveform intochunks of 100 ms each (except for the PRS Sub-Challenge withchunks of 70 ms). These are fed into a three layer convolutionalnetwork comprised of a series of convolution and pooling oper-ations which try to find a robust representation of the originalsignal. The extracted features are passed to a two layer GRU tocapture the temporal dynamics in the raw waveform. For the sake of transparency and reproducibility of the baselinecomputation, in line with previous years, we use an open-sourceimplementation of SVMs with linear kernels. The providedscripts employ the S
CIKIT - LEARN toolkit with its class L IN - EAR
SVC for the classification based on functionals, BoAW, AU D EEP , D I FE, and D
EEP S PECTRUM features. All featurerepresentations were scaled to zero mean and unit standard devi-ation (M IN M AX S CALER of S
CIKIT - LEARN ), using the parame-ters from the respective training set (when Train and Dev werefused for the final classifier, the parameters were calculated onthis fusion). The complexity parameter C was always optimisedduring the development phase. Each Sub-Challenge packageincludes scripts that allow participants to reproduce the baselinesand perform the testing in a reproducible and automatic way(including pre-processing, model training, model evaluation onDev, and scoring by the competition and further measures). Thisyear, we provide the six approaches outlined above. The sameway as in the last three years, we chose the highest results onTest for defining the baselines, irrespective of the correspondingresults on Dev, in order to prevent participants from surpassingthe official baseline by simply repeating or slightly modifyingother constellations that can be found in Table 2. A fusion ofthe best configurations (each different approach with its bestparameters) with Majority Voting is given in the last row. As canbe seen in Table 2, for
CCS , the baseline is fusion of best with
UAR = 73 . ; for CCS , the baseline is based on C OM P AR Ewith
UAR = 72 . ; for ESS , BoAWs define the baseline with
UAR = 59 . ; and for PRS , the baseline is fusion of best with
UAR = 87 .
46 % .We provide two types of 95 % confidence intervals, see thecolumn ‘CI on Test’ in Table 2: First, we did 1000x bootstrap-ping for Test (random selection with replacement) and computedUARs, based on the same model that was trained with Train and https://github.com/end2you/end2you Dev; the CI for these UARs is given before the slash. Then, wedid 100x bootstrapping for the corresponding combination ofTrain and Dev, and employed the different models obtained fromthese combinations to get UARs for Test and subsequently, CIs,as displayed after the slash. Note that for this type of CI, the Testresults are often above the CI, sometimes within and in a fewcases below. Obviously, reducing the variability of the samplesin the training phase with bootstrapping results on average insomehow lower performance.Figure 1 displays the confusion matrices for the four sub-challenges for Dev corresponding to the best result on Test; e. g.,for CSS, best Test result (without fusion) is 72.9 % UAR for N =
4. Concluding Remarks
This year’s challenge is new by four new tasks (COVID-19Cough and Speech, Escalation, and Primates), all of them highlyrelevant for applications. Besides the by now ‘classic’ ap-proaches C OM P AR E and Bag-of-Audio-Words (BoAWs) , wefurther featured sequence-to-sequence autoencoder-based audiofeatures by the AU D EEP toolkit, D EEP S PECTRUM , a DutchLinguistIc Feature Extractor ( D I FE ) as well as End2End DeepSequence Modelling . For all computation steps, scripts are pro-vided that can, but need not be used by the participants. Weexpect participants to obtain better performance measures byemploying novel (combinations of) procedures and features in-cluding such tailored to the particular tasks.
5. Acknowledgements
We acknowledge funding from the DFG’s Reinhart Koselleckproject No. 442218748 (AUDI0NOMOUS), the EU’s HORI-ZON 2020 Grant No. 115902 (RADAR CNS), and the ERCproject No. 833296 (EAR). For PRS, only 10x was executed, because of the large number ofdata points. This holds apart from End2You that would have required too time-consuming computation cycles. r e li m i n a r y
6. References6. References