End-2-End COVID-19 Detection from Breath & Cough Audio
Harry Coppock, Alexander Gaskell, Panagiotis Tzirakis, Alice Baird, Lyn Jones, Björn W. Schuller
EEnd-2-End COVID-19 Detection from Breath & Cough Audio
Harry Coppock* , Alexander Gaskell* , Panagiotis Tzirakis , Alice Baird , Lyn Jones , Björn W.Schuller , GLAM – Group on Language, Audio, & Music, Imperial College London, London, UK Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany Radiology Department, North Bristol NHS Trust, Bristol, UK
Keywords:
COVID-19, Computer Audition, Digital Health,Deep Learning, Audio
SUMMARY BOX
Our main contributions are as follows:• We demonstrate the first attempt to diagnose COVID-19 using end-to-end deep learning from a crowd-sourceddataset of audio samples, achieving ROC-AUC of • Our model, the C OVID-19
Ide ntification R esNet,( CIdeR ), has potential for rapid scalability, minimal cost,and improving performance as more data becomes avail-able. This could enable regular COVID-19 testing at apopulation scale• We introduce a novel modelling strategy using a customdeep neural network to diagnose COVID-19 from a jointbreath and cough representation• We release our four stratified folds for cross parameteroptimisation and validation on a standard public corpusand details on the models for reproducibility and futurereference
INTRODUCTION
The Coronavirus disease 2019 (COVID-19), caused bythe severe-acute-respiratory-syndrome-coronavirus 2 (SARS-CoV-2), is the first global pandemic of the 21st century. Sinceits emergence in December 2019, it has led to over 75 millionconfirmed cases and more than 1.6 million deaths in over 200countries (WHO) . SARS-CoV-2 causes either asymptomatic *Equal contribution infection or clinical disease, which ranges from mild to life-threatening [1]. Developing a swift and accurate test, able toidentify both symptomatic and asymptomatic cases, is there-fore essential for pandemic control.Vocal biomarkers of SARS-CoV-2 infection have been de-scribed, thought to relate to the clinical and subclinical effectsof the virus on the lower respiratory tract, neuro-muscularfunction, senses of taste and smell and on proprioceptive feed-back. Together, these produce a reduction in complexity ofthe co-ordination of respiratory and laryngeal motion in bothsymptomatic and asymptomatic individuals [2].Recently, several audio applications have been released thatcapture the breath or cough of individuals. Examples includethe ‘Coughvid’ [3], ‘Breath for Science’ , ‘Coswara’ [4],and ‘CoughAgainstCovid’ [5]. With the release of thesedatasets, several studies have been published that leveragebreath and/or cough signals alongside machine learning to de-tect the virus [6, 7, 8, 9, 10, 11]. However, these approachestry to compute representations of the breath and cough signalsseparately. In contrast, our approach computes a joint repre-sentation using a single model.We postulate that end-to-end deep learning using convo-lutional neural networks (CNNs) could be successfully ap-plied to this assessment task. This article describes a proofof concept study of automatic symptomatic and asymptomaticCOVID-19 recognition using combined breathing and cough-ing information from audio recordings using an end-to-endCNN design. The code for our experiments and all details forreproduction of findings can be found at https://github.com/glam-imperial/cider. METHODS
The objective is supervised learning binary classification fordiagnosing COVID-19 as positive or negative using audio sig- Harry Coppock et al. a r X i v : . [ c s . S D ] J a n igure 1 A schematic of the C OVID-19
Ide ntification R esNet, ( CIdeR ). The figure shows a blow-up of a residual block, consistingof convolutional, batch normalisation, and Rectified Linear Unit (ReLU) layers. nals. Our implementation, displayed in Figure 1, has two dis-tinct stages which are outlined below:
1. Spectrogram extraction
As shown in Figure 1, eachparticipant in the study carried out by the University of Cam-bridge [6] could submit waveform audio (WAV) files includ-ing a breath sample and a cough sample . We first compute thespectrogram of each of these WAV files to obtain a visual rep-resentation of the spectrum of audio frequencies against time.Next, we perform a log transformation, converting the spec-trogram from an amplitude representation to a decibel repre-sentation. These transformations are implemented using the librosa [12] python package.Each WAV file lasts between one and fourty-eight secondswith a mean of ten seconds. As uniform duration is requiredfor CNN input, we chunk the whole WAV file into s -secondsegments, using right padding for files shorter than s -seconds.This creates an image of size { F, W } , where F ∝ f f t n and W ∝ sr ∗ s and f f t n and sr are parameters used when com-puting the spectrogram. During model training, we only pro-cess one WAV segment (sampled uniformly). At inferencetime, we perform majority voting, whereby each chunk is pro-cessed in parallel, and the output label becomes the modalclassification from each of the chunks .
2. Convolutional Neural Network
CIdeR is based onResNets [13], a variant of the CNN architecture, which uses residual blocks. As shown in Figure 1, a residual block con-sists of two convolutions, batch normalisation [14], and a Rec-tified Linear Unit (ReLU) non-linearity. These blocks use“skip” connections which add the output from these operationsto the input activations for this layer. This alleviates the van-ishing gradient problem, facilitating deeper architectures with The mean of the output logits is taken in the case of a tied vote. more layers, thereby permitting richer hierarchical learnt rep-resentations. The number of convolutional channels for eachof CIdeR’s nine layers are annotated in Figure 1.We concatenate the log spectrograms of the breath andcough samples depth-wise, creating an { F, W, } matrix asthe model input. The CNN outputs a single logit which isthen passed through a sigmoid layer to obtain a (0 , score,representing the probability of a COVID-positive sample. Aweighted binary-cross entropy loss function [15] is used dur-ing training to address the class imbalance in the dataset. Training strategy
Prior work [6] used “10-fold-like” crossvalidation during training (see the paper for details). In con-trast, we implement a stratified 3-fold cross optimisation andadditional validation partitioning using 2 / 1 (rotating devel-opment + train) / 1 (always held out fixed test) folds, respec-tively. This is to best optimise parameters independently ofthe test set with a small dataset while ensuring that the test setremains a) fixed for easier comparison with other work, and b)truly blind, eliminating the possibility of CIdeR overfitting tothe test set. Our stratified sampling methodology ensures thatour folds represent disjoint sets of participants and each of thestrata (next section) are approximately uniformly distributedacross each fold. To enable reproducibility, the folds are fullyreleased in the accompanying code.
Baseline
Our approach is not directly comparable with thestudy from [6] as they do not explicitly provide their foldsand discard some audio samples. To this purpose, to create aperformance reference for CIdeR, we implement a linear ker-nel Support Vector Machine (SVM) [16] baseline. We extractopenSMILE features [17] for each wavefile following the In-terspeech 2016 ComParE challenge format [18] and performPrincipal Component Analysis (PCA) [19], selecting the top100 components by highest explained variance. We follow thecross optimisation procedure outlined above, using the devel- Harry Coppock et al. pment set to optimise the complexity parameter and report-ing final results using the held-out test set. DATASET
The dataset used in this work consists of 517 crowdsourcedcoughing and breathing audio recordings from 355 partici-pants, of which 62 participants had tested positive for COVID-19 within 14 days of the recording COVID-negative , participants had to meet a number of strin-gent criteria described in [6]. These participants were thendivided into 3 categories: those with no cough ( healthy-no-symptoms ), those with a cough ( healthy-with-cough ) and thosewho had asthma ( asthma-with-cough ). The
COVID-positive class is constituted of the 62 COVID-positive participants andis further divided into the sub classes
COVID-no-cough , and
COVID-cough representing 39 COVID positive participantswithout a cough and 23 participants with a cough, respectively.
EXPERIMENTS & RESULTS
As indicated above, we perform a 3-fold cross optimisationusing the rotating development plus train folds. Recall that thetest set is fixed and always held out during optimisation. Forevaluation metrics, we utilise the Area Under Curve of the Re-ceiver Operating Characteristics curve (AUC-ROC), and Un-weighted Average Recall (UAR), both of which are robust toimbalanced datasets. AUC-ROC maps the relationship be-tween sensitivity and the false positive rate as the classifica-tion threshold is varied, and UAR computes the mean recallper class. The models’ performance is sensitive to initialisa-tion parameters, so we report the mean and standard deviationfrom three training runs. Table 1 details our hyperparametersearch and optimal values used for the final model.Our model performs the three tasks described in the datasetpublication [6], and an additional fourth task. Tasks 1-4 are asfollows:
Task 1
Distinguishing between
COVID-positive and thestrata healthy-no-symptoms (62 vs 245 participants).
Task 2
Distinguishing between
COVID-positive partici-pants with a cough (
COVID-cough ) and the strata healthy-with-cough (23 vs 30 participants). Values between e − and on a logarithmic scale. The dataset used in this study is a small subset of the full dataset that hasbeen collected by the University of Cambridge, which has yet to be madefully public. As of July 2020 the full dataset totalled 10 000 samples fromroughly 7 000 participants.
Table 1
Overview of the hyperparameter search detailing theinterval, step size, and optimal parameters (used to obtain thereported figures in this article – for details cf. the abovenamed GitHub repository). Hyperparameters were optimisedfor task 4, and subsequently used on all tasks. ∗ Intervalconstructed using a logarithmic scale. Adam [20] was used foroptimisation.
Parameter Min. Max. Step Optimal
Learning rate e − e − e − e − Batch size 8 32 2 ∗ ∗ n ) [ ∗ sr [kHz] 24 48 2 ∗ Task 3
Distinguishing between
COVID-positive partici-pants with a cough (
COVID-cough ) and the strata asthma-with-cough (23 vs 19 participants).
Task 4
Distinguishing between
COVID-positive and
COVID-negative (62 vs 293 participants).Note that the number of participants deviates from [6], as wealso use those audio clips shorter than two seconds resultingin partially more participants considered.Results obtained for each task are shown in Table 2, along-side the baseline. CIdeR outperforms on all tasks bar task 2with high margin on both metrics. The results for tasks 1, 3,and 4 are statistically significant with a level of significance of0.01 in a two-sided two sample t-test for difference in samplemeans.
DISCUSSION
The results in Table 2 demonstrate two key points: 1) it ispossible to diagnose COVID-19 using a CNN-based modeltrained on crowdsourced data; 2) CIdeR obtains a high AUC-ROC of 0.846 on task 4, the task which uses the entire sam-ple, and so represents the most pertinent task. These suggestthat jointly processing breath and cough audio signals usinga CNN-based classifier could act as an effective and scalablemethod for COVID-19 diagnosis.The only task where CIdeR fails to outperform the baselinein our experiments is task 2. We posit this is jointly due to thesmall number of samples and the similarity of audio patternsof healthy participants with a cough and those with COVID-19, creating a challenging task. We leave further analysis forfuture work.A key limitation of this study is the size and demograph-ics of the publicly available dataset [6]. We are limited to62 COVID-positive participants, limiting the breadth of any
Harry Coppock et al. able 2 Results of the models on tasks 1-4 for 3-foldoptimisation of the number of training epochs based on therotated development sets using the frozen optimal modelparameters from Table 1. [Train+development / test] samplecounts are displayed alongside the task. Testing is performedon the held out test fold, each. The mean Area Under Curveof the Receiver Operating Characteristics curve (AUC(-ROC))and the Unweighted Average Recall (UAR) are displayed. A95% confidence interval is also shown following [21] and thenormal approximation method for AUC-ROC and UARrespectively. Scores in bold indicate significant results with α = 0 . using a 2-sample t-test for no difference in meansbetween the baseline and CIdeR based on the standarddeviation from the 3-fold cross validation. Task CIdeR Baseline
AUC UAR AUC UAR1 [ / ] .827 ± .051 .770 ± .053 .697 ± .066 .677 ± .0592 [ / ]* .570 ± .216 .535 ± .185 .628 ± .208 .583 ± .1833 [ / ]* .909 ± .130 .774 ± .145 .559 ± .220 .506 ± .1734 [ / ] .846 ± .040 .765 ± .044 .721 ± .053 .654 ± .050 *It is questionable whether the normality assumption holds at these smallsample sizes. The confidence interval estimates should therefore be takenlightly. conclusions we draw. Our control group, COVID-free partici-pants, is not a random sample as participation required the sub-ject lived in a country with low COVID-19 rates, among othercriteria. How representative these audio biomarker featuresare of the wider population is therefore still an open question.Importantly, before a such a technology can be deployed, eval-uation on a larger, more representative dataset is necessary. Asalluded to in [22], pandemics have historically led to break-throughs in healthcare. If AI-driven screening is to be one ofthese breakthroughs from the 2020 COVID-19 pandemic, amore comprehensive dataset is required. CONCLUSION
Wholesale testing of the population is a promising avenue foridentifying and controlling the spread of COVID-19. A digi-tal audio-collection and diagnostic system could be deployedto the majority of the population and performed daily at min-imial cost, e. g., for pre-selection for more reliable diagnosesor monitoring of spread, and therefore holds great potential.This study introduced the C OVID-19
Ide ntification R esNet,( CIdeR ), which demonstrated a strong proof-of-concept forapplying end-to-end deep learning to jointly learning repre-sentations from breath and cough audio samples. This was de-spite a small dataset; given more samples, it seems likely thatCIdeR’s diagnostic capabilities would significantly increase.
ACKNOWLEDGEMENTS
The authors give their thanks to the help provided by theircolleagues Mina A. Nessiem and Mostafa M. Mohamed.The University of Cambridge does not bear any responsi-bility for the analysis or interpretation of the data used herein,which represents the own view of the authors of this commu-nication.
REFERENCES [1] Rafael Polidoro, Robert Hagan, Roberta de Santis San-tiago, and Nathan Schmidt. Overview: Systemic In-flammatory Response Derived From Lung Injury Causedby SARS-CoV-2 Infection Explains Severe Outcomes inCOVID-19.
Frontiers in Immunology , 11:1626, 2020.[2] Thomas Quatieri, Tanya Talkar, and Jeffrey Palmer. AFramework for Biomarkers of COVID-19 Based on Co-ordination of Speech-Production Subsystems.
IEEEOpen Journal of Engineering in Medicine and Biology ,1:203–206, 2020.[3] Lara Orlandic, Tomás Teijeiro, and David Atienza. TheCOUGHVID crowdsourcing dataset: A corpus for thestudy of large-scale cough analysis algorithms. arXiv ,(2009.11644), 2020. 11 pages.[4] Neeraj Sharma, Prashant Krishnan, Rohit Kumar,Shreyas Ramoji, Srikanth Raj Chetupalli, Nirmala R.,Prasanta Kumar Ghosh, and Sriram Ganapathy. Coswara– A Database of Breathing, Cough, and Voice Soundsfor COVID-19 Diagnosis. In
Proc. Interspeech , pages4811–4815, Shanghai, China, 2020.[5] Piyush Bagad, Aman Dalmia, Jigar Doshi, Arsha Na-grani, Parag Bhamare, Amrita Mahale, Saurabh Rane,Neeraj Agarwal, and Rahul Panicker. Cough AgainstCOVID: Evidence of COVID-19 Signature in CoughSounds. arXiv , (2009.08790), 2020. 12 pages.[6] Chloë Brown, Jagmohan Chauhan, Andreas Gram-menos, Jing Han, Apinan Hasthanasombat, DimitrisSpathis, Tong Xia, Pietro Cicuta, and Cecilia Mas-colo. Exploring Automatic Diagnosis of COVID-19from Crowdsourced Respiratory Sound Data. In
Proc.Knowledge Discovery and Data Mining , pages 3474–3484, 2020.[7] Katrin D. Bartl-Pokorny, Florian B. Pokorny, AntonBatliner, Shahin Amiriparian, Anastasia Semertzidou,Florian Eyben, Elena Kramer, Florian Schmidt, RainerSchönweiler, Markus Wehler, and Björn W. Schuller. Harry Coppock et al. he voice of COVID-19: Acoustic correlates of infec-tion, 2020. 8 pages.[8] Kotra Venkata Sai Ritwik, Shareef Babu Kalluri, andDeepu Vijayasenan. COVID-19 Patient Detection fromTelephone Quality Speech Data. arXiv , (2011.04299),2020. 6 pages.[9] Jordi Laguarta, Ferran Hueto, and Brian Subirana.COVID-19 Artificial Intelligence Diagnosis using onlyCough Recordings.
IEEE Open Journal of Engineeringin Medicine and Biology , pages 1–1, 2020.[10] Gadi Pinkas, Yarden Karny, Aviad Malachi, GaliaBarkai, Gideon Bachar, and Vered Aharonson. SARS-CoV-2 Detection From Voice.
IEEE Open Journal ofEngineering in Medicine and Biology , 1:268–274, 2020.[11] Ali Imran, Iryna Posokhova, Haneya Naeem Qureshi,Usama Masood, Sajid Riaz, Kamran Ali, Charles N.John, and Muhammad Nabeel. AI4COVID-19: AI En-abled Preliminary Diagnosis for COVID-19 from CoughSamples via an App. arXiv , (2004.01275), 2020. 27pages.[12] Brian McFee, Colin Raffel, Dawen Liang, Daniel Ellis,Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa:Audio and music signal analysis in python. In
Proc.Python in Science Conference , volume 8, pages 18–25,Austin, TX, 2015.[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep Residual Learning for Image Recognition.In
Proc. Conference on Computer Vision and PatternRecognition , pages 770–778, Las Vegas, NV, 2016.[14] Sergey Ioffe and Christian Szegedy. Batch Normaliza-tion: Accelerating Deep Network Training by ReducingInternal Covariate Shift. In
Proc. International Confer-ence on Machine Learning , volume 37, pages 448–456,Lille, France, 2015.[15] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, andSerge J. Belongie. Class-Balanced Loss Based on Ef-fective Number of Samples. arXiv , (1901.05555), 2019.11 pages.[16] Nello Cristianini and John Shawe-Taylor.
An introduc-tion to support vector machines and other kernel-basedlearning methods . Cambridge university press, 2000.[17] Florian Eyben, Martin Wöllmer, and Björn Schuller.openSMILE – The Munich Versatile and Fast Open-Source Audio Feature Extractor. In
Proc. ACM Inter- national Conference Multimedia , pages 1459–1462, Flo-rence, Italy, 2010.[18] Björn Schuller, Stefan Steidl, Anton Batliner, JuliaHirschberg, Judee K. Burgoon, Alice Baird, AaronElkins, Yue Zhang, Eduardo Coutinho, and KeelanEvanini. The INTERSPEECH 2016 Computational Par-alinguistics Challenge: Deception, Sincerity & NativeLanguage. In
Proc. Interspeech 2016 , pages 2001–2005,San Francisco, CA, 2016.[19] Christopher Bishop.
Pattern Recognition and MachineLearning (Information Science and Statistics) . Springer-Verlag, Berlin, Heidelberg, 2006.[20] Diederik Kingma and Jimmy Ba. Adam: A methodfor stochastic optimization. In
Proc. International Con-ference on Learning Representations , San Diego, CA,2015.[21] James A. Hanley and Barbara J. McNeil. The meaningand use of the area under a receiver operating character-istic (ROC) curve.
Radiology , 143(1):29–36, 1982.[22] Ashley McKimm. Call to action for the BMJ Innovationscommunity after COVID-19.
BMJ Innovations , 7(1):1–2, 2021.