[PDF] End-2-End COVID-19 Detection from Breath & Cough Audio

Abstract

Our main contributions are as follows: (I) We demonstrate the first attempt to diagnose COVID-19 using end-to-end deep learning from a crowd-sourced dataset of audio samples, achieving ROC-AUC of 0.846; (II) Our model, the COVID-19 Identification ResNet, (CIdeR), has potential for rapid scalability, minimal cost and improving performance as more data becomes available. This could enable regular COVID-19 testing at apopulation scale; (III) We introduce a novel modelling strategy using a custom deep neural network to diagnose COVID-19 from a joint breath and cough representation; (IV) We release our four stratified folds for cross parameter optimisation and validation on a standard public corpus and details on the models for reproducibility and future reference.

Full PDF

EEnd-2-End COVID-19 Detection from Breath & Cough Audio

Harry Coppock* , Alexander Gaskell* , Panagiotis Tzirakis , Alice Baird , Lyn Jones , Björn W.Schuller , GLAM – Group on Language, Audio, & Music, Imperial College London, London, UK Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany Radiology Department, North Bristol NHS Trust, Bristol, UK

Keywords:

COVID-19, Computer Audition, Digital Health,Deep Learning, Audio

SUMMARY BOX

Our main contributions are as follows:• We demonstrate the ﬁrst attempt to diagnose COVID-19 using end-to-end deep learning from a crowd-sourceddataset of audio samples, achieving ROC-AUC of • Our model, the C OVID-19

Ide ntiﬁcation R esNet,( CIdeR ), has potential for rapid scalability, minimal cost,and improving performance as more data becomes avail-able. This could enable regular COVID-19 testing at apopulation scale• We introduce a novel modelling strategy using a customdeep neural network to diagnose COVID-19 from a jointbreath and cough representation• We release our four stratiﬁed folds for cross parameteroptimisation and validation on a standard public corpusand details on the models for reproducibility and futurereference

INTRODUCTION

The Coronavirus disease 2019 (COVID-19), caused bythe severe-acute-respiratory-syndrome-coronavirus 2 (SARS-CoV-2), is the ﬁrst global pandemic of the 21st century. Sinceits emergence in December 2019, it has led to over 75 millionconﬁrmed cases and more than 1.6 million deaths in over 200countries (WHO) . SARS-CoV-2 causes either asymptomatic *Equal contribution infection or clinical disease, which ranges from mild to life-threatening [1]. Developing a swift and accurate test, able toidentify both symptomatic and asymptomatic cases, is there-fore essential for pandemic control.Vocal biomarkers of SARS-CoV-2 infection have been de-scribed, thought to relate to the clinical and subclinical effectsof the virus on the lower respiratory tract, neuro-muscularfunction, senses of taste and smell and on proprioceptive feed-back. Together, these produce a reduction in complexity ofthe co-ordination of respiratory and laryngeal motion in bothsymptomatic and asymptomatic individuals [2].Recently, several audio applications have been released thatcapture the breath or cough of individuals. Examples includethe ‘Coughvid’ [3], ‘Breath for Science’ , ‘Coswara’ [4],and ‘CoughAgainstCovid’ [5]. With the release of thesedatasets, several studies have been published that leveragebreath and/or cough signals alongside machine learning to de-tect the virus [6, 7, 8, 9, 10, 11]. However, these approachestry to compute representations of the breath and cough signalsseparately. In contrast, our approach computes a joint repre-sentation using a single model.We postulate that end-to-end deep learning using convo-lutional neural networks (CNNs) could be successfully ap-plied to this assessment task. This article describes a proofof concept study of automatic symptomatic and asymptomaticCOVID-19 recognition using combined breathing and cough-ing information from audio recordings using an end-to-endCNN design. The code for our experiments and all details forreproduction of ﬁndings can be found at https://github.com/glam-imperial/cider. METHODS

The objective is supervised learning binary classiﬁcation fordiagnosing COVID-19 as positive or negative using audio sig- Harry Coppock et al. a r X i v : . [ c s . S D ] J a n igure 1 A schematic of the C OVID-19

Ide ntiﬁcation R esNet, ( CIdeR ). The ﬁgure shows a blow-up of a residual block, consistingof convolutional, batch normalisation, and Rectiﬁed Linear Unit (ReLU) layers. nals. Our implementation, displayed in Figure 1, has two dis-tinct stages which are outlined below:

1. Spectrogram extraction

As shown in Figure 1, eachparticipant in the study carried out by the University of Cam-bridge [6] could submit waveform audio (WAV) ﬁles includ-ing a breath sample and a cough sample . We ﬁrst compute thespectrogram of each of these WAV ﬁles to obtain a visual rep-resentation of the spectrum of audio frequencies against time.Next, we perform a log transformation, converting the spec-trogram from an amplitude representation to a decibel repre-sentation. These transformations are implemented using the librosa [12] python package.Each WAV ﬁle lasts between one and fourty-eight secondswith a mean of ten seconds. As uniform duration is requiredfor CNN input, we chunk the whole WAV ﬁle into s -secondsegments, using right padding for ﬁles shorter than s -seconds.This creates an image of size { F, W } , where F ∝ f f t n and W ∝ sr ∗ s and f f t n and sr are parameters used when com-puting the spectrogram. During model training, we only pro-cess one WAV segment (sampled uniformly). At inferencetime, we perform majority voting, whereby each chunk is pro-cessed in parallel, and the output label becomes the modalclassiﬁcation from each of the chunks .

2. Convolutional Neural Network

CIdeR is based onResNets [13], a variant of the CNN architecture, which uses residual blocks. As shown in Figure 1, a residual block con-sists of two convolutions, batch normalisation [14], and a Rec-tiﬁed Linear Unit (ReLU) non-linearity. These blocks use“skip” connections which add the output from these operationsto the input activations for this layer. This alleviates the van-ishing gradient problem, facilitating deeper architectures with The mean of the output logits is taken in the case of a tied vote. more layers, thereby permitting richer hierarchical learnt rep-resentations. The number of convolutional channels for eachof CIdeR’s nine layers are annotated in Figure 1.We concatenate the log spectrograms of the breath andcough samples depth-wise, creating an { F, W, } matrix asthe model input. The CNN outputs a single logit which isthen passed through a sigmoid layer to obtain a (0 , score,representing the probability of a COVID-positive sample. Aweighted binary-cross entropy loss function [15] is used dur-ing training to address the class imbalance in the dataset. Training strategy

Prior work [6] used “10-fold-like” crossvalidation during training (see the paper for details). In con-trast, we implement a stratiﬁed 3-fold cross optimisation andadditional validation partitioning using 2 / 1 (rotating devel-opment + train) / 1 (always held out ﬁxed test) folds, respec-tively. This is to best optimise parameters independently ofthe test set with a small dataset while ensuring that the test setremains a) ﬁxed for easier comparison with other work, and b)truly blind, eliminating the possibility of CIdeR overﬁtting tothe test set. Our stratiﬁed sampling methodology ensures thatour folds represent disjoint sets of participants and each of thestrata (next section) are approximately uniformly distributedacross each fold. To enable reproducibility, the folds are fullyreleased in the accompanying code.

Baseline

Our approach is not directly comparable with thestudy from [6] as they do not explicitly provide their foldsand discard some audio samples. To this purpose, to create aperformance reference for CIdeR, we implement a linear ker-nel Support Vector Machine (SVM) [16] baseline. We extractopenSMILE features [17] for each waveﬁle following the In-terspeech 2016 ComParE challenge format [18] and performPrincipal Component Analysis (PCA) [19], selecting the top100 components by highest explained variance. We follow thecross optimisation procedure outlined above, using the devel- Harry Coppock et al. pment set to optimise the complexity parameter and report-ing ﬁnal results using the held-out test set. DATASET

The dataset used in this work consists of 517 crowdsourcedcoughing and breathing audio recordings from 355 partici-pants, of which 62 participants had tested positive for COVID-19 within 14 days of the recording COVID-negative , participants had to meet a number of strin-gent criteria described in [6]. These participants were thendivided into 3 categories: those with no cough ( healthy-no-symptoms ), those with a cough ( healthy-with-cough ) and thosewho had asthma ( asthma-with-cough ). The

COVID-positive class is constituted of the 62 COVID-positive participants andis further divided into the sub classes

COVID-no-cough , and

COVID-cough representing 39 COVID positive participantswithout a cough and 23 participants with a cough, respectively.

EXPERIMENTS & RESULTS

As indicated above, we perform a 3-fold cross optimisationusing the rotating development plus train folds. Recall that thetest set is ﬁxed and always held out during optimisation. Forevaluation metrics, we utilise the Area Under Curve of the Re-ceiver Operating Characteristics curve (AUC-ROC), and Un-weighted Average Recall (UAR), both of which are robust toimbalanced datasets. AUC-ROC maps the relationship be-tween sensitivity and the false positive rate as the classiﬁca-tion threshold is varied, and UAR computes the mean recallper class. The models’ performance is sensitive to initialisa-tion parameters, so we report the mean and standard deviationfrom three training runs. Table 1 details our hyperparametersearch and optimal values used for the ﬁnal model.Our model performs the three tasks described in the datasetpublication [6], and an additional fourth task. Tasks 1-4 are asfollows:

Task 1

Distinguishing between

COVID-positive and thestrata healthy-no-symptoms (62 vs 245 participants).

Task 2

Distinguishing between

COVID-positive partici-pants with a cough (

COVID-cough ) and the strata healthy-with-cough (23 vs 30 participants). Values between e − and on a logarithmic scale. The dataset used in this study is a small subset of the full dataset that hasbeen collected by the University of Cambridge, which has yet to be madefully public. As of July 2020 the full dataset totalled 10 000 samples fromroughly 7 000 participants.

Table 1

Overview of the hyperparameter search detailing theinterval, step size, and optimal parameters (used to obtain thereported ﬁgures in this article – for details cf. the abovenamed GitHub repository). Hyperparameters were optimisedfor task 4, and subsequently used on all tasks. ∗ Intervalconstructed using a logarithmic scale. Adam [20] was used foroptimisation.

Parameter Min. Max. Step Optimal

Learning rate e − e − e − e − Batch size 8 32 2 ∗ ∗ n ) [ ∗ sr [kHz] 24 48 2 ∗ Task 3

Distinguishing between

COVID-positive partici-pants with a cough (

COVID-cough ) and the strata asthma-with-cough (23 vs 19 participants).

Task 4

Distinguishing between

COVID-positive and

COVID-negative (62 vs 293 participants).Note that the number of participants deviates from [6], as wealso use those audio clips shorter than two seconds resultingin partially more participants considered.Results obtained for each task are shown in Table 2, along-side the baseline. CIdeR outperforms on all tasks bar task 2with high margin on both metrics. The results for tasks 1, 3,and 4 are statistically signiﬁcant with a level of signiﬁcance of0.01 in a two-sided two sample t-test for difference in samplemeans.

DISCUSSION

The results in Table 2 demonstrate two key points: 1) it ispossible to diagnose COVID-19 using a CNN-based modeltrained on crowdsourced data; 2) CIdeR obtains a high AUC-ROC of 0.846 on task 4, the task which uses the entire sam-ple, and so represents the most pertinent task. These suggestthat jointly processing breath and cough audio signals usinga CNN-based classiﬁer could act as an effective and scalablemethod for COVID-19 diagnosis.The only task where CIdeR fails to outperform the baselinein our experiments is task 2. We posit this is jointly due to thesmall number of samples and the similarity of audio patternsof healthy participants with a cough and those with COVID-19, creating a challenging task. We leave further analysis forfuture work.A key limitation of this study is the size and demograph-ics of the publicly available dataset [6]. We are limited to62 COVID-positive participants, limiting the breadth of any

Harry Coppock et al. able 2 Results of the models on tasks 1-4 for 3-foldoptimisation of the number of training epochs based on therotated development sets using the frozen optimal modelparameters from Table 1. [Train+development / test] samplecounts are displayed alongside the task. Testing is performedon the held out test fold, each. The mean Area Under Curveof the Receiver Operating Characteristics curve (AUC(-ROC))and the Unweighted Average Recall (UAR) are displayed. A95% conﬁdence interval is also shown following [21] and thenormal approximation method for AUC-ROC and UARrespectively. Scores in bold indicate signiﬁcant results with α = 0 . using a 2-sample t-test for no difference in meansbetween the baseline and CIdeR based on the standarddeviation from the 3-fold cross validation. Task CIdeR Baseline

AUC UAR AUC UAR1 [ / ] .827 ± .051 .770 ± .053 .697 ± .066 .677 ± .0592 [ / ]* .570 ± .216 .535 ± .185 .628 ± .208 .583 ± .1833 [ / ]* .909 ± .130 .774 ± .145 .559 ± .220 .506 ± .1734 [ / ] .846 ± .040 .765 ± .044 .721 ± .053 .654 ± .050 *It is questionable whether the normality assumption holds at these smallsample sizes. The conﬁdence interval estimates should therefore be takenlightly. conclusions we draw. Our control group, COVID-free partici-pants, is not a random sample as participation required the sub-ject lived in a country with low COVID-19 rates, among othercriteria. How representative these audio biomarker featuresare of the wider population is therefore still an open question.Importantly, before a such a technology can be deployed, eval-uation on a larger, more representative dataset is necessary. Asalluded to in [22], pandemics have historically led to break-throughs in healthcare. If AI-driven screening is to be one ofthese breakthroughs from the 2020 COVID-19 pandemic, amore comprehensive dataset is required. CONCLUSION

Wholesale testing of the population is a promising avenue foridentifying and controlling the spread of COVID-19. A digi-tal audio-collection and diagnostic system could be deployedto the majority of the population and performed daily at min-imial cost, e. g., for pre-selection for more reliable diagnosesor monitoring of spread, and therefore holds great potential.This study introduced the C OVID-19

Ide ntiﬁcation R esNet,( CIdeR ), which demonstrated a strong proof-of-concept forapplying end-to-end deep learning to jointly learning repre-sentations from breath and cough audio samples. This was de-spite a small dataset; given more samples, it seems likely thatCIdeR’s diagnostic capabilities would signiﬁcantly increase.

ACKNOWLEDGEMENTS

The authors give their thanks to the help provided by theircolleagues Mina A. Nessiem and Mostafa M. Mohamed.The University of Cambridge does not bear any responsi-bility for the analysis or interpretation of the data used herein,which represents the own view of the authors of this commu-nication.

REFERENCES [1] Rafael Polidoro, Robert Hagan, Roberta de Santis San-tiago, and Nathan Schmidt. Overview: Systemic In-ﬂammatory Response Derived From Lung Injury Causedby SARS-CoV-2 Infection Explains Severe Outcomes inCOVID-19.

Frontiers in Immunology , 11:1626, 2020.[2] Thomas Quatieri, Tanya Talkar, and Jeffrey Palmer. AFramework for Biomarkers of COVID-19 Based on Co-ordination of Speech-Production Subsystems.

IEEEOpen Journal of Engineering in Medicine and Biology ,1:203–206, 2020.[3] Lara Orlandic, Tomás Teijeiro, and David Atienza. TheCOUGHVID crowdsourcing dataset: A corpus for thestudy of large-scale cough analysis algorithms. arXiv ,(2009.11644), 2020. 11 pages.[4] Neeraj Sharma, Prashant Krishnan, Rohit Kumar,Shreyas Ramoji, Srikanth Raj Chetupalli, Nirmala R.,Prasanta Kumar Ghosh, and Sriram Ganapathy. Coswara– A Database of Breathing, Cough, and Voice Soundsfor COVID-19 Diagnosis. In

Proc. Interspeech , pages4811–4815, Shanghai, China, 2020.[5] Piyush Bagad, Aman Dalmia, Jigar Doshi, Arsha Na-grani, Parag Bhamare, Amrita Mahale, Saurabh Rane,Neeraj Agarwal, and Rahul Panicker. Cough AgainstCOVID: Evidence of COVID-19 Signature in CoughSounds. arXiv , (2009.08790), 2020. 12 pages.[6] Chloë Brown, Jagmohan Chauhan, Andreas Gram-menos, Jing Han, Apinan Hasthanasombat, DimitrisSpathis, Tong Xia, Pietro Cicuta, and Cecilia Mas-colo. Exploring Automatic Diagnosis of COVID-19from Crowdsourced Respiratory Sound Data. In

Proc.Knowledge Discovery and Data Mining , pages 3474–3484, 2020.[7] Katrin D. Bartl-Pokorny, Florian B. Pokorny, AntonBatliner, Shahin Amiriparian, Anastasia Semertzidou,Florian Eyben, Elena Kramer, Florian Schmidt, RainerSchönweiler, Markus Wehler, and Björn W. Schuller. Harry Coppock et al. he voice of COVID-19: Acoustic correlates of infec-tion, 2020. 8 pages.[8] Kotra Venkata Sai Ritwik, Shareef Babu Kalluri, andDeepu Vijayasenan. COVID-19 Patient Detection fromTelephone Quality Speech Data. arXiv , (2011.04299),2020. 6 pages.[9] Jordi Laguarta, Ferran Hueto, and Brian Subirana.COVID-19 Artiﬁcial Intelligence Diagnosis using onlyCough Recordings.

IEEE Open Journal of Engineeringin Medicine and Biology , pages 1–1, 2020.[10] Gadi Pinkas, Yarden Karny, Aviad Malachi, GaliaBarkai, Gideon Bachar, and Vered Aharonson. SARS-CoV-2 Detection From Voice.

IEEE Open Journal ofEngineering in Medicine and Biology , 1:268–274, 2020.[11] Ali Imran, Iryna Posokhova, Haneya Naeem Qureshi,Usama Masood, Sajid Riaz, Kamran Ali, Charles N.John, and Muhammad Nabeel. AI4COVID-19: AI En-abled Preliminary Diagnosis for COVID-19 from CoughSamples via an App. arXiv , (2004.01275), 2020. 27pages.[12] Brian McFee, Colin Raffel, Dawen Liang, Daniel Ellis,Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa:Audio and music signal analysis in python. In

Proc.Python in Science Conference , volume 8, pages 18–25,Austin, TX, 2015.[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep Residual Learning for Image Recognition.In

Proc. Conference on Computer Vision and PatternRecognition , pages 770–778, Las Vegas, NV, 2016.[14] Sergey Ioffe and Christian Szegedy. Batch Normaliza-tion: Accelerating Deep Network Training by ReducingInternal Covariate Shift. In

Proc. International Confer-ence on Machine Learning , volume 37, pages 448–456,Lille, France, 2015.[15] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, andSerge J. Belongie. Class-Balanced Loss Based on Ef-fective Number of Samples. arXiv , (1901.05555), 2019.11 pages.[16] Nello Cristianini and John Shawe-Taylor.

An introduc-tion to support vector machines and other kernel-basedlearning methods . Cambridge university press, 2000.[17] Florian Eyben, Martin Wöllmer, and Björn Schuller.openSMILE – The Munich Versatile and Fast Open-Source Audio Feature Extractor. In

Proc. ACM Inter- national Conference Multimedia , pages 1459–1462, Flo-rence, Italy, 2010.[18] Björn Schuller, Stefan Steidl, Anton Batliner, JuliaHirschberg, Judee K. Burgoon, Alice Baird, AaronElkins, Yue Zhang, Eduardo Coutinho, and KeelanEvanini. The INTERSPEECH 2016 Computational Par-alinguistics Challenge: Deception, Sincerity & NativeLanguage. In

Proc. Interspeech 2016 , pages 2001–2005,San Francisco, CA, 2016.[19] Christopher Bishop.

Pattern Recognition and MachineLearning (Information Science and Statistics) . Springer-Verlag, Berlin, Heidelberg, 2006.[20] Diederik Kingma and Jimmy Ba. Adam: A methodfor stochastic optimization. In

Proc. International Con-ference on Learning Representations , San Diego, CA,2015.[21] James A. Hanley and Barbara J. McNeil. The meaningand use of the area under a receiver operating character-istic (ROC) curve.

Radiology , 143(1):29–36, 1982.[22] Ashley McKimm. Call to action for the BMJ Innovationscommunity after COVID-19.

BMJ Innovations , 7(1):1–2, 2021.