Benchmarking of eight recurrent neural network variants for breath phase and adventitious sound detection on a self-developed open-access lung sound database-HF_Lung_V1

Abstract

A reliable, remote, and continuous real-time respiratory sound monitor with automated respiratory sound analysis ability is urgently required in many clinical scenarios-such as in monitoring disease progression of coronavirus disease 2019-to replace conventional auscultation with a handheld stethoscope. However, a robust computerized respiratory sound analysis algorithm has not yet been validated in practical applications. In this study, we developed a lung sound database (HF_Lung_V1) comprising 9,765 audio files of lung sounds (duration of 15 s each), 34,095 inhalation labels, 18,349 exhalation labels, 13,883 continuous adventitious sound (CAS) labels (comprising 8,457 wheeze labels, 686 stridor labels, and 4,740 rhonchi labels), and 15,606 discontinuous adventitious sound labels (all crackles). We conducted benchmark tests for long short-term memory (LSTM), gated recurrent unit (GRU), bidirectional LSTM (BiLSTM), bidirectional GRU (BiGRU), convolutional neural network (CNN)-LSTM, CNN-GRU, CNN-BiLSTM, and CNN-BiGRU models for breath phase detection and adventitious sound detection. We also conducted a performance comparison between the LSTM-based and GRU-based models, between unidirectional and bidirectional models, and between models with and without a CNN. The results revealed that these models exhibited adequate performance in lung sound analysis. The GRU-based models outperformed, in terms of F1 scores and areas under the receiver operating characteristic curves, the LSTM-based models in most of the defined tasks. Furthermore, all bidirectional models outperformed their unidirectional counterparts. Finally, the addition of a CNN improved the accuracy of lung sound analysis, especially in the CAS detection tasks.

Full PDF

11 Benchmarking of eight recurrent neural network variants for breath phase and adventitious sound detection on a self-developed open-access lung sound database—HF_Lung_V1

Fu-Shun Hsu , Shang-Ran Huang , Chien-Wen Huang , Chao-Jung Huang , Yuan-Ren Cheng , Chun-Chieh Chen , Jack Hsiao , Chung-Wei Chen , Li-Chin Chen , Yen-Chun Lai , Bi-Fang Hsu , Nian-Jhen Lin , Wan-Lin Tsai , Yi-Lin Wu , Tzu-Ling Tseng , Ching-Ting Tseng , Yi-Tsun Chen , Feipei Lai Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei 10617, Taiwan Department of Critical Care Medicine, Far Eastern Memorial Hospital, New Taipei 22060, Taiwan Heroic Faith Medical Science Co. Ltd., New Taipei 23553, Taiwan Avalanche Computing Inc., Taipei 10687, Taiwan Department of Life Science, College of Life Science, National Taiwan University, Taipei 10617, Taiwan Institute of Biomedical Sciences, Academia Sinica, Taipei 11529, Taiwan HCC Healthcare Group, New Taipei 22060, Taiwan Research Center for Information Technology Innovation, Academia Sinica, Taipei 11529, Taiwan Division of Pulmonary Medicine, Far Eastern Memorial Hospital, New Taipei 22060, Taiwan

Short Title: Automated lung sound analysis database † Corresponding Author Feipei Lai Graduate Institute of Biomedical Electronics and Bioinformatics National Taiwan University No. 1, Sec. 4, Roosevelt Road Taipei 10617, Taiwan Tel: [email protected]

ABSTRACT

A reliable, remote, and continuous real-time respiratory sound monitor with automated respiratory sound analysis ability is urgently required in many clinical scenarios—such as in monitoring disease progression of coronavirus disease 2019—to replace conventional auscultation with a handheld stethoscope. However, a robust computerized respiratory sound analysis algorithm has not yet been validated in practical applications. In this study, we developed a lung sound database (HF_Lung_V1) comprising 9,765 audio files of lung sounds (duration of 15 s each), 34,095 inhalation labels, 18,349 exhalation labels, 13,883 continuous adventitious sound (CAS) labels (comprising 8,457 wheeze labels, 686 stridor labels, and 4,740 rhonchi labels), and 15,606 discontinuous adventitious sound labels (all crackles). We conducted benchmark tests for long short-term memory (LSTM), gated recurrent unit (GRU), bidirectional LSTM (BiLSTM), bidirectional GRU (BiGRU), convolutional neural network (CNN)-LSTM, CNN-GRU, CNN-BiLSTM, and CNN-BiGRU models for breath phase detection and adventitious sound detection. We also conducted a performance comparison between the LSTM-based and GRU-based models, between unidirectional and bidirectional models, and between models with and without a CNN. The results revealed that these models exhibited adequate performance in lung sound analysis. The GRU-based models outperformed, in terms of F1 scores and areas under the receiver operating characteristic curves, the LSTM-based models in most of the defined tasks. Furthermore, all bidirectional models outperformed their unidirectional counterparts. Finally, the addition of a CNN improved the accuracy of lung sound analysis, especially in the CAS detection tasks. Keywords: adventitious sound, auscultation, convolutional neural networks, lung sound, recurrent neural networks, respiratory monitor Introduction

Respiration is vital for the normal functioning of the human body. Therefore, clinical physicians are frequently required to examine respiratory conditions. Respiratory auscultation (Bohadana et al., 2014; Goettel & Herrmann, 2019; Sarkar et al., 2015) using a stethoscope has long been a crucial first-line physical examination. The chestpiece of a stethoscope is usually placed on a patient’s chest or back for lung sound auscultation or over the patient’s tracheal region for tracheal sound auscultation. During auscultation, breath cycles can be inferred, which help clinical physicians evaluate the patient’s respiratory rate. In addition, pulmonary pathologies are suspected when the frequency or intensity of respiratory sounds changes or when adventitious sounds, including continuous adventitious sounds (CASs) and discontinuous adventitious sounds (DASs), are identified (Bohadana et al., 2014; Goettel & Herrmann, 2019; Pramono et al., 2017). Patients with coronavirus disease 2019 exhibit adventitious sounds (Wang et al., 2020); hence, auscultation may be a useful approach for disease diagnosis (Raj et al., 2020) and disease progression tracking. However, auscultation performed using a conventional handheld stethoscope involves some limitations (Sovijärvi et al., 1997). First, the interpretation of auscultation results substantially depends on the subjectivity of the practitioners. Even experienced clinicians might not have high consensus rates in their interpretations of auscultatory manifestations (Berry et al., 2016; Grunnreis, 2016). Second, auscultation is a qualitative analysis method. Comparing auscultation results between individuals and quantifying the sound change by reviewing historical records are difficult tasks. Third, prolonged continuous monitoring of respiratory sound is almost impractical. To overcome the aforementioned limitations, computerized methods for respiratory sound recording and analyses based on traditional signal processing and machine learning have been proposed and reviewed (Gurung et al., 2011; Huq & Moussavi, 2012; Mesaros et al., 2016; Pasterkamp et al., 1997; Pramono et al., 2017). With the advent of the deep learning era, studies have developed novel deep learning–based methods for respiratory sound analysis. However, many of such studies have focused on only distinguishing healthy participants from participants with respiratory disorders (Chambres et al., 2018; Demir et al., 2020; Hosseini et al., 2020; Perna & Tagarelli, 2019; Pham et al., 2020) and distinguishing various types of normal breathing sounds from adventitious sounds (Acharya & Basu, 2020; Aykanat et al., 2017; Bardou et al., 2018; Chen et al., 2019; Grzywalski et al., 2019; Kochetov et al., 2018; Li et al., 2016). Only a few studies (Hsiao et al., 2020; Jácome et al., 2019; Liu et al., 2017; Messner et al., 2018) have explored the use of deep learning for detecting breath phases and adventitious sounds. Moreover, most previous studies on computerized lung sound analysis have been limited by insufficient data. As of writing this paper, the largest reported respiratory sound database is ICBHI 2017 Challenge (Rocha et al., 2017), which comprises 6,898 breath cycles and 10,775 events of wheezes and crackles acquired from 126 individuals. Data size plays a major role in the creation of a robust and accurate deep learning–based respiratory sound analysis algorithm (Hestness et al., 2017; Sun et al., 2017). Accordingly, the first aim of the present study was to establish a large and open-access respiratory sound database for training such algorithms for the detection of breath phase and adventitious sounds, mainly focusing on lung sounds. The second aim was to conduct a benchmark test on the established lung sound database by using eight recurrent neural network (RNN)-based models. RNNs (Elman, 1990) are effective for time-series analysis; long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997) and gated recurrent unit (GRU; Cho et al., 2014) networks, which are two RNN variants, exhibit superior performance to the original RNN model. However, whether LSTM models are superior to GRU models (and vice versa) in many applications, particularly in respiratory sound analysis, is inconclusive. Bidirectional RNN models (Graves & Schmidhuber, 2005; Schuster & Paliwal, 1997) can transfer not only past information to the future but also future information to the past; these models consistently exhibit superior performance to unidirectional RNN models in many applications (Khandelwal et al., 2016; Linchuan Li et al., 2016; Parascandolo et al., 2016) as well as in breath phase and crackle detection (Messner et al., 2018). However, whether bidirectional RNN models outperform unidirectional RNN models in CAS detection has yet to be determined. Furthermore, the convolutional neural network (CNN)–RNN structure has been proven to be suitable for heart sound analysis (Deng et al., 2020), lung sound analysis (Acharya & Basu, 2020), and other tasks (Linchuan Li et al., 2016; Zhao et al., 2018). Nevertheless, the application of the CNN–RNN structure in respiratory sound detection has yet to be fully investigated. Benchmarking can enable demonstrating the reliability and goodness of a database; it can also be applied to investigate the performance of the RNN variants in respiratory analysis. In summary, the aims of this study are outlined as follows: ◼ Establish the largest open-access lung sound database as of writing this paper—HF_Lung_V1 (https://gitlab.com/techsupportHF/HF_Lung_V1). ◼ Conduct a performance comparison between LSTM and GRU models, between unidirectional and bidirectional models, and between models with and without a CNN in breath phase and adventitious sound detection based on lung sound data. ◼ Discuss factors influencing model performance. Establishment of the lung sound database

Data sources and patients

The lung sound database was established using two sources. The first source was a database used in a datathon in Taiwan Smart Emergency and Critical Care (TSECC), 2020, under the license of Creative Commons Attribution 4.0 (CC BY 4.0), provided by the Taiwan Society of Emergency and Critical Care Medicine. Lung sound recordings in the TSECC database were acquired from 261 patients. The second source was sound recordings acquired from 18 residents of a respiratory care ward (RCW) or a respiratory care center (RCC) in Northern Taiwan between August 2018 and October 2019. The recordings were approved by the Research Ethics Review Committee of Far Eastern

Memorial Hospital (case number: 107052-F). This study was conducted in accordance with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. All patients were Taiwanese and aged older than 20 years. Descriptive statistics regarding the patients’ demographic data, major diagnosis, and comorbidities are presented in Table 1; however, information on the patients in the TSECC database is missing. Moreover, all 18 RCW/RCC residents were under mechanical ventilation. Table 1

Demographic data of patients.

Subjects from RCW/RCC (n= 18) TSECC Database (n = 261) Gender (M/F) 11/7 NA Age 67.5 (36.7, 98.3) NA Height (cm) 163.6 (147.2, 180.0) NA Weight (kg) 62.1 (38.2, 86.1) NA BMI (kg/m ) 23.1 (15.6, 30.7) NA Respiratory Diseases ARF 4 (22.2%) NA CRF 8 (44.4%) NA COPD AE 1 (5.6%) NA COPD 2 (11.1%) NA Pneumonia 4 (22.2%) NA ARDS 1 (5.6%) NA Emphysema 1 (5.6%) NA Comorbidity CKD 1 (5.6%) NA AKI 3 (16.7%) NA CHF 2 (11.1%) NA DM 7 (38.9%) NA HTN 6 (33.3%) NA Malignancy 1 (5.6%) NA Arrythmia 1 (5.6%) NA CAD 1 (5.6%) NA RCW: respiratory care ward, RCC: respiratory care center, ARF: acute respiratory failure, CRF: chronic respiratory failure, COPD AE: chronic obstructive pulmonary disease acute exacerbation, COPD: chronic obstructive pulmonary disease, ARDS: acute respiratory distress syndrome, CKD: chronic kidney disease, AKI: acute kidney injury, CHF: chronic heart failure, DM: diabetes, HTN: hypertension, CAD: cardiovascular disease. The mean values of the age, height, weight, and BMI are presented, with the corresponding 95% CI in parentheses. Sound recording

Breathing lung sounds were recorded using two devices: (1) a commercial electronic stethoscope (Littmann 3200, 3M, Saint Paul, Minnesota, USA) and (2) a customized multichannel acoustic recording device (HF-Type-1) that supports the connection of eight electret microphones. The signals collected by the HF-Type-1 device were transmitted to a tablet (Surface Pro 6, Microsoft, Redmond, Washington, USA; Fig. 1). Breathing lung sounds were collected at the eight locations (denoted by L1–L8) indicated in Fig. 2a. The auscultation locations are described in detail in the caption of Fig. 2. The two devices had a sampling rate of 4,000 Hz and a bit depth of 16 bits. The audio files were recorded in the WAVE (.wav) format. All lung sounds in the TSECC database were collected using the Littmann 3200 device only, where 15.8-s recordings were obtained sequentially from L1 to L8 (Fig. 2b; Littmann 3200). One round of recording with the Littmann 3200 device entails a recording of lung sounds from L1 to L8. The TSECC database was composed of data obtained from one to three rounds of recording with the Littmann 3200 device for each patient. We recorded the lung sounds of the 18 RCW/RCC residents by using both the Littmann 3200 device and the HF-Type-1 device. The Littmann 3200 recording protocol was in accordance with that used in the TSECC database, except that data from four to five rounds of lung sound recording were collected instead. The HF-Type-1 device was used to record breath sounds at L1, L2, L4, L5, L6, and L8. One round of recording with the HF-Type-1 device entails a synchronous and continuous recording of breath sounds for 30 min (Fig. 2b; HF-Type-1). However, the recording with the HF-Type-1 device was occasionally interrupted; in this case, the recording duration was <30 min. Voluntary deep breathing was not mandated during the recording of lung sounds. The statistics of the recordings are listed in Table 2. Fig. 1.

Customized multichannel acoustic recording device (HF-Type-1) connected to a tablet. Fig. 2.

Auscultation locations and lung sound recording protocol. (a) Auscultation locations (L1–L8): L1: second intercostal space (ICS) on the right midclavicular line (MCL); L2: fifth ICS on the right MCL; L3: fourth ICS on the right midaxillary line (MAL); L4: tenth ICS on the right MAL; L5: second ICS on the left MCL; L6: fifth ICS on the left MCL; L7: fourth ICS on the left MAL; and L8: tenth ICS on the left MAL. (b) A standard round of breathing lung sound recording with Littmann 3200 and HF-Type-1 devices. The white arrows represent a continuous recording, and the small red blocks represent 15-s recordings. When the Littmann 3200 device was used, 15.8-s signals were recorded sequentially from L1 to L8. Subsequently, all recordings were truncated to 15 s. When the HF-Type-1 device was used, sounds at L1, L2, L4, L5, L6, and L8 were recorded simultaneously. Subsequently, each 2-min signal was truncated to generate new 15-s audio files. Audio file truncation

In this study, the standard duration of an audio signal used for inhalation, exhalation, and adventitious sound detection was 15 s. This duration was selected because a 15-s signal contains at least three complete breath cycles, which are adequate for a clinician to reach a clinical conclusion.

Table 2

Statistics of recordings and labels of HF_Lung_V1 database.

Littmann 3200 HF-Type-1 Total

Subjects n 261 18 261 Recordings Filename prefix steth_ trunc_ NA Rounds of recording 748 70 NA No of 15-sec recordings 4504 5261 9765 Total duration (min) 1126 1315.25 2441.25 Labels No of I 16535 17560 34095 Total duration of I (min) 257.17 271.02 528.19 Mean duration of I (sec) 0.93 0.93 0.93 No of E 9107 9242 18349 Total duration of E (min) 160.25 132.60 292.85 Mean duration of E (sec) 1.06 0.86 0.96 No of C/W/S/R 6984/3974/152/2858 6899/4483/534/1882 13883/8457/686/4740 Total duration of C/W/S/R (min) 105.90/63.92/1.94/40.04 85.26/55.80/7.52/21.94 191.16/119.73/9.46/61.98 Mean duration of C/W/S/R (sec) 0.91/0.97/0.76/0.84 0.74/0.75/0.85/0.70 0.83/0.85/0.83/0.78 No of D 7266 8340 15606 Total duration of D (min) 111.75 55.80 230.87 Mean duration of D (sec) 0.92 0.87 0.89

I: inhalation, E: exhalation, W: wheeze, S: stridor, R: rhonchus, C: continuous adventitious sound, D: discontinuous adventitious sound. W, S, and R were combined to form C. Furthermore, a 15-s breath sound was be used previously for verification and validation (Pasterkamp et al., 2016) . Because each audio file generated by the Littmann 3200 device had a length of 15.8 s, we cropped out the final 0.8-s signal from the files (Fig. 2b; Littmann 3200). Moreover, we used only the first 15 s of each 2-min signal of the audio files (Fig. 2b; HF-Type-1) generated by the HF-Type-1 device. Table 2 presents the number of truncated 15-s recordings and the total duration.

Data labeling

Because the data in the TSECC database contains only classification labels indicating whether a CAS or DAS exists in a recording, we attempted to label the event level of all sound recordings. Two board-certified respiratory therapists (NJL and YLW) and one board-certified nurse (WLT), with 8, 3, and 13 years of clinical experience, respectively, were recruited to label the start and end points of inhalation (I), exhalation (E), wheeze (W), stridor (S), rhonchus (R), and DAS (D) events in the recordings. They labeled the sound events by listening to the recorded breath sounds while simultaneously observing the corresponding patterns on a spectrogram by using customized labeling software (Hsu et al., 2021). The labelers were asked not to label sound events if they could not clearly identify the corresponding sound or if an incomplete event at the beginning or end of an audio file caused difficulty in identification. BFH held regular meetings to ensure that the labelers had good agreement on labeling criteria based on a few samples by judging the mean pseudo-κ value (Jácome et al., 2019). When developing artificial intelligence (AI) detection models, we combined the W, S, and R labels to form CAS labels (C). Moreover, the D labels comprised only crackles, which were not differentiated into coarse or fine crackles. The labelers were asked to label the period containing crackles but not a single explosive sound (generally less than 25 ms) of a crackle. Each recording was annotated by only one labeler; thus, the labels did not represent perfect ground truth. However, we used the labels as ground-truth labels for model training, validation, and testing. The statistics of the labels are listed in Table 2. Inhalation, exhalation, CAS, and DAS detection

Framework

The inhalation, exhalation, CAS, and DAS detection framework developed in this study is displayed in Fig. 3. The prominent advantage of the research framework is its modular design. Specifically, each unit of the framework can be tested separately, and the algorithms in different parts of the framework can be modified to achieve optimal overall performance. Moreover, the output of some blocks can be used for multiple purposes. For instance, the spectrogram generated by the preprocessing block can be used as the input of a model or for visualization in the user interface for real-time monitoring. Fig. 3.

Pipeline of detection framework. The framework comprises three parts: preprocessing, deep learning–based modeling, and postprocessing. The preprocessing part involves signal processing and feature engineering techniques. The deep learning–based modeling part entails the use of a well-designed neural network for obtaining a sequence of classification predictions rather than a single prediction. The postprocessing part involves merging the segment prediction results and eliminating the burst event.

Preprocessing

We processed the lung sound recordings at a sampling frequency of 4 kHz. First, to eliminate the 60-Hz electrical interference and a part of the heart sound noise, we applied a high-pass filter to the recordings by setting a filter order of 10 and cut-off frequency of 80 Hz. The filtered signals were then processed using the short-time Fourier transform (STFT). In the STFT, we set a Hanning window size of 256 and hop length of 64; no additional zero-padding was applied. Thus, a 15-s sound signal could be transformed into a corresponding spectrogram with a size of 938 × 129. To obtain the spectral information regarding the lung sounds, we extracted the following features (Chamberlain et al., 2016; Messner et al., 2018): ◼ Spectrogram: We extracted 129-bin log-magnitude spectrograms. ◼ Mel frequency cepstral coefficients (MFCCs): We extracted 20 static coefﬁcients, 20 delta coefﬁcients (Δ), and 20 acceleration coefﬁcients (Δ ). We used 40 mel bands within a frequency range of 0–4,000 Hz. The frame width used to calculate the delta and acceleration coefﬁcients was set to 9, which resulted in a 60-bin vector per frame. ◼ Energy summation: We computed the energy summation of four frequency bands, namely 0–250, 250–500, 500–1,000, and 0–2,000 Hz, and obtained four values per time frame. After extracting the aforementioned features, we concatenated them to form a 938 × 193 feature matrix. Subsequently, we conducted min–max normalization on each feature. The values of the normalized features ranged between 0 and 1.

Deep learning models

We investigated the performance of eight RNN models, namely LSTM, GRU, bidirectional LSTM (BiLSTM), bidirectional GRU (BiGRU), CNN-LSTM, CNN-GRU, CNN-BiLSTM, and CNN-BiGRU, in terms of inhalation, exhalation, and adventitious sound detection. Fig. 4 illustrates the detailed model structures. The outputs of the LSTM, GRU, BiLSTM, and BiGRU models were

938 × 1 vectors, and those of the CNN-LSTM, CNN-GRU, CNN-BiLSTM, and CNN-BiGRU models were 469 × 1 vectors. An element in these vectors was set to 1 if an inhalation, exhalation, CAS, or DAS occurred within a time segment in which the output value passed the thresholding criterion; otherwise, the element was set to 0. For a fairer comparison of the performance of the unidirectional and bidirectional models, we trained additional simplified (SIMP) BiLSTM, SIMP BiGRU, SIMP CNN-BiLSTM, and SIMP CNN-BiGRU models by adjusting the number of trainable parameters. Parameter adjustment was conducted by halving the number of cells of the LSTM and GRU layers. We used Adam as the optimizer in the benchmark model, and we set the initial learning rate to 0.0001 with a step decay (0.2×) when the validation loss did not decrease for 10 epochs. The learning process stopped when no improvement occurred over 50 consecutive epochs. Fig. 4.

Model architectures and postprocessing for inhalation, exhalation, CAS, and DAS segment and event detection. (a) LSTM and GRU models; (b) BiLSTM and BiGRU models; and (c) CNN-LSTM, CNN-GRU, CNN-BiLSTM, and CNN-BiGRU models. Postprocessing

The prediction vectors obtained using the adopted models can be further processed for different purposes. For example, we can transform the prediction result from frames to time for real-time monitoring. The breathing duration of most humans lies within a certain range; we considered this fact in our study. Accordingly, when the prediction results obtained using the models indicated that two consecutive inhalation events occurred within a very small interval, we checked the continuity of these two events and decided whether to merge them, as illustrated in the bottom panel of Fig. 4a. For example, when the interval between the j th and i th events was smaller than T s, we computed the difference in frequency between their energy peaks ( |𝒑 𝒋 − 𝒑 𝒊 | ). Subsequently, if the difference was below a given threshold P , the two events were merged into a single event. In the experiment, T was set to 0.5 s, and P was set to 25 Hz. After the merging process, we further assessed whether a burst event existed. If the duration of an event was shorter than 0.05 s, the event was deleted. Dataset arrangement and cross-validation

We adopted fivefold cross-validation in the training dataset to train and validate the models. Moreover, we used an independent testing dataset to test the performance of the trained models. According to our preliminary experience, the acoustic patterns of the breath sounds collected from one patient at different auscultation locations or between short intervals had many similarities. To avoid potential data leakage caused by our methods of collecting and truncating the breath sound signals, we assigned all truncated recordings collected on the same day to only one of the training, validation, or testing datasets; this is because these recordings might have been collected from the same patient within a short period. The statistics of the datasets are listed in Table 3. We used only audio files containing CASs and DASs to train and test their corresponding detection models. Table 3

Statistics of the datasets and labels of the HF_Lung_V1 database.

Training Dataset Testing Dataset Total

Recordings No of 15-sec recordings 7809 1956 9765 Total duration (min) 1952.25 489 2441.25 Labels No of I 27223 6872 34095 Total duration of I (min) 422.17 105.97 528.14 Mean duration of I (sec) 0.93 0.93 0.93 No of E 15601 2748 18349 Total duration of E (min) 248.05 44.81 292.85 Mean duration of E (sec) 0.95 0.98 0.96 No of C/W/S/R 11464/7027/657/3780 2419/1430/29/960 13883/8457/686/4740 Total duration of C/W/S/R (min) 160.16/100.71/9.10/50.35 31.01/19.02/0.36/11.63 191.16/119.73/9.46/61.98 Mean duration of C/W/S/R (sec) 0.84/0.86/0.83/0.80 0.77/0.80/0.74/0.73 0.83/0.85/0.83/0.78 No of D 13794 1812 15606 Total duration of D (min) 203.59 27.29 230.87 Mean duration of D (sec) 0.89 0.90 0.89 I: inhalation, E: exhalation, W: wheeze, S: stridor, R: rhonchus, C: continuous adventitious sound, D: discontinuous adventitious sound. W, S, and R were combined to form C. Task definition and evaluation metrics

Pramono et al. (2017) clearly defined classification and detection at the segment, event, and recording levels. In this study, we performed two tasks. The first task involved performing detection at the segment level. The acoustic signal of each lung sound recording was transformed into a spectrogram. The temporal resolution of the spectrogram depended on the window size and overlap ratio of the STFT. The aforementioned parameters were fixed such that each spectrogram was a matrix of size 938 × 129. Thus, each recording contained 938 time segments (time frames), and each time segment was automatically labeled (Fig. 5b) according to the ground-truth event labels (Fig. 5a) assigned by the labelers. The output of the prediction process was a sequential prediction matrix (Fig. 5c) of size 938 × 1 in the LSTM, GRU, BiLSTM, and BiGRU models and size 469 × 1 in the CNN-LSTM, CNN-GRU, CNN-BiLSTM, and CNN-BiGRU models. By comparing the sequential prediction with the ground-truth time segments, we could define true positive (TP; orange vertical bars in Fig. 5d), true negative (TN; green vertical bars in Fig. 5d), false positive (FP; black vertical bars in Fig. 5d), and false negative (FN; yellow vertical bars in Fig. 5d) time segments. Subsequently, the models’ sensitivity and specificity in classifying the segments in each recording were computed. Fig. 5.

Task definition and evaluation metrics. (a) Ground-truth event labels, (b) ground-truth time segments, (c) AI inference results, (d) segment classification, (e) event detection, and (f) legend. JI: Jaccard index.

The second task entailed event detection at the recording level. After completing the sequential prediction (Fig. 5c), we assembled the time segments associated with the same label into a corresponding event (Fig. 5e). We also derived the start and end times of each assembled event. The Jaccard index (JI; Jácome et al., 2019) was used to determine whether an AI inference result correctly matched the ground-truth event. For an assembled event to be designated as a TP event (orange horizontal bars in Fig. 5e), the corresponding JI value must be greater than 0.5. If the JI was between 0 and 0.5, the assembled event was designated as an FN event (yellow horizontal bars in Fig. 5e), and if it was 0, the assembled event was designated as an FP event (black horizontal bars in Fig. 5e). A TN event cannot be defined in the task of event detection. The performance of the models was evaluated using the F1 score, and that of segment detection was evaluated using the receiver operating characteristic (ROC) curve and area under the ROC curve (AUC). In addition, the mean absolute percentage error (MAPE) of event detection was derived. The accuracy, positive predictive value (PPV), sensitivity, specificity, and F1 score of the models are presented in Appendix A. Hardware and software

We trained the baseline models on an Ubuntu 18.04 server that was provided by the National Center for High-Performance Computing in Taiwan [Taiwan Computing Cloud (TWCC)] and was equipped with an Intel(R) Xeon(R) Gold 6154 @3.00 GHz CPU with 90 GB RAM. To manage the intensive computation involved in RNN training, we implemented the training module by using the TensorFlow 2.10, CUDA 10, and CuDNN 7 programs to run the NVIDIA Titan V100 card on the TWCC server for GPU acceleration. Results

LSTM versus GRU models

Table 4 presents the F1 scores used to compare the eight LSTM- and GRU-based models. When a CNN was not added, the GRU models outperformed the LSTM models by 0.7%–9.5% in terms of the F1 scores. However, the CNN-GRU and CNN-BiGRU models did not outperform the CNN-LSTM and CNN-BiLSTM models in terms of the F1 scores (and vice versa). According to the ROC curves presented in Fig. 6a–d, the GRU-based models outperformed the LSTM-based models in all compared pairs, except for one pair, in terms of DAS segment detection (AUC of 0.891 for CNN-BiLSTM vs 0.889 for CNN-BiGRU). Table 4

Comparison of F1 scores between LSTM-based models and GRU-based models. Inhalation

Exhalation

CASs

DASs Models n of trainable parameters F1 score F1 score F1 score F1 score Segment Detection Event Detection Segment Detection Event Detection Segment Detection Event Detection Segment Detection Event Detection LSTM 300,609 73.9% 76.1% 51.8% 57.0% 15.1% 12.2% 62.6% 59.1% GRU 227,265

BiLSTM 732,225 78.1% 84.0% 57.3% 63.9% 19.8% 19.1% 69.6% 70.0% BiGRU 552,769

CNN-LSTM 3,448,513 77.6% 81.1%

CNN-BiLSTM 6,959,809

CNN-BiGRU 5,240,513

The bold values indicate the higher F1 score between the compared pairs of models. Fig. 6.

ROC curves for (a) inhalation, (b) exhalation, (c) CAS, and (d) DAS segment detection. The corresponding AUC values are presented. Unidirectional versus bidirectional models

As presented in Table 5, the bidirectional models outperformed their unidirectional counterparts in all the defined tasks by 0.4%–9.8% in terms of the F1 scores, even when the bidirectional models had fewer trainable parameters after model adjustment. Models with CNN versus those without CNN

According to Table 6, the models with a CNN outperformed those without a CNN in 26 of the 32 compared pairs. The models with a CNN exhibited higher AUC values than did those without a CNN (Fig. 6a–d),

Table 5

Comparison of F1 scores between the unidirectional and bidirectional models. Inhalation

Exhalation

CASs

GRU 227,265 76.2% 78.9% 59.8% 65.6% 24.6% 20.1% 65.9% 62.5% SIMP BiGRU 178,113

CNN-LSTM 3,448,513 77.6% 81.1% 57.7% 62.1% 45.3% 42.5% 68.8% 64.4% SIMP CNN-BiLSTM 3,382,977

CNN-GRU 2,605,249 78.4% 82.0% 57.2% 62.0% 51.5% 49.8% 68.0% 64.6% SIMP CNN-BiGRU 2,556,097

The bold values indicate the higher F1 score between the compared pairs of models. SIMP means the number of trainable parameters is adjusted. except that BiGRU had a higher AUC value than did CNN-BiGRU in terms of inhalation detection (0.963 vs 0.961), GRU had a higher AUC value than did CNN-GRU in terms of exhalation detection (0.886 vs 0.883), and BiGRU had a higher AUC value than did CNN-BiGRU in terms of exhalation detection (0.911 vs 0.899). Moreover, compared with the LSTM, GRU, BiLSTM, and BiGRU models, the CNN-LSTM, CNN-GRU, CNN-BiLSTM, and CNN-BiGRU models exhibited flatter and lower MAPE curves over a wide range of threshold values in all event detection tasks (Fig. 7a–d). Table 6

Comparison of F1 scores between models without and with a CNN. Inhalation

Exhalation

CASs

BiLSTM 732,225 76.2% 78.9%

GRU 227,265 78.1% 84.0% 57.3% 63.9% 24.60% 20.10% 65.90% 62.50% CNN-GRU 2,605,249

BiGRU 178,113 80.3%

CNN-BiGRU 2,556,097

The bold values indicate the higher F1 score between the compared pairs of models. Fig. 7.

MAPE curves for (a) inhalation, (b) exhalation, (c) CAS, and (d) DAS event detection. Discussion

Benchmark results

According to the F1 scores presented in Table 4, among models without a CNN, the GRU and BiGRU models consistently outperformed the LSTM and BiLSTM models in all defined tasks. However, the GRU-based models did not have superior F1 scores among models with a CNN. Regarding the ROC curves and AUC values (Fig. 6a–d), the GRU-based models consistently outperformed the other models in all but one task. Accordingly, we can conclude that GRU-based models perform slightly better than LSTM-based models in lung sound analysis. Previous studies have also compared LSTM- and GRU-based models (Chung et al., 2014; Khandelwal et al., 2016; Shewalkar, 2018). Although a concrete conclusion cannot be drawn regarding whether LSTM-based models are superior to the GRU-based models (and vice versa), GRU-based models have been reported to outperform LSTM-based models in terms of computation time (Khandelwal et al., 2016; Shewalkar, 2018). As presented in Table 5, the bidirectional models outperformed their unidirectional counterparts in all defined tasks, a finding that is consistent with several previously obtained results (Graves & Schmidhuber, 2005; Khandelwal et al., 2016; Messner et al., 2018; Parascandolo et al., 2016). A CNN can facilitate the extraction of useful features and enhance the prediction accuracy of RNN-based models. The benefits engendered by a CNN are particularly vital in CAS detection. For the models with a CNN, the F1 score improvement ranged from 26.0% to 30.3% and the AUC improvement ranged from 0.067 to 0.089 in the CAS detection tasks. Accordingly, we can infer that considerable information used in CAS detection resides in the local positional arrangement of the features. Thus, a two-dimensional CNN facilitates the extraction of the associated information. Notably, CNN-induced improvements in model performance in the inhalation, exhalation, and DAS detection tasks were not as high as those observed in the CAS detection tasks. The MAPE curves (Fig. 7a–d) reveal that a model with a CNN has more consistent predictions over various threshold values. In our previous study (Hsiao et al., 2020), an attention-based encoder–decoder architecture based on ResNet and LSTM exhibited favorable performance in inhalation ( F1 score of 90.4%) and exhalation ( F1 score of 93.2%) segment detection tasks. However, the model was established on the basis of a very small dataset (489 recordings of 15-s-long lung sounds). Moreover, the model involves a complicated architecture; hence, it is impossible to implement real-time respiratory monitoring in devices with limited computing power, such as smartphones or medical-grade tablets. Few studies have performed event detection at the recording level by using a comparatively simple deep learning model. Messner et al. (2018) used the BiGRU model and one-dimensional labels (similar to those used in the present study) for breath phase and crackle detection. Their BiGRU model exhibited comparable performance to our models in terms of inhalation event detection ( F1 scores, 87.0% vs 86.2%) and in terms of DAS event detection ( F1 scores, 72.1% vs 71.4%). However, the performance of the BiGRU model differed considerably from that of our models in terms of exhalation detection ( F1 scores: 84.6% vs 70.9%). One of the reasons for this discrepancy is that Messner et al. (2018) established their ground-truth labels on the basis of the gold-standard signals of a pneumotachograph. Another reason is that an exhalation label is not always available following an inhalation label in our data. Finally, we did not specifically control the sounds we recorded; for example, we did not ask patients to perform voluntary deep breathing or keep ambient noise down. The factors influencing the model performance are further discussed in the next section. Factors influencing model performance

The benchmark performance of the proposed models may have been influenced by the following factors: (1) unusual breathing patterns; (2) imbalanced data; (3) low signal-to-noise ratio (SNR); (4) noisy labels, including class and attribute noise, in the database; and (5) sound overlapping. Fig. 8 displays most of the breath patterns present in the HF_Lung_V1 database. Fig. 8a illustrates the general pattern of a breath cycle in the lung sounds when the ratio of inhalation to exhalation durations is approximately 2:1 and an expiratory pause is noted (Pramono et al., 2017; Sarkar et al., 2015). Fig. 8b presents a frequent condition under which an exhalation is not completely heard by the labelers. However, because we did not ask the subjects to breath voluntarily when recording the sound, many unusual breath patterns might have been recorded, such as patterns caused by shallow breathing, fast breathing, and apnea as well as those caused by double triggering of the ventilator (Thille et al., 2006) and air trapping (Blanch et al., 2005; Miller et al., 2014). These unusual breathing patterns might confuse the labeling and learning processes and result in poor testing results. Fig. 8.

Patterns of normal breathing lung sounds. (a) General lung sound patterns and (b) general lung sound patterns with unidentifiable exhalations. “I” represents an identifiable inhalation event, “E” represents an identifiable exhalation event, and the black areas represent pause phases. The developed database contains imbalanced numbers of inhalation and exhalation labels (34,095 and 18,349, respectively) because not every exhalation was heard and labeled. In addition, the proposed models may possess the capability of learning the rhythmic rise and fall of breathing signals but not the capability of learning acoustic or texture features that can distinguish an inhalation from an exhalation. This may thus explain the models’ poor performance in exhalation detection. However, these models are suitable for respiratory rate estimation and apnea detection as long as appropriate inhalation detection is achieved. Furthermore, for all labels, the summation of the event duration was smaller than that of the background signal duration (these factors had a ratio of approximately 1:2.5 to 1:7). The aforementioned phenomenon can be regarded as foreground–background class imbalance (Oksuz et al., 2020) and will be addressed in future studies. Most of the sounds in the established database were not recorded during the patients performed deep breathing; thus, the signal quality was not maximized. However, training models with such nonoptimal data increase their adaptability to real-world scenarios. Moreover, the SNR may be reduced by noise, such as human voices; music; sounds from bedside monitors, televisions, air conditioners, fans, and radios; sounds generated by mechanical ventilators; electrical noise generated by touching or moving the parts of acoustic sensors; and friction sounds generated by the rubbing of two surfaces together (e.g., rubbing clothes with the skin). A poor SNR of audio signals can lead to difficulties in labeling and prediction tasks. The features of some noise types are considerably similar to those of adventitious sounds. The poor performance of the proposed models in CAS detection can be partly attributed to the noisy environment in which the lung sounds were recorded. In particular, the sounds generated by ventilators caused numerous FP events in the CAS detection tasks. Thus, additional effort is required to develop a superior preprocessing algorithm that can filter out influential noise or to identify a strategy to ensure that models focus on learning the correct CAS features. Furthermore, the integration of active noise-canceling technology (Wu et al., 2020) or noise suppression technology (Emmanouilidou et al., 2017) into respiratory sound monitors can help reduce the noise from auscultatory signals. The sound recordings in the HF_Lung_V1 database were labeled by only one labeler; thus, some noisy labels, including class and attribute noise, may exist in the database (Zhu & Wu, 2004). These noisy labels are attributable to (1) the different hearing abilities of the labeler, which can cause differences in the labeled duration; (2) the absence of clear criteria for differentiating between target and confusing events; (3) individual human errors; (4) tendency to not label events located close to the beginning and end of a recording; and (5) confusion caused by unusual breath patterns and poor SNRs. However, deep learning models exhibit high robustness to noisy labels (Rolnick et al., 2017). Accordingly, we are currently working toward establishing better ground-truth labels. Breathing generates CASs and DASs under abnormal respiratory conditions. This means that the breathing sound, CAS, and DAS might overlap with one another during the same period. This sound overlapping, along with the data imbalance, makes the CAS and DAS detection models learn to read the rise and fall of the breathing energy and falsely identify an inhalation or exhalation as CAS or DAS, respectively. This FP detection was observed in our benchmark results. In the future, strategies must be adopted to address the problem of sound overlap. Conclusion

We established a large open-access lung sound database, namely HF_Lung_V1 (https://gitlab.com/techsupportHF/HF_Lung_V1), that contains 9,765 audio files of lung sounds (each with a duration of 15 s), 34,095 inhalation labels, 18,349 exhalation labels, 13,883 CAS labels (comprising 8,457 wheeze labels, 686 stridor labels, and 4,740 rhonchus labels), and 15,606 DAS labels (all of which are crackles). We also investigated the performance of eight RNN-based models in terms of inhalation, exhalation, CAS detection, and DAS detection in the HF_Lung_V1 database. We determined that the bidirectional models outperformed the unidirectional models in lung sound analysis. Furthermore, the addition of a CNN to these models further improved their performance. Future studies can develop more accurate respiratory sound analysis models. First, highly accurate ground-truth labels should be established. Second, researchers should investigate the performance of RNN-based models containing state-of-the-art convolutional layers. Third, regional CNN variants can be adopted in lung sound analysis if the labels are expanded to two-dimensional bounding boxes (Jácome et al., 2019). Fourth, wavelet-based approaches, empirical mode decomposition, and other methods that can extract different features should be investigated (Pramono et al., 2017; Pramono et al., 2019). Finally, respiratory sound monitors should be equipped with the capability of tracheal breath sound analysis (Wu et al., 2020). Author Contributions

YCL, NJL, YLW, and BFH collected the breath sounds. NJL, YLW, and WLT established the labels. BFH and ZLT organized the labels. JH and CWC helped recruit the study participants. FSH, SRH, YRC, and FPL conceptualized the research. CWH and CCC trained the deep learning models. FSH, SRH, LCC, YTC, and CTT contributed to the final manuscript. FPL supervised the research.

Acknowledgments

This study was partially funded by the Raising Children Medical Foundation, Taiwan. The authors thank the employees of Heroic Faith Medical Science Co. Ltd. who have ever partially contributed to developing the HF-Type-1 and establishing the HF_Lung_V1 database. This manuscript was edited by Wallace Academic Editing. We also thank the All Vista Healthcare Center, Ministry of Science and Technology, Taiwan for the support.

Conflicts of Interest

The authors declare that they have no conflicts of interest relevant to this research. References

Acharya, J., & Basu, A. (2020). Deep Neural Network for Respiratory Sound Classification in Wearable Devices Enabled by Patient Specific Model Tuning.

IEEE transactions on biomedical circuits and systems, 14 , 535-544. Aykanat, M., Kılıç, Ö., Kurt, B., & Saryal, S. (2017). Classification of lung sounds using convolutional neural networks.

EURASIP Journal on Image and Video Processing, 2017 . Bardou, D., Zhang, K., & Ahmad, S. M. (2018). Lung sounds classification using convolutional neural networks.

Artif Intell Med, 88 , 58-69. Berry, M. P., Martí, J.-D., & Ntoumenopoulos, G. (2016). Inter-rater agreement of auscultation, palpable fremitus, and ventilator waveform sawtooth patterns between clinicians.

Respiratory care, 61 , 1374-1383. Blanch, L., Bernabé, F., & Lucangelo, U. (2005). Measurement of air trapping, intrinsic positive end-expiratory pressure, and dynamic hyperinflation in mechanically ventilated patients.

Respiratory care, 50 , 110-124. Bohadana, A., Izbicki, G., & Kraman, S. S. (2014). Fundamentals of lung auscultation.

New England Journal of Medicine, 370 , 744-751. Chamberlain, D., Kodgule, R., Ganelin, D., Miglani, V., & Fletcher, R. R. (2016). Application of semi-supervised deep learning to lung sound analysis. In (pp. 804-807): IEEE. Chambres, G., Hanna, P., & Desainte-Catherine, M. (2018). Automatic detection of patient with respiratory diseases using lung sound analysis. In (pp. 1-6): IEEE. Chen, H., Yuan, X., Pei, Z., Li, M., & Li, J. (2019). Triple-Classification of Respiratory Sounds Using Optimized S-Transform and Deep Residual Networks.

IEEE Access, 7 , 32845-32852. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 . Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 . Demir, F., Sengur, A., & Bajaj, V. (2020). Convolutional neural networks based efficient approach for classification of lung diseases.

Health Inf Sci Syst, 8 , 4. Deng, M., Meng, T., Cao, J., Wang, S., Zhang, J., & Fan, H. (2020). Heart sound classification based on improved MFCC features and convolutional recurrent neural networks.

Neural Networks . Elman, J. L. (1990). Finding structure in time.

Cognitive science, 14 , 179-211. Emmanouilidou, D., McCollum, E. D., Park, D. E., & Elhilali, M. (2017). Computerized lung sound screening for pediatric auscultation in noisy field environments.

IEEE Transactions on Biomedical Engineering, 65 , 1564-1574. Goettel, N., & Herrmann, M. J. (2019). Breath Sounds: From Basic Science to Clinical Practice.

Anesthesia & Analgesia, 128 , e42. Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures.

Neural Networks, 18 , 602-610. Grunnreis, F. O. (2016).

Intra-and interobserver variation in lung sound classification. Effect of training.

UiT Norges arktiske universitet. Grzywalski, T., Piecuch, M., Szajek, M., Breborowicz, A., Hafke-Dys, H., Kocinski, J., Pastusiak, A., & Belluzzo, R. (2019). Practical implementation of artificial intelligence algorithms in pulmonary auscultation examination.

Eur J Pediatr, 178 , 883-890. Gurung, A., Scrafford, C. G., Tielsch, J. M., Levine, O. S., & Checkley, W. (2011). Computerized lung sound analysis as diagnostic aid for the detection of abnormal lung sounds: a systematic review and meta-analysis.

Respiratory medicine, 105 , 1396-1403. Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M., Ali, M., Yang, Y., & Zhou, Y. (2017). Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 . Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.

Neural computation, 9 , 1735-1780. Hosseini, M., Ren, H., Rashid, H.-A., Mazumder, A. N., Prakash, B., & Mohsenin, T. (2020). Neural Networks for Pulmonary Disease Diagnosis using Auditory and Demographic Information. arXiv preprint arXiv:2011.13194 . Hsiao, C.-H., Lin, T.-W., Lin, C.-W., Hsu, F.-S., Lin, F. Y.-S., Chen, C.-W., & Chung, C.-M. (2020). Breathing Sound Segmentation and Detection Using Transfer Learning Techniques on an Attention-Based Encoder-Decoder Architecture. In (pp. 754-759): IEEE. Hsu, F.-S., Huang, C.-J., Kuo, C.-Y., Huang, S.-R., Cheng, Y.-R., Wang, J.-H., Wu, Y.-L., Tzeng, T.-L., & Lai, F. (2021). Development of a respiratory sound labeling software for training a deep learning-based respiratory sound analysis model. arXiv preprint arXiv:2101.01352 . Huq, S., & Moussavi, Z. (2012). Acoustic breath-phase detection using tracheal breath sounds.

Medical & biological engineering & computing, 50 , 297-308. Jácome, C., Ravn, J., Holsbø, E., Aviles-Solis, J. C., Melbye, H., & Ailo Bongo, L. (2019). Convolutional neural network for breathing phase detection in lung sounds.

Sensors, 19 , 1798. Khandelwal, S., Lecouteux, B., & Besacier, L. (2016). Comparing GRU and LSTM for automatic speech recognition. Kochetov, K., Putin, E., Balashov, M., Filchenkov, A., & Shalyto, A. (2018). Noise Masking Recurrent Neural Network for Respiratory Sound Classification. In

Artificial Neural Networks and Machine Learning – ICANN 2018 (pp. 208-217). Li, L., Wu, Z., Xu, M., Meng, H. M., & Cai, L. (2016). Combining CNN and BLSTM to Extract Textual and Acoustic Features for Recognizing Stances in Mandarin Ideological Debate Competition. In

INTERSPEECH (pp. 1392-1396). Li, L., Xu, W., Hong, Q., Tong, F., & Wu, J. (2016). Classification between normal and adventitious lung sounds using deep neural network. In (pp. 1-5): IEEE. Liu, Y., Lin, Y., Gao, S., Zhang, H., Wang, Z., Gao, Y., & Chen, G. (2017). Respiratory sounds feature learning with deep convolutional neural networks. In (pp. 170-177): IEEE. Mesaros, A., Heittola, T., & Virtanen, T. (2016). Metrics for polyphonic sound event detection.

Applied Sciences, 6 , 162. Messner, E., Fediuk, M., Swatek, P., Scheidl, S., Smolle-Juttner, F.-M., Olschewski, H., & Pernkopf, F. (2018). Crackle and breathing phase detection in lung sounds with deep bidirectional gated recurrent neural networks. In (pp. 356-359): IEEE. Miller, W. T., Chatzkel, J., & Hewitt, M. G. (2014). Expiratory air trapping on thoracic computed tomography. A diagnostic subclassification.

Annals of the American Thoracic Society, 11 , 874-881. Oksuz, K., Cam, B. C., Kalkan, S., & Akbas, E. (2020). Imbalance problems in object detection: A review.

IEEE Transactions on Pattern Analysis and Machine Intelligence . Parascandolo, G., Huttunen, H., & Virtanen, T. (2016). Recurrent neural networks for polyphonic sound event detection in real life recordings. In (pp. 6440-6444): IEEE. Pasterkamp, H., Brand, P. L., Everard, M., Garcia-Marcos, L., Melbye, H., & Priftis, K. N. (2016). Towards the standardisation of lung sound nomenclature.

European Respiratory Journal, 47 , 724-732. Pasterkamp, H., Kraman, S. S., & Wodicka, G. R. (1997). Respiratory sounds: advances beyond the stethoscope.

American journal of respiratory and critical care medicine, 156 , 974-987. Perna, D., & Tagarelli, A. (2019). Deep Auscultation: Predicting Respiratory Anomalies and Diseases via Recurrent Neural Networks. In (pp. 50-55). Pham, L., McLoughlin, I., Phan, H., Tran, M., Nguyen, T., & Palaniappan, R. (2020). Robust Deep Learning Framework For Predicting Respiratory Anomalies and Diseases. arXiv preprint arXiv:2002.03894 . Pramono, R. X. A., Bowyer, S., & Rodriguez-Villegas, E. (2017). Automatic adventitious respiratory sound analysis: A systematic review.

PloS one, 12 , e0177926. Pramono, R. X. A., Imtiaz, S. A., & Rodriguez-Villegas, E. (2019). Evaluation of features for classification of wheezes and normal respiratory sounds.

PloS one, 14 , e0213659. Raj, V., Renjini, A., Swapna, M., Sreejyothi, S., & Sankararaman, S. (2020). Nonlinear time series and principal component analyses: Potential diagnostic tools for COVID-19 auscultation.

Chaos, Solitons & Fractals, 140 , 110246. Rocha, B., Filos, D., Mendes, L., Vogiatzis, I., Perantoni, E., Kaimakamis, E., Natsiavas, P., Oliveira, A., Jácome, C., & Marques, A. (2017). Α respiratory sound database for the development of automated classification. In

International Conference on Biomedical and Health Informatics (pp. 33-37): Springer. Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694 . Sarkar, M., Madabhavi, I., Niranjan, N., & Dogra, M. (2015). Auscultation of the respiratory system.

Annals of thoracic medicine, 10 , 158. Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks.

IEEE transactions on Signal Processing, 45 , 2673-2681. Shewalkar, A. N. (2018). Comparison of rnn, lstm and gru on speech recognition data. Sovijärvi, A., Vanderschoot, J., & Earis, J. (1997). Standardization of computerized respiratory sound analysis.

Crit Care Med, 156 , 974-987. Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In

Proceedings of the IEEE international conference on computer vision (pp. 843-852). Thille, A. W., Rodriguez, P., Cabello, B., Lellouche, F., & Brochard, L. (2006). Patient-ventilator asynchrony during assisted mechanical ventilation.

Intensive care medicine, 32 , 1515-1522. Wang, B., Liu, Y., Wang, Y., Yin, W., Liu, T., Liu, D., Li, D., Feng, M., Zhang, Y., & Liang, Z. (2020). Characteristics of Pulmonary auscultation in patients with 2019 novel coronavirus in china. Wu, Y., Liu, J., He, B., Zhang, X., & Yu, L. (2020). Adaptive Filtering Improved Apnea Detection Performance Using Tracheal Sounds in Noisy Environment: A Simulation Study.

BioMed Research International, 2020 . Zhao, H., Zarar, S., Tashev, I., & Lee, C.-H. (2018). Convolutional-recurrent neural networks for speech enhancement. In (pp. 2401-2405): IEEE. Zhu, X., & Wu, X. (2004). Class noise vs. attribute noise: A quantitative study.

Artificial intelligence review, 22 , 177-210. Appendix A