An Update of a Progressively Expanded Database for Automated Lung Sound Analysis
Fu-Shun Hsu, Shang-Ran Huang, Chien-Wen Huang, Yuan-Ren Cheng, Chun-Chieh Chen, Jack Hsiao, Chung-Wei Chen, Feipei Lai
11 Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei 10617, Taiwan Department of Critical Care Medicine, Far Eastern Memorial Hospital, New Taipei 22060, Taiwan Heroic Faith Medical Science Co. Ltd., New Taipei 23553, Taiwan Avalanche Computing Inc., Taipei 10687, Taiwan Department of Life Science, College of Life Science, National Taiwan University, Taipei 10617, Taiwan Institute of Biomedical Sciences, Academia Sinica, Taipei 11529, Taiwan HCC Healthcare Group, New Taipei 22060, Taiwan Research Center for Information Technology Innovation, Academia Sinica, Taipei 11529, Taiwan Division of Pulmonary Medicine, Far Eastern Memorial Hospital, New Taipei 22060, Taiwan
Short Title: An update of a lung sound database † Corresponding Author Feipei Lai Graduate Institute of Biomedical Electronics and Bioinformatics National Taiwan University No. 1, Sec. 4, Roosevelt Road Taipei 10617, Taiwan Tel: [email protected]
An Update of a Progressively Expanded Database for Automated Lung Sound Analysis
Fu-Shun Hsu , Shang-Ran Huang , Chien-Wen Huang , Yuan-Ren Cheng , Chun-Chieh Chen , Jack Hsiao , Chung-Wei Chen , and Feipei Lai , Senior
Member, IEEE bstract
A continuous real-time respiratory sound automated analysis system is needed in clinical practice. Previously, we established an open access lung sound database, HF_Lung_V1, and automated lung sound analysis algorithms capable of detecting inhalation, exhalation, continuous adventitious sounds (CASs) and discontinuous adventitious sounds (DASs). In this study, HF-Lung-V1 has been further expanded to HF-Lung-V2 with 1.45 times of increase in audio files. The convolutional neural network (CNN)-bidirectional gated recurrent unit (BiGRU) model was separately trained with training datasets of HF_Lung_V1 (V1_Train) and HF_Lung_V2 (V2_Train), and then were used for the performance comparisons of segment detection and event detection on both test datasets of HF_Lung_V1 (V1_Test) and HF_Lung_V2 (V2_Test). The performance of segment detection was measured by accuracy, predictive positive value (PPV), sensitivity, specificity, F1 score, receiver operating characteristic (ROC) curve and area under the curve (AUC), whereas that of event detection was evaluated with PPV, sensitivity, and F1 score. Results indicate that the model performance trained by V2_Train showed improvement on both V1_Test and V2_Test in inhalation, CASs and DASs, particularly in CASs, as well as on V1_Test in exhalation. .
INTRODUCTION Respiration is vital for survival. The changes in the frequency or intensity of the respiratory sounds and the identification of continuous adventitious sounds (CASs) and discontinuous adventitious sounds (DASs) are associated with pulmonary disorders [1-2]. Wheeze (W), stridor (S), and rhonchus (R) are classified as CASs and crackles and pleural friction rubs are recognized as DASs [1-2]. The automated detection of adventitious breath sounds can immediately alarm clinicians to make medical decisions in time. A review on automatic adventitious respiratory sound analysis up to 2016 was reported [3] and further studies followed. An automated lung sound analysis system achieved the area under the receiver operating characteristic (ROC) curve (AUC) of 0.86 for wheeze classification and AUC of 0.74 for crackle classification [4]. Another automated system in classifying wheeze, crackle, and normal sounds presented the accuracy, sensitivity, and specificity up to 98.79%, 96.27%, and 100%, respectively [5]. For breath phase detection, an automated system was reported to have an average sensitivity and specificity of 97% and 84% [6]. An event-based automated breath phase and crackle detection system presented F1 score around 86 % for breathing phase detection and around 72 % for crackle detection [7]. However, the aforementioned studies mostly focused on the task of classification only and suffered from the limitation of small data size. In addition, no reported systems could detect inhalation (I), exhalation (E), CASs (C), and DASs (D) at the same time. Our goal has been to build respiration related sound databases and further develop automated detection systems. A respiratory sound labeling software [8] was developed to establish an open access lung sound database, HF_Lung_V1 (Lung_V1) (https://gitlab.com/techsupportHF/HF_Lung_V1) [9]. Several variants of recurrent neural networks (RNN) were used to benchmark Lung_V1. Results indicated the potential of using deep learning for automated I, E, C, and D detections. Since the performance of deep learning was shown to be positively related to the data size [10], it was worthy to keep collecting and labeling lung sound files. Hence, in this paper, we report an update of expanding Lung_V1 to HF_Lung_V2 (Lung_V2) with ore lung sound files and corresponding labels. Moreover, we investigated whether the detection models for I, E, C, and D trained from Lung_V2 showed performance improvement accordingly as the data size increased.
II.
MATERIALS AND METHODS The lung sound file database, Lung_V2, is an incremental expansion of previous Lung_V1. The lung sounds were collected from August 15, 2018 to October 8, 2019 to build Lung_V1 and were extended to December 3, 2019 to build Lung_V2. A commercial electronic stethoscope, Littmann 3200 (Littmann) (3M, Saint Paul, Minnesota, USA), and a custom multichannel sound recording device, HF-Type-1 (Type-1) (Heroic Faith Medical Science, New Taipei City, Taiwan) were used to record lung sounds [9]. The protocol was approved by the Research Ethics Review Committee of Far Eastern Memorial Hospital (case number: 107052-F). Littmann could only record breath sounds at one location. Therefore, recordings were sequentially conducted at eight locations, namely, the second intercostal space (ICS) in the right and left midclavicular lines (MCLs), the fifth ICS in the right and left MCLs, the fourth ICS in the right and left midaxillary lines (MALs), and the 10th ICS in the right and left MALs to complete a full round of Littman recording, whereas Type-1 was used to simultaneously record lung sounds from six locations (same as the aforesaid locations except the fourth ICS in the right and left MALs) by multichannel acoustic recordings [9]. A complete round of Type-1 recording contained 30-minute continuous signal obtained from each of the aforesaid six locations. We collected more lung sounds from 10 residents of a respiratory care ward (RCW) or a respiratory care center (RCC), who were under long term mechanical ventilation support by Littmann for 4-5 rounds and Type-1 for 3-4 rounds. Additionally, the lung sounds of another 22 inpatients with apparent adventitious sounds in Far Eastern Memorial Hospital were collected by Littmann alone for 1-3 rounds. All participants were Taiwanese aged more than 20 years old. The sampling rate of two recording devices was 4,000 Hz and the bit depth was 16 bits. he length of audio files recorded by Littmann was 15.8 seconds. Thus, the terminal 0.8-second length was deleted to make the audio file to 15-second long. As for the audio files recorded by Type-1, the first 15-second length recording of every 2-minute signal was truncated for subsequent analysis. Two board certified respiratory therapists with 8 and 4 years of clinical experience and one board certified nurse with 13 years of clinical experience did the labeling. Each lung sound file was labeled by only one labeler though a regular consensus meeting was held to make the labelers have the same labeling criteria. A self-developed software was used to label I, E, W, S, R, and D [8]. Labels of W, S, and R were combined to form C, whereas D labels contained all type of crackles without pleural friction rubs. Please refer to [9] for detailed recording protocol, data preparation, and labeling. The acoustic patterns of breath sounds collected from one subject at different auscultation locations or those between short time intervals bear many similarities. Therefore, the audio files from the same subject were randomly distributed to either training or test dataset to avoid potential data leakage. The ratio of training to test dataset was intentionally maintained close to 4:1 based on the number of recordings. The 5-fold cross validation was used in the training dataset. V1_Train and V1_Test are subsets of V2_Train and V2_Test, respectively. A convolutional neural network (CNN)-bidirectional gated recurrent unit (BiGRU) model, presented in Fig. 1, outperformed among all benchmark models [9]. Therefore, in this study, the same CNN-BiGRU model was used to test Lung_V2. Figure 1. Architecture of the CNN-BiGRU model. The pipeline of preprocessing, deep learning process, and postprocessing is displayed in Fig. 2, which is the same one as described before [9]. The obtained signal was firstly processed with a high-pass filter with cut-off frequency of 80 Hz. Then, the spectrogram, mel frequency cepstral coefficients (MFCCs) [7], and energy summation were calculated from the filtered signal and subsequently normalized and sent into the CNN-BiGRU model as input. The spectrogram was computed using short time Fourier transform with a Hanning window with a size of 256, hop length with a size of 64, and no zero-padding. The MFCCs included 20 static coefficients, 20 delta coefficients, and 20 acceleration coefficients. The energy summation is the summed energy in four frequency bands, namely, 0-250, 250-500, 500-1,000, and 0-2,000 Hz. The output of the CNN-BiGRU model was a 469 x 1 vector. An element in the vector was set as 1 if the output value passed a thresholding criterion; otherwise, the element was set at 0. The value of 1 indicated one I, E, C, or D was detected in the corresponding time frame (segment). After the results of segment detection were obtained, the vector was sent to postprocessing for merging neighboring segments and removing burst events to generate the results of event detection [9]. The performance of segment detection was measured by the accuracy, PPV, ensitivity, specificity, F1 score, and ROC curve as well as AUC, whereas the event detection was evaluated with PPV, sensitivity, and F1 score. Figure 2. Pipeline of preprocessing, deep learning inference, and postprocessing.
III.
RESULTS A. Statistics of lung sound files of both Lung_V1 and Lung_V2 databases
Statistics of lung sound files of both Lung_V1 and Lung_V2 recorded by both Littmann and Type-1 are tabulated in Table I. The number of subjects increased from 261 to 303. There were about 1.45 times increase of total 15-second recordings, where the quantity increased from 9765 to 14145 and the duration increased from 2441.25 min. to 3536.25 min. The additional patients from RCW/RCC in Lung_V2 resulted in substantial increase in the quantities of I, E, C, and D sound labels. There were approximately 1.5 times increase from 34095 to 49659 in I, 1.3 times increase from 18349 to 246025 in E, 1.6 times increase from 13883 to 22550 in C, and 1.3 times increase from 15606 to 19651 in D. The number of 15-second files increased from 4504 to 5163 recorded by Littmann, whereas that of 15-second files increased from 5261 to 8982 recorded by Type-1. The increase of 15-second files recorded by Littmann was not as much as that by Type-1. The mean duration of I, E, C, and D labels were 0.93, 0.96, 0.83, and 0.86 seconds for Lung_V1 and 0.95, 0.92, 0.82, and 0.86 seconds for ung_V2, respectively. The mean duration of I was relatively close between subjects recorded by Littmann or Type-1 (0.93 vs. 0.93 for Lung_V1; 0.93 vs. 0.97 for Lung_V2). However, mean duration of E (1.06 vs. 0.86 for Lung_V1; 1.05 vs. 0.82 for Lung_V2), C (0.91 vs. 0.74 for Lung_V1; 0.87 vs. 0.79 for Lung_V2), and D (0.92 vs. 0.87 for Lung_V1; 0.92 vs. 0.83 for Lung_V2) of patients recorded by Type-1 were shorter compared to those recorded by Littmann. B. Statistics of both training and test datasets of both Lung_V1 and Lung_V2 databases
Statistics of both training and test datasets of both Lung_V1 and Lung_V2 are tabulated in Table II. The number of 15-second files increased from 7809 to 10742 in training dataset and from 1956 to 3403 in test dataset. The numbers of I label increased from 27223 to 39343 in training dataset and from 6872 to 10316 in test dataset; those of E label increased from 15601 to 18384 in training dataset and from 2748 to 6218 in test dataset; those of C label increased from 11464 to 18353 in training dataset and from 2419 to 4197 in test dataset; and those of D label increased from 13794 to 14273 in training dataset and from 1812 to 5378 in test dataset. The mean duration of I between training dataset and test dataset was 0.93 vs. 0.93 for Lung_V1, and 0.96 vs. 0.94 for Lung_V2. However, the mean duration of E, C, and D between training and test dataset were 0.95 vs. 0.98, 0.84 vs. 0.77, and 0.89 vs. 0.90 for Lung_V1 and 0.96 vs. 0.79, 0.84 vs. 0.75, 0.89 vs. 0.79 for Lung_V2, respectively. C. Performance comparisons of Lung_V1 versus Lung_V2
Statistics of performance measurements are tabulated in Table III. The model had more items with higher values was designated as performed better. For example, the model trained by V2_Train led 6 items (accuracy, PPV, sensitivity, specificity, F1 score, and AUC) on V1_Test and 5 items (accuracy, PPV, specificity, F1 score, and AUC) on V2_Test in segment detection and 2 items in event detection on both V1_Test and V2_Test in I detection; therefore, the model trained by V2_Train was defined as performed better than it trained by V1_Train in I detection. Based on the efinition, in summary, the model trained by V2_Train performed better on both V1_Test and V2_Test in I, C, and D detection, as well as on V1_Test in E detection. Similar results can be observed from the ROC curves and AUCs of segment detection presented in Fig. 3. Figure 3. The ROC curves and AUC of segment detection on the V1_Test and V2_Test based on the CNN-BiGRU model. The trend of F1 score change of segment and event detection derived by expanding from Lung_V1 to Lung_V2 are presented in Fig. 4. All F1 scores showed improvement as the data size increased but the ones of exhalation segment and event detection based on T2_Test. Figure 4. F1 scores of segment and event detection of (a) inhalation, (b) exhalation, (c) CASs, and (d) DASs based on the HF_Lung_V1 and HF_Lung_V2. IV.
DISCUSSION In this paper, we report the effort of expanding Lung_V1 to Lung_V2. The performance of the CNN-BiGRU model trained by the expanded Lung_V2 improved in inhalation, CAS, and DAS detection. However, the improvement of exhalation detection is not clearly seen based on the V2_Test. It may result from the bigger difference of the exhalation characteristics between the training and testing datasets, which is indicated by a large difference in the mean duration (0.96 vs. 0.79) for Lung_V2 but a small difference in the mean duration (0.95 vs. 0.98) for Lung_V1. mall data region, power-law region, and irreducible error region are present in the power-law learning curve of deep learning [10]. The generalization error (log-scale) decreases as training dataset size (log-scale) increases in power-law region [10]. The size of Lung_V2 is 1.45 times of that of Lung_V1 though we did not investigate whether the increase of size is in power-law region. The promising improvement in the performance of inhalation, CAS, and DAS detection encourages us to keep collecting more breathing lung sounds and build a larger dataset. ACKNOWLEDGEMENT This study was partially funded by the Raising Children Medical Foundation, Taiwan. The authors thank the National Center for High-Performance Computing in Taiwan for providing the computing resources required for this research. The authors thank the employees of Heroic Faith Medical Science Corp. Ltd. who have ever partially contributed to the establishment of HF_Lung_V2 database. REFERENCES [1]
A. Bohadana, G. Izbicki, and S. S. Kraman. “Fundamentals of lung auscultation,”
N Engl J Med , vol. 370, pp. 744-751, Feb. 2014. [2]
S. Fouzas, M. B. Anthracopoulos, and A. Bohadana. “Clinical usefulness of breath sounds,” in
Breath sounds-from basic science to clinical practice , K. N. Priftis, L. J. Hadjileontiadis, and M. L. Everard, Ed. Cham: Springer International Publishing AG., 2018, pp. 32-52. [3]
R. X. A. Pramono, S. Bowyer, and E. Rodriguez-Villegas. “Automatic adventitious respiratory sound analysis: a systematic review,”
PLoS One , vol. 12, no. 5, pp. e0177926, May 2017. [4]
D. Chamberlain, R. Kodgule, D. Ganelin, V. Miglani, and R. R. Fletcher. “Application of semi-supervised deep learning to lung sound analysis,”
Annu Int Conf IEEE Eng Med Biol Soc , vol. 2016, pp. 804-807, Aug. 2016. [5]
H. Chen, X. Yuan, Z. Pei, M. Li, and J. Li. “Triple-classification of respiratory sounds using optimized s-transform and deep residual networks,”
IEEE Access , vol. 7, pp. 32845-32852, 2019. [6]
C. Jácome, J. Ravn, E. Holsbø, J. C. Aviles-Solis, H. Melbye, and L. A. Bongo. “Convolutional neural network for breathing phase detection in lung sounds,”
Sensors (Basel) , vol. 19, no. 8, pp. 1798, Apr. 2019. [7] E. Messner, M. Fediuk, P. Swatek, S. Scheidl , F. -M. Smolle-Juttner, H. Olschewski, and F. Pernkopf. “Crackle and breathing phase detection in lung sounds with deep bidirectional gated ecurrent neural networks,”
Annu Int Conf IEEE Eng Med Biol Soc , vol. 2018, pp. 356-359, Jul. 2018. [8]
F. S. Hsu, C. J. Huang, C. Y. Kuo, S. R. Huang, Y. R. Cheng, J. H. Wang, Y. L. Wu, T. L. Tzeng, and F. Lai. (2021) “Development of a respiratory sound labeling software for training a deep learning-based respiratory sound analysis model,” arXiv preprint arXiv:2101.01352 . [9]
F. S. Hsu, S. R. Huang, C. W. Huang, C. J. Huang, Y. R. Cheng, C. C. Chen, J. Hsiao, C. W. Chen, L. C. Chen, Y. C. Lai, B. F. Hsu, N. J. Lin, W. L. Tsai, Y. L. Wu, T. L. Tseng, C. T. Tseng, Y. T. Chen, and F. Lai. (2021) “Benchmarking of eight recurrent neural network variants for breath phase and adventitious sound detection on a self-developed open-access lung sound database-HF_Lung_V1,” arXiv preprint arXiv: 2102.03049. [10]
J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou. (2017) “Deep learning scaling is predictable, empirically,” arXiv preprint arXiv:1712.00409 . T a b l e I . S t a t i s t i c s o f l un g s o und f il e s i n t h e H F _ L un g_ V a nd H F _ L un g_ V d a t a b a s e s . D a t a b a s e H F _ L un g_ V H F _ L un g_ V H F _ L un g_ V H F _ L un g_ V R ec o r d i n g D e v i ce L i tt m a nn H F - T y p e - L i tt m a nn H F - T y p e - S ub j ec t s N o . o f - s ec . r ec o r d i ng s T o t a l du r a ti on ( m i n ) . . . . . I nh a l a ti on N o . D u r a ti on ( m i n ) . . . . . . M ea n ( s ec ) . . . . . . E xh a l a ti on N o . D u r a ti on ( m i n ) . . . . . . M ea n ( s ec ) . . . . . . C A S s N o . C / W / S / R / / / / / / / / / / / / / / / / / / D u r a ti on ( m i n ) C / W / S / R . / . / . / . . / . / . / . . / . / . / . . / . / . / . . / . / . / . . / . / . / . M ea n ( s ec ) C / W / S / R ( s ec ) . / . / . / . . / . / . / . . / . / . / . . / . / . / . . / . / . / . . / . / . / . DA S s N o . D u r a ti on ( m i n ) . . . . . . M ea n ( s ec ) . . . . . . C A S / C : c on ti nuou s a dv e n titi ou s s ound , DA S : d i s c on ti nuou s a dv e n titi ou s s ound . W : w h eeze , S : s t r i do r , R : r hon c hu s , NA : no t a pp li ca b l e . T a b l e II . S t a t i s t i c s a nd g r o up s o f b o t h t r a i n i n g a nd t e s t d a t a s e t s i n b o t h H F _ L un g_ V a nd H F _ L n g_ V d a t a b a s e s . D a t a b a s e H F _ L un g_ V H F _ L un g_ V D a t a s e t T r a i n i n g T e s t T r a i n i n g T e s t N o . o f - s ec . r ec o r d i ng s T o t a l du r a ti on ( m i n ) . . . . I nh a l a ti on N o . D u r a ti on ( m i n ) . . . . M ea n ( s ec ) . . . . E xh a l a ti on N o . D u r a ti on ( m i n ) . . . . M ea n ( s ec ) . . . . C A S s N o . C / W / S / R / / / / / / / / / / / / D u r a ti on ( m i n ) C / W / S / R . / . / . / . . / . / . / . . / . / . / . . / . / . / . M ea n ( s ec ) C / W / S / R ( s ec ) . / . / . / . . / . / . / . . / . / . / . . / . / . / . DA S s N o . D u r a ti on ( m i n ) . . . . M ea n ( s ec ) . . . . C A S / C : c on ti nuou s a dv e n titi ou s s ound , DA S : d i s c on ti nuou s a dv e n titi ou s s ound , W : w h eeze , S : s t r i do r , R : r hon c hu s . T a b l e III . P er f o r m a n ce c o m p a r i s o n s b e t w ee n t h e CNN - B i G RU m o d e l s t r a i n e d b y H F _ L un g_ V a nd H F _ L un g_ V o n t h e V T e s t a nd V T e s t . L a b e l s A cc u r a c y PP V S e n s i t i v i t y Sp ec i f i c i t y F s c o re AUC S e g m e n t d e t ec t i o n E v e n t d e t ec t i o n S e g m e n t d e t ec t i o n E v e n t d e t ec t i o n S e g m e n t d e t ec t i o n E v e n t d e t ec t i o n S e g m e n t d e t ec t i o n E v e n t d e t ec t i o n S e g m e n t d e t ec t i o n E v e n t d e t ec t i o n S e g m e n t d e t ec t i o n I nh a l a t i o n V T r a i n on V T e s t . NA . . . . . NA . . . V T r a i n on V T e s t . NA . . . . . NA . . . V T r a i n on V T e s t . NA . . . . . NA . . . V T r a i n on V T e s t . NA . . . . . NA . . . E x h a l a t i o n V T r a i n on V T e s t . NA . . . . . NA . . . V T r a i n on V T e s t . NA . . . . . NA . . . V T r a i n on V T e s t . NA . . . . . NA . . . V T r a i n on V T e s t . NA . . . . . NA . . . CA S s V T r a i n on V T e s t . NA . . . . . NA . . . V T r a i n on V T e s t . NA . . . . . NA . . . V T r a i n on V T e s t . NA . . . . . NA . . . V T r a i n on V T e s t . NA . . . . . NA . . . DA S s V T r a i n on V T e s t . NA . . . . . NA . . . V T r a i n on V T e s t . NA . . . . . NA . . . V T r a i n on V T e s t . NA . . . . . NA . . . V T r a i n on V T e s t . NA . . . . . NA . . . PP V : po s iti v e p r e d i c ti v e v a l u e , AU C : a r ea und e r t h e c u r v e , NA : no t a pp li ca b l e , C A S : c on ti nuou s a dv e n titi ou s s ound . DA S : d i s c on ti nuou s a dv e n titi ou s s ound , W : w h eeze , S : s t r i do r , R : r hon c hu s ..