[PDF] Automatic Detection of B-lines in Lung Ultrasound Videos From Severe Dengue Patients

Abstract

Lung ultrasound (LUS) imaging is used to assess lung abnormalities, including the presence of B-line artefacts due to fluid leakage into the lungs caused by a variety of diseases. However, manual detection of these artefacts is challenging. In this paper, we propose a novel methodology to automatically detect and localize B-lines in LUS videos using deep neural networks trained with weak labels. To this end, we combine a convolutional neural network (CNN) with a long short-term memory (LSTM) network and a temporal attention mechanism. Four different models are compared using data from 60 patients. Results show that our best model can determine whether one-second clips contain B-lines or not with an F1 score of 0.81, and extracts a representative frame with B-lines with an accuracy of 87.5%.

Full PDF

AAutomatic Detection of B-lines in Lung Ultrasound Videos From Severe Dengue Patients

Hamideh Kerdegari a Phung Tran Huy Nhat a,b

Angela McBride b VITAL Consortium e Reza Razavi a Nguyen Van Hao c Louise Thwaites b,d

Sophie Yacoub b,d

Alberto Gomez aa School of Biomedical Engineering & Imaging Sciences, King’s College London, UK b Oxford University Clinical Research Unit, Ho Chi Minh City, Vietnam c Hospital for Tropical Diseases, Ho Chi Minh City, Vietnam d Centre for Tropical Medicine and Global Health, University of Oxford, UK e a corporate author; for the list of members, please see the section at the end of this manuscript ABSTRACT

Lung ultrasound (LUS) imaging is used to assess lung ab-normalities, including the presence of B-line artefacts due toﬂuid leakage into the lungs caused by a variety of diseases.However, manual detection of these artefacts is challenging.In this paper, we propose a novel methodology to automat-ically detect and localize B-lines in LUS videos using deepneural networks trained with weak labels. To this end, wecombine a convolutional neural network (CNN) with a longshort-term memory (LSTM) network and a temporal attentionmechanism. Four different models are compared using datafrom 60 patients. Results show that our best model can de-termine whether one-second clips contain B-lines or not withan F1 score of 0.81, and extracts a representative frame withB-lines with an accuracy of 87.5%.

Index Terms — Lung ultrasound (LUS), video analysis,classiﬁcation

1. INTRODUCTION

Ultrasound imaging is gaining popularity for real-time patientmanagement in the intensive care units (ICU) because it ismobile, fast, non-invasive, safe for patients and relatively in-expensive. Speciﬁcally, lung ultrasound (LUS) is becomingthe reference modality for rapid lung assessment but unlikeall other ultrasound imaging applications, the purpose of LUSis to capture image artefacts that indicate a pulmonary abnor-mality, including features of extravascular lung water such asoedema and effusions [1]. Fluid leakage is one of the charac-teristic clinical features of severe dengue and accurate assess-

This work was supported by the Wellcome Trust UK (110179/Z/15/Z,203905/Z/16/Z). H. Kerdegari, N. Phung, R. Razavi and A. Gomez also ac-knowledge ﬁnancial support from the Department of Health via the NationalInstitute for Health Research (NIHR) comprehensive Biomedical ResearchCentre award to Guy’s and St Thomas’ NHS Foundation Trust in partnershipwith King’s College London and King’s College Hospital NHS FoundationTrust. ment of this is critical for dengue patient care [2]. To this end,ultrasound imaging can be used to assess leakage through thepresence and appearance of B-lines (e.g. Fig.1, frame 12),bright lines extending from the surface of the lung distallyfollowing the direction of propagation of the sound waves.These lines appear and disappear during the respiratory cy-cle and may be found only in some regions of the affectedlung [3]. As a result, manually detecting these lines is a verychallenging task, particularly for inexperienced operators.Recent advances in computer vision, machine learningand particularly deep learning have brought great advancesin challenging computer vision tasks such as classiﬁcationand object detection. Applied to medical imaging, these taskscould help automate problems such as B-line detection inLUS. Despite the wide use of such techniques to more com-mon applications of ultrasound imaging, very few works, andvery recently only, have been published on automatic analysisof LUS images. Related work can be organised in two cate-gories: detection of B-lines (or other artefacts) in an image;and segmentation or localisation of lung lesions in an image.The ﬁrst category of methods use classiﬁcation techniquesto detect B-lines in individual LUS frames. For example,Sloun et al. [4] applied classiﬁcation and weakly-supervisedlocalization of B-lines to LUS from COVID-19 patients. Afully convolutional network was trained to recognize abnor-mality in images, followed by class activation maps (CAMs)[5] to produce a weakly-supervised segmentation map of theinput. Unlike [4] that used CAMs for localization, Roy etal. [6] exploited a spatial transformer network for weakly su-pervised localization. Further, they proposed an ordinal re-gression to predict the presence of COVID-19 related arte-facts and a score connected to the disease severity. In anotherstudy [7], a single-shot CNN was employed to predict bound-ing boxes for B-lines. All these methods classify one frame ata time, either requiring a method to extract a frame from theultrasound stream ﬁrst, or needing a method to unify a predic-tion from all predictions done on individual frames from one a r X i v : . [ ee ss . I V ] F e b A tt en t i on W e i gh t Frames

Fig. 1 . Results of attention weights (blue line) and their associated ground truth (green line) on a sample one second LUSvideo containing non-B-line and B-line frames. Some example frames with their related attention weights are visualized. Forexample, frame 12 is a B-line frame and has an attention weight of while frame 2 that shows a non-B-line frame has receivedan attention weight of . Here, the attention weights are normalized between and using min-max normalization forvisualization.patient.The second category of methods is focused on using at-tention mechanisms, particularly on CT and x-ray lung im-ages. A residual attention U-Net for multi-class segmenta-tion of COVID-19 Chest CT images was proposed by Chenet al. [8]. Similar architecture was applied by Gaal et al. [9]but for x-ray lung segmentation of pneumonia. In [10], a 3DCNN network with online attention reﬁnement and dualsam-pling strategy was developed to distinguish COVID-19 fromthe pneumonia in chest CT images. A lesion-attention deepneural network (LA-DNN) was proposed by [11], learningtwo tasks: a primary binary classiﬁcation task on presenceof COVID-19 and an auxiliary multi-label attention learningtask on ﬁve lesions. It was shown that the auxiliary task pro-motes the primary task to focus attention on the lesion areasand consequently improve the classiﬁcation performance.In all the mentioned studies, CAMs and attention mech-anisms have been used for the spatial localization of lunglesions. Differently, we leverage temporal analysis networksand use attention to ﬁnd and localize the most importantframes (i.e. B-line frames) within a video that contains B-lines. Indeed, B-line artefacts appear at arbitrary frameswithin a LUS video, hence the ability to ﬁrst detect whetherthere are B-line frames in the video or not is essential for clin-ical applicability. A variety of classical models [12, 13] havebeen applied for temporal context modeling. Most recently,RNNs and particularly LSTM have become popular due totheir ability for end-to-end training when combined withCNN. Several recent studies incorporated spatial/optical-ﬂowCNN features with LSTM models for global temporal mod-eling of videos [14, 15]. We also incorporate CNN featureswith LSTM for LUS video classiﬁcation. However, we use anew variant of the LSTM model equipped with an attentionnetwork that allows it to focus and highlight B-lines artefacts as discriminative frames in the ultrasound video.In summary, the novel contributions of this paper are: 1)analysis of ultrasound videos, instead of ultrasound frames,exploiting temporal information that captures the dynamic na-ture of the underlying anatomy; and 2) utilization of temporalattention to localise in time the video frames where B-linesare shown.

2. MODEL ARCHITECTURE

The overall model architecture is shown in Fig.2a. It consistsmainly of three parts: convolutional neural network (CNN),bidirectional long short-term memory (LSTM) network andtemporal attention mechanism.The input to the model is a sequence of N frames that isrepresented using a matrix of X = ( x , ..., x N ) , X ∈ R D .Spatial features are extracted from this sequence using theCNN model described below. The CNN architecture (shownin Fig.2b) consists of four layers of convolution with ReLU activation, and two max-poolings. Each convolution ﬁlteruses × kernels with unit stride. A fully connected layer isused at the end to produce a 256-dimensional feature vectorto represent each frame in the input video. Then, this fea-ture vector is passed as input to the bidirectional LSTM toextract temporal features. We use a bidirectional LSTM with16 hidden units and tanh activation function. The LSTM out-puts are then passed to the attention network to generate anattention score. We adopt the temporal attention mechanismproposed by Bahdanau et al. [16] for neural machine transla-tion. Speciﬁcally, this attention model computes an attentionscore e t for each attended frame h at time step t : e t = h t w a (1) nput sequence Spatial feature Sequence learning CNN CNN CNN

LSTM LSTM LSTM Temporal Attention

Dense layer B-Line F l a tt en D en s e C on v D + R e l u M a x poo l C on v D + R e l u M a x poo l C on v D + R e l u C on v D + R e l u (a) Model architecture (b) CNN architecture Fig. 2 . The proposed architecture for LUS B-line detection. (a): This model consists of CNN layers, a bidirectional LSTM, andan attention module. (b): The detailed architecture of the CNN model.Here h t is the representation of the frame at time step t and w a is the weight matrix for the attention layer. From the attentionscore e t , an importance attention weight a t is computed foreach frame at each time t : a t = exp( e t ) (cid:80) Ti =1 exp( e i ) (2)The importance attention weights are multiplied by thefeature vector output by the LSTM, hence effectively learn-ing which frame of the video to pay attention to. A higherattention weight reﬂects a more discriminative value of theframe with respect to the B-line detection task. The attention-weighted temporal feature vector is averaged over time, ¯ A = n (cid:80) ni =1 A i and passed to a fully connected layer for the ﬁnalLUS video classiﬁcation.

3. DATA, MATERIALS AND EXPERIMENTS

In this section, the data collection procedure and materialsused are ﬁrst explained. Then, experiments and evaluationcriteria are presented.

The LUS exams were carried out using a Sonosite M-Turbomachine (Fujiﬁlm Sonosite, Inc., Bothell, WA) with a low-medium frequency (3.5-5 MHz) convex probe by qualiﬁedsonographers. LUS was performed using a standardised op-erating procedure based on the Kigali ARDS protocol [17]:assessment for B-lines [18, 19], consolidation and pleural ef-fusion, performed at 6 points on each side of the chest (2 an-terior, 2 lateral and 2 posterolateral). For this study, data from 60 patients were acquired be-tween June 2019 and June 2020. Each patient had an averageﬁve LUS examinations, totaling 298 examinations. The videoresolution was × with a frame rate of 30fps. Theacquired dataset has about ﬁve hours LUS video data con-taining B-line and non-B-line videos. Four-seconds clips ateach acoustic window were stored as AVI format and fullyanonymised through masking. These video clips were anno-tated by a qualiﬁed sonographer using the VGG annotator tool[20]. The annotation procedure was performed by assigninga label (either B-line or non-B-line) to each video clip andthen localizing the B-line frames in the B-line videos. Then,the annotation output was saved in JSON format ready tobe used by the model. For the model training, each four-seconds clip was converted into shorter clips of one secondwith an overlap of 20 percent between consecutive frames inthe video. The proposed model was implemented using Keras librarywith a Tensorﬂow backend. The standard Adam optimizerwas used for the network optimization with the learning rateset to 0.0001. A batch size of 20 and batch normalizationwere utilized for both convolutional and LSTM network lay-ers. Dropout of 0.2 and Ł − for regularization wereconsidered. During the training stage, all the input videoswere resized to × video clips. The dataset was aug-mented by adding horizontally-ﬂipped frames to the trainingdata. We used 5-fold cross validation and trained the network Audio Video Interleave (AVI) or 60 epochs. As an evaluation metric for the classiﬁcation task, precision,recall, and F1 score were reported. Intersection Over Union(IoU) of the predicted and ground truth temporal labels wasused as the attention error metric.To evaluate the potential beneﬁt of exploiting temporalinformation and the effectiveness of the attention mechanism,four model architectures were compared: as a baseline, 2Dconvolutions in the initial CNN subnet followed by temporalattention module and no LSTM (C2D+A); a model with 3Dconvolutions in the CNN subnet followed by temporal atten-tion and no LSTM (C3D+A); a model with 2D CNN followedby LSTM (C2D+LSTM); and last, a model with 2D CNN fol-lowed by LSTM and temporal attention (C2D+LSTM+A).

Results on our LUS video dataset are presented in Table 1.As it is shown, C2D+A model has the least F1 score be-cause it can not model the temporal aspect of the data with2D CNN and no recursion over time. Using C3D+A model,the performance improves which shows the ability of C3D+Astructure for modelling the temporal aspect of the data butwith a short context span over time. However, adding LSTMto C2D model (C2D+LSTM) indicates that LSTM part ofthe model is crucial for the ﬁnal performance as it consid-ers long temporal progression of the LUS video data. Fi-nally, C2D+LSTM+A model outperforms the other modelsand shows that with the temporal attention mechanism F1score improved from 0.79 (in C2D+LSTM) to 0.81 (+ 0.02).This experiment demonstrates that all the sub-components ofthe proposed method contribute to the ﬁnal performance im-provement.

Table 1 . Precision, Recall and F1 score results on the LUSvideo dataset using different models.

Model Precision Recall F1

C2D+A 0.57 0.61 0.58C3D+A 0.73 0.82 0.77C2D+LSTM 0.75 0.85 0.79

C2D+LSTM+A

Besides improving the classiﬁcation performance, it isshown that the temporal attention mechanism is able to high-light discriminative frames that contain B-lines quantitativelyin Table 2. These predicted temporal localized frames arecompared with the ground truth annotation at different IoUthresholds, achieving an accuracy of up to 67.1%. To illus-trate the meaning of this number, the example shown in Fig.1had an IoU of 78%. Further, the representative frame with

Table 2 . B-line localization accuracy (%) at different IoU α ’s.IoU α =0.1 α =0.2 α =0.3 α =0.4C2D+A 36.2 31.4 28.6 22.4C3D+A 63.3 61.1 54.0 45.7C2D+LSTM+A

4. CONCLUSION

We have proposed an attention-based convolutional + LSTMmodel capable of detecting the B-line artefacts and localiz-ing them within LUS videos. This architecture allows us tocapture features from both spatial and temporal dimensions.Further, the temporal attention mechanism enables the local-ization of B-line frames. The performance of this model wasevaluated on our LUS video dataset and showed classiﬁcationF1 score of 0.81 and B-line localization accuracy of 67.1%.These results demonstrate the efﬁcacy of our approach andare consistent with qualitative analysis via visual inspectionof the calculated attentions, which highlight frames with themost salient B-lines in the video.Future work includes investigating more accurate spatialfeature extractors such as VGG19 [21] and ResNet101 [22]architectures that will likely lead to better overall perfor-mance. In addition, it is interesting to add a spatial attentionmechanism to the model to detect B-line regions in the LUSvideo along with the B-line frames, which is the ﬁrst steptowards the quantiﬁcation of the severity of the disease. Fur-ther, architectures like temporal convolutional networks [23]that have worked well in other domains for sequence mod-eling could be applied to LUS video analysis. Overall, ourresults on the automation of B-Line detection using LUS willassist the ﬂuid status assessment and management of patientswith dengue and other diseases, especially for users with lessultrasound expertise.

5. COMPLIANCE WITH ETHICAL STANDARDS

This study was performed in line with the principles of theDeclaration of Helsinki. Approval was granted by the Ethicsommittee of the Hospital for Tropical Diseases, Ho ChiMinh City and Oxford Tropical Research Ethics Committee.

6. ACKNOWLEDGMENTS

A. G. is an advisor to Ultromics Ltd. The VITAL Consortium:

OUCRU : Dang Trung Kien, Dong Huu Khanh Trinh, JosephDonovan, Du Hong Duc, Ronald Geskus, Ho Bich Hai, HoQuang Chanh, Ho Van Hien, Hoang Minh Tu Van, HuynhTrung Trieu, Evelyne Kestelyn, Lam Minh Yen, Le NguyenThanh Nhan, Le Thanh Phuong, Luu Phuoc An, NguyenLam Vuong, Nguyen Than Ha Quyen, Nguyen Thanh Ngoc,Nguyen Thi Le Thanh, Nguyen Thi Phuong Dung, NinhThi Thanh Van, Pham Thi Lieu, Phan Nguyen Quoc Khanh,Phung Khanh Lam, Phung Tran Huy Nhat, Guy Thwaites,Louise Thwaites, Tran Minh Duc, Trinh Manh Hung, HugoTurner, Jennifer Ilo Van Nuil, Sophie Yacoub.

Hospital forTropical Diseases, Ho Chi Minh City : Cao Thi Tam, DuongBich Thuy, Ha Thi Hai Duong, Ho Dang Trung Nghia, LeBuu Chau, Le Ngoc Minh Thu, Le Thi Mai Thao, LuongThi Hue Tai, Nguyen Hoan Phu, Nguyen Quoc Viet, NguyenThanh Nguyen, Nguyen Thanh Phong, Nguyen Thi KimAnh, Nguyen Van Hao, Nguyen Van Thanh Duoc, NguyenVan Vinh Chau, Pham Kieu Nguyet Oanh, Phan Tu Qui,Phan Vinh Tho, Truong Thi Phuong Thao.

University ofOxford : David Clifton, Mike English, Heloise Greeff, HuiqiLu, Jacob McKnight, Chris Paton.

Imperial College Lon-don : Pantellis Georgiou, Bernard Hernandez Perez, KerriHill-Cawthorne, Alison Holmes, Stefan Karolcik, DamienMing, Nicolas Moser, Jesus Rodriguez Manzano.

King’sCollege London : Alberto Gomez, Hamideh Kerdegari, MarcModat, Reza Razavi.

ETH Zurich : Abhilash Guru Dutt,Walter Karlen, Michaela Verling, Elias Wicki.

MelbourneUniversity : Linda Denehy, Thomas Rollinson.

7. REFERENCES [1] G. Soldati et al., “Ultrasound patterns of pulmonaryedema,”

Annals of Translational Medicine , vol. 7, no.1, 2019.[2] P. Mayo et al., “Thoracic ultrasonography: a narrativereview,”

Intensive Care Medicine , pp. 1–12, 2019.[3] C. Dietrich et al., “Lung b-line artefacts and their use,”

Journal of Thoracic Disease , vol. 8, no. 6, pp. 1356,2016.[4] R. J. van Sloun et al., “Localizing b-lines in lung ultra-sonography by weakly supervised deep learning, in-vivoresults,”

IEEE JBHI , vol. 24, no. 4, pp. 957–964, 2019.[5] B. Zhou et al., “Learning deep features for discrimina-tive localization,” in

CVPR , 2016, pp. 2921–2929.[6] S. Roy et al., “Deep learning for classiﬁcation and lo-calization of covid-19 markers in point-of-care lung ul-trasound,”

IEEE TMI , 2020. [7] S. Kulhare et al., “Ultrasound-based detection of lungabnormalities using single shot detection convolutionalneural networks,” in

MICCAI-PoCUS , pp. 65–73. 2018.[8] X. Chen et al., “Residual attention u-net for automatedmulti-class segmentation of covid-19 chest ct images,” arXiv:2004.05645 , 2020.[9] G. Ga´al et al., “Attention u-net based adversar-ial architectures for chest x-ray lung segmentation,” arXiv:2003.10304 , 2020.[10] X. Ouyang et al., “Dual-sampling attention network fordiagnosis of covid-19 from community acquired pneu-monia,”

IEEE TMI , 2020.[11] B. Liu et al., “Online covid-19 diagnosis with chestct images: Lesion-attention deep neural networks,” medRxiv , 2020.[12] C. Sminchisescu et al., “Conditional models for con-textual human motion recognition,”

CVIU , vol. 104, no.2-3, pp. 210–220, 2006.[13] N. Ikizler et al., “Searching video for complex activitieswith ﬁnite state models,” in

CVPR , 2007, pp. 1–8.[14] N. Srivastava et al., “Unsupervised learning of videorepresentations using lstms,” in

International confer-ence on machine learning , 2015, pp. 843–852.[15] J. Donahue et al., “Long-term recurrent convolutionalnetworks for visual recognition and description,” in

CVPR , 2015, pp. 2625–2634.[16] D. Bahdanau et al., “Neural machine translation byjointly learning to align and translate,” arXiv preprintarXiv:1409.0473 , 2014.[17] E. D. Riviello et al., “Hospital incidence and outcomesof the acute respiratory distress syndrome using the ki-gali modiﬁcation of the berlin deﬁnition,”

Am. J. Respir.Crit. Care Med. , vol. 193, no. 1, pp. 52–59, 2016.[18] D. A. Lichtenstein, “Relevance of lung ultrasound in thediagnosis of acute respiratory failure: the blue protocol,”

Chest , vol. 134, no. 1, pp. 117–125, 2008.[19] G. Volpicelli et al., “International evidence-based rec-ommendations for point-of-care lung ultrasound,”

In-tensive care medicine , vol. 38, no. 4, pp. 577–591, 2012.[20] A. Dutta et al., “The VIA annotation software for im-ages, audio and video,” in

ACM Multimedia , 2019.[21] K. Simonyan et al., “Very deep conv networks for large-scale image recognition,” arXiv:1409.1556 , 2014.[22] K. He et al., “Deep residual learning for image recogni-tion,” in

IEEE CVPR , 2016, pp. 770–778.[23] C. Lea et al., “Temporal convolutional networks: A uni-ﬁed approach to action segmentation,” in