[PDF] Chest X-ray lung and heart segmentation based on minimal training sets

Abstract

As the COVID-19 pandemic aggravated the excessive workload of doctors globally, the demand for computer aided methods in medical imaging analysis increased even further. Such tools can result in more robust diagnostic pipelines which are less prone to human errors. In our paper, we present a deep neural network to which we refer to as Attention BCDU-Net, and apply it to the task of lung and heart segmentation from chest X-ray (CXR) images, a basic but ardous step in the diagnostic pipeline, for instance for the detection of cardiomegaly. We show that the fine-tuned model exceeds previous state-of-the-art results, reaching 98.1\pm 0.1\% Dice score and 95.2\pm 0.1\% IoU score on the dataset of Japanese Society of Radiological Technology (JSRT). Besides that, we demonstrate the relative simplicity of the task by attaining surprisingly strong results with training sets of size 10 and 20: in terms of Dice score, 97.0\pm 0.8\% and 97.3\pm 0.5, respectively, while in terms of IoU score, 92.2\pm 1.2\% and 93.3\pm 0.4\%, respectively. To achieve these scores, we capitalize on the mixup augmentation technique, which yields a remarkable gain above 4\% IoU score in the size 10 setup.

Full PDF

CChest X-ray lung and heart segmentation based on minimal trainingsets

Bal´azs Maga * January 22, 2021

Abstract

As the COVID-19 pandemic aggravated the excessive workload of doctors globally, the demand for computeraided methods in medical imaging analysis increased even further. Such tools can result in more robust diagnosticpipelines which are less prone to human errors. In our paper, we present a deep neural network to which we referto as Attention BCDU-Net, and apply it to the task of lung and heart segmentation from chest X-ray (CXR) images,a basic but ardous step in the diagnostic pipeline, for instance for the detection of cardiomegaly. We show thatthe ﬁne-tuned model exceeds previous state-of-the-art results, reaching . ± . Dice score and . ± . IoU score on the dataset of Japanese Society of Radiological Technology (JSRT). Besides that, we demonstrate therelative simplicity of the task by attaining surprisingly strong results with training sets of size 10 and 20: in terms ofDice score, . ± . and . ± . , respectively, while in terms of IoU score, . ± . and . ± . ,respectively. To achieve these scores, we capitalize on the mixup augmentation technique, which yields a remarkablegain above IoU score in the size 10 setup.

All around the world, a plethora of radiographic examinations are performed day to day, producing images usingdifferent imaging modalities such as X-ray, computed tomography (CT), diagnostic ultrasound and magnetic resonanceimaging (MRI). According to the publicly available, ofﬁcial data of the National Health Service ([13]), in the periodfrom February 2017 to February 2018, the count of imaging activity was about 41 million only in England. Thethorough examination of the vast quantity of these images imposes a huge workload on radiologists, which increasesthe number of the avoidable human mistakes. Consequently, automated methods aiding the diagnostic processes aresought-after.The examination of medical images customarily includes various segmentation tasks, in which detecting and pixel-wise annotating different tissues and certain anomalies are vital. Common examples include lung nodule segmentationin the diagnosis of lung cancer, lung and heart segmentation in the diagnosis of cardiomegaly, or plaque segmenta-tion in the diagnosis of thrombosis. Even in the case of 2-dimensional modalities, such segmentation tasks can beextremely time-demanding, and the situation gets even worse in three dimension. Taking into consideration that thesetasks are easier to formalize as a standard computer vision exercise than the identiﬁcation of a particular disease, it isnot surprising that they sparked much activity in the ﬁeld of automated medical imaging analysis.Semantic segmentation – that is assigning a pre-deﬁned class to each pixel of an image – requires a high levelof visual understanding, in which state-of-the-art performance is attained by methods utilizing Fully ConvolutionalNetworks (FCN) [4]. An additional challenge of the ﬁeld is posed by the strongly limited quantity of training dataon which one train machine learning models, as annotating medical imaging requires specialists in contrast to “real-life” images. To overcome this difﬁculty, the so-called U-Net architecture was proposed: its capability to beingefﬁciently trained on small datasets has been demonstrated in [5]. Over the past few years several modiﬁcations andimprovements have been proposed on the original architecture, some of which involved different attention mechanismsdesigned to help the network to detect the important parts of the images. * E¨otv¨os Lor´and University, P´azm´any P´eter s´et´any 1/C, Budapest, H-1117 Hungary, email: [email protected] a r X i v : . [ ee ss . I V ] J a n n the present paper we introduce a new network primarily based on the ideas of [12] and [8], to which we refer toas Attention BCDU-Net. We optimize its performance through hyperparameter tests on the depth of the architectureand the loss function, and we demonstrate the superiority of the resulting model compared to the state-of-the-artnetwork presented in [15] in the task of lung and heart segmentation on chest X-rays. Besides that, we will also givean insight into two interesting phenomena arising during our research which might be interesting for the AI medicalimaging community: one of them is the very small data requirement of this particular task, while the other one is thepeculiar evolution of the loss curves over the training. As already mentioned in Section 1, [5] introducing U-Nets is of paramount importance in the ﬁeld. Since then U-Nets have been used to cope with diverse medical segmentation tasks, and numerous papers aimed to design U-Netvariants and mechanisms such that the resulting models tackles better the problem considered. Some of these paidprimary attention to the structure of the encoder and the decoder – that is the downsampling and the upsamplingpath – of the original architecture. For example in [18], the authors developed a network (CoLe-CNN) with multipledecoder branches and Inception-v4 inspired encoder to achieve state-of-the-art results in 2-dimensional lung nodulesegmentation. In [10] and [14], the authors introduced U-Net++, a network equipped with intermediate upsamplingpaths and additional convolutional layers, leading to essentially an efﬁcient ensemble of U-Nets of varying depths, anddemonstrated its superiority compared to the standard U-Net in many image domains. Other works put emphasis on thedesign of skip connections and the way the higher resolution semantic information joins the features coming throughthe upsampling branch. In [12], the authors proposed the architecture BCDU-Net, in which instead of the simpleconcatenation of the corresponding ﬁlters, the features of different levels are fused using a bidirectional ConvLSTMlayer, which introduces nonlinearity into the model at this point and makes more precise segmentations available.In [8] it has been shown that for medical image analysis tasks the integration of so-called Attention Gates (AGs)improved the accuracy of the segmentation models, while preserving computational efﬁciency. In [15], this networkwas enhanced by a critic network in a GAN-like scheme following [9], and achieved state-of-the-art results in the taskof lung and heart segmentation. Other attention mechanisms were introduced in [17] and in [16].

The network architecture Attention BCDU-Net we propose is a modiﬁcation of the Attention U-Net, shown at Figure1. Figure 1: Schematic architecture of Attention U-Net [8].2igure 2: Schematic ﬁgure of the attention gate used in Attention U-Net [8], the tensor addition to alter is highlightedby an arrow.In [12], the authors demonstrated that it is beneﬁcial to use bidirectional ConvLSTM layers to introduce nonlin-earity in the step of merging semantic information gained through skip connections and the features arriving throughthe decoder. This inspired us to modify the attention gates (see Figure 2) in a similar manner, in which these piecesof information are merged via tensor addition, that is a linear operation as well. This addition is replaced by a bidi-rectional ConvLSTM layer, to which the output of W g and W x – the processed features and the semantic information,respectively – is fed in this order. We note that to our best knowledge, there is a slight ambiguity about the structure ofthe resampling steps in the attention gate: while the ofﬁcial implementation is in accordance with the ﬁgure, there arewidely utilized implementations in which the output of W g is upsampled instead of downsampling the output of W x in order to ﬁt their shape. We tested both solutions and did not experience a measurable difference in the performance.We also experimented with the usage of additional spatial and channel attention layers as proposed by [17], however,we found that it does not improve the performance of our model.The depth of the network is to be determined by hyperparameter testing. Our tests conﬁrmed that four downsam-pling steps results in the best performance, however, the differences are minuscule. A standard score to compare segmentations is the Intersection over Union (IoU): given two sets of pixels

X, Y , theirIoU is

IoU ( X, Y ) = | X ∩ Y || X ∪ Y | . In the ﬁeld of medical imaging, Dice Score Coefﬁcient (DSC) is probably the most widespread and simple way tomeasure the overlap ratio of the masks and the ground truth, and hence to compare and evaluate segmentations. It is aslight modiﬁcation of IoU: given two sets of pixels

X, Y , their DSC is

DSC ( X, Y ) = 2 | X ∩ Y || X | + | Y | . If Y is in fact the result of a test about which pixels are in X , we can rewrite it with the usual notation true/falsepositive (TP/FP), false negative (FN) to be DSC ( X, Y ) = 2

T P T P + F N + F P .

We would like to use this concept in our setup. The class c we would like to segment corresponds to a set, but it ismore appropriate to consider its indicator function g , that is g i,c ∈ { , } equals 1 if and only if the i th pixel belongsto the object. On the other hand, our prediction is a probability for each pixel denoted by p i,c ∈ [0 , . Then the DiceScore of the prediction in the spirit of the above description is deﬁned to be DSC = (cid:80) Ni =1 p i,c g i,c + ε (cid:80) Ni =1 ( p i,c + g i,c ) + ε , where N is the total number of pixels, and ε is introduced for the sake of numerical stability and to avoid divison by 0.The IoU of the prediction can be calculated in a similar manner. The linear Dice Loss (DL) of the multiclass predictionis then DL = (cid:88) c (1 − DSC c ) . α, β as tuneable parameters, resulting in the deﬁnition of Tversky similarity index [1]: T I c = N (cid:88) i =1 p i,c g i,c + ε N (cid:88) i =1 p i,c g i,c + α N (cid:88) i =1 p i,c g i,c + β N (cid:88) i =1 p i,c g i,c + ε , where p i,c = 1 − p i,c and g i,c = 1 − g i,c , that is the overline simply stands for describing the complement of the class.Tversky Loss is obtained from Tversky index as Dice Loss was obtained from Dice Score Coefﬁcient: T L = (cid:88) c (1 − T I c ) . Another issue with the Dice Loss is that it struggles to segment small ROIs as they do not contribute to the losssigniﬁcantly. This difﬁculty was addressed in [11], where the authors introduced the quantity Focal Tversky Loss inorder to improve the performance of their lesion segmentation model:

F T L = (cid:88) c (1 − T I c ) γ − , where γ ∈ [1 , . In practice, if a pixel with is misclassiﬁed with a high Tversky index, the Focal Tversky Loss isunaffected. However, if the Tversky index is small and the pixel is misclassiﬁed, the Focal Tversky Loss will decreasesigniﬁcantly.In our work we use multiclass DSC and IoU to evaluate segmentation performance. As our initial tests demon-strated that training our network with Focal Tversky loss results in better scores, we will use this loss function. Theoptimal α, β, γ parameters should be determined by extensive hyperparameter testing and grid search. We workedbelow with α = 0 . , β = 0 . , γ = 0 . . For training- and validation data, we used the public Japanese Society of Radiological Technology (JSRT) dataset([3]), available at [2]. The JSRT dataset contains a total of 247 images, all of them are in × resolution, andhave 12-bit grayscale levels. Both lung and heart segmentation masks are available for this dataset.In terms of preprocessing, similarly to [15], the images were resized to the resolution × ﬁrst. As X-raysare grayscale images with typically low contrast, which makes their analysis a difﬁcult task. This obstacle might beovercome by using some sort of histogram equalization technique. The idea of standard histogram equalization isspreading out the the most frequent intensity values to a higher range of the intensity domain by modifying the intensi-ties so that their cumulative distribution function (CDF) on the complete modiﬁed image is as close to the CDF of theuniform distribution as possible. Improvements might be made by using adaptive histogram equalization, in which theabove method is not utilized globally, but separately on pieces of the image, in order to enhance local contrasts. How-ever, this technique might overamplify noise in near-constant regions, hence our choice was to use Contrast LimitedAdaptive Histogram Equalization (CLAHE), which counteracts this effect by clipping the histogram at a predeﬁnedvalue before calculating the CDF, and redistribute this part of the image equally among all the histogram bins. Concerning data augmentation, we follow [7], in which the method mixup was used to improve glioma segmentationon brain MRI’s. This slightly counter-intuitive augmentation technique was introduced by [6]: training data samples4re obtained by taking random convex combinations of original image-mask pairs. That is, for ( x , y ) and ( x , y ) image-mask pairs, we create a random mixed up pair x = λx +(1 − λ ) x , y = λy +(1 − λ ) y , where λ is chosen fromthe beta distribution B ( δ, δ ) for some δ ∈ (0 , ∞ ) . In each epoch, the original samples are paired randomly, henceduring the course of the training, a multitude of training samples are fed to the network. (From the mathematicalpoint of view, as the coefﬁcient λ is chosen independently in each case from a continuous probability distribution,the network will encounter pairwise distinct mixed up training samples with probability 1, modulo ﬂoating pointinaccuracy.) In [6], the authors argue that generating training samples via this method encourages the network tobehave linearly in-between training examples, which reduces the amount of undesirable oscillations when predictingoutside the training examples.The choice of δ should be determined by hyperparameter testing for any network and task considered. In [6], δ ∈ [0 . , . is proposed, while in [7] δ = 0 . is applied. In our main tests, the JSRT dataset was randomly split so that 85% of it was used for training and the rest for validationand testing. This split was carried out independently in each case, enhancing the robustness of our results. Besidesthat, we also experimented with small dataset training, in which rather modest sets of 10 and 20 X-rays was utilizedas training set. (The test set remained the same.) It enabled us to measure the beneﬁts of mixup more transparently.In each of these cases, we trained our network with Adam optimizer: in the former case, for 50 epochs, while in thelatter cases for 1000 and 500 epochs, respectively. As these epoch numbers are approximately inversely proportionalto the size of the training sets, these choices correspond to each other in terms of training steps.

Table 1 summarizes the numerical results we obtained during the testing of Attention BCDU-Net with different trainsizes and choices of δ , while Figure 3-5 display visual results. Note that the highest DSC scores slightly exceed theones attained by the state-of-the-art, adversarially enhanced Attention U-Net introduced in [15] ( . ± . ) andadmit higher stability. The effect of augmentation is the most striking in the case of training on an X-ray set of size10, when the choice δ = 0 . results in a 5% increase of IoU compared to the no mixup case. In general, we found thiscase particularly interesting: it was surprising that we could achieve IoU and DSC scores of this magnitude using sucha small training set. Nevertheless the predictions have some imperfections, displayed by Figure 3: the contours of thesegmentation are less clear and both the heart and the lung segmentation tend to contain small spots far from the groundtruth. However, such conspicuous faults are unlikely to occur in the case of the best models for 20 train X-rays (Figure4), which is still remarkable. The sufﬁciency of such small training sets is probably due to the relative simplicity ofthe task. Notably, lung and heart regions admit large similarity across a set of chest X-rays, and they are stronglycorrelated with simple intensity thresholds. Consequently, even small datasets have high representing potential. Wenote that as δ gets smaller, the probability density function of B ( δ, δ ) gets more strongly skewed towards the endpointsof the interval [0 , , which results in mixed up samples being closer to original samples in general. The perceivedoptimality of δ = 0 . in the small dataset cases show that a considerable augmentation is beneﬁcial and desirable, yetit is unadvised to use too wildly modiﬁed samples.The benevolent effect of mixup gets more obscure as we increase the size of the training set. Notably, the resultsof different augmentation setups are almost indistinguishable from each other. We interpret this phenomena as anotherconsequence of the similarity of masks from different samples, which inherently drives the network towards simplerrepresentations in the case of a sufﬁciently broad training set, even without using mixup.We also note that in the case of 10 training samples, while the IoU differences between the no mixup and themixup regime are striking, the gain in DSC is less remarkable. It hints that it is unadvised to rely merely on DSC whenevaluating and comparing segmentation models. 5igure 3: Ground truth (left) compared to the prediction of the Attention BCDU-Net (right), train size: 10.Figure 4: Ground truth (left) compared to the prediction of the Attention BCDU-Net (right), train size: 20.Figure 5: Ground truth (left) compared to the prediction of the Attention BCDU-Net (right), train size: complete.Train size: 10 Train set: 20 Train size: 209 (Complete)IoU DSC IoU DSC IoU DSCNo mixup . ± . % . ± . % . ± . % . ± . % . ± . % . ± . % δ = 0 . . ± . % . ± . % . ± . % . ± . % . ± . % . ± . % δ = 0 . . ± . % . ± . % . ± . % . ± . % . ± . % . ± . % δ = 0 . . ± . % . ± . % . ± . % . ± . % . ± . % . ± . % δ = 0 . . ± . % . ± . % . ± . % . ± . % . ± . % . ± . %Table 1: Dice scores and IoU scores of Attention BCDU-Net with different mixup parametersWe would also like to draw attention to the peculiar loss curves we primarily encountered during the small datasettrainings, as displayed in Figure 6. Notably, the curve of the validation DSC ﬂattens far below the also ﬂattening curveof the train DSC, strongly inciting the usage of early stopping. (Train DSC reaches essentially 1 in fact, which isunsurprising with such a small training set.) However, in the later stages the validation DSC catches up, even thoughthe train DSC does not have any room for further improvement. We were especially puzzled by this behaviour in the10-sized training setup, in which both the train and validation DSC seems completely stabilized after from epoch 50 toepoch 400, yet validation DSC skyrockets in the later stages in a very short amount of time. The same behaviour wasexperienced during each test run. We have yet to give the intuitive or theoretical explanation for this phenomenon that6ow the generalizing ability of the model can improve further when it seems to be in a perfect state from the trainingperspective. We note that these observations naturally led us to experiment with even longer trainings, but to no avail.Figure 6: From left to right: the evolution of the train DSC (blue) and the validation DSC (orange) with 10 trainingsamples, 20 training samples, and the complete training dataset, respectively. The IoU curves admit similar patterns. In the present work, we addressed the problem of automated lung and heart segmentation on chest X-rays. Weintroduced a new model, Attention BCDU-Net, a variant of Attention U-Net equipped with modiﬁed attention gates,and surpassed previous state-of-the-art results. We also demonstrated its ability to attain surprisingly reasonable resultswith strongly limited training sets. Performance in these cases was enhanced using the mixup augmentation technique,resulting in highly notable contribution in the IoU score.Concerning future work, a natural extension of this work would be adding a structure correcting adversarial net-work to the training scheme, similarly to [9] and [15], and measuring its effect on the performance, especially in thesetup of limited training sets. We would also like to give some kind of explanation to the phenomenon of peculiar losscurves.

Acknowledgements

The project was supported by the grant EFOP-3.6.3-VEKOP-16-2017-00002.

References [1] Amos Tversky. “Features of similarity.” In:

Psychological review http://db.jsrt.or.jp/eng.php . 2000.[3] Junji Shiraishi et al. “Development of a digital image database for chest radiographs with and without a lungnodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules”. In:

AmericanJournal of Roentgenology

Proceedings of the IEEE conference on computer vision and pattern recognition . 2015, pp. 3431–3440.[5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-Net: Convolutional Networks for Biomedical Im-age Segmentation”. In:

Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18thInternational Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III . Springer InternationalPublishing, 2015, pp. 234–241.

ISBN : 978-3-319-24574-4.

DOI : . URL : https://doi.org/10.1007/978-3-319-24574-4_28 .[6] Hongyi Zhang et al. “mixup: Beyond Empirical Risk Minimization”. In: (Oct. 2017).[7] Zach Eaton-Rosen et al. “Improving Data Augmentation for Medical Image Segmentation”. In: 2018.78] Ozan Oktay et al. “Attention U-Net: Learning where to look for the pancreas”. In: arXiv:1804.03999 (2018).[9] Nanqing Dong Wei Dai B et al. “SCAN: Structure Correcting Adversarial Network for Organ Segmentation inChest X-Rays”. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical DecisionSupport: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held inConjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings . Vol. 11045. Springer.2018, p. 263.[10] Zongwei Zhou et al. “UNet++: A Nested U-Net Architecture for Medical Image Segmentation”. In: July 2018.[11] Nabila Abraham and Naimul Mefraz Khan. “A novel focal Tversky loss function with improved attention U-Net for lesion segmentation”. In: .IEEE. 2019, pp. 683–687.[12] Reza Azad et al. “Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions”. In: (2019), pp. 406–415.[13] NHS England and NHS Improvement. “Diagnostic imaging dataset statistical release”. In: (2019).[14] Zongwei Zhou et al. “UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Seg-mentation”. In:

IEEE transactions on medical imaging (Dec. 2019).

DOI : .[15] Guszt´av Ga´al, Bal´azs Maga, and Andr´as Luk´acs. “Attention U-Net Based Adversarial Architectures for ChestX-ray Lung Segmentation”. In: arXiv:2003.10304 (Mar. 2020).[16] Trinh Khanh et al. “Enhancing U-Net with Spatial-Channel Attention Gate for Abnormal Tissue Segmentationin Medical Imaging”. In: Applied Sciences

10 (Aug. 2020).

DOI : .[17] Peng Zhao et al. “SCAU-Net: Spatial-Channel Attention U-Net for Gland Segmentation”. In: Frontiers in Bio-engineering and Biotechnology

ISSN : 2296-4185.

DOI : . URL : .[18] Giuseppe Pezzano, Vicent Ribas Ripoll, and Petia Radeva. “CoLe-CNN: Context-learning convolutional neu-ral network with adaptive loss function for lung nodule segmentation”. In: Computer Methods and Programsin Biomedicine

198 (2021), p. 105792.

ISSN : 0169-2607.

DOI : https : / / doi . org / 10 . 1016 / j .cmpb . 2020 . 105792 . URL :