Chest X-ray lung and heart segmentation based on minimal training sets
CChest X-ray lung and heart segmentation based on minimal trainingsets
Bal´azs Maga * January 22, 2021
Abstract
As the COVID-19 pandemic aggravated the excessive workload of doctors globally, the demand for computeraided methods in medical imaging analysis increased even further. Such tools can result in more robust diagnosticpipelines which are less prone to human errors. In our paper, we present a deep neural network to which we referto as Attention BCDU-Net, and apply it to the task of lung and heart segmentation from chest X-ray (CXR) images,a basic but ardous step in the diagnostic pipeline, for instance for the detection of cardiomegaly. We show thatthe fine-tuned model exceeds previous state-of-the-art results, reaching . ± . Dice score and . ± . IoU score on the dataset of Japanese Society of Radiological Technology (JSRT). Besides that, we demonstrate therelative simplicity of the task by attaining surprisingly strong results with training sets of size 10 and 20: in terms ofDice score, . ± . and . ± . , respectively, while in terms of IoU score, . ± . and . ± . ,respectively. To achieve these scores, we capitalize on the mixup augmentation technique, which yields a remarkablegain above IoU score in the size 10 setup.
All around the world, a plethora of radiographic examinations are performed day to day, producing images usingdifferent imaging modalities such as X-ray, computed tomography (CT), diagnostic ultrasound and magnetic resonanceimaging (MRI). According to the publicly available, official data of the National Health Service ([13]), in the periodfrom February 2017 to February 2018, the count of imaging activity was about 41 million only in England. Thethorough examination of the vast quantity of these images imposes a huge workload on radiologists, which increasesthe number of the avoidable human mistakes. Consequently, automated methods aiding the diagnostic processes aresought-after.The examination of medical images customarily includes various segmentation tasks, in which detecting and pixel-wise annotating different tissues and certain anomalies are vital. Common examples include lung nodule segmentationin the diagnosis of lung cancer, lung and heart segmentation in the diagnosis of cardiomegaly, or plaque segmenta-tion in the diagnosis of thrombosis. Even in the case of 2-dimensional modalities, such segmentation tasks can beextremely time-demanding, and the situation gets even worse in three dimension. Taking into consideration that thesetasks are easier to formalize as a standard computer vision exercise than the identification of a particular disease, it isnot surprising that they sparked much activity in the field of automated medical imaging analysis.Semantic segmentation – that is assigning a pre-defined class to each pixel of an image – requires a high levelof visual understanding, in which state-of-the-art performance is attained by methods utilizing Fully ConvolutionalNetworks (FCN) [4]. An additional challenge of the field is posed by the strongly limited quantity of training dataon which one train machine learning models, as annotating medical imaging requires specialists in contrast to “real-life” images. To overcome this difficulty, the so-called U-Net architecture was proposed: its capability to beingefficiently trained on small datasets has been demonstrated in [5]. Over the past few years several modifications andimprovements have been proposed on the original architecture, some of which involved different attention mechanismsdesigned to help the network to detect the important parts of the images. * E¨otv¨os Lor´and University, P´azm´any P´eter s´et´any 1/C, Budapest, H-1117 Hungary, email: [email protected] a r X i v : . [ ee ss . I V ] J a n n the present paper we introduce a new network primarily based on the ideas of [12] and [8], to which we refer toas Attention BCDU-Net. We optimize its performance through hyperparameter tests on the depth of the architectureand the loss function, and we demonstrate the superiority of the resulting model compared to the state-of-the-artnetwork presented in [15] in the task of lung and heart segmentation on chest X-rays. Besides that, we will also givean insight into two interesting phenomena arising during our research which might be interesting for the AI medicalimaging community: one of them is the very small data requirement of this particular task, while the other one is thepeculiar evolution of the loss curves over the training. As already mentioned in Section 1, [5] introducing U-Nets is of paramount importance in the field. Since then U-Nets have been used to cope with diverse medical segmentation tasks, and numerous papers aimed to design U-Netvariants and mechanisms such that the resulting models tackles better the problem considered. Some of these paidprimary attention to the structure of the encoder and the decoder – that is the downsampling and the upsamplingpath – of the original architecture. For example in [18], the authors developed a network (CoLe-CNN) with multipledecoder branches and Inception-v4 inspired encoder to achieve state-of-the-art results in 2-dimensional lung nodulesegmentation. In [10] and [14], the authors introduced U-Net++, a network equipped with intermediate upsamplingpaths and additional convolutional layers, leading to essentially an efficient ensemble of U-Nets of varying depths, anddemonstrated its superiority compared to the standard U-Net in many image domains. Other works put emphasis on thedesign of skip connections and the way the higher resolution semantic information joins the features coming throughthe upsampling branch. In [12], the authors proposed the architecture BCDU-Net, in which instead of the simpleconcatenation of the corresponding filters, the features of different levels are fused using a bidirectional ConvLSTMlayer, which introduces nonlinearity into the model at this point and makes more precise segmentations available.In [8] it has been shown that for medical image analysis tasks the integration of so-called Attention Gates (AGs)improved the accuracy of the segmentation models, while preserving computational efficiency. In [15], this networkwas enhanced by a critic network in a GAN-like scheme following [9], and achieved state-of-the-art results in the taskof lung and heart segmentation. Other attention mechanisms were introduced in [17] and in [16].
The network architecture Attention BCDU-Net we propose is a modification of the Attention U-Net, shown at Figure1. Figure 1: Schematic architecture of Attention U-Net [8].2igure 2: Schematic figure of the attention gate used in Attention U-Net [8], the tensor addition to alter is highlightedby an arrow.In [12], the authors demonstrated that it is beneficial to use bidirectional ConvLSTM layers to introduce nonlin-earity in the step of merging semantic information gained through skip connections and the features arriving throughthe decoder. This inspired us to modify the attention gates (see Figure 2) in a similar manner, in which these piecesof information are merged via tensor addition, that is a linear operation as well. This addition is replaced by a bidi-rectional ConvLSTM layer, to which the output of W g and W x – the processed features and the semantic information,respectively – is fed in this order. We note that to our best knowledge, there is a slight ambiguity about the structure ofthe resampling steps in the attention gate: while the official implementation is in accordance with the figure, there arewidely utilized implementations in which the output of W g is upsampled instead of downsampling the output of W x in order to fit their shape. We tested both solutions and did not experience a measurable difference in the performance.We also experimented with the usage of additional spatial and channel attention layers as proposed by [17], however,we found that it does not improve the performance of our model.The depth of the network is to be determined by hyperparameter testing. Our tests confirmed that four downsam-pling steps results in the best performance, however, the differences are minuscule. A standard score to compare segmentations is the Intersection over Union (IoU): given two sets of pixels
X, Y , theirIoU is
IoU ( X, Y ) = | X ∩ Y || X ∪ Y | . In the field of medical imaging, Dice Score Coefficient (DSC) is probably the most widespread and simple way tomeasure the overlap ratio of the masks and the ground truth, and hence to compare and evaluate segmentations. It is aslight modification of IoU: given two sets of pixels
X, Y , their DSC is
DSC ( X, Y ) = 2 | X ∩ Y || X | + | Y | . If Y is in fact the result of a test about which pixels are in X , we can rewrite it with the usual notation true/falsepositive (TP/FP), false negative (FN) to be DSC ( X, Y ) = 2
T P T P + F N + F P .
We would like to use this concept in our setup. The class c we would like to segment corresponds to a set, but it ismore appropriate to consider its indicator function g , that is g i,c ∈ { , } equals 1 if and only if the i th pixel belongsto the object. On the other hand, our prediction is a probability for each pixel denoted by p i,c ∈ [0 , . Then the DiceScore of the prediction in the spirit of the above description is defined to be DSC = (cid:80) Ni =1 p i,c g i,c + ε (cid:80) Ni =1 ( p i,c + g i,c ) + ε , where N is the total number of pixels, and ε is introduced for the sake of numerical stability and to avoid divison by 0.The IoU of the prediction can be calculated in a similar manner. The linear Dice Loss (DL) of the multiclass predictionis then DL = (cid:88) c (1 − DSC c ) . α, β as tuneable parameters, resulting in the definition of Tversky similarity index [1]: T I c = N (cid:88) i =1 p i,c g i,c + ε N (cid:88) i =1 p i,c g i,c + α N (cid:88) i =1 p i,c g i,c + β N (cid:88) i =1 p i,c g i,c + ε , where p i,c = 1 − p i,c and g i,c = 1 − g i,c , that is the overline simply stands for describing the complement of the class.Tversky Loss is obtained from Tversky index as Dice Loss was obtained from Dice Score Coefficient: T L = (cid:88) c (1 − T I c ) . Another issue with the Dice Loss is that it struggles to segment small ROIs as they do not contribute to the losssignificantly. This difficulty was addressed in [11], where the authors introduced the quantity Focal Tversky Loss inorder to improve the performance of their lesion segmentation model:
F T L = (cid:88) c (1 − T I c ) γ − , where γ ∈ [1 , . In practice, if a pixel with is misclassified with a high Tversky index, the Focal Tversky Loss isunaffected. However, if the Tversky index is small and the pixel is misclassified, the Focal Tversky Loss will decreasesignificantly.In our work we use multiclass DSC and IoU to evaluate segmentation performance. As our initial tests demon-strated that training our network with Focal Tversky loss results in better scores, we will use this loss function. Theoptimal α, β, γ parameters should be determined by extensive hyperparameter testing and grid search. We workedbelow with α = 0 . , β = 0 . , γ = 0 . . For training- and validation data, we used the public Japanese Society of Radiological Technology (JSRT) dataset([3]), available at [2]. The JSRT dataset contains a total of 247 images, all of them are in × resolution, andhave 12-bit grayscale levels. Both lung and heart segmentation masks are available for this dataset.In terms of preprocessing, similarly to [15], the images were resized to the resolution × first. As X-raysare grayscale images with typically low contrast, which makes their analysis a difficult task. This obstacle might beovercome by using some sort of histogram equalization technique. The idea of standard histogram equalization isspreading out the the most frequent intensity values to a higher range of the intensity domain by modifying the intensi-ties so that their cumulative distribution function (CDF) on the complete modified image is as close to the CDF of theuniform distribution as possible. Improvements might be made by using adaptive histogram equalization, in which theabove method is not utilized globally, but separately on pieces of the image, in order to enhance local contrasts. How-ever, this technique might overamplify noise in near-constant regions, hence our choice was to use Contrast LimitedAdaptive Histogram Equalization (CLAHE), which counteracts this effect by clipping the histogram at a predefinedvalue before calculating the CDF, and redistribute this part of the image equally among all the histogram bins. Concerning data augmentation, we follow [7], in which the method mixup was used to improve glioma segmentationon brain MRI’s. This slightly counter-intuitive augmentation technique was introduced by [6]: training data samples4re obtained by taking random convex combinations of original image-mask pairs. That is, for ( x , y ) and ( x , y ) image-mask pairs, we create a random mixed up pair x = λx +(1 − λ ) x , y = λy +(1 − λ ) y , where λ is chosen fromthe beta distribution B ( δ, δ ) for some δ ∈ (0 , ∞ ) . In each epoch, the original samples are paired randomly, henceduring the course of the training, a multitude of training samples are fed to the network. (From the mathematicalpoint of view, as the coefficient λ is chosen independently in each case from a continuous probability distribution,the network will encounter pairwise distinct mixed up training samples with probability 1, modulo floating pointinaccuracy.) In [6], the authors argue that generating training samples via this method encourages the network tobehave linearly in-between training examples, which reduces the amount of undesirable oscillations when predictingoutside the training examples.The choice of δ should be determined by hyperparameter testing for any network and task considered. In [6], δ ∈ [0 . , . is proposed, while in [7] δ = 0 . is applied. In our main tests, the JSRT dataset was randomly split so that 85% of it was used for training and the rest for validationand testing. This split was carried out independently in each case, enhancing the robustness of our results. Besidesthat, we also experimented with small dataset training, in which rather modest sets of 10 and 20 X-rays was utilizedas training set. (The test set remained the same.) It enabled us to measure the benefits of mixup more transparently.In each of these cases, we trained our network with Adam optimizer: in the former case, for 50 epochs, while in thelatter cases for 1000 and 500 epochs, respectively. As these epoch numbers are approximately inversely proportionalto the size of the training sets, these choices correspond to each other in terms of training steps.
Table 1 summarizes the numerical results we obtained during the testing of Attention BCDU-Net with different trainsizes and choices of δ , while Figure 3-5 display visual results. Note that the highest DSC scores slightly exceed theones attained by the state-of-the-art, adversarially enhanced Attention U-Net introduced in [15] ( . ± . ) andadmit higher stability. The effect of augmentation is the most striking in the case of training on an X-ray set of size10, when the choice δ = 0 . results in a 5% increase of IoU compared to the no mixup case. In general, we found thiscase particularly interesting: it was surprising that we could achieve IoU and DSC scores of this magnitude using sucha small training set. Nevertheless the predictions have some imperfections, displayed by Figure 3: the contours of thesegmentation are less clear and both the heart and the lung segmentation tend to contain small spots far from the groundtruth. However, such conspicuous faults are unlikely to occur in the case of the best models for 20 train X-rays (Figure4), which is still remarkable. The sufficiency of such small training sets is probably due to the relative simplicity ofthe task. Notably, lung and heart regions admit large similarity across a set of chest X-rays, and they are stronglycorrelated with simple intensity thresholds. Consequently, even small datasets have high representing potential. Wenote that as δ gets smaller, the probability density function of B ( δ, δ ) gets more strongly skewed towards the endpointsof the interval [0 , , which results in mixed up samples being closer to original samples in general. The perceivedoptimality of δ = 0 . in the small dataset cases show that a considerable augmentation is beneficial and desirable, yetit is unadvised to use too wildly modified samples.The benevolent effect of mixup gets more obscure as we increase the size of the training set. Notably, the resultsof different augmentation setups are almost indistinguishable from each other. We interpret this phenomena as anotherconsequence of the similarity of masks from different samples, which inherently drives the network towards simplerrepresentations in the case of a sufficiently broad training set, even without using mixup.We also note that in the case of 10 training samples, while the IoU differences between the no mixup and themixup regime are striking, the gain in DSC is less remarkable. It hints that it is unadvised to rely merely on DSC whenevaluating and comparing segmentation models. 5igure 3: Ground truth (left) compared to the prediction of the Attention BCDU-Net (right), train size: 10.Figure 4: Ground truth (left) compared to the prediction of the Attention BCDU-Net (right), train size: 20.Figure 5: Ground truth (left) compared to the prediction of the Attention BCDU-Net (right), train size: complete.Train size: 10 Train set: 20 Train size: 209 (Complete)IoU DSC IoU DSC IoU DSCNo mixup . ± . % . ± . % . ± . % . ± . % . ± . % . ± . % δ = 0 . . ± . % . ± . % . ± . % . ± . % . ± . % . ± . % δ = 0 . . ± . % . ± . % . ± . % . ± . % . ± . % . ± . % δ = 0 . . ± . % . ± . % . ± . % . ± . % . ± . % . ± . % δ = 0 . . ± . % . ± . % . ± . % . ± . % . ± . % . ± . %Table 1: Dice scores and IoU scores of Attention BCDU-Net with different mixup parametersWe would also like to draw attention to the peculiar loss curves we primarily encountered during the small datasettrainings, as displayed in Figure 6. Notably, the curve of the validation DSC flattens far below the also flattening curveof the train DSC, strongly inciting the usage of early stopping. (Train DSC reaches essentially 1 in fact, which isunsurprising with such a small training set.) However, in the later stages the validation DSC catches up, even thoughthe train DSC does not have any room for further improvement. We were especially puzzled by this behaviour in the10-sized training setup, in which both the train and validation DSC seems completely stabilized after from epoch 50 toepoch 400, yet validation DSC skyrockets in the later stages in a very short amount of time. The same behaviour wasexperienced during each test run. We have yet to give the intuitive or theoretical explanation for this phenomenon that6ow the generalizing ability of the model can improve further when it seems to be in a perfect state from the trainingperspective. We note that these observations naturally led us to experiment with even longer trainings, but to no avail.Figure 6: From left to right: the evolution of the train DSC (blue) and the validation DSC (orange) with 10 trainingsamples, 20 training samples, and the complete training dataset, respectively. The IoU curves admit similar patterns. In the present work, we addressed the problem of automated lung and heart segmentation on chest X-rays. Weintroduced a new model, Attention BCDU-Net, a variant of Attention U-Net equipped with modified attention gates,and surpassed previous state-of-the-art results. We also demonstrated its ability to attain surprisingly reasonable resultswith strongly limited training sets. Performance in these cases was enhanced using the mixup augmentation technique,resulting in highly notable contribution in the IoU score.Concerning future work, a natural extension of this work would be adding a structure correcting adversarial net-work to the training scheme, similarly to [9] and [15], and measuring its effect on the performance, especially in thesetup of limited training sets. We would also like to give some kind of explanation to the phenomenon of peculiar losscurves.
Acknowledgements
The project was supported by the grant EFOP-3.6.3-VEKOP-16-2017-00002.
References [1] Amos Tversky. “Features of similarity.” In:
Psychological review http://db.jsrt.or.jp/eng.php . 2000.[3] Junji Shiraishi et al. “Development of a digital image database for chest radiographs with and without a lungnodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules”. In:
AmericanJournal of Roentgenology
Proceedings of the IEEE conference on computer vision and pattern recognition . 2015, pp. 3431–3440.[5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-Net: Convolutional Networks for Biomedical Im-age Segmentation”. In:
Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18thInternational Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III . Springer InternationalPublishing, 2015, pp. 234–241.
ISBN : 978-3-319-24574-4.
DOI : . URL : https://doi.org/10.1007/978-3-319-24574-4_28 .[6] Hongyi Zhang et al. “mixup: Beyond Empirical Risk Minimization”. In: (Oct. 2017).[7] Zach Eaton-Rosen et al. “Improving Data Augmentation for Medical Image Segmentation”. In: 2018.78] Ozan Oktay et al. “Attention U-Net: Learning where to look for the pancreas”. In: arXiv:1804.03999 (2018).[9] Nanqing Dong Wei Dai B et al. “SCAN: Structure Correcting Adversarial Network for Organ Segmentation inChest X-Rays”. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical DecisionSupport: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held inConjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings . Vol. 11045. Springer.2018, p. 263.[10] Zongwei Zhou et al. “UNet++: A Nested U-Net Architecture for Medical Image Segmentation”. In: July 2018.[11] Nabila Abraham and Naimul Mefraz Khan. “A novel focal Tversky loss function with improved attention U-Net for lesion segmentation”. In: .IEEE. 2019, pp. 683–687.[12] Reza Azad et al. “Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions”. In: (2019), pp. 406–415.[13] NHS England and NHS Improvement. “Diagnostic imaging dataset statistical release”. In: (2019).[14] Zongwei Zhou et al. “UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Seg-mentation”. In:
IEEE transactions on medical imaging (Dec. 2019).
DOI : .[15] Guszt´av Ga´al, Bal´azs Maga, and Andr´as Luk´acs. “Attention U-Net Based Adversarial Architectures for ChestX-ray Lung Segmentation”. In: arXiv:2003.10304 (Mar. 2020).[16] Trinh Khanh et al. “Enhancing U-Net with Spatial-Channel Attention Gate for Abnormal Tissue Segmentationin Medical Imaging”. In: Applied Sciences
10 (Aug. 2020).
DOI : .[17] Peng Zhao et al. “SCAU-Net: Spatial-Channel Attention U-Net for Gland Segmentation”. In: Frontiers in Bio-engineering and Biotechnology
ISSN : 2296-4185.
DOI : . URL : .[18] Giuseppe Pezzano, Vicent Ribas Ripoll, and Petia Radeva. “CoLe-CNN: Context-learning convolutional neu-ral network with adaptive loss function for lung nodule segmentation”. In: Computer Methods and Programsin Biomedicine
198 (2021), p. 105792.
ISSN : 0169-2607.
DOI : https : / / doi . org / 10 . 1016 / j .cmpb . 2020 . 105792 . URL :