Reducing Labelled Data Requirement for Pneumonia Segmentation using Image Augmentations
Jitesh Seth, Rohit Lokwani, Viraj Kulkarni, Aniruddha Pant, Amit Kharat
RReducing Labelled Data Requirement for Pneumonia Segmentation using ImageAugmentations
Jitesh Seth , , Rohit Lokwani , Viraj Kulkarni , Aniruddha Pant , Amit Kharat Indian Institute of Science Education and Research, Pune DeepTek Inc
Abstract
Deep learning semantic segmentation algorithmscan localise abnormalities or opacities from chestradiographs. However, the task of collecting andannotating training data is expensive and requiresexpertise which remains a bottleneck for algorithmperformance. We investigate the effect of imageaugmentations on reducing the requirement of la-belled data in the semantic segmentation of chestX-rays for pneumonia detection. We train fullyconvolutional network models on subsets of differ-ent sizes from the total training data. We apply adifferent image augmentation while training eachmodel and compare it to the baseline trained onthe entire dataset without augmentations. We findthat rotate and mixup are the best augmentationsamongst rotate, mixup, translate, gamma and hor-izontal flip, wherein they reduce the labelled datarequirement by 70% while performing comparablyto the baseline in terms of AUC and mean IoU inour experiments.
The progress in computer vision has had a substantial impacton radiology [1; 2]. Deep learning approaches such as Con-volutional Neural Networks (CNNs) have shown great suc-cess in the classification and segmentation of radiographs likechest X-rays and CT scans [3; 4]. Research on pathologyidentification in chest X-rays (CXRs) has mainly focussed onclassification, where models predict a class label from a broadset of pathologies. Such an approach does not directly informus of the regions in the CXRs responsible for the class la-bel. It requires an additional step of plotting saliency mapsor gradient class activation maps to interpret how the CNN ismaking decisions [5].Segmentation and object detection algorithms provide anadvantage over standard CNNs because they can predict theregions of interest. Architectures such as U-Net [6] andRetinaNet [7] localise the regions responsible for a specificpathology. Besides, semantic segmentation models are moreuseful than classification models in two ways: they requireless training data since they have pixel-level labels for every image, and they can assist radiologists in their work by local-ising the abnormalities [8].However, the cost and effort to label the data is often asignificant constraint on training models that perform wellin practice [9]. There have been efforts on both fronts - toincrease the amount of labelled data available and to cre-ate techniques that can learn better from small amounts oflabelled data. The former include large databases such asCheXpert [10], and ChestX-ray8 [11], whereas the latter con-sists of image augmentations, semi-supervised learning, spe-cial architectures such as U-Nets [6], GANs [12] and others.Image augmentations is a vital data processing technique toimprove the performance of machine learning models. Theycan make the CNN indifferent to naturally present variationsin the data, such as position, scale, or different radiographyequipment. Augmentations can be of varied types. Rotate,scale and flip and similar augmentations are called geomet-rical augmentations. Photometric augmentations transformthe colour space of the images. More complex transforma-tions include elastic deformation or mixing multiple images.However, inflating the dataset with numerous augmentationswould add to the training time and compute requirementswithout necessarily increasing performance. A drawback ofaugmentations is that they may cause overfitting by makingthe CNN invariant to some features but highly tailored to thetraining data in others [13].The problem of augmentations on the performance of seg-mentation models in the medical domain is not sufficientlyaddressed in research. Knowledge of specific augmentationswhich reduce the labelled data requirements will help re-searchers and data scientists fine-tune their models faster andbetter. Moreover, data augmentation studies investigate theincrease in model performance rather than the decrease in thelabelled training data requirements. Our paper specificallyaddresses the reduction in the training data requirement usingimage augmentations.In this paper, we implement five different augmentationson the training of CNN models on Chest X-rays. We pro-pose three criteria for identifying augmentations that reducelabelled data requirement. First, the model should performcomparably to the baseline on a subset of the data. Sec-ond, the models with augmentation trained on partial datashould perform better than models without any augmenta-tions trained on the same data. Third, the model should sat- a r X i v : . [ ee ss . I V ] F e b sfy the criteria above for multiple test sets. We validated ourresults with an in-sample and out-of-sample test set. This section covers the recent advancements in Computer-aided diagnosis (CADx) using deep learning, image aug-mentations and semantic segmentation in the field of CXRs.CADx for chest radiography started around since 1960s [14],but deep learning has transformed and dominated the field ina few years. Van Ginneken et al. [2] has comprehensivelysummarised the evolution of CADx from rule-based and ma-chine learning approaches to deep learning ones. Kermanyet al. [15] showed the generalizability of deep neural net-works by using the same neural network to classify retinalOptical Coherence Tomography images and pediatric pneu-monia classification in CXRs. The latter model achieved anaccuracy of 92.8%, and the area under the ROC curve was96.8%.Recent studies have proven that image augmentations im-prove the machine learning model performance. Sirazitdinovet al. [16] showed the effect of augmentations on the clas-sification of chest radiographs. They concluded that a com-bination of increasing brightness, random rotation and hori-zontal flips led to the best performance on the ChestX-Ray14dataset, with an AUC-ROC of 0.808 (compared to 0.785 with-out any augmentations). However, they do not quantify theextent of each augmentation, thus decreasing its reproducibil-ity. After the invention of mixup [17], Eaton-Rosen et al. [18]applied it to a dataset of MRI images. They provide a graphi-cal overview of mixup compared to other augmentations anda baseline for a large (199 images) and a small (10 images)dataset.Souza et al. [19] created an automatic method for seg-mentation and reconstruction of lungs, which can take intoaccount lung opacities from pneumonia or tuberculosis, re-construct the lung boundaries, and finally, segment the lungs.They used the segmented lungs for a classification model,which achieved an accuracy of 96.97%, an average Dice co-efficient of 0.94 on the Montgomery County’s TuberculosisControl dataset [20]. Selvan et al. [21] tackled the same prob-lem by treating high opacity regions as missing data and usinga variational auto-encoder for data imputation. They achievedan accuracy of 88.15% and a Dice coefficient of 0.8503 on acurated CXR dataset. Thus segmentation was mainly used todemarcate the lungs for use in classification models. How-ever, we can also use semantic segmentation algorithms todemarcate lung opacities. This opportunity became plausi-ble after the publication of the Radiology Society of NorthAmerica’s (RSNA) pneumonia dataset [22], also used in thisstudy.Wu et al. took the concept of lung segmentation and opac-ity detection one step forward. They segmented both thelungs, divided them into three zones each and predicted thepresence of pneumonia in each zone using the patient’s radi-ology report. Thus they created an object detection datasetfrom radiology reports. Using this dataset, they trained aRetinaNet model and tested it on the RSNA data set. Themodel had a mean IoU of 0.29 per pneumonia positive im- age. Hurt et al. have shown that semantic segmentation ofCXRs can be used as a probability map to interpret the ra-diographs. Their segmentation model on the RSNA datasetshowed a dice coefficient of 0.603, and the classification hadan AUC of 0.854.An essential aspect of deep learning in CADx is the us-ability of the models - we do not desire clever models thatmight end up being clinically irrelevant [23]. For example,both in Pan and Cadrin-Chˆenevert’s, and Cheng’s model ofthe RSNA dataset [24], they systematically decreased the pre-dicted bounding boxes by 12-17%, which increased the per-formance for the particular test set, but there is no medicallyrelevant reason to do the same in practical settings. Anotherpractical constraint with Pan’s models was extensive ensem-bling, which requires the availability of high-end GPUs.
For this study, we use the publicly available pneumoniadataset jointly annotated by RSNA and the Society of Tho-racic Radiology(STR) [22]. The dataset consists of DI-COM images of Chest X-Rays (CXRs) having dimensions × . Pneumonia positive images contain boundingbox information of lung opacities. The dataset consists ofabout 30,000 frontal CXRs with bounding boxes around thelung opacities.From this dataset, we select all the pneumonia-positiveCXRs (n=6012) and a subset of the negative CXRs (n=8488)and divide them into training(n=10000), validation (n=1500)and test (n=3000) sets. Each CXR belongs to a unique indi-vidual. All three sets have the same prevalence of the positiveclass (41.4%).We also test all trained models on an out-of-sample testset curated from Padchest [25] and four private hospitals andpopulation screening programmes from India and Indonesia.This set contains 1125 pneumonia positive CXRs with thecorresponding polygonal or rectangular annotations and 1875pneumonia negative CXRs. We use a U-net-like CNN with depthwise separable convolu-tions style connections implemented using the Keras libraryfor this study [26]. We resize the input into an image ofshape (512,512,3). The network has two parts, an encoderpart and a decoder part. The encoder part uses residual con-nections, depthwise separable convolution and 2D convolu-tion and max pooling. The decoder portion uses transposeconvolution, 2D upsampling and 2D convolution.We use binary cross-entropy as the loss function, and theevaluation metric is the mean intersection over union (IoU).We use Adam to optimise the loss function and the learningrate is scheduled to decrease when plateauing for 5 epochs.We train each model until it has not shown improvementin validation loss for at least 5 epochs. We save the modelweights with the lowest validation loss for evaluation.After the model training, we evaluate it on the in-sampleand out-of-sample test sets. These are compared to the groundtruth masks to calculate the mean IoU and the loss. We use igure 1: Original CXR with opacities labelled(left) and result of augmentations with updated bounding boxes.Figure 2: Two examples of CXR with the ground truth mask (left)and predicted mask by the baseline model (right) the segmentation results to classify the CXR as positive ornegative. We calculate the area under the curve (AUC) of thereceiver operating characteristics (ROC) based on this classi-fication.
We have chosen five image augmentations for this study.These are random rotation between − ◦ to ◦ , changingthe gamma between 0.75 to 1.25, translating the image ran-domly between 0-5% of its length in x and y direction, hori-zontal flips, and mixup [17]. In mixup, two images and theirmasks (represented by x and x ) are combined using the for-mula: x = x λ + x (1 − λ ) Where λ ∼ β (0 . , . . These augmentations have negli-gible computational cost. Mixup results in better-performingsegmentation models according to recent studies [18].We train the baseline (i.e. no augmentations) on 100% ofthe training set. For each of the six conditions - five aug-mentations and one with no augmentation (hereafter referredto as “NoAug”), models were trained using 30%, 50%, 70%and 90% of the training set. Figure 3: (Top) original CXRs for mixup; (bottom left) result ofmixup augmentation and (bottom right) corresponding mask.
We compare each model with the baseline and the NoAugmodel on the same amount of data. We use the non-parametric DeLong test [27] to compare the AUC for themodels’ classification performance. We set the significancelevel at 0.05. We use the pROC library in R [28] to performthe same.
For comparison of the models’ segmentation performance,we plotted each model’s mean IoU in Fig. 4. We also com-pared the different models’ classification performances by us-ing AUC, in Fig. 5.It is essential to note the behaviour of the NoAug models.In both Fig. 4 and Fig. 5, we see that the performance ofNoAug models is lower than baseline for 30% and 50% data.However, the NoAug models with 70% and 90% data per-form as good as, or even better, than the baseline. Thus it isirrelevant to study augmentations on 70% or more fraction ofthe data for this study. The p-values for the one-tailed De-Long Test between AUC of NoAug and baseline are ∼ − and ∼ − respectively for the internal test set. a) (b) (c)Figure 4: Mean IoU on the (a) internal test set and (b) external test set for the various models trained. (c) The legend(a) (b)Figure 5: AUC ROC of the (a) internal and (b) external test set for the various models. Test Set Internal Test set External Test SetAugmentation 30% 50% 30% 50%None 7.66 × − / 1 0.4826 / 0.7587 2.20 × − / 1 6.04 × − / 1Translate / 0.1144 / 0.912 2.34 × − / 1 / 0.958Rotate / 0.0788 / 0.3154 / / 0.8697Gamma 0.0147 / 0.9927 1.86 / × − × − / 1 5.15 × − / × − Mixup / 0.454 × − / 1 / 0.2059Flip / 0.3575 0.02505 / 0.9875 / 0.8842 6.984 × − / 0.9997 Table 1: p-values of DeLong Test compared to the baseline for the external test set. The first value corresponds to two-tailed DeLong test andthe second corresponds to the one-tailed DeLong test. Bold indicates that the value satisfied our criteria, as explained in Results est Set Internal Test set External Test SetAugmentation 30% 50% 30% 50%Baseline on 100% 0.8569 0.8569 0.9298 0.9298None 0.8251 0.854 0.859 0.9058Translate 0.8615 0.8515 0.9106 0.9232Rotate 0.8626 0.8587 0.9373 0.9251Gamma 0.8464 0.868 0.8954 0.9428Mixup 0.8574 0.8638 0.9042 0.9329Flip 0.8584 0.8483 0.9249 0.9135
Table 2: The AUC ROC of the different augmentations calculatedon 30% and 50% data for both internal and external test sets.
To check the first criterion, we did two DeLong hypothesistests to see if the augmentation models’ performance is com-parable to the baseline. The first has the null hypothesis thatthe AUC of augmentation models is equal to that of the base-line trained on 100% data. In this case, a p-value larger than0.05 would mean that we fail to reject the null hypothesis, andthus we can say the two AUC are comparable. In the secondtest, the null hypothesis is that the AUC of the augmentationmodels is lesser than the baseline. Here a p-value of less than0.05 would mean that we can reject the null hypothesis, andthe AUC of augmentation models is significantly larger thanthat of the baseline. We find that except for gamma with 30%data and flip with 50% data, all models trained on 30% and50% data pass either of the two tests on the internal test set.However, for the external test set, only rotate and flip with30% data, and translate, rotate, gamma and mixup with 50%data pass the hypothesis tests (see table 1 for the p-values).We checked the second criterion by performing a one-tailed DeLong Test on the AUC of the augmentation againstNoAug models trained on the same data on both test sets.We find that for the external test set, all augmentations with30% and 50% data perform significantly better than NoAug(p < − ). For the internal test set, all augmentationstrained on 30% data performed better than NoAug with 30%data. However, on 50% data, only the gamma and mixup per-formed better than NoAug.We propose that good augmentations should satisfy boththe criteria above for both the test sets. We found that rotateand flip with 30% data and gamma and mixup with 50% dataaccomplished this.On the other hand, the mean IoU plot informs us of thebest data augmentations at pixel-level. As we can see in theinternal test set (Fig. 4(a)), rotate and mixup perform quitegood at 30%, whereas mixup and flip perform better at 50%.For the external test set(Fig. 4(b)), we see that rotate andmixup at 30% and flip and gamma at 50% performing betterthan NoAug and almost as good as the baseline.Therefore, we find that rotate and mixup are the best aug-mentations for semantic segmentation on our dataset. Theseaugmentations are capable of reducing the labelled data re-quirements by even 70%. Users of deep learning algorithms often overlook image aug-mentations as an extra step for a minor boost in performance. Our study has shown that augmentations are capable of re-ducing the amount of labelled training data required. To thebest of our knowledge, this is the first study addressing aug-mentations in this manner.There is an assumption that augmentations are most usefulwhen the training and test data are from the same distribution[13]. Using two test sets from vastly different populations(internal test set being from the RSNA data set, sourced fromthe USA and the out of sample test set sourced from India)and achieving good results on both, we have shown that thisassumption need not hold.While this study looked at individual augmentations, thereis still scope for improving the model performance and re-ducing the labelled data requirement further by combiningmultiple augmentations.
References [1] M. P. McBee, O. A. Awan, A. T. Colucci, C. W.Ghobadi, N. Kadom, A. P. Kansagra, S. Tridandapani,and W. F. Auffermann, “Deep Learning in Radiology,”
Academic Radiology , vol. 25, pp. 1472–1480, Nov.2018.[2] B. van Ginneken, “Fifty years of computer analy-sis in chest imaging: rule-based, machine learning,deep learning,”
Radiological Physics and Technology ,vol. 10, pp. 23–32, feb 2017.[3] P. Lakhani and B. Sundaram, “Deep Learning atChest Radiography: Automated Classification of Pul-monary Tuberculosis by Using Convolutional NeuralNetworks,”
Radiology , vol. 284, pp. 574–582, Apr.2017.[4] J. A. Dunnmon, D. Yi, C. P. Langlotz, C. R´e, D. L. Ru-bin, and M. P. Lungren, “Assessment of ConvolutionalNeural Networks for Automated Classification of ChestRadiographs,”
Radiology , vol. 290, pp. 537–544, Nov.2018.[5] F. Pasa, V. Golkov, F. Pfeiffer, D. Cremers, and D. Pfeif-fer, “Efficient deep network architectures for fast chestx-ray tuberculosis screening and visualization,”
Scien-tific Reports , vol. 9, no. 1, p. 6268, 2019.[6] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Con-volutional Networks for Biomedical Image Segmen-tation,” arXiv:1505.04597 [cs] , May 2015. arXiv:1505.04597.[7] T.-Y. Lin, P. Goyal, R. Girshick, K. He, andP. Doll´ar, “Focal Loss for Dense Object Detection,” arXiv:1708.02002 [cs] , Feb. 2018. arXiv: 1708.02002.[8] B. Hurt, A. Yen, S. Kligerman, and A. Hsiao, “Aug-menting Interpretation of Chest Radiographs With DeepLearning Probability Maps,”
Journal of Thoracic Imag-ing , vol. 35, pp. 285–293, Sept. 2020.[9] L. M. Prevedello, S. S. Halabi, G. Shih, C. C. Wu,M. D. Kohli, F. H. Chokshi, B. J. Erickson, J. Kalpathy-Cramer, K. P. Andriole, and A. E. Flanders, “ChallengesRelated to Artificial Intelligence Research in Medicalmaging and the Importance of Image Analysis Com-petitions,”
Radiology: Artificial Intelligence , vol. 1,p. e180031, Jan. 2019.[10] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus,C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Sh-panskaya, J. Seekins, D. A. Mong, S. S. Halabi, J. K.Sandberg, R. Jones, D. B. Larson, C. P. Langlotz, B. N.Patel, M. P. Lungren, and A. Y. Ng, “CheXpert: A LargeChest Radiograph Dataset with Uncertainty Labels andExpert Comparison,” arXiv:1901.07031 [cs, eess] , Jan.2019. arXiv: 1901.07031.[11] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M.Summers, “ChestX-Ray8: Hospital-Scale Chest X-RayDatabase and Benchmarks on Weakly-Supervised Clas-sification and Localization of Common Thorax Dis-eases,” in , pp. 3462–3471, July2017. ISSN: 1063-6919.[12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,“Generative Adversarial Networks,” arXiv:1406.2661[cs, stat] , June 2014. arXiv: 1406.2661.[13] C. Shorten and T. M. Khoshgoftaar, “A survey on im-age data augmentation for deep learning,”
Journal ofBig Data , vol. 6, no. 60, 2019.[14] B. V. Ginneken, B. M. T. H. Romeny, and M. A.Viergever, “Computer-aided diagnosis in chest radiog-raphy: a survey,”
IEEE Transactions on Medical Imag-ing , vol. 20, pp. 1228–1241, Dec. 2001.[15] D. S. Kermany, M. Goldbaum, W. Cai, C. C. S. Valen-tim, H. Liang, S. L. Baxter, A. McKeown, G. Yang,X. Wu, F. Yan, J. Dong, M. K. Prasadha, J. Pei, M. Y. L.Ting, J. Zhu, C. Li, S. Hewett, J. Dong, I. Ziyar, A. Shi,R. Zhang, L. Zheng, R. Hou, W. Shi, X. Fu, Y. Duan,V. A. N. Huu, C. Wen, E. D. Zhang, C. L. Zhang, O. Li,X. Wang, M. A. Singer, X. Sun, J. Xu, A. Tafreshi,M. A. Lewis, H. Xia, and K. Zhang, “Identifying Med-ical Diagnoses and Treatable Diseases by Image-BasedDeep Learning,”
Cell , vol. 172, pp. 1122–1131.e9, Feb.2018.[16] I. Sirazitdinov, M. Kholiavchenko, R. Kuleev, andB. Ibragimov, “Data Augmentation for Chest Patholo-gies Classification,” in ,pp. 1216–1219, Apr. 2019. ISSN: 1945-8452.[17] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” arXiv:1710.09412 [cs, stat] , Apr. 2018. arXiv:1710.09412.[18] Z. Eaton-Rosen, F. Bragman, S. Ourselin, and M. J. Car-doso, “Improving Data Augmentation for Medical Im-age Segmentation,” Apr. 2018.[19] J. C. Souza, J. O. B. Diniz, J. L. Ferreira, G. L. F.da Silva, A. C. Silva, and A. C. de Paiva, “An au-tomatic method for lung segmentation and reconstruc- tion in chest X-ray using deep neural networks,”
Com-puter Methods and Programs in Biomedicine , vol. 177,pp. 285–296, aug 2019.[20] S. Jaeger, S. Candemir, S. Antani, Y.-X. J. W´ang, P.-X.Lu, and G. Thoma, “Two public chest X-ray datasetsfor computer-aided screening of pulmonary diseases,”
Quantitative Imaging in Medicine and Surgery , vol. 4,pp. 475–477, Dec. 2014.[21] R. Selvan, E. B. Dam, N. S. Detlefsen, S. Rischel,K. Sheng, M. Nielsen, and A. Pai, “Lung Segmentationfrom Chest X-rays using Variational Data Imputation,” arXiv:2005.10052 [cs, eess, stat] , July 2020. arXiv:2005.10052.[22] G. Shih, C. C. Wu, S. S. Halabi, M. D. Kohli, L. M.Prevedello, T. S. Cook, A. Sharma, J. K. Amorosa,V. Arteaga, M. Galperin-Aizenberg, R. R. Gill, M. C.Godoy, S. Hobbs, J. Jeudy, A. Laroia, P. N. Shah,D. Vummidi, K. Yaddanapudi, and A. Stein, “Aug-menting the National Institutes of Health Chest Ra-diograph Dataset with Expert Annotations of PossiblePneumonia,”
Radiology: Artificial Intelligence , vol. 1,p. e180041, Jan. 2019.[23] A. S. Lundervold and A. Lundervold, “An overview ofdeep learning in medical imaging focusing on MRI,”
Zeitschrift f¨ur Medizinische Physik , vol. 29, pp. 102–127, May 2019.[24] I. Pan, A. Cadrin-Chˆenevert, and P. M. Cheng, “Tack-ling the Radiological Society of North America Pneu-monia Detection Challenge,”
American Journal ofRoentgenology , vol. 213, pp. 568–574, May 2019.[25] A. Bustos, A. Pertusa, J.-M. Salinas, and M. de laIglesia-Vay´a, “PadChest: A large chest x-ray imagedataset with multi-label annotated reports,”
Medical Im-age Analysis , vol. 66, p. 101797, Dec. 2020. arXiv:1901.07441.[26] F. Chollet, “Keras documentation: Image seg-mentation with a U-Net-like architecture. Keraswebsite.” https://keras.io/examples/vision/oxford petsimage segmentation/. Accessed: 28-Dec-2020.[27] E. R. DeLong, D. M. DeLong, and D. L. Clarke-Pearson, “Comparing the areas under two or more cor-related receiver operating characteristic curves: a non-parametric approach,”
Biometrics , pp. 837–845, 1988.[28] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek,J.-C. Sanchez, and M. M¨uller, “pROC: an open-sourcepackage for R and S+ to analyze and compare ROCcurves,”