Semi-Supervised Deep Learning for Abnormality Classification in Retinal Images
Bruno Lecouat, Ken Chang, Chuan-Sheng Foo, Balagopal Unnikrishnan, James M. Brown, Houssam Zenati, Andrew Beers, Vijay Chandrasekhar, Jayashree Kalpathy-Cramer, Pavitra Krishnaswamy
SSemi-Supervised Deep Learning for AbnormalityClassification in Retinal Images
Bruno Lecouat Ken Chang Chuan-Sheng Foo Balagopal UnnikrishnanJames M. Brown Houssam Zenati Andrew Beers Vijay Chandrasekhar Jayashree Kalpathy-Cramer ∗ Pavitra Krishnaswamy ∗ [email protected] and [email protected] Institute for Infocomm Research, A*STAR, Singapore Athinoula A. Martinos Center for Biomedical Imaging,Massachusetts General Hospital, Boston, MA, USA
Abstract
Supervised deep learning algorithms have enabled significant performance gains inmedical image classification tasks. But these methods rely on large labeled datasetsthat require resource-intensive expert annotation. Semi-supervised generativeadversarial network (GAN) approaches offer a means to learn from limited labeleddata alongside larger unlabeled datasets, but have not been applied to discern fine-scale, sparse or localized features that define medical abnormalities. To overcomethese limitations, we propose a patch-based semi-supervised learning approach andevaluate performance on classification of diabetic retinopathy from funduscopicimages. Our semi-supervised approach achieves high AUC with just 10-20 labeledtraining images, and outperforms the supervised baselines by upto 15% when lessthan 30% of the training dataset is labeled. Further, our method implicitly enablesinterpretation of the SSL predictions. As this approach enables good accuracy,resolution and interpretability with lower annotation burden, it sets the pathway forscalable applications of deep learning in clinical imaging.
Deep learning is driving significant advances in automated analysis and interpretation of medicalimages for applications spanning reconstruction, segmentation, diagnosis, prognosis and treatmentresponse assessment in radiology, dermatology, pathology, oncology and ophthalmology (5; 3;6). Typically, medical imaging studies employ supervised convolutional neural networks (CNNs)that require large datasets annotated by experts to obtain high predictive performance. In manyapplications, this annotation burden is further exacerbated by the need for multiple annotations toreduce labeling noise (6; 8).Semi-supervised deep learning (SSL) algorithms that combine small labeled datasets with largerunlabeled datasets offer a means to address these limitations. Recent works have explored SSLapproaches based on generative adversarial networks (GANs), and showed applicability to classifyingskin and heart disease (19; 10; 11). While these studies have demonstrated feasibility of the GANapproach, they have been limited to low resolution images ( × to × pixels) and toapplications where clinically relevant features are present in large parts of the image. However, inmany cases, clinical image classification relies on fine features that are only visible in high resolution Machine Learning for Health (ML4H) Workshop at NeurIPS 2018. a r X i v : . [ c s . C V ] D ec mages and/or are sparsely distributed throughout the image. Further, the ability to interpret classifierpredictions would be desirable in practical scenarios.To address these needs, we frame the semi-supervised medical image classification problem as one oflearning from very few labeled images with granular annotations, alongside a larger set of unlabeledimages. As an example, we consider patch-level annotations for the labeled set, and propose toperform SSL at the patch level. We then aggregate predictions from individual patches of a givenimage into an image-level classification without requiring additional annotation. As this approachtreats images as composites of finely labeled entities, it can overcome the resolution and interpretationlimitations encountered in previous works employing SSL using GANs for medical imaging andcomputer vision applications (18; 15).We demonstrate this approach on the task of classifying abnormalities in retinal fundoscopy imagesobtained from Diabetic Retinopathy (DR) patients. DR patients are diagnosed based on the presenceof fine-scaled, sparse and localized microaneurysms, soft exudates, hard exudates, and hemorrhages,which are hard to annotate. Therefore, computer-aided screening for early detection and intervention(13) requires high-resolution images. We show that our GAN-based semi-supervised method canprovide accurate classification with far fewer labeled samples than CNN-based supervised methods.Further, we demonstrate that our method can effectively detect fine-scale anomalies and classifyspatially sparse abnormalities in an interpretable manner. Finally, we discuss directions to enabletranslation of semi-supervised deep learning methods for practical use in clinical imaging. We propose a patch-based semi-supervised classification framework where high-resolution medicalimages are divided into equal sized patches before being used for training or prediction with asemi-supervised GAN (Figure 1). Predictions on individual patches are then aggregated to producean image-level prediction. Our patch-based approach enables visualization and localization of salientpredictive features by overlaying patch-level predictions onto the input image.
Image size 1024 x 1024 64 patches(128 x 128) SSL GAN Patch levelpredictions
Sum patch-levelpredictions Resize to 32 x 32
Prediction:
Diseased
Figure 1: Overview of patch-based semi-supervised classification approach.
Semi-supervised GAN (SSL-GAN) : GANs (4) are a class of deep generative neural network modelsthat have been successful in modeling distributions over natural images (14). A typical GAN consistsof a generator network G and discriminator network D . In the course of training a GAN on a set of(unlabeled) images, the generator learns how to map a low-dimensional set of random variables Z togenerate images, while the discriminator learns how to differentiate between those generated imagesand real images present in the training set.We build upon the semi-supervised feature-matching GAN framework (16). This method extendsthe discriminator D to determine the specific class of the image in addition to determining whetherthe image is real or generated. Formally, suppose we are given a dataset of image patches I = I L ∪ I U consisting of a labeled set I L = { ( x i , y i ) } and unlabeled set I U = { x i } of patches; here x i and y i are the patches and labels (having K classes) respectively. Then, during training, wesimultaneously optimize D and G using stochastic gradient descent, minimizing loss functions L D for the discriminator and L G for the generator. L D is the sum of a loss term on the labeled imagesubset ( L supervised ) and the vanilla GAN loss ( L unsupervised ) so that L D = L unsupervised + L supervised L supervised = − E ( x,y ) ∼I L [log p D ( y | x, y < K + 1)] L unsupervised = − E x ∼I [log[1 − p D ( y = K + 1 | x )]] − E x ∼ G [log[ p D ( y = K + 1 | x )]] . G , the feature-matching loss designed to encourage generated patches to have similar features tothe real patches. L G = (cid:13)(cid:13) E x ∼I [ h ( x )] − E z ∼ p z ( z ) [ h ( g ( z ))] (cid:13)(cid:13) Here, h ( x ) denotes activations on an intermediate layer of the discriminator. In our experiments, theactivation layer after the Networks in Networks (NiN) layers (9) was chosen as intermediate h ( x ) .As the generator and discriminator networks in (16) were developed for benchmark natural imagedatasets, we adapted the networks to the larger image sizes and distinct image statistics of the retinalfunduscopy datasets (Experimental Setup, Supplement 1). Image level predictions : At evaluation time, we applied the semi-supervised GAN to derive patch-level predictions and then pooled these patch-level predictions to form an image-level abnormalityscore: score i = (cid:80) j =1 σ ( l ij ) where l ij is the classifier logit of patch j in image i . If the image-levelabnormality score exceeds a threshold, the classifier predicts that the image is diseased. Dataset and Pre-processing:
We evaluated our SSL-GAN approach on color retinal fundus imagesfrom the IDRiD challenge dataset collected at an eye clinic located in Nanded, Maharashtra, India (12).This dataset contains images from 249 patients (168 healthy, 81 DR). DR patients are diagnosed basedon the presence of microaneurysms, soft exudates, hard exudates, and hemorrhages in retinal fundusphotographs (6); the IDRiD dataset contains segmentation masks for each of these abnormalities.We randomly split the dataset into Training (n=149), Validation (n=50), and Testing (n=50) cohorts.We resized the original images to × and normalized them by the maximum intensity valuein each image. We then subdivided the normalized image into non-overlapping × patchesbased on a uniform 8x8 grid. To determine patch-level labels, we combined segmentation masksfor each of the abnormalities into a single binary mask. Then, we labeled patches with 0 pixelsoverlapping the mask as healthy; and denoted patches with at least 1 overlapping pixel as diseased. Amajority of the patches had very sparse disease features ( < of the patch overlapped abnormalitymasks). Baselines:
We compared the semi-supervised approach (SSL-GAN) in the patch-based frameworkagainst two supervised baselines: a shallow CNN with architecture similar to the GAN (ConvNet),and a 34-layer residual network (ResNet34, (7). For both SSL-GAN and supervised baselines,we employed the same method to aggregate patch-level predictions to produce an image-levelclassification.
Model Training and Evaluation:
The SSL-GAN uses limited labeled data and lots of unlabeleddata during training. We sampled labeled images randomly from the training cohort; and usedremaining images in the training cohort as unlabeled data. We evaluated the algorithm with differentratios of labeled-to-unlabeled data in the training set. We report mean classification AUCs both atthe patch and image levels over 5 random samplings with the associated standard deviations. Forappropriate comparisons, we performed supervised baselines with only the limited labeled datasamples used in the SSL experiments, and repeated training with 5 different random seeds. We detailmodel architectures and training hyperparameters in Supplement 1.
Classification Performance as a Function of Annotation : We present the semi-supervised classi-fication results, and benchmark against supervised deep learning baselines – both at the patch andimage levels. In particular, we vary the proportion of labeled data in the training set, and evaluatehow annotation relates to classification performance across the different methods.Table 1: AUC of Semi-supervised vs. Supervised Learning: Patch-level ClassificationLabeled/Total Images: 10/149 20/149 40/149 80/149 149/149SSL-GAN (Patch) . ± . . ± . . ± . . ± . . ± . ConvNet (Patch) . ± . . ± . . ± .
14 79 . ± . . ± . ResNet34 (Patch) . ± . . ± . . ± . . ± . . ± . At the patch-level, the SSL-GAN significantly outperforms the supervised baselines. We observethat the SSL-GAN image-level predictions tend to have about 10% higher AUC than the associated3able 2: AUC of Semi-supervised vs. Supervised Learning: Image-Level ClassificationLabeled/Total Images: 10/149 20/149 40/149 80/149 149/149SSL-GAN (Image) . ± . . ± . . ± . . ± . . ± . ConvNet (Image) . ± . . ± .
57 87 . ± . . ± .
16 97 . ± . ResNet34 (Image) . ± . . ± . . ± . . ± . . ± . patch-level predictions (Table 1 vs. 2). For the final image-level predictions, the semi-supervisedclassifier shows significant improvements over supervised baselines, even when less than 10% of thetraining dataset is labeled. In particular, when less than 30% of the training dataset is labeled, theSSL-GAN outperforms CNNs by upto 15%. We also performed comparisons to a 50-layer residualnetwork trained for direct classification on full-size images (Supplement 2). Interpretation of Abnormality Predictions : To interpret the SSL classification results, we overlaidthe localized patch-level abnormality scores spatially onto the image; and smoothed the resultingvisualization with a Gaussian blur. Figure 2 shows some example testing results obtained from anSSL-GAN trained with 20 labeled images. The predictions are clinically meaningful: qualitativecomparisons with ground truth segmentation masks suggest that the method can accurately detectexudates and hemorrhages, although there are some misses at the peripheral patches. We quantitativelycompared the resulting localization masks against the ground truth annotations, and found that theSSL-GAN had an AUC gain of 16.30% over the CNN baselines.
True PositiveTrue Negative True Positive False Positive
Figure 2: Exemplar test-set images with overlay of patch-based abnormality scores predicted by theSSL-GAN. In each case, the image-level classification accuracy is indicated.Taken together, these results suggest that our semi-supervised GAN approach can provide significantimprovements over supervised baselines, maintain classification accuracy with large reductionsin labeling burden, and enable localization for better interpretation of results. We performed apreliminary test to explore how the models trained on the IDRiD data generalize to an independenttest on the Kaggle Diabetic Retinopathy dataset. The SSL-GAN had an AUC of 64% against 47%on the supervised CNN baselines, suggesting that the semi-supervised models could exhibit greatercapacity to adapt and generalize to varying dataset, class distribution and cohort characteristics.
We have proposed a patch-based approach to extend SSL with GANs to high-resolution scenariostypical of medical applications. Our approach leverages granular annotations on a minimal dataset,and offers an effective, efficient alternative to supervised approaches that require coarse annotationfor large datasets. To the best of our knowledge, this is the first report employing GANs for semi-supervised classification of fine-scale sparse abnormalities in images. Further, as our semi-supervisedclassifier produces patch-based predictions, it also implicitly provides a valuable means to interpretthe image-level classification results. As such, our work demonstrates that it is possible to use GAN-based semi-supervised deep learning to concomitantly reduce annotation burden, obtain accurateclassifications, maintain desirable resolutions and enable interpretation of predictions. This hasimplications for practical systems focused on scaling applications of deep learning in cross-sectionaland multi-dimensional clinical imaging applications.Although we demonstrated feasibility on retinal image classification, our approach generalizes toa range of classification tasks involving high spatial resolution images and/or sparse anomalousfeatures. Example applications include digital pathology with gigapixel whole-slide images (17),cancer screening with cross-sectional CT/MRI (2) and severity grading in multiple sclerosis (1).4xisting SSL methods have been developed on standard computer vision datasets wherein imagestructure and composition vary significantly with the target labels. However, as medical imagestypically have more structural similarity and greater redundancy amongst samples, there is need todesign new methods for these unique requirements.
References [1] Robert L Barry, Johanna S Vannesjo, Samantha By, John C Gore, and Seth A Smith. Spinalcord MRI at 7T.
NeuroImage 168: 437-451 , 2018.[2] Nicolas Coudray, Paolo Santiago Ocampo, Theodore Sakellaropoulos, Navneet Narula, MatijaSnuderl, David Fenyo, Andre Moreira, Narges Razavian, and Aristotelis Tsirigos. Classificationand mutation prediction from non-small cell lung cancer histopathology images using deeplearning.
Nature Medicine 24, 1559-1567 , 2018.[3] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, andSebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks.
Nature , 542(7639):115–118, 2017.[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances in neuralinformation processing systems , pages 2672–2680, 2014.[5] Hayit Greenspan, Bram van Ginneken, and Ronald M Summers. Guest editorial: deep learning inmedical imaging: Overview and future promise of an exciting new technique.
IEEE Transactionson Medical Imaging , 35(5):1153–1159, 2016.[6] Varun Gulshan, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, ArunachalamNarayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros,Ramasamy Kim, Rajiv Raman, Philip C. Nelson, Jessica L. Mega, and Dale R. Webster. Devel-opment and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy inRetinal Fundus Photographs.
JAMA , 316(22):2402, dec 2016.[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In ,pages 770–778. IEEE, jun 2016.[8] Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S. Corrado,Lily Peng, and Dale R. Webster. Grader variability and the importance of reference standardsfor evaluating machine learning models for diabetic retinopathy.
Ophthalmology , oct 2018.[9] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network.
ICLR , 2014.[10] Ali Madani, Mehdi Moradi, Alexandros Karargyris, and Tanveer Syeda-Mahmood. Semi-supervised learning with Generative Adversarial Networks for Chest X-Ray Classification withAbility of Data Domain Adaptation.
IEEE ISBI , 2018.[11] Ali Madani, Jia Rui Ong, Anshul Tibrewal, and Mohammad RK Mofrad. Deep echocardiogra-phy: data-efficient supervised and semi-supervised deep learning towards automated diagnosisof cardiac disease.
Nature Digital Medicine 1:59; doi:10.1038/s41746-018-0065-x , 2018.[12] P Porwal, S Pachade, R Kamble, M Kokare, G Deshmukh, V Sahasrabuddhe, and F Meriaudeau.Indian diabetic retinopathy image dataset (idrid).
IEEE DataPort , 2018.[13] Gwenolé Quellec, Katia Charrière, Yassine Boudi, Béatrice Cochener, and Mathieu Lamard.Deep image mining for diabetic retinopathy screening.
Medical Image Analysis , 39:178–193,2017.[14] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning withDeep Convolutional Generative Adversarial Networks. arXiv , pages 1–15, 2015.[15] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko. Semi-Supervised Learningwith Ladder Networks.
NIPS , 2015. 516] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. ImprovedTechniques for Training GANs.
NIPS , 2016.[17] Joel Saltz, Rajarsi Gupta, Le Hou, Tahsin Kurc, Pankaj Singh, Vu Nguyen, Dimitris Samaras,Kenneth R Shroyer, Tianhao Zhao, Rebecca Batiste, et al. Spatial organization and molecularcorrelation of tumor-infiltrating lymphocytes using deep learning on pathology images.
Cellreports , 23(1):181, 2018.[18] Xin Yi, Ekta Walia, and Paul Baby. Generative Adversarial Network in Medical Imaging: AReview. arXiv:1809.07294v1 , 2018.[19] Xin Yi, Ekta Walia, and Paul Babyn. Unsupervised and semi-supervised learning with cate-gorical generative adversarial networks assisted by wasserstein distance for dermascopy imageclassification. arXiv:1804.03700v1 , 2018.
Acknowledgments
This project was supported by funding from the Deep Learning 2.0 program at A*STAR, Singapore,and a training grant from the US National Institute of Biomedical Imaging and Bioengineering(NIBIB, 5T32EB1680). The content is solely the responsibility of the authors and does not necessarilyrepresent the official views of the funders. 6 ppendicesA Architecture and Hyperparameters
The semi-supervised network architectures are detailed in Tables 3 and 4. Briefly, the discriminatorcomprises a 10-layer convolutional neural network with dropout and weight normalization; and thegenerator comprises a 5-layer convolutional neural network with batch normalization. For semi-supervised learning and the associated CNN, we downsampled the patches to × . We augmentedthe data during training by performing random cropping and flipping of the input training images.We used an exponential moving average of the parameters for inference on the testing set. We usedthe validation datasets to determine the model hyper-parameters (Supplement Table 6). The hyperparameters are maintained across all experiments. We will release our code in due course.Table 3: Discriminatorconv-large DR32 × × p = 0 . × × × p = 0 . × × × p = 0 . × × × ×
512 batchnorm ReLU5 × × × × α = 1 ∗ − β = 0 . )Epoch early stopping with patience of 20Loss binary crossentropy balanced class-weightingLearning rate decay reduce LR to 10% of its value with a patience of 10Augmentation 0-180 degree rotation, L-R flipping, U-D flipping0-30% horizontal shift, 0-30% vertical shiftTable 6: Hyperparameters semi-supervised GANHyper-parameter DREpoch 1200Batch size 100Leaky ReLU slope 0.2Exp. moving average decay 0.999Learning rate decay linear to 0 after 1000 epochsOptimizer ADAM ( α = 3 ∗ − , β = 0 . )Weight initialization Isotropic gaussian ( µ = 0 , σ = 0 . )Bias initialization Constant (0) B Image Level Classification Results
We present AUCs obtained from a direct fully supervised image-level classification (Table 5). Theseresults serve as a baseline to assess how the use of finer patch-level annotations and aggregation ofpatch-level predictions affects supervised image classification.Table 7: AUC of Semisupervised vs. Supervised Learning: Direct Image-Level ClassificationLabeled/Total Images 10/149 20/149 40/149 80/149 149/149Pretrained (Image) . ± . . ± . . ± . . ± . . . ± . ResNet50 (Image) . ± . ± .
11 67 . ± . . ± . . ± . We note that pre-training reuses weights from networks that are trained on general scene databases(e.g., ImageNet, CIFAR-10), and produces good results even with low numbers of labeled images.However, it must be noted that such workarounds carry the risk of neglecting critical clinicalinformation, reducing interpretability, and do not apply to common clinical scenarios involvingcross-sectional (3D) imaging or video data.
C Future Directions
Our approach enhances and adapts semi-supervised deep learning to the unique challenges of medicaldatasets. Future work on the translational front will focus on moving towards practical retinal imageclassification systems. On the methodological front, our study suggests directions towards improvedsemi-supervised deep learning methods.
Retinal Image Classification : The current method still shows a number of false negatives at patchlevel. Employing overlapping patches and/or other unsupervised mechanisms to pool patch-levelpredictions for image-level classification might help to improve sensitivity. Further, we used con-sensus labels as ground truth in this study, and will explore the effect of the inter-rater variability onthe semi-supervised learning process, especially as the number of labeled samples reduces. Finally,we will investigate the effects of varying patch size, threshold to translate pixel-level labels intopatch-level labels, and benchmark SSL performance for these variations of finer labeling against thenaive coarse image-level labeling approach. 8 emi-Supervised Deep Learning Methodsemi-Supervised Deep Learning Methods