[PDF] Conversion and Implementation of State-of-the-Art Deep Learning Algorithms for the Classification of Diabetic Retinopathy

Abstract

Diabetic retinopathy (DR) is a retinal microvascular condition that emerges in diabetic patients. DR will continue to be a leading cause of blindness worldwide, with a predicted 191.0 million globally diagnosed patients in 2030. Microaneurysms, hemorrhages, exudates, and cotton wool spots are common signs of DR. However, they can be small and hard for human eyes to detect. Early detection of DR is crucial for effective clinical treatment. Existing methods to classify images require much time for feature extraction and selection, and are limited in their performance. Convolutional Neural Networks (CNNs), as an emerging deep learning (DL) method, have proven their potential in image classification tasks. In this paper, comprehensive experimental studies of implementing state-of-the-art CNNs for the detection and classification of DR are conducted in order to determine the top performing classifiers for the task. Five CNN classifiers, namely Inception-V3, VGG19, VGG16, ResNet50, and InceptionResNetV2, are evaluated through experiments. They categorize medical images into five different classes based on DR severity. Data augmentation and transfer learning techniques are applied since annotated medical images are limited and imbalanced. Experimental results indicate that the ResNet50 classifier has top performance for binary classification and that the InceptionResNetV2 classifier has top performance for multi-class DR classification.

Full PDF

CConversion and Implementation of State-of-the-ArtDeep Learning Algorithms for the Classiﬁcation ofDiabetic Retinopathy st Mihir Rao

Chatham High School

Chatham, New Jersey, [email protected] 2 nd Michelle Zhu

Department of Computer ScienceMontclair State University

Montclair, New Jersey, [email protected] rd Tianyang Wang

Department of Computer ScienceAustin Peay State University

Tennessee, [email protected]

Abstract —Diabetic retinopathy (DR) is a retinal microvascularcondition that emerges in diabetic patients. DR will continue tobe a leading cause of blindness worldwide, with a predicted 191.0million globally diagnosed patients in 2030. Microaneurysms,hemorrhages, exudates, and cotton wool spots are common signsof DR. However, they can be small and hard for human eyesto detect. Early detection of DR is crucial for effective clinicaltreatment. Existing methods to classify images require muchtime for feature extraction and selection, and are limited intheir performance. Convolutional Neural Networks (CNNs), asan emerging deep learning (DL) method, have proven their po-tential in image classiﬁcation tasks. In this paper, comprehensiveexperimental studies of implementing state-of-the-art CNNs forthe detection and classiﬁcation of DR are conducted in order todetermine the top performing classiﬁers for the task. Five CNNclassiﬁers, namely Inception-V3, VGG19, VGG16, ResNet50, andInceptionResNetV2, are evaluated through experiments. Theycategorize medical images into ﬁve different classes based onDR severity. Data augmentation and transfer learning techniquesare applied since annotated medical images are limited andimbalanced. Experimental results indicate that the ResNet50classiﬁer has top performance for binary classiﬁcation and thatthe InceptionResNetV2 classiﬁer has top performance for multi-class DR classiﬁcation.

Index Terms —diabetic retinopathy, convolutional neural net-works, transfer learning, binary classiﬁcation, multi-class classi-ﬁcation, optimizers

I. I

NTRODUCTION

Diabetic retinopathy (DR) is a retinal microvascular condi-tion that emerges as a direct result of diabetes. High bloodsugar levels allow glucose to block blood vessels, in thiscase in the retina [1]. This leads to microaneurysms, whichare swollen sections of blood vessels in the retina. Whenthese microaneurysms leak, they are called hemorrhages [2].These hemorrhages allow cotton wool spots to form, which areaccumulations of axoplasmic material in the back of the eye,along with exudates [3]. In order to effectively treat DR, itmust be detected in its early stages. However, most peoplewith the condition are unaware of the fact that they musthave their vision examined often, thus allowing the conditionto pass the early stages and into the later stages undetected[4]. Additionally, DR patients in resource-poor countries lack effective DR identiﬁcation technology and clinicians in orderto make ofﬁcial diagnoses and treatment plans [4]. This meansthat not only must DR be detected in its early stages, but thedetection technology must be easily accessible for people whodo not have access to eye specialists and adequate technology.In 2030, it is estimated that there will be 191.0 millionpeople with DR globally. This is approximately a 50% jumpfrom the 126.6 million people with DR globally in 2010. Ofthe 191.0 million people, 56.3 million people are expected tohave vision-threatening diabetic retinopathy (VTDR) if actionis not taken [5]. In the United States alone, the number ofAmericans aged 40 and older in 2050 with DR is predictedto be 16.0 million people, while the number of people withVTDR is expected to be 3.4 million. These numbers are ap-proximately three times the amount of people when comparedto 2005 when there were 5.5 million people with DR and 1.2million people with VTDR [6]. Clearly, early and accurateDR detection is not only vital in the present day, but it willcontinue to be necessary for decades to come.Many medical imaging techniques such as computed to-mography (CT) and magnetic resonance imaging (MRI) havebecome indispensable tools in clinical research and diagnosis.Classifying medical images has been playing an important rolein disease diagnosis and medical treatment [7].At present, the most widely used method for the detection ofdiabetic retinopathy is a retinal eye exam [8]. This approachinvolves an eye specialist looking through a patient’s pupiland at the back of their eye. The specialist looks for someof the most common symptoms of the disease, including mi-croaneurysms, hemorrhages, exudates, and cotton wool spots.Additionally, some detection methods involve using fundusphotography to take a picture of a patient’s retina, allowing aneye specialist to conduct the same examination by looking at aretinal image on a computer screen. Lack of ophthalmologistswill leave a large portion of patients undiagnosed. In addition,human errors are unavoidable.Due to the vast amount of medical images and humanfatigue as well as errors, relying on professional ophthalmol-ogists will be very expensive and inefﬁcient. Thus, machine a r X i v : . [ c s . C V ] O c t earning approaches have been used for this purpose. Medicalimage classiﬁcation can generally be categorized into su-pervised and unsupervised classiﬁcation methods. Supervisedmethods require samples to be pre-annotated and include K-nearest neighbor algorithm, Bayesian models, logistic regres-sion, neural networks, and support vector machines. Unsu-pervised methods automatically detect the similarity amongsamples and include K-means clustering, auto-encoders, andprincipal component analysis (PCA).In recent years, deep learning (DL) has increasingly at-tracted researchers’ attention for medical image classiﬁcation.DL is a machine learning technique in which neural networksare trained on collections of data in order to learn patternsand extract features from them. Trained DL models can thenbe used to predict certain details about unseen test data,making them useful tools for medical image classiﬁcation.Speciﬁcally, CNNs are DL models that have been proveneffective for feature extraction and pattern recognition in imagedata, especially in a medical context.Deep learning integrates supervised methods with unsu-pervised methods, and some image classiﬁcation experimentsusing convolutional neural networks (CNNs) achieve perfor-mance close to what a specialized physician can achieve [9].In this paper, the aim is to detect diabetic retinopathy byclassifying a retinal image into one of the ﬁve different levels(classes), as shown in Fig. 1, based on the disease severity.The medical images used in this paper are very limited andunbalanced due to the data privacy concern and labelingefforts.To address the data challenges in the medical imagingdata, several techniques are used. Data augmentation techniqueis used to counteract the small data size by rotating andscaling the existing images. Noise is intentionally introducedto increase the noise tolerance level of the models. Transferlearning utilizes some pre-trained model as a base to build thenew model. This allows pre-acquired weights to be propagatedinto the classiﬁcation task. During experiments, the selectedmodels are trained on a publicly available dataset [10].II. R ELATED W ORK

It has been proven that CNNs can automatically extractmore distinct and effective features than handcrafted featureextraction methods. Deep learning methods usually outperformtraditional machine learning methods, such as SVMs (supportvector machines), since the SVM method is designed for asmall sample size and is not suitable for large samples [9][11]. Deep learning networks are widely adopted in order toimprove the image classiﬁcation performance since 2012 [12].In particular, Krizhevsky et al. used a CNN to classify 1.2million images into 1000 classes, with benchmark performanceat the ImageNet Large Scale Visual Recognition Challenge(ILSVRC) 2012 [13].In order to improve the accuracy, researchers have beenworking to add more layers to the CNNs for large-scaleimage classiﬁcation. VGG neural networks address the depthof CNNs and support up to 19 layers [14]. Very small 3×3 ﬁlters in all convolutional layers are used to reduce the numberof parameters. Additionally, it has been demonstrated that 3×3ﬁlters are most effective according to covariance analysis [15].Signiﬁcant performance gain has been observed by increasingthe depth when compared to previous architectures [14].However, simply adding layers to CNNs for higher ac-curacy causes complications such as overﬁtting, degradationand computing and memory burden. Skip connections amongthe layers are proposed in residual networks, which learn anadditive residual function with respect to an identity mappingderived from the preceding layers inputs [16]. The residualarchitectures are capable of better fusing features and thusaddress the issue of gradient explosion or vanishing.The large number of parameters and large data size con-tribute to the success of CNN models. However, training deepnetworks usually has high demands for high performancecomputing resources, such as powerful GPUs and efﬁcientstorage systems. Multiple-GPUs have been used to speed uptraining. The data parallelism can be exploited by dividingeach batch of training images into several smaller batches,computed in parallel on each GPU. For example, Simonyanand Zisserman trained their VGG nets of 144M parameters onfour NVIDIA Titan Black GPUs [14].Model-based transfer learning for the neural network con-tains two stages, namely network pre-training with benchmarkdatasets, such as ImageNet [17], and ﬁne-tuning the pre-trained networks with speciﬁc target datasets. The two-stagemethod has been very popular in tasks involving medicalimages due to the limited sizes of speciﬁc target datasets. Thepre-trained networks can capture some general features fromsimilar benchmark images and these features can be furtherﬁne-tuned in the second stage. Experiments show that featurereuse primarily happens in the lowest layers [18].III. M

ETHODS

A. Addressing Class Imbalance in the Dataset

A Kaggle dataset titled APTOS 2019 Blindness Detection(APTOS stands for Asia Paciﬁc Tele-Ophthalmology Society)was used to train and test models [10]. The dataset consistsof 3662 retinal images across ﬁve different stages of diabeticretinopathy (DR): no DR, mild DR, moderate DR, severe DR,and proliferate DR. These classes are annotated as values0 through 4. As shown in Fig. 2, classes of images weregrouped together and re-annotated based on the classiﬁcationtask at hand (binary, 3-class, or 5-class). Class imbalance wasaddressed by oversampling images in order to achieve a moreuniform distribution of images across classes. Additionally,oversampled images were randomly rotated, reﬂected, andnoisiﬁed in order to prevent training models on duplicate dataand to increase the overall noise resistance of the system.

B. Image Pre-Processing

The original dataset used in this project consists of imagesof vastly varying dimensions and characteristics, includingempty space around the actual retina in the image and thebrightness of the image as a whole. In order to feed such ig. 1. Retinal images showing the progression of DR.Fig. 2. Diagram showing the approach taken for image grouping andreannotation based on the classiﬁcation task at hand. data into a CNN, the images need to be pre-processed. Thiswas done in a series of steps. Firstly, by checking for pixelsthroughout the images that are completely black, the regionsof empty space were able to be detected and then croppedout. Then, in an effort to help normalize the brightnessof the images as well as to help bring out some of theimportant features of the images, weighted arrays consistingof a Gaussian blurred versions of the images were added totheir corresponding resized images. Thirdly, the images werecircle cropped, with the center of the circle lying at the centerof the image and the circumference of the circle touching theedges of the image. Not only did this allow for excess spacearound the retina to be further eliminated, but it also led tomore uniformity throughout the dataset. Lastly, the imageswere resized to a common dimension of 512x512 pixels. Thisspeciﬁc dimension was chosen as a starting point since the rawimages in the dataset had an average size of approximately1527x2015 pixels. The smaller the image is resized, the morethe details from the raw image are lost. So, resizing the imagesto 512 square pixels allowed for high image resolution tobe maintained while achieving time and memory efﬁciencyduring training. Fig. 3 shows two examples of image pre-processing conducted on raw retinal images.

C. Transfer Learning

Unlike training from scratch, transfer learning aims totransfer the knowledge, which was learned from anotherdataset, to a target problem. These models are repurposed,as their weights and biases are from their initial training.These parameters may still help with some high-level featureextraction due to certain level of relevance between the two models. Additionally, the selected pre-trained models havebeen proven successful in other classiﬁcation problems, fur-ther making them good candidates for the problem of DRdetection and classiﬁcation [14] [19] [20]. During experiments,the following pre-trained models were selected for furtherﬁne-tuning: ResNet50, VGG16, VGG19, Inception-V3, andInceptionResNetV2. Since these models were initially trainedfor tasks that categorized images into a large number of uniqueclasses, the ﬁnal layer of each model did not match therequired architecture for binary, 3-class, and 5-class classi-ﬁcation tasks. In order to address this, a series of additionallayers were added to the end of each classiﬁer in order to ﬁtthe classiﬁcation task at hand: a ﬂatten layer to reduce modeloutput to a 1-dimensional space, a series of fully-connecteddense layers, and a ﬁnal dense layer with 1 node for the binarytask, 3 nodes for the 3-class task, and 5 nodes for the 5-classtask, respectively.IV. E

XPERIMENTS AND R ESULTS

A. Experimental Setups

For the binary classiﬁcation task, various models weretested across different optimization techniques. Speciﬁcally,the models that were tested were ResNet50, VGG16, VGG19,and Inception-V3. All of these models were tested across two

Fig. 3. Two examples of raw retinal images undergoing image pre-processing. ptimizers: Adam and Stochastic Gradient Descent (SGD). Alearning rate of 0.001 was used for both optimizers acrossall experiments. The ﬁnal layer of all of the classiﬁers in thebinary classiﬁcation task had a sigmoid activation functionimplemented, providing outputs between 0 and 1. In differen-tiating between outputs of the negative and positive class, athreshold value of ¡=0.5 was used for the negative class whilea value of ¿0.5 was used for the positive class. During training,the base transfer learning model’s layers were frozen.For the 3-class classiﬁcation task, two phases of experimentswere conducted. For the ﬁrst phase, similar approaches weretaken as for the binary classiﬁcation task with respect to modelarchitecture and optimizers. However, in the ﬁnal layer ofall 3-class classiﬁcation models, a Softmax activation across3 nodes was used, providing probabilities across the threeclasses. Again, a learning rate of 0.001 was used for bothoptimizers. In order to interpret the Softmax probabilitiesproduced by the model as a prediction, the Argmax functionwas implemented, which returned the index of the element inthe probability array with the highest value, thus returningthe model’s most conﬁdent prediction for a certain image.Based on the results from this initial phase of experiments,adjustments were made and a second phase of experimentswere conducted. The adjustments made were the learning ratebeing decreased by a factor of 10, the kernels of the entiremodel being initialized using the He Uniform initializer [16],and the unfreezing of layers in the base transfer learningmodel. However, the adjustments yielded memory limitationsfor model training of ResNet50 and Inception-V3 due to theiroutput tensors being extremely large once ﬂattened in theclassiﬁer (524288 values for ResNet50 and 401408 values forInception-V3). This is several fold higher than the numberof output parameters by the VGG variants (131072 values).So, the VGG variants were trained using the above-mentionedmethod while ResNet50 and Inception-V3 training occurredon images of size 224x224 pixels and 299x299 pixels, re-spectively. This helped reduce the output tensor from 4 to 2dimensions, thus alleviating the memory limitation. 224 and299 square pixels is the default input size for the ResNet50and Inception-V3 trained on ImageNet, so feeding this sizeinto ResNet50 and Inception-V3 allowed for training on moredefault settings. In order to collect more thorough results,training was conducted for the VGG variants on 224x224 pixelimages for comparison purposes. It is recommended that theResNet50 and Inception-V3 models be trained just like theVGG variants (on 512x512 pixel images) as part of futurework when adequate resources are available. Once phase 2of results were analyzed, based on the relatively good perfor-mance of ResNet50 and Inception-V3, the InceptionResNetV2architecture, also known as Inception-V4, was trained with theAdam optimizer using the same training parameters as usedwith the other models. Inception-V4 takes the Inception-V3architecture and incorporates residual connection much likethose in the ResNet variants [19]. The input image size forthis model was 299x299 pixels. The results of testing thismodel are presented with the phase 2 experimentation results. For the 5-class classiﬁcation task, just like with the 3-class classiﬁcation task, two phases of experiments wereconducted. The ﬁrst phase conducted the same experimentsthat were conducted in the phase 1 of experimentation forthe 3-class classiﬁcation, only the softmax activation on theﬁnal layer of the classiﬁer was modiﬁed to ﬁt the 5-classclassiﬁcation task. Based on the results of this initial phase, asecond phase of experimentation was conducted after makingadjustments to training parameters. These adjustments werethe same as those made prior to the second phase of the 3-class classiﬁcation task. Additionally, the second phase for the5-class classiﬁcation task used the same experimental setupsas those used in phase 2 for the 3-class classiﬁcation task.Based on the observed performance of the ResNet50 andInception-V3 models, just like for the 3-class classiﬁcationtask, the InceptionResNetV2 model was experimented withboth the Adam and SGD optimizers. Again, the same trainingparameters were used for this model as used for the 3-class classiﬁcation task. The results of testing this model arepresented with the phase 2 experimentation results for the 5-class classiﬁcation task.For all classiﬁcation types, test sets were created usingrandom 20% samples and validation sets were random 20%samples of the train set. During training, an early-stoppingcallback was implemented. This would monitor validationaccuracy during training and stop training once the valida-tion accuracy began to decrease, indicating overﬁtting. Thecallback would then restore the model’s best weights fromthe penultimate epoch. Additionally, across all binary taskexperiments and experiments in phase 1 of the multi-classapproaches, weight initialization for the base model was doneusing the ImageNet weights provided for the model beingtested. These provided weights enable the model with high-level feature extraction abilities.

B. Model Performance Analysis

Table I shows the collected results for the binary classiﬁca-tion task. A confusion matrix in Fig. 4 and receiver-operatorcharacteristic (ROC) curves in Fig. 5 are shown for the binaryclassiﬁcation experimental setup that yielded the best results. Itwas found that the ResNet50 architecture accompanied withthe Adam optimizer yielded the best results for the binaryclassiﬁcation task, including an accuracy of 96.59% whentested on unseen test data and micro and macro-average area-under-the-curve (AUC) values of 0.99.Table II shows the model testing accuracies for phase 1of 3-class classiﬁcation experimentation. It can be seen thatthe majority of the models are not able to pass roughly 84%testing accuracy. This could be due to the fact that with 3-class classiﬁcation, merely training the end classiﬁer and notthe base transfer learning model itself may not be sufﬁcient tohelp the model differentiate between the mild/moderate andsevere/proliferate classes. So, unfreezing of the base modellayers, initializing weights in a speciﬁc manner, and decreasingthe learning rate to 0.0001 was necessary. Table III shows theresults for phase 2 of experimentation for 3-class classiﬁcation

ABLE IB

INARY C LASSIFICATION E XPERIMENTAL R ESULTS

Model and OptimizerAdamMetric

ResNet50 VGG16 VGG19 Inception-V3

Test Accuracy

Stochastic Gradient DescentMetric

ResNet50 VGG16 VGG19 Inception-V3

Test Accuracy 0.956 0.9119 0.9219 0.7827Precision 0.95 0.91 0.92 0.82Recall 0.95 0.91 0.92 0.79Micro Average AUC with 512x512 pixel images. Table IV shows the results forphase 2 of experimentation for 3-class classiﬁcation with299x299 pixel images for the Inception-V3 and Inception-ResNetV2 models and 224x224 pixel images for the othermodels. A confusion matrix in Fig. 6 and receiver-operatorcharacteristic (ROC) curves in Fig. 7 are shown for the 3-class classiﬁcation experimental setup that yielded the bestresults. It was found that the InceptionResNetV2 architectureaccompanied with the Adam optimizer yielded the best resultsfor the 3-class classiﬁcation task, including an accuracy of88.14% and a micro and macro-average AUC values of 0.98and 0.97, respectively.

TABLE II3-C

LASS C LASSIFICATION P HASE

EST A CCURACIES

Model and OptimizerAdamMetric

ResNet50 VGG16 VGG19 Inception-V3

Test Accuracy 0.795 0.8088 0.8309 0.602

Stochastic Gradient DescentMetric

ResNet50 VGG16 VGG19 Inception-V3

Test Accuracy

LASS C LASSIFICATION P HASE

XPERIMENTAL R ESULTS ON X

512 P

IXEL I NPUT I MAGES

Model and OptimizerAdamMetric

VGG16 VGG19

Test Accuracy 0.7454 0.7528Precision 0.74 0.77Recall 0.75 0.76Micro Average AUC 0.9 0.9Macro Average AUC 0.88 0.89F1-Score 0.7434 0.7616

Stochastic Gradient DescentMetric

VGG16 VGG19

Test Accuracy 0.7472

Precision 0.76

Recall 0.76

Micro Average AUC 0.9

Macro Average AUC 0.89

F1-Score 0.7563

Table V shows the accuracies for the ﬁrst phase of 5-class classiﬁcation experimentation. It can be seen that themajority of the models are not able to pass roughly 70%testing accuracy. This could be due to the fact that with non-multi-class approaches, merely training the end classiﬁer andnot the base transfer learning model itself yielded promisingresults. However, as the 5-class classiﬁcation task requiresmore detailed classiﬁcation by the model, especially betweenthe mild and moderate classes and the severe and proliferateclasses, the mentioned adjustments were made before phase2 of experiments. Table VI shows the results for phase 2of experimentation for 5-class classiﬁcation with 512x512pixel images. Table VII shows the results for phase 2 ofexperimentation for 5-class classiﬁcation with 299x299 pixelimages for the Inception-V3 and InceptionResNetV2 models

ABLE IV3-C

LASS C LASSIFICATION P HASE

XPERIMENTAL R ESULTS ON X AND X

299 P

IXEL I NPUT I MAGES

Model and OptimizerAdamMetric

ResNet50 VGG16 VGG19 Inception-V3 InceptionResNetV2

Test Accuracy 0.7849 0.3695 0.3548 0.8116

Precision 0.79 0.12 0.12 0.83

Recall 0.78 0.33 0.33 0.81

Micro Average AUC 0.92 0.52 0.52 0.94

Macro Average AUC 0.9 0.5 0.5 0.94

F1-Score 0.7778 0.1772 0.1759 0.8046

ResNet50 VGG16 VGG19 Inception-V3 InceptionResNetV2

Test Accuracy 0.3493 0.3419 0.3906 0.3125 0.4164Precision 0.33 0.14 0.37 0.29 0.35Recall 0.35 0.33 0.4 0.32 0.44Micro Average AUC 0.49 0.47 0.55 0.5 0.52Macro Average AUC 0.49 0.41 0.57 0.48 0.56F1-Score 0.2413 0.1807 0.3483 0.2239 0.346Fig. 6. Confusion matrix for the testing results of the best 3-class classiﬁer.Fig. 7. Receiver-operator characteristic (ROC) curves for the testing resultsof the best 3-class classiﬁer. and 224x224 pixel images for the other models. A confusionmatrix in Fig. 8 and receiver-operator characteristic (ROC)curves in Fig. 9 are shown for the 5-class classiﬁcationexperimental setup that yielded the best results. It was foundthat the InceptionResNetV2 architecture accompanied withthe Adam optimizer yielded the best results for the 5-classclassiﬁcation task, including an accuracy of 85.02% and amicro and macro-average AUC values of 0.97.

TABLE V5-C

LASS C LASSIFICATION P HASE

EST A CCURACIES

Model and OptimizerAdamMetric

ResNet50 VGG16 VGG19 Inception-V3

Test Accuracy 0.6374

Stochastic Gradient DescentMetric

ResNet50 VGG16 VGG19 Inception-V3

Test Accuracy 0.6681 0.6519 0.6288 0.1832TABLE VI5-C

LASS C LASSIFICATION P HASE

XPERIMENTAL R ESULTS ON X

512 P

IXEL I NPUT I MAGES

Model and OptimizerAdamMetric

VGG16 VGG19

Test Accuracy

Stochastic Gradient DescentMetric

VGG16 VGG19

Test Accuracy 0.6853 0.7295Precision 0.7 0.72Recall 0.7 0.72Micro Average AUC 0.92 0.93Macro Average AUC 0.91 0.92F1-Score 0.6931 0.7199ABLE VII5-C

LASS C LASSIFICATION P HASE

XPERIMENTAL R ESULTS ON X AND X

299 P

IXEL I NPUT I MAGES

Model and OptimizerAdamMetric

ResNet50 VGG16 VGG19 Inception-V3 InceptionResNetV2

Test Accuracy 0.7893 0.2091 0.2042 0.6228

Precision 0.8 0.04 0.04 0.55

Recall 0.79 0.2 0.2 0.63

Micro Average AUC 0.95 0.5 0.5 0.91

Macro Average AUC 0.94 0.5 0.5 0.88

F1-Score 0.7905 0.0692 0.0677 0.5746

ResNet50 VGG16 VGG19 Inception-V3 InceptionResNetV2

Test Accuracy 0.222 0.2328 0.2548 0.1827 0.1994Precision 0.35 0.16 0.38 0.18 0.13Recall 0.21 0.24 0.25 0.19 0.19Micro Average AUC 0.49 0.49 0.57 0.48 0.52Macro Average AUC 0.5 0.49 0.53 0.47 0.55F1-Score 0.1517 0.1595 0.1788 0.1042 0.1435Fig. 8. Confusion matrix for the testing results of the best 5-class classiﬁer.Fig. 9. Receiver-operator characteristic (ROC) curves for the testing resultsof the best 5-class classiﬁer.

V. D

ISCUSSION

The results show that ResNet50 and InceptionResNetV2have relatively better performance than the other evaluatedmodels. A commonality between these models is the imple-mentation of skip connections. Furthermore, using the defaultinput size for a model also yields better results.Firstly, the implementation of skip connections may con-tribute to better performance as it may help the model avoidthe vanishing gradient problem (VGP). During training, back-propogation may result in vanishing gradients, especially in theearlier model layers. This could impact the overall ability forthe model to learn as learning could slow down in those layers,hindering the model’s performance potential. Skip connectionsmay avoid the VGP as they enable the model to ﬂow gradientsbetween non-consecutive layers, thus skipping over layersthat may include vanished gradients. Secondly, the use ofmodel-speciﬁc default input sizes may contribute to betterperformance because changing the size of the input forcesthe model to learn from scratch since pre-trained parameters’sizes may not match the new input layer.VI. C

ONCLUSION AND F UTURE W ORK

Promising methods for binary, 3-class, and 5-class clas-siﬁcation of DR have been demonstrated. Additional workcan be conducted, especially for the multiclass classiﬁcationtasks. Overall, it was found that, given the parameters of theconducted experiments, the ResNet50/Adam combination isbest for the binary task, and the InceptionResNetV2/Adamcombination is best for the 3-class and 5-class tasks.

A. Model Performance Improvement

In future research, other deep learning architectures andoptimization techniques could be experimented with. Further-more, further tuning of the learning rate of the optimizer couldbe conducted. A hybrid approach to layer freezing could alsobe taken, resulting in a mix of frozen and unfrozen layers.Lastly, based on observed results, techniques explored withthe multi-class task could be applied to the binary task inrder to improve the results. However, high accuracy 5-classclassiﬁcation is the ultimate goal as it provides the most insightinto the severity of the disease in an image to a medicalprofessional, so more effort should be put towards it.

B. Implementation of a Multi-Stage Classiﬁcation System

A multi-stage classiﬁcation system could be developedthrough which a series of binary classiﬁcations in a decisiontree-like manner could lead to a 5-class classiﬁcation. Thiscould be a promising approach for DR detection as binaryclassiﬁcation has already shown extremely promising results.Sigmoid outputs of each binary model within the system wouldindividually contribute to the ﬁnal 5-prediction array, to whichArgmax would be applied in order to determine the ﬁnalclassiﬁcation by identifying the node with the highest value,which would be indicative of the image classiﬁcation with thehighest probability.

C. Deployment of a Deep Learning-Based Medical DiagnosticTool

There is a strong need for a reliable diagnostic tool for thedetection of diabetic retinopathy given the global prevalenceof the disease. Additionally, global regions with a lack ofmedical professionals could signiﬁcantly beneﬁt from such atool. After undergoing clinical trials and testing, a cost-friendlydeep learning-based diagnostic tool could be deployed througha cloud-based service in order to provide global access to areliable and accurate detection system.A

CKNOWLEDGMENT