Conversion and Implementation of State-of-the-Art Deep Learning Algorithms for the Classification of Diabetic Retinopathy
CConversion and Implementation of State-of-the-ArtDeep Learning Algorithms for the Classification ofDiabetic Retinopathy st Mihir Rao
Chatham High School
Chatham, New Jersey, [email protected] 2 nd Michelle Zhu
Department of Computer ScienceMontclair State University
Montclair, New Jersey, [email protected] rd Tianyang Wang
Department of Computer ScienceAustin Peay State University
Tennessee, [email protected]
Abstract —Diabetic retinopathy (DR) is a retinal microvascularcondition that emerges in diabetic patients. DR will continue tobe a leading cause of blindness worldwide, with a predicted 191.0million globally diagnosed patients in 2030. Microaneurysms,hemorrhages, exudates, and cotton wool spots are common signsof DR. However, they can be small and hard for human eyesto detect. Early detection of DR is crucial for effective clinicaltreatment. Existing methods to classify images require muchtime for feature extraction and selection, and are limited intheir performance. Convolutional Neural Networks (CNNs), asan emerging deep learning (DL) method, have proven their po-tential in image classification tasks. In this paper, comprehensiveexperimental studies of implementing state-of-the-art CNNs forthe detection and classification of DR are conducted in order todetermine the top performing classifiers for the task. Five CNNclassifiers, namely Inception-V3, VGG19, VGG16, ResNet50, andInceptionResNetV2, are evaluated through experiments. Theycategorize medical images into five different classes based onDR severity. Data augmentation and transfer learning techniquesare applied since annotated medical images are limited andimbalanced. Experimental results indicate that the ResNet50classifier has top performance for binary classification and thatthe InceptionResNetV2 classifier has top performance for multi-class DR classification.
Index Terms —diabetic retinopathy, convolutional neural net-works, transfer learning, binary classification, multi-class classi-fication, optimizers
I. I
NTRODUCTION
Diabetic retinopathy (DR) is a retinal microvascular condi-tion that emerges as a direct result of diabetes. High bloodsugar levels allow glucose to block blood vessels, in thiscase in the retina [1]. This leads to microaneurysms, whichare swollen sections of blood vessels in the retina. Whenthese microaneurysms leak, they are called hemorrhages [2].These hemorrhages allow cotton wool spots to form, which areaccumulations of axoplasmic material in the back of the eye,along with exudates [3]. In order to effectively treat DR, itmust be detected in its early stages. However, most peoplewith the condition are unaware of the fact that they musthave their vision examined often, thus allowing the conditionto pass the early stages and into the later stages undetected[4]. Additionally, DR patients in resource-poor countries lack effective DR identification technology and clinicians in orderto make official diagnoses and treatment plans [4]. This meansthat not only must DR be detected in its early stages, but thedetection technology must be easily accessible for people whodo not have access to eye specialists and adequate technology.In 2030, it is estimated that there will be 191.0 millionpeople with DR globally. This is approximately a 50% jumpfrom the 126.6 million people with DR globally in 2010. Ofthe 191.0 million people, 56.3 million people are expected tohave vision-threatening diabetic retinopathy (VTDR) if actionis not taken [5]. In the United States alone, the number ofAmericans aged 40 and older in 2050 with DR is predictedto be 16.0 million people, while the number of people withVTDR is expected to be 3.4 million. These numbers are ap-proximately three times the amount of people when comparedto 2005 when there were 5.5 million people with DR and 1.2million people with VTDR [6]. Clearly, early and accurateDR detection is not only vital in the present day, but it willcontinue to be necessary for decades to come.Many medical imaging techniques such as computed to-mography (CT) and magnetic resonance imaging (MRI) havebecome indispensable tools in clinical research and diagnosis.Classifying medical images has been playing an important rolein disease diagnosis and medical treatment [7].At present, the most widely used method for the detection ofdiabetic retinopathy is a retinal eye exam [8]. This approachinvolves an eye specialist looking through a patient’s pupiland at the back of their eye. The specialist looks for someof the most common symptoms of the disease, including mi-croaneurysms, hemorrhages, exudates, and cotton wool spots.Additionally, some detection methods involve using fundusphotography to take a picture of a patient’s retina, allowing aneye specialist to conduct the same examination by looking at aretinal image on a computer screen. Lack of ophthalmologistswill leave a large portion of patients undiagnosed. In addition,human errors are unavoidable.Due to the vast amount of medical images and humanfatigue as well as errors, relying on professional ophthalmol-ogists will be very expensive and inefficient. Thus, machine a r X i v : . [ c s . C V ] O c t earning approaches have been used for this purpose. Medicalimage classification can generally be categorized into su-pervised and unsupervised classification methods. Supervisedmethods require samples to be pre-annotated and include K-nearest neighbor algorithm, Bayesian models, logistic regres-sion, neural networks, and support vector machines. Unsu-pervised methods automatically detect the similarity amongsamples and include K-means clustering, auto-encoders, andprincipal component analysis (PCA).In recent years, deep learning (DL) has increasingly at-tracted researchers’ attention for medical image classification.DL is a machine learning technique in which neural networksare trained on collections of data in order to learn patternsand extract features from them. Trained DL models can thenbe used to predict certain details about unseen test data,making them useful tools for medical image classification.Specifically, CNNs are DL models that have been proveneffective for feature extraction and pattern recognition in imagedata, especially in a medical context.Deep learning integrates supervised methods with unsu-pervised methods, and some image classification experimentsusing convolutional neural networks (CNNs) achieve perfor-mance close to what a specialized physician can achieve [9].In this paper, the aim is to detect diabetic retinopathy byclassifying a retinal image into one of the five different levels(classes), as shown in Fig. 1, based on the disease severity.The medical images used in this paper are very limited andunbalanced due to the data privacy concern and labelingefforts.To address the data challenges in the medical imagingdata, several techniques are used. Data augmentation techniqueis used to counteract the small data size by rotating andscaling the existing images. Noise is intentionally introducedto increase the noise tolerance level of the models. Transferlearning utilizes some pre-trained model as a base to build thenew model. This allows pre-acquired weights to be propagatedinto the classification task. During experiments, the selectedmodels are trained on a publicly available dataset [10].II. R ELATED W ORK
It has been proven that CNNs can automatically extractmore distinct and effective features than handcrafted featureextraction methods. Deep learning methods usually outperformtraditional machine learning methods, such as SVMs (supportvector machines), since the SVM method is designed for asmall sample size and is not suitable for large samples [9][11]. Deep learning networks are widely adopted in order toimprove the image classification performance since 2012 [12].In particular, Krizhevsky et al. used a CNN to classify 1.2million images into 1000 classes, with benchmark performanceat the ImageNet Large Scale Visual Recognition Challenge(ILSVRC) 2012 [13].In order to improve the accuracy, researchers have beenworking to add more layers to the CNNs for large-scaleimage classification. VGG neural networks address the depthof CNNs and support up to 19 layers [14]. Very small 3×3 filters in all convolutional layers are used to reduce the numberof parameters. Additionally, it has been demonstrated that 3×3filters are most effective according to covariance analysis [15].Significant performance gain has been observed by increasingthe depth when compared to previous architectures [14].However, simply adding layers to CNNs for higher ac-curacy causes complications such as overfitting, degradationand computing and memory burden. Skip connections amongthe layers are proposed in residual networks, which learn anadditive residual function with respect to an identity mappingderived from the preceding layers inputs [16]. The residualarchitectures are capable of better fusing features and thusaddress the issue of gradient explosion or vanishing.The large number of parameters and large data size con-tribute to the success of CNN models. However, training deepnetworks usually has high demands for high performancecomputing resources, such as powerful GPUs and efficientstorage systems. Multiple-GPUs have been used to speed uptraining. The data parallelism can be exploited by dividingeach batch of training images into several smaller batches,computed in parallel on each GPU. For example, Simonyanand Zisserman trained their VGG nets of 144M parameters onfour NVIDIA Titan Black GPUs [14].Model-based transfer learning for the neural network con-tains two stages, namely network pre-training with benchmarkdatasets, such as ImageNet [17], and fine-tuning the pre-trained networks with specific target datasets. The two-stagemethod has been very popular in tasks involving medicalimages due to the limited sizes of specific target datasets. Thepre-trained networks can capture some general features fromsimilar benchmark images and these features can be furtherfine-tuned in the second stage. Experiments show that featurereuse primarily happens in the lowest layers [18].III. M
ETHODS
A. Addressing Class Imbalance in the Dataset
A Kaggle dataset titled APTOS 2019 Blindness Detection(APTOS stands for Asia Pacific Tele-Ophthalmology Society)was used to train and test models [10]. The dataset consistsof 3662 retinal images across five different stages of diabeticretinopathy (DR): no DR, mild DR, moderate DR, severe DR,and proliferate DR. These classes are annotated as values0 through 4. As shown in Fig. 2, classes of images weregrouped together and re-annotated based on the classificationtask at hand (binary, 3-class, or 5-class). Class imbalance wasaddressed by oversampling images in order to achieve a moreuniform distribution of images across classes. Additionally,oversampled images were randomly rotated, reflected, andnoisified in order to prevent training models on duplicate dataand to increase the overall noise resistance of the system.
B. Image Pre-Processing
The original dataset used in this project consists of imagesof vastly varying dimensions and characteristics, includingempty space around the actual retina in the image and thebrightness of the image as a whole. In order to feed such ig. 1. Retinal images showing the progression of DR.Fig. 2. Diagram showing the approach taken for image grouping andreannotation based on the classification task at hand. data into a CNN, the images need to be pre-processed. Thiswas done in a series of steps. Firstly, by checking for pixelsthroughout the images that are completely black, the regionsof empty space were able to be detected and then croppedout. Then, in an effort to help normalize the brightnessof the images as well as to help bring out some of theimportant features of the images, weighted arrays consistingof a Gaussian blurred versions of the images were added totheir corresponding resized images. Thirdly, the images werecircle cropped, with the center of the circle lying at the centerof the image and the circumference of the circle touching theedges of the image. Not only did this allow for excess spacearound the retina to be further eliminated, but it also led tomore uniformity throughout the dataset. Lastly, the imageswere resized to a common dimension of 512x512 pixels. Thisspecific dimension was chosen as a starting point since the rawimages in the dataset had an average size of approximately1527x2015 pixels. The smaller the image is resized, the morethe details from the raw image are lost. So, resizing the imagesto 512 square pixels allowed for high image resolution tobe maintained while achieving time and memory efficiencyduring training. Fig. 3 shows two examples of image pre-processing conducted on raw retinal images.
C. Transfer Learning
Unlike training from scratch, transfer learning aims totransfer the knowledge, which was learned from anotherdataset, to a target problem. These models are repurposed,as their weights and biases are from their initial training.These parameters may still help with some high-level featureextraction due to certain level of relevance between the two models. Additionally, the selected pre-trained models havebeen proven successful in other classification problems, fur-ther making them good candidates for the problem of DRdetection and classification [14] [19] [20]. During experiments,the following pre-trained models were selected for furtherfine-tuning: ResNet50, VGG16, VGG19, Inception-V3, andInceptionResNetV2. Since these models were initially trainedfor tasks that categorized images into a large number of uniqueclasses, the final layer of each model did not match therequired architecture for binary, 3-class, and 5-class classi-fication tasks. In order to address this, a series of additionallayers were added to the end of each classifier in order to fitthe classification task at hand: a flatten layer to reduce modeloutput to a 1-dimensional space, a series of fully-connecteddense layers, and a final dense layer with 1 node for the binarytask, 3 nodes for the 3-class task, and 5 nodes for the 5-classtask, respectively.IV. E
XPERIMENTS AND R ESULTS
A. Experimental Setups
For the binary classification task, various models weretested across different optimization techniques. Specifically,the models that were tested were ResNet50, VGG16, VGG19,and Inception-V3. All of these models were tested across two
Fig. 3. Two examples of raw retinal images undergoing image pre-processing. ptimizers: Adam and Stochastic Gradient Descent (SGD). Alearning rate of 0.001 was used for both optimizers acrossall experiments. The final layer of all of the classifiers in thebinary classification task had a sigmoid activation functionimplemented, providing outputs between 0 and 1. In differen-tiating between outputs of the negative and positive class, athreshold value of ¡=0.5 was used for the negative class whilea value of ¿0.5 was used for the positive class. During training,the base transfer learning model’s layers were frozen.For the 3-class classification task, two phases of experimentswere conducted. For the first phase, similar approaches weretaken as for the binary classification task with respect to modelarchitecture and optimizers. However, in the final layer ofall 3-class classification models, a Softmax activation across3 nodes was used, providing probabilities across the threeclasses. Again, a learning rate of 0.001 was used for bothoptimizers. In order to interpret the Softmax probabilitiesproduced by the model as a prediction, the Argmax functionwas implemented, which returned the index of the element inthe probability array with the highest value, thus returningthe model’s most confident prediction for a certain image.Based on the results from this initial phase of experiments,adjustments were made and a second phase of experimentswere conducted. The adjustments made were the learning ratebeing decreased by a factor of 10, the kernels of the entiremodel being initialized using the He Uniform initializer [16],and the unfreezing of layers in the base transfer learningmodel. However, the adjustments yielded memory limitationsfor model training of ResNet50 and Inception-V3 due to theiroutput tensors being extremely large once flattened in theclassifier (524288 values for ResNet50 and 401408 values forInception-V3). This is several fold higher than the numberof output parameters by the VGG variants (131072 values).So, the VGG variants were trained using the above-mentionedmethod while ResNet50 and Inception-V3 training occurredon images of size 224x224 pixels and 299x299 pixels, re-spectively. This helped reduce the output tensor from 4 to 2dimensions, thus alleviating the memory limitation. 224 and299 square pixels is the default input size for the ResNet50and Inception-V3 trained on ImageNet, so feeding this sizeinto ResNet50 and Inception-V3 allowed for training on moredefault settings. In order to collect more thorough results,training was conducted for the VGG variants on 224x224 pixelimages for comparison purposes. It is recommended that theResNet50 and Inception-V3 models be trained just like theVGG variants (on 512x512 pixel images) as part of futurework when adequate resources are available. Once phase 2of results were analyzed, based on the relatively good perfor-mance of ResNet50 and Inception-V3, the InceptionResNetV2architecture, also known as Inception-V4, was trained with theAdam optimizer using the same training parameters as usedwith the other models. Inception-V4 takes the Inception-V3architecture and incorporates residual connection much likethose in the ResNet variants [19]. The input image size forthis model was 299x299 pixels. The results of testing thismodel are presented with the phase 2 experimentation results. For the 5-class classification task, just like with the 3-class classification task, two phases of experiments wereconducted. The first phase conducted the same experimentsthat were conducted in the phase 1 of experimentation forthe 3-class classification, only the softmax activation on thefinal layer of the classifier was modified to fit the 5-classclassification task. Based on the results of this initial phase, asecond phase of experimentation was conducted after makingadjustments to training parameters. These adjustments werethe same as those made prior to the second phase of the 3-class classification task. Additionally, the second phase for the5-class classification task used the same experimental setupsas those used in phase 2 for the 3-class classification task.Based on the observed performance of the ResNet50 andInception-V3 models, just like for the 3-class classificationtask, the InceptionResNetV2 model was experimented withboth the Adam and SGD optimizers. Again, the same trainingparameters were used for this model as used for the 3-class classification task. The results of testing this model arepresented with the phase 2 experimentation results for the 5-class classification task.For all classification types, test sets were created usingrandom 20% samples and validation sets were random 20%samples of the train set. During training, an early-stoppingcallback was implemented. This would monitor validationaccuracy during training and stop training once the valida-tion accuracy began to decrease, indicating overfitting. Thecallback would then restore the model’s best weights fromthe penultimate epoch. Additionally, across all binary taskexperiments and experiments in phase 1 of the multi-classapproaches, weight initialization for the base model was doneusing the ImageNet weights provided for the model beingtested. These provided weights enable the model with high-level feature extraction abilities.
B. Model Performance Analysis
Table I shows the collected results for the binary classifica-tion task. A confusion matrix in Fig. 4 and receiver-operatorcharacteristic (ROC) curves in Fig. 5 are shown for the binaryclassification experimental setup that yielded the best results. Itwas found that the ResNet50 architecture accompanied withthe Adam optimizer yielded the best results for the binaryclassification task, including an accuracy of 96.59% whentested on unseen test data and micro and macro-average area-under-the-curve (AUC) values of 0.99.Table II shows the model testing accuracies for phase 1of 3-class classification experimentation. It can be seen thatthe majority of the models are not able to pass roughly 84%testing accuracy. This could be due to the fact that with 3-class classification, merely training the end classifier and notthe base transfer learning model itself may not be sufficient tohelp the model differentiate between the mild/moderate andsevere/proliferate classes. So, unfreezing of the base modellayers, initializing weights in a specific manner, and decreasingthe learning rate to 0.0001 was necessary. Table III shows theresults for phase 2 of experimentation for 3-class classification
ABLE IB
INARY C LASSIFICATION E XPERIMENTAL R ESULTS
Model and OptimizerAdamMetric
ResNet50 VGG16 VGG19 Inception-V3
Test Accuracy
Stochastic Gradient DescentMetric
ResNet50 VGG16 VGG19 Inception-V3
Test Accuracy 0.956 0.9119 0.9219 0.7827Precision 0.95 0.91 0.92 0.82Recall 0.95 0.91 0.92 0.79Micro Average AUC with 512x512 pixel images. Table IV shows the results forphase 2 of experimentation for 3-class classification with299x299 pixel images for the Inception-V3 and Inception-ResNetV2 models and 224x224 pixel images for the othermodels. A confusion matrix in Fig. 6 and receiver-operatorcharacteristic (ROC) curves in Fig. 7 are shown for the 3-class classification experimental setup that yielded the bestresults. It was found that the InceptionResNetV2 architectureaccompanied with the Adam optimizer yielded the best resultsfor the 3-class classification task, including an accuracy of88.14% and a micro and macro-average AUC values of 0.98and 0.97, respectively.
TABLE II3-C
LASS C LASSIFICATION P HASE
EST A CCURACIES
Model and OptimizerAdamMetric
ResNet50 VGG16 VGG19 Inception-V3
Test Accuracy 0.795 0.8088 0.8309 0.602
Stochastic Gradient DescentMetric
ResNet50 VGG16 VGG19 Inception-V3
Test Accuracy
LASS C LASSIFICATION P HASE
XPERIMENTAL R ESULTS ON X
512 P
IXEL I NPUT I MAGES
Model and OptimizerAdamMetric
VGG16 VGG19
Test Accuracy 0.7454 0.7528Precision 0.74 0.77Recall 0.75 0.76Micro Average AUC 0.9 0.9Macro Average AUC 0.88 0.89F1-Score 0.7434 0.7616
Stochastic Gradient DescentMetric
VGG16 VGG19
Test Accuracy 0.7472
Precision 0.76
Recall 0.76
Micro Average AUC 0.9
Macro Average AUC 0.89
F1-Score 0.7563
Table V shows the accuracies for the first phase of 5-class classification experimentation. It can be seen that themajority of the models are not able to pass roughly 70%testing accuracy. This could be due to the fact that with non-multi-class approaches, merely training the end classifier andnot the base transfer learning model itself yielded promisingresults. However, as the 5-class classification task requiresmore detailed classification by the model, especially betweenthe mild and moderate classes and the severe and proliferateclasses, the mentioned adjustments were made before phase2 of experiments. Table VI shows the results for phase 2of experimentation for 5-class classification with 512x512pixel images. Table VII shows the results for phase 2 ofexperimentation for 5-class classification with 299x299 pixelimages for the Inception-V3 and InceptionResNetV2 models
ABLE IV3-C
LASS C LASSIFICATION P HASE
XPERIMENTAL R ESULTS ON X AND X
299 P
IXEL I NPUT I MAGES
Model and OptimizerAdamMetric
ResNet50 VGG16 VGG19 Inception-V3 InceptionResNetV2
Test Accuracy 0.7849 0.3695 0.3548 0.8116
Precision 0.79 0.12 0.12 0.83
Recall 0.78 0.33 0.33 0.81
Micro Average AUC 0.92 0.52 0.52 0.94
Macro Average AUC 0.9 0.5 0.5 0.94
F1-Score 0.7778 0.1772 0.1759 0.8046
ResNet50 VGG16 VGG19 Inception-V3 InceptionResNetV2
Test Accuracy 0.3493 0.3419 0.3906 0.3125 0.4164Precision 0.33 0.14 0.37 0.29 0.35Recall 0.35 0.33 0.4 0.32 0.44Micro Average AUC 0.49 0.47 0.55 0.5 0.52Macro Average AUC 0.49 0.41 0.57 0.48 0.56F1-Score 0.2413 0.1807 0.3483 0.2239 0.346Fig. 6. Confusion matrix for the testing results of the best 3-class classifier.Fig. 7. Receiver-operator characteristic (ROC) curves for the testing resultsof the best 3-class classifier. and 224x224 pixel images for the other models. A confusionmatrix in Fig. 8 and receiver-operator characteristic (ROC)curves in Fig. 9 are shown for the 5-class classificationexperimental setup that yielded the best results. It was foundthat the InceptionResNetV2 architecture accompanied withthe Adam optimizer yielded the best results for the 5-classclassification task, including an accuracy of 85.02% and amicro and macro-average AUC values of 0.97.
TABLE V5-C
LASS C LASSIFICATION P HASE
EST A CCURACIES
Model and OptimizerAdamMetric
ResNet50 VGG16 VGG19 Inception-V3
Test Accuracy 0.6374
Stochastic Gradient DescentMetric
ResNet50 VGG16 VGG19 Inception-V3
Test Accuracy 0.6681 0.6519 0.6288 0.1832TABLE VI5-C
LASS C LASSIFICATION P HASE
XPERIMENTAL R ESULTS ON X
512 P
IXEL I NPUT I MAGES
Model and OptimizerAdamMetric
VGG16 VGG19
Test Accuracy
Stochastic Gradient DescentMetric
VGG16 VGG19
Test Accuracy 0.6853 0.7295Precision 0.7 0.72Recall 0.7 0.72Micro Average AUC 0.92 0.93Macro Average AUC 0.91 0.92F1-Score 0.6931 0.7199ABLE VII5-C
LASS C LASSIFICATION P HASE
XPERIMENTAL R ESULTS ON X AND X
299 P
IXEL I NPUT I MAGES
Model and OptimizerAdamMetric
ResNet50 VGG16 VGG19 Inception-V3 InceptionResNetV2
Test Accuracy 0.7893 0.2091 0.2042 0.6228
Precision 0.8 0.04 0.04 0.55
Recall 0.79 0.2 0.2 0.63
Micro Average AUC 0.95 0.5 0.5 0.91
Macro Average AUC 0.94 0.5 0.5 0.88
F1-Score 0.7905 0.0692 0.0677 0.5746
ResNet50 VGG16 VGG19 Inception-V3 InceptionResNetV2
Test Accuracy 0.222 0.2328 0.2548 0.1827 0.1994Precision 0.35 0.16 0.38 0.18 0.13Recall 0.21 0.24 0.25 0.19 0.19Micro Average AUC 0.49 0.49 0.57 0.48 0.52Macro Average AUC 0.5 0.49 0.53 0.47 0.55F1-Score 0.1517 0.1595 0.1788 0.1042 0.1435Fig. 8. Confusion matrix for the testing results of the best 5-class classifier.Fig. 9. Receiver-operator characteristic (ROC) curves for the testing resultsof the best 5-class classifier.
V. D
ISCUSSION
The results show that ResNet50 and InceptionResNetV2have relatively better performance than the other evaluatedmodels. A commonality between these models is the imple-mentation of skip connections. Furthermore, using the defaultinput size for a model also yields better results.Firstly, the implementation of skip connections may con-tribute to better performance as it may help the model avoidthe vanishing gradient problem (VGP). During training, back-propogation may result in vanishing gradients, especially in theearlier model layers. This could impact the overall ability forthe model to learn as learning could slow down in those layers,hindering the model’s performance potential. Skip connectionsmay avoid the VGP as they enable the model to flow gradientsbetween non-consecutive layers, thus skipping over layersthat may include vanished gradients. Secondly, the use ofmodel-specific default input sizes may contribute to betterperformance because changing the size of the input forcesthe model to learn from scratch since pre-trained parameters’sizes may not match the new input layer.VI. C
ONCLUSION AND F UTURE W ORK
Promising methods for binary, 3-class, and 5-class clas-sification of DR have been demonstrated. Additional workcan be conducted, especially for the multiclass classificationtasks. Overall, it was found that, given the parameters of theconducted experiments, the ResNet50/Adam combination isbest for the binary task, and the InceptionResNetV2/Adamcombination is best for the 3-class and 5-class tasks.
A. Model Performance Improvement
In future research, other deep learning architectures andoptimization techniques could be experimented with. Further-more, further tuning of the learning rate of the optimizer couldbe conducted. A hybrid approach to layer freezing could alsobe taken, resulting in a mix of frozen and unfrozen layers.Lastly, based on observed results, techniques explored withthe multi-class task could be applied to the binary task inrder to improve the results. However, high accuracy 5-classclassification is the ultimate goal as it provides the most insightinto the severity of the disease in an image to a medicalprofessional, so more effort should be put towards it.
B. Implementation of a Multi-Stage Classification System
A multi-stage classification system could be developedthrough which a series of binary classifications in a decisiontree-like manner could lead to a 5-class classification. Thiscould be a promising approach for DR detection as binaryclassification has already shown extremely promising results.Sigmoid outputs of each binary model within the system wouldindividually contribute to the final 5-prediction array, to whichArgmax would be applied in order to determine the finalclassification by identifying the node with the highest value,which would be indicative of the image classification with thehighest probability.
C. Deployment of a Deep Learning-Based Medical DiagnosticTool
There is a strong need for a reliable diagnostic tool for thedetection of diabetic retinopathy given the global prevalenceof the disease. Additionally, global regions with a lack ofmedical professionals could significantly benefit from such atool. After undergoing clinical trials and testing, a cost-friendlydeep learning-based diagnostic tool could be deployed througha cloud-based service in order to provide global access to areliable and accurate detection system.A
CKNOWLEDGMENT