Analysis of skin lesion images with deep learning
11 Analysis of skin lesion images with deep learning
Josef Steppan, Sten Hanke
Abstract —Skin cancer is the most common cancer worldwide,with melanoma being the deadliest form. Dermoscopy is askin imaging modality that has shown an improvement inthe diagnosis of skin cancer compared to visual examinationwithout support. We evaluate the current state of the art inthe classification of dermoscopic images based on the ISIC-2019 Challenge for the classification of skin lesions and currentliterature. Various deep neural network architectures pre-trainedon the ImageNet data set are adapted to a combined trainingdata set comprised of publicly available dermoscopic and clinicalimages of skin lesions using transfer learning and model fine-tuning. The performance and applicability of these models forthe detection of eight classes of skin lesions are examined. Real-time data augmentation, which uses random rotation, translation,shear, and zoom within specified bounds is used to increasethe number of available training samples. Model predictions aremultiplied by inverse class frequencies and normalized to betterapproximate actual probability distributions. Overall predictionaccuracy is further increased by using the arithmetic mean ofthe predictions of several independently trained models. The bestsingle model has been published as a web service. The sourcecode is publicly available at http://github.com/j05t/lesion-analysis
Index Terms —Lesion, Skin, Melanoma, Deep Learning
I. I
NTRODUCTION S KIN cancer is the most common cancer worldwide, withmelanoma being the deadliest form. A later stage in thediagnosis of melanoma is associated with a strong influenceon melanoma mortality within 5 years of diagnosis [1]. Earlydetection of melanoma can significantly reduce both morbidityand mortality [2]. The risk of dying from the disease is directlyrelated to the depth of the cancer, which is directly related tothe time it has been growing. Self-examination of the skin bypatients, full-body skin examinations by a doctor, and patienteducation are the keys to early detection. Self-examiners aregenerally diagnosed with thinner melanomas than non-self-examiners (0.77 mm versus 0.95 mm) [3].This paper evaluates the current state of the art in theclassification of dermoscopic images based on the ISIC-2019Challenge for the classification of skin lesions and currentliterature. Since medical image data sets often show a classimbalance, several approaches for the training of deep neuralnetworks on imbalanced data sets have been reviewed. Be-cause the training of deep neural networks requires a largeamount of training data, further publicly available dermoscopicas well as clinical image data sets of skin lesions havebeen evaluated for expanding the ISIC-2019 training data
J. Steppan was with the Department of eHealth at FH Joanneum Universityof Applied Sciences, Alte Poststrasse 149, 8020 Graz, AUSTRIA. e-mail:[email protected]. Hanke is with the Department of eHealth at FH Joanneum University ofApplied Sciences, Alte Poststrasse 149, 8020 Graz. e-mail: [email protected] set. Since the heterogeneity of the image data of the ISICdata set requires preprocessing, a suitable approach towardspreprocessing, as well as the effects of preprocessing on theachieved accuracy of trained networks have been investigated.Furthermore, the potential of real-time data augmentationto increase the number of available training patterns duringtraining and to improve the prediction accuracy at inferencetime has been investigated. Current ensembling strategies andan overview of current architectures of deep neural networksfor the classification of image content have been reviewed.II. I
MAGE C LASSIFICATION
Convolutional Neural Networks (CNNs) [4] are currentlystate of the art in image classification and have been exceedingthe recognition rate of human experts in the ImageNet LargeScale Visual Recognition Challenge (ILSVRC) [5] since 2015[6]. The ILSVRC evaluates algorithms for object recogni-tion and image classification on a large scale. An importantmotivation is to enable researchers to compare progress inrecognition for a wider variety of objects. Another motivationis to measure the progress of computer vision algorithms forclassifying images on a large scale. The ImageNet trainingdata set contains 1.000 categories and 1,2 million images.Image classification algorithms are compared using a test dataset of 150.000 images in 1.000 categories. Highest accuracyrates are currently achieved with the architectures SENet [7]154 (81.3% top-1 accuracy), PNASNet-5 Large [8] (82.9%),AmoebaNet-C [9], [10] (83.9%) and EfficientNet-B7 [11](84.4%) [12]. Algorithms for classifying image content areconstantly being improved. Deep learning has shown enor-mous potential in this area due to the constantly increasingamounts of data [13], [14]. Some deep learning approachesoutperform teams of certified dermatologists in the detectionof melanoma in dermoscopic images [15], [16], [17] or achieveequivalent detection rates [18], [19].III. S KIN L ESION D ATASETS
A. ISIC-2019
To make specialist knowledge more widely available, theInternational Skin Imaging Collaboration developed the ISICarchive, an international repository for dermoscopic images,both for clinical training purposes and to support technical re-search on automated algorithmic analysis by hosting the ISICChallenges. The training data set of the ISIC-2019 Challengeconsists of several dermoscopic image databases: BCN 20000[20] with dermoscopic images of the most common classesof skin lesions: actinic keratosis, squamous cell carcinoma,basal cell carcinoma, seborrheic keratosis, solar lentigo, and http://image-net.org/challenges/LSVRC/2017 a r X i v : . [ ee ss . I V ] J a n dermatological lesions. The HAM10000 dataset [21], with600x450 images centered and cropped on lesions. The MSKdata set [22] with images of different resolutions. A totalof 25,331 images are available for training in 8 differentcategories. The test data set consists of 8,238 images whoselabels are not publicly available. Also, the test data set containsan additional outlier class that is not contained in the trainingdata and must be identified by developed systems. Predictionson the ISIC-2019 test data set are assessed by an automaticevaluation system. The goal of the ISIC-2019 Challenge is toclassify dermoscopic images among nine different diagnosticcategories:1) Melanoma (MEL)2) Melanocytic nevus (NV)3) Basal cell carcinoma (BCC)4) Actinic Keratosis (AK)5) Benign keratosis (solar lentigo / seborrheic keratosis /lichen planus-like keratosis) (BKL)6) Dermatofibroma (DF)7) Vascular Lesion (VASC)8) Squamous cell carcinoma (SCC)9) None of the others (UNK) B. PH2 database
The PH2 database [23] includes manual segmentation, clin-ical diagnosis, and the identification of multiple dermoscopicstructures performed by experienced dermatologists in a setof 200 dermoscopic images. The images were obtained inthe dermatology department of the Pedro Hispano hospital(Matosinhos, Portugal) under the same conditions by theTuebinger Mole Analyzer System using 20-fold magnifica-tion. These are 8-bit RGB color images with a resolutionof 768x560 pixels. The image database contains a total of200 dermoscopic images of melanocytic lesions, including80 common nevi, 80 atypical nevi, and 40 melanomas. ThePH2 database contains a medical annotation of all images,namely a medical segmentation of the lesion, a clinical andhistological diagnosis as well as the evaluation of severaldermoscopic criteria (colors; pigment network; dots/spheres;stripes; regression areas; blue-whitish haze). The databasewas made freely available for research and benchmarkingpurposes . C. Light Field Image Dataset of Skin Lesions
Faria et al. [24] present a contribution to the researchcommunity in the form of the publicly available data set ofskin lesions, the ”Light Field Image Dataset of Skin Lesions”(SKINL2) . The dataset contains 250 light fields [25], whichwere recorded with a focused plenoptic camera and dividedinto eight clinical categories depending on the type of lesion.Each light field consists of 81 different views of the samelesion. The database also contains the dermoscopic image ofeach lesion. The data set offers great potential the further https://challenge2019.isic-archive.com/ development of medical imaging research and the developmentof new classification algorithms based on light fields as wellas for clinically oriented dermatological studies; however, onlydermoscopic images contained in the data set are taken intoaccount for this work. D. SD-198
In contrast to dermoscopic images with largely constantlighting and low image disturbances, clinical images are oftencreated with a large number of different image recordingdevices, such as digital cameras or smartphones. The SD-198data set [26] contains 6,584 clinical images from 198 classes,which vary according to scale, color, shape, and structure. TheSD-198 benchmark data set is intended to stimulate furtherresearch into the visual classification of skin diseases. Theauthors also carry out an extensive analysis of this data setusing modern methods including CNNs. The ground truthlabels of the images were created via DermQuest , with eachimage being examined by qualified experts and labeled withthe name of its class. To ensure the quality of the labels, twoexperts were also invited to check the data set. E. 7-point criteria evaluation database
Kawahara et al. [27] provide a database for evaluating thecomputerized image-based prediction of the 7-point check-list for malignant skin lesions . The seven-point checklist,published in 1998, is one of the best-validated dermoscopicalgorithms due to its high sensitivity and specificity, even whenused by non-specialists. The seven criteria were originallyapplied to 342 melanocytic lesions (117 melanomas and 225atypical nevi) tested and selected for their frequent associationwith melanoma [28]. Three of them were defined as themain criteria (atypical network, blue-white haze, and atypicalvascular pattern), while the remaining four were consideredminor (irregular stripes, irregular spots or globules, irregularspots, and regression structures) [29]. The data set containsover 2000 clinical and dermoscopic color images as wellas corresponding structured metadata that are tailored to thetraining and evaluation of CAD (Computer Aided Diagnostic)systems. F. MED-NODE
The MED-NODE data set [30] consists of 70 melanomaand 100 nevus images from the digital image archive of theDepartment of Dermatology at the University Hospital Gronin-gen (UMCG), which is used for the development and testing ofthe MED-NODE Decision Support System for the detection ofSkin cancer using macroscopic images. The system proposedby the authors achieves results with a diagnostic accuracy of81%. The final classification was achieved by a majority voteof the predictions of several models. The dataset is publiclyavailable . https://derm.cs.sfu.ca/Welcome.html ∼ imaging/databases/melanoma naevi/ ISIC-201925331 MEL4904 NV13704BCC3378 AK867 BKL2733 DF294 VASC282 SCC628SD-1985944 UNK5958MED-NODE170 7-point criteria database1011 PH2200 SKINL292 Train 29469Valid 3279
Fig. 1. Combined training data set from the data sets ISIC-2019, PH2, LightField Image Dataset of Skin Lesions, SD-198, the 7-point criteria evaluationdatabase, and MED-NODE. The ”UNK” category is mainly formed from datafrom the SD-198 dataset. The combined data set is divided into a training(90%) and validation data set (10%), so 29.469 images are available fortraining and 3.279 images for assessing the generalizability of the predictionsand for adapting hyperparameters in the validation data set. The ISIC-2019test data set consists of 8.238 images whose labels are not publicly available.The test data set is not used for training or parameter adjustment.
IV. C
OMBINED TRAINING DATASET
A combined training data set has been created from allthe data sets described in section III. 32,748 images areavailable for training in total. Images from SD-198 wereused exclusively for the creation of training data for the”UNK” class, after prior removal of image data from the eightcategories of the ISIC-2019 training data set. The combineddata set is still heavily imbalanced (Figure 1).V. M
ETHODOLOGY
A. Preprocessing
Training and test data of the ISIC-2019 dataset have beenpreprocessed to remove black areas surrounding dermoscopicimages, and subsequently rescaled maintaining aspect ratio(Figure 2). Descriptive text appended to images in the SD-198 dataset has been removed.
Fig. 2. Preprocessing of the ISIC 2019 dataset. Black image borders aredetected and removed. The top row shows images of the original trainingdata set, shown below are preprocessed images Fig. 3. Applied augmentations for a single training image. Random rotation,translation in the x and y directions as well as scaling within defined limitsavoid overfitting on the training data and enable a better generalization ofthe model. Used augmentation parameters are: max rotate=45, p affine=0.5,do flip=True, flip vert=True, max zoom=1.05, max lighting=0.2,crop pad(input size), cutout(n holes=(1,1), length=(16,16), p=.5).
B. Data Augmentation
To avoid overfitting [31] in neural networks, dropout [32]is often used. Another simple method for regularization (andexpansion of the number of different training samples) ofCNNs is data augmentation. During training, input data ischanged randomly according to certain criteria (translation,rotation, scaling, etc.). Additionally, Cutout [33] has been usedfor regularization. Figure 3 shows the applied augmentations.
C. Out of Distribution Detection
Neural networks offer little or no guarantee of reliableprediction when applied to data that was not generated throughthe same process that was used to create the network’straining data. With such Out-of-Distribution (OOD) inputs,the prediction may not only be incorrect but also associatedwith a high level of confidence [34], [35] of the network,which restricts the reliability of deep learning classifiers inreal world applications. Often the predictions of (ensembles of)classifiers that have been trained on data within the distributionare examined for the presence of OOD inputs using statisticalmethods [36], [37]. Alternatively, the input distribution canbe modeled directly by using generative models that do notrequire the presence of class labels. However, it has beenshown that this method can also output higher probabilitieson OOD inputs than on inputs within the distribution [38]. Inthe ISIC 2019 Challenge, classes that are not included in thetraining data set should be detected as OOD and recognizedas class ”UNK”. In this work, a data-driven approach to therecognition of OOD inputs is pursued by using images (mostlyfrom SD-198, see subsection III-D) as training data for the”UNK” class that are not labeled as one of the classes ofthe ISIC-2019 training data set. However, this approach is farfrom optimal, and OOD detection in deep learning classifiersremaining an unsolved problem. Further work is needed toimprove classifier performance regarding OOD detection.
D. Dataset Imbalance
A common problem with deep learning-based applicationsis the fact that some classes have a significantly higher number of samples in the training set than other classes. This differenceis known as class imbalance. There are many examples in areassuch as computer vision [39], [40], [41], [42], [43], medicaldiagnosis [44], [45], fraud detection [46], and others [47], [48],[49] where this problem is highly significant and the incidenceof one class (e.g. cancer) can be 1000 times less than anotherclass (e.g. healthy patient) [50]. It has been shown that a classimbalance in training data sets can have a significant adverseeffect on the training of traditional classifiers [51], includingclassic neural networks or multilayer perceptrons [52]. Theclass imbalance influences both the convergence of neuralnetworks during the training phase and the generalization ofa model to real or test data [50].
1) Undersampling / Oversampling:
Undersampling andoversampling in data analysis are techniques to adjust theclass distribution of a data set (i.e. the relationship betweenthe different classes/categories represented). These terms areused in statistical sampling, survey design methodology, andmachine learning. The goal of undersampling and oversam-pling is to create a balanced data set. Many machine learningtechniques, such as neural networks create more reliablepredictions when trained on balanced data. Oversampling isgenerally used more often than undersampling. The reasons forusing undersampling are mainly practical and often resource-dependent. With random oversampling, the training data issupplemented by multiple copies of samples from minorityclasses. This is one of the earliest methods proposed thathas also proven robust [53]. Instead of duplicating minorityclass samples, some of them can be chosen at random bysubstitution. Other methods of handling unbalanced data setssuch as synthetic oversampling [54] are more suitable fortraditional machine learning tasks [55] and were therefore notconsidered any further in this work.
2) Weighted Cross-Entropy Loss:
Weighted cross-entropy[56] is useful for training neural networks on unbalanced datasets. [57] suggest adding a margin-based loss value to thecross-entropy on in-distribution training patterns in order toensure a minimum difference in average entropy between in-distribution and out-of-distribution data. This ensemble-basedmethod is intended to surpass previous methods of recognizingout-of-distribution inputs such as ODIN [58]. Cross entropycan be described as L ( x, y ) = − log (cid:32) exp( x [ y ]) (cid:80) j exp( x [ j ]) (cid:33) = − x [ y ]+ log (cid:32)(cid:88) j exp( x [ j ]) (cid:33) or, by using class weights: L ( x, y ) = W [ y ] (cid:32) − x [ y ] + log (cid:32)(cid:88) j exp( x [ j ]) (cid:33)(cid:33) The arithmetic mean of the loss values achieved is calcu-lated for each mini-batch. A weight vector can be calculatedusing effective class weights [59] with the simple formula (1 − β n ) / (1 − β ) , with the hyperparameter beta equal to0.999 (a choice of the parameter beta equal to zero would notapply any weighting and a choice of beta equal to 1 wouldcorrespond to weighting by the inverse class frequency). In the simplest case, loss values can be weighted by multiplyingby inverse class frequencies.
3) Thresholding:
Also referred to as threshold shifting orrescaling, thresholding adapts the decision threshold of aclassifier. This method is used at inference time and involveschanging the output class probabilities. There are several waysin which the network outputs can be rescaled. In general, anoptimization algorithm can be used to configure the networkto minimize any criteria [60]. The simplest method onlycompensates for a priori class probabilities [61]. It has beenshown that neural networks estimate Bayesian a posterioriprobabilities [61]. That is, for a given data point x, the outputfor class c is implicitly y i ( x ) = p ( c | x ) = p ( c ) p ( x | c ) p ( x ) . The actualprobabilities of class membership can therefore be calculatedby dividing the output of the network by the estimated a prioriprobability p ( c ) = | c | (cid:80) k | k | , where | c | is the number of samplesof class c [50]. The resulting class probabilities are normalizedafter thresholding is applied. This simple method of handlingan existing class imbalance can significantly increase the classprobability distribution approximation made by classifiers. E. Transfer Learning
Transfer learning in the context of machine learning is atechnique that uses information obtained from solving a prob-lem and applies it to a similar problem. When using transferlearning, a model that has already been trained on anotherdata set is adapted to custom data. Ideally, the pre-trainedmodel has been trained on similar data, but this is not strictlynecessary. The final layers of the network are removed andreplaced by output layers featuring appropriate dimensions.The model is then trained on custom data. By using transferlearning, the time required for training a network can begreatly reduced [62], [63], [64]. The existing pre-trained modelthus serves as a feature extractor, which forwards features suchas edges, texture, position of recognized objects, etc. to thelast layer for classification. A softmax function (normalizedexponential function) transforms the network output into avector of numbers between zero and one which sum up toone which allows interpreting the output of the network as aprobability distribution.
F. Test Time Augmentation
Data augmentation is a technique widely used to improveneural network training performance and reduce generalizationerrors. The same image data augmentation technique canalso be used at inference time to allow the model to makepredictions for several different versions of each image inthe test data. Test Time Augmentation (TTA) predictions areformed by calculating the average of the regular predictions(with a weighting of beta=0.4) with the average of the pre-dictions obtained by predicting on augmented versions of theimage data (with a weighting of 1-beta). The transformationsspecified for the training set are applied with the followingchanges: Scaling with a factor of 1.05 controls the scaling forthe zoom (which is not random for TTA). Furthermore, thecropping is not random to ensure that the four corners of thepicture are used. Reflection is not random but is applied once to each of these corner images (so that a total of 8 augmentedversions are created).
G. Ensembling
Ensembling is the use of several independently trained mod-els to form an overall prediction. The basic idea of ensemblingis that individual models have weaknesses in different areas,which are compensated by the combination with predictionsof other independently trained models. Possible ensemblingstrategies are e.g. majority voting, the use of a weightedaverage based on classifier confidences, or simply using thearithmetic mean of several predictions of different models andmodel architectures [65].VI. E
XPERIMENTS
The CNN architectures Inception-ResNet-v2 [66], SE-ResNeXt-101 (32x4d) [7], NASNet-A-Large [8], EfficientNet-B4 and EfficientNet-B5 [11] pre-trained on the ImageNetdata set were adapted for the task of classifying the nineclasses of the ISIC-2019 Challenge by replacing final layerswith a custom linear layer to output nine class probabilities.Real-time data augmentation has been used to improve thegeneralizability of the resulting models. Models have beentrained on an NVIDIA GTX 1070 GPU. Batch sizes (numberof training samples that are used for a single forward pass)were adapted to individual architectures and input sizes toachieve optimal utilization of the available video memory.Images have been resized to fit model input sizes prior training.Models have been trained via transfer learning over 32epochs followed by model fine-tuning using differential learn-ing rates until convergence using One Cycle Policy [67],allowing very rapid convergence rates of trained networks[68]. Appropriate learning rates were determined manually atregular intervals. The use of a weighted loss function has,contrary to expectations, only proven to be advantageous fortraining the NASNet-A-Large architecture, which has beenunable to converge without applying weighted loss. Otherarchitectures could not benefit from training using a weightedloss function. Early stopping has been applied to avoid modeloverfitting. Best models have been selected based on their per-formance on the validation data. Out-of-distribution detectionusing thresholding proved to provide inferior results to usinga data-driven approach as described in V-C.The unsatisfactory balanced multiclass accuracy of theNASNet model may be caused by the relatively small batchsize, which was limited to four due to the size of the model.As expected, improved performance of deep neural networksin the classification of ImageNet data can be directly translatedto models trained on custom data sets. Improved CNN archi-tectures, which achieve higher accuracy in the classification ofthe ImageNet data set, thus also provide better results in theclassification of dermoscopic images.A rescaling of the outputs of the models by multiplying theoutput probabilities by inverse class frequency have provento be advantageous for the balanced multiclass accuracy ofthe network predictions in all cases where no weighted lossfunction has been used. Applying rescaling on models trained
TABLE IS
INGLE M ODEL , E
NSEMBLE B ALANCED A CCURACY
Architecture AccuracyEfficientNet-B5 0.600SE-ResNeXt-101(32x4d) 0.582EfficientNet-B4 0.577Inception-ResNet-v2 0.569NASNet-A-Large 0.504Ensemble (excluding NasNet) 0.634TABLE IIM
ETRICS (E NSEMBLE )Category MeanMetrics Value MEL NV BCC AK BKL DF VASC SCC UNKAUC .902 .924 .957 .942 .917 .893 .977 .932 .936 .638AUC, Sens >
80% .813 .853 .926 .883 .829 .776 .966 .868 .876 .336Avg. Precision .561 .766 .923 .719 .366 .572 .586 .502 .326 .285Accuracy .923 .899 .894 .908 .933 .933 .983 .978 .969 .808Sensitivity .525 .581 .752 .666 .580 .384 .744 .614 .408 .00Specificity .973 .963 .962 .944 .952 .985 .986 .983 .982 1.00Dice Coeff .491 .659 .821 .654 .468 .499 .523 .434 .364 .00PPV .609 .760 .905 .642 .392 .713 .404 .335 .328 1.00NPV .941 .919 .890 .950 .977 .944 .997 .995 .987 .808 using a weighted loss function did not improve balanced multi-class prediction accuracy. The outputs of several independentlytrained models were combined into an overall prediction usingthe arithmetic mean of all model predictions and transmitted tothe automated evaluation system of the ISIC-2019 Challenge.Table I shows results for individual models. Best perform-ing models were used to form ensemble predictions. NASNet-A-Large was not included in the ensemble due to the un-satisfactory overall accuracy achieved. Although EfficientNetshows the best results of all trained network architectures, thecombination with predictions from SE-ResNeXt-101 (32x4d)and Inception-ResNet-v2 models still lead to higher averageaccuracy than any single model could achieve independently.Table II shows metrics for the ensemble with 0.634 bal-anced multiclass accuracy, as computed by the ISIC challengewebsite. AUC: Area under the receiver operating characteristic(ROC) curve. AUC, Sens > Accuracy = sensitivity ∗ prevalence + specif icity ∗ (1 − prevalence ) . Sensitivity measures true-positive predictions,specificity (recall) measures true-negative predictions of theclassifier. The F1 score (Dice Coefficient) is the harmoniousmean of precision and recall, with an F1 score reaching itsbest value at 1 (perfect precision and recall). F1 score is alsoknown as the Sørensen-Dice coefficient or Dice similaritycoefficient (DSC). A positive predictive value (PPV) is thelikelihood that subjects who test positive will actually have thedisease. A negative predictive value (NPV) is the likelihoodthat subjects who test negative really do not have the disease.Figure 4 shows the receiver operating characteristic curve forthe ensemble. FPR T P R Fig. 4. ROC curve for the 0.634 balanced multiclass accuracy ensemble. TheROC curve shows the diagnostic capability of a binary classifier as its decisionthreshold varies. The ROC curve is constructed by plotting the true positiverate (TPR) against the false positive rate (FPR) at various threshold settings.The true positive rate is also referred to as sensitivity, recall or detectionprobability, whereas FPR corresponds to the false positive rate (1 - specificity).
VII. C
ONCLUSION
Deep learning has become a mature technology for theclassification of image content and can achieve similar orsuperior accuracy as human experts in the classification of skinlesions. The use of deep learning applications that automat-ically evaluate clinical and dermoscopic images and classifyskin lesions offer great potential for improving and imple-menting prevention and screening measures and increasingtheir efficiency. One of the main criticisms of deep learningapplications, that these networks have to be treated as ablack box and that there is no easy explanation of howthey form their decisions remain unchanged despite someprogress in the visualization of network activations. Carefulvalidation of trained models using real-world data sets beforeand also during use is essential. Progress in the developmentof more efficient architectures of deep neural networks andimproved accuracy in the classification of images with highimage quality does not automatically mean that results canbe transferred to real-world applications. For instance, [69]examined the use of a classification system created by Googleresearchers to detect diabetic retinopathy in 11 clinics inThailand and found that this technology does not yet workwell in practice despite all the research advances. Advantagesof deep learning applications in the medical field are therapid availability of diagnosis compared to analysis by humanspecialists and cost-effective provisioning of models for largenumbers of simultaneous users. Central provisioning of deeplearning models allows uncomplicated and transparent deliveryof improved models without having to make changes to clientsoftware. Cloud applications can serve current deep learningmodels cost-effectively through automatic horizontal scaling of active services and flexible price calculations. Also, deeplearning applications can help nursing staff to better argumenttheir own assessments to specialists and to prioritize urgentcases accordingly. Even if decisions made by deep learningmodels still have to be manually verified by human experts,automated image classifiers can support these human expertsand reduce the workload by accelerating decision makingprocesses, therefore contributing to more efficient utilizationof the resources of health systems.R
EFERENCES[1] K. J. Wernli, N. B. Henrikson, C. C. Morrison, M. Nguyen, G. Pocobelli,and P. R. Blasi, “Screening for skin cancer in adults: updated evidencereport and systematic review for the us preventive services task force,”
Jama , vol. 316, no. 4, pp. 436–447, 2016.[2] L. F. di Ruffano, Y. Takwoingi, J. Dinnes, N. Chuchu, S. E. Bayliss,C. Davenport, R. N. Matin, K. Godfrey, C. O’Sullivan, A. Gu-lati et al. , “Computer-assisted diagnosis techniques (dermoscopy andspectroscopy-based) for diagnosing skin cancer in adults,”
CochraneDatabase of Systematic Reviews , no. 12, 2018.[3] P. Carli, V. De Giorgi, D. Palli, A. Maurichi, P. Mulas, C. Orlandi,G. L. Imberti, I. Stanganelli, P. Soma, D. Dioguardi et al. , “Derma-tologist detection and skin self-examination are associated with thinnermelanomas: results from a survey of the italian multidisciplinary groupon melanoma,”
Archives of dermatology , vol. 139, no. 5, pp. 607–612,2003.[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Advances in neural infor-mation processing systems , 2012, pp. 1097–1105.[5] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al. , “Imagenet largescale visual recognition challenge,”
International journal of computervision , vol. 115, no. 3, pp. 211–252, 2015.[6] C. Langlotz, B. Allen, B. Erickson, J. Kalpathy-Cramer, K. Bigelow,T. Cook, A. Flanders, M. Lungren, D. Mendelson, J. Rudie, G. Wang,and K. Kandarpa, “A roadmap for foundational research on artificialintelligence in medical imaging: From the 2018 nih/rsna/acr/the academyworkshop,”
Radiology , vol. 291, p. 190613, 04 2019.[7] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2018, pp. 7132–7141.[8] C. Liu, B. Zoph, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. L.Yuille, J. Huang, and K. Murphy, “Progressive neural architecturesearch,”
CoRR , vol. abs/1712.00559, 2017. [Online]. Available:http://arxiv.org/abs/1712.00559[9] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei,A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecturesearch,” in
The European Conference on Computer Vision (ECCV) ,September 2018.[10] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolutionfor image classifier architecture search,” in
Proceedings of the aaaiconference on artificial intelligence , vol. 33, 2019, pp. 4780–4789.[11] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling forconvolutional neural networks,”
CoRR , vol. abs/1905.11946, 2019.[Online]. Available: http://arxiv.org/abs/1905.11946[12] S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmarkanalysis of representative deep neural network architectures,”
IEEEAccess , vol. 6, pp. 64 270–64 277, 2018.[13] X. Cui, R. Wei, L. Gong, R. Qi, Z. Zhao, H. Chen, K. Song, A. A.Abdulrahman, Y. Wang, J. Z. Chen et al. , “Assessing the effectivenessof artificial intelligence methods for melanoma: A retrospective review,”
Journal of the American Academy of Dermatology , vol. 81, no. 5, pp.1176–1180, 2019.[14] Y. Fujisawa, S. Inoue, and Y. Nakamura, “The possibility of deeplearning-based, computer-aided skin tumor classifiers,”
Frontiers inMedicine , vol. 6, p. 191, 2019.[15] A. Hekler, J. S. Utikal, A. H. Enk, A. Hauschild, M. Weichenthal, R. C.Maron, C. Berking, S. Haferkamp, J. Klode, D. Schadendorf et al. ,“Superior skin cancer classification by the combination of human andartificial intelligence,”
European Journal of Cancer , vol. 120, pp. 114–121, 2019. [16] R. C. Maron, M. Weichenthal, J. S. Utikal, A. Hekler, C. Berking,A. Hauschild, A. H. Enk, S. Haferkamp, J. Klode, D. Schadendorf et al. ,“Systematic outperformance of 112 dermatologists in multiclass skincancer image classification by convolutional neural networks,”
EuropeanJournal of Cancer , vol. 119, pp. 57–65, 2019.[17] T. J. Brinker, A. Hekler, A. H. Enk, J. Klode, A. Hauschild, C. Berking,B. Schilling, S. Haferkamp, D. Schadendorf, T. Holland-Letz et al. ,“Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task,”
EuropeanJournal of Cancer , vol. 113, pp. 47–54, 2019.[18] A. Blum, H. Luedtke, U. Ellwanger, R. Schwabe, G. Rassner, andC. Garbe, “Digital image analysis for diagnosis of cutaneous melanoma.development of a highly effective computer algorithm based on analysisof 837 melanocytic lesions,”
British Journal of Dermatology , vol. 151,no. 5, pp. 1029–1038, 2004.[19] M. Zortea, T. R. Schopf, K. Thon, M. Geilhufe, K. Hindberg, H. Kirch-esch, K. Møllersen, J. Schulz, S. O. Skrøvseth, and F. Godtliebsen,“Performance of a dermoscopy-based computer vision system for thediagnosis of pigmented skin lesions compared with visual evaluation byexperienced dermatologists,”
Artificial intelligence in medicine , vol. 60,no. 1, pp. 13–26, 2014.[20] M. Combalia, N. C. Codella, V. Rotemberg, B. Helba, V. Vilaplana,O. Reiter, A. C. Halpern, S. Puig, and J. Malvehy, “Bcn20000: Dermo-scopic lesions in the wild,” arXiv preprint arXiv:1908.02288 , 2019.[21] P. Tschandl, C. Rosendahl, and H. Kittler, “The ham10000 dataset,a large collection of multi-source dermatoscopic images of commonpigmented skin lesions,”
Scientific data , vol. 5, p. 180161, 2018.[22] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti,S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler et al. ,“Skin lesion analysis toward melanoma detection: A challenge at the2017 international symposium on biomedical imaging (isbi), hosted bythe international skin imaging collaboration (isic),” in . IEEE,2018, pp. 168–172.[23] T. Mendonc¸a, P. M. Ferreira, J. S. Marques, A. R. Marcal, and J. Rozeira,“Ph 2-a dermoscopic image database for research and benchmarking,”in . IEEE, 2013, pp. 5437–5440.[24] S. M. de Faria, J. N. Filipe, P. M. Pereira, L. M. Tavora, P. A. Assuncao,M. O. Santos, R. Fonseca-Pinto, F. Santiago, V. Dominguez, andM. Henrique, “Light field image dataset of skin lesions,” in . IEEE, 2019, pp. 3905–3908.[25] G. Wu, B. Masia, A. Jarabo, Y. Zhang, L. Wang, Q. Dai, T. Chai, andY. Liu, “Light field image processing: An overview,”
IEEE Journal ofSelected Topics in Signal Processing , vol. 11, no. 7, pp. 926–954, 2017.[26] X. Sun, J. Yang, M. Sun, and K. Wang, “A benchmark for automaticvisual classification of clinical skin disease images,” in
European Con-ference on Computer Vision . Springer, 2016, pp. 206–222.[27] J. Kawahara, S. Daneshvar, G. Argenziano, and G. Hamarneh, “Seven-point checklist and skin lesion classification using multitask multimodalneural nets,”
IEEE Journal of Biomedical and Health Informatics ,vol. 23, no. 2, pp. 538–546, 2019.[28] G. Argenziano, G. Fabbrocini, P. Carli, V. De Giorgi, E. Sammarco, andM. Delfino, “Epiluminescence microscopy for the diagnosis of doubtfulmelanocytic skin lesions: comparison of the abcd rule of dermatoscopyand a new 7-point checklist based on pattern analysis,”
Archives ofdermatology , vol. 134, no. 12, pp. 1563–1570, 1998.[29] H. Kittler, A. A. Marghoob, G. Argenziano, C. Carrera, C. Curiel-Lewandrowski, R. Hofmann-Wellenhof, J. Malvehy, S. Menzies,S. Puig, H. Rabinovitz et al. , “Standardization of terminology indermoscopy/dermatoscopy: results of the third consensus conferenceof the international society of dermoscopy,”
Journal of the AmericanAcademy of Dermatology , vol. 74, no. 6, pp. 1093–1106, 2016.[30] I. Giotis, N. Molders, S. Land, M. Biehl, M. F. Jonkman, and N. Petkov,“Med-node: a computer-assisted melanoma diagnosis system using non-dermoscopic images,”
Expert systems with applications , vol. 42, no. 19,pp. 6578–6585, 2015.[31] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.Salakhutdinov, “Improving neural networks by preventing co-adaptationof feature detectors,” arXiv preprint arXiv:1207.0580 , 2012. [Online].Available: http://arxiv.org/abs/1207.0580[32] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neural networksfrom overfitting.”
Journal of machine learning research , vol. 15, no. 1,pp. 1929–1958, 2014. [33] T. DeVries and G. W. Taylor, “Improved regularization of convolutionalneural networks with cutout,” arXiv preprint arXiv:1708.04552 , 2017.[34] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572 , 2014.[35] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easilyfooled: High confidence predictions for unrecognizable images,” in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2015, pp. 427–436.[36] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassifiedand out-of-distribution examples in neural networks,” arXiv preprintarXiv:1610.02136 , 2016.[37] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalablepredictive uncertainty estimation using deep ensembles,” in
Advances inneural information processing systems , 2017, pp. 6402–6413.[38] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon,and B. Lakshminarayanan, “Likelihood ratios for out-of-distributiondetection,” in
Advances in Neural Information Processing Systems , 2019,pp. 14 680–14 691.[39] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard,H. Adam, P. Perona, and S. Belongie, “The inaturalist species classifi-cation and detection dataset,” in
Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2018, pp. 8769–8778.[40] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sundatabase: Large-scale scene recognition from abbey to zoo,” in . IEEE, 2010, pp. 3485–3492.[41] B. A. Johnson, R. Tateishi, and N. T. Hoan, “A hybrid pansharpen-ing approach and multiscale object-based image analysis for mappingdiseased pine and oak trees,”
International journal of remote sensing ,vol. 34, no. 20, pp. 6969–6982, 2013.[42] M. Kubat, R. C. Holte, and S. Matwin, “Machine learning for thedetection of oil spills in satellite radar images,”
Machine learning ,vol. 30, no. 2-3, pp. 195–215, 1998.[43] O. Beijbom, P. J. Edmunds, D. I. Kline, B. G. Mitchell, and D. Krieg-man, “Automated annotation of coral reef survey images,” in . IEEE, 2012,pp. 1170–1177.[44] J. W. Grzymala-Busse, L. K. Goodwin, W. J. Grzymala-Busse, andX. Zheng, “An approach to imbalanced data sets based on changing rulestrength,” in
Rough-neural computing . Springer, 2004, pp. 543–553.[45] B. Mac Namee, P. Cunningham, S. Byrne, and O. I. Corrigan, “Theproblem of bias in training data in regression problems in medicaldecision support,”
Artificial intelligence in medicine , vol. 24, no. 1, pp.51–70, 2002.[46] K. Philip and S. Chan, “Toward scalable learning with non-uniformclass and cost distributions: A case study in credit card fraud detection,”in
Proceeding of the Fourth International Conference on KnowledgeDiscovery and Data Mining , 1998, pp. 164–168.[47] P. Radivojac, N. V. Chawla, A. K. Dunker, and Z. Obradovic, “Clas-sification and knowledge discovery in protein databases,”
Journal ofBiomedical Informatics , vol. 37, no. 4, pp. 224–239, 2004.[48] C. Cardie and N. Nowe, “Improving minority class prediction usingcase-specific feature weights,” in
Proceedings of the Fourteenth Interna-tional Conference on Machine Learning , ser. ICML ’97. San Francisco,CA, USA: Morgan Kaufmann Publishers Inc., 1997, p. 57–65.[49] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, andG. Bing, “Learning from class-imbalanced data: Review of methods andapplications,”
Expert Systems with Applications , vol. 73, pp. 220–239,2017.[50] M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study ofthe class imbalance problem in convolutional neural networks,”
NeuralNetworks , vol. 106, pp. 249–259, 2018.[51] N. Japkowicz and S. Stephen, “The class imbalance problem: A system-atic study,”
Intelligent data analysis , vol. 6, no. 5, pp. 429–449, 2002.[52] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker,and G. D. Tourassi, “Training neural network classifiers for medicaldecision making: The effects of imbalanced datasets on classificationperformance,”
Neural networks , vol. 21, no. 2-3, pp. 427–436, 2008.[53] C. X. Ling and C. Li, “Data mining for direct marketing: Problems andsolutions.” in
Kdd , vol. 98, 1998, pp. 73–79.[54] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:synthetic minority over-sampling technique,”
Journal of artificial intel-ligence research , vol. 16, pp. 321–357, 2002.[55] A. Fern´andez, S. Garcia, F. Herrera, and N. V. Chawla, “Smote forlearning from imbalanced data: progress and challenges, marking the15-year anniversary,”
Journal of artificial intelligence research , vol. 61,pp. 863–905, 2018. [56] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in
International Conference onMedical image computing and computer-assisted intervention . Springer,2015, pp. 234–241.[57] A. Vyas, N. Jammalamadaka, X. Zhu, D. Das, B. Kaul, and T. L. Willke,“Out-of-distribution detection using an ensemble of self supervisedleave-out classifiers,” in
Proceedings of the European Conference onComputer Vision (ECCV) , 2018, pp. 550–564.[58] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” arXiv preprintarXiv:1706.02690 , 2017.[59] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balancedloss based on effective number of samples,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2019, pp.9268–9277.[60] S. Lawrence, I. Burns, A. Back, A. C. Tsoi, and C. L. Giles, “Neuralnetwork classification and prior class probabilities,” in
Neural networks:tricks of the trade . Springer, 1998, pp. 299–313.[61] M. D. Richard and R. P. Lippmann, “Neural network classifiers estimatebayesian a posteriori probabilities,”
Neural computation , vol. 3, no. 4,pp. 461–483, 1991.[62] S. J. Pan, J. T. Kwok, and Q. Yang, “Transfer learning via dimensionalityreduction.” in
AAAI , vol. 8, 2008, pp. 677–682.[63] S. J. Pan, Q. Yang et al. , “A survey on transfer learning,”
IEEETransactions on knowledge and data engineering , vol. 22, no. 10, pp.1345–1359, 2010.[64] S. Hoo-Chang, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao,D. Mollura, and R. M. Summers, “Deep convolutional neural networksfor computer-aided detection: Cnn architectures, dataset characteristicsand transfer learning,”
IEEE transactions on medical imaging , vol. 35,no. 5, p. 1285, 2016.[65] K. Kowsari, M. Heidarysafa, D. E. Brown, K. J. Meimandi, and L. E.Barnes, “Rmdl: Random multimodel deep learning for classification,” in
Proceedings of the 2nd International Conference on Information Systemand Data Mining , 2018, pp. 19–28.[66] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnetand the impact of residual connections on learning,”
CoRR , vol.abs/1602.07261, 2016. [Online]. Available: http://arxiv.org/abs/1602.07261[67] L. N. Smith, “A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weightdecay,” arXiv preprint arXiv:1803.09820 , 2018.[68] L. N. Smith and N. Topin, “Super-convergence: Very fast training ofneural networks using large learning rates,” in
Artificial Intelligenceand Machine Learning for Multi-Domain Operations Applications , vol.11006. International Society for Optics and Photonics, 2019, p.1100612.[69] E. Beede, E. Baylor, F. Hersch, A. Iurchenko, L. Wilcox, P. Ruamvi-boonsuk, and L. M. Vardoulakis, “A human-centered evaluation of adeep learning system deployed in clinics for the detection of diabeticretinopathy,” in