[PDF] Analysis of skin lesion images with deep learning

Abstract

Skin cancer is the most common cancer worldwide, with melanoma being the deadliest form. Dermoscopy is a skin imaging modality that has shown an improvement in the diagnosis of skin cancer compared to visual examination without support. We evaluate the current state of the art in the classification of dermoscopic images based on the ISIC-2019 Challenge for the classification of skin lesions and current literature. Various deep neural network architectures pre-trained on the ImageNet data set are adapted to a combined training data set comprised of publicly available dermoscopic and clinical images of skin lesions using transfer learning and model fine-tuning. The performance and applicability of these models for the detection of eight classes of skin lesions are examined. Real-time data augmentation, which uses random rotation, translation, shear, and zoom within specified bounds is used to increase the number of available training samples. Model predictions are multiplied by inverse class frequencies and normalized to better approximate actual probability distributions. Overall prediction accuracy is further increased by using the arithmetic mean of the predictions of several independently trained models. The best single model has been published as a web service.

Full PDF

11 Analysis of skin lesion images with deep learning

Josef Steppan, Sten Hanke

Abstract —Skin cancer is the most common cancer worldwide,with melanoma being the deadliest form. Dermoscopy is askin imaging modality that has shown an improvement inthe diagnosis of skin cancer compared to visual examinationwithout support. We evaluate the current state of the art inthe classiﬁcation of dermoscopic images based on the ISIC-2019 Challenge for the classiﬁcation of skin lesions and currentliterature. Various deep neural network architectures pre-trainedon the ImageNet data set are adapted to a combined trainingdata set comprised of publicly available dermoscopic and clinicalimages of skin lesions using transfer learning and model ﬁne-tuning. The performance and applicability of these models forthe detection of eight classes of skin lesions are examined. Real-time data augmentation, which uses random rotation, translation,shear, and zoom within speciﬁed bounds is used to increasethe number of available training samples. Model predictions aremultiplied by inverse class frequencies and normalized to betterapproximate actual probability distributions. Overall predictionaccuracy is further increased by using the arithmetic mean ofthe predictions of several independently trained models. The bestsingle model has been published as a web service. The sourcecode is publicly available at http://github.com/j05t/lesion-analysis

Index Terms —Lesion, Skin, Melanoma, Deep Learning

I. I

NTRODUCTION S KIN cancer is the most common cancer worldwide, withmelanoma being the deadliest form. A later stage in thediagnosis of melanoma is associated with a strong inﬂuenceon melanoma mortality within 5 years of diagnosis [1]. Earlydetection of melanoma can signiﬁcantly reduce both morbidityand mortality [2]. The risk of dying from the disease is directlyrelated to the depth of the cancer, which is directly related tothe time it has been growing. Self-examination of the skin bypatients, full-body skin examinations by a doctor, and patienteducation are the keys to early detection. Self-examiners aregenerally diagnosed with thinner melanomas than non-self-examiners (0.77 mm versus 0.95 mm) [3].This paper evaluates the current state of the art in theclassiﬁcation of dermoscopic images based on the ISIC-2019Challenge for the classiﬁcation of skin lesions and currentliterature. Since medical image data sets often show a classimbalance, several approaches for the training of deep neuralnetworks on imbalanced data sets have been reviewed. Be-cause the training of deep neural networks requires a largeamount of training data, further publicly available dermoscopicas well as clinical image data sets of skin lesions havebeen evaluated for expanding the ISIC-2019 training data

J. Steppan was with the Department of eHealth at FH Joanneum Universityof Applied Sciences, Alte Poststrasse 149, 8020 Graz, AUSTRIA. e-mail:[email protected]. Hanke is with the Department of eHealth at FH Joanneum University ofApplied Sciences, Alte Poststrasse 149, 8020 Graz. e-mail: [email protected] set. Since the heterogeneity of the image data of the ISICdata set requires preprocessing, a suitable approach towardspreprocessing, as well as the effects of preprocessing on theachieved accuracy of trained networks have been investigated.Furthermore, the potential of real-time data augmentationto increase the number of available training patterns duringtraining and to improve the prediction accuracy at inferencetime has been investigated. Current ensembling strategies andan overview of current architectures of deep neural networksfor the classiﬁcation of image content have been reviewed.II. I

MAGE C LASSIFICATION

Convolutional Neural Networks (CNNs) [4] are currentlystate of the art in image classiﬁcation and have been exceedingthe recognition rate of human experts in the ImageNet LargeScale Visual Recognition Challenge (ILSVRC) [5] since 2015[6]. The ILSVRC evaluates algorithms for object recogni-tion and image classiﬁcation on a large scale. An importantmotivation is to enable researchers to compare progress inrecognition for a wider variety of objects. Another motivationis to measure the progress of computer vision algorithms forclassifying images on a large scale. The ImageNet trainingdata set contains 1.000 categories and 1,2 million images.Image classiﬁcation algorithms are compared using a test dataset of 150.000 images in 1.000 categories. Highest accuracyrates are currently achieved with the architectures SENet [7]154 (81.3% top-1 accuracy), PNASNet-5 Large [8] (82.9%),AmoebaNet-C [9], [10] (83.9%) and EfﬁcientNet-B7 [11](84.4%) [12]. Algorithms for classifying image content areconstantly being improved. Deep learning has shown enor-mous potential in this area due to the constantly increasingamounts of data [13], [14]. Some deep learning approachesoutperform teams of certiﬁed dermatologists in the detectionof melanoma in dermoscopic images [15], [16], [17] or achieveequivalent detection rates [18], [19].III. S KIN L ESION D ATASETS

A. ISIC-2019

To make specialist knowledge more widely available, theInternational Skin Imaging Collaboration developed the ISICarchive, an international repository for dermoscopic images,both for clinical training purposes and to support technical re-search on automated algorithmic analysis by hosting the ISICChallenges. The training data set of the ISIC-2019 Challengeconsists of several dermoscopic image databases: BCN 20000[20] with dermoscopic images of the most common classesof skin lesions: actinic keratosis, squamous cell carcinoma,basal cell carcinoma, seborrheic keratosis, solar lentigo, and http://image-net.org/challenges/LSVRC/2017 a r X i v : . [ ee ss . I V ] J a n dermatological lesions. The HAM10000 dataset [21], with600x450 images centered and cropped on lesions. The MSKdata set [22] with images of different resolutions. A totalof 25,331 images are available for training in 8 differentcategories. The test data set consists of 8,238 images whoselabels are not publicly available. Also, the test data set containsan additional outlier class that is not contained in the trainingdata and must be identiﬁed by developed systems. Predictionson the ISIC-2019 test data set are assessed by an automaticevaluation system. The goal of the ISIC-2019 Challenge is toclassify dermoscopic images among nine different diagnosticcategories:1) Melanoma (MEL)2) Melanocytic nevus (NV)3) Basal cell carcinoma (BCC)4) Actinic Keratosis (AK)5) Benign keratosis (solar lentigo / seborrheic keratosis /lichen planus-like keratosis) (BKL)6) Dermatoﬁbroma (DF)7) Vascular Lesion (VASC)8) Squamous cell carcinoma (SCC)9) None of the others (UNK) B. PH2 database

The PH2 database [23] includes manual segmentation, clin-ical diagnosis, and the identiﬁcation of multiple dermoscopicstructures performed by experienced dermatologists in a setof 200 dermoscopic images. The images were obtained inthe dermatology department of the Pedro Hispano hospital(Matosinhos, Portugal) under the same conditions by theTuebinger Mole Analyzer System using 20-fold magniﬁca-tion. These are 8-bit RGB color images with a resolutionof 768x560 pixels. The image database contains a total of200 dermoscopic images of melanocytic lesions, including80 common nevi, 80 atypical nevi, and 40 melanomas. ThePH2 database contains a medical annotation of all images,namely a medical segmentation of the lesion, a clinical andhistological diagnosis as well as the evaluation of severaldermoscopic criteria (colors; pigment network; dots/spheres;stripes; regression areas; blue-whitish haze). The databasewas made freely available for research and benchmarkingpurposes . C. Light Field Image Dataset of Skin Lesions

Faria et al. [24] present a contribution to the researchcommunity in the form of the publicly available data set ofskin lesions, the ”Light Field Image Dataset of Skin Lesions”(SKINL2) . The dataset contains 250 light ﬁelds [25], whichwere recorded with a focused plenoptic camera and dividedinto eight clinical categories depending on the type of lesion.Each light ﬁeld consists of 81 different views of the samelesion. The database also contains the dermoscopic image ofeach lesion. The data set offers great potential the further https://challenge2019.isic-archive.com/ development of medical imaging research and the developmentof new classiﬁcation algorithms based on light ﬁelds as wellas for clinically oriented dermatological studies; however, onlydermoscopic images contained in the data set are taken intoaccount for this work. D. SD-198

In contrast to dermoscopic images with largely constantlighting and low image disturbances, clinical images are oftencreated with a large number of different image recordingdevices, such as digital cameras or smartphones. The SD-198data set [26] contains 6,584 clinical images from 198 classes,which vary according to scale, color, shape, and structure. TheSD-198 benchmark data set is intended to stimulate furtherresearch into the visual classiﬁcation of skin diseases. Theauthors also carry out an extensive analysis of this data setusing modern methods including CNNs. The ground truthlabels of the images were created via DermQuest , with eachimage being examined by qualiﬁed experts and labeled withthe name of its class. To ensure the quality of the labels, twoexperts were also invited to check the data set. E. 7-point criteria evaluation database

Kawahara et al. [27] provide a database for evaluating thecomputerized image-based prediction of the 7-point check-list for malignant skin lesions . The seven-point checklist,published in 1998, is one of the best-validated dermoscopicalgorithms due to its high sensitivity and speciﬁcity, even whenused by non-specialists. The seven criteria were originallyapplied to 342 melanocytic lesions (117 melanomas and 225atypical nevi) tested and selected for their frequent associationwith melanoma [28]. Three of them were deﬁned as themain criteria (atypical network, blue-white haze, and atypicalvascular pattern), while the remaining four were consideredminor (irregular stripes, irregular spots or globules, irregularspots, and regression structures) [29]. The data set containsover 2000 clinical and dermoscopic color images as wellas corresponding structured metadata that are tailored to thetraining and evaluation of CAD (Computer Aided Diagnostic)systems. F. MED-NODE

The MED-NODE data set [30] consists of 70 melanomaand 100 nevus images from the digital image archive of theDepartment of Dermatology at the University Hospital Gronin-gen (UMCG), which is used for the development and testing ofthe MED-NODE Decision Support System for the detection ofSkin cancer using macroscopic images. The system proposedby the authors achieves results with a diagnostic accuracy of81%. The ﬁnal classiﬁcation was achieved by a majority voteof the predictions of several models. The dataset is publiclyavailable . https://derm.cs.sfu.ca/Welcome.html ∼ imaging/databases/melanoma naevi/ ISIC-201925331 MEL4904 NV13704BCC3378 AK867 BKL2733 DF294 VASC282 SCC628SD-1985944 UNK5958MED-NODE170 7-point criteria database1011 PH2200 SKINL292 Train 29469Valid 3279

Fig. 1. Combined training data set from the data sets ISIC-2019, PH2, LightField Image Dataset of Skin Lesions, SD-198, the 7-point criteria evaluationdatabase, and MED-NODE. The ”UNK” category is mainly formed from datafrom the SD-198 dataset. The combined data set is divided into a training(90%) and validation data set (10%), so 29.469 images are available fortraining and 3.279 images for assessing the generalizability of the predictionsand for adapting hyperparameters in the validation data set. The ISIC-2019test data set consists of 8.238 images whose labels are not publicly available.The test data set is not used for training or parameter adjustment.

IV. C

OMBINED TRAINING DATASET

A combined training data set has been created from allthe data sets described in section III. 32,748 images areavailable for training in total. Images from SD-198 wereused exclusively for the creation of training data for the”UNK” class, after prior removal of image data from the eightcategories of the ISIC-2019 training data set. The combineddata set is still heavily imbalanced (Figure 1).V. M

ETHODOLOGY

A. Preprocessing

Training and test data of the ISIC-2019 dataset have beenpreprocessed to remove black areas surrounding dermoscopicimages, and subsequently rescaled maintaining aspect ratio(Figure 2). Descriptive text appended to images in the SD-198 dataset has been removed.

Fig. 2. Preprocessing of the ISIC 2019 dataset. Black image borders aredetected and removed. The top row shows images of the original trainingdata set, shown below are preprocessed images Fig. 3. Applied augmentations for a single training image. Random rotation,translation in the x and y directions as well as scaling within deﬁned limitsavoid overﬁtting on the training data and enable a better generalization ofthe model. Used augmentation parameters are: max rotate=45, p afﬁne=0.5,do ﬂip=True, ﬂip vert=True, max zoom=1.05, max lighting=0.2,crop pad(input size), cutout(n holes=(1,1), length=(16,16), p=.5).

B. Data Augmentation

To avoid overﬁtting [31] in neural networks, dropout [32]is often used. Another simple method for regularization (andexpansion of the number of different training samples) ofCNNs is data augmentation. During training, input data ischanged randomly according to certain criteria (translation,rotation, scaling, etc.). Additionally, Cutout [33] has been usedfor regularization. Figure 3 shows the applied augmentations.

C. Out of Distribution Detection

Neural networks offer little or no guarantee of reliableprediction when applied to data that was not generated throughthe same process that was used to create the network’straining data. With such Out-of-Distribution (OOD) inputs,the prediction may not only be incorrect but also associatedwith a high level of conﬁdence [34], [35] of the network,which restricts the reliability of deep learning classiﬁers inreal world applications. Often the predictions of (ensembles of)classiﬁers that have been trained on data within the distributionare examined for the presence of OOD inputs using statisticalmethods [36], [37]. Alternatively, the input distribution canbe modeled directly by using generative models that do notrequire the presence of class labels. However, it has beenshown that this method can also output higher probabilitieson OOD inputs than on inputs within the distribution [38]. Inthe ISIC 2019 Challenge, classes that are not included in thetraining data set should be detected as OOD and recognizedas class ”UNK”. In this work, a data-driven approach to therecognition of OOD inputs is pursued by using images (mostlyfrom SD-198, see subsection III-D) as training data for the”UNK” class that are not labeled as one of the classes ofthe ISIC-2019 training data set. However, this approach is farfrom optimal, and OOD detection in deep learning classiﬁersremaining an unsolved problem. Further work is needed toimprove classiﬁer performance regarding OOD detection.

D. Dataset Imbalance

A common problem with deep learning-based applicationsis the fact that some classes have a signiﬁcantly higher number of samples in the training set than other classes. This differenceis known as class imbalance. There are many examples in areassuch as computer vision [39], [40], [41], [42], [43], medicaldiagnosis [44], [45], fraud detection [46], and others [47], [48],[49] where this problem is highly signiﬁcant and the incidenceof one class (e.g. cancer) can be 1000 times less than anotherclass (e.g. healthy patient) [50]. It has been shown that a classimbalance in training data sets can have a signiﬁcant adverseeffect on the training of traditional classiﬁers [51], includingclassic neural networks or multilayer perceptrons [52]. Theclass imbalance inﬂuences both the convergence of neuralnetworks during the training phase and the generalization ofa model to real or test data [50].

1) Undersampling / Oversampling:

Undersampling andoversampling in data analysis are techniques to adjust theclass distribution of a data set (i.e. the relationship betweenthe different classes/categories represented). These terms areused in statistical sampling, survey design methodology, andmachine learning. The goal of undersampling and oversam-pling is to create a balanced data set. Many machine learningtechniques, such as neural networks create more reliablepredictions when trained on balanced data. Oversampling isgenerally used more often than undersampling. The reasons forusing undersampling are mainly practical and often resource-dependent. With random oversampling, the training data issupplemented by multiple copies of samples from minorityclasses. This is one of the earliest methods proposed thathas also proven robust [53]. Instead of duplicating minorityclass samples, some of them can be chosen at random bysubstitution. Other methods of handling unbalanced data setssuch as synthetic oversampling [54] are more suitable fortraditional machine learning tasks [55] and were therefore notconsidered any further in this work.

2) Weighted Cross-Entropy Loss:

Weighted cross-entropy[56] is useful for training neural networks on unbalanced datasets. [57] suggest adding a margin-based loss value to thecross-entropy on in-distribution training patterns in order toensure a minimum difference in average entropy between in-distribution and out-of-distribution data. This ensemble-basedmethod is intended to surpass previous methods of recognizingout-of-distribution inputs such as ODIN [58]. Cross entropycan be described as L ( x, y ) = − log (cid:32) exp( x [ y ]) (cid:80) j exp( x [ j ]) (cid:33) = − x [ y ]+ log (cid:32)(cid:88) j exp( x [ j ]) (cid:33) or, by using class weights: L ( x, y ) = W [ y ] (cid:32) − x [ y ] + log (cid:32)(cid:88) j exp( x [ j ]) (cid:33)(cid:33) The arithmetic mean of the loss values achieved is calcu-lated for each mini-batch. A weight vector can be calculatedusing effective class weights [59] with the simple formula (1 − β n ) / (1 − β ) , with the hyperparameter beta equal to0.999 (a choice of the parameter beta equal to zero would notapply any weighting and a choice of beta equal to 1 wouldcorrespond to weighting by the inverse class frequency). In the simplest case, loss values can be weighted by multiplyingby inverse class frequencies.

3) Thresholding:

Also referred to as threshold shifting orrescaling, thresholding adapts the decision threshold of aclassiﬁer. This method is used at inference time and involveschanging the output class probabilities. There are several waysin which the network outputs can be rescaled. In general, anoptimization algorithm can be used to conﬁgure the networkto minimize any criteria [60]. The simplest method onlycompensates for a priori class probabilities [61]. It has beenshown that neural networks estimate Bayesian a posterioriprobabilities [61]. That is, for a given data point x, the outputfor class c is implicitly y i ( x ) = p ( c | x ) = p ( c ) p ( x | c ) p ( x ) . The actualprobabilities of class membership can therefore be calculatedby dividing the output of the network by the estimated a prioriprobability p ( c ) = | c | (cid:80) k | k | , where | c | is the number of samplesof class c [50]. The resulting class probabilities are normalizedafter thresholding is applied. This simple method of handlingan existing class imbalance can signiﬁcantly increase the classprobability distribution approximation made by classiﬁers. E. Transfer Learning

Transfer learning in the context of machine learning is atechnique that uses information obtained from solving a prob-lem and applies it to a similar problem. When using transferlearning, a model that has already been trained on anotherdata set is adapted to custom data. Ideally, the pre-trainedmodel has been trained on similar data, but this is not strictlynecessary. The ﬁnal layers of the network are removed andreplaced by output layers featuring appropriate dimensions.The model is then trained on custom data. By using transferlearning, the time required for training a network can begreatly reduced [62], [63], [64]. The existing pre-trained modelthus serves as a feature extractor, which forwards features suchas edges, texture, position of recognized objects, etc. to thelast layer for classiﬁcation. A softmax function (normalizedexponential function) transforms the network output into avector of numbers between zero and one which sum up toone which allows interpreting the output of the network as aprobability distribution.

F. Test Time Augmentation

Data augmentation is a technique widely used to improveneural network training performance and reduce generalizationerrors. The same image data augmentation technique canalso be used at inference time to allow the model to makepredictions for several different versions of each image inthe test data. Test Time Augmentation (TTA) predictions areformed by calculating the average of the regular predictions(with a weighting of beta=0.4) with the average of the pre-dictions obtained by predicting on augmented versions of theimage data (with a weighting of 1-beta). The transformationsspeciﬁed for the training set are applied with the followingchanges: Scaling with a factor of 1.05 controls the scaling forthe zoom (which is not random for TTA). Furthermore, thecropping is not random to ensure that the four corners of thepicture are used. Reﬂection is not random but is applied once to each of these corner images (so that a total of 8 augmentedversions are created).

G. Ensembling

Ensembling is the use of several independently trained mod-els to form an overall prediction. The basic idea of ensemblingis that individual models have weaknesses in different areas,which are compensated by the combination with predictionsof other independently trained models. Possible ensemblingstrategies are e.g. majority voting, the use of a weightedaverage based on classiﬁer conﬁdences, or simply using thearithmetic mean of several predictions of different models andmodel architectures [65].VI. E

XPERIMENTS

The CNN architectures Inception-ResNet-v2 [66], SE-ResNeXt-101 (32x4d) [7], NASNet-A-Large [8], EfﬁcientNet-B4 and EfﬁcientNet-B5 [11] pre-trained on the ImageNetdata set were adapted for the task of classifying the nineclasses of the ISIC-2019 Challenge by replacing ﬁnal layerswith a custom linear layer to output nine class probabilities.Real-time data augmentation has been used to improve thegeneralizability of the resulting models. Models have beentrained on an NVIDIA GTX 1070 GPU. Batch sizes (numberof training samples that are used for a single forward pass)were adapted to individual architectures and input sizes toachieve optimal utilization of the available video memory.Images have been resized to ﬁt model input sizes prior training.Models have been trained via transfer learning over 32epochs followed by model ﬁne-tuning using differential learn-ing rates until convergence using One Cycle Policy [67],allowing very rapid convergence rates of trained networks[68]. Appropriate learning rates were determined manually atregular intervals. The use of a weighted loss function has,contrary to expectations, only proven to be advantageous fortraining the NASNet-A-Large architecture, which has beenunable to converge without applying weighted loss. Otherarchitectures could not beneﬁt from training using a weightedloss function. Early stopping has been applied to avoid modeloverﬁtting. Best models have been selected based on their per-formance on the validation data. Out-of-distribution detectionusing thresholding proved to provide inferior results to usinga data-driven approach as described in V-C.The unsatisfactory balanced multiclass accuracy of theNASNet model may be caused by the relatively small batchsize, which was limited to four due to the size of the model.As expected, improved performance of deep neural networksin the classiﬁcation of ImageNet data can be directly translatedto models trained on custom data sets. Improved CNN archi-tectures, which achieve higher accuracy in the classiﬁcation ofthe ImageNet data set, thus also provide better results in theclassiﬁcation of dermoscopic images.A rescaling of the outputs of the models by multiplying theoutput probabilities by inverse class frequency have provento be advantageous for the balanced multiclass accuracy ofthe network predictions in all cases where no weighted lossfunction has been used. Applying rescaling on models trained

TABLE IS

INGLE M ODEL , E

NSEMBLE B ALANCED A CCURACY

Architecture AccuracyEfﬁcientNet-B5 0.600SE-ResNeXt-101(32x4d) 0.582EfﬁcientNet-B4 0.577Inception-ResNet-v2 0.569NASNet-A-Large 0.504Ensemble (excluding NasNet) 0.634TABLE IIM

ETRICS (E NSEMBLE )Category MeanMetrics Value MEL NV BCC AK BKL DF VASC SCC UNKAUC .902 .924 .957 .942 .917 .893 .977 .932 .936 .638AUC, Sens >

80% .813 .853 .926 .883 .829 .776 .966 .868 .876 .336Avg. Precision .561 .766 .923 .719 .366 .572 .586 .502 .326 .285Accuracy .923 .899 .894 .908 .933 .933 .983 .978 .969 .808Sensitivity .525 .581 .752 .666 .580 .384 .744 .614 .408 .00Speciﬁcity .973 .963 .962 .944 .952 .985 .986 .983 .982 1.00Dice Coeff .491 .659 .821 .654 .468 .499 .523 .434 .364 .00PPV .609 .760 .905 .642 .392 .713 .404 .335 .328 1.00NPV .941 .919 .890 .950 .977 .944 .997 .995 .987 .808 using a weighted loss function did not improve balanced multi-class prediction accuracy. The outputs of several independentlytrained models were combined into an overall prediction usingthe arithmetic mean of all model predictions and transmitted tothe automated evaluation system of the ISIC-2019 Challenge.Table I shows results for individual models. Best perform-ing models were used to form ensemble predictions. NASNet-A-Large was not included in the ensemble due to the un-satisfactory overall accuracy achieved. Although EfﬁcientNetshows the best results of all trained network architectures, thecombination with predictions from SE-ResNeXt-101 (32x4d)and Inception-ResNet-v2 models still lead to higher averageaccuracy than any single model could achieve independently.Table II shows metrics for the ensemble with 0.634 bal-anced multiclass accuracy, as computed by the ISIC challengewebsite. AUC: Area under the receiver operating characteristic(ROC) curve. AUC, Sens > Accuracy = sensitivity ∗ prevalence + specif icity ∗ (1 − prevalence ) . Sensitivity measures true-positive predictions,speciﬁcity (recall) measures true-negative predictions of theclassiﬁer. The F1 score (Dice Coefﬁcient) is the harmoniousmean of precision and recall, with an F1 score reaching itsbest value at 1 (perfect precision and recall). F1 score is alsoknown as the Sørensen-Dice coefﬁcient or Dice similaritycoefﬁcient (DSC). A positive predictive value (PPV) is thelikelihood that subjects who test positive will actually have thedisease. A negative predictive value (NPV) is the likelihoodthat subjects who test negative really do not have the disease.Figure 4 shows the receiver operating characteristic curve forthe ensemble. FPR T P R Fig. 4. ROC curve for the 0.634 balanced multiclass accuracy ensemble. TheROC curve shows the diagnostic capability of a binary classiﬁer as its decisionthreshold varies. The ROC curve is constructed by plotting the true positiverate (TPR) against the false positive rate (FPR) at various threshold settings.The true positive rate is also referred to as sensitivity, recall or detectionprobability, whereas FPR corresponds to the false positive rate (1 - speciﬁcity).

VII. C

ONCLUSION

Deep learning has become a mature technology for theclassiﬁcation of image content and can achieve similar orsuperior accuracy as human experts in the classiﬁcation of skinlesions. The use of deep learning applications that automat-ically evaluate clinical and dermoscopic images and classifyskin lesions offer great potential for improving and imple-menting prevention and screening measures and increasingtheir efﬁciency. One of the main criticisms of deep learningapplications, that these networks have to be treated as ablack box and that there is no easy explanation of howthey form their decisions remain unchanged despite someprogress in the visualization of network activations. Carefulvalidation of trained models using real-world data sets beforeand also during use is essential. Progress in the developmentof more efﬁcient architectures of deep neural networks andimproved accuracy in the classiﬁcation of images with highimage quality does not automatically mean that results canbe transferred to real-world applications. For instance, [69]examined the use of a classiﬁcation system created by Googleresearchers to detect diabetic retinopathy in 11 clinics inThailand and found that this technology does not yet workwell in practice despite all the research advances. Advantagesof deep learning applications in the medical ﬁeld are therapid availability of diagnosis compared to analysis by humanspecialists and cost-effective provisioning of models for largenumbers of simultaneous users. Central provisioning of deeplearning models allows uncomplicated and transparent deliveryof improved models without having to make changes to clientsoftware. Cloud applications can serve current deep learningmodels cost-effectively through automatic horizontal scaling of active services and ﬂexible price calculations. Also, deeplearning applications can help nursing staff to better argumenttheir own assessments to specialists and to prioritize urgentcases accordingly. Even if decisions made by deep learningmodels still have to be manually veriﬁed by human experts,automated image classiﬁers can support these human expertsand reduce the workload by accelerating decision makingprocesses, therefore contributing to more efﬁcient utilizationof the resources of health systems.R

EFERENCES[1] K. J. Wernli, N. B. Henrikson, C. C. Morrison, M. Nguyen, G. Pocobelli,and P. R. Blasi, “Screening for skin cancer in adults: updated evidencereport and systematic review for the us preventive services task force,”

Jama , vol. 316, no. 4, pp. 436–447, 2016.[2] L. F. di Ruffano, Y. Takwoingi, J. Dinnes, N. Chuchu, S. E. Bayliss,C. Davenport, R. N. Matin, K. Godfrey, C. O’Sullivan, A. Gu-lati et al. , “Computer-assisted diagnosis techniques (dermoscopy andspectroscopy-based) for diagnosing skin cancer in adults,”

CochraneDatabase of Systematic Reviews , no. 12, 2018.[3] P. Carli, V. De Giorgi, D. Palli, A. Maurichi, P. Mulas, C. Orlandi,G. L. Imberti, I. Stanganelli, P. Soma, D. Dioguardi et al. , “Derma-tologist detection and skin self-examination are associated with thinnermelanomas: results from a survey of the italian multidisciplinary groupon melanoma,”

Archives of dermatology , vol. 139, no. 5, pp. 607–612,2003.[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neural infor-mation processing systems , 2012, pp. 1097–1105.[5] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al. , “Imagenet largescale visual recognition challenge,”

International journal of computervision , vol. 115, no. 3, pp. 211–252, 2015.[6] C. Langlotz, B. Allen, B. Erickson, J. Kalpathy-Cramer, K. Bigelow,T. Cook, A. Flanders, M. Lungren, D. Mendelson, J. Rudie, G. Wang,and K. Kandarpa, “A roadmap for foundational research on artiﬁcialintelligence in medical imaging: From the 2018 nih/rsna/acr/the academyworkshop,”

Radiology , vol. 291, p. 190613, 04 2019.[7] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2018, pp. 7132–7141.[8] C. Liu, B. Zoph, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. L.Yuille, J. Huang, and K. Murphy, “Progressive neural architecturesearch,”

CoRR , vol. abs/1712.00559, 2017. [Online]. Available:http://arxiv.org/abs/1712.00559[9] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei,A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecturesearch,” in

The European Conference on Computer Vision (ECCV) ,September 2018.[10] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolutionfor image classiﬁer architecture search,” in

Proceedings of the aaaiconference on artiﬁcial intelligence , vol. 33, 2019, pp. 4780–4789.[11] M. Tan and Q. V. Le, “Efﬁcientnet: Rethinking model scaling forconvolutional neural networks,”

CoRR , vol. abs/1905.11946, 2019.[Online]. Available: http://arxiv.org/abs/1905.11946[12] S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmarkanalysis of representative deep neural network architectures,”

IEEEAccess , vol. 6, pp. 64 270–64 277, 2018.[13] X. Cui, R. Wei, L. Gong, R. Qi, Z. Zhao, H. Chen, K. Song, A. A.Abdulrahman, Y. Wang, J. Z. Chen et al. , “Assessing the effectivenessof artiﬁcial intelligence methods for melanoma: A retrospective review,”

Journal of the American Academy of Dermatology , vol. 81, no. 5, pp.1176–1180, 2019.[14] Y. Fujisawa, S. Inoue, and Y. Nakamura, “The possibility of deeplearning-based, computer-aided skin tumor classiﬁers,”

Frontiers inMedicine , vol. 6, p. 191, 2019.[15] A. Hekler, J. S. Utikal, A. H. Enk, A. Hauschild, M. Weichenthal, R. C.Maron, C. Berking, S. Haferkamp, J. Klode, D. Schadendorf et al. ,“Superior skin cancer classiﬁcation by the combination of human andartiﬁcial intelligence,”

European Journal of Cancer , vol. 120, pp. 114–121, 2019. [16] R. C. Maron, M. Weichenthal, J. S. Utikal, A. Hekler, C. Berking,A. Hauschild, A. H. Enk, S. Haferkamp, J. Klode, D. Schadendorf et al. ,“Systematic outperformance of 112 dermatologists in multiclass skincancer image classiﬁcation by convolutional neural networks,”

EuropeanJournal of Cancer , vol. 119, pp. 57–65, 2019.[17] T. J. Brinker, A. Hekler, A. H. Enk, J. Klode, A. Hauschild, C. Berking,B. Schilling, S. Haferkamp, D. Schadendorf, T. Holland-Letz et al. ,“Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classiﬁcation task,”

EuropeanJournal of Cancer , vol. 113, pp. 47–54, 2019.[18] A. Blum, H. Luedtke, U. Ellwanger, R. Schwabe, G. Rassner, andC. Garbe, “Digital image analysis for diagnosis of cutaneous melanoma.development of a highly effective computer algorithm based on analysisof 837 melanocytic lesions,”

British Journal of Dermatology , vol. 151,no. 5, pp. 1029–1038, 2004.[19] M. Zortea, T. R. Schopf, K. Thon, M. Geilhufe, K. Hindberg, H. Kirch-esch, K. Møllersen, J. Schulz, S. O. Skrøvseth, and F. Godtliebsen,“Performance of a dermoscopy-based computer vision system for thediagnosis of pigmented skin lesions compared with visual evaluation byexperienced dermatologists,”

Artiﬁcial intelligence in medicine , vol. 60,no. 1, pp. 13–26, 2014.[20] M. Combalia, N. C. Codella, V. Rotemberg, B. Helba, V. Vilaplana,O. Reiter, A. C. Halpern, S. Puig, and J. Malvehy, “Bcn20000: Dermo-scopic lesions in the wild,” arXiv preprint arXiv:1908.02288 , 2019.[21] P. Tschandl, C. Rosendahl, and H. Kittler, “The ham10000 dataset,a large collection of multi-source dermatoscopic images of commonpigmented skin lesions,”

Scientiﬁc data , vol. 5, p. 180161, 2018.[22] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti,S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler et al. ,“Skin lesion analysis toward melanoma detection: A challenge at the2017 international symposium on biomedical imaging (isbi), hosted bythe international skin imaging collaboration (isic),” in . IEEE,2018, pp. 168–172.[23] T. Mendonc¸a, P. M. Ferreira, J. S. Marques, A. R. Marcal, and J. Rozeira,“Ph 2-a dermoscopic image database for research and benchmarking,”in . IEEE, 2013, pp. 5437–5440.[24] S. M. de Faria, J. N. Filipe, P. M. Pereira, L. M. Tavora, P. A. Assuncao,M. O. Santos, R. Fonseca-Pinto, F. Santiago, V. Dominguez, andM. Henrique, “Light ﬁeld image dataset of skin lesions,” in . IEEE, 2019, pp. 3905–3908.[25] G. Wu, B. Masia, A. Jarabo, Y. Zhang, L. Wang, Q. Dai, T. Chai, andY. Liu, “Light ﬁeld image processing: An overview,”

IEEE Journal ofSelected Topics in Signal Processing , vol. 11, no. 7, pp. 926–954, 2017.[26] X. Sun, J. Yang, M. Sun, and K. Wang, “A benchmark for automaticvisual classiﬁcation of clinical skin disease images,” in

European Con-ference on Computer Vision . Springer, 2016, pp. 206–222.[27] J. Kawahara, S. Daneshvar, G. Argenziano, and G. Hamarneh, “Seven-point checklist and skin lesion classiﬁcation using multitask multimodalneural nets,”

IEEE Journal of Biomedical and Health Informatics ,vol. 23, no. 2, pp. 538–546, 2019.[28] G. Argenziano, G. Fabbrocini, P. Carli, V. De Giorgi, E. Sammarco, andM. Delﬁno, “Epiluminescence microscopy for the diagnosis of doubtfulmelanocytic skin lesions: comparison of the abcd rule of dermatoscopyand a new 7-point checklist based on pattern analysis,”

Archives ofdermatology , vol. 134, no. 12, pp. 1563–1570, 1998.[29] H. Kittler, A. A. Marghoob, G. Argenziano, C. Carrera, C. Curiel-Lewandrowski, R. Hofmann-Wellenhof, J. Malvehy, S. Menzies,S. Puig, H. Rabinovitz et al. , “Standardization of terminology indermoscopy/dermatoscopy: results of the third consensus conferenceof the international society of dermoscopy,”

Journal of the AmericanAcademy of Dermatology , vol. 74, no. 6, pp. 1093–1106, 2016.[30] I. Giotis, N. Molders, S. Land, M. Biehl, M. F. Jonkman, and N. Petkov,“Med-node: a computer-assisted melanoma diagnosis system using non-dermoscopic images,”

Expert systems with applications , vol. 42, no. 19,pp. 6578–6585, 2015.[31] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.Salakhutdinov, “Improving neural networks by preventing co-adaptationof feature detectors,” arXiv preprint arXiv:1207.0580 , 2012. [Online].Available: http://arxiv.org/abs/1207.0580[32] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neural networksfrom overﬁtting.”

Journal of machine learning research , vol. 15, no. 1,pp. 1929–1958, 2014. [33] T. DeVries and G. W. Taylor, “Improved regularization of convolutionalneural networks with cutout,” arXiv preprint arXiv:1708.04552 , 2017.[34] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572 , 2014.[35] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easilyfooled: High conﬁdence predictions for unrecognizable images,” in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2015, pp. 427–436.[36] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassiﬁedand out-of-distribution examples in neural networks,” arXiv preprintarXiv:1610.02136 , 2016.[37] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalablepredictive uncertainty estimation using deep ensembles,” in

Advances inneural information processing systems , 2017, pp. 6402–6413.[38] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon,and B. Lakshminarayanan, “Likelihood ratios for out-of-distributiondetection,” in

Advances in Neural Information Processing Systems , 2019,pp. 14 680–14 691.[39] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard,H. Adam, P. Perona, and S. Belongie, “The inaturalist species classiﬁ-cation and detection dataset,” in

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2018, pp. 8769–8778.[40] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sundatabase: Large-scale scene recognition from abbey to zoo,” in . IEEE, 2010, pp. 3485–3492.[41] B. A. Johnson, R. Tateishi, and N. T. Hoan, “A hybrid pansharpen-ing approach and multiscale object-based image analysis for mappingdiseased pine and oak trees,”

International journal of remote sensing ,vol. 34, no. 20, pp. 6969–6982, 2013.[42] M. Kubat, R. C. Holte, and S. Matwin, “Machine learning for thedetection of oil spills in satellite radar images,”

Machine learning ,vol. 30, no. 2-3, pp. 195–215, 1998.[43] O. Beijbom, P. J. Edmunds, D. I. Kline, B. G. Mitchell, and D. Krieg-man, “Automated annotation of coral reef survey images,” in . IEEE, 2012,pp. 1170–1177.[44] J. W. Grzymala-Busse, L. K. Goodwin, W. J. Grzymala-Busse, andX. Zheng, “An approach to imbalanced data sets based on changing rulestrength,” in

Rough-neural computing . Springer, 2004, pp. 543–553.[45] B. Mac Namee, P. Cunningham, S. Byrne, and O. I. Corrigan, “Theproblem of bias in training data in regression problems in medicaldecision support,”

Artiﬁcial intelligence in medicine , vol. 24, no. 1, pp.51–70, 2002.[46] K. Philip and S. Chan, “Toward scalable learning with non-uniformclass and cost distributions: A case study in credit card fraud detection,”in

Proceeding of the Fourth International Conference on KnowledgeDiscovery and Data Mining , 1998, pp. 164–168.[47] P. Radivojac, N. V. Chawla, A. K. Dunker, and Z. Obradovic, “Clas-siﬁcation and knowledge discovery in protein databases,”

Journal ofBiomedical Informatics , vol. 37, no. 4, pp. 224–239, 2004.[48] C. Cardie and N. Nowe, “Improving minority class prediction usingcase-speciﬁc feature weights,” in

Proceedings of the Fourteenth Interna-tional Conference on Machine Learning , ser. ICML ’97. San Francisco,CA, USA: Morgan Kaufmann Publishers Inc., 1997, p. 57–65.[49] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, andG. Bing, “Learning from class-imbalanced data: Review of methods andapplications,”

Expert Systems with Applications , vol. 73, pp. 220–239,2017.[50] M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study ofthe class imbalance problem in convolutional neural networks,”

NeuralNetworks , vol. 106, pp. 249–259, 2018.[51] N. Japkowicz and S. Stephen, “The class imbalance problem: A system-atic study,”

Intelligent data analysis , vol. 6, no. 5, pp. 429–449, 2002.[52] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker,and G. D. Tourassi, “Training neural network classiﬁers for medicaldecision making: The effects of imbalanced datasets on classiﬁcationperformance,”

Neural networks , vol. 21, no. 2-3, pp. 427–436, 2008.[53] C. X. Ling and C. Li, “Data mining for direct marketing: Problems andsolutions.” in

Kdd , vol. 98, 1998, pp. 73–79.[54] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:synthetic minority over-sampling technique,”

Journal of artiﬁcial intel-ligence research , vol. 16, pp. 321–357, 2002.[55] A. Fern´andez, S. Garcia, F. Herrera, and N. V. Chawla, “Smote forlearning from imbalanced data: progress and challenges, marking the15-year anniversary,”

Journal of artiﬁcial intelligence research , vol. 61,pp. 863–905, 2018. [56] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in

International Conference onMedical image computing and computer-assisted intervention . Springer,2015, pp. 234–241.[57] A. Vyas, N. Jammalamadaka, X. Zhu, D. Das, B. Kaul, and T. L. Willke,“Out-of-distribution detection using an ensemble of self supervisedleave-out classiﬁers,” in

Proceedings of the European Conference onComputer Vision (ECCV) , 2018, pp. 550–564.[58] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” arXiv preprintarXiv:1706.02690 , 2017.[59] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balancedloss based on effective number of samples,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2019, pp.9268–9277.[60] S. Lawrence, I. Burns, A. Back, A. C. Tsoi, and C. L. Giles, “Neuralnetwork classiﬁcation and prior class probabilities,” in

Neural networks:tricks of the trade . Springer, 1998, pp. 299–313.[61] M. D. Richard and R. P. Lippmann, “Neural network classiﬁers estimatebayesian a posteriori probabilities,”

Neural computation , vol. 3, no. 4,pp. 461–483, 1991.[62] S. J. Pan, J. T. Kwok, and Q. Yang, “Transfer learning via dimensionalityreduction.” in

AAAI , vol. 8, 2008, pp. 677–682.[63] S. J. Pan, Q. Yang et al. , “A survey on transfer learning,”

IEEETransactions on knowledge and data engineering , vol. 22, no. 10, pp.1345–1359, 2010.[64] S. Hoo-Chang, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao,D. Mollura, and R. M. Summers, “Deep convolutional neural networksfor computer-aided detection: Cnn architectures, dataset characteristicsand transfer learning,”

IEEE transactions on medical imaging , vol. 35,no. 5, p. 1285, 2016.[65] K. Kowsari, M. Heidarysafa, D. E. Brown, K. J. Meimandi, and L. E.Barnes, “Rmdl: Random multimodel deep learning for classiﬁcation,” in

Proceedings of the 2nd International Conference on Information Systemand Data Mining , 2018, pp. 19–28.[66] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnetand the impact of residual connections on learning,”

CoRR , vol.abs/1602.07261, 2016. [Online]. Available: http://arxiv.org/abs/1602.07261[67] L. N. Smith, “A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weightdecay,” arXiv preprint arXiv:1803.09820 , 2018.[68] L. N. Smith and N. Topin, “Super-convergence: Very fast training ofneural networks using large learning rates,” in

Artiﬁcial Intelligenceand Machine Learning for Multi-Domain Operations Applications , vol.11006. International Society for Optics and Photonics, 2019, p.1100612.[69] E. Beede, E. Baylor, F. Hersch, A. Iurchenko, L. Wilcox, P. Ruamvi-boonsuk, and L. M. Vardoulakis, “A human-centered evaluation of adeep learning system deployed in clinics for the detection of diabeticretinopathy,” in