[PDF] Melanoma Detection using Adversarial Training and Deep Transfer Learning

Abstract

Skin lesion datasets consist predominantly of normal samples with only a small percentage of abnormal ones, giving rise to the class imbalance problem. Also, skin lesion images are largely similar in overall appearance owing to the low inter-class variability. In this paper, we propose a two-stage framework for automatic classification of skin lesion images using adversarial training and transfer learning toward melanoma detection. In the first stage, we leverage the inter-class variation of the data distribution for the task of conditional image synthesis by learning the inter-class mapping and synthesizing under-represented class samples from the over-represented ones using unpaired image-to-image translation. In the second stage, we train a deep convolutional neural network for skin lesion classification using the original training set combined with the newly synthesized under-represented class samples. The training of this classifier is carried out by minimizing the focal loss function, which assists the model in learning from hard examples, while down-weighting the easy ones. Experiments conducted on a dermatology image benchmark demonstrate the superiority of our proposed approach over several standard baseline methods, achieving significant performance improvements. Interestingly, we show through feature visualization and analysis that our method leads to context based lesion assessment that can reach an expert dermatologist level.

Full PDF

MMelanoma Detection using Adversarial Training and Deep Transfer Learning

Hasib Zunair and A. Ben HamzaConcordia University, Montreal, QC, Canada

Abstract

Skin lesion datasets consist predominantly of normal sampleswith only a small percentage of abnormal ones, giving rise to theclass imbalance problem. Also, skin lesion images are largelysimilar in overall appearance owing to the low inter-class vari-ability. In this paper, we propose a two-stage framework forautomatic classiﬁcation of skin lesion images using adversarialtraining and transfer learning toward melanoma detection. Inthe ﬁrst stage, we leverage the inter-class variation of the datadistribution for the task of conditional image synthesis by learn-ing the inter-class mapping and synthesizing under-representedclass samples from the over-represented ones using unpairedimage-to-image translation. In the second stage, we train a deepconvolutional neural network for skin lesion classiﬁcation usingthe original training set combined with the newly synthesizedunder-represented class samples. The training of this classiﬁeris carried out by minimizing the focal loss function, which as-sists the model in learning from hard examples, while down-weighting the easy ones. Experiments conducted on a dermatol-ogy image benchmark demonstrate the superiority of our pro-posed approach over several standard baseline methods, achiev-ing signiﬁcant performance improvements. Interestingly, weshow through feature visualization and analysis that our methodleads to context based lesion assessment that can reach an expertdermatologist level.

Keywords : Adversarial training; transfer learning; domainadaptation; melanoma detection; skin lesion analysis.

Melanoma is one of the most aggressive forms of skin cancer [1,2]. It is diagnosed in more than 132,000 people worldwide eachyear, according to the World Health Organization. Hence, itis essential to detect melanoma early before it spreads to otherorgans in the body and becomes more difﬁcult to treat.While visual inspection of suspicious skin lesions by a der-matologist is normally the ﬁrst step in melanoma diagnosis, it isgenerally followed by dermoscopy imaging for further analysis.Dermoscopy is a noninvasive imaging procedure that acquiresa magniﬁed image of a region of the skin at a very high res-olution to clearly identify the spots on the skin [3], and helpsidentify deeper levels of skin, providing more details of thelesions. Moreover, dermoscopy provides detailed visual con-text of regions of the skin and has proven to enhance the di-agnostic accuracy of a naked eye examination, but it is costly,error prone, and achieves only average sensitivity in detecting melanoma [4]. This has triggered the need for developing moreprecise computer-aided diagnostics systems that would assist inearly detection of melanoma from dermoscopy images. De-spite signiﬁcant strides in skin lesion recognition, melanomadetection remains a challenging task due to various reasons,including the high degree of visual similarity (i.e. low inter-class variation) between malignant and benign lesions, makingit difﬁcult to distinguish between melanoma and non-melanomaskin lesions during the diagnosis of patients. Also, the con-trast variability and boundaries between skin regions owing toimage acquisition make automated detection of melanoma anintricate task. In addition to the high intra-class variation ofmelanoma’s color, texture, shape, size and location in dermo-scopic images [5], there are also artifacts such as hair, veins,ruler marks, illumination variation, and color calibration chartsthat usually cause occlusions and blurriness, further complicat-ing the situation [6].Classiﬁcation of skin lesion images is a central topic in med-ical imaging, having a relatively extensive literature. Some ofthe early methods for classifying melanoma and non-melanomaskin lesions have focused mostly on low-level computer visionapproaches, which involve hand-engineering features based onexpert knowledge such as color [5], shape [7] and texture [8, 9].By leveraging feature selection, approaches that use mid-levelcomputer vision techniques have also been shown to achieveimproved detection performance [10]. In addition to ensembleclassiﬁcation based techniques [11], other methods include two-stage approaches, which usually involve segmentation of skinlesions, followed by a classiﬁcation stage to further improve de-tection performance [4, 9, 10]. However, hand-crafted featuresoften lead to unsatisfactory results on unseen data due to highintra-class variation and visual similarity, as well as the presenceof artifacts in dermoscopy images. Moreover, such features areusually designed for speciﬁc tasks and do not generalize acrossdifferent tasks.Deep learning has recently emerged as a very powerful wayto hierarchically ﬁnd abstract patterns using large amounts oftraining data. The tremendous success of deep neural net-works in image classiﬁcation, for instance, is largely attributedto open source software, inexpensive computing hardware, andthe availability of large-scale datasets [12]. Deep learning hasproved valuable for various medical image analysis tasks suchas classiﬁcation and segmentation [13–18]. In particular, sig-niﬁcant performance gains in melanoma recognition have beenachieved by leveraging deep convolutional neural networks ina two-stage framework [19], which uses a fully convolutionalresidual network for skin lesion segmentation and a very deepresidual network for skin lesion classiﬁcation. However, the1 a r X i v : . [ ee ss . I V ] A p r ssues of low inter-class variation and class imbalance of skinlesion image datasets severely undermine the applicability ofdeep learning to melanoma detection [19, 20], as they often hin-der the model’s ability to generalize, leading to over-ﬁtting [21].In this paper, we employ conditional image synthesis withoutpaired images to tackle the class imbalance problem by gen-erating synthetic images for the minority class. Built on topof generative adversarial networks (GANs) [22], several imagesynthesis approaches, both conditional [23] and unconditional[24], have been recently adopted for numerous medical imagingtasks, including melanoma detection [25–27]. Also, approachesthat enable the training of diverse models based on distributionmatching with both paired and unpaired data were introducedin [28–31]. These approaches include image translation fromCT-PET [32], CS-MRI [33], MR-CT [34], XCAT-CT [35] andH&E staining in histopathology [36,37]. In [38,39], image syn-thesis models that synthesize images from noise were developedin an effort to improve melanoma detection. However, Cohen etal. [40] showed that the training schemes used in several domainadaptation methods often lead to a high bias and may result inhallucinating features (e.g. adding or removing tumors leadingto a semantic change). This is due in large part to the source ortarget domains consisting of over- or under-represented samplesduring training (e.g. source domain composed of 50% malig-nant images and 50% benign; or target domain composed of20% malignant and 80% benign images).In this paper, we introduce MelaNet, a deep neural net-work based framework for melanoma detection, to overcomethe aforementioned issues. Our approach mitigates the biasproblem [40], while improving detection performance and re-ducing over-ﬁtting. The proposed MelaNet framework consistsof two integrated stages. In the ﬁrst stage, we generate syn-thetic dermoscopic images for the minority class (i.e. malignantimages) using unpaired image-to-image translation in a bid tobalance the training set. These additional images are then usedto boost training. In the second stage, we train a deep convo-lutional neural network classiﬁer by minimizing the focal lossfunction, which assists the classiﬁcation model in learning fromhard examples, while down-weighting the easy ones. The maincontributions of this paper can be summarized as follows: • We propose an integrated deep learning based framework,which couples adversarial training and transfer learning tojointly address inter-class variation and class imbalance forthe task of skin lesion classiﬁcation. • We train a deep convolutional network by iteratively min-imizing the focal loss function, which assists the model inlearning from hard examples, while down-weighting theeasy ones. • We show experimentally on a dermatology image analysisbenchmark signiﬁcant improvements over several baselinemethods for the important task of melanoma detection. • We show how our method enables visual discovery ofhigh activations for the regions surrounding the skin lesion,leading to context based lesion assessment that can reachan expert dermatologist level. The rest of this paper is organized as follows. In Section2, we introduce a two-stage approach for melanoma detectionusing conditional image synthesis from benign to malignant inan effort to mitigate the effect caused by class imbalance, fol-lowed by training a deep convolutional neural network via iter-ative minimization of the focal loss function in order to learnfrom hard examples. We also discuss in detail the major com-ponents of our approach, and summarize its main algorithmicsteps. In Section 3, experiments performed on a dermatologyimage analysis datasets are presented to demonstrate the effec-tiveness of the proposed approach in comparison with baselinemethods. Finally, we conclude in Section 4 and point out futurework directions.

In this section, we describe the main components and algorith-mic steps of the proposed approach to melanoma detection.

In order to tackle the challenging issue of low inter-class vari-ation in skin lesion datasets [19, 21], we partition the inter-classes into two domains for conditional image synthesis withthe goal to generate malignant lesions from benign lesions. Thisdata generation process for the malignant minority class is per-formed in an effort to mitigate the class imbalance problem, asit is relatively easy to learn a transformation with given priorknowledge or conditioning for a narrowly deﬁned task [34, 37].Also, using unconditional image synthesis to generate data of atarget distribution from noise often leads to artifacts and mayresult in training instabilities [41]. In recent years, variousmethods based on generative adversarial networks (GANs) havebeen used to tackle the conditional image synthesis problem,but most of them use paired training data for image-to-imagetranslation [42], which requires the generation of a new im-age that is a controlled modiﬁcation of a given image. Due tothe unavailability of datasets consisting of paired examples formelanoma detection, we use cycle-consistent adversarial net-works (CycleGAN), a technique that involves the automatictraining of image-to-image translation models without pairedexamples [28]. These models are trained in an unsupervisedfashion using a collection of images from the source and tar-get domains. CycleGAN is a framework for training image-to-image translation models by learning mapping functions be-tween two domains using the GAN model architecture in con-junction with cycle consistency. The idea behind cycle consis-tency is to ward off the learned mappings between these twodomains from contradicting each other.Given two image domains B and M denoting benign and ma-lignant, respectively, the CycleGAN framework aims to learnto translate images of one type to another using two generators G B : B → M and G M : M → B , and two discriminators D M and D B , as illustrated in Figure 1. The generator G B (resp. G M ) translates images from benign to malignant (resp. malig-nant to benign), while the discriminator D M (resp. D B ) scoreshow real an image of M (resp. B ) looks. In other words, these2iscriminator models are used to determine how plausible thegenerated images are and update the generator models accord-ingly. The objective function of CycleGAN is deﬁned as L ( G B , G M , D M , D B ) = L GAN ( G B , D M , B, M )+ L GAN ( G M , D B , M, B )+ λ L cyc ( G B , G M ) , (1)which consists of two adversarial loss functions and a cycle con-sistency loss function regularized by a hyper-parameter λ thatcontrols the relative importance of these loss functions [28]. Theﬁrst adversarial loss is given by L GAN ( G B , D M , B, M ) = E m ∼ p data ( m ) [log D M ( m )]+ E b ∼ p data ( b ) [log(1 − D M ( G B ( b )))] , (2)where the generator G B tries to generate images G B ( b ) thatlook similar to malignant images, while D M aims to distinguishbetween generated samples G B ( b ) and real samples m . Duringthe training, as G B generates a malignant lesion, D M veriﬁesif the translated image is actually a real malignant lesion or agenerated one. The data distributions of benign and malignantare p data ( b ) and p data ( m ) , respectively. Similarly, the second ad-versarial loss is given by L GAN ( G M , D B , M, B ) = E b ∼ p data ( b ) [log D B ( b )]+ E m ∼ p data ( m ) [log(1 − D B ( G M ( m )))] , (3)where G M takes a malignant image m from M as input, andtries to generate a realistic image G M ( m ) in B that tricks thediscriminator D B . Hence, the goal of G M is to generate a be-nign lesion such that it fools the discriminator D B to label it asa real benign lesion.The third loss function is the cycle consistency loss given by L cyc ( G B , G M ) = E b ∼ p data ( b ) [ (cid:107) G M ( G B ( b )) − b (cid:107) ]+ E m ∼ p data ( m ) [ (cid:107) G B ( G M ( m )) − m (cid:107) ] , (4)which basically quantiﬁes the difference between the input im-age and the generated one using the (cid:96) -norm. The idea ofthe cycle consistency loss it to enforce G M ( G B ( b )) ≈ b and G B ( G M ( m )) ≈ m . In other words, the objective of CycleGANis to learn two bijective generator mappings by solving the fol-lowing optimization problem G ∗ B , G ∗ M = arg min G B ,G M max D B ,D M L ( G B , G M , D M , D B ) . (5)We adopt the U-Net architecture [13] for the generators andPatchGAN [29] for the discriminators. The U-Net architectureconsists of an encoder subnetwork and decoder subnetwork thatare connected by a bridge section, while PatchGAN is basi-cally a convolutional neural network classiﬁer that determineswhether an image patch is real or fake. Due to limited training data, it is standard practice to lever-age deep learning models that were pre-trained on large C on v - Benign Generated Malignant Reconstructed BenignMalignant Benign/Malignant?Cycle Consistency

Figure 1: Illustration of the generative adversarial training pro-cess for unpaired image-to-image translation. Lesions are trans-lated from benign to malignant and then back to benign to en-sure cycle consistency in the forward pass. The same procedureis applied in the backward pass from malignant to benign.datasets [43]. The proposed melanoma classiﬁcation model usesthe pre-trained VGG-16 convolutional neural network withoutthe fully connected (FC) layers, as illustrated in Figure 2. TheVGG-16 network consists of 16 layers with learnable weights:13 convolutional layers, and 3 fully connected layers [44]. Asshown in Figure 2, the proposed architecture, dubbed VGG-GAP, consists of ﬁve blocks of convolutional layers, followedby a global average pooling (GAP) layer. Each of the ﬁrst andsecond convolutional blocks is comprised of two convolutionallayers with 64 and 128 ﬁlters, respectively. Similarly, each ofthe third, fourth and ﬁfth convolutional blocks consists of threeconvolutional layers with 256, 512, and 512 ﬁlters, respectively.The GAP layer, which is widely used in classiﬁcation tasks,computes the average output of each feature map in the previ-ous layer and helps minimize overﬁtting by reducing the totalnumber of parameters in the model. GAP turns a feature mapinto a single number by taking the average of the numbers inthat feature map. Similar to max pooling layers, GAP layershave no trainable parameters and are used to reduce the spatialdimensions of a three-dimensional tensor. The GAP layer is fol-lowed by a single FC layer with a softmax function (i.e. a densesoftmax layer of two units for the binary classiﬁcation case) thatyields the probabilities of predicted classes. C on v - C on v - C on v - C on v - C on v - C on v - C on v - C on v - C on v - C on v - C on v - C on v - C on v - M a x - poo li ng M a x - poo li ng M a x - poo li ng M a x - poo li ng M a x - poo li ng G AP - F C - S o ft m a x B en i gn M a li gnan t C on v - S k i n Le s i on VGG-16 without FC layers Classifier

Figure 2: VGG-GAP architecture with a GAP layer, followedby an FC layer that in turn is fed into a softmax layer of twounits.Since we are addressing a binary classiﬁcation problem withimbalanced data, we learn the weights of the VGG-GAP net-work by minimizing the focal loss function [45] deﬁned asFL ( p t ) = − α t (1 − p t ) γ log( p t ) , (6)3here p t and α t are given by p t = (cid:40) p if y = 1 , − p otherwise and α t = (cid:40) α if y = 1 , − α otherwise , with y ∈ {− , } denoting the ground truth for negative andpositive classes, and p ∈ [0 , denoting the model’s predictedprobability for the class with label y = 1 . The weight parameter α ∈ [0 , balances the importance of positive and negative la-beled samples, while the nonnegative tunable focusing parame-ter γ smoothly adjusts the rate at which easy examples are down-weighted. Note that when γ = 0 , the focal loss function reducesto the cross-entropy loss. A positive value of the focusing pa-rameter decreases the relative loss for well-classiﬁed examples,focusing more on hard, misclassiﬁed examples.Intuitively, the focal loss function penalizes hard-to-classifyexamples. It basically down-weights the loss for well-classiﬁedexamples so that their contribution to the total loss is small evenif their number is large. In order to achieve faster convergence, feature standardizationis usually performed, i.e. we rescale the images to have valuesbetween 0 and 1. Given a data matrix X = ( x , . . . , x n ) (cid:124) , thestandardized feature vector is given by z i = x i − min( x i )max( x i ) − min( x i ) , i = 1 , . . . , n, (7)where x i is the i -th input data point, denoting a row vector. Itis important to note that in our approach, no domain speciﬁcor application speciﬁc pre-processing or post-processing is em-ployed.On the other hand, data augmentation is usually carried outon medical datasets to improve performance in classiﬁcationtasks [24, 46]. This is often done by creating modiﬁed versionsof the input images in a dataset through random transformations,including horizontal and vertical ﬂip, Gaussian noise, brightnessand zoom augmentation, horizontal and vertical shift, samplingnoise once per pixel, color space conversion, and rotation.We do not perform on-the-ﬂy data augmentation (random)during training, as it may add an unnecessary layer of complex-ity to training and evaluation. When designing our conﬁgura-tions, we ﬁrst augment the data ofﬂine and then we train theclassiﬁer using the augmented data. Also, we do not apply dataaugmentation in the proposed two-stage approach, as it wouldnot give us an insight on which of the two approaches has morecontribution in the performance (data augmentation or imagesynthesis?). Hence, we keep these two conﬁgurations indepen-dent from each other. The main algorithmic steps of our approach are summarized inAlgorithm 1. The input is a training set consisting of skin lesiondermoscopic images, along with their associated class labels. Inthe ﬁrst stage, the different classes are grouped together (e.g. for binary classiﬁcation, we have two groups), and we resizeeach image to × × . Then, we balance the inter-classdata samples by performing undersampling. We train Cycle-GAN to learn a function of the interclass variation between thetwo groups, i.e. we learn a transformation between melanomaand non-melanoma lesions. We apply CycleGAN to the over-represented class samples in order to synthesize the target classsamples (i.e. under-represented class). After this transformationis applied, we acquire a balanced dataset, composed of origi-nal training data and generated data. In the second stage, weemploy the VGG-GAP classiﬁer with the focal loss function.Finally, we evaluate the trained model on the test set to generatethe predicted class labels. Algorithm 1

MelaNet classiﬁer

Input:

Training set D = { ( I , y ) , . . . , ( I n , y n ) } of dermo-scopic images, where y i is a class label of the input I i . Output:

Vector ˆy containing predicted class labels. for i = 1 to n do Group each lesion image according to class label. Resize each image to × × . end for Balance the inter-class data samples. Train CycleGAN on unpaired and balanced interclass data. for i = 1 to n do if class label benign then Translate to malignant using the generator network else pass end if end for

Merge synthetic under-represented class outputs and origi-nal training set.

Shufﬂe.

Train VGG-GAP on the balanced training set

Evaluate the model on the test set and generate predictedclass labels.

In this section, extensive experiments are conducted to evaluatethe performance of the proposed two-stage approach on a stan-dard benchmark dataset for skin lesion analysis.

Dataset.

The effectiveness of MelaNet is evaluated on theISIC-2016 dataset, a publicly accessible dermatology imageanalysis benchmark challenge for skin lesion analysis towardsmelanoma detection [47], which leverages annotated skin le-sion images from the International Skin Imaging Collaboration(ISIC) archive. The dataset contains a representative mix of im-ages of both malignant and benign skin lesions, which were ran-domly partitioned into training and test sets, with 900 imagesin the training set and 379 images in the test set, respectively.These images consist of different types of textures in both back-ground and foreground, and also have poor contrast, making thetask of melanoma detection a challenging problem. It is also4oteworthy to mention that in the training set, there are 727 be-nign cases and only 173 malignant cases, resulting in an inter-class ratio of 1:4. Sample benign and malignant images fromthe ISIC-2016 dataset are depicted in Figure 3, which showsthat both categories have a high visual similarity, making thetask of melanoma detection quite arduous. Note that there is ahigh intra-class variation among the malignant samples. Thesevariations include color, texture and shape. On the other hand,it is important to point out that benign samples are not visu-ally very different, and hence they exhibit low inter-class varia-tion. Furthermore, there are artifacts present in the images suchas ruler markers and ﬁne hair, which cause occlusions. Noticethat most malignant images show more diffuse boundaries ow-ing to the possibility that before image acquisition, the patientwas already diagnosed with melanoma and the medical person-nel acquired the dermoscopic images at a deeper level in orderto better differentiate between the benign and malignant classes.Figure 3: Sample malignant and benign images from the ISIC-2016 dataset. Notice a high intra-class variation among the ma-lignant samples (left), while benign samples (right) are not vi-sually very different.The histogram of the training data is displayed in Figure 4,showing the class imbalance problem, where the number of im-ages belonging to the minority class (“malignant”) is far smallerthan the number of the images belonging to the majority class(“benign”). Also, the number of benign and malignant cases inthe test set are 304 and 75, respectively, with an inter-class ratioof 1:4.

Number of Samples M a li gnan t B en i gn Figure 4: Histogram of the ISIC-2016 training set, showing theclass imbalance between malignant and benign cases.Since the images in the ISIC-2016 dataset are of varyingsizes, we resize them to × pixels after applying paddingto make them square in order to retain the original aspect ratio. Training Procedure.

Since we are tackling a binary classi- ﬁcation problem with imbalanced data, we use the focal lossfunction for the training of the VGG-GAP model. The focalloss is designed to address class imbalance problem by down-weighting easy examples, and focusing more on training thehard examples. Fine-tuning is essentially performed throughre-training the whole VGG-GAP network by iteratively mini-mizing the focal loss function.

Baseline methods.

We compare the proposed MelaNet ap-proach against VGG-GAP, VGG-GAP + Augment-5x, andVGG-GAP + Augment-10x. The VGG-GAP network is trainedon the original training set, which consists of 900 samples. TheVGG-GAP + Augment-5x model uses the same VGG-GAP ar-chitecture, but is trained on an augmented dataset composed of5400 training samples, i.e. we increase the training set 5 timesfrom 900 to 5400 samples using image augmentation. Simi-larly, the VGG-GAP + Augment-10x network is trained on anaugmented set of 99000 training samples (i.e. 10 times the orig-inal set). We also ran experiments with augmented training setshigher than 10x the original one, but we did not observe im-proved performance as the network tends to learn redundant rep-resentations.

Implementation details.

All experiments are carried out ona Linux server with 2x Intel Xeon E5-2650 V4 Broadwell @2.2GHz, 256 GB RAM, 4x NVIDIA P100 Pascal (12G HBM2memory) GPU cards. The algorithms are implemented in Keraswith TensorFlow backend.We train CycleGAN for 500 epochs using Adam opti-mizer [48] with learning rate 0.0002 and batch size 1. We setthe regularization parameter λ to 10. The VGG-GAP classi-ﬁer, on the other hand, is trained using Adadelta optimizer [49]with learning rate 0.001 and mini-batch 16. A factor of 0.1 isused to reduce the learning rate once the loss stagnates. For theVGG-GAP model, we set the focal loss parameters to α = 0 . and γ = 2 , meaning that α t = 0 . for positive labeled sam-ples, and α t = 0 . for negative labeled samples. Trainingof VGG-GAP is continued on all network layers until the fo-cal loss stops improving, and then the best weights are retained.For fair comparison, use used the same set of hyper-parametersfor VGG-GAP and baseline methods. We choose Adadelta asan optimizer due to its robustness to noisy gradient informationand minimal computational overhead. The effectiveness of the proposed classiﬁer is assessed by con-ducting a comprehensive comparison with the baseline methodsusing several performance evaluation metrics [18, 19, 50], in-cluding the receiver operating characteristic (ROC) curve, sen-sitivity, and the area under the ROC curve (AUC). Sensitivity isdeﬁned as the percentage of positive instances correctly classi-ﬁed, i.e. Sensitivity = TPTP + FN , (8)where TP, FP, TN and FN denote true positives, false positives,true negatives and false negatives, respectively. TP is the num-ber of correctly predicted malignant lesions, while TN is the5umber of correctly predicted benign lesions. A classiﬁer thatreduces FN (ruling cancer out in cases that do have it) and FP(wrongly diagnosing cancer where there is none) indicates a bet-ter performance. Sensitivity, also known as recall or true posi-tive rate (TPR), indicates how often a classiﬁer misses a positiveprediction. It is one of the most common measures to evaluatea classiﬁer in medical image classiﬁcation tasks [51]. We use athreshold of 0.5.Another common metric is AUC that summarizes the infor-mation contained in the ROC curve, which plots TPR versusFPR = FP / ( FP + TN ) , the false positive rate, at various thresh-olds. Larger AUC values indicate better performance at distin-guishing between melonoma and non-melanoma images. It isworth pointing out that the accuracy metric is not used in thisstudy, as it provides no interpretable information and may leadto a false sense of superiority of classifying the majority class.The performance comparison results of MelaNet and thebaseline methods using AUC, FN and Sensitivity are depicted inFigure 5. We observe that our approach outperforms the base-lines, achieving an AUC of 81.18% and a sensitivity of 91.76%with performance improvements of 2.1% and 7.3% over theVGG-GAP baseline. Interestingly, MelaNet yields the lowestnumber of false negatives, which were reduced by more than50% compared to the baseline methods, meaning it picked up onmalignant cases that the baselines had missed. In other words,MelaNet caught instances of melanoma that would have oth-erwise gone undetected. This is a signiﬁcant performance inthe potential for early melanoma detection, albeit MelaNet wastrained on only 1627 samples composed of 900 images from theoriginal dataset and 727 synthesized images (benign and malig-nant) obtained via generative adversarial training. VGG-GAP VGG-GAP+ Augment-5x VGG-GAP+Augment-10x MelaNet020406080100 P e r f o r m a n c e M e a s u r e AUCFNSensitivity

Figure 5: Classiﬁcation performance of MelaNet and the base-line methods using AUC, FN and Sensitivity as evaluation met-rics on the ISIC-2016 test set.Figure 6 displays the ROC curve, which shows the betterperformance of our proposed MelaNet approach compared tothe baseline methods. Each point on ROC represents differenttrade-off between false positives and false negatives. An ROCcurve that is closer to the upper right indicates a better perfor-mance (TPR is higher than FPR). Even though during the earlyand last stages, the ROC curve of MelaNet seems to ﬂuctuate at certain points, the overall performance is much higher than thebaselines, as indicated by the AUC value. This better perfor-mance demonstrates that the conditional image synthesis proce-dure plays a crucial role and enables our model to learn effectiverepresentations, while mitigating data scarcity and class imbal-ance.

False Positive Rate T r u e P o s i t i v e R a t e VGG-GAP (AUC = 79.08%)VGG-GAP + Augment-5x (AUC = 78.81%)VGG-GAP + Augment-10x (AUC = 79.56%)MelaNet (AUC = 81.18%)

Figure 6: ROC curves for MelaNet and baseline methods, alongwith the corresponding AUC values.We also compare MelaNet to two other standard baselinemethods [18, 19]. The top evaluation results on the ISIC-2016dataset to classify images as either being benign or malignantare reported in [18]. The method presented in [19] is also atwo-stage approach consisting of a fully convolutional residualnetwork for skin lesion segmentation, followed by a very deepresidual network for skin lesion classiﬁcation. The classiﬁcationresults are displayed in Table 1, which shows that the proposedapproach achieves signiﬁcantly better results than the baselinemethods.

Feature visualization and analysis.

Understanding and inter-preting the predictions made by a deep learning model providesvaluabe insights into the input data and the features learned bythe model so that the results can be easily understood by hu-man experts. In order to visually explain the decisions made bythe proposed classiﬁer and baseline methods, we use gradient-weighted class activation map (Grad-CAM) [52] to generate thesaliency maps that highlight the most inﬂuential features affect-ing the predictions. Since convolutional feature maps retainspatial information and each pixel of the feature map indicateswhether the corresponding visual pattern exists in its receptiveﬁeld, the output from the last convolutional layer of the VGG-16network shows the discriminative region of the image.The class activation maps displayed in Figure 7 show thateven though the baseline methods demonstrate high activationsfor the region consisting of the lesion, they still fail to correctlyclassify the dermoscopic image. For our proposed MelaNet ap-proach, we observe that the area surrounding the skin lesion ishighly activated. Notice that most of the borders of the wholeinput image are highlighted, due largely to the fact the classi-ﬁers are not looking at the regions of interest, and hence resultin misclassiﬁcation.6able 1: Classiﬁcation evaluation results of MelaNet and baseline methods. Boldface numbers indicate the best performance.Performance MeasuresMethod AUC (%) Sensitivity (%) FNGutman et al. [18] 80.40 50.70 –Yu et al. [19] ( without segmentation ) 78.20 42.70 –Yu et al. [19] ( with segmentation ) 78.30 54.70 –VGG-GAP 79.08 84.46 55VGG-GAP + Augment-5x ( ours ) 78.81 85.34 51VGG-GAP + Augment-10x ( ours ) 79.56 86.09 47MelaNet ( ours ) We can also see in Figure 8 that while the proposed approachshows similar visual patterns as the baselines when correctlyclassifying the input image, it, however, outputs high activa-tions for the regions surrounding the skin lesion in many cases.These regions consist of shapes and edges. Hence, our approachnot only focus on the skin lesion, but also captures its context,which helps in the ﬁnal detection. This context-based approachis commonly used by expert dermatologists [51]. This observa-tion is of great signiﬁcance, and further shows the effectivenessof our approach. V GG - G AP M e l a N e t V GG - G AP + A ug m en t - x V GG - G AP + A ug m en t - x Figure 7: Grad-CAM heat maps for the misclassiﬁed malignantcases by MelaNet and baseline methods.In order to get a clear understanding of the data distribu-tion, the learned features from both the original training set andthe balanced dataset (i.e. with the additional synthesized datausing adversarial training) are visualized using Uniform Man-ifold Approximation and Projection for Dimension Reduction(UMAP) [53], which is a dimensionality reduction techniquethat is particularly well-suited for embedding high-dimensionaldata into a two- or three-dimensional space. The UMAP embed-dings shown in Figure 9 were generated by running the UMAPalgorithm on the original training set with 900 samples (benignand malignant) and the balanced dataset consisting of 1627 sam-ples (benign and malignant).From Figure 9 (left), it is evident that the inter-class variationis signiﬁcantly small due in large part to the very high visual V GG - G AP M e l a N e t V GG - G AP + A ug m en t - x V GG - G AP + A ug m en t - x Figure 8: Grad-CAM heat maps for the correctly classiﬁed ma-lignant cases by MelaNet and baseline methods.similarity between malignant and benign skin lesions. Hence,the task of learning a decision boundary between the two cate-gories is challenging. We can also see that the synthesized sam-ples (malignant lesions shown in green) lie very close to theoriginal data distribution. It is important to note that the out-liers present in the dataset are not due to the image synthesisprocedure, but this is rather a characteristic present in the origi-nal training set. Therefore, the synthetically generated data arerepresentative of the original under-represented class (i.e. ma-lignant skin lesions).

Discussion.

With a training set consisting of only 1627 im-ages, our proposed MelaNet approach is able to achieve im-proved performance. This better performance is largely due tothe fact that by leveraging the inter-class variation in medicalimages, the mapping between the source and target distributionfor conditional image synthesis can be easily learned. More-over, it is much easier to generate target images given prior in-formation, rather than generating from noise which often resultsin training instability and artifacts [50]. It is important to notethat even though image-to-image translation schemes are con-sidered to hallucinate images by adding or removing image fea-tures [40], we showed that in our scheme the partition of theinter-classes does not result in a bias or unwanted feature hal-7igure 9: Two-dimensional UMAP embeddings using the origi-nal ISIC-2016 training set (left) consisting of 900 samples (be-nign shown in blue and malignant in orange) and with additionalsynthesized malignant data samples (shown in green) consistingof a total 1627 samples (right).lucination. Figure 10 shows the benign lesions sampled fromthe ISIC-2016 training set, which are translated to malignantsamples using MelaNet. As can be seen, the benign and the cor-responding synthesized malignant images have a high degree ofvisual similarity. This is largely due to the nature of the dataset,which is known to have a low inter-class variation.In the synthetic minority over-sampling technique (SMOTE),when drawing random observations from its k-nearest neigh-bors, it is possible that a “border point” or an observation veryclose to the decision boundary may be selected, resulting insynthetically-generated observations lying too close to the de-cision boundary, and as a result the performance of the classiﬁermay be degraded. The advantage of our approach over SMOTEis that we learn a transformation between a source and a tar-get domain by solving an optimization problem in order to de-termine two bijective mappings. This enables the generator tosynthesize observations, which help improve the classiﬁcationperformance while learning the transformation/decision bound-ary.

Benign Malignant

Figure 10: Sample benign images from the ISIC-2016 datasetthat are translated to malignant images using the proposed ap-proach. Notice that the synthesized images display a reasonablygood visual quality.In order to gain a deeper insight on the performance of theproposed approach, we sample all the original benign lesionsand a subset of the synthesized malignant lesions, consisting of727 and 10 samples, respectively. For the benign group of im-ages, the proposed MelaNet model yields a sensitivity score of 89%, with 77 misclassiﬁed images. By contrast, a 100% sensi-tivity score is obtained when performing predictions on the syn-thesized malignant group of images. In addition, the F-scorevalues for MelaNet on the benign and synthesized malignantgroups are 94% and 21%, respectively.

In this paper, we proposed a two-stage framework for melanomadetection. The ﬁrst stage addresses the problem of data scarcityand class imbalance by formulating inter-class variation as con-ditional image synthesis for over-sampling in order to synthe-size under-represented class samples (e.g. melanoma from non-melanoma lesions). The newly synthesized samples are thenused as additional data to train a deep convolutional neuralnetwork by minimizing the focal loss function, which assiststhe classiﬁer in learning from hard examples. We demonstratethrough extensive experiments that the proposed MelaNet ap-proach improves sensitivity by a margin of 13.10% and the AUCby 0.78% from only 1627 dermoscopy images compared to thebaseline methods on the ISIC-2016 dataset. For future work di-rections, we plan to address the multi-class classiﬁcation prob-lem, which requires an independent generative model for eachdomain, leading to prohibitive computational overhead for ad-versarial training. We also intend to apply our method to othermedical imaging modalities.

References [1] E. Saito and M. Hori, “Melanoma skin cancer incidencerates in the world from the cancer incidence in ﬁve conti-nents XI,”

Japanese Journal of Clinical Oncology , vol. 48,no. 12, pp. 1113–1114, 2018.[2] R. Siegel, K. Miller, , and A. Jemal, “Cancer statistics,2019,”

CA: A Cancer Journal for Clinicians , vol. 69,pp. 7–34, 2019.[3] M. Binder, M. Schwarz, A. Winkler, A. Steiner, A. Kaider,K. Wolff, and H. Pehamberger, “Epiluminescence mi-croscopy: a useful tool for the diagnosis of pigmented skinlesions for formally trained dermatologists,”

Archives ofDermatology , vol. 131, no. 3, pp. 286–291, 1995.[4] H. Ganster, P. Pinz, R. Rohrer, E. Wildling, M. Binder,and H. Kittler, “Automated melanoma recognition,”

IEEETransactions on Medical Imaging , vol. 20, no. 3, pp. 233–239, 2001.[5] Y. Cheng, R. Swamisai, S. E. Umbaugh, R. H. Moss, W. V.Stoecker, S. Teegala, and S. K. Srinivasan, “Skin lesionclassiﬁcation using relative color features,”

Skin Researchand Technology , vol. 14, no. 1, pp. 53–64, 2008.[6] Z. Liu and J. Zerubia, “Skin image illumination modelingand chromophore identiﬁcation for melanoma diagnosis,”

Physics in Medicine & Biology , vol. 60, pp. 3415–3431,2015.87] N. K. Mishra and M. E. Celebi, “An overview of melanomadetection in dermoscopy images using image processingand machine learning,” arXiv preprint arXiv:1601.07843 ,2016.[8] L. Ballerini, R. B. Fisher, B. Aldridge, and J. Rees, “Acolor and texture based hierarchical K-NN approach tothe classiﬁcation of non-melanoma skin lesions,” in

ColorMedical Image Analysis , pp. 63–86, Springer, 2013.[9] T. Tommasi, E. La Torre, and B. Caputo, “Melanomarecognition using representative and discriminative kernelclassiﬁers,” in

Proc. International Workshop on ComputerVision Approaches to Medical Image Analysis , pp. 1–12,2006.[10] M. E. Celebi, H. A. Kingravi, B. Uddin, H. Iyatomi,Y. A. Aslandogan, W. V. Stoecker, and R. H. Moss,“A methodological approach to the classiﬁcation of der-moscopy images,”

Computerized Medical Imaging andGraphics , vol. 31, no. 6, pp. 362–373, 2007.[11] G. Schaefer, B. Krawczyk, M. E. Celebi, and H. Iyatomi,“An ensemble classiﬁcation approach for melanoma diag-nosis,”

Memetic Computing , vol. 6, no. 4, pp. 233–240,2014.[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNetclassiﬁcation with deep convolutional neural networks,”in

Advances in Neural Information Processing Systems ,pp. 1097–1105, 2012.[13] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convo-lutional networks for biomedical image segmentation,” in

Proc. International Conference on Medical Image Com-puting and Computer-Assisted Intervention , pp. 234–241,2015.[14] H. R. Roth, L. Lu, A. Seff, K. M. Cherry, J. Hoffman,S. Wang, J. Liu, E. Turkbey, and R. M. Summers, “Anew 2.5 d representation for lymph node detection us-ing random sets of deep convolutional neural network ob-servations,” in

Proc. International Conference on Med-ical Image Computing and Computer-Assisted Interven-tion , pp. 520–527, 2014.[15] M. Anthimopoulos, S. Christodoulidis, L. Ebner,A. Christe, and S. Mougiakakou, “Lung pattern clas-siﬁcation for interstitial lung diseases using a deepconvolutional neural network,”

IEEE Transactions onMedical Imaging , vol. 35, no. 5, pp. 1207–1216, 2016.[16] K. Matsunaga, A. Hamada, A. Minagawa, and H. Koga,“Image classiﬁcation of melanoma, nevus and sebor-rheic keratosis by deep neural network ensemble,” arXivpreprint arXiv:1703.03108 , 2017.[17] N. C. Codella, Q.-B. Nguyen, S. Pankanti, D. A. Gutman,B. Helba, A. C. Halpern, and J. R. Smith, “Deep learningensembles for melanoma recognition in dermoscopy im-ages,”

IBM Journal of Research and Development , vol. 61,no. 4/5, pp. 5–1, 2017. [18] D. Gutman, N. C. Codella, E. Celebi, B. Helba,M. Marchetti, N. Mishra, and A. Halpern, “Skin lesionanalysis toward melanoma detection: A challenge at theinternational symposium on biomedical imaging (ISBI)2016, hosted by the international skin imaging collabora-tion (ISIC),” arXiv preprint arXiv:1605.01397 , 2016.[19] L. Yu, H. Chen, Q. Dou, J. Qin, and P.-A. Heng, “Auto-mated melanoma recognition in dermoscopy images viavery deep residual networks,”

IEEE Transactions on Med-ical Imaging , vol. 36, no. 4, pp. 994–1004, 2016.[20] C.-K. Shie, C.-H. Chuang, C.-N. Chou, M.-H. Wu, andE. Y. Chang, “Transfer representation learning for medi-cal image analysis,” in

Proc. International Conference ofthe IEEE Engineering in Medicine and Biology Society ,pp. 711–714, 2015.[21] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues,J. Yao, D. Mollura, and R. M. Summers, “Deep convolu-tional neural networks for computer-aided detection: CNNarchitectures, dataset characteristics and transfer learning,”

IEEE Transactions on Medical Imaging , vol. 35, no. 5,pp. 1285–1298, 2016.[22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,“Generative adversarial nets,” in

Advances in Neural In-formation Processing Systems , pp. 2672–2680, 2014.[23] D. Nie, R. Trullo, J. Lian, C. Petitjean, S. Ruan, Q. Wang,and D. Shen, “Medical image synthesis with context-awaregenerative adversarial networks,” in

Proc. InternationalConference on Medical Image Computing and Computer-Assisted Intervention , pp. 417–425, 2017.[24] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, andH. Greenspan, “Synthetic data augmentation using gan forimproved liver lesion classiﬁcation,” in

Proc. IEEE Inter-national Symposium on Biomedical Imaging , pp. 289–293,2018.[25] X. Yi, E. Walia, and P. Babyn, “Generative adversarial net-work in medical imaging: A review,”

Medical Image Anal-ysis , 2019.[26] P. Costa, A. Galdran, M. I. Meyer, M. Niemeijer,M. Abr`amoff, A. M. Mendonc¸a, and A. Campilho, “End-to-end adversarial retinal image synthesis,”

IEEE Trans-actions on Medical Imaging , vol. 37, no. 3, pp. 781–791,2017.[27] T. Zhang, H. Fu, Y. Zhao, J. Cheng, M. Guo, Z. Gu,B. Yang, Y. Xiao, S. Gao, and J. Liu, “SkrGAN: Sketching-rendering unconditional generative adversarial networksfor medical image synthesis,” in

Proc. International Con-ference on Medical Image Computing and Computer-Assisted Intervention , pp. 777–785, 2019.928] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpairedimage-to-image translation using cycle-consistent adver-sarial networks,” in

Proc. IEEE International Conferenceon Computer Vision , pp. 2223–2232, 2017.[29] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,”in

Proc. IEEE Conference on Computer Vision and PatternRecognition , pp. 1125–1134, 2017.[30] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in

Advances in Neural In-formation Processing Systems , pp. 700–708, 2017.[31] A. M. Lamb, D. Hjelm, Y. Ganin, J. P. Cohen, A. C.Courville, and Y. Bengio, “GibbsNet: Iterative adversar-ial inference for deep graphical models,” in

Advances inNeural Information Processing Systems , pp. 5089–5098,2017.[32] A. Ben-Cohen, E. Klang, S. P. Raskin, M. M. Amitai, andH. Greenspan, “Virtual PET images from CT data usingdeep convolutional networks: initial results,” in

Proc. In-ternational Workshop on Simulation and Synthesis in Med-ical Imaging , pp. 49–57, 2017.[33] G. Yang, S. Yu, H. Dong, G. Slabaugh, P. L. Dragotti,X. Ye, F. Liu, S. Arridge, J. Keegan, Y. Guo, et al. , “DA-GAN: deep de-aliasing generative adversarial networks forfast compressed sensing mri reconstruction,”

IEEE Trans-actions on Medical Imaging , vol. 37, no. 6, pp. 1310–1321,2017.[34] J. M. Wolterink, A. M. Dinkla, M. H. Savenije, P. R.Seevinck, C. A. van den Berg, and I. Iˇsgum, “Deep MR toCT synthesis using unpaired data,” in

Proc. InternationalWorkshop on Simulation and Synthesis in Medical Imag-ing , pp. 14–23, 2017.[35] T. Russ, S. Goerttler, A.-K. Schnurr, D. F. Bauer,S. Hatamikia, L. R. Schad, F. G. Z¨ollner, and K. Chung,“Synthesis of CT images from digital body phantoms us-ing CycleGAN,”

International Journal of Computer As-sisted Radiology and Surgery , vol. 14, no. 10, pp. 1741–1750, 2019.[36] M. T. Shaban, C. Baur, N. Navab, and S. Albarqouni,“StainGAN: Stain style transfer for digital histological im-ages,” in

Proc. IEEE International Symposium on Biomed-ical Imaging , pp. 953–956, 2019.[37] T. de Bel, M. Hermsen, J. Kers, J. van der Laak, andG. Litjens, “Stain-transforming cycle-consistent genera-tive adversarial networks for improved segmentation of re-nal histopathology,” in

Proc. International Conference onMedical Imaging with Deep Learning , vol. 102, pp. 151–163, 2018.[38] A. Bissoto, F. Perez, E. Valle, and S. Avila, “Skin le-sion synthesis with generative adversarial networks,” in

Proc. International Workshop on Computer-Assisted andRobotic Endoscopy , pp. 294–302, 2018. [39] I. S. Ali, M. F. Mohamed, and Y. B. Mahdy, “Data aug-mentation for skin lesion using self-attention based pro-gressive generative adversarial network,” arXiv preprintarXiv:1910.11960 , 2019.[40] J. P. Cohen, M. Luck, and S. Honari, “Distribution match-ing losses can hallucinate features in medical image trans-lation,” in

Proc. International Conference on MedicalImage Computing and Computer-Assisted Intervention ,pp. 529–536, 2018.[41] Z. Zhao, Q. Sun, H. Yang, H. Qiao, Z. Wang, and D. O.Wu, “Compression artifacts reduction by improved gener-ative adversarial networks,”

EURASIP Journal on Imageand Video Processing , vol. 2019, no. 1, p. 62, 2019.[42] S. Kazeminia, C. Baur, A. Kuijper, B. van Gin-neken, N. Navab, S. Albarqouni, and A. Mukhopad-hyay, “GANs for medical image analysis,” arXiv preprintarXiv:1809.06222 , 2018.[43] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “Howtransferable are features in deep neural networks?,” in

Advances in Neural Information Processing Systems ,pp. 3320–3328, 2014.[44] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” in

Interna-tional Conference on Learning Representations , 2015.[45] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Fo-cal loss for dense object detection,” in

Proc. IEEE Inter-national Conference on Computer Vision , pp. 2980–2988,2017.[46] T. Ara´ujo, G. Aresta, E. Castro, J. Rouco, P. Aguiar,C. Eloy, A. Pol´onia, and A. Campilho, “Classiﬁcation ofbreast cancer histology images using convolutional neuralnetworks,”

PloS one , vol. 12, no. 6, p. e0177544, 2017.[47] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A.Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra,H. Kittler, et al. , “Skin lesion analysis toward melanomadetection: A challenge at the 2017 international sympo-sium on biomedical imaging (ISBI), hosted by the in-ternational skin imaging collaboration (ISIC),” in

Proc.IEEE International Symposium on Biomedical Imaging ,pp. 168–172, 2018.[48] D. P. Kingma and J. Ba, “Adam: A method for stochas-tic optimization,” in

International Conference on LearningRepresentations , 2015.[49] M. D. Zeiler, “ADADELTA: an adaptive learning ratemethod,” arXiv preprint arXiv:1212.5701 , 2012.[50] F. Perez, C. Vasconcelos, S. Avila, and E. Valle, “Data aug-mentation for skin lesion analysis,” in

Proc. InternationalWorkshop on Computer-Assisted and Robotic Endoscopy ,pp. 303–311, 2018.1051] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter,H. M. Blau, and S. Thrun, “Dermatologist-level classiﬁ-cation of skin cancer with deep neural networks,”

Nature ,vol. 542, no. 7639, p. 115, 2017.[52] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,D. Parikh, and D. Batra, “Grad-CAM: Visual explanationsfrom deep networks via gradient-based localization,” in

Proc. IEEE International Conference on Computer Vision ,pp. 618–626, 2017.[53] L. McInnes, J. Healy, and J. Melville, “UMAP: Uniformmanifold approximation and projection for dimension re-duction,”