MelanoGANs: High Resolution Skin Lesion Synthesis with GANs
MMelanoGANs: High Resolution Skin Lesion Synthesiswith GANs
Christoph Baur
Computer Aided Medical Procedures (CAMP)TU Munich, Germany [email protected]
Shadi Albarqouni
Computer Aided Medical Procedures (CAMP)TU Munich, Germany
Nassir Navab
Computer Aided Medical Procedures (CAMP)TU Munich, GermanyWhiting School of EngineeringJohns Hopkins University, Baltimore, United States
Abstract
Generative Adversarial Networks (GANs) have been successfully used to syn-thesize realistically looking images of faces, scenery and even medical images.Unfortunately, they usually require large training datasets, which are often scarcein the medical field, and to the best of our knowledge GANs have been only appliedfor medical image synthesis at fairly low resolution. However, many state-of-the-art machine learning models operate on high resolution data as such data carriesindispensable, valuable information. In this work, we try to generate realisticallylooking high resolution images of skin lesions with GANs, using only a smalltraining dataset of 2000 samples. The nature of the data allows us to do a directcomparison between the image statistics of the generated samples and the realdataset. We both quantitatively and qualitatively compare state-of-the-art GANarchitectures such as DCGAN and LAPGAN against a modification of the latterfor the task of image generation at a resolution of 256x256px. Our investigationshows that we can approximate the real data distribution with all of the models,but we notice major differences when visually rating sample realism, diversity andartifacts. In a set of use-case experiments on skin lesion classification, we furthershow that we can successfully tackle the problem of heavy class imbalance withthe help of synthesized high resolution melanoma samples.
Generative Adversarial Networks (GANs) [6] have heavily disrupted the field of machine learning. Inthe computer vision community, they have been successfully used for the generation of realisticallylooking images of indoor and outdoor scenery[17, 4], faces [17] or handwritten digits [6]. Theirconditional extension [14] has also set the new state-of-the-art in the realms of super-resolution [11]and image-to-image translation [8]. Some of these successes have been translated to the medicaldomain, with applications for cross-modality image synthesis [19], CT image denoising [20] andeven for the synthesis of biological images [16], PET images [2], prostate lesions [10] and OCTpatches [18].The synthesis of realistic images opens up various opportunities for machine learning, in particular forthe data hungry deep learning paradigm: Since deep learning requires vast amounts of labeled trainingdata, which is often scarce in the medical field, realistically looking synthetic data may be used toincrease the training dataset size, to cope with severe class imbalance and to potentially improve a r X i v : . [ c s . C V ] A p r igure 1: A comparison between the LAPGAN(left) and our DDGAN(right) architecture. While theLAPGAN focuses on the generation of realistically looking residual images from multiple sources ofnoise, our DDGAN directly optimizes for real images instead, while implicitly learning the residualsfrom a single source of noise.robustness and generalization capability of the models. Successful attempts for data augmentationusing GANs have been made in [1, 5]. Additionally, a trained GAN can provide valuable insightsinto the latent structure behind data distributions, e.g. by investigating the connection between thelatent manifold and the generated images [21], ultimately facilitating data simulation.GANs have been successfully used for various image synthesis tasks. Thoroughly engineeredarchitectures such as DCGAN [17] or LAPGAN [4] have proven to work well for high qualityimage synthesis, however at resolutions of 64x64px and 96x96px, respectively, as reported in theoriginal papers. In the context of fine-grained image classification, recent work [15] has pointedout the importance of high resolution data: The authors reported that using synthetic 32x32pximages upsampled to 128x128px rather than using synthetic 128x128px images right away leads to anoticable decrease in classifier performance, which clearly motivates the generation of realisticallylooking images at higher resolutions. Contribution
In this work, we aim to increase the resolution of generated images while maintaininghigh quality and realism. For our experiments, we choose the ISIC 2017 dataset [3], consisting ofapprox. 2000 dermoscopic images of benign and malignant skin lesions. The nature of the imagesin the dataset allows us to directly compare the image statistics of both the real and the generateddata. For data generation, we employ state-of-the-art architectures such as DCGAN or LAPGANand rank them against a modification of the latter. More precisely, in contrast to LAPGAN, whichinvolves multiple sources of noise, we experiment with a single source of noise, a discrimination onreal and synthesized images rather than residuals and further try to learn an upsampling instead ofusing traditional interpolation. This leaves us with a network which can be trained end-to-end and hasmultiple discriminators attached to different levels of the generator, thus we refer to it as the deeplydiscriminated GAN (DDGAN). A comparison in terms of the aforementioned image statistics showsthat all of the models match the training dataset distribution very well, however visual explorationreveals noticable differences in terms of sample diversity, sharpness and artifacts. In a variety ofuse-case experiments for skin lesion classification we further show that synthetic high resolution skinlesion images can be successfully leveraged to tackle the problem of severe class imbalance.The remainder of this manuscript is organized as follows: We first briefly recapitulate the GANframework as well as the DCGAN and the LAPGAN before we introduce our proposed DDGANarchitecture. This is followed by an experiments section, where we try to synthesize realisticallylooking skin lesion images at a resolution of 256x256px using DCGAN, LAPGAN and differentinstances of the DDGAN. In the second part of the experiments section, we compare the performanceof a state-of-the-art skin lesion classifier trained in the presence of severe class imbalance againstmodels where the class imbalance has been resolved with the help of synthetic images.2
Background
The original GAN framework consists of a pair of adversarial networks: A generator network Gtries to transform random noise z ∼ p z from a prior distribution p z (usually a standard normaldistribution) to realistically looking images G ( z ) ∼ p fake . At the same time, a discriminator networkD aims to classify well between samples coming from the real training data distribution x ∼ p real and fake samples G ( z ) generated by the generator. By utilizing the feedback of the discriminator,the parameters of the generator G can be adjusted such that its samples are more likely to foolthe discriminator network in its classfication task. Mathematically speaking, the networks play atwo-player minimax game against each other: min G max D V ( D, G ) = E x ∼ p data ( x ) [ log ( D ( x ))] + E z ∼ p z ( z ) [1 − log ( D ( G ( z )))] (1)In consequence, as D and G are updated in an alternating fashion, the discriminator D becomes betterin distinguishing between real and fake samples while the generator G learns to produce even morerealistic samples, round by round. DCGAN
The DCGAN architecture is a popular and well engineered convolutional GAN that isfairly stable to train and yields high quality results. The architecture is carefully designed with leakyrelu activations to avoid sparse gradients and a specific weight initialization to allow for a robusttraining. It has proven to work reliably in the task of image synthesis at a resolution of 64x64px.
LAPGAN
The LAPGAN is a generative image synthesis framework inspired by the concept ofLaplacian pyramids. Again, as seen in the standard GAN framework, a generator G produces fakelow resolution images I ,fake from noise z ∼ p z . These images are then subject to an upsamplingoperation up ( · ) and fed, together with noise z , into the next generator G of the pyramid, whichis supposed to generate the fake residual image R ,fake , i.e. the high frequency components whichneed to be added to the upsampled and thus blurry input image I ,fake to obtain a realistic, higherresolution image, i.e. I ,fake = up ( I ,fake ) + R ,fake . The output is upsampled again and fed intothe next higher resolution residual generator: I ,fake = G ( z ) , z ∼ p z I k,fake = up ( I k − ,fake ) + R k,fake , R k,fake = G k ( up ( I k − ,fake ) , z k ) , k > A peculiarity of this approach is the discrimination between real and fake residual images rather thanthe discrimination between real and fake images, i.e. a discriminator at level k > operates on R k,real and R k,fake rather than I k,real and I k,fake (here referred to as residual discrimination ). Interestingly,the framework by default is not trained end-to-end, even though theoretically possible. Instead,the different generators are trained separately, which makes the approach very time-consuming.Noteworthy, the LAPGAN has proven to work for synthesizing 96x96px sized images of realisticallylooking outdoor scenery, but to the best of our knowledge has not been applied to medical data yet. Like the LAPGAN, our DDGAN (Fig. 1) starts with a generator G which maps noise z ∼ p z tolow resolution image samples I . A respective low resolution discriminator then has to distinguishbetween real and fake images and provides the lowest resolution generator with gradients. Insuccession, the generated images are upsampled and fed into another generator. However, thisgenerator differs from the LAPGAN generator in multiple aspects: First, opposed to the LAPGAN,the upsampled, generated images are not concatenated with another channel of noise. Second, anyhigher resolution generator G k , k > is simply a residual block[7] res ( · , d ) of depth d , whose outputdirectly is an image I k,fake , rather than a residual map (Fig. 2a). Consequently, and different to theLAPGAN, the discrimination happens on real and fake images rather than real and fake residuals(referred to as image discrimination ). This way, the high frequency residuals that have to be added to3 a) Residual Deconvolution Block (b) Skin lesions in the ISIC 2017 dataset. Figure 2an upsampled version of the respective low resolution input image I k − ,fake are learned implicitlyby the residual block. I ,fake = G ( z ) , z ∼ p z I k,fake = G k ( up ( I k − ,fake )) = res ( up ( I k − ,fake ) , d ) , k > Further, the non-parametric upsampling can be replaced by a deconvolutional layer, which effectivelyamounts to learning an upsampling. Whether image discrimination is preferable over residualdiscrimination as seen in the LAPGAN, and if upsampling via deconvolution is somehow beneficialis subject to research in this manuscript.
In the first part of our experiments, we train a standard DCGAN, a LAPGAN and various DDGANsfor skin lesion synthesis from the entire dataset and investigate the properties of the synthetic samples.In the second part, we utilize a selection of the frameworks to train synthesis models only onmelanoma images in order to tackle class imbalance with the help of synthetic samples when trainingskin lesion classifiers.
We evaluate our method on the ISIC 2017[3] dataset consisting of 2000 dermoscopic images of bothbenign and malignant skin lesions (images of 1372 benign lesions, 254 seborrheic keratosis samplesand 374 melanoma). The megapixel dermoscopic images are center cropped and downsampled to × px, leading to 2000 training images. Fig. 2b shows some of these training samples. A variety of methods have been proposed for evaluating the performance of GANs in capturing datadistributions and for judging the quality of synthesized images. In order to evaluate visual fidelity,numerous works utilized either crowdsourcing or expert ratings to distinguish between real andsynthetic samples. There have also been efforts to develop quantitative measures to rate realism anddiversity of synthetic images, the most prominent being the so-called Inception-Score, which relieson an ImageNet pretrained GoogleNet. Unfortunately, we noticed that it does not provide meaningfulscores for skin lesions as the GoogleNet focuses on the properties of real objects and natural images.Odena et al.[15] rate sample diversity by computing the mean MS-SSIM metric among randomlychosen synthetic sample pairs, for which a high value indicates high sample diversity. Given theconstrained nature of our images, this approach is also not applicable, since we obtain very high andcomparable MS-SSIM values on the training dataset and on synthetic samples. Instead, per modelwe generate 2000 random samples, compute a normalized color histogram and compare it to thenormalized color histogram of the training dataset in terms of the JS-Divergence and Wasserstein-Distance. Further, we discuss visual fidelity of the generated images with a focus on diversity, realism,sharpness and artifacts. 4able 1: Performance comparison of the DCGAN, LAPGAN and DDGAN. The models are comparedin terms of the JS-Divergence and the Wasserstein Distance between the histogram of the trainingimages and the histogram of samples generated using the respective model.Model EMD JS DivergenceDCGAN 0.00821 0.00458LAPGAN 0.04098 0.01420DDGAN upsampling deconvolution upsampling (e) DDGAN deconvolution
We trained a standard DCGAN, a LAPGAN and various DDGANs for skin lesion synthesis at aresolution of 256x256px. For a valid comparison, both the LAPGAN and the DDGAN are designedto have the same number of trainable parameters. Notably, the DCGAN directly regresses a singlesource of gaussian noise to images with a resolution of 256x256px, while LAPGAN and DDGANincreasingly regress from 64x64px to 128x128px up to 256x256px sized images. All of the modelshave been trained in an end-to-end fashion.The dimensionality of z is always set to . As a loss function for the discriminator network weemploy the least squares[12] loss. All models have been trained for 200 epochs, in minibatches of 8due to GPU memory constraints on our nVidia 1080Ti, which took approx. 20h per model.Overall, all of the models mimic the real data distribution fairly well (see Table. 1 and Fig. 4).Interestingly, the DCGAN matches the training dataset intensity distribution the best in all ofthe divergence measures, even though it shows the least sample variety and suffers from severecheckerboard artifacts. LAPGAN produces a great diversity of samples, but suffers from highfrequency artifacts, as a result of high magnitude residuals. The DDGAN with standard upsamplingand image discrimination matches the training dataset intensity distribution slightly better than theLAPGAN, but the sample diversity seems to be slightly less with any of the DDGAN models. Thereby,deconvolution seems to produce noisier and more unrealistic samples than standard upsampling. Next, we repeated the synthesis experiment, but trained models from only 374 melanoma images.Further, we split the ISIC dataset into a training (60% of the data) and validation set (40% of thedata) while keeping the class distribution within each set, and utilized it to train a variety of skinlesion classifiers for classifying lesions into benign, melanoma and keratosis. Similar to the winnerof the ISIC 2017 challenge[13], we utilize a pretrained RESNET-50 and train a baseline model B F ull on the full training dataset, another baseline model B Imb where the number of melanoma imageswas artificially reduced to 46 to obtain a severe class imbalance, as well various RESNET-50, where5igure 4: Samples generated with the different models.(a) DCGAN (b) LAPGAN(c) DDGAN deconvolution (d) DDGAN upsampling
Table 2: Results of our use-case experiments, reporting the training and validation accuracy fordifferent models.Set B
F ull B Imb
DCGAN LAPGAN DDGAN upsampling
DDGAN deconvolution
Training 0.9809 0.8583 - upsampling and DDGAN deconvolution .As expected, in the presence of class imbalance, B Imb performs considerable worse than B F ull .Interestingly, when restoring the original class distribution of the training dataset with synthetic6amples we obtain even higher accuracies than with B F ull . Biggest improvements are made withsamples from the LAPGAN, closely followed by DDGAN upsampling and DDGAN deconvolution . In summary, we presented a comparison of the DCGAN, the LAPGAN and the DDGAN for the taskof high resolution skin lesion synthesis and demonstrated that both the LAPGAN and the DDGANare able to mimic the training dataset distribution with diverse and realistic samples, even when thetraining dataset is very small. In a set of use-case experiments, these synthetic samples have alsobeen successfully used in the training of skin lesion classifiers for tackling class imbalance, evenoutperforming a baseline model purely trained on real data. We amount the observation that thehistogram divergences are not consistent with synthesis quality of the models to the fact that thehigh frequency artifacts produced by the LAPGAN and the DDGAN bias the intensity histograms.This is reflected in the histogram obtained with DDGAN upsampling which comes closer to the trainingdataset histogram than the LAPGAN, as it produces less high frequency artifacts than the latter.Our qualitative and quantitative results have further shown that a learnt upsampling with the helpof deconvolution is not superior to non-parametric upsampling. In our use-case experiments, thebest performance was obtained with the LAPGAN, leaving us with the conclusion that havingmultiple sources of noise is indeed beneficial for realism and sample diversity. Interestingly, the highmagnitude residual artifacts in LAPGAN do not seem to negatively impact the skin lesion classifier.Irrespective of that, we suppose that more training iterations might resolve these artifacts, but wealso want to emphasize that training the LAPGAN is very difficult, requiring constant supervisionand adjustment of hyperparameters. In comparison, the DDGAN with image discrimination is easierto train, converges faster and does not suffer from severe high frequency artifacts, while only beingslightly inferior to the LAPGAN in our use-case experiments. In future work, we aim to obtainthe feedback of dermatologists on sample realism, investigate the very recent approach for highresolution image synthesis presented in [9], and also conduct a variety of use-case experiments ondata augmentation using synthetic samples.