Deep Transfer Learning for Automated Diagnosis of Skin Lesions from Photographs
DDeep Transfer Learning for Automated Diagnosis ofSkin Lesions from Photographs
Doyoon Kim ∗ Cleveland High SchoolCalifornia, US
Emma Rocheteau† ∗ Department of Computer Science and TechnologyUniversity of Cambridge, UK [email protected]
Abstract
Melanoma is not the most common form of skin cancer, but it is the most deadly.Currently, the disease is diagnosed by expert dermatologists, which is costly andrequires timely access to medical treatment. Recent advances in deep learninghave the potential to improve diagnostic performance, expedite urgent referrals andreduce burden on clinicians. Through smart phones, the technology could reach peo-ple who would not normally have access to such healthcare services, e.g. in remoteparts of the world, due to financial constraints or in 2020, COVID-19 cancellations.To this end, we have investigated various transfer learning approaches by leveragingmodel parameters pre-trained on ImageNet with finetuning on melanoma detection.We compare EfficientNet, MnasNet, MobileNet, DenseNet, SqueezeNet, Shuf-fleNet, GoogleNet, ResNet, ResNeXt, VGG and a simple CNN with and withouttransfer learning. We find the mobile network, EfficientNet (with transfer learning)achieves the best mean performance with an area under the receiver operatingcharacteristic curve (AUROC) of 0.931 ± ± ± ± Melanoma is the most common cause of skin cancer related deaths worldwide [28]. In the UnitedStates alone, it is estimated that there will be 100,350 cases and 6,850 melanoma-related deaths in2020 [2]. Initially, it develops in melanocytes where genetic mutations lead to unregulated growthand the ability to metastasise to other areas of the body [39]. Like many cancers, early detectionis key to successful treatment. If melanoma is detected before spreading to the lymph nodes, theaverage five-year survival rate is 98%. However, this drops to 64% if it has spread to regional lymphnodes, and 23% if it has reached distant organs.Currently, melanoma is diagnosed by professional medical examination [14]. A meta-analysisconducted by Phillips et al. [24] showed that when distinguishing between melanoma and benignskin lesions, primary care physicians (10 studies) achieve an area under the receiver operatingcharacteristic curve (AUROC) of 0.83 ± ± ∗ Equal contribution, †Corresponding authorMachine Learning for Mobile Health Workshop at NeurIPS 2020, Vancouver, Canada. a r X i v : . [ c s . C V ] N ov linicians, and when they are rigorously evaluated with significance testing and model interpretability,useful tools can be produced to support health in the community.In this work, we investigate transfer learning with various Convolutional Neural Networks (CNNs)on the binary classification task of classifying melanoma and benign skin lesions. In addition, weperform post-hoc visualisation of the feature attributions using integrated gradients [32]. Recent work by Raghu et al. [25] cast doubt on the usefulness of transfer learning (TL) for medicalimaging. However, a few TL works have achieved success on the problem of melanoma detection [8,21, 22, 26, 38, 40] (although they do not necessarily compare the model with and without TL). We didnot find an extensive survey on existing TL models such as ours. The highest AUROC for melanomadetection that we found on photographs was 0.880 in Bisla et al. [3], which is still lower than theperformance for professional dermatologists found in Phillips et al. [24].
Our task is to classify between benign nevi and malignant melanomas. For each patient we havea dermatoscopic photograph in RBG format, x ∈ R × × and static features, s ∈ R (age,gender and location on the body) and the binary label y ∈ R . Figure 1 shows the basic architectureof all models. Our code is publicly available at https://github.com/aimadeus/Transfer_learning_melanoma . AgeGenderLocation Prediction
Figure 1: Model architecture. The CNN component (indicated in brackets) is different in each experiment. Thestatic data is processed separately and concatenated to the CNN output before a final prediction is made.
Transfer Learning (TL) is a machine learning method where the weights of a trained model are used toinitialise another model on a different task [34]. In our case, we investigate several CNN architecturesusing pre-training on ImageNet [4] (a database containing over 14 million images). The last fullyconnected layer is replaced with one that has a binary output, and whose weights are initialisedusing Kaiming initialisation [7]. Further description of the various CNN architectures are providedin Appendix A. We also train a standard 5-layer CNN with no transfer learning (hyperparameteroptimisation and further implementation details are provided in Appendix B and C respectively).
We use the International Skin Imaging Collaboration (ISIC) 2020 dataset [27] (released August2020), containing labelled photographs taken from various locations on the body (see Table 3 in theAppendix). We noted a significant class imbalance with only 2% of the data containing melanoma.To improve this ratio, we added a second dataset with additional malignant cases [1], which broughtthe total to 37,648 skin lesion images. The data was split such that 60%, 20% and 20% was usedfor training, validating and testing respectively. Data Augmentation was performed on the training2ata to introduce small variations in the form of random rotations, horizontal and vertical flipping,resizing, brightness, and saturation shifts. This means the training data is subtly altered each time itis presented to the model. Figure 2 shows two examples of raw and augmented images respectively.
Figure 2: Example photographs in the training data. In each pair of images, the raw data is shown on the left andan augmented image example is shown on the right.
Table 1 shows the test performance of all the models ((a) without and (b) with transfer learning).Eight of the ten CNN models performed significantly better with TL across all 4 metrics, and none ofthe models are significantly harmed by TL on any metric, demonstrating a clear benefit of transferlearning for melanoma detection. The best performing models are EfficientNet [17] and MnasNet [35],which significantly outperform dermatologists [24] on AUROC. From the ROC curves shown inFigure 3, we see that the EfficientNet can achieve a true positive rate of 0.95 while only conceding0.1 to the false positive rate.
Table 1: Performance of the models averaged over 10 independent training runs. Tables (a) and (b) show theperformance without and with transfer learning respectively. The error margins are 95% confidence intervals(CIs). We report the accuracy, area under the receiver operating characteristic curve (AUROC), area underthe precision recall curve (AUPRC) and the F1 Score. Within each table, the results are ordered from leastto best performance. In table (b), if the result is statistically better than the model without transfer learningin a one-tailed t-test ( p < . ∗ and p < . ∗∗ ), then it is indicated with stars. Results that significantlyoutperform general practitioners and dermatologists on AUROC (determined by a recent meta-analysis † [24])are indicated in green and blue respectively ( p < . ).(a) Model Accuracy AUROC AUPRC F1 Score
Standard CNN 0.914 ± ± ± ± † - 0.83 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± † - ± - -(b) General Practitioners † - 0.83 ± ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ ResNet-50 [6] 0.962 ± ± ± ± ± ± ± ± ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ DenseNet [9] 0.966 ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ Dermatologists † - ± - -MobileNet [29] 0.969 ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ ResNeXt [37] 0.971 ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ GoogleNet [33] 0.973 ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ MnasNet [35] 0.974 ± ∗ ± ∗∗ ± ∗∗ ± ∗∗ EfficientNet [17] 0.975 ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ igure 3: ROC curves of TL models and Standard CNN (wemagnify the top left part of the curves in the right plot). Figure 4: A test set image and correspondingintegrated gradient attributions for the standardCNN model. We used the integrated gradients method [32] to calculate feature attributions. This method computesthe importance scores φ IGi by accumulating gradients interpolated between a baseline b input(intended to represent the absence of data, in our case this is a black image) and the current input x . φ IGi ( ψ, x , b ) = diff. from baseline (cid:122) (cid:125)(cid:124) (cid:123) ( x i − b i ) × (cid:90) α =0 acc. local grad. (cid:122) (cid:125)(cid:124) (cid:123) δψ ( b + α ( x − b )) δ x i dα (1)The CNN model is represented as ψ . We observed that the models tend to focus primarily on theedges of the skin lesions (Figure 4). This aligns with our expectation, since uneven or notched edgesare common in melanoma [19]. Secondary to the edges, there is some importance to the lesionitself and surrounding skin. This is significant because melanomas can also show uneven texture orcolour [19]. We have conducted an extensive investigation of transfer learning for the task of melanoma detectionfrom photographs. We have demonstrated the benefit of transfer learning with ImageNet pre-training [4] for melanoma detection on the ISIC 2020 dataset [27]. Furthermore, we show that thebest performing neural networks are EfficientNet and MnasNet, which are capable of outperformingdermatologists when distinguishing melanoma from benign skin lesions. In particular, we note thatthese networks have been specifically designed for mobile devices [17, 35]. This may be importantwhen it comes to data privacy and medical data regulations (as the classification can be performedlocally on the user’s personal device).In future work, we aim to extend the binary classification task to classify other skin lesions such asbenign keratosis, basal cell carcinoma, actinic keratosis, vascular lesions and dermatofibroma. Sec-ondly, we would like to extend our interpretability study such that we can visualise the learnt featuresin the intermediate layers of the models. To do this we can leverage the approach of Mordvintsev et al.[20] whereby we obtain inputs designed to maximise the activation of hidden layers of the network.This will provide further insights as to why certain models outperform others in melanoma detection.Finally, we can validate the diagnostic technology in the community with an implementation study.
The automated diagnosis technology could be used to screen, triage, refer and follow-up patientsin the community. It also has potential to reach patients who would not normally have access todermatologists e.g. in remote areas or the developing world. The high AUROC of EfficientNet (hightrue positive rate coinciding with a low false positive rate) would make it well-suited to this purpose.Such a system could significantly reduce the cost and resources needed to screen and treat as itreduces the pool of patients needing to see the dermatologist. The background and intuition behind the method is explained clearly in Sturmfels et al. [31]. cknowledgements The authors would like to thank Horizon Academic for facilitating this research.
References [1] melanoma external malignant 256, 2020. URL .[2] American Cancer Society. Key statistics for melanoma skin cancer, 2020.URL .[3] Devansh Bisla, Anna Choromanska, Jennifer A. Stein, David Polsky, and Russell S. Berman.Skin lesion segmentation and classification with deep learning system.
CoRR , abs/1902.06061,2019. URL http://arxiv.org/abs/1902.06061 .[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-ScaleHierarchical Image Database. In
CVPR09 , 2009.[5] Y. Fujisawa, S. Inoue, and Y. Nakamura. The Possibility of Deep Learning-Based, Computer-Aided Skin Tumor Classifiers.
Front. Med. , 2019.[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv1512.03385 , 12 2015.[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:Surpassing human-level performance on imagenet classification.
CoRR , abs/1502.01852, 2015.URL http://arxiv.org/abs/1502.01852 .[8] Khalid M. Hosny, Mohamed A. Kassem, and Mohamed M. Foaud. Classification of skinlesions using transfer learning and augmentation with alex-net.
PLOS ONE , 14(5):1–17, 052019. doi: 10.1371/journal.pone.0217293. URL https://doi.org/10.1371/journal.pone.0217293 .[9] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutionalnetworks, 2018.[10] F. N. Iandola, S. Han, M. W. Moskewicz, W. J. Dally K. Ashraf, and Kurt Keutzer. Squeezenet:Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size. arXiv:1602.07360 ,02 2016.[11] Y. Bengio J. Bergstra. Random search for hyper-parameter optimization.
J. Mach. Learn. Res. ,13:281–305, 02 2012. ISSN 1532-4435.[12] A.P. Kassianos, J.D. Emery, P. Murchie, and F.M. Walter. Smartphone applications for melanomadetection by community, patient and generalist clinician users: a review.
British Journal ofDermatology , 05 2015. doi: 10.1111/bjd.13665.[13] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
CoRR ,abs/1412.6980, 2014.[14] H. Kittler, H. Pehamberger, K. Wolff, and M. Binder. Diagnostic accuracy of dermoscopy.
Database of Abstracts of Reviews of Effects (DARE): Quality-assessed Reviews [Internet] , 2002.URL .[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein-berger, editors,
Advances in Neural Information Processing Systems 25 , pages 1097–1105.Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf .[16] Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stoppingis provably robust to label noise for overparameterized neural networks, 2019.517] Q. V. Le M. Tan. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv:1905.11946 , 05 2019.[18] N. Ma, X. Zhang, H. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnnarchitecture design. 2018.[19] Collette McCourt, Olivia Dolan, and Gerry Gormley. Malignant melanoma: a pic-torial review.
The Ulster medical journal , 83(2):103–110, may 2014. ISSN 2046-4207. URL .[20] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going deeper into neu-ral networks, 2015. URL https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html .[21] Dennis H. Murphree and Che Ngufor. Transfer learning for melanoma detection: Participationin ISIC 2017 skin lesion classification challenge.
CoRR , abs/1703.05235, 2017. URL http://arxiv.org/abs/1703.05235 .[22] Zabir Al Nazi and Tasnim Azad Abir. Automatic skin lesion segmentation and melanomadetection: Transfer learning approach with u-net and dcnn-svm. In Mohammad Shorif Uddin andJagdish Chand Bansal, editors,
Proceedings of International Joint Conference on ComputationalIntelligence , pages 371–381, Singapore, 2020. Springer Singapore. ISBN 978-981-13-7564-4.[23] Adam Paszke, Sam Gross, Francisco Massa, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In
Advances in Neural Information Processing Systems32 , pages 8024–8035. Curran Associates, Inc., 2019.[24] Michael Phillips, Jack Greenhalgh, Helen Marsden, and Ioulios Palamaras. Detection of Malig-nant Melanoma Using Artificial Intelligence: An Observational Study of Diagnostic Accuracy.
Dermatology practical & conceptual , 10(1):e2020011–e2020011, dec 2019. ISSN 2160-9381.doi: 10.5826/dpc.1001a11. URL .[25] Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Un-derstanding transfer learning for medical imaging. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neu-ral Information Processing Systems 32 , pages 3347–3357. Curran Associates, Inc.,2019. URL http://papers.nips.cc/paper/8596-transfusion-understanding-transfer-learning-for-medical-imaging.pdf .[26] A. Romero Lopez, X. Giro-i-Nieto, J. Burdick, and O. Marques. Skin lesion classificationfrom dermoscopic images using deep learning techniques. In , pages 49–54, 2017.[27] Veronica Rotemberg, Nicholas Kurtansky, Brigid Betz-Stablein, Liam Caffery, EmmanouilChousakos, Noel Codella, Marc Combalia, Stephen Dusza, Pascale Guitera, David Gutman,Allan Halpern, Harald Kittler, Kivanc Kose, Steve Langer, Konstantinos Lioprys, Josep Malvehy,Shenara Musthaq, Jabpani Nanda, Ofer Reiter, George Shih, Alexander Stratigos, PhilippTschandl, Jochen Weber, and H. Peter Soyer. A patient-centric dataset of images and metadatafor identifying melanomas using clinical context, 2020.[28] Eiko Saito and Megumi Hori. Melanoma skin cancer incidence rates in the world from theCancer Incidence in Five Continents XI.
Japanese Journal of Clinical Oncology , 48(12):1113–1114, 11 2018. ISSN 1465-3621. doi: 10.1093/jjco/hyy162. URL https://doi.org/10.1093/jjco/hyy162 .[29] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Mobilenetv2: Inverted residualsand linear bottlenecks. 2019.[30] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv 1409.1556 , 09 2014.631] Pascal Sturmfels, Scott Lundberg, and Su-In Lee. Visualizing the impact of feature attribu-tion baselines.
Distill , 5(1):e22, 2020. URL https://distill.pub/2020/attribution-baselines/ .[32] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In
Proceedings of the 34th International Conference on Machine Learning - Volume 70 , ICML’17,page 3319–3328. JMLR.org, 2017.[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, andA. Rabinovich. Going deeper with convolutions. 2014.[34] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu. A survey on deep transfer learning.
Artificial Neural Networks and Machine Learning - ICANN 2018 , Lecture Notes in ComputerScience, vol 11141:270–279, 2018. URL https://doi.org/10.1007/978-3-030-01424-7_27 .[35] M. Tan, B. Chen, R. Pang, V. Vasudevan, Mark S., A. Howard, and Q. V. Le. Mnasnet:Platform-aware neural architecture search for mobile, 2019.[36] J.A. Wolf, J.F. Moreau, O. Akilov, J.C. English 3rd T. Patton, J. Ho, and L. K. Ferris.Diagnostic inaccuracy of smartphone applications for melanoma detection. 2013. doi:10.1001/jamadermatol.2013.2382.[37] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deepneural networks, 2017.[38] L. Yu, H. Chen, Q. Dou, J. Qin, and P. Heng. Automated melanoma recognition in dermoscopyimages via very deep residual networks.
IEEE Transactions on Medical Imaging , 36(4):994–1004, 2017.[39] Blazej Zbytek, J Andrew Carlson, Jacqueline Granese, Jeffrey Ross, Martin C Mihm Jr,and Andrzej Slominski. Current concepts of metastasis in melanoma.
Expert reviewof dermatology , 3(5):569–585, oct 2008. ISSN 1746-9872. doi: 10.1586/17469872.3.5.569. URL .[40] Hasib Zunair and A. Hamza. Melanoma detection using adversarial training and deep transferlearning. 04 2020. 7
Transfer Learning Architectures
VGG
VGG [30] is an advancement of a previous deep neural network, AlexNet [15]. The modeluses small receptive fields of 3x3 with five max-pooling layers. In our paper, we use VGG16.
GoogleNet
GoogleNet [33] was developed in 2014 to solve the problem of overfitting by buildingan Inception Module, using filters of multiple sizes. Three filter sizes of 1x1, 3x3, and 5x5 aresimultaneously used; whereby the 1x1 convolution is used to shrink the dimensions of the model. TheGoogleNet architecture consists of 9 Inception Modules, with each module connected to an averagepooling layer.
ResNet
ResNet [6] short for “Residual Network”, is a deep learning model developed in 2015 andwas the winner of the ImageNet Competition [4]. In our research, we use ResNet50, a variant ofthe ResNet Model. The model consists of 48 Convolutional layers, 1 Max Pooling and 1 AveragePooling layer. ResNet addresses the vanishing-exploding gradients by leveraging skip connectionsfor identity mapping, simplifying the network.
SqueezeNet
SqueezeNet [10] uses fewer parameters while preserving similar performance toAlexNet [15]. There are several architectural features worth noting: the use of 1x1 convolution filters,decreased number of input channels, and down-sampling later in the network.
DenseNet
DenseNet [9] is similar to the architecture of ResNet but with “DenseBlocks”. EachDenseBlock consists of a convolution layer, pooling layer, batch normalisation, and non-linearactivation layer.
ResNeXt
Built on the Residual Network and VGG, ResNeXt [37] uses a similar split-transform-merge strategy with an additional cardinality dimension (size of a set of transformations). The modelsborrows the repeating layers strategy from VGG and ResNet and according to the researchers, hasbetter performance than ResNet [6] but with only 50% complexity.
MobileNet
MobileNet [29] was developed for devices with smaller computational power suchas smartphones. Unlike bigger deep learning networks such as VGG, MobileNet uses depthwiseseparable convolution, performing convolution on the input channels separately and then by pointwiseconvolution. This way low latency models can be developed, which are applicable to mobile devices.
ShuffleNet
ShuffleNet [18] was also designed for mobile devices with small computational power.The model uses 1x1 convolution and channel shuffle, designed specifically for small networks.ShuffleNet has efficient computation while obtaining an accuracy similar to and thirteen times fasterthan AlexNet [15].
MnasNet
MnasNet [35] is another mobile network designed for efficient performance using amulti-objective neural architecture search approach that considers accuracy and latency. The networkalso uses a hierarchical search space, achieving speeds faster than MobileNet [29].
EfficientNet
EfficientNet [17] is a recent mobile network developed in 2018, which applies acompound coefficient for improved accuracy. Rather than scaling up the CNN model by an arbitraryamount, the authors use a grid search to find correlation in scaling based on the AutoML neuralarchitecture search.
B Hyperparameter Optimisation
For the standard CNN model, values in a range were tested for dropout, batch size, kernel size,learning rate, number of layers, pool size, and number of convolution filters using random search,which is a more efficient method of hyperparameter optimisation than grid or manual search [11].The search ranges and final values are shown in Table 2.8 yperparameter Value Lower Upper Scale
Dropout 0.4 0.0 0.5 LinearBatch Size 8 4 512 log Kernel Size 4 2 5 LinearLearning Rate 0.00977 0.001 0.01 log Number of Layers 5 5 10 LinearPool Size 3 3 4 LinearConvolution Filters 11 6 12 Linear
Table 2: Hyperparameter search ranges and final values.
C Implementation
All deep learning methods were implemented in PyTorch [23] and were optimised using Adam [13].The models were trained using Tesla P100 GPUs. Each model was trained over 10 independenttraining runs with early stopping [16] for a maximum of 15 epochs (the patience constant was set to3). We used step decay in the learning rate (the decay was set to 0.4 with a learning patience of 1).
D Additional Tables and Figures
Location Normal Melanoma
Torso 17106 257Lower extremity 8293 124Upper extremity 4872 111Head/neck 1781 74Palms/soles 370 5Oral/genital 120 4