[PDF] Deep Transfer Learning for Automated Diagnosis of Skin Lesions from Photographs

Abstract

Full PDF

DDeep Transfer Learning for Automated Diagnosis ofSkin Lesions from Photographs

Doyoon Kim ∗ Cleveland High SchoolCalifornia, US

Emma Rocheteau† ∗ Department of Computer Science and TechnologyUniversity of Cambridge, UK [email protected]

Abstract

Melanoma is not the most common form of skin cancer, but it is the most deadly.Currently, the disease is diagnosed by expert dermatologists, which is costly andrequires timely access to medical treatment. Recent advances in deep learninghave the potential to improve diagnostic performance, expedite urgent referrals andreduce burden on clinicians. Through smart phones, the technology could reach peo-ple who would not normally have access to such healthcare services, e.g. in remoteparts of the world, due to ﬁnancial constraints or in 2020, COVID-19 cancellations.To this end, we have investigated various transfer learning approaches by leveragingmodel parameters pre-trained on ImageNet with ﬁnetuning on melanoma detection.We compare EfﬁcientNet, MnasNet, MobileNet, DenseNet, SqueezeNet, Shuf-ﬂeNet, GoogleNet, ResNet, ResNeXt, VGG and a simple CNN with and withouttransfer learning. We ﬁnd the mobile network, EfﬁcientNet (with transfer learning)achieves the best mean performance with an area under the receiver operatingcharacteristic curve (AUROC) of 0.931 ± ± ± ± Melanoma is the most common cause of skin cancer related deaths worldwide [28]. In the UnitedStates alone, it is estimated that there will be 100,350 cases and 6,850 melanoma-related deaths in2020 [2]. Initially, it develops in melanocytes where genetic mutations lead to unregulated growthand the ability to metastasise to other areas of the body [39]. Like many cancers, early detectionis key to successful treatment. If melanoma is detected before spreading to the lymph nodes, theaverage ﬁve-year survival rate is 98%. However, this drops to 64% if it has spread to regional lymphnodes, and 23% if it has reached distant organs.Currently, melanoma is diagnosed by professional medical examination [14]. A meta-analysisconducted by Phillips et al. [24] showed that when distinguishing between melanoma and benignskin lesions, primary care physicians (10 studies) achieve an area under the receiver operatingcharacteristic curve (AUROC) of 0.83 ± ± ∗ Equal contribution, †Corresponding authorMachine Learning for Mobile Health Workshop at NeurIPS 2020, Vancouver, Canada. a r X i v : . [ c s . C V ] N ov linicians, and when they are rigorously evaluated with signiﬁcance testing and model interpretability,useful tools can be produced to support health in the community.In this work, we investigate transfer learning with various Convolutional Neural Networks (CNNs)on the binary classiﬁcation task of classifying melanoma and benign skin lesions. In addition, weperform post-hoc visualisation of the feature attributions using integrated gradients [32]. Recent work by Raghu et al. [25] cast doubt on the usefulness of transfer learning (TL) for medicalimaging. However, a few TL works have achieved success on the problem of melanoma detection [8,21, 22, 26, 38, 40] (although they do not necessarily compare the model with and without TL). We didnot ﬁnd an extensive survey on existing TL models such as ours. The highest AUROC for melanomadetection that we found on photographs was 0.880 in Bisla et al. [3], which is still lower than theperformance for professional dermatologists found in Phillips et al. [24].

Our task is to classify between benign nevi and malignant melanomas. For each patient we havea dermatoscopic photograph in RBG format, x ∈ R × × and static features, s ∈ R (age,gender and location on the body) and the binary label y ∈ R . Figure 1 shows the basic architectureof all models. Our code is publicly available at https://github.com/aimadeus/Transfer_learning_melanoma . AgeGenderLocation Prediction

Figure 1: Model architecture. The CNN component (indicated in brackets) is different in each experiment. Thestatic data is processed separately and concatenated to the CNN output before a ﬁnal prediction is made.

Transfer Learning (TL) is a machine learning method where the weights of a trained model are used toinitialise another model on a different task [34]. In our case, we investigate several CNN architecturesusing pre-training on ImageNet [4] (a database containing over 14 million images). The last fullyconnected layer is replaced with one that has a binary output, and whose weights are initialisedusing Kaiming initialisation [7]. Further description of the various CNN architectures are providedin Appendix A. We also train a standard 5-layer CNN with no transfer learning (hyperparameteroptimisation and further implementation details are provided in Appendix B and C respectively).

We use the International Skin Imaging Collaboration (ISIC) 2020 dataset [27] (released August2020), containing labelled photographs taken from various locations on the body (see Table 3 in theAppendix). We noted a signiﬁcant class imbalance with only 2% of the data containing melanoma.To improve this ratio, we added a second dataset with additional malignant cases [1], which broughtthe total to 37,648 skin lesion images. The data was split such that 60%, 20% and 20% was usedfor training, validating and testing respectively. Data Augmentation was performed on the training2ata to introduce small variations in the form of random rotations, horizontal and vertical ﬂipping,resizing, brightness, and saturation shifts. This means the training data is subtly altered each time itis presented to the model. Figure 2 shows two examples of raw and augmented images respectively.

Figure 2: Example photographs in the training data. In each pair of images, the raw data is shown on the left andan augmented image example is shown on the right.

Table 1 shows the test performance of all the models ((a) without and (b) with transfer learning).Eight of the ten CNN models performed signiﬁcantly better with TL across all 4 metrics, and none ofthe models are signiﬁcantly harmed by TL on any metric, demonstrating a clear beneﬁt of transferlearning for melanoma detection. The best performing models are EfﬁcientNet [17] and MnasNet [35],which signiﬁcantly outperform dermatologists [24] on AUROC. From the ROC curves shown inFigure 3, we see that the EfﬁcientNet can achieve a true positive rate of 0.95 while only conceding0.1 to the false positive rate.

Table 1: Performance of the models averaged over 10 independent training runs. Tables (a) and (b) show theperformance without and with transfer learning respectively. The error margins are 95% conﬁdence intervals(CIs). We report the accuracy, area under the receiver operating characteristic curve (AUROC), area underthe precision recall curve (AUPRC) and the F1 Score. Within each table, the results are ordered from leastto best performance. In table (b), if the result is statistically better than the model without transfer learningin a one-tailed t-test ( p < . ∗ and p < . ∗∗ ), then it is indicated with stars. Results that signiﬁcantlyoutperform general practitioners and dermatologists on AUROC (determined by a recent meta-analysis † [24])are indicated in green and blue respectively ( p < . ).(a) Model Accuracy AUROC AUPRC F1 Score

Standard CNN 0.914 ± ± ± ± † - 0.83 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± † - ± - -(b) General Practitioners † - 0.83 ± ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ ResNet-50 [6] 0.962 ± ± ± ± ± ± ± ± ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ DenseNet [9] 0.966 ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ Dermatologists † - ± - -MobileNet [29] 0.969 ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ ResNeXt [37] 0.971 ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ GoogleNet [33] 0.973 ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ MnasNet [35] 0.974 ± ∗ ± ∗∗ ± ∗∗ ± ∗∗ EfﬁcientNet [17] 0.975 ± ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ igure 3: ROC curves of TL models and Standard CNN (wemagnify the top left part of the curves in the right plot). Figure 4: A test set image and correspondingintegrated gradient attributions for the standardCNN model. We used the integrated gradients method [32] to calculate feature attributions. This method computesthe importance scores φ IGi by accumulating gradients interpolated between a baseline b input(intended to represent the absence of data, in our case this is a black image) and the current input x . φ IGi ( ψ, x , b ) = diff. from baseline (cid:122) (cid:125)(cid:124) (cid:123) ( x i − b i ) × (cid:90) α =0 acc. local grad. (cid:122) (cid:125)(cid:124) (cid:123) δψ ( b + α ( x − b )) δ x i dα (1)The CNN model is represented as ψ . We observed that the models tend to focus primarily on theedges of the skin lesions (Figure 4). This aligns with our expectation, since uneven or notched edgesare common in melanoma [19]. Secondary to the edges, there is some importance to the lesionitself and surrounding skin. This is signiﬁcant because melanomas can also show uneven texture orcolour [19]. We have conducted an extensive investigation of transfer learning for the task of melanoma detectionfrom photographs. We have demonstrated the beneﬁt of transfer learning with ImageNet pre-training [4] for melanoma detection on the ISIC 2020 dataset [27]. Furthermore, we show that thebest performing neural networks are EfﬁcientNet and MnasNet, which are capable of outperformingdermatologists when distinguishing melanoma from benign skin lesions. In particular, we note thatthese networks have been speciﬁcally designed for mobile devices [17, 35]. This may be importantwhen it comes to data privacy and medical data regulations (as the classiﬁcation can be performedlocally on the user’s personal device).In future work, we aim to extend the binary classiﬁcation task to classify other skin lesions such asbenign keratosis, basal cell carcinoma, actinic keratosis, vascular lesions and dermatoﬁbroma. Sec-ondly, we would like to extend our interpretability study such that we can visualise the learnt featuresin the intermediate layers of the models. To do this we can leverage the approach of Mordvintsev et al.[20] whereby we obtain inputs designed to maximise the activation of hidden layers of the network.This will provide further insights as to why certain models outperform others in melanoma detection.Finally, we can validate the diagnostic technology in the community with an implementation study.

The automated diagnosis technology could be used to screen, triage, refer and follow-up patientsin the community. It also has potential to reach patients who would not normally have access todermatologists e.g. in remote areas or the developing world. The high AUROC of EfﬁcientNet (hightrue positive rate coinciding with a low false positive rate) would make it well-suited to this purpose.Such a system could signiﬁcantly reduce the cost and resources needed to screen and treat as itreduces the pool of patients needing to see the dermatologist. The background and intuition behind the method is explained clearly in Sturmfels et al. [31]. cknowledgements The authors would like to thank Horizon Academic for facilitating this research.

References [1] melanoma external malignant 256, 2020. URL .[2] American Cancer Society. Key statistics for melanoma skin cancer, 2020.URL .[3] Devansh Bisla, Anna Choromanska, Jennifer A. Stein, David Polsky, and Russell S. Berman.Skin lesion segmentation and classiﬁcation with deep learning system.

CoRR , abs/1902.06061,2019. URL http://arxiv.org/abs/1902.06061 .[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-ScaleHierarchical Image Database. In

CVPR09 , 2009.[5] Y. Fujisawa, S. Inoue, and Y. Nakamura. The Possibility of Deep Learning-Based, Computer-Aided Skin Tumor Classiﬁers.

Front. Med. , 2019.[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv1512.03385 , 12 2015.[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation.

CoRR , abs/1502.01852, 2015.URL http://arxiv.org/abs/1502.01852 .[8] Khalid M. Hosny, Mohamed A. Kassem, and Mohamed M. Foaud. Classiﬁcation of skinlesions using transfer learning and augmentation with alex-net.

PLOS ONE , 14(5):1–17, 052019. doi: 10.1371/journal.pone.0217293. URL https://doi.org/10.1371/journal.pone.0217293 .[9] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutionalnetworks, 2018.[10] F. N. Iandola, S. Han, M. W. Moskewicz, W. J. Dally K. Ashraf, and Kurt Keutzer. Squeezenet:Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size. arXiv:1602.07360 ,02 2016.[11] Y. Bengio J. Bergstra. Random search for hyper-parameter optimization.

J. Mach. Learn. Res. ,13:281–305, 02 2012. ISSN 1532-4435.[12] A.P. Kassianos, J.D. Emery, P. Murchie, and F.M. Walter. Smartphone applications for melanomadetection by community, patient and generalist clinician users: a review.

British Journal ofDermatology , 05 2015. doi: 10.1111/bjd.13665.[13] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

CoRR ,abs/1412.6980, 2014.[14] H. Kittler, H. Pehamberger, K. Wolff, and M. Binder. Diagnostic accuracy of dermoscopy.

Database of Abstracts of Reviews of Effects (DARE): Quality-assessed Reviews [Internet] , 2002.URL .[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deepconvolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein-berger, editors,

Advances in Neural Information Processing Systems 25 , pages 1097–1105.Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf .[16] Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stoppingis provably robust to label noise for overparameterized neural networks, 2019.517] Q. V. Le M. Tan. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. arXiv:1905.11946 , 05 2019.[18] N. Ma, X. Zhang, H. Zheng, and J. Sun. Shufﬂenet v2: Practical guidelines for efﬁcient cnnarchitecture design. 2018.[19] Collette McCourt, Olivia Dolan, and Gerry Gormley. Malignant melanoma: a pic-torial review.

The Ulster medical journal , 83(2):103–110, may 2014. ISSN 2046-4207. URL .[20] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going deeper into neu-ral networks, 2015. URL https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html .[21] Dennis H. Murphree and Che Ngufor. Transfer learning for melanoma detection: Participationin ISIC 2017 skin lesion classiﬁcation challenge.

CoRR , abs/1703.05235, 2017. URL http://arxiv.org/abs/1703.05235 .[22] Zabir Al Nazi and Tasnim Azad Abir. Automatic skin lesion segmentation and melanomadetection: Transfer learning approach with u-net and dcnn-svm. In Mohammad Shorif Uddin andJagdish Chand Bansal, editors,

Proceedings of International Joint Conference on ComputationalIntelligence , pages 371–381, Singapore, 2020. Springer Singapore. ISBN 978-981-13-7564-4.[23] Adam Paszke, Sam Gross, Francisco Massa, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In

Advances in Neural Information Processing Systems32 , pages 8024–8035. Curran Associates, Inc., 2019.[24] Michael Phillips, Jack Greenhalgh, Helen Marsden, and Ioulios Palamaras. Detection of Malig-nant Melanoma Using Artiﬁcial Intelligence: An Observational Study of Diagnostic Accuracy.

Dermatology practical & conceptual , 10(1):e2020011–e2020011, dec 2019. ISSN 2160-9381.doi: 10.5826/dpc.1001a11. URL .[25] Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Un-derstanding transfer learning for medical imaging. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,

Advances in Neu-ral Information Processing Systems 32 , pages 3347–3357. Curran Associates, Inc.,2019. URL http://papers.nips.cc/paper/8596-transfusion-understanding-transfer-learning-for-medical-imaging.pdf .[26] A. Romero Lopez, X. Giro-i-Nieto, J. Burdick, and O. Marques. Skin lesion classiﬁcationfrom dermoscopic images using deep learning techniques. In , pages 49–54, 2017.[27] Veronica Rotemberg, Nicholas Kurtansky, Brigid Betz-Stablein, Liam Caffery, EmmanouilChousakos, Noel Codella, Marc Combalia, Stephen Dusza, Pascale Guitera, David Gutman,Allan Halpern, Harald Kittler, Kivanc Kose, Steve Langer, Konstantinos Lioprys, Josep Malvehy,Shenara Musthaq, Jabpani Nanda, Ofer Reiter, George Shih, Alexander Stratigos, PhilippTschandl, Jochen Weber, and H. Peter Soyer. A patient-centric dataset of images and metadatafor identifying melanomas using clinical context, 2020.[28] Eiko Saito and Megumi Hori. Melanoma skin cancer incidence rates in the world from theCancer Incidence in Five Continents XI.

Japanese Journal of Clinical Oncology , 48(12):1113–1114, 11 2018. ISSN 1465-3621. doi: 10.1093/jjco/hyy162. URL https://doi.org/10.1093/jjco/hyy162 .[29] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Mobilenetv2: Inverted residualsand linear bottlenecks. 2019.[30] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv 1409.1556 , 09 2014.631] Pascal Sturmfels, Scott Lundberg, and Su-In Lee. Visualizing the impact of feature attribu-tion baselines.

Distill , 5(1):e22, 2020. URL https://distill.pub/2020/attribution-baselines/ .[32] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In

Proceedings of the 34th International Conference on Machine Learning - Volume 70 , ICML’17,page 3319–3328. JMLR.org, 2017.[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, andA. Rabinovich. Going deeper with convolutions. 2014.[34] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu. A survey on deep transfer learning.

Artiﬁcial Neural Networks and Machine Learning - ICANN 2018 , Lecture Notes in ComputerScience, vol 11141:270–279, 2018. URL https://doi.org/10.1007/978-3-030-01424-7_27 .[35] M. Tan, B. Chen, R. Pang, V. Vasudevan, Mark S., A. Howard, and Q. V. Le. Mnasnet:Platform-aware neural architecture search for mobile, 2019.[36] J.A. Wolf, J.F. Moreau, O. Akilov, J.C. English 3rd T. Patton, J. Ho, and L. K. Ferris.Diagnostic inaccuracy of smartphone applications for melanoma detection. 2013. doi:10.1001/jamadermatol.2013.2382.[37] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deepneural networks, 2017.[38] L. Yu, H. Chen, Q. Dou, J. Qin, and P. Heng. Automated melanoma recognition in dermoscopyimages via very deep residual networks.

IEEE Transactions on Medical Imaging , 36(4):994–1004, 2017.[39] Blazej Zbytek, J Andrew Carlson, Jacqueline Granese, Jeffrey Ross, Martin C Mihm Jr,and Andrzej Slominski. Current concepts of metastasis in melanoma.

Expert reviewof dermatology , 3(5):569–585, oct 2008. ISSN 1746-9872. doi: 10.1586/17469872.3.5.569. URL .[40] Hasib Zunair and A. Hamza. Melanoma detection using adversarial training and deep transferlearning. 04 2020. 7

Transfer Learning Architectures

VGG

VGG [30] is an advancement of a previous deep neural network, AlexNet [15]. The modeluses small receptive ﬁelds of 3x3 with ﬁve max-pooling layers. In our paper, we use VGG16.

GoogleNet

GoogleNet [33] was developed in 2014 to solve the problem of overﬁtting by buildingan Inception Module, using ﬁlters of multiple sizes. Three ﬁlter sizes of 1x1, 3x3, and 5x5 aresimultaneously used; whereby the 1x1 convolution is used to shrink the dimensions of the model. TheGoogleNet architecture consists of 9 Inception Modules, with each module connected to an averagepooling layer.

ResNet

ResNet [6] short for “Residual Network”, is a deep learning model developed in 2015 andwas the winner of the ImageNet Competition [4]. In our research, we use ResNet50, a variant ofthe ResNet Model. The model consists of 48 Convolutional layers, 1 Max Pooling and 1 AveragePooling layer. ResNet addresses the vanishing-exploding gradients by leveraging skip connectionsfor identity mapping, simplifying the network.

SqueezeNet

SqueezeNet [10] uses fewer parameters while preserving similar performance toAlexNet [15]. There are several architectural features worth noting: the use of 1x1 convolution ﬁlters,decreased number of input channels, and down-sampling later in the network.

DenseNet

DenseNet [9] is similar to the architecture of ResNet but with “DenseBlocks”. EachDenseBlock consists of a convolution layer, pooling layer, batch normalisation, and non-linearactivation layer.

ResNeXt

Built on the Residual Network and VGG, ResNeXt [37] uses a similar split-transform-merge strategy with an additional cardinality dimension (size of a set of transformations). The modelsborrows the repeating layers strategy from VGG and ResNet and according to the researchers, hasbetter performance than ResNet [6] but with only 50% complexity.

MobileNet

MobileNet [29] was developed for devices with smaller computational power suchas smartphones. Unlike bigger deep learning networks such as VGG, MobileNet uses depthwiseseparable convolution, performing convolution on the input channels separately and then by pointwiseconvolution. This way low latency models can be developed, which are applicable to mobile devices.

ShufﬂeNet

ShufﬂeNet [18] was also designed for mobile devices with small computational power.The model uses 1x1 convolution and channel shufﬂe, designed speciﬁcally for small networks.ShufﬂeNet has efﬁcient computation while obtaining an accuracy similar to and thirteen times fasterthan AlexNet [15].

MnasNet

MnasNet [35] is another mobile network designed for efﬁcient performance using amulti-objective neural architecture search approach that considers accuracy and latency. The networkalso uses a hierarchical search space, achieving speeds faster than MobileNet [29].

EfﬁcientNet

EfﬁcientNet [17] is a recent mobile network developed in 2018, which applies acompound coefﬁcient for improved accuracy. Rather than scaling up the CNN model by an arbitraryamount, the authors use a grid search to ﬁnd correlation in scaling based on the AutoML neuralarchitecture search.

B Hyperparameter Optimisation

For the standard CNN model, values in a range were tested for dropout, batch size, kernel size,learning rate, number of layers, pool size, and number of convolution ﬁlters using random search,which is a more efﬁcient method of hyperparameter optimisation than grid or manual search [11].The search ranges and ﬁnal values are shown in Table 2.8 yperparameter Value Lower Upper Scale

Dropout 0.4 0.0 0.5 LinearBatch Size 8 4 512 log Kernel Size 4 2 5 LinearLearning Rate 0.00977 0.001 0.01 log Number of Layers 5 5 10 LinearPool Size 3 3 4 LinearConvolution Filters 11 6 12 Linear

Table 2: Hyperparameter search ranges and ﬁnal values.

C Implementation

All deep learning methods were implemented in PyTorch [23] and were optimised using Adam [13].The models were trained using Tesla P100 GPUs. Each model was trained over 10 independenttraining runs with early stopping [16] for a maximum of 15 epochs (the patience constant was set to3). We used step decay in the learning rate (the decay was set to 0.4 with a learning patience of 1).

D Additional Tables and Figures

Location Normal Melanoma

Torso 17106 257Lower extremity 8293 124Upper extremity 4872 111Head/neck 1781 74Palms/soles 370 5Oral/genital 120 4