Incorporating the Knowledge of Dermatologists to Convolutional Neural Networks for the Diagnosis of Skin Lesions
IIncorporating the Knowledge of Dermatologists toConvolutional Neural Networks for the Diagnosis ofSkin Lesions
Iván González-Díaz
Department of Signal Theory and CommunicationsUniversidad Carlos III de MadridLeganés, 28911, Spain [email protected]
Abstract
This report describes our submission to the
ISIC 2017 Challenge in Skin LesionAnalysis Towards Melanoma Detection . We have participated in the
Part 3: LesionClassification with a system for automatic diagnosis of nevus, melanoma andseborrheic keratosis. Our approach aims to incorporate the expert knowledge ofdermatologists into the well known framework of Convolutional Neural Networks(CNN), which have shown impressive performance in many visual recognitiontasks. In particular, we have designed several networks providing lesion areaidentification, lesion segmentation into structural patterns and final diagnosis ofclinical cases. Furthermore, novel blocks for CNNs have been designed to integratethis information with the diagnosis processing pipeline.Figure 1: Main processing pipeline of our Automatic Diagnosis System
The main pipeline of our system is depicted in Fig. 1. It comprises the following steps:1. For each clinical case c , a dermoscopic image X c feeds a Lesion Segmentation Network thatgenerates a binary mask M c outlining the area of the image which corresponds to the lesion.The description of this module is given in section 2.2. Each clinical case c , which is now defined by the image-mask couple { X c , M c } , goesthrough the Data Augmentation Module . This module aims to extend the initial visualsupport of the lesion by generating new views v corresponding to different rotations and a r X i v : . [ c s . C V ] J un ropped areas. Hence, the output of this module is an extended set of images ˜ X cv related tothe lesion. Section 3 provides a detailed description of this data augmentation process.3. The next step in the process is the Structure Segmentation Network . It aims to segment eachview of the lesion ˜ X v into a set of eight global and local structures that have turned to bevery important for dermatologists in their daily diagnosis. Examples of these structures aredots/globules, regression areas, streaks, etc. Hence, the output of this system is a set of 8segmentation maps S cvs , s = 1 ... , each one associated to a particular structure s of interest.This module is introduced in section 4.4. Finally, the augmented set { ˜ X cv , S cvs } is passed to the Diagnosis Network , which is in chargeof providing the final diagnosis Y c for the clinical case. The description of this network canbe found in section 5. The Lesion Segmentation Network has been developed by learning a Fully Convolutional Network(FCN) [Shelhamer et al., 2016]. FCNs have achieved state-of-the-art results on the task of semanticimage segmentation in general-content, as demonstrated in the PASCAL VOC Segmentation [Ever-ingham et al., 2015]. In order to train a network for our particular task of lesion/skin segmentation,we have used the training set for the lesion segmentation task in the 2017 ISBI challenge. Let us notethat the goal of this module is not to generate very accurate segmentation maps of a lesion, but tobroadly identify the area of the image that corresponds to the lesion, giving place to a binary map M c for each clinical case.Figure 2: Example of a rotated and cropped view of a lesion and its Normalized Polar Coordinates.(Left) View of the lession (Middle) Normalize Ratio (Right) Angle It is well known that data augmentation notably boosts the performance of deep neural networks,mainly when the amount of training data is limited. Among all the potential image variations andartifacts, invariance to orientation is probably the main requirement of our method, as dermatologistsdo not follow a specific protocol during the capture of a lesion. Other more complex geometrictransformations such as affine or projective transforms are less interesting here as the dermatoscopeis normally placed just over and orthogonally to the lesion surface. The particular process of dataaugmentation is described next:1. First, starting from the pair { X c , M c } , we generate a set of rotated versions.2. As rotating an image without losing any visual information requires incorporating new areaswhich were not present in the original view, we find and crop the largest inner rectangleensuring that all pixels belong to the original image.3. Finally, as our sub subsequent CNNs (Structure Segmentation and Diagnosis) require squareinput images of 256x256 pixels, we finally perform various squared crops which are in turnre-sized to the required dimensions.Considering the aforementioned rotations and crops, for each given clinical case c , we generate anaugmented set of 24 images, represented by a tensor ˜ X cv ∈ R × × , with v = 1 ... .In addition, for each generated view ˜ X cv , we compute the Normalized Polar Coordinates from thelesion mask. The goal of this new alternative coordinates is to support subsequent processing blocksby providing invariance against shifts, rotations, changes in size and even irregular shapes of thelesions. To do so, we transform pixel Cartesian coordinates ( x i , y i ) into normalized polar coordinates ( ρ i , θ i ) , where rho i ∈ [0 , and θ i ∈ [0 , π ) stand for the normalized ratio and angle, respectively.2he process to compute this transformation is as follows: first, the mask of the lesion is approximatedby an ellipse with the same second-order moments. Then, we learn the affine matrix that transformsthe ellipse into a normalized (unit ratio) circle centered at (0,0). Figure 2 shows an example of arotated and cropped view of a lesion, and its corresponding normalized polar coordinates. The goal of this module is, given an input view of the lesion ˜ X cv , to provide a correspondingsegmentation into a pre-defined set of textural patterns and local structures that are of special interestfor dermatologists in their diagnosis. In particular, we have considered a set of eight structures: 1) dots,globules and cobblestone pattern , 2) reticular patterns and pigmented networks , 3) homogeneousareas , 4) regression areas , 5) blue-white veil , 6) streaks , 7) vascular structures and 8) unspecificpatterns .The main challenge to develop this module is the generation of a strongly-labeled training dataset, inwhich each image has an associated ground truth pixel-wise segmentation. This kind of annotationis often hard to obtain as it requires a huge effort of the dermatologists to manually outline thesegmentations. Alternatively, providing weak image-level labels indicating only which structuralpatterns are present in each lesion is much easier for dermatologists and therefore becomes morerealistic. Hence, following this latter approach, we asked dermatologists of a collaborating medicalinstitution, the Hospital Doce de Octubre in Madrid, to annotate the
ISIC 2016 training dataset withthe presence or absence of the 8 considered structures. In particular, we asked them to provide onelabels for each structure: 0 if the structure is not present, 1 if is locally present, 2 if it is present andlarge enough to be considered a global pattern in the lesion.Given this weakly-annotated dataset, we have built our approach over the work of [Pathak et al., 2015],where the authors introduced a novel constrained optimization for weakly-labeled segmentation usingCNNs. The output of this network is a reduced version of the input image (64x64 in our case) where,for each pixel location x i , a softmax is used to transform the net outputs f i ( x i ; θ ) into probabilitiesas follows: p i ( x i | θ ) = 1 Z i exp ( f i ( x i | θ )) (1)where θ represents the parameters of the CNN, and Z i = (cid:80) s =1 ... exp ( f i ( s | θ )) is the partitionfunction at the location i . The presence or absence of a class, as well as, an estimate of its size in theimage, lead to particular constraints over the probability P s = (cid:80) i p i ( s | θ ) accumulated over all pixellocations in the segmentation map: • If a structure s is not present in an image, the constraint acts as an upper bound over theaccumulated probability P s , which has to be nearly zero. • If a structure s is local in an image, we impose a lower and upper bound on the accumulatedprobability P s in the image to control the total area of the structure in the lesion. • If a structure s is global in an image, we impose a lower bound on the accumulated probability P s in the image to ensure a minimum area corresponding to the structure.In order to adapt this approach to our particular scenario, we have developed a set of modificationsover the original approach, namely: • We observed that using simple softmax function lead to situations in which many constraintsover local structures were obeyed by assigning some residual probability to every locationin the segmentation map. From our point of view, this is an undesired behavior, as onewould rather expect a small set of pixels showing large probabilities of belonging to thestructure of interest. To overcome this limitation, we have used a parametric softmax p i ( x i | γ, θ ) = Z i exp ( f i ( γx i | θ )) . The parameter γ is a soft-approximation towards the maxfunction, and large values lead to scenarios in which each location shows high probabilityjust for very reduced set of structures. In our case, we have used a value of γ = 20 . • We added a new constraint that helps to learn structures that appear in spatial locations ofthe lesion: e.g. streaks tend to appear in the borders of a lesion. For that end, we accumulateprobabilities P s only in those locations that will likely contain the intended structure. At3his point, we have defined these areas of interest over the Normalized Polar Coordinatesdescribed in section 3, which are more adequate than the original Cartesian coordinates.We have implemented this module taking the well-known vgg-vdd [Simonyan and Zisserman, 2014](the same network used as initialization for the lesion segmentation module), removing the top layers,and using the ISIC 2016 training dataset and the described constrained optimization with weakannotations [Pathak et al., 2015]. The output of this module is, for each view v of a clinical case c , atensor S cv ∈ R × × that contains the 8 probability maps of the considered structures.Figure 3: Processing pipeline of the Diagnosis Network The
Diagnosis Network will gather the information from previous modules in order to generate adiagnose for each clinical case. As in the previous modules, our approach has taken a well-knownCNN as starting point and modified the top layers to get a better adaptation to our problem.The network chosen as basis is the resnet-50 [He et al., 2015], which uses residual layers to avoidthe degradation problem when more and more layers are stacked to the network. When appliedto our 256x256 images, the last convolutional block (conv_5x) of this network produces a tensor T c ∈ R × × , which hopefully behaves as a detector of high level concepts (objects in Imagenet,the dataset for which it was originally designed).In the original work, an average pooling layer transformed this tensor into a single-value per channeland image T s ∈ R × × , which was followed by a fully convolutional layer and a softmax togenerate the final probabilities of the image containing the classes being detected. Hence, the goal ofthe average pooling was fusing detections at various locations of the input image and generating aunified score for each high-level concept.In our approach, however, we have modified the structure of the top layers of the network, givingplace to the structure presented in Figure 3. We basically subdivide the top fully-connected layerproviding the lesion diagnosis into three arms: a) the original arm with an average pooling followedby a fully connected layer (FC1), b) a second arm that performs a normalized polar pooling ( x rings by angles) and follows it by a fully connected layer (FC2), c) a third arm that estimates theasymmetry of lesion based on the previous polar pooling and applies then a Fully Connected layer(FC3). The results of the three arms are then linearly combined using a Sum block. We next describethe novel blocks that are required in this new structure and that have been specifically developed inthis work:1. Modulation block : The goal of this block is to take advantage of the previous segmentationsof the lesion into global and local structures which are of great interest for dermatologists intheir daily diagnosis. To do so, this blocks fuses the previous structure segmentation maps S cv with the filter outputs of the conv_5x layer in resnet-50. In particular, we modulate theoutputs of the layer (2048 channels in our case) using the probabilities of the 8 local andglobal structures described in section 4. By concatenating the resulting modulation with theoriginal set of outputs we finally generate a set of channels which is 9 times the original one(18432 in our case).2. Polar Pooling : This block aims to perform pooling operations over data (average or maxpooling) but, rather than using rectangular spatial regions, we employ sectors defined inpolar coordinates. Hence, this block is defined for a given number of radial rings R (radius4anging from 0 to 1) and angular sectors A (angles ranging between 0 and π ), producing anoutput of size R × A × channels . Furthermore, in order to adapt to the irregular shapes ofthe lesions, we use the normalized polar coordinates described in section 3. Since, dependingon the shape of the lesion and the size of the tensor being pooled, some combinations ( r, a ) may not contain pixels in the image, we can also define overlaps between adjacent radiusand angles to regularize the outputs. In addition, the division of the lesion into rings isnon-uniform and ensures that every ring contains the same number of pixels for a perfectcircular lesion.3. Asymmetry : This block computes metrics that evaluate the asymmetry of a lesion for a givenangle. In particular, given a polar division of the lesion into R × A sectors, we computethe asymmetry for A/ angles by folding the lesion over each angle and computing theaccumulated absolute difference between corresponding sectors.As shown in the Figure 3, we combine these modules to generate a final output Y cv for each consideredview of a clinical case.Finally, in order to generate a final output for each clinical case Y c , we consider independencebetween views leading to a factorization: Y c = V (cid:89) v =1 Y cv (2)It is also worth noting that our final submission has also incorporated in the factorization an extraclassifier which depends only on external information about the clinical case, such as patient genderand age, and lesion area. The code that implements this paper as well as the Lesion Segmentation and Diagnosis Networks areprovided in the following link: https://github.com/igondia/matconvnet-dermoscopy . Acknowledgments
We kindly thank dermatologists of
Hospital 12 de Octubre of Madrid because of their inestimablehelp annotating the data contents with the weak labels of structural patterns. This work was supportedin part by the National Grant TEC2014-53390-P and National Grant TEC2014-61729-EXP of theSpanish Ministry of Economy and Competitiveness. In addition, we gratefully acknowledge thesupport of NVIDIA Corporation with the donation of the TITAN X GPU used for this research.
References
M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. Thepascal visual object classes challenge: A retrospective.
International Journal of Computer Vision ,111(1):98–136, Jan. 2015.K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.
CoRR ,abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385 .D. Pathak, P. Krähenbühl, and T. Darrell. Constrained convolutional neural networks for weaklysupervised segmentation. In
ICCV , 2015.E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation.
CoRR , abs/1605.06211, 2016. URL http://arxiv.org/abs/1605.06211 .K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.