Breast lesion segmentation in ultrasound images with limited annotated data
BBREAST LESION SEGMENTATION IN ULTRASOUND IMAGES WITH LIMITEDANNOTATED DATA
Bahareh Behboodi (cid:63) † Mina Amiri (cid:63) † Rupert Brooks (cid:63) ‡ Hassan Rivaz (cid:63) † (cid:63) Department of Electrical and Computer Engineering, Concordia University, Canada † PERFORM Center, Concordia University, Canada ‡ Nuance Communications
ABSTRACT
Ultrasound (US) is one of the most commonly used imagingmodalities in both diagnosis and surgical interventions due toits low-cost, safety, and non-invasive characteristic. US imagesegmentation is currently a unique challenge because of thepresence of speckle noise. As manual segmentation requiresconsiderable efforts and time, the development of automaticsegmentation algorithms has attracted researchers’ attention.Although recent methodologies based on convolutional neuralnetworks have shown promising performances, their successrelies on the availability of a large number of training data,which is prohibitively difficult for many applications. There-fore, in this study we propose the use of simulated US imagesand natural images as auxiliary datasets in order to pre-trainour segmentation network, and then to fine-tune with limited in vivo data. We show that with as little as in vivo images,fine-tuning the pre-trained network improves the dice scoreby compared to training from scratch. We also demon-strate that if same number of natural and simulation US im-ages is available, pre-training on simulation data is preferable. Index Terms — Segmentation, simulation, U-Net, fine-tuning
1. INTRODUCTION
Breast cancer has been reported as one of the leading causesof death among women worldwide. Although, digital mam-mography is an effective modality in breast cancer detection,it has limitations in detecting dense lesions which are simi-lar to dense tissues [1], and further uses ionizing radiation.Therefore, ultrasound (US) imaging as a safe and versatilescreening and diagnostic modality plays an important role inthis regard. However, due to contamination of the US imageswith speckle noise, US images have low resolution and poorcontrast between the target tissue and background; thus, theirsegmentation is currently a challenging task [2]. Researchershave utilized recent state-of-the-art deep learning techniquesin order to overcome limitations in manual segmentation. De-spite the success of deep learning techniques in computer vi-sion tasks, their performance depends on the size of input data which is limited specially in medical US images. The collec-tion and annotation of US images require considerable effortand time which attain the need to a deep learning-based strat-egy that can be trained on as few annotated data as possible.The U-Net architecture [3], as one of the most well-knownnetworks for segmentation purposes, is built upon fully con-volutional network. It involves several convolutional, max-pooling, and up-sampling layers. To cope with limited in-put data for training U-Net, researches have proposed vari-ous strategies based on data augmentation and transfer learn-ing [2, 4, 5]. Data augmentation cannot truly capture the char-acteristics of the real data when very limited data is avail-able. To this end, we propose a methodology based on trans-fer learning which utilizes US simulation data and natural im-ages as an auxiliary dataset. The goal is to enhance the seg-mentation results while only few images are available. In ourwork, first, we pre-train the U-Net with US simulated and nat-ural images separately, and then fine-tune the network withonly of available in vivo images. We demonstrate improvement in segmentation results when small number ofimages are available.
2. METHOD
In deep learning approaches, the improvement in results de-pends on the number of training data. Therefore, such tech-niques perform better if they have larger amount of trainingdata. In medical images, especially in US images, annotat-ing enough number of training data is expensive, and thus,we take advantage of simulated US data as well as naturalimages as the auxiliary datasets for pre-training U-Net in ourproposed workflow. To that end, our proposed workflow con-sists of three avenues as shown in Fig. 1. In the first avenue,U-Net is trained using only of the in vivo dataset. Inthe second avenue, U-Net is first pre-trained on the simulateddata, and then fine-tuned using the same of the in vivo dataset which was used in the first avenue. And the last av-enue is similar to the second avenue with the difference thatnatural images were used for pre-training. Section 2.5 willclarify each avenue in more details. a r X i v : . [ ee ss . I V ] J a n v e nu e U-Net (5-fold cross validation) … … in vivo Predicted mask A v e nu e U-Net (5-foldcross validation) … … in vivo Predicted mask … … U-Net simulation Predicted maskPre-trainingFine-tuning A v e nu e U-Net (5-fold cross validation) … … in vivo Predicted mask … … U-Net natural Predicted maskPre-trainingFine-tuning A ug m e n t a ti on A ug m e n t a ti on A ug m e n t a ti on Fig. 1 : Proposed workflow for training U-Net when limitedannotated data is available.
In vivo
Data
In vivo dataset includes 163 breast B-mode US images withlesions and the mean image size of × . The images aswell as their delineation of lesions are publicly available uponrequest [1]. The breast lesions of interest are generally hy-poechoic (i.e. tissues with lower echogenicity), that is, darkerthan surrounding tissue. Only of the total number of invivo images were used as training and validation datasets andthe remaining were set as the testing datasets. The sizeof training dataset was selected 4 times larger than the size ofvalidation images yielding , , and images for training,validation, and testing datasets, respectively. To simulate B-mode images, a MATLAB-based publiclyavailable US simulation software, Field II [6] was used.Number of RF lines, centre frequency, sampling frequency,and speed of the sound were respectively set to , . M Hz , M Hz , and m/s . In our simulation phantom, thesurface started at mm from the transducer surface andthe axial, lateral, and elevational distances were initiatedas mm , mm , and mm , respectively. The scattererswere randomly distributed in our virtual phantom such thateach mm of phantom had in average scatterers, to allowfor fast ultrasound simulation. In our simulated images weconsidered each image to randomly have either hyperechoic(i.e. tissues with higher echogenicity), that is brighter thansurrounding tissue, or hypoechoic lesions, or both at the same time in order to let our network learn better the various possi-ble textures of the US images. The intensities for hyperechoiclesions were set k times higher than the background where k was an integer in range of − , however, for hypoechoiclesions the intensities were set l times the background where l was a random variable between and . The location of thelesions was randomly selected with circle or ellipsoid shapes.A total of 700 images were simulated and separated to train-ing, validation, and testing sets with splitting factors of , , and of total number of images, yielding , ,and images, respectively. It worth mentioning that as the in vivo data consisted of hypoechoic lesions, in the masksof simulated data only the pixels inside of the hypoechoiclesions were set to , and the remaining pixels were set to . Therefore, there were some simulated images with zerosegmented lesions in their masks. The natural images are publicly available at [7]. The datasetconsists of images of salient objects with their annota-tions. In our work, the dataset was split to training, validation,and testing sets with splitting factors of , , and of total number of images, yielding , , and im-ages, respectively. The U-Net structure previously proposed by [3] utilizes sev-eral conv-block, max-pooling, up-sampling, and skip connec-tion layers as illustrated in Fig. 2. Each conv-block consistsof repetition of two convolution layers while in the contrac-tion and expansion paths, followed by max-pooling and up-sampling layers, respectively. In this work, the kernel sizesof convolution, max-pooling, and up-sampling layers wereset to 3 ×
3, 2 ×
2, and 2 ×
2, respectively. As a pre-processingstep, all the images were resized to × , mirrored withthe mirroring factor of pixels, yielding images with size × , and normalized to the range of [0 , . Thus, thesize of the input and output data was ( batch, , , and ( batch, , , , respectively, where batch indicates thenumber of images in each batch. The activation and lossfunctions, optimizer, learning rate, number of epochs, batchsize, weight initializer, and kernel regularizer were initializedas stated in Table 1. The Dice score is defined as DSC = | G ∩ P || G | + | P | , where G and P is ground truth and predicted masks,respectively. As previously mentioned, in this work we propose three av-enues to study the impact of simulated and natural imagesasthe auxiliary datasets for US segmentation (see Fig. 1). In thefollowing paragraphs, we will explain each avenue in detail: onv-Block
Contraction path
Conv-BlockConv-BlockConv-Block Conv-Block
Bottle-neck path
Conv-BlockConv-BlockConv-BlockConv-Block
Expansion path
Up-sampling + ConvolutionMax-pooling
Conv-Block c o n v o l u t i o n c o n v o l u t i o n =Skip connection Fig. 2 : U-Net structure with its contraction, bottle-neck, andexpansion paths.
Table 1 : U-Net parameters
Parameter Value
Activation function (except last layer) ReLU [8]Activation function (last layer) SoftmaxLoss function Dice scoreOptimizer Adam [9]Learning rate . No. of epochs
Batch size 8Weight initializer He-normal [10]Kernel-regularizer L2-norm
In the first avenue, the U-Net structure with above-mentionedparameters, was trained on in vivo images from scratch us-ing and images as training and validation sets, respec-tively, and was tested on images. We call this trainednetwork as Pt invivo . Due to small number of training data,we used -fold cross-validation to prevent variation in per-formance. Prior to each optimization iteration, we performed”on-the-fly” augmentation by applying random height-shift,width-shift, and zooming. In this avenue, U-Net was first trained using and sim-ulation images as its training and validation sets, respectively.Similar to the first avenue, the U-Net was initialized using pa-rameters mentioned in Table 1. For the simplicity, we refer thetrained U-Net with simulated data as
Pt sim . Afterwards, thecontraction path of
Pt sim was fine-tuned on in vivo trainingand validation sets based on parameters in Table 1 except thatweights were initialized using the
Pt sim weights. We call thefined-tuned network as
Ft sim invivo which was tested on invivo test set. -fold cross-validation and ”on-the-fly” augmen-tation was used for fine-tuning our Ft sim invivo network.
In this step, similar to the
Avenue described above, U-Netwas first pre-trained and then fine-tuned on in vivo . However,for pre-training the network we used and natu-ral images as training and validation sets, respectively. Forsimplicity, the pre-trained U-Net with natural images is re-ferred as Pt nat and the fine-tuned network using
Pt nat isreferred as
Ft nat invivo . -fold cross-validation and ”on-the-fly” augmentation was used in the fine-tuning step.
3. RESULTS3.1. Evaluation Criteria
In this work, we used Dice Similarity Coefficient (
DSC ) asour evaluation criteria. It is worth noticing that we also used
DSC as our loss function, however, in the evaluation stepthe predicted masks which were the output of the last layer(i.e. Softmax layer) were first binarized using argmax func-tion, and then compared with the ground truth masks. Asour dataset was unbalanced (i.e. number of background pix-els were higher than the lesion pixels), we only report the
DSC scores of the foreground (i.e. lesions) masks ignoringthe
DSC score of the background.
Table 2 presents the
DSC scores of the predicted masksderived from
Pt invivo , Pt sim , Ft sim invivo , Pt nat , and
Ft nat invivo networks for both training and testing in vivo sets. The
DSC score for test set increases when we fine-tunethe pre-trained network no matter what type of images wereused during pre-training. Therefore, pre-training the networkperforms better than training from scratch with limited train-ing data. It is worth mentioning that we used number ofnatural images and number of simulated images duringpre-training. However, when we decreased the number ofnatural images in
Avenue from to in order to beequal to the number of simulated images, the DSC score wasreduced from . to . as shown in Table 2 ( Pt nat420 ,and
Ft nat420 invivo are referred as repetition of
Avenue using natural images). As a result, pre-training thenetwork using simulated data is preferable as the auxiliarydataset than using natural images when same number of im-ages from both datasets is available. Figure 3 demonstratesexamples of the predicted masks with their DSC scores.We had natural images in which hours wasneeded to train the Ft nat invivo network. For training onsimulation, hours, and for training/fine-tuning on in vivo , minutes was needed. As more annotations become available,although the U-Net is better trained, more time is needed forthe pre-training step. able 2 : Mean and standard deviation of DSC scores forpredicted masks of in vivo train and test sets over -fold cross-validation Network Name Train in vivo
Test in vivo
Avenue Pt invivo . ± .
03 0 . ± . Avenue Pt sim . ± .
22 0 . ± . Ft sim invivo . ± . . ± . Avenue Pt nat . ± .
27 0 . ± . Ft nat invivo . ± . . ± . Ft nat420 invivo . ± .
15 0 . ± . Avenue 1 (Pt_invivo) in vivo ground truth D S C = . Avenue 2 (Ft_sim_invivo) D S C = . Avenue 3 (Ft_nat_invivo) D S C = . Ft_nat420_invivo D S C = . Fig. 3 : Examples of segmentation results and their
DSC scores derived from
Avenue , Avenue , Avenue , and F t nat invivo .
4. CONCLUSION AND FUTURE WORK
In this work, we showed that pre-training the network per-forms better than training the network from scratch especiallywhen the number of annotations is limited. We proposedthe use of simulated US images as the auxiliary dataset forpre-training the network. In addition, we confirmed that nat-ural images can be also considered as the auxiliary dataset,however, thousands of them are required for optimum resultswhich led to hours of pre-training. Therefore, we concludethat US simulation images are the preferred auxiliary datasetfor pre-training the network. As our future work, we will val-idate our strategy for different type of in vivo datasets andsegmentation applications such as prostate cancer and musclesegmentation.
5. ACKNOWLEDGEMENT
This research was funded by Richard and Edith Strauss Foun-dation and by NSERC Discovery Grant RGPIN 04136. Theauthors would like to thank NVIDIA for donating the GPU.
6. REFERENCES [1] MH Yap, G Pons, J Mart´ı, S Ganau, M Sent´ıs,R Zwiggelaar, A K Davison, and R Mart´ı, “Automatedbreast ultrasound lesions detection using convolutionalneural networks,”
IEEE Journal of Biomedical andHealth Informatics , vol. 22, no. 4, pp. 1218–1226, 2017.[2] S Liu, Y Wang, X Yang, B Lei, L Liu, SX Li, D Ni, andT Wang, “Deep learning in medical ultrasound analysis:A review,”
Engineering , 2019.[3] O Ronneberger, P Fischer, and T Brox, “U-net: Convo-lutional networks for biomedical image segmentation,”in
International Conference on Medical Image Comput-ing and Computer-assisted Intervention . Springer, 2015,pp. 234–241.[4] S Pereira, A Pinto, V Alves, and CA Silva, “Brain tu-mor segmentation using convolutional neural networksin mri images,”
IEEE Transactions on Medical Imaging ,vol. 35, no. 5, pp. 1240–1251, 2016.[5] HC Shin, NA Tenenholtz, JK Rogers, CG Schwarz,ML Senjem, JL Gunter, KP Andriole, and M Michalski,“Medical image synthesis for data augmentation andanonymization using generative adversarial networks,”in
International Workshop on Simulation and Synthesisin Medical Imaging . Springer, 2018, pp. 1–11.[6] JA Jensen, “Field: A program for simulating ultrasoundsystems,” in .Citeseer, 1996.[7] C Xia, J Li, X Chen, A Zheng, and Y Zhang, “What isand what is not a salient object? learning salient objectdetector by ensembling linear exemplar regressors,” in
Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition , 2017, pp. 4142–4150.[8] V Nair and GE Hinton, “Rectified linear units im-prove restricted boltzmann machines,” in
Proceedingsof the 27th International Conference on Machine Learn-ing (ICML-10) , 2010, pp. 807–814.[9] DP Kingma and J Ba, “Adam: A method for stochasticoptimization,” arXiv preprint arXiv:1412.6980 , 2014.[10] K He, X Zhang, S Ren, and J Sun, “Delving deep intorectifiers: Surpassing human-level performance on im-agenet classification,” in