[PDF] Automated Pancreas Segmentation Using Multi-institutional Collaborative Deep Learning

Abstract

The performance of deep learning-based methods strongly relies on the number of datasets used for training. Many efforts have been made to increase the data in the medical image analysis field. However, unlike photography images, it is hard to generate centralized databases to collect medical images because of numerous technical, legal, and privacy issues. In this work, we study the use of federated learning between two institutions in a real-world setting to collaboratively train a model without sharing the raw data across national boundaries. We quantitatively compare the segmentation models obtained with federated learning and local training alone. Our experimental results show that federated learning models have higher generalizability than standalone training.

Full PDF

AAutomated Pancreas Segmentation UsingMulti-institutional Collaborative Deep Learning

Pochuan Wang (cid:63) , Chen Shen (cid:63) , Holger R. Roth ,Dong Yang , Daguang Xu , Masahiro Oda , Kazunari Misawa , Po-Ting Chen ,Kao-Lang Liu , Wei-Chih Liao , Weichung Wang , and Kensaku Mori National Taiwan University, Taiwan Nagoya University, Japan NVIDIA Corporation, United States Aichi Cancer Center, Japan National Taiwan University Hospital, Taiwan

Abstract.

The performance of deep learning based methods strongly relieson the number of datasets used for training. Many efforts have been made toincrease the data in the medical image analysis field. However, unlike photog-raphy images, it is hard to generate centralized databases to collect medicalimages because of numerous technical, legal, and privacy issues. In this work,we study the use of federated learning between two institutions in a real-worldsetting to collaboratively train a model without sharing the raw data acrossnational boundaries. We quantitatively compare the segmentation modelsobtained with federated learning and local training alone. Our experimentalresults show that federated learning models have higher generalizability thanstandalone training.

Keywords:

Federated Learning · Pancreas Segmentation · Neural ArchitectureSearch.

Recently, deep neural networks (DNNs) based methods have been widely utilized formedical imaging research. High-performing models that are clinically useful alwaysrequire vast, varied, and high-quality datasets. However, it is expensive to collect alarge number of datasets, especially in the medical field. Only well-trained expertscan generate acceptable annotations for DNN training, making annotated medicalimages even more scarce. Furthermore, medical images from a single institution canbe biased towards specific pathologies, equipment, acquisition protocols, and patientpopulations. The low generalizability of DNNs models trained on insufficient datasetsis critical when applying deep learning methods for clinical usages.To improve the robustness with scant data, fine-tuning is an alternative way tolearn the knowledge from pre-trained DNNs. The fine-tuning technique starts trainingfrom a pre-trained weight instead of random initialization, which has been proved (cid:63) equal contribution a r X i v : . [ ee ss . I V ] S e p P. Wang et al. helpful in medical image analysis [7,8], which exceeds the performance on traininga DNN from scratch. However, fine-tuned models can still have high deficiencies ingeneralizability [1]. When a model is pre-trained on one data (source data) and thenfine-tuned on another data (target data), the trained model tends to fit on targetdata but lose the representation on source data [3].Federated learning (FL) [4] is an innovation for solving this issue. It can collabora-tively train the DNNs using the datasets from multiple institutions without creating acentralized dataset [2,6]. Each institution (client) trains with local data using the samenetwork architecture decided in advance. After a certain amount of local training, eachinstitution regularly sends the trained model to the server. The server only centralizesthe weights of the model to aggregate, and then send them back to each client.In this work, we collaboratively generated and evaluated an FL model for pancreassegmentation without sharing the data. Our data consists of healthy and unhealthypancreas collected at the two institutions from different countries (Taiwan and Japan).Throughout this study, we utilized the model from coarse-to-fine network architecturesearch (C2FNAS) [10] with an additional variational auto-encoder (VAE) [5] branchto the encoder endpoint. FL dramatically improved the generalizability of modelson server-side and client-side for both datasets. To the best of our knowledge, this isthe first time performing FL for building a pancreas segmentation model from datahosted at multi-national sites.

Fig. 1: The architecture of federated learning system. utomated Pancreas Segmentation Using Collaborative Deep Learning 3

FL can be categorized into different types based on the distribution characteristicsof data [9]. In this work, we only focus on horizontal architecture, which is illustratedin Fig. 1. This type of FL allows us to train with datasets from different samplesdistributed across clients.A horizontal FL system consists of two parts: server and clients . The server man-ages the training process and generates a global model , and the client train with localdata to produces a local model . The server receives trained weights from each client andaggregates them into a global model. The clients train with the local dataset and sendthe weights to the server. We call the process of generating one global model one round.The workflow consists of the following steps:1. Start the server. The server-side sets the gPRC communication ports, SSLcertificate, the maximum and minimum numbers of clients.2. Start the client. Use client-side configuration to initialize. Then use the credentialto make a login request to the server.3. Client-side downloads the current global model from the server and fine-tuningthe model with the local dataset. Then, only submit the model to the server andwait for other clients.4. Once the server receives the model from a previously defined minimum numberof clients, it will aggregate them into a new global model.5. The server updates the global model and finishes one round.6. Go back to 3. for another round.The model shared among the server and clients is only weight parameters, pro-tecting the privacy for local data. To build the server-client trust, the server-side usestoken throughout the process. SSL certificate authority and gPRC communicationports were adopted to improve security.

We use two physically separated clients in this work in order to try FL in the real-worldsetting. Two different datasets from two institutions from two different countries wereapplied.For Client 1, we utilize 420 portal-venous phase abdominal CT images collectedfor preoperative planning in gastric surgery, so the stomach part is inflated. For thepancreas, we did not notice any particular abnormalities. The resolution of volumesare (0.58-0.98, 0.58-0.98, 0.16-1.0) in the voxel spacing (x, y, z) in millimeter. Onlypancreas regions are manually annotated using semi-automated segmentation tools.We randomly split the data set into 252 training volumes, 84 validation volumes, and84 testing volumes.For Client 2’s dataset, we collected 486 contrast-enhanced abdominal CT images,where all volumes are from patients with pancreatic cancer. Among the whole dataset,the voxel spacing (x, y, z) in millimeter of 40 volumes are (0.68, 0.68, 1.0) and therest 446 volume are (0.68, 0.68, 5.0) in millimeter. The segmentation labels containthe normal part of the pancreas and the tumor of pancreatic cancer. All the labels aremanually segmented by physicians. We split client 2’s dataset into training, validationand testing sets randomly, the training set contains 286 volumes, the validation setcontains 100 volumes and the testing set also contains 100 volumes.

P. Wang et al.

We re-sample the resolution of all volumes to isotropic spacing 1 . × . × . −

200 and maximum intensity 250. Then we re-scale the value range to [ − . , . We utilized the resulting model of coarse-to-fine network architecture search (C2FNAS) [10].The C2FNAS search algorithm performs a coarse-level search followed by a fine-levelsearch to determine the optimal neural network architecture for 3D medical imagesegmentation. In the coarse-level search, C2FNAS searched for the topology of U-Netlike models. In the fine-level search, C2FNAS searched for the optimal operations(including 2D convolution, 3D convolution, and pseudo-3D convolution) for eachmodule from the previous search results.Fig. 2: Model architecture of C2FNASWe add a VAE branch to the encoder endpoint of the C2FNAS model. The VAEbranch shares encoder layers with C2FNAS and estimates the mean and standarddeviation of encoded features for input image reconstruction. Two further losses, L KL and L , are introduced in [5] are required for the VAE branch. L KL estimates thedistance of mean vector and standard deviation from a Gaussian distribution, andthe L computes the distance of decoded volume and input volume in voxel level.VAE is capable of regularizing the shared encoder of the segmentation model.Our implementation of VAE estimates the mean vector and the standard deviationvector by adding two separate dense layers with 128 output logits. In training, weconstruct the latent vector by adding mean vector and weighted standard deviationvector by random coefficients in normal distribution. In the validation and testing,we treat the mean vector as the latent vector. To reconstruct the input image, weadd a dense layer to recover the shape of input features from the latent vector. Withthe recovered features, we use trilinear up-sampling and residual convolutional blocksto reconstruct the input image. We use batch size 8 with 4 NVIDIA GPUs (Tesla V100 32GB for client 1 and QuadroRTX8000 for client 2) at each client in all our experiments, and the patches in each utomated Pancreas Segmentation Using Collaborative Deep Learning 5

Fig. 3: Model architecture of image reconstruction for variational auto-encoder.batch are randomly sampled and cropped from input volume. The sample rate offoreground patches and background patches are equal. The input patch size we usefor training is [96 , , − to 10 − , with cosine annealing learning rate scheduler. The loss for C2FNASsegmentation is Dice loss combined with categorical cross-entropy loss. In the settingwith VAE regularization, we add VAE loss L KL and reconstruction loss L to thetotal loss with constant coefficients 0 . .

3, respectively.Our implementation of the C2FNAS model is based on TensorFlow . Our FLexperiments utilize the NVIDIA Clara Train SDK for model training and commu-nication of weights between the server and clients. The experimental setups include standalone training on both clients and federatedlearning with two clients. In the standalone setting, both Client 1 ( C1 ) and Client2 ( C2 ) train their local model independently with each client’s own dataset, theresulting models are C1 baseline and

C2 baseline . In the federated learning setup,we set up an aggregation server with no access to any dataset, two clients trainingon their local datasets sending gradients every ten epochs. The resulting models forfederated learning are

FL global , C1 FL local and

C2 FL local .Table 1 compare the standalone training models (C1 baseline and C2 baseline)and FL models (FL global, C1 FL local, include C2 FL local) for C1 dataset and C2dataset. We have to mention that the C1 dataset only has pancreas label, whereas theC2 dataset is from pancreatic cancer patients, including pancreas and tumor label. Forstandalone models, the performance is not ideal when predicting on the other client’sdataset. C2 tumor even get zero mean Dice socre on C1 baseline model, becausethe C1 dataset does not include the tumor class. For FL models, the local model,both from C1 and C2, have great improvement on the other dataset. Segmentationperformance for tumors on C1 FL local model is comparable to a standalone model. https://developer.nvidia.com/clara P. Wang et al.

Table 1: Dice scores of pancreas and tumor. Data from C1 only have label forpanaceas. FL improves the generalizability of model.

C1 C2Dice coefficient Pancreas Pancreas Tumor Pancreas average AverageC1 baseline

C2 baseline

C1 FL local

C2 FL local

Average

Fig. 4: Comparison of segmentation results with the Client 1 (C1) dataset. Pancreasregion in blue and yellow indicates the pancreas and tumor. Only pancreas regionsare labeled in the Client 2 (C2) dataset.Even the C1 dataset does not include tumor class. FL global model shows highgeneralizability on both C1 and C2 dataset.Fig. 4 shows the qualitative assessment on C1 dataset. When predicting withC2 baseline model, a small region of the pancreas was misdetected as a pancreatictumor, although CT volumes in the C1 dataset consist of healthy pancreas cases.The misdetection part disappeared after FL. FL global global model has the bestperformance on pancreas segmentation for the C1 dataset.In Fig. 5 we present a visualization of segmentation of one sample volume in theC2 test set. The prediction result of C1 baseline model missed most areas of thepancreas and tumor. The prediction of C2 baseline model is roughly in the correct utomated Pancreas Segmentation Using Collaborative Deep Learning 7(a) Ground truth (b) C1 baseline (c) C2 baseline(d) FL global (e) C1 FL local (f) C2 FL local

Fig. 5: Comparison of segmentation results with the Client 2 dataset. The pancreaspart is labeled as blue and the tumor part is labeled as yellow.area, but the shape of the tumor is incorrect and has a false positive of anothertumor. The three federated learning models are doing better in detecting the area ofthe pancreas and the tumor. Although the tumor shape is still far from the groundtruth in all predictions, the continuity of the area and the smoothness of the tumorboundary are significantly improved.

In the standalone training setup, both C1 baseline and C2 baseline models performwell on their corresponding local test set. However, the testing results on the oppositetest set have a significant performance drop. As the properties of the C1 datasetand C2 dataset are very different (healthy pancreas patients versus patients withpancreatic tumors), it is natural that the standalone models cannot generalize wellto different data distribution.In the federated learning setup, the performance of C1 FL local model is slightlybetter than C1 baseline in its own test set, and C1 FL local model has a remarkableperformance gain in the C2 test set, both the mean Dice score of pancreas and tumoron the C2 test set is comparable to the C2 baseline model. For C2 FL local model, themean Dice score of the healthy part of the pancreas is slightly better than the C2 base-line model, and the mean Dice score of tumor part drops only moderately. The testingresult of C2 FL local model on the C1 dataset also has a substantial improvementfrom C2 baseline model, and the performance is similar to C2 FL local model. TheFL global model can predict well for the pancreas for both test sets, but the predictionof tumors is notably lower than the other two local models. This drop is possibly

P. Wang et al. caused by the lack of any validation or model selection procedure on the server-side. Inthe client training, we always keep the model with the highest local validation metrics,but on the server-side, the model aggregator only accepts gradients from the clients.The server cannot determine the quality of the model in our current training setting.

In this research, we conduct real-world federated learning to train neural networksbetween two institutes without the need for data sharing between the sites and despiteinconsistent data collection criteria. The results suggest that the federated learningframework can deal with highly unbalanced data distributions between clients andcan deliver more generalizable models than standalone training.

References

1. Chang, K., Balachandar, N., Lam, C., Yi, D., Brown, J., Beers, A., Rosen, B., Rubin,D., Kalpathy-Cramer, J.: Distributed deep learning networks among institutions formedical imaging. Journal of the American Medical Informatics Association (8),945–954 (Jan 2018). https://doi.org/10.1093/jamia/ocy0172. Li, W., Milletar`ı, F., Xu, D., Rieke, N., Hancox, J., Zhu, W., Baust, M., Cheng, Y.,Ourselin, S., Cardoso, M.J., Feng, A.: Privacy-preserving federated brain tumoursegmentation. In: Suk, H.I., Liu, M., Yan, P., Lian, C. (eds.) Machine Learning inMedical Imaging. pp. 133–141. Springer International Publishing, Cham (2019)3. Li, Z., Hoiem, D.: Learning without forgetting. IEEE transactions on pattern analysisand machine intelligence (12), 2935–2947 (2017)4. McMahan, H.B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: AISTATS (2017)5. Myronenko, A.: 3D MRI brain tumor segmentation using autoencoder regularization.In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. pp.311–320. Springer International Publishing (2019)6. Sheller, M.J., Reina, G.A., Edwards, B., Martin, J., Bakas, S.: Multi-institutional deeplearning modeling without sharing patient data: A feasibility study on brain tumorsegmentation. In: Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum,T. (eds.) Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries.pp. 92–104. Springer International Publishing, Cham (2019)7. Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D.,Summers, R.M.: Deep convolutional neural networks for computer-aided detection:Cnn architectures, dataset characteristics and transfer learning. IEEE transactions onmedical imaging (5), 1285–1298 (2016)8. Tajbakhsh, N., Shin, J.Y., Gurudu, S.R., Hurst, R.T., Kendall, C.B., Gotway, M.B.,Liang, J.: Convolutional neural networks for medical image analysis: Full training orfine tuning? IEEE Transactions on Medical Imaging (5), 1299–1312 (2016)9. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: Con-cept and applications. ACM Trans. Intell. Syst. Technol. (2) (Jan 2019).https://doi.org/10.1145/3298981, https://doi.org/10.1145/3298981https://doi.org/10.1145/3298981