Cross-Modality Multi-Atlas Segmentation Using Deep Neural Networks
CCross-Modality Multi-Atlas Segmentation UsingDeep Neural Networks
Wangbin Ding , Lei Li , , , Xiahai Zhuang , ∗ , and Liqin Huang , ∗ College of Physics and Information Engineering, Fuzhou University, Fuzhou, China School of Data Science, Fudan University, Shanghai, China School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China School of Biomedical Engineering and Imaging Sciences, Kings College London,London, UK
Abstract.
Both image registration and label fusion in the multi-atlassegmentation (MAS) rely on the intensity similarity between target andatlas images. However, such similarity can be problematic when targetand atlas images are acquired using different imaging protocols. High-level structure information can provide reliable similarity measurementfor cross-modality images when cooperating with deep neural networks(DNNs). This work presents a new MAS framework for cross-modalityimages, where both image registration and label fusion are achieved byDNNs. For image registration, we propose a consistent registration net-work, which can jointly estimate forward and backward dense displace-ment fields (DDFs). Additionally, an invertible constraint is employedin the network to reduce the correspondence ambiguity of the estimatedDDFs. For label fusion, we adapt a few-shot learning network to mea-sure the similarity of atlas and target patches. Moreover, the networkcan be seamlessly integrated into the patch-based label fusion. The pro-posed framework is evaluated on the MM-WHS dataset of MICCAI 2017.Results show that the framework is effective in both cross-modality reg-istration and segmentation.
Keywords:
MAS · Cross-Modality · Similarity.
Segmentation is an essential step for medical image processing. Many clinical ap-plications rely on an accurate segmentation to extract specific anatomy or com-pute some functional indices. The multi-atlas segmentation (MAS) has provedto be an effective method for medical image segmentation [22]. Generally, it con-tains two steps, i.e., a pair-wise registration between target image and atlases,and a label fusion among selected reliable atlases. Conventional MAS methods * X Zhuang and L Huang are co-senior and corresponding authors:[email protected]; [email protected]. This work was funded by the National Natu-ral Science Foundation of China (Grant No. 61971142), and Shanghai MunicipalScience and Technology Major Project (Grant No. 2017SHZDZX01). a r X i v : . [ c s . C V ] A ug Ding et al.
Atlas PatchesTarget Patch Similarity networkRegistration network Patch Sample SimilaritiesTarget Image 𝒑 𝒒 PLF
Atlas p 𝑠𝑀 p 𝑠2 p 𝑠 Patch Sample
Predicted Label
Fig. 1.
The pipeline of the cross-modality MAS framework. The atlases are first warpedto the target image by the registration model (see section 2.1). Thus, the warped atlaslabel becomes a candidate segmentation of the target image simultaneously. Then, eachvoting patch sampled from warped atlases is weighted according to its similarity to thecorresponding target patch (see section 2.2). Based on the weight, one can obtain afinal label for the target image using the PLF strategy. normally process images from single modality, but in many scenarios they couldbenefit from cross-modality image processing [7]. To obtain such a method, reg-istration and label fusion algorithms that can adapt to cross-modality data arerequired.To achieve cross-modality registration, a common approach is to design amodality-invariance similarity as the registration criterion, such as mutual infor-mation (MI) [9], normalized mutual information (NMI) [15]. An alternative wayis to employ structural representations of images, which are supposed to be in-variant across multi-modality images [17,5]. Recently, several deep learning (DL)based multi-modality registration algorithms are developed. For example, Hu etal. proposed a weakly-supervised multi-modality registration network by explor-ing the dense voxel correspondence from anatomical labels [6]. Qin et al. designedan unsupervised registration network based on disentangled shape representa-tions, and then converted the multi-modality registration into a mono-modalityproblem in the latent shape space [11].For label fusion, there are several widely utilized strategies, such as major-ity voting (MV), plurality voting, global or local weighted voting, joint labelfusion (JLF) [18], statistical modeling approach [19], and patch-based label fu-sion (PLF) [3]. To use cross-modality atlas, Kasiri et al. presented a similaritymeasurement based on un-decimated wavelet transform for cross-modality atlasfusion [8]. Furthermore, Zhuang et al. proposed a multi-scale patch strategy toextract multi-level structural information for multi-modality atlas fusion [23].Recently, learning methods are engaged to improve the performance of label fu-sion. Ding et al. proposed a DL-based label fusion strategy, namely VoteNet,which can locally select reliable atlases and fuse atlas labels by plurality vot-ing [4]. To enhance PLF strategy, Sanroma et al. and Yang et al. attempted toachieve a better deep feature similarity between target and atlas patches throughdeep neural networks (DNN) [13,21]. Similarly, Xie et al. incorporated a DNN topredict the weight of voting patches for the JLF strategy [20]. All these learning- ross-Modality Multi-Atlas Segmentation Using Deep Neural Networks 3 based label fusion works assumed that atlas and target images come from thesame modality.This work is aimed at designing a DNN-based approach to achieve accu-rate registration and label fusion for cross-modality MAS. Figure 1 presents thepipeline of our proposed MAS method. The main contributions of this workare summarized as follows: (1) We present a DNN-based MAS framework forcross-modality segmentation, and validate it using the MM-WHS dataset [22].(2) We propose a consistent registration network, where an invertible constraintis employed to encourage the uniqueness of transformation fields between cross-modality images. (3) We introduce a similarity network based on few-shot learn-ing, which can estimate the patch-based similarity between the target and atlasimages.
Suppose given N atlases { ( I a , L a ) , · · · , ( I Na , L Na ) } anda target ( I t , L t ), for each pair of I ia and I t , two registration procedures could beperformed by switching the role of I ia and I t . We denote the dense displacementfield (DDF) from I ia to I t as U i , and vice versa as V i . For convenience, weabbreviate I ia , L ia , U i and V i as I a , L a , U and V when no confusion is caused.Consider the label as a mapping function from common spatial space to labelspace: Ω → L , so that ˜ L a ( x ) = L a ( x + U ( x )) , (1)˜ L t ( x ) = L t ( x + V ( x )) , (2)where ˜ L a and ˜ L t denote the warped L a and L t , respectively.We develop a new registration network which can jointly estimate the forward( U ) and inverse ( V ) DDF for a pair of input images. The advantage of jointestimation is that it can reduce the ambiguous correspondence in DDFs (see nextsubsection). Figure 2 shows the overall structure of the registration network. Thebackbone of the network is based on the U-Shape registration model [6]. Insteadof using voxel-level ground-truth transformations, which is hard to obtain incross-modality scenarios, the Dice coefficients of anatomical labels are used totrain the network. Since the network is design to produce both U and V , pairwiseregistration errors caused by those two DDFs should been taken into account inthe loss function. Thus, a symmetric Dice loss of the network is designed by L oss Dice = D ice ( L a , ˜ L t ) + D ice ( L t , ˜ L a )) + λ ( Ψ ( U ) + Ψ ( V )) , (3)where λ is the hyperparameter, Ψ ( U ) and Ψ ( V ) are smoothness regularizationsfor DDFs. Ding et al. n2n 4n 8n 8n 4n 2n UV
16n 16n res-conv-blockcopyDown-samplingUp-samplingsumconv-resize block n=8 2 n conv
Fig. 2.
The architecture of the consistent registration network.
Consistent Constraint:
The L oss Dice only provides voxel-level matching cri-terion for transformation field estimation. It is easily trapped into a local maxi-mum due to the ambiguous correspondence in the voxel-level DDF. Inspired bythe work of Christensen et al. [2], a consistent constraint is employed to encour-age the uniqueness of the field. i.e., each voxel in L a is mapped to only one voxelin L t , and vice versa. To achieve this, an invertible loss L oss inv is engaged toforce the restored warped label L (cid:48) a (or L (cid:48) t ) to be identical to its original label L a (or L t ), L oss inv = D ice ( L (cid:48) a , L a ) + D ice ( L (cid:48) t , L t ) , (4)where L (cid:48) a ( x ) = ˜ L a ( x + V ( x ))) and L (cid:48) t ( x ) = ˜ L t ( x + U ( x ))). Ideally, L oss inv isequal to 0 when U and V are the inverse of each other. Therefore, it can constrainthe network to produce invertible DDFs. Finally, the total trainable loss of theregistration model is L oss reg = L oss Dice + λ L oss inv . (5)Here, λ is the hyperparameter of the model. As only anatomical labels areneeded to train the network, the consistent registration network is naturallyapplicable to cross-modality registration. Based on the registration network, ( I a , L a ) can be de-formed toward I t and become the warped atlas ( ˜ I a , ˜ L a ), where ˜ L a is a candidatesegmentation of I t . Given N atlases, the registration network will produce N corresponding segmentations. Then, the target label of I t is derived by combin-ing the contribution of each warped atlas via PLF strategy. For a spatial point ross-Modality Multi-Atlas Segmentation Using Deep Neural Networks 5 embedding Space Support PatchesQuery Patch 𝒑 𝒒 p 𝑠𝑀 p 𝑠2 p 𝑠1 𝑓 𝜃 f φ conv layer relu layer pooling layerflatten layer Fig. 3.
The architecture of the similarity network. x , the optimal target label ¨ L t ( x ) is defined as¨ L t ( x ) = arg max l = { l ,l ,..l k } N (cid:88) i =1 w i ( x ) δ ( ˜ L ia ( x ) , l ) , (6)where { l , l , ..l k } is the label set, w i ( x ) is the contribution weight of i-th warpedatlas, and δ ( ˜ L ia ( x ) , l ) is the Kronecker delta function. Usually, w i ( x ) is measuredaccording to the intensity similarity among local patches. Inspired by the ideaof prototypical method [14], there exists an embedding that can capture morediscriminative features for similarity measurement. We design a convolution net-work to map the original patches into a more distinguishable embedding space,and similarities (contribution weights) can be computed according to the dis-tance between the embedded atlas and target patches.Figure 3 shows the architecture of the similarity network. It contains twoconvolution ways ( f ϕ and f θ ), which can map the target and atlas patches intoa embedding space separately. According to the prototypical method, we definethe patch from target image as the query patch ( p q ), and define the patchessampled from warped atlases as support patches ( p s = { ˜ p s , ˜ p s , . . . , ˜ p Ms } ). Thesimilarity sim i of p q and ˜ p is is calculated based on a softmax over the Euclideandistance of embedded atlas f θ (˜ p is ) and target patch f ϕ ( p q ), sim i = exp ( − d ( f ϕ ( p q ) , f θ (˜ p is ))) M (cid:80) m =1 exp ( − d ( f ϕ ( p q ) , f θ (˜ p ms ))) . (7) Training Algorithm:
We explore to train the similarity network by using theanatomical label information. Let y i denotes the ground-truth similarity between p q and ˜ p is . The parameters of f θ and f ϕ can be optimized by minimizing the Ding et al.
Algorithm 1:
Pseudocode for training the similarity network
Input: ( ˜ I a , ˜ L a ); ( I t , L t ); the batch size B ; the training iteration C Output: θ , ϕ Init θ , ϕ for c = 1 to C do J ← for b = 1 to B do (˜ p js , y j ) , (˜ p ks , y k ) , p q ← Sample (( ˜ I a , ˜ L a ) , ( I t , L t ) , thr , thr ) sim j , sim k ← Calculate (˜ p js , ˜ p ks , p q ) // see Eq.(7) J ← J + B ( (cid:80) i = { j,k } y i log ( sim i )) end θ new ← θ old − (cid:15) (cid:53) θ Jϕ new ← ϕ old − (cid:15) (cid:53) ϕ J end cross-entropy loss ( J ) of the predicted and ground-truth similarities, J = − M (cid:88) i =1 y i log ( sim i ) . (8)However, y i is hard to obtain in cross-modality scenarios. To train the network,the support patches (˜ p is ) which have significant shape difference or similarity tothe query patch ( p q ) are used, and their corresponding y i is decided by usingthe anatomical labels, y i = (cid:26) D ice ( l q , ˜ l is ) > thr D ice ( l q , ˜ l is ) < thr . (9)where thr and thr are hard thresholds, l q and ˜ l is denote the anatomical label of p q and ˜ p is , respectively. The network is trained in a fashion of few-shot learning,each training sample is compose of a query patch ( p q ) and two support patches(˜ p js , ˜ p ks ) with significant shape differences ( y j (cid:54) = y k ). In this way, the convolutionlayers can learn to capture discriminative features for measuring similarity ofcross-modality. Algorithm 1 provides the pseudocode. For the conciseness, thecode only describe one atlas and one target setup here, while the reader caneasily extend to N atlas and K targets in practice. Experiment setup:
We evaluated the framework by myocardial segmentationof the MM-WHS dataset [22]. The dataset provides 40 (20 CT and 20 MRI)images with corresponding manual segmentations of whole heart. For cross-modality setup, MR (CT) images with their labels are used as the atlases andCT (MR) images are treated as the targets. We randomly selected 24 (12 CT ross-Modality Multi-Atlas Segmentation Using Deep Neural Networks 7 and 12 MR) images for training the registration network, 8 (4 CT and 4 MR)images for training the similarity network. The remaining 8 (4 CT and 4 MR)images were used as test data. Form each image, a 96 × ×
96 sub-image aroundLV myocardium was cropped, and all the sub-images were normalized to zero-mean with unit-variance. In order to improve the performance, both the affineand deformable transformation were adopted for data augmentation .
For training the registration network:
In each training iteration, a pairof CT-MR intensity images is fed into the registration network (see Figure 2).Then the network produce U and V , with which the MR and CT label can bewarped to each other. By setting the hyperparameter λ and λ to 0.3 and 0.2,the total trainable loss (see Eq.(5)) of the network can be calculated. Finally,Adam optimizer is employed to train the parameters of network. For training the similarity network:
For training the network, we extractedpatches along the boundary of LV myocardium (which usually cover differentanatomical structure). In each training iteration, the size of patch is set to 15 × ×
15 voxels, while the thr and thr are set to 0.9 and 0.5, respectively (seeEq.(9)). Training sample (˜ p js , ˜ p ks , p q ) is randomly selected and then mapped intothe embedding space. Finally, the loss can be accumulated and backpropagatedto optimize the parameters of f ϕ and f θ (see Algorithm 1). Table 1.
Comparison between the pro-posed registration network and otherstate-of-the-art methods.Method Dice (Myo)Demons CT-MR [16] 36.1 ± ± ± ± ± ± ± %Our MR-CT ± % Table 2.
Comparison between the pro-posed MAS and other state-of-the-artmethods.Method Dice (Myo)U-Net [12] 86.1 ± ± %MV MR-CT 84.4 ± ± ± ± ± ± ± ± % Results:
The performance of the registration network is evaluated by usingthe Dice score between the warped atlas label and the target gold standardlabel. Table 1 shows the average Dice scores over 48 (12 CT × × Ding et al. the conventional methods (SyNOnly [1] and Demons [16]). This is reasonable asour method takes advantage of the high-level information (anatomical label) totrain the registration model, which makes it more suitable for the challengingdataset of MM-WHS.Table 2 shows the result of three different MAS methods based on our regis-tration network. ie, MV, non-local weighted voting (NLWV) [3] and the proposedframework. Compared to other state-of-the-art methods [10,12], our frameworkcan achieve promising performance in cross-modality myocardial segmentation.Especially in MR images, compared to the Seg-CNN [10] who won the first placeof MM-WHS Challenge, our framework improves the Dice score by almost 6%.However, our MR-CT result, which is set up to use CT atlases to segment anMR target, is worse than other state-of-the-art methods. This is because thequality of atlas will affect MAS performance. Generally, MR is considered morechallenging data (lower quality) compared to CT [22]. The use of low-quality MRatlases limits the segmentation accuracy of our MR-CT. Thus, the Seg-CNN [10],which is trained on purely CT data, can obtain almost 3% better Dice score thanour MR-CT method. In addition, Figure 4 demonstrates a series of intermediateresults and segmentation details.Figure 5 visualizes the performance of similarity network. The target patch israndomly selected from CT image, and the atlas patches are randomly croppedfrom MR images. Since Dice coefficient computes similarity of patches by usinggolden standard labels, it can be considered as the golden standard for cross-modality similarity estimation. Results show that the estimated similarities arewell correlated to the Dice coefficient.
We have proposed a cross-modality MAS framework to segment a target imageusing the atlas from another modality. Also, we have described the consistentregistration and similarity estimation algorithm based on DNN models. The ex-periment demonstrates that the proposed framework is capable of segmentingmyocardium from CT or MR images. Future research aims to extend the frame-work to other substructure of the whole heart, and investigate the performanceon different datasets.
References
1. Avants, B.B., Tustison, N., Song, G.: Advanced normalization tools (ants). Insightj (365), 1–35 (2009)2. Christensen, G.E., Johnson, H.J.: Consistent image registration. IEEE transactionson medical imaging (7), 568–582 (2001)3. Coup´e, P., Manj´on, J.V., Fonov, V., Pruessner, J., Robles, M., Collins, D.L.: Non-local patch-based label fusion for hippocampus segmentation. In: InternationalConference on Medical Image Computing and Computer-Assisted Intervention.pp. 129–136. Springer (2010)ross-Modality Multi-Atlas Segmentation Using Deep Neural Networks 9 (a) (b) (c) ( d ) Target Image
Gold Standard Target ImagePredicted Label Target (CT) Atlas (MRI)Segmentation Details
Fig. 4.
Visualization of the proposed framework. Image of (a) and (b) are atlas images,(c) and (d) are corresponding warp atlas images. All the images are from axial-view.The segmentation details show different slices, where the region in the yellow box showsthe error by our method, while the region in the blue boxes indicate that the proposedmethod performs not worse than the golden standard. (The reader is referred to thecolourful web version of this article)4. Ding, Z., Han, X., Niethammer, M.: Votenet: A deep learning label fusion methodfor multi-atlas segmentation. In: International Conference on Medical Image Com-puting and Computer-Assisted Intervention. pp. 202–210. Springer (2019)5. Heinrich, M.P., Jenkinson, M., Papie˙z, B.W., Brady, M., Schnabel, J.A.: Towardsrealtime multimodal fusion for image-guided interventions using self-similarities.In: International conference on medical image computing and computer-assistedintervention. pp. 187–194. Springer (2013)6. Hu, Y., Modat, M., Gibson, E., Li, W., Ghavami, N., Bonmati, E., Wang, G.,Bandula, S., Moore, C.M., Emberton, M., et al.: Weakly-supervised convolutionalneural networks for multimodal image registration. Medical image analysis ,1–13 (2018)7. Iglesias, J.E., Sabuncu, M.R., Van Leemput, K.: A unified framework for cross-modality multi-atlas segmentation of brain mri. Medical image analysis (8),1181–1191 (2013)8. Kasiri, K., Fieguth, P., Clausi, D.A.: Cross modality label fusion in multi-atlassegmentation. In: 2014 IEEE International Conference on Image Processing (ICIP).pp. 16–20. IEEE (2014)9. Luan, H., Qi, F., Xue, Z., Chen, L., Shen, D.: Multimodality image registration bymaximization of quantitative–qualitative measure of mutual information. PatternRecognition (1), 285–298 (2008)10. Payer, C., ˇStern, D., Bischof, H., Urschler, M.: Multi-label whole heart segmenta-tion using cnns and anatomical label configurations. In: International Workshop onStatistical Atlases and Computational Models of the Heart. pp. 190–198. Springer(2017)0 Ding et al. Target patch (CT)
𝑫𝒊𝒄𝒆 = 𝒔𝒊𝒎 =
𝑫𝒊𝒄𝒆 = 𝒔𝒊𝒎 =
𝑫𝒊𝒄𝒆 = 𝒔𝒊𝒎 =
𝑫𝒊𝒄𝒆 = 𝟎.𝟕𝟓𝟓 𝒔𝒊𝒎 =
𝑫𝒊𝒄𝒆 = 𝟎.𝟑𝟎𝟗 𝒔𝒊𝒎 =
𝑫𝒊𝒄𝒆 = 𝒔𝒊𝒎 = 𝟎.𝟒𝟒𝟑
Atlas patches (MR)
Fig. 5.
Visualization of estimated similarities from the proposed network. The Dicescores (
Dice ) are calculated by the gold standard label, while the similarities ( sim ) areestimated by feeding intensity patches into the similarity network. Please note that the sim are normalized by the softmax function of similarity network (see Eq.(7)). Ideally,the sim should be positively related to the
Dice . This figure shows both the correct(green box) and failed (red box) cases of our similarity estimation method.11. Qin, C., Shi, B., Liao, R., Mansi, T., Rueckert, D., Kamen, A.: Unsuperviseddeformable registration for multi-modal images via disentangled representations.In: International Conference on Information Processing in Medical Imaging. pp.249–261. Springer (2019)12. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-cal image segmentation. In: International Conference on Medical image computingand computer-assisted intervention. pp. 234–241. Springer (2015)13. Sanroma, G., Benkarim, O.M., Piella, G., Camara, O., Wu, G., Shen, D., Gis-pert, J.D., Molinuevo, J.L., Ballester, M.A.G., Initiative, A.D.N., et al.: Learningnon-linear patch embeddings with neural networks for label fusion. Medical imageanalysis , 143–155 (2018)14. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In:Advances in neural information processing systems. pp. 4077–4087 (2017)15. Studholme, C., Hill, D.L., Hawkes, D.J.: An overlap invariant entropy measure of3d medical image alignment. Pattern recognition (1), 71–86 (1999)16. Thirion, J.: Image matching as a diffusion process: an analogy with maxwell’sdemons. Medical Image Analysis (3), 243–260 (1998)17. Wachinger, C., Navab, N.: Entropy and laplacian images: Structural representa-tions for multi-modal registration. Medical image analysis (1), 1–17 (2012)18. Wang, H., Suh, J.W., Das, S.R., Pluta, J.B., Craige, C., Yushkevich, P.A.: Multi-atlas segmentation with joint label fusion. IEEE transactions on pattern analysisand machine intelligence (3), 611–623 (2012)19. Warfield, S.K., Zou, K.H., Wells, W.M.: Simultaneous truth and performance levelestimation (staple): an algorithm for the validation of image segmentation. IEEEtransactions on medical imaging (7), 903–921 (2004)20. Xie, L., Wang, J., Dong, M., Wolk, D.A., Yushkevich, P.A.: Improving multi-atlassegmentation by convolutional neural network based patch error estimation. In:International Conference on Medical Image Computing and Computer-AssistedIntervention. pp. 347–355. Springer (2019)21. Yang, H., Sun, J., Li, H., Wang, L., Xu, Z.: Neural multi-atlas label fusion: Appli-cation to cardiac mr images. Medical image analysis , 60–75 (2018)22. Zhuang, X., Li, L., Payer, C., ˇStern, D., Urschler, M., Heinrich, M.P., Oster, J.,Wang, C., Smedby, ¨O., Bian, C., et al.: Evaluation of algorithms for multi-modalitywhole heart segmentation: An open-access grand challenge. Medical image analysis , 101537 (2019)ross-Modality Multi-Atlas Segmentation Using Deep Neural Networks 1123. Zhuang, X., Shen, J.: Multi-scale patch and multi-modality atlases for whole heartsegmentation of mri. Medical image analysis31