[PDF] MSU-Net: Multiscale Statistical U-Net for Real-time 3D Cardiac MRI Video Segmentation

Abstract

Cardiac magnetic resonance imaging (MRI) is an essential tool for MRI-guided surgery and real-time intervention. The MRI videos are expected to be segmented on-the-fly in real practice. However, existing segmentation methods would suffer from drastic accuracy loss when modified for speedup. In this work, we propose Multiscale Statistical U-Net (MSU-Net) for real-time 3D MRI video segmentation in cardiac surgical guidance. Our idea is to model the input samples as multiscale canonical form distributions for speedup, while the spatio-temporal correlation is still fully utilized. A parallel statistical U-Net is then designed to efficiently process these distributions. The fast data sampling and efficient parallel structure of MSU-Net endorse the fast and accurate inference. Compared with vanilla U-Net and a modified state-of-the-art method GridNet, our method achieves up to 268% and 237% speedup with 1.6% and 3.6% increased Dice scores.

Full PDF

MMSU-Net: Multiscale Statistical U-Net forReal-time 3D Cardiac MRI Video Segmentation

Tianchen Wang , Jinjun Xiong , Xiaowei Xu , Meng Jiang , Haiyun Yuan ,Meiping Huang , Jian Zhuang , and Yiyu Shi University of Notre Dame, { twang9, xxu8, yshi4 } @nd.edu IBM Thomas J. Watson Research Center, [email protected] Guangdong General Hospital yhy [email protected], [email protected], [email protected]

Abstract.

Cardiac magnetic resonance imaging (MRI) is an essentialtool for MRI-guided surgery and real-time intervention. The MRI videosare expected to be segmented on-the-ﬂy in real practice. However, exist-ing segmentation methods would suﬀer from drastic accuracy loss whenmodiﬁed for speedup. In this work, we propose Multiscale StatisticalU-Net (MSU-Net) for real-time 3D MRI video segmentation in cardiacsurgical guidance. Our idea is to model the input samples as multiscalecanonical form distributions for speedup, while the spatio-temporal cor-relation is still fully utilized. A parallel statistical U-Net is then designedto eﬃciently process these distributions. The fast data sampling andeﬃcient parallel structure of MSU-Net endorse the fast and accurate in-ference. Compared with vanilla U-Net and a modiﬁed state-of-the-artmethod GridNet, our method achieves up to 268% and 237% speedupwith 1.6% and 3.6% increased Dice scores.

Real-time Magnetic Resonance Imaging (MRI) techniques have been providingfast and accurate visual guidance in multiple ﬁelds. The duration of cardiacsurgery (e.g., prosthetic valve implantation in the correct location at the aorticannulus) has been signiﬁcantly shortened since interactive real-time MRI wasgetting applied [6]. Interventional real-time MRI has also been adopted for con-genital, ischemic, and structural heart disease for its capacity of visualizing 3Danatomy and assessing myocardial tissue as well as local hemodynamics [2]. Toachieve real-time MRI guidance, the images need to be segmented on-the-ﬂy, ata speed of at least , preferably up to 100 frames per second (FPS) [9,3].However, performing real-time segmentation on cardiac MRI images is a chal-lenging task. In addition to the diﬃcult eﬀects such as anisotropic resolution,cardiac border ambiguity and large variations among targeting objects from pa-tients [11], the requirement of real-time fast segmentation demands a lightweightand eﬃcient processing framework. Existing approaches used complicated neu-ral network architectures to achieve good accuracy and were not able to make a r X i v : . [ ee ss . I V ] S e p T. Wang, et al. inference in real time [4,12]. Recently, Statistical Convolutional Neural Network(SCNN) was proposed to speed up conventional CNNs with little performanceloss in video object detection [10]. Instead of feeding the input samples as deter-ministic values, SCNN used Independent Component Analysis (ICA) to extractparameterized statistical distributions in canonical form to compactly model thetemporally and contextually correlated information. Then the network modelpropagated the distributions in canonical form more eﬃciently than determinis-tic values.In this work, we propose Multiscale Statistical U-Net (MSU-Net) for real-time cardiac MRI segmentation. We incorporate SCNN and a new multiscaledata sampling method with the U-Net to capture spatio-temporal correlation inthe input data. Our model adopts a parallel architecture to eﬃciently propagatethe multiscale distributions. Speciﬁcally, we apply ICA with multiple sets oftemporal image patches to generate a cluster of canonical form distributions,each of which represents a diﬀerent scale to model the input data. This multiscalesampling method can preserve the information of spatio-temporal correlationsat diﬀerent scales. Then we implement a number of parallel yet light-weightencoder-decoder style branches for eﬃcient inference. Each branch propagatesthe speciﬁc scale of canonical form distributions. Experimental results show thatour MSU-Net achieves up to 268% and 237% speedup with 1.6% and 3.6%increased Dice scores compared with vanilla U-Net and a modiﬁed state-of-the-art method GridNet [12].

SCNN [10] was the ﬁrst model that feeds CNNs with a reasonable number ofstatistical distributions that were decomposed from the input data. SCNN islighter and thus of higher speed than conventional CNNs that conduct deter-ministic operations (such as sum and max).SCNN applied ICA to decompose video frames that exhibit spatio-temporalcorrelation into canonical form distributions as follows: D = a + a X + . . . + a m X m + a r R, (1)where (1) D is a random multivariate signal, which in video object detectionrepresents the same pixel across multiple frames in a snippet; (2) a is the meanvalue of D ; (3) X i ( i ∈ { . . . m } ) are additive independent subcomponents of D ; (4) a i ( i ∈ { . . . m } ) are the corresponding weight act as mixing matrix; (5) R denotes uncorrelated Gaussian noise and a r is the weight of R ; (6) m is thebasis dimension of the canonical form distribution.With the help of predeﬁned core operations (weighted sum and max ) thatkeep their outputs still in canonical form distributions, SCNN needs little mod-iﬁcation to the standard gradient descent based scheme. It can be trained usingthe same forward and back propagation procedures as conventional CNNs. Atthe output, the results are mixed to form a temporal feature map for each sampleby plugging in the values of independent sources X i from the ICA process. By SU-Net 3 processing multiple frames at a time through distributions, SCNN signiﬁcantlyspeedups object detection in videos over conventional CNNs with slight accuracydegradation.

In this section, we ﬁrst present a multiscale sampling method to extract canonicalform distributions from input 3D MRI videos. Then we introduce the architec-ture of MSU-Net and explain how it processes these distributions for real-timesegmentation.

In order to build linear distributions in parameterized canonical form (Equa-tion 1) via ICA, we need to decide how to properly extract samples from 3DMRI video to feed into ICA, i.e., what information each D should represent. Inthe approach of SCNN for video object detection [10], the video clips are resizedand split into small snippets, and each distribution D models the same pixelacross multiple frames in the same snippet. However, this cannot be applied to3D MRI video directly since lots of semantic details important to segmentationwould be lost. Thus, we propose to use D to represent a patch within a smallrange (both spatially and temporally) where strong correlation exists. Speciﬁ-cally, we denote the dimension of an input 3D MRI video as [ X , Y , Z , T ], where X − Y plane is the short axis plane, Z is the short axis and T is the temporaldimension. The common issues of slice shifting as well as large inter-slice gapin MRI cardiac images along short-axis ( Z axis) [12] lead to minimum spatialcorrelation in X − Z and Y − Z planes. Therefore, we extract patches within thedimension [ X , Y , T ], independent of Z .Before extracting the patches, the 3D MRI videos are normalized ﬁrst toremove oﬀsets among videos. Each patch is then extracted using a window ofsize ( n, n ) on the X − Y plane over t time steps. We call t as snippet span. Wepropose to allow diﬀerent canonical forms to have diﬀerent n and t , as such anapproach covers potential spatio-temporal correlations at diﬀerent scales. We callthis cluster of distributions with multiple patch sizes as multiscale distributions.An example of the extraction process of multiscale distributions with diﬀerentpatch sizes on one slice is shown in Fig. 1 (a). The patches are collected at thesame position over time and fed to ICA to extract canonical form distributions.ICA has to be used because the propagation of the canonical form distributionsrequires all the bases to be independent. Other approahces such as PCA cannotguarantee this unless the samples follow Gaussian distributions, which is not thecase in our problem. As a result, the snippet of 3D MRI video is “collapsed” intoa smaller 3D image, with each voxel representing a canonical form distribution(Equation 1) that has both spatial and temporal correlations: with patch size( n, n, t ) and predetermined independent basis dimension d , a compression ratioof r = d/ ( n t ) is achieved. T. Wang, et al.

Input r=1/40 r=1/10 time step = 1, 15, 30

Time (t frames) 3D MRI Video Patches (b)(a)

ICA ... ... ... ...

ICACanonical Form

Distributions

Canonical Form

Distributions

Fig. 1: (a) Illustration of multi-scale data sampling from cropped 3D MRI video.Each canonical form is extracted using the samples from the same position in the X − Y plane and collected at diﬀerent time steps. Diﬀerent canonical forms canhave diﬀerent patch sizes. (b) Restored inputs with multi-scale data samplingmethod at diﬀerent time steps (t) using diﬀerent compression ratio (r). Therestored inputs with r=1/40 have more noises than those with r=1/10.To show the feasibility of the proposed data sampling, we extract the multi-scale canonical form distributions using our procedure with various compressionratio r by changing the basis dimension d with n = 7 and t = 5. The visualresults along with the compression ratio are shown in Fig. 1 (b). With a largerratio (r = 1 /

40, smaller basis dimension of canonical form distribution), the re-stored video gain more noise with vague contours, which would bring obstructionto the segmentation task. With a smaller ratio (r = 1 / /

10 as the compression ratio in our following experiments.

The multiscale canonical form distributions provide compact data representationfor eﬃcient processing. In this subsection, we explore a parallel structure, namelyMSU-Net, that can further speedup the segmentation.Fig. 2 illustrates our MSU-Net which consists of multiple DownTubes (DTs),UpTubes (UTs), Center blocks, and a ﬁnal evaluator (FE). The DTs and UTsact as the encoders and decoders in U-Net for feature propagation. Multiple DTsare built for a set of splitting patch sizes, each consisting of multiple blocks withdownscaling convolution layers to perform feature downscaling and reuse. TheICA process and the corresponding mixing operations are done before and afterthe operations in DT, repsectively, and the operations in DT are performed incanonical form distributions similar to the work developed in SCNN [10]. The

SU-Net 5

Operation in general numbers CC M ConcatenationMixing

Conv, 3x3

BN, LeakyReLUConv, 3x3

BN, LeakyReLUConv, 3x3 BN, LeakyReLUConv, 1x1

3D MRI Video FE

Output I C A ... ... Conv, 3x3

BN, LeakyReLUConv, 3x3

BN, LeakyReLU

Conv, 3x3 BN, LeakyReLU

Conv, 3x3

BN, LeakyReLUConv, 3x3

BN, LeakyReLU

DownTube_1

Block_D Block_D Block_D M M M Conv, 3x3, /2Conv, 3x3, /2 ... ...

CenterCenterCenter

UpTube_1

Block_U Block_U Block_U TransConv, 3x3, 2TransConv, 3x3, 2

CCCC ... ...

DownTube_n

Block_D Block_D Block_D M M M Conv, 3x3, /2Conv, 3x3, /2 ... ...

CenterCenterCenter

UpTube_n

Block_U Block_U Block_U TransConv, 3x3, 2TransConv, 3x3, 2 CC ... ... I C A BlockFinal Evaluator (FE)Center

Operation in canonical form distribution CC CC CC CC ... ... Fig. 2: The architecture of MSU-Net. The number of Blocks in Down-Tube/UpTube varies to accommodate the various input dimensions.features in UT are propagated and upscaled with the blocks made of convo-lutional layers, and transposed convolutional layers, respectively. The featuresafter each upscaling are concatenated with the one skipped from DT for fea-ture reuse. After the outputs are obtained from UTs with various patch sizes,all features would have the same dimensions, which are then concatenated andforwarded to the ﬁnal evaluator to generate the ﬁnal output. The number ofblocks in DT/UT varies to accommodate the input dimensions of 3D images.

The evaluation task is to segment right ventricle (RV), myocardium (MYO),and left ventricle (LV) from MRI video clips in real time. We evaluate the pro-posed MSU-Net and competitive baselines on segmenting the RV, MYO and LVfrom the frames of End Diastolic (ED) and End Systolic (ES) instant. Theseframes were collected from the ACDC MICCAI 2017 challenge dataset [1] withadditional labeling done by experience radiologists. These frames have similarproperties as 3D cardiac MRI videos. The dataset has 150 exams from diﬀerentpatients with 100 for training and 50 for testing. The images were collected fol-lowing the common clinical SSFP cine-MRI sequence with a series of short-axis

T. Wang, et al. slices starting from the mitral valves down to the apex of the left ventricle. Weperform 5-fold cross-validation and use the Dice score to evaluate the segmen-tation accuracy.We implement two versions of MSU-Net with speciﬁc snippet spans (t = 5and 10, denoted as T5, T10, respectively) for evaluation. The ICA processingtime is included when we evaluate the inference time of MSU-Net. Existingapproaches have reported their FPS on the same dataset: ∼ . Table 1 presents the comparison among U-Net, GridNet, and the proposed MSU-Net on Dice score and FPS. Our MSU-Nets can achieve the fastest processingspeed (highest FPS) and the best Dice score. Compared with the fastest baselinemethod U-Net (D3+IF8), our MSU-Net (T10) runs 1.63 × faster and makesan improvement of 26% on segmentation accuracy. Compared with the mostaccurate baseline method (with the highest Dice score) GridNet (D3+IF32), ourMSU-Net (T5) can achieve a slightly higher accuracy and 2.75 × faster processingspeed. From the table, it is clear that MSU-Nets are the only capable methodto segment real-time 3D MRI videos.For MSU-Net, a bigger video snippet span (T10) can obtain a faster process-ing speed with a slight accuracy degradation (only 0.028). However, for U-Net,when it is modiﬁed into shallow/slim versions such as U-Net (D3+IF16) and SU-Net 7

Table 1: Comparison between baseline methods and our MSU-Net on Dice scoreand FPS for 3D MRI video segmentation. “T5”/“T10” denotes the video snippetspan in MSU-Net (t=5/t=10). “D” and “IF” denote the depth of the networkand the initial ﬁlters number of the input layer, respectively.

Methods FPS Dice scoreRV MYO LV Average

GridNet (D3+IF32) 15.7 .842 ± .028 .804 ± .026 .901 ± .036 .849 ± .014U-Net (D5+IF64) 16.1 .865 ± .036 .761 ± .039 .911 ± .026 .846 ± .025GridNet (D2+IF32) 18.2 .815 ± .025 .812 ± .014 .851 ± .033 .826 ± .011U-Net (D3+IF16) 33.2 .564 ± .071 .738 ± .045 .767 ± .026 .690 ± .036U-Net (D3+IF8) 43.2 .552 ± ± .060 .759 ± .059 .662 ± .058MSU-Net (T5) .855 ± .026 .836 ± .022 .897 ± .017 .862 ± .011 MSU-Net (T10) .837 ± .034 .811 ± .049 .854 ± .040 .834 ± .020 time step = 1, 5, 10, 15, 20, 25BaseMiddleApex Fig. 3: The segmentation results of our method MSU-Net (T5) on the testingdata. The rows indicate the slices at the base, the middle, and the apex of LV.The columns show the results at various time steps in series. RV, MYO, and LVare labeled in blue, green and red, respectively.(D3+IF8) for real-time processing ( ≥

30 FPS), the accuracy degrades signif-icantly: We observe that the accuracy drops from 0.846 to 0.690 and 0.662,respectively. We observe the same pattern for GridNet, and conclude that MSU-Net can achieve a stable accuracy when conﬁgured for segmentation in real time.Finally, Fig. 3 shows the examples of MSU-Net segmentation results at vari-ous time steps. Note that our MSU-Net can accurately segment the target areas.The boundaries are clearly extracted on most of the slices. In the base and mid-dle slices, the segmentation ﬁts the contours of targets. In some of the apexslices, the segmentation of RV (labeled in blue) is not as accurate as MYO andLV, because of the unclear boundaries between the instances.

T. Wang, et al.

In this paper, we proposed Multiscale Statistical U-Net (MSU-Net) for real-time3D cardiac MRI video segmentation. Based on the scheme of Statistical Con-volutional Neural Network, we model the input samples as multiscale canonicalform distributions for speedup, while the spatio-temporal correlationis still fullyutilized. A parallel statistical U-Net is then proposed to process these multiscaledistributions eﬃciently. On the 3D cardiac MRI videos from the ACDC MIC-CAI 2017 dataset, MSU-Net achieves up to 268% and 237% speedup with 1.6%and 3.6% increased Dice scores compared with vanilla U-Net and a modiﬁedstate-of-the-art method GridNet, respectively.

This work was approved by the Research Ethics Committee of GuangdongGeneral Hospital, Gunagdong Academy of Medical Sciences with protocol No.20140316. This work was supported by the National key Research and Develop-ment Program [2018YFC1002600], Science and Technology Planning Project ofGuangdong Province, China [No. 2017A070701013, 2017B090904034, 2017030314109,and 2019B020230003], National Science Foundation grant [CCF-1919167], andGuangdong peak project [DFJH201802].

References

1. Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin,I., Lekadir, K., Camara, O., Ballester, M.A.G., et al.: Deep learning techniques forautomatic mri cardiac multi-structures segmentation and diagnosis: Is the problemsolved? IEEE transactions on medical imaging (11), 2514–2525 (2018)2. Campbell-Washburn, A.E., Tavallaei, M.A., Pop, M., Grant, E.K., Chubb, H.,Rhode, K., Wright, G.A.: Real-time mri guidance of cardiac interventions. Journalof Magnetic Resonance Imaging (4), 935–950 (2017)3. Iltis, P.W., Frahm, J., Voit, D., Joseph, A.A., Schoonderwaldt, E., Altenm¨uller,E.: High-speed real-time magnetic resonance imaging of fast tongue movements inelite horn players. Quantitative imaging in medicine and surgery (3), 374 (2015)4. Isensee, F., Jaeger, P.F., Full, P.M., Wolf, I., Engelhardt, S., Maier-Hein, K.H.:Automatic cardiac disease assessment on cine-mri via time-series segmentationand domain speciﬁc features. In: International workshop on statistical atlases andcomputational models of the heart. pp. 120–129. Springer (2017)5. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shuﬄenet v2: Practical guidelines foreﬃcient cnn architecture design. In: Proceedings of the European Conference onComputer Vision (ECCV). pp. 116–131 (2018)6. McVeigh, E.R., Guttman, M.A., Lederman, R.J., Li, M., Kocaturk, O., Hunt,T., Kozlov, S., Horvath, K.A.: Real-time interactive mri-guided cardiac surgery:Aortic valve replacement using a direct apical approach. Magnetic Resonance inMedicine: An Oﬃcial Journal of the International Society for Magnetic Resonancein Medicine (5), 958–964 (2006)SU-Net 97. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-cal image segmentation. In: International Conference on Medical image computingand computer-assisted intervention. pp. 234–241. Springer (2015)8. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: In-verted residuals and linear bottlenecks. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 4510–4520 (2018)9. Schaetz, S., Voit, D., Frahm, J., Uecker, M.: Accelerated computing in magneticresonance imaging: Real-time imaging using nonlinear inverse reconstruction. Com-putational and mathematical methods in medicine2017