A Deep-Learning Approach For Direct Whole-Heart Mesh Reconstruction
AA Deep-Learning Approach For Direct Whole-Heart Mesh Reconstruction
Fanwei Kong a, ∗ , Nathan Wilson b , Shawn C. Shadden a, ∗ a Mechanical Engineering Department, University of California, Berkeley, Berkeley, CA 94709 b Open Source Medical Software Corporation, Santa Monica, CA
A B S T R A C TAutomated construction of surface geometries of cardiac structures from volumetric medical images is impor-tant for a number of clinical applications. While deep-learning based approaches have demonstrated promisingreconstruction precision, these approaches have mostly focused on voxel-wise segmentation followed by surfacereconstruction and post-processing techniques. However, such approaches su ff er from a number of limitations in-cluding disconnected regions or incorrect surface topology due to erroneous segmentation and stair-case artifactsdue to limited segmentation resolution. We propose a novel deep-learning-based approach that directly predictswhole heart surface meshes from volumetric CT and MR image data. Our approach leverages a graph convolu-tional neural network to predict deformation on mesh vertices from a pre-defined mesh template to reconstructmultiple anatomical structures in a 3D image volume. Our method demonstrated promising performance of gen-erating high-resolution and high-quality whole heart reconstructions and outperformed prior deep-learning basedmethods on both CT and MR data in terms of precision and surface quality. Furthermore, our method can more ef-ficiently produce temporally-consistent and feature-corresponding surface mesh predictions for heart motion fromCT or MR cine sequences, and therefore can potentially be applied for e ffi ciently constructing 4D whole heartdynamics. Keywords:
Whole heart segmentation, surface mesh reconstruction, graph convolutional networks, deep learning
1. Introduction
Three-dimensional (3D) geometries of anatomical structures reconstructed from volumetric medical imagesare increasingly used for a number of clinical applications, such as patient-specific visualization (Gonz´alez Izardet al., 2020), physics-based simulation, virtual surgery planning and morphology assessment (Prakosa et al., 2018;Bucioli et al., 2017). As cardiovascular diseases are the leading causes of mortality, one area of research thatcurrently receives considerable attention is computational modeling and visualization of the heart from patient-specific image data (Prakosa et al., 2018; Chnafa et al., 2016; Khalafvand et al., 2012).Creating accurate patient-specific models of the whole heart from image data has traditionally required signif-icant time and human e ff ort, limiting clinical applications and high-throughput, large-cohort analyses of patient-specific cardiac functions (Mittal et al., 2015). While surface representation of the whole heart is important for theaforementioned applications, most studies have focused on automatic whole heart segmentation rather than directsurface reconstruction (Payer et al., 2018; Bai et al., 2015; Ye et al., 2019). Nevertheless, accurate and reliableautomatic whole heart segmentation remains an ongoing challenge and an active research direction (Zhuang et al.,2019; Zhuang, 2013; Peng et al., 2016). Much of the challenge is related to the complex geometries of the heart,large structural deformation over the cardiac cycle, di ffi culties in di ff erentiating individual cardiac structures fromeach other and the surrounding tissue, as well as variations across individuals and di ff erent imaging modalities.Prior cardiac model construction e ff orts have typically adopted a multistage approach whereby 3D segmen-tations of cardiac structures are first obtained from image volumes, meshes of the segmented regions are then ∗ Corresponding author e-mail: [email protected] (Fanwei Kong), [email protected] (Nathan Wilson), [email protected] (ShawnC. Shadden) a r X i v : . [ ee ss . I V ] F e b Fanwei Kong et al. (2021) generated using marching cube algorithms, and finally manual surface post-processing or editing is performed(Lorensen and Cline, 1987; Kong and Shadden, 2020; Maher et al., 2019; Augustin et al., 2016). The quality ofreconstructed surfaces highly depends on the quality of segmentation and the complexity of the anatomical struc-tures. Automatic heart segmentation has been a popular research topic and previously published algorithms havebeen summarized in detail (Zhuang, 2013; Zhuang et al., 2019; Peng et al., 2016; Habijan et al., 2020). Generally,there are two common approaches to whole heart segmentation: multi-atlas segmentation (MAS) (Bai et al., 2015;Zhuang et al., 2015; Zhuang and Shen, 2016) and deep learning (DL) based segmentation (Ronneberger et al.,2015; C¸ ic¸ek et al., 2016). Compared with MAS, DL based approaches have become more popular as they havedemonstrated higher segmentation precision (Zhuang et al., 2019; Payer et al., 2018) and are much faster in prac-tice. While a couple of recent studies have reduced the processing time of MAS approaches down to a couple ofminutes (Bui et al., 2020; Bui et al., 2020), DL based approach can generally process a whole heart segmentationwithin a couple of seconds. However, while DL based methods may produce segmentations that achieve highaverage voxel-wise accuracy, they can contain extraneous regions and other nonphysical artifacts. Correcting suchartifacts would require a number of carefully designed post-processing steps and sometimes manual e ff orts (Kongand Shadden, 2020). Indeed, since the CNN-based segmentation methods are based on classification of each imagevoxel to a particular tissue class, the neural networks are often trained to reduce voxel-wise discrepancy betweenthe predicted segmentation and the ground truth and therefore lack awareness of the overall anatomy and topologyof the target organs. Moreover, CNN-based 3D segmentation methods are memory intensive and therefore requiredownsampling of the data to fit within memory and thus can only generate segmentation with limited resolution.However, high-resolution geometries are often required for down-stream applications such as computational sim-ulations, and direct or low-resolution segmentation will often produce surfaces with staircase artifacts that requireadditional post-processing (Wei et al., 2018, 2013; Updegrove et al., 2016).Compared with representing whole heart geometries as segmentations on a dense voxel grid, representing thegeometries as meshes is a more compact representation, as only point coordinates on the organ boundaries needto be stored. This advantage may enable e ffi cient reconstruction of high-resolution surface meshes on a limitedmemory budget and avoid the stair-case artifacts of surfaces constructed from low-resolution 3D segmentation.Moreover, for low-resolution input images, voxel-wise segmentation would be a coarse representation of the un-derlying cardiac structures, but a surface mesh representation can still function as a smoother and more realisticrepresentation of the shapes as the mesh vertices are defined in a continuous coordinate space and do not have toalign with the input voxel grid.Some studies have adopted a model-based approach to directly fit surfaces meshes of the heart to target images(Ecabert et al., 2008; Ecabert et al., 2011; Peters et al., 2010). Such approaches deform a template mesh using localoptimization to match with tissue boundaries on input images. However, they are often sensitive to initializationand require complicated steps and manual e ff orts to construct a mean template of the heart. In contrast, deep-learning methods do not require test-time optimization and therefore may be more e ff ective to solve the meshdeformation problem.Recent progress on geometric deep learning has extended the concepts of convolutional neural network onirregular graphs (De ff errard et al., 2016; Bronstein et al., 2017). Recent deep-learning based approaches haveshown promise in reconstructing shapes as surface meshes from image data using graph convolutional neuralnetworks (Wang et al., 2020a; Wen et al., 2019; Pontes et al., 2019). However, these approaches have focusedon reconstructing a single shape from a 2D camera image and thus cannot be directly applied to reconstructingmultiple anatomical structures from volumetric medical image data. A recent study from Wickramasinghe et al.(2020) extended the work of Wang et al. (2020a) to 3D volumetric medical image data and demonstrated improvedsegmentation results. However, their method demonstrated success only on simple geometries such the liver,hippocampus and synaptic junction but not on the whole heart that involves multiple cardiac structures with verydi ff erent shapes.To overcome these shortcomings, we explore the problem of using a deep-learning-based approach to directlypredicting surface meshes of multiple cardiac structures from volumetric image data. Our approach leverages agraph convolutional neural network to predict deformation on mesh vertices from a pre-defined mesh template tofit multiple anatomical structures in a 3D image volume. The mesh deformation is conditioned on image featuresextracted by a CNN-based image encoder. Since cardiac structures such as heart chambers are di ff eomorphic to a anwei Kong et al. (2021) 3 sphere, we use spheres as our initial mesh templates, which can be considered as a topological prior of the cardiacstructures. Compared with classification-based approaches, our approach can reduce extraneous regions that areanatomically inconsistent. Using a generic initial mesh also enables our approach to be easily adapted to otheranatomical structures.The key contributions of our work are as follows:1. We propose the first end-to-end deep-learning-based approach of predicting multiple anatomical structures inthe form of surfaces meshes from 3D image data. We show that our method was able to produce whole-heartgeometries from both CT and MR images with better accuracy compared to classification-based approaches.2. We investigate and compare the impact of dataset size and variability on whole-heart reconstruction perfor-mance to di ff erent methods. When having trained on both small and larger training datasets, our methoddemonstrated better Dice scores for most of the cardiac structures reconstructed than prior approaches.3. As cardiac MR image data often have large variation across di ff erent data sources, we compare di ff erentmethods and demonstrate the advantage of our approach on MR images with varying through-plane resolu-tion as well as on low-resolution MR images that di ff er significantly from our training datasets.4. Since our approach predicts deformation from a template mesh, we show that our reconstructions generallyhave point correspondence across di ff erent time frames and di ff erent patients by consistently mapping meshvertices on the templates to similar structural regions of the heart. We demonstrated the potential applicationof our method on e ffi ciently constructing 4D whole heart dynamics that captures the motion of a beatingheart from a time-series of images.
2. Methods
Since cardiac medical image data is sensitive to a number of factors, including di ff erences in vendors, modal-ities and acquisition protocols across clinical centers, deep-learning based methods can be easily biased to thesefactors. Therefore, we aimed to develop our models using whole heart image data collected from di ff erencesources, vendors and imaging modalities. We included data from four existing public datasets that contain contrast-enhanced CT images or MR images that cover the whole heart. These four datasets are from the multi-modalitywhole heart segmentation challenge (MMWHS) Zhuang et al. (2019), orCalScore challenge (Wolterink et al.,2016), left atrial wall thickness challenge (SLAWT) (Karim et al., 2018) and left atrial segmentation challenge(LASC) (Tobon-Gomez et al., 2015). The use of such diverse data enables us to not only better evaluate the re-construction accuracy of our trained model but also evaluate the impact of dataset size and variability on modelperformance.Additional time-series CT and MR images were collected to evaluate the performance of our trained neuralnetwork models on time-series image data acquired from di ff erent data sources from the training data. The time-series CT data were from 10 patients with left ventricular diastolic dysfunction. The 9 sets of cine cardic MRdata were from 5 healthy subjects and 4 patients with cardiac diseases. All data was de-identified and previouslycollected for other purposes. The details of datasets used and collected are described in the following sub-sectionsand summarized in table 1. We followed the same method of Zhuang et al. (2019) to manually delineate sevencardiac structures: LV, LA, RA, RV, myocardium, aorta and pulmonary artery for the collected image data that didnot have ground truth annotations of the whole heart. Our framework consists of three components to predict the whole-heart meshes from a volumetric input image:(1) an image encoding module that extracts and encodes image features, (2) a mesh deformation module thatcombines features from images and meshes to predict deformation of mesh vertices, and (3) a segmentation modulethat predict a binary segmentation map to allow additional supervision using ground truth annotations. Figure 1shows the overall architecture.
Fanwei Kong et al. (2021)
Table 1. Summary of data characteristics for whole heart CT and MR data included.
CT data MR dataMMWHS(Zhuang et al.,2019) OrCaScore(Wolterink et al.,2016) SLAWT (Karimet al., 2018) Time-series CT MMWHS(Zhuang et al.,2019) LASC(Tobon-Gomezet al., 2015) cine MRVendor Philips GE, Philips,Siemens andToshiba Philips Achieva256 iCT GE 1.5T Philips and1.5T SiemensMagnetom Avanto 1.5 T PhilipsAchieva 1.5 T Philips / A N / A N / A 100 N / A N / A 50Public or private public public public private public public pivate
For an input image data, the image encoding module uses a series of 3D convolutional layers to extract vol-umetric image feature maps at multiple resolutions. These features maps are required by the following meshdeformation module to predict whole-heart geometries. Therefore, the image encoder should both be e ff ective forbetter geometric reconstruction and be memory-e ffi cient to process a 128 × ×
128 volumetric input image in asingle pass. Our image feature encoder is based on an improved 3D UNet architecture that was designed to worke ff ectively for large volumetric image data (Isensee et al., 2018). Briefly, the feature encoder architecture consistsof multiple levels of residual blocks that encode increasingly abstract representations of the input. Residual con-nections are known to facilitate gradient propagation during training and improve generalization (He et al., 2016).Each residual block contains two 3 × × × × While our purpose is to reconstruct surface meshes directly from image data, the ground truth segmentationcan function as an additional supervision to the network to further facilitate training. From our experiments, in-cluding the segmentation module helped avoid non-manifold geometries due to local minimums and thus improvereconstruction accuracy. Since the ground truth mesh is a sparse representation of the cardiac structures comparedwith a volumetric segmentation, including the segmentation as a dense supervision with skip connections to theimage feature encoder can improve gradient propagation to the image encoding module to better interpret the fullvolumetric input data. However, since we are only interested in reconstructing meshes, rather than predictingsegmentations for all cardiac structure, our segmentation module is trained to predict only a binary segmentationrepresenting the occupancy of the heart in the input image. The adopted network architecture is simplified fromthe decoder architecture of Isensee et al. (2018) with only a small number of filters in the convolutional layers.Briefly, the segmentation module contains multiple levels of decoder convolutional blocks that correspond to theresidual blocks from the image encoding module to reconstruct segmentation from extracted features. Followinga 3 × × × × ff erent levels of the segmentation module and added together to form the final prediction. Our neural network uses graph convolutions on a template mesh to predict deformation vectors on its vertices.Unlike for structured data such as images, convolution in the spatial domain is not well defined for manifold anwei Kong et al. (2021) 5
Fig. 1. Diagram of the proposed automatic whole heart reconstruction approach. The framework uses 3D convolutional layers(shown in blue) to encode image features and predict a binary segmentation map from an input image volume. The correspondingimage features are sampled by pooling layers (shown in orange) based on the vertex coordinates of the template mesh. From thecombined image and mesh features, graph convolutional layers (shown in green) are then used to predict the deformation of meshvertices to generate the final mesh predictions structures such as meshes. Therefore, we apply graph convolution in the frequency domain following recentprocess in graph convolutional neural networks (Bronstein et al., 2017; De ff errard et al., 2016). Briefly, ourtemplate mesh is represented by a graph M = ( V , E ), where V = { v i } N = is the set of N vertices of N and E = { e i } Ei = is the set of E edges that define the connections among mesh vertices. The graph adjacency matrix A ∈ { , } N × N is a sparse matrix that defines the connection between each pair of vertices, with A i j = A i j = D is a diagonal matrix that represents thedegree of each vertex, with D ii = (cid:80) j A ji . Therefore, the graph laplacian matrix is a real and symmetric matrixdefined as L = D − A , which can then be normalized as L norm = I − D − / AD − / . The laplacian matrix can bediagonalized by the Fourier basis on graph U ∈ R N × N as L norm = U Λ U T . The columns of U are the orthogonaleigenvectors of L and Λ is a diagonal matrix containing the corresponding eigenvalues. The Fourier transform ofa function defined on mesh vertices, f ∈ L ( V ), is thus described by ˆ f = U T f and the inverse Fouier transformis f = U ˆ f . Therefore, convolution between f and g ∈ L ( V ) is described as f ∗ g = U (( U T f ) (cid:12) ( U T g )). If weparameterize g with learnable weights, a graph convolution layer can then be defined as f out = σ ( Ug θ ( Λ ) U T f in ),where f in and f out are the input and output and σ is the ReLU activation function.However, the above expression is computationally expensive for meshes with a large number of vertices, since U is not sparse and the number of parameters required can be as many as the number of vertices. Therefore, wefollowed De ff errard et al. (2016) to approximate g θ ( Λ ) using Chebyshev polynomials and thus have Ug θ ( Λ ) U T = (cid:80) Kk = β k T k ( ˜ L ), where ˜ L is the scaled sparse laplacian matrix ˜ L = L norm /λ max − I . β k is the parameter for the k thorder Chebyshev polynomial and T k is the k th order polynomial that can be computed recursively as T = I , T = ˜ L and T k ( ˜ L ) = LT k − ( ˜ L ) − T k − ( ˜ L ). We chose K = ff ectively avoid fitting thenoise on our ground truth surfaces and reduce the amount of parameters to learn. Therefore, the graph convolutionon mesh using a first-order Chebyshev polynomial approximation is described as f out = σ ( β f in + β f in ˜ L ), where Fanwei Kong et al. (2021) β , β ∈ R d l + × d l are trainable weights. Our method uses a single network to simultaneously deform multiple sphere templates to corresponding car-diac structures on the input image. Since the relative locations and scales of di ff erent cardiac structures of the heartare generally consistent across population, we leverage this piece of prior knowledge into our neural network byscaling and positioning the corresponding initial sphere mesh template based on the relative sizes and locations ofthe cardiac structures. We then used a graph convolution layer to augment the coordinates of the initial meshessuch that they have comparable contribution as the image features, in terms of the length of feature vectors, tothe following deformation block. Namely, we aligned our training images into the same coordinate space, andthen for each cardiac structure, we computed the average centroid location and the average length between surfaceand centroid, from the ground truth meshes in the training data. We then scaled and translated the initial spheresaccordingly. By having a closer initialization compared with using centered unit spheres as in prior approaches(Wickramasinghe et al., 2020; Wang et al., 2020a), our network can have reduced distance between predictionsand ground truths and thus avoid large deformation during early phase of training. From our experiments, this isan important and e ff ective technique to avoid getting stuck in local minimums and achieve faster convergence. Our proposed mesh deformation module consists of three deformation blocks with graph convolutional layersthat progressively deform our initial template meshes based on both existing mesh vertex features and image fea-tures extracted from the image encoding module. The volumetric feature maps have increasing level of abstractionbut decreasing spatial resolution as we progress deeper in the image encoding module. Therefore, as shown infigure 1, we used more abstracted, high-level image feature maps for the initial mesh deformation blocks to learnthe general shapes of cardiac structures while using low-level, high-resolution feature maps for the later mesh de-formation blocks to produce more accurate predictions with detailed features. For each mesh deformation block,we project image features from the image encoding module to the mesh vertices and then concatenate the extractedimage feature vector with the existing vertex feature vector. As we deform the mesh through multiple deformationblocks, we decrease the size of the graph convolutional filters to reduce the dimension of mesh feature vectors tomatch with the reduced number of filters used in upper levels of the image encoding module. Within each meshdeformation block, the concatenated feature vectors are processed by three graph residual blocks, which containstwo graph convolutional layers with residual connections. We then use an additional graph convolutional layerto predict deformation as 3D feature vectors on mesh vertices and add those with the vertex coordinates of theinitial mesh or the mesh from the previous deformation block to obtain the current predicted vertex coordinates.To project corresponding image features onto mesh vertices, from the vertex locations of the initial or previouslydeformed mesh, we compute the corresponding image coordinates in the volumetric image feature maps. Wethen tri-linearly interpolate the feature vectors that correspond to the 8 neighboring voxels of the computed imagecoordinates in the volumetric feature maps.
The training of our networks was supervised by 3D ground truth meshes of the whole heart as well as a binarysegmentation indicating occupancy of the heart on the voxel grid that corresponds to the input image volume. Thewhole heart meshes were extracted from segmentation of cardiac structures using the marching cube algorithmand the binary segmentation was also obtained from segmentation by setting all non-background voxels to 1 andthe rest to 0. We used two categories of loss functions, geometry consistency losses and regularization lossesin the training process. The geometry consistency losses include point and normal consistency losses while theregularization losses include edge length and laplacian losses.
We used a hybrid loss function that contained both cross-entropy and dice-score losses. Namely, let L occupancy ( I p , I g )denote the loss of between the predicted occupancy probability map I P and the ground truth binary segmentation anwei Kong et al. (2021) 7 of the whole heart I G . The hybrid loss function was L seg ( I p , I g ) = − (cid:88) x ∈ I g (cid:16) I g ( x ) log( I p ( x )) + (1 − I g ( x )) log(1 − I p ( x )) (cid:17) − (cid:80) x ∈ I I g ( x ) I p ( x ) (cid:80) x ∈ I I g ( x ) + (cid:80) x ∈ I I p ( x ) (1)where x denotes the pixel in the input image I . We used Chamfer loss to regulate the accuracy of the vertex locations on predicted meshes. For a point fromthe predicted mesh or the ground truth mesh, Chamfer loss finds the nearest vertex in the other point set and addsup all pair-wise distances. The point loss is defined by, L point ( P i , G i ) = (cid:88) p ∈ P i min g ∈ G i || p − g || + (cid:88) g ∈ G i min p ∈ P i || p − g || (2)where p and g are, respectively, points from vertex sets of the predicted mesh P i and the ground truth mesh G i ofcardiac structure i . We used a normal consistency loss to regulate the accuracy of surface normal on the predicted meshes. Foreach point, the surface normal is estimated by the cross product between two edges of a face connected to thepoint. The predicted surface normal is then compared with the ground truth surface normal at the nearest vertex.Namely, L normal ( P i , G i ) = (cid:88) p ∈ P i (cid:88) g = arg min g , g ∈ G i (cid:13)(cid:13)(cid:13) ( p − p ) × ( p − p ) − n g (cid:13)(cid:13)(cid:13) (3)where p and p are the two vertices sharing the same face with vertex p . We used an edge length loss to encourage a more uniform mesh density on the predictions. That is, weregularize the di ff erence between each edge length and an estimated average edge length µ i of the correspondingcardiac structure G i . Namely, we compute the average surface area of our ground truth mesh for each cardiacstructure and estimate the average edge length based on the surface area ratio between the template and groundtruth meshes, leading to L edge ( P i , G i ) = (cid:88) p ∈ P i (cid:88) k p ∈N ( p ) (cid:12)(cid:12)(cid:12) (cid:107) p − k (cid:107) − µ i (cid:12)(cid:12)(cid:12) . (4) To encourage a smoother mesh prediction, we used a laplacian loss to regularize the di ff erence between avertex location p and the mean location of its neighboring vertices k p as L lap ( P i ) = (cid:88) p ∈ P i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p − (cid:88) k p ∈N ( p ) ||N ( p ) || k p (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (5) Prior approaches of mesh reconstruction from images commonly formulated the total loss function as aweighted sum of multiple loss functions (Wang et al., 2020a; Wickramasinghe et al., 2020). However, for multi-loss regression problems, di ff erent loss functions are di ff erent in scales. Manually tuning the weight assigned toeach loss function is di ffi cult and expensive since the magnitude of losses can sometimes di ff er orders of magni-tude. Therefore, we express the total loss on predicted meshes as a weighted geometric mean of the individuallosses so that the gradient for individual loss function can be invariant to its scale relative to other loss functions. Fanwei Kong et al. (2021)
Thus, for predicted meshes G and ground truth meshes P with N cardiac structures, the total mesh loss is expressedas, L mesh ( P , G ) = N (cid:88) i L point ( P i , G i ) λ L normal ( P i , G i ) λ L edge ( P i , G i ) λ L lap ( P i ) λ , (6)where each λ is a hyperparameter to weight each individual loss based on its importance without being a ff ected byits scale. We can thus choose hyperparameters from a consistent range for all the losses. We generated 8 sets ofrandom numbers ranging from 0 to 1 and chose the best set of hyperparameter with λ = . λ = . λ = . λ = .
05 that produced the highest accuracy on the validation data. For total loss, we added up losses fromall three deformation blocks as well as the binary segmentation loss: L total = L mesh ( P B , G B ) + L mesh ( P B , G B ) + L mesh ( P B , G B ) + L seg ( I p , I g ) . (7)
3. Experiments and Results
We considered three baselines to compare our method against, 2D UNet (Ronneberger et al., 2015), a residual3D UNet (Isensee et al., 2018) and Voxel2Mesh (Wickramasinghe et al., 2020). The UNets are arguably the mostsuccessful architecture for medical image segmentation and thus can function as strong baselines. In particular,the 2D UNet is a part of the whole-heart segmentation framework implemented in Kong and Shadden (2020) thatrecently demonstrated state-of-the-art performance on the MMWHS challenge dataset. The residual 3D UNethas demonstrated improved performance than a regular 3D UNet and won the KiTS2019 Challenge (Isensee andMaier-Hein, 2019; Heller et al., 2021). To ensure a fair comparison, the same network architecture and convo-lutional filter numbers were used for the image encoding module between our method and the residual 3D UNetand the same image pre-processing and augmentation methods were applied during the training of all methods.For Voxel2Mesh, we reduced the resolution of the template mesh such that the total memory consumption duringtraining can fit within the memory available on our Nvidia GeForce GTX 1080 Ti GPU (11 GB). The final meshresolution is thus halved compared to the original implementation (Wickramasinghe et al., 2020) and contains3663 vertices for each cardiac structures. In contrast, our method can process a template mesh with 11494 meshvertices for each cardiac structures within available GPU memory.
We first compare the performance of whole-heart reconstruction from our method against our baselines. In thisexperiment, we trained and validated our method using both CT and MR images collected from existing publicdatasets except for the held-out test dataset of the MMWHS challenge, which we used for test-time evaluation.Our training set thus contained 87 CT images and 41 MR images and the validation set contained 15 CT imagesand 6 MR images. The MMWHS held-out test dataset contained 40 CT images and 40 MR images. We ana-lyzed the performance of our method against baselines in terms of both the accuracy and the quality of the surfacereconstructions. The accuracy is evaluated against the ground truth segmentation using the executable providedby the MMWHS challenge organizers. We also manually labeled the testing images and compared this with theground truth segmentation of the MMWHS challenge to provide a comparison between the evaluated reconstruc-tion accuracy of our deep-learning based method and the inter-observer variability in manual delineations. Thesurface quality was qualitatively evaluated in terms of surface smoothness, normal consistency and topologicalcorrectness.Table 2 shows the average Dice and Jaccard scores, ASSD and HD of the reconstruction results of both thewhole heart and individual cardiac structures for the MMWHS test dataset. For both CT and MR data, our methodconsistently outperformed our baselines in terms of Dice and Jaccard scores for both whole heart and all individualcardiac structures. In terms of surface ASSD and HD measures for the whole heart or individual cardiac structures,our method was the best or the second among the four deep-learning-based methods compared. To provide fur-ther details on segmentation accuracy, figure 13 gives the distribution of di ff erent segmentation accuracy metricsfor whole heart and individual cardiac structures. Overall, our method demonstrated advantages of whole heart anwei Kong et al. (2021) 9 Table 2. A comparison of prediction accuracy on MMWHS MR and CT test datasets from di ff erent methods. Epi LA LV RA RV Ao PA WHCT Dice ( ↑ ) Ours ↑ ) Ours ↓ ) Ours 1.335 ↓ ) Ours 14.393 10.407 10.325 13.639 13.360 Manual 14.446 12.677 12.619 15.313 13.496 11.189 25.449 27.181MR Dice ( ↑ ) Ours ↑ ) Ours ↓ ) Ours 2.198 ↓ ) Ours et al. (2021) reconstruction for both CT and MR images, and 2D UNet was the closest to ours compared with 3D UNet orVoxel2Mesh. All methods produced better reconstruction for CT images than for MR images. Furthermore, thereare no significant di ff erences between the evaluated Dice scores of our methods and those of our manual labeling,except for left ventricle epicardium (p < Fig. 2. Example reconstructions from our method for CT (left) and MR (right) data selected from MMWHS test dataset. Ourmethod reconstructs the whole heart consisting of seven cardiac structures, including the four heart chambers, left ventricle epi-cardium, aorta and pulmonary arteries. Geometry of each reconstructed cardiac structure is demonstrated in two di ff erent views,with the bottom view also displaying the meshes. Figure 3 and 4 visualize the median and worst results from the di ff erent methods for CT and MR images,respectively, from the MMWHS test dataset. The surface meshes of 2D UNet and 3D UNet were extracted fromthe segmentation results using the marching cube algorithm. As shown, our method is able to construct smoothgeometries while segmentation based methods, such as 2D UNet or 3D UNet, produced surfaces with staircaseartifacts. Such artifacts require surface post-processing techniques such as laplacian smoothing that often alsodegrade true features. Generally, all four methods are able to produce reasonable median cases from CT data. ForMR data, our method produced reasonable reconstructions, while the 2D UNet and 3D UNet produced reconstruc-tions with disconnected regions that would require post-processing to remove or connect. Voxel2Mesh was unableto capture detailed shapes of some structures such as the bifurcation of the pulmonary artery branches. In the worstcases for both CT and MR, our method nonetheless produced realistic shapes. However, 2D UNet and 3D UNetpredicted geometries with missing parts, noisy surfaces, incorrect classifications and / or disconnected regions thatwould require significant post-processing. Voxel2Mesh predicted worst-case geometries that deviated largely fromground truths and had major surface artifacts. anwei Kong et al. (2021) 11 Fig. 3. Visualizations of the median and worst reconstruction results among the MMWHS CT test dataset in terms of whole-heartDice scores for all compared methods.Fig. 4. Visualizations of the median and worst reconstruction results among the MMWHS MR test dataset in terms of whole-heartDice scores for all compared methods.
Figures 5 and 6 provide further qualitative comparisons of the results from the di ff erent methods. As shownin Fig. 6, our method was able to generate smoother reconstruction than the ground truth segmentation on MRimages that have relatively large voxel spacing 6. In contrast, 2D UNet that produces segmentation on a slice-by-slice manner along the sagittal view, may su ff er from inconsistency between adjacent slices, leading to coarsesegmentation when looking from the axial view that the 2D UNet was not trained on. 3D UNet, limited by thememory constrain of GPU, can only produce coarse segmentation on a down-sampled voxel grid of 128 × × et al. (2021) Fig. 5. Comparison of the predicted whole heart surfaces from di ff erent methods for CT test cases. Di ff erent rows demonstratedthe images and predictions from di ff erent test cases with the 10th, 50th, 90th percentiles of Dice scores based on our method.Fig. 6. Comparison of the predicted whole heart surfaces from di ff erent methods for MR test cases. Di ff erent rows demonstratedthe images and predictions from di ff erent test cases with the 10th, 50th, 90th percentiles of Dice scores based on our method. Figure 7 shows more reconstruction results from our models for the 10 most challenging CT and MR imagesfor which 2D UNet, which demonstrated closest performance to our method, predicted less accurate segmentations anwei Kong et al. (2021) 13 in terms of Dice scores compared with the rest images in the test datasets. For all the 10 MR images and 8 outof the 10 CT images, our method produced whole-heart reconstructions with improved the Dice scores. For allthese CT cases, we were able to generate accurate reconstruction with Dice scores above 0.87 and smooth surfaceswithout obvious artifacts. However, for the 10 MR cases, although we demonstrated improvement against 2DUNet predictions, we observed buckling and bumpiness on mesh surfaces of one or more cardiac structures for 5out of 10 cases.Interestingly, as indicated by the point-correspondence color maps in Figure 7, although we did not explicitlytrain our method to generate feature-corresponding meshes across di ff erent input images, our method was generallyable to consistently deform template meshes to map mesh vertices to similar structural features of the heart fordi ff erent images. This behavior allowed convenient generation of the mean whole heart shapes from the test datasetby computing the average coordinates of each vertex. Figure 8 demonstrates the mean whole heart shapes for MRand CT images from the MMWHS test dataset, respectively, and the distribution of the average surface distanceerrors on the whole heart compared with manual ground truths. For both CT and MR data, locations that su ff erfrom higher surface errors include the ends of the aorta and pulmonary arteries, boundaries between the rightventricle and the pulmonary artery, boundaries between the right atrium and the ventricle, and the inferior venacava region on the right atrium. We note that several of the locations of largest error are artificial boundaries, orarbitrary truncations of vessels extending away form the heart. Fig. 7. Whole heart reconstruction results from our models for the 10 most challenging CT and MR images for which 2D UNetpredicted less accurate segmentations in terms of Dice scores compared with the rest images in the MMWHS test datasets. On topof each case, we showed the whole-heart Dice score of our result and the di ff erence in whole-heart Dice score compared with 2DUNet reconstruction for the same case. The color map denotes the indices of mesh vertices and demonstrates the correspondenceof mesh vertices across reconstructed meshes from di ff erent images. et al. (2021) Fig. 8. Distribution of the average surface distance errors on mean whole heart shapes from the CT and MR data in MMWHS testdataset.
Cardiac MR image data are often acquired in a slice-by-slice manner and thus often vary in through-planeresolution due to the use of di ff erent acquisition protocols and vendors. For MR images with low through-planeresolution, accurately constructing smooth surface geometries is challenging since a method would need to com-plete the cardiac structures that are not captured between the slices. Therefore, having trained our method on MRimages with high through-plane resolution to produce detailed whole heart geometries, we evaluate the perfor-mance of our method on MR images with lower through-plane resolution and compare it with our baselines. Todisentangle the e ff ect of through-plane resolution from the e ff ect of other variations of MR images, we first gener-ate low-resolution MR images from our validation data by down-sampling the images to various slice thicknesses.We then evaluate the robustness of di ff erent methods to challenging real low-resolution MR images that signifi-cantly di ff er from our training datasets. Namely, we used data from our cine MR images, which were acquiredwith large slice thicknesses (8-10 mm), di ff erent acquisition planes, and from a di ff erent clinical center. anwei Kong et al. (2021) 15 Fig. 9. Robustness of di ff erent methods to through-plane resolution changes of MR images. Left panel shows the front and backviews of the ground truth surfaces; top panel shows example slices along the down-sampling axis of images down-sampled tovarying slice thicknesses, and bottom panel shows front and back views of predicted whole-heart surfaces from di ff erent methodscorresponding to di ff erent slice thickness values. Figure 9 displays an example of down-sampling an input image data to various slice thickness of 1 mm, 6 mm,and 10 mm, as well as the corresponding predictions from our method, 2D UNet and 3D UNet, respectively. Forlow through-plane resolution images, the same linear resampling method was applied as before to interpolate the3D image volume to the sizes required by the neural network models. As the slice thickness was increased to upto 10 mm, while 2DUNet can generally produce consistent segmentation on 2D slices, it produces 3D geometrieswith severe staircase artifacts. In contrast, 3D UNet is able to produce smoother surfaces along the down-samplingaxis by accounting for inter-slice information. However, as slice thickness increases, 3D UNet produces lessaccurate segmentation, such as incorrectly classifying a part of the RV into RA, as shown in figure 9. Our method,however, for all di ff erent slice thicknesses, produces consistent reconstructions that closely resembles the groundtruth surfaces and are free of any major artifacts. Figure 10 displays quantitative evaluations of the reconstructionperformance on various image resolutions. Regardless of slice thickness values considered, our method out-performed 2D UNet and 3D UNet both in terms of Dice and ASSD. Moreover, as slice thickness increases from1 mm to 10 mm, in general, we observed increasing improvement of our method compared with 2D UNet or3D UNet. Furthermore, by taking a 3D image volume as the input, our method and 3D UNet are more robustto additional in-plane resolution changes than the 2D UNet. Both our method and the 3D UNet demonstrated asmaller reduction in accuracy with 4 times reduction of in-plane resolution. et al. (2021) Fig. 10. Relation of Dice and ASSD values of whole-heart surfaces to through-plane resolution of MR images. Comparison betweendi ff erent methods and di ff erent in-plane resolutions are indicated by lines with di ff erent color and di ff erent styles, respectively.The bottom panel shows the average percentage di ff erences of Dice or ASSD values between our method and 2D UNet or 3D UNetacross all validation images. We evaluated the robustness of our method on the challenging cine MR dataset, which significantly di ff ersfrom our training datasets in terms of the through-plane resolution, imaging plane orientation and coverage of theheart. Table 3 compares the reconstruction accuracy between our method and the baselines. The reconstructionaccuracy was evaluated at two time frames, end diastole and end systole, for each patient. Overall, our methoddemonstrated high reconstruction accuracy and outperformed the other methods for most cardiac structures interms of average Dice score and ASSD.Figure 11 compares the whole-heart geometries reconstructed by our method with others for one example ofcine cardiac MR images. Our method was able to produce clean surface meshes while at the same capture most ofthe cardiac structures with reasonable accuracy. In contrast, since these images were acquired on imaging planesthat were di ff erent from those used in acquiring the training data, 2D UNet produced inaccurate reconstructionand disconnected surfaces. 3D UNet produces more complete reconstruction of the cardiac structures but oftenproduced many disconnected false positive regions. Voxel2Mesh is able to produce clean surface meshes withgenerally correct topology but the predictions are not accurate. Furthermore, as changes in input images over dif-ferent time frames are small, our method produced consistent reconstruction over di ff erent time phases. However,segentation-based methods, 2D UNet or 3D UNet, often produce inconsistent reconstruction with significant shapeor topology changes, despite small changes in input images over di ff erent time frames. anwei Kong et al. (2021) 17 Table 3. A comparison of prediction accuracy on cine MR dataset from di ff erent methods. Epi LA LV RA RV Ao PA WHDice ( ↑ ) Ours ± ± ± ± ± ± ± ±
2D UNet 0.543 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ↓ ) Ours 4.009 ± ± ± ± ± ± ± ±
2D UNet 4.585 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Fig. 11. Short axis and long axis slices at di ff erent time frames for an example cine cardic MR data and the corresponding recon-structed whole heart surfaces from di ff erent methods. We further validated our method on the Time-series CT dataset. Table 4 compares the reconstruction accuracybetween our method and the other baseline methods. Similar to above, the reconstruction accuracy was evaluatedat two time frames, end diastole and end systole, for each patient. Overall, our method demonstrated high recon-struction accuracy and outperformed the other methods for most cardiac structures in terms of average Dice scoreand ASSD.Furthermore, we explore the potential capability of our method to reconstruct dynamic 4D whole-heart modelsto capture the motion of the heart from time-series image data. Figure 12 displays example whole-heart reconstruc-tion results of our methods on Time-series CT data that consisted of images from 10 time frames over the cardiaccycle for each patient. Although our model predicts mesh reconstructions independently from each time frame, itis able to consistently deform the template meshes such that the same mesh vertices on the template meshes are et al. (2021) generally mapped to the same region of the reconstructed geometries across di ff erent time frames, as shown by thecolor maps of vertex IDs in Figure 12. Moreover, as demonstrated by the segmentation in Figure 12, our methodis able to capture the minor changes between time frames. Therefore, our method can potentially be applied toe ffi ciently construct 4D dynamic whole-heart models to capture the motion of a beating heart by computing thedeformation field on each mesh vertex, as displayed in Figure 12. Table 4. A comparison of prediction accuracy on Time-series CT dataset from di ff erent methods. Epi LA LV RA RV Ao PA WHDice ( ↑ ) Ours 0.902 ± ± ± ± ± ± ± ±
2D UNet ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ↓ ) Ours 0.697 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
3D UNet 0.811 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Fig. 12. Whole-heart reconstruction results for time-series CT data. From left to right, each column displays results at one timeframe from middle diastole to early diastole. The top row shows predicted segmentation overlaid with CT images, the middle rowshows computed mesh deformation vectors compared with vertex positions at middle diastole phase and the bottom row shows thecorrespondence maps of mesh vertices across reconstructed meshes from di ff erent time frames, with same color denoting the samemesh vertices on reconstructed meshes. We investigate how well our method can reconstruct whole-heart geometries using only a small number oftraining data. In this experiment, our neural network model was trained using only the training set of MMWHSchallenge, which consists of 20 CT images and 20 MR images. 16 out of 20 image volumes from each modalitywere used for training and the rest were used for validation. We compared our method against the baseline methodsfor the same MMWHS test set described above. The baseline methods were trained using the same training andvalidation splits.Table 5 compares the Dice and Jaccard scores, ASSD and HD of the reconstruction results for the methodstrained with the reduced training set, as well as the accuracy di ff erences compared with training models using moredata, as described above. For CT data, our method consistently outperformed others in terms of Dice and Jaccard anwei Kong et al. (2021) 19 scores for the whole heart and individual cardiac structures except for pulmonary arteries. In terms of ASSD andHD, our method outperformed 3D UNet and Voxel2Mesh and was comparable to 2D UNet. For MR data, ourmethod demonstrated better performance than others in terms of whole heart Dice and Jaccard scores, as wellas surface HD of whole heart. 2D UNet demonstrated the best whole heart ASSD performance. For individualcardiac structures, our method showed better Dice and Jaccard scores for Epi, LV, RA and RV, smaller ASSDvalues for Epi, LV, RA and smaller surface HD values for most of the cardiac structures except for LA and Ao.Figure 14 shows the distribution of di ff erent segmentation accuracy metrics for whole heart and individual cardiacstructures among the MMWHS test dataset.As shown in table 5, when trained with a smaller training dataset, the methods generally showed reducedDice or Jaccard scores and increased ASSD and HD values for both whole heart and individual cardiac structurescompared with when trained with a larger dataset, as summarized in table 2. Exceptions include the smaller HDvalues of Epi, LA, LV, RV and PA from our method for CT data and the better LV and aorta segmentation from3D UNet for MR data in terms of all four metrics. Compared with CT data, all methods generally demonstratedmore significant reduction of segmentation accuracy for MR data, in terms of average values of reduction for allfour metrics. While performance drops due to reduced size of training data is consistent, the actually amount ofperformance drop is minor for our method, 2D UNet and 3D UNet. For example, although the number of CTtraining data was reduced from 87 to 16, we only observed a small average reduction (0.01-0.02) of whole heartDice scores for 2D UNet, 3D UNet and our method. However, the performance drop for Voxel2Mesh in relationto the number of training data was much more significant, with a 0.27-0.28 reduction of whole-heart Dice scoresfor CT and MR data. Among all the cardiac structures, our method had the most significant performance reductionof PA reconstruction for both CT and MR data while segmentation based approaches, 2D UNet and 3D UNet,demonstrated a more uniform performance drop across all cardiac structures. Indeed, the shapes of the PA di ff ersignificantly from our initial sphere template mesh and therefore accurately capturing the shapes of PA mightrequire more training data for our method.
4. Discussion
Image-based reconstruction of cardiac anatomy and the concomitant geometric representation using unstruc-tured meshes is important to a number of applications, including visualization of patient-specific heart morphol-ogy and computational simulations of cardiac function. Prior deep learning based approaches have shown greatpromise in automatic whole heart segmentation (Zhuang et al., 2019), however converting the segmentation re-sults to topologically valid mesh structures requires additional, and often manual, post-processing, and is highly-dependent on the resolution of the image data. In this work, we present a novel deep-learning based approach thatuses graph convolutional neural networks to directly generate meshes of multiple cardiac structures of the wholeheart from volumetric medial image data. Our approach generally demonstrated improved whole heart reconstruc-tion performance compared with the baseline methods in terms of accuracy measures, Dice and Jaccard scores,ASSD and HD. Furthermore, our method demonstrated advantages in generating high-resolution, anatomicallyand temporally consistent geometries, which are not reflected by the accuracy measures.Our method reconstructs cardiac structures by predicting the deformation of mesh vertices from sphere meshtemplates. We have demonstrated the advantages of this approach over segmentation-based approaches in termsof both precision and surface quality. Namely, the use of a template mesh can introduce topological constraintsso that predicted cardiac structure are di ff eomorphic to the template. Thus, our template based approach enablesone to eliminated disconnected regions and greatly reduce erroneous topological artifacts often encountered withexisting deep-learning based segmentation methods. While the cardiac structures of interest were di ff eomorphicto spheres, the presented method has the potential to be generalized to organs with di ff erent topology, by using adi ff erent template mesh with the required surface topology.When trained on a relatively large dataset with 87 CT and 41 MR images, our method was able to achievecomparable accuracy to manual delineations, which is considered the gold standard. Furthermore, since we ex-plicitly regularized the surface smoothness and normal consistency, our method produced smooth and qualitymeshes while capturing the detailed features of the cardiac structures. Namely, these factors along with the use of et al. (2021) Table 5. A comparison of prediction accuracy on MMWHS MR and CT test datasets from di ff erent methods trained with imagesfrom MMWHS training set. An asterisk ∗ indicates statistically significant accuracy di ff erences, compared with Table 2, resultedfrom training on a smaller datset based on t-tests (p < Epi LA LV RA RV Ao PA WHCT Dice ( ↑ ) Ours * *2DUNet 0.877* 0.916 0.926 0.855 0.876* 0.916 ↑ ) Ours * *2DUNet 0.784* 0.847 0.864 0.753 0.787* 0.850 ↓ ) Ours 1.357 * 2.020 1.333*2DUNet * 1.141 *3DUNet 1.809* 1.389 1.134 2.176 1.585 0.832 2.276 1.668Voxel2Mesh 3.412* 3.147* 4.973* 3.638* 4.300* 4.326* 5.857* 4.287*HD (mm) ( ↓ ) Ours 13.789 10.362 9.628 ↑ ) Ours * *2DUNet 0.751 * ↑ ) Ours * *2DUNet 0.611* * 0.608 0.719*3DUNet 0.588 0.695* 0.803 0.718* 0.727 0.718 ↓ ) Ours * 1.812 3.243 3.138 2.235*2DUNet 2.692* ↓ ) Ours * * 14.651 22.028* 22.810* anwei Kong et al. (2021) 21 a template enable our method to generate realistic cardiac structures even when image quality was poor and seg-mentation methods struggled to provide realistic topology. From our observations, the locations on the heart thatour neural network models produced high surface errors are consistent with the locations that could su ff er fromhigh inter- or intra-observer variations, such as the arbitrary length of aorta and pulmonary arteries, boundariesbetween atria and ventricles and between the right atrium and the inferior vena cava. Indeed, these boundaries arenot distinguishable by voxel intensity di ff erences and are often subject to uncertainties even for human observers.Compared with segmentation-based approaches, our method predicts whole heart surfaces directly in the phys-ical space rather than on a voxel grid of the input image. Moreover, the whole heart geometries are representedusing surface meshes rather than a dense voxel grid. Hence, our method is able to generate high-resolution re-constructions (10K mesh vertices for each cardiac structure) e ffi ciently on a limited memory budget and within ashorter or comparable run-time (table 7). High resolution segmentation may also be obtained by recent methodsthat represent geometries using implicit surfaces (Kirillov et al., 2019). Namely, for each point in the physicalspace, this approach predicts the probability of this point belonging to a certain tissue class. Therefore, by sam-pling a large number of points in the physical space, these methods can also achieve high-resolution reconstructionthat are not constrained by the voxel resolution of the input image or GPU memory. However, the inference pro-cess for such methods is computationally expensive (Gupta and Chandraker, 2020) as it requires prediction on alarge number of points. In contrast, our method represents the mesh as a graph (i.e., a sparse matrix) and takes lessthan a second to predict a high resolution whole heart mesh.Compared with prior deep-learning based meth reconstruction methods from image data (Wang et al., 2020a;Wickramasinghe et al., 2020), our method used a shared graph neural network to simultaneously predict surfacemeshes of multiple cardiac structures. This is made possible by initializing the template meshes at various scalesand locations corresponding to individual cardiac structures. We observed that proper template initialization isessential to avoid local minimums due to large mesh deformation at the beginning stage of training. In contrast,prior approaches are designed for predicting a single geometry from image data and require training a separategraph neural network for each anatomical structure and thus do not easily scale to reconstruct multiple cardiacstructures at a high-resolution from a single image volume. Furthermore, while prior approaches proposed vari-ous up-sampling scheme to construct a dense mesh from a coarse template (Wickramasinghe et al., 2020; Wanget al., 2020a), we directly deformed a high resolution template. Since the majority of weights is in the image fea-ture encoder to process a dense volumetric input image, more mesh vertices can provide more e ff ective gradientpropagation to the image feature encoder. Indeed, using a coarse mesh with 3K mesh vertices for each cardiacstructures, we observed a 2% reduction of whole heart dice score as shown in our supplemental materials (table 6).However, our method was still able to outperform Voxel2Mesh by 3% and 10% for CT and MR data using a coarsemesh template with a similar amount of mesh vertices. These design choices allowed our method to demonstratepromising generalization capabilities to unseen MR images and maintain good performance when trained with asmaller number of samples. In contrast, Voxel2mesh su ff ered from a large performance drop when trained on asmaller dataset.When applied to time-resolved images, our method consistently deformed the template mesh such that meshvertices were mapped to the similar regions of the heart across di ff erent time frames. Learning such semantic cor-respondence is purely a consequence of our model architecture and did not require any explicit training. This be-havior of producing semantic corresponding predictions was also observed in DeepOrganNet, which reconstructedlung shapes from single-view X-ray images by deforming lung templates (Wang et al., 2020b). Point-correspondedmeshes across di ff erent input images are required for numerous applications, such as building statistical shapemodels, constructing 4D dynamic whole-heart models for motion analysis and deriving boundary conditions fordeforming-domain CFD simulations. Current approaches that construct feature corresponding meshes for the heartmostly use surface or image registration methods to deform a reference mesh so that its boundary is consistent withthe target surfaces or image segmentation (Ordas et al., 2007; Khalafvand et al., 2018; Kong and Shadden, 2020).However, registration algorithms are often very computationally expensive to align high-resolution meshes andthey often su ff er from inaccuracies for complex whole heart geometries due to local minimums during the opti-mization process. In the case of time-series image data, our method naturally produces point corresponding mesheswith high resolution (10K mesh vertices per cardiac structure) across time frames within a couple of seconds, whileprior methods could require hours to generate a 4D dynamic whole-heart model at a similar resolution. Although et al. (2021) not considered here, it is possible to include another loss function that minimizes the point distances between thevertex locations on the predicted meshes and ground truth landmarks when available to further enhance featurecorrespondence.
5. Conclusions
We have developed a deep-learning based method to directly predict surface mesh reconstructions of the wholeheart from volumetric image data. The approach leverages a graph convolutional neural network to predict defor-mation on mesh vertices from a predefined mesh template to fit multiple anatomical structures in a 3D imagevolume. The mesh deformation is conditioned on image features extracted by a CNN-based image encoder. Themethod demonstrated promising performance of generating accurate high-resolution and high-quality whole heartreconstructions and outperformed prior deep-learning based methods on both CT and MR data. It also demon-strated robust performance when evaluated on MR or CT images from new data sources that di ff er from our thetraining datasets. Furthermore, the method produced temporally consistent predictions and feature-correspondingpredictions by consistently mapping mesh vertices on the templates to similar structural regions of the heart.Therefore, this method can potentially be applied for e ffi ciently constructing 4D dynamics whole heart model thatcaptures the motion of a beating heart from time-series images data. Acknowledgments
This work was supported by the NSF, Award
References
Augustin, C.M., Neic, A., Liebmann, M., Prassl, A.J., Niederer, S.A., Haase, G., Plank, G., 2016. Anatomically accurate high resolutionmodeling of human whole heart electromechanics: A strongly scalable algebraic multigrid solver method for nonlinear deformation.Journal of computational physics 305, 622–646. URL: , doi: .Bai, W., Shi, W., Ledig, C., Rueckert, D., 2015. Multi-atlas segmentation with augmented features for cardiac mr images. MedicalImage Analysis 19, 98 – 109. URL: , doi: https://doi.org/10.1016/j.media.2014.09.005 .Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P., 2017. Geometric deep learning: Going beyond euclidean data. IEEESignal Processing Magazine 34, 18–42. doi: .Bucioli, A.A.B., Cyrino, G.F., Lima, G.F.M., Peres, I.C.S., Cardoso, A., Lamounier, E.A., Neto, M.M., Botelho, R.V., 2017. Holo-graphic real time 3d heart visualization from coronary tomography for multi-place medical diagnostics, in: 2017 IEEE 15th Intl Confon Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on BigData Intelligence and Computing and Cyber Science and Technology Congress(DASC / PiCom / DataCom / CyberSciTech), pp. 239–244.doi: .Bui, V., Hsu, L.Y., Shanbhag, S.M., Tran, L., Bandettini, W.P., Chang, L.C., Chen, M.Y., 2020. Improving multi-atlas cardiac structuresegmentation of computed tomography angiography: A performance evaluation based on a heterogeneous dataset. Computers in Biologyand Medicine 125, 104019. URL: , doi: https://doi.org/10.1016/j.compbiomed.2020.104019 .Bui, V., Shanbhag, S.M., Levine, O., Jacobs, M., Bandettini, W.P., Chang, L.C., Chen, M.Y., Hsu, L.Y., 2020. Simultaneous multi-structuresegmentation of the heart and peripheral tissues in contrast enhanced cardiac computed tomography angiography. IEEE Access 8,16187–16202. doi: .Chnafa, C., Mendez, S., Franck, N., 2016. Image-based simulations show important flow fluctuations in a normal left ventricle: Whatcould be the implications? Annals of biomedical engineering 44. doi: .C¸ ic¸ek, ¨O., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O., 2016. 3d u-net: Learning dense volumetric segmentation fromsparse annotation, in: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016, Springer International Publishing, Cham. pp. 424–432.De ff errard, M., Bresson, X., Vandergheynst, P., 2016. Convolutional neural networks on graphs with fast localized spectralfiltering, in: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (Eds.), Advances in Neural Information Pro-cessing Systems, Curran Associates, Inc.. pp. 3844–3852. URL: https://proceedings.neurips.cc/paper/2016/file/04df4d434d481c5bb723be1b6df1ee65-Paper.pdf . anwei Kong et al. (2021) 23 Ecabert, O., Peters, J., Schramm, H., Lorenz, C., von Berg, J., Walker, M.J., Vembar, M., Olszewski, M.E., Subramanyan, K., Lavi, G.,Weese, J., 2008. Automatic model-based segmentation of the heart in ct images. IEEE Transactions on Medical Imaging 27, 1189–1201.doi: .Ecabert, O., Peters, J., Walker, M.J., Ivanc, T., Lorenz, C., von Berg, J., Lessick, J., Vembar, M., Weese, J., 2011. Segmentation of the heartand great vessels in ct images using a model-based adaptation framework. Medical Image Analysis 15, 863 – 876. URL: , doi: https://doi.org/10.1016/j.media.2011.06.004 .Gonz´alez Izard, S., S´anchez Torres, R., Alonso Plaza, ´O., Juanes M´endez, J.A., Garc´ıa-Pe˜nalvo, F.J., 2020. Nextmed: AutomaticImaging Segmentation, 3D Reconstruction, and 3D Model Visualization Platform Using Augmented and Virtual Reality. Sensors(Basel, Switzerland) 20, 2962. URL: , doi: .Gupta, K., Chandraker, M., 2020. Neural mesh flow: 3d manifold mesh generationvia di ff eomorphic flows. arXiv:2007.10973 .Habijan, M., Babin, D., Gali´c, I., Leventi´c, H., Romi´c, K., Velicki, L., Piˇzurica, A., 2020. Overview of the Whole Heart and Heart ChamberSegmentation Methods. Cardiovascular Engineering and Technology URL: https://doi.org/10.1007/s13239-020-00494-8 ,doi: .He, K., Zhang, X., Ren, S., Sun, J., 2016. Identity mappings in deep residual networks, in: Leibe, B., Matas, J., Sebe, N., Welling, M.(Eds.), Computer Vision – ECCV 2016, Springer International Publishing, Cham. pp. 630–645.Heller, N., Isensee, F., Maier-Hein, K.H., Hou, X., Xie, C., Li, F., Nan, Y., Mu, G., Lin, Z., Han, M., Yao, G., Gao, Y., Zhang, Y.,Wang, Y., Hou, F., Yang, J., Xiong, G., Tian, J., Zhong, C., Ma, J., Rickman, J., Dean, J., Stai, B., Tejpaul, R., Oestreich, M., Blake,P., Kaluzniak, H., Raza, S., Rosenberg, J., Moore, K., Walczak, E., Rengel, Z., Edgerton, Z., Vasdev, R., Peterson, M., McSweeney,S., Peterson, S., Kalapara, A., Sathianathen, N., Papanikolopoulos, N., Weight, C., 2021. The state of the art in kidney and kidneytumor segmentation in contrast-enhanced ct imaging: Results of the kits19 challenge. Medical Image Analysis 67, 101821. URL: , doi: https://doi.org/10.1016/j.media.2020.101821 .Isensee, F., Kickingereder, P., Wick, W., Bendszus, M., Maier-Hein, K.H., 2018. Brain tumor segmentation and radiomics survivalprediction: Contribution to the brats 2017 challenge, in: Crimi, A., Bakas, S., Kuijf, H., Menze, B., Reyes, M. (Eds.), Brainlesion:Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Springer International Publishing, Cham. pp. 287–297.Isensee, F., Maier-Hein, K., 2019. An attempt at beating the 3d u-net. ArXiv abs / , doi: https://doi.org/10.1016/j.media.2018.08.004 .Khalafvand, S.S., Ng, E., Zhong, L., Hung, T.K., 2012. Fluid-dynamics modelling of the human left ventricle with dynamic mesh fornormal and myocardial infarction: Preliminary study. Computers in biology and medicine 42, 863–70. doi: .Khalafvand, S.S., Voorneveld, J., Muralidharan, A., Gijsen, F., Bosch, J.G., Walsum, T., Haak, A., Jong, N., Kenjeres, S., 2018. Assessmentof human left ventricle flow using statistical shape modelling and computational fluid dynamics. Journal of Biomechanics 74. doi: .Kingma, D., Ba, J., 2014. Adam: A method for stochastic optimization. International Conference on Learning Representations .Kirillov, A., Wu, Y., He, K., Girshick, R., 2019. PointRend: Image segmentation as rendering.Kong, F., Shadden, S.C., 2020. Automating Model Generation for Image-Based Cardiac Flow Simulation. Jour-nal of Biomechanical Engineering 142. URL: https://doi.org/10.1115/1.4048032 , doi: , arXiv:https://asmedigitalcollection.asme.org/biomechanical/article-pdf/142/11/111011/6565541/bio 142 11 111011.pdf .111011.Lorensen, W.E., Cline, H.E., 1987. Marching cubes: A high resolution 3d surface construction algorithm. SIGGRAPH Comput. Graph.21, 163–169. URL: https://doi.org/10.1145/37402.37422 , doi: .Maher, G., Wilson, N., Marsden, A., 2019. Accelerating cardiovascular model building with convolutional neural networks. Medical &Biological Engineering & Computing 57, 2319–2335.Mittal, R., Seo, J.H., Vedula, V., Choi, Y., Liu, H., Huang, H., Jain, S., Younes, L., Abraham, T., George, R., 2015. Computational modelingof cardiac hemodynamics: Current status and future outlook. Journal of Computational Physics 305. doi: .Ordas, S., Oubel, E., Leta, R., Carreras, F., Frangi, A., 2007. A statistical shape model of the heart and its application to model-basedsegmentation. Proceedings of SPIE - The International Society for Optical Engineering 6511. doi: .Payer, C., ˇStern, D., Bischof, H., Urschler, M., 2018. Multi-label whole heart segmentation using cnns and anatomical label configurations,in: Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges, Springer, Cham. pp. 190–198.Peng, P., Lekadir, K., Gooya, A., Shao, L., Petersen, S.E., Frangi, A.F., 2016. A review of heart chamber segmentation for structural andfunctional analysis using cardiac magnetic resonance imaging. Magnetic Resonance Materials in Physics, Biology and Medicine 29,155–195. URL: https://doi.org/10.1007/s10334-015-0521-4 , doi: .Peters, J., Ecabert, O., Meyer, C., Kneser, R., Weese, J., 2010. Optimizing boundary detection via simulated search with applications tomulti-modal heart segmentation. Medical Image Analysis 14, 70 – 84. URL: , doi: https://doi.org/10.1016/j.media.2009.10.004 .Pontes, J.K., Kong, C., Sridharan, S., Lucey, S., Eriksson, A., Fookes, C., 2019. Image2mesh: A learning framework for single image et al. (2021)
3d reconstruction, in: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (Eds.), Computer Vision – ACCV 2018, Springer InternationalPublishing, Cham. pp. 365–381.Prakosa, A., Arevalo, H.J., Deng, D., Boyle, P.M., Nikolov, P.P., Ashikaga, H., Blauer, J.J.E., Ghafoori, E., Park, C.J., Blake, R.C., Han,F.T., MacLeod, R.S., Halperin, H.R., Callans, D.J., Ranjan, R., Chrispin, J., Nazarian, S., Trayanova, N.A., 2018. Personalized virtual-heart technology for guiding the ablation of infarct-related ventricular tachycardia. Nature Biomedical Engineering 2, 732–740. URL: https://doi.org/10.1038/s41551-018-0282-2 , doi: .Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: Navab, N., Horneg-ger, J., Wells, W.M., Frangi, A.F. (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, SpringerInternational Publishing, Cham. pp. 234–241.Simard, P., Steinkraus, D., Platt, J.C., 2003. Best practices for convolutional neural networks applied to visual document analysis. SeventhInternational Conference on Document Analysis and Recognition, 2003. Proceedings. , 958–963.Tobon-Gomez, C., Geers, A.J., Peters, J., Weese, J., Pinto, K., Karim, R., Ammar, M., Daoudi, A., Margeta, J., Sandoval, Z., Stender, B.,Zheng, Y., Zuluaga, M.A., Betancur, J., Ayache, N., Chikh, M.A., Dillenseger, J., Kelm, B.M., Mahmoudi, S., Ourselin, S., Schlaefer,A., Schae ff ter, T., Razavi, R., Rhode, K.S., 2015. Benchmark for algorithms segmenting the left atrium from 3d ct and mri datasets.IEEE Transactions on Medical Imaging 34, 1460–1473. doi: .Updegrove, A., Wilson, N.M., Shadden, S.C., 2016. Boolean and smoothing of discrete polygonal surfaces. Advances in EngineeringSoftware 95, 16 – 27. URL: , doi: https://doi.org/10.1016/j.advengsoft.2016.01.015 .Wang, N., Zhang, Y., Li, Z., Fu, Y., Yu, H., Liu, W., Xue, X., Jiang, Y., 2020a. Pixel2mesh: 3d mesh model generation via image guideddeformation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 1–1.Wang, Y., Zhong, Z., Hua, J., 2020b. Deeporgannet: On-the-fly reconstruction and visualization of 3d /
4d lung models from single-viewprojections by deep deformation network. IEEE Transactions on Visualization and Computer Graphics 26, 960–970. doi: .Wei, M., Shen, W., Qin, J., Wu, J., Wong, T.T., Heng, P.A., 2013. Feature-preserving optimization for noisy mesh using joint bilateral filterand constrained laplacian smoothing. Optics and Lasers in Engineering 51, 1223 – 1234. URL: , doi: https://doi.org/10.1016/j.optlaseng.2013.04.018 .Wei, M., Wang, J., Guo, X., Wu, H., Xie, H., Wang, F.L., Qin, J., 2018. Learning-based 3D surface optimization from medical imagereconstruction. Optics and Lasers in Engineering 103, 110–118. URL: https://doi.org/10.1016/j.optlaseng.2017.11.014 ,doi: .Wen, C., Zhang, Y., Li, Z., Fu, Y., 2019. Pixel2mesh ++ : Multi-view 3d mesh generation via deformation. 2019 IEEE / CVF InternationalConference on Computer Vision (ICCV) , 1042–1051.Wickramasinghe, U., Remelli, E., Knott, G., Fua, P., 2020. Voxel2mesh: 3d mesh model generation from volumetric data, in: Martel,A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (Eds.), Medical ImageComputing and Computer Assisted Intervention – MICCAI 2020, Springer International Publishing, Cham. pp. 299–308.Wolterink, J.M., Leiner, T., de Vos, B.D., Coatrieux, J.L., Kelm, B.M., Kondo, S., Salgado, R.A., Shahzad, R., Shu, H., Snoeren, M.,Takx, R.A.P., van Vliet, L.J., van Walsum, T., Willems, T.P., Yang, G., Zheng, Y., Viergever, M.A., Iˇsgum, I., 2016. An evaluationof automatic coronary artery calcium scoring methods with cardiac ct using the orcascore framework. Medical Physics 43, 2361–2373. URL: https://aapm.onlinelibrary.wiley.com/doi/abs/10.1118/1.4945696 , doi: https://doi.org/10.1118/1.4945696 , arXiv:https://aapm.onlinelibrary.wiley.com/doi/pdf/10.1118/1.4945696 .Ye, C., Wang, W., Zhang, S., Wang, K., 2019. Multi-depth fusion network for whole-heart ct image segmentation. IEEE Access 7,23421–23429. doi: .Zhuang, X., 2013. Challenges and Methodologies of Fully Automatic Whole Heart Segmentation: A Review. Journal of HealthcareEngineering 4, 981729. URL: https://doi.org/10.1260/2040-2295.4.3.371 , doi: .Zhuang, X., Bai, W., Song, J., Zhan, S., Qian, X., Shi, W., Lian, Y., Rueckert, D., 2015. Multiatlas whole heart segmentation of CT datausing conditional entropy for atlas ranking and selection. Med Phys 42, 3822–3833.Zhuang, X., Li, L., Payer, C., ˇStern, D., Urschler, M., Heinrich, M.P., Oster, J., Wang, C., ¨Orjan Smedby, Bian, C., Yang, X., Heng,P.A., Mortazi, A., Bagci, U., Yang, G., Sun, C., Galisot, G., Ramel, J.Y., Brouard, T., Tong, Q., Si, W., Liao, X., Zeng, G., Shi, Z.,Zheng, G., Wang, C., MacGillivray, T., Newby, D., Rhode, K., Ourselin, S., Mohiaddin, R., Keegan, J., Firmin, D., Yang, G., 2019.Evaluation of algorithms for multi-modality whole heart segmentation: An open-access grand challenge. Medical Image Analysis58, 101537. URL: , doi: https://doi.org/10.1016/j.media.2019.101537 .Zhuang, X., Shen, J., 2016. Multi-scale patch and multi-modality atlases for whole heart segmentation of mri. Medical Image Analysis31, 77 – 87. URL: , doi: https://doi.org/10.1016/j.media.2016.02.006 . anwei Kong et al. (2021) 25
6. Supplementary Material
Intensity normalization and resizing were applied to all 3D image volumes to obtain consistent image dimen-sions and pixel intensity range. We followed the procedures in Kong and Shadden (2020) to normalize pixelintensity values of each CT or MR image volume such that they ranged from -1 to 1. The 3D image volumes werethen resized using linear interpolation to a dimension of 128 × × Data augmentation techniques were applied during training to improve the robustness of the neural networkmodels to the variations of input images. Specifically, we applied random scaling ( −
5% to 5%), random rotation( − ◦ to 5 ◦ ), random shearing ( − ◦ to 10 ◦ ) as well as elastic deformations (Simard et al., 2003) on the inputimages. For elastic deformations, 16 control points were placed along each dimension of the 3D image volumeand were randomly perturbed. The input images are then warped according to the displacements of the controlpoints using the B-spline interpolation. The model parameters were computed by minimizing the total loss function using the Adam stochastic gradientdescent algorithm (Kingma and Ba, 2014). The initial learning rate was set to be 0.001, while β and β for theAdam algorithm were set to 0.9 and 0.999, respectively. Point losses were evaluated on the validation data aftereach training epoch and the model was saved after one epoch only if the validation point loss had improved.We adopted a learning rate schedule where the learning rate was reduced by 20% if the validation point losseshad not improved for 10 epochs. The minimum learning rate was 5 × − . The network was implemented inTensorFlow and the training was conducted on a Nvidia GeForce GTX 1080 Ti graphics processing unit (GPU)until the validation loss converged. Fig. 13. Comparison of segmentation accuracy for whole heart and individual cardiac structures from di ff erent methods. Whitecircles on the boxes indicate mean values across patients. Cardiac structures are sorted based on the accuracy of our method. et al. (2021) Fig. 14. Comparison of segmentation accuracy for whole heart and individual cardiac structures from di ff erent methods trainedusing the small MMWHS training dataset. White circles on the boxes indicate mean values across patients. Cardiac structures aresorted based on the accuracy of our method. We compare the e ff ect of design choice changes on the whole heart reconstruction performance of our method.Namely, we trained another three models while, respectively, using reduced number of convolutional filters inthe image encoding module, using reduced resolution of template meshes or excluding the elastic deformationfrom our image augmentation techniques. Specifically, the number of convolutional filters in the last four residualblocks were reduced from 48, 96, 192 and 384 to 32, 64, 128 and 256, respectively. The number of mesh verticesof the template mesh was reduced from 11494 to 3260. As shown in table 6, reducing the number of filters ortemplate mesh vertices mildly reduced the reconstruction accuracy of the whole heart or most cardiac structurescompared with the our final model. However, without elastic deformation augmentation, we observed a significantdrop in reconstruction performance. Indeed, elastic deformation augmentation may improve model robustness tominor local perturbations of the anatomical structures, thereby facilitating accurate predictions of detailed cardiacstructures. Table 7 compares the total number of parameters and prediction time among our method, our method withreduced convolutional filters or mesh vertices, 2D UNet, 3D UNet and Voxel2Mesh. The prediction time wasmeasured on a Nvidia GeForce GTX 1080Ti GPU. anwei Kong et al. (2021) 27
Table 6. A comparison of prediction accuracy on MMWHS MR and CT test datasets from di ff erent variants of our methods. Epi LA LV RA RV Ao PA WHCT Dice Final Model 0.899 0.932 0.940 0.892 0.910 0.950 0.852 0.918Reduced Number of Filters 0.893 0.932 0.936 0.888 0.906 0.949 0.847 0.915Reduced Number of Vertices 0.842 0.929 0.931 0.888 0.892 0.943 0.837 0.899No Elastic Deformation Augmentation 0.546 0.882 0.878 0.801 0.757 0.882 0.552 0.773ASSD (mm) Final Model 1.335 1.042 0.842 1.583 1.176 0.531 1.904 1.213Reduced Number of Filters 1.404 1.063 0.892 1.523 1.122 0.536 1.744 1.195Reduced Number of Vertices 1.768 1.126 1.029 1.664 1.425 0.583 1.768 1.378No Elastic Deformation Augmentation 3.055 1.742 1.731 2.479 3.188 1.117 5.823 2.920MR Dice Final Model 0.797 0.881 0.922 0.888 0.892 0.890 0.816 0.882Reduced Number of Filters 0.813 0.873 0.919 0.888 0.881 0.876 0.789 0.879Reduced Number of Vertices 0.774 0.870 0.903 0.887 0.861 0.860 0.792 0.863No Elastic Deformation Augmentation 0.487 0.810 0.867 0.795 0.744 0.724 0.413 0.735ASSD (mm) Final Model 2.198 1.401 1.183 1.611 1.333 2.648 2.689 1.775Reduced Number of Filters 2.053 1.556 1.238 1.488 1.429 2.143 2.205 1.645Reduced Number of Vertices 2.405 1.615 1.516 1.561 1.651 2.390 2.222 1.845No Elastic Deformation Augmentation 3.794 2.151 2.003 2.976 3.575 3.700 5.166 3.348