[PDF] Binary segmentation of medical images using implicit spline representations and deep learning

Abstract

We propose a novel approach to image segmentation based on combining implicit spline representations with deep convolutional neural networks. This is done by predicting the control points of a bivariate spline function whose zero-set represents the segmentation boundary. We adapt several existing neural network architectures and design novel loss functions that are tailored towards providing implicit spline curve approximations. The method is evaluated on a congenital heart disease computed tomography medical imaging dataset. Experiments are carried out by measuring performance in various standard metrics for different networks and loss functions. We determine that splines of bidegree (1,1) with 128\times128 coefficient resolution performed optimally for 512\times 512 resolution CT images. For our best network, we achieve an average volumetric test Dice score of almost 92%, which reaches the state of the art for this congenital heart disease dataset.

Full PDF

BBinary segmentation of medical imagesusing implicit spline representations and deep learning

Oliver J.D. Barrowclough a , Georg Muntingh a, ∗ , Varatharajan Nainamalai b , Ivar Stangeby c a Department of Mathematics and Cybernetics, SINTEF, Forskningsveien 1, 0373 Oslo, Norway b The Intervention Centre, Rikshospitalet, Oslo University Hospital, 0372, Oslo, Norway c Cognite AS, Oksenøyveien 10, 1366 Lysaker, Norway

Abstract

We propose a novel approach to image segmentation based on combining implicit spline representationswith deep convolutional neural networks. This is done by predicting the control points of a bivariatespline function whose zero-set represents the segmentation boundary. We adapt several existing neuralnetwork architectures and design novel loss functions that are tailored towards providing implicit splinecurve approximations. The method is evaluated on a congenital heart disease computed tomography medicalimaging dataset. Experiments are carried out by measuring performance in various standard metrics fordiﬀerent networks and loss functions. We determine that splines of bidegree (1 , with × coeﬃcientresolution performed optimally for × resolution CT images. For our best network, we achieve anaverage volumetric test Dice score of almost 92%, which reaches the state of the art for this congenital heartdisease dataset. Keywords:

Implicit spline representations, Shape modelling, Deep learning, Medical imaging, Imagesegmentation

1. Introduction

Image segmentation is the process of partitioning an image of pixels into diﬀerent regions accordingto shared attributes. It is a technology widely used in multiple ﬁelds ranging from self-driving vehiclesand surveillance to medical imaging; see Minaee et al. (2020) for a recent comprehensive survey on imagesegmentation using deep learning.Image segmentation techniques can be divided into three types: manual, semi-automatic, and fullyautomatic Işın et al. (2016). In manual segmentation, trained people classify the region of interest one imageat a time. An automatic image segmentation followed by a manual threshold or a manual seeding of a regiongrowing algorithm is referred to as semi-automatic image segmentation. Fully automatic segmentation,mainly based on machine learning (ML) methods, applies prior knowledge based on learned features to newunseen images.Manual segmentation of images is a time-consuming and tedious job, so there is potentially great value indeveloping methods that can partially or fully automate the process. Semi-automatic and automatic methodsfor image segmentation date back several decades and include approaches such as thresholding, regiongrowing, classiﬁers, clustering, Markov random-ﬁeld models, deformable models, atlas-guided approachesand (artiﬁcial) neural networks Pham et al. (2000). In recent years, there has been an explosion in the useof neural networks, especially in the ﬁeld of computer vision, where they typically signiﬁcantly outperformother approaches. ∗ Corresponding author

Email addresses: [email protected] (Oliver J.D. Barrowclough), [email protected] (GeorgMuntingh), [email protected] (Varatharajan Nainamalai), [email protected] (Ivar Stangeby)

Preprint submitted to GMP 2021 February 26, 2021 a r X i v : . [ ee ss . I V ] F e b ne common feature of image segmentation is that the segmentation boundaries are often smooth, whileat the same time exhibiting complex topological behaviour. This is particularly true in the case of medicalimagery, but is also typical in many other types of image data. Complex topologies can occur, for example,when slicing a 3D object with a simpler topology, or from objects being partially occluded in photographs.Spline representations are well known in the ﬁeld of computer-aided design for providing excellent compactrepresentations of smooth geometries. Implicit representations, which represent a curve or a surface by thezero-set of a multivariate function, are well suited for modelling complex topologies and for visualization byray tracing. By using tensor-product splines as an implicit function, we combine the respective beneﬁts andobtain a representation that compactly models a wide variety of segmentation boundaries.Splines have previously been used in the context of deep learning, but as far as we are aware, not forimplicitly representing segmentation boundaries. For example, Catmull-Rom splines are employed by Linget al. (2019) to deﬁne segmentation boundaries as parametric curves. Deep learning has also been usedto compute parametrizations that can be used for parametric spline approximation Laube et al. (2018).One weakness of the parametric approach compared to the implicit approach is that it is diﬃcult to modelcomplex topologies with parametric geometries. Splines have also been used in a slightly diﬀerent context, toextend convolutional neural networks (CNNs) (see Section 2.2) to irregular and geometric datasets Fey et al.(2018) by deﬁning continuous B-spline kernels. Similarly, more general continuous spline representationsof the weight space have been considered Keskin and Izadi (2018), using splines to compactly model entireﬁlter banks and fully connected layers.The use of implicit representations in deep learning has appeared in a number of recent works involvinga variety of problems. The problem of recovering a 3D model from a single image of an object usingimplicit representations has been investigated by several authors Xu et al. (2019a); Michalkiewicz et al.(2019), achieving state-of-the-art performance on certain datasets. Implicit representations based on neuralnetworks with periodic activation functions have also been considered for modelling images, videos andsounds along with their derivatives Sitzmann et al. (2020), addressing issues with the level of detail providedby conventional implementations. Other works using levels sets of spline functions for the purpose of imagesegmentation have appeared in the context of evolutionary processes Yang et al. (2006).Within the ﬁeld of medical imaging, segmentation is used for studying anatomical structures, estimat-ing the volume of certain tissues, planning treatments and post-treatment follow-ups, and identifying andmonitoring the development of tumors, lesions and abnormalities Sharma and Aggarwal (2010). There exista number of non-invasive imaging techniques that can be subject to segmentation, including ultrasound,magnetic resonance imaging (MRI) and computed tomography (CT) imaging. In this paper, we focus onthe problem of segmenting CT images from a dataset of congenital heart disease (CHD) patients Xu et al.(2019b). A CHD is a structural birth defect in the heart, or blood vessels near the heart, that can disruptthe normal ﬂow of blood.In this paper we combine implicit representations and deep convolutional neural networks into a newmethod for image segmentation. The method is applicable to general image segmentation problems, andwe show that it reaches state-of-the-art results on a CHD CT image dataset. The novelties of this paperinclude:• A new end-to-end procedure, based on deep convolutional neural networks, for segmenting images withimplicitly represented splines that compactly represent smooth and topologically complex segmentationboundaries.• Several neural network architectures based on existing networks, including truncated VGG-style net-works Simonyan and Zisserman (2014) and an adaptive version of UNet Ronneberger et al. (2015).This new network is adaptive both in the sense of being able to take variable size input for ﬁxed output(via repeated application of adaptive average pooling) and in the sense that it can adapt to diﬀerentgridded basis functions (in our case, tensor-product splines).• New loss functions that are tailored to the speciﬁc problem of modelling implicit splines, by mappingcontrol points to a binary inside-outside mask. 2 A parameter study that determines the uniform tensor-product spline spaces with optimal performancefor the CHD CT image dataset.Our implementation of the networks, loss functions, and training and data processing procedures is basedon PyTorch. This code is open source and available as a GitHub repository Barrowclough et al. (2020).

2. Background

There have been some attempts to obtain ML-based automatic CT and MRI cardiac image segmentationfor full blood volume or for heart chambers. Xu et al. (2019b) used a UNet-based deep learning network forCHD CT images of 68 volumes and obtained an average Dice score of 0.7843 and 0.773 for blood volume andmyocardium, respectively (2D UNet for blood volume segmentation, 3D UNet for chambers and myocardiumsegmentation). Varatharajan et al. (2020) used a 3D DenseVNet CNN model from Gibson et al. (2018) forthe same CHD CT image dataset, and obtained a mean Dice score of 0.9183 and 0.8519 for blood volume andmyocardium, respectively. A few authors have used the cardiac CT angiography dataset from the MICCAI2017 Multi-Modality Whole Heart Segmentation Challenge, consisting of seven structures of heart, includingthe left ventricle, myocardium of the left ventricle, left atrium, right ventricle, right atrium, pulmonary arteryand the ascending aorta. Payer et al. (2018) used a 3D UNet architecture model with bounding box aroundall heart structures, and obtained a mean Dice score of 0.889 for the whole heart CT image segmentation.Wang and Smedby (2018) used a 2D UNet with shape context estimation. Xu et al. (2018) used a fasterR-CNN and 3D UNet networks for the whole heart CT image segmentation. Habijan et al. (2019) used a 3DUNet architecture CNN model with principal component analysis as a data augmentation technique, andobtained an average Dice score of 0.89 for the whole heart CT image segmentation.

In this section, we recall the basics of CNNs, necessary for describing the truncated VGG-style Simonyanand Zisserman (2014) and UNet-style Ronneberger et al. (2015) architectures used in this paper.Given an image I and a kernel (or ﬁlter ) K , each taking real values on a ﬁnite subset of R , their (discrete) convolution is deﬁned as a new image O = I ∗ K := (cid:32)(cid:88) m (cid:88) n K ( m, n ) I ( i − m, j − n ) (cid:33) i,j . (1)Flipping the minus signs one obtains a cross-correlation . In our setting, the kernel typically has smallsupport, and the cross-correlation measures, in each location in the image, to what extent the image locallycorrelates with the kernel. A simple example is the kernel K whose only nonzero values are K (0 ,

0) = − and K (1 ,

0) = 1 , compactly denoted by K = [ − , , measuring the forward diﬀerence of the image in theﬁrst coordinate. The resulting output image O then measures, at each pixel, the presence of a vertical edge(called a feature ) in the input image. The values of K are often called weights .Many neural network libraries implement cross-correlations but call them convolutions, and this is thecore ingredient of a CNN. A convolutional layer typically takes many convolutions (with diﬀerent ﬁlters)of the same input image in parallel, with resulting output images stacked as channels of a triple array,each measuring the presence of a diﬀerent feature in the input image. The input image can also consist ofmultiple channels, as is the case for instance for RGB images. In our case, each ﬁlter is a triple array, withtwo spatial dimensions and one channel dimension. Confusingly, it is common practice to omit the channeldimension in describing the size of the ﬁlter. For instance, a kernel of size × actually has a single entryfor every channel, and it is used to take linear combinations between image channel values in correspondingspatial locations. In particular in the case of RGB images, the kernel K = [1 , , ∈ R , , would measurethe presence of the “redness feature” in each pixel.In theory, it is possible to consider a single convolutional layer ( shallow learning ) with a vast numberof ﬁlters that together account for any conceivable pattern in the image, but in practice this is not very3ﬃcient. The success of CNNs is to instead consider hierarchies of features that build complex features outof simple features ( deep learning ). For instance, if our ﬁrst convolutional layer measures the presence ofa horizontal line − and a vertical line | in separate channels, then in a subsequent layer the × kernel K = [1 , ∈ R , , is able to measure the presence of a + . Being linear maps, simply composing ﬁltersjust yields another ﬁlter, typically with fewer degrees of freedom than the sum of those of the individualﬁlters. Instead, features of higher complexity are obtained by adding a non-linearity (an activation function )in-between subsequent layers, such as the Rectiﬁed Linear Unit (ReLU, deﬁned as > · x ) or the hyperbolictangent ( tanh ).Instead of hand-crafting these ﬁlters (called feature engineering ), deep learning obtains them by min-imizing a loss function expressing the discrepancy between the network prediction and a given label (the ground truth ). The loss function is often minimized using a stochastic variant of gradient descent, where thegradient of the loss for the entire dataset is estimated by the loss for a representative subset ( mini-batch ).Such noisy gradients are desirable for escaping local minima and, more importantly, saddle points, which areprevalent in high-dimensional weight space. This gradient is then distributed across the weights using thechain-rule, in a process called back-propagation . When ‘unfolding’ the gradient in this manner, earlier layerscan sometimes receive too little ( vanishing gradient ) or too much ( exploding gradients ). This is resolvedby adding a batch normalization layer in between convolutional layers, which shifts the distribution of abatch of inputs to each layer by applying a normalizing ﬁlter. Another technique for making the gradientpropagate more eﬀectively through the network are skip connections , which connect early layers (almost)directly to deeper layers (cf. Figure 2, bottom).While deep hierarchical representations dramatically reduce the number ﬁlters required for solving manytasks, still an enormous number of activation function values ( activations ) need to be stored in memory forlearning these ﬁlters, especially when images are processed in parallel in (mini-)batches. In order to makedeep networks feasible on existing hardware, the spatial resolution can be reduced. This is further justiﬁedby eﬀectively increasing the receptive ﬁeld of deeper features, referring to the pixels in the input image thatare involved in its computation. In addition, it makes the features invariant to local perturbations, which isoften considered a positive side eﬀect.One method for downscaling the spatial resolution is pooling spatial groups of features. The standardexample involves dividing the image spatially into blocks, and replacing each block by the maximum ( maxpooling ) or average ( average pooling ) of its feature values. This can be done either directly by specifying theblock size, or adaptively by specifying the desired output resolution. Alternatively, the spatial resolutioncan be reduced using strided convolutions , which compute (1) only at pixels ( i, j ) in some lattice in Z , forinstance Z × Z .Repeatedly down-scaling is useful in object detection, where one desires a single scalar output interpretedas the probability of the presence of an object in the input image. However, for segmentation, an outputresolution comparable to the input resolution is required. Observe that all convolutions are linear maps,with sparsity pattern determined by the support of the kernel and stride. Hence the adjoint of a stridedconvolution (i.e., the linear map with transposed matrix) maps to a higher dimensional space correspondingto a larger image. These learnable up-scalings are called up-convolutions (or transposed convolutions or convolutions with fractional stride ). The UNet incorporates these convolutional layers, ﬁrst in a contractingpath with d down-scalings (the depth ) resulting in a layer with the ( bottleneck ) b × b spatial resolution,followed by an expanding path involving d up-scalings. More detail is provided in Section 3.2. Previously, segmentation data has been discretely represented as boundary polygons Castrejón et al.(2017) or as graphs Acuna et al. (2018). Binary segmentation masks, or full segmentation maps with resolu-tion corresponding to the input image have also been considered Ronneberger et al. (2015). In the situationthat the underlying topology is known, active contouring has also been used for boundary segmentationAubert et al. (2003).Alternatively, a smooth geometric representation for segmentation boundaries of unknown topology canbe provided by implicit functions F . Here the boundary is not modeled using an explicit parametrization,4 igure 1: A CT slice segmentation represented as an implicit function, here shown with outside as negative for visualizationpurposes. but implicitly as the points ( x, y ) in the plane for which F ( x, y ) = 0 . In addition, the points ( x, y ) insidethe segmentation satisfy F ( x, y ) < , while those outside the segmentation satisfy F ( x, y ) > ; sometimesthis convention is reversed, see Figure 1.The function F can be modelled in many ways, including with multivariate polynomials, radial basisfunctions, (signed) distance ﬁelds and with splines. In this paper, we choose such implicit functions fromappropriate tensor product spline spaces, which, due to their gridded nature, are well suited for representingsmooth geometries implicitly as the output of fully convolutional neural networks. The implicit splinerepresentation has the following additional features:• Representation : It is compact, capable of representing complex topology, has a built-in degree-dependent continuity establishing a smoothness prior on the segmentation. It can also representfeatures of arbitrarily small size, and is thus not limited by pixel resolution.•

Processing : Inside/outside computations are reduced to mere function evaluations, allowing for in-stance for swift volume computation; manipulation of the shape and comparison between shapes iseﬃcient in terms of the spline control net; derivatives, tangent vectors and normal vectors are readilycomputed from the implicit form, yielding oﬀsets and conﬁdence bounds for the modelled shapes.

Bivariate tensor-product splines are widely used for smooth surface approximation, due to their highapproximation order and intuitive control net-based representation. For a given degree p and number ofbasis functions O (to be used for the output representation), consider a non-decreasing sequence ( t i ) O + p +1 i =1 ,whose entries are called knots . The constant B-spline basis functions B , , . . . , B O, are deﬁned by B i, ( x ) := (cid:26) t i ≤ x < t i +1 , otherwise . The higher-degree B-spline basis functions B i,p ( x ) are deﬁned recursively as x − t i t i + p − t i B i,p − ( x ) + t i + p +1 − xt i + p +1 − t i +1 B i +1 ,p − ( x ) , with the convention that / evaluates to . 5or a ﬁxed degree p and (open) knot vector t = · · · = t p +1 < t p +2 < · · · < t O < t O +1 = · · · = t O + p +1 , the set B := { B i,p ( x ) : 1 ≤ i ≤ O } forms a basis of the vector space of C p − -smooth functions on the interval [ t , t O + p +1 ] restricting to poly-nomials of degree p on the intervals [ t i , t i +1 ] .Using the tensor-product construction, the above univariate B-spline basis gives rise to a bivariate B-spline basis B ⊗ B := { B i,p ( x ) B j,p ( y ) : 1 ≤ i, j ≤ O } of the vector space of C p − ,p − -smooth functions restricting to polynomials of bidegree ( p , p ) on therectangles [ t i , t i +1 ] × [ t j , t j +1 ] .In this paper we consider several diﬀerent bidegrees, but due to the nature of the data, we restrict ourattention to symmetric bidegrees of the form ( p, p ) . We also consider diﬀerent resolutions O × O of outputspline coeﬃcients, focusing on the cases O = 64 , and using open uniform knot vectors ( t i ) O + p +1 i =1 = [ 0 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) p +1 times , , , . . . , O − p − , O − p, . . . , O − p (cid:124) (cid:123)(cid:122) (cid:125) p +1 times ] . Any C p − ,p − -smooth piecewise polynomial of bidegree ( p, p ) on this knot vector takes the form F ( x, y ) = O (cid:88) i =1 O (cid:88) j =1 c i,j B i,p ( x ) B j,p ( y ) , (2)for certain spline coeﬃcients c i,j arranged in a rectangular grid of control points, also known as a controlnet . Our implementation works with batches of such coeﬃcient grids, arranged as a triple array C =( c b,i,j ) B,O,Ob =1 ,i =1 ,j =1 for a batch of size B .

3. Methodology

In this paper, we propose a method for image segmentation, by combining implicit spline representationswith fully convolutional neural networks. Due to the gridded nature of the coeﬃcients, tensor-productsplines are well suited for representing smooth geometries implicitly as the output of fully convolutionalneural networks.

The medical image CT dataset we consider consists of 66 volumes that capture images of and aroundpatients’ hearts Xu et al. (2019b). The data has resolution × with a varying number of slices from130–340 in the z -direction. The pixel spacing is 0.25 mm × × . This is done both for the image and mask volumes.Our initial experiments involved an extra step, where we computed new spline coeﬃcient ground truths asapproximations of the original mask data. To do this, we performed a weighted least-squares approximationon a signed distance ﬁeld computed directly from the mask using a fast marching method. However,this approach made experimentation cumbersome, as each change in spline resolution or degree required a6 I conv1

128 128 I /2 conv2

256 256 256 I /4 conv3 I /4 conv4 I /4 conv5 I conv1

128 128 I /2 conv2

256 256 256 I /4 conv3

512 512 512 I /8 conv4 I /8 conv5 I /8 conv6 I conv1 I /2 conv2

256 256 I /4 conv3

512 512 I /8 conv4 O /8 O /4 upconv1 O /4

512 512 O /4 conv6 O /2 upconv2 O /2

256 256 O /2 conv7 O upconv3 O O conv8 O conv9 + + + Figure 2: The VGG-Implicit (top left), VGG-Implicit (top right) and UNetImplicit with depth d = 3 (bottom) networkarchitectures, mapping CT slices to a spline coeﬃcient grid. The yellow blocks are convolutional layers, with × kernel andstride 1 followed by batch normalization and a ReLU (ochre), or with × kernel and tanh (green) or linear activation, thelatter in the ﬁnal block. The lightblue blocks are up-convolutions. Each (up-)convolutional block is labelled with its number ofﬁlters (upright) and output spatial resolution (slanted). Each orange block is a × max pooling, while the darkblue blocksare adaptive average poolings. computationally heavy recompute of distance ﬁelds and spline approximations. Eventually, we abandonedthis approach in favour of one that includes spline evaluation as part of the loss function. In this way themasks provide the ground truth, despite the network outputting spline coeﬃcients of a lower resolution.This approach will be described in more detail in Section 3.3. We have experimented with three diﬀerent neural network architectures, adapted from existing models.These are shown in Figure 2. We summarize the networks as follows:• VGG-Implicit : This network is a simple truncation of the VGG-16 network from Simonyan andZisserman (2014) after the third convolutional block. Two new convolutional layers are then addedto reduce the number of channels ﬁrst from 256 to 16, and then from 16 to one, the ﬁrst of whichhas batch normalization and tanh activations. These added convolutional layers use a kernel of size × , introducing a non-linear combination of the features in each location to predict the value ofthe corresponding spline control point. Given that this network is heavily truncated, the extra non-linear activation is used to slightly increase the expressiveness of the network. VGG-Implicit is avery compact network, which can be used for fast evaluation. The output resolution is constrained tobe one quarter of the size of the input in each direction, so × input gives an output grid of × coeﬃcients.• VGG-Implicit : This network is similarly truncated, but this time after the fourth convolutionalblock. Again a convolutional layer with batch normalization and tanh activation is added to reducethe number of channels from 512 to 32, followed by a ﬁnal convolutional layer that reduces the number7f channels to 1. Both these layers use × kernels. This provides a deeper network with largerreceptive ﬁeld. The output resolution of VGG-Implicit is constrained to be one eighth of the size ofthe input in each direction, so × input gives an output grid of × coeﬃcients.• UNetImplicit: To further increase the receptive ﬁeld and depth of the network, we also adapt theUNet architecture from Ronneberger et al. (2015) to our setting. Since, in principle, we are inter-ested in arbitrary resolution input (images) but ﬁxed resolution output (spline coeﬃcients), we haveimplemented changes that make the network adaptive. This is done through several applications ofadaptive average pooling layers.In the original UNet paper, ‘copy and crop’ skip connections are used to attach the outputs of theconvolutional blocks on the contracting path to the corresponding blocks on the expansive path. InUNetImplicit, we use adaptive average pooling instead of cropping to allow for freedom of choice inthe output resolution. We also add an adaptive average pooling layer (of size b × b ) at the bottleneckof the UNet, before the ﬁrst up-convolution. Each up-convolution on the expansive path then doublesthe resolutions until we reach the desired output resolution. In order to adapt this network to diﬀerentoutput resolutions, we can vary both the bottleneck size b and the depth d , which we deﬁne as thenumber of down-scalings (max-poolings), which we always keep the same as the number of up-scalings(up-convolutions). For example, setting b = 8 and d = 4 , an input of × will produce an outputof × . To reduce the output to × , we can either set b = 4 or set d = 3 . Another changefrom the original UNet is that padding is used throughout to ensure consistent down- and up-scaling ofthe input tensor. We also experiment with reducing the number of ﬁlters in each convolutional layer,which vastly reduces the total number of training parameters in the network.In addition to these networks, we experimented with bottleneck architectures of encoder-decoder type, butthese yielded sub-optimal results, likely due to a lack of spatial awareness. Further details of the networkimplementation can be found in the code repository Barrowclough et al. (2020). Because the networks we use are designed to output a speciﬁc spline coeﬃcient resolution O (in eachdirection) that is smaller than the ground truth mask resolution I = 512 , the ﬁrst step in each loss functionis to evaluate the spline speciﬁed by the predicted coeﬃcients at uniformly spaced parameters. This makesit possible to compare arrays of the same shape. The general approach to spline evaluation as presented inSection 2.4 can be implemented as a recursive algorithm, known as the Cox-De Boor algorithm. However,spline evaluation can also be reduced to a simple matrix multiplication, which is much easier to implementin a way that is compatible with the automatic diﬀerentiation used to compute the gradient during back-propagation. This is particularly simple for open uniform knot vectors and repeated evaluation of the samepoints. To implement this evaluation, we ﬁrst precompute a single univariate collocation matrix (assumingboth the spline coeﬃcient arrays and the masks have identical length in each direction): U := (cid:0) B k ( s i ) (cid:1) I,Oi =1 ,k =1 , s i := ( O − p )( i − I − , i = 1 , . . . , I. This collocation matrix is computed on the CPU a single time before training, after which it is passed tothe GPU for all future evaluations.

Remark.

For certain special combinations of the input resolution I , output resolution O and degree p , thecollocation matrix has a repeating structure. In these cases, the spline evaluation becomes a special caseof an up-convolution in which the weights are cardinal B-splines evaluated at uniformly spaced intervals.We have chosen to use a precomputed collocation matrix rather than adding an up-convolutional layer withﬁxed weights for the reasons described in Section 2.3, and to allow for ﬂexible combinations of I, O , and p .The actual application of the collocation matrix U to evaluation of a batch of B bivariate spline coeﬃcientarrays, represented as a triple array C ∈ R B,O,O , is done using a two-stage application of Einstein summation,8hich is available in PyTorch. The ﬁrst application creates an intermediate tensor ˜ Z ∈ R B,I,O as ˜ Z mil = (cid:88) k U ik C mkl , followed by creation of the ﬁnal tensor Z ∈ R B,I,I as Z mij = (cid:88) l U jl ˜ Z mil . Here m indexes a single element in the current batch of predicted coeﬃcient arrays. Note that applyingthe tensor contraction in two stages via use of a univariate collocation matrix is necessary to ensure bothmemory and computational eﬃciency.Once the predicted tensor C has been converted to a new tensor Z that matches the dimensions of thebatch of ground truth masks, we can apply a number of diﬀerent loss functions.The simplest approach is to apply traditional loss functions such as mean average error (MAE) loss ormean squared error (MSE) loss. Since we are interested in the zero set of the spline function, it is naturalto ﬁrst transform the mask to contain values in {− , } before applying MAE or MSE loss. We thus nameour loss functions mask-MAE (MMAE) and mask-MSE (MMSE), in order to distinguish from a directMAE/MSE loss on the coeﬃcients. After mapping the ground truth mask Y ∈ { , } I,I to ˆ Y := 2 Y − ∈{− , } I,I , these loss functions, with mean reduction over the entire tensor, can be given as L MMAE := mean (cid:0) | Z − ˆ Y | (cid:1) , and L MMSE := mean (cid:0) ( Z − ˆ Y ) (cid:1) , respectively. Here, all tensor operations are applied elementwise.Alternatively, we can base the loss functions on relevant metrics that are used for image segmentation,such as the Jaccard index, the Dice similarity coeﬃcient, and the accuracy. The predicted binary segmen-tation is readily deduced from the predicted implicit form as the sign of the evaluated implicit form ineach pixel. Hence, the resulting segmentations can be evaluated on a pixel-by-pixel basis; in each pixel, theprediction can be classiﬁed into four diﬀerent categories as true positive ( TP ), true negative ( TN ), falsepositive ( FP ) and false negative ( FN ). The Jaccard index (Jaccard), also known as the intersection overunion , is deﬁned as the intersection between the predicted image and the manual reference segmentationdivided by their union, that is: Jaccard := T PT P + F P + F N . (3)The

Dice similarity coeﬃcient (Dice) is a measure of the spatial overlap between the predicted image andthe manual reference segmentation, written as:Dice := 2

T P T P + F P + F N . (4)The accuracy (Accuracy) is a measure of the closeness between the predicted image and the manual referencesegmentation, written as: Accuracy := T P + T NT P + F P + F N + T N . (5)Based on these metrics, we deﬁne three new loss functions L Jaccard , L Dice , and L Accuracy . These deﬁnitionsare compatible with the automatic diﬀerentiation used during back-propagation. The ﬁrst stage in computingthese losses is to ﬁrst transform the tensor Z into a tensor ˆ Z only containing zeros and ones: ˆ Z := 12 (cid:18) Zε + | Z | + 1 (cid:19) , ε is a small number (typically 0.0001) used to avoid numerical issues with potential division by zero.We can now deﬁne the losses as: L Jaccard := 1 − sum ( Y ˆ Z ) sum ( Y + ˆ Z − Y ˆ Z ) ,L Dice := 1 − sum (2 Y ˆ Z ) sum ( Y + ˆ Z ) , (6) L Accuracy := 1 − sum ( − Y − ˆ Z + 2 Y ˆ Z ) sum ( ) , where sums are taken over the entire tensor and denotes the tensor with dimensions identical to Y and ˆ Z and all entries equal to 1. For training the model, we have used an Intel(R) Core (TM) i7-7700K CPU 4.20GHz (8 cores), 64GBRAM, and NVIDIA Geforce GTX 1080 Ti - PCIE - 11GB of VRAM. We chose a batch size of 10, in orderto have a consistent batch size between experiments. The limiting factor for the batch size was the GPUmemory available for the higher resolution inputs and outputs. The networks were trained with a stochasticgradient descent optimizer with Nesterov momentum Sutskever et al. (2013) of 0.9 and a learning rateof 0.001. These values were set after some early experimentation to determine parameters that generallyworked well. The loss functions used were as described in the previous section.

4. Experiments and results

We evaluate our approach under several diﬀerent metrics in order to compare and determine optimalparameters for this dataset. The data parameters we consider are bidegree ( p, p ) and output resolution O ,as well as parameters that determine the architecture of the UNetImplicit network (bottleneck size b = 4 , ,number of spatial down- and up-scalings d = 3 , , and number of ﬁlters per convolutional layer). We alsocompare our approach with state-of-the-art results. We now turn our attention to the metrics used for evaluating our networks. Above we have described theaccuracy, Dice index and Jaccard index. Contrary to the Dice and Jaccard scores, the accuracy is symmetricin true positives and true negatives. However, the dataset we consider contains large regions of negatives,meaning that in many cases high accuracy scores can be achieved just by predicting blank masks. Thus thehigh accuracies achieved should only be considered relative to other methods implemented on this dataset.Next we consider another evaluation metric not mentioned in Section 3.3, as its computation is too slowfor eﬀective use as a loss function. For any subsets Y , Z of a metric space with metric d , one deﬁnes the Hausdorﬀ distance (HD) as HD ( Y , Z ) = max (cid:20) sup y ∈Y inf z ∈Z d ( x, y ) , sup z ∈Z inf y ∈Y d ( x, y ) (cid:21) , (7)where sup and inf represent the supremum and inﬁmum, respectively; see Karimi and Salcudean (2020). Inour experiments, for the Euclidean distance d of R , we compute the (3D) Hausdorﬀ distance HD ( Y , Z ) of Y := { ( i, j, (cid:96) ) : Y (cid:96)ij = 1 } and Z := { ( i, j, (cid:96) ) : F (cid:96) ( i, j ) < } , where Y (cid:96) is the mask of the (cid:96) -th layer in a single volume and F (cid:96) is the tensor-product spline function deﬁnedby the predicted spline coeﬃcients C (cid:96) on that layer. Smaller values of Hausdorﬀ distance correspond tobetter segmentation accuracy. However, it should be noted that even small regions of false positives cancause large Hausdorﬀ distances, if the false positive is far from the ground truth in pixel space.10mage Mask Prediction Image Mask Prediction Figure 3: Randomly selected images that show the variability of the test dataset, together with manually labelled masks andpredictions using UNetImplicit with depth d = 4 , bottleneck size b = 8 , and degree p = 1 , trained for 100 epochs. We have performed a hyperparameter study to identify the contribution of the individual network hy-perparameters to the performance of the network, as well as their optimal values. The hyperparametersconsidered were the spline degree p , network depth d , bottleneck size b (or equivalently output resolution O = b · d ), as well as the number of ﬁlters in the ﬁrst layer (which we use to determine the number of ﬁltersin subsequent layers).We observed that in almost all training runs, the validation curve has ﬂattened out after 100 epochs. Inaddition, the validation curve did not signiﬁcantly increase during any of our training runs, indicating thatany overﬁtting is minimal. We thus performed most of our experiments by training the networks for 100epochs, which takes approximately 11–15 hours on our hardware, depending on the network conﬁguration.Our early experiments showed that UNetImplicit generally provided better results than the VGG-inspirednetworks. The UNetImplicit network can be deﬁned with diﬀerent depth d and bottleneck size b, and weperformed a study to determine the optimal parameters. In Table 1, we summarize our experiments on theperformance of UNetImplicit under these diﬀerent parameters. Note that varying the depths and bottlenecksize changes the coeﬃcient output resolution. In this table, we also consider the performance over diﬀerentB-spline bidegrees p = 0 , , .Note that the scores presented in this table are based on the validation dataset and correspond to scorescomputed over a batch of input layers. Hence they are not directly comparable with the scores presented inTables 2 and 4, which are taken over entire volumes. The timings presented in the table are averages andstandard deviations of per-layer inference runtimes, when performed with a batch size of one. To computethese timings we utilized the same hardware as described in Section 3.4.The results of Table 1 show that the UNetImplicit network with d = 4 , b = 8 , and p = (1 , performsbetter than the other networks. This network also performs better than both the VGG-inspired networkswhen tested with corresponding parameters, see Table 4. We have thus selected UNetImplicit with theseparameters as the best model for this dataset.The box plots in Figure 6 show the distribution of scores for the diﬀerent networks, including theminimum, lower quantile, median, upper quantile and maximum scores over the test volumes. They conﬁrm11ut.res. Depth O = 64 d = 3 p = 0 p = 1 p = 2 d = 4 p = 0 p = 1 p = 2 O = 128 d = 4 p = 0 p = 1 p = 2 d = 5 p = 0 p = 1 p = 2 Table 1: A hyperparameter study of the performance (highest emphasized) for various UNetImplicit network architecturestrained with L Dice loss (6) for 100 epochs, for input resolution I = 512 , output coeﬃcient resolutions O = 64 , , depth d = 3 , , , and spline degrees p = 0 , , . All scores are for batches of validation data.(a) (b)Figure 4: Comparison of (a) manual segmentation and (b) UNetImplicit network prediction for Volume 18, which has highestDice score in the test dataset that, in general, UNetImplicit does indeed perform signiﬁcantly better than the VGG-inspired networks.However, it should be noted that the lowest accuracy and the second-lowest Dice and Jaccard scores wereobtained using UNetImplicit. This suggests that in some edge cases the VGG networks can perform better.It is interesting to note that VGG-Implicit on average performs slightly better than VGG-Implicit , despiteit having output resolution O = 64 , which is half that of VGG-Implicit . This could be explained by thefact that VGG-Implicit includes one convolutional block more than VGG-Implicit , resulting in a deepernetwork with larger receptive ﬁeld.After training for 320 epochs, we managed to improve the results for UNetImplicit with d = 4 , b = 8 , even further (to an average Dice score of . on batches of the validation data). The results of this networkfor each individual test volume is shown in Table 2. For these results, the predicted 2D images correspondingto each CT volume are combined to make full 3D volumes. Figures 4 and 5 show 3D views of the volumes12 a) (b)Figure 5: Comparison of (a) manual segmentation and (b) UNetImplicit network prediction for Volume 40, which has lowestDice score in the test dataset .

985 0 .

990 0 . VGG-Implicit VGG-Implicit UNetImplicit (a) Vol. accuracy . .

85 0 . . (b) Vol. Dice . . . (c) Vol. Jaccard

100 200 (d) Vol. HausdorﬀFigure 6: For various networks and metrics, box plots of the values of these metrics for the volumes in the test data set. with best Dice score (0.9670) and worst Dice score (0.8234), respectively.The lowest Dice score and highest Hausdorﬀ distance were observed for Volume 40 in the test dataset.However, in this case, the poor scores can be explained by the fact that the manual segmentation has notincluded all branches of the pulmonary artery, as shown in Figure 5(a). Since our model is trained withCT images that include the blood vessels of the pulmonary artery, these are picked up in the predictions,see Figure 5(b). Hence, the network exhibits over-segmentation, and the Dice score and Hausdorﬀ distancesuﬀer accordingly. We also observed some disconnected components in the predicted 3D model. Since thenetwork is trained and tested on 2D images, some of the slices with few positives exhibit more variability inthe quality of the predictions, which may be the cause of these disconnected components.We performed some experiments to examine the eﬀect of reducing the number of ﬁlters. For the sakeof brevity we avoid presenting detailed results here, but we observed that halving the number of ﬁlters ateach convolution led to a modest reduction in the overall score. Reducing to one quarter of the originalnumber of ﬁlters reduced the score further, but the results could still be considered reasonable. Given thatreducing the number of ﬁlters vastly reduces the size of the networks, reducing the number of ﬁlters at aminor expense of accuracy could be considered beneﬁcial in some applications (e.g. real-time segmentationfor dynamic visualization).

Remark.

Note that the results in this section are before applying many of the standard “tricks” for boosting13est volume number Vol. Accuracy Vol. Dice Vol. Jaccard Vol. HausdorﬀVolume 02 0.9965 0.9564 0.9164 59.3Volume 12 0.9852 0.8865 0.7962 40.8Volume 18

Volume 60 0.9850 0.8689 0.7682 133.8Average 0.9921 0.9188 0.8519 78.0Standard deviation 0.0042 0.0371 0.0613 42.4

Table 2: 3D (volumetric) Accuracy, Dice, Jaccard and Hausdorﬀ scores for UNetImplicit (trained for 320 epochs) applied toeach of the test volumes, considered individually. performance. In particular, further improvements might be possible by applying appropriate data augmen-tation and ensemble modelling. Removing small components that are disconnected from the main structuresmay also improve the scores, especially with respect to Hausdorﬀ distance.

We trained the UNetImplicit network with depth d = 4 , bottleneck size b = 8 , and degree p = 1 withthe diﬀerent loss functions described in Section 3.3. It is hard to compare the obtained validation lossesdirectly, as validation loss will typically favour the loss function that it is trained on. However, besides aqualitative comparison, it is possible to train on each loss and then evaluate on all the other loss functions,as shown in Table 3. This approach is inspired by the principle of cross-validation.Table 3 shows that training with accuracy, Dice and Jaccard loss all achieve high scores on each other’sevaluation metrics. Likely due to this, training on linear combinations of Dice and Jaccard loss did notimprove the scores. However, training with accuracy, Dice and Jaccard loss does not give good scores withrespect to the MMSE and MMAE evaluation metrics. This is expected, as the MMSE and MMAE metricsare tailored to approximating functions with values in {− , } , and this constraint is not imposed on theresults when using these loss functions. The MMSE and MMAE losses are also very similar, both achievinghigh scores in all evaluation metrics, albeit with slightly lower scores on the accuracy, Dice and Jaccardevaluation metrics. When training directly for MSE of the predicted coeﬃcients and precomputed implicitspline approximations of the ground truth segmentation masks, a Jaccard score of around 0.8 was achievedon the validation set. As far as we are aware, there exist two papers in the literature that make use of the same CHD CTdataset as us. We have organized the performances in terms of various metrics for our network and thesenetworks in Table 4, bearing in mind that it is diﬃcult to make entirely fair comparisons due to the reasonsdescribed below. Note that UNetImplicit (with optimal parameters) can process about 200 slices per second,roughly corresponding to a single CT volume per second. Increasing the batch size can be expected to yielda signiﬁcantly speed improvement, in particular for the smaller networks.14raining loss − MMSE val − MMAE val

Accuracy val

Dice val

Jaccard val

MMSE 0.9744 0.9685 0.9919 0.9275 0.8654MMAE 0.9734 0.9743 0.9916 0.9234 0.8583 − Accuracy 0.5506 0.5701 0.9921 0.9297 0.8691 − Dice -3.5035 -0.7471 0.9926 0.9336 0.8761 − Jaccard -7.5589 -1.3616 0.9927 0.9341 0.8768

Table 3: Performance of UNetImplicit (depth d = 4 , bottleneck size b = 8 , degree p = 1 , trained for 100 epochs) evaluated invarious metrics when varying the training loss function. Model Vol. Accuracy Vol. Dice Vol. Jaccard Vol. Hausdorﬀ Inference time (ms)VGG-Implicit ± VGG-Implicit ± ± - Table 4: Model evaluation and comparison of 3D metrics and per-slice average inference time (mean ± standard deviation) onthe blood volume test dataset, with emphasis on the best scores. The top three networks are trained for 100 epochs, and theparameters used for UNetImplicit are b = 8 , d = 4 , and p = 1 . Xu et al. (2019b) have used a 2D UNet architecture, and observed the average Dice score 0.7843 forblood volume. However, in their approach, blood volume is split into several anatomically separate parts,and the average is taken over the individual Dice scores for these predicted parts. This appears to be aharder problem to solve, and may be the main cause of the lower score observed.Varatharajan et al. (2020) considered a DenseVNet architecture, and obtained the average volumetricDice score of 0.9183 for blood volume, which is very similar to the results we obtain with UNetImplicit.The other scores for DenseVNet are also presented in Table 4. They achieve a lower average Hausdorﬀdistance than in our approach, and this may in part be explained by the fact that the authors post-processthe test results by removing components that are disconnected from the main structure. Since our approachis inherently 2D, we have chosen not to remove components that are disconnected in 3D in this paper.

5. Conclusion

Summary of the paper

In this paper, we have introduced a new method for segmenting image data using implicit spline rep-resentations and deep learning. We have shown that our approach is eﬀective at segmenting blood volumein a medical imaging dataset and that we are able to achieve state-of-the-art results by using a modiﬁedversion of the UNet architecture, which we call UNetImplicit.By modelling the segmentation boundaries implicitly, we are able to perform segmentation even in thepresence of complex topologies, and the use of spline representations ensures a compact representation thatcan be subsequently sampled at any desired resolution. In the case of our best network, the total number ofoutput spline coeﬃcients is one sixteenth of the total number of input pixels. Besides these representationaladvantages, our method is also amenable to geometric processing operations. In particular, inside/outsidecomputations are reduced to mere function evaluations, and shapes can be eﬃciently manipulated andcompared in terms of the spline control net.While we have chosen to focus on medical imaging in this paper, the proposed method is not limitedto this application domain. Early experiments performed on the Cityscapes dataset Cordts et al. (2016)yielded promising results, in which we experienced that a spline grid of × coeﬃcients was suﬃcient formost segmentations. Hence the resolution of the spline coeﬃcients required to obtain good results depends15n the input data, both with respect to variability of the data and the presence of sharp features. This alsoillustrates that in other application domains a further reduction in representation size is possible. Limitations

One of the main limitations of our approach is that we need to used a ﬁxed spline resolution for allsamples. This imposes an upper limit on the number of oscillations the spline can have in a given region.Depending on the characteristics of the data, the spline resolution may need to be increased to capture alldetails. Our approach also has limited ability to model sharp features, in that splines are inherently smooth.However, this is more a theoretical than practical limitation due to uncertainty or blurring at the pixellevel, which is typical for imaging datasets (including the CHD CT dataset). Another limitation of usingimplicit representations is that they can only deal with a single segmentation class per output channel. Apotential solution to this is to output multiple segmented classes in separate channels using weight sharingand multitask learning.

Future work

We foresee several directions for future work based on our approach. In this paper, we have restrictedour attention to B-spline basis functions, but the only requirement is that the basis can be evaluated bymultiplication with a pre-computed collocation matrix. Thus the method can adopt any basis functionsarranged in a grid that is compatible with adaptive average pooling of the contracting path. This opens upavenues for exploring how diﬀerent types of basis functions perform on diﬀerent datasets.Thus far we have employed implicit functions solely for the purpose of determining the segmentationboundary. A richer form of implicit function is a signed distance function to the boundary. In that case,the gradient of the implicit function (projected to the plane z = 0 ) is the normal direction along thesegmentation boundary. Hence diﬀerent level sets of the implicit (signed distance) function can be used togenerate oﬀsets or conﬁdence bounds of the segmentation boundary. While signed distance functions arenot typically smooth everywhere, they can be approximated closely by spline functions.Although we have only considered 2D segmentation in this paper, the approach is directly generalizableto 3D. A 3D implementation will allow us to take advantage of smoothness in the axial direction, whichshould lead to even more compact representations. In particular, multislice methods could be employed totake advantage of gradients in the slice direction.Finally, we envisage that our approach can be useful in other application areas. For example, recon-structing digital twins of 3D printed objects from images taken at each layer of the manufacturing processis one potential application. Our approach may also be applied to standard segmentation of photographs ifthe segmentation boundaries are of generally smooth nature. Acknowledgements

This project was supported by an IKTPLUSS grant, project number 270922, from the Research Councilof Norway, as well as by the European Union’s Horizon 2020 Research and Innovation Programme underGrant Agreement numbers 951956 and 860843. We would like to thank Dr. Xiaowei Xu from the Departmentof Computer Science and Engineering, University of Notre Dame, for providing us access to the CHD CTdataset. We wish to thank Artur Berzins and the anonymous referees for their valuable feedback on anearlier draft of this paper.

References

Acuna, D., Ling, H., Kar, A., Fidler, S., 2018. Eﬃcient Interactive Annotation of Segmentation Datasets with Polygon-RNN++,in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 859–868.Aubert, G., Barlaud, M., Faugeras, O., Jehan-Besson, S., 2003. Image segmentation using active contours: Calculus of variationsor shape gradients? SIAM Journal on Applied Mathematics 63, 2128–2154.Barrowclough, O.J.D., Muntingh, G., Stangeby, I., 2020. Implicit reconstruction. https://github.com/sintefmath/implicit-recon/ . astrejón, L., Kundu, K., Urtasun, R., Fidler, S., 2017. Annotating Object Instances with a Polygon-RNN, in: CVPR.Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B., 2016. Thecityscapes dataset for semantic urban scene understanding, in: Proceedings of the IEEE conference on computer vision andpattern recognition, pp. 3213–3223.Fey, M., Eric Lenssen, J., Weichert, F., Müller, H., 2018. SplineCNN: Fast geometric deep learning with continuous B-splinekernels, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 869–877.Gibson, E., Giganti, F., Hu, Y., Bonmati, E., Bandula, S., Gurusamy, K., Davidson, B., Pereira, S.P., Clarkson, M.J., Barratt,D.C., 2018. Automatic Multi-Organ Segmentation on Abdominal CT With Dense V-Networks. IEEE Transactions onMedical Imaging 37, 1822–1834.Habijan, M., Leventić, H., Galić, I., Babin, D., 2019. Whole Heart Segmentation from CT images Using 3D U-Net architecture,in: 2019 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 121–126. doi: .Işın, A., Direkoğlu, C., Şah, M., 2016. Review of MRI-based Brain Tumor Image Segmentation Using Deep Learning Methods.Procedia Computer Science 102, 317 – 324. doi: . 12th International Conference on Applicationof Fuzzy Systems and Soft Computing, ICAFS 2016, 29-30 August 2016, Vienna, Austria.Karimi, D., Salcudean, S.E., 2020. Reducing the Hausdorﬀ Distance in Medical Image Segmentation With ConvolutionalNeural Networks. IEEE Transactions on Medical Imaging 39, 499–513.Keskin, C., Izadi, S., 2018. Splinenets: Continuous neural decision graphs. arXiv preprint arXiv:1810.13118 .Laube, P., Franz, M.O., Umlauf, G., 2018. Deep learning parametrization for B-spline curve approximation, in: 2018 Interna-tional Conference on 3D Vision (3DV), IEEE. pp. 691–699.Ling, H., Gao, J., Kar, A., Chen, W., Fidler, S., 2019. Fast interactive object annotation with curve-GCN, in: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pp. 5257–5266.Michalkiewicz, M., Pontes, J.K., Jack, D., Baktashmotlagh, M., Eriksson, A., 2019. Implicit surface representations as layers inneural networks, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4742–4751. doi: .Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., Terzopoulos, D., 2020. Image segmentation using deep learning:A survey. CoRR abs/2001.05566. URL: https://arxiv.org/abs/2001.05566 .Payer, C., Štern, D., Bischof, H., Urschler, M., 2018. Multi-label Whole Heart Segmentation Using CNNs and AnatomicalLabel Conﬁgurations, in: Pop, M., Sermesant, M., Jodoin, P.M., Lalande, A., Zhuang, X., Yang, G., Young, A., Bernard, O.(Eds.), Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges, Springer InternationalPublishing, Cham. pp. 190–198.Pham, D.L., Xu, C., Prince, J.L., 2000. Current Methods in Medical Image Segmentation. Annual Review of BiomedicalEngineering 2, 315–337. doi: .Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: Interna-tional Conference on Medical image computing and computer-assisted intervention, Springer. pp. 234–241.Sharma, N., Aggarwal, L., 2010. Automated medical image segmentation techniques. Journal of Medical Physics 35, 3–14.doi: .Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556 .Sitzmann, V., Martel, J.N., Bergman, A.W., Lindell, D.B., Wetzstein, G., 2020. Implicit neural representations with periodicactivation functions. arXiv preprint arXiv:2006.09661 .Sutskever, I., Martens, J., Dahl, G., Hinton, G., 2013. On the importance of initialization and momentum in deep learning, in:Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, JMLR.org.pp. 1139–1147.Varatharajan, N., Lippert, M., Brun, H., Elle, O.J., Kumar, R.P., 2020. Deep learning based blood pool segmentation ofcongenital heart disease. Submitted to journal .Wang, C., Smedby, Ö., 2018. Automatic Whole Heart Segmentation Using Deep Learning and Shape Context, in: Pop,M., Sermesant, M., Jodoin, P.M., Lalande, A., Zhuang, X., Yang, G., Young, A., Bernard, O. (Eds.), Statistical Atlasesand Computational Models of the Heart. ACDC and MMWHS Challenges, Springer International Publishing, Cham. pp.242–249.Xu, Q., Wang, W., Ceylan, D., Mech, R., Neumann, U., 2019a. DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction, in: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (Eds.),Advances in Neural Information Processing Systems 32. Curran Associates, Inc., pp. 492–502.Xu, X., Wang, T., Shi, Y., Yuan, H., Jia, Q., Huang, M., Zhuang, J., 2019b. Whole Heart and Great Vessel Segmentation inCongenital Heart Disease Using Deep Neural Networks and Graph Matching, in: Medical Image Computing and ComputerAssisted Intervention - MICCAI 2019. doi: .Xu, Z., Wu, Z., Feng, J., 2018. CFUN: Combining Faster R-CNN and U-net Network for Eﬃcient Whole Heart Segmentation.CoRR abs/1812.04914. URL: http://arxiv.org/abs/1812.04914 .Yang, H., Fuchs, M., Juttler, B., Scherzer, O., 2006. Evolution of T-spline level sets with distance ﬁeld constraints for geometryreconstruction and image segmentation, in: IEEE International Conference on Shape Modeling and Applications 2006(SMI’06), IEEE. pp. 37–37..Yang, H., Fuchs, M., Juttler, B., Scherzer, O., 2006. Evolution of T-spline level sets with distance ﬁeld constraints for geometryreconstruction and image segmentation, in: IEEE International Conference on Shape Modeling and Applications 2006(SMI’06), IEEE. pp. 37–37.