[PDF] Iterative fully convolutional neural networks for automatic vertebra segmentation and identification

Abstract

Precise segmentation and anatomical identification of the vertebrae provides the basis for automatic analysis of the spine, such as detection of vertebral compression fractures or other abnormalities. Most dedicated spine CT and MR scans as well as scans of the chest, abdomen or neck cover only part of the spine. Segmentation and identification should therefore not rely on the visibility of certain vertebrae or a certain number of vertebrae. We propose an iterative instance segmentation approach that uses a fully convolutional neural network to segment and label vertebrae one after the other, independently of the number of visible vertebrae. This instance-by-instance segmentation is enabled by combining the network with a memory component that retains information about already segmented vertebrae. The network iteratively analyzes image patches, using information from both image and memory to search for the next vertebra. To efficiently traverse the image, we include the prior knowledge that the vertebrae are always located next to each other, which is used to follow the vertebral column. This method was evaluated with five diverse datasets, including multiple modalities (CT and MR), various fields of view and coverages of different sections of the spine, and a particularly challenging set of low-dose chest CT scans. The proposed iterative segmentation method compares favorably with state-of-the-art methods and is fast, flexible and generalizable.

Full PDF

IIterative fully convolutional neural networks for automaticvertebra segmentation and identiﬁcation

Nikolas Lessmann a , Bram van Ginneken b , Pim A. de Jong c,d , Ivana Išgum a a Image Sciences Institute, University Medical Center Utrecht, The Netherlands b Diagnostic Image Analysis Group, Radboud University Medical Center Nijmegen, The Netherlands c Department of Radiology, University Medical Center Utrecht, The Netherlands d Utrecht University, The Netherlands

Abstract

Precise segmentation and anatomical identiﬁcation of the vertebrae provides the basis for automatic analysis of the spine, such asdetection of vertebral compression fractures or other abnormalities. Most dedicated spine CT and MR scans as well as scans ofthe chest, abdomen or neck cover only part of the spine. Segmentation and identiﬁcation should therefore not rely on the visibilityof certain vertebrae or a certain number of vertebrae. We propose an iterative instance segmentation approach that uses a fullyconvolutional neural network to segment and label vertebrae one after the other, independently of the number of visible vertebrae.This instance-by-instance segmentation is enabled by combining the network with a memory component that retains informationabout already segmented vertebrae. The network iteratively analyzes image patches, using information from both image and memoryto search for the next vertebra. To e ﬃ ciently traverse the image, we include the prior knowledge that the vertebrae are always locatednext to each other, which is used to follow the vertebral column. The network concurrently performs multiple tasks, which aresegmentation of a vertebra, regression of its anatomical label and prediction whether the vertebra is completely visible in the image,which allows to exclude incompletely visible vertebrae from further analyses. The predicted anatomical labels of the individualvertebrae are additionally reﬁned with a maximum likelihood approach, choosing the overall most likely labeling if all detectedvertebrae are taken into account. This method was evaluated with ﬁve diverse datasets, including multiple modalities (CT andMR), various ﬁelds of view and coverages of di ﬀ erent sections of the spine, and a particularly challenging set of low-dose chest CTscans. For vertebra segmentation, the average Dice score was 94.9 ± ±

1. Introduction

Segmentation and identiﬁcation of the vertebrae is often aprerequisite for automatic analysis of the spine, such as detectionof vertebral fractures (Yao et al., 2012), assessment of spinaldeformities (Forsberg et al., 2013), or computer-assisted surgicalinterventions (Knez et al., 2016). Automatic spine analysis canbe performed with a large variety of tomographic scans, includ-ing dedicated spine scans but also scans of the neck, chest orabdomen that incidentally cover part of the spine. A genericvertebra segmentation algorithm therefore needs to be robustwith respect to di ﬀ erent image resolutions and di ﬀ erent cover-ages of the spine. This especially means that no assumptionsshould be made about the number of visible vertebrae and theiranatomical identity, i.e., to which section of the spine they be-long. Vertebra segmentation is therefore essentially an instancesegmentation problem with an a priori unknown number of in-stances ( i.e. vertebrae). However, in contrast to generic instancesegmentation the individual instances are not independent ofeach other. The instances are known to be located in close prox-imity to each other in the image, forming together the vertebralcolumn. We propose to approach vertebra segmentation with an instance segmentation algorithm that explicitly incorporates thisprior knowledge to locate instances, but that makes no furtherassumptions.Approaching vertebra segmentation as an instance segmen-tation problem entails treating all vertebrae as instances of thesame class of objects. However, an anatomical identiﬁcation ofthe segmented vertebrae is often also needed, for instance, forfurther analysis steps or for reporting purposes. Especially inimages originally not intended for spine imaging, anatomicallabeling of the vertebrae can be challenging due to variationsin the ﬁeld of view. These variations lead to variable coverageof the spine and also of structures that provide anatomical cuesfor identiﬁcation of the vertebrae, such as the ribs or the sacrum.Additionally, neighboring vertebrae often have similar shape andappearance so that independent labeling of each vertebra mayresult in mistakes. Vertebra identiﬁcation therefore requires aglobal rather than a per-instance approach to ensure an overallplausible, anatomically correct labeling.Another challenge inherent to an instance segmentation ap-proach is the identiﬁcation of partially visible instances. Whileocclusion is a typical problem in two-dimensional but not inthree-dimensional images, some vertebrae may be only partially Preprint submitted to Medical Image Analysis January 19, 2019 a r X i v : . [ c s . C V ] F e b isible due to the limited ﬁeld of view of the scan. If these in-completely visible vertebrae are included in subsequent analysesthat are based on the obtained vertebra segmentations, such asmeasurement of vertebral heights for detection and classiﬁca-tion of vertebral compression fractures (Grigoryan et al., 2003),their results may be unreliable. Therefore, incompletely visi-ble instances need to be either ignored or explicitly identiﬁedas incomplete so that they can be excluded from subsequentanalyses.In this paper, we propose an iterative instance-by-instancesegmentation approach for vertebra segmentation based on afully convolutional neural network. This network performsvertebra detection, segmentation, anatomical identiﬁcation andclassiﬁcation of their completeness concurrently and thereforepresents an entirely supervised approach that can be trained end-to-end. While we propose to attempt a per-instance identiﬁcationof the individual vertebrae together with the segmentation, thelabeling is subsequently adjusted taking all segmented vertebraeinto account. In contrast to previous approaches, the presentedmethod can be used for any imaging modality, any ﬁeld of viewand any number and type (cervical, thoracic, lumbar) of visi-ble vertebrae because it avoids explicit modeling of shape andappearance of the vertebrae and the vertebral column. We evalu-ate these claims using a diverse selection of datasets, includingscans from di ﬀ erent modalities (CT and MR), various ﬁelds ofview, cases with severe compression fractures and a particularlychallenging set of low-dose chest CT.

2. Related work

While a few other methods have been published that addressboth vertebra segmentation and identiﬁcation (Chu et al., 2015;Kelm et al., 2013; Klinder et al., 2009; Sekuboyina et al., 2017;Suzani et al., 2015), the majority of methods in the literaturefocused on one of these problems. The existing literature istherefore reviewed separately for vertebra segmentation andfor vertebra identiﬁcation. We also brieﬂy review literature ongeneral instance segmentation.

Vertebra segmentation has been approached predominantlyas a model-ﬁtting problem using statistical shape models and itsvariants, most often active shape models and shape-constraineddeformable models (Castro-Mateos et al., 2015; Ibragimov et al.,2015, 2014; Kadoury et al., 2011, 2013; Klinder et al., 2009;Korez et al., 2015, 2016; Leventon et al., 2002; Mastmeyer et al.,2006; Mirzaalian et al., 2013; Pereanez et al., 2015; Rasoulianet al., 2013; Štern et al., 2011; Suzani et al., 2015; Yang et al.,2017a). Other approaches have been based on atlases (Wanget al., 2015), level-sets with shape priors (Leventon et al., 2002;Lim et al., 2014) and active contours (Athertya and Kumar, 2016;Hammernik et al., 2015).More recently, machine learning has been increasingly usedfor vertebra segmentation. Kelm et al. (2013) used an iterativevariant of marginal space learning to ﬁnd bounding boxes for theintervertebral discs, which were used to initialize and guide ver-tebra segmentations based on Markov random ﬁelds and graph cuts. Zuki´c et al. (2014) applied the Viola-Jones object detec-tion framework based on Adaboost to ﬁnd bounding boxes forthe vertebral bodies, which were subsequently segmented byinﬂating a mesh from the center of each vertebral body. Chuet al. (2015) used random forest regression to detect the center ofthe vertebral bodies and used these to deﬁne regions of interestin which vertebrae were segmented using random forest voxelclassiﬁcation. A similar method was proposed by Suzani et al.(2015), who used a multilayer perceptron to regress the distanceto the nearest center of a vertebral body. The detected locationswere used to initialize an adaptive shape model for segmentationof the vertebral bodies. Mirzaalian et al. (2013) also combinedmachine learning and shape models by using a probabilisticboosting-tree classiﬁer for boundary detection, which was usedto adapt a surface mesh to the vertebrae in combination with astatistical shape model. The shape model was used for initializa-tion of the mesh and to impose shape constraints. Korez et al.(2016) used a convolutional neural network (CNN) to generateprobability maps for the vertebral bodies and used these mapsto guide a deformable surface model to segment the vertebralbodies.Even though the aforementioned methods contain a machinelearning component beyond statistical modeling, machine learn-ing was primarily used for vertebra detection and thus merely forinitialization of the segmentation, which itself was performedwith other methods. Many of the most recently published verte-bra segmentation methods, however, are based on deep learningand have replaced explicit modeling of the vertebral shape andappearance with convolutional and recurrent neural networks.For instance, Sekuboyina et al. (2017) segmented the lumbarvertebrae in 2D sagittal slices using a multiclass CNN for pixellabeling. As a prior step, a simple multilayer perceptron esti-mated a bounding box of the lumbar region to identify the regionof interest in the image. In subsequent work, Sekuboyina et al.(2018) used a patch-based 3D network for voxel classiﬁcationin the entire image and additionally a 2D network to predict alow-resolution mask for the vertebral column, which was usedto remove false positives outside the spinal region. Similarly,Janssens et al. (2018) relied on two consecutive networks, ﬁrstusing a regression CNN to estimate a bounding box of the lum-bar region, followed by a classiﬁcation CNN to perform voxellabeling within that bounding box to segment the lumbar ver-tebrae. In our preliminary work (Lessmann et al., 2018), wealso applied a two-stage approach in which vertebrae were ﬁrstsegmented in downsampled images using an iterative strategy.The image was repeatedly analyzed by a CNN to segment thevertebrae one after the other. A second network analyzed thefull resolution images to reﬁne the low-resolution segmentations.Even though all of these approaches relied on CNNs for seg-mentation of the vertebrae, they retained the separation into adetection and a segmentation task, and consequently used twodedicated networks.

Anatomical identiﬁcation of individual vertebrae has beenmainly approached with one of three strategies: with appear-ance and shape models (Cai et al., 2015; Glocker et al., 2012;2linder et al., 2009), with machine learning based on hand-crafted features (Bromiley et al., 2016; Chu et al., 2015; Glockeret al., 2012, 2013; Kelm et al., 2013; Major et al., 2013; Suzaniet al., 2015) and with deep neural networks (Cai et al., 2016;Chen et al., 2015; Forsberg et al., 2017; Janssens et al., 2018;Lessmann et al., 2018; Liao et al., 2018; Sekuboyina et al.,2017; Yang et al., 2017a,b). Most of these approaches com-bined a rough labeling of the vertebrae, typically by perform-ing voxel classiﬁcation or regression of vertebral centroids orbounding boxes, with a global model to reﬁne the individualpredictions, to discard outliers and to ﬁnd an overall plausiblesolution. These models have often been graphical models, suchas hidden Markov models (Chu et al., 2015; Glocker et al., 2012)and Markov random ﬁelds (Major et al., 2013), statistical shapemodels (Bromiley et al., 2016; Chen et al., 2015; Suzani et al.,2015) or recurrent neural networks (Liao et al., 2018; Yang et al.,2017b). Several methods relied on detection of a reference ver-tebra, such as the ﬁfth lumbar vertebra (L5), and labeled thedetected vertebrae relative to this reference vertebra (Cai et al.,2016; Forsberg et al., 2017; Lessmann et al., 2018). MulticlassCNNs for combined segmentation and identiﬁcation throughvoxel classiﬁcation were used in scans with a ﬁxed ﬁeld of viewand a limited number of vertebrae, e.g., lumbar spine CT scans(Janssens et al., 2018; Sekuboyina et al., 2017). Furthermore,probabilistic modeling has been used to calculate a likelihoodscore for each possible labeling conﬁguration based on shapeor appearance similarities or spatial relationships of the verte-brae (Chen et al., 2015; Glocker et al., 2013; Kelm et al., 2013;Klinder et al., 2009).

Generic instance segmentation frameworks based on con-volutional neural networks, such as Mask R-CNN (He et al.,2017), typically split the task into a detection and a segmenta-tion task. Many of the above discussed publications on vertebrasegmentation have used this approach as well, but have usu-ally imposed constraints on the number of instances or otherfeatures. Recurrent networks have been used in a similar fash-ion by localizing individual instances based on attention andmemory mechanisms, often also based on instance detectionthrough region proposals and subsequent segmentation (Li et al.,2016; Ren and Zemel, 2017; Romera-Paredes and Torr, 2016;Stewart et al., 2016). Other approaches have relied on clusteringinstead of explicit instance detection, transforming images intoabstract feature representations in which individual instanceswere detected as individual clusters (De Brabandere et al., 2017;Liang et al., 2018; Novotny et al., 2018; Uhrig et al., 2016). Aniterative approach without region proposals or recurrent connec-tions has been proposed by Li et al. (2016), who repeatedly fedthe same image through a convolutional semantic segmentationnetwork together with the previous prediction map.

3. Methods

We propose a vertebra segmentation and identiﬁcation methodbased on a single fully convolutional neural network (FCN) that performs multiple tasks concurrently. In contrast to existingmethods, this avoids a multi-stage process with successive in-stance detection and segmentation, or segmentation and instanceseparation steps. Other existing generic instance segmentationmethods with 2D deep neural networks often do not general-ize well to 3D image volumes because they analyze the entireimage at once, which is currently not feasible with typical CTor MR volumes. We therefore apply a patch-based vertebra-by-vertebra segmentation approach in which the image is analyzedin patches large enough to contain at least one vertebra. The net-work segments a single vertebra in this patch and the anatomicalknowledge that the following vertebra must be located in closeproximity is used to reposition the patch for segmentation of thefollowing vertebra.Next to this iterative inference strategy, our approach con-sists of four major components. The central component is a segmentation network that segments voxels from a 3D imagepatch by binary classiﬁcation of all voxels in the patch. To enablethis network to segment only voxels belonging to a speciﬁc in-stance rather than all vertebrae visible in the patch, we augmentthe network with an instance memory that informs the networkabout already segmented vertebrae. The network uses this infor-mation to always segment only the following not yet segmentedvertebra. Once a vertebra is fully segmented, the instance mem-ory is updated, which triggers the network in the next iterationto ignore this vertebra and focus on the next vertebra instead.The third component is an identiﬁcation sub-network that pre-dicts the anatomical label of each detected vertebra. The fourthcomponent is a completeness classiﬁcation sub-network that isadded to the network to distinguish between completely visibleand partially visible vertebrae. The full network architectureis illustrated in Figure 1. Please note that the aforementionedcomponents, referred to as networks and sub-networks, togetherform a single network.In the following, we will ﬁrst describe the segmentationnetwork (section 3.1) and the instance memory (section 3.2).These two components form the basis for the proposed iterativeinstance segmentation strategy (section 3.3). Additional net-work parts without inﬂuence on the iterative segmentation aredescribed afterwards: regression of the anatomical label of eachvertebra (section 3.4) and classiﬁcation of its completeness (sec-tion 3.5). Finally, the loss function to optimize the network forall three tasks and the training process are detailed (section 3.6).

The segmentation component of the network is a FCN thatpredicts a binary label for each voxel in an image patch. Thislabel indicates whether the voxel belongs to the current instanceor not. We used a patch size of 128 × ×

128 voxels, whichis large enough to cover an entire vertebra. To ensure that allpatches have the same resolution, even though the analyzedimages may have di ﬀ erent resolutions, we resample all inputimages to an isotropic resolution of 1 mm × × M

128 64 * * * * * * * * * * SL relusigmoid + * * * * *

2x 3x3x3 convolution2x 3x3x3 convolution + max pooling2x 3x3x3 convolution + upsamplingConcatenation along feature axis dense8 * * * C sigmoid1 * dense Image patchInstance memorySegmentation maskAnatomical labelCompleteness score

IMSLC

Figure 1: Schematic drawing of the network architecture. I, M and S are 3D volumes, L is the predicted label in form of a single value, and C is the predictedprobability for complete visibility. Cubes represent 3D feature maps with 84 channels in the path from I and M to S and 48 features maps in the two additionalcompression paths to L and C. Exceptions are the ﬁrst cube after I and M, which has two channels as a result of the concatenation of I and M, and the cube beforeS, which has only one channel. Both dense layers map 48 features to a single value. The number on each cube indicates the size of the feature map (a cube “128”corresponds to a feature map of 128 × ×

128 voxels).

The architecture of the segmentation network is inspired bythe U-net architecture (Çiçek et al., 2016; Ronneberger et al.,2015), i.e., the network consists of a compression and an expan-sion path with intermediate skip connections. We use a constantnumber of ﬁlters in all layers and added batch normalizationas well as additional padding before all convolutional layersto obtain segmentation masks of the same size as the imagepatches. The segmentation masks are additionally reﬁned in CTscans by removing voxels below 200 HU from the surface ofeach vertebra.

In the proposed iterative segmentation scheme, which is de-scribed in detail in the following section, the network segmentsone vertebra after the other, one at a time. The purpose of theinstance memory is to remind the network of vertebrae that werealready segmented in previous iterations so that it can target thenext not yet segmented vertebra. This memory is a binary ﬂagfor each voxel of the input image that indicates whether the voxelhas been labeled as vertebra in any of the previous iterations.Binary rather than probabilistic ﬂags are used for simplicity, butprobabilistic ﬂags could prove useful in future extensions thatallow voxels labeled as part of a vertebra to be unlabeled orrelabeled in later iterations. Together with an image patch, thenetwork receives a corresponding memory patch as input. Theseare fed to the network as a two-channel input volume.

The iterative segmentation process is illustrated in Figure 2.This process follows either a top-down or bottom-up scheme,i.e., the vertebrae are not segmented in random order but succes-sively from top to bottom, or vice versa. The network learns toinfer from the memory patch which vertebra to segment in the current patch. If the memory is empty, i.e., no vertebra has beendetected yet, the network segments the top-most or bottom-mostvertebra that is visible in the patch, depending on the chosendirection of traversal. Otherwise, the network segments the ﬁrstfollowing not yet segmented vertebra, even if multiple unseg-mented vertebrae are visible. Other vertebrae that are the secondor third not yet segmented vertebra in the direction of traversalare disregarded until they become the ﬁrst not yet segmentedvertebra themselves in a later iteration. Each instance of thenetwork therefore has a ﬁxed direction of traversal.The patch size is chosen large enough to always containpart of the following vertebra when a vertebra is in the centerof the patch. This enables utilizing prior knowledge about thespatial arrangement of the individual instances to move fromvertebra to vertebra. The FCN iteratively analyzes a single patchcentered at x t , where t denotes the iteration step. Initially, thepatch is moved over the image in a sliding window fashionwith constant step size ∆ x , searching for the top-most vertebrawhen using a top-down approach, or the bottom-most vertebrawhen using a bottom-up approach. As soon as the networkdetects a large enough fragment of vertebral bone, at least v min = =

10 mm in our experiments, the patch is movedtoward this fragment. The center of the bounding box of thedetected fragment, referred to as b t , becomes the center of thenext patch: x t + =  x t + ∆ x , if v t < v min [ b t ] , otherwiseEven if initially only a small part of the vertebra is visi-ble and detected in the patch, centering the following patch atthe detected fragment ensures that a larger part of the vertebrabecomes visible in the next iteration. Eventually, the entire verte-4 ✔ ✔ ✔✔ ... Figure 2: Illustration of the iterative instance segmentation and traversal strategy. The patch is ﬁrst moving in a sliding window fashion over the image (1), until afragment of vertebral bone is detected (2). The patch is then moved to the center of the detected fragment. This process is repeated until the entire vertebra becomesvisible and the patch thus stops moving (3). The segmented vertebra is added to the instance memory and the same patch is analyzed again, now yielding a fragmentof the following vertebra because the updated memory forces the network to ignore the previous vertebra (4). The patch is centered now at the detected fragment ofthe following vertebra and the process repeats (5-7). bra becomes visible, in which case the patch position convergesbecause no additional voxels are detected anymore that woulda ﬀ ect b t , and hence x t + . We detect convergence by comparingthe current patch position x t and the previous patch position x t − , testing whether they still di ﬀ er by more than δ max on anyaxis. Occasionally, the patch position does not converge butkeeps alternating between positions that di ﬀ er slightly morethan the threshold δ max . We therefore limit the number of it-erations per vertebra. When this limit is reached, we assumethat the patch has converged to the position between the twoprevious patch positions and we accordingly move the patch to x t + = [ ( x t + x t − ) / δ max was set to 2, ∆ x to half the patch size and the maximum number of iterationsto 10.Once the position has converged, the segmented vertebrais added to the output mask using a unique instance label andthe segmentation mask obtained at this ﬁnal position. Further-more, the instance memory is updated. In the following iteration,the network analyzes the same patch again. The updated mem-ory prompts the network to detect a fragment of the followingvertebra and the patch is moved to the center of the detectednew fragment, repeating the segmentation process for the nextvertebra. Should no fragment of the following vertebra be imme-diately visible, traversal reverts to a sliding window motion untilthe next vertebra is found. The entire process continues until nofurther fragments are found, i.e., until all visible vertebrae aresegmented. For each detected vertebra, the network additionally predictsthe anatomical label using an additional identiﬁcation compo-nent that is added to the segmentation network. The U-net architecture, which provides the basis of the segmentation net-work, consists of a compression and a expansion path. These arecommonly understood as recognition and segmentation paths,respectively. Since vertebra identiﬁcation is a recognition task,the identiﬁcation component is appended to the compressionpath as further compression steps. Essentially, it shares featureswith the segmentation network but further compresses the inputpatch into a single value. The vertebrae C1 to L5 are representedby integers 1 to 24 and the value 0 is used during training asground truth value for patches that do not contain any vertebralbone. In the network, the label output unit is a rectiﬁed linearunit f ( x ) = max (0 , x ), which enables the network to produceany negative activation for patches without vertebral bone. Re-gression of the label as continuous value instead of classiﬁcationwith 25 output units and a softmax activation function ensuresthat the loss function penalizes predictions more strongly thefurther they deviate from the true label.In order to ﬁnd a plausible sequence of labels for the de-tected vertebrae, without duplicates or gaps, the predicted labelsare interpreted as probabilities in a simple maximum likelihoodestimation similar to Klinder et al. (2009). Depending on howmany vertebrae are detected, several labeling sequences are the-oretically possible – only if all 24 vertebrae are visible in a scanthere is only a single possible labeling sequence. For each possi-ble sequence of labels, the average likelihood is calculated byinterpreting the predicted labels as probabilities. For example, aregression output value of 22.8 is interpreted as 80 % probabilityfor label 23 (L4) and 20 % probability for label 22 (L3), andas 0 % probability for any other label. In sequences in whichthis vertebra would be labeled as L3, it therefore contributes aprobability of 20 % to the average. The sequence with maximal5ikelihood is ﬁnally used to labeled the detected vertebrae. Thisstep is performed as a post-processing step when all vertebraehave been segmented from the image. An obvious strategy for disregarding incompletely visiblevertebrae in the segmentation process would be to train the net-work only with examples of fully visible vertebrae. However,the traversal scheme requires detection of vertebral fragmentsin the patches. We therefore choose to include partially visiblevertebrae in the training data but to add a classiﬁcation compo-nent to the segmentation network that classiﬁes each segmentedvertebra as complete or incomplete. Similar to the identiﬁcationcomponent, this completeness classiﬁcation is a recognition andnot a segmentation task. In the network, the classiﬁcation pathis therefore also a continuation of the compression path andhas the same architecture as the identiﬁcation path, with thedi ﬀ erence that the output is a sigmoid unit. The output value isthus a single value in [0 , During training, we derived the status of the instance mem-ory from the reference segmentation masks. The patches usedto train the network were forced to contain vertebral bone byrandomly selecting in each iteration a scan and a vertebra visiblein that scan, followed by random patch sampling within thebounding box of that vertebra. However, 25 % of the patcheswere selected randomly from anywhere in the training images.If these patches contained vertebral bone, it was added to theinstance memory so that the network could also learn to produceempty segmentation masks for patches without vertebral boneor without unsegmented vertebral bone.Due to the size of the input patches, the Nvidia Titan XGPUs with 12 GB memory that we used for training allowedprocessing of only single patches instead of minibatches ofmultiple patches. We therefore used Adam (Kingma and Ba,2014) for optimization with a ﬁxed learning rate of 0.001 andan increased momentum of 0.99, which stabilizes the gradients.Furthermore, the network predicts labels for all voxels in theinput patch and the loss term is accordingly not based on asingle output value, but on more than two million output valuesper patch. The loss term L combines terms for segmentation,anatomical labeling and classiﬁcation errors: L = λ · FP soft + FN soft (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) Segmentation error + | p L − t L | (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) Labeling error + (cid:0) − t C log p C − (1 − t C ) log(1 − p C ) (cid:1)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) Completeness classiﬁcation error

We propose to optimize the segmentation by directly mini-mizing the number of incorrectly labeled voxels, i.e., the numberof false positives and false negatives. This is similar to lossterms based on the Dice score (Milletari et al., 2016), whichcan be expressed as / (2TP + FP + FN) , but because the number oftrue positives is not part of the term values are more consistentacross empty (TP =

0) and non-empty (TP (cid:29)

0) patches. Givenan input patch I and for all voxels i binary reference labels t i and probabilistic predictions p i , di ﬀ erentiable expressions forthe number of false positive and false negative predictions are:FP soft = (cid:88) i ∈ I ω i · (1 − t i ) p i FN soft = (cid:88) i ∈ I ω i · t i (1 − p i )Here, w i are weights used to assign more importance to thevoxels near the surface of the vertebra. This aims at improvingthe separation of neighboring vertebrae (Ronneberger et al.,2015). The weights are derived from the distance d i of voxel i to the closest point on the surface of the targeted vertebra: ω i = γ · exp (cid:0) − d i / σ (cid:1) +

1. We used γ = σ = λ weights the cost of a false positiveerror relative to a false negative error, which is useful to counter-act an imbalance between background and foreground voxels. Inmost segmentation problems, the number of background voxelsis substantially larger than the number of foreground voxels andconsequently a systematic false negative mistake is more favor-able than a systematic false positive mistake. This can preventthe network from learning anything other than predicting anempty segmentation mask for any input. We therefore used anincreasing value for λ starting from λ min = . λ ( n ) = λ min + − λ min + e − ϑ ( n ) where ϑ ( n ) = n − n max / n max / Here, n is the number of training iterations, i.e., backward passes,that have been performed. The deﬁnition of ϑ ensures that falsepositive and false negative errors are equally weighted afterabout half of the maximum number of training iterations n max .The network is therefore initially biased towards making falsepositive errors rather than false negative errors, but this bias isreduced over time to ensure that the network will not convergein a state where it tends to oversegment the vertebra. We foundthat assigning weights to the individual error components thatform the loss L was not necessary.The labeling error is deﬁned as the (cid:96) ﬀ erencebetween predicted label and true label. The predicted labelsare real numbers ≥ ﬀ erence of these two values as loss function6enalizes large errors more strongly than small errors. This isespecially beneﬁcial because small deviations can be more likelycorrected in the label reﬁnement step.The completeness classiﬁcation error is deﬁned as the binarycross entropy between the true label t C , which is a binary valuethat indicates whether the vertebra is completely visible in thepatch, and the predicted probability for complete visibility p C . Inour experiments, vertebrae were considered completely visible ina patch when they were manually marked as completely visiblein the scan and when not more than 2 % of the volume of thereference segmentation of the vertebra was not contained in thepatch. This allows for some tolerance as manual identiﬁcation ofincompletely visible vertebrae can be ambiguous in scans withlow resolution or low-dose artifacts.We used random elastic deformations, random Gaussiannoise, random Gaussian smoothing as well as random croppingalong the z-axis to augment the training data. Rectiﬁed linearunits were used as activation function in all layers except theoutput layers of the segmentation and the classiﬁcation paths,in which sigmoid functions were used. The network was imple-mented using the PyTorch framework. Training on an NvidiaTitan X GPU took about 4–5 days when training for 100 000iterations.

4. Evaluation

We trained and evaluated the method with ﬁve sets of CTand MR scans that visualize the spine. Reference segmentationmasks for four of these datasets are publicly available, whichallowed for a comparison with other publications that used thesame data. Examples of images from the datasets are shown inFigure 3.The thoracolumbar spine CT dataset consists of 15 dedicatedspine CT scans that visualize all thoracic and lumbar vertebrae.It was originally used for the spine segmentation challenge heldin conjunction with the Computational Spine Imaging (CSI)workshop at MICCAI 2014 (Yao et al., 2016). All subjects wereyoung adults (20 to 34 years) without vertebral fractures whowere scanned with IV-contrast administration. The scans werereconstructed to in-plane resolutions of 0.31 mm to 0.36 mm andslice thicknesses of 0.7 mm to 1.0 mm. Semi-automatically ob-tained reference segmentations were provided by the challengeorganizers. To allow for a comparison with the challenge results,we used the same data split with 5 scans for evaluation and theremaining 10 scans for training and development.The xVertSeg.v1 dataset consists of 15 lumbar spine CT scansof subjects with compression fractures of various grades andtypes (Ibragimov et al., 2017). Manual reference segmentationsare available for the lumbar vertebrae and were deﬁned througha consensus reading of two observers. The scans were recon-structed to in-plane resolutions of 0.29 mm to 0.80 mm and slicethicknesses of 1.0 mm to 1.9 mm. There are currently two otherpublications that used the same dataset, but with di ﬀ erent evalua-tion / training separation (Janssens et al., 2018; Sekuboyina et al.,2017). We therefore used the scans 1 to 5 for evaluation and theremaining 10 scans for training. The low-dose chest CT dataset consists of 55 scans from theNational Lung Screening Trial (The National Lung ScreeningTrial Research Team, 2011). These scans were acquired forlung imaging and visualize in addition to the lungs a variablesection of the thoracic and upper lumbar vertebrae. The scannedsubjects were heavy smokers aged 50 to 74 years and thereforeat increased risk for vertebral compression fractures due to theiradvanced age and smoking history. The scans were acquiredwith low radiation dose and reconstructed to in-plane resolu-tions of 0.54 mm to 0.82 mm and slice thicknesses of 1.0 mmto 2.5 mm. We created manual and semi-automatic referencesegmentations for this dataset: 10 scans were used for evaluationand were therefore fully manually annotated by drawing alongthe contour of each vertebra in sagittal slices using an interactivelive wire tool (Barrett and Mortensen, 1997). The contours wereconverted into segmentation masks, in which inaccuracies andother mistakes were corrected voxel-by-voxel. An additionalset of 5 scans was annotated in the same way and was used totrain a preliminary version of the network. This network wasused to predict rough segmentations in the remaining 40 scans.These rough segmentations were manually inspected and cor-rected voxel-by-voxel, and were used for training of the ﬁnalnetwork. This strategy enabled us to create a large training setwith substantially less manual annotation e ﬀ ort compared tofully manual segmentation, which is not necessarily needed fortraining data. Additionally, a second observer fully manually an-notated two scans from the evaluation set for an estimation of theinterobserver agreement. All fully manual and semi-automaticsegmentations were performed in sagittal views by observerswho received detailed instructions beforehand. Additionally, allsegmentations were validated by an experienced radiologist.The lumbar spine CT dataset consists of 10 scans of healthysubjects and corresponding manual reference segmentations ofthe lumbar vertebrae (Ibragimov et al., 2014; Korez et al., 2015).The scans were reconstructed to in-plane resolutions of 0.28 mmto 0.79 mm and slice thicknesses of 0.7 mm to 1.5 mm. Becausethis dataset is the smallest of the datasets that we included, itwas used for an external evaluation of our supervised approach.Scans from this dataset were therefore only used for evaluationand were not part of the training set.The lumbar spine MR dataset consists of 23 T2-weightedturbo spine echo MR images acquired at 1.5T in sagittal ori-entation (Chu et al., 2015). The scans have a resolution of2 mm × × a) (b)(c) (d)(e) Figure 3: Examples of the various image types. Shown are examples of (a)thoracolumbar spine CT, (b) lumbar spine CT with multiple compression frac-tures, (c) low-dose chest CT, (d) lumbar spine CT, and (e) lumbar spine MR(T2-weighted). tained in the scan. None of the datasets contained multiple scansof the same subject.

The segmentation performance was evaluated with the met-rics most commonly reported in publications that used the samedatasets. These metrics were the Dice coe ﬃ cient to measurevolume overlap and the average absolute symmetric surfacedistance (ASSD) to measure segmentation accuracy along thevertebral surface. Both metrics were calculated for individualvertebrae and then averaged across all scans. Each vertebra inthe reference segmentation was compared with the vertebra inthe automatic segmentation mask with which it had the largestoverlap.The identiﬁcation performance was evaluated using the iden-tiﬁcation accuracy, i.e., the percentage of vertebrae that wereassigned the correct anatomical label, and the linearly weightedkappa coe ﬃ cient (Cohen, 1968). Through the linear weighting,the kappa coe ﬃ cient captures also the magnitude of mistakes.The completeness classiﬁcation performance was evaluatedusing the classiﬁcation accuracy and the average number offalse positives and false negatives per scan. False positives werein this case vertebrae that were incompletely visible, but wereclassiﬁed as completely visible.

5. Experiments and Results

We trained modality-speciﬁc instances of the network ad-justed to the di ﬀ erent ground truths, i.e., to perform vertebrasegmentation in CT and vertebral body segmentation in MR.The CT training set consisted of 60 scans, of which 10 werethoracolumbar spine CT, 10 lumbar spine CT scans with com-pression fractures and 40 NLST scans. The CT evaluation setconsisted of 30 scans, of which 5 were thoracolumbar spine CTscans, 5 lumbar spine CT with compression fractures, 10 NLST scans, and 10 normal lumbar spine CT scans, which were notrepresented in the training set. The dataset for vertebral bodysegmentation in MR consisted of only 23 scans in total and wetherefore performed 3-fold cross-validation with evaluation setsof 8, 8 and 7 scans and the remaining scans used for training.The MR images were normalized by clipping o ﬀ values belowthe 5th and above the 95th percentile and transforming the valuesinto the range [ − , Similar performance was achieved for vertebra segmen-tation in various CT datasets with an average Dice score of94.9 ± ± ± ± ﬀ erences of the automatic segmentationsfrom the ground truth segmentations.In the CT datasets, the segmentation was more accurate onhigh-resolution dedicated spine scans of healthy subjects com-pared with low-dose low-resolution chest CT scans and scansof subjects with in some cases severe compression fractures.This is also visible in the segmentation performance stratiﬁedby vertebra (Figure 5). Segmentations were more accurate forthe lumbar (L1-L5) than for the thoracic vertebrae (T1-T12),which are covered by the more challenging low-dose chest CTscans. Outliers among the lumbar vertebrae correspond to verte-brae from the xVertSeg.v1 dataset, which features a number ofseverely deformed lumbar vertebrae that are particularly chal-lenging to segment.In comparison with other vertebra segmentation methods,our iterative instance segmentation approach outperformed pre-vious methods on the thoracolumbar spine CT dataset as well ason the lumbar spine CT dataset. In both cases, there was a sub-stantial improvement in average Dice score and especially also inthe surface distance (Table 1). On the xVertSeg.v1 dataset withvarious fractured vertebrae, our method performed comparableto the method of Sekuboyina et al. (2017) and not as well as themethod of Janssens et al. (2018). However, both of these publica-tions used a di ﬀ erent separation between training and evaluationdata and the results are therefore not directly comparable. Forvertebral body segmentation in MR, our approach achieved onaverage higher Dice scores and lower surface distances thanprevious methods, but with higher variance compared to Korezet al. (2016). Although the automatic segmentation was overallaccurate on low-dose chest CT, the performance was still slightlybelow the level of interobserver variation (average di ﬀ erence of3.3 % in Dice score and 0.1 mm in surface distance).The segmentation was optimized by minimizing a loss termbased on false positive and false negative predictions. We ad-ditionally trained instances of the network using the categori-cal cross-entropy and the Dice coe ﬃ cient as segmentation loss.8 a) Low-dose chest CT(b) Lumbar spine CT (xVertSeg.v1 dataset)(c) Lumbar spine MR Figure 4: Segmentation results in di ﬀ erent types of images. The segmentations are shown both as color overlay with di ﬀ erent colors for di ﬀ erent instances (left), andas di ﬀ erence maps with oversegmentation errors marked in red and undersegmentation errors in yellow (right). Some images have been cropped to better show thevertebral column. able 1: Quantitative results of automatic segmentation, anatomical identiﬁcation and completeness classiﬁcation. Data are reported as mean ± standard deviation andwere obtained with modality-speciﬁc networks, i.e., with a network trained with CT images and another network trained with MR images. Segmentation Identiﬁcation Completeness classiﬁcation

Dataset Dice score (%) ASSD (mm) Accuracy (%) κ Accuracy (%) FP / scan FN / scanThoracolumbar spine CTProposed method 96.3 ± ± ± ± ± ± ∗ Proposed method 94.6 ± ± ± ± ± ± ± † ± ± ± ± ± ± ± ± § - - - - -Ibragimov et al. (2014) 93.6 ± ± ‡ Proposed method 94.4 ± ± ± ± ± ± § - - - - - ∗ xVertSeg.v1 dataset † subset (2 /

10 scans) ‡ only vertebral bodies § ASD (non-symmetric)

However, the network optimized using the categorical cross-entropy converged in a state in which it predicted all voxelsof any input patch as background. The network optimizedusing the Dice score performed substantially worse than thenetwork trained with the proposed loss function, achieving forinstance on low-dose chest CT scans an average Dice coe ﬃ cientof 89.9 ± ± ﬃ cient of 93.1 ± ± ﬀ erences,the segmentation performance of the networks with and withoutadditional tasks was overall comparable (Table 2). The anatomical identiﬁcation was correct for 93 % ( / , κ = .

99) of the vertebrae in the CT datasets. The labeling ofone thoracolumbar spine CT scan was o ﬀ set by one vertebra,resulting in incorrect labeling of all 16 visible vertebrae. Therange T2-L5 was predicted, but this patient had only four lumbarvertebrae and the correct range was therefore T1-L4. In the MR dataset, anatomical identiﬁcation succeeded in all cases( κ = . ﬀ erent sections ofthe spine were included in our evaluation, most still visualizedanatomical landmarks such as the sacrum that potentially sim-pliﬁed the anatomical identiﬁcation. Even low-dose chest CTscans have a fairly standardized ﬁeld of view deﬁned by thelocation and size of the lungs, which might have simpliﬁed ver-tebra identiﬁcation. To evaluate the identiﬁcation performanceon arbitrary ﬁeld of view images, we performed an experimentwith randomly cropped images. For each of the 15 evaluationscans in the thoracolumbar spine CT and the chest CT dataset,we created two new images by randomly cropping the originalimage along the z-axis. These new images were minimally 80 %and 60 % smaller than the original image, but we ensured thatthey still contained multiple vertebrae by enforcing a minimumsize of 150 mm along the z-axis. Of the vertebrae visible inthese cropped images, 93 % were correctly identiﬁed ( / ; κ = . ﬀ set by ±

1, hence the high κ score. None of the scans in any of the datasets visualized the entirespine, all scans contained vertebrae that were only partially visi-ble due to the limited ﬁeld of view. In the CT datasets, 97 % ofthe vertebrae were correctly classiﬁed as completely or incom-pletely visible ( / ). In the MR dataset, all vertebrae werecorrectly classiﬁed. Most mistakes were made in low-dose chest10 D i c e s c o r e ( % ) T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 L1 L2 L3 L4 L50.00.20.40.60.81.0 S u r f a c e d i s t a n c e ( mm ) Figure 5: Box and whisker plots for per-vertebra Dice scores (top) and averageabsolute symmetric surface distances (bottom) in the CT evaluation set with 25scans. Note that not all vertebrae were visible in every scan. Only completelyvisible vertebrae are included in the evaluation.

CT scans, which was the type of scan with the least standardizedﬁeld of view. Mistakes occurred only in the ﬁrst or last visiblevertebrae, near the boundary of the ﬁeld of view, and often invertebrae of which only a small part was missing from the scan(Figure 6). Notably, there were no mistakes where the networkpredicted an implausible sequence by classifying a vertebra be-tween two completely visible vertebrae as incompletely visible,or vice versa.

The networks in our experiments were trained to traverseupwards along the spine. To evaluate whether the direction oftraversal inﬂuences the performance, we trained an additionalinstance of the network on the CT datasets to traverse down-wards along the spine. We found no di ﬀ erences in anatomicalidentiﬁcation and completeness classiﬁcation of the detectedvertebrae. However, while the segmentation performance wasoverall comparable with that of the same network traversingupwards, the segmentation performance deteriorated on the low-dose chest CT dataset (Table 2). In multiple low-dose chest CTscans, the top-most vertebrae were less well segmented (Fig-ure 7). When traversing downwards, these are the ﬁrst vertebraeto be segmented. Notably, the traversal process was still set o ﬀ correctly in all cases.In an additional experiment, we used two networks that weretrained to traverse in opposite directions. The result of the up-wards traversal was used to initialize the downwards traversal Table 2: Comparison of the segmentation performance of a network consistingonly of the segmentation path (S network) and a network consisting of thesegmentation, the anatomical labeling and the completeness classiﬁcation paths(S-L-C network), as proposed. Both networks traverse the spine upwards. Addi-tionally, the segmentation performance is reported for the full network traversingdownwards (S-L-C network ↓ ). Segmentation

Dataset Dice score (%) ASSD (mm)Thoracolumbar spine CTS network 96.2 ± ± ± ± ↓ ± ± ∗ S network 95.3 ± ± ± ± ↓ ± ± ± ± ± ± ↓ ± ± ± ± ± ± ↓ ± ± ∗ xVertSeg.v1 dataset by starting the downwards traversal from the last vertebra de-tected during upwards traversal. We hoped that this would resultin improved segmentation performance because it relieves thedownwards network from detecting the ﬁrst vertebra without anycontext information provided by the instance memory. However,this showed not to be beneﬁcial and resulted in virtually identicalsegmentation performance compared to only upwards traversal. To verify that an instance segmentation approach is beneﬁ-cial, we also trained the segmentation component of our network,which is a U-net like 3D FCN with skip connections (Figure 1),to segment and identify the vertebrae using multiclass voxel clas-siﬁcation instead of the proposed iterative binary segmentation.This network received an image patch of 128 × ×

128 voxelsas input, but unlike in our iterative approach not a correspondingpatch from the instance memory. The network had 25 outputclasses, corresponding to the 24 di ﬀ erent vertebrae and a back-ground class. At inference time, the patch was moved over theentire image in a sliding window fashion with overlapping win-dows so that multiple predictions were obtained for each voxel.Each voxel was eventually labeled with the class label that hadthe highest average probability. Using non-overlapping windowsresulted in substantially worse performance. Overall, the multi-class FCN achieved an average Dice score of 78.7 ± ± κ = .

98) of the vertebrae. Whilethe segmentations were overall reasonably accurate with a Dicescore of 84.6 ± igure 6: Examples of vertebra completeness classiﬁcation results. Vertebrae classiﬁed as completely visible are marked in light green and vertebrae classiﬁed asincompletely visible are marked in red. Arrows indicate misclassiﬁed vertebrae. Shown are low-dose chest CT scans (left and right) and a lumbar spine CT scan(center). The runtime of a single iteration step was about 1 s on stan-dard hardware. The number of required iteration steps and thusthe overall runtime per image depends on the size of the image,the number of visible vertebrae and their location within theimage, which inﬂuences how many steps are initially needed toﬁnd the ﬁrst vertebra. For instance, the low-dose chest CT scanscovered on average 13.7 vertebrae and required 55 iterationsteps. The thoracolumbar spine CT scans covered on average17.8 vertebrae, but with a narrower ﬁeld of view focused on thespine and required a comparable number of iteration steps (56on average). The average runtime per scan was about one minutein all datasets, excluding time required for loading the imageand storing the results.

6. Discussion

This paper demonstrates that fully convolutional neural net-works, which have been widely used for semantic segmentation(Litjens et al., 2017), are also capable of learning a complexinstance segmentation task. Vertebra segmentation performedinstance-by-instance required the network to learn to infer froman additional memory input which vertebra to segment and toignore other vertebrae. Additionally, the same network was ableto perform multiple tasks concurrently, namely vertebra segmen-tation, identiﬁcation and classiﬁcation to determine whether thevertebra was completely contained in the scan. This approachoutperformed all methods that participated in the CSI 2014 spinesegmentation challenge (Yao et al., 2016) and performed betteror comparable to state-of-the-art methods on other datasets. Ina particularly challenging set of low-dose chest CT scans, theperformance was close to the interobserver variability.The diverse selection of datasets that we used to evaluate theiterative segmentation approach demonstrates that this approachcan cope with arbitrary ﬁelds of view, with low-resolution andlow-dose scans, and with scans with normal as well as severelydeformed vertebrae. The approach is entirely supervised, whichenables transferring the same approach to other modalities and other segmentation tasks, which we demonstrated by applyingthe same approach without any modiﬁcations to vertebral bodysegmentation in T2-weighted MR images. Moreover, we demon-strated that the networks did not overﬁt to the datasets that wererepresented in the training data, but instead outperformed state-of-the-art methods on an entirely unseen datasets of lumbarspine CT scans. Previous methods for vertebra segmentationwere often tailored to speciﬁc image types and evaluated on ho-mogeneous datasets. Although our approach performed slightlyworse than some of these methods on some of the datasets in-cluded in our evaluation, we demonstrated consistently highperformance across multiple datasets.The proposed iterative vertebra-by-vertebra segmentationperformed substantially better than a regular multiclass FCNsimilar to a 3D U-net (Çiçek et al., 2016). However, the per-formance of the multiclass FCN exceeded our expectations andmight further improve with more training data and with hard-ware that enables training of larger networks, or by includingmore context information via, e.g., a multi-scale approach (Kam-nitsas et al., 2017; Moeskops et al., 2016). Even though theiterative approach requires the network to combine two inputs toidentify a speciﬁc vertebra and ignore others, it also simpliﬁesthe segmentation problem from a multiclass into a binary voxellabeling task. The direct comparison of these two approachesindicates that individual instances of a target class of objectsare better separated by a FCN if the network is trained to fo-cus on individual instances. This strategy could also lead toimprovements in other instance segmentation tasks, for instance,in histopathological image analysis.We combined multiple tasks into a single network, whichhelps to simplify both training and inference: only a single net-work needs to be trained, and at inference time, each patch needsto be passed only through one network to obtain multiple predic-tions, one for each distinct task. Even though the segmentationpath and the identiﬁcation and completeness classiﬁcation pathsshared part of the network, we did not ﬁnd that this improvedthe segmentation performance. However, the segmentation per-formance did also not deteriorate when these additional tasks12 pwards traversal Downwards traversalFigure 7: Segmentations obtained with the proposed iterative segmentation ap-proach, comparing models trained for either upwards or downwards traversal.Both examples are low-dose chest CT scans (cropped). White arrows indicatesegmentation errors in the top-most visible vertebrae, which occur more oftenwhen traversing downwards. Iterative FCN Multiclass FCNFigure 8: Segmentations obtained with a multiclass FCN, thus without using theiterative segmentation strategy, compared with segmentations obtained with theproposed iterative approach. While the segmentations are overall fairly accurate,the individual vertebrae are not well separated. were added, which indicates that the proposed combination intoa single network is useful. The proposed network architecturewith the additional output paths uses more GPU memory thantask speciﬁc networks would and therefore more strongly limitsthe maximum number of ﬁlter per layer and the depth of thenetwork. Future hardware generations with larger memory willenable training of larger networks, which might lead to furtherperformance improvements.The proposed spine traversal strategy can be applied to seg-ment the vertebrae from top to bottom or vice versa. How-ever, we observed better performance for upwards compared todownwards traversal on low-dose chest CT scans with di ﬀ er-ences mostly in the upper vertebrae. The size of the vertebraeincreases from cervical to lumbar vertebrae, i.e., from top tobottom. Hence, when traversing downwards, the ﬁrst verte-bra that needs to be found, without additional information thatcould be derived from the memory input, is the smallest verte-bra that is visible in the scan. Additionally, the region aroundthe uppermost visible vertebrae is often a ﬀ ected by low-doseartifacts in low-dose chest CT scans, which makes starting thetraversal from these vertebrae especially challenging. Becausethere were no substantial di ﬀ erences between downwards andupwards traversal in the other CT datasets, upwards traversalpresents overall the more robust strategy.Even though we used a variety of datasets in our experiments,some important types of data were not present. These includescans covering the cervical vertebrae and scans of patients with implants near the spine, such as pedicle screws. However, sinceour approach is entirely supervised and trained end-to-end, itwill likely be able to handle these cases if su ﬃ cient trainingexamples are available. Furthermore, there were no scans of pa-tients with irregular numbers of vertebrae present in the trainingset, which caused mislabeling of the vertebrae in one patientwith irregular number of vertebrae in the evaluation set. En-suring accurate anatomical labeling for such cases could be aninteresting direction for further research. However, the highkappa scores indicate that the labeling would be minimally o ﬀ -set in such cases. Depending on the exact clinical application,the impact of such labeling mistakes would therefore be limited.Segmentation of the vertebrae one after the other, using in-formation about the already segmented vertebrae as a prior, isinherently susceptible to cascading failure. Failure to ﬁnd orcorrectly segment a single vertebra may cause failure to ﬁnd orcorrectly segment all subsequent vertebrae. There is additionallyno element that explicitly ensures that the predicted segmenta-tion mask covers only a single vertebra if multiple are visible.While we did not observe these kind of failures in our evalua-tion, they are likely to occur in images with extreme anatomicalabnormalities or severe imaging artifacts. Reﬁnement of thelabeling through a maximum likelihood approach su ﬀ ers froma similar weakness: Except for cases with irregular number ofvertebrae, the labeling can only be entirely correct or entirelyo ﬀ set, even if the correct labels were predicted for some of thevertebrae. This limitation could potentially be addressed in the13uture by employing a more sophisticated global labeling model,e.g., based on Markov models.Manual vertebra segmentation is a time-consuming and te-dious task, requiring annotation times of about 40 to 60 hours perscan in low-dose chest CT and even longer in scans with higherresolution and coverage of more vertebrae. Semi-automaticsegmentation proved to be an e ﬀ ective strategy for generatingreference segmentations for network training from only fewmanual reference segmentations. Especially if little training datais available, additional priors or model ﬁtting steps could helpstabilize the performance. These could be statistical knowledgeabout typical sizes or shapes of vertebra, or additional ﬁtting ofa deformable surface mesh model to the segmentation results(Korez et al., 2016).Precise segmentation and identiﬁcation of the vertebrae fromCT and MR scans enables automatic spine analysis, notably alsoin images that were originally not intended for spine imaging.For instance, our iterative approach achieved a segmentation andidentiﬁcation performance on low-dose chest CT scans that islikely su ﬃ cient to analyze the shape of the vertebral bodies fordetection of compression fractures. This could enable oppor-tunistic screening for early signs of osteoporosis in lung cancerscreening programs, in addition to screening for pulmonary ab-normalities.In conclusion, this paper presents an iterative instance-by-instance approach to vertebra segmentation and anatomical iden-tiﬁcation. This approach is fast, ﬂexible and accurate across alarge variety of both dedicated as well as non-dedicated spinescans. Acknowledgements

We would like to thank the organizers of the CSI 2014 spine seg-mentation challenge, the Laboratory of Imaging Technologies atthe University of Ljubljana and the authors of the MR datasetfor making scans and reference segmentations publicly available.We are furthermore grateful to the United States National CancerInstitute (NCI) for providing access to NCI’s data collected bythe National Lung Screening Trial. The statements containedin this publication are solely ours and do not represent or implyconcurrence or endorsement by NCI.

References

Athertya, J.S., Kumar, G.S., 2016. Automatic segmentation of vertebral contoursfrom CT images using fuzzy corners. Computers in Biology and Medicine72, 75–89. doi: .Barrett, W.A., Mortensen, E.N., 1997. Interactive live-wire boundary extraction.Medical Image Analysis 1, 331–341.Bromiley, P.A., Kariki, E.P., Adams, J.E., Cootes, T.F., 2016. Fully automatic lo-calisation of vertebrae in CT images using random forest regression voting, in:International Workshop on Computational Methods and Clinical Applicationsfor Spine Imaging. Springer. volume 10182 of

LNCS , pp. 51–63.Cai, Y., Landis, M., Laidley, D.T., Kornecki, A., Lum, A., Li, S., 2016. Multi-modal vertebrae recognition using transformed deep convolution network.Computerized Medical Imaging and Graphics 51, 11–19.Cai, Y., Osman, S., Sharma, M., Landis, M., Li, S., 2015. Multi-modalityvertebra recognition in arbitrary views using 3D deformable hierarchicalmodel. IEEE Transactions on Medical Imaging 34, 1676–93. doi: . Castro-Mateos, I., Pozo, J.M., Pereanez, M., Lekadir, K., Lazary, A., Frangi,A.F., 2015. Statistical interspace models (SIMs): Application to robust 3Dspine segmentation. IEEE Transactions on Medical Imaging 34, 1663–1675.doi: .Chen, H., Shen, C., Qin, J., Ni, D., Shi, L., Cheng, J.C.Y., Heng, P.A., 2015.Automatic localization and identiﬁcation of vertebrae in spine CT via a jointlearning model with deep neural networks, in: International Conference onMedical Image Computing and Computer-Assisted Intervention. Springer.volume 9349 of

LNCS , pp. 515–522.Chu, C., Belav`y, D.L., Armbrecht, G., Bansmann, M., Felsenberg, D., Zheng,G., 2015. Fully automatic localization and segmentation of 3D vertebralbodies from CT / MR images via a learning-based method. PLOS ONE 10,e0143327.Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O., 2016.3D U-Net: Learning dense volumetric segmentation from sparse annotation,in: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer. volume 9901 of

LNCS , pp. 234–241.Cohen, J., 1968. Weighted kappa: Nominal scale agreement provision forscaled disagreement or partial credit. Psychological Bulletin 70, 213–220.doi: .De Brabandere, B., Neven, D., Van Gool, L., 2017. Semantic instance segmenta-tion for autonomous driving, in: Computer Vision and Pattern RecognitionWorkshops (CVPRW), 2017 IEEE Conference on, pp. 478–480.Forsberg, D., Lundström, C., Andersson, M., Vavruch, L., Tropp, H., Knutsson,H., 2013. Fully automatic measurements of axial vertebral rotation forassessment of spinal deformity in idiopathic scoliosis. Physics in Medicineand Biology 58, 1775.Forsberg, D., Sjöblom, E., Sunshine, J.L., 2017. Detection and labeling ofvertebrae in MR images using deep learning with clinical annotations astraining data. Journal of Digital Imaging 30, 406–412.Glocker, B., Feulner, J., Criminisi, A., Haynor, D., Konukoglu, E., 2012. Auto-matic localization and identiﬁcation of vertebrae in arbitrary ﬁeld-of-viewCT scans, in: International Conference on Medical Image Computing andComputer-Assisted Intervention. Springer. volume 7512 of

LNCS , pp. 590–598.Glocker, B., Zikic, D., Konukoglu, E., Haynor, D.R., Criminisi, A., 2013. Ver-tebrae localization in pathological spine CT via dense classiﬁcation fromsparse annotations, in: International Conference on Medical Image Comput-ing and Computer-Assisted Intervention. Springer. volume 8150 of

LNCS ,pp. 262–270. doi: .Grigoryan, M., Guermazi, A., Roemer, F.W., Delmas, P.D., Genant, H.K., 2003.Recognizing and reporting osteoporotic vertebral fractures. European SpineJournal 12, S104–S112. doi: .Hammernik, K., Ebner, T., Stern, D., Urschler, M., Pock, T., 2015. Vertebraesegmentation in 3D CT images based on a variational framework, in: RecentAdvances in Computational Methods and Clinical Applications for SpineImaging. Springer. volume 20 of

Lecture Notes in Computational Vision andBiomechanics , pp. 227–233. doi: .He, K., Gkioxari, G., Dollár, P., Girshick, R.B., 2017. Mask R-CNN, in: IEEE International Conference on Computer Vision (ICCV). arXiv:arXiv:1703.06870 .Ibragimov, B., Korez, R., Likar, B., Pernuš, F., Vrtovec, T., 2015. Interpolation-based detection of lumbar vertebrae in CT spine images, in: Recent Advancesin Computational Methods and Clinical Applications for Spine Imaging.Springer. volume 20 of

Lecture Notes in Computational Vision and Biome-chanics , pp. 73–84. doi: .Ibragimov, B., Korez, R., Likar, B., Pernus, F., Xing, L., Vrtovec, T., 2017.Segmentation of pathological structures by landmark-assisted deformablemodels. IEEE Transactions on Medical Imaging 36, 1457–69. doi: .Ibragimov, B., Likar, B., Pernuš, F., Vrtovec, T., 2014. Shape representationfor e ﬃ cient landmark-based segmentation in 3-D. IEEE Transactions onMedical Imaging 33, 861–874.Janssens, R., Zeng, G., Zheng, G., 2018. Fully automatic segmentation of lumbarvertebrae from CT images using cascaded 3D fully convolutional networks,in: IEEE 15th International Symposium on Biomedical Imaging (ISBI), pp.893–897. doi: .Kadoury, S., Labelle, H., Paragios, N., 2011. Automatic inference of articulatedspine models in CT images using high-order markov random ﬁelds. MedicalImage Analysis 15, 426–437. doi: .Kadoury, S., Labelle, H., Paragios, N., 2013. Spine segmentation in medical im- ges using manifold embeddings and higher-order MRFs. IEEE Transactionson Medical Imaging 32, 1227–38. doi: .Kamnitsas, K., Ledig, C., Newcombe, V.F., Simpson, J.P., Kane, A.D., Menon,D.K., Rueckert, D., Glocker, B., 2017. E ﬃ cient multi-scale 3D CNN withfully connected CRF for accurate brain lesion segmentation. Medical ImageAnalysis 36, 61–78. doi: .Kelm, B.M., Wels, M., Zhou, S.K., Seifert, S., Suehling, M., Zheng, Y., Comani-ciu, D., 2013. Spine detection in CT and MR using iterated marginal spacelearning. Medical Image Analysis 17, 1283–92. doi: .Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization.arXiv:1412.6980. arXiv:arXiv:1412.6980 .Klinder, T., Ostermann, J., Ehm, M., Franz, A., Kneser, R., Lorenz, C., 2009.Automated model-based vertebra detection, identiﬁcation, and segmentationin CT images. Medical Image Analysis 13, 471–482.Knez, D., Likar, B., Pernuš, F., Vrtovec, T., 2016. Computer-assisted screw sizeand insertion trajectory planning for pedicle screw placement surgery. IEEETransactions on Medical Imaging 35, 1420–1430.Korez, R., Ibragimov, B., Likar, B., Pernuš, F., Vrtovec, T., 2015. A frameworkfor automated spine and vertebrae interpolation-based detection and model-based segmentation. IEEE Transactions on Medical Imaging 34, 1649–1662.Korez, R., Likar, B., Pernuš, F., Vrtovec, T., 2016. Model-based segmentationof vertebral bodies from MR images with 3D CNNs, in: International Con-ference on Medical Image Computing and Computer-Assisted Intervention.Springer. volume 9901 of LNCS , pp. 433–441.Lessmann, N., van Ginneken, B., Išgum, I., 2018. Iterative convolutional neuralnetworks for automatic vertebra identiﬁcation and segmentation in CT images,in: Medical Imaging. volume 10574 of

Proceedings of SPIE , p. 1057408.Leventon, M.E., Grimson, W.E.L., Faugeras, O., 2002. Statistical shape inﬂuencein geodesic active contours, in: 5th IEEE EMBS International Summer Schoolon Biomedical Imaging. doi: .Li, K., Hariharan, B., Malik, J., 2016. Iterative instance segmentation, in:Conference on Computer Vision and Pattern Recognition.Liang, X., Lin, L., Wei, Y., Shen, X., Yang, J., Yan, S., 2018. Proposal-freenetwork for instance-level semantic object segmentation. IEEE Transactionson Pattern Analysis and Machine Intelligence 40, 2978–91. doi: .Liao, H., Mesﬁn, A., Luo, J., 2018. Joint vertebrae identiﬁcation and local-ization in spinal CT images by combining short- and long-range contex-tual information. IEEE Transactions on Medical Imaging 37, 1266–1275.doi: .Lim, P.H., Bagci, U., Bai, L., 2014. A robust segmentation framework for spinetrauma diagnosis, in: Computational Methods and Clinical Applications forSpine Imaging. Springer. volume 17 of

Lecture Notes in Computational Visionand Biomechanics , pp. 25–33. doi: .Litjens, G.J.S., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian,M., van der Laak, J.A.W.M., van Ginneken, B., Sánchez, C.I., 2017. A surveyon deep learning in medical image analysis. Medical Image Analysis 42,60–88.Major, D., Hlad˚uvka, J., Schulze, F., Bühler, K., 2013. Automated landmarkingand labeling of fully and partially scanned spinal columns in CT images.Medical Image Analysis 17, 1151–63. doi: .Mastmeyer, A., Engelke, K., Fuchs, C., Kalender, W.A., 2006. A hierarchical3D segmentation method and the deﬁnition of vertebral body coordinatesystems for QCT of the lumbar spine. Medical Image Analysis 10, 560–577.doi: .Milletari, F., Navab, N., Ahmadi, S.A., 2016. V-net: Fully convolutionalneural networks for volumetric medical image segmentation, in: InternationalConference on 3D Vision, pp. 565–571.Mirzaalian, H., Wels, M., Heimann, T., Kelm, B.M., Suehling, M., 2013. Fastand robust 3D vertebra segmentation using statistical shape models, in: 35thAnnual International Conference of the IEEE Engineering in Medicine andBiology Society, p. 3.Moeskops, P., Viergever, M.A., Mendrik, A.M., de Vries, L.S., Benders, M.J.,Išgum, I., 2016. Automatic segmentation of MR brain images with a convo-lutional neural network. IEEE Transactions on Medical Imaging 35, 1252–1261.Novotny, D., Albanie, S., Larlus, D., Vedaldi, A., 2018. Semi-convolutionaloperators for instance segmentation, in: European Conference on Com-puter Vision. Springer. number 11205 in LNCS, pp. 89–105. doi: .Pereanez, M., Lekadir, K., Castro-Mateos, I., Pozo, J.M., Lazary, A., Frangi,A.F., 2015. Accurate segmentation of vertebral bodies and processes usingstatistical shape decomposition and conditional models. IEEE Transactionson Medical Imaging 34, 1627–39. doi: .Rasoulian, A., Rohling, R., Abolmaesumi, P., 2013. Lumbar spine segmentationusing a statistical multi-vertebrae anatomical shape + pose model. IEEETransactions on Medical Imaging 32, 1890–1900. doi: .Ren, M., Zemel, R.S., 2017. End-to-end instance segmentation with recurrentattention, in: Conference on Computer Vision and Pattern Recognition.Romera-Paredes, B., Torr, P.H.S., 2016. Recurrent instance segmentation, in:European Conference on Computer Vision. Springer. volume 9910 of LNCS ,pp. 312–329. doi: .Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networksfor biomedical image segmentation, in: International Conference on MedicalImage Computing and Computer-Assisted Intervention. Springer. volume9351 of

LNCS , pp. 234–241.Sekuboyina, A., Kukaˇcka, J., Kirschke, J.S., Menze, B.H., Valentinitsch, A.,2018. Attention-driven deep learning for pathological spine segmentation,in: Computational Methods and Clinical Applications in MusculoskeletalImaging. Springer. volume 10734 of

LNCS , pp. 108–119. doi: .Sekuboyina, A., Valentinitsch, A., Kirschke, J.S., Menze, B.H., 2017. Alocalisation-segmentation approach for multi-label annotation of lumbarvertebrae using deep nets. arXiv:1703.04347.Štern, D., Likar, B., Pernuš, F., Vrtovec, T., 2011. Parametric modelling andsegmentation of vertebral bodies in 3D CT and MR spine images. Physicsin Medicine and Biology 56, 7505–22. doi: .Stewart, R., Andriluka, M., Ng, A.Y., 2016. End-to-end people detection incrowded scenes, in: Conference on Computer Vision and Pattern Recognition.doi: .Suzani, A., Rasoulian, A., Seitel, A., Fels, S., Rohling, R.N., Abolmaesumi,P., 2015. Deep learning for automatic localization, identiﬁcation, and seg-mentation of vertebral bodies in volumetric MR image, in: Medical Imaging.volume 9415 of

Proceedings of SPIE , p. 941514.The National Lung Screening Trial Research Team, 2011. Reduced lung-cancermortality with low-dose computed tomographic screening. New EnglandJournal of Medicine 365, 395–409.Uhrig, J., Cordts, M., Franke, U., Brox, T., 2016. Pixel-level encoding anddepth layering for instance-level semantic labeling, in: German Conferenceon Pattern Recognition (GCPR). Springer. volume 9796 of

LNCS , pp. 14–25.doi: .Wang, Y., Yao, J., Roth, H.R., Burns, J.E., Summers, R.M., 2015. Multi-atlassegmentation with joint label fusion of osteoporotic vertebral compressionfractures on CT, in: International Workshop on Computational Methods andClinical Applications for Spine Imaging. Springer. volume 9402 of

LNCS ,pp. 74–84.Yang, D., Xiong, T., Xu, D., Huang, Q., Liu, D., Zhou, S.K., Xu, Z., Park, J.,Chen, M., Tran, T.D., et al., 2017a. Automatic vertebra labeling in large-scale3D CT using deep image-to-image network with message passing and sparsityregularization, in: IPMI. Springer. volume 10265 of

LNCS , pp. 633–644.Yang, D., Xiong, T., Xu, D., Zhou, S.K., Xu, Z., Chen, M., Park, J., Grbic, S.,Tran, T.D., Chin, S.P., Metaxas, D., Comaniciu, D., 2017b. Deep image-to-image recurrent network with shape basis learning for automatic vertebralabeling in large-scale 3D CT volumes, in: International Conference on Medi-cal Image Computing and Computer-Assisted Intervention. Springer. volume10435 of

LNCS , pp. 498–506. doi: .Yao, J., Burns, J.E., Forsberg, D., Seitel, A., Rasoulian, A., Abolmaesumi, P.,Hammernik, K., Urschler, M., Ibragimov, B., Korez, R., et al., 2016. A multi-center milestone study of clinical vertebral CT segmentation. ComputerizedMedical Imaging and Graphics 49, 16–28.Yao, J., Burns, J.E., Munoz, H., Summers, R.M., 2012. Detection of vertebralbody fractures based on cortical shell unwrapping, in: International Con-ference on Medical Image Computing and Computer-Assisted Intervention.Springer. volume 7512 of

LNCS , pp. 509–516.Zuki´c, D., Vlasák, A., Egger, J., Hoˇrínek, D., Nimsky, C., Kolb, A., 2014.Robust detection and segmentation for diagnosis of vertebral diseases usingroutine MR images. Computer Graphics Forum 33, 190–204., pp. 509–516.Zuki´c, D., Vlasák, A., Egger, J., Hoˇrínek, D., Nimsky, C., Kolb, A., 2014.Robust detection and segmentation for diagnosis of vertebral diseases usingroutine MR images. Computer Graphics Forum 33, 190–204.