[PDF] A Convolutional Approach to Vertebrae Detection and Labelling in Whole Spine MRI

Abstract

We propose a novel convolutional method for the detection and identification of vertebrae in whole spine MRIs. This involves using a learnt vector field to group detected vertebrae corners together into individual vertebral bodies and convolutional image-to-image translation followed by beam search to label vertebral levels in a self-consistent manner. The method can be applied without modification to lumbar, cervical and thoracic-only scans across a range of different MR sequences. The resulting system achieves 98.1% detection rate and 96.5% identification rate on a challenging clinical dataset of whole spine scans and matches or exceeds the performance of previous systems on lumbar-only scans. Finally, we demonstrate the clinical applicability of this method, using it for automated scoliosis detection in both lumbar and whole spine MR scans.

Full PDF

AA Convolutional Approach to VertebraeDetection and Labelling in Whole Spine MRI

Rhydian Windsor , Amir Jamaludin , Timor Kadir , and Andrew Zisserman Visual Geometry Group, Department of Engineering Science, University of Oxford Plexalis Ltd

Abstract.

We propose a novel convolutional method for the detectionand identiﬁcation of vertebrae in whole spine MRIs. This involves usinga learnt vector ﬁeld to group detected vertebrae corners together into in-dividual vertebral bodies and convolutional image-to-image translationfollowed by beam search to label vertebral levels in a self-consistent man-ner. The method can be applied without modiﬁcation to lumbar, cervicaland thoracic-only scans across a range of diﬀerent MR sequences. Theresulting system achieves 98.1% detection rate and 96.5% identiﬁcationrate on a challenging clinical dataset of whole spine scans and matches orexceeds the performance of previous systems of detecting and labellingvertebrae in lumbar-only scans. Finally, we demonstrate the clinical ap-plicability of this method, using it for automated scoliosis detection inboth lumbar and whole spine MR scans.

Keywords:

Vertebral bodies · Whole spine MRI · Scoliosis.

The objective of this paper is automated vertebrae detection and identiﬁcationof vertebral levels. This is an important task for several reasons. Firstly, auto-mated diagnosis of many spinal diseases such as disc degeneration [9,10] or spinalstenosis [13] relies on accurate localisation of vertebral structures or, in the caseof pathological scoliosis, lordosis and kyphosis [8], analysing the geometry ofthe spinal column. Secondly, vertebral bodies can be used to infer other spinalstructures of interest such as the spinal cord or ribs. Finally, vertebrae can actas points to allow registration between diﬀerent scans.There are several issues that make this task challenging. One of the mostobvious is that vertebrae are highly repetitive and hence distinguishing betweendiﬀerent levels can be hard. Labelling by simply counting down from the C2vertebra is problematic as it assumes that all vertebrae have been detected,C2 is visible, and that every patient has the same number of vertebrae whichis not always true. Furthermore, for clinical use, labelling must be robust to:variations in spinal anatomy (such as collapsed vertebrae, hemivertebrae andfused vertebra); vertebrae numbers – around 11.3% of the population have onemore or one less mobile vertebra [18]); diﬀerent imaging parameters includingMR weighting (e.g. T1, T2, STIR, TIRM and FLAIR); ﬁelds of view (e.g. lumbar, a r X i v : . [ ee ss . I V ] J u l R. Windsor et al. whole spine scans); scan resolution and also number/thickness of slices in thescan.This paper proposes a new approach to this challenge, in particular in thecase of 3-D sagittal whole spine clinical MRIs which are important for diagnosingseveral diseases such as ankylosing spondylitis and multiple myeloma. We makethe following contributions to the tasks of vertebral body detection and vertebrallevel labelling in clinical MRIs: We propose a new convolutional method of detec-tion based on localising the corners and centroids of vertebrae and then groupingthem together (Section 2); We reformulate the labelling task as a convolutionenhancement followed by a language modelling inspired sequential correction,removing the need for a recurrent network and showing robustness to variationsin vertebra numbers (Section 3); We show that the resulting system is robust toa variety of ﬁelds of view and pathologies by evaluating on a large clinical datasetof MR lumbar scans, and also, for the ﬁrst time, on whole spine scans. We achievestate-of-the-art performance at vertebra identiﬁcation in both (Section 4); Wedemonstrate a clinical application of this system by using it to automaticallydetect cases of scoliosis (Section 5).

Related work:

There have been several approaches to automated detection andlabelling of vertebrae in MRIs although most focus on ﬁxed ﬁelds of view, e.g.lumbar or cervical scans only [3,9,12,13]. Previous vertebrae labelling methodstend to rely on either heuristic-based graphical models [3,11] or assuming thebottom vertebra is S1 and counting up [13]. Zhao et al. [21] perform labelling inMRIs with arbitrary ﬁelds of view, but only in the lower spine (from S1-T12 toL4-T10). Windsor and Jamaludin [19] report a method of detecting, though notlabelling, vertebrae in full spine scans iteratively but also require the locationof the S1 vertebra for initialization. Cai et al. [2] also perform detection andlabelling by a hierarchical deformation model and even report success in a singlefull spine scan but do not evaluate the performance quantitatively on a wholespine dataset. Furthermore such models are slow to apply and make strong as-sumptions on spinal geometry such as a ﬁxed number of vertebrae and lack ofmajor pathology (fused/collapsed vertebrae). Greater progress has been madein whole-body CT images where the task is more straightforward; CT imag-ing protocols are highly standardised due to image intensities being consistent,representing X-Ray absorptions of voxel locations. Also, CT scans tend to havemore 3-D information, with higher resolution in the sagittal plane than thattypical of clinical MR. Approaches using graphical models and recurrent neuralnetworks for labelling have been reported to achieve 70-89% identiﬁcation ratein mixed ﬁeld of view CT scans [4,5,20].

Detection of the vertebrae proceeds in three stages, as illustrated in Figure 1.First, corners and centroids of each vertebra are predicted in each sagittal slice.Second, each detected corner is assigned to a centroid by predicting a vectorﬁeld for each corner type (e.g. top left, bottom right) which points to the cor- onvolutional Vertebrae Detection and Labelling 3 responding centroid. The detected centroid and the four corners which point toit deﬁne the bounding quadrilateral for that vertebra in that slice. Third, thesequadrilaterals are grouped across slices to deﬁne detected volumes for each ver-tebra.Inference is performed here by a U-Net architecture (shown in detail in theappendix) which ingests an MRI (one slice) and outputs 13 channels: 5 channelsare the landmark detection channels and each is used to detect the centroidand 4 corners respectively of all vertebral bodies appearing in that image; theremaining 8 channels are the landmark grouping channels corresponding to the x and y coordinates of the 4 vector ﬁelds used in grouping.In more detail, the locations of corners and centroids are identiﬁed as modesof the heatmaps in the landmark detection channels. For each corner detected,there exists a corresponding grouping vector ﬁeld from the landmark grouping channels; 4 corners equals 4 vectors ﬁelds. A group is then formed from the 4closest corners pointing to a single centroid; a group here forms a quadrilateral.This is done for each detected centroid. If two or more centroids are assigned thesame corner, the centroid to which the corner’s vector points closest remains, andthe other centroid is discarded to stop double detection of a single vertebra. Ifthere is no detected corner within a ﬁxed range of a centroid, it is also discarded,making the system robust to spurious centroid detections. This is performed ineach sagittal slice. Finally, the vertebral bodies detected are grouped across slicesby measuring the IoU between quadrilaterals and assigning them to the samevertebra if they have an overlap of greater than 0.5. Discussion.

Our approach of detecting vertebrae as a series of points and thengrouping them diﬀers from other methods which have used region proposal net-works [21] pixel/voxel-wise segmentation [23], deformable-part models [2] or sim-ply detected vertebrae centroids [3,20]. However, this approach is receiving in-creasing attention in the computer vision literature, with Zhou et al. [22] show-ing state-of-the-art results by detecting objects as a series of keypoints. Theadvantage of this method is its high speed (1 second inference on GPU-enabledhardware) combined with more accurate bounding regions than using standardbounding boxes.

Training:

To train the network, we use separate loss functions for the detectionand grouping channels. The ground truth annotations are coordinates of thecorners of the vertebral bodies (centroids are computed from these). Targetdetection channel outputs are constructed by overlaying a Gaussian kernel oneach annotated ground truth landmark in the detection channels with varianceproportional to the square root of the area of the detection. These channels aretrained using the weighted L1 loss; L detect ( Y, ˆ Y ) = P k =1 P i,j α ijk | y ijk − ˆ y ijk | where Y is the response map output by the network and ˆ Y is the target responsemap. y ijk and ˆ y ijk are the value of response and ground truth map respectivelyat image coordinate ( i, j ) in detection channel k and a ijk is a weighing factorgiven by α ijk = N k N k + P k if ˆ y ijk ≥ T or P k N k + P k if ˆ y ijk < T , where P k and N k are the number of pixels respectively above and below threshold T in channel k . This weighing factor balances the loss from false positive responses in the R. Windsor et al.

NN5×N

64 128 128 256 512 256 128 64 32 13

Thresholded Detections &Corresponding VectorsRaw MR scan with N sagittal slices Detections Grouped into Vertebrae InstancesU-Net N 8×N G r o up D e t ec t i o n s A c r o ss S ag i tt a l S li ce s Fig. 1.

The pipeline used to detect vertebrae in whole spine scans. The output detectionchannels are thresholded and each resulting connected volume becomes a corner or acentroid. Each corner has corresponding vector in the grouping channels at the point-of-detection and ‘points’ with a magnitude and direction according to that vector. Eachcentroid is assigned the 4 corners that point closest to it. heatmap and false negatives, speeding up training. In the experiments in thispaper, T = 0 .

01 was used.

Landmark grouping channels are trained by using theL2 loss; L group = P l =1 P b P ( i,j ) ∈N bl || v lij − r kij || . Here l indexes each cornertype/vector ﬁeld (e.g. top left, bottom right), b indexes labelled ground truthvertebral body, N bl is a neighbourhood of pixels surrounding the l th corner ofvertebral body b . v lij is the value of the vector ﬁeld corresponding to corner l at the pixel location ( i, j ) and r bij is the displacement from the centroid ofvertebral body b to ( i, j ). As suggested by [15], heavy augmentation is usedduring training. Scans are padded, rotated, zoomed and ﬂipped in the coronalplane. Non-square scans are split into squares overlapping by 40% and resizedto 224 ×

224 to ensure constant sized input to the network.

Once the vertebrae are detected, the next task is to label each vertebrae with itslevel (e.g. S1, L5, L4 etc.). There are two types of information to consider whenlabelling a vertebra; the appearance of the vertebrae – its intensity pattern, shapeand size – and its context – its position in relation to other vertebrae that havebeen detected in the scan. As such, we train two networks; an appearance network to infer the level of vertebra from its appearance alone, and a context network which takes as input the predictions of the appearance network along with thespatial conﬁguration of the detections in the scan to improve the predictions.The ﬁnal stage is to search for a consistent labelling of the vertebra using asequential ‘language model’ that builds in ordering constraints. The labellingpipeline is outlined in Figure 2. It should be noted that both networks are fullyconvolutional. This diﬀers from the approaches outlined in [20] and [12] whichuse recurrent neural networks and graphical models respectively. onvolutional Vertebrae Detection and Labelling 5 N S L L T T T T T T C C C Vertebral Level H e i g h t i n S c a n S L L T T T T T T C C C Vertebral Level H e i g h t i n S c a n Flatten

Flatten

Context NetworkAppearanceNetwork B e a m S e a r c h w . L a n g u ag e M o d e lli n g

16 32 64 128 128 64 32 16 1

Extract volumes surrounding detections and resize to 224×224×16

P’P

Fig. 2.

Overview the labelling pipeline. Volumes surrounding each of the vertebraedetected are extracted and input to an appearance network. This is then used toconstruct an input image to the context network which gives the probability of a givenlabel at a given height in the image. The resulting probability map is shown in the farright of the image, followed by a beam search to generate the ﬁnal level sequence. Thearchitectures of both networks are shown in the appendix.

The labelling pipeline proceeds as follows: A volume around each detectedvertebra is extracted and given as input to the appearance model which predictsa softmax probability vector over the labels. Input volumes are created by ﬁrstﬁtting a bounding box around the detection, then expanding the box by 100%to include nearby anatomical features and resampling its size to 224 × × P ∈ R H × N where H is the height of the imageand N = 24 is the number of vertebrae level classes (C2 to S1). If vertebradetection v spans from heights h v to h v in the scan then P h n = p avn ∀ h v ≤ h ≤ h v and 0 otherwise where p avn is the probability v has level n , givenby the appearance network. Next, the context network reﬁnes this probability-height map, using detections around each vertebra to update its predicted class.Finally, the output map is decoded into a logical sequence of vertebra using alanguage-model inspired beam search which imposes global constraints such asno repetitions. This is described in detail in Section 3.1. Training:

The appearance network is trained using a cross-entropy loss function.For re-calibration [7], a softmax operation with temperature T = 10 is applied tothe logits layer. P is given to the context net which produces P , a reﬁned versionof the same map as shown on the far right of Figure 2. The cross-entropy betweenthe output probability map at the centroid height of each vertebra and its groundtruth label is used as the loss function. When training the context network,augmentation is applied at training time by randomly removing between 0 to 4of the highest and lowest detections from the input image and further removingeach remaining vertebra’s probability map from the input image with probability R. Windsor et al.

To predict the labels for each vertebrae detected, P shown in Figure 2 mustbe decoded into a sequence of level predictions. A naive method to do thiswould be to take the argmax of the probability map at each vertebra centroidheight and use this as a label. However, this allows implausible sequences suchas repeated vertebral levels to be predicted. If the detection v at height h v haslabel index l , then we know v at h v > h v detection should have label l > l withthe greatest probability of being l + 1(e.g. L4 should be followed by L3). Thechallenge of imposing such constraints is analogous to that faced in automaticspeech recognition where CTC training is followed by language modelling topredict a valid character or word sequence [16,6]. We take inspiration from thisand use a beam search to obtain a valid labelling: beginning at the highestdetection and searching down, sequences are generated and scored by selectingthe highest probability labels for each vertebra. The k most likely sequences oflevels are stored in memory at each step, where k is the beam width. Sequenceswith repetitions of levels are given probability 0 and those with skipped levelsare penalised by multiplying the sequence probability by a penalty score. Themethod can also incorporate numerical variations, with ± We evaluate the system at the task of vertebrae detection and labelling in wholespine and lumbar scans. Following [11,20], we deﬁne correct detections to bewhen the ground truth vertebra centroid is contained entirely within a singlebounding quadrilateral. For detection, we report precision, recall, and the local-isation error (LE), deﬁned to be the mean distance of ground truth centroidsfrom the closest detection quadrilateral centroid. For labelling, we report iden-tiﬁcation rate (IDR), the fraction of vertebrae detected and labelled correctly.

Three datasets are used in this work: OWS, Genodisc and Zuki´c. OWS is adataset of 710 sagittal whole spine scans across 196 patients from the PictureArchiving and Communication System (PACS) of an orthopaedic centre. Thedataset exhibits a wide range of pathologies such as hemivertebra, fused ver-tebrae, numerical variations of vertebrae and scoliosis. Scans are taken fromdiﬀerent scanners with a range of MR parameters (T1, T2, FLAIR, TIRM andSTIR). The dataset is split into training, validation and testing sets with an60/20/20% split at the patient level. Corners of vertebrae from S1 to C2 areannotated and vertebrae levels were marked by a radiologist in one scan for each onvolutional Vertebrae Detection and Labelling 7

Dataset No. Scans No. Vert Method Prec. (%) Rec. (%) IDR(%) IDR ± † [19] - - ±

37 888 Label Baseline - - 86.9 93.4 -Ours 99.0 98.1 ± ± ± Zuki´c(Lumbar) Zuki´c [23] 98.7 92.9 - - ±

17 154 Label Baseline - - 87.0 94.3 -Ours ± Table 1.

Performance of the pipeline on the three datasets. Our approach is com-pared with other methods using the same datasets and also a LSTM labelling baseline.Results are reported on a per-vertebra level. Higher is better for detection precision(Prec.), detection recall (Rec.), and level identiﬁcation rate (IDR). Lower is better forlocalisation error (LE). We also report the percentage of vertebrae within one levelof their ground truth value (IDR ± Note, Windsor † [19] requires manual initializationby providing the location of the S1 vertebra, so is not directly comparable patient, with S1 being the ﬁrst vertebra attached to the pelvis. In the case of25 vertebrae between S1 and C2 instead of the normal 24, an extra lumbar ver-tebrae is labelled (L6). Networks were trained on the OWS training set, usingan Adam optimizer with β = 0 . β = 0 .

999 and a learning rate of 0.001. TheGenodisc and Zuki´c datasets are used only for testing. Genodisc’s test set has421 clinical lumbar MRIs used by Lootus et al. [12]. Zuki´c [23] is a small datasetof 17 mostly lumbar sagittal MRIs available on the online SpineWeb platform.

The results of detection and labelling on all datasets are shown in Table 1 withcomparisons to other methods reported on the same datasets where available. Wealso compare our convolutional labelling pipeline results to a baseline recurrentapproach for vertebra labelling, training a bidirectional LSTM on the appearancefeatures extracted from each detected volume. Example predicted detection andlabelling sequences across a range of pathologies are given in Figure 3. TheLSTM baseline used is detailed in the appendix.

Discussion:

For whole spine detection on the OWS dataset, the proposedmethod achieves a high precision and recall of 99.0% and 98.1% respectively.It achieves a level identiﬁcation rate of 96.5%, signiﬁcantly exceeding the LSTMbaseline. The few errors the system made for labelling are generally due to S2being detected as S1, meaning all labels are out by one. In practice, this is amistake radiologists often make as it can be diﬃcult to tell which is the ﬁrstsacral vertebra without looking at axial scans to see which bone is joint to thepelvis. This also explains why [19] achieves slightly higher precision and recallat detection than ours; it is a semi-automated algorithm given the location ofS1 at initialization and thus bypasses this diﬃcult problem of S1 recognition.

R. Windsor et al. a) b) c) d) e) f)

Fig. 3.

Example detection and labelling of vertebrae for a range of whole spine scans.a) and b) are examples of typical spines. The other images shows examples of thesystem dealing with pathologies; c) shows an example of a spine with fused C3 and C4vertebrae; d) is an example of a spine with a collapsed vertebra; e) shows a spine withan extra vertebra between S1 and C2 and f) shows an extremely kyphotic spine witha hemivertebra at the bottom of the thoracic spine.

On the Genodisc lumbar spine dataset, our method again signiﬁcantly outper-forms the baseline, and it is also outperforms the prior method of Lootus etal. [11] (98.4% compared to 86.9%). Importantly, OWS and Genodisc are scansof patients with a wide range of pathologies imaged using typical clinical MRprotocols; hence strong performance here gives evidence of the clinical useful-ness of this approach. We show example results for pathological spines in theappendix. In the appendix we also report results by other groups for vertebraedetection and labelling in MRIs. However, these are for diﬀerent datasets towhich we could not get access, with diﬀerent scanning protocols, ﬁelds of view(FoV) and patient sets, and thus cannot be compared to directly.

Finally, as an illustration of a potential application of the proposed approach,we explore the ability of the system to classify cases of scoliosis from sagittalMR scans. In clinical practice this is usually determined by measuring the Cobbangle in coronal views of X-ray scans [1], however measuring scoliosis in thesupine position has also been shown to be possible [17,8] and MRI can be usefulfor understanding disease etiology and symptoms [14]. This is a more diﬃculttask in sagittal scans as it requires sensitive detection of the sides of vertebrawith a clear decision boundary as to when a vertebra is present or not presentin a slice which can be diﬃcult in cases of partial visibility. In the entirety onvolutional Vertebrae Detection and Labelling 9

Max. Curvature:0.026Max. Distance from Centreline: 17.6 mm Max. Curvature:0.014Max. Distance fro Centreline: 6.70 mm

Zukić Scan C002 Zukić, Scan F04

Max. Curvature:0.026Max. Distance from Centreline: 17.6 mm Max. Curvature:0.014Max. Distance from Centreline: 6.70 mm

Zukić Scan C002 Zukić, Scan F04 (a) (b)

Fig. 4.

Results of scoliosis classiﬁcation in sagittal scans using the proposed vertebraedetection system: (a) ROC curve for simple scoliosis classiﬁers based on statistics ofpolynomial curves ﬁtted through vertebrae centroids in Genodisc scans; (b) A qualita-tive comparison of curves ﬁt through detected vertebrae in full spine scans. Curves areoverlaid on coronal slices synthesised from the sagittal slices. of the Genodisc dataset, scoliosis was reported by a radiologist in 198 of 3542scans, across 2009 patients. By measuring statistics of a quintic polynomial ﬁtthrough vertebra centroids and using them as predictive features we developclassiﬁers for this label. Speciﬁcally, we measure the maximum curvature of thepolynomial, and the maximum deviation of the curve from a straight verticalline ﬁt through the vertebrae, assuming that a low curvature vertebral columnwith little deviation from the centreline corresponds to a non-scoliotic spine.While ground truth scoliosis labels were not available for the whole spine scans,we give qualitative results, comparing the features of curves ﬁt though vertebraeof a scoliotic and non-scoliotic scan from Zuki´c. Results of these experiments areshown in Figure 4.

Discussion:

The results for automated scoliosis detection from sagittal scansare promising. Using simple classiﬁers AUCs of 0.636-0.689 are achieved in ahighly class-imbalanced problem. Of the features measured, distance from thevertical centreline of the vertebrae performed best. The system is also shown tocapture scoliotic curves in full spine scans too. These experiments illustrate thatthe proposed method produces a strong geometric representation of the spine,which can be used in further downstream tasks.

We introduce a novel method for vertebrae detection and labelling in whole spinesagittal MRIs. It shows state-of-the-art results for vertebra identiﬁcation in lum-bar scans, with little performance drop in the whole spine case, and is robust to arange of spinal defects, numerical variations of vertebrae and diﬀerent scanningprotocols. We also demonstrate a potential diagnostic application: automateddetection of scoliosis from sagittal MRIs. Future work will include integrating automated detection of diﬀerent spinal pathologies and implementing the systemfor CT scans.

Acknowledgements.

The authors would like to thank Dr. Sarim Ather for use-ful discussions on spinal anatomy and clinical approaches to diagnosing disease,as well as assistance labelling the data. Rhydian Windsor is supported by CancerResearch UK as part of the EPSRC CDT in Autonomous Intelligent Machinesand Systems (EP/L015897/1). Amir Jamaludin is supported by EPSRC Pro-gramme Grant Seebibyte (EP/M013774/1). The Genodisc data was obtainedduring the EC FP7 project GENODISC (HEALTH-F2-2008-201626).

References

1. Aebi, M.: The adult scoliosis. European Spine Journal (10), 925–948 (2005)2. Cai, Y., Osman, S., Sharma, M., Landis, M., Li, S.: Multi-Modality Vertebra Recog-nition in Arbitrary Views Using 3D Deformable Hierarchical Model. IEEE Trans-actions on Medical Imaging (8), 1676–1693 (2015)3. Forsberg, D., Sjblom, E., Sunshine, J.L.: Detection and Labeling of Vertebrae inMR Images Using Deep Learning with Clinical Annotations as Training Data.Journal of Digital Imaging (4), 406–412 (2017)4. Glocker, B., Feulner, J., Criminisi, A., Haynor, D.R., Konukoglu, E.: AutomaticLocalization and Identiﬁcation of Vertebrae in Arbitrary Field-of-View CT Scans.In: Medical Image Computing and Computer-Assisted Intervention (2012)5. Glocker, B., Zikic, D., Konukoglu, E., Haynor, D.R., Criminisi, A.: Vertebrae Local-ization in Pathological Spine CT via Dense Classiﬁcation from Sparse Annotations.In: Medical Image Computing and Computer-Assisted Intervention. pp. 262–270(2013)6. Graves, A., Fernndez, S., Gomez, F., Schmidhuber, J.: Connectionist temporalclassiﬁcation: labelling unsegmented sequence data with recurrent neural networks.In: International Conference on Machine learning. ICML ’06 (2006)7. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On Calibration of Modern NeuralNetworks. In: International Conference on Machine Learning (2017)8. Jamaludin, A., Kadir, T., Clark, E., Zisserman, A.: Predicting spine geometry andscoliosis from DXA scans. In: MICCAI Workshop: Computational Methods andClinical Applications in Musculoskeletal Imaging (2019)9. Jamaludin, A., Kadir, T., Zisserman, A.: SpineNet: Automated classiﬁcation andevidence visualization in spinal MRIs. Medical Image Analysis , 63–73 (2017)10. Jamaludin, A., Lootus, M., Kadir, T., Zisserman, A., Urban, J., Batti´e, M.C.,Fairbank, J., McCall, I.: Automation of reading of radiological features from mag-netic resonance images (mris) of the lumbar spine without human intervention iscomparable with an expert radiologist. European Spine Journal (2017)11. Lootus, M., Kadir, T., Zisserman, A.: Vertebrae detection and labelling in lumbarMR images. In: MICCAI Workshop: Computational Methods and Clinical Appli-cations for Spine Imaging (2013)12. Lootus, M., Kadir, T., Zisserman, A.: Radiological grading of spinal MRI. In:MICCAI Workshop: Computational Methods and Clinical Applications for SpineImaging (2014)onvolutional Vertebrae Detection and Labelling 1113. Lu, J.T., Pedemonte, S., Bizzo, B., Doyle, S., Andriole, K.P., Michalski, M.H.,Gonzalez, R.G., Pomerantz, S.R.: Deep spine: Automated lumbar vertebral seg-mentation, disc-level designation, and spinal stenosis grading using deep learning.In: Machine Learning for Healthcare Conference (2018)14. Ozturk, C., Karadereler, S., Ornek, I., Enercan, M., Ganiyusufoglu, K., Hamzaoglu,A.: The role of routine magnetic resonance imaging in the preoperative evaluationof adolescent idiopathic scoliosis. International Orthopaedics (4), 543–546 (Apr2010)15. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomed-ical Image Segmentation. In: Medical Image Computing and Computer-AssistedIntervention (2015)16. Scheidl, H., Fiel, S., Sablatnig, R.: Word Beam Search: A Connectionist Tempo-ral Classiﬁcation Decoding Algorithm. In: 2018 16th International Conference onFrontiers in Handwriting Recognition (ICFHR) (2018)17. Taylor, H.J., Harding, I., Hutchinson, J., Nelson, I., Blom, A., Tobias, J.H., Clark,E.M.: Identifying scoliosis in population-based cohorts: development and validationof a novel method based on total-body dual-energy x-ray absorptiometric scans.Calciﬁed Tissue International (6), 539–547 (Jun 2013)18. Tins, B.J., Balain, B.: Incidence of numerical variants and transitional lumbosacralvertebrae on whole-spine MRI. Insights into Imaging (2), 199–203 (2016)19. Windsor, R., Jamaludin, A.: The ladder algorithm: Finding repetitive structuresin medical images by induction. In: IEEE International Symposium on BiomedicalImaging (2020)20. Yang, D., Xiong, T., Xu, D., Huang, Q., Liu, D., Zhou, S.K., Xu, Z., Park, J.,Chen, M., Tran, T.D., Chin, S.P., Metaxas, D., Comaniciu, D.: Automatic Ver-tebra Labeling in Large-Scale 3D CT Using Deep Image-to-Image Network withMessage Passing and Sparsity Regularization. In: Information Processing in Med-ical Imaging (2017)21. Zhao, S., Wu, X., Chen, B., Li, S.: Automatic Vertebrae Recognition from ArbitrarySpine MRI Images by a Hierarchical Self-calibration Detection Framework. In:Medical Image Computing and Computer Assisted Intervention (2019)22. Zhou, X., Wang, D., Kr¨ahenb¨uhl, P.: Objects as points. In: arXiv preprintarXiv:1904.07850 (2019)23. Zuki, D., Vlask, A., Egger, J., Honek, D., Nimsky, C., Kolb, A.: Robust Detectionand Segmentation for Diagnosis of Vertebral Diseases Using Routine MR Images.Computer Graphics Forum (6), 190–204 (2014) ppendix

1. Overview of Proposed Pipeline

Fig. 1.

An overview of the proposed detection and labelling pipeline. Vertebrae landmarks are detected, grouped intoquadrilaterals in each sagittal slice and ﬁnally grouped across slices. The output volumes are input into the labellingpipeline.

2. LSTM Baseline Architecture Flatten

Flatten

AppearanceNetwork … … K Layers [ p S1 , p L5 , …, p C2 ][ p S1 , p L5 , …, p C2 ][ p S1 , p L5 , …, p C2 ][ p S1 , p L5 , …, p C2 ] SoftmaxLSTM CellElementwiseSum Fig. 2.

The LSTM architecture used as a labelling baseline. Output probability maps from the appearance networkare used as input to a bidirectional LSTM, with the inputs ordered by the height of the vertebrae in the image. Thearchitecture outputs assigns a probability vector for each detection which the likeliness of the vertebra being each levelfrom S1 to C2. In the experiments in Section 4, K = 2 is used. a r X i v : . [ ee ss . I V ] J u l . Other Reported Results Table 1.

Reported results of other methods proposed for the segmentation of vertebrae in CT and MR spinal scansevaluated on diﬀerent datasets. We attempted to access a publically available dataset from [2] although did not get areply prior to submission.Method FoV Comment Prec. (%) Rec. (%) LE(mm) IDR (%)Cai et al. [2] Arbitrary MR & CT, Deformable shape model - - 2.54-3.12 92-97Forsberg et al. [4] Lumbar, Cervical MR, requires S1 or C2 visibility 99.6-100 99.1-99.8 1.18-2.60 96-97Glocker et al. [4] Mostly Abdominal CT, Random Forest Classiﬁer - - 5.94-8.53 72-85Glocker et al.[5] Mostly Abdominal CT, Random Forest Classiﬁer - - 7-14.3 62-86Yang et al. [18] Arbitrary MR, LSTM labelling - - 6.9-9.0 83-89

4. Detection, Appearance and Context Network Architecture Diagrams DOWNCONV1 DOWNCONV2 DOWNCONV3

DOWNCONV4

DOWNCONV5 UPCONV1 UPCONV2 UPCONV3 UPCONV4 (a) The detection network. All lay-ers have 3 × × × FC1

FC2 SIGMOID

Flatten (b) The appearance network usedto label vertebrae volumes. The ﬁrsttwo layers have kernel sizes of 3 × × × × DOWNCONV1 DOWNCONV2 DOWNCONV3

DOWNCONV4

DOWNCONV5 UPCONV1 UPCONV2 UPCONV3 UPCONV4 (c) The context network. All Convlayers have 5 × × × × × Fig. 3.

The network architectures used in the detection and labelling pipeline.Yellow blocks indicate convolutionallayers, green blocks indicate batch normalisation, purple layers are fully connected and orange layers are sigmoidactivation units. All layers except the last use ReLU activation units (not shown). . Detection and Labelling Accuracy Across Diﬀerent Vertebral Levels S L L L L L T T T T T T T T T T T T C C C C C C Vertebral Level020406080100 A cc u r a c y ( % ) DetectionLabelling