Self-Supervised Discovery of Anatomical Shape Landmarks
SSelf-Supervised Discovery of Anatomical ShapeLandmarks
Riddhish Bhalodia , , Ladislav Kavan , and Ross T. Whitaker , Scientific Computing and Imaging Institute, University of Utah School of Computing, University of Utah
Abstract.
Statistical shape analysis is a very useful tool in a wide range of med-ical and biological applications. However, it typically relies on the ability to pro-duce a relatively small number of features that can capture the relevant variabilityin a population. State-of-the-art methods for obtaining such anatomical featuresrely on either extensive preprocessing or segmentation and/or significant tuningand post-processing. These shortcomings limit the widespread use of shape statis-tics. We propose that effective shape representations should provide sufficientinformation to align/register images. Using this assumption we propose a self-supervised, neural network approach for automatically positioning and detectinglandmarks in images that can be used for subsequent analysis. The network dis-covers the landmarks corresponding to anatomical shape features that promotegood image registration in the context of a particular class of transformations. Inaddition, we also propose a regularization for the proposed network which allowsfor a uniform distribution of these discovered landmarks. In this paper, we presenta complete framework, which only takes a set of input images and produces land-marks that are immediately usable for statistical shape analysis. We evaluate theperformance on a phantom dataset as well as 2D and 3D images.
Keywords:
Self-Supervised Learning, Shape Analysis, Landmark Localization
Statistical shape modeling (SSM)/morphological analysis [28] is an important resourcefor medical and biological applications. SSM broadly involves two distinct parts, (i)shape representation which involves describing the anatomy/shape of interest by givingan implicit or explicit representation of the shape, and (ii) using the shape represen-tation to perform the subsequent analysis on shape population. Classical approachesrelied on representing the shape via landmark points, often corresponding to distinctanatomical features. There have been many automated approaches for dense correspon-dence discovery which captures the underlying shape statistics [10, 26]. An alternateapproach to shape representation is to leverage coordinate transformations between im-ages or geometries, typically members of a population or to a common atlas [19]. Sucha set of transformations implicitly capture the population shape statistics for the ob-jects/anatomies contained in those images.Automated shape representation via dense correspondences has its drawbacks; mostsuch methods rely on heavily preprocessed data. Such preprocessing steps might in-clude segmentation, smoothing, alignment, and cropping. These tasks typically require a r X i v : . [ c s . C V ] J un Riddhish Bhalodia, Ladislav Kavan, and Ross T. Whitaker manual parameter settings and quality control thereby making this preprocessing heavyon human resources. In several cases, especially for segmentation a degree of specificanatomical, clinical, or biological expertise is also required, introducing even higherbarriers to engaging in shape analysis. Additionally, automated landmark placement orregistration rely on computationally expensive optimization methods, and often requireadditional parameter tuning and quality control. This heavy preprocessing and complexoptimization often make statistical shape analysis difficult for nonexperts, especiallywhen the data under study consists of primarily of images/volumes.Systems that produce transformations and/or dense correspondences will typicallyproduce high-dimensional shape descriptors, whereas many users prefer lower-dimensionaldescriptors to perform subsequent statistical analyses such as clustering, regression, orhypothesis testing. Therefore, there is typically an additional set of processes (e.g. PCAin various forms) that require further expertise (and research) to interpret these complex,high-dimensional outputs and distill them down to usable quantities.
Training ImagesTesting Images Proposed Framework(Train)Proposed Framework(Test)
Fig. 1.
Proposed method’s advantage of being an end to end image to landmarks black-box
These challenges point a need for an end-to-end system that takes in images andautomatically extracts its shape landmarks, for direct statistical shape analysis. In thispaper, we propose a system that takes collections of images as input and produces a setof landmarks that leads to accurate registration. Using image registration as the proxytask, we propose a self-supervised learning method using neural networks that discov-ers anatomical shape landmarks from a given image set. This method circumvents theintensive preprocessing and optimization required for previous correspondence basedmethods, while maintaining the simplicity of spatial landmarks of user specified dimen-sion and complexity. Figure 1 showcases the usefulness of the proposed system, i.e. ablack-box framework which trains on roughly aligned 2D or 3D images and learns toidentify anatomical shape landmarks on new testing images from the same distribution.
Statistical shape models have been extensively used in medical imaging (e.g. orthope-dics [17], neuroscience [15] and cardiology [14]), however, they usually require someform of surface parameterization (e.g. segmentation or meshes) and cannot be applieddirectly on images. Explicit correspondences between surfaces have been done usinggeometric parameterizations [12, 25] as well as functional maps [22]. Particle distribu-tion models (PDMs) [16] rely on a dense set of particles on surfaces whose positions are elf-Supervised Discovery of Anatomical Shape Landmarks 3 optimized to reduce the statistical complexity of the resulting model [10,11]. These setsof particles are then projected to a low-dimensional shape representation using PCA tofacilitate subsequent analysis [27]. Some recent works leverage convolution neural net-works (CNNs) to perform regression from images to a shape description of these densecorrespondences [5, 7]. These methods are supervised and require an existing shapemodel or manual landmarks for their training.Deformable registration is also widely used as a tool for modeling shape varia-tion/statistics [3], as well as atlas building [19]. Recent works rely on neural networksto perform unsupervised registration to atlases [2] or random pairs [6], and the defor-mation fields produced by these methods can be used as a shape representation forindividuals. Statistical deformation models [20, 24] are applied which learns the prob-ability model for the manifold of deformation fields and reduce the dimension to use itfor shape analysis. However, in our experience users across a wide range of applicationsin medicine and biology prefer the simplicity and interpretability of landmarks. Manyof these same users also balk at the complexity of the dimensionality reduction methodsthat are needed to make these deformation fields useful for statistical analysis.Another relevant work is from computer vision literature, of using an image defor-mation loss to learn dense features/feature maps [13, 23]. Other works use convolutionneural networks to learn shape features on surfaces to discover correspondences [8],or to be used for subsequent correspondence optimizations [1]. In this work, we learnlandmark points, rather than features/feature maps, in a purely unsupervised setting. x sK , y sK
Architecture for discovering anatomical shape landmarks in a self-supervised manner.
The proposed self-supervised approach is that a NN should produce landmarks that bestalign the objects of interest in pairs of images, using as input many randomly chosen
Riddhish Bhalodia, Ladislav Kavan, and Ross T. Whitaker pairs from a set of images. The architecture contains a landmark regressor (NN) whichdiscovers a vector of landmark positions on an image. The architecture includes a spa-tial transformer stage that takes as input the NN-generated landmarks and compares theresulting aligned images. For deformation we use a landmark-guided, radial basis func-tion, thin plate spline (TPS) image registration, as in Figure 2. For detecting landmarks,we need the system to be invariant pairing, i.e. always produce the same landmarks onan image. Therefore we use Siamese CNN networks [9] to extract the landmarks fromsource ( I S ) and the target ( I T ) images, and we denote these points l S and l T respec-tively. A single instance/sibling of the fully trained Siamese pair is the detector , whichcan be applied to training data or unseen data to produce landmarks. The output of anetwork (from the Siamese pair) is a K × d matrix, where K is the number of land-mark points and d is the dimensionality of the image, typically either 2D or 3D. Thesefeatures are treated as correspondences that go into the TPS solve module. The TPSregistration entails the following steps: (i) create the kernel matrix A from l T and theoutput position vector b using l S , (ii) solve the linear system of equations ( Aw = b ) toget the TPS transformation parameters w , and (iii) using w the source image, warp theentire source grid into the registered image ( I R ).The loss for the entire network comprises of two terms: L = L match ( I T , I R ) + λ L reg ( A ) (1)The first term is the image matching term, any image registration loss which allowsfor backpropagation can be employed, in this work we work with L -2. However, morecomplex losses such as normalized cross-correlation [2] can also be used. The secondterm is the regularization term applied on the TPS transformation matrix kernel A , sothat it does not become ill-conditioned, which can happen from the random initializationof the network or during optimization. To regularize A we use the condition number ofthe matrix using Frobenius Norm ( || A || F = (cid:115) M (cid:80) i =1 N (cid:80) j =1 a ij ), defined as follows: L reg ( A ) = κ F ( A ) = || A || F || A − || F (2)A useful side effect of this regularization is as follows. The TPS kernel matrix be-comes singular/ill-conditioned when two landmark positions are very close together,and therefore this regularization term, while preventing the matrix from being singularalso encourages the resulting shape landmarks to be distributed in the image domain.This regularization loss is weighted by a scalar hyperparameter λ , and our experienceis that the system produces useful and effective landmarks for a wide range of λ ’s. The network is trained on a set of 2D/3D images in a pair-wise fashion. For smallerdata sets, it is feasible to present the network with all pairs of images, as we have in theexamples here. For larger image datasets, the n training size may be too large and/orredundant, and random pairs could be used. For robust training, we train the Siamesenetwork on images with a small added i.i.d Gaussian noise on the input, while using the elf-Supervised Discovery of Anatomical Shape Landmarks 5 original images for the loss computation on the transformed image. Finally, we whitenthe image data (with zero mean and unit variance) before feeding it into the networks. Landmarks removal :
As it can be seen in Figure 1, when we discover landmarksthere may be some which are redundant , i.e. they do not align with any significantanatomical features important for image registration. For this we propose a post-trainingmethod to weed out the redundant landmarks. After training the network, we reregisterpairs of images by removing one landmark at a time, and we compute the differencein the registration loss computed using all the landmarks and by leaving one of themout. Averaging over all image pairs gives us an indicator of the importance of eachlandmark, and we can threshold and remove the ones with low importance. We applythis technique to the phantom data for testing. Figure 3 shows images/landmarks withand without the redundancy removal, demonstrating that landmarks in non-informativeregions are removed. There are many alternative methods for culling landmarks, suchas L1 regularized weights that the network might learn. We consider the study of alter-native methods for landmark culling as beyond the scope of this paper.
In this section we describe the performance of the proposed method on three datasets,(i) a Shepp-Logan (S-L) phantom (sythnthetic) dataset, (ii) a diatoms dataset with 2Dimages of diatoms of four morphological classes, (iii) a dataset of 3D cranial CT scansof infants for detection of craniosynostosis (a morphological disorder). We split eachdataset into training, validation and testing images with 80%, 10%, 10% division, andthe image pairs are taken from their respective set. Detailed architecture for each datasetis presented in the supplementary material. (a) (b) (c) (d) (e)
Fig. 3.
Results on S-L phantom dataset, (a) and (b) an example of source-target image pair withlearned landmarks overlayed, (c) the registered image (source image warped to target) and, (d)and (e) the same source and target pair with landmarks after the redundant points removal. Riddhish Bhalodia, Ladislav Kavan, and Ross T. Whitaker
We create a synthetic dataset using S-L phantom and creating randomly perturbed TPSwarps using 6 control points, 2 on each ellipse endpoints, the boundary ellipse, and 2black ellipses inside the phantom. We train our network on this dataset with 30 land-marks, with 4 constant landmarks placed on the corners of each image. We train the pro-posed network on this dataset for 20 epochs with a regularization coefficient of 0.0001,this gives an average registration loss to reach 0.01%. Figure 3 showcases the resultson this dataset, 3(a) and 3(b) are the source and target images overlayed with predictedfeatures respectively, 3(c) represents the registered output. We also remove the redun-dant points as described in section 3.2, the retained points are shown in Figure 3(d, e).Encouragingly, the redundancy metric shows high importance for points that are nearthe location of the control points from which the data was generated.
Diatoms are single-cell algae which are classified into different categories based ontheir shape and texture, which have biological applications [18]. The dataset consistsof 2D diatom images with four different classes
Eunotia (68 samples),
Fragilariforma (100 samples),
Gomphonema (100 samples) and
Stauroneis (72 samples). We train theproposed network on a joint dataset of images from all four classes (all together, unla-beled/unsupervised), allowing the network to establish correspondences between differ-ent classes via discovered landmark positions. The goal is to see how well the discov-ered landmark positions describe the shape variation and distinguish the classes in theunsupervised setting. We whiten(zero mean, unit variance) the images before passing itthrough the network, use the regularization parameter set to 1e-5, and train the networkto discover 26 landmarks. We train this dataset for 20 epochs which achieves the testingregistration accuracy of 0.1%. Like with phantom data we compute the redundancymetric and remove landmarks below a threshold. Figure 4 (middle column) shows theimages with only the retained landmarks and we see that the landmarks which did notlie on shape surfaces (left column) were removed for lack of impact on the losss func-tion (point numbers 4, 6, 8, 14). We want to evaluate how well these landmarks used asshape descriptors perform on clustering these diatoms. For this, we use Spectral clus-tering [21] on the learned landmarks. We cluster landmarks coming from the trainingimages and then predict the clusters for the validation as well as testing images (us-ing the landmarks). The results are shown in the right column, which shows 2D PCAembedding of landmarks (shape features) labeled with ground truth clusters labels andthe predicted cluster label. The mismatch between
Eunotia and
Fragilariforma classesis due to the fact they both have a very similar shapes/sizes are typically also distin-guished by intensity patterns, which are not part of this research.
Metopic craniosynostosis is a morphological disorder of cranium in infants, caused bypremature fusion of the metopic suture. The severity of metopic craniosynostosis is hy-pothesized to be dependent on the deviation of a skull’s morphology from the shape elf-Supervised Discovery of Anatomical Shape Landmarks 7 (a)(b)(c)(d)
EunotiaFragilariforma GomphonemaStauroneis
Fig. 4.
The left image rows( (a) Gomphonema, (b) Eunotia, (c) Stauroneis, (d) Fragilariforma) showcase 4 different images from test set corresponding to four different classes of diatomsand their predicted landmarks shown in cyan. The middle column shows the same images withredundant landmarks removed. The left column shows spectral clustering (in 2D PCA embeddingspace) evaluated on a test set and compared to ground truth labels. statistics of the normal skull. There have been studies on diagnosing this condition [29]as well as characterizing the severity of the pathology [4] by various shape representa-tion and modeling techniques. For this example we use 120 CT scans of the craniumwith 93 normal phenotypical skulls with no deformity and 27 samples of metopic cran-iosynostosis (pathological scans for such conditions are generally less abundant). EachCT scan is a × × volume with an isotropic voxel resolution of 2mm. Wewhiten (zero mean, unit variance) the data for training the network and train for 10epochs with regularization weight of 0.00001 and 80 3D landmarks. The network sawboth classes of scans but with no labels/supervision. After the redundancy removal weare only left with 49 landmarks. The registration performance of the system is shownin Figure 5 (orange box) at the mid-axial slice of two source-target pairs. CT scans donot present with apparent information inside the skulls, however, with a very specificcontrast setting (used in the figures) we can see grainy outlies of brain anatomy insidethe skulls. An interesting thing to note is certain landmarks adhere to these low-contrastfeatures, and are consistent across shapes. In row two of the figure, we show a normalskull as the source image and the target is a metopic skull, and trigonocephaly (trian-gular frontal head shape) which is a key characteristic in metopic craniosynostosis iscaptured by the three (in plane) landmarks, one on the frontal rim and two near the ears.Capturing this morphological symptom is essential in distinguishing the normal skullsfrom the metopic. Using these 49 landmarks as the shape descriptors we evaluate howwell it can capture the deviation between the metopic skull from the normal population. Severity Comparison :
First, we want to study if the discovered landmarks are de-scriptive enough to accurately identify the pathology. We use all 120 images (irrespec-tive of the training/testing set), we compute Mahalanobis distance (Z-score) of all theimages from the base distribution formed using the normal skulls in the training dataset.Figure 5(green box) is the histogram of the Z-scores, it showcases that the Z-score ofmetopic skulls is larger on average than the normal skulls which support our hypoth-esis that the discovered low-dimensional landmark-based shape descriptor is enough
Riddhish Bhalodia, Ladislav Kavan, and Ross T. Whitaker
Source Image Target Image Registered Overlay
Fig. 5. (Orange Box) represents two instances of source and target 3D volumes, the first two im-ages also show a specific contrast enhanced axial slice with discovered landmarks, and the thirdfigure shows overlay of target segmentation and the registration segmentation. (Blue Box) Corre-lation between Z-scores from state-of-the-art PDM and Z-scores of landmark shape descriptorsfrom the proposed method. (Green Box) Showcases the Z-scores from the landmark shape de-scriptor which is able to distinguish normal versus metopic skulls. to capture abnormal morphology. We also want to compare against a shape-of-the-art,correspondence-based shape model. We use
ShapeWorks [10] as our PDM which in-volved segmenting the skulls, preprocessing (smoothing, anti-aliasing, etc.) and non-linear optimizations, to place 2048 particles on each skull.
ShapeWorks has been usedin other metopic craniosynostosis severity analyses such as in [4]. We perform PCA toreduce the dense correspondences to 15 dimensions (95% of variability) and computeZ-scores (the base data to be the same scans as with the proposed method). We nor-malize Z-scores from both methods (PDM and the proposed one) and plot their scatterplot in Figure 5 (blue box). We see that the correlation between the two methods issignificant (correlation coefficient of 0.81), and hence, can predict/quantify the severityof metopic craniosynostosis equally well. This demonstrates that the landmarks discov-ered by our method are a good substitute for the state-of-the-art PDM model with verylow amounts of pre- and postprocessing overhead. Also, it is much faster to get shapedescriptors for new images as compared with conventional PDM methods.
In this paper, we present an end-to-end methodology for processing sets of roughlyaligned 2D/3D images to discover landmarks on the images, which then can be usedfor subsequent shape analysis. Under the assumption that good landmarks can producea better image to image registration, we propose a self-supervised neural network ar-chitecture that optimizes the TPS registration loss between pairs of images with an elf-Supervised Discovery of Anatomical Shape Landmarks 9 intermediate landmark discovery that feeds into the TPS registration module. Addition-ally, we also propose a regularizer that ensures TPS solvability and encourage moreuniform landmark distribution. We also apply a redundancy removal via the loss dif-ference with and without a particular particle, which manages to cull uninformativelandmarks. However, for future work there is a possibility of assigning importance toindividual landmarks during training which can be used for removing redundant points.This paper presents a combination of technologies that result in a turn-key system, thatwill allow shape analysis to be used in a wide range of applications with a much lessexpertise than previous methods. We showcase that our methodology places landmarksqualitatively corresponding to distinct anatomical features and that these landmarks canwork as shape descriptors for downstream tasks such as clustering for diatoms or sever-ity prediction for metopic craniosynostosis.
References
1. Agrawal, P., Whitaker, R.T., Elhabian, S.Y.: Learning deep features for automated placementof correspondence points on ensembles of complex shapes. In: Medical Image Computingand Computer Assisted Intervention MICCAI 2017. pp. 185–193. Springer InternationalPublishing, Cham (2017)2. Balakrishnan, G., Zhao, A., Sabuncu, M., Guttag, J., Dalca, A.V.: Voxelmorph: A learningframework for deformable medical image registration. IEEE TMI: Transactions on MedicalImaging , 1788–1800 (2019)3. Beg, M.F., Miller, M.I., Trouv´e, A., Younes, L.: Computing large deformation metric map-pings via geodesic flows of diffeomorphisms. International journal of computer vision (2),139–157 (2005)4. Bhalodia, R., Dvoracek, L.A., Ayyash, A.M., Kavan, L., Whitaker, R., Goldstein, J.A.: Quan-tifying the severity of metopic craniosynostosis: A pilot study application of machine learn-ing in craniofacial surgery. Journal of Craniofacial Surgery (2020)5. Bhalodia, R., Elhabian, S.Y., Kavan, L., Whitaker, R.T.: Deepssm: A deep learning frame-work for statistical shape modeling from raw images. Shape in Medical Imaging : Interna-tional Workshop, ShapeMI, MICCAI 2018 (5), 525–537 (May 2002). https://doi.org/10.1109/TMI.2002.100938812. Davies, R.H., Twininga, C.J., Cootes, T.F., Waterton, J.C., Taylor, C.J.: 3D statistical shapemodels using direct optimisation of descriptionlength. In: 7th European Conference on Com-puter Vision (ECCV). vol. 3, pp. 3–21 (2002)13. DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised interest pointdetection and description. In: CVPR Deep Learning for Visual SLAM Workshop (2018), http://arxiv.org/abs/1712.07629
14. Gardner, G., Morris, A., Higuchi, K., MacLeod, R., Cates, J.: A point-correspondence ap-proach to describing the distribution of image features on anatomical surfaces, with applica-tion to atrial fibrillation. In: 2013 IEEE 10th International Symposium on Biomedical Imag-ing. pp. 226–229 (April 2013). https://doi.org/10.1109/ISBI.2013.655645315. Gerig, G., Styner, M., Jones, D., Weinberger, D., Lieberman, J.: Shape analysisof brain ventricles using spharm. In: Proceedings IEEE Workshop on Mathemati-cal Methods in Biomedical Image Analysis (MMBIA 2001). pp. 171–178 (2001).https://doi.org/10.1109/MMBIA.2001.99173116. Grenander, U., Chow, Y., Keenan, D.M.: Hands: A Pattern Theoretic Study of BiologicalShapes. Springer, New York (1991)17. Harris, M.D., Datar, M., Whitaker, R.T., Jurrus, E.R., Peters, C.L., Anderson, A.E.: Statis-tical shape modeling of cam femoroacetabular impingement. Journal of Orthopaedic Re-search (10), 1620–1626 (2013). https://doi.org/10.1002/jor.22389, http://dx.doi.org/10.1002/jor.22389
18. Hicks, Y.A., Marshall, D., Rosin, P., Martin, R.R., Mann, D.G., Droop, S.J.M.: A modelfor diatom shape and texture for analysis, synthesis and identification. Machine Vision andApplications (2006)19. Joshi, S., Davis, B., Jomier, M., Gerig, G.: Unbiased diffeomorphic atlas construction forcomputational anatomy. NeuroImage; Supplement issue on Mathematics in Brain Imaging (Supplement1), S151–S160 (2004)20. Joshi, S.C., Miller, M.I., Grenander, U.: On the geometry and shape of brain sub-manifolds.International Journal of Pattern Recognition and Artificial Intelligence (08), 1317–1343(1997)21. Luxburg, U.V.: A tutorial on spectral clustering (2007)22. Ovsjanikov, M., Ben-Chen, M., Solomon, J., Butscher, A., Guibas, L.: Functional maps:a flexible representation of maps between shapes. ACM Transactions on Graphics (TOG) (4), 30 (2012)23. Rocco, I., Arandjelovi´c, R., Sivic, J.: Convolutional neural network architecture for geo-metric matching. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (2017)24. Rueckert, D., Frangi, A.F., Schnabel, J.A.: Automatic construction of 3-d statistical deforma-tion models of the brain using nonrigid registration. IEEE Transactions on Medical Imaging (8), 1014–1025 (2003)25. Styner, M., Brechbuhler, C., Szekely, G., Gerig, G.: Parametric estimate of intensity inho-mogeneities applied to MRI. IEEE Transactions on Medical Imaging (3), 153–165 (Mar2000)26. Styner, M., Oguz, I., Xu, S., Brechbuehler, C., Pantazis, D., Levitt, J., Shenton, M., Gerig,G.: Framework for the statistical shape analysis of brain structures using spharm-pdm (072006)27. T., B.E., Alan, M., D., W.B., J., M.C., F., M.N., Joshua, C.: Left atrial shape predicts recur-rence after atrial fibrillation catheter ablation. Journal of Cardiovascular Electrophysiology (0). https://doi.org/10.1111/jce.13641elf-Supervised Discovery of Anatomical Shape Landmarks 1128. Thompson, D.W., et al.: On growth and form. On growth and form. (1942)29. Wood, B.C., Mendoza, C.S., Oh, A.K., Myers, E., Safdar, N., Linguraru, M.G., Rogers, G.F.:Whats in a name? accurately diagnosing metopic craniosynostosis using a computationalapproach. Plastic and Reconstructive Surgery (1), 205–213 (2016) Appendix
The detailed architecture for each of the landmark regressor is given in Figure 6. In eachof the convolution layers we use the kernel size of × for 2D and × × for 3D.The non-linear activation used are ReLU’s and hyperbolic tangent is used for the outputlayer.
32 646432 (a)(b)(c) Fully Connected Layers3D Convolution Layers2D Convolution LayersMax Pooling