[PDF] An End-to-end Deep Learning Approach for Landmark Detection and Matching in Medical Images

Abstract

Anatomical landmark correspondences in medical images can provide additional guidance information for the alignment of two images, which, in turn, is crucial for many medical applications. However, manual landmark annotation is labor-intensive. Therefore, we propose an end-to-end deep learning approach to automatically detect landmark correspondences in pairs of two-dimensional (2D) images. Our approach consists of a Siamese neural network, which is trained to identify salient locations in images as landmarks and predict matching probabilities for landmark pairs from two different images. We trained our approach on 2D transverse slices from 168 lower abdominal Computed Tomography (CT) scans. We tested the approach on 22,206 pairs of 2D slices with varying levels of intensity, affine, and elastic transformations. The proposed approach finds an average of 639, 466, and 370 landmark matches per image pair for intensity, affine, and elastic transformations, respectively, with spatial matching errors of at most 1 mm. Further, more than 99% of the landmark pairs are within a spatial matching error of 2 mm, 4 mm, and 8 mm for image pairs with intensity, affine, and elastic transformations, respectively. To investigate the utility of our developed approach in a clinical setting, we also tested our approach on pairs of transverse slices selected from follow-up CT scans of three patients. Visual inspection of the results revealed landmark matches in both bony anatomical regions as well as in soft tissues lacking prominent intensity gradients.

Full PDF

AAn End-to-end Deep Learning Approach for LandmarkDetection and Matching in Medical Images

Monika Grewal a , Timo M. Deist a , Jan Wiersma b , Peter A. N. Bosman a, c , and TanjaAlderliesten b ∗ a Life Sciences & Health Research Group, Centrum Wiskunde & Informatica, P.O. Box 94079,1090 GB Amsterdam, The Netherlands b Department of Radiation Oncology, Amsterdam UMC, University of Amsterdam, P.O. Box22660, 1100 DD Amsterdam, The Netherlands c Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University ofTechnology, P.O. Box 5, 2600 AA Delft, The Netherlands

ABSTRACT

Anatomical landmark correspondences in medical images can provide additional guidance information for thealignment of two images, which, in turn, is crucial for many medical applications. However, manual landmarkannotation is labor-intensive. Therefore, we propose an end-to-end deep learning approach to automaticallydetect landmark correspondences in pairs of two-dimensional (2D) images. Our approach consists of a Siameseneural network, which is trained to identify salient locations in images as landmarks and predict matchingprobabilities for landmark pairs from two diﬀerent images. We trained our approach on 2D transverse slicesfrom 168 lower abdominal Computed Tomography (CT) scans. We tested the approach on 22,206 pairs of2D slices with varying levels of intensity, aﬃne, and elastic transformations. The proposed approach ﬁnds anaverage of 639, 466, and 370 landmark matches per image pair for intensity, aﬃne, and elastic transformations,respectively, with spatial matching errors of at most 1 mm. Further, more than 99% of the landmark pairs arewithin a spatial matching error of 2 mm, 4 mm, and 8 mm for image pairs with intensity, aﬃne, and elastictransformations, respectively. To investigate the utility of our developed approach in a clinical setting, we alsotested our approach on pairs of transverse slices selected from follow-up CT scans of three patients. Visualinspection of the results revealed landmark matches in both bony anatomical regions as well as in soft tissueslacking prominent intensity gradients.

Keywords: end-to-end, landmark detection, CT, deep learning, deformable image registration

1. INTRODUCTION

Deformable Image Registration (DIR) can be extremely valuable in work-ﬂows related to image-guided diag-nostics and treatment planning. However, DIR in medical imaging can be challenging due to large anatomicalvariations between images. This is particularly the case in the lower abdomen, where internal structures canundergo large deformations between two scans of a patient due to physical conditions like presence of gas pocketsand bladder ﬁlling. Such scenarios are particularly challenging for intensity based registration, as there are manylocal optima to overcome. Landmark correspondences between images can provide additional guidance informa-tion to the DIR methods

1, 2 and increase the probabilty of ﬁnding the right transformation by adding landmarkmatches as an additional constraint or objective in the optimization. Since the manual annotation of anatomicallandmarks is labor-intensive and requires expertise, developing methods for ﬁnding landmark correspondencesautomatically has great potential beneﬁts. ∗ Tanja Alderliesten is currently working at the Department of Radiation Oncology, Leiden University Medical Center, P.O. Box9600, 2300 RC Leiden, The Netherlands

Further author information: (Send correspondence to M.G.)M.G.: Email: [email protected] a r X i v : . [ c s . C V ] J a n he existing methods for obtaining landmark correspondences in medical images are based on large andtime-consuming pipelines that involve identifying landmark locations followed by matching local feature de-scriptors within a restricted neighborhood. These methods rely upon multiple pre- and post-processing steps,multi-resolution search, and manual checking to achieve robustness; each step adding more heuristics and em-pirical hyperparameters to an already complex pipeline. Further, existing methods for landmark detection thatrestrict the deﬁnition of landmarks to certain intensity gradient patterns speciﬁc to the underlying data set oranatomical region may not be easily adaptable to other contexts. Generalizing the deﬁnition of landmarks andreducing the number of heuristics would allow for faster adaptation of automated methods for diﬀerent clinicalsettings. In addition, faster execution times for landmark detection and matching could beneﬁt their clinicalapplication.Recently, deep Convolutional Neural Networks (CNNs) have shown promising results for classiﬁcation andsegmentation tasks in medical imaging due to their capability of learning discriminant feature descriptors fromraw images.

There exist a few deep learning approaches for ﬁnding landmarks in medical images.

13, 14

However, in these approaches a neural network is trained in a supervised manner to learn a small number ofmanually annotated landmarks. It is to be noted that a high density of landmark correspondences is desirableto eﬀectively provide additional guidance to the DIR methods. In a supervised setting, it means annotatingthousands of landmarks per CT scan, which is intractable in terms of required manual eﬀorts. On the otherhand, many deep learning approaches have been developed for automatically ﬁnding object landmarks in naturalimages that do not require manual annotations. Some of these approaches focus on discovering a limitednumber of landmarks in an image dataset. Whereas, others either ﬁne-tune a pre-trained network or make useof incremental training in a self-supervised fashion.Our proposed approach is based on the above-mentioned approaches developed for natural images and tailoredto meet the speciﬁc requirements relating to the medical images. We propose a two-headed Siamese neuralnetwork that based on a pair of images simultaneously predicts the landmarks and their feature descriptorscorresponding to each image. These are then sent to another module to predict their matching probabilities.We train the neural network from scratch and gradients are back-propagated from end-to-end. To the best ofour knowledge, this is ﬁrst endeavour to develop an end-to-end deep learning approach for ﬁnding landmarkcorrespondences in medical images. Our approach has the following distinct advantages compared to existingmethods for ﬁnding landmark correspondences: · Our approach is end-to-end deep learning based; therefore, the need for data pre- and post-processing duringinference is avoided. In addition, the proposed approach is faster at run-time and has fewer hyperparametersthan traditional approaches. · We do not impose any prior on the deﬁnition of a landmark in an image. Instead, we train the networkin a way that the landmarks represent salient regions in the image that can be found repeatedly despitepotential intensity variations, and deformations. · The proposed approach does not require manual annotations for training and learns from data in a self-supervised manner. · Our approach improves over the existing approaches for natural images by avoiding the need for pre-training, or incremental ﬁne-tuning of the neural network.

2. MATERIALS AND METHODS2.1 Data

In total 222 lower abdominal Computed Tomography (CT) scans of female patients acquired for radiationtreatment planning purposes were retrospectively included: 168 scans (24,923 two-dimensional (2D) slices) wereused for training and 54 scans (7,402 2D slices) were used for testing. For a separate set of three patients, oneoriginal scan along with a follow-up CT scan was included. The scans of these three patients were used fortesting the approach in a clinical setting. All CT scans had an in-plane resolution from 0.91 mm × × × SamplingLayer I I Transformation FeatureDescriptorMatchingModule weights weights sharinglandmark probabilityground truthdescriptor matchingground truth data input and output

LandmarkProbabilityLoss gradients

DescriptorMatchingLoss gradients ̂ ̂ ( , ) landmark probability map ( I )feature descriptors ( I )landmark probability map ( I )feature descriptors ( I ) ℎ CNNCNN

Figure 1.

Schematic representation of our approach.

The weights are shared between two branches of the Siameseneural network. The transformation is required only during training for calculating the ground truths. Abbreviations ofthe data input and output at various stages follow the description in the text.

In Figure 1, the diﬀerent modules of our approach are illustrated along with the data ﬂow between them. Ourapproach comprises a Siamese architecture consisting of

CNN branches with shared weights. The outputs ofthe CNN branches are sent to a module named

Sampling Layer followed by another module named

FeatureDescriptor Matching Module . The network takes two images I and I as inputs and predicts K and K landmarks in I and I , respectively. In addition, the network predicts matching probabilities (ˆ c i,j ) for eachlandmark i ∈ { , , ..., K } in I to a landmark j ∈ { , , ..., K } in I . In the following paragraphs, a descriptionof each module is provided. The CNN branches of the Siamese neural network have shared weights and consist of an encoder-decoder typenetwork similar to the U-Net architecture. The only diﬀerence from the original implementation is that thenumber of convolutional ﬁlters in each layer is reduced by a factor of four to avoid overﬁtting. The implementedarchitecture contains 16, 32, 64, 128, and 256 convolutional ﬁlters in successive downsampling blocks respectively.The CNN branches give two outputs for each input image: a landmark probability map, and feature descriptors.The landmark probability map is computed at the end of the upsampling path after applying the sigmoidnon-linearity and the feature descriptors are computed by concatenation of feature maps from the last twodownsampling blocks. The feature maps from diﬀerent downsampling blocks intrinsically allow for featurematching at multiple resolutions and abstraction levels. The sampling layer is a parameter-free module of the network. It performs the following tasks:1. It samples K and K landmark locations in I and I , respectively, which correspond to the highestprobability score locations in the predicted landmark probability maps.2. It extracts predicted landmark probabilities ˆ p I i , and ˆ p I j corresponding to K and K locations in landmarkprobability maps of image I and I .. It extracts feature descriptors f I i and f I j corresponding to the sampled landmark locations in I and I ,respectively, and creates feature descriptor pairs ( f I i , f I j ) for each i ∈ { , , ..., K } and j ∈ { , , ..., K } .4. During training, it generates the ground truths for landmark probabilities and feature descriptor matchingprobabilities on-the-ﬂy as mentioned in Georgakis et al. Brieﬂy, the sampled landmark locations of I are projected onto I based on the known transformation between the images. A landmark location i in I is decided to be matching to a landmark location j in I if the Euclidean distance between i and theprojection of j on image I is less than a predeﬁned pixel threshold ( thresh pixels ). All the feature descriptor pairs ( f I i , f I j ) are fed to the feature descriptor matching module. The feature descrip-tor matching module consists of a single fully connected layer that predicts the matching probability for eachfeature descriptor pair. Training image pairs were generated on-the-ﬂy by sampling a reference image randomly and generating thetarget image by transforming the reference image with a known transformation (randomly simulated brightnessor contrast jitter, rotation, scaling, shearing, or elastic transformation). During training, the ground truthsfor landmark probabilities and feature descriptor matching probabilities are generated in the sampling layer asdescribed above. We trained the network by minimizing a multi-task loss deﬁned as follows:

Loss = LandmarkP robabilityLoss I + LandmarkP robabilityLoss I + DescriptorM atchingLoss (1)The

LandmarkP robabilityLoss I n for the probabilities of landmarks in image I n , n ∈ { , } is deﬁned as: LandmarkP robabilityLoss I n = 1 K n K n (cid:88) i =1 (cid:16) (1 − ˆ p I n i ) + CrossEntropy (ˆ p I n i , p I n i ) (cid:17) (2)where CrossEntropy is the cross entropy loss between predicted landmark probabilities ˆ p I n i and ground truths p I n i . The term (1 − ˆ p I n i ) in (2) encourages high probability scores at all the sampled landmark locations,whereas the cross entropy loss term forces low probability scores at the landmark locations that do not have acorrespondence in the other image. As a consequence, the network is forced to predict high landmark probabilitiesonly at the salient locations that have correspondence in the other image as well.Hinge loss is widely used for learning discriminant landmark descriptors between matching and non-matchinglandmark pairs. We observed that a positive margin for the matching pairs in the hinge loss encouragesthe network to focus on hard positive examples (i.e., non-trivial landmark matches). Therefore, we deﬁned DescriptorM atchingLoss (equation 3) as a linear combination of hinge loss with a positive margin m pos on theL2-norm of feature descriptor pairs and cross entropy loss on matching probabilities predicted by the featuredescriptor matching module. DescriptorM atchingLoss = K ,K (cid:88) i =1 ,j =1 (cid:32) c i,j max (0 , || f I i − f I j || − m pos ) K pos + (1 − c i,j ) max (0 , m neg − || f I i − f I j || ) K neg + W eightedCrossEntropy (ˆ c i,j , c i,j )( K pos + K neg ) (cid:19) (3)where ˆ c i,j , and c i,j are the predicted and the ground truth matching probabilities, respectively, for the featuredescriptor pair ( f I i , f I j ); K pos and K neg are the number of matching (positive class) and non-matching (negativeclass) feature descriptor pairs; m pos and m neg are the margins for the L2-norm of matching and non-matchingfeature descriptor pairs. W eightedCrossEntropy is the binary cross entropy loss where the loss correspondingto positive class is weighted by the frequency of negative examples and vice versa. The gradients are back-propagated from end-to-end as indicated by the dashed arrows in Figure 1. .4 Constraining Landmark Locations

A naive implementation of the approach may ﬁnd all the landmarks clustered in a single anatomical region,which is not desirable. Therefore, to learn landmarks in all anatomical regions during training, we sample thelandmarks on a coarse grid in the sampling layer, i.e., in each 8 × valid mask for each image, which contained the value 1 at the location of bodypixels and 0 elsewhere. The valid mask was generated by image binarization using intensity thresholding andremoving small connected components in the binarized image. The network is trained to predict high landmarkprobabilities as well as feature descriptor matching probabilities only in the matching locations that correspondto a value of 1 in the valid mask. This allows the network to learn a content-based prior on the landmarklocations and avoids the need for image pre-processing during inference. During inference, only the locations in I and I with landmark probabilities above a threshold ( thresh landmark )are considered. Further, landmark pairs from diﬀerent images are only matched if their matching is inverseconsistent. Suppose, locations i ∈ { , .., K } in I and locations j ∈ { , .., K } in I have landmark probabilitiesabove thresh landmark . A pair ( i ∗ , j ∗ ) is considered matching if there is no other pair ( i ∗ , j (cid:48) ) where j (cid:48) ∈ { , .., K } or ( i (cid:48) , j ∗ ) where i (cid:48) ∈ { , .., K } with higher descriptor matching probabilities or lower L2-norms for their featuredescriptor pairs ( f I i ∗ , f I j (cid:48) ) or ( f I i (cid:48) , f I j ∗ ). We implemented our approach using PyTorch. We trained the network for 50 epochs using the Adam optimizer with learning rate 10 − and a weight decay of 10 − . The training was done with a batchsize of 4and took 28 GPU (NVIDIA GeForce RTX 2080 Ti) hours. To allow for batching, a constant K (set to 400)landmarks were sampled from all the images. The threshold for Euclidean distance while generating the groundtruth ( thresh pixels ) was 2 pixels. The margin for the L2-norm of matching feature descriptors ( m pos ) was set to0.1 and the margin for the L2-norm of non-matching pairs ( m neg ) was set to 1. During inference, thresh landmark = 0.5 was used.The empirical values for the hyperparameters were decided based on experience in the preliminary experi-ments. For example, the number for landmarks to be sampled during training ( K ) was decided such that theentire image was covered with suﬃcient landmark density, which was inspected visually. Similarly, the deci-sion for thresh pixels was motivated by the fact that a threshold less than 2 pixels did not yield any matchinglandmarks in the ﬁrst few iterations of the training and hence the network could not be trained. We initiallytrained the network with default values of m pos , and m neg ( m pos = 0, and m neg = 1). However, we noticed onthe validation set that all the predicted landmark pairs were clustered in regions of no deformation. To avoidthis behaviour, we trained the network with m pos = 0 . m pos = 0 . m pos = 0 . thresh landmark was chosen to give the best trade oﬀ between the number oflandmarks per image pair and the spatial matching error on the validation set.

3. EXPERIMENTS3.1 Baseline

Scale Invariant Feature Transform (SIFT ) based keypoint detectors and feature descriptors are prevalent ap-proaches used in both natural image analysis as well as in medical image analysis. Therefore, we used theOpenCV implementation of SIFT as the baseline approach for comparison. We used two matching strategiesfor SIFT: a) brute-force matching with inverse consistency (similar to our approach, we refer to this approach able 1. Description of predicted landmark matches.

Median number of landmark matches per image pair withInter Quartile Range (IQR) in parentheses are provided together with the spatial matching error. The entries in boldrepresent the best value among all approaches.

Transformations Intensity Aﬃne ElasticNo. oflandmarks

Proposed Approach 639 (547 - 729) 466 (391 - 555) 370 (293 - 452)SIFT - InverseConsistency

711 (594 - 862) 610 (509 - 749) 542 (450 - 670)

SIFT - RatioTest 698 (578 - 849) 520 (426 - 663) 418 (330 - 541)

Spatialmatchingerror (mm)

Proposed Approach

SIFT - InverseConsistency 1.0 (1.0 - 1.4) 1.0 (1.0 - 1.4) 1.0 (1.0 - 2.0)SIFT - RatioTest 1.0 (1.0 - 1.4) 1.0 (1.0 - 1.4) as SIFT-InverseConsistency), b) brute-force matching with ratio test (as described in the original paper, werefer to this approach as SIFT-RatioTest). Default values provided in the OpenCV implementation were usedfor all other hyperparameters. The performance is evaluated on two test sets. First, for quantitative evaluation, we transformed all 7,402 testingimages from 54 CT scans with three diﬀerent types of transformations corresponding to intensity (jitter in pixelintensities = ±

20% maximum intensity), aﬃne (pixel displacement: median = 29 mm, Inter Quartile Range(IQR) = 14 mm - 51 mm), and elastic transformations (pixel displacement: median = 12 mm, IQR = 9 mm - 15mm), respectively. Elastic transformations were generated by deforming the original image according to a defor-mation vector ﬁeld representing randomly-generated 2D Gaussian deformations. The extent of transformationswas decided such that the intensity variations and the displacement of pixels represented the typical variationsin thoracic and abdominal CT scan images.

23, 24

This resulted in three sets of 7,402 2D image pairs (total 22,206pairs).Second, to test the generalizability of our approach in a clinical setting, image pairs were taken from twoCT scans of the same patient but acquired on diﬀerent days. The two scans were aligned with each other usingaﬃne registration in the SimpleITK package. This process was repeated for three patients. For quantitative evaluation, we projected the predicted landmarks in the target images to the reference imagesand calculated the Euclidean distance to their corresponding matches in the reference images. We report thecumulative distribution of landmark pairs with respect to the Euclidean distance between them.The performance of our approach on clinical data was assessed visually. We show the predicted results onfour transverse slices belonging to diﬀerent anatomical regions. To visually trace the predicted correspondencesof landmarks, the colors of the landmarks in both the images vary according to their location in the original CTslice. Similarly colored dots between slices from original and follow-up image represent matched landmarks.

4. RESULTS

The inference time of our approach per 2D image pair is within 10 seconds on a modern CPU without anyparallelization. On the GPU the inference time is ∼

20 milliseconds. The model predicted on average 639 (IQR= 547 - 729), 466 (IQR = 391 - 555), and 370 (IQR = 293 - 452) landmark matches per image pair for intensity,aﬃne, and elastic transformations, respectively. .1 Simulated Transformations

Table 1 describes the number of landmark matches per image pair and the spatial matching error for both ourapproach and the two variants of SIFT. Though our approach ﬁnds less landmarks per image as compared tothe two variants of SIFT, the predicted landmarks have smaller spatial matching error than the SIFT variants.Further, Figure 2 shows the cumulative distribution of landmark pairs with respect to the Euclidean distancebetween them. All the approaches are able to ﬁnd more than 90% of landmark matches within 2 mm errorfor intensity transformations. Predicting landmark correspondences under aﬃne and elastic transformations isconsiderably more diﬃcult; this can also be seen in the worse performance of all approaches. However, ourapproach is still able to ﬁnd more than 99% of landmark matches within a spatial matching error of 4 mmand 8 mm, respectively for aﬃne and elastic transformations. However, a noticeable percentage (about 2% foraﬃne transformations and 3% for elastic transformations) of landmarks detected by SIFT-RatioTest are wronglymatched with landmarks from far apart regions (more than 64 mm). It should be noted that if landmark matcheswith such high inaccuracies are used for providing guidance to a registration method, it may have a deterioratingeﬀect on the registration if the optimizer is not suﬃciently regularized. For visual comparison, the landmark

Figure 2.

Cumulative distribution of landmarks.

The cross-hairs in (b) and (c) correspond to the percentile oflandmarks in SIFT-RatioTest at 64 mm. correspondences in pairs of original and elastic transformed images are shown in Figure 3 (rows a-b) for ourapproach as well as for SIFT. As can be seen, the cases of mismatch in predictions from our approach (i.e., thenumber of landmarks in transformed slices not following the color gradient in the original slice) are rather scarcein comparison to the baseline approaches. Another interesting point to note is the diﬀerence in the landmarklocations from our approach and the two baseline approaches. Since SIFT is designed to predict landmarks atlocations of local extrema, the landmark matches are concentrated on the edges in the images. Our approach,however, predicts matches in soft tissue regions as well. Further inspection reveals that our approach predicts aconsiderable number of landmark matches even in the deformed regions in contrast to the baseline approaches.The capability to establish landmark correspondences in the soft tissues and deformed regions is importantbecause DIR methods can especially beneﬁt from guidance information in these regions.

Rows c-f in Figure 3 show landmark correspondences in pairs of transverse slices corresponding to the lowerabdominal region in the original and follow-up CT for our approach as well as for SIFT. As can be seen, theoriginal and follow-up slices have large diﬀerences in local appearance of structures owing to contrast agent,bladder ﬁlling, presence or absence of gas pockets, which was not part of the training procedure. It is notablethat the model is able to ﬁnd considerable landmark matches in image pairs despite these changes in localappearance. Moreover, the spatial matching error of landmarks seems similar to that of images with simulatedtransformations, in contrast to the baseline approach SIFT-InverseConsistency. Further, SIFT-RatioTest predictsfewer mismatched landmarks compared to SIFT-InverseConsistency, but this is achieved at the cost of a largedecrease in the number of landmark matches per image pair. igure 3.

Landmark correspondences for pairs of diﬀerent transverse slices in abdominal CT scans.

Thelandmark correspondences predicted by our approach are shown in comparison with two variants of SIFT.

Rows (a-b) show predictions on pairs of original (left) and elastic transformed (right) slices.

Rows (c-f ) show transverse slices takenfrom diﬀerent anatomical regions. The slices in the original CT (left) are matched with a similar slice from a follow-upCT scan (right) by aﬃne registration.

5. DISCUSSION AND CONCLUSIONS

With a motivation to provide additional guidance information for DIR methods of medical images, we developedan end-to-end deep learning approach for the detection and matching of landmarks in an image pair. To thebest of our knowledge, this is the ﬁrst approach that simultaneously learns landmark locations as well as thefeature descriptors for establishing landmarks correspondences in medical imaging. While the ﬁnal version of thismanuscript was being prepared, we came across one research on retinal images, whose approach for landmarkdetection using UNet architecture in a semi-supervised manner is partly similar to ours. However, our approachnot only learns the landmark locations, but also the feature descriptors and the feature matching such thatthe entire pipeline for ﬁnding landmark correspondences can be replaced by a neural network. Therefore, ourapproach can be seen as an essential extension to the mentioned approach.Our proposed approach does not require any expert annotation or prior knowledge regarding the appearanceof landmarks in the learning process. Instead, it learns landmarks based on their distinctiveness in feature spacedespite local transformations. Such a deﬁnition of landmarks is generic so as to be applicable in any type ofmage and suﬃcient for the underlying application of establishing correspondences between image pairs. Further,in contrast to the traditional unsupervised approaches for landmark detection in medical imaging, the proposedapproach does not require any pre- or post-processing steps, and has fewer hyperparameters.The main challenge for intensity based DIR methods is to overcome local optima caused by multiple lowcontrast regions in the image, which result in image folding and unrealistic transformations in the registeredimage. It can be speculated that the availability of landmark correspondences in the low contrast image regionsmay prove to be beneﬁcial for DIR methods. Moreover, a uniform coverage of entire image is desirable forimproved performance. Upon visual inspection of the landmarks predicted by our approach, we observed thatour approach not only ﬁnds landmark correspondences in bony anatomical regions but also in soft tissue regionslacking intensity gradients. Moreover, a considerable density of landmarks (approximately 400 landmarks perimage pair) was observed despite the presence of intensity, aﬃne, or elastic transformations. Based on theseobservations, we are optimistic about the potential added value of our approach to the DIR methods.We validated our approach on images with simulated intensity, aﬃne, and elastic transformations. The quan-titative results show low spatial matching error of the landmarks predicted by our approach. Additionally, theresults on clinical data demonstrate the generalization capability of our approach. We compared the performanceof our approach with the two variants of widely used SIFT keypoint detection approach. Our approach not onlyoutperforms the SIFT based approach in terms of matching error under simulated transformations, but alsoﬁnds more accurate matches in the clinical data. As such the results look quite promising. However, the currentapproach is developed for 2D images i.e., it overlooks the possibility of the out-of-plane correspondences in twoCT scans, which is quite likely especially in lower abdominal regions. The extension of the approach to 3D is,therefore, imperative so as to speculate into its beneﬁts in providing additional guidance information to the DIRmethods.

6. ACKNOWLEDGEMENTS

The research is part of the research programme, Open Technology Programme with project number 15586, whichis ﬁnanced by the Dutch Research Council (NWO), Elekta, and Xomnia. Further, the work is co-funded by thepublic-private partnership allowance for top consortia for knowledge and innovation (TKIs) from the Ministryof Economic Aﬀairs.

REFERENCES [1] Alderliesten, T., Bosman, P. A. N., and Bel, A., “Getting the most out of additional guidance informationin deformable image registration by leveraging multi-objective optimization,” in [

Medical Imaging 2015:Image Processing ], Proc. SPIE , 94131R, International Society for Optics and Photonics (2015).[2] Han, D., Gao, Y., Wu, G., Yap, P.-T., and Shen, D., “Robust anatomical landmark detection with applica-tion to MR brain image registration,”

Comput. Med. Imaging Graph. , 277–290 (2015).[3] Yang, D., Zhang, M., Chang, X., Fu, Y., Liu, S., Li, H. H., Mutic, S., and Duan, Y., “A method to detectlandmark pairs accurately between intra-patient volumetric medical images,” Med. Phys. (11), 5859–5872(2017).[4] Werner, R., Duscha, C., Schmidt-Richberg, A., Ehrhardt, J., and Handels, H., “Assessing accuracy of non-linear registration in 4D image data using automatically detected landmark correspondences,” in [ MedicalImaging 2013: Image Processing ], Proc. SPIE , 86690Z, International Society for Optics and Photonics(2013).[5] Rhaak, J., Polzin, T., Heldmann, S., Simpson, I. J. A., Handels, H., Modersitzki, J., and Heinrich, M. P.,“Estimation of large motion in lung CT by integrating regularized keypoint correspondences into densedeformable registration,”

IEEE Trans. Med. Imaging (8), 1746–1757 (2017).[6] Ghassabi, Z., Shanbehzadeh, J., Sedaghat, A., and Fatemizadeh, E., “An eﬃcient approach for robustmultimodal retinal image registration based on UR-SIFT features and PIIFD descriptors,” EURASIP J.Image Video Process. (1), 25 (2013).[7] Chen, J., Tian, J., Lee, N., Zheng, J., Smith, R. T., and Laine, A. F., “A partial intensity invariant featuredescriptor for multimodal retinal image registration,”

IEEE Trans. Biomed. Eng. (7), 1707–1718 (2010).8] Guo, Y., Bennamoun, M., Sohel, F., Lu, M., Wan, J., and Kwok, N. M., “A comprehensive performanceevaluation of 3D local feature descriptors,” Int. J. Comput. Vis. (1), 66–89 (2016).[9] Hervella, ´A. S., Rouco, J., Novo, J., and Ortega, M., “Multimodal registration of retinal images usingdomain-speciﬁc landmarks and vessel enhancement,”

Procedia Comput. Sci. , 97–104 (2018).[10] Ronneberger, O., Fischer, P., and Brox, T., “U-Net: Convolutional networks for biomedical image segmen-tation,” in [

International Conference on Medical Image Computing and Computer-Assisted Intervention ],234–241, Springer (2015).[11] Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner,K., Madams, T., Cuadros, J., Kim, R., Raman, R., Nelson, P. C., Mega, J. L., and Webster, D. R.,“Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinalfundus photographs,”

JAMA Netw. Open (22), 2402–2410 (2016).[12] Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., and Thrun, S., “Dermatologist-level classiﬁcation of skin cancer with deep neural networks,”

Nature (7639), 115 (2017).[13] Bier, B., Unberath, M., Zaech, J.-N., Fotouhi, J., Armand, M., Osgood, G., Navab, N., and Maier, A.,“X-ray-transform invariant anatomical landmark detection for pelvic trauma surgery,” in [

InternationalConference on Medical Image Computing and Computer-Assisted Intervention ], 55–63, Springer (2018).[14] Tuysuzoglu, A., Tan, J., Eissa, K., Kiraly, A. P., Diallo, M., and Kamen, A., “Deep adversarial context-awarelandmark detection for ultrasound imaging,” in [

International Conference on Medical Image Computing andComputer-Assisted Intervention ], 151–158, Springer (2018).[15] Thewlis, J., Bilen, H., and Vedaldi, A., “Unsupervised learning of object landmarks by factorized spatialembeddings,” in [

The IEEE International Conference on Computer Vision ], 5916–5925 (2017).[16] Zhang, Y., Guo, Y., Jin, Y., Luo, Y., He, Z., and Lee, H., “Unsupervised discovery of object landmarksas structural representations,” in [

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition ], 2694–2703 (2018).[17] Georgakis, G., Karanam, S., Wu, Z., Ernst, J., and Koˇseck´a, J., “End-to-end learning of keypoint detectorand descriptor for pose invariant 3D matching,” in [

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition ], 1965–1973 (2018).[18] DeTone, D., Malisiewicz, T., and Rabinovich, A., “Superpoint: Self-supervised interest point detectionand description,” in [

Proceedings of the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops ], 224–236 (2018).[19] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga,L., and Lerer, A., “Automatic diﬀerentiation in PyTorch,” in [

Advances in Neural Information ProcessingSystems-W ], (2017).[20] Kingma, D. P. and Ba, J., “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).[21] Lowe, D. G., “Distinctive image features from scale-invariant keypoints,”

Int. J. Comput. Vis. (2), 91–110(2004).[22] Bradski, G., “The OpenCV library,” Dr. Dobb’s J. , 120–125 (2000).[23] Vsquez Osorio, E. M., Hoogeman, M. S., Mndez Romero, A., Wielopolski, P., Zolnay, A., and Heijmen, B.J. M., “Accurate CT/MR vessel-guided nonrigid registration of largely deformed livers,” Med. Phys. (5),2463–2477 (2012).[24] Polzin, T., R¨uhaak, J., Werner, R., Strehlow, J., Heldmann, S., Handels, H., and Modersitzki, J., “Combin-ing automatic landmark detection and variational methods for lung CT registration,” in [ Fifth InternationalWorkshop on Pulmonary Image Analysis ], 85–96 (2013).[25] Lowekamp, B. C., Chen, D. T., Ib´a˜nez, L., and Blezek, D., “The design of SimpleITK,”