[PDF] D-Net: Siamese based Network with Mutual Attention for Volume Alignment

Abstract

Alignment of contrast and non-contrast-enhanced imaging is essential for the quantification of changes in several biomedical applications. In particular, the extraction of cartilage shape from contrast-enhanced Computed Tomography (CT) of tibiae requires accurate alignment of the bone, currently performed manually. Existing deep learning-based methods for alignment require a common template or are limited in rotation range. Therefore, we present a novel network, D-net, to estimate arbitrary rotation and translation between 3D CT scans that additionally does not require a prior standard template. D-net is an extension to the branched Siamese encoder-decoder structure connected by new mutual non-local links, which efficiently capture long-range connections of similar features between two branches. The 3D supervised network is trained and validated using preclinical CT scans of mouse tibiae with and without contrast enhancement in cartilage. The presented results show a significant improvement in the estimation of CT alignment, outperforming the current comparable methods.

Full PDF

DD-Net: Siamese based Network with MutualAttention for Volume Alignment

Jian-Qing Zheng , Ngee Han Lim , and Bart(cid:32)lomiej W. Papie˙z Kennedy Institute of Rheumatology, Nuﬃeld Department of Rheumatology,Orthapaedics and Musculoskeletal Sciences, University of Oxford, UK { jianqing.zheng, han.lim } @kennedy.ox.ac.uk Big Data Institute, Li Ka Shing Centre for Health Information and DiscoveryUniversity of Oxford, Oxford, UK [email protected]

Abstract.

Alignment of contrast and non contrast-enhanced imaging isessential for quantiﬁcation of changes in several biomedical applications.In particular, the extraction of cartilage shape from contrast-enhancedComputed Tomography (CT) of tibiae requires accurate alignment of thebone, currently performed manually. Existing deep learning-based meth-ods for alignment require a common template or are limited in rotationrange. Therefore, we present a novel network, D-net, to estimate arbi-trary rotation and translation between 3D CT scans that additionallydoes not require a prior template. D-net is an extension to the branchedSiamese encoder-decoder structure connected by new mutual non-locallinks, which eﬃciently capture long-range connections of similar featuresbetween two branches. The 3D supervised network is trained and vali-dated using preclinical CT scans of mouse tibiae with and without con-trast enhancement in cartilage. The presented results show a signiﬁcantimprovement in the estimation of CT alignment, outperforming the cur-rent comparable methods.

Keywords:

Image registration · Deep learning · Mutual attention · Siamese network.

It is currently impossible to accurately quantify the damage to cartilage duringthe progression of disease in small animal models of osteoarthritis. Visualisationof cartilage in Computed Tomography (CT) requires a contrast agent. The pre-clinical development of such a contrast agent [9] has highlighted the problem ofaccurate cartilage shape extraction from contrasted images; the partial volumeeﬀect and adjacency to bone necessitates the use of pre-and post-contrasted CTimages. In preclinical scanners, the animal can be placed in various unsystematic(i.e. arbitrary) positions during the acquisition. Tibial cartilage shape may beextracted from the contrast enhanced image by subtracting the non-contrastedscan but this requires accurate alignment of the tibial bone. However, current a r X i v : . [ ee ss . I V ] J a n Zheng, et al. semi-accurate manual alignment using

ImageJ requires over 1 hour and is proneto error, calling for an automated and accurate method to estimate rigid trans-formation between 3D volumes acquired in the preclinical setup.The standardised protocols for image acquisition in clinical scanners meansthat the range of rotation and translation required to register scans are smalland the bigger challenge is to perform deformable registration, especially inbetween image modalities [14]. In pre-clinical studies, protocols are usually studyand machine speciﬁc. The limbs of mice are particularly challenging as theymay be extended or tucked, dependent on posture (prone, supine and on theside). Post-mortem ex vivo tissue may also be scanned in ﬁxative solution, whichincreases the variability of orientations and positions of scans. Our initial datasetis comprised of such ex vivo tissue which will later be used to validate the invivo scans. Thus estimation of large-range rigid transformations is required.Classic approaches to preclinical image alignment [1] used state-of-the-art, it-erative image registration with a similarity measure capturing intensity changescaused by contrast and an appropriate transformation model. However, such ap-proaches are easily trapped in local minimum especially when large translationor rotation is present. More recently, deep learning approaches [15],[8],[10] wereemployed to improve the performance of iterative image registration algorithms,however, the slow performance and dependency on initialization motivates one-step transformation estimation via regression [4,5]. For example, two-branchSiamese Encoder (SE) used to learn similarity measure between two images, wasapplied to 2D brain images alignment [16]. A convolution neural network calledAIRNet [2] was used for aﬃne registration of 3D brain Magnetic ResonanceImaging (MRI) with dense convolution layers as SE. The SE structure was alsoused within the framework of deformable image registration in [17,18] to esti-mate an initial, aﬃne transformation between two volumes. Alternatively, aﬃnetransformation can be estimated using the Global-net [6] with the input imagesbeing concatenated and fed into an one-branch encoder. Despite the success ofthe previous approaches, the capture range of rotation is heavily limited between ± ◦ [16] and ± ◦ (0.8 rad) [2], yielding unsatisfactory results in preclinicalimaging acquisition setup (shown in Sec. 3). The 3D pose estimation of arbitraryoriented subject was presented in [13] but it requires a prior standard template,which is not available for preclinical cartilage imaging.In this paper, a new architecture, D-net, is proposed for estimation of ar-bitrary rigid transformation based on a Siamese Encoder Decoder (SED) withnovel Mutual Non-local Links (MNL) between two Siamese branches, as de-scribed in Sec. 2.1 and Sec. 2.2. Data collection and experiment design are de-scribed in Sec. 2.3 and Sec. 2.4 respectively. Experimental results are shown inSec. 3, discussed and concluded in Sec. 4.The contributions of this work are as follows. We propose a new networkwith SED used for ﬁrst time for rigid registration and we present a conceptof MNL showing signiﬁcantly improved performance on 3D volume alignment.Our network achieves consistent accuracy for wide range of volume orientations -Net: Siamese based Network with Mutual Attention for Volume Alignment 3 apparent in challenging preclinical data set while it does not require prior atlasor template. The objective of 3D image registration is to estimate the transformation f : R s → R s , X f (cid:55)→ X m between a ﬁxed volume X f ∈ R s and a moving volume X m ∈ R s ,where s = d × h × w , and d, h, w are the thickness, height, and width. For 3Drigid registration, the transformation f θ := [ R , t ] ∈ SE (3), consists of rotation R ∈ SO (3) and translation t ∈ R , with the parameters θ = [ θ r , θ t ] ∈ R including θ t and θ r for translation and rotation. The task of the networks inregistration is to estimate θ from the two preprocessed volumes ˜ X f and ˜ X m by networks’ mapping g : ( ˜ X f , ˜ X m ) (cid:55)→ ˆ θ , where ˆ θ ∈ R are the parametersestimated by networks.As θ r ∈ R is redundant for rotation, the 3D orthogonalization mapping of6D rotation representation [20] is used as O : R → SO (3) , θ r1:6 (cid:55)→ R , calculatedby: O ( θ r1:6 ) = [ r r r ] :=  N ([ θ r1:3 ] (cid:62) ) N ([ θ r4 − ] (cid:62) − ( r (cid:62) · [ θ r4:6 ]) r (cid:62) )det([ r r e ])  (cid:62) (1)where [ θ i : j ] ∈ R j − i denotes a column vector consist of θ i : j , N ( · ) denotes a Eu-clidean normalization function, det( · ) denotes a determinant calculation, e is avector of the 3 canonical basis vectors of the 3D Euclidean space. This map-ping keeps the continuous representation of 3D rotation and is equivalent toGram-Schmidt process for a rotation in right handed coordinate system but justrequires 6 input values.Thus the rigid transformation can be estimated by: ˆ f ˆ θ = [ O (ˆ θ r ) , ˆ θ t ]. D-net consists of SE part, decoder part, and regression part, and its schematicarchitecture is shown in Fig. 1(a). Similar structure of SED was applied tosegmentation [7] and tracking [3], but with diﬀerent connection structure be-tween contracting and expansive parts comparing to D-net. The SE in the D-netincludes two branches of six Residual-down-sampling (Res-down) blocks withshared parameters. Four pairs of the Res-down blocks are linked by MNL anddetailed in the Fig. 1(b). In MNL, two matching matrices, from left branch toright and the inverse, are computed by dot product of each pair of voxels’ fea-ture vectors. The matrices from two branches are normalized via softmax tocorrespond and connect the voxels of the feature maps between two branches.MNL, therefore, captures the long-range connection of similar high and low levelfeatures between two branches.The details of the Res-down blocks are illustrated in the Fig. 1(c). The de-coder part of D-net includes four Residual-up-sampling (Res-up) blocks receiv-ing skip connections from the corresponding Mutual Non-local Linked Res-down

Zheng, et al. 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖 Conv3/1+ActConv3/1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖 Act 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖 𝒔𝒔 𝑖𝑖+1 × 𝑐𝑐 𝑖𝑖+1 T− Conv3+ActConv3/1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖 Act 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖 Conv3/1+ActConv3/1+ActConv3/1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 Act 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖 Max−pool 𝒔𝒔 𝑖𝑖+1 × 𝑐𝑐 𝑖𝑖+1 Conv3/1+Act 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 (d)(a)(c) Conv3/1: � Conv3, 𝑖𝑖 < 5Conv1, 𝑖𝑖 = 5 Conv3: × × convolution Conv1: × × convolution Act: leaky-

ReLU, α=0.01 T: transposeConcatenatePixel-wise additionMatrix productData forward in the right branchData forward in the left branchSkip connectionData forward FC: fully connection 𝒔𝒔 𝑖𝑖 : ( 𝑑𝑑 𝑖𝑖 × ℎ 𝑖𝑖 × 𝑤𝑤 𝑖𝑖 ) 𝒔𝒔 × 2 𝑐𝑐 𝒔𝒔 × 3 𝒔𝒔 × 𝑐𝑐 𝒔𝒔 × 𝑐𝑐 𝒔𝒔 × 𝑐𝑐 𝒔𝒔 × 𝑐𝑐 Mutually non - local linked Res - down block 𝑖𝑖 = 2 Res - down block 𝑖𝑖 = 1 Res - down block 𝑖𝑖 = 1 Res - down block 𝑖𝑖 = 0 Res - down block 𝑖𝑖 = 0 Mutually non - local linked Res - down block 𝑖𝑖 = 3 Mutually non - local linked Res - down block 𝑖𝑖 = 4 Res - up block 𝑖𝑖 = 4 Res - up block 𝑖𝑖 = 3 Res - up block 𝑖𝑖 = 2 Mutually non - local linked Res - down block 𝑖𝑖 = 5 Conv1+Act

Res - up block 𝑖𝑖 = 5FC + Act Conv1+Act FC 𝒔𝒔 × 𝑐𝑐 S i a m e s e E n c od e r D ec od e r R e g r e ss i on Fixed Volume Moving Volume 𝒔𝒔 × 𝑐𝑐 Transformation (b) 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 Mutual Non-local Link 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 Conv3/1+ActConv3/1+ActConv3/1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 Act 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖 Max−pool 𝒔𝒔 𝑖𝑖+1 × 𝑐𝑐 𝑖𝑖+1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 Conv3/1+ActConv3/1+ActConv3/1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 Act 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖 Max−pool 𝒔𝒔 𝑖𝑖+1 × 𝑐𝑐 𝑖𝑖+1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 c 𝑖𝑖+1 × 𝑑𝑑 𝑖𝑖 ℎ 𝑖𝑖 𝑤𝑤 𝑖𝑖 𝑑𝑑 𝑖𝑖 ℎ 𝑖𝑖 𝑤𝑤 𝑖𝑖 × 𝑑𝑑 𝑖𝑖 ℎ 𝑖𝑖 𝑤𝑤 𝑖𝑖 T+SoftMax SoftMax 𝑑𝑑 𝑖𝑖 ℎ 𝑖𝑖 𝑤𝑤 𝑖𝑖 × c 𝑖𝑖+1 Flatten+TT+Reshape FlattenReshape c 𝑖𝑖+1 × 𝑑𝑑 𝑖𝑖 ℎ 𝑖𝑖 𝑤𝑤 𝑖𝑖 𝑑𝑑 𝑖𝑖 ℎ 𝑖𝑖 𝑤𝑤 𝑖𝑖 × c 𝑖𝑖+1 Conv1 𝒔𝒔 𝑖𝑖 × c 𝑖𝑖 𝒔𝒔 𝑖𝑖 × 2c 𝑖𝑖+1 Conv1 Conv1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 Conv1 Conv1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 𝒔𝒔 𝑖𝑖 × 𝑐𝑐 𝑖𝑖+1 Fig. 1.

The architecture of (a) D-net, (b) the novel Mutually non-local linked Res-downblock, (c) Res-down block, and (d) Res-up block. blocks, shown in the Fig. 1(d). The regression part of D-net includes two fullyconnected layers with 128 and 12 neurons for 12 transformation parameters. InFig. 1, i is the block number, d = ( d · · · d ) ∈ Z , h = ( h · · · h ) ∈ Z , w =( w · · · w ) ∈ Z and c = ( c · · · c ) ∈ Z denote the sequences of thickness,heights, widths, and channel number of input volume and feature maps for eachbranch respectively. An approach with non-local links was presented in a classic image registration,where non-local motion was estimated using a range of spatial scales, naturallycaptured by graph representation [12]. Similarly, a concept of unique matching -Net: Siamese based Network with Mutual Attention for Volume Alignment 5 between a pair of voxels by weighting function and mutual saliency was pre-viously shown in [11]. Here, the mutual attention mechanism is proposed withthe deep-learning design of MNL inspired by the Self Non-local Link (SNL) onone branch proposed in [19]. In this section we provide the general deﬁnition ofMNL:  y m2f k := (cid:80) ∀ j φ ( x f k , x m j ) ψ ( x m j ) (cid:80) ∀ j φ ( x f k , x m j ) y f2m k := (cid:80) ∀ j φ ( x m k , x f j ) ψ ( x f j ) (cid:80) ∀ j φ ( x f k , x m j ) (2)where j, k are the indices of the position in a feature map, x f , x m are the inputsignals from two branches, y m2f , y f2m are the output signals from this block, φ is the similarity measurement function, ψ is a unary function.The instantiated MNL in D-net is based on embedded Gaussian similarityrepresentation with φ ( x , x ) := e x (cid:62) W (cid:62) W x (3)and ψ ( x ) := W x (4), where W is a matrix of trainable weights. A total of 100 ex-vivo micro CT scans of tibiae from 50 mice were acquired using

P erkin Elmer , Quantum FX with a resolution of 10 × × µ m /vox, volumesize of 512 × × T : R s → R s , X (cid:55)→ ˜ X by:˜ x ijk = ReLU( x ijk − x th ) x max − x th (5), where ˜ x ijk is the entry of ˜ X , x max is the maximum intensity in the dataset and x th is the threshold value set as 2000 because of the high density of backgroundsolution. Finally, because of data size, the input volumes are sub-sampled withlinear interpolation to 80 × × µ m /vox. The CT slices of two exemplarsubjects are shown in Fig. 2.In the training dataset, each CT volume is transformed to synthesize theﬁxed and the moving volumes with uniformly distributed random translation ∼ U ( − . , . of the whole volume size, and randomrotation with angle ∼ U ( − π, π ) around random axes uniformly distributed in3D sphere surface. To enlarge the training dataset, data augmentation includingintensity scale transforming and Gaussian noise is applied, where the intensityscale coeﬃcient is ∼ U (0 . , .

05) and each voxel is added with random variable ∼ N (0 , . Zheng, et al. (a)(b)

Overlay Volume / Fixed/Moving Volumes in false colour scaleLabeled cartilage Moving Volume (w/o contrast)Fixed Volume (w/ contrast) Plastic

Fig. 2.

The tibial CT slices (at original resolution) of two exemplar subjects show thelarge ranges of spatial transformation between ﬁxed and moving volumes.

The loss function in terms of θ and ˆ θ is calculated as: L = α (cid:107) θ t − ˆ θ t (cid:107) (cid:107) θ t (cid:107) + (cid:15) + β (cid:107) θ r − ˆ θ r (cid:107) (6)where (cid:107) · (cid:107) := (cid:112) ( (cid:80) · ) is Euclidean norm, the both weights of relative trans-lation error α and rotation error β are set as α = β = 0 .

5, and (cid:15) = 0 .

01 toavoid singularity. Momentum Stochastic Gradient Descend was applied with thelearning rate 0.0001 and the learning momentum 0.9.We split our CT data set into two folders (A and B) each containing pairsof contrast and non contrast-enhanced CTs from the same mouse, and two-foldcross validation was performed on diﬀerent mouse. Furthermore, we performedfour validation strategies: (S1) training on all data from folder A, and testing onB; (S2) training on all data from B, testing on A; (S3) training on non contrast-enhanced data from A, and testing on all data from B; (S4) training on noncontrast-enhanced data from B, and testing on all data from A. Training on noncontrast-enhanced data was performed to check whether our network can beused for alignment of follow-up contrast-enhanced data, when only the baselinedata are available for training, thus modeling real scenario of data acquisition. -Net: Siamese based Network with Mutual Attention for Volume Alignment 7

For the validation strategy, known transformation (as described in Sec 2.3) wasapplied to volume from the folder to create a pair of CTs in synthetic test,and in real test, each pair of corresponding contrasted and non-contrasted CTwas registered. The synthetic test includes rotation test with ﬁxed translation(0 . . . − π to π aroundaxis ( √ √ √ ) as well as translation test with ﬁxed π rotation around axis( √ √ √ ) and 11 translation uniformly ranges from − √ mm to √ mm alongthe axis ( √ √ √ ). D-net was compared with other relevant image registration approaches: • SITK: Simple ITK with metric Joint Histogram Mutual Information and op-timizer Regular Step Gradient Descent with gradient tolerance 0.0001, max iternumber 10k and learning rate 1. • ME (Mixed Encoder): The ”Global-net” [6] concatenating the two input vol-umes together and feeding into one mixed branch. All the architecture set-tings are set with default values, with d = h = w = (64 , , , ,

4) and c = (1 , , , , • SE (Siamese Encoder): An architecture employing two branches of 6 Res-downblocks for SE and two fully connected layers for regression, a similar structurewas used in [17] but without residual structure and fewer down-sampling blocks. • SED (Siamese Encoder Decoder): A proposed architecture inserting 4 Res-upblocks into SE between the SE and regression parts, with skip connection fromthe 4 latter Res-down block of SE. • SNL-SED (Self Non-local Linked - Siamese Encoder Decoder): A proposedSED architecture with the 4 latter Res-down blocks self non-locally linked bythe Embedded Gaussian similarity based - non-local block [19] in each branch.SITK, ME and SE are previously published methods; while SED and SNL-SED are transitional forms towards D-Net and their impact is separately vali-dated; SE, SED SNL-SED and D-net are validated with d = h = w = (64 , , , , , , c = (1 , , , , , , Criteria

The Euclidean distance of Translation Error (TE) between the pre-dicted and expected translation,

T E := (cid:107) θ t − ˆ θ t (cid:107) , and the Rotation Error (RE)between the predicted and expected rotation, RE = arccos( tr( R (cid:62) O (ˆ θ r )) − ), arecalculated for synthetic tests. Since the ground truth for real examples are un-known, the Dice Similarity Coeﬃcient (DSC) between the cortical bone seg-mented from contrasted and non-contrasted tibial CT is calculated for bothsynthetic and real tests. Zheng, et al.

Fig. 3.

The training curves exempliﬁed with (S1) left: Loss values, middle: Rotationerrors (RE) and right: Translation Errors (TE) shows SED, SNL-SED and D-net aretrainable across range of translations and rotations.

All networks were trained for 120k iterations. In all training strategies used, MEand SE failed to converge, whereas SED, SNL-SED and D-net were trainable(exempliﬁed in Fig. 3).The results of rotation test and translation test with validation strategy(S1&S2) are shown in Fig. 4(a) and Fig. 4(b), where only D-net achieves thesub-voxel average TE in the rotation and translation tests. The performanceof ME and especially SE is sensitive to the initial translation and rotation asshown in Fig. 4(a)-middle and Fig. 4(b)-left because they intend to predict asmall range transformations for any input volumes. DSCs for 30 subjects in realtest with strategy (S1&S2) are shown in Fig. 4(c), where the SED increases DSCby 0.1 − − − (S4) in rotation test and real test are plottedin Fig. 4(d). It shows average DSC of D-net is higher than all others in bothrotation and real test but is slightly lower in real test compared with rotationtest.The tibial bone shapes of two registration examples for ME, SED, and D-netfrom real test are shown in Fig. 5 with the same subjects previously shown inFig. 2. The ﬁgure illustrates bone fragments and segmentation diﬀerence causedby the varying intensity inﬂuenced by contrast, decreasing the DSC values andmaking registration of preclinical tibia data particularly challenging. Visual re-sults in Fig. 5 conﬁrms that D-Net performs robustly and this is further sup-ported by the quantitative results shown in Tab. 1, where TE, RE, and DSC forall methods are presented.Comparing with others, D-net achieves the lowest TE and RE and high-est DSC with consistent performance across range of rotations. Using two-wayAnalysis of Variance (ANOVA) in rotation and translation tests and one-wayANOVA in real test, D-net signiﬁcantly outperforms all other approaches with p < − on TE, RE and DSC in rotation and translation tests and on DSC inreal test by strategy (S1&S2) and (S3&S4); SED signiﬁcantly outperforms SE, -Net: Siamese based Network with Mutual Attention for Volume Alignment 9 (b)(a)(c) (d) Real testRot test

Fig. 4.

D-net outperforms all other methods exempliﬁed by (S1) in (a) rotation (rot)test, (b) translation test, and (c) real test exempliﬁed by 30 subjects, with TranslationErrors (TE), Rotation Errors (RE) and DSC, avg ± std; (d) DSC, avg ± std in rot testand real test with validation strategies (S1) − (S4). ME and SITK with p < − on TE, RE and DSC in rotation and translationtest and on DSC in real test by all the strategies. In this paper, we proposed a new network, D-net, and a new structure, MutualNon-local Link (MNL), for estimation of transformation between CT volumes.The experimental results shows D-net outperforms other methods and achievesstate-of-the-art performance for rigid registration of preclinical mouse CT scanswith and without contrast. While ME and SE did not converge during trainingusing full range of rotations, we were able to train them using smaller range ofrotations (30 ◦ ), similarly as in [2,4]. This further shows superiority of D-net atconsistently estimating full range of rotations.The average DSCs of D-net in real test are slightly lower than in synthetictests potentially due to the diﬀerence in segmentation for contrast-enhanced Transformed Shape Reference Shape Bone fragment

Fig. 5.

Segmentation surfaces for the two exemplar volumes used in real experiment.D-net achieve the most plausible registration (overlapping red and white surfaces)with highest DSC shown at the right bottom corner (images are shown at originalresolution).

Table 1.

Average values of Translation Error (TE/ µ m), Rotation Error (RE/ ◦ ) andDice Similarity Coeﬃcient (DSC/%) for diﬀerent methods in Translation (Transl),Rotation (Rot) and Real test, with strategies of training on both contrasted and non-contrasted data (S1&S2) and just non-contrasted data (S3&S4)SITK ME SE SED SNL-SED D-netVariable No. - 0.7M 9.2M 5.2M 4.9M 4.9MStrategy(S) - 1&2 3&4 1&2 3&4 1&2 3&4 1&2 3&4 1&2 3&4Transl TE 710.5 406.4 395.1 472.6 472.5 124.3 132.7 163.4 146.2 RE 135.9 82.7 83.4 89.8 89.2 7.1 8.0 9.3 9.7

DSC 12.98 15.04 14.94 12.86 13.42 52.28 51.08 45.61 47.69

Rot TE 696.2 575.4 569.4 692.4 692.4 154.4 144.1 201.5 175.7

RE 119.6 92.6 95.1 98.5 98.0 7.4 8.1 9.1 9.5

DSC 12.75 14.54 14.54 12.22 12.75 48.28 47.97 40.88 44.69

Real DSC 16.91 16.75 15.16 18.86 17.23 43.66 42.52 41.85 39.26 volumes. However, D-net is still able to extract the common features of tibialbone and align two volumes plausibly, showing usefulness for shape extractionof cartilage from contrast-enhaced CT of tibiae.For the rotation representation, the widely used quaternion, Euler angles andLee algebra were not applied due to the discontinuity of 3D rotation representedin the real Euclidean space with dimension lower than 5D [20].In future work, a pipeline for cartilage shape extraction will be further val-idated for morphological analysis and in application for diagnosis and staging -Net: Siamese based Network with Mutual Attention for Volume Alignment 11 of osteoarthritis, and D-net will be generalized to other modalities to exploreinter-subject and inter-modality registration.

This work was supported by a Kennedy Trust for Rheumatology Research Stu-dentship, the Centre for OA Pathogenesis Versus Arthritits (Versus Arthritisgrant 21621). The authors acknowledge Patricia das Neves Borges as the re-searcher who collected the preclinical CT dataset, as part of the National Cen-tre for Replacement, Reﬁnement and Reduction of Animals in Research (NC3Rgrant NC/M000141/1).B. W. Papie˙z acknowledges Rutherford Fund at HealthData Research UK

References

1. Baiker, M., Staring, M., L¨owik, C.W., Reiber, J.H., Lelieveldt, B.P.: Automatedregistration of whole-body follow-up microCT data of mice. In: International Con-ference on Medical Image Computing and Computer-Assisted Intervention. pp.516–523. Springer (2011)2. Chee, E., Wu, Z.: Airnet: Self-supervised aﬃne registration for 3d medical imagesusing neural networks. arXiv preprint arXiv:1810.02583 (2018)3. Dunnhofer, M., Antico, M., Sasazawa, F., Takeda, Y., Camps, S., Martinel, N.,Micheloni, C., Carneiro, G., Fontanarosa, D.: Siam-U-Net: encoder-decoder siamesenetwork for knee cartilage tracking in ultrasound images. Medical Image Analysis , 101631 (2020)4. Haskins, G., Kruecker, J., Kruger, U., Xu, S., Pinto, P.A., Wood, B.J., Yan, P.:Learning deep similarity metric for 3D MR–TRUS image registration. Internationaljournal of computer assisted radiology and surgery (3), 417–425 (2019)5. Haskins, G., Kruger, U., Yan, P.: Deep learning in medical image registration: asurvey. Machine Vision and Applications (1), 8 (2020)6. Hu, Y., Modat, M., Gibson, E., Ghavami, N., Bonmati, E., Moore, C.M., Emberton,M., Noble, J.A., Barratt, D.C., Vercauteren, T.: Label-driven weakly-supervisedlearning for multimodal deformable image registration. In: 2018 IEEE 15th Inter-national Symposium on Biomedical Imaging (ISBI 2018). pp. 1070–1074. IEEE(2018)7. Kwon, D., Ahn, J., Kim, J., Choi, I., Jeong, S., Lee, Y.S., Park, J., Lee, M.: SiameseU-Net with healthy template for accurate segmentation of intracranial hemorrhage.In: International Conference on Medical Image Computing and Computer-AssistedIntervention. pp. 848–855. Springer (2019)8. Liao, R., Miao, S., de Tournemire, P., Grbic, S., Kamen, A., Mansi, T., Comani-ciu, D.: An artiﬁcial agent for robust image registration. In: Thirty-First AAAIConference on Artiﬁcial Intelligence (2017)9. Lim, N.H., Fowkes, M.M.: Radiopaque compound containing diiodotyrosine (Jun 52019), EU Patent EP3490614A110. Ma, K., Wang, J., Singh, V., Tamersoy, B., Chang, Y.J., Wimmer, A., Chen,T.: Multimodal image registration with deep context reinforcement learning. In:International Conference on Medical Image Computing and Computer-AssistedIntervention. pp. 240–248. Springer (2017)2 Zheng, et al.11. Ou, Y., Sotiras, A., Paragios, N., Davatzikos, C.: DRAMMS: Deformable registra-tion via attribute matching and mutual-saliency weighting. Medical image analysis (4), 622–639 (2011)12. Papie˙z, B.W., Szmul, A., Grau, V., Brady, J.M., Schnabel, J.A.: Non-local graph-based regularization for deformable image registration. In: Medical Computer Vi-sion and Bayesian and Graphical Models for Biomedical Imaging. pp. 199–207.Springer (2016)13. Salehi, S.S.M., Khan, S., Erdogmus, D., Gholipour, A.: Real-time deep pose es-timation with geodesic loss for image-to-template rigid registration. IEEE TMI (2), 470–481 (2018)14. Schnabel, J.A., Heinrich, M.P., Papie˙z, B.W., Brady, J.M.: Advances and chal-lenges in deformable image registration: from image fusion to complex motionmodelling. Medical image analysis , 145–148 (2016)15. Simonovsky, M., Guti´errez-Becker, B., Mateus, D., Navab, N., Komodakis, N.: Adeep metric for multimodal registration. In: International conference on medicalimage computing and computer-assisted intervention. pp. 10–18. Springer (2016)16. Sloan, J.M., Goatman, K.A., Siebert, J.P.: Learning rigid image registration-utilizing convolutional neural networks for medical image registration. In: 11thInternational Joint Conference on Biomedical Engineering Systems and Technolo-gies,. pp. 89–99. SCITEPRESS-Science and Technology Publications (2018)17. de Vos, B.D., Berendsen, F.F., Viergever, M.A., Sokooti, H., Staring, M., Iˇsgum,I.: A deep learning framework for unsupervised aﬃne and deformable image regis-tration. Medical image analysis52