Simultaneous Denoising and Motion Estimation for Low-dose Gated PET using a Siamese Adversarial Network with Gate-to-Gate Consistency Learning
SSimultaneous Denoising and Motion Estimationfor Low-dose Gated PET using a SiameseAdversarial Network with Gate-to-GateConsistency Learning
Bo Zhou , Yu-Jung Tsai , and Chi Liu , Biomedical Engineering, Yale University, New Haven, CT, USA Radiology and Biomedical Imaging, Yale University, New Haven, CT, USA
Abstract.
Gating is commonly used in PET imaging to reduce respira-tory motion blurring and facilitate more sophisticated motion correctionmethods. In the applications of low dose PET, however, reducing injec-tion dose causes increased noise and reduces signal-to-noise ratio (SNR),subsequently corrupting the motion estimation/correction steps, causinginferior image quality. To tackle these issues, we first propose a Siameseadversarial network (SAN) that can efficiently recover high dose gatedimage volume from low dose gated image volume. To ensure the appear-ance consistency between the recovered gated volumes, we then utilize apre-trained motion estimation network incorporated into SAN that en-ables the constraint of gate-to-gate (G2G) consistency. With high-qualityrecovered gated volumes, gate-to-gate motion vectors can be simultane-ously outputted from the motion estimation network. Comprehensiveevaluations on a low dose gated PET dataset of 29 subjects demonstratethat our method can effectively recover the low dose gated PET volumes,with an average PSNR of 37.16 and SSIM of 0.97, and simultaneouslygenerate robust motion estimation that could benefit subsequent motioncorrections.
Keywords:
Low-dose Gated PET, Denoising, Motion Estimation, Mo-tion Correction
PET is a commonly used functional imaging modality. To obtain high quality im-age, a small amount of radioactive tracer is administered to patient, introducingradiation exposure to both patients and healthcare providers [1]. PET data ac-quisition typically takes several minutes. During this period, patients breathinginevitably introduces blurring in the lung and abdominal regions. Respiratorygating facilitated by external motion monitoring devices are typically used toreduce respiratory-induced motion blurring. However, each gated image is gen-erated by only a fraction of detected events, leading to high image noise in eachgate. To address the noise issue, previous works proposed motion correction ap-proaches involving non-rigid image registration among gated images, and utilize a r X i v : . [ ee ss . I V ] S e p B. Zhou, etc the motion vectors to correct motion by using all detected events to reduce imagenoise [2]. In the applications of radiation dose reduction, reduction of injectiondose is the first choice but will increase the image noise and result in low signal-to-noise ratio (SNR). In the cases where respiratory gating is performed usinglow-dose data, the image noise is further increased, potentially causing errors inmotion vector estimation, which subsequently affects the final motion correctionresults, as illustrated in Figure 1. To address this challenge, we aim to simulta-neously tackle the image denoising and motion estimation problems in low-dosegated PET data.
Fig. 1: Illustration of phase gated PET acquisition with 6 gates for both 100% full countand 1.5% count levels. End-expiration gate with the least intra-gate motion (G4) is usedas reference gate. Each low dose gated volume needs to be denoised and registered tothe reference gated volume.
Previous works on denoising low-dose PET can be summarized into twocategories: conventional post-processing [3,4,5] and deep learning based post-processing [6,7,8,9]. Conventional post-processing techniques, such as Gaussianfiltering, is the standard technique to reduce PET image noise, but has chal-lenge to preserve local structure. More recently, non-local mean filter [3] andblock-matching 4D filter [4] have been proposed to denoise low-dose PET whilebetter preserving the structural information. Deep learning based methods, suchas deep auto-context CNN [6], 3D cGAN [7], UNet [8], and GAN [9], were de-veloped for recovering standard-dose PET from low-dose PET. Compared toconventional methods, these deep learning based methods achieved promisingdenoising performance on static low-dose PET. However, none of these previousstudies addressed denoising and motion estimation in low-dose respiratory gatedPET in a unified fashion.In this work, we proposed a Siamese adversarial network (SAN) with gate-to-gate consistency learning (G2G) to simultaneously denoise low dose gated enoising and motion estimation for low-dose gated PET 3 volumes and estimate the motion among the gates. We evaluated our methodon a challenging low dose gated PET dataset with only 1 .
5% count level. Ourexperimental results demonstrated that our proposed method can effectivelyreduce the noise while preserving the structural information and improve theaccuracy of motion estimation.
Assuming a phase gated PET exam generates 6 gates with gate 4 as the ref-erence gate, we denote high-dose PET (HDPET) and low-dose PET (LDPET)gated volumes as H n , L n ∈ R h × w × d with gate index n ∈ { , , , , , } andvolume size h × w × d . The transformation predicted between { L , L n } is ex-pected to be different from the transformation predicted between { H , H n } dueto the high noise level of LDPET. Given that the distribution of HDPET isunknown, our goal is to recover H n from the degraded L n . Previous methodshave been trying to solve the inverse problem by finding the generative model P D parameterized by θ D such that P D ( P n =1 L n ; θ D ) = ¯ H nmc ≈ H nmc , where¯ H nmc is the non-gated denoised volume with no motion correction (nmc). Sinceno motion estimation and corresponding motion correction component are con-sidered, degradation in the final image can be expected. Therefore, we aim totackle these issues by recovering the HDPET from LDPET for each gate andsimultaneously estimate the motion field between gates. Specifically, we want tofind single gate generative model P D such that P D ( L n ; θ D ) = ¯ H n ≈ H n where¯ H n is the recovered HDPET for gate n . Then, the motion transformation be-tween the reference gate (assume to be gate 4 here) and gate n is estimated by¯ T n = P R ( P D ( L ; θ D ) , P D ( L n ; θ D ); θ R ) ≈ T n , where ¯ T n is the predicted transfor-mation from our motion estimator P R . In this work, our goal is to obtain theoptimal P D and P R for simultaneous denoising and motion estimation. The overall pipeline of our method is illustrated in Figure 2. It consists of threemajor parts: 1) Siamese generative networks with supervision from our structurerecovery loss; 2) unsupervised motion estimation network; and 3) gate-to-gateconsistency training. The Siamese generator G maps the target gate LDPET( L tgt ) and the reference gate LDPET ( L ref ) to the HDPET space simultane-ously, thus generating denoised HDPET gated volumes. The generator G is firstoptimized based on the structure recovery loss that measures the dissimilaritybetween prediction and ground truth, yielding the high quality denoised HDPETvolumes. In the meantime, the motion estimation network R is pre-trained usingthe ground truth HDPET gated volumes H , and concatenated to the Siamesegenerative networks. By replacing the input for R with the synthetic HDPETvolumes ˆ H generated by G , the joint network enforces gate-to-gate consistencyin the transformed synthetic HDPET for each target gate, providing additionalsupervision for training G . The details are as follows. B. Zhou, etcFig. 2: Our two-stage training procedure consists of: the pre-training of our motionestimator (R), and Siamese adversarial training of our generator. Two shared weightsgenerators G learn mapping from LDPET to HDPET, which are supervised by a struc-ture recovery loss ( L SR = L + L SSIM + L adv ), and a transform consistency loss( L G G = L + L KL ), respectively. Motion estimator R is pre-trained with the groundtruth HDPET, and concatenated to the generator for end-to-end optimization. Networkarchitecture details are listed in the supplementary. Siamese Generative Network is illustrated in Figure 2. The Siamese genera-tive network G with encoding and decoding architecture is firstly supervised by a L loss, a structural similarity (SSIM) loss, and an adversarial loss to ensure thenoise reduction and structure recovery. Specifically, we use a L loss to ensurethe general appearance recovery and a L SSIM loss to ensure the fine-detailedstructure recovery. L loss allows noise suppression and SNR improvement, atthe expense of reduced image sharpness. On the other hand, L SSIM loss encour-ages image to have high contrast, sharpness and resolution. Given L tgt and L ref the target and reference LDPET gated volumes respectively, G takes a pair of[ L tgt , L ref ] and channel-wise concatenates each volumes with anatomical priorCT ( ρ ) to predict ¯ H tgt = G ( L tgt , ρ ; θ G ) and ¯ H ref = G ( L ref , ρ ; θ G ) simultane-ously. The L loss and the L SSIM loss can be written as: L = X i || H i − ¯ H i || , i ∈ { tgt, ref } (1) L SSIM = X i [1 − SSIM ( H i , ¯ H i )] , i ∈ { tgt, ref } (2a) SSIM ( x, y ) = 2 m x m y + C m x + m y + C · σ xy + C σ x + σ y + C (2b) where [ m x , m y ] and [ σ x , σ y ] denote mean and standard deviation of an imagepair [ x, y ]. The cross-covariance of [ x, y ] is denoted as σ xy . C and C are con-stant parameters. The adversarial loss from the discriminator D provides anindication of discrepancy between prediction and ground truth as both G and D progressively optimized. Thus, the adversarial loss is also added to minimize theperceptual difference between prediction and ground truth from a CNN perspec-tive. We utilize the adversarial loss in Wasseerstein GAN with gradient penalty enoising and motion estimation for low-dose gated PET 5 (WGAN-GP) to achieve stable adversarial training [10], which is formulated as: L adv = X i E [ D ( ¯ H i )] − E [ D ( H i )]+ λ gp E [( ||∇ ¨ H i D ( ¨ H i ) || − ] , i ∈ { tgt, ref } (3)where ¨ H represents a linear combination of ¯ H and H with a weight t uniformlysampled between 0 and 1. Thereby, λ gp controls the gradient penalty level andis set to 3 here. The combination of these three loss functions formulates ourStructure Recovery (SR) loss as: L SR = β L + β L SSIM + β L adv (4) where β , β , and β are loss weights. In our experiments, we empirically set β = 1, β = 1, and β = 0 . Motion estimation network R aims to predict the transformation betweentarget and reference gated volumes. Here, we use a probabilistic generative model[11] to predict the transformation, as illustrated in Figure 2’s left section. As-suming H ref and H tgt are volumes that need to be registered and the transfor-mation between them is parameterized by a sampled velocity field V , R aimsto find the most likely registration field by optimizing the posterior probabil-ity p ( V | H ref , H tgt ). Thus, the loss function for network R can be derived andwritten as: L R ( H ref , H tgt ) = 1 K X k || H ref − T ◦ H tgt || + KL[ q θ R ( V | H ref , H tgt ) || p ( V )] (5)where K is the number of samples in each training batch, T is the transformationfunction parameterized by V ∼ q θ R ( V | H ref , H tgt ). The first term minimizes theL1 distance between reference volume H ref and warped target volume H tgt . Thesecond term ensures the distribution similarity between posterior and prior of V . L R is the transform consistency loss in Figure 2. During the inference stage,the predicted V is fed into the scaling and squaring layer [12] to integrate V over[0 , T . Then, T and the target volume H tgt are inputted into the spatial transform layer to generate the warped targetvolume T ◦ H tgt . Detailed derivation is in our supplementary. Gate-to-Gate Consistency Learning
The Siamese generator in the first partmaps L to H with SR loss L SR for individual gates. However, the appearanceconsistency constraint between gates is not utilized. A gate-to-gate consistencyshould sustain when the gated volumes are registered. The gate-to-gate con-sistency learning is achieved by feeding the synthetic pair of HDPET volumes,[ ¯ H tgt , ¯ H ref ] generated using the Siamese generative network G , into the pre-trained motion estimation network R after concatenating these two networks.Therefore, the transformation prediction process of the joint network can bedescribed as: ¯ T = R ( ¯ H ref , ¯ H tgt ; θ R ) = R ( G ( L ref , ρ ; θ G ) , G ( L tgt , ρ ; θ G ); θ R ) (6) Given the transformation ¯ T , we warp the synthetic ¯ H tgt and obtain ¯ T ◦ ¯ H tgt . Weaim to minimize the distance between ¯ T ◦ ¯ H tgt and ground truth H ref , such that B. Zhou, etc the transformed target gated volume and reference gated volume are consistent.Thus, the gate-to-gate transform consistency loss can be formulated as: L G G = 1 K X k || H ref − ¯ T ◦ ¯ H tgt || + KL[ q θ R ( V | ¯ H ref , ¯ H tgt ) || p ( V )] (7)The first term encourages the gate-to-gate appearance consistency using a L norm and the second term ensures the distribution similarity between posteriorand prior of V . L G G provides additional supervision for optimizing G by uti-lizing the inter-gate relationship. It is the key in our Siamese network designthat enables us to randomly sample pairs of gated volume, which augments thenumber of available training data for each subject to A = 30. Therefore, thedenoising and structural recovery from LDPET to HDPET will be more reliable.Finally, our full loss function for optimizing G is L tot = L SR + L G G , whichis trained in an adversarial manner. G and R try to minimize this loss collabora-tively, while D tries to maximize it. To optimize the overall network, we update G , R , and D alternatively by: optimizing D with G and R fixed, then optimizing G with D and R fixed. We collected 29 pancreas F-FPDTBZ [13] PET/CT studies with respirationgating facilitated by the Anzai system. The total acquisition time was 120 minsfor each study. We used phase gating to generate 6 gates for each study. Toeliminate the mismatch between attenuation correction (AC) map and gatedPET, instead of using CT as AC-map, we utilized the maximum likelihood esti-mation of activity and attenuation (MLAA) [14] to generated AC-map for eachgated volume to ensure phase-matched attenuation correction, where CT wasused as initial estimation for MLAA iterations. The HDPET volumes were re-constructed with 100% of the listmode data mimicking high radiation dose data.The LDPET volumes were reconstructed with 1.5% of the listmode data withrandom sampling. Each data was reconstructed into a 400 × ×
109 volumewith voxel size of 2 . × . × . mm . The central 200 × ×
109 voxelswere kept to remove most voxels outside the human body contour and resizedto 128 × × R using the ground truth HDPET. For compar-ative study, we compared our results against the following algorithms: Gaussian enoising and motion estimation for low-dose gated PET 7 filtering (GAU), Non-local mean filtering (NLM) [3], Block-matching 4D filtering(BM4D) [4], UNet [8,15], and cGAN [7]. Fig. 3: Sample HD and 1.5% LD PET slices with enlarged subregions using variousdenoising methods for Two sample subjects. The corresponding PSNR and SSIM areindicated at the bottom of the images. Comparison of intensity profile is also shownon the right. -/+G2G denotes without/with gate-to-gate consistency learning.
The qualitative comparison of various denoising methods is shown in Figure 3.As we can observe on the figure, conventional post-processing methods, such asNLM [3] and BM4D [4], have difficulties in structural recovery when only 1.5% ofthe total counts was considered. The high noise level also introduced additionalartifacts, resulting in inferior performance compared to the standard Gaussianfiltering. In contrast, deep learning based methods achieved better performancein noise reduction and structural recovery.Table 1 outlines the quantitative comparison of different methods on PETimage denoising. Both PSNR and SSIM were evaluated for each gated volumes( G n ), along with averaged value computed on the last column. Among them, ourSAN without G2G outperforms the previous deep learning based methods, andthe addition of G2G learning that utilizes the information over gates further im-proved the performance. In parallel, Figure 4 illustrates a qualitative comparisonof motion estimation based on the discussed denoising methods. As we can see,our proposed SAN+G2G yields the most consistent motion vectors between theestimated and ground truth motion vectors. The quantitative comparison of mo-tion estimation among different denoising methods is given in Table 2. As shownin the table, our SAN+G2G was able to improve the motion estimation accuracyby 20% in average, achieving the lowest 0 .
264 in averaged MVED, compared toother studied methods. Using our proposed method, denoised gated LDPET vol-umes can be generated with corresponding motion vectors to the reference gate.We then registered all gated volumes of LDPET, HDPET, and LDPET with
B. Zhou, etc
SAN+G2G to the reference gate. As an example shown in Figure 5, the pro-posed network is able to generate gated PET volumes with reduced noise leveland a final motion corrected image that averaged all registered image volumeswith reduced motion blurring using low dose gated data.
Table 1: Quantitative comparison of denoising results using PSNR (dB) and SSIM( × ). Among conventional post-processing methods and deep learning based meth-ods, the optimal results are marked in red. PSNR/SSIM G G G G G G Fig. 4: Qualitative motion estimation re-sults from different denoising methods.Ground truth (green arrows) and pre-dicted (magenta arrows) motion estima-tion vectors are overlaid on denoised im-ages. Fig. 5: Illustration of motion blurred im-ages (left) and averaged image of all gatesaverage registered to the reference frame(middle). The green arrows indicate wheresignificant motion reduction is observed af-ter applying the proposed SAN+G2G.
In this work, we propose a Siamese adversarial network with gate-to-gate consis-tency learning, a novel framework for low dose gated PET denoising and motionestimation, simultaneously. We first pre-train our motion estimation network on enoising and motion estimation for low-dose gated PET 9Table 2: Quantitative comparison of motion estimation results evaluated in terms ofMVED. G4 is the reference gate. Optimal results are marked in red.
MVED G G G G G G the ground truth HDPET, and concatenate it to our Siamese adversarial networkthat enables the gate-to-gate consistency learning for improving the denoisingperformance. The denoised low-dose gated volumes are then fed into the motionestimation network for robust motion estimation. In our framework, the Siameseinput design allows us to efficiently augment the training data from each patient,thus can better train generalizable denoising and motion estimation models. Wedemonstrated the feasibility of our method on the tasks of PET image denoisingand motion estimation with promising performance.The potential clinical feasibility of our work is two-fold. Firstly, as high-noise level and motion are inevitable in the chest and abdominal low-dose PETacquisitions, it will affect the visualization of small pathological findings, suchas lung/liver lesions. Our work is potentially useful for recovering these smallobjects from noise and correcting motions to improve the delineation of distortedobjects. Secondly, the estimated motion can be incorporated into the motioncompensated PET reconstruction frameworks toward motion-free low-dose PETreconstructions, which will also improve the reconstruction quality by reducingthe motion artifacts. We will explore these directions in our future works. References
1. Strauss, K.J., Kaste, S.C.: The alara (as low as reasonably achievable) concept inpediatric interventional and fluoroscopic imaging: striving to keep radiation dosesas low as possible during fluoroscopy of pediatric patientsa white paper executivesummary. Radiology (3) (2006) 621–6222. Catana, C.: Motion correction options in pet/mri. In: Seminars in nuclear medicine.Volume 45., Elsevier (2015) 212–2233. Dutta, J., Leahy, R.M., Li, Q.: Non-local means denoising of dynamic pet images.PloS one (12) (2013) e813904. Maggioni, M., Katkovnik, V., Egiazarian, K., Foi, A.: Nonlocal transform-domainfilter for volumetric data denoising and reconstruction. IEEE transactions on imageprocessing (1) (2012) 119–1335. Mejia, J., Mederos, B., Mollineda, R.A., Maynez, L.O.: Noise reduction in smallanimal pet images using a variational non-convex functional. IEEE Transactionson Nuclear Science (5) (2016) 2577–25856. Xiang, L., Qiao, Y., Nie, D., An, L., Lin, W., Wang, Q., Shen, D.: Deep auto-context convolutional neural networks for standard-dose pet image estimation fromlow-dose pet/mri. Neurocomputing (2017) 406–4160 B. Zhou, etc7. Wang, Y., Yu, B., Wang, L., Zu, C., Lalush, D.S., Lin, W., Wu, X., Zhou, J., Shen,D., Zhou, L.: 3d conditional generative adversarial networks for high-quality petimage estimation at low dose. Neuroimage (2018) 550–5628. Lu, W., Onofrey, J.A., Lu, Y., Shi, L., Ma, T., Liu, Y., Liu, C.: An investigation ofquantitative accuracy for deep learning based denoising in oncological pet. Physicsin Medicine & Biology (16) (2019) 1650199. Kaplan, S., Zhu, Y.M.: Full-dose pet image estimation from low-dose pet imageusing deep learning: a pilot study. Journal of digital imaging (5) (2019) 773–77810. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprintarXiv:1701.07875 (2017)11. Dalca, A.V., Balakrishnan, G., Guttag, J., Sabuncu, M.R.: Unsupervised learningfor fast probabilistic diffeomorphic registration. In: International Conference onMedical Image Computing and Computer-Assisted Intervention, Springer (2018)729–73812. Arsigny, V., Commowick, O., Pennec, X., Ayache, N.: A log-euclidean frameworkfor statistics on diffeomorphisms. In: International Conference on Medical ImageComputing and Computer-Assisted Intervention, Springer (2006) 924–93113. Normandin, M.D., Petersen, K.F., Ding, Y.S., Lin, S.F., Naik, S., Fowles, K.,Skovronsky, D.M., Herold, K.C., McCarthy, T.J., Calle, R.A., et al.: In vivo imag-ing of endogenous pancreatic β -cell mass in healthy and type 1 diabetic subjectsusing 18f-fluoropropyl-dihydrotetrabenazine and pet. Journal of Nuclear Medicine (6) (2012) 908–91614. Rezaei, A., Michel, C., Casey, M.E., Nuyts, J.: Simultaneous reconstruction of theactivity image and registration of the ct image in tof-pet. Physics in Medicine &Biology (4) (2016) 185215. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-cal image segmentation. In: International Conference on Medical image computingand computer-assisted intervention, Springer (2015) 234–241 upplemental Materials Bo Zhou , Yu-Jung Tsai , and Chi Liu , Biomedical Engineering, Yale University, New Haven, CT, USA Radiology and Biomedical Imaging, Yale University, New Haven, CT, USA
G, R, and D’s network architectures are summarized in Table 1. Both G and Ruse a UNet backbone structure.To avoid overfitting, we deployed two augmentation techniques: 1) we ap-plied identical random cropping and 90 degrees rotation along x,y,z axis for theSiamese input, and 2) we randomly chose 2 gates from 6 gates of each patientfor Siamese input, which resulted in A = 30 training pairs with each patientdata. The Adam solver was used to optimize the loss functions in this work witha momentum of 0.99 and learning rate of 0.0001. The network was trained on aQuadro RTX 8000 GPU with 48GB memory. Table 1: Configuration details of generator (G), discriminator (D), and motion esti-mator (R) in our Siamese Adversarial Network. The input size is denoted as (batch-size × width × height × depth × channel). The operations are denoted as LReLU:Leaky-ReLU; BN: Batch Normalization; Concat: Concatenation along channel axis;Upsample2: × | denotes skip connection. G D R
Inputs: (nx128x128x128x1) Inputs: (nx128x128x128x1) Inputs: (nx128x128x128x1)LReLU(BN(Conv3D(1,8,3,1,1))) LReLU(Conv3D(1,16,3,2,1)) LReLU(Conv3D(1,16,3,2,1)) | LReLU(BN(Conv3D(8,16,3,2,1))) LReLU(BN(Conv3D(16,32,3,2,1))) | LReLU(Conv3D(16,32,3,2,1)) | |
LReLU(BN(Conv3D(16,32,3,2,1))) LReLU(BN(Conv3D(32,64,3,2,1))) | |
LReLU(Conv3D(32,32,3,2,1)) | | |
LReLU(BN(Conv3D(32,32,3,2,1))) LReLU(BN(Conv3D(64,128,3,2,1))) | | |
LReLU(Conv3D(32,32,3,2,1)) | | |
ReLU(BN(Conv3D(32,64,3,1,1))) LReLU(BN(Conv3D(128,256,3,2,1))) | | | |
Upsample2(LReLU(Conv3D(32,32,3,1,1))) | | |
DeConv3D(64,64,2,2,0)) FC(Flatten()) | | |
Concat() | |
Concat() | | |
Upsample2(LReLU(Conv3D(32+32,32,3,1,1))) | |
ReLU(BN(Conv3D(64+32,64,3,1,1))) | |
Concat() | |
DeConv3D(64,32,2,2,0)) | |
Upsample2(LReLU(Conv3D(32+32,32,3,1,1))) | Concat() | Concat() | ReLU(BN(Conv3D(32+16,32,3,1,1))) | LReLU(Conv3D(32+32,16,3,1,1)) | DeConv3D(32,16,2,2,0)) Concat()Concat() Conv3D(16+16,1,1,1,0)ReLU(BN(Conv3D(16+8,16,3,1,1)))ReLU(BN(Conv3D(16,8,3,1,1)))ReLU(BN(Conv3D(8,1,1,1,0)))
Denoting H ref and H tgt as two volumes need to be registered and V a sampledstationary velocity field that parameterizes a transformation, the goal is to com-pute the posterior probability p ( V | H ref , H tgt ) such that we can get the most a r X i v : . [ ee ss . I V ] S e p B. Zhou, etc likely registration field for a volume pair [ H ref , H tgt ]. R generates µ V | H ref ,H tgt and σ V | H ref ,H tgt for sampling V that specifies a diffeomorphism. Assuming theprior probability p ( V ) and modeled posterior q θ R of V are both multivariatenormal distributions, we have: p ( V ) = N ( V ; 0 , σ V ) (1) q θ R = N ( V ; µ V | H ref ,H tgt , σ V | H ref ,H tgt ) (2)Thus, the KL divergence can be computed as:min θ R KL [ q θ R ( V | H ref , H tgt ) || p ( V | H ref , H tgt )]= min θ R E q [log q θ R ( V | H ref , H tgt ) − log p ( V | H ref , H tgt )]= min θ R E q [log q θ R ( V | H ref , H tgt ) − log p ( V, H ref , H tgt ) p ( H ref , H tgt ) ]= min θ R E q [log q θ R ( V | H ref , H tgt ) − log p ( V )] − E q [log p ( H ref | V, H tgt )]= min θ R KL [ q θ R ( V | H ref , H tgt ) || p ( V )] − E q [log p ( H ref | V, H tgt )] (3)Then, we can train R by optimizing the variational lower bound from the aboveequation. Thereby, the loss function can be re-written as: L R ( H ref , H tgt ) = − E q [log p ( H ref | V, H tgt )] + KL [ q θ R ( V | H ref , H tgt ) || p ( V )]= 1 K X k || H ref − T ◦ H tgt || + KL [ q θ R ( V | H ref , H tgt ) || p ( V )](4)where K is the number of sample in each training batch. T is the transformationfunction parameterized by V ∼ q θ R ( V | H ref , H tgt ) = N ( V ; µ V | H ref ,H tgt , σ V | H ref ,H tgt ).The first term minimizes the L1 distance between the reference volume H ref andthe warped target volume T ◦ H tgt . The second term ensures the distributionsimilarity between posterior and prior of V . Additional motion estimation results are shown in Figure 1. As we can see fromthe comparison, our SAN+G2G can produce motion estimation much closer tothe ground truth motion. Additional denoising result for each gate as well asthe corresponding motion estimation results are shown in Figure 2. The averageimages of all gates with and without applying the corresponding transformation T that deforms each gate to align it with the reference gate are also provided.that deforms each gate to align it with the reference gate are also provided.