[PDF] Sampling possible reconstructions of undersampled acquisitions in MR imaging

Abstract

Undersampling the k-space during MR acquisitions saves time, however results in an ill-posed inversion problem, leading to an infinite set of images as possible solutions. Traditionally, this is tackled as a reconstruction problem by searching for a single "best" image out of this solution set according to some chosen regularization or prior. This approach, however, misses the possibility of other solutions and hence ignores the uncertainty in the inversion process. In this paper, we propose a method that instead returns multiple images which are possible under the acquisition model and the chosen prior. To this end, we introduce a low dimensional latent space and model the posterior distribution of the latent vectors given the acquisition data in k-space, from which we can sample in the latent space and obtain the corresponding images. We use a variational autoencoder for the latent model and the Metropolis adjusted Langevin algorithm for the sampling. This approach allows us to obtain multiple possible images and capture the uncertainty in the inversion process under the used prior. We evaluate our method on images from the Human Connectome Project dataset as well as in-house measured multi-coil images and compare to two different methods. The results indicate that the proposed method is capable of producing images that match the ground truth in regions where acquired k-space data is informative and construct different possible reconstructions, which show realistic structural variations, in regions where acquired k-space data is not informative. Keywords: Magnetic Resonance image reconstruction, uncertainty estimation, inverse problems, sampling, MCMC, deep learning, unsupervised learning.

Full PDF

SSampling possible reconstructions of undersampled acquisitions inMR imaging

Kerem C. Tezcan, Christian F. Baumgartner and Ender Konukoglu ∗† October 2, 2020

Abstract

Undersampling the k-space during MR acquisitionssaves time, however results in an ill-posed inversionproblem, leading to an inﬁnite set of images as pos-sible solutions. Traditionally, this is tackled as a re-construction problem by searching for a single ”best”image out of this solution set according to some cho-sen regularization or prior. This approach, however,misses the possibility of other solutions and henceignores the uncertainty in the inversion process. Inthis paper, we propose a method that instead returnsmultiple images which are possible under the acqui-sition model and the chosen prior. To this end, weintroduce a low dimensional latent space and modelthe posterior distribution of the latent vectors giventhe acquisition data in k-space, from which we cansample in the latent space and obtain the correspond-ing images. We use a variational autoencoder for thelatent model and the Metropolis adjusted Langevinalgorithm for the sampling. This approach allows usto obtain multiple possible images and capture theuncertainty in the inversion process under the usedprior. We evaluate our method on images from theHuman Connectome Project dataset as well as in-house measured multi-coil images and compare to twodiﬀerent methods. The results indicate that the pro-posed method is capable of producing images thatmatch the ground truth in regions where acquired k-space data is informative and construct diﬀerent pos-sible reconstructions, which show realistic structuralvariations, in regions where acquired k-space data isnot informative.

Keywords:

Magnetic Resonance image reconstruc-tion, uncertainty estimation, inverse problems, sam-pling, MCMC, deep learning, unsupervised learning. ∗ This work was supported by the Swiss National ScienceFoundation under Grant 205321 173016. † All authors are with the Computer Vision Lab, ETHZrich, Switzerland. Emails are { tezcan, baumgartner,kender } @vision.ee.ethz.ch Undersampling the k-space in MR imaging reducesscan time by speeding up acquisition, allowing ahigher throughput as well as higher comfort for pa-tients. However, contrary to a fully acquired k-space, where an inverse Fourier transform is mostlyenough to uniquely determine the underlying imageup to measurement noise, the undersampled acqui-sition leads to an underdetermined system of equa-tions. Mathematically, this means that there are in-ﬁnitely many images that match the acquired portionof the k-space data and it is impossible to know whichone is the underlying image in a general setting.Traditionally, this problem of inﬁnitely many solu-tions has been tackled as a deterministic reconstruc-tion problem, where diﬀerent methods were proposedto choose a single ”best” image out of the set ofpossible images. This was achieved by introducingsome prior information in addition to the data con-sistency term through deﬁning a regularization termthat implicitly prefers a solution according to its cer-tain properties, such as smoothness with Tikhonovregularization [1], sparsity of its gradients with to-tal variation regularization [2] or how well it matcheswith some lower resolution version of the image [3].This converts the problem into a well-posed regular-ized inverse problem and allows to obtain a singlesolution as the reconstructed image. Another way ofseeing this approach is from the Bayesian framework,where the data consistency and regularization termscorrespond to the data likelihood and prior terms,respectively. Here, again, a suitable prior along withthe data likelihood allows to write the correspondingposterior probability which then can be maximizedto obtain the maximum-a-posteriori (MAP) image asthe single “best” reconstruction [4].By providing a single reconstructed image as theiroutput, these approaches miss the uncertainty in thesolution that arises due to the missing portion of thek-space data. The reconstructed image is formedusing information from the measured data and the1 a r X i v : . [ ee ss . I V ] S e p rior. Measured data matches the underlying trueimage up to the measurement noise while the priorcomplements the unmeasured information. However,it is important to remember that the prior is indepen-dent of the underlying true image and hence, the im-age that best satisﬁes the prior need not be the sameas the true image. There is an inherent ambiguity.When a single reconstructed image is output, infor-mation coming from the prior is treated as deﬁnite.This treatment implicitly sacriﬁces the opportunityto reveal the inherent ambiguity in the reconstructedimage.An alternative approach, which we will pursue inthis work is producing multiple images as solutionsto the ill-posed inverse problem, where the imagesmatch the measured data while being highly likelyaccording to the prior at the same time. Such anapproach is capable of capturing the uncertainty inthe inversion due to the missing data. The idea isthat in this case the information from the prior is nottreated as deﬁnite, but rather possibility of diﬀerentimages is explored under the speciﬁc prior used.As the proposed approach provides multiple im-ages, clinical, research or further analysis tasks, suchas segmentation, can be applied separately on eachimage to propagate the uncertainty to any giventask’s output. Alternatively the uncertainty can bequantiﬁed at the image level and passed on the fol-lowing tasks, such as in the form of a mean imagealong with a standard deviation map. Inspecting theimages or the quantiﬁed uncertainty for diﬀerent re-gions can also be indicative of which parts of the im-ages might be more prone to diﬀer from the underly-ing true image. To the best of our knowledge, this isthe ﬁrst time such an approach is being proposed forundersampled MRI acquisition.In the recent years, though not directly related tothe uncertainty due to missing data in undersam-pled MRI, there has been research eﬀorts to quan-tify uncertainty in diﬀerent aspects of the image re-construction problem, especially with the deep learn-ing based models [5]. One such aspect is the epis-temic or model uncertainty, i.e. the uncertainty inthe mapping learned by the neural network used inthe reconstruction, which can be obtained using ap-proaches such as drop-out [6]. However, these ap-proaches quantify the uncertainty due to the ambigu-ity in the network parameters and do not capture theuncertainty due to the missing k-space data. Epis-temic uncertainty tends to zero as the training setsize goes to inﬁnity [7], however, increasing trainingsamples cannot be expected to diminish the ambigu-ity due to missing k-space data. Another aspect is theso called aleatoric uncertainty due to noise or other ambiguities in the images or the labels, which is morerelevant to reconstruction from undersampled MRI.Quantifying aleatoric uncertainty can be approachedusing, for instance, heteroscedastic models. Thesemodels predict a diﬀerent variance value for each out-put pixel and hence can in principle learn to predicthigh variance for pixels where the model expects tohave incorrect mean predictions. There is one impor-tant limitation to these models. They generate onlysecond order statistics of pixel-wise marginal distri-butions. Such a model may be useful in predictingpixels where predictions may be inaccurate, however,it cannot propose diﬀerent possible reconstructions.For this, non-Gaussian pixel-wise distributions andnon-trivial statistical dependencies across diﬀerentpixels are crucial to capture. Hence these models arelimited in the information they can provide [8]. Beingable to generate samples makes uncertainty propaga-tion trivial for any following task while only generat-ing second order statistics makes this non-trivial.One work that has similar aims as this paper isby Adler et al. [9]. In this work, the authors traina modiﬁed conditional Wasserstein generative adver-sarial network (cWGAN) that generates high dosecounterparts of CT images from low-dose measure-ments. However this approach does not explicitlymodel the known physics or the measurement noise ofthe imaging procedure, lacking an explicit data like-lihood term. As such, there are no guarantees thatthe samples will be from the true posterior. Insteadof using an explicit physics-based imaging model, thecWGAN requires supervised training with pairs ofundersampled-fully sampled images, hence a separatecWGAN has to be trained for all diﬀerent undersam-pling schemes and factors for best performance. Fur-thermore, though the authors modify their discrim-inator to reduce the mode collapse associated withthe WGAN, they do not completely avoid it, lead-ing to a possibly poorer implicit prior. Lastly, themodel aims to minimize the Wasserstein distance be-tween the predicted posterior and the true posterior.However, it is often impossible to have varying sam-ples from the true posterior, in reality only one fullymeasured image is available for each low-dose mea-surement image. Therefore, the available trainingsamples may not be able to support an accurate pos-terior approximation and minimizing the Wassersteindistance may converge to a degenerate version of thetrue posterior.Another work in line with our purposes is by Pede-monte et al. [10], where the authors use HamiltonianMonte Carlo (HMC) to sample from the posterior ofemission rates given the photon counts for positronemission tomography. However, they use a uniform,2.e. a non-informative prior for the emission rates,reducing the strength of the model heavily in con-trast to using more informative, data-driven priors.Furthermore the authors use a Riemannian HMCscheme [11] to make sampling eﬃcient despite thehigh dimensional posterior, where the use of the Rie-mannian metric speeds up the sampling by taking thegeometry of the space of probability distributions intoaccount. However this does not directly take into ac-count the geometry of the space of the emission rates.Here we identify two ideas that motivate us inproposing a new method that overcomes the limita-tions of the works mentioned above. The ﬁrst oneis regarding the geometry of the space of MR im-ages. We make the assumption that the MR imagesactually live around a low dimensional subspace inthe high dimensional image space and that we canlearn a mapping from a low dimensional latent spaceto this subspace. This assumption has been demon-strated empirically in our prior work [4]. Then ”walk-ing around” and sampling in the subspace of MR im-ages can be simply implemented as walking aroundand taking samples in the latent space. Secondly,deep learning based data-driven priors have showngreat value in inverse problems in general in the re-cent years, as well as speciﬁcally in MR image recon-struction [4,12,13]. Such methodology allows learninga powerful mapping between the latent space and theimage space, facilitating the sampling.Embodying the ideas mentioned above, we proposea novel method based on a latent Bayesian model andMarkov chain Monte Carlo (MCMC) sampling thataddresses the issue of uncertainty in the inversion pro-cess. To this end, we use a variational autoencoder(VAE) [14, 15] trained on fully sampled MR imagesas our prior and utilize its lower dimensional latentspace to do the sampling instead of sampling in thehigh dimensional image space. For the target dis-tribution of the MCMC, we use the posterior of thelatent vectors given the measured k-space data. Weobtain this distribution by marginalizing over the im-ages. We use the Metropolis adjusted Langevin Algo-rithm (MALA) [11] as the MCMC method due to itseﬀectiveness in high dimensional spaces. The latentsamples coming from MALA are then guaranteed tobe from the posterior and can be transformed to im-ages using the decoder of the VAE. Although we usea VAE and MALA in our implementation, the frame-work is generic and can be used with other genera-tive models as well as sampling schemes. We evaluateour method with data from the Human ConnectomeProject (HCP) [16] as well as in-house measured im-ages [4] for changing settings of undersampling ratiosand measurement noise levels and compare it to two other methods. We are interested in obtaining samples from the pos-terior distribution p ( x | y ) of images x ∈ C N given theobserved undersampled noisy k-space data y ∈ C Mc with c coils, (M ≤ N). We model the acquisition as y = Ex + η , where E ∈ C Mc × N is the extended MRencoding operation and η is complex Gaussian noisein the k-space with η ∼ N (0 , Σ ns ) with Σ ns as thenoise covariance matrix. Hence the data likelihoodterm is given as p ( y | x ) = N ( y ; Ex, σ ns ). The ex-tended encoding operation comprises of the usual coilsensitivity encoding [17], Fourier transform and un-dersampling operations, and additionally of a scalingfactor, a padding operator, an operation for combin-ing the phase with the magnitude image and an oper-ator for modeling the bias ﬁeld [18, 19] in the image,which we explain further in Section 2.5.Unfortunately, it is diﬃcult to draw samples di-rectly from their posterior p ( x | y ) since the x ’s arevery high dimensional. This renders simple sam-pling methods such as rejection sampling or vanillaMCMC [20] very ineﬃcient since they would needtoo many samples to adequately travel around in thespace to generate a good representation of the poste-rior. Moreover, the assumption that the high prob-ability regions in the image space form a lower di-mensional subspace would drastically lower the ac-ceptance ratio for the samples. For instance, if thesampling is done in the image space using MCMC,the random walk will try to move in dimensions thatwill take it out of this lower dimensional subspace andsampling will become very ineﬃcient. On the otherhand, if we have a latent space connected to the lowerdimensional subspace with a decoder, then we can im-plement the random walk in the latent space, and theimages decoded from the latent samples will alwaysbe around this lower dimensional subspace.The proposed model here samples directly in thelower dimensional latent space to address the diﬃcul-ties mentioned above. To this end we use a VAE asthe latent space model. This allows us to ﬁrst samplelatent vectors and then use the decoder of the VAEto obtain images from these.In the following we describe the method in moredetail. Firstly, we need a prior term for the MR imageswith which we can evaluate the probability of agiven image and also diﬀerentiate. To this end, we3se a VAE [14, 15] trained on fully sampled MRimages. The trained VAE consists of an encoder q ( z | x ) = N ( µ z ( x ) , Σ z ( x )) and a decoder p ( x | z ) = N ( µ x ( z ) , Σ x ) parameterized by neural networks with z ∈ R D (D << N) as latent vectors distributed accord-ing to a Gaussian prior p ( z ) = N ( µ pr , Σ pr ). We dropthe x and z dependencies in the rest of the text un-less necessary. Here we use a diagonal non-isotropiccovariance matrix for Σ z , an isotropic diagonal ma-trix for Σ x and a block diagonal for the Σ pr . TheVAE is trained to maximize the evidence lower bound(ELBO) which approximates the log likelihood p ( x ).For the prior p ( z ), we empirically estimate the pa-rameters µ pr and Σ pr from training data [21], whichis diﬀerent than the case in the vanilla VAE, whichwe explain in Section 2.4.Now, given the VAE, we can formalize our aim asto sample from the posterior of latent variables giventhe undersampled k-space data, namely p ( z | y ). Bydoing so we obtain latent samples that match themeasured k-space data and thus, the images associ-ated with these latent samples will match the datatoo. However, before we describe how we obtain thisdistribution, we ﬁrst introduce the sampling proce-dure.Sampling from p ( z | y ) can be implemented usingdiﬀerent methods. In this work, we use the Metropo-lis adjusted Langevin algorithm (MALA) [11].MALA is a variant of Markov chain Monte Carlo,consisting of a random walk given by Langevindynamics and an acceptance scheme following theMetropolis-Hastings algorithm. The random walk forMALA with the target distribution p ( z | y ) is writtenasˆ z t +1 = z t + τ ∇ z log p ( z | y ) | z = z t + √ τ ζ, ζ ∼ N (0 , τ . As can be observed, the update stepfor the random walk is composed of two terms. Thesecond term models the randomness with a Gaus-sian distributed variable ζ , aiming to discover thespace equally in all directions. As the latent spacecan be large, only sampling with the random partcan be ineﬃcient for sampling from p ( z | y ). Here, theﬁrst term τ ∇ z log p ( z | y ) comes to aid by pulling thesteps towards the high probability regions and pre-venting the random walk to move far away from suchregions. However, the discrete nature of the walkrequires a Metropolis-Hastings correction to be ap-plied to ensure convergence. This means a sampleis accepted as z t +1 = ˆ z t +1 with log probability α =min (cid:110) , log (cid:104) p (ˆ z t +1 | y ) q ( z t | ˆ z t +1 ) p ( z t ) q (ˆ z t +1 | z t ) (cid:105)(cid:111) , otherwise z t +1 = z t .The proposal distribution for MALA is given as q ( z (cid:48) | z ) ∝ exp (cid:0) − τ (cid:107) z (cid:48) − z − τ ∇ log p ( z | y ) (cid:107) (cid:1) . Further- Figure 1: Illustration of sampling in the latent space.Left side shows the latent space equipped with a priorshown by the green color. The contours show theregions where z values result in high data likelihoodvalues for the measured data. The random walk (redline) samples from the product of these two, i.e. theposterior z t ∼ p ( z | y ) ∝ ( y | z ) p ( z ). Right side showsthe low dimensional subspace around which the MRimages reside. Each sample in the Z space correspondto a distribution of images in the subspace, s.t. animage can be taken using the the mean of decoder ofthe VAE as x t = µ x ( z t ).more, in order to avoid a long burn-in period, we caninitialize the chain close to the mode of the posteriorinstead of starting with a randomly chosen z . Thiscan be achieved by encoding the MAP image into thelatent space and using its mean as z .Once the posterior samples { z t } are obtained, wethen take samples from the posterior of the imagesas x t = µ t , i.e. the mean of the decoder p ( x | z ) when z = z t . Notice that this procedure is quite similar toancestral sampling due to the hierarchical structureof the model, however, without the sampling step onthe image level. Now these image samples can bepassed on to further tasks or used to calculate empir-ical statistics such as pixel-wise mean and variance. p ( z | y ) The main component required for the random walkis the unnormalized posterior distribution of z , i.e. p ( y | z ) p ( z ) ∝ p ( z | y ), where we do not need the nor-malization constant p ( y ) since it does not appear inthe derivative nor the acceptance terms. To con-struct it, we use the trained VAE model with its prior p ( z ) = N ( µ pr , Σ pr ). We write the p ( y | z ) term as amarginalization over the images as p ( y | z ) = (cid:90) p ( y, x | z ) dx = (cid:90) p ( y | x ) p ( x | z ) dx, (2)where we use the decoder of the VAE as the p ( x | z ) and the conditional independence assumption p ( y | x, z ) = p ( y | x ). This expression can be interpretedas two terms glued together with the images func-tioning as the intermediate variables, connecting the4atent space with the k-space. This integral can beevaluated analytically to yield another Gaussian dis-tribution [22]. After isolating the terms constant with z and taking the logarithm, this distribution is givenaslog p ( y | z ) = µ Hx Σ − x (Σ − x + E H Σ − ns E ) − Σ − x µ x + 2Re (cid:8) y H Σ − ns E (Σ − x + E H Σ − ns E ) − Σ − x µ x (cid:9) (3) − µ Hx Σ − x µ x + C. where H denotes the conjugate transpose and C issome constant w.r.t. z . We provide the details of thisderivation in the Appendix due to space restrictions.For the random walk we also need the log gra-dient of the target distribution, which we write as ∇ z log p ( z | y ) = ∇ z log p ( y | z ) + ∇ z log p ( z ). The ∇ z log p ( y | z ) term can be easily obtained by auto-matic diﬀerentiation since outputs of the decoder µ x and Σ x are modeled as neural networks and, thus,diﬀerentiable w.r.t z , given that we can implement p ( y | z ) in a software package that allows such dif-ferentiation. Similarly, the prior ∇ z log p ( z ) term isalso straightforward and can be derived analytically.The more challenging part is the matrix inversion(Σ − x + E H Σ − ns E ) − in Eq. 3, which we investigatein the next section.Finally, looking at the two terms in the posterior p ( z | y ) ∝ p ( y | z ) p ( z ) reveals more insights regardingthe method. The ﬁrst term drives the chain to re-gions where the z t values, when decoded as p ( x | z t ),lead to images which satisfy the data likelihood term p ( y | x ) for x ’s coming form the p ( x | z t ). On the otherhand, the second term p ( z ) tries to pull the chaintowards the middle of the empirical Gaussian in thelatent space, discouraging the chain to move awayfrom the meaningful regions of the latent space. Assuch, a random walk in the latent space with this tar-get distribution explores the areas which satisfy thedata likelihood and the prior terms simultaneously.One natural alternative to the proposed method isto use the approximation posterior q ( z | x ) modeled inVAE directly to do the sampling around a MAP esti-mate x MAP , i.e. taking latent samples around theconditional distribution z t ∼ q ( z | x MAP ) and thenagain decoding these to the image space. This ap-proach is fundamentally diﬀerent than the proposedmethod in that it takes the MAP image as the ”truereconstruction” and samples only locally around it.Though these locally sampled images will still be inthe subspace of MR images and show some structuralvariation, they do not explore the possibility of diﬀer-ent images being the true underlying image. Hence the local sampling method is inherently very limitedand cannot in general identify the regions where thereconstruction has failed. In contrast, the proposedmethod can ”globally” explore the latent space aslong as the data likelihood is satisﬁed. Furthermore, q ( z | x ) is only an approximate posterior distributionfor a given p ( x | z ) and p ( z ) while we extract samplesfrom the exact posterior. We compare the two meth-ods also experimentally and show results for the localsampling in Section 4 as well. For the sampling procedure, we need to evaluate theterms in Eq. 3 at each iteration. This means theinverse of (Σ − x + E H Σ − ns E ) has to be recomputed ateach iteration, if the Σ x term depends on the z value.In this work we take Σ x to be constant, however,even then, inverting the matrix once and keeping itin memory is not an option, since it is a very big( N c × N c ) matrix. Approximating it as a diagonalmatrix is also not an option as this would result inthe loss of the aliasing information kept in the oﬀ-diagonals of the E H Σ − ns E term.Instead we propose to use an iterative matrix inver-sion that can also be applied when Σ x changes with z . To this end, we write the common term in Eq. 3in an approximation as γ ∗ = min γ || (Σ − x + E H Σ − ns E ) γ − Σ − x µ x || . (4)We then solve this inversion as an optimizationproblem using conjugate gradients (CG) and obtain γ ∗ , which we plug-in to Eq. 3 to yieldlog p ( y | z ) = µ Hx Σ − x γ ∗ + 2Re (cid:8) y H Σ − ns Eγ ∗ (cid:9) − µ Hx Σ − x µ x , (5)where we dropped the constant C.Though we use CG here, other gradient basedmethods can be used as well, since the gradientsare well deﬁned. Furthermore fast Fourier transform(FFT) can be used in the operations E and E H , re-voking the need to write the matrix explicitly andspeeding up computations.The key advantage here is that when the num-ber of iterations ( N γ ) for the CG is ﬁxed, the wholeinversion optimization also becomes a ﬁxed oper-ation, where the related vectors and matrices areadded or multiplied with each other with ﬁxed co-eﬃcients. Moreover, since automated diﬀerentiationis deﬁned for all these operations seperately, the gra-dient through the whole inversion operation can also5e taken using automated diﬀerentiation. This al-lows us to take the gradients of p ( y | z ) according to z through this ﬁxed optimization. Hence, before start-ing the Markov Chain we select the parameter N γ forwhich we obtain a small L error in Eq. 4 and keepthis throughout the sampling. We observe that theerror stays small throughout the sampling process forthe chosen parameters for diﬀerent µ x values. In the following we introduce the modiﬁed VAEmodel we use. Here we apply two modiﬁcations tothe vanilla VAE model [14]. Firstly, the VAE is fullyconvolutional and we use a 2D latent space, i.e. L x L x D dimensions with two spatial and one channeldimension, eﬀectively a latent image with D chan-nels. Each spatial position in such a latent imagehas a receptive ﬁeld on the image when traced backand the architecture is designed using strided convo-lutions such that the receptive ﬁelds have only min-imal overlap. This allows us to adhere to the inde-pendence assumption of the latent pixels reﬂected inthe model by the use of a diagonal covariance matrixfor q ( z | x ). However, in reality, contents in receptiveﬁelds corresponding to diﬀerent spatial locations in alatent image are not entirely independent from eachother, as there are global correlations in the image.To be able to model these, we introduce an empiricalprior as p ( z ) = N ( µ pr , Σ pr ), similar to [21].We obtain the parameters of N ( µ pr , Σ pr ) empir-ically by sampling T samples z i ∼ q ( z | x i ) from Tdiﬀerent training images x i , after the VAE has beentrained using a unit Gaussian prior. The mean µ pr then is calculated as the mean of these samples. Theestimation of a full covariance matrix is, however, dif-ﬁcult due to i) its size and ii) large number of sam-ples required for the estimation. Instead we apply aKolmogorov-Smirnov test [23] against the unit Gaus-sian seperately for each latent channel to ﬁnd thechannels that are the least unit Gaussian in termsof the p-values, i.e. approximately the most infor-mative. Then we form a combined block diagonalcovariance matrix, where we calculate the full covari-ance matrix for the K most informative channels (ofsize KL L x KL L ) and for the rest of the channelswe assume they are independent from each other andcalculate seperately for each channel only the spatialcovariance matrix (of size L L x L L ). The propercombination of these block matrices yields the Σ pr .In practice we do not form this matrix but imple-ment the operations as sparse matrix-vector multi-plications. This strategy also allows us to also re- duce the number of samples T required for the esti-mation. We set K = 10 as preliminary experimentshave shown this covers the informative channels suf-ﬁciently. We also set T high enough to make sure theestimated covariance matrix is full rank and found T = 20000 to be suﬃcient. We extend the usual encoding operation in order tobe able to model additional eﬀects of the image ac-quisition. The aim while modeling these is to closethe domain gap between a trained VAE and an ob-served k-space data by integrating acquisition speciﬁcknowledge as much as possible. Let us assume wehave a trained VAE and observed a k-space data forthe rest of this section.The extended encoding matrix E is based on theusual MR encoding matrix ˜ E = U F S , with S : C N →C Nc the sensitivity encoding matrix [17] with c coils, F : C Nc → C Nc the coilwise Fourier transform, U : C Nc → C Mc the undersampling operation. Then theextended encoding operation is given as E = ˜ EBϕP s. (6)In the following we explain each term seperately.Firstly, a discrepancy between the k-space data andthe space of images on which the VAE operates can bedue to diﬀerences in the ﬁeld of view (FOV). Thoughour fully convolutional architecture is agnostic to theimage size, the empirical prior is estimated for a spe-ciﬁc resolution and FOV, and the k-space size can bediﬀerent due to varying FOV during acquisition. Tobridge this gap, we introduce a padding operation P that pads or crops the images to ﬁt the required sizesfor the VAE.Secondly, for computational as well as implemen-tation related purposes, we assume that the phase ofstructural images are highly independent of the mag-nitude image and smooth, and hence a single phaseimage can be used for all posterior samples we take.This allows us to separate the magnitude and thephase of the image and run the sampling only on themagnitude of the image. However, note that this as-sumption is not a requirement for the proposed ideaas the phase could be sampled as well, but rather amethodological simpliﬁcation motivated by empiricalobservations. Following this assumption, we write thephase as a diagonal matrix ϕ acting on the image.Thirdly, we use a diagonal matrix B that explicitlymodels the bias ﬁeld in the acquisition [18]. The MRimages unavoidably have a bias ﬁeld due to severalfactors [19], and the bias ﬁeld is diﬃcult to avoid,however easy to estimate from the measured data. As6he bias varies between diﬀerent acquisitions, this isa potential source of discrepancy leading to a domainshift. In order to minimize this, we train the VAE onbias free images, which then gives us bias free imagesamples. However, as the measured data y has thebias ﬁeld in it, we estimate this ﬁeld and explicitlymodel it to bring the sampled images to the biaseddomain of the k-space data.Finally, we introduce a scale factor to make thedata likelihood invariant to any scaling diﬀerence be-tween the samples and the k-space. During the ran-dom walk in the latent space, the corresponding im-ages might get scaled at each step, meaning the imagemay be multiplied globally by a scale factor. If thisscale factor moves away from 1, this causes the datalikelihood to increase, since the scales of the k-spacedata and the image samples do not match. How-ever, from the perspective of sample quality, this doesnot pose a problem as long as the scaling factors areknown. The sampled images can be brought to thesame scale by multiplying them with the inverse ofthe scaling factor. Furthermore, allowing the scalefactor to be diﬀerent for each sample, allows morefreedom to the random walk in the latent space, as itis less constrained by the increase in data likelihooddue to scale changes. Hence such an invariance to thisscaling is desirable. To this end we introduce a scalar s , that keeps the data likelihood at the lowest, induc-ing an invariance to scaling. We calculate its valueby solving s ∗ = min s || Esµ x ( z t ) − y || , where we sep-arated only the s term from the extended encodingand used the mean of the decoder as the image. Thenwe take the derivative of the expression acc. to s andset it to zero to obtain the minimizing s value, whichis given analytically as s ∗ = Re { µ x ( z t ) H E H y } µ x ( z t ) H E H Eµ x ( z t ) . We dothis estimation seperately for each z t sample at eachstep.The complex conjugate of the extended en-coding operation operation is given as E H = s H P H ϕ H B H ˜ E H , where we implement P H as crop-ping if P is a padding operation and vice versa, ϕ H is multiplication with the complex conjugate of thephase, B H = B since the bias ﬁeld is real and s H = s ,again since the scale factor is real. We used T1 weighted slices from the full 3D volumesof 780 subjects from the HCP dataset [16] for train-ing of the VAE. There were in total 202800 slices of size 252x308, with an isotropic resolution of 0 . mm .We ran the N4 bias ﬁeld correction on the images andused the corrected images for training. The trainingran for 2250000 iterations. We also trained anotherVAE after downsampling the images to 1mm isotropicresolution to work with lower resolution images for1750000 iterations. For both, we augmented the im-ages by translating them randomly (-4 to +4 pixels)in both directions and trained till convergence.For testing we used 4 axial slices from subjects inthe HCP dataset, diﬀerent than those used in train-ing, without bias ﬁeld correction. We additionallytested with 3 axial slices from in-house measured T1weighted brain images of diﬀerent subjects [4]. Theseimages have similar acquisition parameters as HCPand have an isotropic resolution of 1 mm . Further-more these images are acquired with 13 coils and havenon-zero phase. We used ESPIRiT [24] to obtain thecoil sensitivity maps, which we used in the MAP es-timation and sampling.For the experiments, we retrospectively applyCartesian undersampling to the test images with dif-ferent patterns for each image. We obtain these pat-terns by generating 100 diﬀerent patterns and choos-ing the one with the highest peak-to-side ratio of theassociated point spread functions. In all patterns the15 central proﬁles are always sampled.For comparison purposes, we also modiﬁed thecode by Adler et al. [9] to work with magnitude ofMR images and evaluated in our experimental setting(the authors provided private access to their reposi-tory for the code). This method requires supervisedtraining, i.e. pairs of zero-ﬁlled and fully sampledimages. We generated such a training set with zero-ﬁlled images by undersampling the training images,which were also used for the VAE. We used the imageswith their bias ﬁeld. We generated an undersamplingpattern as described above, seperately for each of theimages. We trained the cWGAN for 800000 iterationstill convergence with the the decay ratio for the noisylinear cosine decay as 2000000 but otherwise with thedefault settings in the code provided by the authorsand the augmentation used for the VAE.We also trained a feed-forward heteroscedastic net-work as a baseline. This network has the same archi-tecture as the VAE, without the KLD in the loss andoutputs a pixelwise standard deviation alongside themean prediction. It is trained for 4000000 iterationsusing the supervised training setup described for thecWGAN method.Finally, we implement the local sampling methodwe described in Section 2.2 for comparison purposes.For these we use the same VAE that is used in theproposed method. We take the samples around the7AP reconstruction. We initialize the chain with the maximum-a-posteriori (MAP) images obtained by the deep den-sity prior (DDP) reconstruction [4] to avoid a longburn-in period and take 10000 samples in total. Weempirically determine the step size τ = 4 × − toobtain an acceptance ratio around 0.3-0.5 and use thesame for both the HCP images the in-house measuredimages. We take a lower number of samples (1000)for the cWGAN and local sampling methods as theeﬀective sample size is also lower for the MCMC chaindue to correlated samples.We used Tensorﬂow [25] for the implementation ofVAE related parts of the proposed method. The VAEis fully convolutional with all padded convolutionsand has a 2 dimensional latent space with D=60 chan-nels. We refer to the Appendix for the details of thearchitecture. For Σ x we use a diagonal matrix withequal diagonal values set at 0 .

02. The same value wasused for training of the VAE and sampling. Σ ns wasalso taken to be a diagonal matrix, where the val-ues were estimated by taking the variance of a smallregion at the upper 10 pixels of the fully sampled k-space center in the undersampled data, seperately foreach coil. Though this approach neglects the covari-ance between coils, it allows for a diﬀerent varianceper coil, which is important as these can diﬀer signif-icantly. Alternatively, the data can be pre-whitenedto decorrelate the coils. The number of iterations forthe matrix inversion were determined empirically as N γ = 25, which was enough to reduce the L2 error ofthe approximate inversion below 0.01%.For padding we use simple zero padding and crop-ping. For bias ﬁeld estimation we used the N4method [26] on the magnitude of the MAP estima-tion with default parameters. For the phase of thesamples we took the MAP phase estimate. We show most of the results at high undersamplingfactors on purpose to make sure that the uncertaintyin the inversion is high and demonstrate that themodel is able to capture it. We present the resultswith the bias ﬁeld put back in for convention althoughthe method provides the samples bias free. We alsomultiply the images with their corresponding scalevalues to bring them to the same scale as the ob-served k-space data. The scale values stay mostlyin the 0.95-1.1 range throughout the sampling pro-cedure. We note that it is quite diﬃcult to see the (a) FS (b) ZF (c) MAP(d) Samples(e) Mean (f) Std (g) Error(h) Zoomed region II (i) Zoomed region III(j) Zoomed region I Figure 2: Results for the proposed latent MALAalgorithm for R=5. FS, ZF and MAP denote thefully sampled, zero-ﬁlled and MAP estimation im-ages. Second row presents three randomly chosensamples. (e-f) show the mean and pixelwise standarddeviation (std) maps for all samples. (g) shows theabsolute error map between the mean and the fullysampled image (clipped to 0-0.3). (h-j) show threezoomed-in regions indicated in (d) for three diﬀer-ent samples as well as the pixelwise std maps. Thegrid lines are to aid visual inspection. As the varia-tions are extremely diﬃcult to see in this format, westrongly encourage the reader to look at the supple-mentary GIFs.8 a) FS (b) l-MALA mean (c) l-MALA std (d) l-MALA Error(e) MAP image (f) cWGAN mean (g) cWGAN std (h) cWGAN diff(i) ZF (j) Local mean (k) Local std (l) Local diff(m) H.sced mean (n) H.sced std (o) H.sced diff Figure 3: Sampling results for diﬀerent methods atR=5. The left most column shows the fully sampled(FS), the MAP estimate and the zero-ﬁlled (ZF) im-ages. In the rightmost three columns, the samplemean, pixelwise standard deviations and the abso-lute error maps between the mean and fully sampledimages from the respective method are given. Theerror maps are clipped to (0, 0.3), the std maps areclipped to (0, 0.06) and (0, 0.18) for the l-MALA andthe other two methods, respectively.variations in the samples presented in this paper inprint and highly encourage the reader to view theprovided GIFs .We start by showing three sample images from thelatent MALA model in Fig. 2 for an image undersam-pled with factor R=5. The structures in the samplesas well as in the mean image, obtained as the mean ofthe drawn samples, overlap well with the fully sam-pled image. On the other hand, structural variationsbetween the samples are present, which can also beseen in the std map. Nearly all pixels correspondingto tissue edges have a high std value. This is expectedsince the missing data in the k-space is mostly in thehigh-frequency regions, whose contributions are moreimportant for edge pixels. However, it is important to GIFs available for the shown results and additional images: https://polybox.ethz.ch/index.php/s/3DPrRoYQnyzANAF (a) FS

R=2R=3R=4R=5 (b) Histograms I-II-III(c) MAP (d) ZF (e) Mean (f) Std (g) Error Figure 4: Results for changing undersampling ratios.First row shows the fully sampled image (FS) and his-tograms of pixels values in all samples for the pixelsindicated on the FS image as I, II and III, respec-tively. Note the diﬀerent bin positions for the his-tograms. Rows two and three show results for R=2and R=4, respectively. Each row shows the MAP es-timation, the zero ﬁlled image (ZF), the pixelwisemean and standard deviation maps and the abso-lute error map between the mean and the FS image(clipped to (0,0.3)).note that the std values on the edges are not homoge-neous, indicating some parts of the edges have highervariability. Furthermore, the variations are not lim-ited to edges but also structures in the white matteras well. Examples of these can be seen in the zoomedregions, indicated with arrows. Region II shows anexample where the grey matter is connected and dis-connected in diﬀerent samples. In the lower part ofregion III one can see a gray matter structure insidethe white matter becoming more and less visible inthe samples. Similarly, in region I, one can see a graymatter structure showing variability in how deep itpenetrates the white matter. The pixel-wise std mapsare marginal maps, i.e. they present the variations inthe pixels as if they were independent. In reality thevariations are not pixel-wise, rather the structures ascollections of multiple pixels move between diﬀerentsamples, which can be observed better in the GIFs.Fig. 3 shows results for the diﬀerent samplingmethods for comparison purposes, namely the latentMALA, the cWGAN and the local VAE sampling.We use the same undersampling pattern for all themethods for comparability. Both the VAE cWGANmethods capture the underlying image fairly well in9he mean of the samples, which is reﬂected in the dif-ference images. The mean of the local VAE samplingis very blurry and cannot to capture the structuresin the underlying image as well as the other meth-ods. The heteroscedastic model performs worse in themean prediction as expected due to the lack of a dataconsistency term. The pixelwise standard deviationmaps for the VAE and cWGAN models are similar atﬁrst glance, in that both reﬂect the high uncertaintyregions at the tissue edges. However, the cWGANmaps are quite noisier and blurrier in comparison.Latent MALA provides a much ﬁner level distinction.This is expected since the proposed method generatessamples based on examination of the given data in-stead of relying on a trained model to generalize anddoes not make assumptions about data availabilityfrom the joint distribution of fully and undersampledimages as in cWGAN. The std maps from the het-eroscedastic model yields even more blurry results.All methods except the local sampling are capable ofindicating some of the regions where their mean mapsdiﬀer from the ground truth image, by showing highdiversity in the samples or high std values in those re-gions, as exempliﬁed by the arrows on the respectiveimage. The local sampling method also captures vari-ability on the edges. Furthermore, it can also indicatepossible diﬀerences in its mean and the ground truthimages, though only coincidentally. For example, inthe region shown with the green arrow, it also assignshigh std values, however this is rather due to the factthat there are two edges intersecting heavily in thatregion. In the region indicated by the red arrow, onthe other hand, although the mean map diﬀers fromthe ground truth, the local samples do not indicate ahigh variability in this region. We also present resultsin Appendix on how well the samples agree with thefully sampled image for the measured portion of thek-space and show that the l-MALA outperforms thelocal sampling and cWGAN methods in this regard.In Fig. 4, we show how the statistics from thesamples change with changing undersampling ratios.Firstly, we show histograms from three pixels indi-cated on the FS image for R=2, 3, 4 and 5, fromwhich one can observe that the pixel histogramsbecome wider with increasing R, indicating higheruncertainty. This increase is also reﬂected in thestd maps, which show an increase in std values forincreasing R. This result shows that the proposedmodel is able to capture increasing ambiguity due tohigher undersampling ratio.Next we present results in Fig. 5 to show the meth-ods sensitivity to the noise in the k-space. The qual-ity of the MAP image degrades due to the high noise.This is reﬂected less in the mean maps, however the (a) FS (b) MAP (c) Mean (d) Std ns=0ns=1ns=4ns=8 (e) Histograms I-II-III Figure 5: Results for changing the noise in k-spaceat R=5. First row shows the results with the basisHCP k-space noise. Second row shows the resultswith noise added on the k-space with 8 times theoriginal noise standard deviation. Third row showshistograms of values of the pixels indicated on the stdmap (with added noise 1, 4 and 8 times of the basisnoise). Note that the fully sampled (FS) image alsochanges due to the added noise. (a) FS (b) ZF (c) MAP img(d) Mean (e) Std (f) Error Figure 6: Results for multicoil in-house measured im-ages at R=2. Shown are the fulls sampled (FS), zero-ﬁlled (ZF), MAP estimate images, the mean and stdmaps for the latent-MALA samples as well as thediﬀerence image between the mean map and the FS(clipped to 0,0.3)).10tandard deviation values increase. This is how themodel should behave since the added noise increasesthe values in Σ ns , which then allows samples to movefarther away from the measured data and show higherdiversity. This is also reﬂected in the histograms ofthree pixel’s intensities, which are indicated in thetop std map, as the distributions become wider withincreasing noise.Finally we present results for an image from themulticoil in-house measured dataset in Fig 6 for R=2.The method yields similar results for this image aswell. The mean map can capture the underlyingstructures, and most variation is concentrated on theedges. Similar to the HCP results, std map can in-dicate potential discrepancies between mean and FSimage as well, for instance low intensity region indi-cated with the arrow, that is incorrectly representedin the mean image, has a high std value. The results show that the proposed method is ableto capture the underlying ambiguity in undersampledMRI acquisitions in that it generates samples, suchas those in Fig. 2, that show realistic structural diver-sity while retaining high ﬁdelity to the fully sampledimage. The variability also indicates potential dis-crepancies between the mean prediction and the FSimage.One observation is that the texture in the fullysampled image is not entirely captured in the sam-ples. This is ﬁrstly because we do not add the noiseon to the samples, of which the texture is partlycomposed of. Secondly, the lack of texture can beattributed to the VAE, which is known for prefer-ring blurry images and ignore very high-frequencychanges. We expect this aspect to improve with abetter prior model.More importantly, the variation in the samplessummarized in the std maps are capable of high-lighting the potential mistakes in their mean predic-tions as seen in Fig. 3 for the latent MALA, cW-GAN and heteroscedastic feed-forward network ap-proaches. This is very valuable information for fur-ther decision making, as such regions where uncer-tainty is high should be approached with doubt. Thisinformation, when taken directly in the form of thesamples or estimated standard deviations can be usedfor any decision making process for clinical or re-search purposes. The latent MALA and cWGANare advantageous compared to heteroscedastic mod-els in this respect as they can produce samples al-lowing quantiﬁcation of uncertainty of any task per- formed on the images. The heteroscedastic modelingapproach is limited in that it makes an independentGaussian assumption for each pixel and cannot gen-erate realistic samples.The std maps from the local VAE sampling do nothighlight the regions where its mean diﬀers from theground truth. This is expected, since it takes theMAP estimate as the underlying image and samplesonly around its latent representation, without explor-ing other areas of the latent space. Hence it can gen-erate variations only on the tissue edges of the MAPimage, but cannot explore possibility of diﬀerent tis-sue structures.We note that the cWGAN method was developedfor CT images and our modiﬁed implementation forMR is very straightforward and limited. By feedingonly the magnitude images, we set the phase to zero,which is the correct phase for the HCP dataset, giv-ing the method an advantage. However this is nota generic situation and the cWGAN method needsto be extended from CT to work in a realistic MRsetting.As the undersampling ratio R increases the inver-sion problem becomes harder, hence the mean mapsstart diverging more from the ground truth, which isalso reﬂected in the diﬀerence maps shown in Fig. 4.Proposed model successfully captures the higher am-biguity for higher undersampling ratios as can be seenin the higher values in the std maps. This increaseis also visible in the increasing spread of the his-tograms. Results presented in Fig. 5, from the exper-iments with increasing k-space noise Σ ns , show thatthe latent MALA model can incorporate the chang-ing k-space noise. When the k-space noise increasesambiguity in the observations increase. Mathemati-cally, the Σ ns term in the Gaussian data likelihoodincreases, which in turn allows accepting samples far-ther away from the measured k-space data y . Thisresults in higher sample diversity, which is reﬂected inthe increase in the standard deviation maps, whereasthe mean of the samples is not aﬀected much by this.Though we use here the VAE as the latent spacemodel, the outlined method is generic. The integralin Eq. 2, which relates the latent space and the k-space is a generic formulation and can be used withother probabilistic models that provide a determin-istic or probabilistic decoder structure. One impor-tant necessary property, however, is that the decoderstructure needs to be diﬀerentiable w.r.t. the latentvariables for the Langevin walk. Utilizing anotherdecoder structure here can also increase the qualityof the ﬁnal image samples, for instance models thatsuﬀer less from blurriness than VAEs.Another factor to consider regarding the VAE is11hat the aggregate posterior of the VAE, given as q ( z ) = (cid:82) q ( z | x ) p data ( x ) dx does not necessarily over-lap with the prior distribution p ( z ). This can thencause the random walk to move towards regions ofthe latent space which are not in the aggregate poste-rior or similarly miss parts of the aggregate posteriorwhich are zero in the prior. We corrected this dis-crepancy partially by introducing the empirical priorfor p ( z ) and have not observed problems regardingthis issue.Similarly, MALA is not the only way of doing thesampling. We choose it due to several factors, suchas its eﬃciency in high dimensional spaces, theoreti-cal guarantees on asymptotic convergence to the trueposterior, not requiring normalized target distribu-tions etc. The target distribution p ( z | y ) in our for-mulation is given by a Gaussian, however its covari-ance matrix is not given in a closed form, renderingdirect sampling diﬃcult. Furthermore, the Gaussianposterior is not a generic situation. In cases of morecomplicated distributions, approaches such as Hamil-tonian Monte Carlo [11] can be utilized so that thetypical set can be traversed more quickly. Similarly,in cases of multimodal distributions, approaches tai-lored to such distributions, such as Stein variationalgradient descent [27] can be considered. Furthermorenatural gradient based methods, where the geometryof the target distribution is taken into account by in-troducing an associated Riemannian metric can beconsidered to speed up the MALA [11].As discussed above, the proposed method has theadvantage of having a modular structure as the prioris decoupled from the data acquisition model and thetarget posterior is decoupled from the sampling pro-cedure. This, we believe, is quite advantageous interms of future research and improvement opportu-nities. This is in contrast to the cWGAN approach,where the loss function and the architecture, by de-sign, determine an implicit target distribution with-out explicitly modeling the prior or the acquisition.Lastly, the decoupling of the prior from the data ac-quisition model allows the latent MALA model tobe used for diﬀerent undersampling factors with thesame prior without retraining and also incorporateadditional details of the acquisition in a very straight-forward way, such as the bias ﬁeld without needingto retrain the prior.Lastly, the proposed model is sampling from thetrue posterior for a given p ( x ) and y . Predictive ap-proaches for uncertainty quantiﬁcation, such as cW-GAN, relies on a trained network to generalize for agiven sample while having serious training data deﬁ-ciency. They require training samples that show dif-ferent fully sampled images for a given undersampled image, which is not readily available unless a methodsuch as the one proposed here is used to constructthem. In this paper we proposed and evaluated a methodthat can provide multiple possible images for thegiven undersampled k-space data. In contrast to re-construction approaches, where a single image is out-put, the sampling approach can capture the uncer-tainty in the inversion process due to the missingdata. The variation in the samples is indicative ofthe uncertainty, open up new avenues for uncertaintyquantiﬁcation for following image analysis tasks andcan point to potential mistakes in the mean predic-tion. The method we propose has a modular struc-ture and can be improved by separately improving itscomponents, such as the prior term or the samplingscheme.

Acknowledgment

The authors thank GyroTools for their MRecon soft-ware, NVIDIA for their GPU donation and Klaas P.Pruessmann and Roger C. Luechinger for their helpwith data acquisition.

References [1] B. Bilgic, I. Chatnuntawech, A. P. Fan, K. Set-sompop, S. F. Cauley, L. L. Wald, and E. Adal-steinsson, “Fast image reconstruction with l2-regularization,”

Journal of Magnetic ResonanceImaging , vol. 40, no. 1, pp. 181–191, 2014.[2] M. Lustig, D. Donoho, and J. M. Pauly, “SparseMRI: The application of compressed sensingfor rapid MR imaging,”

Magnetic Resonance inMedicine , vol. 58, no. 6, pp. 1182–1195, 2007.[3] J. Tsao, P. Boesiger, and K. P. Pruessmann, “k-t blast and k-t sense: Dynamic mri with highframe rate exploiting spatiotemporal correla-tions,”

Magnetic Resonance in Medicine , vol. 50,no. 5, pp. 1031–1042, 2003.[4] K. C. Tezcan, C. F. Baumgartner,R. Luechinger, K. P. Pruessmann, andE. Konukoglu, “Mr image reconstructionusing deep density priors,”

IEEE Transactionson Medical Imaging , vol. 38, no. 7, 2019.125] A. Kendall and Y. Gal, “What uncertainties dowe need in bayesian deep learning for computervision?” in

Advances in Neural Information Pro-cessing Systems 30 , 2017, pp. 5574–5584.[6] Y. Gal and Z. Ghahramani, “Bayesian convo-lutional neural networks with Bernoulli approx-imate variational inference,” arXiv:1506.02158 ,2015.[7] R. Tanno, D. E. Worrall, A. G. Enrico Kaden,F. Grussu, A. Bizzi, S. N. Sotiropoulos, A. Crim-inisi, and D. C. Alexander, “Uncertainty quan-tiﬁcation in deep learning for safer neuroimageenhancement,” arXiv:1907.13418v1 , 2019.[8] J. Schlemper, D. C. Castro, W. Bai, C. Qin,O. Oktay, J. Duan, A. N. Price, J. Hajnal,and D. Rueckert, “Bayesian deep learning foraccelerated mr image reconstruction,” in

Ma-chine Learning for Medical Image Reconstruc-tion . Springer International Publishing, 2018.[9] J. Adler and O. ktem, “Deep bayesian inver-sion,” arXiv:1811.05910 , 2018.[10] S. Pedemonte, C. Catana, and K. Van Leemput,“Bayesian tomographic reconstruction using rie-mannian mcmc,” in

Medical Image Computingand Computer-Assisted Intervention – MICCAI2015 . Springer International Publishing, 2015,pp. 619–626.[11] M. Girolami and B. Calderhead, “Riemannmanifold langevin and hamiltonian monte carlomethods,”

Journal of the Royal Statistical Soci-ety: Series B (Statistical Methodology) , vol. 73,no. 2, 2011.[12] D. Narnhofer, K. Hammernik, F. Knoll, andT. Pock, “Inverse GANs for accelerated MRI re-construction,” in

Wavelets and Sparsity XVIII ,D. V. D. Ville, M. Papadakis, and Y. M. Lu,Eds., vol. 11138, International Society for Op-tics and Photonics. SPIE, 2019, pp. 381 – 392.[13] J. Schlemper, J. Caballero, J. V. Hajnal, A. N.Price, and D. Rueckert, “A deep cascade ofconvolutional neural networks for dynamic mrimage reconstruction,”

IEEE Transactions onMedical Imaging , vol. 37, no. 2, pp. 491–503,2018.[14] D. P. Kingma and M. Welling, “Auto-encodingvariational bayes.”

CoRR , vol. abs/1312.6114,2013. [15] D. J. Rezende, S. Mohamed, and D. Wierstra,“Stochastic backpropagation and approximateinference in deep generative models,” in

Pro-ceedings of the 31st International Conference onMachine Learning , ser. Proceedings of MachineLearning Research, vol. 32, no. 2. Bejing, China:PMLR, 22–24 Jun 2014, pp. 1278–1286.[16] D. V. Essen, K. Ugurbil, E. Auerbach, D. Barch,T. Behrens, R. Bucholz, A. Chang, L. Chen,M. Corbetta, S. Curtiss, S. D. Penna, D. Fein-berg, M. Glasser, N. Harel, A. Heath, L. Larson-Prior, D. Marcus, G. Michalareas, S. Moeller,R. Oostenveld, S. Petersen, F. Prior, B. Schlag-gar, S. Smith, A. Snyder, J. Xu, and E. Yacoub,“The human connectome project: A data acqui-sition perspective,”

NeuroImage , vol. 62, no. 4,pp. 2222 – 2231, 2012, connectivity.[17] K. P. Pruessmann, M. Weiger, P. Brnert, andP. Boesiger, “Advances in sensitivity encodingwith arbitrary k-space trajectories,”

MagneticResonance in Medicine , vol. 46, no. 4, pp. 638–651, 2001.[18] M. Gaillochet, K. Tezcan, and E. Konukoglu,“Joint reconstruction and bias ﬁeld correctionfor undersampled mr imaging,”

MICCAI , 2020.[19] P. G. Sled J.G., “Understanding intensity non-uniformity in mri.”

MICCAI , 1998.[20] C. M. Bishop,

Pattern Recognition and MachineLearning (Information Science and Statistics) .Berlin, Heidelberg: Springer-Verlag, 2006.[21] A. Volokitin, E. Erdil, N. Karani, K. C. Tezcan,X. Chen, L. V. Gool, and E. Konukoglu, “Mod-elling the distribution of 3d brain mri using a 2dslice vae,”

MICCAI , 2020.[22] K. T´othov´a, S. Parisot, M. C. H. Lee, E. Puyol-Ant´on, L. M. Koch, A. P. King, E. Konukoglu,and M. Pollefeys, “Uncertainty quantiﬁcation incnn-based surface prediction using shape priors,”in

Shape in Medical Imaging . Springer Interna-tional Publishing, 2018, pp. 300–310.[23] W. W. Daniel,

Applied Nonparametric Statistics .Duxbury, 2000.[24] U. Martin, L. Peng, M. M. J., V. Patrick,E. Michael, P. J. M., V. S. S., and L. Michael,“Espiritan eigenvalue approach to autocalibrat-ing parallel mri: Where sense meets grappa,”

Magnetic Resonance in Medicine , vol. 71, no. 3,pp. 990–1001, 2014.1325] M. Abadi, P. Barham, J. Chen, Z. Chen,A. Davis, J. Dean, M. Devin, S. Ghemawat,G. Irving, M. Isard, M. Kudlur, J. Levenberg,R. Monga, S. Moore, D. G. Murray, B. Steiner,P. Tucker, V. Vasudevan, P. Warden, M. Wicke,Y. Yu, and X. Zheng, “Tensorﬂow: A system forlarge-scale machine learning,” in , 2016, pp. 265–283.[26] N. J. Tustison, B. B. Avants, P. A. Cook,Y. Zheng, A. Egan, P. A. Yushkevich, and G. J.C., “N4itk: Improved n3 bias correction.”

IEEETransactions on Medical Imaging , vol. 29, 2010.[27] Q. Liu and D. Wang, “Stein variational gradientdescent: A general purpose bayesian inferencealgorithm,” 2016. 14

Appendix to “Sampling possible reconstructions of undersam-pled acquisitions in MR imaging”

A.1 Derivation of the closed form of p ( y | z ) As mentioned in the paper the form that is easiest to interpret is given by the marginalization, however thisintegral diﬃcult to evaluate directly. Instead, we do it implicitly by using conjugacy relations for Normaldistributions.We begin by writing p ( y | x, z ) p ( x | z ) = p ( y | z ) p ( x | y, z ) . (7)Since p ( y | x, z ) and p ( x | z ) are Normal distributions, due to the conjugacy, the posterior p ( x | y, z ) is also aNormal distribution given as N ( µ post , Σ post ). Then p ( y | z ) = p ( y | x, z ) p ( x | z ) N ( µ post , Σ post ) , or p ( y | z ) N ( µ post , Σ post ) = p ( y | x, z ) p ( x | z ) . (8)Hence the posterior p ( y | z ) acts as a normalizer to the product distribution to yield a Gaussian. We derive p ( y | z ) using this relation in Eqn. 8. In the following we also use the conditional independence p ( y | x, z ) = p ( y | x ) meaning that when the image is given, this posterior distribution in the k-space is determined withoutthe need for the latent variable.For the derivation we use this strategy: i) we ﬁrst write the product of the two distributions p ( y | x ) p ( x | z ),ii) then recognize the mean and covariance of the Normal posterior distribution N ( µ post , Σ post ) in this, iii)and separate a the Gaussian with these parameters from the whole expression. What is left gives us thetarget distribution.The product can be written as p ( y | x ) p ( x | z ) = (9)det(2 π Σ ns ) − / det(2 π Σ x ) − / exp (cid:26) − (cid:2) ( y − Ex ) H Σ − ns ( y − Ex ) (cid:3)(cid:27) (10) · exp (cid:26) − (cid:2) ( x − µ x ) H Σ − x ( x − µ x ) (cid:3)(cid:27) (11)= det(2 π Σ ns ) − / det(2 π Σ x ) − / exp (cid:26) − x H (Σ − x + E H Σ − ns E ) (cid:124) (cid:123)(cid:122) (cid:125) Σ − post x (12)+ Re { x H ( E H Σ − ns y + Σ − x µ x ) (cid:124) (cid:123)(cid:122) (cid:125) Σ − post µ post } − y H Σ − ns y − µ Hx Σ − x µ x (cid:27) , (13)where we have recognized the parameters of the posterior. With these we have enough information to com-plete the posterior Gaussian. We can replace the terms with posterior parameters and add the missing term ± µ Hpost Σ − post µ post to complete the quadratic form as well as the normalizing determinant det(2 π Σ post ) ± / .= det(2 π Σ ns ) − / det(2 π Σ x ) − / det(2 π Σ post ) +1 / det(2 π Σ post ) − / (14) · exp (cid:26) − x H Σ − post x + Re { x H Σ − post µ post } − µ Hpost Σ − post µ post (cid:124) (cid:123)(cid:122) (cid:125) − ( x − µ post ) H Σ − post ( x − µ post ) (15)+ 12 µ Hpost Σ − post µ post − y H Σ − ns y − µ Hx Σ − x µ x (cid:27) . (16)We can combine the quadratic term in the exponent with the determinant term and obtain the complete15osterior Gaussian. In this case the expression becomes p ( y | x ) p ( x | z ) = N ( µ post , Σ post ) det(2 π Σ ns ) − / det(2 π Σ x ) − / det(2 π Σ post ) +1 / (17) · exp (cid:26) + 12 µ Hpost Σ − post µ post − y H Σ − ns y − µ Hx Σ − x µ x (cid:27) . (18)Remembering Eqn. 8, we obtain p ( y | z ) = det(2 π Σ post ) +1 / det(2 π Σ ns ) / det(2 π Σ x ) / · exp (cid:26) − y H Σ − ns y + 12 µ Hpost Σ − post µ post − µ Hx Σ − x µ x (cid:27) . (19)Now taking the logarithm and leaving out the terms that are independent of z we can arrive at theexpression we use as log p ( y | z ) = + 12 µ Hpost Σ − post µ post − µ Hx Σ − x µ x + C, (20)where C denotes some constant with z. Notice that we could leave out the determinant term in the nominatordue to our model choice of constant Σ x .Now we need the closed form expression for the ﬁrst term in the above equation. Also we need to arrive atthis using the terms we have access to from the above equations 12 and 13, namely Σ − post µ post and Σ − post . Firstwe write µ post = (Σ − post ) − Σ − post µ post and rewrite the target term as µ Hpost Σ − post µ post = (Σ − post µ post ) H µ post .Combining the expressions and isoliting the terms constant with z as C then yields µ Hpost Σ − post µ post = µ Hx Σ − x (Σ − x + E H Σ − ns E ) − Σ − x µ x (21)+ 2Re (cid:8) y H Σ − ns E (Σ − x + E H Σ − ns E ) − Σ − x µ x (cid:9) + C (22)Applying the Woodburry identity on the term (Σ − x + E H Σ − ns E ) followed by some algebraic manipulationsreveals that this is equivalent to the expression given in [22]. A.2 The VAE architecture

All convolutions are padded and have a kernel size (3, 3) and stride (1, 1) and use a ReLU unless notedotherwise.The encoder begins with four convolutional layers with 32, 64, 64, 64 output channels, respectively. Thena convolutional layer with kernel size (14, 14), stride (19, 19) and 60 output channels produces the meanof q ( z | x ) from the fourth layer. Similarly another convolutional layer from the third layer produces the logstandard deviation values for q ( z | x ) with a kernel size of (14, 14), stride (19, 19), without ReLU and 60output channels. The network is fully convolutional, hence can work with diﬀerent image sizes. Assumingan input image size of 252x308 for demonstration, the latent space size becomes bx18x22x60, where b is thebatch size. We use the usual reparameterization trick to sample z ’s [14]. At the beginning of the decoder, weapply a scheme of increasing channel dimensions and using these to increase spatial dimensions. We do thisin two steps, once for the ﬁrst image dimension and once again for the second image dimension to obtaina proper reshaping while using the implementation of Tensorﬂow’s reshaping function. First convolutionallayer of the decoder does not use ReLU and has 64 · A.3 Distribution of voxelwise error in the measured parts of k-space and RMSE

Here we show the k-space error histograms in Figure 7 for a test slice at R=5. We calculate these as follows:we take 50 image samples { x s } s =1 from each method and apply the undersampled Fourier transform totransform each of them to k-space and take the measured voxels. Then we calculate the voxelwise diﬀerence16 .3 0.2 0.1 0.0 0.1 0.2 0.30510152025303540 l-MALAcWGANlocal (a) Real part (b) Imaginary part (c) Absolute error Figure 7: Histograms of the voxelwise error in the measured voxels in the k-space for three diﬀerent methodsfor a subject at R=5. As the error is complex, the real and imaginary parts as well as the magnitude of theerror values are shown seperately.between these and the measured data for all measured k-space voxels for all the samples together. Thehistogram then shows the distribution of the error for all these k-space voxels from all 50 samples. As thisdiﬀerence is complex, we show two histograms seperately for the real and imaginary parts and also for themagnitude values. We can also look at the image-wise the absolute error asabs. error s = (cid:88) all meas. voxels | Ex F S − Ex s | , (23)for a sample image x s , the fully sampled image x F S and the sum is over all measured k-space voxels. Whencalculated for all 50 samples, the mean (std) values for this slice are given as 484.23 (5.45), 584.71 (18.89)and 597.98 (15.69), for the l-MALA, cWGAN and local sampling methods, respectively.To show how this generalizes, we do a similar analysis using slices from 9 test subjects. We undersample theslices with diﬀerent undersampling patterns at R=5. Again, for each test subject we generate 50 samples forthe three methods each. We then calculate the absolute errors and report the mean (std) values in Table 1.For all subjects (except subject 3) the l-MALA method yields signiﬁcantly (p value lower than 0.001) lowerabsolute error for the 50 samples with the Wilcoxon signed-rank test. For subject 3, the cWGAN methodyields signiﬁcantly lower absolute error (p value lower than 0.001). Considering the mean absolute error foreach subject, the l-MALA method yields signiﬁcantly lower absolute error overall (p value lower than 0.011)We also calculate the root mean squared error (RMSE) between the 50 samples and the fully sampled image.Though achieving a low RMSE is not the purpose of any of the methods, we present these results as theystill provide some insight into the performance of the methods. To calculate the RMSE we use the formulagiven in [4] and use a mask to disregard the background. The RMSE values are signiﬁcantly lower for the50 samples for each subject (p value lower than 0.001) and for the mean RMSE values for each subject (pvalue lower than 0.008) for the l-MALA method compared to the other methods.17 ubject Abs. error RMSEl-MALA cWGAN Local l-MALA cWGAN Localubject Abs. error RMSEl-MALA cWGAN Local l-MALA cWGAN Local