Dynamic imaging using a deep generative SToRM (Gen-SToRM) model
Qing Zou, Abdul Haseeb Ahmed, Prashant Nagpal, Stanley Kruger, Mathews Jacob
11 Dynamic imaging using a deep generative SToRM(Gen-SToRM) model
Qing Zou, Abdul Haseeb Ahmed, Prashant Nagpal, Stanley Kruger, Mathews Jacob
Abstract —We introduce a generative smoothness regulariza-tion on manifolds (SToRM) model for the recovery of dynamicimage data from highly undersampled measurements. The modelassumes that the images in the dataset are non-linear mappingsof low-dimensional latent vectors. We use the deep convolutionalneural network (CNN) to represent the non-linear transfor-mation. The parameters of the generator as well as the low-dimensional latent vectors are jointly estimated only from theundersampled measurements. This approach is different fromtraditional CNN approaches that require extensive fully sampledtraining data. We penalize the norm of the gradients of the non-linear mapping to constrain the manifold to be smooth, whiletemporal gradients of the latent vectors are penalized to obtaina smoothly varying time-series. The proposed scheme brings inthe spatial regularization provided by the convolutional network.The main benefit of the proposed scheme is the improvement inimage quality and the orders-of-magnitude reduction in memorydemand compared to traditional manifold models. To minimizethe computational complexity of the algorithm, we introducean efficient progressive training-in-time approach and an ap-proximate cost function. These approaches speed up the imagereconstructions and offers better reconstruction performance.
Index Terms —Generative model; CNN; Manifold approach;Unsupervised learning, Deep image prior
I. I
NTRODUCTION T HE imaging of time-varying objects at high spatialand temporal resolution is key to several modalities,including MRI and microscopy. A central challenge is theneed for high resolution in both space and time [1], [2].Several computational imaging strategies have been introducedin MRI to improve the resolution, especially in the contextof free-breathing and ungated cardiac MRI. A popular ap-proach pursued by several groups is self-gating, where cardiacand respiratory information is obtained from central k-spaceregions (navigators) using bandpass filtering or clustering[3]–[7]. The data is then binned to the respective phasesand recovered using total variation or other priors. Recently,approaches using smooth manifold regularization have beenintroduced. These approaches model the images in the timeseries as points on a high-dimensional manifold [8]–[12].Manifold regularization algorithms, including the smoothness
Qing Zou is with the Applied Mathematics and Computational Sci-ences (AMCS) program at the University of Iowa, Iowa City, USA (e-mail: [email protected]). Abdul Haseeb Ahmed and Mathews Jacob arewith the Department of Electrical and Computer Engineering, Universityof Iowa, Iowa City, USA (e-mail: [email protected] and [email protected]). Prashant Nagpal and Stanley Kruger are with the Depart-ment of Radiology, University of Iowa, Iowa City, USA (e-mail: [email protected] and [email protected]). This work is supportedby NIH under Grants R01EB019961 and R01AG067078-01A1. This workwas conducted on an MRI instrument funded by 1S10OD025025-01. regularization on manifolds (SToRM) framework [8]–[10],have shown good performance in several dynamic imagingapplications. Since the data is not explicitly binned into spe-cific phases as in the self-gating methods, manifold algorithmsare less vulnerable to clustering errors than navigator-basedcorrections. Despite the benefits, a key challenge with thecurrent manifold methods is the high memory demand. Unlikeself-gating methods that only recover specific phases, manifoldmethods recover the entire time series. The limited memoryon current GPUs restricts the number of frames that canbe recovered simultaneously, which makes it challenging toextend the model to higher dimensionalities. The high memorydemand also makes it difficult to use spatial regularizationpriors on the images using deep learned models.Our main focus is to capitalize on the power of deepconvolutional neural networks (CNN) to introduce a memoryefficient generative or synthesis formulation of SToRM. CNNbased approaches are now revolutionizing image reconstruc-tion, offering significantly improved image quality and fastimage recovery [13]–[19]. In the context of MRI, severalnovel approaches have been introduced [20], [21], includ-ing transfer-learning [22], domain adaptation [23], learning-based dynamic MRI [24], and generative-adversarial models[25]–[27]. Unlike many CNN-based approaches, the proposedscheme does not require pre-training using large amountsof training data. This makes the approach desirable in free-breathing applications, where the acquisition of fully sampledtraining data is infeasible. We note that the classical SToRMapproach can be viewed as an analysis regularization scheme(see Fig. 1.(a)). Specifically, a non-linear injective mapping isapplied on the images such that the mapped points of the alias-free images lie on a low-dimensional subspace [10], [28], [29].When recovering images from undersampled data, the nuclearnorm prior is applied in the transform domain to encouragetheir non-linear mappings to lie in a subspace. Unfortunately,this analysis approach requires the storage of all the imageframes in the time series, which translates to high memorydemand. The proposed generative SToRM formulation offersquite significant compression of the data, which can overcomethe above challenge. Both the relation between the analysisand synthesis formulations and the relation of the synthesisformulation to neural networks were established in earlierwork [29].We assume that the image volumes in the dataset are smoothnon-linear functions of a few latent variables, i.e., x t = G θ ( z t ) ,where z t are the latent vectors in a low-dimensional space. x t is the t -th generated image frame in the time series.This explicit formulation implies that the image volumes a r X i v : . [ ee ss . I V ] F e b lie on a smooth non-linear manifold in a high-dimensionalambient space (see Fig. 1.(b)). The latent variables capture thedifferences between the images (e.g., cardiac phase, respiratoryphase, contrast dynamics, subject motion). We model the G us-ing a CNN, which offers a significantly compressed represen-tation. Specifically, the number of parameters required by themodel (CNN weights and latent vectors) are several orders ofmagnitude smaller than required for the direct representationof the images. The compact model proportionately reducesthe number of measurements needed to recover the images. Inaddition, the compression also enables algorithms with muchsmaller memory footprint and computational complexity. Wepropose to jointly optimize for the network parameters θ andthe latent vector z t based on the given measurements. Thesmoothness of the manifold generated by G θ ( z ) depends on thegradient of G θ with respect to its input. To enforce the learningof a smooth image manifold, we regularize the norm of theJacobian of the mapping (cid:107) J z G θ (cid:107) . We experimentally observethat by penalizing the gradient of the mapping, the networkis encouraged to learn meaningful mappings. Similarly, theimages in the time series are expected to vary smoothly intime. Hence, we also use a Tikhonov smoothness penalty onthe latent vectors z t to further constrain the solutions. Weuse the ADAM optimizer with stochastic gradients, whererandom batches of z i and b i are chosen at iteration todetermine the parameters. Unlike traditional CNN methodsthat are fast during testing/inference, the direct application ofthis scheme to the dynamic MRI setting is computationallyexpensive. We use approximations, including progressive-in-time optimization and an approximated data term that avoidsnon-uniform fast Fourier transforms, to significantly reducethe computational complexity of the algorithm.The proposed approach is inspired by deep image prior(DIP), which was introduced for static imaging problems [30],as well as its extension to dynamic imaging [31]. The keydifference of the proposed formulation is the joint optimizationof the latent variables z and G . The work of Jin ea tl. [31]was originally developed for CINE MRI, where the latentvariables were obtained by linearly interpolating noise vari-ables at the first and last frames. Their extension to real-timeapplications involved setting noise latent vectors at multiplesof a preselected period, followed by linearly interpolatingthe noise variables. This approach is not ideally suited forapplications with free breathing, when the motion is notperiodic. Another key distinction is the use of regularizationpriors on the network parameters and latent vectors, whichencourages the mapping to be an isometry between latent andimage spaces. Unlike DIP methods, the performance of thenetwork does not significantly degrade with iterations. Whilewe call our algorithm “generative SToRM”, we note that ourgoal is not to generate random images from stochastic inputsas in generative-adversarial networks (GAN). In particular, wedo not use adversarial loss functions where a discriminator isjointly learned as in the literature [32], [33]. II. B ACKGROUND
A. Dynamic MRI from undersampled data: problem setup
Our main focus is to recover a series of images x , .. x M from their undersampled multichannel MRI measurements.The multidimensional dataset is often compactly representedby its Casoratti matrix X = (cid:2) x ... x M (cid:3) . (1)Each of the images is acquired by different multichannelmeasurement operators b i = A i ( x i ) + n i , (2)where n i is zero mean Gaussian noise matrix that corrupts themeasurements. B. Smooth manifold models for dynamic MRI
The smooth manifold methods model images x i in thedynamic time series as points on a smooth manifold M . Thesemethods are motivated by continuous domain formulations thatrecover a function f on a manifold from its measurements as f = arg min f (cid:88) i (cid:107) f ( x i ) − b i (cid:107) + λ (cid:90) M (cid:107)∇ M f (cid:107) d x (3)where the regularization term involves the smoothness of thefunction on the manifold. This problem is adapted to thediscrete setting to solve for images lying on a smooth manifoldfrom its measurements as X = arg min X M (cid:88) i =1 (cid:107)A ( x i ) − b i (cid:107) + λ trace( XLX H ) , (4)where L is the graph Laplacian matrix. L is the discrete ap-proximation of the Laplace-Beltrami operator on the manifold,which depends on the structure or geometry of the manifold.The manifold matrix L is estimated from k-space navigators.Different approaches, ranging from proximity-based methods[8] to kernel low-rank regularization [10] and sparse optimiza-tion [12], have been introduced.The results of earlier work [10], [28] show that the abovemanifold regularization penalties can be viewed as an analysisprior. In particular, these schemes rely on a fixed non-linearmapping ϕ of the images. The theory shows that if theimages x , .. x M lie in a smooth manifold/surface or unionof manifolds/surfaces, the mapped points live on a subspaceor union of subspaces. The low-dimensional property of themapped points ϕ ( x ) , ..ϕ ( x M ) is used to recover the imagesfrom undersampled data or derive the manifold using a kernellow-rank minimization scheme: X ∗ = arg min X M (cid:88) i =1 (cid:107)A ( x i ) − b i (cid:107) + λ (cid:107) [ ϕ ( x ) , .., ϕ ( x N )] (cid:107) ∗ . (5)This nuclear norm regularization scheme is minimized usingan iterative reweighted algorithm, whose intermediate stepsmatch (4). The non-linear mapping ϕ may be viewed as ananalysis operator that transforms the original images to a low-dimensional latent subspace, very similar to analysis sparsity-based approaches used in compressed sensing. C. Unsupervised learning using Deep Image Prior
The recent work of DIP uses the structure of the networkas a prior [30], enabling the recovery of images from ill-posed measurements without any training data. Specifically,DIP relies on the property that CNN architectures favor imagedata more than noise. The regularized reconstruction of animage from undersampled and noisy measurements is posedin DIP as { θ ∗ } = arg min θ (cid:107)A ( x ) − b (cid:107) such that x = G θ [ z ] (6)where x = G θ ∗ ( z ) is the recovered image, generated by theCNN generator G θ ∗ whose parameters are denoted by θ . Here, z is the random latent variable, which is chosen as randomnoise and kept fixed.The above optimization problem is often solved usingstochastic gradient descent (SGD). Since CNNs are efficient inlearning natural images, the solution often converges quicklyto a good image. However, when iterated further, the algorithmalso learns to represent the noise in the measurements ifthe generator has sufficient capacity, resulting in poor imagequality. The general practice is to rely on early termination toobtain good results. This approach was recently extended tothe dynamic setting by Jin et al. [31], where a sequence ofrandom vectors was used as the input.III. D EEP GENERATIVE ST O RM MODEL
We now introduce a synthesis SToRM formulation for therecovery of images in a time series from undersampled data(see Fig. 1.(b)). Rather than relying on a non-linear mappingof images to a low-dimensional subspace [10] (see Fig. 1.(a)),we model the images in the time series as non-linear functionsof latent vectors living in a low-dimensional subspace.
A. Generative model
We model the images in the time series as x i = G θ ( z i ) , i = 1 , .., M, (7)where G θ is a non-linear mapping, which is termed as thegenerator. Inspired by the extensive work on generative imagemodels [30], [34], [35], we represent G θ by a deep CNN,whose weights are denoted by θ . The parameters z i are thelatent vectors, which live in a low-dimensional subspace. Thenon-linear mapping G θ may be viewed as the inverse of theimage-to-latent space mapping ϕ , considered in the SToRMapproach.We propose to estimate the parameters of the network θ as well as the latent vectors z i by fitting the model tothe undersampled measurements. The main distinction of ourframework with DIP, which is designed for a single image,is that we use the same generator for all the images inthe dynamic dataset. The latent vector z i for each image isdifferent and is also estimated from the measurements. Thisstrategy allows us to exploit non-local information in thedataset. For example, in free-breathing cardiac MRI, the latentvectors of images with the same cardiac and respiratory phaseare expected to be similar. When the gradient of the network is z z z ρ ρ ρ A A A Latent spaceImage manifold b b b 𝒢 θ CNN
Generator z z z ρ ρ ρ A A A Image manifold b b b φ (a). Analysis STORM (b). Synthesis/Generative STORM Latent space Nonlinear
Liftiing
Fig. 1. Illustration of (a) analysis SToRM and (b) generative SToRM. AnalysisSToRM considers a non-linear (e.g. exponential) lifting of the data. If theoriginal points lie on a smooth manifold, the lifted points lie on a low-dimensional subspace. The analysis SToRM cost function in (5) is the sumof the fit of the recovered images to the undersampled measurements and thenuclear norm of the lifted points. A challenge with analysis SToRM is itshigh memory demand and the difficulty in adding spatial regularization. Theproposed method models the images as the non-linear mapping G θ of somelatent vectors z i , which lie in a very low-dimensional space. Note that thesame generator is used to model all the images in the dataset. The numberof parameters of the generator and the latent variables is around the size of asingle image, which implies a highly compressed representation. In addition,the structure of the CNN offers spatial regularization as shown in DIP. Theproposed algorithm in (13) estimates the parameters of the generator and thelatent variables from the measured data. A distance regularization prior isadded to the generator to ensure that nearby points in the latent subspace aremapped to nearby points on the manifold. Similarly, a temporal regularizationprior is added to the latent variables. The optimization is performed usingADAM with batches of few images. bounded, the output images at these time points are expectedto be the same. The proposed framework is hence expected tolearn a common representation from these time-points, whichare often sampled using different sampling trajectories. Unlikeconventional manifold methods [8], [10], [12], the use of theCNN generator also offers spatial regularization.It is often impossible to acquire fully-sampled training datain many free-breathing dynamic imaging applications, anda key benefit of this framework over conventional neuralnetwork schemes is that no training data is required. Asdiscussed previously, the number of parameters of the modelin (7) is orders of magnitude smaller than the number ofpixels in the dataset. The dramatic compression offered by therepresentation, together with the mini-batch training provides ahighly memory-efficient alternative to current manifold basedand low-rank/tensor approaches. Although our focus is onestablishing the utility of the scheme in 2-D settings inthis paper, the approach can be readily translated to higherdimensional applications. Another benefit is the implicit spatialregularization brought in by the convolutional network asdiscussed for DIP. We now introduce novel regularizationpriors on the network and the latent vectors to further constrainthe recovery to reduce the need for manual early stopping. B. Distance/Network regularization
As in the case of analysis SToRM regularization [8], [10],our interest is in generating a manifold model that preservesdistances. Specifically, we would like the nearby points in thelatent space to map to similar images on the manifold. Withthis interest, we now study the relation between the Euclideandistances between their latent vectors and the shortest distancebetween the points on the manifold (geodesic distance).We consider two points z and z in the latent space,which are fed to the generator to obtain G ( z ) and G ( z ) ,respectively. We have the following result, which relates thethe Euclidean distance (cid:107) z − z (cid:107) to the geodesic distance dist M ( G ( z ) , G ( z )) , which is the shortest distance on themanifold. The setting is illustrated in Fig. 2, where thegeodesic distance is indicated by the red curve. Proposition 1.
Let z , z ∈ R n be two nearby points in thelatent space, with mappings denoted by G ( z ) , G ( z ) ∈ M .Here, M = { G ( z ) | z ∈ R n } . Then, the geodesic distance onthe manifold satisfies: dist M (cid:0) G ( z ) , G ( z ) (cid:1) ≤ (cid:107) z − z (cid:107) F (cid:107) J z (cid:0) G ( z ) (cid:1) (cid:107) F . (8) Proof.
The straight-line between the latent vectors is denotedby c ( s ) , s ∈ [0 , with c (0) = z and c (1) = z . We alsoassume that the line is described in its curvilinear abscissa,which implies (cid:107) c (cid:48) ( s ) (cid:107) = 1; ∀ s ∈ [0 , . We note that G may map to the black curve, which may be longer than thegeodesic distance. We now compute the length of the blackcurve G [ c ( s )] as d = (cid:90) (cid:107)∇ s G [ c ( s )] (cid:107) ds. (9)Using the chain rule and denoting the Jacobian matrix of G by J z , we can simplify the above distance as d = (cid:90) (cid:107) J z ( G ) c (cid:48) ( s ) (cid:107) F ds ≤ (cid:90) (cid:107) J z ( G ) (cid:107) F (cid:107) c (cid:48) ( s ) (cid:107) F (cid:124) (cid:123)(cid:122) (cid:125) ds = (cid:107) J z ( G [ z ]) (cid:107) F (cid:90) ds (cid:124) (cid:123)(cid:122) (cid:125) (cid:107) z − z (cid:107) . (10)We used the Cauchy-Schwartz inequality in the second stepand in the last step, we use the fact that J z G ( c ( t )) = J z G ( z )+ O ( t ) when the points z and z are close. Since thegeodesic distance is the shortest distance on the manifold, wehave dist M (cid:0) G ( z ) , G ( z ) (cid:1) ≤ d and hence we obtain (8).The result in (8) shows that the Frobenius norm of theJacobian matrix (cid:107) J z G(cid:107) controls how far apart G maps twovectors that are close in the latent space. We would like pointsthat are close in the latent space map to nearby points on themanifold. We hence use the gradient of the map: R distance = (cid:107) J z (cid:0) G ( z ) (cid:1) (cid:107) F (11)as a regularization penalty. We note that the above penaltywill also encourage the learning of a mapping G such that the length of curve G ( c ( t )) is the geodesic distance. We notethat the above penalty can also be thought of as a networkregularization. Similar gradient penalties are used in machinelearning to improve generalization ability and to improve therobustness to adversarial attacks [36]. The use of gradientpenalty is observed to be qualitatively equivalent to penalizingthe norm of the weights of the network. z z 𝒢( z ) 𝒢( z ) Geodesic distance c ( t ) 𝒢 ( c ( t ) ) Fig. 2. Illustration of the distance penalty. The length of the curve connectingthe images corresponding to z and z depends on the Frobenius norm of theJacobian of the mapping G as well as the Euclidean distance (cid:107) z − z (cid:107) . Weare interested in learning a mapping that preserves distances; we would likenearby points in the latent space to map to similar images. We hence use thenorm of the Jacobian as the regularization prior, with the goal of preservingdistances. C. Latent vector regularization penalty
The time frames in a dynamic time series have extensiveredundancy between adjacent frames, which is usually capi-talized by temporal gradient regularization. Directly penalizingthe temporal gradient norm of the images requires the compu-tation of the entire image time series, which is difficult whenthe entire image time series is not optimized in every batch.We consider the norm of the finite differences between im-ages specified by (cid:107)∇ p G [ z p ] (cid:107) . Using Taylor series expansion,we obtain ∇ p G [ z p ] = J z ( G [ z ]) ∇ p z + O ( p ) . We thus have (cid:107)∇ p G [ z p ] (cid:107) ≈ (cid:107) J z ( G [ z ]) ∇ p z (cid:107) ≤ (cid:107) J z ( G [ z ]) (cid:107) (cid:107)∇ p z (cid:107) . (12)Since J z ( G [ z ]) is small because of the distance regularization,we propose to add a temporal regularizer on the latent vectors.For example, when applied to free-breathing cardiac MRI, weexpect the latent vectors to capture the two main contribu-tors of motion: cardiac motion and respiratory motion. Thetemporal regularization encourages the cardiac and respiratoryphases change slowly in time. D. Proposed optimization criterion
Based on the above analysis, we derive the parameters ofthe network θ and the low-dimensional latent vectors z i ; i = , .., M from the measured data by minimizing: C ( z , θ ) = N (cid:88) i =1 (cid:107)A i ( G θ [ z i ]) − b (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) data term + λ (cid:107) J z G θ ( z ) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) distance regularization + λ (cid:107)∇ t z t (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) latent regularization (13)with respect to z and θ . We use the ADAM optimization todetermine the optimal parameters, and random initialization isused for the network parameters and latent variables.A potential challenge with directly solving (13) is its highcomputational complexity. Unlike supervised neural networkapproaches that offer fast inference, the proposed approachoptimizes the network parameters based on the measured data.This strategy will amount to a long reconstruction time whenthere are several image frames in the time series. E. Strategies to reduce computational complexity
To minimize the computational complexity, we now intro-duce some approximation strategies.
1) Approximate data term for accelerated convergence:
When the data is measured using non-Cartesian samplingschemes, M non-uniform fast Fourier transform (NUFFT)evaluations are needed for the evaluation of the data term,where M is the number of frames in the dataset. Similarly, M inverse non-uniform fast Fourier transform (INUFFT)evaluations are needed for each back-propagation step. TheseNUFFT evaluations are computationally expensive, resultingin slow algorithms. In addition, most non-Cartesian imagingschemes over-sample the center of k-space. Since the least-square loss function in (5) weights errors in the center of k-space higher than in outer k-space regions, it is associated withslow convergence.To speed up the intermediate computations, we proposeto use gridding with density compensation, together with aprojection step for the initial iterations. Specifically, we willuse the approximate data term D ( z , θ ) = M (cid:88) i =1 (cid:107)P i ( G θ [ z i ]) − g i (cid:107) (14)instead of (cid:80) i (cid:107)A i ( G [ z i ]) − b i (cid:107) in early iterations to speedup the computations. Here, g i are the gridding reconstructions g i = (cid:0) A Hi A i (cid:1) † A Hi b i ≈ A Hi W b , (15)where, W are diagonal matrices corresponding to multiplica-tion by density compensation factors. The operators P i in (14)are projection operators: P i x = (cid:0) A Hi A i (cid:1) † (cid:0) A Hi A i (cid:1) x ≈ (cid:0) A Hi W A i (cid:1) x (16)We note that the term (cid:0) A Hi W A i (cid:1) x can be efficiently com-puted using Toeplitz embedding, which eliminates the need forexpensive NUFFT and INUFFT steps. In addition, the use ofthe density compensation serves as a preconditioner, resultingin faster convergence. Once the algorithm has approximatelyconverged, we switch the loss term back to (5) since it isoptimal in a maximum likelihood perspective.
2) Progressive training-in-time:
To further speed up thealgorithm, we introduce a progressive training strategy, whichis similar to multi-resolution strategies used in image process-ing. In particular, we start with a single frame obtained bypooling the measured data from all the time frames. Sincethis average frame is well-sampled, the algorithm promptlyconverges to the optimal solution. The corresponding networkserves as a good initialization for the next step. Followingconvergence, we increase the number of frames. The optimal θ parameters from the previous step are used to initializethe generator, while the latent vector is initialized by theinterpolated version of the latent vector at the previous step.This process is repeated until the desired number of frames isreached.This progressive training-in-time approach significantly re-duces the computational complexity of the proposed algorithm.In this work, we used a three-step algorithm. However, thenumber of steps (levels) of training can be chosen basedon the dataset. This progressive training-in-time approach isillustrated in Fig. 3.IV. I MPLEMENTATION DETAILS AND DATASETS
A. Structure of the generator
The structure of the generator used in this work is givenin Table. I. The output images have two channels, whichcorrespond to the real and imaginary parts of the MR images.Note that we have a parameter d in the network. This user-defined parameter controls the size of the generator or, in otherwords, the number of trainable parameters in the generator.We also have a number (cid:96) ( z ) as a user-defined parameter. Thisparameter represents the number of elements in each latentvector. In this work, it is chosen as (cid:96) ( z ) = 2 as we have twomotion patterns in cardiac images. We use leaky ReLU for allthe non-linear activations, except at the output layer, where itis tanh activation. TABLE IA
RCHITECTURE OF THE GENERATOR G θ . (cid:96) ( z ) MEANS THE NUMBER OFELEMENTS IN EACH LATENT VECTOR . Input size filter sz × × (cid:96) ( z ) 1 ×
100 0 1 × × × ×
100 3 × d × × d × × d × d × × d × × d × d × × d × × d × d × × d × × d × d × × d × × d × d × × d × × d × d × × d × × d × d × × d × × d × × × B. Datasets
This research study was conducted using data acquired fromhuman subjects. The Institutional Review Board at the localinstitution (The University of Iowa) approved the acquisitionof the data, and written consents were obtained from all
Fig. 3. Illustration of the progressive training-in-time approach. In the first level of training, the k-space data of all the frames are binned into one and wetry to solve for the average image in this level. Upon the convergence of the first step, the parameters and latent variables are transferred as the initializationof the second step. In the second level of training, we divide the k-space data into M groups and try to reconstruct the M average images. Following theconvergence, we can move to the final level of training, where the parameters obtained in the second step and the linear interpolation of the latent vectors inthe second step are chosen as the initializations of the final step of training. subjects. The experiments reported in this paper are basedon datasets collected in the free-breathing mode using thegolden angle spiral trajectory.We acquired eight datasets on aGE 3T scanner. One dataset was used to identify the optimalhyperparameters of all the algorithms in the proposed scheme.We then used the hyperparameters to generate the experimentalresults for all the remaining datasets reported in this paper. Thesequence parameters for the datasets are: TR = 8.4 ms, FOV= 320 mm ×
320 mm, flip angle = 18 ◦ , slice thickness = 8mm. The datasets were acquired using a cardiac multichannelarray with 34 channels. We used an automatic algorithm topre-select the eight best coils, that provide the best signalto noise ratio in the region of interest. The removal of thecoils with low sensitivities provided improved reconstructions[37]. We used a PCA-based coil combination using SVD suchthat the approximation error < T R = 3 . ms were used, this would correspond to ≈
14 lines/frame; with a × matrix, this corresponds roughly to an accelerationfactor of 24. Moreover, each dataset has more than 500frames. During reconstruction, we omit the first 20 framesin each dataset and use the next 500 frames for SToRMreconstructions; this is then used as the simulated ground truthfor comparisons. The experiments were run on a machine withan Intel Xeon CPU at 2.40 GHz and a Tesla P100-PCIE 16GBGPU. C. Quality evaluation metric
In this work, the quantitative comparisons are made usingthe Signal-to-Error Ratio (SER) metric (in addition to the stan-dard Peak Signal-to-Noise Ratio (PSNR) and the StructuralSimilarity Index Measure (SSIM)) defined as:
SER = 20 · log (cid:107) x orig (cid:107)(cid:107) x orig − x recon (cid:107) . Here x orig and x recon represent the ground truth and thereconstructed image. The unit for SER is decibel (dB).The SER metric requires a reference image, which ischosen as the SToRM reconstruction with 500 frames. How-ever, we note that this reference may be imperfect and maysuffer from blurring and related artifacts. Hence, we con-sider the Blind/referenceless Image Spatial Quality Evaluator(BRISQUE) [39] to evaluate the score of the image quality.The BRISQUE score is a perceptual score based on thesupport vector regression model trained on an image databasewith corresponding differential mean opinion score values.The training image dataset contains images with differentdistortions. A smaller score indicates better perceptual quality. D. State-of-the-art methods for comparison
We compare the proposed scheme with the recent state-of-the-art methods for free-breathing and ungated cardiac MRI.We note that while there are many deep learning algorithmsfor static MRI, those methods are not readily applicable to oursetting. • Analysis SToRM [9], [10], published in 2020: The man-ifold Laplacian matrix is estimated from k-space navi-gators using kernel low-rank regularization, followed bysolving for the images using (4). • Time-DIP [31] implementation based on the arXiv format the submission of this article: This is an unsupervisedlearning scheme, that fixes the latent variables as noiseand solves for the generator parameters. For real-timeapplications, Time-DIP chooses a preset period, and thenoise vectors of the frames corresponding to the multiplesof the period were chosen as independent Gaussianvariables [31]. The latent variables of the intermediateframes were obtained using linear interpolation. We chosea period of 20 frames, which roughly corresponds to theperiod of the heart beats. • Low-rank [2]: The image frames in the time series arerecovered using the nuclear norm minimization.
E. Hyperparameter tuning
We used one of the acquired datasets to identify thehyperparameters of the proposed scheme. Since we do nothave access to the fully-sampled dataset, we used the SToRMreconstructions from 500 images (acquisition time of seconds) as a reference. The smoothness parameter λ of thismethod was manually selected as λ = 0 . to obtain thebest recovery, as in the literature [9]. All of the comparisonsrelied on image recovery from 150 frames (acquisition time of7.5 seconds). The hyperparameter tuning approach yielded theparameters d = 40 , λ = 0 . , and λ = 2 for the proposedapproach. We demonstrate the impact of tuning d in Fig. 6,while the impact of choosing λ and λ is shown in Fig. 4.The hyperparameter optimization of SToRM from 150 framesresulted in the optimal smoothness parameter λ = 0 . .For Time-DIP, we follow the design of the network shownby Jin et al. [31], where the generator consists of multiplelayers of convolution and upsampling operations. To ensurefair comparison, we used a similar architecture, where the basesize of the network was tuned to obtain the best results. (a) Performance comparison (b) Latent codes with both terms(c) Without distance regularization (d) Without latent regularization(e) Visual and quantitative comparisonsFig. 4. Illustration of the impact of the regularization terms in the proposedscheme with d = 24 . We considered three cases in the experiment: (1)using both regularizations, (2) using only latent regularization, and (3) usingonly network regularization; these correspond to the blue, orange, and yellowcurves in (a). In (b), (c), and (d), we showed the learned latent vectors forthe three cases. The visual and quantitative comparisons of the three casesare shown in (e). We use a three-step progressive training strategy. In the firststep, the learning rate for the network is × − and 1000epoches are used. For the second step of training, the learningrate for the network is × − and the learning rate for thelatent variable is × − . In this stage, 600 epoches are used.In the final step of training, the learning rate for the network is × − , the learning rate for the latent variable is × − ,and 700 epoches are used.V. E XPERIMENTS AND RESULTS
A. Impact of different regularization terms
We first study the impact of the two regularization termsin (13). The parameter d corresponding to the size of thenetwork (see Table I) was chosen as d = 24 in this case. InFig. 4 (a), we plot the reconstruction performance with respectto the number of epoches for three scenarios: (1) using bothregularization terms; (2) using only latent regularization; and(3) using only distance/network regularization. In the experi-ment, we use 500 frames of SToRM ( ∼
25 seconds of acqui-sition) reconstructions, which is called “SToRM500”, as thereference for SER computations. We tested the reconstructionperformance for the three scenarios using 150 frames, whichcorresponds to around 7.5 seconds of acquisition. From theplot, we observe that without using the network regularization,the SER degrades with increasing epoches, which is similar tothat of DIP. In this case, an early stopping strategy is neededto obtain good recovery. The latent vectors corresponding tothis setting are shown in (c), which shows mixing betweencardiac and respiratory waveforms. When latent regularizationis not used, we observe that the SER plot is roughly flat, but thelatent variables show quite significant mixing, which translatesto blurred reconstructions. By contrast, when both networkand latent regularizations are used, the algorithm convergesto a better solution. We also note that the latent variablesare well decoupled; the blue curve captures the respiratorymotion, while the orange one captures the cardiac motion.We also observe that the reconstructions agree well with theSToRM reconstructions. The network now learns meaningfulmappings, which translate to improved reconstructions whencompared to the reconstructions obtained without using theregularizers.
B. Benefit of progressive training-in-time approach
In Fig. 5, we demonstrate the significant reduction in run-time offered by the progressive training strategy describedin Section III-E2. Here, we consider the recovery from 150frames with and without the progressive strategy. Both regu-larization priors were used in this strategy, and d was chosenas 24. We plot the reconstruction performance, measured bythe SER with respect to the running time. The SER plots showthat the proposed scheme converges in around ≈ seconds,while the direct approach takes more than 2000 seconds. Wealso note from the SER plots that the solution obtained usingprogressive training is superior to the one without progressivetraining. Fig. 5. Comparisons of the reconstruction performance with and without theprogressive training-in-time strategy using d = 40 . From the plot of SERvs. running time, we can see that the progressive training-in-time approachyields better results with much less running time comparing to the trainingwithout using progressive training-in-time. Two reconstructed frames nearthe end of systole and diastole using SToRM500, the proposed schemewith progressive training-in-time and the proposed scheme without using theprogressive training-in-time are shown in the plot as well for comparisonpurposes. The average Brisque scores for SToRM500, the reconstructionwith progressive training-in-time, and the reconstruction without progressivetraining-in-time are . , . and . respectively. C. Impact of size of the network
The architecture of the generator G θ is given in Table I. Notethat the size of the network is controlled by the user-definedparameter d , which dictates the number of convolution filtersand hence the number of trainable parameters in the network.In this section, we investigate the impact of the user-definedparameter d on the reconstruction performance. We tested thereconstruction performance using d = 8 , , , , , and , and the obtained results are shown in Fig. 6. From thefigure, we see that when d = 8 or d = 16 , the generatornetwork is too small to capture the dynamic variations. When d = 8 , the generator is unable to capture both cardiacmotion and respiratory motion. When d = 16 , part of therespiratory motion is recovered, while the cardiac motion isstill lost. The best SER scores with respect to SToRM with500 frames is obtained for d = 24 , while the lowest Brisquescores are obtained for d = 40 . We also observe that thefeatures including papillary muscles and myocardium in the d = 40 results appear sharper than those of SToRM with 500frames, even though the proposed reconstructions were onlyperformed from 150 frames. We use d = 40 for the subsequentcomparisons in the paper. D. Comparison with the state-of-the-art methods
In this section, we compare our proposed scheme with sev-eral state-of-the-art methods for the reconstruction of dynamicimages.In Fig. 7, we compare the region of interest for SToRM500,SToRM with 150 frames (SToRM150), the proposed methodwith two different d values, the unsupervised Time-DIP ap-proach, and the low-rank algorithm. From Fig. 7, we observethat the proposed scheme can significantly reduce errors incomparison to SToRM150. Additionally, the proposed schemeis able to capture the motion patterns better than Time-DIP,while the low-rank method is unable to capture the motion Fig. 6. Impact of network size on reconstruction performance. In theexperiments, we chose d = 8 , , , , and to investigate thereconstruction performance. We used 500 frames for SToRM reconstructions(SToRM500) as the reference for SER comparisons. For the investigation ofthe impact of network size on the reconstructions, we used 150 frames. Thediastolic and systolic states and the temporal profiles are shown in the figurefor each case. The Brisque scores and average SER are also reported. It isworth noting that when d = 40 , the results are even less blurred than theSToRM500 results, even though only one-third of the data are used.TABLE IIQ UANTITATIVE COMPARISONS BASED ON SIX DATASETS : W
E USED SIXDATASETS TO OBTAIN THE AVERAGE
SER, PSNR, SSIM, B
RISQUESCORE , AND TIME USED FOR RECONSTRUCTION . Methods SToRM500 SToRM150 Propsed Time-DIPSER (dB) NA . . . PSNR (dB) NA . . . SSIM NA . . . Brisque . . . . Time (min) 47 13 17 57 patterns. From the time profile in Fig. 7, we notice that theproposed scheme is capable of recovering the abrupt changein blood-pool contrast between diastole and systole. This isdue to inflow effects associated with gradient echo (GRE)acquisitions. In particular, the blood from regions outsidethe slice enters the heart, which did not experience any ofthe former slice-selective excitation pulses; the differencesin magnetization of the blood with no magnetization history,and that was within the slice, results in the abrupt change inintensity. We note that some of the competing methods suchas Time-DIP and low-rank, blur these details.We also perform the comparisons on a different dataset inFig. 8. We compare the proposed scheme with SToRM500,SToRM150, Time-DIP, and the low-rank approach. The re-sults are shown in Fig. 8. From the figure, we see that theproposed reconstructions appear less blurred than those of theconventional schemes.We also compared the proposed scheme with SToRM500,SToRM150, and the unsupervised Time-DIP approach quan- (a) Visual comparisons (b) Time profilesFig. 7. Comparisons with the state-of-the-art methods. The first column of (a) corresponds to the reconstructions from 500 frames ( ∼
25s of acquisitiontime), while the rest of the columns are recovered from 150 frames ( ∼ d = 40 ) reconstructions exhibit less blurring and fewer artifacts when compared to SToRM150 and competing methods.(a) Visual comparisons (b) Time profilesFig. 8. Comparisons with the state-of-the-art methods. The first column of (a) corresponds to the reconstructions from 500 frames ( ∼
25s of acquisitiontime), while the rest of the columns are recovered from 150 frames ( ∼ d = 40 for the proposed scheme. We observe that the proposed reconstructions appear less blurred when compared to the conventional schemes.(a) Latent vectors (b) Systole in E-E (c) Systole in E-I (d) Diastole in E-E (e) Diastole in E-IFig. 9. Illustration of the framework of the proposed scheme with d = 40 . We plot the latent variables of 150 frames in a time series on the first dataset.We showed four different phases in the time series: systole in End-Expiration (E-E), systole in End-Inspiration (E-I), diastole in End-Expiration (E-E), anddiastole in End-Inspiration (E-I). A thin green line surrounds the liver in the image frame to indicate the respiratory phase. The latent vectors correspondingto the four different phases are indicated in the plot of the latent vectors.(a) Latent vectors (b) Systole in E-E (c) Systole in E-I (d) Diastole in E-E (e) Diastole in E-IFig. 10. Illustration of the framework of the proposed scheme with d = 40 . We plot the latent variables of 150 frames in a time series. We showedfour different phases in the time series: systole in End-Expiration (E-E), systole in End-Inspiration (E-I), diastole in End-Expiration (E-E), and diastole inEnd-Inspiration (E-I). The latent vectors corresponding to the four different phases are indicated in the plot of the latent vectors. titatively. We omit the low-rank method here because low-rank approach often failed in some datasets. The quantitativecomparisons are shown in Table II. We used SToRM500 asthe reference for SER, PSNR, and SSIM calculations. Thequantitative results are based on the average performance fromsix datasets.Finally, we illustrate the proposed approaches in Fig. 9 andFig. 10, respectively. The proposed approach decoupled thelatent vectors corresponding to the cardiac and respiratoryphases well, as shown in the representative examples in Fig.9 (a) and Fig. 10 (a).VI. C ONCLUSION
In this work, we introduced an unsupervised generativeSToRM framework for the recovery of free-breathing cardiacimages from spiral acquisitions. This work assumes that theimages are generated by a non-linear CNN-based generator G θ , which maps the low-dimensional latent variables to high-resolution images. Unlike traditional supervised CNN meth-ods, the proposed approach does not require any training data.The parameters of the generator and the latent variables aredirectly estimated from the undersampled data. The key benefitfor this generative model is its ability to compress the data,which results in a memory-effective algorithm. To improve theperformance, we introduced a network/distance regularizationand a latent variable regularization. The combination of thepriors ensures the learning of representations that preservedistances and ensure the temporal smoothness of the recoveredimages; the regularized approach provides improved recon-structions while minimizing the need for early stopping. Toreduce the computational complexity, we introduced a fastapproximation of the data loss term as well as a progressivetraining-in-time strategy. These approximations result in analgorithm with computational complexity comparable to ourprior SToRM algorithm. The main benefits of this scheme arethe improved performance and considerably reduced memorydemand. While our main focus in this work was to establishthe benefits of this work in 2D, we plan to extend this workto 3D applications in the future.A CKNOWLEDGEMENT
The authors would like to thank Ms. Melanie Lavermanfrom the University of Iowa for making editorial correctionsto refine this paper. R
EFERENCES[1] J. Tsao, P. Boesiger, and K. P. Pruessmann, “k-t blast and k-t sense: dy-namic mri with high frame rate exploiting spatiotemporal correlations,”
Magnetic Resonance in Medicine , vol. 50, no. 5, pp. 1031–1042, 2003.[2] S. G. Lingala, Y. Hu, E. DiBella, and M. Jacob, “Accelerated dynamicmri exploiting sparsity and low-rank structure: kt slr,”
IEEE Transactionson Medical Imaging , vol. 30, no. 5, pp. 1042–1054, 2011.[3] L. Feng et al., “Golden-angle radial sparse parallel mri: combination ofcompressed sensing, parallel imaging, and golden-angle radial samplingfor fast and flexible dynamic volumetric mri,”
Magnetic Resonance inMedicine , vol. 72, no. 3, pp. 707–717, 2014.[4] L. Feng, L. Axel, H. Chandarana, K. T. Block, D. K. Sodickson, andR. Otazo, “Xd-grasp: golden-angle radial mri with reconstruction ofextra motion-state dimensions using compressed sensing,”
MagneticResonance in Medicine , vol. 75, no. 2, pp. 775–788, 2016. [5] A. G. Christodoulou et al., “Magnetic resonance multitasking formotion-resolved quantitative cardiovascular imaging,”
Nature Biomedi-cal Engineering , vol. 2, no. 4, pp. 215–226, 2018.[6] C. Prieto et al., “Highly efficient respiratory motion compensated free-breathing coronary mra using golden-step cartesian acquisition,”
Journalof Magnetic Resonance Imaging , vol. 41, no. 3, pp. 738–746, 2015.[7] A. Bustin, N. Fuin, R. M. Botnar, and C. Prieto, “From compressed-sensing to artificial intelligence-based cardiac mri reconstruction,”
Fron-tiers in Cardiovascular Medicine , vol. 7, pp. 17, 2020.[8] S. Poddar and M. Jacob, “Dynamic mri using smoothness regularizationon manifolds (storm),”
IEEE Transactions on Medical Imaging , vol. 35,no. 4, pp. 1106–1115, 2015.[9] A. H. Ahmed, R. Zhou, Y. Yang, P. Nagpal, M. Salerno, and M. Jacob,“Free-breathing and ungated dynamic mri using navigator-less spiralstorm,”
IEEE Transactions on Medical Imaging , 2020.[10] S. Poddar, Y. Q. Mohsin, D. Ansah, B. Thattaliyath, R. Ashwath, andM. Jacob, “Manifold recovery using kernel low-rank regularization:Application to dynamic imaging,”
IEEE Transactions on ComputationalImaging , vol. 5, no. 3, pp. 478–491, 2019.[11] U. Nakarmi, Y. Wang, J. Lyu, D. Liang, and L. Ying, “A kernel-basedlow-rank (klr) model for low-dimensional manifold recovery in highlyaccelerated dynamic mri,”
IEEE Transactions on Medical Imaging , vol.36, no. 11, pp. 2297–2307, 2017.[12] U. Nakarmi, K. Slavakis, and L. Ying, “Mls: Joint manifold-learningand sparsity-aware framework for highly accelerated dynamic magneticresonance imaging,” in . IEEE, 2018, pp. 1213–1216.[13] G. Wang, “A perspective on deep imaging,”
IEEE Access , vol. 4, pp.8914–8924, 2016.[14] G. Wang, M. Kalra, and C. G. Orton, “Machine learning will transformradiology significantly within the next 5 years,”
Medical Physics , vol.44, no. 6, pp. 2041–2044, 2017.[15] G. Dardikman-Yoffe and Y. C. Eldar, “Learned sparcom: Unfolded deepsuper-resolution microscopy,” arXiv preprint arXiv:2004.09270 , 2020.[16] J. C. Ye, Y. Han, and E. Cha, “Deep convolutional framelets: A generaldeep learning framework for inverse problems,”
SIAM Journal onImaging Sciences , vol. 11, no. 2, pp. 991–1048, 2018.[17] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep con-volutional neural network for inverse problems in imaging,”
IEEETransactions on Image Processing , vol. 26, no. 9, pp. 4509–4522, 2017.[18] V. Monga, Y. Li, and Y. C. Eldar, “Algorithm unrolling: Interpretable,efficient deep learning for signal and image processing,” arXiv preprintarXiv:1912.10557 , 2019.[19] A. Pramanik, H. K. Aggarwal, and M. Jacob, “Deep generalizationof structured low-rank algorithms (deep-slr),”
IEEE Transactions onMedical Imaging , 2020.[20] S. Wang et al., “Accelerating magnetic resonance imaging via deeplearning,” in . IEEE, 2016, pp. 514–517.[21] S. Wang et al., “Deepcomplexmri: Exploiting deep residual networkfor fast parallel mr imaging with complex convolution,”
MagneticResonance Imaging , vol. 68, pp. 136–147, 2020.[22] S. UH Dar, M. ¨Ozbey, A. B. C¸ atlı, and T. C¸ ukur, “A transfer-learningapproach for accelerated mri using deep neural networks,”
Magneticresonance in medicine , vol. 84, no. 2, pp. 663–685, 2020.[23] Y. Han et al., “Deep learning with domain adaptation for acceleratedprojection-reconstruction mr,”
Magnetic resonance in medicine , vol. 80,no. 3, pp. 1189–1205, 2018.[24] T. Sanchez et al., “Scalable learning-based sampling optimization forcompressive dynamic mri,” in
ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2020, pp. 8584–8588.[25] S. UH Dar et al., “Prior-guided image reconstruction for acceleratedmulti-contrast mri via generative adversarial networks,”
IEEE Journalof Selected Topics in Signal Processing , vol. 14, no. 6, pp. 1072–1087,2020.[26] S. UH Dar et al., “Image synthesis in multi-contrast mri with conditionalgenerative adversarial networks,”
IEEE transactions on medical imaging ,vol. 38, no. 10, pp. 2375–2388, 2019.[27] M. Yurt, S. UH Dar, A. Erdem, E. Erdem, and T. C¸ ukur, “mustgan:Multi-stream generative adversarial networks for mr image synthesis,” arXiv preprint arXiv:1909.11504 , 2019.[28] Q. Zou, S. Poddar, and M. Jacob, “Sampling of planar curves: Theoryand fast algorithms,”
IEEE Transactions on Signal Processing , vol. 67,no. 24, pp. 6455–6467, 2019. [29] Q. Zou and M. Jacob, “Recovery of surfaces and functions in highdimensions: sampling theory and links to neural networks,” SIAMJournal on Imaging Sciences , in press.[30] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2018, pp. 9446–9454.[31] K. H. Jin, H. Gupta, J. Yerly, M. Stuber, and M. Unser, “Time-dependentdeep image prior for dynamic mri,” arXiv preprint arXiv:1910.01684 ,2019.[32] A. Bora, E. Price, and A. G. Dimakis, “Ambientgan: Generative modelsfrom lossy measurements,” in
International Conference on LearningRepresentations , 2018.[33] K. Kazuhiro et al., “Generative adversarial networks for the creation ofrealistic artificial brain magnetic resonance images,”
Tomography , vol.4, no. 4, pp. 159, 2018.[34] I. Goodfellow et al., “Generative adversarial nets,” in
Advances in neuralinformation processing systems , 2014, pp. 2672–2680.[35] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXivpreprint arXiv:1701.07875 , 2017.[36] D. Varga, A. Csisz´arik, and Z. Zombori, “Gradient regulariza-tion improves accuracy of discriminative models,” arXiv preprintarXiv:1712.09936 , 2017.[37] R. Zhou et al., “Free-breathing cine imaging with motion-correctedreconstruction at 3t using spiral acquisition with respiratory correctionand cardiac self-gating (sparcs),”
Magnetic resonance in medicine , vol.82, no. 2, pp. 706–720, 2019.[38] D. O. Walsh, A. F. Gmitro, and M. W. Marcellin, “Adaptive reconstruc-tion of phased array mr imagery,”
Magnetic Resonance in Medicine: AnOfficial Journal of the International Society for Magnetic Resonance inMedicine , vol. 43, no. 5, pp. 682–690, 2000.[39] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference imagequality assessment in the spatial domain,”