[PDF] Python Non-Uniform Fast Fourier Transform (PyNUFFT): multi-dimensional non-Cartesian image reconstruction package for heterogeneous platforms and applications to MRI

Abstract

This paper reports the development of a Python Non-Uniform Fast Fourier Transform (PyNUFFT) package, which accelerates non-Cartesian image reconstruction on heterogeneous platforms. Scientific computing with Python encompasses a mature and integrated environment. The NUFFT algorithm has been extensively used for non-Cartesian image reconstruction but previously there was no native Python NUFFT library. The current PyNUFFT software enables multi-dimensional NUFFT on heterogeneous platforms. The PyNUFFT also provides several solvers, including the conjugate gradient method, ℓ 1 total-variation regularized ordinary least square (L1TV-OLS) and ℓ 1 total-variation regularized least absolute deviation (L1TV-LAD). Metaprogramming libraries were employed to accelerate PyNUFFT. The PyNUFFT package has been tested on multi-core CPU and GPU, with acceleration factors of 6.3 - 9.5 × on a 32 thread CPU platform and 5.4 - 13 × on the GPU.

Full PDF

aa r X i v : . [ phy s i c s . m e d - ph ] O c t Python Non-Uniform Fast Fourier Transform(PyNUFFT): multi-dimensional non-Cartesian imagereconstruction package for heterogeneous platforms andapplications to MRI

Jyh-Miin Lin a,b a Department of Radiology, University of Cambridge, Cambridge CB2 0QQ, [email protected] b Graduate Institute of Biomedical Electronics and Bioinformatics, National TaiwanUniversity, Taipei, Taiwan

Abstract

This paper reports the development of a Python Non-Uniform Fast FourierTransform (PyNUFFT) package, which accelerates non-Cartesian image re-construction on heterogeneous platforms. Scientiﬁc computing with Pythonencompasses a mature and integrated environment. The NUFFT algorithmhas been extensively used for non-Cartesian image reconstruction but previ-ously there was no native Python NUFFT library. The current PyNUFFTsoftware enables multi-dimensional NUFFT on heterogeneous platforms. ThePyNUFFT also provides several solvers, including the conjugate gradientmethod, ℓ ℓ × on a 32 thread CPU platform and 5.4 - 13 × on the GPU. Keywords:

Heterogeneous system architecture (HSA), graphic processingunit (GPU), multi-core system, metaprogramming, ℓ Preprint submitted to J. Parallel Distrib. Comput. October 10, 2017 ighlights • A Python non-uniform fast Fourier transform (PyNUFFT) package wasdeveloped. • Computations are accelerated on heterogeneous systems, including multi-core CPU and GPU. • It provides the accelerated nonlinear conjugate gradient method, ℓ ℓ • Pre-indexing enables multi-dimensional NUFFT and fast image gradi-ents on heterogeneous platforms. • The single-precision version is dual-licensed under MIT and LGPL-3.0. • Applications to magnetic resonance imaging reconstruction are pre-sented. 2 . Introduction

Fast Fourier transform (FFT) is an exact fast algorithm to compute dis-crete Fourier transform when data are acquired on an equispaced grid. In cer-tain image processing ﬁelds, however, the frequency locations are irregularlydistributed, which obstructs the use of FFT. The alternative Non-UniformFast Fourier Transform (NUFFT) algorithm is a fast mapping for comput-ing non-equispaced frequency components. Several previous non-Cartesianimage reconstructions are summarized in the discussion section (Section 4.1).Python is a fully-ﬂedged and well-supported programming language indata science. The importance of Python language is manifested in the recentsurge of interest in machine learning. Developers have increasingly reliedon Python to build software, taking advantage of its abundant libraries andactive community. Yet the standard Python numerical environments lack anative implementation of NUFFT packages, and the development of an eﬃ-cient Python NUFFT may ﬁll a gap in the image processing ﬁeld. However,Python has been notorious for its slow execution speed, which hampers theimplementation of an eﬃcient Python NUFFT.During the past decade, the speed of Python has been greatly improvedby numerical libraries with rapid array manipulations and vendor-providedperformance libraries. However, parallel computing using multi-threadingmodel cannot be easily implemented in Python. This problem is mostlydue to the Global Interpreter Lock (GIL) of Python interpreter, which onlyallowed one core to be used at the same time, while the multi-threadingcapabilities of modern Symmetric multiprocessing (SMP) processors cannotbe exploited.Recently, general purpose graphic processing unit (GPGPU) computinghas enabled enormous acceleration by oﬄoading computations to hundredsto several thousands of parallel processing units on GPUs. This emergingprogramming models may enable one to develop an eﬃcient Python NUFFTpackage, by circumventing the limitations of GIL. Two GPGPU architecturesare commonly used, i.e. the proprietary Compute Uniﬁed Device Architec-ture (CUDA, NVIDIA, Santa Clara, CA, USA) and Open Computing Lan-guage (OpenCL, Khronos Group, Beaverton, OR,USA). These two similararchitectures can be ported to each other, or could be dynamically generatedfrom Python codes [1].Thus, we have developed the Python NUFFT (PyNUFFT) package, whichwas optimized for heterogeneous systems, including multi-core CPU and3PU. The design of PyNUFFT aims to reduce the run-time while main-taining the readability of Python program. Python NUFFT (PyNUFFT)contains the following features: (1) algorithms written and tested for hetero-geneous platforms (including multi-core CPU and GPU); (2) pre-indexing tohandle multi-dimensional NUFFT and image gradient; (3) providing severalnonlinear solvers.

2. Material and methods

PyNUFFT software was developed on a 64-bit Linux system with a quad-core CPU (Intel i7-6700HQ Processor, 6M Cache, up to 3.50 GHz) and aGPU (NVIDIA Geforce GTX 965m, 1024 CUDA Cores at 944+ BoostBaseClock(MHz), 2GB GDDR5 memory at 2500 MHz Memory Clock). The cur-rent version was modiﬁed from the previous version which was acceleratedwith CUDA [2]. The new version is built on PyOpenCL, PyCUDA [1] andReikna [3] licensed under dual MIT and GNU Lesser General Public Licensev3.0 (LGPL-3.0) [4]. Thus, PyNUFFT may be used in a variety of projects.

The execution of PyNUFFT proceeds through three major stages: (1)scaling, (2) oversampled FFT, and (3) interpolation. The three stages canbe formulated as the combination of linear operations: A = VFS (1)where A is the NUFFT, S is the scaling, F is the FFT, and V is the interpo-lation. The default kernel function in PyNUFFT is the min-max kernel [5],whereas other kernel functions can be used. The design of the three-stageNUFFT algorithm is illustrated in Figure 1(A). Batch-mode NUFFT can beperformed with serial processing or parallel processing. Scaling was performed by in-place multiplication cMultiplyVecInplace.The complex multi-dimensional array was oﬄoaded to the device.

This stage is composed of two steps: (1) zero-padding which copies thesmall array to a larger array, and (2) FFT. The ﬁrst step recomputed thearray indexes on-the-ﬂight with logical operations, a step which is not well4

FT xxxk_Kdk_KdSparse Matrix-VectorMultiplication Scalingy zero-paddingThree-stage forward NUFFT PyNUFFT methodsCPU systems S : self.x2xx() F : self.xx2k() V : self.k2y() HeterogeneousSystems S : self._x2xx(): cMultiplyVecInplace() synchronize() F : self._xx2k(): cMultiplyScalar() cSelect() fft() synchronize() V : self._k2y(): cSparseMatVec() synchronize()k(A) (B) SFV

Figure 1: (A) A 2D example of the forward PyNUFFT (B) The methods provided inPyNUFFT. Forward NUFFT can be decomposed into three stages: scaling, FFT, andinterpolation. supported on GPUs. Here, a generalized pre-indexing procedure was imple-mented to avoid matrix reshaping on the ﬂight, and the cSelect subroutinecopies the array1 to array2 according to the pre-indexes order1 and order2.This pre-indexing avoids multi-dimensional matrix reshaping on the ﬂight,thus greatly simplifying the algorithm on GPU platforms. In addition, pre-indexing can be generalized to multi-dimensional arrays (with size N in , N out ).Once the indexes (inlist, outlist) were obtained, the input array can berapidly copied to a larger array. No matrix reshaping will be needed duringiterations.The inverse FFT is also based on the same subroutines of the oversampledFFT, but it reverses the order of computations. Thus, an IFFT is followedby array copying. 5 .1.3. Interpolation While the current PyNUFFT includes the min-max interpolator [5], otherkernels could also be used. The scaling factor of the min-max interpolatorwas designed to minimize the error of oﬀ-grid samples [5]. The interpolationkernel was stored as the Compressed Sparse Row matrix (CSR) format for C-order (row-major) array. Thus, the indexing was accommodated for C-orderand the cSparseMatVec subroutine could quickly compute the interpolationwithout matrix reshaping. cSparseMatVec had been optimized to exploit thedata coalescence and parallel threads on the heterogeneous platforms. Thewarp in CUDA (or wavefront in OpenCL) control the size of workgroups inthe cSparseMatVec kernel. Note that the indexing of the C-ordered array isdiﬀerent from the fortran array (F-order) implemented in MATLAB.Gridding is the conjugate transpose of the interpolation, which also usesthe same cSparseMatVec subroutine.

Adjoint NUFFT reverses the order of the forward NUFFT. Each stage isthe conjugate transpose (Hermitian transpose) of the forward NUFFT: A H = S H F H V H (2)which is also illustrated in Figure 2. In iterative algorithms, the cost function is used to represent data con-sistency: cost f unction : argmin x k A x − y k (3)(4)in which the normal equation is composed of interpolation ( A ) and gridding( A H ): normal equation : A H A x = A H y (5)Thus, precomputing the A H A can improve the run-time eﬃciency[6]. SeeFigure 3 6 nverse FFT xxxk_Kdk_KdSparse Matrix-VectorMultiplication Scalingy cropImageThree-stage adjoint NUFFT PyNUFFT methodsCPU systems V H : self.y2k() F H : self.k2xx() S H : self.xx2x() HeterogeneousSystems V H : self._y2k(): cSparseMatVec() synchronize() F H : self._xx2k(): cMultiplyScalar() fft() cSelect() synchronize() S H : self._xx2x(): cMultiplyVecInplace() synchronize() k(A) (B) S H F H V H Figure 2: (A) The adjoint PyNUFFT (B) The methods provided in PyNUFFT. AdjointNUFFT can be decomposed into three stages: gridding, IFFT, and rescaling.

The NUFFT provided solvers to restore multi-dimensional images (or1D signals in the time-domain) from the non-equispaced frequency samples.A great number of reconstruction methods exist for non-Cartesian imagereconstruction. These methods are usually categorized into three families:(1) density compensation and gridding, (2) least square regression in k -space,and (3) iterative NUFFT. The sampling density compensation method generated a tapering func-tion w (sampling density compensation function), which can be represented7 xxk_Kdk_KdSparse Matrix-VectorMultiplicationy Selfadjoint (Toepli tz) NUFFT PyNUFFT methodsCPU systems S : self.x2xx() F : self.xx2k() V H V : self.k2y2k() F H : self.k2xx() S H : self.xx2x() HeterogeneousSystems S : self._x2xx(): cMultiplyVecInplace() synchronize() F : self._xx2k(): cMultiplyScalar() cSelect() fft() synchronize() V H V : self._k2y2k(): cSparseMatVec() synchronize() F H : self._xx2k(): cMultiplyScalar() fft() cSelect() synchronize() S H : self._xx2x(): cMultiplyVecInplace() synchronize() k(A) (B) S H F H V H V SF

Figure 3: (A) The self-adjoint PyNUFFT (Toeplitz) (B) The methods are provided inPyNUFFT. The Toeplitz method precomputes the interpolation-gridding matrix ( V H V ),which can be accelerated on CPU and GPU. See Table 1 for measured acceleration factors. as the following stable iterations [7]: w i +1 = w i AA H w i (6)Once w is prepared, the sampling density compensation will be calculatedby: x = A H ( y · w ) (7) Least square regression is a solution to the inverse problem of imagereconstruction. Consider the following problem: x = argmin x k y − A x k (8)8 Conjugate gradient method (cg) : A solution to Equation (8) is thetwo-step solution: k = min k k y − V k k (9)The most expensive step is to ﬁnd k in Equation (9). Once k has beensolved, the inverse of FS can be computed by the inverse FFT and thendivided by the scaling factor: x = S − F − k (10)The conjugate gradient method is an iterative solution when the sparsematrix is a symmetric Hermitian matrix: V H y = V H V k, solve k (11)Then each iteration generates the residue, which is used to computethe next value. The conjugate gradient method is also provided forheterogeneous systems. • Other iterative least square solvers : Scipy [8] provides a variety ofleast square solvers, including lsmr, lsqr, bicg, bicgstab, gmres, lgmres.These solvers were integrated into the CPU version of PyNUFFT.

Iterative NUFFT reconstruction solves the inverse problem with variousforms of image regularization. Due to the large size of the interpolation andFFT, iterative NUFFT is computationally expensive. Here, PyNUFFT isalso optimized for iterative NUFFT reconstructions on heterogeneous sys-tems. • Pre-indexing for fast image gradient : Total-variation is a basicimage regularization which has been extensively used in image denos-ing and image reconstruction. Image gradient computes the diﬀerencebetween adjacent pixels, which is represented as follows: ∇ i x = x ( ..., a i + 1 , ... ) − x ( ..., a i , ... ) (12)9here a i is the indexes of the i -th axis. Computing the image gra-dient requires image rolling, followed by subtracting the two images.However, multi-dimensional image rolling on heterogeneous is expen-sive and PyNUFFT adopts the pre-indexing to save the run-time. Thispre-indexing procedure generates the indexes for the rolled image, andthe indexes are oﬄoaded to heterogeneous platforms before iterativealgorithms begin. Thus, computing the image rolling is not neededduring the iterations. The advantage of this pre-indexing procedure isdemonstrated in Figure 4, in which preindexing makes image gradientrun faster on CPU and GPU. • ℓ : The ℓ x = argmin x ( µ k y − Ax k + λT V ( x )) (13)where T V ( x ) is the total-variation of the image: T V ( x ) = Σ |∇ x x | + |∇ y x | (14)Here, the ∇ x and ∇ y are directional gradient operators applied to theimage domain along the x and y axes. Equation (14) is solved by thevariable-splitting method and has previously been developed[9, 10, 11]: K = µ A H A + λ ∇ Tx ∇ x + λ ∇ Ty ∇ y (15)The inner iterations are as follows: rhs k = µ ( A H y k ) + λ ∇ Tx ( d kx − b kx ) + λ ∇ Ty ( d ky − b ky ) (16) x k +1 = K − rhs k (17) d k +1 x = shrink ( ∇ x x k +1 + b kx , /λ ) (18) d k +1 y = shrink ( ∇ y x k +1 + b ky , /λ ) (19) b k +1 x = b kx + ( ∇ x x k +1 − d k +1 x ) (20) b k +1 y = b ky + ( ∇ y x k +1 − d k +1 y ) (21)The outer iteration is: A H y k +1 = A H y k + A H y − A H A x k +1 (22)10 PU GPU Preindexing(CPU) Preindexing(GPU)0.00.20.40.60.81.0 E x e c u t i on t i m e s ( m s ) Figure 4: Preindexing for fast image gradient ( ∇ and ∇ T ). Image gradient can be slow ifthe indexes are to be calculated with each iteration. Preindexing can save the time neededfor recalculation and image gradient during iterations. • ℓ : Least absolute deviation (LAD) is a statistical regression modelwhich is robust to non-stationary noise distribution [12]. It is possibleto solve the ℓ x = argmin x ( µ k y − Ax k + λT V ( x )) (23)where T V ( x ) is the total-variation of the image. Note that LAD is the ℓ K = µ A H A + λ ∇ Tx ∇ x + λ ∇ Ty ∇ y (24)The inner iterations of ℓ rhs k = µ ( A H y k + d kf − b kf ) + λ ∇ Tx ( d kx − b kx ) + λ ∇ Ty ( d ky − b ky ) (25) x k +1 = K − rhs k (26) d k +1 x = shrink ( ∇ x x k +1 + b kx , /λ ) (27) d k +1 y = shrink ( ∇ y x k +1 + b ky , /λ ) (28) d k +1 f = shrink ( A H A x k +1 − A H y + b ky , /µ ) (29) b k +1 x = b kx + ( ∇ x x k +1 − d k +1 x ) (30) b k +1 y = b ky + ( ∇ y x k +1 − d k +1 y ) (31) b k +1 f = b kf + ( A H A x k +1 − A H y − d k +1 f ) (32)The outer iteration is: A H y k +1 = A H y k + A H y − A H A x k +1 (33)Note that Equation (29) is a shrinkage function, which is quickly solvedon CPU as well as on heterogeneous systems. • Multi-coil regularized image reconstruction : In multi-coil regu-larized image reconstruction, the selfadjoint NUFFT in Equation (5) isextended to multi-channel data: A Hmulti A multi x = Nc X i A Hi A i x (34)= Nc X i conj ( c i ) · A H A · c i · x (35)where coil-sensitivities ( c i ) of multiple channels multiply each channelbefore the NUFFT ( A ) and after the adjoint NUFFT ( A H ).12 .3. Applications to brain MRI A 3T brain MRI template [14] was used to simulate the non-CartesianMRI acquisition. The image was resized to 512 ×

512 and the non-Cartesian k -space [15] was used to simulate the data. PyNUFFT was tested on a multi-core CPU cluster and a GPU instanceof the cloud-based Linux web services. All the computations were completedwith single ﬂoat (FP32) complex numbers. The conﬁguration of CPU andGPU systems were as follows. • Multi-core CPU : The CPU instance (m4.16xlarge, Amazon Web Ser-vices) is equipped with 64 vCPUs (Intel E5 2686 v4) and 61 GB ofmemory. The number of vCPUs can be dynamically controlled by theCPU hotplug functionality of the Linux system, and computations areoﬄoaded to the Intel OpenCL CPU device with 1 to 64 threads.PyNUFFT was executed on the multi-core CPU instance with 1 - 64threads. The PyNUFFT transforms were oﬄoaded to the OpenCLCPU device and were executed 20 times. The times required for thetransforms are compared with the run times on the single thread CPU.Iterative reconstructions were also tested on the multi-core CPU. Wemeasured the execution times of the conjugate gradient method, ℓ ℓ • GPU : The GPU instance (p2.xlarge, Amazon Web Services) is equippedwith 4 vCPUs (Intel E5 2686 v4) and one Tesla K80 (NVIDIA, SantaClara, CA, USA) with two GK210 GPUs. Each GPU is composed of2496 parallel processing cores and 12 GB of memory. Computationsare compiled and oﬄoaded to one GPU by CUDA and OpenCL APIs.Transformations were repeated 20 times to measure the averaged runtimes. Iterative reconstructions were also tested on the K80 GPU, andthe execution times of conjugate gradient method, ℓ ℓ . Results This work focuses on accelerating the PyNUFFT package on heteroge-neous platforms, and the complete image quality assessment was not carriedout in this paper. Some visual results of MRI reconstructions can be seenin Figure 5. Iterative NUFFT using L1TV-OLS and L1TV-LAD result infewer ripples in the brain structure than the sampling density compensationmethod.

Tru e Image

L1TV-LAD L1TV-OLS Sam pling density com pensation

Figure 5: Simulated results of brain MRI. The brain template was reconstructed withtwo iterative NUFFT algorithms (L1TV-OLS and L1TV-LAD) and the sampling densitycompensation method. Compared with sampling density compensation method, fewerripples can be seen in L1TV-OLS and L1TV-LAD. able 1: Run time (and acceleration) of subroutines and solvers Operations 1 vCPU 32 vCPUs PyOpenCL K80 PyCUDA K80Scaling( S ) 709 µ s 557 µ s (1.3 × ) 133 µ s (5.3 × ) 300 µ s (2.4 × )FFT( F ) 12.6ms 2.03 ms (6.2 × ) 0.78 ms (16.2 × ) 1.12 ms (11.3 × )Interpolation( V ) 25 ms 2 ms (12 × ) 4 ms(6 × ) 4 ms (6 × )Gridding( V H ) 24 ms 2 ms (12 × ) 5 ms (4.8 × ) 4.5 ms (5.4 × )IFFT( F H ) 16 ms 3.1 ms (5.16 × ) 4.0 ms (4 × ) 3.5 (4.6 × )Rescaling ( S H ) 566 µ s 595 µ s (0.95 × ) 122 µ s (4.64 × ) 283 µ s (2 × ) V H V × ) 0.18 ms (26 × ) 0.15 ms (31 × )Forward ( A ) 39 ms 5 ms (7.8 × ) 6 ms (6.5 × ) 6 ms (6.5 × )Adjoint ( A H ) 38 ms 6 ms (6.3 × ) 7 ms (5.4 × ) 6 ms (6.3 × )Selfadjoint ( A H A ) 34.7 ms 11 ms (3.15 × ) 9 ms (3.86 × ) 8 ms (4.34 × )Solvers 1 vCPU 32 vCPUs PyOpenCL K80 PyCUDA K80Conjugate gradient 11.8 s 2.97 s (4 × ) 1.32 s (8.9 × ) 1.89 s (6.3 × )L1TV-OLS 14.7 s 3.26 s (4.5 × ) 1.79 s (8.2 × ) 1.68 s (8.7 × )L1TV-LAD 15.1 s 3.62 s (4.2 × ) 1.93 s (7.8 × ) 1.78 s (8.5 × ) • Multi-core CPU : Table 1 listed the detailed run-times of the diﬀer-ent stages of forward NUFFT, adjoint NUFFT, and selfadjoint NUFFT(Toeplitz). The overall execution speed of PyNUFFT was faster on themulti-core CPU platform than single thread CPU. Yet the accelera-tion factors of each stage varied from 0.95 (no acceleration) to 15.6.Compared with computations on single thread, 32 threads accelerateinterpolation and gridding by a factor of 12, and the FFT and IFFTwere accelerated by a factor of 6.2 - 15.6.The beneﬁts of 32 threads are limited for certain computations, in-cluding scaling, rescaling and interpolation-gridding ( V H V )). In thesecomputations, the acceleration factors of 32 threads range from 0.95 to1.85. This limited performance gain is due to the high eﬃciency of sin-gle thread CPU, which is already fast enough and leaves limited roomfor improvement. In particular, the integrated interpolation-gridding( V H V ) is already 10 times faster than the separate interpolation andregridding sequence. On a single-thread CPU, V H V requires only 4.79ms, whereas the separate interpolation ( V ) and gridding ( V H ) require49 ms. In this case, 32 threads deliver only an extra 83% of performance15o V H V .Figure 6 illustrates the acceleration on the multi-core CPU againstthe single thread CPU. The performance of PyNUFFT improved bya factor of 5 - 10 when the number of threads increases from 1 to20, and the software achieves peak performance with 30 - 32 threads(equivalent to 15 - 16 physical CPU cores). More than 32 threads bringno substantial beneﬁts in performance.Forward NUFFT, adjoint NUFFT and selfadjoint NUFFT (Toeplitz)are accelerated on 32 threads with an acceleration factor of 7.8 - 9.5.The acceleration factors of iterative solvers (conjugate gradient method, ℓ ℓ • GPU :Table 1 shows that GPU delivers a generally faster PyNUFFT trans-form, with the acceleration factors ranging from 2 to 31.Scaling and rescaling have led to a moderate degree of acceleration.The most signiﬁcant acceleration occurs in the interpolation-gridding( V H V ) in which GPU is 26 - 31 times faster than single-thread CPU.This signiﬁcant acceleration is faster than the acceleration factors forseparate interpolation ( V , with 6 × acceleration) and gridding ( V H with 4 - 4.6 × acceleration).Forward NUFFT, adjoint NUFFT and selfadjoint NUFFT (Toeplitz)are accelerated on K80 GPU by 5.4 - 13. Iterative solvers on GPU are6.3 - 8.9 faster than single-thread, and about 2 × faster than 32 threads.

4. Discussion

Diﬀerent kernel functions (interpolators) are available in previous NUFFTimplementations, including: (1)the min-max interpoaltor [5], (2) the fast ra-dial basis functions [16, 17], (3)least square interpolator [18], (4)least mean-square error interpolato [19], (5) fast Gaussian summation [20], (6) Kaiser-Bessel function [21], (7) linear system transfer function or inverse reconstruc-tion [22, 23]. 16he NUFFTs have been implemented in diﬀerent programming languages:(1) Matlab (Mathworks Inc, MA, USA)[5, 19, 21, 24]; (2) C++ [25]; (3)CUDA [21, 25]; (4) Fortran [20]; (5) OpenACC using PGI compiler (PGICompilers & Tools) [26].NUFFT has been accelerated on single and multiple GPUs. Fast itera-tive NUFFT using the Kaiser-Bessel function was accelerated on GPU withtotal-variation regularization [25] and generalized total-variation regulariza-tion [21]. A real-time inverse reconstruction is developed in Sebastinan etal. [27] and Murphy et al. [28], but this inverse reconstruction does notperform the full interpolation and gridding during iterations. The patent ofNadar et al.[29] describes a custom multi-GPU buﬀer to improve the mem-ory access for image reconstruction with non-uniform k -space. An mripypackage applies the Numba compiler in its NUFFT with Gaussian kernel(https://github.com/peng-cao/mripy).Other than NUFFT, iterative discrete Fourier transform (DFT) is slowerthan NUFFT, and GPUs can bring enormous acceleration factors to iterativereconstruction [30, 31]. Still, iterative DFTs on GPUs may be slower thaniterative NUFFTs on GPUs. The NUFFT transforms (forward, adjoint, and Toeplitz) have been ac-celerated on multi-core CPU and GPU. In particular, the beneﬁts of fastiterative solvers (including least square and iterative NUFFT) have beenshown in the results of benchmarks. The image reconstruction times (with100 iterations) for one 256 ×

256 image are less than 4 seconds on a 32 threadCPU platform and less than 2 seconds on the GPU platform.The current PyNUFFT has been tested with computations using thesingle-precision ﬂoating numbers (FP32). However, the number of FP64 unitson GPU is only a fraction of the number of FP32 units, which will reducethe performance with FP64 and slow down the performance of PyNUFFT(with 1/3 of the FP32 performance).

An open-source PyNUFFT package can accelerate non-Cartesian imagereconstruction on multi-core CPU and GPU platforms. The accelerationfactors are 6.3 - 9.5 × on a 32 thread CPU platform and 5.4 - 13 × on a TeslaK80 GPU. The iterative solvers with 100 iterations can be completed within4 seconds on the 32 thread CPU platform and 2 seconds on the GPU.17 . Acknowledgements The work was supported by the Ministry of Science and Technology(MOST) Taiwan, partly by the Cambridge Commonwealth, European andInternational Trust (Cambridge, UK), as well as the Ministry of Education,Taiwan. The benchmarks were carried out on Amazon Web Services providedby AWS Educate credit. J.-M. Lin declares no conﬂict of interest.

6. ReferencesReferences [1] A. Kl¨ockner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, A. Fasih, Py-CUDA and PyOpenCL: A scripting-based approach to GPU run-timecode generation, Parallel Computing 38 (3) (2012) 157–174.[2] J.-M. Lin, H.-W. Chung, PyNUFFT: python non-uniform fast Fouriertransform for MRI, in: Building Bridges in Medical Sciences 2017, StJohn’s College, Cambridge CB2 1TP, UK, 2017.[3] B. Opanchuk, Reikna, a pure Python GPGPU library,http://reikna.publicﬁelds.net/, Accessed: 2017-10-09.URL http://reikna.publicfields.net/ [4] Free Software Foundation, GNU General Public License.URL [5] J. Fessler, B. P. Sutton, Nonuniform fast Fourier transforms using min-max interpolation, IEEE Transactions on Signal Processing 51 (2) (2003)560–574.[6] J. Fessler, S. Lee, V. T. Olafsson, H. R. Shi, D. C. Noll, Toeplitz-basediterative image reconstruction for MRI with correction for magnetic ﬁeldinhomogeneity, IEEE Transactions on Signal Processing 53 (9) (2005)3393–3402.[7] J. G. Pipe, P. Menon, Sampling density compensation in MRI: Rationaleand an iterative numerical solution, Magnetic Resonance in Medicine41 (1) (1999) 179–186. 188] E. Jones, T. Oliphant, P. Peterson, et al.,SciPy: Open source scientiﬁc tools for Python (2001–).URL [9] J.-M. Lin, A. J. Patterson, H.-C. Chang, T.-C. Chuang, H.-W. Chung,M. J. Graves, Whitening of colored noise in PROPELLER using iterativeregularized PICO reconstruction, in: Proceedings of the 23rd AnnualMeeting of International Society for Magnetic Resonance in Medicine,Toronto, Canada, 2015, p. 3738.[10] J.-M. Lin, A. J. Patterson, C.-W. Lee, Y.-F. Chen, T. Das, D. Scoﬀ-ings, H.-W. Chung, J. Gillard, M. Graves, Improved identiﬁcation andclinical utility of pseudo-inverse with constraints (PICO) reconstructionfor PROPELLER MRI, in: Proceedings of the 24th Annual Meeting ofInternational Society for Magnetic Resonance in Medicine, Singapore,2016, p. 1773.[11] J.-M. Lin, S.-Y. Tsai, H.-C. Chang, H.-W. Chung, H. C. Chen, Y.-H.Lin, C.-W. Lee, Y.-F. Chen, D. Scoﬃngs, T. Das, J. H. Gillard, A. J.Patterson, M. J. Graves, Pseudo-inverse constrained (PICO) reconstruc-tion reduces colored noise of PROPELLER and improves the gray-whitematter diﬀerentiation, in: Proceedings of the 25th Annual Meeting ofInternational Society for Magnetic Resonance in Medicine, Honolulu,HI, USA, 2017, p. 1524.[12] L. Wang, The penalized LAD estimator for high dimensional linearregression, Journal of Multivariate Analysis 120 (2013) 135 – 151. doi:http://doi.org/10.1016/j.jmva.2013.04.001 http://arxiv.org/abs/1701.08361 [28] M. Murphy, M. Alley, J. Demmel, K. Keutzer, S. Vasanawala, M. Lustig,Fast-SPIRiT compressed sensing parallel imaging MRI: Scalable parallelimplementation and clinically feasible runtime, IEEE Transactions onMedical Imaging 31 (6) (2012) 1250–1262.[29] M. S. Nadar, S. Martin, A. Lefebvre, J. Liu, Multi-GPU FISTA imple-mentation for MR reconstruction with non-uniform k-space sampling,uS Patent App. 14/031,374 (2013).[30] S. S. Stone, J. P. Haldar, S. C. Tsao, W.-m. W. Hwu, Z.-P. Liang,B. P. Sutton, Accelerating advanced MRI reconstructions on GPUs,in: Proceedings of the 5th Conference on Computing Fron-tiers, CF ’08, ACM, New York, NY, USA, 2008, pp. 261–272. doi:10.1145/1366230.1366276 .URL http://doi.acm.org/10.1145/1366230.1366276 [31] J. Gai, N. Obeid, J. L. Holtrop, X.-L. Wu, F. Lam, M. Fu, J. P. Hal-dar, W. H. Wen-mei, Z.-P. Liang, B. P. Sutton, More IMPATIENT:A gridding-accelerated Toeplitz-based strategy for non-Cartesian high-resolution 3D MRI on GPUs, Journal of Parallel and Distributed Com-puting 73 (5) (2013) 686–697. 21 (cid:0) (cid:1) 6(cid:2)N um ber of vCPUs (2 x Cores)2468 A cc e l e r a t i on f a c t o r Forward NUFFT 0 20 40 60Num ber of vCPUs (2 x Cores)246810 A cc e l e r a t i on f a c t o r Adjoint NUFFT0 20 40 60Num ber of vCPUs (2 x Cores)246810 A cc e l e r a t i on f a c t o r Selfadjoint NUFFT(Toeplitz) 0 20 40 60Num ber of vCPUs (2 x Cores)1234 A cc e l e r a t i on f a c t o r Conjugate gradient m ethod0 20 40 60Num ber of vCPUs (2 x Cores)1234 A cc e l e r a t i on f a c t o r L1 total variation 0 20 40 60Num ber of vCPUs (2 x Cores)1234 A cc e l e r a t i on f a c t o r L1 total variation LAD