cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs
Yu-hsuan Shih, Garrett Wright, Joakim Andén, Johannes Blaschke, Alex H. Barnett
ccuFINUFFT: a load-balanced GPU library forgeneral-purpose nonuniform FFTs
Yu-hsuan Shih
Courant Institute of Mathematical SciencesNew York University
New York, NY, USA
Garrett Wright
PACMPrinceton University
Princeton, NJ, USA
Joakim And´en
Department of MathematicsKTH Royal Institute of Technology
Stockholm, Sweden
Johannes Blaschke
National Energy Research Scientific Computing CenterLawrence Berkeley National Laboratory
Berkeley, CA, USA
Alex H. Barnett
Center for Computational MathematicsFlatiron Institute
New York, NY, USA
Abstract —Nonuniform fast Fourier transforms dominate thecomputational cost in many applications including image re-construction and signal processing. We thus present a general-purpose GPU-based CUDA library for type 1 (nonuniform touniform) and type 2 (uniform to nonuniform) transforms indimensions 2 and 3, in single or double precision. It achieveshigh performance for a given user-requested accuracy, regardlessof the distribution of nonuniform points, via cache-aware pointreordering, and load-balanced blocked spreading in shared mem-ory. At low accuracies, this gives on-GPU throughputs around nonuniform points per second, and (even including host-device transfer) is typically 4–10 × faster than the latest parallelCPU code FINUFFT (at 28 threads). It is competitive withtwo established GPU codes, being up to 90 × faster at highaccuracy and/or type 1 clustered point distributions. Finally wedemonstrate a 6–18 × speedup versus CPU in an X-ray diffraction3D iterative reconstruction task at − accuracy, observingexcellent multi-GPU weak scaling up to one rank per GPU. Index Terms —Nonuniform FFT, GPU, load balancing.
I. I
NTRODUCTION
Nonuniform (or nonequispaced) fast Fourier transforms(NUFFTs) are fast algorithms that generalize the FFT to thecase of off-grid points. They thus have a wealth of applicationsin engineering and scientific computing, including imagereconstruction from off-grid Fourier data (e.g. MRI gridding [1],optical coherence tomography [2], cryo electron microscopy[3]–[7], radioastronomy [8], coherent diffraction X-ray imaging[9]); wave diffraction [10]; partial differential equations [11],[12]; and long-range interactions in molecular [13] and particledynamics [14]. For reviews, see [15]–[18].In 2D, the type 1 NUFFT [15] (also known as the “adjointnonequispaced fast Fourier transform” or “adjoint NFFT” [17])evaluates the uniform N × N grid of Fourier series coefficients f k ,k due to a set of point masses of arbitrary strengths c j and locations ( x j , y j ) ∈ [ − π, π ) , j = 1 , . . . , M : f k ,k := M (cid:88) j =1 c j e − i ( k x j + k y j ) , ( k , k ) ∈ I N ,N , (1) where the 1D integer Fourier frequency grid is defined by I N := { k ∈ Z : − N/ ≤ k < N/ } , (2)and we use the notation I N ,N := I N × I N for a 2D gridof Fourier frequencies.The type 2 NUFFT (or “NFFT” [17]) is the adjoint of thetype 1. Given a grid of Fourier coefficients f k ,k it evaluatesthe resulting Fourier series at arbitrary (generally nonuniform)targets ( x j , y j ) ∈ [ − π, π ) , to give c j := (cid:88) ( k ,k ) ∈I N ,N f k ,k e i ( k x j + k y j ) , j = 1 , . . . , M. (3)In contrast to the FFT, type 1 is generally not the inverse oftype 2: inverting a NUFFT usually requires iterative solutionof a linear system [17], [19]. Definitions (1) and (3) generalizeto 1D and 3D in the obvious fashion [15].Naively the exponential sums in (1) and (3) take O ( N M ) effort, where N := N × · · · × N d is the total numberof Fourier modes, and d the dimension. The NUFFT uses fast algorithms [15], [20] to approximate these sums toa user-prescribed tolerance ε , typically with effort of only O ( N log N + M log d (1 /ε )) , i.e., quasi-linear in the datasize. Most algorithms internally set up a “fine” grid of size n × · · · × n d , where each n i = σN i , for a given upsamplingfactor σ > . Then the type 1 transform has three steps:i) spreading (convolution) of each weighted nonuniformpoint by a localized kernel, writing into the fine grid,ii) performing a d -dimensional FFT of this fine grid, theniii) selecting the central N output modes from this fine grid,after pointwise division (deconvolution) by the kernelFourier coefficients.Type 2 simply reverses (transposes) these steps, with i)becoming kernel-weighted interpolation from the fine gridto the nonuniform target points. See Sec. II for details.The NUFFT is often the rate-limiting step in applications,especially for iterative reconstruction [1], motivating the needfor high throughput. Spreading and interpolation are often the a r X i v : . [ c s . D C ] F e b ominant steps of NUFFTs, due to scattered memory writes andreads of kernel-sized blocks. Since it demands high memorybandwidth, yet is data parallel, it is a task well suited foracceleration by a general-purpose GPU [21].This potential of GPUs to accelerate the NUFFT has ofcourse been noted, and to some extent exploited, in priorimplementations [22]–[25]. However, the present work showsthat it is possible to increase the efficiency significantly beyondthat of prior codes, in the same hardware, via algorithmicinnovations. With few exceptions [22], most prior GPU NUFFTimplementations are packaged in a manner specific to a singlescience application (e.g. MRI [23]–[25], OCT [2], MD [13],or cryo-EM [5]), have unknown or limited accuracy, and lackmathematical documentation and testing, rendering them almostinaccessible to the general user. This motivates the need foran efficient, tested, general-purpose GPU code. A. Contributions of this work
We present cuFINUFFT, a general-purpose GPU-basedCUDA NUFFT library. It exploits a recently-developed kernelwith optimally small width for a full range of user-chosentolerances ( − to − ), yet more efficient to evaluatethan prior ones [18], [26]. Its throughput is high—and largelyinsensitive to the point distribution.One main contribution is to accelerate spreading in the type1 NUFFT. Broadly speaking there have been two styles ofparallelization in prior work: “input driven” [27] (or scatter[5]), which assigns one thread to each nonuniform point, and“output driven” [5], [17], [25], [28] (gather), which assigns eachthread a distinct portion of the fine output grid to spread to.The input driven scheme, accumulating to GPU global memory,has been used in many prior GPU codes [14], [22], [29]; wewill refer to our implementation of this baseline method as GM (global memory). While load-balanced, the memory access isarbitrary and can suffer from atomic collisions between writes(see Sec. IV-A). Yet, a naive output driven approach, althoughcollision-free, is poorly load-balanced for highly nonuniformpoint distributions [18, Rmk. 12]. To address these issues wepropose two new spreading methods: • GM-sort (global memory, sorted). This improves upon GM in that the work order of the nonuniform points ischosen by spatial sorting into bins (boxes) covering thefine grid. This regularizes the memory access pattern,enabling cache reuse. • SM (shared memory). This sets up spreading “subprob-lems” executed in faster GPU shared memory. This isa hybrid scheme (see Fig. 1): each subproblem has alocal copy of the fine grid lying within the kernel half-width of one bin (output driven), yet is load-balanced bycapping its subset of nonuniform points (input driven).Local fine grids are added back into global memory usingfar fewer global atomic operations than GM methods,avoiding collisions. The result is 2–10 × faster than GM-sort (depending on d and clustering). Bin sizes and shapes have been hand-tuned for performance;this is crucial for SM to ensure optimal use of limited sharedmemory.Turning to the interpolation task in type 2, we propose to usethe adjoint version of GM-sort , where grid writes are replacedby reads. We will also refer to this algorithm as
GM-sort .For both tasks, while bin-sorting nonuniform points addstime, it accelerates the execution of spreading/interpolation.Thus our library uses a “plan, setup, execute, destroy” interfacethat allows efficient reuse of the same (sorted) nonuniformpoints with new strength vectors (e.g. new c i in (1)). This usecase is common, e.g. in iterative methods for NUFFT inversion.We benchmark in detail the speedup of cuFINUFFT overexisting NUFFT libraries, for a range of accuracies, problemsizes, and point distributions. For example, including GPUmemory allocation and transfer time, for low accuracy andquasi-uniform points, cuFINUFFT is on average 8 × faster thanFINUFFT [18] (28 threads), 5 × faster than CUNFFT [22] and78 × faster than gpuNUFFT [24] for type 1 transforms. Fortype 2, cuFINUFFT is on average 6 × faster than FINUFFT,5 × faster than gpuNUFFT, and performs similarly to CUNFFTbut with 2–5 × faster “execute” times.The library also enables multi-GPU parallelism, essentialfor larger problems in HPC environments. In Sec. V we showthis in the setting of 3D single particle reconstruction fromcoherent X-ray diffraction data, which demands thousands of3D NUFFTs. For NSERC and OLCF supercomputer nodes,we demonstrate an order of magnitude speedup over the CPUversion, and excellent weak scaling with respect to the numberof GPU processes, up to one process per GPU.The code and documentation for the library is available onGitHub and installable as a PyPI package in Python through pip install cufinufft . B. Limitations
Our library has a few limitations. (1) Both
GM-sort and SM have some GPU memory overhead, due to sorting indexarrays. Yet, for a large 3D transform ( N i = 256 , i = 1 , , ,and M = 1 . × ), this overhead is only around . (2)The SM method, while providing a large acceleration fortype 1, is currently limited to single precision, due to thesmall GPU shared memory per thread block (49 kB). (3) Wefixed the upsampling factor σ = 2 ; reducing this could reducememory overhead and run times [18]. (4) Our library doesnot yet provide NUFFTs in 1D, nor of type 3 (nonuniform tononuniform) [30]. II. A LGORITHMS
We follow the standard three-step scheme presented in theprevious section. Our Fourier transform convention is ˆ φ ( k ) = (cid:90) ∞−∞ φ ( x ) e − ikx dx , φ ( x ) = 12 π (cid:90) ∞−∞ ˆ φ ( k ) e ikx dk. (4) https://github.com/flatironinstitute/cufinufft e fix the upsampling factor σ = 2 , and use the “exponentialof semicircle” (ES) kernel from FINUFFT [18], [26], φ β ( z ) := (cid:40) e β ( √ − z − , | z | ≤ , otherwise . (5)Given a user-requested tolerance ε , the kernel width w in finegrid points, and parameter β in (5), are set via w = (cid:100) log /ε (cid:101) + 1 , β = 2 . w. (6)This typically gives relative (cid:96) errors close to ε [18]. As inFINUFFT, for FFT efficiency, the fine grid size n i is set tobe the smallest integer of the form q i p i r i , greater than orequal to max ( σN i , w ) , in each dimension i = 1 , . . . , d . A. Type 1: nonuniform to uniform
We now describe the algorithm to compute ˜ f k ,k , anapproximation to f k ,k in (1). We will write only the 2Dcase, the generalization to 3D being clear. a) Step 1 (spreading): For each index ( l , l ) in the finegrid ≤ l < n , ≤ l < n , compute b l ,l = M (cid:88) j =1 c j ψ per ( l h − x j , l h − y j ) , (7)where h i := 2 π/n i is the fine grid spacing, and ψ per ( x, y ) isthe periodized tensor product of rescaled ES kernels ψ ( x, y ) := φ β ( x/α ) φ β ( y/α ) , α i = wπ/n i , i = 1 , ,ψ per ( x, y ) := (cid:88) ( m ,m ) ∈ Z ψ ( x − πm , y − πm ) . (8)Note that each nonuniform point ( x j , y j ) only affects a nearbysquare of w fine grid points. b) Step 2: Use a plain 2D FFT to evaluate ˆ b k ,k = n − (cid:88) l =0 n − (cid:88) l =0 b l ,l e − πi ( l k /n + l k /n ) , ( k , k ) ∈ I n ,n . (9) c) Step 3 (correction): Truncate the Fourier coefficientsto the central N × N frequencies, and scale them to give thefinal outputs ˜ f k ,k = p k ,k ˆ b k ,k , ( k , k ) ∈ I N ,N . (10)Here, correction (deconvolution) factors p k ,k are precomputedfrom samples of the kernel Fourier transform, via p k ,k = h h ˆ ψ ( k , k ) − = (2 /w ) ( ˆ φ β ( α k ) ˆ φ β ( α k )) − . B. Type 2: uniform to nonuniform
To compute ˜ c j , an approximation to c j in (3), the abovesteps for type 1 are reversed. The correction factors p k ,k andthe periodized kernel ψ per remain as above. a) Step 1 (correction): Pre-correct then zero-pad thecoefficients f k ,k to the fine grid, i.e., for all indices ( l , l ) , ˆ b l ,l = (cid:40) p k ,k f k ,k , ( k , k ) ∈ I N ,N , ( k , k ) ∈ I n ,n \I N ,N (11) b) Step 2: Use a plain inverse 2D FFT to evaluate b l ,l = (cid:88) ( k ,k ) ∈I n ,n ˆ b k ,k e πi ( l k /n + l k /n ) ,l i = 0 , . . . , n i − , i = 1 , . (12) c) Step 3 (interpolation): Compute a weighted sum ofthe w grid values near each target nonuniform point ( x j , y j ) , ˜ c j = n − (cid:88) l =0 n − (cid:88) l =0 b l ,l ψ per ( l h − x j , l h − y j ) , j = 1 , · · · , M. III. GPU
IMPLEMENTATION
This section shows how the above three-step algorithms areimplemented on the GPU using the CUDA API. For the FFT inboth types, we use NVIDIA’s cuFFT library. For the correctionsteps, since the task is embarrassingly parallel, we simplylaunch one thread for each of the N × N Fourier modes. Thefactors p k ,k are precomputed once in the planning stage. Themajor work lies in the spreading (type 1) and interpolation(type 2), to which we now turn. A. Spreading
Recall that we use GM to denote a baseline input drivenspreading implementation in global memory (as used inCUNFFT [22]). This launches one thread per nonuniformpoint, i.e. M in total, in their user-supplied order. The threadgiven nonuniform point j spreads it to the fine grid b array: b l ,l (cid:55)→ b l ,l + c j ψ per ( l h − x j , l h − y j ) , ∀ ( l , l ) . This task comprises (a) reading x j , y j , c j from GPU globalmemory, (b) w evaluations of the kernel function ψ , and(c) w atomic adds to the b array residing in GPU globalmemory. One downside of this approach is that nonuniformpoints assigned to threads within a thread block and hencewithin a warp can reside far from each other on the grid, whichresults in uncoalesced memory loads. (Note that assigning one thread block per nonuniform point may alleviate this issue[14].) Another downside is that global atomic operations forclustered points can essentially serialize the method. GM-sort is a scheme to address the issue of uncoalescedaccess. We partition the n × n fine grid into rectangular “bins” R i , i = 1 , . . . , n bins , each of integer sizes m × m if possible,otherwise smaller. (A typical choice is m = m = 32 .) Thus n bins = (cid:100) n m (cid:101) × (cid:100) n m (cid:101) . Bins are ordered in a Cartesian gridwith the x axis fast and y slow, which echoes on a larger scalethe ordering of the fine grid itself. We will say that nonuniformpoint j is “inside” bin R i if the point’s rounded integer finegrid coordinates l = (cid:98) ( x j + π ) /h (cid:99) , l = (cid:98) ( y j + π ) /h (cid:99) , lie in R i . Suppose that there are M i nonuniform points insidebin R i for i = 1 , . . . , n bins . We set up a permutation t (abijection from { , . . . , M } to itself), such that the nonuniformpoints with indices t (1) , t (2) , . . . , t ( M ) are precisely thoselying in bin R , then those with indices t ( M + 1) , t ( M + ig. 1: SM spreading scheme. The shaded gray squares (each w × w fine grid points) show the support of the spreading kernel,centered on each nonuniform point (black dots). Colors indicate the different nonempty bins. Step 0:
Divide the n × n finegrid into bins of size m × m . Step 1:
Assign subproblems by bin-sorting non-uniform points, and, if needed, further splittingso that there are at most M sub points per subproblem. Step 2:
Spread the points inside each subproblem into a padded bin copyin faster shared memory (size p × p ). Step 3:
Atomic add each padded bin result back into global memory. , . . . , t ( M + M ) are precisely those in bin R , etc. Thisis done in practice by first recording the bin index of eachpoint, reading out this list in bin ordering, then inverting thispermutation to give t . Nonuniform points are then assignedto threads in the permuted index order t (1) , . . . , t ( M ) . Thismeans that the threads within a warp now access parts of the b fine grid array that, most of the time, are approximatelyadjacent. The GPU therefore has a better chance of coalescingthese accesses into fewer global memory transactions. SM is a hybrid scheme which exploits GPU shared memoryto further address the issue of slow global atomic operations.It partitions the fine grid into bins R i as above, then has threeremaining steps, as follows (see Fig. 1). Step 1: Assign subproblems using bin-sorting and blocking ofnonuniform points . The nonuniform point index list , . . . , M is broken into the union of disjoint subsets S , S , . . . , eachof which we call a “subproblem”. This is done as follows.For bin R , if the number of points M is larger than M sub , aparameter controlling the maximum subproblem size, then itis further broken into subsets (subproblems) of size at most M sub . These one or more subproblems are all associated to thebin R , in which their points lie. The same is repeated for theremainder of the bins R i . Thread blocks are then launched,one per subproblem. Note that the cap on subproblem size is a(blocked) form of input driven load-balancing: if many pointslie in a bin, their spreading tasks are well parallelized. Step 2: Spread nonuniform points inside each subproblem toshared memory.
By the above partition, within a subproblem(a thread block), the nonuniform points can only affect the finegrid array b within a padded bin of dimensions p × p , where p i = ( m i + 2 (cid:100) w/ (cid:101) ) , i = 1 , . (13) For the k th subproblem S k , we accumulate its spreading resulton a shared memory copy b shared of its padded bin, local to itsthread block, b shared s ,s (cid:55)→ b shared s ,s + (cid:88) j ∈ S k c j ψ per (( s + ∆ ) h − x j , ( s + ∆ ) h − y j ) ,s i = 0 , . . . , p i − , i = 1 , , (14)where ( s , s ) are local indices within the padded bin copy,and (∆ , ∆ ) its offset within the fine grid (see Fig. 1). Step 3: Atomic add the results back to global memory.
Once the spreading subproblem result is accumulated in theshared memory padded bin, we atomically add it back to thecorresponding region of global memory array b , b l ( s ) ,l ( s ) (cid:55)→ b l ( s ) ,l ( s ) + b shared s ,s , ∀ ( s , s ) (15)where l i ( s i ) := ( s i + ∆ i ) mod n i , i = 1 , , maps eachcoordinate in the padded bin back to the fine grid, with periodicwrapping (see Fig. 1). This completes the spreading process.When there are many nonuniform points per bin, this incursmany fewer global atomic writes than GM-sort .We generalize both the implementations
GM-sort and SM to3D by using cuboid bins of maximum dimension m × m × m . Remark 1:
In both methods
GM-sort and SM , the perfor-mance is sensitive to the bin sizes. By hand-tuning (in powersof two), we have found × in 2D and × × in 3Dto be optimal. This takes account of the area (or volume) ratioof bin to padded bin, the ordering of fine grid data, and themaximum size of GPU shared memory per thread block (49kB), and are based on speed tests on an NVIDIA Tesla V100.e similarly set M sub = 1024 , although we believe that itsoptimal value is problem-dependent. Remark 2:
We implemented SM in both dimensions andprecisions, apart from 3D double precision where it is no longeradvantageous. Here the shared memory constraint m + w )( m + w )( m + w ) ≤ , for widths w > neededwhen ε < − , forces the bin volume to be tiny compared tothe padded bin volume, dramatically increasing the number ofglobal operations. We test only GM-sort in this case.
B. Interpolation
For interpolation, we use the same idea as
GM-sort forspreading, with read and write memory operations reversed.Threads are assigned to nonuniform points in the permutedorder t (1) , . . . , t ( M ) coming from bin-sorting. Thus the j ththread performs the task ˜ c t ( j ) = n − (cid:88) l =0 n − (cid:88) l =0 b l ,l ψ per ( l h − x t ( j ) , l h − y t ( j ) ) . We refer to the method by the same name,
GM-sort , and use GM to refer to the unsorted version. The reason to bin-sortthe nonuniform points is the coalesce the reads from the finegrid. Since there is no conflict between threads reading thesame location in memory, this is fast; the benefit of applyingan idea like SM to interpolation would be limited.IV. P ERFORMANCE T ESTS
In this section, we report the performance of our GPUlibrary, cuFINUFFT. We first show how the proposed spreadingmethods
GM-sort and SM perform against GM . Then, wecompare the performance of the interpolation step with ( GM-sort ) and without ( GM ) bin-sorting the nonuniform points.Finally, we show how the whole pipeline performance (spread-ing/interpolation, FFT, correction) of cuFINUFFT compareswith the fastest known multithreaded CPU library FINUFFT[18], and two established GPU libraries CUNFFT [22] andgpuNUFFT [5].All GPU timings are for a NVIDIA Tesla V100 (releasedin 2017), with memory bandwidth 900 GB/s. We compile allcodes with GCC v7.4.0 and NVCC v10.0.130. Unless specified,single precision and M sub = 1024 are used in all the tests. Tasks.
We present results for the following two nonuniformpoint distributions, which are extreme cases: • “rand”: nonuniform points are iid random across the entireperiodic domain box [ − π, π ) d , d = 2 , . • “cluster”: points are iid random in the small box [0 , h ] ×· · · × [0 , h d ] , recalling that h i are the fine grid spacings.We restrict to square/cube problems, i.e. N = N (= N ) ,being the most common application problem. We defineproblem density ρ to be the ratio of number of nonuniformpoints to number of upsampled grid points, i.e., ρ := M (cid:81) di =1 n i = Mσ d (cid:81) di =1 N i . (16)We report tests with ρ of order 1, since a) this is common inapplications, and b) in the uncommon case ρ (cid:28) one ends up . × . × . × . × . × × . × . × . × . × × . × n = n n s p e r nonun i f o r m po i n t
2D rand, ρ = 1 , ε = 10 − × × × × × × . × . × . × . × . × . × n = n
2D cluster, ρ = 1 , ε = 10 − × . × . × . × . × . × . × . × . × . × n = n = n n s p e r nonun i f o r m po i n t
3D rand, ρ = 1 , ε = 10 − × . × . × . × . × × × × . × . × n = n = n
3D cluster, ρ = 1 , ε = 10 − spread (total), GM spread, GM-sort total, GM-sort spread, SM total, SM Fig. 2: Spreading method comparisons. Execution time pernonuniform point (smaller is better) is shown, for variousfine grid sizes, and distributions “rand” and “cluster”, in 2Dand 3D. For
GM-sort (cyan) and SM (dark blue), the “total”time (solid lines) includes the precomputation (bin-sorting orsubproblem setup), whereas the “spread” time (dotted lines)excludes this precomputation. The baseline GM (red) needsno precomputation. Annotations are the speedups over GM .essentially comparing plain FFT speeds. In fact we have tested ρ = 0 . and ρ = 10 , as well as less extreme nonuniform pointdistributions; the conclusions are rather similar. A. Spreading performance
For high-accuracy single precision ( ε = 10 − , i.e. w = 6 ),Fig. 2 compares our spreading methods GM-sort and SM against the baseline method GM , in 2D (top), 3D (bottom),and for “rand” (left) and “cluster” (right) distributions. Solidlines show total times (in nanoseconds per nonuniform point)including preprocessing (sorting) of new nonuniform points.Dotted lines show execution excluding this, so are relevant for repeated transforms with same set of nonuniform points.We can see from the “rand” results that for large grids( n = n ≥ in 2D, or n = n = n ≥ in 3D) bin-sorting brings a large gain. In the extreme case (the largest n i tested), GM-sort is 3.9 × faster than GM in 2D, and 7.6 × faster in 3D. On the other hand, for small grids, because thememory accesses are already localized, we do not see anybenefit of bin-sorting.From the “cluster” results, since nonuniform points all residein a small region, bin-sorting brings no benefit. However, wesee the clear advantage of doing local spreading on sharedmemory, in that SM is up to 12.8 × faster than GM in 2D,and up to 3.2 × faster in 3D. The speedup in 3D is limited . × . × . × . × . × . × n = n n s p e r nonun i f o r m po i n t
2D rand, ρ = 1 , ε = 10 − interp (total), GMinterp, GM-sorttotal, GM-sort . × . × . × . × . × n = n = n
3D rand, ρ = 1 , ε = 10 − Fig. 3: Interpolation comparisons. Execution time per nonuni-form point is shown, for various fine grid sizes, and distribution“rand”, in 2D and 3D. The “total” time (solid lines) includes thebin-sorting precomputation, whereas the “interp” time (dottedlines) excludes this precomputation.because the padding of the bins, especially in the z direction,grows the volume of global atomic adds needed in Step 3 .Comparing the dark blue curves in the left and rightplots, we see a distribution-robust performance for SM , inthat similar throughput is achieved for “rand” and “cluster”distributions. In comparison, GM-sort is on average 3.9 × slower in 2D comparing “cluster” and “rand” distributions. The2D execution throughput (excluding precomputation) exceeds points/sec for large grids, and is close even when includingprecomputation. B. Interpolation performance
For the same accuracy, Fig. 3 compares interpolation method
GM-sort against GM in 2D and 3D, for the “rand” distribution.(We exclude the “cluster” results, since, as with spreading, bin-sorting nonuniform points has no effect.) We see that again GM-sort improves the performance for large grids ( n = n ≥ in 2D, or n = n = n ≥ in 3D). It is 4.5 × faster than GM in 2D, and 12.7 × faster in 3D, for the largest n i tested. A difference with spreading is that, because thereare no global write conflicts, the execution time of GM-sort (excluding precomputation) never becomes slower than GM . C. Benchmark comparisons against existing libraries
We now compare cuFINUFFT against the CPU libraryFINUFFT (which has already benchmarked favorably againstother CPU libraries [18]), and GPU libraries CUNFFT [22] andgpuNUFFT [5]. For FINUFFT, we used a high-end computenode equipped with 512 GB RAM and two Intel Xeon E5-2680 v4 processors (released in 2016). Each processor has 14physical cores at 2.40 GHz. We ran multithreaded FINUFFTwith 28 threads (1 thread per physical core). • cuFINUFFT version 1.0. We used host compiler flags -fPIC -O3 -funroll-loops -march=native . • FINUFFT version 2.0.2. Compiler flags were -O3 -funroll-loops -march=native-fcx-limited-range . We fixed upsamplingfactor σ = 2 to match that used in cuFINUFFT. • CUNFFT version 2.0. We compiled with fastGaussian gridding ( -DCOM_FG_PSI=ON ). Default dimensions of thread blocks (
THREAD_DIM_X=16 , THREAD_DIM_Y=16 ) are used. • gpuNUFFT version 2.1.0. We use its MATLAB interface.We used default host compiler flags ( -std=c++11-fPIC ). We set MAXIMUM_ALIASING_ERROR to − to get more accurate results. We use the same sector width as in their demo codes.We present three different timings for NUFFT executions: • “total”: Execution time (per nonuniform point) for inputsand output on the GPU. • “total+mem”: Execution time (per nonuniform point),including the time for GPU memory allocation plustransferring data from host to GPU and back. • “exec”: Execution time (per nonuniform point) for atransform, after its nonuniform points have already beenpreprocessed. This is a subset of the “total” time. It is therelevant time for the case of multiple fixed-size transformswith a fixed set of nonuniform points, but new strengthor coefficient vectors.There is a constant start-up cost (about 0.1–0.2 second) forcalling the cuFFT library, so to exclude it we add a dummy callof cuFFTPlan1d before calling cuFINUFFT or CUNFFT.For gpuNUFFT, in “total+mem”, we exclude the time forbuilding the nonuniform FFT operator and creating the cuFFTplan. Note also that gpuNUFFT sorts the nonuniform pointsinto sectors on the CPU and copies the arrays to the GPU whenit builds the operator, so, to be generous, we do not includethis in “total+mem” either. Finally, “total” is not shown forgpuNUFFT and CUNFFT, because gpuNUFFT takes CPUarrays as inputs and outputs, and CUNFFT allocates GPUmemory in the initialization stage ( cunfft_init ) in a waythat did not allow us to separate its timing.We now discuss the results (Figures 4 to 7 and Table I). Awide range of relative (cid:96) errors, (cid:15) , are explored by varying therequested tolerance, or kernel parameters (usually the width w ), for each library. Error is measured against a ground truthof FINUFFT with tolerance ε = 10 − for double precisionruns, and ε = 6 × − for single precision runs. a) Single precision comparisons: Figs. 4 and 5 compareperformance of both type 1 (top) and type 2 (bottom) in2D (left), 3D (right), for all libraries in single precision.We can see from the top plots that for type 1, cuFINUFFToutperforms all other libraries. For type 1, the best performanceis achieved using the SM method (dark blue). The “exec” timeof cuFINUFFT (SM) in 2D is around 10 × faster than “exec”time of FINUFFT, independent of the accuracy; and in 3D, itis 3–12 × faster from high to low accuracy.For type 2 (bottom plots), except for 2D low accuracy( (cid:15) > − ) where CUNFFT is comparable to cuFINUFFT,cuFINUFFT is again the fastest. Its “exec” time is 4–7 × and6–8 × faster than the “exec” time of FINUFFT in 2D and 3Drespectively.In Fig. 6, we fix a tolerance ε = 10 − (achievable byall libraries), and examine the effect of nonuniform pointdistribution and number of Fourier modes (fixing density ρ = 1 ) N = N = N M Method “Exec” time (sec) RAM (MB) Speedup vs FINUFFT Spread fraction (%) −
32 2 . × GM-sort . . × . SM . . × . . × GM-sort .
33 6139 . × . SM .
18 6141 . × . −
32 2 . × GM-sort . . × . SM . . × . . × GM-sort . × . SM .
87 6141 . × . TABLE I: cuFINUFFT 3D type 1 NUFFT GPU memory usage, and “exec” time, for distribution “rand” and for two relativetolerances ε = 10 − , − . Speedup is computed relative to the “exec” time from FINUFFT. Spread fraction is the percentageof “exec” time spent on spreading. RAM is measured using nvidia-smi . For the baseline spreading method GM , RAM useis 381 MB for N i = 32 and 5113 MB for N i = 256 ). − − − − − (cid:15) (relative l error) n s p e r nonun i f o r m po i n t
2D Type 1, N i = 1000 , M = 10 − − − − − (cid:15) (relative l error)
3D Type 1, N i = 100 , M = 10 − − − − (cid:15) (relative l error) n s p e r nonun i f o r m po i n t
2D Type 2, N i = 1000 , M = 10 − − − − − (cid:15) (relative l error)
3D Type 2, N i = 100 , M = 10 finufft cufinufft (SM) cufinufft (GM-sort) cunfft gpunufft Fig. 4: Single precision NUFFT comparisons in 2D (left) and3D (right), for type 1 (upper) and 2 (lower). “total+mem”(“total” for FINUFFT) time per nonuniform point vs accuracyis shown, for the named libraries, for the distribution “rand”.on library performance. From the top plots, for type 1, weobserve distribution-robust performance in cuFINUFFT (SM),FINUFFT and gpuNUFFT. The others, cuFINUFFT (GM-sort)and CUNFFT, slow down when the points are clustered: for anintermediate problem size ( N i = 2 ), cuFINUFFT (GM-sort)“exec” is slowed by a factor of 3 when switching from “rand”to “cluster” to “rand”. Dramatically, CUNFFT is slowed by afactor of 200: it is very slow for clustered type 1 transforms.For type 2 (lower plots in Fig. 6), the sensitivity toclustering is much weaker: all libraries tackle “cluster” at aboutthe same speed they tackle “rand”, apart from cuFINUFFTwhich becomes 3–4 × faster. While cuFINUFFT has similar“total+mem” time as CUNFFT, its “exec” time is 2–5 × fasterthan that of CUNFFT. In 3D our detailed findings are quitesimilar, and we do not show them. − − − − − (cid:15) (relative l error) n s p e r nonun i f o r m po i n t
2D Type 1, N i = 1000 , M = 10 − − − − (cid:15) (relative l error)
3D Type 1, N i = 100 , M = 10 − − − − (cid:15) (relative l error) n s p e r nonun i f o r m po i n t
2D Type 2, N i = 1000 , M = 10 − − − − − (cid:15) (relative l error)
3D Type 2, N i = 100 , M = 10 finufft cufinufft (SM) cufinufft (GM-sort) cunfft Fig. 5: Single precision comparisons in 2D and 3D. “exec”time per nonuniform point vs accuracy is shown for the testedlibraries, except for gpuNUFFT. For explanation see captionof Fig. 4.Lastly, in Table I we detail the type 1 performance and RAMusage of cuFINUFFT in 3D, for two tolerances. We see againthat higher speedup with respect to FINUFFT is achieved for low accuracy and large problem sizes . The spreading method SM achieves better performance, but at a cost of slightly moreGPU RAM usage for large problems. Lastly, spreading is stillthe performance bottleneck for 3D type 1: it occupies over of “exec” time for all accuracies and problem sizes. b) Double precision comparisons: Fig. 7 compares theperformance for both types in 2D and 3D, for all libraries(except gpuNUFFT, whose (cid:15) appears always to exceed − ).We see from the top left plot that for 2D type 1, cuFINUFFToutperforms the others by 1–2 orders of magnitude. The bestperformance is achieved by SM (blue) for high accuracy ( (cid:15) ≤ − ) and by GM-sort (cyan) for low accuracy. The “exec” . × . × . × . × . × . × N = N n s p e r non - un i f o r m po i n t Type 1, rand, ρ = 1 . , ε = 10 − × . × × . × . × × N = N Type 1, cluster, ρ = 1 . , ε = 10 − . × . × . × . × . × . × N = N n s p e r nonun i f o r m po i n t Type 2, rand, ρ = 1 . , ε = 10 − . × . × . × × . × . × N = N Type 2, cluster, ρ = 1 . , ε = 10 − exec, finufft total, finufftexec, cufinufft (SM) total, cufinufft (SM) total+mem, cufinufft (SM)exec, cufinufft (GM-sort) total, cufinufft (GM-sort) total+mem, cufinufft (GM-sort)exec, cunfft total+mem, cunffttotal+mem, gpunufft Fig. 6: Detailed 2D Type 1 (top) and 2 (bottom) NUFFTcomparisons (single precision). Execution time per nonuniformpoint vs number of Fourier modes are shown for the namedlibraries, comparing “rand” (left) and “cluster” (right). Annota-tions give the speedup of “exec” of cuFINUFFT (SM) for type1, cuFINUFFT (GM-sort) for type 2, vs “exec” of FINUFFT.speedup of cuFINUFFT (taking the faster of SM and GM-sort ), vs FINUFFT, ranges from 4–11 × . From the top rightplot, for 3D type 1, cuFINUFFT is only faster than FINUFFTfor relative error (cid:15) ≥ − , merely matching its speed atthe highest accuracies. From the bottom plots, for type 2,cuFINUFFT is always the fastest, and by a large factor athigh accuracies. The “exec” time of cuFINUFFT is on average6 × faster than that FINUFFT in both dimensions. In 2D, andlow-accuracy 3D, we see that host-to-device transfer dominates:a several-fold speedup is available by maintaining data on theGPU. V. M ULTI -GPU A
PPLICATIONS
Finally, we illustrate the multi-GPU performance of cuFIN-UFFT in 3D coherent X-ray image reconstruction. Singleparticle imaging is a technique whereby the 3D electron densityof a molecule may be recovered at sub-nanometer resolutionfrom a large ( ≤ ) set of 2D far-field diffraction images, eachtaken in a single shot of a free-electron laser with a random unknown molecular orientation [9]. Each 2D image measuresthe squared magnitude Fourier transform of the density on an(Ewald sphere) slice passing through the origin; see Fig. 8.The multitiered iterative phasing (M-TIP) algorithm is used forreconstruction [9]. Broadly speaking, one starts with a Fourier − − − − − − − (cid:15) (relative l error) n s p e r nonun i f o r m po i n t
2D Type 1, N i = 1000 , M = 10 − − − − − − − (cid:15) (relative l error)
3D Type1, N i = 100 , M = 10 − − − − − − (cid:15) (relative l error) n s p e r nonun i f o r m po i n t
2D Type 2, N i = 1000 , M = 10 − − − − − − − (cid:15) (relative l error)
3D Type 2, N i = 100 , M = 10 exec, finufft total, finufftexec, cufinufft (GM-sort) total, cufinufft (GM-sort) total+mem, cufinufft (GM-sort)exec, cufinufft (SM) total, cufinufft (SM) total+mem, cufinufft (SM)exec, cunfft total+mem, cunfft Fig. 7: Double precision comparisons. All three timings “exec”,“total”, and “total+mem” are shown. For more explanation seecaption of Fig. 4.Fig. 8: M-TIP merging step: 3D Fourier transform data,collected on multiple Ewald sphere slices with arbitraryorientations, is merged onto a single uniform grid. Image credit:Jeffrey Donatelli (Lawrence Berkeley National Laboratory).transform estimate on a 3D Cartesian grid, and estimatedorientations, then iterates the following four steps:i) “Slicing”: a 3D type 2 NUFFT is used to evaluate theFourier transform on a large set of Ewald sphere slices.ii) Orientation matching: adjust each slice orientation usingits 2D image data.iii) “Merging”: solve for 3D Fourier transform data matchingthe 2D images on known slices, as in Fig. 8; this needstwo 3D type 1 NUFFTs.iv) Phasing: find the most likely phase of the 3D Fouriertransform to give a real-space density of known support.The code acceleration strategy, a part of the ExascaleComputing Project, has been to offload the intensive stepsi)–iii) to GPUs. . Work management and the cuFINUFFT Python interface
We use MPI to manage parallel processes, via the mpi4pypackage. Each MPI rank (i.e. process) is assigned some datato handle and a GPU. Since slicing and merging are linearoperations, we can scatter ( mpi4py.scatter ) before theslicing step, and reduce ( mpi4py.reduce ) after the mergingstep. In modern HPC environments each compute node isfurnished with several GPUs—e.g. NERSC’s Cori GPU systemhas 8 NVIDIA V100 per node, while OLCF’s Summit systemhas 6 V100 per node. Thus, depending on the ratio of GPUsto CPU cores, we can have several MPI ranks share the sameGPU. For M-TIP, load balancing is simple because each rankdoes (roughly) the same amount of work, thus we assign GPUsin a round-robin fashion. We use PyCUDA to transfer databetween device and host. This allows cuFINUFFT to access numpy.ndarray objects as double [] , hence requiringno specialized API calls to convert data between Python andcuFINUFFT. In order to ensure that the PyCUDA API sendsthe data to the correct device, we manually define the devicecontext. Here is an example of taking a type 1 3D NUFFTwith nonuniform coordinates X , Y , Z , and strengths nuvect (note: mpi4py provides rank ): from cufinufft import cufinufft from pycuda.gpuarray import GPUArray, to_gpu
GPUS_PER_NODE = 8 device_id = rank % GPUS_PER_NODEpycuda.driver.init()device = pycuda.driver.Device(device_id)ctx = device.make_context()ugrid_gpu = GPUArray(shape) plan = cufinufft(1, shape, eps=eps,gpu_device_id=device_id)plan.set_pts(to_gpu(X), to_gpu(Y), to_gpu(Z))plan.execute(to_gpu(nuvect), ugrid_gpu)
The cuFINUFFT Python interface allows the user to assigna cufinufft plan to a specific device by setting the gpu_device_id option. See the documentation on GitHubfor details. The M-TIP code requires a tolerance of ε = 10 − . B. cuFINUFFT Performance on Cori GPU and Summit
To perform a single-node weak scaling study we used,per rank, the NUFFT problem sizes in Table II, whichcorrespond to images. The table shows the average wall-clock time spent performing NUFFTs during slicing (type 2NUFFT) and merging (type 1 NUFFT) steps for one M-TIPiteration. Comparing the GPU wall-clock times (including datamovement) to the equivalent problem running on 40 CPUthreads on a single Intel Skylake Cori GPU node (using theFINUFFT code), we find that cuFINUFFT on a single GPU isroughly similar to the CPU times, while for the larger problemdistributed over the whole node (multi-GPU) it is 6–18 × fasterthan on the CPU. Fig. 9: Single-node multi-GPU weak scaling on NERSC CoriGPU (left) and OLCF Summit (right). We achieve close toideal weak scaling (flat lines) up to a number of ranks matchingthe number of GPUs on the node (vertical dotted line).Fig. 9 shows the weak scaling performance on single nodesof NERSC’s Cori GPU and OLCF’s Summit. Each rank isgiven the single-rank problem size from Table II. Solid linesshow the total time including host-device transfer. Crossesshow setup time (plan, input and nonuniform point sorting),while squares show NUFFT execution time. In all cases (excepttype 2 on Summit), we see ideal weak scaling up to one rankper GPU. We found that enabling multi-process service (MPS)made no measurable difference. We see rapid deterioration ofweak scaling once each GPU is used by more than one rank,suggesting that cuFINUFFT uses each GPU to capacity.VI. C ONCLUSIONS
We presented a general-purpose GPU-based library fornonuniform fast Fourier transforms: cuFINUFFT. It supportsboth transforms of type 1 (nonuniform to uniform) and type2 (uniform to nonuniform), in 2D and 3D, with adjustableaccuracies. By using an efficient kernel function, sorting thenonuniform points into bins, and leveraging shared memoryto reduce write collisions, cuFINUFFT obtains a significantspeedup compared to established CPU- and GPU-based li-braries. On average, we observe a speedup of one order ofmagnitude over the FINUFFT parallel CPU library. We alsoobserve up to an order of magnitude speedup compared to theCUNFFT GPU library, and more in the case of clustered type1 transforms. We also see excellent multi-GPU weak scalingin an iterative 3D X-ray reconstruction application.There are several directions for future work. One is extendingthe library to include 1D and type 3 transforms, but also to usesmaller upsampling factors σ , which can significantly reducememory size. It is also worth exploring supporting other GPUsvia a library that provides a unified hardware API, such asOCCA or Kokkos. ask Uniform grid (per rank) Nonuniform points (per rank) Density Parallelism CPU time [s] GPU time [s] GPU time [s] N = N = N M ρ (Intel Skylake) (Cori GPU) (Summit)Slicing
41 1 . × . single-rank .
11 0 .
075 (1 . × ) 0 .
076 (1 . × ) (type 2) whole-node .
49 0 .
078 (18 × ) 0 .
11 (13 × ) Merging
81 1 . × . single-rank .
62 1 .
89 (0 . × ) 1 .
76 (0 . × ) (type 1) whole-node . .
94 (6 . × ) 1 .
76 (6 . × ) TABLE II: Problem sizes and average NUFFT run times for a representative M-TIP iteration: NERSC’s Cori GPU (IntelSkylake with NVIDIA V100’s) and OLCF’s Summit (IBM Power9 with NVIDIA V100’s). The problem sizes are fixedper MPI rank. The CPU code is FINUFFT v1.1.2, with 40 threads on the Intel Skylake. The relative speedup of single- ormulti-GPU cuFINUFFT over the CPU code is shown in parentheses. The second of each pair of rows shows the wallclock timefor a problem scaled up by the number of GPUs available on each node—8 for Cori GPU and 6 for Summit—using this samenumber of ranks (i.e., one rank per GPU).VII. A
CKNOWLEDGMENTS
This research used resources of two U.S. Department ofEnergy Office of Science User Facilities: NERSC and OLCF(contract
EFERENCES[1] J. A. Fessler and B. P. Sutton. Nonuniform fast Fourier transforms usingmin-max interpolation.
IEEE Trans. Signal Process. , 51(2):560–574,2003.[2] K. Zhang and J. U. Kang. Graphics processing unit accelerated non-uniform fast Fourier transform for ultrahigh-speed, real-time Fourier-domain OCT.
Opt. Express , 18(22):23472–87, 2010.[3] L. Wang, Y. Shkolnisky, and A. Singer. A Fourier-based approachfor iterative 3D reconstruction from cryo-EM images. arXiv preprintarXiv:1307.5824 , 2013.[4] A. H. Barnett, M. Spivak, A. Pataki, and L. Greengard. Rapid solutionof the cryo-EM reconstruction problem by frequency marching.
SIAM J.Imaging Sci. , 10(3):1170–1195, 2017.[5] D. Stˇrel´ak, C. ´O. S. Sorzano, J. M. Carazo, and J. Filipoviˇc. A GPUacceleration of 3-D Fourier reconstruction in cryo-EM.
Int. J. HighPerform. Comput. Appl. , 33(5):948–959, 2019.[6] J. And´en and A. Singer. Structural variability from noisy tomographicprojections.
SIAM J. Imaging Sci. , 11(2):1441–1492, 2018.[7] Z. Zhao, Y. Shkolnisky, and A. Singer. Fast steerable principal componentanalysis.
IEEE Trans. Comput. Imaging , 2(1):1–12, 2016.[8] P. Arras, M. Reinecke, R. Westermann, and T. A. Enßlin. Efficientwide-field radio interferometry response, 2020. arXiv:2010.10122; inpress , Astron. Astrophys.[9] J. J. Donatelli, J. A. Sethian, and P. H. Zwart. Reconstruction from limitedsingle-particle diffraction data via simultaneous determination of state,orientation, intensity, and phase.
Proc. Natl. Acad. Sci. , 114(28):7222–7227, 2017.[10] A. H. Barnett. Efficient high-order accurate Fresnel diffraction via arealquadrature and the nonuniform FFT.
J. Astron. Telesc. Instrum. Syst. ,7(2):021211, 2021.[11] Z. Gimbutas and S. Veerapaneni. A fast algorithm for spherical gridrotations and its application to singular quadrature.
SIAM J. Sci. Comput. ,5(6):A2738–A2751, 2013. [12] L. af Klinteberg, T. Askham, and M. C. Kropinski. A fast integralequation method for the two-dimensional Navier–Stokes equations.
J.Comput. Phys. , 409:109353, 2020.[13] R. Salomon-Ferrer, A. W. Gotz, D. Poole, S. Le Grand, and R. C. Walker.Routine microsecond molecular dynamics simulations with AMBER onGPUs. 2. Explicit solvent particle mesh Ewald.
J. Chem. Theory Comput. ,9(9):3878–3888, 2013.[14] A. M. Fiore, F. Balboa Usabiaga, A. Donev, and J. W. Swan. Rapidsampling of stochastic displacements in Brownian dynamics simulations.
J. Chem. Phys. , 146(12):124116, 2017.[15] A. Dutt and V. Rokhlin. Fast Fourier transforms for nonequispaced data.
SIAM J. Sci. Comput. , 14:1369–1393, 1993.[16] L. Greengard and J.-Y. Lee. Accelerating the nonuniform fast Fouriertransform.
SIAM Review , 46(3):443–454, 2004.[17] J. Keiner, S. Kunis, and D. Potts. Using NFFT 3 – a software libraryfor various nonequispaced fast Fourier transforms.
ACM Trans. Math.Software , 36(4), 2009.[18] A. H. Barnett, J. Magland, and L. af Klinteberg. A parallel nonuniformfast Fourier transform library based on an “exponential of semicircle”kernel.
SIAM J. Sci. Comput. , 41(5):C479–C504, Jan 2019.[19] L. Greengard, J.-Y. Lee, and S. Inati. The fast sinc transform and imagereconstruction from nonuniform samples in k -space. Commun. Appl.Math. Comput. Sci. , 1(1):121–131, 2006.[20] G. Steidl. A note on fast Fourier transforms for nonequispaced grids.
Adv. Comput. Math. , 9:337–352, 1998.[21] J. Cheng, M. Grossman, and T. McKercher.
Professional CUDA CProgramming . John Wiley & Sons, 2014.[22] S. Kunis and S. Kunis. The nonequispaced FFT on graphics processingunits.
Proc. Appl. Math. Mech. , 12:7–10, 2012.[23] J.-M. Lin. Python non-uniform fast Fourier transform (PyNUFFT): Anaccelerated non-Cartesian MRI package on a heterogeneous platform(CPU/GPU).
J. Imaging , 4(3):51, 2018.[24] F. Knoll, A. Schwarzl, C. Diwoky, and D. K. Sodickson. gpuNUFFT—anopen-source GPU library for 3D gridding with direct MATLAB interface.
Proc. Intl. Soc. Mag. Reson. Med. , 22:4297, 2014.[25] J. Gai et al. More IMPATIENT: A gridding-accelerated Toeplitz-basedstrategy for non-Cartesian high-resolution 3D MRI on GPUs.
J. ParallelDistrib. Comput. , 73(5):686–697, 2013.[26] A H Barnett. Aliasing error of the exp ( β √ − z ) kernel in thenonuniform fast Fourier transform. Appl. Comput. Harmon. Anal. , 51:1–16, 2021.[27] A. Gregerson. Implementing fast MRI gridding on GPUs via CUDA,2008. NVIDIA Whitepaper.[28] D. S. Smith, S. Sengupta, S. A. Smith, and E. B. Welch. Trajectoryoptimized NUFFT: Faster non-Cartesian MRI reconstruction through priorknowledge and parallel architectures.
Magn. Reson. Med. , 81(3):2064–2071, 2019.[29] S.-C. Yang, H.-J. Qian, and Z.-Y. Lu. A new theoretical derivation ofNFFT and its implementation on GPU.
Appl. Comput. Harmon. Anal. ,44(2):273–293, 2018.[30] J.-Y. Lee and L. Greengard. The type 3 nonuniform FFT and itsapplications.