Implementation of 3D degridding algorithm on the NVIDIA GPUs using CUDA
Karel Adámek, Peter Wortmann, Bojan Nikolic, Ben Mort, Wesley Armour
aa r X i v : . [ a s t r o - ph . I M ] F e b Implementation of 3D degridding algorithm on the NVIDIAGPUs using CUDA
Karel Ad´amek , Peter Wortmann , Bojan Nikolic , Ben Mort , andWesley Armour Faculty of Information Technology, Czech Technical University, Th´akurova 9, 16000, Prague, Czech Republic Astrophysics Group, Cavendish Laboratory, JJ Thomson Avenue, CambridgeCB3 0HE, United Kingdom SKA Organisation, Jodrell Bank, Lower Withington, Macclesfield, Cheshire,SK11 9FT, United Kingdom Oxford e-Research Centre, Department of Engineering Sciences, University ofOxford, 7 Keble road, OX1 3QG, Oxford, United Kingdom
February 26, 2021
Abstract
Practical aperture synthesis imaging algorithms work by iterating between estimating thesky brightness distribution and a comparison of a prediction based on this estimate with themeasured data (”visibilities”). Accuracy in the latter step is crucial but is made difficultby irregular and non-planar sampling of data by the telescope. In this work we present aGPU implementation of 3d de-gridding which accurately deals with these two difficulties andis designed for distributed operation. We address the load balancing issues caused by largevariation in visibilities that need to be computed. Using CUDA and NVidia GPUs we measureperformance up to 1.2 billion visibilities per second.
Degridding is essential to deconvolution methods like CLEAN [Cornwell, 2008, Schwab and Cotton,1983] which iteratively recovers the sky brightness distribution by comparing measured data calledvisibilities with model of the sky. In this process the sky brightness distribution is first calculatedfrom the irregularly sampled visibilities. The brightest components of the sky brightness distribu-tion are added into the sky model which represent an approximation to sky brightness distribution.When the sky model is updated, regularly sampled visibilities of the model must be degriddedback to where the measured visibilities have been sampled and then subtracted from the data.The process is repeated on resulting residuals until convergence. The accuracy in the degriddingstep is crucial for accurate image of the sky.Degridding is part of many imaging pipelines [Offringa et al., 2014, Tasse et al., 2013]. TheGPU implementation of the image domain degridding algorithm was published by [Veenboer et al.,2017]. In this work, we have examined the GPU implementation of a deconvolution of a list ofvisibilities with the pre-computed grid data which is a part of the improved w-stacking [Ye, 2019]. The source code is available at https://github.com/KAdamek/GPU_degridding Implementation
We have implemented the degridding using CUDA for NVIDIA GPUs. The degridding algorithmis a sum of weighted regularly sampled visibilities. In case of 3d degridding we assume that theregularly sampled visibilities with coordinates u , v , and w are in the form of a stack of subgrids.For the implementations we have further assumed that on the input there is: • precomputed oversampled gridding convolution function (GCF) for uv plane G uv , with sup-port 8 × • precomputed oversampled GCF for w planes G w , with support 4; • S precomputed subgrid stack of size 512 × × N vis , S visi-bilities which need to be deconvolved; and • a list of N vis , S visibility coordinates [ u, v, w ] for each subgrid.To test the performance, we have assumed that the visibilities which need to be deconvolved forma line with a constant step d u , d v , and d w . However, the code does not rely on this assumptionand work with any distribution of visibilities.For each visibility i from a subgrid S we need to calculate two sets of coordinates. One set ofcoordinates g = [ x, y, z ] refers to the position of a subgrid point where the deconvolution starts.The second set of coordinates is for the location of correct weights from the oversampled griddingconvolution function GCF G uv and G w . These coordinates are calculated within the threadblock.In the optimised version of the deconvolution (D1) a single threadblock of 32 threads processes N b visibilities which are adjustable. This offers some data reuse as a single grid point may con-tribute to a multiple visibilities. The data reuse of the weights is harder due to oversampling ofthe GCF. The threads first cooperate on the calculation of all coordinates for assigned visibilities.These values are then stored in the shared memory and used by all threads in the deconvolutionpart.The deconvolution processes in two phases. First, each thread accumulates its complex visi-bility along subgrids y -axis ( v -axis). This allows threads to have contiguous access to memory forboth subgrid point and GCF weights. In the second phase, these partial results are pooled usingparallel reduction to get deconvolved visibility. As all these operations are confined to a singleCUDA warp no synchronization is required.The visibilities are stored in the shared memory and written out in bulk in a contiguous mannerto the device memory at the end of the threadblock execution.The number of threadblocks in the x -direction of the CUDA grid is given by the maximumnumber of visibilities per subgrid from all active subgrids N th = max S ( N vis , S /N b ) . (1)The number of threadblocks in y -direction is equal to the number of subgrids S . This distributionof the work between threadblock may pose a load-balancing issue as not all subgrids may have thesame amounts of visibilities. This may be a problem for SKA as some of the subgrids may havesignificantly more visibilities than others.In addition to the deconvolution described above, we have also implemented different variationsof the algorithm. Alternatively, the optimised GPU kernel could be launched with 64 threads andthe deconvolution is first reduced in z -direction ( w -direction) before all threads perform parallelreduction (kernel D4). Since this involves multiple warps it requires synchronization.Lastly, we have implemented a Basic version (D2) of the GPU kernel where each threadblockprocesses only one visibility (D2). This GPU kernel is also implemented using dynamic parallelism(D3). Dynamic parallelism allows us to launch GPU kernels from inside of another GPU kernel.This means that each thread launches a separate CUDA grid with different dimensions for eachsubgrid mitigating load-balancing to a degree. 2igure 1: Number of visibilities processed per second for subgrids with constant number of visi-bilities per subgrids.Figure 2: Number of visibilities processed per second for subgrids with varying number of visibil-ities per subgrids. 3 Results
To assess the performance of the deconvolution GPU kernel we have used two scenarios. In thefirst scenario, the number of visibilities per subgrid is the same for all subgrids. This is shownin Figure 1. In the second scenario, shown in Figure 2 the number of visibilities per subgrid isincreasing with every subgrid up to the maximum number of visibilities in the last subgrid.The results show that the choice of the optimal GPU kernel depends on the expected number ofvisibilities per subgrid. For a lower number of visibilities, below 262 thousand, reduction through z -direction (D4) is better. For a higher number of visibilities, the better performing kernel is kernelD1. The results also show the advantage of dynamic parallelism for basic GPU kernel which wasnot repeated when applied to optimised kernels D1 or D4. The optimal number of visibilitiesprocessed per threadblock N b depends on the step d u , d v , and d w . The differences are howevermarginal and single value of N b = 32 is sufficient. For the number of visibilities per subgrid below16 thousand, it is substantially better to process only one visibility per threadblock. Acknowledgements
This work has received support from STFC Grant (ST/T000570/1). The authors acknowledgethe support of the OP VVV MEYS funded project CZ.02.1.01/0.0/0.0/16 019/0000765 ”ResearchCenter for Informatics”.
References
Cornwell, T. J. (2008). Multiscale CLEAN Deconvolution of Radio Synthesis Images.
IEEEJournal of Selected Topics in Signal Processing , 2(5):793–801.Offringa, A. R. et al. (2014). WSCLEAN: an implementation of a fast, generic wide-field imagerfor radio astronomy.
MNRAS , 444(1):606–619.Schwab, F. R. and Cotton, W. D. (1983). Global fringe search techniques for VLBI.
AstronomicalJournal , 88:688–694.Tasse, C. et al. (2013). Applying full polarization A-Projection to very wide field of view instru-ments: An imager for LOFAR.
A&A , 553:A105.Veenboer, B., Petschow, M., and Romein, J. W. (2017). Image-domain gridding on graphics pro-cessors. In ,pages 545–554.Ye, H. (2019).