[PDF] Implementation of 3D degridding algorithm on the NVIDIA GPUs using CUDA

Abstract

Practical aperture synthesis imaging algorithms work by iterating between estimating the sky brightness distribution and a comparison of a prediction based on this estimate with the measured data ("visibilities"). Accuracy in the latter step is crucial but is made difficult by irregular and non-planar sampling of data by the telescope. In this work we present a GPU implementation of 3d de-gridding which accurately deals with these two difficulties and is designed for distributed operation. We address the load balancing issues caused by large variation in visibilities that need to be computed. Using CUDA and NVidia GPUs we measure performance up to 1.2 billion visibilities per second.

Full PDF

aa r X i v : . [ a s t r o - ph . I M ] F e b Implementation of 3D degridding algorithm on the NVIDIAGPUs using CUDA

Karel Ad´amek , Peter Wortmann , Bojan Nikolic , Ben Mort , andWesley Armour Faculty of Information Technology, Czech Technical University, Th´akurova 9, 16000, Prague, Czech Republic Astrophysics Group, Cavendish Laboratory, JJ Thomson Avenue, CambridgeCB3 0HE, United Kingdom SKA Organisation, Jodrell Bank, Lower Withington, Macclesﬁeld, Cheshire,SK11 9FT, United Kingdom Oxford e-Research Centre, Department of Engineering Sciences, University ofOxford, 7 Keble road, OX1 3QG, Oxford, United Kingdom

February 26, 2021

Abstract

Practical aperture synthesis imaging algorithms work by iterating between estimating thesky brightness distribution and a comparison of a prediction based on this estimate with themeasured data (”visibilities”). Accuracy in the latter step is crucial but is made diﬃcultby irregular and non-planar sampling of data by the telescope. In this work we present aGPU implementation of 3d de-gridding which accurately deals with these two diﬃculties andis designed for distributed operation. We address the load balancing issues caused by largevariation in visibilities that need to be computed. Using CUDA and NVidia GPUs we measureperformance up to 1.2 billion visibilities per second.

Degridding is essential to deconvolution methods like CLEAN [Cornwell, 2008, Schwab and Cotton,1983] which iteratively recovers the sky brightness distribution by comparing measured data calledvisibilities with model of the sky. In this process the sky brightness distribution is ﬁrst calculatedfrom the irregularly sampled visibilities. The brightest components of the sky brightness distribu-tion are added into the sky model which represent an approximation to sky brightness distribution.When the sky model is updated, regularly sampled visibilities of the model must be degriddedback to where the measured visibilities have been sampled and then subtracted from the data.The process is repeated on resulting residuals until convergence. The accuracy in the degriddingstep is crucial for accurate image of the sky.Degridding is part of many imaging pipelines [Oﬀringa et al., 2014, Tasse et al., 2013]. TheGPU implementation of the image domain degridding algorithm was published by [Veenboer et al.,2017]. In this work, we have examined the GPU implementation of a deconvolution of a list ofvisibilities with the pre-computed grid data which is a part of the improved w-stacking [Ye, 2019]. The source code is available at https://github.com/KAdamek/GPU_degridding Implementation

We have implemented the degridding using CUDA for NVIDIA GPUs. The degridding algorithmis a sum of weighted regularly sampled visibilities. In case of 3d degridding we assume that theregularly sampled visibilities with coordinates u , v , and w are in the form of a stack of subgrids.For the implementations we have further assumed that on the input there is: • precomputed oversampled gridding convolution function (GCF) for uv plane G uv , with sup-port 8 × • precomputed oversampled GCF for w planes G w , with support 4; • S precomputed subgrid stack of size 512 × × N vis , S visi-bilities which need to be deconvolved; and • a list of N vis , S visibility coordinates [ u, v, w ] for each subgrid.To test the performance, we have assumed that the visibilities which need to be deconvolved forma line with a constant step d u , d v , and d w . However, the code does not rely on this assumptionand work with any distribution of visibilities.For each visibility i from a subgrid S we need to calculate two sets of coordinates. One set ofcoordinates g = [ x, y, z ] refers to the position of a subgrid point where the deconvolution starts.The second set of coordinates is for the location of correct weights from the oversampled griddingconvolution function GCF G uv and G w . These coordinates are calculated within the threadblock.In the optimised version of the deconvolution (D1) a single threadblock of 32 threads processes N b visibilities which are adjustable. This oﬀers some data reuse as a single grid point may con-tribute to a multiple visibilities. The data reuse of the weights is harder due to oversampling ofthe GCF. The threads ﬁrst cooperate on the calculation of all coordinates for assigned visibilities.These values are then stored in the shared memory and used by all threads in the deconvolutionpart.The deconvolution processes in two phases. First, each thread accumulates its complex visi-bility along subgrids y -axis ( v -axis). This allows threads to have contiguous access to memory forboth subgrid point and GCF weights. In the second phase, these partial results are pooled usingparallel reduction to get deconvolved visibility. As all these operations are conﬁned to a singleCUDA warp no synchronization is required.The visibilities are stored in the shared memory and written out in bulk in a contiguous mannerto the device memory at the end of the threadblock execution.The number of threadblocks in the x -direction of the CUDA grid is given by the maximumnumber of visibilities per subgrid from all active subgrids N th = max S ( N vis , S /N b ) . (1)The number of threadblocks in y -direction is equal to the number of subgrids S . This distributionof the work between threadblock may pose a load-balancing issue as not all subgrids may have thesame amounts of visibilities. This may be a problem for SKA as some of the subgrids may havesigniﬁcantly more visibilities than others.In addition to the deconvolution described above, we have also implemented diﬀerent variationsof the algorithm. Alternatively, the optimised GPU kernel could be launched with 64 threads andthe deconvolution is ﬁrst reduced in z -direction ( w -direction) before all threads perform parallelreduction (kernel D4). Since this involves multiple warps it requires synchronization.Lastly, we have implemented a Basic version (D2) of the GPU kernel where each threadblockprocesses only one visibility (D2). This GPU kernel is also implemented using dynamic parallelism(D3). Dynamic parallelism allows us to launch GPU kernels from inside of another GPU kernel.This means that each thread launches a separate CUDA grid with diﬀerent dimensions for eachsubgrid mitigating load-balancing to a degree. 2igure 1: Number of visibilities processed per second for subgrids with constant number of visi-bilities per subgrids.Figure 2: Number of visibilities processed per second for subgrids with varying number of visibil-ities per subgrids. 3 Results

To assess the performance of the deconvolution GPU kernel we have used two scenarios. In theﬁrst scenario, the number of visibilities per subgrid is the same for all subgrids. This is shownin Figure 1. In the second scenario, shown in Figure 2 the number of visibilities per subgrid isincreasing with every subgrid up to the maximum number of visibilities in the last subgrid.The results show that the choice of the optimal GPU kernel depends on the expected number ofvisibilities per subgrid. For a lower number of visibilities, below 262 thousand, reduction through z -direction (D4) is better. For a higher number of visibilities, the better performing kernel is kernelD1. The results also show the advantage of dynamic parallelism for basic GPU kernel which wasnot repeated when applied to optimised kernels D1 or D4. The optimal number of visibilitiesprocessed per threadblock N b depends on the step d u , d v , and d w . The diﬀerences are howevermarginal and single value of N b = 32 is suﬃcient. For the number of visibilities per subgrid below16 thousand, it is substantially better to process only one visibility per threadblock. Acknowledgements

This work has received support from STFC Grant (ST/T000570/1). The authors acknowledgethe support of the OP VVV MEYS funded project CZ.02.1.01/0.0/0.0/16 019/0000765 ”ResearchCenter for Informatics”.

References

Cornwell, T. J. (2008). Multiscale CLEAN Deconvolution of Radio Synthesis Images.

IEEEJournal of Selected Topics in Signal Processing , 2(5):793–801.Oﬀringa, A. R. et al. (2014). WSCLEAN: an implementation of a fast, generic wide-ﬁeld imagerfor radio astronomy.

MNRAS , 444(1):606–619.Schwab, F. R. and Cotton, W. D. (1983). Global fringe search techniques for VLBI.

AstronomicalJournal , 88:688–694.Tasse, C. et al. (2013). Applying full polarization A-Projection to very wide ﬁeld of view instru-ments: An imager for LOFAR.

A&A , 553:A105.Veenboer, B., Petschow, M., and Romein, J. W. (2017). Image-domain gridding on graphics pro-cessors. In ,pages 545–554.Ye, H. (2019).