[PDF] Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on GPUs for Real-time Edge Computing

Abstract

The Square Kilometre Array (SKA) is an international initiative for developing the world's largest radio telescope with a total collecting area of over a million square meters. The scale of the operation, combined with the remote location of the telescope, requires the use of energy-efficient computational algorithms. This, along with the extreme data rates that will be produced by the SKA and the requirement for a real-time observing capability, necessitates in-situ data processing in an edge style computing solution. More generally, energy efficiency in the modern computing landscape is becoming of paramount concern. Whether it be the power budget that can limit some of the world's largest supercomputers, or the limited power available to the smallest Internet-of-Things devices. In this paper, we study the impact of hardware frequency scaling on the energy consumption and execution time of the Fast Fourier Transform (FFT) on NVIDIA GPUs using the cuFFT library. The FFT is used in many areas of science and it is one of the key algorithms used in radio astronomy data processing pipelines. Through the use of frequency scaling, we show that we can lower the power consumption of the NVIDIA V100 GPU when computing the FFT by up to 60% compared to the boost clock frequency, with less than a 10% increase in the execution time. Furthermore, using one common core clock frequency for all tested FFT lengths, we show on average a 50% reduction in power consumption compared to the boost core clock frequency with an increase in the execution time still below 10%. We demonstrate how these results can be used to lower the power consumption of existing data processing pipelines. These savings, when considered over years of operation, can yield significant financial savings, but can also lead to a significant reduction of greenhouse gas emissions.

Full PDF

EEﬃciency Near the Edge: Increasing the Energy Eﬃciency of FFTs onGPUs for Real-time Edge Computing

Karel Ad´amek , Jan Novotn´y , Jeyarajan Thiyagalingam , and Wesley Armour ∗ Oxford e-Research Centre, Department of Engineering Sciences, University of Oxford, 7 Kebleroad, Oxford, OX1 3QG, United Kingdom Faculty of Information Technology, Czech Technical University, Th´akurova 9, 160 00, Prague,Czech Republic Research Centre for Theoretical Physics and Astrophysics, Institute of Physics, Silesian Univeristyin Opava, Bezruˇcovo n´amˇest´ı 13, CZ-74601, Opava, Czech Republic Rutherford Appleton Laboratory, Science and Technology Facilities Council, Harwell Campus,Didcot, OX11 0QX, UK

September 15, 2020

Abstract

The Square Kilometre Array (SKA) is an internationalinitiative for developing the world’s largest radio telescopewith a total collecting area of over a million square me-ters. The scale of the operation, combined with the re-mote location of the telescope, requires the use of energy-eﬃcient computational algorithms. This, along with theextreme data rates that will be produced by the SKA andthe requirement for a real-time observing capability, neces-sitates in-situ data processing in an edge style computingsolution. More generally, energy eﬃciency in the moderncomputing landscape is becoming of paramount concern.Whether it be the power budget that can limit some ofthe world’s largest supercomputers, or the limited poweravailable to the smallest Internet-of-Things devices. Inthis paper, we study the impact of hardware frequencyscaling on the energy consumption and execution time ofthe Fast Fourier Transform (FFT) on NVIDIA GPUs us-ing the cuFFT library. The FFT is used in many areas ofscience and it is one of the key algorithms used in radio as-tronomy data processing pipelines. Through the use of fre-quency scaling, we show that we can lower the power con-sumption of the NVIDIA V100 GPU when computing theFFT by up to 60% compared to the boost clock frequency,with less than a 10% increase in the execution time. Fur-thermore, using one common core clock frequency for alltested FFT lengths, we show on average a 50% reductionin power consumption compared to the boost core clockfrequency with an increase in the execution time still be-low 10%. We demonstrate how these results can be usedto lower the power consumption of existing data process-ing pipelines. These savings, when considered over yearsof operation, can yield signiﬁcant ﬁnancial savings, butcan also lead to a signiﬁcant reduction of greenhouse gasemissions. ∗ E-mail address: [email protected]

Keywords —

Energy eﬃciency, Green computing,High performance computing, Real-time systems, Paral-lel architectures

The Fast Fourier Transform (FFT) is one of the most fun-damental and widely used numerical algorithms in scien-tiﬁc computing, with applications in a diverse range of ar-eas such as astronomy, image processing, audio and radarsignal processing, numerical solvers, such as partial diﬀer-ential solvers, and mechanical systems [8]. The FFT is alsoan integral part of many data processing pipelines. For in-stance, the FFT is an important part of data processingpipelines in both image- [38, 29, 39, 14] and time-domain[12, 4, 3, 22] radio astronomy.The upcoming, next-generation radio telescope, theSquare Kilometer Array (SKA), will employ such complexdata processing pipelines to deliver science products thatwill provide new and exciting insights into our Universe.Previous studies, for example [11], estimate that theSKA will require an exascale size high performance com-puting (HPC) system to provide us with such scientiﬁcproducts. Where, the computational footprint of the FFT,depending on the data processing task, may occupy [20]up to 47% of the overall computational footprint measuredin ﬂoating-point operations per second (or FLOPS). Thismakes the FFT a critical algorithm for the SKA.Processing the data captured by the SKA posses manychallenges. The SKA will produce extremely large vol-umes of data at unprecedented rates. Furthermore, thetelescope itself must be located in a radio-quiet area dueto it’s extreme sensitivity. This makes the persistent stor-age of all data not viable and transportation of these datato a well equipped (and suitably powered) data centre im-practical. Finally, some science cases such as the study ofFast Radio Bursts (FRBs), necessitate near real-time data1 a r X i v : . [ c s . PF ] S e p rocessing. Meaning that data has to be processed closeto the instrument itself. These constraints present signiﬁ-cant challenges to software and system engineers, they de-mand high fractions of peak performance of the hardware,whilst maintaining the best possible energy eﬃciency ofboth software and hardware.To address the need to minimise the power consumptionof the locally installed hardware, close attention must bepaid to the energy eﬃciency of the data processing algo-rithms, speciﬁcally the FFT. Given the emphasis on lowerpower consumption in HPC in general, the ability to com-pute the FFT more eﬃciently is of interest to many com-putational domains.The near real-time processing constraint means that theexecution time of the data processing algorithms must notbe increased signiﬁcantly. An increase in the executiontime might lead to either failure to process data on timeand hence a loss of scientiﬁcally important data or in-creased capital and operational costs as more hardwarewould be needed to meet the real-time requirement.Motivated by this, we have studied the impact of dy-namic frequency scaling (DFS) on the energy eﬃciencyand execution time of the FFT on NVIDIA GPUs usingthe cuFFT library [27]. The GPU is the fastest and mostenergy eﬃcient choice of hardware for image domain ra-dio astronomy as shown by [40], with FPGAs a close sec-ond. There are other FFT libraries for GPU’s, notably, theclFFT library which uses the OpenCL framework. clFFTis not a vendor supported library and was shown by [34]to be slower than cuFFT on NVIDIA GPUs thus we havenot considered it for this work.Our exhaustive study, conducted on a range of state-of-the-art GPUs shows that careful tuning of the core clockfrequency can save, in the case of the V100 GPU, up to60% (boost core clock frequency) of the energy consump-tion of the FFT. This saving can have a signiﬁcant impacton two fronts: ﬁnancial savings in recurrent costs, and theassociated reduced CO emission. We also show that thesecarefully tuned frequencies can be replaced with a singlefrequency that is speciﬁc to each model of GPU and cho-sen ﬂoating-point precision, whilst still being able to saveon average up to 50% of the FFT energy consumption (forthe V100 GPU and boost core clock frequency).The main contributions of this work are: • We have performed an in-depth investigation ofcuFFT library’s power consumption and executiontime and how it changes with core clock frequency fora wide range of problem sizes and numerical precisions(FP16, FP32 and FP64) on ﬁve NVIDIA GPUs. • We identify an optimal core clock frequency with thehighest energy eﬃciency for all problem sizes and nu-merical precisions and have shown that a single meanoptimal frequency per GPU model gives similar powersavings regardless of problem size. • We demonstrate how these results can be used tolower the power consumption of existing data pro-cessing pipelines.Whilst this work has been motivated by the SKA radiotelescope, the conclusions of the work are applicable to any computational task that employs cuFFT running onNVIDIA GPUs.

Power consumption in HPC is being solved on multiplelevels. From construction at the level of the cluster tonew energy eﬃcient hardware. The power consumptionof speciﬁc hardware depends on execution time , the timetaken to ﬁnish a calculation, and also on the utilizationof the hardware (memory, cache, computing cores). Thesoftware itself also plays an important role in power con-sumption. Energy can be saved through proper softwaredesign, making software stable [25] and through the use ofappropriate algorithms.However, concerns regarding energy eﬃciency in themodern computing landscape are not solely limited toHPC. Edge computing is becoming an increasingly im-portant research area driven by the explosion of Internet-of-Things devices. The basic premise of edge computingis to capture and process data as close to their sources asis possible by utilising light weight processors. Becauseedge computing aims to process data locally, it minimizeswider latency and bandwidth needs and allows for real-time feedback. It is estimated that by 2025 around 150billion devices will be connected and creating data in real-time [31], with the FFT playing, not only an importantrole in the communication between devices, but also inprocessing collected data. Hence optimising the energy ef-ﬁciency of the FFT on edge devices is of importance froman environmental perspective. This has motivated us toinclude NVIDIA’s Jetson Nano in our selection of hard-ware since it represents NVIDIA’s low power edge com-puting solution.The idea behind DFS, which is part of the dynamicvoltage and frequency scaling (DVFS) method, is to makehardware more energy eﬃcient under diﬀerent loads by ad-justing hardware performance which is achieved by chang-ing clock frequencies to ﬁt the application running on it.By decreasing the clock frequency of a component we de-crease its performance while increasing its utilization andthus decreasing the power consumption of a given compo-nent. For example, Trefethen et. al. [37] have investigatedpossible energy savings when running software on CPUswith a diﬀerent number of threads, compilers and CPUclock frequencies.Applications can be broadly separated into two classesof performance, the ﬁrst is where an application or algo-rithm is compute-bound. This is where the performancebottleneck of the application is the compute resource. Thiscan be the number of ﬂoating-point operations which canbe performed per second (FLOPS), but also the number ofinstructions which can be issued per second. The secondbroad category is memory bandwidth bound applications,where we have enough compute resources but we cannotsupply the data through the memory bus to the comput-ing cores quickly enough. In this case the performance isthen limited by the memory bandwidth. This bandwidthlimitation can occur at any level in the computers memoryhierarchy, for example this might be at the level of access2o the GPU main memory (called device memory ), or atthe level of one of the caches.We have investigated the cuFFT library using theNVIDIA Visual Proﬁler (NVVP). This shows that forall investigated problem sizes GPU kernels used by thecuFFT library are device memory bandwidth bound.

The one-dimensional discrete Fourier transformation(DFT) of a signal x is given by X l = N − (cid:88) n =0 x n exp (cid:20) − i π nlN (cid:21) , (1)where X l is the l -th element of a transformed signal, x n is the n -th element of an input signal, and N is the trans-formation length or the FFT length .The cuFFT library [27] uses the Cooley-Tukey algo-rithm [17] for FFT sizes that can be decomposed as multi-ples of powers of primes from 2 to 127 and Bluestein’s al-gorithm [6] otherwise. For longer FFT lengths the cuFFTlibrary uses multiple GPU kernels to compute the entireFFT, which can be seen by studying the cuFFT libraryusing the NVVP. In many cases, the Fourier transform iscalculated more quickly if the FFT length is increased bypadding to a more optimized length as was shown by [35].The two-dimensional Fourier transformation is given bythe formula X l,k = M − (cid:88) m =0 N − (cid:88) n =0 x n,m exp (cid:20) − i π (cid:18) nlN + mkM (cid:19)(cid:21) , (2)where x n,m , X l,k is now an element of a matrix of size N × M . The sums in this equation can be evaluatedindependently which allows us to decompose the two-dimensional Fourier transformation into two sets of one-dimensional Fourier transformations. This is routinelydone and it is indeed what cuFFT does when calculatinghigher-dimensional (2D, 3D) Fourier transformations asshown by the NVVP. Thus by investigating the energy ef-ﬁciency of the one-dimensional Fourier transformation weare also investigating the energy eﬃciency of the higher-dimensional Fourier transforms. The GPU design methodology is diﬀerent to that of aCPU. A CPU architecture is aimed at low latency compu-tations, but also has lower throughput. In other words, theCPU can execute a wider range of complicated algorithmsquickly, for example a complicated branching code, but thenumber of concurrently running tasks is small. A GPU ar-chitecture has high latency but also high throughput, on aGPU one can execute thousands of simple tasks but eachtask takes longer to process due to the simpler schedulersthat are employed. Both platforms are broadening theirfocus, CPUs are adding more cores and increasing theirvector lengths as GPU architectures become more com-plex and GPU schedulers become more sophisticated.

Device memoryL2 cache

Computing blockMemory blockL1 cache L1 cache L1 cacheSM SM SM

Figure 1: A schematic of the GPU architecture.The GPU architecture, which is, in simpliﬁed form,shown in Fig. 1, is divided into the memory block andthe compute block. The compute block is further dividedinto caches and streaming multiprocessors (SM) whichare responsible for executing the computations. The SMsare further divided into specialized units such as ﬂoating-point cores or special function units (which are respon-sible for computing things like transcendental functions).The memory hierarchy on the GPU is distributed betweenthese two blocks. The device memory which runs at thememory clock frequency has the lowest bandwidth on theGPU card and it is the memory that the CPU (host) canread/write into via the PCIe bus. The L2 cache is sharedbetween the SMs, the L1 cache is private to each SM andthe shared memory is shared amongst a group of threadscalled a threadblock. The L2, L1 and shared memorybandwidth is proportional to the core clock frequency, thusby using a lower core clock frequency we also decrease thebandwidth of these caches. The core clock frequency, aswell as the memory clock frequency, can only be set topredeﬁned values.Diﬀerent GPUs may use diﬀerent memory modules.Amongst the tested GPUs were GPUs with GDDR mem-ory modules (Titan XP, P4, Jetson Nano) which allowus to change the memory clock frequency, but also GPUswith HBM2 modules (Titan V, V100) which do not allowus to change the memory clock frequency.When measuring the power consumption and perfor-mance of the GPU it is important to keep the GPUutilized. For example, the NVIDIA V100 GPU has 80streaming multiprocessors (SM) where each SM is able to3un up to 2048 threads. This gives more than 150 thou-sands threads which can execute concurrently. Thus inour measurements, we have used a ﬁxed amount of datacontaining a diﬀerent number of individual Fourier trans-forms to keep the GPU utilized for all tested FFT lengths.

Data processing can be composed of a single step but moreoften is a series of processing steps which together form adata processing pipeline.The ability of the application to process data in a real-time processing scenario can be described by the real-timespeed-up factor. The real-time speed-up is calculated as S = t a /t p , where t a is the time needed to acquire a givenamount of data by the telescope, sensor, etc. and t p is thetime taken to process that data. When S ≥ S < S = 1 that pipeline is processing the data in timebut has no performance buﬀer to call on if needed. In sucha case any increase in the execution time leads to S <

As of November 2019, the ﬁrst two positions in the top500 list of supercomputers are held by systems that useGPUs. Within the top ten, ﬁve systems contained GPUs.In the Green 500 list, GPUs are used in eight out of thetop ten supercomputers. A clear demonstration that itis important to understand the power consumption, en-ergy eﬃciency and potential energy savings for GPUs us-ing DVFS.The diﬀerent approaches of how to measure the powerconsumption, power and performance modelling and alsothe results of DVFS for selected applications were reviewedby Mei el al.[24]. The authors note that the eﬀect ofDVFS depends not only on the architecture but also onthe characteristics of the GPU application. They havefound the optimal frequency for 42 GPU applications andfound that 12 of them beneﬁted from an increased core frequency compared to the default whereas for 30 appli-cations the optimal frequency was lower than the defaultcore frequency, and values of these optimal frequencieswere diﬀerent for most GPU applications. The authorscalled for a deeper investigation into their diﬀerences. Auseful review of the DVFS technique is provided by Mittaland Vetter [26]. The review by Bridges et al. [7] lookedinto the modelling of the power consumption by GPUs.A number of published studies have investigated thereliability of power measurements using internal sensors.Burtscher et al. [9] published their experience of usingbuilt-in sensors when measuring the power consumptionof NVIDIA K20 GPUs. They described several issues thatthey encountered when using these sensors and suggestedmethods to correct for these. The accuracy of the built-insensors was investigated by Farad et al. [13] who foundthat the average mean error using an abstract model ofa GPU is about 10% compared to measurements usingexternal power meters. This error value was conﬁrmedby Arafa et al. [5] who measured the energy consump-tion of almost all PTX instructions for four generationsof NVIDIA GPUs. They have found that the Maxwelland the Turing generations of GPUs have high energyconsumption when compared to the Pascal and the Voltagenerations of NVIDIA GPUs which are found to be moreenergy eﬃcient.There are a number of papers where authors have usedDVFS in the context of GPUs [2, 33, 41, 23, 15, 10, 21,16, 18, 24, 36]. Guerreiro et. al. [16] classiﬁed GPU appli-cations into four diﬀerent categories which describe theirbehaviour when DVFS is applied. These categories arean extension of the compute-bound, memory-bound dis-tinction. The early work on GPU power consumption andDVFS was performed by Jiao et al. [18]. They investi-gated the behaviour of several GPU applications whichincluded the FFT algorithm, however, the cuFFT librarywas not studied because there were better performing FFTimplementations at the time. The FFT was also indirectlyincluded in Mei et al. [24] as part of the convolution, andin Tang et al. [36] where the author investigated the eﬀectof DVFS on deep learning applications.In relation to radio astronomy and the SKA, there areseveral works. Price et al. [30] made a detailed inves-tigation into power consumption, voltage and frequencyscaling of the GPU implementation of the correlator forthe SKA. The power consumed by the GPU in the do-main of radio astronomy was investigated by Romein [32].The performance of the cuFFT library was investigatedby Jondra et al. [19] along with its power consumption.However, increases in energy eﬃciency due to frequencyscaling were not investigated.

The code that we have used for measurements of the en-ergy eﬃciency of the FFT algorithm consists of a basic can be found at https://github.com/KAdamek/cuFFT_benchmark nvidia-smi ) for all GPU cards except the Jetson Nano,where we have used the tegrastats utility. For both wehave speciﬁed the measurement interval to be 10 ms as ourtests have showed that a setting of time sampling below10 ms does not lead to an improvement in the time resolu-tion of our data. The actual time between samples variedand the actual achieved sampling rate from the driver ison average 14.2 ms for all tested FFT lengths and cards.This sampling rate fulﬁlls the criterion of at least 15 ms(66.7 Hz) recommended by Burtscher et al. [9] to accu-rately measure the energy consumption of real-world ker-nels.For the localization of the FFT algorithm and estab-lishing the execution time we have used the nvprof utilitywhere we have included the timestamp. Finally we logthe beginning and end of each GPU kernel execution toa ﬁle. This way we produce two ﬁles containing all ofthe needed metrics for all possible combinations of coreclock frequencies for a speciﬁc FFT length, bit precisionand GPU card. The ﬁnal combination (via the times-tamp comparison) of these ﬁles is done by using a simpleR script. Here we compute all other metrics includingenergy eﬃciency, optimal clock frequency, mean optimalcore clock frequency and computational performance. Inthe script we also verify that the current core clock fre-quency is the same as the requested one, and compare themeasured execution time from nvprof with the log times-tamps of the nvidia-smi query. Using this method wehave found that, for the Titan V, the core clock frequencyis capped to 1335 MHz by the driver during the compu-tation, but during the copy of the results is set to a highercore clock frequency (1837 MHz). For frequencies lowerthan 1335 MHz, no capping is observed. An example ofthe GPU kernel power consumption and active core clockfrequency, which was localized using log ﬁle timestamps,is shown for the V100 GPU in Fig. 2 (top). An example driver version 450.36.06 of the frequency capping on the Titan V GPU is shown inFig. 2 (bottom). Tesla V100, FFT length= 16384,core clock frequency= 1020 MHz P o w e r c on s u m p ti on [ W ] C o r e fr e qu e n c y [ M H z ] Sample index power consumptioncore frequency 40 60 80 100 120 140 0 50 100 150 200 250 1005 1010 1015 1020 1025 1030 1035

Titan V, FFT length= 16384,core frequency= 1020 MHz P o w e r c on s u m p ti on [ W ] C o r e fr e qu e n c y [ M H z ] Sample indexpower consumptioncore frequency 0 20 40 60 80 100 120 140 160 0 100 200 300 400 500 600 0 200 400 600 800 1000 1200 1400 1600 1800 2000

Figure 2: Parts of the log ﬁle with the GPU kernel high-lighted (red dots) by the R script between the two non-computing parts of the GPU run (grey line dots) show-ing the reported power consumption. The blue line corre-sponds to the measured core clock frequency. Speciﬁcally,the data displayed are from measurements on the TeslaV100 (top) and Titan V (bottom) for an FFT length of2 , single precision and the core clock frequency set to1020 MHz (Tesla V100) and 1912 MHz (Titan V).The choice of clock frequencies for both the memorybus and the computational cores are limited to a set ofsupported frequencies deﬁned by the hardware itself. Thesupported core clock frequency can easily be changed bythe driver API. The allowed clock frequencies of the devicememory bus are limited or not changeable depending onthe memory type. Since the cuFFT library is completelylimited by device memory bandwidth this suggests thatlowering the memory frequency would not lead to sub-stantial increases in the energy eﬃciency. Thus, we havenot changed the memory clock frequency in this work.Moreover the High Bandwidth Memory (HBM) which ispresent on the newest GPU cards (Titan V, Tesla V100)operates on a ﬁxed memory clock frequency. The rangesand step sizes of the core clock frequencies that we haveused are summarized in Table 1.The energy for a speciﬁc core clock frequency is deﬁnedas E f = (cid:88) i P i · t i , (3)where P i corresponds to the reported power for a sampleindex i and t i is the time between the current sample and5able 1: List of the allowed core clock frequencies frommaximal f max up to minimal f min frequency for all cardsand their corresponding frequency step size ( f step ). Thesize of the frequency step alternates between values shownin the column f step with the exception of the Jetson Nano. Card name f max [MHz] f min [MHz] f step [MHz]Tesla V100 1530 135 7, 8Tesla P4 1531 455 12, 13Titan XP 1911 379 12, 13Titan V 1912 135 7, 8Jetson Nano 921.6 76.8 76.8 the previous one. Then the energy eﬃciency for a speciﬁccore clock frequency is given as E ef = C p · t/E f , (4)where t corresponds to the time of the whole run of thecomputation, E f is the energy and C p is the computa-tional performance in FLOPS given by C p = [5 N log ( N ) · N b · N FFT ] /t , (5)where N b is the number of FFT runs of length N and N FFT is the number of FFTs computed per run. Thenumber of Fourier transforms performed ( N FFT ) dependson the FFT size as follows N FFT = M GB / ( N · B) , (6)where M GB is the desired amount of memory used forFFTs in GB and B is the byte size of the input datatype. The optimal core clock frequency for a speciﬁc FFTlength is then found as the one with the minimal consumedenergy.We deﬁne the increase in energy eﬃciency as I ef = E ef , o /E ef , d , (7)where E ef , o and E ef , d are the energy eﬃciencies for theoptimal frequency and the boost frequency respectively(given by (4)).The measurement error, that is the relative standarddeviation, for the V100 GPU and the Jetson Nano is shownin Fig. 3. We have observed that the measurement error,in general, is around 5% for all cards except the JetsonNano. The GPU cards use instrumentation ampliﬁers forthe current/voltage/power monitors, hence the potentialerror in the measurement is expected to be around 3–5%[1]. The results of our power measurement correspond tothe expected characteristics of the on-board chips.For Fourier transformations of higher radices (7+) orfor Fourier transformations which use the Bluestein algo-rithm we observe a measurement error of up to 5%. Themeasurement error increases with decreasing core clockfrequency and increasing number of GPU kernels used forthe FFT calculation.The measurement error for the Jetson Nano is usu-ally below 15% for all FFT lengths, and is below 10%for power-of-two FFT lengths. The highest measure-ment error that we have observed is for Bluestein FFT P o w e r m ea s u r e m e n t e rr o r [ % ] Measurement numberTesla V100, FP32Radix n=2Radix n>2Bluestein0.00.51.01.52.02.53.03.54.04.55.05.5 0 2000 4000 6000 8000 10000 12000 P o w e r m ea s u r e m e n t e rr o r [ % ] Measurement numberJetson Nano, FP32Radix n=2Radix n>2Bluestein 0 5 10 15 20 25 30 35 40 45 50 55 0 100 200 300 400 500 600 700

Figure 3: Measurement error (V100 GPU at the top, Jet-son Nano at the bottom) for all tested FFT lengths at alltested core clock frequencies.lengths. For these lengths, cuFFT uses multiple kernels(for N = 139 eleven GPU kernels are used) thus the highmeasurement error is due to the diﬀerent loads these GPUkernels exert on the GPU and also the diﬀering power con-sumption between them. The Bluestein FFT lengths rep-resent a marginal case. Due to large measurement errorsfor Bluestein FFT lengths on the Jetson Nano we have notincluded these measurements into our calculations of meanoptimal frequency. However, we present these results forthe sake of completeness.For the measurement of the execution time we have usedthe NVIDIA Visual Proﬁler. Using this we have found thatthe measurement error for the execution time was below0.3%.Using propagation of uncertainty the error of the en-ergy (3) is dominated by the measurement error of thepower consumption. Based on that, the error in the in-crease in energy eﬃciency (7) is given by σ R ( I ef ) = √ σ R ( E ef ) , (8)where σ R is the relative error and we have assumed thatthe relative error in E ef , o and E ef , d are equal. This givesan error for the increase in the energy eﬃciency of 7% forall GPUs except the Jetson Nano where the error is 21%.These values represent the worst case scenario since mostof measurement errors are well below these values.6 Results

For our investigation, we have used ﬁve diﬀerent NVIDIAGPUs from three recent architecture generations, namelyV100 (Volta), Tesla P4 (Pascal), Jetson Nano (Maxwell),Titan V (Volta) and Titan XP (Pascal). The relevanthardware speciﬁcations can be found in Table 2. Boththe V100 GPU, and Tesla P4 GPU are aimed at scien-tiﬁc applications, the P4 GPU also oﬀers improved energyeﬃciency for it’s generation. The Jetson Nano is a low-powered all-in-one solution for autonomous systems. TheTitan V and Titan XP are consumer grade GPUs.GPUs have two diﬀerent frequency settings: a base anda boost core clock frequency. If not stated otherwise, wehave used the boost core clock frequencies. This is becausethe GPU’s default behaviour is to perform calculations atthe boost core clock frequency. This is indeed what isobserved when the GPU is set to default mode and we runour cuFFT code. When reporting energy eﬃciency, we useboth frequencies as there is a non-linear dependency of thepower consumption of a GPU on the core clock frequency.We have measured the complex-to-complex (C2C) one-dimensional transform for three diﬀerent ﬂoating-pointprecisions; double (FP64), ﬂoat (FP32) and half (FP16).The Tesla P4, Titan XP and Jetson Nano GPUs have lim-ited support for the double precision format. Furthermore,the Tesla P4 and the Titan XP do not support the half(FP16) ﬂoating-point precision. In addition, when usinghalf precision (FP16), the cuFFT library supports onlypower-of-two FFT lengths.We have investigated various FFT lengths but focusedon lengths that are powers-of-two because FFT algo-rithms are not only best suited to processing such lengths,but also oﬀer superior execution time performance withpowers-of-two lengths. When calculating non-power-of-two FFT lengths it is often faster [35] to pad the datawhich needs to be Fourier transformed to the nearesthigher power-of-two FFT length and then Fourier trans-form.First, we present execution times for processing a ﬁxedamount of data t ﬁx which oﬀers an insight into the level ofoptimization provided by the cuFFT library. The mem-ory requirements to store the data needed for the Fouriertransform grows linearly with the FFT length N . Sincethe cuFFT library is limited by the device memory band-width, the execution time consists of the time required totransfer the data to computing cores and to store the re-sult back to the device memory t i , and the time requiredfor any additional overhead accesses to the device mem-ory t o . If the performance limiting factor is diﬀerent tothe device memory bandwidth, we are unable to makesuch a distinction in this work. In an ideal case wherewe would have a large enough cache, the execution timeof the Fourier transform would be equal to the time t i .However, because the cache size is limited, the time t o will be non-zero and directly indicate the eﬃciency of theimplementation. By ﬁxing the amount of memory beingprocessed, the time t i will be a constant and any increasein the execution time of the Fourier transform will be dueto time t o .If we ﬁx the amount of data that is processed then the number of FFTs performed N FFT depends on the FFTlength as given by (6). The execution time of a singleFFT within a batch is given as t t = t ﬁx /N FFT . The exe-cution time t ﬁx for processing a ﬁxed amount of data forvarious FFT lengths is shown in Fig. 4 for FP32 and inFig. 5 for FP16 and FP64 precision. The execution timefor the Jetson Nano is for 1 / t ﬁx is t ﬁx = 4ˆ t ﬁx . This is due tothe low amount of available memory on the card.The execution time t ﬁx increases in proportion to thelength of the Fourier transform. However, we see regionsof the same execution time with sudden increases afterspeciﬁc FFT lengths. These abrupt changes represent atransition from one optimized GPU kernel to another asis shown by the NVIDIA proﬁler. We must take thesechanges into account in our analysis since these GPU ker-nels might behave diﬀerently. When the execution time t ﬁx does not increase for a given range of problem sizes (forexample from FFT length N = 32 to N = 8192) it meansthat the higher number of ﬂoating-point operations whichcome with a larger problem size utilizes GPU resourcesother than the device memory bandwidth. Given that theTitan XP, Tesla P4 and Jetson GPUs do not fully sup-port all tested ﬂoating-point precisions the execution timeof Fourier transformations on these GPUs exhibit diﬀerentbehaviours. Radix n=2Radix n>2Bluestein A v e r a g e ti m e t ﬁ x [ m s ] FFT length [samples]Tesla P4Titan XPTitan V Tesla V100Jetson Nano ( t^ ﬁ x )410

32 256 2k 16k 128k 1Mcomplex-to-complex, FP32

Figure 4: The execution time t ﬁx (for FP32) required toprocess a ﬁxed amount of data for diﬀerent FFT lengths.The discontinuities in the execution time indicate a changeof optimised GPU kernel that is used to calculate the FFT.Results for the Jetson Nano are for one quarter of thememory size.In this work, results are presented per FFT batch , whichis the number of FFT’s of a given length which ﬁt intothe ﬁxed amount of memory that we have chosen to workwith. However, most of our results, such as energy eﬃ-ciency, are independent of the number of FFTs calculatedprovided that the GPU is fully utilised. The executiontime for diﬀerent core clock frequencies is denoted by t f .The execution time with boost frequency is denoted as t d and is taken as the execution time for the default settings.Furthermore, we have focused our discussion on the V100GPU as it is the most current (and widely used) scien-tiﬁc GPU and the Jetson Nano as it represents NVIDIA’slow power edge computing solution. We point out any de-7able 2: GPU card speciﬁcations. The shared memory bandwidth is calculated as BW(bytes / s) =(bank bandwidth (bytes)) × (clock frequency (Hz)) × (32 banks) × ( Titan XP Tesla P4 Titan V Tesla V100 Jetson Nano

CUDA Cores 3840 2560 5120 5120 128SMs 30 20 80 80 2Base/Boost Core Clock 1405/1480 MHz 810/1063 MHz 1220/1455 MHz 1200/1455 MHz 921 MHzMemory Clock 5005 MHz 3003 MHz 850 MHz 877 MHz 1600 MHzDv. m. bandwidth 547 GB/s 192 GB/s 652 GB/s 900 GB/s 25.6 GB/sMemory modules GDDR5 GDDR5 HBM2 HBM2 LPDDR4Shared m. bandwidth 5395 GB/s 2657 GB/s 14550 GB/s 14550 GB/s 230 GB/sMemory size 12 GB 8 GB 12 GB 16 GB 4 GBTDP 250 W 75 W 250 W 300 W 5/10 WCUDA version 10.0.130 10.0.130 10.0.130 10.0.130 JetPack 4.2 SDK

Radix n=2Radix n>2Bluestein A v e r a g e ti m e t ﬁ x [ m s ] FFT length [samples]410

32 256 2k 16k 128k 1Mcomplex-to-complex, FP64

Titan VTesla P4Titan XP Tesla V100Jetson Nano ( t^ ﬁ x )

32 256 2k 16k 128k 1Mcomplex-to-complex, FP16

Figure 5: The execution time t ﬁx (for FP16 and FP64) re-quired to process a ﬁxed amount of data for diﬀerent FFTlengths. The discontinuities in execution time indicate achange of optimised GPU kernel that is used to calculatethe FFT. Results for the Jetson Nano are for one quarterof the memory size.viations from these behaviours in the other tested GPUswhen they occur. First, we present the behaviour of the execution time withchanging core clock frequency. This is shown as a ratio ofexecution time t f over default execution time t d in Fig. 6,which shows all tested conﬁgurations for FP32 precision.There are three distinct behaviours, the execution timeis:a) decreasing at ﬁrst;b) slightly increasing;c) increasing notably with each frequency decrease.In the case of the V100 GPU, the ﬁrst two behavioursa) and b) are in the majority. For a few speciﬁc FFTlengths (notably for N = 8192) we have observed be-haviour c). We have observed this behaviour through-out multiple measurements and always for the same FFT complex-to-complex, FP32Jetson Nanomean opt. core clock freq.110 E x ec u ti on ti m e r a ti o GPU core clock frequency [MHz]complex-to-complex, FP32Tesla V100110 200400600800100012001400 E x ec u ti on ti m e r a ti o Figure 6: Ratio of the execution time t f over the defaultexecution time t d measured for the V100 GPU and theJetson Nano. Every investigated FFT length is shownand represented by a single line.lengths. Other tested GPUs behaved similarly to the V100GPU.The Jetson Nano exhibits a diﬀerent behaviour, wheremost of the conﬁgurations belong to case c) with notablepeaks which are present for Bluestein FFT lengths.The energy consumed per FFT batch calculated byequation (3) with ﬁxed length N = 16384 for diﬀerentGPUs is shown in Fig. 7. For the measurement, we haveused a batch of 16384 FFTs (in the case of FP32 this rep-resents 2 GB of input data) in order to fully saturate theGPU. Notably, the energy per FFT batch on the Titan VGPU does not change above 1335 MHz. This is becausethe card does not run at the user selected frequency butis capped by the driver to 1335 MHz.As the core clock frequency decreases the power con-sumption of 0the GPU changes non-linearly. This is shownin Fig. 8 for the V100 GPU and Jetson Nano.The frequency at which the energy per FFT batchreaches a minimum was selected as the optimal frequency .The optimal frequency is diﬀerent for each tested FFTlength for a given GPU and precision. The optimal fre-quency expressed as a percentage of the default core clockfrequency for all precisions is shown in Fig. 9.8 n e r gy p e r FF T b a t c h [ J ] GPU core clock frequency [MHz]complex-to-complex, FP32, FFT length=16384Titan XPTesla P4 Titan VTesla V100 Jetson Nano123456 20040060080010001200140016001800

Figure 7: The energy consumed per FFT batch changeswith core clock frequency. The minimum, emphasized bya black star for each tested GPU, represents the most eﬃ-cient conﬁguration and the value of the optimal frequency. complex-to-complex, FP32Jetson Nanomean opt. core clock freq.0.51.01.52.02.53.03.54.04.5 A v e r a g e po w e r c on s u m p ti on [ w ] GPU core clock frequency [MHz] complex-to-complex, FP32Tesla V10050100150200240 200400600800100012001400 A v e r a g e po w e r c on s u m p ti on [ w ] Figure 8: Averaged power consumption as a function ofcore clock frequency for all tested FFT lengths. The Jet-son Nano is shown independently as its behaviour is diﬀer-ent from the rest of the tested GPUs which are representedby the V100 GPU. P e r ce n t a g e o f boo s t G P U c o r e c l o c k fr e qu e n c y [ % ] Titan VTesla V100

Jetson NanoTesla P4Titan XP

32 256 2k 16k 128k 1MFP16

Figure 9: Value of the optimal frequency expressed asa percentage of the boost clock frequency. The value ofthe optimal frequency is consistent through diﬀerent pre-cisions with the exception of the Tesla P4 GPU.

To acquire the following results we have selected the op-timal frequency for each FFT length and measured theconsumed power to calculate the energy eﬃciency usingequation (4). The energy eﬃciency expressed as the num-ber of GFLOPS/W is shown in Fig. 10. complex-to-complex, FP32FFT length [samples] E n e r gy e ﬃ c i e n c y [ G F L O PS / W ] Jetson nanoTesla V100Tesla P4Titan XPTitan V0510152025 32 256 2k 16k 128k 1M Radix n=2Radix n>2Bluestein32 256 2k 16k 128k 1M complex-to-complex, FP64FFT length [samples] E n e r gy e ﬃ c i e n c y [ G F L O PS / W ] Jetson nanoTesla V100Tesla P4Titan XPTitan V02468101214 32 256 2k 16k 128k 1M Radix n=2Radix n>2Bluestein32 256 2k 16k 128k 1M E n e r gy e ﬃ c i e n c y [ G F L O PS / W ] FFT length [samples] Jetson nanoTesla V100Titan V01020304050 32 256 2k 16k 128k 1Mcomplex-to-complex, FP16

Figure 10: Floating-point operations per second per Watt(GFLOPS/W) for optimal frequency. The coloured regionshows the improvement from the default frequency.The change in the execution time for the optimal fre-quency with respect to the default execution time as apercentage is shown in Fig. 11. The change in GFLOPS isshown in Fig. 12. The peaks visible in Fig. 11 correspondto FFT lengths which displayed case c) type behaviour ofthe execution time (Fig. 6).The increase in the energy eﬃciency (7) with respectto the boost core clock frequency is shown for diﬀerentprecisions in Fig. 13 and with respect to the base coreclock frequency in Fig. 14.We see that the optimal frequency of diﬀerent FFTlengths as shown in Fig. 9 is roughly the same for a givenGPU and precision across all tested FFT lengths. Further-9 i ﬀ e r e n ce o f e x ec u ti on ti m e [ % ] FFT length [samples]

Tesla P4Titan XP Titan VTesla V100 -4051015 32 256 2k 16k 128k 1M

Radix n=2Radix n>2Bluestein

Jetson Nano D i ﬀ e r e n ce o f e x ec u ti on ti m e [ % ] FFT length [samples]

Tesla P4Titan V Tesla V100 -50102030 32 256 2k 16k 128k 1M

Radix n=2Radix n>2Bluestein

Jetson NanoTitan XP D i ﬀ e r e n ce o f e x ec u ti on ti m e [ % ] FFT length [samples]

Titan VTesla V100 -50102030 32 256 2k 16k 128k 1M

Jetson Nano

Figure 11: Increase in the execution time for optimal fre-quencies as a percentage of the default execution time t d . complex-to-complex, FP32FFT length [samples] C o m pu t a ti on a l p e rf o r m a n ce [ G F L O PS ] Jetson NanoTesla V100Tesla P4 Titan XPTitan V0110100100010000 32 256 2k 16k 128k 1M Radix n=2Radix n>2Bluestein32 256 2k 16k 128k 1M complex-to-complex, FP64FFT length [samples] C o m pu t a ti on a l p e rf o r m a n ce [ G F L O PS ] Jetson NanoTesla V100Tesla P4 Titan XPTitan V01101001000 32 256 2k 16k 128k 1M Radix n=2Radix n>2Bluestein32 256 2k 16k 128k 1M C o m pu t a ti on a l p e rf o r m a n ce [ G F L O PS ] FFT length [samples] Jetson NanoTesla V100Titan V101001000 32 256 2k 16k 128k 1Mcomplex-to-complex, FP16

Figure 12: Floating-point operations per second(GFLOPS) for optimal frequencies. The colored regionshows the change from the default frequency.10 n c r ea s e i n e n e r gy e ﬃ c i e n c y Figure 13: The increase in the energy eﬃciency for opti-mal core clock frequencies with respect to the boost coreclock frequency for all tested FFT lengths. The twopeaks observed in the Jetson Nano data are due to theuse of the Bluestein algorithm. I n c r ea s e i n e n e r gy e ﬃ c i e n c y Figure 14: The increase in the energy eﬃciency for theoptimal core clock frequencies with respect to the basecore clock frequency for all tested FFT lengths. TheJetson Nano is not included since there is no base coreclock frequency.more, the optimal frequency is roughly the same acrossall numerical precisions for a given GPU with the excep-tion of Tesla P4 GPU. Based on this we have calculated a mean optimal frequency for a given GPU and precision byaveraging optimal frequencies which achieves a similar in-creases in energy eﬃciency for all measured FFT lengths.The increase in energy eﬃciency using the mean optimalfrequency is shown in Fig. 15 for the boost frequency andin Fig. 16 for the base frequency. The values of meanoptimal frequencies are listed in Table 3.When considering existing pipelines, it is also interest-ing to study the relationship between the increase in en-ergy eﬃciency and the increase in the execution time. Thisrelationship indicates the cost (in units of execution time)of any increase in energy eﬃciency. This is shown for theV100 GPU in Fig. 17 and for the Jetson Nano in Fig. 18. Table 3: Mean optimal core clock frequencies.

Card name FP32 [MHz] FP64 [MHz] FP16 [MHz]Tesla V100 945 945 937Tesla P4 746 1126 NATitan V 952 967 1042Titan XP 1151 1215 NAJetson Nano 460.8 460.8 460.8 I n c r ea s e i n e n e r gy e ﬃ c i e n c y Figure 15: The increase in the energy eﬃciency for themean optimal frequency with respect to the boost coreclock frequency for all tested FFT lengths.The twopeaks observed in the Jetson Nano data are due to theuse of the Bluestein algorithm. I n c r ea s e i n e n e r gy e ﬃ c i e n c y Figure 16: The increase in the energy eﬃciency for themean optimal frequency with respect to the base coreclock frequency for all tested FFT lengths. The Jet-son Nano is not included since there is no base core clockfrequency.

To demonstrate the applicability of the mean optimal fre-quency in existing pipelines we have employed part of thedata processing pipeline used for the detection of pulsarsin time-domain radio astronomy data. The pipeline usesseveral computational steps: FFT, power spectrum calcu-lation; mean and standard deviation calculation; and theharmonic sum. The harmonic sum adds the value of higherharmonics of the pulsar in the power spectrum to the pul- Source code for used pipeline is on GitHub https://github.com/KAdamek/cuFFT_energy_efficiency_example n c r ea s e i n e n e r g e y e ﬃ c i e n c y [ % ] FFT length

32 64 128 256 512 1024 2048 4096 8192 16k 32k 64k 128k 256k 512k 1 M M I n c r ea s e i n e x ec u ti on ti m e [ % ] Figure 17: Trade-oﬀ between an increase in energy eﬃ-ciency in percent (represented by a number in each cell)and an increase in execution time (represented by a color)for the V100 GPU. I n c r ea s e i n e n e r g e y e ﬃ c i e n c y [ % ] FFT length

32 64 128 256 512 1024 2048 4096 8192 16k 32k 64k 128k 256k 512k 1 M M I n c r ea s e i n e x ec u ti on ti m e [ % ] Figure 18: Trade-oﬀ between an increase in energy eﬃ-ciency in percent (represented by a number in each cell)and an increase in execution time (represented by a color)for the Jetson Nano.sar’s expected fundamental frequency thus increasing thesignal-to-noise ratio of the pulsar in the power spectrum.The harmonic sum can add up to 32 higher order har-monics which decreases the FFT execution footprint inthe pipeline’s total execution time.To change the frequency during the pipeline executionwe have used the NVIDIA Management Library (NVML)[28]. This approach, however, has limitations becausethe library is fully supported only on scientiﬁc (Tesla)NVIDIA GPUs. The measured power consumption andthe core clock frequency for the V100 GPU are shown inFig. 19 and the increase in energy eﬃciency for diﬀerentconﬁgurations of the pipeline is listed in Table 4.The usage of the NVML library is simple. Before theGPU kernel execution the core clock frequency is (for agiven GPU) set using nvmlDeviceSetGpuLockedClocks providing maximum and minimum core clock fre-quency. When the calculation is ﬁnished the GPUcore clock frequency is returned to default by calling nvmlDeviceResetGpuLockedClocks .The FFT length used for the computation was N =5 · which was not used in our measurements or in ourcalculation of the mean optimal frequency. For proﬁling, we have used the NVIDIA visual proﬁler(NVVP). Based on the diﬀerent behavior of the executiontime t ﬁx shown in Fig. 4 we have selected three represen-tative power-of-two FFT lengths ( N = 8192, N = 16k, Table 4: Increase in energy eﬃciency for diﬀerent conﬁg-urations of our toy data processing pipeline.num. har-monicsummed cuFFT % oftotal exec.time Increase in En-ergy eﬃciency2 60.85 1.2914 58.56 1.2908 55.92 1.26716 53.73 1.26032 51.34 1.240 P o w e r c on s u m p ti on [ W ] Tesla V100without NVMLwith NVML 0 50 100 150 200 250 300 C o r e fr e qu e n c y [ M H z ] Sample index 250 500 750 1000 1250 1500 0 250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000

Figure 19: Measured power consumption (top) and coreclock frequency (bottom) for part of a radio astronomydata processing pipeline. N = 2M) which are calculated by diﬀerent kernels. Theproﬁling results for these kernels are shown in Fig. 20. Forour study of compute utilization we have used two indica-tors. The ﬁrst is the compute utilization as reported bythe NVVP, the second metric is the issue slot utilization,which tells us how many instruction slots are used. Thenext quantity displayed in Fig. 20 is the device memorybandwidth utilization (device MBU). Fig. 20 also showsthe normalized execution time from fastest to slowest toprovide context for the other displayed quantities. The dependency of the execution time on the core clockfrequency is shown in Fig. 6. Fig. 6 displays the three pre-viously discussed behaviours a), b) and c). However, theJetson Nano only exhibits the third type of behaviour c).All other GPU’s, represented by the V100 GPU, exhibit acomposition of all three behaviours with cases a) and b)being dominant.The behaviour in case b), might be due to reduced cachecontention which slightly increases the hit rate of the uni-ﬁed cache as shown by the the NVVP. However, it mightalso be a systematic error caused by measurement usingthe NVIDIA driver, which is based on the GPU core clockfrequency. In this case as well as in case a) the GPU’s com-pute resources are not fully utilized and the computationsare limited by device memory bandwidth.The reason for an increase in the execution time at a12 astSlow N o r m a li z e d exec u ti on ti m e Core clock frequency [MHz]N=8192Normalized execution time Device mem. bandwidth utilizationFloating point operation utilization Issue Slot UtilizationFastSlow 3006009001200 N=2M

Figure 20: Proﬁling results for the V100 GPU using the NVIDIA visual proﬁler. Longer FFT lengths use more thanone GPU kernel to calculate the Fourier transform which are numbered.particular critical frequency is due to the saturation of thenumber of issued instructions (see Fig. 20). This leads to areduction in memory requests to the device memory which,in turn, leads to poor latency hiding of the device memoryaccesses. Therefore most of the threads are waiting fordata but there are not enough threads with data to utilizethe ﬂoating-point operation units. Thus the ﬂoating-pointoperation utilization remains mostly unchanged.The sharp increase in the execution time t ﬁx for lowfrequencies, which are present in all cases, are due to thechange of the P-state to a state corresponding to the idlestatus of the GPU with reduced voltage which reduces theavailable GPU resources severely.Lastly, case c) occurs due to the high utilization of oneof the caches. Since the cache bandwidth decreases withthe core clock frequency each decrease in frequency lowersthe bandwidth which is already fully utilised leading to adecrease in performance.The average power consumption shown in Fig. 8, tells uswhy, even with longer execution times, we can improve en-ergy eﬃciency. The rate of the decrease in power consump-tion is higher than the rate at which the execution timeincreases. This is especially visible around f = 1000 Hzfor the V100 GPU and about f = 450 Hz for the JetsonNano. These frequencies roughly coincide with the meanoptimal frequency for the given GPUs. The energy eﬃciency is shown in Fig. 10, the change inthe execution time is shown in Fig. 11 and the change inGFLOPS is shown in Fig. 12.In the language of costs, Fig. 11 is equivalent to theincrease in capital costs as an increase in execution timedirectly translates into more hardware needed in order tomeet the constrains of real-time data processing. On theother hand, the increase in energy eﬃciency (Fig. 10) isrelated to operational costs, where better energy eﬃciencytranslates into lower operational costs. However we mustbare in mind that operational costs include cooling, fa-cility management, etc. which could be increased by therequirement for more hardware due to longer executiontimes.For FP32 precision we see that the Jetson Nano is moreenergy eﬃcient than the V100 GPU for almost all FFTlengths, especially for the small FFT lengths where it is50% more eﬃcient. When we look at the change in the ex-ecution time we see that the Jetson Nano requires approxi-mately 60% more time to ﬁnish compared to the executiontime at the boost core clock frequency. With one extremecase where the execution time is 140% longer. This meanson average 60% more hardware to achieve real-time dataprocessing with the best energy eﬃciency.This behaviour is not reproduced by the V100 GPUwhere the increase in energy eﬃciency is not, for the mostpart, at the expense of the execution time. The changein the execution time for the V100 GPU is below 5%.13here are more signiﬁcant increases in execution time forthe non-power of two FFT lengths which can cause in-creases of up to 20% in execution time. Small changes inthe execution time on the V100 GPU oﬀers a possibilityto improve existing real-time processing pipelines withoutsubstantial change in hardware.We see similar behaviour for the V100 GPU at FP64precision. The slow-down in execution time suﬀered bythe V100 GPU due to the lower core clock frequenciesis within 5%. The execution time for most of the non-power of two FFT lengths does not increase above 20%.The Tesla P4 GPU, Titan XP GPU and Jetson Nano donot fully support FP64 precision. This manifests in lesssigniﬁcant improvements in GFLOPS/W, much higher ex-ecution times and a decrease in GFLOPS. In the case ofthe Jetson Nano we would have to double the number ofcards in order to process data in real-time.At FP16 precision we have only three GPUs which sup-port this precision: V100 GPU, Titan V GPU and Jet-son Nano. Regarding energy eﬃciency, both the V100GPU and the Jetson Nano are comparable but the V100GPU is the overall more energy eﬃcient GPU. When welook at the change in execution time we see that the V100GPU typically has a 10% increase or less, but at someFFT lengths the increase is as high as 40% (N=64). Thisbehaviour means that we have to be more careful aboutpotential energy savings since at some FFT lengths theincrease in execution time might be too high for real-timedata processing. The change in execution time of the Jet-son Nano is again large and we would need to have almosttwice the number of GPUs to process data in real time atthe best possible energy eﬃciency.

The increase in the energy eﬃciency for the optimal fre-quency is shown in Fig. 13 and Fig. 14. The correspond-ing ﬁgures for the mean optimal frequency are Fig. 15 andFig. 16. The diﬀerence in the increase in energy eﬃciencyfor the base core clock frequency between the optimal fre-quency and the mean optimal frequency is 5 percentagepoints. That is, an average increase in energy eﬃciencyfor the optimal frequency which is tuned for each FFTlength is 29% whereas the average increase in energy ef-ﬁciency for the mean optimal frequency is 24%. For theV100 GPU this holds for all FFT lengths and precisionswith a very limited number of exceptions for FP16 pre-cision. For the boost core clock frequency the loss is 10percentage points. This allows us to use one core clockfrequency and achieve similar energy savings without de-termining the optimal frequency for each FFT length. Asimilar result is observed for the Jetson Nano with theexception of Bluestein FFT lengths which are responsiblefor the peaks in the results.The dependency between the increase in energy eﬃ-ciency and the change in the execution time, shown inFig. 17 for the V100 GPU but more notably in Fig. 18for the Jetson Nano, is non-linear. We see that we canachieve an interesting increase in energy eﬃciency evenfor increases in execution time which are below 10%. Lastly, our practical test with our example data pro-cessing pipeline shows that we can dynamically changethe core clock frequency in a very precise manner. Ourcode demonstrates how to target only the duration of thecuFFT library call within the pipeline and thus reducepower consumption. This technique can be applied to ex-isting pipelines or more generally any software with mini-mal changes to the codebase. The increase in energy eﬃ-ciency (for the boost core clock frequency) are summarizedin Table 4 corresponds to the expected values based on theFFT execution time footprint within the pipeline. For theﬁrst conﬁguration with 2 harmonics, the FFT executiontime corresponds to 60% of the total execution time. Theaverage increase in energy eﬃciency for V100 GPU withboost core clock frequency (based on Fig.15) is about 50%.Considering the FFT execution time footprint we shouldget 30% increase in energy eﬃciency which is indeed whatwe have measured. This behaviour is consistent with otherconﬁgurations of the pipeline.

We have measured the power consumption when calcu-lating the Fourier transformation at diﬀerent numericalprecisions (FP32, FP64, FP16) on NVIDIA GPUs usingthe NVIDIA cuFFT library and quantiﬁed the possibleenergy savings when DVFS techniques are used. For eachtested GPU, precision, and a wide range of FFT lengths,we have found the optimal core clock frequency to min-imise power consumption. We have also measured thechange in execution time of the Fourier transform whenDVFS is applied, which is an important consideration forreal-time data processing because this can increase whenthe core clock frequencies of the GPU are modiﬁed.We have presented the achieved energy eﬃciency inGFLOPS/W. Along with this we have presented the in-crease in energy eﬃciency when using our optimal coreclock frequency compared to the boost and base core clockfrequency for each GPU. We have also presented the in-crease in the execution time of the Fourier transform whenDVFS is applied.The decrease in power consumption and change in theexecution time depends on the GPU used. In the case ofthe V100 GPU, the average increase in energy eﬃciencyis for FP32, FP64, and FP16 precisions is 60% comparedto the boost core clock frequency. When compared to thebase core clock frequency an average increase in energyeﬃciency of 30% for FP32 and FP64 precision and 20% forFP16 precision is observed. The increase in the executiontime is below 5% (with few exceptions as outlined). TheJetson Nano oﬀers higher increases in energy eﬃciency tothat of the V100 GPU. On average 70% for FP32, 55% forFP64 and 70% for FP16 but at the expense of executiontime which increases by more than 60%. For the P4 GPUand the Titan V GPU we have not achieved a signiﬁcantincrease in energy eﬃciency.Our results have shown that the Volta architecture issigniﬁcantly more energy eﬃcient than the P4 GPU whichrepresents the most energy eﬃcient GPU from the pre-vious Pascal generation. When compared to the Jetson14ano the V100 GPU is less energy eﬃcient at FP32 pre-cision. For short and long FFTs at FP32 precision theJetson Nano is 50% more energy eﬃcient than the V100GPU. For FP16 precision the V100 GPU has similar en-ergy eﬃciency as the Jetson Nano. The Jetson Nano doesnot fully support double precision thus the V100 GPU issigniﬁcantly more energy eﬃcient at this precision.We have shown that values of optimal core clock fre-quencies for all tested FFT lengths for a given GPU andnumerical precision are similar, with few exceptions. Thisallowed us to deﬁne a mean optimal core clock frequencyunique to each tested GPU and precision, but is the samefor all FFT lengths. Using the mean optimal core clock fre-quency, we have achieved a similar energy eﬃciency whencompared to the energy eﬃciency achieved with the opti-mal core clock frequency for each tested FFT length. Forthe V100 GPU the diﬀerence is only 5 percentage points.For the other GPUs the loss is similar.We have also presented the practical implementationof these results in our example data processing pipelinewhich is available as an open source code. We have demon-strated how to change the core clock frequency of theGPU to the mean optimal core clock frequency using theNVIDIA Management Library and demonstrated a de-crease in power consumption which is in agreement withthe results presented in this work.Finally we have highlighted how, from an environmen-tal perspective, increasing the energy eﬃciency of the FFTalgorithm will be an important consideration for edge com-puting and IoT.

Acknowledgment

This work has received support from STFC Grant(ST/T000570/1). The authors acknowledge thesupport of the OP VVV MEYS funded projectCZ.02.1.01/0.0/0.0/16 019/0000765 ”Research Center forInformatics”. The authors would like to acknowledgethe use of the University of Oxford Advanced ResearchComputing (ARC) facility in carrying out this work( http://dx.doi.org/10.5281/zenodo.22558 ). The au-thors would like to express their gratitude to the ResearchCentre for Theoretical Physics and Astrophysics, Instituteof Physics, Silesian University in Opava for institutionalsupport.

References [1] Power consumption measurement with nvidia-smi,March 2018.[2] Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Koji In-oue, Masato Edahiro, and Martin Peres. Power andperformance characterization and modeling of gpu-accelerated systems. In

Proceedings of the 2014 IEEE28th International Parallel and Distributed Process-ing Symposium , IPDPS ’14, page 113–122, USA,2014. IEEE Computer Society. [3] Karel Ad´amek and Wesley Armour. A GPU Imple-mentation of the Harmonic Sum Algorithm. In Pe-ter J. Teuben, Marc W. Pound, Brian A. Thomas,and Elizabeth M. Warner, editors,

Astronomical DataAnalysis Software and Systems XXVII , volume 523of

Astronomical Society of the Paciﬁc Conference Se-ries , page 489, October 2019.[4] Karel Ad´amek, Soﬁa Dimoudi, Mike Giles, and Wes-ley Armour. Improved Acceleration of the GPUFourier Domain Acceleration Search Algorithm. InPascal Ballester, Jorge Ibsen, Mauricio Solar, andKeith Shortridge, editors,

Astronomical Data Anal-ysis Software and Systems XXVII , volume 522 of

As-tronomical Society of the Paciﬁc Conference Series ,page 477, April 2020.[5] Yehia Arafa, Ammar ElWazir, Abdelrahman ElKa-nishy, Youssef Aly, Ayatelrahman Elsayed, Abdel-Hameed Badawy, Gopinath Chennupati, StephanEidenbenz, and Nandakishore Santhi. Veriﬁedinstruction-level energy consumption measurementfor nvidia gpus, 2020.[6] L. Bluestein. A linear ﬁltering approach to the com-putation of discrete fourier transform.

IEEE Transac-tions on Audio and Electroacoustics , 18(4):451–455,1970.[7] Robert A. Bridges, Neena Imam, and Tiﬀany M.Mintz. Understanding gpu power: A survey of proﬁl-ing, modeling, and simulation methods.

ACM Com-put. Surv. , 49(3), September 2016.[8] E. O. Brigham.

The fast Fourier transform and itsapplications . Prentice Hall, Englewood Cliﬀs, NewYork, signal processing series edition, 1988.[9] Martin Burtscher, Ivan Zecena, and Ziliang Zong.Measuring gpu power with the k20 built-in sensor.In

Proceedings of Workshop on General Purpose Pro-cessing Using GPUs , GPGPU-7, pages 28––36, NewYork, NY, USA, 2014. Association for ComputingMachinery.[10] Vincent Chau, Xiaowen Chu, Hai Liu, and Yiu-WingLeung. Energy eﬃcient job scheduling with dvfs forcpu-gpu heterogeneous systems. In

Proceedings ofthe Eighth International Conference on Future En-ergy Systems , e-Energy ’17, pages 1––11, New York,NY, USA, 2017. Association for Computing Machin-ery.[11] Tim Cornwell and Ben Humphreys. Ska exascale soft-ware challenges, 2010.[12] Soﬁa Dimoudi, Karel Ad´amek, Prabu Thiagaraj,Scott M. Ransom, Aris Karastergiou, and Wesley Ar-mour. A GPU Implementation of the CorrelationTechnique for Real-time Fourier Domain Pulsar Ac-celeration Searches.

ApJS , 239(2):28, December 2018.1513] Muhammad Fahad, Arsalan Shahid, Ravi ReddyManumachu, and Alexey Lastovetsky. A compara-tive study of methods for measurement of energy ofcomputing.

Energies , 12(11), 2019.[14] J. S. Farnes, B. Mort, F. Dulwich, K. Ad´amek,A. Brown, J. Novotn´y, S. Salvini, and W. Armour.Building the world’s largest radio telescope: Thesquare kilometre array science data processor. In , pages 366–367, 2018.[15] R. Ge, R. Vogt, J. Majumder, A. Alam, M. Burtscher,and Z. Zong. Eﬀects of dynamic voltage and fre-quency scaling on a k20 gpu. In , pages 826–833, 2013.[16] Jo˜ao Guerreiro, Aleksandar Ilic, Nuno Roma, and Pe-dro Tom´as. Dvfs-aware application classiﬁcation toimprove gpgpus energy eﬃciency.

Parallel Comput-ing , 83:93 – 117, 2019.[17] John W. Tukey James W. Cooley. An algorithmfor the machine calculation of complex fourier series.

Mathematics of Computation , 19(90):297–301, 1965.[18] Yang Jiao, Heshan Lin, Pavan Balaji, and Wu-chunFeng. Power and performance characterization ofcomputational kernels on the gpu. In

Proceedingsof the 2010 IEEE/ACM Int’l Conference on GreenComputing and Communications & Int’l Conferenceon Cyber, Physical and Social Computing , pages 221–228. IEEE Computer Society, 2010.[19] J. L. Jodra, I. Gurrutxaga, and J. Muguerza. A studyof memory consumption and execution performanceof the cuﬀt library. In , pages 323–327, 2015.[20] R. Jongerius, S. Wijnholds, R. Nijboer, and H. Cor-poraal. An end-to-end computing model for thesquare kilometre array.

Computer , 47(9):48–54, 2014.[21] J. Lee, V. Sathisha, M. Schulte, K. Compton,and N. S. Kim. Improving throughput of power-constrained gpus using dynamic voltage/frequencyand core scaling. In ,pages 111–120, 2011.[22] L. Levin, W. Armour, C. Baﬀa, E. Barr, S. Cooper,R. Eatough, A. Ensor, E. Giani, A. Karastergiou,R. Karuppusamy, and et al. Pulsar searches with theska.

Proceedings of the International AstronomicalUnion , 13(S337):171–174, 2017.[23] Dumitrel Loghin and Yong Meng Teo. The energy ef-ﬁciency of modern multicore systems. In

Proceedingsof the 47th International Conference on Parallel Pro-cessing Companion , ICPP ’18, pages 1–10, New York,NY, USA, 2018. Association for Computing Machin-ery. [24] Xinxin Mei, Qiang Wang, and Xiaowen Chu. A sur-vey and measurement study of gpu dvfs on energyconservation.

Digital Communications and Networks ,3(2):89–100, 2017.[25] E. Meneses, O. Sarood, and L. V. Kal´e. Assessing en-ergy eﬃciency of fault tolerance protocols for hpc sys-tems. In , pages 35–42, 2012.[26] Sparsh Mittal and Jeﬀrey S. Vetter. A survey of meth-ods for analyzing and improving gpu energy eﬃciency.

ACM Comput. Surv. , 47(2), August 2014.[27] NVIDIA. Cuﬀt library user’s guide, 2020 v11.0.3.[28] NVIDIA. Nvidia management library, 2020 vR450.[29] A. R. Oﬀringa et al. WSCLEAN: an implementationof a fast, generic wide-ﬁeld imager for radio astron-omy.

MNRAS , 444(1):606–619, October 2014.[30] D. C. Price, M. A. Clark, B. R. Barsdell, R. Babich,and L. J. Greenhill. Optimizing performance-per-watt on GPUs in high performance comput-ing.

Computer Science - Research and Development ,31(4):185–193, November 2016.[31] David Reinsel, John Gantz, and John Rydning. Thedigitization of the world from edge to core, 2018.[32] J. W. Romein. A comparison of accelerator archi-tectures for radio-astronomical signal-processing al-gorithms. In , pages 484–489, 2016.[33] A. Sethia and S. Mahlke. Equalizer: Dynamic tuningof gpu resources for eﬃcient execution. In , pages 647–658, 2014.[34] Peter Steinbach and Matthias Werner. gearshiﬀt -The FFT Benchmark Suite for Heterogeneous Plat-forms. arXiv e-prints , page arXiv:1702.00629, Febru-ary 2017.[35] David Stˇrel´ak and Jiˇr´ı Filipoviˇc. Performance anal-ysis and autotuning setup of the cuﬀt library. In

Proceedings of the 2nd Workshop on AutotuniNg andADaptivity AppRoaches for Energy Eﬃcient HPCSystems , ANDARE ’18, pages 1–6, New York, NY,USA, 2018. Association for Computing Machinery.[36] Zhenheng Tang, Yuxin Wang, Qiang Wang, and Xi-aowen Chu. The impact of gpu dvfs on the energy andperformance of deep learning: An empirical study. In

Proceedings of the Tenth ACM International Confer-ence on Future Energy Systems , e-Energy ’19, pages315—-325, New York, NY, USA, 2019. Associationfor Computing Machinery.[37] Anne E. Trefethen and Jeyarajan Thiyagalingam.Energy-aware software: Challenges, opportunitiesand strategies.

Journal of Computational Science ,16(6):444 – 449, 2013. Scalable Algorithms for Large-Scale Systems Workshop (ScalA2011), Supercomput-ing 2011.[38] Sebastiaan van der Tol, Bram Veenboer, andAndr´e R. Oﬀringa. Image Domain Gridding: a fastmethod for convolutional resampling of visibilities.

Astronomy & Astrophysics , 616:A27, August 2018.[39] B. Veenboer, M. Petschow, and J. W. Romein. Image-domain gridding on graphics processors. In , pages 545–554, 2017.[40] Bram Veenboer and John W. Romein. Radio-astronomical imaging: Fpgas vs gpus. In RaminYahyapour, editor,

Euro-Par 2019: Parallel Process-ing , pages 509–521, Cham, 2019. Springer Interna-tional Publishing.[41] Qiang Wang, Chengjian Liu, and Xiaowen Chu.Gpgpu performance estimation for frequency scalingusing cross-benchmarking. In