Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on GPUs for Real-time Edge Computing
Karel Adámek, Jan Novotný, Jeyarajan Thiyagalingam, Wesley Armour
EEfficiency Near the Edge: Increasing the Energy Efficiency of FFTs onGPUs for Real-time Edge Computing
Karel Ad´amek , Jan Novotn´y , Jeyarajan Thiyagalingam , and Wesley Armour ∗ Oxford e-Research Centre, Department of Engineering Sciences, University of Oxford, 7 Kebleroad, Oxford, OX1 3QG, United Kingdom Faculty of Information Technology, Czech Technical University, Th´akurova 9, 160 00, Prague,Czech Republic Research Centre for Theoretical Physics and Astrophysics, Institute of Physics, Silesian Univeristyin Opava, Bezruˇcovo n´amˇest´ı 13, CZ-74601, Opava, Czech Republic Rutherford Appleton Laboratory, Science and Technology Facilities Council, Harwell Campus,Didcot, OX11 0QX, UK
September 15, 2020
Abstract
The Square Kilometre Array (SKA) is an internationalinitiative for developing the world’s largest radio telescopewith a total collecting area of over a million square me-ters. The scale of the operation, combined with the re-mote location of the telescope, requires the use of energy-efficient computational algorithms. This, along with theextreme data rates that will be produced by the SKA andthe requirement for a real-time observing capability, neces-sitates in-situ data processing in an edge style computingsolution. More generally, energy efficiency in the moderncomputing landscape is becoming of paramount concern.Whether it be the power budget that can limit some ofthe world’s largest supercomputers, or the limited poweravailable to the smallest Internet-of-Things devices. Inthis paper, we study the impact of hardware frequencyscaling on the energy consumption and execution time ofthe Fast Fourier Transform (FFT) on NVIDIA GPUs us-ing the cuFFT library. The FFT is used in many areas ofscience and it is one of the key algorithms used in radio as-tronomy data processing pipelines. Through the use of fre-quency scaling, we show that we can lower the power con-sumption of the NVIDIA V100 GPU when computing theFFT by up to 60% compared to the boost clock frequency,with less than a 10% increase in the execution time. Fur-thermore, using one common core clock frequency for alltested FFT lengths, we show on average a 50% reductionin power consumption compared to the boost core clockfrequency with an increase in the execution time still be-low 10%. We demonstrate how these results can be usedto lower the power consumption of existing data process-ing pipelines. These savings, when considered over yearsof operation, can yield significant financial savings, butcan also lead to a significant reduction of greenhouse gasemissions. ∗ E-mail address: [email protected]
Keywords —
Energy efficiency, Green computing,High performance computing, Real-time systems, Paral-lel architectures
The Fast Fourier Transform (FFT) is one of the most fun-damental and widely used numerical algorithms in scien-tific computing, with applications in a diverse range of ar-eas such as astronomy, image processing, audio and radarsignal processing, numerical solvers, such as partial differ-ential solvers, and mechanical systems [8]. The FFT is alsoan integral part of many data processing pipelines. For in-stance, the FFT is an important part of data processingpipelines in both image- [38, 29, 39, 14] and time-domain[12, 4, 3, 22] radio astronomy.The upcoming, next-generation radio telescope, theSquare Kilometer Array (SKA), will employ such complexdata processing pipelines to deliver science products thatwill provide new and exciting insights into our Universe.Previous studies, for example [11], estimate that theSKA will require an exascale size high performance com-puting (HPC) system to provide us with such scientificproducts. Where, the computational footprint of the FFT,depending on the data processing task, may occupy [20]up to 47% of the overall computational footprint measuredin floating-point operations per second (or FLOPS). Thismakes the FFT a critical algorithm for the SKA.Processing the data captured by the SKA posses manychallenges. The SKA will produce extremely large vol-umes of data at unprecedented rates. Furthermore, thetelescope itself must be located in a radio-quiet area dueto it’s extreme sensitivity. This makes the persistent stor-age of all data not viable and transportation of these datato a well equipped (and suitably powered) data centre im-practical. Finally, some science cases such as the study ofFast Radio Bursts (FRBs), necessitate near real-time data1 a r X i v : . [ c s . PF ] S e p rocessing. Meaning that data has to be processed closeto the instrument itself. These constraints present signifi-cant challenges to software and system engineers, they de-mand high fractions of peak performance of the hardware,whilst maintaining the best possible energy efficiency ofboth software and hardware.To address the need to minimise the power consumptionof the locally installed hardware, close attention must bepaid to the energy efficiency of the data processing algo-rithms, specifically the FFT. Given the emphasis on lowerpower consumption in HPC in general, the ability to com-pute the FFT more efficiently is of interest to many com-putational domains.The near real-time processing constraint means that theexecution time of the data processing algorithms must notbe increased significantly. An increase in the executiontime might lead to either failure to process data on timeand hence a loss of scientifically important data or in-creased capital and operational costs as more hardwarewould be needed to meet the real-time requirement.Motivated by this, we have studied the impact of dy-namic frequency scaling (DFS) on the energy efficiencyand execution time of the FFT on NVIDIA GPUs usingthe cuFFT library [27]. The GPU is the fastest and mostenergy efficient choice of hardware for image domain ra-dio astronomy as shown by [40], with FPGAs a close sec-ond. There are other FFT libraries for GPU’s, notably, theclFFT library which uses the OpenCL framework. clFFTis not a vendor supported library and was shown by [34]to be slower than cuFFT on NVIDIA GPUs thus we havenot considered it for this work.Our exhaustive study, conducted on a range of state-of-the-art GPUs shows that careful tuning of the core clockfrequency can save, in the case of the V100 GPU, up to60% (boost core clock frequency) of the energy consump-tion of the FFT. This saving can have a significant impacton two fronts: financial savings in recurrent costs, and theassociated reduced CO emission. We also show that thesecarefully tuned frequencies can be replaced with a singlefrequency that is specific to each model of GPU and cho-sen floating-point precision, whilst still being able to saveon average up to 50% of the FFT energy consumption (forthe V100 GPU and boost core clock frequency).The main contributions of this work are: • We have performed an in-depth investigation ofcuFFT library’s power consumption and executiontime and how it changes with core clock frequency fora wide range of problem sizes and numerical precisions(FP16, FP32 and FP64) on five NVIDIA GPUs. • We identify an optimal core clock frequency with thehighest energy efficiency for all problem sizes and nu-merical precisions and have shown that a single meanoptimal frequency per GPU model gives similar powersavings regardless of problem size. • We demonstrate how these results can be used tolower the power consumption of existing data pro-cessing pipelines.Whilst this work has been motivated by the SKA radiotelescope, the conclusions of the work are applicable to any computational task that employs cuFFT running onNVIDIA GPUs.
Power consumption in HPC is being solved on multiplelevels. From construction at the level of the cluster tonew energy efficient hardware. The power consumptionof specific hardware depends on execution time , the timetaken to finish a calculation, and also on the utilizationof the hardware (memory, cache, computing cores). Thesoftware itself also plays an important role in power con-sumption. Energy can be saved through proper softwaredesign, making software stable [25] and through the use ofappropriate algorithms.However, concerns regarding energy efficiency in themodern computing landscape are not solely limited toHPC. Edge computing is becoming an increasingly im-portant research area driven by the explosion of Internet-of-Things devices. The basic premise of edge computingis to capture and process data as close to their sources asis possible by utilising light weight processors. Becauseedge computing aims to process data locally, it minimizeswider latency and bandwidth needs and allows for real-time feedback. It is estimated that by 2025 around 150billion devices will be connected and creating data in real-time [31], with the FFT playing, not only an importantrole in the communication between devices, but also inprocessing collected data. Hence optimising the energy ef-ficiency of the FFT on edge devices is of importance froman environmental perspective. This has motivated us toinclude NVIDIA’s Jetson Nano in our selection of hard-ware since it represents NVIDIA’s low power edge com-puting solution.The idea behind DFS, which is part of the dynamicvoltage and frequency scaling (DVFS) method, is to makehardware more energy efficient under different loads by ad-justing hardware performance which is achieved by chang-ing clock frequencies to fit the application running on it.By decreasing the clock frequency of a component we de-crease its performance while increasing its utilization andthus decreasing the power consumption of a given compo-nent. For example, Trefethen et. al. [37] have investigatedpossible energy savings when running software on CPUswith a different number of threads, compilers and CPUclock frequencies.Applications can be broadly separated into two classesof performance, the first is where an application or algo-rithm is compute-bound. This is where the performancebottleneck of the application is the compute resource. Thiscan be the number of floating-point operations which canbe performed per second (FLOPS), but also the number ofinstructions which can be issued per second. The secondbroad category is memory bandwidth bound applications,where we have enough compute resources but we cannotsupply the data through the memory bus to the comput-ing cores quickly enough. In this case the performance isthen limited by the memory bandwidth. This bandwidthlimitation can occur at any level in the computers memoryhierarchy, for example this might be at the level of access2o the GPU main memory (called device memory ), or atthe level of one of the caches.We have investigated the cuFFT library using theNVIDIA Visual Profiler (NVVP). This shows that forall investigated problem sizes GPU kernels used by thecuFFT library are device memory bandwidth bound.
The one-dimensional discrete Fourier transformation(DFT) of a signal x is given by X l = N − (cid:88) n =0 x n exp (cid:20) − i π nlN (cid:21) , (1)where X l is the l -th element of a transformed signal, x n is the n -th element of an input signal, and N is the trans-formation length or the FFT length .The cuFFT library [27] uses the Cooley-Tukey algo-rithm [17] for FFT sizes that can be decomposed as multi-ples of powers of primes from 2 to 127 and Bluestein’s al-gorithm [6] otherwise. For longer FFT lengths the cuFFTlibrary uses multiple GPU kernels to compute the entireFFT, which can be seen by studying the cuFFT libraryusing the NVVP. In many cases, the Fourier transform iscalculated more quickly if the FFT length is increased bypadding to a more optimized length as was shown by [35].The two-dimensional Fourier transformation is given bythe formula X l,k = M − (cid:88) m =0 N − (cid:88) n =0 x n,m exp (cid:20) − i π (cid:18) nlN + mkM (cid:19)(cid:21) , (2)where x n,m , X l,k is now an element of a matrix of size N × M . The sums in this equation can be evaluatedindependently which allows us to decompose the two-dimensional Fourier transformation into two sets of one-dimensional Fourier transformations. This is routinelydone and it is indeed what cuFFT does when calculatinghigher-dimensional (2D, 3D) Fourier transformations asshown by the NVVP. Thus by investigating the energy ef-ficiency of the one-dimensional Fourier transformation weare also investigating the energy efficiency of the higher-dimensional Fourier transforms. The GPU design methodology is different to that of aCPU. A CPU architecture is aimed at low latency compu-tations, but also has lower throughput. In other words, theCPU can execute a wider range of complicated algorithmsquickly, for example a complicated branching code, but thenumber of concurrently running tasks is small. A GPU ar-chitecture has high latency but also high throughput, on aGPU one can execute thousands of simple tasks but eachtask takes longer to process due to the simpler schedulersthat are employed. Both platforms are broadening theirfocus, CPUs are adding more cores and increasing theirvector lengths as GPU architectures become more com-plex and GPU schedulers become more sophisticated.
Device memoryL2 cache
Computing blockMemory blockL1 cache L1 cache L1 cacheSM SM SM
Figure 1: A schematic of the GPU architecture.The GPU architecture, which is, in simplified form,shown in Fig. 1, is divided into the memory block andthe compute block. The compute block is further dividedinto caches and streaming multiprocessors (SM) whichare responsible for executing the computations. The SMsare further divided into specialized units such as floating-point cores or special function units (which are respon-sible for computing things like transcendental functions).The memory hierarchy on the GPU is distributed betweenthese two blocks. The device memory which runs at thememory clock frequency has the lowest bandwidth on theGPU card and it is the memory that the CPU (host) canread/write into via the PCIe bus. The L2 cache is sharedbetween the SMs, the L1 cache is private to each SM andthe shared memory is shared amongst a group of threadscalled a threadblock. The L2, L1 and shared memorybandwidth is proportional to the core clock frequency, thusby using a lower core clock frequency we also decrease thebandwidth of these caches. The core clock frequency, aswell as the memory clock frequency, can only be set topredefined values.Different GPUs may use different memory modules.Amongst the tested GPUs were GPUs with GDDR mem-ory modules (Titan XP, P4, Jetson Nano) which allowus to change the memory clock frequency, but also GPUswith HBM2 modules (Titan V, V100) which do not allowus to change the memory clock frequency.When measuring the power consumption and perfor-mance of the GPU it is important to keep the GPUutilized. For example, the NVIDIA V100 GPU has 80streaming multiprocessors (SM) where each SM is able to3un up to 2048 threads. This gives more than 150 thou-sands threads which can execute concurrently. Thus inour measurements, we have used a fixed amount of datacontaining a different number of individual Fourier trans-forms to keep the GPU utilized for all tested FFT lengths.
Data processing can be composed of a single step but moreoften is a series of processing steps which together form adata processing pipeline.The ability of the application to process data in a real-time processing scenario can be described by the real-timespeed-up factor. The real-time speed-up is calculated as S = t a /t p , where t a is the time needed to acquire a givenamount of data by the telescope, sensor, etc. and t p is thetime taken to process that data. When S ≥ S < S = 1 that pipeline is processing the data in timebut has no performance buffer to call on if needed. In sucha case any increase in the execution time leads to S <
As of November 2019, the first two positions in the top500 list of supercomputers are held by systems that useGPUs. Within the top ten, five systems contained GPUs.In the Green 500 list, GPUs are used in eight out of thetop ten supercomputers. A clear demonstration that itis important to understand the power consumption, en-ergy efficiency and potential energy savings for GPUs us-ing DVFS.The different approaches of how to measure the powerconsumption, power and performance modelling and alsothe results of DVFS for selected applications were reviewedby Mei el al.[24]. The authors note that the effect ofDVFS depends not only on the architecture but also onthe characteristics of the GPU application. They havefound the optimal frequency for 42 GPU applications andfound that 12 of them benefited from an increased core frequency compared to the default whereas for 30 appli-cations the optimal frequency was lower than the defaultcore frequency, and values of these optimal frequencieswere different for most GPU applications. The authorscalled for a deeper investigation into their differences. Auseful review of the DVFS technique is provided by Mittaland Vetter [26]. The review by Bridges et al. [7] lookedinto the modelling of the power consumption by GPUs.A number of published studies have investigated thereliability of power measurements using internal sensors.Burtscher et al. [9] published their experience of usingbuilt-in sensors when measuring the power consumptionof NVIDIA K20 GPUs. They described several issues thatthey encountered when using these sensors and suggestedmethods to correct for these. The accuracy of the built-insensors was investigated by Farad et al. [13] who foundthat the average mean error using an abstract model ofa GPU is about 10% compared to measurements usingexternal power meters. This error value was confirmedby Arafa et al. [5] who measured the energy consump-tion of almost all PTX instructions for four generationsof NVIDIA GPUs. They have found that the Maxwelland the Turing generations of GPUs have high energyconsumption when compared to the Pascal and the Voltagenerations of NVIDIA GPUs which are found to be moreenergy efficient.There are a number of papers where authors have usedDVFS in the context of GPUs [2, 33, 41, 23, 15, 10, 21,16, 18, 24, 36]. Guerreiro et. al. [16] classified GPU appli-cations into four different categories which describe theirbehaviour when DVFS is applied. These categories arean extension of the compute-bound, memory-bound dis-tinction. The early work on GPU power consumption andDVFS was performed by Jiao et al. [18]. They investi-gated the behaviour of several GPU applications whichincluded the FFT algorithm, however, the cuFFT librarywas not studied because there were better performing FFTimplementations at the time. The FFT was also indirectlyincluded in Mei et al. [24] as part of the convolution, andin Tang et al. [36] where the author investigated the effectof DVFS on deep learning applications.In relation to radio astronomy and the SKA, there areseveral works. Price et al. [30] made a detailed inves-tigation into power consumption, voltage and frequencyscaling of the GPU implementation of the correlator forthe SKA. The power consumed by the GPU in the do-main of radio astronomy was investigated by Romein [32].The performance of the cuFFT library was investigatedby Jondra et al. [19] along with its power consumption.However, increases in energy efficiency due to frequencyscaling were not investigated.
The code that we have used for measurements of the en-ergy efficiency of the FFT algorithm consists of a basic can be found at https://github.com/KAdamek/cuFFT_benchmark nvidia-smi ) for all GPU cards except the Jetson Nano,where we have used the tegrastats utility. For both wehave specified the measurement interval to be 10 ms as ourtests have showed that a setting of time sampling below10 ms does not lead to an improvement in the time resolu-tion of our data. The actual time between samples variedand the actual achieved sampling rate from the driver ison average 14.2 ms for all tested FFT lengths and cards.This sampling rate fulfills the criterion of at least 15 ms(66.7 Hz) recommended by Burtscher et al. [9] to accu-rately measure the energy consumption of real-world ker-nels.For the localization of the FFT algorithm and estab-lishing the execution time we have used the nvprof utilitywhere we have included the timestamp. Finally we logthe beginning and end of each GPU kernel execution toa file. This way we produce two files containing all ofthe needed metrics for all possible combinations of coreclock frequencies for a specific FFT length, bit precisionand GPU card. The final combination (via the times-tamp comparison) of these files is done by using a simpleR script. Here we compute all other metrics includingenergy efficiency, optimal clock frequency, mean optimalcore clock frequency and computational performance. Inthe script we also verify that the current core clock fre-quency is the same as the requested one, and compare themeasured execution time from nvprof with the log times-tamps of the nvidia-smi query. Using this method wehave found that, for the Titan V, the core clock frequencyis capped to 1335 MHz by the driver during the compu-tation, but during the copy of the results is set to a highercore clock frequency (1837 MHz). For frequencies lowerthan 1335 MHz, no capping is observed. An example ofthe GPU kernel power consumption and active core clockfrequency, which was localized using log file timestamps,is shown for the V100 GPU in Fig. 2 (top). An example driver version 450.36.06 of the frequency capping on the Titan V GPU is shown inFig. 2 (bottom). Tesla V100, FFT length= 16384,core clock frequency= 1020 MHz P o w e r c on s u m p ti on [ W ] C o r e fr e qu e n c y [ M H z ] Sample index power consumptioncore frequency 40 60 80 100 120 140 0 50 100 150 200 250 1005 1010 1015 1020 1025 1030 1035
Titan V, FFT length= 16384,core frequency= 1020 MHz P o w e r c on s u m p ti on [ W ] C o r e fr e qu e n c y [ M H z ] Sample indexpower consumptioncore frequency 0 20 40 60 80 100 120 140 160 0 100 200 300 400 500 600 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Figure 2: Parts of the log file with the GPU kernel high-lighted (red dots) by the R script between the two non-computing parts of the GPU run (grey line dots) show-ing the reported power consumption. The blue line corre-sponds to the measured core clock frequency. Specifically,the data displayed are from measurements on the TeslaV100 (top) and Titan V (bottom) for an FFT length of2 , single precision and the core clock frequency set to1020 MHz (Tesla V100) and 1912 MHz (Titan V).The choice of clock frequencies for both the memorybus and the computational cores are limited to a set ofsupported frequencies defined by the hardware itself. Thesupported core clock frequency can easily be changed bythe driver API. The allowed clock frequencies of the devicememory bus are limited or not changeable depending onthe memory type. Since the cuFFT library is completelylimited by device memory bandwidth this suggests thatlowering the memory frequency would not lead to sub-stantial increases in the energy efficiency. Thus, we havenot changed the memory clock frequency in this work.Moreover the High Bandwidth Memory (HBM) which ispresent on the newest GPU cards (Titan V, Tesla V100)operates on a fixed memory clock frequency. The rangesand step sizes of the core clock frequencies that we haveused are summarized in Table 1.The energy for a specific core clock frequency is definedas E f = (cid:88) i P i · t i , (3)where P i corresponds to the reported power for a sampleindex i and t i is the time between the current sample and5able 1: List of the allowed core clock frequencies frommaximal f max up to minimal f min frequency for all cardsand their corresponding frequency step size ( f step ). Thesize of the frequency step alternates between values shownin the column f step with the exception of the Jetson Nano. Card name f max [MHz] f min [MHz] f step [MHz]Tesla V100 1530 135 7, 8Tesla P4 1531 455 12, 13Titan XP 1911 379 12, 13Titan V 1912 135 7, 8Jetson Nano 921.6 76.8 76.8 the previous one. Then the energy efficiency for a specificcore clock frequency is given as E ef = C p · t/E f , (4)where t corresponds to the time of the whole run of thecomputation, E f is the energy and C p is the computa-tional performance in FLOPS given by C p = [5 N log ( N ) · N b · N FFT ] /t , (5)where N b is the number of FFT runs of length N and N FFT is the number of FFTs computed per run. Thenumber of Fourier transforms performed ( N FFT ) dependson the FFT size as follows N FFT = M GB / ( N · B) , (6)where M GB is the desired amount of memory used forFFTs in GB and B is the byte size of the input datatype. The optimal core clock frequency for a specific FFTlength is then found as the one with the minimal consumedenergy.We define the increase in energy efficiency as I ef = E ef , o /E ef , d , (7)where E ef , o and E ef , d are the energy efficiencies for theoptimal frequency and the boost frequency respectively(given by (4)).The measurement error, that is the relative standarddeviation, for the V100 GPU and the Jetson Nano is shownin Fig. 3. We have observed that the measurement error,in general, is around 5% for all cards except the JetsonNano. The GPU cards use instrumentation amplifiers forthe current/voltage/power monitors, hence the potentialerror in the measurement is expected to be around 3–5%[1]. The results of our power measurement correspond tothe expected characteristics of the on-board chips.For Fourier transformations of higher radices (7+) orfor Fourier transformations which use the Bluestein algo-rithm we observe a measurement error of up to 5%. Themeasurement error increases with decreasing core clockfrequency and increasing number of GPU kernels used forthe FFT calculation.The measurement error for the Jetson Nano is usu-ally below 15% for all FFT lengths, and is below 10%for power-of-two FFT lengths. The highest measure-ment error that we have observed is for Bluestein FFT P o w e r m ea s u r e m e n t e rr o r [ % ] Measurement numberTesla V100, FP32Radix n=2Radix n>2Bluestein0.00.51.01.52.02.53.03.54.04.55.05.5 0 2000 4000 6000 8000 10000 12000 P o w e r m ea s u r e m e n t e rr o r [ % ] Measurement numberJetson Nano, FP32Radix n=2Radix n>2Bluestein 0 5 10 15 20 25 30 35 40 45 50 55 0 100 200 300 400 500 600 700
Figure 3: Measurement error (V100 GPU at the top, Jet-son Nano at the bottom) for all tested FFT lengths at alltested core clock frequencies.lengths. For these lengths, cuFFT uses multiple kernels(for N = 139 eleven GPU kernels are used) thus the highmeasurement error is due to the different loads these GPUkernels exert on the GPU and also the differing power con-sumption between them. The Bluestein FFT lengths rep-resent a marginal case. Due to large measurement errorsfor Bluestein FFT lengths on the Jetson Nano we have notincluded these measurements into our calculations of meanoptimal frequency. However, we present these results forthe sake of completeness.For the measurement of the execution time we have usedthe NVIDIA Visual Profiler. Using this we have found thatthe measurement error for the execution time was below0.3%.Using propagation of uncertainty the error of the en-ergy (3) is dominated by the measurement error of thepower consumption. Based on that, the error in the in-crease in energy efficiency (7) is given by σ R ( I ef ) = √ σ R ( E ef ) , (8)where σ R is the relative error and we have assumed thatthe relative error in E ef , o and E ef , d are equal. This givesan error for the increase in the energy efficiency of 7% forall GPUs except the Jetson Nano where the error is 21%.These values represent the worst case scenario since mostof measurement errors are well below these values.6 Results
For our investigation, we have used five different NVIDIAGPUs from three recent architecture generations, namelyV100 (Volta), Tesla P4 (Pascal), Jetson Nano (Maxwell),Titan V (Volta) and Titan XP (Pascal). The relevanthardware specifications can be found in Table 2. Boththe V100 GPU, and Tesla P4 GPU are aimed at scien-tific applications, the P4 GPU also offers improved energyefficiency for it’s generation. The Jetson Nano is a low-powered all-in-one solution for autonomous systems. TheTitan V and Titan XP are consumer grade GPUs.GPUs have two different frequency settings: a base anda boost core clock frequency. If not stated otherwise, wehave used the boost core clock frequencies. This is becausethe GPU’s default behaviour is to perform calculations atthe boost core clock frequency. This is indeed what isobserved when the GPU is set to default mode and we runour cuFFT code. When reporting energy efficiency, we useboth frequencies as there is a non-linear dependency of thepower consumption of a GPU on the core clock frequency.We have measured the complex-to-complex (C2C) one-dimensional transform for three different floating-pointprecisions; double (FP64), float (FP32) and half (FP16).The Tesla P4, Titan XP and Jetson Nano GPUs have lim-ited support for the double precision format. Furthermore,the Tesla P4 and the Titan XP do not support the half(FP16) floating-point precision. In addition, when usinghalf precision (FP16), the cuFFT library supports onlypower-of-two FFT lengths.We have investigated various FFT lengths but focusedon lengths that are powers-of-two because FFT algo-rithms are not only best suited to processing such lengths,but also offer superior execution time performance withpowers-of-two lengths. When calculating non-power-of-two FFT lengths it is often faster [35] to pad the datawhich needs to be Fourier transformed to the nearesthigher power-of-two FFT length and then Fourier trans-form.First, we present execution times for processing a fixedamount of data t fix which offers an insight into the level ofoptimization provided by the cuFFT library. The mem-ory requirements to store the data needed for the Fouriertransform grows linearly with the FFT length N . Sincethe cuFFT library is limited by the device memory band-width, the execution time consists of the time required totransfer the data to computing cores and to store the re-sult back to the device memory t i , and the time requiredfor any additional overhead accesses to the device mem-ory t o . If the performance limiting factor is different tothe device memory bandwidth, we are unable to makesuch a distinction in this work. In an ideal case wherewe would have a large enough cache, the execution timeof the Fourier transform would be equal to the time t i .However, because the cache size is limited, the time t o will be non-zero and directly indicate the efficiency of theimplementation. By fixing the amount of memory beingprocessed, the time t i will be a constant and any increasein the execution time of the Fourier transform will be dueto time t o .If we fix the amount of data that is processed then the number of FFTs performed N FFT depends on the FFTlength as given by (6). The execution time of a singleFFT within a batch is given as t t = t fix /N FFT . The exe-cution time t fix for processing a fixed amount of data forvarious FFT lengths is shown in Fig. 4 for FP32 and inFig. 5 for FP16 and FP64 precision. The execution timefor the Jetson Nano is for 1 / t fix is t fix = 4ˆ t fix . This is due tothe low amount of available memory on the card.The execution time t fix increases in proportion to thelength of the Fourier transform. However, we see regionsof the same execution time with sudden increases afterspecific FFT lengths. These abrupt changes represent atransition from one optimized GPU kernel to another asis shown by the NVIDIA profiler. We must take thesechanges into account in our analysis since these GPU ker-nels might behave differently. When the execution time t fix does not increase for a given range of problem sizes (forexample from FFT length N = 32 to N = 8192) it meansthat the higher number of floating-point operations whichcome with a larger problem size utilizes GPU resourcesother than the device memory bandwidth. Given that theTitan XP, Tesla P4 and Jetson GPUs do not fully sup-port all tested floating-point precisions the execution timeof Fourier transformations on these GPUs exhibit differentbehaviours. Radix n=2Radix n>2Bluestein A v e r a g e ti m e t fi x [ m s ] FFT length [samples]Tesla P4Titan XPTitan V Tesla V100Jetson Nano ( t^ fi x )410
32 256 2k 16k 128k 1Mcomplex-to-complex, FP32
Figure 4: The execution time t fix (for FP32) required toprocess a fixed amount of data for different FFT lengths.The discontinuities in the execution time indicate a changeof optimised GPU kernel that is used to calculate the FFT.Results for the Jetson Nano are for one quarter of thememory size.In this work, results are presented per FFT batch , whichis the number of FFT’s of a given length which fit intothe fixed amount of memory that we have chosen to workwith. However, most of our results, such as energy effi-ciency, are independent of the number of FFTs calculatedprovided that the GPU is fully utilised. The executiontime for different core clock frequencies is denoted by t f .The execution time with boost frequency is denoted as t d and is taken as the execution time for the default settings.Furthermore, we have focused our discussion on the V100GPU as it is the most current (and widely used) scien-tific GPU and the Jetson Nano as it represents NVIDIA’slow power edge computing solution. We point out any de-7able 2: GPU card specifications. The shared memory bandwidth is calculated as BW(bytes / s) =(bank bandwidth (bytes)) × (clock frequency (Hz)) × (32 banks) × ( Titan XP Tesla P4 Titan V Tesla V100 Jetson Nano
CUDA Cores 3840 2560 5120 5120 128SMs 30 20 80 80 2Base/Boost Core Clock 1405/1480 MHz 810/1063 MHz 1220/1455 MHz 1200/1455 MHz 921 MHzMemory Clock 5005 MHz 3003 MHz 850 MHz 877 MHz 1600 MHzDv. m. bandwidth 547 GB/s 192 GB/s 652 GB/s 900 GB/s 25.6 GB/sMemory modules GDDR5 GDDR5 HBM2 HBM2 LPDDR4Shared m. bandwidth 5395 GB/s 2657 GB/s 14550 GB/s 14550 GB/s 230 GB/sMemory size 12 GB 8 GB 12 GB 16 GB 4 GBTDP 250 W 75 W 250 W 300 W 5/10 WCUDA version 10.0.130 10.0.130 10.0.130 10.0.130 JetPack 4.2 SDK
Radix n=2Radix n>2Bluestein A v e r a g e ti m e t fi x [ m s ] FFT length [samples]410
32 256 2k 16k 128k 1Mcomplex-to-complex, FP64
Titan VTesla P4Titan XP Tesla V100Jetson Nano ( t^ fi x )
32 256 2k 16k 128k 1Mcomplex-to-complex, FP16
Figure 5: The execution time t fix (for FP16 and FP64) re-quired to process a fixed amount of data for different FFTlengths. The discontinuities in execution time indicate achange of optimised GPU kernel that is used to calculatethe FFT. Results for the Jetson Nano are for one quarterof the memory size.viations from these behaviours in the other tested GPUswhen they occur. First, we present the behaviour of the execution time withchanging core clock frequency. This is shown as a ratio ofexecution time t f over default execution time t d in Fig. 6,which shows all tested configurations for FP32 precision.There are three distinct behaviours, the execution timeis:a) decreasing at first;b) slightly increasing;c) increasing notably with each frequency decrease.In the case of the V100 GPU, the first two behavioursa) and b) are in the majority. For a few specific FFTlengths (notably for N = 8192) we have observed be-haviour c). We have observed this behaviour through-out multiple measurements and always for the same FFT complex-to-complex, FP32Jetson Nanomean opt. core clock freq.110 E x ec u ti on ti m e r a ti o GPU core clock frequency [MHz]complex-to-complex, FP32Tesla V100110 200400600800100012001400 E x ec u ti on ti m e r a ti o Figure 6: Ratio of the execution time t f over the defaultexecution time t d measured for the V100 GPU and theJetson Nano. Every investigated FFT length is shownand represented by a single line.lengths. Other tested GPUs behaved similarly to the V100GPU.The Jetson Nano exhibits a different behaviour, wheremost of the configurations belong to case c) with notablepeaks which are present for Bluestein FFT lengths.The energy consumed per FFT batch calculated byequation (3) with fixed length N = 16384 for differentGPUs is shown in Fig. 7. For the measurement, we haveused a batch of 16384 FFTs (in the case of FP32 this rep-resents 2 GB of input data) in order to fully saturate theGPU. Notably, the energy per FFT batch on the Titan VGPU does not change above 1335 MHz. This is becausethe card does not run at the user selected frequency butis capped by the driver to 1335 MHz.As the core clock frequency decreases the power con-sumption of 0the GPU changes non-linearly. This is shownin Fig. 8 for the V100 GPU and Jetson Nano.The frequency at which the energy per FFT batchreaches a minimum was selected as the optimal frequency .The optimal frequency is different for each tested FFTlength for a given GPU and precision. The optimal fre-quency expressed as a percentage of the default core clockfrequency for all precisions is shown in Fig. 9.8 n e r gy p e r FF T b a t c h [ J ] GPU core clock frequency [MHz]complex-to-complex, FP32, FFT length=16384Titan XPTesla P4 Titan VTesla V100 Jetson Nano123456 20040060080010001200140016001800
Figure 7: The energy consumed per FFT batch changeswith core clock frequency. The minimum, emphasized bya black star for each tested GPU, represents the most effi-cient configuration and the value of the optimal frequency. complex-to-complex, FP32Jetson Nanomean opt. core clock freq.0.51.01.52.02.53.03.54.04.5 A v e r a g e po w e r c on s u m p ti on [ w ] GPU core clock frequency [MHz] complex-to-complex, FP32Tesla V10050100150200240 200400600800100012001400 A v e r a g e po w e r c on s u m p ti on [ w ] Figure 8: Averaged power consumption as a function ofcore clock frequency for all tested FFT lengths. The Jet-son Nano is shown independently as its behaviour is differ-ent from the rest of the tested GPUs which are representedby the V100 GPU. P e r ce n t a g e o f boo s t G P U c o r e c l o c k fr e qu e n c y [ % ] Titan VTesla V100
Jetson NanoTesla P4Titan XP
32 256 2k 16k 128k 1MFP16
Figure 9: Value of the optimal frequency expressed asa percentage of the boost clock frequency. The value ofthe optimal frequency is consistent through different pre-cisions with the exception of the Tesla P4 GPU.
To acquire the following results we have selected the op-timal frequency for each FFT length and measured theconsumed power to calculate the energy efficiency usingequation (4). The energy efficiency expressed as the num-ber of GFLOPS/W is shown in Fig. 10. complex-to-complex, FP32FFT length [samples] E n e r gy e ffi c i e n c y [ G F L O PS / W ] Jetson nanoTesla V100Tesla P4Titan XPTitan V0510152025 32 256 2k 16k 128k 1M Radix n=2Radix n>2Bluestein32 256 2k 16k 128k 1M complex-to-complex, FP64FFT length [samples] E n e r gy e ffi c i e n c y [ G F L O PS / W ] Jetson nanoTesla V100Tesla P4Titan XPTitan V02468101214 32 256 2k 16k 128k 1M Radix n=2Radix n>2Bluestein32 256 2k 16k 128k 1M E n e r gy e ffi c i e n c y [ G F L O PS / W ] FFT length [samples] Jetson nanoTesla V100Titan V01020304050 32 256 2k 16k 128k 1Mcomplex-to-complex, FP16
Figure 10: Floating-point operations per second per Watt(GFLOPS/W) for optimal frequency. The coloured regionshows the improvement from the default frequency.The change in the execution time for the optimal fre-quency with respect to the default execution time as apercentage is shown in Fig. 11. The change in GFLOPS isshown in Fig. 12. The peaks visible in Fig. 11 correspondto FFT lengths which displayed case c) type behaviour ofthe execution time (Fig. 6).The increase in the energy efficiency (7) with respectto the boost core clock frequency is shown for differentprecisions in Fig. 13 and with respect to the base coreclock frequency in Fig. 14.We see that the optimal frequency of different FFTlengths as shown in Fig. 9 is roughly the same for a givenGPU and precision across all tested FFT lengths. Further-9 i ff e r e n ce o f e x ec u ti on ti m e [ % ] FFT length [samples]
Tesla P4Titan XP Titan VTesla V100 -4051015 32 256 2k 16k 128k 1M
Radix n=2Radix n>2Bluestein
Jetson Nano D i ff e r e n ce o f e x ec u ti on ti m e [ % ] FFT length [samples]
Tesla P4Titan V Tesla V100 -50102030 32 256 2k 16k 128k 1M
Radix n=2Radix n>2Bluestein
Jetson NanoTitan XP D i ff e r e n ce o f e x ec u ti on ti m e [ % ] FFT length [samples]
Titan VTesla V100 -50102030 32 256 2k 16k 128k 1M
Jetson Nano
Figure 11: Increase in the execution time for optimal fre-quencies as a percentage of the default execution time t d . complex-to-complex, FP32FFT length [samples] C o m pu t a ti on a l p e rf o r m a n ce [ G F L O PS ] Jetson NanoTesla V100Tesla P4 Titan XPTitan V0110100100010000 32 256 2k 16k 128k 1M Radix n=2Radix n>2Bluestein32 256 2k 16k 128k 1M complex-to-complex, FP64FFT length [samples] C o m pu t a ti on a l p e rf o r m a n ce [ G F L O PS ] Jetson NanoTesla V100Tesla P4 Titan XPTitan V01101001000 32 256 2k 16k 128k 1M Radix n=2Radix n>2Bluestein32 256 2k 16k 128k 1M C o m pu t a ti on a l p e rf o r m a n ce [ G F L O PS ] FFT length [samples] Jetson NanoTesla V100Titan V101001000 32 256 2k 16k 128k 1Mcomplex-to-complex, FP16
Figure 12: Floating-point operations per second(GFLOPS) for optimal frequencies. The colored regionshows the change from the default frequency.10 n c r ea s e i n e n e r gy e ffi c i e n c y Figure 13: The increase in the energy efficiency for opti-mal core clock frequencies with respect to the boost coreclock frequency for all tested FFT lengths. The twopeaks observed in the Jetson Nano data are due to theuse of the Bluestein algorithm. I n c r ea s e i n e n e r gy e ffi c i e n c y Figure 14: The increase in the energy efficiency for theoptimal core clock frequencies with respect to the basecore clock frequency for all tested FFT lengths. TheJetson Nano is not included since there is no base coreclock frequency.more, the optimal frequency is roughly the same acrossall numerical precisions for a given GPU with the excep-tion of Tesla P4 GPU. Based on this we have calculated a mean optimal frequency for a given GPU and precision byaveraging optimal frequencies which achieves a similar in-creases in energy efficiency for all measured FFT lengths.The increase in energy efficiency using the mean optimalfrequency is shown in Fig. 15 for the boost frequency andin Fig. 16 for the base frequency. The values of meanoptimal frequencies are listed in Table 3.When considering existing pipelines, it is also interest-ing to study the relationship between the increase in en-ergy efficiency and the increase in the execution time. Thisrelationship indicates the cost (in units of execution time)of any increase in energy efficiency. This is shown for theV100 GPU in Fig. 17 and for the Jetson Nano in Fig. 18. Table 3: Mean optimal core clock frequencies.
Card name FP32 [MHz] FP64 [MHz] FP16 [MHz]Tesla V100 945 945 937Tesla P4 746 1126 NATitan V 952 967 1042Titan XP 1151 1215 NAJetson Nano 460.8 460.8 460.8 I n c r ea s e i n e n e r gy e ffi c i e n c y Figure 15: The increase in the energy efficiency for themean optimal frequency with respect to the boost coreclock frequency for all tested FFT lengths.The twopeaks observed in the Jetson Nano data are due to theuse of the Bluestein algorithm. I n c r ea s e i n e n e r gy e ffi c i e n c y Figure 16: The increase in the energy efficiency for themean optimal frequency with respect to the base coreclock frequency for all tested FFT lengths. The Jet-son Nano is not included since there is no base core clockfrequency.
To demonstrate the applicability of the mean optimal fre-quency in existing pipelines we have employed part of thedata processing pipeline used for the detection of pulsarsin time-domain radio astronomy data. The pipeline usesseveral computational steps: FFT, power spectrum calcu-lation; mean and standard deviation calculation; and theharmonic sum. The harmonic sum adds the value of higherharmonics of the pulsar in the power spectrum to the pul- Source code for used pipeline is on GitHub https://github.com/KAdamek/cuFFT_energy_efficiency_example n c r ea s e i n e n e r g e y e ffi c i e n c y [ % ] FFT length
32 64 128 256 512 1024 2048 4096 8192 16k 32k 64k 128k 256k 512k 1 M M I n c r ea s e i n e x ec u ti on ti m e [ % ] Figure 17: Trade-off between an increase in energy effi-ciency in percent (represented by a number in each cell)and an increase in execution time (represented by a color)for the V100 GPU. I n c r ea s e i n e n e r g e y e ffi c i e n c y [ % ] FFT length
32 64 128 256 512 1024 2048 4096 8192 16k 32k 64k 128k 256k 512k 1 M M I n c r ea s e i n e x ec u ti on ti m e [ % ] Figure 18: Trade-off between an increase in energy effi-ciency in percent (represented by a number in each cell)and an increase in execution time (represented by a color)for the Jetson Nano.sar’s expected fundamental frequency thus increasing thesignal-to-noise ratio of the pulsar in the power spectrum.The harmonic sum can add up to 32 higher order har-monics which decreases the FFT execution footprint inthe pipeline’s total execution time.To change the frequency during the pipeline executionwe have used the NVIDIA Management Library (NVML)[28]. This approach, however, has limitations becausethe library is fully supported only on scientific (Tesla)NVIDIA GPUs. The measured power consumption andthe core clock frequency for the V100 GPU are shown inFig. 19 and the increase in energy efficiency for differentconfigurations of the pipeline is listed in Table 4.The usage of the NVML library is simple. Before theGPU kernel execution the core clock frequency is (for agiven GPU) set using nvmlDeviceSetGpuLockedClocks providing maximum and minimum core clock fre-quency. When the calculation is finished the GPUcore clock frequency is returned to default by calling nvmlDeviceResetGpuLockedClocks .The FFT length used for the computation was N =5 · which was not used in our measurements or in ourcalculation of the mean optimal frequency. For profiling, we have used the NVIDIA visual profiler(NVVP). Based on the different behavior of the executiontime t fix shown in Fig. 4 we have selected three represen-tative power-of-two FFT lengths ( N = 8192, N = 16k, Table 4: Increase in energy efficiency for different config-urations of our toy data processing pipeline.num. har-monicsummed cuFFT % oftotal exec.time Increase in En-ergy efficiency2 60.85 1.2914 58.56 1.2908 55.92 1.26716 53.73 1.26032 51.34 1.240 P o w e r c on s u m p ti on [ W ] Tesla V100without NVMLwith NVML 0 50 100 150 200 250 300 C o r e fr e qu e n c y [ M H z ] Sample index 250 500 750 1000 1250 1500 0 250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000
Figure 19: Measured power consumption (top) and coreclock frequency (bottom) for part of a radio astronomydata processing pipeline. N = 2M) which are calculated by different kernels. Theprofiling results for these kernels are shown in Fig. 20. Forour study of compute utilization we have used two indica-tors. The first is the compute utilization as reported bythe NVVP, the second metric is the issue slot utilization,which tells us how many instruction slots are used. Thenext quantity displayed in Fig. 20 is the device memorybandwidth utilization (device MBU). Fig. 20 also showsthe normalized execution time from fastest to slowest toprovide context for the other displayed quantities. The dependency of the execution time on the core clockfrequency is shown in Fig. 6. Fig. 6 displays the three pre-viously discussed behaviours a), b) and c). However, theJetson Nano only exhibits the third type of behaviour c).All other GPU’s, represented by the V100 GPU, exhibit acomposition of all three behaviours with cases a) and b)being dominant.The behaviour in case b), might be due to reduced cachecontention which slightly increases the hit rate of the uni-fied cache as shown by the the NVVP. However, it mightalso be a systematic error caused by measurement usingthe NVIDIA driver, which is based on the GPU core clockfrequency. In this case as well as in case a) the GPU’s com-pute resources are not fully utilized and the computationsare limited by device memory bandwidth.The reason for an increase in the execution time at a12 astSlow N o r m a li z e d exec u ti on ti m e Core clock frequency [MHz]N=8192Normalized execution time Device mem. bandwidth utilizationFloating point operation utilization Issue Slot UtilizationFastSlow 3006009001200 N=2M
Figure 20: Profiling results for the V100 GPU using the NVIDIA visual profiler. Longer FFT lengths use more thanone GPU kernel to calculate the Fourier transform which are numbered.particular critical frequency is due to the saturation of thenumber of issued instructions (see Fig. 20). This leads to areduction in memory requests to the device memory which,in turn, leads to poor latency hiding of the device memoryaccesses. Therefore most of the threads are waiting fordata but there are not enough threads with data to utilizethe floating-point operation units. Thus the floating-pointoperation utilization remains mostly unchanged.The sharp increase in the execution time t fix for lowfrequencies, which are present in all cases, are due to thechange of the P-state to a state corresponding to the idlestatus of the GPU with reduced voltage which reduces theavailable GPU resources severely.Lastly, case c) occurs due to the high utilization of oneof the caches. Since the cache bandwidth decreases withthe core clock frequency each decrease in frequency lowersthe bandwidth which is already fully utilised leading to adecrease in performance.The average power consumption shown in Fig. 8, tells uswhy, even with longer execution times, we can improve en-ergy efficiency. The rate of the decrease in power consump-tion is higher than the rate at which the execution timeincreases. This is especially visible around f = 1000 Hzfor the V100 GPU and about f = 450 Hz for the JetsonNano. These frequencies roughly coincide with the meanoptimal frequency for the given GPUs. The energy efficiency is shown in Fig. 10, the change inthe execution time is shown in Fig. 11 and the change inGFLOPS is shown in Fig. 12.In the language of costs, Fig. 11 is equivalent to theincrease in capital costs as an increase in execution timedirectly translates into more hardware needed in order tomeet the constrains of real-time data processing. On theother hand, the increase in energy efficiency (Fig. 10) isrelated to operational costs, where better energy efficiencytranslates into lower operational costs. However we mustbare in mind that operational costs include cooling, fa-cility management, etc. which could be increased by therequirement for more hardware due to longer executiontimes.For FP32 precision we see that the Jetson Nano is moreenergy efficient than the V100 GPU for almost all FFTlengths, especially for the small FFT lengths where it is50% more efficient. When we look at the change in the ex-ecution time we see that the Jetson Nano requires approxi-mately 60% more time to finish compared to the executiontime at the boost core clock frequency. With one extremecase where the execution time is 140% longer. This meanson average 60% more hardware to achieve real-time dataprocessing with the best energy efficiency.This behaviour is not reproduced by the V100 GPUwhere the increase in energy efficiency is not, for the mostpart, at the expense of the execution time. The changein the execution time for the V100 GPU is below 5%.13here are more significant increases in execution time forthe non-power of two FFT lengths which can cause in-creases of up to 20% in execution time. Small changes inthe execution time on the V100 GPU offers a possibilityto improve existing real-time processing pipelines withoutsubstantial change in hardware.We see similar behaviour for the V100 GPU at FP64precision. The slow-down in execution time suffered bythe V100 GPU due to the lower core clock frequenciesis within 5%. The execution time for most of the non-power of two FFT lengths does not increase above 20%.The Tesla P4 GPU, Titan XP GPU and Jetson Nano donot fully support FP64 precision. This manifests in lesssignificant improvements in GFLOPS/W, much higher ex-ecution times and a decrease in GFLOPS. In the case ofthe Jetson Nano we would have to double the number ofcards in order to process data in real-time.At FP16 precision we have only three GPUs which sup-port this precision: V100 GPU, Titan V GPU and Jet-son Nano. Regarding energy efficiency, both the V100GPU and the Jetson Nano are comparable but the V100GPU is the overall more energy efficient GPU. When welook at the change in execution time we see that the V100GPU typically has a 10% increase or less, but at someFFT lengths the increase is as high as 40% (N=64). Thisbehaviour means that we have to be more careful aboutpotential energy savings since at some FFT lengths theincrease in execution time might be too high for real-timedata processing. The change in execution time of the Jet-son Nano is again large and we would need to have almosttwice the number of GPUs to process data in real time atthe best possible energy efficiency.
The increase in the energy efficiency for the optimal fre-quency is shown in Fig. 13 and Fig. 14. The correspond-ing figures for the mean optimal frequency are Fig. 15 andFig. 16. The difference in the increase in energy efficiencyfor the base core clock frequency between the optimal fre-quency and the mean optimal frequency is 5 percentagepoints. That is, an average increase in energy efficiencyfor the optimal frequency which is tuned for each FFTlength is 29% whereas the average increase in energy ef-ficiency for the mean optimal frequency is 24%. For theV100 GPU this holds for all FFT lengths and precisionswith a very limited number of exceptions for FP16 pre-cision. For the boost core clock frequency the loss is 10percentage points. This allows us to use one core clockfrequency and achieve similar energy savings without de-termining the optimal frequency for each FFT length. Asimilar result is observed for the Jetson Nano with theexception of Bluestein FFT lengths which are responsiblefor the peaks in the results.The dependency between the increase in energy effi-ciency and the change in the execution time, shown inFig. 17 for the V100 GPU but more notably in Fig. 18for the Jetson Nano, is non-linear. We see that we canachieve an interesting increase in energy efficiency evenfor increases in execution time which are below 10%. Lastly, our practical test with our example data pro-cessing pipeline shows that we can dynamically changethe core clock frequency in a very precise manner. Ourcode demonstrates how to target only the duration of thecuFFT library call within the pipeline and thus reducepower consumption. This technique can be applied to ex-isting pipelines or more generally any software with mini-mal changes to the codebase. The increase in energy effi-ciency (for the boost core clock frequency) are summarizedin Table 4 corresponds to the expected values based on theFFT execution time footprint within the pipeline. For thefirst configuration with 2 harmonics, the FFT executiontime corresponds to 60% of the total execution time. Theaverage increase in energy efficiency for V100 GPU withboost core clock frequency (based on Fig.15) is about 50%.Considering the FFT execution time footprint we shouldget 30% increase in energy efficiency which is indeed whatwe have measured. This behaviour is consistent with otherconfigurations of the pipeline.
We have measured the power consumption when calcu-lating the Fourier transformation at different numericalprecisions (FP32, FP64, FP16) on NVIDIA GPUs usingthe NVIDIA cuFFT library and quantified the possibleenergy savings when DVFS techniques are used. For eachtested GPU, precision, and a wide range of FFT lengths,we have found the optimal core clock frequency to min-imise power consumption. We have also measured thechange in execution time of the Fourier transform whenDVFS is applied, which is an important consideration forreal-time data processing because this can increase whenthe core clock frequencies of the GPU are modified.We have presented the achieved energy efficiency inGFLOPS/W. Along with this we have presented the in-crease in energy efficiency when using our optimal coreclock frequency compared to the boost and base core clockfrequency for each GPU. We have also presented the in-crease in the execution time of the Fourier transform whenDVFS is applied.The decrease in power consumption and change in theexecution time depends on the GPU used. In the case ofthe V100 GPU, the average increase in energy efficiencyis for FP32, FP64, and FP16 precisions is 60% comparedto the boost core clock frequency. When compared to thebase core clock frequency an average increase in energyefficiency of 30% for FP32 and FP64 precision and 20% forFP16 precision is observed. The increase in the executiontime is below 5% (with few exceptions as outlined). TheJetson Nano offers higher increases in energy efficiency tothat of the V100 GPU. On average 70% for FP32, 55% forFP64 and 70% for FP16 but at the expense of executiontime which increases by more than 60%. For the P4 GPUand the Titan V GPU we have not achieved a significantincrease in energy efficiency.Our results have shown that the Volta architecture issignificantly more energy efficient than the P4 GPU whichrepresents the most energy efficient GPU from the pre-vious Pascal generation. When compared to the Jetson14ano the V100 GPU is less energy efficient at FP32 pre-cision. For short and long FFTs at FP32 precision theJetson Nano is 50% more energy efficient than the V100GPU. For FP16 precision the V100 GPU has similar en-ergy efficiency as the Jetson Nano. The Jetson Nano doesnot fully support double precision thus the V100 GPU issignificantly more energy efficient at this precision.We have shown that values of optimal core clock fre-quencies for all tested FFT lengths for a given GPU andnumerical precision are similar, with few exceptions. Thisallowed us to define a mean optimal core clock frequencyunique to each tested GPU and precision, but is the samefor all FFT lengths. Using the mean optimal core clock fre-quency, we have achieved a similar energy efficiency whencompared to the energy efficiency achieved with the opti-mal core clock frequency for each tested FFT length. Forthe V100 GPU the difference is only 5 percentage points.For the other GPUs the loss is similar.We have also presented the practical implementationof these results in our example data processing pipelinewhich is available as an open source code. We have demon-strated how to change the core clock frequency of theGPU to the mean optimal core clock frequency using theNVIDIA Management Library and demonstrated a de-crease in power consumption which is in agreement withthe results presented in this work.Finally we have highlighted how, from an environmen-tal perspective, increasing the energy efficiency of the FFTalgorithm will be an important consideration for edge com-puting and IoT.
Acknowledgment
This work has received support from STFC Grant(ST/T000570/1). The authors acknowledge thesupport of the OP VVV MEYS funded projectCZ.02.1.01/0.0/0.0/16 019/0000765 ”Research Center forInformatics”. The authors would like to acknowledgethe use of the University of Oxford Advanced ResearchComputing (ARC) facility in carrying out this work( http://dx.doi.org/10.5281/zenodo.22558 ). The au-thors would like to express their gratitude to the ResearchCentre for Theoretical Physics and Astrophysics, Instituteof Physics, Silesian University in Opava for institutionalsupport.
References [1] Power consumption measurement with nvidia-smi,March 2018.[2] Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Koji In-oue, Masato Edahiro, and Martin Peres. Power andperformance characterization and modeling of gpu-accelerated systems. In
Proceedings of the 2014 IEEE28th International Parallel and Distributed Process-ing Symposium , IPDPS ’14, page 113–122, USA,2014. IEEE Computer Society. [3] Karel Ad´amek and Wesley Armour. A GPU Imple-mentation of the Harmonic Sum Algorithm. In Pe-ter J. Teuben, Marc W. Pound, Brian A. Thomas,and Elizabeth M. Warner, editors,
Astronomical DataAnalysis Software and Systems XXVII , volume 523of
Astronomical Society of the Pacific Conference Se-ries , page 489, October 2019.[4] Karel Ad´amek, Sofia Dimoudi, Mike Giles, and Wes-ley Armour. Improved Acceleration of the GPUFourier Domain Acceleration Search Algorithm. InPascal Ballester, Jorge Ibsen, Mauricio Solar, andKeith Shortridge, editors,
Astronomical Data Anal-ysis Software and Systems XXVII , volume 522 of
As-tronomical Society of the Pacific Conference Series ,page 477, April 2020.[5] Yehia Arafa, Ammar ElWazir, Abdelrahman ElKa-nishy, Youssef Aly, Ayatelrahman Elsayed, Abdel-Hameed Badawy, Gopinath Chennupati, StephanEidenbenz, and Nandakishore Santhi. Verifiedinstruction-level energy consumption measurementfor nvidia gpus, 2020.[6] L. Bluestein. A linear filtering approach to the com-putation of discrete fourier transform.
IEEE Transac-tions on Audio and Electroacoustics , 18(4):451–455,1970.[7] Robert A. Bridges, Neena Imam, and Tiffany M.Mintz. Understanding gpu power: A survey of profil-ing, modeling, and simulation methods.
ACM Com-put. Surv. , 49(3), September 2016.[8] E. O. Brigham.
The fast Fourier transform and itsapplications . Prentice Hall, Englewood Cliffs, NewYork, signal processing series edition, 1988.[9] Martin Burtscher, Ivan Zecena, and Ziliang Zong.Measuring gpu power with the k20 built-in sensor.In
Proceedings of Workshop on General Purpose Pro-cessing Using GPUs , GPGPU-7, pages 28––36, NewYork, NY, USA, 2014. Association for ComputingMachinery.[10] Vincent Chau, Xiaowen Chu, Hai Liu, and Yiu-WingLeung. Energy efficient job scheduling with dvfs forcpu-gpu heterogeneous systems. In
Proceedings ofthe Eighth International Conference on Future En-ergy Systems , e-Energy ’17, pages 1––11, New York,NY, USA, 2017. Association for Computing Machin-ery.[11] Tim Cornwell and Ben Humphreys. Ska exascale soft-ware challenges, 2010.[12] Sofia Dimoudi, Karel Ad´amek, Prabu Thiagaraj,Scott M. Ransom, Aris Karastergiou, and Wesley Ar-mour. A GPU Implementation of the CorrelationTechnique for Real-time Fourier Domain Pulsar Ac-celeration Searches.
ApJS , 239(2):28, December 2018.1513] Muhammad Fahad, Arsalan Shahid, Ravi ReddyManumachu, and Alexey Lastovetsky. A compara-tive study of methods for measurement of energy ofcomputing.
Energies , 12(11), 2019.[14] J. S. Farnes, B. Mort, F. Dulwich, K. Ad´amek,A. Brown, J. Novotn´y, S. Salvini, and W. Armour.Building the world’s largest radio telescope: Thesquare kilometre array science data processor. In , pages 366–367, 2018.[15] R. Ge, R. Vogt, J. Majumder, A. Alam, M. Burtscher,and Z. Zong. Effects of dynamic voltage and fre-quency scaling on a k20 gpu. In , pages 826–833, 2013.[16] Jo˜ao Guerreiro, Aleksandar Ilic, Nuno Roma, and Pe-dro Tom´as. Dvfs-aware application classification toimprove gpgpus energy efficiency.
Parallel Comput-ing , 83:93 – 117, 2019.[17] John W. Tukey James W. Cooley. An algorithmfor the machine calculation of complex fourier series.
Mathematics of Computation , 19(90):297–301, 1965.[18] Yang Jiao, Heshan Lin, Pavan Balaji, and Wu-chunFeng. Power and performance characterization ofcomputational kernels on the gpu. In
Proceedingsof the 2010 IEEE/ACM Int’l Conference on GreenComputing and Communications & Int’l Conferenceon Cyber, Physical and Social Computing , pages 221–228. IEEE Computer Society, 2010.[19] J. L. Jodra, I. Gurrutxaga, and J. Muguerza. A studyof memory consumption and execution performanceof the cufft library. In , pages 323–327, 2015.[20] R. Jongerius, S. Wijnholds, R. Nijboer, and H. Cor-poraal. An end-to-end computing model for thesquare kilometre array.
Computer , 47(9):48–54, 2014.[21] J. Lee, V. Sathisha, M. Schulte, K. Compton,and N. S. Kim. Improving throughput of power-constrained gpus using dynamic voltage/frequencyand core scaling. In ,pages 111–120, 2011.[22] L. Levin, W. Armour, C. Baffa, E. Barr, S. Cooper,R. Eatough, A. Ensor, E. Giani, A. Karastergiou,R. Karuppusamy, and et al. Pulsar searches with theska.
Proceedings of the International AstronomicalUnion , 13(S337):171–174, 2017.[23] Dumitrel Loghin and Yong Meng Teo. The energy ef-ficiency of modern multicore systems. In
Proceedingsof the 47th International Conference on Parallel Pro-cessing Companion , ICPP ’18, pages 1–10, New York,NY, USA, 2018. Association for Computing Machin-ery. [24] Xinxin Mei, Qiang Wang, and Xiaowen Chu. A sur-vey and measurement study of gpu dvfs on energyconservation.
Digital Communications and Networks ,3(2):89–100, 2017.[25] E. Meneses, O. Sarood, and L. V. Kal´e. Assessing en-ergy efficiency of fault tolerance protocols for hpc sys-tems. In , pages 35–42, 2012.[26] Sparsh Mittal and Jeffrey S. Vetter. A survey of meth-ods for analyzing and improving gpu energy efficiency.
ACM Comput. Surv. , 47(2), August 2014.[27] NVIDIA. Cufft library user’s guide, 2020 v11.0.3.[28] NVIDIA. Nvidia management library, 2020 vR450.[29] A. R. Offringa et al. WSCLEAN: an implementationof a fast, generic wide-field imager for radio astron-omy.
MNRAS , 444(1):606–619, October 2014.[30] D. C. Price, M. A. Clark, B. R. Barsdell, R. Babich,and L. J. Greenhill. Optimizing performance-per-watt on GPUs in high performance comput-ing.
Computer Science - Research and Development ,31(4):185–193, November 2016.[31] David Reinsel, John Gantz, and John Rydning. Thedigitization of the world from edge to core, 2018.[32] J. W. Romein. A comparison of accelerator archi-tectures for radio-astronomical signal-processing al-gorithms. In , pages 484–489, 2016.[33] A. Sethia and S. Mahlke. Equalizer: Dynamic tuningof gpu resources for efficient execution. In , pages 647–658, 2014.[34] Peter Steinbach and Matthias Werner. gearshifft -The FFT Benchmark Suite for Heterogeneous Plat-forms. arXiv e-prints , page arXiv:1702.00629, Febru-ary 2017.[35] David Stˇrel´ak and Jiˇr´ı Filipoviˇc. Performance anal-ysis and autotuning setup of the cufft library. In
Proceedings of the 2nd Workshop on AutotuniNg andADaptivity AppRoaches for Energy Efficient HPCSystems , ANDARE ’18, pages 1–6, New York, NY,USA, 2018. Association for Computing Machinery.[36] Zhenheng Tang, Yuxin Wang, Qiang Wang, and Xi-aowen Chu. The impact of gpu dvfs on the energy andperformance of deep learning: An empirical study. In
Proceedings of the Tenth ACM International Confer-ence on Future Energy Systems , e-Energy ’19, pages315—-325, New York, NY, USA, 2019. Associationfor Computing Machinery.[37] Anne E. Trefethen and Jeyarajan Thiyagalingam.Energy-aware software: Challenges, opportunitiesand strategies.
Journal of Computational Science ,16(6):444 – 449, 2013. Scalable Algorithms for Large-Scale Systems Workshop (ScalA2011), Supercomput-ing 2011.[38] Sebastiaan van der Tol, Bram Veenboer, andAndr´e R. Offringa. Image Domain Gridding: a fastmethod for convolutional resampling of visibilities.
Astronomy & Astrophysics , 616:A27, August 2018.[39] B. Veenboer, M. Petschow, and J. W. Romein. Image-domain gridding on graphics processors. In , pages 545–554, 2017.[40] Bram Veenboer and John W. Romein. Radio-astronomical imaging: Fpgas vs gpus. In RaminYahyapour, editor,
Euro-Par 2019: Parallel Process-ing , pages 509–521, Cham, 2019. Springer Interna-tional Publishing.[41] Qiang Wang, Chengjian Liu, and Xiaowen Chu.Gpgpu performance estimation for frequency scalingusing cross-benchmarking. In