Towards Green Computing: A Survey of Performance and Energy Efficiency of Different Platforms using OpenCL
TTowards Green Computing: A Survey ofPerformance and Energy Efficiency of DifferentPlatforms using OpenCL
Philip Heinisch , Katharina Ostaszewski , and Hendrik Ranocha Institut für angewandte numerische Wissenschaft e.V. (IANW), Braunschweig, Germany Institut für Geophysik und extraterrestrische Physik (IGeP), Technische Universität Braunschweig, Germany King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
March 8, 2020
When considering different hardware platforms, not just the time-to-solution can beof importance but also the energy necessary to reach it. This is not only the case withbattery powered and mobile devices but also with high-performance parallel clustersystems due to financial and practical limits on power consumption and cooling. Recentdevelopments in hard- and software have given programmers the ability to run thesame code on a range of different devices giving rise to the concept of heterogeneouscomputing. Many of these devices are optimized for certain types of applications.To showcase the differences and give a basic outlook on the applicability of differentarchitectures for specific problems, the cross platform OpenCL framework was used tocompare both time- and energy-to-solution. A large set of devices ranging from ARMprocessors to server CPUs and consumer and enterprise level GPUs has been used withdifferent benchmarking testcases taken from applied research applications. While theresults show the overall advantages of GPUs in terms of both runtime and energyefficiency compared to CPUs, ARM devices show potential for certain applications inmassively parallel systems. This study also highlights how OpenCL enables the use ofthe same codebase on many different systems and hardware platforms without specificcode adaptations.
The maximum useful clock frequency of modern processors has become increasingly limited bysignal propagation delays and power dissipation, the so-called power wall. To support the growingneed for computational power, multi-processor architectures have become the new de-facto stan-dard in consumer, industrial and scientific applications. Together with the mainstream adoptionof multi-core CPUs, initially task-specific Graphics Processing Units (GPUs) developed into highlyparallel General-Purpose computing on Graphics Processing Units. Simultaneously, ARM-basedRISC processors transformed from application specific embedded processors to low-cost alterna-tives to the CISC-based x86 processors, driven by the fast increase in computational requirementsfor mobile and embedded devices. In recent years, accelerators using Field Programmable GateArrays (FPGAs) or Application-Specific Integrated Circuits (ASICs) have also become available,e.g. to solve specific proof-of-work based system as used in many cryptocurrencies.1 a r X i v : . [ c s . PF ] M a r n the past, most applications were developed for a specific CPU architecture with little need forportability. While the x86 architecture was the only option for computationally intensive problems,these processors were unsuitable for mobile applications, due to the high power consumption andneed for external support circuitry. Low-power ARM-based processors could easily be integratedinto battery powered mobile devices but lacked the computational resources (i.e. dedicated floatingpoint units) and support for large amounts of RAM or high-speed data buses needed for complexdata processing or scientific computations. The use of FPGAs, ASICs and GPUs was limited bythe high cost of soft- and hardware development.This has changed drastically in recent years, with increasingly power efficient x86 hardware,powerful multi-core ARM-based System-on-a-Chip (SoC) designs and the widespread adoption oflow-cost high-powered GPUs both as external hardware and integrated with ARM and x86 CPUs.Even FPGAs became available as accelerator cards or integrated within CPUs (e.g. Intel Xeon Gold6138P or Xilinx Zynq). Combined with the increasing support for portability between differentarchitectures by software development kits, libraries, and compilers this lead to the concept ofheterogeneous computing. Frameworks like Khronos’ Open Computing Language (OpenCL),Microsoft’s C++ AMP or higher level programming models like Intel’s oneAPI or SYCL as a higher-level programming model for OpenCL provide the necessary software portability to target theseheterogeneous platforms with a single code-base, but still with limitations regarding performanceportability [1, 19]. In many studies, the question of performance portability is primarily discussedin relation to runtime, but the problem also arises in regard to power consumption. Limitingfactors for portability for example with OpenCL are plentiful and can range from workload sharingbetween compute units or cores to the optimization and usage of specific hardware accelerationfeatures. Especially for embedded or mobile devices, efficient usage of specific hardware featurescan have a large impact on the required power, while having comparatively little impact on overallruntime.With all these hardware options readily available, the question arises which combination ofdevices is best suited to solve a given set of problems. The answer depends not only on the time-to-solution but also on the energy-to-solution and the price of the hardware. Especially the energynecessary to reach a solution becomes increasingly important as battery powered systems haveto handle computational intensive tasks like image processing for autonomous electric vehiclesor machine learning. Even for large-scale data centers or high performance computing (HPC)facilities the energy cost over the lifetime of the systems can be higher than the initial cost ofacquisition. This has lead to programs like the Green500 list [22] to rank supercomputers basedon energy efficiency. In this context the term performance per watt is widely used. This metricis of course linked to accompanying benchmark workloads. To provide some comparability inmany cases the linear algebra LINPACK suite is used [8], which primarily relies on standardmicrobenchmarks. This study aims at using complete self contained real world examples fromcomputational physics and mathematics as well as signal and image processing to show theportability of a given implementation without specific optimizations and the differences in time-and energy-to-solution for different platforms. Due to its versatility and support by many differentmanufactures, OpenCL was chosen to execute the set of testcases. Hence, this study can not onlyhelp answer the question of which kind of hardware is best suited for a given problem, but it alsoinvestigates the capabilities of OpenCL on a variety of hardware platforms.As the results will show, different devices are optimized for certain workloads or are even missingcertain hardware features like support for double or half precision floating point arithmetic.Especially with more task specific devices, calculating the performance per watt based on a specificbenchmark testcase (like LINPACK) is not representative for a different set of tasks and no generalperformance per watt exists. While such a measure can be used for more abstract comparisonsor theoretical benchmarks, it is of less use when designing real world systems. Hence, as part ofthis work several different testcases were used and the results will be called energy-to-solution inthe following, to clarify that these values are always relative to a specific problem set and cannot2ecessarily be generalized.Previous studies have instigated the advantages of different architectures related to time-to-solution mostly in the context of high-performance or scientific computing linked to very specificapplications like Monte Carlo methods [24], solvers for systems of linear equations [5], aerodynamicNavier–Stokes solvers [19], or mesh interpolation [4]. Research on the trade-offs between time-to-solution and energy-to-solution has mainly focused on the comparison between ARM-basedprocessors and x86 CPUs in the context of computing clusters [3, 9, 18] or was focused specifically onGPUs [12]. To measure the power requirements, the integrated power profiling features availablein modern CPUs and GPUs were used. Knowledge of the realtime power consumption of computedevices is not only important to the user to optimize code efficiency but also an integral part ofautomatic power state and dynamic clock frequency management. Hence, manufacturers havestarted to implement both software based heuristics and hardware based current and voltagemonitoring to provide power estimates for different components with very little overhead. Thesevendor specific profiling solutions have a relatively high temporal resolutions, typically withmaximum sampling rates in the range of 100 Hz and can be accessed by the user through vendorsspecific APIs.All test-cases were implemented using OpenCL, as it is not only promising for scientific HPC [17],but it also allows to target heterogeneous systems even for mobile applications [23]. OpenCL alsohas the major advantage of transparently abstracting the low-level parallelization from the user.Targeting both CPUs and GPUs directly with C/C++ code on different operating systems wouldotherwise require different implementations, as only OpenCL allows targeting heterogeneoussystems running one of the major operating systems. Additionally this opens up the possibilityto investigate the performance portability of typical algorithms, which becomes an increasinglyimportant question with the adoption of GPUs [12] and ARM-based systems both in scientific HPCand in consumer applications. Four different test cases were chosen and implemented in OpenCL to test both the time-to-solutionand energy-to-solution on different platforms. The open source cross platform tool ToolkitICL[11] was used as a framework on the host to execute and profile the different kernels. This toolwas specifically designed to execute generic OpenCL kernels on HPC systems both for productionuse and for profiling and benchmarking. The data input and output, kernel source, and OpenCLsettings are completely handled by HDF5 files. This makes it possible to encapsulate the individ-ual test cases into separate HDF5 files and use the same host application for all runs to ensurecomparability.ToolkitICL also includes support for power and temperature profiling for AMD, Intel, and Nvidiahardware. While the sampling rates (typically around 10 Hz) achieved by these profiling solutionsdo not allow for instruction level power profiling (see e.g. [16]), they are sufficient to estimate thetotal energy necessary to execute a larger set of instructions. To gauge the accuracy of these built-in power profiling tools, a true RMS digital multimeter was used to log the current consumptionduring execution on different platforms to provide a measurement based reference. The conversionefficiency of the power supply (typically between 80% and 90%) was also considered and themeasurements corrected based on the specific device to represent the actual power consumption.All test runs were done with and without power profiling enabled, to gauge the overhead associatedwith software based profiling. To determine the energy-to-solution, the time series of powermeasurements (either from profiling API or current measurement) was numerically integrated overthe OpenCL kernel execution time. To prevent bias caused for example by storage or networkinghardware, the baseline idle power consumption was determined and subtracted from the powermeasured during execution. Each run was repeated at least three times and only the averagedresults were used for the final analysis. 3ue to the large differences in architecture between hardware platforms, selecting generic testcases without intrinsically favoring specific setups is a challenge. Especially while testing highlyparallel applications on multi-node cluster setups, like the one used by [9], performance can bedominated by the efficiency of interconnect networks and memory architecture. The results aretherefore only meaningful for a specific system and care must be taken when these results aregeneralized. Hence for this study benchmarking was only performed on single nodes to eliminatethe influence of network architecture and additional overhead associated with distributing theworkload across nodes. To avoid the caveats of classical microbenchmarking while still being ableto use less powerful ARM systems incapable of running for example large numerical simulations,a compromise between reducing the complexity of the program (i.e. multi-node support or filein- and output) and reducing the size of the problem itself (i.e. number of timesteps or datapoints) was chosen. OpenCL was selected to realize the test cases not only because it can targetall intended platforms using the same source code but also because it has execution time profilingfeatures already build in. This is especially important when comparing accelerators (i.e. GPUs)with dedicated memory against CPUs, as the additional memory transfer overhead has to be takeninto account. To simulate extensive real world workloads, the individual test cases were executedback-to-back several thousand times using the same input data, to yield an average time- andenergy-to-solution.OpenCL uses vendor specific platform driver implementations to compile and run the OpenCLcode on the specific devices. Especially for GPUs, the OpenCL implementation is provideddirectly by the manufacturer as part of the GPU driver. Currently, only Intel supplies OpenCLCPU drivers for their devices. AMD discontinued CPU based drivers and ARM only providesOpenCL GPU drivers. While optimized for their own set of CPUs, the Intel platform drivers canbe used on AMD devices, but the performance is not guaranteed. Another open-source crossplatform OpenCL implementation is the "Portable Computing Language" (POCL) [13]. It can becompiled directly on the target system, to achieve the best possible optimization for the specifichardware. POCL was used in this study for all ARM CPUs and tested on some of the Intel devicesfor reference. For AMD CPUs, POCL, Intel and the legacy AMD drivers were tested.
Different computational problems were selected for testing to represent a wide variety of realworld workloads encountered in scientific and commercial applications. Testcases were chosenspecifically to represent not just microbenchmarks, but rather a selection of real-world problemscomprised of more abstract benchmark cases and completely self contained filter or simulationimplementations. The OpenCL benchmarks were tested against similar implementations of thealgorithms in MATLAB, CUDA, C++ AMP, and C++/OpenMP to verify the results and ensurethat comparable or better time-to-solution was achieved by the OpenCL implementation. Thisapproach was also used to ensure that none of the testcases intrinsically favours one architecture.While it has been All necessary testcase files are open source and available as part of the ToolkitICLtool [11].
The median filter was selected because it is a common nonlinear edge preserving digital filteringtechnique typically used to reduce image noise [14]. While it is also used in 1D signal processing,it is most often applied as a stand alone filter in image editing or as a pre-processing step inimage recognition or classification applications. As such it is used in real time image processingsystems, which are often constrained by power limitations. It is also a typical candidate for aproblem theoretically benefiting from sharing workloads between a CPU acquiring the input andan accelerator like an embedded GPU to handle the actual processing. The idea behind medianfiltering is to replace each pixel with the median of a window of n × n neighboring pixels. As4mage processing is mostly done on integer datatypes, this example was implemented solely usinginteger operations and does not use floating point variables. A 4K (3840 × The dot product is a standard algebraic operation, calculating the sum of the products of thecorresponding entries of two vectors of equal length. As a typical microbenchmark, even thoughit is a more abstract test scenario it combines a large number of multiply–accumulate operations,which can be used as a measure of the overall performance of a computational system. Additionally,the dot product reduces the input data to a single output value, which is inherently difficult toparallelize, as the individual steps are not independent. This kind of reduction requires atomicmemory operation which can incur large performance penalties, especially on GPUs. Hence, thedot product can be calculated more efficiently using a single threaded approach (in particular byemploying techniques like loop unrolling), even for relatively large vectors, but as many differentalgorithms in computational mathematics and data processing require some kind of atomic datareduction, this dot product example can be considered as a generic testcase for these kinds ofreduction-requiring algorithms.
Cross-correlation is a signal processing technique used to measure the similarity between twodatasets depending on the displacement of one of the inputs. It is typically used for pattern recog-nition, machine learning, for example 1D speech recognition or 2D image pattern matching (e.g.[2]) or to find statistical links between datasets for example in climate analysis. Cross-correlationalso has applications in large scale scientific data processing for time of flight or propagation anal-ysis (e.g. [10]). In these scientific applications, cross-correlation analysis is applied to extremelylarge data sets, requiring significant computational resources. Discrete implementations typicallyshift one of the signals relative to the other and calculate the correlation coefficient for each shift.As this workload can be separated into independent steps, cross-correlation can significantly ben-efit from parallel execution, especially on real time systems. This example implements a completecorrelation analysis but with reduced dataset length using single precision floating point variables,which is typical for many real-world applications to conserve memory.
Numerically solving (partial) differential equations is a problem encountered in many fields notonly computational mathematics, but also physics, engineering and economics. Runge–Kuttamethods are a set of iterative algorithms, widely considered as one of the de-facto standards tonumerically approximate the solution of these equations. This testcase was not just implementedas a microbenchmark of a specific solver, but as full 2D simulation running for several thousandtimesteps. It is also well-suited to compare the influence of the floating point precision on the time-and energy-to-solution, by executing the same algorithm using both single precision and doubleprecision floating point variables. An open-source implementation with a corresponding physicalproblem used in current mathematical and physics research was chosen [20, 21], as the intentionof this work is to use benchmark testcases as close to real world applications as possible.5 .2 Hardware
The hardware used for this study was selected to represent different architectures and vendorswhile focusing on typical platforms widely available on the market and in use in scientific HPC.It was not limited to higher-end enterprise level devices, as applications like image processing arerequired to run even on lower-end systems. Linux, Windows, and Mac OS X operating systemswere used to exclude a possible bias due to differences in the operating systems. In total, 22different discrete GPUs, three different integrated GPUs, and 14 server and workstation CPUsmanufactured by Intel, AMD and Nvidia were tested. As ARM-based system are becomingincreasingly important, four ARM-based CPUs and two integrated ARM GPUs were tested aswell. A complete list of devices and operating systems is available as supplementary material. Nospecific performance related changes, like overclocking, were made to the systems.
The computational power of the used devices is vastly different. Therefore, the time-to-solution isnot directly comparable. In contrast, the energy required to solve the given problems is not directlyrelated to the computational power of the device but rather to the energy efficiency or performanceper watt. This of course assumes that only the power required to run the actual computationaldevice is considered, excluding peripherals, storage systems and other related hardware. Toguarantee the validity of the results, it was ensured that each of the individual runs was within15 % of the average both for time- and energy-to-solution.
Fig. 1 shows the power requirements of an Nvidia C2075 GPU during execution of the singleprecision Runge–Kutta differential equation test case (see Section 2.1.4) over time. This exampleshows the results of both the internal vendor specific power profiling and external primary currentmeasurement. As explained in Section 2, the externally measured power can vary from theinternal profiling measurements. As overhead due to networking, storage, cooling or the hostCPU which is not considered by the internal profiling solutions and can not fully be removed bysubtracting the idle current. Furthermore, external measurements also track changes in powerdue to interconnect link state changes and other automatic workload dependent power savingfeatures. The power supply efficiency ( ∼
10 15 20 25 30
Time (s) M ea s u r ed P o w e r ( W ) P r o f ili ng P o w e r ( W ) Figure 1: Power requirements of a Tesla C2075 GPU during execution of the single precision Runge–Kuttadifferential equation testcase, determined by external primary current measurement (blue) andinternal vendor specific power profiling (black).
Figures 2–6 illustrate the execution time for the different benchmark test-cases for selected devices.Overall both energy- and time-to-solution improve with newer hardware. The latest GPUs per-formed best regarding time to solution, even outperforming the latest series of high-performanceCPUs, especially with complex floating-point workloads. While older integrated GPUs are compa-rable in runtime to the actual CPU (like the Core i7-6500U), the CPU is noticeably more powerful innewer devices. Especially for the image processing example (see Fig. 2), which is a prime candidatefor shared memory based workload sharing between a CPU for acquiring and later evaluating theimages and an integrated GPU used for image processing and filtering, the CPUs outperform theGPU by a factor of ∼
2. In a real world application, the CPU would have the additional overheadof managing and synchronising the GPU tasks, which makes offloading less attractive and morechallenging for the programmer. Hence, at least for the workloads used as part of this study,simply offloading computationally intensive tasks from the CPU to the integrated GPU does notpositively impact runtime.The advantage in execution time with GPUs is especially pronounced in tasks which rely heavilyon complex single precision floating point operations, like the Runge–Kutta or cross-correlationexamples. The difference in execution time is much less pronounced in double precision workloads,which is due to the fact that even most modern GPUs, with the exception of dedicated generalpurpose computing GPUs (like the Tesla V100/P100), are in contrast to CPUs not optimizedfor double precision tasks. In particular computing GPUs optimized for machine learning andinference tasks, like the Tesla T4, lack the hardware acceleration for double precision workloads.Modern desktop CPUs like the Ryzen 7 or the Intel Core i7-8700K (both released in 2017) havealready achieved the computational performance of older server grade Xeon CPUs like the E5-2640 (released 2012) even though they have less cores and lower energy requirements. Whileit was to be expected that the ARM devices are not comparable in runtime to modern CPUs orGPUs, it is noteworthy that the latest ARM GPUs like the Mali G52 (released 2017) approach theperformance of older low-end GPUs and CPUs like the GT 610 (released 2012) or the Intel CeleronN2840 (released 2014). The possible advantages of ARM-based devices become more obviouswhen considering energy-to-solution.To ensure that the vendor specific power profiling implemented in the ToolkitICL framework hasno negative impact on the time-to-solution, baseline runs without power profiling were performed.As expected, no significant difference in runtime was found for GPUs and even for CPUs, thedeviations were around 10 %, which is within the statistical margin of error of this study. Slowersystems like older CPUs or the ARM devices have no built-in power profiling capabilities, hencethe impact of the profiling with computationally less powerful devices could not be investigated,as external current measurement had to be used.7he 2D median image filtering testcase was also used to compare the OpenCL performance withOpenMP for validation (see section 2.1.1). On average, the OpenMP implementation was slowerby a factor of ∼
3, which showcases the efficiency of the OpenCL framework in automaticallyoptimizing the code for parallel execution.One notable exception for GPU performance are the AMD devices, which performed muchworse than expected based on the computational power given by the manufacturer. Additionaltesting showed that AMD GPUs seem to require hardware specific OpenCL code tuning to evenachieve performance roughly comparable to other GPUs or even CPUs. Due to the limited supportfor other software frameworks, it was not possible to check whether these tuning requirements arespecific to OpenCL.
Similar to the runtime results, GPUs outperformed CPUs in energy-to-solution as well. Theindividual results are shown in Figures 7–11. As expected, CPUs targeted for mobile deviceswere more energy efficient compared to the much faster, but less energy efficient server CPUs.While the ARM devices were expectedly outperformed by traditional CPUs and GPUs in regard toruntime, the possible advantages of ARM-based systems become clear when considering energy-to-solution. In most cases both ARM CPUs and GPUs are comparable in energy efficiency tomodern x86 CPUs and integrated GPUs. Especially with computationally simpler workloads likethe dot-product example or signal processing workloads, modern ARM GPUs like the Mali G52are even comparable or better then modern GPUs (see Fig. 8, 9). While they require longer toexecute the tasks, ARM devices still require less energy overall. This exemplifies the potential ofARM devices as basis for energy efficient massively parallel compute clusters, for certain types ofapplications.Integrated GPUs show no advantage over the corresponding CPU for integer and double preci-sion floating point workloads. For single precision tasks, the integrated GPUs are up to a factor oftwo more efficient, but have comparatively minor trade offs in runtime (see Section 3.2). Especiallywith battery powered systems, offloading single precision tasks seems reasonable from an energystandpoint based on these results.The power consumption for most of the devices (AMD, Intel and Nvidia) was measured usingvendor specific internal power profiling APIs. As this approach is for many devices at least partiallybased on software heuristics and not just measurement, the accuracy of this method was validatedfor all APIs using external supply current measurement. For GPUs, the results were within 10 %,while the deviation was below 15 % for CPUs. The difference is most likely due to the fact, that fortechnical reasons, GPUs have at least some on-board power measurement, while most CPUs havenot and rely completely on heuristics.
As benchmark results are available for devices released between 2010 and 2019 spanning almostan entire decade, the development of the time-to-solution and energy-to-solution for the differentdevices has been investigated. Compared to CPUs, GPUs were initially intended as task specifichardware accelerators and the development of GPUs into more general purpose computing devicesis a rather recent trend. Hence, Fig. 12 and Fig. 13 show the results separately for CPUs anddiscrete GPUs. Based on the work of [15], the results were plotted semi-logarithmically. Theruntime, especially for CPUs, shows no clear trend over time, which can primarily be attributed tothe fact that an intentionally broad selection of devices with different computational capabilitieswas selected. It is unsurprising that older enterprise level server CPUs are still faster than modernconsumer or even embedded CPUs. The selection of GPUs is inherently more homogeneous, asit is primarily comprised of discrete GPUs, which are by design intended to be used for morecomputationally intensive tasks, otherwise integrated graphics solutions are used. Therefore, it8
D Median Filter
GTX 570 (6.22 s) GT 610 (75.45 s)GTX 1050 Ti MQ (3.87 s)GTX 1060 (2.03 s)GTX 1060 MQ (2.31 s)GTX 1070 Ti (1.22 s)GTX 1080 Ti (0.9 s)RTX 2070 (1.01 s)RTX 2080 Ti (0.59 s)GTX Titan (3.41 s)RTX Titan (0.55 s)GRID K520 (7.44 s)Tesla C2075 (8.75 s)Tesla M60 (2.3 s)Tesla K80 (4.29 s)Tesla T4 (1.37 s)Tesla P100 (0.99 s)Tesla V100 (0.65 s)Quadro RTX 4000 (1.12 s)Radeon R9 280X (7.03 s)FirePro S9150 (6.23 s)Radeon VII (3.97 s)Ryzen 5 1600X (3.96 s)Ryzen 7 1700X (3.36 s)Ryzen 5 2600X (4.73 s)Ryzen 5 3600 (3.07 s)Xeon CPU E5-2640 v0 (x2) (9.91 s)Xeon CPU E5-2620 v3 (x2) (5.36 s)Xeon CPU E5-2666 v3 (x2) (5.11 s)Xeon CPU E7-8880 v3 (x2) (3.66 s)Xeon Gold 6230 CPU (x2) (0.8 s) Celeron N2840 (319.5 s)Core i5-2400 (30.42 s)Core i7-6500U (23.53 s)Core i7-8700K (5.71 s)Core i7-8750H (6.78 s)Core i7-6500U HD520 GPU (14.73 s)Core i7-8700K UHD630 GPU (12.93 s)Core i7-8750H UHD630 GPU (18.78 s) ARM Mali T-628 MP6-4C (249.73 s)ARM Mali G52 (76.02 s)Samsung Exynos 5422 (172.8 s)Amlogic S905 (283.16 s)Amlogic S922X (148.7 s)Broadcom BCM2837 (219.97 s)
Runtime (s) - less is faster
Figure 2: Time-to-solution for the integer based 2D median image filter testcase for CPUs, GPUs and ARMdevices. ot Product GTX 570 (0.525 s) GT 610 (2.747 s)GTX 1050 Ti MQ (0.334 s)GTX 1060 (0.173 s)GTX 1060 MQ (0.251 s)GTX 1070 Ti (0.156 s)GTX 1080 Ti (0.165 s)RTX 2070 (0.161 s)RTX 2080 Ti (0.154 s)GTX Titan (0.32 s)RTX Titan (0.154 s)GRID K520 (0.43 s)Tesla C2075 (0.691 s)Tesla M60 (0.451 s)Tesla K80 (0.455 s)Tesla T4 (0.327 s)Tesla P100 (0.352 s)Tesla V100 (0.324 s)Quadro RTX 4000 (0.167 s)Radeon R9 280X (1.513 s) FirePro S9150 (7.714 s)Radeon VII (1.202 s)Ryzen 5 1600X (0.683 s)Ryzen 7 1700X (0.567 s)Ryzen 5 2600X (0.6 s)Ryzen 5 3600 (0.466 s)Xeon CPU E5-2640 v0 (x2) (0.785 s)Xeon CPU E5-2620 v3 (x2) (0.535 s)Xeon CPU E5-2666 v3 (x2) (0.54 s)Xeon CPU E7-8880 v3 (x2) (0.422 s)Xeon Gold 6230 CPU (x2) (0.148 s) Celeron N2840 (9.289 s)Core i5-2400 (1.81 s)Core i7-6500U (1.805 s)Core i7-8700K (0.505 s)Core i7-8750H (0.622 s)Core i7-6500U HD520 GPU (1.556 s)Core i7-8700K UHD630 GPU (1.789 s)Core i7-8750H UHD630 GPU (1.829 s) ARM Mali T-628 MP6-4C (10.547 s)ARM Mali G52 (5.73 s)Samsung Exynos 5422 (6.713 s)Amlogic S905 (19.52 s)Amlogic S922X (8.527 s)Broadcom BCM2837 (18.614 s)
Runtime (ms) - less is faster
Figure 3: Time-to-solution for the single-precision floating point dot product testcase for CPUs, GPUs andARM devices. ross-Correlation GTX 570 (6.5 s) GT 610 (78.16 s)GTX 1050 Ti MQ (4.9 s)GTX 1060 (2.57 s)GTX 1060 MQ (3.32 s)GTX 1070 Ti (1.58 s)GTX 1080 Ti (1.58 s)RTX 2070 (1.21 s)RTX 2080 Ti (0.81 s)GTX Titan (3.88 s)RTX Titan (0.77 s)GRID K520 (6.54 s)Tesla C2075 (9.77 s)Tesla M60 (2.88 s)Tesla K80 (4.7 s)Tesla T4 (1.5 s)Tesla P100 (1.36 s)Tesla V100 (0.92 s)Quadro RTX 4000 (1.27 s)Radeon R9 280X (3.41 s)FirePro S9150 (2.85 s)Radeon VII (1.35 s) Ryzen 5 1600X (19.67 s)Ryzen 7 1700X (16.12 s)Ryzen 5 2600X (18.71 s)Ryzen 5 3600 (12.28 s)Xeon CPU E5-2640 v0 (x2) (17.42 s)Xeon CPU E5-2620 v3 (x2) (13.64 s)Xeon CPU E5-2666 v3 (x2) (13.7 s)Xeon CPU E7-8880 v3 (x2) (9.66 s)Xeon Gold 6230 CPU (x2) (2.93 s) Celeron N2840 (331.07 s)Core i5-2400 (43.17 s)Core i7-6500U (53.08 s)Core i7-8700K (14.51 s)Core i7-8750H (18.58 s)Core i7-6500U HD520 GPU (26.11 s)Core i7-8700K UHD630 GPU (21.66 s)Core i7-8750H UHD630 GPU (24.82 s) ARM Mali T-628 MP6-4C (158.27 s)ARM Mali G52 (91.87 s)Samsung Exynos 5422 (230.43 s)Amlogic S905 (172.84 s)Amlogic S922X (187.15 s)Broadcom BCM2837 (261.18 s)
Runtime (ms) - less is faster
Figure 4: Time-to-solution for the single-precision floating point cross-correlation testcase for CPUs, GPUsand ARM devices. unge-Kutta Differential Equation Solver (single precision) GTX 570 (6.48 s) GT 610 (79.78 s)GTX 1050 Ti MQ (4.18 s)GTX 1060 (2.46 s)GTX 1060 MQ (3.41 s)GTX 1070 Ti (1.71 s)GTX 1080 Ti (1.8 s)RTX 2070 (1.02 s)RTX 2080 Ti (0.9 s)GTX Titan (3.71 s)RTX Titan (0.74 s)GRID K520 (7.57 s)Tesla C2075 (8.84 s)Tesla M60 (4.26 s)Tesla K80 (4.93 s)Tesla T4 (1.15 s)Tesla P100 (1.17 s)Tesla V100 (0.84 s)Quadro RTX 4000 (0.95 s) Radeon R9 280X (64.96 s)FirePro S9150 (20.97 s)Radeon VII (11.85 s) Ryzen 5 1600X (29.59 s)Ryzen 7 1700X (24.01 s)Ryzen 5 2600X (22.12 s)Ryzen 5 3600 (23.71 s)Xeon CPU E5-2640 v0 (x2) (24.22 s)Xeon CPU E5-2620 v3 (x2) (17.21 s)Xeon CPU E5-2666 v3 (x2) (17.25 s)Xeon CPU E7-8880 v3 (x2) (13.64 s)Xeon Gold 6230 CPU (x2) (7.66 s) Celeron N2840 (581.5 s)Core i5-2400 (65.92 s)Core i7-6500U (64.6 s)Core i7-8700K (17.23 s)Core i7-8750H (23.84 s)Core i7-6500U HD520 GPU (26.02 s)Core i7-8700K UHD630 GPU (28.61 s)Core i7-8750H UHD630 GPU (35.41 s)ARM Mali T-628 MP6-4C (398.78 s)ARM Mali G52 (189.88 s)Samsung Exynos 5422 (674.48 s)Amlogic S905 (793.8 s)Amlogic S922X (321.16 s)Broadcom BCM2837 (1246.33 s)
Runtime (s) - less is faster
Figure 5: Time-to-solution for the single-precision Runge-Kutta differential equation solver testcase forCPUs, GPUs and ARM devices. unge-Kutta Differential Equation Solver (double precision) GTX 570 (26.76 s) GT 610 (401.39 s)GTX 1050 Ti MQ (46.85 s)GTX 1060 (23.28 s)GTX 1060 MQ (29.31 s)GTX 1070 Ti (13.59 s)GTX 1080 Ti (9.71 s)RTX 2070 (15.71 s)RTX 2080 Ti (9.06 s)GTX Titan (18.17 s)RTX Titan (8.12 s)GRID K520 (43.8 s)Tesla C2075 (31.04 s)Tesla M60 (26.28 s)Tesla K80 (15.45 s)Tesla T4 (16.56 s)Tesla P100 (2.06 s)Tesla V100 (1.44 s)Quadro RTX 4000 (15.98 s)Radeon R9 280X (127.33 s)FirePro S9150 (136.8 s)Radeon VII (38.15 s)Ryzen 5 1600X (48.61 s)Ryzen 7 1700X (38.97 s)Ryzen 5 2600X (35.17 s)Ryzen 5 3600 (29.49 s)Xeon CPU E5-2640 v0 (x2) (31.5 s)Xeon CPU E5-2620 v3 (x2) (25.02 s)Xeon CPU E5-2666 v3 (x2) (22.51 s)Xeon CPU E7-8880 v3 (x2) (18.74 s)Xeon Gold 6230 CPU (x2) (9.39 s)Core i5-2400 (81.76 s)Core i7-8700K (21.73 s)Core i7-8750H (26.84 s)Core i7-8700K UHD630 GPU (61.78 s)Core i7-8750H UHD630 GPU (61.69 s) ARM Mali T-628 MP6-4C (1199.75 s)Samsung Exynos 5422 (1311.37 s)Amlogic S905 (1280.97 s)Amlogic S922X (481.6 s)Broadcom BCM2837 (2028.82 s)
Runtime (s) - less is faster
Figure 6: Time-to-solution for the double-precision Runge-Kutta differential equation solver testcase forCPUs, GPUs and ARM devices. D Median Filter
GTX 570* (751.01 Ws)GT 610* (1795.68 Ws)GTX 1050 Ti MQ* (250.34 Ws)GTX 1060 (170.56 Ws)GTX 1060 MQ (96.02 Ws)GTX 1070 Ti (140.61 Ws)GTX 1080 Ti (127.86 Ws)RTX 2070 (102.02 Ws)RTX 2080 Ti (81.75 Ws) GTX Titan (434.75 Ws)RTX Titan (92.39 Ws)GRID K520 (295.5 Ws)Tesla C2075 (464.26 Ws)Tesla M60 (177.26 Ws)Tesla K80 (252.28 Ws)Tesla T4 (52.68 Ws)Tesla P100 (116.94 Ws)Tesla V100 (93.17 Ws)Quadro RTX 4000 (76.89 Ws) Radeon R9 280X* (660.58 Ws)FirePro S9150 (274.75 Ws)Radeon VII (313.26 Ws)Ryzen 5 1600X (383.1 Ws)Ryzen 7 1700X (272.67 Ws)Ryzen 5 2600X (358.78 Ws)Ryzen 5 3600 (156.38 Ws) Xeon CPU E5-2640 v0 (x2) (1220 Ws)Xeon CPU E5-2620 v3 (x2) (380.03 Ws)Xeon CPU E5-2666 v3 (x2) (550.99 Ws)Xeon CPU E7-8880 v3 (x2) (456.13 Ws)Xeon Gold 6230 CPU (x2) (179.18 Ws) Celeron N2840* (1051.8 Ws)Core i5-2400* (891.35 Ws)Core i7-6500U (283.64 Ws)Core i7-8700K (454.54 Ws)Core i7-8750H (384.62 Ws)Core i7-6500U HD520 GPU (339.94 Ws)Core i7-8700K UHD630 GPU (626.75 Ws)Core i7-8750H UHD630 GPU (407.16 Ws)ARM Mali T-628 MP6-4C* (437.4 Ws)ARM Mali G52* (85.5 Ws) Samsung Exynos 5422* (723.87 Ws)Amlogic S905* (412.2 Ws)Amlogic S922X* (410.4 Ws)Broadcom BCM2837* (470.7 Ws)
Energy (Ws) - less is more efficient
Figure 7: Energy-to-solution for the integer based 2D median image filter testcase for CPUs, GPUs and ARMdevices, results based on external current measurement are marked with an asterisk. ot Product GTX 570* (0.0369 Ws)GT 610* (0.0601 Ws)GTX 1050 Ti MQ* (0.0218 Ws)GTX 1060 (0.0049 Ws)GTX 1060 MQ (0.0031 Ws)GTX 1070 Ti (0.0059 Ws)GTX 1080 Ti (0.0068 Ws)RTX 2070 (0.0072 Ws)RTX 2080 Ti (0.0113 Ws)GTX Titan (0.0091 Ws)RTX Titan (0.0098 Ws)GRID K520 (0.0085 Ws)Tesla C2075 (0.0175 Ws)Tesla M60 (0.0038 Ws)Tesla K80 (0.0035 Ws)Tesla T4 (0.0026 Ws)Tesla P100 (0.0059 Ws)Tesla V100 (0.0069 Ws)Quadro RTX 4000 (0.0074 Ws) Radeon R9 280X* (0.1154 Ws)FirePro S9150 (0.2924 Ws)Radeon VII (0.0721 Ws)Ryzen 5 1600X (0.058 Ws)Ryzen 7 1700X (0.0413 Ws)Ryzen 5 2600X (0.0451 Ws)Ryzen 5 3600 (0.024 Ws)Xeon CPU E5-2640 v0 (x2) (0.0691 Ws)Xeon CPU E5-2620 v3 (x2) (0.0477 Ws)Xeon CPU E5-2666 v3 (x2) (0.0635 Ws)Xeon CPU E7-8880 v3 (x2) (0.06 Ws)Xeon Gold 6230 CPU (x2) (0.0258 Ws)Celeron N2840* (0.0459 Ws)Core i5-2400* (0.0628 Ws)Core i7-6500U (0.0284 Ws)Core i7-8700K (0.0449 Ws)Core i7-8750H (0.0398 Ws)Core i7-6500U HD520 GPU (0.0244 Ws)Core i7-8700K UHD630 GPU (0.0695 Ws)Core i7-8750H UHD630 GPU (0.0344 Ws)ARM Mali T-628 MP6-4C* (0.0143 Ws)ARM Mali G52* (0.0032 Ws)Samsung Exynos 5422* (0.0388 Ws)Amlogic S905* (0.0287 Ws)Amlogic S922X* (0.0259 Ws)Broadcom BCM2837* (0.0397 Ws)
Energy (Ws) - less is more efficient
Figure 8: Energy to-solution for the single-precision floating point dot product testcase for CPUs, GPUs andARM devices, results based on external current measurement are marked with an asterisk. ross-Correlation GTX 570* (0.097 Ws) GT 610* (1.703 Ws)GTX 1050 Ti MQ* (0.304 Ws)GTX 1060 (0.098 Ws)GTX 1060 MQ (0.078 Ws)GTX 1070 Ti (0.096 Ws)GTX 1080 Ti (0.112 Ws)RTX 2070 (0.081 Ws)RTX 2080 Ti (0.088 Ws)GTX Titan (0.341 Ws)RTX Titan (0.086 Ws)GRID K520 (0.183 Ws)Tesla C2075 (0.348 Ws)Tesla M60 (0.136 Ws)Tesla K80 (0.221 Ws)Tesla T4 (0.05 Ws)Tesla P100 (0.073 Ws)Tesla V100 (0.057 Ws)Quadro RTX 4000 (0.09 Ws)Radeon R9 280X* (0.578 Ws)FirePro S9150 (0.121 Ws)Radeon VII (0.107 Ws) Ryzen 5 1600X (1.604 Ws)Ryzen 7 1700X (1.159 Ws)Ryzen 5 2600X (1.324 Ws)Ryzen 5 3600 (0.641 Ws) Xeon CPU E5-2640 v0 (x2) (1.93 Ws)Xeon CPU E5-2620 v3 (x2) (1.347 Ws)Xeon CPU E5-2666 v3 (x2) (1.652 Ws)Xeon CPU E7-8880 v3 (x2) (1.363 Ws)Xeon Gold 6230 CPU (x2) (0.512 Ws)Celeron N2840* (1.73 Ws)Core i5-2400* (1.679 Ws)Core i7-6500U (0.752 Ws)Core i7-8700K (1.212 Ws)Core i7-8750H (0.969 Ws)Core i7-6500U HD520 GPU (0.415 Ws)Core i7-8700K UHD630 GPU (0.905 Ws)Core i7-8750H UHD630 GPU (0.421 Ws)ARM Mali T-628 MP6-4C* (0.347 Ws)ARM Mali G52* (0.057 Ws) Samsung Exynos 5422* (2.214 Ws)Amlogic S905* (0.192 Ws)Amlogic S922X* (0.567 Ws)Broadcom BCM2837* (0.55 Ws)
Energy (Ws) - less is more efficient
Figure 9: Energy-to-solution for the single-precision floating point cross-correlation testcase for CPUs, GPUsand ARM devices, results based on external current measurement are marked with an asterisk. unge-Kutta Differential Equation Solver (single precision) GTX 570* (837.41 Ws)GT 610* (2022.8 Ws)GTX 1050 Ti MQ* (260.5 Ws)GTX 1060 (164.57 Ws)GTX 1060 MQ (85.45 Ws)GTX 1070 Ti (165.79 Ws)GTX 1080 Ti (159 Ws)RTX 2070 (95.12 Ws)RTX 2080 Ti (111.74 Ws)GTX Titan (471.8 Ws)RTX Titan (94.89 Ws)GRID K520 (353.32 Ws)Tesla C2075 (624.52 Ws)Tesla M60 (218.93 Ws)Tesla K80 (319.95 Ws)Tesla T4 (51.95 Ws)Tesla P100 (109.54 Ws)Tesla V100 (64.59 Ws)Quadro RTX 4000 (75.66 Ws) Radeon R9 280X* (4868.31 Ws)FirePro S9150 (612.92 Ws)Radeon VII (628.3 Ws) Ryzen 5 1600X (2476.44 Ws)Ryzen 7 1700X (1747.45 Ws)Ryzen 5 2600X (1633.85 Ws)Ryzen 5 3600 (1165.11 Ws)Xeon CPU E5-2640 v0 (x2) (2751.82 Ws)Xeon CPU E5-2620 v3 (x2) (1779.87 Ws)Xeon CPU E5-2666 v3 (x2) (2255.49 Ws)Xeon CPU E7-8880 v3 (x2) (1968.73 Ws)Xeon Gold 6230 CPU (x2) (1276.82 Ws)Celeron N2840* (1993.11 Ws)Core i5-2400* (2247.23 Ws)Core i7-6500U (1031.01 Ws)Core i7-8700K (1490.06 Ws)Core i7-8750H (1357.17 Ws)Core i7-6500U HD520 GPU (517.83 Ws)Core i7-8700K UHD630 GPU (1276.4 Ws)Core i7-8750H UHD630 GPU (610.35 Ws)ARM Mali T-628 MP6-4C* (651.72 Ws)ARM Mali G52* (240.25 Ws) Samsung Exynos 5422* (2605.8 Ws)Amlogic S905* (1031.69 Ws)Amlogic S922X* (940.13 Ws)Broadcom BCM2837* (2371.05 Ws)
Energy (Ws) - less is more efficient
Figure 10: Energy-to-solution for the single-precision Runge-Kutta differential equation solver testcase forCPUs, GPUs and ARM devices, results based on external current measurement are marked withan asterisk. unge-Kutta Differential Equation Solver (double precision) GTX 570* (3369.7 Ws) GT 610* (9540.83 Ws)GTX 1050 Ti MQ* (2731.82 Ws)GTX 1060 (788.31 Ws)GTX 1060 MQ (613.06 Ws)GTX 1070 Ti (970.72 Ws)GTX 1080 Ti (915.9 Ws)RTX 2070 (839.07 Ws)RTX 2080 Ti (860.09 Ws)GTX Titan (1837.23 Ws)RTX Titan (763.5 Ws)GRID K520 (1511.99 Ws)Tesla C2075 (2325.95 Ws)Tesla M60 (1173.1 Ws)Tesla K80 (1004.69 Ws)Tesla T4 (450.74 Ws)Tesla P100 (218.52 Ws)Tesla V100 (120.16 Ws)Quadro RTX 4000 (889.16 Ws) Radeon R9 280X* (9197.66 Ws)FirePro S9150 (7301.3 Ws)Radeon VII (2670.78 Ws)Ryzen 5 1600X (3601.48 Ws)Ryzen 7 1700X (2630.98 Ws)Ryzen 5 2600X (2636.26 Ws)Ryzen 5 3600 (1514.87 Ws) Xeon CPU E5-2640 v0 (x2) (3872.54 Ws)Xeon CPU E5-2620 v3 (x2) (1821.99 Ws)Xeon CPU E5-2666 v3 (x2) (2973.41 Ws)Xeon CPU E7-8880 v3 (x2) (2762.74 Ws)Xeon Gold 6230 CPU (x2) (1828.33 Ws)Core i5-2400* (3275.54 Ws)Core i7-8700K (1792.88 Ws)Core i7-8750H (1799.55 Ws)Core i7-8700K UHD630 GPU (2150.05 Ws)Core i7-8750H UHD630 GPU (1840.04 Ws)ARM Mali T-628 MP6-4C* (2006.19 Ws) Samsung Exynos 5422* (4586.42 Ws)Amlogic S905* (2020.1 Ws)Amlogic S922X* (1489.31 Ws) Broadcom BCM2837* (4062.71 Ws)
Energy (Ws) - less is more efficient
Figure 11: Energy-to-solution for the double-precision Runge-Kutta differential equation solver testcase forCPUs, GPUs and ARM devices, results based on external current measurement are marked withan asterisk.
18s understandable, compared to the CPU results, that on average a decrease in runtime can beobserved, but still without a statistically significant trend.As already discussed in Section 3.3, the energy-to-solution or the related performance per wattis independent of the computational power. Increases in energy efficiency do not just translateinto economical and ecological benefits, but also allow for more computationally powerful devicesper unit area given a constant limit on thermal dissipation. Especially with modern many-corearchitectures, energy efficiency under load can be used as an overall metric of the computationalcapabilities of a system independent of the specific computational devices (i.e. number of cores orclock rates). With the set of devices used in this study, an exponential increase in energy efficiencyover time was observed. An exponential function of the form a · e rt , where a is a problem specificscaling constant, t is the year, and r is the growth rate, was fitted to the results (see dashed linesin Fig. 12 and Fig. 13). Based on these findings, on average the energy efficiency improves bya factor of 1.22 every two years for CPUs and by a factor of discrete 1.50 for GPUs, which is adifference of ∼
19 %. This shows the slow down in development relative to the doubling of thenumber of components per integrated circuit postulated by Moore’s law [15], which is linked to aneven faster increase in performance per watt (based on the Dennard scaling law [7]). Even thoughthe development seems to be slowing down, in general newer devices are still noticeably moreenergy efficient, but without comparable improvements in runtime. While this is less importantfor the average user, who is primarily interested in the time-to-solution, it significantly impacts therunning costs of high-performance computing systems.19
010 2012 2014 2016 2018 2020
Year of Release R un t i m e ( s ) Year of Release E ne r g y ( W s ) r = -0.072 Year of Release R un t i m e ( s ) Year of Release E ne r g y ( W s ) r = -0.069 Year of Release R un t i m e ( s ) Year of Release E ne r g y ( W s ) r = -0.188 Year of Release R un t i m e ( s ) Year of Release E ne r g y ( W s ) r = -0.083 Year of Release R un t i m e ( s ) Year of Release E ne r g y ( W s ) r = -0.091
2D Median FilterRK Differential Equation Solver (single)Dot ProductCross-CorrelationRK Differential Equation Solver (double)
Runtime and Efficiency (CPU)
Figure 12: Time-to-solution (left panels) and energy-to-solution (right panels) plotted logarithmically overthe initial year of release of the device for all CPUs except ARM devices. An exponential fit wasused to approximate the development of the energy efficiency over time, the growth rates aregiven in the individual plots.
010 2012 2014 2016 2018 2020
Year of Release R un t i m e ( s ) Year of Release E ne r g y ( W s ) r = -0.211 Year of Release R un t i m e ( s ) Year of Release E ne r g y ( W s ) r = -0.184 Year of Release R un t i m e ( s ) Year of Release E ne r g y ( W s ) r = -0.244 Year of Release R un t i m e ( s ) Year of Release -2 E ne r g y ( W s ) r = -0.172 Year of Release R un t i m e ( s ) Year of Release E ne r g y ( W s ) r = -0.192
2D Median FilterRK Differential Equation Solver (single)Dot ProductCross-CorrelationRK Differential Equation Solver (double)
Runtime and Efficiency (GPU)
Figure 13: Time-to-solution (left panels) and energy-to-solution (right panels) plotted logarithmically overthe initial year of release of the device for all discrete GPUs. An exponential fit was used toapproximate the development of the energy efficiency over time, the growth rates are given inthe individual plots. Summary and Conclusion
Using benchmarking testcases based on real world scientific and engineering workloads, this studyshowed that GPUs outperformed CPUs both in time- and in energy-to-solution. In particularwith tasks relying heavily on complex single-precision floating point operations, even consumergrade GPUs can outperform server grade CPUs by a factor between 4 and 20 in runtime. Whilethe difference in time-to-solution is less pronounced with double-precision or integer workloads(between a factor of 1.7 up to 10 on average speedup), particularly the energy efficiency of GPUsis superior and on average about an order of magnitude better, making GPUs a faster and moreenergy efficient alternative to CPU-only systems. On average the energy efficiency improves by afactor of 1.22 every two years for CPUs and by a factor of 1.50 for GPUs. This comparison is basedon the actual execution time and the energy necessary to run the compute device without externalperipherals, memory, networking or other components. While these factors play a role in the overallruntime and energy efficiency, these components are system specific, and therefore not comparable.While these factors can play a role in large cluster deployments, especially with extremely memoryintensive operations, these factors become negligible with increasing interconnect bandwidth andGPU memory. While the energy requirements for storage and interconnect can be quite significantas well, they are nearly identical independent of the compute architecture. Although GPUs andother accelerators still need a host CPU, the impact on energy efficiency is comparatively low, as asingle CPU can control multiple accelerators, while also sharing some of the compute workload byusing heterogeneous compute frameworks like OpenCL. The measurements also showed that theCPUs required only between 10 % to 15 % of the maximum power under load during idle whilejust running the operating system and networking, so even if the CPUs in a cluster environmentwere just used to control accelerators like GPUs, the efficiency would still be favorable over a CPUonly system. Integrated GPUs have significantly longer runtimes compared to the correspondingCPU (up to ∼ References [1] G. Agosta, A. Barenghi, A. Di Federico, and G. Pelosi. “OpenCL performance portability forgeneral-purpose computation on graphics processor units: an exploration on cryptographicprimitives.” In:
Concurrency and Computation: Practice and Experience . url: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3358 .[2] J. Benesty, J. Chen, Y. Huang, and I. Cohen. “Pearson Correlation Coefficient.” In:
NoiseReduction in Speech Processing . Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 1 –4.doi: . url: https://doi.org/10.1007/978-3-642-00296-0_5 .[3] J. L. Bez, E. Bernart, F. Santos, L. Schnorr, and P. Navaux. “Performance and energy efficiencyanalysis of HPC physics simulation applications in a cluster of ARM processors.” In:
Con-currency and Computation: Practice and Experience . url: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4014 .[4] F. Büyükkeçeci, O. Awile, and I. F. Sbalzarini. “A portable OpenCL implementation ofgeneric particle–mesh and mesh–particle interpolation in 2D and 3D.” In:
Parallel Computing https : / / doi . org / 10 . 1016 / j . parco . 2012 . 12 . 001 . url: .[5] A.-K. Cheik Ahamed and F. Magoulès. “Conjugate gradient method with graphics process-ing unit acceleration: CUDA vs OpenCL.” In:
Advances in Engineering Software
111 (2017).Advances in High Performance Computing: on the path to Exascale software, pp. 32 –42. doi: https : / / doi . org / 10 . 1016 / j . advengsoft . 2016 . 10 . 002 . url: .236] L. Dagum and R. Menon. “OpenMP: an industry standard API for shared-memory program-ming.” In:
Computational Science & Engineering, IEEE
IEEE Journal of Solid-StateCircuits .[8] J. Dongarra. “The LINPACK Benchmark: An Explanation.” In:
Proceedings of the 1st Interna-tional Conference on Supercomputing . London, UK, UK: Springer-Verlag, 1988, pp. 456–474.url: http://dl.acm.org/citation.cfm?id=647970.742568 .[9] D. Göddeke, D. Komatitsch, M. Geveler, D. Ribbrock, N. Rajovic, N. Puzovic, and A. Ramirez.“Energy Efficiency vs. Performance of the Numerical Solution of PDEs: An Application Studyon a Low-power ARM-based Cluster.” In:
J. Comput. Phys.
237 (Mar. 2013), pp. 132–150. doi: . url: http://dx.doi.org/10.1016/j.jcp.2012.11.031 .[10] P. Heinisch, H.-U. Auster, I. Richter, G. Haerendel, I. Apathy, K.-H. Fornacon, E. Cupido,and K.-H. Glassmeier. “Joint two-point observations of LF-waves at 67P/Churyumov–Gerasimenko.” In:
Monthly Notices of the Royal Astronomical Society . url: http://dx.doi.org/10.1093/mnras/stx1175 .[11] P. Heinisch, K. Ostaszewski, and H. Ranocha. toolkitICL. An open source tool for automatedOpenCL kernel execution. https://github.com/IANW-Projects/toolkitICL . Nov. 2019. doi: .[12] H. H. Holm, A. R. Brodtkorb, and M. L. Sætra.
GPU Computing with Python: Performance,Energy Efficiency and Usability . 2019. arXiv: .[13] P. Jääskeläinen, C. S. Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. “pocl: APerformance-Portable OpenCL Implementation.” In:
Int. J. Parallel Program.
10 . 1007 / s10766 - 014 - 0320 - y . url: http : / / dx . doi . org / 10 . 1007 /s10766-014-0320-y .[14] J. S. Lim.
Two-dimensional Signal and Image Processing . Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1990, pp. 469–476.[15] G. E. Moore. “Progress in digital integrated electronics [Technical literaiture, Copyright 1975IEEE. Reprinted, with permission. Technical Digest. International Electron Devices Meeting,IEEE, 1975, pp. 11-13.]” In:
IEEE Solid-State Circuits Society Newsletter .[16] L. Mukhanov, P. Petoumenos, Z. Wang, N. Parasyris, D. S. Nikolopoulos, B. R. De Supinski,and H. Leather. “ALEA: A Fine-Grained Energy Profiling Tool.” In:
ACM Transactions onArchitecture and Code Optimization
10 . 1145 / 3050436 . url: http://doi.acm.org/10.1145/3050436 .[17] K. Ostaszewski, P. Heinisch, and H. Ranocha. “Advantages and Pitfalls of OpenCL in Com-putational Physics.” In:
Proceedings of the International Workshop on OpenCL . IWOCL ’18.Oxford, United Kingdom: ACM, 2018, 10:1–10:1. doi:
10 . 1145 / 3204919 . 3204929 . url: http://doi.acm.org/10.1145/3204919.3204929 .[18] E. L. Padoin, L. L. Pilla, M. Castro, F. Z. Boito, P. O. A. Navaux, and J. F. Méhaut. “Per-formance/energy trade-off in scientific computing: the case of ARM big.LITTLE and IntelSandy Bridge.” In:
IET Computers Digital Techniques .[19] S. J. Pennycook, S. D. Hammond, S. A. Wright, J. A. Herdman, I. Miller, and S. A. Jarvis. “AnInvestigation of the Performance Portability of OpenCL.” In:
J. Parallel Distrib. Comput. . url: http://dx.doi.org/10.1016/j.jpdc.2012.07.005 . 2420] H. Ranocha, K. Ostaszewski, and P. Heinisch.
InductionEq. A set of tools for numerically solvingthe nonlinear magnetic induction equation with Hall effect in OpenCL. https://github.com/IANW-Projects/InductionEq . Sept. 2018. doi: .[21] H. Ranocha, K. Ostaszewski, and P. Heinisch.
Numerical Methods for the Magnetic InductionEquation with Hall Effect and Projections onto Divergence-Free Vector Fields . Oct. 2018. arXiv: .[22] S. Sharma, C.-H. H. Hsu, and W.-C. Feng. “Making a case for a Green500 list.” In:
Proceedings20th IEEE International Parallel Distributed Processing Symposium . Apr. 2006, p. 8. doi: .[23] O. Valery, P. Liu, and J.-J. Wu. “A collaborative CPU–GPU approach for principal componentanalysis on mobile heterogeneous platforms.” In:
Journal of Parallel and Distributed Computing
120 (2018), pp. 44 –61. doi: https : / / doi . org / 10 . 1016 / j . jpdc . 2018 . 05 . 006 . url: .[24] R. Weber, A. Gothandaraman, R. J. Hinde, and G. D. Peterson. “Comparing HardwareAccelerators in Scientific Applications: A Case Study.” In:
IEEE Transactions on Parallel andDistributed Systems10.1109/TPDS.2010.125