[PDF] Towards Green Computing: A Survey of Performance and Energy Efficiency of Different Platforms using OpenCL

Abstract

When considering different hardware platforms, not just the time-to-solution can be of importance but also the energy necessary to reach it. This is not only the case with battery powered and mobile devices but also with high-performance parallel cluster systems due to financial and practical limits on power consumption and cooling. Recent developments in hard- and software have given programmers the ability to run the same code on a range of different devices giving rise to the concept of heterogeneous computing. Many of these devices are optimized for certain types of applications. To showcase the differences and give a basic outlook on the applicability of different architectures for specific problems, the cross-platform OpenCL framework was used to compare both time- and energy-to-solution. A large set of devices ranging from ARM processors to server CPUs and consumer and enterprise level GPUs has been used with different benchmarking testcases taken from applied research applications. While the results show the overall advantages of GPUs in terms of both runtime and energy efficiency compared to CPUs, ARM devices show potential for certain applications in massively parallel systems. This study also highlights how OpenCL enables the use of the same codebase on many different systems and hardware platforms without specific code adaptations.

Full PDF

TTowards Green Computing: A Survey ofPerformance and Energy Eﬃciency of DiﬀerentPlatforms using OpenCL

Philip Heinisch , Katharina Ostaszewski , and Hendrik Ranocha Institut für angewandte numerische Wissenschaft e.V. (IANW), Braunschweig, Germany Institut für Geophysik und extraterrestrische Physik (IGeP), Technische Universität Braunschweig, Germany King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia

March 8, 2020

When considering diﬀerent hardware platforms, not just the time-to-solution can beof importance but also the energy necessary to reach it. This is not only the case withbattery powered and mobile devices but also with high-performance parallel clustersystems due to ﬁnancial and practical limits on power consumption and cooling. Recentdevelopments in hard- and software have given programmers the ability to run thesame code on a range of diﬀerent devices giving rise to the concept of heterogeneouscomputing. Many of these devices are optimized for certain types of applications.To showcase the diﬀerences and give a basic outlook on the applicability of diﬀerentarchitectures for speciﬁc problems, the cross platform OpenCL framework was used tocompare both time- and energy-to-solution. A large set of devices ranging from ARMprocessors to server CPUs and consumer and enterprise level GPUs has been used withdiﬀerent benchmarking testcases taken from applied research applications. While theresults show the overall advantages of GPUs in terms of both runtime and energyeﬃciency compared to CPUs, ARM devices show potential for certain applications inmassively parallel systems. This study also highlights how OpenCL enables the use ofthe same codebase on many diﬀerent systems and hardware platforms without speciﬁccode adaptations.

The maximum useful clock frequency of modern processors has become increasingly limited bysignal propagation delays and power dissipation, the so-called power wall. To support the growingneed for computational power, multi-processor architectures have become the new de-facto stan-dard in consumer, industrial and scientiﬁc applications. Together with the mainstream adoptionof multi-core CPUs, initially task-speciﬁc Graphics Processing Units (GPUs) developed into highlyparallel General-Purpose computing on Graphics Processing Units. Simultaneously, ARM-basedRISC processors transformed from application speciﬁc embedded processors to low-cost alterna-tives to the CISC-based x86 processors, driven by the fast increase in computational requirementsfor mobile and embedded devices. In recent years, accelerators using Field Programmable GateArrays (FPGAs) or Application-Speciﬁc Integrated Circuits (ASICs) have also become available,e.g. to solve speciﬁc proof-of-work based system as used in many cryptocurrencies.1 a r X i v : . [ c s . PF ] M a r n the past, most applications were developed for a speciﬁc CPU architecture with little need forportability. While the x86 architecture was the only option for computationally intensive problems,these processors were unsuitable for mobile applications, due to the high power consumption andneed for external support circuitry. Low-power ARM-based processors could easily be integratedinto battery powered mobile devices but lacked the computational resources (i.e. dedicated ﬂoatingpoint units) and support for large amounts of RAM or high-speed data buses needed for complexdata processing or scientiﬁc computations. The use of FPGAs, ASICs and GPUs was limited bythe high cost of soft- and hardware development.This has changed drastically in recent years, with increasingly power eﬃcient x86 hardware,powerful multi-core ARM-based System-on-a-Chip (SoC) designs and the widespread adoption oflow-cost high-powered GPUs both as external hardware and integrated with ARM and x86 CPUs.Even FPGAs became available as accelerator cards or integrated within CPUs (e.g. Intel Xeon Gold6138P or Xilinx Zynq). Combined with the increasing support for portability between diﬀerentarchitectures by software development kits, libraries, and compilers this lead to the concept ofheterogeneous computing. Frameworks like Khronos’ Open Computing Language (OpenCL),Microsoft’s C++ AMP or higher level programming models like Intel’s oneAPI or SYCL as a higher-level programming model for OpenCL provide the necessary software portability to target theseheterogeneous platforms with a single code-base, but still with limitations regarding performanceportability [1, 19]. In many studies, the question of performance portability is primarily discussedin relation to runtime, but the problem also arises in regard to power consumption. Limitingfactors for portability for example with OpenCL are plentiful and can range from workload sharingbetween compute units or cores to the optimization and usage of speciﬁc hardware accelerationfeatures. Especially for embedded or mobile devices, eﬃcient usage of speciﬁc hardware featurescan have a large impact on the required power, while having comparatively little impact on overallruntime.With all these hardware options readily available, the question arises which combination ofdevices is best suited to solve a given set of problems. The answer depends not only on the time-to-solution but also on the energy-to-solution and the price of the hardware. Especially the energynecessary to reach a solution becomes increasingly important as battery powered systems haveto handle computational intensive tasks like image processing for autonomous electric vehiclesor machine learning. Even for large-scale data centers or high performance computing (HPC)facilities the energy cost over the lifetime of the systems can be higher than the initial cost ofacquisition. This has lead to programs like the Green500 list [22] to rank supercomputers basedon energy eﬃciency. In this context the term performance per watt is widely used. This metricis of course linked to accompanying benchmark workloads. To provide some comparability inmany cases the linear algebra LINPACK suite is used [8], which primarily relies on standardmicrobenchmarks. This study aims at using complete self contained real world examples fromcomputational physics and mathematics as well as signal and image processing to show theportability of a given implementation without speciﬁc optimizations and the diﬀerences in time-and energy-to-solution for diﬀerent platforms. Due to its versatility and support by many diﬀerentmanufactures, OpenCL was chosen to execute the set of testcases. Hence, this study can not onlyhelp answer the question of which kind of hardware is best suited for a given problem, but it alsoinvestigates the capabilities of OpenCL on a variety of hardware platforms.As the results will show, diﬀerent devices are optimized for certain workloads or are even missingcertain hardware features like support for double or half precision ﬂoating point arithmetic.Especially with more task speciﬁc devices, calculating the performance per watt based on a speciﬁcbenchmark testcase (like LINPACK) is not representative for a diﬀerent set of tasks and no generalperformance per watt exists. While such a measure can be used for more abstract comparisonsor theoretical benchmarks, it is of less use when designing real world systems. Hence, as part ofthis work several diﬀerent testcases were used and the results will be called energy-to-solution inthe following, to clarify that these values are always relative to a speciﬁc problem set and cannot2ecessarily be generalized.Previous studies have instigated the advantages of diﬀerent architectures related to time-to-solution mostly in the context of high-performance or scientiﬁc computing linked to very speciﬁcapplications like Monte Carlo methods [24], solvers for systems of linear equations [5], aerodynamicNavier–Stokes solvers [19], or mesh interpolation [4]. Research on the trade-oﬀs between time-to-solution and energy-to-solution has mainly focused on the comparison between ARM-basedprocessors and x86 CPUs in the context of computing clusters [3, 9, 18] or was focused speciﬁcally onGPUs [12]. To measure the power requirements, the integrated power proﬁling features availablein modern CPUs and GPUs were used. Knowledge of the realtime power consumption of computedevices is not only important to the user to optimize code eﬃciency but also an integral part ofautomatic power state and dynamic clock frequency management. Hence, manufacturers havestarted to implement both software based heuristics and hardware based current and voltagemonitoring to provide power estimates for diﬀerent components with very little overhead. Thesevendor speciﬁc proﬁling solutions have a relatively high temporal resolutions, typically withmaximum sampling rates in the range of 100 Hz and can be accessed by the user through vendorsspeciﬁc APIs.All test-cases were implemented using OpenCL, as it is not only promising for scientiﬁc HPC [17],but it also allows to target heterogeneous systems even for mobile applications [23]. OpenCL alsohas the major advantage of transparently abstracting the low-level parallelization from the user.Targeting both CPUs and GPUs directly with C/C++ code on diﬀerent operating systems wouldotherwise require diﬀerent implementations, as only OpenCL allows targeting heterogeneoussystems running one of the major operating systems. Additionally this opens up the possibilityto investigate the performance portability of typical algorithms, which becomes an increasinglyimportant question with the adoption of GPUs [12] and ARM-based systems both in scientiﬁc HPCand in consumer applications. Four diﬀerent test cases were chosen and implemented in OpenCL to test both the time-to-solutionand energy-to-solution on diﬀerent platforms. The open source cross platform tool ToolkitICL[11] was used as a framework on the host to execute and proﬁle the diﬀerent kernels. This toolwas speciﬁcally designed to execute generic OpenCL kernels on HPC systems both for productionuse and for proﬁling and benchmarking. The data input and output, kernel source, and OpenCLsettings are completely handled by HDF5 ﬁles. This makes it possible to encapsulate the individ-ual test cases into separate HDF5 ﬁles and use the same host application for all runs to ensurecomparability.ToolkitICL also includes support for power and temperature proﬁling for AMD, Intel, and Nvidiahardware. While the sampling rates (typically around 10 Hz) achieved by these proﬁling solutionsdo not allow for instruction level power proﬁling (see e.g. [16]), they are suﬃcient to estimate thetotal energy necessary to execute a larger set of instructions. To gauge the accuracy of these built-in power proﬁling tools, a true RMS digital multimeter was used to log the current consumptionduring execution on diﬀerent platforms to provide a measurement based reference. The conversioneﬃciency of the power supply (typically between 80% and 90%) was also considered and themeasurements corrected based on the speciﬁc device to represent the actual power consumption.All test runs were done with and without power proﬁling enabled, to gauge the overhead associatedwith software based proﬁling. To determine the energy-to-solution, the time series of powermeasurements (either from proﬁling API or current measurement) was numerically integrated overthe OpenCL kernel execution time. To prevent bias caused for example by storage or networkinghardware, the baseline idle power consumption was determined and subtracted from the powermeasured during execution. Each run was repeated at least three times and only the averagedresults were used for the ﬁnal analysis. 3ue to the large diﬀerences in architecture between hardware platforms, selecting generic testcases without intrinsically favoring speciﬁc setups is a challenge. Especially while testing highlyparallel applications on multi-node cluster setups, like the one used by [9], performance can bedominated by the eﬃciency of interconnect networks and memory architecture. The results aretherefore only meaningful for a speciﬁc system and care must be taken when these results aregeneralized. Hence for this study benchmarking was only performed on single nodes to eliminatethe inﬂuence of network architecture and additional overhead associated with distributing theworkload across nodes. To avoid the caveats of classical microbenchmarking while still being ableto use less powerful ARM systems incapable of running for example large numerical simulations,a compromise between reducing the complexity of the program (i.e. multi-node support or ﬁlein- and output) and reducing the size of the problem itself (i.e. number of timesteps or datapoints) was chosen. OpenCL was selected to realize the test cases not only because it can targetall intended platforms using the same source code but also because it has execution time proﬁlingfeatures already build in. This is especially important when comparing accelerators (i.e. GPUs)with dedicated memory against CPUs, as the additional memory transfer overhead has to be takeninto account. To simulate extensive real world workloads, the individual test cases were executedback-to-back several thousand times using the same input data, to yield an average time- andenergy-to-solution.OpenCL uses vendor speciﬁc platform driver implementations to compile and run the OpenCLcode on the speciﬁc devices. Especially for GPUs, the OpenCL implementation is provideddirectly by the manufacturer as part of the GPU driver. Currently, only Intel supplies OpenCLCPU drivers for their devices. AMD discontinued CPU based drivers and ARM only providesOpenCL GPU drivers. While optimized for their own set of CPUs, the Intel platform drivers canbe used on AMD devices, but the performance is not guaranteed. Another open-source crossplatform OpenCL implementation is the "Portable Computing Language" (POCL) [13]. It can becompiled directly on the target system, to achieve the best possible optimization for the speciﬁchardware. POCL was used in this study for all ARM CPUs and tested on some of the Intel devicesfor reference. For AMD CPUs, POCL, Intel and the legacy AMD drivers were tested.

Diﬀerent computational problems were selected for testing to represent a wide variety of realworld workloads encountered in scientiﬁc and commercial applications. Testcases were chosenspeciﬁcally to represent not just microbenchmarks, but rather a selection of real-world problemscomprised of more abstract benchmark cases and completely self contained ﬁlter or simulationimplementations. The OpenCL benchmarks were tested against similar implementations of thealgorithms in MATLAB, CUDA, C++ AMP, and C++/OpenMP to verify the results and ensurethat comparable or better time-to-solution was achieved by the OpenCL implementation. Thisapproach was also used to ensure that none of the testcases intrinsically favours one architecture.While it has been All necessary testcase ﬁles are open source and available as part of the ToolkitICLtool [11].

The median ﬁlter was selected because it is a common nonlinear edge preserving digital ﬁlteringtechnique typically used to reduce image noise [14]. While it is also used in 1D signal processing,it is most often applied as a stand alone ﬁlter in image editing or as a pre-processing step inimage recognition or classiﬁcation applications. As such it is used in real time image processingsystems, which are often constrained by power limitations. It is also a typical candidate for aproblem theoretically beneﬁting from sharing workloads between a CPU acquiring the input andan accelerator like an embedded GPU to handle the actual processing. The idea behind medianﬁltering is to replace each pixel with the median of a window of n × n neighboring pixels. As4mage processing is mostly done on integer datatypes, this example was implemented solely usinginteger operations and does not use ﬂoating point variables. A 4K (3840 × The dot product is a standard algebraic operation, calculating the sum of the products of thecorresponding entries of two vectors of equal length. As a typical microbenchmark, even thoughit is a more abstract test scenario it combines a large number of multiply–accumulate operations,which can be used as a measure of the overall performance of a computational system. Additionally,the dot product reduces the input data to a single output value, which is inherently diﬃcult toparallelize, as the individual steps are not independent. This kind of reduction requires atomicmemory operation which can incur large performance penalties, especially on GPUs. Hence, thedot product can be calculated more eﬃciently using a single threaded approach (in particular byemploying techniques like loop unrolling), even for relatively large vectors, but as many diﬀerentalgorithms in computational mathematics and data processing require some kind of atomic datareduction, this dot product example can be considered as a generic testcase for these kinds ofreduction-requiring algorithms.

Cross-correlation is a signal processing technique used to measure the similarity between twodatasets depending on the displacement of one of the inputs. It is typically used for pattern recog-nition, machine learning, for example 1D speech recognition or 2D image pattern matching (e.g.[2]) or to ﬁnd statistical links between datasets for example in climate analysis. Cross-correlationalso has applications in large scale scientiﬁc data processing for time of ﬂight or propagation anal-ysis (e.g. [10]). In these scientiﬁc applications, cross-correlation analysis is applied to extremelylarge data sets, requiring signiﬁcant computational resources. Discrete implementations typicallyshift one of the signals relative to the other and calculate the correlation coeﬃcient for each shift.As this workload can be separated into independent steps, cross-correlation can signiﬁcantly ben-eﬁt from parallel execution, especially on real time systems. This example implements a completecorrelation analysis but with reduced dataset length using single precision ﬂoating point variables,which is typical for many real-world applications to conserve memory.

Numerically solving (partial) diﬀerential equations is a problem encountered in many ﬁelds notonly computational mathematics, but also physics, engineering and economics. Runge–Kuttamethods are a set of iterative algorithms, widely considered as one of the de-facto standards tonumerically approximate the solution of these equations. This testcase was not just implementedas a microbenchmark of a speciﬁc solver, but as full 2D simulation running for several thousandtimesteps. It is also well-suited to compare the inﬂuence of the ﬂoating point precision on the time-and energy-to-solution, by executing the same algorithm using both single precision and doubleprecision ﬂoating point variables. An open-source implementation with a corresponding physicalproblem used in current mathematical and physics research was chosen [20, 21], as the intentionof this work is to use benchmark testcases as close to real world applications as possible.5 .2 Hardware

The hardware used for this study was selected to represent diﬀerent architectures and vendorswhile focusing on typical platforms widely available on the market and in use in scientiﬁc HPC.It was not limited to higher-end enterprise level devices, as applications like image processing arerequired to run even on lower-end systems. Linux, Windows, and Mac OS X operating systemswere used to exclude a possible bias due to diﬀerences in the operating systems. In total, 22diﬀerent discrete GPUs, three diﬀerent integrated GPUs, and 14 server and workstation CPUsmanufactured by Intel, AMD and Nvidia were tested. As ARM-based system are becomingincreasingly important, four ARM-based CPUs and two integrated ARM GPUs were tested aswell. A complete list of devices and operating systems is available as supplementary material. Nospeciﬁc performance related changes, like overclocking, were made to the systems.

The computational power of the used devices is vastly diﬀerent. Therefore, the time-to-solution isnot directly comparable. In contrast, the energy required to solve the given problems is not directlyrelated to the computational power of the device but rather to the energy eﬃciency or performanceper watt. This of course assumes that only the power required to run the actual computationaldevice is considered, excluding peripherals, storage systems and other related hardware. Toguarantee the validity of the results, it was ensured that each of the individual runs was within15 % of the average both for time- and energy-to-solution.

Fig. 1 shows the power requirements of an Nvidia C2075 GPU during execution of the singleprecision Runge–Kutta diﬀerential equation test case (see Section 2.1.4) over time. This exampleshows the results of both the internal vendor speciﬁc power proﬁling and external primary currentmeasurement. As explained in Section 2, the externally measured power can vary from theinternal proﬁling measurements. As overhead due to networking, storage, cooling or the hostCPU which is not considered by the internal proﬁling solutions and can not fully be removed bysubtracting the idle current. Furthermore, external measurements also track changes in powerdue to interconnect link state changes and other automatic workload dependent power savingfeatures. The power supply eﬃciency ( ∼

10 15 20 25 30

Time (s) M ea s u r ed P o w e r ( W ) P r o f ili ng P o w e r ( W ) Figure 1: Power requirements of a Tesla C2075 GPU during execution of the single precision Runge–Kuttadiﬀerential equation testcase, determined by external primary current measurement (blue) andinternal vendor speciﬁc power proﬁling (black).

Figures 2–6 illustrate the execution time for the diﬀerent benchmark test-cases for selected devices.Overall both energy- and time-to-solution improve with newer hardware. The latest GPUs per-formed best regarding time to solution, even outperforming the latest series of high-performanceCPUs, especially with complex ﬂoating-point workloads. While older integrated GPUs are compa-rable in runtime to the actual CPU (like the Core i7-6500U), the CPU is noticeably more powerful innewer devices. Especially for the image processing example (see Fig. 2), which is a prime candidatefor shared memory based workload sharing between a CPU for acquiring and later evaluating theimages and an integrated GPU used for image processing and ﬁltering, the CPUs outperform theGPU by a factor of ∼

2. In a real world application, the CPU would have the additional overheadof managing and synchronising the GPU tasks, which makes oﬄoading less attractive and morechallenging for the programmer. Hence, at least for the workloads used as part of this study,simply oﬄoading computationally intensive tasks from the CPU to the integrated GPU does notpositively impact runtime.The advantage in execution time with GPUs is especially pronounced in tasks which rely heavilyon complex single precision ﬂoating point operations, like the Runge–Kutta or cross-correlationexamples. The diﬀerence in execution time is much less pronounced in double precision workloads,which is due to the fact that even most modern GPUs, with the exception of dedicated generalpurpose computing GPUs (like the Tesla V100/P100), are in contrast to CPUs not optimizedfor double precision tasks. In particular computing GPUs optimized for machine learning andinference tasks, like the Tesla T4, lack the hardware acceleration for double precision workloads.Modern desktop CPUs like the Ryzen 7 or the Intel Core i7-8700K (both released in 2017) havealready achieved the computational performance of older server grade Xeon CPUs like the E5-2640 (released 2012) even though they have less cores and lower energy requirements. Whileit was to be expected that the ARM devices are not comparable in runtime to modern CPUs orGPUs, it is noteworthy that the latest ARM GPUs like the Mali G52 (released 2017) approach theperformance of older low-end GPUs and CPUs like the GT 610 (released 2012) or the Intel CeleronN2840 (released 2014). The possible advantages of ARM-based devices become more obviouswhen considering energy-to-solution.To ensure that the vendor speciﬁc power proﬁling implemented in the ToolkitICL framework hasno negative impact on the time-to-solution, baseline runs without power proﬁling were performed.As expected, no signiﬁcant diﬀerence in runtime was found for GPUs and even for CPUs, thedeviations were around 10 %, which is within the statistical margin of error of this study. Slowersystems like older CPUs or the ARM devices have no built-in power proﬁling capabilities, hencethe impact of the proﬁling with computationally less powerful devices could not be investigated,as external current measurement had to be used.7he 2D median image ﬁltering testcase was also used to compare the OpenCL performance withOpenMP for validation (see section 2.1.1). On average, the OpenMP implementation was slowerby a factor of ∼

3, which showcases the eﬃciency of the OpenCL framework in automaticallyoptimizing the code for parallel execution.One notable exception for GPU performance are the AMD devices, which performed muchworse than expected based on the computational power given by the manufacturer. Additionaltesting showed that AMD GPUs seem to require hardware speciﬁc OpenCL code tuning to evenachieve performance roughly comparable to other GPUs or even CPUs. Due to the limited supportfor other software frameworks, it was not possible to check whether these tuning requirements arespeciﬁc to OpenCL.

Similar to the runtime results, GPUs outperformed CPUs in energy-to-solution as well. Theindividual results are shown in Figures 7–11. As expected, CPUs targeted for mobile deviceswere more energy eﬃcient compared to the much faster, but less energy eﬃcient server CPUs.While the ARM devices were expectedly outperformed by traditional CPUs and GPUs in regard toruntime, the possible advantages of ARM-based systems become clear when considering energy-to-solution. In most cases both ARM CPUs and GPUs are comparable in energy eﬃciency tomodern x86 CPUs and integrated GPUs. Especially with computationally simpler workloads likethe dot-product example or signal processing workloads, modern ARM GPUs like the Mali G52are even comparable or better then modern GPUs (see Fig. 8, 9). While they require longer toexecute the tasks, ARM devices still require less energy overall. This exempliﬁes the potential ofARM devices as basis for energy eﬃcient massively parallel compute clusters, for certain types ofapplications.Integrated GPUs show no advantage over the corresponding CPU for integer and double preci-sion ﬂoating point workloads. For single precision tasks, the integrated GPUs are up to a factor oftwo more eﬃcient, but have comparatively minor trade oﬀs in runtime (see Section 3.2). Especiallywith battery powered systems, oﬄoading single precision tasks seems reasonable from an energystandpoint based on these results.The power consumption for most of the devices (AMD, Intel and Nvidia) was measured usingvendor speciﬁc internal power proﬁling APIs. As this approach is for many devices at least partiallybased on software heuristics and not just measurement, the accuracy of this method was validatedfor all APIs using external supply current measurement. For GPUs, the results were within 10 %,while the deviation was below 15 % for CPUs. The diﬀerence is most likely due to the fact, that fortechnical reasons, GPUs have at least some on-board power measurement, while most CPUs havenot and rely completely on heuristics.

As benchmark results are available for devices released between 2010 and 2019 spanning almostan entire decade, the development of the time-to-solution and energy-to-solution for the diﬀerentdevices has been investigated. Compared to CPUs, GPUs were initially intended as task speciﬁchardware accelerators and the development of GPUs into more general purpose computing devicesis a rather recent trend. Hence, Fig. 12 and Fig. 13 show the results separately for CPUs anddiscrete GPUs. Based on the work of [15], the results were plotted semi-logarithmically. Theruntime, especially for CPUs, shows no clear trend over time, which can primarily be attributed tothe fact that an intentionally broad selection of devices with diﬀerent computational capabilitieswas selected. It is unsurprising that older enterprise level server CPUs are still faster than modernconsumer or even embedded CPUs. The selection of GPUs is inherently more homogeneous, asit is primarily comprised of discrete GPUs, which are by design intended to be used for morecomputationally intensive tasks, otherwise integrated graphics solutions are used. Therefore, it8

D Median Filter

GTX 570 (6.22 s) GT 610 (75.45 s)GTX 1050 Ti MQ (3.87 s)GTX 1060 (2.03 s)GTX 1060 MQ (2.31 s)GTX 1070 Ti (1.22 s)GTX 1080 Ti (0.9 s)RTX 2070 (1.01 s)RTX 2080 Ti (0.59 s)GTX Titan (3.41 s)RTX Titan (0.55 s)GRID K520 (7.44 s)Tesla C2075 (8.75 s)Tesla M60 (2.3 s)Tesla K80 (4.29 s)Tesla T4 (1.37 s)Tesla P100 (0.99 s)Tesla V100 (0.65 s)Quadro RTX 4000 (1.12 s)Radeon R9 280X (7.03 s)FirePro S9150 (6.23 s)Radeon VII (3.97 s)Ryzen 5 1600X (3.96 s)Ryzen 7 1700X (3.36 s)Ryzen 5 2600X (4.73 s)Ryzen 5 3600 (3.07 s)Xeon CPU E5-2640 v0 (x2) (9.91 s)Xeon CPU E5-2620 v3 (x2) (5.36 s)Xeon CPU E5-2666 v3 (x2) (5.11 s)Xeon CPU E7-8880 v3 (x2) (3.66 s)Xeon Gold 6230 CPU (x2) (0.8 s) Celeron N2840 (319.5 s)Core i5-2400 (30.42 s)Core i7-6500U (23.53 s)Core i7-8700K (5.71 s)Core i7-8750H (6.78 s)Core i7-6500U HD520 GPU (14.73 s)Core i7-8700K UHD630 GPU (12.93 s)Core i7-8750H UHD630 GPU (18.78 s) ARM Mali T-628 MP6-4C (249.73 s)ARM Mali G52 (76.02 s)Samsung Exynos 5422 (172.8 s)Amlogic S905 (283.16 s)Amlogic S922X (148.7 s)Broadcom BCM2837 (219.97 s)

Runtime (s) - less is faster

Figure 2: Time-to-solution for the integer based 2D median image ﬁlter testcase for CPUs, GPUs and ARMdevices. ot Product GTX 570 (0.525 s) GT 610 (2.747 s)GTX 1050 Ti MQ (0.334 s)GTX 1060 (0.173 s)GTX 1060 MQ (0.251 s)GTX 1070 Ti (0.156 s)GTX 1080 Ti (0.165 s)RTX 2070 (0.161 s)RTX 2080 Ti (0.154 s)GTX Titan (0.32 s)RTX Titan (0.154 s)GRID K520 (0.43 s)Tesla C2075 (0.691 s)Tesla M60 (0.451 s)Tesla K80 (0.455 s)Tesla T4 (0.327 s)Tesla P100 (0.352 s)Tesla V100 (0.324 s)Quadro RTX 4000 (0.167 s)Radeon R9 280X (1.513 s) FirePro S9150 (7.714 s)Radeon VII (1.202 s)Ryzen 5 1600X (0.683 s)Ryzen 7 1700X (0.567 s)Ryzen 5 2600X (0.6 s)Ryzen 5 3600 (0.466 s)Xeon CPU E5-2640 v0 (x2) (0.785 s)Xeon CPU E5-2620 v3 (x2) (0.535 s)Xeon CPU E5-2666 v3 (x2) (0.54 s)Xeon CPU E7-8880 v3 (x2) (0.422 s)Xeon Gold 6230 CPU (x2) (0.148 s) Celeron N2840 (9.289 s)Core i5-2400 (1.81 s)Core i7-6500U (1.805 s)Core i7-8700K (0.505 s)Core i7-8750H (0.622 s)Core i7-6500U HD520 GPU (1.556 s)Core i7-8700K UHD630 GPU (1.789 s)Core i7-8750H UHD630 GPU (1.829 s) ARM Mali T-628 MP6-4C (10.547 s)ARM Mali G52 (5.73 s)Samsung Exynos 5422 (6.713 s)Amlogic S905 (19.52 s)Amlogic S922X (8.527 s)Broadcom BCM2837 (18.614 s)

Runtime (ms) - less is faster

Figure 3: Time-to-solution for the single-precision ﬂoating point dot product testcase for CPUs, GPUs andARM devices. ross-Correlation GTX 570 (6.5 s) GT 610 (78.16 s)GTX 1050 Ti MQ (4.9 s)GTX 1060 (2.57 s)GTX 1060 MQ (3.32 s)GTX 1070 Ti (1.58 s)GTX 1080 Ti (1.58 s)RTX 2070 (1.21 s)RTX 2080 Ti (0.81 s)GTX Titan (3.88 s)RTX Titan (0.77 s)GRID K520 (6.54 s)Tesla C2075 (9.77 s)Tesla M60 (2.88 s)Tesla K80 (4.7 s)Tesla T4 (1.5 s)Tesla P100 (1.36 s)Tesla V100 (0.92 s)Quadro RTX 4000 (1.27 s)Radeon R9 280X (3.41 s)FirePro S9150 (2.85 s)Radeon VII (1.35 s) Ryzen 5 1600X (19.67 s)Ryzen 7 1700X (16.12 s)Ryzen 5 2600X (18.71 s)Ryzen 5 3600 (12.28 s)Xeon CPU E5-2640 v0 (x2) (17.42 s)Xeon CPU E5-2620 v3 (x2) (13.64 s)Xeon CPU E5-2666 v3 (x2) (13.7 s)Xeon CPU E7-8880 v3 (x2) (9.66 s)Xeon Gold 6230 CPU (x2) (2.93 s) Celeron N2840 (331.07 s)Core i5-2400 (43.17 s)Core i7-6500U (53.08 s)Core i7-8700K (14.51 s)Core i7-8750H (18.58 s)Core i7-6500U HD520 GPU (26.11 s)Core i7-8700K UHD630 GPU (21.66 s)Core i7-8750H UHD630 GPU (24.82 s) ARM Mali T-628 MP6-4C (158.27 s)ARM Mali G52 (91.87 s)Samsung Exynos 5422 (230.43 s)Amlogic S905 (172.84 s)Amlogic S922X (187.15 s)Broadcom BCM2837 (261.18 s)

Runtime (ms) - less is faster

Figure 4: Time-to-solution for the single-precision ﬂoating point cross-correlation testcase for CPUs, GPUsand ARM devices. unge-Kutta Differential Equation Solver (single precision) GTX 570 (6.48 s) GT 610 (79.78 s)GTX 1050 Ti MQ (4.18 s)GTX 1060 (2.46 s)GTX 1060 MQ (3.41 s)GTX 1070 Ti (1.71 s)GTX 1080 Ti (1.8 s)RTX 2070 (1.02 s)RTX 2080 Ti (0.9 s)GTX Titan (3.71 s)RTX Titan (0.74 s)GRID K520 (7.57 s)Tesla C2075 (8.84 s)Tesla M60 (4.26 s)Tesla K80 (4.93 s)Tesla T4 (1.15 s)Tesla P100 (1.17 s)Tesla V100 (0.84 s)Quadro RTX 4000 (0.95 s) Radeon R9 280X (64.96 s)FirePro S9150 (20.97 s)Radeon VII (11.85 s) Ryzen 5 1600X (29.59 s)Ryzen 7 1700X (24.01 s)Ryzen 5 2600X (22.12 s)Ryzen 5 3600 (23.71 s)Xeon CPU E5-2640 v0 (x2) (24.22 s)Xeon CPU E5-2620 v3 (x2) (17.21 s)Xeon CPU E5-2666 v3 (x2) (17.25 s)Xeon CPU E7-8880 v3 (x2) (13.64 s)Xeon Gold 6230 CPU (x2) (7.66 s) Celeron N2840 (581.5 s)Core i5-2400 (65.92 s)Core i7-6500U (64.6 s)Core i7-8700K (17.23 s)Core i7-8750H (23.84 s)Core i7-6500U HD520 GPU (26.02 s)Core i7-8700K UHD630 GPU (28.61 s)Core i7-8750H UHD630 GPU (35.41 s)ARM Mali T-628 MP6-4C (398.78 s)ARM Mali G52 (189.88 s)Samsung Exynos 5422 (674.48 s)Amlogic S905 (793.8 s)Amlogic S922X (321.16 s)Broadcom BCM2837 (1246.33 s)

Runtime (s) - less is faster

Figure 5: Time-to-solution for the single-precision Runge-Kutta diﬀerential equation solver testcase forCPUs, GPUs and ARM devices. unge-Kutta Differential Equation Solver (double precision) GTX 570 (26.76 s) GT 610 (401.39 s)GTX 1050 Ti MQ (46.85 s)GTX 1060 (23.28 s)GTX 1060 MQ (29.31 s)GTX 1070 Ti (13.59 s)GTX 1080 Ti (9.71 s)RTX 2070 (15.71 s)RTX 2080 Ti (9.06 s)GTX Titan (18.17 s)RTX Titan (8.12 s)GRID K520 (43.8 s)Tesla C2075 (31.04 s)Tesla M60 (26.28 s)Tesla K80 (15.45 s)Tesla T4 (16.56 s)Tesla P100 (2.06 s)Tesla V100 (1.44 s)Quadro RTX 4000 (15.98 s)Radeon R9 280X (127.33 s)FirePro S9150 (136.8 s)Radeon VII (38.15 s)Ryzen 5 1600X (48.61 s)Ryzen 7 1700X (38.97 s)Ryzen 5 2600X (35.17 s)Ryzen 5 3600 (29.49 s)Xeon CPU E5-2640 v0 (x2) (31.5 s)Xeon CPU E5-2620 v3 (x2) (25.02 s)Xeon CPU E5-2666 v3 (x2) (22.51 s)Xeon CPU E7-8880 v3 (x2) (18.74 s)Xeon Gold 6230 CPU (x2) (9.39 s)Core i5-2400 (81.76 s)Core i7-8700K (21.73 s)Core i7-8750H (26.84 s)Core i7-8700K UHD630 GPU (61.78 s)Core i7-8750H UHD630 GPU (61.69 s) ARM Mali T-628 MP6-4C (1199.75 s)Samsung Exynos 5422 (1311.37 s)Amlogic S905 (1280.97 s)Amlogic S922X (481.6 s)Broadcom BCM2837 (2028.82 s)

Runtime (s) - less is faster

Figure 6: Time-to-solution for the double-precision Runge-Kutta diﬀerential equation solver testcase forCPUs, GPUs and ARM devices. D Median Filter

GTX 570* (751.01 Ws)GT 610* (1795.68 Ws)GTX 1050 Ti MQ* (250.34 Ws)GTX 1060 (170.56 Ws)GTX 1060 MQ (96.02 Ws)GTX 1070 Ti (140.61 Ws)GTX 1080 Ti (127.86 Ws)RTX 2070 (102.02 Ws)RTX 2080 Ti (81.75 Ws) GTX Titan (434.75 Ws)RTX Titan (92.39 Ws)GRID K520 (295.5 Ws)Tesla C2075 (464.26 Ws)Tesla M60 (177.26 Ws)Tesla K80 (252.28 Ws)Tesla T4 (52.68 Ws)Tesla P100 (116.94 Ws)Tesla V100 (93.17 Ws)Quadro RTX 4000 (76.89 Ws) Radeon R9 280X* (660.58 Ws)FirePro S9150 (274.75 Ws)Radeon VII (313.26 Ws)Ryzen 5 1600X (383.1 Ws)Ryzen 7 1700X (272.67 Ws)Ryzen 5 2600X (358.78 Ws)Ryzen 5 3600 (156.38 Ws) Xeon CPU E5-2640 v0 (x2) (1220 Ws)Xeon CPU E5-2620 v3 (x2) (380.03 Ws)Xeon CPU E5-2666 v3 (x2) (550.99 Ws)Xeon CPU E7-8880 v3 (x2) (456.13 Ws)Xeon Gold 6230 CPU (x2) (179.18 Ws) Celeron N2840* (1051.8 Ws)Core i5-2400* (891.35 Ws)Core i7-6500U (283.64 Ws)Core i7-8700K (454.54 Ws)Core i7-8750H (384.62 Ws)Core i7-6500U HD520 GPU (339.94 Ws)Core i7-8700K UHD630 GPU (626.75 Ws)Core i7-8750H UHD630 GPU (407.16 Ws)ARM Mali T-628 MP6-4C* (437.4 Ws)ARM Mali G52* (85.5 Ws) Samsung Exynos 5422* (723.87 Ws)Amlogic S905* (412.2 Ws)Amlogic S922X* (410.4 Ws)Broadcom BCM2837* (470.7 Ws)