[PDF] Run-Time Power Modelling in Embedded GPUs with Dynamic Voltage and Frequency Scaling

Abstract

This paper investigates the application of a robust CPU-based power modelling methodology that performs an automatic search of explanatory events derived from performance counters to embedded GPUs. A 64-bit Tegra TX1 SoC is configured with DVFS enabled and multiple CUDA benchmarks are used to train and test models optimized for each frequency and voltage point. These optimized models are then compared with a simpler unified model that uses a single set of model coefficients for all frequency and voltage points of interest. To obtain this unified model, a number of experiments are conducted to extract information on idle, clock and static power to derive power usage from a single reference equation. The results show that the unified model offers competitive accuracy with an average 5\% error with four explanatory variables on the test data set and it is capable to correctly predict the impact of voltage, frequency and temperature on power consumption. This model could be used to replace direct power measurements when these are not available due to hardware limitations or worst-case analysis in emulation platforms.

Full PDF

RRun-Time Power Modelling in Embedded GPUs with DynamicVoltage and Frequency Scaling

Jose Nunez-Yanez, Kris Nikov, Kerstin Eder, Mohammad Hosseinabady {j.l.nunez-yanez,kris.nikov,kerstin.eder}@bristol.ac.uk,[email protected] of Electrical and Electronic Engineering,Bristol

ABSTRACT

This paper investigates the application of a robust CPU-based powermodelling methodology that performs an automatic search of ex-planatory events derived from performance counters to embeddedGPUs. A 64-bit Tegra TX1 SoC is configured with DVFS enabledand multiple CUDA benchmarks are used to train and test modelsoptimized for each frequency and voltage point. These optimizedmodels are then compared with a simpler unified model that uses asingle set of model coefficients for all frequency and voltage pointsof interest. To obtain this unified model, a number of experimentsare conducted to extract information on idle, clock and static powerto derive power usage from a single reference equation. The re-sults show that the unified model offers competitive accuracy withan average 5% error with four explanatory variables on the testdata set and it is capable to correctly predict the impact of voltage,frequency and temperature on power consumption. This modelcould be used to replace direct power measurements when theseare not available due to hardware limitations or worst-case analysisin emulation platforms.

CCS CONCEPTS • Hardware → Power estimation and optimization . KEYWORDS keywords: Heterogeneous architecture, GPU power modelling,DVFS, multiple linear regression, Embedded GPU

ACM Reference Format:

Jose Nunez-Yanez, Kris Nikov, Kerstin Eder, Mohammad Hosseinabady. 2020.Run-Time Power Modelling in Embedded GPUs with Dynamic Voltage andFrequency Scaling. In

ACM, New York, NY,USA, 6 pages. https://doi.org/10.1145/3381427.3381429

Embedded GPUs (Graphical Processing Units) which are physicallypresent in the same chip as the central processing unit (CPU) are

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

PARMA-DITAM’20, January 21, 2020, Bologna, Italy © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-7545-0/20/01...$15.00https://doi.org/10.1145/3381427.3381429 popular as general-purpose accelerators in power constrained ap-plications such as unmanned aerial vehicles (UAV), self-driving carsor robotics.The power profiles of these GPUs are in order of Watts comparedwith the hundreds of Watts needed in their desktop counterpartsthat invariably use the PCIe bus to communicate with the host CPUand memory. Power optimization in these embedded devices is ofcritical importance since in these applications, the power sourcestend to be batteries and the systems must operate untethered for aslong as possible. In this paper we investigate the application of thepower modelling framework created in [6] for heterogeneous em-bedded CPUs to embedded GPUs capable of general-purpose com-puting thanks to their support for languages such as as CUDA andOpenCL. Our framework, called ROSE (RObust Statistical search ofexplanatory activity Events), can be used to automatically collectactivity and power data and then perform a complete search foroptimal events across a large range of frequency and voltage pairsas defined in the device DVFS (Dynamic Voltage and FrequencyScaling) tables. ROSE multiple linear regression optimization usesOLS (Ordinary Least Squares) which is well understood for the caseof both desktop and integrated CPUs but it has been less studiedin GPUs that are characterized by a proprietary black box archi-tecture with restricted access to internal microarchitecture details.Taking these points into account, the novelty of this work can besummarized as follows:1 We perform power modelling on an embedded GPU devicewith integrated power measurement and voltage/frequencyscaling compared with previous work largely focused ondesktop GPUs.2 We limit the number of explanatory variables used in themodel to enable the run-time collection of this informationusing a limited number of hardware registers.3 We propose a novel unified model that includes temperature,frequency and voltage as global states and compare it againstmodels with coefficients optimized for single temperaturesand frequency/voltage pairs.This paper is organized as follows. Section 2 introduces relatedwork in the area of power modelling with GPUs. Section 3 presentsthe methodology based on our previous work in this area, theset of CUDA benchmarks for model training/verification and thetechniques used to obtain run-time measures of power and eventinformation. Section 4 develops models based on this methodologywith coefficients optimized for individual voltage and frequencypairs. Section 5 proposes an unified model so that the power of asingle per-frequency model can be scaled to an extended range ofvoltage and frequency points. Section 6 investigates temperature a r X i v : . [ c s . OH ] J un ARMA-DITAM’20, January 21, 2020, Bologna, Italy Jose Nunez-Yanez, Kris Nikov, Kerstin Eder, Mohammad Hosseinabady effects on power consumption and model accuracy. Finally, section7 concludes the paper.

As previously indicated, there is a significant amount of work onpower modelling of CPU cores and CPU-based systems and theinterested reader is referred to [7] for a review of results and tech-niques. In the field of general-purpose GPU-based computing, theamount of power modelling research is more limited. The authorsof [4] investigate how performance counters can be used to modelpower on a desktop NVIDIA GPUs connected to a host computervia a PCIe interface. The PCIe interface is instrumented with cur-rent clamp sensors and the host computer samples these sensorswhile collecting performance counter information. The authorsidentify a total of 13 CUDA performance counters but since onlyfour hardware registers are available, multiple runs are needed toaccess all of them. The authors also identify that certain kernels thatperform texture reads, such as Rodinia Leukocyte, show significantpower error up to 50% due to lack of relevant counter information.The impact of DVFS on power modelling is not considered. Alsotargeting desktop GPUs, the authors of [8] introduce a support vec-tor regression model instead of the least squares linear regressionmore commonly used. A total of five variables are used, such as vertex shader busy and texture busy to build the model. Instead ofpredicting the power of full kernels as done in [4], they predictthe power of the different execution phases as similarly done inour work. The authors show a slight advantage of SVR in accuracyalthough some phases of execution of the GPU power cannot bemodelled correctly. The performance and power characterizationdone in [1] considers different desktop NVIDIA GPU families (i.e.Tesla, Fermi and Kepler). The external power measurements applyto the entire system which includes the GPU and CPU and not theindividual components. The proposed power model uses perfor-mance counters and linear regression and introduces a frequencyscaling parameter in the power equation to account for the differentperformance levels possible in the GPU. It does not consider theoperating voltage and, with multiple voltage levels possible for asingle frequency, this could explain the errors in the predictionaccuracy which are measured at around 20 to 30%. The work of [3]also focuses on desktop GPUs with a review that shows that thenumber of explanatory variables used varies between 8 and 23. Itconsiders the use of neural networks to perform the prediction,indicating how neural networks can address the nonlinear depen-dencies of the input variables at the expense of significantly highercomplexity. However, this could make the models harder to deployas part of an energy-aware operating system. In this paper, wefocus on using a low number of explanatory variables to make themodels easy to deploy at run-time and investigate the accuracy ofmultiple linear regression for power modelling considering voltage,frequency and temperature in embedded GPUs.

The methodology is based on our previous work targeting ARMbig.LITTLE SoCs and introduced in [6]. The CUDA benchmarksused for the model creation and validation are shown in Table 1. Training and testing benchmarks are independent and have beenobtained from the Rodinia and CUDA SDK benchmark sets.

Table 1: Deployed benchmarks

CUDA Rodinia Train Setstream_cluster srad_v1 srad_v1particle_filter srad_v2 srad_v2mmumergpu pathfinder pathfinderleukocyte myocite myocitelavaMD kmeans kmeansbackprop bfs bfsb+tree cfd cfdheartwall hotspot3d hotspot3dhotspot hybridsort hybridsortCUDA SDK Test SetbinomialOptions Montecarloblackscholes particlesSobolQRNG RadixsortTranspose FDTD3dTexture3D nbodyWe have modified the collection and processing stages to ac-count for the differences in counter availability, power and currentsensors and DVFS implementations. In the CPU-based power modeldone in [6] the DVFS table contains fixed pairs of voltage and fre-quency and the preferred way to build the power model is to use aper-frequency model in which a distinct set of coefficient values arecalculated for each pair. To use this approach directly on the TX1is problematic due to temperature dependencies and the practicaldifficulties of adjusting the temperature of the device to each possi-ble value during a data collection run that typically executes overseveral days. To manage this complexity in this work we distinguishbetween local events such as the number of instructions executedor the number of memory accesses that will affect power in cer-tain regions of the device, and global states such as the operatingfrequency, voltage and temperature that will affect power globally.This approach enables us to propose an unified model with a singleset of coefficients that could be used for multiple combinationsof voltage, frequency and temperature. The development of thisunified model and its comparison with the per-frequency modelsis conducted in section 5 and Section 7. The performance coun-ters considered in this work are shown in Figure 1. The numberof physical registers available in the GPU device to collect activityinformation in parallel is limited and for this reason limiting thenumber of model counters is preferred. The methodology presentedin [6] implements different types of automatic searches and analysisof the effects of different counters on the power model accuracy.In the methodology flow, the octave_makemodel script receiveswith -r a measurement.txt text file containing the power and activitycounter samples with around 12,000 samples in our case. Then with-b a benchmark.txt file that identifies which benchmarks should beused for training and which for testing. Then with -f all the fre-quency values that are going to be considered (each frequency valuealso corresponds to a different voltage as determined by the DVFStable), -p identifies the column number in measurement.txt thatcontains power information, -m set to 1 is the search mode heuristic un-Time Power Modelling in Embedded GPUs with Dynamic Voltage and Frequency Scaling PARMA-DITAM’20, January 21, 2020, Bologna, Italy defined as bottom-up and -l lists the performance counters selectedfor analysis as columns in measurement.txt, -n set to 4 instructsthe framework to search for the best possible four performancecounters that result in a more accurate model as indicated with-c 1. This means that the script will search up to a maximum of 4performance counters in the list provided, across all the frequenciesand voltages automatically. The result is a set of coefficients foreach frequency/voltage pair. To minimize temperature interference,the experiments are conducted setting the available TX1 fan tomaximum speed initially. If, for example, the user is interested inobtaining models across all possible frequencies for a particular setof events that have been pre-selected, the user can use the switch -eto specify the four columns in power_measurement.txt with eventsthat need to be analyzed. inst_executed_cs

Instructions executed by compute shaders (CS), not including replays sm_inst_executed_texture

Texture instructions executed sm_executed_ipc sm_issued_ipc The average instructions issued per active cycle per SM. threads_launched Total threads launched.

Increments by 1 per thread launched. sm_active_cycles

Sum of cycles that SM was active.

Increments by 0-NumSMs per cycle. sm_active_warps Sum of warps that SM was active. Increments by 0-64 per cycle per

SM. sm_warps_launched

Warps launched. Increments by 1 per warp launchedgpu_busy

Cycles the graphics engine or the compute engine is busy.l2_write_bytes l2_read_bytes

Number of bytes written to L2 cacheNumber of bytes read from L2 cachesm_inst_executed _global_loadssm_inst_executed _global_stores The number of executed global loads

The number of executed global stores

The average instructions executed per active cycle per SM.

Figure 1: Analyzed GPU performance counters

Equation 1 shows the general form of the power model proposed inthis work. Comparing this equation with the previous work donein [6] we normalize the total event count with the total numberof cycles available in the time slot to obtain an activity densitymeasurement that should remain constant as frequency changes.For example, if the frequency doubles then the number of events(e.g. instructions executed) in the same time period should alsodouble, but since the number of clock cycles also doubles the ratioshould remain constant.We limit all experiments to a maximum set of four counters toaccount for the limited number of registers available in commercialGPUs. Figure 2 shows the four examples of performance countersthat the methodology ends up selecting as the more accurate identi-fied as model A, B, C and D. The coefficients shown are for a single example frequency of 76MHz with a corresponding voltage of 0.82vand a similar set of coefficients exists for the other 12 possiblefrequency and voltage pairs at a constant temperature. P GPU f req = α + α × events / cycles + . . . + α n × events n / cycles (1) Model AModel BModel CModel D

Counter 1 /Value @ 76 MHz Counter 2 /Value @ 76 MHz Counter 3 /Value @ 76 MHz Counter 4 /Value @ 76 MHz Constant @ 76 MHz inst_exec uted_cs / 0.0005

Inst_executed_global_stores / 0.0029 gpu_busy / 6.45E-05 sm_active_cycles / 0.0003 inst_exec uted_cs / 0.0005 sm_inst_executed_texture/ 0.0019 sm_active_warps / 2.0038E-06 sm_inst_executed_lobal _loads / -0.00020 uted_cs / 0.0009 sm_inst_executed_texture/ 0.0047 sm_active_cyc les / 0.00066 gpu_busy / -0.00021 inst_exec uted_cs / 0.0011 sm_active_warps / 4.6137E-06 0.4324 sm_inst_glo bal_stores/ 0.030 gpu_busy / -3.3929E-05

Figure 2: Model parameters

Figure 3 shows the comparison of the accuracy of these fourmodels across all the frequency and voltage pairs. The performanceof models A, C and D is similar with an overall error below 5%.Model D offers a slightly better overall accuracy, as shown in theoverall value and will be taken forward to derive a unified powermodel in the next section. We can also appreciate that at differentfrequencies, the accuracy varies and this is largely defined by themodel parameters. E rr o r % Frequency (MHz)/Voltage (V) points model a model b model c model d

Figure 3: Model comparison

The previous per-frequency models contain a total of 13 times 5parameters with four event coefficients and a constant parameterfor each of the 13 voltage/frequency points. They are obtained at

ARMA-DITAM’20, January 21, 2020, Bologna, Italy Jose Nunez-Yanez, Kris Nikov, Kerstin Eder, Mohammad Hosseinabady fixed voltage and frequency pairs and do not take into accountthe multiple voltage levels available for each frequency in the TX1device part of the DVFS table. In this section, we propose a newtype of model that unifies the previous models with a single setof coefficients and includes independent variables for frequencyand voltage. Equation 2 shows the general form of this unifiedpower model with two terms being added corresponding to dy-namic and static power. The approach consists of scaling the powerpredicted by a power model at a single frequency to fit the rest ofthe frequencies and voltages. (2) P GPU f req x = ( P GPU f req − P GPU sta x ) × f req f req x × ( volt volt x ) + P GPU sta x × ( volt volt x ) Scaling is possible because the model uses normalized activityrates that should remain constant at different frequencies sinceboth events and cycles should reduce proportionally. The scaling isdone based on how voltage and frequency affect the dynamic andstatic power of a chip. Dynamic power is proportional to the voltagesquare and frequency. In our experiments, we observe that staticpower accuracy improves using also voltage square scaling. Staticpower or leakage is the power of the device when the frequency iszero, so the frequency term should not be used to scale it. To isolatethe static power in the second term of the equation to be able toscale it correctly, we need to measure it first. It is important to notethat the per-frequency model contains a constant component thatrepresents the device power with no activity and this power can bedefined as idle power as shown in Equation 4. This idle power isformed mainly by the static power and the clock power since theclocks remain active when there is no active load. A direct way tomeasure static power will be to clock gate the GPU device, however,the Linux for Tegra L4T JetPack 4.2.1 for the TX1 SoC used in thiswork does not implement this feature and only allows frequencyconfigurations part of the DVFS table. To be able to extract thestatic power, we use an indirect method as follows. We sweep allthe points available in the DVFS table with no benchmarks runningto obtain the idle power. The first few frequency points in the DVFStable do not affect the supply voltage of 0.82V and this results ina linear relation of power and frequency as shown in Figure 4 forthese points at a reference temperature of 23C at full fan speed.We use the point at which this line intersects the Y axis as thefrequency of zero and the corresponding value rounded to 0.21Was the static power present in the device at that voltage level andtemperature. With this information, we can create a unified modelbased on Equation 2 using as reference point any frequency thathas the common voltage point of 0.82V and should have constantstatic power.We consider two possible reference points at the minimum fre-quency of 76 MHz and the middle frequency of 380 MHz that bothshare the voltage of 0.82V. We call these models UAL and UAMfor "unified anchor low" and "unified anchor middle", respectively.Another alternative is to use a reference point at a high frequencyif we can estimate the static power at that level. Since we know thatthe dynamic power follows equation 3, we can solve equation 4 andwith the available values for P idle , P static , V and f we can obtain theα× C that we treat as a constant. Our hypothesis is that the activity y = 0.1112x + 0.20790 P o w e r ( W a tt s ) Frequency (MHz)

Figure 4: Idle power estimation rate α× C should remain constant within a small sample intervalacross different frequencies because we are measuring events di-vided by cycles in the sample interval. We can now extract thestatic power for the high frequency point of 998MHz and 1.07V byobtaining P dynamic_clock and subtracting it from P idle . We call thismodel UAH for "unified anchor high". E rr o r % Frequency (MHz)/ V (Volts) points model d pf model d uam model d ual model d uah

Figure 5: Unified power model comparison

Figure 5 shows the accuracy of the UAL, UAM and UAH unifiedmodels. This figure shows how the accuracy compares betweenthese unified models derived from the PF (per-frequency) modelD. The anchor low and anchor high models obtain the best accu-racy at their respective reference points but suffer a significantdegradation as the frequency/voltage moves further away fromthe reference point. On the other hand, the anchor middle offersa largely identical accuracy to model D with an overall percentileerror rate of around 5%. This result shows that the unified modelcan be competitive in terms of accuracy compared with the per-frequency models developed in the previous section. This unifiedpower model with 380 MHz and 0.82V as the reference frequencyand voltage is shown in Equation 5 where P GPUfreq_ref is obtained un-Time Power Modelling in Embedded GPUs with Dynamic Voltage and Frequency Scaling PARMA-DITAM’20, January 21, 2020, Bologna, Italy P o w e r ( W a tt s ) Time step sample measured unified model per-frequency model

FREQ = 998, VOLT = 1.07 FREQ = 76, VOLT = 0.82 FREQ = 768,

VOLT = 0.94

FREQ = 384 VOLT = 0.82

Figure 6: Run-time prediction by Equation 6. Equation 6 contains a negative coefficient which,in principle, is not an intuitive result but it can be explained bythe fact that the different explanatory variables have correlationsamong them (i.e. GPU busy increases as the number of instructionsexecuted increases) and the multiple linear regression process findsthis negative value as a value that improves the model fit to thetraining data.Finally, Figure 6 compares the power predictions performed bythe per-frequency model D and the unified derived model with themeasured values at run-time for a full sweep of the test benchmarksat different voltages and frequencies. We observe that the powerconsumption ranges from below 1 Watt to over 13 Watts depend-ing on the operation point and benchmark. The power predictionsfollow the different execution phases, although it is at the highestpoints of power consumption that the errors are more noticeable.Also the measured power tends to show low spikes between bench-marks that the model does not predict. This effect could be dueto our sampling frequency that is limited to one sample every 0.5seconds. Further research is needed to increase this sample ratetaking into account that, since the thread that samples the powersensors is also executed by the CPU cores higher sampling ratescould mean that the processors are not available to launch theCUDA benchmarks which could introduce artifacts. (3) P dynamic _ clock = α × C × V × f (4) P idle = P dynamic _ clock + P static (5) P GPU f req _ x = ( P GPU f req _ ref − . w ) × f req _ x MHz × ( volt _ x . V ) + 0 . W × ( volt _ x . V ) P GPU f req _ ref = 0 . W +0 . W × inst _ executed _ cscycles +0 . w × executed _ дlobal _ storescycles − . W × дpu _ busycycles + 0 . W × active _ warpscycles (6) The power equation presented in Equation 5 will consider possiblechanges in voltage and frequency due to temperature changes asdefined in the DVFS table but it does not consider the changes inpower due to temperature itself. Temperature has a direct effect onthe static power consumption of the device as shown in [2]. Thestatic power depends on leakage current and supply voltage linearlywhile the leakage current itself depends on the supply voltageand temperature exponentially. In this analysis we approximatethese exponential relations linearly for the range in which devicetemperatures occur [2]. To understand the dependency of staticpower and temperature we run a number of experiments with noload on the GPU varying the fan rate and frequency at a constantvoltage of 0.82V to generate different temperature profiles. For eachrun we obtain a linear relation between frequency and power thatenables as to estimate the static power by setting frequency atzero. We use the same approach to obtain a linear relation betweenpower and temperature that allows to estimate the temperature atfrequency zero for each of the runs. y = 0.0051x + 0.084900.050.10.150.20.250.30.350.40.450.5 20 22 24 26 28 30 32 34 P o w e r ( W a tt s ) Temperature (Celsius)

Figure 7: Temperature and static power

We can now plot the points of temperature and power as shownin Figure 7 and obtain a linear relation between temperature andstatic power at 0.82V. Using this information we can now replacethe

Pstatic in Equation 5 to obtain Equation 7Two temperature and power profiles generated by varying thefan activity are used to test the temperature-aware model. Over-all the results show that under the same workload, voltage andfrequency, temperature results in a power variation higher than20%. This accuracy result justifies the importance of capturing tem-perature in a power model as done in this work. We evaluate the

ARMA-DITAM’20, January 21, 2020, Bologna, Italy Jose Nunez-Yanez, Kris Nikov, Kerstin Eder, Mohammad Hosseinabady temperature-aware power equation in Figures 8 and 9 against theother models. Figure 8 shows that under the same temperature con-ditions considered in the previous section, the model operates witha similar value of accuracy. Figure 9 shows that when the deviceheats up the accuracy of the original models degrades significantlywhile the temperature-aware model largely maintains the samelevel of accuracy. P GPU f req _ x = ( P GPU f req _ ref − ( Tre f × . . W ) × f req _ x MHz × ( volt _ x . V ) + ( T × . . W × ( volt _ x . V ) (7) E rr o r % Frequency (MHz)/ V (Volts) pointsmodel d pf model d uam model d ual model d uah model d uamt

Figure 8: Model comparison at low temperatures E rr o r % Frequency (MHz)/ V (Volts) pointsmodel d pf model d uam model d uamt

Figure 9: Model comparison at high temperatures

The proposed per-frequency and unified power models are keptsimple by using only four performance counters collected in par-allel. We also extend the unified model with temperature-awarecapabilities to improve accuracy when the device can work in mul-tiple temperature regimes such as in fanless configuration. Weobserve that the temperature impact on static power can increasepower by over 20% which is the rationale to make it a part of theproposed model. Overall, the research shows that the CPU powermethodology can be applied successfully to GPU devices despitethat the performance counters are very different in nature and thatthe prediction error can be maintained at around 5% using a combi-nation of local events represented by the performance counters andglobal states represented by voltage, frequency and temperaturevariables. The simplicity of these models means that they could bedeployed as part of an energy-aware operating system and sched-uling framework. The unified model could be particularly usefulsince it can capture multiple voltage levels for one frequency levelwith a single set of coefficients. Our future work involves furthervalidation of the methodology and its application with additionalbenchmarks, improving the data collection approach to increasethe granularity of the samples to better capture the different phasesof benchmark execution and experimenting with inter-predictionstrategies across different GPU devices and technologies. The powermodelling methodology used in this paper is available open-sourceat the following github repository [5].

ACKNOWLEDGMENTS

This work was partially supported by the EPSRC ENEAC grant num-ber EP/N002539/1, the H2020 TeamPlay project Grant agreementNo.: 779882 and the Royal Society industrial fellowship MINET(Award: INF\R2\192044).

REFERENCES [1] Abe, Y., Sasaki, H., Kato, S., Inoue, K., Edahiro, M., Peres, M.: Power and perfor-mance characterization and modeling of gpu-accelerated systems. In: 2014 IEEE28th International Parallel and Distributed Processing Symposium, pp. 113–122(2014). DOI 10.1109/IPDPS.2014.23[2] Goel, B., McKee, S.A., SjÃďlander, M.: Chapter two - techniques to measure, model,and manage power. pp. 7 – 54. Elsevier (2012). DOI https://doi.org/10.1016/B978-0-12-396528-8.00002-X[3] Mei, X., Wang, Q., Chu, X.: A survey and measurement study of gpu dvfs on energyconservation. ArXiv abs/1610.01784 (2016)[4] Nagasaka, H., Maruyama, N., Nukada, A., Endo, T., Matsuoka, S.: Statistical powermodeling of gpu kernels using performance counters. pp. 115–122 (2010). DOI10.1109/GREENCOMP.2010.5598315[5] Nikov, K.: Buildmodel. https://github.com/kranik/BUILDMODEL/tree/master/ARMPM_buildmodel (2017). [Online; accessed 01-Aug-2017][6] Nikov, K., Nunez-Yanez, J.L.: Inter and intra-core power modelling for single-isaheterogeneous processors (2020). DOI DOI:10.1504/IJES.2020.10021023[7] Nunez-Yanez, J., Lore, G.: Enabling accurate modeling of power and energy con-sumption in an arm-based system-on-chip. Microprocessors and Microsystems37