[PDF] Adaptive Performance Optimization under Power Constraint in Multi-thread Applications with Diverse Scalability

Abstract

In modern data centers, energy usage represents one of the major factors affecting operational costs. Power capping is a technique that limits the power consumption of individual systems, which allows reducing the overall power demand at both cluster and data center levels. However, literature power capping approaches do not fit well the nature of important applications based on first-class multi-thread technology. For these applications performance may not grow linearly as a function of the thread-level parallelism because of the need for thread synchronization while accessing shared resources, such as shared data. In this paper we consider the problem of maximizing the application performance under a power cap by dynamically tuning the thread-level parallelism and the power state of the CPU-cores. Based on experimental observations, we design an adaptive technique that selects in linear time the optimal combination of thread-level parallelism and CPU-core power state for the specific workload profile of the multi-threaded application. We evaluate our proposal by relying on different benchmarks, configured to use different thread synchronization methods, and compare its effectiveness to different state-of-the-art techniques.

Full PDF

AAdaptive Performance Optimization under Power Constraint inMulti-thread Applications with Diverse Scalability

Stefano Conoci, Pierangelo Di Sanzo, Bruno Ciciani

DIAG - Sapienza University of RomeEmail: { [email protected],[email protected], [email protected] } Francesco Quaglia

DICII - University of Rome Tor VergataEmail: [email protected]

Abstract —In modern data centers, energy usage representsone of the major factors affecting operational costs. Powercapping is a technique that limits the power consumption ofindividual systems, which allows reducing the overall powerdemand at both cluster and data center levels. However,literature power capping approaches do not ﬁt well the natureof important applications based on ﬁrst-class multi-threadtechnology. For these applications performance may not growlinearly as a function of the thread-level parallelism becauseof the need for thread synchronization while accessing sharedresources—such as shared data. In this paper we consider theproblem of maximizing the application performance under apower cap by dynamically tuning the thread-level parallelismand the power state of the CPU-cores. Based on experimentalobservations, we design an adaptive technique that selects inlinear time the optimal combination of thread-level parallelismand CPU-core power state for the speciﬁc workload proﬁle ofthe multi-threaded application. We evaluate our proposal byrelying on different benchmarks, conﬁgured to use differentthread synchronization methods, and compare its effectivenessto different state-of-the-art techniques.

I. I

NTRODUCTION

Multi-core architectures are nowadays dominating themarket. Also, thanks to the support they offer for sharingmemory among CPU-cores, they have become the main-stream reference hardware for applications based on the ﬁrst-class multi-thread technology. On the downside, poweringmany-core machines implies high energy delivery to eachsingle multi-core server. Therefore, over the last years,energy and power consumption raised up as a core concernto cope with, especially in (large) data centers.Such concern led manufacturers to introduce hardwaremechanisms oriented to improve energy efﬁciency in op-erational contexts. These include Dynamic Voltage and Fre-quency Scaling (DVFS), which allows lowering the voltageand the frequency (hence the power consumption) of aprocessor/core in a controlled manner, and Clock Gating,which disables some processor/core circuitry during idle pe-riods. Contextually, today’s Operating Systems offer powermanagement tools—like Linux CPUFreq Governor [1]—which expose to the user code interfaces to dynamicallychange the power state of cores via DVFS, thus allowingto tune the performance of cores and their power demandaccording to the need of speciﬁc applications/workload.In this context, one interesting challenge is the one ofcontrolling the power demand of an application in order to keep it below a given threshold, also known as the power cap . However, an even more interesting challengeis the one of ensuring that an application runs at maximumperformance under a given power cap. Such an achievement,in addition to performance beneﬁts, would also improve theapplication efﬁciency in terms of energy per task.Various power capping techniques for multi-core servershave been proposed. As for the speciﬁc case of multi-thread applications, the problem of regulating the numberof threads and the core frequency to control the balancebetween performance and power consumption has been orig-inally considered in [2], and subsequently in [3]. The maindrawback of these approaches is that the tuning strategiesthey rely on do not account for complex and dynamic effectson performance that may be caused by thread contentionon hardware resources and/or shared data. In more details,when multiple threads concurrently run on different CPU-cores, the presence of shared hardware resources—such asmemory interconnections and cache levels—leads them tocontend for their utilization. This impacts both performanceand the power consumption proﬁle of the application. Also,in common multi-thread applications that are not disjoint-access parallel, threads share data whose accesses mayrequire synchronization. This still affects performance andthe power consumption proﬁle, also depending on the spe-ciﬁc synchronization mechanisms (either speculative or not)that are employed by the application code. An additionalfactor of complexity in the presence of synchronizationis that the speed-up achieved by running the applicationwith different number of threads may change depending onthe workload proﬁle, which in turn can be dynamic by itsown. Also, the speed-up can be non-linear as a function ofthe number of threads, depending on the workload proﬁle,as well as the underlying hardware settings. Speciﬁcally,performance can even decrease when increasing the levelof parallelism. This indicates that synchronization costs,including the energy spent while performing synchronizationoperations, can show complex proﬁles to deal with.Overall, to select the right combination of thread paral-lelism and core power state, which ensures the best perfor-mance under a power cap, it looks mandatory to take intoaccount the (possible) limited scalability of a multi-threadapplication, just like it manifests at run-time due to actualsynchronization dynamics. Further, it is mandatory to react a r X i v : . [ c s . PF ] S e p o variations of the workload proﬁle.To cope with this problem, we present an adaptive tech-nique that uses a novel on-line exploration-based tuningstrategy. We devised our technique exploiting empiricalobservations of the effects on both performance and powerconsumption associated with the combined variation ofthread-level parallelism and core power state. Speciﬁcally,by the results of experiments we conducted with differentmulti-thread benchmarks characterized by non-negligibleincidence of synchronization —e.g. because of thread con-tention while accessing share data—we highlight that theirscalability is not affected by the variation of the power stateof the CPU-cores. Based on this, we deﬁned an optimizedtuning strategy where the exploration moves along speciﬁcdirections that depend on the power cap value and on theintrinsic scalability of the application. Remarkably, we provethat the proposed technique ﬁnds in linear the optimal con-ﬁguration of concurrent threads and CPU frequency/voltage,i.e. the conﬁguration that provides the highest performanceamong the conﬁgurations with power consumption lowerthan the power cap. Also, we present a reﬁnement ofour technique that exploits continues ﬂuctuations betweenconﬁgurations—in terms of thread-level parallelism and corepower state—to further improve the application performanceand reduce the possibility/incidence of power cap violations.We demonstrate the advantages of our proposal via anexperimental study based on various application contexts,including various benchmarks that use different thread syn-chronization methods. This allows us to robustly assess ourtechnique via disparate test cases where contention amongthreads affects the application scalability in signiﬁcantlydifferent ways.The remainder of this article is structured as follows. InSection II we discuss related works. Section III deﬁnes ourtarget problem and presents the results of the preliminaryanalysis. Section IV illustrates the proposed optimizationtechnique, proves that the selected conﬁguration is optimaland analyzes the time complexity of the exploration proce-dure. Section V describes the most relevant implementationdetails and presents the experimental results.II. R ELATED W ORK

A work speciﬁcally focused on optimizing the energydemand at application level is presented in [2]. The proposedtechnique, called Pack and Cap, aims at selecting the bestconﬁguration, in terms of number of cores to be assignedto an application and the related core frequency, whichensures a given power cap for multi-thread workloads.Based on experimental measurements of the performanceand power consumption obtained running benchmarks fromthe Parsec suite, the authors conclude that the conﬁgurationthat provides the highest performance at a given level ofpower consumption always assigns to the application thehighest possible number of cores. However, as extensivelyshown in the following of this article, this selection strategyis not optimal for general multi-threaded applications withless than linear scalability. The work in [3] considers the problem of maximizing performance under a power capwhile also taking into account the effects of contention.The solution deﬁnes an ordered set of power knobs that areprogressively tuned by performing a binary search on therespective domain, selecting the setting that provides thehighest performance for the considered power knob whileoperating within the power cap. In particular, the solutionﬁrst selects the optimal number of cores that should beassigned to an application while running at the slowest avail-able frequency/voltage, then selects the optimal CPU

P-State setting for the previously selected number of assigned cores.Therefore, by tuning the power knobs independently, it doesnot consider the changing energy/performance trade-offs atdifferent levels of parallelism for the speciﬁc workload. Asan example, if an application shows a limited speed-up whenincreasing the number of cores, the solution would still pickthe highest value that provides a power consumption withinthe cap, even if the same power budget could provide higherperformance if spent to further increase the frequency of alower number of cores.Other works in literature investigate the problem ofimproving application performance under power constraintconsidering different power management variables. FastCap[4] deﬁnes an approach for optimizing performance under asystem-wide power cap considering both CPU and memoryDVFS. It deﬁnes a non-linear optimization problem solvedthrough a queuing model that considers the interactionbetween CPU-cores and memory banks communicating overa shared bus. Unfortunately, memory DVFS has only beenproposed recently [5] [6] and is not yet available in com-mercial systems. Kanduri et al. propose approximation asanother knob that can be used in power capping, combinedwith DVFS and Clock Gating, to deﬁne a trade-off betweenperformance and accuracy of the results [7]. However, inorder to dynamically switch between different levels ofaccuracy, it requires multiple implementations of the sameapplication. PPEP [8] is an online prediction framework that,based on hardware performance events and on-chip tem-perature measurements, estimates the performance, powerconsumption and energy efﬁciency for each different CPU

P-state . Therefore, it allows the deﬁnition of a power cappingtechnique that can meet power targets in a single stepwithout requiring any exploration. However, it does notconsider the possibility of altering the number of coresassigned to an application, thus it would provide sub-optimalperformance for multi-thread applications showing less thanlinear scalability.III. P

ROBLEM S TATEMENT AND P RELIMINARY A NALYSIS

As discussed, in our study we consider the problem ofadaptively tuning the system conﬁguration to ensure thehighest application performance under a power cap. Weconsider two tuning parameters, the number of concurrentthreads and the cores power state. We focus on the generalscenario of multi-thread applications executed on a working-thread pool (e.g. multithreaded web/application servers)hose size can be tuned at run-time. However, we shouldnote that the proposed technique is orthogonal with respectto the chosen thread regulation mechanism. Also, we assumethat the power state of cores can be changed, affectingboth power consumption and performance. In practice, thisis what happens when changing the so-called

P-state inmodern multi-core processors, which determines a variationof the core voltage and frequency, thus modifying both thepower consumption and the instruction processing speed. Weadhere to the notation of ACPI standard, which establishesthat P0 denotes the core state with maximum power andperformance, and P1 , P2 , ... progressively identify stateswith less power and performance. Also, we consider thecore idle state ( C-state ), where C0 denotes the full operatingcore state, and C1 , C2 , ..., progressively identify lowerpower states where the core is idle, i.e. it does not executeinstructions. A core can transit from C0 to a deeper C-state when it has no instruction to execute. Hence, whenthe number of running threads goes below the number ofavailable cores, unused cores can transit to low power states,thus reducing the total power consumption.To provide the reader with real data demonstrating theeffects on power consumption associated with the variationof

P-state and the number of concurrent threads, we showin Figure 1 the results of an experiment where we run themulti-thread Intruder benchmark from the STAMP suite [9]for Transactional Memory systems [10]. Intruder emulates asignature-based network intrusion detection system wherenetwork packets are processed in parallel by concurrentthreads. We executed different runs while changing

P-state and the number of concurrent threads on top of a machinewith two Intel Xeon E5, 20 physical cores total, 256 ECCDDR4 memory, with core clock frequency ranging from1.2 GHz (whose

P-state is denoted as P-11) to 2.2 GHz(denoted as P-1), and TurboBoost from 2.2 GHz to 3.1 GHz(denoted as P-0). Since we focus on the effects of the jointvariation of core power state and thread parallelism, weconsider power consumption data related to the CPU andmemory subsystems, which we collected via Intel RAPLinterface [11]. The plot shows the power consumption asa function of the couple ( p, t ) , where p is P-state and t isthe number of concurrent threads. The results clearly outlinethat the power consumption grows while incrementing eitherthe ﬁrst or the second variable. Given a power cap value,if { ( p, t ) } is the set of all possible conﬁgurations, wedenote as { ( p, t ) } ac ⊆ { ( p, t ) } the subset of all acceptableconﬁgurations, that is the conﬁgurations for which the powercap is not violated. Formally, it is the subset such that pwr ( p, t ) ≤ C , where pwr ( p, t ) is the power consumptionwith conﬁguration ( p, t ) and C is the power cap value. Sincethe function pwr ( p, t ) monotonically increases with respectto both p and t , the subsets of acceptable and unacceptableconﬁgurations are separated by a frontier, as shown in ﬁgure3. Our goal is to ﬁnd the conﬁguration ( p, t ) ∗ ∈ { ( p, t ) } ac for which the performance of the application is maximized.

0 2 4 6 8 10 2 4 6 8 10 12 14 16 18 20 30 40 50 60 70 80 90 100 110 P o w e r ( W a tt s ) Power ConsumptionP-State Threads P o w e r ( W a tt s )

30 40 50 60 70 80 90 100 110

Figure 1. Throughput vs. Number of Concurrent Threads and

P-state

Without loss of generality, we consider the applicationthroughput as a performance metric. In any case, withour approach also other metrics, depending on the speciﬁcapplication, could be used, such as the application runtimeor the operation response time. We denote as thr ( p, t ) theapplication throughput for conﬁguration ( p, t ) .In multi-thread applications, the variation of the appli-cation throughput as a function on t plays a key rolewhen ﬁnding the best conﬁguration. Due to hardware anddata contention phenomena, the proﬁle of the applicationthroughput curve is generally characterized by two parts,i.e. an initial ascending part, where the throughput increaseswhile increasing t , followed by a descending part, where thethroughput decreases while increasing t . However, we notethat in the case of high contention the initial ascending partmay not exist (i.e. the throughput always decreases whenincreasing t ). Conversely, in the case of low contention thethroughput may never decrease.In Figure 2, we report the results of an experimental studywe conducted with three different multi-thread applicationsstill taken from STAMP, namely Intruder, Genome, Vacationand Ssca2. We selected these applications since their scal-ability trends are very different. Also, in our experimentswe considered two different implementations of the threadsynchronization logic: a) a coarse-grained lock-based ap-proach, where critical sections are synchronized by a singleglobal lock, and b) a ﬁne-grained approach based on soft-ware transactional memory, where shared data accesses aresynchronized by transactions. We purposely used a coarse-grained locking scheme to evaluate our approach in variousand antithetical scenarios, spanning from applications withvery limited to very high scalability.By the plots in Figure 2, the proﬁle of the throughputcurves conﬁrm that there is an ascending part followed byan descending part. In some cases the ascending or thedescending part may not exist. Also, the plots show that,when changing the application and/or the synchronizationapproach, the shapes of the throughput curves change.Particularly, the number of threads that provides the bestthroughput is generally different. Its range varies from 1(in the case of workloads with very limited scalability, suchas for Intruder Lock-based, Vacation Lock-based and Ssca2Lock-based), up to 20 (in the case fully scalable work-loads as Genome Transaction-based or Vacation Transaction- T h r oughpu t Concurrent threadsGenome Lock-based 500000 1x10

0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsIntruder Lock-based 100000 120000 140000 160000 180000 200000 220000 240000 260000 280000 300000 0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsVacation Lock-based 500000 1x10

0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsSsca2 Lock-based 0 500000 1x10

0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsGenome Transaction-based 500000 1x10

0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsIntruder Transaction-based 0 200000 400000 600000 800000 1x10

0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsVacation Transaction-based 600000 800000 1x10

0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsSsca2 Transaction-based

Figure 2. Throughput vs. Number of Concurrent Threads based). Notably, in some cases it is in the middle (as forIntruder Transaction-based, Genome Lock-based or Ssca2Transaction-based). On the other hand, ﬁxed the applicationand the synchronization approach, the throughput curvespreserve the shape when varying

P-state . Curves appearproportionally translated, but the number of threads thatprovides the best throughput does not change, unless forsmall and unpredictable variations due to the measurementnoise. Finally, the plots shows that, keeping ﬁxed the numberof threads, the throughput increases when decreasing

P-state . We exploit these experimental ﬁndings to deﬁne theexploration-based technique presented in the next section.IV. T HE A DAPTIVE P OWER C APPING T ECHNIQUE

The adaptive power capping technique we propose aimsat ﬁnding the optimal conﬁguration ( p, t ) ∗ , i.e. the con-ﬁguration that provides the highest performance amongthe conﬁgurations in the set { ( p, t ) } ac , assuming that itmay change due to variations of the workload proﬁle. Thetechnique is based on an on-line tuning strategy that peri-odically performs an exploration procedure. The latter aimsto identify the optimal conﬁguration ( p, t ) ∗ for the currentworkload proﬁle, which is actuated until the exploration pro-cedure restarts after a given period. During the explorationprocedure, the power consumption and the throughput of the application are measured while moving along conﬁgurationswithin a given path, discarding the explored conﬁgurationsthat are not in the set { ( p, t ) } ac . Then, the one with thehighest throughput is selected. We should note that thenumber of threads t ∗ of the optimal conﬁguration may bedifferent than the number of thread that provides the highestthroughput for the speciﬁc application, since decreasing theCPU P-state might provide an higher performance increasethan increasing the number of threads. The procedure isable to identify the optimal conﬁguration by exploring onlya subset of conﬁgurations. In effect, we note that the fullset of conﬁgurations may be very large, particularly whena large number of cores are available. Thus, reducing theexploration space is fundamental to implement an on-lineexploration-based strategy.

A. The Exploration Procedure

The exploration procedure takes as input a starting conﬁg-uration ( p s , t s ) and a power cap value C and returns ( p, t ) ∗ .For the ﬁrst execution of the procedure, the starting conﬁg-uration is established by the user, while in next executionsit corresponds to the output conﬁguration of the previousone. We note that, based on the shapes of the throughputcurves and the observations that we made in our preliminaryanalysis, a set of conﬁgurations can be excluded from P - S t a t e Threads FrontierExploration phases (ps,ts) (ps,t1)(p2,t2) (p3,t3)

Phase 1Phase 2Phase 3

Figure 3. Example of exploration phases performed by the basic strategy the exploration, thus reducing the conﬁguration explorationspace. Speciﬁcally, if during the exploration:1) a conﬁguration ( p j , t k ) such that thr ( p j , t k ) ≤ thr ( p j , t k − is found then all conﬁgurations ( p, t ) where t ≥ t k , for whichever p , can be excluded (sincewe are in the descending part of the throughput curveand since the throughput curves preserve the shapewhile varying P-state ).2) a conﬁguration ( p j , t k ) such that pwr ( p j , t k ) ≤ C is found then all conﬁgurations ( p, t k ) with p > p k can be excluded (since increasing P-state reduces theapplication throughput).3) a conﬁguration ( p j , t k ) such that pwr ( p j , t k ) > C is found then all conﬁgurations ( p, t ) where t ≥ t k and p ≤ p k can be excluded (since decreasing P-state or increasing the number of concurrent threadsincrements the power consumption).Based on the above observations, we built an explorationprocedure divided in 3 phases, plus a ﬁnal selection phase.The phases are described below. A graphical example isshown in Figure 3, which refers to an execution wherethe number of concurrent threads providing the highestthroughput is equal to 15 and C = 50 watts.The phases are the following ones:

Phase 1: this phase starts from the initial conﬁguration ( p s , t s ) and, keeping the P-state ﬁxed, aims at ﬁnding thenumber of threads providing the highest throughput withoutviolating the power cap. We denote as ( p s , t ) the conﬁgu-ration returned by this phase. It performs a search inspiredby the hill-climbing technique. Speciﬁcally, it increments byone the number of threads while the throughput increasesand the power cap is not violated (since it is moving alongthe ascending part of the throughput curve), then it returnsthe conﬁguration with the highest throughput within thepower cap. If the throughput does not grow after the ﬁrst in-crement or the power cap is violated, it starts decreasing thenumber of threads (since it is moving along the descendingpart of the throughput curve or the power consumption has to be reduced) until the throughput starts decreasing. Then,it returns the conﬁguration with the highest throughput if itdoes not violate the power cap, otherwise if all the exploredconﬁgurations violate the power cap or if the explorationreaches a number of threads equal to 1 it returns ( p s , .In the example in Figure 3, the exploration performed inPhase 1 is represented by the green line. It starts with ( p s , t s ) = (6 , , then increases the number of threads andterminates when it explores conﬁguration (6 , since itviolates the power cap. It returns ( p s , t ) = (6 , , whichis within the power cap. Phase 2:

This phase starts from the conﬁguration returnedby phase 1 ( p s , t ) and is executed only if this conﬁgurationdoes not violate the power cap (otherwise it jumps to the nextphase). The goal of phase 2 is to continue the exploration forlower values of P-state (we remark that lower values of

P-state lead to both higher core performance and higher powerconsumption). Speciﬁcally, it explores by moving from thecurrent conﬁguration ( p, t ) to conﬁguration ( p − , t ) . Ifthe latter conﬁguration does not violate the power cap,it continues to reduce the value of P-state . If the explo-ration reaches a conﬁgurations such that pwr ( p, t ) > C ,it starts reducing the number of threads, thus moving toconﬁguration ( p, t − , then ( p, t − and so on (sincedecreasing the number of concurrent threads reduces thepower consumption) until the power cap is not violated.After, it restarts the exploration by decreasing the value of P-state . The exploration terminates when p reaches and thecurrent conﬁguration does not violate the power cap, whenit reaches conﬁguration (0 , , or when a conﬁguration with t = 1 violates the power cap. Then, the phase returns theexplored conﬁguration with the highest throughput withinthe power cap, that we denote as ( p , t ) , or conﬁguration (0 , . In Figure 3, the exploration of Phase 2 is shown bythe blue line. It starts from ( p s , t ) = (6 , , then exploresup to conﬁguration (0 , . It returns ( p , t ) = (3 , . Phase 3:

This phase starts again from the conﬁgurationreturned by Phase 1, i.e. ( p s , t ) , and aims at continuing theexploration for higher values of P-state . If the conﬁgurationreturned by Phase 1 is such that t is the number of threadsproviding the highest throughput and is within the powercap, Phase 3 is not executed (since decrementing the value of P-state leads to lower throughput). Otherwise, it incrementsby one the value of

P-state and starts increasing the numberof concurrent threads until the power cap is violated or thethroughput decreases. In the former case, if the maximumvalue of

P-state has not been reached, it increments by onethe value of

P-state and starts again incrementing the numberof threads. In all the other cases the exploration terminates.Then, the phase returns the explored conﬁguration, that wedenote as ( p , t ) , with the highest throughput within thepower cap, or conﬁguration ( p max , t ) , where p max is themaximum value of P-state . In Figure 3, the exploration ofPhase 2 is represented by the yellow line. It starts from ( p s , t ) = (6 , , then explores up to conﬁguration (8 , ,where it stops since the throughput decreases (in the examplehe number of concurrent threads providing the highestthroughput is equal to 15). It returns ( p , t ) = (8 , . Final phase: this phase selects the conﬁguration withthe highest throughput between the conﬁgurations ( p s , t ) , ( p , t ) and ( p , t ) , which does not violate the power cap,or returns null if none of them is within the power cap. B. Proof of Optimality

In this subsection we prove that the proposed explorationprocedure returns the optimal conﬁguration, i.e. the conﬁgu-ration ( p, t ) ∗ that provides the highest level of performancewith a power consumption lower than the power cap. Theproof assumes that the observations discussed in Section IIIalways hold true. Speciﬁcally, we take as hypotheses that:1) the shape of the throughput curve for each ﬁxed P-state and varying number of active threads is char-acterized by an initial ascending part, followed bya descending part. Also, one of these parts may bemissing;2) if thr ( p j , t k ) > thr ( p j , t k + 1) then thr ( p, t k ) >thr ( p, t k + 1) for each p (the shape of the throughputcurves preserve the shape while varying the P-state );3) if p j < p k then thr ( p j , t ) > thr ( p k , t ) for each ﬁxed t (decreasing the P-state with a ﬁxed number of threadsalways increases the throughput);4) pwr ( p, t ) ≥ pwr ( p j , t k ) for each p < = p j and t > = t k (decreasing P-state or increasing the numberof threads increases the power consumption);5) the workload is static during the exploration proce-dure;6) the samples of throughput and power consumption ob-tained for each explored conﬁguration are equivalentto their real values;We should note that hypothesis 5 and 6 are necessary forany exploration-based solution that relies on data gatheredat run-time. In particular, they guarantee that if the optimalconﬁguration is explored it will also be selected by thealgorithm as the best conﬁguration. Hypotheses 1, 2, 3 and4, as shown in Figure 2, reﬂect properties that appear to bevalid for all the considered workloads.

Proof:

We can partition the search space deﬁned by theconﬁgurations ( p, t ) in three distinct sub-spaces, delimitedby the starting conﬁguration ( p, t ) s , such that: • p = p s ; • p < p s ; • p > p s ;We denote as the optimal conﬁguration for the sub-spaceof conﬁgurations S , the conﬁguration ( p, t ) q ∈ S thatprovides the highest performance while operating within thepower cap compared to all the conﬁgurations ( p, t ) ∈ S .Considering that the sum of these three sub-spaces coversthe complete space of conﬁgurations, the conﬁguration thatprovides the highest performance between the optimal con-ﬁguration of all the sub-spaces will be the optimal conﬁgu-ration ( p, t ) ∗ . Thus, proving that the exploration procedureﬁnds the optimal conﬁguration for each of these sub-space is equivalent to prove that it ﬁnds the optimal conﬁgurationfor the whole space of conﬁgurations. p = p s : Phase 1 of the exploration procedure starts from ( p, t ) s and searches for the number of active threads at P-state p s that maximizes performance with power con-sumption within the power cap. Therefore, it exploresthe considered sub-space. Phase 1 is based on an hill-climbing optimization algorithm which generally ﬁnds thelocal optima, which might not be the best possible solu-tion. However, for hypothesis 1, the local optima is alsothe global optima as it is not possible for a non-globaloptima to exist for a function with a single ascending partfollowed by a single descending part or, in case one ofthose is missing, for a monotonic function. We should notethat, unlike traditional hill-climbing algorithms, the selectedconﬁguration might not be the global optima as it mightrequire a power consumption higher than the power cap. Inthis case, exploiting the shape of the throughput function,the conﬁguration with the highest number of active threadwith a power consumption lower than the power cap isselected which is clearly the optimal conﬁguration for thesub-space as the optimum is always located either at theend of the ascending part or at 1 thread if the ascendingpart does not exist. Else, the global optima is selected. Thus,phase 1 selects the optimal conﬁguration for the sub-spaceof conﬁgurations with p = p s . p < p s : Assume that the optimal conﬁguration ( p, t ) k +1 forthe sub-space of conﬁgurations with p = k + 1 is known.The optimal conﬁgurations for the sub-space with p = k must have t k < = t k +1 since if t k +1 is the optimal numberof threads for p = k + 1 it must be that either: • throughput with t = t k +1 + 1 is lower than with t = t k +1 with p = k + 1 . Thus, for hypothesis 2,all conﬁgurations ( p, t ) where t > t k +1 , for whichever p , can not be optimal; • pwr ( p k +1 , t k +1 +1) > C which implies, for hypothesis4, that for p = k < k + 1 all conﬁgurations with t > = t k +1 would have a power consumption higher than thepower cap and thus can not be optimal.If t k +1 = 1 we can already conclude that the optimum forthe sub-space of conﬁgurations with p = p k is ( p = k, t =1) . Differently, for t > we can state that the throughputat P-state k monotonically increases in the range from 1 to t = t k +1 . If t k +1 > it must be that thr ( p k +1 , t k +1 ) > = thr ( p k +1 , t k +1 − which implies for hypothesis 2 that thr ( p k , t k +1 ) > = thr ( p k , t k +1 − and consequently thatthe throughput curve with p = k for t < t k +1 is inthe ascending part. Therefore, as performed by Phase 2,starting the exploration of the sub-space of conﬁgurationswith p = k from the conﬁguration p k , t k +1 and, if necessary,decreasing the number of threads until the power cap isreached assures that the the optimal conﬁguration for thesub-space is explored. The conﬁguration returned by phase1—which is the optimal conﬁguration for the sub-space with p = p s —is used as base case. The sum of each sub-spaceof conﬁgurations with { p = j | j ∈ [0 , p s − is equalo the sub-space of conﬁgurations with p < p s . Therefore,the conﬁguration with the highest performance between theoptimal conﬁgurations for each of this sub-spaces will be theoptimal conﬁguration for the entire sub-space with p < p s . p > p s : Assume that the optimal conﬁguration ( p, t ) k − for the sub-space of conﬁgurations with p = p k − = p k − is known. We can state that thr ( p k , t ) < = thr ( p, t ) k − foreach t < = t k − since: • if t k − is the optimal value of t for p = p k − , it mustbe true that thr ( p k , t k − ) > = thr ( p k , t ) for each t < = t k − (hypotheses 2); • thr ( p k , t k − ) < ( p k − , t k − ) (hypothesis 3).In addition, we can state that if thr ( p k , t k − ) >thr ( p k , t k − + 1) then for each conﬁguration ( p, t ) with p > p k it must be true that thr ( p, t ) < thr ( p k , t k − ) since: • for hypothesis 2, increasing the number of threads over t k − does not improve the throughput for any P-state ; • for hypothesis 3, increasing the P-state reduces thethroughput.Therefore, considering that pwr ( p k , t k − ) < C (hypothesis4) and that ( p k , t k − ) is included in the sub-space of conﬁg-urations with p < p s , if thr ( p k , t k − ) > thr ( p k , t k − + 1) then all conﬁgurations in the sub-space of conﬁgurationswith p < p k cannot be optimal conﬁgurations of the sub-space with p < p s . Starting from the conﬁguration returnedby phase 1—which is the optimal conﬁguration for p = p s —phase 3 decrements the P-state and increments the numberof threads until the power cap is violated or the throughputdecreases, which assures that the optimal conﬁguration forthe sub-space is explored. If the throughput decreases whenincreasing the number of threads or when the maximum

P-state is reaches, phase 3 is completed. By induction, itexplores the optimal conﬁguration of each sub-space ofconﬁgurations with { p = j | j ∈ [ p s + 1 , p max ] , excludingthe sub-spaces that we proved cannot contain the optimum.Therefore, the conﬁguration with the highest performancebetween the optimal conﬁgurations of the considered sub-spaces will be the optimal conﬁguration for the entire sub-space with p > p s . C. Time Complexity Analysis

The time complexity of the exploration procedure isexpressed as the number of exploration steps required bythe procedure to return the optimal conﬁguration ( p, t ) ∗ . Let p tot be the total number of P-states supported by the systemand t tot the maximum number of concurrent threads for thespeciﬁc application, which, in HPC applications, is usuallyset equal to the number of physical/virtual cores available inthe system. Considering that the exploration procedure doesnot explore the descending part of the throughput curve, wecould also denote t tot as the maximum number of concurrentthreads that provide, for at least a portion of the executiontime, the highest performance for the speciﬁc application runon the speciﬁc hardware. We analyze the time complexityof each exploration phase separately: • phase 1 : each conﬁguration with a different numberof concurrent threads and p = p s is explored at mostonce, thus the time complexity is O ( t tot ) ; • phase 2 : starting from a conﬁguration ( p, t ) , phase 2either reduces the value of p or reduces t . Starting fromthe conﬁguration returned by phase 1, it can reduce p atmost p tot times, and reduce t at most t tot times. Thus,the time complexity of phase 2 is O ( p tot + t tot ) ; • phase 3 : starting from a conﬁguration ( p, t ) , phase3 either increments the value of p or increments t .Thus, for the same reasoning used in phase 2, the timecomplexity of phase 3 is O ( p tot + t tot ) ;Therefore, the overall time complexity of the explorationprocedure is O ( p tot + t tot ) . D. The Enhanced Tuning Strategy

In this section we present an enhancement of the tuningstrategy that allows to further improve performance andreduce the power cap violation probability. It proﬁts by thepossible gap between the power cap value and the powerconsumption of conﬁguration ( p, t ) ∗ which is due to the dis-crete domain of power consumption values of the differentconﬁgurations. Speciﬁcally, it is unlikely that pwr ( p, t ) ∗ isexactly equal to C . Rather we can have C − pwr ( p, t ) ∗ > .Statistically, the greater the difference of power consumptionbetween adjacent conﬁgurations, the larger C − pwr ( p, t ) ∗ .To reduce the performance penalization due to this gap,the enhanced tuning strategy relies on continue ﬂuctuationsbetween two conﬁgurations (rather that remaining alwaysin ( p, t ) ∗ ) along the time interval between the end of theexploration procedure and the start of the next one. Allthe phases are equal to the previous tuning strategy, exceptthat an additional conﬁguration ( p, t ) H is selected. ( p, t ) H is the conﬁguration with higher throughput than ( p, t ) ∗ (ifany) such that the ratio between throughput and powerconsumption is the largest one among the explored ones.Thus, it is the conﬁguration with the highest efﬁciency interms of throughput over power consumption. We note that,since ( p, t ) ∗ is the conﬁguration within the power cap withthe highest throughput, then ( p, t ) H is a conﬁguration thatviolates the power cap.At the end of the exploration procedure, the enhancedstrategy continuously ﬂuctuates between ( p, t ) ∗ and ( p, t ) H in order to take advantage of the higher throughput ofconﬁguration ( p, t ) H , but avoids that the average powerconsumption, over a given time window w , overcomes C .To this aim, if the average power consumption overcomes C , then conﬁguration ( p, t ) ∗ is set. Conversely, when theaverage power consumption falls below C , conﬁguration ( p, t ) H is set, and so on. To limit the ﬂuctuation frequency,an upper and a lower tolerance threshold, C + l and C − l isused. In real scenarios, the length w can be set equal to theactual time window used to calculate the power consumptionof the machine.Another factor that may impact the effectiveness of ourtechnique is the variation over time of the power consump-tion of the selected conﬁgurations. For example, pwr ( p, t ) ∗ ay change due to variations of the workload proﬁle, thusleading to power cap violations. If this happens, with ourtuning strategy it may not be detected until the next explo-ration procedure starts. To limit the effect of this delay onthe power cap violation, the enhanced tuning strategy selectsa third conﬁguration, that we denote as ( p, t ) L . It is theconﬁguration with lower power consumption than ( p, t ) ∗ (ifany) with the highest efﬁciency in terms of throughput overpower consumption. Thus, if pwr ( p, t ) ∗ overcomes C , thenthe strategy ﬂuctuates between pwr ( p, t ) ∗ and pwr ( p, t ) L rather than between ( p, t ) ∗ and ( p, t ) H . This allows to reducethe probability of power cap violation until the workloadproﬁle variation is such that pwr ( p, t ) L < C . Similarly, forthe same goal of promptly adapting to workload variations,if pwr ( p, t ) L > C ( pwr ( p, t ) H < C ), the P-state of allconﬁgurations is shifted up (down) by one.V. E

XPERIMENTAL R ESULTS

In this section, we presents the results of an experimentalstudy we conducted to asses the proposed power cappingtechnique. As in previous studies on power capping (e.g. [2],[12]), we consider two evaluation metrics, the applicationperformance and the average power cap error. The latter isthe average difference between the power consumption andpower cap value along time intervals where the power capis violated. We run experiments for all application scenariosthat we considered in our preliminary study (see SectionIII). Thus we use Intruder, Genome,Vacation and Ssca2 asbenchmark applications from STAMP, with both locks andtransactions as the synchronization method. These applica-tions were speciﬁcally selected to cover a wide range ofdifferent scalability scenarios. We compared our techniquewith:1) a reference power capping technique, referred to asbaseline, that selects the conﬁguration with the low-est

P-state from the set of conﬁgurations with thehighest number of threads among the conﬁgurationswith power consumption lower than the power cap. Itimplements the selection strategy proposed in [2];2) a technique, referred to as dual-phase, that initiallytunes the number of threads starting from the lowest

P-state , and subsequently tunes the CPU

P-state keepingthe number of threads ﬁxed. The initial phase isequivalent to phase 1 of the proposed explorationprocedure. The selection strategy of this technique issimilar to the one presented in [3].The comparison with the ﬁrst technique allows to quantifythe performance beneﬁts achievable by properly allocatingthe power budget taking into consideration the scalabilityof the speciﬁc multi-threaded application. Additionally, weconsidered the dual-phase technique in the evaluation toquantify the possible performance beneﬁts achievable byexploring the whole bi-dimensional space of conﬁgurationsover two distinct mono-dimensional explorations, whichmight not ﬁnd the optimal conﬁguration. We should notethat, despite exploring a larger set of conﬁgurations, the proposed technique has the same time complexity of thedual-phase technique.

A. Implementation details

We developed a controller module that implements ourtechnique and the baseline technique. All software of ourexperimental study, including benchmark applications, isdeveloped in C language for Linux. The controller mod-ule alters the number of concurrent threads exploiting the pause() system call and thread-speciﬁc signal for reactiva-tion. The CPU

P-state is regulated through the cpufreq linuxsub-system, while energy readings are obtained from the powercap sub-system. Both these sub-systems are includedby default in recent versions of the linux kernel and exposetheir respective interface through the sys virtual ﬁle system.The exploration procedure relies on statistical results ofthe previous step, such as average power consumption andthroughput, to deﬁne the following conﬁguration to explore.Each step of statistics collection is determined by a ﬁxedamount of units of work processed. We cannot rely onapplication independent metrics, such as the number of CPUretired operation, since it would also consider instructionsrelated to spin-locking or aborted transactions that do notprovide execution progress. For applications based on lockswe deﬁned the unit of work as the execution of one criticalsection guarded by a global lock. Differently, for transactionswe deﬁne the unit of work as one commit. The statistics arecollected in a round-robin fashion by all the active threads toreduce execution overhead and provide NUMA-aware resultsin modern multi-package systems.For the executions presented in the experimental results,we set the units of work per step to 5000, resulting in tensof milliseconds per step for all the considered applicationsand synchronization method. In addition, we set to 150 thenumber of steps required to restart the exploration procedureafter the conclusion of the previous.

B. Experimental results

We consider both the tuning strategies of our techniquereferred as basic strategy and enhanced strategy. We analyzethe performance results of our strategies in terms of speed-upwith respect to the throughput of the baseline technique. Asanticipated, we also compare the average power cap error.For each test case, we present the results with three differentpower cap values, i.e. 50, 60 and 70 watts.Results for the case of lock-based synchronization arereported in Figure 4. Overall, the results show an evident per-formance improvement with both strategies of our techniquewith respect to the baseline technique. Only for the case ofGenome the performance is comparable. In the best cases,i.e. with Intruder, the performance improvement reaches 2.2x(2.32x) and 2.15x (2.19x) for the basic (enhanced) strategywhen the power cap is equal to 50 and 60 watts respectively,and it is close to 1.9x for both the proposed strategies withpower cap set to 70 watts. The enhanced strategy further See github.com/StefanoConoci/STMEnergyOptimization S peed - up Speed-up with Locks - Power Cap: 50 watts 0 0.5 1 1.5 2 intruder vacation genome ssca2 S peed - up Speed-up with Locks - Power Cap: 60 watts 0 0.5 1 1.5 2 intruder vacation genome ssca2 S peed - up Speed-up with Locks - Power Cap: 70 watts0.01.02.03.04.05.0 intruder vacation genome ssca2 E rr o r ( % ) Power Cap Error with Locks - Power Cap: 50 watts 0.01.02.03.04.05.0 intruder vacation genome ssca2 E rr o r ( % ) Power Cap Error with Locks - Power Cap: 60 watts 0.01.02.03.04.05.0 intruder vacation genome ssca2 E rr o r ( % ) Power Cap Error with Locks - Power Cap: 70 watts

Figure 4. Throughput Speed-up and Power Cap Error with Locks improves performance compared to the baseline techniqueby up to 12.5% in Intruder at 50 watts, and by 5.3% onaverage. For lock-based synchronization, the results of thedual-phase technique are similar to those achieved by thebasic strategy.As for the power cap error, with both the strategies of ourtechnique and the dual-phase technique, it is clearly reducedcompared to the baseline. Also, the results show that withthe enhanced strategy in many cases there is a reduction ofthe power cap error compared to the basic strategy. Indeed,except for the case of Vacation with power cap equals to 60watts, where it is increased by less than 0.1%, the error withthe enhanced strategy is lower. In the best case it is about0.1%, while it is about 2% and 4.8% with the basic strategyand the baseline technique, respectively.Results for the case of transaction-based synchronizationare reported in Figure 5. Overall, the performance resultsconﬁrm the advantage of our technique compared to thebaseline technique. However, with transactions the speed-up is generally slightly lower than with locks. In the bestcases, it reaches about 1.9x. Also, there is one case (withGenome and power cap = 50 watts) where it is slightly lessthat 1 with both the strategies. As for the power cap error,it increases with the basic strategy compared to the casewith locks, overcoming the error of the baseline technique inmost of the cases. However, it does not overcome 2% in allcases. The error is considerably reduced with the enhancedstrategy. Particularly, it is clearly lower than the baselinetechnique with all applications when the power cap is equalto 50 watts and with Intruder when the power cap is equalsto 60 watts, while the results are similar for the other powercap values. In addition, the enhanced strategy can furtherincrease performance by up to to 8% (Vacation with power cap set to 50 watts) and by 3.5% on average. Differentlyfrom the lock-based case, both strategies of the proposedtechnique show an higher speed-up compared to the dual-phase technique by up to 21% (ssca2 with power cap setto 50), and by 7.7% and 10.7% on average for the basicstrategy and the enhanced strategy respectively.

C. Analysis of the Results

As a ﬁrst observation, results show that in various caseswith locks, the error of our technique and of the dual-phasetechnique is very close to zero. This is due to the fact that, inour study, the scalability is limited for all applications whenusing locks. In these scenarios, the number of concurrentthreads providing the higher throughput (that is selected byour technique and by the dual-phase technique) is low, thusthe value of

P-state can be changed up to 0 while the powercap frontier is still far. This keeps the error very close to 0since it is unlikely that the power cap is violated during theexploration procedure or due to workload variations.The error is generally reduced with the enhanced strategycompared to the basic strategy, while also improving perfor-mance. This arises since the former is able to react along thetime between two consecutive exploration procedures to thepossible variations of the power consumption of the selectedconﬁgurations, as discussed at the end of Section IV-D.The speed-up with our technique is less than 1 only inone case, i.e. for Genome with transactions when the powercap value is equal to 50 watts. We note that Genome withtransactions is highly scalable (see Figure 2). This leadsboth the baseline technique and our technique to select 20as number of concurrent threads. As shown by the plot inFigure 2, the throughput of Genome with transactions issubject to noise when close to 20 threads . Also, we remark S peed - up Speed-up with Transactions - Power Cap: 50 watts 0 0.5 1 1.5 2 intruder vacation genome ssca2 S peed - up Speed-up with Transactions - Power Cap: 60 watts 0 0.5 1 1.5 2 intruder vacation genome ssca2 S peed - up Speed-up with Transactions - Power Cap: 70 watts0.01.02.03.04.05.0 intruder vacation genome ssca2 E rr o r ( % ) Power Cap Error with Transactions - Power Cap: 50 watts 0.01.02.03.04.05.0 intruder vacation genome ssca2 E rr o r ( % ) Power Cap Error with Transactions - Power Cap: 60 watts 0.01.02.03.04.05.0 intruder vacation genome ssca2 E rr o r ( % ) Power Cap Error with Transactions - Power Cap: 70 watts

Figure 5. Throughput Speed-up and Power Cap Error with Transactions that our technique is able to react to workload variations alsoin terms of scalability. In this scenario, these factors causelower performance with our technique due to the noise,which sometimes (wrongly) leads to temporarily selectinga less than optimal number of concurrent threads.As expected, for lock-based synchronization the proposedtechnique technique shows similar results to the dual-phasetechnique since both techniques return the same conﬁgu-ration when the ascending part of the throughput curve ismissing. For transaction-based synchronization, the highestspeed-up improvements over the dual-phase technique areobtained for Ssca2 and Genome which show a less than lin-ear ascending part of the throughput curve for each ﬁxed

P-state (Figure 2. As the most signiﬁcant example, in Ssca2 thethroughput slightly increases when increasing the number ofthreads from 6 to 15 which makes the dual-phase techniqueselect a conﬁguration with 15 threads. Differently, the pro-posed technique allocates the power budget more efﬁcientlyby selecting a conﬁguration with a lower number of threadsat an increased frequency. We should note that the beneﬁtsof the proposed technique over the dual-phase technique arenot limited to applications that rely on transactional-basedsynchronization. Effectively, performance beneﬁts should beobtained for any application with a throughput function thatshows an ascending part followed by a descending, or onlyan ascending part that is less than linear.Overall, the results of our experiments study show that itis possible to achieve signiﬁcant performance beneﬁts byappropriately selecting the number of concurrent threadsand CPU

P-state taking into consideration the scalabilityof the the speciﬁc multi-threaded application. As expected,compared to the baseline technique, the proposed solutionsachieves the best results with poorly scalable applications, i.e. where contention is not minimal. Compared to the dual-phase technique, the exploration of the whole bi-dimensionalspace of conﬁgurations performed by the proposed techniquecan provide an appreciable improvement in performancefor some applications, while achieving the same results forothers. Finally, the enhanced strategy manages to furtherimprove performance and reduce the power cap error overthe basic strategy. VI. C

ONCLUSIONS

In this work we introduced a novel power capping tech-nique that, by jointly tuning the CPU performance state andthe number of concurrent threads, improves the performanceof multi-thread applications, speciﬁcally for applicationsthat show less than linear scalability due to contention.Exploiting the results of a preliminary analysis, the proposedtechnique can return in linear time the optimal conﬁgurationwhich provides the highest performance between all con-ﬁgurations with power consumption lower than the powercap. We also present an enhanced strategy that by ﬂuctuat-ing between different conﬁgurations optimizes the dynamicallocation of the power budget, resulting in both increasedperformance and reduced power cap error. Compared to thebaseline technique, that always assigns to the application thehighest possible number of cores, our strategy provides anaverage speed-up of 1.48x, with individual test cases reach-ing up to 2.32x. Furthermore, we show that by exploring theoverall bi-dimensional space of conﬁguration, the proposedtechnique can improve performance by up to 21% comparedto techniques that tune the number of threads and the CPUperformance state independently.

EFERENCES [1] V. Pallipadi and A. Starikovskiy, “The ondemand governor:past, present and future,” in

Proceedings of Linux Symposium,vol. 2, pp. 223-238 , 2006.[2] S. Reda, R. Cochran, and A. Coskun, “Adaptive powercapping for servers with multithreaded workloads,”

IEEEMicro , vol. 32, no. 5, pp. 64–75, Sep. 2012. [Online].Available: http://dx.doi.org/10.1109/MM.2012.59[3] H. Zhang and H. Hoffmann, “Maximizing performance undera power cap: A comparison of hardware, software, and hybridtechniques,” in

Proceedings of the Twenty-First InternationalConference on Architectural Support for Programming Lan-guages and Operating Systems , ser. ASPLOS ’16. New York,NY, USA: ACM, 2016, pp. 545–559.[4] Y. Liu, G. Cox, Q. Deng, S. C. Draper, and R. Bianchini,“FastCap: An efﬁcient and fair algorithm for power capping inmany-core systems,”

ISPASS 2016 - International Symposiumon Performance Analysis of Systems and Software , no. 3, pp.57–68, 2016.[5] Q. Deng, L. Ramos, R. Bianchini, D. Meisner, andT. Wenisch, “Active low-power modes for main memory withmemScale,”

IEEE Micro , vol. 32, no. 3, pp. 60–69, 2012.[6] H. David, C. Fallin, E. Gorbatov, U. R. Hanebutte, andO. Mutlu, “Memory Power Management via Dynamic Volt-age/Frequency Scaling,”

Proceedings of the 8th ACM Inter-national Conference on Autonomic Computing , pp. 31–40,2011.[7] A. Kanduri, M.-H. Haghbayan, A. M. Rahmani, P. Liljeberg,A. Jantsch, N. Dutt, and H. Tenhunen, “Approximation knob:power capping meets energy efﬁciency,”

Proceedings of the35th International Conference on Computer-Aided Design -ICCAD ’16 , pp. 1–8, 2016.[8] B. Su, J. Gu, L. Shen, W. Huang, J. L. Greathouse, andZ. Wang, “PPEP: Online Performance, Power, and EnergyPrediction Framework and DVFS Space Exploration,” , pp. 445–457, 2014.[9] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Oluko-tun, “STAMP: Stanford transactional applications for multi-processing,” in

Proc. 4th IEEE Int. Symposium on WorkloadCharacterization . IEEE, 2008, pp. 35–46.[10] N. Shavit and D. Touitou, “Software transactional memory,”in

Proc. 14th ACM Symposium on Principles of DistributedComputing . ACM, 1995, pp. 204–213.[11] “Intel 64 and ia-32 architectures software developers manual,volume 3c: System programming guide, part 3,” (Accessedon 06/26/2017).[12] C. Lefurgy, X. Wang, and M. Ware, “Power capping: Aprelude to power shifting,”