Adaptive Performance Optimization under Power Constraint in Multi-thread Applications with Diverse Scalability
Stefano Conoci, Pierangelo Di Sanzo, Bruno Ciciani, Francesco Quaglia
AAdaptive Performance Optimization under Power Constraint inMulti-thread Applications with Diverse Scalability
Stefano Conoci, Pierangelo Di Sanzo, Bruno Ciciani
DIAG - Sapienza University of RomeEmail: { [email protected],[email protected], [email protected] } Francesco Quaglia
DICII - University of Rome Tor VergataEmail: [email protected]
Abstract —In modern data centers, energy usage representsone of the major factors affecting operational costs. Powercapping is a technique that limits the power consumption ofindividual systems, which allows reducing the overall powerdemand at both cluster and data center levels. However,literature power capping approaches do not fit well the natureof important applications based on first-class multi-threadtechnology. For these applications performance may not growlinearly as a function of the thread-level parallelism becauseof the need for thread synchronization while accessing sharedresources—such as shared data. In this paper we consider theproblem of maximizing the application performance under apower cap by dynamically tuning the thread-level parallelismand the power state of the CPU-cores. Based on experimentalobservations, we design an adaptive technique that selects inlinear time the optimal combination of thread-level parallelismand CPU-core power state for the specific workload profile ofthe multi-threaded application. We evaluate our proposal byrelying on different benchmarks, configured to use differentthread synchronization methods, and compare its effectivenessto different state-of-the-art techniques.
I. I
NTRODUCTION
Multi-core architectures are nowadays dominating themarket. Also, thanks to the support they offer for sharingmemory among CPU-cores, they have become the main-stream reference hardware for applications based on the first-class multi-thread technology. On the downside, poweringmany-core machines implies high energy delivery to eachsingle multi-core server. Therefore, over the last years,energy and power consumption raised up as a core concernto cope with, especially in (large) data centers.Such concern led manufacturers to introduce hardwaremechanisms oriented to improve energy efficiency in op-erational contexts. These include Dynamic Voltage and Fre-quency Scaling (DVFS), which allows lowering the voltageand the frequency (hence the power consumption) of aprocessor/core in a controlled manner, and Clock Gating,which disables some processor/core circuitry during idle pe-riods. Contextually, today’s Operating Systems offer powermanagement tools—like Linux CPUFreq Governor [1]—which expose to the user code interfaces to dynamicallychange the power state of cores via DVFS, thus allowingto tune the performance of cores and their power demandaccording to the need of specific applications/workload.In this context, one interesting challenge is the one ofcontrolling the power demand of an application in order to keep it below a given threshold, also known as the power cap . However, an even more interesting challengeis the one of ensuring that an application runs at maximumperformance under a given power cap. Such an achievement,in addition to performance benefits, would also improve theapplication efficiency in terms of energy per task.Various power capping techniques for multi-core servershave been proposed. As for the specific case of multi-thread applications, the problem of regulating the numberof threads and the core frequency to control the balancebetween performance and power consumption has been orig-inally considered in [2], and subsequently in [3]. The maindrawback of these approaches is that the tuning strategiesthey rely on do not account for complex and dynamic effectson performance that may be caused by thread contentionon hardware resources and/or shared data. In more details,when multiple threads concurrently run on different CPU-cores, the presence of shared hardware resources—such asmemory interconnections and cache levels—leads them tocontend for their utilization. This impacts both performanceand the power consumption profile of the application. Also,in common multi-thread applications that are not disjoint-access parallel, threads share data whose accesses mayrequire synchronization. This still affects performance andthe power consumption profile, also depending on the spe-cific synchronization mechanisms (either speculative or not)that are employed by the application code. An additionalfactor of complexity in the presence of synchronizationis that the speed-up achieved by running the applicationwith different number of threads may change depending onthe workload profile, which in turn can be dynamic by itsown. Also, the speed-up can be non-linear as a function ofthe number of threads, depending on the workload profile,as well as the underlying hardware settings. Specifically,performance can even decrease when increasing the levelof parallelism. This indicates that synchronization costs,including the energy spent while performing synchronizationoperations, can show complex profiles to deal with.Overall, to select the right combination of thread paral-lelism and core power state, which ensures the best perfor-mance under a power cap, it looks mandatory to take intoaccount the (possible) limited scalability of a multi-threadapplication, just like it manifests at run-time due to actualsynchronization dynamics. Further, it is mandatory to react a r X i v : . [ c s . PF ] S e p o variations of the workload profile.To cope with this problem, we present an adaptive tech-nique that uses a novel on-line exploration-based tuningstrategy. We devised our technique exploiting empiricalobservations of the effects on both performance and powerconsumption associated with the combined variation ofthread-level parallelism and core power state. Specifically,by the results of experiments we conducted with differentmulti-thread benchmarks characterized by non-negligibleincidence of synchronization —e.g. because of thread con-tention while accessing share data—we highlight that theirscalability is not affected by the variation of the power stateof the CPU-cores. Based on this, we defined an optimizedtuning strategy where the exploration moves along specificdirections that depend on the power cap value and on theintrinsic scalability of the application. Remarkably, we provethat the proposed technique finds in linear the optimal con-figuration of concurrent threads and CPU frequency/voltage,i.e. the configuration that provides the highest performanceamong the configurations with power consumption lowerthan the power cap. Also, we present a refinement ofour technique that exploits continues fluctuations betweenconfigurations—in terms of thread-level parallelism and corepower state—to further improve the application performanceand reduce the possibility/incidence of power cap violations.We demonstrate the advantages of our proposal via anexperimental study based on various application contexts,including various benchmarks that use different thread syn-chronization methods. This allows us to robustly assess ourtechnique via disparate test cases where contention amongthreads affects the application scalability in significantlydifferent ways.The remainder of this article is structured as follows. InSection II we discuss related works. Section III defines ourtarget problem and presents the results of the preliminaryanalysis. Section IV illustrates the proposed optimizationtechnique, proves that the selected configuration is optimaland analyzes the time complexity of the exploration proce-dure. Section V describes the most relevant implementationdetails and presents the experimental results.II. R ELATED W ORK
A work specifically focused on optimizing the energydemand at application level is presented in [2]. The proposedtechnique, called Pack and Cap, aims at selecting the bestconfiguration, in terms of number of cores to be assignedto an application and the related core frequency, whichensures a given power cap for multi-thread workloads.Based on experimental measurements of the performanceand power consumption obtained running benchmarks fromthe Parsec suite, the authors conclude that the configurationthat provides the highest performance at a given level ofpower consumption always assigns to the application thehighest possible number of cores. However, as extensivelyshown in the following of this article, this selection strategyis not optimal for general multi-threaded applications withless than linear scalability. The work in [3] considers the problem of maximizing performance under a power capwhile also taking into account the effects of contention.The solution defines an ordered set of power knobs that areprogressively tuned by performing a binary search on therespective domain, selecting the setting that provides thehighest performance for the considered power knob whileoperating within the power cap. In particular, the solutionfirst selects the optimal number of cores that should beassigned to an application while running at the slowest avail-able frequency/voltage, then selects the optimal CPU
P-State setting for the previously selected number of assigned cores.Therefore, by tuning the power knobs independently, it doesnot consider the changing energy/performance trade-offs atdifferent levels of parallelism for the specific workload. Asan example, if an application shows a limited speed-up whenincreasing the number of cores, the solution would still pickthe highest value that provides a power consumption withinthe cap, even if the same power budget could provide higherperformance if spent to further increase the frequency of alower number of cores.Other works in literature investigate the problem ofimproving application performance under power constraintconsidering different power management variables. FastCap[4] defines an approach for optimizing performance under asystem-wide power cap considering both CPU and memoryDVFS. It defines a non-linear optimization problem solvedthrough a queuing model that considers the interactionbetween CPU-cores and memory banks communicating overa shared bus. Unfortunately, memory DVFS has only beenproposed recently [5] [6] and is not yet available in com-mercial systems. Kanduri et al. propose approximation asanother knob that can be used in power capping, combinedwith DVFS and Clock Gating, to define a trade-off betweenperformance and accuracy of the results [7]. However, inorder to dynamically switch between different levels ofaccuracy, it requires multiple implementations of the sameapplication. PPEP [8] is an online prediction framework that,based on hardware performance events and on-chip tem-perature measurements, estimates the performance, powerconsumption and energy efficiency for each different CPU
P-state . Therefore, it allows the definition of a power cappingtechnique that can meet power targets in a single stepwithout requiring any exploration. However, it does notconsider the possibility of altering the number of coresassigned to an application, thus it would provide sub-optimalperformance for multi-thread applications showing less thanlinear scalability.III. P
ROBLEM S TATEMENT AND P RELIMINARY A NALYSIS
As discussed, in our study we consider the problem ofadaptively tuning the system configuration to ensure thehighest application performance under a power cap. Weconsider two tuning parameters, the number of concurrentthreads and the cores power state. We focus on the generalscenario of multi-thread applications executed on a working-thread pool (e.g. multithreaded web/application servers)hose size can be tuned at run-time. However, we shouldnote that the proposed technique is orthogonal with respectto the chosen thread regulation mechanism. Also, we assumethat the power state of cores can be changed, affectingboth power consumption and performance. In practice, thisis what happens when changing the so-called
P-state inmodern multi-core processors, which determines a variationof the core voltage and frequency, thus modifying both thepower consumption and the instruction processing speed. Weadhere to the notation of ACPI standard, which establishesthat P0 denotes the core state with maximum power andperformance, and P1 , P2 , ... progressively identify stateswith less power and performance. Also, we consider thecore idle state ( C-state ), where C0 denotes the full operatingcore state, and C1 , C2 , ..., progressively identify lowerpower states where the core is idle, i.e. it does not executeinstructions. A core can transit from C0 to a deeper C-state when it has no instruction to execute. Hence, whenthe number of running threads goes below the number ofavailable cores, unused cores can transit to low power states,thus reducing the total power consumption.To provide the reader with real data demonstrating theeffects on power consumption associated with the variationof
P-state and the number of concurrent threads, we showin Figure 1 the results of an experiment where we run themulti-thread Intruder benchmark from the STAMP suite [9]for Transactional Memory systems [10]. Intruder emulates asignature-based network intrusion detection system wherenetwork packets are processed in parallel by concurrentthreads. We executed different runs while changing
P-state and the number of concurrent threads on top of a machinewith two Intel Xeon E5, 20 physical cores total, 256 ECCDDR4 memory, with core clock frequency ranging from1.2 GHz (whose
P-state is denoted as P-11) to 2.2 GHz(denoted as P-1), and TurboBoost from 2.2 GHz to 3.1 GHz(denoted as P-0). Since we focus on the effects of the jointvariation of core power state and thread parallelism, weconsider power consumption data related to the CPU andmemory subsystems, which we collected via Intel RAPLinterface [11]. The plot shows the power consumption asa function of the couple ( p, t ) , where p is P-state and t isthe number of concurrent threads. The results clearly outlinethat the power consumption grows while incrementing eitherthe first or the second variable. Given a power cap value,if { ( p, t ) } is the set of all possible configurations, wedenote as { ( p, t ) } ac ⊆ { ( p, t ) } the subset of all acceptableconfigurations, that is the configurations for which the powercap is not violated. Formally, it is the subset such that pwr ( p, t ) ≤ C , where pwr ( p, t ) is the power consumptionwith configuration ( p, t ) and C is the power cap value. Sincethe function pwr ( p, t ) monotonically increases with respectto both p and t , the subsets of acceptable and unacceptableconfigurations are separated by a frontier, as shown in figure3. Our goal is to find the configuration ( p, t ) ∗ ∈ { ( p, t ) } ac for which the performance of the application is maximized.
0 2 4 6 8 10 2 4 6 8 10 12 14 16 18 20 30 40 50 60 70 80 90 100 110 P o w e r ( W a tt s ) Power ConsumptionP-State Threads P o w e r ( W a tt s )
30 40 50 60 70 80 90 100 110
Figure 1. Throughput vs. Number of Concurrent Threads and
P-state
Without loss of generality, we consider the applicationthroughput as a performance metric. In any case, withour approach also other metrics, depending on the specificapplication, could be used, such as the application runtimeor the operation response time. We denote as thr ( p, t ) theapplication throughput for configuration ( p, t ) .In multi-thread applications, the variation of the appli-cation throughput as a function on t plays a key rolewhen finding the best configuration. Due to hardware anddata contention phenomena, the profile of the applicationthroughput curve is generally characterized by two parts,i.e. an initial ascending part, where the throughput increaseswhile increasing t , followed by a descending part, where thethroughput decreases while increasing t . However, we notethat in the case of high contention the initial ascending partmay not exist (i.e. the throughput always decreases whenincreasing t ). Conversely, in the case of low contention thethroughput may never decrease.In Figure 2, we report the results of an experimental studywe conducted with three different multi-thread applicationsstill taken from STAMP, namely Intruder, Genome, Vacationand Ssca2. We selected these applications since their scal-ability trends are very different. Also, in our experimentswe considered two different implementations of the threadsynchronization logic: a) a coarse-grained lock-based ap-proach, where critical sections are synchronized by a singleglobal lock, and b) a fine-grained approach based on soft-ware transactional memory, where shared data accesses aresynchronized by transactions. We purposely used a coarse-grained locking scheme to evaluate our approach in variousand antithetical scenarios, spanning from applications withvery limited to very high scalability.By the plots in Figure 2, the profile of the throughputcurves confirm that there is an ascending part followed byan descending part. In some cases the ascending or thedescending part may not exist. Also, the plots show that,when changing the application and/or the synchronizationapproach, the shapes of the throughput curves change.Particularly, the number of threads that provides the bestthroughput is generally different. Its range varies from 1(in the case of workloads with very limited scalability, suchas for Intruder Lock-based, Vacation Lock-based and Ssca2Lock-based), up to 20 (in the case fully scalable work-loads as Genome Transaction-based or Vacation Transaction- T h r oughpu t Concurrent threadsGenome Lock-based 500000 1x10
0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsIntruder Lock-based 100000 120000 140000 160000 180000 200000 220000 240000 260000 280000 300000 0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsVacation Lock-based 500000 1x10
0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsSsca2 Lock-based 0 500000 1x10
0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsGenome Transaction-based 500000 1x10
0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsIntruder Transaction-based 0 200000 400000 600000 800000 1x10
0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsVacation Transaction-based 600000 800000 1x10
0 2 4 6 8 10 12 14 16 18 20 T h r oughpu t Concurrent threadsSsca2 Transaction-based
Figure 2. Throughput vs. Number of Concurrent Threads based). Notably, in some cases it is in the middle (as forIntruder Transaction-based, Genome Lock-based or Ssca2Transaction-based). On the other hand, fixed the applicationand the synchronization approach, the throughput curvespreserve the shape when varying
P-state . Curves appearproportionally translated, but the number of threads thatprovides the best throughput does not change, unless forsmall and unpredictable variations due to the measurementnoise. Finally, the plots shows that, keeping fixed the numberof threads, the throughput increases when decreasing
P-state . We exploit these experimental findings to define theexploration-based technique presented in the next section.IV. T HE A DAPTIVE P OWER C APPING T ECHNIQUE
The adaptive power capping technique we propose aimsat finding the optimal configuration ( p, t ) ∗ , i.e. the con-figuration that provides the highest performance amongthe configurations in the set { ( p, t ) } ac , assuming that itmay change due to variations of the workload profile. Thetechnique is based on an on-line tuning strategy that peri-odically performs an exploration procedure. The latter aimsto identify the optimal configuration ( p, t ) ∗ for the currentworkload profile, which is actuated until the exploration pro-cedure restarts after a given period. During the explorationprocedure, the power consumption and the throughput of the application are measured while moving along configurationswithin a given path, discarding the explored configurationsthat are not in the set { ( p, t ) } ac . Then, the one with thehighest throughput is selected. We should note that thenumber of threads t ∗ of the optimal configuration may bedifferent than the number of thread that provides the highestthroughput for the specific application, since decreasing theCPU P-state might provide an higher performance increasethan increasing the number of threads. The procedure isable to identify the optimal configuration by exploring onlya subset of configurations. In effect, we note that the fullset of configurations may be very large, particularly whena large number of cores are available. Thus, reducing theexploration space is fundamental to implement an on-lineexploration-based strategy.
A. The Exploration Procedure
The exploration procedure takes as input a starting config-uration ( p s , t s ) and a power cap value C and returns ( p, t ) ∗ .For the first execution of the procedure, the starting config-uration is established by the user, while in next executionsit corresponds to the output configuration of the previousone. We note that, based on the shapes of the throughputcurves and the observations that we made in our preliminaryanalysis, a set of configurations can be excluded from P - S t a t e Threads FrontierExploration phases (ps,ts) (ps,t1)(p2,t2) (p3,t3)
Phase 1Phase 2Phase 3
Figure 3. Example of exploration phases performed by the basic strategy the exploration, thus reducing the configuration explorationspace. Specifically, if during the exploration:1) a configuration ( p j , t k ) such that thr ( p j , t k ) ≤ thr ( p j , t k − is found then all configurations ( p, t ) where t ≥ t k , for whichever p , can be excluded (sincewe are in the descending part of the throughput curveand since the throughput curves preserve the shapewhile varying P-state ).2) a configuration ( p j , t k ) such that pwr ( p j , t k ) ≤ C is found then all configurations ( p, t k ) with p > p k can be excluded (since increasing P-state reduces theapplication throughput).3) a configuration ( p j , t k ) such that pwr ( p j , t k ) > C is found then all configurations ( p, t ) where t ≥ t k and p ≤ p k can be excluded (since decreasing P-state or increasing the number of concurrent threadsincrements the power consumption).Based on the above observations, we built an explorationprocedure divided in 3 phases, plus a final selection phase.The phases are described below. A graphical example isshown in Figure 3, which refers to an execution wherethe number of concurrent threads providing the highestthroughput is equal to 15 and C = 50 watts.The phases are the following ones:
Phase 1: this phase starts from the initial configuration ( p s , t s ) and, keeping the P-state fixed, aims at finding thenumber of threads providing the highest throughput withoutviolating the power cap. We denote as ( p s , t ) the configu-ration returned by this phase. It performs a search inspiredby the hill-climbing technique. Specifically, it increments byone the number of threads while the throughput increasesand the power cap is not violated (since it is moving alongthe ascending part of the throughput curve), then it returnsthe configuration with the highest throughput within thepower cap. If the throughput does not grow after the first in-crement or the power cap is violated, it starts decreasing thenumber of threads (since it is moving along the descendingpart of the throughput curve or the power consumption has to be reduced) until the throughput starts decreasing. Then,it returns the configuration with the highest throughput if itdoes not violate the power cap, otherwise if all the exploredconfigurations violate the power cap or if the explorationreaches a number of threads equal to 1 it returns ( p s , .In the example in Figure 3, the exploration performed inPhase 1 is represented by the green line. It starts with ( p s , t s ) = (6 , , then increases the number of threads andterminates when it explores configuration (6 , since itviolates the power cap. It returns ( p s , t ) = (6 , , whichis within the power cap. Phase 2:
This phase starts from the configuration returnedby phase 1 ( p s , t ) and is executed only if this configurationdoes not violate the power cap (otherwise it jumps to the nextphase). The goal of phase 2 is to continue the exploration forlower values of P-state (we remark that lower values of
P-state lead to both higher core performance and higher powerconsumption). Specifically, it explores by moving from thecurrent configuration ( p, t ) to configuration ( p − , t ) . Ifthe latter configuration does not violate the power cap,it continues to reduce the value of P-state . If the explo-ration reaches a configurations such that pwr ( p, t ) > C ,it starts reducing the number of threads, thus moving toconfiguration ( p, t − , then ( p, t − and so on (sincedecreasing the number of concurrent threads reduces thepower consumption) until the power cap is not violated.After, it restarts the exploration by decreasing the value of P-state . The exploration terminates when p reaches and thecurrent configuration does not violate the power cap, whenit reaches configuration (0 , , or when a configuration with t = 1 violates the power cap. Then, the phase returns theexplored configuration with the highest throughput withinthe power cap, that we denote as ( p , t ) , or configuration (0 , . In Figure 3, the exploration of Phase 2 is shown bythe blue line. It starts from ( p s , t ) = (6 , , then exploresup to configuration (0 , . It returns ( p , t ) = (3 , . Phase 3:
This phase starts again from the configurationreturned by Phase 1, i.e. ( p s , t ) , and aims at continuing theexploration for higher values of P-state . If the configurationreturned by Phase 1 is such that t is the number of threadsproviding the highest throughput and is within the powercap, Phase 3 is not executed (since decrementing the value of P-state leads to lower throughput). Otherwise, it incrementsby one the value of
P-state and starts increasing the numberof concurrent threads until the power cap is violated or thethroughput decreases. In the former case, if the maximumvalue of
P-state has not been reached, it increments by onethe value of
P-state and starts again incrementing the numberof threads. In all the other cases the exploration terminates.Then, the phase returns the explored configuration, that wedenote as ( p , t ) , with the highest throughput within thepower cap, or configuration ( p max , t ) , where p max is themaximum value of P-state . In Figure 3, the exploration ofPhase 2 is represented by the yellow line. It starts from ( p s , t ) = (6 , , then explores up to configuration (8 , ,where it stops since the throughput decreases (in the examplehe number of concurrent threads providing the highestthroughput is equal to 15). It returns ( p , t ) = (8 , . Final phase: this phase selects the configuration withthe highest throughput between the configurations ( p s , t ) , ( p , t ) and ( p , t ) , which does not violate the power cap,or returns null if none of them is within the power cap. B. Proof of Optimality
In this subsection we prove that the proposed explorationprocedure returns the optimal configuration, i.e. the configu-ration ( p, t ) ∗ that provides the highest level of performancewith a power consumption lower than the power cap. Theproof assumes that the observations discussed in Section IIIalways hold true. Specifically, we take as hypotheses that:1) the shape of the throughput curve for each fixed P-state and varying number of active threads is char-acterized by an initial ascending part, followed bya descending part. Also, one of these parts may bemissing;2) if thr ( p j , t k ) > thr ( p j , t k + 1) then thr ( p, t k ) >thr ( p, t k + 1) for each p (the shape of the throughputcurves preserve the shape while varying the P-state );3) if p j < p k then thr ( p j , t ) > thr ( p k , t ) for each fixed t (decreasing the P-state with a fixed number of threadsalways increases the throughput);4) pwr ( p, t ) ≥ pwr ( p j , t k ) for each p < = p j and t > = t k (decreasing P-state or increasing the numberof threads increases the power consumption);5) the workload is static during the exploration proce-dure;6) the samples of throughput and power consumption ob-tained for each explored configuration are equivalentto their real values;We should note that hypothesis 5 and 6 are necessary forany exploration-based solution that relies on data gatheredat run-time. In particular, they guarantee that if the optimalconfiguration is explored it will also be selected by thealgorithm as the best configuration. Hypotheses 1, 2, 3 and4, as shown in Figure 2, reflect properties that appear to bevalid for all the considered workloads.
Proof:
We can partition the search space defined by theconfigurations ( p, t ) in three distinct sub-spaces, delimitedby the starting configuration ( p, t ) s , such that: • p = p s ; • p < p s ; • p > p s ;We denote as the optimal configuration for the sub-spaceof configurations S , the configuration ( p, t ) q ∈ S thatprovides the highest performance while operating within thepower cap compared to all the configurations ( p, t ) ∈ S .Considering that the sum of these three sub-spaces coversthe complete space of configurations, the configuration thatprovides the highest performance between the optimal con-figuration of all the sub-spaces will be the optimal configu-ration ( p, t ) ∗ . Thus, proving that the exploration procedurefinds the optimal configuration for each of these sub-space is equivalent to prove that it finds the optimal configurationfor the whole space of configurations. p = p s : Phase 1 of the exploration procedure starts from ( p, t ) s and searches for the number of active threads at P-state p s that maximizes performance with power con-sumption within the power cap. Therefore, it exploresthe considered sub-space. Phase 1 is based on an hill-climbing optimization algorithm which generally finds thelocal optima, which might not be the best possible solu-tion. However, for hypothesis 1, the local optima is alsothe global optima as it is not possible for a non-globaloptima to exist for a function with a single ascending partfollowed by a single descending part or, in case one ofthose is missing, for a monotonic function. We should notethat, unlike traditional hill-climbing algorithms, the selectedconfiguration might not be the global optima as it mightrequire a power consumption higher than the power cap. Inthis case, exploiting the shape of the throughput function,the configuration with the highest number of active threadwith a power consumption lower than the power cap isselected which is clearly the optimal configuration for thesub-space as the optimum is always located either at theend of the ascending part or at 1 thread if the ascendingpart does not exist. Else, the global optima is selected. Thus,phase 1 selects the optimal configuration for the sub-spaceof configurations with p = p s . p < p s : Assume that the optimal configuration ( p, t ) k +1 forthe sub-space of configurations with p = k + 1 is known.The optimal configurations for the sub-space with p = k must have t k < = t k +1 since if t k +1 is the optimal numberof threads for p = k + 1 it must be that either: • throughput with t = t k +1 + 1 is lower than with t = t k +1 with p = k + 1 . Thus, for hypothesis 2,all configurations ( p, t ) where t > t k +1 , for whichever p , can not be optimal; • pwr ( p k +1 , t k +1 +1) > C which implies, for hypothesis4, that for p = k < k + 1 all configurations with t > = t k +1 would have a power consumption higher than thepower cap and thus can not be optimal.If t k +1 = 1 we can already conclude that the optimum forthe sub-space of configurations with p = p k is ( p = k, t =1) . Differently, for t > we can state that the throughputat P-state k monotonically increases in the range from 1 to t = t k +1 . If t k +1 > it must be that thr ( p k +1 , t k +1 ) > = thr ( p k +1 , t k +1 − which implies for hypothesis 2 that thr ( p k , t k +1 ) > = thr ( p k , t k +1 − and consequently thatthe throughput curve with p = k for t < t k +1 is inthe ascending part. Therefore, as performed by Phase 2,starting the exploration of the sub-space of configurationswith p = k from the configuration p k , t k +1 and, if necessary,decreasing the number of threads until the power cap isreached assures that the the optimal configuration for thesub-space is explored. The configuration returned by phase1—which is the optimal configuration for the sub-space with p = p s —is used as base case. The sum of each sub-spaceof configurations with { p = j | j ∈ [0 , p s − is equalo the sub-space of configurations with p < p s . Therefore,the configuration with the highest performance between theoptimal configurations for each of this sub-spaces will be theoptimal configuration for the entire sub-space with p < p s . p > p s : Assume that the optimal configuration ( p, t ) k − for the sub-space of configurations with p = p k − = p k − is known. We can state that thr ( p k , t ) < = thr ( p, t ) k − foreach t < = t k − since: • if t k − is the optimal value of t for p = p k − , it mustbe true that thr ( p k , t k − ) > = thr ( p k , t ) for each t < = t k − (hypotheses 2); • thr ( p k , t k − ) < ( p k − , t k − ) (hypothesis 3).In addition, we can state that if thr ( p k , t k − ) >thr ( p k , t k − + 1) then for each configuration ( p, t ) with p > p k it must be true that thr ( p, t ) < thr ( p k , t k − ) since: • for hypothesis 2, increasing the number of threads over t k − does not improve the throughput for any P-state ; • for hypothesis 3, increasing the P-state reduces thethroughput.Therefore, considering that pwr ( p k , t k − ) < C (hypothesis4) and that ( p k , t k − ) is included in the sub-space of config-urations with p < p s , if thr ( p k , t k − ) > thr ( p k , t k − + 1) then all configurations in the sub-space of configurationswith p < p k cannot be optimal configurations of the sub-space with p < p s . Starting from the configuration returnedby phase 1—which is the optimal configuration for p = p s —phase 3 decrements the P-state and increments the numberof threads until the power cap is violated or the throughputdecreases, which assures that the optimal configuration forthe sub-space is explored. If the throughput decreases whenincreasing the number of threads or when the maximum
P-state is reaches, phase 3 is completed. By induction, itexplores the optimal configuration of each sub-space ofconfigurations with { p = j | j ∈ [ p s + 1 , p max ] , excludingthe sub-spaces that we proved cannot contain the optimum.Therefore, the configuration with the highest performancebetween the optimal configurations of the considered sub-spaces will be the optimal configuration for the entire sub-space with p > p s . C. Time Complexity Analysis
The time complexity of the exploration procedure isexpressed as the number of exploration steps required bythe procedure to return the optimal configuration ( p, t ) ∗ . Let p tot be the total number of P-states supported by the systemand t tot the maximum number of concurrent threads for thespecific application, which, in HPC applications, is usuallyset equal to the number of physical/virtual cores available inthe system. Considering that the exploration procedure doesnot explore the descending part of the throughput curve, wecould also denote t tot as the maximum number of concurrentthreads that provide, for at least a portion of the executiontime, the highest performance for the specific application runon the specific hardware. We analyze the time complexityof each exploration phase separately: • phase 1 : each configuration with a different numberof concurrent threads and p = p s is explored at mostonce, thus the time complexity is O ( t tot ) ; • phase 2 : starting from a configuration ( p, t ) , phase 2either reduces the value of p or reduces t . Starting fromthe configuration returned by phase 1, it can reduce p atmost p tot times, and reduce t at most t tot times. Thus,the time complexity of phase 2 is O ( p tot + t tot ) ; • phase 3 : starting from a configuration ( p, t ) , phase3 either increments the value of p or increments t .Thus, for the same reasoning used in phase 2, the timecomplexity of phase 3 is O ( p tot + t tot ) ;Therefore, the overall time complexity of the explorationprocedure is O ( p tot + t tot ) . D. The Enhanced Tuning Strategy
In this section we present an enhancement of the tuningstrategy that allows to further improve performance andreduce the power cap violation probability. It profits by thepossible gap between the power cap value and the powerconsumption of configuration ( p, t ) ∗ which is due to the dis-crete domain of power consumption values of the differentconfigurations. Specifically, it is unlikely that pwr ( p, t ) ∗ isexactly equal to C . Rather we can have C − pwr ( p, t ) ∗ > .Statistically, the greater the difference of power consumptionbetween adjacent configurations, the larger C − pwr ( p, t ) ∗ .To reduce the performance penalization due to this gap,the enhanced tuning strategy relies on continue fluctuationsbetween two configurations (rather that remaining alwaysin ( p, t ) ∗ ) along the time interval between the end of theexploration procedure and the start of the next one. Allthe phases are equal to the previous tuning strategy, exceptthat an additional configuration ( p, t ) H is selected. ( p, t ) H is the configuration with higher throughput than ( p, t ) ∗ (ifany) such that the ratio between throughput and powerconsumption is the largest one among the explored ones.Thus, it is the configuration with the highest efficiency interms of throughput over power consumption. We note that,since ( p, t ) ∗ is the configuration within the power cap withthe highest throughput, then ( p, t ) H is a configuration thatviolates the power cap.At the end of the exploration procedure, the enhancedstrategy continuously fluctuates between ( p, t ) ∗ and ( p, t ) H in order to take advantage of the higher throughput ofconfiguration ( p, t ) H , but avoids that the average powerconsumption, over a given time window w , overcomes C .To this aim, if the average power consumption overcomes C , then configuration ( p, t ) ∗ is set. Conversely, when theaverage power consumption falls below C , configuration ( p, t ) H is set, and so on. To limit the fluctuation frequency,an upper and a lower tolerance threshold, C + l and C − l isused. In real scenarios, the length w can be set equal to theactual time window used to calculate the power consumptionof the machine.Another factor that may impact the effectiveness of ourtechnique is the variation over time of the power consump-tion of the selected configurations. For example, pwr ( p, t ) ∗ ay change due to variations of the workload profile, thusleading to power cap violations. If this happens, with ourtuning strategy it may not be detected until the next explo-ration procedure starts. To limit the effect of this delay onthe power cap violation, the enhanced tuning strategy selectsa third configuration, that we denote as ( p, t ) L . It is theconfiguration with lower power consumption than ( p, t ) ∗ (ifany) with the highest efficiency in terms of throughput overpower consumption. Thus, if pwr ( p, t ) ∗ overcomes C , thenthe strategy fluctuates between pwr ( p, t ) ∗ and pwr ( p, t ) L rather than between ( p, t ) ∗ and ( p, t ) H . This allows to reducethe probability of power cap violation until the workloadprofile variation is such that pwr ( p, t ) L < C . Similarly, forthe same goal of promptly adapting to workload variations,if pwr ( p, t ) L > C ( pwr ( p, t ) H < C ), the P-state of allconfigurations is shifted up (down) by one.V. E
XPERIMENTAL R ESULTS
In this section, we presents the results of an experimentalstudy we conducted to asses the proposed power cappingtechnique. As in previous studies on power capping (e.g. [2],[12]), we consider two evaluation metrics, the applicationperformance and the average power cap error. The latter isthe average difference between the power consumption andpower cap value along time intervals where the power capis violated. We run experiments for all application scenariosthat we considered in our preliminary study (see SectionIII). Thus we use Intruder, Genome,Vacation and Ssca2 asbenchmark applications from STAMP, with both locks andtransactions as the synchronization method. These applica-tions were specifically selected to cover a wide range ofdifferent scalability scenarios. We compared our techniquewith:1) a reference power capping technique, referred to asbaseline, that selects the configuration with the low-est
P-state from the set of configurations with thehighest number of threads among the configurationswith power consumption lower than the power cap. Itimplements the selection strategy proposed in [2];2) a technique, referred to as dual-phase, that initiallytunes the number of threads starting from the lowest
P-state , and subsequently tunes the CPU
P-state keepingthe number of threads fixed. The initial phase isequivalent to phase 1 of the proposed explorationprocedure. The selection strategy of this technique issimilar to the one presented in [3].The comparison with the first technique allows to quantifythe performance benefits achievable by properly allocatingthe power budget taking into consideration the scalabilityof the specific multi-threaded application. Additionally, weconsidered the dual-phase technique in the evaluation toquantify the possible performance benefits achievable byexploring the whole bi-dimensional space of configurationsover two distinct mono-dimensional explorations, whichmight not find the optimal configuration. We should notethat, despite exploring a larger set of configurations, the proposed technique has the same time complexity of thedual-phase technique.
A. Implementation details
We developed a controller module that implements ourtechnique and the baseline technique. All software of ourexperimental study, including benchmark applications, isdeveloped in C language for Linux. The controller mod-ule alters the number of concurrent threads exploiting the pause() system call and thread-specific signal for reactiva-tion. The CPU
P-state is regulated through the cpufreq linuxsub-system, while energy readings are obtained from the powercap sub-system. Both these sub-systems are includedby default in recent versions of the linux kernel and exposetheir respective interface through the sys virtual file system.The exploration procedure relies on statistical results ofthe previous step, such as average power consumption andthroughput, to define the following configuration to explore.Each step of statistics collection is determined by a fixedamount of units of work processed. We cannot rely onapplication independent metrics, such as the number of CPUretired operation, since it would also consider instructionsrelated to spin-locking or aborted transactions that do notprovide execution progress. For applications based on lockswe defined the unit of work as the execution of one criticalsection guarded by a global lock. Differently, for transactionswe define the unit of work as one commit. The statistics arecollected in a round-robin fashion by all the active threads toreduce execution overhead and provide NUMA-aware resultsin modern multi-package systems.For the executions presented in the experimental results,we set the units of work per step to 5000, resulting in tensof milliseconds per step for all the considered applicationsand synchronization method. In addition, we set to 150 thenumber of steps required to restart the exploration procedureafter the conclusion of the previous.
B. Experimental results
We consider both the tuning strategies of our techniquereferred as basic strategy and enhanced strategy. We analyzethe performance results of our strategies in terms of speed-upwith respect to the throughput of the baseline technique. Asanticipated, we also compare the average power cap error.For each test case, we present the results with three differentpower cap values, i.e. 50, 60 and 70 watts.Results for the case of lock-based synchronization arereported in Figure 4. Overall, the results show an evident per-formance improvement with both strategies of our techniquewith respect to the baseline technique. Only for the case ofGenome the performance is comparable. In the best cases,i.e. with Intruder, the performance improvement reaches 2.2x(2.32x) and 2.15x (2.19x) for the basic (enhanced) strategywhen the power cap is equal to 50 and 60 watts respectively,and it is close to 1.9x for both the proposed strategies withpower cap set to 70 watts. The enhanced strategy further See github.com/StefanoConoci/STMEnergyOptimization S peed - up Speed-up with Locks - Power Cap: 50 watts 0 0.5 1 1.5 2 intruder vacation genome ssca2 S peed - up Speed-up with Locks - Power Cap: 60 watts 0 0.5 1 1.5 2 intruder vacation genome ssca2 S peed - up Speed-up with Locks - Power Cap: 70 watts0.01.02.03.04.05.0 intruder vacation genome ssca2 E rr o r ( % ) Power Cap Error with Locks - Power Cap: 50 watts 0.01.02.03.04.05.0 intruder vacation genome ssca2 E rr o r ( % ) Power Cap Error with Locks - Power Cap: 60 watts 0.01.02.03.04.05.0 intruder vacation genome ssca2 E rr o r ( % ) Power Cap Error with Locks - Power Cap: 70 watts
Figure 4. Throughput Speed-up and Power Cap Error with Locks improves performance compared to the baseline techniqueby up to 12.5% in Intruder at 50 watts, and by 5.3% onaverage. For lock-based synchronization, the results of thedual-phase technique are similar to those achieved by thebasic strategy.As for the power cap error, with both the strategies of ourtechnique and the dual-phase technique, it is clearly reducedcompared to the baseline. Also, the results show that withthe enhanced strategy in many cases there is a reduction ofthe power cap error compared to the basic strategy. Indeed,except for the case of Vacation with power cap equals to 60watts, where it is increased by less than 0.1%, the error withthe enhanced strategy is lower. In the best case it is about0.1%, while it is about 2% and 4.8% with the basic strategyand the baseline technique, respectively.Results for the case of transaction-based synchronizationare reported in Figure 5. Overall, the performance resultsconfirm the advantage of our technique compared to thebaseline technique. However, with transactions the speed-up is generally slightly lower than with locks. In the bestcases, it reaches about 1.9x. Also, there is one case (withGenome and power cap = 50 watts) where it is slightly lessthat 1 with both the strategies. As for the power cap error,it increases with the basic strategy compared to the casewith locks, overcoming the error of the baseline technique inmost of the cases. However, it does not overcome 2% in allcases. The error is considerably reduced with the enhancedstrategy. Particularly, it is clearly lower than the baselinetechnique with all applications when the power cap is equalto 50 watts and with Intruder when the power cap is equalsto 60 watts, while the results are similar for the other powercap values. In addition, the enhanced strategy can furtherincrease performance by up to to 8% (Vacation with power cap set to 50 watts) and by 3.5% on average. Differentlyfrom the lock-based case, both strategies of the proposedtechnique show an higher speed-up compared to the dual-phase technique by up to 21% (ssca2 with power cap setto 50), and by 7.7% and 10.7% on average for the basicstrategy and the enhanced strategy respectively.
C. Analysis of the Results
As a first observation, results show that in various caseswith locks, the error of our technique and of the dual-phasetechnique is very close to zero. This is due to the fact that, inour study, the scalability is limited for all applications whenusing locks. In these scenarios, the number of concurrentthreads providing the higher throughput (that is selected byour technique and by the dual-phase technique) is low, thusthe value of
P-state can be changed up to 0 while the powercap frontier is still far. This keeps the error very close to 0since it is unlikely that the power cap is violated during theexploration procedure or due to workload variations.The error is generally reduced with the enhanced strategycompared to the basic strategy, while also improving perfor-mance. This arises since the former is able to react along thetime between two consecutive exploration procedures to thepossible variations of the power consumption of the selectedconfigurations, as discussed at the end of Section IV-D.The speed-up with our technique is less than 1 only inone case, i.e. for Genome with transactions when the powercap value is equal to 50 watts. We note that Genome withtransactions is highly scalable (see Figure 2). This leadsboth the baseline technique and our technique to select 20as number of concurrent threads. As shown by the plot inFigure 2, the throughput of Genome with transactions issubject to noise when close to 20 threads . Also, we remark S peed - up Speed-up with Transactions - Power Cap: 50 watts 0 0.5 1 1.5 2 intruder vacation genome ssca2 S peed - up Speed-up with Transactions - Power Cap: 60 watts 0 0.5 1 1.5 2 intruder vacation genome ssca2 S peed - up Speed-up with Transactions - Power Cap: 70 watts0.01.02.03.04.05.0 intruder vacation genome ssca2 E rr o r ( % ) Power Cap Error with Transactions - Power Cap: 50 watts 0.01.02.03.04.05.0 intruder vacation genome ssca2 E rr o r ( % ) Power Cap Error with Transactions - Power Cap: 60 watts 0.01.02.03.04.05.0 intruder vacation genome ssca2 E rr o r ( % ) Power Cap Error with Transactions - Power Cap: 70 watts
Figure 5. Throughput Speed-up and Power Cap Error with Transactions that our technique is able to react to workload variations alsoin terms of scalability. In this scenario, these factors causelower performance with our technique due to the noise,which sometimes (wrongly) leads to temporarily selectinga less than optimal number of concurrent threads.As expected, for lock-based synchronization the proposedtechnique technique shows similar results to the dual-phasetechnique since both techniques return the same configu-ration when the ascending part of the throughput curve ismissing. For transaction-based synchronization, the highestspeed-up improvements over the dual-phase technique areobtained for Ssca2 and Genome which show a less than lin-ear ascending part of the throughput curve for each fixed
P-state (Figure 2. As the most significant example, in Ssca2 thethroughput slightly increases when increasing the number ofthreads from 6 to 15 which makes the dual-phase techniqueselect a configuration with 15 threads. Differently, the pro-posed technique allocates the power budget more efficientlyby selecting a configuration with a lower number of threadsat an increased frequency. We should note that the benefitsof the proposed technique over the dual-phase technique arenot limited to applications that rely on transactional-basedsynchronization. Effectively, performance benefits should beobtained for any application with a throughput function thatshows an ascending part followed by a descending, or onlyan ascending part that is less than linear.Overall, the results of our experiments study show that itis possible to achieve significant performance benefits byappropriately selecting the number of concurrent threadsand CPU
P-state taking into consideration the scalabilityof the the specific multi-threaded application. As expected,compared to the baseline technique, the proposed solutionsachieves the best results with poorly scalable applications, i.e. where contention is not minimal. Compared to the dual-phase technique, the exploration of the whole bi-dimensionalspace of configurations performed by the proposed techniquecan provide an appreciable improvement in performancefor some applications, while achieving the same results forothers. Finally, the enhanced strategy manages to furtherimprove performance and reduce the power cap error overthe basic strategy. VI. C
ONCLUSIONS
In this work we introduced a novel power capping tech-nique that, by jointly tuning the CPU performance state andthe number of concurrent threads, improves the performanceof multi-thread applications, specifically for applicationsthat show less than linear scalability due to contention.Exploiting the results of a preliminary analysis, the proposedtechnique can return in linear time the optimal configurationwhich provides the highest performance between all con-figurations with power consumption lower than the powercap. We also present an enhanced strategy that by fluctuat-ing between different configurations optimizes the dynamicallocation of the power budget, resulting in both increasedperformance and reduced power cap error. Compared to thebaseline technique, that always assigns to the application thehighest possible number of cores, our strategy provides anaverage speed-up of 1.48x, with individual test cases reach-ing up to 2.32x. Furthermore, we show that by exploring theoverall bi-dimensional space of configuration, the proposedtechnique can improve performance by up to 21% comparedto techniques that tune the number of threads and the CPUperformance state independently.
EFERENCES [1] V. Pallipadi and A. Starikovskiy, “The ondemand governor:past, present and future,” in
Proceedings of Linux Symposium,vol. 2, pp. 223-238 , 2006.[2] S. Reda, R. Cochran, and A. Coskun, “Adaptive powercapping for servers with multithreaded workloads,”
IEEEMicro , vol. 32, no. 5, pp. 64–75, Sep. 2012. [Online].Available: http://dx.doi.org/10.1109/MM.2012.59[3] H. Zhang and H. Hoffmann, “Maximizing performance undera power cap: A comparison of hardware, software, and hybridtechniques,” in
Proceedings of the Twenty-First InternationalConference on Architectural Support for Programming Lan-guages and Operating Systems , ser. ASPLOS ’16. New York,NY, USA: ACM, 2016, pp. 545–559.[4] Y. Liu, G. Cox, Q. Deng, S. C. Draper, and R. Bianchini,“FastCap: An efficient and fair algorithm for power capping inmany-core systems,”
ISPASS 2016 - International Symposiumon Performance Analysis of Systems and Software , no. 3, pp.57–68, 2016.[5] Q. Deng, L. Ramos, R. Bianchini, D. Meisner, andT. Wenisch, “Active low-power modes for main memory withmemScale,”
IEEE Micro , vol. 32, no. 3, pp. 60–69, 2012.[6] H. David, C. Fallin, E. Gorbatov, U. R. Hanebutte, andO. Mutlu, “Memory Power Management via Dynamic Volt-age/Frequency Scaling,”
Proceedings of the 8th ACM Inter-national Conference on Autonomic Computing , pp. 31–40,2011.[7] A. Kanduri, M.-H. Haghbayan, A. M. Rahmani, P. Liljeberg,A. Jantsch, N. Dutt, and H. Tenhunen, “Approximation knob:power capping meets energy efficiency,”
Proceedings of the35th International Conference on Computer-Aided Design -ICCAD ’16 , pp. 1–8, 2016.[8] B. Su, J. Gu, L. Shen, W. Huang, J. L. Greathouse, andZ. Wang, “PPEP: Online Performance, Power, and EnergyPrediction Framework and DVFS Space Exploration,” , pp. 445–457, 2014.[9] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Oluko-tun, “STAMP: Stanford transactional applications for multi-processing,” in
Proc. 4th IEEE Int. Symposium on WorkloadCharacterization . IEEE, 2008, pp. 35–46.[10] N. Shavit and D. Touitou, “Software transactional memory,”in
Proc. 14th ACM Symposium on Principles of DistributedComputing . ACM, 1995, pp. 204–213.[11] “Intel 64 and ia-32 architectures software developers manual,volume 3c: System programming guide, part 3,” (Accessedon 06/26/2017).[12] C. Lefurgy, X. Wang, and M. Ware, “Power capping: Aprelude to power shifting,”