Dim Silicon and the Case for Improved DVFS Policies
TTechnical Report:Dim Silicon and the Case for Improved DVFS Policies
Mathias Gottschlag, Yussuf Khalil, Frank BellosaOperating Systems GroupKarlsruhe Institute of TechnologyE-mail: [email protected]
Abstract
Due to thermal and power supply limits, modern In-tel CPUs reduce their frequency when AVX2 and AVX-512 instructions are executed. As the CPUs wait for µ s before increasing the frequency again, the perfor-mance of some heterogeneous workloads is reduced. Inthis paper, we describe parallels between this situationand dynamic power management as well as between thepolicy implemented by these CPUs and fixed-timeoutdevice shutdown policies. We show that the policy im-plemented by Intel CPUs is not optimal and describepotential better policies. In particular, we present amechanism to classify applications based on their like-liness to cause frequency reduction. Our approach takeseither the resulting classification information or infor-mation provided by the application and generates hintsfor the DVFS policy. We show that faster frequencychanges based on these hints are able to improve per-formance for a web server using the OpenSSL library. In recent years, performance became increasinglylimited by power consumption as Dennard scaling hascome to an end [33]. The effect where the availablepower budget allows for different maximum frequen-cies depending on the number of cores is called dimsilicon [17]. The same effect also applies to differentinstruction mixes. As different operations cause differ-ent switching activity on the chip, they consume dif-ferent amounts of energy, so complex instructions haveto be executed at a lower frequency. Similarly, if un-used parts of the chip are power-gated because theyare not required by simpler operations, the resultingpower savings can be used to increase the frequency. The power budget is not only limited due to thermalconstraints but also due to power supply limitations ,where even short-term transgressions could cause in-stability due to voltage drops.As the large size of the SIMD registers used by re-cent SIMD instruction set extensions causes high powervariation, recent CPUs have started to vary their fre-quency based on the workload to maximize perfor-mance under power budget constraints. For exam-ple, Intel CPUs reduce their clock speed as soon ascode containing AVX2 and AVX-512 instructions is ex-ecuted [5]. However, every frequency change causessome overhead [26], because the system has to waitfor voltages to change and clock signals to stabilize.Therefore, even if no AVX2 and AVX-512 instructionsare executed anymore, these CPUs delay increasingthe clock speed [6]. This mechanism ensures that ifthe code continues executing these vectorized instruc-tions shortly after, no excessive numbers of frequencychanges are performed.For some workloads, the delay causes overhead,though, as parts of the software which could be exe-cuted at higher frequency are needlessly slowed down.For example, a simple benchmark using the nginx webserver is slowed down by 10% if the SSL library usedby the web server is compiled with support for AVX-512, as the CPU frequency is reduced during AVX-512-heavy encryption and decryption, but the frequencychange also affects the non-vectorized parts of the webserver [20].A policy similar to this constant-delay policy is em-ployed in the area of dynamic power management. Inthis area, a similar trade-off is found, as disabling In our tests, recent Intel CPUs have reported maximum cur-rent as the most common reason for frequency changes in AVX-512-heavy workloads. The frequency can only be increased when sufficient voltageis available, leading to frequency change delays and a resulting“underclocking loss” [26]. a r X i v : . [ c s . O S ] M a y evices saves energy but incurs overhead both dur-ing shutdown and reactivation. The widely-used fixedtimeout policy shuts down devices after a fixed de-lay [7], where the delay is usually equal to the break-even time in order to improve worst-case power con-sumption [18]. In the area of power management, re-search has brought up a plethora of other shutdownstrategies promising higher energy savings [7] and hasshown that input from the application can be used tofurther improve power efficiency [34]. It is likely thatsimilar approaches can be used to reduce DVFS over-head for partially power-intensive workloads. In thiswork, we show that, in particular, input from the ap-plication can be used to predict whether immediatereclocking makes sense. Our contributions are as fol-lows: • We describe the parallels between DVFS in dimsilicon scenarios and dynamic power management.The duality allows to apply research from the areaof dynamic power management to the former. • We determine the frequency change cost on a cur-rent server system and calculate the break-eventime for frequency changes. We use this resultto show how the delay specified by Intel does notprovide optimal worst-case behavior. • We show that application knowledge about exe-cution phases or the instruction types used by in-dividual processes can be used to improve perfor-mance by passing hints about future instructionset usage to the DVFS policy. We validate thisfinding through simulation of different DVFS poli-cies on a web server workload. • We describe a mechanism to determine at runtimewhether individual processes will trigger frequencyreductions due to their usage of power-intensiveinstructions. Unlike existing approaches, our de-sign can reliably distinguish between all three fre-quency levels provided by current Intel CPUs.This information can be used as input for an im-proved DVFS policy to trigger frequency changesduring context switches.
Starting with the Haswell microarchitecture whichintroduced the AVX2 instruction set, Intel introduced aseparate maximum frequency for AVX2-intensive codesegments [15]. The Skylake microarchitecture addedAVX-512 instructions and a third AVX-512 frequencylevel [29]. Table 1 shows the maximum turbo frequency for the Intel Xeon Gold 6130 server processor. Themaximum frequency depends both on the number of ac-tive cores – with larger numbers of active cores requir-ing larger frequency reduction – as well as on the typeof instructions executed. AVX-512 causes a particu-larly large frequency reduction due to the complexityof operations on 512-bit vectors. As described above,the reduced frequency is maintained longer than nec-essary to prevent excessive reclocking overhead.There are two situations where this delay can causethe frequency reduction to negatively affect unrelatednon-AVX code and cause a significant performance re-duction. First, on a system with simultaneous multi-threading (SMT), if one of the hardware threads causesthe frequency of the physical core to be reduced, theother hardware threads on the same core also executeat lower frequency even if their code is not as energy-intensive [22]. Second, in heterogeneous applicationsconsisting of power-intensive and less power-intensiveparts – or if the OS frequently switches between power-intensive and less power-intensive tasks – the delaybefore increasing the frequency causes reduced perfor-mance for the less power-intensive code [13].As an example for the latter, previous work de-scribes overhead caused by AVX-512 in a web serverworkload, where the nginx web server provides up to10% lower performance when the SSL library uses cryp-tography primitives implemented with AVX-512 in-structions, because unrelated web server code is sloweddown following calls into the SSL library [20]. We repli-cated this experiment, the result is shown in Figure 1alongside other experiments with workloads consistingof multiple different processes to show that the per-formance impact is also present in such scenarios. Forthese other experiments, we execute different non-AVXworkloads while concurrently executing the x265 videoencoder configured to use AVX, AVX2, or AVX-512 in-structions. The experiments are conducted on a systemwith an Intel Xeon Gold 6130 processor.Our first multi-process experiment determines theimpact on an interactive web server workload: We ex-ecuted the nginx web server alongside the x265 videoencoder and configured the wrk2 client to generate afixed number of requests to the web server. This setupimitates the scenario where a web server is not fullyutilized and the remaining CPU time is used for back-ground batch tasks. Figure 1 shows the normalizedCPU time required by the nginx web server to servea unencrypted static file (“nginx+x265”). The resultsshow a 6.6% performance impact when the backgroundprocess uses AVX2 instructions and a 21.8% perfor-mance impact for AVX-512. As the web server is notoperating at 100% utilization, the background process2ctive cores 1-2 cores 3-4 cores 5-8 cores 9-12 cores 13-16 coresNormal 3 . . . . . . . . . . . . . . . b l a c k s c h o l e s flu i d a n i m a t e n g i n x + x n g i n x + o p e n ss l a p a c h e bu il d - li nu x - k e r n e l g i t m y s q l s l a p r e d i s s q li t e s w a p t i o n s v i p s x . . C P U t i m e ( n o r m a li ze d ) AVX AVX2 AVX-512Figure 1: CPU time required to run various benchmarks under the influence of different instruction sets to measurethe impact of AVX frequency reduction.is often executed inbetween two consecutive requests oris executed in parallel on the other hardware thread ofthe same core, causing a particularly large performanceimpact.To show that the problem affects both interactiveand batch workloads, we also execute various bench-marks from the Parsec [8] benchmark suite and thePhoronix Test Suite (PTS) [1] benchmarks in paral-lel to the x265 video encoder. As shown in Figure 1,all these benchmarks are also affected by the frequencychanges caused by x265. The Parsec benchmarks expe-rience an average performance reduction by 10.0% forAVX-512. Similarly, the PTS benchmarks are sloweddown by 12.4%.As described above, one major mechanism for slow-down that is targeted by other approaches [22] is thatsoftware on one hardware thread slows down otherhardware threads of the same core. To show that someof the slowdown is also experienced on systems with-out hyperthreading, we repeat all the benchmarks ona system with hyperthreading disabled. The results ofthis experiment are shown in Figure 2 and show thatCPU-intensive non-interactive workloads are not sig-nificantly slowed down anymore once hyperthreadingis disabled as the system does not switch between theprocesses often enough for frequency change delays tohave a significant effect. For example, on a system with the default Linux CFS scheduler, we observe onlyone context switch every 10 to 20 ms for the blacksc-holes workload whereas frequency increases are onlydelayed by less than one millisecond. Although dis-abling hyperthreading reduces the performance of thesystem and is therefore not a viable technique againstthe overhead caused by AVX-heavy code in these sce-narios, other techniques such as core specialization [13]and core scheduling [22] can make sure that wheneverpossible either both hyperthreads are executing AVX-intensive code or none of them is.Overhead caused by hyperthreading is out of thescope of this paper, though. Instead, the goal of ourapproach is to reduce the overhead in applicationswhich periodically execute short sections of AVX-512or AVX2 code as well as in workloads which frequentlyswitch between AVX-512 or AVX2 and non-AVX ap-plications on a single core. From the benchmarksshown in Figure 2, an example for the former is thenginx/OpenSSL benchmark, which executes AVX-512instructions only when OpenSSL functions are called.The nginx/x265 benchmark as well as the Apache,MySQL and SQLite benchmarks from PTS, instead,trigger frequent context switches between the AVX-512-enabled background task and the benchmarked ap-plication and are therefore examples for the latter be-haviour. These types of benchmarks are the bench-3arsec Phoronix Test Suite b l a c k s c h o l e s flu i d a n i m a t e n g i n x + x n g i n x + o p e n ss l a p a c h e bu il d - li nu x - k e r n e l g i t m y s q l s l a p r e d i s s q li t e s w a p t i o n s v i p s x . . C P U t i m e ( n o r m a li ze d ) AVX AVX2 AVX-512Figure 2: CPU time required for the experiment shown in Figure 1 on a system with hyperthreading disabled. Notethat benchmarks with few context switches do not suffer from frequency changes anymore once hyperthreadingis disabled, whereas benchmarks with frequent switches between AVX and non-AVX code (i.e., heterogeneousprograms with short AVX-heavy phases as well as workloads consisting of an interactive service and an AVX-heavybackground task) still suffer from the frequency change delay. Note that such workloads are often latency-criticaland therefore particularly suffer from degraded performance.marks which show overhead even when hyperthreadingis disabled: For AVX-512, the nginx benchmarks areslowed down by 7.0% on average, whereas the threePTS benchmarks are slowed down by 12.4% on aver-age.Due to the frequent switches between AVX-512/AVX2 and non-AVX code during these workloads,the upclocking delay implemented by the CPU’s ex-isting hardware DVFS policy is the main source forthe overhead caused by AVX instructions. To isolatethis overhead source and to demonstrate that improvedDVFS policies are able to mitigate its effects, we con-duct all further experiments in this paper with hyper-threading disabled. The assumption of CPUs with-out hyperthreading significantly simplifies the design ofsome parts of our approach. This does not mean thatimproved DVFS policies are inherently ineffective onsystems with hyperthreading, although more researchhas to be conducted to identify appropriate heuristicsfor improved DVFS decisions.
As described above, the complex frequency behav-ior of modern CPUs stems from the fact that it is noteconomically viable to cool modern CPUs when theyare executing power-intensive code at their maximumfrequency [17]. Instead, available thermal headroomis used to temporarily use higher frequencies (a formof computational sprinting [27]). In this scenario, themore the energy consumption per instruction varies, the higher is the thermal headroom for code execut-ing simple instructions. Therefore, modern Intel CPUsuse different turbo frequencies for different types ofcode, with AVX2 and AVX-512 instructions triggeringa transition to significantly lower frequency levels [29].As shown by the registers provided by these CPUs todetermine the reason for frequency changes, not onlythermal headroom plays a factor for these frequencyreductions, though: The power dissipation of the chipcorrelates with the current required from the powersupply, and frequency changes are also required to pre-vent voltage drops due to increased current draw.The frequency changes required to use the availableheadroom come at a cost. For example, Mazouz et al.have measured the cost of a single frequency changeto be approximately 10 µ s on an Intel Ivy Bridge sys-tem [24] and our own experiments presented in Sec-tion 4.1 arrive at a similar cost (between 9 µ s and19 µ s) on more recent Skylake server CPUs. Therefore,increasing the frequency to use thermal headroom isonly viable if the higher frequency can be applied longenough that the performance improvement makes upfor the frequency change overhead. This trade-off issimilar to the problem of dynamic power management where devices are temporarily switched off or transi-tioned to a low-power state in order to save energy [7].Here, the energy cost for the state transition meansthat switching devices off for only short periods of timeis frequently unviable. As the operating system, how-ever, does not know how long a device is going to stayunused, it is in general not possible to determine in ad-vance whether shutting a device off is going to result4n a net improvement.In the area of dynamic power management, sig-nificant effort has gone into developing heuristic ap-proaches to guess when to shutdown devices [7]. Onemetric to measure the quality of heuristic approachesis their competitiveness in a worst-case scenario. Thecompetitiveness is the worst-case ratio between the en-ergy required by the approach compared to the energyrequired by an oracle policy that can determine in ad-vance whether shutting off a device is viable. Karlinet al. [18] showed at most 2-competitiveness (mean-ing that the approach uses at most twice as much en-ergy) is possible for deterministic algorithms. In dy-namic power management, 2-competitiveness can beachieved by switching a device off after a fixed time-out. When that timeout equals the break-even time (i.e., the time of inactivity during with the low-powerstate would have made up for the transition costs), thedevice uses at most twice as much energy if it wakesup directly after being sent to a low-power state. In-tel CPUs show a very similar behavior as they delayincreasing the frequency by a fixed timeout after theCPU has stopped executing any AVX instructions [29].However, the fixed delay is not optimal in terms of com-petitiveness because, as we show in Section 4, DVFShas wildly varying break-even times in different sce-narios. Neither is the DVFS policy implemented bycurrent Intel CPUs optimal for real-world workloadsas we show in Section 6.2.There are approaches that can, depending on thesituation, perform better than simple heuristic ap-proaches. For example, applications can give hintsabout expected future behavior to let the OS performbetter informed decisions [23] or the OS can use thedeadlines of I/O requests to change the device usagepattern to save more energy [35] Both these approachescan be applied to DVFS policies in dim silicon scenar-ios. In this paper, we show an example for the formerapproach. As software developers often know whetherthe application is going to execute no power-intensivecode – i.e., no AVX2 and AVX-512 – in the near future,that information can be used by the CPU to forego thefrequency change delay and immediately change fre-quencies for improved performance. According to the optimization manual, recent IntelCPUs implement a fixed-timeout policy where the CPUwaits approximately 2 ms after the last section of AVX-intensive code before increasing the frequency again [6,p. 2-13]. In addition, before lowering the frequency, thecore requests a power license from the package control unit (PCU) which takes up to 500 µ s before grantingthe license. However, as shown by Schne et al. [29], thebehavior of the hardware does not match the documen-tation. Instead, the processor waits for a significantlyshorter timeout (approx. 670 µ s as measured in our ex-periments) before upclocking. We were able to confirmthe observed behavior on a system with an Intel Corei9-7940X, where we measured the delay for frequencychanges when executing sections of code consisting ofscalar, AVX2, or AVX512 instructions. Note that fre-quency reduction is triggered almost immediately whenAVX2 or AVX-512 instructions are executed, as re-quired to prevent excessive power consumption.The upclocking delay is constant independent fromthe number of cores in use. As described in the last sec-tion, maximum competitiveness in worst-case scenar-ios is reached when the timeout equals the break-eventime, but the break-even time depends not only on thecost for the frequency transition but also on the perfor-mance advantage at a higher frequency. In this case,the frequency change is higher if more cores are ac-tive [5], so the performance overhead for downclockingis higher and the break-even time is shorter when morecores are active. Therefore, the policy implemented byIntel does not provide maximum competitiveness. Toshow the potential for improved timeout-based poli-cies, the following sections describe experiments to de-termine both frequency transition overhead as well asperformance impact for different situations to deter-mine the corresponding break-even times. One factor required to determine the break-eventime is the frequency change overhead: If the cost ofindividual frequency changes increases, more time be-tween consecutive changes is required in order to makeup for the overhead. For the Intel Ivy Bridge architec-ture, Mazouz et al. determined that a CPU is stoppedfor approximately 10 µ s during a frequency change [24].This pause is required to allow the new frequency tostabilize [26]. However, in particular in the case offrequency changes caused by AVX instructions, addi-tional factors increase the overall overhead. Therefore,and because our systems use a newer CPU architecturethan the one considered by Mazouz et al., we measurethe overhead of frequency changes on a system with anIntel Xeon Gold 6130 CPU.To measure the overhead due to frequency reductioncaused by AVX2 and AVX-512 instructions, we executethe same amount of such instructions twice, once whenthe system is already at the appropriate frequency, andonce when it executes at a higher frequency and the5 5 10 1502040 Active cores O v e r h e a d ( µ s ) Scalar → AVX2Scalar → AVX-512AVX2 → AVX-512Figure 3: Overhead when the frequency is reduced,measured as the mean of 1000 runs. The error barsindicate the standard deviation. The overhead seemsto vary slightly based on the number of active coresand on the resulting frequencies. Note that a transitionfrom scalar to AVX-512 frequencies incurs two separatefrequency transitions.0 5 10 1502040 Active cores O v e r h e a d ( µ s ) AVX2 → ScalarAVX-512 → ScalarAVX-512 → AVX-2Figure 4: Overhead when the frequency is increased.No variation based on the number of active cores canbe observed.code triggers a frequency change. The overhead of thefrequency change can be calculated as the difference ofthe two runtimes. The results of this experiment for allcombinations of scalar, AVX2, and AVX-512 instruc-tions are shown in Figure 3, which shows significantlyhigher overhead than measured by Mazouz et al. [24].For example, a transition from the maximum frequencyto the AVX2 frequency level takes 17 µ s on average,whereas a transition to the AVX-512 frequency leveltakes 24 µ s. The reason for this increased overhead islikely the reduced IPC due to additional throttling be-fore the frequency switch is complete [11]. As AVX2and AVX-512 instructions would draw excessive powerat the previous higher frequency, the system temporar-ily employs throttling to reduce power consumption [9].Note that the overhead appears to vary slightly for the different frequencies and frequency differences causedby different numbers of active cores.Measuring the overhead of frequency increases isslightly more complex due to the large – and, in ourexperiment, somewhat variable – delay before the sys-tem restores the non-AVX frequency level. In this case,we employ the technique employed by Mazouz et al. todetermine frequency change costs [24] as we start at asystem running at either AVX2 or AVX-512 frequen-cies and repeatedly execute a short code section whichconsists of instructions allowing a higher frequency. Wemeasure the runtime of the code section each time, sothat frequency changes are shown as spikes in the mea-sured runtime. As other sources such as the activationof additional cores can trigger additional reduction ofthe maximum frequency, we simply assume that thefirst frequency change is the one triggered by the lackof AVX2 and AVX-512 instructions and discard anyfurther runtime spikes. The size of the spike is assumedto be the overhead of the frequency change, which isplotted in Figure 4. The results closely match those ofMazouz et al. [24] and show no variation based on theabsolute frequency of the core or the magnitude of thefrequency change, both of which vary with the numberof active cores. Note, however, that this experimentdoes not consider the performance loss due to the sys-tem temporarily executing at a lower frequency whilethe voltage is ramped up to the level required for thefrequency change [26]. For many dynamic power man-agement approaches, state changes can be predictedin advance, so voltage changes can likely be conductedspeculatively, removing the need for such additional de-lays. For example, for fixed-timeout policies, the time-out can be slightly reduced accordingly. The break-even time for frequency changes dependsnot only on the overhead for frequency transitions butalso on the relative performance advantage due to thehigher frequency. Whereas the performance of CPU-bound tasks is nearly proportional to the CPU fre-quency, the same is not true for memory-heavy work-loads as the memory latency is independent from theCPU frequency. In this work, to simplify the proto-type, we assume the former.The result of this simplification is that the break-even time is underestimated for memory-heavy appli-cations. To quantify this error for the workloads usedin this paper, we executed most of the individual ap-plications described in Section 2 – nginx, x265, theParsec benchmarks and the PTS benchmarks with theexception of mysql due to the long execution time of6 . . · .
52 frequency (GHz) i n s t r u c t i o n s p e r c y c l e nginx+openssl blackscholesfluidanimate swaptionsvips x264x265 pts-apachepts-build-linux-kernel pts-gitpts-nginx pts-redisFigure 5: IPC of various parsec and PTS benchmarksas well as the nginx/OpenSSL workload described inSection 2 when executed at different frequencies. Inthe monotome region between 2 . . . . . . . Maximizing the num-ber of active cores should maximize the working setof the application and should therefore maximize theimpact of memory accesses on performance.Figure 5 shows the results of this experiment. Coun-terintuitively, IPC consistently improves when the fre-quency is increased from 2 . . . x265 failed to fully saturate all cores due to inter-threaddependencies. and 2 . . . . The break-even time t BE – i.e., the time after whichthe performance increase due to increased frequenciesoffsets the cost to increase and decrease the frequency– can be calculated according to the following formula: p low t BE = p high ( t BE − t o )In this formula, p low and p high are the performanceat the lower and higher frequency, respectively, and t o = t o,d + t o,u is the total overhead for reducing ( t o,d )and increasing ( t o,u ) the frequency, measured as theequivalent CPU time as in Section 4.1.If we insert the results from the last sections and cal-culate t BE , we arrive at the times shown in Figure 6.7 5 10 155001 ,
000 Active cores B r e a k - E v e n T i m e ( µ s ) AVX2 → ScalarAVX-512 → ScalarAVX-512 → AVX-2Figure 6: Break-even time for frequency changes cal-culated from Figure 3 and 4, assuming performance tobe proportional to frequency. The break-even timesvary with the number of active cores due to the dif-ferent magnitude of the frequency change. The resultsshow that a single fixed timeout as implemented byIntel CPUs can not be optimal in terms of worst-casecompetitiveness.As the performance is dominated by the frequencywhereas the overhead is fairly constant, the break-eventime is significantly affected by the number of activecores. For example, for a transition between AVX2and non-AVX frequencies, the break-even time in situ-ations with less than four active cores is approximately1000 µ s due to the low frequency swing of only 100 MHz(see Table 1), whereas for more than eight cores fre-quency changes between 400 and 500 MHz cause break-even times between 150 and 190 µ s.As Karlin et al. [18] show, a fixed-timeout policyachieves optimal competitiveness – in our case, min-imal overhead when the system has to switch backto a lower frequency at the least opportunistic time– when the timeout equals the break-even time. Inthis case, the timeout before the CPU increases its fre-quency should therefore be based on the frequency dif-ference to achieve good competitiveness in all cases.Intel CPUs, however, only implement one fixed time-out for all core counts and instruction sets. As shown inSection 2, some applications are negatively affected bythe overhead of frequency changes, which shows thatan improved DVFS policy with variable timeout basedon the frequency difference can likely have positive im-pact on these applications. While the 2-competitive fixed-timeout policy is op-timal in the worst case for unpredictable workloads, itis not when the behavior of the workload is predictable, in which case earlier decisions to increase the CPU fre-quency can result in higher performance. In this work,we focus on two types of predictions about whetherthe system is going to use AVX-512 in the near future.First, the application developer has knowledge aboutthe structure of the application and can tell the operat-ing system when AVX-intensive parts begin and end,which can aid workloads where one process switchesbetween AVX-intensive code and code without power-intensive instructions. Second, the operating systemcan statistically determine whether a process is likelyto require a reduced frequency and can change the CPUfrequency during context switches in order to imme-diately let non-power-intensive processes profit fromhigher frequencies.
If an application consists of vectorized and non-vectorized parts and those are executed alternately –such as the web server example in Section 1 – the non-vectorized part is slowed down due to the frequencychange caused by the vectorized part. Often, softwaredevelopers know which part of the application is vec-torized and how long execution of each part takes. Inthat case, assuming that a suitable hardware-softwareinterface exists, they can notify the CPU after each vec-torized code portion if the next scalar portion is likely long enough to warrant for an early frequency increase.The CPU could use that hint to immediately switch toa higher frequency. Such a hint could therefore improveperformance, as the existing DVFS policy of the CPUwould instead needlessly keep the frequency reducedfor some time.
Even if each individual application is sufficiently uni-form, it is still possible that context switches betweendifferent applications cause overhead as an applicationis slowed down by the preceding AVX-enabled applica-tion as described in Section 2. For most workloads, thisoverhead is avoidable, as scheduler time slices are usu-ally longer than the break-even time. During a switchfrom an AVX-enabled application to a non-AVX appli-cation, the scheduler should usually immediately selecta higher frequency.To trigger such frequency changes, the schedulerneeds a categorization of the individual processes basedon their instruction set usage and their expected fre-quency reduction. To this end, we introduce the notionof a power score which serves as a measure of the ex-pected power consumption of the instruction mix exe-8uted by a process. A high power score signals that theprocess will likely trigger significant frequency reduc-tions. More specifically, a power score of 1 means thatthe process is assumed to execute at AVX2 frequencies,whereas a power score of 2 means that the process likelycauses a reduction down to AVX-512 frequency levels.This power score could potentially be determinedeither via a static analysis of the application binaryor via a dynamic analysis of the frequency changes atruntime. A static analysis can detect whether an exe-cutable contains any AVX2 or AVX-512 instructionsthat could trigger a frequency reduction. However,applications might contain such instructions even ifthey do not execute them frequently enough to signif-icantly reduce the average frequency. Also, functionslike memset make use of AVX-512 instructions, but onlyfor inputs of certain size which is hard to detect viastatic analysis. Overall, a static analysis is thereforebound to be unreliable.We expect dynamic analyses to yield a better es-timate of the instruction set usage of individual pro-cesses as they are able to observe the effects of theactual execution patterns within the process. Simplymapping the frequency level to the active process is,however, not accurate in situations with frequent con-text switches, because the delays mean that some ofthe time spent at lower frequencies is attributed to thewrong processes. Counting the AVX2 and AVX-512 in-structions executed by the active process might be suffi-cient to draw conclusions about the resulting frequencyrequirements in most cases, but recent Intel CPUs onlyprovide performance counters for specific types of suchinstructions [4, p. 19-20f]. In any case, though, moreaccurate statistics would be possible if the processorprovided the operating system with information aboutwhether the conditions for each frequency level werefulfilled at each point in time, for example via appro-priate performance counters. Current hardware doesnot provide such performance counters, either.As a method to collect reliable information aboutfrequency requirements and to determine the processesresponsible for frequency reductions, we therefore sug-gest distinguishing between two cases based on the timebetween subsequent scheduler invocations.If the time between subsequent scheduler invoca-tions is significantly longer than the frequency increasedelay of 670 µ s, the scheduler can sample the CPU fre-quency level and can directly attribute the frequency tothe last process, as any influence of its predecessor onthe CPU frequency has ended. To determine the CPUfrequency level, we configure the performance countersto track the cycles spent at power license levels
0, 1,and 2 which correspond to the frequency levels for non- AVX, AVX2, and AVX-512 code, respectively [6].If the time between subsequent scheduler invoca-tions is shorter than the frequency increase delay,such an approach would risk misattributing frequencychanges. In this case, our main observation is that ifthe frequency is reduced during the execution of a pro-cess, then that process is most likely responsible for thechange. For short periods of execution of a process, wetherefore only attribute the resulting frequency to theprocess in case of a frequency change during the pe-riod. In some rare cases, however, frequency changescan occur during the execution of a process that did nottrigger the change – most likely due to delays duringfrequency selection as documented by Intel [6]. There-fore, the power score is calculated as the moving av-erage over all CPU frequency samples attributed to aprocess to reduce the impact of occasional misattribu-tion. The following steps are conducted to calculatethe power score of the processes:1. Initially, the power score of new processes is set to0, i.e., the system assumes that new processes willnot use AVX-512 or AVX2.2. At each scheduler invocation, we detect the cur-rent power license level by sampling all power li-cense level performance counters twice in a row.The counter that is incremented during the shorttime inbetween indicates the current frequencylevel.3. We compare the level during two consecutive con-text switches. If the levels match, the power li-cense did not change. In this case, for short CPUbursts, the current process might not have hadenough time to have an impact on the power li-cense, so the power score is not updated.4. If context switches are more than 1 ms apart –longer than the frequency delay as reasoned above– or if the power license decreases below or in-creases above the current power score, however,the power score of the process is updated as theexponential moving average of such power licensechanges. Assuming S t − is the old power scoreand L t is the new power license, the new powerscore is S t = 0 . L t + 0 . S t − .The resulting power score indicates the potential fre-quency reduction caused by the process. The dynamicanalysis of frequency changes can be combined with theresults of a static analysis of the executable – e.g., byoverriding the score to be 0 if the executable does notcontain AVX2 nor AVX-512 instructions – and withmanual instrumentation as described in Section 5.1, in9hich case hints from the developer override the auto-matically determined power score.Note that with hyperthreading the frequency is de-termined by two programs. Thus, this technique onlyworks on systems with deactivated hyperthreading andon systems which always schedule the same programon both cores, as recently suggested for the Linux ker-nel [10]. On other systems, the hardware has to bemodified to provide a more reliable source of informa-tion about the energy consumption of the instructionsexecuted by individual processes. Once predictions about the instruction set use areavailable, the system can use this information to im-prove performance. When the code running on a core– i.e., all hardware threads in the case of a system withhardware multithreading – indicates that no power-intensive instructions are going to be executed in thenear future, for example, via the mechanisms presentedin Sections 5.1 or 5.2, the system can eagerly increasethe frequency when it is not already at the highest levelpossible for the expected instructions.Ideally, the DVFS policy should be implemented inthe CPU to be able to provide quick reactions to chang-ing instruction usage and to prevent power budget vi-olations, so any hint about future instruction usageneeds to communicated to the CPU using an appropri-ate software-hardware interface. For example, the op-erating system or the application software could tem-porarily configure a different frequency change timeoutdepending on the type of executed code, to force earlierfrequency changes or to prevent any changes.
Current hardware does not provide any such interface.It does, however, provide a mechanism to manually setthe CPU frequency, which can be used to implementa wide range of DVFS policies in software. For thedim silicon scenario described in this paper, the limita-tions of the hardware prevent both practical software-based implementations of DVFS policies as well as lim-ited implementations to estimate the performance ofhardware-based implementations.Any practical implementation of a DVFS solutionfor AVX-512 or AVX2 code is prevented both by theinability to detect problematic AVX-512 or AVX2 codeas well as by the delay of manual frequency changes.First, conservative detection of problematic code is nec-essary so that the OS knows when frequency reductionsare required. Our approach in Section 5.2 is not usable as it only results in approximate long-term classifica-tion of applications. In contrast, conservative short-term estimation based on register set usage can detectany access to 512-bit and 256-bit vector registers butwill often select lower frequencies than necessary as weshow in our evaluation in Section 6.1, leading to re-duced performance. Second, software-based DVFS pol-icy implementations require the ability to change thefrequency at a precise point in time, yet current CPUsdelay frequency changes significantly. As described byHackenberg et al. [15], the frequency selection logic ofIntel CPUs starting with the Haswell microarchitectureonly allows frequency changes once every 500 µ s, so anyfrequency change request is delayed until the end of thenext such 500 µ s window. The immediate throttling ofAVX-512 instructions [11], however, shows that imme-diate power reduction is necessary for stability, so suchdelays are inacceptable.These limitations not only prevent practical soft-ware solutions but unfortunately also prevent the con-struction of a prototype based on existing hardwareto evaluate the performance of hardware implementa-tions. Such a prototype would not necessarily have tobe able to ensure system stability, but would have totrigger frequency changes in a way that results in equalperformance compared to a complete implementation.As from the point of view of the OS the frequencychange delay often appears to be random with an evendistribution, a na¨ıve approach might assume that theaverage delay of frequency increases cancels out the av-erage delay of frequency reductions. However, for shortsections of AVX- or non-AVX code, both frequency in-crease and decrease might occur within the same 500 µ swindow, in which our experiments showed that no fre-quency change occurs.In this paper, we suggest improved DVFS policiesas a method to reduce the overhead caused by AVX2and AVX-512. As we cannot use existing hardware toconduct a performance evaluation, we are limited todemonstrating the performance impact through simu-lations and microbenchmarks as shown in Section 6.2. As described in the last section, this paper pro-poses using hints from the application or the oper-ating system to provide improved frequency scaling.Our approach consists of two main pieces, namelythe classification of the processes – or, alternatively,hints from the application developer – and a modi-fied DVFS algorithm that takes those hints into ac-count. For existing processors, it is impossible to builda complete implementation of this design, as the exist-10ng DVFS policy implemented by the CPU cannot beextended as required. Deactivating all AVX-inducedfrequency changes and completely reimplementing thepolicy in software is impossible due to the latency ofsoftware-triggered frequency changes which can be aslong as 500 µ s. Our evaluation is therefore limited toqualitatively showing that the individual componentsare functional and that application-directed DVFS canhave an advantage over the existing policy. The main goal of the process classification mecha-nism described in Section 5.2 is to be able to detectthe required power license of individual processes evenif they are running in a heterogeneous multi-processworkload where the effects of one process on the CPUfrequency might shadow the effects of another process.To show that the mechanism fulfills this goal, we con-structed a prototype based on Linux 5.2. We modi-fied the kernel’s completely fair scheduler (CFS) andinserted the power license detection code in the mainscheduler function schedule() . Our implementationuses the Linux perf framework to read the power licenseperformance counters.We let our prototype estimate the power score of thex265 video encoder using different instruction sets run-ning in isolation. To show that our prototype is ableto correctly distinguish between different processes ex-ecuting on the same system and is able to attributefrequency changes to the correct process, we also exe-cuted the Apache benchmark from the Phoronix TestSuite as well as the swaptions benchmark from Par-sec in parallel with x265. The two applications wereconfigured to share the same set of cores without anyrestrictions to scheduling. Note that we specificallyselected an interactive benchmark as well as a batchworkload, to show that the classification works withboth. Table 2 shows both the expected power score forthe applications – we expected our prototype to classifyx265 according to the instruction set used, and neitherof the other two benchmarks used significant amountsof vector instructions – as well as the estimated powerscore from our prototype, averaged over the runtime ofthe application.The first three rows show that x265 was correctlyclassified in all cases, except for some uncertainty ifneither AVX2 nor AVX-512 instructions were used.The next two table rows then show the results forthe mixed scenarios. In both cases, our prototype cancorrectly identify x265 as the process responsible forthe frequency reduction. For x265 executed alone, wecompared the performance of our prototype to a stock Linux kernel and were not able to measure any statis-tically significant performance overhead.We compare our approach to the state-of-the-arttechnique available in the Linux kernel. Linux pro-vides the time elapsed since the last use of AVX-512 aspart of the arch status file in the proc file system [2].The time since the last use of AVX-512 is calculatedby checking the state of the FPU registers at each con-text switch. Like our approach, this mechanism is ableto detect AVX-512 usage in the benchmarks describedabove as shown in the upper half of Table 2.The approach found in the Linux kernel has a sig-nificant drawback, though, as the use of specific FPUregisters is only loosely connected to the resulting fre-quency change. For example, a dense sequence ofmultiplication instructions on 512-bit vector registerscauses the CPU to transition to the lowest frequency,whereas other instructions only trigger the intermedi-ate “AVX2” frequency. Therefore, in a workload con-sisting of processes showing the former behavior as wellas processes of the latter type, the time since the last512-bit register usage cannot be used to identify theprocesses responsible for a frequency reduction. Wedemonstrate this effect by executing a sequence of 512-bit and 256-bit multiplications and additions both withour approach and on an unmodified Linux 5.5 kernel.The results shown in the lower half of Table 2 show thatour prototype is correctly able to detect the three dif-ferent frequency levels caused by different types of in-structions, whereas the stock Linux kernel is only ableto detect whether 512-bit registers are used.To show that the problem also affects real-worldworkloads, we execute an web server benchmark us-ing nginx and OpenSSL similar to the one describedin Section 2 and measure the average time since thelast AVX-512 usage as determined by the Linux 5.5kernel on a system running Fedora 30. We let thenginx web server serve a static file with compressionat runtime and use OpenSSL compiled with eitherAVX2 and AVX-512 instruction support for TLS en-cryption. As shown above, the web server providessignificantly higher performance when using AVX2 in-structions due to the resulting higher frequencies. Evenin this case the system uses 512-bit registers, though, asthe C library provides AVX-512 variants of memset() , memmove() , and memcpy() . Therefore, the stock Linuxkernel detects AVX-512 usage in both cases, with simi-lar reported average time since the last usage of 512-bitregisters. Note that the implementation tests whetherregisters are in use only during context switches. Dif-ferent scheduling causes large variation in the resultingvalues, making a quantitative comparison for such ex-periments difficult.11redominant AverageScenario Freq. Level AVX Score AVX512 elapsed ms x265 (AVX) 0 0.356 N/Ax265 (AVX2) 1 1.005 N/Ax265 (AVX-512) 2 1.899 103.4 msx265 (AVX-512) 2 1.827 181.3 ms+ pts-apache 0 0.428 N/Ax265 (AVX-512) 2 1.679 86.0 ms+ parsec-swaptions 0 0.099 N/A512-bit FMA 2 1.821 0.10 ms512-bit add 1 0.917 0.10 ms256-bit FMA 1 0.934 N/A256-bit add 0 0 N/ATable 2: Estimated AVX scores for different scenarios and a comparison to the mechanism found in the Linuxkernel to track AVX-512 usage. The first three rows show the score for an isolated instance of x265 using differentinstruction sets. The next two rows show the scores in scenarios with two different applications running concurrentlyon the same set of cores, to show that the score is estimated correctly on a per-process basis even if one processaffects the frequency of another. The remaining rows show how our approach is able to distinguish between thethree frequency levels of the CPU, whereas the stock Linux kernel is only able to track AVX-512 register usage.
Once it is known which parts of the system usepower-intensive instructions – either using manual an-notation as described in Section 5.1 or via automaticdetection as described in the previous section – thisinformation can be used to optimize performance.Whereas other approaches perform core specializationto separate AVX-512 code from non-AVX code [13,22],we, as described in Section 5, suggest that improvedDVFS policies can also significantly reduce the over-head caused by AVX-512 instructions and similarlypower-intensive instruction sets. In particular, we sug-gest that the delay for frequency increases as imple-mented by recent Intel CPUs is unnecessary if the sys-tem can predict that the software executed in the nearfuture does not require power-intensive instructions.
The most direct method to show the potential of im-proved DVFS policies would be to compare the per-formance of a benchmarked application when using afixed-timeout policy such as the one implemented bythe processor to the same benchmark instrumented tochange the processor frequency at points in the pro-gram selected by the developer. However, recent In-tel CPUs delay frequency change requests by up to500 µ s [15], making it impossible to precisely specifythe points in the program at which frequency changes occur. Therefore, our evaluation relies on simulationof different DVFS policies based on a trace generatedwhile running a web server benchmark (Section 6.2.2)and uses a microbenchmark to demonstrate the po-tential performance impact of a single eager frequencychange (Section 6.3) and to check the accuracy of thesimulation.The following experiments were conducted on a sys-tem with an Intel Core i9-7940X processor, with thesimulation configured to match this system. This pro-cessor was selected because, as it is designed for over-clocking, it allows configuration of the AVX offsets which specify the frequency reduction caused by AVX2and AVX-512 instructions. For tests to determine thebaseline performance of the system, we configured theoffsets to match the frequencies reported in news arti-cles [32], where the base frequency of the processor isreported to be 3 . . . . For our simulation experiments, the workload used isthe nginx web server example from Section 2. We con-figure the web server to serve a single static file usinggzip compression and we encrypt HTTP requests andreplies using the OpenSSL library. The library is con-figured to vectorize encryption and decryption usingAVX-512 instructions, which in other experiments hasresulted in a 10% slowdown. We instrument the webserver to record the times when the OpenSSL func-tions for encryption and decryption are called and whenthey return. When generating the log of the OpenSSLfunction calls, we execute the benchmark with min-imal AVX offsets. Although the resulting frequencieswould not be stable and would result in frequent systemcrashes with all cores utilized, this setup yields morerepresentative timing input for the simulator, as thesimulator itself is supposed to slow down the AVX-512portions of the simulated workload. To ensure systemstability and to simplify simulation, the web server isonly executed on a single core. We do not expect in-dividual web server threads to behave significantly dif-ferent when additional web server threads are placedon the other cores of the system. The resulting application trace contains a list of pe-riods where the system is assumed to execute onlyAVX-512 code (the function calls into OpenSSL) alter-nating with periods where the system is assumed notto execute any AVX-512 or AVX2 instructions. Wefeed this trace into a simple model-based simulator asshown in Figure 7 to estimate the application runtimeresulting from different DVFS policies. The simulatorapplies a DVFS policy to the trace and dilates the timeduring periods where the CPU would be executing at alower frequency. During the simulation, to get resultsmore representative for a server scenario, we assumethat most of the cores are active and assume a corre-sponding large frequency reduction whenever AVX-512code is executed.We implement fixed-timeout policies with the time-out used by Intel processors as well as with a timeout of180 µ s which was shown to be more competitive in Sec-tion 4.3. As an example for a policy based on developerinput, we also implement a policy which only increasesthe frequency when the last packet of an HTTP requestwas received and decrypted, which we identify by thereturn value of the corresponding OpenSSL functioncall . After this call, the web server processes the re-quest and takes a significant amount of time before anyfurther AVX-512 code is executed when the HTTP re-ply is sent, so at this point eager frequency changes aremost likely to be beneficial for application performance.For all the policies, the simulator assumes a perfor-mance impact of 16 µ s per frequency change, similar tothe values determined experimentally in Section 4.1.The simulation result shows that a lower timeoutthan what is used by Intel CPUs results in a 2.9%higher performance in the simulated scenario. Witha lower timeout, the policy can exploit shorter non-AVX program phases and wastes less time at lower fre-quencies throughout the program. The resulting per-formance improvement outweighs the (simulated) over-head of the larger number of frequency changes. Eventhough the difference is small, the result shows that thetimeout does have a measurable impact on applicationperformance.The developer-directed DVFS policy performed evenbetter, with a 3.9% performance improvement com-pared to the policy implemented by Intel CPUs, as thepolicy was able to completely mitigate overhead dueto low CPU frequencies during the longest non-AVXphases of the program. While this improvement mightseem minor, it covers most of the 5.7% overhead caused A more generic and robust implementation would be to in-strument the HTTP request parsing logic to increase the fre-quency whenever the end of a HTTP request is detected. Ourimplementation suffices to show that the approach is generallypossible. a) Normal AVX offsets Non-AVX code − . . . . . . . . µ s) F r e q . ( G H z ) (b) Minimal AVX offsets Non-AVX code 130 µ sspeedup − . . . . . . . . µ s) F r e q . ( G H z ) (c) Manual eager reclocking Non-AVX code 98 µ sspeedup − . . . . . . . . µ s) F r e q . ( G H z ) Figure 8: To determine the potential performance im-provement for a single frequency change, we execute afixed amount of non-AVX code directly following someAVX-512 code. We compare the time required at de-fault AVX offsets (a) to the time required at minimalAVX offsets (b) as well as with eager frequency changessimulated by a manually inserted frequency change atthe beginning of the non-AVX code (c).by AVX-512 for this workload as shown in Figure 2.Workloads with more frequent AVX-512 phases mightbenefit more from improved policies. In addition, thepolicy did not increase the frequency during some othernon-AVX phases where a frequency change would havebeen beneficial, showing that a carefully optimized pro-totype might achieve higher performance.
When looking at a single developer-directed eagerfrequency change, the simulation resulted in a CPUtime saving of 195 µ s for sufficiently long stretches ofnon-AVX code compared to a fixed timeout of 670 µ s asimplemented by current Intel CPUs, as the CPU wasoperating 30% faster during this time. To show that the assumptions made in our simulator yield realisticresults, we validate this value against measurementsbased on a simple microbenchmark. The microbench-mark first executes a series of AVX-512 instructionsand then executes a fixed amount of non-AVX instruc-tions. The number of instructions is chosen so thatthey take longer than the frequency change timeoutimplemented by the CPU. We measure the time re-quired for the code section to determine the impact ofthe frequency change caused by the preceding AVX-512 code in different configurations. All experimentsare repeated 1000 times.First, to measure the overall impact of such fre-quency changes on the CPU, we compare the averagetime at default frequencies (Figure 8a) with the aver-age time with minimal AVX offsets (Figure 8b). Ourexperiment shows that with minimal frequency changesthe code executes 130 µ s faster. In this configurationthe AVX-512 code still reduces the CPU frequency by100 MHz as described in Section 6.2.1 and the mea-sured runtime still includes the overhead of the cor-responding frequency change which needs to be takeninto account when comparing the values with the modelused for our simulation.Second, we manually insert frequency changes intoour prototype so that the frequency is reduced whenthe AVX-512 code starts and is immediately increasedwhen the non-AVX code starts. Note that, as describedin Section 6.2.1, frequency changes are applied with arandom delay of up to 500 µ s. Therefore, for this ex-periment, we do not take the average time but insteadtake the 5th percentile as this value represents the situ-ation when an optimized DVFS policy implementationalmost immediately triggers frequency changes. In thisexperiment, we measure a runtime for the non-AVXcode which is 32 µ s slower than the result with min-imum frequency changes, but 98 µ s faster than regu-lar frequency changes (Figure 8c). The performanceis slightly lower than in the experiment with mini-mal AVX offsets because the benchmark triggers notone but two frequency changes – one by the hardwaredue to the 100 MHz reduction described above, andone manual frequency change to simulate the DVFSpolicy. Apart from this overhead and minor overheaddue to the additional system calls, the runtime mostlymatches the optimal case, which supports our modelthat eager frequency changes can mitigate most of theoverhead caused by AVX instructions.However, the absolute runtime differences are lowerthan determined by the simulation. As describedabove, two potential reasons for the deviation are thelarger number of frequency changes as well as someremaining frequency reduction. As shown in Fig-14re 4, the additional frequency change costs approx-imately 10 µ s, and an expected 3% performance over-head due to the 100 MHz frequency difference costs an-other 20 µ s. While the measured results mostly matchour model when taking these effects into account, fur-ther analysis of the CPU behavior has to be conductedto provide a better quantitative model of the perfor-mance in similar situations. In this paper, we showed that the fixed timeout pol-icy implemented by recent Intel CPUs for AVX fre-quencies yields less-than-optimal average processor fre-quencies for heterogeneous workloads. We also arguethat better timeouts and developer-directed frequencychanges can improve performance. Even though ourevaluation lacks experiments to directly demonstratethe effects on real-world workloads, the estimate gener-ated by our simulation shows that it is highly likely thatsuch a performance improvement is to be expected.This basic result opens up a number of further researchquestions which we will discuss in the following sec-tions.
In Section 6.2.1, we show why the frequency changedelays on current Intel CPUs prevent constructing afull prototype demonstrating our approach. Even iffrequency changes were triggered instantly, though, asoftware-only DVFS policy implementation would notbe viable for two reasons: First, the CPU would stillneed to be able to autonomously reduce power con-sumption when executing AVX-512 instructions to en-sure system stability, for example, by reducing the fre-quency or applying other forms of throttling. Second,not all applications in the system would be modified tomake use of developer-directed frequency scaling, mak-ing a hardware fallback necessary.If the DVFS policy is implemented in hardware, asoftware-hardware interface is required to influence pol-icy decisions. We propose the combination of two suchinterfaces:1.
Configurable frequency change delay:
Aswe show, the problem of AVX-induced frequencychanges is similar to the dynamic power manage-ment problem, and the main decision is whetherto immediately increase the frequency when pos-sible or whether to wait or not increase the fre-quency at all. While it would be possible to tell the CPU to immediately increase the frequency af-ter the next section of AVX code, we expect suchan interface not to be viable in many situations,because the boundaries of AVX-intensive programexecution phases are not well defined and varia-tions in the program’s control flow might causeunnecessary frequency changes. Instead, we sug-gest an interface to manually set a different fre-quency change timeout for individual parts of theprogram – i.e., until the application manually re-verts the change or sets a different timeout – to al-low applications to enable eager frequency changesin certain situations.2.
Forced immediate frequency change:
In ad-dition, the CPU should provide an interface to im-mediately increase the frequency to the maximumfrequency for use by the operating system to in-crease the frequency during context switches whenit is known that the next task is unlikely to useAVX-512 or AVX2.Further work has to be conducted to test whetherthese interfaces are sufficiently flexible to implement awide range of DVFS policies in software.
One significant limitation of our work is that all ourexperiments were conducted with a system in mindthat does not use hardware multithreading. On a sys-tem with hardware multithreading, the CPU frequencyhas to be reduced when either of the threads executesAVX instructions, thereby limiting the potential per-formance advantage of developer-directed approachesas it is hard to predict when another completely un-related hyperthread will affect the frequency. Also, asshown in Section 2 on systems with hyperthreadingmany additional types of workloads experience slow-down due to frequency reductions. Despite the differ-ences, improved DVFS policies might be viable andtheir effectiveness might even be amplified as morecode is affected by frequency reductions. More researchshould be conducted to create a statistical model ofthe CPU frequency selection in systems with hardwaremultithreading and to develop suitable DVFS policies.
In our controlled experiment, we used a benchmarkthat had a clearly defined performance metric. In gen-eral it is, however, not always clear whether the over-head caused by AVX-512 is large enough to warrant15he usage of techniques to reduce it and it is not al-ways clear whether these techniques are successful. Inparticular when techniques have the potential to causeadditional overhead – for example, due to increasednumbers of frequency changes – it would be beneficialto be able to profile a system to estimate the impact ofAVX-512 on performance. The result of such a profilercould also be used to implement close-loop policies. Forexample, the system could repeatedly try out differentDVFS policies depending on the resulting performancechange.The performance counters on current CPUs, how-ever, cannot be used to construct such a profiler, asthey can only be used to count cycles spent at reducedfrequencies but do not provide sufficient informationabout how long the reduced frequencies are actuallyrequired. In particular, the performance monitoringunits of these CPUs can not be used to detect any exe-cuted AVX2 and AVX-512 instructions as they can onlycount floating point instructions. Instead, we envisionan approach which periodically samples the frequencyof the system, pauses the system to let the CPU switchback to the highest possible frequency, and then checkswhether the system will immediately switch back to alower frequency when the workload is continued. Thelatter check determines whether a frequency reductionis required due to ongoing AVX code or whether thereduction represents avoidable overhead. Further ex-periments have to determine the accuracy of such anapproach, and further work has to be conducted toshow whether modified hardware-software interfacescan provide a more accurate profiling mechanism withlower CPU time overhead.
This paper presents improved DVFS policies as amethod to reduce the overhead of the frequency reduc-tion caused by AVX and AVX-512 instructions on re-cent Intel CPUs. Other approaches to this and similarproblems have used core specialization or have mod-ified the application to reduce the impact of varyingpower consumption and of frequent frequency changes.
Another method to limit the performance impact ofAVX and AVX-512 code on unrelated non-AVX codeis to place AVX and non-AVX parts of the workloadon separate sets of cores. As performance problemsoccur when non-AVX code is executed on the samecore following AVX code which reduced the frequency, specialization of cores can prevent such overhead. Ap-proaches for core specialization either targeted het-erogeneous programs consisting of AVX and non-AVXcode within one process [13] or targeted workloads con-sisting of AVX and non-AVX processes [22]. The for-mer detects the usage of AVX instructions either byinstrumentation inserted by the developer or by recon-figuring the CPU to trigger exceptions when executingAVX instructions [13, 14]. Based on this information,individual threads are migrated between cores to con-centrate the AVX part of the program on as few coresas possible. The latter technique which is targetedat multi-process workloads instead relies on heuristicsto identify processes using AVX-512 instructions andmodifies the scheduler to prevent scheduling an AVX-512 and a non-AVX task on hardware threads of thesame core at the same time [22]. This approach cur-rently uses the Linux arch status interface which onlygives a rough estimate of AVX-512 usage. In this pa-per, we present a method to identify applications whichcause frequency reductions with higher accuracy.Note that all these approaches can cause significantperformance overhead themselves. Task migrations canincrease cache miss rates, and restricting schedulingof different processes on the same core at the sametime can cause significant overhead with some work-loads [10]. We present a technique which might provideadvantages in situations where other approaches causetoo much overhead.The fact that co-scheduling applications on thehardware threads of a single core can cause varyingoverhead depending on the type of the applications hasbeen observed by other works before and many schedul-ing techniques have been developed to improve theperformance of SMT systems. For example, existingapproaches use sampling-based techniques [31], cacheconflict detection [30], or performance counters [12, 25]to determine whether two tasks are suited for paral-lel scheduling on the same physical core. We describea similar approach which uses performance countersto identify tasks requiring execution at reduced fre-quency and which can likely be used for improved co-scheduling of AVX-512 applications as described above.
The approach in this paper is designed either forapplications which are only available in binary form orwhich can benefit from AVX2 and AVX-512 instruc-tions. If a program only makes use of such instructionsin very short execution phases, those parts could al-16ernatively be rewritten to use instructions with lowerpower consumption.Kumar et al. [21] use such an approach to improvethe efficiency of power-gating the processor’s SIMDunit. In this scenario, devectorizing parts of the pro-gram reduces the speedup caused by SIMD instruc-tions, but reduces the power-gating overhead. The au-thors use a profiler to determine the SIMD instructionusage in individual parts of the program. As static re-compilation based on this information is problematicas the profiling results are only accurate for specificinput data, the authors integrate the profiler into asystem which uses dynamic translation at runtime todevectorize those parts which only rarely use SIMDinstructions. Such an approach could likely be ap-plied to AVX-512 to improve average CPU frequencies,although hardware modifications would be required –current CPUs can only count floating-point AVX-512instructions, but not integer operations [4, p. 19.20f].Even with such hardware changes, it is not possible touse the approach with existing ahead-of-time compil-ers, though. In our work, we explore techniques usablewithin the existing software environment.Roy et al. [28], instead, suggest a similar techniquethat uses information from dynamic profiling to insertstatic power management code into an application atcompile time. Their approach inserts instructions forpower gating of parts of the processor in order to saveenergy. A similar approach, however, could potentiallybe used to let the application guide frequency selectiondecisions of the processor.
Modern Intel CPUs reduce their frequency when-ever power-intensive AVX2 or AVX-512 instructionsare executed to prevent violating power limits. Thefrequency is only increased again after a fixed timeouthas elapsed, in order to prevent excessive numbers offrequency changes. This behavior reduces the perfor-mance for heterogeneous workloads where code sectionswith and without such AVX instructions alternate, asparts of the latter are executed at a lower frequencythan necessary.We show the similarity between this behavior andmechanisms from dynamic power management. Weshow that the constant delay before increasing the fre-quency is not optimal in terms of worst-case competi-tiveness and show how the delay should depend on themagnitude of the frequency change. We also sketchhow information from the OS or the developer can beused to inform the CPU about future system behaviorso that the CPU can implement more efficient DVFS policies. Although we do not have a complete imple-mentation due to constraints of the hardware, we showthat it is possible to reliably determine whether anapplication will cause frequency changes and we showthat eager frequency changes based on such informa-tion about the workload can improve performance.
Although we show that an oracle-style DVFS pol-icy can improve performance, it remains to be seenwhether other approaches from the area of dynamicpower management can be applied as well. In particu-lar, some shutdown strategies achieve lower power con-sumption compared to the simple fixed-timeout policyeven without application-level knowledge.In addition, due to hardware constraints, we do notpresent any complete implementation of our approach.We plan to construct a testbed for other DVFS policiesand to use it to evaluate different hardware-softwareinterfaces which would allow input from the operat-ing system or from applications to affect hardware-controlled frequency scaling.
References [1] Phoronix test suite. https://phoronix-test-suite.com/ .[2] The /proc filesystem. Linux,
Documentation/filesystems/proc.txt .[3] Wikichip: Xeon Gold 6130 – Intel. https://en.wikichip.org/wiki/intel/xeon_gold/6130 .[4]
Intel 64 and IA-32 Architectures Software Devel-oper’s Manual - Volume 3 (3A, 3B, 3C & 3D):System Programming Guide , May 2018.[5]
Intel Xeon Processor Scalable Family – Specifica-tion Update . Intel Corporation, Feb. 2018.[6]
Intel 64 and IA-32 Architectures OptimizationReference Manual , Sept. 2019.[7] L. Benini, A. Bogliolo, and G. De Micheli. A sur-vey of design techniques for system-level dynamicpower management.
IEEE transactions on verylarge scale integration (VLSI) systems , 8(3):299–316, 2000.[8] C. Bienia.
Benchmarking Modern Multiprocessors .PhD thesis, Princeton University, January 2011.179] N. Bonen, R. Gabor, Z. Sperber, V. Svilan, D. N.Mackintosh, J. A. B. Paredes, N. Kumar, andS. Gupta. Performing local power gating in a pro-cessor, Sept. 26 2017. US Patent 9,772,674.[10] J. Corbet. Core scheduling, Feb. 28 2019. https://lwn.net/Articles/780703/ .[11] T. Downs. Gathering intel on intel avx-512 tran-sitions, Jan. 17 2020.[12] A. El-Moursy, R. Garg, D. H. Albonesi, andS. Dwarkadas. Compatible phase co-scheduling ona cmp of multi-threaded processors. In
Proceedings20th IEEE International Parallel & DistributedProcessing Symposium , pages 10–pp. IEEE, 2006.[13] M. Gottschlag and F. Bellosa. Reducing avx-induced frequency variation with core specializa-tion. In
The 9th Workshop on Systems for Multi-core and Heterogeneous Architectures , Dresden,Germany, Mar. 25 2019.[14] M. Gottschlag, P. Brantsch, and F. Bellosa. Auto-matic core specialization for avx-512 applications.In
Proceedings of the 13th ACM International Sys-tems and Storage Conference (to appear) . ACM,2020.[15] D. Hackenberg, R. Sch¨one, T. Ilsche, D. Molka,J. Schuchart, and R. Geyer. An energy efficiencyfeature survey of the intel haswell processor. In
Proceedings of the 2015 IEEE International Par-allel and Distributed Processing Symposium Work-shop , pages 896–904. IEEE, 2015.[16] R. Hebbar SR and A. Milenkovi´c. Impact of threadand frequency scaling on performance and energyefficiency: An evaluation of core i7-8700k usingspec cpu2017. In , pages 1–7.IEEE, 2019.[17] W. Huang, K. Rajamani, M. R. Stan, andK. Skadron. Scaling with design constraints: Pre-dicting the future of big chips.
IEEE Micro ,31(4):16–29, 2011.[18] A. R. Karlin, M. S. Manasse, L. A. McGeoch, andS. Owicki. Competitive randomized algorithms fornonuniform problems.
Algorithmica , 11(6):542–571, 1994.[19] G. Keramidas, V. Spiliopoulos, and S. Kaxiras.Interval-based models for run-time dvfs orchestra-tion in superscalar processors. In
Proceedings ofthe 7th ACM international conference on Comput-ing frontiers , pages 287–296, 2010. [20] V. Krasnov. On the dangers of in-tel’s frequency scaling, Oct. 10, 2017. https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/ .[21] R. Kumar, A. Martinez, and A. Gonzalez. Effi-cient power gating of simd accelerators throughdynamic selective devectorization in an hw/swcodesigned environment.
ACM Transactions onArchitecture and Code Optimization (TACO) ,11(3):25, 2014.[22] A. Li. Core scheduling: prevent fast instructionsfrom slowing you down. Linux Plumbers Confer-ence, Sept. 9 2019.[23] Y.-H. Lu, L. Benini, and G. De Micheli. Power-aware operating systems for interactive systems.
IEEE transactions on very large scale integration(VLSI) systems , 10(2):119–134, 2002.[24] A. Mazouz, A. Laurent, B. Pradelle, and W. Jalby.Evaluation of cpu frequency transition latency.
Computer Science - Research and Development ,29(3-4):187–195, 2014.[25] R. L. McGregor, C. D. Antonopoulos, and D. S.Nikolopoulos. Scheduling algorithms for effectivethread pairing on hybrid multiprocessors. In , pages 10–pp. IEEE, 2005.[26] S. Park, J. Park, D. Shin, Y. Wang, Q. Xie, M. Pe-dram, and N. Chang. Accurate modeling of thedelay and energy overhead of dynamic voltageand frequency scaling in modern microprocessors.
IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems , 32(5):695–708,2013.[27] A. Raghavan, Y. Luo, A. Chandawalla, M. Pa-paefthymiou, K. P. Pipe, T. F. Wenisch, andM. M. Martin. Computational sprinting. In
IEEEinternational symposium on high-performancecomp architecture , pages 1–12. IEEE, 2012.[28] S. Roy, N. Ranganathan, and S. Katkoori. Aframework for power-gating functional units inembedded microprocessors.
IEEE transactionson very large scale integration (VLSI) systems ,17(11):1640–1649, 2009.[29] R. Sch¨one, T. Ilsche, M. Bielert, A. Gocht, andD. Hackenberg. Energy efficiency features of theintel skylake-sp processor and their impact on per-formance. arXiv preprint arXiv:1905.12468 , 2019.1830] A. Settle, J. Kihm, A. Janiszewski, and D. Con-nors. Architectural support for enhanced smtjob scheduling. In
Proceedings. 13th InternationalConference on Parallel Architecture and Compila-tion Techniques, 2004. PACT 2004. , pages 63–73.IEEE, 2004.[31] A. Snavely and D. M. Tullsen. Symbiotic job-scheduling for a simultaneous multithreaded pro-cessor. In
Proceedings of the ninth internationalconference on Architectural support for program-ming languages and operating systems , pages 234–244, 2000.[32] C. Spille. Skylake X: Das heitere AVX-Takteraten hat ein Ende, Sept. 8 2017. .[33] M. B. Taylor. Is dark silicon useful? harness-ing the four horsemen of the coming dark siliconapocalypse. In , pages 1131–1136. IEEE,2012.[34] V. Venkatachalam and M. Franz. Power reduc-tion techniques for microprocessor systems.
ACMComputing Surveys (CSUR) , 37(3):195–237, 2005.[35] A. Weissel, B. Beutel, and F. Bellosa. Coopera-tive i/o: A novel i/o semantics for energy-awareapplications.