[PDF] FlexWatts: A Power- and Workload-Aware Hybrid Power Delivery Network for Energy-Efficient Microprocessors

Abstract

Modern client processors typically use one of three commonly-used power delivery network (PDN): 1) motherboard voltage regulators (MBVR), 2) integrated voltage regulators (IVR), and 3) low dropout voltage regulators (LDO). We observe that the energy-efficiency of each of these PDNs varies with the processor power (e.g., thermal design power (TDP) and dynamic power-state) and workload characteristics. This leads to energy inefficiency and performance loss, as modern client processors operate across a wide spectrum of power consumption and execute a wide variety of workloads. We propose FlexWatts, a hybrid adaptive PDN for modern client processors whose goal is to provide high energy-efficiency across the processor's wide range of power consumption and workloads by dynamically allocating PDNs to processor domains. FlexWatts is based on three key ideas. First, it combines IVRs and LDOs in a novel way to share multiple on-chip and off-chip resources. This hybrid PDN is allocated for processor domains with a wide power consumption range and it dynamically switches between two modes: IVR-Mode and LDO-Mode, depending on the power consumption. Second, for all other processor domains, FlexWatts statically allocates off-chip VRs. Third, FlexWatts introduces a prediction algorithm that switches the hybrid PDN to the mode that is the most beneficial. To evaluate the tradeoffs of PDNs, we develop and open-source PDNspot, the first validated architectural PDN model that enables quantitative analysis of PDN metrics. Using PDNspot, we evaluate FlexWatts on a wide variety of SPEC CPU2006, 3DMark06, and battery life workloads against IVR, the state-of-the-art PDN in modern client processors. For a 4W TDP processor, FlexWatts improves the average performance of the SPEC CPU2006 and 3DMark06 workloads by 22% and 25%, respectively. FlexWatts has comparable cost and area overhead to IVR.

Full PDF

FFlexWatts: A Power- and Workload-Aware Hybrid Power DeliveryNetwork for Energy-Efficient Microprocessors

Jawad Haj-Yahya § Mohammed Alser § Jeremie S. Kim § Lois Orosa § Efraim Rotem ? Avi Mendelson †‡ Anupam Chattopadhyay ‡ Onur Mutlu § § ETH Zürich ? Intel † Technion ‡ Nanyang Technological University

Modern client processors typically use one of three commonly-used power delivery network (PDN) architectures: 1) mother-board voltage regulators (MBVR), 2) integrated voltage regu-lators (IVR), and 3) low dropout voltage regulators (LDO). Weobserve that the energy-efficiency of each of these PDNs varieswith the processor power (e.g., thermal design power (TDP) anddynamic power-state) and workload characteristics (e.g., work-load type and computational intensity). This leads to energy-inefficiency and performance loss, as modern client processorsoperate across a wide spectrum of power consumption and executea wide variety of workloads.To address this inefficiency, we propose

FlexWatts , a hybridadaptive PDN for modern client processors whose goal is to pro-vide high energy-efficiency across the processor’s wide rangeof power consumption and workloads. FlexWatts provides highenergy-efficiency by intelligently and dynamically allocatingPDNs to processor domains depending on the processor’s powerconsumption and workload. FlexWatts is based on three keyideas. First, FlexWatts combines IVRs and LDOs in a novel wayto share multiple on-chip and off-chip resources and thus reducecost, as well as board and die area overheads. This hybrid PDN isallocated for processor domains with a wide power consumptionrange (e.g., CPU cores and graphics engines) and it dynamically switches between two modes:

IVR-Mode and

LDO-Mode , depend-ing on the power consumption. Second, for all other processordomains (that have a low and narrow power range, e.g., theIO domain), FlexWatts statically allocates off-chip VRs, whichhave high energy-efficiency for low and narrow power ranges.Third, FlexWatts introduces a novel prediction algorithm that au-tomatically switches the hybrid PDN to the mode (

IVR-Mode or LDO-Mode ) that is the most beneficial based on processor powerconsumption and workload characteristics.To evaluate the tradeoffs of PDNs, we develop and open-source

PDNspot , the first validated architectural PDN model that en-ables quantitative analysis of PDN metrics. Using PDNspot, weevaluate FlexWatts on a wide variety of SPEC CPU2006, graphics(3DMark06), and battery life (e.g., video playback) workloadsagainst IVR, the state-of-the-art PDN in modern client proces-sors. For a W thermal design power (TDP) processor, FlexWattsimproves the average performance of the SPEC CPU2006 and3DMark06 workloads by and , respectively. For batterylife workloads, FlexWatts reduces the average power consump-tion of video playback by across all tested TDPs (4W–50W).FlexWatts has comparable cost and area overhead to IVR. Weconclude that FlexWatts provides high energy-efficiency across amodern client processor’s wide range of power consumption andwide variety of workloads, with minimal overhead.

1. Introduction

Architecting an efficient power delivery network (PDN) forclient processors (e.g., tablets, laptops, desktops) is a well-known challenge that has been hotly debated in industry andacademia in recent years. Due to multiple constraints, a mod-ern client processor typically implements only one of threetypes of commonly-used PDNs: 1) motherboard voltage regu-lators (MBVR [29, 41, 63, 97]), 2) low dropout voltage regulators (LDO [15, 18, 111, 112, 113, 120]), and 3) integrated voltage regu-lators (IVR [21, 61, 88, 117]). We find that the energy-efficiencyof each of the three different commonly-used PDN types variesdifferently with the processor power (e.g., thermal designpower (TDP ) and dynamic power-state) and workload charac-teristics (e.g., workload type and computational intensity). Par-ticularly, each PDN is designed for energy-efficient operationat a different TDP, power-state, workload type, and workloadcomputational intensity. This leads to energy-inefficiency andperformance loss as modern client processors operate across a wide range of power consumption and execute a wide varietyof workloads.Architects of modern client processors typically build a sin-gle PDN architecture (i.e., MBVR, IVR, or LDO) that supports all

TDPs of a client processor family for two reasons. First,doing so allows system manufacturers to configure a proces-sor’s TDP (known as configurable TDP [5, 63, 132] or cTDP) toenable the processor to operate at higher or lower performancelevels, depending on the available cooling capacity and desiredpower consumption. For example, the Intel Skylake processoruses an MBVR PDN [26, 117] for all TDP ranges (from 3 W [56]to 91 W [57]) and recent AMD client processors use an LDOPDN [3, 4, 15, 18, 111, 112], while enabling cTDP [56, 57]. Sec-ond, it reduces non-recurring engineering (NRE [81]) cost anddesign complexity to allow competitive product prices andenable meeting of strict time-to-market requirements.Modern client processors operate across a wide power range (i.e., the range of power consumption between under light-load and heavy-load) for two reasons. First, modern workloadshave a wide range of computational intensity (leading to be-tween tens of milliwatts of power consumption, e.g., for anidle workload that is in Connected-Standby power-mode [42],to tens of watts on average, e.g., for a workload that activatesTurbo Boost [98]). Second, processors must support multiplemarket segments that have very different TDPs. For example,the recent Intel Skylake processor architecture can scale fromnearly 3 W [56] of TDP (for passively-cooled small systems,e.g., a tablet) up to 91 W [57] of TDP (for a high-performance As the processor dissipates power, the temperature of the silicon junction( T j ) increases, ( T j ) should be kept below the maximum junction temperature( T jmax ). Overheating may cause permanent damage to the processor. Hence,every processor has a thermal design power (TDP) limit. a r X i v : . [ c s . A R ] S e p esktop computer). The recent AMD client processors followsimilar trends [3, 4, 15, 18, 111, 112].Based on our empirical evaluations, we find that a sin-gle PDN architecture, which supports a wide power rangeis energy-inefficient. For instance, the IVR PDN is energy-inefficient for low-TDP processors (e.g., tablets, convertiblelaptop-tablets), while the MBVR and the LDO PDNs are energy-inefficient for high-TDP processors (e.g., high performancelaptops, desktops). We also observe that even if we build a ded-icated PDN matching the TDP of the processor, e.g., IVR PDNfor high-TDP processors and MBVR or LDO PDN for low-TDPprocessors, these processors will still suffer from significant en-ergy inefficiency because 1) the IVR PDN is energy-inefficientin high-TDP processors when running a computationally lightworkload, 2) a low-TDP processor can potentially execute com-putationally heavy workloads that exceed the TDP, e.g., viaTurbo Boost [98], and 3) the TDP of modern client processorscan be dynamically configured using cTDP [5, 132].Various works focus on improving the processor PDN us-ing various techniques (e.g., thermal-aware voltage regula-tors (VRs) [72], re-configurable PDN [32], VR phase scaling[11], VR efficiency-aware power management [12], on-chipVRs for fast DVFS [53, 73, 137], voltage stacking [33, 90, 142],PDNs for waferscale processors [90], voltage noise reduc-tion [16, 35, 36, 44, 74, 84, 95, 96, 108, 119], voltage noise mod-eling [141, 143], multiple voltage domains [100, 138], voltageoptimizations [115], and adaptive DVFS [22, 91]). These worksfocus on adapting power management techniques that alreadyexist in modern client processors (such as voltage noise re-duction and modeling, power management techniques thatoptimize VR efficiency, using fast VRs for better DVFS, uti-lizing on-chip VRs for building multiple voltage domains toimprove energy-efficiency), but they do not alleviate the in-herent energy inefficiencies of commonly-used PDNs in client processors due to operating across a wide range of power andwide variety of workloads.In this paper, we propose FlexWatts , a power- and workload-aware hybrid adaptive PDN whose goal is to maintain highenergy efficiency in a modern client processor throughout theprocessor’s wide spectrum of power and workloads with alow bill of materials (BOM [66]) and board area overhead.FlexWatts is based on three key ideas . First, FlexWatts com-bines IVRs and LDOs in a novel way to share multiple on-chipand off-chip resources and thus reduce BOM, as well as boardand die area overheads. This hybrid PDN is allocated for pro-cessor domains with a wide power consumption range (e.g.,CPU cores and graphics engines) and it dynamically switchesbetween two modes, IVR-Mode and

LDO-Mode , depending onthe power consumption. For example, when a domain operatesunder high power conditions (e.g., high TDP, power-hungryapplications), it uses the PDN in

IVR-Mode . Otherwise (e.g.,low TDP, light-load), it uses the PDN in

LDO-Mode . Second, forall other processor domains (that have a low and narrow powerrange, e.g., the IO domain), FlexWatts statically allocates off-chip VRs that have high energy-efficiency for low and narrowpower ranges. Third, FlexWatts introduces a new prediction Given a specific product, a BOM is a list of its immediate componentswith which it is built and the components’ relationships. algorithm that automatically switches the hybrid PDN to themode (i.e.,

IVR-Mode or LDO-Mode ) that is predicted to be themost beneficial based on processor power consumption andworkload characteristics.To assess the tradeoffs of commonly-used PDNs, and ar-chitect a PDN that is highly efficient in the metrics of in-terest (e.g., energy consumption, performance, board area,BOM), an accurate architecture-level quantitative analysis ofthese metrics is needed. Unfortunately, no model or tool isavailable to the computer architecture research communityfor such analysis. To this end, we develop

PDNspot , a vali-dated architectural open-source PDN framework whose goalis to enable architects to study the tradeoffs of various PDNs.PDNspot provides a versatile framework that enables multi-dimensional architecture-space exploration of modern pro-cessor PDNs. PDNspot evaluates the effect of multiple PDNparameters, TDP, and workloads on the metrics of interest. Weopen-source PDNspot [104].Using PDNspot, we evaluate FlexWatts on a wide varietyof SPEC CPU2006, graphics (3DMark06), and battery life (e.g.,video playback) workloads against IVR [21], the state-of-the-art PDN in modern client processors. For a 4 W TDP pro-cessor, FlexWatts improves the average performance of theSPEC CPU2006 and 3DMark06 workloads by 22% and 25%,respectively. For battery life workloads, FlexWatts reduces theaverage power consumption of video playback by 11% acrossall tested TDPs (4 W –50 W ). FlexWatts has comparable BOMand area overhead to IVR.This paper makes the following major contributions :• We introduce FlexWatts, a novel adaptive hybrid PDN thatmaintains high efficiency and high performance in metricsof interest in client processors across a wide spectrum ofpower consumption and workloads. To our knowledge,FlexWatts is the first hybrid PDN to use two types of on-chip voltage regulators (IVR and LDO) to simultaneouslyleverage the advantages of both.• We develop a versatile framework, PDNspot, that enablesmulti-dimensional architecture-level exploration of mod-ern processor PDNs. To our knowledge, PDNspot is thefirst tool that can evaluate the effects of multiple PDNparameters, TDP, and workloads characteristics on promi-nent system metrics such as energy consumption, per-formance, board area, and bill of materials (BOM). Weopen-source PDNspot [104].• We provide a thorough experimental evaluation of thepower, performance, area, and BOM of IVR, MBVR, LDO,and FlexWatts PDNs across various processor TDPs andworkloads. Our evaluation shows that our new adaptivehybrid PDN, FlexWatts, provides large benefits in metricsof interest (performance, energy, cost, area) with minimaloverhead, compared to the state-of-the-art PDN.

2. Background

We provide the necessary background on the architectureof a modern client processor and its power delivery network(PDN), the electrical system that provides supply voltage to thetransistors within an integrated circuit via voltage regulators.2e also explain some of the parameters (e.g., tolerance bandand load-line) that affect the system-level efficiency of PDNs.

Architecture.

To illustrate the usage of a PDN in modernclient processors, we first summarize the architecture of Intel’sclient processor [8, 20, 21, 83, 101] in Table 1. Similar architec-tures are widely used for modern processors from various ven-dors, such as AMD, IBM, and ARM [15, 18, 89, 94, 111, 112, 120].

Table 1: Summary of the processor architecture

Domain Description

Two CPU Cores(Core 0/1) Single clock domain to all cores. Clock frequency canscale from 0.8GHz to 4GHzGraphics Engines(GFX) GFX frequency can scale from 0.1GHz to 1.2GHzLast Level Cache(LLC) The LLC size scales proportionally to theCPU core and graphics engine frequenciesSystem-Agent (SA) The SA includes a memory controller, display controller,IO fabric, and other IPs (e.g., Camera, PCIe, Voice), each ofwhich operate at a fixed frequency (not scaled with load)Input/Output(IOs) Includes the processor IOs, such as DDRIO, displayIO, which operate at fixed frequencies Power Delivery Networks.

The Power Delivery Network(PDN) is the electrical system that provides supply voltage tothe transistors within an integrated circuit (IC) or domain (e.g.,CPU core, graphics engine) in a processor. The objective ofa PDN in a processor is to provide a stable desired voltage toeach processor domain. Particularly, a PDN should supportthree distinct capabilities: 1) supply a stable voltage to eachprocessor domain, 2) provide transient current required by aprocessor domain, and 3) filter out the noise currents injectedby a processor domain [64, 116, 123].A PDN consists of 1) a power supply (e.g., power supply unit(PSU) or battery), which provides high voltage (e.g., 7.2–20 V )to the motherboard, 2) voltage regulators (VRs) (also known asDC–DC converters), used in either one or two stages to reducethe voltage level from the power supply to the desired opera-tional voltage for a domain (typically 0.5–1.1 V ), 3) a network ofinterconnections , which distributes the voltage from the voltageregulators to the PDN components and processor domains, 4) decoupling capacitors distributed on the motherboard, package,and die, which act as reservoirs to store charge and reducevoltage noise from instantaneous current draw, and 5) power-gates to turn off a processor domain when it is idle. Beforediscussing the common PDN designs in more detail, we firstdiscuss types of voltage regulators, an essential component inPDNs for converting voltage. The main objective of a voltage regulator (VR) is to con-vert the input voltage level to another voltage level. Thereare multiple types of VRs and each has pros and cons withrespect to power conversion efficiency, voltage noise, designcomplexity and size. In this section, we describe the switchingVR (SVR), and the low dropout VR (LDO VR), each of whichare key components (on-chip and/or off-chip) in modern clientprocessor PDNs. The System-Agent houses the traditional North Bridge and containsseveral modules such as the memory and IO controllers [38, 122, 129].

Switching Voltage Regulator (SVR).

Modern processorstypically use a step-down SVR (i.e., a buck converter [49,73,93]),which converts the input voltage level to a lower voltage level.An SVR consists of an inductor, diode, capacitor, switch, andcontrol modules. Traditionally, SVRs are placed on the moth-erboard . However, recent PDN designs integrate

SVRs intothe chip package and die [21, 61, 88, 117]. The main advantageof an SVR over other types of VRs is its ability to maintain ahigh power conversion efficiency (typically > V for an input voltage of 1.8 V ). Low Dropout Voltage Regulator (LDO VR).

An LDO VR isa type of linear voltage regulator [64, 79, 85] that consists ofa power switch, a differential amplifier (error amplifier), andresistors. The LDO VR has four advantages over an SVR: anLDO VR 1) is immune to switching noise due to the absenceof capacitors, 2) has a simpler and smaller design as it doesnot include large inductors, 3) can regulate the output voltageeven when the input voltage level is very close to the outputvoltage level, 4) even operate in bypass-mode [112], in whichthe input voltage signal is directly connected to the output toavoid voltage regulation, and 5) can have higher efficiency thanan SVR when the input voltage level is very close to the outputvoltage level (e.g., input/output voltage of 1 V /0.9 V ). However,the main disadvantage of the LDO VR is its inefficiency inconverting the input voltage if it is very different from theoutput voltage (e.g., input/output voltage of 1 V /0.5 V ). Fig. 1 shows the high-level organization of each of the threecommonly-used PDNs in modern client processors: 1) inte-grated voltage regulator (IVR [21,61,88,117]; Fig. 1(a)), 2) moth-erboard voltage regulator (MBVR [29,41,63,97]; Fig. 1(b)), and 3) low dropout voltage regulator (LDO VR [15,18,111,112,113,120];Fig. 1(c)).

Integrated Voltage Regulator (IVR) PDN.

The IVR PDN isa state-of-the-art PDN in modern client processors and is usedin Intel’s 4th, 5th, and 10th generation Core processors [21,61, 88]. The IVR PDN integrates most of the SVR components(i.e., diodes, capacitors, control modules, and switches) intothe processor die while some components are placed on thepackage (e.g., interconnections) and off-chip (e.g., inductors).Since circuit elements in modern processors cannot toleratethe high input voltage of a power supply (7.2–20 V ) due to theirsmall process technology node size, the IVR PDN regulatesvoltage in two-stages , as illustrated in Fig. 1(a). The first stageof voltage conversion is handled by a single motherboard SVR(i.e., V _ IN VR), which converts input voltage from the powersupply unit (PSU) or battery (7.2–20 V ) to a level typicallyless than 2 V (e.g., 1.8 V ). The second stage is handled by anintegrated SVR (i.e., IVR), which is a sequential buck converterthat converts the input voltage (i.e., output of the first stage VR)to the desired voltage level (typically 0.5–1.1 V ) of a processordomain (e.g., a CPU core). In a processor, multiple IVRs are3sed (e.g., six as shown in Fig. 1(a)) to supply different voltagelevels to each processor domain.The IVR PDN has two main advantages over other PDNs: 1)it enables fast voltage level changes, 2) it reduces a chip’s input(i.e., output of the first stage VR into the processor die) currentby using a high input voltage level (e.g., 1.8 V compared to 0.5–1.1 V using a traditional MBVR), thereby reducing I R powerlosses, and reduces the maximum current (i.e., Icc max ) require-ment of the first stage VR. However, the IVR PDN has threemain disadvantages over other PDNs: 1) low power-conversionefficiency in computationally light workloads due to the two-stage voltage regulation [41], 2) high design complexity as itis normally designed along with the chip, which adds extradesign constraints and consumes silicon die area [86], and 3)higher sensitivity to di/dt noise than the MBVR PDN due toa limited amount of decoupling capacitors available on theprocessor’s die [86].

Motherboard Voltage Regulator (MBVR) PDN.

The MBVRPDN is the traditional PDN for processors and is used in In-tel’s 2nd, 3rd, 6th, 7th, 8th and 9th generation Core proces-sors [29, 63, 97, 130, 131]. As shown in Fig. 1(b), the MBVRPDN uses several one-stage motherboard SVRs and multipleon-chip power-gates. An MBVR PDN has four advantages overother PDNs: 1) it decouples the VR design from the processordesign, thereby reducing system design complexity, 2) heatgenerated due to VR power conversion losses is kept outsidethe processor chip, 3) it enables placing enough decouplingcapacitors on motherboard, package and die (due to the longpath from processor die to the off-chip VR) to reduce voltagenoise, and 4) it is efficient at executing computationally lightworkloads. However, the MBVR PDN has two major disadvan-tages: 1) voltage level changes are slow as the VR is far from theload (i.e., processor domain), and 2) computationally-intensive(high current) workloads suffer high I R power losses due tohigh processor input current and high impedance (load-line)on the path from the board VRs to the processor domains. Low Dropout Voltage Regulator (LDO) PDN.

The LDOPDN is used in AMD’s recent Zen [15, 111, 112] processors.As shown in Fig. 1(c), the LDO PDN statically allocates twotypes of VRs to different domains based on their power de-mands: it allocates 1) one-stage motherboard SVRs (similar toMBVR PDN) to domains with a low and narrow power range(e.g., IO and SA) and 2) two-stage VRs for domains with widepower range (e.g., CPU cores, graphics engines, and LLC). The first stage is a single motherboard SVR (i.e., V _ IN VR) andthe second stage is an integrated LDO VR. Multiple LDO VRsare used (e.g., four as shown in Fig. 1(c)) which supply dif-ferent voltage levels to each of the processor domains. Forthe two-stage VR, the processor’s power management unitadjusts V _ IN to the maximum voltage required across all do-mains. For domains that require the same voltage level as theinput voltage, the domain’s LDO VR operates in bypass-mode to avoid voltage regulation by simply connecting the inputvoltage signal to the output. For other domains that require alower voltage, the LDO VR adjusts the input voltage by oper-ating in regulation-mode . For idle domains, the LDO VR actsas a power-gate .The LDO PDN has three advantages over other PDNs: it1) requires less board area compared to the MBVR PDN, 2) issimpler than the IVR PDN as the integration of an LDO VRinto the die is simpler than that of an SVR, 3) has higher power-conversion efficiency than an IVR PDN when running compu-tationally light workloads. However, the LDO PDN has twomain disadvantages compared to other PDNs: 1) low power-conversion efficiency in computationally intensive workloadsdue to the high processor input current and high impedance(load-line) on the path from the board VRs to the processordomains, and 2) higher design complexity than MBVR as it isdesigned along with the chip, which adds extra design con-straints and complexity to the power management algorithms. Power-Conversion Efficiency ( η ). The ratio of the total out-put power ( P out ) of a VR to the total input power ( P in ) is knownas Efficiency ( η ) as given in Equation 1. Efficiency = η = P out P in = P out P out + P loss (1)For an SVR, power-conversion efficiency is not constant, butrather a function of: 1) the load current and 2) the input andoutput voltages [12, 34, 39, 40]. The LDO VR power-conversionefficiency, η LDO , is the ratio of the desired output voltage, V out ,to the input voltage, V in , times the LDO VR current efficiency (typically around 99% in a modern LDO VR [50, 79]), thus η LDO ≈ V out / V in .The power-conversion efficiency is also defined for the entirePDN , also known as the PDN end-to-end power-conversionefficiency (ETEE) . ETEE of a PDN at a given time is the ratiobetween the total load’s nominal power (i.e., the sum of all V_Cores L_Core0V_IO L_IOV_SA L_SAV_GFX L_GFXL_LLCL_Core1DomainsPower-Gates P

NOM P GB P PG P D_LL P MBVR I D P D V D V_Core0 L_Core0V_IO L_IOV_SA L_SAV_GFX L_GFXV_LLC L_LLCV_Core1 L_Core1DomainsIVRs (2 nd stage) V_INPower Supply(PSU/Battery) Board VRs (1 st stage) (b) MBVR PDN P GB P IVR_D P IN_LL P IVR P NOM I IN P IN V IN (a) IVR PDN V_Core0 L_Core0V_GFX L_GFXV_LLC L_LLCV_Core1 L_Core1DomainsLDO VRs (2 nd Stage) /Power-GatesV_IN P GB P LDO_D P IN_LL P LDO I IN P IN V IN V_IO L_IOV_SA L_SAP

NOM (c) LDO PDN

Package/DieMotherboardPackage/DieMotherboard Package/DieMotherboardPower Supply(PSU/Battery) Power Supply(PSU/Battery) Board VRs (1 st stage) Board VRs (1 st stage) R IN_LL R D_LL R IN_LL R D_LL P D P D_LL

Figure 1: The three commonly-used PDNs in client processors. The processor consists of six loads: two CPU cores, a last-level-cache (LLC), graphics engines (GFX), system-agent (SA), and IO. (a) The IVR PDN uses one off-chip VR (V_IN) and six differenton-chip IVRs (V_Core0/1, V_LLC, V_GFX, V_SA and V_IO). (b) The MBVR PDN uses four off-chip VRs (V_Cores, V_GFX, V_SAand V_IO) and six on-chip power-gates. (c) The LDO PDN uses three off-chip VRs (V_IN, V_SA and V_IO), four on-chip LDO VRs(V_Core0/1, V_LLC, V_GFX), and two on-chip power-gates. ) and the effective power consumed bythe main power supply (e.g., battery, PSU). VR Tolerance Band (TOB).

The tolerance band (TOB) of aVR [58] is the maximum voltage variation for the VR acrosstemperature, manufacturing variation, and age factors (e.g., V TOB = 25 mV ). The standard VR TOB can be sliced into threemain categories: controller tolerance, current sense variation,and voltage ripple. The supply voltage is maintained at ahigher value than the nominal voltage required by the load, tocompensate for TOB voltage variations. This excess voltagedue to the TOB leads to wasted power that cannot be utilizedby the load. Application Ratio (AR).

AR is a term used inpower/performance modeling to quantify the computa-tional intensity of a workload [34]. AR describes the switchingrate of a processor component (e.g., CPU core, graphicsengine, IO) for a workload when compared to the highestpossible power, P peak , that can be consumed by the mostcomputationally-intensive workload (i.e., also known asthe power-virus workload [31, 77, 88]). AR and P peak canbe estimated 1) offline using power modeling tools such asMcPAT [77], SYMPO [31] or Intel’s Blizzard [9]), and 2) atruntime using activity sensors implemented in the processorcomponents [7, 10, 19, 30, 78, 102, 110, 126]. Load-line.

The load-line or adaptive voltage positioning [59]is a model that describes the voltage and current relationshipunder a given system impedance ( R LL ). This relationship isdefined as: V cc = V IN – V TOB – R LL · I cc where V cc and I cc arethe voltage and current at the load, respectively. V TOB is thetolerance band (TOB) voltage variation and V IN is the inputvoltage to the system. From this equation, we can see that thevoltage at the load input ( V cc ) decreases when the current ofthe load ( I cc ) increases (e.g., when running a workload witha high AR). Therefore, to keep the voltage at the load ( V cc )above a minimum functional voltage under even the mostcomputationally-intensive workload (i.e., power-virus [31, 77,88], for which AR=1), the input voltage ( V IN ) is set to a levelthat provides enough guardband.

3. PDNspot

We develop PDNspot, a framework that models the threecommonly-used PDNs in modern client processors, evaluatingmultiple metrics of interest (i.e., performance, energy, BOM,and board area). PDNspot provides a versatile frameworkthat enables multi-dimensional architecture-space explorationof modern processor PDNs. PDNspot evaluates the effect ofmultiple PDN parameters, TDP, and workloads on the metricsof interest. In this section, we present the core models ofPDNspot: 1) an end-to-end power-conversion efficiency (ETEE)model for each PDN that we use to assess the average powerand current consumption of a PDN, 2) board area and BOMmodels, and 3) a performance model of the processor that weuse to assess each PDN’s impact on performance. A load’s nominal power at a given time is a function of the load’s 1)power state (e.g., active vs. idle), 2) activity factor, 3) frequency, 4) nominalvoltage, and 5) temperature [34, 39, 40, 45].

We present three high-level power models . Each model takesmultiple inputs (main inputs tabulated in Table 2) to calculatethe end-to-end power consumption of a domain (shown onthe right side of each PDN in Fig. 1), starting from nominalpower of each load ( P NOM , in Fig. 1) until the power supply(shown on the left side of each PDN in Fig. 1). The calculationsfollow the symbols shown in Fig. 1 on each PDN to estimatethe total power (i.e., P IVR , P MBVR , and P LDO ) consumed by themain power supply (i.e., PSU or battery).We calculate the end-to-end power-conversion efficiency(ETEE) of each PDN as the ratio of the total input power ofthe PDN (i.e., the sum of the nominal input power of all loads, P P NOM ) to the total effective power (i.e., P IVR , P MBVR , and P LDO ) consumed by the main power supply. We begin bydiscussing MBVR PDN modeling as it is the simplest PDN.

Table 2: Main parameters used in our PDNspot models

Parameter IVR MBVR LDO

Load-line Impedance R LL (m Ω ) IN = 1 Cores , GFX , SA , IO = 2.5,2.5,7,4 IN , SA , IO = 1.25,7,4VR Tolerance Band TOB (mV) 18–22 18–20 16–18On-chip VR Efficiency η (%) 81%–88% — ( V out / V in ) · η (%) η IN , GFX , SA , IO ( V in , V out , I out ,power-state) = 72%–93%Leakage Fraction F L (%) 20%–45% depending on the domainCores Nom. Power P NOM (W) 0.6 W –30 W for TDP range 4–50 W LLC Nom. Power P NOM (W) 0.5 W – 4 W for TDP range 4–50 W GFX Nom. Power P NOM (W) 0.58 W –29.4 W for TDP range 4–50 W PG Impedance R PG (m Ω ) 1–2 m Ω depending on the domain MBVR PDN Power Modeling.

In order to calculate the totalpower consumption of the MBVR, denoted by ( P MBVR ), we firstcalculate P GB , which is the power consumption after applyinga voltage guardband on the nominal power P NOM . This volt-age guardband, V GB , guarantees proper circuit timing acrossvoltage variations ( V TOB explained in Sec. 2.4). The leakage and dynamic power consumption scale differently as voltageincreases from V NOM to V NOM + V GB (i.e., when nominal volt-age, V NOM , is increased by a voltage guardband, V GB ). Thedynamic power consumption is proportional to the voltagesquared (i.e., ( V NOM + V GB V NOM ) ), while the leakage power consump-tion scales exponentially with voltage and depends on severalother parameters such as threshold voltage, temperature, andother design and fabrication characteristics [34, 39, 40, 45, 64].As an approximation, we use a model based on polynomialcurve fitting, where leakage power scales polynomially withsupply voltage (i.e., ( V NOM + V GB V NOM ) δ ). We validate our model withmeasurements on a commercial client processor (Intel Core i7-6600U Processor [55]). Assuming the same temperature, theleakage power scales by the power of δ = ∼ F L ) of 45% forthe graphics domain and 22% for the rest (e.g., cores, LLC, SA)similarly to Rusu et al. [103]. Therefore, P GB can be calculatedwith Equation 2. P GB = P NOM · h F L · ( V NOM + V GB V NOM ) δ + (1 – F L ) · ( V NOM + V GB V NOM ) i (2)For domains with power-gates (e.g., L _ Core L _ LLC in Fig. 1(b)), there is an additional voltage drop on the power-gate ( V PG , e.g., 10 mV ) due to its impedance ( R PG ). The powerconsumption ( P PG in Fig. 1(b)), due to this increase in the5upply voltage, is calculated similarly to Equation 2 (i.e., byassigning in the equation: V PG , P GB , ( V NOM + V GB ) instead of V GB , P NOM , V NOM , respectively).Next, we need to compensate for the voltage drop on theload-line impedance ( R LL , discussed in Sec. 2.4) by raising theon-board VR output voltage (i.e., applying a voltage guard-band). The voltage guardband needs to account for the maxi-mum possible voltage drop, which is attained when the proces-sor consumes the maximum possible power, P peak , by runningthe most computationally-intensive workload, which is alsoknown as a power-virus workload [31, 77, 88]. Next we attain, P D , the total power consumption of a group of domains whichshare the same off-chip VR (e.g., { Core Core LLC } , { GFX } ),by summing all P PG values from each domain,. We use theapplication ratio (AR, discussed in Sec. 2.4), to obtain P peak byscaling P D using the AR, i.e., P peak = P D / AR . The correspond-ing calculation for the voltage and power after accounting forthe voltage drop on the load-line impedance of each group ofdomains (i.e., R D _ LL in Fig. 1(b)) is shown by Equations 3 and 4,respectively. V D _ LL = V D + P peak V D · R D _ LL (3) P D _ LL = V D _ LL · I D = V D _ LL · P D V D (4)The total power, P MBVR , consumed from the battery/PSUis obtained by summing the effective power of each domain,which can be calculated by dividing the output power of eachon-board VR by its power conversion efficiency ( η D ) as shownin Equation 5. P MBVR = P P D _ LL η D (5) IVR PDN Power Modeling.

Using the same approach formodeling MBVR PDN power consumption, we calculate the to-tal power of an IVR PDN, P IVR , consumed from the battery/PSU,as shown in Fig. 1(a). We calculate P GB by applying a voltageguardband due to the VR tolerance band (i.e., TOB, discussedin Sec. 2.4) using Equation 2. P IVR _ D (in Fig. 1(a)) is the powerconsumption after accounting for the IVR loss at a specific do-main. Given the IVR power conversion efficiency η IVR , P IVR _ D can be calculated using Equation 6. P IVR _ D = P GB η IVR (6)Next we calculate P IN (shown in Fig. 1(a)) by summing thepower consumed by all domains connected to V _ IN VR (i.e., P IN = P P IVR _ D ). Similarly to the MBVR PDN, the voltage( V IN _ LL ) and power consumption ( P IN _ LL ) after accountingfor the voltage drop on the load-line impedance (i.e., R IN _ LL )are calculated with Equations 7 and 8, respectively, whereas P IN peak = P IN / AR . Finally, we obtain the total power ( P IVR )consumed from the battery/PSU by dividing the output power(i.e., P IN _ LL ) of the V IN VR by the power conversion efficiencyof the V IN VR (i.e., η IN ), as shown in Equation 9. V IN _ LL = V IN + P INpeak V IN · R IN _ LL (7) P IN _ LL = V IN _ LL · P IN V IN (8) P IVR = P IN _ LL η IN (9) LDO PDN Power Modeling.

Similarly to the other two mod-els, P GB (shown in Fig. 1(c)) is calculated using Equation 2. For the four domains with LDO VRs (i.e., L _ Core L _ LLC and L _ GFX domains), we calculate the power of each domainafter including the LDO VR power conversion losses, denotedby P LDO _ D in Fig. 1(c). P LDO _ D is obtained by dividing theoutput power of the LDO ( P GB ) by the power conversion ef-ficiency of the LDO ( η LDO ) as shown in Equation 11. η LDO is the ratio of the desired output voltage to the input voltagemultiplied by the LDO VR current efficiency ( I effi , e.g., 99%), asshown in Equation 10. Next, we obtain the power that eachLDO domain consumes from the shared VR ( V _ IN ) using twosteps. First, we sum the power of each LDO domain to obtain P IN (i.e., P IN = P P LDO _ D ). Second, we calculate the powerconsumption ( P IN _ LL ) after accounting for the voltage drop onthe load-line impedance (i.e., R IN _ LL ) using Equations 7 and 8(similar to the calculations in IVR PDN power modeling). η LDO = V OUT V IN · I effi (10) P LDO _ D = P GB η LDO (11)For domains that use motherboard VRs (i.e., L _ SA and L _ IO ),we calculate the power ( P D _ LL ) that each of these domainsconsumes from the motherboard VRs (i.e., V _ SA and V _ IO )using Equations 3 and 4 (similar to our calculations in MBVRPDN power modeling). Finally, the total power (i.e., P LDO ) thatthe LDO PDN consumes from the battery/PSU is calculatedby summing the power that each motherboard VR consumesfrom the battery/PSU as shown in Equation 12. P LDO = P IN _ LL η IN + P P D _ LL η D (12) The board area and BOM of an off-chip VR are functions ofmainly the maximum current (

Icc max ) that the VR can support.

Icc max is the maximum current that the VR must be electricallydesigned to support. Exceeding the the

Icc max limit can resultin irreversible damage to the VR or the processor’s chip [34,39,40, 59, 62, 80, 86, 135, 141]. A higher

Icc max implies a larger VRand higher cost. VR sharing between multiple domains (e.g.,the LDO PDN shares V _ IN VR for cores, LLC, and graphics asshown in Fig. 1(c)) effectively reduces the maximum currentrequired,

Icc max , thereby reducing the area and BOM of theoff-chip VR.To reduce system area and cost, many platforms use a powermanagement integrated circuit (PMIC [52, 109, 134]) that incor-porates multiple VRs (and other functions) into one integratedcircuit. In our model, the VR area and cost are calculated basedon the

Icc max requirements for each domain of a PDN. Weassume an optimized solution with a PMIC for processors withTDPs up to 18 W for all PDNs. Higher-TDP processors typicallyuse a traditional voltage regulator module (VRM [59]) insteadof a PMIC due to the high current requirements of these pro-cessors [52, 109]. We obtain the actual mapping between the Icc max and the area/cost directly from

Texas Instruments

VRvendor [118].

To understand the impact of PDN end-to-end power-conversion efficiency (ETEE) on workload performance of aclient processor, we build a performance model. Our perfor-mance model aims to estimate the performance improvement6f a CPU- (graphics-) intensive workload when increasing thepower-budget allocated to the CPU cores (graphics engines).We build the performance model of the compute do-main (i.e., CPU cores and graphics engines) using em-pirical measurements on a real system in three steps.First, we run a CPU- (graphics-) intensive workload withhigh performance-scalability , e.g., 416.gamess of SPECCPU2006 [114] (3DMark06 [124]), on a real Intel Skylake sys-tem, whose specifications are in Table 3. Second, we sweep thefrequency of CPU cores (graphics engines) in steps of 100MHz(50MHz), the finest CPU core (graphics engine) frequency gran-ularity that the Skylake architecture supports. Third, we mea-sure the total power consumption of the processor and logthe increase in power consumption compared to the measure-ment done in the previous (i.e, lower) frequency. By doing so,we build power-frequency curves that we use along with theworkload’s performance-scalability to estimate performanceas a function of power.Using our performance model, we plot in Fig. 2(a) the ad-ditional power-budget required (y-axis) to increase the clockfrequency of a CPU/graphics domain by 1% when runningCPU-/graphics-intensive workloads, relative to the baselinefrequency of each TDP (x-axis). We observe that, comparedto a high-TDP (e.g., 50 W ) processor, a low-TDP (e.g., 4 W ) pro-cessor requires only a small amount of power (e.g., ∼ mW )to increase the clock frequency of a CPU/graphics domain by1%. Fig. 2(b) shows the percentage (y-axis) of the total TDPpower-budget (x-axis) that is allocated to the CPU-cores, LLC,IO and SA, and PDN power losses for a CPU-intensive work-load (no budget is allocated to graphics in this workload). Ineach TDP, we use the PDN among three commonly-used PDNs(i.e., MBVR, IVR, LDO) that maximizes PDN power loss (e.g.,IVR for 4 W and MBVR for 50 W ), to show the effect of usingan unoptimized PDN on different processor domains’ powerbudgets. We find that in a low-TDP processor, a relatively smallfraction (e.g., only 13% of a 4 W TDP) is allocated to CPU-corescompared to a higher-TDP processor (e.g., about 52% of a 50 W TDP), while PDN power loss is 25% or more (i.e., ETEE of 75%or less). If we use a PDN with a higher ETEE for each TDP(e.g., 5% higher ETEE, which translates to 5% lower PDN powerloss), we can increase the CPU-cores’ power-budget by thespared power on PDN loss (e.g., 5%), thereby increasing theworkload’s performance. We illustrate the impact of a PDN’sETEE with the following example.

Impact of PDN ETEE on System Performance.

For a 4 W TDP processor, the domains’ nominal power consumption(i.e., the sum of each domain’s nominal power consumption)is approximately 3W. To find the total processor power con-sumption, we must account for the PDN power conversionloss by dividing the domains’ nominal power consumption bythe PDN’s ETEE. Therefore, the PDN’s ETEE can dictate theamount of remaining power budget for reallocation across the We define performance scalability of a workload with respect to CPUfrequency as the performance improvement the workload experiences withunit increase in frequency, as described in [46, 139]. Modern processors pre-dict the performance-scalability of a workload at runtime using performancecounters [139]. The performance-scalability metric is used by current powermanagement algorithms, such as Intel’s SpeedShift [98] and EARtH [27], whichfirst appeared in the Intel Skylake processor [8]. P o w e r B ud g e t I n c . f o r % F r e q . I n c . ( m W ) (a) Frequency Sensitivity CPU GFX

25% 27% 46%

51% 52%

28% 33%

35% 38% % o f P o w e r B ud g e t (b) Power Budget Breakdown SA+IO CPU LLC PDN Loss P o w e r - B ud g e t I n c r e a s e f o r % F r e q . I n c r e a s e ( m W ) % o f P o w e r - B ud g e t Figure 2: Using our performance model, we show (a) the addi-tional power-budget required (y-axis) to increase the clock fre-quency of a CPU/graphics by when running CPU-/graphics-intensive workloads, relative to the baseline frequency of eachTDP (x-axis), and (b) percentage (y-axis) of the total TDPpower-budget (x-axis) that is allocated to CPU-cores, LLC, IOand SA, and PDN power loss for a CPU-intensive workload. domains to improve system performance. For example, we canincrease the CPU-cores’ clock frequency by 1% for each 9mWincrease in the CPU-cores’ power budget at a 4 W TDP (shownin Fig. 2(a)).To show how even a small difference in ETEE can have a sig-nificant impact on system performance, assume we have twoPDNs: 1)

PDN with ETEE =75%, and 2) PDN with ETEE =80%.The total processor power consumption of PDN and PDN are 4 W (3 W /0.75) and 3.75 W (3 W /0.8), respectively. Accord-ing to our model (shown in Fig. 2(a)), the additional 250 mW (4 W – 3.75 W ) saved by using PDN (instead of PDN ) couldbe allocated to increasing the CPU cores’ clock frequency by28%. This would increase the performance of a highly-scalableworkload by 28%. Assumptions.

Our PDNspot model makes three main as-sumptions. First, PDNspot assumes that the system operateswithin a thermal design power (TDP) limit. The power man-agement unit allocates 1) a power-budget to the SA and IO do-mains, which have nearly constant power consumption acrossdifferent TDPs, and 2) the remaining power-budget to thecompute domain (cores and graphics). The compute domainpower-budget is divided between the cores and the graphicsengines based on the running workload (e.g., CPU- versusgraphics-intensive workload). Second, PDNspot assumes thesame routing resources for all PDNs. Therefore, for PDNs inwhich multiple domains share a single VR (e.g., IVR, LDO),the routing resources of these domains are combined. Third,PDNspot assumes that all voltage emergencies are handled byboth 1) existing decoupling capacitors and 2) existing architec-tural techniques. This is a reasonable assumption for modernclient processors [7, 102, 112].

Limitations.

Our PDNspot model has two main limitations.First, the model predicts the ETEE based on average values ofinputs over a time interval (e.g., during residency in a powerstate). To provide the dynamic ETEE of a workload (e.g., duringmultiple system power states within a workload), PDNspotshould be run for each time interval separately with the ap-propriate input for the examined time interval. However, thisis not a big limitation since doing so can be automated (e.g.,using a script) once data for multiple intervals is collected.Second, the model considers the processor and the off-chipVRs as a single thermal domain (i.e., as sharing the same TDP),which is true for many systems [92]. However, the PDNspotmodel does not provide the effect of thermals on power and7erformance for a system in which the processor and off-chipVRs are in two different thermal domains.

4. PDNspot Validation

PDNs in modern client processors have complex designs,and they involve several components integrated on die, pack-age, and board. For example, the IVR design includes multiplecomponents such as 1) buck regulator bridges [21], 2) controlmodules that generate the pulse width modulation (PWM)signals [49, 73, 93] and activate IVR phases, 3) air core induc-tors (ACI) [21, 49], and 4) Metal Insulator Metal (MIM) capaci-tors [21]. In addition, several IVR parameters (e.g., thresholdsfor voltage-regulator power-states) and algorithms (e.g., phase-shedding management) are typically configured and tunedpost-silicon. Therefore, modeling these designs with, for ex-ample,

SPICE [87] is inaccurate and unsuitable for validatingour power models. Instead, we obtain the input parameters(shown in Table 2) to PDNspot and validate the three powermodels of PDNspot with real experimental data from our labthat we collect using two different sets of benchmark tracesthat are typically used to evaluate client processors.In this section, we present the 1) experimental setup used toobtain PDNspot model parameters, 2) methodology for obtain-ing PDNspot model parameters, and 3) PDNspot validationprocess.

System Setup.

To measure power and validate our powermodels, we use two systems with the configurations shown inTable 3. Intel Broadwell and Skylake architectures use IVR [88]and MBVR [26] PDNs, respectively.

Table 3: Processor configurations and PDNs

Processors 1) i7-5600U [54] Broadwell architecturePDN topology: IVR [88]2) i7-6600U [55] Skylake architecturePDN topology: MBVR [26]L3 (LLC) cache: 4 MB.Process technology node: 14 nmMemory DDR3L-1600

MHz [65], non-ECC,dual-channel, 8 GB capacity

Benchmark Traces.

To obtain the input parameters (shownin Table 2) for our models and validate the models, we useapproximately 5000 traces from a wide variety of benchmarks,typically used in evaluating client processors. We use ∼ ∼ ∼

750 graphics traces comprising of 1) representative CPU- andgraphics-intensive workloads including SPEC CPU2006 [114],Sunspider [128], PhotoShop [2], Illustrator [1] SYSmark [14],HandBrake [133], 3DMark06 [124], Crysis [28], 2) representa-tive battery life workloads such as office productivity work-loads (e.g., MobileMark [13]), video conferencing and stream-ing workloads, and web-browsing workloads [6], and 3) syn-thetic traces of power-virus [26] for each domain, which canbe generated using tools such as McPAT [77], SYMPO [31] orIntel’s Blizzard [9].

Power Measurements.

For the platform power measure-ments , we use a Keysight N6705B DC power analyzer [69]equipped with an N6781A source measurement unit (SMU) [70]. The N6705B (equipped with N6781A) accuracy is around99.975% [70]. The power analyzer measures and logs the instan-taneous power consumption of different device components.Keysight’s control and analysis software [69] is used for datavisualization and measurement management. For more detail,we refer the reader to the Keysight manual [69] and to ourprior work [42].

We describe the process we use to obtain each of the in-put parameters to PDNspot models. A summary of the mainparameters is shown in Table 2.

VR Efficiency Curves – Input Parameters.

We measuretwo sets of parameters for 1) on-chip VR efficiency (i.e., η IVR and η LDO ) and 2) off-chip VR efficiency (i.e., η VIN , η GFX , η SA ,and η IO ). We perform the measurements on our systems acrossmultiple values in the operational range of the 1) VR inputvoltage (e.g., 7.2V, 9V, 12V for off-chip VR; 1.6V and 1.8V forIVR), 2) VR output voltages (e.g., 0.5V, 0.6V, 0.7V, 1V, 1.8V), and3) load current.We measure the off-chip VR efficiency ( η VIN , η GFX , η SA , and η IO ) by 1) connecting the VR input (output) to channel A (B)of the DC power analyzer, which we configure as the powersupply (DC electronic load) [71]. This setup enables us to 1)measure the input and output power, and 2) sweep over theranges of the load current, output voltage and input voltagevalues, and log the data into the host PC that runs the controland analysis software. We also measure the efficiency for eachVR power-state for VRs that support multiple power-states(e.g., V IN supports PS0, PS1, PS3 and PS4). Fig. 3 shows the effi-ciency curves for the off-chip VRs (i.e., V _ Core , V _ GFX , V _ SA , V _ IO and V _ IN ) as a function of multiple output voltages, oneinput voltage (7.2V) and two VR power-states (PS0 and PS1). E ff i c i e n c y ( % ) Iout (A)

Vout=0.6 Vout=0.7 Vout=1 Vout=1.8

Vout=0.6 Vout=0.7 Vout=1 Vout=1.8PS0PS1Power State:

Vin=7.2V

Figure 3: Off-chip VR efficiency curves as a function of: 1)output current (Iout, x-axis), 2) output voltage (Vout), 3) VRpower-states (only PS0 and PS1 shown), and 4) input voltage(Vin, only 7.2V is shown).

We measure

IVR efficiency ( η IVR ) using the Broadwell pro-cessor. Since the IVR is integrated into the processor, it isimpossible to disconnect the native load (e.g., cores, graphicsengines) and connect a high current load directly to the outputof an IVR. Therefore, to measure the IVR efficiency, we operatethe processor in a special Design For Test (DFT) mode [21]. Wealso operate the processor clock tree at varying frequencies toenable a large effective adjustable load current. We measurethe current and voltage at the output and input (i.e., outputof the V IN in Fig. 1) of the IVR [21]. Next, we calculate theinput and output power and plot the efficiency curves as afunction of load current and output voltage. Table 2 (On-chipVR Efficiency) shows the range of the measured IVR efficiency881%–88%). The actual curves in PDNspot plot the efficiency asa function of input voltage, output voltage and output current.We measure the LDO VR efficiency ( η LDO ) in two steps. First,since the LDO VR is not implemented in our experimentalsystems, we emulate the LDO VR static behavior using thepower-gates that exist in the MBVR PDN of the Skylake pro-cessor, a technique which is used by Intel [79] to implementan LDO VR. Second, we measure the input and output powerof the LDO VR under varying load current, input and outputvoltages and plot the efficiency curves. The LDO VR efficiencyis the ratio between the output and the input voltage timesthe ratio between input and output current (also known ascurrent efficiency), i.e., η LDO = ( V OUT / V IN ) · ( I OUT / I IN ). Ourmeasurements show that the current efficiency, i.e., I OUT / I IN ,is more than 99% as tabulated in Table 2. Nominal Power of Domains – Input Parameter.

We mea-sure the nominal power ( P NOM ) input parameter of each domain(i.e., cores, LLC, graphics, SA, and IO) directly on the Skylakesystem when running traces of single threaded, multi-threadedand graphics workloads. We log the measured power of eachtrace and its application ratio (i.e., AR , discussed in Sec. 2.4). Other Input Parameters . We measure the

Load-lineimpedance (R LL ) from a domain’s input to the output of the off-chip VRs for each domain directly on Skylake and BroadwellSystems. We measure peak-power (i.e., P peak ) when runningpower-virus traces. We estimate leakage-power fraction ( F L )using a post-silicon technique, thermal conditioning [23,25,47],by 1) increasing the processor temperature while running aload with constant voltage and frequency (i.e., constant dy-namic power), 2) measuring the associated changes in powerconsumption, and 3) extrapolating the domain’s power frac-tion which is affected by temperature, as the leakage powerdepends exponentially on temperature whereas the dynamicpower is not affected by temperature [34, 39, 40, 64]. We validate PDNspot by comparing the predicted ETEEobtained from each PDNspot model (i.e., IVR, MBVR, and LDO)with the ETEE measurements on real systems. By controlling the number of the conducting power-gate transistorsand their gate voltages, the power-gate behaves like an LDO VR. The actualLDO VR implementation has additional circuitry (e.g., to handle load transientresponse, digital control of the LDO VR output).

To validate PDNspot, we use as reference the total powerconsumption of real Intel processors (Broadwell, Skylake, andSkylake with emulated LDO PDN) measured from the mainpower supply (battery/PSU) for each of the PDNs ( P IVR , P MBVR ,and P LDO ). We use PDNspot to obtain the predicted powerconsumption of each PDN. We use a subset (200) of the bench-mark traces (single-thread, multi-programmed, and graphicsdescribed in Sec. 4.1) that have various application ratios (AR).We calculate the measured (predicted) ETEE of each PDN by di-viding the total nominal power consumption (i.e., PDN outputpower) by the measured (predicted) total power consumption(i.e., PDN input power). Finally, we calculate the accuracy ofPDNspot by comparing the measured ETEE to the predictedETEE of each PDN.We find that our three IVR, MBVR and LDO PDN mod-els in PDNspot have an average (min/max) accuracy of 99.1%(98.7%/99.3%), 99.4% (98.9%/99.7%), and 99.2% (98.6%/99.6%), re-spectively, across all our 200 workloads. Fig. 4(a–i) shows thevalidation results (measured vs. predicted ETEEs) for 4 W , 18 W ,50 W TDPs when running single-threaded, multi-programmed,and graphics traces with an AR between 40% to 80%. Fig.4(j) shows the results for the battery life related power-states:C0 with minimum frequency ( C MIN ) and package C-states(C2/3/6/7/8) [34, 39, 40].

5. Motivation: PDN Inefficiencies in ClientProcessors

This section makes three key empirical observations aboutthe three most commonly-used PDN architectures (i.e., IVR [21,61, 88], MBVR [29, 63, 97], LDO [15, 18, 111, 112, 113, 120]) inmodern high-end client processors to motivate the need fora hybrid and adaptive PDN that leverages the advantages ofeach one of the three PDN architectures.We use our validated model, PDNspot, to evaluate the ef-ficiency of the three PDNs. We estimate the off-chip currentconsumption, ETEE with breakdown into multiple sourcesof power-conversion losses, and average power consumptionof a processor using each of the three PDNs. We use a totalof 300 CPU-intensive, graphics-intensive, and video playbackworkload traces to evaluate each PDN.Based on our evaluation results shown in Figures 4 and 5,we make three key observations.

40% 45% 50% 55% 60% 65% 70% 75% 80%TITLEIVR Measured IVR Predicted MBVR Measured MBVR Predicted LDO Measured LDO Predicted

40% 50% 60% 70% 80% E T EE ( % ) AR (%)

40% 50% 60% 70% 80%

AR (%)

40% 50% 60% 70% 80%

AR (%)

40% 50% 60% 70% 80%

AR (%)

40% 50% 60% 70% 80%

AR (%)

40% 50% 60% 70% 80%

AR (%)

40% 50% 60% 70% 80%

AR (%)

40% 50% 60% 70% 80%

AR (%)

C0MIN C2 C3 C6 C7 C8

Power States (a) Single-Thread - 4W (b) Single-Thread - 18W (c) Single-Thread - 50W (d) Multi-Thread - 4W (e) Multi-Thread - 18W (f) Multi-Thread - 50W (g) Graphics - 4W (h) Graphics - 18W (i) Graphics - 50W (j) Package C-state - all

TDPs

Figure 4: PDNspot validation results. (a)–(i) End-to-End power-conversion efficiency (ETEE) for single-threaded, multi-threadedand graphics traces at W , W and W TDP with varying application ratios (AR). (j) shows the results for battery life relatedpower-state: C0 with minimum frequency (C0MIN) and package C-states (C2/3/6/7/8) [34, 39, 40].

11 2 211.0

4W 18W 50W 4W 18W 50W 4W 18W 50W

IVR MBVR LDO N o r m a li z e d C h i p I npu t C u rr e n t , L o a d - li n e ( t i m e s ) P D N P o w e r C o n v e r s i o n L o ss e s ( % ) On- & off-chip VR inefficiencies Conduction loss (I2R) - Core & GFX

Conduction loss (I2R) - SA & IO Others

Chip input current (I) Load-line impedance (RLL) ~6% higher loss due to IVR's

Core & GFX I R LL losses of MBVR/LDO PDNs scale faster with TDP than IVR PDN due to of MBVR/LDO's higher 1) input current, 2) R LL Figure 5: Breakdown of the power conversion loss of the threePDNs when running a CPU-intensive workload ( AR = ) at W , W , and W TDPs. Conduction loss ( I R ) and on-chip &off-chip VR infficiencies are the most prominent losses. Nor-malized (to IVR PDN) chip input current ( I , i.e., from off-chipVRs) and load-line impedance ( R LL ) are shown as line plots. Observation 1.

We observe that when executing CPU- andgraphics-intensive workloads, the IVR PDN has a lower

ETEEat the 4 W TDP (Figures 4.a,d,g) and a higher

ETEE at the 50 W TDP (Figures 4.c,f,i) compared to MBVR and LDO PDNs acrossthe entire range of tested ARs. The ETEE crossover point , atwhich the IVR ETEE becomes higher than the MBVR/LDOETEE, exists at some TDP between 4 W and 50 W .Fig. 5 provides more insight into this observation with break-downs of PDN power conversion loss. We find that at 4 W TDP,the dominating contributor to the PDN power conversion lossare the on-chip and off-chip VR inefficiencies. At 4 W TDP, theIVR PDN has a lower ETEE than the MBVR and LDO PDNsdue to the higher power conversion inefficiencies of the IVRPDN’s on-chip and off-chip VRs. At a 50 W TDP, we find thatMBVR and LDO PDNs have lower ETEEs due to their high I R loss in core and graphics domains. The high I R loss is dueto: 1) a ∼ × higher chip input current in the MBVR and LDOPDNs compared to the IVR PDN , and 2) a 2.5 × /1.3 × higherload-line impedance ( R LL ) in the MBVR/LDO PDNs comparedto the IVR PDN . We conclude that the MBVR and LDO PDNsare more efficient at a low TDP (e.g., 4 W ) compared to the IVRPDN, while the IVR PDN is more efficient at a high TDP (e.g.,50 W ). Observation 2.

We observe that the PDN ETEE is affectednot only by the TDP (as discussed in Observation 1) but alsoby the workload’s Application Ratio (AR) and the workloadtype, i.e., single-threaded, multi-threaded, and graphics.Fig. 4(a–i) shows that the MBVR and LDO PDN ETEEsincreases with AR, which is most pronounced at 18W and50 W TDPs. This phenomenon is due to the load-line (describedon Sec. 2), which results in a lower voltage-guardband whenrunning workloads with higher ARs. The IVR PDN reduces the chip input current because it uses high inputvoltage from the first-stage VR into the chip (Sec. 2). The IVR and LDO PDNs have lower R LL compared to MBVR becauseboth IVR and LDO PDNs share routing resources from external VRs into thechip’s package and die. Fig. 4(b,e,h) show that the single-thread, multi-thread, andgraphics workloads (all at the same TDP of 18W) have differentETEE curves. For example, for the graphics workload in Fig.4(h), the IVR PDN is less efficient than the other two PDNs forthe entire AR range (with a crossover point around 21W TDP,not shown in Fig. 4, at which the IVR ETEE becomes higherthan the MBVR/LDO ETEE), while the other two workloadshave crossover points at different ARs within the 18W TDP.Fig. 4(a–f) shows that the LDO ETEE is higher than theMBVR ETEE for CPU-intensive (single- and multi-threaded)workloads, but is lower than the MBVR ETEE for graphics-intensive workloads. Note that the LDO inefficiency is moredominant in graphics workloads, due to the high voltage dif-ference between the core and graphics domains because ingraphics-intensive workloads, the graphics-engine runs at rel-atively high frequencies (and voltages) while cores are keptat low frequencies (and voltages). Therefore, the LDO PDN 1)sets the off-chip (i.e., first stage) VR voltage to the high voltagelevel required by the graphics-engines (e.g., 0.9 V ) while acti-vating the graphics-engines’ on-chip LDO (i.e., second-stage)VR in bypass-mode, and 2) uses the core’s on-chip LDO (i.e.,second-stage) VR to regulate the voltage down to the low volt-age level required by the core (e.g., 0.5 V ). Doing so, results invery low power conversion efficiency of the core’s LDO VR(e.g., ∼ Observation 3.

We observe that the ETEE of the IVR PDNis significantly lower than that of MBVR and LDO PDNs forcomputationally light workloads (e.g., video playback, webbrowsing, office productivity applications [6, 13, 14]) and low-power states across all

TDPs. Fig. 4(j) shows the ETEE of thethree PDNs in 1) C MIN , an active power-state in which thecore and graphics domains operate at their lowest frequencies,and 2) package C-states (C2, C3, C6, C7, and C8 [34,39,40]), lowpower-states of the processor. The processor uses these power-states, for all

TDPs, to reduce energy consumption (therebyincreasing battery life of battery-powered devices) when theprocessor runs a light (i.e., low computational intensity) work-load or once the processor is partially/fully idle. We explainthe effects of ETEE in these power-states on battery life usinga video playback workload example.The video playback [6] workload is a computationally light workload that operates in three main power-states during eachvideo-frame. First, a C MIN power-state, which consumes P C MIN =2.5 W nominal power for R C MIN =10% ( R C MIN is theresidency of power state C MIN in terms of the fraction ofexecution time) of the frame’s time. In this state, the coresand graphics engines prepare a video-frame and store it inmain memory. Second, a C P C =1.2 W nominal power for R C =5% of the frame’s time. Thecores and graphics engines are idle (power-gated) in this state.10n C

2, the display-controller fetches part of the frame frommain memory into a local buffer inside the display controller.Third, a C P C =0.13 W nominalpower for R C =85% of the frame’s time. In C

8, the displaycontroller reads frame data from its local buffer and displaysit on the display panel, while the rest of the processor is idle(e.g., main memory is in self-refresh). We calculate the averagepower of the video playback workload by summing the frac-tional power of each power-state taking into account the ETEEin each state (denoted by η C MIN ,2,8 ). Hence, the average poweris given by: P C MIN · R C MIN / η C MIN + P C · R C / η C + P C · R C / η C .The video playback average-power results (shown in Fig. 8(c))show that MBVR and LDO PDNs have 12% and 11% lower aver-age power, respectively, than the IVR PDN. We conclude thatthe IVR PDN is energy-inefficient for computationally-lightworkloads, which negatively impacts both energy consump-tion and battery life. Summary.

We conclude that there is no single PDN for mod-ern client processors that maintains a high ETEE across allTDPs, workload types and application ratios (ARs). These ob-servations motivate us to build a hybrid and adaptive

PDN thatutilizes the advantages of each one of the three PDN architec-tures, as we describe in Sec. 6.

6. FlexWatts

We present

FlexWatts , a hybrid adaptive PDN for modernprocessors that maintains a high ETEE for the wide powerconsumption range and workload diversity of client proces-sors. FlexWatts is based on three key ideas. First, it combinesIVRs and LDOs in a novel way to share multiple on-chip andoff-chip resources and thus reduce BOM, as well as board anddie area overheads, as illustrated in Fig. 6. This hybrid PDNis allocated for processor domains with a wide power con-sumption range (e.g., CPU cores and graphics engines) andit dynamically switches between two modes,

IVR-Mode and

LDO-Mode , based on the efficiency of each mode, using a spe-cial power-management flow. Second, FlexWatts staticallyallocates an off-chip VR to each system domain with a low andnarrow power consumption range (i.e., SA and IO domains).This is because unlike in compute domains, the power con-sumption of the system-agent (SA) and IO domains does not significantly scale with TDP (as shown earlier in Fig. 2(b)) orworkload’s AR. Thus, it is more energy-efficient to place each ofthem on a dedicated off-chip VR compared to using an on-chipVR . Third, FlexWatts introduces a new prediction algorithmthat automatically determines which PDN mode ( IVR-Mode or LDO-Mode ) would be the most beneficial based on systemand workload characteristics. For example, FlexWatts can op-erate in

LDO-Mode ( IVR-Mode ) when the processor runs a light(heavy) workload such as video playback (Turbo Boost), orwhen the processor operates at low (high) TDP such as 4 W (50 W ). FlexWatts uses a runtime ETEE prediction algorithmto select the operation mode (i.e., LDO-Mode or IVR-Mode ) thatmaximizes ETEE.

Hybrid PDN and Resource Sharing.

We build the FlexWattsPDN by modifying a baseline IVR PDN, shown in Fig. 1(a), in AMD uses the same strategy for their LDO PDNs [112] (Fig. 1(c))

V_Core0 L_Core0V_GFX L_GFXV_LLC L_LLCV_Core1 L_Core1

DomainsHybrid VRs (IVRs/LDO VRs) /Power-Gates

V_IN

Board VRs

V_IO L_IOV_SA L_SAPackage/DieMotherboard LDO-ModeIVR-ModePower Supply (PSU/

Battery)

Figure 6: Our hybrid adaptive PDN (FlexWatts). FlexWattsuses an off-chip VR to each system domain with a low and nar-row power consumption range (i.e., SA and IO domains). Forsystem domains with a wide power consumption range (e.g.,CPU cores and graphics engines), FlexWatts allocates a hybridPDN. This hybrid PDN can dynamically switch between twomodes,

IVR-Mode and

LDO-Mode , based on the expected ETEEbenefits of each mode for the current workload and powerconsumption. The hybrid PDN shares between IVR and LDOmodes 1) on-chip resources such as the high-side (HS) NMOSpower switch in the IVR PDN as illustrated on the right side,and 2) off-chip VRs ( V _ IN ). two ways. First, we replace the two on-chip IVRs of the SA andIO domains (i.e., V_SA and V_IO IVRs) with two off-chip VRsand two on-chip power-gates, as illustrated in Fig. 6. Second,we implement hybrid VRs , which extend each of the remainingIVRs (i.e., V_Core0/1, V_LLC and V_GFX IVRs in Fig. 1(a)) byimplementing an LDO VR using the existing resources of theIVR, as illustrated in Fig. 6 (right side). By doing so, we enablea hybrid PDN that has two modes of operation, IVR-Mode and

LDO-Mode , with low cost and low area overhead. As illustratedin Fig. 6, each hybrid VR shares between the two modes 1)on chip resources such as the high-side (HS) NMOS powerswitch [21], and decoupling capacitors (both on package andon die) of the baseline on-chip IVR, and 2) off-chip VRs (i.e., V _ IN ). We use the HS power-switch to implement the LDOVR, similar to Luria et al. [79], a work carried out by Intelthat utilizes the power-gate’s power-switch to implement anLDO VR. This architecture enables both PDN modes to sharerouting resources and the power grid across board, package,and die during operation, as illustrated in Fig. 6. Voltage Noise-Free Mode-Switching.

FlexWatts mode-switching transitions the hybrid PDN between two modes(

IVR-Mode and

LDO-Mode ). Carrying out the mode-switchingwhile the compute domains are active may introduce voltagenoise because the two modes have very different operationprinciples. In

IVR-Mode , the off-chip VR ( V _ IN ) is set to a rel-atively high-voltage (e.g., 1.8 V ) and the on-chip IVRs regulatethe voltage to the level the domain needs (e.g., 0.6 V –1.1 V ).In LDO-Mode , V _ IN voltage is set to the maximum voltage re-quired by all domains (e.g., 0.6 V –1.1 V ) and the on-chip LDOsregulate this maximum voltage to the level the domain needs.Therefore, the mode-switching should configure the on-chipand off-chip VRs and change their voltage levels while transi-tioning from one mode to the other.To prevent any voltage noise during mode-switching,FlexWatts performs mode-switching while the compute do-mains are idle . To do so, we 1) place the processor in an idlepower-state for a short period, 2) configure the hybrid PDN andupdate the on-chip and off-chip VR levels, and 3) exit the idle11ower-state and resume the processor with the new PDN mode.To this end, we utilize a power-management flow that placesthe processor into the idle power-state, (which exists in mostmodern processors [26,34,39,40,42,43,48,51,121]), in which thecores, LLC, and graphics units are turned off after their contextsare saved into a dedicated SRAM. We leverage the C6 packageC-state power management firmware flow [42] to implementFlexWatts’s mode-switching transition flow. FlexWatts takesthe following three steps to switch between two PDN modes.First, the power management unit (PMU) places the systeminto the package C6 idle power state during which the PMUsaves the context of the hybrid PDN domains (i.e., the CPUcores, LLC, and graphics) and turns off their clock and voltage.Second, the PMU performs the actual mode switching actionsof the hybrid PDN by 1) adjusting the V _ IN VR voltage to alevel suitable for the new mode (i.e., 1.8 V for IVR-Mode , or0.6 V –1.1 V for LDO-Mode ), and 2) configuring the hybrid VRsto operate in the new mode (as illustrated in Fig. 6). Third,the PMU exits the package C6 idle power-state and switchesto the active state. Doing so allows the processor to resumeexecution while the hybrid PDN domains use the new PDNmode.

Runtime PDN Mode-Prediction Algorithm.

So far, we ex-plained how to switch between two PDN modes (i.e., mode-switching flow) without describing when to switch. FlexWattsrelies on our new runtime mode-prediction algorithm whosegoal is to predict which PDN mode, among the two modes,

IVR-Mode and

LDO-Mode , provides the best end-to-end power-conversion efficiency (ETEE).As shown in Fig. 4, ETEE is a function of 1) the AR and theworkload type (i.e., single-thread, multi-thread, and graphics),and 2) the TDP and the power-state of the system. ETEEdepends on the AR due to the load-line effect (discussed inSec. 2.4) and shown in Equation 3. The workload type affectsETEE because each of the three workload types stresses theunderlying power delivery network differently, as explainedin Sec. 3.1.Algorithm 1 depicts our mode prediction algorithm. The key idea of our algorithm is two-fold. First, we store twosets of ETEE curves inside the PMU firmware, one set for theIVR PDN and the other set for the LDO PDN. A PDN ETEEcurve set is a multidimensional table that includes an ETEEcurve corresponding to a TDP for each workload type (i.e.,three curves for each TDP point). Each ETEE curve storesthe ETEE values as a function of the AR (as shown in Fig.4(a-i)). We also include one ETEE curve for power states (asshown in Fig. 4(j)). Second, for every evaluation interval (e.g.,10ms), we estimate each of the algorithm’s input parameters(i.e., TDP, AR, workload type, and power-state). We use theestimated parameters to access the corresponding ETEE curveto obtain the ETEE values for both IVR-mode and LDO-mode.The algorithm chooses the mode that maximizes the ETEE. The context is stored into dedicated SRAMs, using power from an always-on VR (not shown in Fig. 1) that retains the dedicated SRAMs’ contents in idlestates [42, 43]. A modern PMU implements multiple curves (as tables) such as leakagepower as function of temperature and voltage, voltage as function of frequency,VR power-conversion efficiency as a function of input-voltage, output-voltageand output-current [34, 39, 40, 98, 101].

Next, we explain how we estimate the inputs to our algorithm(i.e., TDP, AR, workload type, and power-state) at runtime.

Algorithm 1

FlexWatts Mode Prediction Algorithm procedure Determine_FlexWatts_Mode2:

Input : TDP, AR, WL_TYPE, PS /*power-state*/3:

Output : PDN_Mode (

IVR-Mode or LDO-Mode )4: IVR_ETEE = estimate_IVR_ETEE (TDP,AR,WL_TYPE,PS)5: LDO_ETEE = estimate_LDO_ETEE (TDP,AR,WL_TYPE,PS)6: if IVR _ ETEE ≥ LDO _ ETEE return IVR-Mode else return LDO-Mode end procedure Runtime Estimation of the Algorithm Inputs.

The PMUof a modern processor uses the TDP, AR, workload-type, andpower-state in multiple power management algorithms suchas 1) power-budget management (PBM) algorithm [24, 26, 101],2) Turbo Boost algorithm [26,98,101], and 3) system maximumcurrent protection [7, 102].The runtime-configured TDP value is available to the PMU[5, 132]. To estimate the AR, the PMU uses activity sensors[7, 10, 19, 30, 78, 102, 110, 126] that are implemented in multipledomains of the Intel Skylake processor [19, 26, 102, 110]. Theseactivity sensors estimate each domain’s activity using internalevents in each domain, such as active execution ports in thecore, memory stalls, type of instructions being executed (e.g.,scalar, vector instructions of 128-bits/256-bits/512-bits). A ded-icated weight is associated with each event, and the weightedsum of the events in a domain is periodically (e.g., every mil-lisecond) sent to the PMU. The weights of the activity sensorsare calibrated post-silicon to provide a proxy of the AR.The PMU estimates the workload-type (WL_TYPE) basedon the power-state (i.e., active/idle) of the cores and graphicsengines. For example, if the graphics engines are active, thenthe workload-type is set to graphics, while if more than onecore is active and the graphics engines are idle, then it is set tomulti-threaded.The power-state , i.e., package power-state, of the processor isknown to PMU firmware as the PMU carries out the transitionsfrom one package C-state to another [34, 39, 40].

FlexWatts Overhead.

We estimate the latency of ourFlexWatts mode switching flow with techniques used by previ-ous works that estimate the package C-state latencies [105,106].We find that 1) placing the processor into package C6 powerstate takes 45 µ s (without voltage changes), 2) adjusting theon-chip and off-chip VR voltage levels (assuming a latency of ≤ µ s for on-chip VRs [21,79], and a slew rate of 50 mV / µ s [60]for off-chip VRs) takes 19 µ s , and 3) exiting the C6 power statetakes about 30 µ s . Hence, the overall flow takes nearly 94 µ s .It should be noted that the DVFS (P-state) latency on Intelprocessors can take up to 500 µ s [34, 37, 51, 82] depending onthe processor’s internal state, which shows that the FlexWattsflow latency is within an acceptable range.The area overhead of FlexWatts over the IVR PDN is minimal.The additional area required to implement the LDO mode usingthe IVR resources (i.e., the high-side NMOS power switch) isaround 0.041 mm [79] at 14nm process technology node. Thiscorresponds to only 0.04% and 0.03% of the Intel dual and quadcore client die sizes [129], respectively.12 . Experimental Results We evaluate FlexWatts with respect to performance, batterylife, board area and bill of materials (BOM), compared to thethree commonly-used state-of-the-art PDNs in modern proces-sors: IVR, MBVR, and LDO. We also include a comparison witha hybrid PDN (used in Intel Skylake-X processors [62]) thatcombines IVR and MBVR PDNs, which we refer to as

I+MBVR .Similar to the LDO PDN, I+MBVR uses off-chip VRs for theSA and IO domains and similar to the IVR PDN, it uses IVRsfor the other domains. We evaluate the PDNs using our newPDNspot framework described in Sec. 3.

We evaluate the performance of FlexWatts compared toother PDN architectures (IVR, MBVR, LDO, I+MBVR), underthe following scenarios:• When running SPEC CPU2006 [114] core performancebenchmarks, on processors with 4 W TDP. We also showthe average performance of SPEC CPU2006 as TDP variesbetween 4 W and 50 W .• When running 3DMark06 [124] graphics performanceworkloads, as TDP varies between 4 W and 50 W .We evaluate the performance of CPU- and graphics-intensive workloads assuming a fan-less system . Therefore,we use a junction temperature ( T j ) of 80 ◦ C for TDPs between4–8 W and 100 ◦ C for TDPs higher than 8 W . SPEC CPU2006 Benchmarks at 4W TDP.

We evaluateSPEC CPU2006 [114] benchmarks with the maximum allowedfrequency (i.e., 0.9GHz) for a 4 W TDP system. For these bench-marks, the two cores run at the same frequency and voltage, asin all recent client processors [21,29,63,88,97,111]. In addition,the voltage design point for the LLC matches the core voltagedomain as described in Rotem et al. [100]. Thus, the core0,core1, and LLC domains have nearly the same voltage require-ments (except for voltage variations due to manufacturingprocess variation).Fig. 7 plots the performance improvement (normalized tothat of the IVR PDN at 100%) of each SPEC CPU2006 bench-mark when using each of the five PDNs in a 4 W TDP system.PDNspot uses the performance-scalability metric of the SPECCPU2006 benchmarks to estimate performance (as we discussin Sec. 3.3). Based on Fig. 7, we make four key observations. 1)The performance improvement of MBVR, LDO, and FlexWatts,averaged across all benchmarks, is greater than 22% for the4 W TDP system. This is because MBVR, LDO, and FlexWatts(which mainly operates in

LDO-Mode at 4 W TDP) each havea higher ETEE than IVR at low TDP. At low TDPs, IVR hasa larger power conversion loss due to the two-stage (on-chipand off-chip) voltage regulation. 2) FlexWatts has a very small(i.e., less than 1%) performance degradation compared to LDOand MBVR PDNs (the highest performing PDNs at 4 W TDP).FlexWatts performs only slightly worse than the LDO andMBVR PDNs due to FlexWatts’s higher load-line that is a re-sult of resource sharing between its LDO and IVR componentswithin FlexWatts’s hybrid PDN (discussed in Sec. 6). 3) The The junction temperature ( T j ) of a fan-less small form factor device (e.g.,smartphone, tablet) is typically limited by the outer surface temperature ofthe device [99, 136]. I+MBVR PDN provides higher performance than the IVR PDN(6% on average) since I+MBVR removes the two-stage voltageregulation of the SA and IO domains. This change improvesthe ETEE of the I+MBVR PDN over the IVR PDN, and there-fore increases the power-budget of the CPU core domain. 4)The performance improvement of the five PDNs correlateswith the performance-scalability of the workloads, since theperformance-scalability metric reflects how the performanceof an application improves as the CPU clock frequency in-creases (due to the additional power-budget allocated to theCPU cores).We conclude that FlexWatts significantly improves the CPUcore performance compared to the state-of-the-art PDN (IVR)at a low TDP point by operating in

LDO-Mode , which resultsin a higher ETEE than that of the IVR PDN. . m il c . b w a v e s . G e m s F D … . s o p l e x . z e u s m p . l e s li e d . o m n e t pp . m c f . w r f . g cc . l b m . c a c t u s A D … . s ph i n x . li bqu a n t … . d e a l II . x a l a n c b m k . c a l c u li x . a s t a r . g r o m a c s . b z i p . t o n t o . n a m d . s j e n g . h r e f . g o b m k . p o v r a y . p e r l b e n c h . h mm e r . g a m e ss A v e r a g e P e r f . S c a l a b ili t y ( % ) N o r m a li z e d P e r f o r m a n c e ( % ) IVR MBVR LDO I+MBVR FlexWatts Perf. Scalability

Figure 7: SPEC CPU2006 performance (normalized to the IVRPDN) with five PDNs at W TDP, sorted (in ascending order)by the average performance-scalability of each benchmark.

SPEC CPU2006 Benchmarks at 4W to 50W TDP.

We ex-amine the effects of using different processor TDP levels, rang-ing from 4 W to 50 W , on CPU performance. Fig. 8(a) plots theaverage performance across the SPEC CPU2006 benchmarksfor several TDP levels. Based on Fig. 8(a), we make three keyobservations. 1) At TDPs lower than 18 W , FlexWatts providesup to 22% higher performance over the IVR PDN by operat-ing mainly in LDO-Mode , which has a higher

ETEE than theIVR PDN at low

TDPs. Compared to the highest-performingPDNs (MBVR/LDO) at low TDPs , FlexWatts performs onlyslightly (i.e., less than 1%) worse due to the higher load-line ofFlexWatts’s

LDO-Mode . 2) At TDPs higher than 18 W , FlexWattsprovides up to 7%/4% higher performance over the MBVR/LDOPDNs by operating mainly in IVR-Mode , which has a higher

ETEE than the MBVR/LDO PDNs at high

TDPs. Compared tothe highest-performing PDN (IVR) at high TDPs , FlexWatts per-forms only slightly (i.e., less than 1%) worse due to the higher load-line of FlexWatts’s

IVR-Mode . 3) The I+MBVR PDN pro-vides higher (up to 6%) performance than the IVR PDN acrossthe tested TDP range since I+MBVR removes the two-stagevoltage regulation of the SA and IO domains. This changeimproves the ETEE of the I+MBVR over the IVR PDN, andtherefore increases the power-budget of the CPU core domain.However, I+MBVR provides significantly lower performance(up to 15%) than FlexWatts at low TDPs, since the I+MBVRPDN uses two-stage voltage regulation (i.e., for the CPU cores,LLC, and graphics domains), which results in a lower ETEEcompared to FlexWatts at low TDPs (e.g., 4 W ). Graphics Workloads at 4W to 50W TDP.

We evaluate dif-ferent PDN architectures using the 3DMark06 graphics work-loads [124]. While running these workloads, 10% to 20% of the13 a) SPEC CPU2006 (b) Graphics (3DMark06) (c) Battery Life (d) BOM Cost (e) Board Area

4W 8W 10W 18W 25W 36W 50W N o r m a li z e d A r e a IVR MBVR LDO

I+MBVR FlexWatts0123 N o r m a li z e d B O M C o s t IVR MBVR LDO

I+MBVR FlexWatts

4W 8W 10W 18W 25W 36W 50W90%95%

4W 8W 10W 18W 25W 36W 50W N o r m . P e r f o r m a n c e ( % ) IVR MBVR LDO

I+MBVR FlexWatts

VideoPlayback VideoConf. WebBrowsing LightGaming N o r m . A v e r a g e P o w e r ( % ) IVR MBVR LDO I+MBVR FlexWatts

Figure 8: Evaluation of the five PDNs normalized to IVR PDN (the state-of-the-art PDN [21, 61, 88]) (a) SPEC CPU2006 averageperformance, (b) 3DMark06 performance, (c) Battery life workloads, (d) BOM, and (e) Board area. processor’s power-budget is allocated to the CPU cores, whilethe rest is allocated to the graphics engines. In addition, sincethe graphics workloads require high memory bandwidth, theLLC domain operates at a higher frequency and higher voltagethan the CPU domain.Fig. 8(b) shows the average performance of the 3DMark06graphics workloads with the five PDN architectures whenrunning at 4 W to 50 W TDP. We make four key observations.1) At TDPs lower than 25 W , FlexWatts provides up to 25%higher performance over the IVR PDN by operating mainlyin LDO-Mode , which has a higher

ETEE than the IVR PDN at low

TDPs. 2) At TDPs higher than 25 W , FlexWatts providesup to 3%/6% higher performance over MBVR/LDO PDNs bymainly operating in IVR-Mode , which has a higher

ETEE thanthe MBVR/LDO PDNs at high

TDPs. 3) FlexWatts performsslightly worse (i.e., up to 2% lower) than MBVR/LDO PDNsdue to i) the higher load-line of FlexWatts, and ii) the largedifference in operating voltages across the CPU core, LLC andgraphics domains while running graphics workloads (i.e., thecore domain requires low voltage, e.g., 0.5 V , while graphicsdomain requires high voltage, e.g., 0.9 V ), which degrades theETEE of both FlexWatts (in LDO-Mode ) and LDO PDNs (as wediscuss in Sec. 2.3). 4) The I+MBVR PDN provides up to 6%higher performance than the IVR PDN across the tested TDPrange. I+MBVR improves the power conversion efficiency forthe SA and IO domains (which results in I+MBVR having ahigher ETEE than the IVR PDN), and increases the power-budget of the graphics domain. However, I+MBVR providessignificantly lower performance (up to 19%) than FlexWatts at low

TDPs, since the I+MBVR PDN’s two-stage voltage regula-tion (similar to IVR PDN) at low TDPs (e.g., 4 W ) results in alower ETEE than FlexWatts.Based on our extensive CPU- and graphics-intensive work-load evaluations, we conclude that FlexWatts increases theperformance of a low TDP (e.g., 4 W ) processor by up to 25%,while maintaining a low (i.e., less than 2%) performance degra-dation for high TDP processors compared to the state-of-the-art IVR PDN, over a wide range of TDPs (i.e., 4 W –50 W ). Thisis because FlexWatts 1) allocates the hybrid PDN to domainswith a wide power consumption range (i.e., CPU cores, LLC,and graphics), thereby maintaining a high ETEE across thewide power range, and 2) allocates an off-chip VR to each do-main with a low and narrow power consumption range (i.e., SAand IO), thereby maintaining high power conversion efficiencyin these domains, which increases FlexWatts’s ETEE across all TDPs and workloads compared to the IVR PDN.

Battery Life Workloads.

We choose four workloads thatare commonly used to evaluate the battery life of mobile pro-cessors [6, 17, 140]: video playback [6, 17], video conferenc-ing [13, 17], web browsing [13, 14], and light gaming [107] benchmarks. For our modeled system, video playback, videoconferencing, web browsing, and light gaming have 10%, 20%,30%, and 40% active state with minimum frequency ( C MIN )residencies, respectively. During the remaining execution time,compute domains (cores, LLC, and graphics engines) are idle,but the system agent (SA) has activity at the display-controller(in package-C8 state) and performs periodic (every few hun-dreds of microseconds) memory accesses (in package-C2 state).We note that these workloads have nearly the same averagepower consumption regardless of the TDP of the system. Inactive and idle states, we assume the same nominal power atall TDPs. We evaluate battery life workloads at T j of 50 ◦ C . Fig.8(c) shows the average (normalized to IVR) power consump-tion of the five PDNs. We observe that FlexWatts consumes upto 1% more power than MBVR, but 8% to 11% less power thanIVR when running the four battery life workloads. I+MBVRconsumes up to 6% less average power than IVR and 5% higheraverage power than FlexWatts.We conclude that FlexWatts is almost as energy-efficient asboth MBVR and LDO and up to 11% more energy-efficient thanIVR, for battery life workloads. This is mainly because, in lowpower states (i.e., package C-states) and the low-frequency ac-tive state (i.e., C MIN ) of the battery life workloads, FlexWattsoperates in

LDO-Mode , which has better power conversion effi-ciency that IVR in these low power consumption states, therebymaintaining high power conversion efficiency across batterylife workloads.

BOM.

Fig. 8(d) shows the BOM of the five PDNs normalizedto IVR for 4 W –50 W TDPs. We make two key observations. 1)FlexWatts and I+MBVR PDNs have comparable cost to IVR. 2)MBVR and LDO have 2.1 × –4.2 × and 1.6 × –3.1 × higher BOM,respectively, compared to IVR, across the wide TDP range. Board Area.

Fig. 8(e) shows board area of the five PDNsnormalized to IVR for the 4 W –50 W TDP range. We make twokey observations. 1) FlexWatts and I+MBVR have comparableboard area to IVR. 2) MBVR and LDO have 1.5 × –4.5 × and1.1 × –3.3 × higher area, respectively, compared to IVR. Why does FlexWatts have better BOM and board areathan LDO and MBVR?

The advantage of FlexWatts in BOMand board area over MBVR and LDO is due its reducedmaximum-current , Icc max . This happens due to two reasons.First, FlexWatts uses a shared voltage regulator for the highpower domains (i.e., cores, graphics, and LLC), which en-ables current sharing between these three domains. Second,FlexWatts has reduced current (by nearly 50%) in

IVR-Mode compared to LDO, and as such, the shared VR (between thecores, graphics, and LLC) is designed with a maximum-currentlevel similar to that of IVR. When a high power (and thus highcurrent) workload (e.g., Turbo Boost [98]) is requested, the14ybrid PDN switches to the

IVR-Mode , and thus FlexWatts hascomparable maximum-current to IVR.We conclude that FlexWatts provides significant perfor-mance and energy improvements with a low BOM and areaoverhead compared to the state-of-the-art PDN, over a widepower consumption range and a wide variety of workloads.

8. Related Work

To our knowledge, this is the first work to 1) provide a ver-satile framework, PDNspot, that enables multi-dimensionalarchitecture-level exploration of modern processor power de-livery networks (PDNs), and 2) propose a novel adaptive hybridPDN, FlexWatts, that provides high efficiency and performancein client processors across a wide spectrum of power con-sumption and workloads, compared to four state-of-the-artPDNs [18, 62, 88, 117], as we demonstrate both qualitativelyand quantitatively. We discuss other related works here.A recent work [76] proposes an adaptive PDN that can dy-namically manage on/off-chip VRs in hybrid PDN systemsbased on the dynamic workload. The proposed solution usesmany on-chip and off-chip VRs, and targets many-core systemsthat are optimized for only a single TDP. Unlike FlexWatts,this solution is not optimized for cost, area, or client (laptopand desktop) systems.Many existing works investigate the potential of integratedVRs [21, 67, 75, 125, 127, 137]. PowerSoC [127] is an analyticalmodel of a PDN system that includes on-chip VRs, off-chip VRs,and PDN models, providing a platform to evaluate performanceand explore the design space of the entire PDN system. Theauthors show that hybrid PDN architectures with both on-chipand off-chip VRs can achieve a better tradeoff between areaand efficiency requirements compared to traditional off-chipparadigms. Haoran et al. [75] compare the characteristics ofdifferent PDNs for many-core systems using on-chip and/or off-chip VRs using an analytical model. Yan et al. [137] propose ahybrid PDN that optimizes the area-energy tradeoff to improvethe energy-efficiency of multi-core architectures by using sev-eral redundant cores powered by dedicated on-chip or off-chipVRs and migrating workloads that can benefit from fast DVFSto cores powered by on-chip VRs. Other works [21, 67, 125]claim that the fully-integrated voltage regulator, first adoptedin Intel’s 4th generation Core processors [21], improves per-formance and increases battery life in client systems. Theseworks have at least one of two main shortcomings. First, sev-eral of these prior works [75, 127, 137] are not optimized forthree key design parameters for client processors: cost, area,or different TDPs. Second, some works [21, 67, 125] do not ad-dress the inefficiencies of the IVR PDN in terms of performance(e.g., at low TDPs) and energy (e.g., for computationally-lightworkloads), which makes these works inefficient for clientprocessors across a wide power and workload range.Compared to all aforementioned works, our experimentalstudy 1) models a wide TDP range, showing which PDN isbetter for high performance and high energy efficiency at eachTDP level, and 2) evaluates a wide variety of mobile clientsystem workloads, providing an understanding of which PDNarchitecture is more efficient for each workload.

9. Conclusion

In this work, we first develop PDNspot, a framework thatenables architectural exploration of power delivery network(PDN) architectures with respect to multiple metrics: perfor-mance, battery life, BOM and board area. Using PDNspot, weobserve multiple energy inefficiencies in the PDNs of recentclient processors. We introduce a new power- and workload-aware hybrid PDN, FlexWatts, to improve the performanceand energy-efficiency of client processors for a wide powerand workload range. We provide a practical implementation ofFlexWatts, where we design a mode-switching power manage-ment flow that guarantees to switch the hybrid PDN safely be-tween two PDN modes, without undesirable voltage noise. Wepresent a new algorithm that automatically switches FlexWattsto the PDN mode that results in the highest energy-efficiency,battery life, and performance. Our evaluations show thatFlexWatts provides significant performance and energy im-provements with a very small BOM and area overhead com-pared to the state-of-the-art PDN, over a wide power consump-tion range and a wide variety of workloads. We hope thatour open-source release of PDNspot fills a gap in the spaceof publicly-available experimental PDN infrastructures and,along with FlexWatts, inspires new studies, ideas, and method-ologies in PDN system design.

Acknowledgments

We thank the anonymous reviewers of MICRO 2020 forfeedback and the SAFARI group members for feedback and thestimulating intellectual environment they provide.

References

Hot-Chips , 2016.[9] K. Anshumali, T. Chappell, W. Gomes, J. Miller, N. Kurd, and R. Kumar,“Circuit and Process Innovations to Enable High-Performance, and Powerand Area Efficiency on The Nehalem and Westmere Family of IntelProcessors.”

Intel Technology Journal , 2010.[10] F. Ardanaz, J. Eastep, and R. Greco, “Hierarchical Autonomous Capaci-tance Management,” US Patent 10,048,738. Aug. 14 2018.[11] H. Asghari-Moghaddam, H. R. Ghasemi, A. A. Sinkar, I. Paul, and N. S.Kim, “VR-scale: Runtime Dynamic Phase Scaling of Processor VoltageRegulators for Improving Power Efficiency,” in

DAC , 2016.[12] Y. Bai, V. W. Lee, and E. Ipek, “Voltage Regulator Efficiency Aware PowerManagement,”

ASPLOS , 2017.

13] BAPCo, “MobileMark 2014,” online, accessed May 2020,https://bapco.com/products/mobilemark-2018.[14] BAPCo, “SYSmark 2014,” online, accessed May 2020,https://bapco.com/products/sysmark-2014-se/.[15] N. Beck, S. White, M. Paraschou, and S. Naffziger, “Zeppelin: An SoCfor Multichip Architectures,” in

ISSCC , 2018.[16] R. Bertran, A. Buyuktosunoglu, P. Bose, T. J. Slegel, G. Salem, S. Carey,R. F. Rizzolo, and T. Strach, “Voltage Noise in Multi-core Processors:Empirical Characterization and Optimization Opportunities,” in

MICRO ,2014.[17] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur,D. Kim, A. Kuusela, A. Knies, P. Ranganathan et al. , “Google Workloadsfor Consumer devices: Mitigating Data Movement Bottlenecks,” in

ASP-LOS , 2018.[18] T. Burd, N. Beck, S. White, M. Paraschou, N. Kalyanasundharam, G. Don-ley, A. Smith, L. Hewitt, and S. Naffziger, “Zeppelin: An SoC for Multi-chip Architectures,”

JSSC , 2019.[19] J. S. Burns, A. V. Choubal, A. Raman, and J. G. Van De Groenendaal,“Method and System for Run-time Reallocation of Leakage Current andDynamic Power Supply Current,” US Patent 9,335,813. May 10 2016.[20] B. Burres, J. van de Groenendaal, P. Mosur, J. Robinson, I. Steiner, Y.-F.Liu, S. S. Tan, E. McShane, B. Kuttanna, and S. Lakshmanamurthy, “IntelAtom C2000 Processor Family: Power-efficient Datacenter Processing,”

IEEE Micro , 2015.[21] E. A. Burton, G. Schrom, F. Paillet, J. Douglas, W. J. Lambert, K. Rad-hakrishnan, and M. J. Hill, “FIVR - Fully Integrated Voltage Regulatorson 4th Generation Intel® Core SoCs,” in

APEC , 2014.[22] Y. Çakmak, W. Toms, J. Navaridas, M. Luján et al. , “Cyclic Power-gatingas an Alternative to Voltage and Frequency Scaling,”

CAL , 2015.[23] R. Cochran, A. N. Nowroz, and S. Reda, “Post-silicon Power Characteri-zation Using Thermal Infrared Emissions,” in

ISLPED , 2010.[24] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le, “RAPL:Memory Power Estimation and Capping,” in

ISLPED , 2010.[25] K. Dev, A. N. Nowroz, and S. Reda, “Power Mapping and Modeling ofMulti-core Processors,” in

ISLPED , 2013.[26] J. Doweck, W.-F. Kao, A. K.-y. Lu, J. Mandelblat, A. Rahatekar, L. Rap-poport, E. Rotem, A. Yasin, and A. Yoaz, “Inside 6th-Generation IntelCore: New Microarchitecture Code-Named Skylake,”

IEEE Micro , 2017.[27] R. Efraim, R. Ginosar, C. Weiser, and A. Mendelson, “Energy Aware Raceto Halt: A Down to EARtH Approach for Platform Energy Management,”

CAL

ISSCC , 2016.[30] E. Fetzer, R. J. Reidlinger, D. Soltis, W. J. Bowhill, S. Shrimali, K. Sistla,E. Rotem, R. Kumar, V. Garg, A. Naveh et al. , “Managing Power Con-sumption in a Multi-core Processor,” US Patent 9,069,555. Jun. 30 2015.[31] K. Ganesan, J. Jo, W. L. Bircher, D. Kaseridis, Z. Yu, and L. K. John,“System-level Max Power (SYMPO)-A Systematic Approach for Escalat-ing System-level Power Consumption Using Synthetic Benchmarks,” in

PACT , 2010.[32] W. Godycki, C. Torng, I. Bukreyev, A. Apsel, and C. Batten, “EnablingRealistic Fine-grain Voltage Scaling with Reconfigurable Power Distri-bution Networks,” in

MICRO , 2014.[33] B. Gopireddy and J. Torrellas, “Designing Vertical Processors in Mono-lithic 3D,” in

ISCA , 2019.[34] C. Gough, I. Steiner, and W. Saunders, “CPU Power Management,” in

En-ergy Efficient Servers: Blueprints for Data Center Optimization . Springer, 2015.[35] E. Grochowski, D. Ayers, and V. Tiwari, “Microarchitectural Simulationand Control of di/dt-induced Power Supply Voltage Variation,” in

HPCA ,2002.[36] M. S. Gupta, K. K. Rangan, M. D. Smith, G.-Y. Wei, and D. Brooks, “DeCoR:A Delayed Commit and Rollback Mechanism for Handling InductiveNoise in Processors,” in

HPCA , 2008.[37] D. Hackenberg, R. Schöne, T. Ilsche, D. Molka, J. Schuchart, and R. Geyer,“An Energy Efficiency Feature Survey of the Intel Haswell processor,” in

IPDPSW , 2015.[38] J. Haj-Yahya, M. Alser, J. Kim, A. G. Yaglıkçı, N. Vijaykumar, E. Rotem,and O. Mutlu, “SysScale: Exploiting Multi-domain Dynamic Voltageand Frequency Scaling for Energy Efficient Mobile Processors,” in

ISCA ,2020.[39] J. Haj-Yahya, A. Mendelson, Y. B. Asher, and A. Chattopadhyay,

EnergyEfficient High Performance Processors: Recent Approaches for DesigningGreen High Performance Computing . Springer, 2018.[40] J. Haj-Yahya, A. Mendelson, Y. B. Asher, and A. Chattopadhyay, “PowerManagement of Modern Processors,” in

Energy Efficient High Perfor-mance Processors . Springer, 2018.[41] J. Haj-Yahya, E. Rotem, A. Mendelson, and A. Chattopadhyay, “A Com-prehensive Evaluation of Power Delivery Schemes for Modern Micro-processors,” in

ISQED , 2019.[42] J. Haj-Yahya, Y. Sazeides, M. Alser, E. Rotem, and O. Mutlu, “Techniquesfor Reducing the Connected-Standby Energy Consumption of MobileDevices,” in

HPCA , 2020.[43] J. Haj-Yihia, “Connected Standby Sleep State,” US Patent 8,458,503. Jun. 42013.[44] J. Haj-Yihia, Y. B. Asher, E. Rotem, A. Yasin, and R. Ginosar, “Compiler-directed Power Management for Superscalars,”

TACO , 2015.[45] J. Haj-Yihia, A. Yasin, Y. B. Asher, and A. Mendelson, “Fine-grain PowerBreakdown of Modern Out-of-order Cores and its Implications onSkylake-based Systems,”

TACO , 2016.[46] J. Haj-Yihia, A. Yasin, and Y. Ben-Asher, “DOEE: Dynamic OptimizationFramework for Better Energy Efficiency,” in

HiPC , 2015.[47] H. F. Hamann, A. Weger, J. A. Lacey, Z. Hu, P. Bose, E. Cohen, andJ. Wakil, “Hotspot-limited Microprocessors: Direct Temperature andPower Distribution Measurements,”

JSSC , 2006.[48] P. Hammarlund, A. J. Martinez, A. A. Bajwa, D. L. Hill, E. Hallnor,H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar et al. , “Haswell:The Fourth-generation Intel Core Processor,”

IEEE Micro , 2014.[49] P. Hazucha, G. Schrom, J. Hahn, B. A. Bloechel, P. Hack, G. E. Dermer,S. Narendra, D. Gardner, T. Karnik, V. De et al. , “A 233-MHz 80%-87%Efficient Four-phase DC-DC Converter Utilizing Air-core Inductors onPackage,”

JSSC , 2005.[50] M. Huang, Y. Lu, S.-W. Sin, U. Seng-Pan, and R. P. Martins, “A Fully Inte-grated Digital LDO with Coarse–fine-tuning and Burst-mode Operation,”

TCAS II , 2016.[51] S. Huang, M. Lang, S. Pakin, and S. Fu, “Measurement and Characteriza-tion of Haswell Power and Energy Consumption,” in

E2SC , 2015.[52] W. Huang, J. A. A. Qahouq, and Z. Dang, “CCM–DCM Power-multiplexed Control Scheme for Single-inductor Multiple-output DC–DC Power Converter with no Cross Regulation,”

IAS , 2016.[53] W. Huang, C. Lefurgy, W. Kuk, A. Buyuktosunoglu, M. Floyd, K. Raja-mani, M. Allen-Ware, and B. Brock, “Accurate Fine-grained ProcessorPower Proxies,” in

ISCA , 2012.[54] Intel, “Intel Core i7-5600U Processor,” online, accessed Aug 2020,https://intel.ly/3lJXYD9.[55] Intel, “Intel Core i7-6600U Processor,” online, accessed Aug 2020,https://intel.ly/2EPKPYz.

56] Intel, “Intel Core m5-6Y57 Processor,” online, accessed Aug 2020,https://intel.ly/2Dm5RgV.[57] Intel, “Intel® Core™ i7-6700K Processor,” online, accessed Aug 2020,https://intel.ly/3lCHv3T.[58] Intel, “Voltage Regulator-Down 11.1: Processor Power Delivery DesignGuide,” online, accessed Aug 2020, https://intel.ly/2YUX3pW.[59] Intel, “Module, Voltage Regulator and Enterprise Voltage Regulator-Down (EVRD) 11.1 Design Guidelines,” 2009.[60] Intel, “Intel 64 and IA-32 Architectures Optimization Reference Manual,”2016.[61] Intel, “Icelake, 10th Generation Intel® Core™ Processor Families,”https://intel.ly/3frvxpK. July 2019.[62] Intel, “Skylake-X, 6th Generation Intel Core X-series Processors Families,”https://intel.ly/30SP8uX. July 2019.[63] S. Jahagirdar, V. George, I. Sodhi, and R. Wells, “Power Managementof the Third Generation Intel Core Micro Architecture Formerly Code-named Ivy Bridge,” in

Hot-Chips , 2012.[64] R. Jakushokas, M. Popovich, A. V. Mezhiba, S. Köse, and E. G. Fried-man,

Power Distribution Networks with On-chip Decoupling Capacitors .Springer, 2010.[65] JEDEC, “Low Power Double Data Rate 3 (LPDDR3),” Standard No.JESD209-3B, 2013.[66] J. Jiao, M. M. Tseng, Q. Ma, and Y. Zou, “Generic bill-of-materials-and-operations for High-variety Production Management,”

Concurrent Engi-neering , 2000.[67] D. Kanter, “Haswell FIVR Extends Battery Life,”

Microprocessor Report,The Linley Group , 2013.[68] M. K. Kazimierczuk,

Pulse-width Modulated DC-DC Power Converters .John Wiley & Sons, 2015.[69] Keysight, “IntKeysight N6705B DC Power Analyzer,” online, accessedJune 2019, https://bit.ly/2MZ9Hhv.[70] Keysight, “IntKeysight Source Measure Units Power Modules,” online,accessed June 2019, https://bit.ly/38kkStt.[71] Keysight, “Voltage Regulator Efficiency Testing Using theN6782A Source Measure Unit,” online, accessed April 2020,https://community.keysight.com/thread/18983.[72] S. K. Khatamifard, L. Wang, W. Yu, S. Köse, and U. R. Karpuzcu, “Ther-moGater: Thermally-aware On-chip Voltage Regulation,” in

ISCA , 2017.[73] W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks, “System Level Analysisof Fast, Per-core DVFS Using On-chip Switching Regulators,” in

ISCA ,2008.[74] J. Leng, Y. Zu, and V. J. Reddi, “GPU Voltage Noise: Characterizationand Hierarchical Smoothing of Spatial and Temporal Voltage NoiseInterference in GPU Architectures,” in

HPCA , 2015.[75] H. Li, X. Wang, J. Xu, Z. Wang, R. K. Maeda, Z. Wang, P. Yang,L. H. Duong, and Z. Wang, “Energy-efficient Power Delivery SystemParadigms for Many-core Processors,”

TCAD , 2017.[76] H. Li, J. Xu, Z. Wang, R. K. Maeda, P. Yang, and Z. Tian, “Workload-aware Adaptive Power Delivery System Management for Many-coreProcessors,”

TCAD , 2018.[77] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P.Jouppi, “McPAT: an Integrated Power, Area, and Timing ModelingFramework for Multicore and Manycore Architectures,” in

MICRO , 2009.[78] H. Linda, “Dynamic Intelligent Allocation and Utilization of PackageMaximum Operating Current Budget,” US Patent App. 13/539,411. Jan. 22014.[79] K. Luria, J. Shor, M. Zelikson, and A. Lyakhov, “Dual-mode Low-drop-out Regulator/power Gate with Linear and On–Off Conduction forMicroprocessor Core On-die Supply Voltages in 14 nm,”

JSSC , 2016. [80] R. M. Ma, C. Forbell, S. Soe, and J. Haj-Yihia, “Maximum Current Throt-tling,” US Patent App. 13/537,319. Jan. 2 2014.[81] P. Magarshack and P. G. Paulin, “System-on-chip Beyond the NanometerWall,” in

DAC , 2003.[82] A. Mazouz, A. Laurent, B. Pradelle, and W. Jalby, “Evaluation of CPUFrequency Transition Latency,”

CSRD , 2014.[83] P. Meinerzhagen, C. Tokunaga, A. Malavasi, V. Vaidya, A. Mendon,D. Mathaikutty, J. Kulkarni, C. Augustine, M. Cho, S. Kim et al. , “AnEnergy-efficient Graphics Processor Featuring Fine-grain DVFS withIntegrated Voltage Regulators, Execution-unit Turbo, and RetentiveSleep in 14nm Tri-gate CMOS,” in

ISSCC , 2018.[84] T. N. Miller, R. Thomas, X. Pan, and R. Teodorescu, “VRSync: Character-izing and Eliminating Synchronization-induced Voltage Emergencies inMany-core Processors,” in

ISCA , 2012.[85] R. J. Milliken, J. Silva-Martínez, and E. Sánchez-Sinencio, “Full On-chipCMOS Low-dropout Voltage Regulator,”

TCAS I , 2007.[86] S. Naffziger, “Integrated Power Conversion Strategies Across LaptopServer and Graphics Products,” in

PwrSoC , 2016.[87] L. W. Nagel and D. Pederson, “SPICE (Simulation Program with Inte-grated Circuit Emphasis),” EECS Department, University of California,Berkeley, Tech. Rep., 1973.[88] A. Nalamalpu, N. Kurd, A. Deval, C. Mozak, J. Douglas, A. Khanna,F. Paillet, G. Schrom, and B. Phelps, “Broadwell: A Family of IA 14nmProcessors,” in

VLSI Circuits , 2015.[89] V. P. Nikolskiy, V. V. Stegailov, and V. S. Vecher, “Efficiency of the TegraK1 and X1 Systems-on-chip for Classical Molecular Dynamics,” in

HPCS ,2016.[90] S. Pal, D. Petrisko, M. Tomei, P. Gupta, S. S. Iyer, and R. Kumar, “Archi-tecting Waferscale Processors-A GPU Case Study,” in

HPCA , 2019.[91] G. Papadimitriou, A. Chatzidimitriou, and D. Gizopoulos, “AdaptiveVoltage/Frequency Scaling and Core Allocation for Balanced Energyand Performance on Multicore CPUs,” in

HPCA , 2019.[92] F. Paterna and T. Š. Rosing, “Modeling and Mitigation of Extra-SoC Ther-mal Coupling Effects and Heat Transfer Variations in Mobile Devices,”in

ICCAD , 2015.[93] D. J. Perreault, J. Hu, J. M. Rivas, Y. Han, O. Leitermann, R. C. Pilawa-Podgurski, A. Sagneri, and C. R. Sullivan, “Opportunities and Challengesin Very High Frequency Power Conversion,” in

APEC , 2009.[94] Qualcomm Technologies, “Qualcomm Snapdragon 410E Processor De-vice Specification,” online, accessed Aug 2020, https://bit.ly/2xsSB7m.[95] V. J. Reddi, M. S. Gupta, G. Holloway, G.-Y. Wei, M. D. Smith, andD. Brooks, “Voltage Emergency Prediction: Using Signatures to ReduceOperating Margins,” in

HPCA , 2009.[96] V. J. Reddi, S. Kanev, W. Kim, S. Campanoni, M. D. Smith, G.-Y. Wei, andD. Brooks, “Voltage Smoothing: Characterizing and Mitigating VoltageNoise in Production Processors Via Software-guided Thread scheduling,”in

MICRO , 2010.[97] E. Rotem, A. Naveh, D. Rajwan, A. Ananthakrishnan, and E. Weissmann,“Power Management Architecture of the 2nd Generation Intel CoreMicroarchitecture, Formerly Codenamed Sandy Bridge,” in

Hot-Chips ,2011.[98] E. Rotem, “Intel Architecture, Code Name Skylake Deep Dive: A NewArchitecture to Manage Power Performance and Energy Efficiency,” in

Intel Developer Forum , 2015.[99] E. Rotem, R. Ginosar, A. Mendelson, and U. C. Weiser, “Power and Ther-mal Constraints of Modern System-on-Chip Computer,” in

THERMINIC ,2013.[100] E. Rotem, A. Mendelson, R. Ginosar, and U. Weiser, “Multiple Clock andVoltage Domains for Chip Multi Processors,” in

MICRO , 2009.[101] E. Rotem, A. Naveh, A. Ananthakrishnan, E. Weissmann, and D. Rajwan,“Power-management Architecture of the Intel Microarchitecture Code- amed Sandy Bridge,” IEEE Micro , 2012.[102] E. Rotem, N. Rosenzweig, D. Rajwan, N. Shulman, G. Leibovich, T. Ziv,A. Gabai, J. P. Rodriguez, and J. A. Carlson, “System Maximum CurrentProtection,” US Patent 9,477,243. Oct. 25 2016.[103] S. Rusu, H. Muljono, D. Ayers, S. Tam, W. Chen, A. Martin, S. Li, S. Vora,R. Varada, and E. Wang, “5.4 Ivytown: A 22nm 15-Core Enterprise Xeon®Processor Family,” in

ISSCC , 2014.[104] SAFARI Research Group, “PDNspot — GitHub Repository,”https://github.com/CMU-SAFARI/PDNspot.[105] R. Schöne, T. Ilsche, M. Bielert, A. Gocht, and D. Hackenberg, “EnergyEfficiency Features of the Intel Skylake-SP Processor and Their Impacton Performance,” arXiv , 2019.[106] R. Schöne, D. Molka, and M. Werner, “Wake-up Latencies for ProcessorIdle States on Current x86 Processors,”

CSRD , 2015.[107] N. Shaker, M. Shaker, and J. Togelius, “Evolving Playable Content forCut the Rope Through a Simulation-based Approach,” in

AIIDE , 2013.[108] M. Shevgoor, J.-S. Kim, N. Chatterjee, R. Balasubramonian, A. Davis, andA. N. Udipi, “Quantifying the Relationship Between the Power DeliveryNetwork and Architectural Policies in a 3D-stacked Memory Device,” in

MICRO , 2013.[109] C. Shi, B. C. Walker, E. Zeisel, B. Hu, and G. H. McAllister, “A HighlyIntegrated Power Management IC for Advanced Mobile Applications,”

JSSC , 2007.[110] J. J. Shrall, S. H. Gunther, K. V. Sistla, R. D. Wells, and S. M. Conrad,“Controlling Configurable Peak Performance Limits of a Processor,” USPatent 9,671,854. Jun. 6 2017.[111] T. Singh, S. Rangarajan, D. John, C. Henrion, S. Southard, H. McIntyre,A. Novak, S. Kosonocky, R. Jotwani, A. Schaefer et al. , “3.2 Zen: ANext-Generation High-Performance ×

86 Core,” in

ISSCC , 2017.[112] T. Singh, A. Schaefer, S. Rangarajan, D. John, C. Henrion, R. Schreiber,M. Rodriguez, S. Kosonocky, S. Naffziger, and A. Novak, “Zen: AnEnergy-Efficient High-Performance – x

86 Core,”

JSSC , 2018.[113] A. A. Sinkar, H. R. Ghasemi, M. J. Schulte, U. R. Karpuzcu, and N. S.Kim, “Low-cost Per-core Voltage Domain Support for Power-constrainedHigh-performance Processors,”

TVLSI

HPCA , 2017.[116] M. Swaminathan and E. Engin,

Power Integrity Modeling and Design forSemiconductors and Systems . Pearson Education, 2007.[117] S. M. Tam, H. Muljono, M. Huang, S. Iyer, K. Royneogi, N. Satti,R. Qureshi, W. Chen, T. Wang, H. Hsieh et al. , “SkyLake-SP: A 14nm28-Core Xeon® Processor,” in

ISSCC , 2018.[118] Texas Instruments, “DC to DC switching regulators,” online, accessedMarch 2018, https://bit.ly/32T8zTE.[119] R. Thomas, K. Barber, N. Sedaghati, L. Zhou, and R. Teodorescu, “CoreTunneling: Variation-aware Voltage Noise Mitigation in GPUs,” in

HPCA ,2016.[120] Z. Toprak-Deniz, M. Sperling, J. Bulzacchelli, G. Still, R. Kruse, S. Kim,D. Boerstler, T. Gloekler, R. Robertazzi, K. Stawiasz et al. , “5.2 DistributedSystem of Digitally Controlled Microregulators Enabling Per-core DVFSfor the POWER8 Microprocessor,” in

ISSCC , 2014.[121] S. Tu, “Atom-x5/x7 series processor, codenamed cherry trail,” in

Hot-Chips , 2015.[122] H. Usui, L. Subramanian, K. K.-W. Chang, and O. Mutlu, “DASH:Deadline-Aware High-Performance Memory Scheduler for Heteroge-neous Systems with Hardware Accelerators,”

TACO , 2016.[123] I. Vaisband and E. G. Friedman, “Heterogeneous Methodology for EnergyEfficient Distribution of On-chip Power Supplies,”

TPEL

ISLPED , 2015.[126] V. Vogman, “Method and Apparatus for Precision CPU Maximum PowerDetection,” US Patent 9,874,927. Jan. 23 2018.[127] X. Wang, J. Xu, Z. Wang, K. J. Chen, X. Wu, Z. Wang, P. Yang, and L. H.Duong, “An Analytical Study of Power Delivery Systems for Many-coreProcessors Using On-chip and Off-chip Voltage Regulators,”

TCAD et al. , “Characterizationof Micro-bump C4 Interconnects for Si-carrier SOP Applications,” in

ECTC , 2006.[136] Q. Xie, M. J. Dousti, and M. Pedram, “Therminator: a Thermal Simulatorfor Smartphones Producing Accurate Chip and Skin temperature Maps,”in

ISLPED , 2014.[137] G. Yan, Y. Li, Y. Han, X. Li, M. Guo, and X. Liang, “AgileRegulator: AHybrid Voltage Regulator Scheme Redeeming Dark Silicon for PowerEfficiency in a Multicore Architecture,” in

HPCA , 2012.[138] G. Yan, X. Liang, Y. Han, and X. Li, “Leveraging the Core-level Comple-mentary Effects of PVT Variations to Reduce Timing Emergencies inMulti-core Processors,” in

ISCA , 2010.[139] A. Yasin, N. Rosenzweig, E. Weissmann, and E. Rotem, “PerformanceScalability Prediction,” US Patent 9,829,957. Nov. 28 2017.[140] H. Zhang, P. V. Rengasamy, S. Zhao, N. C. Nachiappan, A. Sivasubrama-niam, M. T. Kandemir, R. Iyer, and C. R. Das, “Race-to-sleep+ ContentCaching+ Display Caching: a Recipe for Energy-efficient Video Stream-ing on Handhelds,” in

ISCA , 2017.[141] R. Zhang, K. Wang, B. H. Meyer, M. R. Stan, and K. Skadron, “Architec-ture Implications of Pads as a Scarce Resource,” in

ISCA , 2014.[142] A. Zou, J. Leng, X. He, Y. Zu, C. D. Gill, V. J. Reddi, and X. Zhang,“Voltage-Stacked GPUs: A Control Theory Driven Cross-Layer Solutionfor Practical Voltage Stacking in GPUs,” in

MICRO , 2018.[143] A. Zou, J. Leng, Y. Zu, T. Tong, V. J. Reddi, D. Brooks, G.-Y. Wei, andX. Zhang, “Ivory: Early-stage Design Space Exploration Tool for Inte-grated Voltage Regulators,” in

DAC , 2017., 2017.