Toward an End-to-End Auto-tuning Framework in HPC PowerStack
Xingfu Wu, Aniruddha Marathe, Siddhartha Jana, Ondrej Vysocky, Jophin John, Andrea Bartolini, Lubomir Riha, Michael Gerndt, Valerie Taylor, Sridutt Bhalachandra
TToward an End-to-End Auto-tuning Framework in HPC PowerStack
Xingfu Wu
Argonne National LaboratoryThe University of Chicago, USAEmail: [email protected]
Aniruddha Marathe
Lawrence Livermore National LaboratoryLivermore, CA, USAEmail: [email protected]
Siddhartha Jana
Intel Corp., USAEmail: [email protected]
Ondrej Vysocky
IT4Innovations National Supercomputing CenterCzech RepublicEmail: [email protected]
Jophin John
Technical University of MunichMunich, GermanyEmail: [email protected]
Andrea Bartolini
University of BolognaBologna, ItalyEmail: [email protected]
Lubomir Riha
IT4Innovations National Supercomputing CenterCzech RepublicEmail: [email protected]
Michael Gerndt
Technical University of MunichMunich, GermanyEmail: [email protected]
Valerie Taylor
Argonne National LaboratoryThe University of Chicago, USAEmail: [email protected]
Sridutt Bhalachandra
Lawrence Berkeley National Laboratory, USAEmail: [email protected]
Abstract —Efficiently utilizing procured power and optimiz-ing performance of scientific applications under power andenergy constraints are challenging. The HPC PowerStackdefines a software stack to manage power and energy ofhigh-performance computing systems and standardizes theinterfaces between different components of the stack. Thissurvey paper presents the findings of a working group fo-cused on the end-to-end tuning of the PowerStack. First, weprovide a background on the PowerStack layer-specific tuningefforts in terms of their high-level objectives, the constraintsand optimization goals, layer-specific telemetry, and controlparameters, and we list the existing software solutions thataddress those challenges. Second, we propose the PowerStackend-to-end auto-tuning framework, identify the opportunitiesin co-tuning different layers in the PowerStack, and presentspecific use cases and solutions. Third, we discuss the researchopportunities and challenges for collective auto-tuning of twoor more management layers (or domains) in the PowerStack.This paper takes the first steps in identifying and aggregatingthe important R&D challenges in streamlining the optimizationefforts across the layers of the PowerStack.
1. Introduction
As we enter the exascale computing era, power andenergy management are key design points and constraintsfor any next generation of supercomputers [2]. Efficientlyutilizing procured power and optimizing the performance ofscientific applications under power and energy constraints are challenging for several reasons including dynamic phasebehavior, manufacturing variation, and increasing system-level heterogeneity. While several individual techniqueshave been proposed for the automatic and efficient man-agement of power and energy, the majority of these tech-niques have been devised to meet the needs of a spe-cific high-performance computing (HPC) center or specificoptimization goals. Specifications such as PowerAPI [14],[15], IPMI [17], and Redfish [8] provide high-level powermanagement interfaces for accessing power knobs. A recentsurvey [22] conducted by the EEHPC WG [9] concludedthat the majority of such techniques lack the application-awareness required to achieve the best system performanceand throughput. Furthermore, each technique tends to im-prove the management of power and energy for a differentsubset of the site or system hardware and at different (andoften conflicting) granularities. Unfortunately, the existingtechniques have not been designed to coexist simultaneouslyon one site and cooperate on management in a streamlinedfashion.To address these gaps, the HPC community needs aholistic stack for power and energy management. The HPCPowerStack [2], [16] started in May 2018 as a workinggroup to gather the experience of active developers in indus-try, computing centers, and academia in building softwareinterfaces and solutions for handling and optimizing thepower and energy consumption in HPC systems in pro-duction. PowerStack defines a software stack that managesthe power and energy of HPC systems and standardizes theinterfaces between different levels of software components1 a r X i v : . [ c s . PF ] A ug n the stack. One of the key aspects of PowerStack is todefine a vision of a holistic power and energy managementstack extensible by design and capable of optimizing thetarget power- or energy-efficiency application-aware metricso that it can trade off power, energy, and time to solutionin order to optimize the efficiency of an HPC application.Its second aspect is to define a standard interface to inter-act with optimization software and hardware knobs acrossdifferent vendor HPC systems. Based on the state of the artof the components available in the community for powerand energy management [8], [9], [14], [15], [17], [22],a hierarchical straw man PowerStack design [1], [2] wasproposed to manage power and energy at three levels ofgranularity: the system level, the job level, and the nodelevel. This implies the need to put in place the followingincrementally: • Define a site-level requirement, a power-aware sys-tem Resource Manager (RM) / job scheduler, apower-aware job-level manager, and a power-awarenode manager. • Define the interfaces between these layers to trans-late objectives at each layer into actionable items atthe adjacent lower layer. • Drive end-to-end optimization across different layersof the PowerStack.To address this need, we formed an PowerStack End-to-End Auto-tuning Working Group in 2019. A plethora ofliterature on power-aware tuning exists, including notableworks by the members of this working group. These effortsinclude system and hardware tuning, tuning with runtimesystem, application-level tuning, compiler-level parametertuning, loop-level parameter tuning, and deep learning-basedhyperparameter tuning.A primary limitation of most—if not all—of these effortsis that the tuning research has been solely limited to theindividual layers of the PowerStack. The opportunities forfurther gains in power efficiency from collectively tuningtwo or more layers of the PowerStack have largely remaineduntapped. The goal of this paper is to explore those untappedopportunities by addressing the following specific questions. • Opportunity analysis : How do we quantify the po-tential benefits of end-to-end auto-tuning across thedifferent layers of the PowerStack? What experimen-tation is required to achieve baseline quantificationof the benefits of end-to-end auto-tuning? • Identification of high-level challenges : What arethe high-level research questions to be explored inthe end-to-end auto-tuning of the PowerStack? Whatengineering solutions and research approaches areneeded to these questions? • Interaction of existing layer-specific tuning : Basedon the conceptual diagram of the PowerStack, whatinteractions are required across the layers of thePowerStack with existing layer-specific tuning ap-proaches as a precursor to end-to-end auto-tuning? • Extension toward end-to-end auto-tuning : Howdo we combine the existing auto-tuning approaches to develop comprehensive end-to-end auto-tuningsolutions for the high-level power and energy goals?To our knowledge, this paper is the first attempt atidentifying and aggregating the important R&D challengesin streamlining the interoperation across the layers of thePowerStack. We present the important high-level questions,concrete ideas, and ongoing efforts discussed by the mem-bers of the PowerStack working group. The remainder ofthis paper is organized as follows. Section 2 surveys thelayer-specific tuning efforts in terms of their high-levelobjectives; describes the high-level objectives; discusses thetarget metrics, layer-specific control parameters, and spe-cific methods; and lists the existing software components.Section 3 proposes the PowerStack end-to-end auto-tuningframework, identifies the opportunities in collective tuning(henceforth co-tuning ) different layers in the PowerStack,and presents the specific use cases. Section 4 identifiesthe further opportunities and open challenges for co-tuningof two or more management layers (or domains) in thePowerStack. Section 5 summarizes our conclusions.
2. Survey of PowerStack Layer-Specific Tuning
In this section, we first describe the high-level objectivesof the existing layer-specific tuning approaches at the dif-ferent layers of the PowerStack: system (i.e., cluster), job /application, and node. Next, we outline the target metrics forthe existing tuning approaches. We then present the layer-specific control parameters, telemetry and specific methodsused to accomplish the objectives by the individual layers.
The common objective of each layer of the PowerStackis to operate within the power constraints or energy goalsassigned by the upper layer. A power constraint is appliedand measured over a time window. An energy goal isassigned and measured over the job execution or systemuptime. The smallest supported time window is defined bywhat can be supported at each layer and what is acceptableby the upper layer. Along with the primary objective ofadhering to a power constraint, the following secondarymetrics are targeted: • Power-constrained performance optimization • Performance constrained energy optimization, RM-brokered SLA-compliant performance • Guaranteed rate of change, or lower and upperbounds on power (power usage) in system in aspecified time window • Thermal-constrained performance optimizationSome of the secondary metrics monitored and affectedin this process are as follows: • System utilization, resource utilization • Thermal metrics: ambient temperature, water tem-perature2
ABLE 1. S
URVEY OF PARAMETERS AND METHODS USED BY THELAYERS OF THE P OWER S TACK • Job turnaround time, queuing delay, throughput • Reduced memory footprint, reduced data movementand I/O footprint
The objectives at different layers of the PowerStack canbe realized by using measured or derived metrics at thoselayers as follows: • Job-level power (watts) or energy (watt hour orjoules) usage, • Execution time (seconds/minutes/hours) • Operating frequency (Hz) • Performance (FLOPS, IPC, IPS) • Power efficiency (FLOPS/watt, IPC/watt) • Energy efficiency (
ED, ED , F LOP S/Joule,IP C/Joule ) • Node utilization (% of time in use, % of resource inuse)
The objectives described above are realized by eachlayer by managing a set of available controls provided bythe adjacent lower layer. The parameters are tuned throughthe available methods provided by the hardware or indirectlymanaged by the lower layers (runtime, node-level manager,system software, etc.). We describe the parameters and themethods used by the individual layers of the PowerStack atsystem, job/runtime, application, and node levels shown inTable 1.
TABLE 2. E
XISTING TOOLS / SOLUTIONS AT EACH LAYER OF THE P OWER S TACK
In Table 2 we list several existing software componentsat the four layers: resource manager/job scheduler, job-levelruntime system, node-level management, and application-level tuning. By integrating proper software componentsfrom each layer, we are able to do the proposed end-to-endauto-tuning in PowerStack.
3. End-to-End Auto-Tuning Framework
Before we delve into end-to-end tuning of the Power-Stack, we define the term tuning . At a high level, tuning is the process of improving the target metric through betterhandling of available control parameters and configurationoptions without violating operating constraints (if any). Theprocess of tuning in the layers of the PowerStack (a) typi-cally targets performance or power efficiency as the primarymetric, (b) complies with the operating power constraintimposed on the layer, and (c) attempts to improve the man-agement and orchestration of the available control parame-ters that affect the application and/or hardware performance.In this process, the other layers are either treated as blackboxes or are ignored altogether in order to keep the researchproblem tractable. Extending this definition, we define co-tuning as the process of improving the target metrics of twoor more layers of the PowerStack by incorporating cross-layer characteristics in the orchestration process. End-to-endauto-tuning aims to perform holistic co-tuning of all layersof the PowerStack.In 2019, the PowerStack community sketched aschematic diagram outlining the different components of apower management stack, shown in Figure 1. A site has oneor more HPC systems, site policies, and a power budget.Each system is constrained under a derived system-levelpower budget. For end-to-end auto-tuning in PowerStack, wewill focus on tuning at the system, job/application, and node.We propose a high-level overview of the end-to-end auto-tuning framework (orange boxed portion) in Figure 1. Wedescribe the knobs at each layer, what control knobs can bemodified on temporal and spatial dimensions, who controlsthe knobs (actors), and what metrics can be measured. Wedefine tunable parameters at each layer, then discuss how toauto-tuning the combination of different parameters at thedistinct layers (parameter space) for an optimal solution (the3 igure 1. High-level overview of end-to-end auto-tuning framework (orangeboxed portion) smallest runtime, the lowest power, or the lowest energy)under a system power cap.The traditional PowerStack design has focused largelyon the engineering challenges in standardization and de-ployment. In contrast, this paper focuses on extending thecurrent design of the PowerStack to address novel researchchallenges in end-to-end auto-tuning of the PowerStack.Specifically, we extend the traditional PowerStack model byconsidering two additional, largely static layers: 1) Applica-tion: We consider application as its own auto-tuning layer;and 2) System software: We consider the system softwaresuch as the compiler toolchain, system-level dependenciessuch as MPI, OpenMP, and thread-management libraries,and other external entities that play an important role inrealizing the PowerStack but have no direct interfaces inthe traditional design.
In this section, we survey the opportunities in collec-tive auto-tuning of two or more management layers in thePowerStack. The goal of this survey is to find potentialareas for research and prepare a list of broad researchquestions that the PowerStack is collectively interested intackling. An outcome of this survey would be to comeup with research activities that the PowerStack communitycan collaboratively participate in depending on the area ofexpertise.Before we discuss the co-tuning opportunities for indi-vidual layers, we define the important terms used in the restof this paper. These terms are listed in Table 3. Figure 2shows a high-level overview of the placement of ResourceManager, Job, Runtime System, and Application in thePowerStack, and the interaction between the layers (orangeand green lines).
The objective of this co-tuning is to explore andleverage opportunities in the simultaneous tuning of theresource manager with empirical or online knowledge of thedynamic behavior of a power-aware runtime system subjectto a target power constraint or energy efficiency metric.We consider two directions of interaction in this section—from resource manager (RM) to runtime system and vice
TABLE 3. D
EFINITIONS OF TERMS
Figure 2. Placement of Resource Manager, Job, Runtime System, andApplication (orange lines) versa. Several types of interactions between the RM and theruntime system can occur, as outlined below.
Static interaction : These interactions define the manage-ment decisions by the resource manager at job launch. • How many nodes (for moldable jobs). The userprovides a minimum and a maximum number ofnodes the job can use. • Which nodes (or compute resources) to select forjob launch for managing inefficiencies in the systemsuch as thermal hot spots, and processor manufac-turing variation. • Which job to run (or backfill) from the job queue. • Which binary dependencies to pick given the situa-tion on the cluster.
Dynamic interaction : These interactions define the man-agement decisions by the resource manager during job run-time. • How much power to reassign to a running job (re-duce or increase). • Which job to pause (run queue) or restart (waitqueue) if supported by the job. • Whether the resource manager or the runtime sys-tem can leverage job malleability. Some resourcemanagers and runtime systems already leverage mal-leability at the thread level through concurrencythrottling.4
Job relaunch. Some resource managers may explorejust-in-time (JIT) compilation of the application torelaunch the job either through checkpoint-restart inthe same job or pause/resume over different alloca-tions.Another type of categorization of the interactions be-tween the resource manager and the runtime system is basedon the job awareness. • Job-aware interactions : These are the interactionsbetween the resource manager and runtime systemthat take job behavior into account when applyingpower management decisions. The job awarenessis based on either the empirical profile of the ap-plication or runtime telemetry collected from theapplication. • Job-agnostic interactions : These interactions be-tween the resource manager and runtime systemdo not take job behavior into account. These areprimarily the interactions agnostic (or transparent)to the application itself.The interaction from the resource manager to the runtimesystem may occur through reassignment of power controlsand reporting of degradation in performance or efficiencyobserved at the system level with a heartbeat signal. Theinteraction from the runtime system to the resource managermay occur through reporting of job-level power usage, re-quest for additional power usage or returning unused power,and non-power-related controls that indirectly affect job andsystem power efficiency. The potential metrics by whichthe opportunity can be measured are system-level energyefficiency and system-level job throughput.
The objectives for the co-tuning are to assign resourcesto the job, control their consumption, select heuristics tomaximize power efficiency at the system level, and selectoptimal application control parameters at application launch.Note that the job-level runtime is either absent or agnosticto the co-tuning of the resource manager, the application,and the system software.The goal is to understand what opportunities exist fortuning two interactions in this co-tuning space: (a) tuning thecontrol loop in RM by making it application-characteristicsaware through empirical data or on-line monitoring and(b) requesting changes to resource assignments by the ap-plication for further tuning control parameters within theapplication. The objective is to maximize the performanceof all applications on the system under a system-wide powerlimit by maximizing per-job power efficiency (minimumperformance impact).We assume that in this interaction, the control andtelemetry information to and from the application is limitedto application-centric data. This does not include power-related controls and telemetry as noted in the definitions,since the application itself is not involved in power man-agement decisions. For the interaction from the resource manager to application, the following are involved: 1) powerand/or energy budget if the application understands themetric; 2) power efficiency translated into an application-level metric such as watts per timestep (But this will beapplication-specific and will not scale for all applications);3) number of resources; 4) other operating constraints: fromjob to managed resource; 5) power/frequency/concurrencycontrol; 6) application control parameters; and 7) applicationlaunch parameters. Interaction from the application to theresource manager may occur in terms of reporting met-rics describing application progress. The required interfacesbetween the resource manager and the application includepower consumption monitoring, power limit specification,the expected efficiency metric to monitor from job-level run-time to RM, and the expected efficiency metric to monitorfrom application to job-level runtime.
The objective is to fine-tune system parameters, software,and application parameters at job launch as well as runtimein order to maximize job power efficiency under the powerbudget. The runtime system may leverage JIT compilationof the application with static actions such as the amountof required computational resources (
The objective of this co-tuning space is to explore oppor-tunities for how the resource manager, runtime system, andapplication (along with system software) can be co-tuned tomaximize application performance under a power constraint.The interactions are the required interfaces for interactionacross three layers. The interfaces must be defined to answerthe following questions: 1) What interfaces are needed totranslate high-level targets at the resource manager intotargets for the job-level runtime system and the application?5nd 2) What telemetry interface needs to be provided fromapplication to the job runtime and from the runtime systemto the resource manager? The discussions of the pairwiseco-tuning process described previously cover the interfacesrequired in this co-tuning process.For example, a target metric of throughput under asystem-level power constraint at the resource manager levelneeds to be translated into power efficiency targets or to-tal runtimes of individual jobs managed by the job-levelruntime system subject to a job-level power constraint. Thismust be translated into improvements in the calculations persimulation step per watt at the application level.
In this section, we discuss seven specific use cases andsolutions for collective tuning of the resource manager,runtime, application, and node layers.
This use case describes the co-tuning of the resourcemanager, runtime system, and application. The target appli-cation is a 27-point Laplacian problem implemented as partof the test program shipped with the Hypre linear solverlibrary [11]. The control parameters exposed by Hypre areprimarily at the algorithm and subalgorithmic level. Specif-ically, Hypre enables the user (or the runtime system) toselect input preconditioner, linear solver, subsolver options,and postconditioner. In our experience, several thousandcombinations of these options can be selected from at joblaunch. While tuning Hypre parameters has been an exten-sively researched topic, our empirical analysis showed thatsubjecting system-level power constraints severely affectsthe applicability of previous tuning efforts [23]. We observedthat the best-case combination of the tuning knobs forHypre is often inefficient when subject to a hardware powerconstraint. Consequently, tuning the resource manager andruntime system without taking into account how the power-aware runtime system and Hypre collectively respond to thedecision layers leaves performance on the table. We use theConductor runtime system [24] to transparently optimize thejob-level power budget on the allocated nodes. Conductorexposes control parameters that impact the granularity andefficiency of its power-balancing algorithm under the as-signed job-level power limit.Our target metrics are twofold: 1) improve power effi-ciency in terms of instructions per cycle (IPC) per watt at theruntime system level. and 2) improve job throughput at theresource manager level (
Figure 3. High-level overview of multijob GEOPM policy assignment
At the application level, we will explore several param-eters including the choice of solver algorithm, subsolveroptions, and data preconditioner (smoother and coarsening).At the runtime system level, we explore traditional platformknobs such as power limiting and frequency scaling. At theresource manager level, we explore the number of nodesand MPI tasks as the control parameters. The approachrelies on the telemetry at different layers of the PowerStack.The primary metrics for telemetry include power usage(watts), operating frequency (GHz), and IPC. The primarychallenges include managing static interactions between thelayers at application launch and estimating the impact ofseveral control parameters on the application behavior.
One of the critical challengesat HPC facilities is to limit energy/power consumption oftheir in-house systems based on their contractual obliga-tions with utility service providers [9]. In order to enhanceefficiency, a system-level power-aware software agent (likethe resource manager) is needed to frequently monitor andcontrol its consumption in tandem with an application-levelpower management software agent (like GEOPM [10], [12]).Figure 3 illustrates how facility-level power policies filterdown into job-level granularity.The community is working toward a prototype thatfacilitates a bidirectional communication channel for in-teroperation between the resource manager and GEOPM.The objectives are systemwide characterization of frequency,power, and thermal variation across the system plus nodeoutlier detection, node energy savings or FLOPS improve-ments via adapting CPU/GPU PM controls according to ap-plication phases or characteristics, overall application energysavings or runtime reductions via steering power betweennodes according to load imbalance patterns, and enablingof other tools or resource managers to perform system-widepower/performance management by leveraging GEOPM.GEOPM has three main modes of community site-levelpolicies .1) Enforcing static preconfigured sitewide policy: Thismethod relies on preconfigured policy configuration filespassed to GEOPM during job launch. These files can eitherbe directly baked into the node boot image on a pseudo-filesystem or passed along to the job launcher. Adding extra6 igure 4. Autotuning application and runtime parameter selection awareness of these files within the RM enables it to enforcepower management control with/without GEOPM.2) Enforcing job-specific policies: This method is suit-able for HPC sites that frequently run a finite set ofapplications with historic profile information about itspower/energy/thermal characteristics. Such sites typicallymaintain a database that maps applications to specific policyparameters (e.g., fixed frequency or power-cap).3) Fully dynamic policy: This ongoing R&D work re-lies on the entire power management stack participating indynamic cooperation between the electric grid supply, theresource manager, and the job-level runtime (GEOPM). Thismode enables job-specific hardware resources to react in ac-cordance with the instantaneous system-level power/energyrequirements realized at the site level.Interfaces to system-level agents: The communicationchannel/interface can take the form of environment vari-ables, preconfiguration files, job-launcher command line op-tions, or shared memory. Having a shared memory facilitatesefficient interaction and is referred to as an endpoint, whichacts as a gateway between a persistent compute node dae-mon (like SLURM) and an application power-managementdaemon (like GEOPM root controller). GEOPM providesa plugin-based interface that enables users to plug-and-play their own algorithms of choice. By default, a typicalGEOPM installation compares prepacked with five differentalgorithms that correspond to the most common policiesamong HPC sites: energy efficiency under a performancedegradation threshold, power load balancing based on theaverage node power cap, static frequency assignment forthe entire lifetime of the application, static power cap as-signment for the entire lifetime of the application, and mon-itoring application energy/power metrics. The open designchallenges are fault tolerance, security, multitenancy, inter-action with other OS/user-level/RAS daemons, and handlingof conflicting power management mechanisms on the samesystem.
In theauto-tuning ytopt project [31], we use the new Clang looptiling, interchange, packing, and/or jam pragmas as exam-ples to illustrate the integration process about auto-tuningthe pragma parameters to achieve the optimal performance.Figure 4 shows our auto-tuning framework.We analyze an application code to identify the importantparameters that we focus on, then replace these parameters (application parameters, loop transformation parameters)with symbols such as
The Horizon 2020 project Runtime Exploitation ofApplication Dynamism for Energy efficient exascale com-puting (READEX) [28] came with an idea and tool suiteproviding a parallel application splitting into regions ofdifferent resource requirements and dynamic tuning of hard-ware parameters that suit the needs of each of the regions.Besides the hardware and system parameters tuning, theREADEX tool suite supports static and dynamic tuning ofthe application parameters. For this purpose a static applica-tion configuration tuning (ACP) plugin and a dynamic appli-cation tuning parameters (ATP) plugin [21] for the PeriscopeTuning Framework [13] were developed. The static tuningrequires the identification of parameters in the application’slaunch configuration file, which changes at the beginningof each run. The dynamic tuning requires the applicationinstrumentation with the ATP’s API to identify the tunedparameters. Each parameter’s value is then changed at every x th iteration of a loop in which the instrumentation is,which requires support at the application side to handle suchmodification during runtime. The key input information forboth static and dynamic application parameters tuning is notonly a list of parameter values to set but also dependency7 igure 5. Graph of the ESPRESO FETI solver regions instrumented in thesource codeFigure 6. Dynamic resource redistribution to enforce power corridor conditions that express which combinations of parametersare not allowed.ESPRESO FETI solver [27] was tuned by using theREADEX tools for optimal hardware configuration as wellas using the ATP to find an optimal solver, preconditioner,and domain size, all of which have a major impact on thetime to solution of the issued problem. The application hasbeen instrumented with a set of regions as presented inFigure 5. Such optimization led to major runtime and energysavings as well as to improved scalability [21], [27].The complete READEX tool suite does the search ofthe optimal configuration automatically using one of manysupported algorithms for the space state search, which couldbe potentially very large. However, the challenge of thisapproach is in the instrumentation of the application andits configuration file with valid and complete dependencyconditions for the application parameters. The site-levelpower budget (power corridor) is usually enforced by usingjob cancellation, idle node shutdown, power capping, and dynamic voltage and frequency scaling. A new proactivepower corridor management strategy is developed with ourdynamic resource management infrastructure [7], [18] com-prising an Invasive Resource Manager (IRM) and Inva-sive Message Passing Interface (IMPI). The core researchchallenge is power management using dynamic resourceredistribution among applications.The power usages of running applications are predictedand analyzed for power budget violation. If a violationis predicted, steps are taken to formulate a new resourceredistribution heuristic to satisfy the site-level power con-straints. As shown in Figure 6, the node distribution wasdynamically changed by IRM to maintain the power budget.This dynamic strategy can also cater to dynamic power bud-gets arising because of renewable energy sources. We alsoconstructed a programming paradigm called Elastic PhaseOriented Programming (EPOP) [19] on top of IMPI fordeveloping dynamic applications and for co-tuning. EPOPmeasures the power as well as performance characteristics ofthe application and communicates with IRM upon request.Using EPOP, the programmer can explicitly inform IRM uabout the application phases where resource redistributionis needed or not. A dynamic resource manager also requiresknowledge of application constraints (for example, the re-quirement of a cubic number of processes in LULESH) forefficient resource redistribution. Hence, a co-tuning amongresource managers, runtime system, and applications is nec-essary in order to exploit the dynamism. Lack of running dy-namic applications and application-specific constraints canhinder the calculation of successful resource redistribution.Power-aware scheduling with other strategies such as powercapping and DVFS can address this shortcoming.
COUNTDOWN is a runtime li-brary for performance-neutral energy saving in MPI applica-tions developed by the University of Bologna and CINECA[5], [6]. The energy saving is obtained transparently to theuser, without requiring application code modifications orrecompilation of the application. The target metric is energyefficiency with no performance degradation seamlessly byleveraging MPI communication phases. We consider themetrics: runtime, power consumption, and energy consump-tion. The control parameters/units are the types and sectionsof the MPI calls in which to reduce the power consumptionand CPU performance. The COUNTDOWN runtime willintercept those and separate waiting time from copy timeand computing time. The COUNTDOWN configuration canbe set at the beginning of a job run to (i) profile onlythe MPI communication regions or (ii) reduce the powerconsumption during MPI wait and copy time or (iii) reducethe power consumption during MPI wait time only. Thepower reduction is achieved by interacting with the nodelevel. While at the system level, the resource managerinteracts with the COUNTDOWN configuration to select thelevel of aggressiveness (i.e., energy-saving). This enablesco-tuning of application and communication phases powermanagement.8he challenges and open issues are the dynamic adapta-tion of the COUNTDOWN configuration during job execu-tion, its integration with the job scheduler, and the estimationduring job execution of the accumulated time-to-solutionoverhead.
This use case for using two runtimesystems at the same time optimizes the hardware parametersfor the same metric. The COUNTDOWN is a powerfultool for optimizing the communication phases of the MPIapplications. However, it does not take into account otherapplication characteristics such as I/O and memory-boundas well as compute-bound regions of the code that mayprovide a major space for savings exploitation [20]. TheMERIC library [29] implements the READEX approach,so it tunes the application based on its instrumentation,which should cover the whole runtime of the applicationand provides a specific tuned-parameters configuration foreach of the instrumented regions. The instrumentation canbe inserted manually into the source code or automaticallyinto the binary. Since we usually instrument any region forwhich we are able to get at least 100 power samples tohave reliable energy consumption measurement (e.g., forRAPL, it means minimum region size of 100 ms), neithermanual nor automatic instrumentation probably is providingso fine-grained annotation as the communication phases ofthe application are.Both tools have active stand-alone development. Thechallenge is to implement a communication layer that shouldallow synergy of these tools, which guarantees that bothtools keep the system’s knowledge of which tool is incharge and what the current and future hardware settingsare, without creating a conflict. This is a work-in-progressidea.
4. Research Opportunity and Challenges
In this section, we identify the important unsolved chal-lenges in collective auto-tuning of two or more manage-ment layers (or domains) in the PowerStack. Identifyingsuch challenges will provide a platform for collaborativeresearch on co-tuning of different layers in the Power-Stack. For co-tuning resource manager and runtime system,the objective is to define interactions (static and dynamic)based on the type of job (moldable/malleable vs. non-moldable/malleable) and job-agnostic interactions that arecommon to these two types of jobs. For co-tuning theresource manager and application, the objective is to targethow the application can fully utilize the allocated resourcesfor efficient execution while complying with the job-levelpower budget. For co-tuning runtime system and application,the objective is to fine-tune system parameters, software, andapplication parameters at runtime under the job-level powerbudget. For co-tuning resource manager, runtime system,and application, the objective is to explore ways in whichthe application can fully utilize the dynamically allocatedresources for efficient execution while also maximizing the job throughput under the system power limit. When weconsider co-tuning all four layers, we have to carefullyanalyze the tunable parameters from each layer to identifyhow they impact each other across layers, in order to findthe best combination for the optimal performance under apower budget.We identify four areas where further research effortis required to extend the end-to-end auto-tuning for thePowerStack:
Collective tuning of PowerStack layers has remaineda challenging task because of the lack of interfaces for(1) translation of high-level goals into subsequent lower-level goals, (2) translation of monitored metrics at lowerlayers to derived metrics at higher layers, and (3) enablingof a custom configuration (e.g., resource reallocation andremapping) at the application launch and during runtime.Developing such interfaces will enable the platform to beused to explore several research directions for the reallo-cation of resources to job and sub-job components. In thisresearch direction, the following research questions shouldbe explored. • What target metrics are important to the individ-ual layers of the PowerStack? How can the targetmetrics be translated into metrics understood bythe lower layers of the stack assuming a top-downcontrol transfer? • What are the different approaches to quantify thepotential for performance improvement while tuningresource allocation and mapping across the stack?Potential approaches include exhaustive empiricalexploration, model-based estimation, and emulation. • What customization is required in the software andhardware components participating in resource real-location? For example, for the reallocation of re-sources to a running job, what features must besupported by the layers of the PowerStack to supportthe flexibility? • What operating challenges would be incurred whenperforming resource reallocation? For example, per-forming job reallocation while maintaining a mini-mum power draw across the system may be criticalin order to comply with site-specific policies andwill require additional control logic to be in place atseveral layers of the PowerStack. • Can we leverage technologies such as containers andvirtual machines to address some of the challengesin resource reallocation? • Which layers of the PowerStack can leverage jobmoldability and malleability? At what point in theapplication run should this be done? What interfacesare provided to utilize moldability and/or malleabil-ity?9 .2. Offline/Static Co-tuning
The PowerStack typically operates with the softwarecomponents and target binaries provided to it to solve atarget problem. Several software components not directlyincluded in the formal definition of the PowerStack playan important role in the eventual outcome of the Power-Stack. Such components include compiler tool chains andthe optimization features provided by them, variants ofcommonly used libraries (e.g., implementations of MPI,OpenMP, thread-management libraries), and input decks ofthe target application. These indirectly affect the efficiencyof the individual layers of the PowerStack as well as theefficiency of the science being performed on the system.We identify the following open research questions in thistopic. • Can we quantify the impact of different compileroptimization flags for one or more target metrics onthe layers of the PowerStack and the application? • Can we inform the compiler tool chain about theruntime (or online) situation on the system includingresource constraints, choice of runtime algorithms,and hardware characteristics while applying opti-mization techniques? • Can we quantify the impact of using several vari-ants of the application dependencies (outside of thePowerStack) on the efficiency of the PowerStack? • Can we identify correlations between black-boxand/or white-box characteristics of these dependen-cies and the efficiency metrics relevant to the Pow-erStack?
Hardware overprovisioning has been suggested as aviable approach to address the challenges associated withsite-wide or cluster-level power constraints [25]. Since morecompute and storage devices exist than can be poweredup at any given time on each node of the cluster onoverprovisioned hardware, the problem of selecting whichcomponents to power up and how to operate them becomeschallenging. With heterogeneous computing platforms onthe rise to meet the demands of exascale performance, thisproblem becomes even more challenging. Developing solu-tions to tackle this problem, which are often site-specific,will require exploring the following research questions. • How can one quantify the trade-off between thenumber of compute devices on the system vs.system-level efficiency? • How can one quantify the trade-off between design-ing software approaches that efficiently use fewer ormore compute resources vs. system efficiency? • How can one correlate software-level features intoefficient usage of compute resources on the systemunder a power constraint? • What are the existing discrepancies between therequired interfaces to drive runtime managementof online or offline compute devices and existinginterfaces to manage those devices?
Several research challenges exist in fine-tuning controlparameters exposed by different software components in thePowerStack. The key motivation behind such a tuning isthat individual layers in the stack are tuned based on a localview of those layers without taking into account the impactof the tuning decisions outside of those layers. We identifythe following research questions in this space. • How do the existing control parameters exposedby the different layers of the PowerStack and theapplication interact with each other on the scaleof power efficiency? What are the approaches toquantify that interaction and describe the potentialbenefits of such exploration? • How can one find the correlation between the char-acteristics of the allocation and management algo-rithms used by the layers of the PowerStack andhow those algorithms interact with the application?Can we augment this analysis with application-levelalgorithms and sub-algorithmic controls exposed bythe developer? • Can we extend the algorithms at different layers ofthe PowerStack to incorporate semantic informationin the application (e.g., state of the molecular dy-namics simulation at each time step)?
5. Conclusions
The PowerStack community effort has demonstrated theneed for a holistic software stack for power and energymanagement and standardized interfaces between the layersof the PowerStack [2]. In this paper, we presented a surveyof existing efforts and important open challenges to enableend-to-end auto-tuning in the PowerStack. First, we sur-veyed the layer-specific tuning efforts describing the high-level objectives, the target metrics, layer-specific controlparameters, and methods, and we listed the existing softwarecomponents. Then, we proposed the PowerStack end-to-endauto-tuning framework, identified the opportunities in co-tuning different layers in PowerStack, and presented sevenuse cases. We also presented our vision of the researchopportunities and challenges for collective auto-tuning oftwo or more management layers (or domains) in the Pow-erStack. This paper takes the first step in identifying andaggregating all the R&D challenges related to interoperationamong multiple layers of the PowerStack. As part of futurework, we invite participation in the collaborative effort todevelop the end-to-end auto-tuning framework and integratethe individual research activities, in order to realize stream-lined auto-tuning of all layers of the PowerStack.10 cknowledgments
This work was supported in part by LDRD funding fromArgonne National Laboratory, provided by the Director,Office of Science, of the U.S. Department of Energy undercontract DE-AC02-06CH11357, and in part by NSF grantsCCF-1801856.
References